[
  {
    "path": ".gitattributes",
    "content": "#common settings that generally should always be used with your language specific settings\n\n# Auto detect text files and perform LF normalization\n# http://davidlaing.com/2012/09/19/customise-your-gitattributes-to-become-a-git-ninja/\n* text=auto\n\n#\n# The above will handle all files NOT found below\n#\n\n# Scripts\n*.bat text eol=crlf\n*.cmd text eol=crlf\n*.ps1 text eol=crlf\n*.sh  text eol=lf\n\n# Documents\n*.doc\t diff=astextplain\n*.DOC\t diff=astextplain\n*.docx diff=astextplain\n*.DOCX diff=astextplain\n*.dot  diff=astextplain\n*.DOT  diff=astextplain\n*.ppt\t diff=astextplain\n*.PPT\t diff=astextplain\n*.pptx\t diff=astextplain\n*.PPTX\t diff=astextplain\n*.pdf  diff=astextplain\n*.PDF\t diff=astextplain\n*.rtf\t diff=astextplain\n*.RTF\t diff=astextplain\n*.md text\n*.adoc text\n*.textile text\n*.mustache text\n*.csv text\n*.tab text\n*.tsv text\n*.sql text\n\n# Graphics\n*.png binary\n*.jpg binary\n*.jpeg binary\n*.gif binary\n*.tif binary\n*.tiff binary\n*.ico binary\n# SVG treated as an asset (binary) by default. If you want to treat it as text,\n# comment-out the following line and uncomment the line after.\n*.svg binary\n#*.svg text\n*.eps binary\n\n#sources\n*.c   text eol=crlf\n*.cc  text eol=crlf\n*.cxx text eol=crlf\n*.cpp text eol=crlf\n*.c++ text eol=crlf\n*.hpp text eol=crlf\n*.h   text eol=crlf\n*.h++ text eol=crlf\n*.hh  text eol=crlf\n*.asm text eol=crlf\n*.S   text eol=crlf\n*.cfg text eol=crlf\n*.txt text eol=lf\n\n# QT Project files\n*.pro text eol=lf\n\n# Compiled Object files\n*.slo binary\n*.lo binary\n*.o binary\n*.obj binary\n\n# Precompiled Headers\n*.gch binary\n*.pch binary\n\n# Compiled Dynamic libraries\n*.so binary\n*.dylib binary\n*.dll binary\n\n# Compiled Static libraries\n*.lai binary\n*.la binary\n*.a binary\n*.lib binary\n\n# Executables\n*.exe binary\n*.out binary\n*.app binary\n\n# Custom for Visual Studio\n*.sln text eol=crlf\n*.csproj text eol=crlf\n*.vbproj text eol=crlf\n*.fsproj text eol=crlf\n*.dbproj text eol=crlf\n\n*.vcproj  text eol=crlf\n*.vcxproj text eol=crlf\n*.sln     text eol=crlf\n*.vcxitems text eol=crlf\n*.props    text eol=crlf\n*.filters  text eol=crlf\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/----.md",
    "content": "---\nname: 问题咨询\nabout: 使用问题/安全问题/其他问题\n\n---\n\n请发送邮件至： sswang@pku.edu.cn\n或在应用内“高级设置” - “建议反馈” 填写表单\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/bug_report.md",
    "content": "---\nname: Bug report\nabout: Create a report of bug / 如果你认为你发现了一项代码问题\n\n---\n\n**Describe the bug**\n\nA clear and concise description of what the bug is.\n\n请详细的描述这个bug的细节\n\n**To Reproduce**\n\nSteps to reproduce the behavior (including the commond line parameters)\n\n请详细描述重现这个bug的步骤（运行的命令行参数、输入的文件）\n\n\n**Expected behavior**\n\nA clear and concise description of what you expected to happen.\n\n你认为这个功能本应如何工作\n\n**Screenshots**\n\nIf applicable, add screenshots to help explain your problem.\n\n如果有可能，请提供截图\n\n**Desktop (please complete the following information):**\n - OS: [e.g. Windows10, Ubuntu 18.04]\n - Compiler [e.g. Visual Studio 2013, GCC 5.6.0]\n - yasm [e.g. 1.2.0, 1.3.0-luofl]\n\n你的操作系统（包括版本）、编译器（GCC/G++, VS)、汇编器yasm（版本号）。\n\n\n**Additional context**\n\nAdd any other context about the problem here, i.e. video sequences and bitstreams.\n\n额外的材料，例如输入的视频序列、码流文件等。\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/feature_request.md",
    "content": "---\nname: Feature request\nabout: Suggest an idea for this project / 功能请求\n\n---\n\n请详细填写以下四项关键元素\n\n## 功能描述\n\n## 功能带来的效应\n\n## 缺少此功能的影响\n\n## 实现的思路与方式\n"
  },
  {
    "path": ".gitignore",
    "content": "Debug/\nRelease/\nx64_Debug/\nx64_Release/\nbuild/linux/cavs2dec*\nbuild/linux/davs2*\nMy*/\n*.user\n*.suo\n*.ncb\n*.aps\n*.pdb\n*.res\n*.dat\n*.manifest\n*.map\n*.dep\n*.idb\n*.ilk\n*.htm\n*.exp\n*.lib\n*.obj\n*.dll*\n*.exe\n*.avs\n*.mkv\n*.mp4\n*.y4m\n*.yuv\n*.log\n*.bak\n*.o\n*.a\n*.so\n*.cd\n*.sdf\n*.opensdf\n*.depend\n*.pc\n*.mak\n*.so.*\n*.dec\n*.txt\nconfig.h\n*.iobj\n*.ipdb\nversion.h\n\n"
  },
  {
    "path": ".travis.yml",
    "content": "language: c\ndist: xenial\n\ninstall:\n  - wget https://www.nasm.us/pub/nasm/releasebuilds/2.14.02/nasm-2.14.02.tar.gz -O nasm-2.14.02.tar.gz\n  - tar -xvf nasm-2.14.02.tar.gz\n  - pushd nasm-2.14.02 && ./configure --prefix=/usr && make && sudo make install && popd\n\njobs:\n  include:\n   # General Linux build job\n   - name: Build\n     script:\n     - cd build/linux\n     - ./configure\n     - make -j\n\n"
  },
  {
    "path": "COPYING",
    "content": "\t\t    GNU GENERAL PUBLIC LICENSE\n\t\t       Version 2, June 1991\n\n Copyright (C) 1989, 1991 Free Software Foundation, Inc.\n     59 Temple Place, Suite 330, Boston, MA  02111-1307  USA\n Everyone is permitted to copy and distribute verbatim copies\n of this license document, but changing it is not allowed.\n\n\t\t\t    Preamble\n\n  The licenses for most software are designed to take away your\nfreedom to share and change it.  By contrast, the GNU General Public\nLicense is intended to guarantee your freedom to share and change free\nsoftware--to make sure the software is free for all its users.  This\nGeneral Public License applies to most of the Free Software\nFoundation's software and to any other program whose authors commit to\nusing it.  (Some other Free Software Foundation software is covered by\nthe GNU Library General Public License instead.)  You can apply it to\nyour programs, too.\n\n  When we speak of free software, we are referring to freedom, not\nprice.  Our General Public Licenses are designed to make sure that you\nhave the freedom to distribute copies of free software (and charge for\nthis service if you wish), that you receive source code or can get it\nif you want it, that you can change the software or use pieces of it\nin new free programs; and that you know you can do these things.\n\n  To protect your rights, we need to make restrictions that forbid\nanyone to deny you these rights or to ask you to surrender the rights.\nThese restrictions translate to certain responsibilities for you if you\ndistribute copies of the software, or if you modify it.\n\n  For example, if you distribute copies of such a program, whether\ngratis or for a fee, you must give the recipients all the rights that\nyou have.  You must make sure that they, too, receive or can get the\nsource code.  And you must show them these terms so they know their\nrights.\n\n  We protect your rights with two steps: (1) copyright the software, and\n(2) offer you this license which gives you legal permission to copy,\ndistribute and/or modify the software.\n\n  Also, for each author's protection and ours, we want to make certain\nthat everyone understands that there is no warranty for this free\nsoftware.  If the software is modified by someone else and passed on, we\nwant its recipients to know that what they have is not the original, so\nthat any problems introduced by others will not reflect on the original\nauthors' reputations.\n\n  Finally, any free program is threatened constantly by software\npatents.  We wish to avoid the danger that redistributors of a free\nprogram will individually obtain patent licenses, in effect making the\nprogram proprietary.  To prevent this, we have made it clear that any\npatent must be licensed for everyone's free use or not licensed at all.\n\n  The precise terms and conditions for copying, distribution and\nmodification follow.\n\f\n\t\t    GNU GENERAL PUBLIC LICENSE\n   TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION\n\n  0. This License applies to any program or other work which contains\na notice placed by the copyright holder saying it may be distributed\nunder the terms of this General Public License.  The \"Program\", below,\nrefers to any such program or work, and a \"work based on the Program\"\nmeans either the Program or any derivative work under copyright law:\nthat is to say, a work containing the Program or a portion of it,\neither verbatim or with modifications and/or translated into another\nlanguage.  (Hereinafter, translation is included without limitation in\nthe term \"modification\".)  Each licensee is addressed as \"you\".\n\nActivities other than copying, distribution and modification are not\ncovered by this License; they are outside its scope.  The act of\nrunning the Program is not restricted, and the output from the Program\nis covered only if its contents constitute a work based on the\nProgram (independent of having been made by running the Program).\nWhether that is true depends on what the Program does.\n\n  1. You may copy and distribute verbatim copies of the Program's\nsource code as you receive it, in any medium, provided that you\nconspicuously and appropriately publish on each copy an appropriate\ncopyright notice and disclaimer of warranty; keep intact all the\nnotices that refer to this License and to the absence of any warranty;\nand give any other recipients of the Program a copy of this License\nalong with the Program.\n\nYou may charge a fee for the physical act of transferring a copy, and\nyou may at your option offer warranty protection in exchange for a fee.\n\n  2. You may modify your copy or copies of the Program or any portion\nof it, thus forming a work based on the Program, and copy and\ndistribute such modifications or work under the terms of Section 1\nabove, provided that you also meet all of these conditions:\n\n    a) You must cause the modified files to carry prominent notices\n    stating that you changed the files and the date of any change.\n\n    b) You must cause any work that you distribute or publish, that in\n    whole or in part contains or is derived from the Program or any\n    part thereof, to be licensed as a whole at no charge to all third\n    parties under the terms of this License.\n\n    c) If the modified program normally reads commands interactively\n    when run, you must cause it, when started running for such\n    interactive use in the most ordinary way, to print or display an\n    announcement including an appropriate copyright notice and a\n    notice that there is no warranty (or else, saying that you provide\n    a warranty) and that users may redistribute the program under\n    these conditions, and telling the user how to view a copy of this\n    License.  (Exception: if the Program itself is interactive but\n    does not normally print such an announcement, your work based on\n    the Program is not required to print an announcement.)\n\f\nThese requirements apply to the modified work as a whole.  If\nidentifiable sections of that work are not derived from the Program,\nand can be reasonably considered independent and separate works in\nthemselves, then this License, and its terms, do not apply to those\nsections when you distribute them as separate works.  But when you\ndistribute the same sections as part of a whole which is a work based\non the Program, the distribution of the whole must be on the terms of\nthis License, whose permissions for other licensees extend to the\nentire whole, and thus to each and every part regardless of who wrote it.\n\nThus, it is not the intent of this section to claim rights or contest\nyour rights to work written entirely by you; rather, the intent is to\nexercise the right to control the distribution of derivative or\ncollective works based on the Program.\n\nIn addition, mere aggregation of another work not based on the Program\nwith the Program (or with a work based on the Program) on a volume of\na storage or distribution medium does not bring the other work under\nthe scope of this License.\n\n  3. You may copy and distribute the Program (or a work based on it,\nunder Section 2) in object code or executable form under the terms of\nSections 1 and 2 above provided that you also do one of the following:\n\n    a) Accompany it with the complete corresponding machine-readable\n    source code, which must be distributed under the terms of Sections\n    1 and 2 above on a medium customarily used for software interchange; or,\n\n    b) Accompany it with a written offer, valid for at least three\n    years, to give any third party, for a charge no more than your\n    cost of physically performing source distribution, a complete\n    machine-readable copy of the corresponding source code, to be\n    distributed under the terms of Sections 1 and 2 above on a medium\n    customarily used for software interchange; or,\n\n    c) Accompany it with the information you received as to the offer\n    to distribute corresponding source code.  (This alternative is\n    allowed only for noncommercial distribution and only if you\n    received the program in object code or executable form with such\n    an offer, in accord with Subsection b above.)\n\nThe source code for a work means the preferred form of the work for\nmaking modifications to it.  For an executable work, complete source\ncode means all the source code for all modules it contains, plus any\nassociated interface definition files, plus the scripts used to\ncontrol compilation and installation of the executable.  However, as a\nspecial exception, the source code distributed need not include\nanything that is normally distributed (in either source or binary\nform) with the major components (compiler, kernel, and so on) of the\noperating system on which the executable runs, unless that component\nitself accompanies the executable.\n\nIf distribution of executable or object code is made by offering\naccess to copy from a designated place, then offering equivalent\naccess to copy the source code from the same place counts as\ndistribution of the source code, even though third parties are not\ncompelled to copy the source along with the object code.\n\f\n  4. You may not copy, modify, sublicense, or distribute the Program\nexcept as expressly provided under this License.  Any attempt\notherwise to copy, modify, sublicense or distribute the Program is\nvoid, and will automatically terminate your rights under this License.\nHowever, parties who have received copies, or rights, from you under\nthis License will not have their licenses terminated so long as such\nparties remain in full compliance.\n\n  5. You are not required to accept this License, since you have not\nsigned it.  However, nothing else grants you permission to modify or\ndistribute the Program or its derivative works.  These actions are\nprohibited by law if you do not accept this License.  Therefore, by\nmodifying or distributing the Program (or any work based on the\nProgram), you indicate your acceptance of this License to do so, and\nall its terms and conditions for copying, distributing or modifying\nthe Program or works based on it.\n\n  6. Each time you redistribute the Program (or any work based on the\nProgram), the recipient automatically receives a license from the\noriginal licensor to copy, distribute or modify the Program subject to\nthese terms and conditions.  You may not impose any further\nrestrictions on the recipients' exercise of the rights granted herein.\nYou are not responsible for enforcing compliance by third parties to\nthis License.\n\n  7. If, as a consequence of a court judgment or allegation of patent\ninfringement or for any other reason (not limited to patent issues),\nconditions are imposed on you (whether by court order, agreement or\notherwise) that contradict the conditions of this License, they do not\nexcuse you from the conditions of this License.  If you cannot\ndistribute so as to satisfy simultaneously your obligations under this\nLicense and any other pertinent obligations, then as a consequence you\nmay not distribute the Program at all.  For example, if a patent\nlicense would not permit royalty-free redistribution of the Program by\nall those who receive copies directly or indirectly through you, then\nthe only way you could satisfy both it and this License would be to\nrefrain entirely from distribution of the Program.\n\nIf any portion of this section is held invalid or unenforceable under\nany particular circumstance, the balance of the section is intended to\napply and the section as a whole is intended to apply in other\ncircumstances.\n\nIt is not the purpose of this section to induce you to infringe any\npatents or other property right claims or to contest validity of any\nsuch claims; this section has the sole purpose of protecting the\nintegrity of the free software distribution system, which is\nimplemented by public license practices.  Many people have made\ngenerous contributions to the wide range of software distributed\nthrough that system in reliance on consistent application of that\nsystem; it is up to the author/donor to decide if he or she is willing\nto distribute software through any other system and a licensee cannot\nimpose that choice.\n\nThis section is intended to make thoroughly clear what is believed to\nbe a consequence of the rest of this License.\n\f\n  8. If the distribution and/or use of the Program is restricted in\ncertain countries either by patents or by copyrighted interfaces, the\noriginal copyright holder who places the Program under this License\nmay add an explicit geographical distribution limitation excluding\nthose countries, so that distribution is permitted only in or among\ncountries not thus excluded.  In such case, this License incorporates\nthe limitation as if written in the body of this License.\n\n  9. The Free Software Foundation may publish revised and/or new versions\nof the General Public License from time to time.  Such new versions will\nbe similar in spirit to the present version, but may differ in detail to\naddress new problems or concerns.\n\nEach version is given a distinguishing version number.  If the Program\nspecifies a version number of this License which applies to it and \"any\nlater version\", you have the option of following the terms and conditions\neither of that version or of any later version published by the Free\nSoftware Foundation.  If the Program does not specify a version number of\nthis License, you may choose any version ever published by the Free Software\nFoundation.\n\n  10. If you wish to incorporate parts of the Program into other free\nprograms whose distribution conditions are different, write to the author\nto ask for permission.  For software which is copyrighted by the Free\nSoftware Foundation, write to the Free Software Foundation; we sometimes\nmake exceptions for this.  Our decision will be guided by the two goals\nof preserving the free status of all derivatives of our free software and\nof promoting the sharing and reuse of software generally.\n\n\t\t\t    NO WARRANTY\n\n  11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY\nFOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW.  EXCEPT WHEN\nOTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES\nPROVIDE THE PROGRAM \"AS IS\" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED\nOR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF\nMERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.  THE ENTIRE RISK AS\nTO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU.  SHOULD THE\nPROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,\nREPAIR OR CORRECTION.\n\n  12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING\nWILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR\nREDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,\nINCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING\nOUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED\nTO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY\nYOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER\nPROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE\nPOSSIBILITY OF SUCH DAMAGES.\n\n\t\t     END OF TERMS AND CONDITIONS\n\f\n\t    How to Apply These Terms to Your New Programs\n\n  If you develop a new program, and you want it to be of the greatest\npossible use to the public, the best way to achieve this is to make it\nfree software which everyone can redistribute and change under these terms.\n\n  To do so, attach the following notices to the program.  It is safest\nto attach them to the start of each source file to most effectively\nconvey the exclusion of warranty; and each file should have at least\nthe \"copyright\" line and a pointer to where the full notice is found.\n\n    <one line to give the program's name and a brief idea of what it does.>\n    Copyright (C) <year>  <name of author>\n\n    This program is free software; you can redistribute it and/or modify\n    it under the terms of the GNU General Public License as published by\n    the Free Software Foundation; either version 2 of the License, or\n    (at your option) any later version.\n\n    This program is distributed in the hope that it will be useful,\n    but WITHOUT ANY WARRANTY; without even the implied warranty of\n    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n    GNU General Public License for more details.\n\n    You should have received a copy of the GNU General Public License\n    along with this program; if not, write to the Free Software\n    Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA\n\n\nAlso add information on how to contact you by electronic and paper mail.\n\nIf the program is interactive, make it output a short notice like this\nwhen it starts in an interactive mode:\n\n    Gnomovision version 69, Copyright (C) year  name of author\n    Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'.\n    This is free software, and you are welcome to redistribute it\n    under certain conditions; type `show c' for details.\n\nThe hypothetical commands `show w' and `show c' should show the appropriate\nparts of the General Public License.  Of course, the commands you use may\nbe called something other than `show w' and `show c'; they could even be\nmouse-clicks or menu items--whatever suits your program.\n\nYou should also get your employer (if you work as a programmer) or your\nschool, if any, to sign a \"copyright disclaimer\" for the program, if\nnecessary.  Here is a sample; alter the names:\n\n  Yoyodyne, Inc., hereby disclaims all copyright interest in the program\n  `Gnomovision' (which makes passes at compilers) written by James Hacker.\n\n  <signature of Ty Coon>, 1 April 1989\n  Ty Coon, President of Vice\n\nThis General Public License does not permit incorporating your program into\nproprietary programs.  If your program is a subroutine library, you may\nconsider it more useful to permit linking proprietary applications with the\nlibrary.  If this is what you want to do, use the GNU Library General\nPublic License instead of this License.\n\nThis program is also available under a commercial proprietary license.\nFor more information, contact us at sswang @ pku.edu.cn.\n"
  },
  {
    "path": "README.md",
    "content": "# davs2\n**davs2** is an open-source decoder of `AVS2-P2/IEEE1857.4` video coding standard.\n\nAn encoder, **xavs2**, can be found at [Github][2] or  [Gitee (mirror in China)][3].\n\n[![GitHub tag](https://img.shields.io/github/tag/pkuvcl/davs2.svg?style=plastic)]()\n[![GitHub issues](https://img.shields.io/github/issues/pkuvcl/davs2.svg)](https://github.com/pkuvcl/davs2/issues)\n[![GitHub forks](https://img.shields.io/github/forks/pkuvcl/davs2.svg)](https://github.com/pkuvcl/davs2/network)\n[![GitHub stars](https://img.shields.io/github/stars/pkuvcl/davs2.svg)](https://github.com/pkuvcl/davs2/stargazers)\n\nLinux(Ubuntu-16.04):[![Travis Build Status](https://travis-ci.org/pkuvcl/davs2.svg?branch=master)](https://travis-ci.org/pkuvcl/davs2)\nWindows(VS2013):[![AppVeyor Build Status](https://ci.appveyor.com/api/projects/status/pq0b5mnc6mig6ryp?svg=true)](https://ci.appveyor.com/project/luofalei/davs2/build/artifacts)\n\nStargazers over time\n[![Stargazers over time](https://starcharts.herokuapp.com/pkuvcl/davs2.svg)](https://starcharts.herokuapp.com/pkuvcl/davs2)\n\n## Compile it\n### Windows\nUse VS2013 or latest version of  visual studio open the `./build/vs2013/davs2.sln` solution\n and set the `davs2` as the start project.\n\n#### Notes\n1. A `shell executor`, i.e. the bash in git for windows, is needed and should be found in `PATH` variable.\n For example, the path `C:\\Program Files\\Git\\bin` can be added if git-for-windows is installed.\n2. `nasm.exe` with version `2.13` (or later version) is needed and should be put into the `PATH` directory.\n For windows platform, you can downloaded the packege and unpack the zip file to get `nasm.exe`:\nhttps://www.nasm.us/pub/nasm/releasebuilds/2.14.02/win64/nasm-2.14.02-win64.zip\n\n### Linux\n```\n$ cd build/linux\n$ ./configure\n$ make\n```\n\n## Try it\n\nDecode AVS2 stream `test.avs` with `1` thread and output to a *YUV file* named `dec.yuv`.\n```\n./davs2 -i test.avs -t 1 -o dec.yuv\n```\n\nDecode AVS2 stream `test.avs` and display the decoding result via *ffplay*.\n```\n./davs2 -i test.avs -t 1 -o stdout | ffplay -i -\n```\n\n### Parameter Instructions\n|  Parameter       |   Alias     |   Result  |\n| :--------:       | :---------: | :--------------: |\n| --input=test.avs | -i test.avs |  Setting the input bitstream file |\n| --output=dec.yuv | -o dec.yuv  |  Setting the output YUV file |\n| --psnr=rec.yuv   | -r rec.yuv  |  Setting the reference reconstruction YUV file |\n| --threads=N      | -t N        |  Setting the threads for decoding (default: 1) |\n| --md5=M          | -m M        |  Reference MD5, used to check whether the output YUV is right |\n| --verbose        | -v          |  Enable decoding status every frame (Default: Enabled) |\n| --help           | -h          |  Showing this instruction |\n\n## Issue and Pull Request\n\n[Issues should be reported here][6]。\n\nIf you have some bugs fixed or features implemented, and would like to share with the public, please [make a Pull Request][7].\n\n## Homepages\n\n[PKU-VCL][1]\n\n`AVS2-P2/IEEE1857.4` Encoder: [xavs2 (Github)][2], [xavs2 (mirror in China)][3]\n\n`AVS2-P2/IEEE1857.4` Decoder: [davs2 (Github)][4], [davs2 (mirror in China)][5]\n\n  [1]: http://vcl.idm.pku.edu.cn/ \"PKU-VCL\"\n  [2]: https://github.com/pkuvcl/xavs2 \"xavs2 github repository\"\n  [3]: https://gitee.com/pkuvcl/xavs2 \"xavs2 gitee repository\"\n  [4]: https://github.com/pkuvcl/davs2 \"davs2 decoder@github\"\n  [5]: https://gitee.com/pkuvcl/davs2 \"davs2 decoder@gitee\"\n  [6]: https://github.com/pkuvcl/davs2/issues \"report issues\"\n  [7]: https://github.com/pkuvcl/davs2/pulls \"pull request\"\n"
  },
  {
    "path": "README.zh.md",
    "content": "# davs2\n\n遵循 `AVS2-P2/IEEE1857.4` 视频编码标准的解码器. \n\n对应的编码器 **xavs2** 可在 [Github][2] 或 [Gitee (mirror in China)][3] 上找到.\n\n[![GitHub tag](https://img.shields.io/github/tag/pkuvcl/davs2.svg?style=plastic)]()\n[![GitHub issues](https://img.shields.io/github/issues/pkuvcl/davs2.svg)](https://github.com/pkuvcl/davs2/issues)\n[![GitHub forks](https://img.shields.io/github/forks/pkuvcl/davs2.svg)](https://github.com/pkuvcl/davs2/network)\n[![GitHub stars](https://img.shields.io/github/stars/pkuvcl/davs2.svg)](https://github.com/pkuvcl/davs2/stargazers)\n\nLinux(Ubuntu-16.04):[![Travis Build Status](https://travis-ci.org/pkuvcl/davs2.svg?branch=master)](https://travis-ci.org/pkuvcl/davs2)\nWindows(VS2013):[![AppVeyor Build Status](https://ci.appveyor.com/api/projects/status/pq0b5mnc6mig6ryp?svg=true)](https://ci.appveyor.com/project/luofalei/davs2/build/artifacts)\n\n[![Stargazers over time](https://starcharts.herokuapp.com/pkuvcl/davs2.svg)](https://starcharts.herokuapp.com/pkuvcl/davs2)\n\n## 编译方法\n### Windows\n\n可使用`VS2013`打开解决方案`./build/win32/DAVS2.sln`进行编译, 也可以使用更新的vs版本打开上述解决方案.\n打开解决方案后, 将工程`davs2`设置为启动项, 进行编译即可. \n\n#### 注意\n1. 首次编译本项目时, 需要安装一个 `shell 执行器`, 比如 `git-for-windows` 中的 `bash`, \n 需要将该 `bash` 所在的目录添加到系统环境变量 `PATH` 中.\n 如上所述, 如果您以默认配置安装了`git-for-windows`, \n 那么将 `C:\\Program Files\\Git\\bin` 添加到环境变量中即可.\n2. 需将 `nasm.exe`放入到系统 `PATH` 目录, `nasm`版本号需为`2.13`或更新.\n  对于`windows`平台,可下载如下压缩包中，解压得到`nasm.exe`.\nhttps://www.nasm.us/pub/nasm/releasebuilds/2.14.02/win64/nasm-2.14.02-win64.zip\n\n### Linux\n\n对于linux系统, 依次执行如下命令即可完成编译:\n```\n$ cd build/linux\n$ ./configure\n$ make\n```\n\n## 运行和测试\n\n使用`1`个线程解码AVS2码流文件`test.avs`并将结果输出成YUV文件`dec.yuv`:\n```\n./davs2 -i test.avs -t 1 -o dec.yuv\n```\n\n解码AVS2码流文件`test.avs`并用ffplay播放显示:\n```\n./davs2 -i test.avs -t 1 -o stdout | ffplay -i -\n```\n\n### 参数说明\n|       参数       |  等价形式   |   意义           |\n| :--------:       | :---------: | :--------------: |\n| --input=test.avs | -i test.avs |  设置输入码流文件路径 |\n| --output=dec.yuv | -o dec.yuv  |  设置输出解码YUV文件路径 |\n| --psnr=rec.yuv   | -r rec.yuv  |  设置参考用YUV文件路径, 用于计算PSNR以确定是否匹配 |\n| --threads=N      | -t N        |  设置解码线程数 (默认值: 1) |\n| --md5=M          | -m M        |  设置参考MD5值, 用于验证输出的重构YUV是否匹配 |\n| --verbose        | -v          |  设置每帧是否输出 (默认: 开启) |\n| --help           | -h          |  显示此输出命令 |\n\n## Issue & Pull Request\n\n欢迎提交 issue，请写清楚遇到问题的环境与运行参数，包括操作系统环境、编译器环境等。\n如果可能提供原始输入`YUV/码流文件`，请尽量提供以方便更快地重现结果。\n\n[反馈问题的 issue 请按照模板格式填写][6]。\n\n如果有开发能力，建议在本地调试出错的代码，并[提供相应修正的 Pull Request][7]。\n\n## 主页链接\n\n[北京大学-视频编码算法研究室(PKU-VCL)][1]\n\n`AVS2-P2/IEEE1857.4` Encoder: [xavs2 (Github)][2], [xavs2 (mirror in China)][3]\n\n`AVS2-P2/IEEE1857.4` Decoder: [davs2 (Github)][4], [davs2 (mirror in China)][5]\n\n  [1]: http://vcl.idm.pku.edu.cn/ \"PKU-VCL\"\n  [2]: https://github.com/pkuvcl/xavs2 \"xavs2 github repository\"\n  [3]: https://gitee.com/pkuvcl/xavs2 \"xavs2 gitee repository\"\n  [4]: https://github.com/pkuvcl/davs2 \"davs2 decoder@github\"\n  [5]: https://gitee.com/pkuvcl/davs2 \"davs2 decoder@gitee\"\n  [6]: https://github.com/pkuvcl/davs2/issues \"report issues\"\n  [7]: https://github.com/pkuvcl/davs2/pulls \"pull request\"\n"
  },
  {
    "path": "appveyor.yml",
    "content": "image: Visual Studio 2015\nconfiguration: Release\n\nplatform:\n  - x64\n\ninstall:\n    - git submodule update --init\n    - appveyor DownloadFile https://www.nasm.us/pub/nasm/releasebuilds/2.14.02/win64/nasm-2.14.02-win64.zip -FileName nasm.zip\n    - ps: Expand-Archive -Path nasm.zip -DestinationPath build/vs2013 -Force:$true\n    - set PATH=%PATH%;%APPVEYOR_BUILD_FOLDER%/build/vs2013/nasm-2.14.02\n\nbuild:\n  project: build\\vs2013\\DAVS2.sln\n\nartifacts:\n    - path: build\\bin\\x64_Release\\*.* \n      name: $(APPVEYOR_PROJECT_NAME)\n"
  },
  {
    "path": "build/linux/Makefile",
    "content": "# Makefile\n\ninclude config.mak\n\nvpath %.cc $(SRCPATH)\nvpath %.c $(SRCPATH)\nvpath %.h $(SRCPATH)\nvpath %.S $(SRCPATH)\nvpath %.asm $(SRCPATH)\nvpath %.rc $(SRCPATH)\nCFLAGS += -I$(SRCPATH) -I$(SRCPATH)/.. \\\n\t\t\t-I$(SRCPATH)/x86 \\\n\t\t\t-I$(SRCPATH)/vec\n\nGENERATED =\n\nall: default\ndefault:\n\nSRCS  = common/aec.cc    common/alf.cc \\\n\t\tcommon/bitstream.cc common/block_info.cc \\\n\t\tcommon/common.cc common/davs2.cc common/cpu.cc common/cu.cc \\\n\t\tcommon/deblock.cc common/decoder.cc \\\n\t\tcommon/frame.cc  common/header.cc \\\n\t\tcommon/intra.cc common/mc.cc \\\n\t\tcommon/memory.cc \\\n\t\tcommon/pixel.cc common/predict.cc \\\n\t\tcommon/quant.cc \\\n\t\tcommon/sao.cc common/transform.cc \\\n\t\tcommon/primitives.cc \\\n\t\tcommon/threadpool.cc common/win32thread.cc\n\nSRCCLI = test/test.c\n\nSRCSO =\nOBJS =\nOBJAVX =\nOBJSO =\nOBJCLI =\n\n#OBJCHK = tools/checkasm.o\n\n## CONFIG: $(shell cat config.h)\n## \n## ifneq ($(findstring HAVE_THREAD 1, $(CONFIG)),)\n## SRCS    += common/threadpool.cc\n## endif\n## ifneq ($(findstring HAVE_WIN32THREAD 1, $(CONFIG)),)\n## SRCS    += common/win32thread.cc\n## endif\n\n# MMX/SSE optims\nifneq ($(AS),)\n# asm --------------------------------------------------------------\nX86SRC = common/x86/const-a.asm \\\n\t\t\t\tcommon/x86/blockcopy8.asm \\\n\t\t\t\tcommon/x86/cpu-a.asm \\\n\t\t\t\tcommon/x86/dct8.asm \\\n\t\t\t\tcommon/x86/mc-a2.asm \\\n\t\t\t\tcommon/x86/pixeladd8.asm \\\n\t\t\t\tcommon/x86/quant8.asm\n\nifeq ($(SYS_ARCH),X86)\nARCH_X86 = yes\nASMSRC   = $(X86SRC) \nendif\n\n## Until now, we do not have 64-bit asm\nifeq ($(SYS_ARCH),X86_64)\nARCH_X86 = yes\nSRCS     += common/vec/intrinsic.cc \\\n\t\t\tcommon/vec/intrinsic_alf.cc \\\n\t\t\tcommon/vec/intrinsic_sao.cc \\\n\t\t\tcommon/vec/intrinsic_deblock.cc \\\n\t\t\tcommon/vec/intrinsic_intra-filledge.cc \\\n\t\t\tcommon/vec/intrinsic_intra-pred.cc \\\n\t\t\tcommon/vec/intrinsic_inter_pred.cc \\\n\t\t\tcommon/vec/intrinsic_idct.cc \\\n\t\t\tcommon/vec/intrinsic_pixel.cc\n\nSRCSAVX = common/vec/intrinsic_sao_avx2.cc \\\n\t\t  common/vec/intrinsic_deblock_avx2.cc \\\n\t\t  common/vec/intrinsic_intra-pred_avx2.cc \\\n\t\t  common/vec/intrinsic_inter_pred_avx2.cc \\\n\t\t  common/vec/intrinsic_pixel_avx.cc \\\n\t\t  common/vec/intrinsic_idct_avx2.cc\n\nCFLAGS += -mmmx -msse -msse2 -msse3 -msse4 -msse4.1 -msse4.2 -msse4a -mssse3 -mavx\n# ASMSRC   = $(X86SRC:-32.asm=-64.asm)\nASMSRC   = $(X86SRC)\nASFLAGS += -DARCH_X86_64=1\nOBJASM  = $(ASMSRC:%.asm=%.o)\n$(OBJASM): common/x86/x86inc.asm common/x86/x86util.asm\nendif\n\nifdef ARCH_X86\nASFLAGS += -I$(SRCPATH)/x86/\n#SRCS    += x86/mc-c.cc x86/predict-c.cc\nOBJASM  = $(ASMSRC:%.asm=%.o)\n$(OBJASM): common/x86/x86inc.asm common/x86/x86util.asm\nendif\n\n# AltiVec optims\nifeq ($(SYS_ARCH),PPC)\nSRCS += common/ppc/mc.cc common/ppc/pixel.cc common/ppc/dct.cc \\\n        common/ppc/quant.cc common/ppc/deblock.cc \\\n        common/ppc/predict.cc\nendif\n\n# NEON optims\nifeq ($(SYS_ARCH),ARM)\n# x264 ARM asm sources\n# ASMSRC += common/arm/cpu-a.S common/arm/pixel-a.S common/arm/mc-a.S \\\n#           common/arm/dct-a.S common/arm/quant-a.S common/arm/deblock-a.S \\\n#           common/arm/predict-a.S common/arm/bitstream-a.S\n# SRCS   += common/arm/mc-c.cc common/arm/predict-c.cc\n# x265 ARM asm sources\nASMSRC += common/arm/blockcopy8.S common/arm/cpu-a.S common/arm/dct-a.S \\\n          common/arm/ipfilter8.S common/arm/mc-a.S common/arm/pixel-util.S \\\n          common/arm/sad-a.S common/arm/ssd-a.S\nOBJASM  = $(ASMSRC:%.S=%.o)\nendif\n\n# AArch64 NEON optims\nifeq ($(SYS_ARCH),AARCH64)\nASMSRC += common/aarch64/bitstream-a.S \\\n          common/aarch64/cabac-a.S     \\\n          common/aarch64/dct-a.S     \\\n          common/aarch64/deblock-a.S \\\n          common/aarch64/mc-a.S      \\\n          common/aarch64/pixel-a.S   \\\n          common/aarch64/predict-a.S \\\n          common/aarch64/quant-a.S\nSRCS   += common/aarch64/asm-offsets.cc \\\n          common/aarch64/mc-c.cc        \\\n          common/aarch64/predict-c.cc\nOBJASM  = $(ASMSRC:%.S=%.o)\nOBJCHK += tools/checkasm-aarch64.o\nendif\n\n# MSA optims\nifeq ($(SYS_ARCH),MIPS)\nifneq ($(findstring HAVE_MSA 1, $(CONFIG)),)\nSRCS += common/mips/mc-c.cc common/mips/dct-c.cc \\\n        common/mips/deblock-c.cc common/mips/pixel-c.cc \\\n        common/mips/predict-c.cc common/mips/quant-c.cc\nendif\nendif\n\n# asm --------------------------------------------------------------\nendif \n# here ends ifneq ($(AS),)\n\nifneq ($(HAVE_GETOPT_LONG),1)\nSRCS += compat/getopt/getopt.cc\nendif\n\n## Windows Dll\n## ifeq ($(SYS), WINDOWS)\n## # OBJCLI += $(if $(RC), davs2res.o)\n## ifneq ($(SONAME),)\n## SRCSO   += davs2dll.cc\n## OBJSO   += $(if $(RC), davs2res.dll.o)\n## endif\n## endif\n\nOBJS   += $(SRCS:%.cc=%.o)\nOBJAVX += $(SRCSAVX:%.cc=%.o)\nOBJCLI += $(SRCCLI:%.c=%.o)\nOBJSO  += $(SRCSO:%.cc=%.o)\n\n.PHONY: all default fprofiled clean distclean install install-* uninstall cli lib-* etags\n\ncli: davs2$(EXE)\nlib-static: $(LIBDAVS2)\nlib-shared: $(SONAME)\n\n$(LIBDAVS2): $(GENERATED) .depend $(OBJS) $(OBJAVX) $(OBJASM)\n\t@echo \"\\033[33m [linking static] $(LIBDAVS2) \\033[0m\"\n\trm -f $(LIBDAVS2)\n\t$(AR)$@ $(OBJS) $(OBJAVX) $(OBJASM)\n\t$(if $(RANLIB), $(RANLIB) $@)\n\n$(SONAME): $(GENERATED) .depend $(OBJS) $(OBJAVX) $(OBJASM) $(OBJSO)\n\t@echo \"\\033[33m [linking shared] $(SONAME) \\033[0m\"\n\t$(LD)$@ $(OBJS) $(OBJAVX) $(OBJASM) $(OBJSO) $(SOFLAGS) $(LDFLAGS)\n\nifneq ($(EXE),)\n.PHONY: davs2 checkasm\ndavs2: davs2$(EXE)\ncheckasm: checkasm$(EXE)\nendif\n\ndavs2$(EXE): $(GENERATED) .depend $(OBJCLI) $(CLI_LIBDAVS2)\n\t@echo \"\\033[33m [linking execution] davs2$(EXE) \\033[0m\"\n\t$(LD)$@ $(OBJCLI) $(CLI_LIBDAVS2) $(LDFLAGSCLI) $(LDFLAGS)\n\ncheckasm$(EXE): $(GENERATED) .depend $(OBJCHK) $(LIBDAVS2)\n\t@echo \"\\033[33m [linking checkasm] checkasm$(EXE) \\033[0m\"\n\t$(LD)$@ $(OBJCHK) $(LIBDAVS2) $(LDFLAGS)\n\n$(OBJS) $(OBJAVX) $(OBJASM) $(OBJSO) $(OBJCLI) $(OBJCHK): .depend\n\n%.o: %.asm common/x86/x86inc.asm common/x86/x86util.asm\n\t@echo \"\\033[33m [Compiling asm]: $< \\033[0m\"\n\t$(AS) $(ASFLAGS) -o $@ $<\n\t-@ $(if $(STRIP), $(STRIP) -x $@) # delete local/anonymous symbols, so they don't show up in oprofile\n\n%.o: %.S\n\t@echo \"\\033[33m [Compiling asm]: $< \\033[0m\"\n\t$(AS) $(ASFLAGS) -o $@ $<\n\t-@ $(if $(STRIP), $(STRIP) -x $@) # delete local/anonymous symbols, so they don't show up in oprofile\n\n%.dll.o: %.rc davs2.h\n\t@echo \"\\033[33m [Compiling dll]: $< \\033[0m\"\n\t$(RC) $(RCFLAGS)$@ -DDLL $<\n\n%.o: %.rc davs2.h\n\t@echo \"\\033[33m [Compiling rc]: $< \\033[0m\"\n\t$(RC) $(RCFLAGS)$@ $<\n\n$(OBJAVX):\n\t@echo \"\\033[33m [Compiling]: $(@:.o=.cc) \\033[0m\"\n\t$(CC) $(CFLAGS) -mavx2 -c -o $@ $(SRCPATH)/$(@:.o=.cc)\n\n%.o: %.cc\n\t@echo \"\\033[33m [Compiling]: $< \\033[0m\"\n\t$(CC) $(CFLAGS) -c -o $@ $<\n\n%.o: %.c\n\t@echo \"\\033[33m [Compiling]: $< \\033[0m\"\n\t$(CC) $(CFLAGS) -c -o $@ $<\n\n.depend: config.mak\n\t@rm -f .depend\n\t@echo \"\\033[33m dependency file generation... \\033[0m\"\nifeq ($(COMPILER),CL)\n\t@$(foreach SRC, $(addprefix $(SRCPATH)/, $(SRCS) $(SRCCLI) $(SRCSO)), $(SRCPATH)/tools/msvsdepend.sh \"$(CC)\" \"$(CFLAGS)\" \"$(SRC)\" \"$(SRC:$(SRCPATH)/%.cc=%.o)\" 1>> .depend;)\n\t@$(foreach SRC, $(addprefix $(SRCPATH)/, $(SRCSAVX)), $(SRCPATH)/tools/msvsdepend.sh \"$(CC)\" \"$(CFLAGS)\" \"$(SRC)\" \"$(SRC:$(SRCPATH)/%.cc=%.o)\" 1>> .depend;)\nelse\n\t@$(foreach SRC, $(addprefix $(SRCPATH)/, $(SRCS) $(SRCCLI) $(SRCSO)), $(CC) $(CFLAGS) $(SRC) $(DEPMT) $(SRC:$(SRCPATH)/%.cc=%.o) $(DEPMM) 1>> .depend;)\n\t@$(foreach SRC, $(addprefix $(SRCPATH)/, $(SRCSAVX)), $(CC) $(CFLAGS) $(SRC) $(DEPMT) $(SRC:$(SRCPATH)/%.cc=%.o) $(DEPMM) 1>> .depend;)\nendif\n\nconfig.mak:\n\t./configure\n\ndepend: .depend\nifneq ($(wildcard .depend),)\ninclude .depend\nendif\n\nSRC2 = $(SRCS) $(SRCCLI)\n# These should cover most of the important codepaths\nOPT0 = --crf 30 -b1 -m1 -r1 --me dia --no-cabac --direct temporal --ssim --no-weightb\nOPT1 = --crf 16 -b2 -m3 -r3 --me hex --no-8x8dct --direct spatial --no-dct-decimate -t0  --slice-max-mbs 50\nOPT2 = --crf 26 -b4 -m5 -r2 --me hex --cqm jvt --nr 100 --psnr --no-mixed-refs --b-adapt 2 --slice-max-size 1500\nOPT3 = --crf 18 -b3 -m9 -r5 --me umh -t1 -A all --b-pyramid normal --direct auto --no-fast-pskip --no-mbtree\nOPT4 = --crf 22 -b3 -m7 -r4 --me esa -t2 -A all --psy-rd 1.0:1.0 --slices 4\nOPT5 = --frames 50 --crf 24 -b3 -m10 -r3 --me tesa -t2\nOPT6 = --frames 50 -q0 -m9 -r2 --me hex -Aall\nOPT7 = --frames 50 -q0 -m2 -r1 --me hex --no-cabac\n\nifeq (,$(VIDS))\nfprofiled:\n\t@echo 'usage: make fprofiled VIDS=\"infile1 infile2 ...\"'\n\t@echo 'where infiles are anything that davs2 understands,'\n\t@echo 'i.e. YUV with resolution in the filename, y4m, or avisynth.'\nelse\nfprofiled:\n\t$(MAKE) clean\n\t$(MAKE) davs2$(EXE) CFLAGS=\"$(CFLAGS) $(PROF_GEN_CC)\" LDFLAGS=\"$(LDFLAGS) $(PROF_GEN_LD)\"\n\t$(foreach V, $(VIDS), $(foreach I, 0 1 2 3 4 5 6 7, ./davs2$(EXE) $(OPT$I) --threads 1 $(V) -o $(DEVNULL) ;))\nifeq ($(COMPILER),CL)\n# Because Visual Studio timestamps the object files within the PGD, it fails to build if they change - only the executable should be deleted\n\trm -f davs2$(EXE)\nelse\n\trm -f $(SRC2:%.cc=%.o)\nendif\n\t$(MAKE) CFLAGS=\"$(CFLAGS) $(PROF_USE_CC)\" LDFLAGS=\"$(LDFLAGS) $(PROF_USE_LD)\"\n\trm -f $(SRC2:%.cc=%.gcda) $(SRC2:%.cc=%.gcno) *.dyn pgopti.dpi pgopti.dpi.lock *.pgd *.pgc\nendif\n\nclean:\n\trm -f $(OBJS) $(OBJASM) $(OBJCLI) $(OBJSO) $(SONAME) \n\trm -f *.a *.lib *.exp *.pdb libdavs2.so* davs2 davs2.exe .depend TAGS\n\trm -f checkasm checkasm.exe $(OBJCHK) $(GENERATED) davs2_lookahead.cclbin\n\trm -f example example.exe $(OBJEXAMPLE)\n\trm -f $(SRC2:%.cc=%.gcda) $(SRC2:%.cc=%.gcno) *.dyn pgopti.dpi pgopti.dpi.lock *.pgd *.pgc\n\ndistclean: clean\n\trm -f config.mak davs2_config.h config.h config.log davs2.pc davs2.def conftest*\n\ninstall-cli: cli\n\t$(INSTALL) -d $(DESTDIR)$(bindir)\n\t$(INSTALL) davs2$(EXE) $(DESTDIR)$(bindir)\n\ninstall-lib-dev:\n\t$(INSTALL) -d $(DESTDIR)$(includedir)\n\t$(INSTALL) -d $(DESTDIR)$(libdir)\n\t$(INSTALL) -d $(DESTDIR)$(libdir)/pkgconfig\n\t$(INSTALL) -m 644 $(SRCPATH)/davs2.h $(DESTDIR)$(includedir)\n\t$(INSTALL) -m 644 davs2_config.h $(DESTDIR)$(includedir)\n\t$(INSTALL) -m 644 davs2.pc $(DESTDIR)$(libdir)/pkgconfig\n\ninstall-lib-static: lib-static install-lib-dev\n\t$(INSTALL) -m 644 $(LIBDAVS2) $(DESTDIR)$(libdir)\n\t$(if $(RANLIB), $(RANLIB) $(DESTDIR)$(libdir)/$(LIBDAVS2))\n\ninstall-lib-shared: lib-shared install-lib-dev\nifneq ($(IMPLIBNAME),)\n\t$(INSTALL) -d $(DESTDIR)$(bindir)\n\t$(INSTALL) -m 755 $(SONAME) $(DESTDIR)$(bindir)\n\t$(INSTALL) -m 644 $(IMPLIBNAME) $(DESTDIR)$(libdir)\nelse ifneq ($(SONAME),)\n\tln -f -s $(SONAME) $(DESTDIR)$(libdir)/libdavs2.$(SOSUFFIX)\n\t$(INSTALL) -m 755 $(SONAME) $(DESTDIR)$(libdir)\nendif\n\nuninstall:\n\trm -f $(DESTDIR)$(includedir)/davs2.h $(DESTDIR)$(includedir)/davs2_config.h $(DESTDIR)$(libdir)/libdavs2.a\n\trm -f $(DESTDIR)$(bindir)/davs2$(EXE) $(DESTDIR)$(libdir)/pkgconfig/davs2.pc\nifneq ($(IMPLIBNAME),)\n\trm -f $(DESTDIR)$(bindir)/$(SONAME) $(DESTDIR)$(libdir)/$(IMPLIBNAME)\nelse ifneq ($(SONAME),)\n\trm -f $(DESTDIR)$(libdir)/$(SONAME) $(DESTDIR)$(libdir)/libdavs2.$(SOSUFFIX)\nendif\n\netags: TAGS\n\nTAGS:\n\tetags $(SRCS)\n"
  },
  {
    "path": "build/linux/config.guess",
    "content": "#! /bin/sh\n# Attempt to guess a canonical system name.\n#   Copyright 1992-2017 Free Software Foundation, Inc.\n\ntimestamp='2017-05-27'\n\n# This file is free software; you can redistribute it and/or modify it\n# under the terms of the GNU General Public License as published by\n# the Free Software Foundation; either version 3 of the License, or\n# (at your option) any later version.\n#\n# This program is distributed in the hope that it will be useful, but\n# WITHOUT ANY WARRANTY; without even the implied warranty of\n# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU\n# General Public License for more details.\n#\n# You should have received a copy of the GNU General Public License\n# along with this program; if not, see <http://www.gnu.org/licenses/>.\n#\n# As a special exception to the GNU General Public License, if you\n# distribute this file as part of a program that contains a\n# configuration script generated by Autoconf, you may include it under\n# the same distribution terms that you use for the rest of that\n# program.  This Exception is an additional permission under section 7\n# of the GNU General Public License, version 3 (\"GPLv3\").\n#\n# Originally written by Per Bothner; maintained since 2000 by Ben Elliston.\n#\n# You can get the latest version of this script from:\n# http://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.guess\n#\n# Please send patches to <config-patches@gnu.org>.\n\n\nme=`echo \"$0\" | sed -e 's,.*/,,'`\n\nusage=\"\\\nUsage: $0 [OPTION]\n\nOutput the configuration name of the system \\`$me' is run on.\n\nOperation modes:\n  -h, --help         print this help, then exit\n  -t, --time-stamp   print date of last modification, then exit\n  -v, --version      print version number, then exit\n\nReport bugs and patches to <config-patches@gnu.org>.\"\n\nversion=\"\\\nGNU config.guess ($timestamp)\n\nOriginally written by Per Bothner.\nCopyright 1992-2017 Free Software Foundation, Inc.\n\nThis is free software; see the source for copying conditions.  There is NO\nwarranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.\"\n\nhelp=\"\nTry \\`$me --help' for more information.\"\n\n# Parse command line\nwhile test $# -gt 0 ; do\n  case $1 in\n    --time-stamp | --time* | -t )\n       echo \"$timestamp\" ; exit ;;\n    --version | -v )\n       echo \"$version\" ; exit ;;\n    --help | --h* | -h )\n       echo \"$usage\"; exit ;;\n    -- )     # Stop option processing\n       shift; break ;;\n    - )\t# Use stdin as input.\n       break ;;\n    -* )\n       echo \"$me: invalid option $1$help\" >&2\n       exit 1 ;;\n    * )\n       break ;;\n  esac\ndone\n\nif test $# != 0; then\n  echo \"$me: too many arguments$help\" >&2\n  exit 1\nfi\n\ntrap 'exit 1' 1 2 15\n\n# CC_FOR_BUILD -- compiler used by this script. Note that the use of a\n# compiler to aid in system detection is discouraged as it requires\n# temporary files to be created and, as you can see below, it is a\n# headache to deal with in a portable fashion.\n\n# Historically, `CC_FOR_BUILD' used to be named `HOST_CC'. We still\n# use `HOST_CC' if defined, but it is deprecated.\n\n# Portable tmp directory creation inspired by the Autoconf team.\n\nset_cc_for_build='\ntrap \"exitcode=\\$?; (rm -f \\$tmpfiles 2>/dev/null; rmdir \\$tmp 2>/dev/null) && exit \\$exitcode\" 0 ;\ntrap \"rm -f \\$tmpfiles 2>/dev/null; rmdir \\$tmp 2>/dev/null; exit 1\" 1 2 13 15 ;\n: ${TMPDIR=/tmp} ;\n { tmp=`(umask 077 && mktemp -d \"$TMPDIR/cgXXXXXX\") 2>/dev/null` && test -n \"$tmp\" && test -d \"$tmp\" ; } ||\n { test -n \"$RANDOM\" && tmp=$TMPDIR/cg$$-$RANDOM && (umask 077 && mkdir $tmp) ; } ||\n { tmp=$TMPDIR/cg-$$ && (umask 077 && mkdir $tmp) && echo \"Warning: creating insecure temp directory\" >&2 ; } ||\n { echo \"$me: cannot create a temporary directory in $TMPDIR\" >&2 ; exit 1 ; } ;\ndummy=$tmp/dummy ;\ntmpfiles=\"$dummy.c $dummy.o $dummy.rel $dummy\" ;\ncase $CC_FOR_BUILD,$HOST_CC,$CC in\n ,,)    echo \"int x;\" > $dummy.c ;\n\tfor c in cc gcc c89 c99 ; do\n\t  if ($c -c -o $dummy.o $dummy.c) >/dev/null 2>&1 ; then\n\t     CC_FOR_BUILD=\"$c\"; break ;\n\t  fi ;\n\tdone ;\n\tif test x\"$CC_FOR_BUILD\" = x ; then\n\t  CC_FOR_BUILD=no_compiler_found ;\n\tfi\n\t;;\n ,,*)   CC_FOR_BUILD=$CC ;;\n ,*,*)  CC_FOR_BUILD=$HOST_CC ;;\nesac ; set_cc_for_build= ;'\n\n# This is needed to find uname on a Pyramid OSx when run in the BSD universe.\n# (ghazi@noc.rutgers.edu 1994-08-24)\nif (test -f /.attbin/uname) >/dev/null 2>&1 ; then\n\tPATH=$PATH:/.attbin ; export PATH\nfi\n\nUNAME_MACHINE=`(uname -m) 2>/dev/null` || UNAME_MACHINE=unknown\nUNAME_RELEASE=`(uname -r) 2>/dev/null` || UNAME_RELEASE=unknown\nUNAME_SYSTEM=`(uname -s) 2>/dev/null`  || UNAME_SYSTEM=unknown\nUNAME_VERSION=`(uname -v) 2>/dev/null` || UNAME_VERSION=unknown\n\ncase \"${UNAME_SYSTEM}\" in\nLinux|GNU|GNU/*)\n\t# If the system lacks a compiler, then just pick glibc.\n\t# We could probably try harder.\n\tLIBC=gnu\n\n\teval $set_cc_for_build\n\tcat <<-EOF > $dummy.c\n\t#include <features.h>\n\t#if defined(__UCLIBC__)\n\tLIBC=uclibc\n\t#elif defined(__dietlibc__)\n\tLIBC=dietlibc\n\t#else\n\tLIBC=gnu\n\t#endif\n\tEOF\n\teval `$CC_FOR_BUILD -E $dummy.c 2>/dev/null | grep '^LIBC' | sed 's, ,,g'`\n\t;;\nesac\n\n# Note: order is significant - the case branches are not exclusive.\n\ncase \"${UNAME_MACHINE}:${UNAME_SYSTEM}:${UNAME_RELEASE}:${UNAME_VERSION}\" in\n    *:NetBSD:*:*)\n\t# NetBSD (nbsd) targets should (where applicable) match one or\n\t# more of the tuples: *-*-netbsdelf*, *-*-netbsdaout*,\n\t# *-*-netbsdecoff* and *-*-netbsd*.  For targets that recently\n\t# switched to ELF, *-*-netbsd* would select the old\n\t# object file format.  This provides both forward\n\t# compatibility and a consistent mechanism for selecting the\n\t# object file format.\n\t#\n\t# Note: NetBSD doesn't particularly care about the vendor\n\t# portion of the name.  We always set it to \"unknown\".\n\tsysctl=\"sysctl -n hw.machine_arch\"\n\tUNAME_MACHINE_ARCH=`(uname -p 2>/dev/null || \\\n\t    /sbin/$sysctl 2>/dev/null || \\\n\t    /usr/sbin/$sysctl 2>/dev/null || \\\n\t    echo unknown)`\n\tcase \"${UNAME_MACHINE_ARCH}\" in\n\t    armeb) machine=armeb-unknown ;;\n\t    arm*) machine=arm-unknown ;;\n\t    sh3el) machine=shl-unknown ;;\n\t    sh3eb) machine=sh-unknown ;;\n\t    sh5el) machine=sh5le-unknown ;;\n\t    earmv*)\n\t\tarch=`echo ${UNAME_MACHINE_ARCH} | sed -e 's,^e\\(armv[0-9]\\).*$,\\1,'`\n\t\tendian=`echo ${UNAME_MACHINE_ARCH} | sed -ne 's,^.*\\(eb\\)$,\\1,p'`\n\t\tmachine=${arch}${endian}-unknown\n\t\t;;\n\t    *) machine=${UNAME_MACHINE_ARCH}-unknown ;;\n\tesac\n\t# The Operating System including object format, if it has switched\n\t# to ELF recently (or will in the future) and ABI.\n\tcase \"${UNAME_MACHINE_ARCH}\" in\n\t    earm*)\n\t\tos=netbsdelf\n\t\t;;\n\t    arm*|i386|m68k|ns32k|sh3*|sparc|vax)\n\t\teval $set_cc_for_build\n\t\tif echo __ELF__ | $CC_FOR_BUILD -E - 2>/dev/null \\\n\t\t\t| grep -q __ELF__\n\t\tthen\n\t\t    # Once all utilities can be ECOFF (netbsdecoff) or a.out (netbsdaout).\n\t\t    # Return netbsd for either.  FIX?\n\t\t    os=netbsd\n\t\telse\n\t\t    os=netbsdelf\n\t\tfi\n\t\t;;\n\t    *)\n\t\tos=netbsd\n\t\t;;\n\tesac\n\t# Determine ABI tags.\n\tcase \"${UNAME_MACHINE_ARCH}\" in\n\t    earm*)\n\t\texpr='s/^earmv[0-9]/-eabi/;s/eb$//'\n\t\tabi=`echo ${UNAME_MACHINE_ARCH} | sed -e \"$expr\"`\n\t\t;;\n\tesac\n\t# The OS release\n\t# Debian GNU/NetBSD machines have a different userland, and\n\t# thus, need a distinct triplet. However, they do not need\n\t# kernel version information, so it can be replaced with a\n\t# suitable tag, in the style of linux-gnu.\n\tcase \"${UNAME_VERSION}\" in\n\t    Debian*)\n\t\trelease='-gnu'\n\t\t;;\n\t    *)\n\t\trelease=`echo ${UNAME_RELEASE} | sed -e 's/[-_].*//' | cut -d. -f1,2`\n\t\t;;\n\tesac\n\t# Since CPU_TYPE-MANUFACTURER-KERNEL-OPERATING_SYSTEM:\n\t# contains redundant information, the shorter form:\n\t# CPU_TYPE-MANUFACTURER-OPERATING_SYSTEM is used.\n\techo \"${machine}-${os}${release}${abi}\"\n\texit ;;\n    *:Bitrig:*:*)\n\tUNAME_MACHINE_ARCH=`arch | sed 's/Bitrig.//'`\n\techo ${UNAME_MACHINE_ARCH}-unknown-bitrig${UNAME_RELEASE}\n\texit ;;\n    *:OpenBSD:*:*)\n\tUNAME_MACHINE_ARCH=`arch | sed 's/OpenBSD.//'`\n\techo ${UNAME_MACHINE_ARCH}-unknown-openbsd${UNAME_RELEASE}\n\texit ;;\n    *:LibertyBSD:*:*)\n\tUNAME_MACHINE_ARCH=`arch | sed 's/^.*BSD\\.//'`\n\techo ${UNAME_MACHINE_ARCH}-unknown-libertybsd${UNAME_RELEASE}\n\texit ;;\n    *:ekkoBSD:*:*)\n\techo ${UNAME_MACHINE}-unknown-ekkobsd${UNAME_RELEASE}\n\texit ;;\n    *:SolidBSD:*:*)\n\techo ${UNAME_MACHINE}-unknown-solidbsd${UNAME_RELEASE}\n\texit ;;\n    macppc:MirBSD:*:*)\n\techo powerpc-unknown-mirbsd${UNAME_RELEASE}\n\texit ;;\n    *:MirBSD:*:*)\n\techo ${UNAME_MACHINE}-unknown-mirbsd${UNAME_RELEASE}\n\texit ;;\n    *:Sortix:*:*)\n\techo ${UNAME_MACHINE}-unknown-sortix\n\texit ;;\n    alpha:OSF1:*:*)\n\tcase $UNAME_RELEASE in\n\t*4.0)\n\t\tUNAME_RELEASE=`/usr/sbin/sizer -v | awk '{print $3}'`\n\t\t;;\n\t*5.*)\n\t\tUNAME_RELEASE=`/usr/sbin/sizer -v | awk '{print $4}'`\n\t\t;;\n\tesac\n\t# According to Compaq, /usr/sbin/psrinfo has been available on\n\t# OSF/1 and Tru64 systems produced since 1995.  I hope that\n\t# covers most systems running today.  This code pipes the CPU\n\t# types through head -n 1, so we only detect the type of CPU 0.\n\tALPHA_CPU_TYPE=`/usr/sbin/psrinfo -v | sed -n -e 's/^  The alpha \\(.*\\) processor.*$/\\1/p' | head -n 1`\n\tcase \"$ALPHA_CPU_TYPE\" in\n\t    \"EV4 (21064)\")\n\t\tUNAME_MACHINE=alpha ;;\n\t    \"EV4.5 (21064)\")\n\t\tUNAME_MACHINE=alpha ;;\n\t    \"LCA4 (21066/21068)\")\n\t\tUNAME_MACHINE=alpha ;;\n\t    \"EV5 (21164)\")\n\t\tUNAME_MACHINE=alphaev5 ;;\n\t    \"EV5.6 (21164A)\")\n\t\tUNAME_MACHINE=alphaev56 ;;\n\t    \"EV5.6 (21164PC)\")\n\t\tUNAME_MACHINE=alphapca56 ;;\n\t    \"EV5.7 (21164PC)\")\n\t\tUNAME_MACHINE=alphapca57 ;;\n\t    \"EV6 (21264)\")\n\t\tUNAME_MACHINE=alphaev6 ;;\n\t    \"EV6.7 (21264A)\")\n\t\tUNAME_MACHINE=alphaev67 ;;\n\t    \"EV6.8CB (21264C)\")\n\t\tUNAME_MACHINE=alphaev68 ;;\n\t    \"EV6.8AL (21264B)\")\n\t\tUNAME_MACHINE=alphaev68 ;;\n\t    \"EV6.8CX (21264D)\")\n\t\tUNAME_MACHINE=alphaev68 ;;\n\t    \"EV6.9A (21264/EV69A)\")\n\t\tUNAME_MACHINE=alphaev69 ;;\n\t    \"EV7 (21364)\")\n\t\tUNAME_MACHINE=alphaev7 ;;\n\t    \"EV7.9 (21364A)\")\n\t\tUNAME_MACHINE=alphaev79 ;;\n\tesac\n\t# A Pn.n version is a patched version.\n\t# A Vn.n version is a released version.\n\t# A Tn.n version is a released field test version.\n\t# A Xn.n version is an unreleased experimental baselevel.\n\t# 1.2 uses \"1.2\" for uname -r.\n\techo ${UNAME_MACHINE}-dec-osf`echo ${UNAME_RELEASE} | sed -e 's/^[PVTX]//' | tr ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz`\n\t# Reset EXIT trap before exiting to avoid spurious non-zero exit code.\n\texitcode=$?\n\ttrap '' 0\n\texit $exitcode ;;\n    Alpha\\ *:Windows_NT*:*)\n\t# How do we know it's Interix rather than the generic POSIX subsystem?\n\t# Should we change UNAME_MACHINE based on the output of uname instead\n\t# of the specific Alpha model?\n\techo alpha-pc-interix\n\texit ;;\n    21064:Windows_NT:50:3)\n\techo alpha-dec-winnt3.5\n\texit ;;\n    Amiga*:UNIX_System_V:4.0:*)\n\techo m68k-unknown-sysv4\n\texit ;;\n    *:[Aa]miga[Oo][Ss]:*:*)\n\techo ${UNAME_MACHINE}-unknown-amigaos\n\texit ;;\n    *:[Mm]orph[Oo][Ss]:*:*)\n\techo ${UNAME_MACHINE}-unknown-morphos\n\texit ;;\n    *:OS/390:*:*)\n\techo i370-ibm-openedition\n\texit ;;\n    *:z/VM:*:*)\n\techo s390-ibm-zvmoe\n\texit ;;\n    *:OS400:*:*)\n\techo powerpc-ibm-os400\n\texit ;;\n    arm:RISC*:1.[012]*:*|arm:riscix:1.[012]*:*)\n\techo arm-acorn-riscix${UNAME_RELEASE}\n\texit ;;\n    arm*:riscos:*:*|arm*:RISCOS:*:*)\n\techo arm-unknown-riscos\n\texit ;;\n    SR2?01:HI-UX/MPP:*:* | SR8000:HI-UX/MPP:*:*)\n\techo hppa1.1-hitachi-hiuxmpp\n\texit ;;\n    Pyramid*:OSx*:*:* | MIS*:OSx*:*:* | MIS*:SMP_DC-OSx*:*:*)\n\t# akee@wpdis03.wpafb.af.mil (Earle F. Ake) contributed MIS and NILE.\n\tif test \"`(/bin/universe) 2>/dev/null`\" = att ; then\n\t\techo pyramid-pyramid-sysv3\n\telse\n\t\techo pyramid-pyramid-bsd\n\tfi\n\texit ;;\n    NILE*:*:*:dcosx)\n\techo pyramid-pyramid-svr4\n\texit ;;\n    DRS?6000:unix:4.0:6*)\n\techo sparc-icl-nx6\n\texit ;;\n    DRS?6000:UNIX_SV:4.2*:7* | DRS?6000:isis:4.2*:7*)\n\tcase `/usr/bin/uname -p` in\n\t    sparc) echo sparc-icl-nx7; exit ;;\n\tesac ;;\n    s390x:SunOS:*:*)\n\techo ${UNAME_MACHINE}-ibm-solaris2`echo ${UNAME_RELEASE}|sed -e 's/[^.]*//'`\n\texit ;;\n    sun4H:SunOS:5.*:*)\n\techo sparc-hal-solaris2`echo ${UNAME_RELEASE}|sed -e 's/[^.]*//'`\n\texit ;;\n    sun4*:SunOS:5.*:* | tadpole*:SunOS:5.*:*)\n\techo sparc-sun-solaris2`echo ${UNAME_RELEASE}|sed -e 's/[^.]*//'`\n\texit ;;\n    i86pc:AuroraUX:5.*:* | i86xen:AuroraUX:5.*:*)\n\techo i386-pc-auroraux${UNAME_RELEASE}\n\texit ;;\n    i86pc:SunOS:5.*:* | i86xen:SunOS:5.*:*)\n\teval $set_cc_for_build\n\tSUN_ARCH=i386\n\t# If there is a compiler, see if it is configured for 64-bit objects.\n\t# Note that the Sun cc does not turn __LP64__ into 1 like gcc does.\n\t# This test works for both compilers.\n\tif [ \"$CC_FOR_BUILD\" != no_compiler_found ]; then\n\t    if (echo '#ifdef __amd64'; echo IS_64BIT_ARCH; echo '#endif') | \\\n\t\t(CCOPTS=\"\" $CC_FOR_BUILD -E - 2>/dev/null) | \\\n\t\tgrep IS_64BIT_ARCH >/dev/null\n\t    then\n\t\tSUN_ARCH=x86_64\n\t    fi\n\tfi\n\techo ${SUN_ARCH}-pc-solaris2`echo ${UNAME_RELEASE}|sed -e 's/[^.]*//'`\n\texit ;;\n    sun4*:SunOS:6*:*)\n\t# According to config.sub, this is the proper way to canonicalize\n\t# SunOS6.  Hard to guess exactly what SunOS6 will be like, but\n\t# it's likely to be more like Solaris than SunOS4.\n\techo sparc-sun-solaris3`echo ${UNAME_RELEASE}|sed -e 's/[^.]*//'`\n\texit ;;\n    sun4*:SunOS:*:*)\n\tcase \"`/usr/bin/arch -k`\" in\n\t    Series*|S4*)\n\t\tUNAME_RELEASE=`uname -v`\n\t\t;;\n\tesac\n\t# Japanese Language versions have a version number like `4.1.3-JL'.\n\techo sparc-sun-sunos`echo ${UNAME_RELEASE}|sed -e 's/-/_/'`\n\texit ;;\n    sun3*:SunOS:*:*)\n\techo m68k-sun-sunos${UNAME_RELEASE}\n\texit ;;\n    sun*:*:4.2BSD:*)\n\tUNAME_RELEASE=`(sed 1q /etc/motd | awk '{print substr($5,1,3)}') 2>/dev/null`\n\ttest \"x${UNAME_RELEASE}\" = x && UNAME_RELEASE=3\n\tcase \"`/bin/arch`\" in\n\t    sun3)\n\t\techo m68k-sun-sunos${UNAME_RELEASE}\n\t\t;;\n\t    sun4)\n\t\techo sparc-sun-sunos${UNAME_RELEASE}\n\t\t;;\n\tesac\n\texit ;;\n    aushp:SunOS:*:*)\n\techo sparc-auspex-sunos${UNAME_RELEASE}\n\texit ;;\n    # The situation for MiNT is a little confusing.  The machine name\n    # can be virtually everything (everything which is not\n    # \"atarist\" or \"atariste\" at least should have a processor\n    # > m68000).  The system name ranges from \"MiNT\" over \"FreeMiNT\"\n    # to the lowercase version \"mint\" (or \"freemint\").  Finally\n    # the system name \"TOS\" denotes a system which is actually not\n    # MiNT.  But MiNT is downward compatible to TOS, so this should\n    # be no problem.\n    atarist[e]:*MiNT:*:* | atarist[e]:*mint:*:* | atarist[e]:*TOS:*:*)\n\techo m68k-atari-mint${UNAME_RELEASE}\n\texit ;;\n    atari*:*MiNT:*:* | atari*:*mint:*:* | atarist[e]:*TOS:*:*)\n\techo m68k-atari-mint${UNAME_RELEASE}\n\texit ;;\n    *falcon*:*MiNT:*:* | *falcon*:*mint:*:* | *falcon*:*TOS:*:*)\n\techo m68k-atari-mint${UNAME_RELEASE}\n\texit ;;\n    milan*:*MiNT:*:* | milan*:*mint:*:* | *milan*:*TOS:*:*)\n\techo m68k-milan-mint${UNAME_RELEASE}\n\texit ;;\n    hades*:*MiNT:*:* | hades*:*mint:*:* | *hades*:*TOS:*:*)\n\techo m68k-hades-mint${UNAME_RELEASE}\n\texit ;;\n    *:*MiNT:*:* | *:*mint:*:* | *:*TOS:*:*)\n\techo m68k-unknown-mint${UNAME_RELEASE}\n\texit ;;\n    m68k:machten:*:*)\n\techo m68k-apple-machten${UNAME_RELEASE}\n\texit ;;\n    powerpc:machten:*:*)\n\techo powerpc-apple-machten${UNAME_RELEASE}\n\texit ;;\n    RISC*:Mach:*:*)\n\techo mips-dec-mach_bsd4.3\n\texit ;;\n    RISC*:ULTRIX:*:*)\n\techo mips-dec-ultrix${UNAME_RELEASE}\n\texit ;;\n    VAX*:ULTRIX*:*:*)\n\techo vax-dec-ultrix${UNAME_RELEASE}\n\texit ;;\n    2020:CLIX:*:* | 2430:CLIX:*:*)\n\techo clipper-intergraph-clix${UNAME_RELEASE}\n\texit ;;\n    mips:*:*:UMIPS | mips:*:*:RISCos)\n\teval $set_cc_for_build\n\tsed 's/^\t//' << EOF >$dummy.c\n#ifdef __cplusplus\n#include <stdio.h>  /* for printf() prototype */\n\tint main (int argc, char *argv[]) {\n#else\n\tint main (argc, argv) int argc; char *argv[]; {\n#endif\n\t#if defined (host_mips) && defined (MIPSEB)\n\t#if defined (SYSTYPE_SYSV)\n\t  printf (\"mips-mips-riscos%ssysv\\n\", argv[1]); exit (0);\n\t#endif\n\t#if defined (SYSTYPE_SVR4)\n\t  printf (\"mips-mips-riscos%ssvr4\\n\", argv[1]); exit (0);\n\t#endif\n\t#if defined (SYSTYPE_BSD43) || defined(SYSTYPE_BSD)\n\t  printf (\"mips-mips-riscos%sbsd\\n\", argv[1]); exit (0);\n\t#endif\n\t#endif\n\t  exit (-1);\n\t}\nEOF\n\t$CC_FOR_BUILD -o $dummy $dummy.c &&\n\t  dummyarg=`echo \"${UNAME_RELEASE}\" | sed -n 's/\\([0-9]*\\).*/\\1/p'` &&\n\t  SYSTEM_NAME=`$dummy $dummyarg` &&\n\t    { echo \"$SYSTEM_NAME\"; exit; }\n\techo mips-mips-riscos${UNAME_RELEASE}\n\texit ;;\n    Motorola:PowerMAX_OS:*:*)\n\techo powerpc-motorola-powermax\n\texit ;;\n    Motorola:*:4.3:PL8-*)\n\techo powerpc-harris-powermax\n\texit ;;\n    Night_Hawk:*:*:PowerMAX_OS | Synergy:PowerMAX_OS:*:*)\n\techo powerpc-harris-powermax\n\texit ;;\n    Night_Hawk:Power_UNIX:*:*)\n\techo powerpc-harris-powerunix\n\texit ;;\n    m88k:CX/UX:7*:*)\n\techo m88k-harris-cxux7\n\texit ;;\n    m88k:*:4*:R4*)\n\techo m88k-motorola-sysv4\n\texit ;;\n    m88k:*:3*:R3*)\n\techo m88k-motorola-sysv3\n\texit ;;\n    AViiON:dgux:*:*)\n\t# DG/UX returns AViiON for all architectures\n\tUNAME_PROCESSOR=`/usr/bin/uname -p`\n\tif [ $UNAME_PROCESSOR = mc88100 ] || [ $UNAME_PROCESSOR = mc88110 ]\n\tthen\n\t    if [ ${TARGET_BINARY_INTERFACE}x = m88kdguxelfx ] || \\\n\t       [ ${TARGET_BINARY_INTERFACE}x = x ]\n\t    then\n\t\techo m88k-dg-dgux${UNAME_RELEASE}\n\t    else\n\t\techo m88k-dg-dguxbcs${UNAME_RELEASE}\n\t    fi\n\telse\n\t    echo i586-dg-dgux${UNAME_RELEASE}\n\tfi\n\texit ;;\n    M88*:DolphinOS:*:*)\t# DolphinOS (SVR3)\n\techo m88k-dolphin-sysv3\n\texit ;;\n    M88*:*:R3*:*)\n\t# Delta 88k system running SVR3\n\techo m88k-motorola-sysv3\n\texit ;;\n    XD88*:*:*:*) # Tektronix XD88 system running UTekV (SVR3)\n\techo m88k-tektronix-sysv3\n\texit ;;\n    Tek43[0-9][0-9]:UTek:*:*) # Tektronix 4300 system running UTek (BSD)\n\techo m68k-tektronix-bsd\n\texit ;;\n    *:IRIX*:*:*)\n\techo mips-sgi-irix`echo ${UNAME_RELEASE}|sed -e 's/-/_/g'`\n\texit ;;\n    ????????:AIX?:[12].1:2)   # AIX 2.2.1 or AIX 2.1.1 is RT/PC AIX.\n\techo romp-ibm-aix     # uname -m gives an 8 hex-code CPU id\n\texit ;;               # Note that: echo \"'`uname -s`'\" gives 'AIX '\n    i*86:AIX:*:*)\n\techo i386-ibm-aix\n\texit ;;\n    ia64:AIX:*:*)\n\tif [ -x /usr/bin/oslevel ] ; then\n\t\tIBM_REV=`/usr/bin/oslevel`\n\telse\n\t\tIBM_REV=${UNAME_VERSION}.${UNAME_RELEASE}\n\tfi\n\techo ${UNAME_MACHINE}-ibm-aix${IBM_REV}\n\texit ;;\n    *:AIX:2:3)\n\tif grep bos325 /usr/include/stdio.h >/dev/null 2>&1; then\n\t\teval $set_cc_for_build\n\t\tsed 's/^\t\t//' << EOF >$dummy.c\n\t\t#include <sys/systemcfg.h>\n\n\t\tmain()\n\t\t\t{\n\t\t\tif (!__power_pc())\n\t\t\t\texit(1);\n\t\t\tputs(\"powerpc-ibm-aix3.2.5\");\n\t\t\texit(0);\n\t\t\t}\nEOF\n\t\tif $CC_FOR_BUILD -o $dummy $dummy.c && SYSTEM_NAME=`$dummy`\n\t\tthen\n\t\t\techo \"$SYSTEM_NAME\"\n\t\telse\n\t\t\techo rs6000-ibm-aix3.2.5\n\t\tfi\n\telif grep bos324 /usr/include/stdio.h >/dev/null 2>&1; then\n\t\techo rs6000-ibm-aix3.2.4\n\telse\n\t\techo rs6000-ibm-aix3.2\n\tfi\n\texit ;;\n    *:AIX:*:[4567])\n\tIBM_CPU_ID=`/usr/sbin/lsdev -C -c processor -S available | sed 1q | awk '{ print $1 }'`\n\tif /usr/sbin/lsattr -El ${IBM_CPU_ID} | grep ' POWER' >/dev/null 2>&1; then\n\t\tIBM_ARCH=rs6000\n\telse\n\t\tIBM_ARCH=powerpc\n\tfi\n\tif [ -x /usr/bin/lslpp ] ; then\n\t\tIBM_REV=`/usr/bin/lslpp -Lqc bos.rte.libc |\n\t\t\t   awk -F: '{ print $3 }' | sed s/[0-9]*$/0/`\n\telse\n\t\tIBM_REV=${UNAME_VERSION}.${UNAME_RELEASE}\n\tfi\n\techo ${IBM_ARCH}-ibm-aix${IBM_REV}\n\texit ;;\n    *:AIX:*:*)\n\techo rs6000-ibm-aix\n\texit ;;\n    ibmrt:4.4BSD:*|romp-ibm:BSD:*)\n\techo romp-ibm-bsd4.4\n\texit ;;\n    ibmrt:*BSD:*|romp-ibm:BSD:*)            # covers RT/PC BSD and\n\techo romp-ibm-bsd${UNAME_RELEASE}   # 4.3 with uname added to\n\texit ;;                             # report: romp-ibm BSD 4.3\n    *:BOSX:*:*)\n\techo rs6000-bull-bosx\n\texit ;;\n    DPX/2?00:B.O.S.:*:*)\n\techo m68k-bull-sysv3\n\texit ;;\n    9000/[34]??:4.3bsd:1.*:*)\n\techo m68k-hp-bsd\n\texit ;;\n    hp300:4.4BSD:*:* | 9000/[34]??:4.3bsd:2.*:*)\n\techo m68k-hp-bsd4.4\n\texit ;;\n    9000/[34678]??:HP-UX:*:*)\n\tHPUX_REV=`echo ${UNAME_RELEASE}|sed -e 's/[^.]*.[0B]*//'`\n\tcase \"${UNAME_MACHINE}\" in\n\t    9000/31? )            HP_ARCH=m68000 ;;\n\t    9000/[34]?? )         HP_ARCH=m68k ;;\n\t    9000/[678][0-9][0-9])\n\t\tif [ -x /usr/bin/getconf ]; then\n\t\t    sc_cpu_version=`/usr/bin/getconf SC_CPU_VERSION 2>/dev/null`\n\t\t    sc_kernel_bits=`/usr/bin/getconf SC_KERNEL_BITS 2>/dev/null`\n\t\t    case \"${sc_cpu_version}\" in\n\t\t      523) HP_ARCH=hppa1.0 ;; # CPU_PA_RISC1_0\n\t\t      528) HP_ARCH=hppa1.1 ;; # CPU_PA_RISC1_1\n\t\t      532)                      # CPU_PA_RISC2_0\n\t\t\tcase \"${sc_kernel_bits}\" in\n\t\t\t  32) HP_ARCH=hppa2.0n ;;\n\t\t\t  64) HP_ARCH=hppa2.0w ;;\n\t\t\t  '') HP_ARCH=hppa2.0 ;;   # HP-UX 10.20\n\t\t\tesac ;;\n\t\t    esac\n\t\tfi\n\t\tif [ \"${HP_ARCH}\" = \"\" ]; then\n\t\t    eval $set_cc_for_build\n\t\t    sed 's/^\t\t//' << EOF >$dummy.c\n\n\t\t#define _HPUX_SOURCE\n\t\t#include <stdlib.h>\n\t\t#include <unistd.h>\n\n\t\tint main ()\n\t\t{\n\t\t#if defined(_SC_KERNEL_BITS)\n\t\t    long bits = sysconf(_SC_KERNEL_BITS);\n\t\t#endif\n\t\t    long cpu  = sysconf (_SC_CPU_VERSION);\n\n\t\t    switch (cpu)\n\t\t\t{\n\t\t\tcase CPU_PA_RISC1_0: puts (\"hppa1.0\"); break;\n\t\t\tcase CPU_PA_RISC1_1: puts (\"hppa1.1\"); break;\n\t\t\tcase CPU_PA_RISC2_0:\n\t\t#if defined(_SC_KERNEL_BITS)\n\t\t\t    switch (bits)\n\t\t\t\t{\n\t\t\t\tcase 64: puts (\"hppa2.0w\"); break;\n\t\t\t\tcase 32: puts (\"hppa2.0n\"); break;\n\t\t\t\tdefault: puts (\"hppa2.0\"); break;\n\t\t\t\t} break;\n\t\t#else  /* !defined(_SC_KERNEL_BITS) */\n\t\t\t    puts (\"hppa2.0\"); break;\n\t\t#endif\n\t\t\tdefault: puts (\"hppa1.0\"); break;\n\t\t\t}\n\t\t    exit (0);\n\t\t}\nEOF\n\t\t    (CCOPTS=\"\" $CC_FOR_BUILD -o $dummy $dummy.c 2>/dev/null) && HP_ARCH=`$dummy`\n\t\t    test -z \"$HP_ARCH\" && HP_ARCH=hppa\n\t\tfi ;;\n\tesac\n\tif [ ${HP_ARCH} = hppa2.0w ]\n\tthen\n\t    eval $set_cc_for_build\n\n\t    # hppa2.0w-hp-hpux* has a 64-bit kernel and a compiler generating\n\t    # 32-bit code.  hppa64-hp-hpux* has the same kernel and a compiler\n\t    # generating 64-bit code.  GNU and HP use different nomenclature:\n\t    #\n\t    # $ CC_FOR_BUILD=cc ./config.guess\n\t    # => hppa2.0w-hp-hpux11.23\n\t    # $ CC_FOR_BUILD=\"cc +DA2.0w\" ./config.guess\n\t    # => hppa64-hp-hpux11.23\n\n\t    if echo __LP64__ | (CCOPTS=\"\" $CC_FOR_BUILD -E - 2>/dev/null) |\n\t\tgrep -q __LP64__\n\t    then\n\t\tHP_ARCH=hppa2.0w\n\t    else\n\t\tHP_ARCH=hppa64\n\t    fi\n\tfi\n\techo ${HP_ARCH}-hp-hpux${HPUX_REV}\n\texit ;;\n    ia64:HP-UX:*:*)\n\tHPUX_REV=`echo ${UNAME_RELEASE}|sed -e 's/[^.]*.[0B]*//'`\n\techo ia64-hp-hpux${HPUX_REV}\n\texit ;;\n    3050*:HI-UX:*:*)\n\teval $set_cc_for_build\n\tsed 's/^\t//' << EOF >$dummy.c\n\t#include <unistd.h>\n\tint\n\tmain ()\n\t{\n\t  long cpu = sysconf (_SC_CPU_VERSION);\n\t  /* The order matters, because CPU_IS_HP_MC68K erroneously returns\n\t     true for CPU_PA_RISC1_0.  CPU_IS_PA_RISC returns correct\n\t     results, however.  */\n\t  if (CPU_IS_PA_RISC (cpu))\n\t    {\n\t      switch (cpu)\n\t\t{\n\t\t  case CPU_PA_RISC1_0: puts (\"hppa1.0-hitachi-hiuxwe2\"); break;\n\t\t  case CPU_PA_RISC1_1: puts (\"hppa1.1-hitachi-hiuxwe2\"); break;\n\t\t  case CPU_PA_RISC2_0: puts (\"hppa2.0-hitachi-hiuxwe2\"); break;\n\t\t  default: puts (\"hppa-hitachi-hiuxwe2\"); break;\n\t\t}\n\t    }\n\t  else if (CPU_IS_HP_MC68K (cpu))\n\t    puts (\"m68k-hitachi-hiuxwe2\");\n\t  else puts (\"unknown-hitachi-hiuxwe2\");\n\t  exit (0);\n\t}\nEOF\n\t$CC_FOR_BUILD -o $dummy $dummy.c && SYSTEM_NAME=`$dummy` &&\n\t\t{ echo \"$SYSTEM_NAME\"; exit; }\n\techo unknown-hitachi-hiuxwe2\n\texit ;;\n    9000/7??:4.3bsd:*:* | 9000/8?[79]:4.3bsd:*:* )\n\techo hppa1.1-hp-bsd\n\texit ;;\n    9000/8??:4.3bsd:*:*)\n\techo hppa1.0-hp-bsd\n\texit ;;\n    *9??*:MPE/iX:*:* | *3000*:MPE/iX:*:*)\n\techo hppa1.0-hp-mpeix\n\texit ;;\n    hp7??:OSF1:*:* | hp8?[79]:OSF1:*:* )\n\techo hppa1.1-hp-osf\n\texit ;;\n    hp8??:OSF1:*:*)\n\techo hppa1.0-hp-osf\n\texit ;;\n    i*86:OSF1:*:*)\n\tif [ -x /usr/sbin/sysversion ] ; then\n\t    echo ${UNAME_MACHINE}-unknown-osf1mk\n\telse\n\t    echo ${UNAME_MACHINE}-unknown-osf1\n\tfi\n\texit ;;\n    parisc*:Lites*:*:*)\n\techo hppa1.1-hp-lites\n\texit ;;\n    C1*:ConvexOS:*:* | convex:ConvexOS:C1*:*)\n\techo c1-convex-bsd\n\texit ;;\n    C2*:ConvexOS:*:* | convex:ConvexOS:C2*:*)\n\tif getsysinfo -f scalar_acc\n\tthen echo c32-convex-bsd\n\telse echo c2-convex-bsd\n\tfi\n\texit ;;\n    C34*:ConvexOS:*:* | convex:ConvexOS:C34*:*)\n\techo c34-convex-bsd\n\texit ;;\n    C38*:ConvexOS:*:* | convex:ConvexOS:C38*:*)\n\techo c38-convex-bsd\n\texit ;;\n    C4*:ConvexOS:*:* | convex:ConvexOS:C4*:*)\n\techo c4-convex-bsd\n\texit ;;\n    CRAY*Y-MP:*:*:*)\n\techo ymp-cray-unicos${UNAME_RELEASE} | sed -e 's/\\.[^.]*$/.X/'\n\texit ;;\n    CRAY*[A-Z]90:*:*:*)\n\techo ${UNAME_MACHINE}-cray-unicos${UNAME_RELEASE} \\\n\t| sed -e 's/CRAY.*\\([A-Z]90\\)/\\1/' \\\n\t      -e y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxyz/ \\\n\t      -e 's/\\.[^.]*$/.X/'\n\texit ;;\n    CRAY*TS:*:*:*)\n\techo t90-cray-unicos${UNAME_RELEASE} | sed -e 's/\\.[^.]*$/.X/'\n\texit ;;\n    CRAY*T3E:*:*:*)\n\techo alphaev5-cray-unicosmk${UNAME_RELEASE} | sed -e 's/\\.[^.]*$/.X/'\n\texit ;;\n    CRAY*SV1:*:*:*)\n\techo sv1-cray-unicos${UNAME_RELEASE} | sed -e 's/\\.[^.]*$/.X/'\n\texit ;;\n    *:UNICOS/mp:*:*)\n\techo craynv-cray-unicosmp${UNAME_RELEASE} | sed -e 's/\\.[^.]*$/.X/'\n\texit ;;\n    F30[01]:UNIX_System_V:*:* | F700:UNIX_System_V:*:*)\n\tFUJITSU_PROC=`uname -m | tr ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz`\n\tFUJITSU_SYS=`uname -p | tr ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz | sed -e 's/\\///'`\n\tFUJITSU_REL=`echo ${UNAME_RELEASE} | sed -e 's/ /_/'`\n\techo \"${FUJITSU_PROC}-fujitsu-${FUJITSU_SYS}${FUJITSU_REL}\"\n\texit ;;\n    5000:UNIX_System_V:4.*:*)\n\tFUJITSU_SYS=`uname -p | tr ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz | sed -e 's/\\///'`\n\tFUJITSU_REL=`echo ${UNAME_RELEASE} | tr ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz | sed -e 's/ /_/'`\n\techo \"sparc-fujitsu-${FUJITSU_SYS}${FUJITSU_REL}\"\n\texit ;;\n    i*86:BSD/386:*:* | i*86:BSD/OS:*:* | *:Ascend\\ Embedded/OS:*:*)\n\techo ${UNAME_MACHINE}-pc-bsdi${UNAME_RELEASE}\n\texit ;;\n    sparc*:BSD/OS:*:*)\n\techo sparc-unknown-bsdi${UNAME_RELEASE}\n\texit ;;\n    *:BSD/OS:*:*)\n\techo ${UNAME_MACHINE}-unknown-bsdi${UNAME_RELEASE}\n\texit ;;\n    *:FreeBSD:*:*)\n\tUNAME_PROCESSOR=`/usr/bin/uname -p`\n\tcase ${UNAME_PROCESSOR} in\n\t    amd64)\n\t\tUNAME_PROCESSOR=x86_64 ;;\n\t    i386)\n\t\tUNAME_PROCESSOR=i586 ;;\n\tesac\n\techo ${UNAME_PROCESSOR}-unknown-freebsd`echo ${UNAME_RELEASE}|sed -e 's/[-(].*//'`\n\texit ;;\n    i*:CYGWIN*:*)\n\techo ${UNAME_MACHINE}-pc-cygwin\n\texit ;;\n    *:MINGW64*:*)\n\techo ${UNAME_MACHINE}-pc-mingw64\n\texit ;;\n    *:MINGW*:*)\n\techo ${UNAME_MACHINE}-pc-mingw32\n\texit ;;\n    *:MSYS*:*)\n\techo ${UNAME_MACHINE}-pc-msys\n\texit ;;\n    i*:windows32*:*)\n\t# uname -m includes \"-pc\" on this system.\n\techo ${UNAME_MACHINE}-mingw32\n\texit ;;\n    i*:PW*:*)\n\techo ${UNAME_MACHINE}-pc-pw32\n\texit ;;\n    *:Interix*:*)\n\tcase ${UNAME_MACHINE} in\n\t    x86)\n\t\techo i586-pc-interix${UNAME_RELEASE}\n\t\texit ;;\n\t    authenticamd | genuineintel | EM64T)\n\t\techo x86_64-unknown-interix${UNAME_RELEASE}\n\t\texit ;;\n\t    IA64)\n\t\techo ia64-unknown-interix${UNAME_RELEASE}\n\t\texit ;;\n\tesac ;;\n    [345]86:Windows_95:* | [345]86:Windows_98:* | [345]86:Windows_NT:*)\n\techo i${UNAME_MACHINE}-pc-mks\n\texit ;;\n    8664:Windows_NT:*)\n\techo x86_64-pc-mks\n\texit ;;\n    i*:Windows_NT*:* | Pentium*:Windows_NT*:*)\n\t# How do we know it's Interix rather than the generic POSIX subsystem?\n\t# It also conflicts with pre-2.0 versions of AT&T UWIN. Should we\n\t# UNAME_MACHINE based on the output of uname instead of i386?\n\techo i586-pc-interix\n\texit ;;\n    i*:UWIN*:*)\n\techo ${UNAME_MACHINE}-pc-uwin\n\texit ;;\n    amd64:CYGWIN*:*:* | x86_64:CYGWIN*:*:*)\n\techo x86_64-unknown-cygwin\n\texit ;;\n    p*:CYGWIN*:*)\n\techo powerpcle-unknown-cygwin\n\texit ;;\n    prep*:SunOS:5.*:*)\n\techo powerpcle-unknown-solaris2`echo ${UNAME_RELEASE}|sed -e 's/[^.]*//'`\n\texit ;;\n    *:GNU:*:*)\n\t# the GNU system\n\techo `echo ${UNAME_MACHINE}|sed -e 's,[-/].*$,,'`-unknown-${LIBC}`echo ${UNAME_RELEASE}|sed -e 's,/.*$,,'`\n\texit ;;\n    *:GNU/*:*:*)\n\t# other systems with GNU libc and userland\n\techo ${UNAME_MACHINE}-unknown-`echo ${UNAME_SYSTEM} | sed 's,^[^/]*/,,' | tr \"[:upper:]\" \"[:lower:]\"``echo ${UNAME_RELEASE}|sed -e 's/[-(].*//'`-${LIBC}\n\texit ;;\n    i*86:Minix:*:*)\n\techo ${UNAME_MACHINE}-pc-minix\n\texit ;;\n    aarch64:Linux:*:*)\n\techo ${UNAME_MACHINE}-unknown-linux-${LIBC}\n\texit ;;\n    aarch64_be:Linux:*:*)\n\tUNAME_MACHINE=aarch64_be\n\techo ${UNAME_MACHINE}-unknown-linux-${LIBC}\n\texit ;;\n    alpha:Linux:*:*)\n\tcase `sed -n '/^cpu model/s/^.*: \\(.*\\)/\\1/p' < /proc/cpuinfo` in\n\t  EV5)   UNAME_MACHINE=alphaev5 ;;\n\t  EV56)  UNAME_MACHINE=alphaev56 ;;\n\t  PCA56) UNAME_MACHINE=alphapca56 ;;\n\t  PCA57) UNAME_MACHINE=alphapca56 ;;\n\t  EV6)   UNAME_MACHINE=alphaev6 ;;\n\t  EV67)  UNAME_MACHINE=alphaev67 ;;\n\t  EV68*) UNAME_MACHINE=alphaev68 ;;\n\tesac\n\tobjdump --private-headers /bin/sh | grep -q ld.so.1\n\tif test \"$?\" = 0 ; then LIBC=gnulibc1 ; fi\n\techo ${UNAME_MACHINE}-unknown-linux-${LIBC}\n\texit ;;\n    arc:Linux:*:* | arceb:Linux:*:*)\n\techo ${UNAME_MACHINE}-unknown-linux-${LIBC}\n\texit ;;\n    arm*:Linux:*:*)\n\teval $set_cc_for_build\n\tif echo __ARM_EABI__ | $CC_FOR_BUILD -E - 2>/dev/null \\\n\t    | grep -q __ARM_EABI__\n\tthen\n\t    echo ${UNAME_MACHINE}-unknown-linux-${LIBC}\n\telse\n\t    if echo __ARM_PCS_VFP | $CC_FOR_BUILD -E - 2>/dev/null \\\n\t\t| grep -q __ARM_PCS_VFP\n\t    then\n\t\techo ${UNAME_MACHINE}-unknown-linux-${LIBC}eabi\n\t    else\n\t\techo ${UNAME_MACHINE}-unknown-linux-${LIBC}eabihf\n\t    fi\n\tfi\n\texit ;;\n    avr32*:Linux:*:*)\n\techo ${UNAME_MACHINE}-unknown-linux-${LIBC}\n\texit ;;\n    cris:Linux:*:*)\n\techo ${UNAME_MACHINE}-axis-linux-${LIBC}\n\texit ;;\n    crisv32:Linux:*:*)\n\techo ${UNAME_MACHINE}-axis-linux-${LIBC}\n\texit ;;\n    e2k:Linux:*:*)\n\techo ${UNAME_MACHINE}-unknown-linux-${LIBC}\n\texit ;;\n    frv:Linux:*:*)\n\techo ${UNAME_MACHINE}-unknown-linux-${LIBC}\n\texit ;;\n    hexagon:Linux:*:*)\n\techo ${UNAME_MACHINE}-unknown-linux-${LIBC}\n\texit ;;\n    i*86:Linux:*:*)\n\techo ${UNAME_MACHINE}-pc-linux-${LIBC}\n\texit ;;\n    ia64:Linux:*:*)\n\techo ${UNAME_MACHINE}-unknown-linux-${LIBC}\n\texit ;;\n    k1om:Linux:*:*)\n\techo ${UNAME_MACHINE}-unknown-linux-${LIBC}\n\texit ;;\n    m32r*:Linux:*:*)\n\techo ${UNAME_MACHINE}-unknown-linux-${LIBC}\n\texit ;;\n    m68*:Linux:*:*)\n\techo ${UNAME_MACHINE}-unknown-linux-${LIBC}\n\texit ;;\n    mips:Linux:*:* | mips64:Linux:*:*)\n\teval $set_cc_for_build\n\tsed 's/^\t//' << EOF >$dummy.c\n\t#undef CPU\n\t#undef ${UNAME_MACHINE}\n\t#undef ${UNAME_MACHINE}el\n\t#if defined(__MIPSEL__) || defined(__MIPSEL) || defined(_MIPSEL) || defined(MIPSEL)\n\tCPU=${UNAME_MACHINE}el\n\t#else\n\t#if defined(__MIPSEB__) || defined(__MIPSEB) || defined(_MIPSEB) || defined(MIPSEB)\n\tCPU=${UNAME_MACHINE}\n\t#else\n\tCPU=\n\t#endif\n\t#endif\nEOF\n\teval `$CC_FOR_BUILD -E $dummy.c 2>/dev/null | grep '^CPU'`\n\ttest x\"${CPU}\" != x && { echo \"${CPU}-unknown-linux-${LIBC}\"; exit; }\n\t;;\n    mips64el:Linux:*:*)\n\techo ${UNAME_MACHINE}-unknown-linux-${LIBC}\n\texit ;;\n    openrisc*:Linux:*:*)\n\techo or1k-unknown-linux-${LIBC}\n\texit ;;\n    or32:Linux:*:* | or1k*:Linux:*:*)\n\techo ${UNAME_MACHINE}-unknown-linux-${LIBC}\n\texit ;;\n    padre:Linux:*:*)\n\techo sparc-unknown-linux-${LIBC}\n\texit ;;\n    parisc64:Linux:*:* | hppa64:Linux:*:*)\n\techo hppa64-unknown-linux-${LIBC}\n\texit ;;\n    parisc:Linux:*:* | hppa:Linux:*:*)\n\t# Look for CPU level\n\tcase `grep '^cpu[^a-z]*:' /proc/cpuinfo 2>/dev/null | cut -d' ' -f2` in\n\t  PA7*) echo hppa1.1-unknown-linux-${LIBC} ;;\n\t  PA8*) echo hppa2.0-unknown-linux-${LIBC} ;;\n\t  *)    echo hppa-unknown-linux-${LIBC} ;;\n\tesac\n\texit ;;\n    ppc64:Linux:*:*)\n\techo powerpc64-unknown-linux-${LIBC}\n\texit ;;\n    ppc:Linux:*:*)\n\techo powerpc-unknown-linux-${LIBC}\n\texit ;;\n    ppc64le:Linux:*:*)\n\techo powerpc64le-unknown-linux-${LIBC}\n\texit ;;\n    ppcle:Linux:*:*)\n\techo powerpcle-unknown-linux-${LIBC}\n\texit ;;\n    riscv32:Linux:*:* | riscv64:Linux:*:*)\n\techo ${UNAME_MACHINE}-unknown-linux-${LIBC}\n\texit ;;\n    s390:Linux:*:* | s390x:Linux:*:*)\n\techo ${UNAME_MACHINE}-ibm-linux-${LIBC}\n\texit ;;\n    sh64*:Linux:*:*)\n\techo ${UNAME_MACHINE}-unknown-linux-${LIBC}\n\texit ;;\n    sh*:Linux:*:*)\n\techo ${UNAME_MACHINE}-unknown-linux-${LIBC}\n\texit ;;\n    sparc:Linux:*:* | sparc64:Linux:*:*)\n\techo ${UNAME_MACHINE}-unknown-linux-${LIBC}\n\texit ;;\n    tile*:Linux:*:*)\n\techo ${UNAME_MACHINE}-unknown-linux-${LIBC}\n\texit ;;\n    vax:Linux:*:*)\n\techo ${UNAME_MACHINE}-dec-linux-${LIBC}\n\texit ;;\n    x86_64:Linux:*:*)\n\techo ${UNAME_MACHINE}-pc-linux-${LIBC}\n\texit ;;\n    xtensa*:Linux:*:*)\n\techo ${UNAME_MACHINE}-unknown-linux-${LIBC}\n\texit ;;\n    i*86:DYNIX/ptx:4*:*)\n\t# ptx 4.0 does uname -s correctly, with DYNIX/ptx in there.\n\t# earlier versions are messed up and put the nodename in both\n\t# sysname and nodename.\n\techo i386-sequent-sysv4\n\texit ;;\n    i*86:UNIX_SV:4.2MP:2.*)\n\t# Unixware is an offshoot of SVR4, but it has its own version\n\t# number series starting with 2...\n\t# I am not positive that other SVR4 systems won't match this,\n\t# I just have to hope.  -- rms.\n\t# Use sysv4.2uw... so that sysv4* matches it.\n\techo ${UNAME_MACHINE}-pc-sysv4.2uw${UNAME_VERSION}\n\texit ;;\n    i*86:OS/2:*:*)\n\t# If we were able to find `uname', then EMX Unix compatibility\n\t# is probably installed.\n\techo ${UNAME_MACHINE}-pc-os2-emx\n\texit ;;\n    i*86:XTS-300:*:STOP)\n\techo ${UNAME_MACHINE}-unknown-stop\n\texit ;;\n    i*86:atheos:*:*)\n\techo ${UNAME_MACHINE}-unknown-atheos\n\texit ;;\n    i*86:syllable:*:*)\n\techo ${UNAME_MACHINE}-pc-syllable\n\texit ;;\n    i*86:LynxOS:2.*:* | i*86:LynxOS:3.[01]*:* | i*86:LynxOS:4.[02]*:*)\n\techo i386-unknown-lynxos${UNAME_RELEASE}\n\texit ;;\n    i*86:*DOS:*:*)\n\techo ${UNAME_MACHINE}-pc-msdosdjgpp\n\texit ;;\n    i*86:*:4.*:* | i*86:SYSTEM_V:4.*:*)\n\tUNAME_REL=`echo ${UNAME_RELEASE} | sed 's/\\/MP$//'`\n\tif grep Novell /usr/include/link.h >/dev/null 2>/dev/null; then\n\t\techo ${UNAME_MACHINE}-univel-sysv${UNAME_REL}\n\telse\n\t\techo ${UNAME_MACHINE}-pc-sysv${UNAME_REL}\n\tfi\n\texit ;;\n    i*86:*:5:[678]*)\n\t# UnixWare 7.x, OpenUNIX and OpenServer 6.\n\tcase `/bin/uname -X | grep \"^Machine\"` in\n\t    *486*)\t     UNAME_MACHINE=i486 ;;\n\t    *Pentium)\t     UNAME_MACHINE=i586 ;;\n\t    *Pent*|*Celeron) UNAME_MACHINE=i686 ;;\n\tesac\n\techo ${UNAME_MACHINE}-unknown-sysv${UNAME_RELEASE}${UNAME_SYSTEM}${UNAME_VERSION}\n\texit ;;\n    i*86:*:3.2:*)\n\tif test -f /usr/options/cb.name; then\n\t\tUNAME_REL=`sed -n 's/.*Version //p' </usr/options/cb.name`\n\t\techo ${UNAME_MACHINE}-pc-isc$UNAME_REL\n\telif /bin/uname -X 2>/dev/null >/dev/null ; then\n\t\tUNAME_REL=`(/bin/uname -X|grep Release|sed -e 's/.*= //')`\n\t\t(/bin/uname -X|grep i80486 >/dev/null) && UNAME_MACHINE=i486\n\t\t(/bin/uname -X|grep '^Machine.*Pentium' >/dev/null) \\\n\t\t\t&& UNAME_MACHINE=i586\n\t\t(/bin/uname -X|grep '^Machine.*Pent *II' >/dev/null) \\\n\t\t\t&& UNAME_MACHINE=i686\n\t\t(/bin/uname -X|grep '^Machine.*Pentium Pro' >/dev/null) \\\n\t\t\t&& UNAME_MACHINE=i686\n\t\techo ${UNAME_MACHINE}-pc-sco$UNAME_REL\n\telse\n\t\techo ${UNAME_MACHINE}-pc-sysv32\n\tfi\n\texit ;;\n    pc:*:*:*)\n\t# Left here for compatibility:\n\t# uname -m prints for DJGPP always 'pc', but it prints nothing about\n\t# the processor, so we play safe by assuming i586.\n\t# Note: whatever this is, it MUST be the same as what config.sub\n\t# prints for the \"djgpp\" host, or else GDB configure will decide that\n\t# this is a cross-build.\n\techo i586-pc-msdosdjgpp\n\texit ;;\n    Intel:Mach:3*:*)\n\techo i386-pc-mach3\n\texit ;;\n    paragon:*:*:*)\n\techo i860-intel-osf1\n\texit ;;\n    i860:*:4.*:*) # i860-SVR4\n\tif grep Stardent /usr/include/sys/uadmin.h >/dev/null 2>&1 ; then\n\t  echo i860-stardent-sysv${UNAME_RELEASE} # Stardent Vistra i860-SVR4\n\telse # Add other i860-SVR4 vendors below as they are discovered.\n\t  echo i860-unknown-sysv${UNAME_RELEASE}  # Unknown i860-SVR4\n\tfi\n\texit ;;\n    mini*:CTIX:SYS*5:*)\n\t# \"miniframe\"\n\techo m68010-convergent-sysv\n\texit ;;\n    mc68k:UNIX:SYSTEM5:3.51m)\n\techo m68k-convergent-sysv\n\texit ;;\n    M680?0:D-NIX:5.3:*)\n\techo m68k-diab-dnix\n\texit ;;\n    M68*:*:R3V[5678]*:*)\n\ttest -r /sysV68 && { echo 'm68k-motorola-sysv'; exit; } ;;\n    3[345]??:*:4.0:3.0 | 3[34]??A:*:4.0:3.0 | 3[34]??,*:*:4.0:3.0 | 3[34]??/*:*:4.0:3.0 | 4400:*:4.0:3.0 | 4850:*:4.0:3.0 | SKA40:*:4.0:3.0 | SDS2:*:4.0:3.0 | SHG2:*:4.0:3.0 | S7501*:*:4.0:3.0)\n\tOS_REL=''\n\ttest -r /etc/.relid \\\n\t&& OS_REL=.`sed -n 's/[^ ]* [^ ]* \\([0-9][0-9]\\).*/\\1/p' < /etc/.relid`\n\t/bin/uname -p 2>/dev/null | grep 86 >/dev/null \\\n\t  && { echo i486-ncr-sysv4.3${OS_REL}; exit; }\n\t/bin/uname -p 2>/dev/null | /bin/grep entium >/dev/null \\\n\t  && { echo i586-ncr-sysv4.3${OS_REL}; exit; } ;;\n    3[34]??:*:4.0:* | 3[34]??,*:*:4.0:*)\n\t/bin/uname -p 2>/dev/null | grep 86 >/dev/null \\\n\t  && { echo i486-ncr-sysv4; exit; } ;;\n    NCR*:*:4.2:* | MPRAS*:*:4.2:*)\n\tOS_REL='.3'\n\ttest -r /etc/.relid \\\n\t    && OS_REL=.`sed -n 's/[^ ]* [^ ]* \\([0-9][0-9]\\).*/\\1/p' < /etc/.relid`\n\t/bin/uname -p 2>/dev/null | grep 86 >/dev/null \\\n\t    && { echo i486-ncr-sysv4.3${OS_REL}; exit; }\n\t/bin/uname -p 2>/dev/null | /bin/grep entium >/dev/null \\\n\t    && { echo i586-ncr-sysv4.3${OS_REL}; exit; }\n\t/bin/uname -p 2>/dev/null | /bin/grep pteron >/dev/null \\\n\t    && { echo i586-ncr-sysv4.3${OS_REL}; exit; } ;;\n    m68*:LynxOS:2.*:* | m68*:LynxOS:3.0*:*)\n\techo m68k-unknown-lynxos${UNAME_RELEASE}\n\texit ;;\n    mc68030:UNIX_System_V:4.*:*)\n\techo m68k-atari-sysv4\n\texit ;;\n    TSUNAMI:LynxOS:2.*:*)\n\techo sparc-unknown-lynxos${UNAME_RELEASE}\n\texit ;;\n    rs6000:LynxOS:2.*:*)\n\techo rs6000-unknown-lynxos${UNAME_RELEASE}\n\texit ;;\n    PowerPC:LynxOS:2.*:* | PowerPC:LynxOS:3.[01]*:* | PowerPC:LynxOS:4.[02]*:*)\n\techo powerpc-unknown-lynxos${UNAME_RELEASE}\n\texit ;;\n    SM[BE]S:UNIX_SV:*:*)\n\techo mips-dde-sysv${UNAME_RELEASE}\n\texit ;;\n    RM*:ReliantUNIX-*:*:*)\n\techo mips-sni-sysv4\n\texit ;;\n    RM*:SINIX-*:*:*)\n\techo mips-sni-sysv4\n\texit ;;\n    *:SINIX-*:*:*)\n\tif uname -p 2>/dev/null >/dev/null ; then\n\t\tUNAME_MACHINE=`(uname -p) 2>/dev/null`\n\t\techo ${UNAME_MACHINE}-sni-sysv4\n\telse\n\t\techo ns32k-sni-sysv\n\tfi\n\texit ;;\n    PENTIUM:*:4.0*:*)\t# Unisys `ClearPath HMP IX 4000' SVR4/MP effort\n\t\t\t# says <Richard.M.Bartel@ccMail.Census.GOV>\n\techo i586-unisys-sysv4\n\texit ;;\n    *:UNIX_System_V:4*:FTX*)\n\t# From Gerald Hewes <hewes@openmarket.com>.\n\t# How about differentiating between stratus architectures? -djm\n\techo hppa1.1-stratus-sysv4\n\texit ;;\n    *:*:*:FTX*)\n\t# From seanf@swdc.stratus.com.\n\techo i860-stratus-sysv4\n\texit ;;\n    i*86:VOS:*:*)\n\t# From Paul.Green@stratus.com.\n\techo ${UNAME_MACHINE}-stratus-vos\n\texit ;;\n    *:VOS:*:*)\n\t# From Paul.Green@stratus.com.\n\techo hppa1.1-stratus-vos\n\texit ;;\n    mc68*:A/UX:*:*)\n\techo m68k-apple-aux${UNAME_RELEASE}\n\texit ;;\n    news*:NEWS-OS:6*:*)\n\techo mips-sony-newsos6\n\texit ;;\n    R[34]000:*System_V*:*:* | R4000:UNIX_SYSV:*:* | R*000:UNIX_SV:*:*)\n\tif [ -d /usr/nec ]; then\n\t\techo mips-nec-sysv${UNAME_RELEASE}\n\telse\n\t\techo mips-unknown-sysv${UNAME_RELEASE}\n\tfi\n\texit ;;\n    BeBox:BeOS:*:*)\t# BeOS running on hardware made by Be, PPC only.\n\techo powerpc-be-beos\n\texit ;;\n    BeMac:BeOS:*:*)\t# BeOS running on Mac or Mac clone, PPC only.\n\techo powerpc-apple-beos\n\texit ;;\n    BePC:BeOS:*:*)\t# BeOS running on Intel PC compatible.\n\techo i586-pc-beos\n\texit ;;\n    BePC:Haiku:*:*)\t# Haiku running on Intel PC compatible.\n\techo i586-pc-haiku\n\texit ;;\n    x86_64:Haiku:*:*)\n\techo x86_64-unknown-haiku\n\texit ;;\n    SX-4:SUPER-UX:*:*)\n\techo sx4-nec-superux${UNAME_RELEASE}\n\texit ;;\n    SX-5:SUPER-UX:*:*)\n\techo sx5-nec-superux${UNAME_RELEASE}\n\texit ;;\n    SX-6:SUPER-UX:*:*)\n\techo sx6-nec-superux${UNAME_RELEASE}\n\texit ;;\n    SX-7:SUPER-UX:*:*)\n\techo sx7-nec-superux${UNAME_RELEASE}\n\texit ;;\n    SX-8:SUPER-UX:*:*)\n\techo sx8-nec-superux${UNAME_RELEASE}\n\texit ;;\n    SX-8R:SUPER-UX:*:*)\n\techo sx8r-nec-superux${UNAME_RELEASE}\n\texit ;;\n    SX-ACE:SUPER-UX:*:*)\n\techo sxace-nec-superux${UNAME_RELEASE}\n\texit ;;\n    Power*:Rhapsody:*:*)\n\techo powerpc-apple-rhapsody${UNAME_RELEASE}\n\texit ;;\n    *:Rhapsody:*:*)\n\techo ${UNAME_MACHINE}-apple-rhapsody${UNAME_RELEASE}\n\texit ;;\n    *:Darwin:*:*)\n\tUNAME_PROCESSOR=`uname -p` || UNAME_PROCESSOR=unknown\n\teval $set_cc_for_build\n\tif test \"$UNAME_PROCESSOR\" = unknown ; then\n\t    UNAME_PROCESSOR=powerpc\n\tfi\n\tif test `echo \"$UNAME_RELEASE\" | sed -e 's/\\..*//'` -le 10 ; then\n\t    if [ \"$CC_FOR_BUILD\" != no_compiler_found ]; then\n\t\tif (echo '#ifdef __LP64__'; echo IS_64BIT_ARCH; echo '#endif') | \\\n\t\t       (CCOPTS=\"\" $CC_FOR_BUILD -E - 2>/dev/null) | \\\n\t\t       grep IS_64BIT_ARCH >/dev/null\n\t\tthen\n\t\t    case $UNAME_PROCESSOR in\n\t\t\ti386) UNAME_PROCESSOR=x86_64 ;;\n\t\t\tpowerpc) UNAME_PROCESSOR=powerpc64 ;;\n\t\t    esac\n\t\tfi\n\t\t# On 10.4-10.6 one might compile for PowerPC via gcc -arch ppc\n\t\tif (echo '#ifdef __POWERPC__'; echo IS_PPC; echo '#endif') | \\\n\t\t       (CCOPTS=\"\" $CC_FOR_BUILD -E - 2>/dev/null) | \\\n\t\t       grep IS_PPC >/dev/null\n\t\tthen\n\t\t    UNAME_PROCESSOR=powerpc\n\t\tfi\n\t    fi\n\telif test \"$UNAME_PROCESSOR\" = i386 ; then\n\t    # Avoid executing cc on OS X 10.9, as it ships with a stub\n\t    # that puts up a graphical alert prompting to install\n\t    # developer tools.  Any system running Mac OS X 10.7 or\n\t    # later (Darwin 11 and later) is required to have a 64-bit\n\t    # processor. This is not true of the ARM version of Darwin\n\t    # that Apple uses in portable devices.\n\t    UNAME_PROCESSOR=x86_64\n\tfi\n\techo ${UNAME_PROCESSOR}-apple-darwin${UNAME_RELEASE}\n\texit ;;\n    *:procnto*:*:* | *:QNX:[0123456789]*:*)\n\tUNAME_PROCESSOR=`uname -p`\n\tif test \"$UNAME_PROCESSOR\" = x86; then\n\t\tUNAME_PROCESSOR=i386\n\t\tUNAME_MACHINE=pc\n\tfi\n\techo ${UNAME_PROCESSOR}-${UNAME_MACHINE}-nto-qnx${UNAME_RELEASE}\n\texit ;;\n    *:QNX:*:4*)\n\techo i386-pc-qnx\n\texit ;;\n    NEO-*:NONSTOP_KERNEL:*:*)\n\techo neo-tandem-nsk${UNAME_RELEASE}\n\texit ;;\n    NSE-*:NONSTOP_KERNEL:*:*)\n\techo nse-tandem-nsk${UNAME_RELEASE}\n\texit ;;\n    NSR-*:NONSTOP_KERNEL:*:*)\n\techo nsr-tandem-nsk${UNAME_RELEASE}\n\texit ;;\n    NSX-*:NONSTOP_KERNEL:*:*)\n\techo nsx-tandem-nsk${UNAME_RELEASE}\n\texit ;;\n    *:NonStop-UX:*:*)\n\techo mips-compaq-nonstopux\n\texit ;;\n    BS2000:POSIX*:*:*)\n\techo bs2000-siemens-sysv\n\texit ;;\n    DS/*:UNIX_System_V:*:*)\n\techo ${UNAME_MACHINE}-${UNAME_SYSTEM}-${UNAME_RELEASE}\n\texit ;;\n    *:Plan9:*:*)\n\t# \"uname -m\" is not consistent, so use $cputype instead. 386\n\t# is converted to i386 for consistency with other x86\n\t# operating systems.\n\tif test \"$cputype\" = 386; then\n\t    UNAME_MACHINE=i386\n\telse\n\t    UNAME_MACHINE=\"$cputype\"\n\tfi\n\techo ${UNAME_MACHINE}-unknown-plan9\n\texit ;;\n    *:TOPS-10:*:*)\n\techo pdp10-unknown-tops10\n\texit ;;\n    *:TENEX:*:*)\n\techo pdp10-unknown-tenex\n\texit ;;\n    KS10:TOPS-20:*:* | KL10:TOPS-20:*:* | TYPE4:TOPS-20:*:*)\n\techo pdp10-dec-tops20\n\texit ;;\n    XKL-1:TOPS-20:*:* | TYPE5:TOPS-20:*:*)\n\techo pdp10-xkl-tops20\n\texit ;;\n    *:TOPS-20:*:*)\n\techo pdp10-unknown-tops20\n\texit ;;\n    *:ITS:*:*)\n\techo pdp10-unknown-its\n\texit ;;\n    SEI:*:*:SEIUX)\n\techo mips-sei-seiux${UNAME_RELEASE}\n\texit ;;\n    *:DragonFly:*:*)\n\techo ${UNAME_MACHINE}-unknown-dragonfly`echo ${UNAME_RELEASE}|sed -e 's/[-(].*//'`\n\texit ;;\n    *:*VMS:*:*)\n\tUNAME_MACHINE=`(uname -p) 2>/dev/null`\n\tcase \"${UNAME_MACHINE}\" in\n\t    A*) echo alpha-dec-vms ; exit ;;\n\t    I*) echo ia64-dec-vms ; exit ;;\n\t    V*) echo vax-dec-vms ; exit ;;\n\tesac ;;\n    *:XENIX:*:SysV)\n\techo i386-pc-xenix\n\texit ;;\n    i*86:skyos:*:*)\n\techo ${UNAME_MACHINE}-pc-skyos`echo ${UNAME_RELEASE} | sed -e 's/ .*$//'`\n\texit ;;\n    i*86:rdos:*:*)\n\techo ${UNAME_MACHINE}-pc-rdos\n\texit ;;\n    i*86:AROS:*:*)\n\techo ${UNAME_MACHINE}-pc-aros\n\texit ;;\n    x86_64:VMkernel:*:*)\n\techo ${UNAME_MACHINE}-unknown-esx\n\texit ;;\n    amd64:Isilon\\ OneFS:*:*)\n\techo x86_64-unknown-onefs\n\texit ;;\nesac\n\ncat >&2 <<EOF\n$0: unable to guess system type\n\nThis script (version $timestamp), has failed to recognize the\noperating system you are using. If your script is old, overwrite\nconfig.guess and config.sub with the latest versions from:\n\n  http://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.guess\nand\n  http://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.sub\n\nIf $0 has already been updated, send the following data and any\ninformation you think might be pertinent to config-patches@gnu.org to\nprovide the necessary information to handle your system.\n\nconfig.guess timestamp = $timestamp\n\nuname -m = `(uname -m) 2>/dev/null || echo unknown`\nuname -r = `(uname -r) 2>/dev/null || echo unknown`\nuname -s = `(uname -s) 2>/dev/null || echo unknown`\nuname -v = `(uname -v) 2>/dev/null || echo unknown`\n\n/usr/bin/uname -p = `(/usr/bin/uname -p) 2>/dev/null`\n/bin/uname -X     = `(/bin/uname -X) 2>/dev/null`\n\nhostinfo               = `(hostinfo) 2>/dev/null`\n/bin/universe          = `(/bin/universe) 2>/dev/null`\n/usr/bin/arch -k       = `(/usr/bin/arch -k) 2>/dev/null`\n/bin/arch              = `(/bin/arch) 2>/dev/null`\n/usr/bin/oslevel       = `(/usr/bin/oslevel) 2>/dev/null`\n/usr/convex/getsysinfo = `(/usr/convex/getsysinfo) 2>/dev/null`\n\nUNAME_MACHINE = ${UNAME_MACHINE}\nUNAME_RELEASE = ${UNAME_RELEASE}\nUNAME_SYSTEM  = ${UNAME_SYSTEM}\nUNAME_VERSION = ${UNAME_VERSION}\nEOF\n\nexit 1\n\n# Local variables:\n# eval: (add-hook 'write-file-hooks 'time-stamp)\n# time-stamp-start: \"timestamp='\"\n# time-stamp-format: \"%:y-%02m-%02d\"\n# time-stamp-end: \"'\"\n# End:\n"
  },
  {
    "path": "build/linux/config.sub",
    "content": "#! /bin/sh\n# Configuration validation subroutine script.\n#   Copyright 1992-2017 Free Software Foundation, Inc.\n\ntimestamp='2017-04-02'\n\n# This file is free software; you can redistribute it and/or modify it\n# under the terms of the GNU General Public License as published by\n# the Free Software Foundation; either version 3 of the License, or\n# (at your option) any later version.\n#\n# This program is distributed in the hope that it will be useful, but\n# WITHOUT ANY WARRANTY; without even the implied warranty of\n# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU\n# General Public License for more details.\n#\n# You should have received a copy of the GNU General Public License\n# along with this program; if not, see <http://www.gnu.org/licenses/>.\n#\n# As a special exception to the GNU General Public License, if you\n# distribute this file as part of a program that contains a\n# configuration script generated by Autoconf, you may include it under\n# the same distribution terms that you use for the rest of that\n# program.  This Exception is an additional permission under section 7\n# of the GNU General Public License, version 3 (\"GPLv3\").\n\n\n# Please send patches to <config-patches@gnu.org>.\n#\n# Configuration subroutine to validate and canonicalize a configuration type.\n# Supply the specified configuration type as an argument.\n# If it is invalid, we print an error message on stderr and exit with code 1.\n# Otherwise, we print the canonical config type on stdout and succeed.\n\n# You can get the latest version of this script from:\n# http://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.sub\n\n# This file is supposed to be the same for all GNU packages\n# and recognize all the CPU types, system types and aliases\n# that are meaningful with *any* GNU software.\n# Each package is responsible for reporting which valid configurations\n# it does not support.  The user should be able to distinguish\n# a failure to support a valid configuration from a meaningless\n# configuration.\n\n# The goal of this file is to map all the various variations of a given\n# machine specification into a single specification in the form:\n#\tCPU_TYPE-MANUFACTURER-OPERATING_SYSTEM\n# or in some cases, the newer four-part form:\n#\tCPU_TYPE-MANUFACTURER-KERNEL-OPERATING_SYSTEM\n# It is wrong to echo any other type of specification.\n\nme=`echo \"$0\" | sed -e 's,.*/,,'`\n\nusage=\"\\\nUsage: $0 [OPTION] CPU-MFR-OPSYS or ALIAS\n\nCanonicalize a configuration name.\n\nOperation modes:\n  -h, --help         print this help, then exit\n  -t, --time-stamp   print date of last modification, then exit\n  -v, --version      print version number, then exit\n\nReport bugs and patches to <config-patches@gnu.org>.\"\n\nversion=\"\\\nGNU config.sub ($timestamp)\n\nCopyright 1992-2017 Free Software Foundation, Inc.\n\nThis is free software; see the source for copying conditions.  There is NO\nwarranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.\"\n\nhelp=\"\nTry \\`$me --help' for more information.\"\n\n# Parse command line\nwhile test $# -gt 0 ; do\n  case $1 in\n    --time-stamp | --time* | -t )\n       echo \"$timestamp\" ; exit ;;\n    --version | -v )\n       echo \"$version\" ; exit ;;\n    --help | --h* | -h )\n       echo \"$usage\"; exit ;;\n    -- )     # Stop option processing\n       shift; break ;;\n    - )\t# Use stdin as input.\n       break ;;\n    -* )\n       echo \"$me: invalid option $1$help\"\n       exit 1 ;;\n\n    *local*)\n       # First pass through any local machine types.\n       echo $1\n       exit ;;\n\n    * )\n       break ;;\n  esac\ndone\n\ncase $# in\n 0) echo \"$me: missing argument$help\" >&2\n    exit 1;;\n 1) ;;\n *) echo \"$me: too many arguments$help\" >&2\n    exit 1;;\nesac\n\n# Separate what the user gave into CPU-COMPANY and OS or KERNEL-OS (if any).\n# Here we must recognize all the valid KERNEL-OS combinations.\nmaybe_os=`echo $1 | sed 's/^\\(.*\\)-\\([^-]*-[^-]*\\)$/\\2/'`\ncase $maybe_os in\n  nto-qnx* | linux-gnu* | linux-android* | linux-dietlibc | linux-newlib* | \\\n  linux-musl* | linux-uclibc* | uclinux-uclibc* | uclinux-gnu* | kfreebsd*-gnu* | \\\n  knetbsd*-gnu* | netbsd*-gnu* | netbsd*-eabi* | \\\n  kopensolaris*-gnu* | cloudabi*-eabi* | \\\n  storm-chaos* | os2-emx* | rtmk-nova*)\n    os=-$maybe_os\n    basic_machine=`echo $1 | sed 's/^\\(.*\\)-\\([^-]*-[^-]*\\)$/\\1/'`\n    ;;\n  android-linux)\n    os=-linux-android\n    basic_machine=`echo $1 | sed 's/^\\(.*\\)-\\([^-]*-[^-]*\\)$/\\1/'`-unknown\n    ;;\n  *)\n    basic_machine=`echo $1 | sed 's/-[^-]*$//'`\n    if [ $basic_machine != $1 ]\n    then os=`echo $1 | sed 's/.*-/-/'`\n    else os=; fi\n    ;;\nesac\n\n### Let's recognize common machines as not being operating systems so\n### that things like config.sub decstation-3100 work.  We also\n### recognize some manufacturers as not being operating systems, so we\n### can provide default operating systems below.\ncase $os in\n\t-sun*os*)\n\t\t# Prevent following clause from handling this invalid input.\n\t\t;;\n\t-dec* | -mips* | -sequent* | -encore* | -pc532* | -sgi* | -sony* | \\\n\t-att* | -7300* | -3300* | -delta* | -motorola* | -sun[234]* | \\\n\t-unicom* | -ibm* | -next | -hp | -isi* | -apollo | -altos* | \\\n\t-convergent* | -ncr* | -news | -32* | -3600* | -3100* | -hitachi* |\\\n\t-c[123]* | -convex* | -sun | -crds | -omron* | -dg | -ultra | -tti* | \\\n\t-harris | -dolphin | -highlevel | -gould | -cbm | -ns | -masscomp | \\\n\t-apple | -axis | -knuth | -cray | -microblaze*)\n\t\tos=\n\t\tbasic_machine=$1\n\t\t;;\n\t-bluegene*)\n\t\tos=-cnk\n\t\t;;\n\t-sim | -cisco | -oki | -wec | -winbond)\n\t\tos=\n\t\tbasic_machine=$1\n\t\t;;\n\t-scout)\n\t\t;;\n\t-wrs)\n\t\tos=-vxworks\n\t\tbasic_machine=$1\n\t\t;;\n\t-chorusos*)\n\t\tos=-chorusos\n\t\tbasic_machine=$1\n\t\t;;\n\t-chorusrdb)\n\t\tos=-chorusrdb\n\t\tbasic_machine=$1\n\t\t;;\n\t-hiux*)\n\t\tos=-hiuxwe2\n\t\t;;\n\t-sco6)\n\t\tos=-sco5v6\n\t\tbasic_machine=`echo $1 | sed -e 's/86-.*/86-pc/'`\n\t\t;;\n\t-sco5)\n\t\tos=-sco3.2v5\n\t\tbasic_machine=`echo $1 | sed -e 's/86-.*/86-pc/'`\n\t\t;;\n\t-sco4)\n\t\tos=-sco3.2v4\n\t\tbasic_machine=`echo $1 | sed -e 's/86-.*/86-pc/'`\n\t\t;;\n\t-sco3.2.[4-9]*)\n\t\tos=`echo $os | sed -e 's/sco3.2./sco3.2v/'`\n\t\tbasic_machine=`echo $1 | sed -e 's/86-.*/86-pc/'`\n\t\t;;\n\t-sco3.2v[4-9]*)\n\t\t# Don't forget version if it is 3.2v4 or newer.\n\t\tbasic_machine=`echo $1 | sed -e 's/86-.*/86-pc/'`\n\t\t;;\n\t-sco5v6*)\n\t\t# Don't forget version if it is 3.2v4 or newer.\n\t\tbasic_machine=`echo $1 | sed -e 's/86-.*/86-pc/'`\n\t\t;;\n\t-sco*)\n\t\tos=-sco3.2v2\n\t\tbasic_machine=`echo $1 | sed -e 's/86-.*/86-pc/'`\n\t\t;;\n\t-udk*)\n\t\tbasic_machine=`echo $1 | sed -e 's/86-.*/86-pc/'`\n\t\t;;\n\t-isc)\n\t\tos=-isc2.2\n\t\tbasic_machine=`echo $1 | sed -e 's/86-.*/86-pc/'`\n\t\t;;\n\t-clix*)\n\t\tbasic_machine=clipper-intergraph\n\t\t;;\n\t-isc*)\n\t\tbasic_machine=`echo $1 | sed -e 's/86-.*/86-pc/'`\n\t\t;;\n\t-lynx*178)\n\t\tos=-lynxos178\n\t\t;;\n\t-lynx*5)\n\t\tos=-lynxos5\n\t\t;;\n\t-lynx*)\n\t\tos=-lynxos\n\t\t;;\n\t-ptx*)\n\t\tbasic_machine=`echo $1 | sed -e 's/86-.*/86-sequent/'`\n\t\t;;\n\t-windowsnt*)\n\t\tos=`echo $os | sed -e 's/windowsnt/winnt/'`\n\t\t;;\n\t-psos*)\n\t\tos=-psos\n\t\t;;\n\t-mint | -mint[0-9]*)\n\t\tbasic_machine=m68k-atari\n\t\tos=-mint\n\t\t;;\nesac\n\n# Decode aliases for certain CPU-COMPANY combinations.\ncase $basic_machine in\n\t# Recognize the basic CPU types without company name.\n\t# Some are omitted here because they have special meanings below.\n\t1750a | 580 \\\n\t| a29k \\\n\t| aarch64 | aarch64_be \\\n\t| alpha | alphaev[4-8] | alphaev56 | alphaev6[78] | alphapca5[67] \\\n\t| alpha64 | alpha64ev[4-8] | alpha64ev56 | alpha64ev6[78] | alpha64pca5[67] \\\n\t| am33_2.0 \\\n\t| arc | arceb \\\n\t| arm | arm[bl]e | arme[lb] | armv[2-8] | armv[3-8][lb] | armv7[arm] \\\n\t| avr | avr32 \\\n\t| ba \\\n\t| be32 | be64 \\\n\t| bfin \\\n\t| c4x | c8051 | clipper \\\n\t| d10v | d30v | dlx | dsp16xx \\\n\t| e2k | epiphany \\\n\t| fido | fr30 | frv | ft32 \\\n\t| h8300 | h8500 | hppa | hppa1.[01] | hppa2.0 | hppa2.0[nw] | hppa64 \\\n\t| hexagon \\\n\t| i370 | i860 | i960 | ia16 | ia64 \\\n\t| ip2k | iq2000 \\\n\t| k1om \\\n\t| le32 | le64 \\\n\t| lm32 \\\n\t| m32c | m32r | m32rle | m68000 | m68k | m88k \\\n\t| maxq | mb | microblaze | microblazeel | mcore | mep | metag \\\n\t| mips | mipsbe | mipseb | mipsel | mipsle \\\n\t| mips16 \\\n\t| mips64 | mips64el \\\n\t| mips64octeon | mips64octeonel \\\n\t| mips64orion | mips64orionel \\\n\t| mips64r5900 | mips64r5900el \\\n\t| mips64vr | mips64vrel \\\n\t| mips64vr4100 | mips64vr4100el \\\n\t| mips64vr4300 | mips64vr4300el \\\n\t| mips64vr5000 | mips64vr5000el \\\n\t| mips64vr5900 | mips64vr5900el \\\n\t| mipsisa32 | mipsisa32el \\\n\t| mipsisa32r2 | mipsisa32r2el \\\n\t| mipsisa32r6 | mipsisa32r6el \\\n\t| mipsisa64 | mipsisa64el \\\n\t| mipsisa64r2 | mipsisa64r2el \\\n\t| mipsisa64r6 | mipsisa64r6el \\\n\t| mipsisa64sb1 | mipsisa64sb1el \\\n\t| mipsisa64sr71k | mipsisa64sr71kel \\\n\t| mipsr5900 | mipsr5900el \\\n\t| mipstx39 | mipstx39el \\\n\t| mn10200 | mn10300 \\\n\t| moxie \\\n\t| mt \\\n\t| msp430 \\\n\t| nds32 | nds32le | nds32be \\\n\t| nios | nios2 | nios2eb | nios2el \\\n\t| ns16k | ns32k \\\n\t| open8 | or1k | or1knd | or32 \\\n\t| pdp10 | pdp11 | pj | pjl \\\n\t| powerpc | powerpc64 | powerpc64le | powerpcle \\\n\t| pru \\\n\t| pyramid \\\n\t| riscv32 | riscv64 \\\n\t| rl78 | rx \\\n\t| score \\\n\t| sh | sh[1234] | sh[24]a | sh[24]aeb | sh[23]e | sh[234]eb | sheb | shbe | shle | sh[1234]le | sh3ele \\\n\t| sh64 | sh64le \\\n\t| sparc | sparc64 | sparc64b | sparc64v | sparc86x | sparclet | sparclite \\\n\t| sparcv8 | sparcv9 | sparcv9b | sparcv9v \\\n\t| spu \\\n\t| tahoe | tic4x | tic54x | tic55x | tic6x | tic80 | tron \\\n\t| ubicom32 \\\n\t| v850 | v850e | v850e1 | v850e2 | v850es | v850e2v3 \\\n\t| visium \\\n\t| wasm32 \\\n\t| we32k \\\n\t| x86 | xc16x | xstormy16 | xtensa \\\n\t| z8k | z80)\n\t\tbasic_machine=$basic_machine-unknown\n\t\t;;\n\tc54x)\n\t\tbasic_machine=tic54x-unknown\n\t\t;;\n\tc55x)\n\t\tbasic_machine=tic55x-unknown\n\t\t;;\n\tc6x)\n\t\tbasic_machine=tic6x-unknown\n\t\t;;\n\tleon|leon[3-9])\n\t\tbasic_machine=sparc-$basic_machine\n\t\t;;\n\tm6811 | m68hc11 | m6812 | m68hc12 | m68hcs12x | nvptx | picochip)\n\t\tbasic_machine=$basic_machine-unknown\n\t\tos=-none\n\t\t;;\n\tm88110 | m680[12346]0 | m683?2 | m68360 | m5200 | v70 | w65 | z8k)\n\t\t;;\n\tms1)\n\t\tbasic_machine=mt-unknown\n\t\t;;\n\n\tstrongarm | thumb | xscale)\n\t\tbasic_machine=arm-unknown\n\t\t;;\n\txgate)\n\t\tbasic_machine=$basic_machine-unknown\n\t\tos=-none\n\t\t;;\n\txscaleeb)\n\t\tbasic_machine=armeb-unknown\n\t\t;;\n\n\txscaleel)\n\t\tbasic_machine=armel-unknown\n\t\t;;\n\n\t# We use `pc' rather than `unknown'\n\t# because (1) that's what they normally are, and\n\t# (2) the word \"unknown\" tends to confuse beginning users.\n\ti*86 | x86_64)\n\t  basic_machine=$basic_machine-pc\n\t  ;;\n\t# Object if more than one company name word.\n\t*-*-*)\n\t\techo Invalid configuration \\`$1\\': machine \\`$basic_machine\\' not recognized 1>&2\n\t\texit 1\n\t\t;;\n\t# Recognize the basic CPU types with company name.\n\t580-* \\\n\t| a29k-* \\\n\t| aarch64-* | aarch64_be-* \\\n\t| alpha-* | alphaev[4-8]-* | alphaev56-* | alphaev6[78]-* \\\n\t| alpha64-* | alpha64ev[4-8]-* | alpha64ev56-* | alpha64ev6[78]-* \\\n\t| alphapca5[67]-* | alpha64pca5[67]-* | arc-* | arceb-* \\\n\t| arm-*  | armbe-* | armle-* | armeb-* | armv*-* \\\n\t| avr-* | avr32-* \\\n\t| ba-* \\\n\t| be32-* | be64-* \\\n\t| bfin-* | bs2000-* \\\n\t| c[123]* | c30-* | [cjt]90-* | c4x-* \\\n\t| c8051-* | clipper-* | craynv-* | cydra-* \\\n\t| d10v-* | d30v-* | dlx-* \\\n\t| e2k-* | elxsi-* \\\n\t| f30[01]-* | f700-* | fido-* | fr30-* | frv-* | fx80-* \\\n\t| h8300-* | h8500-* \\\n\t| hppa-* | hppa1.[01]-* | hppa2.0-* | hppa2.0[nw]-* | hppa64-* \\\n\t| hexagon-* \\\n\t| i*86-* | i860-* | i960-* | ia16-* | ia64-* \\\n\t| ip2k-* | iq2000-* \\\n\t| k1om-* \\\n\t| le32-* | le64-* \\\n\t| lm32-* \\\n\t| m32c-* | m32r-* | m32rle-* \\\n\t| m68000-* | m680[012346]0-* | m68360-* | m683?2-* | m68k-* \\\n\t| m88110-* | m88k-* | maxq-* | mcore-* | metag-* \\\n\t| microblaze-* | microblazeel-* \\\n\t| mips-* | mipsbe-* | mipseb-* | mipsel-* | mipsle-* \\\n\t| mips16-* \\\n\t| mips64-* | mips64el-* \\\n\t| mips64octeon-* | mips64octeonel-* \\\n\t| mips64orion-* | mips64orionel-* \\\n\t| mips64r5900-* | mips64r5900el-* \\\n\t| mips64vr-* | mips64vrel-* \\\n\t| mips64vr4100-* | mips64vr4100el-* \\\n\t| mips64vr4300-* | mips64vr4300el-* \\\n\t| mips64vr5000-* | mips64vr5000el-* \\\n\t| mips64vr5900-* | mips64vr5900el-* \\\n\t| mipsisa32-* | mipsisa32el-* \\\n\t| mipsisa32r2-* | mipsisa32r2el-* \\\n\t| mipsisa32r6-* | mipsisa32r6el-* \\\n\t| mipsisa64-* | mipsisa64el-* \\\n\t| mipsisa64r2-* | mipsisa64r2el-* \\\n\t| mipsisa64r6-* | mipsisa64r6el-* \\\n\t| mipsisa64sb1-* | mipsisa64sb1el-* \\\n\t| mipsisa64sr71k-* | mipsisa64sr71kel-* \\\n\t| mipsr5900-* | mipsr5900el-* \\\n\t| mipstx39-* | mipstx39el-* \\\n\t| mmix-* \\\n\t| mt-* \\\n\t| msp430-* \\\n\t| nds32-* | nds32le-* | nds32be-* \\\n\t| nios-* | nios2-* | nios2eb-* | nios2el-* \\\n\t| none-* | np1-* | ns16k-* | ns32k-* \\\n\t| open8-* \\\n\t| or1k*-* \\\n\t| orion-* \\\n\t| pdp10-* | pdp11-* | pj-* | pjl-* | pn-* | power-* \\\n\t| powerpc-* | powerpc64-* | powerpc64le-* | powerpcle-* \\\n\t| pru-* \\\n\t| pyramid-* \\\n\t| riscv32-* | riscv64-* \\\n\t| rl78-* | romp-* | rs6000-* | rx-* \\\n\t| sh-* | sh[1234]-* | sh[24]a-* | sh[24]aeb-* | sh[23]e-* | sh[34]eb-* | sheb-* | shbe-* \\\n\t| shle-* | sh[1234]le-* | sh3ele-* | sh64-* | sh64le-* \\\n\t| sparc-* | sparc64-* | sparc64b-* | sparc64v-* | sparc86x-* | sparclet-* \\\n\t| sparclite-* \\\n\t| sparcv8-* | sparcv9-* | sparcv9b-* | sparcv9v-* | sv1-* | sx*-* \\\n\t| tahoe-* \\\n\t| tic30-* | tic4x-* | tic54x-* | tic55x-* | tic6x-* | tic80-* \\\n\t| tile*-* \\\n\t| tron-* \\\n\t| ubicom32-* \\\n\t| v850-* | v850e-* | v850e1-* | v850es-* | v850e2-* | v850e2v3-* \\\n\t| vax-* \\\n\t| visium-* \\\n\t| wasm32-* \\\n\t| we32k-* \\\n\t| x86-* | x86_64-* | xc16x-* | xps100-* \\\n\t| xstormy16-* | xtensa*-* \\\n\t| ymp-* \\\n\t| z8k-* | z80-*)\n\t\t;;\n\t# Recognize the basic CPU types without company name, with glob match.\n\txtensa*)\n\t\tbasic_machine=$basic_machine-unknown\n\t\t;;\n\t# Recognize the various machine names and aliases which stand\n\t# for a CPU type and a company and sometimes even an OS.\n\t386bsd)\n\t\tbasic_machine=i386-unknown\n\t\tos=-bsd\n\t\t;;\n\t3b1 | 7300 | 7300-att | att-7300 | pc7300 | safari | unixpc)\n\t\tbasic_machine=m68000-att\n\t\t;;\n\t3b*)\n\t\tbasic_machine=we32k-att\n\t\t;;\n\ta29khif)\n\t\tbasic_machine=a29k-amd\n\t\tos=-udi\n\t\t;;\n\tabacus)\n\t\tbasic_machine=abacus-unknown\n\t\t;;\n\tadobe68k)\n\t\tbasic_machine=m68010-adobe\n\t\tos=-scout\n\t\t;;\n\talliant | fx80)\n\t\tbasic_machine=fx80-alliant\n\t\t;;\n\taltos | altos3068)\n\t\tbasic_machine=m68k-altos\n\t\t;;\n\tam29k)\n\t\tbasic_machine=a29k-none\n\t\tos=-bsd\n\t\t;;\n\tamd64)\n\t\tbasic_machine=x86_64-pc\n\t\t;;\n\tamd64-*)\n\t\tbasic_machine=x86_64-`echo $basic_machine | sed 's/^[^-]*-//'`\n\t\t;;\n\tamdahl)\n\t\tbasic_machine=580-amdahl\n\t\tos=-sysv\n\t\t;;\n\tamiga | amiga-*)\n\t\tbasic_machine=m68k-unknown\n\t\t;;\n\tamigaos | amigados)\n\t\tbasic_machine=m68k-unknown\n\t\tos=-amigaos\n\t\t;;\n\tamigaunix | amix)\n\t\tbasic_machine=m68k-unknown\n\t\tos=-sysv4\n\t\t;;\n\tapollo68)\n\t\tbasic_machine=m68k-apollo\n\t\tos=-sysv\n\t\t;;\n\tapollo68bsd)\n\t\tbasic_machine=m68k-apollo\n\t\tos=-bsd\n\t\t;;\n\taros)\n\t\tbasic_machine=i386-pc\n\t\tos=-aros\n\t\t;;\n\tasmjs)\n\t\tbasic_machine=asmjs-unknown\n\t\t;;\n\taux)\n\t\tbasic_machine=m68k-apple\n\t\tos=-aux\n\t\t;;\n\tbalance)\n\t\tbasic_machine=ns32k-sequent\n\t\tos=-dynix\n\t\t;;\n\tblackfin)\n\t\tbasic_machine=bfin-unknown\n\t\tos=-linux\n\t\t;;\n\tblackfin-*)\n\t\tbasic_machine=bfin-`echo $basic_machine | sed 's/^[^-]*-//'`\n\t\tos=-linux\n\t\t;;\n\tbluegene*)\n\t\tbasic_machine=powerpc-ibm\n\t\tos=-cnk\n\t\t;;\n\tc54x-*)\n\t\tbasic_machine=tic54x-`echo $basic_machine | sed 's/^[^-]*-//'`\n\t\t;;\n\tc55x-*)\n\t\tbasic_machine=tic55x-`echo $basic_machine | sed 's/^[^-]*-//'`\n\t\t;;\n\tc6x-*)\n\t\tbasic_machine=tic6x-`echo $basic_machine | sed 's/^[^-]*-//'`\n\t\t;;\n\tc90)\n\t\tbasic_machine=c90-cray\n\t\tos=-unicos\n\t\t;;\n\tcegcc)\n\t\tbasic_machine=arm-unknown\n\t\tos=-cegcc\n\t\t;;\n\tconvex-c1)\n\t\tbasic_machine=c1-convex\n\t\tos=-bsd\n\t\t;;\n\tconvex-c2)\n\t\tbasic_machine=c2-convex\n\t\tos=-bsd\n\t\t;;\n\tconvex-c32)\n\t\tbasic_machine=c32-convex\n\t\tos=-bsd\n\t\t;;\n\tconvex-c34)\n\t\tbasic_machine=c34-convex\n\t\tos=-bsd\n\t\t;;\n\tconvex-c38)\n\t\tbasic_machine=c38-convex\n\t\tos=-bsd\n\t\t;;\n\tcray | j90)\n\t\tbasic_machine=j90-cray\n\t\tos=-unicos\n\t\t;;\n\tcraynv)\n\t\tbasic_machine=craynv-cray\n\t\tos=-unicosmp\n\t\t;;\n\tcr16 | cr16-*)\n\t\tbasic_machine=cr16-unknown\n\t\tos=-elf\n\t\t;;\n\tcrds | unos)\n\t\tbasic_machine=m68k-crds\n\t\t;;\n\tcrisv32 | crisv32-* | etraxfs*)\n\t\tbasic_machine=crisv32-axis\n\t\t;;\n\tcris | cris-* | etrax*)\n\t\tbasic_machine=cris-axis\n\t\t;;\n\tcrx)\n\t\tbasic_machine=crx-unknown\n\t\tos=-elf\n\t\t;;\n\tda30 | da30-*)\n\t\tbasic_machine=m68k-da30\n\t\t;;\n\tdecstation | decstation-3100 | pmax | pmax-* | pmin | dec3100 | decstatn)\n\t\tbasic_machine=mips-dec\n\t\t;;\n\tdecsystem10* | dec10*)\n\t\tbasic_machine=pdp10-dec\n\t\tos=-tops10\n\t\t;;\n\tdecsystem20* | dec20*)\n\t\tbasic_machine=pdp10-dec\n\t\tos=-tops20\n\t\t;;\n\tdelta | 3300 | motorola-3300 | motorola-delta \\\n\t      | 3300-motorola | delta-motorola)\n\t\tbasic_machine=m68k-motorola\n\t\t;;\n\tdelta88)\n\t\tbasic_machine=m88k-motorola\n\t\tos=-sysv3\n\t\t;;\n\tdicos)\n\t\tbasic_machine=i686-pc\n\t\tos=-dicos\n\t\t;;\n\tdjgpp)\n\t\tbasic_machine=i586-pc\n\t\tos=-msdosdjgpp\n\t\t;;\n\tdpx20 | dpx20-*)\n\t\tbasic_machine=rs6000-bull\n\t\tos=-bosx\n\t\t;;\n\tdpx2* | dpx2*-bull)\n\t\tbasic_machine=m68k-bull\n\t\tos=-sysv3\n\t\t;;\n\te500v[12])\n\t\tbasic_machine=powerpc-unknown\n\t\tos=$os\"spe\"\n\t\t;;\n\te500v[12]-*)\n\t\tbasic_machine=powerpc-`echo $basic_machine | sed 's/^[^-]*-//'`\n\t\tos=$os\"spe\"\n\t\t;;\n\tebmon29k)\n\t\tbasic_machine=a29k-amd\n\t\tos=-ebmon\n\t\t;;\n\telxsi)\n\t\tbasic_machine=elxsi-elxsi\n\t\tos=-bsd\n\t\t;;\n\tencore | umax | mmax)\n\t\tbasic_machine=ns32k-encore\n\t\t;;\n\tes1800 | OSE68k | ose68k | ose | OSE)\n\t\tbasic_machine=m68k-ericsson\n\t\tos=-ose\n\t\t;;\n\tfx2800)\n\t\tbasic_machine=i860-alliant\n\t\t;;\n\tgenix)\n\t\tbasic_machine=ns32k-ns\n\t\t;;\n\tgmicro)\n\t\tbasic_machine=tron-gmicro\n\t\tos=-sysv\n\t\t;;\n\tgo32)\n\t\tbasic_machine=i386-pc\n\t\tos=-go32\n\t\t;;\n\th3050r* | hiux*)\n\t\tbasic_machine=hppa1.1-hitachi\n\t\tos=-hiuxwe2\n\t\t;;\n\th8300hms)\n\t\tbasic_machine=h8300-hitachi\n\t\tos=-hms\n\t\t;;\n\th8300xray)\n\t\tbasic_machine=h8300-hitachi\n\t\tos=-xray\n\t\t;;\n\th8500hms)\n\t\tbasic_machine=h8500-hitachi\n\t\tos=-hms\n\t\t;;\n\tharris)\n\t\tbasic_machine=m88k-harris\n\t\tos=-sysv3\n\t\t;;\n\thp300-*)\n\t\tbasic_machine=m68k-hp\n\t\t;;\n\thp300bsd)\n\t\tbasic_machine=m68k-hp\n\t\tos=-bsd\n\t\t;;\n\thp300hpux)\n\t\tbasic_machine=m68k-hp\n\t\tos=-hpux\n\t\t;;\n\thp3k9[0-9][0-9] | hp9[0-9][0-9])\n\t\tbasic_machine=hppa1.0-hp\n\t\t;;\n\thp9k2[0-9][0-9] | hp9k31[0-9])\n\t\tbasic_machine=m68000-hp\n\t\t;;\n\thp9k3[2-9][0-9])\n\t\tbasic_machine=m68k-hp\n\t\t;;\n\thp9k6[0-9][0-9] | hp6[0-9][0-9])\n\t\tbasic_machine=hppa1.0-hp\n\t\t;;\n\thp9k7[0-79][0-9] | hp7[0-79][0-9])\n\t\tbasic_machine=hppa1.1-hp\n\t\t;;\n\thp9k78[0-9] | hp78[0-9])\n\t\t# FIXME: really hppa2.0-hp\n\t\tbasic_machine=hppa1.1-hp\n\t\t;;\n\thp9k8[67]1 | hp8[67]1 | hp9k80[24] | hp80[24] | hp9k8[78]9 | hp8[78]9 | hp9k893 | hp893)\n\t\t# FIXME: really hppa2.0-hp\n\t\tbasic_machine=hppa1.1-hp\n\t\t;;\n\thp9k8[0-9][13679] | hp8[0-9][13679])\n\t\tbasic_machine=hppa1.1-hp\n\t\t;;\n\thp9k8[0-9][0-9] | hp8[0-9][0-9])\n\t\tbasic_machine=hppa1.0-hp\n\t\t;;\n\thppa-next)\n\t\tos=-nextstep3\n\t\t;;\n\thppaosf)\n\t\tbasic_machine=hppa1.1-hp\n\t\tos=-osf\n\t\t;;\n\thppro)\n\t\tbasic_machine=hppa1.1-hp\n\t\tos=-proelf\n\t\t;;\n\ti370-ibm* | ibm*)\n\t\tbasic_machine=i370-ibm\n\t\t;;\n\ti*86v32)\n\t\tbasic_machine=`echo $1 | sed -e 's/86.*/86-pc/'`\n\t\tos=-sysv32\n\t\t;;\n\ti*86v4*)\n\t\tbasic_machine=`echo $1 | sed -e 's/86.*/86-pc/'`\n\t\tos=-sysv4\n\t\t;;\n\ti*86v)\n\t\tbasic_machine=`echo $1 | sed -e 's/86.*/86-pc/'`\n\t\tos=-sysv\n\t\t;;\n\ti*86sol2)\n\t\tbasic_machine=`echo $1 | sed -e 's/86.*/86-pc/'`\n\t\tos=-solaris2\n\t\t;;\n\ti386mach)\n\t\tbasic_machine=i386-mach\n\t\tos=-mach\n\t\t;;\n\ti386-vsta | vsta)\n\t\tbasic_machine=i386-unknown\n\t\tos=-vsta\n\t\t;;\n\tiris | iris4d)\n\t\tbasic_machine=mips-sgi\n\t\tcase $os in\n\t\t    -irix*)\n\t\t\t;;\n\t\t    *)\n\t\t\tos=-irix4\n\t\t\t;;\n\t\tesac\n\t\t;;\n\tisi68 | isi)\n\t\tbasic_machine=m68k-isi\n\t\tos=-sysv\n\t\t;;\n\tleon-*|leon[3-9]-*)\n\t\tbasic_machine=sparc-`echo $basic_machine | sed 's/-.*//'`\n\t\t;;\n\tm68knommu)\n\t\tbasic_machine=m68k-unknown\n\t\tos=-linux\n\t\t;;\n\tm68knommu-*)\n\t\tbasic_machine=m68k-`echo $basic_machine | sed 's/^[^-]*-//'`\n\t\tos=-linux\n\t\t;;\n\tm88k-omron*)\n\t\tbasic_machine=m88k-omron\n\t\t;;\n\tmagnum | m3230)\n\t\tbasic_machine=mips-mips\n\t\tos=-sysv\n\t\t;;\n\tmerlin)\n\t\tbasic_machine=ns32k-utek\n\t\tos=-sysv\n\t\t;;\n\tmicroblaze*)\n\t\tbasic_machine=microblaze-xilinx\n\t\t;;\n\tmingw64)\n\t\tbasic_machine=x86_64-pc\n\t\tos=-mingw64\n\t\t;;\n\tmingw32)\n\t\tbasic_machine=i686-pc\n\t\tos=-mingw32\n\t\t;;\n\tmingw32ce)\n\t\tbasic_machine=arm-unknown\n\t\tos=-mingw32ce\n\t\t;;\n\tminiframe)\n\t\tbasic_machine=m68000-convergent\n\t\t;;\n\t*mint | -mint[0-9]* | *MiNT | *MiNT[0-9]*)\n\t\tbasic_machine=m68k-atari\n\t\tos=-mint\n\t\t;;\n\tmips3*-*)\n\t\tbasic_machine=`echo $basic_machine | sed -e 's/mips3/mips64/'`\n\t\t;;\n\tmips3*)\n\t\tbasic_machine=`echo $basic_machine | sed -e 's/mips3/mips64/'`-unknown\n\t\t;;\n\tmonitor)\n\t\tbasic_machine=m68k-rom68k\n\t\tos=-coff\n\t\t;;\n\tmorphos)\n\t\tbasic_machine=powerpc-unknown\n\t\tos=-morphos\n\t\t;;\n\tmoxiebox)\n\t\tbasic_machine=moxie-unknown\n\t\tos=-moxiebox\n\t\t;;\n\tmsdos)\n\t\tbasic_machine=i386-pc\n\t\tos=-msdos\n\t\t;;\n\tms1-*)\n\t\tbasic_machine=`echo $basic_machine | sed -e 's/ms1-/mt-/'`\n\t\t;;\n\tmsys)\n\t\tbasic_machine=i686-pc\n\t\tos=-msys\n\t\t;;\n\tmvs)\n\t\tbasic_machine=i370-ibm\n\t\tos=-mvs\n\t\t;;\n\tnacl)\n\t\tbasic_machine=le32-unknown\n\t\tos=-nacl\n\t\t;;\n\tncr3000)\n\t\tbasic_machine=i486-ncr\n\t\tos=-sysv4\n\t\t;;\n\tnetbsd386)\n\t\tbasic_machine=i386-unknown\n\t\tos=-netbsd\n\t\t;;\n\tnetwinder)\n\t\tbasic_machine=armv4l-rebel\n\t\tos=-linux\n\t\t;;\n\tnews | news700 | news800 | news900)\n\t\tbasic_machine=m68k-sony\n\t\tos=-newsos\n\t\t;;\n\tnews1000)\n\t\tbasic_machine=m68030-sony\n\t\tos=-newsos\n\t\t;;\n\tnews-3600 | risc-news)\n\t\tbasic_machine=mips-sony\n\t\tos=-newsos\n\t\t;;\n\tnecv70)\n\t\tbasic_machine=v70-nec\n\t\tos=-sysv\n\t\t;;\n\tnext | m*-next )\n\t\tbasic_machine=m68k-next\n\t\tcase $os in\n\t\t    -nextstep* )\n\t\t\t;;\n\t\t    -ns2*)\n\t\t      os=-nextstep2\n\t\t\t;;\n\t\t    *)\n\t\t      os=-nextstep3\n\t\t\t;;\n\t\tesac\n\t\t;;\n\tnh3000)\n\t\tbasic_machine=m68k-harris\n\t\tos=-cxux\n\t\t;;\n\tnh[45]000)\n\t\tbasic_machine=m88k-harris\n\t\tos=-cxux\n\t\t;;\n\tnindy960)\n\t\tbasic_machine=i960-intel\n\t\tos=-nindy\n\t\t;;\n\tmon960)\n\t\tbasic_machine=i960-intel\n\t\tos=-mon960\n\t\t;;\n\tnonstopux)\n\t\tbasic_machine=mips-compaq\n\t\tos=-nonstopux\n\t\t;;\n\tnp1)\n\t\tbasic_machine=np1-gould\n\t\t;;\n\tneo-tandem)\n\t\tbasic_machine=neo-tandem\n\t\t;;\n\tnse-tandem)\n\t\tbasic_machine=nse-tandem\n\t\t;;\n\tnsr-tandem)\n\t\tbasic_machine=nsr-tandem\n\t\t;;\n\tnsx-tandem)\n\t\tbasic_machine=nsx-tandem\n\t\t;;\n\top50n-* | op60c-*)\n\t\tbasic_machine=hppa1.1-oki\n\t\tos=-proelf\n\t\t;;\n\topenrisc | openrisc-*)\n\t\tbasic_machine=or32-unknown\n\t\t;;\n\tos400)\n\t\tbasic_machine=powerpc-ibm\n\t\tos=-os400\n\t\t;;\n\tOSE68000 | ose68000)\n\t\tbasic_machine=m68000-ericsson\n\t\tos=-ose\n\t\t;;\n\tos68k)\n\t\tbasic_machine=m68k-none\n\t\tos=-os68k\n\t\t;;\n\tpa-hitachi)\n\t\tbasic_machine=hppa1.1-hitachi\n\t\tos=-hiuxwe2\n\t\t;;\n\tparagon)\n\t\tbasic_machine=i860-intel\n\t\tos=-osf\n\t\t;;\n\tparisc)\n\t\tbasic_machine=hppa-unknown\n\t\tos=-linux\n\t\t;;\n\tparisc-*)\n\t\tbasic_machine=hppa-`echo $basic_machine | sed 's/^[^-]*-//'`\n\t\tos=-linux\n\t\t;;\n\tpbd)\n\t\tbasic_machine=sparc-tti\n\t\t;;\n\tpbb)\n\t\tbasic_machine=m68k-tti\n\t\t;;\n\tpc532 | pc532-*)\n\t\tbasic_machine=ns32k-pc532\n\t\t;;\n\tpc98)\n\t\tbasic_machine=i386-pc\n\t\t;;\n\tpc98-*)\n\t\tbasic_machine=i386-`echo $basic_machine | sed 's/^[^-]*-//'`\n\t\t;;\n\tpentium | p5 | k5 | k6 | nexgen | viac3)\n\t\tbasic_machine=i586-pc\n\t\t;;\n\tpentiumpro | p6 | 6x86 | athlon | athlon_*)\n\t\tbasic_machine=i686-pc\n\t\t;;\n\tpentiumii | pentium2 | pentiumiii | pentium3)\n\t\tbasic_machine=i686-pc\n\t\t;;\n\tpentium4)\n\t\tbasic_machine=i786-pc\n\t\t;;\n\tpentium-* | p5-* | k5-* | k6-* | nexgen-* | viac3-*)\n\t\tbasic_machine=i586-`echo $basic_machine | sed 's/^[^-]*-//'`\n\t\t;;\n\tpentiumpro-* | p6-* | 6x86-* | athlon-*)\n\t\tbasic_machine=i686-`echo $basic_machine | sed 's/^[^-]*-//'`\n\t\t;;\n\tpentiumii-* | pentium2-* | pentiumiii-* | pentium3-*)\n\t\tbasic_machine=i686-`echo $basic_machine | sed 's/^[^-]*-//'`\n\t\t;;\n\tpentium4-*)\n\t\tbasic_machine=i786-`echo $basic_machine | sed 's/^[^-]*-//'`\n\t\t;;\n\tpn)\n\t\tbasic_machine=pn-gould\n\t\t;;\n\tpower)\tbasic_machine=power-ibm\n\t\t;;\n\tppc | ppcbe)\tbasic_machine=powerpc-unknown\n\t\t;;\n\tppc-* | ppcbe-*)\n\t\tbasic_machine=powerpc-`echo $basic_machine | sed 's/^[^-]*-//'`\n\t\t;;\n\tppcle | powerpclittle)\n\t\tbasic_machine=powerpcle-unknown\n\t\t;;\n\tppcle-* | powerpclittle-*)\n\t\tbasic_machine=powerpcle-`echo $basic_machine | sed 's/^[^-]*-//'`\n\t\t;;\n\tppc64)\tbasic_machine=powerpc64-unknown\n\t\t;;\n\tppc64-*) basic_machine=powerpc64-`echo $basic_machine | sed 's/^[^-]*-//'`\n\t\t;;\n\tppc64le | powerpc64little)\n\t\tbasic_machine=powerpc64le-unknown\n\t\t;;\n\tppc64le-* | powerpc64little-*)\n\t\tbasic_machine=powerpc64le-`echo $basic_machine | sed 's/^[^-]*-//'`\n\t\t;;\n\tps2)\n\t\tbasic_machine=i386-ibm\n\t\t;;\n\tpw32)\n\t\tbasic_machine=i586-unknown\n\t\tos=-pw32\n\t\t;;\n\trdos | rdos64)\n\t\tbasic_machine=x86_64-pc\n\t\tos=-rdos\n\t\t;;\n\trdos32)\n\t\tbasic_machine=i386-pc\n\t\tos=-rdos\n\t\t;;\n\trom68k)\n\t\tbasic_machine=m68k-rom68k\n\t\tos=-coff\n\t\t;;\n\trm[46]00)\n\t\tbasic_machine=mips-siemens\n\t\t;;\n\trtpc | rtpc-*)\n\t\tbasic_machine=romp-ibm\n\t\t;;\n\ts390 | s390-*)\n\t\tbasic_machine=s390-ibm\n\t\t;;\n\ts390x | s390x-*)\n\t\tbasic_machine=s390x-ibm\n\t\t;;\n\tsa29200)\n\t\tbasic_machine=a29k-amd\n\t\tos=-udi\n\t\t;;\n\tsb1)\n\t\tbasic_machine=mipsisa64sb1-unknown\n\t\t;;\n\tsb1el)\n\t\tbasic_machine=mipsisa64sb1el-unknown\n\t\t;;\n\tsde)\n\t\tbasic_machine=mipsisa32-sde\n\t\tos=-elf\n\t\t;;\n\tsei)\n\t\tbasic_machine=mips-sei\n\t\tos=-seiux\n\t\t;;\n\tsequent)\n\t\tbasic_machine=i386-sequent\n\t\t;;\n\tsh)\n\t\tbasic_machine=sh-hitachi\n\t\tos=-hms\n\t\t;;\n\tsh5el)\n\t\tbasic_machine=sh5le-unknown\n\t\t;;\n\tsh64)\n\t\tbasic_machine=sh64-unknown\n\t\t;;\n\tsparclite-wrs | simso-wrs)\n\t\tbasic_machine=sparclite-wrs\n\t\tos=-vxworks\n\t\t;;\n\tsps7)\n\t\tbasic_machine=m68k-bull\n\t\tos=-sysv2\n\t\t;;\n\tspur)\n\t\tbasic_machine=spur-unknown\n\t\t;;\n\tst2000)\n\t\tbasic_machine=m68k-tandem\n\t\t;;\n\tstratus)\n\t\tbasic_machine=i860-stratus\n\t\tos=-sysv4\n\t\t;;\n\tstrongarm-* | thumb-*)\n\t\tbasic_machine=arm-`echo $basic_machine | sed 's/^[^-]*-//'`\n\t\t;;\n\tsun2)\n\t\tbasic_machine=m68000-sun\n\t\t;;\n\tsun2os3)\n\t\tbasic_machine=m68000-sun\n\t\tos=-sunos3\n\t\t;;\n\tsun2os4)\n\t\tbasic_machine=m68000-sun\n\t\tos=-sunos4\n\t\t;;\n\tsun3os3)\n\t\tbasic_machine=m68k-sun\n\t\tos=-sunos3\n\t\t;;\n\tsun3os4)\n\t\tbasic_machine=m68k-sun\n\t\tos=-sunos4\n\t\t;;\n\tsun4os3)\n\t\tbasic_machine=sparc-sun\n\t\tos=-sunos3\n\t\t;;\n\tsun4os4)\n\t\tbasic_machine=sparc-sun\n\t\tos=-sunos4\n\t\t;;\n\tsun4sol2)\n\t\tbasic_machine=sparc-sun\n\t\tos=-solaris2\n\t\t;;\n\tsun3 | sun3-*)\n\t\tbasic_machine=m68k-sun\n\t\t;;\n\tsun4)\n\t\tbasic_machine=sparc-sun\n\t\t;;\n\tsun386 | sun386i | roadrunner)\n\t\tbasic_machine=i386-sun\n\t\t;;\n\tsv1)\n\t\tbasic_machine=sv1-cray\n\t\tos=-unicos\n\t\t;;\n\tsymmetry)\n\t\tbasic_machine=i386-sequent\n\t\tos=-dynix\n\t\t;;\n\tt3e)\n\t\tbasic_machine=alphaev5-cray\n\t\tos=-unicos\n\t\t;;\n\tt90)\n\t\tbasic_machine=t90-cray\n\t\tos=-unicos\n\t\t;;\n\ttile*)\n\t\tbasic_machine=$basic_machine-unknown\n\t\tos=-linux-gnu\n\t\t;;\n\ttx39)\n\t\tbasic_machine=mipstx39-unknown\n\t\t;;\n\ttx39el)\n\t\tbasic_machine=mipstx39el-unknown\n\t\t;;\n\ttoad1)\n\t\tbasic_machine=pdp10-xkl\n\t\tos=-tops20\n\t\t;;\n\ttower | tower-32)\n\t\tbasic_machine=m68k-ncr\n\t\t;;\n\ttpf)\n\t\tbasic_machine=s390x-ibm\n\t\tos=-tpf\n\t\t;;\n\tudi29k)\n\t\tbasic_machine=a29k-amd\n\t\tos=-udi\n\t\t;;\n\tultra3)\n\t\tbasic_machine=a29k-nyu\n\t\tos=-sym1\n\t\t;;\n\tv810 | necv810)\n\t\tbasic_machine=v810-nec\n\t\tos=-none\n\t\t;;\n\tvaxv)\n\t\tbasic_machine=vax-dec\n\t\tos=-sysv\n\t\t;;\n\tvms)\n\t\tbasic_machine=vax-dec\n\t\tos=-vms\n\t\t;;\n\tvpp*|vx|vx-*)\n\t\tbasic_machine=f301-fujitsu\n\t\t;;\n\tvxworks960)\n\t\tbasic_machine=i960-wrs\n\t\tos=-vxworks\n\t\t;;\n\tvxworks68)\n\t\tbasic_machine=m68k-wrs\n\t\tos=-vxworks\n\t\t;;\n\tvxworks29k)\n\t\tbasic_machine=a29k-wrs\n\t\tos=-vxworks\n\t\t;;\n\twasm32)\n\t\tbasic_machine=wasm32-unknown\n\t\t;;\n\tw65*)\n\t\tbasic_machine=w65-wdc\n\t\tos=-none\n\t\t;;\n\tw89k-*)\n\t\tbasic_machine=hppa1.1-winbond\n\t\tos=-proelf\n\t\t;;\n\txbox)\n\t\tbasic_machine=i686-pc\n\t\tos=-mingw32\n\t\t;;\n\txps | xps100)\n\t\tbasic_machine=xps100-honeywell\n\t\t;;\n\txscale-* | xscalee[bl]-*)\n\t\tbasic_machine=`echo $basic_machine | sed 's/^xscale/arm/'`\n\t\t;;\n\tymp)\n\t\tbasic_machine=ymp-cray\n\t\tos=-unicos\n\t\t;;\n\tz8k-*-coff)\n\t\tbasic_machine=z8k-unknown\n\t\tos=-sim\n\t\t;;\n\tz80-*-coff)\n\t\tbasic_machine=z80-unknown\n\t\tos=-sim\n\t\t;;\n\tnone)\n\t\tbasic_machine=none-none\n\t\tos=-none\n\t\t;;\n\n# Here we handle the default manufacturer of certain CPU types.  It is in\n# some cases the only manufacturer, in others, it is the most popular.\n\tw89k)\n\t\tbasic_machine=hppa1.1-winbond\n\t\t;;\n\top50n)\n\t\tbasic_machine=hppa1.1-oki\n\t\t;;\n\top60c)\n\t\tbasic_machine=hppa1.1-oki\n\t\t;;\n\tromp)\n\t\tbasic_machine=romp-ibm\n\t\t;;\n\tmmix)\n\t\tbasic_machine=mmix-knuth\n\t\t;;\n\trs6000)\n\t\tbasic_machine=rs6000-ibm\n\t\t;;\n\tvax)\n\t\tbasic_machine=vax-dec\n\t\t;;\n\tpdp10)\n\t\t# there are many clones, so DEC is not a safe bet\n\t\tbasic_machine=pdp10-unknown\n\t\t;;\n\tpdp11)\n\t\tbasic_machine=pdp11-dec\n\t\t;;\n\twe32k)\n\t\tbasic_machine=we32k-att\n\t\t;;\n\tsh[1234] | sh[24]a | sh[24]aeb | sh[34]eb | sh[1234]le | sh[23]ele)\n\t\tbasic_machine=sh-unknown\n\t\t;;\n\tsparc | sparcv8 | sparcv9 | sparcv9b | sparcv9v)\n\t\tbasic_machine=sparc-sun\n\t\t;;\n\tcydra)\n\t\tbasic_machine=cydra-cydrome\n\t\t;;\n\torion)\n\t\tbasic_machine=orion-highlevel\n\t\t;;\n\torion105)\n\t\tbasic_machine=clipper-highlevel\n\t\t;;\n\tmac | mpw | mac-mpw)\n\t\tbasic_machine=m68k-apple\n\t\t;;\n\tpmac | pmac-mpw)\n\t\tbasic_machine=powerpc-apple\n\t\t;;\n\t*-unknown)\n\t\t# Make sure to match an already-canonicalized machine name.\n\t\t;;\n\t*)\n\t\techo Invalid configuration \\`$1\\': machine \\`$basic_machine\\' not recognized 1>&2\n\t\texit 1\n\t\t;;\nesac\n\n# Here we canonicalize certain aliases for manufacturers.\ncase $basic_machine in\n\t*-digital*)\n\t\tbasic_machine=`echo $basic_machine | sed 's/digital.*/dec/'`\n\t\t;;\n\t*-commodore*)\n\t\tbasic_machine=`echo $basic_machine | sed 's/commodore.*/cbm/'`\n\t\t;;\n\t*)\n\t\t;;\nesac\n\n# Decode manufacturer-specific aliases for certain operating systems.\n\nif [ x\"$os\" != x\"\" ]\nthen\ncase $os in\n\t# First match some system type aliases\n\t# that might get confused with valid system types.\n\t# -solaris* is a basic system type, with this one exception.\n\t-auroraux)\n\t\tos=-auroraux\n\t\t;;\n\t-solaris1 | -solaris1.*)\n\t\tos=`echo $os | sed -e 's|solaris1|sunos4|'`\n\t\t;;\n\t-solaris)\n\t\tos=-solaris2\n\t\t;;\n\t-svr4*)\n\t\tos=-sysv4\n\t\t;;\n\t-unixware*)\n\t\tos=-sysv4.2uw\n\t\t;;\n\t-gnu/linux*)\n\t\tos=`echo $os | sed -e 's|gnu/linux|linux-gnu|'`\n\t\t;;\n\t# First accept the basic system types.\n\t# The portable systems comes first.\n\t# Each alternative MUST END IN A *, to match a version number.\n\t# -sysv* is not here because it comes later, after sysvr4.\n\t-gnu* | -bsd* | -mach* | -minix* | -genix* | -ultrix* | -irix* \\\n\t      | -*vms* | -sco* | -esix* | -isc* | -aix* | -cnk* | -sunos | -sunos[34]*\\\n\t      | -hpux* | -unos* | -osf* | -luna* | -dgux* | -auroraux* | -solaris* \\\n\t      | -sym* | -kopensolaris* | -plan9* \\\n\t      | -amigaos* | -amigados* | -msdos* | -newsos* | -unicos* | -aof* \\\n\t      | -aos* | -aros* | -cloudabi* | -sortix* \\\n\t      | -nindy* | -vxsim* | -vxworks* | -ebmon* | -hms* | -mvs* \\\n\t      | -clix* | -riscos* | -uniplus* | -iris* | -rtu* | -xenix* \\\n\t      | -hiux* | -386bsd* | -knetbsd* | -mirbsd* | -netbsd* \\\n\t      | -bitrig* | -openbsd* | -solidbsd* | -libertybsd* \\\n\t      | -ekkobsd* | -kfreebsd* | -freebsd* | -riscix* | -lynxos* \\\n\t      | -bosx* | -nextstep* | -cxux* | -aout* | -elf* | -oabi* \\\n\t      | -ptx* | -coff* | -ecoff* | -winnt* | -domain* | -vsta* \\\n\t      | -udi* | -eabi* | -lites* | -ieee* | -go32* | -aux* \\\n\t      | -chorusos* | -chorusrdb* | -cegcc* | -glidix* \\\n\t      | -cygwin* | -msys* | -pe* | -psos* | -moss* | -proelf* | -rtems* \\\n\t      | -midipix* | -mingw32* | -mingw64* | -linux-gnu* | -linux-android* \\\n\t      | -linux-newlib* | -linux-musl* | -linux-uclibc* \\\n\t      | -uxpv* | -beos* | -mpeix* | -udk* | -moxiebox* \\\n\t      | -interix* | -uwin* | -mks* | -rhapsody* | -darwin* | -opened* \\\n\t      | -openstep* | -oskit* | -conix* | -pw32* | -nonstopux* \\\n\t      | -storm-chaos* | -tops10* | -tenex* | -tops20* | -its* \\\n\t      | -os2* | -vos* | -palmos* | -uclinux* | -nucleus* \\\n\t      | -morphos* | -superux* | -rtmk* | -rtmk-nova* | -windiss* \\\n\t      | -powermax* | -dnix* | -nx6 | -nx7 | -sei* | -dragonfly* \\\n\t      | -skyos* | -haiku* | -rdos* | -toppers* | -drops* | -es* \\\n\t      | -onefs* | -tirtos* | -phoenix* | -fuchsia* | -redox*)\n\t# Remember, each alternative MUST END IN *, to match a version number.\n\t\t;;\n\t-qnx*)\n\t\tcase $basic_machine in\n\t\t    x86-* | i*86-*)\n\t\t\t;;\n\t\t    *)\n\t\t\tos=-nto$os\n\t\t\t;;\n\t\tesac\n\t\t;;\n\t-nto-qnx*)\n\t\t;;\n\t-nto*)\n\t\tos=`echo $os | sed -e 's|nto|nto-qnx|'`\n\t\t;;\n\t-sim | -es1800* | -hms* | -xray | -os68k* | -none* | -v88r* \\\n\t      | -windows* | -osx | -abug | -netware* | -os9* | -beos* | -haiku* \\\n\t      | -macos* | -mpw* | -magic* | -mmixware* | -mon960* | -lnews*)\n\t\t;;\n\t-mac*)\n\t\tos=`echo $os | sed -e 's|mac|macos|'`\n\t\t;;\n\t-linux-dietlibc)\n\t\tos=-linux-dietlibc\n\t\t;;\n\t-linux*)\n\t\tos=`echo $os | sed -e 's|linux|linux-gnu|'`\n\t\t;;\n\t-sunos5*)\n\t\tos=`echo $os | sed -e 's|sunos5|solaris2|'`\n\t\t;;\n\t-sunos6*)\n\t\tos=`echo $os | sed -e 's|sunos6|solaris3|'`\n\t\t;;\n\t-opened*)\n\t\tos=-openedition\n\t\t;;\n\t-os400*)\n\t\tos=-os400\n\t\t;;\n\t-wince*)\n\t\tos=-wince\n\t\t;;\n\t-osfrose*)\n\t\tos=-osfrose\n\t\t;;\n\t-osf*)\n\t\tos=-osf\n\t\t;;\n\t-utek*)\n\t\tos=-bsd\n\t\t;;\n\t-dynix*)\n\t\tos=-bsd\n\t\t;;\n\t-acis*)\n\t\tos=-aos\n\t\t;;\n\t-atheos*)\n\t\tos=-atheos\n\t\t;;\n\t-syllable*)\n\t\tos=-syllable\n\t\t;;\n\t-386bsd)\n\t\tos=-bsd\n\t\t;;\n\t-ctix* | -uts*)\n\t\tos=-sysv\n\t\t;;\n\t-nova*)\n\t\tos=-rtmk-nova\n\t\t;;\n\t-ns2 )\n\t\tos=-nextstep2\n\t\t;;\n\t-nsk*)\n\t\tos=-nsk\n\t\t;;\n\t# Preserve the version number of sinix5.\n\t-sinix5.*)\n\t\tos=`echo $os | sed -e 's|sinix|sysv|'`\n\t\t;;\n\t-sinix*)\n\t\tos=-sysv4\n\t\t;;\n\t-tpf*)\n\t\tos=-tpf\n\t\t;;\n\t-triton*)\n\t\tos=-sysv3\n\t\t;;\n\t-oss*)\n\t\tos=-sysv3\n\t\t;;\n\t-svr4)\n\t\tos=-sysv4\n\t\t;;\n\t-svr3)\n\t\tos=-sysv3\n\t\t;;\n\t-sysvr4)\n\t\tos=-sysv4\n\t\t;;\n\t# This must come after -sysvr4.\n\t-sysv*)\n\t\t;;\n\t-ose*)\n\t\tos=-ose\n\t\t;;\n\t-es1800*)\n\t\tos=-ose\n\t\t;;\n\t-xenix)\n\t\tos=-xenix\n\t\t;;\n\t-*mint | -mint[0-9]* | -*MiNT | -MiNT[0-9]*)\n\t\tos=-mint\n\t\t;;\n\t-aros*)\n\t\tos=-aros\n\t\t;;\n\t-zvmoe)\n\t\tos=-zvmoe\n\t\t;;\n\t-dicos*)\n\t\tos=-dicos\n\t\t;;\n\t-nacl*)\n\t\t;;\n\t-ios)\n\t\t;;\n\t-none)\n\t\t;;\n\t*)\n\t\t# Get rid of the `-' at the beginning of $os.\n\t\tos=`echo $os | sed 's/[^-]*-//'`\n\t\techo Invalid configuration \\`$1\\': system \\`$os\\' not recognized 1>&2\n\t\texit 1\n\t\t;;\nesac\nelse\n\n# Here we handle the default operating systems that come with various machines.\n# The value should be what the vendor currently ships out the door with their\n# machine or put another way, the most popular os provided with the machine.\n\n# Note that if you're going to try to match \"-MANUFACTURER\" here (say,\n# \"-sun\"), then you have to tell the case statement up towards the top\n# that MANUFACTURER isn't an operating system.  Otherwise, code above\n# will signal an error saying that MANUFACTURER isn't an operating\n# system, and we'll never get to this point.\n\ncase $basic_machine in\n\tscore-*)\n\t\tos=-elf\n\t\t;;\n\tspu-*)\n\t\tos=-elf\n\t\t;;\n\t*-acorn)\n\t\tos=-riscix1.2\n\t\t;;\n\tarm*-rebel)\n\t\tos=-linux\n\t\t;;\n\tarm*-semi)\n\t\tos=-aout\n\t\t;;\n\tc4x-* | tic4x-*)\n\t\tos=-coff\n\t\t;;\n\tc8051-*)\n\t\tos=-elf\n\t\t;;\n\thexagon-*)\n\t\tos=-elf\n\t\t;;\n\ttic54x-*)\n\t\tos=-coff\n\t\t;;\n\ttic55x-*)\n\t\tos=-coff\n\t\t;;\n\ttic6x-*)\n\t\tos=-coff\n\t\t;;\n\t# This must come before the *-dec entry.\n\tpdp10-*)\n\t\tos=-tops20\n\t\t;;\n\tpdp11-*)\n\t\tos=-none\n\t\t;;\n\t*-dec | vax-*)\n\t\tos=-ultrix4.2\n\t\t;;\n\tm68*-apollo)\n\t\tos=-domain\n\t\t;;\n\ti386-sun)\n\t\tos=-sunos4.0.2\n\t\t;;\n\tm68000-sun)\n\t\tos=-sunos3\n\t\t;;\n\tm68*-cisco)\n\t\tos=-aout\n\t\t;;\n\tmep-*)\n\t\tos=-elf\n\t\t;;\n\tmips*-cisco)\n\t\tos=-elf\n\t\t;;\n\tmips*-*)\n\t\tos=-elf\n\t\t;;\n\tor32-*)\n\t\tos=-coff\n\t\t;;\n\t*-tti)\t# must be before sparc entry or we get the wrong os.\n\t\tos=-sysv3\n\t\t;;\n\tsparc-* | *-sun)\n\t\tos=-sunos4.1.1\n\t\t;;\n\tpru-*)\n\t\tos=-elf\n\t\t;;\n\t*-be)\n\t\tos=-beos\n\t\t;;\n\t*-haiku)\n\t\tos=-haiku\n\t\t;;\n\t*-ibm)\n\t\tos=-aix\n\t\t;;\n\t*-knuth)\n\t\tos=-mmixware\n\t\t;;\n\t*-wec)\n\t\tos=-proelf\n\t\t;;\n\t*-winbond)\n\t\tos=-proelf\n\t\t;;\n\t*-oki)\n\t\tos=-proelf\n\t\t;;\n\t*-hp)\n\t\tos=-hpux\n\t\t;;\n\t*-hitachi)\n\t\tos=-hiux\n\t\t;;\n\ti860-* | *-att | *-ncr | *-altos | *-motorola | *-convergent)\n\t\tos=-sysv\n\t\t;;\n\t*-cbm)\n\t\tos=-amigaos\n\t\t;;\n\t*-dg)\n\t\tos=-dgux\n\t\t;;\n\t*-dolphin)\n\t\tos=-sysv3\n\t\t;;\n\tm68k-ccur)\n\t\tos=-rtu\n\t\t;;\n\tm88k-omron*)\n\t\tos=-luna\n\t\t;;\n\t*-next )\n\t\tos=-nextstep\n\t\t;;\n\t*-sequent)\n\t\tos=-ptx\n\t\t;;\n\t*-crds)\n\t\tos=-unos\n\t\t;;\n\t*-ns)\n\t\tos=-genix\n\t\t;;\n\ti370-*)\n\t\tos=-mvs\n\t\t;;\n\t*-next)\n\t\tos=-nextstep3\n\t\t;;\n\t*-gould)\n\t\tos=-sysv\n\t\t;;\n\t*-highlevel)\n\t\tos=-bsd\n\t\t;;\n\t*-encore)\n\t\tos=-bsd\n\t\t;;\n\t*-sgi)\n\t\tos=-irix\n\t\t;;\n\t*-siemens)\n\t\tos=-sysv4\n\t\t;;\n\t*-masscomp)\n\t\tos=-rtu\n\t\t;;\n\tf30[01]-fujitsu | f700-fujitsu)\n\t\tos=-uxpv\n\t\t;;\n\t*-rom68k)\n\t\tos=-coff\n\t\t;;\n\t*-*bug)\n\t\tos=-coff\n\t\t;;\n\t*-apple)\n\t\tos=-macos\n\t\t;;\n\t*-atari*)\n\t\tos=-mint\n\t\t;;\n\t*)\n\t\tos=-none\n\t\t;;\nesac\nfi\n\n# Here we handle the case where we know the os, and the CPU type, but not the\n# manufacturer.  We pick the logical manufacturer.\nvendor=unknown\ncase $basic_machine in\n\t*-unknown)\n\t\tcase $os in\n\t\t\t-riscix*)\n\t\t\t\tvendor=acorn\n\t\t\t\t;;\n\t\t\t-sunos*)\n\t\t\t\tvendor=sun\n\t\t\t\t;;\n\t\t\t-cnk*|-aix*)\n\t\t\t\tvendor=ibm\n\t\t\t\t;;\n\t\t\t-beos*)\n\t\t\t\tvendor=be\n\t\t\t\t;;\n\t\t\t-hpux*)\n\t\t\t\tvendor=hp\n\t\t\t\t;;\n\t\t\t-mpeix*)\n\t\t\t\tvendor=hp\n\t\t\t\t;;\n\t\t\t-hiux*)\n\t\t\t\tvendor=hitachi\n\t\t\t\t;;\n\t\t\t-unos*)\n\t\t\t\tvendor=crds\n\t\t\t\t;;\n\t\t\t-dgux*)\n\t\t\t\tvendor=dg\n\t\t\t\t;;\n\t\t\t-luna*)\n\t\t\t\tvendor=omron\n\t\t\t\t;;\n\t\t\t-genix*)\n\t\t\t\tvendor=ns\n\t\t\t\t;;\n\t\t\t-mvs* | -opened*)\n\t\t\t\tvendor=ibm\n\t\t\t\t;;\n\t\t\t-os400*)\n\t\t\t\tvendor=ibm\n\t\t\t\t;;\n\t\t\t-ptx*)\n\t\t\t\tvendor=sequent\n\t\t\t\t;;\n\t\t\t-tpf*)\n\t\t\t\tvendor=ibm\n\t\t\t\t;;\n\t\t\t-vxsim* | -vxworks* | -windiss*)\n\t\t\t\tvendor=wrs\n\t\t\t\t;;\n\t\t\t-aux*)\n\t\t\t\tvendor=apple\n\t\t\t\t;;\n\t\t\t-hms*)\n\t\t\t\tvendor=hitachi\n\t\t\t\t;;\n\t\t\t-mpw* | -macos*)\n\t\t\t\tvendor=apple\n\t\t\t\t;;\n\t\t\t-*mint | -mint[0-9]* | -*MiNT | -MiNT[0-9]*)\n\t\t\t\tvendor=atari\n\t\t\t\t;;\n\t\t\t-vos*)\n\t\t\t\tvendor=stratus\n\t\t\t\t;;\n\t\tesac\n\t\tbasic_machine=`echo $basic_machine | sed \"s/unknown/$vendor/\"`\n\t\t;;\nesac\n\necho $basic_machine$os\nexit\n\n# Local variables:\n# eval: (add-hook 'write-file-hooks 'time-stamp)\n# time-stamp-start: \"timestamp='\"\n# time-stamp-format: \"%:y-%02m-%02d\"\n# time-stamp-end: \"'\"\n# End:\n"
  },
  {
    "path": "build/linux/configure",
    "content": "#!/bin/bash\n\nif test x\"$1\" = x\"-h\" -o x\"$1\" = x\"--help\" ; then\ncat <<EOF\nUsage: ./configure [options]\n\nHelp:\n  -h, --help               print this message\n\nStandard options:\n  --prefix=PREFIX          install architecture-independent files in PREFIX\n                           [/usr/local]\n  --exec-prefix=EPREFIX    install architecture-dependent files in EPREFIX\n                           [PREFIX]\n  --bindir=DIR             install binaries in DIR [EPREFIX/bin]\n  --libdir=DIR             install libs in DIR [EPREFIX/lib]\n  --includedir=DIR         install includes in DIR [PREFIX/include]\n  --extra-asflags=EASFLAGS add EASFLAGS to ASFLAGS\n  --extra-cflags=ECFLAGS   add ECFLAGS to CFLAGS\n  --extra-ldflags=ELDFLAGS add ELDFLAGS to LDFLAGS\n  --extra-rcflags=ERCFLAGS add ERCFLAGS to RCFLAGS\n\nConfiguration options:\n  --disable-cli            disable cli\n  --system-libdavs2        use system libdavs2 instead of internal\n  --enable-shared          build shared library\n  --disable-static         disable building static library\n  --disable-opencl         disable OpenCL features\n  --disable-gpl            disable GPL-only features\n  --disable-thread         disable multithreaded encoding\n  --disable-win32thread    disable win32threads (windows only)\n  --disable-interlaced     disable interlaced encoding support\n  --bit-depth=BIT_DEPTH    set output bit depth (8-10) [8]\n  --chroma-format=FORMAT   output chroma format (420, 422, 444, all) [all]\n\nAdvanced options:\n  --disable-asm            disable platform-specific assembly optimizations\n  --enable-lto             enable link-time optimization\n  --enable-debug           add -g\n  --enable-gprof           add -pg\n  --enable-strip           add -s\n  --enable-pic             build position-independent code\n\nCross-compilation:\n  --host=HOST              build programs to run on HOST\n  --cross-prefix=PREFIX    use PREFIX for compilation tools\n  --sysroot=SYSROOT        root of cross-build tree\n\nEOF\nexit 1\nfi\n\nlog_check() {\n    echo -n \"checking $1... \" >> config.log\n}\n\nlog_ok() {\n    echo \"yes\" >> config.log\n}\n\nlog_fail() {\n    echo \"no\" >> config.log\n}\n\nlog_msg() {\n    echo \"$1\" >> config.log\n}\n\ncc_cflags() {\n    # several non g++ compilers issue an incredibly large number of warnings on high warning levels,\n    # suppress them by reducing the warning level rather than having to use #pragmas\n    for arg in $*; do\n        [[ \"$arg\" = -falign-loops* ]] && arg=\n        [ \"$arg\" = -fno-tree-vectorize ] && arg=\n        [ \"$arg\" = -Wshadow ] && arg=\n        [ \"$arg\" = -Wno-maybe-uninitialized ] && arg=\n        [[ \"$arg\" = -mpreferred-stack-boundary* ]] && arg=\n        [[ \"$arg\" = -l* ]] && arg=\n        [[ \"$arg\" = -L* ]] && arg=\n        if [ $compiler_style = MS ]; then\n            [ \"$arg\" = -ffast-math ] && arg=\"-fp:fast\"\n            [ \"$arg\" = -Wall ] && arg=\n            [ \"$arg\" = -Werror ] && arg=\"-W3 -WX\"\n            [ \"$arg\" = -g ] && arg=-Z7\n            [ \"$arg\" = -fomit-frame-pointer ] && arg=\n            [ \"$arg\" = -s ] && arg=\n            [ \"$arg\" = -fPIC ] && arg=\n        else\n            [ \"$arg\" = -ffast-math ] && arg=\n            [ \"$arg\" = -Wall ] && arg=\n            [ \"$arg\" = -Werror ] && arg=\"-w3 -Werror\"\n        fi\n        [ $compiler = CL -a \"$arg\" = -O3 ] && arg=-O2\n\n        [ -n \"$arg\" ] && echo -n \"$arg \"\n    done\n}\n\ncl_ldflags() {\n    for arg in $*; do\n        arg=${arg/LIBPATH/libpath}\n        [ \"${arg#-libpath:}\" == \"$arg\" -a \"${arg#-l}\" != \"$arg\" ] && arg=${arg#-l}.lib\n        [ \"${arg#-L}\" != \"$arg\" ] && arg=-libpath:${arg#-L}\n        [ \"$arg\" = -Wl,--large-address-aware ] && arg=-largeaddressaware\n        [ \"$arg\" = -s ] && arg=\n        [ \"$arg\" = -Wl,-Bsymbolic ] && arg=\n        [ \"$arg\" = -fno-tree-vectorize ] && arg=\n        [ \"$arg\" = -Werror ] && arg=\n        [ \"$arg\" = -Wshadow ] && arg=\n        [ \"$arg\" = -Wmaybe-uninitialized ] && arg=\n        [[ \"$arg\" = -Qdiag-error* ]] && arg=\n\n        arg=${arg/pthreadGC/pthreadVC}\n        [ \"$arg\" = avifil32.lib ] && arg=vfw32.lib\n        [ \"$arg\" = gpac_static.lib ] && arg=libgpac_static.lib\n        [ \"$arg\" = davs2.lib ] && arg=libdavs2.lib\n\n        [ -n \"$arg\" ] && echo -n \"$arg \"\n    done\n}\n\ncc_check() {\n    if [ -z \"$3\" ]; then\n        if [ -z \"$1$2\" ]; then\n            log_check \"whether $CC works\"\n        elif [ -z \"$1\" ]; then\n            log_check \"for $2\"\n        else\n            log_check \"for $1\"\n        fi\n    elif [ -z \"$1\" ]; then\n        if [ -z \"$2\" ]; then\n            log_check \"whether $CC supports $3\"\n        else\n            log_check \"whether $CC supports $3 with $2\"\n        fi\n    else\n        log_check \"for $3 in $1\";\n    fi\n    rm -f conftest.c\n    for arg in $1; do\n        echo \"#include <$arg>\" >> conftest.c\n    done\n    echo \"int main (void) { $3 return 0; }\" >> conftest.c\n    if [ $compiler_style = MS ]; then\n        cc_cmd=\"$CC conftest.c $(cc_cflags $CFLAGS $CHECK_CFLAGS $2) -link $(cl_ldflags $2 $LDFLAGSCLI $LDFLAGS)\"\n    else\n        cc_cmd=\"$CC conftest.c $CFLAGS $CHECK_CFLAGS $2 $LDFLAGSCLI $LDFLAGS -o conftest\"\n    fi\n    if $cc_cmd >conftest.log 2>&1; then\n        res=$?\n        log_ok\n    else\n        res=$?\n        log_fail\n        log_msg \"Failed commandline was:\"\n        log_msg \"--------------------------------------------------\"\n        log_msg \"$cc_cmd\"\n        cat conftest.log >> config.log\n        log_msg \"--------------------------------------------------\"\n        log_msg \"Failed program was:\"\n        log_msg \"--------------------------------------------------\"\n        cat conftest.c >> config.log\n        log_msg \"--------------------------------------------------\"\n    fi\n    return $res\n}\n\ncpp_check() {\n    log_check \"whether $3 is true\"\n    rm -f conftest.c\n    for arg in $1; do\n        echo \"#include <$arg>\" >> conftest.c\n    done\n    echo -e \"#if !($3) \\n#error $4 \\n#endif \" >> conftest.c\n    if [ $compiler_style = MS ]; then\n        cpp_cmd=\"$CC conftest.c $(cc_cflags $CFLAGS $2) -P\"\n    else\n        cpp_cmd=\"$CC conftest.c $CFLAGS $2 -E -o conftest\"\n    fi\n    if $cpp_cmd >conftest.log 2>&1; then\n        res=$?\n        log_ok\n    else\n        res=$?\n        log_fail\n        log_msg \"--------------------------------------------------\"\n        cat conftest.log >> config.log\n        log_msg \"--------------------------------------------------\"\n        log_msg \"Failed program was:\"\n        log_msg \"--------------------------------------------------\"\n        cat conftest.c >> config.log\n        log_msg \"--------------------------------------------------\"\n    fi\n    return $res\n}\n\nas_check() {\n    log_check \"whether $AS supports $1\"\n    echo \"$1\" > conftest$AS_EXT\n    as_cmd=\"$AS conftest$AS_EXT $ASFLAGS $2 -o conftest.o\"\n    if $as_cmd >conftest.log 2>&1; then\n        res=$?\n        log_ok\n    else\n        res=$?\n        log_fail\n        log_msg \"Failed commandline was:\"\n        log_msg \"--------------------------------------------------\"\n        log_msg \"$as_cmd\"\n        cat conftest.log >> config.log\n        log_msg \"--------------------------------------------------\"\n        log_msg \"Failed program was:\"\n        log_msg \"--------------------------------------------------\"\n        cat conftest$AS_EXT >> config.log\n        log_msg \"--------------------------------------------------\"\n    fi\n    return $res\n}\n\nrc_check() {\n    log_check \"whether $RC works\"\n    echo \"$1\" > conftest.rc\n    if [ $compiler = GNU ]; then\n        rc_cmd=\"$RC $RCFLAGS -o conftest.o conftest.rc\"\n    else\n        rc_cmd=\"$RC $RCFLAGS -foconftest.o conftest.rc\"\n    fi\n    if $rc_cmd >conftest.log 2>&1; then\n        res=$?\n        log_ok\n    else\n        res=$?\n        log_fail\n        log_msg \"Failed commandline was:\"\n        log_msg \"--------------------------------------------------\"\n        log_msg \"$rc_cmd\"\n        cat conftest.log >> config.log\n        log_msg \"--------------------------------------------------\"\n        log_msg \"Failed program was:\"\n        log_msg \"--------------------------------------------------\"\n        cat conftest.rc >> config.log\n        log_msg \"--------------------------------------------------\"\n    fi\n    return $res\n}\n\ndefine() {\n    echo \"#define $1$([ -n \"$2\" ] && echo \" $2\" || echo \" 1\")\" >> config.h\n}\n\ndie() {\n    log_msg \"DIED: $@\"\n    echo \"$@\"\n    exit 1\n}\n\nconfigure_system_override() {\n    log_check \"system libdavs2 configuration\"\n    davs2_config_path=\"$1/davs2_config.h\"\n    if [ -e \"$davs2_config_path\" ]; then\n        res=$?\n        log_ok\n        arg=\"$(grep '#define DAVS2_GPL ' $davs2_config_path | sed -e 's/#define DAVS2_GPL *//; s/ *$//')\"\n        if [ -n \"$arg\" ]; then\n            [ \"$arg\" = 0 ] && arg=\"no\" || arg=\"yes\"\n            [ \"$arg\" != \"$gpl\" ] && die \"Incompatible license with system libdavs2\"\n        fi\n        arg=\"$(grep '#define DAVS2_BIT_DEPTH ' $davs2_config_path | sed -e 's/#define DAVS2_BIT_DEPTH *//; s/ *$//')\"\n        if [ -n \"$arg\" ]; then\n            if [ \"$arg\" != \"$bit_depth\" ]; then\n                echo \"Override output bit depth with system libdavs2 configuration\"\n                bit_depth=\"$arg\"\n            fi\n        fi\n        arg=\"$(grep '#define DAVS2_CHROMA_FORMAT ' $davs2_config_path | sed -e 's/#define DAVS2_CHROMA_FORMAT *//; s/ *$//')\"\n        if [ -n \"$arg\" ]; then\n            [ \"$arg\" = 0 ] && arg=\"all\" || arg=\"${arg#DAVS2_CSP_I}\"\n            if [ \"$arg\" != \"$chroma_format\" ]; then\n                echo \"Override output chroma format with system libdavs2 configuration\"\n                chroma_format=\"$arg\"\n            fi\n        fi\n        arg=\"$(grep '#define DAVS2_INTERLACED ' $davs2_config_path | sed -e 's/#define DAVS2_INTERLACED *//; s/ *$//')\"\n        if [ -n \"$arg\" ]; then\n            [ \"$arg\" = 0 ] && arg=\"no\" || arg=\"yes\"\n            if [ \"$arg\" != \"$interlaced\" ]; then\n                echo \"Override interlaced encoding support with system libdavs2 configuration\"\n                interlaced=\"$arg\"\n            fi\n        fi\n    else\n        res=$?\n        log_fail\n        log_msg \"Failed search path was: $davs2_config_path\"\n    fi\n    return $res\n}\n\nrm -f davs2_config.h config.h config.mak config.log davs2.pc davs2.def conftest*\n\n# Construct a path to the specified directory relative to the working directory\nrelative_path() {\n    local base=\"${PWD%/}\"\n    local path=\"$(cd \"$1\" >/dev/null; printf '%s/.' \"${PWD%/}\")\"\n    local up=''\n\n    while [[ $path != \"$base/\"* ]]; do\n        base=\"${base%/*}\"\n        up=\"../$up\"\n    done\n\n    dirname \"$up${path#\"$base/\"}\"\n}\n\nSRCPATH=\"$(cd ../../source ; pwd)\"\n[ \"$SRCPATH\" = \"$(pwd)\" ] && SRCPATH=.\n[ -n \"$(echo $SRCPATH | grep ' ')\" ] && die \"Out of tree builds are impossible with whitespace in source path.\"\n\nBUILDPATH=\"$(cd . ; pwd)\"\n\necho \"$SRCPATH\" | grep -q ' ' && die \"Out of tree builds are impossible with whitespace in source path.\"\necho \"$BUILDPATH\" | grep -q ' ' && die \"Out of tree builds are impossible with whitespace in source path.\"\n[ -e \"$BUILDPATH/config.h\" -o -e \"$BUILDPATH/davs2_config.h\" ] && die \"Out of tree builds are impossible with config.h/davs2_config.h in source dir.\"\n\nprefix='/usr/local'\nexec_prefix='${prefix}'\nbindir='${exec_prefix}/bin'\nlibdir='${exec_prefix}/lib'\nincludedir='${prefix}/include'\nDEVNULL='/dev/null'\n\ncli=\"yes\"\ncli_libdavs2=\"internal\"\nshared=\"no\"\nstatic=\"yes\"\ngpl=\"yes\"\nthread=\"auto\"\nasm=\"auto\"\ninterlaced=\"yes\"\nlto=\"no\"\ndebug=\"no\"\ngprof=\"no\"\nstrip=\"no\"\npic=\"no\"\nbit_depth=\"8\"\nchroma_format=\"all\"\ncompiler=\"GNU\"\ncompiler_style=\"GNU\"\nopencl=\"no\"\nvsx=\"auto\"\n\nCFLAGS=\"$CFLAGS -Wall -I. -I\\$(SRCPATH)\"\nLDFLAGS=\"$LDFLAGS\"\nLDFLAGSCLI=\"$LDFLAGSCLI\"\nASFLAGS=\"$ASFLAGS -I. -I\\$(SRCPATH)\"\nRCFLAGS=\"$RCFLAGS\"\nCHECK_CFLAGS=\"\"\nHAVE_GETOPT_LONG=1\ncross_prefix=\"\"\n\nEXE=\"\"\nAS_EXT=\".S\"\nNL=\"\n\"\n\n# list of all preprocessor HAVE values we can define\nCONFIG_HAVE=\"MALLOC_H ALTIVEC ALTIVEC_H MMX ARMV6 ARMV6T2 NEON BEOSTHREAD POSIXTHREAD WIN32THREAD THREAD LOG2F \\\n             GPL VECTOREXT INTERLACED CPU_COUNT OPENCL THP X86_INLINE_ASM AS_FUNC INTEL_DISPATCHER \\\n             MSA MMAP WINRT VSX\"\n\n# parse options\n\nfor opt do\n    optarg=\"${opt#*=}\"\n    case \"$opt\" in\n        --prefix=*)\n            prefix=\"$optarg\"\n            ;;\n        --exec-prefix=*)\n            exec_prefix=\"$optarg\"\n            ;;\n        --bindir=*)\n            bindir=\"$optarg\"\n            ;;\n        --libdir=*)\n            libdir=\"$optarg\"\n            ;;\n        --includedir=*)\n            includedir=\"$optarg\"\n            ;;\n        --disable-cli)\n            cli=\"no\"\n            ;;\n        --system-libdavs2)\n            cli_libdavs2=\"system\"\n            ;;\n        --enable-shared)\n            shared=\"yes\"\n            ;;\n        --disable-static)\n            static=\"no\"\n            ;;\n        --disable-asm)\n            asm=\"no\"\n            ;;\n        --disable-interlaced)\n            interlaced=\"no\"\n            ;;\n        --disable-gpl)\n            gpl=\"no\"\n            ;;\n        --extra-asflags=*)\n            ASFLAGS=\"$ASFLAGS $optarg\"\n            ;;\n        --extra-cflags=*)\n            CFLAGS=\"$CFLAGS $optarg\"\n            ;;\n        --extra-ldflags=*)\n            LDFLAGS=\"$LDFLAGS $optarg\"\n            ;;\n        --extra-rcflags=*)\n            RCFLAGS=\"$RCFLAGS $optarg\"\n            ;;\n        --disable-thread)\n            thread=\"no\"\n            ;;\n        --disable-win32thread)\n            [ \"$thread\" != \"no\" ] && thread=\"posix\"\n            ;;\n        --enable-lto)\n            lto=\"auto\"\n            ;;\n        --enable-debug)\n            debug=\"yes\"\n            ;;\n        --enable-gprof)\n            CFLAGS=\"$CFLAGS -pg\"\n            LDFLAGS=\"$LDFLAGS -pg\"\n            gprof=\"yes\"\n            ;;\n        --enable-strip)\n            strip=\"yes\"\n            ;;\n        --enable-pic)\n            pic=\"yes\"\n            ;;\n        --host=*)\n            host=\"$optarg\"\n            ;;\n        --disable-vsx)\n            vsx=\"no\"\n            ;;\n        --disable-opencl)\n            opencl=\"no\"\n            ;;\n        --cross-prefix=*)\n            cross_prefix=\"$optarg\"\n            ;;\n        --sysroot=*)\n            CFLAGS=\"$CFLAGS --sysroot=$optarg\"\n            LDFLAGS=\"$LDFLAGS --sysroot=$optarg\"\n            ;;\n        --bit-depth=*)\n            bit_depth=\"$optarg\"\n            if [ \"$bit_depth\" -lt \"8\" -o \"$bit_depth\" -gt \"10\" ]; then\n                echo \"Supplied bit depth must be in range [8,10].\"\n                exit 1\n            elif [[ \"$bit_depth\" = \"9\" || \"$bit_depth\" = \"10\" ]]; then\n                echo \"BitDepth $bit_depth not supported currently.\"\n                exit 1\n            fi\n            bit_depth=`expr $bit_depth + 0`\n            ;;\n        --chroma-format=*)\n            chroma_format=\"$optarg\"\n            if [ $chroma_format != \"420\" -a $chroma_format != \"422\" -a $chroma_format != \"444\" -a $chroma_format != \"all\" ]; then\n                echo \"Supplied chroma format must be 420, 422, 444 or all.\"\n                exit 1\n            fi\n            ;;\n        *)\n            echo \"Unknown option $opt, ignored\"\n            ;;\n    esac\ndone\n\n[ \"$cli\" = \"no\" -a \"$shared\" = \"no\" -a \"$static\" = \"no\" ] && die \"Nothing to build. Enable cli, shared or static.\"\n\nCC=\"${CC-${cross_prefix}g++}\"\nSTRIP=\"${STRIP-${cross_prefix}strip}\"\nINSTALL=\"${INSTALL-install}\"\nPKGCONFIG=\"${PKGCONFIG-${cross_prefix}pkg-config}\"\n\n# ar and ranlib doesn't load the LTO plugin by default, prefer the g++-prefixed wrappers which does.\nif ${cross_prefix}g++-ar --version >/dev/null 2>&1; then\n    AR=\"${AR-${cross_prefix}g++-ar}\"\nelse\n    AR=\"${AR-${cross_prefix}ar}\"\nfi\nif ${cross_prefix}g++-ranlib --version >/dev/null 2>&1; then\n    RANLIB=\"${RANLIB-${cross_prefix}g++-ranlib}\"\nelse\n    RANLIB=\"${RANLIB-${cross_prefix}ranlib}\"\nfi\n\nif [ \"x$host\" = x ]; then\n    host=`${BUILDPATH}/config.guess`\nfi\n# normalize a triplet into a quadruplet\nhost=`${BUILDPATH}/config.sub $host`\n\n# split $host\nhost_cpu=\"${host%%-*}\"\nhost=\"${host#*-}\"\nhost_vendor=\"${host%%-*}\"\nhost_os=\"${host#*-}\"\n\ntrap 'rm -f conftest*' EXIT\n\n# test for use of compilers that require specific handling\ncc_base=`basename \"$CC\"`\nQPRE=\"-\"\nif [[ $host_os = mingw* || $host_os = cygwin* ]]; then\n    if [[ \"$cc_base\" = icl || \"$cc_base\" = icl[\\ .]* ]]; then\n        # Windows Intel Compiler creates dependency generation with absolute Windows paths, Cygwin's make does not support Windows paths.\n        [[ $host_os = cygwin* ]] && die \"Windows Intel Compiler support requires MSYS\"\n        compiler=ICL\n        compiler_style=MS\n        CFLAGS=\"$CFLAGS -Qstd=c99 -nologo -Qms0 -DHAVE_STRING_H -I\\$(SRCPATH)/extras\"\n        QPRE=\"-Q\"\n        `$CC 2>&1 | grep -q IA-32` && host_cpu=i486\n        `$CC 2>&1 | grep -q \"Intel(R) 64\"` && host_cpu=x86_64\n        cpp_check \"\" \"\" \"_MSC_VER >= 1400\" || die \"Windows Intel Compiler support requires Visual Studio 2005 or newer\"\n        if cc_check '' -Qdiag-error:10006,10157 ; then\n            CHECK_CFLAGS=\"$CHECK_CFLAGS -Qdiag-error:10006,10157\"\n        fi\n    elif [[ \"$cc_base\" = cl || \"$cc_base\" = cl[\\ .]* ]]; then\n        # Standard Microsoft Visual Studio\n        compiler=CL\n        compiler_style=MS\n        CFLAGS=\"$CFLAGS -nologo -GS- -DHAVE_STRING_H -I\\$(SRCPATH)/extras\"\n        `$CC 2>&1 | grep -q 'x86'` && host_cpu=i486\n        `$CC 2>&1 | grep -q 'x64'` && host_cpu=x86_64\n        cpp_check '' '' '_MSC_VER > 1800 || (_MSC_VER == 1800 && _MSC_FULL_VER >= 180030324)' || die \"Microsoft Visual Studio support requires Visual Studio 2013 Update 2 or newer\"\n    else\n        # MinGW uses broken pre-VS2015 Microsoft printf functions unless it's told to use the POSIX ones.\n        CFLAGS=\"$CFLAGS -D_POSIX_C_SOURCE=200112L\"\n    fi\nelse\n    if [[ \"$cc_base\" = icc || \"$cc_base\" = icc[\\ .]* ]]; then\n        AR=\"xiar\"\n        compiler=ICC\n    fi\nfi\n\nif [[ \"$cc_base\" = clang* ]]; then\n    if cc_check '' -Werror=unknown-warning-option ; then\n        CHECK_CFLAGS=\"$CHECK_CFLAGS -Werror=unknown-warning-option\"\n    fi\nfi\n\nlibm=\"\"\ncase $host_os in\n    beos*)\n        SYS=\"BEOS\"\n        define HAVE_MALLOC_H\n        ;;\n    darwin*)\n        SYS=\"MACOSX\"\n        libm=\"-lm\"\n        if [ \"$pic\" = \"no\" ]; then\n            cc_check \"\" -mdynamic-no-pic && CFLAGS=\"$CFLAGS -mdynamic-no-pic\"\n        fi\n        # TODO: Fix compiling error under mac osx (force disabled now)\n        asm=\"no\"\n        ;;\n    freebsd*)\n        SYS=\"FREEBSD\"\n        libm=\"-lm\"\n        ;;\n    kfreebsd*-gnu)\n        SYS=\"FREEBSD\"\n        define HAVE_MALLOC_H\n        libm=\"-lm\"\n        ;;\n    netbsd*)\n        SYS=\"NETBSD\"\n        libm=\"-lm\"\n        ;;\n    openbsd*)\n        SYS=\"OPENBSD\"\n        libm=\"-lm\"\n        ;;\n    *linux*)\n        SYS=\"LINUX\"\n        define HAVE_MALLOC_H\n        libm=\"-lm\"\n        ;;\n    gnu*)\n        SYS=\"HURD\"\n        define HAVE_MALLOC_H\n        libm=\"-lm\"\n        ;;\n    cygwin*|mingw*|msys*)\n        EXE=\".exe\"\n        if [[ $host_os = cygwin* ]] && cpp_check \"\" \"\" \"defined(__CYGWIN__)\" ; then\n            SYS=\"CYGWIN\"\n            define HAVE_MALLOC_H\n        else\n            SYS=\"WINDOWS\"\n            DEVNULL=\"NUL\"\n            LDFLAGSCLI=\"$LDFLAGSCLI -lshell32\"\n            [ $compiler = GNU ] && RC=\"${RC-${cross_prefix}windres}\" || RC=\"${RC-rc}\"\n        fi\n        ;;\n    sunos*|solaris*)\n        SYS=\"SunOS\"\n        define HAVE_MALLOC_H\n        libm=\"-lm\"\n        if cc_check \"\" /usr/lib/64/values-xpg6.o; then\n            LDFLAGS=\"$LDFLAGS /usr/lib/64/values-xpg6.o\"\n        else\n            LDFLAGS=\"$LDFLAGS /usr/lib/values-xpg6.o\"\n        fi\n        if test -x /usr/ucb/install ; then\n            INSTALL=/usr/ucb/install\n        elif test -x /usr/bin/ginstall ; then\n            # OpenSolaris\n            INSTALL=/usr/bin/ginstall\n        elif test -x /usr/gnu/bin/install ; then\n            # OpenSolaris\n            INSTALL=/usr/gnu/bin/install\n        fi\n        HAVE_GETOPT_LONG=0\n        ;;\n    *qnx*)\n        SYS=\"QNX\"\n        define HAVE_MALLOC_H\n        libm=\"-lm\"\n        HAVE_GETOPT_LONG=0\n        CFLAGS=\"$CFLAGS -I\\$(SRCPATH)/extras\"\n        ;;\n    *haiku*)\n        SYS=\"HAIKU\"\n        ;;\n    *)\n        die \"Unknown system $host, edit the configure\"\n        ;;\nesac\n\nLDFLAGS=\"$LDFLAGS $libm\"\n\nstack_alignment=4\ncase $host_cpu in\n    i*86)\n        ARCH=\"X86\"\n        AS=\"${AS-nasm}\"\n        AS_EXT=\".asm\"\n        CFLAGS=\"$CFLAGS -DARCH_X86_64=0\"\n        ASFLAGS=\"$ASFLAGS -DARCH_X86_64=0 -I\\$(SRCPATH)/common/x86/\"\n        if [ $compiler = GNU ]; then\n            if [[ \"$asm\" == auto && \"$CFLAGS\" != *-march* ]]; then\n                CFLAGS=\"$CFLAGS -march=i686\"\n            fi\n            if [[ \"$asm\" == auto && \"$CFLAGS\" != *-mfpmath* ]]; then\n                CFLAGS=\"$CFLAGS -mfpmath=sse -msse -msse2\"\n            fi\n            CFLAGS=\"-m32 $CFLAGS\"\n            LDFLAGS=\"-m32 $LDFLAGS\"\n        fi\n        if [ \"$SYS\" = MACOSX ]; then\n            ASFLAGS=\"$ASFLAGS -f macho32 -DPREFIX\"\n        elif [ \"$SYS\" = WINDOWS -o \"$SYS\" = CYGWIN ]; then\n            ASFLAGS=\"$ASFLAGS -f win32 -DPREFIX\"\n            LDFLAGS=\"$LDFLAGS -Wl,--large-address-aware\"\n            [ $compiler = GNU ] && LDFLAGS=\"$LDFLAGS -Wl,--dynamicbase,--nxcompat,--tsaware\"\n            [ $compiler = GNU ] && RCFLAGS=\"--target=pe-i386 $RCFLAGS\"\n        else\n            ASFLAGS=\"$ASFLAGS -f elf32\"\n        fi\n        ;;\n    x86_64)\n        ARCH=\"X86_64\"\n        AS=\"${AS-nasm}\"\n        AS_EXT=\".asm\"\n        CFLAGS=\"$CFLAGS -DARCH_X86_64=1\"\n        ASFLAGS=\"$ASFLAGS -DARCH_X86_64=1 -I\\$(SRCPATH)/common/x86/\"\n        stack_alignment=16\n        [ $compiler = GNU ] && CFLAGS=\"-m64 $CFLAGS\" && LDFLAGS=\"-m64 $LDFLAGS\"\n        if [ \"$SYS\" = MACOSX ]; then\n            ASFLAGS=\"$ASFLAGS -f macho64 -DPIC -DPREFIX\"\n            if cc_check '' \"-arch x86_64\"; then\n                CFLAGS=\"$CFLAGS -arch x86_64\"\n                LDFLAGS=\"$LDFLAGS -arch x86_64\"\n            fi\n        elif [ \"$SYS\" = WINDOWS -o \"$SYS\" = CYGWIN ]; then\n            ASFLAGS=\"$ASFLAGS -f win64\"\n            if [ $compiler = GNU ]; then\n                # only the GNU toolchain is inconsistent in prefixing function names with _\n                cc_check \"\" \"-S\" && grep -q \"_main:\" conftest && ASFLAGS=\"$ASFLAGS -DPREFIX\"\n                cc_check \"\" \"-Wl,--high-entropy-va\" && LDFLAGS=\"$LDFLAGS -Wl,--high-entropy-va\"\n                LDFLAGS=\"$LDFLAGS -Wl,--dynamicbase,--nxcompat,--tsaware\"\n                LDFLAGSCLI=\"$LDFLAGSCLI -Wl,--image-base,0x140000000\"\n                SOFLAGS=\"$SOFLAGS -Wl,--image-base,0x180000000\"\n                RCFLAGS=\"--target=pe-x86-64 $RCFLAGS\"\n            fi\n        else\n            ASFLAGS=\"$ASFLAGS -f elf64\"\n        fi\n        ;;\n    powerpc*)\n        ARCH=\"PPC\"\n        if [ $asm = auto ] ; then\n            define HAVE_ALTIVEC\n            AS=\"${AS-${CC}}\"\n            AS_EXT=\".c\"\n            if [ $SYS = MACOSX ] ; then\n                CFLAGS=\"$CFLAGS -faltivec -fastf -mcpu=G4\"\n            else\n                CFLAGS=\"$CFLAGS -maltivec -mabi=altivec\"\n                define HAVE_ALTIVEC_H\n            fi\n            if [ \"$vsx\" != \"no\" ] ; then\n                vsx=\"no\"\n                if cc_check \"\" \"-mvsx\" ; then\n                    CFLAGS=\"$CFLAGS -mvsx\"\n                    define HAVE_VSX\n                    vsx=\"yes\"\n                fi\n            fi\n        fi\n        ;;\n    sparc)\n        ARCH=\"SPARC\"\n        ;;\n    mips*)\n        ARCH=\"MIPS\"\n        AS=\"${AS-${CC}}\"\n        AS_EXT=\".c\"\n        ;;\n    arm*)\n        ARCH=\"ARM\"\n        if [ \"$SYS\" = MACOSX ] ; then\n            AS=\"${AS-${SRCPATH}/tools/gas-preprocessor.pl -arch arm -- ${CC}}\"\n            ASFLAGS=\"$ASFLAGS -DPREFIX -DPIC\"  # apple's ld doesn't support movw/movt relocations at all\n            # build for armv7 by default\n            if ! echo $CFLAGS | grep -Eq '\\-arch' ; then\n                CFLAGS=\"$CFLAGS -arch armv7\"\n                LDFLAGS=\"$LDFLAGS -arch armv7\"\n            fi\n        else\n            AS=\"${AS-${CC}}\"\n        fi\n        ;;\n    aarch64)\n        ARCH=\"AARCH64\"\n        stack_alignment=16\n        if [ \"$SYS\" = MACOSX ] ; then\n            AS=\"${AS-${SRCPATH}/tools/gas-preprocessor.pl -arch aarch64 -- ${CC}}\"\n            ASFLAGS=\"$ASFLAGS -DPREFIX\"\n        else\n            AS=\"${AS-${CC}}\"\n        fi\n        ;;\n    s390|s390x)\n        ARCH=\"S390\"\n        ;;\n    hppa*|parisc*)\n        ARCH=\"PARISC\"\n        ;;\n    ia64)\n        ARCH=\"IA64\"\n        ;;\n    alpha*)\n        ARCH=\"ALPHA\"\n        ;;\n    *)\n        ARCH=\"$(echo $host_cpu | tr a-z A-Z)\"\n        ;;\nesac\n\n[ \"$vsx\" != \"yes\" ] && vsx=\"no\"\n\nif [ $SYS = WINDOWS ]; then\n    if ! rc_check \"0 RCDATA {0}\" ; then\n        RC=\"\"\n    fi\n\n    if cpp_check \"winapifamily.h\" \"\" \"!WINAPI_FAMILY_PARTITION(WINAPI_PARTITION_DESKTOP)\" ; then\n        [ $compiler = CL ] || die \"WinRT requires MSVC\"\n        define HAVE_WINRT\n        CFLAGS=\"$CFLAGS -MD\"\n        LDFLAGS=\"$LDFLAGS -appcontainer\"\n        if ! cpp_check \"\" \"\" \"defined(_WIN32_WINNT) && _WIN32_WINNT >= 0x0603\" ; then\n            die \"_WIN32_WINNT must be defined to at least 0x0603 (Windows 8.1) for WinRT\"\n        elif cpp_check \"\" \"\" \"_WIN32_WINNT >= 0x0A00\" ; then\n            # Universal Windows Platform (Windows 10)\n            LDFLAGS=\"$LDFLAGS -lWindowsApp\"\n        fi\n        cli=\"no\"\n        opencl=\"no\"\n    fi\nfi\n\nlog_msg \"davs2 configure script\"\nif [ -n \"$*\" ]; then\n    msg=\"Command line options:\"\n    for i in $@; do\n        msg=\"$msg \\\"$i\\\"\"\n    done\n    log_msg \"$msg\"\nfi\nlog_msg \"\"\n\n# check requirements\n\ncc_check || die \"No working C compiler found.\"\n\nif [ $compiler_style = GNU ]; then\n    if cc_check '' -std=gnu++11 'for( int i = 0; i < 9; i++ );' ; then\n        CFLAGS=\"$CFLAGS -std=gnu++11 -D_GNU_SOURCE\"\n    elif ! cc_check '' '' 'for( int i = 0; i < 9; i++ );' ; then\n        die \"GNU++11 compiler is needed for compilation.\"\n    fi\nfi\n\nif [ $shared = yes -a \\( $ARCH = \"X86_64\" -o $ARCH = \"PPC\" -o $ARCH = \"ALPHA\" -o $ARCH = \"ARM\" -o $ARCH = \"IA64\" -o $ARCH = \"PARISC\" -o $ARCH = \"MIPS\" -o $ARCH = \"AARCH64\" \\) ] ; then\n    pic=\"yes\"\nfi\n\nif [ $compiler = GNU -a \\( $ARCH = X86 -o $ARCH = X86_64 \\) ] ; then\n    if cc_check '' -mpreferred-stack-boundary=5 ; then\n        CFLAGS=\"$CFLAGS -mpreferred-stack-boundary=5\"\n        stack_alignment=32\n    elif [ $stack_alignment -lt 16 ] && cc_check '' -mpreferred-stack-boundary=4 ; then\n        CFLAGS=\"$CFLAGS -mpreferred-stack-boundary=4\"\n        stack_alignment=16\n    fi\nelif [ $compiler = ICC -a $ARCH = X86 ]; then\n    # icc on linux has various degrees of mod16 stack support\n    if [ $SYS = LINUX ]; then\n        # >= 12 defaults to a mod16 stack\n        if cpp_check \"\" \"\" \"__INTEL_COMPILER >= 1200\" ; then\n            stack_alignment=16\n        # 11 <= x < 12 is capable of keeping a mod16 stack, but defaults to not doing so.\n        elif cpp_check \"\" \"\" \"__INTEL_COMPILER >= 1100\" ; then\n            CFLAGS=\"$CFLAGS -falign-stack=assume-16-byte\"\n            stack_alignment=16\n        fi\n        # < 11 is completely incapable of keeping a mod16 stack\n    fi\nfi\n\nif [ $asm = auto -a \\( $ARCH = X86 -o $ARCH = X86_64 \\) ] ; then\n    if ! as_check \"vmovdqa32 [eax]{k1}{z}, zmm0\" ; then\n        VER=\"$(($AS --version || echo no assembler) 2>/dev/null | head -n 1)\"\n        echo \"Found $VER\"\n        echo \"Minimum version is nasm-2.13\"\n        echo \"If you really want to compile without asm, configure with --disable-asm.\"\n        exit 1\n    fi\n    cc_check '' '' '__asm__(\"pabsw %xmm0, %xmm0\");' && define HAVE_X86_INLINE_ASM\n    ASFLAGS=\"$ASFLAGS -Worphan-labels\"\n    define HAVE_MMX\nfi\n\nif [ $asm = auto -a $ARCH = ARM ] ; then\n    # set flags so neon is built by default\n    echo $CFLAGS | grep -Eq '(-mcpu|-march|-mfpu)' || CFLAGS=\"$CFLAGS -mcpu=cortex-a8 -mfpu=neon\"\n\n    if  cc_check '' '' '__asm__(\"rev ip, ip\");' ; then      define HAVE_ARMV6\n        cc_check '' '' '__asm__(\"movt r0, #0\");'         && define HAVE_ARMV6T2\n        cc_check '' '' '__asm__(\"vadd.i16 q0, q0, q0\");' && define HAVE_NEON\n        ASFLAGS=\"$ASFLAGS -c\"\n    else\n        echo \"You specified a pre-ARMv6 or Thumb-1 CPU in your CFLAGS.\"\n        echo \"If you really want to run on such a CPU, configure with --disable-asm.\"\n        exit 1\n    fi\nfi\n\nif [ $asm = auto -a $ARCH = AARCH64 ] ; then\n    if  cc_check '' '' '__asm__(\"cmeq v0.8h, v0.8h, #0\");' ; then define HAVE_NEON\n        ASFLAGS=\"$ASFLAGS -c\"\n    else\n        echo \"no NEON support, try adding -mfpu=neon to CFLAGS\"\n        echo \"If you really want to run on such a CPU, configure with --disable-asm.\"\n        exit 1\n    fi\nfi\n\nif [ $asm = auto -a \\( $ARCH = ARM -o $ARCH = AARCH64 \\) ] ; then\n    # check if the assembler supports '.func' (clang 3.5 does not)\n    as_check \".func test${NL}.endfunc\" && define HAVE_AS_FUNC 1\nfi\n\nif [ $asm = auto -a $ARCH = MIPS ] ; then\n    if ! echo $CFLAGS | grep -Eq '(-march|-mmsa|-mno-msa)' ; then\n        cc_check '' '-mmsa -mfp64 -mhard-float' && CFLAGS=\"-mmsa -mfp64 -mhard-float $CFLAGS\"\n    fi\n\n    if cc_check '' '' '__asm__(\"addvi.b $w0, $w1, 1\");' ; then\n        define HAVE_MSA\n    else\n        echo \"You specified a pre-MSA CPU in your CFLAGS.\"\n        echo \"If you really want to run on such a CPU, configure with --disable-asm.\"\n        exit 1\n    fi\nfi\n\n[ $asm = no ] && AS=\"\"\n[ \"x$AS\" = x ] && asm=\"no\" || asm=\"yes\"\n\ndefine ARCH_$ARCH\ndefine SYS_$SYS\n\ndefine STACK_ALIGNMENT $stack_alignment\nASFLAGS=\"$ASFLAGS -DSTACK_ALIGNMENT=$stack_alignment\"\n\n# skip endianness check for Intel Compiler and MSVS, as all supported platforms are little. each have flags that will cause the check to fail as well\nCPU_ENDIAN=\"little-endian\"\nif [ $compiler = GNU ]; then\n    echo \"int i[2] = {0x42494745,0}; double f[2] = {0x1.0656e6469616ep+102,0};\" > conftest.c\n    $CC $CFLAGS conftest.c -c -o conftest.o 2>/dev/null || die \"endian test failed\"\n    if (${cross_prefix}strings -a conftest.o | grep -q BIGE) && (${cross_prefix}strings -a conftest.o | grep -q FPendian) ; then\n        define WORDS_BIGENDIAN\n        CPU_ENDIAN=\"big-endian\"\n    elif !(${cross_prefix}strings -a conftest.o | grep -q EGIB && ${cross_prefix}strings -a conftest.o | grep -q naidnePF) ; then\n        die \"endian test failed\"\n    fi\nfi\n\nif [ \"$cli_libdavs2\" = \"system\" -a \"$shared\" != \"yes\" ] ; then\n    [ \"$static\" = \"yes\" ] && die \"Option --system-libdavs2 can not be used together with --enable-static\"\n    if $PKGCONFIG --exists davs2 2>/dev/null; then\n        DAVS2_LIBS=\"$($PKGCONFIG --libs davs2)\"\n        DAVS2_INCLUDE_DIR=\"${DAVS2_INCLUDE_DIR-$($PKGCONFIG --variable=includedir davs2)}\"\n        configure_system_override \"$DAVS2_INCLUDE_DIR\" || die \"Detection of system libdavs2 configuration failed\"\n    else\n        die \"Can not find system libdavs2\"\n    fi\nfi\n\n# autodetect options that weren't forced nor disabled\n\nlibpthread=\"\"\nif [ \"$SYS\" = \"WINDOWS\" -a \"$thread\" = \"posix\" ] ; then\n    if [ \"$gpl\" = \"no\" ] ; then\n        echo \"Warning: pthread-win32 is LGPL and is therefore not supported with --disable-gpl\"\n        thread=\"no\"\n    elif cc_check pthread.h -lpthread \"pthread_create(0,0,0,0);\" ; then\n        libpthread=\"-lpthread\"\n    elif cc_check pthread.h -lpthreadGC2 \"pthread_create(0,0,0,0);\" ; then\n        libpthread=\"-lpthreadGC2\"\n    elif cc_check pthread.h \"-lpthreadGC2 -lwsock32 -DPTW32_STATIC_LIB\" \"pthread_create(0,0,0,0);\" ; then\n        libpthread=\"-lpthreadGC2 -lwsock32\"\n        define PTW32_STATIC_LIB\n    elif cc_check pthread.h \"-lpthreadGC2 -lws2_32 -DPTW32_STATIC_LIB\" \"pthread_create(0,0,0,0);\" ; then\n        libpthread=\"-lpthreadGC2 -lws2_32\"\n        define PTW32_STATIC_LIB\n    else\n        thread=\"no\"\n    fi\nelif [ \"$thread\" != \"no\" ] ; then\n    thread=\"no\"\n    case $SYS in\n        BEOS)\n            thread=\"beos\"\n            define HAVE_BEOSTHREAD\n            ;;\n        WINDOWS)\n            thread=\"win32\"\n            define HAVE_WIN32THREAD\n            ;;\n        QNX)\n            cc_check pthread.h -lc \"pthread_create(0,0,0,0);\" && thread=\"posix\" && libpthread=\"-lc\"\n            ;;\n        *)\n            if cc_check pthread.h -lpthread \"pthread_create(0,0,0,0);\" ; then\n               thread=\"posix\"\n               libpthread=\"-lpthread\"\n            else\n                cc_check pthread.h \"\" \"pthread_create(0,0,0,0);\" && thread=\"posix\" && libpthread=\"\"\n            fi\n            ;;\n    esac\nfi\nif [ \"$thread\" = \"posix\" ]; then\n    LDFLAGS=\"$LDFLAGS $libpthread\"\n    define HAVE_POSIXTHREAD\n    if [ \"$SYS\" = \"LINUX\" ] && cc_check sched.h \"-D_GNU_SOURCE -Werror\" \"cpu_set_t p_aff; return CPU_COUNT(&p_aff);\" ; then\n        define HAVE_CPU_COUNT\n    fi\nfi\n[ \"$thread\" != \"no\" ] && define HAVE_THREAD\n\nif cc_check \"math.h\" \"-Werror\" \"return log2f(2);\" ; then\n    define HAVE_LOG2F\nfi\n\nif [ \"$SYS\" != \"WINDOWS\" ] && cpp_check \"sys/mman.h unistd.h\" \"\" \"defined(MAP_PRIVATE)\"; then\n    define HAVE_MMAP\nfi\n\nif [ \"$SYS\" = \"LINUX\" -a \\( \"$ARCH\" = \"X86\" -o \"$ARCH\" = \"X86_64\" \\) ] && cc_check \"sys/mman.h\" \"\" \"MADV_HUGEPAGE;\" ; then\n    define HAVE_THP\nfi\n\ncc_check \"stdint.h\" \"\" \"uint32_t test_vec __attribute__ ((vector_size (16))) = {0,1,2,3};\" && define HAVE_VECTOREXT\n\nif [ \"$pic\" = \"yes\" ] ; then\n    [ \"$SYS\" != WINDOWS -a \"$SYS\" != CYGWIN ] && CFLAGS=\"$CFLAGS -fPIC\"\n    ASFLAGS=\"$ASFLAGS -DPIC\"\n    # resolve textrels in the x86 asm\n    cc_check stdio.h \"-shared -Wl,-Bsymbolic\" && SOFLAGS=\"$SOFLAGS -Wl,-Bsymbolic\"\n    [ $SYS = SunOS -a \"$ARCH\" = \"X86\" ] && SOFLAGS=\"$SOFLAGS -mimpure-text\"\nfi\n\nif [ \"$debug\" != \"yes\" -a \"$gprof\" != \"yes\" ]; then\n    CFLAGS=\"$CFLAGS -fomit-frame-pointer\"\nfi\n\nif [ \"$strip\" = \"yes\" ]; then\n    LDFLAGS=\"$LDFLAGS -s\"\nfi\n\nif [ \"$debug\" = \"yes\" ]; then\n    CFLAGS=\"-O1 -g $CFLAGS\"\n    RCFLAGS=\"$RCFLAGS -DDEBUG\"\nelse\n    CFLAGS=\"-O3 -ffast-math $CFLAGS\"\n    if [ \"$lto\" = \"auto\" ] && [ $compiler = GNU ] && cc_check \"\" \"-flto\" ; then\n        lto=\"yes\"\n        CFLAGS=\"$CFLAGS -flto\"\n        LDFLAGS=\"$LDFLAGS -O3 -flto\"\n    fi\nfi\n[ \"$lto\" = \"auto\" ] && lto=\"no\"\n\nif cc_check '' -fno-tree-vectorize ; then\n    CFLAGS=\"$CFLAGS -fno-tree-vectorize\"\nfi\n\nif [ $SYS = WINDOWS -a $ARCH = X86 -a $compiler = GNU ] ; then\n    # workaround g++/ld bug with alignment of static variables/arrays that are initialized to zero\n    cc_check '' -fno-zero-initialized-in-bss && CFLAGS=\"$CFLAGS -fno-zero-initialized-in-bss\"\nfi\n\nif cc_check '' -Wshadow ; then\n    CFLAGS=\"-Wshadow $CFLAGS\"\nfi\n\nif cc_check '' -Wmaybe-uninitialized ; then\n    if [ $SYS = MACOSX ] ; then\n        CFLAGS=\"-Wno-uninitialized $CFLAGS\"\n    else\n        CFLAGS=\"-Wno-maybe-uninitialized $CFLAGS\"\n    fi\nfi\n\nif [ $compiler = ICC -o $compiler = ICL ] ; then\n    if cc_check 'extras/intel_dispatcher.h' '' 'davs2_intel_dispatcher_override();' ; then\n        define HAVE_INTEL_DISPATCHER\n    fi\nfi\n\nif [ \"$bit_depth\" -gt \"8\" ]; then\n    define HIGH_BIT_DEPTH\n    ASFLAGS=\"$ASFLAGS -DHIGH_BIT_DEPTH=1\"\n    CFLAGS+=\" -DHIGH_BIT_DEPTH=1\"\n    opencl=\"no\"\nelse\n    ASFLAGS=\"$ASFLAGS -DHIGH_BIT_DEPTH=0\"\n    CFLAGS+=\" -DHIGH_BIT_DEPTH=0\"\nfi\n\nif [ \"$chroma_format\" != \"all\" ]; then\n    define CHROMA_FORMAT CHROMA_$chroma_format\nfi\n\nASFLAGS=\"$ASFLAGS -DBIT_DEPTH=$bit_depth\"\nCFLAGS+=\" -DBIT_DEPTH=$bit_depth\"\n\n[ $gpl = yes ] && define HAVE_GPL && davs2_gpl=1 || davs2_gpl=0\n\n[ $interlaced = yes ] && define HAVE_INTERLACED && davs2_interlaced=1 || davs2_interlaced=0\n\nlibdl=\"\"\nif [ \"$opencl\" = \"yes\" ]; then\n    opencl=\"no\"\n    # cygwin can use opencl if it can use LoadLibrary\n    if [ $SYS = WINDOWS ] || ([ $SYS = CYGWIN ] && cc_check windows.h \"\" \"LoadLibraryW(0);\") ; then\n        opencl=\"yes\"\n        define HAVE_OPENCL\n    elif [ \"$SYS\" = \"LINUX\" -o \"$SYS\" = \"MACOSX\" ] ; then\n        opencl=\"yes\"\n        define HAVE_OPENCL\n        libdl=\"-ldl\"\n    fi\n    LDFLAGS=\"$LDFLAGS $libdl\"\nfi\n\n#define undefined vars as 0\nfor var in $CONFIG_HAVE; do\n    grep -q \"HAVE_$var 1\" config.h || define HAVE_$var 0\ndone\n\n# generate exported config file\n\nconfig_chroma_format=\"DAVS2_CSP_I$chroma_format\"\n[ \"$config_chroma_format\" == \"DAVS2_CSP_Iall\" ] && config_chroma_format=\"0\"\ncat > davs2_config.h << EOF\n#define DAVS2_BIT_DEPTH     $bit_depth\n#define DAVS2_GPL           $davs2_gpl\n#define DAVS2_INTERLACED    $davs2_interlaced\n#define DAVS2_CHROMA_FORMAT $config_chroma_format\nEOF\n\n# generate version.h\ncd ${SRCPATH}/..\n./version.sh >> ${BUILDPATH}/davs2_config.h\ncd ${BUILDPATH}\n\nif [ \"$cli_libdavs2\" = \"system\" ] ; then\n    if [ \"$shared\" = \"yes\" ]; then\n        CLI_LIBDAVS2='$(SONAME)'\n    else\n        CLI_LIBDAVS2=\n        LDFLAGSCLI=\"$DAVS2_LIBS $LDFLAGSCLI\"\n        cc_check 'stdint.h davs2.h' '' 'davs2_encoder_open(0);' || die \"System libdavs2 can't be used for compilation of this version\"\n    fi\nelse\n    CLI_LIBDAVS2='$(LIBDAVS2)'\nfi\n\nDEPMM=\"${QPRE}MM\"\nDEPMT=\"${QPRE}MT\"\nif [ $compiler_style = MS ]; then\n    AR=\"lib -nologo -out:\"\n    LD=\"link -out:\"\n    if [ $compiler = ICL ]; then\n        AR=\"xi$AR\"\n        LD=\"xi$LD\"\n    else\n        mslink=\"$(dirname \"$(command -v cl 2>/dev/null)\")/link\"\n        [ -x \"$mslink\" ] && LD=\"\\\"$mslink\\\" -out:\"\n    fi\n    HAVE_GETOPT_LONG=0\n    LDFLAGS=\"-nologo -incremental:no $(cl_ldflags $LDFLAGS)\"\n    LDFLAGSCLI=\"$(cl_ldflags $LDFLAGSCLI)\"\n    LIBDAVS2=libdavs2.lib\n    RANLIB=\n    [ -n \"$RC\" ] && RCFLAGS=\"$RCFLAGS -nologo -I. -I\\$(SRCPATH)/extras -fo\"\n    STRIP=\n    if [ $debug = yes ]; then\n        LDFLAGS=\"-debug $LDFLAGS\"\n        CFLAGS=\"-D_DEBUG $CFLAGS\"\n    else\n        CFLAGS=\"-DNDEBUG $CFLAGS\"\n    fi\nelse # g++/icc\n    DEPMM=\"$DEPMM -g0\"\n    AR=\"$AR rc \"\n    LD=\"$CC -o \"\n    LIBDAVS2=libdavs2.a\n    [ -n \"$RC\" ] && RCFLAGS=\"$RCFLAGS -I. -o \"\nfi\n[ $compiler != GNU ] && CFLAGS=\"$(cc_cflags $CFLAGS)\"\nif [ $compiler = ICC -o $compiler = ICL ]; then\n    # icc does not define __SSE__ until SSE2 optimization and icl never defines it or _M_IX86_FP\n    [ \\( $ARCH = X86_64 -o $ARCH = X86 \\) -a $asm = yes ] && ! cpp_check \"\" \"\" \"defined(__SSE__)\" && define __SSE__\n    PROF_GEN_CC=\"${QPRE}prof-gen ${QPRE}prof-dir.\"\n    PROF_GEN_LD=\n    PROF_USE_CC=\"${QPRE}prof-use ${QPRE}prof-dir.\"\n    PROF_USE_LD=\nelif [ $compiler = CL ]; then\n    # Visual Studio\n    # _M_IX86_FP is only defined on x86\n    [ $ARCH = X86 ] && cpp_check '' '' '_M_IX86_FP >= 1' && define __SSE__\n    [ $ARCH = X86_64 ] && define __SSE__\n    # As long as the cli application can't link against the dll, the dll can not be pgo'd.\n    # pgds are link flag specific and the -dll flag for creating the dll makes it unshareable with the cli\n    PROF_GEN_CC=\"-GL\"\n    PROF_GEN_LD=\"-LTCG:PGINSTRUMENT\"\n    PROF_USE_CC=\"-GL\"\n    PROF_USE_LD=\"-LTCG:PGOPTIMIZE\"\nelse\n    PROF_GEN_CC=\"-fprofile-generate\"\n    PROF_GEN_LD=\"-fprofile-generate\"\n    PROF_USE_CC=\"-fprofile-use\"\n    PROF_USE_LD=\"-fprofile-use\"\nfi\n\n# generate config files\n\ncat > config.mak << EOF\nSRCPATH=$SRCPATH\nprefix=$prefix\nexec_prefix=$exec_prefix\nbindir=$bindir\nlibdir=$libdir\nincludedir=$includedir\nSYS_ARCH=$ARCH\nSYS=$SYS\nCC=$CC\nCFLAGS=$CFLAGS\nCOMPILER=$compiler\nCOMPILER_STYLE=$compiler_style\nDEPMM=$DEPMM\nDEPMT=$DEPMT\nLD=$LD\nLDFLAGS=$LDFLAGS\nLIBDAVS2=$LIBDAVS2\nAR=$AR\nRANLIB=$RANLIB\nSTRIP=$STRIP\nINSTALL=$INSTALL\nAS=$AS\nASFLAGS=$ASFLAGS\nRC=$RC\nRCFLAGS=$RCFLAGS\nEXE=$EXE\nHAVE_GETOPT_LONG=$HAVE_GETOPT_LONG\nDEVNULL=$DEVNULL\nPROF_GEN_CC=$PROF_GEN_CC\nPROF_GEN_LD=$PROF_GEN_LD\nPROF_USE_CC=$PROF_USE_CC\nPROF_USE_LD=$PROF_USE_LD\nHAVE_OPENCL=$opencl\nEOF\n\nif [ $compiler_style = MS ]; then\n    echo '%.o: %.c' >> config.mak\n    echo '\t$(CC) $(CFLAGS) -c -Fo$@ $<' >> config.mak\nfi\n\nif [ \"$cli\" = \"yes\" ]; then\n    echo 'default: cli' >> config.mak\n    echo 'install: install-cli' >> config.mak\nfi\n\nif [ \"$shared\" = \"yes\" ]; then\n    API=$(grep '#define DAVS2_BUILD' < ${BUILDPATH}/davs2_config.h | sed 's/^.* \\([1-9][0-9]*\\).*$/\\1/')\n    if [ \"$SYS\" = \"WINDOWS\" -o \"$SYS\" = \"CYGWIN\" ]; then\n        echo \"SONAME=libdavs2-$API.dll\" >> config.mak\n        if [ $compiler_style = MS ]; then\n            echo 'IMPLIBNAME=libdavs2.dll.lib' >> config.mak\n            # GNU ld on windows defaults to exporting all global functions if there are no explicit __declspec(dllexport) declarations\n            # MSVC link does not act similarly, so it is required to make an export definition out of davs2.h and use it at link time\n            echo \"SOFLAGS=-dll -def:davs2.def -implib:\\$(IMPLIBNAME) $SOFLAGS\" >> config.mak\n            echo \"EXPORTS\" > davs2.def\n            # export API functions\n            grep \"^\\(int\\|void\\|davs2_t\\).*davs2\" ${SRCPATH}/davs2.h | sed -e \"s/.*\\(davs2.*\\)(.*/\\1/;s/open/open_$API/g\" >> davs2.def\n            # export API variables/data. must be flagged with the DATA keyword\n            grep \"extern.*davs2\" ${SRCPATH}/davs2.h | sed -e \"s/.*\\(davs2\\w*\\)\\W.*/\\1 DATA/;\" >> davs2.def\n        else\n            echo 'IMPLIBNAME=libdavs2.dll.a' >> config.mak\n            echo \"SOFLAGS=-shared -Wl,--out-implib,\\$(IMPLIBNAME) $SOFLAGS\" >> config.mak\n        fi\n    elif [ \"$SYS\" = \"MACOSX\" ]; then\n        echo \"SOSUFFIX=dylib\" >> config.mak\n        echo \"SONAME=libdavs2.$API.dylib\" >> config.mak\n        echo \"SOFLAGS=-shared -dynamiclib -Wl,-single_module -Wl,-read_only_relocs,suppress -install_name \\$(DESTDIR)\\$(libdir)/\\$(SONAME) $SOFLAGS\" >> config.mak\n    elif [ \"$SYS\" = \"SunOS\" ]; then\n        echo \"SOSUFFIX=so\" >> config.mak\n        echo \"SONAME=libdavs2.so.$API\" >> config.mak\n        echo \"SOFLAGS=-shared -Wl,-h,\\$(SONAME) $SOFLAGS\" >> config.mak\n    else\n        echo \"SOSUFFIX=so\" >> config.mak\n        echo \"SONAME=libdavs2.so.$API\" >> config.mak\n        echo \"SOFLAGS=-shared -Wl,-soname,\\$(SONAME) $SOFLAGS\" >> config.mak\n    fi\n    echo 'default: lib-shared' >> config.mak\n    echo 'install: install-lib-shared' >> config.mak\nfi\n\nif [ \"$static\" = \"yes\" ]; then\n    echo 'default: lib-static' >> config.mak\n    echo 'install: install-lib-static' >> config.mak\nfi\n\necho \"LDFLAGSCLI = $LDFLAGSCLI\" >> config.mak\necho \"CLI_LIBDAVS2 = $CLI_LIBDAVS2\" >> config.mak\n\ncat > davs2.pc << EOF\nprefix=$prefix\nexec_prefix=$exec_prefix\nlibdir=$libdir\nincludedir=$includedir\n\nName: davs2\nDescription: AVS2 (IEEE 1857.4) decoder library\nVersion: $(grep POINTVER < davs2_config.h | sed -e 's/.* \"//; s/\".*//')\nLibs: -L$libdir -ldavs2 $([ \"$shared\" = \"yes\" ] || echo $libpthread $libm $libdl)\nLibs.private: $([ \"$shared\" = \"yes\" ] && echo $libpthread $libm $libdl)\nCflags: -I$includedir\nEOF\n\nfilters=\"crop select_every\"\ngpl_filters=\"\"\n[ $gpl = yes ] && filters=\"$filters $gpl_filters\"\n\ncat > conftest.log <<EOF\nplatform:      $ARCH\nbyte order:    $CPU_ENDIAN\nsystem:        $SYS\ncli:           $cli\nlibdavs2:      $cli_libdavs2\nshared:        $shared\nstatic:        $static\nasm:           $asm\ninterlaced:    $interlaced\ngpl:           $gpl\nthread:        $thread\nopencl:        $opencl\nfilters:       $filters\nlto:           $lto\ndebug:         $debug\ngprof:         $gprof\nstrip:         $strip\nPIC:           $pic\nbit depth:     $bit_depth\nchroma format: $chroma_format\nEOF\n\necho >> config.log\ncat conftest.log >> config.log\ncat conftest.log\n\n# [ \"$SRCPATH\" != \".\" ] && ln -sf ${SRCPATH}/Makefile ./Makefile\nmkdir -p common/{aarch64,arm,ppc,x86,vec} test\n\necho\necho \"You can run 'make' or 'make fprofiled' now.\"\n\n"
  },
  {
    "path": "build/vs2013/DAVS2.sln",
    "content": "﻿\r\nMicrosoft Visual Studio Solution File, Format Version 12.00\r\n# Visual Studio 2013\r\nVisualStudioVersion = 12.0.40629.0\r\nMinimumVisualStudioVersion = 10.0.40219.1\r\nProject(\"{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}\") = \"davs2\", \"davs2.vcxproj\", \"{852EFB9B-4E73-4E80-AA57-711ADCB132AE}\"\r\n\tProjectSection(ProjectDependencies) = postProject\r\n\t\t{34C0840A-BDE6-446B-B0DF-A8281A42825B} = {34C0840A-BDE6-446B-B0DF-A8281A42825B}\r\n\tEndProjectSection\r\nEndProject\r\nProject(\"{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}\") = \"libdavs2\", \"libdavs2.vcxproj\", \"{34C0840A-BDE6-446B-B0DF-A8281A42825B}\"\r\n\tProjectSection(ProjectDependencies) = postProject\r\n\t\t{A9B37E3C-A8C7-4E24-BC2D-AB4D0804DAC1} = {A9B37E3C-A8C7-4E24-BC2D-AB4D0804DAC1}\r\n\t\t{558555B9-A7B2-42D6-A298-BB5CC248541F} = {558555B9-A7B2-42D6-A298-BB5CC248541F}\r\n\t\t{2E7A6EE4-927F-470A-A012-3B29EDB87906} = {2E7A6EE4-927F-470A-A012-3B29EDB87906}\r\n\tEndProjectSection\r\nEndProject\r\nProject(\"{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}\") = \"libdavs2_asm\", \"libdavs2_asm.vcxproj\", \"{A9B37E3C-A8C7-4E24-BC2D-AB4D0804DAC1}\"\r\nEndProject\r\nProject(\"{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}\") = \"libdavs2_intrin_avx\", \"libdavs2_intrin_avx.vcxproj\", \"{558555B9-A7B2-42D6-A298-BB5CC248541F}\"\r\nEndProject\r\nProject(\"{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}\") = \"libdavs2_intrin_sse\", \"libdavs2_intrin_sse.vcxproj\", \"{2E7A6EE4-927F-470A-A012-3B29EDB87906}\"\r\nEndProject\r\nGlobal\r\n\tGlobalSection(SolutionConfigurationPlatforms) = preSolution\r\n\t\tDebug|Win32 = Debug|Win32\r\n\t\tDebug|x64 = Debug|x64\r\n\t\tRelease|Win32 = Release|Win32\r\n\t\tRelease|x64 = Release|x64\r\n\tEndGlobalSection\r\n\tGlobalSection(ProjectConfigurationPlatforms) = postSolution\r\n\t\t{852EFB9B-4E73-4E80-AA57-711ADCB132AE}.Debug|Win32.ActiveCfg = Debug|Win32\r\n\t\t{852EFB9B-4E73-4E80-AA57-711ADCB132AE}.Debug|Win32.Build.0 = Debug|Win32\r\n\t\t{852EFB9B-4E73-4E80-AA57-711ADCB132AE}.Debug|x64.ActiveCfg = Debug|x64\r\n\t\t{852EFB9B-4E73-4E80-AA57-711ADCB132AE}.Debug|x64.Build.0 = Debug|x64\r\n\t\t{852EFB9B-4E73-4E80-AA57-711ADCB132AE}.Release|Win32.ActiveCfg = Release|Win32\r\n\t\t{852EFB9B-4E73-4E80-AA57-711ADCB132AE}.Release|Win32.Build.0 = Release|Win32\r\n\t\t{852EFB9B-4E73-4E80-AA57-711ADCB132AE}.Release|x64.ActiveCfg = Release|x64\r\n\t\t{852EFB9B-4E73-4E80-AA57-711ADCB132AE}.Release|x64.Build.0 = Release|x64\r\n\t\t{34C0840A-BDE6-446B-B0DF-A8281A42825B}.Debug|Win32.ActiveCfg = Debug|Win32\r\n\t\t{34C0840A-BDE6-446B-B0DF-A8281A42825B}.Debug|Win32.Build.0 = Debug|Win32\r\n\t\t{34C0840A-BDE6-446B-B0DF-A8281A42825B}.Debug|x64.ActiveCfg = Debug|x64\r\n\t\t{34C0840A-BDE6-446B-B0DF-A8281A42825B}.Debug|x64.Build.0 = Debug|x64\r\n\t\t{34C0840A-BDE6-446B-B0DF-A8281A42825B}.Release|Win32.ActiveCfg = Release|Win32\r\n\t\t{34C0840A-BDE6-446B-B0DF-A8281A42825B}.Release|Win32.Build.0 = Release|Win32\r\n\t\t{34C0840A-BDE6-446B-B0DF-A8281A42825B}.Release|x64.ActiveCfg = Release|x64\r\n\t\t{34C0840A-BDE6-446B-B0DF-A8281A42825B}.Release|x64.Build.0 = Release|x64\r\n\t\t{A9B37E3C-A8C7-4E24-BC2D-AB4D0804DAC1}.Debug|Win32.ActiveCfg = Debug|Win32\r\n\t\t{A9B37E3C-A8C7-4E24-BC2D-AB4D0804DAC1}.Debug|Win32.Build.0 = Debug|Win32\r\n\t\t{A9B37E3C-A8C7-4E24-BC2D-AB4D0804DAC1}.Debug|x64.ActiveCfg = Debug|x64\r\n\t\t{A9B37E3C-A8C7-4E24-BC2D-AB4D0804DAC1}.Debug|x64.Build.0 = Debug|x64\r\n\t\t{A9B37E3C-A8C7-4E24-BC2D-AB4D0804DAC1}.Release|Win32.ActiveCfg = Release|Win32\r\n\t\t{A9B37E3C-A8C7-4E24-BC2D-AB4D0804DAC1}.Release|Win32.Build.0 = Release|Win32\r\n\t\t{A9B37E3C-A8C7-4E24-BC2D-AB4D0804DAC1}.Release|x64.ActiveCfg = Release|x64\r\n\t\t{A9B37E3C-A8C7-4E24-BC2D-AB4D0804DAC1}.Release|x64.Build.0 = Release|x64\r\n\t\t{558555B9-A7B2-42D6-A298-BB5CC248541F}.Debug|Win32.ActiveCfg = Debug|Win32\r\n\t\t{558555B9-A7B2-42D6-A298-BB5CC248541F}.Debug|Win32.Build.0 = Debug|Win32\r\n\t\t{558555B9-A7B2-42D6-A298-BB5CC248541F}.Debug|x64.ActiveCfg = Debug|x64\r\n\t\t{558555B9-A7B2-42D6-A298-BB5CC248541F}.Debug|x64.Build.0 = Debug|x64\r\n\t\t{558555B9-A7B2-42D6-A298-BB5CC248541F}.Release|Win32.ActiveCfg = Release|Win32\r\n\t\t{558555B9-A7B2-42D6-A298-BB5CC248541F}.Release|Win32.Build.0 = Release|Win32\r\n\t\t{558555B9-A7B2-42D6-A298-BB5CC248541F}.Release|x64.ActiveCfg = Release|x64\r\n\t\t{558555B9-A7B2-42D6-A298-BB5CC248541F}.Release|x64.Build.0 = Release|x64\r\n\t\t{2E7A6EE4-927F-470A-A012-3B29EDB87906}.Debug|Win32.ActiveCfg = Debug|Win32\r\n\t\t{2E7A6EE4-927F-470A-A012-3B29EDB87906}.Debug|Win32.Build.0 = Debug|Win32\r\n\t\t{2E7A6EE4-927F-470A-A012-3B29EDB87906}.Debug|x64.ActiveCfg = Debug|x64\r\n\t\t{2E7A6EE4-927F-470A-A012-3B29EDB87906}.Debug|x64.Build.0 = Debug|x64\r\n\t\t{2E7A6EE4-927F-470A-A012-3B29EDB87906}.Release|Win32.ActiveCfg = Release|Win32\r\n\t\t{2E7A6EE4-927F-470A-A012-3B29EDB87906}.Release|Win32.Build.0 = Release|Win32\r\n\t\t{2E7A6EE4-927F-470A-A012-3B29EDB87906}.Release|x64.ActiveCfg = Release|x64\r\n\t\t{2E7A6EE4-927F-470A-A012-3B29EDB87906}.Release|x64.Build.0 = Release|x64\r\n\tEndGlobalSection\r\n\tGlobalSection(SolutionProperties) = preSolution\r\n\t\tHideSolutionNode = FALSE\r\n\tEndGlobalSection\r\nEndGlobal\r\n"
  },
  {
    "path": "build/vs2013/davs2.vcxproj",
    "content": "﻿<?xml version=\"1.0\" encoding=\"utf-8\"?>\r\n<Project DefaultTargets=\"Build\" ToolsVersion=\"12.0\" xmlns=\"http://schemas.microsoft.com/developer/msbuild/2003\">\r\n  <ItemGroup Label=\"ProjectConfigurations\">\r\n    <ProjectConfiguration Include=\"Debug|Win32\">\r\n      <Configuration>Debug</Configuration>\r\n      <Platform>Win32</Platform>\r\n    </ProjectConfiguration>\r\n    <ProjectConfiguration Include=\"Debug|x64\">\r\n      <Configuration>Debug</Configuration>\r\n      <Platform>x64</Platform>\r\n    </ProjectConfiguration>\r\n    <ProjectConfiguration Include=\"Release|Win32\">\r\n      <Configuration>Release</Configuration>\r\n      <Platform>Win32</Platform>\r\n    </ProjectConfiguration>\r\n    <ProjectConfiguration Include=\"Release|x64\">\r\n      <Configuration>Release</Configuration>\r\n      <Platform>x64</Platform>\r\n    </ProjectConfiguration>\r\n  </ItemGroup>\r\n  <ItemGroup>\r\n    <ClCompile Include=\"..\\..\\source\\test\\getopt\\getopt.c\" />\r\n    <ClCompile Include=\"..\\..\\source\\test\\test.c\" />\r\n  </ItemGroup>\r\n  <ItemGroup>\r\n    <ClInclude Include=\"..\\..\\source\\davs2.h\" />\r\n    <ClInclude Include=\"..\\..\\source\\test\\inputstream.h\" />\r\n    <ClInclude Include=\"..\\..\\source\\test\\getopt\\getopt.h\" />\r\n    <ClInclude Include=\"..\\..\\source\\test\\md5.h\" />\r\n    <ClInclude Include=\"..\\..\\source\\test\\parse_args.h\" />\r\n    <ClInclude Include=\"..\\..\\source\\test\\psnr.h\" />\r\n    <ClInclude Include=\"..\\..\\source\\test\\utils.h\" />\r\n  </ItemGroup>\r\n  <PropertyGroup Label=\"Globals\">\r\n    <ProjectGuid>{852EFB9B-4E73-4E80-AA57-711ADCB132AE}</ProjectGuid>\r\n    <Keyword>Win32Proj</Keyword>\r\n  </PropertyGroup>\r\n  <Import Project=\"$(VCTargetsPath)\\Microsoft.Cpp.Default.props\" />\r\n  <PropertyGroup Condition=\"'$(Configuration)|$(Platform)'=='Debug|Win32'\" Label=\"Configuration\">\r\n    <ConfigurationType>Application</ConfigurationType>\r\n    <UseDebugLibraries>true</UseDebugLibraries>\r\n    <PlatformToolset>$(DefaultPlatformToolset)</PlatformToolset>\r\n    <CharacterSet>MultiByte</CharacterSet>\r\n  </PropertyGroup>\r\n  <PropertyGroup Condition=\"'$(Configuration)|$(Platform)'=='Debug|x64'\" Label=\"Configuration\">\r\n    <ConfigurationType>Application</ConfigurationType>\r\n    <UseDebugLibraries>true</UseDebugLibraries>\r\n    <PlatformToolset>$(DefaultPlatformToolset)</PlatformToolset>\r\n    <CharacterSet>MultiByte</CharacterSet>\r\n  </PropertyGroup>\r\n  <PropertyGroup Condition=\"'$(Configuration)|$(Platform)'=='Release|Win32'\" Label=\"Configuration\">\r\n    <ConfigurationType>Application</ConfigurationType>\r\n    <UseDebugLibraries>false</UseDebugLibraries>\r\n    <PlatformToolset>$(DefaultPlatformToolset)</PlatformToolset>\r\n    <WholeProgramOptimization>true</WholeProgramOptimization>\r\n    <CharacterSet>MultiByte</CharacterSet>\r\n  </PropertyGroup>\r\n  <PropertyGroup Condition=\"'$(Configuration)|$(Platform)'=='Release|x64'\" Label=\"Configuration\">\r\n    <ConfigurationType>Application</ConfigurationType>\r\n    <UseDebugLibraries>false</UseDebugLibraries>\r\n    <PlatformToolset>$(DefaultPlatformToolset)</PlatformToolset>\r\n    <WholeProgramOptimization>true</WholeProgramOptimization>\r\n    <CharacterSet>MultiByte</CharacterSet>\r\n  </PropertyGroup>\r\n  <Import Project=\"$(VCTargetsPath)\\Microsoft.Cpp.props\" />\r\n  <ImportGroup Label=\"ExtensionSettings\">\r\n  </ImportGroup>\r\n  <ImportGroup Label=\"PropertySheets\" Condition=\"'$(Configuration)|$(Platform)'=='Debug|Win32'\">\r\n    <Import Project=\"$(UserRootDir)\\Microsoft.Cpp.$(Platform).user.props\" Condition=\"exists('$(UserRootDir)\\Microsoft.Cpp.$(Platform).user.props')\" Label=\"LocalAppDataPlatform\" />\r\n  </ImportGroup>\r\n  <ImportGroup Condition=\"'$(Configuration)|$(Platform)'=='Debug|x64'\" Label=\"PropertySheets\">\r\n    <Import Project=\"$(UserRootDir)\\Microsoft.Cpp.$(Platform).user.props\" Condition=\"exists('$(UserRootDir)\\Microsoft.Cpp.$(Platform).user.props')\" Label=\"LocalAppDataPlatform\" />\r\n  </ImportGroup>\r\n  <ImportGroup Label=\"PropertySheets\" Condition=\"'$(Configuration)|$(Platform)'=='Release|Win32'\">\r\n    <Import Project=\"$(UserRootDir)\\Microsoft.Cpp.$(Platform).user.props\" Condition=\"exists('$(UserRootDir)\\Microsoft.Cpp.$(Platform).user.props')\" Label=\"LocalAppDataPlatform\" />\r\n  </ImportGroup>\r\n  <ImportGroup Condition=\"'$(Configuration)|$(Platform)'=='Release|x64'\" Label=\"PropertySheets\">\r\n    <Import Project=\"$(UserRootDir)\\Microsoft.Cpp.$(Platform).user.props\" Condition=\"exists('$(UserRootDir)\\Microsoft.Cpp.$(Platform).user.props')\" Label=\"LocalAppDataPlatform\" />\r\n  </ImportGroup>\r\n  <PropertyGroup Label=\"UserMacros\" />\r\n  <PropertyGroup Condition=\"'$(Configuration)|$(Platform)'=='Debug|Win32'\">\r\n    <LinkIncremental>true</LinkIncremental>\r\n    <IntDir>$(SolutionDir)$(Platform)_$(Configuration)\\$(ProjectName)\\</IntDir>\r\n    <OutDir>$(SolutionDir)..\\bin\\$(Platform)_$(Configuration)\\</OutDir>\r\n  </PropertyGroup>\r\n  <PropertyGroup Condition=\"'$(Configuration)|$(Platform)'=='Debug|x64'\">\r\n    <LinkIncremental>true</LinkIncremental>\r\n    <IntDir>$(SolutionDir)$(Platform)_$(Configuration)\\$(ProjectName)\\</IntDir>\r\n    <OutDir>$(SolutionDir)..\\bin\\$(Platform)_$(Configuration)\\</OutDir>\r\n  </PropertyGroup>\r\n  <PropertyGroup Condition=\"'$(Configuration)|$(Platform)'=='Release|Win32'\">\r\n    <LinkIncremental>false</LinkIncremental>\r\n    <IntDir>$(SolutionDir)$(Platform)_$(Configuration)\\$(ProjectName)\\</IntDir>\r\n    <OutDir>$(SolutionDir)..\\bin\\$(Platform)_$(Configuration)\\</OutDir>\r\n  </PropertyGroup>\r\n  <PropertyGroup Condition=\"'$(Configuration)|$(Platform)'=='Release|x64'\">\r\n    <LinkIncremental>false</LinkIncremental>\r\n    <IntDir>$(SolutionDir)$(Platform)_$(Configuration)\\$(ProjectName)\\</IntDir>\r\n    <OutDir>$(SolutionDir)..\\bin\\$(Platform)_$(Configuration)\\</OutDir>\r\n  </PropertyGroup>\r\n  <ItemDefinitionGroup Condition=\"'$(Configuration)|$(Platform)'=='Debug|Win32'\">\r\n    <ClCompile>\r\n      <PrecompiledHeader>\r\n      </PrecompiledHeader>\r\n      <WarningLevel>Level4</WarningLevel>\r\n      <Optimization>Disabled</Optimization>\r\n      <PreprocessorDefinitions>HIGH_BIT_DEPTH=0;WIN32;ARCH_X86_64=0;_DEBUG;_CONSOLE;_LIB;%(PreprocessorDefinitions)</PreprocessorDefinitions>\r\n      <PrecompiledHeaderFile />\r\n      <AdditionalIncludeDirectories>..\\..\\;..\\..\\source;..\\..\\pthread;..\\..\\source\\test\\getopt;</AdditionalIncludeDirectories>\r\n      <RuntimeLibrary>MultiThreadedDebug</RuntimeLibrary>\r\n    </ClCompile>\r\n    <Link>\r\n      <SubSystem>Console</SubSystem>\r\n      <GenerateDebugInformation>true</GenerateDebugInformation>\r\n      <AdditionalDependencies>kernel32.lib;user32.lib;gdi32.lib;winspool.lib;comdlg32.lib;advapi32.lib;shell32.lib;ole32.lib;oleaut32.lib;uuid.lib;odbc32.lib;odbccp32.lib;%(AdditionalDependencies)</AdditionalDependencies>\r\n      <AdditionalLibraryDirectories>$(SolutionDir)$(Platform)\\</AdditionalLibraryDirectories>\r\n      <LargeAddressAware>true</LargeAddressAware>\r\n    </Link>\r\n  </ItemDefinitionGroup>\r\n  <ItemDefinitionGroup Condition=\"'$(Configuration)|$(Platform)'=='Debug|x64'\">\r\n    <ClCompile>\r\n      <PrecompiledHeader>\r\n      </PrecompiledHeader>\r\n      <WarningLevel>Level4</WarningLevel>\r\n      <Optimization>Disabled</Optimization>\r\n      <PreprocessorDefinitions>HIGH_BIT_DEPTH=0;WIN32;ARCH_X86_64=1;_DEBUG;_CONSOLE;_LIB;%(PreprocessorDefinitions)</PreprocessorDefinitions>\r\n      <PrecompiledHeaderFile>\r\n      </PrecompiledHeaderFile>\r\n      <AdditionalIncludeDirectories>..\\..\\;..\\..\\source;..\\..\\pthread;..\\..\\source\\test\\getopt;</AdditionalIncludeDirectories>\r\n      <RuntimeLibrary>MultiThreadedDebug</RuntimeLibrary>\r\n    </ClCompile>\r\n    <Link>\r\n      <SubSystem>Console</SubSystem>\r\n      <GenerateDebugInformation>true</GenerateDebugInformation>\r\n      <AdditionalDependencies>kernel32.lib;user32.lib;gdi32.lib;winspool.lib;comdlg32.lib;advapi32.lib;shell32.lib;ole32.lib;oleaut32.lib;uuid.lib;odbc32.lib;odbccp32.lib;%(AdditionalDependencies)</AdditionalDependencies>\r\n      <AdditionalLibraryDirectories>$(SolutionDir)$(Platform)\\</AdditionalLibraryDirectories>\r\n    </Link>\r\n  </ItemDefinitionGroup>\r\n  <ItemDefinitionGroup Condition=\"'$(Configuration)|$(Platform)'=='Release|Win32'\">\r\n    <ClCompile>\r\n      <WarningLevel>Level4</WarningLevel>\r\n      <PrecompiledHeader>\r\n      </PrecompiledHeader>\r\n      <Optimization>MaxSpeed</Optimization>\r\n      <FunctionLevelLinking>true</FunctionLevelLinking>\r\n      <IntrinsicFunctions>true</IntrinsicFunctions>\r\n      <PreprocessorDefinitions>HIGH_BIT_DEPTH=0;WIN32;ARCH_X86_64=0;NDEBUG;_CONSOLE;_LIB;%(PreprocessorDefinitions)</PreprocessorDefinitions>\r\n      <PrecompiledHeaderFile />\r\n      <AdditionalIncludeDirectories>..\\..\\;..\\..\\source;..\\..\\pthread;..\\..\\source\\test\\getopt;</AdditionalIncludeDirectories>\r\n      <FavorSizeOrSpeed>Speed</FavorSizeOrSpeed>\r\n      <RuntimeLibrary>MultiThreaded</RuntimeLibrary>\r\n    </ClCompile>\r\n    <Link>\r\n      <SubSystem>Console</SubSystem>\r\n      <GenerateDebugInformation>true</GenerateDebugInformation>\r\n      <EnableCOMDATFolding>true</EnableCOMDATFolding>\r\n      <OptimizeReferences>true</OptimizeReferences>\r\n      <AdditionalDependencies>kernel32.lib;user32.lib;gdi32.lib;winspool.lib;comdlg32.lib;advapi32.lib;shell32.lib;ole32.lib;oleaut32.lib;uuid.lib;odbc32.lib;odbccp32.lib;%(AdditionalDependencies)</AdditionalDependencies>\r\n      <AdditionalLibraryDirectories>$(SolutionDir)$(Platform)\\</AdditionalLibraryDirectories>\r\n      <LargeAddressAware>true</LargeAddressAware>\r\n    </Link>\r\n  </ItemDefinitionGroup>\r\n  <ItemDefinitionGroup Condition=\"'$(Configuration)|$(Platform)'=='Release|x64'\">\r\n    <ClCompile>\r\n      <WarningLevel>Level4</WarningLevel>\r\n      <PrecompiledHeader>\r\n      </PrecompiledHeader>\r\n      <Optimization>MaxSpeed</Optimization>\r\n      <FunctionLevelLinking>true</FunctionLevelLinking>\r\n      <IntrinsicFunctions>true</IntrinsicFunctions>\r\n      <PreprocessorDefinitions>HIGH_BIT_DEPTH=0;WIN32;ARCH_X86_64=1;NDEBUG;_CONSOLE;_LIB;%(PreprocessorDefinitions)</PreprocessorDefinitions>\r\n      <PrecompiledHeaderFile>\r\n      </PrecompiledHeaderFile>\r\n      <AdditionalIncludeDirectories>..\\..\\;..\\..\\source;..\\..\\pthread;..\\..\\source\\test\\getopt;</AdditionalIncludeDirectories>\r\n      <FavorSizeOrSpeed>Speed</FavorSizeOrSpeed>\r\n      <RuntimeLibrary>MultiThreaded</RuntimeLibrary>\r\n    </ClCompile>\r\n    <Link>\r\n      <SubSystem>Console</SubSystem>\r\n      <GenerateDebugInformation>true</GenerateDebugInformation>\r\n      <EnableCOMDATFolding>true</EnableCOMDATFolding>\r\n      <OptimizeReferences>true</OptimizeReferences>\r\n      <AdditionalDependencies>kernel32.lib;user32.lib;gdi32.lib;winspool.lib;comdlg32.lib;advapi32.lib;shell32.lib;ole32.lib;oleaut32.lib;uuid.lib;odbc32.lib;odbccp32.lib;%(AdditionalDependencies)</AdditionalDependencies>\r\n      <AdditionalLibraryDirectories>$(SolutionDir)$(Platform)\\</AdditionalLibraryDirectories>\r\n      <UACExecutionLevel>AsInvoker</UACExecutionLevel>\r\n    </Link>\r\n  </ItemDefinitionGroup>\r\n  <ItemGroup>\r\n    <ProjectReference Include=\"libdavs2.vcxproj\">\r\n      <Project>{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}</Project>\r\n      <ReferenceOutputAssembly>false</ReferenceOutputAssembly>\r\n    </ProjectReference>\r\n  </ItemGroup>\r\n  <Import Project=\"$(VCTargetsPath)\\Microsoft.Cpp.targets\" />\r\n  <ImportGroup Label=\"ExtensionTargets\">\r\n  </ImportGroup>\r\n</Project>"
  },
  {
    "path": "build/vs2013/davs2.vcxproj.filters",
    "content": "﻿<?xml version=\"1.0\" encoding=\"utf-8\"?>\r\n<Project ToolsVersion=\"4.0\" xmlns=\"http://schemas.microsoft.com/developer/msbuild/2003\">\r\n  <ItemGroup>\r\n    <Filter Include=\"inc\">\r\n      <UniqueIdentifier>{93995380-89BD-4b04-88EB-625FBE52EBFB}</UniqueIdentifier>\r\n      <Extensions>h;hh;hpp;hxx;hm;inl;inc;xsd</Extensions>\r\n    </Filter>\r\n    <Filter Include=\"src\">\r\n      <UniqueIdentifier>{4FC737F1-C7A5-4376-A066-2A32D752A2FF}</UniqueIdentifier>\r\n      <Extensions>cpp;c;cc;cxx;def;odl;idl;hpj;bat;asm;asmx</Extensions>\r\n    </Filter>\r\n  </ItemGroup>\r\n  <ItemGroup>\r\n    <ClInclude Include=\"..\\..\\source\\test\\psnr.h\">\r\n      <Filter>inc</Filter>\r\n    </ClInclude>\r\n    <ClInclude Include=\"..\\..\\source\\test\\utils.h\">\r\n      <Filter>inc</Filter>\r\n    </ClInclude>\r\n    <ClInclude Include=\"..\\..\\source\\davs2.h\">\r\n      <Filter>inc</Filter>\r\n    </ClInclude>\r\n    <ClInclude Include=\"..\\..\\source\\test\\parse_args.h\">\r\n      <Filter>inc</Filter>\r\n    </ClInclude>\r\n    <ClInclude Include=\"..\\..\\source\\test\\getopt\\getopt.h\">\r\n      <Filter>inc</Filter>\r\n    </ClInclude>\r\n    <ClInclude Include=\"..\\..\\source\\test\\inputstream.h\">\r\n      <Filter>inc</Filter>\r\n    </ClInclude>\r\n    <ClInclude Include=\"..\\..\\source\\test\\md5.h\">\r\n      <Filter>inc</Filter>\r\n    </ClInclude>\r\n  </ItemGroup>\r\n  <ItemGroup>\r\n    <ClCompile Include=\"..\\..\\source\\test\\test.c\">\r\n      <Filter>src</Filter>\r\n    </ClCompile>\r\n    <ClCompile Include=\"..\\..\\source\\test\\getopt\\getopt.c\">\r\n      <Filter>src</Filter>\r\n    </ClCompile>\r\n  </ItemGroup>\r\n</Project>"
  },
  {
    "path": "build/vs2013/libdavs2.vcxproj",
    "content": "﻿<?xml version=\"1.0\" encoding=\"utf-8\"?>\r\n<Project DefaultTargets=\"Build\" ToolsVersion=\"12.0\" xmlns=\"http://schemas.microsoft.com/developer/msbuild/2003\">\r\n  <ItemGroup Label=\"ProjectConfigurations\">\r\n    <ProjectConfiguration Include=\"Debug|Win32\">\r\n      <Configuration>Debug</Configuration>\r\n      <Platform>Win32</Platform>\r\n    </ProjectConfiguration>\r\n    <ProjectConfiguration Include=\"Debug|x64\">\r\n      <Configuration>Debug</Configuration>\r\n      <Platform>x64</Platform>\r\n    </ProjectConfiguration>\r\n    <ProjectConfiguration Include=\"Release|Win32\">\r\n      <Configuration>Release</Configuration>\r\n      <Platform>Win32</Platform>\r\n    </ProjectConfiguration>\r\n    <ProjectConfiguration Include=\"Release|x64\">\r\n      <Configuration>Release</Configuration>\r\n      <Platform>x64</Platform>\r\n    </ProjectConfiguration>\r\n  </ItemGroup>\r\n  <ItemGroup>\r\n    <ClCompile Include=\"..\\..\\source\\common\\aec.cc\" />\r\n    <ClCompile Include=\"..\\..\\source\\common\\alf.cc\" />\r\n    <ClCompile Include=\"..\\..\\source\\common\\bitstream.cc\" />\r\n    <ClCompile Include=\"..\\..\\source\\common\\block_info.cc\" />\r\n    <ClCompile Include=\"..\\..\\source\\common\\common.cc\" />\r\n    <ClCompile Include=\"..\\..\\source\\common\\davs2.cc\" />\r\n    <ClCompile Include=\"..\\..\\source\\common\\cpu.cc\" />\r\n    <ClCompile Include=\"..\\..\\source\\common\\cu.cc\" />\r\n    <ClCompile Include=\"..\\..\\source\\common\\deblock.cc\" />\r\n    <ClCompile Include=\"..\\..\\source\\common\\decoder.cc\" />\r\n    <ClCompile Include=\"..\\..\\source\\common\\frame.cc\" />\r\n    <ClCompile Include=\"..\\..\\source\\common\\header.cc\" />\r\n    <ClCompile Include=\"..\\..\\source\\common\\intra.cc\" />\r\n    <ClCompile Include=\"..\\..\\source\\common\\mc.cc\" />\r\n    <ClCompile Include=\"..\\..\\source\\common\\memory.cc\" />\r\n    <ClCompile Include=\"..\\..\\source\\common\\pixel.cc\" />\r\n    <ClCompile Include=\"..\\..\\source\\common\\predict.cc\" />\r\n    <ClCompile Include=\"..\\..\\source\\common\\primitives.cc\" />\r\n    <ClCompile Include=\"..\\..\\source\\common\\quant.cc\" />\r\n    <ClCompile Include=\"..\\..\\source\\common\\sao.cc\" />\r\n    <ClCompile Include=\"..\\..\\source\\common\\threadpool.cc\" />\r\n    <ClCompile Include=\"..\\..\\source\\common\\transform.cc\" />\r\n    <ClCompile Include=\"..\\..\\source\\common\\win32thread.cc\" />\r\n  </ItemGroup>\r\n  <ItemGroup>\r\n    <ClInclude Include=\"..\\..\\source\\common\\win32thread.h\" />\r\n    <ClInclude Include=\"..\\..\\source\\configw.h\" />\r\n    <ClInclude Include=\"..\\..\\source\\davs2.h\" />\r\n    <ClInclude Include=\"..\\..\\source\\common\\aec.h\" />\r\n    <ClInclude Include=\"..\\..\\source\\common\\alf.h\" />\r\n    <ClInclude Include=\"..\\..\\source\\common\\bitstream.h\" />\r\n    <ClInclude Include=\"..\\..\\source\\common\\block_info.h\" />\r\n    <ClInclude Include=\"..\\..\\source\\common\\common.h\" />\r\n    <ClInclude Include=\"..\\..\\source\\common\\cpu.h\" />\r\n    <ClInclude Include=\"..\\..\\source\\common\\cu.h\" />\r\n    <ClInclude Include=\"..\\..\\source\\common\\deblock.h\" />\r\n    <ClInclude Include=\"..\\..\\source\\common\\decoder.h\" />\r\n    <ClInclude Include=\"..\\..\\source\\common\\defines.h\" />\r\n    <ClInclude Include=\"..\\..\\source\\common\\frame.h\" />\r\n    <ClInclude Include=\"..\\..\\source\\common\\header.h\" />\r\n    <ClInclude Include=\"..\\..\\source\\common\\intra.h\" />\r\n    <ClInclude Include=\"..\\..\\source\\common\\mc.h\" />\r\n    <ClInclude Include=\"..\\..\\source\\common\\osdep.h\" />\r\n    <ClInclude Include=\"..\\..\\source\\common\\predict.h\" />\r\n    <ClInclude Include=\"..\\..\\source\\common\\primitives.h\" />\r\n    <ClInclude Include=\"..\\..\\source\\common\\quant.h\" />\r\n    <ClInclude Include=\"..\\..\\source\\common\\sao.h\" />\r\n    <ClInclude Include=\"..\\..\\source\\common\\scantab.h\" />\r\n    <ClInclude Include=\"..\\..\\source\\common\\threadpool.h\" />\r\n    <ClInclude Include=\"..\\..\\source\\common\\transform.h\" />\r\n    <ClInclude Include=\"..\\..\\source\\common\\vlc.h\" />\r\n    <ClInclude Include=\"resource.h\" />\r\n  </ItemGroup>\r\n  <PropertyGroup Label=\"Globals\">\r\n    <ProjectGuid>{34C0840A-BDE6-446B-B0DF-A8281A42825B}</ProjectGuid>\r\n    <Keyword>Win32Proj</Keyword>\r\n  </PropertyGroup>\r\n  <Import Project=\"$(VCTargetsPath)\\Microsoft.Cpp.Default.props\" />\r\n  <PropertyGroup Condition=\"'$(Configuration)|$(Platform)'=='Debug|Win32'\" Label=\"Configuration\">\r\n    <ConfigurationType>DynamicLibrary</ConfigurationType>\r\n    <UseDebugLibraries>true</UseDebugLibraries>\r\n    <PlatformToolset>$(DefaultPlatformToolset)</PlatformToolset>\r\n    <CharacterSet>MultiByte</CharacterSet>\r\n  </PropertyGroup>\r\n  <PropertyGroup Condition=\"'$(Configuration)|$(Platform)'=='Debug|x64'\" Label=\"Configuration\">\r\n    <ConfigurationType>DynamicLibrary</ConfigurationType>\r\n    <UseDebugLibraries>true</UseDebugLibraries>\r\n    <PlatformToolset>$(DefaultPlatformToolset)</PlatformToolset>\r\n    <CharacterSet>MultiByte</CharacterSet>\r\n  </PropertyGroup>\r\n  <PropertyGroup Condition=\"'$(Configuration)|$(Platform)'=='Release|Win32'\" Label=\"Configuration\">\r\n    <ConfigurationType>DynamicLibrary</ConfigurationType>\r\n    <UseDebugLibraries>false</UseDebugLibraries>\r\n    <PlatformToolset>$(DefaultPlatformToolset)</PlatformToolset>\r\n    <WholeProgramOptimization>true</WholeProgramOptimization>\r\n    <CharacterSet>MultiByte</CharacterSet>\r\n  </PropertyGroup>\r\n  <PropertyGroup Condition=\"'$(Configuration)|$(Platform)'=='Release|x64'\" Label=\"Configuration\">\r\n    <ConfigurationType>DynamicLibrary</ConfigurationType>\r\n    <UseDebugLibraries>false</UseDebugLibraries>\r\n    <PlatformToolset>$(DefaultPlatformToolset)</PlatformToolset>\r\n    <WholeProgramOptimization>true</WholeProgramOptimization>\r\n    <CharacterSet>MultiByte</CharacterSet>\r\n  </PropertyGroup>\r\n  <Import Project=\"$(VCTargetsPath)\\Microsoft.Cpp.props\" />\r\n  <ImportGroup Label=\"ExtensionSettings\">\r\n  </ImportGroup>\r\n  <ImportGroup Label=\"PropertySheets\" Condition=\"'$(Configuration)|$(Platform)'=='Debug|Win32'\">\r\n    <Import Project=\"$(UserRootDir)\\Microsoft.Cpp.$(Platform).user.props\" Condition=\"exists('$(UserRootDir)\\Microsoft.Cpp.$(Platform).user.props')\" Label=\"LocalAppDataPlatform\" />\r\n  </ImportGroup>\r\n  <ImportGroup Condition=\"'$(Configuration)|$(Platform)'=='Debug|x64'\" Label=\"PropertySheets\">\r\n    <Import Project=\"$(UserRootDir)\\Microsoft.Cpp.$(Platform).user.props\" Condition=\"exists('$(UserRootDir)\\Microsoft.Cpp.$(Platform).user.props')\" Label=\"LocalAppDataPlatform\" />\r\n  </ImportGroup>\r\n  <ImportGroup Label=\"PropertySheets\" Condition=\"'$(Configuration)|$(Platform)'=='Release|Win32'\">\r\n    <Import Project=\"$(UserRootDir)\\Microsoft.Cpp.$(Platform).user.props\" Condition=\"exists('$(UserRootDir)\\Microsoft.Cpp.$(Platform).user.props')\" Label=\"LocalAppDataPlatform\" />\r\n  </ImportGroup>\r\n  <ImportGroup Condition=\"'$(Configuration)|$(Platform)'=='Release|x64'\" Label=\"PropertySheets\">\r\n    <Import Project=\"$(UserRootDir)\\Microsoft.Cpp.$(Platform).user.props\" Condition=\"exists('$(UserRootDir)\\Microsoft.Cpp.$(Platform).user.props')\" Label=\"LocalAppDataPlatform\" />\r\n  </ImportGroup>\r\n  <PropertyGroup Label=\"UserMacros\" />\r\n  <PropertyGroup Condition=\"'$(Configuration)|$(Platform)'=='Debug|Win32'\">\r\n    <LinkIncremental>true</LinkIncremental>\r\n    <OutDir>$(SolutionDir)..\\bin\\$(Platform)_$(Configuration)\\</OutDir>\r\n    <IntDir>$(SolutionDir)$(Platform)_$(Configuration)\\$(ProjectName)\\</IntDir>\r\n  </PropertyGroup>\r\n  <PropertyGroup Condition=\"'$(Configuration)|$(Platform)'=='Debug|x64'\">\r\n    <IntDir>$(SolutionDir)$(Platform)_$(Configuration)\\$(ProjectName)\\</IntDir>\r\n    <LinkIncremental>true</LinkIncremental>\r\n    <OutDir>$(SolutionDir)..\\bin\\$(Platform)_$(Configuration)\\</OutDir>\r\n  </PropertyGroup>\r\n  <PropertyGroup Condition=\"'$(Configuration)|$(Platform)'=='Release|Win32'\">\r\n    <LinkIncremental>false</LinkIncremental>\r\n    <OutDir>$(SolutionDir)..\\bin\\$(Platform)_$(Configuration)\\</OutDir>\r\n    <IntDir>$(SolutionDir)$(Platform)_$(Configuration)\\$(ProjectName)\\</IntDir>\r\n  </PropertyGroup>\r\n  <PropertyGroup Condition=\"'$(Configuration)|$(Platform)'=='Release|x64'\">\r\n    <IntDir>$(SolutionDir)$(Platform)_$(Configuration)\\$(ProjectName)\\</IntDir>\r\n    <LinkIncremental>false</LinkIncremental>\r\n    <OutDir>$(SolutionDir)..\\bin\\$(Platform)_$(Configuration)\\</OutDir>\r\n  </PropertyGroup>\r\n  <ItemDefinitionGroup Condition=\"'$(Configuration)|$(Platform)'=='Debug|Win32'\">\r\n    <ClCompile>\r\n      <PrecompiledHeader>\r\n      </PrecompiledHeader>\r\n      <WarningLevel>Level4</WarningLevel>\r\n      <Optimization>Disabled</Optimization>\r\n      <PreprocessorDefinitions>DAVS2_EXPORTS;HIGH_BIT_DEPTH=0;BIT_DEPTH=8;WIN32;ARCH_X86_64=0;_DEBUG;_WINDOWS;_USRDLL;%(PreprocessorDefinitions)</PreprocessorDefinitions>\r\n      <AdditionalIncludeDirectories>..\\..\\;..\\..\\source;..\\..\\pthread</AdditionalIncludeDirectories>\r\n      <PrecompiledHeaderFile />\r\n      <RuntimeLibrary>MultiThreadedDebug</RuntimeLibrary>\r\n    </ClCompile>\r\n    <Link>\r\n      <SubSystem>Windows</SubSystem>\r\n      <GenerateDebugInformation>true</GenerateDebugInformation>\r\n      <AdditionalLibraryDirectories>$(OutDir)\\</AdditionalLibraryDirectories>\r\n      <AdditionalDependencies>libdavs2_asm.lib;libdavs2_intrin_sse.lib;libdavs2_intrin_avx.lib;kernel32.lib;user32.lib;gdi32.lib;winspool.lib;comdlg32.lib;advapi32.lib;shell32.lib;ole32.lib;oleaut32.lib;uuid.lib;odbc32.lib;odbccp32.lib;%(AdditionalDependencies)</AdditionalDependencies>\r\n      <LargeAddressAware>true</LargeAddressAware>\r\n    </Link>\r\n    <PreBuildEvent>\r\n      <Command>cd /d \"$(SolutionDir)..\\..\" &amp;&amp; sh version.sh</Command>\r\n      <Message>UpdateSourceVersionInfo</Message>\r\n    </PreBuildEvent>\r\n  </ItemDefinitionGroup>\r\n  <ItemDefinitionGroup Condition=\"'$(Configuration)|$(Platform)'=='Debug|x64'\">\r\n    <ClCompile>\r\n      <PrecompiledHeader>\r\n      </PrecompiledHeader>\r\n      <WarningLevel>Level4</WarningLevel>\r\n      <Optimization>Disabled</Optimization>\r\n      <PreprocessorDefinitions>DAVS2_EXPORTS;HIGH_BIT_DEPTH=0;BIT_DEPTH=8;WIN32;ARCH_X86_64=1;_DEBUG;_WINDOWS;_USRDLL;%(PreprocessorDefinitions)</PreprocessorDefinitions>\r\n      <AdditionalIncludeDirectories>..\\..\\;..\\..\\source;..\\..\\pthread;$(CUDA_PATH)\\include;$(AMD_APPSDK_PATH)\\include;$(INTEL_OPENCL_SDK)\\include;</AdditionalIncludeDirectories>\r\n      <PrecompiledHeaderFile>\r\n      </PrecompiledHeaderFile>\r\n      <RuntimeLibrary>MultiThreadedDebug</RuntimeLibrary>\r\n    </ClCompile>\r\n    <Link>\r\n      <SubSystem>Windows</SubSystem>\r\n      <GenerateDebugInformation>true</GenerateDebugInformation>\r\n      <AdditionalLibraryDirectories>$(OutDir)\\;$(CUDA_PATH)\\lib\\$(Platform);$(INTEL_OPENCL_SDK)\\lib\\$(Platform);$(AMD_APPSDK_PATH)\\lib\\x64;%(AdditionalLibraryDirectories);</AdditionalLibraryDirectories>\r\n      <AdditionalDependencies>libdavs2_asm.lib;libdavs2_intrin_sse.lib;libdavs2_intrin_avx.lib;kernel32.lib;user32.lib;gdi32.lib;winspool.lib;comdlg32.lib;advapi32.lib;shell32.lib;ole32.lib;oleaut32.lib;uuid.lib;odbc32.lib;odbccp32.lib;%(AdditionalDependencies)</AdditionalDependencies>\r\n    </Link>\r\n    <PreBuildEvent>\r\n      <Command>cd /d \"$(SolutionDir)..\\..\" &amp;&amp; sh version.sh</Command>\r\n      <Message>UpdateSourceVersionInfo</Message>\r\n    </PreBuildEvent>\r\n  </ItemDefinitionGroup>\r\n  <ItemDefinitionGroup Condition=\"'$(Configuration)|$(Platform)'=='Release|Win32'\">\r\n    <ClCompile>\r\n      <WarningLevel>Level4</WarningLevel>\r\n      <PrecompiledHeader>\r\n      </PrecompiledHeader>\r\n      <Optimization>MaxSpeed</Optimization>\r\n      <FunctionLevelLinking>true</FunctionLevelLinking>\r\n      <IntrinsicFunctions>true</IntrinsicFunctions>\r\n      <PreprocessorDefinitions>DAVS2_EXPORTS;HIGH_BIT_DEPTH=0;BIT_DEPTH=8;WIN32;ARCH_X86_64=0;NDEBUG;_WINDOWS;_USRDLL;%(PreprocessorDefinitions)</PreprocessorDefinitions>\r\n      <AdditionalIncludeDirectories>..\\..\\;..\\..\\source;..\\..\\pthread</AdditionalIncludeDirectories>\r\n      <PrecompiledHeaderFile />\r\n      <FavorSizeOrSpeed>Speed</FavorSizeOrSpeed>\r\n      <RuntimeLibrary>MultiThreaded</RuntimeLibrary>\r\n    </ClCompile>\r\n    <Link>\r\n      <SubSystem>Windows</SubSystem>\r\n      <GenerateDebugInformation>true</GenerateDebugInformation>\r\n      <EnableCOMDATFolding>true</EnableCOMDATFolding>\r\n      <OptimizeReferences>true</OptimizeReferences>\r\n      <AdditionalLibraryDirectories>$(OutDir)\\</AdditionalLibraryDirectories>\r\n      <AdditionalDependencies>libdavs2_asm.lib;libdavs2_intrin_sse.lib;libdavs2_intrin_avx.lib;kernel32.lib;user32.lib;gdi32.lib;winspool.lib;comdlg32.lib;advapi32.lib;shell32.lib;ole32.lib;oleaut32.lib;uuid.lib;odbc32.lib;odbccp32.lib;%(AdditionalDependencies)</AdditionalDependencies>\r\n      <LargeAddressAware>true</LargeAddressAware>\r\n    </Link>\r\n    <PreBuildEvent>\r\n      <Command>cd /d \"$(SolutionDir)..\\..\" &amp;&amp; sh version.sh</Command>\r\n      <Message>UpdateSourceVersionInfo</Message>\r\n    </PreBuildEvent>\r\n  </ItemDefinitionGroup>\r\n  <ItemDefinitionGroup Condition=\"'$(Configuration)|$(Platform)'=='Release|x64'\">\r\n    <ClCompile>\r\n      <WarningLevel>Level4</WarningLevel>\r\n      <PrecompiledHeader>\r\n      </PrecompiledHeader>\r\n      <Optimization>MaxSpeed</Optimization>\r\n      <FunctionLevelLinking>true</FunctionLevelLinking>\r\n      <IntrinsicFunctions>true</IntrinsicFunctions>\r\n      <PreprocessorDefinitions>DAVS2_EXPORTS;HIGH_BIT_DEPTH=0;BIT_DEPTH=8;WIN32;ARCH_X86_64=1;NDEBUG;_WINDOWS;_USRDLL;%(PreprocessorDefinitions)</PreprocessorDefinitions>\r\n      <AdditionalIncludeDirectories>..\\..\\;..\\..\\source;..\\..\\pthread;$(CUDA_PATH)\\include;$(AMD_APPSDK_PATH)\\include;$(INTEL_OPENCL_SDK)\\include;</AdditionalIncludeDirectories>\r\n      <PrecompiledHeaderFile>\r\n      </PrecompiledHeaderFile>\r\n      <FavorSizeOrSpeed>Speed</FavorSizeOrSpeed>\r\n      <RuntimeLibrary>MultiThreaded</RuntimeLibrary>\r\n    </ClCompile>\r\n    <Link>\r\n      <SubSystem>Windows</SubSystem>\r\n      <GenerateDebugInformation>true</GenerateDebugInformation>\r\n      <EnableCOMDATFolding>true</EnableCOMDATFolding>\r\n      <OptimizeReferences>true</OptimizeReferences>\r\n      <AdditionalLibraryDirectories>$(OutDir)\\;$(CUDA_PATH)\\lib\\$(Platform);$(INTEL_OPENCL_SDK)\\lib\\$(Platform);$(AMD_APPSDK_PATH)\\lib\\x64;%(AdditionalLibraryDirectories);</AdditionalLibraryDirectories>\r\n      <AdditionalDependencies>libdavs2_asm.lib;libdavs2_intrin_sse.lib;libdavs2_intrin_avx.lib;kernel32.lib;user32.lib;gdi32.lib;winspool.lib;comdlg32.lib;advapi32.lib;shell32.lib;ole32.lib;oleaut32.lib;uuid.lib;odbc32.lib;odbccp32.lib;%(AdditionalDependencies)</AdditionalDependencies>\r\n    </Link>\r\n    <PreBuildEvent>\r\n      <Command>cd /d \"$(SolutionDir)..\\..\" &amp;&amp; sh version.sh</Command>\r\n      <Message>UpdateSourceVersionInfo</Message>\r\n    </PreBuildEvent>\r\n  </ItemDefinitionGroup>\r\n  <Import Project=\"$(VCTargetsPath)\\Microsoft.Cpp.targets\" />\r\n  <ImportGroup Label=\"ExtensionTargets\">\r\n  </ImportGroup>\r\n</Project>"
  },
  {
    "path": "build/vs2013/libdavs2.vcxproj.filters",
    "content": "﻿<?xml version=\"1.0\" encoding=\"utf-8\"?>\r\n<Project ToolsVersion=\"4.0\" xmlns=\"http://schemas.microsoft.com/developer/msbuild/2003\">\r\n  <ItemGroup>\r\n    <Filter Include=\"inc\">\r\n      <UniqueIdentifier>{93995380-89BD-4b04-88EB-625FBE52EBFB}</UniqueIdentifier>\r\n      <Extensions>h;hh;hpp;hxx;hm;inl;inc;xsd</Extensions>\r\n    </Filter>\r\n    <Filter Include=\"src\">\r\n      <UniqueIdentifier>{4FC737F1-C7A5-4376-A066-2A32D752A2FF}</UniqueIdentifier>\r\n      <Extensions>cpp;c;cc;cxx;def;odl;idl;hpj;bat;asm;asmx</Extensions>\r\n    </Filter>\r\n    <Filter Include=\"res\">\r\n      <UniqueIdentifier>{67DA6AB6-F800-4c08-8B7A-83BB121AAD01}</UniqueIdentifier>\r\n      <Extensions>rc;ico;cur;bmp;dlg;rc2;rct;bin;rgs;gif;jpg;jpeg;jpe;resx;tiff;tif;png;wav;mfcribbon-ms</Extensions>\r\n    </Filter>\r\n  </ItemGroup>\r\n  <ItemGroup>\r\n    <ClCompile Include=\"..\\..\\source\\common\\aec.cc\">\r\n      <Filter>src</Filter>\r\n    </ClCompile>\r\n    <ClCompile Include=\"..\\..\\source\\common\\alf.cc\">\r\n      <Filter>src</Filter>\r\n    </ClCompile>\r\n    <ClCompile Include=\"..\\..\\source\\common\\bitstream.cc\">\r\n      <Filter>src</Filter>\r\n    </ClCompile>\r\n    <ClCompile Include=\"..\\..\\source\\common\\block_info.cc\">\r\n      <Filter>src</Filter>\r\n    </ClCompile>\r\n    <ClCompile Include=\"..\\..\\source\\common\\common.cc\">\r\n      <Filter>src</Filter>\r\n    </ClCompile>\r\n    <ClCompile Include=\"..\\..\\source\\common\\cu.cc\">\r\n      <Filter>src</Filter>\r\n    </ClCompile>\r\n    <ClCompile Include=\"..\\..\\source\\common\\deblock.cc\">\r\n      <Filter>src</Filter>\r\n    </ClCompile>\r\n    <ClCompile Include=\"..\\..\\source\\common\\header.cc\">\r\n      <Filter>src</Filter>\r\n    </ClCompile>\r\n    <ClCompile Include=\"..\\..\\source\\common\\intra.cc\">\r\n      <Filter>src</Filter>\r\n    </ClCompile>\r\n    <ClCompile Include=\"..\\..\\source\\common\\mc.cc\">\r\n      <Filter>src</Filter>\r\n    </ClCompile>\r\n    <ClCompile Include=\"..\\..\\source\\common\\sao.cc\">\r\n      <Filter>src</Filter>\r\n    </ClCompile>\r\n    <ClCompile Include=\"..\\..\\source\\common\\transform.cc\">\r\n      <Filter>src</Filter>\r\n    </ClCompile>\r\n    <ClCompile Include=\"..\\..\\source\\common\\decoder.cc\">\r\n      <Filter>src</Filter>\r\n    </ClCompile>\r\n    <ClCompile Include=\"..\\..\\source\\common\\frame.cc\">\r\n      <Filter>src</Filter>\r\n    </ClCompile>\r\n    <ClCompile Include=\"..\\..\\source\\common\\cpu.cc\">\r\n      <Filter>src</Filter>\r\n    </ClCompile>\r\n    <ClCompile Include=\"..\\..\\source\\common\\predict.cc\">\r\n      <Filter>src</Filter>\r\n    </ClCompile>\r\n    <ClCompile Include=\"..\\..\\source\\common\\quant.cc\">\r\n      <Filter>src</Filter>\r\n    </ClCompile>\r\n    <ClCompile Include=\"..\\..\\source\\common\\pixel.cc\">\r\n      <Filter>src</Filter>\r\n    </ClCompile>\r\n    <ClCompile Include=\"..\\..\\source\\common\\threadpool.cc\">\r\n      <Filter>src</Filter>\r\n    </ClCompile>\r\n    <ClCompile Include=\"..\\..\\source\\common\\davs2.cc\">\r\n      <Filter>src</Filter>\r\n    </ClCompile>\r\n    <ClCompile Include=\"..\\..\\source\\common\\primitives.cc\">\r\n      <Filter>src</Filter>\r\n    </ClCompile>\r\n    <ClCompile Include=\"..\\..\\source\\common\\memory.cc\">\r\n      <Filter>src</Filter>\r\n    </ClCompile>\r\n    <ClCompile Include=\"..\\..\\source\\common\\win32thread.cc\">\r\n      <Filter>src</Filter>\r\n    </ClCompile>\r\n  </ItemGroup>\r\n  <ItemGroup>\r\n    <ClInclude Include=\"..\\..\\source\\common\\aec.h\">\r\n      <Filter>inc</Filter>\r\n    </ClInclude>\r\n    <ClInclude Include=\"..\\..\\source\\common\\alf.h\">\r\n      <Filter>inc</Filter>\r\n    </ClInclude>\r\n    <ClInclude Include=\"..\\..\\source\\common\\bitstream.h\">\r\n      <Filter>inc</Filter>\r\n    </ClInclude>\r\n    <ClInclude Include=\"..\\..\\source\\common\\block_info.h\">\r\n      <Filter>inc</Filter>\r\n    </ClInclude>\r\n    <ClInclude Include=\"..\\..\\source\\common\\common.h\">\r\n      <Filter>inc</Filter>\r\n    </ClInclude>\r\n    <ClInclude Include=\"..\\..\\source\\common\\cu.h\">\r\n      <Filter>inc</Filter>\r\n    </ClInclude>\r\n    <ClInclude Include=\"..\\..\\source\\common\\deblock.h\">\r\n      <Filter>inc</Filter>\r\n    </ClInclude>\r\n    <ClInclude Include=\"..\\..\\source\\common\\defines.h\">\r\n      <Filter>inc</Filter>\r\n    </ClInclude>\r\n    <ClInclude Include=\"..\\..\\source\\common\\header.h\">\r\n      <Filter>inc</Filter>\r\n    </ClInclude>\r\n    <ClInclude Include=\"..\\..\\source\\common\\intra.h\">\r\n      <Filter>inc</Filter>\r\n    </ClInclude>\r\n    <ClInclude Include=\"..\\..\\source\\common\\mc.h\">\r\n      <Filter>inc</Filter>\r\n    </ClInclude>\r\n    <ClInclude Include=\"..\\..\\source\\common\\osdep.h\">\r\n      <Filter>inc</Filter>\r\n    </ClInclude>\r\n    <ClInclude Include=\"..\\..\\source\\common\\sao.h\">\r\n      <Filter>inc</Filter>\r\n    </ClInclude>\r\n    <ClInclude Include=\"..\\..\\source\\common\\transform.h\">\r\n      <Filter>inc</Filter>\r\n    </ClInclude>\r\n    <ClInclude Include=\"..\\..\\source\\common\\vlc.h\">\r\n      <Filter>inc</Filter>\r\n    </ClInclude>\r\n    <ClInclude Include=\"..\\..\\source\\common\\decoder.h\">\r\n      <Filter>inc</Filter>\r\n    </ClInclude>\r\n    <ClInclude Include=\"resource.h\">\r\n      <Filter>inc</Filter>\r\n    </ClInclude>\r\n    <ClInclude Include=\"..\\..\\source\\common\\scantab.h\">\r\n      <Filter>inc</Filter>\r\n    </ClInclude>\r\n    <ClInclude Include=\"..\\..\\source\\common\\frame.h\">\r\n      <Filter>inc</Filter>\r\n    </ClInclude>\r\n    <ClInclude Include=\"..\\..\\source\\common\\cpu.h\">\r\n      <Filter>inc</Filter>\r\n    </ClInclude>\r\n    <ClInclude Include=\"..\\..\\source\\common\\predict.h\">\r\n      <Filter>inc</Filter>\r\n    </ClInclude>\r\n    <ClInclude Include=\"..\\..\\source\\common\\quant.h\">\r\n      <Filter>inc</Filter>\r\n    </ClInclude>\r\n    <ClInclude Include=\"..\\..\\source\\common\\threadpool.h\">\r\n      <Filter>inc</Filter>\r\n    </ClInclude>\r\n    <ClInclude Include=\"..\\..\\source\\davs2.h\">\r\n      <Filter>inc</Filter>\r\n    </ClInclude>\r\n    <ClInclude Include=\"..\\..\\source\\common\\primitives.h\">\r\n      <Filter>inc</Filter>\r\n    </ClInclude>\r\n    <ClInclude Include=\"..\\..\\source\\configw.h\">\r\n      <Filter>inc</Filter>\r\n    </ClInclude>\r\n    <ClInclude Include=\"..\\..\\source\\common\\win32thread.h\">\r\n      <Filter>inc</Filter>\r\n    </ClInclude>\r\n  </ItemGroup>\r\n  <ItemGroup>\r\n    <ClInclude Include=\"..\\..\\source\\common\\aec.h\">\r\n      <Filter>inc</Filter>\r\n    </ClInclude>\r\n  </ItemGroup>\r\n</Project>"
  },
  {
    "path": "build/vs2013/libdavs2_asm.vcxproj",
    "content": "﻿<?xml version=\"1.0\" encoding=\"utf-8\"?>\r\n<Project DefaultTargets=\"Build\" ToolsVersion=\"12.0\" xmlns=\"http://schemas.microsoft.com/developer/msbuild/2003\">\r\n  <ItemGroup Label=\"ProjectConfigurations\">\r\n    <ProjectConfiguration Include=\"Debug|Win32\">\r\n      <Configuration>Debug</Configuration>\r\n      <Platform>Win32</Platform>\r\n    </ProjectConfiguration>\r\n    <ProjectConfiguration Include=\"Debug|x64\">\r\n      <Configuration>Debug</Configuration>\r\n      <Platform>x64</Platform>\r\n    </ProjectConfiguration>\r\n    <ProjectConfiguration Include=\"Release|Win32\">\r\n      <Configuration>Release</Configuration>\r\n      <Platform>Win32</Platform>\r\n    </ProjectConfiguration>\r\n    <ProjectConfiguration Include=\"Release|x64\">\r\n      <Configuration>Release</Configuration>\r\n      <Platform>x64</Platform>\r\n    </ProjectConfiguration>\r\n  </ItemGroup>\r\n  <ItemGroup>\r\n    <NASM Include=\"..\\..\\source\\common\\x86\\blockcopy8.asm\" />\r\n    <NASM Include=\"..\\..\\source\\common\\x86\\const-a.asm\" />\r\n    <NASM Include=\"..\\..\\source\\common\\x86\\cpu-a.asm\" />\r\n    <NASM Include=\"..\\..\\source\\common\\x86\\dct8.asm\" />\r\n    <NASM Include=\"..\\..\\source\\common\\x86\\ipfilter8.asm\">\r\n      <ExcludedFromBuild Condition=\"'$(Configuration)|$(Platform)'=='Debug|Win32'\">true</ExcludedFromBuild>\r\n      <ExcludedFromBuild Condition=\"'$(Configuration)|$(Platform)'=='Release|Win32'\">true</ExcludedFromBuild>\r\n      <ExcludedFromBuild Condition=\"'$(Configuration)|$(Platform)'=='Debug|x64'\">true</ExcludedFromBuild>\r\n      <ExcludedFromBuild Condition=\"'$(Configuration)|$(Platform)'=='Release|x64'\">true</ExcludedFromBuild>\r\n    </NASM>\r\n    <NASM Include=\"..\\..\\source\\common\\x86\\mc-a2.asm\" />\r\n    <NASM Include=\"..\\..\\source\\common\\x86\\pixeladd8.asm\" />\r\n    <NASM Include=\"..\\..\\source\\common\\x86\\quant8.asm\" />\r\n  </ItemGroup>\r\n  <ItemGroup>\r\n    <ClInclude Include=\"..\\..\\source\\common\\x86\\dct8.h\" />\r\n    <ClInclude Include=\"..\\..\\source\\common\\x86\\ipfilter8.h\" />\r\n  </ItemGroup>\r\n  <PropertyGroup Label=\"Globals\">\r\n    <ProjectGuid>{A9B37E3C-A8C7-4E24-BC2D-AB4D0804DAC1}</ProjectGuid>\r\n    <Keyword>Win32Proj</Keyword>\r\n    <RootNamespace>asmopt</RootNamespace>\r\n  </PropertyGroup>\r\n  <Import Project=\"$(VCTargetsPath)\\Microsoft.Cpp.Default.props\" />\r\n  <PropertyGroup Condition=\"'$(Configuration)|$(Platform)'=='Debug|Win32'\" Label=\"Configuration\">\r\n    <ConfigurationType>StaticLibrary</ConfigurationType>\r\n    <UseDebugLibraries>true</UseDebugLibraries>\r\n    <PlatformToolset>$(DefaultPlatformToolset)</PlatformToolset>\r\n    <CharacterSet>MultiByte</CharacterSet>\r\n  </PropertyGroup>\r\n  <PropertyGroup Condition=\"'$(Configuration)|$(Platform)'=='Debug|x64'\" Label=\"Configuration\">\r\n    <ConfigurationType>StaticLibrary</ConfigurationType>\r\n    <UseDebugLibraries>true</UseDebugLibraries>\r\n    <PlatformToolset>$(DefaultPlatformToolset)</PlatformToolset>\r\n    <CharacterSet>MultiByte</CharacterSet>\r\n  </PropertyGroup>\r\n  <PropertyGroup Condition=\"'$(Configuration)|$(Platform)'=='Release|Win32'\" Label=\"Configuration\">\r\n    <ConfigurationType>StaticLibrary</ConfigurationType>\r\n    <UseDebugLibraries>false</UseDebugLibraries>\r\n    <PlatformToolset>$(DefaultPlatformToolset)</PlatformToolset>\r\n    <WholeProgramOptimization>true</WholeProgramOptimization>\r\n    <CharacterSet>MultiByte</CharacterSet>\r\n  </PropertyGroup>\r\n  <PropertyGroup Condition=\"'$(Configuration)|$(Platform)'=='Release|x64'\" Label=\"Configuration\">\r\n    <ConfigurationType>StaticLibrary</ConfigurationType>\r\n    <UseDebugLibraries>false</UseDebugLibraries>\r\n    <PlatformToolset>$(DefaultPlatformToolset)</PlatformToolset>\r\n    <WholeProgramOptimization>true</WholeProgramOptimization>\r\n    <CharacterSet>MultiByte</CharacterSet>\r\n  </PropertyGroup>\r\n  <Import Project=\"$(VCTargetsPath)\\Microsoft.Cpp.props\" />\r\n  <ImportGroup Label=\"ExtensionSettings\">\r\n    <Import Project=\"$(SolutionDir)nasm.props\" />\r\n  </ImportGroup>\r\n  <ImportGroup Label=\"PropertySheets\" Condition=\"'$(Configuration)|$(Platform)'=='Debug|Win32'\">\r\n    <Import Project=\"$(UserRootDir)\\Microsoft.Cpp.$(Platform).user.props\" Condition=\"exists('$(UserRootDir)\\Microsoft.Cpp.$(Platform).user.props')\" Label=\"LocalAppDataPlatform\" />\r\n  </ImportGroup>\r\n  <ImportGroup Condition=\"'$(Configuration)|$(Platform)'=='Debug|x64'\" Label=\"PropertySheets\">\r\n    <Import Project=\"$(UserRootDir)\\Microsoft.Cpp.$(Platform).user.props\" Condition=\"exists('$(UserRootDir)\\Microsoft.Cpp.$(Platform).user.props')\" Label=\"LocalAppDataPlatform\" />\r\n  </ImportGroup>\r\n  <ImportGroup Label=\"PropertySheets\" Condition=\"'$(Configuration)|$(Platform)'=='Release|Win32'\">\r\n    <Import Project=\"$(UserRootDir)\\Microsoft.Cpp.$(Platform).user.props\" Condition=\"exists('$(UserRootDir)\\Microsoft.Cpp.$(Platform).user.props')\" Label=\"LocalAppDataPlatform\" />\r\n  </ImportGroup>\r\n  <ImportGroup Condition=\"'$(Configuration)|$(Platform)'=='Release|x64'\" Label=\"PropertySheets\">\r\n    <Import Project=\"$(UserRootDir)\\Microsoft.Cpp.$(Platform).user.props\" Condition=\"exists('$(UserRootDir)\\Microsoft.Cpp.$(Platform).user.props')\" Label=\"LocalAppDataPlatform\" />\r\n  </ImportGroup>\r\n  <PropertyGroup Label=\"UserMacros\" />\r\n  <PropertyGroup Condition=\"'$(Configuration)|$(Platform)'=='Debug|Win32'\">\r\n    <OutDir>$(SolutionDir)..\\bin\\$(Platform)_$(Configuration)\\</OutDir>\r\n    <IntDir>$(SolutionDir)$(Platform)_$(Configuration)\\$(ProjectName)\\</IntDir>\r\n  </PropertyGroup>\r\n  <PropertyGroup Condition=\"'$(Configuration)|$(Platform)'=='Release|Win32'\">\r\n    <OutDir>$(SolutionDir)..\\bin\\$(Platform)_$(Configuration)\\</OutDir>\r\n    <IntDir>$(SolutionDir)$(Platform)_$(Configuration)\\$(ProjectName)\\</IntDir>\r\n  </PropertyGroup>\r\n  <PropertyGroup Condition=\"'$(Configuration)|$(Platform)'=='Debug|x64'\">\r\n    <OutDir>$(SolutionDir)..\\bin\\$(Platform)_$(Configuration)\\</OutDir>\r\n    <IntDir>$(SolutionDir)$(Platform)_$(Configuration)\\$(ProjectName)\\</IntDir>\r\n  </PropertyGroup>\r\n  <PropertyGroup Condition=\"'$(Configuration)|$(Platform)'=='Release|x64'\">\r\n    <OutDir>$(SolutionDir)..\\bin\\$(Platform)_$(Configuration)\\</OutDir>\r\n    <IntDir>$(SolutionDir)$(Platform)_$(Configuration)\\$(ProjectName)\\</IntDir>\r\n  </PropertyGroup>\r\n  <ItemDefinitionGroup Condition=\"'$(Configuration)|$(Platform)'=='Debug|Win32'\">\r\n    <ClCompile>\r\n      <PrecompiledHeader>\r\n      </PrecompiledHeader>\r\n      <WarningLevel>Level4</WarningLevel>\r\n      <Optimization>Disabled</Optimization>\r\n      <PreprocessorDefinitions>HIGH_BIT_DEPTH=0;BIT_DEPTH=8;WIN32;ARCH_X86_64=0;_DEBUG;_LIB;%(PreprocessorDefinitions)</PreprocessorDefinitions>\r\n      <AdditionalIncludeDirectories>..\\..\\;..\\..\\source;..\\..\\pthread</AdditionalIncludeDirectories>\r\n      <DebugInformationFormat>ProgramDatabase</DebugInformationFormat>\r\n      <RuntimeLibrary>MultiThreadedDebug</RuntimeLibrary>\r\n      <PrecompiledHeaderFile />\r\n      <DisableSpecificWarnings>4752;</DisableSpecificWarnings>\r\n    </ClCompile>\r\n    <Link>\r\n      <SubSystem>Windows</SubSystem>\r\n      <GenerateDebugInformation>true</GenerateDebugInformation>\r\n    </Link>\r\n    <NASM>\r\n      <PreprocessorDefinitions>STACK_ALIGNMENT=32;HIGH_BIT_DEPTH=0;BIT_DEPTH=8;ARCH_X86_64=0;</PreprocessorDefinitions>\r\n      <IncludePaths>..\\..\\source\\common\\x86;</IncludePaths>\r\n    </NASM>\r\n  </ItemDefinitionGroup>\r\n  <ItemDefinitionGroup Condition=\"'$(Configuration)|$(Platform)'=='Debug|x64'\">\r\n    <ClCompile>\r\n      <PrecompiledHeader>\r\n      </PrecompiledHeader>\r\n      <WarningLevel>Level4</WarningLevel>\r\n      <Optimization>Disabled</Optimization>\r\n      <PreprocessorDefinitions>HIGH_BIT_DEPTH=0;BIT_DEPTH=8;WIN32;ARCH_X86_64=1;_DEBUG;_LIB;%(PreprocessorDefinitions)</PreprocessorDefinitions>\r\n      <AdditionalIncludeDirectories>..\\..\\;..\\..\\source;..\\..\\pthread;$(CUDA_PATH)\\include;$(AMD_APPSDK_PATH)\\include;$(INTEL_OPENCL_SDK)\\include;</AdditionalIncludeDirectories>\r\n      <RuntimeLibrary>MultiThreadedDebug</RuntimeLibrary>\r\n      <PrecompiledHeaderFile />\r\n      <DisableSpecificWarnings>4752;</DisableSpecificWarnings>\r\n    </ClCompile>\r\n    <Link>\r\n      <SubSystem>Windows</SubSystem>\r\n      <GenerateDebugInformation>true</GenerateDebugInformation>\r\n    </Link>\r\n    <NASM>\r\n      <PreprocessorDefinitions>STACK_ALIGNMENT=32;HIGH_BIT_DEPTH=0;BIT_DEPTH=8;ARCH_X86_64=1;</PreprocessorDefinitions>\r\n      <IncludePaths>..\\..\\source\\common\\x86;</IncludePaths>\r\n    </NASM>\r\n  </ItemDefinitionGroup>\r\n  <ItemDefinitionGroup Condition=\"'$(Configuration)|$(Platform)'=='Release|Win32'\">\r\n    <ClCompile>\r\n      <WarningLevel>Level4</WarningLevel>\r\n      <PrecompiledHeader>\r\n      </PrecompiledHeader>\r\n      <Optimization>MaxSpeed</Optimization>\r\n      <FunctionLevelLinking>true</FunctionLevelLinking>\r\n      <IntrinsicFunctions>true</IntrinsicFunctions>\r\n      <PreprocessorDefinitions>HIGH_BIT_DEPTH=0;BIT_DEPTH=8;WIN32;ARCH_X86_64=0;NDEBUG;_LIB;%(PreprocessorDefinitions)</PreprocessorDefinitions>\r\n      <AdditionalIncludeDirectories>..\\..\\;..\\..\\source;..\\..\\pthread</AdditionalIncludeDirectories>\r\n      <FavorSizeOrSpeed>Speed</FavorSizeOrSpeed>\r\n      <RuntimeLibrary>MultiThreaded</RuntimeLibrary>\r\n      <PrecompiledHeaderFile />\r\n      <DisableSpecificWarnings>4752;</DisableSpecificWarnings>\r\n    </ClCompile>\r\n    <Link>\r\n      <SubSystem>Windows</SubSystem>\r\n      <GenerateDebugInformation>true</GenerateDebugInformation>\r\n      <EnableCOMDATFolding>true</EnableCOMDATFolding>\r\n      <OptimizeReferences>true</OptimizeReferences>\r\n    </Link>\r\n    <NASM>\r\n      <PreprocessorDefinitions>STACK_ALIGNMENT=32;HIGH_BIT_DEPTH=0;BIT_DEPTH=8;ARCH_X86_64=0;</PreprocessorDefinitions>\r\n      <IncludePaths>..\\..\\source\\common\\x86;</IncludePaths>\r\n    </NASM>\r\n  </ItemDefinitionGroup>\r\n  <ItemDefinitionGroup Condition=\"'$(Configuration)|$(Platform)'=='Release|x64'\">\r\n    <ClCompile>\r\n      <WarningLevel>Level4</WarningLevel>\r\n      <PrecompiledHeader>\r\n      </PrecompiledHeader>\r\n      <Optimization>MaxSpeed</Optimization>\r\n      <FunctionLevelLinking>true</FunctionLevelLinking>\r\n      <IntrinsicFunctions>true</IntrinsicFunctions>\r\n      <PreprocessorDefinitions>HIGH_BIT_DEPTH=0;BIT_DEPTH=8;WIN32;ARCH_X86_64=1;NDEBUG;_LIB;%(PreprocessorDefinitions)</PreprocessorDefinitions>\r\n      <AdditionalIncludeDirectories>..\\..\\;..\\..\\source;..\\..\\pthread;$(CUDA_PATH)\\include;$(AMD_APPSDK_PATH)\\include;$(INTEL_OPENCL_SDK)\\include;</AdditionalIncludeDirectories>\r\n      <FavorSizeOrSpeed>Speed</FavorSizeOrSpeed>\r\n      <RuntimeLibrary>MultiThreaded</RuntimeLibrary>\r\n      <PrecompiledHeaderFile />\r\n      <DisableSpecificWarnings>4752;</DisableSpecificWarnings>\r\n    </ClCompile>\r\n    <Link>\r\n      <SubSystem>Windows</SubSystem>\r\n      <GenerateDebugInformation>true</GenerateDebugInformation>\r\n      <EnableCOMDATFolding>true</EnableCOMDATFolding>\r\n      <OptimizeReferences>true</OptimizeReferences>\r\n    </Link>\r\n    <NASM>\r\n      <PreprocessorDefinitions>STACK_ALIGNMENT=32;HIGH_BIT_DEPTH=0;BIT_DEPTH=8;ARCH_X86_64=1;</PreprocessorDefinitions>\r\n      <IncludePaths>..\\..\\source\\common\\x86;</IncludePaths>\r\n    </NASM>\r\n  </ItemDefinitionGroup>\r\n  <Import Project=\"$(VCTargetsPath)\\Microsoft.Cpp.targets\" />\r\n  <ImportGroup Label=\"ExtensionTargets\">\r\n    <Import Project=\"$(SolutionDir)nasm.targets\" />\r\n  </ImportGroup>\r\n</Project>"
  },
  {
    "path": "build/vs2013/libdavs2_asm.vcxproj.filters",
    "content": "﻿<?xml version=\"1.0\" encoding=\"utf-8\"?>\r\n<Project ToolsVersion=\"4.0\" xmlns=\"http://schemas.microsoft.com/developer/msbuild/2003\">\r\n  <ItemGroup>\r\n    <Filter Include=\"asm-x86\">\r\n      <UniqueIdentifier>{4FC737F1-C7A5-4376-A066-2A32D752A2FF}</UniqueIdentifier>\r\n      <Extensions>cpp;c;cc;cxx;def;odl;idl;hpj;bat;asm;asmx</Extensions>\r\n    </Filter>\r\n  </ItemGroup>\r\n  <ItemGroup>\r\n    <ClInclude Include=\"..\\..\\source\\common\\x86\\dct8.h\">\r\n      <Filter>asm-x86</Filter>\r\n    </ClInclude>\r\n    <ClInclude Include=\"..\\..\\source\\common\\x86\\ipfilter8.h\">\r\n      <Filter>asm-x86</Filter>\r\n    </ClInclude>\r\n  </ItemGroup>\r\n  <ItemGroup>\r\n    <NASM Include=\"..\\..\\source\\common\\x86\\blockcopy8.asm\">\r\n      <Filter>asm-x86</Filter>\r\n    </NASM>\r\n    <NASM Include=\"..\\..\\source\\common\\x86\\const-a.asm\">\r\n      <Filter>asm-x86</Filter>\r\n    </NASM>\r\n    <NASM Include=\"..\\..\\source\\common\\x86\\cpu-a.asm\">\r\n      <Filter>asm-x86</Filter>\r\n    </NASM>\r\n    <NASM Include=\"..\\..\\source\\common\\x86\\dct8.asm\">\r\n      <Filter>asm-x86</Filter>\r\n    </NASM>\r\n    <NASM Include=\"..\\..\\source\\common\\x86\\ipfilter8.asm\">\r\n      <Filter>asm-x86</Filter>\r\n    </NASM>\r\n    <NASM Include=\"..\\..\\source\\common\\x86\\mc-a2.asm\">\r\n      <Filter>asm-x86</Filter>\r\n    </NASM>\r\n    <NASM Include=\"..\\..\\source\\common\\x86\\pixeladd8.asm\">\r\n      <Filter>asm-x86</Filter>\r\n    </NASM>\r\n    <NASM Include=\"..\\..\\source\\common\\x86\\quant8.asm\">\r\n      <Filter>asm-x86</Filter>\r\n    </NASM>\r\n  </ItemGroup>\r\n</Project>"
  },
  {
    "path": "build/vs2013/libdavs2_intrin_avx.vcxproj",
    "content": "﻿<?xml version=\"1.0\" encoding=\"utf-8\"?>\r\n<Project DefaultTargets=\"Build\" ToolsVersion=\"12.0\" xmlns=\"http://schemas.microsoft.com/developer/msbuild/2003\">\r\n  <ItemGroup Label=\"ProjectConfigurations\">\r\n    <ProjectConfiguration Include=\"Debug|Win32\">\r\n      <Configuration>Debug</Configuration>\r\n      <Platform>Win32</Platform>\r\n    </ProjectConfiguration>\r\n    <ProjectConfiguration Include=\"Debug|x64\">\r\n      <Configuration>Debug</Configuration>\r\n      <Platform>x64</Platform>\r\n    </ProjectConfiguration>\r\n    <ProjectConfiguration Include=\"Release|Win32\">\r\n      <Configuration>Release</Configuration>\r\n      <Platform>Win32</Platform>\r\n    </ProjectConfiguration>\r\n    <ProjectConfiguration Include=\"Release|x64\">\r\n      <Configuration>Release</Configuration>\r\n      <Platform>x64</Platform>\r\n    </ProjectConfiguration>\r\n  </ItemGroup>\r\n  <ItemGroup>\r\n    <ClCompile Include=\"..\\..\\source\\common\\vec\\intrinsic_deblock_avx2.cc\" />\r\n    <ClCompile Include=\"..\\..\\source\\common\\vec\\intrinsic_idct_avx2.cc\" />\r\n    <ClCompile Include=\"..\\..\\source\\common\\vec\\intrinsic_inter_pred_avx2.cc\" />\r\n    <ClCompile Include=\"..\\..\\source\\common\\vec\\intrinsic_intra-pred_avx2.cc\" />\r\n    <ClCompile Include=\"..\\..\\source\\common\\vec\\intrinsic_pixel_avx.cc\" />\r\n    <ClCompile Include=\"..\\..\\source\\common\\vec\\intrinsic_sao_avx2.cc\" />\r\n  </ItemGroup>\r\n  <ItemGroup>\r\n    <ClInclude Include=\"..\\..\\source\\common\\vec\\intrinsic.h\" />\r\n  </ItemGroup>\r\n  <PropertyGroup Label=\"Globals\">\r\n    <ProjectGuid>{558555B9-A7B2-42D6-A298-BB5CC248541F}</ProjectGuid>\r\n    <Keyword>Win32Proj</Keyword>\r\n    <RootNamespace>asmopt</RootNamespace>\r\n  </PropertyGroup>\r\n  <Import Project=\"$(VCTargetsPath)\\Microsoft.Cpp.Default.props\" />\r\n  <PropertyGroup Condition=\"'$(Configuration)|$(Platform)'=='Debug|Win32'\" Label=\"Configuration\">\r\n    <ConfigurationType>StaticLibrary</ConfigurationType>\r\n    <UseDebugLibraries>true</UseDebugLibraries>\r\n    <PlatformToolset>$(DefaultPlatformToolset)</PlatformToolset>\r\n    <CharacterSet>MultiByte</CharacterSet>\r\n  </PropertyGroup>\r\n  <PropertyGroup Condition=\"'$(Configuration)|$(Platform)'=='Debug|x64'\" Label=\"Configuration\">\r\n    <ConfigurationType>StaticLibrary</ConfigurationType>\r\n    <UseDebugLibraries>true</UseDebugLibraries>\r\n    <PlatformToolset>$(DefaultPlatformToolset)</PlatformToolset>\r\n    <CharacterSet>MultiByte</CharacterSet>\r\n  </PropertyGroup>\r\n  <PropertyGroup Condition=\"'$(Configuration)|$(Platform)'=='Release|Win32'\" Label=\"Configuration\">\r\n    <ConfigurationType>StaticLibrary</ConfigurationType>\r\n    <UseDebugLibraries>false</UseDebugLibraries>\r\n    <PlatformToolset>$(DefaultPlatformToolset)</PlatformToolset>\r\n    <WholeProgramOptimization>true</WholeProgramOptimization>\r\n    <CharacterSet>MultiByte</CharacterSet>\r\n  </PropertyGroup>\r\n  <PropertyGroup Condition=\"'$(Configuration)|$(Platform)'=='Release|x64'\" Label=\"Configuration\">\r\n    <ConfigurationType>StaticLibrary</ConfigurationType>\r\n    <UseDebugLibraries>false</UseDebugLibraries>\r\n    <PlatformToolset>$(DefaultPlatformToolset)</PlatformToolset>\r\n    <WholeProgramOptimization>true</WholeProgramOptimization>\r\n    <CharacterSet>MultiByte</CharacterSet>\r\n  </PropertyGroup>\r\n  <Import Project=\"$(VCTargetsPath)\\Microsoft.Cpp.props\" />\r\n  <ImportGroup Label=\"PropertySheets\" Condition=\"'$(Configuration)|$(Platform)'=='Debug|Win32'\">\r\n    <Import Project=\"$(UserRootDir)\\Microsoft.Cpp.$(Platform).user.props\" Condition=\"exists('$(UserRootDir)\\Microsoft.Cpp.$(Platform).user.props')\" Label=\"LocalAppDataPlatform\" />\r\n  </ImportGroup>\r\n  <ImportGroup Condition=\"'$(Configuration)|$(Platform)'=='Debug|x64'\" Label=\"PropertySheets\">\r\n    <Import Project=\"$(UserRootDir)\\Microsoft.Cpp.$(Platform).user.props\" Condition=\"exists('$(UserRootDir)\\Microsoft.Cpp.$(Platform).user.props')\" Label=\"LocalAppDataPlatform\" />\r\n  </ImportGroup>\r\n  <ImportGroup Label=\"PropertySheets\" Condition=\"'$(Configuration)|$(Platform)'=='Release|Win32'\">\r\n    <Import Project=\"$(UserRootDir)\\Microsoft.Cpp.$(Platform).user.props\" Condition=\"exists('$(UserRootDir)\\Microsoft.Cpp.$(Platform).user.props')\" Label=\"LocalAppDataPlatform\" />\r\n  </ImportGroup>\r\n  <ImportGroup Condition=\"'$(Configuration)|$(Platform)'=='Release|x64'\" Label=\"PropertySheets\">\r\n    <Import Project=\"$(UserRootDir)\\Microsoft.Cpp.$(Platform).user.props\" Condition=\"exists('$(UserRootDir)\\Microsoft.Cpp.$(Platform).user.props')\" Label=\"LocalAppDataPlatform\" />\r\n  </ImportGroup>\r\n  <PropertyGroup Label=\"UserMacros\" />\r\n  <PropertyGroup Condition=\"'$(Configuration)|$(Platform)'=='Debug|Win32'\">\r\n    <OutDir>$(SolutionDir)..\\bin\\$(Platform)_$(Configuration)\\</OutDir>\r\n    <IntDir>$(SolutionDir)$(Platform)_$(Configuration)\\$(ProjectName)\\</IntDir>\r\n  </PropertyGroup>\r\n  <PropertyGroup Condition=\"'$(Configuration)|$(Platform)'=='Release|Win32'\">\r\n    <OutDir>$(SolutionDir)..\\bin\\$(Platform)_$(Configuration)\\</OutDir>\r\n    <IntDir>$(SolutionDir)$(Platform)_$(Configuration)\\$(ProjectName)\\</IntDir>\r\n  </PropertyGroup>\r\n  <PropertyGroup Condition=\"'$(Configuration)|$(Platform)'=='Debug|x64'\">\r\n    <OutDir>$(SolutionDir)..\\bin\\$(Platform)_$(Configuration)\\</OutDir>\r\n    <IntDir>$(SolutionDir)$(Platform)_$(Configuration)\\$(ProjectName)\\</IntDir>\r\n  </PropertyGroup>\r\n  <PropertyGroup Condition=\"'$(Configuration)|$(Platform)'=='Release|x64'\">\r\n    <OutDir>$(SolutionDir)..\\bin\\$(Platform)_$(Configuration)\\</OutDir>\r\n    <IntDir>$(SolutionDir)$(Platform)_$(Configuration)\\$(ProjectName)\\</IntDir>\r\n  </PropertyGroup>\r\n  <ItemDefinitionGroup Condition=\"'$(Configuration)|$(Platform)'=='Debug|Win32'\">\r\n    <ClCompile>\r\n      <PrecompiledHeader>\r\n      </PrecompiledHeader>\r\n      <WarningLevel>Level4</WarningLevel>\r\n      <Optimization>Disabled</Optimization>\r\n      <PreprocessorDefinitions>HIGH_BIT_DEPTH=0;BIT_DEPTH=8;WIN32;ARCH_X86_64=0;_DEBUG;_LIB;%(PreprocessorDefinitions)</PreprocessorDefinitions>\r\n      <AdditionalIncludeDirectories>..\\..\\;..\\..\\source;..\\..\\pthread</AdditionalIncludeDirectories>\r\n      <DebugInformationFormat>ProgramDatabase</DebugInformationFormat>\r\n      <RuntimeLibrary>MultiThreadedDebug</RuntimeLibrary>\r\n      <PrecompiledHeaderFile />\r\n      <DisableSpecificWarnings>\r\n      </DisableSpecificWarnings>\r\n      <EnableEnhancedInstructionSet>AdvancedVectorExtensions2</EnableEnhancedInstructionSet>\r\n    </ClCompile>\r\n    <Link>\r\n      <SubSystem>Windows</SubSystem>\r\n      <GenerateDebugInformation>true</GenerateDebugInformation>\r\n    </Link>\r\n  </ItemDefinitionGroup>\r\n  <ItemDefinitionGroup Condition=\"'$(Configuration)|$(Platform)'=='Debug|x64'\">\r\n    <ClCompile>\r\n      <PrecompiledHeader>\r\n      </PrecompiledHeader>\r\n      <WarningLevel>Level4</WarningLevel>\r\n      <Optimization>Disabled</Optimization>\r\n      <PreprocessorDefinitions>HIGH_BIT_DEPTH=0;BIT_DEPTH=8;WIN32;ARCH_X86_64=1;_DEBUG;_LIB;%(PreprocessorDefinitions)</PreprocessorDefinitions>\r\n      <AdditionalIncludeDirectories>..\\..\\;..\\..\\source;..\\..\\pthread;$(CUDA_PATH)\\include;$(AMD_APPSDK_PATH)\\include;$(INTEL_OPENCL_SDK)\\include;</AdditionalIncludeDirectories>\r\n      <RuntimeLibrary>MultiThreadedDebug</RuntimeLibrary>\r\n      <PrecompiledHeaderFile />\r\n      <DisableSpecificWarnings>\r\n      </DisableSpecificWarnings>\r\n      <EnableEnhancedInstructionSet>AdvancedVectorExtensions2</EnableEnhancedInstructionSet>\r\n    </ClCompile>\r\n    <Link>\r\n      <SubSystem>Windows</SubSystem>\r\n      <GenerateDebugInformation>true</GenerateDebugInformation>\r\n    </Link>\r\n  </ItemDefinitionGroup>\r\n  <ItemDefinitionGroup Condition=\"'$(Configuration)|$(Platform)'=='Release|Win32'\">\r\n    <ClCompile>\r\n      <WarningLevel>Level4</WarningLevel>\r\n      <PrecompiledHeader>\r\n      </PrecompiledHeader>\r\n      <Optimization>MaxSpeed</Optimization>\r\n      <FunctionLevelLinking>true</FunctionLevelLinking>\r\n      <IntrinsicFunctions>true</IntrinsicFunctions>\r\n      <PreprocessorDefinitions>HIGH_BIT_DEPTH=0;BIT_DEPTH=8;WIN32;ARCH_X86_64=0;NDEBUG;_LIB;%(PreprocessorDefinitions)</PreprocessorDefinitions>\r\n      <AdditionalIncludeDirectories>..\\..\\;..\\..\\source;..\\..\\pthread</AdditionalIncludeDirectories>\r\n      <FavorSizeOrSpeed>Speed</FavorSizeOrSpeed>\r\n      <RuntimeLibrary>MultiThreaded</RuntimeLibrary>\r\n      <PrecompiledHeaderFile />\r\n      <DisableSpecificWarnings>\r\n      </DisableSpecificWarnings>\r\n      <EnableEnhancedInstructionSet>AdvancedVectorExtensions2</EnableEnhancedInstructionSet>\r\n    </ClCompile>\r\n    <Link>\r\n      <SubSystem>Windows</SubSystem>\r\n      <GenerateDebugInformation>true</GenerateDebugInformation>\r\n      <EnableCOMDATFolding>true</EnableCOMDATFolding>\r\n      <OptimizeReferences>true</OptimizeReferences>\r\n    </Link>\r\n  </ItemDefinitionGroup>\r\n  <ItemDefinitionGroup Condition=\"'$(Configuration)|$(Platform)'=='Release|x64'\">\r\n    <ClCompile>\r\n      <WarningLevel>Level4</WarningLevel>\r\n      <PrecompiledHeader>\r\n      </PrecompiledHeader>\r\n      <Optimization>MaxSpeed</Optimization>\r\n      <FunctionLevelLinking>true</FunctionLevelLinking>\r\n      <IntrinsicFunctions>true</IntrinsicFunctions>\r\n      <PreprocessorDefinitions>HIGH_BIT_DEPTH=0;BIT_DEPTH=8;WIN32;ARCH_X86_64=1;NDEBUG;_LIB;%(PreprocessorDefinitions)</PreprocessorDefinitions>\r\n      <AdditionalIncludeDirectories>..\\..\\;..\\..\\source;..\\..\\pthread;$(CUDA_PATH)\\include;$(AMD_APPSDK_PATH)\\include;$(INTEL_OPENCL_SDK)\\include;</AdditionalIncludeDirectories>\r\n      <FavorSizeOrSpeed>Speed</FavorSizeOrSpeed>\r\n      <RuntimeLibrary>MultiThreaded</RuntimeLibrary>\r\n      <PrecompiledHeaderFile />\r\n      <DisableSpecificWarnings>\r\n      </DisableSpecificWarnings>\r\n      <EnableEnhancedInstructionSet>AdvancedVectorExtensions</EnableEnhancedInstructionSet>\r\n    </ClCompile>\r\n    <Link>\r\n      <SubSystem>Windows</SubSystem>\r\n      <GenerateDebugInformation>true</GenerateDebugInformation>\r\n      <EnableCOMDATFolding>true</EnableCOMDATFolding>\r\n      <OptimizeReferences>true</OptimizeReferences>\r\n    </Link>\r\n  </ItemDefinitionGroup>\r\n  <Import Project=\"$(VCTargetsPath)\\Microsoft.Cpp.targets\" />\r\n</Project>"
  },
  {
    "path": "build/vs2013/libdavs2_intrin_avx.vcxproj.filters",
    "content": "﻿<?xml version=\"1.0\" encoding=\"utf-8\"?>\r\n<Project ToolsVersion=\"4.0\" xmlns=\"http://schemas.microsoft.com/developer/msbuild/2003\">\r\n  <ItemGroup>\r\n    <Filter Include=\"vec\">\r\n      <UniqueIdentifier>{93995380-89BD-4b04-88EB-625FBE52EBFB}</UniqueIdentifier>\r\n      <Extensions>cpp;c;cc;cxx;def;odl;idl;hpj;bat;asm;asmx</Extensions>\r\n    </Filter>\r\n  </ItemGroup>\r\n  <ItemGroup>\r\n    <ClCompile Include=\"..\\..\\source\\common\\vec\\intrinsic_idct_avx2.cc\">\r\n      <Filter>vec</Filter>\r\n    </ClCompile>\r\n    <ClCompile Include=\"..\\..\\source\\common\\vec\\intrinsic_deblock_avx2.cc\">\r\n      <Filter>vec</Filter>\r\n    </ClCompile>\r\n    <ClCompile Include=\"..\\..\\source\\common\\vec\\intrinsic_intra-pred_avx2.cc\">\r\n      <Filter>vec</Filter>\r\n    </ClCompile>\r\n    <ClCompile Include=\"..\\..\\source\\common\\vec\\intrinsic_inter_pred_avx2.cc\">\r\n      <Filter>vec</Filter>\r\n    </ClCompile>\r\n    <ClCompile Include=\"..\\..\\source\\common\\vec\\intrinsic_sao_avx2.cc\">\r\n      <Filter>vec</Filter>\r\n    </ClCompile>\r\n    <ClCompile Include=\"..\\..\\source\\common\\vec\\intrinsic_pixel_avx.cc\">\r\n      <Filter>vec</Filter>\r\n    </ClCompile>\r\n  </ItemGroup>\r\n  <ItemGroup>\r\n    <ClInclude Include=\"..\\..\\source\\common\\vec\\intrinsic.h\">\r\n      <Filter>vec</Filter>\r\n    </ClInclude>\r\n  </ItemGroup>\r\n</Project>"
  },
  {
    "path": "build/vs2013/libdavs2_intrin_sse.vcxproj",
    "content": "﻿<?xml version=\"1.0\" encoding=\"utf-8\"?>\r\n<Project DefaultTargets=\"Build\" ToolsVersion=\"12.0\" xmlns=\"http://schemas.microsoft.com/developer/msbuild/2003\">\r\n  <ItemGroup Label=\"ProjectConfigurations\">\r\n    <ProjectConfiguration Include=\"Debug|Win32\">\r\n      <Configuration>Debug</Configuration>\r\n      <Platform>Win32</Platform>\r\n    </ProjectConfiguration>\r\n    <ProjectConfiguration Include=\"Debug|x64\">\r\n      <Configuration>Debug</Configuration>\r\n      <Platform>x64</Platform>\r\n    </ProjectConfiguration>\r\n    <ProjectConfiguration Include=\"Release|Win32\">\r\n      <Configuration>Release</Configuration>\r\n      <Platform>Win32</Platform>\r\n    </ProjectConfiguration>\r\n    <ProjectConfiguration Include=\"Release|x64\">\r\n      <Configuration>Release</Configuration>\r\n      <Platform>x64</Platform>\r\n    </ProjectConfiguration>\r\n  </ItemGroup>\r\n  <ItemGroup>\r\n    <ClCompile Include=\"..\\..\\source\\common\\vec\\intrinsic.cc\" />\r\n    <ClCompile Include=\"..\\..\\source\\common\\vec\\intrinsic_alf.cc\" />\r\n    <ClCompile Include=\"..\\..\\source\\common\\vec\\intrinsic_deblock.cc\" />\r\n    <ClCompile Include=\"..\\..\\source\\common\\vec\\intrinsic_idct.cc\" />\r\n    <ClCompile Include=\"..\\..\\source\\common\\vec\\intrinsic_idct_avx2.cc\" />\r\n    <ClCompile Include=\"..\\..\\source\\common\\vec\\intrinsic_inter_pred.cc\" />\r\n    <ClCompile Include=\"..\\..\\source\\common\\vec\\intrinsic_intra-filledge.cc\" />\r\n    <ClCompile Include=\"..\\..\\source\\common\\vec\\intrinsic_intra-pred.cc\" />\r\n    <ClCompile Include=\"..\\..\\source\\common\\vec\\intrinsic_pixel.cc\" />\r\n    <ClCompile Include=\"..\\..\\source\\common\\vec\\intrinsic_sao.cc\" />\r\n  </ItemGroup>\r\n  <ItemGroup>\r\n    <ClInclude Include=\"..\\..\\source\\common\\vec\\intrinsic.h\" />\r\n  </ItemGroup>\r\n  <PropertyGroup Label=\"Globals\">\r\n    <ProjectGuid>{2E7A6EE4-927F-470A-A012-3B29EDB87906}</ProjectGuid>\r\n    <Keyword>Win32Proj</Keyword>\r\n    <RootNamespace>asmopt</RootNamespace>\r\n  </PropertyGroup>\r\n  <Import Project=\"$(VCTargetsPath)\\Microsoft.Cpp.Default.props\" />\r\n  <PropertyGroup Condition=\"'$(Configuration)|$(Platform)'=='Debug|Win32'\" Label=\"Configuration\">\r\n    <ConfigurationType>StaticLibrary</ConfigurationType>\r\n    <UseDebugLibraries>true</UseDebugLibraries>\r\n    <PlatformToolset>$(DefaultPlatformToolset)</PlatformToolset>\r\n    <CharacterSet>MultiByte</CharacterSet>\r\n  </PropertyGroup>\r\n  <PropertyGroup Condition=\"'$(Configuration)|$(Platform)'=='Debug|x64'\" Label=\"Configuration\">\r\n    <ConfigurationType>StaticLibrary</ConfigurationType>\r\n    <UseDebugLibraries>true</UseDebugLibraries>\r\n    <PlatformToolset>$(DefaultPlatformToolset)</PlatformToolset>\r\n    <CharacterSet>MultiByte</CharacterSet>\r\n  </PropertyGroup>\r\n  <PropertyGroup Condition=\"'$(Configuration)|$(Platform)'=='Release|Win32'\" Label=\"Configuration\">\r\n    <ConfigurationType>StaticLibrary</ConfigurationType>\r\n    <UseDebugLibraries>false</UseDebugLibraries>\r\n    <PlatformToolset>$(DefaultPlatformToolset)</PlatformToolset>\r\n    <WholeProgramOptimization>true</WholeProgramOptimization>\r\n    <CharacterSet>MultiByte</CharacterSet>\r\n  </PropertyGroup>\r\n  <PropertyGroup Condition=\"'$(Configuration)|$(Platform)'=='Release|x64'\" Label=\"Configuration\">\r\n    <ConfigurationType>StaticLibrary</ConfigurationType>\r\n    <UseDebugLibraries>false</UseDebugLibraries>\r\n    <PlatformToolset>$(DefaultPlatformToolset)</PlatformToolset>\r\n    <WholeProgramOptimization>true</WholeProgramOptimization>\r\n    <CharacterSet>MultiByte</CharacterSet>\r\n  </PropertyGroup>\r\n  <Import Project=\"$(VCTargetsPath)\\Microsoft.Cpp.props\" />\r\n  <ImportGroup Label=\"PropertySheets\" Condition=\"'$(Configuration)|$(Platform)'=='Debug|Win32'\">\r\n    <Import Project=\"$(UserRootDir)\\Microsoft.Cpp.$(Platform).user.props\" Condition=\"exists('$(UserRootDir)\\Microsoft.Cpp.$(Platform).user.props')\" Label=\"LocalAppDataPlatform\" />\r\n  </ImportGroup>\r\n  <ImportGroup Condition=\"'$(Configuration)|$(Platform)'=='Debug|x64'\" Label=\"PropertySheets\">\r\n    <Import Project=\"$(UserRootDir)\\Microsoft.Cpp.$(Platform).user.props\" Condition=\"exists('$(UserRootDir)\\Microsoft.Cpp.$(Platform).user.props')\" Label=\"LocalAppDataPlatform\" />\r\n  </ImportGroup>\r\n  <ImportGroup Label=\"PropertySheets\" Condition=\"'$(Configuration)|$(Platform)'=='Release|Win32'\">\r\n    <Import Project=\"$(UserRootDir)\\Microsoft.Cpp.$(Platform).user.props\" Condition=\"exists('$(UserRootDir)\\Microsoft.Cpp.$(Platform).user.props')\" Label=\"LocalAppDataPlatform\" />\r\n  </ImportGroup>\r\n  <ImportGroup Condition=\"'$(Configuration)|$(Platform)'=='Release|x64'\" Label=\"PropertySheets\">\r\n    <Import Project=\"$(UserRootDir)\\Microsoft.Cpp.$(Platform).user.props\" Condition=\"exists('$(UserRootDir)\\Microsoft.Cpp.$(Platform).user.props')\" Label=\"LocalAppDataPlatform\" />\r\n  </ImportGroup>\r\n  <PropertyGroup Label=\"UserMacros\" />\r\n  <PropertyGroup Condition=\"'$(Configuration)|$(Platform)'=='Debug|Win32'\">\r\n    <OutDir>$(SolutionDir)..\\bin\\$(Platform)_$(Configuration)\\</OutDir>\r\n    <IntDir>$(SolutionDir)$(Platform)_$(Configuration)\\$(ProjectName)\\</IntDir>\r\n  </PropertyGroup>\r\n  <PropertyGroup Condition=\"'$(Configuration)|$(Platform)'=='Release|Win32'\">\r\n    <OutDir>$(SolutionDir)..\\bin\\$(Platform)_$(Configuration)\\</OutDir>\r\n    <IntDir>$(SolutionDir)$(Platform)_$(Configuration)\\$(ProjectName)\\</IntDir>\r\n  </PropertyGroup>\r\n  <PropertyGroup Condition=\"'$(Configuration)|$(Platform)'=='Debug|x64'\">\r\n    <OutDir>$(SolutionDir)..\\bin\\$(Platform)_$(Configuration)\\</OutDir>\r\n    <IntDir>$(SolutionDir)$(Platform)_$(Configuration)\\$(ProjectName)\\</IntDir>\r\n  </PropertyGroup>\r\n  <PropertyGroup Condition=\"'$(Configuration)|$(Platform)'=='Release|x64'\">\r\n    <OutDir>$(SolutionDir)..\\bin\\$(Platform)_$(Configuration)\\</OutDir>\r\n    <IntDir>$(SolutionDir)$(Platform)_$(Configuration)\\$(ProjectName)\\</IntDir>\r\n  </PropertyGroup>\r\n  <ItemDefinitionGroup Condition=\"'$(Configuration)|$(Platform)'=='Debug|Win32'\">\r\n    <ClCompile>\r\n      <PrecompiledHeader>\r\n      </PrecompiledHeader>\r\n      <WarningLevel>Level4</WarningLevel>\r\n      <Optimization>Disabled</Optimization>\r\n      <PreprocessorDefinitions>HIGH_BIT_DEPTH=0;BIT_DEPTH=8;WIN32;ARCH_X86_64=0;_DEBUG;_LIB;%(PreprocessorDefinitions)</PreprocessorDefinitions>\r\n      <AdditionalIncludeDirectories>..\\..\\;..\\..\\source;..\\..\\pthread</AdditionalIncludeDirectories>\r\n      <DebugInformationFormat>ProgramDatabase</DebugInformationFormat>\r\n      <RuntimeLibrary>MultiThreadedDebug</RuntimeLibrary>\r\n      <PrecompiledHeaderFile />\r\n      <DisableSpecificWarnings>4752;</DisableSpecificWarnings>\r\n    </ClCompile>\r\n    <Link>\r\n      <SubSystem>Windows</SubSystem>\r\n      <GenerateDebugInformation>true</GenerateDebugInformation>\r\n    </Link>\r\n  </ItemDefinitionGroup>\r\n  <ItemDefinitionGroup Condition=\"'$(Configuration)|$(Platform)'=='Debug|x64'\">\r\n    <ClCompile>\r\n      <PrecompiledHeader>\r\n      </PrecompiledHeader>\r\n      <WarningLevel>Level4</WarningLevel>\r\n      <Optimization>Disabled</Optimization>\r\n      <PreprocessorDefinitions>HIGH_BIT_DEPTH=0;BIT_DEPTH=8;WIN32;ARCH_X86_64=1;_DEBUG;_LIB;%(PreprocessorDefinitions)</PreprocessorDefinitions>\r\n      <AdditionalIncludeDirectories>..\\..\\;..\\..\\source;..\\..\\pthread;$(CUDA_PATH)\\include;$(AMD_APPSDK_PATH)\\include;$(INTEL_OPENCL_SDK)\\include;</AdditionalIncludeDirectories>\r\n      <RuntimeLibrary>MultiThreadedDebug</RuntimeLibrary>\r\n      <PrecompiledHeaderFile />\r\n      <DisableSpecificWarnings>4752;</DisableSpecificWarnings>\r\n    </ClCompile>\r\n    <Link>\r\n      <SubSystem>Windows</SubSystem>\r\n      <GenerateDebugInformation>true</GenerateDebugInformation>\r\n    </Link>\r\n  </ItemDefinitionGroup>\r\n  <ItemDefinitionGroup Condition=\"'$(Configuration)|$(Platform)'=='Release|Win32'\">\r\n    <ClCompile>\r\n      <WarningLevel>Level4</WarningLevel>\r\n      <PrecompiledHeader>\r\n      </PrecompiledHeader>\r\n      <Optimization>MaxSpeed</Optimization>\r\n      <FunctionLevelLinking>true</FunctionLevelLinking>\r\n      <IntrinsicFunctions>true</IntrinsicFunctions>\r\n      <PreprocessorDefinitions>HIGH_BIT_DEPTH=0;BIT_DEPTH=8;WIN32;ARCH_X86_64=0;NDEBUG;_LIB;%(PreprocessorDefinitions)</PreprocessorDefinitions>\r\n      <AdditionalIncludeDirectories>..\\..\\;..\\..\\source;..\\..\\pthread</AdditionalIncludeDirectories>\r\n      <FavorSizeOrSpeed>Speed</FavorSizeOrSpeed>\r\n      <RuntimeLibrary>MultiThreaded</RuntimeLibrary>\r\n      <PrecompiledHeaderFile />\r\n      <DisableSpecificWarnings>4752;</DisableSpecificWarnings>\r\n    </ClCompile>\r\n    <Link>\r\n      <SubSystem>Windows</SubSystem>\r\n      <GenerateDebugInformation>true</GenerateDebugInformation>\r\n      <EnableCOMDATFolding>true</EnableCOMDATFolding>\r\n      <OptimizeReferences>true</OptimizeReferences>\r\n    </Link>\r\n  </ItemDefinitionGroup>\r\n  <ItemDefinitionGroup Condition=\"'$(Configuration)|$(Platform)'=='Release|x64'\">\r\n    <ClCompile>\r\n      <WarningLevel>Level4</WarningLevel>\r\n      <PrecompiledHeader>\r\n      </PrecompiledHeader>\r\n      <Optimization>MaxSpeed</Optimization>\r\n      <FunctionLevelLinking>true</FunctionLevelLinking>\r\n      <IntrinsicFunctions>true</IntrinsicFunctions>\r\n      <PreprocessorDefinitions>HIGH_BIT_DEPTH=0;BIT_DEPTH=8;WIN32;ARCH_X86_64=1;NDEBUG;_LIB;%(PreprocessorDefinitions)</PreprocessorDefinitions>\r\n      <AdditionalIncludeDirectories>..\\..\\;..\\..\\source;..\\..\\pthread;$(CUDA_PATH)\\include;$(AMD_APPSDK_PATH)\\include;$(INTEL_OPENCL_SDK)\\include;</AdditionalIncludeDirectories>\r\n      <FavorSizeOrSpeed>Speed</FavorSizeOrSpeed>\r\n      <RuntimeLibrary>MultiThreaded</RuntimeLibrary>\r\n      <PrecompiledHeaderFile />\r\n      <DisableSpecificWarnings>4752;</DisableSpecificWarnings>\r\n    </ClCompile>\r\n    <Link>\r\n      <SubSystem>Windows</SubSystem>\r\n      <GenerateDebugInformation>true</GenerateDebugInformation>\r\n      <EnableCOMDATFolding>true</EnableCOMDATFolding>\r\n      <OptimizeReferences>true</OptimizeReferences>\r\n    </Link>\r\n  </ItemDefinitionGroup>\r\n  <Import Project=\"$(VCTargetsPath)\\Microsoft.Cpp.targets\" />\r\n</Project>"
  },
  {
    "path": "build/vs2013/libdavs2_intrin_sse.vcxproj.filters",
    "content": "﻿<?xml version=\"1.0\" encoding=\"utf-8\"?>\r\n<Project ToolsVersion=\"4.0\" xmlns=\"http://schemas.microsoft.com/developer/msbuild/2003\">\r\n  <ItemGroup>\r\n    <Filter Include=\"vec\">\r\n      <UniqueIdentifier>{93995380-89BD-4b04-88EB-625FBE52EBFB}</UniqueIdentifier>\r\n      <Extensions>cpp;c;cc;cxx;def;odl;idl;hpj;bat;asm;asmx</Extensions>\r\n    </Filter>\r\n  </ItemGroup>\r\n  <ItemGroup>\r\n    <ClCompile Include=\"..\\..\\source\\common\\vec\\intrinsic.cc\">\r\n      <Filter>vec</Filter>\r\n    </ClCompile>\r\n    <ClCompile Include=\"..\\..\\source\\common\\vec\\intrinsic_alf.cc\">\r\n      <Filter>vec</Filter>\r\n    </ClCompile>\r\n    <ClCompile Include=\"..\\..\\source\\common\\vec\\intrinsic_deblock.cc\">\r\n      <Filter>vec</Filter>\r\n    </ClCompile>\r\n    <ClCompile Include=\"..\\..\\source\\common\\vec\\intrinsic_inter_pred.cc\">\r\n      <Filter>vec</Filter>\r\n    </ClCompile>\r\n    <ClCompile Include=\"..\\..\\source\\common\\vec\\intrinsic_intra-pred.cc\">\r\n      <Filter>vec</Filter>\r\n    </ClCompile>\r\n    <ClCompile Include=\"..\\..\\source\\common\\vec\\intrinsic_pixel.cc\">\r\n      <Filter>vec</Filter>\r\n    </ClCompile>\r\n    <ClCompile Include=\"..\\..\\source\\common\\vec\\intrinsic_sao.cc\">\r\n      <Filter>vec</Filter>\r\n    </ClCompile>\r\n    <ClCompile Include=\"..\\..\\source\\common\\vec\\intrinsic_idct.cc\">\r\n      <Filter>vec</Filter>\r\n    </ClCompile>\r\n    <ClCompile Include=\"..\\..\\source\\common\\vec\\intrinsic_idct_avx2.cc\">\r\n      <Filter>vec</Filter>\r\n    </ClCompile>\r\n    <ClCompile Include=\"..\\..\\source\\common\\vec\\intrinsic_intra-filledge.cc\">\r\n      <Filter>vec</Filter>\r\n    </ClCompile>\r\n  </ItemGroup>\r\n  <ItemGroup>\r\n    <ClInclude Include=\"..\\..\\source\\common\\vec\\intrinsic.h\">\r\n      <Filter>vec</Filter>\r\n    </ClInclude>\r\n  </ItemGroup>\r\n</Project>"
  },
  {
    "path": "build/vs2013/nasm.props",
    "content": "<?xml version=\"1.0\" encoding=\"utf-8\"?>\r\n<Project xmlns=\"http://schemas.microsoft.com/developer/msbuild/2003\">\r\n  <PropertyGroup\r\n    Condition=\"'$(NASMBeforeTargets)' == '' and '$(NASMAfterTargets)' == '' and '$(ConfigurationType)' != 'Makefile'\">\r\n    <NASMBeforeTargets>Midl</NASMBeforeTargets>\r\n    <NASMAfterTargets>CustomBuild</NASMAfterTargets>\r\n  </PropertyGroup>\r\n  <PropertyGroup>\r\n    <NasmPath Condition= \"'$(NASMPATH)' == ''\">$(VCInstallDir)</NasmPath>\r\n  </PropertyGroup>\r\n  <ItemDefinitionGroup>\r\n    <NASM>\r\n      <ObjectFileName>$(IntDir)%(FileName).obj</ObjectFileName>\r\n      <CommandLineTemplate Condition=\"'$(Platform)' == 'Win32'\">nasm.exe -Xvc -f win32 [AllOptions] [AdditionalOptions] \"%(FullPath)\"</CommandLineTemplate>\r\n      <CommandLineTemplate Condition=\"'$(Platform)' == 'x64'\">nasm.exe -Xvc -f win64 [AllOptions] [AdditionalOptions] \"%(FullPath)\"</CommandLineTemplate>\r\n      <CommandLineTemplate Condition=\"'$(Platform)' != 'Win32' and '$(Platform)' != 'x64'\">echo NASM not supported on this platform\r\nexit 1</CommandLineTemplate>\r\n      <ExecutionDescription>%(Identity)</ExecutionDescription>\r\n    </NASM>\r\n  </ItemDefinitionGroup>\r\n</Project>"
  },
  {
    "path": "build/vs2013/nasm.targets",
    "content": "<?xml version=\"1.0\" encoding=\"utf-8\"?>\n<Project xmlns=\"http://schemas.microsoft.com/developer/msbuild/2003\">\n  <ItemGroup>\n    <PropertyPageSchema\n      Include=\"$(MSBuildThisFileDirectory)$(MSBuildThisFileName).xml\" />\n    <AvailableItemName Include=\"NASM\">\n      <Targets>_NASM</Targets>\n    </AvailableItemName>\n  </ItemGroup>\n  <PropertyGroup>\n    <ComputeLinkInputsTargets>\n      $(ComputeLinkInputsTargets);\n      ComputeNASMOutput;\n    </ComputeLinkInputsTargets>\n    <ComputeLibInputsTargets>\n      $(ComputeLibInputsTargets);\n      ComputeNASMOutput;\n    </ComputeLibInputsTargets>\n  </PropertyGroup>\n  <UsingTask\n    TaskName=\"NASM\"\n    TaskFactory=\"XamlTaskFactory\"\n    AssemblyName=\"Microsoft.Build.Tasks.v4.0\">\n    <Task>$(MSBuildThisFileDirectory)$(MSBuildThisFileName).xml</Task>\n  </UsingTask>\n  <Target\n    Name=\"_NASM\"\n    BeforeTargets=\"$(NASMBeforeTargets)\"\n    AfterTargets=\"$(NASMAfterTargets)\"\n    Condition=\"'@(NASM)' != ''\"\n    Outputs=\"%(NASM.ObjectFileName)\"\n    Inputs=\"%(NASM.Identity);%(NASM.AdditionalDependencies);$(MSBuildProjectFile)\"\n    DependsOnTargets=\"_SelectedFiles\">\n    <ItemGroup Condition=\"'@(SelectedFiles)' != ''\">\n      <NASM Remove=\"@(NASM)\" Condition=\"'%(Identity)' != '@(SelectedFiles)'\" />\n    </ItemGroup>\n    <ItemGroup>\n      <NASM_tlog Include=\"%(NASM.ObjectFileName)\" Condition=\"'%(NASM.ObjectFileName)' != '' and '%(NASM.ExcludedFromBuild)' != 'true'\">\n        <Source>@(NASM, '|')</Source>\n      </NASM_tlog>\n    </ItemGroup>\n    <Message\n      Condition=\"'@(NASM)' != '' and '%(NASM.ExcludedFromBuild)' != 'true'\"\n      Importance=\"High\"\n      Text=\"%(NASM.ExecutionDescription)\" />\n    <WriteLinesToFile\n      Condition=\"'@(NASM_tlog)' != '' and '%(NASM_tlog.ExcludedFromBuild)' != 'true'\"\n      File=\"$(IntDir)$(ProjectName).write.1.tlog\"\n      Lines=\"^%(NASM_tlog.Source);@(NASM_tlog-&gt;'%(Fullpath)')\"/>\n    <NASM\n      Condition=\"'@(NASM)' != '' and '%(NASM.ExcludedFromBuild)' != 'true'\"\n      Inputs=\"%(NASM.Inputs)\"\n      ObjectFileName=\"%(NASM.ObjectFileName)\"\n      SymbolsPrefix=\"%(NASM.SymbolsPrefix)\"\n      SymbolsPostfix=\"%(NASM.SymbolsPostfix)\"\n      GenerateDebugInformation=\"%(NASM.GenerateDebugInformation)\"\n      IncludePaths=\"%(NASM.IncludePaths)\"\n      PreIncludeFiles=\"%(NASM.PreIncludeFiles)\"\n      PreprocessorDefinitions=\"%(NASM.PreprocessorDefinitions)\"\n      UndefinePreprocessorDefinitions=\"%(NASM.UndefinePreprocessorDefinitions)\"\n      TreatWarningsAsErrors=\"%(NASM.TreatWarningsAsErrors)\"\n      CommandLineTemplate=\"%(NASM.CommandLineTemplate)\"\n      AdditionalOptions=\"%(NASM.AdditionalOptions)\"\n    />\n  </Target>\n  <Target\n    Name=\"ComputeNASMOutput\"\n    Condition=\"'@(NASM)' != ''\">\n    <ItemGroup>\n      <Link Include=\"@(NASM->Metadata('ObjectFileName')->Distinct()->ClearMetadata())\" Condition=\"'%(NASM.ExcludedFromBuild)' != 'true'\"/>\n      <Lib Include=\"@(NASM->Metadata('ObjectFileName')->Distinct()->ClearMetadata())\" Condition=\"'%(NASM.ExcludedFromBuild)' != 'true'\"/>\n    </ItemGroup>\n  </Target>\n</Project>\n"
  },
  {
    "path": "build/vs2013/nasm.xml",
    "content": "<?xml version=\"1.0\" encoding=\"utf-8\"?>\n<ProjectSchemaDefinitions xmlns=\"http://schemas.microsoft.com/build/2009/properties\" xmlns:x=\"http://schemas.microsoft.com/winfx/2006/xaml\" xmlns:sys=\"clr-namespace:System;assembly=mscorlib\">\n  <Rule\n    Name=\"NASM\"\n    PageTemplate=\"tool\"\n    DisplayName=\"Netwide Assembler\"\n    Order=\"200\">\n    <Rule.DataSource>\n      <DataSource\n        Persistence=\"ProjectFile\"\n        ItemType=\"NASM\" />\n    </Rule.DataSource>\n    <Rule.Categories>\n      <Category\n        Name=\"General\">\n        <Category.DisplayName>\n          <sys:String>General</sys:String>\n        </Category.DisplayName>\n      </Category>\n      <Category\n        Name=\"Preprocessor\">\n        <Category.DisplayName>\n          <sys:String>Preprocessing Options</sys:String>\n        </Category.DisplayName>\n      </Category>\n      <Category\n        Name=\"Assembler Options\">\n        <Category.DisplayName>\n          <sys:String>Assembler Options</sys:String>\n        </Category.DisplayName>\n      </Category>\n      <Category\n        Name=\"Command Line\"\n        Subtype=\"CommandLine\">\n        <Category.DisplayName>\n          <sys:String>Command Line</sys:String>\n        </Category.DisplayName>\n      </Category>\n    </Rule.Categories>\n    <StringProperty\n      Name=\"Inputs\"\n      Category=\"Command Line\"\n      IsRequired=\"true\">\n      <StringProperty.DataSource>\n        <DataSource\n          Persistence=\"ProjectFile\"\n          ItemType=\"NASM\"\n          SourceType=\"Item\" />\n      </StringProperty.DataSource>\n    </StringProperty>\n    <StringProperty\n      Name=\"ObjectFileName\"\n      Category=\"Assembler Options\"\n      DisplayName=\"Output File Name\"\n      Description=\"Specify Output Filename.\"\n      HelpUrl=\"http://www.nasm.us/doc/nasmdoc2.html#section-2.1.1\"  \n      Switch=\"-o &quot;[value]&quot;\" />\n    <BoolProperty\n      Name=\"GenerateDebugInformation\"\n      Category=\"Assembler Options\"\n      DisplayName=\"Generate Debug Information\"\n      Description=\"Generates Debug Information.\"\n      HelpUrl=\"http://www.nasm.us/doc/nasmdoc2.html#section-2.1.12\"\n      Switch=\"-g\" />\n    <StringProperty\n      Name=\"SymbolsPrefix\"\n      Category=\"Assembler Options\"\n      DisplayName=\"Symbols Prefix\"\n      Description=\"Prepend the given argument to all global or extern variables.\"\n      HelpUrl=\"http://www.nasm.us/doc/nasmdoc2.html#section-2.1.27\"\n      Switch=\"--prefix [value]\" />\n    <StringProperty\n      Name=\"SymbolsPostfix\"\n      Category=\"Assembler Options\"\n      DisplayName=\"Symbols Postfix\"\n      Description=\"Append the given argument to all global or extern variables.\"\n      HelpUrl=\"http://www.nasm.us/doc/nasmdoc2.html#section-2.1.27\"\n      Switch=\"--postfix [value]\" />\n    <StringListProperty\n      Name=\"IncludePaths\"\n      Category=\"General\"\n      DisplayName=\"Include File Search Directories\"\n      Description=\"Sets path for include files.\"\n      HelpUrl=\"http://www.nasm.us/doc/nasmdoc2.html#section-2.1.16\"\n      Switch=\"-I&quot;[value]/&quot;\" />\n    <StringListProperty\n      Name=\"PreIncludeFiles\"\n      Category=\"General\"\n      DisplayName=\"Pre-Include a File\"\n      Description=\"Force files to be pre-included into source file.\"\n      HelpUrl=\"http://www.nasm.us/doc/nasmdoc2.html#section-2.1.17\"\n      Switch=\"-P&quot;[value]&quot;\" />\n    <StringListProperty\n      Name=\"PreprocessorDefinitions\"\n      Category=\"Preprocessor\"\n      DisplayName=\"Preprocessor Definitions\"\n      Description=\"Defines a text macro with the given name.\"\n      HelpUrl=\"http://www.nasm.us/doc/nasmdoc2.html#section-2.1.18\"\n      Switch=\"-D[value]\" />\n    <StringListProperty\n      Name=\"UndefinePreprocessorDefinitions\"\n      Category=\"Preprocessor\"\n      DisplayName=\"Undefine Preprocessor Definitions\"\n      Description=\"Undefines a text macro with the given name.\"\n      HelpUrl=\"http://www.nasm.us/doc/nasmdoc2.html#section-2.1.19\"\n      Switch=\"-U[value]\" />\n    <BoolProperty\n      Name=\"TreatWarningsAsErrors\"\n      Category=\"Assembler Options\"\n      DisplayName=\"Treat Warnings As Errors\"\n      Description=\"Returns an error code if warnings are generated.\"\n      HelpUrl=\"http://www.nasm.us/doc/nasmdoc2.html#section-2.1.24\"\n      Switch=\"-Werror\" />\n    <StringProperty\n      Name=\"CommandLineTemplate\"\n      DisplayName=\"Command Line\"\n      Visible=\"False\"\n      IncludeInCommandLine=\"False\" />\n    <DynamicEnumProperty\n        Name=\"NASMBeforeTargets\"\n        Category=\"General\"\n        EnumProvider=\"Targets\"\n        IncludeInCommandLine=\"False\">\n      <DynamicEnumProperty.DisplayName>\n        <sys:String>Execute Before</sys:String>\n      </DynamicEnumProperty.DisplayName>\n      <DynamicEnumProperty.Description>\n        <sys:String>Specifies the targets for the build customization to run before.</sys:String>\n      </DynamicEnumProperty.Description>\n      <DynamicEnumProperty.ProviderSettings>\n        <NameValuePair\n          Name=\"Exclude\"\n          Value=\"^NASMBeforeTargets|^Compute\" />\n      </DynamicEnumProperty.ProviderSettings>\n      <DynamicEnumProperty.DataSource>\n        <DataSource\n          Persistence=\"ProjectFile\"\n          ItemType=\"\"\n          HasConfigurationCondition=\"true\" />\n      </DynamicEnumProperty.DataSource>\n    </DynamicEnumProperty>\n    <DynamicEnumProperty\n      Name=\"NASMAfterTargets\"\n      Category=\"General\"\n      EnumProvider=\"Targets\"\n      IncludeInCommandLine=\"False\">\n      <DynamicEnumProperty.DisplayName>\n        <sys:String>Execute After</sys:String>\n      </DynamicEnumProperty.DisplayName>\n      <DynamicEnumProperty.Description>\n        <sys:String>Specifies the targets for the build customization to run after.</sys:String>\n      </DynamicEnumProperty.Description>\n      <DynamicEnumProperty.ProviderSettings>\n        <NameValuePair\n          Name=\"Exclude\"\n          Value=\"^NASMAfterTargets|^Compute\" />\n      </DynamicEnumProperty.ProviderSettings>\n      <DynamicEnumProperty.DataSource>\n        <DataSource\n          Persistence=\"ProjectFile\"\n          ItemType=\"\"\n          HasConfigurationCondition=\"true\" />\n      </DynamicEnumProperty.DataSource>\n    </DynamicEnumProperty>\n    <StringProperty\n      Name=\"ExecutionDescription\"\n      DisplayName=\"Execution Description\"\n      IncludeInCommandLine=\"False\"\n      Visible=\"False\" />\n    <StringListProperty\n      Name=\"AdditionalDependencies\"\n      DisplayName=\"Additional Dependencies\"\n      IncludeInCommandLine=\"False\"\n      Visible=\"False\" />\n    <StringProperty\n      Subtype=\"AdditionalOptions\"\n      Name=\"AdditionalOptions\"\n      Category=\"Command Line\">\n      <StringProperty.DisplayName>\n        <sys:String>Additional Options</sys:String>\n      </StringProperty.DisplayName>\n      <StringProperty.Description>\n        <sys:String>Additional Options</sys:String>\n      </StringProperty.Description>\n    </StringProperty>\n  </Rule>\n  <ItemType\n    Name=\"NASM\"\n    DisplayName=\"Netwide Assembler\" />\n  <FileExtension\n    Name=\"*.asm\"\n    ContentType=\"NASM\" />\n  <ContentType\n    Name=\"NASM\"\n    DisplayName=\"Netwide Assembler\"\n    ItemType=\"NASM\" />\n</ProjectSchemaDefinitions>\n"
  },
  {
    "path": "source/common/aec.cc",
    "content": "/*\r\n *  aec.cc\r\n *\r\n * Description of this file:\r\n *    AEC functions definition of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video encoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#include \"common.h\"\r\n#include \"block_info.h\"\r\n#include \"alf.h\"\r\n#include \"aec.h\"\r\n#include \"vlc.h\"\r\n#include \"sao.h\"\r\n#include \"scantab.h\"\r\n\r\n/**\r\n * ===========================================================================\r\n * macros\r\n * ===========================================================================\r\n */\r\n#define CTRL_OPT_AEC                       1  /* ǷûڲAEC״̬ */\r\n#define MAKE_CONTEXT(lg_pmps, mps, cycno)  (((uint16_t)(cycno) << 0) | ((uint16_t)(mps) << 2) | (uint16_t)(lg_pmps << 3))\r\n\r\n/**\r\n * ===========================================================================\r\n * global & local variables\r\n * ===========================================================================\r\n */\r\n\r\n#if AVS2_TRACE\r\nint      symbolCount = 0;\r\n#endif\r\n\r\n#if CTRL_OPT_AEC\r\n/* [8 * lg_pmps + 4 * mps + cycno] */\r\nstatic context_t g_tab_ctx_mps[2048 * 4 * 2];\r\nstatic context_t g_tab_ctx_lps[2048 * 4 * 2];\r\n#endif\r\n\r\n/* ---------------------------------------------------------------------------\r\n * 0: INTRA_PRED_VER\r\n * 1: INTRA_PRED_HOR\r\n * 2: INTRA_PRED_DC_DIAG\r\n */\r\nconst int tab_intra_mode_scan_type[NUM_INTRA_MODE] = {\r\n    2, 2, 2, 1, 1, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0,\r\n    2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 0\r\n};\r\n\r\nstatic const int EO_OFFSET_INV__MAP[] = { 1, 0, 2, -1, 3, 4, 5, 6 };\r\nstatic const int T_Chr[5] = { 0, 1, 2, 4, 3000 };\r\nstatic const int8_t tab_rank[6] = { 0, 1, 2, 3, 3, 4/*, 4 ...*/ };\r\n\r\nstatic const uint8_t raster2ZZ_4x4[] = {\r\n    0,  1,  5,  6,\r\n    2,  4,  7, 12,\r\n    3,  8, 11, 13,\r\n    9, 10, 14, 15\r\n};\r\n\r\nstatic const uint8_t raster2ZZ_8x8[] = {\r\n     0,  1,  5,  6, 14, 15, 27, 28,\r\n     2,  4,  7, 13, 16, 26, 29, 42,\r\n     3,  8, 12, 17, 25, 30, 41, 43,\r\n     9, 11, 18, 24, 31, 40, 44, 53,\r\n    10, 19, 23, 32, 39, 45, 52, 54,\r\n    20, 22, 33, 38, 46, 51, 55, 60,\r\n    21, 34, 37, 47, 50, 56, 59, 61,\r\n    35, 36, 48, 49, 57, 58, 62, 63\r\n};\r\n\r\nstatic const uint8_t raster2ZZ_2x8[] = {\r\n    0, 1, 4, 5,  8,  9, 12, 13,\r\n    2, 3, 6, 7, 10, 11, 14, 15\r\n};\r\n\r\n\r\nstatic const uint8_t raster2ZZ_8x2[] = {\r\n    0,  1,\r\n    2,  4,\r\n    3,  5,\r\n    6,  8,\r\n    7,  9,\r\n    10, 12,\r\n    11, 13,\r\n    14, 15\r\n};\r\n\r\nstatic const uint8_t tab_scan_coeff_pos_in_cg[4][4] = {\r\n    { 0,  1,  5,  6 },\r\n    { 2,  4,  7, 12 },\r\n    { 3,  8, 11, 13 },\r\n    { 9, 10, 14, 15 }\r\n};\r\n\r\nstatic const uint8_t tab_cwr[] = {\r\n    3, 3, 4, 5, 5, 5, 5 /* 5, 5, 5, 5 */\r\n};\r\n\r\nstatic const uint16_t tab_lg_pmps_offset[] = {\r\n    0, 0, 0, 197, 95, 46 /* 5, 5, 5, 5 */\r\n};\r\n\r\nstatic const int tab_pdir_bskip[DS_MAX_NUM] = {\r\n    PDIR_SYM, PDIR_BID, PDIR_BWD, PDIR_SYM, PDIR_FWD\r\n};\r\n\r\n/**\r\n * ===========================================================================\r\n * defines\r\n * ===========================================================================\r\n */\r\n\r\nenum aec_const_e {\r\n    LG_PMPS_SHIFTNO    = 2,\r\n    B_BITS             = 10,\r\n    QUARTER_SHIFT      = (B_BITS-2),\r\n    HALF               = (1 << (B_BITS-1)),\r\n    QUARTER            = (1 << (B_BITS-2)),\r\n    AEC_VALUE_BOUND    = 254,     /* make sure rs1 will not overflow for 8-bit uint8_t */\r\n};\r\n\r\n\r\nstatic const int8_t tab_intra_mode_luma2chroma[NUM_INTRA_MODE] = {\r\n    DC_PRED_C,   -1, BI_PRED_C, -1, -1, -1, -1, -1, -1, -1, -1, -1,\r\n    VERT_PRED_C, -1,        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,\r\n    HOR_PRED_C,  -1,        -1, -1, -1, -1, -1, -1, -1\r\n};\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic ALWAYS_INLINE\r\nint aec_get_next_bit(aec_t *p_aec)\r\n{\r\n    uint32_t next_bit;\r\n\r\n    if (--p_aec->i_bits_to_go < 0) {\r\n        int diff = p_aec->i_bytes - p_aec->i_byte_pos;\r\n        uint8_t *p_buffer = p_aec->p_buffer + p_aec->i_byte_pos;\r\n\r\n#if 1\r\n        if (diff > 7) {\r\n            p_aec->i_byte_buf = ((uint64_t)p_buffer[0] << 56) | ((uint64_t)p_buffer[1] << 48) | ((uint64_t)p_buffer[2] << 40) | ((uint64_t)p_buffer[3] << 32) |\r\n                                ((uint64_t)p_buffer[4] << 24) | ((uint64_t)p_buffer[5] << 16) | ((uint64_t)p_buffer[6] <<  8) |  (uint64_t)p_buffer[7];\r\n            p_aec->i_bits_to_go = 63;\r\n            p_aec->i_byte_pos += 8;\r\n        } else if (diff > 0) {\r\n            /* һ֡ʣС8һ֡ͼֻһ */\r\n            int i;\r\n            p_aec->i_bits_to_go += (int8_t)(diff << 3);\r\n            p_aec->i_byte_pos += (p_aec->i_bits_to_go + 1) >> 3;\r\n\r\n            p_aec->i_byte_buf = 0;\r\n            for (i = 0; i < diff; i++) {\r\n                p_aec->i_byte_buf = (p_aec->i_byte_buf << 8) | p_buffer[i];\r\n            }\r\n        } else {\r\n            p_aec->b_bit_error = 1;\r\n            return 1;\r\n        }\r\n#else\r\n        int i;\r\n        if (diff > 8) {\r\n            diff = 8;\r\n        } else if (diff <= 0) {\r\n            p_aec->b_bit_error = 1;\r\n            return 1;\r\n        }\r\n        p_aec->i_bits_to_go += (diff << 3);\r\n        p_aec->i_byte_pos += (p_aec->i_bits_to_go + 1) >> 3;\r\n\r\n        p_aec->i_byte_buf = 0;\r\n        for (i = 0; i < diff; i++) {\r\n            p_aec->i_byte_buf = (p_aec->i_byte_buf << 8) | p_buffer[i];\r\n        }\r\n#endif\r\n    }\r\n\r\n    /* get next bit */\r\n    next_bit = ((p_aec->i_byte_buf >> p_aec->i_bits_to_go) & 0x01);\r\n\r\n    p_aec->i_value_t = (p_aec->i_value_t << 1) | next_bit;\r\n\r\n    return 0;\r\n}\r\n\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic ALWAYS_INLINE\r\nint aec_get_next_n_bit(aec_t *p_aec, int num_bits)\r\n{\r\n    if (p_aec->i_bits_to_go >= num_bits) {\r\n        uint32_t next_bits;\r\n        p_aec->i_bits_to_go -= (int8_t)num_bits;\r\n        next_bits = (p_aec->i_byte_buf >> p_aec->i_bits_to_go) & ((1 << num_bits) - 1);\r\n\r\n        p_aec->i_value_t = (p_aec->i_value_t << num_bits) | next_bits;\r\n\r\n        return 0;\r\n    } else {\r\n        for (; num_bits != 0; num_bits--) {\r\n            aec_get_next_bit(p_aec);\r\n        }\r\n        return p_aec->b_bit_error;\r\n    }\r\n}\r\n    \r\n/**\r\n * ===========================================================================\r\n * function defines\r\n * ===========================================================================\r\n */\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic ALWAYS_INLINE\r\nvoid update_ctx_mps(context_t *ctx)\r\n{\r\n#if CTRL_OPT_AEC\r\n    ctx->v = g_tab_ctx_mps[ctx->v].v;\r\n#else\r\n    uint32_t lg_pmps = ctx->LG_PMPS;\r\n    uint8_t  cycno   = (uint8_t)ctx->cycno; \r\n    uint32_t cwr     = tab_cwr[cycno];\r\n\r\n    // update probability estimation and other parameters\r\n    if (cycno == 0) {\r\n        ctx->cycno = 1;\r\n    }\r\n    lg_pmps -= (lg_pmps >> cwr) + (lg_pmps >> (cwr + 2));\r\n\r\n    ctx->LG_PMPS = (uint16_t)lg_pmps;\r\n#endif\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic ALWAYS_INLINE\r\nvoid update_ctx_lps(context_t *ctx)\r\n{\r\n#if CTRL_OPT_AEC\r\n    ctx->v = g_tab_ctx_lps[ctx->v].v;\r\n#else\r\n    uint32_t cycno   = ctx->cycno;\r\n    uint32_t cwr     = tab_cwr[cycno];\r\n    uint32_t lg_pmps = ctx->LG_PMPS + tab_lg_pmps_offset[cwr];\r\n    uint32_t mps     = ctx->MPS;\r\n\r\n    // update probability estimation and other parameters\r\n    if (cycno != 3) {\r\n        ++cycno;\r\n    }\r\n\r\n    if (lg_pmps >= (256 << LG_PMPS_SHIFTNO)) {\r\n        lg_pmps  = (512 << LG_PMPS_SHIFTNO) - 1 - lg_pmps;\r\n        mps = !mps;\r\n    }\r\n\r\n    ctx->v = MAKE_CONTEXT(lg_pmps, mps, cycno);\r\n#endif\r\n}\r\n\r\n#if CTRL_OPT_AEC\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid init_aec_context_tab(void)\r\n{\r\n    static bool_t b_inited = 0;\r\n    context_t ctx_i;\r\n    context_t ctx_o;\r\n    int cycno;\r\n    int mps;\r\n\r\n    if (b_inited != 0) {\r\n        return;\r\n    }\r\n    /* init context table */\r\n    b_inited = 1;\r\n    ctx_i.v = 0;\r\n    ctx_o.v = 0;\r\n    memset(g_tab_ctx_mps, 0, sizeof(g_tab_ctx_mps));\r\n    memset(g_tab_ctx_lps, 0, sizeof(g_tab_ctx_lps));\r\n\r\n    /* mps */\r\n    for (cycno = 0; cycno < 4; cycno++) {\r\n        uint32_t cwr = tab_cwr[cycno];\r\n        ctx_i.cycno = cycno;\r\n        ctx_o.cycno = (uint8_t)DAVS2_MAX(cycno, 1);\r\n\r\n        for (mps = 0; mps < 2; mps++) {\r\n            ctx_i.MPS = (uint8_t)mps;\r\n            ctx_o.MPS = (uint8_t)mps;\r\n            for (ctx_i.LG_PMPS = 0; ctx_i.LG_PMPS <= 1024; ctx_i.LG_PMPS++) {\r\n                uint32_t lg_pmps = ctx_i.LG_PMPS;\r\n                lg_pmps -= (lg_pmps >> cwr) + (lg_pmps >> (cwr + 2));\r\n                ctx_o.LG_PMPS = (uint16_t)lg_pmps;\r\n                g_tab_ctx_mps[ctx_i.v].v = ctx_o.v;\r\n            }\r\n        }\r\n    }\r\n\r\n    /* lps */\r\n    for (cycno = 0; cycno < 4; cycno++) {\r\n        uint32_t cwr = tab_cwr[cycno];\r\n        ctx_i.cycno = cycno;\r\n        ctx_o.cycno = (uint8_t)DAVS2_MIN(cycno + 1, 3);\r\n\r\n        for (mps = 0; mps < 2; mps++) {\r\n            ctx_i.MPS = (uint8_t)mps;\r\n            ctx_o.MPS = (uint8_t)mps;\r\n            for (ctx_i.LG_PMPS = 0; ctx_i.LG_PMPS <= 1024; ctx_i.LG_PMPS++) {\r\n                uint32_t lg_pmps = ctx_i.LG_PMPS + tab_lg_pmps_offset[cwr];\r\n                if (lg_pmps >= (256 << LG_PMPS_SHIFTNO)) {\r\n                    lg_pmps = (512 << LG_PMPS_SHIFTNO) - 1 - lg_pmps;\r\n                    ctx_o.MPS = !mps;\r\n                }\r\n                ctx_o.LG_PMPS = (uint16_t)lg_pmps;\r\n                g_tab_ctx_lps[ctx_i.v].v = ctx_o.v;\r\n            }\r\n        }\r\n    }\r\n}\r\n#endif\r\n\r\n/* ---------------------------------------------------------------------------\r\n * initializes the aec_t for the arithmetic decoder\r\n */\r\nint aec_start_decoding(aec_t *p_aec, uint8_t *p_start, int i_byte_pos, int i_bytes)\r\n{\r\n#if CTRL_OPT_AEC\r\n    init_aec_context_tab();\r\n#endif\r\n    p_aec->p_buffer         = p_start;\r\n    p_aec->i_byte_pos       = i_byte_pos;\r\n    p_aec->i_bytes          = i_bytes;\r\n    p_aec->i_bits_to_go     = 0;\r\n    p_aec->b_bit_error      = 0;\r\n    p_aec->b_val_domain     = 1;\r\n    p_aec->i_s1             = 0;\r\n    p_aec->i_t1             = QUARTER - 1; // 0xff\r\n    p_aec->i_value_s        = 0;\r\n    p_aec->i_value_t        = 0;\r\n\r\n    if (p_aec->i_bits_to_go < B_BITS - 1) {\r\n        if (aec_get_next_n_bit(p_aec, B_BITS - 1)) {\r\n            return 0;\r\n        }\r\n    }\r\n\r\n    return 0;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nint aec_bits_read(aec_t *p_aec)\r\n{\r\n    return (p_aec->i_byte_pos << 3) - p_aec->i_bits_to_go;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic INLINE\r\nint biari_decode_symbol(aec_t *p_aec, context_t *ctx)\r\n{\r\n    uint32_t lg_pmps = ctx->LG_PMPS >> LG_PMPS_SHIFTNO;\r\n    uint32_t t2;\r\n    uint32_t s2;\r\n    uint32_t s_flag;\r\n    uint32_t i_value_s = p_aec->i_value_s;\r\n    int bit = ctx->MPS;\r\n    int is_LPS;\r\n\r\n    // p_aec->i_value_t is in R domain  p_aec->i_s1=0 or p_aec->i_s1 == AEC_VALUE_BOUND\r\n    if (p_aec->b_val_domain != 0 || (p_aec->i_s1 == AEC_VALUE_BOUND && p_aec->b_val_bound != 0)) {\r\n        i_value_s   = 0;\r\n        p_aec->i_s1 = 0;\r\n\r\n        while (p_aec->i_value_t < QUARTER && i_value_s < AEC_VALUE_BOUND) {\r\n            if (aec_get_next_bit(p_aec)) {\r\n                return 0;\r\n            }\r\n            i_value_s++;\r\n        }\r\n\r\n        p_aec->b_val_bound = p_aec->i_value_t < QUARTER;\r\n        p_aec->i_value_t   = p_aec->i_value_t & 0xff;\r\n    }\r\n\r\n    if (p_aec->i_value_s > AEC_VALUE_BOUND) {\r\n        /// davs2_log(NULL, DAVS2_LOG_ERROR, \"p_aec->i_value_s (>254).\");\r\n        p_aec->b_bit_error = 1;\r\n        p_aec->i_value_s   = i_value_s;\r\n        return 0;\r\n    }\r\n\r\n    s_flag = p_aec->i_t1 < lg_pmps;\r\n    s2     = p_aec->i_s1 + s_flag;\r\n    t2     = p_aec->i_t1 - lg_pmps + (s_flag << 8);            // 8bits\r\n    is_LPS = (s2 > i_value_s || (s2 == i_value_s && p_aec->i_value_t >= t2)) && p_aec->b_val_bound == 0;\r\n\r\n    p_aec->b_val_domain = (bool_t)is_LPS;\r\n\r\n    if (is_LPS) {     // LPS\r\n        uint32_t t_rlps = (s_flag == 0) ? (lg_pmps) : (p_aec->i_t1 + lg_pmps);\r\n        int n_bits = 0;\r\n        bit = !bit;\r\n\r\n        if (s2 == i_value_s) {\r\n            p_aec->i_value_t -= t2;\r\n        } else {\r\n            if (aec_get_next_bit(p_aec)) {\r\n                return 0;\r\n            }\r\n            p_aec->i_value_t += 256 - t2;\r\n        }\r\n\r\n        // restore range\r\n        while (t_rlps < QUARTER) {\r\n            t_rlps <<= 1;\r\n            n_bits++;\r\n        }\r\n        if (n_bits) {\r\n            if (aec_get_next_n_bit(p_aec, n_bits)) {\r\n                return 0;\r\n            }\r\n        }\r\n\r\n        p_aec->i_s1 = 0;\r\n        p_aec->i_t1 = t_rlps & 0xff;\r\n        update_ctx_lps(ctx);\r\n    } else {        // MPS\r\n        p_aec->i_s1 = s2;\r\n        p_aec->i_t1 = t2;\r\n        update_ctx_mps(ctx);\r\n    }\r\n\r\n    p_aec->i_value_s = i_value_s;\r\n\r\n    return bit;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * return the decoded symbol\r\n */\r\nstatic INLINE\r\nint biari_decode_symbol_eq_prob(aec_t *p_aec)\r\n{\r\n    if (p_aec->b_val_domain != 0 || (p_aec->i_s1 == AEC_VALUE_BOUND && p_aec->b_val_bound != 0)) {\r\n        p_aec->i_s1 = 0;\r\n\r\n        if (aec_get_next_bit(p_aec)) {\r\n            return 0;\r\n        }\r\n\r\n        if (p_aec->i_value_t >= (256 + p_aec->i_t1)) {  // LPS\r\n            p_aec->i_value_t -= (256 + p_aec->i_t1);\r\n            return 1;\r\n        } else {\r\n            return 0;\r\n        }\r\n    } else {\r\n        uint32_t s2 = p_aec->i_s1 + 1;\r\n        uint32_t t2 = p_aec->i_t1;\r\n        int is_LPS = s2 > p_aec->i_value_s || ((s2 == p_aec->i_value_s && p_aec->i_value_t >= t2) && p_aec->b_val_bound == 0);\r\n\r\n        p_aec->b_val_domain = (bool_t)is_LPS;\r\n\r\n        if (is_LPS) {    //LPS\r\n            if (s2 == p_aec->i_value_s) {\r\n                p_aec->i_value_t -= t2;\r\n            } else {\r\n                if (aec_get_next_bit(p_aec)) {\r\n                    return 0;\r\n                }\r\n                p_aec->i_value_t += 256 - t2;\r\n            }\r\n            return 1;\r\n        } else {\r\n            p_aec->i_s1 = s2;\r\n            p_aec->i_t1 = t2;\r\n            return 0;\r\n        }\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic INLINE\r\nint biari_decode_final(aec_t *p_aec)\r\n{\r\n    // static context_t ctx = { (1 << LG_PMPS_SHIFTNO), 0, 0 };\r\n    const uint32_t lg_pmps = 1; // ctx.LG_PMPS >> LG_PMPS_SHIFTNO;\r\n    uint32_t t2;\r\n    uint32_t s2;\r\n    uint32_t s_flag;\r\n    int is_LPS;\r\n\r\n    // p_aec->i_value_t is in R domain  p_aec->i_s1=0 or p_aec->i_s1 == AEC_VALUE_BOUND\r\n    if (p_aec->b_val_domain != 0 || (p_aec->i_s1 == AEC_VALUE_BOUND && p_aec->b_val_bound != 0)) {\r\n        p_aec->i_s1 = 0;\r\n        p_aec->i_value_s = 0;\r\n\r\n        while (p_aec->i_value_t < QUARTER && p_aec->i_value_s < AEC_VALUE_BOUND) {\r\n            if (aec_get_next_bit(p_aec)) {\r\n                return 0;\r\n            }\r\n            p_aec->i_value_s++;\r\n        }\r\n\r\n        p_aec->b_val_bound = p_aec->i_value_t < QUARTER;\r\n        p_aec->i_value_t = p_aec->i_value_t & 0xff;\r\n    }\r\n\r\n    s_flag = p_aec->i_t1 < lg_pmps;\r\n    s2 = p_aec->i_s1 + s_flag;\r\n    t2 = p_aec->i_t1 - lg_pmps + (s_flag << 8);            // 8bits\r\n\r\n    /* ֵ */\r\n    is_LPS = (s2 > p_aec->i_value_s || (s2 == p_aec->i_value_s && p_aec->i_value_t >= t2)) && p_aec->b_val_bound == 0;\r\n    p_aec->b_val_domain = (bool_t)is_LPS;\r\n\r\n    if (is_LPS) {     // LPS\r\n        uint32_t t_rlps = 1;\r\n        int n_bits = 0;\r\n\r\n        if (s2 == p_aec->i_value_s) {\r\n            p_aec->i_value_t -= t2;\r\n        } else {\r\n            if (aec_get_next_bit(p_aec)) {\r\n                return 0;\r\n            }\r\n            p_aec->i_value_t += 256 - t2;\r\n        }\r\n\r\n        // restore range\r\n        while (t_rlps < QUARTER) {\r\n            t_rlps <<= 1;\r\n            n_bits++;\r\n        }\r\n        if (n_bits) {\r\n            if (aec_get_next_n_bit(p_aec, n_bits)) {\r\n                return 0;\r\n            }\r\n        }\r\n\r\n        p_aec->i_s1 = 0;\r\n        p_aec->i_t1 = 0;\r\n        // return 1;  // !ctx.MPS\r\n    } else {        // MPS\r\n        p_aec->i_s1 = s2;\r\n        p_aec->i_t1 = t2;\r\n        // return 0;  // ctx.MPS\r\n    }\r\n\r\n    return is_LPS;\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * decode symbols until a zero bit is obtained or passed max_num symbols\r\n * ʹͬĽţֱ0ߴﵽ(max_num)\r\n */\r\nstatic INLINE\r\nint biari_decode_symbol_continue0(aec_t *p_aec, context_t *ctx, int max_num)\r\n{\r\n    uint32_t i_value_s = p_aec->i_value_s;\r\n    int bit = 0;\r\n    int i;\r\n\r\n    for (i = 0; i < max_num && !bit; i++) {\r\n        uint32_t lg_pmps = ctx->LG_PMPS >> LG_PMPS_SHIFTNO;\r\n        uint32_t t2;\r\n        uint32_t s2;\r\n        uint32_t s_flag;\r\n        int is_LPS;\r\n\r\n        bit = ctx->MPS;\r\n\r\n        if (p_aec->b_val_domain != 0 || (p_aec->i_s1 == AEC_VALUE_BOUND && p_aec->b_val_bound != 0)) {\r\n            p_aec->i_s1 = 0;\r\n            i_value_s = 0;\r\n\r\n            while (p_aec->i_value_t < QUARTER && i_value_s < AEC_VALUE_BOUND) {\r\n                if (aec_get_next_bit(p_aec)) {\r\n                    return 0;\r\n                }\r\n                i_value_s++;\r\n            }\r\n\r\n            p_aec->b_val_bound = p_aec->i_value_t < QUARTER;\r\n            p_aec->i_value_t = p_aec->i_value_t & 0xff;\r\n        }\r\n\r\n        s_flag = p_aec->i_t1 < lg_pmps;\r\n        s2 = p_aec->i_s1 + s_flag;\r\n        t2 = p_aec->i_t1 - lg_pmps + (s_flag << 8);            // 8bits\r\n\r\n        if (i_value_s > AEC_VALUE_BOUND) {\r\n            /// davs2_log(NULL, DAVS2_LOG_ERROR, \"i_value_s (>254).\");\r\n            p_aec->b_bit_error = 1;\r\n            return 0;\r\n        }\r\n\r\n        is_LPS = (s2 > i_value_s || (s2 == i_value_s && p_aec->i_value_t >= t2)) && p_aec->b_val_bound == 0;\r\n        p_aec->b_val_domain = (bool_t)is_LPS;\r\n\r\n        if (is_LPS) {     // LPS\r\n            uint32_t t_rlps = (s_flag == 0) ? (lg_pmps) : (p_aec->i_t1 + lg_pmps);\r\n            int n_bits = 0;\r\n            bit = !bit;\r\n\r\n            if (s2 == i_value_s) {\r\n                p_aec->i_value_t -= t2;\r\n            } else {\r\n                if (aec_get_next_bit(p_aec)) {\r\n                    return 0;\r\n                }\r\n                p_aec->i_value_t += 256 - t2;\r\n            }\r\n\r\n            // restore range\r\n            while (t_rlps < QUARTER) {\r\n                t_rlps <<= 1;\r\n                n_bits++;\r\n            }\r\n            if (n_bits) {\r\n                if (aec_get_next_n_bit(p_aec, n_bits)) {\r\n                    return 0;\r\n                }\r\n            }\r\n\r\n            p_aec->i_s1 = 0;\r\n            p_aec->i_t1 = t_rlps & 0xff;\r\n            update_ctx_lps(ctx);\r\n        } else {        // MPS\r\n            p_aec->i_s1 = s2;\r\n            p_aec->i_t1 = t2;\r\n            update_ctx_mps(ctx);\r\n        }\r\n    }\r\n\r\n    p_aec->i_value_s = i_value_s;\r\n    return i - bit;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic\r\nint biari_decode_symbol_continu0_ext(aec_t *p_aec, context_t *ctx, int max_ctx_inc, int max_num)\r\n{\r\n    int bit = 0;\r\n    int i;\r\n\r\n    for (i = 0; i < max_num && !bit; i++) {\r\n        int ctx_add = DAVS2_MIN(i, max_ctx_inc);\r\n        context_t *p_ctx = ctx + ctx_add;\r\n        uint32_t lg_pmps = p_ctx->LG_PMPS >> LG_PMPS_SHIFTNO;\r\n        uint32_t t2;\r\n        uint32_t s2;\r\n        int is_LPS;\r\n        int s_flag;\r\n\r\n        bit = p_ctx->MPS;\r\n\r\n        if (p_aec->b_val_domain != 0 || (p_aec->i_s1 == AEC_VALUE_BOUND && p_aec->b_val_bound != 0)) {\r\n            p_aec->i_s1 = 0;\r\n            p_aec->i_value_s = 0;\r\n\r\n            while (p_aec->i_value_t < QUARTER && p_aec->i_value_s < AEC_VALUE_BOUND) {\r\n                if (aec_get_next_bit(p_aec)) {\r\n                    return 0;\r\n                }\r\n                p_aec->i_value_s++;\r\n            }\r\n\r\n            p_aec->b_val_bound = p_aec->i_value_t < QUARTER;\r\n            p_aec->i_value_t = p_aec->i_value_t & 0xff;\r\n        }\r\n\r\n        s_flag = p_aec->i_t1 < lg_pmps;\r\n        s2 = p_aec->i_s1 + s_flag;\r\n        t2 = p_aec->i_t1 - lg_pmps + (s_flag << 8);            // 8bits\r\n\r\n        if (p_aec->i_value_s > AEC_VALUE_BOUND) {\r\n            /// davs2_log(NULL, DAVS2_LOG_ERROR, \"p_aec->i_value_s (>254).\");\r\n            /// exit(1);\r\n            p_aec->b_bit_error = 1;\r\n            return 0;\r\n        }\r\n\r\n        is_LPS = (s2 > p_aec->i_value_s || (s2 == p_aec->i_value_s && p_aec->i_value_t >= t2)) && p_aec->b_val_bound == 0;\r\n        p_aec->b_val_domain = (bool_t)is_LPS;\r\n\r\n        if (is_LPS) {     // LPS\r\n            uint32_t t_rlps = (s_flag == 0) ? (lg_pmps) : (p_aec->i_t1 + lg_pmps);\r\n            bit = !bit;\r\n\r\n            if (s2 == p_aec->i_value_s) {\r\n                p_aec->i_value_t -= t2;\r\n            } else {\r\n                if (aec_get_next_bit(p_aec)) {\r\n                    return 0;\r\n                }\r\n                p_aec->i_value_t += 256 - t2;\r\n            }\r\n\r\n            // restore range\r\n            while (t_rlps < QUARTER) {\r\n                t_rlps <<= 1;\r\n\r\n                if (aec_get_next_bit(p_aec)) {\r\n                    return 0;\r\n                }\r\n            }\r\n\r\n            p_aec->i_s1 = 0;\r\n            p_aec->i_t1 = t_rlps & 0xff;\r\n            update_ctx_lps(p_ctx);\r\n        } else {        // MPS\r\n            p_aec->i_s1 = s2;\r\n            p_aec->i_t1 = t2;\r\n            update_ctx_mps(p_ctx);\r\n        }\r\n    }\r\n\r\n    return i - bit;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * decoding of unary binarization using one or 2 distinct models for the first\r\n * and all remaining bins; no terminating \"0\" for max_symbol\r\n */\r\nstatic int unary_bin_max_decode(aec_t *p_aec, context_t *ctx, int ctx_offset, int max_symbol)\r\n{\r\n    int symbol = biari_decode_symbol(p_aec, ctx);\r\n\r\n    if (symbol == 1) {\r\n        return 0;\r\n    } else {\r\n        if (max_symbol == 1) {\r\n            return symbol;\r\n        } else {\r\n            context_t *p_ctx = ctx + ctx_offset;\r\n\r\n            symbol = 1 + biari_decode_symbol_continue0(p_aec, p_ctx, max_symbol - 1);\r\n\r\n            return symbol;\r\n        }\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid aec_init_contexts(aec_t *p_aec)\r\n{\r\n    const uint16_t lg_pmps = ((QUARTER << LG_PMPS_SHIFTNO) - 1);\r\n    uint16_t  v = MAKE_CONTEXT(lg_pmps, 0, 0);\r\n    uint16_t *d = (uint16_t *)&p_aec->syn_ctx;\r\n    int ctx_cnt = sizeof(context_set_t) / sizeof(uint16_t);\r\n\r\n    while (ctx_cnt-- != 0) {\r\n        *d++ = v;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid aec_new_slice(davs2_t *h)\r\n{\r\n    h->i_last_dquant = 0;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nint aec_read_dmh_mode(aec_t *p_aec, int i_cu_level)\r\n{\r\n    context_t *p_ctx = p_aec->syn_ctx.pu_type_index + (i_cu_level - 3) * 3 + NUM_INTER_DIR_DHP_CTX;\r\n\r\n    assert(NUM_INTER_DIR_DHP_CTX + NUM_DMH_MODE_CTX == NUM_INTER_DIR_CTX);\r\n\r\n    if (biari_decode_symbol(p_aec, p_ctx) == 0) {\r\n        return 0;\r\n    } else {\r\n        if (biari_decode_symbol(p_aec, p_ctx + 1) == 0) {\r\n            return 3 + biari_decode_symbol_eq_prob(p_aec);    // 3, 4: ԪŴ10x\r\n        } else {\r\n            if (biari_decode_symbol(p_aec, p_ctx + 2) == 0) {\r\n                return 7 + biari_decode_symbol_eq_prob(p_aec);    // 7, 8: ԪŴ110x\r\n            } else {\r\n                /* 1,2ԪŴ1110x\r\n                 * 5,6ԪŴ1111x\r\n                 */\r\n                int b3 = biari_decode_symbol_eq_prob(p_aec);\r\n                int b4 = biari_decode_symbol_eq_prob(p_aec);\r\n                return 1 + (b3 << 2) + b4;\r\n            }\r\n        }\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * arithmetically decode the motion vector difference\r\n */\r\nstatic INLINE\r\nint aec_read_mvd(aec_t *p_aec, context_t *p_ctx)\r\n{\r\n    int binary_symbol = 0;\r\n    int golomb_order = 0;\r\n    int act_sym;\r\n\r\n    if (!biari_decode_symbol(p_aec, p_ctx + 0)) {\r\n        act_sym = 0;\r\n    } else if (!biari_decode_symbol(p_aec, p_ctx + 1)) {\r\n        act_sym = 1;\r\n    } else if (!biari_decode_symbol(p_aec, p_ctx + 2)) {\r\n        act_sym = 2;\r\n    } else {   // 1110\r\n        int add_sym = biari_decode_symbol_eq_prob(p_aec);\r\n        act_sym = 0;\r\n\r\n        for (;;) {\r\n            int l = biari_decode_symbol_eq_prob(p_aec);\r\n\r\n            AEC_RETURN_ON_ERROR(0);\r\n\r\n            if (l == 0) {\r\n                act_sym += (1 << golomb_order);\r\n                golomb_order++;\r\n            } else {\r\n                break;\r\n            }\r\n        }\r\n\r\n        while (golomb_order--) {\r\n            // next binary part\r\n            if (biari_decode_symbol_eq_prob(p_aec)) {\r\n                binary_symbol |= (1 << golomb_order);\r\n            }\r\n        }\r\n\r\n        act_sym += binary_symbol;\r\n        act_sym = (act_sym << 1) + 3 + add_sym;\r\n    }\r\n\r\n    if (act_sym != 0) {\r\n        if (biari_decode_symbol_eq_prob(p_aec)) {\r\n            act_sym = -act_sym;\r\n        }\r\n    }\r\n\r\n    return act_sym;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * arithmetically decode the motion vector difference\r\n */\r\nvoid aec_read_mvds(aec_t *p_aec, mv_t *p_mvd)\r\n{\r\n    p_mvd->x = (int16_t)aec_read_mvd(p_aec, p_aec->syn_ctx.mvd_contexts[0]);\r\n    p_mvd->y = (int16_t)aec_read_mvd(p_aec, p_aec->syn_ctx.mvd_contexts[1]);\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * arithmetically decode the 8x8 block type\r\n */\r\nstatic INLINE\r\nint aec_read_wpm(aec_t *p_aec, int num_of_references)\r\n{\r\n    context_t *p_ctx = p_aec->syn_ctx.weighted_skip_mode;\r\n    return biari_decode_symbol_continu0_ext(p_aec, p_ctx, 2, num_of_references - 1);\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic INLINE\r\nint aec_read_dir_skip_mode(aec_t *p_aec)\r\n{\r\n    context_t *p_ctx = p_aec->syn_ctx.cu_subtype_index;\r\n    int act_sym = biari_decode_symbol_continu0_ext(p_aec, p_ctx, 32768, 3);\r\n    if (act_sym == 3) {\r\n        act_sym += (!biari_decode_symbol(p_aec, p_ctx + 3));\r\n    }\r\n    return act_sym;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * TU split type when TU split is enabled for current CU */\r\nstatic ALWAYS_INLINE int cu_set_tu_split_type(davs2_t *h, cu_t *p_cu, int transform_split_flag)\r\n{\r\n    // split types\r\n    // [mode][(NSQT enable or SDIP enables) and cu_level > B8X8_IN_BIT]\r\n    //  split_type for block non-SDIP/NSQT:[0] and SDIP/NSQT:[1]\r\n    static const int8_t TU_SPLIT_TYPE[MAX_PRED_MODES][2] = {\r\n        { TU_SPLIT_CROSS,   TU_SPLIT_CROSS   }, // 0: 8x8, ---, ---, --- (PRED_SKIP   )\r\n        { TU_SPLIT_CROSS,   TU_SPLIT_CROSS   }, // 1: 8x8, ---, ---, --- (PRED_2Nx2N  )\r\n        { TU_SPLIT_CROSS,   TU_SPLIT_HOR     }, // 2: 8x4, 8x4, ---, --- (PRED_2NxN   )\r\n        { TU_SPLIT_CROSS,   TU_SPLIT_VER     }, // 3: 4x8, 4x8, ---, --- (PRED_Nx2N   )\r\n        { TU_SPLIT_CROSS,   TU_SPLIT_HOR     }, // 4: 8x2, 8x6, ---, --- (PRED_2NxnU  )\r\n        { TU_SPLIT_CROSS,   TU_SPLIT_HOR     }, // 5: 8x6, 8x2, ---, --- (PRED_2NxnD  )\r\n        { TU_SPLIT_CROSS,   TU_SPLIT_VER     }, // 6: 2x8, 6x8, ---, --- (PRED_nLx2N  )\r\n        { TU_SPLIT_CROSS,   TU_SPLIT_VER     }, // 7: 6x8, 2x8, ---, --- (PRED_nRx2N  )\r\n        { TU_SPLIT_NON,     TU_SPLIT_INVALID }, // 8: 8x8, ---, ---, --- (PRED_I_2Nx2N)\r\n        { TU_SPLIT_CROSS,   TU_SPLIT_CROSS   }, // 9: 4x4, 4x4, 4x4, 4x4 (PRED_I_NxN  )\r\n        { TU_SPLIT_INVALID, TU_SPLIT_HOR     }, //10: 8x2, 8x2, 8x2, 8x2 (PRED_I_2Nxn )\r\n        { TU_SPLIT_INVALID, TU_SPLIT_VER     }  //11: 2x8, 2x8, 2x8, 2x8 (PRED_I_nx2N )\r\n    };\r\n    int mode = p_cu->i_cu_type;\r\n    int level = p_cu->i_cu_level;\r\n    int enable_nsqt_sdip = IS_INTRA_MODE(mode) ? h->seq_info.enable_sdip : h->seq_info.enable_nsqt;\r\n\r\n    enable_nsqt_sdip = enable_nsqt_sdip && level > B8X8_IN_BIT;\r\n    p_cu->i_trans_size = transform_split_flag ? TU_SPLIT_TYPE[mode][enable_nsqt_sdip] : TU_SPLIT_NON;\r\n    assert(p_cu->i_trans_size != TU_SPLIT_INVALID);\r\n\r\n    return p_cu->i_trans_size;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nint aec_read_intra_cu_type(aec_t *p_aec, cu_t *p_cu, int b_sdip, davs2_t *h)\r\n{\r\n    int cu_type = PRED_I_NxN;\r\n    int b_tu_split = 0;\r\n    b_sdip = (p_cu->i_cu_level == B32X32_IN_BIT || p_cu->i_cu_level == B16X16_IN_BIT) && b_sdip;\r\n\r\n    /* 1, read intra cu split flag */\r\n    if (p_cu->i_cu_level == B8X8_IN_BIT || b_sdip) {\r\n        context_t * p_ctx = p_aec->syn_ctx.transform_split_flag;\r\n\r\n        b_tu_split = biari_decode_symbol(p_aec, p_ctx + 1 + b_sdip);\r\n    }\r\n\r\n#if AVS2_TRACE\r\n    avs2_trace(\"Transform_Size = %3d \\n\", b_tu_split);\r\n#endif\r\n\r\n    /* 2, read intra CU partition type */\r\n    if (!b_tu_split) {\r\n        cu_type = PRED_I_2Nx2N;\r\n    } else if (b_sdip) {\r\n        context_t * p_ctx = p_aec->syn_ctx.intra_pu_type_contexts;\r\n        int symbol1 = biari_decode_symbol(p_aec, p_ctx);\r\n        cu_type = symbol1 ? PRED_I_2Nxn : PRED_I_nx2N;\r\n    }\r\n\r\n#if AVS2_TRACE\r\n    avs2_trace_string(\"cuType\", cu_type, 1);\r\n#endif\r\n\r\n    p_cu->i_cu_type = (int8_t)cu_type;\r\n    cu_set_tu_split_type(h, p_cu, b_tu_split);\r\n\r\n    return cu_type;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * arithmetically decode the coding unit type info of a given CU\r\n */\r\nint aec_read_cu_type(aec_t *p_aec, cu_t *p_cu, int img_type, int b_amp, int b_mhp, int b_wsm, int num_references)\r\n{\r\n    // 0: SKIP, 1: 2Nx2N, 2: 2NxN / 2NxnU / 2NxnD, 3: Nx2N / nLx2N / nRx2N, 9: INTRA\r\n    static const int MAP_CU_TYPE[2][7] = {\r\n        {-1, 0, 1, 2, 3, -1/*PRED_NxN*/, PRED_I_NxN},\r\n        {-1, 0, 1, 2, 3, PRED_I_NxN}\r\n    };\r\n\r\n    int real_cu_type;\r\n\r\n    if (img_type != AVS2_I_SLICE) {\r\n        context_t *p_ctx = p_aec->syn_ctx.cu_type_contexts;\r\n        int bin_idx = 0;\r\n        int act_ctx = 0;\r\n        int act_sym = 0;\r\n        int max_bit = 6 - (p_cu->i_cu_level == B8X8_IN_BIT);\r\n        int symbol;\r\n\r\n        while (act_sym < max_bit) {\r\n            if ((bin_idx == 5) && (p_cu->i_cu_level != MIN_CU_SIZE_IN_BIT)) {\r\n                symbol = biari_decode_final(p_aec);\r\n            } else {\r\n                symbol = biari_decode_symbol(p_aec, p_ctx + act_ctx);\r\n            }\r\n\r\n            AEC_RETURN_ON_ERROR(-1);\r\n            bin_idx++;\r\n\r\n            if (symbol == 0) {\r\n                act_sym++;\r\n                act_ctx = DAVS2_MIN(5, act_ctx + 1);\r\n            } else {\r\n                break;\r\n            }\r\n        }\r\n\r\n        real_cu_type = MAP_CU_TYPE[p_cu->i_cu_level == B8X8_IN_BIT][act_sym];\r\n\r\n        // for AMP\r\n        if (p_cu->i_cu_level >= B16X16_IN_BIT && b_amp && (real_cu_type == 2 || real_cu_type == 3)) {\r\n            context_t *p_ctx_amp = p_aec->syn_ctx.shape_of_partition_index;\r\n            if (!biari_decode_symbol(p_aec, p_ctx_amp + 0)) {\r\n                real_cu_type = real_cu_type * 2 + (!biari_decode_symbol(p_aec, p_ctx_amp + 1));\r\n            }\r\n        }\r\n    } else {\r\n        real_cu_type = PRED_I_NxN;     /* intra mode */\r\n    }\r\n\r\n#if AVS2_TRACE\r\n    {\r\n        int trace_cu_type = real_cu_type;\r\n        if (trace_cu_type == PRED_I_2Nxn || trace_cu_type == PRED_I_nx2N) {\r\n            trace_cu_type += 2;             /* in order to trace same text as RM */\r\n        }\r\n\r\n        trace_cu_type += (img_type == AVS2_B_SLICE);    /* also here */\r\n        avs2_trace_string(\"cuType\", trace_cu_type, 1);\r\n    }\r\n#endif\r\n\r\n    if (real_cu_type <= 0) {    /* Skip Mode */\r\n        int weighted_skipmode_fix = 0;\r\n        int md_directskip_mode    = DS_NONE;\r\n\r\n        if (img_type == AVS2_F_SLICE && b_wsm && num_references > 1) {\r\n            weighted_skipmode_fix = aec_read_wpm(p_aec, num_references);\r\n#if AVS2_TRACE\r\n            avs2_trace(\"weighted_skipmode1 = %3d \\n\", weighted_skipmode_fix);\r\n#endif\r\n        }\r\n        p_cu->i_weighted_skipmode = (int8_t)weighted_skipmode_fix;\r\n\r\n        if ((weighted_skipmode_fix == 0) &&\r\n            ((b_mhp && img_type == AVS2_F_SLICE) || img_type == AVS2_B_SLICE)) {\r\n            md_directskip_mode = aec_read_dir_skip_mode(p_aec);\r\n#if AVS2_TRACE\r\n            avs2_trace(\"p_directskip_mode = %3d \\n\", md_directskip_mode);\r\n#endif\r\n        } else {\r\n            md_directskip_mode = DS_NONE;\r\n        }\r\n\r\n        p_cu->i_md_directskip_mode = (int8_t)md_directskip_mode;\r\n    }\r\n\r\n    return real_cu_type;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nint aec_read_cu_type_sframe(aec_t *p_aec)\r\n{\r\n    static const int MapSCUType[7] = {-1, PRED_SKIP, PRED_I_NxN};\r\n    context_t * p_ctx = p_aec->syn_ctx.cu_type_contexts;\r\n    int act_ctx = 0;\r\n    int cu_type = 0;\r\n\r\n    for (;;) {\r\n        if (biari_decode_symbol(p_aec, p_ctx + act_ctx) == 0) {\r\n            cu_type++;\r\n            act_ctx++;\r\n        } else {\r\n            break;\r\n        }\r\n\r\n        if (cu_type >= 2) {\r\n            break;\r\n        }\r\n    }\r\n\r\n    cu_type = MapSCUType[cu_type];    /* cu type */\r\n#if AVS2_TRACE\r\n    avs2_trace_string(\"cuType\", cu_type, 1);\r\n#endif\r\n\r\n    return cu_type;       /* return cu type */\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic INLINE\r\nint aec_read_b_pdir(aec_t * p_aec, cu_t * p_cu)\r\n{\r\n    static const int dir2offset[4][4] = {\r\n        {  0,  2,  4,  9 },\r\n        {  3,  1,  5, 10 },\r\n        {  6,  7,  8, 11 },\r\n        { 12, 13, 14, 15 }\r\n    };\r\n\r\n    int new_pdir[4] = { 3, 1, 0, 2 };\r\n    context_t *p_ctx = p_aec->syn_ctx.pu_type_index;\r\n    int act_ctx = 0;\r\n    int act_sym = 0;\r\n    int pdir    = PDIR_FWD;\r\n    int pdir0 = 0, pdir1 = 0;\r\n    int symbol;\r\n\r\n    if (p_cu->i_cu_type == PRED_2Nx2N) {\r\n        /* act_ctx: 0, 1, 2 */\r\n        act_sym = biari_decode_symbol_continu0_ext(p_aec, p_ctx, 32768, 2);\r\n        if (act_sym == 2) {\r\n            act_sym += (!biari_decode_symbol(p_aec, p_ctx + 2));\r\n        }\r\n        pdir = act_sym;\r\n    } else if ((p_cu->i_cu_type >= PRED_2NxN && p_cu->i_cu_type <= PRED_nRx2N) && p_cu->i_cu_level == B8X8_IN_BIT) {\r\n        p_ctx = p_aec->syn_ctx.b_pu_type_min_index;\r\n        pdir0 = !biari_decode_symbol(p_aec, p_ctx + act_ctx);  // BW\r\n\r\n        if (biari_decode_symbol(p_aec, p_ctx + act_ctx + 1)) {\r\n            pdir1 = pdir0;\r\n        } else {\r\n            pdir1 = !pdir0;\r\n        }\r\n\r\n        pdir = dir2offset[pdir0][pdir1];\r\n    } else if (p_cu->i_cu_type >= PRED_2NxN || p_cu->i_cu_type <= PRED_nRx2N) {\r\n        /* act_ctx: 3, 4 */\r\n        act_sym = biari_decode_symbol_continu0_ext(p_aec, p_ctx + 3, 32768, 2);\r\n\r\n        /* act_ctx: 5 */\r\n        if (act_sym == 2) {\r\n            act_sym += (!biari_decode_symbol(p_aec, p_ctx + 5));\r\n        }\r\n        pdir0 = act_sym;\r\n\r\n        if (biari_decode_symbol(p_aec, p_ctx + 6)) {\r\n            pdir1 = pdir0;\r\n        } else {\r\n            switch (pdir0) {\r\n            case 0:\r\n                if (biari_decode_symbol(p_aec, p_ctx + 7)) {\r\n                    pdir1 = 1;\r\n                } else {\r\n                    symbol = biari_decode_symbol(p_aec, p_ctx + 8);\r\n                    pdir1 = symbol ? 2 : 3;\r\n                }\r\n\r\n                break;\r\n            case 1:\r\n                if (biari_decode_symbol(p_aec, p_ctx + 9)) {\r\n                    pdir1 = 0;\r\n                } else {\r\n                    symbol = biari_decode_symbol(p_aec, p_ctx + 10);\r\n                    pdir1 = symbol ? 2 : 3;\r\n                }\r\n\r\n                break;\r\n            case 2:\r\n                if (biari_decode_symbol(p_aec, p_ctx + 11)) {\r\n                    pdir1 = 0;\r\n                } else {\r\n                    symbol = biari_decode_symbol(p_aec, p_ctx + 12);\r\n                    pdir1 = symbol ? 1 : 3;\r\n                }\r\n\r\n                break;\r\n            case 3:\r\n                if (biari_decode_symbol(p_aec, p_ctx + 13)) {\r\n                    pdir1 = 0;\r\n                } else {\r\n                    symbol = biari_decode_symbol(p_aec, p_ctx + 14);\r\n                    pdir1 = symbol ? 1 : 2;\r\n                }\r\n\r\n                break;\r\n            }\r\n        }\r\n\r\n        pdir0 = new_pdir[pdir0];\r\n        pdir1 = new_pdir[pdir1];\r\n        pdir  = dir2offset[pdir0][pdir1];\r\n    }\r\n\r\n#if AVS2_TRACE\r\n    if (p_cu->i_cu_type >= PRED_2NxN && p_cu->i_cu_type <= PRED_nRx2N) {\r\n        avs2_trace_string(\"B_Pred_Dir0 \", pdir0, 1);\r\n        avs2_trace_string(\"B_Pred_Dir1 \", pdir1, 1);\r\n    } else if (p_cu->i_cu_type == PRED_2Nx2N) {\r\n        avs2_trace_string(\"B_Pred_Dir \", pdir0, 1);\r\n    }\r\n#endif\r\n\r\n    return pdir;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * arithmetically decode the PU type\r\n */\r\nstatic INLINE\r\nint aec_read_pdir_dhp(aec_t * p_aec, cu_t * p_cu)\r\n{\r\n    static const int dir2offset[2][2] = {\r\n        { 0, 1 },\r\n        { 2, 3 }\r\n    };\r\n\r\n    context_t *p_ctx = p_aec->syn_ctx.pu_type_index;\r\n    int pdir = PDIR_FWD;\r\n    int pdir0, pdir1;\r\n    int symbol;\r\n\r\n    if (p_cu->i_cu_type == PRED_2Nx2N) {\r\n        pdir = pdir0 = biari_decode_symbol(p_aec, p_ctx);\r\n    } else if (p_cu->i_cu_type >= PRED_2NxN || p_cu->i_cu_type <= PRED_nRx2N) {\r\n        pdir0 = biari_decode_symbol(p_aec, p_ctx + 1);\r\n\r\n        symbol = biari_decode_symbol(p_aec, p_ctx + 2);\r\n        if (symbol) {\r\n            pdir1 = pdir0;\r\n        } else {\r\n            pdir1 = 1 - pdir0;\r\n        }\r\n\r\n        pdir = dir2offset[pdir0][pdir1];\r\n    }\r\n\r\n#if AVS2_TRACE\r\n    if (p_cu->i_cu_type >= PRED_2NxN && p_cu->i_cu_type <= PRED_nRx2N) {\r\n        avs2_trace_string(\"P_Pred_Dir0 \", pdir0, 1);\r\n        avs2_trace_string(\"P_Pred_Dir1 \", pdir1, 1);\r\n    } else if (p_cu->i_cu_type == PRED_2Nx2N) {\r\n        avs2_trace_string(\"P_Pred_Dir \", pdir0, 1);\r\n    }\r\n#endif\r\n\r\n    return pdir;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * set CU prediction direction for P/F-Frames\r\n */\r\nstatic INLINE void cu_set_pdir_PFframe(cu_t *p_cu, int pdir)\r\n{\r\n    static const int8_t pdir0[4] = { PDIR_FWD, PDIR_FWD, PDIR_DUAL, PDIR_DUAL };\r\n    static const int8_t pdir1[4] = { PDIR_FWD, PDIR_DUAL, PDIR_FWD, PDIR_DUAL };\r\n    int i_cu_type = p_cu->i_cu_type;\r\n    int i;\r\n\r\n    if (i_cu_type == PRED_2Nx2N) { // 16x16\r\n        /* PU£PUΪ 1[2/3]븳ֵԼDMHģʽŽж */\r\n        pdir = (pdir == PDIR_FWD ? PDIR_FWD : PDIR_DUAL);\r\n        for (i = 0; i < 4; i++) {\r\n            p_cu->b8pdir[i] = (int8_t)pdir;\r\n        }\r\n    } else if (IS_HOR_PU_PART(i_cu_type)) { // horizontal: 16x8, 16x4, 16x12\r\n        /* ˮƽPU£PUΪ2[2/3]븳ֵԼDMHģʽŽж */\r\n        p_cu->b8pdir[0] = p_cu->b8pdir[2] = pdir0[pdir];\r\n        p_cu->b8pdir[1] = p_cu->b8pdir[3] = pdir1[pdir];\r\n    } else if (IS_VER_PU_PART(i_cu_type)) { // vertical:\r\n        /* ֱPU£PUΪ2[2/3]븳ֵԼDMHģʽŽж */\r\n        p_cu->b8pdir[0] = p_cu->b8pdir[2] = pdir0[pdir];\r\n        p_cu->b8pdir[1] = p_cu->b8pdir[3] = pdir1[pdir];\r\n    } else {  /* intra mode */\r\n        for (i = 0; i < 4; i++) {\r\n            p_cu->b8pdir[i] = PDIR_INVALID;\r\n        }\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * set CU prediction direction for B-Frames\r\n */\r\nstatic INLINE void cu_set_pdir_Bframe(cu_t *p_cu, int pdir)\r\n{\r\n    static const int8_t pdir0[16] = { PDIR_FWD, PDIR_BWD, PDIR_FWD, PDIR_BWD, PDIR_FWD, PDIR_BWD, PDIR_SYM, PDIR_SYM, PDIR_SYM, PDIR_FWD, PDIR_BWD, PDIR_SYM, PDIR_BID, PDIR_BID, PDIR_BID, PDIR_BID };\r\n    static const int8_t pdir1[16] = { PDIR_FWD, PDIR_BWD, PDIR_BWD, PDIR_FWD, PDIR_SYM, PDIR_SYM, PDIR_FWD, PDIR_BWD, PDIR_SYM, PDIR_BID, PDIR_BID, PDIR_BID, PDIR_FWD, PDIR_BWD, PDIR_SYM, PDIR_BID };\r\n    static const int8_t pdir2refidx[4][2] = {\r\n        { B_FWD, INVALID_REF },  // PDIR_FWD\r\n        { INVALID_REF, B_BWD },  // PDIR_BWD\r\n        { B_FWD, B_BWD },\r\n        { B_FWD, B_BWD }\r\n    };\r\n    int i_cu_type = p_cu->i_cu_type;\r\n    int8_t *b8pdir = p_cu->b8pdir;\r\n    int i;\r\n\r\n    //--- set b8type, and b8pdir ---\r\n    if (i_cu_type == PRED_SKIP) {   // direct\r\n        /* SkipģʽPUΪ14PU */\r\n        pdir = tab_pdir_bskip[p_cu->i_md_directskip_mode];\r\n        for (i = 0; i < 4; i++) {\r\n            b8pdir[i] = (int8_t)pdir;\r\n        }\r\n    } else if (i_cu_type == PRED_2Nx2N) { // 16x16\r\n        /* PU£PUΪ 1 */\r\n        for (i = 0; i < 4; i++) {\r\n            b8pdir[i] = (int8_t)pdir;\r\n        }\r\n    } else if (IS_HOR_PU_PART(i_cu_type)) { // 16x8, 16x4, 16x12\r\n        /* ˮƽPU£PUΪ2 */\r\n        b8pdir[0] = b8pdir[2] = pdir0[pdir];\r\n        b8pdir[1] = b8pdir[3] = pdir1[pdir];\r\n    } else if (IS_VER_PU_PART(i_cu_type)) {\r\n        /* ֱPU£PUΪ2 */\r\n        b8pdir[0] = b8pdir[2] = pdir0[pdir];\r\n        b8pdir[1] = b8pdir[3] = pdir1[pdir];\r\n    } else {  // intra mode\r\n        for (i = 0; i < 4; i++) {\r\n            b8pdir[i] = PDIR_INVALID;\r\n        }\r\n    }\r\n\r\n    for (i = 0; i < 4; i++) {\r\n        const int8_t *p_idx = pdir2refidx[b8pdir[i]];\r\n        p_cu->ref_idx[i].r[0] = p_idx[0];\r\n        p_cu->ref_idx[i].r[1] = p_idx[1];\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * arithmetically decode the reference parameter of a given MB\r\n */\r\nstatic INLINE\r\nint aec_read_ref_frame(aec_t *p_aec, int num_of_references)\r\n{\r\n    context_t *p_ctx = p_aec->syn_ctx.pu_reference_index;\r\n    int act_sym;\r\n\r\n    if (biari_decode_symbol(p_aec, p_ctx)) {\r\n        act_sym = 0;\r\n    } else {\r\n        int act_ctx = 1;\r\n        act_sym = 1;\r\n\r\n        // TODO: ˴ɼŻ\r\n        while ((act_sym != num_of_references - 1) && (!biari_decode_symbol(p_aec, p_ctx + act_ctx))) {\r\n            act_sym++;\r\n            act_ctx = DAVS2_MIN(2, act_ctx + 1);\r\n        }\r\n    }\r\n\r\n    return act_sym;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic INLINE\r\nint cu_read_references(davs2_t *h, aec_t *p_aec, cu_t *p_cu)\r\n{\r\n    int idx_pu;\r\n    int num_pu = p_cu->i_cu_type == PRED_2Nx2N ? 1 : 2;\r\n\r\n    //  If multiple ref. frames, read reference frame for the MB *********************************\r\n    for (idx_pu = 0; idx_pu < num_pu; idx_pu++) {\r\n        int8_t ref_1st, ref_2nd;\r\n        // non skip (direct)\r\n        assert(p_cu->b8pdir[idx_pu] == PDIR_FWD || p_cu->b8pdir[idx_pu] == PDIR_DUAL);\r\n        if (h->num_of_references > 1) {\r\n            ref_1st = (int8_t)aec_read_ref_frame(p_aec, h->num_of_references);\r\n            AEC_RETURN_ON_ERROR(-1);\r\n#if AVS2_TRACE\r\n            avs2_trace(\"Fwd Ref frame no  = %3d \\n\", ref_1st);\r\n#endif\r\n        } else {\r\n            ref_1st = 0;\r\n        }\r\n\r\n        if (p_cu->b8pdir[idx_pu] == PDIR_DUAL) {\r\n            ref_2nd = !ref_1st;\r\n        } else {\r\n            ref_2nd = INVALID_REF;\r\n        }\r\n\r\n        p_cu->ref_idx[idx_pu].r[0] = ref_1st;\r\n        p_cu->ref_idx[idx_pu].r[1] = ref_2nd;\r\n    }\r\n\r\n    return 0;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid aec_read_inter_pred_dir(aec_t * p_aec, cu_t *p_cu, davs2_t *h)\r\n{\r\n    int pdir = PDIR_FWD;\r\n    int real_cu_type = p_cu->i_cu_type;\r\n\r\n    if ((h->i_frame_type == AVS2_B_SLICE)) {  // B frame\r\n        if (real_cu_type >= PRED_2Nx2N && real_cu_type <= PRED_nRx2N) {\r\n            pdir = aec_read_b_pdir(p_aec, p_cu);\r\n        }\r\n        cu_set_pdir_Bframe(p_cu, pdir);\r\n    } else {  // other Inter frame\r\n        if (IS_SKIP_MODE(real_cu_type)) {\r\n            int i;\r\n            if (p_cu->i_weighted_skipmode || \r\n                p_cu->i_md_directskip_mode == DS_DUAL_1ST || \r\n                p_cu->i_md_directskip_mode == DS_DUAL_2ND) {\r\n                pdir = PDIR_DUAL;\r\n            }\r\n            for (i = 0; i < 4; i++) {\r\n                p_cu->b8pdir[i] = (int8_t)pdir;\r\n            }\r\n        } else {\r\n            if (h->i_frame_type == AVS2_F_SLICE && h->num_of_references > 1 && h->seq_info.enable_dhp) {\r\n                if (!(p_cu->i_cu_level == B8X8_IN_BIT && real_cu_type >= PRED_2NxN && real_cu_type <= PRED_nRx2N)) {\r\n                    pdir = aec_read_pdir_dhp(p_aec, p_cu);\r\n                }\r\n            }\r\n            cu_set_pdir_PFframe(p_cu, pdir);\r\n        }\r\n\r\n        if (h->i_frame_type != AVS2_S_SLICE && p_cu->i_cu_type != PRED_SKIP) {\r\n            cu_read_references(h, p_aec, p_cu);\r\n        }\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * arithmetically decode a pair of intra prediction modes of a given MB\r\n */\r\nint aec_read_intra_pmode(aec_t * p_aec)\r\n{\r\n    context_t * p_ctx = p_aec->syn_ctx.intra_luma_pred_mode;\r\n    int symbol;\r\n\r\n    if (biari_decode_symbol(p_aec, p_ctx) == 1) {\r\n        symbol = biari_decode_symbol(p_aec, p_ctx + 6) - 2;\r\n    } else {\r\n        symbol  = biari_decode_symbol(p_aec, p_ctx + 1) << 4;\r\n        symbol += biari_decode_symbol(p_aec, p_ctx + 2) << 3;\r\n        symbol += biari_decode_symbol(p_aec, p_ctx + 3) << 2;\r\n        symbol += biari_decode_symbol(p_aec, p_ctx + 4) << 1;\r\n        symbol += biari_decode_symbol(p_aec, p_ctx + 5);\r\n    }\r\n\r\n#if AVS2_TRACE\r\n    avs2_trace(\"@%d %s\\t\\t\\t%d\\n\", symbolCount++, p_aec->tracestring, symbol);\r\n#endif\r\n\r\n    return symbol;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * arithmetically decode the delta qp of a given CU\r\n */\r\nstatic INLINE\r\nint aec_read_cu_delta_qp(aec_t * p_aec, int i_last_dequant)\r\n{\r\n    context_t * p_ctx = p_aec->syn_ctx.delta_qp_contexts;\r\n    int act_sym;\r\n    int dquant;\r\n\r\n    act_sym = 1 - biari_decode_symbol(p_aec, p_ctx + (!!i_last_dequant));\r\n    if (act_sym != 0) {\r\n        act_sym = unary_bin_max_decode(p_aec, p_ctx + 2, 1, 256) + 1;\r\n    }\r\n\r\n    /* cu_qp_delta: (-32  -  4 (BitDepth-8)) (32  + 4 (BitDepth -8)) */\r\n    dquant = (act_sym + 1) >> 1;\r\n    if ((act_sym & 0x01) == 0) {    // LSB is signed bit\r\n        dquant = -dquant;\r\n    }\r\n\r\n#if AVS2_TRACE\r\n    avs2_trace(\"@%d %s\\t\\t\\t%d\\n\", symbolCount++, p_aec->tracestring, dquant);\r\n#endif\r\n\r\n    return dquant;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * arithmetically decode the ctp_y[i] of a given cu\r\n */\r\nstatic int aec_read_ctp_y(davs2_t *h, aec_t *p_aec, int b8, cu_t *p_cu, int scu_x, int scu_y)\r\n{\r\n    context_t *p_ctx;\r\n    int b_hor   = p_cu->i_trans_size == TU_SPLIT_HOR;  // is current CU hor TU partition\r\n    int b_ver   = p_cu->i_trans_size == TU_SPLIT_VER;  // is current CU ver TU partition\r\n    int i_level = p_cu->i_cu_level;\r\n    int cu_size = 1 << i_level;\r\n    int a = 0, b = 0;                   // ctp_y[i] of neighboring blocks\r\n    int x, y;\r\n\r\n    /* ǰTBCUеλ */\r\n    if (b_hor) {\r\n        x = 0;\r\n        y = ((cu_size * b8) >> 2);\r\n    } else if (b_ver) {\r\n        x = ((cu_size * b8) >> 2);\r\n        y = 0;\r\n    } else {\r\n        x = ((cu_size * (b8  & 1)) >> 1);\r\n        y = ((cu_size * (b8 >> 1)) >> 1);\r\n    }\r\n\r\n    /* TBͼеλ */\r\n    x += (scu_x << MIN_CU_SIZE_IN_BIT);\r\n    y += (scu_y << MIN_CU_SIZE_IN_BIT);\r\n    /* ת4x4λ */\r\n    x >>= MIN_PU_SIZE_IN_BIT;\r\n    y >>= MIN_PU_SIZE_IN_BIT;\r\n\r\n    /* ȡڿӦλõCTP */\r\n    if (b_ver && b8 > 0) {\r\n        a = (p_cu->i_cbp >> (b8 - 1)) & 1;\r\n    } else {\r\n        a = get_neighbor_cbp_y(h, x - 1, y, scu_x, scu_y, p_cu);\r\n    }\r\n\r\n    /* ȡڿӦλõCTP */\r\n    if (b_hor && b8 > 0) {\r\n        b = (p_cu->i_cbp >> (b8 - 1)) & 1;\r\n    } else {\r\n        b = get_neighbor_cbp_y(h, x, y - 1, scu_x, scu_y, p_cu);\r\n    }\r\n\r\n    p_ctx = p_aec->syn_ctx.cbp_contexts + a + 2 * b;\r\n\r\n    return biari_decode_symbol(p_aec, p_ctx);\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic INLINE\r\nint aec_read_cbp(aec_t *p_aec, davs2_t *h, cu_t *p_cu, int scu_x, int scu_y)\r\n{\r\n    int cbp = 0;\r\n    int cbp_bit = 0;\r\n\r\n    if (IS_INTER(p_cu)) {\r\n        if (IS_NOSKIP_INTER_MODE(p_cu->i_cu_type)) {\r\n            cbp_bit = biari_decode_symbol(p_aec, p_aec->syn_ctx.cbp_contexts + 8);  // \"ctp_zero_flag\"\r\n        }\r\n        if (cbp_bit == 0) {\r\n            // transform size\r\n            int b_tu_split = biari_decode_symbol(p_aec, p_aec->syn_ctx.transform_split_flag);\r\n            cu_set_tu_split_type(h, p_cu, b_tu_split);\r\n\r\n            // chroma\r\n            if (h->i_chroma_format != CHROMA_400) {\r\n                cbp_bit = biari_decode_symbol(p_aec, p_aec->syn_ctx.cbp_contexts + 4);\r\n                if (cbp_bit) {\r\n                    cbp_bit = biari_decode_symbol(p_aec, p_aec->syn_ctx.cbp_contexts + 5);\r\n\r\n                    if (cbp_bit) {\r\n                        cbp += 48;\r\n                    } else {\r\n                        cbp_bit = biari_decode_symbol(p_aec, p_aec->syn_ctx.cbp_contexts + 5);\r\n                        cbp += (cbp_bit == 1) ? 32 : 16;\r\n                    }\r\n                }\r\n            }\r\n\r\n            // luma\r\n            if (b_tu_split == 0) {\r\n                if (cbp == 0) {\r\n                    cbp = 1;   // ɫȿȫ㣬ctp_zero_flagָʾзϵ\r\n                } else {\r\n                    cbp_bit = aec_read_ctp_y(h, p_aec, 0, p_cu, scu_x, scu_y);\r\n                    cbp    += cbp_bit;\r\n                }\r\n            } else {\r\n                cbp_bit = aec_read_ctp_y(h, p_aec, 0, p_cu, scu_x, scu_y);\r\n                cbp    += cbp_bit;\r\n                p_cu->i_cbp = (int8_t)cbp;\r\n\r\n                cbp_bit = aec_read_ctp_y(h, p_aec, 1, p_cu, scu_x, scu_y);\r\n                cbp    += (cbp_bit << 1);\r\n                p_cu->i_cbp = (int8_t)cbp;\r\n\r\n                cbp_bit = aec_read_ctp_y(h, p_aec, 2, p_cu, scu_x, scu_y);\r\n                cbp    += (cbp_bit << 2);\r\n                p_cu->i_cbp = (int8_t)cbp;\r\n\r\n                cbp_bit = aec_read_ctp_y(h, p_aec, 3, p_cu, scu_x, scu_y);\r\n                cbp    += (cbp_bit << 3);\r\n                p_cu->i_cbp = (int8_t)cbp;\r\n            }\r\n        } else {\r\n            cu_set_tu_split_type(h, p_cu, 1);\r\n            p_cu->i_cbp = 0;\r\n            cbp = 0;\r\n        }\r\n    } else {\r\n        // intra luma\r\n        if (p_cu->i_cu_type == PRED_I_2Nx2N) {\r\n            cbp     = aec_read_ctp_y(h, p_aec, 0, p_cu, scu_x, scu_y);\r\n        } else {\r\n            cbp_bit = aec_read_ctp_y(h, p_aec, 0, p_cu, scu_x, scu_y);\r\n            cbp    += cbp_bit;\r\n            p_cu->i_cbp = (int8_t)cbp;\r\n\r\n            cbp_bit = aec_read_ctp_y(h, p_aec, 1, p_cu, scu_x, scu_y);\r\n            cbp    += (cbp_bit << 1);\r\n            p_cu->i_cbp = (int8_t)cbp;\r\n\r\n            cbp_bit = aec_read_ctp_y(h, p_aec, 2, p_cu, scu_x, scu_y);\r\n            cbp    += (cbp_bit << 2);\r\n            p_cu->i_cbp = (int8_t)cbp;\r\n\r\n            cbp_bit = aec_read_ctp_y(h, p_aec, 3, p_cu, scu_x, scu_y);\r\n            cbp    += (cbp_bit << 3);\r\n            p_cu->i_cbp = (int8_t)cbp;\r\n        }\r\n\r\n        // chroma decoding\r\n        if (h->i_chroma_format != CHROMA_400) {\r\n            cbp_bit = biari_decode_symbol(p_aec, p_aec->syn_ctx.cbp_contexts + 6);\r\n            if (cbp_bit) {\r\n                cbp_bit = biari_decode_symbol(p_aec, p_aec->syn_ctx.cbp_contexts + 7);\r\n\r\n                if (cbp_bit) {\r\n                    cbp += 48;\r\n                } else {\r\n                    cbp_bit = biari_decode_symbol(p_aec, p_aec->syn_ctx.cbp_contexts + 7);\r\n                    cbp += 16 << cbp_bit;\r\n                }\r\n            }\r\n        }   // ɫCBP\r\n    }\r\n\r\n    if (!cbp) {\r\n        h->i_last_dquant = 0;\r\n    }\r\n\r\n#if AVS2_TRACE\r\n    avs2_trace(\"@%d %s\\t\\t\\t\\t%d\\n\", symbolCount++, p_aec->tracestring, cbp);\r\n#endif\r\n\r\n    return cbp;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nint cu_read_cbp(davs2_t *h, aec_t *p_aec, cu_t *p_cu, int scu_x, int scu_y)\r\n{\r\n#if AVS2_TRACE\r\n    snprintf(p_aec->tracestring, TRACESTRING_SIZE, \"CBP\");\r\n#endif\r\n    p_cu->i_cbp = (int8_t)aec_read_cbp(p_aec, h, p_cu, scu_x, scu_y);    // check: first_mb_nr\r\n\r\n    // delta quant only if nonzero coeffs\r\n    if (h->b_DQP) {\r\n        int i_delta_qp = 0;\r\n        if (p_cu->i_cbp) {\r\n            const int max_delta_qp = 32 + 4 * (h->sample_bit_depth - 8);\r\n            const int min_delta_qp = -max_delta_qp;\r\n#if AVS2_TRACE\r\n            snprintf(p_aec->tracestring, TRACESTRING_SIZE, \"delta quant\");\r\n#endif\r\n            i_delta_qp = (int8_t)aec_read_cu_delta_qp(p_aec, h->i_last_dquant);\r\n            if (i_delta_qp < min_delta_qp ||\r\n                i_delta_qp > max_delta_qp) {\r\n                i_delta_qp = DAVS2_CLIP3(min_delta_qp, max_delta_qp, i_delta_qp);\r\n                davs2_log(h, DAVS2_LOG_ERROR, \"Invalid cu_qp_delta: %d.\", i_delta_qp);\r\n            }\r\n        }\r\n\r\n        h->i_last_dquant = i_delta_qp;\r\n        p_cu->i_qp = (int8_t)i_delta_qp + h->lcu.i_left_cu_qp;\r\n    } else {\r\n        p_cu->i_qp = (int8_t)h->i_qp;\r\n    }\r\n\r\n    AEC_RETURN_ON_ERROR(-1);\r\n\r\n    return 0;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * arithmetically decode the chroma intra prediction mode of a given CU\r\n */\r\nint aec_read_intra_pmode_c(aec_t *p_aec, davs2_t *h, int luma_mode)\r\n{\r\n    context_t *p_ctx = p_aec->syn_ctx.intra_chroma_pred_mode;\r\n    int act_ctx      = h->lcu.c_ipred_mode_ctx;\r\n    int lmode        = tab_intra_mode_luma2chroma[luma_mode];\r\n    int is_redundant = lmode >= 0;\r\n    int act_sym;\r\n\r\n    act_sym = !biari_decode_symbol(p_aec, p_ctx + act_ctx);\r\n    if (act_sym != 0) {\r\n        act_sym = unary_bin_max_decode(p_aec, p_ctx + 2, 0, 3) + 1;\r\n        if (is_redundant && act_sym >= lmode) {\r\n            if (act_sym == 4) {\r\n                davs2_log(h, DAVS2_LOG_ERROR, \"Error in intra_chroma_pred_mode. (%d, %d) (%d, %d)\", h->lcu.i_pix_x, h->lcu.i_pix_y, h->lcu.i_scu_x, h->lcu.i_scu_y);\r\n                return 4;\r\n            }\r\n\r\n            act_sym++;\r\n        }\r\n    }\r\n\r\n#if AVS2_TRACE\r\n    avs2_trace(\"@%d %s\\t\\t%d\\n\", symbolCount++, p_aec->tracestring, act_sym);\r\n#endif\r\n\r\n    return act_sym;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n*/\r\nstatic INLINE\r\nint aec_read_last_cg_pos(aec_t *p_aec, context_t *p_ctx, cu_t *p_cu,\r\n                         int *CGx, int *CGy, int b_luma, int num_cg, int is_dc_diag,\r\n                         int num_cg_x_minus1, int num_cg_y_minus1)\r\n{\r\n    int last_cg_x = 0;\r\n    int last_cg_y = 0;\r\n    int last_cg_idx = 0;\r\n\r\n    if (b_luma && is_dc_diag) {\r\n        DAVS2_SWAP(num_cg_x_minus1, num_cg_y_minus1);\r\n    }\r\n\r\n    if (num_cg == 4) {  // 8x8\r\n        last_cg_idx = 0;\r\n        last_cg_idx += biari_decode_symbol_continu0_ext(p_aec, p_ctx, 2, 3);\r\n\r\n        if (b_luma && p_cu->i_trans_size == TU_SPLIT_HOR) {\r\n            last_cg_x = last_cg_idx;\r\n            last_cg_y = 0;\r\n        } else if (b_luma && p_cu->i_trans_size == TU_SPLIT_VER) {\r\n            last_cg_x = 0;\r\n            last_cg_y = last_cg_idx;\r\n        } else {\r\n            last_cg_x = last_cg_idx &  1;\r\n            last_cg_y = last_cg_idx >> 1;\r\n        }\r\n    } else { // 16x16 and 32x32\r\n        int last_cg_bit;\r\n\r\n        p_ctx += 3;\r\n        last_cg_bit = biari_decode_symbol(p_aec, p_ctx);\r\n\r\n        if (last_cg_bit == 0) {\r\n            last_cg_x = 0;\r\n            last_cg_y = 0;\r\n            last_cg_idx  = 0;\r\n        } else {\r\n            p_ctx++;\r\n            last_cg_x = biari_decode_symbol_continue0(p_aec, p_ctx, num_cg_x_minus1);\r\n\r\n            p_ctx++;\r\n            if (last_cg_x == 0) {\r\n                if (num_cg_y_minus1 != 1) {\r\n                    last_cg_y = biari_decode_symbol_continue0(p_aec, p_ctx, num_cg_y_minus1 - 1);\r\n                }\r\n\r\n                last_cg_y++;\r\n            } else {\r\n                last_cg_y = biari_decode_symbol_continue0(p_aec, p_ctx, num_cg_y_minus1);\r\n            }\r\n        }\r\n\r\n        if (b_luma && is_dc_diag) {\r\n            DAVS2_SWAP(last_cg_x, last_cg_y);\r\n        }\r\n\r\n        if (b_luma && p_cu->i_trans_size == TU_SPLIT_HOR) {\r\n            last_cg_idx = raster2ZZ_2x8[last_cg_y * 8 + last_cg_x];\r\n        } else if (b_luma && p_cu->i_trans_size == TU_SPLIT_VER) {\r\n            last_cg_idx = raster2ZZ_8x2[last_cg_y * 2 + last_cg_x];\r\n        } else if (num_cg == 16) {\r\n            last_cg_idx = raster2ZZ_4x4[last_cg_y * 4 + last_cg_x];\r\n        } else {\r\n            last_cg_idx = raster2ZZ_8x8[last_cg_y * 8 + last_cg_x];\r\n        }\r\n    }\r\n\r\n    *CGx = last_cg_x;\r\n    *CGy = last_cg_y;\r\n    return last_cg_idx;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic INLINE \r\nint aec_read_last_coeff_pos_in_cg(aec_t *p_aec, context_t *p_ctx,\r\n                                  int rank, int cg_x, int cg_y, int b_luma, \r\n                                  int b_one_cg, int is_dc_diag)\r\n{\r\n    int xx, yy;\r\n    int offset;\r\n\r\n    /* AVS2-P2: 8.3.3.2.14   ȷlast_coeff_pos_x last_coeff_pos_y ctxIdxInc */\r\n    if (b_luma == 0) {                    // ɫȷռ12\r\n        offset = b_one_cg ? 0 : 4 + (rank == 0) * 4;\r\n    } else if (b_one_cg) {                // Log2TransformSize Ϊ 2ռ8\r\n        offset = 40 + is_dc_diag * 4;\r\n    } else if (cg_x != 0 && cg_y != 0) {  // cg_x  cg_y Ϊ㣬ռ8\r\n        offset = 32 + (rank == 0) * 4;\r\n    } else {                              // λռ40\r\n        offset = (4 * (rank == 0) + 2 * (cg_x == 0 && cg_y == 0) + is_dc_diag) * 4;\r\n    }\r\n\r\n    p_ctx += offset;\r\n    xx = biari_decode_symbol_continu0_ext(p_aec, p_ctx, 1, 3);\r\n\r\n    p_ctx += 2;\r\n    yy = biari_decode_symbol_continu0_ext(p_aec, p_ctx, 1, 3);\r\n\r\n    if (cg_x == 0 && cg_y > 0 && is_dc_diag) {\r\n        DAVS2_SWAP(xx, yy);\r\n    }\r\n    if (rank != 0) {\r\n        xx = 3 - xx;\r\n        if (is_dc_diag) {\r\n            yy = 3 - yy;\r\n        }\r\n    }\r\n\r\n    return tab_scan_coeff_pos_in_cg[yy][xx];\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic ALWAYS_INLINE\r\nint get_abssum_of_n_last_coeffs(runlevel_pair_t *p_runlevel, int end_pair_pos, int start_pair_pos)\r\n{\r\n    int absSum5 = 0;\r\n    int n = 0;\r\n    int k;\r\n\r\n    for (k = end_pair_pos - 1; k >= start_pair_pos; k--) {\r\n        n += p_runlevel[k].run;\r\n        if (n >= 6) {\r\n            break;\r\n        }\r\n        absSum5 += DAVS2_ABS(p_runlevel[k].level);\r\n        n++;\r\n    }\r\n\r\n    return absSum5;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\ntypedef int (*aec_read_run_f)(aec_t *p_aec, context_t *p_ctx, int pos, int b_only_one_cg, int b_1st_cg);\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic\r\nint aec_read_run_luma1(aec_t *p_aec, context_t *p_ctx, int pos, int b_only_one_cg, int b_1st_cg)\r\n{\r\n    int ctxpos;\r\n    int Run = 0;\r\n    int offset = 0;\r\n\r\n    b_only_one_cg = b_only_one_cg ? 0 : 4;\r\n\r\n    for (ctxpos = 0; Run != pos; ctxpos++) {\r\n        if (ctxpos < pos) {\r\n            int moddiv; // 012\r\n            moddiv = (tab_scan_4x4[pos - 1 - ctxpos][1] + 1) >> 1;\r\n            offset = (b_1st_cg ? (pos == ctxpos + 1 ? 0 : (1 + moddiv)) : (4 + moddiv)) + b_only_one_cg;  // 0,...,10\r\n        }\r\n\r\n        assert(offset >= 0 && offset < NUM_MAP_CTX);\r\n        if (biari_decode_symbol(p_aec, p_ctx + offset)) {\r\n            break;\r\n        }\r\n\r\n        Run++;\r\n    }\r\n\r\n    return Run;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic\r\nint aec_read_run_luma2(aec_t *p_aec, context_t *p_ctx, int pos, int b_only_one_cg, int b_1st_cg)\r\n{\r\n    int ctxpos;\r\n    int Run = 0;\r\n    int offset = 0;\r\n\r\n    b_only_one_cg = b_only_one_cg ? 0 : 4;\r\n    \r\n    for (ctxpos = 0; Run != pos; ctxpos++) {\r\n        if (ctxpos < pos) {\r\n            int moddiv; // 012\r\n            moddiv = ((pos < ctxpos + 4) ? 0 : (pos < ctxpos + 11 ? 1 : 2));\r\n            offset = (b_1st_cg ? (pos == ctxpos + 1 ? 0 : (1 + moddiv)) : (4 + moddiv)) + b_only_one_cg;  // 0,...,10\r\n        }\r\n\r\n        assert(offset >= 0 && offset < NUM_MAP_CTX);\r\n        if (biari_decode_symbol(p_aec, p_ctx + offset)) {\r\n            break;\r\n        }\r\n\r\n        Run++;\r\n    }\r\n\r\n    return Run;\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic\r\nint aec_read_run_chroma(aec_t *p_aec, context_t *p_ctx, int pos, int b_only_one_cg, int b_1st_cg)\r\n{\r\n    int ctxpos;\r\n    int Run = 0;\r\n    int offset = 0;\r\n\r\n    b_only_one_cg = b_only_one_cg ? 0 : 3;\r\n\r\n    for (ctxpos = 0; Run != pos; ctxpos++) {\r\n        if (ctxpos < pos) {\r\n            int moddiv = (pos >= 6 + ctxpos);\r\n            offset = (b_1st_cg ? (pos == ctxpos + 1 ? 0 : (1 + moddiv)) : (3 + moddiv)) + b_only_one_cg;\r\n        }\r\n\r\n        assert(offset >= 0 && offset < NUM_MAP_CTX);\r\n        if (biari_decode_symbol(p_aec, p_ctx + offset)) {\r\n            break;\r\n        }\r\n\r\n        Run++;\r\n    }\r\n\r\n    return Run;\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic\r\nint aec_read_run_level(aec_t *p_aec, cu_t *p_cu, int num_cg, int b_luma, int is_dc_diag,\r\n                       runlevel_t *runlevel, int scale, int shift)\r\n{\r\n    static const int numOfCoeffInCG = 16;\r\n    const int add = (1 << (shift - 1));\r\n\r\n    //--- read coefficients for whole block ---\r\n    const int16_t(*tab_cg_scan)[2]         = runlevel->cg_scan;\r\n    context_t(*ctxa_run)[NUM_MAP_CTX]      = runlevel->p_ctx_run;\r\n    context_t *p_ctx_level                 = runlevel->p_ctx_level;\r\n    context_t *p_ctx_nonzero_cg_flag       = runlevel->p_ctx_sig_cg;\r\n    context_t *p_ctx_last_cg_pos           = runlevel->p_ctx_last_cg;\r\n    context_t *p_ctx_last_pos_in_cg        = runlevel->p_ctx_last_pos_in_cg;\r\n    runlevel_pair_t *p_runlevel            = runlevel->run_level;\r\n    int idx_cg;\r\n    int cg_pos = 0;\r\n    int CGx = 0;\r\n    int CGy = 0;\r\n    int b_only_one_cg = (num_cg == 1);\r\n    int8_t dct_pattern = DCT_QUAD;\r\n    int w_tr_half, w_tr_quad; // CG position limitation\r\n    int h_tr_half, h_tr_quad; // CG position limitation\r\n    int w_tr = runlevel->w_tr;\r\n    int h_tr = runlevel->h_tr;\r\n#if AVS2_TRACE\r\n    int idx_runlevel = 0;\r\n#endif\r\n    int rank = 0;\r\n    aec_read_run_f f_read_run = b_luma ? (!is_dc_diag ? aec_read_run_luma1 : aec_read_run_luma2) : aec_read_run_chroma;\r\n\r\n    /* dct_pattern_e */\r\n    if (w_tr == h_tr) {\r\n        w_tr_half = w_tr >> 1;\r\n        h_tr_half = h_tr >> 1;\r\n        w_tr_quad = w_tr >> 2;\r\n        h_tr_quad = h_tr >> 2;\r\n    } else if (w_tr > h_tr) {\r\n        w_tr_half = w_tr >> 1;\r\n        h_tr_half = h_tr >> 0;\r\n        w_tr_quad = w_tr >> 2;\r\n        h_tr_quad = h_tr >> 0;\r\n    } else {\r\n        w_tr_half = w_tr >> 0;\r\n        h_tr_half = h_tr >> 1;\r\n        w_tr_quad = w_tr >> 0;\r\n        h_tr_quad = h_tr >> 2;\r\n    }\r\n    /* תCGλı߽λ */\r\n    w_tr_half >>= 2;\r\n    h_tr_half >>= 2;\r\n    w_tr_quad >>= 2;\r\n    h_tr_quad >>= 2;\r\n\r\n    /* 1, read last CG position */\r\n    if (num_cg > 1) {\r\n        int num_cg_x_minus1 = tab_cg_scan[num_cg - 1][0];\r\n        int num_cg_y_minus1 = tab_cg_scan[num_cg - 1][1];\r\n        cg_pos = aec_read_last_cg_pos(p_aec, p_ctx_last_cg_pos, p_cu, &CGx, &CGy, b_luma, num_cg, is_dc_diag, num_cg_x_minus1, num_cg_y_minus1);\r\n    }\r\n\r\n    num_cg = cg_pos + 1;\r\n    runlevel->num_nonzero_cg = num_cg;\r\n\r\n    /* 2, read coefficients in each CG */\r\n    for (idx_cg = 0; idx_cg < num_cg; idx_cg++) {\r\n        int b_1st_cg = (cg_pos == 0);\r\n        int nonzero_cg_flag = 1;\r\n\r\n        /* 2.1, sig CG flag */\r\n        if (rank > 0) {\r\n            /* update CG position */\r\n            int ctx_sig_cg = (b_luma && cg_pos != 0);\r\n            CGx = tab_cg_scan[cg_pos][0];\r\n            CGy = tab_cg_scan[cg_pos][1];\r\n            nonzero_cg_flag = biari_decode_symbol(p_aec, p_ctx_nonzero_cg_flag + ctx_sig_cg);\r\n        }\r\n\r\n        /* 2.2, coefficients in CG */\r\n        if (nonzero_cg_flag) {\r\n            int num_pairs_in_cg = 0;\r\n            int i;\r\n\r\n            // last in CG\r\n            int pos = aec_read_last_coeff_pos_in_cg(p_aec, p_ctx_last_pos_in_cg, rank, CGx, CGy, b_luma, b_only_one_cg, is_dc_diag);\r\n\r\n            for (i = -numOfCoeffInCG; i != 0; i++) {\r\n                // level\r\n                int Run = 0;\r\n                int Level = 1;\r\n                int absSum5;\r\n                context_t *p_ctx;\r\n\r\n                /* coeff_level_minus1_band[j] */\r\n                if (biari_decode_final(p_aec)) {\r\n                    int golomb_order  = 0;\r\n                    int binary_symbol = 0;\r\n\r\n                    for (;;) {\r\n                        int l = biari_decode_symbol_eq_prob(p_aec);\r\n                        AEC_RETURN_ON_ERROR(-1);\r\n                        if (l) {\r\n                            break;\r\n                        }\r\n                        Level += (1 << golomb_order);\r\n                        golomb_order++;\r\n                    }\r\n\r\n                    while (golomb_order--) {\r\n                        // next binary part\r\n                        int sig = biari_decode_symbol_eq_prob(p_aec);\r\n                        binary_symbol |= (sig << golomb_order);\r\n                    }\r\n\r\n                    Level += binary_symbol;\r\n                    Level += 32;\r\n                } else {\r\n                    int pairsInCGIdx = (num_pairs_in_cg + 1) >> 1;\r\n                    pairsInCGIdx = DAVS2_MIN(2, pairsInCGIdx);\r\n                    p_ctx = p_ctx_level;\r\n                    p_ctx += 10 * (b_1st_cg && pos < 3) + DAVS2_MIN(rank, pairsInCGIdx + 2) + ((5 * pairsInCGIdx) >> 1);\r\n                    Level += biari_decode_symbol_continue0(p_aec, p_ctx, 31);\r\n                }\r\n\r\n                AEC_RETURN_ON_ERROR(-1);\r\n                absSum5 = get_abssum_of_n_last_coeffs(p_runlevel, num_pairs_in_cg, 0);\r\n                absSum5 = (absSum5 + Level) >> 1;\r\n                p_ctx = ctxa_run[DAVS2_MIN(absSum5, 2)];\r\n\r\n                // run\r\n                Run = 0;\r\n                if (pos > 0) {\r\n                    Run = f_read_run(p_aec, p_ctx, pos, b_only_one_cg, b_1st_cg);\r\n                }\r\n                AEC_RETURN_ON_ERROR(-1);\r\n\r\n#if AVS2_TRACE\r\n                if (b_luma) {\r\n                    avs2_trace(\"  Luma8x8 sng\");\r\n                    avs2_trace(\"(%2d) level =%3d run =%2d\\n\", idx_runlevel, level, run);\r\n                } else {\r\n                    avs2_trace(\"  AC chroma 8X8 \");\r\n                    avs2_trace(\"%2d: level =%3d run =%2d\\n\", idx_runlevel, level, run);\r\n                }\r\n                idx_runlevel++;\r\n#endif\r\n                p_runlevel[num_pairs_in_cg].level = (int16_t)Level;\r\n                p_runlevel[num_pairs_in_cg].run   = (int16_t)Run;\r\n\r\n                num_pairs_in_cg++;\r\n                if (Level > T_Chr[rank]) {\r\n                    rank = tab_rank[DAVS2_MIN(5, Level)];\r\n                }\r\n                if (Run == pos) {\r\n                    break;\r\n                }\r\n\r\n                pos -= (Run + 1);\r\n            } // for (i = -numOfCoeffInCG; i != 0; i++)\r\n\r\n            // sign of level\r\n            for (i = 0; i < num_pairs_in_cg; i++) {\r\n                if (biari_decode_symbol_eq_prob(p_aec)) {\r\n                    p_runlevel[i].level = -p_runlevel[i].level;\r\n                }\r\n            }\r\n\r\n            /* convert run-level to coefficients */\r\n            {\r\n                const int b_swap_xy  = runlevel->b_swap_xy;\r\n                const int i_coeff    = runlevel->i_res;\r\n                coeff_t *p_res = runlevel->p_res;\r\n                int num_pairs  = num_pairs_in_cg;\r\n                int coef_ctr   = -1;\r\n                if (b_swap_xy) {\r\n                    DAVS2_SWAP(CGx, CGy);\r\n                }\r\n                p_res += i_coeff * (CGy << 2) + (CGx << 2);\r\n                // RunLevelתCGڵķϵ\r\n                while (num_pairs > 0) {  /* leave if len=1 */\r\n                    int x_in_cg, y_in_cg;\r\n\r\n                    int level = p_runlevel[num_pairs - 1].level;\r\n                    int run   = p_runlevel[num_pairs - 1].run;\r\n                    num_pairs--;\r\n                    if (run < 0 || run >= 16) {\r\n                        // davs2_log(h, DAVS2_LOG_ERROR, \"wrong run level.\");\r\n                        return -1;\r\n                    }\r\n                    coef_ctr += run + 1;\r\n\r\n                    x_in_cg = tab_scan_4x4[coef_ctr][ b_swap_xy];\r\n                    y_in_cg = tab_scan_4x4[coef_ctr][!b_swap_xy];\r\n\r\n                    level = (level * scale + add) >> shift;\r\n                    p_res[y_in_cg * i_coeff + x_in_cg] = (coeff_t)DAVS2_CLIP3(-32768, 32767, level);\r\n                }\r\n\r\n                if (CGy >= h_tr_half || CGx >= w_tr_half) {\r\n                    dct_pattern = DCT_DEAULT;\r\n                } else if ((CGy >= h_tr_quad || CGx >= w_tr_quad) && dct_pattern != DCT_DEAULT) {\r\n                    dct_pattern = DCT_HALF;\r\n                }\r\n            }\r\n        }  // end of reading one CG\r\n        cg_pos--;\r\n    }  // end of reading all CGs\r\n\r\n    return dct_pattern;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * get coefficients of one block\r\n */\r\nint8_t cu_get_block_coeffs(aec_t *p_aec, runlevel_t *runlevel,\r\n                           cu_t *p_cu, coeff_t *p_res, int w_tr, int h_tr,\r\n                           int i_tu_level, int b_luma,\r\n                           int intra_pred_class, int b_swap_xy,\r\n                           int scale, int shift, int wq_size_id)\r\n{\r\n    int num_coeffs = w_tr * h_tr;\r\n    int num_cg = num_coeffs >> 4;\r\n\r\n    runlevel->p_res         = p_res;\r\n    runlevel->i_res         = w_tr;\r\n    runlevel->b_swap_xy     = b_swap_xy;\r\n    runlevel->i_tu_level    = i_tu_level;\r\n    runlevel->w_tr          = w_tr;\r\n    runlevel->h_tr          = h_tr;\r\n    UNUSED_PARAMETER(wq_size_id);\r\n\r\n    return (int8_t)aec_read_run_level(p_aec, p_cu, num_cg, b_luma, intra_pred_class == INTRA_PRED_DC_DIAG, runlevel, scale, shift);\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * finding end of a slice in case this is not the end of a frame\r\n *\r\n * unsure whether the \"correction\" below actually solves an off-by-one\r\n * problem or whether it introduces one in some cases :-(  Anyway,\r\n * with this change the bit stream format works with AEC again.\r\n */\r\nint aec_startcode_follows(aec_t *p_aec, int eos_bit)\r\n{\r\n    int bit = 0;\r\n\r\n    if (eos_bit) {\r\n        bit = biari_decode_final(p_aec);\r\n\r\n#if AVS2_TRACE\r\n        avs2_trace(\"@%d %s\\t\\t%d\\n\", symbolCount++, \"Decode Sliceterm\", bit);\r\n#endif\r\n    }\r\n\r\n    /* the best way to be sure that the current slice is end,\r\n     * is to check if a start code is followed */\r\n\r\n    return bit;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nint aec_read_split_flag(aec_t *p_aec, int i_level)\r\n{\r\n    context_t *p_ctx = p_aec->syn_ctx.cu_split_flag + (i_level - MIN_CU_SIZE_IN_BIT - 1);\r\n    int split_flag = biari_decode_symbol(p_aec, p_ctx);\r\n\r\n#if AVS2_TRACE\r\n    avs2_trace(\"SplitFlag = %3d\\n\", split_flag);\r\n#endif\r\n\r\n    return split_flag;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic ALWAYS_INLINE int read_sao_mergeflag(aec_t *p_aec, int act_ctx)\r\n{\r\n    int act_sym = 0;\r\n\r\n    if (act_ctx == 1) {\r\n        act_sym = biari_decode_symbol(p_aec, &p_aec->syn_ctx.sao_mergeflag_context[0]);\r\n    } else if (act_ctx == 2) {\r\n        act_sym = biari_decode_symbol(p_aec, &p_aec->syn_ctx.sao_mergeflag_context[1]);\r\n        if (act_sym != 1) {\r\n            act_sym += (biari_decode_symbol(p_aec, &p_aec->syn_ctx.sao_mergeflag_context[2]) << 1);\r\n        }\r\n    }\r\n\r\n    return act_sym;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nint aec_read_sao_mergeflag(aec_t *p_aec, int mergeleft_avail, int mergeup_avail)\r\n{\r\n    int merge_left  = 0;\r\n    int merge_top   = 0;\r\n    int merge_index = read_sao_mergeflag(p_aec, mergeleft_avail + mergeup_avail);\r\n\r\n    assert(merge_index <= 2);\r\n\r\n    if (mergeleft_avail) {\r\n        merge_left  = merge_index & 0x01;\r\n        merge_index = merge_index >> 1;\r\n    }\r\n    if (mergeup_avail && !merge_left) {\r\n        merge_top = merge_index & 0x01;\r\n    }\r\n\r\n    return (merge_left << 1) + merge_top;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nint aec_read_sao_mode(aec_t *p_aec)\r\n{\r\n    int t2 = !biari_decode_symbol(p_aec, p_aec->syn_ctx.sao_mode_context);\r\n    int act_sym;\r\n\r\n    if (t2) {\r\n        int t1 = !biari_decode_symbol_eq_prob(p_aec);\r\n        act_sym = t2 + (t1 << 1);\r\n    } else {\r\n        act_sym = 0;\r\n    }\r\n\r\n    return act_sym;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic ALWAYS_INLINE int read_sao_offset(aec_t *p_aec, int offset_type)\r\n{\r\n    int maxvalue = saoclip[offset_type][2];\r\n    int cnt = 0;\r\n    int act_sym, sym;\r\n\r\n    if (offset_type == SAO_CLASS_BO) {\r\n        sym = !biari_decode_symbol(p_aec, &p_aec->syn_ctx.sao_offset_context[0]);\r\n    } else {\r\n        sym = !biari_decode_symbol_eq_prob(p_aec);\r\n    }\r\n\r\n    while (sym) {\r\n        cnt++;\r\n        if (cnt == maxvalue) {\r\n            break;\r\n        }\r\n        sym = !biari_decode_symbol_eq_prob(p_aec);\r\n    }\r\n\r\n    if (offset_type == SAO_CLASS_EO_FULL_VALLEY) {\r\n        act_sym = EO_OFFSET_INV__MAP[cnt];\r\n    } else if (offset_type == SAO_CLASS_EO_FULL_PEAK) {\r\n        act_sym = -EO_OFFSET_INV__MAP[cnt];\r\n    } else if (offset_type == SAO_CLASS_EO_HALF_PEAK) {\r\n        act_sym = -cnt;\r\n    } else {\r\n        act_sym = cnt;\r\n    }\r\n\r\n    if (offset_type == SAO_CLASS_BO && act_sym) {\r\n        if (biari_decode_symbol_eq_prob(p_aec)) { // sign symbol\r\n            act_sym = -act_sym;\r\n        }\r\n    }\r\n\r\n    return act_sym;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid aec_read_sao_offsets(aec_t *p_aec, sao_param_t *p_sao_param, int *offset)\r\n{\r\n    int i;\r\n\r\n    assert(p_sao_param->modeIdc == SAO_MODE_NEW);\r\n\r\n    for (i = 0; i < 4; i++) {\r\n        int offset_type;\r\n        if (p_sao_param->typeIdc == SAO_TYPE_BO) {\r\n            offset_type = SAO_CLASS_BO;\r\n        } else {\r\n            offset_type = (i >= 2) ? (i + 1) : i;\r\n        }\r\n        offset[i] = read_sao_offset(p_aec, offset_type);\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic ALWAYS_INLINE int read_sao_type(aec_t *p_aec, int act_ctx)\r\n{\r\n    int act_sym = 0;\r\n    int golomb_order = 1;\r\n    int length;\r\n\r\n    if (act_ctx == 0) {\r\n        length = NUM_SAO_EO_TYPES_LOG2;\r\n    } else if (act_ctx == 1) {\r\n        length = NUM_SAO_BO_CLASSES_LOG2;\r\n    } else {\r\n        assert(act_ctx == 2);\r\n        length = NUM_SAO_BO_CLASSES_LOG2 - 1;\r\n    }\r\n\r\n    if (act_ctx == 2) {\r\n        int temp;\r\n        int rest;\r\n\r\n        do {\r\n            temp = biari_decode_symbol_eq_prob(p_aec);\r\n            AEC_RETURN_ON_ERROR(-1);\r\n\r\n            if (temp == 0) {\r\n                act_sym += (1 << golomb_order);\r\n                golomb_order++;\r\n            }\r\n\r\n            if (golomb_order == 4) {\r\n                golomb_order = 0;\r\n                temp = 1;\r\n            }\r\n\r\n        } while (temp != 1);\r\n\r\n        rest = 0;\r\n        while (golomb_order--) {\r\n            // next binary part\r\n            temp = biari_decode_symbol_eq_prob(p_aec);\r\n            if (temp == 1) {\r\n                rest |= (temp << golomb_order);\r\n            }\r\n        }\r\n\r\n        act_sym += rest;\r\n    } else {\r\n        int i;\r\n\r\n        for (i = 0; i < length; i++) {\r\n            act_sym = act_sym + (biari_decode_symbol_eq_prob(p_aec) << i);\r\n        }\r\n    }\r\n\r\n    return act_sym;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nint aec_read_sao_type(aec_t *p_aec, sao_param_t *p_sao_param)\r\n{\r\n    int stBnd[2];\r\n    \r\n    assert(p_sao_param->modeIdc == SAO_MODE_NEW);\r\n    if (p_sao_param->typeIdc == SAO_TYPE_BO) {\r\n        stBnd[0] = read_sao_type(p_aec, 1);\r\n\r\n        // read delta start band for BO\r\n        stBnd[1] = read_sao_type(p_aec, 2) + 2;\r\n        return (stBnd[0] + (stBnd[1] << NUM_SAO_BO_CLASSES_LOG2));\r\n    } else {\r\n        assert(p_sao_param->typeIdc == SAO_TYPE_EO_0);\r\n        return read_sao_type(p_aec, 0);\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nint aec_read_alf_lcu_ctrl(aec_t *p_aec)\r\n{\r\n    context_t *ctx = p_aec->syn_ctx.alf_lcu_enable_scmodel;\r\n\r\n    return biari_decode_symbol(p_aec, ctx);\r\n}\r\n"
  },
  {
    "path": "source/common/aec.h",
    "content": "/*\r\n *  aec.h\r\n *\r\n * Description of this file:\r\n *    AEC functions definition of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video encoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#ifndef DAVS2_AEC_H\r\n#define DAVS2_AEC_H\r\n\r\n#ifdef __cplusplus\r\nextern \"C\" {\r\n#endif\r\n    \r\n/* ---------------------------------------------------------------------------\r\n * global variables */\r\n#define saoclip FPFX(saoclip)\r\nextern const int        saoclip[NUM_SAO_OFFSET][3];\r\n#define tab_intra_mode_scan_type FPFX(tab_intra_mode_scan_type)\r\nextern const int        tab_intra_mode_scan_type[NUM_INTRA_MODE];\r\n\r\n/* ---------------------------------------------------------------------------\r\n * aec basic operations */\r\n#define aec_init_contexts FPFX(aec_init_contexts)\r\nvoid aec_init_contexts      (aec_t *p_aec);\r\n#define aec_new_slice FPFX(aec_new_slice)\r\nvoid aec_new_slice          (davs2_t *h);\r\n\r\n#define aec_start_decoding FPFX(aec_start_decoding)\r\nint  aec_start_decoding     (aec_t *p_aec, uint8_t *p_start, int i_byte_pos, int i_bytes);\r\n#define aec_bits_read FPFX(aec_bits_read)\r\nint  aec_bits_read          (aec_t *p_aec);\r\n#define aec_startcode_follows FPFX(aec_startcode_follows)\r\nint  aec_startcode_follows  (aec_t *p_aec, int eos_bit);\r\n\r\n/* ---------------------------------------------------------------------------\r\n * ctu structure information */\r\n#define aec_read_split_flag FPFX(aec_read_split_flag)\r\nint  aec_read_split_flag    (aec_t *p_aec, int i_level);\r\n\r\n/* ---------------------------------------------------------------------------\r\n * cu type information */\r\n#define aec_read_cu_type FPFX(aec_read_cu_type)\r\nint  aec_read_cu_type       (aec_t *p_aec, cu_t *p_cu, int img_type, int b_amp, int b_mhp, int b_wsm, int num_references);\r\n#define aec_read_cu_type_sframe FPFX(aec_read_cu_type_sframe)\r\nint  aec_read_cu_type_sframe(aec_t *p_aec);\r\n#define aec_read_intra_cu_type FPFX(aec_read_intra_cu_type)\r\nint  aec_read_intra_cu_type (aec_t *p_aec, cu_t *p_cu, int b_sdip, davs2_t *h);\r\n\r\n/* ---------------------------------------------------------------------------\r\n * inter prediction information */\r\n#define aec_read_dmh_mode FPFX(aec_read_dmh_mode)\r\nint  aec_read_dmh_mode      (aec_t *p_aec, int i_cu_level);\r\n#define aec_read_mvds FPFX(aec_read_mvds)\r\nvoid aec_read_mvds          (aec_t *p_aec, mv_t *p_mvd);\r\n#define aec_read_inter_pred_dir FPFX(aec_read_inter_pred_dir)\r\nvoid aec_read_inter_pred_dir(aec_t * p_aec, cu_t *p_cu, davs2_t *h);\r\n\r\n/* ---------------------------------------------------------------------------\r\n * intra prediction information */\r\n#define aec_read_intra_pmode FPFX(aec_read_intra_pmode)\r\nint  aec_read_intra_pmode   (aec_t *p_aec);\r\n#define aec_read_intra_pmode_c FPFX(aec_read_intra_pmode_c)\r\nint  aec_read_intra_pmode_c (aec_t *p_aec, davs2_t *h, int luma_mode);\r\n\r\n/* ---------------------------------------------------------------------------\r\n * transform unit (residual) information */\r\n#define cu_read_cbp FPFX(cu_read_cbp)\r\nint  cu_read_cbp            (davs2_t *h, aec_t *p_aec, cu_t *p_cu, int scu_x, int scu_y);\r\n#define cu_get_block_coeffs FPFX(cu_get_block_coeffs)\r\nint8_t cu_get_block_coeffs  (aec_t *p_aec, runlevel_t *runlevel,\r\n                             cu_t *p_cu, coeff_t *p_res, int w_tr, int h_tr,\r\n                             int i_tu_level, int b_luma,\r\n                             int intra_pred_class, int b_swap_xy,\r\n                             int scale, int shift, int wq_size_id);\r\n\r\n/* ---------------------------------------------------------------------------\r\n * loop filter information */\r\n#define aec_read_sao_mergeflag FPFX(aec_read_sao_mergeflag)\r\nint  aec_read_sao_mergeflag (aec_t *p_aec, int mergeleft_avail, int mergeup_avail);\r\n#define aec_read_sao_mode FPFX(aec_read_sao_mode)\r\nint  aec_read_sao_mode      (aec_t *p_aec);\r\n#define aec_read_sao_offsets FPFX(aec_read_sao_offsets)\r\nvoid aec_read_sao_offsets   (aec_t *p_aec, sao_param_t *p_sao_param, int *offset);\r\n#define aec_read_sao_type FPFX(aec_read_sao_type)\r\nint  aec_read_sao_type      (aec_t *p_aec, sao_param_t *p_sao_param);\r\n\r\n#define aec_read_alf_lcu_ctrl FPFX(aec_read_alf_lcu_ctrl)\r\nint  aec_read_alf_lcu_ctrl  (aec_t *p_aec);\r\n\r\n#ifndef AEC_RETURN_ON_ERROR\r\n#define AEC_RETURN_ON_ERROR(ret_code) \\\r\n        if (p_aec->b_bit_error) {\\\r\n            p_aec->b_bit_error = FALSE; /* reset error flag */\\\r\n            /* davs2_log(h, DAVS2_LOG_ERROR, \"aec decoding error.\"); */\\\r\n            return (ret_code);\\\r\n        }\r\n#endif\r\n\r\n#ifdef __cplusplus\r\n}\r\n#endif\r\n#endif  // DAVS2_AEC_H\r\n"
  },
  {
    "path": "source/common/alf.cc",
    "content": "/*\r\n * alf.cc\r\n *\r\n * Description of this file:\r\n *    ALF functions definition of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video encoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#include \"common.h\"\r\n#include \"alf.h\"\r\n#include \"aec.h\"\r\n#include \"vlc.h\"\r\n#include \"frame.h\"\r\n\r\n#if HAVE_MMX\r\n#include \"vec/intrinsic.h\"\r\n#endif\r\n\r\n\r\n/**\r\n * ===========================================================================\r\n * local function defines\r\n * ===========================================================================\r\n */\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void alf_recon_coefficients(alf_param_t *p_alf_param, int p_filter_coeff[][ALF_MAX_NUM_COEF])\r\n{\r\n    int num_coeff = p_alf_param->num_coeff - 1;\r\n    int alf_num   = 1 << ALF_NUM_BIT_SHIFT;\r\n    int sum;\r\n    int i, j;\r\n\r\n    for (j = 0; j < p_alf_param->filters_per_group; j++) {\r\n        sum = 0;\r\n        for (i = 0; i < num_coeff; i++) {\r\n            sum += (2 * p_alf_param->coeffmulti[j][i]);\r\n            p_filter_coeff[j][i] = p_alf_param->coeffmulti[j][i];\r\n        }\r\n        p_filter_coeff[j][num_coeff] = (alf_num - sum) + p_alf_param->coeffmulti[j][num_coeff];\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void alf_init_var_table(alf_param_t *p_alf_param, int *p_var_tab)\r\n{\r\n    if (p_alf_param->filters_per_group > 1) {\r\n        int i;\r\n\r\n        p_var_tab[0] = 0;\r\n        for (i = 1; i < ALF_NUM_VARS; ++i) {\r\n            p_var_tab[i] = (p_alf_param->filterPattern[i]) ? (p_var_tab[i - 1] + 1) : p_var_tab[i - 1];\r\n        }\r\n    } else {\r\n        memset(p_var_tab, 0, ALF_NUM_VARS * sizeof(int));\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic\r\nvoid alf_filter_block1(pel_t *p_dst, const pel_t *p_src, int stride,\r\n                       int lcu_pix_x, int lcu_pix_y, int lcu_width, int lcu_height,\r\n                       int *alf_coeff, int b_top_avail, int b_down_avail)\r\n{\r\n    const int pel_add  = 1 << (ALF_NUM_BIT_SHIFT - 1);\r\n    const int pel_max  = max_pel_value;\r\n    const int min_x    = -3;\r\n    const int max_x    = lcu_width - 1 + 3;\r\n    int x, y;\r\n    const pel_t *imgPad1, *imgPad2, *imgPad3, *imgPad4, *imgPad5, *imgPad6;\r\n\r\n    {\r\n        int startPos = b_top_avail ? (lcu_pix_y - 4) : lcu_pix_y;\r\n        int endPos = b_down_avail ? (lcu_pix_y + lcu_height - 4) : (lcu_pix_y + lcu_height);\r\n        p_src += (startPos * stride) + lcu_pix_x;\r\n        p_dst += (startPos * stride) + lcu_pix_x;\r\n        lcu_height = endPos - startPos;\r\n        lcu_height--;\r\n    }\r\n\r\n    for (y = 0; y <= lcu_height; y++) {\r\n        int yUp, yBottom;\r\n        yUp     = DAVS2_CLIP3(0, lcu_height, y - 1);\r\n        yBottom = DAVS2_CLIP3(0, lcu_height, y + 1);\r\n        imgPad1 = p_src + (yBottom - y) * stride;\r\n        imgPad2 = p_src + (yUp     - y) * stride;\r\n\r\n        yUp     = DAVS2_CLIP3(0, lcu_height, y - 2);\r\n        yBottom = DAVS2_CLIP3(0, lcu_height, y + 2);\r\n        imgPad3 = p_src + (yBottom - y) * stride;\r\n        imgPad4 = p_src + (yUp     - y) * stride;\r\n\r\n        yUp     = DAVS2_CLIP3(0, lcu_height, y - 3);\r\n        yBottom = DAVS2_CLIP3(0, lcu_height, y + 3);\r\n        imgPad5 = p_src + (yBottom - y) * stride;\r\n        imgPad6 = p_src + (yUp     - y) * stride;\r\n\r\n        for (x = 0; x < lcu_width; x++) {\r\n            int xLeft, xRight;\r\n            int pel_val;\r\n            pel_val  = alf_coeff[0] * (imgPad5[x] + imgPad6[x]);\r\n            pel_val += alf_coeff[1] * (imgPad3[x] + imgPad4[x]);\r\n\r\n            xLeft    = DAVS2_CLIP3(min_x, max_x, x - 1);\r\n            xRight   = DAVS2_CLIP3(min_x, max_x, x + 1);\r\n            pel_val += alf_coeff[2] * (imgPad1[xRight] + imgPad2[xLeft ]);\r\n            pel_val += alf_coeff[3] * (imgPad1[x     ] + imgPad2[x     ]);\r\n            pel_val += alf_coeff[4] * (imgPad1[xLeft ] + imgPad2[xRight]);\r\n            pel_val += alf_coeff[7] * (p_src [xRight] + p_src [xLeft ]);\r\n\r\n            xLeft    = DAVS2_CLIP3(min_x, max_x, x - 2);\r\n            xRight   = DAVS2_CLIP3(min_x, max_x, x + 2);\r\n            pel_val += alf_coeff[6] * (p_src [xRight] + p_src [xLeft ]);\r\n\r\n            xLeft    = DAVS2_CLIP3(min_x, max_x, x - 3);\r\n            xRight   = DAVS2_CLIP3(min_x, max_x, x + 3);\r\n            pel_val += alf_coeff[5] * (p_src [xRight] + p_src [xLeft ]);\r\n            pel_val += alf_coeff[8] * (p_src [x     ]);\r\n\r\n            pel_val   = (pel_val + pel_add) >> ALF_NUM_BIT_SHIFT;\r\n            p_dst[x] = (pel_t)DAVS2_CLIP3(0, pel_max, pel_val);\r\n        }\r\n\r\n        p_src += stride;\r\n        p_dst += stride;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic\r\nvoid alf_filter_block2(pel_t *p_dst, const pel_t *p_src, int i_src,\r\n                       int lcu_pix_x, int lcu_pix_y, int lcu_width, int lcu_height,\r\n                       int *alf_coeff, int b_top_avail, int b_down_avail)\r\n{\r\n    const pel_t *p_src1, *p_src2, *p_src3, *p_src4, *p_src5, *p_src6;\r\n    int i_dst = i_src;\r\n    int pixelInt;\r\n    int startPos = b_top_avail ? (lcu_pix_y - 4) : lcu_pix_y;\r\n    int endPos = b_down_avail ? (lcu_pix_y + lcu_height - 4) : (lcu_pix_y + lcu_height);\r\n\r\n    /* first line */\r\n    p_src += (startPos * i_src) + lcu_pix_x;\r\n    p_dst += (startPos * i_dst) + lcu_pix_x;\r\n\r\n    if (p_src[0] != p_src[-1]) {\r\n        p_src1 = p_src + 1 * i_src;\r\n        p_src2 = p_src;\r\n        p_src3 = p_src + 2 * i_src;\r\n        p_src4 = p_src;\r\n        p_src5 = p_src + 3 * i_src;\r\n        p_src6 = p_src;\r\n\r\n        pixelInt  = alf_coeff[0] * (p_src5[ 0] + p_src6[ 0]);\r\n        pixelInt += alf_coeff[1] * (p_src3[ 0] + p_src4[ 0]);\r\n        pixelInt += alf_coeff[2] * (p_src1[ 1] + p_src2[ 0]);\r\n        pixelInt += alf_coeff[3] * (p_src1[ 0] + p_src2[ 0]);\r\n        pixelInt += alf_coeff[4] * (p_src1[-1] + p_src2[ 1]);\r\n        pixelInt += alf_coeff[7] * (p_src [ 1] + p_src [-1]);\r\n        pixelInt += alf_coeff[6] * (p_src [ 2] + p_src [-2]);\r\n        pixelInt += alf_coeff[5] * (p_src [ 3] + p_src [-3]);\r\n        pixelInt += alf_coeff[8] * (p_src [ 0]);\r\n\r\n        pixelInt = (int)((pixelInt + 32) >> 6);\r\n        p_dst[0] = (pel_t)DAVS2_CLIP1(pixelInt);\r\n    }\r\n\r\n    p_src += lcu_width - 1;\r\n    p_dst += lcu_width - 1;\r\n\r\n    if (p_src[0] != p_src[1]) {\r\n        p_src1 = p_src + 1 * i_src;\r\n        p_src2 = p_src;\r\n        p_src3 = p_src + 2 * i_src;\r\n        p_src4 = p_src;\r\n        p_src5 = p_src + 3 * i_src;\r\n        p_src6 = p_src;\r\n\r\n        pixelInt  = alf_coeff[0] * (p_src5[ 0] + p_src6[ 0]);\r\n        pixelInt += alf_coeff[1] * (p_src3[ 0] + p_src4[ 0]);\r\n        pixelInt += alf_coeff[2] * (p_src1[ 1] + p_src2[-1]);\r\n        pixelInt += alf_coeff[3] * (p_src1[ 0] + p_src2[ 0]);\r\n        pixelInt += alf_coeff[4] * (p_src1[-1] + p_src2[ 0]);\r\n        pixelInt += alf_coeff[7] * (p_src [ 1] + p_src [-1]);\r\n        pixelInt += alf_coeff[6] * (p_src [ 2] + p_src [-2]);\r\n        pixelInt += alf_coeff[5] * (p_src [ 3] + p_src [-3]);\r\n        pixelInt += alf_coeff[8] * (p_src [ 0]);\r\n\r\n        pixelInt = (int)((pixelInt + 32) >> 6);\r\n        p_dst[0] = (pel_t)DAVS2_CLIP1(pixelInt);\r\n    }\r\n\r\n    /* last line */\r\n    p_src -= lcu_width - 1;\r\n    p_dst -= lcu_width - 1;\r\n    p_src += ((endPos - startPos - 1) * i_src);\r\n    p_dst += ((endPos - startPos - 1) * i_dst);\r\n\r\n    if (p_src[0] != p_src[-1]) {\r\n        p_src1 = p_src;\r\n        p_src2 = p_src - 1 * i_src;\r\n        p_src3 = p_src;\r\n        p_src4 = p_src - 2 * i_src;\r\n        p_src5 = p_src;\r\n        p_src6 = p_src - 3 * i_src;\r\n\r\n        pixelInt  = alf_coeff[0] * (p_src5[ 0] + p_src6[ 0]);\r\n        pixelInt += alf_coeff[1] * (p_src3[ 0] + p_src4[ 0]);\r\n        pixelInt += alf_coeff[2] * (p_src1[ 1] + p_src2[-1]);\r\n        pixelInt += alf_coeff[3] * (p_src1[ 0] + p_src2[ 0]);\r\n        pixelInt += alf_coeff[4] * (p_src1[ 0] + p_src2[ 1]);\r\n        pixelInt += alf_coeff[7] * (p_src [ 1] + p_src [-1]);\r\n        pixelInt += alf_coeff[6] * (p_src [ 2] + p_src [-2]);\r\n        pixelInt += alf_coeff[5] * (p_src [ 3] + p_src [-3]);\r\n        pixelInt += alf_coeff[8] * (p_src [ 0]);\r\n\r\n        pixelInt = (int)((pixelInt + 32) >> 6);\r\n        p_dst[0] = (pel_t)DAVS2_CLIP1(pixelInt);\r\n    }\r\n\r\n    p_src += lcu_width - 1;\r\n    p_dst += lcu_width - 1;\r\n\r\n    if (p_src[0] != p_src[1]) {\r\n        p_src1 = p_src;\r\n        p_src2 = p_src - 1 * i_src;\r\n        p_src3 = p_src;\r\n        p_src4 = p_src - 2 * i_src;\r\n        p_src5 = p_src;\r\n        p_src6 = p_src - 3 * i_src;\r\n\r\n        pixelInt  = alf_coeff[0] * (p_src5[ 0] + p_src6[ 0]);\r\n        pixelInt += alf_coeff[1] * (p_src3[ 0] + p_src4[ 0]);\r\n        pixelInt += alf_coeff[2] * (p_src1[ 0] + p_src2[-1]);\r\n        pixelInt += alf_coeff[3] * (p_src1[ 0] + p_src2[ 0]);\r\n        pixelInt += alf_coeff[4] * (p_src1[-1] + p_src2[ 1]);\r\n        pixelInt += alf_coeff[7] * (p_src [ 1] + p_src [-1]);\r\n        pixelInt += alf_coeff[6] * (p_src [ 2] + p_src [-2]);\r\n        pixelInt += alf_coeff[5] * (p_src [ 3] + p_src [-3]);\r\n        pixelInt += alf_coeff[8] * (p_src [ 0]);\r\n\r\n        pixelInt = (int)((pixelInt + 32) >> 6);\r\n        p_dst[0] = (pel_t)DAVS2_CLIP1(pixelInt);\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void deriveBoundaryAvail(davs2_t *h, int lcu_xy, int width_in_lcu, int height_in_lcu,\r\n    int *b_top_avail, int *b_down_avail)\r\n{\r\n    *b_top_avail  = (lcu_xy >= width_in_lcu);\r\n    *b_down_avail = (lcu_xy < (height_in_lcu - 1) * width_in_lcu);\r\n\r\n    if (!h->seq_info.cross_loop_filter_flag) {\r\n        int width_in_scu = h->i_width_in_scu;\r\n        int lcu_pic_x = (lcu_xy % width_in_lcu) << h->i_lcu_level;\r\n        int lcu_pic_y = (lcu_xy / width_in_lcu) << h->i_lcu_level;\r\n        int scu_xy    = (lcu_pic_y >> MIN_CU_SIZE_IN_BIT) * width_in_scu + (lcu_pic_x >> MIN_CU_SIZE_IN_BIT);\r\n        // int scu_xy_next_row = scu_xy + (1 << (h->i_lcu_level - MIN_CU_SIZE_IN_BIT)) * width_in_scu;\r\n        int slice_idx_top = *b_top_avail ? h->scu_data[scu_xy - width_in_scu].i_slice_nr : -1;\r\n        // int slice_idx_down = *b_down_avail ? h->scu_data[scu_xy_next_row].i_slice_nr : -1;\r\n        int slcie_idx_cur  = h->scu_data[scu_xy].i_slice_nr;\r\n\r\n        *b_top_avail  = (slcie_idx_cur == slice_idx_top) ? TRUE : FALSE;\r\n        // *b_down_avail = (slcie_idx_cur == slice_idx_down) ? TRUE : FALSE;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void alf_param_init(alf_param_t *alf_par, int cID)\r\n{\r\n    alf_par->num_coeff             = ALF_MAX_NUM_COEF;\r\n    alf_par->filters_per_group     = 1;\r\n    alf_par->componentID           = cID;\r\n    memset(alf_par->filterPattern, 0, sizeof(alf_par->filterPattern));\r\n    memset(alf_par->coeffmulti,    0, sizeof(alf_par->coeffmulti));\r\n}\r\n\r\n/**\r\n * ===========================================================================\r\n * interface function defines\r\n * ===========================================================================\r\n */\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nsize_t alf_get_buffer_size(davs2_t *h)\r\n{\r\n    size_t width_in_lcu  = h->i_width_in_lcu;\r\n    size_t height_in_lcu = h->i_height_in_lcu;\r\n\r\n    return  sizeof(alf_var_t) + height_in_lcu * width_in_lcu * sizeof(uint8_t);\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid alf_init_buffer(davs2_t *h)\r\n{\r\n    static const uint8_t regionTable[ALF_NUM_VARS] = { 0, 1, 4, 5, 15, 2, 3, 6, 14, 11, 10, 7, 13, 12, 9, 8 };\r\n    int width_in_lcu  = h->i_width_in_lcu;\r\n    int height_in_lcu = h->i_height_in_lcu;\r\n    int quad_w_in_lcu = ((width_in_lcu  + 1) >> 2);\r\n    int quad_h_in_lcu = ((height_in_lcu + 1) >> 2);\r\n    int region_idx_x;\r\n    int region_idx_y;\r\n    int i, j;\r\n    uint8_t *mem_ptr  = (uint8_t *)h->p_alf;\r\n\r\n    h->p_alf->tab_lcu_region = (uint8_t *)(mem_ptr + sizeof(alf_var_t));\r\n\r\n    memset(h->p_alf->filterCoeffSym, 0, sizeof(h->p_alf->filterCoeffSym));\r\n\r\n    for (j = 0; j < height_in_lcu; j++) {\r\n        region_idx_y = (quad_h_in_lcu == 0) ? 3 : DAVS2_MIN(j / quad_h_in_lcu, 3);\r\n        for (i = 0; i < width_in_lcu; i++) {\r\n            region_idx_x = (quad_w_in_lcu == 0) ? 3 : DAVS2_MIN(i / quad_w_in_lcu, 3);\r\n            h->p_alf->tab_lcu_region[j * width_in_lcu + i] = regionTable[region_idx_y * 4 + region_idx_x];\r\n        }\r\n    }\r\n\r\n    for (i = 0; i < IMG_COMPONENTS; i++) {\r\n        alf_param_init(&h->p_alf->img_param[i], i);\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void vlc_read_alf_coeff(davs2_bs_t *bs, alf_param_t *alf_param)\r\n{\r\n    const int numCoeff = ALF_MAX_NUM_COEF;\r\n    int f, symbol, pre_symbole;\r\n    int pos;\r\n\r\n    switch (alf_param->componentID) {\r\n    case IMG_U:\r\n    case IMG_V:\r\n        for (pos = 0; pos < numCoeff; pos++) {\r\n            alf_param->coeffmulti[0][pos] = se_v(bs, \"Chroma ALF coefficients\");\r\n        }\r\n        break;\r\n    case IMG_Y:\r\n        alf_param->filters_per_group = ue_v(bs, \"ALF filter number\");\r\n        alf_param->filters_per_group = alf_param->filters_per_group + 1;\r\n\r\n        memset(alf_param->filterPattern, 0, ALF_NUM_VARS * sizeof(int));\r\n        pre_symbole = 0;\r\n        symbol = 0;\r\n        for (f = 0; f < alf_param->filters_per_group; f++) {\r\n            if (f > 0) {\r\n                if (alf_param->filters_per_group != 16) {\r\n                    symbol = ue_v(bs, \"Region distance\");\r\n                } else {\r\n                    symbol = 1;\r\n                }\r\n                alf_param->filterPattern[symbol + pre_symbole] = 1;\r\n                pre_symbole += symbol;\r\n            }\r\n\r\n            for (pos = 0; pos < numCoeff; pos++) {\r\n                alf_param->coeffmulti[f][pos] = se_v(bs, \"Luma ALF coefficients\");\r\n            }\r\n        }\r\n        break;\r\n    default:\r\n        /// Not a legal component ID\r\n        assert(0);\r\n        exit(-1);\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid alf_read_param(davs2_t *h, davs2_bs_t *bs)\r\n{\r\n    if (h->b_alf) {\r\n        h->pic_alf_on[IMG_Y] = u_flag(bs, \"alf_pic_flag_Y\");\r\n        h->pic_alf_on[IMG_U] = u_flag(bs, \"alf_pic_flag_Cb\");\r\n        h->pic_alf_on[IMG_V] = u_flag(bs, \"alf_pic_flag_Cr\");\r\n\r\n        if (h->pic_alf_on[0] || h->pic_alf_on[1] || h->pic_alf_on[2]) {\r\n            int component_idx;\r\n\r\n            for (component_idx = 0; component_idx < IMG_COMPONENTS; component_idx++) {\r\n                if (h->pic_alf_on[component_idx]) {\r\n                    vlc_read_alf_coeff(bs, &h->p_alf->img_param[component_idx]);\r\n                }\r\n            }\r\n        }\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * ALF one LCU block\r\n */\r\nstatic void alf_lcu_block(davs2_t *h, alf_param_t *p_alf_param, davs2_frame_t *p_tmp_frm, davs2_frame_t *p_dec_frm, int i_lcu_x, int i_lcu_y)\r\n{\r\n    int lcu_size      = h->i_lcu_size;\r\n    int img_height    = h->i_height;\r\n    int img_width     = h->i_width;\r\n    int width_in_lcu  = h->i_width_in_lcu;\r\n    int height_in_lcu = h->i_height_in_lcu;\r\n    int lcu_pix_x     = i_lcu_x << h->i_lcu_level;\r\n    int lcu_pix_y     = i_lcu_y << h->i_lcu_level;\r\n    int lcu_width     = (lcu_pix_x + lcu_size > img_width ) ? (img_width  - lcu_pix_x) : lcu_size;\r\n    int lcu_height    = (lcu_pix_y + lcu_size > img_height) ? (img_height - lcu_pix_y) : lcu_size;\r\n    int lcu_xy        = i_lcu_y * width_in_lcu + i_lcu_x;\r\n    int b_top_avail, b_down_avail;\r\n    int lcu_region_idx = h->p_alf->tab_lcu_region[lcu_xy];\r\n    int *alf_coef;\r\n\r\n    // derive CTU boundary availabilities\r\n    deriveBoundaryAvail(h, lcu_xy, width_in_lcu, height_in_lcu, &b_top_avail, &b_down_avail);\r\n\r\n    if (h->lcu_infos[lcu_xy].enable_alf[0]) {\r\n        alf_init_var_table(&p_alf_param[0], h->p_alf->tab_region_coeff_idx);\r\n\r\n        // reconstruct ALF coefficients & related parameters\r\n        alf_recon_coefficients(&p_alf_param[0], h->p_alf->filterCoeffSym);\r\n        alf_coef = h->p_alf->filterCoeffSym[h->p_alf->tab_region_coeff_idx[lcu_region_idx]];\r\n\r\n        gf_davs2.alf_block[0](p_dec_frm->planes[0], p_tmp_frm->planes[0], p_dec_frm->i_stride[0],\r\n            lcu_pix_x, lcu_pix_y, lcu_width, lcu_height,\r\n            alf_coef, b_top_avail, b_down_avail);\r\n        gf_davs2.alf_block[1](p_dec_frm->planes[0], p_tmp_frm->planes[0], p_dec_frm->i_stride[0],\r\n            lcu_pix_x, lcu_pix_y, lcu_width, lcu_height,\r\n            alf_coef, b_top_avail, b_down_avail);\r\n    }\r\n\r\n    lcu_pix_x  >>= 1;\r\n    lcu_pix_y  >>= 1;\r\n    lcu_width  >>= 1;\r\n    lcu_height >>= 1;\r\n    if (h->lcu_infos[lcu_xy].enable_alf[1]) {\r\n        // reconstruct ALF coefficients & related parameters\r\n        alf_recon_coefficients(&p_alf_param[1], h->p_alf->filterCoeffSym);\r\n        alf_coef = h->p_alf->filterCoeffSym[0];\r\n\r\n        gf_davs2.alf_block[0](p_dec_frm->planes[1], p_tmp_frm->planes[1], p_dec_frm->i_stride[1],\r\n            lcu_pix_x, lcu_pix_y, lcu_width, lcu_height,\r\n            alf_coef, b_top_avail, b_down_avail);\r\n        gf_davs2.alf_block[1](p_dec_frm->planes[1], p_tmp_frm->planes[1], p_dec_frm->i_stride[1],\r\n            lcu_pix_x, lcu_pix_y, lcu_width, lcu_height,\r\n            alf_coef, b_top_avail, b_down_avail);\r\n    }\r\n\r\n    if (h->lcu_infos[lcu_xy].enable_alf[2]) {\r\n        // reconstruct ALF coefficients & related parameters\r\n        alf_recon_coefficients(&p_alf_param[2], h->p_alf->filterCoeffSym);\r\n        alf_coef = h->p_alf->filterCoeffSym[0];\r\n\r\n        gf_davs2.alf_block[0](p_dec_frm->planes[2], p_tmp_frm->planes[2], p_dec_frm->i_stride[2],\r\n            lcu_pix_x, lcu_pix_y, lcu_width, lcu_height,\r\n            alf_coef, b_top_avail, b_down_avail);\r\n        gf_davs2.alf_block[1](p_dec_frm->planes[2], p_tmp_frm->planes[2], p_dec_frm->i_stride[2],\r\n            lcu_pix_x, lcu_pix_y, lcu_width, lcu_height,\r\n            alf_coef, b_top_avail, b_down_avail);\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid alf_lcurow(davs2_t *h, alf_param_t *p_alf_param, davs2_frame_t *p_tmp_frm, davs2_frame_t *p_dec_frm, int i_lcu_y)\r\n{\r\n    const int w_in_lcu = h->i_width_in_lcu;\r\n    int i_lcu_x;\r\n\r\n    /* copy one decoded LCU-row (with padding left and right edges) */\r\n    davs2_frame_copy_lcurow(h, p_tmp_frm, p_dec_frm, i_lcu_y, -4, 8);\r\n\r\n    /* ALF one LCU-row */\r\n    for (i_lcu_x = 0; i_lcu_x < w_in_lcu; i_lcu_x++) {\r\n        alf_lcu_block(h, p_alf_param, p_tmp_frm, p_dec_frm, i_lcu_x, i_lcu_y);\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid davs2_alf_init(uint32_t cpuid, ao_funcs_t *fh)\r\n{\r\n    UNUSED_PARAMETER(cpuid);\r\n\r\n    /* init c function handles */\r\n    fh->alf_block[0] = alf_filter_block1;\r\n    fh->alf_block[1] = alf_filter_block2;\r\n\r\n    /* init asm function handles */\r\n#if HAVE_MMX\r\n#if HIGH_BIT_DEPTH\r\n#else\r\n    if (cpuid & DAVS2_CPU_SSE4) {\r\n        fh->alf_block[0] = alf_filter_block_sse128;\r\n    }\r\n#endif\r\n#endif\r\n}\r\n"
  },
  {
    "path": "source/common/alf.h",
    "content": "/*\r\n * alf.h\r\n *\r\n * Description of this file:\r\n *    ALF functions definition of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video encoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#ifndef DAVS2_ALF_H\r\n#define DAVS2_ALF_H\r\n#ifdef __cplusplus\r\nextern \"C\" {\r\n#endif\r\n\r\n#define alf_get_buffer_size FPFX(alf_get_buffer_size)\r\nsize_t alf_get_buffer_size(davs2_t *h);\r\n#define alf_init_buffer FPFX(alf_init_buffer)\r\nvoid alf_init_buffer    (davs2_t *h);\r\n\r\n#define alf_lcurow FPFX(alf_lcurow)\r\nvoid alf_lcurow(davs2_t *h, alf_param_t *p_alf_param, davs2_frame_t *p_tmp_frm, davs2_frame_t *p_dec_frm, int i_lcu_y);\r\n\r\n#define alf_read_param FPFX(alf_read_param)\r\nvoid alf_read_param(davs2_t *h, davs2_bs_t *bs);\r\n\r\n#define davs2_alf_init FPFX(alf_init)\r\nvoid davs2_alf_init(uint32_t cpuid, ao_funcs_t *fh);\r\n\r\n#ifdef __cplusplus\r\n}\r\n#endif\r\n#endif  // DAVS2_ALF_H\r\n"
  },
  {
    "path": "source/common/bitstream.cc",
    "content": "/*\r\n * bitstream.cc\r\n *\r\n * Description of this file:\r\n *    Bitstream functions definition of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video encoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#include \"common.h\"\r\n#include \"aec.h\"\r\n#include \"bitstream.h\"\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * start code (in 32-bit)\r\n */\r\n#define SEQENCE_START_CODE      0xB0010000\r\n#define I_FRAME_START_CODE      0xB3010000\r\n#define PB_FRAME_START_CODE     0xB6010000\r\n\r\n\r\n/**\r\n * ===========================================================================\r\n * function defines\r\n * ===========================================================================\r\n */\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid bs_init(davs2_bs_t *bs, uint8_t *p_data, int i_data)\r\n{\r\n    bs->p_stream  = p_data;\r\n    bs->i_stream  = i_data;\r\n    bs->i_bit_pos = 0;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * align position in bitstream */\r\nvoid bs_align(davs2_bs_t *bs)\r\n{\r\n    bs->i_bit_pos = ((bs->i_bit_pos + 7) >> 3) << 3;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nint bs_left_bytes(davs2_bs_t *bs)\r\n{\r\n    return (bs->i_stream - (bs->i_bit_pos >> 3));\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * Function   : try to find slice header in next forward bytes\r\n * Parameters :\r\n *      [in ] : bs_data   - pointer to the bit-stream data buffer\r\n * Return     : TRUE for slice header, otherwise FALSE\r\n * ---------------------------------------------------------------------------\r\n */\r\nint found_slice_header(davs2_bs_t *bs)\r\n{\r\n    int num_bytes = 4;\r\n\r\n    for (; num_bytes; num_bytes--) {\r\n        uint8_t *data = bs->p_stream + ((bs->i_bit_pos + 7) >> 3);\r\n        uint32_t code = *(uint32_t *)data;\r\n        if ((code & 0x00FFFFFF) == 0x00010000 && ((code >> 24) <= SC_SLICE_CODE_MAX)) {\r\n            return 1;\r\n        }\r\n        bs->i_bit_pos += 8;\r\n    }\r\n\r\n    return 0;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nint bs_get_start_code(davs2_bs_t *bs)\r\n{\r\n    uint8_t *p_data  = bs->p_stream + ((bs->i_bit_pos + 7) >> 3);\r\n    int i_left_bytes = bs_left_bytes(bs);\r\n    int i_used_bytes = 0;\r\n\r\n    /* find the start code '00 00 01 xx' */\r\n    while (i_left_bytes >= 4 && (*(uint32_t *)p_data & 0x00FFFFFF) != 0x00010000) {\r\n        p_data++;\r\n        i_left_bytes--;\r\n        i_used_bytes++;\r\n    }\r\n\r\n    if (i_left_bytes >= 4) {\r\n        bs->i_bit_pos += (i_used_bytes << 3);\r\n        return p_data[3];\r\n    } else {\r\n        return -1;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * Function   : check bitstream & dispose the pseudo start code\r\n * Parameters :\r\n *       [in] : dst   - pointer to dst byte buffer\r\n *   [in/out] : src   - pointer to source byte buffer\r\n *   [in/out] : i_src - byte number of src\r\n * Return     : byte number of dst\r\n * ---------------------------------------------------------------------------\r\n */\r\nint bs_dispose_pseudo_code(uint8_t *dst, uint8_t *src, int i_src)\r\n{\r\n    static const int BITMASK[] = { 0x00, 0x00, 0xc0, 0x00, 0xf0, 0x00, 0xfc, 0x00 };\r\n    int b_found_start_code = 0;\r\n    int leading_zeros  = 0;\r\n    int last_bit_count = 0;\r\n    int curr_bit_count = 0;\r\n    int b_dispose = 0;\r\n    int i_pos = 0;\r\n    int i_dst = 0;\r\n    uint8_t last_byte = 0;\r\n    uint8_t curr_byte = 0;\r\n\r\n    /* checking... */\r\n    while (i_pos < i_src) {\r\n        curr_byte = src[i_pos++];\r\n        curr_bit_count = 8;\r\n        switch (curr_byte) {\r\n        case 0:\r\n            if (b_found_start_code) {\r\n                b_dispose          = 1; /* start code of first slice: [00 00 01 00] */\r\n                b_found_start_code = 0;\r\n            }\r\n            leading_zeros++;\r\n            break;\r\n        case 1:\r\n            if (leading_zeros >= 2) {\r\n                /* find start code: [00 00 01] */\r\n                b_found_start_code = 1;\r\n                if (last_bit_count) {\r\n                    /* terminate the fixing work before new start code */\r\n                    last_bit_count = 0;\r\n                    dst[i_dst++]   = 0; /* insert the dispose byte */\r\n                }\r\n            }\r\n            leading_zeros = 0;\r\n            break;\r\n        case 2:\r\n            if (b_dispose && leading_zeros == 2) {\r\n                /* dispose the pseudo code, two bits */\r\n                curr_bit_count = 6;\r\n            }\r\n            leading_zeros = 0;\r\n            break;\r\n        default:\r\n            if (b_found_start_code) {\r\n                if (curr_byte == SC_SEQUENCE_HEADER || curr_byte == SC_USER_DATA || curr_byte == SC_EXTENSION) {\r\n                    b_dispose = 0;\r\n                } else {\r\n                    b_dispose = 1;\r\n                }\r\n                b_found_start_code = 0;\r\n            }\r\n            leading_zeros = 0;\r\n            break;\r\n        }\r\n\r\n        if (curr_bit_count == 8) {\r\n            if (last_bit_count == 0) {\r\n                dst[i_dst++] = curr_byte;\r\n            } else {\r\n                dst[i_dst++] = ((last_byte & BITMASK[last_bit_count]) | ((curr_byte & BITMASK[8 - last_bit_count]) >> last_bit_count));\r\n                last_byte    = (curr_byte << (8 - last_bit_count)) & BITMASK[last_bit_count];\r\n            }\r\n        } else {\r\n            if (last_bit_count == 0) {\r\n                last_byte      = curr_byte;\r\n                last_bit_count = curr_bit_count;\r\n            } else {\r\n                dst[i_dst++]   = ((last_byte & BITMASK[last_bit_count]) | ((curr_byte & BITMASK[8 - last_bit_count]) >> last_bit_count));\r\n                last_byte      = (curr_byte << (8 - last_bit_count)) & BITMASK[last_bit_count - 2];\r\n                last_bit_count = last_bit_count - 2;\r\n            }\r\n        }\r\n    }\r\n\r\n    if (last_bit_count != 0 && last_byte != 0) {\r\n        dst[i_dst++] = last_byte;\r\n    }\r\n\r\n    return i_dst;\r\n}\r\n\r\n// ---------------------------------------------------------------------------\r\n// find the first start code in byte stream\r\n// return the byte address if found, or NULL on failure\r\nconst uint8_t *\r\nfind_start_code(const uint8_t *data, int len)\r\n{\r\n    while (len >= 4 && (*(uint32_t *)data & 0x00FFFFFF) != 0x00010000) {\r\n        data++;\r\n        len--;\r\n    }\r\n\r\n    return len >= 4 ? data : NULL;\r\n}\r\n\r\n// ---------------------------------------------------------------------------\r\n// find the first picture or sequence start code in byte stream\r\nint32_t\r\nfind_pic_start_code(uint8_t prevbyte3, uint8_t prevbyte2, uint8_t prevbyte1, const uint8_t *data, int32_t len)\r\n{\r\n#define ISPIC(x) ((x) == 0xB0 || (x) == 0xB1 || (x) == 0xB3 || (x) == 0xB6 || (x) == 0xB7)\r\n\r\n    const uint8_t *p = NULL;\r\n    const uint8_t *data0 = data;\r\n    const int32_t  len0  = len;\r\n\r\n    /* check start code: 00 00 01 xx */\r\n    if (/*..*/ len >= 1 && (prevbyte3 == 0) && (prevbyte2 == 0) && (prevbyte1 == 1)) {\r\n        if (ISPIC(data[0])) {\r\n            return -3;          // found start code (position: -3)\r\n        }\r\n    } else if (len >= 2 && (prevbyte2 == 0) && (prevbyte1 == 0) && (data[0] == 1)) {\r\n        if (ISPIC(data[1])) {\r\n            return -2;          // found start code (position: -2)\r\n        }\r\n    } else if (len >= 3 && (prevbyte1 == 0) && (data[0] == 0) && (data[1] == 1)) {\r\n        if (ISPIC(data[2])) {\r\n            return -1;          // found start code (position: -1)\r\n        }\r\n    }\r\n\r\n    /* check start code: 00 00 01 xx, ONLY in data buffer */\r\n    while (((p = (uint8_t *)find_start_code(data, len)) != NULL) && !ISPIC(p[3])) {\r\n        len -= (int32_t)(p - data + 4);\r\n        data = p + 4;\r\n    }\r\n\r\n    return (int32_t)(p != NULL ? p - data0 : len0 + 1);\r\n\r\n#undef ISPIC\r\n}\r\n"
  },
  {
    "path": "source/common/bitstream.h",
    "content": "/*\r\n * bitstream.h\r\n *\r\n * Description of this file:\r\n *    Bitstream functions definition of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video encoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#ifndef DAVS2_BITSTREAM_H\r\n#define DAVS2_BITSTREAM_H\r\n#ifdef __cplusplus\r\nextern \"C\" {\r\n#endif\r\n\r\n#include \"common.h\"\r\n\r\n#define bs_init FPFX(bs_init)\r\nvoid bs_init(davs2_bs_t *bs, uint8_t *p_data, int i_data);\r\n#define bs_align FPFX(bs_align)\r\nvoid bs_align(davs2_bs_t *bs);\r\n#define bs_left_bytes FPFX(bs_left_bytes)\r\nint  bs_left_bytes(davs2_bs_t *bs);\r\n#define found_slice_header FPFX(found_slice_header)\r\nint  found_slice_header(davs2_bs_t *bs);\r\n#define bs_get_start_code FPFX(bs_get_start_code)\r\nint  bs_get_start_code(davs2_bs_t *bs);\r\n#define bs_dispose_pseudo_code FPFX(bs_dispose_pseudo_code)\r\nint  bs_dispose_pseudo_code(uint8_t *dst, uint8_t *src, int i_src);\r\n#define find_start_code FPFX(find_start_code)\r\nconst uint8_t * find_start_code(const uint8_t *data, int len);\r\n#define find_pic_start_code FPFX(find_pic_start_code)\r\nint32_t find_pic_start_code(uint8_t prevbyte3, uint8_t prevbyte2, uint8_t prevbyte1, const uint8_t *data, int32_t len);\r\n\r\n#ifdef __cplusplus\r\n}\r\n#endif\r\n#endif  // DAVS2_BITSTREAM_H\r\n"
  },
  {
    "path": "source/common/block_info.cc",
    "content": "/*\r\n * block_info.cc\r\n *\r\n * Description of this file:\r\n *    Block-infomation functions definition of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video encoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#include \"common.h\"\r\n#include \"block_info.h\"\r\n\r\n\r\n/**\r\n * ===========================================================================\r\n * local & global variables (const tables)\r\n * ===========================================================================\r\n */\r\n\r\n/**\r\n * ===========================================================================\r\n * function defines\r\n * ===========================================================================\r\n */\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic ALWAYS_INLINE \r\ncu_t *get_neighbor_cu_in_slice(davs2_t *h, cu_t *p_cur, int scu_x, int scu_y, int x4x4, int y4x4)\r\n{\r\n    const int shift_4x4 = MIN_CU_SIZE_IN_BIT - MIN_PU_SIZE_IN_BIT;\r\n\r\n    if (x4x4 < 0 || y4x4 < 0 || x4x4 >= h->i_width_in_spu || y4x4 >= h->i_height_in_spu) {\r\n        return NULL;\r\n    } else if ((scu_x << shift_4x4) <= x4x4 && (scu_y << shift_4x4) <= y4x4) {\r\n        return p_cur;\r\n    } else {\r\n        cu_t *p_neighbor = &h->scu_data[(y4x4 >> 1) * h->i_width_in_scu + (x4x4 >> 1)];\r\n        return p_neighbor->i_slice_nr == p_cur->i_slice_nr ? p_neighbor : NULL;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * (x_4x4, y_4x4) - ڱ任4x4ַͼ\r\n * (scu_x, scu_y) - ǰCUSCUַͼ\r\n */\r\nint get_neighbor_cbp_y(davs2_t *h, int x_4x4, int y_4x4, int scu_x, int scu_y, cu_t *p_cu)\r\n{\r\n    cu_t *p_neighbor = get_neighbor_cu_in_slice(h, p_cu, scu_x, scu_y, x_4x4, y_4x4);\r\n\r\n    if (p_neighbor == NULL) {\r\n        return 0;\r\n    } else if (p_neighbor->i_trans_size == TU_SPLIT_NON) {\r\n        return p_neighbor->i_cbp & 1;   // TUʱֱӷضӦȿCBP\r\n    } else {\r\n        int cbp     = p_neighbor->i_cbp;\r\n        int level   = p_neighbor->i_cu_level - MIN_PU_SIZE_IN_BIT;\r\n        int cu_mask = (1 << level) - 1;\r\n\r\n        x_4x4 &= cu_mask;\r\n        y_4x4 &= cu_mask;\r\n\r\n        if (p_neighbor->i_trans_size == TU_SPLIT_VER) {           // ֱ\r\n            x_4x4 >>= (level - 2);\r\n            return (cbp >> x_4x4) & 1;\r\n        } else if (p_neighbor->i_trans_size == TU_SPLIT_HOR) {    // ˮƽ\r\n            y_4x4 >>= (level - 2);\r\n            return (cbp >> y_4x4) & 1;\r\n        } else {                                                  // Ĳ滮\r\n            x_4x4 >>= (level - 1);\r\n            y_4x4 >>= (level - 1);\r\n            return (cbp >> (x_4x4 + (y_4x4 << 1))) & 1;\r\n        }\r\n    }\r\n}\r\n"
  },
  {
    "path": "source/common/block_info.h",
    "content": "/*\r\n * block_info.h\r\n *\r\n * Description of this file:\r\n *    Block Infomation functions definition of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video encoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#ifndef DAVS2_BLOCK_INFO_H\r\n#define DAVS2_BLOCK_INFO_H\r\n#ifdef __cplusplus\r\nextern \"C\" {\r\n#endif\r\n\r\n#define get_neighbor_cbp_y FPFX(get_neighbor_cbp_y)\r\nint  get_neighbor_cbp_y(davs2_t *h, int xN, int yN, int scu_x, int scu_y, cu_t *p_cu);\r\n\r\n#ifdef __cplusplus\r\n}\r\n#endif\r\n#endif // DAVS2_BLOCK_INFO_H\r\n"
  },
  {
    "path": "source/common/common.cc",
    "content": "/*\r\n * common.cc\r\n *\r\n * Description of this file:\r\n *    misc common functionsdefinition of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video encoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#include \"common.h\"\r\n#include <stdarg.h>\r\n\r\n#if __ARM_ARCH_7__\r\n#include <android/log.h>\r\n#define LOGI(format,...) __android_log_print(ANDROID_LOG_INFO, \"davs2\",format,##__VA_ARGS__)\r\n#endif\r\n\r\n/**\r\n * ===========================================================================\r\n * macros\r\n * ===========================================================================\r\n */\r\n\r\n\r\n/**\r\n * ===========================================================================\r\n * global variables\r\n * ===========================================================================\r\n */\r\n#if HIGH_BIT_DEPTH\r\nint max_pel_value = 255;\r\nint g_bit_depth   = 8;\r\nint g_dc_value    = 128;\r\n#endif\r\n\r\n\r\n/**\r\n * ===========================================================================\r\n * trace\r\n * ===========================================================================\r\n */\r\n\r\n#if AVS2_TRACE\r\n\r\n/**\r\n * ===========================================================================\r\n * trace file\r\n * ===========================================================================\r\n */\r\n\r\nFILE *h_trace = NULL;           /* global file handle for trace file */\r\nint g_bit_count = 0;            /* global bit    count for trace */\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nint avs2_trace_init(davs2_t *h, char *psz_trace_file)\r\n{\r\n    if (strlen(psz_trace_file) > 0) {\r\n        /* create or truncate the trace file */\r\n        h_trace = fopen(psz_trace_file, \"wt\");\r\n        if (!h_trace) {\r\n            davs2_log(h, DAVS2_LOG_ERROR, \"trace: can't write to trace file\");\r\n            return -1;\r\n        } else if (!davs2_is_regular_file(fileno(h_trace))) {\r\n            davs2_log(h, DAVS2_LOG_ERROR, \"trace: incompatible with non-regular file\");\r\n            return -1;\r\n        }\r\n    }\r\n\r\n    return 0;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid avs2_trace_destroy(void)\r\n{\r\n    if (h_trace) {\r\n        fclose(h_trace);\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nint avs2_trace(const char *psz_fmt, ...)\r\n{\r\n    int len = 0;\r\n\r\n    /* append to the trace file */\r\n    if (h_trace) {\r\n        va_list arg;\r\n        va_start(arg, psz_fmt);\r\n\r\n        len = vfprintf(h_trace, psz_fmt, arg);\r\n        fflush(h_trace);\r\n        va_end(arg);\r\n    }\r\n\r\n    return len;\r\n}\r\n\r\nvoid avs2_trace_string(char *trace_string, int value, int len)\r\n{\r\n    int i, chars;\r\n\r\n    avs2_trace(\"@\");\r\n    chars = avs2_trace(\"%i\", g_bit_count);\r\n\r\n    while (chars++ < 6) {\r\n        avs2_trace(\" \");\r\n    }\r\n\r\n    chars += avs2_trace(\"%s\", trace_string);\r\n\r\n    while (chars++ < 55) {\r\n        avs2_trace(\" \");\r\n    }\r\n\r\n    // align bit-pattern\r\n    if (len < 15) {\r\n        for (i = 0; i < 15 - len; i++) {\r\n            avs2_trace(\" \");\r\n        }\r\n    }\r\n\r\n    g_bit_count += len;\r\n    while (len >= 32) {\r\n        for (i = 0; i < 8; i++) {\r\n            avs2_trace(\"0\");\r\n        }\r\n\r\n        len -= 8;\r\n    }\r\n\r\n    // print bit-pattern\r\n    for (i = 0; i < len; i++) {\r\n        if (0x01 & (value >> (len - i - 1))) {\r\n            avs2_trace(\"1\");\r\n        } else {\r\n            avs2_trace(\"0\");\r\n        }\r\n    }\r\n\r\n    avs2_trace(\"  (%3d)\\n\", value);\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * write out a trace string to the trace file\r\n */\r\nvoid avs2_trace_string2(char *trace_string, int bit_pattern, int value, int len)\r\n{\r\n    int i, chars;\r\n\r\n    avs2_trace(\"@\");\r\n    chars = avs2_trace(\"%i\", g_bit_count);\r\n\r\n    while (chars++ < 6) {\r\n        avs2_trace(\" \");\r\n    }\r\n\r\n    chars += avs2_trace(\"%s\", trace_string);\r\n\r\n    while (chars++ < 55) {\r\n        avs2_trace(\" \");\r\n    }\r\n\r\n    // align bit-pattern\r\n    if (len < 15) {\r\n        for (i = 0; i < 15 - len; i++) {\r\n            avs2_trace(\" \");\r\n        }\r\n    }\r\n\r\n    // print bit-pattern\r\n    g_bit_count += len;\r\n    for (i = 1; i <= len; i++) {\r\n        if ((bit_pattern >> (len - i)) & 0x1) {\r\n            avs2_trace(\"1\");\r\n        } else {\r\n            avs2_trace(\"0\");\r\n        }\r\n    }\r\n\r\n    avs2_trace(\"  (%3d)\\n\", value);\r\n}\r\n#endif\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nint xl_init(xlist_t *const xlist)\r\n{\r\n    if (xlist == NULL) {\r\n        return -1;\r\n    }\r\n\r\n    /* set list empty */\r\n    xlist->p_list_head = NULL;\r\n    xlist->p_list_tail = NULL;\r\n\r\n    /* set node number */\r\n    xlist->i_node_num = 0;\r\n\r\n    /* create lock and conditions */\r\n    if (davs2_thread_mutex_init(&xlist->list_mutex, NULL) < 0 ||\r\n        davs2_thread_cond_init(&xlist->list_cond, NULL) < 0) {\r\n        davs2_log(NULL, DAVS2_LOG_ERROR, \"Failed to init lock for xl_init()\");\r\n        return -1;\r\n    }\r\n\r\n    return 0;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid xl_destroy(xlist_t *const xlist)\r\n{\r\n    if (xlist == NULL) {\r\n        return;\r\n    }\r\n\r\n    /* destroy lock and conditions */\r\n    davs2_thread_mutex_destroy(&xlist->list_mutex);\r\n    davs2_thread_cond_destroy(&xlist->list_cond);\r\n\r\n    /* clear */\r\n    memset(xlist, 0, sizeof(xlist_t));\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid xl_append(xlist_t *const xlist, void *node)\r\n{\r\n    node_t *new_node = (node_t *)node;\r\n\r\n    if (xlist == NULL) {\r\n        return;                       /* error */\r\n    }\r\n\r\n    new_node->next = NULL;            /* set NULL */\r\n\r\n    davs2_thread_mutex_lock(&xlist->list_mutex);   /* lock */\r\n\r\n    /* append this node */\r\n    if (xlist->p_list_tail != NULL) {\r\n        /* append this node at tail */\r\n        xlist->p_list_tail->next = new_node;\r\n    } else {\r\n        xlist->p_list_head = new_node;\r\n    }\r\n\r\n    xlist->p_list_tail = new_node;    /* point to the tail node */\r\n    xlist->i_node_num++;              /* increase the node number */\r\n\r\n    davs2_thread_mutex_unlock(&xlist->list_mutex);  /* unlock */\r\n\r\n    /* all is done, notify one waiting thread to work */\r\n    davs2_thread_cond_signal(&xlist->list_cond);\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid *xl_remove_head(xlist_t *const xlist, const int wait)\r\n{\r\n    node_t *node = NULL;\r\n\r\n    if (xlist == NULL) {\r\n        return NULL;                  /* error */\r\n    }\r\n\r\n    davs2_thread_mutex_lock(&xlist->list_mutex);\r\n\r\n    if (wait && !xlist->i_node_num) {\r\n        davs2_thread_cond_wait(&xlist->list_cond, &xlist->list_mutex);\r\n    }\r\n\r\n    /* remove the header node */\r\n    if (xlist->i_node_num > 0) {\r\n        node = xlist->p_list_head;    /* point to the header node */\r\n\r\n        /* modify the list */\r\n        xlist->p_list_head = node->next;\r\n\r\n        if (xlist->p_list_head == NULL) {\r\n            /* there are no any node in this list, reset the tail pointer */\r\n            xlist->p_list_tail = NULL;\r\n        }\r\n\r\n        xlist->i_node_num--;          /* decrease the number */\r\n    }\r\n\r\n    davs2_thread_mutex_unlock(&xlist->list_mutex);\r\n\r\n    return node;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid *xl_remove_head_ex(xlist_t *const xlist)\r\n{\r\n    node_t *node = NULL;\r\n\r\n    if (xlist == NULL) {\r\n        return NULL;                  /* error */\r\n    }\r\n\r\n    /* remove the header node */\r\n    if (xlist->i_node_num > 0) {\r\n        node = xlist->p_list_head;    /* point to the header node */\r\n\r\n        /* modify the list */\r\n        xlist->p_list_head = node->next;\r\n\r\n        if (xlist->p_list_head == NULL) {\r\n            /* there are no any node in this list, reset the tail pointer */\r\n            xlist->p_list_tail = NULL;\r\n        }\r\n\r\n        xlist->i_node_num--;          /* decrease the number */\r\n    }\r\n\r\n    return node;\r\n}\r\n\r\n/**\r\n * ===========================================================================\r\n * davs2_log\r\n * ===========================================================================\r\n */\r\n\r\n#ifdef _MSC_VER\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic ALWAYS_INLINE\r\nvoid davs2_set_font_color(int color)\r\n{\r\n    static const WORD colors[] = {\r\n        FOREGROUND_INTENSITY | FOREGROUND_GREEN,                   // ɫ\r\n        FOREGROUND_INTENSITY | FOREGROUND_GREEN | FOREGROUND_BLUE, // cyan\r\n        FOREGROUND_INTENSITY | FOREGROUND_RED | FOREGROUND_GREEN,  // ɫ\r\n        FOREGROUND_INTENSITY | FOREGROUND_RED,                     // ɫ\r\n        FOREGROUND_INTENSITY | FOREGROUND_RED | FOREGROUND_BLUE,   // ɫ\r\n    };\r\n    SetConsoleTextAttribute(GetStdHandle(STD_OUTPUT_HANDLE), colors[color]);\r\n}\r\n#endif\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void\r\ndavs2_log_default(int i_log_level, const char *psz_fmt)\r\n{\r\n#if !defined(_MSC_VER)\r\n    static const char str_color_clear[] = \"\\033[0m\";  // \"\\033[0m\"\r\n    static const char str_color[][16] = {\r\n    /*    green         cyan         yellow             red     */\r\n        \"\\033[1;32m\", \"\\033[1;36m\", \"\\033[1;33m\",   \"\\033[1;31m\"\r\n    };\r\n    const char *cur_color = str_color[i_log_level];\r\n#endif\r\n    static const char *null_prefix = \"\";\r\n    const char *psz_prefix = null_prefix;\r\n\r\n    switch (i_log_level) {\r\n    case DAVS2_LOG_ERROR:\r\n        psz_prefix = \"[davs2 error]: \";\r\n        break;\r\n    case DAVS2_LOG_WARNING:\r\n        psz_prefix = \"[davs2 warn]: \";\r\n        break;\r\n    case DAVS2_LOG_INFO:\r\n        psz_prefix = \"[davs2 info]: \";\r\n        break;\r\n    case DAVS2_LOG_DEBUG:\r\n        psz_prefix = \"[davs2 debug]: \";\r\n        break;\r\n    default:\r\n        psz_prefix = \"[davs2 *]: \";\r\n#if !defined(_MSC_VER)\r\n        cur_color  = str_color[0];\r\n#endif\r\n        break;\r\n    }\r\n#if defined(_MSC_VER)\r\n    davs2_set_font_color(i_log_level); /* set color */\r\n    fprintf(stderr, \"%s%s\\n\", psz_prefix, psz_fmt);\r\n    davs2_set_font_color(0);     /* restore to white color */\r\n#else\r\n    fprintf(stderr, \"%s%s%s%s\\n\", cur_color, psz_prefix, psz_fmt, str_color_clear);\r\n#endif\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid davs2_log(void *handle, int level, const char *format, ...)\r\n{\r\n    davs2_log_t *h = (davs2_log_t *)handle;\r\n    int i_enable_level = 0;\r\n\r\n    if (h != NULL) {\r\n        i_enable_level = h->i_log_level;\r\n    }\r\n\r\n    DAVS2_ASSERT(level >= 0 && level < DAVS2_LOG_MAX, \"Invalid log level %d\", level);\r\n\r\n    if (level >= i_enable_level) {\r\n        char message[2048] = { 0 };\r\n        \r\n        if (h != NULL) {\r\n            sprintf(message, \"%s: \", h->module_name);\r\n        }\r\n\r\n        va_list arg_ptr;\r\n        va_start(arg_ptr, format);\r\n        vsprintf(message + strlen(message), format, arg_ptr);\r\n        va_end(arg_ptr);\r\n\r\n        davs2_log_default(level, message);\r\n    }\r\n}\r\n"
  },
  {
    "path": "source/common/common.h",
    "content": "/*\r\n * common.h\r\n *\r\n * Description of this file:\r\n *    misc common functionsdefinition of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video encoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#ifndef DAVS2_COMMON_H\r\n#define DAVS2_COMMON_H\r\n\r\n#ifdef __cplusplus\r\nextern \"C\" {\r\n#endif\r\n/**\r\n * ===========================================================================\r\n * common include files\r\n * ===========================================================================\r\n */\r\n\r\n#include \"defines.h\"\r\n#include \"osdep.h\"\r\n#include \"davs2.h\"\r\n\r\n#include <stdlib.h>\r\n#include <string.h>\r\n#include <assert.h>\r\n#if (ARCH_X86 || ARCH_X86_64)\r\n#include <xmmintrin.h>\r\n#endif\r\n\r\n/**\r\n * ===========================================================================\r\n * basic type defines\r\n * ===========================================================================\r\n */\r\n\r\n#if HIGH_BIT_DEPTH\r\ntypedef uint16_t                pel_t;      /* type for pixel value */\r\ntypedef uint64_t                pel4_t;     /* type for 4-pixels value */\r\ntypedef int32_t                 itr_t;      /* intra prediction temp */\r\n#else\r\ntypedef uint8_t                 pel_t;      /* type for pixel value */\r\ntypedef uint32_t                pel4_t;     /* type for 4-pixels value */\r\ntypedef int16_t                 itr_t;      /* intra prediction temp */\r\n#endif\r\n\r\ntypedef int16_t                 coeff_t;    /* type for transform coefficient */\r\ntypedef int16_t                 mct_t;       /* motion compensation temp*/\r\ntypedef uint8_t                 bool_t;     /* type for flag */\r\n\r\ntypedef struct cu_t             cu_t;\r\ntypedef struct davs2_log_t      davs2_log_t;\r\ntypedef struct davs2_t          davs2_t;\r\ntypedef struct davs2_mgr_t      davs2_mgr_t;\r\ntypedef struct davs2_outpic_t   davs2_outpic_t;\r\n\r\n\r\n/**\r\n * ===========================================================================\r\n * macros\r\n * ===========================================================================\r\n */\r\n#define IS_HOR_PU_PART(mode)          (((1 << (mode)) & MASK_HOR_PU_MODES) != 0)\r\n#define IS_VER_PU_PART(mode)          (((1 << (mode)) & MASK_VER_PU_MODES) != 0)\r\n#define IS_INTRA_MODE(mode)           (((1 << (mode)) & MASK_INTRA_MODES ) != 0)\r\n#define IS_INTER_MODE(mode)           (((1 << (mode)) & MASK_INTER_MODES ) != 0)\r\n#define IS_NOSKIP_INTER_MODE(mode)    (((1 << (mode)) & MASK_INTER_NOSKIP) != 0)\r\n#define IS_SKIP_MODE(mode)            ((mode) == PRED_SKIP)\r\n\r\n#define IS_INTRA(cu)             IS_INTRA_MODE((cu)->i_cu_type)\r\n#define IS_INTER(cu)             IS_INTER_MODE((cu)->i_cu_type)\r\n#define IS_NOSKIP_INTER(cu)      IS_NOSKIP_INTER_MODE((cu)->i_cu_type)\r\n#define IS_SKIP(cu)              IS_SKIP_MODE((cu)->i_cu_type)\r\n\r\nstatic ALWAYS_INLINE int DAVS2_MAX(int a, int b)\r\n{\r\n    return ((a) > (b) ? (a) : (b));\r\n}\r\nstatic ALWAYS_INLINE int DAVS2_MIN(int a, int b)\r\n{\r\n    return ((a) < (b) ? (a) : (b));\r\n}\r\n#define DAVS2_ABS(a)             ((a) < 0 ? (-(a)) : (a))\r\n#define DAVS2_CLIP1(a)           (pel_t)((a) > max_pel_value ? max_pel_value : ((a) < 0 ? 0 : (a)))\r\n\r\nstatic ALWAYS_INLINE int DAVS2_CLIP3(int L, int H, int v)\r\n{\r\n    return (((v) < (L)) ? (L) : (((v) > (H)) ? (H) : (v)));\r\n}\r\n\r\n#define DAVS2_SWAP(x,y)          { (y)=(y)^(x); (x)=(y)^(x); (y)=(x)^(y); }\r\n#define DAVS2_ALIGN(x, a)        (((x) + ((a) - 1)) & (~((a) - 1)))\r\n\r\n#define LCU_STRIDE               (MAX_CU_SIZE)\r\n#define LCU_BUF_SIZE             (LCU_STRIDE * MAX_CU_SIZE)          /* size of LCU buffer size */\r\n\r\n/* ---------------------------------------------------------------------------\r\n * multi line macros\r\n */\r\n#if defined(_MSC_VER) || defined(__INTEL_COMPILER)\r\n#define MULTI_LINE_MACRO_BEGIN  do {\r\n#define MULTI_LINE_MACRO_END \\\r\n    __pragma(warning(push))\\\r\n    __pragma(warning(disable:4127))\\\r\n    } while (0)\\\r\n    __pragma(warning(pop))\r\n#else\r\n#define MULTI_LINE_MACRO_BEGIN   {\r\n#define MULTI_LINE_MACRO_END     }\r\n#endif\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * memory malloc\r\n */\r\n#define CHECKED_MALLOC(var, type, size) \\\r\n    MULTI_LINE_MACRO_BEGIN\\\r\n    (var) = (type)davs2_malloc(size);\\\r\n    if ((var) == NULL) {\\\r\n        goto fail;\\\r\n        }\\\r\n    MULTI_LINE_MACRO_END\r\n\r\n#define CHECKED_MALLOCZERO(var, type, size) \\\r\n    MULTI_LINE_MACRO_BEGIN\\\r\n    CHECKED_MALLOC(var, type, size);\\\r\n    memset(var, 0, size);\\\r\n    MULTI_LINE_MACRO_END\r\n\r\n\r\n/**\r\n * ===========================================================================\r\n * enum defines\r\n * ===========================================================================\r\n */\r\n\r\n/* ---------------------------------------------------------------------------\r\n * task status */\r\nenum task_status_t {\r\n    TASK_FREE    = 0,           /* task is free, could be used */\r\n    TASK_BUSY    = 1            /* task busy */\r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n * coding types */\r\nenum coding_type_e {\r\n    FRAME_CODING = 0,           /* frame coding */\r\n    FIELD_CODING = 1            /* field coding */\r\n};\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * picture struct */\r\nenum pic_struct_e {\r\n    FIELD = 0,                  /* field picture struct */\r\n    FRAME = 1                   /* frame picture struct */\r\n};\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * slice type */\r\nenum {\r\n    AVS2_I_SLICE = 0,           /* slice type: I frame */\r\n    AVS2_P_SLICE = 1,           /* slice type: P frame */\r\n    AVS2_B_SLICE = 2,           /* slice type: B frame */\r\n    AVS2_G_SLICE = 3,           /* AVSS2 type: G frame, should be output (as I frame) */\r\n    AVS2_F_SLICE = 4,           /* slice type: F frame */\r\n    AVS2_S_SLICE = 5,           /* AVSS2 type: S frame */\r\n    AVS2_GB_SLICE = 6,          /* AVSS2 type: GB frame, should not be output */\r\n};\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * start codes */\r\nenum start_code_e {\r\n    SC_SEQUENCE_HEADER = 0xB0,  /* sequence header start code */\r\n    SC_SEQUENCE_END    = 0xB1,  /* sequence end    start code */\r\n    SC_USER_DATA       = 0xB2,  /* user data       start code */\r\n    SC_INTRA_PICTURE   = 0xB3,  /* intra picture   start code */\r\n    SC_EXTENSION       = 0xB5,  /* extension       start code */\r\n    SC_INTER_PICTURE   = 0xB6,  /* inter picture   start code */\r\n    SC_VIDEO_EDIT_CODE = 0xB7,  /* video edit      start code */\r\n    SC_SLICE_CODE_MIN  = 0x00,  /* min slice       start code */\r\n    SC_SLICE_CODE_MAX  = 0x8F   /* max slice       start code */\r\n};\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * all prediction modes (n = N/2) */\r\nenum cu_pred_mode_e {\r\n    /* all inter modes: 8                                           */\r\n    PRED_SKIP  = 0,      /*  skip/direct           block: 1  */\r\n    PRED_2Nx2N = 1,      /*  2N x 2N               block: 1  */\r\n    PRED_2NxN  = 2,      /*  2N x  N               block: 2  */\r\n    PRED_Nx2N  = 3,      /*   N x 2N               block: 2  */\r\n    PRED_2NxnU = 4,      /*  2N x  n  +  2N x 3n   block: 2  */\r\n    PRED_2NxnD = 5,      /*  2N x 3n  +  2N x  n   block: 2  */\r\n    PRED_nLx2N = 6,      /*   n x 2N  +  3n x 2N   block: 2  */\r\n    PRED_nRx2N = 7,      /*  3n x 2N  +   n x 2N   block: 2  */\r\n    /* all intra modes: 4                                           */\r\n    PRED_I_2Nx2N = 8,      /*  2N x 2N               block: 1  */\r\n    PRED_I_NxN   = 9,      /*   N x  N               block: 4  */\r\n    PRED_I_2Nxn  = 10,     /*  2N x  n  (32x8, 16x4) block: 4  */\r\n    PRED_I_nx2N  = 11,     /*   n x 2N  (8x32, 4x16) block: 4  */\r\n    /* mode numbers                                                 */\r\n    MAX_PRED_MODES  = 12,     /* total 12 pred modes, include:    */\r\n    MAX_INTER_MODES = 8,      /*       8 inter modes              */\r\n    MAX_INTRA_MODES = 4,      /*       4 intra modes              */\r\n    /* masks                                                        */\r\n    MASK_HOR_TU_MODES = 0x0430, /* mask for horizontal TU partition */\r\n    MASK_VER_TU_MODES = 0x08C0, /* mask for vertical   TU partition */\r\n    MASK_HOR_PU_MODES = 0x0434, /* mask for horizontal PU partition */\r\n    MASK_VER_PU_MODES = 0x08C8, /* mask for vertical   PU partition */\r\n    MASK_INTER_MODES  = 0x00FF, /* mask for inter modes             */\r\n    MASK_INTER_NOSKIP = 0x00FE, /* mask for inter modes except skip */\r\n    MASK_INTRA_MODES  = 0x0F00  /* mask for intra modes             */\r\n};\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * splitting type of transform unit */\r\nenum tu_split_type_e {\r\n    TU_SPLIT_INVALID  = -1,     /*      invalid split type          */\r\n    TU_SPLIT_NON      = 0,      /*          not split               */\r\n    TU_SPLIT_HOR      = 1,      /* horizontally split into 4 blocks */\r\n    TU_SPLIT_VER      = 2,      /*   vertically split into 4 blocks */\r\n    TU_SPLIT_CROSS    = 3,      /*    cross     split into 4 blocks */\r\n    NUM_TU_SPLIT_TYPE = 4       /* number of transform split types  */\r\n};\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * pu partition */\r\nenum PU_PART {\r\n    /* square */\r\n    PART_4x4, PART_8x8, PART_16x16, PART_32x32, PART_64x64,\r\n    /* rectangular */\r\n    PART_8x4, PART_4x8,\r\n    PART_16x8, PART_8x16,\r\n    PART_32x16, PART_16x32,\r\n    PART_64x32, PART_32x64,\r\n    /* asymmetrical (0.75, 0.25) */\r\n    PART_16x12, PART_12x16, PART_16x4, PART_4x16,\r\n    PART_32x24, PART_24x32, PART_32x8, PART_8x32,\r\n    PART_64x48, PART_48x64, PART_64x16, PART_16x64,\r\n    /* max number of partitions */\r\n    MAX_PART_NUM\r\n};\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * DCT pattern */\r\nenum dct_pattern_e {\r\n    DCT_DEAULT,      /* default */\r\n    DCT_HALF,        /* οϽ1/2ߣ1/4 ǷοΪϽ1/21/2 */\r\n    DCT_QUAD,        /* οϽ1/4ߣ1/16ǷοΪϽ1/41/4 */\r\n    /* max number of DCT pattern */\r\n    DCT_PATTERN_NUM\r\n};\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * context mode */\r\nenum context_mode_e {\r\n    INTRA_PRED_VER     = 0,     /* intra vertical predication */\r\n    INTRA_PRED_HOR     = 1,     /* intra horizontal predication */\r\n    INTRA_PRED_DC_DIAG = 2      /* intra DC predication */\r\n};\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * image component index */\r\nenum img_component_index_e {\r\n    IMG_Y = 0,         /* image component: Y */\r\n    IMG_U = 1,         /* image component: Cb */\r\n    IMG_V = 2,         /* image component: Cr */\r\n    IMG_COMPONENTS = 3          /* number of image components */\r\n};\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * predicate direction for inter frame */\r\nenum inter_pred_direction_e {\r\n    INVALID_REF = -1,           /* invalid */\r\n    B_BWD = 0,                  /* backward */\r\n    B_FWD = 1                   /* forward */\r\n};\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * neighboring position used in inter coding (MVP) or intra prediction */\r\nenum neighbor_block_pos_e {\r\n    BLK_TOPLEFT     = 0,        /* D: top-left   block: (x     - 1, y     - 1) */\r\n    BLK_TOP         = 1,        /* B: top        block: (x        , y     - 1) */\r\n    BLK_LEFT        = 2,        /* A: left       block: (x     - 1, y        ) */\r\n    BLK_TOPRIGHT    = 3,        /* C: top-right  block: (x + W    , y     - 1) */\r\n    BLK_TOP2        = 4,        /* G: top        block: (x + W - 1, y     - 1) */\r\n    BLK_LEFT2       = 5,        /* F: left       block: (x     - 1, y + H - 1) */\r\n    BLK_COLLOCATED  = 6,         /* Col: mode of temporal neighbor */\r\n    NUM_INTER_NEIGHBOR = BLK_COLLOCATED + 1\r\n};\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * neighboring position used in inter coding (MVP) or intra prediction */\r\nenum direct_skip_mode_e {\r\n    DS_NONE  = 0,       /* no spatial direct/skip mode */\r\n\r\n    /* spatial direct/skip mode for B frame */\r\n    DS_B_BID = 1,        /* skip/direct mode: bi-direction */\r\n    DS_B_BWD = 2,        /*                 : backward direction */\r\n    DS_B_SYM = 3,        /*                 : symmetrical direction */\r\n    DS_B_FWD = 4,        /*                 : forward direction */\r\n\r\n    /* spatial direct/skip mode for F frame */\r\n    DS_DUAL_1ST   = 1,        /* skip/direct mode: dual 1st */\r\n    DS_DUAL_2ND   = 2,        /*                 : dual 2nd */\r\n    DS_SINGLE_1ST = 3,        /*                 : single 1st */\r\n    DS_SINGLE_2ND = 4,        /*                 : single 2st */\r\n\r\n    /* max number */\r\n    DS_MAX_NUM    = 5         /* max spatial direct/skip mode number of B or F frames */\r\n};\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nenum intra_avail_e {\r\n    MD_I_LEFT      = 0,\r\n    MD_I_TOP       = 1,\r\n    MD_I_LEFT_DOWN = 2,\r\n    MD_I_TOP_RIGHT = 3,\r\n    MD_I_TOP_LEFT  = 4,\r\n    MD_I_NUM       = 5,\r\n#define IS_NEIGHBOR_AVAIL(i_avai, md)    ((i_avai) & (1 << (md)))\r\n};\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * sao modes */\r\nenum sao_mode_e {\r\n    SAO_MODE_OFF   = 0,         /* sao mode: off */\r\n    SAO_MODE_MERGE = 1,         /* sao mode: merge */\r\n    SAO_MODE_NEW   = 2          /* sao mode: new */\r\n};\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n* sao mode merge types */\r\nenum sao_mode_merge_type_e {\r\n    SAO_MERGE_LEFT      = 0,    /* sao merge type: left */\r\n    SAO_MERGE_ABOVE     = 1,    /* sao merge type: above */\r\n    NUM_SAO_MERGE_TYPES = 2     /* number of sao merge types */\r\n};\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n* sao mode types */\r\nenum sao_mode_type_e {\r\n    SAO_TYPE_EO_0   = 0,        /* sao mode type: EO - 0   */\r\n    SAO_TYPE_EO_90  = 1,        /* sao mode type: EO - 90  */\r\n    SAO_TYPE_EO_135 = 2,        /* sao mode type: EO - 135 */\r\n    SAO_TYPE_EO_45  = 3,        /* sao mode type: EO - 45  */\r\n    SAO_TYPE_BO     = 4         /* sao mode type: BO       */\r\n};\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * sao EO classes\r\n * the assignments depended on how you implement the edgeType calculation */\r\nenum sao_EO_classes_e {\r\n    SAO_CLASS_EO_FULL_VALLEY = 0,\r\n    SAO_CLASS_EO_HALF_VALLEY = 1,\r\n    SAO_CLASS_EO_PLAIN       = 2,\r\n    SAO_CLASS_EO_HALF_PEAK   = 3,\r\n    SAO_CLASS_EO_FULL_PEAK   = 4,\r\n    SAO_CLASS_BO             = 5,\r\n    NUM_SAO_OFFSET           = 6\r\n};\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * contexts for syntax elements */\r\n#define NUM_CUTYPE_CTX          6\r\n#define NUM_SPLIT_CTX           3     // CU depth\r\n#define NUM_INTRA_PU_TYPE_CTX   1\r\n/* Ԥ */\r\n#define NUM_MVD_CTX             3\r\n#define NUM_REF_NO_CTX          3\r\n#define NUM_DELTA_QP_CTX        4\r\n#define NUM_INTER_DIR_CTX       15\r\n#define NUM_INTER_DIR_DHP_CTX   3\r\n#define NUM_DMH_MODE_CTX        12\r\n#define NUM_AMP_CTX             2\r\n#define NUM_C_INTRA_MODE_CTX    3\r\n#define NUM_CTP_CTX             9\r\n#define NUM_INTRA_MODE_CTX      7\r\n#define NUM_TU_SPLIT_CTX        3\r\n#define WPM_NUM                 3\r\n#define NUM_DIR_SKIP_CTX        4     /* B Skip mode, F Skip mode */\r\n/* 任ϵ */\r\n#define NUM_BLOCK_TYPES         3\r\n#define NUM_MAP_CTX             11\r\n#define NUM_LAST_CG_CTX_LUMA    6\r\n#define NUM_LAST_CG_CTX_CHROMA  6\r\n#define NUM_SIGCG_CTX_LUMA      2\r\n#define NUM_SIGCG_CTX_CHROMA    1\r\n#define NUM_LAST_POS_CTX_LUMA   48\r\n#define NUM_LAST_POS_CTX_CHROMA 12\r\n#define NUM_COEFF_LEVEL_CTX     40\r\n#define NUM_LAST_CG_CTX         (NUM_LAST_CG_CTX_LUMA+NUM_LAST_CG_CTX_CHROMA)\r\n#define NUM_SIGCG_CTX           (NUM_SIGCG_CTX_LUMA+NUM_SIGCG_CTX_CHROMA)\r\n#define NUM_LAST_POS_CTX        (NUM_LAST_POS_CTX_LUMA+NUM_LAST_POS_CTX_CHROMA)\r\n/*  */\r\n#define NUM_SAO_MERGE_FLAG_CTX  3\r\n#define NUM_SAO_MODE_CTX        1\r\n#define NUM_SAO_OFFSET_CTX      2\r\n#define NUM_INTER_DIR_MIN_CTX   2\r\n#define NUM_ALF_LCU_CTX         4     /* adaptive loop filter */\r\n\r\n/**\r\n * ===========================================================================\r\n * struct type defines\r\n * ===========================================================================\r\n */\r\n\r\n/* ---------------------------------------------------------------------------\r\n * node\r\n */\r\ntypedef struct node_t   node_t;\r\nstruct node_t {\r\n    node_t      *next;                /* pointer to next node */\r\n};\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * xlist_t\r\n */\r\ntypedef struct xlist_t {\r\n    node_t              *p_list_head;     /* pointer to head of node list */\r\n    node_t              *p_list_tail;     /* pointer to tail of node list */\r\n    davs2_thread_cond_t  list_cond;       /* list condition variable */\r\n    davs2_thread_mutex_t list_mutex;      /* list mutex lock */\r\n    int                  i_node_num;      /* node number in the list */\r\n} xlist_t;\r\n\r\n\r\n#if defined(_MSC_VER) || defined(__ICL)\r\n#pragma warning(disable: 4201)        // non-standard extension used (nameless struct/union)\r\n#endif\r\n\r\n/* ---------------------------------------------------------------------------\r\n * syntax context type */\r\ntypedef union context_t {\r\n    struct {\r\n        unsigned    cycno   : 2;      // 2  bits\r\n        unsigned    MPS     : 1;      // 1  bit\r\n        unsigned    LG_PMPS : 11;     // 11 bits\r\n    };\r\n    uint16_t        v;\r\n} context_t;\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * syntax context management */\r\ntypedef struct context_set_t {\r\n    /* CU */\r\n    context_t cu_type_contexts         [NUM_CUTYPE_CTX];\r\n    context_t intra_pu_type_contexts   [NUM_INTRA_PU_TYPE_CTX];\r\n    context_t cu_split_flag            [NUM_SPLIT_CTX];\r\n    context_t transform_split_flag     [NUM_TU_SPLIT_CTX];\r\n    context_t shape_of_partition_index [NUM_AMP_CTX];\r\n    context_t pu_reference_index       [NUM_REF_NO_CTX];\r\n    context_t cbp_contexts             [NUM_CTP_CTX];\r\n    context_t mvd_contexts          [2][NUM_MVD_CTX];\r\n    /* ֡Ԥ */\r\n    context_t pu_type_index            [NUM_INTER_DIR_CTX];    // b_pu_type_index[15] = f_pu_type_index[3] + dir_multi_hypothesis_mode[12]\r\n    context_t b_pu_type_min_index      [NUM_INTER_DIR_MIN_CTX];\r\n    context_t cu_subtype_index         [NUM_DIR_SKIP_CTX];  // B_Skip/B_Direct, F_Skip/F_Direct \r\n    context_t weighted_skip_mode       [WPM_NUM];\r\n    context_t delta_qp_contexts        [NUM_DELTA_QP_CTX];\r\n    /* ֡Ԥ */\r\n    context_t intra_luma_pred_mode     [NUM_INTRA_MODE_CTX];\r\n    context_t intra_chroma_pred_mode   [NUM_C_INTRA_MODE_CTX];\r\n    /* 任ϵ */\r\n    context_t coeff_run             [2][NUM_BLOCK_TYPES][NUM_MAP_CTX];\r\n    context_t coeff_level              [NUM_COEFF_LEVEL_CTX];\r\n    context_t last_cg_contexts         [NUM_LAST_CG_CTX];\r\n    context_t sig_cg_contexts          [NUM_SIGCG_CTX];\r\n    context_t last_coeff_pos           [NUM_LAST_POS_CTX];\r\n    /*  */\r\n    context_t sao_mergeflag_context    [NUM_SAO_MERGE_FLAG_CTX];\r\n    context_t sao_mode_context         [NUM_SAO_MODE_CTX];\r\n    context_t sao_offset_context       [NUM_SAO_OFFSET_CTX];\r\n    context_t alf_lcu_enable_scmodel   [NUM_ALF_LCU_CTX * 3];\r\n} context_set_t;\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * bitstream */\r\ntypedef struct davs2_bs_t {\r\n    uint8_t    *p_stream;             /* pointer to the code-buffer */\r\n    int         i_stream;             /* over code-buffer length, byte-oriented */\r\n    int         i_bit_pos;            /* actual position in the code-buffer, bit-oriented */\r\n#if !ARCH_X86_64\r\n    int         reserved;             /* reserved */\r\n#endif\r\n} davs2_bs_t;\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * SAO parameters for component block */\r\ntypedef struct sao_param_t {\r\n    int         modeIdc;              // NEW, MERGE, OFF\r\n    int         typeIdc;              // NEW: EO_0, EO_90, EO_135, EO_45, BO. MERGE: left, above\r\n    int         startBand;            //BO: starting band index\r\n    int         startBand2;\r\n    int         offset[MAX_NUM_SAO_CLASSES];\r\n} sao_param_t;\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * SAO parameters for LCU */\r\ntypedef struct sao_t {\r\n    sao_param_t planes[IMG_COMPONENTS];\r\n} sao_t;\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * ALF parameters */\r\ntypedef struct alf_param_t {\r\n    int         num_coeff;\r\n    int         filters_per_group;\r\n    int         componentID;\r\n    int         filterPattern[ALF_NUM_VARS];\r\n    int         coeffmulti[ALF_NUM_VARS][ALF_MAX_NUM_COEF];  // ȷ16ףɫȷ1\r\n} alf_param_t;\r\n\r\n\r\ntypedef struct alf_var_t {\r\n    alf_param_t   img_param[IMG_COMPONENTS];\r\n    int           filterCoeffSym[ALF_NUM_VARS][ALF_MAX_NUM_COEF];\r\n    int           tab_region_coeff_idx[ALF_NUM_VARS];   /* coefficient look-up table for 16 regions */\r\n    uint8_t      *tab_lcu_region;                       /* region index look-up table for LCUs */\r\n} alf_var_t;\r\n\r\n/* ---------------------------------------------------------------------------\r\n * reference index */\r\ntypedef union ref_idx_t {\r\n    struct {                          // nameless struct\r\n        int8_t  r[2];                 // ref 1st and 2nd, 4 bit (sign integer)\r\n    };\r\n    uint16_t    v;                    // v = ((r2 << 8) | (r1 & 0xFF)), 16-bit\r\n} ref_idx_t;\r\n\r\n/* ---------------------------------------------------------------------------\r\n * motion vector */\r\ntypedef union mv_t {\r\n    struct {                          // nameless struct\r\n        int16_t x;                    // x, low  16-bit\r\n        int16_t y;                    // y, high 16-bit\r\n    };\r\n    uint32_t    v;                    // v = ((y << 16) | (x & 0xFFFF)), 32-bit\r\n} mv_t;\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * coding block\r\n */\r\ntypedef union cb_t {\r\n    struct {                          /* nameless struct */\r\n        int8_t  x;                    /* start position (x, in pixel) within current CU */\r\n        int8_t  y;                    /* start position (y, in pixel) within current CU */\r\n        int8_t  w;                    /* block width  (in pixel) */\r\n        int8_t  h;                    /* block height (in pixel) */\r\n    };\r\n    uint32_t    v;                    /* used for fast operation for all components */\r\n} cb_t;\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * motion vector */\r\ntypedef struct neighbor_inter_t {\r\n    mv_t        mv[2];                /* motion vectors */\r\n    int8_t      is_available;         /* is block available */\r\n    int8_t      i_dir_pred;           /* predict direction */\r\n    ref_idx_t   ref_idx;              /* reference indexes of 1st and 2nd frame */\r\n} neighbor_inter_t;\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\ntypedef struct aec_t {\r\n    ALIGN32(uint8_t *p_buffer);\r\n    uint64_t    i_byte_buf;\r\n    int         i_byte_pos;\r\n    int         i_bytes;\r\n    int8_t      i_bits_to_go;\r\n    bool_t      b_bit_error;          /* bit error in stream */\r\n    bool_t      b_val_bound;\r\n    bool_t      b_val_domain;         // is value in R domain 1 is R domain 0 is LG domain\r\n    uint32_t    i_s1;\r\n    uint32_t    i_t1;\r\n    uint32_t    i_value_s;\r\n    uint32_t    i_value_t;\r\n\r\n    /* context */\r\n    context_set_t   syn_ctx;              // pointer to struct of context models\r\n#if AVS2_TRACE\r\n    /* ---------------------------------------------------------------------------\r\n     * syntax element */\r\n#define         TRACESTRING_SIZE 128  // size of trace string\r\n    char        tracestring[TRACESTRING_SIZE]; // trace string\r\n#endif // AVS2_TRACE\r\n} aec_t;\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * reference picture set (RPS) */\r\ntypedef struct rps_t {\r\n    int     ref_pic[AVS2_MAX_REFS]; /* delta COI of ref pic */\r\n    int     remove_pic[8];          /* delta COI of removed pic */\r\n    int     num_of_ref;             /* number of reference picture */\r\n    int     num_to_remove;          /* number of removed picture */\r\n    int     refered_by_others;      /* referenced by others */\r\n    int     reserved;               /* reserved 4 bytes */\r\n} rps_t;\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * sequence set information */\r\ntypedef struct davs2_seq_t {\r\n    int     valid_flag;               /* is this sequence header valid ? */\r\n    davs2_seq_info_t head;         /* sequence header information (output) */\r\n    int     sample_precision;         /* sample precision */\r\n    int     encoding_precision;       /* encoding precision */\r\n    int     bit_rate_lower;           /* bitrate (lower) */\r\n    int     bit_rate_upper;           /* bitrate (upper) */\r\n    int     i_enc_width;              /* sequence encoding width */\r\n    int     i_enc_height;             /* sequence encoding height */\r\n    int     log2_lcu_size;            /* largest coding block size */\r\n    bool_t  b_field_coding;           /* field coded sequence? */\r\n    bool_t  b_temporal_id_exist;      /* temporal id exist flag */\r\n    bool_t  enable_weighted_quant;    /* weight quant enable */\r\n    bool_t  enable_background_picture;/* background picture enabled? */\r\n    bool_t  enable_mhp_skip;          /* mhpskip enabled? */\r\n    bool_t  enable_dhp;               /* dhp enabled? */\r\n    bool_t  enable_wsm;               /* wsm enabled? */\r\n    bool_t  enable_amp;               /* AMP(asymmetric motion partitions) enabled? */\r\n    bool_t  enable_nsqt;              /* use NSQT? */\r\n    bool_t  enable_sdip;              /* use SDIP? */\r\n    bool_t  enable_2nd_transform;     /* secondary transform enabled? */\r\n    bool_t  enable_sao;               /* SAO enabled? */\r\n    bool_t  enable_alf;               /* ALF enabled? */\r\n    bool_t  enable_pmvr;              /* PMVR enabled? */\r\n    bool_t  cross_loop_filter_flag;   /* cross loop filter flag */\r\n    int     picture_reorder_delay;    /* picture reorder delay */\r\n    int     num_of_rps;               /* rps set number */\r\n    rps_t   seq_rps[AVS2_GOP_NUM];    /* RPS at sequence level */\r\n    int16_t seq_wq_matrix[2][64];     /* sequence base weighting quantization matrix */\r\n} davs2_seq_t;\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * davs2_frame_t */\r\ntypedef struct davs2_frame_t {\r\n    /* properties */\r\n    int64_t     i_pts;                /* user pts (presentation time stamp) */\r\n    int64_t     i_dts;                /* user dts (decoding time stamp) */\r\n\r\n    int         i_type;               /* frame type */\r\n    int         i_qp;\r\n\r\n    int         i_chroma_format;      /* chroma format    (for function davs2_write_a_frame) */\r\n    int         i_output_bit_depth;   /* output bit depth (for function davs2_write_a_frame) */\r\n    int         i_sample_bit_depth;   /* sample bit depth (for function davs2_write_a_frame) */\r\n    int         frm_decode_error;     /* is there any decoding error in this frame? */\r\n\r\n    int         dist_refs[AVS2_MAX_REFS];  /* distance of reference frames, used for MV scaling */\r\n    int         dist_scale_refs[AVS2_MAX_REFS];  /* = (MULTI / dist_refs) */\r\n    int         i_poc;                /* POC (picture order count), used for MV scaling */\r\n    int         i_coi;                /* COI (coding order index) */\r\n    int         b_refered_by_others;  /* referenced by others */\r\n\r\n    /* planes */\r\n    int         i_plane;              /* number of planes */\r\n    int         i_width[3];           /* width  for Y/U/V */\r\n    int         i_lines[3];           /* height for Y/U/V */\r\n    int         i_stride[3];          /* stride for Y/U/V */\r\n\r\n    /* parallel */\r\n    uint32_t    i_ref_count;          /* the reference count, DO NOT move its position in this struct */\r\n\r\n    int         i_disposable;         /* what to do with the frame when the reference count is decreased to 0? */\r\n    /* 0: do nothing, 1: clean the frame, 2: free the frame */\r\n    /* frames with 'i_disposable' greater than 0 should NOT be referenced. */\r\n\r\n    int          is_self_malloc;      /* is the buffer allocated by itself */\r\n    volatile int i_decoded_line;      /* latest lcu line that finished reconstruction */\r\n    volatile int i_parsed_lcu_xy;     /* parsed number of LCU */\r\n    int          i_conds;             /* number conds */\r\n    davs2_thread_cond_t   cond_aec;   /* signal of AEC decoding */\r\n    davs2_thread_cond_t  *conds_lcu_row;  /* [LCU lines] */\r\n    int *num_decoded_lcu_in_row;      /* number of LCUs decoded in a row */ \r\n    davs2_thread_mutex_t mutex_frm;   /* the mutex */\r\n    davs2_thread_mutex_t mutex_recon; /* mutex of reconstruction threads */\r\n\r\n    /* buffers */\r\n    pel_t      *planes[3];            /* pointers to Y/U/V data buffer */\r\n    int8_t     *refbuf;               /* pointers to reference index buffer */\r\n    mv_t       *mvbuf;                /* pointers to motion vector buffer*/\r\n} davs2_frame_t;\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * weighting quantization */\r\ntypedef struct weighted_quant_t {\r\n    int         pic_wq_data_index;\r\n    int         wq_param;\r\n    int         wq_model;\r\n    int16_t     quant_param_undetail[6];\r\n    int16_t     quant_param_detail[6];\r\n    int16_t     cur_wq_matrix[4][64]; // [matrix_id][coef]\r\n    int16_t     wq_matrix[2][2][64];  // [matrix_id][detail/undetail][coef]\r\n    int16_t     seq_wq_matrix[2][64];\r\n    int16_t     pic_user_wq_matrix[2][64];\r\n    int16_t     wquant_param[2][6];\r\n} weighted_quant_t;\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * Run-Level pair */\r\ntypedef struct runlevel_pair_t {\r\n    int16_t run;\r\n    int16_t level;\r\n} runlevel_pair_t;\r\n\r\n/* ---------------------------------------------------------------------------\r\n * Run-Level info */\r\ntypedef struct runlevel_t {\r\n    ALIGN32(runlevel_pair_t  run_level[16]);     /* 任ϵΪ32x32 */\r\n    int         num_nonzero_cg;  // number of CGs with non-zero coefficients\r\n    uint32_t    reserved;\r\n    /* contexts pointer */\r\n    context_t(*p_ctx_run)[NUM_MAP_CTX];\r\n    context_t *p_ctx_level;\r\n    context_t *p_ctx_sig_cg;\r\n    context_t *p_ctx_last_cg;\r\n    context_t *p_ctx_last_pos_in_cg;\r\n\r\n    const int16_t(*avs_scan)[2];\r\n    const int16_t(*cg_scan)[2];\r\n    coeff_t       *p_res;\r\n    int            i_res;\r\n    int            b_swap_xy;\r\n    int            num_cg;\r\n    int            i_tu_level;\r\n    int            w_tr;\r\n    int            h_tr;\r\n} runlevel_t;\r\n\r\n/* ---------------------------------------------------------------------------\r\n * LCU reconstruction info */\r\ntypedef struct lcu_rec_info_t {\r\n    ALIGN32(coeff_t     coeff_buf_y[LCU_BUF_SIZE]);\r\n    ALIGN32(coeff_t     coeff_buf_uv[2][LCU_BUF_SIZE >> 2]);\r\n} lcu_rec_info_t;\r\n\r\n/* ---------------------------------------------------------------------------\r\n * LCU info */\r\ntypedef struct lcu_info_t {\r\n#if CTRL_AEC_THREAD\r\n    lcu_rec_info_t rec_info;\r\n#endif\r\n    sao_t      sao_param;                        /* SAO param for each LCU */\r\n    uint8_t    enable_alf[IMG_COMPONENTS];       /* ALF enabled for each LCU */\r\n} lcu_info_t;\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * coding unit */\r\nstruct cu_t {\r\n    /* -------------------------------------------------------------\r\n     * variables needed for neighboring CU decoding */\r\n    int8_t      i_cu_level;\r\n    int8_t      i_cu_type;\r\n\r\n    int8_t      i_slice_nr;\r\n\r\n    int8_t      i_qp;\r\n    int8_t      i_cbp;\r\n    int8_t      i_trans_size;         /* tu_split_type_e */\r\n\r\n    /* -------------------------------------------------------------\r\n     */\r\n    int8_t      i_weighted_skipmode;\r\n    int8_t      i_md_directskip_mode;\r\n    int8_t      c_ipred_mode;         /* chroma intra prediction mode */\r\n    int8_t      i_dmh_mode;           /* dir_multi_hypothesis_mode */\r\n    int8_t      num_pu;               /* number of prediction units */\r\n\r\n    /* -------------------------------------------------------------\r\n     * buffers */\r\n    int8_t      b8pdir[4];\r\n    int8_t      intra_pred_modes[4];\r\n    int8_t      dct_pattern[6];       /* DCT pattern of each block, dct_pattern_e, 4 luma + 2 chroma blocks */\r\n    mv_t        mv[4][2];             /* [block_idx][1st/2nd] */\r\n    ref_idx_t   ref_idx[4];           /* [block_idx].r[1st/2nd] */\r\n\r\n    cb_t        pu[4];                /* used to reserve the size of PUs */\r\n};\r\n\r\n\r\n#include \"primitives.h\"\r\n\r\n/* get partition index for the given size */\r\nextern const uint8_t g_partition_map_tab[];\r\n#define PART_INDEX(w, h)    (g_partition_map_tab[((((w) >> 2) - 1) << 4) + ((h) >> 2) - 1])\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * output picture\r\n */\r\nstruct davs2_outpic_t {\r\n    ALIGN16(void        *magic);      /* must be the 1st member variable. do not change it */\r\n\r\n    davs2_frame_t      *frame;       /* the source frame */\r\n    davs2_seq_info_t *head;        /* sequence head used to decode the frame */\r\n    davs2_picture_t  *pic;         /* the output picture */\r\n\r\n    davs2_outpic_t     *next;        /* next node */\r\n};\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * output picture list\r\n */\r\ntypedef struct davs2_output_t {\r\n    int               output;         /* output index of the next frame */\r\n    int               busy;           /* whether possibly one frame is being delivered */\r\n    int               num_output_pic; /* number of pictures to be output */\r\n    davs2_outpic_t  *pics;           /* output pictures */\r\n} davs2_output_t;\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * assemble elementary stream to a complete decodable unit (e.g., one frame),\r\n * the complete decodable unit is called ES unit\r\n */\r\ntypedef struct es_unit_t {\r\n    ALIGN16(void *magic);             /* must be the 1st member variable. do not change it */\r\n    davs2_bs_t    bs;                 /* bit-stream reader of this es_unit */\r\n    int64_t       pts;                /* presentation time stamp */\r\n    int64_t       dts;                /* decoding time stamp */\r\n    int           len;                /* length of valid data in byte stream buffer */\r\n    int           size;               /* buffer size */\r\n    uint8_t       data[1];            /* byte stream buffer */\r\n} es_unit_t;\r\n\r\n/* ---------------------------------------------------------------------------\r\n * decoder task\r\n */\r\ntypedef struct davs2_task_t {\r\n    ALIGN32(int     task_id);         /* task id */\r\n    int             task_status;      /* 0: free; 1, busy */\r\n    davs2_mgr_t   *taskmgr;          /* the taskmgr */\r\n    es_unit_t      *curr_es_unit;     /* decoding ES unit */\r\n    davs2_thread_t  thread_decode;    /* handle of the decoding thread */\r\n} davs2_task_t;\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstruct davs2_log_t {\r\n    int         i_log_level;          /* log level */\r\n    char        module_name[60];      /* module name */\r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n * decoder manager\r\n */\r\nstruct davs2_mgr_t {\r\n    davs2_log_t         module_log;   /* log module */\r\n\r\n    volatile int        b_exit;       /* app signal to exit */\r\n    volatile int        b_flushing;   /* is being flushing */\r\n\r\n    davs2_param_t       param;        /* decoder param */\r\n    es_unit_t          *es_unit;      /* next input ES unit pointer */\r\n    davs2_seq_t         seq_info;     /* latest sequence head */\r\n\r\n    int                 i_tr_wrap_cnt;/* COI wrap count */\r\n    int                 i_prev_coi;   /* previous COI */\r\n\r\n    /* --- decoder output --------- */\r\n    int                 new_sps;      /* is SPS(sequence property set) changed? */\r\n    int                 num_frames_to_output;\r\n\r\n    /* --- decoding picture buffer (DBP) --------- */\r\n    davs2_frame_t     **dpb;          /* decoded picture buffer array */\r\n    int                 dpbsize;      /* size of the dpb array */\r\n\r\n    /* --- frames to be removed before next frame decoding --------- */\r\n    int     num_frames_to_remove;     /* number of frames to be removed */\r\n    int     coi_remove_frame[8];      /* COI of frames to be removed */\r\n\r\n    /* --- lists (input & output) ---------------------------------- */\r\n    xlist_t             packets_idle; /* bit-stream: free buffers for input packets */\r\n\r\n    xlist_t             pic_recycle;  /* output_picture: free pictures recycle bin */\r\n    davs2_output_t      outpics;      /* output pictures */\r\n\r\n    /* --- task ---------------------------------------------------- */\r\n    int                 num_decoders;        /* number of decoders in total */\r\n    int                 num_active_decoders; /* number of active decoders currently */\r\n    davs2_t            *decoders;            /* frame decoder contexts */\r\n    davs2_t            *h_dec;               /* decoder context for current input bitstream */\r\n    int                 num_frames_in;       /* number of frames: input */\r\n    int                 num_frames_out;      /* number of frames: output */\r\n\r\n    /* --- thread control ------------------------------------------ */\r\n    int                     num_total_thread;  /* number of decoding threads in total */\r\n    int                     num_aec_thread;    /* number of threads for AEC coding (the others are for reconstruction) */\r\n    int                     num_rec_thread;    /* use thread pool or not */\r\n    davs2_thread_t          thread_output;     /* handle of the frame output thread */\r\n    davs2_thread_mutex_t    mutex_mgr;         /* a non-recursive mutex */\r\n    davs2_thread_mutex_t    mutex_aec;         /* a non-recursive mutex for AEC */\r\n    void                   *thread_pool;       /* AEC encoding thread */ \r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\ntypedef struct davs2_row_rec_t {\r\n    davs2_t   *h;                /* frame decoder handler */\r\n    lcu_info_t *lcu_info;         /* LCU info for REC */\r\n    lcu_rec_info_t *p_rec_info;   /* LCu reconstruction info */\r\n    int         idx_cu_zscan;     /* current CU scan order */\r\n    bool_t      b_block_avail_top;    /* availability of top  block, used in second transform */\r\n    bool_t      b_block_avail_left;   /* availability of left block, used in second transform */\r\n\r\n    /* LCU position */\r\n    struct ctu_recon_t {\r\n        int     i_pix_x;\r\n        int     i_pix_y;\r\n        int     i_pix_x_c;\r\n        int     i_pix_y_c;\r\n        int     i_scu_x;\r\n        int     i_scu_y;\r\n        int     i_scu_xy;\r\n        int     i_spu_x;\r\n        int     i_spu_y;\r\n        int     i_ctu_w;              /* width  of CTU in luma */\r\n        int     i_ctu_h;              /* height of CTU in luma */\r\n        int     i_ctu_w_c;            /* width  of CTU in chroma */\r\n        int     i_ctu_h_c;            /* height of CTU in chroma */\r\n\r\n        /* buffer pointers to picture */\r\n        int     i_frec[3];            /* stride of reconstruction buffer (reconstruction picture) */\r\n        pel_t  *p_frec[3];            /* reconstruction buffer pointer (reconstruction picture) */\r\n\r\n        /* buffer pointers to CTU cache */\r\n        int     i_fdec[3];            /* stride of reconstruction buffer (current LCU) */\r\n        pel_t  *p_fdec[3];            /* reconstruction buffer pointer (current LCU) */\r\n    } ctu;   // CTU info\r\n\r\n    /* buffers */\r\n    ALIGN32(pel_t       buf_edge_pixels[MAX_CU_SIZE << 3]); /* intra predication buffer */\r\n    ALIGN32(pel_t       pred_blk[LCU_BUF_SIZE]);            /* temporary buffer used for prediction */\r\n    // ALIGN32(pel_t       fdec_buf[MAX_CU_SIZE * (MAX_CU_SIZE + (MAX_CU_SIZE >> 1))]);\r\n    struct lcu_intra_border_t {\r\n        ALIGN32(pel_t rec_left[MAX_CU_SIZE]);          /* Left border of current LCU */\r\n        ALIGN32(pel_t rec_top[MAX_CU_SIZE * 2 + 32]);  /* top-left, top and top-right samples (Reconstruction) of current LCU */\r\n    } ctu_border[IMG_COMPONENTS];                      /* Y, U, V components */\r\n} davs2_row_rec_t;\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstruct davs2_t {\r\n    davs2_log_t  module_log;         /* log module */\r\n\r\n    /* -------------------------------------------------------------\r\n     * task information */\r\n    davs2_task_t task_info;          /* task information */\r\n\r\n    /* -------------------------------------------------------------\r\n     * sequence */\r\n    davs2_seq_t  seq_info;           /* sequence head of this task */\r\n\r\n    /* -------------------------------------------------------------\r\n     * log */\r\n    int         i_log_level;          /* log level */\r\n\r\n    int         i_image_width;        /* decoded image width */\r\n    int         i_image_height;       /* decoded image height */\r\n    int         i_chroma_format;      /* chroma format(1: 4:2:0, 2: 4:2:2) */\r\n    int         i_lcu_level;          /* LCU size in bit */\r\n    int         i_lcu_size;           /* LCU size = 1 << i_lcu_level */\r\n    int         i_lcu_size_sub1;      /* LCU size = (1 << i_lcu_level) - 1 */\r\n    int         i_display_delay;      /* picture display delay */\r\n    int         sample_bit_depth;     /* sample bit depth */\r\n    int         output_bit_depth;     /* output bit depth (assuming: output_bit_depth <= sample_bit_depth) */\r\n    bool_t      b_bkgnd_picture;      /* background picture enabled? */\r\n    bool_t      b_ra_decodable;       /* random access decodable flag */\r\n    bool_t      b_video_edit_code;    /* video edit code */\r\n\r\n    /* -------------------------------------------------------------\r\n     * coding tools enabled */\r\n    bool_t      b_roi;\r\n    bool_t      b_DQP;                /* using DQP? */\r\n    bool_t      b_sao;\r\n    bool_t      b_alf;\r\n    // int         b_dmh;\r\n\r\n    /* -------------------------------------------------------------\r\n     * decoding */\r\n    davs2_bs_t   *p_bs;               /* input bitstream pointer */\r\n    aec_t         aec;                /* arithmetic entropy decoder */\r\n    int           decoding_error;     /* ֵʾ˽ */\r\n\r\n    /* -------------------------------------------------------------\r\n     * field */\r\n    bool_t      b_top_field_first;\r\n    bool_t      b_repeat_first_field;\r\n    bool_t      b_top_field;\r\n\r\n    /* -------------------------------------------------------------\r\n     * picture coding type */\r\n    int8_t      i_frame_type;\r\n    int8_t      i_pic_coding_type;\r\n    int8_t      i_pic_struct;         /* frame or field coding */\r\n\r\n    /* -------------------------------------------------------------\r\n     * picture properties */\r\n    int         i_width;              /* picture width  in pixel (luma) */\r\n    int         i_height;             /* picture height in pixel (luma) */\r\n    int         i_width_in_scu;       /* width  in SCU */\r\n    int         i_height_in_scu;      /* height in SCU */\r\n    int         i_size_in_scu;        /* number of SCU */\r\n    int         i_width_in_spu;       /* width  in SPU */\r\n    int         i_height_in_spu;      /* height in SPU */\r\n    int         i_width_in_lcu;       /* width  in LCU */\r\n    int         i_height_in_lcu;      /* height in LCU */\r\n\r\n    int         i_picture_qp;\r\n    int         i_qp;                 /* quant for the current frame */\r\n\r\n    int         i_poc;                /* POC (picture order count) of current frame, 8 bit */\r\n    int         i_coi;                /* COI (coding order index) */\r\n\r\n    int         i_cur_layer;\r\n\r\n    int         chroma_quant_param_delta_u;\r\n    int         chroma_quant_param_delta_v;\r\n    bool_t      b_fixed_picture_qp;\r\n    bool_t      b_bkgnd_reference;    /* AVS2-S: background reference enabled? */\r\n    bool_t      enable_chroma_quant_param;\r\n\r\n\r\n    /* -------------------------------------------------------------\r\n     * slice */\r\n    bool_t      b_slice_checked;      /* is slice checked? */\r\n    bool_t      b_fixed_slice_qp;\r\n    int         i_slice_index;        /* current slice index */\r\n    int         i_slice_qp;\r\n    int         i_last_dquant;\r\n    pel_t      *intra_border[3];      /* buffer for store decoded bottom pixels of the top lcu row (before filter) */\r\n\r\n    /* -------------------------------------------------------------\r\n     * reference frame */\r\n    int         num_of_references;\r\n    rps_t       rps;\r\n\r\n    davs2_frame_t *fref[AVS2_MAX_REFS];\r\n    davs2_frame_t *fdec;\r\n    davs2_frame_t *f_background_cur; /* background reference frame, used for reconstruction */\r\n    davs2_frame_t *f_background_ref; /* background_frame, used for reference */\r\n    davs2_frame_t *p_frame_sao;      /* used for SAO */\r\n    davs2_frame_t *p_frame_alf;      /* used for ALF */\r\n    lcu_info_t *lcu_infos;            /* LCU level info */\r\n\r\n    /* -------------------------------------------------------------\r\n     * post processing */\r\n\r\n    /* deblock */\r\n    int         b_loop_filter;        /* loop filter enabled? */\r\n    int         i_alpha_offset;\r\n    int         i_beta_offset;\r\n    int         alpha;\r\n    int         alpha_c;\r\n    int         beta;\r\n    int         beta_c;\r\n\r\n    /* ALF */\r\n    alf_var_t  *p_alf;\r\n    bool_t      pic_alf_on[IMG_COMPONENTS];\r\n\r\n    /* SAO */\r\n    bool_t      slice_sao_on[IMG_COMPONENTS];\r\n\r\n    /* -------------------------------------------------------------\r\n     * buffers */\r\n    uint8_t    *p_integral;           /* holder: base pointer for all allocated memory */\r\n\r\n    /* intra mode */\r\n    int         i_ipredmode;          /* stride */\r\n    int8_t     *p_ipredmode;          /* intra prediction mode buffer */\r\n\r\n    /* scu */\r\n    cu_t       *scu_data;\r\n\r\n    /* ref & mv & inter prediction direction */\r\n    int8_t     *p_dirpred;            /* inter prediction direction */\r\n    ref_idx_t  *p_ref_idx;            /* reference index */\r\n    mv_t       *p_tmv_1st;            /* motion vector of 4x4 block (1st reference) */\r\n    mv_t       *p_tmv_2nd;            /* motion vector of 4x4 block (2nd reference) */\r\n\r\n    /* loop filter */\r\n    uint8_t    *p_deblock_flag[2];    /* [v/h][b8_x, b8_y] */\r\n\r\n    /* -------------------------------------------------------------\r\n     * block availability */\r\n    const int8_t *p_tab_TR_avail;\r\n    const int8_t *p_tab_DL_avail;\r\n\r\n    /* -------------------------------------------------------------\r\n     * LCU-based cache */\r\n    struct lcu_t {\r\n        /* geometrical properties */\r\n        ALIGN32(int i_pix_width);     /* actual width  (in pixel) for current lcu */\r\n        int     i_pix_height;         /* actual height (in pixel) for current lcu */\r\n        int     i_scu_x;              /* horizontal position for the first SCU in lcu */\r\n        int     i_scu_y;              /* vertical   position for the first SCU in lcu */\r\n        int     i_scu_xy;             /*            position for the first SCU in lcu */\r\n        int     i_spu_x;              /* horizontal position for the first SPU in lcu */\r\n        int     i_spu_y;              /* vertical   position for the first SPU in lcu */\r\n        int     i_pix_x;              /* horizontal position (in pixel) of lcu (luma) */\r\n        int     i_pix_y;              /* vertical   position (in pixel) of lcu (luma) */\r\n        int     i_pix_c_x;            /* horizontal position (in pixel) of lcu (chroma) */\r\n        int     i_pix_c_y;            /* vertical   position (in pixel) of lcu (chroma) */\r\n        int     idx_cu_zscan_aec;     /* Z-scan index of current AEC CU within LCU (in 8x8 unit) */\r\n\r\n        /* buffer pointers */\r\n        lcu_info_t *lcu_aec;          /* LCU info for AEC */\r\n\r\n        int8_t  i_left_cu_qp;         /* QP of left CU (for current CU decoding) */\r\n        int8_t  c_ipred_mode_ctx;     /* context of chroma intra prediction mode (for current CU decoding) */\r\n\r\n        neighbor_inter_t neighbor_inter[NUM_INTER_NEIGHBOR];        /* neighboring inter modes of 4x4 blocks*/\r\n\r\n        int8_t  ref_skip_1st[DS_MAX_NUM];\r\n        int8_t  ref_skip_2nd[DS_MAX_NUM];\r\n        mv_t    mv_tskip_1st[DS_MAX_NUM];\r\n        mv_t    mv_tskip_2nd[DS_MAX_NUM];\r\n\r\n#if !CTRL_AEC_THREAD\r\n        lcu_rec_info_t      rec_info;\r\n#endif\r\n        ALIGN32(runlevel_t  cg_info);\r\n\r\n    } lcu;\r\n\r\n    /* -------------------------------------------------------------\r\n     * adaptive frequency weighting quantization */\r\n    weighted_quant_t   wq;                // weight quant parameters\r\n};\r\n\r\n/**\r\n * ===========================================================================\r\n * global variables\r\n * ===========================================================================\r\n */\r\n#if HIGH_BIT_DEPTH\r\nextern int max_pel_value;\r\nextern int g_bit_depth;\r\nextern int g_dc_value;\r\n#else\r\nstatic const int g_bit_depth   = BIT_DEPTH;\r\nstatic const int max_pel_value = (1 << BIT_DEPTH) - 1;\r\nstatic const pel_t g_dc_value    = 128;\r\n#endif\r\n\r\n/**\r\n * ===========================================================================\r\n * common function declares\r\n * ===========================================================================\r\n */\r\n\r\n/**\r\n * ---------------------------------------------------------------------------\r\n * Function   : output information\r\n * Parameters :\r\n *       [in] : decoder - decoder handle\r\n * Return     : none\r\n * ---------------------------------------------------------------------------\r\n */\r\nvoid davs2_log(void *h, int level, const char *format, ...);\r\n\r\n/* ---------------------------------------------------------------------------\r\n * trace */\r\n#if AVS2_TRACE\r\nint  avs2_trace_init(char *psz_trace_file);\r\nvoid avs2_trace_destroy(void);\r\nint  avs2_trace(const char *psz_fmt, ...);\r\nvoid avs2_trace_string(char *trace_string, int value, int len);\r\nvoid avs2_trace_string2(char *trace_string, int bit_pattern, int value, int len);\r\n#endif\r\n\r\n/* ---------------------------------------------------------------------------\r\n * memory alloc\r\n */\r\n\r\nstatic ALWAYS_INLINE void *davs2_malloc(size_t i_size)\r\n{\r\n    intptr_t mask = CACHE_LINE_SIZE - 1;\r\n    uint8_t *align_buf = NULL;\r\n    uint8_t *buf = (uint8_t *)malloc(i_size + mask + sizeof(void **));\r\n\r\n    if (buf != NULL) {\r\n        align_buf = buf + mask + sizeof(void **);\r\n        align_buf -= (intptr_t)align_buf & mask;\r\n        *(((void **)align_buf) - 1) = buf;\r\n    } else {\r\n#if defined(__STDC_VERSION__) && __STDC_VERSION__ >= 199901L\r\n        davs2_log(NULL, DAVS2_LOG_ERROR, \"malloc of size %zu failed\\n\", i_size);\r\n#else\r\n        davs2_log(NULL, DAVS2_LOG_ERROR, \"malloc of size %lu failed\\n\", i_size);\r\n#endif\r\n    }\r\n\r\n    return align_buf;\r\n}\r\n\r\nstatic ALWAYS_INLINE void *davs2_calloc(size_t count, size_t size)\r\n{\r\n    void *p = davs2_malloc(count * size);\r\n    if (p != NULL) {\r\n        memset(p, 0, size * sizeof(uint8_t));\r\n    }\r\n    return p;\r\n}\r\n\r\nstatic ALWAYS_INLINE void davs2_free(void *ptr)\r\n{\r\n    if (ptr != NULL) {\r\n        free(*(((void **)ptr) - 1));\r\n    }\r\n}\r\n\r\n#if SYS_WINDOWS\r\n#define WIN32_LEAN_AND_MEAN\r\n#include <windows.h>\r\n#endif\r\n#include <time.h>\r\n\r\n/* ---------------------------------------------------------------------------\r\n * get timestamp in us\r\n */\r\nstatic ALWAYS_INLINE\r\nint64_t davs2_get_us(void)\r\n{\r\n#if SYS_WINDOWS\r\n    LARGE_INTEGER nFreq;\r\n    if (QueryPerformanceFrequency(&nFreq)) { // طʾӲָ֧߾ȼ\r\n        LARGE_INTEGER t1;\r\n        QueryPerformanceCounter(&t1);\r\n        return (int64_t)(1000000 * t1.QuadPart / (double)nFreq.QuadPart);\r\n    } else {  // Ӳ֧£ʹú뼶ϵͳʱ\r\n        int64_t tm = clock();\r\n        return (tm * (1000000 / CLOCKS_PER_SEC));\r\n    }\r\n#else\r\n    int64_t tm = clock();\r\n    return (tm * (1000000 / CLOCKS_PER_SEC));\r\n#endif\r\n}\r\n\r\n/**\r\n * ===========================================================================\r\n * inline function defines\r\n * ===========================================================================\r\n */\r\n\r\n#if defined(__GNUC__) && (__GNUC__ > 3 || __GNUC__ == 3 && __GNUC_MINOR__ > 3)\r\n#define davs2_clz(x)      __builtin_clz(x)\r\n#define davs2_ctz(x)      __builtin_ctz(x)\r\n#elif defined(_MSC_VER) && defined(_WIN32)\r\nstatic int ALWAYS_INLINE davs2_clz(const uint32_t x)\r\n{\r\n    DWORD r;\r\n    _BitScanReverse(&r, (DWORD)x);\r\n    return (r ^ 31);\r\n}\r\n\r\nstatic int ALWAYS_INLINE davs2_ctz(const uint32_t x)\r\n{\r\n    DWORD r;\r\n    _BitScanForward(&r, (DWORD)x);\r\n    return r;\r\n}\r\n\r\n#else\r\nstatic int ALWAYS_INLINE davs2_clz(uint32_t x)\r\n{\r\n    static uint8_t lut[16] = { 4, 3, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0 };\r\n    int y, z = (((x >> 16) - 1) >> 27) & 16;\r\n    x >>= z ^ 16;\r\n    z += y = ((x - 0x100) >> 28) & 8;\r\n    x >>= y ^ 8;\r\n    z += y = ((x - 0x10) >> 29) & 4;\r\n    x >>= y ^ 4;\r\n    return z + lut[x];\r\n}\r\n\r\nstatic int ALWAYS_INLINE davs2_ctz(uint32_t x)\r\n{\r\n    static uint8_t lut[16] = { 4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0 };\r\n    int y, z = (((x & 0xffff) - 1) >> 27) & 16;\r\n    x >>= z;\r\n    z += y = (((x & 0xff) - 1) >> 28) & 8;\r\n    x >>= y;\r\n    z += y = (((x & 0xf) - 1) >> 29) & 4;\r\n    x >>= y;\r\n    return z + lut[x & 0xf];\r\n}\r\n#endif\r\n\r\nstatic ALWAYS_INLINE pel_t davs2_clip_pixel(int x)\r\n{\r\n    return (pel_t)((x & ~max_pel_value) ? (-x) >> 31 & max_pel_value : x);\r\n}\r\n\r\nstatic ALWAYS_INLINE int davs2_clip3(int v, int i_min, int i_max)\r\n{\r\n    return ((v < i_min) ? i_min : (v > i_max) ? i_max : v);\r\n}\r\n\r\nstatic ALWAYS_INLINE int davs2_median(int a, int b, int c)\r\n{\r\n    int t = (a - b) & ((a - b) >> 31);\r\n\r\n    a -= t;\r\n    b += t;\r\n    b -= (b - c) & ((b - c) >> 31);\r\n    b += (a - b) & ((a - b) >> 31);\r\n\r\n    return b;\r\n}\r\n\r\n// ֵķλ-1򷵻1\r\nstatic ALWAYS_INLINE int davs2_sign2(int val)\r\n{\r\n    return ((val >> 31) << 1) + 1;\r\n}\r\n\r\n// ֵķλ-10ֵ01\r\nstatic ALWAYS_INLINE int davs2_sign3(int val)\r\n{\r\n    return (val >> 31) | (int)(((uint32_t)-val) >> 31u);\r\n}\r\n\r\n// log2ֵ01ʱ0log2(val)\r\n#define davs2_log2u(val)  davs2_ctz(val)\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * unions for type-punning.\r\n * Mn: load or store n bits, aligned, native-endian\r\n * CPn: copy n bits, aligned, native-endian\r\n * we don't use memcpy for CPn because memcpy's args aren't assumed\r\n * to be aligned */\r\ntypedef union {\r\n    uint16_t    i;\r\n    uint8_t     c[2];\r\n} MAY_ALIAS davs2_union16_t;\r\n\r\ntypedef union {\r\n    uint32_t    i;\r\n    uint16_t    b[2];\r\n    uint8_t     c[4];\r\n} MAY_ALIAS davs2_union32_t;\r\n\r\ntypedef union {\r\n    uint64_t    i;\r\n    uint32_t    a[2];\r\n    uint16_t    b[4];\r\n    uint8_t     c[8];\r\n} MAY_ALIAS davs2_union64_t;\r\n\r\n#define M16(src)                (((davs2_union16_t *)(src))->i)\r\n#define M32(src)                (((davs2_union32_t *)(src))->i)\r\n#define M64(src)                (((davs2_union64_t *)(src))->i)\r\n#define CP16(dst,src)           M16(dst)  = M16(src)\r\n#define CP32(dst,src)           M32(dst)  = M32(src)\r\n#define CP64(dst,src)           M64(dst)  = M64(src)\r\n\r\n/* ---------------------------------------------------------------------------\r\n * assert\r\n */\r\n#define DAVS2_ASSERT(expression, ...)   if (!(expression)) { davs2_log(NULL, DAVS2_LOG_ERROR, __VA_ARGS__); }\r\n\r\n/* ---------------------------------------------------------------------------\r\n * list\r\n */\r\n#define xl_init         FPFX(xl_init)\r\nint   xl_init          (xlist_t *const xlist);\r\n#define xl_destroy      FPFX(xl_destroy)\r\nvoid  xl_destroy       (xlist_t *const xlist);\r\n#define xl_append       FPFX(xl_append)\r\nvoid  xl_append        (xlist_t *const xlist, void *node);\r\n#define xl_remove_head  FPFX(xl_remove_head)\r\nvoid *xl_remove_head   (xlist_t *const xlist, const int wait);\r\n#define xl_remove_head_ex FPFX(xl_remove_head_ex)\r\nvoid *xl_remove_head_ex(xlist_t *const xlist);\r\n\r\n#ifdef __cplusplus\r\n}\r\n#endif\r\n#endif // DAVS2_COMMON_H\r\n"
  },
  {
    "path": "source/common/cpu.cc",
    "content": "/*\r\n * cpu.cc\r\n *\r\n * Description of this file:\r\n *    CPU-Processing functions definition of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video encoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n * Authors: Falei LUO     <falei.luo@gmail.com>\r\n *\r\n * --------------------------------------------------------------------------\r\n * Copyright (C) 2013-2017 MulticoreWare, Inc\r\n *\r\n * Authors: Loren Merritt <lorenm@u.washington.edu>\r\n *          Laurent Aimar <fenrir@via.ecp.fr>\r\n *          Fiona Glaser  <fiona@x264.com>\r\n *          Steve Borho   <steve@borho.org>\r\n *\r\n * This program is free software; you can redistribute it and/or modify\r\n * it under the terms of the GNU General Public License as published by\r\n * the Free Software Foundation; either version 2 of the License, or\r\n * (at your option) any later version.\r\n *\r\n * This program is distributed in the hope that it will be useful,\r\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n * GNU General Public License for more details.\r\n *\r\n * You should have received a copy of the GNU General Public License\r\n * along with this program; if not, write to the Free Software\r\n * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n * This program is also available under a commercial proprietary license.\r\n * For more information, contact us at license @ x265.com.\r\n */\r\n\r\n#include \"common.h\"\r\n#include \"cpu.h\"\r\n\r\n#if SYS_MACOSX || SYS_FREEBSD\r\n#include <sys/types.h>\r\n#include <sys/sysctl.h>\r\n#endif\r\n#if SYS_OPENBSD\r\n#include <sys/param.h>\r\n#include <sys/sysctl.h>\r\n#include <machine/cpu.h>\r\n#endif\r\n\r\n#if ARCH_ARM\r\n#include <signal.h>\r\n#include <setjmp.h>\r\nstatic sigjmp_buf jmpbuf;\r\nstatic volatile sig_atomic_t canjump = 0;\r\n\r\nstatic void sigill_handler(int sig)\r\n{\r\n    if (!canjump)\r\n    {\r\n        signal(sig, SIG_DFL);\r\n        raise(sig);\r\n    }\r\n\r\n    canjump = 0;\r\n    siglongjmp(jmpbuf, 1);\r\n}\r\n\r\n#endif // if ARCH_ARM\r\n\r\n#ifdef __cplusplus\r\nextern \"C\" {\r\n#endif\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\ntypedef struct {\r\n    const char *name;\r\n    int flags;\r\n} davs2_cpu_name_t;\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic const davs2_cpu_name_t davs2_cpu_names[] = {\r\n#if ARCH_X86 || ARCH_X86_64\r\n#define MMX2            DAVS2_CPU_MMX | DAVS2_CPU_MMX2 | DAVS2_CPU_CMOV\r\n    { \"MMX2\",           MMX2 },\r\n    { \"MMXEXT\",         MMX2 },\r\n    { \"SSE\",            MMX2 | DAVS2_CPU_SSE },\r\n#define SSE2            MMX2 | DAVS2_CPU_SSE | DAVS2_CPU_SSE2\r\n    { \"SSE2Slow\",       SSE2 | DAVS2_CPU_SSE2_IS_SLOW },\r\n    { \"SSE2\",           SSE2 },\r\n    { \"SSE2Fast\",       SSE2 | DAVS2_CPU_SSE2_IS_FAST },\r\n    { \"SSE3\",           SSE2 | DAVS2_CPU_SSE3 },\r\n    { \"SSSE3\",          SSE2 | DAVS2_CPU_SSE3 | DAVS2_CPU_SSSE3 },\r\n    { \"SSE4.1\",         SSE2 | DAVS2_CPU_SSE3 | DAVS2_CPU_SSSE3 | DAVS2_CPU_SSE4 },\r\n    { \"SSE4\",           SSE2 | DAVS2_CPU_SSE3 | DAVS2_CPU_SSSE3 | DAVS2_CPU_SSE4 },\r\n    { \"SSE4.2\",         SSE2 | DAVS2_CPU_SSE3 | DAVS2_CPU_SSSE3 | DAVS2_CPU_SSE4 | DAVS2_CPU_SSE42 },\r\n#define AVX             SSE2 | DAVS2_CPU_SSE3 | DAVS2_CPU_SSSE3 | DAVS2_CPU_SSE4 | DAVS2_CPU_SSE42 | DAVS2_CPU_AVX\r\n    { \"AVX\",            AVX },\r\n    { \"XOP\",            AVX | DAVS2_CPU_XOP },\r\n    { \"FMA4\",           AVX | DAVS2_CPU_FMA4 },\r\n    { \"AVX2\",           AVX | DAVS2_CPU_AVX2 },\r\n    { \"FMA3\",           AVX | DAVS2_CPU_FMA3 },\r\n#undef AVX\r\n#undef SSE2\r\n#undef MMX2\r\n    { \"Cache32\",        DAVS2_CPU_CACHELINE_32 },\r\n    { \"Cache64\",        DAVS2_CPU_CACHELINE_64 },\r\n    { \"LZCNT\",          DAVS2_CPU_LZCNT },\r\n    { \"BMI1\",           DAVS2_CPU_BMI1 },\r\n    { \"BMI2\",           DAVS2_CPU_BMI1 | DAVS2_CPU_BMI2 },\r\n    { \"SlowCTZ\",        DAVS2_CPU_SLOW_CTZ },\r\n    { \"SlowAtom\",       DAVS2_CPU_SLOW_ATOM },\r\n    { \"SlowPshufb\",     DAVS2_CPU_SLOW_PSHUFB },\r\n    { \"SlowPalignr\",    DAVS2_CPU_SLOW_PALIGNR },\r\n    { \"SlowShuffle\",    DAVS2_CPU_SLOW_SHUFFLE },\r\n    { \"UnalignedStack\", DAVS2_CPU_STACK_MOD4 },\r\n\r\n#elif ARCH_ARM\r\n    { \"ARMv6\",          DAVS2_CPU_ARMV6 },\r\n    { \"NEON\",           DAVS2_CPU_NEON },\r\n    { \"FastNeonMRC\",    DAVS2_CPU_FAST_NEON_MRC },\r\n#endif // if DAVS2_ARCH_X86\r\n    { \"\", 0 }\r\n};\r\n\r\n#ifdef __cplusplus\r\n}\r\n#endif\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nchar *davs2_get_simd_capabilities(char *buf, uint32_t cpuid)\r\n{\r\n    char *p = buf;\r\n\r\n    for (int i = 0; davs2_cpu_names[i].flags; i++) {\r\n        if (!strcmp(davs2_cpu_names[i].name, \"SSE\")\r\n            && (cpuid & DAVS2_CPU_SSE2))\r\n            continue;\r\n        if (!strcmp(davs2_cpu_names[i].name, \"SSE2\")\r\n            && (cpuid & (DAVS2_CPU_SSE2_IS_FAST | DAVS2_CPU_SSE2_IS_SLOW)))\r\n            continue;\r\n        if (!strcmp(davs2_cpu_names[i].name, \"SSE3\")\r\n            && (cpuid & DAVS2_CPU_SSSE3 || !(cpuid & DAVS2_CPU_CACHELINE_64)))\r\n            continue;\r\n        if (!strcmp(davs2_cpu_names[i].name, \"SSE4.1\")\r\n            && (cpuid & DAVS2_CPU_SSE42))\r\n            continue;\r\n        if (!strcmp(davs2_cpu_names[i].name, \"BMI1\")\r\n            && (cpuid & DAVS2_CPU_BMI2))\r\n            continue;\r\n        if ((cpuid & davs2_cpu_names[i].flags) == (uint32_t)davs2_cpu_names[i].flags\r\n            && (!i || davs2_cpu_names[i].flags != davs2_cpu_names[i - 1].flags))\r\n            p += sprintf(p, \" %s\", davs2_cpu_names[i].name);\r\n    }\r\n\r\n    if (p == buf) {\r\n        sprintf(p, \" none! 0x%x\", cpuid);\r\n    }\r\n    return buf;\r\n}\r\n\r\n#if !ARCH_X86_64\r\n/*  */\r\nint  davs2_cpu_cpuid_test(void);\r\n#endif\r\n\r\n#if HAVE_MMX\r\n/* ---------------------------------------------------------------------------\r\n */\r\nuint32_t davs2_cpu_detect(void)\r\n{\r\n    uint32_t cpuid = 0;\r\n\r\n    uint32_t eax, ebx, ecx, edx;\r\n    uint32_t vendor[4] = { 0 };\r\n    uint32_t max_extended_cap, max_basic_cap;\r\n\r\n#if !ARCH_X86_64\r\n    if (!davs2_cpu_cpuid_test()) {\r\n        return 0;\r\n    }\r\n#endif\r\n\r\n    davs2_cpu_cpuid(0, &eax, vendor + 0, vendor + 2, vendor + 1);\r\n    max_basic_cap = eax;\r\n    if (max_basic_cap == 0) {\r\n        return 0;\r\n    }\r\n\r\n    davs2_cpu_cpuid(1, &eax, &ebx, &ecx, &edx);\r\n    if (edx & 0x00800000) {\r\n        cpuid |= DAVS2_CPU_MMX;\r\n    } else {\r\n        return cpuid;\r\n    }\r\n\r\n    if (edx & 0x02000000) {\r\n        cpuid |= DAVS2_CPU_MMX2 | DAVS2_CPU_SSE;\r\n    }\r\n    if (edx & 0x00008000) {\r\n        cpuid |= DAVS2_CPU_CMOV;\r\n    } else {\r\n        return cpuid;\r\n    }\r\n\r\n    if (edx & 0x04000000) {\r\n        cpuid |= DAVS2_CPU_SSE2;\r\n    }\r\n    if (ecx & 0x00000001) {\r\n        cpuid |= DAVS2_CPU_SSE3;\r\n    }\r\n    if (ecx & 0x00000200) {\r\n        cpuid |= DAVS2_CPU_SSSE3;\r\n    }\r\n    if (ecx & 0x00080000) {\r\n        cpuid |= DAVS2_CPU_SSE4;\r\n    }\r\n    if (ecx & 0x00100000) {\r\n        cpuid |= DAVS2_CPU_SSE42;\r\n    }\r\n\r\n    /* Check OXSAVE and AVX bits */\r\n    if ((ecx & 0x18000000) == 0x18000000) {\r\n        /* Check for OS support */\r\n        davs2_cpu_xgetbv(0, &eax, &edx);\r\n        if ((eax & 0x6) == 0x6) {\r\n            cpuid |= DAVS2_CPU_AVX;\r\n            if (ecx & 0x00001000) {\r\n                cpuid |= DAVS2_CPU_FMA3;\r\n            }\r\n        }\r\n    }\r\n\r\n    if (max_basic_cap >= 7) {\r\n        davs2_cpu_cpuid(7, &eax, &ebx, &ecx, &edx);\r\n        /* AVX2 requires OS support, but BMI1/2 don't. */\r\n        if ((cpuid & DAVS2_CPU_AVX) && (ebx & 0x00000020)) {\r\n            cpuid |= DAVS2_CPU_AVX2;\r\n        }\r\n        if (ebx & 0x00000008) {\r\n            cpuid |= DAVS2_CPU_BMI1;\r\n            if (ebx & 0x00000100) {\r\n                cpuid |= DAVS2_CPU_BMI2;\r\n            }\r\n        }\r\n    }\r\n\r\n    if (cpuid & DAVS2_CPU_SSSE3) {\r\n        cpuid |= DAVS2_CPU_SSE2_IS_FAST;\r\n    }\r\n\r\n    davs2_cpu_cpuid(0x80000000, &eax, &ebx, &ecx, &edx);\r\n    max_extended_cap = eax;\r\n\r\n    if (max_extended_cap >= 0x80000001) {\r\n        davs2_cpu_cpuid(0x80000001, &eax, &ebx, &ecx, &edx);\r\n\r\n        if (ecx & 0x00000020)\r\n            cpuid |= DAVS2_CPU_LZCNT;               /* Supported by Intel chips starting with Haswell */\r\n        if (ecx & 0x00000040) {                     /* SSE4a, AMD only */\r\n            int family = ((eax >> 8) & 0xf) + ((eax >> 20) & 0xff);\r\n            cpuid |= DAVS2_CPU_SSE2_IS_FAST;        /* Phenom and later CPUs have fast SSE units */\r\n            if (family == 0x14) {\r\n                cpuid &= ~DAVS2_CPU_SSE2_IS_FAST;   /* SSSE3 doesn't imply fast SSE anymore... */\r\n                cpuid |= DAVS2_CPU_SSE2_IS_SLOW;    /* Bobcat has 64-bit SIMD units */\r\n                cpuid |= DAVS2_CPU_SLOW_PALIGNR;    /* palignr is insanely slow on Bobcat */\r\n            }\r\n            if (family == 0x16) {\r\n                cpuid |= DAVS2_CPU_SLOW_PSHUFB;     /* Jaguar's pshufb isn't that slow, but it's slow enough\r\n                                                     * compared to alternate instruction sequences that this\r\n                                                     * is equal or faster on almost all such functions. */\r\n            }\r\n        }\r\n\r\n        if (cpuid & DAVS2_CPU_AVX)\r\n        {\r\n            if (ecx & 0x00000800) {   /* XOP */\r\n                cpuid |= DAVS2_CPU_XOP;\r\n            }\r\n            if (ecx & 0x00010000) {   /* FMA4 */\r\n                cpuid |= DAVS2_CPU_FMA4;\r\n            }\r\n        }\r\n\r\n        if (!strcmp((char*)vendor, \"AuthenticAMD\")) {\r\n            if (edx & 0x00400000) {\r\n                cpuid |= DAVS2_CPU_MMX2;\r\n            }\r\n            if (!(cpuid & DAVS2_CPU_LZCNT)) {\r\n                cpuid |= DAVS2_CPU_SLOW_CTZ;\r\n            }\r\n            if ((cpuid & DAVS2_CPU_SSE2) && !(cpuid & DAVS2_CPU_SSE2_IS_FAST)) {\r\n                cpuid |= DAVS2_CPU_SSE2_IS_SLOW; /* AMD CPUs come in two types: terrible at SSE and great at it */\r\n            }\r\n        }\r\n    }\r\n\r\n    if (!strcmp((char*)vendor, \"GenuineIntel\")) {\r\n        int family, model;\r\n        davs2_cpu_cpuid(1, &eax, &ebx, &ecx, &edx);\r\n        family = ((eax >> 8) & 0xf) + ((eax >> 20) & 0xff);\r\n        model = ((eax >> 4) & 0xf) + ((eax >> 12) & 0xf0);\r\n        if (family == 6) {\r\n            /* 6/9 (pentium-m \"banias\"), 6/13 (pentium-m \"dothan\"), and 6/14 (core1 \"yonah\")\r\n             * theoretically support sse2, but it's significantly slower than mmx for\r\n             * almost all of x264's functions, so let's just pretend they don't. */\r\n            if (model == 9 || model == 13 || model == 14) {\r\n                cpuid &= ~(DAVS2_CPU_SSE2 | DAVS2_CPU_SSE3);\r\n                //DAVS2_CHECK(!(cpuid & (DAVS2_CPU_SSSE3 | DAVS2_CPU_SSE4)), \"unexpected CPU ID %d\\n\", cpuid);\r\n            } else if (model == 28) {\r\n                /* Detect Atom CPU */\r\n                cpuid |= DAVS2_CPU_SLOW_ATOM;\r\n                cpuid |= DAVS2_CPU_SLOW_CTZ;\r\n                cpuid |= DAVS2_CPU_SLOW_PSHUFB;\r\n            } else if ((cpuid & DAVS2_CPU_SSSE3) && !(cpuid & DAVS2_CPU_SSE4) && model < 23) {\r\n                /* Conroe has a slow shuffle unit. Check the model number to make sure not\r\n                 * to include crippled low-end Penryns and Nehalems that don't have SSE4. */\r\n                cpuid |= DAVS2_CPU_SLOW_SHUFFLE;\r\n            }\r\n        }\r\n    }\r\n\r\n    if ((!strcmp((char*)vendor, \"GenuineIntel\") || !strcmp((char*)vendor, \"CyrixInstead\")) && !(cpuid & DAVS2_CPU_SSE42)) {\r\n        /* cacheline size is specified in 3 places, any of which may be missing */\r\n        int cache;\r\n        davs2_cpu_cpuid(1, &eax, &ebx, &ecx, &edx);\r\n        cache = (ebx & 0xff00) >> 5; // cflush size\r\n        if (!cache && max_extended_cap >= 0x80000006) {\r\n            davs2_cpu_cpuid(0x80000006, &eax, &ebx, &ecx, &edx);\r\n            cache = ecx & 0xff; // cacheline size\r\n        }\r\n        if (!cache && max_basic_cap >= 2) {\r\n            // Cache and TLB Information\r\n            static const uint8_t cache32_ids[] = { 0x0a, 0x0c, 0x41, 0x42, 0x43, 0x44, 0x45, 0x82, 0x83, 0x84, 0x85, 0 };\r\n            static const uint8_t cache64_ids[] = { 0x22, 0x23, 0x25, 0x29, 0x2c, 0x46, 0x47, 0x49, 0x60, 0x66, 0x67,\r\n                0x68, 0x78, 0x79, 0x7a, 0x7b, 0x7c, 0x7c, 0x7f, 0x86, 0x87, 0 };\r\n            uint32_t buf[4];\r\n            int max, i = 0, j;\r\n            do {\r\n                davs2_cpu_cpuid(2, buf + 0, buf + 1, buf + 2, buf + 3);\r\n                max = buf[0] & 0xff;\r\n                buf[0] &= ~0xff;\r\n                for (j = 0; j < 4; j++) {\r\n                    if (!(buf[j] >> 31)) {\r\n                        while (buf[j]) {\r\n                            if (strchr((const char *)cache32_ids, buf[j] & 0xff)) {\r\n                                cache = 32;\r\n                            }\r\n                            if (strchr((const char *)cache64_ids, buf[j] & 0xff)) {\r\n                                cache = 64;\r\n                            }\r\n                            buf[j] >>= 8;\r\n                        }\r\n                    }\r\n                }\r\n            } while (++i < max);\r\n        }\r\n\r\n        if (cache == 32) {\r\n            cpuid |= DAVS2_CPU_CACHELINE_32;\r\n        } else if (cache == 64) {\r\n            cpuid |= DAVS2_CPU_CACHELINE_64;\r\n        } else {\r\n            davs2_log(NULL, DAVS2_LOG_WARNING, \"unable to determine cacheline size\\n\");\r\n        }\r\n    }\r\n\r\n#ifdef BROKEN_STACK_ALIGNMENT\r\n    cpuid |= DAVS2_CPU_STACK_MOD4;\r\n#endif\r\n\r\n    return cpuid;\r\n}\r\n#endif // if DAVS2_ARCH_X86\r\n\r\n#if SYS_LINUX && !(defined(__ARM_ARCH_7A__) || defined(__ARM_ARCH_7__))\r\n/* ---------------------------------------------------------------------------\r\n */\r\nint sched_getaffinity(pid_t pid, size_t cpusetsize, cpu_set_t *mask);\r\n#endif\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nint davs2_cpu_num_processors(void)\r\n{\r\n#if !HAVE_THREAD\r\n    return 1;\r\n#elif defined(__ARM_ARCH_7A__) || defined(__ARM_ARCH_7__)\r\n    return 2;\r\n#elif SYS_WINDOWS\r\n    return davs2_thread_num_processors_np();\r\n#elif SYS_LINUX\r\n    unsigned int bit;\r\n    int np = 0;\r\n    cpu_set_t p_aff;\r\n\r\n    memset(&p_aff, 0, sizeof(p_aff));\r\n    sched_getaffinity(0, sizeof(p_aff), &p_aff);\r\n    for (bit = 0; bit < sizeof(p_aff); bit++) {\r\n        np += (((uint8_t *)& p_aff)[bit / 8] >> (bit % 8)) & 1;\r\n    }\r\n    return np;\r\n\r\n#elif SYS_BEOS\r\n    system_info info;\r\n\r\n    get_system_info(&info);\r\n    return info.cpu_count;\r\n\r\n#elif SYS_MACOSX || SYS_FREEBSD || SYS_OPENBSD\r\n    int numberOfCPUs;\r\n    size_t length = sizeof (numberOfCPUs);\r\n#if SYS_OPENBSD\r\n    int mib[2] = { CTL_HW, HW_NCPU };\r\n    if(sysctl(mib, 2, &numberOfCPUs, &length, NULL, 0))\r\n#else\r\n    if(sysctlbyname(\"hw.ncpu\", &numberOfCPUs, &length, NULL, 0))\r\n#endif\r\n    {\r\n        numberOfCPUs = 1;\r\n    }\r\n    return numberOfCPUs;\r\n\r\n#else\r\n    return 1;\r\n#endif\r\n}\r\n"
  },
  {
    "path": "source/common/cpu.h",
    "content": "/*\r\n * cpu.h\r\n *\r\n * Description of this file:\r\n *    CPU-Processing functions definition of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video encoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n\r\n#ifndef DAVS2_CPU_H\r\n#define DAVS2_CPU_H\r\n#ifdef __cplusplus\r\nextern \"C\" {\r\n#endif\r\n\r\n\r\n#define davs2_cpu_detect FPFX(cpu_detect)\r\nuint32_t davs2_cpu_detect(void);\r\n#define davs2_cpu_num_processors FPFX(cpu_num_processors)\r\nint  davs2_cpu_num_processors(void);\r\n#define avs_cpu_emms FPFX(avs_cpu_emms)\r\nvoid avs_cpu_emms(void);\r\n#define avs_cpu_mask_misalign_sse FPFX(avs_cpu_mask_misalign_sse)\r\nvoid avs_cpu_mask_misalign_sse(void);\r\n#define avs_cpu_sfence FPFX(avs_cpu_sfence)\r\nvoid avs_cpu_sfence(void);\r\n\r\n#define davs2_get_simd_capabilities FPFX(get_simd_capabilities)\r\nchar *davs2_get_simd_capabilities(char *buf, uint32_t cpuid);\r\n\r\n#if HAVE_MMX\r\n#define davs2_cpu_cpuid FPFX(cpu_cpuid)\r\nuint32_t davs2_cpu_cpuid(uint32_t op, uint32_t *eax, uint32_t *ebx, uint32_t *ecx, uint32_t *edx);\r\n#define davs2_cpu_xgetbv FPFX(cpu_xgetbv)\r\nvoid davs2_cpu_xgetbv(uint32_t op, uint32_t *eax, uint32_t *edx);\r\n#define avs_emms() avs_cpu_emms()\r\n#else\r\n#define avs_emms()\r\n#endif\r\n\r\n#define avs_sfence avs_cpu_sfence\r\n\r\n/* kluge:\r\n * gcc can't give variables any greater alignment than the stack frame has.\r\n * We need 16 byte alignment for SSE2, so here we make sure that the stack is\r\n * aligned to 16 bytes.\r\n * gcc 4.2 introduced __attribute__((force_align_arg_pointer)) to fix this\r\n * problem, but I don't want to require such a new version.\r\n * This applies only to x86_32, since other architectures that need alignment\r\n * also have ABIs that ensure aligned stack. */\r\n#if ARCH_X86 && HAVE_MMX\r\n//int xavs_stack_align(void(*func) (xavs_t *), xavs_t * arg);\r\n//#define avs_stack_align(func,arg) avs_stack_align((void (*)(xavs_t*))func,arg)\r\n#else\r\n#define avs_stack_align(func,...) func(__VA_ARGS__)\r\n#endif\r\n\r\n#define avs_cpu_restore FPFX(avs_cpu_restore)\r\nvoid avs_cpu_restore(uint32_t cpuid);\r\n\r\n#ifdef __cplusplus\r\n}\r\n#endif\r\n#endif  // DAVS2_CPU_H\r\n"
  },
  {
    "path": "source/common/cu.cc",
    "content": "﻿/*\r\n * cu.cc\r\n *\r\n * Description of this file:\r\n *    CU Processing functions definition of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video encoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#include \"common.h\"\r\n#include \"cu.h\"\r\n#include \"vlc.h\"\r\n#include \"transform.h\"\r\n#include \"intra.h\"\r\n#include \"predict.h\"\r\n#include \"block_info.h\"\r\n#include \"aec.h\"\r\n#include \"mc.h\"\r\n#include \"sao.h\"\r\n#include \"quant.h\"\r\n#include \"scantab.h\"\r\n\r\n/**\r\n * ===========================================================================\r\n * local & global variables (const tables)\r\n * ===========================================================================\r\n */\r\n\r\nstatic const int tab_b8xy_to_zigzag[8][8] = {\r\n    {  0,  1,  4,  5, 16, 17, 20, 21 },\r\n    {  2,  3,  6,  7, 18, 19, 22, 23 },\r\n    {  8,  9, 12, 13, 24, 25, 28, 29 },\r\n    { 10, 11, 14, 15, 26, 27, 30, 31 },\r\n    { 32, 33, 36, 37, 48, 49, 52, 53 },\r\n    { 34, 35, 38, 39, 50, 51, 54, 55 },\r\n    { 40, 41, 44, 45, 56, 57, 60, 61 },\r\n    { 42, 43, 46, 47, 58, 59, 62, 63 }\r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nconst uint8_t QP_SCALE_CR[64] = {\r\n    0,  1,  2,  3,  4,  5,  6,  7,  8,  9,\r\n    10, 11, 12, 13, 14, 15, 16, 17, 18, 19,\r\n    20, 21, 22, 23, 24, 25, 26, 27, 28, 29,\r\n    30, 31, 32, 33, 34, 35, 36, 37, 38, 39,\r\n    40, 41, 42, 42, 43, 43, 44, 44, 45, 45,\r\n    46, 46, 47, 47, 48, 48, 48, 49, 49, 49,\r\n    50, 50, 50, 51,\r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic const int8_t dmh_pos[DMH_MODE_NUM + DMH_MODE_NUM - 1][2][2] = {\r\n    { {  0,  0 }, { 0,  0 } },\r\n    { { -1,  0 }, { 1,  0 } },\r\n    { {  0, -1 }, { 0,  1 } },\r\n    { { -1,  1 }, { 1, -1 } },\r\n    { { -1, -1 }, { 1,  1 } },\r\n    { { -2,  0 }, { 2,  0 } },\r\n    { {  0, -2 }, { 0,  2 } },\r\n    { { -2,  2 }, { 2, -2 } },\r\n    { { -2, -2 }, { 2,  2 } }\r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nconst int16_t IQ_SHIFT[80] = {\r\n    15, 15, 15, 15, 15, 15, 15, 15,\r\n    14, 14, 14, 14, 14, 14, 14, 14,\r\n    14, 13, 13, 13, 13, 13, 13, 13,\r\n    12, 12, 12, 12, 12, 12, 12, 12,\r\n    12, 11, 11, 11, 11, 11, 11, 11,\r\n    11, 10, 10, 10, 10, 10, 10, 10,\r\n    10,  9,  9,  9,  9,  9,  9,  9,\r\n     8,  8,  8,  8,  8,  8,  8,  8,\r\n     7,  7,  7,  7,  7,  7,  7,  7,\r\n     6,  6,  6,  6,  6,  6,  6,  6\r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nconst uint16_t IQ_TAB[80] = {\r\n    32768, 36061, 38968, 42495, 46341, 50535, 55437, 60424,\r\n    32932, 35734, 38968, 42495, 46177, 50535, 55109, 59933,\r\n    65535, 35734, 38968, 42577, 46341, 50617, 55027, 60097,\r\n    32809, 35734, 38968, 42454, 46382, 50576, 55109, 60056,\r\n    65535, 35734, 38968, 42495, 46320, 50515, 55109, 60076,\r\n    65535, 35744, 38968, 42495, 46341, 50535, 55099, 60087,\r\n    65535, 35734, 38973, 42500, 46341, 50535, 55109, 60097,\r\n    32771, 35734, 38965, 42497, 46341, 50535, 55109, 60099,\r\n    32768, 36061, 38968, 42495, 46341, 50535, 55437, 60424,\r\n    32932, 35734, 38968, 42495, 46177, 50535, 55109, 59933\r\n};\r\n\r\n#if AVS2_TRACE\r\nextern int symbolCount;\r\n#endif\r\n\r\n/**\r\n * ===========================================================================\r\n * function defines\r\n * ===========================================================================\r\n */\r\n\r\n/* ---------------------------------------------------------------------------\r\n * used for debug\r\n */\r\nstatic INLINE\r\nbool_t is_inside_cu(int cu_pix_x, int cu_pix_y, int i_cu_level, int i_pix_x, int i_pix_y)\r\n{\r\n    int cu_size = 1 << i_cu_level;\r\n    return cu_pix_x <= i_pix_x && (cu_pix_x + cu_size) > i_pix_x &&\r\n        cu_pix_y <= i_pix_y && (cu_pix_y + cu_size) > i_pix_y;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * obtain the pos and size of prediction units (PUs)\r\n */\r\nstatic ALWAYS_INLINE\r\nvoid cu_init_prediction_units(davs2_t *h, cu_t *p_cu)\r\n{\r\n    /* ---------------------------------------------------------------------------\r\n    */\r\n    static const int NUM_PREDICTION_UNIT[MAX_PRED_MODES] = {// [mode]\r\n        1, // 0: 8x8, ---, ---, --- (PRED_SKIP   )\r\n        1, // 1: 8x8, ---, ---, --- (PRED_2Nx2N  )\r\n        2, // 2: 8x4, 8x4, ---, --- (PRED_2NxN   )\r\n        2, // 3: 4x8, 4x8, ---, --- (PRED_Nx2N   )\r\n        2, // 4: 8x2, 8x6, ---, --- (PRED_2NxnU  )\r\n        2, // 5: 8x6, 8x2, ---, --- (PRED_2NxnD  )\r\n        2, // 6: 2x8, 6x8, ---, --- (PRED_nLx2N  )\r\n        2, // 7: 6x8, 2x8, ---, --- (PRED_nRx2N  )\r\n        1, // 8: 8x8, ---, ---, --- (PRED_I_2Nx2N)\r\n        4, // 9: 4x4, 4x4, 4x4, 4x4 (PRED_I_NxN  )\r\n        4, //10: 8x2, 8x2, 8x2, 8x2 (PRED_I_2Nxn )\r\n        4  //11: 2x8, 2x8, 2x8, 2x8 (PRED_I_nx2N )\r\n    };\r\n\r\n    static const cb_t CODING_BLOCK_INFO[MAX_PRED_MODES + 1][4] = {// [mode][block]\r\n        //  x, y, w, h      x, y, w, h      x, y, w, h      x, y, w, h for block 0, 1, 2 and 3\r\n        {{{0, 0, 8, 8}}, {{0, 0, 0, 0}}, {{0, 0, 0, 0}}, {{0, 0, 0, 0}}}, // 0: 8x8, ---, ---, --- (PRED_SKIP   )\r\n        {{{0, 0, 8, 8}}, {{0, 0, 0, 0}}, {{0, 0, 0, 0}}, {{0, 0, 0, 0}}}, // 1: 8x8, ---, ---, --- (PRED_2Nx2N  )\r\n        {{{0, 0, 8, 4}}, {{0, 4, 8, 4}}, {{0, 0, 0, 0}}, {{0, 0, 0, 0}}}, // 2: 8x4, 8x4, ---, --- (PRED_2NxN   )\r\n        {{{0, 0, 4, 8}}, {{4, 0, 4, 8}}, {{0, 0, 0, 0}}, {{0, 0, 0, 0}}}, // 3: 4x8, 4x8, ---, --- (PRED_Nx2N   )\r\n        {{{0, 0, 8, 2}}, {{0, 2, 8, 6}}, {{0, 0, 0, 0}}, {{0, 0, 0, 0}}}, // 4: 8x2, 8x6, ---, --- (PRED_2NxnU  )\r\n        {{{0, 0, 8, 6}}, {{0, 6, 8, 2}}, {{0, 0, 0, 0}}, {{0, 0, 0, 0}}}, // 5: 8x6, 8x2, ---, --- (PRED_2NxnD  )\r\n        {{{0, 0, 2, 8}}, {{2, 0, 6, 8}}, {{0, 0, 0, 0}}, {{0, 0, 0, 0}}}, // 6: 2x8, 6x8, ---, --- (PRED_nLx2N  )\r\n        {{{0, 0, 6, 8}}, {{6, 0, 2, 8}}, {{0, 0, 0, 0}}, {{0, 0, 0, 0}}}, // 7: 6x8, 2x8, ---, --- (PRED_nRx2N  )\r\n        {{{0, 0, 8, 8}}, {{0, 0, 0, 0}}, {{0, 0, 0, 0}}, {{0, 0, 0, 0}}}, // 8: 8x8, ---, ---, --- (PRED_I_2Nx2N)\r\n        {{{0, 0, 4, 4}}, {{4, 0, 4, 4}}, {{0, 4, 4, 4}}, {{4, 4, 4, 4}}}, // 9: 4x4, 4x4, 4x4, 4x4 (PRED_I_NxN  )\r\n        {{{0, 0, 8, 2}}, {{0, 2, 8, 2}}, {{0, 4, 8, 2}}, {{0, 6, 8, 2}}}, //10: 8x2, 8x2, 8x2, 8x2 (PRED_I_2Nxn )\r\n        {{{0, 0, 2, 8}}, {{2, 0, 2, 8}}, {{4, 0, 2, 8}}, {{6, 0, 2, 8}}}, //11: 2x8, 2x8, 2x8, 2x8 (PRED_I_nx2N )\r\n        {{{0, 0, 4, 4}}, {{4, 0, 4, 4}}, {{0, 4, 4, 4}}, {{4, 4, 4, 4}}}, // X: 4x4, 4x4, 4x4, 4x4\r\n    };\r\n\r\n    const int i_level = p_cu->i_cu_level;\r\n    const int i_mode = p_cu->i_cu_type;\r\n    const int shift_bits = i_level - MIN_CU_SIZE_IN_BIT;\r\n    const int block_num = NUM_PREDICTION_UNIT[i_mode];\r\n    int ds_mode = p_cu->i_md_directskip_mode;\r\n    int i;\r\n    cb_t *p_cb = p_cu->pu;\r\n\r\n    // memset(p_cb, 0, 4 * sizeof(cb_t));\r\n\r\n    // set for each block\r\n    if (i_mode == PRED_SKIP) {\r\n        ///! 一些特殊的Skip/Direct模式下如果CU超过8x8，则PU划分成4个\r\n        if (i_level > 3 &&\r\n            (h->i_frame_type == AVS2_P_SLICE\r\n            || (h->i_frame_type == AVS2_F_SLICE && ds_mode == DS_NONE)\r\n            || (h->i_frame_type == AVS2_B_SLICE && ds_mode == DS_NONE))) {\r\n            p_cu->num_pu = 4;\r\n            for (i = 0; i < 4; i++) {\r\n                p_cb[i].v = CODING_BLOCK_INFO[PRED_I_nx2N + 1][i].v << shift_bits;\r\n            }\r\n        } else {\r\n            p_cu->num_pu = 1;\r\n            p_cb[0].v = CODING_BLOCK_INFO[PRED_SKIP][0].v << shift_bits;\r\n        }\r\n    } else {\r\n        p_cu->num_pu = (int8_t)block_num;\r\n        for (i = 0; i < block_num; i++) {\r\n            p_cb[i].v = CODING_BLOCK_INFO[i_mode][i].v << shift_bits;\r\n        }\r\n    }\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * obtain the pos and size of transform units (TUs)\r\n */\r\nstatic ALWAYS_INLINE\r\nvoid cu_init_transform_units(cu_t *p_cu, cb_t *p_tu)\r\n{\r\n    static const cb_t TU_SPLIT_INFO[TU_SPLIT_CROSS+1][4] = {// [mode][block]\r\n        //  x, y, w, h      x, y, w, h      x, y, w, h      x, y, w, h for block 0, 1, 2 and 3\r\n        {{{0, 0, 8, 8}}, {{0, 0, 0, 0}}, {{0, 0, 0, 0}}, {{0, 0, 0, 0}}}, // TU_SPLIT_NON\r\n        {{{0, 0, 8, 2}}, {{0, 2, 8, 2}}, {{0, 4, 8, 2}}, {{0, 6, 8, 2}}}, // TU_SPLIT_HOR\r\n        {{{0, 0, 2, 8}}, {{2, 0, 2, 8}}, {{4, 0, 2, 8}}, {{6, 0, 2, 8}}}, // TU_SPLIT_VER\r\n        {{{0, 0, 4, 4}}, {{4, 0, 4, 4}}, {{0, 4, 4, 4}}, {{4, 4, 4, 4}}}, // TU_SPLIT_CROSS\r\n    };\r\n\r\n    const int shift_bits = p_cu->i_cu_level - MIN_CU_SIZE_IN_BIT;\r\n    const int i_tu_type = p_cu->i_trans_size;\r\n\r\n    p_tu[0].v = TU_SPLIT_INFO[i_tu_type][0].v << shift_bits;\r\n    p_tu[1].v = TU_SPLIT_INFO[i_tu_type][1].v << shift_bits;\r\n    p_tu[2].v = TU_SPLIT_INFO[i_tu_type][2].v << shift_bits;\r\n    p_tu[3].v = TU_SPLIT_INFO[i_tu_type][3].v << shift_bits;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * get neighboring MVs for MVP\r\n */\r\nstatic void cu_get_neighbors(davs2_t *h, cu_t *p_cu, int pix_x, int pix_y, int bsx, int bsy)\r\n{\r\n    neighbor_inter_t *neighbors = h->lcu.neighbor_inter;\r\n    int cur_slice_idx = p_cu->i_slice_nr;\r\n    int x0 = (pix_x >> MIN_PU_SIZE_IN_BIT);\r\n    int y0 = (pix_y >> MIN_PU_SIZE_IN_BIT);\r\n    int x1 = (bsx   >> MIN_PU_SIZE_IN_BIT) + x0 - 1;\r\n    int y1 = (bsy   >> MIN_PU_SIZE_IN_BIT) + y0 - 1;\r\n\r\n    /* 1. check whether the top-right 4x4 block is reconstructed */\r\n    int x_top_right_4x4_in_lcu = x1 - h->lcu.i_spu_x;\r\n    int y_top_right_4x4_in_lcu = y0 - h->lcu.i_spu_y;\r\n    int block_available_TR = h->p_tab_TR_avail[(y_top_right_4x4_in_lcu << (h->i_lcu_level - B4X4_IN_BIT)) + x_top_right_4x4_in_lcu];\r\n\r\n    /* 2. get neighboring blocks */\r\n    cu_get_neighbor_spatial(h, cur_slice_idx, &neighbors[BLK_LEFT    ], x0 - 1, y0    );\r\n    cu_get_neighbor_spatial(h, cur_slice_idx, &neighbors[BLK_TOP     ], x0    , y0 - 1);\r\n    cu_get_neighbor_spatial(h, cur_slice_idx, &neighbors[BLK_TOP2    ], x1    , y0 - 1);\r\n    cu_get_neighbor_spatial(h, cur_slice_idx, &neighbors[BLK_TOPLEFT ], x0 - 1, y0 - 1);\r\n    cu_get_neighbor_spatial(h, cur_slice_idx, &neighbors[BLK_LEFT2   ], x0 - 1, y1    );\r\n\r\n    cu_get_neighbor_spatial(h, cur_slice_idx, &neighbors[BLK_TOPRIGHT], block_available_TR ? x1 + 1 : -1, y0 - 1);\r\n\r\n    cu_get_neighbor_temporal(h, &neighbors[BLK_COLLOCATED], x0, y0);\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic INLINE\r\nvoid cu_init(davs2_t *h, cu_t *p_cu, int i_level, int scu_xy, int pix_x)\r\n{\r\n    assert(scu_xy >= 0 && scu_xy < h->i_size_in_scu);\r\n\r\n    // reset syntax element entries in cu_t\r\n    p_cu->i_cu_level    = (int8_t)i_level;\r\n    p_cu->i_qp          = (int8_t)h->i_qp;\r\n    p_cu->i_cu_type     = PRED_SKIP;\r\n    p_cu->i_cbp         = 0;\r\n    p_cu->c_ipred_mode  = DC_PRED_C;\r\n    p_cu->i_dmh_mode    = 0;\r\n    memset(p_cu->dct_pattern, 0, sizeof(p_cu->dct_pattern));\r\n\r\n    // check left CU\r\n    h->lcu.i_left_cu_qp     = (int8_t)h->i_qp;\r\n    h->lcu.c_ipred_mode_ctx = 0;\r\n\r\n    if (pix_x > 0) {\r\n        cu_t *p_left_cu = &h->scu_data[scu_xy - 1];\r\n\r\n        if (p_left_cu->i_slice_nr == p_cu->i_slice_nr) {\r\n            h->lcu.c_ipred_mode_ctx = p_left_cu->c_ipred_mode != 0;\r\n            h->lcu.i_left_cu_qp     = p_left_cu->i_qp;\r\n        }\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic INLINE\r\nvoid cu_read_end(davs2_t *h, cu_t *p_cu, int i_level, int scu_xy)\r\n{\r\n    cu_t *p_cu_iter = &h->scu_data[scu_xy];\r\n    int size_in_scu = 1 << (i_level - MIN_CU_SIZE_IN_BIT);\r\n    int i;\r\n\r\n    if (size_in_scu <= 1) {\r\n        return;\r\n    }\r\n\r\n    /* the fist row */\r\n    for (i = 1; i < size_in_scu; i++) {\r\n        memcpy(p_cu_iter + i, p_cu, sizeof(cu_t));\r\n    }\r\n\r\n    /* the left rows */\r\n    for (i = 1; i < size_in_scu; i++) {\r\n        p_cu_iter += h->i_width_in_scu;\r\n        memcpy(p_cu_iter, p_cu, size_in_scu * sizeof(cu_t));\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic int cu_read_intrapred_mode_luma(davs2_t *h, aec_t *p_aec, cu_t *p_cu, int b8, int bi, int bj)\r\n{\r\n    int size_in_scu = 1 << (p_cu->i_cu_level - MIN_CU_SIZE_IN_BIT);\r\n    int i_intramode = h->i_ipredmode;\r\n    int8_t *p_intramode = h->p_ipredmode + bj * i_intramode + bi;\r\n    int intra_mode_top  = p_intramode[-i_intramode];\r\n    int intra_mode_left = p_intramode[-1];\r\n    int luma_mode = aec_read_intra_pmode(p_aec);\r\n    int mpm[2];\r\n    int8_t real_luma_mode;\r\n\r\n#if AVS2_TRACE\r\n    strncpy(p_aec->tracestring, \"Ipred Mode\", TRACESTRING_SIZE);\r\n#endif\r\n    assert(IS_INTRA(p_cu) && b8 < 4 && b8 >= 0);\r\n\r\n    AEC_RETURN_ON_ERROR(-1);\r\n\r\n    mpm[0] = DAVS2_MIN(intra_mode_top, intra_mode_left);\r\n    mpm[1] = DAVS2_MAX(intra_mode_top, intra_mode_left);\r\n\r\n    if (mpm[0] == mpm[1]) {\r\n        mpm[0] = DC_PRED;\r\n        mpm[1] = (mpm[1] == DC_PRED) ? BI_PRED : mpm[1];\r\n    }\r\n\r\n    real_luma_mode = (int8_t)((luma_mode < 0) ? mpm[luma_mode + 2] : luma_mode + (luma_mode >= mpm[0]) + (luma_mode + 1 >= mpm[1]));\r\n\r\n    if (real_luma_mode < 0 || real_luma_mode >= NUM_INTRA_MODE) {\r\n        davs2_log(h, DAVS2_LOG_ERROR, \"invalid pred mode %2d. POC %3d, pixel (%3d, %3d), %2dx%2d\",\r\n                 real_luma_mode, h->i_poc, bi << MIN_PU_SIZE_IN_BIT, bj << MIN_PU_SIZE_IN_BIT,\r\n                 size_in_scu << MIN_CU_SIZE_IN_BIT, size_in_scu << MIN_CU_SIZE_IN_BIT);\r\n        real_luma_mode = (int8_t)davs2_clip3(real_luma_mode, 0, NUM_INTRA_MODE - 1);\r\n    }\r\n    p_cu->intra_pred_modes[b8] = real_luma_mode;\r\n\r\n    // store intra prediction mode, for MPM of next blocks\r\n    {\r\n        int w_4x4 = size_in_scu << 1;\r\n        int h_4x4 = size_in_scu << 1;\r\n        int j;\r\n\r\n        switch (p_cu->i_trans_size) {\r\n        case TU_SPLIT_HOR:\r\n            h_4x4 >>= 2;\r\n            break;\r\n        case TU_SPLIT_VER:\r\n            w_4x4 >>= 2;\r\n            break;\r\n        case TU_SPLIT_CROSS:\r\n            w_4x4 >>= 1;\r\n            h_4x4 >>= 1;\r\n            break;\r\n        }\r\n\r\n        for (j = 0; j < h_4x4; j++) {\r\n            int i = (j == h_4x4 - 1) ? 0 : w_4x4 - 1;\r\n            for (; i < w_4x4; i++) {\r\n                p_intramode[i] = real_luma_mode;\r\n            }\r\n            p_intramode += i_intramode;\r\n        }\r\n    }\r\n\r\n    return 0;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic\r\nvoid cu_store_references(davs2_t *h, cu_t *p_cu, int pix_x, int pix_y)\r\n{\r\n    int width_in_spu = h->i_width_in_spu;\r\n    int block8_y = pix_y >> MIN_PU_SIZE_IN_BIT;\r\n    int block8_x = pix_x >> MIN_PU_SIZE_IN_BIT;\r\n    int idx_pu;\r\n\r\n    for (idx_pu = 0; idx_pu < p_cu->num_pu; idx_pu++) {\r\n        ref_idx_t *p_ref_1st;\r\n        int8_t    *p_dirpred;\r\n        int8_t     i_dir_pred;\r\n        ref_idx_t ref_idx;\r\n        int b8_x, b8_y;\r\n        int r, c;\r\n        cb_t pu;\r\n\r\n        pu.v = p_cu->pu[idx_pu].v >> 2;\r\n\r\n        b8_x = block8_x + pu.x;\r\n        b8_y = block8_y + pu.y;\r\n\r\n        i_dir_pred = (int8_t)p_cu->b8pdir[idx_pu];\r\n        ref_idx    = p_cu->ref_idx[idx_pu];\r\n        p_dirpred  = h->p_dirpred + b8_y * width_in_spu + b8_x;\r\n        p_ref_1st  = h->p_ref_idx + b8_y * width_in_spu + b8_x;\r\n\r\n        for (r = pu.h; r != 0; r--) {\r\n            for (c = 0; c < pu.w; c++) {\r\n                p_ref_1st[c] = ref_idx;\r\n                p_dirpred[c] = i_dir_pred;\r\n            }\r\n            p_ref_1st += width_in_spu;\r\n            p_dirpred += width_in_spu;\r\n        }\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic int cu_read_mv(davs2_t *h, aec_t *p_aec, int i_level, int scu_xy, int pix_x, int pix_y)\r\n{\r\n    cu_t *p_cu = &h->scu_data[scu_xy];\r\n    int bframe = (h->i_frame_type == AVS2_B_SLICE);\r\n    int idx_pu;\r\n    int block8_y = pix_y >> MIN_PU_SIZE_IN_BIT;\r\n    int block8_x = pix_x >> MIN_PU_SIZE_IN_BIT;\r\n    int width_in_spu = h->i_width_in_spu;\r\n    int distance_fwd;\r\n    int distance_fwd_src;\r\n    int distance_bwd;  // TODO: 非 B FRAME 情况的初始值？\r\n\r\n    assert(p_cu->i_cu_type != PRED_SKIP);\r\n\r\n    if (h->i_frame_type == AVS2_F_SLICE && /*h->b_dmh &&*/\r\n        p_cu->b8pdir[0] == PDIR_FWD && p_cu->b8pdir[1] == PDIR_FWD &&\r\n        p_cu->b8pdir[2] == PDIR_FWD && p_cu->b8pdir[3] == PDIR_FWD) {\r\n        //has forward vector\r\n        if (!(i_level == B8X8_IN_BIT && p_cu->i_cu_type >= PRED_2NxN && p_cu->i_cu_type <= PRED_nRx2N)) {\r\n            p_cu->i_dmh_mode = (int8_t)aec_read_dmh_mode(p_aec, p_cu->i_cu_level);\r\n            AEC_RETURN_ON_ERROR(-1);\r\n#if AVS2_TRACE\r\n            avs2_trace(\"dmh_mode = %3d\\n\", p_cu->i_dmh_mode);\r\n#endif\r\n        } else {\r\n            p_cu->i_dmh_mode = 0;\r\n        }\r\n    }\r\n\r\n    //=====  READ PDIR_FWD MOTION VECTORS =====\r\n    for (idx_pu = 0; idx_pu < p_cu->num_pu; idx_pu++) {\r\n        if (p_cu->b8pdir[idx_pu] != PDIR_BWD) {\r\n            int pu_pix_x = p_cu->pu[idx_pu].x;\r\n            int pu_pix_y = p_cu->pu[idx_pu].y;\r\n            int bsx      = p_cu->pu[idx_pu].w;\r\n            int bsy      = p_cu->pu[idx_pu].h;\r\n            int i8       = block8_x + (pu_pix_x >> 2);\r\n            int j8       = block8_y + (pu_pix_y >> 2);\r\n            int refframe = h->p_ref_idx[j8 * width_in_spu + i8].r[0];\r\n            mv_t mv, mvp;\r\n            int ii, jj;\r\n\r\n            // first make mv-prediction\r\n            int pu_mvp_type = get_pu_type_for_mvp(bsx, bsy, pu_pix_x, pu_pix_y);\r\n            get_mvp_default(h, p_cu, pix_x + pu_pix_x, pix_y + pu_pix_y, &mvp, 0, refframe, bsx, pu_mvp_type);\r\n\r\n            bsx >>= MIN_PU_SIZE_IN_BIT;\r\n            bsy >>= MIN_PU_SIZE_IN_BIT;\r\n            if (h->i_frame_type != AVS2_S_SLICE) {  //no mvd for S frame, just set it to 0\r\n                mv_t mvd;\r\n                aec_read_mvds(p_aec, &mvd);\r\n                pmvr_mv_derivation(h, &mv, &mvd, &mvp);\r\n#if AVS2_TRACE\r\n                avs2_trace(\"@%d FMVD (pred %3d)\\t\\t\\t%d \\n\", symbolCount++, mvp.x, mvd.x);\r\n                avs2_trace(\"@%d FMVD (pred %3d)\\t\\t\\t%d \\n\", symbolCount++, mvp.y, mvd.y);\r\n#endif\r\n                AEC_RETURN_ON_ERROR(-1);\r\n            } else {\r\n                mv.v = mvp.v;\r\n            }\r\n\r\n            if (bframe) {\r\n                mv_t *p_mv_1st = h->p_tmv_1st + j8 * width_in_spu + i8;\r\n                for (jj = 0; jj < bsy; jj++) {\r\n                    for (ii = 0; ii < bsx; ii++) {\r\n                        p_mv_1st[ii] = mv;\r\n                    }\r\n                    p_mv_1st += width_in_spu;\r\n                }\r\n                p_cu->mv[idx_pu][0] = mv;\r\n            } else {\r\n                mv_t *p_mv_1st = h->p_tmv_1st + j8 * width_in_spu + i8;\r\n                mv_t *p_mv_2nd = h->p_tmv_2nd + j8 * width_in_spu + i8;\r\n                mv_t mv_2nd;\r\n                if (p_cu->b8pdir[idx_pu] == PDIR_DUAL) {\r\n                    int distance_1st     = get_distance_index_p(h, refframe);\r\n                    int distance_1st_src = get_distance_index_p_scale(h, refframe);\r\n                    int distance_2nd     = get_distance_index_p(h, !refframe);\r\n\r\n                    mv_2nd.x = scale_mv_skip(h, mv.x, distance_2nd, distance_1st_src);\r\n                    mv_2nd.y = scale_mv_skip_y(h, mv.y, distance_2nd, distance_1st, distance_1st_src);\r\n                } else {\r\n                    mv_2nd.v = 0;\r\n                }\r\n\r\n                p_cu->mv[idx_pu][0] = mv;\r\n                p_cu->mv[idx_pu][1] = mv_2nd;\r\n\r\n                for (jj = 0; jj < bsy; jj++) {\r\n                    for (ii = 0; ii < bsx; ii++) {\r\n                        p_mv_1st[ii] = mv;\r\n                        p_mv_2nd[ii] = mv_2nd;\r\n                    }\r\n                    p_mv_1st += width_in_spu;\r\n                    p_mv_2nd += width_in_spu;\r\n                }\r\n            }\r\n        }\r\n    }\r\n\r\n    if (!bframe) {\r\n        return 0;\r\n    }\r\n\r\n    assert(h->i_pic_coding_type == FRAME);\r\n    {\r\n        distance_fwd     = get_distance_index_b(h, B_FWD);  // fwd\r\n        distance_fwd_src = get_distance_index_b_scale(h, B_FWD);\r\n        distance_bwd     = get_distance_index_b(h, B_BWD);  // bwd\r\n    }\r\n\r\n    //=====  READ PDIR_BWD MOTION VECTORS =====\r\n    for (idx_pu = 0; idx_pu< p_cu->num_pu; idx_pu++) {\r\n        if (p_cu->b8pdir[idx_pu] != PDIR_FWD) {     //has backward vector\r\n            int pu_pix_x = p_cu->pu[idx_pu].x;\r\n            int pu_pix_y = p_cu->pu[idx_pu].y;\r\n            int bsx      = p_cu->pu[idx_pu].w;\r\n            int bsy      = p_cu->pu[idx_pu].h;\r\n            int i8       = block8_x + (pu_pix_x >> 2);\r\n            int j8       = block8_y + (pu_pix_y >> 2);\r\n            int refframe = h->p_ref_idx[j8 * width_in_spu + i8].r[1];\r\n            mv_t *p_mv_2nd = h->p_tmv_2nd + j8 * width_in_spu + i8;\r\n            mv_t mv, mvp;\r\n            int ii, jj;\r\n\r\n            int pu_mvp_type = get_pu_type_for_mvp(bsx, bsy, pu_pix_x, pu_pix_y);\r\n            get_mvp_default(h, p_cu, pix_x + pu_pix_x, pix_y + pu_pix_y, &mvp, 1, refframe, bsx, pu_mvp_type);\r\n\r\n            bsx >>= MIN_PU_SIZE_IN_BIT;\r\n            bsy >>= MIN_PU_SIZE_IN_BIT;\r\n\r\n            if (p_cu->b8pdir[idx_pu] == PDIR_SYM) {\r\n                mv_t mv_1st;\r\n\r\n                mv_1st = h->p_tmv_1st[j8 * width_in_spu + i8];\r\n\r\n                mv.x = -scale_mv_skip  (h, mv_1st.x, distance_bwd, distance_fwd_src);\r\n                mv.y = -scale_mv_skip_y(h, mv_1st.y, distance_bwd, distance_fwd, distance_fwd_src);\r\n            } else {\r\n                mv_t mvd;\r\n                aec_read_mvds(p_aec, &mvd);\r\n                pmvr_mv_derivation(h, &mv, &mvd, &mvp);\r\n#if AVS2_TRACE\r\n                avs2_trace(\"@%d BMVD (pred %3d)\\t\\t\\t%d \\n\", symbolCount++, mvp.x, mvd.x);\r\n                avs2_trace(\"@%d BMVD (pred %3d)\\t\\t\\t%d \\n\", symbolCount++, mvp.y, mvd.y);\r\n#endif\r\n                AEC_RETURN_ON_ERROR(-1);\r\n            }\r\n\r\n            p_cu->mv[idx_pu][1] = mv;\r\n            for (jj = 0; jj < bsy; jj++) {\r\n                for (ii = 0; ii < bsx; ii++) {\r\n                    p_mv_2nd[ii] = mv;\r\n                }\r\n\r\n                p_mv_2nd += width_in_spu;\r\n            }\r\n        }\r\n    }\r\n\r\n\r\n    return 0;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * get all coefficients of one CU\r\n */\r\nstatic int cu_read_all_coeffs(davs2_t *h, aec_t *p_aec, cu_t *p_cu)\r\n{\r\n    runlevel_t *runlevel = &h->lcu.cg_info;\r\n    int idx_cu_zscan = h->lcu.idx_cu_zscan_aec;\r\n#if CTRL_AEC_THREAD\r\n    coeff_t *coeff_y = &h->lcu.lcu_aec->rec_info.coeff_buf_y    [idx_cu_zscan << 6];\r\n    coeff_t *coeff_u = &h->lcu.lcu_aec->rec_info.coeff_buf_uv[0][idx_cu_zscan << 4];\r\n    coeff_t *coeff_v = &h->lcu.lcu_aec->rec_info.coeff_buf_uv[1][idx_cu_zscan << 4];\r\n#else\r\n    coeff_t *coeff_y = &h->lcu.rec_info.coeff_buf_y    [idx_cu_zscan << 6];\r\n    coeff_t *coeff_u = &h->lcu.rec_info.coeff_buf_uv[0][idx_cu_zscan << 4];\r\n    coeff_t *coeff_v = &h->lcu.rec_info.coeff_buf_uv[1][idx_cu_zscan << 4];\r\n#endif\r\n    int bit_size   = p_cu->i_cu_level;\r\n    int i_tu_level = p_cu->i_cu_level;  // 与变换块中包含的系数相关\r\n    int b8;\r\n    int uv;\r\n\r\n    /*if (h->i_pic_coding_type == FRAME)*/ {\r\n        runlevel->p_ctx_run            = p_aec->syn_ctx.coeff_run[0];\r\n        runlevel->p_ctx_level          = p_aec->syn_ctx.coeff_level;\r\n        runlevel->p_ctx_sig_cg         = p_aec->syn_ctx.sig_cg_contexts;\r\n        runlevel->p_ctx_last_cg        = p_aec->syn_ctx.last_cg_contexts;\r\n        runlevel->p_ctx_last_pos_in_cg = p_aec->syn_ctx.last_coeff_pos;\r\n    }\r\n\r\n    // luma coefficients\r\n    if (p_cu->i_trans_size == TU_SPLIT_NON) {\r\n        i_tu_level         = DAVS2_MIN(3, i_tu_level - B4X4_IN_BIT);\r\n        runlevel->avs_scan = tab_scan_coeff[i_tu_level][TU_SPLIT_NON];\r\n        runlevel->cg_scan  = tab_scan_cg[i_tu_level][TU_SPLIT_NON];\r\n\r\n        if (p_cu->i_cbp & 0x0F) {\r\n            int intra_pred_class = IS_INTRA(p_cu) ? tab_intra_mode_scan_type[p_cu->intra_pred_modes[0]] : INTRA_PRED_DC_DIAG;\r\n            int b_swap_xy = (IS_INTRA(p_cu) && intra_pred_class == INTRA_PRED_HOR);\r\n            int blocksize = 1 << (i_tu_level + B4X4_IN_BIT);\r\n            int shift, scale;\r\n            int wq_size_id = DAVS2_MIN(3, bit_size - B4X4_IN_BIT);\r\n\r\n            cu_get_quant_params(h, p_cu->i_qp, bit_size - (p_cu->i_trans_size != TU_SPLIT_NON), &shift, &scale);\r\n#if !CTRL_AEC_THREAD\r\n            gf_davs2.fast_memzero(coeff_y, sizeof(coeff_t) * blocksize * blocksize);\r\n#endif\r\n\r\n            p_cu->dct_pattern[0] = cu_get_block_coeffs(p_aec, runlevel, p_cu, coeff_y,\r\n                                                       blocksize, blocksize, i_tu_level,\r\n                                                       1, intra_pred_class, b_swap_xy,\r\n                                                       scale, shift, wq_size_id);\r\n        }\r\n    } else {\r\n        int b_wavelet_conducted = (bit_size == B64X64_IN_BIT && p_cu->i_trans_size != TU_SPLIT_CROSS);\r\n        cb_t tus[4];\r\n        int shift, scale;\r\n        int wq_size_id = DAVS2_MIN(3, bit_size - B4X4_IN_BIT);\r\n\r\n        cu_init_transform_units(p_cu, tus);\r\n        tus[0].v >>= b_wavelet_conducted;\r\n        tus[1].v >>= b_wavelet_conducted;\r\n        tus[2].v >>= b_wavelet_conducted;\r\n        tus[3].v >>= b_wavelet_conducted;\r\n\r\n        i_tu_level -= B8X8_IN_BIT;\r\n        i_tu_level -= b_wavelet_conducted;\r\n        cu_get_quant_params(h, p_cu->i_qp, p_cu->i_cu_level - (p_cu->i_trans_size != TU_SPLIT_NON), &shift, &scale);\r\n        if (p_cu->i_trans_size == TU_SPLIT_CROSS) {\r\n            wq_size_id = DAVS2_MIN(3, bit_size - B8X8_IN_BIT);\r\n        } else {\r\n            wq_size_id = bit_size - B8X8_IN_BIT;\r\n            wq_size_id -= (p_cu->i_cu_level == B64X64_IN_BIT);\r\n        }\r\n\r\n        runlevel->avs_scan = tab_scan_coeff[i_tu_level][p_cu->i_trans_size];\r\n        runlevel->cg_scan  = tab_scan_cg[i_tu_level][p_cu->i_trans_size];\r\n\r\n        for (b8 = 0; b8 < 4; b8++) { /* all 4 blocks */\r\n            if (p_cu->i_cbp & (1 << b8)) {\r\n                int bsx = tus[b8].w;\r\n                int bsy = tus[b8].h;\r\n                int intra_pred_class = IS_INTRA(p_cu) ? tab_intra_mode_scan_type[p_cu->intra_pred_modes[b8]] : INTRA_PRED_DC_DIAG;\r\n                int b_swap_xy = (IS_INTRA(p_cu) && intra_pred_class == INTRA_PRED_HOR && p_cu->i_cu_type != PRED_I_2Nxn && p_cu->i_cu_type != PRED_I_nx2N);\r\n                coeff_t *p_res = coeff_y + (b8 << ((bit_size - 1) << 1));\r\n#if !CTRL_AEC_THREAD\r\n                gf_davs2.fast_memzero(p_res, sizeof(coeff_t) * bsx * bsy);\r\n#endif\r\n                p_cu->dct_pattern[b8] = cu_get_block_coeffs(p_aec, runlevel, p_cu, p_res,\r\n                                                            bsx, bsy, i_tu_level,\r\n                                                            1, intra_pred_class, b_swap_xy,\r\n                                                            scale, shift, wq_size_id);\r\n                if (p_cu->dct_pattern[b8] < 0) {\r\n                    return -1;\r\n                }\r\n            }\r\n        }\r\n    }\r\n\r\n    // adaptive frequency weighting quantization\r\n    i_tu_level = p_cu->i_cu_level - B8X8_IN_BIT;\r\n    runlevel->avs_scan = tab_scan_coeff[i_tu_level][TU_SPLIT_NON];\r\n    runlevel->cg_scan  = tab_scan_cg[i_tu_level][TU_SPLIT_NON];\r\n\r\n    /*if (h->i_pic_coding_type == FRAME)*/ {\r\n        runlevel->p_ctx_run             = p_aec->syn_ctx.coeff_run[1];\r\n        runlevel->p_ctx_level           = p_aec->syn_ctx.coeff_level + 20;\r\n        runlevel->p_ctx_sig_cg          = p_aec->syn_ctx.sig_cg_contexts + NUM_SIGCG_CTX_LUMA;\r\n        runlevel->p_ctx_last_cg         = p_aec->syn_ctx.last_cg_contexts + NUM_LAST_CG_CTX_LUMA;\r\n        runlevel->p_ctx_last_pos_in_cg  = p_aec->syn_ctx.last_coeff_pos + NUM_LAST_POS_CTX_LUMA;\r\n    }\r\n\r\n    if (h->i_chroma_format != CHROMA_400) {\r\n        int wq_size_id = p_cu->i_cu_level - 1;\r\n        for (uv = 0; uv < 2; uv++) {\r\n            if ((p_cu->i_cbp >> (uv + 4)) & 0x1) {\r\n                int blocksize = 1 << wq_size_id;\r\n                coeff_t *p_res = uv ? coeff_v : coeff_u;\r\n                int shift, scale;\r\n#if !CTRL_AEC_THREAD\r\n                gf_davs2.fast_memzero(p_res, sizeof(coeff_t) * blocksize * blocksize);\r\n#endif\r\n                cu_get_quant_params(h, cu_get_chroma_qp(h, p_cu->i_qp, uv), wq_size_id, &shift, &scale);\r\n\r\n                p_cu->dct_pattern[4 + uv] = cu_get_block_coeffs(p_aec, runlevel, p_cu, p_res,\r\n                                                                blocksize, blocksize, i_tu_level,\r\n                                                                0, INTRA_PRED_DC_DIAG, 0,\r\n                                                                scale, shift, wq_size_id);\r\n                if (p_cu->dct_pattern[4 + uv] < 0) {\r\n                    return -1;\r\n                }\r\n            }\r\n        }\r\n    }\r\n\r\n    return 0;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * get the syntax elements from the NAL, return cu_type\r\n */\r\nstatic\r\nint cu_read_header(davs2_t *h, aec_t *p_aec, cu_t *p_cu, int pix_x, int pix_y, int *p_real_cu_type)\r\n{\r\n    int real_cu_type;\r\n\r\n    p_cu->i_md_directskip_mode = 0;\r\n    if (h->i_frame_type == AVS2_S_SLICE) {\r\n        real_cu_type = aec_read_cu_type_sframe(p_aec);\r\n    } else {\r\n        real_cu_type = aec_read_cu_type(p_aec, p_cu, h->i_frame_type, h->seq_info.enable_amp,\r\n            h->seq_info.enable_mhp_skip, h->seq_info.enable_wsm, h->num_of_references);\r\n    }\r\n\r\n    AEC_RETURN_ON_ERROR(-1);\r\n\r\n    *p_real_cu_type = real_cu_type;\r\n    real_cu_type    = DAVS2_MAX(0, real_cu_type);\r\n    p_cu->i_cu_type = (int8_t)real_cu_type;\r\n\r\n    /* 帧间预测的方向解析 */\r\n    if (h->i_frame_type != AVS2_I_SLICE && IS_INTER_MODE(real_cu_type)) {\r\n        aec_read_inter_pred_dir(p_aec, p_cu, h);\r\n        AEC_RETURN_ON_ERROR(-1);\r\n    }\r\n\r\n    if (IS_INTRA(p_cu)) {\r\n        int size_8x8   = 1 << (p_cu->i_cu_level - B8X8_IN_BIT);\r\n        int size_16x16 = 1 << (p_cu->i_cu_level - B16X16_IN_BIT);\r\n        int y_4x4      = pix_y >> MIN_PU_SIZE_IN_BIT;\r\n        int x_4x4      = pix_x >> MIN_PU_SIZE_IN_BIT;\r\n\r\n        real_cu_type = aec_read_intra_cu_type(p_aec, p_cu, h->seq_info.enable_sdip, h);\r\n        p_cu->i_cu_type = (int8_t)real_cu_type;\r\n        AEC_RETURN_ON_ERROR(-1);\r\n\r\n        /* Read luma block prediction modes */\r\n        if (cu_read_intrapred_mode_luma(h, p_aec, p_cu, 0, x_4x4, y_4x4) < 0) {\r\n            return -1;\r\n        }\r\n\r\n        switch (real_cu_type) {\r\n        case PRED_I_2Nxn:\r\n            if (cu_read_intrapred_mode_luma(h, p_aec, p_cu, 1, x_4x4, y_4x4 + 1 * size_16x16) < 0 ||\r\n                cu_read_intrapred_mode_luma(h, p_aec, p_cu, 2, x_4x4, y_4x4 + 2 * size_16x16) < 0 ||\r\n                cu_read_intrapred_mode_luma(h, p_aec, p_cu, 3, x_4x4, y_4x4 + 3 * size_16x16) < 0) {\r\n                return -1;\r\n            }\r\n\r\n            break;\r\n        case PRED_I_nx2N:\r\n            if (cu_read_intrapred_mode_luma(h, p_aec, p_cu, 1, x_4x4 + 1 * size_16x16, y_4x4) < 0 ||\r\n                cu_read_intrapred_mode_luma(h, p_aec, p_cu, 2, x_4x4 + 2 * size_16x16, y_4x4) < 0 ||\r\n                cu_read_intrapred_mode_luma(h, p_aec, p_cu, 3, x_4x4 + 3 * size_16x16, y_4x4) < 0) {\r\n                return -1;\r\n            }\r\n\r\n            break;\r\n        case PRED_I_NxN:\r\n            if (cu_read_intrapred_mode_luma(h, p_aec, p_cu, 1, x_4x4 + size_8x8, y_4x4 + 0) < 0 ||\r\n                cu_read_intrapred_mode_luma(h, p_aec, p_cu, 2, x_4x4 + 0, y_4x4 + size_8x8) < 0 ||\r\n                cu_read_intrapred_mode_luma(h, p_aec, p_cu, 3, x_4x4 + size_8x8, y_4x4 + size_8x8) < 0) {\r\n                return -1;\r\n            }\r\n\r\n            break;\r\n        default:\r\n            break;\r\n        }\r\n\r\n#if AVS2_TRACE\r\n        strncpy(p_aec->tracestring, \"Chroma intra pred mode\", TRACESTRING_SIZE);\r\n#endif\r\n        if (h->i_chroma_format != CHROMA_400) {\r\n            p_cu->c_ipred_mode = (int8_t)aec_read_intra_pmode_c(p_aec, h, p_cu->intra_pred_modes[0]);\r\n        } else {\r\n            p_cu->c_ipred_mode = 0;\r\n        }\r\n        AEC_RETURN_ON_ERROR(-1);\r\n    }\r\n\r\n    return 0;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * read CU information from bitstream\r\n */\r\nstatic int cu_read_info(davs2_t *h, cu_t *p_cu, int i_level, int scu_xy, int pix_x, int pix_y)\r\n{\r\n    aec_t *p_aec = &h->aec;\r\n    int size_in_scu = 1 << (i_level - MIN_CU_SIZE_IN_BIT);\r\n    int real_cu_type;\r\n\r\n    /* 0, initial cu data */\r\n    cu_init(h, p_cu, i_level, scu_xy, pix_x);\r\n\r\n    /* 1, read cu type and delta_QP\r\n     * including PU partition, intra prediction mode, reference indexes\r\n     */\r\n    if (cu_read_header(h, p_aec, p_cu, pix_x, pix_y, &real_cu_type) < 0) {\r\n        return -1;\r\n    }\r\n\r\n    // get the size and pos of prediction units\r\n    cu_init_prediction_units(h, p_cu);\r\n\r\n    /* 2, read motion vectors and reference indexes */\r\n    if (IS_INTRA(p_cu)) {\r\n        int i = 0;\r\n        for (i = 0; i < 4; i++) {\r\n            p_cu->ref_idx[i].r[0] = INVALID_REF;\r\n            p_cu->ref_idx[i].r[1] = INVALID_REF;\r\n            p_cu->b8pdir[i]       = PDIR_INVALID;\r\n        }\r\n        // TODO: 由于帧级已初始化，此处无需重复设置 cu_store_references()\r\n        cu_store_references(h, p_cu, pix_x, pix_y);\r\n    } else if (p_cu->i_cu_type == PRED_SKIP) {\r\n        cu_get_neighbors(h, p_cu, pix_x, pix_y, 1 << i_level, 1 << i_level);\r\n        fill_mv_and_ref_for_skip(h, p_cu, pix_x, pix_y, size_in_scu);\r\n    } else {\r\n        cu_store_references(h, p_cu, pix_x, pix_y);\r\n        if (cu_read_mv(h, p_aec, p_cu->i_cu_level, scu_xy, pix_x, pix_y) < 0) {\r\n            return -1;\r\n        }\r\n    }\r\n\r\n    /* 3, read CBP and coefficients */\r\n    if (real_cu_type < 0) {  /* skip mode, no residual */\r\n        p_cu->i_qp = h->lcu.i_left_cu_qp;\r\n        p_cu->i_trans_size = TU_SPLIT_NON;      // cbp has been initialed as zero\r\n    } else {    // non-skip mode\r\n        // read CBP\r\n        if (cu_read_cbp(h, p_aec, p_cu, pix_x >> MIN_CU_SIZE_IN_BIT, pix_y >> MIN_CU_SIZE_IN_BIT) < 0) {\r\n            return -1;\r\n        }\r\n\r\n        if (p_cu->i_cbp != 0) {\r\n            if (cu_read_all_coeffs(h, p_aec, p_cu) < 0) {   // read all coefficients\r\n                return -1;\r\n            }\r\n        }\r\n    }\r\n\r\n    AEC_RETURN_ON_ERROR(-1);\r\n    /* 4, finish decoding the cu data */\r\n    cu_read_end(h, p_cu, i_level, scu_xy);\r\n\r\n    return 0;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid decoder_wait_lcu_row(davs2_t *h, davs2_frame_t *frame, int line)\r\n{\r\n    line = DAVS2_MAX(line, 0);\r\n    line = DAVS2_MIN(line, h->i_height_in_lcu - 1);\r\n\r\n    if (frame->i_decoded_line < line && frame->num_decoded_lcu_in_row[line] < h->i_width_in_lcu + 1) {\r\n        davs2_thread_mutex_lock(&frame->mutex_recon);\r\n\r\n        while (frame->i_decoded_line < line && frame->num_decoded_lcu_in_row[line] < h->i_width_in_lcu + 1) {\r\n            davs2_thread_cond_wait(&frame->conds_lcu_row[line], &frame->mutex_recon);\r\n        }\r\n\r\n        davs2_thread_mutex_unlock(&frame->mutex_recon);\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid decoder_wait_row(davs2_t *h, davs2_frame_t *frame, int max_y_in_pic)\r\n{\r\n    int line = (max_y_in_pic + 8) >> h->i_lcu_level;\r\n    line = DAVS2_MAX(line, 0);\r\n    line = DAVS2_MIN(line, h->i_height_in_lcu - 1);\r\n    decoder_wait_lcu_row(h, frame, line);\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * img_size: 整像素精度的图像 宽度或高度 （整像素精度）\r\n * blk_size: 当前预测块的 宽度或高度     （整像素精度）\r\n * blk_pos:  当前块在图像中的 x/y 坐标   （整像素精度）\r\n * mv     :  MV 的 x/y 分量             （1/4像素精度）\r\n */\r\nstatic INLINE\r\nint cu_get_mc_pos(int img_size, int blk_size, int blk_pos, int mv)\r\n{\r\n    int imv = mv >> 2;  // MV的整像素精度\r\n    int fmv = mv & 7;   // MV的分像素精度部分，保留到 1/8 精度\r\n\r\n    if (blk_pos + imv < -blk_size - 8) {\r\n        return ((-blk_size - 8) << 2) + (fmv);\r\n    } else if (blk_pos + imv > img_size + 4) {\r\n        return ((img_size + 4) << 2) + (fmv);\r\n    } else {\r\n        return (blk_pos << 2) + mv;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * clip mv\r\n */\r\nstatic INLINE\r\nvoid cu_get_mc_pos_mv(davs2_t *h, mv_t *mv, int pic_pix_x, int pic_pix_y, int blk_w, int blk_h)\r\n{\r\n    mv->x = (int16_t)cu_get_mc_pos(h->i_width,  blk_w, pic_pix_x, mv->x);\r\n    mv->y = (int16_t)cu_get_mc_pos(h->i_height, blk_h, pic_pix_y, mv->y);\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * decode one coding unit\r\n */\r\nstatic int davs2_get_inter_pred(davs2_t *h, davs2_row_rec_t *row_rec, cu_t *p_cu, int ctu_x, int ctu_y)\r\n{\r\n    static const int mv_shift = 2;\r\n    int pu_idx;\r\n\r\n    for (pu_idx = 0; pu_idx < p_cu->num_pu; pu_idx++) {\r\n        int pix_x, pix_y, width, height;\r\n        int vec1_x, vec1_y, vec2_x, vec2_y;\r\n        int pred_dir;\r\n        int ref_1st, ref_2nd;\r\n        cb_t *pu = &p_cu->pu[pu_idx];\r\n        mv_t mv_1st, mv_2nd;\r\n        davs2_frame_t *p_fref1, *p_fref2;\r\n\r\n        p_fref1  = p_fref2  = NULL;\r\n        ref_1st  = ref_2nd  = 0;\r\n        mv_1st.v = mv_2nd.v = 0;\r\n\r\n        pix_x  = ctu_x + pu->x;\r\n        pix_y  = ctu_y + pu->y;\r\n        width  = pu->w;\r\n        height = pu->h;\r\n\r\n        pred_dir  = p_cu->b8pdir[pu_idx];\r\n\r\n        if (pred_dir == PDIR_BWD) {\r\n            ref_1st = B_BWD;\r\n            mv_1st  = p_cu->mv[pu_idx][1];\r\n            p_fref1 = h->fref[B_BWD];\r\n        } else if (pred_dir == PDIR_SYM || pred_dir == PDIR_BID) {\r\n            mv_1st.v = p_cu->mv[pu_idx][0].v;\r\n            mv_2nd.v = p_cu->mv[pu_idx][1].v;\r\n\r\n            p_fref1 = h->fref[B_FWD];\r\n            p_fref2 = h->fref[B_BWD];\r\n        } else {\r\n            /* FWD or DUAL */\r\n            int dmh_mode = p_cu->i_dmh_mode;\r\n\r\n            ref_1st = p_cu->ref_idx[pu_idx].r[0];\r\n            mv_1st  = p_cu->mv[pu_idx][0];\r\n\r\n            if (h->i_frame_type == AVS2_B_SLICE) {\r\n                /* for B frame */\r\n                ref_1st = 0;\r\n                p_fref1 = h->fref[B_FWD];\r\n            } else {\r\n                if (pred_dir == PDIR_DUAL) {\r\n                    mv_2nd  = p_cu->mv[pu_idx][1];\r\n                    ref_2nd = p_cu->ref_idx[pu_idx].r[1];\r\n                    p_fref1 = h->fref[ref_1st];\r\n                    p_fref2 = h->fref[ref_2nd];\r\n                } else if (dmh_mode) {\r\n                    mv_2nd.x = mv_1st.x + dmh_pos[dmh_mode][1][0];\r\n                    mv_2nd.y = mv_1st.y + dmh_pos[dmh_mode][1][1];\r\n\r\n                    mv_1st.x += dmh_pos[dmh_mode][0][0];\r\n                    mv_1st.y += dmh_pos[dmh_mode][0][1];\r\n\r\n                    ref_2nd = ref_1st;\r\n                    p_fref1 = p_fref2 = h->fref[ref_1st];\r\n                } else {\r\n                    p_fref1 = h->fref[ref_1st];\r\n                }\r\n            }\r\n        }\r\n\r\n        cu_get_mc_pos_mv(h, &mv_1st, pix_x + row_rec->ctu.i_pix_x, pix_y + row_rec->ctu.i_pix_y, width, height);\r\n        vec1_x = mv_1st.x;\r\n        vec1_y = mv_1st.y;\r\n\r\n        cu_get_mc_pos_mv(h, &mv_2nd, pix_x + row_rec->ctu.i_pix_x, pix_y + row_rec->ctu.i_pix_y, width, height);\r\n        vec2_x = mv_2nd.x;\r\n        vec2_y = mv_2nd.y;\r\n\r\n        // TODO: 出现背景帧参考情况下的参考帧管理需在RPS部分做好修改\r\n        // if (h->b_bkgnd_reference && h->num_of_references >= 2 && ref_1st == h->num_of_references - 1 && (h->i_frame_type == AVS2_P_SLICE || h->i_frame_type == AVS2_F_SLICE) && h->i_typeb != AVS2_S_SLICE) {\r\n        //     p_fref1 = h->f_background_ref;\r\n        // } else if (h->i_typeb == AVS2_S_SLICE) {\r\n        //     p_fref1 = h->f_background_ref;\r\n        // }\r\n\r\n        /* luma prediction */\r\n        if (p_fref1 != NULL) {\r\n            int i_pred = row_rec->ctu.i_fdec[IMG_Y];\r\n            int i_fref = h->fref[0]->i_stride[IMG_Y];\r\n\r\n            pel_t *p_pred = row_rec->ctu.p_fdec[IMG_Y] + pix_y * i_pred + pix_x;\r\n\r\n            decoder_wait_row(h, p_fref1, (vec1_y >> mv_shift) + height + 8 + 4);\r\n\r\n            mc_luma(h, p_pred, i_pred, vec1_x, vec1_y, width, height, p_fref1->planes[IMG_Y], i_fref);\r\n\r\n            if (p_fref2 != NULL) {\r\n                pel_t *p_temp = row_rec->pred_blk;\r\n\r\n                decoder_wait_row(h, p_fref2, (vec2_y >> mv_shift) + height + 8 + 4);\r\n\r\n                mc_luma(h, p_temp, LCU_STRIDE, vec2_x, vec2_y, width, height, p_fref2->planes[IMG_Y], i_fref);\r\n\r\n                gf_davs2.block_avg(p_pred, i_pred, p_pred, i_pred, p_temp, LCU_STRIDE, width, height);\r\n            }\r\n        } else {\r\n            davs2_log(h, DAVS2_LOG_ERROR, \"non-existing reference frame. PB (%d, %d)\", pix_x, pix_y);\r\n            return -1;\r\n        }\r\n\r\n        /* chroma prediction */\r\n        if (h->i_chroma_format == CHROMA_420) {\r\n            pix_x >>= 1;\r\n            pix_y >>= 1;\r\n            width >>= 1;\r\n            height >>= 1;\r\n\r\n            if (p_fref2 == NULL) {\r\n                int i_fref = p_fref1->i_stride[IMG_U];\r\n                int i_pred = row_rec->ctu.i_fdec[IMG_U];\r\n\r\n                pel_t *p_pred = row_rec->ctu.p_fdec[IMG_U] + pix_y * i_pred + pix_x;\r\n\r\n                mc_chroma(h, p_pred, i_pred, vec1_x, vec1_y, width, height, p_fref1->planes[IMG_U], i_fref);\r\n\r\n                i_fref = p_fref1->i_stride[IMG_V];\r\n                i_pred = row_rec->ctu.i_fdec[IMG_V];\r\n                p_pred = row_rec->ctu.p_fdec[IMG_V] + pix_y * i_pred + pix_x;\r\n\r\n                mc_chroma(h, p_pred, i_pred, vec1_x, vec1_y, width, height, p_fref1->planes[IMG_V], i_fref);\r\n            } else {\r\n                /* u component */\r\n                int i_fref = p_fref1->i_stride[IMG_U];\r\n                int i_pred = row_rec->ctu.i_fdec[IMG_U];\r\n\r\n                pel_t *p_pred = row_rec->ctu.p_fdec[IMG_U] + pix_y * i_pred + pix_x;\r\n                pel_t *p_temp = row_rec->pred_blk;\r\n\r\n                mc_chroma(h, p_pred, i_pred, vec1_x, vec1_y, width, height, p_fref1->planes[IMG_U], i_fref);\r\n                mc_chroma(h, p_temp, LCU_STRIDE >> 1, vec2_x, vec2_y, width, height, p_fref2->planes[IMG_U], i_fref);\r\n\r\n                gf_davs2.block_avg(p_pred, i_pred, p_pred, i_pred, p_temp, LCU_STRIDE >> 1, width, height);\r\n\r\n                /* v component */\r\n                i_fref = p_fref1->i_stride[IMG_V];\r\n                i_pred = row_rec->ctu.i_fdec[IMG_V];\r\n                p_pred = row_rec->ctu.p_fdec[IMG_V] + pix_y * i_pred + pix_x;\r\n\r\n                mc_chroma(h, p_pred, i_pred, vec1_x, vec1_y, width, height, p_fref1->planes[IMG_V], i_fref);\r\n                mc_chroma(h, p_temp, LCU_STRIDE >> 1, vec2_x, vec2_y, width, height, p_fref2->planes[IMG_V], i_fref);\r\n\r\n                gf_davs2.block_avg(p_pred, i_pred, p_pred, i_pred, p_temp, LCU_STRIDE >> 1, width, height);\r\n            }\r\n        }   // chroma format YUV420\r\n    }\r\n\r\n    return 0;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * reconstruct a CU\r\n */\r\nstatic int cu_recon(davs2_t *h, davs2_row_rec_t *row_rec, cu_t *p_cu, int pix_x, int pix_y)\r\n{\r\n    int ctu_x = pix_x - row_rec->ctu.i_pix_x;\r\n    int ctu_y = pix_y - row_rec->ctu.i_pix_y;\r\n    int ctu_c_x = ctu_x >> 1;\r\n    int ctu_c_y = ctu_y >> 1;\r\n    int blockidx;\r\n    cb_t tus[4];\r\n\r\n    cu_init_transform_units(p_cu, tus);\r\n\r\n    if (IS_INTRA(p_cu)) {  /* intra cu */\r\n        /* 1, luma component, prediction and residual coding */\r\n        if (p_cu->i_trans_size == TU_SPLIT_NON) {\r\n            davs2_get_intra_pred(row_rec, p_cu, p_cu->intra_pred_modes[0], ctu_x, ctu_y, tus[0].w, tus[0].h);\r\n            if (p_cu->i_cbp & 0x0F) {\r\n                davs2_get_recons(row_rec, p_cu, 0, &tus[0], ctu_x, ctu_y);\r\n            }\r\n        } else {\r\n            for (blockidx = 0; blockidx < 4; blockidx++) {\r\n                davs2_get_intra_pred(row_rec, p_cu, p_cu->intra_pred_modes[blockidx],\r\n                    ctu_x + tus[blockidx].x, ctu_y + tus[blockidx].y,\r\n                    tus[blockidx].w, tus[blockidx].h);\r\n                if (p_cu->i_cbp & (1 << blockidx)) {\r\n                    davs2_get_recons(row_rec, p_cu, blockidx, &tus[blockidx], ctu_x, ctu_y);\r\n                }\r\n            }\r\n        }\r\n\r\n        /* 2, chroma component prediction */\r\n        if (h->i_chroma_format == CHROMA_420) {\r\n            davs2_get_intra_pred_chroma(row_rec, p_cu, ctu_c_x, ctu_c_y);\r\n        }\r\n    } else {  /* inter cu */\r\n        /* 1, prediction (including luma and chroma) */\r\n        if (davs2_get_inter_pred(h, row_rec, p_cu, ctu_x, ctu_y) < 0) {\r\n            return -1;\r\n        }\r\n\r\n        /* 2, luma residual decoding */\r\n        if (p_cu->i_trans_size == TU_SPLIT_NON) {\r\n            if (p_cu->i_cbp & 0x0F) {\r\n                davs2_get_recons(row_rec, p_cu, 0, &tus[0], ctu_x, ctu_y);\r\n            }\r\n        } else {\r\n            for (blockidx = 0; blockidx < 4; blockidx++) {\r\n                if (p_cu->i_cbp & (1 << blockidx)) {\r\n                    davs2_get_recons(row_rec, p_cu, blockidx, &tus[blockidx], ctu_x, ctu_y);\r\n                }\r\n            }\r\n        }\r\n    }\r\n\r\n    /* 3, chroma residual decoding */\r\n    if (h->i_chroma_format == CHROMA_420) {\r\n        cb_t cur_cb;\r\n\r\n        cur_cb.w = cur_cb.h = 1 << (p_cu->i_cu_level - 1);\r\n        cur_cb.y = 1 << p_cu->i_cu_level;\r\n        cur_cb.x = 0;\r\n        if (p_cu->i_cbp & (1 << 4)) {\r\n            davs2_get_recons(row_rec, p_cu, 4, &cur_cb, ctu_x, ctu_y);\r\n        }\r\n\r\n        cur_cb.x = (int8_t)cur_cb.h;\r\n        if (p_cu->i_cbp & (1 << 5)) {\r\n            davs2_get_recons(row_rec, p_cu, 5, &cur_cb, ctu_x, ctu_y);\r\n        }\r\n    }\r\n\r\n    return 0;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic ALWAYS_INLINE void\r\ncopy_lcu_col1(pel_t *dst, pel_t *src, const int height, const int stride)\r\n{\r\n    int i, k;\r\n\r\n    for (i = 0, k = 0; i < height; i++, k += stride) {\r\n        dst[k] = src[k];\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid decode_lcu_init(davs2_t *h, int i_lcu_x, int i_lcu_y)\r\n{\r\n    const int num_in_scu   = 1 << (h->i_lcu_level - MIN_CU_SIZE_IN_BIT);\r\n    const int width_in_scu = h->i_width_in_scu;\r\n    int lcu_w_in_scu, lcu_h_in_scu;\r\n    int i, j;\r\n\r\n    assert(h->lcu.i_scu_xy >= 0 && h->lcu.i_scu_xy < h->i_size_in_scu);\r\n\r\n    // update coordinates of the current coding unit\r\n    h->lcu.i_scu_x  = i_lcu_x << (h->i_lcu_level - MIN_CU_SIZE_IN_BIT);\r\n    h->lcu.i_scu_y  = i_lcu_y << (h->i_lcu_level - MIN_CU_SIZE_IN_BIT);\r\n    h->lcu.i_scu_xy = h->lcu.i_scu_y * width_in_scu + h->lcu.i_scu_x;\r\n\r\n    h->lcu.i_spu_x  = h->lcu.i_scu_x * BLOCK_MULTIPLE;                  // luma block position\r\n    h->lcu.i_spu_y  = h->lcu.i_scu_y * BLOCK_MULTIPLE;                  // luma block position\r\n    h->lcu.i_pix_x  = h->lcu.i_scu_x << MIN_CU_SIZE_IN_BIT;             // luma pixel position\r\n    h->lcu.i_pix_y  = h->lcu.i_scu_y << MIN_CU_SIZE_IN_BIT;             // luma coding unit position\r\n\r\n    h->lcu.i_pix_c_x = h->lcu.i_scu_x << (MIN_CU_SIZE_IN_BIT - 1);      // chroma pixel position\r\n    if (h->i_chroma_format == CHROMA_420) {\r\n        h->lcu.i_pix_c_y = h->lcu.i_scu_y << (MIN_CU_SIZE_IN_BIT - 1);  // chroma coding unit position\r\n    }\r\n\r\n    // actual width and height (in pixel) for current lcu\r\n    lcu_w_in_scu = DAVS2_MIN((h->i_width  - h->lcu.i_pix_x) >> MIN_CU_SIZE_IN_BIT, num_in_scu);\r\n    lcu_h_in_scu = DAVS2_MIN((h->i_height - h->lcu.i_pix_y) >> MIN_CU_SIZE_IN_BIT, num_in_scu);\r\n    h->lcu.i_pix_width  = lcu_w_in_scu << MIN_CU_SIZE_IN_BIT;\r\n    h->lcu.i_pix_height = lcu_h_in_scu << MIN_CU_SIZE_IN_BIT;\r\n\r\n    // init slice index of current LCU\r\n    for (i = 0; i < lcu_h_in_scu; i++) {\r\n        cu_t *p_cu_iter = &h->scu_data[h->lcu.i_scu_xy + i * width_in_scu];\r\n\r\n        for (j = 0; j < lcu_w_in_scu; j++) {\r\n            p_cu_iter->i_slice_nr = (int8_t)h->i_slice_index;\r\n            p_cu_iter++;\r\n        }\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid rowrec_lcu_init(davs2_t *h, davs2_row_rec_t *row_rec, int i_lcu_x, int i_lcu_y)\r\n{\r\n#if CTRL_AEC_THREAD\r\n    row_rec->p_rec_info = &row_rec->lcu_info->rec_info;\r\n#else\r\n    row_rec->p_rec_info = &h->lcu.rec_info;\r\n#endif\r\n    row_rec->idx_cu_zscan = 0;\r\n    /* CTU position */\r\n    row_rec->ctu.i_pix_x = i_lcu_x << h->i_lcu_level;\r\n    row_rec->ctu.i_pix_y = i_lcu_y << h->i_lcu_level;\r\n    row_rec->ctu.i_pix_x_c = i_lcu_x << (h->i_lcu_level - 1);\r\n    row_rec->ctu.i_pix_y_c = i_lcu_y << (h->i_lcu_level - 1);\r\n\r\n    row_rec->ctu.i_ctu_w = DAVS2_MIN(h->i_width  - row_rec->ctu.i_pix_x, 1 << h->i_lcu_level);\r\n    row_rec->ctu.i_ctu_h = DAVS2_MIN(h->i_height - row_rec->ctu.i_pix_y, 1 << h->i_lcu_level);\r\n    row_rec->ctu.i_ctu_w_c = row_rec->ctu.i_ctu_w >> 1;\r\n    row_rec->ctu.i_ctu_h_c = row_rec->ctu.i_ctu_h >> 1;\r\n\r\n    row_rec->ctu.i_scu_x = i_lcu_x << (h->i_lcu_level - MIN_CU_SIZE_IN_BIT);\r\n    row_rec->ctu.i_scu_y = i_lcu_y << (h->i_lcu_level - MIN_CU_SIZE_IN_BIT);\r\n    row_rec->ctu.i_scu_xy = row_rec->ctu.i_scu_y * h->i_width_in_scu + row_rec->ctu.i_scu_x;\r\n\r\n    row_rec->ctu.i_spu_x = row_rec->ctu.i_scu_x * BLOCK_MULTIPLE;                  // luma block position\r\n    row_rec->ctu.i_spu_y = row_rec->ctu.i_scu_y * BLOCK_MULTIPLE;                  // luma block position\r\n    \r\n    /* init pointers */\r\n    row_rec->h = h;\r\n\r\n    row_rec->ctu.i_frec[0] = h->fdec->i_stride[0];\r\n    row_rec->ctu.i_frec[1] = h->fdec->i_stride[1];\r\n    row_rec->ctu.i_frec[2] = h->fdec->i_stride[2];\r\n\r\n    row_rec->ctu.p_frec[0] = h->fdec->planes[0] + row_rec->ctu.i_pix_y   * row_rec->ctu.i_frec[0] + row_rec->ctu.i_pix_x;\r\n    row_rec->ctu.p_frec[1] = h->fdec->planes[1] + row_rec->ctu.i_pix_y_c * row_rec->ctu.i_frec[1] + row_rec->ctu.i_pix_x_c;\r\n    row_rec->ctu.p_frec[2] = h->fdec->planes[2] + row_rec->ctu.i_pix_y_c * row_rec->ctu.i_frec[2] + row_rec->ctu.i_pix_x_c;\r\n\r\n#if 1\r\n    row_rec->ctu.i_fdec[0] = h->fdec->i_stride[0];\r\n    row_rec->ctu.i_fdec[1] = h->fdec->i_stride[1];\r\n    row_rec->ctu.i_fdec[2] = h->fdec->i_stride[2];\r\n\r\n    row_rec->ctu.p_fdec[0] = h->fdec->planes[0] + row_rec->ctu.i_pix_y   * row_rec->ctu.i_fdec[0] + row_rec->ctu.i_pix_x;\r\n    row_rec->ctu.p_fdec[1] = h->fdec->planes[1] + row_rec->ctu.i_pix_y_c * row_rec->ctu.i_fdec[1] + row_rec->ctu.i_pix_x_c;\r\n    row_rec->ctu.p_fdec[2] = h->fdec->planes[2] + row_rec->ctu.i_pix_y_c * row_rec->ctu.i_fdec[2] + row_rec->ctu.i_pix_x_c;\r\n#else\r\n    row_rec->ctu.i_fdec[0] = MAX_CU_SIZE;\r\n    row_rec->ctu.i_fdec[1] = MAX_CU_SIZE;\r\n    row_rec->ctu.i_fdec[2] = MAX_CU_SIZE;\r\n\r\n    row_rec->ctu.p_fdec[0] = row_rec->fdec_buf;\r\n    row_rec->ctu.p_fdec[1] = row_rec->fdec_buf + MAX_CU_SIZE * MAX_CU_SIZE;\r\n    row_rec->ctu.p_fdec[2] = row_rec->fdec_buf + MAX_CU_SIZE * MAX_CU_SIZE + (MAX_CU_SIZE / 2);\r\n#endif\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nint decode_lcu_parse(davs2_t *h, int i_level, int pix_x, int pix_y)\r\n{\r\n    const int width_in_scu = h->i_width_in_scu;\r\n    const int pix_x_end = pix_x + (1 << i_level);\r\n    const int pix_y_end = pix_y + (1 << i_level);\r\n    int b_cu_inside_pic = (pix_x_end <= h->i_width) && (pix_y_end <= h->i_height);\r\n    int split_flag = (i_level != MIN_CU_SIZE_IN_BIT);\r\n\r\n    assert((pix_x < h->i_width) && (pix_y < h->i_height));\r\n    if (i_level > MIN_CU_SIZE_IN_BIT && b_cu_inside_pic) {\r\n        split_flag = aec_read_split_flag(&h->aec, i_level);\r\n    }\r\n\r\n    if (split_flag) {\r\n        int i_level_next = i_level - 1;\r\n        int i;\r\n\r\n        for (i = 0; i < 4; i++) {\r\n            int sub_pix_x = pix_x + ((i & 1) << i_level_next);\r\n            int sub_pix_y = pix_y + ((i >> 1) << i_level_next);\r\n\r\n            if (sub_pix_x < h->i_width && sub_pix_y < h->i_height) {\r\n                decode_lcu_parse(h, i_level_next, sub_pix_x, sub_pix_y);\r\n            }\r\n        }\r\n    } else {\r\n        int i_cu_x  = (pix_x >> MIN_CU_SIZE_IN_BIT);\r\n        int i_cu_y  = (pix_y >> MIN_CU_SIZE_IN_BIT);\r\n        int i_cu_xy = i_cu_y * width_in_scu + i_cu_x;\r\n        cu_t *p_cu  = &h->scu_data[i_cu_xy];\r\n\r\n        h->lcu.idx_cu_zscan_aec = tab_b8xy_to_zigzag[i_cu_y - h->lcu.i_scu_y][i_cu_x - h->lcu.i_scu_x];\r\n\r\n        if (cu_read_info(h, p_cu, i_level, i_cu_xy, pix_x, pix_y) < 0) {\r\n            p_cu->i_slice_nr = -1;  // set an invalid value to terminate the reconstruction\r\n            return -1;\r\n        }\r\n    }\r\n\r\n    return 0;\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nint decode_lcu_recon(davs2_t *h, davs2_row_rec_t *row_rec, int i_level, int pix_x, int pix_y)\r\n{\r\n    const int width_in_scu = h->i_width_in_scu;\r\n    int i_cu_x     = (pix_x >> MIN_CU_SIZE_IN_BIT);\r\n    int i_cu_y     = (pix_y >> MIN_CU_SIZE_IN_BIT);\r\n    int i_cu_xy    = i_cu_y * width_in_scu + i_cu_x;\r\n    cu_t *p_cu     = &h->scu_data[i_cu_xy];\r\n    int split_flag = (p_cu->i_cu_level < i_level);\r\n\r\n    assert((pix_x < h->i_width) && (pix_y < h->i_height));\r\n\r\n    if (split_flag) {\r\n        int i_level_next = i_level - 1;\r\n        int i;\r\n\r\n        for (i = 0; i < 4; i++) {\r\n            int sub_pix_x = pix_x + ((i &  1) << i_level_next);\r\n            int sub_pix_y = pix_y + ((i >> 1) << i_level_next);\r\n\r\n            if (sub_pix_x < h->i_width && sub_pix_y < h->i_height) {\r\n                decode_lcu_recon(h, row_rec, i_level_next, sub_pix_x, sub_pix_y);\r\n            }\r\n        }\r\n    } else {\r\n        int i_cu_mask = h->i_lcu_size_sub1 >> MIN_CU_SIZE_IN_BIT;\r\n        row_rec->idx_cu_zscan = tab_b8xy_to_zigzag[i_cu_y & i_cu_mask][i_cu_x & i_cu_mask];\r\n\r\n        if (p_cu->i_slice_nr == -1) {\r\n            h->decoding_error = 1;\r\n            davs2_log(h, DAVS2_LOG_WARNING, \"invalid CU (%3d, %3d), POC %3d\",\r\n                     pix_x, pix_y, h->i_poc);\r\n            return 0;\r\n        }\r\n        cu_recon(h, row_rec, p_cu, pix_x, pix_y);\r\n    }\r\n\r\n    return 0;\r\n}\r\n"
  },
  {
    "path": "source/common/cu.h",
    "content": "/*\r\n * cu.h\r\n *\r\n * Description of this file:\r\n *    CU Processing functions definition of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video encoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#ifndef DAVS2_CU_H\r\n#define DAVS2_CU_H\r\n#ifdef __cplusplus\r\nextern \"C\" {\r\n#endif\r\n\r\n/* ---------------------------------------------------------------------------\r\n * init LCU decoding\r\n * \\input param\r\n *     h    : decoder handler\r\n *  i_lcu_x : LCU position index\r\n *  i_lcu_y : LCU position index\r\n */\r\n#define decode_lcu_init FPFX(decode_lcu_init)\r\nvoid decode_lcu_init (davs2_t *h, int i_lcu_x, int i_lcu_y);\r\n\r\n#define rowrec_lcu_init FPFX(rowrec_lcu_init)\r\nvoid rowrec_lcu_init (davs2_t *h, davs2_row_rec_t *row_rec, int i_lcu_x, int i_lcu_y);\r\n\r\n/* ---------------------------------------------------------------------------\r\n * process LCU entropy decoding (recursively)\r\n * \\input param\r\n *     h    : decoder handler\r\n *  i_level : log2(CU size)\r\n *   pix_x  : pixel position of the decoding CU in the frame in Luma component\r\n *   pix_y  : pixel position of the decoding CU in the frame in Luma component\r\n */\r\n#define decode_lcu_parse FPFX(decode_lcu_parse)\r\nint  decode_lcu_parse(davs2_t *h, int i_level, int pix_x, int pix_y);\r\n\r\n/* ---------------------------------------------------------------------------\r\n * process LCU reconstruction (recursively)\r\n * \\input param\r\n *     h    : decoder handler\r\n *  i_level : log2(CU size)\r\n *   pix_x  : pixel position of the decoding CU in the frame in Luma component\r\n *   pix_y  : pixel position of the decoding CU in the frame in Luma component\r\n */\r\n#define decode_lcu_recon FPFX(decode_lcu_recon)\r\nint  decode_lcu_recon(davs2_t *h, davs2_row_rec_t *row_rec, int i_level, int pix_x, int pix_y);\r\n\r\n#define decoder_wait_lcu_row FPFX(decoder_wait_lcu_row)\r\nvoid decoder_wait_lcu_row(davs2_t *h, davs2_frame_t *frame, int max_y_in_pic);\r\n#define decoder_wait_row FPFX(decoder_wait_row)\r\nvoid decoder_wait_row(davs2_t *h, davs2_frame_t *frame, int max_y_in_pic);\r\n\r\n#ifdef __cplusplus\r\n}\r\n#endif\r\n#endif  // DAVS2_CU_H\r\n"
  },
  {
    "path": "source/common/davs2.cc",
    "content": "/*\r\n * davs2.cc\r\n *\r\n * Description of this file:\r\n *    API functions definition of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video decoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#include \"common.h\"\r\n#include \"davs2.h\"\r\n#include \"primitives.h\"\r\n#include \"decoder.h\"\r\n#include \"bitstream.h\"\r\n#include \"header.h\"\r\n#include \"version.h\"\r\n#include \"decoder.h\"\r\n#include \"frame.h\"\r\n#include \"cpu.h\"\r\n#include \"threadpool.h\"\r\n#include \"version.h\"\r\n\r\n/**\r\n * ===========================================================================\r\n * macro defines\r\n * ===========================================================================\r\n */\r\n\r\n#if DAVS2_TRACE_API\r\nFILE *fp_trace_bs = NULL;\r\nFILE *fp_trace_in = NULL;\r\n#endif\r\n\r\n/**\r\n * ===========================================================================\r\n * function defines\r\n * ===========================================================================\r\n */\r\n\r\n/* --------------------------------------------------------------------------\r\n */\r\nstatic es_unit_t *\r\nes_unit_alloc(int buf_size)\r\n{\r\n    es_unit_t *es_unit = NULL;\r\n    int bufsize = sizeof(es_unit_t) + buf_size;\r\n    \r\n    bufsize = ((bufsize + 31) >> 5 ) << 5;\r\n    es_unit = (es_unit_t *)davs2_malloc(bufsize);\r\n\r\n    if (es_unit == NULL) {\r\n        davs2_log(NULL, DAVS2_LOG_ERROR, \"failed to malloc memory in es_unit_alloc.\\n\");\r\n        return NULL;\r\n    }\r\n\r\n    es_unit->size = buf_size;\r\n    es_unit->len  = 0;\r\n    es_unit->pts  = 0;\r\n    es_unit->dts  = 0;\r\n\r\n    return es_unit;\r\n}\r\n\r\n/* --------------------------------------------------------------------------\r\n */\r\nstatic void\r\nes_unit_free(es_unit_t *es_unit)\r\n{\r\n    if (es_unit) {\r\n        davs2_free(es_unit);\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * push byte stream data of one frame to input list\r\n */\r\nstatic\r\nes_unit_t *davs2_pack_es_unit(davs2_mgr_t *mgr, const uint8_t *data, int len, int64_t pts, int64_t dts)\r\n{\r\n#define DAVS2_ISUNIT(x) ((x) == 0xB0 || (x) == 0xB1 || (x) == 0xB7 || (x) == 0xB3 || (x) == 0xB6)\r\n    es_unit_t *es_unit     = NULL;\r\n    es_unit_t *ret_es_unit = NULL;\r\n    int start_code = data[3];\r\n\r\n    if (mgr->es_unit == NULL) {\r\n        mgr->es_unit = (es_unit_t *)xl_remove_head(&mgr->packets_idle, 1);\r\n    }\r\n\r\n    es_unit = mgr->es_unit;\r\n\r\n    if (len > 0) {\r\n        if (es_unit->size < es_unit->len + len) {\r\n            /* reallocate frame buffer */\r\n            int new_size = es_unit->len + len + MAX_ES_FRAME_SIZE * 2;\r\n            es_unit_t *new_es_unit;\r\n\r\n            if ((new_es_unit = es_unit_alloc(new_size)) == NULL) {\r\n                return NULL;\r\n            }\r\n\r\n            memcpy(new_es_unit, es_unit, sizeof(es_unit_t));   /* copy ES Unit information */\r\n            memcpy(new_es_unit->data, es_unit->data, es_unit->len * sizeof(uint8_t));\r\n\r\n            es_unit_free(es_unit);\r\n\r\n            mgr->es_unit = es_unit = new_es_unit;\r\n        }\r\n\r\n        /* copy stream data */\r\n        if (DAVS2_ISUNIT(start_code) && es_unit->len > 0) {\r\n            ret_es_unit = es_unit;\r\n            /* fetch a node again from idle list */\r\n            es_unit = (es_unit_t *)xl_remove_head(&mgr->packets_idle, 1);\r\n            mgr->es_unit = es_unit;\r\n        }\r\n        memcpy(es_unit->data + es_unit->len, data, len * sizeof(uint8_t));\r\n        es_unit->len += len;\r\n        es_unit->pts  = pts;\r\n        es_unit->dts  = dts;\r\n    }\r\n\r\n    /* check the pseudo start code */\r\n    if (ret_es_unit != NULL) {\r\n        ret_es_unit->len = bs_dispose_pseudo_code(ret_es_unit->data, ret_es_unit->data, ret_es_unit->len);\r\n    }\r\n\r\n#undef DAVS2_ISUNIT\r\n    return ret_es_unit;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void \r\ndestroy_all_lists(davs2_mgr_t *mgr)\r\n{\r\n    es_unit_t *es_unit = NULL;\r\n    davs2_picture_t *pic = NULL;\r\n\r\n    /* idle list */\r\n    for (;;) {\r\n        if ((es_unit = (es_unit_t *)xl_remove_head_ex(&mgr->packets_idle)) == NULL) {\r\n            break;\r\n        }\r\n\r\n        es_unit_free(es_unit);\r\n    }\r\n\r\n    /* recycle list */\r\n    for (;;) {\r\n        if ((pic = (davs2_picture_t *)xl_remove_head_ex(&mgr->pic_recycle)) == NULL) {\r\n            break;\r\n        }\r\n\r\n        davs2_free(pic);\r\n    }\r\n\r\n    if (mgr->es_unit) {\r\n        es_unit_free(mgr->es_unit);\r\n        mgr->es_unit = NULL;\r\n    }\r\n\r\n    xl_destroy(&mgr->packets_idle);\r\n    xl_destroy(&mgr->pic_recycle);\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic int\r\ncreate_all_lists(davs2_mgr_t *mgr)\r\n{\r\n    es_unit_t *es_unit = NULL;\r\n    int i;\r\n\r\n    if (xl_init(&mgr->packets_idle ) != 0 || \r\n        xl_init(&mgr->pic_recycle  ) != 0) {\r\n        goto fail;\r\n    }\r\n\r\n    for (i = 0; i < MAX_ES_FRAME_NUM + mgr->param.threads; i++) {\r\n        es_unit = es_unit_alloc(MAX_ES_FRAME_SIZE);\r\n\r\n        if (es_unit) {\r\n            xl_append(&mgr->packets_idle, es_unit);\r\n        } else {\r\n            goto fail;\r\n        }\r\n    }\r\n\r\n    return 0;\r\n\r\nfail:\r\n    destroy_all_lists(mgr);\r\n\r\n    return -1;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic\r\nvoid output_list_recycle_picture(davs2_mgr_t *mgr, davs2_outpic_t *pic)\r\n{\r\n    pic->frame = NULL;\r\n    /* picture may be obsolete(for new sequence with different resolution), we will release it later */\r\n    xl_append(&mgr->pic_recycle, pic);\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic \r\nint has_new_output_frame(davs2_mgr_t *mgr, davs2_t *h)\r\n{\r\n    // TODO: ƣȷǰͼϺǷӦõȴ\r\n    UNUSED_PARAMETER(mgr);\r\n    UNUSED_PARAMETER(h);\r\n\r\n    return 1;  // ͼط㣬ͼ0\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic\r\ndavs2_outpic_t *output_list_get_one_output_picture(davs2_mgr_t *mgr)\r\n{\r\n    davs2_outpic_t *pic   = NULL;\r\n\r\n    davs2_thread_mutex_lock(&mgr->mutex_mgr);\r\n\r\n    while (mgr->outpics.pics) {\r\n        davs2_frame_t *frame = mgr->outpics.pics->frame;\r\n        assert(frame);\r\n\r\n        if (frame->i_poc == mgr->outpics.output) {\r\n            /* the next frame : output */\r\n            pic = mgr->outpics.pics;\r\n            mgr->outpics.pics = pic->next;\r\n\r\n            /* move on to the next frame */\r\n            mgr->outpics.output++;\r\n            mgr->outpics.num_output_pic--;\r\n            break;\r\n        } else {\r\n            /* TODO: Ҫȷһ޸ķʽ \r\n             * α֤˳ЧԣҪɶ֡ʱ\r\n             */\r\n            if (frame->i_poc > mgr->outpics.output) {\r\n                /* the end of the stream occurs */\r\n                if (mgr->b_flushing &&\r\n                    mgr->num_frames_in == mgr->num_frames_out + mgr->outpics.num_output_pic) {\r\n                    mgr->outpics.output++;\r\n                    continue;\r\n                }\r\n\r\n                /* a future frame */\r\n                int num_delayed_frames = 1;\r\n\r\n                pic = mgr->outpics.pics;\r\n                while (pic->next != NULL) {\r\n                    num_delayed_frames++;\r\n                    pic = pic->next;\r\n                }\r\n\r\n                if (num_delayed_frames < 8) {\r\n                    /* keep waiting */\r\n                    davs2_thread_mutex_unlock(&mgr->mutex_mgr);\r\n                    davs2_sleep_ms(1);\r\n                    davs2_thread_mutex_lock(&mgr->mutex_mgr);\r\n                    continue;\r\n                }\r\n            }\r\n\r\n            /* ĿǰеСPOCPOC֮ϴ󣬽POCǰǰСPOC */\r\n            davs2_log(mgr, DAVS2_LOG_WARNING, \"Advance to discontinuous POC: %d\\n\", frame->i_poc);\r\n            mgr->outpics.output = frame->i_poc;\r\n        }\r\n    }\r\n\r\n    mgr->outpics.busy = (pic != NULL);\r\n\r\n    davs2_thread_mutex_unlock(&mgr->mutex_mgr);\r\n\r\n    return pic;\r\n}\r\n\r\n/* --------------------------------------------------------------------------\r\n * Thread of decoder output (decoded raw data)\r\n */\r\nint decoder_get_output(davs2_mgr_t *mgr, davs2_seq_info_t *headerset, davs2_picture_t *out_frame, int is_flush)\r\n{\r\n    davs2_outpic_t *pic   = NULL;\r\n    int b_wait_new_frame = mgr->num_frames_in + mgr->num_decoders - mgr->num_frames_out > 8 + mgr->num_aec_thread;\r\n\r\n    while (mgr->num_frames_in > mgr->num_frames_out && /* no more output */\r\n           (b_wait_new_frame || is_flush)) {\r\n        if (mgr->new_sps) {\r\n            memcpy(headerset, &mgr->seq_info.head, sizeof(davs2_seq_info_t));\r\n            mgr->new_sps = FALSE; /* set flag */\r\n            out_frame->magic = NULL;\r\n            return DAVS2_GOT_HEADER;\r\n        }\r\n\r\n        /* check for the next frame */\r\n        pic = output_list_get_one_output_picture(mgr);\r\n\r\n        if (pic == NULL) {\r\n            davs2_sleep_ms(1);\r\n        } else {\r\n            break;\r\n        }\r\n    }\r\n\r\n    if (pic == NULL) {\r\n        if (mgr->new_sps) {\r\n            memcpy(headerset, &mgr->seq_info.head, sizeof(davs2_seq_info_t));\r\n            mgr->new_sps = FALSE; /* set flag */\r\n            out_frame->magic = NULL;\r\n            return DAVS2_GOT_HEADER;\r\n        }\r\n        return DAVS2_DEFAULT;\r\n    }\r\n\r\n    mgr->num_frames_out++;\r\n\r\n    /* copy out */\r\n    davs2_write_a_frame(pic->pic, pic->frame);\r\n\r\n    /* release reference when it would no more be needed */\r\n    if (pic->pic->dec_frame == NULL) {\r\n        release_one_frame(pic->frame);\r\n    }\r\n\r\n    /* deliver this frame */\r\n    memcpy(out_frame, pic->pic, sizeof(davs2_picture_t));\r\n    out_frame->magic       = pic;\r\n    return DAVS2_GOT_FRAME;\r\n}\r\n\r\n/**\r\n * ---------------------------------------------------------------------------\r\n * Function   : release one output frame\r\n * Parameters :\r\n *       [in] : decoder   - decoder handle\r\n *            : out_frame - frame to recycle\r\n * Return     : none\r\n * ---------------------------------------------------------------------------\r\n */\r\nDAVS2_API void\r\ndavs2_decoder_frame_unref(void *decoder, davs2_picture_t *out_frame)\r\n{\r\n    davs2_mgr_t *mgr = (davs2_mgr_t *)decoder;\r\n    if (mgr == NULL || out_frame == NULL) {\r\n        return;\r\n    }\r\n\r\n    /* release the output */\r\n    if (out_frame->magic != NULL) {\r\n        davs2_outpic_t *pic = (davs2_outpic_t *)out_frame->magic;\r\n\r\n        /* release reference when it would no more be needed */\r\n        if (pic->pic->dec_frame != NULL) {\r\n            release_one_frame(pic->frame);   // pic->pic->dec_frame == pic->frame\r\n            pic->pic->dec_frame = NULL;\r\n        }\r\n\r\n        output_list_recycle_picture(mgr, pic);\r\n    }\r\n}\r\n\r\n/* --------------------------------------------------------------------------\r\n */\r\nstatic davs2_t *task_get_free_task(davs2_mgr_t *mgr)\r\n{\r\n    int i;\r\n\r\n    for (; mgr->b_exit == 0;) {\r\n        for (i = 0; i < mgr->num_decoders; i++) {\r\n            davs2_t *h = &mgr->decoders[i];\r\n            davs2_thread_mutex_lock(&mgr->mutex_mgr);\r\n            if (h->task_info.task_status == TASK_FREE) {\r\n                h->task_info.task_status = TASK_BUSY;\r\n                davs2_thread_mutex_unlock(&mgr->mutex_mgr);\r\n                return h;\r\n            }\r\n            davs2_thread_mutex_unlock(&mgr->mutex_mgr);\r\n        }\r\n    }\r\n\r\n    return NULL;\r\n}\r\n\r\n/* --------------------------------------------------------------------------\r\n */\r\nvoid task_unload_packet(davs2_t *h, es_unit_t *es_unit)\r\n{\r\n    davs2_mgr_t *mgr = h->task_info.taskmgr;\r\n\r\n    if (es_unit) {\r\n        /* packet is free */\r\n        es_unit->len = 0;\r\n        xl_append(&mgr->packets_idle, es_unit);\r\n    }\r\n\r\n    davs2_thread_mutex_lock(&mgr->mutex_mgr);\r\n    h->task_info.task_status = TASK_FREE;\r\n    davs2_thread_mutex_unlock(&mgr->mutex_mgr);\r\n}\r\n\r\n\r\n/**\r\n * ===========================================================================\r\n * interface function defines\r\n * ===========================================================================\r\n */\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nDAVS2_API void *\r\ndavs2_decoder_open(davs2_param_t *param)\r\n{\r\n    const int max_num_thread = CTRL_AEC_THREAD ? AVS2_THREAD_MAX : AVS2_THREAD_MAX / 2;\r\n    char buf_cpu[120] = \"\";\r\n    davs2_mgr_t *mgr = NULL;\r\n    uint8_t *mem_ptr;\r\n    size_t mem_size;\r\n    uint32_t cpuid = 0;\r\n    int i;\r\n\r\n    /* output version information */\r\n    if (param->info_level <= DAVS2_LOG_INFO) {\r\n        davs2_log(NULL, DAVS2_LOG_INFO, \"davs2: %s.%d, %s\",\r\n                  XVERSION_STR, BIT_DEPTH, XBUILD_TIME);\r\n    }\r\n\r\n#if DAVS2_TRACE_API\r\n    fp_trace_bs = fopen(\"trace_bitstream.avs\", \"wb\");\r\n    fp_trace_in = fopen(\"trace_input.txt\", \"w\");\r\n#endif\r\n\r\n    /* check parameters */\r\n    if (param == NULL) {\r\n        davs2_log(NULL, DAVS2_LOG_ERROR, \"Invalid input parameters: Null parameters\\n\");\r\n        return 0;\r\n    }\r\n\r\n    /* init all function handlers */\r\n#if HAVE_MMX\r\n    cpuid = davs2_cpu_detect();\r\n    if (param->disable_avx) {\r\n         cpuid &= ~(DAVS2_CPU_AVX | DAVS2_CPU_AVX2);\r\n    }\r\n#endif\r\n    init_all_primitives(cpuid);\r\n\r\n    /* CPU capacities */\r\n    davs2_get_simd_capabilities(buf_cpu, cpuid);\r\n    if (param->info_level <= DAVS2_LOG_INFO) {\r\n        davs2_log(mgr, DAVS2_LOG_INFO, \"CPU Capabilities: %s\", buf_cpu);\r\n    }\r\n\r\n    mem_size = sizeof(davs2_mgr_t) + CACHE_LINE_SIZE\r\n        + AVS2_THREAD_MAX * (sizeof(davs2_t) + CACHE_LINE_SIZE);\r\n    CHECKED_MALLOCZERO(mem_ptr, uint8_t *, mem_size);\r\n\r\n    mgr = (davs2_mgr_t *)mem_ptr;\r\n    mem_ptr += sizeof(davs2_mgr_t);\r\n    ALIGN_POINTER(mem_ptr);\r\n    memcpy(&mgr->param, param, sizeof(davs2_param_t));\r\n\r\n    /* init log module */\r\n    mgr->module_log.i_log_level = param->info_level;\r\n    sprintf(mgr->module_log.module_name, \"Manager %06llx\", (long long unsigned int)(mgr));\r\n\r\n    if (mgr->param.threads <= 0) {\r\n        mgr->param.threads = davs2_cpu_num_processors();\r\n    }\r\n    if (mgr->param.threads > max_num_thread) {\r\n        mgr->param.threads = max_num_thread;\r\n        davs2_log(mgr, DAVS2_LOG_WARNING, \"Max number of thread reached, forcing to be %d\\n\", max_num_thread);\r\n    }\r\n\r\n    /* init members that could not be zero */\r\n    mgr->i_prev_coi       = -1;\r\n\r\n    /* output pictures */\r\n    mgr->outpics.output   = -1;\r\n    mgr->outpics.pics     = NULL;\r\n    mgr->outpics.num_output_pic = 0;\r\n\r\n    mgr->num_decoders     = mgr->param.threads;\r\n    mgr->num_total_thread = mgr->param.threads;\r\n    mgr->num_aec_thread   = mgr->param.threads;\r\n#if CTRL_AEC_THREAD\r\n    if (mgr->num_total_thread > 3) {\r\n        mgr->num_aec_thread = (mgr->param.threads >> 1) + 1;\r\n        mgr->num_rec_thread = mgr->num_total_thread - mgr->num_aec_thread;\r\n    } else {\r\n        mgr->num_rec_thread = 0;\r\n    }\r\n    mgr->num_decoders += 1 + mgr->num_aec_thread;\r\n#else\r\n    mgr->num_rec_thread = 0;\r\n#endif\r\n\r\n    mgr->num_decoders++;\r\n\r\n    mgr->decoders = (davs2_t *)mem_ptr;\r\n    mem_ptr      += AVS2_THREAD_MAX * sizeof(davs2_t);\r\n    ALIGN_POINTER(mem_ptr);\r\n    davs2_thread_mutex_init(&mgr->mutex_mgr, NULL);\r\n    davs2_thread_mutex_init(&mgr->mutex_aec, NULL);\r\n\r\n    /* init input&output lists */\r\n    if (create_all_lists(mgr) < 0) {\r\n        goto fail;\r\n    }\r\n\r\n    /* ߳ò */\r\n    if (mgr->num_total_thread < 1 || mgr->num_decoders < mgr->num_aec_thread ||\r\n        mgr->num_rec_thread < 0 ||\r\n        mgr->num_aec_thread < 1 || mgr->num_aec_thread > mgr->num_total_thread) {\r\n        davs2_log(mgr, DAVS2_LOG_ERROR,\r\n                  \"Invalid thread number configuration: num_task[%d], num_threads[%d], num_aec_thread[%d], num_pool[%d]\\n\",\r\n                  mgr->num_decoders, mgr->num_total_thread, mgr->num_aec_thread, mgr->num_rec_thread);\r\n        goto fail;\r\n    }\r\n\r\n    /* spawn the output thread */\r\n    mgr->num_frames_in  = 0;\r\n    mgr->num_frames_out = 0;\r\n\r\n    /* init all the tasks */\r\n    for (i = 0; i < mgr->num_decoders; i++) {\r\n        davs2_t *h = &mgr->decoders[i];\r\n\r\n        /* init the decode context */\r\n        decoder_open(mgr, h, i);\r\n        // davs2_log(h, DAVS2_LOG_WARNING, \"Decoder [%2d]: %p\", i, h);\r\n\r\n        h->task_info.task_id     = i;\r\n        h->task_info.task_status = TASK_FREE;\r\n        h->task_info.taskmgr     = mgr;\r\n    }\r\n\r\n    /* initialize thread pool for AEC decoding and reconstruction */\r\n    davs2_threadpool_init((davs2_threadpool_t **)&mgr->thread_pool, mgr->num_total_thread, NULL, NULL, 0);\r\n\r\n    davs2_log(mgr, DAVS2_LOG_INFO, \"using %d thread(s): %d(frame/AEC)+%d(pool/REC), %d tasks\", \r\n        mgr->num_total_thread, mgr->num_aec_thread, mgr->num_rec_thread, mgr->num_decoders);\r\n\r\n    return mgr;\r\n\r\nfail:\r\n    davs2_log(NULL, DAVS2_LOG_ERROR, \"failed to open decoder\\n\");\r\n    davs2_decoder_close(mgr);\r\n\r\n    return NULL;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nint decoder_decode_es_unit(davs2_mgr_t *mgr, es_unit_t *es_unit)\r\n{\r\n    davs2_t *h = NULL;\r\n    int b_wait_output = 0;\r\n\r\n    /* decode this frame\r\n     * (1) init bs */\r\n    bs_init(&es_unit->bs, es_unit->data, es_unit->len);\r\n\r\n    h = task_get_free_task(mgr);\r\n    mgr->h_dec = h;\r\n\r\n    davs2_thread_mutex_lock(&mgr->mutex_aec);\r\n\r\n    h->task_info.curr_es_unit = es_unit;     /* record the ES_unit to be decoded */\r\n\r\n    /* (2) parse header */\r\n    if (parse_header(h, &es_unit->bs) == 0) {\r\n        h->p_bs = &es_unit->bs;\r\n        /* prepare the reference list and the reconstruction buffer */\r\n        if (task_get_references(h, es_unit->pts, es_unit->dts) == 0) {\r\n            b_wait_output = has_new_output_frame(mgr, h);\r\n            mgr->num_frames_in++;\r\n\r\n            davs2_thread_mutex_unlock(&mgr->mutex_aec);\r\n            /* decode picture data */\r\n            davs2_threadpool_run((davs2_threadpool_t *)mgr->thread_pool, decoder_decode_picture_data, h, 0, 0);\r\n        } else { \r\n            davs2_thread_mutex_unlock(&mgr->mutex_aec);\r\n            /* task is free */\r\n            task_unload_packet(h, es_unit);\r\n        }\r\n    } else {\r\n        davs2_thread_mutex_unlock(&mgr->mutex_aec);\r\n        /* task is free */\r\n        task_unload_packet(h, es_unit);\r\n    }\r\n\r\n    return b_wait_output;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nDAVS2_API int\r\ndavs2_decoder_send_packet(void *decoder, davs2_packet_t *packet)\r\n{\r\n    davs2_mgr_t *mgr = (davs2_mgr_t *)decoder;\r\n    es_unit_t *es_unit = NULL;\r\n    int ret_type = DAVS2_DEFAULT;\r\n\r\n#if DAVS2_TRACE_API\r\n    if (fp_trace_bs != NULL && packet->len > 0) {\r\n        fwrite(packet->data, packet->len, 1, fp_trace_bs);\r\n        fflush(fp_trace_bs);\r\n    }\r\n    if (fp_trace_in) {\r\n        fprintf(fp_trace_in, \"%4d\\t%d\", packet->len, packet->marker);\r\n        fflush(fp_trace_in);\r\n    }\r\n#endif\r\n\r\n    /* check the input parameter: packet */\r\n    if (packet == NULL || packet->data == NULL || packet->len <= 0) {\r\n        davs2_log(mgr->decoders, DAVS2_LOG_DEBUG, \"Null input packet\");\r\n        return DAVS2_ERROR;              /* error */\r\n    }\r\n\r\n    /* check packet length */\r\n    if (packet->len < 4) {\r\n        davs2_log(mgr, DAVS2_LOG_DEBUG, \"Invalid packet, 4 bytes are needed for one packet (including start_code). Len = %d\",\r\n                  packet->len);\r\n        return DAVS2_ERROR;              /* error */\r\n    }\r\n    /* check the first 3 bytes are START_CODE */\r\n    if (packet->data[0] != 0x00 || packet->data[1] != 0x00 || packet->data[2] != 0x01) {\r\n        davs2_log(mgr, DAVS2_LOG_ERROR, \"Invalid input Byte-Stream, not start code: %02x%02x%02x\",\r\n                  packet->data[0], packet->data[1], packet->data[2]);\r\n        return DAVS2_ERROR;\r\n    }\r\n\r\n    /* generate one es_unit for current byte-stream buffer */\r\n    es_unit = davs2_pack_es_unit(mgr, packet->data, packet->len, packet->pts, packet->dts);\r\n    if (es_unit == NULL && mgr->es_unit == NULL) {\r\n        davs2_log(mgr, DAVS2_LOG_ERROR, \"Failed to create an ES_UNIT, input Byte-Stream length %d\",\r\n                  packet->len);\r\n        return DAVS2_ERROR;\r\n    } else if (es_unit == NULL) {\r\n        // davs2_log(mgr, DAVS2_LOG_DEBUG, \"Buffered byte-stream length: %d\",\r\n        //           packet->len);\r\n        return DAVS2_DEFAULT;\r\n    }\r\n\r\n    /* decode one frame */\r\n    mgr->num_frames_to_output += decoder_decode_es_unit(mgr, es_unit);\r\n\r\n#if DAVS2_TRACE_API\r\n    if (fp_trace_in) {\r\n        fprintf(fp_trace_in, \"\\t%8d\\t%2d\\t%4d\\t%3d\\t%3d\\n\", \r\n                packet->len, ret_type, out_frame->pic_order_count,\r\n                mgr->num_frames_in, mgr->num_frames_out);\r\n        fflush(fp_trace_in);\r\n    }\r\n#endif\r\n    return ret_type;\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nDAVS2_API int\r\ndavs2_decoder_recv_frame(void *decoder, davs2_seq_info_t *headerset, davs2_picture_t *out_frame)\r\n{\r\n    davs2_mgr_t *mgr = (davs2_mgr_t *)decoder;\r\n    int ret_type = DAVS2_DEFAULT;\r\n\r\n    /* clear output frame data */\r\n    out_frame->magic = NULL;\r\n\r\n    /* get one frame or sequence header */\r\n    if (mgr->num_frames_to_output || mgr->new_sps) {\r\n        ret_type = decoder_get_output(mgr, headerset, out_frame, 0);\r\n        if (ret_type == DAVS2_GOT_FRAME) {\r\n            mgr->num_frames_to_output--;\r\n        }\r\n    }\r\n\r\n    return ret_type;\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nDAVS2_API int\r\ndavs2_decoder_flush(void *decoder, davs2_seq_info_t *headerset, davs2_picture_t *out_frame)\r\n{\r\n    davs2_mgr_t *mgr = (davs2_mgr_t *)decoder;\r\n    int ret;\r\n\r\n#if DAVS2_TRACE_API\r\n    if (fp_trace_in) {\r\n        fprintf(fp_trace_in, \"Flush 0x%p \", decoder);\r\n        fflush(fp_trace_in);\r\n    }\r\n#endif\r\n\r\n    if (decoder == NULL) {\r\n        return DAVS2_ERROR;\r\n    }\r\n\r\n    mgr->b_flushing     = 1; // label the decoder being flushing\r\n    out_frame->magic    = NULL;\r\n    ret = DAVS2_DEFAULT;\r\n\r\n#if DAVS2_TRACE_API\r\n    if (fp_trace_in) {\r\n        fprintf(fp_trace_in, \"Fetch \");\r\n        fflush(fp_trace_in);\r\n    }\r\n#endif\r\n\r\n    // flush buffered bit-stream\r\n    if (mgr->es_unit != NULL && mgr->es_unit->len >= 4) {\r\n        es_unit_t *es_unit = mgr->es_unit;\r\n        mgr->es_unit = NULL;\r\n        decoder_decode_es_unit(mgr, es_unit);\r\n    }\r\n\r\n    ret = decoder_get_output(mgr, headerset, out_frame, 1);\r\n\r\n#if DAVS2_TRACE_API\r\n    if (fp_trace_in) {\r\n        fprintf(fp_trace_in, \"Ret %d, %3d\\t%3d\\n\", ret, mgr->num_frames_in, mgr->num_frames_out);\r\n        fflush(fp_trace_in);\r\n    }\r\n#endif\r\n\r\n    if (ret != DAVS2_DEFAULT) {\r\n        return ret;\r\n    } else {\r\n        return DAVS2_END;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nDAVS2_API void\r\ndavs2_decoder_close(void *decoder)\r\n{\r\n    davs2_mgr_t  *mgr = (davs2_mgr_t *)decoder;\r\n    int i;\r\n\r\n#if DAVS2_TRACE_API\r\n    if (fp_trace_in != NULL) {\r\n        fprintf(fp_trace_in, \"Close 0x%p\\n\", decoder);\r\n        fflush(fp_trace_in);\r\n    }\r\n#endif\r\n    if (mgr == NULL) {\r\n        return;\r\n    }\r\n\r\n    /* signal all decoding threads and the output thread to exit */\r\n    mgr->b_exit = 1;\r\n\r\n    /* destroy thread pool */\r\n    if (mgr->num_total_thread != 0) {\r\n        davs2_threadpool_delete((davs2_threadpool_t *)mgr->thread_pool);\r\n    }\r\n\r\n    /* close every task */\r\n    for (i = 0; i < mgr->num_decoders; i++) {\r\n        davs2_t *h = &mgr->decoders[i];\r\n\r\n        /* free all resources of the decoder */\r\n        decoder_close(h);\r\n    }\r\n\r\n    destroy_all_lists(mgr);     /* free all lists */\r\n    destroy_dpb(mgr);           /* free dpb */\r\n\r\n    /* destroy the mutex */\r\n    davs2_thread_mutex_destroy(&mgr->mutex_mgr);\r\n    davs2_thread_mutex_destroy(&mgr->mutex_aec);\r\n\r\n    /* free memory */\r\n    davs2_free(mgr);          /* free the mgr */\r\n\r\n#if DAVS2_TRACE_API\r\n    if (fp_trace_bs != NULL) {\r\n        fclose(fp_trace_bs);\r\n        fp_trace_bs = NULL;\r\n    }\r\n    if (fp_trace_in != NULL) {\r\n        fclose(fp_trace_in);\r\n        fp_trace_in = NULL;\r\n    }\r\n#endif\r\n}\r\n"
  },
  {
    "path": "source/common/deblock.cc",
    "content": "/*\r\n * deblock.cc\r\n *\r\n * Description of this file:\r\n *    Deblock functions definition of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video encoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#include \"common.h\"\r\n#include \"deblock.h\"\r\n#include \"quant.h\"\r\n\r\n#if HAVE_MMX\r\n#include \"vec/intrinsic.h\"\r\n#endif\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic const uint8_t ALPHA_TABLE[64] = {\r\n     0,  0,  0,  0,  0,  0,  1,  1,\r\n     1,  1,  1,  2,  2,  2,  3,  3,\r\n     4,  4,  5,  5,  6,  7,  8,  9,\r\n    10, 11, 12, 13, 15, 16, 18, 20,\r\n    22, 24, 26, 28, 30, 33, 33, 35,\r\n    35, 36, 37, 37, 39, 39, 42, 44,\r\n    46, 48, 50, 52, 53, 54, 55, 56,\r\n    57, 58, 59, 60, 61, 62, 63, 64\r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic const uint8_t BETA_TABLE[64] = {\r\n     0,  0,  0,  0,  0,  0,  1,  1,\r\n     1,  1,  1,  1,  1,  2,  2,  2,\r\n     2,  2,  3,  3,  3,  3,  4,  4,\r\n     4,  4,  5,  5,  5,  5,  6,  6,\r\n     6,  7,  7,  7,  8,  8,  8,  9,\r\n     9, 10, 10, 11, 11, 12, 13, 14,\r\n    15, 16, 17, 18, 19, 20, 21, 22,\r\n    23, 23, 24, 24, 25, 25, 26, 27\r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nextern const uint8_t QP_SCALE_CR[64];\r\n\r\n/* ---------------------------------------------------------------------------\r\n * edge direction for deblock\r\n */\r\nenum edge_direction_e {\r\n    EDGE_HOR = 1,           /* horizontal */\r\n    EDGE_VER = 0            /* vertical */\r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n * edge type for fitler control\r\n */\r\nenum edge_type_e {\r\n    EDGE_TYPE_NOFILTER  = 0,  /* no deblock filter */\r\n    EDGE_TYPE_ONLY_LUMA = 1,  /* TU boundary in CU (chroma block does not have such boundaries) */\r\n    EDGE_TYPE_BOTH      = 2   /* CU boundary and PU boundary */\r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void lf_set_edge_filter_param(davs2_t *h, int i_level, int scu_x, int scu_y, int dir, int edge_type)\r\n{\r\n    const int w_in_scu = h->i_width_in_scu;\r\n    // const int h_in_scu = h->i_height_in_mincu;\r\n    int scu_num  = 1 << (i_level - MIN_CU_SIZE_IN_BIT);\r\n    int scu_xy = scu_y * w_in_scu + scu_x;\r\n    int i;\r\n\r\n    if (dir == EDGE_VER) {\r\n        /* set flag of vertical edges */\r\n        if (scu_x == 0) {\r\n            return;\r\n        }\r\n\r\n        /* Is left border Slice border?\r\n         * check edge condition, can not filter beyond frame/slice boundaries */\r\n        if (!h->seq_info.cross_loop_filter_flag &&\r\n            h->scu_data[scu_xy].i_slice_nr != h->scu_data[scu_xy - 1].i_slice_nr) {\r\n            return;\r\n        }\r\n\r\n        /* set filter type */\r\n        // scu_num = DAVS2_MIN(scu_num, h_in_scu - scu_y);\r\n        for (i = 0; i < scu_num; i++) {\r\n            if (h->p_deblock_flag[EDGE_VER][(scu_y + i) * w_in_scu + scu_x] != EDGE_TYPE_NOFILTER) {\r\n                break;\r\n            }\r\n            h->p_deblock_flag[EDGE_VER][(scu_y + i) * w_in_scu + scu_x] = (uint8_t)edge_type;\r\n        }\r\n    } else {\r\n        /* set flag of horizontal edges */\r\n        if (scu_y == 0) {\r\n            return;\r\n        }\r\n\r\n        /* Is top border Slice border?\r\n         * check edge condition, can not filter beyond frame/slice boundaries */\r\n        if (!h->seq_info.cross_loop_filter_flag && \r\n            h->scu_data[scu_xy].i_slice_nr != h->scu_data[scu_xy - h->i_width_in_scu].i_slice_nr) {\r\n            return;\r\n        }\r\n\r\n        /* set filter type */\r\n        // scu_num = DAVS2_MIN(scu_num, w_in_scu - scu_x);\r\n        for (i = 0; i < scu_num; i++) {\r\n            if (h->p_deblock_flag[EDGE_HOR][scu_y * w_in_scu + scu_x + i] != EDGE_TYPE_NOFILTER) {\r\n                break;\r\n            }\r\n            h->p_deblock_flag[EDGE_HOR][scu_y * w_in_scu + scu_x + i] = (uint8_t)edge_type;\r\n        }\r\n    }\r\n}\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void lf_lcu_set_edge_filter(davs2_t *h, int i_level, int scu_x, int scu_y)\r\n{\r\n    const int w_in_scu = h->i_width_in_scu;\r\n    cu_t *p_scu_data = &h->scu_data[scu_y * w_in_scu + scu_x];\r\n    int i;\r\n\r\n    if (p_scu_data->i_cu_level < i_level) {\r\n        const int h_in_scu = h->i_height_in_scu;\r\n\r\n        // 4 sub-cu\r\n        for (i = 0; i < 4; i++) {\r\n            int sub_cu_x = scu_x + ((i  & 1) << (i_level - MIN_CU_SIZE_IN_BIT - 1));\r\n            int sub_cu_y = scu_y + ((i >> 1) << (i_level - MIN_CU_SIZE_IN_BIT - 1));\r\n\r\n            if (sub_cu_x >= w_in_scu || sub_cu_y >= h_in_scu) {\r\n                continue;       // is outside of the frame\r\n            }\r\n\r\n            lf_lcu_set_edge_filter(h, i_level - 1, sub_cu_x, sub_cu_y);\r\n        }\r\n    } else {\r\n        // set the first left and top edge filter parameters\r\n        lf_set_edge_filter_param(h, i_level, scu_x, scu_y, EDGE_VER, EDGE_TYPE_BOTH);  // left edge\r\n        lf_set_edge_filter_param(h, i_level, scu_x, scu_y, EDGE_HOR, EDGE_TYPE_BOTH);  // top  edge\r\n\r\n        // set other edge filter parameters\r\n        if (p_scu_data->i_cu_level > B8X8_IN_BIT) {\r\n            /* set prediction boundary */\r\n            i = i_level - MIN_CU_SIZE_IN_BIT - 1;\r\n\r\n            switch (p_scu_data->i_cu_type) {\r\n                case PRED_2NxN:\r\n                    lf_set_edge_filter_param(h, i_level, scu_x, scu_y + (1 << i), EDGE_HOR, EDGE_TYPE_BOTH);\r\n                    break;\r\n                case PRED_Nx2N:\r\n                    lf_set_edge_filter_param(h, i_level, scu_x + (1 << i), scu_y, EDGE_VER, EDGE_TYPE_BOTH);\r\n                    break;\r\n                case PRED_I_NxN:\r\n                    lf_set_edge_filter_param(h, i_level, scu_x + (1 << i), scu_y, EDGE_VER, EDGE_TYPE_BOTH);\r\n                    lf_set_edge_filter_param(h, i_level, scu_x, scu_y + (1 << i), EDGE_HOR, EDGE_TYPE_BOTH);\r\n                    break;\r\n                case PRED_I_2Nxn:\r\n                    if (i > 0) {\r\n                        lf_set_edge_filter_param(h, i_level, scu_x, scu_y + (1 << (i - 1)),     EDGE_HOR, EDGE_TYPE_ONLY_LUMA);\r\n                        lf_set_edge_filter_param(h, i_level, scu_x, scu_y + (1 << (i - 1)) * 2, EDGE_HOR, EDGE_TYPE_ONLY_LUMA);\r\n                        lf_set_edge_filter_param(h, i_level, scu_x, scu_y + (1 << (i - 1)) * 3, EDGE_HOR, EDGE_TYPE_ONLY_LUMA);\r\n                    } else {\r\n                        lf_set_edge_filter_param(h, i_level, scu_x, scu_y + (1 << (i    )),     EDGE_HOR, EDGE_TYPE_ONLY_LUMA);\r\n                    }\r\n                    break;\r\n                case PRED_I_nx2N:\r\n                    if (i > 0) {\r\n                        lf_set_edge_filter_param(h, i_level, scu_x + (1 << (i - 1)),     scu_y, EDGE_VER, EDGE_TYPE_ONLY_LUMA);\r\n                        lf_set_edge_filter_param(h, i_level, scu_x + (1 << (i - 1)) * 2, scu_y, EDGE_VER, EDGE_TYPE_ONLY_LUMA);\r\n                        lf_set_edge_filter_param(h, i_level, scu_x + (1 << (i - 1)) * 3, scu_y, EDGE_VER, EDGE_TYPE_ONLY_LUMA);\r\n                    } else {\r\n                        lf_set_edge_filter_param(h, i_level, scu_x + (1 << (i    )),     scu_y, EDGE_VER, EDGE_TYPE_ONLY_LUMA);\r\n                    }\r\n                    break;\r\n                case PRED_2NxnU:\r\n                    if (i > 0) {\r\n                        lf_set_edge_filter_param(h, i_level, scu_x, scu_y + (1 << (i - 1)), EDGE_HOR, EDGE_TYPE_BOTH);\r\n                    }\r\n                    break;\r\n                case PRED_2NxnD:\r\n                    if (i > 0) {\r\n                        lf_set_edge_filter_param(h, i_level, scu_x, scu_y + (1 << (i - 1)) * 3, EDGE_HOR, EDGE_TYPE_BOTH);\r\n                    }\r\n                    break;\r\n                case PRED_nLx2N:\r\n                    if (i > 0) {\r\n                        lf_set_edge_filter_param(h, i_level, scu_x + (1 << (i - 1)), scu_y, EDGE_VER, EDGE_TYPE_BOTH);\r\n                    }\r\n                    break;\r\n                case PRED_nRx2N:\r\n                    if (i > 0) {\r\n                        lf_set_edge_filter_param(h, i_level, scu_x + (1 << (i - 1)) * 3, scu_y, EDGE_VER, EDGE_TYPE_BOTH);\r\n                    }\r\n                    break;\r\n                default:\r\n                    // for other modes: direct/skip, 2Nx2N inter, 2Nx2N intra, no need to set\r\n                    break;\r\n            }\r\n\r\n            /* set transform block boundary */\r\n            if (p_scu_data->i_cu_type != PRED_I_NxN && p_scu_data->i_trans_size != TU_SPLIT_NON && p_scu_data->i_cbp != 0) {\r\n                if (h->seq_info.enable_nsqt && IS_HOR_PU_PART(p_scu_data->i_cu_type)) {\r\n                    if (p_scu_data->i_cu_level == B16X16_IN_BIT) {\r\n                        lf_set_edge_filter_param(h, i_level, scu_x, scu_y + (1 << (i    )),                  EDGE_HOR, EDGE_TYPE_ONLY_LUMA);\r\n                    } else {\r\n                        lf_set_edge_filter_param(h, i_level, scu_x, scu_y + (1 << (i - 1)),                  EDGE_HOR, EDGE_TYPE_ONLY_LUMA);\r\n                        lf_set_edge_filter_param(h, i_level, scu_x, scu_y + (1 << (i    )),                  EDGE_HOR, EDGE_TYPE_ONLY_LUMA);\r\n                        lf_set_edge_filter_param(h, i_level, scu_x, scu_y + (1 << (i    )) + (1 << (i - 1)), EDGE_HOR, EDGE_TYPE_ONLY_LUMA);\r\n                    }\r\n                } else if (h->seq_info.enable_nsqt && IS_VER_PU_PART(p_scu_data->i_cu_type)) {\r\n                    if (p_scu_data->i_cu_level == B16X16_IN_BIT) {\r\n                        lf_set_edge_filter_param(h, i_level, scu_x + (1 << (i    )),                  scu_y, EDGE_VER, EDGE_TYPE_ONLY_LUMA);\r\n                    } else {\r\n                        lf_set_edge_filter_param(h, i_level, scu_x + (1 << (i - 1)),                  scu_y, EDGE_VER, EDGE_TYPE_ONLY_LUMA);\r\n                        lf_set_edge_filter_param(h, i_level, scu_x + (1 << (i    )),                  scu_y, EDGE_VER, EDGE_TYPE_ONLY_LUMA);\r\n                        lf_set_edge_filter_param(h, i_level, scu_x + (1 << (i    )) + (1 << (i - 1)), scu_y, EDGE_VER, EDGE_TYPE_ONLY_LUMA);\r\n                    }\r\n                } else {\r\n                    lf_set_edge_filter_param(h, i_level, scu_x + (1 << i), scu_y, EDGE_VER, EDGE_TYPE_ONLY_LUMA);\r\n                    lf_set_edge_filter_param(h, i_level, scu_x, scu_y + (1 << i), EDGE_HOR, EDGE_TYPE_ONLY_LUMA);\r\n                }\r\n            }\r\n        }\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * return 1 if skip filtering is needed\r\n */\r\nstatic uint8_t lf_skip_filter(davs2_t *h, cu_t *scuP, cu_t *scuQ, int dir, int block_x, int block_y)\r\n{\r\n    if (h->i_frame_type == AVS2_P_SLICE || h->i_frame_type == AVS2_F_SLICE) {\r\n        const int width_in_spu = h->i_width_in_spu;\r\n        int pos1 = block_y         * width_in_spu + block_x;\r\n        int pos2 = (block_y - dir) * width_in_spu + (block_x - !dir);\r\n        int ref1 = h->p_ref_idx[pos1].r[0];\r\n        int ref2 = h->p_ref_idx[pos2].r[0];\r\n        mv_t mv_1, mv_2;\r\n\r\n        mv_1.v = h->p_tmv_1st[pos1].v;\r\n        mv_2.v = h->p_tmv_1st[pos2].v;\r\n\r\n        if ((scuP->i_cbp == 0) && (scuQ->i_cbp == 0) &&\r\n            (DAVS2_ABS(mv_1.x - mv_2.x) < 4) &&\r\n            (DAVS2_ABS(mv_1.y - mv_2.y) < 4) &&\r\n            (ref1 != INVALID_REF && ref1 == ref2)) {\r\n            return 0;\r\n        }\r\n    }\r\n\r\n    return 1;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void lf_edge_core(pel_t *src, int b_chroma, int ptr_inc, int inc1, int alpha, int beta, uint8_t *flt_flag)\r\n{\r\n    int inc2 = inc1 << 1;\r\n    int inc3 = inc1 + inc2;\r\n    int abs_delta;\r\n    int L2, L1, L0, R0, R1, R2;\r\n    int fs; // fs stands for filtering strength.  The larger fs is, the stronger filter is applied.\r\n    int FlatnessL, FlatnessR;   // FlatnessL and FlatnessR describe how flat the curve is of one coding unit\r\n    int flag;\r\n    int pel;\r\n\r\n    for (pel = 0; pel < MIN_CU_SIZE; pel++) {\r\n        L2 = src[-inc3];\r\n        L1 = src[-inc2];\r\n        L0 = src[-inc1];\r\n        R0 = src[    0];\r\n        R1 = src[ inc1];\r\n        R2 = src[ inc2];\r\n\r\n        abs_delta = DAVS2_ABS(R0 - L0);\r\n        flag = (pel < 4) ? flt_flag[0] : flt_flag[1];\r\n\r\n        if (flag && (abs_delta < alpha) && (abs_delta > 1)) {\r\n            FlatnessL  = (DAVS2_ABS(L1 - L0) < beta) ? 2 : 0;\r\n            FlatnessL += (DAVS2_ABS(L2 - L0) < beta);\r\n\r\n            FlatnessR  = (DAVS2_ABS(R0 - R1) < beta) ? 2 : 0;\r\n            FlatnessR += (DAVS2_ABS(R0 - R2) < beta);\r\n\r\n            switch (FlatnessL + FlatnessR) {\r\n            case 6:\r\n                fs = 3 + ((R1 == R0) && (L0 == L1));  // ((R1 == R0) && (L0 == L1)) ? 4 : 3;\r\n                break;\r\n            case 5:\r\n                fs = 2 + ((R1 == R0) && (L0 == L1));  // ((R1 == R0) && (L0 == L1)) ? 3 : 2;\r\n                break;\r\n            case 4:\r\n                fs = 1 + (FlatnessL == 2);            // (FlatnessL == 2) ? 2 : 1;\r\n                break;\r\n            case 3:\r\n                fs = (DAVS2_ABS(L1 - R1) < beta);\r\n                break;\r\n            default:\r\n                fs = 0;\r\n            }\r\n\r\n            fs -= (b_chroma && fs > 0);\r\n\r\n            switch (fs) {\r\n            case 4:\r\n                src[-inc1] = (pel_t)((L0 + ((L0 + L2) << 3) + L2 + (R0 << 3) + (R2 << 2) + (R2 << 1) + 16) >> 5); // L0\r\n                src[-inc2] = (pel_t)(((L0 << 3) - L0 + (L2 << 2) + (L2 << 1) + R0 + (R0 << 1) + 8) >> 4);         // L1\r\n                src[-inc3] = (pel_t)(((L0 << 2) + L2 + (L2 << 1) + R0 + 4) >> 3);                                 // L2\r\n                src[    0] = (pel_t)((R0 + ((R0 + R2) << 3) + R2 + (L0 << 3) + (L2 << 2) + (L2 << 1) + 16) >> 5); // R0\r\n                src[ inc1] = (pel_t)(((R0 << 3) - R0 + (R2 << 2) + (R2 << 1) + L0 + (L0 << 1) + 8) >> 4);         // R1\r\n                src[ inc2] = (pel_t)(((R0 << 2) + R2 + (R2 << 1) + L0 + 4) >> 3);                                 // R2\r\n                break;\r\n            case 3:\r\n                src[-inc1] = (pel_t)((L2 + (L1 << 2) + (L0 << 2) + (L0 << 1) + (R0 << 2) + R1 + 8) >> 4);         // L0\r\n                src[    0] = (pel_t)((L1 + (L0 << 2) + (R0 << 2) + (R0 << 1) + (R1 << 2) + R2 + 8) >> 4);         // R0\r\n                src[-inc2] = (pel_t)((L2 * 3 + L1 * 8 + L0 * 4 + R0 + 8) >> 4);\r\n                src[ inc1] = (pel_t)((R2 * 3 + R1 * 8 + R0 * 4 + L0 + 8) >> 4);\r\n                break;\r\n            case 2:\r\n                src[-inc1] = (pel_t)(((L1 << 1) + L1 + (L0 << 3) + (L0 << 1) + (R0 << 1) + R0 + 8) >> 4);\r\n                src[    0] = (pel_t)(((L0 << 1) + L0 + (R0 << 3) + (R0 << 1) + (R1 << 1) + R1 + 8) >> 4);\r\n                break;\r\n            case 1:\r\n                src[-inc1] = (pel_t)((L0 * 3 + R0 + 2) >> 2);\r\n                src[    0] = (pel_t)((R0 * 3 + L0 + 2) >> 2);\r\n                break;\r\n            default:\r\n                break;\r\n            }\r\n        }\r\n\r\n        src += ptr_inc;     // next row or column\r\n        pel += b_chroma;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void deblock_edge_hor(pel_t *src, int stride, int alpha, int beta, uint8_t *flt_flag)\r\n{\r\n    lf_edge_core(src, 0, 1, stride, alpha, beta, flt_flag);\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void deblock_edge_ver(pel_t *src, int stride, int alpha, int beta, uint8_t *flt_flag)\r\n{\r\n    lf_edge_core(src, 0, stride, 1, alpha, beta, flt_flag);\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\n#if HDR_CHROMA_DELTA_QP\r\nstatic void deblock_edge_ver_c(pel_t *src_u, pel_t *src_v, int stride, int *alpha, int *beta, uint8_t *flt_flag)\r\n#else\r\nstatic void deblock_edge_ver_c(pel_t *src_u, pel_t *src_v, int stride, int alpha, int beta, uint8_t *flt_flag)\r\n#endif\r\n{\r\n#if HDR_CHROMA_DELTA_QP\r\n    lf_edge_core(src_u, 1, stride, 1, alpha[0], beta[0], flt_flag);\r\n    lf_edge_core(src_v, 1, stride, 1, alpha[1], beta[1], flt_flag);\r\n#else\r\n    lf_edge_core(src_u, 1, stride, 1, alpha, beta, flt_flag);\r\n    lf_edge_core(src_v, 1, stride, 1, alpha, beta, flt_flag);\r\n#endif\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\n#if HDR_CHROMA_DELTA_QP\r\nstatic void deblock_edge_hor_c(pel_t *src_u, pel_t *src_v, int stride, int *alpha, int *beta, uint8_t *flt_flag)\r\n#else\r\nstatic void deblock_edge_hor_c(pel_t *src_u, pel_t *src_v, int stride, int alpha, int beta, uint8_t *flt_flag)\r\n#endif\r\n{\r\n#if HDR_CHROMA_DELTA_QP\r\n    lf_edge_core(src_u, 1, 1, stride, alpha[0], beta[0], flt_flag);\r\n    lf_edge_core(src_v, 1, 1, stride, alpha[1], beta[1], flt_flag);\r\n#else\r\n    lf_edge_core(src_u, 1, 1, stride, alpha, beta, flt_flag);\r\n    lf_edge_core(src_v, 1, 1, stride, alpha, beta, flt_flag);\r\n#endif\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * deblock one coding unit\r\n */\r\nstatic void lf_scu_deblock(davs2_t *h, pel_t *p_dec[3], int stride, int stride_c, int scu_x, int scu_y, int dir)\r\n{\r\n    static const int max_qp_deblock = 63;\r\n    const int scu_xy   = scu_y * h->i_width_in_scu + scu_x;\r\n    cu_t     *scuQ     = &h->scu_data[scu_xy];\r\n    int edge_condition = h->p_deblock_flag[dir][scu_xy];\r\n\r\n    /* deblock edges */\r\n    if (edge_condition != EDGE_TYPE_NOFILTER) {\r\n        const int shift = h->sample_bit_depth - 8;\r\n        cu_t  *scuP  = (dir) ? (scuQ - h->i_width_in_scu) : (scuQ - 1);\r\n        uint8_t b_filter_flag[2];\r\n        int QP;\r\n\r\n        b_filter_flag[0] = lf_skip_filter(h, scuP, scuQ, dir, (scu_x << 1),       (scu_y << 1)       );\r\n        b_filter_flag[1] = lf_skip_filter(h, scuP, scuQ, dir, (scu_x << 1) + dir, (scu_y << 1) + !dir);\r\n\r\n        if (!b_filter_flag[0] && !b_filter_flag[1]) {\r\n            return;  // 8x4˲Ҫúĺ\r\n        }\r\n\r\n        /* deblock luma edge */\r\n        {\r\n            pel_t *src_y = p_dec[0] + (scu_y << MIN_CU_SIZE_IN_BIT) * stride + (scu_x << MIN_CU_SIZE_IN_BIT);\r\n            int alpha, beta;\r\n            QP = ((scuP->i_qp + scuQ->i_qp + 1) >> 1);  // average QP of the two blocks\r\n\r\n            /* coded as 10/12 bit, QP is added by (8 * (h->param.sample_bit_depth - 8)) in config file */\r\n            alpha = ALPHA_TABLE[DAVS2_CLIP3(0, max_qp_deblock, QP - (shift << 3) + h->i_alpha_offset)] << shift;\r\n            beta  = BETA_TABLE [DAVS2_CLIP3(0, max_qp_deblock, QP - (shift << 3) + h->i_beta_offset )] << shift;\r\n\r\n            gf_davs2.deblock_luma[dir](src_y, stride, alpha, beta, b_filter_flag);\r\n        }\r\n\r\n        /* deblock chroma edge */\r\n        if (edge_condition == EDGE_TYPE_BOTH && h->i_chroma_format != CHROMA_400)\r\n        if (((scu_y & 1) == 0 && dir) || (((scu_x & 1) == 0) && (!dir))) {\r\n            int uv_offset = (scu_y << (MIN_CU_SIZE_IN_BIT - 1)) * stride_c + (scu_x << (MIN_CU_SIZE_IN_BIT - 1));\r\n            pel_t *src_u = p_dec[1] + uv_offset;\r\n            pel_t *src_v = p_dec[2] + uv_offset;\r\n#if HDR_CHROMA_DELTA_QP\r\n            int alpha[2], beta[2];\r\n            int luma_qp = QP;\r\n            int offset = shift << 3;\r\n            /* coded as 10/12 bit, QP is added by (8 * (h->param.sample_bit_depth - 8)) in config file */\r\n            QP = cu_get_chroma_qp(h, luma_qp, 0) - offset;\r\n            alpha[0] = ALPHA_TABLE[DAVS2_CLIP3(0, max_qp_deblock, QP + h->i_alpha_offset)] << shift;\r\n            beta[0]  = BETA_TABLE [DAVS2_CLIP3(0, max_qp_deblock, QP + h->i_beta_offset )] << shift;\r\n\r\n            QP = cu_get_chroma_qp(h, luma_qp, 1) - offset;\r\n            alpha[1] = ALPHA_TABLE[DAVS2_CLIP3(0, max_qp_deblock, QP + h->i_alpha_offset)] << shift;\r\n            beta[1]  = BETA_TABLE [DAVS2_CLIP3(0, max_qp_deblock, QP + h->i_beta_offset )] << shift;\r\n\r\n            gf_davs2.deblock_chroma[dir](src_u, src_v, stride_c, alpha, beta, b_filter_flag);\r\n#else\r\n            int alpha, beta;\r\n\r\n            /* coded as 10/12 bit, QP is added by (8 * (h->param.sample_bit_depth - 8)) in config file */\r\n            QP = cu_get_chroma_qp(h, QP, 0) - (shift << 3);\r\n            alpha = ALPHA_TABLE[DAVS2_CLIP3(0, max_qp_deblock, QP + h->i_alpha_offset)] << shift;\r\n            beta = BETA_TABLE[DAVS2_CLIP3(0, max_qp_deblock, QP + h->i_beta_offset)] << shift;\r\n\r\n            gf_davs2.deblock_chroma[dir](src_u, src_v, stride_c, alpha, beta, b_filter_flag);\r\n#endif\r\n        }\r\n    }\r\n}\r\n\r\n/**\r\n * ===========================================================================\r\n * interface function defines\r\n * ===========================================================================\r\n */\r\n\r\n/* ---------------------------------------------------------------------------\r\n * NOTE: only support I420 now\r\n */\r\nvoid davs2_lcu_deblock(davs2_t *h, davs2_frame_t *frm, int i_lcu_x, int i_lcu_y)\r\n{\r\n    const int i_stride   = frm->i_stride[0];\r\n    const int i_stride_c = frm->i_stride[1];\r\n    const int w_in_scu   = h->i_width_in_scu;\r\n    const int h_in_scu   = h->i_height_in_scu;\r\n    const int num_in_scu = 1 << (h->i_lcu_level - MIN_CU_SIZE_IN_BIT);\r\n    int scu_x            = i_lcu_x << (h->i_lcu_level - MIN_CU_SIZE_IN_BIT);\r\n    int scu_y            = i_lcu_y << (h->i_lcu_level - MIN_CU_SIZE_IN_BIT);\r\n    int num_of_scu_hor   = DAVS2_MIN(w_in_scu - scu_x, num_in_scu);\r\n    int num_of_scu_ver   = DAVS2_MIN(h_in_scu - scu_y, num_in_scu);\r\n    int i, j;\r\n\r\n    /* -------------------------------------------------------------\r\n     * init\r\n     */\r\n\r\n    /* set edge flags in one LCU */\r\n    lf_lcu_set_edge_filter(h, h->i_lcu_level, scu_x, scu_y);\r\n\r\n    /* -------------------------------------------------------------\r\n     * vertical\r\n     */\r\n\r\n    /* deblock all vertical edges in one LCU */\r\n    for (j = 0; j < num_of_scu_ver; j++) {\r\n        for (i = 0; i < num_of_scu_hor; i++) {\r\n            lf_scu_deblock(h, frm->planes, i_stride, i_stride_c, scu_x + i, scu_y + j, EDGE_VER);\r\n        }\r\n    }\r\n\r\n    /* -------------------------------------------------------------\r\n     * horizontal\r\n     */\r\n\r\n    /* adjust the value of scu_x and num_of_scu_hor */\r\n    if (scu_x == 0) {\r\n        /* the current LCU is the first LCU in a LCU row */\r\n        num_of_scu_hor--; /* leave the last horizontal edge */\r\n    } else {\r\n        /* the current LCU is one of the rest LCUs in a row */\r\n        if (scu_x + num_of_scu_hor == w_in_scu) {\r\n            /* the current LCU is the last LCUs in a row,\r\n             * need deblock one horizontal edge more */\r\n            num_of_scu_hor++;\r\n        }\r\n        scu_x--;        /* begin from the last horizontal edge of previous LCU */\r\n    }\r\n\r\n    /* deblock all horizontal edges in one LCU */\r\n    for (j = 0; j < num_of_scu_ver; j++) {\r\n        for (i = 0; i < num_of_scu_hor; i++) {\r\n            lf_scu_deblock(h, frm->planes, i_stride, i_stride_c, scu_x + i, scu_y + j, EDGE_HOR);\r\n        }\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * init deblock function handles\r\n */\r\nvoid davs2_deblock_init(uint32_t cpuid, ao_funcs_t* fh)\r\n{\r\n    UNUSED_PARAMETER(cpuid);\r\n\r\n    fh->deblock_luma  [0] = deblock_edge_ver;\r\n    fh->deblock_luma  [1] = deblock_edge_hor;\r\n    fh->deblock_chroma[0] = deblock_edge_ver_c;\r\n    fh->deblock_chroma[1] = deblock_edge_hor_c;\r\n\r\n    fh->set_deblock_const = NULL;\r\n\r\n    /* init asm function handles */\r\n#if HAVE_MMX\r\n    if ((cpuid & DAVS2_CPU_SSE4) && !HDR_CHROMA_DELTA_QP) {\r\n#if !HIGH_BIT_DEPTH\r\n        fh->deblock_luma  [0] = deblock_edge_ver_sse128;\r\n        fh->deblock_luma  [1] = deblock_edge_hor_sse128;\r\n        fh->deblock_chroma[0] = deblock_edge_ver_c_sse128;\r\n        fh->deblock_chroma[1] = deblock_edge_hor_c_sse128;\r\n#endif\r\n    }\r\n    if ((cpuid & DAVS2_CPU_AVX2) && !HDR_CHROMA_DELTA_QP) {\r\n#if !HIGH_BIT_DEPTH\r\n        // fh->deblock_luma[0] = deblock_edge_ver_avx2;  // @luofl i7-6700K ˺ sse128\r\n        // fh->deblock_luma[1] = deblock_edge_hor_avx2;\r\n        // fh->deblock_chroma[0] = deblock_edge_ver_c_avx2;\r\n        // fh->deblock_chroma[1] = deblock_edge_hor_c_avx2;\r\n\r\n#endif\r\n    }\r\n#endif  // HAVE_MMX\r\n}\r\n"
  },
  {
    "path": "source/common/deblock.h",
    "content": "/*\r\n * deblock.h\r\n *\r\n * Description of this file:\r\n *    Deblock functions definition of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video encoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#ifndef DAVS2_DEBLOCK_H\r\n#define DAVS2_DEBLOCK_H\r\n#ifdef __cplusplus\r\nextern \"C\" {\r\n#endif\r\n\r\n#define davs2_deblock_init FPFX(deblock_init)\r\nvoid davs2_deblock_init(uint32_t cpuid, ao_funcs_t* fh);\r\n#define davs2_lcu_deblock FPFX(lcu_deblock)\r\nvoid davs2_lcu_deblock(davs2_t *h, davs2_frame_t *frm, int i_lcu_x, int i_lcu_y);\r\n\r\n#ifdef __cplusplus\r\n}\r\n#endif\r\n#endif  // DAVS2_DEBLOCK_H\r\n"
  },
  {
    "path": "source/common/decoder.cc",
    "content": "/*\r\n * decoder.cc\r\n *\r\n * Description of this file:\r\n *    Decoder functions definition of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video encoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#include \"common.h\"\r\n#include \"davs2.h\"\r\n#include \"decoder.h\"\r\n#include \"aec.h\"\r\n#include \"header.h\"\r\n#include \"bitstream.h\"\r\n#include \"deblock.h\"\r\n#include \"cu.h\"\r\n#include \"sao.h\"\r\n#include \"alf.h\"\r\n#include \"quant.h\"\r\n#include \"frame.h\"\r\n#include \"intra.h\"\r\n#include \"mc.h\"\r\n#include \"transform.h\"\r\n#include \"cpu.h\"\r\n#include \"threadpool.h\"\r\n\r\n#define TRACEFILE \"trace_dec_HD.txt\"  /* trace file in current directory */\r\n\r\n/* disable warning C4127: ʽǳ */\r\n#pragma warning(disable:4127)\r\n\r\n\r\n/**\r\n * ===========================================================================\r\n * local function defines\r\n * ===========================================================================\r\n */\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * initializes the parameters for a new frame\r\n */\r\nstatic void init_frame(davs2_t *h)\r\n{\r\n    int num_spu = h->i_width_in_spu * h->i_height_in_spu;\r\n    //int i;\r\n\r\n    h->lcu.i_scu_xy        = 0;\r\n    h->i_slice_index       = -1;\r\n    h->b_slice_checked     = 0;\r\n    h->fdec->i_parsed_lcu_xy = -1;\r\n    h->decoding_error      = 0;    // ־\r\n\r\n    /* 1, clear intra_mode buffer, set to default value (-1) */\r\n    memset(h->p_ipredmode - h->i_ipredmode - 16, DC_PRED, h->i_ipredmode * (h->i_height_in_spu + 1) * sizeof(int8_t));\r\n    memset(h->p_dirpred, PDIR_INVALID, num_spu * sizeof(int8_t));\r\n\r\n    /* 2, clear mv buffer (set all MVs to zero) */\r\n    gf_davs2.fast_memzero(h->p_ref_idx, num_spu * sizeof(ref_idx_t));\r\n    // gf_davs2.fast_memzero(h->p_tmv_1st, num_spu * sizeof(mv_t));\r\n    // gf_davs2.fast_memzero(h->p_tmv_2nd, num_spu * sizeof(mv_t));\r\n\r\n    /* 3, clear slice number for all SCU */\r\n    //repeat for init slice for current LCU\r\n    //for (i = 0; i < h->i_size_in_scu; i++) {\r\n    //    h->scu_data[i].i_slice_nr = -1;\r\n    //}\r\n\r\n    /* 4, init adaptive frequency weighting quantization */\r\n    if (h->seq_info.enable_weighted_quant) {\r\n        wq_init_frame_quant_param(h);\r\n        wq_update_frame_matrix(h);\r\n    }\r\n\r\n    /* 5, copy frame properties for SAO & ALF */\r\n    if (h->b_sao) {\r\n        davs2_frame_copy_properties(h->p_frame_sao, h->fdec);\r\n    }\r\n    if (h->b_alf) {\r\n        int alf_enable = h->pic_alf_on[IMG_Y] != 0 || h->pic_alf_on[IMG_U] != 0 || h->pic_alf_on[IMG_V] != 0;\r\n        if (alf_enable) {\r\n            davs2_frame_copy_properties(h->p_frame_alf, h->fdec);\r\n        }\r\n    }\r\n\r\n    /* 6, clear the p_deblock_flag buffer */\r\n    gf_davs2.fast_memzero(h->p_deblock_flag[0], h->i_width_in_scu * h->i_height_in_scu * 2 * sizeof(uint8_t));\r\n\r\n    /* 7, clear LCU info buffer */\r\n#if CTRL_AEC_THREAD\r\n    gf_davs2.fast_memzero(h->lcu_infos, sizeof(lcu_info_t) * h->i_width_in_lcu * h->i_height_in_lcu);\r\n#endif\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n* cache CTU border\r\n*/\r\nstatic INLINE\r\nvoid davs2_cache_lcu_border(pel_t *p_dst, const pel_t *p_top,\r\nconst pel_t *p_left, int i_left,\r\nint lcu_width, int lcu_height)\r\n{\r\n    int i;\r\n    /* top, top-right */\r\n    memcpy(p_dst, p_top, (2 * lcu_width + 1) * sizeof(pel_t));\r\n    /* left */\r\n    for (i = 1; i <= lcu_height; i++) {\r\n        p_dst[-i] = p_left[0];\r\n        p_left += i_left;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n* cache CTU border (UV components together)\r\n*/\r\nstatic INLINE\r\nvoid davs2_cache_lcu_border_uv(pel_t *p_dst_u, const pel_t *p_top_u, const pel_t *p_left_u,\r\npel_t *p_dst_v, const pel_t *p_top_v, const pel_t *p_left_v,\r\nint i_left, int lcu_width, int lcu_height)\r\n{\r\n    int i;\r\n    /* top, top-right */\r\n    memcpy(p_dst_u, p_top_u, (2 * lcu_width + 1) * sizeof(pel_t));\r\n    memcpy(p_dst_v, p_top_v, (2 * lcu_width + 1) * sizeof(pel_t));\r\n    /* left */\r\n    for (i = 1; i <= lcu_height; i++) {\r\n        p_dst_u[-i] = p_left_u[0];\r\n        p_dst_v[-i] = p_left_v[0];\r\n        p_left_u += i_left;\r\n        p_left_v += i_left;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void save_mv_ref_info(davs2_t *h, int row)\r\n{\r\n    const int w_in_spu     = h->i_width_in_spu;\r\n    const int h_in_spu     = h->i_height_in_spu;\r\n    const int spu_y        = row << (h->i_lcu_level - MIN_PU_SIZE_IN_BIT);\r\n    const int lcu_h_in_spu = 1 << (h->i_lcu_level - MIN_PU_SIZE_IN_BIT);\r\n    mv_t   *p_dst_mv       = &h->fdec->mvbuf[spu_y * w_in_spu];\r\n    int8_t *p_dst_ref      = &h->fdec->refbuf[spu_y * w_in_spu];\r\n    mv_t   *p_src_mv;\r\n    ref_idx_t *p_src_ref;\r\n    int i, j, x, y;\r\n\r\n    for (j = spu_y; j < DAVS2_MIN(spu_y + lcu_h_in_spu, h_in_spu); j++) {\r\n        y = ((j >> MV_FACTOR_IN_BIT) << MV_FACTOR_IN_BIT) + 2;\r\n        if (y >= h_in_spu) {\r\n            y = (((j >> MV_FACTOR_IN_BIT) << MV_FACTOR_IN_BIT) + h_in_spu) >> 1;\r\n        }\r\n\r\n        p_src_mv  = h->p_tmv_1st + y * w_in_spu;\r\n        p_src_ref = h->p_ref_idx + y * w_in_spu;\r\n\r\n        for (i = 0; i < w_in_spu; i++) {\r\n            x = ((i >> MV_FACTOR_IN_BIT) << MV_FACTOR_IN_BIT) + 2;\r\n            if (x >= w_in_spu) {\r\n                x = (((i >> MV_FACTOR_IN_BIT) << MV_FACTOR_IN_BIT) + w_in_spu) >> 1;\r\n            }\r\n\r\n            p_dst_mv [i] = p_src_mv [x];\r\n            p_dst_ref[i] = p_src_ref[x].r[0];\r\n        }\r\n\r\n        p_dst_mv  += w_in_spu;\r\n        p_dst_ref += w_in_spu;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic davs2_outpic_t *get_one_free_picture(davs2_mgr_t *mgr, int w, int h)\r\n{\r\n    davs2_outpic_t *pic = NULL;\r\n\r\n    for (;;) {\r\n        /* get one from recycle bin */\r\n        pic = (davs2_outpic_t *)xl_remove_head(&mgr->pic_recycle, 0);\r\n        if ((pic == NULL) ||\r\n            (pic->pic->widths[0] == w && pic->pic->lines[0] == h)) {\r\n            break;\r\n        }\r\n\r\n        /* obsolete picture */\r\n        free_picture(pic);\r\n        pic = NULL;\r\n    }\r\n\r\n    if (pic == NULL) {\r\n        /* no free picture. no wait, just new one. */\r\n        pic = alloc_picture(w, h);\r\n    }\r\n\r\n    return pic;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * ȴһLCUؽָLCU\r\n */\r\nstatic ALWAYS_INLINE\r\nvoid wait_lcu_row_parsed(davs2_t *h, davs2_frame_t *frm, int lcu_xy)\r\n{\r\n    UNUSED_PARAMETER(h);\r\n\r\n    if (lcu_xy > frm->i_parsed_lcu_xy) {\r\n        davs2_thread_mutex_lock(&frm->mutex_frm);   /* lock */\r\n        while (lcu_xy > frm->i_parsed_lcu_xy) {\r\n            davs2_thread_cond_wait(&frm->cond_aec, &frm->mutex_frm);\r\n        }\r\n        davs2_thread_mutex_unlock(&frm->mutex_frm); /* unlock */\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * ȴһLCUعָLCU\r\n */\r\nstatic ALWAYS_INLINE\r\nvoid wait_lcu_row_reconed(davs2_t *h, davs2_frame_t *frm, int wait_lcu_y, int wait_lcu_coded)\r\n{\r\n    UNUSED_PARAMETER(h);\r\n    // wait_lcu_coded = DAVS2_MIN(h->i_width_in_lcu, wait_lcu_coded);\r\n\r\n    if (frm->num_decoded_lcu_in_row[wait_lcu_y] < wait_lcu_coded) {\r\n        davs2_thread_mutex_lock(&frm->mutex_recon);   /* lock */\r\n        while (frm->num_decoded_lcu_in_row[wait_lcu_y] < wait_lcu_coded) {\r\n            davs2_thread_cond_wait(&frm->conds_lcu_row[wait_lcu_y], &frm->mutex_recon);\r\n        }\r\n        davs2_thread_mutex_unlock(&frm->mutex_recon); /* unlock */\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void decoder_signal(davs2_t *h, davs2_frame_t *frame, int line)\r\n{\r\n    if (line > 0) {\r\n        wait_lcu_row_reconed(h, frame, line - 1, h->i_width_in_lcu + 1);\r\n    }\r\n\r\n    davs2_thread_mutex_lock(&frame->mutex_recon);\r\n    frame->i_decoded_line++;\r\n    frame->num_decoded_lcu_in_row[line] = h->i_width_in_lcu + 3;\r\n    davs2_thread_mutex_unlock(&frame->mutex_recon);\r\n\r\n    davs2_thread_cond_broadcast(&frame->conds_lcu_row[line]);\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic\r\nvoid task_send_picture_to_output_list(davs2_t *h, davs2_outpic_t *pic)\r\n{\r\n    davs2_mgr_t    *mgr  = h->task_info.taskmgr;\r\n    davs2_outpic_t *curr = NULL;\r\n    davs2_outpic_t *prev = NULL;\r\n\r\n    davs2_thread_mutex_lock(&mgr->mutex_mgr);\r\n\r\n    curr = mgr->outpics.pics;\r\n\r\n    while (curr && curr->frame->i_poc < pic->frame->i_poc) {\r\n        prev = curr;\r\n        curr = curr->next;\r\n    }\r\n\r\n    /* duplicate frame? */\r\n    if (curr != NULL && curr->frame->i_poc == pic->frame->i_poc) {\r\n        davs2_log(h, DAVS2_LOG_WARNING, \"detected duplicate POC %d\", curr->frame->i_poc);\r\n    }\r\n\r\n    /* insert this frame before 'curr' */\r\n    pic->next = curr;\r\n\r\n    if (prev) {\r\n        prev->next = pic;\r\n    } else {\r\n        mgr->outpics.pics = pic;\r\n    }\r\n    mgr->outpics.num_output_pic++;\r\n\r\n    DAVS2_ASSERT(h->task_info.task_status == TASK_BUSY,\r\n        \"Invalid task status %d\",\r\n        h->task_info.task_status);\r\n    davs2_thread_mutex_unlock(&mgr->mutex_mgr);\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic\r\nvoid task_output_decoding_frame(davs2_t *h)\r\n{\r\n    davs2_mgr_t       *mgr     = h->task_info.taskmgr;\r\n    davs2_frame_t     *frame   = h->fdec;\r\n    davs2_seq_t       *seqhead = &h->seq_info;\r\n    davs2_outpic_t    *pic     = NULL;\r\n\r\n    assert(frame);\r\n\r\n    pic = get_one_free_picture(mgr, h->i_image_width, h->i_image_height);\r\n    assert(pic);\r\n\r\n    memcpy(pic->head, &seqhead->head, sizeof(davs2_seq_info_t));\r\n\r\n    if (frame->i_type == AVS2_GB_SLICE) {\r\n        pic->frame = h->f_background_ref; ///!!! FIXME: actually NOT working (we do not support S frames now).\r\n    } else {\r\n        pic->frame = frame;\r\n    }\r\n\r\n    frame->i_chroma_format    = h->i_chroma_format;\r\n    frame->i_output_bit_depth = h->output_bit_depth;\r\n    frame->i_sample_bit_depth = h->sample_bit_depth;\r\n    frame->frm_decode_error   = h->decoding_error;\r\n    h->decoding_error         = 0;  // clear decoding error status\r\n\r\n    pic->frame = frame;\r\n\r\n    task_send_picture_to_output_list(h, pic);\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic ALWAYS_INLINE\r\nint check_slice_header(davs2_t *h, davs2_bs_t *bs, int lcu_y)\r\n{\r\n    aec_t *p_aec = &h->aec;\r\n\r\n    if (h->b_slice_checked && found_slice_header(bs)) {\r\n        /* slice starts at next byte */\r\n        bs->i_bit_pos = (((bs->i_bit_pos + 7) >> 3) << 3);\r\n        h->i_slice_index++;\r\n\r\n        parse_slice_header(h, bs);\r\n        aec_init_contexts(p_aec);\r\n        aec_new_slice(h);\r\n        aec_start_decoding(p_aec, bs->p_stream, ((bs->i_bit_pos + 7) / 8), bs->i_stream);\r\n        AEC_RETURN_ON_ERROR(-1);\r\n\r\n        /* ǰSliceһеԤģʽ */\r\n        lcu_y <<= (h->i_lcu_level - MIN_PU_SIZE_IN_BIT);\r\n        memset(h->p_ipredmode + (lcu_y - 1) * h->i_ipredmode - 16, DC_PRED, h->i_ipredmode * sizeof(int8_t));\r\n    }\r\n\r\n    return 0;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic ALWAYS_INLINE\r\nvoid rowrec_store_lcu_recon_samples(davs2_row_rec_t *row_rec)\r\n{\r\n#if 1\r\n    UNUSED_PARAMETER(row_rec);\r\n#else\r\n    gf_davs2.plane_copy(row_rec->ctu.p_frec[0], row_rec->ctu.i_frec[0], \r\n                         row_rec->ctu.p_fdec[0], row_rec->ctu.i_fdec[0], \r\n                         row_rec->ctu.i_ctu_w, row_rec->ctu.i_ctu_h);\r\n    gf_davs2.plane_copy(row_rec->ctu.p_frec[1], row_rec->ctu.i_frec[1],\r\n                         row_rec->ctu.p_fdec[1], row_rec->ctu.i_fdec[1],\r\n                         row_rec->ctu.i_ctu_w_c, row_rec->ctu.i_ctu_h_c);\r\n    gf_davs2.plane_copy(row_rec->ctu.p_frec[2], row_rec->ctu.i_frec[2],\r\n                         row_rec->ctu.p_fdec[2], row_rec->ctu.i_fdec[2],\r\n                         row_rec->ctu.i_ctu_w_c, row_rec->ctu.i_ctu_h_c);\r\n#endif\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * decodes one LCU row\r\n */\r\nstatic int decode_one_lcu_row(davs2_t *h, davs2_bs_t *bs, int i_lcu_y)\r\n{\r\n    const int height_in_lcu = h->i_height_in_lcu;\r\n    const int width_in_lcu  = h->i_width_in_lcu;\r\n    int alf_enable          = h->pic_alf_on[0] | h->pic_alf_on[1] | h->pic_alf_on[2];\r\n    int lcu_xy              = i_lcu_y * width_in_lcu;\r\n    int i_lcu_x;\r\n    int i;\r\n    davs2_row_rec_t row_rec;\r\n\r\n    /* loop over all LCUs in current LCU row ------------------------\r\n     */\r\n    for (i_lcu_x = 0; i_lcu_x < width_in_lcu && h->decoding_error == 0; i_lcu_x++, lcu_xy++) {\r\n        if (check_slice_header(h, bs, i_lcu_y) < 0) {\r\n            return -1;\r\n        }\r\n\r\n#if AVS2_TRACE\r\n        avs2_trace(\"\\n*********** Pic: %i (I/P) MB: %i Slice: %i Type %d **********\\n\", h->i_poc, h->lcu.i_scu_xy, h->i_slice_index, h->i_frame_type);\r\n#endif\r\n        h->lcu.lcu_aec = row_rec.lcu_info = &h->lcu_infos[lcu_xy];\r\n\r\n        rowrec_lcu_init(h, &row_rec, i_lcu_x, i_lcu_y);\r\n        decode_lcu_init(h, i_lcu_x, i_lcu_y);\r\n\r\n        /* decode LCU level data before one LCU */\r\n        if (h->b_sao) {\r\n            sao_read_lcu_param(h, lcu_xy, h->slice_sao_on, &h->lcu.lcu_aec->sao_param);\r\n        }\r\n\r\n        if (h->b_alf) {\r\n            for (i = 0; i < IMG_COMPONENTS; i++) {\r\n                if (h->pic_alf_on[i]) {\r\n                    h->lcu.lcu_aec->enable_alf[i] = (uint8_t)aec_read_alf_lcu_ctrl(&h->aec);\r\n                } else {\r\n                    h->lcu.lcu_aec->enable_alf[i] = FALSE;\r\n                }\r\n            }\r\n        }\r\n\r\n        /* decode one lcu */\r\n        decode_lcu_parse(h, h->i_lcu_level, h->lcu.i_pix_x, h->lcu.i_pix_y);\r\n\r\n        /* cache CTU top border for intra prediction */\r\n        if (i_lcu_x == 0) {\r\n            memcpy(row_rec.ctu_border[0].rec_top + 1, h->intra_border[0], row_rec.ctu.i_ctu_w * 2 * sizeof(pel_t));\r\n            memcpy(row_rec.ctu_border[1].rec_top + 1, h->intra_border[1], row_rec.ctu.i_ctu_w * sizeof(pel_t));\r\n            memcpy(row_rec.ctu_border[2].rec_top + 1, h->intra_border[2], row_rec.ctu.i_ctu_w * sizeof(pel_t));\r\n        }\r\n\r\n        decode_lcu_recon(h, &row_rec, h->i_lcu_level, h->lcu.i_pix_x, h->lcu.i_pix_y);\r\n        \r\n        rowrec_store_lcu_recon_samples(&row_rec);\r\n        /* cache top and left samples for intra prediction of next CTU */\r\n        davs2_cache_lcu_border(row_rec.ctu_border[0].rec_top, h->intra_border[0] + row_rec.ctu.i_pix_x + row_rec.ctu.i_ctu_w - 1,\r\n                               row_rec.ctu.p_frec[0] + row_rec.ctu.i_ctu_w - 1,\r\n                               row_rec.ctu.i_frec[0], row_rec.ctu.i_ctu_w, row_rec.ctu.i_ctu_h);\r\n        davs2_cache_lcu_border_uv(row_rec.ctu_border[1].rec_top, h->intra_border[1] + row_rec.ctu.i_pix_x_c + row_rec.ctu.i_ctu_w_c - 1, row_rec.ctu.p_frec[1] + row_rec.ctu.i_ctu_w_c - 1,\r\n                                  row_rec.ctu_border[2].rec_top, h->intra_border[2] + row_rec.ctu.i_pix_x_c + row_rec.ctu.i_ctu_w_c - 1, row_rec.ctu.p_frec[2] + row_rec.ctu.i_ctu_w_c - 1,\r\n                                  row_rec.ctu.i_frec[1], row_rec.ctu.i_ctu_w_c, row_rec.ctu.i_ctu_h_c);\r\n\r\n        /* backup bottom row pixels */\r\n        if (i_lcu_y < h->i_height_in_lcu - 1) {\r\n            memcpy(h->intra_border[0] + row_rec.ctu.i_pix_x  , row_rec.ctu.p_frec[0] + (row_rec.ctu.i_ctu_h   - 1) * h->fdec->i_stride[0], row_rec.ctu.i_ctu_w   * sizeof(pel_t));\r\n            memcpy(h->intra_border[1] + row_rec.ctu.i_pix_x_c, row_rec.ctu.p_frec[1] + (row_rec.ctu.i_ctu_h_c - 1) * h->fdec->i_stride[1], row_rec.ctu.i_ctu_w_c * sizeof(pel_t));\r\n            memcpy(h->intra_border[2] + row_rec.ctu.i_pix_x_c, row_rec.ctu.p_frec[2] + (row_rec.ctu.i_ctu_h_c - 1) * h->fdec->i_stride[1], row_rec.ctu.i_ctu_w_c * sizeof(pel_t));\r\n        }\r\n\r\n        /* decode LCU level data after one LCU\r\n         * update the bit position */\r\n        h->b_slice_checked = (bool_t)aec_startcode_follows(&h->aec, 1);\r\n        bs->i_bit_pos      = aec_bits_read(&h->aec);\r\n\r\n        /* deblock one lcu */\r\n        if (h->b_loop_filter) {\r\n            davs2_lcu_deblock(h, h->fdec, i_lcu_x, i_lcu_y);\r\n        }\r\n    }\r\n\r\n    if (h->decoding_error != 0) {\r\n        \r\n    } else {\r\n        /* SAO current lcu-row */\r\n        if (h->b_sao) {\r\n            sao_lcurow(h, h->p_frame_sao, h->fdec, i_lcu_y);\r\n        }\r\n\r\n        /* ALF current lcu-row */\r\n        if (alf_enable) {\r\n            alf_lcurow(h, h->p_alf->img_param, h->p_frame_alf, h->fdec, i_lcu_y);\r\n        }\r\n    }\r\n\r\n    /* save motion vectors for reference frame */\r\n    if (h->rps.refered_by_others && h->i_frame_type != AVS2_I_SLICE) {\r\n        save_mv_ref_info(h, i_lcu_y);\r\n    }\r\n\r\n    /* frame padding : line by line */\r\n    if (h->rps.refered_by_others) {\r\n        pad_line_lcu(h, i_lcu_y);\r\n\r\n        /* wake up all waiting threads */\r\n        decoder_signal(h, h->fdec, i_lcu_y);\r\n    }\r\n\r\n    if (i_lcu_y == height_in_lcu - 1) {\r\n\r\n        /* init for AVS-S */\r\n        if ((h->i_frame_type == AVS2_P_SLICE || h->i_frame_type == AVS2_F_SLICE) && h->b_bkgnd_picture && h->b_bkgnd_reference) {\r\n            const int w_in_spu = h->i_width_in_spu;\r\n            const int h_in_spu = h->i_height_in_spu;\r\n            int x, y;\r\n\r\n            for (y = 0; y < h_in_spu; y++) {\r\n                for (x = 0; x < w_in_spu; x++) {\r\n                    int refframe = h->p_ref_idx[y * w_in_spu + x].r[0];\r\n                    if (refframe == h->num_of_references - 1) {\r\n                        h->p_ref_idx[y * w_in_spu + x].r[0] = INVALID_REF;\r\n                    }\r\n                }\r\n            }\r\n        }\r\n\r\n        task_output_decoding_frame(h);\r\n        task_release_frames(h);\r\n        /* task is free */\r\n        task_unload_packet(h, h->task_info.curr_es_unit);\r\n\r\n        // davs2_thread_mutex_lock(&h->task_info.taskmgr->mutex_aec);\r\n        // h->task_info.taskmgr->num_active_decoders--;\r\n        // davs2_thread_mutex_unlock(&h->task_info.taskmgr->mutex_aec);\r\n    }\r\n\r\n    return 0;\r\n}\r\n\r\n// #if CTRL_AEC_THREAD\r\n/* ---------------------------------------------------------------------------\r\n * decodes one LCU row\r\n */\r\nstatic int decode_one_lcu_row_parse(davs2_t *h, davs2_bs_t *bs, int i_lcu_y)\r\n{\r\n    const int width_in_lcu = h->i_width_in_lcu;\r\n    int lcu_xy             = i_lcu_y * width_in_lcu;\r\n    int i_lcu_x;\r\n    int i;\r\n\r\n    /* loop over all LCUs in current LCU row ------------------------\r\n     */\r\n    for (i_lcu_x = 0; i_lcu_x < width_in_lcu; i_lcu_x++, lcu_xy++) {\r\n        if (check_slice_header(h, bs, i_lcu_y) < 0) {\r\n            return -1;\r\n        }\r\n\r\n#if AVS2_TRACE\r\n        avs2_trace(\"\\n*********** Pic: %i (I/P) MB: %i Slice: %i Type %d **********\\n\", h->i_poc, h->lcu.i_scu_xy, h->i_slice_index, h->i_frame_type);\r\n#endif\r\n        h->lcu.lcu_aec = &h->lcu_infos[lcu_xy];\r\n        decode_lcu_init(h, i_lcu_x, i_lcu_y);\r\n\r\n        /* decode LCU level data before one LCU */\r\n        if (h->b_sao) {\r\n            sao_read_lcu_param(h, lcu_xy, h->slice_sao_on, &h->lcu.lcu_aec->sao_param);\r\n        }\r\n\r\n        if (h->b_alf) {\r\n            for (i = 0; i < IMG_COMPONENTS; i++) {\r\n                if (h->pic_alf_on[i]) {\r\n                    h->lcu.lcu_aec->enable_alf[i] = (uint8_t)aec_read_alf_lcu_ctrl(&h->aec);\r\n                } else {\r\n                    h->lcu.lcu_aec->enable_alf[i] = FALSE;\r\n                }\r\n            }\r\n        }\r\n\r\n        /* decode one lcu */\r\n        decode_lcu_parse(h, h->i_lcu_level, h->lcu.i_pix_x, h->lcu.i_pix_y);\r\n\r\n        /* decode LCU level data after one LCU\r\n         * update the bit position */\r\n        h->b_slice_checked = (bool_t)aec_startcode_follows(&h->aec, 1);\r\n        bs->i_bit_pos      = aec_bits_read(&h->aec);\r\n\r\n        h->fdec->i_parsed_lcu_xy = lcu_xy;\r\n        davs2_thread_cond_broadcast(&h->fdec->cond_aec);\r\n    }\r\n\r\n    /* save motion vectors for reference frame */\r\n    if (h->rps.refered_by_others && h->i_frame_type != AVS2_I_SLICE) {\r\n        save_mv_ref_info(h, i_lcu_y);\r\n    }\r\n\r\n    return 0;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * decodes one LCU row\r\n */\r\nstatic int decode_lcu_row_recon(davs2_t *h, int i_lcu_y)\r\n{\r\n    const int width_in_lcu  = h->i_width_in_lcu;\r\n    const int height_in_lcu = h->i_height_in_lcu;\r\n    int alf_enable          = h->pic_alf_on[0] | h->pic_alf_on[1] | h->pic_alf_on[2];\r\n    int i_lcu_level         = h->i_lcu_level;\r\n    int lcu_xy              = i_lcu_y * h->i_width_in_lcu;\r\n    int b_recon_finish      = 0;\r\n    int b_next_row_launched = 0;\r\n    davs2_row_rec_t row_rec;\r\n\r\n    while (i_lcu_y < height_in_lcu) {\r\n        /* loop over all LCUs in current LCU row ------------------------\r\n         */\r\n        int i_lcu_x;\r\n        for (i_lcu_x = 0; i_lcu_x < width_in_lcu; i_lcu_x++, lcu_xy++) {\r\n            /* wait until the parsing process of current LCU having finished */\r\n            wait_lcu_row_parsed(h, h->fdec, lcu_xy);\r\n\r\n            if (i_lcu_y > 0) {\r\n                wait_lcu_row_reconed(h, h->fdec, i_lcu_y - 1, DAVS2_MIN(i_lcu_x + 2, h->i_width_in_lcu));\r\n            }\r\n            row_rec.lcu_info = &h->lcu_infos[lcu_xy];\r\n#if CTRL_AEC_THREAD\r\n            row_rec.p_rec_info = &row_rec.lcu_info->rec_info;\r\n#endif\r\n            rowrec_lcu_init(h, &row_rec, i_lcu_x, i_lcu_y);\r\n\r\n            /* cache CTU top border for intra prediction */\r\n            if (i_lcu_x == 0) {\r\n                memcpy(row_rec.ctu_border[0].rec_top + 1, h->intra_border[0], row_rec.ctu.i_ctu_w * 2 * sizeof(pel_t));\r\n                memcpy(row_rec.ctu_border[1].rec_top + 1, h->intra_border[1], row_rec.ctu.i_ctu_w * sizeof(pel_t));\r\n                memcpy(row_rec.ctu_border[2].rec_top + 1, h->intra_border[2], row_rec.ctu.i_ctu_w * sizeof(pel_t));\r\n            }\r\n\r\n            decode_lcu_recon(h, &row_rec, i_lcu_level, i_lcu_x << i_lcu_level, i_lcu_y << i_lcu_level);\r\n\r\n            rowrec_store_lcu_recon_samples(&row_rec);\r\n            /* cache top and left samples for intra prediction of next CTU */\r\n            davs2_cache_lcu_border(row_rec.ctu_border[0].rec_top, h->intra_border[0] + row_rec.ctu.i_pix_x + row_rec.ctu.i_ctu_w - 1,\r\n                                   row_rec.ctu.p_frec[0] + row_rec.ctu.i_ctu_w - 1,\r\n                                   row_rec.ctu.i_frec[0], row_rec.ctu.i_ctu_w, row_rec.ctu.i_ctu_h);\r\n            davs2_cache_lcu_border_uv(row_rec.ctu_border[1].rec_top, h->intra_border[1] + row_rec.ctu.i_pix_x_c + row_rec.ctu.i_ctu_w_c - 1, row_rec.ctu.p_frec[1] + row_rec.ctu.i_ctu_w_c - 1,\r\n                                      row_rec.ctu_border[2].rec_top, h->intra_border[2] + row_rec.ctu.i_pix_x_c + row_rec.ctu.i_ctu_w_c - 1, row_rec.ctu.p_frec[2] + row_rec.ctu.i_ctu_w_c - 1,\r\n                                      row_rec.ctu.i_frec[1], row_rec.ctu.i_ctu_w_c, row_rec.ctu.i_ctu_h_c);\r\n\r\n            /* backup bottom row pixels */\r\n            if (i_lcu_y < h->i_height_in_lcu - 1) {\r\n                memcpy(h->intra_border[0] + row_rec.ctu.i_pix_x, row_rec.ctu.p_frec[0] + (row_rec.ctu.i_ctu_h - 1) * h->fdec->i_stride[0], row_rec.ctu.i_ctu_w   * sizeof(pel_t));\r\n                memcpy(h->intra_border[1] + row_rec.ctu.i_pix_x_c, row_rec.ctu.p_frec[1] + (row_rec.ctu.i_ctu_h_c - 1) * h->fdec->i_stride[1], row_rec.ctu.i_ctu_w_c * sizeof(pel_t));\r\n                memcpy(h->intra_border[2] + row_rec.ctu.i_pix_x_c, row_rec.ctu.p_frec[2] + (row_rec.ctu.i_ctu_h_c - 1) * h->fdec->i_stride[1], row_rec.ctu.i_ctu_w_c * sizeof(pel_t));\r\n            }\r\n\r\n            /* deblock one lcu */\r\n            if (h->b_loop_filter) {\r\n                davs2_lcu_deblock(h, h->fdec, i_lcu_x, i_lcu_y);\r\n            }\r\n\r\n            h->fdec->num_decoded_lcu_in_row[i_lcu_y]++;\r\n        }\r\n\r\n\r\n        /* SAO above lcu-row */\r\n        if (h->b_sao && i_lcu_y) {\r\n            sao_lcurow(h, h->p_frame_sao, h->fdec, i_lcu_y - 1);  // above row\r\n\r\n            if (i_lcu_y == height_in_lcu - 1) {\r\n                sao_lcurow(h, h->p_frame_sao, h->fdec, i_lcu_y);  // last row\r\n            }\r\n        }\r\n\r\n        /* ALF above lcu-row */\r\n        if (alf_enable && i_lcu_y) {\r\n            alf_lcurow(h, h->p_alf->img_param, h->p_frame_alf, h->fdec, i_lcu_y - 1);  // above row\r\n            if (i_lcu_y == height_in_lcu - 1) {\r\n                alf_lcurow(h, h->p_alf->img_param, h->p_frame_alf, h->fdec, i_lcu_y);  // last row\r\n            }\r\n        }\r\n\r\n        if (i_lcu_y > 0) {\r\n            /* frame padding : line by line */\r\n            if (h->rps.refered_by_others) {\r\n                pad_line_lcu(h, i_lcu_y - 1);\r\n            }\r\n            /* wake up all waiting threads */\r\n            decoder_signal(h, h->fdec, i_lcu_y - 1);\r\n        }\r\n\r\n        /* The last row in one frame */\r\n        if (i_lcu_y == height_in_lcu - 1) {\r\n            b_recon_finish = 1;\r\n        }\r\n\r\n        /* TODO: loop to next LCU row */\r\n        if (b_next_row_launched) {\r\n            break;\r\n        }\r\n        i_lcu_y++;\r\n    }\r\n\r\n    /* the bottom LCU row in a frame */\r\n    if (b_recon_finish) {\r\n        if (h->rps.refered_by_others) {\r\n            pad_line_lcu(h, h->i_height_in_lcu - 1);\r\n        }\r\n\r\n        decoder_signal(h, h->fdec, h->i_height_in_lcu - 1);\r\n        /* init for AVS-S */\r\n        if ((h->i_frame_type == AVS2_P_SLICE || h->i_frame_type == AVS2_F_SLICE) && h->b_bkgnd_picture && h->b_bkgnd_reference) {\r\n            const int w_in_spu = h->i_width_in_spu;\r\n            const int h_in_spu = h->i_height_in_spu;\r\n            int x, y;\r\n\r\n            for (y = 0; y < h_in_spu; y++) {\r\n                for (x = 0; x < w_in_spu; x++) {\r\n                    int refframe = h->p_ref_idx[y * w_in_spu + x].r[0];\r\n                    if (refframe == h->num_of_references - 1) {\r\n                        h->p_ref_idx[y * w_in_spu + x].r[0] = INVALID_REF;\r\n                    }\r\n                }\r\n            }\r\n        }\r\n\r\n        // davs2_log(h, DAVS2_LOG_INFO, \"POC %3d reconstruction finished.\", h->i_poc);\r\n        if (h->i_frame_type == AVS2_G_SLICE) {\r\n            davs2_frame_copy_planes(h->f_background_ref, h->fdec);\r\n        }\r\n\r\n        task_output_decoding_frame(h);\r\n        task_release_frames(h);\r\n        /* task is free */\r\n        task_unload_packet(h, h->task_info.curr_es_unit);\r\n    }\r\n\r\n    return 0;\r\n}\r\n// #endif  // #if CTRL_AEC_THREAD\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void decode_user_data(davs2_t *h, davs2_bs_t *bs)\r\n{\r\n    int      bytes = bs->i_bit_pos >> 3;\r\n    int      left  = bs->i_stream - bytes;\r\n    uint8_t *data  = bs->p_stream + bytes;\r\n\r\n    while (left >= 4) {\r\n        if (data[0] == 0 && data[1] == 0 && data[2] == 1) {\r\n            if (data[3] == SC_USER_DATA) {\r\n                /* user data */\r\n            } else if (data[3] <= SC_SLICE_CODE_MAX) {\r\n                /* slice */\r\n                h->b_slice_checked = 1;\r\n                break;\r\n            }\r\n\r\n            data += 4;\r\n            left -= 4;\r\n        } else {\r\n            ++data;\r\n            --left;\r\n        }\r\n    }\r\n\r\n    if (left >= 4) {\r\n        bs->i_bit_pos = (int)((data - bs->p_stream) << 3);\r\n    }\r\n}\r\n\r\n/**\r\n * ===========================================================================\r\n * interface function defines\r\n * ===========================================================================\r\n */\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid decoder_free_extra_buffer(davs2_t *h)\r\n{\r\n    if (h->f_background_ref) {\r\n        davs2_frame_destroy(h->f_background_ref);\r\n        h->f_background_ref = NULL;\r\n    }\r\n\r\n    if (h->f_background_cur) {\r\n        davs2_frame_destroy(h->f_background_cur);\r\n        h->f_background_cur = NULL;\r\n    }\r\n\r\n    if (h->p_frame_alf) {\r\n        davs2_frame_destroy(h->p_frame_alf);\r\n        h->p_frame_alf = NULL;\r\n    }\r\n\r\n    if (h->p_frame_sao) {\r\n        davs2_frame_destroy(h->p_frame_sao);\r\n        h->p_frame_sao = NULL;\r\n    }\r\n\r\n    if (h->p_integral) {\r\n        davs2_free(h->p_integral);\r\n        h->p_integral = NULL;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * alloc extra buffers for the decoder according to the image width & height\r\n */\r\nint decoder_alloc_extra_buffer(davs2_t *h)\r\n{\r\n    size_t w_in_spu = h->i_width_in_spu;\r\n    size_t h_in_spu = h->i_height_in_spu;\r\n    size_t w_in_scu = h->i_width_in_scu;\r\n    size_t h_in_scu = h->i_height_in_scu;\r\n    size_t size_in_spu = w_in_spu * h_in_spu;\r\n    size_t size_in_lcu = ((h->i_width + h->i_lcu_size_sub1) >> h->i_lcu_level) * ((h->i_height + h->i_lcu_size_sub1) >> h->i_lcu_level);\r\n    size_t size_alf = alf_get_buffer_size(h);\r\n    size_t size_extra_frame = 0;\r\n    size_t mem_size;\r\n\r\n    uint8_t *mem_base;\r\n\r\n    assert((h->i_width  & 7) == 0);\r\n    assert((h->i_height & 7) == 0);\r\n    size_extra_frame = 2 * davs2_frame_get_size(h->i_width, h->i_height, h->i_chroma_format, 1);\r\n    size_extra_frame += (h->b_alf + h->b_sao) * davs2_frame_get_size(h->i_width, h->i_height, h->i_chroma_format, 0);\r\n\r\n    mem_size = sizeof(int8_t)     * (w_in_spu + 16) * (h_in_spu + 1) + /* M1, size of intra prediction mode buffer */\r\n               sizeof(int8_t)     * size_in_spu                      + /* M3, size of prediction direction buffer */\r\n               sizeof(ref_idx_t)  * size_in_spu                      + /* M3, size of reference index (1st+2nd) buffer */\r\n               sizeof(mv_t)       * size_in_spu                      + /* M5, size of motion vector of 4x4 block (1st reference) buffer */\r\n               sizeof(mv_t)       * size_in_spu                      + /* M6, size of motion vector of 4x4 block (2nd reference) buffer */\r\n               sizeof(uint8_t)    * w_in_scu * h_in_scu * 2          + /* M7, size of loop filter flag buffer */\r\n               sizeof(lcu_info_t) * size_in_lcu                      + /* M8, size of SAO block parameter buffer */\r\n               sizeof(cu_t)       * h->i_size_in_scu                 + /* M10, size of cu_t */\r\n               sizeof(pel_t)      * h->i_width * 3                   + /* M13, size of last LCU row bottom border */\r\n               size_alf                                              + /* M11, size of ALF */\r\n               size_extra_frame                                      + /* M12, size of extra frame */\r\n               CACHE_LINE_SIZE * 20;\r\n\r\n    /* allocate memory for a decoder */\r\n    CHECKED_MALLOC(mem_base, uint8_t *, mem_size);\r\n    h->p_integral = mem_base;   /* pointer which holds the extra buffer */\r\n\r\n    /* M1, intra prediction mode buffer */\r\n    h->p_ipredmode  = (int8_t *)mem_base;\r\n    mem_base       += sizeof(int8_t) * (w_in_spu + 16) * (h_in_spu + 1);\r\n    h->p_ipredmode += (w_in_spu + 16) + 16;\r\n    h->i_ipredmode  = ((int)w_in_spu + 16);\r\n    ALIGN_POINTER(mem_base);\r\n\r\n    /* M3, prediction direction buffer */\r\n    h->p_dirpred = (int8_t *)mem_base;\r\n    mem_base += sizeof(int8_t) * size_in_spu;\r\n    ALIGN_POINTER(mem_base);\r\n\r\n    /* M3, reference index (1st) buffer */\r\n    h->p_ref_idx = (ref_idx_t *)mem_base;\r\n    mem_base    += sizeof(ref_idx_t) * size_in_spu;\r\n    ALIGN_POINTER(mem_base);\r\n\r\n    /* M5, motion vector of 4x4 block (1st reference) buffer */\r\n    h->p_tmv_1st = (mv_t *)mem_base;\r\n    mem_base    += sizeof(mv_t) * size_in_spu;\r\n    ALIGN_POINTER(mem_base);\r\n\r\n    /* M6, motion vector of 4x4 block (1st reference) buffer */\r\n    h->p_tmv_2nd = (mv_t *)mem_base;\r\n    mem_base    += sizeof(mv_t) * size_in_spu;\r\n    ALIGN_POINTER(mem_base);\r\n\r\n    /* M7, loop filter flag buffer */\r\n    h->p_deblock_flag[0] = (uint8_t *)mem_base;\r\n    mem_base            += sizeof(uint8_t) * w_in_scu * h_in_scu;\r\n    h->p_deblock_flag[1] = (uint8_t *)mem_base;\r\n    mem_base            += sizeof(uint8_t) * w_in_scu * h_in_scu;\r\n    ALIGN_POINTER(mem_base);\r\n\r\n    /* M8, LCU level parameter buffer */\r\n    h->lcu_infos    = (lcu_info_t *)mem_base;\r\n    mem_base       += sizeof(lcu_info_t) * size_in_lcu;\r\n    ALIGN_POINTER(mem_base);\r\n\r\n    /* allocate memory for scu_data */\r\n    h->scu_data     = (cu_t *)mem_base;\r\n    mem_base       += h->i_size_in_scu * sizeof(cu_t);\r\n    ALIGN_POINTER(mem_base);\r\n\r\n    /* LCU bottom border */\r\n    h->intra_border[0] = (pel_t *)mem_base;\r\n    mem_base += h->i_width * sizeof(pel_t);\r\n    ALIGN_POINTER(mem_base);\r\n    h->intra_border[1] = (pel_t *)mem_base;\r\n    mem_base += h->i_width * sizeof(pel_t);\r\n    ALIGN_POINTER(mem_base);\r\n    h->intra_border[2] = (pel_t *)mem_base;\r\n    mem_base += h->i_width * sizeof(pel_t);\r\n    ALIGN_POINTER(mem_base);\r\n\r\n    /* ALF */\r\n    h->p_alf        = (alf_var_t *)mem_base;\r\n    mem_base       += size_alf;\r\n    ALIGN_POINTER(mem_base);\r\n    alf_init_buffer(h);\r\n\r\n    /* -------------------------------------------------------------\r\n     * allocate frame buffers */\r\n\r\n    // AVS-S\r\n    h->f_background_ref = davs2_frame_new(h->i_width, h->i_height, h->i_chroma_format, &mem_base, 1);\r\n    ALIGN_POINTER(mem_base);\r\n    h->f_background_cur = davs2_frame_new(h->i_width, h->i_height, h->i_chroma_format, &mem_base, 1);\r\n    ALIGN_POINTER(mem_base);\r\n\r\n    // ALF\r\n    if (h->b_alf) {\r\n        h->p_frame_alf = davs2_frame_new(h->i_width, h->i_height, h->i_chroma_format, &mem_base, 0);\r\n        ALIGN_POINTER(mem_base);\r\n    }\r\n\r\n    // SAO\r\n    if (h->b_sao) {\r\n        h->p_frame_sao = davs2_frame_new(h->i_width, h->i_height, h->i_chroma_format, &mem_base, 0);\r\n        ALIGN_POINTER(mem_base);\r\n    }\r\n\r\n    if ((int)mem_size < (mem_base - h->p_integral)) {\r\n        davs2_log(h, DAVS2_LOG_ERROR, \"No enough memory allocated. mem_size %llu <= %llu\\n\",\r\n                   mem_size, mem_base - h->p_integral);\r\n        goto fail;\r\n    }\r\n    return 0;\r\n\r\nfail:\r\n\r\n    decoder_free_extra_buffer(h);\r\n\r\n    return -1;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * write a frame to output picture\r\n */\r\nvoid davs2_write_a_frame(davs2_picture_t *pic, davs2_frame_t *frame)\r\n{\r\n    int img_width    = pic->widths[0];\r\n    int img_height   = pic->lines[0];\r\n    int img_width_c  = (img_width / 2);\r\n    int img_height_c = (img_height / (frame->i_chroma_format == CHROMA_420 ? 2 : 1));\r\n    int num_bytes_per_sample = (frame->i_output_bit_depth == 8 ? 1 : 2);\r\n    int shift1       = frame->i_sample_bit_depth - frame->i_output_bit_depth; // assuming \"sample_bit_depth\" is greater or equal to \"output_bit_depth\"\r\n    pel_t *p_src;\r\n    uint8_t *p_dst;\r\n    int k, j, i_src, i_dst;\r\n\r\n    pic->num_planes       = (frame->i_chroma_format != CHROMA_400) ? 3 : 1;\r\n    pic->bytes_per_sample = num_bytes_per_sample;\r\n    pic->bit_depth        = frame->i_output_bit_depth;\r\n    pic->b_decode_error   = frame->frm_decode_error;\r\n    pic->dec_frame        = NULL;\r\n    pic->strides[0] = pic->widths[0] * num_bytes_per_sample;\r\n    pic->strides[1] = pic->widths[1] * num_bytes_per_sample;\r\n    pic->strides[2] = pic->widths[2] * num_bytes_per_sample;\r\n\r\n    if (!shift1 && sizeof(pel_t) == num_bytes_per_sample) {\r\n        pic->dec_frame = frame;\r\n        // TODO: ¸ֵǰָҪʵʱ򣨽֧ʱָ\r\n        pic->planes[0]  = frame->planes[0];\r\n        pic->planes[1]  = frame->planes[1];\r\n        pic->planes[2]  = frame->planes[2];\r\n        pic->strides[0] = frame->i_stride[0] * num_bytes_per_sample;\r\n        pic->strides[1] = frame->i_stride[1] * num_bytes_per_sample;\r\n        pic->strides[2] = frame->i_stride[2] * num_bytes_per_sample;\r\n    } else if (!shift1 && frame->i_output_bit_depth == 8) { // 8bit encode -> 8bit output\r\n        p_dst = pic->planes[0];\r\n        i_dst = pic->strides[0];\r\n        p_src = frame->planes[0];\r\n        i_src = frame->i_stride[0];\r\n\r\n        for (j = 0; j < img_height; j++) {\r\n            for (k = 0; k < img_width; k++) {\r\n                p_dst[k] = (uint8_t)p_src[k];\r\n            }\r\n\r\n            p_src += i_src;\r\n            p_dst += i_dst;\r\n        }\r\n\r\n        if (pic->num_planes == 3) {\r\n            p_dst = pic->planes[1];\r\n            i_dst = pic->strides[1];\r\n            p_src = frame->planes[1];\r\n            i_src = frame->i_stride[1];\r\n\r\n            for (j = 0; j < img_height_c; j++) {\r\n                for (k = 0; k < img_width_c; k++) {\r\n                    p_dst[k] = (uint8_t)p_src[k];\r\n                }\r\n\r\n                p_src += i_src;\r\n                p_dst += i_dst;\r\n            }\r\n\r\n            p_dst = pic->planes[2];\r\n            i_dst = pic->strides[2];\r\n            p_src = frame->planes[2];\r\n            i_src = frame->i_stride[2];\r\n\r\n            for (j = 0; j < img_height_c; j++) {\r\n                for (k = 0; k < img_width_c; k++) {\r\n                    p_dst[k] = (uint8_t)p_src[k];\r\n                }\r\n\r\n                p_src += i_src;\r\n                p_dst += i_dst;\r\n            }\r\n        }\r\n    } else if (shift1 && frame->i_output_bit_depth == 8) { // 10bit encode -> 8bit output\r\n        p_dst = pic->planes[0];\r\n        i_dst = pic->strides[0];\r\n        p_src = frame->planes[0];\r\n        i_src = frame->i_stride[0];\r\n\r\n        for (j = 0; j < img_height; j++) {\r\n            for (k = 0; k < img_width; k++) {\r\n                p_dst[k] = (uint8_t)DAVS2_CLIP1((p_src[k] + (1 << (shift1 - 1))) >> shift1);\r\n            }\r\n\r\n            p_src += i_src;\r\n            p_dst += i_dst;\r\n        }\r\n\r\n        if (pic->num_planes == 3) {\r\n            p_dst = pic->planes[1];\r\n            i_dst = pic->strides[1];\r\n            p_src = frame->planes[1];\r\n            i_src = frame->i_stride[1];\r\n\r\n            for (j = 0; j < img_height_c; j++) {\r\n                for (k = 0; k < img_width_c; k++) {\r\n                    p_dst[k] = (uint8_t)DAVS2_CLIP1((p_src[k] + (1 << (shift1 - 1))) >> shift1);\r\n                }\r\n\r\n                p_src += i_src;\r\n                p_dst += i_dst;\r\n            }\r\n\r\n            p_dst = pic->planes[2];\r\n            i_dst = pic->strides[2];\r\n            p_src = frame->planes[2];\r\n            i_src = frame->i_stride[2];\r\n\r\n            for (j = 0; j < img_height_c; j++) {\r\n                for (k = 0; k < img_width_c; k++) {\r\n                    p_dst[k] = (uint8_t)DAVS2_CLIP1((p_src[k] + (1 << (shift1 - 1))) >> shift1);\r\n                }\r\n\r\n                p_src += i_src;\r\n                p_dst += i_dst;\r\n            }\r\n        }\r\n    }\r\n\r\n    pic->type            = frame->i_type;\r\n    pic->qp              = frame->i_qp;\r\n    pic->pts             = frame->i_pts;\r\n    pic->dts             = frame->i_dts;\r\n    pic->pic_order_count = frame->i_poc;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\ndavs2_t *decoder_open(davs2_mgr_t *mgr, davs2_t *h, int idx_decoder)\r\n{\r\n    /* allocate memory for a decoder */\r\n    memset(h, 0, sizeof(davs2_t));\r\n\r\n    /* init log module */\r\n    h->module_log.i_log_level = mgr->param.info_level;\r\n    sprintf(h->module_log.module_name, \"Dec[%2d] %06llx\", idx_decoder,(long long unsigned int) h);\r\n\r\n    /* only initialize some variables, not ready to work */\r\n    h->task_info.taskmgr = mgr;\r\n    h->i_width           = -1;\r\n    h->i_height          = -1;\r\n    h->i_frame_type      = AVS2_I_SLICE;\r\n    h->num_of_references = 0;\r\n    h->b_video_edit_code = 0;\r\n\r\n#if AVS2_TRACE\r\n    if (avs2_trace_init(h, TRACEFILE) == -1) {  // append new statistic at the end\r\n        davs2_log(h, DAVS2_LOG_ERROR, \"Error open trace file!\");\r\n    }\r\n#endif\r\n\r\n    return h;\r\n}\r\n\r\n/**\r\n * ---------------------------------------------------------------------------\r\n * Function   : decode one frame\r\n * Parameters :\r\n *       [in] : h       - pointer to struct davs2_t (decoder handler)\r\n *            : es_unit - pointer to bit-stream buffer (including the following parameters)\r\n *            :    data - pointer to bitstream buffer\r\n *            :    len  - data length in bitstream buffer\r\n *            :    pts  - user pts\r\n *            :    dts  - user dts\r\n * Return     : none\r\n * ---------------------------------------------------------------------------\r\n */\r\nvoid *decoder_decode_picture_data(void *arg1, int arg2)\r\n{\r\n    davs2_t *h     = (davs2_t *)arg1;\r\n    davs2_bs_t *bs = h->p_bs;\r\n\r\n    UNUSED_PARAMETER(arg2);\r\n    /* decode one frame */\r\n    init_frame(h);\r\n    /* user data and slice header */\r\n    decode_user_data(h, bs);\r\n\r\n    /* decode picture data */\r\n    if (h->b_slice_checked != 0) {\r\n        davs2_frame_t *frame = h->fref[0];\r\n        davs2_mgr_t *mgr = h->task_info.taskmgr;\r\n        const int height_in_lcu = h->i_height_in_lcu;\r\n        int lcu_y;\r\n\r\n        /* reset LCU decoding status */\r\n        memset(h->fdec->num_decoded_lcu_in_row, 0, sizeof(int) * h->i_height_in_lcu);\r\n\r\n        // davs2_thread_mutex_lock(&mgr->mutex_aec);\r\n        // mgr->num_active_decoders++;\r\n        // davs2_thread_mutex_unlock(&mgr->mutex_aec);\r\n\r\n        if (mgr->num_rec_thread && davs2_threadpool_is_free((davs2_threadpool_t *)mgr->thread_pool)) {\r\n            /* make sure all its dependency frames have started reconstruction */\r\n            int i;\r\n            for (i = 0; i < h->num_of_references; i++) {\r\n                davs2_frame_t *frm = h->fref[i];\r\n                decoder_wait_lcu_row(h, frm, 0);\r\n            }\r\n\r\n\r\n            /* run reconstruction thread */\r\n            davs2_threadpool_run((davs2_threadpool_t *)mgr->thread_pool,\r\n                                 (davs2_threadpool_func_t)decode_lcu_row_recon, h, 0,\r\n                                 0);\r\n            /* -------------------------------------------------------------\r\n             * parse all LCU rows\r\n             */\r\n            for (lcu_y = 0; lcu_y < height_in_lcu; lcu_y++) {\r\n                /* TODO: remove the dependency in this thread */\r\n                if (frame != NULL) {\r\n                    decoder_wait_lcu_row(h, frame, lcu_y);\r\n                }\r\n\r\n                /* parsing the LCU data */\r\n                decode_one_lcu_row_parse(h, bs, lcu_y);\r\n            }\r\n        } else {\r\n            /* -------------------------------------------------------------\r\n             * decode all LCU rows\r\n             */\r\n            for (lcu_y = 0; lcu_y < height_in_lcu; lcu_y++) {\r\n                if (frame != NULL) {\r\n                    decoder_wait_lcu_row(h, frame, lcu_y);\r\n                }\r\n\r\n                /* decode one lcu row */\r\n                decode_one_lcu_row(h, bs, lcu_y);\r\n            }\r\n        }\r\n    } else {\r\n        ///!!! make sure that all row signals of frames with 'b_refered_by_others == 1' have been set before return.\r\n        /// use 'goto fail' instead of 'return' in the half way.\r\n        if (h->rps.refered_by_others) {\r\n            // set all row signals before returning.\r\n            int lcu_y;\r\n            for (lcu_y = 0; lcu_y < h->i_height_in_lcu; ++lcu_y) {\r\n                decoder_signal(h, h->fdec, lcu_y);\r\n            }\r\n        }\r\n\r\n        if (h->i_frame_type == AVS2_G_SLICE) {\r\n            davs2_frame_copy_planes(h->f_background_ref, h->fdec);\r\n        }\r\n\r\n        /* task is free */\r\n        task_unload_packet(h, h->task_info.curr_es_unit);\r\n    }\r\n\r\n    return NULL;\r\n}\r\n\r\n\r\n/**\r\n * ---------------------------------------------------------------------------\r\n * Function   : close the AVS2 decoder\r\n * Parameters :\r\n *       [in] : h - pointer to struct davs2_t, the decoder handle\r\n * Return     : none\r\n * ---------------------------------------------------------------------------\r\n */\r\nvoid decoder_close(davs2_t *h)\r\n{\r\n    /* free extra buffer */\r\n    decoder_free_extra_buffer(h);\r\n\r\n#if AVS2_TRACE\r\n    /* destroy the trace */\r\n    avs2_trace_destroy();\r\n#endif\r\n}\r\n"
  },
  {
    "path": "source/common/decoder.h",
    "content": "/*\r\n * decoder.h\r\n *\r\n * Description of this file:\r\n *    Decoder functions definition of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video encoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#ifndef DAVS2_DECODER_H\r\n#define DAVS2_DECODER_H\r\n#ifdef __cplusplus\r\nextern \"C\" {\r\n#endif\r\n\r\n#include \"common.h\"\r\n\r\n#define decoder_open FPFX(decoder_decoder_open)\r\ndavs2_t *decoder_open(davs2_mgr_t *mgr, davs2_t *h, int idx_decoder);\r\n#define decoder_decode_picture_data FPFX(decoder_decode_picture_data)\r\nvoid *decoder_decode_picture_data(void *arg1, int arg2);\r\n#define decoder_close FPFX(decoder_decoder_close)\r\nvoid decoder_close(davs2_t *h);\r\n#define create_freepictures FPFX(create_freepictures)\r\nint  create_freepictures(davs2_mgr_t *mgr, int w, int h, int size);\r\n#define destroy_freepictures FPFX(destroy_freepictures)\r\nvoid destroy_freepictures(davs2_mgr_t *mgr);\r\n#define decoder_alloc_extra_buffer FPFX(decoder_alloc_extra_buffer)\r\nint  decoder_alloc_extra_buffer(davs2_t *h);\r\n#define decoder_free_extra_buffer FPFX(decoder_free_extra_buffer)\r\nvoid decoder_free_extra_buffer(davs2_t *h);\r\n#define davs2_write_a_frame FPFX(write_a_frame)\r\nvoid davs2_write_a_frame(davs2_picture_t *pic, davs2_frame_t *frame);\r\n\r\n#define task_get_references FPFX(task_get_references)\r\nint  task_get_references(davs2_t *h, int64_t pts, int64_t dts);\r\n\r\n#define task_unload_packet FPFX(task_unload_packet)\r\nvoid task_unload_packet(davs2_t *h, es_unit_t *es_unit);\r\n#define decoder_get_output FPFX(decoder_get_output)\r\nint decoder_get_output(davs2_mgr_t *mgr, davs2_seq_info_t *headerset, davs2_picture_t *out_frame, int is_flush);\r\n\r\n#ifdef __cplusplus\r\n}\r\n#endif\r\n#endif  // DAVS2_DECODER_H\r\n"
  },
  {
    "path": "source/common/defines.h",
    "content": "/*\r\n * defines.h\r\n *\r\n * Description of this file:\r\n *    const variable definition of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video encoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#ifndef DAVS2_DEFINES_H\r\n#define DAVS2_DEFINES_H\r\n\r\n\r\n/**\r\n * ===========================================================================\r\n * build switch\r\n * ===========================================================================\r\n */\r\n\r\n/* ---------------------------------------------------------------------------\r\n * build */\r\n#define RELEASE_BUILD           1     /* 1: release build */\r\n\r\n#define CTRL_AEC_THREAD         0     /* AEC and reconstruct conducted in different threads */\r\n#define CTRL_AEC_CONVERSION     0     /* AEC result conversion */\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * debug */\r\n#if RELEASE_BUILD\r\n#define AVS2_TRACE              0     /* write trace file,    1: ON, 0: OFF */\r\n#else\r\n#define AVS2_TRACE              0     /* write trace file,    1: ON, 0: OFF */\r\n#endif\r\n\r\n#define DAVS2_TRACE_API        0     /* API calling trace */\r\n\r\n#define USE_NEW_INTPL           0     /* use new interpolation functions */\r\n\r\n#define BUGFIX_PREDICTION_INTRA 1     /* align to latest intra prediction */\r\n\r\n\r\n\r\n/**\r\n * ===========================================================================\r\n * define of const variables\r\n * ===========================================================================\r\n */\r\n\r\n/* ---------------------------------------------------------------------------\r\n * profile */\r\n#define MAIN_PICTURE_PROFILE    0x12\r\n#define MAIN_PROFILE            0x20\r\n#define MAIN10_PROFILE          0x22\r\n\r\nenum chroma_format_e {\r\n    CHROMA_400 = 0,\r\n    CHROMA_420 = 1,\r\n    CHROMA_422 = 2,\r\n    CHROMA_444 = 3\r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n * prediction techniques */\r\n#define DMH_MODE_NUM            5     /* number of DMH mode */\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * SAO */\r\n#define MAX_NUM_SAO_CLASSES         32\r\n#define NUM_SAO_BO_CLASSES_LOG2     5\r\n#define NUM_SAO_BO_CLASSES_IN_BIT   5\r\n#define NUM_SAO_EO_TYPES_LOG2       2\r\n#define SAO_SHIFT_PIX_NUM           4\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * ALF parameters */\r\n#define ALF_NUM_VARS            16\r\n#define ALF_MAX_NUM_COEF        9\r\n#define LOG2_VAR_SIZE_H         2\r\n#define LOG2_VAR_SIZE_W         2\r\n#define ALF_FOOTPRINT_SIZE      7\r\n#define DF_CHANGED_SIZE         3\r\n#define ALF_NUM_BIT_SHIFT       6\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * Quantization parameter range */\r\n#define MIN_QP                  0\r\n#if HIGH_BIT_DEPTH\r\n#define MAX_QP                  79    /* max QP */\r\n#else\r\n#define MAX_QP                  63    /* max QP */\r\n#endif\r\n#define SHIFT_QP                11\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * block sizes */\r\n#define MAX_CU_SIZE             64    /* 64x64 */\r\n#define MAX_CU_SIZE_IN_BIT      6\r\n#define MIN_CU_SIZE             8     /* 8x8 */\r\n#define MIN_CU_SIZE_IN_BIT      3\r\n#define MIN_PU_SIZE             4     /* 4x4 */\r\n#define MIN_PU_SIZE_IN_BIT      2\r\n#define BLOCK_MULTIPLE          (MIN_CU_SIZE/MIN_PU_SIZE)\r\n\r\n#define B4X4_IN_BIT             2\r\n#define B8X8_IN_BIT             3\r\n#define B16X16_IN_BIT           4\r\n#define B32X32_IN_BIT           5\r\n#define B64X64_IN_BIT           6\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * luma intra prediction modes\r\n */\r\nenum intra_pred_mode_e {\r\n    /* non-angular mode */\r\n    DC_PRED         = 0 ,                /* prediction mode: DC */\r\n    PLANE_PRED      = 1 ,                /* prediction mode: PLANE */\r\n    BI_PRED         = 2 ,                /* prediction mode: BI */\r\n\r\n    /* vertical angular mode */\r\n    INTRA_ANG_X_3   =  3, INTRA_ANG_X_4   =  4, INTRA_ANG_X_5   =  5,\r\n    INTRA_ANG_X_6   =  6, INTRA_ANG_X_7   =  7, INTRA_ANG_X_8   =  8,\r\n    INTRA_ANG_X_9   =  9, INTRA_ANG_X_10  = 10, INTRA_ANG_X_11  = 11,\r\n    INTRA_ANG_X_12  = 12,\r\n    VERT_PRED       = INTRA_ANG_X_12,    /* prediction mode: VERT */\r\n\r\n    /* vertical + horizontal angular mode */\r\n    INTRA_ANG_XY_13 = 13, INTRA_ANG_XY_14 = 14, INTRA_ANG_XY_15 = 15,\r\n    INTRA_ANG_XY_16 = 16, INTRA_ANG_XY_17 = 17, INTRA_ANG_XY_18 = 18,\r\n    INTRA_ANG_XY_19 = 19, INTRA_ANG_XY_20 = 20, INTRA_ANG_XY_21 = 21,\r\n    INTRA_ANG_XY_22 = 22, INTRA_ANG_XY_23 = 23,\r\n\r\n    /* horizontal angular mode */\r\n    INTRA_ANG_Y_24  = 24, INTRA_ANG_Y_25  = 25, INTRA_ANG_Y_26 = 26,\r\n    INTRA_ANG_Y_27  = 27, INTRA_ANG_Y_28  = 28, INTRA_ANG_Y_29 = 29,\r\n    INTRA_ANG_Y_30  = 30, INTRA_ANG_Y_31  = 31, INTRA_ANG_Y_32 = 32,\r\n    HOR_PRED        = INTRA_ANG_Y_24,    /* prediction mode: HOR */\r\n    NUM_INTRA_MODE  = 33,                /* number of luma intra prediction modes */\r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n * chroma intra prediction modes\r\n */\r\nenum intra_chroma_pred_mode_e {\r\n    /* chroma intra prediction modes */\r\n    DM_PRED_C             = 0,     /* prediction mode: DM */\r\n    DC_PRED_C             = 1,     /* prediction mode: DC */\r\n    HOR_PRED_C            = 2,     /* prediction mode: HOR */\r\n    VERT_PRED_C           = 3,     /* prediction mode: VERT */\r\n    BI_PRED_C             = 4,     /* prediction mode: BI */\r\n    NUM_INTRA_MODE_CHROMA = 5,     /* number of chroma intra prediction modes */\r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n * mv predicating */\r\n#define MVPRED_xy_MIN           0\r\n#define MVPRED_L                1\r\n#define MVPRED_U                2\r\n#define MVPRED_UR               3\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * mv predicating direction */\r\n#define PDIR_FWD                0\r\n#define PDIR_BWD                1\r\n#define PDIR_SYM                2\r\n#define PDIR_BID                3\r\n#define PDIR_DUAL               4\r\n#define PDIR_INVALID           -1     /* invalid predicating direction */\r\n\r\n/* ---------------------------------------------------------------------------\r\n * unification of MV scaling */\r\n#define MULTI                   16384\r\n#define HALF_MULTI              8192\r\n#define OFFSET                  14\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * motion information storage compression */\r\n#define MV_DECIMATION_FACTOR    4     /* store the middle pixel's mv in a motion information unit */\r\n#define MV_FACTOR_IN_BIT        2\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * for 16-BITS transform */\r\n#define LIMIT_BIT               16\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * max value */\r\n#define AVS2_THREAD_MAX        16     /* max number of threads */\r\n#define DAVS2_WORK_MAX        128     /* max number of works (thread queue) */\r\n#define AVS2_MAX_REFS           4     /* max reference frame number */\r\n#define AVS2_GOP_NUM           32     /* max GOP number */\r\n#define AVS2_COI_CYCLE        256     /* COI ranges from [0, 255] */\r\n\r\n#define MAX_POC_DISTANCE      128     /* max POC distance */\r\n#define INVALID_FRAME          -1     /* invalid value for COI & POC */\r\n\r\n#define CG_SIZE                16     /* size of an coefficient group, 4x4 */\r\n\r\n#define TEMPORAL_MAXLEVEL_BIT   3     /* bit number of temporal_id */\r\n#define THRESHOLD_PMVR          2     /* threshold for pmvr */\r\n\r\n#define MAX_ES_FRAME_SIZE 4000000     /* default max es frame size: 4MB */\r\n#define MAX_ES_FRAME_NUM       64     /* default number of es frames */\r\n\r\n#define AVS2_PAD        (64 + 16)     /* number of pixels padded around the reference frame */\r\n\r\n#define DAVS2_MAX_LCU_ROWS   256      /* maximum number of LCU rows of one frame */ \r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * aec\r\n */\r\n#define SE_CHROMA               1     /* context for read (run, level) */\r\n#define SE_LUMA_8x8             2     /* context for read (run, level) */\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * transform */\r\n#define SEC_TR_SIZE             4     /* block size of 2nd transform */\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * CPU flags\r\n */\r\n\r\n/* x86 */\r\n#define DAVS2_CPU_CMOV            0x0000001\r\n#define DAVS2_CPU_MMX             0x0000002\r\n#define DAVS2_CPU_MMX2            0x0000004   /* MMX2 aka MMXEXT aka ISSE */\r\n#define DAVS2_CPU_MMXEXT          DAVS2_CPU_MMX2\r\n#define DAVS2_CPU_SSE             0x0000008\r\n#define DAVS2_CPU_SSE2            0x0000010\r\n#define DAVS2_CPU_SSE3            0x0000020\r\n#define DAVS2_CPU_SSSE3           0x0000040\r\n#define DAVS2_CPU_SSE4            0x0000080   /* SSE4.1 */\r\n#define DAVS2_CPU_SSE42           0x0000100   /* SSE4.2 */\r\n#define DAVS2_CPU_LZCNT           0x0000200   /* Phenom support for \"leading zero count\" instruction. */\r\n#define DAVS2_CPU_AVX             0x0000400   /* AVX support: requires OS support even if YMM registers aren't used. */\r\n#define DAVS2_CPU_XOP             0x0000800   /* AMD XOP */\r\n#define DAVS2_CPU_FMA4            0x0001000   /* AMD FMA4 */\r\n#define DAVS2_CPU_AVX2            0x0002000   /* AVX2 */\r\n#define DAVS2_CPU_FMA3            0x0004000   /* Intel FMA3 */\r\n#define DAVS2_CPU_BMI1            0x0008000   /* BMI1 */\r\n#define DAVS2_CPU_BMI2            0x0010000   /* BMI2 */\r\n/* x86 modifiers */\r\n#define DAVS2_CPU_CACHELINE_32    0x0020000   /* avoid memory loads that span the border between two cachelines */\r\n#define DAVS2_CPU_CACHELINE_64    0x0040000   /* 32/64 is the size of a cacheline in bytes */\r\n#define DAVS2_CPU_SSE2_IS_SLOW    0x0080000   /* avoid most SSE2 functions on Athlon64 */\r\n#define DAVS2_CPU_SSE2_IS_FAST    0x0100000   /* a few functions are only faster on Core2 and Phenom */\r\n#define DAVS2_CPU_SLOW_SHUFFLE    0x0200000   /* The Conroe has a slow shuffle unit (relative to overall SSE performance) */\r\n#define DAVS2_CPU_STACK_MOD4      0x0400000   /* if stack is only mod4 and not mod16 */\r\n#define DAVS2_CPU_SLOW_CTZ        0x0800000   /* BSR/BSF x86 instructions are really slow on some CPUs */\r\n#define DAVS2_CPU_SLOW_ATOM       0x1000000   /* The Atom is terrible: slow SSE unaligned loads, slow\r\n                                                 * SIMD multiplies, slow SIMD variable shifts, slow pshufb,\r\n                                                 * cacheline split penalties -- gather everything here that\r\n                                                 * isn't shared by other CPUs to avoid making half a dozen\r\n                                                 * new SLOW flags. */\r\n#define DAVS2_CPU_SLOW_PSHUFB     0x2000000   /* such as on the Intel Atom */\r\n#define DAVS2_CPU_SLOW_PALIGNR    0x4000000   /* such as on the AMD Bobcat */\r\n\r\n/* ARM */\r\n#define DAVS2_CPU_ARMV6           0x0000001\r\n#define DAVS2_CPU_NEON            0x0000002   /* ARM NEON */\r\n#define DAVS2_CPU_FAST_NEON_MRC   0x0000004   /* Transfer from NEON to ARM register is fast (Cortex-A9) */\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * others */\r\n#ifndef FALSE\r\n#define FALSE                   0\r\n#endif\r\n#ifndef TRUE\r\n#define TRUE                    1\r\n#endif\r\n\r\n#define FAST_GET_SPS            1     /* get SPS as soon as possible */\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * all assembly and related C functions are prefixed with 'staravs_' default\r\n */\r\n#define PFXB(prefix, name)  prefix ## _ ## name\r\n#define PFXA(prefix, name)  PFXB(prefix,   name)\r\n#define FPFX(name)          PFXA(davs2,  name)\r\n\r\n/* ---------------------------------------------------------------------------\r\n * flag\r\n */\r\n#define AVS2_EXIT_THREAD     (-1)  /* flag to terminate thread */\r\n\r\n/* ---------------------------------------------------------------------------\r\n* if hdr chroma qp open\r\n*/\r\n#define HDR_CHROMA_DELTA_QP     0\r\n\r\n#endif  // DAVS2_DEFINES_H\r\n"
  },
  {
    "path": "source/common/frame.cc",
    "content": "/*\r\n * frame.cc\r\n *\r\n * Description of this file:\r\n *    Frame handling functions definition of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video encoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#include \"common.h\"\r\n#include \"frame.h\"\r\n#include \"header.h\"\r\n\r\n\r\n/**\r\n * ===========================================================================\r\n * border expanding\r\n * ===========================================================================\r\n */\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic ALWAYS_INLINE\r\nvoid pad_line_pixel(pel_t *pix, int width, int num_pad)\r\n{\r\n    pel4_t *p_l4 = (pel4_t *)(pix - num_pad);\r\n    pel4_t *p_r4 = (pel4_t *)(pix + width);\r\n    pel4_t l4 = pix[0];\r\n    pel4_t r4 = pix[width - 1];\r\n#if ARCH_X86_64 && !HIGH_BIT_DEPTH\r\n    uint64_t *p_l64 = (uint64_t *)p_l4;\r\n    uint64_t *p_r64 = (uint64_t *)p_r4;\r\n    uint64_t l64;\r\n    uint64_t r64;\r\n#endif\r\n\r\n#if HIGH_BIT_DEPTH\r\n    l4 = (l4 << 48) | (l4 << 32) | (l4 << 16) | l4;\r\n    r4 = (r4 << 48) | (r4 << 32) | (r4 << 16) | r4;\r\n#else\r\n    l4 = (l4 << 24) | (l4 << 16) | (l4 << 8) | l4;\r\n    r4 = (r4 << 24) | (r4 << 16) | (r4 << 8) | r4;\r\n#if ARCH_X86_64\r\n    l64 = ((uint64_t)(l4) << 32) | l4;\r\n    r64 = ((uint64_t)(r4) << 32) | r4;\r\n#endif\r\n#endif\r\n\r\n#if ARCH_X86_64 && !HIGH_BIT_DEPTH\r\n    assert((num_pad & 7) == 0);\r\n    num_pad >>= 3;\r\n\r\n    for (; num_pad != 0; num_pad--) {\r\n        *p_l64++ = l64;              /* pad left */\r\n        *p_r64++ = r64;              /* pad right */\r\n    }\r\n#else\r\n    assert((num_pad & 3) == 0);\r\n    num_pad >>= 2;\r\n\r\n    for (; num_pad != 0; num_pad--) {\r\n        *p_l4++ = l4;              /* pad left */\r\n        *p_r4++ = r4;              /* pad right */\r\n    }\r\n#endif\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid pad_line_lcu(davs2_t *h, int lcu_y)\r\n{\r\n    davs2_frame_t *frame = h->fdec;\r\n    int i, j;\r\n\r\n    for (i = 0; i < 3; i++) {\r\n        int chroma_shift = !!i;\r\n        int start = ((lcu_y + 0) << h->i_lcu_level) >> chroma_shift; ///< -4 for ALF\r\n        int end   = ((lcu_y + 1) << h->i_lcu_level) >> chroma_shift;\r\n        int i_stride = frame->i_stride[i];\r\n        int i_width  = frame->i_width[i];\r\n        const int num_pad = AVS2_PAD >> chroma_shift;\r\n        pel_t *pix;\r\n\r\n        if (lcu_y > 0) {\r\n            start -= 4;\r\n        }\r\n        if (lcu_y < h->i_height_in_lcu - 1) {\r\n            end -= 4;\r\n        }\r\n\r\n        /* padding these rows */\r\n        for (j = start; j < end; j++) {\r\n            pix = frame->planes[i] + j * i_stride;\r\n            pad_line_pixel(pix, i_width, num_pad);\r\n        }\r\n\r\n        /* for the first row, padding the rows above the picture edges */\r\n        if (lcu_y == 0) {\r\n            pix = frame->planes[i] - (num_pad);\r\n\r\n            for (j = 0; j < (num_pad); j++) {\r\n                gf_davs2.memcpy_aligned(pix - i_stride, pix, i_stride * sizeof(pel_t));\r\n                pix -= i_stride;\r\n            }\r\n        }\r\n\r\n        /* for the last row, padding the rows under of the picture edges */\r\n        if (lcu_y == h->i_height_in_lcu - 1) {\r\n            pix = frame->planes[i] + (frame->i_lines[i] - 1) * i_stride - (num_pad);\r\n\r\n            for (j = 0; j < (num_pad); j++) {\r\n                gf_davs2.memcpy_aligned(pix + i_stride, pix, i_stride * sizeof(pel_t));\r\n                pix += i_stride;\r\n            }\r\n        }\r\n    }\r\n}\r\n\r\n/**\r\n * ===========================================================================\r\n * memory handling\r\n * ===========================================================================\r\n */\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic ALWAYS_INLINE int\r\nalign_stride(int x, int align, int disalign)\r\n{\r\n    x = DAVS2_ALIGN(x, align);\r\n    if (!(x & (disalign - 1))) {\r\n        x += align;\r\n    }\r\n\r\n    return x;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic ALWAYS_INLINE int\r\nalign_plane_size(int x, int disalign)\r\n{\r\n    if (!(x & (disalign - 1))) {\r\n        x += 128;\r\n    }\r\n\r\n    return x;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nsize_t davs2_frame_get_size(int width, int height, int chroma_format, int b_extra)\r\n{\r\n    const int width_c        = width >> 1;\r\n    const int height_c       = height >> (chroma_format == CHROMA_420 ? 1 : 0);\r\n    const int width_in_spu   = width  >> MIN_PU_SIZE_IN_BIT;\r\n    const int height_in_spu  = height >> MIN_PU_SIZE_IN_BIT;\r\n    const int max_lcu_height = (height + (1 << 4) - 1) >> 4; /* frame height in 16x16 LCU */\r\n    const int align    = 32;\r\n    const int disalign = 1 << 16;\r\n    int extra_buf_size = 0;     /* extra buffer size */\r\n    int stride_l, stride_c;\r\n    int size_l, size_c;         /* size of luma and chroma plane */\r\n    size_t mem_size;            /* total memory size */\r\n\r\n    /* need extra buffer? */\r\n    if (b_extra) {\r\n        /* reference information buffer size (in SPU) */\r\n        extra_buf_size = width_in_spu * height_in_spu;\r\n    }\r\n\r\n    /* compute stride and the plane size\r\n     * +PAD for extra data for MC */\r\n    stride_l = align_stride(width + AVS2_PAD * 2, align, disalign);\r\n    stride_c = align_stride(width_c + AVS2_PAD, align, disalign);\r\n    size_l   = align_plane_size(stride_l * (height + AVS2_PAD * 2) + CACHE_LINE_SIZE, disalign);\r\n    size_c   = align_plane_size(stride_c * (height_c + AVS2_PAD) + CACHE_LINE_SIZE,   disalign);\r\n\r\n    /* compute space size and alloc memory */\r\n    mem_size = sizeof(davs2_frame_t)                      + /* M0, size of frame handle */\r\n               sizeof(pel_t)  * (size_l + size_c * 2)       + /* M1, size of planes buffer: Y+U+V */\r\n               sizeof(int8_t) * extra_buf_size              + /* M2, size of SPU reference index buffer */\r\n               sizeof(mv_t)   * extra_buf_size              + /* M3, size of SPU motion vector buffer */\r\n               sizeof(davs2_thread_cond_t) * max_lcu_height + /* M4, condition variables for each LCU line */\r\n               sizeof(int) * max_lcu_height                 + /* M5, LCU decoding status */\r\n               CACHE_LINE_SIZE * 6;\r\n\r\n    return mem_size;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\ndavs2_frame_t *davs2_frame_new(int width, int height, int chroma_format, uint8_t **mem_base, int b_extra)\r\n{\r\n    const int width_c        = width >> 1;\r\n    const int height_c       = height >> (chroma_format == CHROMA_420 ? 1 : 0);\r\n    const int width_in_spu   = width  >> MIN_PU_SIZE_IN_BIT;\r\n    const int height_in_spu  = height >> MIN_PU_SIZE_IN_BIT;\r\n    const int max_lcu_height = (height + (1 << 4) - 1) / (1 << 4); /* frame height in 16x16 LCU */\r\n    const int align    = 32;\r\n    const int disalign = 1 << 16;\r\n    int extra_buf_size = 0;     /* extra buffer size */\r\n    int stride_l, stride_c;\r\n    int size_l, size_c;         /* size of luma and chroma plane */\r\n    int i, mem_size;            /* total memory size */\r\n    davs2_frame_t *frame;\r\n    uint8_t *mem_ptr;\r\n\r\n    /* need extra buffer? */\r\n    if (b_extra) {\r\n        /* reference information buffer size (in SPU) */\r\n        extra_buf_size = width_in_spu * height_in_spu;\r\n    }\r\n\r\n    /* compute stride and the plane size\r\n     * +PAD for extra data for MC */\r\n    stride_l = align_stride(width + AVS2_PAD * 2, align, disalign);\r\n    stride_c = align_stride(width_c + AVS2_PAD, align, disalign);\r\n    size_l   = align_plane_size(stride_l * (height + AVS2_PAD * 2) + CACHE_LINE_SIZE, disalign);\r\n    size_c   = align_plane_size(stride_c * (height_c + AVS2_PAD) + CACHE_LINE_SIZE,   disalign);\r\n\r\n    /* compute space size and alloc memory */\r\n    mem_size = sizeof(davs2_frame_t)                       + /* M0, size of frame handle */\r\n               sizeof(pel_t)  * (size_l + size_c * 2)       + /* M1, size of planes buffer: Y+U+V */\r\n               sizeof(int8_t) * extra_buf_size              + /* M2, size of SPU reference index buffer */\r\n               sizeof(mv_t)   * extra_buf_size              + /* M3, size of SPU motion vector buffer */\r\n               sizeof(davs2_thread_cond_t) * max_lcu_height + /* M4, condition variables for each LCU line */\r\n               sizeof(int) * max_lcu_height                 + /* M5, LCU decoding status */\r\n               CACHE_LINE_SIZE * 8;\r\n\r\n    if (mem_base == NULL) {\r\n        CHECKED_MALLOC(mem_ptr, uint8_t *, mem_size);\r\n    } else {\r\n        mem_ptr = *mem_base;\r\n    }\r\n\r\n    /* M0, frame handle */\r\n    frame    = (davs2_frame_t *)mem_ptr;\r\n    memset(frame, 0, sizeof(davs2_frame_t));\r\n    mem_ptr += sizeof(davs2_frame_t);\r\n    ALIGN_POINTER(mem_ptr);\r\n\r\n    /* set frame properties */\r\n    frame->i_plane     = 3;           /* planes: Y+U+V */\r\n    frame->i_width [0] = width;\r\n    frame->i_lines [0] = height;\r\n    frame->i_stride[0] = stride_l;\r\n    frame->i_width [1] = frame->i_width [2] = width_c;\r\n    frame->i_lines [1] = frame->i_lines [2] = height_c;\r\n    frame->i_stride[1] = frame->i_stride[2] = stride_c;\r\n\r\n    frame->i_type      = -1;\r\n    frame->i_pts       = -1;\r\n    frame->i_coi       = INVALID_FRAME;\r\n    frame->i_poc       = INVALID_FRAME;\r\n    frame->b_refered_by_others = 0;\r\n\r\n    /* M1, buffer for planes: Y+U+V */\r\n    frame->planes[0] = (pel_t *)mem_ptr;\r\n    frame->planes[1] = frame->planes[0] + size_l;\r\n    frame->planes[2] = frame->planes[1] + size_c;\r\n    mem_ptr         += sizeof(pel_t) * (size_l + size_c * 2);\r\n\r\n    /* point to plane data area */\r\n    frame->planes[0] += frame->i_stride[0] * (AVS2_PAD    ) + (AVS2_PAD    );\r\n    frame->planes[1] += frame->i_stride[1] * (AVS2_PAD / 2) + (AVS2_PAD / 2);\r\n    frame->planes[2] += frame->i_stride[2] * (AVS2_PAD / 2) + (AVS2_PAD / 2);\r\n    ALIGN_POINTER(frame->planes[0]);\r\n    ALIGN_POINTER(frame->planes[1]);\r\n    ALIGN_POINTER(frame->planes[2]);\r\n\r\n    if (b_extra) {\r\n        /* M2, reference index buffer (in SPU) */\r\n        frame->refbuf = (int8_t *)mem_ptr;\r\n        mem_ptr      += sizeof(int8_t) * extra_buf_size;\r\n        ALIGN_POINTER(mem_ptr);\r\n\r\n        /* M3, motion vector buffer (in SPU) */\r\n        frame->mvbuf = (mv_t *)mem_ptr;\r\n        mem_ptr     += sizeof(mv_t) * extra_buf_size;\r\n        ALIGN_POINTER(mem_ptr);\r\n    }\r\n\r\n    /* M4 */\r\n    frame->conds_lcu_row = (davs2_thread_cond_t *)mem_ptr;\r\n    mem_ptr     += sizeof(davs2_thread_cond_t) * max_lcu_height;\r\n    ALIGN_POINTER(mem_ptr);\r\n\r\n    /* M5 */\r\n    frame->num_decoded_lcu_in_row = (int *)mem_ptr;\r\n    mem_ptr += sizeof(int) * max_lcu_height;\r\n    ALIGN_POINTER(mem_ptr);\r\n\r\n    assert(mem_ptr - (uint8_t *)frame <= mem_size);\r\n\r\n    /* update mem_base */\r\n    if (mem_base != NULL) {\r\n        *mem_base = mem_ptr;\r\n        frame->is_self_malloc = 0;\r\n    } else {\r\n        frame->is_self_malloc = 1;\r\n    }\r\n\r\n    frame->i_conds         = max_lcu_height;\r\n    frame->i_decoded_line  = -1;\r\n    frame->i_ref_count     = 0;\r\n    frame->i_disposable    = 0;\r\n\r\n    for (i = 0; i < frame->i_conds; i++) {\r\n        if (davs2_thread_cond_init(&frame->conds_lcu_row[i], NULL)) {\r\n            goto fail;\r\n        }\r\n    }\r\n\r\n    davs2_thread_cond_init(&frame->cond_aec, NULL);\r\n    davs2_thread_mutex_init(&frame->mutex_frm, NULL);\r\n    davs2_thread_mutex_init(&frame->mutex_recon, NULL);\r\n\r\n    return frame;\r\n\r\nfail:\r\n    if (mem_ptr) {\r\n        davs2_free(mem_ptr);\r\n    }\r\n\r\n    return NULL;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid davs2_frame_destroy(davs2_frame_t *frame)\r\n{\r\n    int i;\r\n\r\n    if (frame == NULL) {\r\n        return;\r\n    }\r\n\r\n    davs2_thread_mutex_destroy(&frame->mutex_frm);\r\n    davs2_thread_mutex_destroy(&frame->mutex_recon);\r\n\r\n    for (i = 0; i < frame->i_conds; i++) {\r\n        davs2_thread_cond_destroy(&frame->conds_lcu_row[i]);\r\n    }\r\n\r\n    /* free the frame itself */\r\n    if (frame->is_self_malloc) {\r\n        davs2_free(frame);\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid davs2_frame_copy_planes(davs2_frame_t *p_dst, davs2_frame_t *p_src)\r\n{\r\n    /* copy frame properties */\r\n    memcpy(p_dst, p_src, (uint8_t *)&p_src->i_ref_count - (uint8_t *)p_src);\r\n\r\n    /* copy all plane data */\r\n#if 1\r\n    /* ʹöַڴ濽һδؿ */\r\n    assert(p_src->i_stride[0] == p_dst->i_stride[0]);\r\n    assert(p_src->i_stride[1] == p_dst->i_stride[1]);\r\n    assert(p_src->i_stride[2] == p_dst->i_stride[2]);\r\n    gf_davs2.memcpy_aligned(p_dst->planes[0], p_src->planes[0], p_src->i_stride[0] * p_src->i_lines[0] * sizeof(pel_t));\r\n    gf_davs2.memcpy_aligned(p_dst->planes[1], p_src->planes[1], p_src->i_stride[1] * p_src->i_lines[1] * sizeof(pel_t));\r\n    gf_davs2.memcpy_aligned(p_dst->planes[2], p_src->planes[2], p_src->i_stride[2] * p_src->i_lines[2] * sizeof(pel_t));\r\n#else\r\n    gf_davs2.plane_copy(p_dst->planes[0], p_dst->i_stride[0], p_src->planes[0], p_src->i_stride[0], p_src->i_width[0], p_src->i_lines[0]);\r\n    gf_davs2.plane_copy(p_dst->planes[1], p_dst->i_stride[1], p_src->planes[1], p_src->i_stride[1], p_src->i_width[1], p_src->i_lines[1]);\r\n    gf_davs2.plane_copy(p_dst->planes[2], p_dst->i_stride[2], p_src->planes[2], p_src->i_stride[2], p_src->i_width[2], p_src->i_lines[2]);\r\n#endif\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * copy frame properties */\r\nvoid davs2_frame_copy_properties(davs2_frame_t *p_dst, davs2_frame_t *p_src)\r\n{\r\n    memcpy(p_dst, p_src, (uint8_t *)&p_src->i_ref_count - (uint8_t *)p_src);\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid davs2_frame_copy_lcu(davs2_t *h, davs2_frame_t *p_dst, davs2_frame_t *p_src, int i_lcu_x, int i_lcu_y, int pix_offset, int padding_size)\r\n{\r\n    int pix_y = (i_lcu_y << h->i_lcu_level) + pix_offset;\r\n    int pix_x = (i_lcu_x << h->i_lcu_level) + pix_offset;\r\n    int lcu_width  = DAVS2_MIN(h->i_lcu_size, h->i_width - pix_x);\r\n    int lcu_height = DAVS2_MIN(h->i_lcu_size, h->i_height - pix_y);\r\n    int y, len, stride;\r\n    pel_t *src, *dst;\r\n\r\n    /* Y */\r\n    stride = p_src->i_stride[0];\r\n    src    = p_src->planes[0] + pix_y * stride + pix_x;\r\n    dst    = p_dst->planes[0] + pix_y * stride + pix_x;\r\n    len    = lcu_width * sizeof(pel_t);\r\n    for (y = 0; y < lcu_height; y++) {\r\n        gf_davs2.fast_memcpy(dst, src, len);\r\n        if (padding_size) {\r\n            pad_line_pixel(dst, p_dst->i_width[0], padding_size);\r\n        }\r\n        src += stride;\r\n        dst += stride;\r\n    }\r\n\r\n    pix_y = (i_lcu_y << (h->i_lcu_level - 1)) + pix_offset;\r\n    pix_x = (i_lcu_x << (h->i_lcu_level - 1)) + pix_offset;\r\n    lcu_height >>= 1;\r\n\r\n    /* U */\r\n    stride = p_src->i_stride[1];\r\n    src    = p_src->planes[1] + pix_y * stride + pix_x;\r\n    dst    = p_dst->planes[1] + pix_y * stride + pix_x;\r\n    len    = lcu_width * sizeof(pel_t);\r\n    for (y = 0; y < lcu_height; y++) {\r\n        gf_davs2.fast_memcpy(dst, src, len);\r\n        if (padding_size) {\r\n            pad_line_pixel(dst, p_dst->i_width[1], padding_size);\r\n        }\r\n        src += stride;\r\n        dst += stride;\r\n    }\r\n\r\n    /* V */\r\n    stride = p_src->i_stride[2];\r\n    src    = p_src->planes[2] + pix_y * stride + pix_x;\r\n    dst    = p_dst->planes[2] + pix_y * stride + pix_x;\r\n    len    = lcu_width * sizeof(pel_t);\r\n    for (y = 0; y < lcu_height; y++) {\r\n        gf_davs2.fast_memcpy(dst, src, len);\r\n        if (padding_size) {\r\n            pad_line_pixel(dst, p_dst->i_width[1], padding_size);\r\n        }\r\n        src += stride;\r\n        dst += stride;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * padding_size - padding size for left and right edges\r\n */\r\nvoid davs2_frame_copy_lcurow(davs2_t *h, davs2_frame_t *p_dst, davs2_frame_t *p_src, int i_lcu_y, int pix_offset, int padding_size)\r\n{\r\n    int pix_y = (i_lcu_y << h->i_lcu_level) + pix_offset;\r\n    int lcu_h = DAVS2_MIN(h->i_height, ((i_lcu_y + 1) << h->i_lcu_level)) - pix_y;\r\n    int y, len, stride;\r\n    pel_t *src, *dst;\r\n\r\n    /* Y */\r\n    stride = p_src->i_stride[0];\r\n    src    = p_src->planes[0] + pix_y * stride;\r\n    dst    = p_dst->planes[0] + pix_y * stride;\r\n    len    = p_src->i_width[0] * sizeof(pel_t);\r\n    for (y = 0; y < lcu_h; y++) {\r\n        gf_davs2.fast_memcpy(dst, src, len);\r\n        if (padding_size) {\r\n            pad_line_pixel(dst, p_dst->i_width[0], padding_size);\r\n        }\r\n        src += stride;\r\n        dst += stride;\r\n    }\r\n\r\n    pix_y = (i_lcu_y << (h->i_lcu_level - 1)) + pix_offset;\r\n    lcu_h = DAVS2_MIN(h->i_height >> 1, ((i_lcu_y + 1) << (h->i_lcu_level - 1))) - pix_y;\r\n\r\n    /* U */\r\n    stride = p_src->i_stride[1];\r\n    src    = p_src->planes[1] + pix_y * stride;\r\n    dst    = p_dst->planes[1] + pix_y * stride;\r\n    len    = p_src->i_width[1] * sizeof(pel_t);\r\n    for (y = 0; y < lcu_h; y++) {\r\n        gf_davs2.fast_memcpy(dst, src, len);\r\n        if (padding_size) {\r\n            pad_line_pixel(dst, p_dst->i_width[1], padding_size);\r\n        }\r\n        src += stride;\r\n        dst += stride;\r\n    }\r\n\r\n    /* V */\r\n    stride = p_src->i_stride[2];\r\n    src    = p_src->planes[2] + pix_y * stride;\r\n    dst    = p_dst->planes[2] + pix_y * stride;\r\n    len    = p_src->i_width[2] * sizeof(pel_t);\r\n    for (y = 0; y < lcu_h; y++) {\r\n        gf_davs2.fast_memcpy(dst, src, len);\r\n        if (padding_size) {\r\n            pad_line_pixel(dst, p_dst->i_width[2], padding_size);\r\n        }\r\n        src += stride;\r\n        dst += stride;\r\n    }\r\n}\r\n"
  },
  {
    "path": "source/common/frame.h",
    "content": "/*\r\n * frame.h\r\n *\r\n * Description of this file:\r\n *    Frame handling functions definition of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video encoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#ifndef DAVS2_FRAME_H\r\n#define DAVS2_FRAME_H\r\n#ifdef __cplusplus\r\nextern \"C\" {\r\n#endif\r\n\r\n/**\r\n * ===========================================================================\r\n * function declares\r\n * ===========================================================================\r\n */\r\n#define davs2_frame_get_size FPFX(frame_get_size)\r\nsize_t davs2_frame_get_size(int width, int height, int chroma_format, int b_extra);\r\n#define davs2_frame_new FPFX(frame_new)\r\ndavs2_frame_t *davs2_frame_new(int width, int height, int chroma_format, uint8_t **mem_base, int b_extra);\r\n\r\n#define davs2_frame_destroy FPFX(frame_destroy)\r\nvoid davs2_frame_destroy(davs2_frame_t *frame);\r\n\r\n#define davs2_frame_copy_planes FPFX(frame_copy_planes)\r\nvoid davs2_frame_copy_planes(davs2_frame_t *p_dst, davs2_frame_t *p_src);\r\n#define davs2_frame_copy_properties FPFX(frame_copy_properties)\r\nvoid davs2_frame_copy_properties(davs2_frame_t *p_dst, davs2_frame_t *p_src);\r\n#define davs2_frame_copy_lcu FPFX(frame_copy_lcu)\r\nvoid davs2_frame_copy_lcu(davs2_t *h, davs2_frame_t *p_dst, davs2_frame_t *p_src, int i_lcu_x, int i_lcu_y, int pix_offset, int padding_size);\r\n#define davs2_frame_copy_lcurow FPFX(frame_copy_lcurow)\r\nvoid davs2_frame_copy_lcurow(davs2_t *h, davs2_frame_t *p_dst, davs2_frame_t *p_src, int i_lcu_y, int pix_offset, int padding_size);\r\n\r\n#define davs2_frame_expand_border FPFX(frame_expand_border)\r\nvoid davs2_frame_expand_border(davs2_frame_t *frame);\r\n\r\n#define pad_line_lcu FPFX(pad_line_lcu)\r\nvoid pad_line_lcu(davs2_t *h, int lcu_y);\r\n\r\n#ifdef __cplusplus\r\n}\r\n#endif\r\n#endif  /* DAVS2_FRAME_H */\r\n"
  },
  {
    "path": "source/common/header.cc",
    "content": "/*\r\n * header.cc\r\n *\r\n * Description of this file:\r\n *    Header functions definition of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video encoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#include \"common.h\"\r\n#include \"davs2.h\"\r\n#include \"transform.h\"\r\n#include \"vlc.h\"\r\n#include \"header.h\"\r\n#include \"aec.h\"\r\n#include \"alf.h\"\r\n#include \"quant.h\"\r\n#include \"bitstream.h\"\r\n#include \"decoder.h\"\r\n#include \"frame.h\"\r\n#include \"predict.h\"\r\n#include \"quant.h\"\r\n#include \"cpu.h\"\r\n\r\n/**\r\n * ===========================================================================\r\n * const variable defines\r\n * ===========================================================================\r\n */\r\nextern const int8_t *tab_DL_Avails[MAX_CU_SIZE_IN_BIT + 1];\r\nextern const int8_t *tab_TR_Avails[MAX_CU_SIZE_IN_BIT + 1];\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\n#define DAVS2_MAX_FRAME_RATE_CODE 13\r\nstatic const float FRAME_RATE[DAVS2_MAX_FRAME_RATE_CODE] = {\r\n    24000.0f / 1001.0f, 24.0f, 25.0f, 30000.0f / 1001.0f, 30.0f, 50.0f, 60000.0f / 1001.0f, 60.0f,\r\n    100.0f, 120.0f, 200.0f, 240.0f, 300.0f\r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic const uint8_t ALPHA_TABLE[64] = {\r\n     0,  0,  0,  0,  0,  0,  1,  1,\r\n     1,  1,  1,  2,  2,  2,  3,  3,\r\n     4,  4,  5,  5,  6,  7,  8,  9,\r\n    10, 11, 12, 13, 15, 16, 18, 20,\r\n    22, 24, 26, 28, 30, 33, 33, 35,\r\n    35, 36, 37, 37, 39, 39, 42, 44,\r\n    46, 48, 50, 52, 53, 54, 55, 56,\r\n    57, 58, 59, 60, 61, 62, 63, 64\r\n};\r\n\r\nstatic const uint8_t BETA_TABLE[64] = {\r\n     0,  0,  0,  0,  0,  0,  1,  1,\r\n     1,  1,  1,  1,  1,  2,  2,  2,\r\n     2,  2,  3,  3,  3,  3,  4,  4,\r\n     4,  4,  5,  5,  5,  5,  6,  6,\r\n     6,  7,  7,  7,  8,  8,  8,  9,\r\n     9, 10, 10, 11, 11, 12, 13, 14,\r\n    15, 16, 17, 18, 19, 20, 21, 22,\r\n    23, 23, 24, 24, 25, 25, 26, 27\r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n * extension id */\r\nenum extension_id_e {\r\n    SEQUENCE_DISPLAY_EXTENSION_ID   = 2,\r\n    TEMPORAL_SCALABLE_EXTENSION_ID  = 3,\r\n    COPYRIGHT_EXTENSION_ID          = 4,\r\n    PICTURE_DISPLAY_EXTENSION_ID    = 7,\r\n    CAMERAPARAMETERS_EXTENSION_ID   = 11,\r\n    LOCATION_DATA_EXTENSION_ID      = 15\r\n};\r\n\r\n#define ROI_DATA_FILE   \"roi.dat\"     // ROI location data output\r\n\r\nstatic bool_t open_dbp_buffer_warning = 1;\r\n\r\n/**\r\n * ===========================================================================\r\n * local function defines\r\n * ===========================================================================\r\n */\r\n\r\nstatic INLINE int is_valid_qp(davs2_t *h, int i_qp)\r\n{\r\n    return i_qp >= 0 && i_qp <= (63 + 8 * (h->sample_bit_depth - 8));\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic\r\nvoid davs2_reconfigure_decoder(davs2_mgr_t *h)\r\n{\r\n    UNUSED_PARAMETER(h);\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * sequence header\r\n */\r\nstatic\r\nint parse_sequence_header(davs2_mgr_t *mgr, davs2_seq_t *seq, davs2_bs_t *bs)\r\n{\r\n    rps_t *p_rps = NULL;\r\n\r\n    int i, j;\r\n    int num_of_rps;\r\n\r\n    bs->i_bit_pos += 32; /* skip start code */\r\n\r\n    memset(seq, 0, sizeof(davs2_seq_t));  // reset all value\r\n\r\n    seq->head.profile_id       = u_v(bs, 8, \"profile_id\");\r\n    seq->head.level_id         = u_v(bs, 8, \"level_id\");\r\n    seq->head.progressive      = u_v(bs, 1, \"progressive_sequence\");\r\n    seq->b_field_coding        = u_flag(bs, \"field_coded_sequence\");\r\n\r\n    seq->head.width     = u_v(bs, 14, \"horizontal_size\");\r\n    seq->head.height    = u_v(bs, 14, \"vertical_size\");\r\n\r\n    if (seq->head.width < 16 || seq->head.height < 16) {\r\n        return -1;\r\n    }\r\n\r\n    seq->head.chroma_format = u_v(bs, 2, \"chroma_format\");\r\n\r\n    if (seq->head.chroma_format != CHROMA_420 && seq->head.chroma_format != CHROMA_400) {\r\n        return -1;\r\n    }\r\n    if (seq->head.chroma_format == CHROMA_400) {\r\n        davs2_log(mgr, DAVS2_LOG_WARNING, \"Un-supported Chroma Format YUV400 as 0 for GB/T.\\n\");\r\n    }\r\n\r\n    /* sample bit depth */\r\n    if (seq->head.profile_id == MAIN10_PROFILE) {\r\n        seq->sample_precision      = u_v(bs, 3, \"sample_precision\");\r\n        seq->encoding_precision    = u_v(bs, 3, \"encoding_precision\");\r\n    } else {\r\n        seq->sample_precision      = u_v(bs, 3, \"sample_precision\");\r\n        seq->encoding_precision    = 1;\r\n    }\r\n    if (seq->sample_precision < 1 || seq->sample_precision > 3 ||\r\n        seq->encoding_precision < 1 || seq->encoding_precision > 3) {\r\n        return -1;\r\n    }\r\n\r\n    seq->head.internal_bit_depth   = 6 + (seq->encoding_precision << 1);\r\n    seq->head.output_bit_depth     = 6 + (seq->encoding_precision << 1);\r\n    seq->head.bytes_per_sample     = seq->head.output_bit_depth > 8 ? 2 : 1;\r\n\r\n    /*  */\r\n    seq->head.aspect_ratio         = u_v(bs, 4, \"aspect_ratio_information\");\r\n    seq->head.frame_rate_id        = u_v(bs, 4, \"frame_rate_id\");\r\n    seq->bit_rate_lower            = u_v(bs, 18, \"bit_rate_lower\");\r\n    u_v(bs, 1,  \"marker bit\");\r\n    seq->bit_rate_upper            = u_v(bs, 12, \"bit_rate_upper\");\r\n    seq->head.low_delay            = u_v(bs, 1, \"low_delay\");\r\n    u_v(bs, 1,  \"marker bit\");\r\n    seq->b_temporal_id_exist       = u_flag(bs, \"temporal_id exist flag\"); // get Extension Flag\r\n    u_v(bs, 18, \"bbv buffer size\");\r\n\r\n    seq->log2_lcu_size             = u_v(bs, 3, \"Largest Coding Block Size\");\r\n\r\n    if (seq->log2_lcu_size < 4 || seq->log2_lcu_size > 6) {\r\n        davs2_log(mgr, DAVS2_LOG_ERROR, \"Invalid LCU size: %d\\n\", seq->log2_lcu_size);\r\n        return -1;\r\n    }\r\n\r\n    seq->enable_weighted_quant     = u_flag(bs, \"enable_weighted_quant\");\r\n\r\n    if (seq->enable_weighted_quant) {\r\n        int load_seq_wquant_data_flag;\r\n        int x, y, sizeId, uiWqMSize;\r\n        const int *Seq_WQM;\r\n\r\n        load_seq_wquant_data_flag = u_flag(bs,  \"load_seq_weight_quant_data_flag\");\r\n\r\n        for (sizeId = 0; sizeId < 2; sizeId++) {\r\n            uiWqMSize = DAVS2_MIN(1 << (sizeId + 2), 8);\r\n            if (load_seq_wquant_data_flag == 1) {\r\n                for (y = 0; y < uiWqMSize; y++) {\r\n                    for (x = 0; x < uiWqMSize; x++) {\r\n                        seq->seq_wq_matrix[sizeId][y * uiWqMSize + x] = (int16_t)ue_v(bs, \"weight_quant_coeff\");\r\n                    }\r\n                }\r\n            } else if (load_seq_wquant_data_flag == 0) {\r\n                Seq_WQM = wq_get_default_matrix(sizeId);\r\n                for (i = 0; i < (uiWqMSize * uiWqMSize); i++) {\r\n                    seq->seq_wq_matrix[sizeId][i] = (int16_t)Seq_WQM[i];\r\n                }\r\n            }\r\n        }\r\n    }\r\n\r\n    seq->enable_background_picture = u_flag(bs, \"background_picture_disable\") ^ 0x01;\r\n    seq->enable_mhp_skip           = u_flag(bs, \"mhpskip enabled\");\r\n    seq->enable_dhp                = u_flag(bs, \"dhp enabled\");\r\n    seq->enable_wsm                = u_flag(bs, \"wsm enabled\");\r\n    seq->enable_amp                = u_flag(bs, \"Asymmetric Motion Partitions\");\r\n    seq->enable_nsqt               = u_flag(bs, \"use NSQT\");\r\n    seq->enable_sdip               = u_flag(bs, \"use NSIP\");\r\n    seq->enable_2nd_transform      = u_flag(bs, \"secT enabled\");\r\n    seq->enable_sao                = u_flag(bs, \"SAO Enable Flag\");\r\n    seq->enable_alf                = u_flag(bs, \"ALF Enable Flag\");\r\n    seq->enable_pmvr               = u_flag(bs, \"pmvr enabled\");\r\n\r\n    if (1 != u_v(bs, 1, \"marker bit\"))  {\r\n        davs2_log(mgr, DAVS2_LOG_ERROR, \"expected marker_bit 1 while received 0, FILE %s, Row %d\\n\", __FILE__, __LINE__);\r\n    }\r\n\r\n    num_of_rps                      = u_v(bs, 6, \"num_of_RPS\");\r\n    if (num_of_rps > AVS2_GOP_NUM) {\r\n        return -1;\r\n    }\r\n\r\n    seq->num_of_rps = num_of_rps;\r\n\r\n    for (i = 0; i < num_of_rps; i++) {\r\n        p_rps = &seq->seq_rps[i];\r\n\r\n        p_rps->refered_by_others        = u_v(bs, 1,  \"refered by others\");\r\n        p_rps->num_of_ref               = u_v(bs, 3,  \"num of reference picture\");\r\n\r\n        for (j = 0; j < p_rps->num_of_ref; j++) {\r\n            p_rps->ref_pic[j]           = u_v(bs, 6,  \"delta COI of ref pic\");\r\n        }\r\n\r\n        p_rps->num_to_remove            = u_v(bs, 3,  \"num of removed picture\");\r\n\r\n        for (j = 0; j < p_rps->num_to_remove; j++) {\r\n            p_rps->remove_pic[j]        = u_v(bs, 6,  \"delta COI of removed pic\");\r\n        }\r\n\r\n        if (1 != u_v(bs, 1, \"marker bit\"))  {\r\n            davs2_log(mgr, DAVS2_LOG_ERROR, \"expected marker_bit 1 while received 0, FILE %s, Row %d\\n\", __FILE__, __LINE__);\r\n        }\r\n    }\r\n\r\n    if (seq->head.low_delay == 0) {\r\n        seq->picture_reorder_delay = u_v(bs, 5, \"picture_reorder_delay\");\r\n    }\r\n\r\n    seq->cross_loop_filter_flag    = u_flag(bs, \"Cross Loop Filter Flag\");\r\n    u_v(bs, 2,  \"reserved bits\");\r\n\r\n    bs_align(bs); /* align position */\r\n\r\n    if (seq->head.frame_rate_id < 1 || seq->head.frame_rate_id > DAVS2_MAX_FRAME_RATE_CODE) {\r\n        davs2_log(mgr, DAVS2_LOG_ERROR, \"Invalid frame_rate_code %d, valid range [1, %d].\\n\",\r\n            seq->head.frame_rate_id, DAVS2_MAX_FRAME_RATE_CODE);\r\n        seq->head.frame_rate_id = DAVS2_CLIP3(1, DAVS2_MAX_FRAME_RATE_CODE, seq->head.frame_rate_id);\r\n    }\r\n\r\n    seq->head.bitrate    = ((seq->bit_rate_upper << 18) + seq->bit_rate_lower) * 400;\r\n    seq->head.frame_rate = FRAME_RATE[seq->head.frame_rate_id - 1];\r\n\r\n    seq->i_enc_width     = ((seq->head.width + MIN_CU_SIZE - 1) >> MIN_CU_SIZE_IN_BIT) << MIN_CU_SIZE_IN_BIT;\r\n    seq->i_enc_height    = ((seq->head.height   + MIN_CU_SIZE - 1) >> MIN_CU_SIZE_IN_BIT) << MIN_CU_SIZE_IN_BIT;\r\n    seq->valid_flag = 1;\r\n\r\n    return 0;\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * init deblock parame of one frame\r\n */\r\nstatic INLINE\r\nvoid deblock_init_frame_parames(davs2_t *h)\r\n{\r\n    int shift = h->sample_bit_depth - 8;\r\n    int QP   = h->i_picture_qp - (shift << 3);\r\n    int QP_c = cu_get_chroma_qp(h, h->i_picture_qp, 0) - (shift << 3);\r\n\r\n    h->alpha   = ALPHA_TABLE[DAVS2_CLIP3(0, 63, QP + h->i_alpha_offset)] << shift;\r\n    h->beta    = BETA_TABLE[DAVS2_CLIP3(0, 63, QP + h->i_beta_offset)] << shift;\r\n\r\n    h->alpha_c = ALPHA_TABLE[DAVS2_CLIP3(0, 63, QP_c + h->i_alpha_offset)] << shift;\r\n    h->beta_c  = BETA_TABLE[DAVS2_CLIP3(0, 63, QP_c + h->i_beta_offset)] << shift;\r\n\r\n\r\n    if (gf_davs2.set_deblock_const != NULL) {\r\n        gf_davs2.set_deblock_const();\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * Intra picture header\r\n */\r\nstatic int parse_picture_header_intra(davs2_t *h, davs2_bs_t *bs)\r\n{\r\n    int time_code_flag;\r\n    int progressive_frame;\r\n    int predict;\r\n    int i;\r\n\r\n    h->i_frame_type = AVS2_I_SLICE;\r\n\r\n    /* skip start code */\r\n    bs->i_bit_pos += 32;\r\n\r\n    u_v(bs, 32, \"bbv_delay\");\r\n    time_code_flag                      = u_v(bs, 1, \"time_code_flag\");\r\n\r\n    if (time_code_flag) {\r\n        /* time_code                 = */ u_v(bs, 24, \"time_code\");\r\n    }\r\n\r\n    if (h->b_bkgnd_picture) {\r\n        int background_picture_flag     = u_v(bs, 1, \"background_picture_flag\");\r\n\r\n        if (background_picture_flag) {\r\n            int b_output                = u_v(bs, 1, \"background_picture_output_flag\");\r\n            if (b_output) {\r\n                h->i_frame_type = AVS2_G_SLICE;\r\n            } else {\r\n                h->i_frame_type = AVS2_GB_SLICE;\r\n            }\r\n        }\r\n    }\r\n\r\n    h->i_coi                            = u_v(bs, 8, \"coding_order\");\r\n\r\n    if (h->seq_info.b_temporal_id_exist == 1) {\r\n        h->i_cur_layer                  = u_v(bs, TEMPORAL_MAXLEVEL_BIT, \"temporal_id\");\r\n    }\r\n\r\n    if (h->seq_info.head.low_delay == 0) {\r\n        h->i_display_delay              = ue_v(bs, \"picture_output_delay\");\r\n        if (h->i_display_delay >= 64) {\r\n            davs2_log(h, DAVS2_LOG_ERROR, \"invalid picture output delay intra.\");\r\n            return -1;\r\n        }\r\n    }\r\n\r\n    predict                             = u_v(bs, 1, \"use RCS in SPS\");\r\n    if (predict) {\r\n        int index                       = u_v(bs, 5, \"predict for RCS\");\r\n        if (index >= h->seq_info.num_of_rps) {\r\n            davs2_log(h, DAVS2_LOG_ERROR, \"invalid rps index.\");\r\n            return -1;\r\n        }\r\n\r\n        h->rps                          = h->seq_info.seq_rps[index];\r\n    } else {\r\n        h->rps.refered_by_others        = u_v(bs, 1, \"refered by others\");\r\n        h->rps.num_of_ref               = u_v(bs, 3, \"num of reference picture\");\r\n        if (h->rps.num_of_ref > AVS2_MAX_REFS) {\r\n            davs2_log(h, DAVS2_LOG_ERROR, \"invalid number of references.\");\r\n            return -1;\r\n        }\r\n\r\n        for (i = 0; i < h->rps.num_of_ref; i++) {\r\n            h->rps.ref_pic[i]           = u_v(bs, 6, \"delta COI of ref pic\");\r\n        }\r\n\r\n        h->rps.num_to_remove            = u_v(bs, 3, \"num of removed picture\");\r\n        assert((unsigned int)h->rps.num_to_remove <= sizeof(h->rps.remove_pic) / sizeof(h->rps.remove_pic[0]));\r\n\r\n        for (i = 0; i < h->rps.num_to_remove; i++) {\r\n            h->rps.remove_pic[i]        = u_v(bs, 6, \"delta COI of removed pic\");\r\n        }\r\n        u_v(bs, 1, \"marker bit\");\r\n    }\r\n\r\n    if (h->seq_info.head.low_delay) {\r\n        /* bbv_check_times           = */ ue_v(bs, \"bbv check times\");\r\n    }\r\n\r\n    progressive_frame                   = u_v(bs, 1, \"progressive_frame\");\r\n\r\n    if (!progressive_frame) {\r\n        h->i_pic_coding_type            = (int8_t)u_v(bs, 1, \"picture_structure\");\r\n    } else {\r\n        h->i_pic_coding_type            = FRAME;\r\n    }\r\n\r\n    h->b_top_field_first                = u_flag(bs, \"top_field_first\");\r\n    h->b_repeat_first_field             = u_flag(bs, \"repeat_first_field\");\r\n\r\n    if (h->seq_info.b_field_coding) {\r\n        h->b_top_field                  = u_flag(bs, \"is_top_field\");\r\n        /* reserved                  = */ u_v(bs, 1, \"reserved bit for interlace coding\");\r\n    }\r\n\r\n    h->b_fixed_picture_qp               = u_flag(bs, \"fixed_picture_qp\");\r\n    h->i_picture_qp                     = u_v(bs, 7, \"picture_qp\");\r\n\r\n    h->b_loop_filter                    = u_v(bs, 1, \"loop_filter_disable\") ^ 0x01;\r\n\r\n    if (h->b_loop_filter) {\r\n        int loop_filter_parameter_flag  = u_v(bs, 1, \"loop_filter_parameter_flag\");\r\n\r\n        if (loop_filter_parameter_flag) {\r\n            h->i_alpha_offset           = se_v(bs, \"alpha_offset\");\r\n            h->i_beta_offset            = se_v(bs, \"beta_offset\");\r\n        } else {\r\n            h->i_alpha_offset           = 0;\r\n            h->i_beta_offset            = 0;\r\n        }\r\n\r\n        deblock_init_frame_parames(h);\r\n    }\r\n\r\n    h->enable_chroma_quant_param        = !u_flag(bs, \"chroma_quant_param_disable\");\r\n    if (h->enable_chroma_quant_param) {\r\n        h->chroma_quant_param_delta_u = se_v(bs, \"chroma_quant_param_delta_cb\");\r\n        h->chroma_quant_param_delta_v = se_v(bs, \"chroma_quant_param_delta_cr\");\r\n    } else {\r\n        h->chroma_quant_param_delta_u = 0;\r\n        h->chroma_quant_param_delta_v = 0;\r\n    }\r\n\r\n    // adaptive frequency weighting quantization\r\n    h->seq_info.enable_weighted_quant = 0;\r\n\r\n    if (h->seq_info.enable_weighted_quant) {\r\n        int pic_weight_quant_enable     = u_v(bs, 1, \"pic_weight_quant_enable\");\r\n        if (pic_weight_quant_enable) {\r\n            weighted_quant_t *p = &h->wq;\r\n            p->pic_wq_data_index        = u_v(bs, 2, \"pic_wq_data_index\");\r\n\r\n            if (p->pic_wq_data_index == 1) {\r\n                /* int mb_adapt_wq_disable = */       u_v(bs, 1, \"reserved_bits\");\r\n\r\n                p->wq_param             = u_v(bs, 2, \"weighting_quant_param_index\");\r\n                p->wq_model             = u_v(bs, 2, \"wq_model\");\r\n\r\n                if (p->wq_param == 1) {\r\n                    for (i = 0; i < 6; i++) {\r\n                        p->quant_param_undetail[i] = (int16_t)se_v(bs, \"quant_param_delta_u\") + wq_param_default[UNDETAILED][i];\r\n                    }\r\n                }\r\n\r\n                if (p->wq_param == 2) {\r\n                    for (i = 0; i < 6; i++) {\r\n                        p->quant_param_detail[i] = (int16_t)se_v(bs, \"quant_param_delta_d\") + wq_param_default[DETAILED][i];\r\n                    }\r\n                }\r\n            } else if (p->pic_wq_data_index == 2) {\r\n                int x, y, sizeId, uiWqMSize;\r\n\r\n                for (sizeId = 0; sizeId < 2; sizeId++) {\r\n                    i = 0;\r\n                    uiWqMSize = DAVS2_MIN(1 << (sizeId + 2), 8);\r\n\r\n                    for (y = 0; y < uiWqMSize; y++) {\r\n                        for (x = 0; x < uiWqMSize; x++) {\r\n                            p->pic_user_wq_matrix[sizeId][i++] = (int16_t)ue_v(bs, \"weight_quant_coeff\");\r\n                        }\r\n                    }\r\n                }\r\n            }\r\n\r\n            h->seq_info.enable_weighted_quant = 1;\r\n        }\r\n    }\r\n\r\n    alf_read_param(h, bs);\r\n\r\n    h->i_qp = h->i_picture_qp;\r\n    if (!is_valid_qp(h, h->i_qp)) {\r\n        davs2_log(h, DAVS2_LOG_ERROR, \"Invalid I Picture QP: %d\\n\", h->i_qp);\r\n    }\r\n\r\n    /* align position in bitstream buffer */\r\n    bs_align(bs);\r\n\r\n    return 0;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * Inter picture header\r\n */\r\nstatic int parse_picture_header_inter(davs2_t *h, davs2_bs_t *bs)\r\n{\r\n    int background_pred_flag;\r\n    int progressive_frame;\r\n    int predict;\r\n    int i;\r\n\r\n    /* skip start code */\r\n    bs->i_bit_pos += 32;\r\n\r\n    u_v(bs, 32, \"bbv delay\");\r\n\r\n    h->i_pic_struct                     = (int8_t)u_v(bs, 2, \"picture_coding_type\");\r\n    if (h->b_bkgnd_picture && (h->i_pic_struct == 1 || h->i_pic_struct == 3)) {\r\n        if (h->i_pic_struct == 1) {\r\n            background_pred_flag        = u_v(bs, 1, \"background_pred_flag\");\r\n        } else {\r\n            background_pred_flag        = 0;\r\n        }\r\n\r\n        if (background_pred_flag == 0) {\r\n            h->b_bkgnd_reference        = u_flag(bs, \"background_reference_enable\");\r\n        } else {\r\n            h->b_bkgnd_reference        = 0;\r\n        }\r\n    } else {\r\n        background_pred_flag            = 0;\r\n        h->b_bkgnd_reference            = 0;\r\n    }\r\n\r\n    if (h->i_pic_struct == 1 && background_pred_flag) {\r\n        h->i_frame_type = AVS2_S_SLICE;\r\n    } else if (h->i_pic_struct == 1) {\r\n        h->i_frame_type = AVS2_P_SLICE;\r\n    } else if (h->i_pic_struct == 3) {\r\n        h->i_frame_type = AVS2_F_SLICE;\r\n    } else {\r\n        h->i_frame_type = AVS2_B_SLICE;\r\n    }\r\n\r\n    h->i_coi                            = u_v(bs, 8, \"coding_order\");\r\n    if (h->seq_info.b_temporal_id_exist == 1) {\r\n        h->i_cur_layer                  = u_v(bs, TEMPORAL_MAXLEVEL_BIT, \"temporal_id\");\r\n    }\r\n\r\n    if (h->seq_info.head.low_delay == 0) {\r\n        h->i_display_delay              = ue_v(bs, \"displaydelay\");\r\n        if (h->i_display_delay >= 64) {\r\n            davs2_log(h, DAVS2_LOG_ERROR, \"invalid picture output delay inter.\");\r\n            return -1;\r\n        }\r\n    }\r\n\r\n    /* */\r\n    predict                             = u_v(bs, 1, \"use RPS in SPS\");\r\n    if (predict) {\r\n        int index                       = u_v(bs, 5, \"predict for RPS\");\r\n        if (index >= h->seq_info.num_of_rps) {\r\n            davs2_log(h, DAVS2_LOG_ERROR, \"invalid rps index.\");\r\n            return -1;\r\n        }\r\n\r\n        h->rps                          = h->seq_info.seq_rps[index];\r\n    } else {\r\n        // GOP size\r\n        h->rps.refered_by_others        = u_v(bs, 1, \"refered by others\");\r\n        h->rps.num_of_ref               = u_v(bs, 3, \"num of reference picture\");\r\n\r\n        for (i = 0; i < h->rps.num_of_ref; i++) {\r\n            h->rps.ref_pic[i]           = u_v(bs, 6, \"delta COI of ref pic\");\r\n        }\r\n\r\n        h->rps.num_to_remove            = u_v(bs, 3, \"num of removed picture\");\r\n        assert((unsigned int)h->rps.num_to_remove <= sizeof(h->rps.remove_pic) / sizeof(h->rps.remove_pic[0]));\r\n\r\n        for (i = 0; i < h->rps.num_to_remove; i++) {\r\n            h->rps.remove_pic[i]        = u_v(bs, 6, \"delta COI of removed pic\");\r\n        }\r\n        u_v(bs, 1, \"marker bit\");\r\n    }\r\n\r\n    if (h->seq_info.head.low_delay) {\r\n        ue_v(bs, \"bbv check times\");\r\n    }\r\n\r\n    progressive_frame                   = u_v(bs, 1, \"progressive_frame\");\r\n\r\n    if (!progressive_frame) {\r\n        h->i_pic_coding_type            = (int8_t)u_v(bs, 1, \"picture_structure\");\r\n    } else {\r\n        h->i_pic_coding_type            = FRAME;\r\n    }\r\n\r\n    h->b_top_field_first                = u_flag(bs, \"top_field_first\");\r\n    h->b_repeat_first_field             = u_flag(bs, \"repeat_first_field\");\r\n\r\n    if (h->seq_info.b_field_coding) {\r\n        h->b_top_field                  =u_flag(bs, \"is_top_field\");\r\n        u_v(bs, 1, \"reserved bit for interlace coding\");\r\n    }\r\n\r\n    h->b_fixed_picture_qp               = u_flag(bs, \"fixed_picture_qp\");\r\n    h->i_picture_qp                     = u_v(bs, 7, \"picture_qp\");\r\n\r\n    if (!(h->i_pic_struct == 2 && h->i_pic_coding_type == FRAME)) {\r\n        u_v(bs, 1, \"reserved_bit\");\r\n    }\r\n\r\n    h->b_ra_decodable                   = u_flag(bs, \"random_access_decodable_flag\");\r\n\r\n    h->b_loop_filter                    = u_v(bs, 1, \"loop_filter_disable\") ^ 0x01;\r\n\r\n    if (h->b_loop_filter) {\r\n        int loop_filter_parameter_flag  = u_v(bs, 1, \"loop_filter_parameter_flag\");\r\n\r\n        if (loop_filter_parameter_flag) {\r\n            h->i_alpha_offset           = se_v(bs, \"alpha_offset\");\r\n            h->i_beta_offset            = se_v(bs, \"beta_offset\");\r\n        } else {\r\n            h->i_alpha_offset           = 0;\r\n            h->i_beta_offset            = 0;\r\n        }\r\n\r\n        deblock_init_frame_parames(h);\r\n    }\r\n\r\n    h->enable_chroma_quant_param = !u_flag(bs, \"chroma_quant_param_disable\");\r\n\r\n    if (h->enable_chroma_quant_param) {\r\n        h->chroma_quant_param_delta_u = se_v(bs, \"chroma_quant_param_delta_cb\");\r\n        h->chroma_quant_param_delta_v = se_v(bs, \"chroma_quant_param_delta_cr\");\r\n    } else {\r\n        h->chroma_quant_param_delta_u = 0;\r\n        h->chroma_quant_param_delta_v = 0;\r\n    }\r\n\r\n    // adaptive frequency weighting quantization\r\n    h->seq_info.enable_weighted_quant = 0;\r\n\r\n    if (h->seq_info.enable_weighted_quant) {\r\n        int pic_weight_quant_enable     = u_v(bs, 1, \"pic_weight_quant_enable\");\r\n\r\n        if (pic_weight_quant_enable) {\r\n            weighted_quant_t *p = &h->wq;\r\n            p->pic_wq_data_index        = u_v(bs, 2, \"pic_wq_data_index\");\r\n\r\n            if (p->pic_wq_data_index == 1) {\r\n                /* int mb_adapt_wq_disable = */     u_v(bs, 1, \"reserved_bits\");\r\n\r\n                p->wq_param             = u_v(bs, 2, \"weighting_quant_param_index\");\r\n                p->wq_model             = u_v(bs, 2, \"wq_model\");\r\n\r\n                if (p->wq_param == 1) {\r\n                    for (i = 0; i < 6; i++) {\r\n                        p->quant_param_undetail[i] = (int16_t)se_v(bs, \"quant_param_delta_u\") + wq_param_default[UNDETAILED][i];\r\n                    }\r\n                }\r\n\r\n                if (p->wq_param == 2) {\r\n                    for (i = 0; i < 6; i++) {\r\n                        p->quant_param_detail[i] = (int16_t)se_v(bs, \"quant_param_delta_d\") + wq_param_default[DETAILED][i];\r\n                    }\r\n                }\r\n            } else if (p->pic_wq_data_index == 2) {\r\n                int x, y, sizeId, uiWqMSize;\r\n\r\n                for (sizeId = 0; sizeId < 2; sizeId++) {\r\n                    i = 0;\r\n                    uiWqMSize = DAVS2_MIN(1 << (sizeId + 2), 8);\r\n\r\n                    for (y = 0; y < uiWqMSize; y++) {\r\n                        for (x = 0; x < uiWqMSize; x++) {\r\n                            p->pic_user_wq_matrix[sizeId][i++] = (int16_t)ue_v(bs, \"weight_quant_coeff\");\r\n                        }\r\n                    }\r\n                }\r\n            }\r\n\r\n            h->seq_info.enable_weighted_quant = 1;\r\n        }\r\n    }\r\n\r\n    alf_read_param(h, bs);\r\n\r\n    h->i_qp = h->i_picture_qp;\r\n    if (!is_valid_qp(h, h->i_qp)) {\r\n        davs2_log(h, DAVS2_LOG_ERROR, \"Invalid PB Picture QP: %d\\n\", h->i_qp);\r\n    }\r\n\r\n    /* align position in bitstream buffer */\r\n    bs_align(bs);\r\n\r\n    return 0;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic\r\nint parse_picture_header(davs2_t *h, davs2_bs_t *bs, uint32_t start_code)\r\n{\r\n    davs2_mgr_t *mgr = h->task_info.taskmgr;\r\n\r\n    assert(start_code == SC_INTRA_PICTURE || start_code == SC_INTER_PICTURE);\r\n\r\n    if (start_code == SC_INTRA_PICTURE) {\r\n        if (parse_picture_header_intra(h, bs) < 0) {\r\n            return -1;\r\n        }\r\n    } else {\r\n        if (mgr->outpics.output == -1) {\r\n            /* An I frame is expected for the first frame or after the decoder is flushed. */\r\n            davs2_log(h, DAVS2_LOG_ERROR, \"sequence should start with an I frame.\");\r\n            return -1;\r\n        }\r\n\r\n        if (parse_picture_header_inter(h, bs) < 0) {\r\n            return -1;\r\n        }\r\n    }\r\n\r\n    /* field picture ? */\r\n    if (h->i_pic_coding_type != FRAME) {\r\n        davs2_log(h, DAVS2_LOG_ERROR, \"field is not supported.\");\r\n        return -1;\r\n    }\r\n\r\n    /* COI should be a periodically-repeated value from 0 to 255 */\r\n    if (mgr->outpics.output != -1 &&\r\n        h->i_coi != (mgr->i_prev_coi + 1) % AVS2_COI_CYCLE) {\r\n        davs2_log(h, DAVS2_LOG_DEBUG, \"discontinuous COI (prev: %d --> curr: %d).\", mgr->i_prev_coi, h->i_coi);\r\n    }\r\n\r\n    /* update COI */\r\n    if (h->i_coi < mgr->i_prev_coi) { /// !!! '='\r\n        mgr->i_tr_wrap_cnt++;\r\n    }\r\n\r\n    mgr->i_prev_coi = h->i_coi;\r\n\r\n    h->i_coi += mgr->i_tr_wrap_cnt * AVS2_COI_CYCLE;\r\n\r\n    if (h->seq_info.head.low_delay == 0) {\r\n        h->i_poc = h->i_coi + h->i_display_delay - h->seq_info.picture_reorder_delay;\r\n    } else {\r\n        h->i_poc = h->i_coi;\r\n    }\r\n\r\n    assert(h->i_coi >= 0 && h->i_poc >= 0); /// 'int' (2147483647) should be large enough for 'i_coi' & 'i_poc'.\r\n\r\n    if (mgr->outpics.output == -1 && start_code == SC_INTRA_PICTURE) {\r\n        if (h->i_coi != 0) {\r\n            davs2_log(h, DAVS2_LOG_INFO, \"COI of the first frame is %d.\", h->i_coi);\r\n        }\r\n\r\n        mgr->outpics.output = h->i_poc;\r\n    }\r\n\r\n    return 0;\r\n}\r\n\r\n\r\n/**\r\n * ===========================================================================\r\n * interface function defines\r\n * ===========================================================================\r\n */\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid parse_slice_header(davs2_t *h, davs2_bs_t *bs)\r\n{\r\n    int slice_vertical_position;\r\n    int slice_vertical_position_extension = 0;\r\n    int slice_horizontal_positon;\r\n    int slice_horizontal_positon_extension;\r\n    int mb_row;\r\n\r\n    /* skip start code: 00 00 01 */\r\n    bs->i_bit_pos += 24;\r\n\r\n    slice_vertical_position = u_v(bs, 8, \"slice vertical position\");\r\n\r\n    if (h->i_image_height > (144 * h->i_lcu_size)) {\r\n        slice_vertical_position_extension = u_v(bs, 3, \"slice vertical position extension\");\r\n    }\r\n\r\n    if (h->i_image_height > (144 * h->i_lcu_size)) {\r\n        mb_row = (slice_vertical_position_extension << 7) + slice_vertical_position;\r\n    } else {\r\n        mb_row = slice_vertical_position;\r\n    }\r\n\r\n    slice_horizontal_positon = u_v(bs, 8, \"slice horizontal position\");\r\n    if (h->i_width > (255 * h->i_lcu_size)) {\r\n        slice_horizontal_positon_extension = u_v(bs, 2, \"slice horizontal position extension\");\r\n    }\r\n\r\n    if (!h->b_fixed_picture_qp) {\r\n        h->b_fixed_slice_qp = u_flag(bs, \"fixed_slice_qp\");\r\n        h->i_slice_qp       = u_v(bs, 7, \"slice_qp\");\r\n\r\n        h->b_DQP            = !h->b_fixed_slice_qp;\r\n    } else {\r\n        h->i_slice_qp       = h->i_picture_qp;\r\n        h->b_DQP            = 0;\r\n    }\r\n    h->i_qp = h->i_slice_qp;\r\n\r\n    if (!is_valid_qp(h, h->i_qp)) {\r\n        davs2_log(h, DAVS2_LOG_ERROR, \"Invalid Slice QP: %d\\n\", h->i_qp);\r\n    }\r\n\r\n    if (h->b_sao) {\r\n        h->slice_sao_on[0] = u_flag(bs, \"sao_slice_flag_Y\");\r\n        h->slice_sao_on[1] = u_flag(bs, \"sao_slice_flag_Cb\");\r\n        h->slice_sao_on[2] = u_flag(bs, \"sao_slice_flag_Cr\");\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\ndavs2_outpic_t *alloc_picture(int w, int h)\r\n{\r\n    davs2_outpic_t *pic = NULL;\r\n    uint8_t *buf;\r\n\r\n    buf = (uint8_t *)davs2_malloc(sizeof(davs2_outpic_t)     +\r\n                                   sizeof(davs2_seq_info_t) +\r\n                                   sizeof(davs2_picture_t)  + sizeof(pel_t) * w * h * 3 / 2);\r\n    if (buf == NULL) {\r\n        return NULL;\r\n    }\r\n\r\n    pic = (davs2_outpic_t *)buf;\r\n\r\n    buf += sizeof(davs2_outpic_t); /* davs2_outpic_t */\r\n\r\n    pic->frame = NULL;\r\n    pic->next  = NULL;\r\n\r\n    pic->head = (davs2_seq_info_t *)buf;\r\n    buf      += sizeof(davs2_seq_info_t);\r\n\r\n    pic->pic = (davs2_picture_t *)buf;\r\n    buf     += sizeof(davs2_picture_t);\r\n\r\n    pic->pic->num_planes = 3;\r\n    pic->pic->planes[0] = buf;\r\n    pic->pic->planes[1] = pic->pic->planes[0] + w * h * sizeof(pel_t);\r\n    pic->pic->planes[2] = pic->pic->planes[1] + w * h / 4 * sizeof(pel_t);\r\n    pic->pic->widths[0] = w;\r\n    pic->pic->widths[1] = w / 2;\r\n    pic->pic->widths[2] = w / 2;\r\n    pic->pic->lines [0] = h;\r\n    pic->pic->lines [1] = h / 2;\r\n    pic->pic->lines [2] = h / 2;\r\n    pic->pic->dec_frame = NULL;\r\n\r\n    return pic;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid free_picture(davs2_outpic_t *pic)\r\n{\r\n    if (pic) {\r\n        davs2_free(pic);\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * destroy decoding picture buffer(DPB)\r\n */\r\nvoid destroy_dpb(davs2_mgr_t *mgr)\r\n{\r\n    davs2_frame_t *frame = NULL;\r\n    int i;\r\n\r\n    for (i = 0; i < mgr->dpbsize; i++) {\r\n        frame = mgr->dpb[i];\r\n        assert(frame);\r\n\r\n        mgr->dpb[i] = NULL;\r\n\r\n        davs2_thread_mutex_lock(&frame->mutex_frm);\r\n\r\n        if (frame->i_ref_count == 0) {\r\n            davs2_thread_mutex_unlock(&frame->mutex_frm);\r\n            davs2_frame_destroy(frame);\r\n        } else {\r\n            frame->i_disposable = 2; /* free when not referenced */\r\n            davs2_thread_mutex_unlock(&frame->mutex_frm);\r\n        }\r\n    }\r\n\r\n    davs2_free(mgr->dpb);\r\n    mgr->dpb = NULL;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * create decoding picture buffer(DPB)\r\n */\r\nstatic INLINE\r\nint create_dpb(davs2_mgr_t *mgr)\r\n{\r\n    davs2_seq_t *seq = &mgr->seq_info;\r\n    uint8_t      *mem_ptr = NULL;\r\n    size_t        mem_size = 0;\r\n    int i;\r\n\r\n    mgr->dpbsize = mgr->num_decoders + seq->picture_reorder_delay + 16;  /// !!! FIXME: decide dpb buffer size ?\r\n    mgr->dpbsize += 8;  // FIXME: ��Ҫ����\r\n\r\n    mem_size = mgr->dpbsize * sizeof(davs2_frame_t *)\r\n        + davs2_frame_get_size(seq->i_enc_width, seq->i_enc_height, seq->head.chroma_format, 1) * mgr->dpbsize\r\n        + davs2_frame_get_size(seq->i_enc_width, seq->i_enc_height, seq->head.chroma_format, 0)\r\n        + CACHE_LINE_SIZE * (mgr->dpbsize + 2);\r\n\r\n    mem_ptr = (uint8_t *)davs2_malloc(mem_size);\r\n    if (mem_ptr == NULL) {\r\n        return -1;\r\n    }\r\n\r\n    mgr->dpb = (davs2_frame_t **)mem_ptr;\r\n    mem_ptr += mgr->dpbsize * sizeof(davs2_frame_t *);\r\n    ALIGN_POINTER(mem_ptr);\r\n\r\n    for (i = 0; i < mgr->dpbsize; i++) {\r\n        mgr->dpb[i] = davs2_frame_new(seq->i_enc_width, seq->i_enc_height, seq->head.chroma_format, &mem_ptr, 1);\r\n        ALIGN_POINTER(mem_ptr);\r\n\r\n        if (mgr->dpb[i] == NULL) {\r\n            return -1;\r\n        }\r\n    }\r\n\r\n    return 0;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void init_fdec(davs2_t *h, int64_t pts, int64_t dts)\r\n{\r\n    int num_in_spu = h->i_width_in_spu * h->i_height_in_spu;\r\n    int i;\r\n\r\n    h->fdec->i_type              = h->i_frame_type;\r\n    h->fdec->i_qp                = h->i_qp;\r\n    h->fdec->i_poc               = h->i_poc;\r\n    h->fdec->i_coi               = h->i_coi;\r\n    h->fdec->b_refered_by_others = h->rps.refered_by_others;\r\n    h->fdec->i_decoded_line      = -1;\r\n    h->fdec->i_pts               = pts;\r\n    h->fdec->i_dts               = dts;\r\n\r\n    for (i = 0; i < AVS2_MAX_REFS; i++) {\r\n        h->fdec->dist_refs[i] = -1;\r\n        h->fdec->dist_scale_refs[i] = -1;\r\n    }\r\n\r\n    if (h->i_frame_type != AVS2_B_SLICE) {\r\n        for (i = 0; i < h->num_of_references; i++) {\r\n            h->fdec->dist_refs[i] = AVS2_DISTANCE_INDEX(2 * (h->fdec->i_poc - h->fref[i]->i_poc));\r\n            if (h->fdec->dist_refs[i] <= 0) {\r\n                davs2_log(h, DAVS2_LOG_ERROR, \"invalid reference frame distance.\");\r\n                h->fdec->dist_refs[i] = 1;\r\n            }\r\n            h->fdec->dist_scale_refs[i] = (MULTI / h->fdec->dist_refs[i]);\r\n        }\r\n    } else {\r\n        h->fdec->dist_refs[B_FWD] = AVS2_DISTANCE_INDEX(2 * (h->fdec->i_poc - h->fref[B_FWD]->i_poc));\r\n        h->fdec->dist_refs[B_BWD] = AVS2_DISTANCE_INDEX(2 * (h->fref[B_BWD]->i_poc - h->fdec->i_poc));\r\n        if (h->fdec->dist_refs[B_FWD] <= 0) {\r\n            davs2_log(h, DAVS2_LOG_ERROR, \"invalid reference frame distance. B_FWD\");\r\n            h->fdec->dist_refs[B_FWD] = 1;\r\n        }\r\n        if (h->fdec->dist_refs[B_BWD] <= 0) {\r\n            davs2_log(h, DAVS2_LOG_ERROR, \"invalid reference frame distance. B_BWD\");\r\n            h->fdec->dist_refs[B_BWD] = 1;\r\n        }\r\n        h->fdec->dist_scale_refs[B_FWD] = (MULTI / h->fdec->dist_refs[B_FWD]);\r\n        h->fdec->dist_scale_refs[B_BWD] = (MULTI / h->fdec->dist_refs[B_BWD]);\r\n    }\r\n\r\n    /* clear mvbuf and refbuf */\r\n    memset(h->fdec->mvbuf, 0, num_in_spu * sizeof(mv_t));\r\n    memset(h->fdec->refbuf, INVALID_REF, num_in_spu * sizeof(int8_t));\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nint task_decoder_update(davs2_t *h)\r\n{\r\n    davs2_mgr_t *mgr  = h->task_info.taskmgr;\r\n    davs2_seq_t *seq  = &mgr->seq_info;\r\n\r\n    if (seq->valid_flag == 0) {\r\n        davs2_log(h, DAVS2_LOG_ERROR, \"failed to update decoder (invalid sequence header).\");\r\n        return -1;\r\n    }\r\n\r\n    if (h->b_sao != seq->enable_sao || h->b_alf != seq->enable_alf ||\r\n        h->i_chroma_format != (int)seq->head.chroma_format || h->i_lcu_level != seq->log2_lcu_size ||\r\n        h->i_image_width != (int)seq->head.width || h->i_image_height != (int)seq->head.height ||\r\n        h->p_integral == NULL) {\r\n        /* resolution changed */\r\n        decoder_free_extra_buffer(h);\r\n\r\n        /* key properties of the video sequence: size and color format */\r\n        h->i_lcu_level      = seq->log2_lcu_size;\r\n        h->i_lcu_size       = 1 << h->i_lcu_level;\r\n        h->i_lcu_size_sub1  = (1 << h->i_lcu_level) - 1;\r\n        h->i_chroma_format  = seq->head.chroma_format;\r\n        h->i_image_width    = seq->head.width;\r\n        h->i_image_height   = seq->head.height;\r\n        h->i_width          = seq->i_enc_width;\r\n        h->i_height         = seq->i_enc_height;\r\n\r\n        h->i_width_in_scu   = h->i_width  >> MIN_CU_SIZE_IN_BIT;\r\n        h->i_height_in_scu  = h->i_height >> MIN_CU_SIZE_IN_BIT;\r\n        h->i_size_in_scu    = h->i_width_in_scu * h->i_height_in_scu;\r\n        h->i_width_in_spu   = h->i_width  >> MIN_PU_SIZE_IN_BIT;\r\n        h->i_height_in_spu  = h->i_height >> MIN_PU_SIZE_IN_BIT;\r\n        h->i_width_in_lcu   = (h->i_width + h->i_lcu_size_sub1) >> h->i_lcu_level;\r\n        h->i_height_in_lcu  = (h->i_height + h->i_lcu_size_sub1) >> h->i_lcu_level;\r\n\r\n        /* encoding tools configuration */\r\n        h->b_sao            = seq->enable_sao;\r\n        h->b_alf            = seq->enable_alf;\r\n\r\n        if (decoder_alloc_extra_buffer(h) < 0) {\r\n            h->i_lcu_level     = 0;\r\n            h->i_chroma_format = 0;\r\n            h->i_image_width   = 0;\r\n            h->i_image_height  = 0;\r\n\r\n            davs2_log(h, DAVS2_LOG_ERROR, \"failed to update the decoder(failed to alloc space).\");\r\n\r\n            return -1;\r\n        }\r\n    }\r\n\r\n    /* update sequence header */\r\n    h->i_chroma_format  = seq->head.chroma_format;\r\n    h->i_lcu_level      = seq->log2_lcu_size;\r\n    h->b_bkgnd_picture  = seq->enable_background_picture;\r\n\r\n    // h->b_dmh            = 1;\r\n    h->output_bit_depth = 8;\r\n    h->sample_bit_depth = 8;\r\n\r\n    h->p_tab_DL_avail   = tab_DL_Avails[h->i_lcu_level];\r\n    h->p_tab_TR_avail   = tab_TR_Avails[h->i_lcu_level];\r\n\r\n    if (seq->head.profile_id == MAIN10_PROFILE) {\r\n        h->output_bit_depth = 6 + (seq->sample_precision << 1);\r\n        h->sample_bit_depth = 6 + (seq->encoding_precision << 1);\r\n    }\r\n\r\n#if HIGH_BIT_DEPTH\r\n    g_bit_depth   = h->sample_bit_depth;\r\n    max_pel_value = (1 << g_bit_depth) - 1;\r\n    g_dc_value    = 1 << (g_bit_depth - 1);\r\n#else\r\n    if (g_bit_depth != h->sample_bit_depth) {\r\n        davs2_log(h, DAVS2_LOG_ERROR, \"Un-supported bit-depth %d in this version.\\n\", h->sample_bit_depth);\r\n        return -1;\r\n    }\r\n#endif\r\n\r\n    memcpy(h->wq.seq_wq_matrix, seq->seq_wq_matrix, 2 * 64 * sizeof(int16_t)); /* weighting quantization matrix */\r\n    memcpy(&h->seq_info, seq, sizeof(davs2_seq_t));\r\n\r\n    return 0;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic\r\nint task_set_sequence_head(davs2_mgr_t *mgr, davs2_seq_t *seq)\r\n{\r\n    int ret = 0;\r\n\r\n    davs2_thread_mutex_lock(&mgr->mutex_mgr);\r\n\r\n    davs2_reconfigure_decoder(mgr);\r\n\r\n    if (seq->valid_flag) {\r\n        int newres = (mgr->seq_info.head.height != seq->head.height || mgr->seq_info.head.width != seq->head.width);\r\n\r\n        memcpy(&mgr->seq_info, seq, sizeof(davs2_seq_t));\r\n\r\n        if (newres) {\r\n            /* resolution changed : new sequence */\r\n            davs2_log(mgr, DAVS2_LOG_INFO, \"Sequence Resolution: %dx%d.\", seq->head.width, seq->head.height);\r\n            if ((seq->head.width & 0) != 0 || (seq->head.height & 1) != 0) {\r\n                davs2_log(mgr, DAVS2_LOG_ERROR, \"Sequence Resolution %dx%d is not even\\n\",\r\n                    seq->head.width, seq->head.height);\r\n            }\r\n\r\n            /* COI for the new sequence should be reset */\r\n            mgr->i_tr_wrap_cnt = 0;\r\n            mgr->i_prev_coi    = -1;\r\n\r\n            destroy_dpb(mgr);\r\n\r\n            if (create_dpb(mgr) < 0) {\r\n                /* error */\r\n                ret = -1;\r\n                memset(&mgr->seq_info, 0, sizeof(davs2_seq_t));\r\n                davs2_log(mgr, DAVS2_LOG_ERROR, \"failed to create dpb buffers. %dx%d.\", seq->head.width, seq->head.height);\r\n            }\r\n            mgr->new_sps = TRUE;\r\n        }\r\n    } else {\r\n        /* invalid header */\r\n        memset(&mgr->seq_info, 0, sizeof(davs2_seq_t));\r\n        davs2_log(mgr, DAVS2_LOG_ERROR, \"decoded an invalid sequence header: %dx%d.\", seq->head.width, seq->head.height);\r\n    }\r\n\r\n    davs2_thread_mutex_unlock(&mgr->mutex_mgr);\r\n    return ret;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid clean_one_frame(davs2_frame_t *frame)\r\n{\r\n    frame->i_poc                = INVALID_FRAME;\r\n    frame->i_coi                = INVALID_FRAME;\r\n    frame->i_disposable         = 0;\r\n    frame->b_refered_by_others  = 0;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid release_one_frame(davs2_frame_t *frame)\r\n{\r\n    int obsolete = 0;\r\n\r\n    if (frame == NULL) {\r\n        return;\r\n    }\r\n\r\n    davs2_thread_mutex_lock(&frame->mutex_frm);\r\n\r\n    assert(frame->i_ref_count > 0);\r\n\r\n    frame->i_ref_count--;\r\n\r\n    if (frame->i_ref_count == 0) {\r\n        if (frame->i_disposable == 1) {\r\n            clean_one_frame(frame);\r\n        }\r\n\r\n        obsolete = frame->i_disposable == 2;\r\n    }\r\n\r\n    davs2_thread_mutex_unlock(&frame->mutex_frm);\r\n\r\n    if (obsolete != 0) {\r\n        davs2_frame_destroy(frame);\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid task_release_frames(davs2_t *h)\r\n{\r\n    int i;\r\n\r\n    /* release reference to all reference frames */\r\n    for (i = 0; i < h->num_of_references; i++) {\r\n        release_one_frame(h->fref[i]);\r\n        h->fref[i] = NULL;\r\n    }\r\n\r\n    h->num_of_references = 0;\r\n\r\n    /* release reference to the reconstructed frame */\r\n    release_one_frame(h->fdec);\r\n    h->fdec = NULL;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nint has_blocking(davs2_mgr_t *mgr)\r\n{\r\n    davs2_output_t *pics  = &mgr->outpics;\r\n    davs2_outpic_t *pic   = NULL;\r\n    davs2_frame_t  *frame = NULL;\r\n\r\n    int decodingframes = 0, outputframes = 0;\r\n    int i;\r\n\r\n    /* is the expected frame already in the output list ? */\r\n    for (pic = pics->pics; pic; pic = pic->next) {\r\n        frame = pic->frame;\r\n\r\n        if (frame->i_poc == pics->output) {\r\n            /* the expected frame */\r\n            return 0;\r\n        } else if (frame->i_poc < pics->output) {\r\n            /* a late frame: the output thread will dump it.*/\r\n            return 0;\r\n        }\r\n\r\n        outputframes++;\r\n    }\r\n\r\n    /* is the expected frame still under decoding ? */\r\n    for (i = 0; i < mgr->num_decoders; i++) {\r\n        davs2_t *h = &mgr->decoders[i];\r\n\r\n        if (h->task_info.task_status != TASK_FREE) {\r\n            frame = h->fdec;\r\n\r\n            if (frame != NULL) {\r\n                if (frame->i_poc == pics->output) {\r\n                    /* the expected frame will be put into the output list soon */\r\n                    return 0;\r\n                }\r\n\r\n                if (frame->i_poc >= 0) {\r\n                    decodingframes++;\r\n                }\r\n            }\r\n        }\r\n    }\r\n\r\n    assert(outputframes + decodingframes <= mgr->dpbsize);\r\n\r\n    /* the expected frame is neither in the output list nor under decoding */\r\n    if (mgr->outpics.busy != 0) {\r\n        /* one frame being delivered, soon it maybe free ? */\r\n        return 0;\r\n    }\r\n\r\n    return 1;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nint task_get_references(davs2_t *h, int64_t pts, int64_t dts)\r\n{\r\n    davs2_mgr_t    *mgr   = h->task_info.taskmgr;\r\n    davs2_frame_t **dpb   = mgr->dpb;\r\n    davs2_frame_t  *frame = NULL;\r\n    int i, j;\r\n\r\n#define IS_VALID_FRAME(frame) ((frame)->i_coi != INVALID_FRAME && (frame)->i_poc != INVALID_FRAME)\r\n\r\n    davs2_thread_mutex_lock(&mgr->mutex_mgr);\r\n\r\n    h->fdec = NULL;\r\n    h->num_of_references = 0;\r\n    for (i = 0; i < AVS2_MAX_REFS; i++) {\r\n        h->fref[i] = NULL;\r\n    }\r\n\r\n    for (i = 0; i < mgr->num_frames_to_remove; i++) {\r\n        int coi_frame_to_remove = mgr->coi_remove_frame[i];\r\n\r\n        for (j = 0; j < mgr->dpbsize; j++) {\r\n            frame = dpb[j];\r\n\r\n            if (!IS_VALID_FRAME(frame)) {\r\n                continue;\r\n            }\r\n\r\n            if (frame->i_coi == coi_frame_to_remove) {\r\n                break;\r\n            }\r\n        }\r\n\r\n        if (j < mgr->dpbsize) {\r\n            davs2_thread_mutex_lock(&frame->mutex_frm);\r\n            // assert(frame->i_disposable == 0);\r\n\r\n            if (frame->i_ref_count == 0) {\r\n                clean_one_frame(frame);\r\n            } else {\r\n                frame->i_disposable = 1;\r\n            }\r\n\r\n            davs2_thread_mutex_unlock(&frame->mutex_frm);\r\n        }\r\n    }\r\n\r\n\r\n    if (h->i_frame_type == AVS2_GB_SLICE) {\r\n        h->fdec = h->f_background_cur;\r\n    } else {\r\n        for (i = 0; i < h->rps.num_of_ref; i++) {\r\n            int ref_frame_coi = h->i_coi - h->rps.ref_pic[i];\r\n            for (j = 0; j < mgr->dpbsize; j++) {\r\n                frame = dpb[j];\r\n\r\n                if (!IS_VALID_FRAME(frame)) {\r\n                    continue;\r\n                }\r\n\r\n                davs2_thread_mutex_lock(&frame->mutex_frm);\r\n\r\n                if (frame->i_coi >= 0 && ref_frame_coi == frame->i_coi) {\r\n                    assert(frame->i_disposable == 0);\r\n                    assert(frame->b_refered_by_others != 0);\r\n\r\n                    if (frame->i_disposable == 0 && frame->b_refered_by_others != 0) {\r\n                        frame->i_ref_count++;\r\n                        davs2_thread_mutex_unlock(&frame->mutex_frm);\r\n\r\n                        h->fref[i] = frame;\r\n                        h->num_of_references++;\r\n\r\n                        break;\r\n                    }\r\n                }\r\n\r\n                davs2_thread_mutex_unlock(&frame->mutex_frm);\r\n            }\r\n\r\n            if (j == mgr->dpbsize) {\r\n                davs2_log(h, DAVS2_LOG_ERROR, \"reference frame of [coi: %d, poc: %d]: <COI: %d> not found.\",\r\n                    h->i_coi, h->i_poc, ref_frame_coi);\r\n                goto fail;\r\n            }\r\n        }\r\n\r\n        if (h->i_frame_type == AVS2_B_SLICE &&\r\n            (h->num_of_references != 2 || h->fref[0]->i_poc <= h->i_poc || h->fref[1]->i_poc >= h->i_poc)) {\r\n            davs2_log(h, DAVS2_LOG_ERROR, \"reference frames for B frame [coi: %d, poc: %d] are wrong: %d frames found\",\r\n                h->i_coi, h->i_poc, h->num_of_references);\r\n            goto fail;\r\n        }\r\n\r\n        /* delete the frame that will never be used */\r\n        mgr->num_frames_to_remove = h->rps.num_to_remove;\r\n\r\n        for (i = 0; i < h->rps.num_to_remove; i++) {\r\n            mgr->coi_remove_frame[i] = h->i_coi - h->rps.remove_pic[i];\r\n        }\r\n\r\n        /* clean old frames */\r\n        for (i = 0; i < mgr->dpbsize; i++) {\r\n            frame = dpb[i];\r\n\r\n            if (!IS_VALID_FRAME(frame)) {\r\n                continue;\r\n            }\r\n\r\n            davs2_thread_mutex_lock(&frame->mutex_frm);\r\n\r\n            if (DAVS2_ABS(frame->i_poc - h->i_poc) >= MAX_POC_DISTANCE) {\r\n                if (frame->i_ref_count == 0) {\r\n                    davs2_log(h, DAVS2_LOG_WARNING, \"force to remove obsolete frame <poc: %d>.\", frame->i_poc);\r\n                    /* no one is holding reference to this frame: clean it ! */\r\n                    clean_one_frame(frame);\r\n                } else {\r\n                    /* weird ? */\r\n                    /* some task has forgot to release it ? */\r\n                    if (frame->i_disposable == 0) {\r\n                        frame->i_disposable = 1;\r\n                        davs2_log(h, DAVS2_LOG_WARNING, \"force to mark obsolete frame <poc: %d> as to be removed.\", frame->i_poc);\r\n                    }\r\n                }\r\n            }\r\n\r\n            davs2_thread_mutex_unlock(&frame->mutex_frm);\r\n        }\r\n\r\n        /* find fdec */\r\n        for (;;) {\r\n            for (i = 0; i < mgr->dpbsize; i++) {\r\n                frame = dpb[i];\r\n\r\n                davs2_thread_mutex_lock(&frame->mutex_frm);\r\n\r\n                if (frame->i_ref_count == 0 && frame->b_refered_by_others == 0) {\r\n                    assert(frame->i_disposable == 0);\r\n\r\n                    frame->i_ref_count++;   /* for the decoding thread */\r\n                    frame->i_ref_count++;   /* for the output thread */\r\n\r\n                    frame->i_disposable = h->rps.refered_by_others == 0 ? 1 : 0;\r\n\r\n                    h->fdec = frame;\r\n\r\n                    davs2_thread_mutex_unlock(&frame->mutex_frm);\r\n\r\n                    break;\r\n                }\r\n\r\n                davs2_thread_mutex_unlock(&frame->mutex_frm);\r\n            }\r\n\r\n            if (h->fdec != NULL) {\r\n                /* got it */\r\n                break;\r\n            }\r\n\r\n            /* DPB full ? */\r\n            if (open_dbp_buffer_warning) {\r\n                davs2_log(h, DAVS2_LOG_WARNING, \"running out of DPB buffers, performance may suffer.\");\r\n                open_dbp_buffer_warning = 0;      /* avoid too many warnings */\r\n            }\r\n\r\n            /* detect possible blocks */\r\n            if (has_blocking(mgr) != 0) {\r\n                if (mgr->outpics.pics == NULL) {\r\n                    /*!!! try to use an earliest frame ??? */\r\n                    /* find the frame with the least POC value */\r\n                    for (i = 0; i < mgr->dpbsize; i++) {\r\n                        frame = dpb[i];\r\n                        davs2_thread_mutex_lock(&frame->mutex_frm);\r\n\r\n                        if (frame->i_ref_count == 0 && (h->fdec == NULL || h->fdec->i_poc > frame->i_poc)) {\r\n                            if (h->fdec) {\r\n                                davs2_thread_mutex_lock(&h->fdec->mutex_frm);\r\n                                h->fdec->i_ref_count--;\r\n                                h->fdec->i_ref_count--;\r\n                                davs2_thread_mutex_unlock(&h->fdec->mutex_frm);\r\n                            }\r\n\r\n                            frame->i_ref_count++;   /* for the decoding thread */\r\n                            frame->i_ref_count++;   /* for the output thread */\r\n\r\n                            h->fdec = frame;\r\n                        }\r\n\r\n                        davs2_thread_mutex_unlock(&frame->mutex_frm);\r\n                    }\r\n\r\n                    if (NULL == h->fdec) {\r\n                        davs2_log(h, DAVS2_LOG_ERROR, \"no frame for new task, DPB size (%d) too small(reorder delay: %d) ?\", mgr->dpbsize, mgr->seq_info.picture_reorder_delay);\r\n                        goto fail;\r\n                    }\r\n\r\n                    h->fdec->i_disposable = h->rps.refered_by_others == 0 ? 1 : 0;\r\n\r\n                    davs2_log(h, DAVS2_LOG_WARNING, \"force one frame as the reconstruction frame.\");\r\n\r\n                    break;\r\n                } else {\r\n                    /* next frame will not be available, skip it */\r\n                    assert(mgr->outpics.output < mgr->outpics.pics->frame->i_poc);\r\n                    /* emit an error */\r\n                    davs2_log(h, DAVS2_LOG_ERROR, \"the expected frame %d unavailable, proceed to frame %d.\", mgr->outpics.output, mgr->outpics.pics->frame->i_poc);\r\n                    /* output the next available frame */\r\n                    mgr->outpics.output = mgr->outpics.pics->frame->i_poc;\r\n                }\r\n            }\r\n\r\n            davs2_thread_mutex_unlock(&mgr->mutex_mgr);\r\n\r\n            /* wait for the output thread to release some frames */\r\n            davs2_sleep_ms(1);\r\n\r\n            /* check it again */\r\n            davs2_thread_mutex_lock(&mgr->mutex_mgr);\r\n        }\r\n\r\n        init_fdec(h, pts, dts);\r\n\r\n        if (h->i_frame_type == AVS2_S_SLICE) {\r\n            int num_in_spu = h->i_width_in_spu * h->i_height_in_spu;\r\n\r\n            for (i = 0; i < mgr->dpbsize; i++) {\r\n                memset(dpb[i]->mvbuf, 0, num_in_spu * sizeof(mv_t));\r\n                memset(dpb[i]->refbuf, 0, num_in_spu * sizeof(int8_t));\r\n            }\r\n        }\r\n    }\r\n\r\n    davs2_thread_mutex_unlock(&mgr->mutex_mgr);\r\n\r\n    return 0;\r\n\r\nfail:\r\n\r\n    davs2_log(NULL, DAVS2_LOG_ERROR, \"Failed to decode frame <COI: %d, POC: %d>\\n\", h->i_coi, h->i_poc);\r\n    davs2_thread_mutex_unlock(&mgr->mutex_mgr);\r\n\r\n    task_release_frames(h);\r\n\r\n    return -1;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nint parse_header(davs2_t *h, davs2_bs_t *p_bs)\r\n{\r\n    const uint8_t *data   = p_bs->p_stream;\r\n    int           *bitpos = &p_bs->i_bit_pos;\r\n    int            len    = p_bs->i_stream;\r\n    const uint8_t *p_start_code = 0;\r\n\r\n    if (len <= 4) {\r\n        return -1;  // at least 4 bytes are needed for decoding\r\n    }\r\n\r\n    while ((p_start_code = find_start_code(data + (*bitpos >> 3), len - (*bitpos >> 3))) != 0) {\r\n        uint32_t start_code;\r\n        *bitpos = (int)((p_start_code - data) << 3);\r\n\r\n        if ((*bitpos >> 3) + 4 > len) {\r\n            break;\r\n        }\r\n\r\n        start_code = data[(*bitpos >> 3) + 3];\r\n        switch (start_code) {\r\n        case SC_INTRA_PICTURE:\r\n        case SC_INTER_PICTURE:\r\n            /* update the decoder */\r\n            if (task_decoder_update(h) < 0) {\r\n                return -1;\r\n            }\r\n\r\n            /* decode the picture header */\r\n            if (parse_picture_header(h, p_bs, start_code) < 0) {\r\n                return -1;\r\n            }\r\n\r\n            return 0; /// !!! we only decode one frame for a single call.\r\n\r\n        case SC_SEQUENCE_HEADER:\r\n            davs2_seq_t new_seq;\r\n            /* decode the sequence head */\r\n            if (parse_sequence_header(h->task_info.taskmgr, &new_seq, p_bs) < 0) {\r\n                davs2_log(h, 0, \"Invalid sequence header.\");\r\n                return -1;\r\n            }\r\n            /* update the task manager */\r\n            if (task_set_sequence_head(h->task_info.taskmgr, &new_seq) < 0) {\r\n                return -1;\r\n            }\r\n\r\n            break;\r\n\r\n        case SC_EXTENSION:\r\n        case SC_USER_DATA:\r\n        case SC_SEQUENCE_END:\r\n        case SC_VIDEO_EDIT_CODE:\r\n        default:\r\n            /* skip this unit */\r\n\r\n            /* NOTE: if you want to decode these units, you should avoid */\r\n            /* using a davs2_t structure which will not be updated until a picture header is decoded. */\r\n\r\n            *bitpos += 32;\r\n            break;\r\n        }\r\n    }\r\n\r\n    return 1;\r\n}\r\n"
  },
  {
    "path": "source/common/header.h",
    "content": "/*\r\n * header.h\r\n *\r\n * Description of this file:\r\n *    Header functions definition of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video encoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#ifndef DAVS2_HEADER_H\r\n#define DAVS2_HEADER_H\r\n#ifdef __cplusplus\r\nextern \"C\" {\r\n#endif\r\n\r\n#define parse_slice_header FPFX(parse_slice_header)\r\nvoid parse_slice_header(davs2_t *h, davs2_bs_t *bs);\r\n#define parse_header FPFX(parse_header)\r\nint  parse_header(davs2_t *h, davs2_bs_t *p_bs);\r\n\r\n#define release_one_frame FPFX(release_one_frame)\r\nvoid release_one_frame(davs2_frame_t *frame);\r\n#define task_release_frames FPFX(task_release_frames)\r\nvoid task_release_frames(davs2_t *h);\r\n\r\n#define alloc_picture FPFX(alloc_picture)\r\ndavs2_outpic_t *alloc_picture(int w, int h);\r\n#define free_picture FPFX(free_picture)\r\nvoid free_picture(davs2_outpic_t *pic);\r\n\r\n#define destroy_dpb FPFX(destroy_dpb)\r\nvoid destroy_dpb(davs2_mgr_t *mgr);\r\n\r\n#ifdef __cplusplus\r\n}\r\n#endif\r\n#endif  // DAVS2_HEADER_H\r\n"
  },
  {
    "path": "source/common/intra.cc",
    "content": "/*\r\n * intra.cc\r\n *\r\n * Description of this file:\r\n *    Intra prediction functions definition of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video encoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#include \"common.h\"\r\n#include \"block_info.h\"\r\n#include \"intra.h\"\r\n#include \"vec/intrinsic.h\"\r\n\r\n// ---------------------------------------------------------------------------\r\n// disable warning\r\n#if defined(_MSC_VER) || defined(__ICL)\r\n#pragma warning(disable: 4100)  // unreferenced formal parameter\r\n#endif\r\n\r\n/*\r\n * ===========================================================================\r\n * global & local variable defines\r\n * ===========================================================================\r\n */\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic const int8_t g_aucXYflg[NUM_INTRA_MODE] = {\r\n    0, 0, 0, 0, 0,\r\n    0, 0, 0, 0, 0,\r\n    0, 0, 0, 0, 0,\r\n    0, 0, 0, 0, 0,\r\n    0, 0, 0, 0, 0,\r\n    1, 1, 1, 1, 1,\r\n    1, 1, 1\r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic const int8_t tab_auc_dir_dx[NUM_INTRA_MODE] = {\r\n     0,  0,  0, 11,  2,\r\n    11,  1,  8,  1,  4,\r\n     1,  1,  0,  1,  1,\r\n     4,  1,  8,  1, 11,\r\n     2, 11,  4,  8,  0,\r\n     8,  4, 11,  2, 11,\r\n     1,  8,  1\r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic const int8_t tab_auc_dir_dy[NUM_INTRA_MODE] = {\r\n     0,   0,   0, -4,  -1,\r\n    -8,  -1, -11, -2, -11,\r\n    -4,  -8,   0,  8,   4,\r\n    11,   2,  11,  1,   8,\r\n     1,   4,   1,  1,   0,\r\n    -1,  -1,  -4, -1,  -8,\r\n    -1, -11,  -2\r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic const int8_t g_aucSign[NUM_INTRA_MODE] = {\r\n     0,  0,  0, -1, -1,\r\n    -1, -1, -1, -1, -1,\r\n    -1, -1,  0,  1,  1,\r\n     1,  1,  1,  1,  1,\r\n     1,  1,  1,  1,  0,\r\n    -1, -1, -1, -1, -1,\r\n    -1, -1, -1\r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic const int8_t tab_auc_dir_dxdy[2][NUM_INTRA_MODE][2] = {\r\n    {\r\n        // dx/dy\r\n        {  0, 0 }, {  0, 0 }, {  0, 0 }, { 11, 2 }, {  2, 0 },\r\n        { 11, 3 }, {  1, 0 }, { 93, 7 }, {  1, 1 }, { 93, 8 },\r\n        {  1, 2 }, {  1, 3 }, {  0, 0 }, {  1, 3 }, {  1, 2 },\r\n        { 93, 8 }, {  1, 1 }, { 93, 7 }, {  1, 0 }, { 11, 3 },\r\n        {  2, 0 }, { 11, 2 }, {  4, 0 }, {  8, 0 }, {  0, 0 },\r\n        {  8, 0 }, {  4, 0 }, { 11, 2 }, {  2, 0 }, { 11, 3 },\r\n        {  1, 0 }, { 93, 7 }, {  1, 1 },\r\n    }, {\r\n        // dy/dx\r\n        {  0, 0 }, {  0, 0 }, {  0, 0 }, { 93, 8 }, {  1, 1 },\r\n        { 93, 7 }, {  1, 0 }, { 11, 3 }, {  2, 0 }, { 11, 2 },\r\n        {  4, 0 }, {  8, 0 }, {  0, 0 }, {  8, 0 }, {  4, 0 },\r\n        { 11, 2 }, {  2, 0 }, { 11, 3 }, {  1, 0 }, { 93, 7 },\r\n        {  1, 1 }, { 93, 8 }, {  1, 2 }, {  1, 3 }, {  0, 0 },\r\n        {  1, 3 }, {  1, 2 }, { 93, 8 }, {  1, 1 }, { 93, 7 },\r\n        {  1, 0 }, { 11, 3 }, {  2, 0 }\r\n    }\r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic const int8_t tab_log2size[MAX_CU_SIZE + 1] = {\r\n    -1, -1, -1, -1,  2, -1, -1, -1,\r\n     3, -1, -1, -1, -1, -1, -1, -1,\r\n     4, -1, -1, -1, -1, -1, -1, -1,\r\n    -1, -1, -1, -1, -1, -1, -1, -1,\r\n     5, -1, -1, -1, -1, -1, -1, -1,\r\n    -1, -1, -1, -1, -1, -1, -1, -1,\r\n    -1, -1, -1, -1, -1, -1, -1, -1,\r\n    -1, -1, -1, -1, -1, -1, -1, -1,\r\n    6\r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nconst int8_t tab_DL_Avail64[16 * 16] = {\r\n    1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0,\r\n    1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,\r\n    1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0,\r\n    1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,\r\n    1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0,\r\n    1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,\r\n    1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0,\r\n    1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\r\n    1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0,\r\n    1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,\r\n    1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0,\r\n    1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,\r\n    1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0,\r\n    1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,\r\n    1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0,\r\n    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0\r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nconst int8_t tab_DL_Avail32[8 * 8] = {\r\n    1, 0, 1, 0, 1, 0, 1, 0,\r\n    1, 0, 0, 0, 1, 0, 0, 0,\r\n    1, 0, 1, 0, 1, 0, 1, 0,\r\n    1, 0, 0, 0, 0, 0, 0, 0,\r\n    1, 0, 1, 0, 1, 0, 1, 0,\r\n    1, 0, 0, 0, 1, 0, 0, 0,\r\n    1, 0, 1, 0, 1, 0, 1, 0,\r\n    0, 0, 0, 0, 0, 0, 0, 0\r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nconst int8_t tab_DL_Avail16[4 * 4] = {\r\n    1, 0, 1, 0,\r\n    1, 0, 0, 0,\r\n    1, 0, 1, 0,\r\n    0, 0, 0, 0\r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nconst int8_t tab_DL_Avail8[2 * 2] = {\r\n    1, 0,\r\n    0, 0\r\n};\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nconst int8_t tab_TR_Avail64[16 * 16] = {\r\n    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\r\n    1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0,\r\n    1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0,\r\n    1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0,\r\n    1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0,\r\n    1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0,\r\n    1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0,\r\n    1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0,\r\n    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,\r\n    1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0,\r\n    1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0,\r\n    1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0,\r\n    1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0,\r\n    1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0,\r\n    1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0,\r\n    1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0\r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nconst int8_t tab_TR_Avail32[8 * 8] = {\r\n    // 0: 8 1:16 2: 32  pu size\r\n    1, 1, 1, 1, 1, 1, 1, 1,\r\n    1, 0, 1, 0, 1, 0, 1, 0,\r\n    1, 1, 1, 0, 1, 1, 1, 0,\r\n    1, 0, 1, 0, 1, 0, 1, 0,\r\n    1, 1, 1, 1, 1, 1, 1, 0,\r\n    1, 0, 1, 0, 1, 0, 1, 0,\r\n    1, 1, 1, 0, 1, 1, 1, 0,\r\n    1, 0, 1, 0, 1, 0, 1, 0\r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nconst int8_t tab_TR_Avail16[4 * 4] = {\r\n    1, 1, 1, 1,\r\n    1, 0, 1, 0,\r\n    1, 1, 1, 0,\r\n    1, 0, 1, 0\r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nconst int8_t tab_TR_Avail8[2 * 2] = {\r\n    1, 1,\r\n    1, 0\r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nconst int8_t *tab_DL_Avails[MAX_CU_SIZE_IN_BIT + 1] = {\r\n    NULL, NULL, NULL, tab_DL_Avail8, tab_DL_Avail16, tab_DL_Avail32, tab_DL_Avail64\r\n};\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nconst int8_t *tab_TR_Avails[MAX_CU_SIZE_IN_BIT + 1] = {\r\n    NULL, NULL, NULL, tab_TR_Avail8, tab_TR_Avail16, tab_TR_Avail32, tab_TR_Avail64\r\n};\r\n\r\n\r\n/* records the sample bit depth for intra predeiction\r\n */\r\n\r\n/**\r\n * ===========================================================================\r\n * local function definition\r\n * ===========================================================================\r\n */\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic ALWAYS_INLINE int is_block_available(davs2_t *h, int x_4x4, int y_4x4, int dx_4x4, int dy_4x4, int cur_slice_idx)\r\n{\r\n    int x2_4x4 = x_4x4 + dx_4x4;\r\n    int y2_4x4 = y_4x4 + dy_4x4;\r\n\r\n    if (x2_4x4 < 0 || y2_4x4 < 0 || x2_4x4 >= h->i_width_in_spu || y2_4x4 >= h->i_height_in_spu) {\r\n        return 0;\r\n    } else {\r\n        return cur_slice_idx == h->scu_data[(y2_4x4 >> 1) * h->i_width_in_scu + (x2_4x4 >> 1)].i_slice_nr;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic ALWAYS_INLINE\r\nuint32_t get_intra_neighbors(davs2_t *h, int x_4x4, int y_4x4, int bsx, int bsy, int cur_slice_idx)\r\n{\r\n    /* 1. ڿǷͬһSlice */\r\n    int b_LEFT      = is_block_available(h, x_4x4, y_4x4, -1, 0, cur_slice_idx);\r\n    int b_TOP       = is_block_available(h, x_4x4, y_4x4,  0, -1, cur_slice_idx);\r\n    int b_TOP_LEFT  = is_block_available(h, x_4x4, y_4x4, -1, -1, cur_slice_idx);\r\n    int b_LEFT_DOWN = is_block_available(h, x_4x4, y_4x4, -1, (bsy >> 1) - 1, cur_slice_idx);  // (bsy >> MIN_PU_SIZE_IN_BIT << 1)\r\n    int b_TOP_RIGHT = is_block_available(h, x_4x4, y_4x4, (bsx >> 1) - 1, -1, cur_slice_idx);  // (bsx >> MIN_PU_SIZE_IN_BIT << 1)\r\n\r\n    int leftdown;\r\n    int upright;\r\n    int log2_lcu_size_in_spu = (h->i_lcu_level - B4X4_IN_BIT);\r\n    int i_lcu_mask = (1 << log2_lcu_size_in_spu) - 1;\r\n\r\n    /* 2. ڿǷڵǰ֮ǰع */\r\n    x_4x4 = x_4x4 & i_lcu_mask;\r\n    y_4x4 = y_4x4 & i_lcu_mask;\r\n\r\n    leftdown = h->p_tab_DL_avail[((y_4x4 + (bsy >> 2) - 1) << log2_lcu_size_in_spu) + (x_4x4)];\r\n    upright  = h->p_tab_TR_avail[((y_4x4) << log2_lcu_size_in_spu) + (x_4x4 + (bsx >> 2) - 1)];\r\n\r\n    b_LEFT_DOWN = b_LEFT_DOWN && leftdown;\r\n    b_TOP_RIGHT = b_TOP_RIGHT && upright;\r\n\r\n    return (b_LEFT << MD_I_LEFT) | (b_TOP << MD_I_TOP) | (b_TOP_LEFT << MD_I_TOP_LEFT) |\r\n        (b_TOP_RIGHT << MD_I_TOP_RIGHT) | (b_LEFT_DOWN << MD_I_LEFT_DOWN);\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void ALWAYS_INLINE mem_repeat_p(pel_t *dst, pel_t val, size_t num)\r\n{\r\n    while (num--) {\r\n        *dst++ = val;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void ALWAYS_INLINE memcpy_vh_pp_c(pel_t *dst, pel_t *src, int i_src, size_t num)\r\n{\r\n    while (num--) {\r\n        *dst++ = *src;\r\n        src += i_src;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void intra_pred_ver_c(pel_t *src, pel_t *dst, int i_dst, int mode, int width, int height)\r\n{\r\n    pel_t *p_src = src + 1;\r\n    int y;\r\n\r\n    UNUSED_PARAMETER(mode);\r\n\r\n    for (y = height; y != 0; y--) {\r\n        memcpy(dst, p_src, width * sizeof(pel_t));\r\n        dst += i_dst;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void intra_pred_hor_c(pel_t *src, pel_t *dst, int i_dst, int mode, int width, int height)\r\n{\r\n    pel_t *p_src = src - 1;\r\n    int x, y;\r\n\r\n    UNUSED_PARAMETER(mode);\r\n\r\n    for (y = 0; y < height; y++) {\r\n        for (x = 0; x < width; x++) {\r\n            dst[x] = p_src[-y];\r\n        }\r\n        dst += i_dst;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void intra_pred_dc_c(pel_t *src, pel_t *dst, int i_dst, int mode, int width, int height)\r\n{\r\n    int b_top  = mode >> 8;\r\n    int b_left = mode & 0xFF;\r\n    pel_t *p_src = src - 1;\r\n    int dc_value = 0;\r\n    int x, y;\r\n\r\n    /* get DC value */\r\n    if (b_left) {\r\n        for (y = 0; y < height; y++) {\r\n            dc_value += p_src[-y];\r\n        }\r\n\r\n        p_src = src + 1;\r\n        if (b_top) {\r\n            for (x = 0; x < width; x++) {\r\n                dc_value += p_src[x];\r\n            }\r\n\r\n            dc_value += ((width + height) >> 1);\r\n            dc_value = (dc_value * (512 / (width + height))) >> 9;\r\n        } else {\r\n            dc_value += height / 2;\r\n            dc_value /= height;\r\n        }\r\n    } else {\r\n        p_src = src + 1;\r\n        if (b_top) {\r\n            for (x = 0; x < width; x++) {\r\n                dc_value += p_src[x];\r\n            }\r\n\r\n            dc_value += width / 2;\r\n            dc_value /= width;\r\n        } else {\r\n            dc_value = 1 << (g_bit_depth - 1);\r\n        }\r\n    }\r\n\r\n    /* fill the block */\r\n    x        = (1 << g_bit_depth) - 1;     /* max pixel value */\r\n    dc_value = DAVS2_CLIP3(0, x, dc_value);\r\n    for (y = 0; y < height; y++) {\r\n        for (x = 0; x < width; x++) {\r\n            dst[x] = (pel_t)dc_value;\r\n        }\r\n        dst += i_dst;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void intra_pred_plane_c(pel_t *src, pel_t *dst, int i_dst, int mode, int width, int height)\r\n{\r\n    int ib_mult[5] = { 13, 17, 5, 11, 23 };\r\n    int ib_shift[5] = { 7, 10, 11, 15, 19 };\r\n    int im_h = ib_mult [tab_log2size[width ] - 2];\r\n    int im_v = ib_mult [tab_log2size[height] - 2];\r\n    int is_h = ib_shift[tab_log2size[width ] - 2];\r\n    int is_v = ib_shift[tab_log2size[height] - 2];\r\n    int iW2 = width >> 1;\r\n    int iH2 = height >> 1;\r\n    int iH = 0;\r\n    int iV = 0;\r\n    int iA, iB, iC;\r\n    int x, y;\r\n    int iTmp, iTmp2;\r\n    int max_val = (1 << g_bit_depth) - 1;\r\n    pel_t *p_src;\r\n\r\n    UNUSED_PARAMETER(mode);\r\n\r\n    p_src = src + 1;\r\n    p_src += (iW2 - 1);\r\n    for (x = 1; x < iW2 + 1; x++) {\r\n        iH += x * (p_src[x] - p_src[-x]);\r\n    }\r\n\r\n    p_src = src - 1;\r\n    p_src -= (iH2 - 1);\r\n    for (y = 1; y < iH2 + 1; y++) {\r\n        iV += y * (p_src[-y] - p_src[y]);\r\n    }\r\n\r\n    p_src = src;\r\n    iA = (p_src[-1 - (height - 1)] + p_src[1 + width - 1]) << 4;\r\n    iB = ((iH << 5) * im_h + (1 << (is_h - 1))) >> is_h;\r\n    iC = ((iV << 5) * im_v + (1 << (is_v - 1))) >> is_v;\r\n\r\n    iTmp = iA - (iH2 - 1) * iC - (iW2 - 1) * iB + 16;\r\n    for (y = 0; y < height; y++) {\r\n        iTmp2 = iTmp;\r\n        for (x = 0; x < width; x++) {\r\n            dst[x] = (pel_t)DAVS2_CLIP3(0, max_val, iTmp2 >> 5);\r\n            iTmp2 += iB;\r\n        }\r\n\r\n        dst += i_dst;\r\n        iTmp += iC;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void intra_pred_bilinear_c(pel_t *src, pel_t *dst, int i_dst, int mode, int width, int height)\r\n{\r\n    itr_t pTop[MAX_CU_SIZE];\r\n    itr_t pLeft[MAX_CU_SIZE];\r\n    itr_t pT[MAX_CU_SIZE];\r\n    itr_t pL[MAX_CU_SIZE];\r\n    itr_t wy[MAX_CU_SIZE];\r\n    int ishift_x  = tab_log2size[width];\r\n    int ishift_y  = tab_log2size[height];\r\n    int ishift    = DAVS2_MIN(ishift_x, ishift_y);\r\n    int ishift_xy = ishift_x + ishift_y + 1;\r\n    int offset    = 1 << (ishift_x + ishift_y);\r\n    int a, b, c, w, wxy, t;\r\n    int predx;\r\n    int x, y;\r\n    int max_value = (1 << g_bit_depth) - 1;\r\n\r\n    UNUSED_PARAMETER(mode);\r\n\r\n    for (x = 0; x < width; x++) {\r\n        pTop[x] = src[1 + x];\r\n    }\r\n\r\n    for (y = 0; y < height; y++) {\r\n        pLeft[y] = src[-1 - y];\r\n    }\r\n\r\n    a = pTop[width - 1];\r\n    b = pLeft[height - 1];\r\n    c = (width == height) ? (a + b + 1) >> 1 : (((a << ishift_x) + (b << ishift_y)) * 13 + (1 << (ishift + 5))) >> (ishift + 6);\r\n    w = (c << 1) - a - b;\r\n\r\n    for (x = 0; x < width; x++) {\r\n        pT[x] = (itr_t)(b - pTop[x]);\r\n        pTop[x] <<= ishift_y;\r\n    }\r\n    t = 0;\r\n    for (y = 0; y < height; y++) {\r\n        pL[y] = (itr_t)(a - pLeft[y]);\r\n        pLeft[y] <<= ishift_x;\r\n        wy[y] = (itr_t)t;\r\n        t += w;\r\n    }\r\n\r\n    for (y = 0; y < height; y++) {\r\n        predx = pLeft[y];\r\n        wxy = -wy[y];\r\n\r\n        for (x = 0; x < width; x++) {\r\n            predx += pL[y];\r\n            wxy += wy[y];\r\n            pTop[x] += pT[x];\r\n            dst[x] = (pel_t)DAVS2_CLIP3(0, max_value, (((predx << ishift_y) + (pTop[x] << ishift_x) + wxy + offset) >> ishift_xy));\r\n        }\r\n\r\n        dst += i_dst;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic int get_context_pixel(int mode, int uiXYflag, int iTempD, int *offset)\r\n{\r\n    int imult = tab_auc_dir_dxdy[uiXYflag][mode][0];\r\n    int ishift = tab_auc_dir_dxdy[uiXYflag][mode][1];\r\n    int iTempDn = iTempD * imult >> ishift;\r\n\r\n    *offset = ((iTempD * imult * 32) >> ishift) - iTempDn * 32;\r\n\r\n    return iTempDn;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void intra_pred_ang_x_c(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    int iDx = tab_auc_dir_dx[dir_mode];\r\n    int iDy = tab_auc_dir_dy[dir_mode];\r\n#if BUGFIX_PREDICTION_INTRA\r\n    int iX;\r\n#else\r\n    int top_width = bsx - iDx;\r\n    int iW2 = (bsx << 1) - 1;\r\n    int iX, idx;\r\n#endif\r\n    int c1, c2, c3, c4;\r\n    int i, j;\r\n    pel_t *dst_base = dst + iDy * i_dst + iDx;\r\n\r\n    for (j = 0; j < bsy; j++, iDy++) {\r\n        iX = get_context_pixel(dir_mode, 0, j + 1, &c4);\r\n        c1 = 32 - c4;\r\n        c2 = 64 - c4;\r\n        c3 = 32 + c4;\r\n\r\n#if BUGFIX_PREDICTION_INTRA\r\n        i = 0;\r\n#else\r\n        if (iDy >= 0 && top_width > 0) {\r\n            memcpy(dst, dst_base, top_width * sizeof(pel_t));\r\n            i = top_width;\r\n            iX += top_width;\r\n        } else {\r\n            i = 0;\r\n        }\r\n#endif\r\n\r\n        for (; i < bsx; i++) {\r\n#if BUGFIX_PREDICTION_INTRA\r\n            dst[i] = (pel_t)((src[iX] * c1 + src[iX + 1] * c2 + src[iX + 2] * c3 + src[iX + 3] * c4 + 64) >> 7);\r\n#else\r\n            idx = DAVS2_MIN(iW2, iX);\r\n            dst[i] = (pel_t)((src[idx] * c1 + src[idx + 1] * c2 + src[idx + 2] * c3 + src[idx + 3] * c4 + 64) >> 7);\r\n#endif\r\n            iX++;\r\n        }\r\n\r\n        dst_base += i_dst;\r\n        dst += i_dst;\r\n    }\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void intra_pred_ang_y_c(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    int offsets[64];\r\n    int xsteps[64];\r\n    int iDx = tab_auc_dir_dx[dir_mode];\r\n    int iDy = tab_auc_dir_dy[dir_mode];\r\n#if !BUGFIX_PREDICTION_INTRA\r\n    int iHeight2 = 1 - (bsy << 1);\r\n    int top_width = bsx - iDx;\r\n#endif\r\n    int i, j;\r\n    int offset;\r\n    int iY;\r\n    pel_t *dst_base = dst + iDy * i_dst + iDx;\r\n\r\n    for (i = 0; i < bsx; i++) {\r\n        xsteps[i] = get_context_pixel(dir_mode, 1, i + 1, &offsets[i]);\r\n    }\r\n\r\n    for (j = 0; j < bsy; j++) {\r\n        for (i = 0; i < bsx; i++) {\r\n#if !BUGFIX_PREDICTION_INTRA\r\n            if (j >= -iDy && i < top_width) {\r\n                dst[i] = dst_base[i];\r\n            } else {\r\n#endif\r\n                int idx;\r\n                iY = j + xsteps[i];\r\n#if BUGFIX_PREDICTION_INTRA\r\n                idx = -iY;\r\n#else\r\n                idx = DAVS2_MAX(iHeight2, -iY);\r\n#endif\r\n                offset = offsets[i];\r\n                dst[i] = (pel_t)((src[idx] * (32 - offset) + src[idx - 1] * (64 - offset) + src[idx - 2] * (32 + offset) + src[idx - 3] * offset + 64) >> 7);\r\n#if !BUGFIX_PREDICTION_INTRA\r\n            }\r\n#endif\r\n        }\r\n        dst_base += i_dst;\r\n        dst += i_dst;\r\n    }\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void intra_pred_ang_xy_c(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    ALIGN16(int xoffsets[64]);\r\n    ALIGN16(int xsteps[64]);\r\n    const int iDx = tab_auc_dir_dx[dir_mode];\r\n    const int iDy = tab_auc_dir_dy[dir_mode];\r\n    pel_t *dst_base = dst - iDy * i_dst - iDx;\r\n    int i, j, iXx, iYy;\r\n    int offsetx, offsety;\r\n\r\n    for (i = 0; i < bsx; i++) {\r\n        xsteps[i] = get_context_pixel(dir_mode, 1, i + 1, &xoffsets[i]);\r\n    }\r\n\r\n    for (j = 0; j < bsy; j++) {\r\n        iXx = -get_context_pixel(dir_mode, 0, j + 1, &offsetx);\r\n\r\n        for (i = 0; i < bsx; i++) {\r\n#if !BUGFIX_PREDICTION_INTRA\r\n            if (j >= iDy && i >= iDx) {\r\n                dst[i] = dst_base[i];\r\n            } else {\r\n#endif\r\n                iYy = j - xsteps[i];\r\n\r\n                if (iYy <= -1) {\r\n                    dst[i] = (pel_t)((src[ iXx + 2] * (32 - offsetx) + src[ iXx + 1] * (64 - offsetx) + src[ iXx] * (32 + offsetx) + src[ iXx - 1] * offsetx + 64) >> 7);\r\n                } else {\r\n                    offsety = xoffsets[i];\r\n                    dst[i] = (pel_t)((src[-iYy - 2] * (32 - offsety) + src[-iYy - 1] * (64 - offsety) + src[-iYy] * (32 + offsety) + src[-iYy + 1] * offsety + 64) >> 7);\r\n                }\r\n#if !BUGFIX_PREDICTION_INTRA\r\n            }\r\n#endif\r\n            iXx++;\r\n        }\r\n        dst_base += i_dst;\r\n        dst      += i_dst;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void intra_pred_ang_x_3_c(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    ALIGN16(pel_t first_line[(64 + 176) << 2]);\r\n    int line_size = bsx + (bsy >> 2) * 11 - 1;\r\n#if !BUGFIX_PREDICTION_INTRA\r\n    int real_size = STARAVS_MIN(line_size, bsx * 2);\r\n#endif\r\n\r\n    int aligned_line_size = 64 + 176;\r\n    int i_dst4 = i_dst << 2;\r\n    int i;\r\n#if !BUGFIX_PREDICTION_INTRA\r\n    pel_t pad1, pad2, pad3, pad4;\r\n#endif\r\n    pel_t *pfirst[4];\r\n\r\n    pfirst[0] = first_line;\r\n    pfirst[1] = pfirst[0] + aligned_line_size;\r\n    pfirst[2] = pfirst[1] + aligned_line_size;\r\n    pfirst[3] = pfirst[2] + aligned_line_size;\r\n\r\n#if BUGFIX_PREDICTION_INTRA\r\n    for (i = 0; i < line_size; i++, src++) {\r\n#else\r\n    for (i = 0; i < real_size; i++, src++) {\r\n#endif\r\n        pfirst[0][i] = (pel_t)((    src[2] + 5 * src[3] + 7 * src[4] + 3 * src[5] + 8) >> 4);\r\n        pfirst[1][i] = (pel_t)((    src[5] + 3 * src[6] + 3 * src[7] +     src[8] + 4) >> 3);\r\n        pfirst[2][i] = (pel_t)((3 * src[8] + 7 * src[9] + 5 * src[10] +     src[11] + 8) >> 4);\r\n        pfirst[3][i] = (pel_t)((    src[11] + 2 * src[12] +   src[13] + 0 * src[14] + 2) >> 2);\r\n    }\r\n\r\n#if !BUGFIX_PREDICTION_INTRA\r\n    // padding\r\n    if (real_size < line_size) {\r\n        int iW2 = (bsx << 1) - 1;\r\n        int j;\r\n\r\n        src -= real_size;\r\n\r\n        pad1 = (pel_t)((    src[iW2] + 5 * src[iW2 + 1] + 7 * src[iW2 + 2] + 3 * src[iW2 + 3] + 8) >> 4);\r\n        pad2 = (pel_t)((    src[iW2] + 3 * src[iW2 + 1] + 3 * src[iW2 + 2] +     src[iW2 + 3] + 4) >> 3);\r\n        pad3 = (pel_t)((3 * src[iW2] + 7 * src[iW2 + 1] + 5 * src[iW2 + 2] +     src[iW2 + 3] + 8) >> 4);\r\n        pad4 = (pel_t)((    src[iW2] + 2 * src[iW2 + 1] +     src[iW2 + 2] + 0 * src[iW2 + 3] + 2) >> 2);\r\n\r\n        for (j = real_size - 1; j > iW2 - 2; j--) {\r\n            pfirst[3][j] = pad4;\r\n            pfirst[2][j] = pad3;\r\n            pfirst[1][j] = pad2;\r\n            pfirst[0][j] = pad1;\r\n        }\r\n        for (; j > iW2 - 5; j--) {\r\n            pfirst[3][j] = pad4;\r\n            pfirst[2][j] = pad3;\r\n            pfirst[1][j] = pad2;\r\n        }\r\n        for (; j > iW2 - 8; j--) {\r\n            pfirst[3][j] = pad4;\r\n            pfirst[2][j] = pad3;\r\n        }\r\n        for (; j > iW2 - 11; j--) {\r\n            pfirst[3][j] = pad4;\r\n        }\r\n\r\n        for (; i < line_size; i++) {\r\n            pfirst[0][i] = pad1;\r\n            pfirst[1][i] = pad2;\r\n            pfirst[2][i] = pad3;\r\n            pfirst[3][i] = pad4;\r\n        }\r\n    }\r\n#endif\r\n\r\n    bsy >>= 2;\r\n    for (i = 0; i < bsy; i++) {\r\n        memcpy(dst            , pfirst[0] + i * 11, bsx * sizeof(pel_t));\r\n        memcpy(dst +     i_dst, pfirst[1] + i * 11, bsx * sizeof(pel_t));\r\n        memcpy(dst + 2 * i_dst, pfirst[2] + i * 11, bsx * sizeof(pel_t));\r\n        memcpy(dst + 3 * i_dst, pfirst[3] + i * 11, bsx * sizeof(pel_t));\r\n        dst += i_dst4;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void intra_pred_ang_x_4_c(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    ALIGN16(pel_t first_line[64 + 128]);\r\n    int line_size = bsx + ((bsy - 1) << 1);\r\n#if !BUGFIX_PREDICTION_INTRA\r\n    int real_size = STARAVS_MIN(line_size, (bsx << 1) - 2);\r\n#endif\r\n    int iHeight2 = bsy << 1;\r\n    int i;\r\n\r\n    src += 3;\r\n#if BUGFIX_PREDICTION_INTRA\r\n    for (i = 0; i < line_size; i++, src++) {\r\n#else\r\n    for (i = 0; i < real_size; i++, src++) {\r\n#endif\r\n        first_line[i] = (pel_t)((src[-1] + src[0] * 2 + src[1] + 2) >> 2);\r\n    }\r\n\r\n#if !BUGFIX_PREDICTION_INTRA\r\n    // padding\r\n    for (; i < line_size; i++) {\r\n        first_line[i] = first_line[real_size - 1];\r\n    }\r\n#endif\r\n\r\n    for (i = 0; i < iHeight2; i += 2) {\r\n        memcpy(dst, first_line + i, bsx * sizeof(pel_t));\r\n        dst += i_dst;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void intra_pred_ang_x_5_c(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    int i;\r\n\r\n    if (((bsy > 4) && (bsx > 8))) {\r\n        ALIGN16(pel_t first_line[(64 + 80) << 3]);\r\n#if !BUGFIX_PREDICTION_INTRA\r\n        int iW2 = bsx * 2 - 1;\r\n#endif\r\n        int line_size = bsx + (((bsy - 8) * 11) >> 3);\r\n#if !BUGFIX_PREDICTION_INTRA\r\n        int real_size = STARAVS_MIN(line_size, iW2 + 1);\r\n#endif\r\n        int aligned_line_size = ((line_size + 15) >> 4) << 4;\r\n        pel_t *pfirst[8];\r\n#if !BUGFIX_PREDICTION_INTRA\r\n        pel_t *src_org = src;\r\n#endif\r\n\r\n        pel_t *dst1 = dst;\r\n        pel_t *dst2 = dst1 + i_dst;\r\n        pel_t *dst3 = dst2 + i_dst;\r\n        pel_t *dst4 = dst3 + i_dst;\r\n        pel_t *dst5 = dst4 + i_dst;\r\n        pel_t *dst6 = dst5 + i_dst;\r\n        pel_t *dst7 = dst6 + i_dst;\r\n        pel_t *dst8 = dst7 + i_dst;\r\n\r\n        pfirst[0] = first_line;\r\n        pfirst[1] = pfirst[0] + aligned_line_size;\r\n        pfirst[2] = pfirst[1] + aligned_line_size;\r\n        pfirst[3] = pfirst[2] + aligned_line_size;\r\n        pfirst[4] = pfirst[3] + aligned_line_size;\r\n        pfirst[5] = pfirst[4] + aligned_line_size;\r\n        pfirst[6] = pfirst[5] + aligned_line_size;\r\n        pfirst[7] = pfirst[6] + aligned_line_size;\r\n\r\n#if BUGFIX_PREDICTION_INTRA\r\n        for (i = 0; i < line_size; src++, i++) {\r\n#else\r\n        for (i = 0; i < real_size; src++, i++) {\r\n#endif\r\n            pfirst[0][i] = (pel_t)((5 * src[1] + 13 * src[2] + 11 * src[3] + 3 * src[4] + 16) >> 5);\r\n            pfirst[1][i] = (pel_t)((    src[2] +  5 * src[3] +  7 * src[4] + 3 * src[5] + 8) >> 4);\r\n            pfirst[2][i] = (pel_t)((7 * src[4] + 15 * src[5] +  9 * src[6] +     src[7] + 16) >> 5);\r\n            pfirst[3][i] = (pel_t)((    src[5] +  3 * src[6] +  3 * src[7] +     src[8] + 4) >> 3);\r\n\r\n            pfirst[4][i] = (pel_t)((     src[6] +  9 * src[7]  + 15 * src[8]  +  7 * src[9]  + 16) >> 5);\r\n            pfirst[5][i] = (pel_t)(( 3 * src[8] +  7 * src[9]  +  5 * src[10] +      src[11] +  8) >> 4);\r\n            pfirst[6][i] = (pel_t)(( 3 * src[9] + 11 * src[10] + 13 * src[11] +  5 * src[12] + 16) >> 5);\r\n            pfirst[7][i] = (pel_t)((    src[11] +  2 * src[12] +      src[13]                 + 2) >> 2);\r\n        }\r\n\r\n#if !BUGFIX_PREDICTION_INTRA\r\n        //padding\r\n        if (((real_size - 1) + 11) > iW2) {\r\n            src = src_org + iW2;\r\n            pel_t pad1 = pfirst[0][iW2 - 1];\r\n            pel_t pad2 = pfirst[1][iW2 - 2];\r\n            pel_t pad3 = pfirst[2][iW2 - 4];\r\n            pel_t pad4 = pfirst[3][iW2 - 5];\r\n\r\n            pel_t pad5 = pfirst[4][iW2 - 6];\r\n            pel_t pad6 = pfirst[5][iW2 - 8];\r\n            pel_t pad7 = pfirst[6][iW2 - 9];\r\n            pel_t pad8 = pfirst[7][iW2 - 11];\r\n\r\n            int start1 = iW2;\r\n            int start2 = iW2 - 1;\r\n            int start3 = iW2 - 3;\r\n            int start4 = iW2 - 4;\r\n            int start5 = iW2 - 5;\r\n            int start6 = iW2 - 7;\r\n            int start7 = iW2 - 8;\r\n            int start8 = iW2 - 10;\r\n\r\n            for (i = start1; i < line_size; i++) {\r\n                pfirst[0][i] = pad1;\r\n            }\r\n            for (i = start2; i < line_size; i++) {\r\n                pfirst[1][i] = pad2;\r\n            }\r\n            for (i = start3; i < line_size; i++) {\r\n                pfirst[2][i] = pad3;\r\n            }\r\n            for (i = start4; i < line_size; i++) {\r\n                pfirst[3][i] = pad4;\r\n            }\r\n            for (i = start5; i < line_size; i++) {\r\n                pfirst[4][i] = pad5;\r\n            }\r\n            for (i = start6; i < line_size; i++) {\r\n                pfirst[5][i] = pad6;\r\n            }\r\n            for (i = start7; i < line_size; i++) {\r\n                pfirst[6][i] = pad7;\r\n            }\r\n            for (i = start8; i < line_size; i++) {\r\n                pfirst[7][i] = pad8;\r\n            }\r\n        }\r\n#endif\r\n\r\n        bsy  >>= 3;\r\n        for (i = 0; i < bsy; i++) {\r\n            memcpy(dst1, pfirst[0] + i * 11, bsx * sizeof(pel_t));\r\n            memcpy(dst2, pfirst[1] + i * 11, bsx * sizeof(pel_t));\r\n            memcpy(dst3, pfirst[2] + i * 11, bsx * sizeof(pel_t));\r\n            memcpy(dst4, pfirst[3] + i * 11, bsx * sizeof(pel_t));\r\n            memcpy(dst5, pfirst[4] + i * 11, bsx * sizeof(pel_t));\r\n            memcpy(dst6, pfirst[5] + i * 11, bsx * sizeof(pel_t));\r\n            memcpy(dst7, pfirst[6] + i * 11, bsx * sizeof(pel_t));\r\n            memcpy(dst8, pfirst[7] + i * 11, bsx * sizeof(pel_t));\r\n\r\n            dst1 = dst8 + i_dst;\r\n            dst2 = dst1 + i_dst;\r\n            dst3 = dst2 + i_dst;\r\n            dst4 = dst3 + i_dst;\r\n            dst5 = dst4 + i_dst;\r\n            dst6 = dst5 + i_dst;\r\n            dst7 = dst6 + i_dst;\r\n            dst8 = dst7 + i_dst;\r\n        }\r\n    } else if (bsx == 16) {\r\n        pel_t *dst1 = dst;\r\n        pel_t *dst2 = dst1 + i_dst;\r\n        pel_t *dst3 = dst2 + i_dst;\r\n        pel_t *dst4 = dst3 + i_dst;\r\n\r\n        for (i = 0; i < bsx; i++, src++) {\r\n            dst1[i]  = (pel_t)((5 * src[1] + 13 * src[2] + 11 * src[3] + 3 * src[4] + 16) >> 5);\r\n            dst2[i]  = (pel_t)((    src[2] +  5 * src[3] +  7 * src[4] + 3 * src[5] + 8) >> 4);\r\n            dst3[i]  = (pel_t)((7 * src[4] + 15 * src[5] +  9 * src[6] +     src[7] + 16) >> 5);\r\n            dst4[i]  = (pel_t)((    src[5] +  3 * src[6] +  3 * src[7] +     src[8] + 4) >> 3);\r\n        }\r\n    } else if (bsx == 8) {\r\n        pel_t *dst1 = dst;\r\n        pel_t *dst2 = dst1 + i_dst;\r\n        pel_t *dst3 = dst2 + i_dst;\r\n        pel_t *dst4 = dst3 + i_dst;\r\n        pel_t *dst5 = dst4 + i_dst;\r\n        pel_t *dst6 = dst5 + i_dst;\r\n        pel_t *dst7 = dst6 + i_dst;\r\n        pel_t *dst8 = dst7 + i_dst;\r\n\r\n        for (i = 0; i < 8; src++, i++) {\r\n            dst1[i]  = (pel_t)((5 * src[1] + 13 * src[2] + 11 * src[3] + 3 * src[4] + 16) >> 5);\r\n            dst2[i]  = (pel_t)((    src[2] +  5 * src[3] +  7 * src[4] + 3 * src[5] + 8) >> 4);\r\n            dst3[i]  = (pel_t)((7 * src[4] + 15 * src[5] +  9 * src[6] +     src[7] + 16) >> 5);\r\n            dst4[i]  = (pel_t)((    src[5] +  3 * src[6] +  3 * src[7] +     src[8] + 4) >> 3);\r\n\r\n            dst5[i] = (pel_t)((     src[6] +  9 * src[7]  + 15 * src[8]  +  7 * src[9]  + 16) >> 5);\r\n            dst6[i] = (pel_t)(( 3 * src[8] +  7 * src[9]  +  5 * src[10] +      src[11] + 8) >> 4);\r\n            dst7[i] = (pel_t)(( 3 * src[9] + 11 * src[10] + 13 * src[11] +  5 * src[12] + 16) >> 5);\r\n            dst8[i] = (pel_t)((    src[11] +  2 * src[12] +      src[13]                 + 2) >> 2);\r\n        }\r\n#if !BUGFIX_PREDICTION_INTRA\r\n        dst7[7] = dst7[6];\r\n        dst8[7] = dst8[4];\r\n        dst8[6] = dst8[4];\r\n        dst8[5] = dst8[4];\r\n#endif\r\n        if (bsy == 32) {\r\n            //src -> 8,src[8] -> 16\r\n#if BUGFIX_PREDICTION_INTRA\r\n            pel_t pad1 = src[8];\r\n            dst1 = dst8 + i_dst;\r\n            int j;\r\n            for (j = 0; j < 24; j++) {\r\n                for (i = 0; i < 8; i++) {\r\n                    dst1[i] = pad1;\r\n                }\r\n                dst1 += i_dst;\r\n            }\r\n\r\n            dst1 = dst8 + i_dst;\r\n            dst2 = dst1 + i_dst;\r\n            dst3 = dst2 + i_dst;\r\n\r\n            src += 4;\r\n            dst1[0] = (pel_t)((5 * src[0] + 13 * src[1] + 11 * src[2] + 3 * src[3] + 16) >> 5);\r\n            dst1[1] = (pel_t)((5 * src[1] + 13 * src[2] + 11 * src[3] + 3 * src[4] + 16) >> 5);\r\n            dst1[2] = (pel_t)((5 * src[2] + 13 * src[3] + 11 * src[4] + 3 * src[5] + 16) >> 5);\r\n            dst1[3] = (pel_t)((5 * src[3] + 13 * src[4] + 11 * src[5] + 3 * src[6] + 16) >> 5);\r\n            dst2[0] = (pel_t)((src[1] + 5 * src[2] + 7 * src[3] + 3 * src[4] + 8) >> 4);\r\n            dst2[1] = (pel_t)((src[2] + 5 * src[3] + 7 * src[4] + 3 * src[5] + 8) >> 4);\r\n            dst2[2] = (pel_t)((src[3] + 5 * src[4] + 7 * src[5] + 3 * src[6] + 8) >> 4);\r\n            dst3[0] = (pel_t)((7 * src[3] + 15 * src[4] +  9 * src[5] +     src[6] + 16) >> 5);\r\n#else\r\n            //src -> 8,src[7] -> 15\r\n            pel_t pad1 = (pel_t)((5 * src[7] + 13 * src[8] + 11 * src[9] + 3 * src[10] + 16) >> 5);\r\n            pel_t pad2 = (pel_t)((src[7] + 5 * src[8] + 7 * src[9] + 3 * src[10] + 8) >> 4);\r\n            pel_t pad3 = (pel_t)((7 * src[7] + 15 * src[8] + 9 * src[9] + src[10] + 16) >> 5);\r\n            pel_t pad4 = (pel_t)((src[7] + 3 * src[8] + 3 * src[9] + src[10] + 4) >> 3);\r\n\r\n            pel_t pad5 = (pel_t)((src[7] + 9 * src[8] + 15 * src[9] + 7 * src[10] + 16) >> 5);\r\n            pel_t pad6 = dst6[7];\r\n            pel_t pad7 = dst7[7];\r\n            pel_t pad8 = dst8[7];\r\n\r\n            dst1 = dst8 + i_dst;\r\n            dst2 = dst1 + i_dst;\r\n            dst3 = dst2 + i_dst;\r\n            dst4 = dst3 + i_dst;\r\n            dst5 = dst4 + i_dst;\r\n            dst6 = dst5 + i_dst;\r\n            dst7 = dst6 + i_dst;\r\n            dst8 = dst7 + i_dst;\r\n\r\n            for (i = 0; i < 8; i++) {\r\n                dst1[i] = pad1;\r\n                dst2[i] = pad2;\r\n                dst3[i] = pad3;\r\n                dst4[i] = pad4;\r\n\r\n                dst5[i] = pad5;\r\n                dst6[i] = pad6;\r\n                dst7[i] = pad7;\r\n                dst8[i] = pad8;\r\n            }\r\n            src += 4;\r\n            dst1[0] = (pel_t)((5 * src[0] + 13 * src[1] + 11 * src[2] + 3 * src[3] + 16) >> 5);\r\n            dst1[1] = (pel_t)((5 * src[1] + 13 * src[2] + 11 * src[3] + 3 * src[4] + 16) >> 5);\r\n            dst1[2] = (pel_t)((5 * src[2] + 13 * src[3] + 11 * src[4] + 3 * src[5] + 16) >> 5);\r\n            dst2[0] = (pel_t)((    src[1] +  5 * src[2] +  7 * src[3] + 3 * src[4] +  8) >> 4);\r\n            dst2[1] = (pel_t)((    src[2] +  5 * src[3] +  7 * src[4] + 3 * src[5] +  8) >> 4);\r\n\r\n            dst1 = dst8 + i_dst;\r\n            dst2 = dst1 + i_dst;\r\n            dst3 = dst2 + i_dst;\r\n            dst4 = dst3 + i_dst;\r\n            dst5 = dst4 + i_dst;\r\n            dst6 = dst5 + i_dst;\r\n            dst7 = dst6 + i_dst;\r\n            dst8 = dst7 + i_dst;\r\n\r\n            for (i = 0; i < 8; i++) {\r\n                dst1[i] = pad1;\r\n                dst2[i] = pad2;\r\n                dst3[i] = pad3;\r\n                dst4[i] = pad4;\r\n\r\n                dst5[i] = pad5;\r\n                dst6[i] = pad6;\r\n                dst7[i] = pad7;\r\n                dst8[i] = pad8;\r\n            }\r\n\r\n            dst1 = dst8 + i_dst;\r\n            dst2 = dst1 + i_dst;\r\n            dst3 = dst2 + i_dst;\r\n            dst4 = dst3 + i_dst;\r\n            dst5 = dst4 + i_dst;\r\n            dst6 = dst5 + i_dst;\r\n            dst7 = dst6 + i_dst;\r\n            dst8 = dst7 + i_dst;\r\n\r\n            for (i = 0; i < 8; i++) {\r\n                dst1[i] = pad1;\r\n                dst2[i] = pad2;\r\n                dst3[i] = pad3;\r\n                dst4[i] = pad4;\r\n\r\n                dst5[i] = pad5;\r\n                dst6[i] = pad6;\r\n                dst7[i] = pad7;\r\n                dst8[i] = pad8;\r\n            }\r\n#endif\r\n        }\r\n    } else {\r\n        pel_t *dst1 = dst;\r\n        pel_t *dst2 = dst1 + i_dst;\r\n        pel_t *dst3 = dst2 + i_dst;\r\n        pel_t *dst4 = dst3 + i_dst;\r\n\r\n        for (i = 0; i < 4; i++, src++) {\r\n            dst1[i]  = (pel_t)((5 * src[1] + 13 * src[2] + 11 * src[3] + 3 * src[4] + 16) >> 5);\r\n            dst2[i]  = (pel_t)((    src[2] +  5 * src[3] +  7 * src[4] + 3 * src[5] + 8) >> 4);\r\n            dst3[i]  = (pel_t)((7 * src[4] + 15 * src[5] +  9 * src[6] +     src[7] + 16) >> 5);\r\n            dst4[i]  = (pel_t)((    src[5] +  3 * src[6] +  3 * src[7] +     src[8] + 4) >> 3);\r\n        }\r\n#if !BUGFIX_PREDICTION_INTRA\r\n        dst4[3] = dst4[2];\r\n#endif\r\n        if (bsy == 16) {\r\n#if BUGFIX_PREDICTION_INTRA\r\n            pel_t *dst5 = dst4 + i_dst;\r\n\r\n            src += 4;\r\n            pel_t pad1 = src[0];\r\n\r\n            int j;\r\n            for (j = 0; j < 12; j++) {\r\n                for (i = 0; i < 4; i++) {\r\n                    dst5[i] = pad1;\r\n                }\r\n                dst5 += i_dst;\r\n            }\r\n            dst5 = dst4 + i_dst;\r\n            dst5[0] = (pel_t)((src[-2] + 9 * src[-1] + 15 * src[0] + 7 * src[1] + 16) >> 5);\r\n            dst5[1] = (pel_t)((src[-1] + 9 * src[ 0] + 15 * src[1] + 7 * src[2] + 16) >> 5);\r\n#else\r\n            pel_t *dst5 = dst4 + i_dst;\r\n            pel_t *dst6 = dst5 + i_dst;\r\n            pel_t *dst7 = dst6 + i_dst;\r\n            pel_t *dst8 = dst7 + i_dst;\r\n\r\n            src += 3;\r\n            pel_t pad1 = (pel_t)((5 * src[0] + 13 * src[1] + 11 * src[2] + 3 * src[3] + 16) >> 5);\r\n            pel_t pad2 = (pel_t)((    src[0] +  5 * src[1] +  7 * src[2] + 3 * src[3] + 8) >> 4);\r\n            pel_t pad3 = (pel_t)((7 * src[0] + 15 * src[1] +  9 * src[2] + 1 * src[3] + 16) >> 5);\r\n            pel_t pad4 = dst4[3];\r\n\r\n            pel_t pad5 = (pel_t)((    src[0] +  9 * src[1] + 15 * src[2] + 7 * src[3] + 16) >> 5);\r\n            pel_t pad6 = (pel_t)((3 * src[0] +  7 * src[1] +  5 * src[2] +     src[3] + 8) >> 4);\r\n            pel_t pad7 = (pel_t)((3 * src[0] + 11 * src[1] + 13 * src[2] + 5 * src[3] + 16) >> 5);\r\n            pel_t pad8 = (pel_t)((    src[0] +  2 * src[1] +      src[2]              + 2) >> 2);\r\n\r\n            for (i = 0; i < 4; i++) {\r\n                dst5[i] = pad5;\r\n                dst6[i] = pad6;\r\n                dst7[i] = pad7;\r\n                dst8[i] = pad8;\r\n            }\r\n            dst5[0] = (pel_t)((src[-1] + 9 * src[0] + 15 * src[1] + 7 * src[2] + 16) >> 5);\r\n\r\n            dst1 = dst8 + i_dst;\r\n            dst2 = dst1 + i_dst;\r\n            dst3 = dst2 + i_dst;\r\n            dst4 = dst3 + i_dst;\r\n            dst5 = dst4 + i_dst;\r\n            dst6 = dst5 + i_dst;\r\n            dst7 = dst6 + i_dst;\r\n            dst8 = dst7 + i_dst;\r\n\r\n            for (i = 0; i < 4; i++) {\r\n                dst1[i] = pad1;\r\n                dst2[i] = pad2;\r\n                dst3[i] = pad3;\r\n                dst4[i] = pad4;\r\n                dst5[i] = pad5;\r\n                dst6[i] = pad6;\r\n                dst7[i] = pad7;\r\n                dst8[i] = pad8;\r\n            }\r\n#endif\r\n        }\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void intra_pred_ang_x_6_c(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    ALIGN16(pel_t first_line[64 + 64]);\r\n    int line_size = bsx + bsy - 1;\r\n#if !BUGFIX_PREDICTION_INTRA\r\n    int real_size = STARAVS_MIN(line_size, (bsx << 1) - 1);\r\n#endif\r\n    int i;\r\n#if !BUGFIX_PREDICTION_INTRA\r\n    pel_t pad;\r\n#endif\r\n\r\n#if BUGFIX_PREDICTION_INTRA\r\n    for (i = 0; i < line_size; i++, src++) {\r\n#else\r\n    for (i = 0; i < real_size; i++, src++) {\r\n#endif\r\n        first_line[i] = (pel_t)((src[1] + (src[2] << 1) + src[3] + 2) >> 2);\r\n    }\r\n\r\n#if !BUGFIX_PREDICTION_INTRA\r\n    // padding\r\n    pad = first_line[real_size - 1];\r\n    for (; i < line_size; i++) {\r\n        first_line[i] = pad;\r\n    }\r\n#endif\r\n\r\n    for (i = 0; i < bsy; i++) {\r\n        memcpy(dst, first_line + i, bsx * sizeof(pel_t));\r\n        dst += i_dst;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void intra_pred_ang_x_7_c(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    int i;\r\n    pel_t *dst1 = dst;\r\n    pel_t *dst2 = dst1 + i_dst;\r\n    pel_t *dst3 = dst2 + i_dst;\r\n    pel_t *dst4 = dst3 + i_dst;\r\n    if (bsy == 4) {\r\n        for (i = 0; i < bsx; src++, i++){\r\n            dst1[i] = (pel_t)((src[0] *  9 + src[1] * 41 + src[2] * 55 + src[3] * 23 + 64) >> 7);\r\n            dst2[i] = (pel_t)((src[1] *  9 + src[2] * 25 + src[3] * 23 + src[4] *  7 + 32) >> 6);\r\n            dst3[i] = (pel_t)((src[2] * 27 + src[3] * 59 + src[4] * 37 + src[5] *  5 + 64) >> 7);\r\n            dst4[i] = (pel_t)((src[2] *  3 + src[3] * 35 + src[4] * 61 + src[5] * 29 + 64) >> 7);\r\n        }\r\n    } else if (bsy == 8) {\r\n        pel_t *dst5 = dst4 + i_dst;\r\n        pel_t *dst6 = dst5 + i_dst;\r\n        pel_t *dst7 = dst6 + i_dst;\r\n        pel_t *dst8 = dst7 + i_dst;\r\n        for (i = 0; i < bsx; src++, i++){\r\n            dst1[i] = (pel_t)((src[0] *  9 + src[1] * 41 + src[2] * 55 + src[3] * 23 + 64) >> 7);\r\n            dst2[i] = (pel_t)((src[1] *  9 + src[2] * 25 + src[3] * 23 + src[4] *  7 + 32) >> 6);\r\n            dst3[i] = (pel_t)((src[2] * 27 + src[3] * 59 + src[4] * 37 + src[5] *  5 + 64) >> 7);\r\n            dst4[i] = (pel_t)((src[2] *  3 + src[3] * 35 + src[4] * 61 + src[5] * 29 + 64) >> 7);\r\n            dst5[i] = (pel_t)((src[3] *  3 + src[4] * 11 + src[5] * 13 + src[6] *  5 + 16) >> 5);\r\n            dst6[i] = (pel_t)((src[4] * 21 + src[5] * 53 + src[6] * 43 + src[7] * 11 + 64) >> 7);\r\n            dst7[i] = (pel_t)((src[5] * 15 + src[6] * 31 + src[7] * 17 + src[8] + 32)      >> 6);\r\n            dst8[i] = (pel_t)((src[5] *  3 + src[6] * 19 + src[7] * 29 + src[8] * 13 + 32) >> 6);\r\n        }\r\n    } else {\r\n        intra_pred_ang_x_c(src, dst, i_dst, dir_mode, bsx, bsy);\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void intra_pred_ang_x_8_c(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    ALIGN16(pel_t first_line[2 * (64 + 32)]);\r\n    int line_size = bsx + (bsy >> 1) - 1;\r\n#if !BUGFIX_PREDICTION_INTRA\r\n    int real_size = STARAVS_MIN(line_size, bsx * 2);\r\n#endif\r\n    int aligned_line_size = ((line_size + 15) >> 4) << 4;\r\n    int i_dst2 = i_dst << 1;\r\n    int i;\r\n#if !BUGFIX_PREDICTION_INTRA\r\n    pel_t pad1, pad2;\r\n#endif\r\n    pel_t *pfirst[2];\r\n\r\n    pfirst[0] = first_line;\r\n    pfirst[1] = first_line + aligned_line_size;\r\n#if BUGFIX_PREDICTION_INTRA\r\n    for (i = 0; i < line_size; i++, src++) {\r\n#else\r\n    for (i = 0; i < real_size; i++, src++) {\r\n#endif\r\n        pfirst[0][i] = (pel_t)((src[0] + (src[1] + src[2]) * 3 + src[3] + 4) >> 3);\r\n        pfirst[1][i] = (pel_t)((src[1] + (src[2] << 1)         + src[3] + 2) >> 2);\r\n    }\r\n\r\n#if !BUGFIX_PREDICTION_INTRA\r\n    // padding\r\n    if (real_size < line_size) {\r\n        pfirst[1][real_size - 1] = pfirst[1][real_size - 2];\r\n\r\n        pad1 = pfirst[0][real_size - 1];\r\n        pad2 = pfirst[1][real_size - 1];\r\n        for (; i < line_size; i++) {\r\n            pfirst[0][i] = pad1;\r\n            pfirst[1][i] = pad2;\r\n        }\r\n    }\r\n#endif\r\n\r\n    bsy >>= 1;\r\n    for (i = 0; i < bsy; i++) {\r\n        memcpy(dst        , pfirst[0] + i, bsx * sizeof(pel_t));\r\n        memcpy(dst + i_dst, pfirst[1] + i, bsx * sizeof(pel_t));\r\n        dst += i_dst2;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void intra_pred_ang_x_9_c(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    if (bsy > 8){\r\n        intra_pred_ang_x_c(src, dst, i_dst, dir_mode, bsx, bsy);\r\n        /*\r\n        ALIGN16(pel_t first_line[(64 + 32) * 11]);\r\n        int line_size = bsx + (bsy * 93 >> 8) - 1;\r\n        int real_size = STARAVS_MIN(line_size, bsx * 2);\r\n        int aligned_line_size = ((line_size + 31) >> 5) << 5;\r\n        int i_dst11 = i_dst * 11;\r\n        int i;\r\n        pel_t pad1, pad2, pad3, pad4, pad5, pad6, pad7, pad8, pad9, pad10, pad11;\r\n        pel_t *pfirst[11];\r\n\r\n        pfirst[0] = first_line;\r\n        pfirst[1] = pfirst[0] + aligned_line_size;\r\n        pfirst[2] = pfirst[1] + aligned_line_size;\r\n        pfirst[3] = pfirst[2] + aligned_line_size;\r\n        pfirst[4] = pfirst[3] + aligned_line_size;\r\n        pfirst[5] = pfirst[4] + aligned_line_size;\r\n        pfirst[6] = pfirst[5] + aligned_line_size;\r\n        pfirst[7] = pfirst[6] + aligned_line_size;\r\n        pfirst[8] = pfirst[7] + aligned_line_size;\r\n        pfirst[9] = pfirst[8] + aligned_line_size;\r\n        pfirst[10] = pfirst[9] + aligned_line_size;\r\n        for (i = 0; i < real_size; i++, src++) {\r\n            pfirst[0][i] = (pel_t)((21 * src[0] + 53 * src[1] + 43 * src[2] + 11 * src[3] + 64) >> 7);\r\n            pfirst[1][i] = (pel_t)((9 * src[0] + 41 * src[1] + 55 * src[2] + 23 * src[3] + 64) >> 7);\r\n            pfirst[2][i] = (pel_t)((15 * src[1] + 31 * src[2] + 17 * src[3] + 1 * src[4] + 32) >> 6);\r\n            pfirst[3][i] = (pel_t)((9 * src[1] + 25 * src[2] + 23 * src[3] + 7 * src[4] + 32) >> 6);\r\n            pfirst[4][i] = (pel_t)((3 * src[1] + 19 * src[2] + 29 * src[3] + 13 * src[4] + 32) >> 6);\r\n            pfirst[5][i] = (pel_t)((27 * src[2] + 59 * src[3] + 37 * src[4] + 5 * src[5] + 64) >> 7);\r\n            pfirst[6][i] = (pel_t)((15 * src[2] + 47 * src[3] + 49 * src[4] + 17 * src[5] + 64) >> 7);\r\n            pfirst[7][i] = (pel_t)((3 * src[2] + 35 * src[3] + 61 * src[4] + 29 * src[5] + 64) >> 7);\r\n            pfirst[8][i] = (pel_t)((3 * src[3] + 7 * src[4] + 5 * src[5] + 1 * src[6] + 8) >> 4);\r\n            pfirst[9][i] = (pel_t)((3 * src[3] + 11 * src[4] + 13 * src[5] + 5 * src[6] + 16) >> 5);\r\n            pfirst[10][i] = (pel_t)((1 * src[3] + 33 * src[4] + 63 * src[5] + 31 * src[6] + 64) >> 7);\r\n        }\r\n\r\n        // padding\r\n        if (real_size < line_size) {\r\n            pfirst[8][real_size - 3] = pfirst[8][real_size - 4];\r\n            pfirst[9][real_size - 3] = pfirst[9][real_size - 4];\r\n            pfirst[10][real_size - 3] = pfirst[10][real_size - 4];\r\n            pfirst[8][real_size - 2] = pfirst[8][real_size - 3];\r\n            pfirst[9][real_size - 2] = pfirst[9][real_size - 3];\r\n            pfirst[10][real_size - 2] = pfirst[10][real_size - 3];\r\n            pfirst[8][real_size - 1] = pfirst[8][real_size - 2];\r\n            pfirst[9][real_size - 1] = pfirst[9][real_size - 2];\r\n            pfirst[10][real_size - 1] = pfirst[10][real_size - 2];\r\n\r\n            pfirst[5][real_size - 2] = pfirst[5][real_size - 3];\r\n            pfirst[6][real_size - 2] = pfirst[6][real_size - 3];\r\n            pfirst[7][real_size - 2] = pfirst[7][real_size - 3];\r\n            pfirst[5][real_size - 1] = pfirst[5][real_size - 2];\r\n            pfirst[6][real_size - 1] = pfirst[6][real_size - 2];\r\n            pfirst[7][real_size - 1] = pfirst[7][real_size - 2];\r\n\r\n            pfirst[2][real_size - 1] = pfirst[2][real_size - 2];\r\n            pfirst[3][real_size - 1] = pfirst[3][real_size - 2];\r\n            pfirst[4][real_size - 1] = pfirst[4][real_size - 2];\r\n\r\n\r\n            pad1 = pfirst[0][real_size - 1];\r\n            pad2 = pfirst[1][real_size - 1];\r\n            pad3 = pfirst[2][real_size - 1];\r\n            pad4 = pfirst[3][real_size - 1];\r\n            pad5 = pfirst[4][real_size - 1];\r\n            pad6 = pfirst[5][real_size - 1];\r\n            pad7 = pfirst[6][real_size - 1];\r\n            pad8 = pfirst[7][real_size - 1];\r\n            pad9 = pfirst[8][real_size - 1];\r\n            pad10 = pfirst[9][real_size - 1];\r\n            pad11 = pfirst[10][real_size - 1];\r\n            for (; i < line_size; i++) {\r\n                pfirst[0][i] = pad1;\r\n                pfirst[1][i] = pad2;\r\n                pfirst[2][i] = pad3;\r\n                pfirst[3][i] = pad4;\r\n                pfirst[4][i] = pad5;\r\n                pfirst[5][i] = pad6;\r\n                pfirst[6][i] = pad7;\r\n                pfirst[7][i] = pad8;\r\n                pfirst[8][i] = pad9;\r\n                pfirst[9][i] = pad10;\r\n                pfirst[10][i] = pad11;\r\n            }\r\n        }\r\n\r\n        int bsy_b = bsy / 11;\r\n        for (i = 0; i < bsy_b; i++) {\r\n            memcpy(dst, pfirst[0] + i, bsx * sizeof(pel_t));\r\n            memcpy(dst + i_dst, pfirst[1] + i, bsx * sizeof(pel_t));\r\n            memcpy(dst + 2 * i_dst, pfirst[2] + i, bsx * sizeof(pel_t));\r\n            memcpy(dst + 3 * i_dst, pfirst[3] + i, bsx * sizeof(pel_t));\r\n            memcpy(dst + 4 * i_dst, pfirst[4] + i, bsx * sizeof(pel_t));\r\n            memcpy(dst + 5 * i_dst, pfirst[5] + i, bsx * sizeof(pel_t));\r\n            memcpy(dst + 6 * i_dst, pfirst[6] + i, bsx * sizeof(pel_t));\r\n            memcpy(dst + 7 * i_dst, pfirst[7] + i, bsx * sizeof(pel_t));\r\n            memcpy(dst + 8 * i_dst, pfirst[8] + i, bsx * sizeof(pel_t));\r\n            memcpy(dst + 9 * i_dst, pfirst[9] + i, bsx * sizeof(pel_t));\r\n            memcpy(dst + 10 * i_dst, pfirst[10] + i, bsx * sizeof(pel_t));\r\n            dst += i_dst11;\r\n        }\r\n        int bsy_r = bsy - bsy_b * 11;\r\n        for (i = 0; i < bsy_r; i++) {\r\n            memcpy(dst, pfirst[i] + bsy_b, bsx * sizeof(pel_t));\r\n            dst += i_dst;\r\n        }\r\n        */\r\n    } else if (bsy == 8) {\r\n        pel_t *dst1 = dst;\r\n        pel_t *dst2 = dst1 + i_dst;\r\n        pel_t *dst3 = dst2 + i_dst;\r\n        pel_t *dst4 = dst3 + i_dst;\r\n        pel_t *dst5 = dst4 + i_dst;\r\n        pel_t *dst6 = dst5 + i_dst;\r\n        pel_t *dst7 = dst6 + i_dst;\r\n        pel_t *dst8 = dst7 + i_dst;\r\n        for (int i = 0; i < bsx; i++, src++) {\r\n            dst1[i] = (pel_t)((21 * src[0] + 53 * src[1] + 43 * src[2] + 11 * src[3] + 64) >> 7);\r\n            dst2[i] = (pel_t)((9  * src[0] + 41 * src[1] + 55 * src[2] + 23 * src[3] + 64) >> 7);\r\n            dst3[i] = (pel_t)((15 * src[1] + 31 * src[2] + 17 * src[3] +      src[4] + 32) >> 6);\r\n            dst4[i] = (pel_t)((9  * src[1] + 25 * src[2] + 23 * src[3] + 7  * src[4] + 32) >> 6);\r\n\r\n            dst5[i] = (pel_t)((3  * src[1] + 19 * src[2] + 29 * src[3] + 13 * src[4] + 32) >> 6);\r\n            dst6[i] = (pel_t)((27 * src[2] + 59 * src[3] + 37 * src[4] + 5  * src[5] + 64) >> 7);\r\n            dst7[i] = (pel_t)((15 * src[2] + 47 * src[3] + 49 * src[4] + 17 * src[5] + 64) >> 7);\r\n            dst8[i] = (pel_t)((3  * src[2] + 35 * src[3] + 61 * src[4] + 29 * src[5] + 64) >> 7);\r\n        }\r\n    } else /*if (bsy == 4)*/ {\r\n        pel_t *dst1 = dst;\r\n        pel_t *dst2 = dst1 + i_dst;\r\n        pel_t *dst3 = dst2 + i_dst;\r\n        pel_t *dst4 = dst3 + i_dst;\r\n        for (int i = 0; i < bsx; i++, src++) {\r\n            dst1[i] = (pel_t)((21 * src[0] + 53 * src[1] + 43 * src[2] + 11 * src[3] + 64) >> 7);\r\n            dst2[i] = (pel_t)((9  * src[0] + 41 * src[1] + 55 * src[2] + 23 * src[3] + 64) >> 7);\r\n            dst3[i] = (pel_t)((15 * src[1] + 31 * src[2] + 17 * src[3] +      src[4] + 32) >> 6);\r\n            dst4[i] = (pel_t)((9  * src[1] + 25 * src[2] + 23 * src[3] + 7  * src[4] + 32) >> 6);\r\n        }\r\n    }\r\n\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void intra_pred_ang_x_10_c(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    pel_t *dst1 = dst;\r\n    pel_t *dst2 = dst1 + i_dst;\r\n    pel_t *dst3 = dst2 + i_dst;\r\n    pel_t *dst4 = dst3 + i_dst;\r\n    int i;\r\n\r\n    if (bsy != 4) {\r\n        ALIGN16(pel_t first_line[4 * (64 + 16)]);\r\n        int line_size = bsx + bsy / 4 - 1;\r\n        int aligned_line_size = ((line_size + 15) >> 4) << 4;\r\n        pel_t *pfirst[4];\r\n\r\n        pfirst[0] = first_line;\r\n        pfirst[1] = first_line + aligned_line_size;\r\n        pfirst[2] = first_line + aligned_line_size * 2;\r\n        pfirst[3] = first_line + aligned_line_size * 3;\r\n\r\n        for (i = 0; i < line_size; i++, src++) {\r\n            pfirst[0][i] = (pel_t)((src[0] * 3 +  src[1] * 7 + src[2]  * 5 + src[3]     + 8) >> 4);\r\n            pfirst[1][i] = (pel_t)((src[0]     + (src[1]     + src[2]) * 3 + src[3]     + 4) >> 3);\r\n            pfirst[2][i] = (pel_t)((src[0]     +  src[1] * 5 + src[2]  * 7 + src[3] * 3 + 8) >> 4);\r\n            pfirst[3][i] = (pel_t)((src[1]     +  src[2] * 2 + src[3]                   + 2) >> 2);\r\n        }\r\n\r\n        bsy   >>= 2;\r\n        i_dst <<= 2;\r\n        for (i = 0; i < bsy; i++) {\r\n            memcpy(dst1, pfirst[0] + i, bsx * sizeof(pel_t));\r\n            memcpy(dst2, pfirst[1] + i, bsx * sizeof(pel_t));\r\n            memcpy(dst3, pfirst[2] + i, bsx * sizeof(pel_t));\r\n            memcpy(dst4, pfirst[3] + i, bsx * sizeof(pel_t));\r\n            dst1 += i_dst;\r\n            dst2 += i_dst;\r\n            dst3 += i_dst;\r\n            dst4 += i_dst;\r\n        }\r\n    } else {\r\n        for (i = 0; i < bsx; i++, src++) {\r\n            dst1[i] = (pel_t)((src[0] * 3 +  src[1] * 7 + src[2]  * 5 + src[3]     + 8) >> 4);\r\n            dst2[i] = (pel_t)((src[0]     + (src[1]     + src[2]) * 3 + src[3]     + 4) >> 3);\r\n            dst3[i] = (pel_t)((src[0]     +  src[1] * 5 + src[2]  * 7 + src[3] * 3 + 8) >> 4);\r\n            dst4[i] = (pel_t)((src[1]     +  src[2] * 2 + src[3]                   + 2) >> 2);\r\n        }\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void intra_pred_ang_x_11_c(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    int i;\r\n    if (bsy > 8) {\r\n        ALIGN16(pel_t first_line[(64 + 16) << 3]);\r\n        int line_size = bsx + (bsy >> 3) - 1;\r\n        int aligned_line_size = ((line_size + 15) >> 4) << 4;\r\n        int i_dst8 = i_dst << 3;\r\n        pel_t *pfirst[8];\r\n\r\n        pfirst[0] = first_line;\r\n        pfirst[1] = pfirst[0] + aligned_line_size;\r\n        pfirst[2] = pfirst[1] + aligned_line_size;\r\n        pfirst[3] = pfirst[2] + aligned_line_size;\r\n        pfirst[4] = pfirst[3] + aligned_line_size;\r\n        pfirst[5] = pfirst[4] + aligned_line_size;\r\n        pfirst[6] = pfirst[5] + aligned_line_size;\r\n        pfirst[7] = pfirst[6] + aligned_line_size;\r\n        for (i = 0; i < line_size; i++, src++) {\r\n            pfirst[0][i] = (pel_t)((7 * src[0] + 15 * src[1] +  9 * src[2] +     src[3] + 16) >> 5);\r\n            pfirst[1][i] = (pel_t)((3 * src[0] +  7 * src[1] +  5 * src[2] +     src[3] +  8) >> 4);\r\n            pfirst[2][i] = (pel_t)((5 * src[0] + 13 * src[1] + 11 * src[2] + 3 * src[3] + 16) >> 5);\r\n            pfirst[3][i] = (pel_t)((    src[0] +  3 * src[1] +  3 * src[2] +     src[3] +  4) >> 3);\r\n\r\n            pfirst[4][i] = (pel_t)((3 * src[0] + 11 * src[1] + 13 * src[2] + 5 * src[3] + 16) >> 5);\r\n            pfirst[5][i] = (pel_t)((    src[0] +  5 * src[1] +  7 * src[2] + 3 * src[3] +  8) >> 4);\r\n            pfirst[6][i] = (pel_t)((    src[0] +  9 * src[1] + 15 * src[2] + 7 * src[3] + 16) >> 5);\r\n            pfirst[7][i] = (pel_t)((    src[1] +  2 * src[2] +      src[3] + 0 * src[4] +  2) >> 2);\r\n        }\r\n\r\n        bsy >>= 3;\r\n        for (i = 0; i < bsy; i++) {\r\n            memcpy(dst            , pfirst[0] + i, bsx * sizeof(pel_t));\r\n            memcpy(dst +     i_dst, pfirst[1] + i, bsx * sizeof(pel_t));\r\n            memcpy(dst + 2 * i_dst, pfirst[2] + i, bsx * sizeof(pel_t));\r\n            memcpy(dst + 3 * i_dst, pfirst[3] + i, bsx * sizeof(pel_t));\r\n            memcpy(dst + 4 * i_dst, pfirst[4] + i, bsx * sizeof(pel_t));\r\n            memcpy(dst + 5 * i_dst, pfirst[5] + i, bsx * sizeof(pel_t));\r\n            memcpy(dst + 6 * i_dst, pfirst[6] + i, bsx * sizeof(pel_t));\r\n            memcpy(dst + 7 * i_dst, pfirst[7] + i, bsx * sizeof(pel_t));\r\n            dst += i_dst8;\r\n        }\r\n    } else if (bsy == 8) {\r\n        pel_t *dst1 = dst;\r\n        pel_t *dst2 = dst1 + i_dst;\r\n        pel_t *dst3 = dst2 + i_dst;\r\n        pel_t *dst4 = dst3 + i_dst;\r\n        pel_t *dst5 = dst4 + i_dst;\r\n        pel_t *dst6 = dst5 + i_dst;\r\n        pel_t *dst7 = dst6 + i_dst;\r\n        pel_t *dst8 = dst7 + i_dst;\r\n        for (i = 0; i < bsx; i++, src++) {\r\n            dst1[i] = (pel_t)((7 * src[0] + 15 * src[1] +  9 * src[2] +     src[3] + 16) >> 5);\r\n            dst2[i] = (pel_t)((3 * src[0] +  7 * src[1] +  5 * src[2] +     src[3] + 8) >> 4);\r\n            dst3[i] = (pel_t)((5 * src[0] + 13 * src[1] + 11 * src[2] + 3 * src[3] + 16) >> 5);\r\n            dst4[i] = (pel_t)((    src[0] +  3 * src[1] +  3 * src[2] +     src[3] + 4) >> 3);\r\n\r\n            dst5[i] = (pel_t)((3 * src[0] + 11 * src[1] + 13 * src[2] + 5 * src[3] + 16) >> 5);\r\n            dst6[i] = (pel_t)((    src[0] +  5 * src[1] +  7 * src[2] + 3 * src[3] +  8) >> 4);\r\n            dst7[i] = (pel_t)((    src[0] +  9 * src[1] + 15 * src[2] + 7 * src[3] + 16) >> 5);\r\n            dst8[i] = (pel_t)((    src[1] +  2 * src[2] +      src[3] +            +  2) >> 2);\r\n        }\r\n    } else {\r\n        for (i = 0; i < bsx; i++, src++) {\r\n            pel_t *dst1 = dst;\r\n            pel_t *dst2 = dst1 + i_dst;\r\n            pel_t *dst3 = dst2 + i_dst;\r\n            pel_t *dst4 = dst3 + i_dst;\r\n            dst1[i] = (pel_t)(( 7 * src[0] + 15 * src[1] +  9 * src[2] +      src[3] + 16) >> 5);\r\n            dst2[i] = (pel_t)(( 3 * src[0] +  7 * src[1] +  5 * src[2] +      src[3] +  8) >> 4);\r\n            dst3[i] = (pel_t)(( 5 * src[0] + 13 * src[1] + 11 * src[2] +  3 * src[3] + 16) >> 5);\r\n            dst4[i] = (pel_t)((     src[0] +  3 * src[1] +  3 * src[2] +      src[3] +  4) >> 3);\r\n        }\r\n    }\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void intra_pred_ang_y_25_c(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    int i;\r\n\r\n    if (bsx > 8) {\r\n            ALIGN16(pel_t first_line[64 + (64 << 3)]);\r\n            int line_size = bsx + ((bsy - 1) << 3);\r\n            int iHeight8 = bsy << 3;\r\n            for (i = 0; i < line_size; i += 8, src--) {\r\n                first_line[0 + i] = (pel_t)((src[0] * 7 + src[-1] * 15 + src[-2] *  9 + src[-3] * 1 + 16) >> 5);\r\n                first_line[1 + i] = (pel_t)((src[0] * 3 + src[-1] * 7  + src[-2] *  5 + src[-3] * 1 + 8) >> 4);\r\n                first_line[2 + i] = (pel_t)((src[0] * 5 + src[-1] * 13 + src[-2] * 11 + src[-3] * 3 + 16) >> 5);\r\n                first_line[3 + i] = (pel_t)((src[0] * 1 + src[-1] * 3  + src[-2] *  3 + src[-3] * 1 + 4) >> 3);\r\n\r\n                first_line[4 + i] = (pel_t)((src[0] * 3 + src[-1] * 11 + src[-2] * 13 + src[-3] * 5 + 16) >> 5);\r\n                first_line[5 + i] = (pel_t)((src[0] * 1 + src[-1] *  5 + src[-2] *  7 + src[-3] * 3 + 8) >> 4);\r\n                first_line[6 + i] = (pel_t)((src[0] * 1 + src[-1] *  9 + src[-2] * 15 + src[-3] * 7 + 16) >> 5);\r\n                first_line[7 + i] = (pel_t)((             src[-1] *  1 + src[-2] *  2 + src[-3] * 1 + 2) >> 2);\r\n            }\r\n            for (i = 0; i < iHeight8; i += 8) {\r\n                memcpy(dst, first_line + i, bsx * sizeof(pel_t));\r\n                dst += i_dst;\r\n            }\r\n    } else if (bsx == 8) {\r\n        for (i = 0; i < bsy; i++, src--) {\r\n            dst[0] = (pel_t)((src[0] * 7 + src[-1] * 15 + src[-2] *  9 + src[-3] * 1 + 16) >> 5);\r\n            dst[1] = (pel_t)((src[0] * 3 + src[-1] *  7 + src[-2] *  5 + src[-3] * 1 + 8) >> 4);\r\n            dst[2] = (pel_t)((src[0] * 5 + src[-1] * 13 + src[-2] * 11 + src[-3] * 3 + 16) >> 5);\r\n            dst[3] = (pel_t)((src[0] * 1 + src[-1] *  3 + src[-2] *  3 + src[-3] * 1 + 4) >> 3);\r\n\r\n            dst[4] = (pel_t)((src[0] * 3 + src[-1] * 11 + src[-2] * 13 + src[-3] * 5 + 16) >> 5);\r\n            dst[5] = (pel_t)((src[0] * 1 + src[-1] *  5 + src[-2] *  7 + src[-3] * 3 + 8) >> 4);\r\n            dst[6] = (pel_t)((src[0] * 1 + src[-1] *  9 + src[-2] * 15 + src[-3] * 7 + 16) >> 5);\r\n            dst[7] = (pel_t)((             src[-1] *  1 + src[-2] *  2 + src[-3] * 1 + 2) >> 2);\r\n            dst += i_dst;\r\n        }\r\n    } else {\r\n        for (i = 0; i < bsy; i++, src--) {\r\n            dst[0] = (pel_t)((src[0] * 7 + src[-1] * 15 + src[-2] *  9 + src[-3] * 1 + 16) >> 5);\r\n            dst[1] = (pel_t)((src[0] * 3 + src[-1] *  7 + src[-2] *  5 + src[-3] * 1 + 8) >> 4);\r\n            dst[2] = (pel_t)((src[0] * 5 + src[-1] * 13 + src[-2] * 11 + src[-3] * 3 + 16) >> 5);\r\n            dst[3] = (pel_t)((src[0] * 1 + src[-1] *  3 + src[-2] *  3 + src[-3] * 1 + 4) >> 3);\r\n            dst += i_dst;\r\n        }\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void intra_pred_ang_y_26_c(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    int i;\r\n\r\n    if (bsx != 4) {\r\n        ALIGN16(pel_t first_line[64 + 256]);\r\n        int line_size = bsx + ((bsy - 1) << 2);\r\n        int iHeight4 = bsy << 2;\r\n\r\n        for (i = 0; i < line_size; i += 4, src--) {\r\n            first_line[i    ] = (pel_t)((src[ 0] * 3 +  src[-1] * 7 + src[-2]  * 5 + src[-3]     + 8) >> 4);\r\n            first_line[i + 1] = (pel_t)((src[ 0]     + (src[-1]     + src[-2]) * 3 + src[-3]     + 4) >> 3);\r\n            first_line[i + 2] = (pel_t)((src[ 0]     +  src[-1] * 5 + src[-2]  * 7 + src[-3] * 3 + 8) >> 4);\r\n            first_line[i + 3] = (pel_t)((src[-1]     +  src[-2] * 2 + src[-3]                    + 2) >> 2);\r\n        }\r\n\r\n        for (i = 0; i < iHeight4; i += 4) {\r\n            memcpy(dst, first_line + i, bsx * sizeof(pel_t));\r\n            dst += i_dst;\r\n        }\r\n    } else {\r\n        for (i = 0; i < bsy; i++, src--) {\r\n            dst[0] = (pel_t)((src[ 0] * 3 +  src[-1] * 7 + src[-2]  * 5 + src[-3]     + 8) >> 4);\r\n            dst[1] = (pel_t)((src[ 0]     + (src[-1]     + src[-2]) * 3 + src[-3]     + 4) >> 3);\r\n            dst[2] = (pel_t)((src[ 0]     +  src[-1] * 5 + src[-2]  * 7 + src[-3] * 3 + 8) >> 4);\r\n            dst[3] = (pel_t)((src[-1]     +  src[-2] * 2 + src[-3]                    + 2) >> 2);\r\n            dst += i_dst;\r\n        }\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void intra_pred_ang_y_27_c(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    int i;\r\n    if (bsx > 8){\r\n        intra_pred_ang_y_c(src, dst, i_dst, dir_mode, bsx, bsy);\r\n    } else if (bsx == 8){\r\n        for (i = 0; i < bsy; i++, src--) {\r\n            dst[0] = (pel_t)((21 * src[0] +  53 * src[-1] + 43 * src[-2] + 11 * src[-3] + 64) >> 7);\r\n            dst[1] = (pel_t)(( 9 * src[0] +  41 * src[-1] + 55 * src[-2] + 23 * src[-3] + 64) >> 7);\r\n            dst[2] = (pel_t)((15 * src[-1] + 31 * src[-2] + 17 * src[-3] +  1 * src[-4] + 32) >> 6);\r\n            dst[3] = (pel_t)(( 9 * src[-1] + 25 * src[-2] + 23 * src[-3] +  7 * src[-4] + 32) >> 6);\r\n\r\n            dst[4] = (pel_t)(( 3 * src[-1] + 19 * src[-2] + 29 * src[-3] + 13 * src[-4] + 32) >> 6);\r\n            dst[5] = (pel_t)((27 * src[-2] + 59 * src[-3] + 37 * src[-4] +  5 * src[-5] + 64) >> 7);\r\n            dst[6] = (pel_t)((15 * src[-2] + 47 * src[-3] + 49 * src[-4] + 17 * src[-5] + 64) >> 7);\r\n            dst[7] = (pel_t)(( 3 * src[-2] + 35 * src[-3] + 61 * src[-4] + 29 * src[-5] + 64) >> 7);\r\n            dst += i_dst;\r\n        }\r\n    } else{\r\n        for (i = 0; i < bsy; i++, src--) {\r\n            dst[0] = (pel_t)((21 * src[0]  + 53 * src[-1] + 43 * src[-2] + 11 * src[-3] + 64) >> 7);\r\n            dst[1] = (pel_t)(( 9 * src[0]  + 41 * src[-1] + 55 * src[-2] + 23 * src[-3] + 64) >> 7);\r\n            dst[2] = (pel_t)((15 * src[-1] + 31 * src[-2] + 17 * src[-3] +  1 * src[-4] + 32) >> 6);\r\n            dst[3] = (pel_t)(( 9 * src[-1] + 25 * src[-2] + 23 * src[-3] +  7 * src[-4] + 32) >> 6);\r\n            dst += i_dst;\r\n        }\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void intra_pred_ang_y_28_c(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    ALIGN16(pel_t first_line[64 + 128]);\r\n    int line_size = bsx + ((bsy - 1) << 1);\r\n#if !BUGFIX_PREDICTION_INTRA\r\n    int real_size = STARAVS_MIN(line_size, (bsy << 2));\r\n#endif\r\n    int iHeight2 = bsy << 1;\r\n    int i;\r\n#if !BUGFIX_PREDICTION_INTRA\r\n    pel_t pad1, pad2;\r\n#endif\r\n\r\n#if BUGFIX_PREDICTION_INTRA\r\n    for (i = 0; i < line_size; i += 2, src--) {\r\n#else\r\n    for (i = 0; i < real_size; i += 2, src--) {\r\n#endif\r\n        first_line[i    ] = (pel_t)((src[ 0] + (src[-1] + src[-2]) * 3 + src[-3] + 4) >> 3);\r\n        first_line[i + 1] = (pel_t)((src[-1] + (src[-2] << 1)          + src[-3] + 2) >> 2);\r\n    }\r\n\r\n#if !BUGFIX_PREDICTION_INTRA\r\n    // padding\r\n    if (real_size < line_size) {\r\n        first_line[i - 1] = first_line[i - 3];\r\n\r\n        pad1 = first_line[i - 2];\r\n        pad2 = first_line[i - 1];\r\n        for (; i < line_size; i += 2) {\r\n            first_line[i    ] = pad1;\r\n            first_line[i + 1] = pad2;\r\n        }\r\n    }\r\n#endif\r\n\r\n    for (i = 0; i < iHeight2; i += 2) {\r\n        memcpy(dst, first_line + i, bsx * sizeof(pel_t));\r\n        dst += i_dst;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void intra_pred_ang_y_29_c(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    int i;\r\n    if (bsx > 8) {\r\n        intra_pred_ang_y_c(src, dst, i_dst, dir_mode, bsx, bsy);\r\n    } else if (bsx == 8) {\r\n        for (i = 0; i < bsy; i++, src--) {\r\n            dst[0] = (pel_t)((src[0] * 9 + src[-1] * 41 + src[-2] * 55 + src[-3] * 23 + 64) >> 7);\r\n            dst[1] = (pel_t)((src[-1] * 9 + src[-2] * 25 + src[-3] * 23 + src[-4] * 7 + 32) >> 6);\r\n            dst[2] = (pel_t)((src[-2] * 27 + src[-3] * 59 + src[-4] * 37 + src[-5] * 5 + 64) >> 7);\r\n            dst[3] = (pel_t)((src[-2] * 3 + src[-3] * 35 + src[-4] * 61 + src[-5] * 29 + 64) >> 7);\r\n\r\n            dst[4] = (pel_t)((src[-3] * 3 + src[-4] * 11 + src[-5] * 13 + src[-6] * 5 + 16) >> 5);\r\n            dst[5] = (pel_t)((src[-4] * 21 + src[-5] * 53 + src[-6] * 43 + src[-7] * 11 + 64) >> 7);\r\n            dst[6] = (pel_t)((src[-5] * 15 + src[-6] * 31 + src[-7] * 17 + src[-8] + 32) >> 6);\r\n            dst[7] = (pel_t)((src[-5] * 3 + src[-6] * 19 + src[-7] * 29 + src[-8] * 13 + 32) >> 6);\r\n            dst += i_dst;\r\n        }\r\n    } else {\r\n        for (i = 0; i < bsy; i++, src--) {\r\n            dst[0] = (pel_t)((src[0] * 9 + src[-1] * 41 + src[-2] * 55 + src[-3] * 23 + 64) >> 7);\r\n            dst[1] = (pel_t)((src[-1] * 9 + src[-2] * 25 + src[-3] * 23 + src[-4] * 7 + 32) >> 6);\r\n            dst[2] = (pel_t)((src[-2] * 27 + src[-3] * 59 + src[-4] * 37 + src[-5] * 5 + 64) >> 7);\r\n            dst[3] = (pel_t)((src[-2] * 3 + src[-3] * 35 + src[-4] * 61 + src[-5] * 29 + 64) >> 7);\r\n            dst += i_dst;\r\n        }\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void intra_pred_ang_y_30_c(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    ALIGN16(pel_t first_line[64 + 64]);\r\n    int line_size = bsx + bsy - 1;\r\n#if !BUGFIX_PREDICTION_INTRA\r\n    int real_size = STARAVS_MIN(line_size, (bsy << 1) - 1);\r\n#endif\r\n    int i;\r\n#if !BUGFIX_PREDICTION_INTRA\r\n    pel_t pad;\r\n#endif\r\n\r\n    src -= 2;\r\n#if BUGFIX_PREDICTION_INTRA\r\n    for (i = 0; i < line_size; i++, src--) {\r\n#else\r\n    for (i = 0; i < real_size; i++, src--) {\r\n#endif\r\n        first_line[i] = (pel_t)((src[-1] + (src[0] << 1) + src[1] + 2) >> 2);\r\n    }\r\n\r\n#if !BUGFIX_PREDICTION_INTRA\r\n    // padding\r\n    pad = first_line[real_size - 1];\r\n    for (; i < line_size; i++) {\r\n        first_line[i] = pad;\r\n    }\r\n#endif\r\n\r\n    for (i = 0; i < bsy; i++) {\r\n        memcpy(dst, first_line + i, bsx * sizeof(pel_t));\r\n        dst += i_dst;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void intra_pred_ang_y_31_c(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    ALIGN16(pel_t dst_tran[MAX_CU_SIZE * MAX_CU_SIZE]);\r\n    ALIGN16(pel_t src_tran[MAX_CU_SIZE << 3]);\r\n    int i;\r\n    if (bsx >= bsy){\r\n        // transposition\r\n#if BUGFIX_PREDICTION_INTRA\r\n        //i < (bsx * 19 / 8 + 3)\r\n        for (i = 0; i < (bsy + bsx * 11 / 8 + 3); i++){\r\n#else\r\n        for (i = 0; i < (2 * bsy + 3); i++){\r\n#endif\r\n            src_tran[i] = src[-i];\r\n        }\r\n        intra_pred_ang_x_5_c(src_tran, dst_tran, bsy, 5, bsy, bsx);\r\n        for (i = 0; i < bsy; i++){\r\n            for (int j = 0; j < bsx; j++){\r\n                dst[j + i_dst * i] = dst_tran[i + bsy * j];\r\n            }\r\n        }\r\n    } else if (bsx == 8){\r\n        for (i = 0; i < bsy; i++, src--) {\r\n            dst[0] = (pel_t)((5 * src[-1] + 13 * src[-2] + 11 * src[-3] + 3 * src[-4] + 16) >> 5);\r\n            dst[1] = (pel_t)((1 * src[-2] + 5 * src[-3] + 7 * src[-4] + 3 * src[-5] + 8) >> 4);\r\n            dst[2] = (pel_t)((7 * src[-4] + 15 * src[-5] + 9 * src[-6] + 1 * src[-7] + 16) >> 5);\r\n            dst[3] = (pel_t)((1 * src[-5] + 3 * src[-6] + 3 * src[-7] + 1 * src[-8] + 4) >> 3);\r\n\r\n            dst[4] = (pel_t)((1 * src[-6] + 9 * src[-7] + 15 * src[-8] + 7 * src[-9] + 16) >> 5);\r\n            dst[5] = (pel_t)((3 * src[-8] + 7 * src[-9] + 5 * src[-10] + 1 * src[-11] + 8) >> 4);\r\n            dst[6] = (pel_t)((3 * src[-9] + 11 * src[-10] + 13 * src[-11] + 5 * src[-12] + 16) >> 5);\r\n            dst[7] = (pel_t)((1 * src[-11] + 2 * src[-12] + 1 * src[-13] + 0 * src[-14] + 2) >> 2);\r\n            dst += i_dst;\r\n        }\r\n    } else {\r\n        for (i = 0; i < bsy; i++, src--) {\r\n            dst[0] = (pel_t)((5 * src[-1] + 13 * src[-2] + 11 * src[-3] + 3 * src[-4] + 16) >> 5);\r\n            dst[1] = (pel_t)((1 * src[-2] + 5 * src[-3] + 7 * src[-4] + 3 * src[-5] + 8) >> 4);\r\n            dst[2] = (pel_t)((7 * src[-4] + 15 * src[-5] + 9 * src[-6] + 1 * src[-7] + 16) >> 5);\r\n            dst[3] = (pel_t)((1 * src[-5] + 3 * src[-6] + 3 * src[-7] + 1 * src[-8] + 4) >> 3);\r\n            dst += i_dst;\r\n        }\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void intra_pred_ang_y_32_c(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    ALIGN16(pel_t first_line[2 * (32 + 64)]);\r\n    int line_size = (bsy >> 1) + bsx - 1;\r\n#if !BUGFIX_PREDICTION_INTRA\r\n    int real_size = STARAVS_MIN(line_size, bsy - 1);\r\n#endif\r\n    int aligned_line_size = ((line_size + 15) >> 4) << 4;\r\n    int i_dst2 = i_dst << 1;\r\n    int i;\r\n#if !BUGFIX_PREDICTION_INTRA\r\n    pel_t pad;\r\n#endif\r\n    pel_t *pfirst[2];\r\n\r\n    pfirst[0] = first_line;\r\n    pfirst[1] = first_line + aligned_line_size;\r\n\r\n    src -= 3;\r\n#if BUGFIX_PREDICTION_INTRA\r\n    for (i = 0; i < line_size; i++, src -= 2) {\r\n#else\r\n    for (i = 0; i < real_size; i++, src -= 2) {\r\n#endif\r\n        pfirst[0][i] = (pel_t)((src[1] + (src[ 0] << 1) + src[-1] + 2) >> 2);\r\n        pfirst[1][i] = (pel_t)((src[0] + (src[-1] << 1) + src[-2] + 2) >> 2);\r\n    }\r\n\r\n#if !BUGFIX_PREDICTION_INTRA\r\n    // padding\r\n    pad = pfirst[1][i - 1];\r\n    for (; i < line_size; i++) {\r\n        pfirst[0][i] = pad;\r\n        pfirst[1][i] = pad;\r\n    }\r\n#endif\r\n\r\n    bsy >>= 1;\r\n    for (i = 0; i < bsy; i++) {\r\n        memcpy(dst        , pfirst[0] + i, bsx * sizeof(pel_t));\r\n        memcpy(dst + i_dst, pfirst[1] + i, bsx * sizeof(pel_t));\r\n        dst += i_dst2;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void intra_pred_ang_xy_13_c(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    int i;\r\n    if (bsy > 8) {\r\n        ALIGN16(pel_t first_line[(64 + 16) << 3]);\r\n        int line_size = bsx + (bsy >> 3) - 1;\r\n        int left_size = line_size - bsx;\r\n        int aligned_line_size = ((line_size + 15) >> 4) << 4;\r\n        pel_t *pfirst[8];\r\n\r\n        pfirst[0] = first_line;\r\n        pfirst[1] = pfirst[0] + aligned_line_size;\r\n        pfirst[2] = pfirst[1] + aligned_line_size;\r\n        pfirst[3] = pfirst[2] + aligned_line_size;\r\n        pfirst[4] = pfirst[3] + aligned_line_size;\r\n        pfirst[5] = pfirst[4] + aligned_line_size;\r\n        pfirst[6] = pfirst[5] + aligned_line_size;\r\n        pfirst[7] = pfirst[6] + aligned_line_size;\r\n\r\n        src -= bsy - 8;\r\n        for (i = 0; i < left_size; i++, src += 8) {\r\n            pfirst[0][i] = (pel_t)((src[6] + (src[7] << 1) + src[8] + 2) >> 2);\r\n            pfirst[1][i] = (pel_t)((src[5] + (src[6] << 1) + src[7] + 2) >> 2);\r\n            pfirst[2][i] = (pel_t)((src[4] + (src[5] << 1) + src[6] + 2) >> 2);\r\n            pfirst[3][i] = (pel_t)((src[3] + (src[4] << 1) + src[5] + 2) >> 2);\r\n\r\n            pfirst[4][i] = (pel_t)((src[2] + (src[3] << 1) + src[4] + 2) >> 2);\r\n            pfirst[5][i] = (pel_t)((src[1] + (src[2] << 1) + src[3] + 2) >> 2);\r\n            pfirst[6][i] = (pel_t)((src[0] + (src[1] << 1) + src[2] + 2) >> 2);\r\n            pfirst[7][i] = (pel_t)((src[-1] + (src[0] << 1) + src[1] + 2) >> 2);\r\n        }\r\n\r\n        for (; i < line_size; i++, src++) {\r\n            pfirst[0][i] = (pel_t)((7 * src[2] + 15 * src[1] + 9 * src[0] + src[-1] + 16) >> 5);\r\n            pfirst[1][i] = (pel_t)((3 * src[2] + 7 * src[1] + 5 * src[0] + src[-1] + 8) >> 4);\r\n            pfirst[2][i] = (pel_t)((5 * src[2] + 13 * src[1] + 11 * src[0] + 3 * src[-1] + 16) >> 5);\r\n            pfirst[3][i] = (pel_t)((src[2] + 3 * src[1] + 3 * src[0] + src[-1] + 4) >> 3);\r\n\r\n            pfirst[4][i] = (pel_t)((3 * src[2] + 11 * src[1] + 13 * src[0] + 5 * src[-1] + 16) >> 5);\r\n            pfirst[5][i] = (pel_t)((src[2] + 5 * src[1] + 7 * src[0] + 3 * src[-1] + 8) >> 4);\r\n            pfirst[6][i] = (pel_t)((src[2] + 9 * src[1] + 15 * src[0] + 7 * src[-1] + 16) >> 5);\r\n            pfirst[7][i] = (pel_t)((src[1] + 2 * src[0] + src[-1] + 2) >> 2);\r\n        }\r\n\r\n        pfirst[0] += left_size;\r\n        pfirst[1] += left_size;\r\n        pfirst[2] += left_size;\r\n        pfirst[3] += left_size;\r\n        pfirst[4] += left_size;\r\n        pfirst[5] += left_size;\r\n        pfirst[6] += left_size;\r\n        pfirst[7] += left_size;\r\n\r\n        bsy >>= 3;\r\n        for (i = 0; i < bsy; i++) {\r\n            memcpy(dst, pfirst[0] - i, bsx * sizeof(pel_t));  dst += i_dst;\r\n            memcpy(dst, pfirst[1] - i, bsx * sizeof(pel_t));  dst += i_dst;\r\n            memcpy(dst, pfirst[2] - i, bsx * sizeof(pel_t));  dst += i_dst;\r\n            memcpy(dst, pfirst[3] - i, bsx * sizeof(pel_t));  dst += i_dst;\r\n            memcpy(dst, pfirst[4] - i, bsx * sizeof(pel_t));  dst += i_dst;\r\n            memcpy(dst, pfirst[5] - i, bsx * sizeof(pel_t));  dst += i_dst;\r\n            memcpy(dst, pfirst[6] - i, bsx * sizeof(pel_t));  dst += i_dst;\r\n            memcpy(dst, pfirst[7] - i, bsx * sizeof(pel_t));  dst += i_dst;\r\n        }\r\n    } else if (bsy == 8) {\r\n        pel_t *dst1 = dst;\r\n        pel_t *dst2 = dst1 + i_dst;\r\n        pel_t *dst3 = dst2 + i_dst;\r\n        pel_t *dst4 = dst3 + i_dst;\r\n        pel_t *dst5 = dst4 + i_dst;\r\n        pel_t *dst6 = dst5 + i_dst;\r\n        pel_t *dst7 = dst6 + i_dst;\r\n        pel_t *dst8 = dst7 + i_dst;\r\n        for (i = 0; i < bsx; i++, src++) {\r\n            dst1[i] = (pel_t)((7 * src[2] + 15 * src[1] + 9 * src[0] + src[-1] + 16) >> 5);\r\n            dst2[i] = (pel_t)((3 * src[2] + 7 * src[1] + 5 * src[0] + src[-1] + 8) >> 4);\r\n            dst3[i] = (pel_t)((5 * src[2] + 13 * src[1] + 11 * src[0] + 3 * src[-1] + 16) >> 5);\r\n            dst4[i] = (pel_t)((src[2] + 3 * src[1] + 3 * src[0] + src[-1] + 4) >> 3);\r\n\r\n            dst5[i] = (pel_t)((3 * src[2] + 11 * src[1] + 13 * src[0] + 5 * src[-1] + 16) >> 5);\r\n            dst6[i] = (pel_t)((src[2] + 5 * src[1] + 7 * src[0] + 3 * src[-1] + 8) >> 4);\r\n            dst7[i] = (pel_t)((src[2] + 9 * src[1] + 15 * src[0] + 7 * src[-1] + 16) >> 5);\r\n            dst8[i] = (pel_t)((src[1] + 2 * src[0] + src[-1]  + 2) >> 2);\r\n        }\r\n    } else {\r\n        for (i = 0; i < bsx; i++, src++) {\r\n            pel_t *dst1 = dst;\r\n            pel_t *dst2 = dst1 + i_dst;\r\n            pel_t *dst3 = dst2 + i_dst;\r\n            pel_t *dst4 = dst3 + i_dst;\r\n            dst1[i] = (pel_t)((7 * src[2] + 15 * src[1] +  9 * src[0] +     src[-1] + 16) >> 5);\r\n            dst2[i] = (pel_t)((3 * src[2] +  7 * src[1] +  5 * src[0] +     src[-1] + 8) >> 4);\r\n            dst3[i] = (pel_t)((5 * src[2] + 13 * src[1] + 11 * src[0] + 3 * src[-1] + 16) >> 5);\r\n            dst4[i] = (pel_t)((    src[2] +  3 * src[1] +  3 * src[0] +     src[-1] + 4) >> 3);\r\n        }\r\n    }\r\n}\r\nstatic void intra_pred_ang_xy_14_c(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    int i;\r\n\r\n    if (bsy != 4) {\r\n        ALIGN16(pel_t first_line[4 * (64 + 16)]);\r\n        int line_size = bsx + (bsy >> 2) - 1;\r\n        int left_size = line_size - bsx;\r\n        int aligned_line_size = ((line_size + 15) >> 4) << 4;\r\n        pel_t *pfirst[4];\r\n\r\n        pfirst[0] = first_line;\r\n        pfirst[1] = first_line + aligned_line_size;\r\n        pfirst[2] = first_line + aligned_line_size * 2;\r\n        pfirst[3] = first_line + aligned_line_size * 3;\r\n\r\n        src -= bsy - 4;\r\n        for (i = 0; i < left_size; i++, src += 4) {\r\n            pfirst[0][i] = (pel_t)((src[ 2] + (src[3] << 1) + src[4] + 2) >> 2);\r\n            pfirst[1][i] = (pel_t)((src[ 1] + (src[2] << 1) + src[3] + 2) >> 2);\r\n            pfirst[2][i] = (pel_t)((src[ 0] + (src[1] << 1) + src[2] + 2) >> 2);\r\n            pfirst[3][i] = (pel_t)((src[-1] + (src[0] << 1) + src[1] + 2) >> 2);\r\n        }\r\n\r\n        for (; i < line_size; i++, src++) {\r\n            pfirst[0][i] = (pel_t)((src[-1]     +  src[0] * 5 + src[1]  * 7 + src[2] * 3 + 8) >> 4);\r\n            pfirst[1][i] = (pel_t)((src[-1]     + (src[0]     + src[1]) * 3 + src[2]     + 4) >> 3);\r\n            pfirst[2][i] = (pel_t)((src[-1] * 3 +  src[0] * 7 + src[1]  * 5 + src[2]     + 8) >> 4);\r\n            pfirst[3][i] = (pel_t)((src[-1]     +  src[0] * 2 + src[1]                   + 2) >> 2);\r\n        }\r\n\r\n        pfirst[0] += left_size;\r\n        pfirst[1] += left_size;\r\n        pfirst[2] += left_size;\r\n        pfirst[3] += left_size;\r\n\r\n        bsy >>= 2;\r\n        for (i = 0; i < bsy; i++) {\r\n            memcpy(dst, pfirst[0] - i, bsx * sizeof(pel_t));\r\n            dst += i_dst;\r\n            memcpy(dst, pfirst[1] - i, bsx * sizeof(pel_t));\r\n            dst += i_dst;\r\n            memcpy(dst, pfirst[2] - i, bsx * sizeof(pel_t));\r\n            dst += i_dst;\r\n            memcpy(dst, pfirst[3] - i, bsx * sizeof(pel_t));\r\n            dst += i_dst;\r\n        }\r\n    } else {\r\n        pel_t *dst1 = dst;\r\n        pel_t *dst2 = dst1 + i_dst;\r\n        pel_t *dst3 = dst2 + i_dst;\r\n        pel_t *dst4 = dst3 + i_dst;\r\n\r\n        for (i = 0; i < bsx; i++, src++) {\r\n            dst1[i] = (pel_t)((src[-1]     +  src[0] * 5 + src[1]  * 7 + src[2] * 3 + 8) >> 4);\r\n            dst2[i] = (pel_t)((src[-1]     + (src[0]     + src[1]) * 3 + src[2]     + 4) >> 3);\r\n            dst3[i] = (pel_t)((src[-1] * 3 +  src[0] * 7 + src[1]  * 5 + src[2]     + 8) >> 4);\r\n            dst4[i] = (pel_t)((src[-1]     +  src[0] * 2 + src[1]                   + 2) >> 2);\r\n        }\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void intra_pred_ang_xy_16_c(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    ALIGN16(pel_t first_line[2 * (64 + 32)]);\r\n    int line_size = bsx + (bsy >> 1) - 1;\r\n    int left_size = line_size - bsx;\r\n    int aligned_line_size = ((line_size + 15) >> 4) << 4;\r\n    int i_dst2 = i_dst << 1;\r\n    pel_t *pfirst[2];\r\n    int i;\r\n\r\n    pfirst[0] = first_line;\r\n    pfirst[1] = first_line + aligned_line_size;\r\n\r\n    src -= bsy - 2;\r\n    for (i = 0; i < left_size; i++, src += 2) {\r\n        pfirst[0][i] = (pel_t)((src[ 0] + (src[1] << 1) + src[2] + 2) >> 2);\r\n        pfirst[1][i] = (pel_t)((src[-1] + (src[0] << 1) + src[1] + 2) >> 2);\r\n    }\r\n\r\n    for (; i < line_size; i++, src++) {\r\n        pfirst[0][i] = (pel_t)((src[-1] + (src[0]       + src[1]) * 3 + src[2] + 4) >> 3);\r\n        pfirst[1][i] = (pel_t)((src[-1] + (src[0] << 1) + src[1]               + 2) >> 2);\r\n    }\r\n\r\n    pfirst[0] += left_size;\r\n    pfirst[1] += left_size;\r\n\r\n    bsy >>= 1;\r\n    for (i = 0; i < bsy; i++) {\r\n        memcpy(dst        , pfirst[0] - i, bsx * sizeof(pel_t));\r\n        memcpy(dst + i_dst, pfirst[1] - i, bsx * sizeof(pel_t));\r\n        dst += i_dst2;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void intra_pred_ang_xy_18_c(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    ALIGN16(pel_t first_line[64 + 64]);\r\n    int line_size = bsx + bsy - 1;\r\n    int i;\r\n    pel_t *pfirst = first_line + bsy - 1;\r\n\r\n    src -= bsy - 1;\r\n    for (i = 0; i < line_size; i++, src++) {\r\n        first_line[i] = (pel_t)((src[-1] + (src[0] << 1) + src[1] + 2) >> 2);\r\n    }\r\n\r\n    for (i = 0; i < bsy; i++) {\r\n        memcpy(dst, pfirst, bsx * sizeof(pel_t));\r\n        pfirst--;\r\n        dst += i_dst;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void intra_pred_ang_xy_20_c(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    ALIGN16(pel_t first_line[64 + 128]);\r\n    int left_size = ((bsy - 1) << 1) + 1;\r\n    int top_size = bsx - 1;\r\n    int line_size = left_size + top_size;\r\n    int i;\r\n    pel_t *pfirst = first_line + left_size - 1;\r\n\r\n    src -= bsy;\r\n    for (i = 0; i < left_size; i += 2, src++) {\r\n        first_line[i    ] = (pel_t)((src[-1] + (src[0] +  src[1]) * 3  + src[2] + 4) >> 3);\r\n        first_line[i + 1] = (pel_t)((           src[0] + (src[1] << 1) + src[2] + 2) >> 2);\r\n    }\r\n    i--;\r\n\r\n    for (; i < line_size; i++, src++) {\r\n        first_line[i] = (pel_t)((src[-1] + (src[0] << 1) + src[1] + 2) >> 2);\r\n    }\r\n\r\n    for (i = 0; i < bsy; i++) {\r\n        memcpy(dst, pfirst, bsx * sizeof(pel_t));\r\n        pfirst -= 2;\r\n        dst    += i_dst;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void intra_pred_ang_xy_22_c(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    int i;\r\n\r\n    if (bsx != 4) {\r\n        src -= bsy;\r\n        ALIGN16(pel_t first_line[64 + 256]);\r\n        int left_size = ((bsy - 1) << 2) + 3;\r\n        int top_size  = bsx - 3;\r\n        int line_size = left_size + top_size;\r\n        pel_t *pfirst = first_line + left_size - 3;\r\n\r\n        for (i = 0; i < left_size; i += 4, src++) {\r\n            first_line[i    ] = (pel_t)((src[-1] * 3 +  src[0] * 7 + src[1]  * 5 + src[2]     + 8) >> 4);\r\n            first_line[i + 1] = (pel_t)((src[-1]     + (src[0]     + src[1]) * 3 + src[2]     + 4) >> 3);\r\n            first_line[i + 2] = (pel_t)((src[-1]     +  src[0] * 5 + src[1]  * 7 + src[2] * 3 + 8) >> 4);\r\n            first_line[i + 3] = (pel_t)((               src[0]     + src[1]  * 2 + src[2]     + 2) >> 2);\r\n        }\r\n        i--;\r\n\r\n        for (; i < line_size; i++, src++) {\r\n            first_line[i] = (pel_t)((src[-1] + (src[0] << 1) + src[1] + 2) >> 2);\r\n        }\r\n\r\n        for (i = 0; i < bsy; i++) {\r\n            memcpy(dst, pfirst, bsx * sizeof(pel_t));\r\n            dst    += i_dst;\r\n            pfirst -= 4;\r\n        }\r\n    } else {\r\n        for (i = 0; i < bsy; i++, src--) {\r\n            dst[0] = (pel_t)((src[-2] * 3 +  src[-1] * 7 + src[0]  * 5 + src[1]     + 8) >> 4);\r\n            dst[1] = (pel_t)((src[-2]     + (src[-1]     + src[0]) * 3 + src[1]     + 4) >> 3);\r\n            dst[2] = (pel_t)((src[-2]     +  src[-1] * 5 + src[0]  * 7 + src[1] * 3 + 8) >> 4);\r\n            dst[3] = (pel_t)((               src[-1]     + src[0]  * 2 + src[1]     + 2) >> 2);\r\n            dst += i_dst;\r\n        }\r\n        // needn't pad, (3,0) is equal for ang_x and ang_y\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n*/\r\nstatic void intra_pred_ang_xy_23_c(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    int i;\r\n\r\n    if (bsx > 8) {\r\n        ALIGN16(pel_t first_line[64 + 512]);\r\n        int left_size = (bsy << 3) - 1;\r\n        int top_size = bsx - 7;\r\n        int line_size = left_size + top_size;\r\n        pel_t *pfirst = first_line + left_size - 7;\r\n\r\n        src -= bsy;\r\n        for (i = 0; i < left_size; i += 8, src++) {\r\n            first_line[i    ] = (pel_t)((7 * src[-1] + 15 * src[0] +  9 * src[1] +     src[2] + 16) >> 5);\r\n            first_line[i + 1] = (pel_t)((3 * src[-1] +  7 * src[0] +  5 * src[1] +     src[2] +  8) >> 4);\r\n            first_line[i + 2] = (pel_t)((5 * src[-1] + 13 * src[0] + 11 * src[1] + 3 * src[2] + 16) >> 5);\r\n            first_line[i + 3] = (pel_t)((    src[-1] +  3 * src[0] +  3 * src[1] +     src[2] +  4) >> 3);\r\n\r\n            first_line[i + 4] = (pel_t)((3 * src[-1] + 11 * src[0] + 13 * src[1] + 5 * src[2] + 16) >> 5);\r\n            first_line[i + 5] = (pel_t)((    src[-1] +  5 * src[0] +  7 * src[1] + 3 * src[2] +  8) >> 4);\r\n            first_line[i + 6] = (pel_t)((    src[-1] +  9 * src[0] + 15 * src[1] + 7 * src[2] + 16) >> 5);\r\n            first_line[i + 7] = (pel_t)((    src[ 0] +  2 * src[1] +      src[2] + 0 * src[3] +  2) >> 2);\r\n        }\r\n        i--;\r\n\r\n        for (; i < line_size; i++, src++) {\r\n            first_line[i] = (pel_t)((src[1] + (src[0] << 1) + src[-1] + 2) >> 2);\r\n        }\r\n\r\n        for (i = 0; i < bsy; i++) {\r\n            memcpy(dst, pfirst, bsx * sizeof(pel_t));\r\n            dst += i_dst;\r\n            pfirst -= 8;\r\n        }\r\n    } else if (bsx == 8) {\r\n        for (i = 0; i < bsy; i++, src--) {\r\n            dst[0] = (pel_t)((7 * src[-2] + 15 * src[-1] +  9 * src[0] +     src[1] + 16) >> 5);\r\n            dst[1] = (pel_t)((3 * src[-2] +  7 * src[-1] +  5 * src[0] +     src[1] +  8) >> 4);\r\n            dst[2] = (pel_t)((5 * src[-2] + 13 * src[-1] + 11 * src[0] + 3 * src[1] + 16) >> 5);\r\n            dst[3] = (pel_t)((    src[-2] +  3 * src[-1] +  3 * src[0] +     src[1] +  4) >> 3);\r\n\r\n            dst[4] = (pel_t)((3 * src[-2] + 11 * src[-1] + 13 * src[0] + 5 * src[1] + 16) >> 5);\r\n            dst[5] = (pel_t)((    src[-2] +  5 * src[-1] +  7 * src[0] + 3 * src[1] +  8) >> 4);\r\n            dst[6] = (pel_t)((    src[-2] +  9 * src[-1] + 15 * src[0] + 7 * src[1] + 16) >> 5);\r\n            dst[7] = (pel_t)((    src[-1] +  2 * src[ 0] +      src[1] + 0 * src[2] +  2) >> 2);\r\n            dst += i_dst;\r\n        }\r\n        // needn't pad, (7,0) is equal for ang_x and ang_y\r\n    } else {\r\n        for (i = 0; i < bsy; i++, src--) {\r\n            dst[0] = (pel_t)((7 * src[-2] + 15 * src[-1] + 9 * src[0] + src[1] + 16) >> 5);\r\n            dst[1] = (pel_t)((3 * src[-2] + 7 * src[-1] + 5 * src[0] + src[1] + 8) >> 4);\r\n            dst[2] = (pel_t)((5 * src[-2] + 13 * src[-1] + 11 * src[0] + 3 * src[1] + 16) >> 5);\r\n            dst[3] = (pel_t)((src[-2] + 3 * src[-1] + 3 * src[0] + src[1] + 4) >> 3);\r\n            dst += i_dst;\r\n        }\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * fill reference samples for intra prediction\r\n * LCUϱ߽PU\r\n */\r\nstatic \r\nvoid fill_reference_samples_0_c(const pel_t *pTL, int i_TL, const pel_t *pLcuEP, pel_t *EP, uint32_t i_avai, int bsx, int bsy)\r\n{\r\n    int num_padding = 0;\r\n\r\n    /* fill default value */\r\n    mem_repeat_p(&EP[-(bsy << 1)], g_dc_value, ((bsy + bsx) << 1) + 1);\r\n\r\n    /* get prediction pixels ---------------------------------------\r\n     * extra pixels          | left-down pixels   | left pixels   | top-left | top pixels  | top-right pixels  | extra pixels\r\n     * -2*bsy-4 ... -2*bsy-1 | -bsy-bsy ... -bsy-1| -bsy -3 -2 -1 |     0    | 1 2 ... bsx | bsx+1 ... bsx+bsx | 2*bsx+1 ... 2*bsx+4\r\n     */\r\n\r\n    /* fill top & top-right pixels */\r\n    if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_TOP)) {\r\n        /* fill top pixels */\r\n        gf_davs2.fast_memcpy(&EP[1], &pLcuEP[1], bsx * sizeof(pel_t));\r\n    }\r\n\r\n    /* fill top-right pixels */\r\n    if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_TOP_RIGHT)) {\r\n        gf_davs2.fast_memcpy(&EP[bsx + 1], &pLcuEP[bsx + 1], bsx * sizeof(pel_t));\r\n    } else {\r\n        mem_repeat_p(&EP[bsx + 1], EP[bsx], bsx);   // repeat the last pixel\r\n    }\r\n\r\n    /* fill extra pixels */\r\n    num_padding = bsy * 11 / 4 - bsx + 4;\r\n    if (num_padding > 0) {\r\n        mem_repeat_p(&EP[2 * bsx + 1], EP[2 * bsx], num_padding); // from (2*bsx) to (iX + 3) = (bsy *11/4 + bsx - 1) + 3\r\n    }\r\n\r\n    /* fill left & left-down pixels */\r\n    if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_LEFT)) {\r\n        /* fill left pixels */\r\n        memcpy(&EP[-bsy], &pLcuEP[-bsy], bsy * sizeof(pel_t));\r\n    }\r\n\r\n    /* fill left-down pixels */\r\n    if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_LEFT_DOWN)) {\r\n        memcpy(&EP[-2 * bsy], &pLcuEP[-2 * bsy], bsy * sizeof(pel_t));\r\n    } else {\r\n        mem_repeat_p(&EP[-(bsy << 1)], EP[-bsy], bsy);\r\n    }\r\n\r\n    /* fill top-left pixel */\r\n    if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_TOP_LEFT)) {\r\n        EP[0] = pLcuEP[0];\r\n    } else if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_TOP)) {\r\n        EP[0] = pLcuEP[1];\r\n    } else if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_LEFT)) {\r\n        EP[0] = pLcuEP[-1];\r\n    }\r\n\r\n    /* fill extra pixels */\r\n    num_padding = bsx * 11 / 4 - bsy + 4;\r\n    if (num_padding > 0) {\r\n        mem_repeat_p(&EP[-2 * bsy - num_padding], EP[-2 * bsy], num_padding); // from (-2*bsy) to (-iY - 3) = -(bsx *11/4 + bsy - 1) - 3\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * fill reference samples for intra prediction\r\n * LCUϱ߽PU\r\n */\r\nstatic \r\nvoid fill_reference_samples_x_c(const pel_t *pTL, int i_TL, const pel_t *pLcuEP, pel_t *EP, uint32_t i_avai, int bsx, int bsy)\r\n{\r\n    const pel_t *pL = pTL + i_TL;\r\n    int num_padding = 0;\r\n\r\n    /* fill default value */\r\n    mem_repeat_p(&EP[-(bsy << 1)], g_dc_value, ((bsy + bsx) << 1) + 1);\r\n\r\n    /* get prediction pixels ---------------------------------------\r\n     * extra pixels          | left-down pixels   | left pixels   | top-left | top pixels  | top-right pixels  | extra pixels\r\n     * -2*bsy-4 ... -2*bsy-1 | -bsy-bsy ... -bsy-1| -bsy -3 -2 -1 |     0    | 1 2 ... bsx | bsx+1 ... bsx+bsx | 2*bsx+1 ... 2*bsx+4\r\n     */\r\n\r\n    /* fill top & top-right pixels */\r\n    if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_TOP)) {\r\n        /* fill top pixels */\r\n        gf_davs2.fast_memcpy(&EP[1], &pLcuEP[1], bsx * sizeof(pel_t));\r\n    }\r\n\r\n    /* fill top-right pixels */\r\n    if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_TOP_RIGHT)) {\r\n        gf_davs2.fast_memcpy(&EP[bsx + 1], &pLcuEP[bsx + 1], bsx * sizeof(pel_t));\r\n    } else {\r\n        mem_repeat_p(&EP[bsx + 1], EP[bsx], bsx);   // repeat the last pixel\r\n    }\r\n\r\n    /* fill extra pixels */\r\n    num_padding = bsy * 11 / 4 - bsx + 4;\r\n    if (num_padding > 0) {\r\n        mem_repeat_p(&EP[2 * bsx + 1], EP[2 * bsx], num_padding); // from (2*bsx) to (iX + 3) = (bsy *11/4 + bsx - 1) + 3\r\n    }\r\n\r\n    /* fill left & left-down pixels */\r\n    if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_LEFT)) {\r\n        const pel_t *p_l = pL;\r\n        int y;\r\n        /* fill left pixels */\r\n        for (y = 0; y < bsy; y++) {\r\n            EP[-1 - y] = *p_l;\r\n            p_l += i_TL;\r\n        }\r\n    }\r\n\r\n    /* fill left-down pixels */\r\n    if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_LEFT_DOWN)) {\r\n        int y;\r\n        const pel_t *p_l = pL + bsy * i_TL;\r\n\r\n        for (y = 0; y < bsy; y++) {\r\n            EP[-bsy - 1 - y] = *p_l;\r\n            p_l += i_TL;\r\n        }\r\n    } else {\r\n        mem_repeat_p(&EP[-(bsy << 1)], EP[-bsy], bsy);\r\n    }\r\n\r\n    /* fill top-left pixel */\r\n    if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_TOP_LEFT)) {\r\n        EP[0] = pLcuEP[0];\r\n    } else if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_TOP)) {\r\n        EP[0] = pLcuEP[1];\r\n    } else if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_LEFT)) {\r\n        EP[0] = pL[0];\r\n    }\r\n\r\n    /* fill extra pixels */\r\n    num_padding = bsx * 11 / 4 - bsy + 4;\r\n    if (num_padding > 0) {\r\n        mem_repeat_p(&EP[-2 * bsy - num_padding], EP[-2 * bsy], num_padding); // from (-2*bsy) to (-iY - 3) = -(bsx *11/4 + bsy - 1) - 3\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * fill reference samples for intra prediction\r\n * LCU߽ϵPU\r\n */\r\nstatic \r\nvoid fill_reference_samples_y_c(const pel_t *pTL, int i_TL, const pel_t *pLcuEP, pel_t *EP, uint32_t i_avai, int bsx, int bsy)\r\n{\r\n    const pel_t *pT = pTL + 1;\r\n    int num_padding = 0;\r\n\r\n    /* fill default value */\r\n    mem_repeat_p(&EP[-(bsy << 1)], g_dc_value, ((bsy + bsx) << 1) + 1);\r\n\r\n    /* get prediction pixels ---------------------------------------\r\n     * extra pixels          | left-down pixels   | left pixels   | top-left | top pixels  | top-right pixels  | extra pixels\r\n     * -2*bsy-4 ... -2*bsy-1 | -bsy-bsy ... -bsy-1| -bsy -3 -2 -1 |     0    | 1 2 ... bsx | bsx+1 ... bsx+bsx | 2*bsx+1 ... 2*bsx+4\r\n     */\r\n\r\n    /* fill top & top-right pixels */\r\n    if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_TOP)) {\r\n        /* fill top pixels */\r\n        gf_davs2.fast_memcpy(&EP[1], pT, bsx * sizeof(pel_t));\r\n    }\r\n\r\n    /* fill top-right pixels */\r\n    if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_TOP_RIGHT)) {\r\n        gf_davs2.fast_memcpy(&EP[bsx + 1], &pT[bsx], bsx * sizeof(pel_t));\r\n    } else {\r\n        mem_repeat_p(&EP[bsx + 1], EP[bsx], bsx);   // repeat the last pixel\r\n    }\r\n\r\n    /* fill extra pixels */\r\n    num_padding = bsy * 11 / 4 - bsx + 4;\r\n    if (num_padding > 0) {\r\n        mem_repeat_p(&EP[2 * bsx + 1], EP[2 * bsx], num_padding); // from (2*bsx) to (iX + 3) = (bsy *11/4 + bsx - 1) + 3\r\n    }\r\n\r\n    /* fill left & left-down pixels */\r\n    if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_LEFT)) {\r\n        /* fill left pixels */\r\n        memcpy(&EP[-bsy], &pLcuEP[-bsy], bsy * sizeof(pel_t));\r\n    }\r\n\r\n    /* fill left-down pixels */\r\n    if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_LEFT_DOWN)) {\r\n        memcpy(&EP[-2 * bsy], &pLcuEP[-2 * bsy], bsy * sizeof(pel_t));\r\n    } else {\r\n        mem_repeat_p(&EP[-(bsy << 1)], EP[-bsy], bsy);\r\n    }\r\n\r\n    /* fill top-left pixel */\r\n    if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_TOP_LEFT)) {\r\n        EP[0] = pLcuEP[0];\r\n    } else if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_TOP)) {\r\n        EP[0] = pT[0];\r\n    } else if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_LEFT)) {\r\n        EP[0] = pLcuEP[-1];\r\n    }\r\n\r\n    /* fill extra pixels */\r\n    num_padding = bsx * 11 / 4 - bsy + 4;\r\n    if (num_padding > 0) {\r\n        mem_repeat_p(&EP[-2 * bsy - num_padding], EP[-2 * bsy], num_padding); // from (-2*bsy) to (-iY - 3) = -(bsx *11/4 + bsy - 1) - 3\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * fill reference samples for intra prediction\r\n * LCUڲڱ߽ϵPU\r\n */\r\nstatic \r\nvoid fill_reference_samples_xy_c(const pel_t *pTL, int i_TL, const pel_t *pLcuEP, pel_t *EP, uint32_t i_avai, int bsx, int bsy)\r\n{\r\n    const pel_t *pT = pTL + 1;\r\n    const pel_t *pL = pTL + i_TL;\r\n    int num_padding = 0;\r\n\r\n    /* fill default value */\r\n    mem_repeat_p(&EP[-(bsy << 1)], g_dc_value, ((bsy + bsx) << 1) + 1);\r\n\r\n    /* get prediction pixels ---------------------------------------\r\n     * extra pixels          | left-down pixels   | left pixels   | top-left | top pixels  | top-right pixels  | extra pixels\r\n     * -2*bsy-4 ... -2*bsy-1 | -bsy-bsy ... -bsy-1| -bsy -3 -2 -1 |     0    | 1 2 ... bsx | bsx+1 ... bsx+bsx | 2*bsx+1 ... 2*bsx+4\r\n     */\r\n\r\n    /* fill top & top-right pixels */\r\n    if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_TOP)) {\r\n        /* fill top pixels */\r\n        gf_davs2.fast_memcpy(&EP[1], pT, bsx * sizeof(pel_t));\r\n    }\r\n\r\n    /* fill top-right pixels */\r\n    if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_TOP_RIGHT)) {\r\n        gf_davs2.fast_memcpy(&EP[bsx + 1], &pT[bsx], bsx * sizeof(pel_t));\r\n    } else {\r\n        mem_repeat_p(&EP[bsx + 1], EP[bsx], bsx);   // repeat the last pixel\r\n    }\r\n\r\n    /* fill extra pixels */\r\n    num_padding = bsy * 11 / 4 - bsx + 4;\r\n    if (num_padding > 0) {\r\n        mem_repeat_p(&EP[2 * bsx + 1], EP[2 * bsx], num_padding); // from (2*bsx) to (iX + 3) = (bsy *11/4 + bsx - 1) + 3\r\n    }\r\n\r\n    /* fill left & left-down pixels */\r\n    if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_LEFT)) {\r\n        const pel_t *p_l = pL;\r\n        int y;\r\n        /* fill left pixels */\r\n        for (y = 0; y < bsy; y++) {\r\n            EP[-1 - y] = *p_l;\r\n            p_l += i_TL;\r\n        }\r\n    }\r\n\r\n    /* fill left-down pixels */\r\n    if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_LEFT_DOWN)) {\r\n        int y;\r\n        const pel_t *p_l = pL + bsy * i_TL;\r\n\r\n        for (y = 0; y < bsy; y++) {\r\n            EP[-bsy - 1 - y] = *p_l;\r\n            p_l += i_TL;\r\n        }\r\n    } else {\r\n        mem_repeat_p(&EP[-(bsy << 1)], EP[-bsy], bsy);\r\n    }\r\n\r\n    /* fill top-left pixel */\r\n    if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_TOP_LEFT)) {\r\n        EP[0] = pTL[0];\r\n    } else if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_TOP)) {\r\n        EP[0] = pT[0];\r\n    } else if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_LEFT)) {\r\n        EP[0] = pL[0];\r\n    }\r\n\r\n    /* fill extra pixels */\r\n    num_padding = bsx * 11 / 4 - bsy + 4;\r\n    if (num_padding > 0) {\r\n        mem_repeat_p(&EP[-2 * bsy - num_padding], EP[-2 * bsy], num_padding); // from (-2*bsy) to (-iY - 3) = -(bsx *11/4 + bsy - 1) - 3\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * make intra prediction for luma block\r\n */\r\nvoid davs2_get_intra_pred(davs2_row_rec_t *row_rec, cu_t *p_cu, int predmode, int ctu_x, int ctu_y, int bsx, int bsy)\r\n{\r\n    const int xy = ((ctu_y != 0) << 1) + (ctu_x != 0);\r\n    pel_t *EP = row_rec->buf_edge_pixels + (MAX_CU_SIZE << 2) - 1;\r\n    int b8_x = (ctu_x >> MIN_PU_SIZE_IN_BIT) + row_rec->ctu.i_spu_x;\r\n    int b8_y = (ctu_y >> MIN_PU_SIZE_IN_BIT) + row_rec->ctu.i_spu_y;\r\n    int i_pred = row_rec->ctu.i_fdec[0];\r\n    pel_t *p_pred = row_rec->ctu.p_fdec[0] + ctu_y * i_pred + ctu_x;\r\n    pel_t *pTL;\r\n    int i_src;\r\n    uint32_t avail;\r\n\r\n    assert(predmode >= 0 && predmode < NUM_INTRA_MODE);\r\n    avail = get_intra_neighbors(row_rec->h, b8_x, b8_y, bsx, bsy, p_cu->i_slice_nr);\r\n\r\n    row_rec->b_block_avail_top  = (bool_t)IS_NEIGHBOR_AVAIL(avail, MD_I_TOP ); // used for second transform\r\n    row_rec->b_block_avail_left = (bool_t)IS_NEIGHBOR_AVAIL(avail, MD_I_LEFT); // used for second transform\r\n\r\n    i_src = i_pred;\r\n    pTL   = p_pred - i_src - 1;\r\n\r\n    gf_davs2.fill_edge_f[xy](pTL, i_src, row_rec->ctu_border[0].rec_top + ctu_x - ctu_y, EP, avail, bsx, bsy);\r\n\r\n    intra_pred(EP, p_pred, i_pred, predmode, bsy, bsx, avail);\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * make intra prediction for chroma block\r\n */\r\nvoid davs2_get_intra_pred_chroma(davs2_row_rec_t *row_rec, cu_t *p_cu, int ctu_c_x, int ctu_c_y)\r\n{\r\n    static const int TAB_CHROMA_MODE_TO_REAL_MODE[NUM_INTRA_MODE_CHROMA] = {\r\n        DC_PRED, DC_PRED, HOR_PRED, VERT_PRED, BI_PRED\r\n    };\r\n    const int xy = ((ctu_c_y != 0) << 1) + (ctu_c_x != 0);\r\n    pel_t *EP_u     = row_rec->buf_edge_pixels + (MAX_CU_SIZE << 1) - 1;\r\n    pel_t *EP_v     = EP_u + (MAX_CU_SIZE << 2);\r\n    int bsize_c     = 1 << (p_cu->i_cu_level - 1);\r\n    int b8_x        = ((ctu_c_x << 1) >> MIN_PU_SIZE_IN_BIT) + row_rec->ctu.i_spu_x;\r\n    int b8_y        = ((ctu_c_y << 1) >> MIN_PU_SIZE_IN_BIT) + row_rec->ctu.i_spu_y;\r\n    int luma_mode   = p_cu->intra_pred_modes[0];\r\n    int chroma_mode = p_cu->c_ipred_mode;\r\n    int real_mode   = (chroma_mode == DM_PRED_C) ? luma_mode : TAB_CHROMA_MODE_TO_REAL_MODE[chroma_mode];\r\n    uint32_t avail;\r\n\r\n    /* Ԥλ */\r\n    int i_pred      = row_rec->ctu.i_fdec[1];\r\n    pel_t *p_pred_u = row_rec->ctu.p_fdec[1] + ctu_c_y * i_pred + ctu_c_x;\r\n    pel_t *p_pred_v = row_rec->ctu.p_fdec[2] + ctu_c_y * i_pred + ctu_c_x;\r\n\r\n    /* UVϽصλ */\r\n    int i_src       = i_pred;\r\n    pel_t *pTL_u    = p_pred_u - i_src - 1;\r\n    pel_t *pTL_v    = p_pred_v - i_src - 1;\r\n\r\n    /* ο߽жο߽ */\r\n    avail = get_intra_neighbors(row_rec->h, b8_x, b8_y, bsize_c << 1, bsize_c << 1, p_cu->i_slice_nr);\r\n\r\n    gf_davs2.fill_edge_f[xy](pTL_u, i_src, row_rec->ctu_border[1].rec_top + ctu_c_x - ctu_c_y, EP_u, avail, bsize_c, bsize_c);\r\n    gf_davs2.fill_edge_f[xy](pTL_v, i_src, row_rec->ctu_border[2].rec_top + ctu_c_x - ctu_c_y, EP_v, avail, bsize_c, bsize_c);\r\n\r\n    /* ִԤ */\r\n    intra_pred(EP_u, p_pred_u, i_pred, real_mode, bsize_c, bsize_c, avail);\r\n    intra_pred(EP_v, p_pred_v, i_pred, real_mode, bsize_c, bsize_c, avail);\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid davs2_intra_pred_init(uint32_t cpuid, ao_funcs_t *pf)\r\n{\r\n#define ANG_X_OFFSET    3\r\n#define ANG_XY_OFFSET   13\r\n#define ANG_Y_OFFSET    25\r\n    int i;\r\n    intra_pred_t *ipred = pf->intraf;\r\n\r\n    pf->fill_edge_f[0]      = fill_reference_samples_0_c;\r\n    pf->fill_edge_f[1]      = fill_reference_samples_x_c;\r\n    pf->fill_edge_f[2]      = fill_reference_samples_y_c;\r\n    pf->fill_edge_f[3]      = fill_reference_samples_xy_c;\r\n    ipred[DC_PRED   ] = intra_pred_dc_c;                // 0\r\n    ipred[PLANE_PRED] = intra_pred_plane_c;             // 1\r\n    ipred[BI_PRED   ] = intra_pred_bilinear_c;          // 2\r\n\r\n    for (i = ANG_X_OFFSET; i < VERT_PRED; i++) {\r\n        ipred[i     ] = intra_pred_ang_x_c;             // 3 ~ 11\r\n    }\r\n    ipred[VERT_PRED ] = intra_pred_ver_c;               // 12\r\n\r\n    for (i = ANG_XY_OFFSET; i < HOR_PRED; i++) {\r\n        ipred[i     ] = intra_pred_ang_xy_c;            // 13 ~ 23\r\n    }\r\n\r\n    ipred[HOR_PRED  ] = intra_pred_hor_c;               // 24\r\n    for (i = ANG_Y_OFFSET; i < NUM_INTRA_MODE; i++) {\r\n        ipred[i     ] = intra_pred_ang_y_c;             // 25 ~ 32\r\n    }\r\n\r\n    ipred[INTRA_ANG_X_3 ]  = intra_pred_ang_x_3_c;\r\n    ipred[INTRA_ANG_X_4 ]  = intra_pred_ang_x_4_c;\r\n    ipred[INTRA_ANG_X_5 ]  = intra_pred_ang_x_5_c;\r\n    ipred[INTRA_ANG_X_6 ]  = intra_pred_ang_x_6_c;\r\n    ipred[INTRA_ANG_X_7 ]  = intra_pred_ang_x_7_c;\r\n    ipred[INTRA_ANG_X_8 ]  = intra_pred_ang_x_8_c;\r\n    ipred[INTRA_ANG_X_9 ]  = intra_pred_ang_x_9_c;\r\n    ipred[INTRA_ANG_X_10]  = intra_pred_ang_x_10_c;\r\n    ipred[INTRA_ANG_X_11]  = intra_pred_ang_x_11_c;\r\n\r\n    ipred[INTRA_ANG_XY_13] = intra_pred_ang_xy_13_c;\r\n    ipred[INTRA_ANG_XY_14] = intra_pred_ang_xy_14_c;\r\n    ipred[INTRA_ANG_XY_16] = intra_pred_ang_xy_16_c;\r\n    ipred[INTRA_ANG_XY_18] = intra_pred_ang_xy_18_c;\r\n    ipred[INTRA_ANG_XY_20] = intra_pred_ang_xy_20_c;\r\n    ipred[INTRA_ANG_XY_22] = intra_pred_ang_xy_22_c;\r\n    ipred[INTRA_ANG_XY_23] = intra_pred_ang_xy_23_c;\r\n\r\n    ipred[INTRA_ANG_Y_25]  = intra_pred_ang_y_25_c;\r\n    ipred[INTRA_ANG_Y_26]  = intra_pred_ang_y_26_c;\r\n    ipred[INTRA_ANG_Y_27]  = intra_pred_ang_y_27_c;\r\n    ipred[INTRA_ANG_Y_28]  = intra_pred_ang_y_28_c;\r\n    ipred[INTRA_ANG_Y_29]  = intra_pred_ang_y_29_c;\r\n    ipred[INTRA_ANG_Y_30]  = intra_pred_ang_y_30_c;\r\n    ipred[INTRA_ANG_Y_31]  = intra_pred_ang_y_31_c;\r\n    ipred[INTRA_ANG_Y_32]  = intra_pred_ang_y_32_c;\r\n\r\n#if HAVE_MMX\r\n    if (cpuid & DAVS2_CPU_SSE4) {\r\n#if !HIGH_BIT_DEPTH\r\n        ipred[DC_PRED   ] = intra_pred_dc_sse128;\r\n        ipred[PLANE_PRED] = intra_pred_plane_sse128;\r\n        ipred[BI_PRED   ] = intra_pred_bilinear_sse128;\r\n        ipred[HOR_PRED  ] = intra_pred_hor_sse128;\r\n        ipred[VERT_PRED ] = intra_pred_ver_sse128;\r\n\r\n        ipred[INTRA_ANG_X_3  ] = intra_pred_ang_x_3_sse128;\r\n        ipred[INTRA_ANG_X_4  ] = intra_pred_ang_x_4_sse128;\r\n        ipred[INTRA_ANG_X_6  ] = intra_pred_ang_x_6_sse128;\r\n        ipred[INTRA_ANG_X_8  ] = intra_pred_ang_x_8_sse128;\r\n        ipred[INTRA_ANG_X_10 ] = intra_pred_ang_x_10_sse128;\r\n\r\n        ipred[INTRA_ANG_XY_14] = intra_pred_ang_xy_14_sse128;\r\n        ipred[INTRA_ANG_XY_16] = intra_pred_ang_xy_16_sse128;\r\n        ipred[INTRA_ANG_XY_18] = intra_pred_ang_xy_18_sse128;\r\n        ipred[INTRA_ANG_XY_20] = intra_pred_ang_xy_20_sse128;\r\n\r\n        ipred[INTRA_ANG_X_5  ] = intra_pred_ang_x_5_sse128;\r\n        //ipred[INTRA_ANG_X_7  ] = intra_pred_ang_x_7_sse128;\r\n        //ipred[INTRA_ANG_X_9  ] = intra_pred_ang_x_9_sse128;\r\n        //ipred[INTRA_ANG_X_11 ] = intra_pred_ang_x_11_sse128;\r\n\r\n        ipred[INTRA_ANG_XY_13] = intra_pred_ang_xy_13_sse128;\r\n        ipred[INTRA_ANG_XY_22] = intra_pred_ang_xy_22_sse128;\r\n        ipred[INTRA_ANG_XY_23] = intra_pred_ang_xy_23_sse128;\r\n\r\n        ipred[INTRA_ANG_Y_25 ] = intra_pred_ang_y_25_sse128;\r\n        ipred[INTRA_ANG_Y_26 ] = intra_pred_ang_y_26_sse128;\r\n        ipred[INTRA_ANG_Y_28 ] = intra_pred_ang_y_28_sse128;\r\n        ipred[INTRA_ANG_Y_30 ] = intra_pred_ang_y_30_sse128;\r\n        ipred[INTRA_ANG_Y_31 ] = intra_pred_ang_y_31_sse128;\r\n        ipred[INTRA_ANG_Y_32 ] = intra_pred_ang_y_32_sse128;\r\n\r\n        pf->fill_edge_f[0]      = fill_edge_samples_0_sse128;\r\n        pf->fill_edge_f[1]      = fill_edge_samples_x_sse128;\r\n        pf->fill_edge_f[2]      = fill_edge_samples_y_sse128;\r\n        pf->fill_edge_f[3]      = fill_edge_samples_xy_sse128;\r\n#endif\r\n    }\r\n\r\n    /* 8/10bit assemble*/\r\n    if (cpuid & DAVS2_CPU_AVX2 ) {\r\n#if !HIGH_BIT_DEPTH\r\n        ipred[DC_PRED        ] = intra_pred_dc_avx;\r\n        ipred[HOR_PRED       ] = intra_pred_hor_avx;\r\n        ipred[VERT_PRED      ] = intra_pred_ver_avx;\r\n\r\n        ipred[PLANE_PRED     ] = intra_pred_plane_avx;\r\n        ipred[BI_PRED        ] = intra_pred_bilinear_avx;\r\n\r\n        ipred[INTRA_ANG_X_3  ] = intra_pred_ang_x_3_avx;\r\n        ipred[INTRA_ANG_X_4  ] = intra_pred_ang_x_4_avx;\r\n        ipred[INTRA_ANG_X_5  ] = intra_pred_ang_x_5_avx;\r\n        ipred[INTRA_ANG_X_6  ] = intra_pred_ang_x_6_avx;\r\n        //ipred[INTRA_ANG_X_7  ] = intra_pred_ang_x_7_avx;\r\n        ipred[INTRA_ANG_X_8  ] = intra_pred_ang_x_8_avx;\r\n        //ipred[INTRA_ANG_X_9  ] = intra_pred_ang_x_9_avx;\r\n        ipred[INTRA_ANG_X_10 ] = intra_pred_ang_x_10_avx;\r\n        //ipred[INTRA_ANG_X_11 ] = intra_pred_ang_x_11_avx;\r\n\r\n        ipred[INTRA_ANG_XY_13] = intra_pred_ang_xy_13_avx;\r\n        ipred[INTRA_ANG_XY_14] = intra_pred_ang_xy_14_avx;\r\n        ipred[INTRA_ANG_XY_16] = intra_pred_ang_xy_16_avx;\r\n        ipred[INTRA_ANG_XY_18] = intra_pred_ang_xy_18_avx;\r\n        ipred[INTRA_ANG_XY_20] = intra_pred_ang_xy_20_avx;\r\n#if _MSC_VER  // TODO: 20180206 cause unextended exit on Linux\r\n        ipred[INTRA_ANG_XY_22] = intra_pred_ang_xy_22_avx;\r\n#endif\r\n        ipred[INTRA_ANG_XY_23] = intra_pred_ang_xy_23_avx;\r\n\r\n        ipred[INTRA_ANG_Y_25 ] = intra_pred_ang_y_25_avx;\r\n        ipred[INTRA_ANG_Y_26 ] = intra_pred_ang_y_26_avx;\r\n        ipred[INTRA_ANG_Y_28 ] = intra_pred_ang_y_28_avx;\r\n        ipred[INTRA_ANG_Y_30 ] = intra_pred_ang_y_30_avx;\r\n        ipred[INTRA_ANG_Y_31 ] = intra_pred_ang_y_31_avx;\r\n        ipred[INTRA_ANG_Y_32 ] = intra_pred_ang_y_32_avx;\r\n#endif\r\n    }\r\n#else\r\n    UNUSED_PARAMETER(cpuid);\r\n#endif //if HAVE_MMX\r\n\r\n#undef ANG_X_OFFSET\r\n#undef ANG_XY_OFFSET\r\n#undef ANG_Y_OFFSET\r\n}\r\n"
  },
  {
    "path": "source/common/intra.h",
    "content": "/*\r\n * intra.h\r\n *\r\n * Description of this file:\r\n *    Intra prediction functions definition of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video encoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#ifndef DAVS2_INTRA_H\r\n#define DAVS2_INTRA_H\r\n#ifdef __cplusplus\r\nextern \"C\" {\r\n#endif\r\n    \r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic ALWAYS_INLINE\r\nvoid intra_pred(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsy, int bsx, int i_avail)\r\n{\r\n    if (dir_mode != DC_PRED) {\r\n        gf_davs2.intraf[dir_mode](src, dst, i_dst, dir_mode, bsx, bsy);\r\n    } else {\r\n        int b_top  = !!IS_NEIGHBOR_AVAIL(i_avail, MD_I_TOP);\r\n        int b_left = !!IS_NEIGHBOR_AVAIL(i_avail, MD_I_LEFT);\r\n        int mode_ex = ((b_top << 8) + b_left);\r\n\r\n        gf_davs2.intraf[dir_mode](src, dst, i_dst, mode_ex, bsx, bsy);\r\n    }\r\n}\r\n\r\n#define davs2_intra_pred_init FPFX(intra_pred_init)\r\nvoid davs2_intra_pred_init(uint32_t cpuid, ao_funcs_t *pf);\r\n#define davs2_get_intra_pred FPFX(get_intra_pred)\r\nvoid davs2_get_intra_pred(davs2_row_rec_t *row_rec, cu_t *p_cu, int predmode, int ctu_x, int ctu_y, int bsx, int bsy);\r\n#define davs2_get_intra_pred_chroma FPFX(get_intra_pred_chroma)\r\nvoid davs2_get_intra_pred_chroma(davs2_row_rec_t *h, cu_t *p_cu, int ctu_c_x, int ctu_c_y);\r\n\r\n#ifdef __cplusplus\r\n}\r\n#endif\r\n#endif  // DAVS2_INTRA_H\r\n"
  },
  {
    "path": "source/common/mc.cc",
    "content": "/*\r\n * mc.cc\r\n *\r\n * Description of this file:\r\n *    MC functions definition of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video encoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#include <stdlib.h>\r\n#include <string.h>\r\n#include \"common.h\"\r\n#include \"mc.h\"\r\n\r\n#if HAVE_MMX\r\n#include \"vec/intrinsic.h\"\r\n#include \"x86/ipfilter8.h\"\r\n#endif\r\n\r\n\r\n#if defined(_MSC_VER) || defined(__INTEL_COMPILER)\r\n/* ---------------------------------------------------------------------------\r\n * disable warning C4127: conditional expression is constant.\r\n */\r\n#pragma warning(disable: 4127)\r\n#endif\r\n\r\n\r\n/**\r\n * ===========================================================================\r\n * local & global variables\r\n * ===========================================================================\r\n */\r\n\r\n/* ---------------------------------------------------------------------------\r\n * interpolate filter (luma) */\r\nALIGN16(static const int8_t INTPL_FILTERS[4][8]) = {\r\n    {  0, 0,   0, 64,  0,  0,  0,  0 }, /* for full-pixel, no use */\r\n    { -1, 4, -10, 57, 19, -7,  3, -1 },\r\n    { -1, 4, -11, 40, 40, -11, 4, -1 },\r\n    { -1, 3,  -7, 19, 57, -10, 4, -1 }\r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n * interpolate filter (chroma) */\r\nALIGN16(static const int8_t INTPL_FILTERS_C[8][4]) = {\r\n    {  0, 64,  0,  0 },                 /* for full-pixel, no use */\r\n    { -4, 62,  6,  0 },\r\n    { -6, 56, 15, -1 },\r\n    { -5, 47, 25, -3 },\r\n    { -4, 36, 36, -4 },\r\n    { -3, 25, 47, -5 },\r\n    { -1, 15, 56, -6 },\r\n    {  0,  6, 62, -4 }\r\n};\r\n\r\n\r\n/**\r\n * ===========================================================================\r\n * macros\r\n * ===========================================================================\r\n */\r\n\r\n/* ---------------------------------------------------------------------------\r\n * for luma interpolating (horizontal) */\r\n#define FLT_8TAP_HOR(src, i, coef) (\\\r\n    (src)[i - 3] * (coef)[0] + \\\r\n    (src)[i - 2] * (coef)[1] + \\\r\n    (src)[i - 1] * (coef)[2] + \\\r\n    (src)[i    ] * (coef)[3] + \\\r\n    (src)[i + 1] * (coef)[4] + \\\r\n    (src)[i + 2] * (coef)[5] + \\\r\n    (src)[i + 3] * (coef)[6] + \\\r\n    (src)[i + 4] * (coef)[7])\r\n\r\n/* ---------------------------------------------------------------------------\r\n * for luma interpolating (vertical) */\r\n#define FLT_8TAP_VER(src, i, i_src, coef) (\\\r\n    (src)[i - 3 * i_src] * (coef)[0] + \\\r\n    (src)[i - 2 * i_src] * (coef)[1] + \\\r\n    (src)[i -     i_src] * (coef)[2] + \\\r\n    (src)[i            ] * (coef)[3] + \\\r\n    (src)[i +     i_src] * (coef)[4] + \\\r\n    (src)[i + 2 * i_src] * (coef)[5] + \\\r\n    (src)[i + 3 * i_src] * (coef)[6] + \\\r\n    (src)[i + 4 * i_src] * (coef)[7])\r\n\r\n/* ---------------------------------------------------------------------------\r\n * for chroma interpolating (horizontal) */\r\n#define FLT_4TAP_HOR(src, i, coef) (\\\r\n    (src)[i - 1] * (coef)[0] + \\\r\n    (src)[i    ] * (coef)[1] + \\\r\n    (src)[i + 1] * (coef)[2] + \\\r\n    (src)[i + 2] * (coef)[3])\r\n\r\n/* ---------------------------------------------------------------------------\r\n * for chroma interpolating (vertical) */\r\n#define FLT_4TAP_VER(src, i, i_src, coef) (\\\r\n    (src)[i -     i_src] * (coef)[0] + \\\r\n    (src)[i            ] * (coef)[1] + \\\r\n    (src)[i +     i_src] * (coef)[2] + \\\r\n    (src)[i + 2 * i_src] * (coef)[3])\r\n\r\n\r\n/**\r\n * ===========================================================================\r\n * interpolate\r\n * ===========================================================================\r\n */\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void\r\nmc_block_copy_c(pel_t *dst, intptr_t i_dst, pel_t *src, intptr_t i_src, int w, int h)\r\n{\r\n    while (h--) {\r\n        memcpy(dst, src, w * sizeof(pel_t));\r\n        dst += i_dst;\r\n        src += i_src;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void\r\nmc_block_copy_sc_c(coeff_t *dst, intptr_t i_dst, int16_t *src, intptr_t i_src, int w, int h)\r\n{\r\n    int i;\r\n\r\n    if (sizeof(coeff_t) == sizeof(int16_t)) {\r\n        while (h--) {\r\n            memcpy(dst, src, w * sizeof(coeff_t));\r\n            dst += i_dst;\r\n            src += i_src;\r\n        }\r\n    } else {\r\n        while (h--) {\r\n            for (i = 0; i < w; i++) {\r\n                dst[i] = src[i];\r\n            }\r\n            dst += i_dst;\r\n            src += i_src;\r\n        }\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void\r\nintpl_chroma_block_hor_c(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, int8_t const *coeff)\r\n{\r\n    int x, y, v;\r\n\r\n    for (y = 0; y < height; y++) {\r\n        for (x = 0; x < width; x++) {\r\n            v = (FLT_4TAP_HOR(src, x, coeff) + 32) >> 6;\r\n            dst[x] = (pel_t)DAVS2_CLIP1(v);\r\n        }\r\n\r\n        src += i_src;\r\n        dst += i_dst;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void\r\nintpl_chroma_block_ver_c(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, int8_t const *coeff)\r\n{\r\n    int x, y, v;\r\n\r\n    for (y = 0; y < height; y++) {\r\n        for (x = 0; x < width; x++) {\r\n            v = (FLT_4TAP_VER(src, x, i_src, coeff) + 32) >> 6;\r\n            dst[x] = (pel_t)DAVS2_CLIP1(v);\r\n        }\r\n\r\n        src += i_src;\r\n        dst += i_dst;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void\r\nintpl_chroma_block_ext_c(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, const int8_t *coeff_h, const int8_t *coeff_v)\r\n{\r\n    // TODO: lumaͳһ\r\n    ALIGN16(int32_t tmp_res[(32 + 3) * 32]);\r\n    int32_t *tmp = tmp_res;\r\n    const int shift1 = g_bit_depth - 8;\r\n    const int add1   = (1 << shift1) >> 1;\r\n    const int shift2 = 20 - g_bit_depth;\r\n    const int add2   = 1 << (shift2 - 1); // 1<<(19-g_bit_depth)\r\n    int x, y, v;\r\n\r\n    src -= i_src;\r\n    for (y = -1; y < height + 2; y++) {\r\n        for (x = 0; x < width; x++) {\r\n            v = FLT_4TAP_HOR(src, x, coeff_h);\r\n            tmp[x] = (v + add1) >> shift1;\r\n        }\r\n        src += i_src;\r\n        tmp += 32;\r\n    }\r\n    tmp = tmp_res + 32;\r\n    for (y = 0; y < height; y++) {\r\n        for (x = 0; x < width; x++) {\r\n            v = (FLT_4TAP_VER(tmp, x, 32, coeff_v) + add2) >> shift2;\r\n            dst[x] = (pel_t)DAVS2_CLIP1(v);\r\n        }\r\n        dst += i_dst;\r\n        tmp += 32;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void\r\nintpl_luma_block_hor_c(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, int8_t const *coeff)\r\n{\r\n    int x, y, v;\r\n\r\n    for (y = 0; y < height; y++) {\r\n        for (x = 0; x < width; x++) {\r\n            v = (FLT_8TAP_HOR(src, x, coeff) + 32) >> 6;\r\n            dst[x] = (pel_t)DAVS2_CLIP1(v);\r\n        }\r\n\r\n        src += i_src;\r\n        dst += i_dst;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void\r\nintpl_luma_block_ver_c(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, int8_t const *coeff)\r\n{\r\n    int x, y, v;\r\n\r\n    for (y = 0; y < height; y++) {\r\n        for (x = 0; x < width; x++) {\r\n            v = (FLT_8TAP_VER(src, x, i_src, coeff) + 32) >> 6;\r\n            dst[x] = (pel_t)DAVS2_CLIP1(v);\r\n        }\r\n\r\n        src += i_src;\r\n        dst += i_dst;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void\r\nintpl_luma_block_ext_c(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, const int8_t *coeff_h, const int8_t *coeff_v)\r\n{\r\n#define TMP_STRIDE      64\r\n    \r\n    const int shift1 = g_bit_depth - 8;\r\n    const int add1   = (1 << shift1) >> 1;\r\n    const int shift2 = 20 - g_bit_depth;\r\n    const int add2   = 1 << (shift2 - 1);//1<<(19-bit_depth)\r\n\r\n    ALIGN16(mct_t tmp_buf[(64 + 7) * TMP_STRIDE]);\r\n    mct_t *tmp = tmp_buf;\r\n    int x, y, v;\r\n\r\n    src -= 3 * i_src;\r\n    for (y = -3; y < height + 4; y++) {\r\n        for (x = 0; x < width; x++) {\r\n            v = FLT_8TAP_HOR(src, x, coeff_h);\r\n            tmp[x] = (mct_t)((v + add1) >> shift1);\r\n        }\r\n        src += i_src;\r\n        tmp += TMP_STRIDE;\r\n    }\r\n\r\n    tmp = tmp_buf + 3 * TMP_STRIDE;\r\n    for (y = 0; y < height; y++) {\r\n        for (x = 0; x < width; x++) {\r\n            v = (FLT_8TAP_VER(tmp, x, TMP_STRIDE, coeff_v) + add2) >> shift2;\r\n            dst[x] = (pel_t)DAVS2_CLIP1(v);\r\n        }\r\n\r\n        dst += i_dst;\r\n        tmp += TMP_STRIDE;\r\n    }\r\n\r\n#undef TMP_STRIDE\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\n#define INTERP_HOR_C(width, height) \\\r\nstatic void interp_horiz_pp_##width##x##height##_c(const pel_t* src, intptr_t srcStride, pel_t* dst, intptr_t dstStride, int coeffIdx) \\\r\n{ \\\r\n    const int N = 8;  /* 8-tap Luma interpolation */ \\\r\n    const int8_t* coeff = (N == 4) ? INTPL_FILTERS_C[coeffIdx] : INTPL_FILTERS[coeffIdx]; \\\r\n    int headRoom = 6; /* Log2 of sum of filter taps */ \\\r\n    int offset = (1 << (headRoom - 1)); \\\r\n    uint16_t maxVal = (1 << BIT_DEPTH) - 1; \\\r\n    int cStride = 1; \\\r\n    src -= (N / 2 - 1) * cStride; \\\r\n    int row, col; \\\r\n    for (row = 0; row < height; row++) {     \\\r\n        for (col = 0; col < width; col++) {  \\\r\n            int sum = src[col + 0 * cStride] * coeff[0];   \\\r\n            sum += src[col + 1 * cStride] * coeff[1];      \\\r\n            sum += src[col + 2 * cStride] * coeff[2];      \\\r\n            sum += src[col + 3 * cStride] * coeff[3];      \\\r\n            if (N == 8) {                                  \\\r\n                sum += src[col + 4 * cStride] * coeff[4];  \\\r\n                sum += src[col + 5 * cStride] * coeff[5];  \\\r\n                sum += src[col + 6 * cStride] * coeff[6];  \\\r\n                sum += src[col + 7 * cStride] * coeff[7];  \\\r\n            }                                              \\\r\n            int16_t val = (int16_t)((sum + offset) >> headRoom); \\\r\n            val = DAVS2_CLIP3(0, maxVal, val);                    \\\r\n            dst[col] = (pel_t)val;                               \\\r\n        } \\\r\n        src += srcStride; \\\r\n        dst += dstStride; \\\r\n    } \\\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\n#define INTERP_PS_HOR_C(width, height) \\\r\nstatic void interp_horiz_ps_##width##x##height##_c(const pel_t* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx) \\\r\n{ \\\r\n    const int N = 8;  /* 8-tap Luma interpolation */ \\\r\n    const int8_t* coeff = (N == 4) ? INTPL_FILTERS_C[coeffIdx] : INTPL_FILTERS[coeffIdx]; \\\r\n    int headRoom = 6; /* Log2 of sum of filter taps */ \\\r\n    int offset = (1 << (headRoom - 1)); \\\r\n    uint16_t maxVal = (1 << BIT_DEPTH) - 1; \\\r\n    int cStride = 1; \\\r\n    src -= (N / 2 - 1) * cStride; \\\r\n    int row, col; \\\r\n    for (row = 0; row < height; row++) {     \\\r\n        for (col = 0; col < width; col++) {  \\\r\n            int sum = src[col + 0 * cStride] * coeff[0];   \\\r\n            sum += src[col + 1 * cStride] * coeff[1];      \\\r\n            sum += src[col + 2 * cStride] * coeff[2];      \\\r\n            sum += src[col + 3 * cStride] * coeff[3];      \\\r\n            if (N == 8) {                                  \\\r\n                sum += src[col + 4 * cStride] * coeff[4];  \\\r\n                sum += src[col + 5 * cStride] * coeff[5];  \\\r\n                sum += src[col + 6 * cStride] * coeff[6];  \\\r\n                sum += src[col + 7 * cStride] * coeff[7];  \\\r\n            }                                              \\\r\n            int16_t val = (int16_t)((sum + offset) >> headRoom); \\\r\n            val = DAVS2_CLIP3(0, maxVal, val);                    \\\r\n            dst[col] = (pel_t)val;                               \\\r\n        } \\\r\n        src += srcStride; \\\r\n        dst += dstStride; \\\r\n    } \\\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\n#define INTERP_VER_C(width, height) \\\r\nstatic void interp_vert_pp_##width##x##height##_c(const pel_t* src, intptr_t srcStride, pel_t* dst, intptr_t dstStride, int coeffIdx) \\\r\n{ \\\r\n    const int N = 8;  /* 8-tap Luma interpolation */ \\\r\n    const int8_t* c = (N == 4) ? INTPL_FILTERS_C[coeffIdx] : INTPL_FILTERS[coeffIdx]; \\\r\n    int shift = 6; \\\r\n    int offset = 1 << (shift - 1); \\\r\n    uint16_t maxVal = (1 << BIT_DEPTH) - 1; \\\r\n    src -= (N / 2 - 1) * srcStride; \\\r\n    int row, col; \\\r\n    for (row = 0; row < height; row++) {    \\\r\n        for (col = 0; col < width; col++) { \\\r\n            int sum = src[col + 0 * srcStride] * c[0];  \\\r\n            sum += src[col + 1 * srcStride] * c[1];     \\\r\n            sum += src[col + 2 * srcStride] * c[2];     \\\r\n            sum += src[col + 3 * srcStride] * c[3];     \\\r\n            if (N == 8) {                               \\\r\n                sum += src[col + 4 * srcStride] * c[4]; \\\r\n                sum += src[col + 5 * srcStride] * c[5]; \\\r\n                sum += src[col + 6 * srcStride] * c[6]; \\\r\n                sum += src[col + 7 * srcStride] * c[7]; \\\r\n            }                                           \\\r\n            int16_t val = (int16_t)((sum + offset) >> shift); \\\r\n            val = DAVS2_CLIP3(0, maxVal, val);                 \\\r\n            dst[col] = (pel_t)val;                            \\\r\n        } \\\r\n        src += srcStride;    \\\r\n        dst += dstStride;    \\\r\n    }                        \\\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\n#define INTERP_SP_VER_C(w, h) \\\r\nstatic void filterVertical_sp_##w##x##h##_c(const int16_t* src, intptr_t srcStride, pel_t* dst, intptr_t dstStride, int coeffIdx) \\\r\n{ \\\r\n    const int N = 8;  /* 8-tap Luma interpolation */ \\\r\n    int headRoom = 14 - BIT_DEPTH; \\\r\n    int shift = 6 + headRoom; \\\r\n    int offset = (1 << (shift - 1)) + ((1 << 13) << 6); \\\r\n    const int8_t* c = (N == 8 ? INTPL_FILTERS_C[coeffIdx] : INTPL_FILTERS[coeffIdx]); \\\r\n    int16_t maxVal = (1 << BIT_DEPTH) - 1;  \\\r\n    src -= (N / 2 - 1) * srcStride; \\\r\n    int row, col; \\\r\n    for (row = 0; row < h; row++) {     \\\r\n        for (col = 0; col < w; col++) { \\\r\n            int sum = src[col + 0 * srcStride] * c[0];  \\\r\n            sum += src[col + 1 * srcStride] * c[1];     \\\r\n            sum += src[col + 2 * srcStride] * c[2];     \\\r\n            sum += src[col + 3 * srcStride] * c[3];     \\\r\n            if (N == 8) {                               \\\r\n                sum += src[col + 4 * srcStride] * c[4]; \\\r\n                sum += src[col + 5 * srcStride] * c[5]; \\\r\n                sum += src[col + 6 * srcStride] * c[6]; \\\r\n                sum += src[col + 7 * srcStride] * c[7]; \\\r\n            }                                           \\\r\n            int16_t val = (int16_t)((sum + offset) >> shift); \\\r\n            val = DAVS2_CLIP3(0, maxVal, val);                 \\\r\n            dst[col] = (pel_t)val;                            \\\r\n        } \\\r\n        src += srcStride;    \\\r\n        dst += dstStride;    \\\r\n    }                        \\\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\n#define INTERP_EXT_C(width, height) \\\r\nstatic void interp_hv_pp_##width##x##height##_c(const pel_t* src, intptr_t srcStride, pel_t* dst, intptr_t dstStride, int idxX, int idxY) \\\r\n{ \\\r\n    int16_t immedVals[(64 + 8) * (64 + 8)]; \\\r\n    interp_horiz_ps_##width##x##height##_c(src, srcStride, immedVals, width, idxX); \\\r\n    filterVertical_sp_##width##x##height##_c(immedVals + 3 * width, width, dst, dstStride, idxY); \\\r\n}\r\n\r\n#define INTPL_OP_C(w, h) \\\r\n    INTERP_HOR_C(w, h) \\\r\n    INTERP_PS_HOR_C(w, h) \\\r\n    INTERP_VER_C(w, h)    \\\r\n    INTERP_SP_VER_C(w, h) \\\r\n    INTERP_EXT_C(w, h)\r\n\r\n//INTPL_OP_C(64, 64)  /* 64x64 */\r\n//INTPL_OP_C(64, 32)\r\n//INTPL_OP_C(32, 64)\r\n//INTPL_OP_C(64, 16)\r\n//INTPL_OP_C(64, 48)\r\n//INTPL_OP_C(16, 64)\r\n//INTPL_OP_C(48, 64)\r\n//INTPL_OP_C(32, 32)  /* 32x32 */\r\n//INTPL_OP_C(32, 16)\r\n//INTPL_OP_C(16, 32)\r\n//INTPL_OP_C(32, 8)\r\n//INTPL_OP_C(32, 24)\r\n//INTPL_OP_C(8, 32)\r\n//INTPL_OP_C(24, 32)\r\n//INTPL_OP_C(16, 16)  /* 16x16 */\r\n//INTPL_OP_C(16, 8)\r\n//INTPL_OP_C(8, 16)\r\n//INTPL_OP_C(16, 4)\r\n//INTPL_OP_C(16, 12)\r\n//INTPL_OP_C(4, 16)\r\n//INTPL_OP_C(12, 16)\r\n//INTPL_OP_C(8, 8)  /* 8x8 */\r\n//INTPL_OP_C(8, 4)\r\n//INTPL_OP_C(4, 8)\r\n//INTPL_OP_C(4, 4)  /* 4x4 */\r\n\r\n/* ---------------------------------------------------------------------------\r\n * interpolation of 1/4 subpixel\r\n *      A  dst  1  src  B\r\n *      c  d  e  f\r\n *      2  h  3  i\r\n *      j  k  l  m\r\n *      C           D\r\n */\r\nvoid mc_luma(davs2_t *h, pel_t *dst, int i_dst, int posx, int posy, int width, int height, pel_t *p_fref, int i_fref)\r\n{\r\n    const int dx = posx & 3;\r\n    const int dy = posy & 3;\r\n    const int mc_part_index = MC_PART_INDEX(width, height);\r\n\r\n\r\n    UNUSED_PARAMETER(h);\r\n    posx >>= 2;\r\n    posy >>= 2;\r\n    \r\n\r\n    p_fref += posy * i_fref + posx;\r\n    \r\n    if (dx == 0 && dy == 0) {\r\n        gf_davs2.copy_pp[PART_INDEX(width, height)](dst, i_dst, p_fref, i_fref);\r\n    } else if (dx == 0) {\r\n#if USE_NEW_INTPL\r\n        gf_davs2.block_intpl_luma_ver[PART_INDEX(width, height)](p_fref, i_fref, dst, i_dst, dy);\r\n#else\r\n        gf_davs2.intpl_luma_ver[mc_part_index][dy - 1](dst, i_dst, p_fref, i_fref, width, height, INTPL_FILTERS[dy]);\r\n#endif\r\n    } else if (dy == 0) {\r\n#if USE_NEW_INTPL\r\n        gf_davs2.block_intpl_luma_hor[PART_INDEX(width, height)](p_fref, i_fref, dst, i_dst, dx);\r\n#else\r\n        gf_davs2.intpl_luma_hor[mc_part_index][dx - 1](dst, i_dst, p_fref, i_fref, width, height, INTPL_FILTERS[dx]);\r\n#endif\r\n    } else {\r\n        gf_davs2.intpl_luma_ext[mc_part_index](dst, i_dst, p_fref, i_fref, width, height, INTPL_FILTERS[dx], INTPL_FILTERS[dy]);\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid mc_chroma(davs2_t *h, pel_t *dst, int i_dst, int posx, int posy, int width, int height, pel_t *p_fref, int i_fref)\r\n{\r\n    const int dx = posx & 7;\r\n    const int dy = posy & 7;\r\n    const int mc_part_index = MC_PART_INDEX(width, height);\r\n\r\n    UNUSED_PARAMETER(h);\r\n    posx >>= 3;\r\n    posy >>= 3;\r\n\r\n    p_fref += posy * i_fref + posx;\r\n\r\n    if (dx == 0 && dy == 0) {\r\n        if (width != 2 && width != 6 && height != 2 && height != 6) {\r\n            gf_davs2.copy_pp[PART_INDEX(width, height)](dst, i_dst, p_fref, i_fref);\r\n        } else {\r\n            gf_davs2.block_copy(dst, i_dst, p_fref, i_fref, width, height);\r\n        }\r\n    } else if (dx == 0) {\r\n        gf_davs2.intpl_chroma_ver[mc_part_index](dst, i_dst, p_fref, i_fref, width, height, INTPL_FILTERS_C[dy]);\r\n    } else if (dy == 0) {\r\n        gf_davs2.intpl_chroma_hor[mc_part_index](dst, i_dst, p_fref, i_fref, width, height, INTPL_FILTERS_C[dx]);\r\n    } else {\r\n        gf_davs2.intpl_chroma_ext[mc_part_index](dst, i_dst, p_fref, i_fref, width, height, INTPL_FILTERS_C[dx], INTPL_FILTERS_C[dy]);\r\n    }\r\n}\r\n\r\n\r\n/**\r\n * ===========================================================================\r\n * pixel block average\r\n * ===========================================================================\r\n */\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void davs2_pixel_average_c(pel_t *dst, int i_dst, const pel_t *src0, int i_src0, const pel_t *src1, int i_src1, int width, int height)\r\n{\r\n    int i, j;\r\n\r\n    for (i = 0; i < height; i++) {\r\n        for (j = 0; j < width; j++) {\r\n            dst[j] = (pel_t)((src0[j] + src1[j] + 1) >> 1);\r\n        }\r\n        dst  += i_dst;\r\n        src0 += i_src0;\r\n        src1 += i_src1;\r\n    }\r\n}\r\n\r\n/**\r\n * ===========================================================================\r\n * plane copy\r\n * ===========================================================================\r\n */\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\n#define plane_copy_c          mc_block_copy_c\r\n\r\n#define ALL_LUMA_PU(name1, name2, cpu) \\\r\n    pf->name1[PART_64x64] = name2 ## _64x64 ##_## cpu;  /* 64x64 */ \\\r\n    pf->name1[PART_64x32] = name2 ## _64x32 ##_## cpu;\\\r\n    pf->name1[PART_32x64] = name2 ## _32x64 ##_## cpu;\\\r\n    pf->name1[PART_64x16] = name2 ## _64x16 ##_## cpu;\\\r\n    pf->name1[PART_64x48] = name2 ## _64x48 ##_## cpu;\\\r\n    pf->name1[PART_16x64] = name2 ## _16x64 ##_## cpu;\\\r\n    pf->name1[PART_48x64] = name2 ## _48x64 ##_## cpu;\\\r\n    pf->name1[PART_32x32] = name2 ## _32x32 ##_## cpu;  /* 32x32 */ \\\r\n    pf->name1[PART_32x16] = name2 ## _32x16 ##_## cpu;\\\r\n    pf->name1[PART_16x32] = name2 ## _16x32 ##_## cpu;\\\r\n    pf->name1[PART_32x8 ] = name2 ## _32x8  ##_## cpu;\\\r\n    pf->name1[PART_32x24] = name2 ## _32x24 ##_## cpu;\\\r\n    pf->name1[PART_8x32 ] = name2 ## _8x32  ##_## cpu;\\\r\n    pf->name1[PART_24x32] = name2 ## _24x32 ##_## cpu;\\\r\n    pf->name1[PART_16x16] = name2 ## _16x16 ##_## cpu;  /* 16x16 */ \\\r\n    pf->name1[PART_16x8 ] = name2 ## _16x8  ##_## cpu;\\\r\n    pf->name1[PART_8x16 ] = name2 ## _8x16  ##_## cpu;\\\r\n    pf->name1[PART_16x4 ] = name2 ## _16x4  ##_## cpu;\\\r\n    pf->name1[PART_16x12] = name2 ## _16x12 ##_## cpu;\\\r\n    pf->name1[PART_4x16 ] = name2 ## _4x16  ##_## cpu;\\\r\n    pf->name1[PART_12x16] = name2 ## _12x16 ##_## cpu;\\\r\n    pf->name1[PART_8x8  ] = name2 ## _8x8   ##_## cpu;  /* 8x8 */ \\\r\n    pf->name1[PART_8x4  ] = name2 ## _8x4   ##_## cpu;\\\r\n    pf->name1[PART_4x8  ] = name2 ## _4x8   ##_## cpu;\\\r\n    pf->name1[PART_4x4  ] = name2 ## _4x4   ##_## cpu  /* 4x4 */\r\n\r\n/**\r\n * ===========================================================================\r\n * interface function defines\r\n * ===========================================================================\r\n */\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid davs2_mc_init(uint32_t cpuid, ao_funcs_t *pf)\r\n{\r\n    UNUSED_PARAMETER(cpuid);\r\n\r\n    /* plane copy */\r\n    pf->plane_copy       = plane_copy_c;\r\n\r\n    pf->block_copy       = mc_block_copy_c;\r\n    pf->block_coeff_copy = mc_block_copy_sc_c;\r\n\r\n    /* block average */\r\n    pf->block_avg        = davs2_pixel_average_c;\r\n\r\n    /* interpolate */\r\n#if USE_NEW_INTPL\r\n    ALL_LUMA_PU(block_intpl_luma_hor, interp_horiz_pp, c);\r\n    ALL_LUMA_PU(block_intpl_luma_ver, interp_vert_pp, c);\r\n    ALL_LUMA_PU(block_intpl_luma_ext, interp_hv_pp, c);\r\n#endif\r\n    pf->intpl_luma_ver[0][0] = intpl_luma_block_ver_c;\r\n    pf->intpl_luma_ver[0][1] = intpl_luma_block_ver_c;\r\n    pf->intpl_luma_ver[0][2] = intpl_luma_block_ver_c;\r\n    pf->intpl_luma_hor[0][0] = intpl_luma_block_hor_c;\r\n    pf->intpl_luma_hor[0][1] = intpl_luma_block_hor_c;\r\n    pf->intpl_luma_hor[0][2] = intpl_luma_block_hor_c;\r\n    pf->intpl_luma_ext[0] = intpl_luma_block_ext_c;\r\n\r\n    pf->intpl_chroma_ver[0] = intpl_chroma_block_ver_c;\r\n    pf->intpl_chroma_hor[0] = intpl_chroma_block_hor_c;\r\n    pf->intpl_chroma_ext[0] = intpl_chroma_block_ext_c;\r\n\r\n    pf->intpl_luma_ver[1][0] = intpl_luma_block_ver_c;\r\n    pf->intpl_luma_ver[1][1] = intpl_luma_block_ver_c;\r\n    pf->intpl_luma_ver[1][2] = intpl_luma_block_ver_c;\r\n    pf->intpl_luma_hor[1][0] = intpl_luma_block_hor_c;\r\n    pf->intpl_luma_hor[1][1] = intpl_luma_block_hor_c;\r\n    pf->intpl_luma_hor[1][2] = intpl_luma_block_hor_c;\r\n    pf->intpl_luma_ext[1] = intpl_luma_block_ext_c;\r\n\r\n    pf->intpl_chroma_ver[1] = intpl_chroma_block_ver_c;\r\n    pf->intpl_chroma_hor[1] = intpl_chroma_block_hor_c;\r\n    pf->intpl_chroma_ext[1] = intpl_chroma_block_ext_c;\r\n\r\n    /* init asm function handles */\r\n#if HAVE_MMX\r\n    if (cpuid & DAVS2_CPU_SSE42) {\r\n#if HIGH_BIT_DEPTH\r\n        //10bit assemble\r\n#else\r\n#if USE_NEW_INTPL\r\n        ALL_LUMA_PU(block_intpl_luma_hor, davs2_interp_8tap_horiz_pp, sse4);\r\n        pf->block_intpl_luma_hor[PART_4x4] = davs2_interp_8tap_horiz_pp_4x4_sse4;\r\n        ALL_LUMA_PU(block_intpl_luma_ver, davs2_interp_8tap_vert_pp, sse4);\r\n        pf->block_intpl_luma_ver[PART_4x4] = davs2_interp_8tap_vert_pp_4x4_sse4;\r\n        /* linking error */\r\n        // ALL_LUMA_PU(block_intpl_luma_ext, davs2_interp_8tap_hv_pp, sse4);\r\n        // pf->block_intpl_luma_ext[PART_4x4] = davs2_interp_8tap_hv_pp_4x4_sse4;\r\n#endif\r\n#endif //if HIGH_BIT_DEPTH\r\n    }\r\n\r\n    if (cpuid & DAVS2_CPU_SSE2) {\r\n        /* memory copy */\r\n        pf->plane_copy = plane_copy_c_sse2;\r\n    }\r\n\r\n    if (cpuid & DAVS2_CPU_SSE4) {\r\n        /* block average */\r\n        pf->block_avg        = avs_pixel_average_sse128;\r\n\r\n#if !HIGH_BIT_DEPTH\r\n        /* interpolate */\r\n        pf->intpl_luma_hor[0][0] = intpl_luma_block_hor_sse128;\r\n        pf->intpl_luma_hor[0][1] = intpl_luma_block_hor_sse128;\r\n        pf->intpl_luma_hor[0][2] = intpl_luma_block_hor_sse128;\r\n        pf->intpl_luma_ext[0] = intpl_luma_block_ext_sse128;\r\n\r\n        /*ֵвƥ⡣\r\n          ޸ʱעرavx2ຯ\r\n         */\r\n        //pf->intpl_chroma_ver[0] = intpl_chroma_block_ver_sse128;\r\n        pf->intpl_chroma_hor[0] = intpl_chroma_block_hor_sse128;\r\n        pf->intpl_chroma_ext[0] = intpl_chroma_block_ext_sse128;\r\n        \r\n        pf->intpl_luma_hor[1][0] = intpl_luma_block_hor_sse128;\r\n        pf->intpl_luma_hor[1][1] = intpl_luma_block_hor_sse128;\r\n        pf->intpl_luma_hor[1][2] = intpl_luma_block_hor_sse128;\r\n        pf->intpl_luma_ext[1] = intpl_luma_block_ext_sse128;\r\n        \r\n        //pf->intpl_chroma_ver[1] = intpl_chroma_block_ver_sse128;\r\n        pf->intpl_chroma_hor[1] = intpl_chroma_block_hor_sse128;\r\n        pf->intpl_chroma_ext[1] = intpl_chroma_block_ext_sse128;\r\n\r\n        pf->intpl_luma_ver[0][0] = intpl_luma_block_ver_sse128;\r\n        pf->intpl_luma_ver[0][1] = intpl_luma_block_ver_sse128;\r\n        pf->intpl_luma_ver[0][2] = intpl_luma_block_ver_sse128;\r\n        pf->intpl_luma_ver[1][0] = intpl_luma_block_ver_sse128;\r\n        pf->intpl_luma_ver[1][1] = intpl_luma_block_ver_sse128;\r\n        pf->intpl_luma_ver[1][2] = intpl_luma_block_ver_sse128;\r\n\r\n        pf->intpl_luma_ver[0][0] = intpl_luma_block_ver0_sse128;\r\n        pf->intpl_luma_ver[0][1] = intpl_luma_block_ver1_sse128;\r\n        pf->intpl_luma_ver[0][2] = intpl_luma_block_ver2_sse128;\r\n        pf->intpl_luma_ver[1][0] = intpl_luma_block_ver0_sse128;\r\n        pf->intpl_luma_ver[1][1] = intpl_luma_block_ver1_sse128;\r\n        pf->intpl_luma_ver[1][2] = intpl_luma_block_ver2_sse128;\r\n#endif\r\n    }\r\n    \r\n    if (cpuid & DAVS2_CPU_AVX2) {\r\n#if !HIGH_BIT_DEPTH\r\n        pf->intpl_luma_hor[1][0] = intpl_luma_block_hor_avx2;\r\n        pf->intpl_luma_hor[1][1] = intpl_luma_block_hor_avx2;\r\n        pf->intpl_luma_hor[1][2] = intpl_luma_block_hor_avx2;\r\n        pf->intpl_luma_ext[1] = intpl_luma_block_ext_avx2;\r\n\r\n        pf->intpl_chroma_ver[1] = intpl_chroma_block_ver_avx2;\r\n        pf->intpl_chroma_hor[1] = intpl_chroma_block_hor_avx2;\r\n        pf->intpl_chroma_ext[1] = intpl_chroma_block_ext_avx2;\r\n\r\n        pf->intpl_luma_ver[1][0] = intpl_luma_block_ver_avx2;\r\n        pf->intpl_luma_ver[1][1] = intpl_luma_block_ver_avx2;\r\n        pf->intpl_luma_ver[1][2] = intpl_luma_block_ver_avx2;\r\n\r\n        pf->intpl_luma_ver[1][0] = intpl_luma_block_ver0_avx2;\r\n        pf->intpl_luma_ver[1][1] = intpl_luma_block_ver1_avx2;\r\n        pf->intpl_luma_ver[1][2] = intpl_luma_block_ver2_avx2;\r\n#endif\r\n    }\r\n#endif  //if HAVE_MMX\r\n}\r\n"
  },
  {
    "path": "source/common/mc.h",
    "content": "/*\r\n * mc.h\r\n *\r\n * Description of this file:\r\n *    MC functions definition of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video encoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#ifndef DAVS2_MC_H\r\n#define DAVS2_MC_H\r\n#ifdef __cplusplus\r\nextern \"C\" {\r\n#endif\r\n\r\n#define mc_luma FPFX(mc_luma)\r\nvoid mc_luma  (davs2_t *h, pel_t *dst, int i_dst, int posx, int posy, int width, int height, pel_t *p_fref, int i_fref);\r\n#define mc_chroma FPFX(mc_chroma)\r\nvoid mc_chroma(davs2_t *h, pel_t *dst, int i_dst, int posx, int posy, int width, int height, pel_t *p_fref, int i_fref);\r\n\r\n#ifdef __cplusplus\r\n}\r\n#endif\r\n#endif  // DAVS2_MC_H\r\n"
  },
  {
    "path": "source/common/memory.cc",
    "content": "/*\r\n * memory.cc\r\n *\r\n * Description of this file:\r\n *    Memory functions definition of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video encoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#include \"common.h\"\r\n#include \"primitives.h\"\r\n\r\n#if HAVE_MMX\r\n#include \"vec/intrinsic.h\"\r\n#endif\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid *memzero_aligned_c(void *dst, size_t n)\r\n{\r\n    return memset(dst, 0, n);\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid davs2_memory_init(uint32_t cpuid, ao_funcs_t* pf)\r\n{\r\n    /* memory copy */\r\n    pf->fast_memcpy      = memcpy;\r\n    pf->fast_memset      = memset;\r\n    pf->memcpy_aligned   = memcpy;\r\n    pf->fast_memzero     = memzero_aligned_c;\r\n    pf->memzero_aligned  = memzero_aligned_c;\r\n\r\n    /* init asm function handles */\r\n#if HAVE_MMX\r\n    if (cpuid & DAVS2_CPU_MMX) {\r\n        pf->fast_memcpy     = davs2_fast_memcpy_mmx;\r\n        pf->memcpy_aligned  = davs2_memcpy_aligned_mmx;\r\n        pf->fast_memset     = davs2_fast_memset_mmx;\r\n        pf->fast_memzero    = davs2_fast_memzero_mmx;\r\n        pf->memzero_aligned = davs2_fast_memzero_mmx;\r\n    }\r\n\r\n    if (cpuid & DAVS2_CPU_SSE) {\r\n        // pf->memcpy_aligned  = davs2_memcpy_aligned_sse;\r\n        // pf->memzero_aligned = davs2_memzero_aligned_sse;\r\n    }\r\n\r\n    if (cpuid & DAVS2_CPU_SSE2) {\r\n        pf->memzero_aligned = davs2_memzero_aligned_c_sse2;\r\n        // gf_davs2.memcpy_aligned  = davs2_memcpy_aligned_c_sse2;\r\n    }\r\n\r\n    if (cpuid & DAVS2_CPU_AVX2) {\r\n        pf->memzero_aligned = davs2_memzero_aligned_c_avx;\r\n    }\r\n#endif // HAVE_MMX\r\n}\r\n"
  },
  {
    "path": "source/common/osdep.h",
    "content": "/*\r\n * osdep.h\r\n *\r\n * Description of this file:\r\n *    platform-specific code functions definition of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video encoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#ifndef DAVS2_OSDEP_H\r\n#define DAVS2_OSDEP_H\r\n#ifdef __cplusplus\r\nextern \"C\" {\r\n#endif\r\n\r\n/* ---------------------------------------------------------------------------\r\n * disable warning C4996: functions or variables may be unsafe.\r\n */\r\n#if defined(_MSC_VER)\r\n#define WIN32_LEAN_AND_MEAN\r\n#define _CRT_NONSTDC_NO_DEPRECATE\r\n#define _CRT_SECURE_NO_DEPRECATE\r\n#define _CRT_SECURE_NO_WARNINGS\r\n#pragma warning(disable:4324)     /* disable warning C4324:  __declspec(align())ṹ */\r\n#endif\r\n\r\n/**\r\n * ===========================================================================\r\n * includes\r\n * ===========================================================================\r\n */\r\n\r\n#define _LARGEFILE_SOURCE 1\r\n#define _FILE_OFFSET_BITS 64\r\n#include <stdio.h>\r\n#include <sys/stat.h>\r\n#include <stdarg.h>\r\n\r\n/* ---------------------------------------------------------------------------\r\n * disable warning C4996: functions or variables may be unsafe.\r\n */\r\n#if defined(_MSC_VER)\r\n#include <intrin.h>\r\n#include <windows.h>\r\n#endif\r\n\r\n#if defined(__ICL) || defined(_MSC_VER)\r\n#include \"configw.h\"\r\n#else\r\n#include \"config.h\"\r\n#endif\r\n\r\n#if HAVE_STDINT_H\r\n#include <stdint.h>\r\n#else\r\n#include <inttypes.h>\r\n#endif\r\n\r\n#if defined(__INTEL_COMPILER)\r\n#include <mathimf.h>\r\n#else\r\n#include <math.h>\r\n#endif\r\n\r\n#if defined(_MSC_VER) || defined(__INTEL_COMPILER)\r\n#include <float.h>\r\n#endif\r\n\r\n/* disable warning C4100: : unreferenced formal parameter */\r\n#if defined(_MSC_VER) || defined(__INTEL_COMPILER)\r\n#define UNUSED_PARAMETER(v) (v) /* same as UNREFERENCED_PARAMETER */\r\n#else\r\n#define UNUSED_PARAMETER(v) (void)(v)\r\n#endif\r\n\r\n\r\n\r\n/**\r\n * ===========================================================================\r\n * const defines\r\n * ===========================================================================\r\n */\r\n\r\n/* ---------------------------------------------------------------------------\r\n * Specifies the number of bits per pixel that DAVS2 uses\r\n */\r\n#define AVS2_BIT_DEPTH          BIT_DEPTH\r\n\r\n#define WORD_SIZE               sizeof(void*)\r\n\r\n\r\n/**\r\n * ===========================================================================\r\n * const defines\r\n * ===========================================================================\r\n */\r\n\r\n#if defined(__GNUC__) && (__GNUC__ > 3 || __GNUC__ == 3 && __GNUC_MINOR__ > 0)\r\n#define UNINIT(x)               x = x\r\n#define UNUSED                  __attribute__((unused))\r\n#define ALWAYS_INLINE           __attribute__((always_inline)) inline\r\n#define NOINLINE                __attribute__((noinline))\r\n#define MAY_ALIAS               __attribute__((may_alias))\r\n#define davs2_constant_p(x)   __builtin_constant_p(x)\r\n#define davs2_nonconstant_p(x)    (!__builtin_constant_p(x))\r\n#define INLINE                  __inline\r\n#else\r\n#define UNINIT(x)               x\r\n#if defined(__ICL)\r\n#define ALWAYS_INLINE           __forceinline\r\n#define NOINLINE                __declspec(noinline)\r\n#else\r\n#define ALWAYS_INLINE           INLINE\r\n#define NOINLINE\r\n#endif\r\n#define UNUSED\r\n#define MAY_ALIAS\r\n#define davs2_constant_p(x)       0\r\n#define davs2_nonconstant_p(x)    0\r\n#endif\r\n\r\n/* string operations */\r\n#if defined(__ICL) || defined(_MSC_VER)\r\n#define INLINE                  __inline\r\n#define strcasecmp              _stricmp\r\n#define strncasecmp             _strnicmp\r\n#define snprintf                _snprintf\r\n#define S_ISREG(x)              (((x) & S_IFMT) == S_IFREG)\r\n#if !HAVE_POSIXTHREAD\r\n#define strtok_r                strtok_s\r\n#endif\r\n#else\r\n#include <strings.h>\r\n#endif\r\n\r\n#if (defined(__GNUC__) || defined(__INTEL_COMPILER)) && (ARCH_X86 || ARCH_X86_64)\r\n#ifndef HAVE_X86_INLINE_ASM\r\n#define HAVE_X86_INLINE_ASM     1\r\n#endif\r\n#endif\r\n\r\n// #if defined(_WIN32)\r\n// /* POSIX says that rename() removes the destination, but win32 doesn't. */\r\n// #define rename(src,dst)         (unlink(dst), rename(src,dst))\r\n// #if !HAVE_POSIXTHREAD\r\n// #ifndef strtok_r\r\n// #define strtok_r(str,delim,save)    strtok(str, delim)\r\n// #endif\r\n// #endif\r\n// #endif\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * align\r\n */\r\n/* align a pointer */\r\n#  define CACHE_LINE_SIZE       32    /* for x86-64 and x86 */\r\n#  define ALIGN_POINTER(p)      (p) = (uint8_t *)((intptr_t)((p) + (CACHE_LINE_SIZE - 1)) & (~(intptr_t)(CACHE_LINE_SIZE - 1)))\r\n#  define CACHE_LINE_256B       32    /* for x86-64 and x86 */\r\n#  define ALIGN_256_PTR(p)      (p) = (uint8_t *)((intptr_t)((p) + (CACHE_LINE_256B - 1)) & (~(intptr_t)(CACHE_LINE_256B - 1)))\r\n\r\n#if defined(_MSC_VER)\r\n#define DECLARE_ALIGNED(var, n) __declspec(align(n)) var\r\n// #elif defined(__INTEL_COMPILER)\r\n// #define DECLARE_ALIGNED(var, n) var __declspec(align(n))\r\n#else\r\n#define DECLARE_ALIGNED(var, n) var __attribute__((aligned(n)))\r\n#endif\r\n#define ALIGN32(var)        DECLARE_ALIGNED(var, 32)\r\n#define ALIGN16(var)        DECLARE_ALIGNED(var, 16)\r\n#define ALIGN8(var)         DECLARE_ALIGNED(var, 8)\r\n\r\n\r\n// ARM compiliers don't reliably align stack variables\r\n// - EABI requires only 8 byte stack alignment to be maintained\r\n// - gcc can't align stack variables to more even if the stack were to be correctly aligned outside the function\r\n// - armcc can't either, but is nice enough to actually tell you so\r\n// - Apple gcc only maintains 4 byte alignment\r\n// - llvm can align the stack, but only in svn and (unrelated) it exposes bugs in all released GNU binutils...\r\n\r\n#define ALIGNED_ARRAY_EMU( mask, type, name, sub1, ... )\\\r\n    uint8_t name##_u [sizeof(type sub1 __VA_ARGS__) + mask]; \\\r\n    type (*name) __VA_ARGS__ = (void*)((intptr_t)(name##_u+mask) & ~mask)\r\n\r\n#if ARCH_ARM && SYS_MACOSX\r\n#define ALIGNED_ARRAY_8( ... ) ALIGNED_ARRAY_EMU( 7, __VA_ARGS__ )\r\n#else\r\n#define ALIGNED_ARRAY_8( type, name, sub1, ... )\\\r\n    ALIGN8( type name sub1 __VA_ARGS__ )\r\n#endif\r\n\r\n#if ARCH_ARM\r\n#define ALIGNED_ARRAY_16( ... ) ALIGNED_ARRAY_EMU( 15, __VA_ARGS__ )\r\n#else\r\n#define ALIGNED_ARRAY_16( type, name, sub1, ... )\\\r\n    ALIGN16( type name sub1 __VA_ARGS__ )\r\n#endif\r\n\r\n#define EXPAND(x) x\r\n\r\n#if defined(STACK_ALIGNMENT) && STACK_ALIGNMENT >= 32\r\n#define ALIGNED_ARRAY_32( type, name, sub1, ... )\\\r\n    ALIGN32( type name sub1 __VA_ARGS__ )\r\n#else\r\n#define ALIGNED_ARRAY_32( ... ) EXPAND( ALIGNED_ARRAY_EMU( 31, __VA_ARGS__ ) )\r\n#endif\r\n\r\n#define ALIGNED_ARRAY_64( ... ) EXPAND( ALIGNED_ARRAY_EMU( 63, __VA_ARGS__ ) )\r\n\r\n/* For AVX2 */\r\n#if ARCH_X86 || ARCH_X86_64\r\n#define NATIVE_ALIGN 32\r\n#define ALIGNED_N ALIGN32\r\n#define ALIGNED_ARRAY_N ALIGNED_ARRAY_32\r\n#else\r\n#define NATIVE_ALIGN 16\r\n#define ALIGNED_N ALIGN16\r\n#define ALIGNED_ARRAY_N ALIGNED_ARRAY_16\r\n#endif\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * threads\r\n */\r\n#if HAVE_BEOSTHREAD\r\n#include <kernel/OS.h>\r\n#define davs2_thread_t       thread_id\r\nstatic int ALWAYS_INLINE\r\navs2dec_pthread_create(davs2_thread_t *t, void *a, void *(*f)(void *), void *d)\r\n{\r\n    *t = spawn_thread(f, \"\", 10, d);\r\n    if (*t < B_NO_ERROR) {\r\n        return -1;\r\n    }\r\n    resume_thread(*t);\r\n    return 0;\r\n}\r\n\r\n#define davs2_thread_join(t,s)\\\r\n    {\\\r\n        long tmp; \\\r\n        wait_for_thread(t,(s)?(long*)(*(s)):&tmp);\\\r\n    }\r\n\r\n#elif HAVE_POSIXTHREAD\r\n#if defined(_MSC_VER) || defined(__ICL)\r\n#if _MSC_VER >= 1900\r\n#define HAVE_STRUCT_TIMESPEC    1       /* for struct timespec */\r\n#endif\r\n#pragma comment(lib, \"pthread_lib.lib\")\r\n#endif\r\n#include <pthread.h>\r\n#define davs2_thread_t                   pthread_t\r\n#define davs2_thread_create              pthread_create\r\n#define davs2_thread_join                pthread_join\r\n#define davs2_thread_exit                pthread_exit\r\n#define davs2_thread_mutex_t             pthread_mutex_t\r\n#define davs2_thread_mutex_init          pthread_mutex_init\r\n#define davs2_thread_mutex_destroy       pthread_mutex_destroy\r\n#define davs2_thread_mutex_lock          pthread_mutex_lock\r\n#define davs2_thread_mutex_unlock        pthread_mutex_unlock\r\n#define davs2_thread_cond_t              pthread_cond_t\r\n#define davs2_thread_cond_init           pthread_cond_init\r\n#define davs2_thread_cond_destroy        pthread_cond_destroy\r\n#define davs2_thread_cond_signal         pthread_cond_signal\r\n#define davs2_thread_cond_broadcast      pthread_cond_broadcast\r\n#define davs2_thread_cond_wait           pthread_cond_wait\r\n#define davs2_thread_attr_t              pthread_attr_t\r\n#define davs2_thread_attr_init           pthread_attr_init\r\n#define davs2_thread_attr_destroy        pthread_attr_destroy\r\n#if defined(__ARM_ARCH_7A__) || SYS_LINUX\r\n#define davs2_thread_num_processors_np   get_nprocs\r\n#else\r\n#define davs2_thread_num_processors_np   pthread_num_processors_np\r\n#endif\r\n#define AVS2_PTHREAD_MUTEX_INITIALIZER   PTHREAD_MUTEX_INITIALIZER\r\n\r\n#elif HAVE_WIN32THREAD\r\n#include \"win32thread.h\"\r\n\r\n#else\r\n#define davs2_thread_t                   int\r\n#define davs2_thread_create(t,u,f,d)     0\r\n#define davs2_thread_join(t,s)\r\n#endif //HAVE_*THREAD\r\n\r\n#if !HAVE_POSIXTHREAD && !HAVE_WIN32THREAD\r\n#define davs2_thread_mutex_t             int\r\n#define davs2_thread_mutex_init(m,f)     0\r\n#define davs2_thread_mutex_destroy(m)\r\n#define davs2_thread_mutex_lock(m)\r\n#define davs2_thread_mutex_unlock(m)\r\n#define davs2_thread_cond_t              int\r\n#define davs2_thread_cond_init(c,f)      0\r\n#define davs2_thread_cond_destroy(c)\r\n#define davs2_thread_cond_broadcast(c)\r\n#define davs2_thread_cond_wait(c,m)\r\n#define davs2_thread_attr_t              int\r\n#define davs2_thread_attr_init(a)        0\r\n#define davs2_thread_attr_destroy(a)\r\n#define AVS2_PTHREAD_MUTEX_INITIALIZER   0\r\n#endif\r\n\r\n#if HAVE_WIN32THREAD || PTW32_STATIC_LIB\r\nint davs2_threading_init(void);\r\n#else\r\n#define davs2_threading_init()   0\r\n#endif\r\n\r\n#if HAVE_POSIXTHREAD\r\n#if SYS_WINDOWS\r\n#define davs2_lower_thread_priority(p)\\\r\n    {\\\r\n        davs2_thread_t handle = pthread_self();\\\r\n        struct sched_param sp;\\\r\n        int policy = SCHED_OTHER;\\\r\n        pthread_getschedparam(handle, &policy, &sp);\\\r\n        sp.sched_priority -= p;\\\r\n        pthread_setschedparam(handle, policy, &sp);\\\r\n    }\r\n#else\r\n#include <unistd.h>\r\n#define davs2_lower_thread_priority(p) { int nice_ret = nice(p); }\r\n#define davs2_thread_spin_init(plock,pshare) pthread_spin_init(plock, pshare)\r\n#endif /* SYS_WINDOWS */\r\n#elif HAVE_WIN32THREAD\r\n#define davs2_lower_thread_priority(p) SetThreadPriority(GetCurrentThread(), DAVS2_MAX(-2, -p))\r\n#else\r\n#define davs2_lower_thread_priority(p)\r\n#endif\r\n\r\n#if SYS_WINDOWS\r\n#define davs2_sleep_ms(x)              Sleep(x)\r\n#else\r\n#define davs2_sleep_ms(x)              usleep(x * 1000)\r\n#endif\r\n\r\n\r\n/**\r\n * ===========================================================================\r\n * inline functions\r\n * ===========================================================================\r\n */\r\nstatic int ALWAYS_INLINE davs2_is_regular_file(int filehandle)\r\n{\r\n    struct stat file_stat;\r\n    if (fstat(filehandle, &file_stat)) {\r\n        return -1;\r\n    }\r\n    return S_ISREG(file_stat.st_mode);\r\n}\r\n\r\nstatic int ALWAYS_INLINE davs2_is_regular_file_path(const char *filename)\r\n{\r\n    struct stat file_stat;\r\n    if (stat(filename, &file_stat)) {\r\n        return -1;\r\n    }\r\n    return S_ISREG(file_stat.st_mode);\r\n}\r\n\r\n#ifdef __cplusplus\r\n}\r\n#endif\r\n#endif /* DAVS2_OSDEP_H */\r\n"
  },
  {
    "path": "source/common/pixel.cc",
    "content": "/*\r\n * pixel.cc\r\n *\r\n * Description of this file:\r\n *    Pixel processing functions definition of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video encoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#include \"common.h\"\r\n#include \"vec/intrinsic.h\"\r\n\r\n#ifdef __cplusplus\r\nextern \"C\" {\r\n#endif\r\n\r\n/**\r\n * ===========================================================================\r\n * local & global variables (const tables)\r\n * ===========================================================================\r\n */\r\n\r\n/* ---------------------------------------------------------------------------\r\n * partition map table\r\n */\r\nconst uint8_t g_partition_map_tab[] = {\r\n    //  4      8          12          16          20   24          28   32          36   40   44   48           52   56   60   64\r\n    PART_4x4,  PART_4x8,  255,        PART_4x16,  255, 255,        255, 255,        255, 255, 255, 255,        255, 255, 255, 255,          // 4\r\n    PART_8x4,  PART_8x8,  255,        PART_8x16,  255, 255,        255, PART_8x32,  255, 255, 255, 255,        255, 255, 255, 255,          // 8\r\n    255,       255,       255,        PART_12x16, 255, 255,        255, 255,        255, 255, 255, 255,        255, 255, 255, 255,          // 12\r\n    PART_16x4, PART_16x8, PART_16x12, PART_16x16, 255, 255,        255, PART_16x32, 255, 255, 255, 255,        255, 255, 255, PART_16x64,   // 16\r\n    255,       255,       255,        255,        255, 255,        255, 255,        255, 255, 255, 255,        255, 255, 255, 255,          // 20\r\n    255,       255,       255,        255,        255, 255,        255, PART_24x32, 255, 255, 255, 255,        255, 255, 255, 255,          // 24\r\n    255,       255,       255,        255,        255, 255,        255, 255,        255, 255, 255, 255,        255, 255, 255, 255,          // 28\r\n    255,       PART_32x8, 255,        PART_32x16, 255, PART_32x24, 255, PART_32x32, 255, 255, 255, 255,        255, 255, 255, PART_32x64,   // 32\r\n    255,       255,       255,        255,        255, 255,        255, 255,        255, 255, 255, 255,        255, 255, 255, 255,          // 36\r\n    255,       255,       255,        255,        255, 255,        255, 255,        255, 255, 255, 255,        255, 255, 255, 255,          // 40\r\n    255,       255,       255,        255,        255, 255,        255, 255,        255, 255, 255, 255,        255, 255, 255, 255,          // 44\r\n    255,       255,       255,        255,        255, 255,        255, 255,        255, 255, 255, 255,        255, 255, 255, PART_48x64,   // 48\r\n    255,       255,       255,        255,        255, 255,        255, 255,        255, 255, 255, 255,        255, 255, 255, 255,          // 52\r\n    255,       255,       255,        255,        255, 255,        255, 255,        255, 255, 255, 255,        255, 255, 255, 255,          // 56\r\n    255,       255,       255,        255,        255, 255,        255, 255,        255, 255, 255, 255,        255, 255, 255, 255,          // 60\r\n    255,       255,       255,        PART_64x16, 255, 255,        255, PART_64x32, 255, 255, 255, PART_64x48, 255, 255, 255, PART_64x64    // 64\r\n};\r\n\r\n\r\n#define PIXEL_ADD_PS_C(w, h) \\\r\nstatic void davs2_pixel_add_ps_##w##x##h(pel_t *a, intptr_t dstride, const pel_t *b0, const coeff_t* b1, intptr_t sstride0, intptr_t sstride1)\\\r\n{\\\r\n    int x, y;\\\r\n    for (y = 0; y < h; y++) {\\\r\n        for (x = 0; x < w; x++) {\\\r\n            a[x] = (pel_t)DAVS2_CLIP1(b0[x] + b1[x]);\\\r\n        }\\\r\n        b0 += sstride0;\\\r\n        b1 += sstride1;\\\r\n        a  += dstride;\\\r\n    }\\\r\n}\r\n\r\n#define BLOCKCOPY_PP_C(w, h) \\\r\nstatic void davs2_blockcopy_pp_##w##x##h(pel_t *a, intptr_t stridea, const pel_t *b, intptr_t strideb)\\\r\n{\\\r\n    int x, y;\\\r\n    for (y = 0; y < h; y++) {\\\r\n        for (x = 0; x < w; x++) {\\\r\n            a[x] = b[x];\\\r\n        }\\\r\n        a += stridea;\\\r\n        b += strideb;\\\r\n    }\\\r\n}\r\n\r\n#define BLOCKCOPY_SS_C(w, h) \\\r\nstatic void davs2_blockcopy_ss_##w##x##h(coeff_t* a, intptr_t stridea, const coeff_t* b, intptr_t strideb)\\\r\n{\\\r\n    int x, y;\\\r\n    for (y = 0; y < h; y++) {\\\r\n        for (x = 0; x < w; x++) {\\\r\n            a[x] = b[x];\\\r\n        }\\\r\n        a += stridea;\\\r\n        b += strideb;\\\r\n    }\\\r\n}\r\n\r\n#define BLOCK_OP_C(w, h) \\\r\n    PIXEL_ADD_PS_C(w, h); \\\r\n    BLOCKCOPY_PP_C(w, h); \\\r\n    BLOCKCOPY_SS_C(w, h);\r\n\r\nBLOCK_OP_C(64, 64)  /* 64x64 */\r\nBLOCK_OP_C(64, 32)\r\nBLOCK_OP_C(32, 64)\r\nBLOCK_OP_C(64, 16)\r\nBLOCK_OP_C(64, 48)\r\nBLOCK_OP_C(16, 64)\r\nBLOCK_OP_C(48, 64)\r\nBLOCK_OP_C(32, 32)  /* 32x32 */\r\nBLOCK_OP_C(32, 16)\r\nBLOCK_OP_C(16, 32)\r\nBLOCK_OP_C(32,  8)\r\nBLOCK_OP_C(32, 24)\r\nBLOCK_OP_C( 8, 32)\r\nBLOCK_OP_C(24, 32)\r\nBLOCK_OP_C(16, 16)  /* 16x16 */\r\nBLOCK_OP_C(16,  8)\r\nBLOCK_OP_C( 8, 16)\r\nBLOCK_OP_C(16,  4)\r\nBLOCK_OP_C(16, 12)\r\nBLOCK_OP_C( 4, 16)\r\nBLOCK_OP_C(12, 16)\r\nBLOCK_OP_C( 8,  8)  /* 8x8 */\r\nBLOCK_OP_C( 8,  4)\r\nBLOCK_OP_C( 4,  8)\r\nBLOCK_OP_C( 4,  4)  /* 4x4 */\r\n\r\n#define DECL_PIXELS(cpu) \\\r\n    FUNCDEF_PU(void,        pixel_avg,    cpu, pel_t* dst, intptr_t dstride, const pel_t* src0, intptr_t sstride0, const pel_t* src1, intptr_t sstride1, int);\\\r\n    FUNCDEF_PU(void,        pixel_add_ps, cpu, pel_t* a,   intptr_t dstride, const pel_t* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1);\\\r\n    FUNCDEF_PU(void,        blockcopy_pp, cpu, pel_t *a, intptr_t stridea, const pel_t *b, intptr_t strideb);\\\r\n    FUNCDEF_PU(void,        blockcopy_ss, cpu, int16_t* a, intptr_t stridea, const int16_t* b, intptr_t strideb);\\\r\n    FUNCDEF_CHROMA_PU(void, addAvg,       cpu, const int16_t*, const int16_t*, pel_t*, intptr_t, intptr_t, intptr_t)\r\n\r\nDECL_PIXELS(mmx);\r\nDECL_PIXELS(mmx2);\r\nDECL_PIXELS(sse2);\r\nDECL_PIXELS(sse3);\r\nDECL_PIXELS(sse4);\r\nDECL_PIXELS(ssse3);\r\nDECL_PIXELS(avx);\r\nDECL_PIXELS(xop);\r\nDECL_PIXELS(avx2);\r\n\r\n#undef DECL_PIXELS\r\n\r\n\r\n#define ALL_LUMA_PU(name1, name2, cpu) \\\r\n    pixf->name1[PART_64x64] = davs2_ ## name2 ## _64x64 ## cpu;  /* 64x64 */ \\\r\n    pixf->name1[PART_64x32] = davs2_ ## name2 ## _64x32 ## cpu;\\\r\n    pixf->name1[PART_32x64] = davs2_ ## name2 ## _32x64 ## cpu;\\\r\n    pixf->name1[PART_64x16] = davs2_ ## name2 ## _64x16 ## cpu;\\\r\n    pixf->name1[PART_64x48] = davs2_ ## name2 ## _64x48 ## cpu;\\\r\n    pixf->name1[PART_16x64] = davs2_ ## name2 ## _16x64 ## cpu;\\\r\n    pixf->name1[PART_48x64] = davs2_ ## name2 ## _48x64 ## cpu;\\\r\n    pixf->name1[PART_32x32] = davs2_ ## name2 ## _32x32 ## cpu;  /* 32x32 */ \\\r\n    pixf->name1[PART_32x16] = davs2_ ## name2 ## _32x16 ## cpu;\\\r\n    pixf->name1[PART_16x32] = davs2_ ## name2 ## _16x32 ## cpu;\\\r\n    pixf->name1[PART_32x8 ] = davs2_ ## name2 ## _32x8  ## cpu;\\\r\n    pixf->name1[PART_32x24] = davs2_ ## name2 ## _32x24 ## cpu;\\\r\n    pixf->name1[PART_8x32 ] = davs2_ ## name2 ## _8x32  ## cpu;\\\r\n    pixf->name1[PART_24x32] = davs2_ ## name2 ## _24x32 ## cpu;\\\r\n    pixf->name1[PART_16x16] = davs2_ ## name2 ## _16x16 ## cpu;  /* 16x16 */ \\\r\n    pixf->name1[PART_16x8 ] = davs2_ ## name2 ## _16x8  ## cpu;\\\r\n    pixf->name1[PART_8x16 ] = davs2_ ## name2 ## _8x16  ## cpu;\\\r\n    pixf->name1[PART_16x4 ] = davs2_ ## name2 ## _16x4  ## cpu;\\\r\n    pixf->name1[PART_16x12] = davs2_ ## name2 ## _16x12 ## cpu;\\\r\n    pixf->name1[PART_4x16 ] = davs2_ ## name2 ## _4x16  ## cpu;\\\r\n    pixf->name1[PART_12x16] = davs2_ ## name2 ## _12x16 ## cpu;\\\r\n    pixf->name1[PART_8x8  ] = davs2_ ## name2 ## _8x8   ## cpu;  /* 8x8 */ \\\r\n    pixf->name1[PART_8x4  ] = davs2_ ## name2 ## _8x4   ## cpu;\\\r\n    pixf->name1[PART_4x8  ] = davs2_ ## name2 ## _4x8   ## cpu;\\\r\n    pixf->name1[PART_4x4  ] = davs2_ ## name2 ## _4x4   ## cpu  /* 4x4 */\r\n\r\nvoid davs2_pixel_init(uint32_t cpuid, ao_funcs_t* pixf)\r\n{\r\n    ALL_LUMA_PU(add_ps, pixel_add_ps,  );\r\n    ALL_LUMA_PU(copy_pp, blockcopy_pp, );\r\n    ALL_LUMA_PU(copy_ss, blockcopy_ss, );\r\n\r\n#if HAVE_MMX\r\n    if (cpuid & DAVS2_CPU_SSE2) {\r\n#if HIGH_BIT_DEPTH\r\n        //10bit assemble\r\n        if (sizeof(pel_t) == sizeof(int16_t) && cpuid) {\r\n            pixf->copy_pp[PART_64x64] = (copy_pp_t)davs2_blockcopy_ss_64x64_sse2;  /* 64x64 */\r\n            pixf->copy_pp[PART_64x32] = (copy_pp_t)davs2_blockcopy_ss_64x32_sse2;\r\n            pixf->copy_pp[PART_32x64] = (copy_pp_t)davs2_blockcopy_ss_32x64_sse2;\r\n            pixf->copy_pp[PART_64x16] = (copy_pp_t)davs2_blockcopy_ss_64x16_sse2;\r\n            pixf->copy_pp[PART_64x48] = (copy_pp_t)davs2_blockcopy_ss_64x48_sse2;\r\n            pixf->copy_pp[PART_16x64] = (copy_pp_t)davs2_blockcopy_ss_16x64_sse2;\r\n            pixf->copy_pp[PART_48x64] = (copy_pp_t)davs2_blockcopy_ss_48x64_sse2;\r\n            pixf->copy_pp[PART_32x32] = (copy_pp_t)davs2_blockcopy_ss_32x32_sse2; /* 32x32 */\r\n            pixf->copy_pp[PART_32x16] = (copy_pp_t)davs2_blockcopy_ss_32x16_sse2;\r\n            pixf->copy_pp[PART_16x32] = (copy_pp_t)davs2_blockcopy_ss_16x32_sse2;\r\n            pixf->copy_pp[PART_32x8 ] = (copy_pp_t)davs2_blockcopy_ss_32x8_sse2;\r\n            pixf->copy_pp[PART_32x24] = (copy_pp_t)davs2_blockcopy_ss_32x24_sse2;\r\n            pixf->copy_pp[PART_8x32 ] = (copy_pp_t)davs2_blockcopy_ss_8x32_sse2;\r\n            pixf->copy_pp[PART_24x32] = (copy_pp_t)davs2_blockcopy_ss_24x32_sse2;\r\n            pixf->copy_pp[PART_16x16] = (copy_pp_t)davs2_blockcopy_ss_16x16_sse2; /* 16x16 */\r\n            pixf->copy_pp[PART_16x8 ] = (copy_pp_t)davs2_blockcopy_ss_16x8_sse2;\r\n            pixf->copy_pp[PART_8x16 ] = (copy_pp_t)davs2_blockcopy_ss_8x16_sse2;\r\n            pixf->copy_pp[PART_16x4 ] = (copy_pp_t)davs2_blockcopy_ss_16x4_sse2;\r\n            pixf->copy_pp[PART_16x12] = (copy_pp_t)davs2_blockcopy_ss_16x12_sse2;\r\n            pixf->copy_pp[PART_4x16 ] = (copy_pp_t)davs2_blockcopy_ss_4x16_sse2;\r\n            pixf->copy_pp[PART_12x16] = (copy_pp_t)davs2_blockcopy_ss_12x16_sse2;\r\n            pixf->copy_pp[PART_8x8  ] = (copy_pp_t)davs2_blockcopy_ss_8x8_sse2; /* 8x8 */\r\n            pixf->copy_pp[PART_8x4  ] = (copy_pp_t)davs2_blockcopy_ss_8x4_sse2;\r\n            pixf->copy_pp[PART_4x8  ] = (copy_pp_t)davs2_blockcopy_ss_4x8_sse2;\r\n            pixf->copy_pp[PART_4x4  ] = (copy_pp_t)davs2_blockcopy_ss_4x4_sse2;  /* 4x4 */\r\n        }\r\n        if (sizeof(coeff_t) == sizeof(int16_t) && cpuid) {\r\n            pixf->copy_ss[PART_64x64] = (copy_ss_t)davs2_blockcopy_ss_64x64_sse2;  /* 64x64 */\r\n            pixf->copy_ss[PART_64x32] = (copy_ss_t)davs2_blockcopy_ss_64x32_sse2;\r\n            pixf->copy_ss[PART_32x64] = (copy_ss_t)davs2_blockcopy_ss_32x64_sse2;\r\n            pixf->copy_ss[PART_64x16] = (copy_ss_t)davs2_blockcopy_ss_64x16_sse2;\r\n            pixf->copy_ss[PART_64x48] = (copy_ss_t)davs2_blockcopy_ss_64x48_sse2;\r\n            pixf->copy_ss[PART_16x64] = (copy_ss_t)davs2_blockcopy_ss_16x64_sse2;\r\n            pixf->copy_ss[PART_48x64] = (copy_ss_t)davs2_blockcopy_ss_48x64_sse2;\r\n            pixf->copy_ss[PART_32x32] = (copy_ss_t)davs2_blockcopy_ss_32x32_sse2; /* 32x32 */\r\n            pixf->copy_ss[PART_32x16] = (copy_ss_t)davs2_blockcopy_ss_32x16_sse2;\r\n            pixf->copy_ss[PART_16x32] = (copy_ss_t)davs2_blockcopy_ss_16x32_sse2;\r\n            pixf->copy_ss[PART_32x8 ] = (copy_ss_t)davs2_blockcopy_ss_32x8_sse2;\r\n            pixf->copy_ss[PART_32x24] = (copy_ss_t)davs2_blockcopy_ss_32x24_sse2;\r\n            pixf->copy_ss[PART_8x32 ] = (copy_ss_t)davs2_blockcopy_ss_8x32_sse2;\r\n            pixf->copy_ss[PART_24x32] = (copy_ss_t)davs2_blockcopy_ss_24x32_sse2;\r\n            pixf->copy_ss[PART_16x16] = (copy_ss_t)davs2_blockcopy_ss_16x16_sse2; /* 16x16 */\r\n            pixf->copy_ss[PART_16x8 ] = (copy_ss_t)davs2_blockcopy_ss_16x8_sse2;\r\n            pixf->copy_ss[PART_8x16 ] = (copy_ss_t)davs2_blockcopy_ss_8x16_sse2;\r\n            pixf->copy_ss[PART_16x4 ] = (copy_ss_t)davs2_blockcopy_ss_16x4_sse2;\r\n            pixf->copy_ss[PART_16x12] = (copy_ss_t)davs2_blockcopy_ss_16x12_sse2;\r\n            pixf->copy_ss[PART_4x16 ] = (copy_ss_t)davs2_blockcopy_ss_4x16_sse2;\r\n            pixf->copy_ss[PART_12x16] = (copy_ss_t)davs2_blockcopy_ss_12x16_sse2;\r\n            pixf->copy_ss[PART_8x8  ] = (copy_ss_t)davs2_blockcopy_ss_8x8_sse2; /* 8x8 */\r\n            pixf->copy_ss[PART_8x4  ] = (copy_ss_t)davs2_blockcopy_ss_8x4_sse2;\r\n            pixf->copy_ss[PART_4x8  ] = (copy_ss_t)davs2_blockcopy_ss_4x8_sse2;\r\n            pixf->copy_ss[PART_4x4  ] = (copy_ss_t)davs2_blockcopy_ss_4x4_sse2;  /* 4x4 */\r\n        }\r\n#else\r\n        ALL_LUMA_PU(copy_pp, blockcopy_pp, _sse2);\r\n        ALL_LUMA_PU(copy_ss, blockcopy_ss, _sse2);\r\n#endif\r\n    }\r\n\r\n    if (cpuid & DAVS2_CPU_SSE4) {\r\n#if HIGH_BIT_DEPTH\r\n        //10bit assemble\r\n#else\r\n        pixf->add_ps[PART_4x4  ] = davs2_pixel_add_ps_4x4_sse4;\r\n        pixf->add_ps[PART_4x8  ] = davs2_pixel_add_ps_4x8_sse4;\r\n        pixf->add_ps[PART_4x16 ] = davs2_pixel_add_ps_4x16_sse4;\r\n        pixf->add_ps[PART_8x8  ] = davs2_pixel_add_ps_8x8_sse4;\r\n        pixf->add_ps[PART_8x16 ] = davs2_pixel_add_ps_8x16_sse4;\r\n        pixf->add_ps[PART_8x32 ] = davs2_pixel_add_ps_8x32_sse4;\r\n        pixf->add_ps[PART_16x4 ] = davs2_pixel_add_ps_16x4_sse4;\r\n        pixf->add_ps[PART_16x8 ] = davs2_pixel_add_ps_16x8_sse4;\r\n        pixf->add_ps[PART_16x12] = davs2_pixel_add_ps_16x12_sse4;\r\n        pixf->add_ps[PART_16x16] = davs2_pixel_add_ps_16x16_sse4;\r\n        pixf->add_ps[PART_16x64] = davs2_pixel_add_ps_16x64_sse4;\r\n        pixf->add_ps[PART_32x8 ] = davs2_pixel_add_ps_32x8_sse4;\r\n        pixf->add_ps[PART_32x16] = davs2_pixel_add_ps_32x16_sse4;\r\n        pixf->add_ps[PART_32x24] = davs2_pixel_add_ps_32x24_sse4;\r\n        pixf->add_ps[PART_32x32] = davs2_pixel_add_ps_32x32_sse4;\r\n        pixf->add_ps[PART_32x64] = davs2_pixel_add_ps_32x64_sse4;\r\n        pixf->add_ps[PART_64x16] = davs2_pixel_add_ps_64x16_sse4;\r\n        pixf->add_ps[PART_64x32] = davs2_pixel_add_ps_64x32_sse4;\r\n        pixf->add_ps[PART_64x48] = davs2_pixel_add_ps_64x48_sse4;\r\n        pixf->add_ps[PART_64x64] = davs2_pixel_add_ps_64x64_sse4;\r\n#endif\r\n    }\r\n    \r\n    if (cpuid & DAVS2_CPU_AVX) {\r\n#if HIGH_BIT_DEPTH\r\n        //10bit assemble\r\n        if (sizeof(pel_t) == sizeof(int16_t) && cpuid) {\r\n            pixf->copy_pp[PART_64x64] = (copy_pp_t)davs2_blockcopy_ss_64x64_avx;\r\n            pixf->copy_pp[PART_64x32] = (copy_pp_t)davs2_blockcopy_ss_64x32_avx;\r\n            pixf->copy_pp[PART_32x64] = (copy_pp_t)davs2_blockcopy_ss_32x64_avx;\r\n            pixf->copy_pp[PART_64x16] = (copy_pp_t)davs2_blockcopy_ss_64x16_avx;\r\n            pixf->copy_pp[PART_64x48] = (copy_pp_t)davs2_blockcopy_ss_64x48_avx;\r\n            pixf->copy_pp[PART_16x64] = (copy_pp_t)davs2_blockcopy_ss_16x64_avx;\r\n            pixf->copy_pp[PART_48x64] = (copy_pp_t)davs2_blockcopy_ss_48x64_avx;\r\n            pixf->copy_pp[PART_32x32] = (copy_pp_t)davs2_blockcopy_ss_32x32_avx;\r\n            pixf->copy_pp[PART_32x16] = (copy_pp_t)davs2_blockcopy_ss_32x16_avx;\r\n            pixf->copy_pp[PART_16x32] = (copy_pp_t)davs2_blockcopy_ss_16x32_avx;\r\n            pixf->copy_pp[PART_32x8 ] = (copy_pp_t)davs2_blockcopy_ss_32x8_avx;\r\n            pixf->copy_pp[PART_32x24] = (copy_pp_t)davs2_blockcopy_ss_32x24_avx;\r\n            pixf->copy_pp[PART_24x32] = (copy_pp_t)davs2_blockcopy_ss_24x32_avx;\r\n            pixf->copy_pp[PART_16x16] = (copy_pp_t)davs2_blockcopy_ss_16x16_avx;\r\n            pixf->copy_pp[PART_16x8 ] = (copy_pp_t)davs2_blockcopy_ss_16x8_avx;\r\n            pixf->copy_pp[PART_16x4 ] = (copy_pp_t)davs2_blockcopy_ss_16x4_avx;\r\n            pixf->copy_pp[PART_16x12] = (copy_pp_t)davs2_blockcopy_ss_16x12_avx;\r\n        }\r\n        if (sizeof(coeff_t) == sizeof(int16_t) && cpuid) {\r\n            pixf->copy_ss[PART_64x64] = (copy_ss_t)davs2_blockcopy_ss_64x64_avx;\r\n            pixf->copy_ss[PART_64x32] = (copy_ss_t)davs2_blockcopy_ss_64x32_avx;\r\n            pixf->copy_ss[PART_32x64] = (copy_ss_t)davs2_blockcopy_ss_32x64_avx;\r\n            pixf->copy_ss[PART_64x16] = (copy_ss_t)davs2_blockcopy_ss_64x16_avx;\r\n            pixf->copy_ss[PART_64x48] = (copy_ss_t)davs2_blockcopy_ss_64x48_avx;\r\n            pixf->copy_ss[PART_16x64] = (copy_ss_t)davs2_blockcopy_ss_16x64_avx;\r\n            pixf->copy_ss[PART_48x64] = (copy_ss_t)davs2_blockcopy_ss_48x64_avx;\r\n            pixf->copy_ss[PART_32x32] = (copy_ss_t)davs2_blockcopy_ss_32x32_avx;\r\n            pixf->copy_ss[PART_32x16] = (copy_ss_t)davs2_blockcopy_ss_32x16_avx;\r\n            pixf->copy_ss[PART_16x32] = (copy_ss_t)davs2_blockcopy_ss_16x32_avx;\r\n            pixf->copy_ss[PART_32x8 ] = (copy_ss_t)davs2_blockcopy_ss_32x8_avx;\r\n            pixf->copy_ss[PART_32x24] = (copy_ss_t)davs2_blockcopy_ss_32x24_avx;\r\n            pixf->copy_ss[PART_24x32] = (copy_ss_t)davs2_blockcopy_ss_24x32_avx;\r\n            pixf->copy_ss[PART_16x16] = (copy_ss_t)davs2_blockcopy_ss_16x16_avx;\r\n            pixf->copy_ss[PART_16x8 ] = (copy_ss_t)davs2_blockcopy_ss_16x8_avx;\r\n            pixf->copy_ss[PART_16x4 ] = (copy_ss_t)davs2_blockcopy_ss_16x4_avx;\r\n            pixf->copy_ss[PART_16x12] = (copy_ss_t)davs2_blockcopy_ss_16x12_avx;\r\n        }\r\n#else\r\n        pixf->copy_pp[PART_64x64] = davs2_blockcopy_pp_64x64_avx;\r\n        pixf->copy_pp[PART_64x32] = davs2_blockcopy_pp_64x32_avx;\r\n        pixf->copy_pp[PART_32x64] = davs2_blockcopy_pp_32x64_avx;\r\n        pixf->copy_pp[PART_64x16] = davs2_blockcopy_pp_64x16_avx;\r\n        pixf->copy_pp[PART_64x48] = davs2_blockcopy_pp_64x48_avx;\r\n        pixf->copy_pp[PART_48x64] = davs2_blockcopy_pp_48x64_avx;\r\n        pixf->copy_pp[PART_32x32] = davs2_blockcopy_pp_32x32_avx;\r\n        pixf->copy_pp[PART_32x16] = davs2_blockcopy_pp_32x16_avx;\r\n        pixf->copy_pp[PART_32x8 ] = davs2_blockcopy_pp_32x8_avx;\r\n        pixf->copy_pp[PART_32x24] = davs2_blockcopy_pp_32x24_avx;\r\n        \r\n        pixf->copy_ss[PART_64x64] = davs2_blockcopy_ss_64x64_avx;\r\n        pixf->copy_ss[PART_64x32] = davs2_blockcopy_ss_64x32_avx;\r\n        pixf->copy_ss[PART_32x64] = davs2_blockcopy_ss_32x64_avx;\r\n        pixf->copy_ss[PART_64x16] = davs2_blockcopy_ss_64x16_avx;\r\n        pixf->copy_ss[PART_64x48] = davs2_blockcopy_ss_64x48_avx;\r\n        pixf->copy_ss[PART_16x64] = davs2_blockcopy_ss_16x64_avx;\r\n        pixf->copy_ss[PART_48x64] = davs2_blockcopy_ss_48x64_avx;\r\n        pixf->copy_ss[PART_32x32] = davs2_blockcopy_ss_32x32_avx;\r\n        pixf->copy_ss[PART_32x16] = davs2_blockcopy_ss_32x16_avx;\r\n        pixf->copy_ss[PART_16x32] = davs2_blockcopy_ss_16x32_avx;\r\n        pixf->copy_ss[PART_32x8 ] = davs2_blockcopy_ss_32x8_avx;\r\n        pixf->copy_ss[PART_32x24] = davs2_blockcopy_ss_32x24_avx;\r\n        pixf->copy_ss[PART_24x32] = davs2_blockcopy_ss_24x32_avx;\r\n        pixf->copy_ss[PART_16x16] = davs2_blockcopy_ss_16x16_avx;\r\n        pixf->copy_ss[PART_16x8 ] = davs2_blockcopy_ss_16x8_avx;\r\n        pixf->copy_ss[PART_16x4 ] = davs2_blockcopy_ss_16x4_avx;\r\n        pixf->copy_ss[PART_16x12] = davs2_blockcopy_ss_16x12_avx;\r\n#endif\r\n    }\r\n\r\n    if (cpuid & DAVS2_CPU_AVX2) {\r\n#if HIGH_BIT_DEPTH\r\n        //10bit assemble\r\n#else\r\n        pixf->add_ps[PART_16x4 ] = davs2_pixel_add_ps_16x4_avx2;\r\n        pixf->add_ps[PART_16x8 ] = davs2_pixel_add_ps_16x8_avx2;\r\n        pixf->add_ps[PART_16x12] = davs2_pixel_add_ps_16x12_avx2;\r\n        pixf->add_ps[PART_16x16] = davs2_pixel_add_ps_16x16_avx2;\r\n        pixf->add_ps[PART_16x64] = davs2_pixel_add_ps_16x64_avx2;\r\n#if ARCH_X86_64\r\n        pixf->add_ps[PART_32x8 ] = davs2_pixel_add_ps_32x8_avx2;\r\n        pixf->add_ps[PART_32x16] = davs2_pixel_add_ps_32x16_avx2;\r\n        pixf->add_ps[PART_32x24] = davs2_pixel_add_ps_32x24_avx2;\r\n        pixf->add_ps[PART_32x32] = davs2_pixel_add_ps_32x32_avx2;\r\n        pixf->add_ps[PART_32x64] = davs2_pixel_add_ps_32x64_avx2;\r\n#endif\r\n        pixf->add_ps[PART_64x16] = davs2_pixel_add_ps_64x16_avx2;\r\n        pixf->add_ps[PART_64x32] = davs2_pixel_add_ps_64x32_avx2;\r\n        pixf->add_ps[PART_64x48] = davs2_pixel_add_ps_64x48_avx2;\r\n        pixf->add_ps[PART_64x64] = davs2_pixel_add_ps_64x64_avx2;\r\n#endif\r\n    }\r\n#endif  // HAVE_MMX\r\n}\r\n\r\n#ifdef __cplusplus\r\n}\r\n#endif\r\n"
  },
  {
    "path": "source/common/predict.cc",
    "content": "/*\r\n * predict.cc\r\n *\r\n * Description of this file:\r\n *    Prediction functions definition of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video encoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#include \"common.h\"\r\n#include \"predict.h\"\r\n#include \"block_info.h\"\r\n\r\n\r\n/**\r\n * ===========================================================================\r\n * local & global variables (const tables)\r\n * ===========================================================================\r\n */\r\n\r\n/**\r\n * ===========================================================================\r\n * function defines\r\n * ===========================================================================\r\n */\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic ALWAYS_INLINE\r\nvoid check_scaling_neighbor_mv(davs2_t *h, mv_t *mv, int mult_distance, int ref_neighbor)\r\n{\r\n    if (ref_neighbor >= 0) {\r\n        int devide_distance = get_distance_index_p(h, ref_neighbor);\r\n        int devide_distance_src = get_distance_index_p_scale(h, ref_neighbor);\r\n\r\n        mv->y = scale_mv_default_y(h, mv->y, mult_distance, devide_distance, devide_distance_src);\r\n        mv->x = scale_mv_default  (h, mv->x, mult_distance, devide_distance_src);\r\n    } else {\r\n        mv->v = 0;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic ALWAYS_INLINE\r\nvoid check_scaling_neighbor_mv_b(davs2_t *h, mv_t *mv, int mult_distance, int  mult_distance_src, int ref_neighbor)\r\n{\r\n    if (ref_neighbor >= 0) {\r\n        mv->y = scale_mv_default_y(h, mv->y, mult_distance, mult_distance, mult_distance_src);\r\n        mv->x = scale_mv_default(h, mv->x, mult_distance, mult_distance_src);\r\n    } else {\r\n        mv->v = 0;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic ALWAYS_INLINE\r\nint recheck_neighbor_ref_avail(davs2_t *h, int ref_frame, int neighbor_ref)\r\n{\r\n    if (neighbor_ref != -1) {\r\n        if (((ref_frame == h->num_of_references - 1 && neighbor_ref != h->num_of_references - 1) ||\r\n             (ref_frame != h->num_of_references - 1 && neighbor_ref == h->num_of_references - 1)) &&\r\n             (h->i_frame_type == AVS2_P_SLICE || h->i_frame_type == AVS2_F_SLICE) && h->b_bkgnd_picture) {\r\n            neighbor_ref = -1;\r\n        }\r\n\r\n        if (h->i_frame_type == AVS2_S_SLICE) {\r\n            neighbor_ref = -1;\r\n        }\r\n    }\r\n\r\n    return neighbor_ref;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic ALWAYS_INLINE\r\nint derive_mv_pred_type(int ref_frame, int rFrameL, int rFrameU, int rFrameUR, \r\n                        int pu_type_for_mvp)\r\n{\r\n    int mvp_type = MVPRED_xy_MIN;\r\n\r\n    if ((rFrameL != INVALID_REF) && (rFrameU == INVALID_REF) && (rFrameUR == INVALID_REF)) {\r\n        mvp_type = MVPRED_L;\r\n    } else if ((rFrameL == INVALID_REF) && (rFrameU != INVALID_REF) && (rFrameUR == INVALID_REF)) {\r\n        mvp_type = MVPRED_U;\r\n    } else if ((rFrameL == INVALID_REF) && (rFrameU == INVALID_REF) && (rFrameUR != INVALID_REF)) {\r\n        mvp_type = MVPRED_UR;\r\n    } else {\r\n        switch (pu_type_for_mvp) {\r\n        case 1:\r\n        case 4:\r\n            if (rFrameL == ref_frame) {\r\n                mvp_type = MVPRED_L;\r\n            }\r\n            break;\r\n        case 2:\r\n            if (rFrameUR == ref_frame) {\r\n                mvp_type = MVPRED_UR;\r\n            }\r\n            break;\r\n        case 3:\r\n            if (rFrameU == ref_frame) {\r\n                mvp_type = MVPRED_U;\r\n            }\r\n            break;\r\n        default:\r\n            break;\r\n        }\r\n    }\r\n\r\n    return mvp_type;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic ALWAYS_INLINE int16_t derive_median_mv(int mva, int mvb, int mvc)\r\n{\r\n    int mvp;\r\n\r\n    if (((mva < 0) && (mvb > 0) && (mvc > 0)) || ((mva > 0) && (mvb < 0) && (mvc < 0))) {\r\n        mvp = (mvb + mvc) / 2;  // b\r\n    } else if (((mvb < 0) && (mva > 0) && (mvc > 0)) || ((mvb > 0) && (mva < 0) && (mvc < 0))) {\r\n        mvp = (mvc + mva) / 2;  // c\r\n    } else if (((mvc < 0) && (mva > 0) && (mvb > 0)) || ((mvc > 0) && (mva < 0) && (mvb < 0))) {\r\n        mvp = (mva + mvb) / 2;  // a\r\n    } else {\r\n        const int dAB = DAVS2_ABS(mva - mvb);  // for Ax\r\n        const int dBC = DAVS2_ABS(mvb - mvc);  // for Bx\r\n        const int dCA = DAVS2_ABS(mvc - mva);  // for Cx\r\n        const int min_diff = DAVS2_MIN(dAB, DAVS2_MIN(dBC, dCA));\r\n\r\n        if (min_diff == dAB) {\r\n            mvp = (mva + mvb) / 2;  // a;\r\n        } else if (min_diff == dBC) {\r\n            mvp = (mvb + mvc) / 2;  // b;\r\n        } else {\r\n            mvp = (mvc + mva) / 2;  // c;\r\n        }\r\n    }\r\n\r\n    return (int16_t)mvp;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * get neighboring MVs for MVP\r\n */\r\nstatic ALWAYS_INLINE\r\nvoid cu_get_neighbors_default_mvp(davs2_t *h, cu_t *p_cu, int pix_cu_x, int pix_cu_y, int bsx)\r\n{\r\n    neighbor_inter_t *neighbors = h->lcu.neighbor_inter;\r\n    int cur_slice_idx = p_cu->i_slice_nr;\r\n    int x0 = pix_cu_x >> MIN_PU_SIZE_IN_BIT;\r\n    int y0 = pix_cu_y >> MIN_PU_SIZE_IN_BIT;\r\n    int x1 = (bsx     >> MIN_PU_SIZE_IN_BIT) + x0 - 1;\r\n\r\n    /* 1. check whether the top-right 4x4 block is reconstructed */\r\n    int x4_TR    = x1 - h->lcu.i_spu_x;\r\n    int y4_TR    = y0 - h->lcu.i_spu_y;\r\n    int avail_TR = h->p_tab_TR_avail[(y4_TR << (h->i_lcu_level - B4X4_IN_BIT)) + x4_TR];\r\n\r\n    /* 2. get neighboring blocks */\r\n    cu_get_neighbor_spatial(h, cur_slice_idx, &neighbors[BLK_LEFT    ], x0 - 1, y0    );\r\n    cu_get_neighbor_spatial(h, cur_slice_idx, &neighbors[BLK_TOP     ], x0    , y0 - 1);\r\n    cu_get_neighbor_spatial(h, cur_slice_idx, &neighbors[BLK_TOPLEFT ], x0 - 1, y0 - 1);\r\n\r\n    cu_get_neighbor_spatial(h, cur_slice_idx, &neighbors[BLK_TOPRIGHT], avail_TR ? x1 + 1 : -1, y0 - 1);\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * set motion vector predictor\r\n */\r\nvoid get_mvp_default(davs2_t *h, cu_t *p_cu, int pix_x, int pix_y, mv_t *pmv, int bwd_2nd, \r\n                     int ref_frame, int bsx, int pu_type_for_mvp)\r\n{\r\n    int mvPredType, rFrameL, rFrameU, rFrameUL, rFrameUR;\r\n    mv_t mva, mvb, mvc, mvd;\r\n    int is_available_UR;\r\n\r\n    cu_get_neighbors_default_mvp(h, p_cu, pix_x, pix_y, bsx);\r\n\r\n    is_available_UR = h->lcu.neighbor_inter[BLK_TOPRIGHT].is_available;\r\n\r\n    rFrameL = h->lcu.neighbor_inter[BLK_LEFT       ].ref_idx.r[bwd_2nd];\r\n    rFrameU = h->lcu.neighbor_inter[BLK_TOP      ].ref_idx.r[bwd_2nd];\r\n    rFrameUL = h->lcu.neighbor_inter[BLK_TOPLEFT].ref_idx.r[bwd_2nd];\r\n    rFrameUR = is_available_UR ? h->lcu.neighbor_inter[BLK_TOPRIGHT].ref_idx.r[bwd_2nd] : rFrameUL;\r\n\r\n    mva = h->lcu.neighbor_inter[BLK_LEFT   ].mv[bwd_2nd];\r\n    mvb = h->lcu.neighbor_inter[BLK_TOP    ].mv[bwd_2nd];\r\n    mvd = h->lcu.neighbor_inter[BLK_TOPLEFT].mv[bwd_2nd];\r\n    mvc = is_available_UR ? h->lcu.neighbor_inter[BLK_TOPRIGHT].mv[bwd_2nd] : mvd;\r\n\r\n    rFrameL  = recheck_neighbor_ref_avail(h, ref_frame, rFrameL);\r\n    rFrameU  = recheck_neighbor_ref_avail(h, ref_frame, rFrameU);\r\n    rFrameUR = recheck_neighbor_ref_avail(h, ref_frame, rFrameUR);\r\n\r\n    mvPredType = derive_mv_pred_type(ref_frame, rFrameL, rFrameU, rFrameUR, pu_type_for_mvp);\r\n\r\n    if (h->i_frame_type == AVS2_B_SLICE) {\r\n        int mult_distance     = get_distance_index_b(h, bwd_2nd ? B_BWD : B_FWD);\r\n        int mult_distance_src = get_distance_index_b_scale(h, bwd_2nd ? B_BWD : B_FWD);\r\n        check_scaling_neighbor_mv_b(h, &mva, mult_distance, mult_distance_src, rFrameL);\r\n        check_scaling_neighbor_mv_b(h, &mvb, mult_distance, mult_distance_src, rFrameU);\r\n        check_scaling_neighbor_mv_b(h, &mvc, mult_distance, mult_distance_src, rFrameUR);\r\n    } else {\r\n        int mult_distance = get_distance_index_p(h, ref_frame);\r\n        check_scaling_neighbor_mv(h, &mva, mult_distance, rFrameL);\r\n        check_scaling_neighbor_mv(h, &mvb, mult_distance, rFrameU);\r\n        check_scaling_neighbor_mv(h, &mvc, mult_distance, rFrameUR);\r\n    }\r\n\r\n    switch (mvPredType) {\r\n    case MVPRED_xy_MIN:\r\n        pmv->x = derive_median_mv(mva.x, mvb.x, mvc.x);  // x\r\n        pmv->y = derive_median_mv(mva.y, mvb.y, mvc.y);  // y\r\n        break;\r\n    case MVPRED_L:\r\n        pmv->v = mva.v;\r\n        break;\r\n    case MVPRED_U:\r\n        pmv->v = mvb.v;\r\n        break;\r\n    default:    // case MVPRED_UR:\r\n        pmv->v = mvc.v;\r\n        break;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic\r\nvoid get_mv_bskip_spatial(davs2_t *h, mv_t *fw_pmv, mv_t *bw_pmv, int num_skip_dir)\r\n{\r\n    neighbor_inter_t *p_neighbors = h->lcu.neighbor_inter;\r\n    mv_t *p_mv_1st = h->lcu.mv_tskip_1st;\r\n    mv_t *p_mv_2nd = h->lcu.mv_tskip_2nd;\r\n    int j;\r\n    int bid_flag = 0, bw_flag = 0, fw_flag = 0, sym_flag = 0, bid2 = 0;\r\n\r\n    memset(h->lcu.mv_tskip_1st, 0, sizeof(h->lcu.mv_tskip_1st) + sizeof(h->lcu.mv_tskip_2nd));\r\n\r\n    for (j = 0; j < 6; j++) {\r\n        if (p_neighbors[j].i_dir_pred == PDIR_BID) {\r\n            p_mv_2nd[DS_B_BID] = p_neighbors[j].mv[1];\r\n            p_mv_1st[DS_B_BID] = p_neighbors[j].mv[0];\r\n            bid_flag++;\r\n            if (bid_flag == 1) {\r\n                bid2 = j;\r\n            }\r\n        } else if (p_neighbors[j].i_dir_pred == PDIR_SYM) {\r\n            p_mv_2nd[DS_B_SYM] = p_neighbors[j].mv[1];\r\n            p_mv_1st[DS_B_SYM] = p_neighbors[j].mv[0];\r\n            sym_flag++;\r\n        } else if (p_neighbors[j].i_dir_pred == PDIR_BWD) {\r\n            p_mv_2nd[DS_B_BWD] = p_neighbors[j].mv[1];\r\n            bw_flag++;\r\n        } else if (p_neighbors[j].i_dir_pred == PDIR_FWD) {\r\n            p_mv_1st[DS_B_FWD] = p_neighbors[j].mv[0];\r\n            fw_flag++;\r\n        }\r\n    }\r\n\r\n    if (bid_flag == 0 && fw_flag != 0 && bw_flag != 0) {\r\n        p_mv_2nd[DS_B_BID] = p_mv_2nd[DS_B_BWD];\r\n        p_mv_1st[DS_B_BID] = p_mv_1st[DS_B_FWD ];\r\n    }\r\n\r\n    if (sym_flag == 0 && bid_flag > 1) {\r\n        p_mv_2nd[DS_B_SYM] = p_neighbors[bid2].mv[1];\r\n        p_mv_1st[DS_B_SYM] = p_neighbors[bid2].mv[0];\r\n    } else if (sym_flag == 0 && bw_flag != 0) {\r\n        p_mv_2nd[DS_B_SYM].v =  p_mv_2nd[DS_B_BWD].v;\r\n        p_mv_1st[DS_B_SYM].x = -p_mv_2nd[DS_B_BWD].x;\r\n        p_mv_1st[DS_B_SYM].y = -p_mv_2nd[DS_B_BWD].y;\r\n    } else if (sym_flag == 0 && fw_flag != 0) {\r\n        p_mv_2nd[DS_B_SYM].x = -p_mv_1st[DS_B_FWD].x;\r\n        p_mv_2nd[DS_B_SYM].y = -p_mv_1st[DS_B_FWD].y;\r\n        p_mv_1st[DS_B_SYM].v =  p_mv_1st[DS_B_FWD].v;\r\n    }\r\n\r\n    if (bw_flag == 0 && bid_flag > 1) {\r\n        p_mv_2nd[DS_B_BWD] = p_neighbors[bid2].mv[1];\r\n    } else if (bw_flag == 0 && bid_flag != 0) {\r\n        p_mv_2nd[DS_B_BWD] = p_mv_2nd[DS_B_BID];\r\n    }\r\n\r\n    if (fw_flag == 0 && bid_flag > 1) {\r\n        p_mv_1st[DS_B_FWD] = p_neighbors[bid2].mv[0];\r\n    } else if (fw_flag == 0 && bid_flag != 0) {\r\n        p_mv_1st[DS_B_FWD] = p_mv_1st[DS_B_BID];\r\n    }\r\n\r\n    fw_pmv->v = p_mv_1st[num_skip_dir].v;\r\n    bw_pmv->v = p_mv_2nd[num_skip_dir].v;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic ALWAYS_INLINE\r\nvoid get_mv_pf_skip_temporal(davs2_t *h, mv_t *p_mv, int block_offset, int cur_dist)\r\n{\r\n    int refframe = h->fref[0]->refbuf[block_offset];\r\n\r\n    if (refframe >= 0) {\r\n        mv_t tmv     = h->fref[0]->mvbuf[block_offset];\r\n        int col_dist = h->fref[0]->dist_scale_refs[refframe];\r\n\r\n        p_mv->x = scale_mv_skip(h, tmv.x, cur_dist, col_dist);\r\n        p_mv->y = scale_mv_skip(h, tmv.y, cur_dist, col_dist);\r\n    } else {\r\n        p_mv->v = 0;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void fill_mv_pf_skip_temporal(davs2_t *h, cu_t *p_cu, int pix_x, int pix_y, int cu_size)\r\n{\r\n    int spu_x = pix_x >> MIN_PU_SIZE_IN_BIT;\r\n    int spu_y = pix_y >> MIN_PU_SIZE_IN_BIT;\r\n    int size_in_spu  = cu_size >> MIN_PU_SIZE_IN_BIT;\r\n    int width_in_spu = h->i_width_in_spu;\r\n    int i, l, m;\r\n    mv_t mv_1st, mv_2nd;\r\n    ref_idx_t ref_idx;\r\n    int delta[AVS2_MAX_REFS];\r\n    int delta_src[AVS2_MAX_REFS];\r\n\r\n    ref_idx.r[0] = 0;\r\n    ref_idx.r[1] = (int8_t)(p_cu->i_weighted_skipmode != 0 ? p_cu->i_weighted_skipmode : INVALID_REF);\r\n\r\n    for (i = 0; i < h->num_of_references; i++) {\r\n        delta[i] = get_distance_index_p(h, i);\r\n        delta_src[i] = get_distance_index_p_scale(h, i);\r\n    }\r\n\r\n    if (cu_size != MIN_CU_SIZE) {\r\n        size_in_spu >>= 1;\r\n        assert(p_cu->num_pu == 4);\r\n    } else {\r\n        assert(p_cu->num_pu == 1);\r\n    }\r\n\r\n    for (i = 0; i < p_cu->num_pu; i++) {\r\n        int block_x       = spu_x + size_in_spu * (i  & 1);\r\n        int block_y       = spu_y + size_in_spu * (i >> 1);\r\n        int block_offset  = block_y * width_in_spu + block_x;\r\n        mv_t *p_mv_1st    = h->p_tmv_1st + block_offset;\r\n        mv_t *p_mv_2nd    = h->p_tmv_2nd + block_offset;\r\n        ref_idx_t *p_ref_1st = h->p_ref_idx + block_offset;\r\n\r\n        get_mv_pf_skip_temporal(h, &mv_1st, block_offset, delta[0]);\r\n\r\n        if (ref_idx.r[1] != INVALID_REF) {\r\n            mv_2nd.x = scale_mv_skip  (h, mv_1st.x, delta[ref_idx.r[1]], delta_src[0]);\r\n            mv_2nd.y = scale_mv_skip_y(h, mv_1st.y, delta[ref_idx.r[1]], delta[0], delta_src[0]);\r\n        } else {\r\n            mv_2nd.v = 0;\r\n        }\r\n\r\n        p_cu->mv[i][0].v = mv_1st.v;\r\n        p_cu->mv[i][1].v = mv_2nd.v;\r\n        p_cu->ref_idx[i] = ref_idx;\r\n\r\n        for (m = 0; m < size_in_spu; m++) {\r\n            for (l = 0; l < size_in_spu; l++) {\r\n                p_mv_1st[l] = mv_1st;\r\n                p_mv_2nd[l] = mv_2nd;\r\n                p_ref_1st[l] = ref_idx;\r\n            }\r\n            p_mv_1st += width_in_spu;\r\n            p_mv_2nd += width_in_spu;\r\n            p_ref_1st += width_in_spu;\r\n        }\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic INLINE\r\nvoid get_mv_fskip_spatial(davs2_t *h)\r\n{\r\n    neighbor_inter_t *p_neighbors = h->lcu.neighbor_inter;\r\n    int bid_flag = 0, fw_flag = 0, bid2 = 0, fw2 = 0;\r\n    int j;\r\n\r\n    memset(h->lcu.ref_skip_1st, 0, sizeof(h->lcu.ref_skip_1st)\r\n           + sizeof(h->lcu.ref_skip_2nd)\r\n           + sizeof(h->lcu.mv_tskip_1st)\r\n           + sizeof(h->lcu.mv_tskip_2nd));\r\n\r\n    for (j = 0; j < 6; j++) {\r\n        if (p_neighbors[j].ref_idx.r[0] != -1 && p_neighbors[j].ref_idx.r[1] != -1) {   // bid\r\n            h->lcu.ref_skip_1st[DS_DUAL_1ST] = p_neighbors[j].ref_idx.r[0];\r\n            h->lcu.ref_skip_2nd[DS_DUAL_1ST] = p_neighbors[j].ref_idx.r[1];\r\n            h->lcu.mv_tskip_1st[DS_DUAL_1ST] = p_neighbors[j].mv[0];\r\n            h->lcu.mv_tskip_2nd[DS_DUAL_1ST] = p_neighbors[j].mv[1];\r\n            bid_flag++;\r\n            if (bid_flag == 1) {\r\n                bid2 = j;\r\n            }\r\n        } else if (p_neighbors[j].ref_idx.r[0] != -1 && p_neighbors[j].ref_idx.r[1] == -1) {  // fw\r\n            h->lcu.ref_skip_1st[DS_SINGLE_1ST] = p_neighbors[j].ref_idx.r[0];\r\n            h->lcu.mv_tskip_1st[DS_SINGLE_1ST] = p_neighbors[j].mv[0];\r\n            fw_flag++;\r\n            if (fw_flag == 1) {\r\n                fw2 = j;\r\n            }\r\n        }\r\n    }\r\n\r\n    // first bid\r\n    if (bid_flag == 0 && fw_flag > 1) {\r\n        h->lcu.ref_skip_1st[DS_DUAL_1ST] = h->lcu.ref_skip_1st[DS_SINGLE_1ST];\r\n        h->lcu.ref_skip_2nd[DS_DUAL_1ST] = p_neighbors[fw2].ref_idx.r[0];\r\n        h->lcu.mv_tskip_1st[DS_DUAL_1ST] = h->lcu.mv_tskip_1st[DS_SINGLE_1ST];\r\n        h->lcu.mv_tskip_2nd[DS_DUAL_1ST] = p_neighbors[fw2].mv[0];\r\n    }\r\n\r\n    // second bid\r\n    if (bid_flag > 1) {\r\n        h->lcu.ref_skip_1st[DS_DUAL_2ND] = p_neighbors[bid2].ref_idx.r[0];\r\n        h->lcu.ref_skip_2nd[DS_DUAL_2ND] = p_neighbors[bid2].ref_idx.r[1];\r\n        h->lcu.mv_tskip_1st[DS_DUAL_2ND] = p_neighbors[bid2].mv[0];\r\n        h->lcu.mv_tskip_2nd[DS_DUAL_2ND] = p_neighbors[bid2].mv[1];\r\n    } else if (bid_flag == 1 && fw_flag > 1) {\r\n        h->lcu.ref_skip_1st[DS_DUAL_2ND] = h->lcu.ref_skip_1st[DS_SINGLE_1ST];\r\n        h->lcu.ref_skip_2nd[DS_DUAL_2ND] = p_neighbors[fw2].ref_idx.r[0];\r\n        h->lcu.mv_tskip_1st[DS_DUAL_2ND] = h->lcu.mv_tskip_1st[DS_SINGLE_1ST];\r\n        h->lcu.mv_tskip_2nd[DS_DUAL_2ND] = p_neighbors[fw2].mv[0];\r\n    }\r\n\r\n    // first fwd\r\n    h->lcu.ref_skip_2nd[DS_SINGLE_1ST] = INVALID_REF;\r\n    h->lcu.mv_tskip_2nd [DS_SINGLE_1ST].v = 0;\r\n    if (fw_flag == 0 && bid_flag > 1) {\r\n        h->lcu.ref_skip_1st[DS_SINGLE_1ST] = p_neighbors[bid2].ref_idx.r[0];\r\n        h->lcu.mv_tskip_1st[DS_SINGLE_1ST] = p_neighbors[bid2].mv[0];\r\n    } else if (fw_flag == 0 && bid_flag == 1) {\r\n        h->lcu.ref_skip_1st[DS_SINGLE_1ST] = h->lcu.ref_skip_1st[DS_DUAL_1ST];\r\n        h->lcu.mv_tskip_1st[DS_SINGLE_1ST] = h->lcu.mv_tskip_1st[DS_DUAL_1ST];\r\n    }\r\n\r\n    // second fwd\r\n    h->lcu.ref_skip_2nd[DS_SINGLE_2ND] = INVALID_REF;\r\n    h->lcu.mv_tskip_2nd [DS_SINGLE_2ND].v = 0;\r\n    if (fw_flag > 1) {\r\n        h->lcu.ref_skip_1st[DS_SINGLE_2ND] = p_neighbors[fw2].ref_idx.r[0];\r\n        h->lcu.mv_tskip_1st[DS_SINGLE_2ND] = p_neighbors[fw2].mv[0];\r\n    } else if (bid_flag > 1) {\r\n        h->lcu.ref_skip_1st[DS_SINGLE_2ND] = p_neighbors[bid2].ref_idx.r[1];\r\n        h->lcu.mv_tskip_1st[DS_SINGLE_2ND] = p_neighbors[bid2].mv[1];\r\n    } else if (bid_flag == 1) {\r\n        h->lcu.ref_skip_1st[DS_SINGLE_2ND] = h->lcu.ref_skip_2nd[DS_DUAL_1ST];\r\n        h->lcu.mv_tskip_1st[DS_SINGLE_2ND] = h->lcu.mv_tskip_2nd[DS_DUAL_1ST];\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void fill_mv_bskip(davs2_t *h, cu_t *p_cu, int pix_x, int pix_y, int size_in_scu)\r\n{\r\n    int width_in_spu = h->i_width_in_spu;\r\n    int i8_1st = pix_x >> MIN_PU_SIZE_IN_BIT;\r\n    int j8_1st = pix_y >> MIN_PU_SIZE_IN_BIT;\r\n    int i;\r\n    int8_t *p_dirpred;\r\n    ref_idx_t *p_ref_1st;\r\n    mv_t *p_mv_1st;\r\n    mv_t *p_mv_2nd;\r\n    mv_t mv_1st, mv_2nd;\r\n    int ds_mode = p_cu->i_md_directskip_mode;\r\n\r\n    assert(h->i_frame_type == AVS2_B_SLICE);\r\n\r\n    if (ds_mode != DS_NONE) {\r\n        int offset_spu = j8_1st * width_in_spu + i8_1st;\r\n        int r, c;\r\n        int cu_size_in_spu = size_in_scu << (MIN_CU_SIZE_IN_BIT - MIN_PU_SIZE_IN_BIT);\r\n        ref_idx_t ref_idx;\r\n        int8_t  i_dir_pred;\r\n\r\n        p_mv_1st  = h->p_tmv_1st + offset_spu;\r\n        p_mv_2nd  = h->p_tmv_2nd + offset_spu;\r\n        p_ref_1st = h->p_ref_idx + offset_spu;\r\n        p_dirpred = h->p_dirpred + offset_spu;\r\n        i_dir_pred = (int8_t)p_cu->b8pdir[0];\r\n\r\n        switch (ds_mode) {\r\n        case DS_B_SYM:\r\n        case DS_B_BID:\r\n            ref_idx.r[0] = B_FWD;\r\n            ref_idx.r[1] = B_BWD;\r\n            break;\r\n        case DS_B_BWD:\r\n            ref_idx.r[0] = INVALID_REF;\r\n            ref_idx.r[1] = B_BWD;\r\n            break;\r\n        // case DS_B_FWD:\r\n        default:\r\n            ref_idx.r[0] = B_FWD;\r\n            ref_idx.r[1] = INVALID_REF;\r\n            break;\r\n        }\r\n\r\n        get_mv_bskip_spatial(h, &mv_1st, &mv_2nd, p_cu->i_md_directskip_mode);\r\n\r\n        p_cu->mv[0][0].v = mv_1st.v;\r\n        p_cu->mv[0][1].v = mv_2nd.v;\r\n        p_cu->ref_idx[0] = ref_idx;\r\n\r\n        for (r = 0; r < cu_size_in_spu; r++) {\r\n            for (c = 0; c < cu_size_in_spu; c++) {\r\n                p_ref_1st[c] = ref_idx;\r\n\r\n                p_mv_1st [c] = mv_1st;\r\n                p_mv_2nd [c] = mv_2nd;\r\n                p_dirpred[c] = i_dir_pred;\r\n            }\r\n            p_ref_1st += width_in_spu;\r\n            p_mv_1st  += width_in_spu;\r\n            p_mv_2nd  += width_in_spu;\r\n            p_dirpred += width_in_spu;\r\n        }\r\n    } else {   //    B_Skip_Sym  B_Direct_Sym\r\n        int size_cu = size_in_scu << MIN_CU_SIZE_IN_BIT;\r\n        int size_pu = size_cu >> (int)(p_cu->num_pu == 4);\r\n        int size_pu_in_spu = size_pu >> MIN_PU_SIZE_IN_BIT;\r\n        ref_idx_t ref_idx;\r\n\r\n        ref_idx.r[0] = B_FWD;\r\n        ref_idx.r[1] = B_BWD;\r\n\r\n        for (i = 0; i < p_cu->num_pu; i++) {\r\n            int i8 = i8_1st + (i &  1) * size_in_scu;\r\n            int j8 = j8_1st + (i >> 1) * size_in_scu;\r\n            int r, c;\r\n            int offset_spu = j8 * width_in_spu + i8;\r\n            const int8_t *refbuf = h->fref[0]->refbuf;\r\n            int refframe = refbuf[j8 * width_in_spu + i8];\r\n\r\n            p_mv_1st  = h->p_tmv_1st + offset_spu;\r\n            p_mv_2nd  = h->p_tmv_2nd + offset_spu;\r\n            p_ref_1st = h->p_ref_idx + offset_spu;\r\n            p_dirpred  = h->p_dirpred + offset_spu;\r\n\r\n            if (refframe == -1) {\r\n                get_mvp_default(h, p_cu, pix_x, pix_y, &mv_1st, 0, 0, size_cu, 0);\r\n                get_mvp_default(h, p_cu, pix_x, pix_y, &mv_2nd, 1, 0, size_cu, 0);\r\n            } else { // next P is skip or inter mode\r\n                int iTRp     = h->fref[0]->dist_refs[refframe];\r\n                int iTRp_src = h->fref[0]->dist_scale_refs[refframe];\r\n                int iTRd     = get_distance_index_b(h, B_BWD);  // bwd\r\n                int iTRb     = get_distance_index_b(h, B_FWD);  // fwd\r\n                mv_t tmv = h->fref[0]->mvbuf[j8 * width_in_spu + i8];\r\n\r\n                mv_1st.x =  scale_mv_biskip(h, tmv.x, iTRb, iTRp_src);\r\n                mv_2nd.x = -scale_mv_biskip(h, tmv.x, iTRd, iTRp_src);\r\n\r\n                mv_1st.y =  scale_mv_biskip_y(h, tmv.y, iTRb, iTRp, iTRp_src);\r\n                mv_2nd.y = -scale_mv_biskip_y(h, tmv.y, iTRd, iTRp, iTRp_src);\r\n            }\r\n\r\n            p_cu->mv[i][0].v = mv_1st.v;\r\n            p_cu->mv[i][1].v = mv_2nd.v;\r\n            p_cu->ref_idx[i].v = ref_idx.v;\r\n\r\n            for (r = 0; r < size_pu_in_spu; r++) {\r\n                for (c = 0; c < size_pu_in_spu; c++) {\r\n                    p_mv_1st [c] = mv_1st;\r\n                    p_mv_2nd [c] = mv_2nd;\r\n\r\n                    p_ref_1st[c].v = ref_idx.v;\r\n                    p_dirpred[c] = PDIR_SYM;\r\n                }\r\n                p_ref_1st += width_in_spu;\r\n                p_mv_1st  += width_in_spu;\r\n                p_mv_2nd  += width_in_spu;\r\n                p_dirpred += width_in_spu;\r\n            }\r\n        }     // for loop all PUs\r\n    }   //    B_Skip_Sym  B_Direct_Sym\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * Skip/Directģʽ£ڿ˶ϢõǰCUп˶Ϣ\r\n * òο֡˶ʸ\r\n */\r\nvoid fill_mv_and_ref_for_skip(davs2_t *h, cu_t *p_cu, int pix_x, int pix_y, int size_in_scu)\r\n{\r\n    assert(p_cu->i_cu_type == PRED_SKIP);\r\n\r\n    if (h->i_frame_type == AVS2_B_SLICE) {\r\n        fill_mv_bskip(h, p_cu, pix_x, pix_y, size_in_scu);\r\n    } else if ((h->i_frame_type == AVS2_F_SLICE) || (h->i_frame_type == AVS2_P_SLICE)) {\r\n        if (p_cu->i_md_directskip_mode == 0) {\r\n            fill_mv_pf_skip_temporal(h, p_cu, pix_x, pix_y, size_in_scu << MIN_CU_SIZE_IN_BIT);\r\n        } else {\r\n            int width_in_spu = h->i_width_in_spu;\r\n            int block_offset = (pix_y >> MIN_PU_SIZE_IN_BIT) * width_in_spu + (pix_x >> MIN_PU_SIZE_IN_BIT);\r\n            ref_idx_t *p_ref_1st = h->p_ref_idx + block_offset;\r\n            mv_t   *p_tmv_1st = h->p_tmv_1st + block_offset;\r\n            mv_t   *p_tmv_2nd = h->p_tmv_2nd + block_offset;\r\n            int i, j;\r\n            mv_t mv_1st, mv_2nd;\r\n            ref_idx_t ref_idx;\r\n            int ds_mode = p_cu->i_md_directskip_mode;\r\n\r\n            get_mv_fskip_spatial(h);\r\n\r\n            mv_1st = h->lcu.mv_tskip_1st[ds_mode];\r\n            mv_2nd = h->lcu.mv_tskip_2nd[ds_mode];\r\n\r\n            ref_idx.r[0] = h->lcu.ref_skip_1st[ds_mode];\r\n            ref_idx.r[1] = h->lcu.ref_skip_2nd[ds_mode];\r\n\r\n            for (i = 0; i < 4; i++) {\r\n                p_cu->mv[i][0].v = mv_1st.v;\r\n                p_cu->mv[i][1].v = mv_2nd.v;\r\n                p_cu->ref_idx[i] = ref_idx;\r\n            }\r\n\r\n            size_in_scu <<= (MIN_CU_SIZE_IN_BIT - MIN_PU_SIZE_IN_BIT);\r\n\r\n            for (j = 0; j < size_in_scu; j++) {\r\n                for (i = 0; i < size_in_scu; i++) {\r\n                    p_ref_1st[i] = ref_idx;\r\n                    p_tmv_1st[i] = mv_1st;\r\n                    p_tmv_2nd[i] = mv_2nd;\r\n                }\r\n                p_ref_1st += width_in_spu;\r\n                p_tmv_1st += width_in_spu;\r\n                p_tmv_2nd += width_in_spu;\r\n            }\r\n        }\r\n    }\r\n}\r\n"
  },
  {
    "path": "source/common/predict.h",
    "content": "/*\r\n * predict.h\r\n *\r\n * Description of this file:\r\n *    Prediction functions definition of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video encoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n \r\n#ifndef DAVS2_PRED_H\r\n#define DAVS2_PRED_H\r\n#ifdef __cplusplus\r\nextern \"C\" {\r\n#endif\r\n\r\n/* ---------------------------------------------------------------------------\r\n * ο֡ʱ鲢ЧΧ */\r\n#define AVS2_DISTANCE_INDEX(distance)    (((distance) + 512) & 511)\r\n\r\n/* ---------------------------------------------------------------------------\r\n * P/F֡Ĳο֡뵱ǰ֮֡ľ */\r\nstatic ALWAYS_INLINE\r\nint get_distance_index_p(davs2_t *h, int refidx)\r\n{\r\n    return h->fdec->dist_refs[refidx];\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n* P/F֡Ĳο֡뵱ǰ֮֡ľ */\r\nstatic ALWAYS_INLINE\r\nint get_distance_index_p_scale(davs2_t *h, int refidx)\r\n{\r\n    return h->fdec->dist_scale_refs[refidx];\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * B֡Ĳο֡뵱ǰ֮֡ľ */\r\nstatic ALWAYS_INLINE\r\nint get_distance_index_b(davs2_t *h, int b_fwd)\r\n{\r\n    return h->fdec->dist_refs[b_fwd];\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n* B֡Ĳο֡뵱ǰ֮֡ľ */\r\nstatic ALWAYS_INLINE\r\nint get_distance_index_b_scale(davs2_t *h, int b_fwd)\r\n{\r\n    return h->fdec->dist_scale_refs[b_fwd];\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n* ڳYŵƫ */\r\nstatic ALWAYS_INLINE\r\nint getDeltas(davs2_t *h, int *delt, int *delt2, int OriPOC, int OriRefPOC, int ScaledPOC, int ScaledRefPOC)\r\n{\r\n    int factor = 2;\r\n\r\n    *delt = 0;\r\n    *delt2 = 0;\r\n\r\n    assert(h->seq_info.b_field_coding);\r\n    assert(h->i_pic_coding_type == FRAME);\r\n\r\n    OriPOC       = AVS2_DISTANCE_INDEX(OriPOC);\r\n    OriRefPOC    = AVS2_DISTANCE_INDEX(OriRefPOC);\r\n    ScaledPOC    = AVS2_DISTANCE_INDEX(ScaledPOC);\r\n    ScaledRefPOC = AVS2_DISTANCE_INDEX(ScaledRefPOC);\r\n\r\n    assert((OriPOC % factor) + (OriRefPOC % factor) + (ScaledPOC % factor) + (ScaledRefPOC % factor) == 0);\r\n\r\n    OriPOC /= factor;\r\n    OriRefPOC /= factor;\r\n    ScaledPOC /= factor;\r\n    ScaledRefPOC /= factor;\r\n\r\n    if (h->b_top_field) {  // scaled is top field\r\n        *delt2 = (ScaledRefPOC % 2) != (ScaledPOC % 2) ? 2 : 0;\r\n\r\n        if ((ScaledPOC % 2) == (OriPOC % 2)) { // ori is top\r\n            *delt = (OriRefPOC % 2) != (OriPOC % 2) ? 2 : 0;\r\n        } else {\r\n            *delt = (OriRefPOC % 2) != (OriPOC % 2) ? -2 : 0;\r\n        }\r\n    } else { // scaled is bottom field\r\n        *delt2 = (ScaledRefPOC % 2) != (ScaledPOC % 2) ? -2 : 0;\r\n        if ((ScaledPOC % 2) == (OriPOC % 2)) { // ori is bottom\r\n            *delt = (OriRefPOC % 2) != (OriPOC % 2) ? -2 : 0;\r\n        } else {\r\n            *delt = (OriRefPOC % 2) != (OriPOC % 2) ? 2 : 0;\r\n        }\r\n    }\r\n\r\n    return 0;\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * MV scaling for Normal Inter Mode (MVP + MVD) */\r\nstatic ALWAYS_INLINE\r\nint16_t scale_mv_default(davs2_t *h, int mv, int dist_dst, int dist_src)\r\n{\r\n    UNUSED_PARAMETER(h);\r\n    mv = davs2_sign3(mv) * ((DAVS2_ABS(mv) * dist_dst * dist_src + HALF_MULTI) >> OFFSET);\r\n    return (int16_t)(DAVS2_CLIP3(-32768, 32767, mv));\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic ALWAYS_INLINE\r\nint16_t scale_mv_default_y(davs2_t *h, int mvy, int dist_dst, int dist_src, int dist_src_mul)\r\n{\r\n    if (h->seq_info.b_field_coding) {\r\n        int oriPOC    = h->fdec->i_poc;\r\n        int oriRefPOC = oriPOC - dist_src;\r\n        int scaledPOC = h->fdec->i_poc;\r\n        int scaledRefPOC = scaledPOC - dist_dst;\r\n        int delta, delta2;\r\n\r\n        getDeltas(h, &delta, &delta2, oriPOC, oriRefPOC, scaledPOC, scaledRefPOC);\r\n        return (int16_t)(scale_mv_default(h, mvy + delta, dist_dst, dist_src_mul) - delta2);\r\n    } else {\r\n        return scale_mv_default(h, mvy, dist_dst, dist_src_mul);\r\n    }\r\n}\r\n\r\n// ----------------------------------------------------------\r\n// MV scaling for Skip/Direct Mode\r\nstatic ALWAYS_INLINE\r\nint16_t scale_mv_skip(davs2_t *h, int mv, int dist_dst, int dist_src)\r\n{\r\n    UNUSED_PARAMETER(h);\r\n    mv = (int16_t)((mv * dist_dst * dist_src + HALF_MULTI) >> OFFSET);\r\n    return (int16_t)(DAVS2_CLIP3(-32768, 32767, mv));\r\n}\r\n\r\nstatic ALWAYS_INLINE\r\nint16_t scale_mv_skip_y(davs2_t *h, int mvy, int dist_dst, int dist_src ,int dist_src_mul)\r\n{\r\n    if (h->seq_info.b_field_coding) {\r\n        int oriPOC    = h->fdec->i_poc;\r\n        int oriRefPOC = oriPOC - dist_src;\r\n        int scaledPOC = h->fdec->i_poc;\r\n        int scaledRefPOC = scaledPOC - dist_dst;\r\n        int delta, delta2;\r\n\r\n        getDeltas(h, &delta, &delta2, oriPOC, oriRefPOC, scaledPOC, scaledRefPOC);\r\n        return (int16_t)(scale_mv_skip(h, mvy + delta, dist_dst, dist_src_mul) - delta2);\r\n    } else {\r\n        return scale_mv_skip(h, mvy, dist_dst, dist_src_mul);\r\n    }\r\n}\r\n\r\n// ----------------------------------------------------------\r\n// MV scaling for Bi-Skip/Direct Mode\r\nstatic ALWAYS_INLINE\r\nint16_t scale_mv_biskip(davs2_t *h, int mv, int dist_dst, int dist_src)\r\n{\r\n    UNUSED_PARAMETER(h);\r\n    mv = (int16_t)(davs2_sign3(mv) * ((dist_src * (1 + DAVS2_ABS(mv) * dist_dst) - 1) >> OFFSET));\r\n    return (int16_t)(DAVS2_CLIP3(-32768, 32767, mv));\r\n}\r\n\r\nstatic ALWAYS_INLINE\r\nint16_t scale_mv_biskip_y(davs2_t *h, int mvy, int dist_dst, int dist_src, int dist_src_mul)\r\n{\r\n    if (h->seq_info.b_field_coding) {\r\n        int oriPOC    = h->fdec->i_poc;\r\n        int oriRefPOC = oriPOC - dist_src;\r\n        int scaledPOC = h->fdec->i_poc;\r\n        int scaledRefPOC = scaledPOC - dist_dst;\r\n        int delta, delta2;\r\n\r\n        getDeltas(h, &delta, &delta2, oriPOC, oriRefPOC, scaledPOC, scaledRefPOC);\r\n        return (int16_t)(scale_mv_biskip(h, mvy + delta, dist_dst, dist_src_mul) - delta2);\r\n    } else {\r\n        return scale_mv_biskip(h, mvy, dist_dst, dist_src_mul);\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic ALWAYS_INLINE\r\nvoid pmvr_mv_derivation(davs2_t *h, mv_t *mv, mv_t *mvd, mv_t *mvp)\r\n{\r\n    int mvx, mvy;\r\n\r\n    if (h->seq_info.enable_pmvr) {\r\n        int ctr_x, ctr_y;\r\n\r\n        ctr_x = ((mvp->x >> 1) << 1) - mvp->x;\r\n        ctr_y = ((mvp->y >> 1) << 1) - mvp->y;\r\n\r\n        if (DAVS2_ABS(mvd->x - ctr_x) > THRESHOLD_PMVR) {\r\n            mvx = mvp->x + (mvd->x << 1) - ctr_x - davs2_sign2(mvd->x - ctr_x) * THRESHOLD_PMVR;\r\n            mvy = mvp->y + (mvd->y << 1) + ctr_y;\r\n        } else if (DAVS2_ABS(mvd->y - ctr_y) > THRESHOLD_PMVR) {\r\n            mvx = mvp->x + (mvd->x << 1) + ctr_x;\r\n            mvy = mvp->y + (mvd->y << 1) - ctr_y - davs2_sign2(mvd->y - ctr_y) * THRESHOLD_PMVR;\r\n        } else {\r\n            mvx = mvd->x + mvp->x;\r\n            mvy = mvd->y + mvp->y;\r\n        }\r\n    } else {\r\n        mvx = mvd->x + mvp->x;\r\n        mvy = mvd->y + mvp->y;\r\n    }\r\n\r\n    mv->x = (int16_t)DAVS2_CLIP3(-32768, 32767, mvx);\r\n    mv->y = (int16_t)DAVS2_CLIP3(-32768, 32767, mvy);\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * get spatial neighboring MV\r\n */\r\nstatic ALWAYS_INLINE\r\nvoid cu_get_neighbor_spatial(davs2_t *h, int cur_slice_idx, neighbor_inter_t *p_neighbor, int x4, int y4)\r\n{\r\n    int b_outside_pic = y4 < 0 || y4 >= h->i_height_in_spu || x4 < 0 || x4 >= h->i_width_in_spu;\r\n    int scu_xy = (y4 >> 1) * h->i_width_in_scu + (x4 >> 1);\r\n\r\n    if (b_outside_pic || h->scu_data[scu_xy].i_slice_nr != cur_slice_idx) {\r\n        p_neighbor->is_available = 0;\r\n        p_neighbor->i_dir_pred = PDIR_INVALID;\r\n        p_neighbor->ref_idx.r[0] = INVALID_REF;\r\n        p_neighbor->ref_idx.r[1] = INVALID_REF;\r\n        p_neighbor->mv[0].v = 0;\r\n        p_neighbor->mv[1].v = 0;\r\n    } else {\r\n        const int w_in_4x4 = h->i_width_in_spu;\r\n        const int pos = y4 * w_in_4x4 + x4;\r\n        p_neighbor->is_available = 1;\r\n        p_neighbor->i_dir_pred = h->p_dirpred[pos];\r\n        p_neighbor->ref_idx = h->p_ref_idx[pos];\r\n        p_neighbor->mv[0] = h->p_tmv_1st[pos];\r\n        p_neighbor->mv[1] = h->p_tmv_2nd[pos];\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * get temporal MV predictor\r\n */\r\nstatic ALWAYS_INLINE\r\nvoid cu_get_neighbor_temporal(davs2_t *h, neighbor_inter_t *p_neighbor, int x4, int y4)\r\n{\r\n    int w_in_16x16 = (h->i_width_in_spu + 3) >> 2;\r\n    int pos = (y4 /*>> 2*/) * w_in_16x16 + (x4 /*>> 2*/);\r\n\r\n    p_neighbor->is_available = 1;\r\n    p_neighbor->i_dir_pred = PDIR_FWD;\r\n    p_neighbor->ref_idx.r[0] = h->fref[0]->refbuf[pos];\r\n    p_neighbor->mv[0] = h->fref[0]->mvbuf[pos];\r\n\r\n    p_neighbor->ref_idx.r[1] = INVALID_REF;\r\n    p_neighbor->mv[1].v = 0;\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic ALWAYS_INLINE\r\nint get_pu_type_for_mvp(int bsx, int bsy, int cu_pix_x, int cu_pix_y)\r\n{\r\n    if (bsx < bsy) {\r\n        if (cu_pix_x == 0) {\r\n            return 1;\r\n        } else {\r\n            return 2;\r\n        }\r\n    } else if (bsx > bsy) {\r\n        if (cu_pix_y == 0) {\r\n            return 3;\r\n        } else {\r\n            return 4;\r\n        }\r\n    }\r\n\r\n    return 0;  // default\r\n}\r\n\r\n#define get_mvp_default FPFX(get_mvp_default)\r\nvoid get_mvp_default(davs2_t *h, cu_t *p_cu, int pix_x, int pix_y, mv_t *pmv, int bwd_2nd, \r\n                     int ref_frame, int bsx, int pu_type_for_mvp);\r\n\r\n#define fill_mv_and_ref_for_skip FPFX(fill_mv_and_ref_for_skip)\r\nvoid fill_mv_and_ref_for_skip(davs2_t *h, cu_t *p_cu, int pix_x, int pix_y, int size_in_scu);\r\n\r\n#ifdef __cplusplus\r\n}\r\n#endif\r\n#endif  // DAVS2_PRED_H\r\n"
  },
  {
    "path": "source/common/primitives.cc",
    "content": "/*\r\n * primitives.cc\r\n *\r\n * Description of this file:\r\n *    function handles initialize functions definition of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video encoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n\r\n#include \"common.h\"\r\n#include \"primitives.h\"\r\n#include \"cpu.h\"\r\n#include \"intra.h\"\r\n#include \"mc.h\"\r\n#include \"transform.h\"\r\n#include \"quant.h\"\r\n#include \"deblock.h\"\r\n#include \"sao.h\"\r\n#include \"alf.h\"\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nao_funcs_t gf_davs2 = {0};\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid init_all_primitives(uint32_t cpuid)\r\n{\r\n    if (gf_davs2.initial_count != 0) {\r\n        // already initialed\r\n        gf_davs2.initial_count++;\r\n        return;\r\n    }\r\n\r\n    gf_davs2.initial_count = 1;\r\n    gf_davs2.cpuid         = cpuid;\r\n\r\n    /* init function handles */\r\n    davs2_memory_init    (cpuid, &gf_davs2);\r\n    davs2_intra_pred_init(cpuid, &gf_davs2);\r\n    davs2_pixel_init     (cpuid, &gf_davs2);\r\n    davs2_mc_init        (cpuid, &gf_davs2);\r\n    davs2_quant_init     (cpuid, &gf_davs2);\r\n    davs2_dct_init       (cpuid, &gf_davs2);\r\n    davs2_deblock_init   (cpuid, &gf_davs2);\r\n    davs2_sao_init       (cpuid, &gf_davs2);\r\n    davs2_alf_init       (cpuid, &gf_davs2);\r\n\r\n}\r\n"
  },
  {
    "path": "source/common/primitives.h",
    "content": "/*\r\n * primitives.h\r\n *\r\n * Description of this file:\r\n *    function handles initialize functions definition of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video encoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#ifndef DAVS2_PRIMITIVES_H\r\n#define DAVS2_PRIMITIVES_H\r\n#ifdef __cplusplus\r\nextern \"C\" {\r\n#endif\r\n\r\n/**\r\n * ===========================================================================\r\n * macros\r\n * ===========================================================================\r\n */\r\n\r\n#if HIGH_BIT_DEPTH\r\n#define MC_PART_INDEX(width, height)  (width >= 8)\r\n#else\r\n#define MC_PART_INDEX(width, height)  (width > 8)\r\n#endif\r\n\r\n/**\r\n * ===========================================================================\r\n * function definitions and structures\r\n * ===========================================================================\r\n */\r\n\r\n\r\n/**\r\n * ===========================================================================\r\n * type defines\r\n * ===========================================================================\r\n */\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * function handle types\r\n */\r\ntypedef void(*block_copy_pp_t)(pel_t *dst, intptr_t i_dst, pel_t *src, intptr_t i_src, int w, int h);\r\ntypedef void(*block_copy_sc_t)(coeff_t *dst, intptr_t i_dst, int16_t *src, intptr_t i_src, int w, int h);\r\ntypedef void(*block_intpl_t)(const pel_t* src, intptr_t srcStride, pel_t* dst, intptr_t dstStride, int coeffIdx);\r\ntypedef void(*block_intpl_ext_t)(const pel_t* src, intptr_t srcStride, pel_t* dst, intptr_t dstStride, int coeffIdxX, int coeffIdxY);\r\ntypedef void(*intpl_t)    (pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, const int8_t *coeff);\r\ntypedef void(*intpl_ext_t)(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, const int8_t *coeff_x, const int8_t *coeff_y);\r\ntypedef void(*pixel_avg_pp_t)(pel_t *dst, int i_dst, const pel_t *src0, int i_src0, const pel_t *src1, int i_src1, int width, int height);\r\ntypedef void(*dct_t)(const coeff_t *src, coeff_t *dst, int i_src);\r\n\r\ntypedef void(*intra_pred_t)(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\ntypedef void(*fill_edge_t)(const pel_t *p_topleft, int i_topleft, const pel_t *p_lcu_ep, pel_t *EP, uint32_t i_avail, int bsx, int bsy);\r\n\r\ntypedef void *(*memcpy_t)(void *dst, const void *src, size_t n);\r\ntypedef void(*copy_pp_t)(pel_t* dst, intptr_t dstStride, const pel_t* src, intptr_t srcStride); // dst is aligned\r\ntypedef void(*copy_ss_t)(coeff_t* dst, intptr_t dstStride, const coeff_t* src, intptr_t srcStride);\r\n\r\ntypedef void(*pixel_add_ps_t)(pel_t* dst, intptr_t dstride, const pel_t* b0, const coeff_t* b1, intptr_t sstride0, intptr_t sstride1);\r\n\r\ntypedef void(*lcu_deblock_t)(davs2_t *h, davs2_frame_t *frm, int i_lcu_x, int i_lcu_y);\r\n\r\ntypedef void(*sao_flt_bo_t)(pel_t *p_dst, int i_dst, const pel_t *p_src, int i_src, int i_block_w, int i_block_h, int bit_depth, const sao_param_t *sao_param);\r\ntypedef void(*sao_flt_eo_t)(pel_t *p_dst, int i_dst, const pel_t *p_src, int i_src, int i_block_w, int i_block_h, int bit_depth, const int *lcu_avail, const int *sao_offset);\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * assembly optimization functions\r\n */\r\ntypedef struct ao_funcs_t {\r\n    ALIGN32(uint32_t    initial_count);\r\n    uint32_t            cpuid;\r\n    /* memory copy */\r\n    memcpy_t            fast_memcpy;\r\n    memcpy_t            memcpy_aligned;\r\n    void*(*fast_memzero)   (void *dst, size_t n);\r\n    void*(*memzero_aligned)(void *dst, size_t n);\r\n    void*(*fast_memset)    (void *dst, int val, size_t n);\r\n\r\n    /* plane copy */\r\n    void(*plane_copy)(pel_t *dst, intptr_t i_dst, pel_t *src, intptr_t i_src, int w, int h);\r\n    block_copy_pp_t block_copy;\r\n    block_copy_sc_t block_coeff_copy;\r\n\r\n    copy_pp_t       copy_pp[MAX_PART_NUM];\r\n    copy_ss_t       copy_ss[MAX_PART_NUM];\r\n    pixel_add_ps_t  add_ps[MAX_PART_NUM];\r\n\r\n    /* block average */\r\n    pixel_avg_pp_t  block_avg;\r\n\r\n    /* interpolate */\r\n#if USE_NEW_INTPL\r\n    block_intpl_t       block_intpl_luma_hor[MAX_PART_NUM];\r\n    block_intpl_t       block_intpl_luma_ver[MAX_PART_NUM];\r\n    block_intpl_ext_t   block_intpl_luma_ext[MAX_PART_NUM];\r\n#endif\r\n    intpl_t         intpl_luma_ver[2][3];//[2]:ݿСк֣0:size<=8   1:size>=16     [3]:Ȩϵв\r\n    intpl_t         intpl_luma_hor[2][3];\r\n    intpl_ext_t     intpl_luma_ext[2];\r\n\r\n    intpl_t         intpl_chroma_ver[2];\r\n    intpl_t         intpl_chroma_hor[2];\r\n    intpl_ext_t     intpl_chroma_ext[2];\r\n\r\n    /* intra prediction */\r\n    intra_pred_t    intraf[NUM_INTRA_MODE];\r\n    fill_edge_t     fill_edge_f[4];\r\n\r\n    /* loop filter */\r\n    void(*set_deblock_const)(void);\r\n\r\n    void(*deblock_luma[2])  (pel_t *src, int stride, int alpha, int beta, uint8_t *flt_flag);\r\n#if HDR_CHROMA_DELTA_QP\r\n    void(*deblock_chroma[2])(pel_t *src_u, pel_t *src_v, int stride, int *alpha, int *beta, uint8_t *flt_flag);\r\n#else\r\n    void(*deblock_chroma[2])(pel_t *src_u, pel_t *src_v, int stride, int alpha, int beta, uint8_t *flt_flag);\r\n#endif\r\n\r\n    /* SAO filter */\r\n    sao_flt_bo_t     sao_block_bo;          /* filter for bo type */\r\n    sao_flt_eo_t     sao_filter_eo[4];      /* SAO filter for eo types */\r\n\r\n    /* alf */\r\n    void(*alf_block[2])(pel_t *p_dst, const pel_t *p_src, int stride,\r\n        int lcu_pix_x, int lcu_pix_y, int lcu_width, int lcu_height,\r\n        int *alf_coeff, int b_top_avail, int b_down_avail);\r\n\r\n    /* dct */\r\n    dct_t        idct[MAX_PART_NUM][DCT_PATTERN_NUM];  /* sqrt dct */\r\n\r\n    /* 2nd transform */\r\n    void(*inv_transform_4x4_2nd)(coeff_t *coeff, int i_coeff);\r\n    void(*inv_transform_2nd)    (coeff_t *coeff, int i_coeff, int i_mode, int b_top, int b_left);\r\n\r\n    /* quant */\r\n    void(*dequant)(coeff_t *coef, const int i_coef, const int scale, const int shift);\r\n} ao_funcs_t;\r\n\r\nextern ao_funcs_t gf_davs2;\r\n\r\n\r\n/**\r\n * ===========================================================================\r\n * interface function declares\r\n * ===========================================================================\r\n */\r\n#define init_all_primitives FPFX(init_all_primitives)\r\nvoid init_all_primitives(uint32_t cpuid);\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * extern functions\r\n */\r\n#define davs2_mc_init FPFX(mc_init)\r\nvoid davs2_mc_init    (uint32_t cpuid, ao_funcs_t *pf);\r\n#define davs2_pixel_init FPFX(pixel_init)\r\nvoid davs2_pixel_init (uint32_t cpuid, ao_funcs_t* pixf);\r\n#define davs2_memory_init FPFX(memory_init)\r\nvoid davs2_memory_init(uint32_t cpuid, ao_funcs_t* pixf);\r\n\r\n#ifdef __cplusplus\r\n}\r\n#endif\r\n#endif  // DAVS2_PRIMITIVES_H\r\n"
  },
  {
    "path": "source/common/quant.cc",
    "content": "/*\r\n * quant.c\r\n *\r\n * Description of this file:\r\n *    Quant functions definition of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video decoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#include \"common.h\"\r\n#include \"quant.h\"\r\n#include \"vec/intrinsic.h\"\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nconst int16_t wq_param_default[2][6] = {\r\n    { 67, 71, 71, 80, 80, 106},\r\n    { 64, 49, 53, 58, 58, 64 }\r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic const int g_WqMDefault4x4[16] = {\r\n    64, 64, 64, 68,\r\n    64, 64, 68, 72,\r\n    64, 68, 76, 80,\r\n    72, 76, 84, 96\r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic const int g_WqMDefault8x8[64] = {\r\n    64,  64,  64,  64,  68,  68,  72,  76,\r\n    64,  64,  64,  68,  72,  76,  84,  92,\r\n    64,  64,  68,  72,  76,  80,  88,  100,\r\n    64,  68,  72,  80,  84,  92,  100, 28,\r\n    68,  72,  80,  84,  92,  104, 112, 128,\r\n    76,  80,  84,  92,  104, 116, 132, 152,\r\n    96,  100, 104, 116, 124, 140, 164, 188,\r\n    104, 108, 116, 128, 152, 172, 192, 216\r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic const uint8_t WeightQuantModel[4][64] = {\r\n    //   l a b c d h\r\n    //   0 1 2 3 4 5\r\n    {\r\n        // Mode 0\r\n        0, 0, 0, 4, 4, 4, 5, 5,\r\n        0, 0, 3, 3, 3, 3, 5, 5,\r\n        0, 3, 2, 2, 1, 1, 5, 5,\r\n        4, 3, 2, 2, 1, 5, 5, 5,\r\n        4, 3, 1, 1, 5, 5, 5, 5,\r\n        4, 3, 1, 5, 5, 5, 5, 5,\r\n        5, 5, 5, 5, 5, 5, 5, 5,\r\n        5, 5, 5, 5, 5, 5, 5, 5\r\n    }, {\r\n        // Mode 1\r\n        0, 0, 0, 4, 4, 4, 5, 5,\r\n        0, 0, 4, 4, 4, 4, 5, 5,\r\n        0, 3, 2, 2, 2, 1, 5, 5,\r\n        3, 3, 2, 2, 1, 5, 5, 5,\r\n        3, 3, 2, 1, 5, 5, 5, 5,\r\n        3, 3, 1, 5, 5, 5, 5, 5,\r\n        5, 5, 5, 5, 5, 5, 5, 5,\r\n        5, 5, 5, 5, 5, 5, 5, 5\r\n    }, {\r\n        // Mode 2\r\n        0, 0, 0, 4, 4, 3, 5, 5,\r\n        0, 0, 4, 4, 3, 2, 5, 5,\r\n        0, 4, 4, 3, 2, 1, 5, 5,\r\n        4, 4, 3, 2, 1, 5, 5, 5,\r\n        4, 3, 2, 1, 5, 5, 5, 5,\r\n        3, 2, 1, 5, 5, 5, 5, 5,\r\n        5, 5, 5, 5, 5, 5, 5, 5,\r\n        5, 5, 5, 5, 5, 5, 5, 5\r\n    }, {\r\n        // Mode 3\r\n        0, 0, 0, 3, 2, 1, 5, 5,\r\n        0, 0, 4, 3, 2, 1, 5, 5,\r\n        0, 4, 4, 3, 2, 1, 5, 5,\r\n        3, 3, 3, 3, 2, 5, 5, 5,\r\n        2, 2, 2, 2, 5, 5, 5, 5,\r\n        1, 1, 1, 5, 5, 5, 5, 5,\r\n        5, 5, 5, 5, 5, 5, 5, 5,\r\n        5, 5, 5, 5, 5, 5, 5, 5\r\n    }\r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic const uint8_t WeightQuantModel4x4[4][16] = {\r\n    //   l a b c d h\r\n    //   0 1 2 3 4 5\r\n    {\r\n        // Mode 0\r\n        0, 4, 3, 5,\r\n        4, 2, 1, 5,\r\n        3, 1, 1, 5,\r\n        5, 5, 5, 5\r\n    }, {\r\n        // Mode 1\r\n        0, 4, 4, 5,\r\n        3, 2, 2, 5,\r\n        3, 2, 1, 5,\r\n        5, 5, 5, 5\r\n    }, {\r\n        // Mode 2\r\n        0, 4, 3, 5,\r\n        4, 3, 2, 5,\r\n        3, 2, 1, 5,\r\n        5, 5, 5, 5\r\n    }, {\r\n        // Mode 3\r\n        0, 3, 1, 5,\r\n        3, 4, 2, 5,\r\n        1, 2, 2, 5,\r\n        5, 5, 5, 5\r\n    }\r\n};\r\n\r\n/**\r\n * ===========================================================================\r\n * function defines\r\n * ===========================================================================\r\n */\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nconst int *wq_get_default_matrix(int sizeId)\r\n{\r\n    return (sizeId == 0) ? g_WqMDefault4x4 : g_WqMDefault8x8;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid wq_init_frame_quant_param(davs2_t *h)\r\n{\r\n    weighted_quant_t *p = &h->wq;\r\n    int uiWQMSizeId;\r\n    int i, j, k;\r\n\r\n    assert(h->seq_info.enable_weighted_quant);\r\n\r\n    for (uiWQMSizeId = 0; uiWQMSizeId < 4; uiWQMSizeId++) {\r\n        for (i = 0; i < 64; i++) {\r\n            p->cur_wq_matrix[uiWQMSizeId][i] = 1 << 7;\r\n        }\r\n    }\r\n\r\n    for (i = 0; i < 2; i++) {\r\n        for (j = 0; j < 6; j++) {\r\n            p->wquant_param[i][j] = 128;\r\n        }\r\n    }\r\n\r\n    if (p->wq_param == 0) {\r\n        for (i = 0; i < 6; i++) {\r\n            p->wquant_param[DETAILED][i] = wq_param_default[DETAILED][i];\r\n        }\r\n    } else if (p->wq_param == 1) {\r\n        for (i = 0; i < 6; i++) {\r\n            p->wquant_param[UNDETAILED][i] = p->quant_param_undetail[i];\r\n        }\r\n    }\r\n\r\n    if (p->wq_param == 2) {\r\n        for (i = 0; i < 6; i++) {\r\n            p->wquant_param[DETAILED][i] = p->quant_param_detail[i];\r\n        }\r\n    }\r\n\r\n    // reconstruct the weighting matrix\r\n    for (k = 0; k < 2; k++) {\r\n        for (j = 0; j < 8; j++) {\r\n            for (i = 0; i < 8; i++) {\r\n                p->wq_matrix[1][k][j * 8 + i] = p->wquant_param[k][WeightQuantModel[p->wq_model][j * 8 + i]];\r\n            }\r\n        }\r\n    }\r\n\r\n    for (k = 0; k < 2; k++) {\r\n        for (j = 0; j < 4; j++) {\r\n            for (i = 0; i < 4; i++) {\r\n                p->wq_matrix[0][k][j * 4 + i] = p->wquant_param[k][WeightQuantModel4x4[p->wq_model][j * 4 + i]];\r\n            }\r\n        }\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid wq_update_frame_matrix(davs2_t *h)\r\n{\r\n    weighted_quant_t *p = &h->wq;\r\n    int uiWQMSizeId, uiWMQId;\r\n    int uiBlockSize;\r\n    int i;\r\n\r\n    assert(h->seq_info.enable_weighted_quant);\r\n\r\n    for (uiWQMSizeId = 0; uiWQMSizeId < 4; uiWQMSizeId++) {\r\n        uiBlockSize = DAVS2_MIN(1 << (uiWQMSizeId + 2), 8);\r\n        uiWMQId = (uiWQMSizeId < 2) ? uiWQMSizeId : 1;\r\n\r\n        if (p->pic_wq_data_index == 0) {\r\n            for (i = 0; i < (uiBlockSize * uiBlockSize); i++) {\r\n                p->cur_wq_matrix[uiWQMSizeId][i] = p->seq_wq_matrix[uiWMQId][i];\r\n            }\r\n        } else if (p->pic_wq_data_index == 1) {\r\n            if (p->wq_param == 0) {\r\n                for (i = 0; i < (uiBlockSize * uiBlockSize); i++) {\r\n                    p->cur_wq_matrix[uiWQMSizeId][i] = p->wq_matrix[uiWMQId][DETAILED][i];// detailed weighted matrix\r\n                }\r\n            } else if (p->wq_param == 1) {\r\n                for (i = 0; i < (uiBlockSize * uiBlockSize); i++) {\r\n                    p->cur_wq_matrix[uiWQMSizeId][i] = p->wq_matrix[uiWMQId][0][i];       // undetailed weighted matrix\r\n                }\r\n            }\r\n\r\n            if (p->wq_param == 2) {\r\n                for (i = 0; i < (uiBlockSize * uiBlockSize); i++) {\r\n                    p->cur_wq_matrix[uiWQMSizeId][i] = p->wq_matrix[uiWMQId][1][i];       // detailed weighted matrix\r\n                }\r\n            }\r\n        } else if (p->pic_wq_data_index == 2) {\r\n            for (i = 0; i < (uiBlockSize * uiBlockSize); i++) {\r\n                p->cur_wq_matrix[uiWQMSizeId][i] = p->pic_user_wq_matrix[uiWMQId][i];\r\n            }\r\n        }\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void dequant_c(coeff_t *p_coeff, const int i_coef, const int scale, const int shift)\r\n{\r\n    const int add = (1 << (shift - 1));\r\n    int i;\r\n\r\n    for (i = 0; i < i_coef; i++) {\r\n        if (p_coeff[i]) {\r\n            p_coeff[i] = (coeff_t)DAVS2_CLIP3(-32768, 32767, (p_coeff[i] * scale + add) >> shift);\r\n        }\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void dequant_weighted_c(coeff_t *p_coeff, int i_coeff, int bsx, int bsy, int scale, int shift, int16_t *wq_matrix, int wqm_shift, int wqm_size_id)\r\n{\r\n    const int add          = 1 << (shift - 1);\r\n    const int wqm_size     = 1 << (wqm_size_id + 2);\r\n    const int stride_shift = DAVS2_CLIP3(0, 2, wqm_size_id - 1);\r\n    const int stride       = wqm_size >> stride_shift;\r\n    int i, j;\r\n\r\n    for (j = 0; j < bsy; j++) {\r\n        for (i = 0; i < bsx; i++) {\r\n            int wqm_coef = wq_matrix[((j >> stride_shift) & (stride - 1)) * stride + ((i >> stride_shift) & (stride - 1))];\r\n            if (p_coeff[i]) {\r\n                int cur_coeff = (((((p_coeff[i] * wqm_coef) >> wqm_shift) * scale) >> 4) + add) >> shift;\r\n                p_coeff[i] = (coeff_t)DAVS2_CLIP3(-32768, 32767, cur_coeff);\r\n            }\r\n        }\r\n        p_coeff += i_coeff;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * dequant the coefficients\r\n */\r\nvoid dequant_coeffs(davs2_t *h, coeff_t *p_coeff, int bsx, int bsy, int scale, int shift, int WQMSizeId)\r\n{\r\n    if (h->seq_info.enable_weighted_quant) {\r\n        int wqm_shift = (h->wq.pic_wq_data_index == 1) ? 3 : 0;\r\n        dequant_weighted_c(p_coeff, bsx, bsx, bsy, scale, shift, h->wq.cur_wq_matrix[WQMSizeId], wqm_shift, WQMSizeId);\r\n    } else {\r\n        gf_davs2.dequant(p_coeff, bsx * bsy, scale, shift);\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid davs2_quant_init(uint32_t cpuid, ao_funcs_t *fh)\r\n{\r\n    /* init c function handles */\r\n    fh->dequant   = dequant_c;\r\n\r\n    /* init asm function handles */\r\n#if HAVE_MMX\r\n    if (cpuid & DAVS2_CPU_SSE4) {\r\n        fh->dequant = davs2_dequant_sse4;\r\n    }\r\n#endif  // if HAVE_MMX\r\n}\r\n"
  },
  {
    "path": "source/common/quant.h",
    "content": "/*\r\n * quant.h\r\n *\r\n * Description of this file:\r\n *    Quant functions definition of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video encoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#ifndef DAVS2_QUANT_H\r\n#define DAVS2_QUANT_H\r\n#ifdef __cplusplus\r\nextern \"C\" {\r\n#endif\r\n\r\n#define QP_SCALE_CR FPFX(QP_SCALE_CR)\r\nextern const uint8_t  QP_SCALE_CR[];\r\n#define IQ_SHIFT FPFX(IQ_SHIFT)\r\nextern const int16_t  IQ_SHIFT[];\r\n#define IQ_TAB FPFX(IQ_TAB)\r\nextern const uint16_t IQ_TAB[];\r\n#define wq_param_default FPFX(wq_param_default)\r\nextern const int16_t wq_param_default[2][6];\r\n\r\n\r\n/**\r\n * ---------------------------------------------------------------------------\r\n * Weight Quant\r\n * - Adaptive Frequency Weighting Quantization, include:\r\n *      a). Frequency weighting model, quantization\r\n *      b). Picture level user-defined frequency weighting\r\n *      c). LCU level adaptive frequency weighting mode decision\r\n *   According to adopted proposals: m1878, m2148, m2331\r\n * ---------------------------------------------------------------------------\r\n */\r\n#define PARAM_NUM  6\r\n#define WQ_MODEL_NUM 3\r\n\r\n#define UNDETAILED 0\r\n#define DETAILED   1\r\n\r\n#define WQ_MODE_F  0\r\n#define WQ_MODE_U  1\r\n#define WQ_MODE_D  2\r\n\r\n#define wq_get_default_matrix FPFX(wq_get_default_matrix)\r\nconst int *wq_get_default_matrix(int sizeId);\r\n\r\n#define wq_init_frame_quant_param FPFX(wq_init_frame_quant_param)\r\nvoid wq_init_frame_quant_param(davs2_t *h);\r\n#define wq_update_frame_matrix FPFX(wq_update_frame_matrix)\r\nvoid wq_update_frame_matrix(davs2_t *h);\r\n\r\n\r\n/* dequant */\r\n#define dequant_coeffs FPFX(dequant_coeffs)\r\nvoid dequant_coeffs(davs2_t *h, coeff_t *p_coeff, int bsx, int bsy, int scale, int shift, int WQMSizeId);\r\n#define davs2_quant_init FPFX(quant_init)\r\nvoid davs2_quant_init(uint32_t cpuid, ao_funcs_t *fh);\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * get qp in chroma component\r\n */\r\nstatic ALWAYS_INLINE\r\nint cu_get_chroma_qp(davs2_t * h, int luma_qp, int uv)\r\n{\r\n    int qp = luma_qp + (uv == 0 ? h->chroma_quant_param_delta_u : h->chroma_quant_param_delta_v);\r\n\r\n#if HIGH_BIT_DEPTH\r\n    const int bit_depth_offset = ((h->sample_bit_depth - 8) << 3);\r\n    qp -= bit_depth_offset;\r\n    qp = qp < 0 ? qp : QP_SCALE_CR[qp];\r\n    qp = DAVS2_CLIP3(0, 63 + bit_depth_offset, qp + bit_depth_offset);\r\n\r\n#else\r\n    qp = QP_SCALE_CR[DAVS2_CLIP3(0, 63, qp)];\r\n#endif\r\n\r\n    return qp;\r\n}\r\n\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * get quant parameters\r\n */\r\nstatic ALWAYS_INLINE\r\nvoid cu_get_quant_params(davs2_t * h, int qp, int bit_size, \r\n                         int *shift, int *scale)\r\n{\r\n    *shift = IQ_SHIFT[qp] + (h->sample_bit_depth + 1) + bit_size - LIMIT_BIT;\r\n    *scale = IQ_TAB[qp];\r\n}\r\n\r\n\r\n#ifdef __cplusplus\r\n}\r\n#endif\r\n#endif  // DAVS2_QUANT_H\r\n"
  },
  {
    "path": "source/common/sao.cc",
    "content": "/*\r\n * sao.cc\r\n *\r\n * Description of this file:\r\n *    SAO functions definition of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video encoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#include \"common.h\"\r\n#include \"sao.h\"\r\n#include \"aec.h\"\r\n#include \"frame.h\"\r\n#include \"vec/intrinsic.h\"\r\n\r\n#if defined(_MSC_VER) || defined(__ICL)\r\n#pragma warning(disable: 4204)  // nonstandard extension used: non-constant aggregate initializer\r\n#endif\r\n\r\n\r\n/**\r\n * ===========================================================================\r\n * local & global variables (const tables)\r\n * ===========================================================================\r\n */\r\n\r\nconst int saoclip[NUM_SAO_OFFSET][3] = {\r\n    //EO\r\n    { -1, 6, 7 }, // low bound, upper bound, threshold\r\n    {  0, 1, 1 },\r\n    {  0, 0, 0 },\r\n    { -1, 0, 1 },\r\n    { -6, 1, 7 },\r\n    { -7, 7, 7 } // BO\r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n * lcu neighbor\r\n */\r\nenum lcu_neighbor_e {\r\n    SAO_T   = 0,    /* top        */\r\n    SAO_D   = 1,    /* down       */\r\n    SAO_L   = 2,    /* left       */\r\n    SAO_R   = 3,    /* right      */\r\n    SAO_TL  = 4,    /* top-left   */\r\n    SAO_TR  = 5,    /* top-right  */\r\n    SAO_DL  = 6,    /* down-left  */\r\n    SAO_DR  = 7     /* down-right */\r\n};\r\n\r\ntypedef struct sao_region_t {\r\n    int    pix_x[IMG_COMPONENTS];       /* start pixel position in x */\r\n    int    pix_y[IMG_COMPONENTS];       /* start pixel position in y */\r\n    int    width[IMG_COMPONENTS];       /*  */\r\n    int    height[IMG_COMPONENTS];      /*  */\r\n\r\n    /* availabilities of neighboring blocks */\r\n    int8_t b_left;\r\n    int8_t b_top_left;\r\n    int8_t b_top;\r\n    int8_t b_top_right;\r\n    int8_t b_right;\r\n    int8_t b_right_down;\r\n    int8_t b_down;\r\n    int8_t b_down_left;\r\n} sao_region_t;\r\n\r\n\r\n/**\r\n * ===========================================================================\r\n * function defines\r\n * ===========================================================================\r\n */\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic ALWAYS_INLINE void sao_init_param(sao_t *lcu_sao)\r\n{\r\n    int i;\r\n\r\n    for (i = 0; i < IMG_COMPONENTS; i++) {\r\n        lcu_sao->planes[i].modeIdc    = SAO_MODE_OFF;\r\n        lcu_sao->planes[i].typeIdc    = -1;\r\n        lcu_sao->planes[i].startBand  = -1;\r\n        lcu_sao->planes[i].startBand2 = -1;\r\n        memset(lcu_sao->planes[i].offset, 0, MAX_NUM_SAO_CLASSES * sizeof(int));\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic ALWAYS_INLINE void sao_copy_param(sao_t *dst, sao_t *src)\r\n{\r\n    memcpy(dst, src, sizeof(sao_t));\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic\r\nvoid sao_block_eo_0_c(pel_t *p_dst, int i_dst,\r\n                      const pel_t *p_src, int i_src,\r\n                      int i_block_w, int i_block_h,\r\n                      int bit_depth, const int *lcu_avail, const int *sao_offset)\r\n{\r\n    const int max_pel_val = (1 << bit_depth) - 1;\r\n    int left_sign, right_sign;\r\n    int edge_type;\r\n    int x, y;\r\n    int pel_diff;\r\n\r\n    int sx = lcu_avail[SAO_L] ? 0 : 1;\r\n    int ex = lcu_avail[SAO_R] ? i_block_w : (i_block_w - 1);\r\n    for (y = 0; y < i_block_h; y++) {\r\n        pel_diff = p_src[sx] - p_src[sx - 1];\r\n        left_sign = pel_diff > 0? 1 : (pel_diff < 0? -1 : 0);\r\n        for (x = sx; x < ex; x++) {\r\n            pel_diff = p_src[x] - p_src[x + 1];\r\n            right_sign = pel_diff > 0? 1 : (pel_diff < 0? -1 : 0);\r\n            edge_type = left_sign + right_sign + 2;\r\n            left_sign = -right_sign;\r\n            p_dst[x] = (pel_t)DAVS2_CLIP3(0, max_pel_val, p_src[x] + sao_offset[edge_type]);\r\n        }\r\n        p_src += i_src;\r\n        p_dst += i_dst;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic\r\nvoid sao_block_eo_90_c(pel_t *p_dst, int i_dst,\r\n                       const pel_t *p_src, int i_src,\r\n                       int i_block_w, int i_block_h,\r\n                       int bit_depth, const int *lcu_avail, const int *sao_offset)\r\n{\r\n    const int max_pel_val = (1 << bit_depth) - 1;\r\n    int edge_type;\r\n    int x, y;\r\n\r\n    int sy = lcu_avail[SAO_T] ? 0 : 1;\r\n    int ey = lcu_avail[SAO_D] ? i_block_h : (i_block_h - 1);\r\n    for (x = 0; x < i_block_w; x++) {\r\n        int pel_diff = p_src[sy * i_src + x] - p_src[(sy - 1) * i_src + x];\r\n        int top_sign = pel_diff > 0 ? 1 : (pel_diff < 0 ? -1 : 0);\r\n        for (y = sy; y < ey; y++) {\r\n            int pelDiff = p_src[y * i_src + x] - p_src[(y + 1) * i_src + x];\r\n            int down_sign = pelDiff > 0 ? 1 : (pelDiff < 0 ? -1 : 0);\r\n            edge_type = down_sign + top_sign + 2;\r\n            top_sign = -down_sign;\r\n            p_dst[y * i_dst + x] = (pel_t)DAVS2_CLIP3(0, max_pel_val, p_src[y * i_src + x] + sao_offset[edge_type]);\r\n        }\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic\r\nvoid sao_block_eo_135_c(pel_t *p_dst, int i_dst,\r\n                        const pel_t *p_src, int i_src,\r\n                        int i_block_w, int i_block_h,\r\n                        int bit_depth, const int *lcu_avail, const int *sao_offset)\r\n{\r\n    int8_t SIGN_BUF[MAX_CU_SIZE + 32];  // sign of top line\r\n    int8_t *UPROW_S = SIGN_BUF + 16;\r\n    const int max_pel_val = (1 << bit_depth) - 1;\r\n    int reg = 0;\r\n    int sx, ex;               // start/end (x, y)\r\n    int sx_0, ex_0, sx_n, ex_n;       // start/end x for first and last row\r\n    int top_sign, down_sign;\r\n    int edge_type;\r\n    int pel_diff;\r\n    int x, y;\r\n\r\n    sx = lcu_avail[SAO_L] ? 0 : 1;\r\n    ex = lcu_avail[SAO_R] ? i_block_w : (i_block_w - 1);\r\n\r\n    // init the line buffer\r\n    for (x = sx; x < ex; x++) {\r\n        pel_diff = p_src[i_src + x + 1] - p_src[x];\r\n        top_sign = pel_diff > 0 ? 1 : (pel_diff < 0 ? -1 : 0);\r\n        UPROW_S[x + 1] = (int8_t)top_sign;\r\n    }\r\n\r\n    // first row\r\n    sx_0 = lcu_avail[SAO_TL] ? 0 : 1;\r\n    ex_0 = lcu_avail[SAO_T] ? (lcu_avail[SAO_R] ? i_block_w : (i_block_w - 1)) : 1;\r\n    for (x = sx_0; x < ex_0; x++) {\r\n        pel_diff = p_src[x] - p_src[-i_src + x - 1];\r\n        top_sign = pel_diff > 0 ? 1 : (pel_diff < 0 ? -1 : 0);\r\n        edge_type = top_sign - UPROW_S[x + 1] + 2;\r\n        p_dst[x] = (pel_t)DAVS2_CLIP3(0, max_pel_val, p_src[x] + sao_offset[edge_type]);\r\n    }\r\n\r\n    // middle rows\r\n    for (y = 1; y < i_block_h - 1; y++) {\r\n        p_src += i_src;\r\n        p_dst += i_dst;\r\n        for (x = sx; x < ex; x++) {\r\n            if (x == sx) {\r\n                pel_diff = p_src[x] - p_src[-i_src + x - 1];\r\n                top_sign = pel_diff > 0 ? 1 : (pel_diff < 0 ? -1 : 0);\r\n                UPROW_S[x] = (int8_t)top_sign;\r\n            }\r\n            pel_diff = p_src[x] - p_src[i_src + x + 1];\r\n            down_sign = pel_diff > 0 ? 1 : (pel_diff < 0 ? -1 : 0);\r\n            edge_type = down_sign + UPROW_S[x] + 2;\r\n            p_dst[x] = (pel_t)DAVS2_CLIP3(0, max_pel_val, p_src[x] + sao_offset[edge_type]);\r\n            UPROW_S[x] = (int8_t)reg;\r\n            reg = -down_sign;\r\n        }\r\n    }\r\n\r\n    // last row\r\n    sx_n = lcu_avail[SAO_D] ? (lcu_avail[SAO_L] ? 0 : 1) : (i_block_w - 1);\r\n    ex_n = lcu_avail[SAO_DR] ? i_block_w : (i_block_w - 1);\r\n    p_src += i_src;\r\n    p_dst += i_dst;\r\n    for (x = sx_n; x < ex_n; x++) {\r\n        if (x == sx) {\r\n            pel_diff = p_src[x] - p_src[-i_src + x - 1];\r\n            top_sign = pel_diff > 0 ? 1 : (pel_diff < 0 ? -1 : 0);\r\n            UPROW_S[x] = (int8_t)top_sign;\r\n        }\r\n        pel_diff = p_src[x] - p_src[i_src + x + 1];\r\n        down_sign = pel_diff > 0 ? 1 : (pel_diff < 0 ? -1 : 0);\r\n        edge_type = down_sign + UPROW_S[x] + 2;\r\n        p_dst[x] = (pel_t)DAVS2_CLIP3(0, max_pel_val, p_src[x] + sao_offset[edge_type]);\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic\r\nvoid sao_block_eo_45_c(pel_t *p_dst, int i_dst,\r\n                       const pel_t *p_src, int i_src,\r\n                       int i_block_w, int i_block_h,\r\n                       int bit_depth, const int *lcu_avail, const int *sao_offset)\r\n{\r\n    int8_t SIGN_BUF[MAX_CU_SIZE + 32];  // sign of top line\r\n    int8_t *UPROW_S = SIGN_BUF + 16;\r\n    const int max_pel_val = (1 << bit_depth) - 1;\r\n    int sx, ex;               // start/end (x, y)\r\n    int sx_0, ex_0, sx_n, ex_n;       // start/end x for first and last row\r\n    int top_sign, down_sign;\r\n    int edge_type;\r\n    int pel_diff;\r\n    int x, y;\r\n\r\n    sx = lcu_avail[SAO_L] ? 0 : 1;\r\n    ex = lcu_avail[SAO_R] ? i_block_w : (i_block_w - 1);\r\n\r\n    // init the line buffer\r\n    for (x = sx; x < ex; x++) {\r\n        pel_diff = p_src[i_src + x - 1] - p_src[x];\r\n        top_sign = pel_diff > 0 ? 1 : (pel_diff < 0 ? -1 : 0);\r\n        UPROW_S[x - 1] = (int8_t)top_sign;\r\n    }\r\n\r\n    // first row\r\n    sx_0 = lcu_avail[SAO_T] ? (lcu_avail[SAO_L] ? 0 : 1) : (i_block_w - 1);\r\n    ex_0 = lcu_avail[SAO_TR] ? i_block_w : (i_block_w - 1);\r\n    for (x = sx_0; x < ex_0; x++) {\r\n        pel_diff = p_src[x] - p_src[-i_src + x + 1];\r\n        top_sign = pel_diff > 0 ? 1 : (pel_diff < 0 ? -1 : 0);\r\n        edge_type = top_sign - UPROW_S[x - 1] + 2;\r\n        p_dst[x] = (pel_t)DAVS2_CLIP3(0, max_pel_val, p_src[x] + sao_offset[edge_type]);\r\n    }\r\n\r\n    // middle rows\r\n    for (y = 1; y < i_block_h - 1; y++) {\r\n        p_src += i_src;\r\n        p_dst += i_dst;\r\n        for (x = sx; x < ex; x++) {\r\n            if (x == ex - 1) {\r\n                pel_diff = p_src[x] - p_src[-i_src + x + 1];\r\n                top_sign = pel_diff > 0 ? 1 : (pel_diff < 0 ? -1 : 0);\r\n                UPROW_S[x] = (int8_t)top_sign;\r\n            }\r\n            pel_diff = p_src[x] - p_src[i_src + x - 1];\r\n            down_sign = pel_diff > 0 ? 1 : (pel_diff < 0 ? -1 : 0);\r\n            edge_type = down_sign + UPROW_S[x] + 2;\r\n            p_dst[x] = (pel_t)DAVS2_CLIP3(0, max_pel_val, p_src[x] + sao_offset[edge_type]);\r\n            UPROW_S[x - 1] = (int8_t)(-down_sign);\r\n        }\r\n    }\r\n\r\n    // last row\r\n    sx_n = lcu_avail[SAO_DL] ? 0 : 1;\r\n    ex_n = lcu_avail[SAO_D] ? (lcu_avail[SAO_R] ? i_block_w : (i_block_w - 1)) : 1;\r\n    p_src += i_src;\r\n    p_dst += i_dst;\r\n    for (x = sx_n; x < ex_n; x++) {\r\n        if (x == ex - 1) {\r\n            pel_diff = p_src[x] - p_src[-i_src + x + 1];\r\n            top_sign = pel_diff > 0 ? 1 : (pel_diff < 0 ? -1 : 0);\r\n            UPROW_S[x] = (int8_t)top_sign;\r\n        }\r\n        pel_diff = p_src[x] - p_src[i_src + x - 1];\r\n        down_sign = pel_diff > 0 ? 1 : (pel_diff < 0 ? -1 : 0);\r\n        edge_type = down_sign + UPROW_S[x] + 2;\r\n        p_dst[x] = (pel_t)DAVS2_CLIP3(0, max_pel_val, p_src[x] + sao_offset[edge_type]);\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic\r\nvoid sao_block_bo_c(pel_t *p_dst, int i_dst,\r\n                    const pel_t *p_src, int i_src,\r\n                    int i_block_w, int i_block_h,\r\n                    int bit_depth, const sao_param_t *sao_param)\r\n{\r\n    const int max_pel_val = (1 << bit_depth) - 1;\r\n    const int *sao_offset = sao_param->offset;\r\n    int edge_type;\r\n    int x, y;\r\n\r\n    const int band_shift = g_bit_depth - NUM_SAO_BO_CLASSES_IN_BIT;\r\n\r\n    for (y = 0; y < i_block_h; y++) {\r\n        for (x = 0; x < i_block_w; x++) {\r\n            edge_type = p_src[x] >> band_shift;\r\n            p_dst[x] = (pel_t)DAVS2_CLIP3(0, max_pel_val, p_src[x] + sao_offset[edge_type]);\r\n        }\r\n        p_src += i_src;\r\n        p_dst += i_dst;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void sao_read_lcu(davs2_t *h, int lcu_xy, bool_t *slice_sao_on, sao_t *cur_sao_param)\r\n{\r\n    const int w_in_scu = h->i_width_in_scu;\r\n    const int scu_x    = h->lcu.i_scu_x;\r\n    const int scu_y    = h->lcu.i_scu_y;\r\n    const int scu_xy   = h->lcu.i_scu_xy;\r\n    int merge_mode     = 0;\r\n    int merge_top_avail, merge_left_avail;\r\n\r\n    /* neighbor available? */\r\n    merge_top_avail  = (scu_y == 0) ? 0 : (h->scu_data[scu_xy].i_slice_nr == h->scu_data[scu_xy - w_in_scu].i_slice_nr);\r\n    merge_left_avail = (scu_x == 0) ? 0 : (h->scu_data[scu_xy].i_slice_nr == h->scu_data[scu_xy -        1].i_slice_nr);\r\n\r\n    if (merge_left_avail || merge_top_avail) {\r\n        merge_mode = aec_read_sao_mergeflag(&h->aec, merge_left_avail, merge_top_avail);\r\n    }\r\n\r\n    if (merge_mode) {\r\n        if (merge_mode == 2) {\r\n            sao_copy_param(cur_sao_param, &h->lcu_infos[lcu_xy - 1].sao_param);  // copy left\r\n        } else {\r\n            assert(merge_mode == 1);\r\n            sao_copy_param(cur_sao_param, &h->lcu_infos[lcu_xy - h->i_width_in_lcu].sao_param);  // copy above\r\n        }\r\n    } else {\r\n        int offset[4];\r\n        int stBnd[2];\r\n        int db_temp;\r\n        int sao_mode, sao_type;\r\n        int i;\r\n\r\n        for (i = 0; i < IMG_COMPONENTS; i++) {\r\n            if (!slice_sao_on[i]) {\r\n                cur_sao_param->planes[i].modeIdc = SAO_MODE_OFF;\r\n            } else {\r\n                sao_mode = aec_read_sao_mode(&h->aec);\r\n                switch (sao_mode) {\r\n                case 0:\r\n                    cur_sao_param->planes[i].modeIdc = SAO_MODE_OFF;\r\n                    break;\r\n                case 1:\r\n                    cur_sao_param->planes[i].modeIdc = SAO_MODE_NEW;\r\n                    cur_sao_param->planes[i].typeIdc = SAO_TYPE_BO;\r\n                    break;\r\n                case 3:\r\n                    cur_sao_param->planes[i].modeIdc = SAO_MODE_NEW;\r\n                    cur_sao_param->planes[i].typeIdc = SAO_TYPE_EO_0;\r\n                    break;\r\n                default:\r\n                    assert(1);\r\n                    break;\r\n                }\r\n\r\n                if (cur_sao_param->planes[i].modeIdc == SAO_MODE_NEW) {\r\n                    aec_read_sao_offsets(&h->aec, &cur_sao_param->planes[i], offset);\r\n                    sao_type = aec_read_sao_type(&h->aec, &cur_sao_param->planes[i]);\r\n\r\n                    if (cur_sao_param->planes[i].typeIdc == SAO_TYPE_BO) {\r\n                        memset(cur_sao_param->planes[i].offset, 0, MAX_NUM_SAO_CLASSES * sizeof(int));\r\n                        db_temp  = sao_type >> NUM_SAO_BO_CLASSES_LOG2;\r\n                        stBnd[0] = sao_type - (db_temp << NUM_SAO_BO_CLASSES_LOG2);\r\n                        stBnd[1] = (stBnd[0] + db_temp) % 32;\r\n                        cur_sao_param->planes[i].startBand = stBnd[0];\r\n                        cur_sao_param->planes[i].startBand2 = stBnd[1];\r\n                        cur_sao_param->planes[i].offset[(stBnd[0]    )     ] = offset[0];\r\n                        cur_sao_param->planes[i].offset[(stBnd[0] + 1) % 32] = offset[1];\r\n                        cur_sao_param->planes[i].offset[(stBnd[1]    )     ] = offset[2];\r\n                        cur_sao_param->planes[i].offset[(stBnd[1] + 1) % 32] = offset[3];\r\n                        //memcpy(cur_sao_param->planes[i].offset, offset, 4 * sizeof(int));\r\n                    } else {\r\n                        assert(cur_sao_param->planes[i].typeIdc == SAO_TYPE_EO_0);\r\n                        cur_sao_param->planes[i].typeIdc                          = sao_type;\r\n                        cur_sao_param->planes[i].offset[SAO_CLASS_EO_FULL_VALLEY] = offset[0];\r\n                        cur_sao_param->planes[i].offset[SAO_CLASS_EO_HALF_VALLEY] = offset[1];\r\n                        cur_sao_param->planes[i].offset[SAO_CLASS_EO_PLAIN      ] = 0;\r\n                        cur_sao_param->planes[i].offset[SAO_CLASS_EO_HALF_PEAK  ] = offset[2];\r\n                        cur_sao_param->planes[i].offset[SAO_CLASS_EO_FULL_PEAK  ] = offset[3];\r\n                    }\r\n                }\r\n            }\r\n        }\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid sao_read_lcu_param(davs2_t *h, int lcu_xy, bool_t *slice_sao_on, sao_t *sao_param)\r\n{\r\n    if (slice_sao_on[0] || slice_sao_on[1] || slice_sao_on[2]) {\r\n        sao_read_lcu(h, lcu_xy, slice_sao_on, sao_param);\r\n    } else {\r\n        sao_init_param(sao_param);\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n*/\r\nstatic\r\nvoid sao_get_neighbor_avail(davs2_t *h, sao_region_t *p_avail, int i_lcu_x, int i_lcu_y)\r\n{\r\n    int i_lcu_level = h->i_lcu_level;\r\n    int pix_x = i_lcu_x << i_lcu_level;\r\n    int pix_y = i_lcu_y << i_lcu_level;\r\n    int width = DAVS2_MIN(1 << i_lcu_level, h->i_width - pix_x);\r\n    int height = DAVS2_MIN(1 << i_lcu_level, h->i_height - pix_y);\r\n    int pix_x_c = pix_x >> 1;\r\n    int chroma_v_shift = (h->i_chroma_format == CHROMA_420);\r\n    int pix_y_c = pix_y >> chroma_v_shift;\r\n    int width_c = width >> 1;\r\n    int height_c = height >> 1;\r\n\r\n    /* Իȡ */\r\n    p_avail->b_left = i_lcu_x != 0;\r\n    p_avail->b_top  = i_lcu_y != 0;\r\n    p_avail->b_right = (i_lcu_x < h->i_width_in_lcu - 1);\r\n    p_avail->b_down  = (i_lcu_y < h->i_height_in_lcu - 1);\r\n\r\n    if (h->seq_info.cross_loop_filter_flag == FALSE) {\r\n        int scu_x = i_lcu_x << (h->i_lcu_level - MIN_CU_SIZE_IN_BIT);\r\n        int scu_y = i_lcu_y << (h->i_lcu_level - MIN_CU_SIZE_IN_BIT);\r\n        if (p_avail->b_top) {\r\n            p_avail->b_top = h->scu_data[scu_y * h->i_width_in_scu + scu_x].i_slice_nr == h->scu_data[(scu_y - 1) * h->i_width_in_scu + scu_x].i_slice_nr;\r\n        }\r\n        if (p_avail->b_down) {\r\n            scu_y += 1 << (h->i_lcu_level - MIN_CU_SIZE_IN_BIT);\r\n            p_avail->b_down = h->scu_data[scu_y * h->i_width_in_scu + scu_x].i_slice_nr == h->scu_data[(scu_y - 1) * h->i_width_in_scu + scu_x].i_slice_nr;\r\n        }\r\n    }\r\n\r\n    p_avail->b_top_left = p_avail->b_top && p_avail->b_left;\r\n    p_avail->b_top_right = p_avail->b_top && p_avail->b_right;\r\n    p_avail->b_down_left = p_avail->b_down && p_avail->b_left;\r\n    p_avail->b_right_down = p_avail->b_down && p_avail->b_right;\r\n\r\n    /* ˲ĵ */\r\n    if (!p_avail->b_right) {\r\n        width += SAO_SHIFT_PIX_NUM;\r\n        width_c += SAO_SHIFT_PIX_NUM;\r\n    }\r\n\r\n    if (!p_avail->b_down) {\r\n        height += SAO_SHIFT_PIX_NUM;\r\n        height_c += SAO_SHIFT_PIX_NUM;\r\n    }\r\n\r\n    if (p_avail->b_left) {\r\n        pix_x -= SAO_SHIFT_PIX_NUM;\r\n        pix_x_c -= SAO_SHIFT_PIX_NUM;\r\n    }\r\n    else {\r\n        width -= SAO_SHIFT_PIX_NUM;\r\n        width_c -= SAO_SHIFT_PIX_NUM;\r\n    }\r\n\r\n    if (p_avail->b_top) {\r\n        pix_y -= SAO_SHIFT_PIX_NUM;\r\n        pix_y_c -= SAO_SHIFT_PIX_NUM;\r\n    }\r\n    else {\r\n        height -= SAO_SHIFT_PIX_NUM;\r\n        height_c -= SAO_SHIFT_PIX_NUM;\r\n    }\r\n\r\n    /* make sure the width and height is not outside a picture */\r\n    width = DAVS2_MIN(width, h->i_width - pix_x);\r\n    width_c = DAVS2_MIN(width_c, (h->i_width >> 1) - pix_x_c);\r\n    height = DAVS2_MIN(height, h->i_height - pix_y);\r\n    height_c = DAVS2_MIN(height_c, (h->i_height >> 1) - pix_y_c);\r\n\r\n    /* luma component */\r\n    p_avail->pix_x[0] = pix_x;\r\n    p_avail->pix_y[0] = pix_y;\r\n    p_avail->width[0] = width;\r\n    p_avail->height[0] = height;\r\n\r\n    /* chroma components */\r\n    p_avail->pix_x[1] = p_avail->pix_x[2] = pix_x_c;\r\n    p_avail->pix_y[1] = p_avail->pix_y[2] = pix_y_c;\r\n    p_avail->width[1] = p_avail->width[2] = width_c;\r\n    p_avail->height[1] = p_avail->height[2] = height_c;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid sao_lcu(davs2_t *h, davs2_frame_t *p_tmp_frm, davs2_frame_t *p_dec_frm, int i_lcu_x, int i_lcu_y)\r\n{\r\n    const int width_in_lcu = h->i_width_in_lcu;\r\n    sao_t *lcu_param = &h->lcu_infos[i_lcu_y * width_in_lcu + i_lcu_x].sao_param;\r\n\r\n    /* copy one decoded LCU */\r\n    davs2_frame_copy_lcu(h, p_tmp_frm, p_dec_frm, i_lcu_x, i_lcu_y, 0, 0);\r\n\r\n    /* SAO one LCU */\r\n    sao_region_t region;\r\n    int comp_idx;\r\n    sao_get_neighbor_avail(h, &region, i_lcu_x, i_lcu_y);\r\n    for (comp_idx = 0; comp_idx < IMG_COMPONENTS; comp_idx++) {\r\n        if (h->slice_sao_on[comp_idx] == 0 || lcu_param->planes[comp_idx].modeIdc == SAO_MODE_OFF) {\r\n            continue;\r\n        }\r\n\r\n        int filter_type = lcu_param->planes[comp_idx].typeIdc;\r\n        assert(filter_type >= SAO_TYPE_EO_0 && filter_type <= SAO_TYPE_BO);\r\n\r\n        int pix_y = region.pix_y[comp_idx];\r\n        int pix_x = region.pix_x[comp_idx];\r\n        const int bit_depth = h->sample_bit_depth;\r\n        int blkoffset = pix_y * p_dec_frm->i_stride[comp_idx] + pix_x;\r\n        pel_t *dst = p_dec_frm->planes[comp_idx] + blkoffset;\r\n        pel_t *src = p_tmp_frm->planes[comp_idx] + blkoffset;\r\n        if (filter_type == SAO_TYPE_BO) {\r\n            sao_block_bo_c(dst, p_dec_frm->i_stride[comp_idx], src, p_dec_frm->i_stride[comp_idx],\r\n                           region.width[comp_idx], region.height[comp_idx], bit_depth, &lcu_param->planes[comp_idx]);\r\n        } else {\r\n            int avail[8];\r\n            avail[0] = region.b_top;\r\n            avail[1] = region.b_down;\r\n            avail[2] = region.b_left;\r\n            avail[3] = region.b_right;\r\n            avail[4] = region.b_top_left;\r\n            avail[5] = region.b_top_right;\r\n            avail[6] = region.b_down_left;\r\n            avail[7] = region.b_right_down;\r\n            gf_davs2.sao_filter_eo[filter_type](dst, p_dec_frm->i_stride[comp_idx], src, p_dec_frm->i_stride[comp_idx],\r\n                                                region.width[comp_idx], region.height[comp_idx],\r\n                                                bit_depth, avail, lcu_param->planes[comp_idx].offset);\r\n        }\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid sao_lcurow(davs2_t *h, davs2_frame_t *p_tmp_frm, davs2_frame_t *p_dec_frm, int i_lcu_y)\r\n{\r\n    const int width_in_lcu = h->i_width_in_lcu;\r\n    int lcu_xy             = i_lcu_y * width_in_lcu;\r\n    int lcu_x;\r\n\r\n    /* copy one decoded LCU-row */\r\n    davs2_frame_copy_lcurow(h, p_tmp_frm, p_dec_frm, i_lcu_y, -4, 0);\r\n\r\n    /* SAO one LCU-row */\r\n    for (lcu_x = 0; lcu_x < h->i_width_in_lcu; lcu_x++) {\r\n        sao_region_t region;\r\n        sao_t *lcu_param = &h->lcu_infos[lcu_xy++].sao_param;\r\n        int comp_idx;\r\n        sao_get_neighbor_avail(h, &region, lcu_x, i_lcu_y);\r\n        for (comp_idx = 0; comp_idx < IMG_COMPONENTS; comp_idx++) {\r\n            if (h->slice_sao_on[comp_idx] == 0 || lcu_param->planes[comp_idx].modeIdc == SAO_MODE_OFF){\r\n                continue;\r\n            }\r\n            int filter_type = lcu_param->planes[comp_idx].typeIdc;\r\n            assert(filter_type >= SAO_TYPE_EO_0 && filter_type <= SAO_TYPE_BO);\r\n\r\n            int pix_y = region.pix_y[comp_idx];\r\n            int pix_x = region.pix_x[comp_idx];\r\n            const int bit_depth = h->sample_bit_depth;\r\n            int blkoffset = pix_y * p_dec_frm->i_stride[comp_idx] + pix_x;\r\n            pel_t *dst = p_dec_frm->planes[comp_idx] + blkoffset;\r\n            pel_t *src = p_tmp_frm->planes[comp_idx] + blkoffset;\r\n            if (filter_type == SAO_TYPE_BO) {\r\n                gf_davs2.sao_block_bo(dst, p_dec_frm->i_stride[comp_idx], src, p_dec_frm->i_stride[comp_idx],\r\n                                      region.width[comp_idx], region.height[comp_idx], bit_depth, &lcu_param->planes[comp_idx]);\r\n            } else {\r\n                int avail[8];\r\n                avail[0] = region.b_top;\r\n                avail[1] = region.b_down;\r\n                avail[2] = region.b_left;\r\n                avail[3] = region.b_right;\r\n                avail[4] = region.b_top_left;\r\n                avail[5] = region.b_top_right;\r\n                avail[6] = region.b_down_left;\r\n                avail[7] = region.b_right_down;\r\n                gf_davs2.sao_filter_eo[filter_type](dst, p_dec_frm->i_stride[comp_idx], src, p_dec_frm->i_stride[comp_idx],\r\n                                                    region.width[comp_idx], region.height[comp_idx],\r\n                                                    bit_depth, avail, lcu_param->planes[comp_idx].offset);\r\n            }\r\n        }\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid davs2_sao_init(uint32_t cpuid, ao_funcs_t *fh)\r\n{\r\n    /* init c function handles */\r\n    fh->sao_block_bo                   = sao_block_bo_c;\r\n    fh->sao_filter_eo[SAO_TYPE_EO_0]   = sao_block_eo_0_c;\r\n    fh->sao_filter_eo[SAO_TYPE_EO_45]  = sao_block_eo_45_c;\r\n    fh->sao_filter_eo[SAO_TYPE_EO_90]  = sao_block_eo_90_c;\r\n    fh->sao_filter_eo[SAO_TYPE_EO_135] = sao_block_eo_135_c;\r\n\r\n    /* init asm function handles */\r\n#if HAVE_MMX\r\n    if (cpuid & DAVS2_CPU_SSE4) {\r\n        fh->sao_block_bo                   = SAO_on_block_bo_sse128;\r\n        fh->sao_filter_eo[SAO_TYPE_EO_0]   = SAO_on_block_eo_0_sse128;\r\n        fh->sao_filter_eo[SAO_TYPE_EO_45]  = SAO_on_block_eo_45_sse128;\r\n        fh->sao_filter_eo[SAO_TYPE_EO_90]  = SAO_on_block_eo_90_sse128;\r\n        fh->sao_filter_eo[SAO_TYPE_EO_135] = SAO_on_block_eo_135_sse128;\r\n    }\r\n    if (cpuid & DAVS2_CPU_AVX2) {\r\n        fh->sao_block_bo                   = SAO_on_block_bo_avx2;\r\n        fh->sao_filter_eo[SAO_TYPE_EO_0]   = SAO_on_block_eo_0_avx2;\r\n        fh->sao_filter_eo[SAO_TYPE_EO_45]  = SAO_on_block_eo_45_avx2;\r\n        fh->sao_filter_eo[SAO_TYPE_EO_90]  = SAO_on_block_eo_90_avx2;\r\n        fh->sao_filter_eo[SAO_TYPE_EO_135] = SAO_on_block_eo_135_avx2;\r\n    }\r\n#endif\r\n}\r\n\r\n"
  },
  {
    "path": "source/common/sao.h",
    "content": "/*\r\n * sao.h\r\n *\r\n * Description of this file:\r\n *    SAO functions definition of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video encoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#ifndef DAVS2_SAO_H\r\n#define DAVS2_SAO_H\r\n#ifdef __cplusplus\r\nextern \"C\" {\r\n#endif\r\n\r\n#define sao_read_lcu_param FPFX(sao_read_lcu_param)\r\nvoid sao_read_lcu_param(davs2_t *h, int lcu_xy, bool_t *slice_sao_on, sao_t *sao_param);\r\n\r\n#define sao_lcu FPFX(sao_lcu)\r\nvoid sao_lcu(davs2_t *h, davs2_frame_t *p_tmp_frm, davs2_frame_t *p_dec_frm, int i_lcu_x, int i_lcu_y);\r\n#define sao_lcurow FPFX(sao_lcurow)\r\nvoid sao_lcurow(davs2_t *h, davs2_frame_t *p_tmp_frm, davs2_frame_t *p_dec_frm, int i_lcu_y);\r\n\r\n#define davs2_sao_init FPFX(sao_init)\r\nvoid davs2_sao_init(uint32_t cpuid, ao_funcs_t *fh);\r\n\r\n#ifdef __cplusplus\r\n}\r\n#endif\r\n#endif  // DAVS2_SAO_H\r\n"
  },
  {
    "path": "source/common/scantab.h",
    "content": "/*\r\n * scantab.h\r\n *\r\n * Description of this file:\r\n *    tAVS2 scan tables of the davs2 library (this file is ONLY included by aec.c)\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video encoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n\r\n#ifndef DAVS2_SCAN_TAB_H\r\n#define DAVS2_SCAN_TAB_H\r\n#ifdef __cplusplus\r\nextern \"C\" {\r\n#endif\r\n\r\n/**\r\n * ===========================================================================\r\n * global variables (const tables)\r\n * ===========================================================================\r\n */\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic const int16_t tab_scan_2x2[4][2] = {\r\n    { 0, 0 }, { 1, 0 }, { 0, 1 }, { 1, 1 }\r\n};\r\n\r\nstatic const int16_t tab_scan_4x4[16][2] = {\r\n    { 0, 0 }, { 1, 0 }, { 0, 1 }, { 0, 2 },\r\n    { 1, 1 }, { 2, 0 }, { 3, 0 }, { 2, 1 },\r\n    { 1, 2 }, { 0, 3 }, { 1, 3 }, { 2, 2 },\r\n    { 3, 1 }, { 3, 2 }, { 2, 3 }, { 3, 3 }\r\n};\r\n\r\nstatic const int16_t tab_scan_8x8[64][2] = {\r\n    { 0, 0 }, { 1, 0 }, { 0, 1 }, { 0, 2 }, { 1, 1 }, { 2, 0 }, { 3, 0 }, { 2, 1 },\r\n    { 1, 2 }, { 0, 3 }, { 0, 4 }, { 1, 3 }, { 2, 2 }, { 3, 1 }, { 4, 0 }, { 5, 0 },\r\n    { 4, 1 }, { 3, 2 }, { 2, 3 }, { 1, 4 }, { 0, 5 }, { 0, 6 }, { 1, 5 }, { 2, 4 },\r\n    { 3, 3 }, { 4, 2 }, { 5, 1 }, { 6, 0 }, { 7, 0 }, { 6, 1 }, { 5, 2 }, { 4, 3 },\r\n    { 3, 4 }, { 2, 5 }, { 1, 6 }, { 0, 7 }, { 1, 7 }, { 2, 6 }, { 3, 5 }, { 4, 4 },\r\n    { 5, 3 }, { 6, 2 }, { 7, 1 }, { 7, 2 }, { 6, 3 }, { 5, 4 }, { 4, 5 }, { 3, 6 },\r\n    { 2, 7 }, { 3, 7 }, { 4, 6 }, { 5, 5 }, { 6, 4 }, { 7, 3 }, { 7, 4 }, { 6, 5 },\r\n    { 5, 6 }, { 4, 7 }, { 5, 7 }, { 6, 6 }, { 7, 5 }, { 7, 6 }, { 6, 7 }, { 7, 7 }\r\n};\r\n\r\nstatic const int16_t tab_scan_16x16[256][2] = {\r\n    {  0,  0}, {  1,  0}, {  0,  1}, {  0,  2}, {  1,  1}, {  2,  0}, {  3,  0}, {  2,  1},\r\n    {  1,  2}, {  0,  3}, {  0,  4}, {  1,  3}, {  2,  2}, {  3,  1}, {  4,  0}, {  5,  0},\r\n    {  4,  1}, {  3,  2}, {  2,  3}, {  1,  4}, {  0,  5}, {  0,  6}, {  1,  5}, {  2,  4},\r\n    {  3,  3}, {  4,  2}, {  5,  1}, {  6,  0}, {  7,  0}, {  6,  1}, {  5,  2}, {  4,  3},\r\n    {  3,  4}, {  2,  5}, {  1,  6}, {  0,  7}, {  0,  8}, {  1,  7}, {  2,  6}, {  3,  5},\r\n    {  4,  4}, {  5,  3}, {  6,  2}, {  7,  1}, {  8,  0}, {  9,  0}, {  8,  1}, {  7,  2},\r\n    {  6,  3}, {  5,  4}, {  4,  5}, {  3,  6}, {  2,  7}, {  1,  8}, {  0,  9}, {  0, 10},\r\n    {  1,  9}, {  2,  8}, {  3,  7}, {  4,  6}, {  5,  5}, {  6,  4}, {  7,  3}, {  8,  2},\r\n    {  9,  1}, { 10,  0}, { 11,  0}, { 10,  1}, {  9,  2}, {  8,  3}, {  7,  4}, {  6,  5},\r\n    {  5,  6}, {  4,  7}, {  3,  8}, {  2,  9}, {  1, 10}, {  0, 11}, {  0, 12}, {  1, 11},\r\n    {  2, 10}, {  3,  9}, {  4,  8}, {  5,  7}, {  6,  6}, {  7,  5}, {  8,  4}, {  9,  3},\r\n    { 10,  2}, { 11,  1}, { 12,  0}, { 13,  0}, { 12,  1}, { 11,  2}, { 10,  3}, {  9,  4},\r\n    {  8,  5}, {  7,  6}, {  6,  7}, {  5,  8}, {  4,  9}, {  3, 10}, {  2, 11}, {  1, 12},\r\n    {  0, 13}, {  0, 14}, {  1, 13}, {  2, 12}, {  3, 11}, {  4, 10}, {  5,  9}, {  6,  8},\r\n    {  7,  7}, {  8,  6}, {  9,  5}, { 10,  4}, { 11,  3}, { 12,  2}, { 13,  1}, { 14,  0},\r\n    { 15,  0}, { 14,  1}, { 13,  2}, { 12,  3}, { 11,  4}, { 10,  5}, {  9,  6}, {  8,  7},\r\n    {  7,  8}, {  6,  9}, {  5, 10}, {  4, 11}, {  3, 12}, {  2, 13}, {  1, 14}, {  0, 15},\r\n    {  1, 15}, {  2, 14}, {  3, 13}, {  4, 12}, {  5, 11}, {  6, 10}, {  7,  9}, {  8,  8},\r\n    {  9,  7}, { 10,  6}, { 11,  5}, { 12,  4}, { 13,  3}, { 14,  2}, { 15,  1}, { 15,  2},\r\n    { 14,  3}, { 13,  4}, { 12,  5}, { 11,  6}, { 10,  7}, {  9,  8}, {  8,  9}, {  7, 10},\r\n    {  6, 11}, {  5, 12}, {  4, 13}, {  3, 14}, {  2, 15}, {  3, 15}, {  4, 14}, {  5, 13},\r\n    {  6, 12}, {  7, 11}, {  8, 10}, {  9,  9}, { 10,  8}, { 11,  7}, { 12,  6}, { 13,  5},\r\n    { 14,  4}, { 15,  3}, { 15,  4}, { 14,  5}, { 13,  6}, { 12,  7}, { 11,  8}, { 10,  9},\r\n    {  9, 10}, {  8, 11}, {  7, 12}, {  6, 13}, {  5, 14}, {  4, 15}, {  5, 15}, {  6, 14},\r\n    {  7, 13}, {  8, 12}, {  9, 11}, { 10, 10}, { 11,  9}, { 12,  8}, { 13,  7}, { 14,  6},\r\n    { 15,  5}, { 15,  6}, { 14,  7}, { 13,  8}, { 12,  9}, { 11, 10}, { 10, 11}, {  9, 12},\r\n    {  8, 13}, {  7, 14}, {  6, 15}, {  7, 15}, {  8, 14}, {  9, 13}, { 10, 12}, { 11, 11},\r\n    { 12, 10}, { 13,  9}, { 14,  8}, { 15,  7}, { 15,  8}, { 14,  9}, { 13, 10}, { 12, 11},\r\n    { 11, 12}, { 10, 13}, {  9, 14}, {  8, 15}, {  9, 15}, { 10, 14}, { 11, 13}, { 12, 12},\r\n    { 13, 11}, { 14, 10}, { 15,  9}, { 15, 10}, { 14, 11}, { 13, 12}, { 12, 13}, { 11, 14},\r\n    { 10, 15}, { 11, 15}, { 12, 14}, { 13, 13}, { 14, 12}, { 15, 11}, { 15, 12}, { 14, 13},\r\n    { 13, 14}, { 12, 15}, { 13, 15}, { 14, 14}, { 15, 13}, { 15, 14}, { 14, 15}, { 15, 15}\r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic const int16_t tab_scan_1x4[4][2] = {\r\n    { 0, 0 }, { 1, 0 }, { 2, 0 }, { 3, 0 },\r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic const int16_t tab_scan_4x1[4][2] = {\r\n    { 0, 0 }, { 0, 1 }, { 0, 2 }, { 0, 3 },\r\n};\r\n\r\nstatic const int16_t tab_scan_2x8[16][2] = {\r\n    { 0, 0 }, { 1, 0 }, { 0, 1 }, { 1, 1 }, { 2, 0 }, { 3, 0 }, { 2, 1 }, { 3, 1 },\r\n    { 4, 0 }, { 5, 0 }, { 4, 1 }, { 5, 1 }, { 6, 0 }, { 7, 0 }, { 6, 1 }, { 7, 1 }\r\n};\r\n\r\nstatic const int16_t tab_scan_8x2[16][2] = {\r\n    { 0, 0 }, { 1, 0 },\r\n    { 0, 1 }, { 0, 2 },\r\n    { 1, 1 }, { 1, 2 },\r\n    { 0, 3 }, { 0, 4 },\r\n    { 1, 3 }, { 1, 4 },\r\n    { 0, 5 }, { 0, 6 },\r\n    { 1, 5 }, { 1, 6 },\r\n    { 0, 7 }, { 1, 7 }\r\n};\r\n\r\nstatic const int16_t tab_scan_4x16[64][2] = {\r\n    {  0,  0}, {  1,  0}, {  0,  1}, {  0,  2},\r\n    {  1,  1}, {  2,  0}, {  3,  0}, {  2,  1},\r\n    {  1,  2}, {  0,  3}, {  1,  3}, {  2,  2},\r\n    {  3,  1}, {  3,  2}, {  2,  3}, {  3,  3},\r\n    {  4,  0}, {  5,  0}, {  4,  1}, {  4,  2},\r\n    {  5,  1}, {  6,  0}, {  7,  0}, {  6,  1},\r\n    {  5,  2}, {  4,  3}, {  5,  3}, {  6,  2},\r\n    {  7,  1}, {  7,  2}, {  6,  3}, {  7,  3},\r\n    {  8,  0}, {  9,  0}, {  8,  1}, {  8,  2},\r\n    {  9,  1}, { 10,  0}, { 11,  0}, { 10,  1},\r\n    {  9,  2}, {  8,  3}, {  9,  3}, { 10,  2},\r\n    { 11,  1}, { 11,  2}, { 10,  3}, { 11,  3},\r\n    { 12,  0}, { 13,  0}, { 12,  1}, { 12,  2},\r\n    { 13,  1}, { 14,  0}, { 15,  0}, { 14,  1},\r\n    { 13,  2}, { 12,  3}, { 13,  3}, { 14,  2},\r\n    { 15,  1}, { 15,  2}, { 14,  3}, { 15,  3}\r\n};\r\n\r\nstatic const int16_t tab_scan_16x4[64][2] = {\r\n    {  0,  0}, {  1,  0}, {  0,  1}, {  0,  2},\r\n    {  1,  1}, {  2,  0}, {  3,  0}, {  2,  1},\r\n    {  1,  2}, {  0,  3}, {  1,  3}, {  2,  2},\r\n    {  3,  1}, {  3,  2}, {  2,  3}, {  3,  3},\r\n    {  0,  4}, {  1,  4}, {  0,  5}, {  0,  6},\r\n    {  1,  5}, {  2,  4}, {  3,  4}, {  2,  5},\r\n    {  1,  6}, {  0,  7}, {  1,  7}, {  2,  6},\r\n    {  3,  5}, {  3,  6}, {  2,  7}, {  3,  7},\r\n    {  0,  8}, {  1,  8}, {  0,  9}, {  0, 10},\r\n    {  1,  9}, {  2,  8}, {  3,  8}, {  2,  9},\r\n    {  1, 10}, {  0, 11}, {  1, 11}, {  2, 10},\r\n    {  3,  9}, {  3, 10}, {  2, 11}, {  3, 11},\r\n    {  0, 12}, {  1, 12}, {  0, 13}, {  0, 14},\r\n    {  1, 13}, {  2, 12}, {  3, 12}, {  2, 13},\r\n    {  1, 14}, {  0, 15}, {  1, 15}, {  2, 14},\r\n    {  3, 13}, {  3, 14}, {  2, 15}, {  3, 15}\r\n};\r\n\r\nstatic const int16_t tab_scan_8x32[256][2] = {\r\n    {  0,  0}, {  1,  0}, {  0,  1}, {  0,  2}, {  1,  1}, {  2,  0}, {  3,  0}, {  2,  1},\r\n    {  1,  2}, {  0,  3}, {  1,  3}, {  2,  2}, {  3,  1}, {  3,  2}, {  2,  3}, {  3,  3},\r\n    {  4,  0}, {  5,  0}, {  4,  1}, {  4,  2}, {  5,  1}, {  6,  0}, {  7,  0}, {  6,  1},\r\n    {  5,  2}, {  4,  3}, {  5,  3}, {  6,  2}, {  7,  1}, {  7,  2}, {  6,  3}, {  7,  3},\r\n    {  0,  4}, {  1,  4}, {  0,  5}, {  0,  6}, {  1,  5}, {  2,  4}, {  3,  4}, {  2,  5},\r\n    {  1,  6}, {  0,  7}, {  1,  7}, {  2,  6}, {  3,  5}, {  3,  6}, {  2,  7}, {  3,  7},\r\n    {  4,  4}, {  5,  4}, {  4,  5}, {  4,  6}, {  5,  5}, {  6,  4}, {  7,  4}, {  6,  5},\r\n    {  5,  6}, {  4,  7}, {  5,  7}, {  6,  6}, {  7,  5}, {  7,  6}, {  6,  7}, {  7,  7},\r\n    {  8,  0}, {  9,  0}, {  8,  1}, {  8,  2}, {  9,  1}, { 10,  0}, { 11,  0}, { 10,  1},\r\n    {  9,  2}, {  8,  3}, {  9,  3}, { 10,  2}, { 11,  1}, { 11,  2}, { 10,  3}, { 11,  3},\r\n    { 12,  0}, { 13,  0}, { 12,  1}, { 12,  2}, { 13,  1}, { 14,  0}, { 15,  0}, { 14,  1},\r\n    { 13,  2}, { 12,  3}, { 13,  3}, { 14,  2}, { 15,  1}, { 15,  2}, { 14,  3}, { 15,  3},\r\n    {  8,  4}, {  9,  4}, {  8,  5}, {  8,  6}, {  9,  5}, { 10,  4}, { 11,  4}, { 10,  5},\r\n    {  9,  6}, {  8,  7}, {  9,  7}, { 10,  6}, { 11,  5}, { 11,  6}, { 10,  7}, { 11,  7},\r\n    { 12,  4}, { 13,  4}, { 12,  5}, { 12,  6}, { 13,  5}, { 14,  4}, { 15,  4}, { 14,  5},\r\n    { 13,  6}, { 12,  7}, { 13,  7}, { 14,  6}, { 15,  5}, { 15,  6}, { 14,  7}, { 15,  7},\r\n    { 16,  0}, { 17,  0}, { 16,  1}, { 16,  2}, { 17,  1}, { 18,  0}, { 19,  0}, { 18,  1},\r\n    { 17,  2}, { 16,  3}, { 17,  3}, { 18,  2}, { 19,  1}, { 19,  2}, { 18,  3}, { 19,  3},\r\n    { 20,  0}, { 21,  0}, { 20,  1}, { 20,  2}, { 21,  1}, { 22,  0}, { 23,  0}, { 22,  1},\r\n    { 21,  2}, { 20,  3}, { 21,  3}, { 22,  2}, { 23,  1}, { 23,  2}, { 22,  3}, { 23,  3},\r\n    { 16,  4}, { 17,  4}, { 16,  5}, { 16,  6}, { 17,  5}, { 18,  4}, { 19,  4}, { 18,  5},\r\n    { 17,  6}, { 16,  7}, { 17,  7}, { 18,  6}, { 19,  5}, { 19,  6}, { 18,  7}, { 19,  7},\r\n    { 20,  4}, { 21,  4}, { 20,  5}, { 20,  6}, { 21,  5}, { 22,  4}, { 23,  4}, { 22,  5},\r\n    { 21,  6}, { 20,  7}, { 21,  7}, { 22,  6}, { 23,  5}, { 23,  6}, { 22,  7}, { 23,  7},\r\n    { 24,  0}, { 25,  0}, { 24,  1}, { 24,  2}, { 25,  1}, { 26,  0}, { 27,  0}, { 26,  1},\r\n    { 25,  2}, { 24,  3}, { 25,  3}, { 26,  2}, { 27,  1}, { 27,  2}, { 26,  3}, { 27,  3},\r\n    { 28,  0}, { 29,  0}, { 28,  1}, { 28,  2}, { 29,  1}, { 30,  0}, { 31,  0}, { 30,  1},\r\n    { 29,  2}, { 28,  3}, { 29,  3}, { 30,  2}, { 31,  1}, { 31,  2}, { 30,  3}, { 31,  3},\r\n    { 24,  4}, { 25,  4}, { 24,  5}, { 24,  6}, { 25,  5}, { 26,  4}, { 27,  4}, { 26,  5},\r\n    { 25,  6}, { 24,  7}, { 25,  7}, { 26,  6}, { 27,  5}, { 27,  6}, { 26,  7}, { 27,  7},\r\n    { 28,  4}, { 29,  4}, { 28,  5}, { 28,  6}, { 29,  5}, { 30,  4}, { 31,  4}, { 30,  5},\r\n    { 29,  6}, { 28,  7}, { 29,  7}, { 30,  6}, { 31,  5}, { 31,  6}, { 30,  7}, { 31,  7}\r\n};\r\n\r\nstatic const int16_t tab_scan_32x8[256][2] = {\r\n    {  0,  0}, {  1,  0}, {  0,  1}, {  0,  2}, {  1,  1}, {  2,  0}, {  3,  0}, {  2,  1},\r\n    {  1,  2}, {  0,  3}, {  1,  3}, {  2,  2}, {  3,  1}, {  3,  2}, {  2,  3}, {  3,  3},\r\n    {  4,  0}, {  5,  0}, {  4,  1}, {  4,  2}, {  5,  1}, {  6,  0}, {  7,  0}, {  6,  1},\r\n    {  5,  2}, {  4,  3}, {  5,  3}, {  6,  2}, {  7,  1}, {  7,  2}, {  6,  3}, {  7,  3},\r\n    {  0,  4}, {  1,  4}, {  0,  5}, {  0,  6}, {  1,  5}, {  2,  4}, {  3,  4}, {  2,  5},\r\n    {  1,  6}, {  0,  7}, {  1,  7}, {  2,  6}, {  3,  5}, {  3,  6}, {  2,  7}, {  3,  7},\r\n    {  0,  8}, {  1,  8}, {  0,  9}, {  0, 10}, {  1,  9}, {  2,  8}, {  3,  8}, {  2,  9},\r\n    {  1, 10}, {  0, 11}, {  1, 11}, {  2, 10}, {  3,  9}, {  3, 10}, {  2, 11}, {  3, 11},\r\n    {  4,  4}, {  5,  4}, {  4,  5}, {  4,  6}, {  5,  5}, {  6,  4}, {  7,  4}, {  6,  5},\r\n    {  5,  6}, {  4,  7}, {  5,  7}, {  6,  6}, {  7,  5}, {  7,  6}, {  6,  7}, {  7,  7},\r\n    {  4,  8}, {  5,  8}, {  4,  9}, {  4, 10}, {  5,  9}, {  6,  8}, {  7,  8}, {  6,  9},\r\n    {  5, 10}, {  4, 11}, {  5, 11}, {  6, 10}, {  7,  9}, {  7, 10}, {  6, 11}, {  7, 11},\r\n    {  0, 12}, {  1, 12}, {  0, 13}, {  0, 14}, {  1, 13}, {  2, 12}, {  3, 12}, {  2, 13},\r\n    {  1, 14}, {  0, 15}, {  1, 15}, {  2, 14}, {  3, 13}, {  3, 14}, {  2, 15}, {  3, 15},\r\n    {  0, 16}, {  1, 16}, {  0, 17}, {  0, 18}, {  1, 17}, {  2, 16}, {  3, 16}, {  2, 17},\r\n    {  1, 18}, {  0, 19}, {  1, 19}, {  2, 18}, {  3, 17}, {  3, 18}, {  2, 19}, {  3, 19},\r\n    {  4, 12}, {  5, 12}, {  4, 13}, {  4, 14}, {  5, 13}, {  6, 12}, {  7, 12}, {  6, 13},\r\n    {  5, 14}, {  4, 15}, {  5, 15}, {  6, 14}, {  7, 13}, {  7, 14}, {  6, 15}, {  7, 15},\r\n    {  4, 16}, {  5, 16}, {  4, 17}, {  4, 18}, {  5, 17}, {  6, 16}, {  7, 16}, {  6, 17},\r\n    {  5, 18}, {  4, 19}, {  5, 19}, {  6, 18}, {  7, 17}, {  7, 18}, {  6, 19}, {  7, 19},\r\n    {  0, 20}, {  1, 20}, {  0, 21}, {  0, 22}, {  1, 21}, {  2, 20}, {  3, 20}, {  2, 21},\r\n    {  1, 22}, {  0, 23}, {  1, 23}, {  2, 22}, {  3, 21}, {  3, 22}, {  2, 23}, {  3, 23},\r\n    {  0, 24}, {  1, 24}, {  0, 25}, {  0, 26}, {  1, 25}, {  2, 24}, {  3, 24}, {  2, 25},\r\n    {  1, 26}, {  0, 27}, {  1, 27}, {  2, 26}, {  3, 25}, {  3, 26}, {  2, 27}, {  3, 27},\r\n    {  4, 20}, {  5, 20}, {  4, 21}, {  4, 22}, {  5, 21}, {  6, 20}, {  7, 20}, {  6, 21},\r\n    {  5, 22}, {  4, 23}, {  5, 23}, {  6, 22}, {  7, 21}, {  7, 22}, {  6, 23}, {  7, 23},\r\n    {  4, 24}, {  5, 24}, {  4, 25}, {  4, 26}, {  5, 25}, {  6, 24}, {  7, 24}, {  6, 25},\r\n    {  5, 26}, {  4, 27}, {  5, 27}, {  6, 26}, {  7, 25}, {  7, 26}, {  6, 27}, {  7, 27},\r\n    {  0, 28}, {  1, 28}, {  0, 29}, {  0, 30}, {  1, 29}, {  2, 28}, {  3, 28}, {  2, 29},\r\n    {  1, 30}, {  0, 31}, {  1, 31}, {  2, 30}, {  3, 29}, {  3, 30}, {  2, 31}, {  3, 31},\r\n    {  4, 28}, {  5, 28}, {  4, 29}, {  4, 30}, {  5, 29}, {  6, 28}, {  7, 28}, {  6, 29},\r\n    {  5, 30}, {  4, 31}, {  5, 31}, {  6, 30}, {  7, 29}, {  7, 30}, {  6, 31}, {  7, 31}\r\n};\r\n\r\nstatic const int16_t tab_scan_cg_8x8[64][2] = {\r\n    { 0, 0 }, { 1, 0 }, { 0, 1 }, { 0, 2 }, { 1, 1 }, { 2, 0 }, { 3, 0 }, { 2, 1 },\r\n    { 1, 2 }, { 0, 3 }, { 1, 3 }, { 2, 2 }, { 3, 1 }, { 3, 2 }, { 2, 3 }, { 3, 3 },\r\n    { 4, 0 }, { 5, 0 }, { 4, 1 }, { 4, 2 }, { 5, 1 }, { 6, 0 }, { 7, 0 }, { 6, 1 },\r\n    { 5, 2 }, { 4, 3 }, { 5, 3 }, { 6, 2 }, { 7, 1 }, { 7, 2 }, { 6, 3 }, { 7, 3 },\r\n    { 0, 4 }, { 1, 4 }, { 0, 5 }, { 0, 6 }, { 1, 5 }, { 2, 4 }, { 3, 4 }, { 2, 5 },\r\n    { 1, 6 }, { 0, 7 }, { 1, 7 }, { 2, 6 }, { 3, 5 }, { 3, 6 }, { 2, 7 }, { 3, 7 },\r\n    { 4, 4 }, { 5, 4 }, { 4, 5 }, { 4, 6 }, { 5, 5 }, { 6, 4 }, { 7, 4 }, { 6, 5 },\r\n    { 5, 6 }, { 4, 7 }, { 5, 7 }, { 6, 6 }, { 7, 5 }, { 7, 6 }, { 6, 7 }, { 7, 7 }\r\n};\r\n\r\nstatic const int16_t tab_scan_cg_16x16[256][2] = {\r\n    {  0,  0}, {  1,  0}, {  0,  1}, {  0,  2}, {  1,  1}, {  2,  0}, {  3,  0}, {  2,  1},\r\n    {  1,  2}, {  0,  3}, {  1,  3}, {  2,  2}, {  3,  1}, {  3,  2}, {  2,  3}, {  3,  3},\r\n    {  4,  0}, {  5,  0}, {  4,  1}, {  4,  2}, {  5,  1}, {  6,  0}, {  7,  0}, {  6,  1},\r\n    {  5,  2}, {  4,  3}, {  5,  3}, {  6,  2}, {  7,  1}, {  7,  2}, {  6,  3}, {  7,  3},\r\n    {  0,  4}, {  1,  4}, {  0,  5}, {  0,  6}, {  1,  5}, {  2,  4}, {  3,  4}, {  2,  5},\r\n    {  1,  6}, {  0,  7}, {  1,  7}, {  2,  6}, {  3,  5}, {  3,  6}, {  2,  7}, {  3,  7},\r\n    {  0,  8}, {  1,  8}, {  0,  9}, {  0, 10}, {  1,  9}, {  2,  8}, {  3,  8}, {  2,  9},\r\n    {  1, 10}, {  0, 11}, {  1, 11}, {  2, 10}, {  3,  9}, {  3, 10}, {  2, 11}, {  3, 11},\r\n    {  4,  4}, {  5,  4}, {  4,  5}, {  4,  6}, {  5,  5}, {  6,  4}, {  7,  4}, {  6,  5},\r\n    {  5,  6}, {  4,  7}, {  5,  7}, {  6,  6}, {  7,  5}, {  7,  6}, {  6,  7}, {  7,  7},\r\n    {  8,  0}, {  9,  0}, {  8,  1}, {  8,  2}, {  9,  1}, { 10,  0}, { 11,  0}, { 10,  1},\r\n    {  9,  2}, {  8,  3}, {  9,  3}, { 10,  2}, { 11,  1}, { 11,  2}, { 10,  3}, { 11,  3},\r\n    { 12,  0}, { 13,  0}, { 12,  1}, { 12,  2}, { 13,  1}, { 14,  0}, { 15,  0}, { 14,  1},\r\n    { 13,  2}, { 12,  3}, { 13,  3}, { 14,  2}, { 15,  1}, { 15,  2}, { 14,  3}, { 15,  3},\r\n    {  8,  4}, {  9,  4}, {  8,  5}, {  8,  6}, {  9,  5}, { 10,  4}, { 11,  4}, { 10,  5},\r\n    {  9,  6}, {  8,  7}, {  9,  7}, { 10,  6}, { 11,  5}, { 11,  6}, { 10,  7}, { 11,  7},\r\n    {  4,  8}, {  5,  8}, {  4,  9}, {  4, 10}, {  5,  9}, {  6,  8}, {  7,  8}, {  6,  9},\r\n    {  5, 10}, {  4, 11}, {  5, 11}, {  6, 10}, {  7,  9}, {  7, 10}, {  6, 11}, {  7, 11},\r\n    {  0, 12}, {  1, 12}, {  0, 13}, {  0, 14}, {  1, 13}, {  2, 12}, {  3, 12}, {  2, 13},\r\n    {  1, 14}, {  0, 15}, {  1, 15}, {  2, 14}, {  3, 13}, {  3, 14}, {  2, 15}, {  3, 15},\r\n    {  4, 12}, {  5, 12}, {  4, 13}, {  4, 14}, {  5, 13}, {  6, 12}, {  7, 12}, {  6, 13},\r\n    {  5, 14}, {  4, 15}, {  5, 15}, {  6, 14}, {  7, 13}, {  7, 14}, {  6, 15}, {  7, 15},\r\n    {  8,  8}, {  9,  8}, {  8,  9}, {  8, 10}, {  9,  9}, { 10,  8}, { 11,  8}, { 10,  9},\r\n    {  9, 10}, {  8, 11}, {  9, 11}, { 10, 10}, { 11,  9}, { 11, 10}, { 10, 11}, { 11, 11},\r\n    { 12,  4}, { 13,  4}, { 12,  5}, { 12,  6}, { 13,  5}, { 14,  4}, { 15,  4}, { 14,  5},\r\n    { 13,  6}, { 12,  7}, { 13,  7}, { 14,  6}, { 15,  5}, { 15,  6}, { 14,  7}, { 15,  7},\r\n    { 12,  8}, { 13,  8}, { 12,  9}, { 12, 10}, { 13,  9}, { 14,  8}, { 15,  8}, { 14,  9},\r\n    { 13, 10}, { 12, 11}, { 13, 11}, { 14, 10}, { 15,  9}, { 15, 10}, { 14, 11}, { 15, 11},\r\n    {  8, 12}, {  9, 12}, {  8, 13}, {  8, 14}, {  9, 13}, { 10, 12}, { 11, 12}, { 10, 13},\r\n    {  9, 14}, {  8, 15}, {  9, 15}, { 10, 14}, { 11, 13}, { 11, 14}, { 10, 15}, { 11, 15},\r\n    { 12, 12}, { 13, 12}, { 12, 13}, { 12, 14}, { 13, 13}, { 14, 12}, { 15, 12}, { 14, 13},\r\n    { 13, 14}, { 12, 15}, { 13, 15}, { 14, 14}, { 15, 13}, { 15, 14}, { 14, 15}, { 15, 15}\r\n};\r\n\r\nstatic const int16_t tab_scan_cg_32x32[1024][2] = {\r\n    {  0,  0}, {  1,  0}, {  0,  1}, {  0,  2}, {  1,  1}, {  2,  0}, {  3,  0}, {  2,  1},\r\n    {  1,  2}, {  0,  3}, {  1,  3}, {  2,  2}, {  3,  1}, {  3,  2}, {  2,  3}, {  3,  3},\r\n    {  4,  0}, {  5,  0}, {  4,  1}, {  4,  2}, {  5,  1}, {  6,  0}, {  7,  0}, {  6,  1},\r\n    {  5,  2}, {  4,  3}, {  5,  3}, {  6,  2}, {  7,  1}, {  7,  2}, {  6,  3}, {  7,  3},\r\n    {  0,  4}, {  1,  4}, {  0,  5}, {  0,  6}, {  1,  5}, {  2,  4}, {  3,  4}, {  2,  5},\r\n    {  1,  6}, {  0,  7}, {  1,  7}, {  2,  6}, {  3,  5}, {  3,  6}, {  2,  7}, {  3,  7},\r\n    {  0,  8}, {  1,  8}, {  0,  9}, {  0, 10}, {  1,  9}, {  2,  8}, {  3,  8}, {  2,  9},\r\n    {  1, 10}, {  0, 11}, {  1, 11}, {  2, 10}, {  3,  9}, {  3, 10}, {  2, 11}, {  3, 11},\r\n    {  4,  4}, {  5,  4}, {  4,  5}, {  4,  6}, {  5,  5}, {  6,  4}, {  7,  4}, {  6,  5},\r\n    {  5,  6}, {  4,  7}, {  5,  7}, {  6,  6}, {  7,  5}, {  7,  6}, {  6,  7}, {  7,  7},\r\n    {  8,  0}, {  9,  0}, {  8,  1}, {  8,  2}, {  9,  1}, { 10,  0}, { 11,  0}, { 10,  1},\r\n    {  9,  2}, {  8,  3}, {  9,  3}, { 10,  2}, { 11,  1}, { 11,  2}, { 10,  3}, { 11,  3},\r\n    { 12,  0}, { 13,  0}, { 12,  1}, { 12,  2}, { 13,  1}, { 14,  0}, { 15,  0}, { 14,  1},\r\n    { 13,  2}, { 12,  3}, { 13,  3}, { 14,  2}, { 15,  1}, { 15,  2}, { 14,  3}, { 15,  3},\r\n    {  8,  4}, {  9,  4}, {  8,  5}, {  8,  6}, {  9,  5}, { 10,  4}, { 11,  4}, { 10,  5},\r\n    {  9,  6}, {  8,  7}, {  9,  7}, { 10,  6}, { 11,  5}, { 11,  6}, { 10,  7}, { 11,  7},\r\n    {  4,  8}, {  5,  8}, {  4,  9}, {  4, 10}, {  5,  9}, {  6,  8}, {  7,  8}, {  6,  9},\r\n    {  5, 10}, {  4, 11}, {  5, 11}, {  6, 10}, {  7,  9}, {  7, 10}, {  6, 11}, {  7, 11},\r\n    {  0, 12}, {  1, 12}, {  0, 13}, {  0, 14}, {  1, 13}, {  2, 12}, {  3, 12}, {  2, 13},\r\n    {  1, 14}, {  0, 15}, {  1, 15}, {  2, 14}, {  3, 13}, {  3, 14}, {  2, 15}, {  3, 15},\r\n    {  0, 16}, {  1, 16}, {  0, 17}, {  0, 18}, {  1, 17}, {  2, 16}, {  3, 16}, {  2, 17},\r\n    {  1, 18}, {  0, 19}, {  1, 19}, {  2, 18}, {  3, 17}, {  3, 18}, {  2, 19}, {  3, 19},\r\n    {  4, 12}, {  5, 12}, {  4, 13}, {  4, 14}, {  5, 13}, {  6, 12}, {  7, 12}, {  6, 13},\r\n    {  5, 14}, {  4, 15}, {  5, 15}, {  6, 14}, {  7, 13}, {  7, 14}, {  6, 15}, {  7, 15},\r\n    {  8,  8}, {  9,  8}, {  8,  9}, {  8, 10}, {  9,  9}, { 10,  8}, { 11,  8}, { 10,  9},\r\n    {  9, 10}, {  8, 11}, {  9, 11}, { 10, 10}, { 11,  9}, { 11, 10}, { 10, 11}, { 11, 11},\r\n    { 12,  4}, { 13,  4}, { 12,  5}, { 12,  6}, { 13,  5}, { 14,  4}, { 15,  4}, { 14,  5},\r\n    { 13,  6}, { 12,  7}, { 13,  7}, { 14,  6}, { 15,  5}, { 15,  6}, { 14,  7}, { 15,  7},\r\n    { 16,  0}, { 17,  0}, { 16,  1}, { 16,  2}, { 17,  1}, { 18,  0}, { 19,  0}, { 18,  1},\r\n    { 17,  2}, { 16,  3}, { 17,  3}, { 18,  2}, { 19,  1}, { 19,  2}, { 18,  3}, { 19,  3},\r\n    { 20,  0}, { 21,  0}, { 20,  1}, { 20,  2}, { 21,  1}, { 22,  0}, { 23,  0}, { 22,  1},\r\n    { 21,  2}, { 20,  3}, { 21,  3}, { 22,  2}, { 23,  1}, { 23,  2}, { 22,  3}, { 23,  3},\r\n    { 16,  4}, { 17,  4}, { 16,  5}, { 16,  6}, { 17,  5}, { 18,  4}, { 19,  4}, { 18,  5},\r\n    { 17,  6}, { 16,  7}, { 17,  7}, { 18,  6}, { 19,  5}, { 19,  6}, { 18,  7}, { 19,  7},\r\n    { 12,  8}, { 13,  8}, { 12,  9}, { 12, 10}, { 13,  9}, { 14,  8}, { 15,  8}, { 14,  9},\r\n    { 13, 10}, { 12, 11}, { 13, 11}, { 14, 10}, { 15,  9}, { 15, 10}, { 14, 11}, { 15, 11},\r\n    {  8, 12}, {  9, 12}, {  8, 13}, {  8, 14}, {  9, 13}, { 10, 12}, { 11, 12}, { 10, 13},\r\n    {  9, 14}, {  8, 15}, {  9, 15}, { 10, 14}, { 11, 13}, { 11, 14}, { 10, 15}, { 11, 15},\r\n    {  4, 16}, {  5, 16}, {  4, 17}, {  4, 18}, {  5, 17}, {  6, 16}, {  7, 16}, {  6, 17},\r\n    {  5, 18}, {  4, 19}, {  5, 19}, {  6, 18}, {  7, 17}, {  7, 18}, {  6, 19}, {  7, 19},\r\n    {  0, 20}, {  1, 20}, {  0, 21}, {  0, 22}, {  1, 21}, {  2, 20}, {  3, 20}, {  2, 21},\r\n    {  1, 22}, {  0, 23}, {  1, 23}, {  2, 22}, {  3, 21}, {  3, 22}, {  2, 23}, {  3, 23},\r\n    {  0, 24}, {  1, 24}, {  0, 25}, {  0, 26}, {  1, 25}, {  2, 24}, {  3, 24}, {  2, 25},\r\n    {  1, 26}, {  0, 27}, {  1, 27}, {  2, 26}, {  3, 25}, {  3, 26}, {  2, 27}, {  3, 27},\r\n    {  4, 20}, {  5, 20}, {  4, 21}, {  4, 22}, {  5, 21}, {  6, 20}, {  7, 20}, {  6, 21},\r\n    {  5, 22}, {  4, 23}, {  5, 23}, {  6, 22}, {  7, 21}, {  7, 22}, {  6, 23}, {  7, 23},\r\n    {  8, 16}, {  9, 16}, {  8, 17}, {  8, 18}, {  9, 17}, { 10, 16}, { 11, 16}, { 10, 17},\r\n    {  9, 18}, {  8, 19}, {  9, 19}, { 10, 18}, { 11, 17}, { 11, 18}, { 10, 19}, { 11, 19},\r\n    { 12, 12}, { 13, 12}, { 12, 13}, { 12, 14}, { 13, 13}, { 14, 12}, { 15, 12}, { 14, 13},\r\n    { 13, 14}, { 12, 15}, { 13, 15}, { 14, 14}, { 15, 13}, { 15, 14}, { 14, 15}, { 15, 15},\r\n    { 16,  8}, { 17,  8}, { 16,  9}, { 16, 10}, { 17,  9}, { 18,  8}, { 19,  8}, { 18,  9},\r\n    { 17, 10}, { 16, 11}, { 17, 11}, { 18, 10}, { 19,  9}, { 19, 10}, { 18, 11}, { 19, 11},\r\n    { 20,  4}, { 21,  4}, { 20,  5}, { 20,  6}, { 21,  5}, { 22,  4}, { 23,  4}, { 22,  5},\r\n    { 21,  6}, { 20,  7}, { 21,  7}, { 22,  6}, { 23,  5}, { 23,  6}, { 22,  7}, { 23,  7},\r\n    { 24,  0}, { 25,  0}, { 24,  1}, { 24,  2}, { 25,  1}, { 26,  0}, { 27,  0}, { 26,  1},\r\n    { 25,  2}, { 24,  3}, { 25,  3}, { 26,  2}, { 27,  1}, { 27,  2}, { 26,  3}, { 27,  3},\r\n    { 28,  0}, { 29,  0}, { 28,  1}, { 28,  2}, { 29,  1}, { 30,  0}, { 31,  0}, { 30,  1},\r\n    { 29,  2}, { 28,  3}, { 29,  3}, { 30,  2}, { 31,  1}, { 31,  2}, { 30,  3}, { 31,  3},\r\n    { 24,  4}, { 25,  4}, { 24,  5}, { 24,  6}, { 25,  5}, { 26,  4}, { 27,  4}, { 26,  5},\r\n    { 25,  6}, { 24,  7}, { 25,  7}, { 26,  6}, { 27,  5}, { 27,  6}, { 26,  7}, { 27,  7},\r\n    { 20,  8}, { 21,  8}, { 20,  9}, { 20, 10}, { 21,  9}, { 22,  8}, { 23,  8}, { 22,  9},\r\n    { 21, 10}, { 20, 11}, { 21, 11}, { 22, 10}, { 23,  9}, { 23, 10}, { 22, 11}, { 23, 11},\r\n    { 16, 12}, { 17, 12}, { 16, 13}, { 16, 14}, { 17, 13}, { 18, 12}, { 19, 12}, { 18, 13},\r\n    { 17, 14}, { 16, 15}, { 17, 15}, { 18, 14}, { 19, 13}, { 19, 14}, { 18, 15}, { 19, 15},\r\n    { 12, 16}, { 13, 16}, { 12, 17}, { 12, 18}, { 13, 17}, { 14, 16}, { 15, 16}, { 14, 17},\r\n    { 13, 18}, { 12, 19}, { 13, 19}, { 14, 18}, { 15, 17}, { 15, 18}, { 14, 19}, { 15, 19},\r\n    {  8, 20}, {  9, 20}, {  8, 21}, {  8, 22}, {  9, 21}, { 10, 20}, { 11, 20}, { 10, 21},\r\n    {  9, 22}, {  8, 23}, {  9, 23}, { 10, 22}, { 11, 21}, { 11, 22}, { 10, 23}, { 11, 23},\r\n    {  4, 24}, {  5, 24}, {  4, 25}, {  4, 26}, {  5, 25}, {  6, 24}, {  7, 24}, {  6, 25},\r\n    {  5, 26}, {  4, 27}, {  5, 27}, {  6, 26}, {  7, 25}, {  7, 26}, {  6, 27}, {  7, 27},\r\n    {  0, 28}, {  1, 28}, {  0, 29}, {  0, 30}, {  1, 29}, {  2, 28}, {  3, 28}, {  2, 29},\r\n    {  1, 30}, {  0, 31}, {  1, 31}, {  2, 30}, {  3, 29}, {  3, 30}, {  2, 31}, {  3, 31},\r\n    {  4, 28}, {  5, 28}, {  4, 29}, {  4, 30}, {  5, 29}, {  6, 28}, {  7, 28}, {  6, 29},\r\n    {  5, 30}, {  4, 31}, {  5, 31}, {  6, 30}, {  7, 29}, {  7, 30}, {  6, 31}, {  7, 31},\r\n    {  8, 24}, {  9, 24}, {  8, 25}, {  8, 26}, {  9, 25}, { 10, 24}, { 11, 24}, { 10, 25},\r\n    {  9, 26}, {  8, 27}, {  9, 27}, { 10, 26}, { 11, 25}, { 11, 26}, { 10, 27}, { 11, 27},\r\n    { 12, 20}, { 13, 20}, { 12, 21}, { 12, 22}, { 13, 21}, { 14, 20}, { 15, 20}, { 14, 21},\r\n    { 13, 22}, { 12, 23}, { 13, 23}, { 14, 22}, { 15, 21}, { 15, 22}, { 14, 23}, { 15, 23},\r\n    { 16, 16}, { 17, 16}, { 16, 17}, { 16, 18}, { 17, 17}, { 18, 16}, { 19, 16}, { 18, 17},\r\n    { 17, 18}, { 16, 19}, { 17, 19}, { 18, 18}, { 19, 17}, { 19, 18}, { 18, 19}, { 19, 19},\r\n    { 20, 12}, { 21, 12}, { 20, 13}, { 20, 14}, { 21, 13}, { 22, 12}, { 23, 12}, { 22, 13},\r\n    { 21, 14}, { 20, 15}, { 21, 15}, { 22, 14}, { 23, 13}, { 23, 14}, { 22, 15}, { 23, 15},\r\n    { 24,  8}, { 25,  8}, { 24,  9}, { 24, 10}, { 25,  9}, { 26,  8}, { 27,  8}, { 26,  9},\r\n    { 25, 10}, { 24, 11}, { 25, 11}, { 26, 10}, { 27,  9}, { 27, 10}, { 26, 11}, { 27, 11},\r\n    { 28,  4}, { 29,  4}, { 28,  5}, { 28,  6}, { 29,  5}, { 30,  4}, { 31,  4}, { 30,  5},\r\n    { 29,  6}, { 28,  7}, { 29,  7}, { 30,  6}, { 31,  5}, { 31,  6}, { 30,  7}, { 31,  7},\r\n    { 28,  8}, { 29,  8}, { 28,  9}, { 28, 10}, { 29,  9}, { 30,  8}, { 31,  8}, { 30,  9},\r\n    { 29, 10}, { 28, 11}, { 29, 11}, { 30, 10}, { 31,  9}, { 31, 10}, { 30, 11}, { 31, 11},\r\n    { 24, 12}, { 25, 12}, { 24, 13}, { 24, 14}, { 25, 13}, { 26, 12}, { 27, 12}, { 26, 13},\r\n    { 25, 14}, { 24, 15}, { 25, 15}, { 26, 14}, { 27, 13}, { 27, 14}, { 26, 15}, { 27, 15},\r\n    { 20, 16}, { 21, 16}, { 20, 17}, { 20, 18}, { 21, 17}, { 22, 16}, { 23, 16}, { 22, 17},\r\n    { 21, 18}, { 20, 19}, { 21, 19}, { 22, 18}, { 23, 17}, { 23, 18}, { 22, 19}, { 23, 19},\r\n    { 16, 20}, { 17, 20}, { 16, 21}, { 16, 22}, { 17, 21}, { 18, 20}, { 19, 20}, { 18, 21},\r\n    { 17, 22}, { 16, 23}, { 17, 23}, { 18, 22}, { 19, 21}, { 19, 22}, { 18, 23}, { 19, 23},\r\n    { 12, 24}, { 13, 24}, { 12, 25}, { 12, 26}, { 13, 25}, { 14, 24}, { 15, 24}, { 14, 25},\r\n    { 13, 26}, { 12, 27}, { 13, 27}, { 14, 26}, { 15, 25}, { 15, 26}, { 14, 27}, { 15, 27},\r\n    {  8, 28}, {  9, 28}, {  8, 29}, {  8, 30}, {  9, 29}, { 10, 28}, { 11, 28}, { 10, 29},\r\n    {  9, 30}, {  8, 31}, {  9, 31}, { 10, 30}, { 11, 29}, { 11, 30}, { 10, 31}, { 11, 31},\r\n    { 12, 28}, { 13, 28}, { 12, 29}, { 12, 30}, { 13, 29}, { 14, 28}, { 15, 28}, { 14, 29},\r\n    { 13, 30}, { 12, 31}, { 13, 31}, { 14, 30}, { 15, 29}, { 15, 30}, { 14, 31}, { 15, 31},\r\n    { 16, 24}, { 17, 24}, { 16, 25}, { 16, 26}, { 17, 25}, { 18, 24}, { 19, 24}, { 18, 25},\r\n    { 17, 26}, { 16, 27}, { 17, 27}, { 18, 26}, { 19, 25}, { 19, 26}, { 18, 27}, { 19, 27},\r\n    { 20, 20}, { 21, 20}, { 20, 21}, { 20, 22}, { 21, 21}, { 22, 20}, { 23, 20}, { 22, 21},\r\n    { 21, 22}, { 20, 23}, { 21, 23}, { 22, 22}, { 23, 21}, { 23, 22}, { 22, 23}, { 23, 23},\r\n    { 24, 16}, { 25, 16}, { 24, 17}, { 24, 18}, { 25, 17}, { 26, 16}, { 27, 16}, { 26, 17},\r\n    { 25, 18}, { 24, 19}, { 25, 19}, { 26, 18}, { 27, 17}, { 27, 18}, { 26, 19}, { 27, 19},\r\n    { 28, 12}, { 29, 12}, { 28, 13}, { 28, 14}, { 29, 13}, { 30, 12}, { 31, 12}, { 30, 13},\r\n    { 29, 14}, { 28, 15}, { 29, 15}, { 30, 14}, { 31, 13}, { 31, 14}, { 30, 15}, { 31, 15},\r\n    { 28, 16}, { 29, 16}, { 28, 17}, { 28, 18}, { 29, 17}, { 30, 16}, { 31, 16}, { 30, 17},\r\n    { 29, 18}, { 28, 19}, { 29, 19}, { 30, 18}, { 31, 17}, { 31, 18}, { 30, 19}, { 31, 19},\r\n    { 24, 20}, { 25, 20}, { 24, 21}, { 24, 22}, { 25, 21}, { 26, 20}, { 27, 20}, { 26, 21},\r\n    { 25, 22}, { 24, 23}, { 25, 23}, { 26, 22}, { 27, 21}, { 27, 22}, { 26, 23}, { 27, 23},\r\n    { 20, 24}, { 21, 24}, { 20, 25}, { 20, 26}, { 21, 25}, { 22, 24}, { 23, 24}, { 22, 25},\r\n    { 21, 26}, { 20, 27}, { 21, 27}, { 22, 26}, { 23, 25}, { 23, 26}, { 22, 27}, { 23, 27},\r\n    { 16, 28}, { 17, 28}, { 16, 29}, { 16, 30}, { 17, 29}, { 18, 28}, { 19, 28}, { 18, 29},\r\n    { 17, 30}, { 16, 31}, { 17, 31}, { 18, 30}, { 19, 29}, { 19, 30}, { 18, 31}, { 19, 31},\r\n    { 20, 28}, { 21, 28}, { 20, 29}, { 20, 30}, { 21, 29}, { 22, 28}, { 23, 28}, { 22, 29},\r\n    { 21, 30}, { 20, 31}, { 21, 31}, { 22, 30}, { 23, 29}, { 23, 30}, { 22, 31}, { 23, 31},\r\n    { 24, 24}, { 25, 24}, { 24, 25}, { 24, 26}, { 25, 25}, { 26, 24}, { 27, 24}, { 26, 25},\r\n    { 25, 26}, { 24, 27}, { 25, 27}, { 26, 26}, { 27, 25}, { 27, 26}, { 26, 27}, { 27, 27},\r\n    { 28, 20}, { 29, 20}, { 28, 21}, { 28, 22}, { 29, 21}, { 30, 20}, { 31, 20}, { 30, 21},\r\n    { 29, 22}, { 28, 23}, { 29, 23}, { 30, 22}, { 31, 21}, { 31, 22}, { 30, 23}, { 31, 23},\r\n    { 28, 24}, { 29, 24}, { 28, 25}, { 28, 26}, { 29, 25}, { 30, 24}, { 31, 24}, { 30, 25},\r\n    { 29, 26}, { 28, 27}, { 29, 27}, { 30, 26}, { 31, 25}, { 31, 26}, { 30, 27}, { 31, 27},\r\n    { 24, 28}, { 25, 28}, { 24, 29}, { 24, 30}, { 25, 29}, { 26, 28}, { 27, 28}, { 26, 29},\r\n    { 25, 30}, { 24, 31}, { 25, 31}, { 26, 30}, { 27, 29}, { 27, 30}, { 26, 31}, { 27, 31},\r\n    { 28, 28}, { 29, 28}, { 28, 29}, { 28, 30}, { 29, 29}, { 30, 28}, { 31, 28}, { 30, 29},\r\n    { 29, 30}, { 28, 31}, { 29, 31}, { 30, 30}, { 31, 29}, { 31, 30}, { 30, 31}, { 31, 31}\r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n * ϵɨ\r\n */\r\nstatic const int16_t(*tab_scan_coeff[4][4])[2] = {\r\n    /* 4x4 */\r\n    {tab_scan_4x4, NULL, NULL, tab_scan_4x4},\r\n    /* 8x8,  16x4,  4x16 */\r\n    {tab_scan_cg_8x8, tab_scan_4x16, tab_scan_16x4, tab_scan_cg_8x8},\r\n    /* 16x16, 32x8, 8x32 */\r\n    {tab_scan_cg_16x16, tab_scan_8x32, tab_scan_32x8, tab_scan_cg_16x16},\r\n    /* 32x32, 64x16, 16x64 */\r\n    {tab_scan_cg_32x32, NULL, NULL, tab_scan_cg_32x32},\r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n * CGɨ˳\r\n */\r\nstatic const int16_t(*tab_scan_cg[4][4])[2] = {\r\n    /* 4x4 */\r\n    {tab_scan_2x2, NULL, NULL, tab_scan_2x2},\r\n    /* 8x8,  16x4,  4x16 */\r\n    {tab_scan_2x2, tab_scan_1x4, tab_scan_4x1, tab_scan_2x2},\r\n    /* 16x16, 32x8, 8x32 */\r\n    {tab_scan_4x4, tab_scan_2x8, tab_scan_8x2, tab_scan_4x4},\r\n    /* 32x32, 64x16, 16x64 */\r\n    {tab_scan_8x8, NULL, NULL, tab_scan_8x8},\r\n};\r\n\r\n#ifdef __cplusplus\r\n}\r\n#endif\r\n#endif  // DAVS2_SCAN_TAB_H\r\n"
  },
  {
    "path": "source/common/threadpool.cc",
    "content": "/*\r\n * threadpool.cc\r\n *\r\n * Description of this file:\r\n *    thread pooling functions definition of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video encoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#include \"common.h\"\r\n#include \"threadpool.h\"\r\n\r\n\r\n/**\r\n * ===========================================================================\r\n * type defines\r\n * ===========================================================================\r\n */\r\n\r\n/* ---------------------------------------------------------------------------\r\n * job\r\n */\r\ntypedef struct threadpool_job_t {\r\n    davs2_threadpool_func_t func;\r\n    void                   *arg1;\r\n    int                     arg2;\r\n    void                   *ret;\r\n    int                     wait;\r\n} threadpool_job_t;\r\n\r\n/* ---------------------------------------------------------------------------\r\n * synchronized job list\r\n */\r\ntypedef struct davs2_sync_job_list_t {\r\n    int                     i_max_size;\r\n    int                     i_size;\r\n    davs2_thread_mutex_t    mutex;\r\n    davs2_thread_cond_t     cv_fill;  /* event signaling that the list became fuller */\r\n    davs2_thread_cond_t     cv_empty; /* event signaling that the list became emptier */\r\n    threadpool_job_t       *list[DAVS2_WORK_MAX + 2];\r\n} davs2_sync_job_list_t;\r\n\r\n/* ---------------------------------------------------------------------------\r\n * thread pool\r\n */\r\nstruct davs2_threadpool_t {\r\n    int                 i_exit;               /* exit flag */\r\n    int                 num_total_threads;    /* thread number in pool */\r\n    int                 num_run_threads;      /* thread number running */\r\n    davs2_threadpool_func_t init_func;\r\n    void               *init_arg;\r\n    int                 init_arg2;\r\n\r\n    /* requires a synchronized list structure and associated methods,\r\n       so use what is already implemented for jobs */\r\n    davs2_sync_job_list_t uninit;   /* list of jobs that are awaiting use */\r\n    davs2_sync_job_list_t run;      /* list of jobs that are queued for processing by the pool */\r\n    davs2_sync_job_list_t done;     /* list of jobs that have finished processing */\r\n\r\n    /* handler of threads in the pool */\r\n    davs2_thread_t        thread_handle[AVS2_THREAD_MAX];\r\n};\r\n\r\n\r\n/**\r\n * ===========================================================================\r\n * list operators\r\n * ===========================================================================\r\n */\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic threadpool_job_t *davs2_job_shift(threadpool_job_t **list)\r\n{\r\n    threadpool_job_t *job = list[0];\r\n    int i;\r\n\r\n    for (i = 0; list[i]; i++) {\r\n        list[i] = list[i + 1];\r\n    }\r\n    assert(job);\r\n\r\n    return job;\r\n}\r\n\r\n/**\r\n * ===========================================================================\r\n * list operators\r\n * ===========================================================================\r\n */\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic int davs2_sync_job_list_init(davs2_sync_job_list_t *slist, int i_max_size)\r\n{\r\n    if (i_max_size < 0) {\r\n        return -1;\r\n    }\r\n\r\n    slist->i_max_size = i_max_size;\r\n    slist->i_size     = 0;\r\n    memset(slist->list, 0, sizeof(slist->list));\r\n\r\n    if (davs2_thread_mutex_init(&slist->mutex, NULL) ||\r\n        davs2_thread_cond_init(&slist->cv_fill, NULL) ||\r\n        davs2_thread_cond_init(&slist->cv_empty, NULL)) {\r\n        return -1;\r\n    }\r\n\r\n    return 0;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void davs2_threadpool_list_delete(davs2_sync_job_list_t *slist)\r\n{\r\n    davs2_thread_mutex_destroy(&slist->mutex);\r\n    davs2_thread_cond_destroy(&slist->cv_fill);\r\n    davs2_thread_cond_destroy(&slist->cv_empty);\r\n    slist->i_size = 0;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void davs2_sync_job_list_push(davs2_sync_job_list_t *slist, threadpool_job_t *job)\r\n{\r\n    davs2_thread_mutex_lock(&slist->mutex);      /* lock */\r\n    while (slist->i_size == slist->i_max_size) {\r\n        davs2_thread_cond_wait(&slist->cv_empty, &slist->mutex);\r\n    }\r\n    slist->list[slist->i_size++] = job;\r\n    davs2_thread_mutex_unlock(&slist->mutex);    /* unlock */\r\n\r\n    davs2_thread_cond_broadcast(&slist->cv_fill);\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic threadpool_job_t *davs2_sync_job_list_pop(davs2_sync_job_list_t *slist)\r\n{\r\n    threadpool_job_t *job;\r\n\r\n    davs2_thread_mutex_lock(&slist->mutex);      /* lock */\r\n    while (!slist->i_size) {\r\n        davs2_thread_cond_wait(&slist->cv_fill, &slist->mutex);\r\n    }\r\n    job = slist->list[--slist->i_size];\r\n    slist->list[slist->i_size] = NULL;\r\n    davs2_thread_cond_broadcast(&slist->cv_empty);\r\n    davs2_thread_mutex_unlock(&slist->mutex);    /* unlock */\r\n\r\n    return job;\r\n}\r\n\r\n\r\n/**\r\n * ===========================================================================\r\n * thread pool operators\r\n * ===========================================================================\r\n */\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic\r\nvoid *davs2_threadpool_thread(void *arg)\r\n{\r\n    davs2_threadpool_t *pool = (davs2_threadpool_t *)arg;\r\n\r\n    /* init */\r\n    if (pool->init_func) {\r\n        pool->init_func(pool->init_arg, pool->init_arg2);\r\n    }\r\n\r\n    /* loop until exit flag is set */\r\n    while (pool->i_exit != AVS2_EXIT_THREAD) {\r\n        threadpool_job_t *job = NULL;\r\n\r\n        /* fetch a job */\r\n        davs2_thread_mutex_lock(&pool->run.mutex);   /* lock */\r\n        while (pool->i_exit != AVS2_EXIT_THREAD && !pool->run.i_size) {\r\n            davs2_thread_cond_wait(&pool->run.cv_fill, &pool->run.mutex);\r\n        }\r\n        if (pool->run.i_size) {\r\n            job = davs2_job_shift(pool->run.list);\r\n            pool->run.i_size--;\r\n        }\r\n        davs2_thread_mutex_unlock(&pool->run.mutex); /* unlock */\r\n\r\n        /* do the job */\r\n        if (!job) {\r\n            continue;\r\n        }\r\n        job->ret = job->func(job->arg1, job->arg2); /* execute the function */\r\n\r\n        /* the job is done */\r\n        if (job->wait) {\r\n            davs2_sync_job_list_push(&pool->done, job);\r\n        } else {\r\n            davs2_sync_job_list_push(&pool->uninit, job);\r\n        }\r\n    }\r\n\r\n    return NULL;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nint davs2_threadpool_init(davs2_threadpool_t **p_pool, int threads, davs2_threadpool_func_t init_func, void *init_arg1, int init_arg2)\r\n{\r\n    davs2_threadpool_t *pool;\r\n    uint32_t mem_size;\r\n    uint8_t *mem_ptr;\r\n    int i;\r\n\r\n    if (threads <= 0) {\r\n        return -1;\r\n    }\r\n\r\n    mem_size = sizeof(davs2_threadpool_t)\r\n        + DAVS2_WORK_MAX * sizeof(threadpool_job_t)\r\n        + CACHE_LINE_SIZE * (DAVS2_WORK_MAX + 2);\r\n\r\n    CHECKED_MALLOCZERO(mem_ptr, uint8_t *, mem_size);\r\n    *p_pool = pool = (davs2_threadpool_t *)mem_ptr;\r\n    mem_ptr += sizeof(davs2_threadpool_t);\r\n    ALIGN_POINTER(mem_ptr);\r\n\r\n    pool->init_func = init_func;\r\n    pool->init_arg  = init_arg1;\r\n    pool->init_arg2 = init_arg2;\r\n    pool->num_total_threads = DAVS2_MIN(threads, AVS2_THREAD_MAX);\r\n    pool->num_run_threads   = 0;\r\n\r\n    if (davs2_sync_job_list_init(&pool->uninit, DAVS2_WORK_MAX) ||\r\n        davs2_sync_job_list_init(&pool->run,    DAVS2_WORK_MAX) ||\r\n        davs2_sync_job_list_init(&pool->done,   DAVS2_WORK_MAX)) {\r\n        goto fail;\r\n    }\r\n\r\n    for (i = 0; i < DAVS2_WORK_MAX; i++) {\r\n        threadpool_job_t *job = (threadpool_job_t *)mem_ptr;\r\n        mem_ptr += sizeof(threadpool_job_t);\r\n        ALIGN_POINTER(mem_ptr);\r\n        davs2_sync_job_list_push(&pool->uninit, job);\r\n    }\r\n\r\n    for (i = 0; i < pool->num_total_threads; i++) {\r\n        if (davs2_thread_create(pool->thread_handle + i, NULL, davs2_threadpool_thread, pool)) {\r\n            goto fail;\r\n        }\r\n    }\r\n\r\n    return 0;\r\n\r\nfail:\r\n    return -1;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid davs2_threadpool_run(davs2_threadpool_t *pool, davs2_threadpool_func_t func, void *arg1, int arg2, int wait_sign)\r\n{\r\n    threadpool_job_t *job = davs2_sync_job_list_pop(&pool->uninit);\r\n\r\n    job->func = func;\r\n    job->arg1 = arg1;\r\n    job->arg2 = arg2;\r\n    job->wait = wait_sign;\r\n    davs2_sync_job_list_push(&pool->run, job);\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * ѯ̳߳Ƿڿת\r\n */\r\nint davs2_threadpool_is_free(davs2_threadpool_t *pool)\r\n{\r\n    return pool->run.i_size <= 0;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid *davs2_threadpool_wait(davs2_threadpool_t *pool, void *arg1, int arg2)\r\n{\r\n    threadpool_job_t *job = NULL;\r\n    void *ret;\r\n    int i;\r\n\r\n    davs2_thread_mutex_lock(&pool->done.mutex);      /* lock */\r\n    while (!job) {\r\n        for (i = 0; i < pool->done.i_size; i++) {\r\n            threadpool_job_t *t = pool->done.list[i];\r\n            if (t->arg1 == arg1 && t->arg2 == arg2) {\r\n                job = davs2_job_shift(pool->done.list + i);\r\n                pool->done.i_size--;\r\n                break;          /* found the job according to arg */\r\n            }\r\n        }\r\n        if (!job) {\r\n            davs2_thread_cond_wait(&pool->done.cv_fill, &pool->done.mutex);\r\n        }\r\n    }\r\n    davs2_thread_mutex_unlock(&pool->done.mutex);    /* unlock */\r\n\r\n    ret = job->ret;\r\n    davs2_sync_job_list_push(&pool->uninit, job);\r\n\r\n    return ret;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid davs2_threadpool_delete(davs2_threadpool_t *pool)\r\n{\r\n    int i;\r\n\r\n    davs2_thread_mutex_lock(&pool->run.mutex);   /* lock */\r\n    pool->i_exit = AVS2_EXIT_THREAD;\r\n    davs2_thread_cond_broadcast(&pool->run.cv_fill);\r\n    davs2_thread_mutex_unlock(&pool->run.mutex); /* unlock */\r\n\r\n    for (i = 0; i < pool->num_total_threads; i++) {\r\n        davs2_thread_join(pool->thread_handle[i], NULL);\r\n    }\r\n\r\n    davs2_threadpool_list_delete(&pool->uninit);\r\n    davs2_threadpool_list_delete(&pool->run);\r\n    davs2_threadpool_list_delete(&pool->done);\r\n    davs2_free(pool);\r\n}\r\n"
  },
  {
    "path": "source/common/threadpool.h",
    "content": "/*\r\n * threadpool.h\r\n *\r\n * Description of this file:\r\n *    thread pooling functions definition of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video encoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#ifndef DAVS2_THREADPOOL_H\r\n#define DAVS2_THREADPOOL_H\r\n#ifdef __cplusplus\r\nextern \"C\" {\r\n#endif\r\n\r\ntypedef struct davs2_threadpool_t davs2_threadpool_t;\r\ntypedef void *(*davs2_threadpool_func_t)(void *arg1, int arg2);\r\n\r\n#define davs2_threadpool_init FPFX(threadpool_init)\r\nint   davs2_threadpool_init  (davs2_threadpool_t **p_pool, int threads,\r\n                              davs2_threadpool_func_t init_func, void *init_arg1, int init_arg2);\r\n#define davs2_threadpool_run FPFX(threadpool_run)\r\nvoid  davs2_threadpool_run   (davs2_threadpool_t *pool, davs2_threadpool_func_t func, void *arg1, int arg2, int wait_sign);\r\n#define davs2_threadpool_is_free FPFX(threadpool_is_free)\r\nint   davs2_threadpool_is_free(davs2_threadpool_t *pool);\r\n#define davs2_threadpool_wait FPFX(threadpool_wait)\r\nvoid *davs2_threadpool_wait  (davs2_threadpool_t *pool, void *arg1, int arg2);\r\n#define davs2_threadpool_delete FPFX(threadpool_delete)\r\nvoid  davs2_threadpool_delete(davs2_threadpool_t *pool);\r\n\r\n#ifdef __cplusplus\r\n}\r\n#endif\r\n#endif  // __STARAVS_THREADPOOL_H\r\n"
  },
  {
    "path": "source/common/transform.cc",
    "content": "/*\r\n *  transform.cc\r\n *\r\n * Description of this file:\r\n *    Transform functions definition of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video encoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#include \"common.h\"\r\n#include \"quant.h\"\r\n#include \"transform.h\"\r\n#include \"block_info.h\"\r\n\r\n#if HAVE_MMX\r\n#include \"vec/intrinsic.h\"\r\n#include \"x86/dct8.h\"\r\n#endif\r\n\r\n/**\r\n * ===========================================================================\r\n * global & local variables\r\n * ===========================================================================\r\n */\r\n\r\n/* ---------------------------------------------------------------------------\r\n * transform */\r\n#define LOT_MAX_WLT_TAP         2     // number of wavelet transform tap, (5-3)\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic const int16_t g_T4[4][4] = {\r\n    { 32,  32,  32,  32 },\r\n    { 42,  17, -17, -42 },\r\n    { 32, -32, -32,  32 },\r\n    { 17, -42,  42, -17 }\r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic const int16_t g_T8[8][8] = {\r\n    { 32,  32,  32,  32,  32,  32,  32,  32 },\r\n    { 44,  38,  25,   9,  -9, -25, -38, -44 },\r\n    { 42,  17, -17, -42, -42, -17,  17,  42 },\r\n    { 38,  -9, -44, -25,  25,  44,   9, -38 },\r\n    { 32, -32, -32,  32,  32, -32, -32,  32 },\r\n    { 25, -44,   9,  38, -38,  -9,  44, -25 },\r\n    { 17, -42,  42, -17, -17,  42, -42,  17 },\r\n    {  9, -25,  38, -44,  44, -38,  25,  -9 }\r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic const int16_t g_T16[16][16] = {\r\n    { 32,  32,  32,  32,  32,  32,  32,  32,  32,  32,  32,  32,  32,  32,  32,  32 },\r\n    { 45,  43,  40,  35,  29,  21,  13,   4,  -4, -13, -21, -29, -35, -40, -43, -45 },\r\n    { 44,  38,  25,   9,  -9, -25, -38, -44, -44, -38, -25,  -9,   9,  25,  38,  44 },\r\n    { 43,  29,   4, -21, -40, -45, -35, -13,  13,  35,  45,  40,  21,  -4, -29, -43 },\r\n    { 42,  17, -17, -42, -42, -17,  17,  42,  42,  17, -17, -42, -42, -17,  17,  42 },\r\n    { 40,   4, -35, -43, -13,  29,  45,  21, -21, -45, -29,  13,  43,  35,  -4, -40 },\r\n    { 38,  -9, -44, -25,  25,  44,   9, -38, -38,   9,  44,  25, -25, -44,  -9,  38 },\r\n    { 35, -21, -43,   4,  45,  13, -40, -29,  29,  40, -13, -45,  -4,  43,  21, -35 },\r\n    { 32, -32, -32,  32,  32, -32, -32,  32,  32, -32, -32,  32,  32, -32, -32,  32 },\r\n    { 29, -40, -13,  45,  -4, -43,  21,  35, -35, -21,  43,   4, -45,  13,  40, -29 },\r\n    { 25, -44,   9,  38, -38,  -9,  44, -25, -25,  44,  -9, -38,  38,   9, -44,  25 },\r\n    { 21, -45,  29,  13, -43,  35,   4, -40,  40,  -4, -35,  43, -13, -29,  45, -21 },\r\n    { 17, -42,  42, -17, -17,  42, -42,  17,  17, -42,  42, -17, -17,  42, -42,  17 },\r\n    { 13, -35,  45, -40,  21,   4, -29,  43, -43,  29,  -4, -21,  40, -45,  35, -13 },\r\n    {  9, -25,  38, -44,  44, -38,  25,  -9,  -9,  25, -38,  44, -44,  38, -25,   9 },\r\n    {  4, -13,  21, -29,  35, -40,  43, -45,  45, -43,  40, -35,  29, -21,  13,  -4 }\r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic const int16_t g_T32[32][32] = {\r\n    { 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32 },\r\n    { 45, 45, 44, 43, 41, 39, 36, 34, 30, 27, 23, 19, 15, 11,  7,  2, -2, -7,-11,-15,-19,-23,-27,-30,-34,-36,-39,-41,-43,-44,-45,-45 },\r\n    { 45, 43, 40, 35, 29, 21, 13,  4, -4,-13,-21,-29,-35,-40,-43,-45,-45,-43,-40,-35,-29,-21,-13, -4,  4, 13, 21, 29, 35, 40, 43, 45 },\r\n    { 45, 41, 34, 23, 11, -2,-15,-27,-36,-43,-45,-44,-39,-30,-19, -7,  7, 19, 30, 39, 44, 45, 43, 36, 27, 15,  2,-11,-23,-34,-41,-45 },\r\n    { 44, 38, 25,  9, -9,-25,-38,-44,-44,-38,-25, -9,  9, 25, 38, 44, 44, 38, 25,  9, -9,-25,-38,-44,-44,-38,-25, -9,  9, 25, 38, 44 },\r\n    { 44, 34, 15, -7,-27,-41,-45,-39,-23, -2, 19, 36, 45, 43, 30, 11,-11,-30,-43,-45,-36,-19,  2, 23, 39, 45, 41, 27,  7,-15,-34,-44 },\r\n    { 43, 29,  4,-21,-40,-45,-35,-13, 13, 35, 45, 40, 21, -4,-29,-43,-43,-29, -4, 21, 40, 45, 35, 13,-13,-35,-45,-40,-21,  4, 29, 43 },\r\n    { 43, 23, -7,-34,-45,-36,-11, 19, 41, 44, 27, -2,-30,-45,-39,-15, 15, 39, 45, 30,  2,-27,-44,-41,-19, 11, 36, 45, 34, 7, -23,-43 },\r\n    { 42, 17,-17,-42,-42,-17, 17, 42, 42, 17,-17,-42,-42,-17, 17, 42, 42, 17,-17,-42,-42,-17, 17, 42, 42, 17,-17,-42,-42,-17, 17, 42 },\r\n    { 41, 11,-27,-45,-30,  7, 39, 43, 15,-23,-45,-34,  2, 36, 44, 19,-19,-44,-36, -2, 34, 45, 23,-15,-43,-39, -7, 30, 45, 27,-11,-41 },\r\n    { 40,  4,-35,-43,-13, 29, 45, 21,-21,-45,-29, 13, 43, 35, -4,-40,-40, -4, 35, 43, 13,-29,-45,-21, 21, 45, 29,-13,-43,-35,  4, 40 },\r\n    { 39, -2,-41,-36,  7, 43, 34,-11,-44,-30, 15, 45, 27,-19,-45,-23, 23, 45, 19,-27,-45,-15, 30, 44, 11,-34,-43, -7, 36, 41,  2,-39 },\r\n    { 38, -9,-44,-25, 25, 44,  9,-38,-38,  9, 44, 25,-25,-44, -9, 38, 38, -9,-44,-25, 25, 44,  9,-38,-38,  9, 44, 25,-25,-44, -9, 38 },\r\n    { 36,-15,-45,-11, 39, 34,-19,-45, -7, 41, 30,-23,-44, -2, 43, 27,-27,-43,  2, 44, 23,-30,-41,  7, 45, 19,-34,-39, 11, 45, 15,-36 },\r\n    { 35,-21,-43,  4, 45, 13,-40,-29, 29, 40,-13,-45, -4, 43, 21,-35,-35, 21, 43, -4,-45,-13, 40, 29,-29,-40, 13, 45,  4,-43,-21, 35 },\r\n    { 34,-27,-39, 19, 43,-11,-45,  2, 45,  7,-44,-15, 41, 23,-36,-30, 30, 36,-23,-41, 15, 44, -7,-45, -2, 45, 11,-43,-19, 39, 27,-34 },\r\n    { 32,-32,-32, 32, 32,-32,-32, 32, 32,-32,-32, 32, 32,-32,-32, 32, 32,-32,-32, 32, 32,-32,-32, 32, 32,-32,-32, 32, 32,-32,-32, 32 },\r\n    { 30,-36,-23, 41, 15,-44, -7, 45, -2,-45, 11, 43,-19,-39, 27, 34,-34,-27, 39, 19,-43,-11, 45,  2,-45,  7, 44,-15,-41, 23, 36,-30 },\r\n    { 29,-40,-13, 45, -4,-43, 21, 35,-35,-21, 43,  4,-45, 13, 40,-29,-29, 40, 13,-45,  4, 43,-21,-35, 35, 21,-43, -4, 45,-13,-40, 29 },\r\n    { 27,-43, -2, 44,-23,-30, 41,  7,-45, 19, 34,-39,-11, 45,-15,-36, 36, 15,-45, 11, 39,-34,-19, 45, -7,-41, 30, 23,-44,  2, 43,-27 },\r\n    { 25,-44,  9, 38,-38, -9, 44,-25,-25, 44, -9,-38, 38,  9,-44, 25, 25,-44,  9, 38,-38, -9, 44,-25,-25, 44, -9,-38, 38,  9,-44, 25 },\r\n    { 23,-45, 19, 27,-45, 15, 30,-44, 11, 34,-43,  7, 36,-41,  2, 39,-39, -2, 41,-36, -7, 43,-34,-11, 44,-30,-15, 45,-27,-19, 45,-23 },\r\n    { 21,-45, 29, 13,-43, 35,  4,-40, 40, -4,-35, 43,-13,-29, 45,-21,-21, 45,-29,-13, 43,-35, -4, 40,-40,  4, 35,-43, 13, 29,-45, 21 },\r\n    { 19,-44, 36, -2,-34, 45,-23,-15, 43,-39,  7, 30,-45, 27, 11,-41, 41,-11,-27, 45,-30, -7, 39,-43, 15, 23,-45, 34,  2,-36, 44,-19 },\r\n    { 17,-42, 42,-17,-17, 42,-42, 17, 17,-42, 42,-17,-17, 42,-42, 17, 17,-42, 42,-17,-17, 42,-42, 17, 17,-42, 42,-17,-17, 42,-42, 17 },\r\n    { 15,-39, 45,-30,  2, 27,-44, 41,-19,-11, 36,-45, 34, -7,-23, 43,-43, 23,  7,-34, 45,-36, 11, 19,-41, 44,-27, -2, 30,-45, 39,-15 },\r\n    { 13,-35, 45,-40, 21,  4,-29, 43,-43, 29, -4,-21, 40,-45, 35,-13,-13, 35,-45, 40,-21, -4, 29,-43, 43,-29,  4, 21,-40, 45,-35, 13 },\r\n    { 11,-30, 43,-45, 36,-19, -2, 23,-39, 45,-41, 27, -7,-15, 34,-44, 44,-34, 15,  7,-27, 41,-45, 39,-23,  2, 19,-36, 45,-43, 30,-11 },\r\n    {  9,-25, 38,-44, 44,-38, 25, -9, -9, 25,-38, 44,-44, 38,-25,  9,  9,-25, 38,-44, 44,-38, 25, -9, -9, 25,-38, 44,-44, 38,-25,  9 },\r\n    {  7,-19, 30,-39, 44,-45, 43,-36, 27,-15,  2, 11,-23, 34,-41, 45,-45, 41,-34, 23,-11, -2, 15,-27, 36,-43, 45,-44, 39,-30, 19, -7 },\r\n    {  4,-13, 21,-29, 35,-40, 43,-45, 45,-43, 40,-35, 29,-21, 13, -4, -4, 13,-21, 29,-35, 40,-43, 45,-45, 43,-40, 35,-29, 21,-13,  4 },\r\n    {  2, -7, 11,-15, 19,-23, 27,-30, 34,-36, 39,-41, 43,-44, 45,-45, 45,-45, 44,-43, 41,-39, 36,-34, 30,-27, 23,-19, 15,-11,  7, -2 }\r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nALIGN16(static const int16_t g_2T[SEC_TR_SIZE * SEC_TR_SIZE]) = {\r\n    123,  -35,  -8,  -3,\r\n    -32, -120,  30,  10,\r\n     14,   25, 123, -22,\r\n      8,   13,  19, 126\r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nALIGN16(static const int16_t g_2T_C[SEC_TR_SIZE * SEC_TR_SIZE]) = {\r\n    34,  58,  72,  81,\r\n    77,  69,  -7, -75,\r\n    79, -33, -75,  58,\r\n    55, -84,  73, -28\r\n};\r\n\r\n\r\n/**\r\n * ===========================================================================\r\n * function defines\r\n * ===========================================================================\r\n */\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void partialButterflyInverse4_c(const coeff_t *src, coeff_t *dst, int shift, int line, int clip_depth)\r\n{\r\n    int E[2], O[2];\r\n    const int max_val = (1 << (clip_depth - 1)) - 1;\r\n    const int min_val = -max_val - 1;\r\n    const int add     = 1 << (shift - 1);\r\n    int j;\r\n\r\n    for (j = 0; j < line; j++) {\r\n        /* utilizing symmetry properties to the maximum to\r\n         * minimize the number of multiplications */\r\n        O[0] = g_T4[1][0] * src[line] + g_T4[3][0] * src[3 * line];\r\n        O[1] = g_T4[1][1] * src[line] + g_T4[3][1] * src[3 * line];\r\n        E[0] = g_T4[0][0] * src[0   ] + g_T4[2][0] * src[2 * line];\r\n        E[1] = g_T4[0][1] * src[0   ] + g_T4[2][1] * src[2 * line];\r\n\r\n        /* combining even and odd terms at each hierarchy levels to\r\n         * calculate the final spatial domain vector */\r\n        dst[0] = (coeff_t)DAVS2_CLIP3(min_val, max_val, ((E[0] + O[0] + add) >> shift));\r\n        dst[1] = (coeff_t)DAVS2_CLIP3(min_val, max_val, ((E[1] + O[1] + add) >> shift));\r\n        dst[2] = (coeff_t)DAVS2_CLIP3(min_val, max_val, ((E[1] - O[1] + add) >> shift));\r\n        dst[3] = (coeff_t)DAVS2_CLIP3(min_val, max_val, ((E[0] - O[0] + add) >> shift));\r\n\r\n        src++;\r\n        dst += 4;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void idct_4x4_c(const coeff_t *src, coeff_t *dst, int i_dst)\r\n{\r\n#define BSIZE   4\r\n    ALIGN32(coeff_t coeff[BSIZE * BSIZE]);\r\n    ALIGN32(coeff_t block[BSIZE * BSIZE]);\r\n    int shift1 = 5;\r\n    int shift2 = 20 - g_bit_depth;\r\n    int clip_depth1 = LIMIT_BIT;\r\n    int clip_depth2 = g_bit_depth + 1;\r\n    int i;\r\n\r\n    partialButterflyInverse4_c(  src, coeff, shift1, BSIZE, clip_depth1);\r\n    partialButterflyInverse4_c(coeff, block, shift2, BSIZE, clip_depth2);\r\n\r\n    for (i = 0; i < BSIZE; i++) {\r\n        memcpy(dst, &block[i * BSIZE], BSIZE * sizeof(coeff_t));\r\n        dst += i_dst;\r\n    }\r\n#undef BSIZE\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void partialButterflyInverse8_c(const coeff_t *src, coeff_t *dst, int shift, int line, int clip_depth)\r\n{\r\n    int E[4], O[4];\r\n    int EE[2], EO[2];\r\n    const int max_val = (1 << (clip_depth - 1)) - 1;\r\n    const int min_val = -max_val - 1;\r\n    const int add     = 1 << (shift - 1);\r\n    int j, k;\r\n\r\n    for (j = 0; j < line; j++) {\r\n        /* utilizing symmetry properties to the maximum to\r\n         * minimize the number of multiplications */\r\n        for (k = 0; k < 4; k++) {\r\n            O[k] = g_T8[1][k] * src[    line] +\r\n                   g_T8[3][k] * src[3 * line] +\r\n                   g_T8[5][k] * src[5 * line] +\r\n                   g_T8[7][k] * src[7 * line];\r\n        }\r\n\r\n        EO[0] = g_T8[2][0] * src[2 * line] + g_T8[6][0] * src[6 * line];\r\n        EO[1] = g_T8[2][1] * src[2 * line] + g_T8[6][1] * src[6 * line];\r\n        EE[0] = g_T8[0][0] * src[0       ] + g_T8[4][0] * src[4 * line];\r\n        EE[1] = g_T8[0][1] * src[0       ] + g_T8[4][1] * src[4 * line];\r\n\r\n        /* combining even and odd terms at each hierarchy levels to\r\n         * calculate the final spatial domain vector */\r\n        E[0] = EE[0] + EO[0];\r\n        E[3] = EE[0] - EO[0];\r\n        E[1] = EE[1] + EO[1];\r\n        E[2] = EE[1] - EO[1];\r\n\r\n        for (k = 0; k < 4; k++) {\r\n            dst[k]     = (coeff_t)DAVS2_CLIP3(min_val, max_val, ((E[k] + O[k] + add) >> shift));\r\n            dst[k + 4] = (coeff_t)DAVS2_CLIP3(min_val, max_val, ((E[3 - k] - O[3 - k] + add) >> shift));\r\n        }\r\n\r\n        src++;\r\n        dst += 8;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void idct_8x8_c(const coeff_t *src, coeff_t *dst, int i_dst)\r\n{\r\n#define BSIZE   8\r\n    ALIGN32(coeff_t coeff[BSIZE * BSIZE]);\r\n    ALIGN32(coeff_t block[BSIZE * BSIZE]);\r\n    int shift1 = 5;\r\n    int shift2 = 20 - g_bit_depth;\r\n    int clip_depth1 = LIMIT_BIT;\r\n    int clip_depth2 = g_bit_depth + 1;\r\n    int i;\r\n\r\n    partialButterflyInverse8_c(  src, coeff, shift1, BSIZE, clip_depth1);\r\n    partialButterflyInverse8_c(coeff, block, shift2, BSIZE, clip_depth2);\r\n\r\n    for (i = 0; i < BSIZE; i++) {\r\n        memcpy(&dst[0], &block[i * BSIZE], BSIZE * sizeof(coeff_t));\r\n        dst += i_dst;\r\n    }\r\n#undef BSIZE\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void partialButterflyInverse16_c(const coeff_t *src, coeff_t *dst, int shift, int line, int clip_depth)\r\n{\r\n    int E[8], O[8];\r\n    int EE[4], EO[4];\r\n    int EEE[2], EEO[2];\r\n    const int max_val = (1 << (clip_depth - 1)) - 1;\r\n    const int min_val = -max_val - 1;\r\n    const int add     = 1 << (shift - 1);\r\n    int j, k;\r\n\r\n    for (j = 0; j < line; j++) {\r\n        /* utilizing symmetry properties to the maximum to\r\n         * minimize the number of multiplications */\r\n        for (k = 0; k < 8; k++) {\r\n            O[k] = g_T16[ 1][k] * src[     line] +\r\n                   g_T16[ 3][k] * src[ 3 * line] +\r\n                   g_T16[ 5][k] * src[ 5 * line] +\r\n                   g_T16[ 7][k] * src[ 7 * line] +\r\n                   g_T16[ 9][k] * src[ 9 * line] +\r\n                   g_T16[11][k] * src[11 * line] +\r\n                   g_T16[13][k] * src[13 * line] +\r\n                   g_T16[15][k] * src[15 * line];\r\n        }\r\n\r\n        for (k = 0; k < 4; k++) {\r\n            EO[k] = g_T16[ 2][k] * src[ 2 * line] +\r\n                    g_T16[ 6][k] * src[ 6 * line] +\r\n                    g_T16[10][k] * src[10 * line] +\r\n                    g_T16[14][k] * src[14 * line];\r\n        }\r\n\r\n        EEO[0] = g_T16[4][0] * src[4 * line] + g_T16[12][0] * src[12 * line];\r\n        EEE[0] = g_T16[0][0] * src[0       ] + g_T16[ 8][0] * src[ 8 * line];\r\n        EEO[1] = g_T16[4][1] * src[4 * line] + g_T16[12][1] * src[12 * line];\r\n        EEE[1] = g_T16[0][1] * src[0       ] + g_T16[ 8][1] * src[ 8 * line];\r\n\r\n        /* combining even and odd terms at each hierarchy levels to\r\n         * calculate the final spatial domain vector */\r\n        for (k = 0; k < 2; k++) {\r\n            EE[k    ] = EEE[k    ] + EEO[k    ];\r\n            EE[k + 2] = EEE[1 - k] - EEO[1 - k];\r\n        }\r\n\r\n        for (k = 0; k < 4; k++) {\r\n            E[k    ] = EE[k    ] + EO[k    ];\r\n            E[k + 4] = EE[3 - k] - EO[3 - k];\r\n        }\r\n\r\n        for (k = 0; k < 8; k++) {\r\n            dst[k]     = (coeff_t)DAVS2_CLIP3(min_val, max_val, ((E[k] + O[k] + add) >> shift));\r\n            dst[k + 8] = (coeff_t)DAVS2_CLIP3(min_val, max_val, ((E[7 - k] - O[7 - k] + add) >> shift));\r\n        }\r\n\r\n        src++;\r\n        dst += 16;\r\n    }\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void idct_16x16_c(const coeff_t *src, coeff_t *dst, int i_dst)\r\n{\r\n#define BSIZE   16\r\n    ALIGN32(coeff_t coeff[BSIZE * BSIZE]);\r\n    ALIGN32(coeff_t block[BSIZE * BSIZE]);\r\n    int shift1 = 5;\r\n    int shift2 = 20 - g_bit_depth;\r\n    int clip_depth1 = LIMIT_BIT;\r\n    int clip_depth2 = g_bit_depth + 1;\r\n    int i;\r\n\r\n    partialButterflyInverse16_c(  src, coeff, shift1, BSIZE, clip_depth1);\r\n    partialButterflyInverse16_c(coeff, block, shift2, BSIZE, clip_depth2);\r\n\r\n    for (i = 0; i < BSIZE; i++) {\r\n        memcpy(&dst[0], &block[i * BSIZE], BSIZE * sizeof(coeff_t));\r\n        dst += i_dst;\r\n    }\r\n#undef BSIZE\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void partialButterflyInverse32_c(const coeff_t *src, coeff_t *dst, int shift, int line, int clip_depth)\r\n{\r\n    int E[16], O[16];\r\n    int EE[8], EO[8];\r\n    int EEE[4], EEO[4];\r\n    int EEEE[2], EEEO[2];\r\n    const int max_val = (1 << (clip_depth - 1)) - 1;\r\n    const int min_val = -max_val - 1;\r\n    const int add     = 1 << (shift - 1);\r\n    int j, k;\r\n\r\n    for (j = 0; j < line; j++) {\r\n        /* utilizing symmetry properties to the maximum to\r\n         * minimize the number of multiplications */\r\n        for (k = 0; k < 16; k++) {\r\n            O[k] = g_T32[ 1][k] * src[     line] +\r\n                   g_T32[ 3][k] * src[ 3 * line] +\r\n                   g_T32[ 5][k] * src[ 5 * line] +\r\n                   g_T32[ 7][k] * src[ 7 * line] +\r\n                   g_T32[ 9][k] * src[ 9 * line] +\r\n                   g_T32[11][k] * src[11 * line] +\r\n                   g_T32[13][k] * src[13 * line] +\r\n                   g_T32[15][k] * src[15 * line] +\r\n                   g_T32[17][k] * src[17 * line] +\r\n                   g_T32[19][k] * src[19 * line] +\r\n                   g_T32[21][k] * src[21 * line] +\r\n                   g_T32[23][k] * src[23 * line] +\r\n                   g_T32[25][k] * src[25 * line] +\r\n                   g_T32[27][k] * src[27 * line] +\r\n                   g_T32[29][k] * src[29 * line] +\r\n                   g_T32[31][k] * src[31 * line];\r\n        }\r\n\r\n        for (k = 0; k < 8; k++) {\r\n            EO[k] = g_T32[ 2][k] * src[ 2 * line] +\r\n                    g_T32[ 6][k] * src[ 6 * line] +\r\n                    g_T32[10][k] * src[10 * line] +\r\n                    g_T32[14][k] * src[14 * line] +\r\n                    g_T32[18][k] * src[18 * line] +\r\n                    g_T32[22][k] * src[22 * line] +\r\n                    g_T32[26][k] * src[26 * line] +\r\n                    g_T32[30][k] * src[30 * line];\r\n        }\r\n\r\n        for (k = 0; k < 4; k++) {\r\n            EEO[k] = g_T32[ 4][k] * src[ 4 * line] +\r\n                     g_T32[12][k] * src[12 * line] +\r\n                     g_T32[20][k] * src[20 * line] +\r\n                     g_T32[28][k] * src[28 * line];\r\n        }\r\n\r\n        EEEO[0] = g_T32[8][0] * src[8 * line] + g_T32[24][0] * src[24 * line];\r\n        EEEO[1] = g_T32[8][1] * src[8 * line] + g_T32[24][1] * src[24 * line];\r\n        EEEE[0] = g_T32[0][0] * src[0       ] + g_T32[16][0] * src[16 * line];\r\n        EEEE[1] = g_T32[0][1] * src[0       ] + g_T32[16][1] * src[16 * line];\r\n\r\n        /* combining even and odd terms at each hierarchy levels to\r\n         * calculate the final spatial domain vector */\r\n        EEE[0] = EEEE[0] + EEEO[0];\r\n        EEE[3] = EEEE[0] - EEEO[0];\r\n        EEE[1] = EEEE[1] + EEEO[1];\r\n        EEE[2] = EEEE[1] - EEEO[1];\r\n        for (k = 0; k < 4; k++) {\r\n            EE[k    ] = EEE[k    ] + EEO[k    ];\r\n            EE[k + 4] = EEE[3 - k] - EEO[3 - k];\r\n        }\r\n\r\n        for (k = 0; k < 8; k++) {\r\n            E[k    ] = EE[k    ] + EO[k    ];\r\n            E[k + 8] = EE[7 - k] - EO[7 - k];\r\n        }\r\n\r\n        for (k = 0; k < 16; k++) {\r\n            dst[k]      = (coeff_t)DAVS2_CLIP3(min_val, max_val, ((E[k] + O[k] + add) >> shift));\r\n            dst[k + 16] = (coeff_t)DAVS2_CLIP3(min_val, max_val, ((E[15 - k] - O[15 - k] + add) >> shift));\r\n        }\r\n\r\n        src++;\r\n        dst += 32;\r\n    }\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * NOTE:\r\n * i_dst - the stride of dst (the lowest bit is additional wavelet flag)\r\n */\r\nstatic void idct_32x32_c(const coeff_t *src, coeff_t *dst, int i_dst)\r\n{\r\n#define BSIZE   32\r\n    ALIGN32(coeff_t coeff[BSIZE * BSIZE]);\r\n    ALIGN32(coeff_t block[BSIZE * BSIZE]);\r\n    int a_flag = i_dst & 0x01;\r\n    int shift1 = 5;\r\n    int shift2 = 20 - g_bit_depth - a_flag;\r\n    int clip_depth1 = LIMIT_BIT;\r\n    int clip_depth2 = g_bit_depth + 1 + a_flag;\r\n    int i;\r\n\r\n    i_dst &= 0xFE;    /* remember to remove the flag bit */\r\n\r\n    partialButterflyInverse32_c(  src, coeff, shift1, BSIZE, clip_depth1);\r\n    partialButterflyInverse32_c(coeff, block, shift2, BSIZE, clip_depth2);\r\n\r\n    for (i = 0; i < BSIZE; i++) {\r\n        memcpy(&dst[0], &block[i * BSIZE], BSIZE * sizeof(coeff_t));\r\n        dst += i_dst;\r\n    }\r\n#undef BSIZE\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void idct_64x64_c(const coeff_t *src, coeff_t *dst, int i_dst)\r\n{\r\n    ALIGN32(coeff_t row_buf[64 + LOT_MAX_WLT_TAP * 2]);\r\n    coeff_t *pExt = row_buf + LOT_MAX_WLT_TAP;\r\n    const int N0  = 64;\r\n    const int N1  = 64 >> 1;\r\n    int x, y, offset;\r\n\r\n    /* step 0: idct 32x32 transform */\r\n    idct_32x32_c(src, dst, i_dst | 1);\r\n\r\n    /* step 1: vertical transform */\r\n    for (x = 0; x < N0; x++) {\r\n        /* copy */\r\n        for (y = 0, offset = 0; y < N1; y++, offset += 32) {\r\n            pExt[y << 1] = dst[x + offset];\r\n        }\r\n\r\n        /* reflection */\r\n        pExt[N0] = pExt[N0 - 2];\r\n\r\n        /* filtering (even pixel) */\r\n        for (y = 0; y <= N0; y += 2) {\r\n            pExt[y] >>= 1;\r\n        }\r\n\r\n        /* filtering (odd pixel) */\r\n        for (y = 1; y < N0; y += 2) {\r\n            pExt[y] = (pExt[y - 1] + pExt[y + 1]) >> 1;\r\n        }\r\n\r\n        /* copy */\r\n        for (y = 0, offset = 0; y < N0; y++, offset += N0) {\r\n            dst[x + offset] = pExt[y];\r\n        }\r\n    }\r\n\r\n    /* step 2: horizontal transform */\r\n    for (y = 0, offset = 0; y < N0; y++, offset += N0) {\r\n        /* copy */\r\n        for (x = 0; x < N1; x++) {\r\n            pExt[x << 1] = dst[offset + x];\r\n        }\r\n\r\n        /* reflection */\r\n        pExt[N0] = pExt[N0 - 2];\r\n\r\n        /* filtering (odd pixel) */\r\n        for (x = 1; x < N0; x += 2) {\r\n            pExt[x] = (pExt[x - 1] + pExt[x + 1]) >> 1;\r\n        }\r\n\r\n        /* copy */\r\n        memcpy(dst + offset, pExt, N0 * sizeof(coeff_t));\r\n    }\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void idct_16x4_c(const coeff_t *src, coeff_t *dst, int i_dst)\r\n{\r\n#define BSIZE_H   16\r\n#define BSIZE_V   4\r\n    ALIGN32(coeff_t coeff[BSIZE_H * BSIZE_V]);\r\n    ALIGN32(coeff_t block[BSIZE_H * BSIZE_V]);\r\n    int shift1 = 5;\r\n    int shift2 = 20 - g_bit_depth;\r\n    int clip_depth1 = LIMIT_BIT;\r\n    int clip_depth2 = g_bit_depth + 1;\r\n    int i;\r\n\r\n    partialButterflyInverse4_c (src,   coeff, shift1, BSIZE_H, clip_depth1);\r\n    partialButterflyInverse16_c(coeff, block, shift2, BSIZE_V, clip_depth2);\r\n\r\n    for (i = 0; i < BSIZE_V; i++) {\r\n        memcpy(&dst[i * i_dst], &block[i * BSIZE_H], BSIZE_H * sizeof(coeff_t));\r\n    }\r\n#undef BSIZE_H\r\n#undef BSIZE_V\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void idct_4x16_c(const coeff_t *src, coeff_t *dst, int i_dst)\r\n{\r\n#define BSIZE_H   4\r\n#define BSIZE_V   16\r\n    ALIGN32(coeff_t coeff[BSIZE_H * BSIZE_V]);\r\n    ALIGN32(coeff_t block[BSIZE_H * BSIZE_V]);\r\n    int shift1 = 5;\r\n    int shift2 = 20 - g_bit_depth;\r\n    int clip_depth1 = LIMIT_BIT;\r\n    int clip_depth2 = g_bit_depth + 1;\r\n    int i;\r\n\r\n    partialButterflyInverse16_c(src,   coeff, shift1, BSIZE_H, clip_depth1);\r\n    partialButterflyInverse4_c (coeff, block, shift2, BSIZE_V, clip_depth2);\r\n\r\n    for (i = 0; i < BSIZE_V; i++) {\r\n        memcpy(&dst[i * i_dst], &block[i * BSIZE_H], BSIZE_H * sizeof(coeff_t));\r\n    }\r\n#undef BSIZE_H\r\n#undef BSIZE_V\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * NOTE:\r\n * i_dst - the stride of dst (the lowest bit is additional wavelet flag)\r\n */\r\nstatic void idct_32x8_c(const coeff_t *src, coeff_t *dst, int i_dst)\r\n{\r\n#define BSIZE_H   32\r\n#define BSIZE_V   8\r\n    ALIGN32(coeff_t coeff[BSIZE_H * BSIZE_V]);\r\n    ALIGN32(coeff_t block[BSIZE_H * BSIZE_V]);\r\n    int shift1 = 5;\r\n    int shift2 = 20 - g_bit_depth - (i_dst & 0x01);\r\n    int clip_depth1 = LIMIT_BIT;\r\n    int clip_depth2 = g_bit_depth + 1 + (i_dst & 0x01);\r\n    int i;\r\n\r\n    partialButterflyInverse8_c (src,   coeff, shift1, BSIZE_H, clip_depth1);\r\n    partialButterflyInverse32_c(coeff, block, shift2, BSIZE_V, clip_depth2);\r\n\r\n    i_dst &= 0xFE;\r\n    for (i = 0; i < BSIZE_V; i++) {\r\n        memcpy(&dst[i * i_dst], &block[i * BSIZE_H], BSIZE_H * sizeof(coeff_t));\r\n    }\r\n#undef BSIZE_H\r\n#undef BSIZE_V\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * NOTE:\r\n * i_dst - the stride of dst (the lowest bit is additional wavelet flag)\r\n */\r\nstatic void idct_8x32_c(const coeff_t *src, coeff_t *dst, int i_dst)\r\n{\r\n#define BSIZE_H   8\r\n#define BSIZE_V   32\r\n    ALIGN32(coeff_t coeff[BSIZE_H * BSIZE_V]);\r\n    ALIGN32(coeff_t block[BSIZE_H * BSIZE_V]);\r\n    int shift1 = 5;\r\n    int shift2 = 20 - g_bit_depth - (i_dst & 0x01);\r\n    int clip_depth1 = LIMIT_BIT;\r\n    int clip_depth2 = g_bit_depth + 1 + (i_dst & 0x01);\r\n    int i;\r\n\r\n    partialButterflyInverse32_c(src,   coeff, shift1, BSIZE_H, clip_depth1);\r\n    partialButterflyInverse8_c (coeff, block, shift2, BSIZE_V, clip_depth2);\r\n\r\n    i_dst &= 0xFE;\r\n    for (i = 0; i < BSIZE_V; i++) {\r\n        memcpy(&dst[i * i_dst], &block[i * BSIZE_H], BSIZE_H * sizeof(coeff_t));\r\n    }\r\n#undef BSIZE_H\r\n#undef BSIZE_V\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void idct_64x16_c(const coeff_t *src, coeff_t *dst, int i_dst)\r\n{\r\n    ALIGN32(coeff_t row_buf[64 + LOT_MAX_WLT_TAP * 2]);\r\n    coeff_t *pExt = row_buf + LOT_MAX_WLT_TAP;\r\n    const int N0  = 64;\r\n    const int N1  = 16;\r\n    int x, y, offset;\r\n\r\n    /* step 0: idct 32x32 transform */\r\n    idct_32x8_c(src, dst, i_dst | 1);\r\n\r\n    /* step 1: vertical transform */\r\n    for (x = 0; x < (N0 >> 1); x++) {\r\n        /* copy */\r\n        for (y = 0, offset = 0; y < N1 >> 1; y++, offset += (N0 >> 1)) {\r\n            pExt[y << 1] = dst[x + offset];\r\n        }\r\n\r\n        /* reflection */\r\n        pExt[N1] = pExt[N1 - 2];\r\n\r\n        /* filtering (even pixel) */\r\n        for (y = 0; y <= N1; y += 2) {\r\n            pExt[y] >>= 1;\r\n        }\r\n\r\n        /* filtering (odd pixel) */\r\n        for (y = 1; y < N1; y += 2) {\r\n            pExt[y] = (pExt[y - 1] + pExt[y + 1]) >> 1;\r\n        }\r\n\r\n        /* copy */\r\n        for (y = 0, offset = 0; y < N1; y++, offset += N0) {\r\n            dst[x + offset] = pExt[y];\r\n        }\r\n    }\r\n\r\n    /* step 2: horizontal transform */\r\n    for (y = 0, offset = 0; y < N1; y++, offset += N0) {\r\n        /* copy */\r\n        for (x = 0; x < N0 >> 1; x++) {\r\n            pExt[x << 1] = dst[offset + x];\r\n        }\r\n\r\n        /* reflection */\r\n        pExt[N0] = pExt[N0 - 2];\r\n\r\n        /* filtering (odd pixel) */\r\n        for (x = 1; x < N0; x += 2) {\r\n            pExt[x] = (pExt[x - 1] + pExt[x + 1]) >> 1;\r\n        }\r\n\r\n        /* copy */\r\n        memcpy(dst + offset, pExt, N0 * sizeof(coeff_t));\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void idct_16x64_c(const coeff_t *src, coeff_t *dst, int i_dst)\r\n{\r\n    ALIGN32(coeff_t row_buf[64 + LOT_MAX_WLT_TAP * 2]);\r\n    coeff_t *pExt = row_buf + LOT_MAX_WLT_TAP;\r\n    const int N0 = 16;\r\n    const int N1 = 64;\r\n    int x, y, offset;\r\n\r\n    /* step 0: idct 8x32 transform */\r\n    idct_8x32_c(src, dst, i_dst | 1);\r\n\r\n    /* step 1: vertical transform */\r\n    for (x = 0; x < (N0 >> 1); x++) {\r\n        /* copy */\r\n        for (y = 0, offset = 0; y < N1 >> 1; y++, offset += (N0 >> 1)) {\r\n            pExt[y << 1] = dst[x + offset];\r\n        }\r\n\r\n        /* reflection */\r\n        pExt[N1] = pExt[N1 - 2];\r\n\r\n        /* filtering (even pixel) */\r\n        for (y = 0; y <= N1; y += 2) {\r\n            pExt[y] >>= 1;\r\n        }\r\n\r\n        /* filtering (odd pixel) */\r\n        for (y = 1; y < N1; y += 2) {\r\n            pExt[y] = (pExt[y - 1] + pExt[y + 1]) >> 1;\r\n        }\r\n\r\n        /* copy */\r\n        for (y = 0, offset = 0; y < N1; y++, offset += N0) {\r\n            dst[x + offset] = pExt[y];\r\n        }\r\n    }\r\n\r\n    /* step 2: horizontal transform */\r\n    for (y = 0, offset = 0; y < N1; y++, offset += N0) {\r\n        /* copy */\r\n        for (x = 0; x < N0 >> 1; x++) {\r\n            pExt[x << 1] = dst[offset + x];\r\n        }\r\n\r\n        /* reflection */\r\n        pExt[N0] = pExt[N0 - 2];\r\n\r\n        /* filtering (odd pixel) */\r\n        for (x = 1; x < N0; x += 2) {\r\n            pExt[x] = (pExt[x - 1] + pExt[x + 1]) >> 1;\r\n        }\r\n\r\n        /* copy */\r\n        memcpy(dst + offset, pExt, N0 * sizeof(coeff_t));\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void xTr2nd_4_1d_Inv_Ver(coeff_t *coeff, int i_coeff, int i_shift, const int16_t *tc)\r\n{\r\n    int tmp_dct[SEC_TR_SIZE * SEC_TR_SIZE];\r\n    const int add = 1 << (i_shift - 1);\r\n    int i, j, k, sum;\r\n\r\n    for (i = 0; i < SEC_TR_SIZE; i++) {\r\n        for (j = 0; j < SEC_TR_SIZE; j++) {\r\n            tmp_dct[i * SEC_TR_SIZE + j] = coeff[i * i_coeff + j];\r\n        }\r\n    }\r\n\r\n    for (i = 0; i < SEC_TR_SIZE; i++) {\r\n        for (j = 0; j < SEC_TR_SIZE; j++) {\r\n            sum = add;\r\n            for (k = 0; k < SEC_TR_SIZE; k++) {\r\n                sum += tc[k * SEC_TR_SIZE + i] * tmp_dct[k * SEC_TR_SIZE + j];\r\n            }\r\n            coeff[i * i_coeff + j] = (coeff_t)DAVS2_CLIP3(-32768, 32767, sum >> i_shift);\r\n        }\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void xTr2nd_4_1d_Inv_Hor(coeff_t *coeff, int i_coeff, int i_shift, int clip_depth, const int16_t *tc)\r\n{\r\n    int tmp_dct[SEC_TR_SIZE * SEC_TR_SIZE];\r\n    const int max_val = (1 << (clip_depth - 1)) - 1;\r\n    const int min_val = -max_val - 1;\r\n    const int add = 1 << (i_shift - 1);\r\n    int i, j, k, sum;\r\n\r\n    for (i = 0; i < SEC_TR_SIZE; i++) {\r\n        for (j = 0; j < SEC_TR_SIZE; j++) {\r\n            tmp_dct[i * SEC_TR_SIZE + j] = coeff[i * i_coeff + j];\r\n        }\r\n    }\r\n\r\n    for (i = 0; i < SEC_TR_SIZE; i++) {\r\n        for (j = 0; j < SEC_TR_SIZE; j++) {\r\n            sum = add;\r\n            for (k = 0; k < SEC_TR_SIZE; k++) {\r\n                sum += tc[k * SEC_TR_SIZE + i] * tmp_dct[j * SEC_TR_SIZE + k];\r\n            }\r\n            coeff[j * i_coeff + i] = (coeff_t)DAVS2_CLIP3(min_val, max_val, sum >> i_shift);\r\n        }\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n*/\r\nstatic void inv_transform_4x4_2nd_c(coeff_t *coeff, int i_coeff)\r\n{\r\n    const int shift1 = 5;\r\n    const int shift2 = 20 - g_bit_depth + 2;\r\n    const int clip_depth2 = g_bit_depth + 1;\r\n\r\n    xTr2nd_4_1d_Inv_Ver(coeff, i_coeff, shift1, g_2T_C);\r\n    xTr2nd_4_1d_Inv_Hor(coeff, i_coeff, shift2, clip_depth2, g_2T_C);\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * i_mode - real intra mode (luma)\r\n * b_top  - block top available?\r\n * b_left - block left available?\r\n */\r\nstatic void inv_transform_2nd_c(coeff_t *coeff, int i_coeff, int i_mode, int b_top, int b_left)\r\n{\r\n    int vt = (i_mode >=  0 && i_mode <= 23);\r\n    int ht = (i_mode >= 13 && i_mode <= 32) || (i_mode >= 0 && i_mode <= 2);\r\n\r\n    if (ht && b_left) {\r\n        xTr2nd_4_1d_Inv_Hor(coeff, i_coeff, 7, 16, g_2T);\r\n    }\r\n    if (vt && b_top) {\r\n        xTr2nd_4_1d_Inv_Ver(coeff, i_coeff, 7, g_2T);\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic INLINE\r\nvoid inv_transform(davs2_row_rec_t *row_rec, coeff_t *p_coeff, cu_t *p_cu, int i_coeff, int bsx, int bsy,\r\n                   int b_secT, int blockidx, int i_luma_intra_mode)\r\n{\r\n    int part_idx = PART_INDEX(bsx, bsy);\r\n    dct_t idct = gf_davs2.idct[part_idx][p_cu->dct_pattern[blockidx]];\r\n\r\n    b_secT = b_secT && IS_INTRA(p_cu) && blockidx < 4;\r\n\r\n    if (part_idx == PART_4x4) {\r\n        if (b_secT) {\r\n            gf_davs2.inv_transform_4x4_2nd(p_coeff, i_coeff);\r\n        } else {\r\n            idct(p_coeff, p_coeff, i_coeff);\r\n        }\r\n    } else {\r\n        if (b_secT) {\r\n            gf_davs2.inv_transform_2nd(p_coeff, i_coeff, i_luma_intra_mode, row_rec->b_block_avail_top, row_rec->b_block_avail_left);\r\n        }\r\n        idct(p_coeff, p_coeff, i_coeff);\r\n    }\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * copy region of h->lcu.residual[] corresponding to blockidx to p_dst\r\n */\r\nstatic ALWAYS_INLINE\r\ncoeff_t *get_quanted_coeffs(davs2_row_rec_t *row_rec, cu_t *p_cu, int blockidx)\r\n{\r\n    int idx_cu_zscan = row_rec->idx_cu_zscan;\r\n    coeff_t *p_res;\r\n\r\n    if (blockidx < 4) {\r\n        int block_offset = blockidx << ((p_cu->i_cu_level - 1) << 1);\r\n        p_res = &row_rec->p_rec_info->coeff_buf_y[idx_cu_zscan << 6];\r\n        p_res += block_offset;\r\n    } else {\r\n        p_res = &row_rec->p_rec_info->coeff_buf_uv[blockidx - 4][idx_cu_zscan << 4];\r\n    }\r\n\r\n    return p_res;\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * get reconstruction pixels for blocks (include luma and chroma component)\r\n */\r\nvoid davs2_get_recons(davs2_row_rec_t *row_rec, cu_t *p_cu, int blockidx, cb_t *p_tu, int ctu_x, int ctu_y)\r\n{\r\n    int bsx     = p_tu->w;\r\n    int bsy     = p_tu->h;\r\n    int x_start = p_tu->x;\r\n    int y_start = p_tu->y;\r\n    int b_luma  = blockidx < 4;\r\n    int b_wavelet_conducted = (b_luma && p_cu->i_cu_level == B64X64_IN_BIT && p_cu->i_trans_size != TU_SPLIT_CROSS);\r\n    coeff_t *p_coeff;\r\n    pel_t *p_dst;\r\n    int i_coeff;\r\n    int i_dst;\r\n    davs2_t *h = row_rec->h;\r\n\r\n    assert(((p_cu->i_cbp >> blockidx) & 1) != 0);\r\n\r\n    // inverse transform\r\n    p_tu->v >>= b_wavelet_conducted;\r\n    i_coeff = p_tu->w;\r\n    p_coeff = get_quanted_coeffs(row_rec, p_cu, blockidx);\r\n    inv_transform(row_rec, p_coeff, p_cu, i_coeff, bsx, bsy, h->seq_info.enable_2nd_transform, blockidx, p_cu->intra_pred_modes[blockidx]);\r\n    i_coeff <<= b_wavelet_conducted;\r\n\r\n    if (b_luma) {\r\n        x_start += ctu_x;\r\n        y_start += ctu_y;\r\n\r\n        i_dst    = row_rec->ctu.i_fdec[0];\r\n        p_dst    = row_rec->ctu.p_fdec[0] + y_start * i_dst + x_start;\r\n    } else {\r\n        x_start  = (ctu_x >> 1);\r\n        y_start  = (ctu_y >> 1);\r\n\r\n        i_dst    = row_rec->ctu.i_fdec[blockidx - 3];\r\n        p_dst    = row_rec->ctu.p_fdec[blockidx - 3] + y_start * i_dst + x_start;\r\n    }\r\n\r\n    // normalize\r\n    gf_davs2.add_ps[PART_INDEX(bsx, bsy)](p_dst, i_dst, p_dst, p_coeff, i_dst, i_coeff);\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid davs2_dct_init(uint32_t cpuid, ao_funcs_t *fh)\r\n{\r\n    int i;\r\n    UNUSED_PARAMETER(cpuid);\r\n\r\n    /* init c function handles */\r\n    fh->inv_transform_4x4_2nd = inv_transform_4x4_2nd_c;\r\n    fh->inv_transform_2nd     = inv_transform_2nd_c;\r\n\r\n    for (i = 0; i < DCT_PATTERN_NUM; i++) {\r\n        fh->idct[PART_4x4  ][i] = idct_4x4_c;\r\n        fh->idct[PART_8x8  ][i] = idct_8x8_c;\r\n        fh->idct[PART_16x16][i] = idct_16x16_c;\r\n        fh->idct[PART_32x32][i] = idct_32x32_c;\r\n        fh->idct[PART_64x64][i] = idct_64x64_c;\r\n\r\n        fh->idct[PART_4x16 ][i] = idct_4x16_c;\r\n        fh->idct[PART_8x32 ][i] = idct_8x32_c;\r\n        fh->idct[PART_16x4 ][i] = idct_16x4_c;\r\n        fh->idct[PART_32x8 ][i] = idct_32x8_c;\r\n        fh->idct[PART_64x16][i] = idct_64x16_c;\r\n        fh->idct[PART_16x64][i] = idct_16x64_c;\r\n    }\r\n\r\n    /* init asm function handles */\r\n#if HAVE_MMX\r\n    /* functions defined in file intrinsic_dct.c */\r\n    if (cpuid & DAVS2_CPU_SSE2) {\r\n        fh->inv_transform_4x4_2nd = inv_transform_4x4_2nd_sse128;\r\n        fh->inv_transform_2nd     = inv_transform_2nd_sse128;\r\n\r\n        for (i = 0; i < DCT_PATTERN_NUM; i++) {\r\n            fh->idct[PART_4x4  ][i] = idct_4x4_sse128;\r\n            fh->idct[PART_8x8  ][i] = idct_8x8_sse128;\r\n            fh->idct[PART_16x16][i] = idct_16x16_sse128;\r\n            fh->idct[PART_32x32][i] = idct_32x32_sse128;\r\n            fh->idct[PART_64x64][i] = idct_64x64_sse128;\r\n            fh->idct[PART_64x16][i] = idct_64x16_sse128;\r\n            fh->idct[PART_16x64][i] = idct_16x64_sse128;\r\n\r\n            fh->idct[PART_4x16][i] = idct_4x16_sse128;\r\n            fh->idct[PART_8x32][i] = idct_8x32_sse128;\r\n            fh->idct[PART_16x4][i] = idct_16x4_sse128;\r\n            fh->idct[PART_32x8][i] = idct_32x8_sse128;\r\n\r\n#if !HIGH_BIT_DEPTH\r\n            fh->idct[PART_4x4 ][i] = FPFX(idct_4x4_sse2);\r\n#if ARCH_X86_64\r\n            fh->idct[PART_8x8 ][i] = FPFX(idct_8x8_sse2);\r\n#endif\r\n#endif\r\n        }\r\n    }\r\n\r\n    if (cpuid & DAVS2_CPU_SSSE3) {\r\n        for (i = 0; i < DCT_PATTERN_NUM; i++) {\r\n#if HIGH_BIT_DEPTH\r\n            // 10bit assemble\r\n#else\r\n            fh->idct[PART_8x8 ][i] = davs2_idct_8x8_ssse3;\r\n#endif\r\n        }\r\n    }\r\n\r\n    /* TODO: ʼĬDCTģ */\r\n    if (cpuid & DAVS2_CPU_SSE2) {\r\n        /* square */\r\n        fh->idct[PART_8x8  ][DCT_HALF] = idct_8x8_half_sse128;\r\n        fh->idct[PART_8x8  ][DCT_QUAD] = idct_8x8_quad_sse128;\r\n        fh->idct[PART_16x16][DCT_HALF] = idct_16x16_half_sse128;\r\n        fh->idct[PART_16x16][DCT_QUAD] = idct_16x16_quad_sse128;\r\n        fh->idct[PART_32x32][DCT_HALF] = idct_32x32_half_sse128;\r\n        fh->idct[PART_32x32][DCT_QUAD] = idct_32x32_quad_sse128;\r\n        fh->idct[PART_64x64][DCT_HALF] = idct_64x64_half_sse128;\r\n        fh->idct[PART_64x64][DCT_QUAD] = idct_64x64_quad_sse128;\r\n\r\n        /* non-square */\r\n        fh->idct[PART_4x16 ][DCT_HALF] = idct_4x16_half_sse128;\r\n        fh->idct[PART_4x16 ][DCT_QUAD] = idct_4x16_quad_sse128;\r\n        fh->idct[PART_16x4 ][DCT_HALF] = idct_16x4_half_sse128;\r\n        fh->idct[PART_16x4 ][DCT_QUAD] = idct_16x4_quad_sse128;\r\n        fh->idct[PART_8x32 ][DCT_QUAD] = idct_8x32_quad_sse128;\r\n        fh->idct[PART_8x32 ][DCT_HALF] = idct_8x32_half_sse128;\r\n        fh->idct[PART_32x8 ][DCT_HALF] = idct_32x8_half_sse128;\r\n        fh->idct[PART_32x8 ][DCT_QUAD] = idct_32x8_quad_sse128;\r\n        fh->idct[PART_16x64][DCT_HALF] = idct_16x64_half_sse128;\r\n        fh->idct[PART_16x64][DCT_QUAD] = idct_16x64_quad_sse128;\r\n        fh->idct[PART_64x16][DCT_HALF] = idct_64x16_half_sse128;\r\n        fh->idct[PART_64x16][DCT_QUAD] = idct_64x16_quad_sse128;\r\n    }\r\n\r\n#if ARCH_X86_64\r\n    if (cpuid & DAVS2_CPU_AVX2) {\r\n        fh->idct[PART_8x8  ][DCT_DEAULT]   = idct_8x8_avx2;\r\n        fh->idct[PART_16x16][DCT_DEAULT] = idct_16x16_avx2;\r\n        fh->idct[PART_64x64][DCT_DEAULT] = idct_64x64_avx2;\r\n        fh->idct[PART_64x16][DCT_DEAULT] = idct_64x16_avx2;\r\n        fh->idct[PART_16x64][DCT_DEAULT] = idct_16x64_avx2;\r\n        fh->idct[PART_32x32][DCT_DEAULT] = idct_32x32_avx2;    // @luofl i7-6700k ٶȱsse128һ\r\n\r\n        /* square */\r\n        // fh->idct[PART_8x8  ][DCT_HALF] = idct_8x8_half_avx2;\r\n        // fh->idct[PART_8x8  ][DCT_QUAD] = idct_8x8_quad_avx2;\r\n        // fh->idct[PART_16x16][DCT_HALF] = idct_16x16_half_avx2;\r\n        // fh->idct[PART_16x16][DCT_QUAD] = idct_16x16_quad_avx2;\r\n        // fh->idct[PART_32x32][DCT_HALF] = idct_32x32_half_avx2;\r\n        // fh->idct[PART_32x32][DCT_QUAD] = idct_32x32_quad_avx2;\r\n        // fh->idct[PART_64x64][DCT_HALF] = idct_64x64_half_avx2;\r\n        // fh->idct[PART_64x64][DCT_QUAD] = idct_64x64_quad_avx2;\r\n\r\n        /* non-square */\r\n        // fh->idct[PART_4x16 ][DCT_HALF] = idct_4x16_half_avx2;\r\n        // fh->idct[PART_4x16 ][DCT_QUAD] = idct_4x16_quad_avx2;\r\n        // fh->idct[PART_16x4 ][DCT_HALF] = idct_16x4_half_avx2;\r\n        // fh->idct[PART_16x4 ][DCT_QUAD] = idct_16x4_quad_avx2;\r\n        // fh->idct[PART_8x32 ][DCT_QUAD] = idct_8x32_quad_avx2;\r\n        // fh->idct[PART_8x32 ][DCT_HALF] = idct_8x32_half_avx2;\r\n        // fh->idct[PART_32x8 ][DCT_HALF] = idct_32x8_half_avx2;\r\n        // fh->idct[PART_32x8 ][DCT_QUAD] = idct_32x8_quad_avx2;\r\n        // fh->idct[PART_16x64][DCT_HALF] = idct_16x64_half_avx2;\r\n        // fh->idct[PART_16x64][DCT_QUAD] = idct_16x64_quad_avx2;\r\n        // fh->idct[PART_64x16][DCT_HALF] = idct_64x16_half_avx2;\r\n        // fh->idct[PART_64x16][DCT_QUAD] = idct_64x16_quad_avx2;\r\n    }\r\n#endif  // if ARCH_X86_X64\r\n#endif  // if HAVE_MMX\r\n}\r\n"
  },
  {
    "path": "source/common/transform.h",
    "content": "/*\r\n *  transform.h\r\n *\r\n * Description of this file:\r\n *    Transform functions definition of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video encoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#ifndef DAVS2_TRANSFORM_H\r\n#define DAVS2_TRANSFORM_H\r\n#ifdef __cplusplus\r\nextern \"C\" {\r\n#endif\r\n\r\n#define davs2_dct_init FPFX(dct_init)\r\nvoid davs2_dct_init(uint32_t cpuid, ao_funcs_t *fh);\r\n\r\n#define davs2_get_recons FPFX(get_recons)\r\nvoid davs2_get_recons(davs2_row_rec_t *row_rec, cu_t *p_cu, int blockidx, cb_t *p_tu, int ctu_x, int ctu_y);\r\n\r\n#ifdef __cplusplus\r\n}\r\n#endif\r\n#endif  // DAVS2_TRANSFORM_H\r\n"
  },
  {
    "path": "source/common/vec/intrinsic.cc",
    "content": "/*\r\n * intrinsic.cc\r\n *\r\n * Description of this file:\r\n *    tables used in SIMD assembly functions of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video decoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#include \"../common.h\"\r\n#include \"intrinsic.h\"\r\n\r\n#include <mmintrin.h>\r\n#include <emmintrin.h>\r\n#include <tmmintrin.h>\r\n#include <smmintrin.h>\r\n\r\n#if HIGH_BIT_DEPTH\r\n\r\nALIGN32(const int16_t intrinsic_mask_10bit[15][16]) = {\r\n    { -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0 }\r\n};\r\n#else\r\nALIGN32(const int8_t intrinsic_mask[15][16]) = {\r\n    { -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0 }\r\n};\r\n\r\n#endif // #if !HIGH_BIT_DEPTH\r\n\r\n\r\nALIGN32(const int8_t intrinsic_mask_256_8bit[16][32]) = {\r\n    { -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 }\r\n\r\n};\r\n\r\n\r\nALIGN32(const int8_t intrinsic_mask32[32][32]) = {\r\n    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1 },\r\n    { -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0 },\r\n    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0 }\r\n};\r\n\r\nALIGN32(const int8_t tab_log2[65]) = {\r\n    -1,\r\n    0, 1, -1, 2, -1, -1, -1, 3,\r\n    -1, -1, -1, -1, -1, -1, -1, 4,\r\n    -1, -1, -1, -1, -1, -1, -1, -1,\r\n    -1, -1, -1, -1, -1, -1, -1, 5,\r\n    -1, -1, -1, -1, -1, -1, -1, -1,\r\n    -1, -1, -1, -1, -1, -1, -1, -1,\r\n    -1, -1, -1, -1, -1, -1, -1, -1,\r\n    -1, -1, -1, -1, -1, -1, -1, 6\r\n};\r\n\r\nconst uint8_t tab_idx_mode_7[64] = {\r\n    0, 1, 2, 2, 3, 4, 5, 5, 6, 7, 7, 8, 9, 10, 10, 11, 12, 13, 13, 14, 15, 15, 16,\r\n    17, 18, 18, 19, 20, 21, 21, 22, 23, 23, 24, 25, 26, 26, 27, 28, 29, 29, 30, 31, 31,\r\n    32, 33, 34, 34, 35, 36, 37, 37, 38, 39, 39, 40, 41, 42, 42, 43, 44, 45, 45, 46\r\n};\r\n\r\nALIGN16(const pel_t tab_coeff_mode_7[64][16]) = {\r\n    { 9, 41, 55, 23, 9, 41, 55, 23, 9, 41, 55, 23, 9, 41, 55, 23 },//0\r\n    { 18, 50, 46, 14, 18, 50, 46, 14, 18, 50, 46, 14, 18, 50, 46, 14 },\r\n    { 27, 59, 37, 5, 27, 59, 37, 5, 27, 59, 37, 5, 27, 59, 37, 5 },\r\n    { 3, 35, 61, 29, 3, 35, 61, 29, 3, 35, 61, 29, 3, 35, 61, 29 },\r\n    { 12, 44, 52, 20, 12, 44, 52, 20, 12, 44, 52, 20, 12, 44, 52, 20 },\r\n    { 21, 53, 43, 11, 21, 53, 43, 11, 21, 53, 43, 11, 21, 53, 43, 11 },\r\n    { 30, 62, 34, 2, 30, 62, 34, 2, 30, 62, 34, 2, 30, 62, 34, 2 },\r\n    { 6, 38, 58, 26, 6, 38, 58, 26, 6, 38, 58, 26, 6, 38, 58, 26 },\r\n    { 15, 47, 49, 17, 15, 47, 49, 17, 15, 47, 49, 17, 15, 47, 49, 17 },//8\r\n    { 24, 56, 40, 8, 24, 56, 40, 8, 24, 56, 40, 8, 24, 56, 40, 8 },\r\n    { 1, 33, 63, 31, 1, 33, 63, 31, 1, 33, 63, 31, 1, 33, 63, 31 },\r\n    { 9, 41, 55, 23, 9, 41, 55, 23, 9, 41, 55, 23, 9, 41, 55, 23 },\r\n    { 18, 50, 46, 14, 18, 50, 46, 14, 18, 50, 46, 14, 18, 50, 46, 14 },\r\n    { 27, 59, 37, 5, 27, 59, 37, 5, 27, 59, 37, 5, 27, 59, 37, 5 },\r\n    { 4, 36, 60, 28, 4, 36, 60, 28, 4, 36, 60, 28, 4, 36, 60, 28 },\r\n    { 12, 44, 52, 20, 12, 44, 52, 20, 12, 44, 52, 20, 12, 44, 52, 20 },\r\n    { 21, 53, 43, 11, 21, 53, 43, 11, 21, 53, 43, 11, 21, 53, 43, 11 },//16\r\n    { 30, 62, 34, 2, 30, 62, 34, 2, 30, 62, 34, 2, 30, 62, 34, 2 },\r\n    { 7, 39, 57, 25, 7, 39, 57, 25, 7, 39, 57, 25, 7, 39, 57, 25 },\r\n    { 15, 47, 49, 17, 15, 47, 49, 17, 15, 47, 49, 17, 15, 47, 49, 17 },\r\n    { 24, 56, 40, 8, 24, 56, 40, 8, 24, 56, 40, 8, 24, 56, 40, 8 },\r\n    { 1, 33, 63, 31, 1, 33, 63, 31, 1, 33, 63, 31, 1, 33, 63, 31 },\r\n    { 10, 42, 54, 22, 10, 42, 54, 22, 10, 42, 54, 22, 10, 42, 54, 22 },\r\n    { 18, 50, 46, 14, 18, 50, 46, 14, 18, 50, 46, 14, 18, 50, 46, 14 },\r\n    { 27, 59, 37, 5, 27, 59, 37, 5, 27, 59, 37, 5, 27, 59, 37, 5 },//24\r\n    { 4, 36, 60, 28, 4, 36, 60, 28, 4, 36, 60, 28, 4, 36, 60, 28 },\r\n    { 13, 45, 51, 19, 13, 45, 51, 19, 13, 45, 51, 19, 13, 45, 51, 19 },\r\n    { 21, 53, 43, 11, 21, 53, 43, 11, 21, 53, 43, 11, 21, 53, 43, 11 },\r\n    { 30, 62, 34, 2, 30, 62, 34, 2, 30, 62, 34, 2, 30, 62, 34, 2 },\r\n    { 7, 39, 57, 25, 7, 39, 57, 25, 7, 39, 57, 25, 7, 39, 57, 25 },\r\n    { 16, 48, 48, 16, 16, 48, 48, 16, 16, 48, 48, 16, 16, 48, 48, 16 },\r\n    { 24, 56, 40, 8, 24, 56, 40, 8, 24, 56, 40, 8, 24, 56, 40, 8 },\r\n    { 1, 33, 63, 31, 1, 33, 63, 31, 1, 33, 63, 31, 1, 33, 63, 31 },//32\r\n    { 10, 42, 54, 22, 10, 42, 54, 22, 10, 42, 54, 22, 10, 42, 54, 22 },\r\n    { 19, 51, 45, 13, 19, 51, 45, 13, 19, 51, 45, 13, 19, 51, 45, 13 },\r\n    { 27, 59, 37, 5, 27, 59, 37, 5, 27, 59, 37, 5, 27, 59, 37, 5 },\r\n    { 4, 36, 60, 28, 4, 36, 60, 28, 4, 36, 60, 28, 4, 36, 60, 28 },\r\n    { 13, 45, 51, 19, 13, 45, 51, 19, 13, 45, 51, 19, 13, 45, 51, 19 },\r\n    { 22, 54, 42, 10, 22, 54, 42, 10, 22, 54, 42, 10, 22, 54, 42, 10 },\r\n    { 30, 62, 34, 2, 30, 62, 34, 2, 30, 62, 34, 2, 30, 62, 34, 2 },\r\n    { 7, 39, 57, 25, 7, 39, 57, 25, 7, 39, 57, 25, 7, 39, 57, 25 },//40\r\n    { 16, 48, 48, 16, 16, 48, 48, 16, 16, 48, 48, 16, 16, 48, 48, 16 },\r\n    { 25, 57, 39, 7, 25, 57, 39, 7, 25, 57, 39, 7, 25, 57, 39, 7 },\r\n    { 1, 33, 63, 31, 1, 33, 63, 31, 1, 33, 63, 31, 1, 33, 63, 31 },\r\n    { 10, 42, 54, 22, 10, 42, 54, 22, 10, 42, 54, 22, 10, 42, 54, 22 },\r\n    { 19, 51, 45, 13, 19, 51, 45, 13, 19, 51, 45, 13, 19, 51, 45, 13 },\r\n    { 28, 60, 36, 4, 28, 60, 36, 4, 28, 60, 36, 4, 28, 60, 36, 4 },\r\n    { 4, 36, 60, 28, 4, 36, 60, 28, 4, 36, 60, 28, 4, 36, 60, 28 },\r\n    { 13, 45, 51, 19, 13, 45, 51, 19, 13, 45, 51, 19, 13, 45, 51, 19 },//48\r\n    { 22, 54, 42, 10, 22, 54, 42, 10, 22, 54, 42, 10, 22, 54, 42, 10 },\r\n    { 31, 63, 33, 1, 31, 63, 33, 1, 31, 63, 33, 1, 31, 63, 33, 1 },\r\n    { 7, 39, 57, 25, 7, 39, 57, 25, 7, 39, 57, 25, 7, 39, 57, 25 },\r\n    { 16, 48, 48, 16, 16, 48, 48, 16, 16, 48, 48, 16, 16, 48, 48, 16 },\r\n    { 25, 57, 39, 7, 25, 57, 39, 7, 25, 57, 39, 7, 25, 57, 39, 7 },\r\n    { 2, 34, 62, 30, 2, 34, 62, 30, 2, 34, 62, 30, 2, 34, 62, 30 },\r\n    { 10, 42, 54, 22, 10, 42, 54, 22, 10, 42, 54, 22, 10, 42, 54, 22 },\r\n    { 19, 51, 45, 13, 19, 51, 45, 13, 19, 51, 45, 13, 19, 51, 45, 13 },//56\r\n    { 28, 60, 36, 4, 28, 60, 36, 4, 28, 60, 36, 4, 28, 60, 36, 4 },\r\n    { 5, 37, 59, 27, 5, 37, 59, 27, 5, 37, 59, 27, 5, 37, 59, 27 },\r\n    { 13, 45, 51, 19, 13, 45, 51, 19, 13, 45, 51, 19, 13, 45, 51, 19 },\r\n    { 22, 54, 42, 10, 22, 54, 42, 10, 22, 54, 42, 10, 22, 54, 42, 10 },\r\n    { 31, 63, 33, 1, 31, 63, 33, 1, 31, 63, 33, 1, 31, 63, 33, 1 },\r\n    { 8, 40, 56, 24, 8, 40, 56, 24, 8, 40, 56, 24, 8, 40, 56, 24 },\r\n    { 16, 48, 48, 16, 16, 48, 48, 16, 16, 48, 48, 16, 16, 48, 48, 16 }//63\r\n};\r\n\r\nALIGN32(const pel_t tab_coeff_mode_7_avx[64][32]) = {\r\n    {  9, 41, 55, 23,  9, 41, 55, 23,  9, 41, 55, 23,  9, 41, 55, 23,  9, 41, 55, 23,  9, 41, 55, 23,  9, 41, 55, 23,  9, 41, 55, 23},//0\r\n    { 18, 50, 46, 14, 18, 50, 46, 14, 18, 50, 46, 14, 18, 50, 46, 14, 18, 50, 46, 14, 18, 50, 46, 14, 18, 50, 46, 14, 18, 50, 46, 14},\r\n    { 27, 59, 37,  5, 27, 59, 37,  5, 27, 59, 37,  5, 27, 59, 37,  5, 27, 59, 37,  5, 27, 59, 37,  5, 27, 59, 37,  5, 27, 59, 37,  5},\r\n    {  3, 35, 61, 29,  3, 35, 61, 29,  3, 35, 61, 29,  3, 35, 61, 29,  3, 35, 61, 29,  3, 35, 61, 29,  3, 35, 61, 29,  3, 35, 61, 29},\r\n    { 12, 44, 52, 20, 12, 44, 52, 20, 12, 44, 52, 20, 12, 44, 52, 20, 12, 44, 52, 20, 12, 44, 52, 20, 12, 44, 52, 20, 12, 44, 52, 20},\r\n    { 21, 53, 43, 11, 21, 53, 43, 11, 21, 53, 43, 11, 21, 53, 43, 11, 21, 53, 43, 11, 21, 53, 43, 11, 21, 53, 43, 11, 21, 53, 43, 11},\r\n    { 30, 62, 34,  2, 30, 62, 34,  2, 30, 62, 34,  2, 30, 62, 34,  2, 30, 62, 34,  2, 30, 62, 34,  2, 30, 62, 34,  2, 30, 62, 34,  2},\r\n    {  6, 38, 58, 26,  6, 38, 58, 26,  6, 38, 58, 26,  6, 38, 58, 26,  6, 38, 58, 26,  6, 38, 58, 26,  6, 38, 58, 26,  6, 38, 58, 26},\r\n    { 15, 47, 49, 17, 15, 47, 49, 17, 15, 47, 49, 17, 15, 47, 49, 17, 15, 47, 49, 17, 15, 47, 49, 17, 15, 47, 49, 17, 15, 47, 49, 17},//8\r\n    { 24, 56, 40,  8, 24, 56, 40,  8, 24, 56, 40,  8, 24, 56, 40,  8, 24, 56, 40,  8, 24, 56, 40,  8, 24, 56, 40,  8, 24, 56, 40,  8},\r\n    {  1, 33, 63, 31,  1, 33, 63, 31,  1, 33, 63, 31,  1, 33, 63, 31,  1, 33, 63, 31,  1, 33, 63, 31,  1, 33, 63, 31,  1, 33, 63, 31},\r\n    {  9, 41, 55, 23,  9, 41, 55, 23,  9, 41, 55, 23,  9, 41, 55, 23,  9, 41, 55, 23,  9, 41, 55, 23,  9, 41, 55, 23,  9, 41, 55, 23},\r\n    { 18, 50, 46, 14, 18, 50, 46, 14, 18, 50, 46, 14, 18, 50, 46, 14, 18, 50, 46, 14, 18, 50, 46, 14, 18, 50, 46, 14, 18, 50, 46, 14},\r\n    { 27, 59, 37,  5, 27, 59, 37,  5, 27, 59, 37,  5, 27, 59, 37,  5, 27, 59, 37,  5, 27, 59, 37,  5, 27, 59, 37,  5, 27, 59, 37,  5},\r\n    {  4, 36, 60, 28,  4, 36, 60, 28,  4, 36, 60, 28,  4, 36, 60, 28,  4, 36, 60, 28,  4, 36, 60, 28,  4, 36, 60, 28,  4, 36, 60, 28},\r\n    { 12, 44, 52, 20, 12, 44, 52, 20, 12, 44, 52, 20, 12, 44, 52, 20, 12, 44, 52, 20, 12, 44, 52, 20, 12, 44, 52, 20, 12, 44, 52, 20},\r\n    { 21, 53, 43, 11, 21, 53, 43, 11, 21, 53, 43, 11, 21, 53, 43, 11, 21, 53, 43, 11, 21, 53, 43, 11, 21, 53, 43, 11, 21, 53, 43, 11},//16\r\n    { 30, 62, 34,  2, 30, 62, 34,  2, 30, 62, 34,  2, 30, 62, 34,  2, 30, 62, 34,  2, 30, 62, 34,  2, 30, 62, 34,  2, 30, 62, 34,  2},\r\n    {  7, 39, 57, 25,  7, 39, 57, 25,  7, 39, 57, 25,  7, 39, 57, 25,  7, 39, 57, 25,  7, 39, 57, 25,  7, 39, 57, 25,  7, 39, 57, 25},\r\n    { 15, 47, 49, 17, 15, 47, 49, 17, 15, 47, 49, 17, 15, 47, 49, 17, 15, 47, 49, 17, 15, 47, 49, 17, 15, 47, 49, 17, 15, 47, 49, 17},\r\n    { 24, 56, 40,  8, 24, 56, 40,  8, 24, 56, 40,  8, 24, 56, 40,  8, 24, 56, 40,  8, 24, 56, 40,  8, 24, 56, 40,  8, 24, 56, 40,  8},\r\n    {  1, 33, 63, 31,  1, 33, 63, 31,  1, 33, 63, 31,  1, 33, 63, 31,  1, 33, 63, 31,  1, 33, 63, 31,  1, 33, 63, 31,  1, 33, 63, 31},\r\n    { 10, 42, 54, 22, 10, 42, 54, 22, 10, 42, 54, 22, 10, 42, 54, 22, 10, 42, 54, 22, 10, 42, 54, 22, 10, 42, 54, 22, 10, 42, 54, 22},\r\n    { 18, 50, 46, 14, 18, 50, 46, 14, 18, 50, 46, 14, 18, 50, 46, 14, 18, 50, 46, 14, 18, 50, 46, 14, 18, 50, 46, 14, 18, 50, 46, 14},\r\n    { 27, 59, 37,  5, 27, 59, 37,  5, 27, 59, 37,  5, 27, 59, 37,  5, 27, 59, 37,  5, 27, 59, 37,  5, 27, 59, 37,  5, 27, 59, 37,  5},//24\r\n    {  4, 36, 60, 28,  4, 36, 60, 28,  4, 36, 60, 28,  4, 36, 60, 28,  4, 36, 60, 28,  4, 36, 60, 28,  4, 36, 60, 28,  4, 36, 60, 28},\r\n    { 13, 45, 51, 19, 13, 45, 51, 19, 13, 45, 51, 19, 13, 45, 51, 19, 13, 45, 51, 19, 13, 45, 51, 19, 13, 45, 51, 19, 13, 45, 51, 19},\r\n    { 21, 53, 43, 11, 21, 53, 43, 11, 21, 53, 43, 11, 21, 53, 43, 11, 21, 53, 43, 11, 21, 53, 43, 11, 21, 53, 43, 11, 21, 53, 43, 11},\r\n    { 30, 62, 34,  2, 30, 62, 34,  2, 30, 62, 34,  2, 30, 62, 34,  2, 30, 62, 34,  2, 30, 62, 34,  2, 30, 62, 34,  2, 30, 62, 34,  2},\r\n    {  7, 39, 57, 25,  7, 39, 57, 25,  7, 39, 57, 25,  7, 39, 57, 25,  7, 39, 57, 25,  7, 39, 57, 25,  7, 39, 57, 25,  7, 39, 57, 25},\r\n    { 16, 48, 48, 16, 16, 48, 48, 16, 16, 48, 48, 16, 16, 48, 48, 16, 16, 48, 48, 16, 16, 48, 48, 16, 16, 48, 48, 16, 16, 48, 48, 16},\r\n    { 24, 56, 40,  8, 24, 56, 40,  8, 24, 56, 40,  8, 24, 56, 40,  8, 24, 56, 40,  8, 24, 56, 40,  8, 24, 56, 40,  8, 24, 56, 40,  8},\r\n    {  1, 33, 63, 31,  1, 33, 63, 31,  1, 33, 63, 31,  1, 33, 63, 31,  1, 33, 63, 31,  1, 33, 63, 31,  1, 33, 63, 31,  1, 33, 63, 31},//32\r\n    { 10, 42, 54, 22, 10, 42, 54, 22, 10, 42, 54, 22, 10, 42, 54, 22, 10, 42, 54, 22, 10, 42, 54, 22, 10, 42, 54, 22, 10, 42, 54, 22},\r\n    { 19, 51, 45, 13, 19, 51, 45, 13, 19, 51, 45, 13, 19, 51, 45, 13, 19, 51, 45, 13, 19, 51, 45, 13, 19, 51, 45, 13, 19, 51, 45, 13},\r\n    { 27, 59, 37,  5, 27, 59, 37,  5, 27, 59, 37,  5, 27, 59, 37,  5, 27, 59, 37,  5, 27, 59, 37,  5, 27, 59, 37,  5, 27, 59, 37,  5},\r\n    {  4, 36, 60, 28,  4, 36, 60, 28,  4, 36, 60, 28,  4, 36, 60, 28,  4, 36, 60, 28,  4, 36, 60, 28,  4, 36, 60, 28,  4, 36, 60, 28},\r\n    { 13, 45, 51, 19, 13, 45, 51, 19, 13, 45, 51, 19, 13, 45, 51, 19, 13, 45, 51, 19, 13, 45, 51, 19, 13, 45, 51, 19, 13, 45, 51, 19},\r\n    { 22, 54, 42, 10, 22, 54, 42, 10, 22, 54, 42, 10, 22, 54, 42, 10, 22, 54, 42, 10, 22, 54, 42, 10, 22, 54, 42, 10, 22, 54, 42, 10},\r\n    { 30, 62, 34,  2, 30, 62, 34,  2, 30, 62, 34,  2, 30, 62, 34,  2, 30, 62, 34,  2, 30, 62, 34,  2, 30, 62, 34,  2, 30, 62, 34,  2},\r\n    {  7, 39, 57, 25,  7, 39, 57, 25,  7, 39, 57, 25,  7, 39, 57, 25,  7, 39, 57, 25,  7, 39, 57, 25,  7, 39, 57, 25,  7, 39, 57, 25},//40\r\n    { 16, 48, 48, 16, 16, 48, 48, 16, 16, 48, 48, 16, 16, 48, 48, 16, 16, 48, 48, 16, 16, 48, 48, 16, 16, 48, 48, 16, 16, 48, 48, 16},\r\n    { 25, 57, 39,  7, 25, 57, 39,  7, 25, 57, 39,  7, 25, 57, 39,  7, 25, 57, 39,  7, 25, 57, 39,  7, 25, 57, 39,  7, 25, 57, 39,  7},\r\n    {  1, 33, 63, 31,  1, 33, 63, 31,  1, 33, 63, 31,  1, 33, 63, 31,  1, 33, 63, 31,  1, 33, 63, 31,  1, 33, 63, 31,  1, 33, 63, 31},\r\n    { 10, 42, 54, 22, 10, 42, 54, 22, 10, 42, 54, 22, 10, 42, 54, 22, 10, 42, 54, 22, 10, 42, 54, 22, 10, 42, 54, 22, 10, 42, 54, 22},\r\n    { 19, 51, 45, 13, 19, 51, 45, 13, 19, 51, 45, 13, 19, 51, 45, 13, 19, 51, 45, 13, 19, 51, 45, 13, 19, 51, 45, 13, 19, 51, 45, 13},\r\n    { 28, 60, 36,  4, 28, 60, 36,  4, 28, 60, 36,  4, 28, 60, 36,  4, 28, 60, 36,  4, 28, 60, 36,  4, 28, 60, 36,  4, 28, 60, 36,  4},\r\n    {  4, 36, 60, 28,  4, 36, 60, 28,  4, 36, 60, 28,  4, 36, 60, 28,  4, 36, 60, 28,  4, 36, 60, 28,  4, 36, 60, 28,  4, 36, 60, 28},\r\n    { 13, 45, 51, 19, 13, 45, 51, 19, 13, 45, 51, 19, 13, 45, 51, 19, 13, 45, 51, 19, 13, 45, 51, 19, 13, 45, 51, 19, 13, 45, 51, 19},//48\r\n    { 22, 54, 42, 10, 22, 54, 42, 10, 22, 54, 42, 10, 22, 54, 42, 10, 22, 54, 42, 10, 22, 54, 42, 10, 22, 54, 42, 10, 22, 54, 42, 10},\r\n    { 31, 63, 33,  1, 31, 63, 33,  1, 31, 63, 33,  1, 31, 63, 33,  1, 31, 63, 33,  1, 31, 63, 33,  1, 31, 63, 33,  1, 31, 63, 33,  1},\r\n    {  7, 39, 57, 25,  7, 39, 57, 25,  7, 39, 57, 25,  7, 39, 57, 25,  7, 39, 57, 25,  7, 39, 57, 25,  7, 39, 57, 25,  7, 39, 57, 25},\r\n    { 16, 48, 48, 16, 16, 48, 48, 16, 16, 48, 48, 16, 16, 48, 48, 16, 16, 48, 48, 16, 16, 48, 48, 16, 16, 48, 48, 16, 16, 48, 48, 16},\r\n    { 25, 57, 39,  7, 25, 57, 39,  7, 25, 57, 39,  7, 25, 57, 39,  7, 25, 57, 39,  7, 25, 57, 39,  7, 25, 57, 39,  7, 25, 57, 39,  7},\r\n    {  2, 34, 62, 30,  2, 34, 62, 30,  2, 34, 62, 30,  2, 34, 62, 30,  2, 34, 62, 30,  2, 34, 62, 30,  2, 34, 62, 30,  2, 34, 62, 30},\r\n    { 10, 42, 54, 22, 10, 42, 54, 22, 10, 42, 54, 22, 10, 42, 54, 22, 10, 42, 54, 22, 10, 42, 54, 22, 10, 42, 54, 22, 10, 42, 54, 22},\r\n    { 19, 51, 45, 13, 19, 51, 45, 13, 19, 51, 45, 13, 19, 51, 45, 13, 19, 51, 45, 13, 19, 51, 45, 13, 19, 51, 45, 13, 19, 51, 45, 13},//56\r\n    { 28, 60, 36,  4, 28, 60, 36,  4, 28, 60, 36,  4, 28, 60, 36,  4, 28, 60, 36,  4, 28, 60, 36,  4, 28, 60, 36,  4, 28, 60, 36,  4},\r\n    {  5, 37, 59, 27,  5, 37, 59, 27,  5, 37, 59, 27,  5, 37, 59, 27,  5, 37, 59, 27,  5, 37, 59, 27,  5, 37, 59, 27,  5, 37, 59, 27},\r\n    { 13, 45, 51, 19, 13, 45, 51, 19, 13, 45, 51, 19, 13, 45, 51, 19, 13, 45, 51, 19, 13, 45, 51, 19, 13, 45, 51, 19, 13, 45, 51, 19},\r\n    { 22, 54, 42, 10, 22, 54, 42, 10, 22, 54, 42, 10, 22, 54, 42, 10, 22, 54, 42, 10, 22, 54, 42, 10, 22, 54, 42, 10, 22, 54, 42, 10},\r\n    { 31, 63, 33,  1, 31, 63, 33,  1, 31, 63, 33,  1, 31, 63, 33,  1, 31, 63, 33,  1, 31, 63, 33,  1, 31, 63, 33,  1, 31, 63, 33,  1},\r\n    {  8, 40, 56, 24,  8, 40, 56, 24,  8, 40, 56, 24,  8, 40, 56, 24,  8, 40, 56, 24,  8, 40, 56, 24,  8, 40, 56, 24,  8, 40, 56, 24},\r\n    { 16, 48, 48, 16, 16, 48, 48, 16, 16, 48, 48, 16, 16, 48, 48, 16, 16, 48, 48, 16, 16, 48, 48, 16, 16, 48, 48, 16, 16, 48, 48, 16}//63\r\n};\r\n\r\n#if HIGH_BIT_DEPTH\r\nALIGN16(const int16_t tab_coeff_mode_9[64][16]) = {\r\n#else\r\nALIGN16(const int8_t tab_coeff_mode_9[64][16]) = {\r\n#endif\r\n    { 21, 53, 43, 11, 21, 53, 43, 11, 21, 53, 43, 11, 21, 53, 43, 11 },\r\n    { 9, 41, 55, 23, 9, 41, 55, 23, 9, 41, 55, 23, 9, 41, 55, 23 },\r\n    { 30, 62, 34, 2, 30, 62, 34, 2, 30, 62, 34, 2, 30, 62, 34, 2 },\r\n    { 18, 50, 46, 14, 18, 50, 46, 14, 18, 50, 46, 14, 18, 50, 46, 14 },\r\n    { 6, 38, 58, 26, 6, 38, 58, 26, 6, 38, 58, 26, 6, 38, 58, 26 },\r\n    { 27, 59, 37, 5, 27, 59, 37, 5, 27, 59, 37, 5, 27, 59, 37, 5 },\r\n    { 15, 47, 49, 17, 15, 47, 49, 17, 15, 47, 49, 17, 15, 47, 49, 17 },\r\n    { 3, 35, 61, 29, 3, 35, 61, 29, 3, 35, 61, 29, 3, 35, 61, 29 },\r\n    { 24, 56, 40, 8, 24, 56, 40, 8, 24, 56, 40, 8, 24, 56, 40, 8 },\r\n    { 12, 44, 52, 20, 12, 44, 52, 20, 12, 44, 52, 20, 12, 44, 52, 20 },\r\n    { 1, 33, 63, 31, 1, 33, 63, 31, 1, 33, 63, 31, 1, 33, 63, 31 },\r\n    { 21, 53, 43, 11, 21, 53, 43, 11, 21, 53, 43, 11, 21, 53, 43, 11 },\r\n    { 9, 41, 55, 23, 9, 41, 55, 23, 9, 41, 55, 23, 9, 41, 55, 23 },\r\n    { 30, 62, 34, 2, 30, 62, 34, 2, 30, 62, 34, 2, 30, 62, 34, 2 },\r\n    { 18, 50, 46, 14, 18, 50, 46, 14, 18, 50, 46, 14, 18, 50, 46, 14 },\r\n    { 6, 38, 58, 26, 6, 38, 58, 26, 6, 38, 58, 26, 6, 38, 58, 26 },\r\n    { 27, 59, 37, 5, 27, 59, 37, 5, 27, 59, 37, 5, 27, 59, 37, 5 },\r\n    { 15, 47, 49, 17, 15, 47, 49, 17, 15, 47, 49, 17, 15, 47, 49, 17 },\r\n    { 4, 36, 60, 28, 4, 36, 60, 28, 4, 36, 60, 28, 4, 36, 60, 28 },\r\n    { 24, 56, 40, 8, 24, 56, 40, 8, 24, 56, 40, 8, 24, 56, 40, 8 },\r\n    { 12, 44, 52, 20, 12, 44, 52, 20, 12, 44, 52, 20, 12, 44, 52, 20 },\r\n    { 1, 33, 63, 31, 1, 33, 63, 31, 1, 33, 63, 31, 1, 33, 63, 31 },\r\n    { 21, 53, 43, 11, 21, 53, 43, 11, 21, 53, 43, 11, 21, 53, 43, 11 },\r\n    { 9, 41, 55, 23, 9, 41, 55, 23, 9, 41, 55, 23, 9, 41, 55, 23 },\r\n    { 30, 62, 34, 2, 30, 62, 34, 2, 30, 62, 34, 2, 30, 62, 34, 2 },\r\n    { 18, 50, 46, 14, 18, 50, 46, 14, 18, 50, 46, 14, 18, 50, 46, 14 },\r\n    { 7, 39, 57, 25, 7, 39, 57, 25, 7, 39, 57, 25, 7, 39, 57, 25 },\r\n    { 27, 59, 37, 5, 27, 59, 37, 5, 27, 59, 37, 5, 27, 59, 37, 5 },\r\n    { 15, 47, 49, 17, 15, 47, 49, 17, 15, 47, 49, 17, 15, 47, 49, 17 },\r\n    { 4, 36, 60, 28, 4, 36, 60, 28, 4, 36, 60, 28, 4, 36, 60, 28 },\r\n    { 24, 56, 40, 8, 24, 56, 40, 8, 24, 56, 40, 8, 24, 56, 40, 8 },\r\n    { 12, 44, 52, 20, 12, 44, 52, 20, 12, 44, 52, 20, 12, 44, 52, 20 },\r\n    { 1, 33, 63, 31, 1, 33, 63, 31, 1, 33, 63, 31, 1, 33, 63, 31 },\r\n    { 21, 53, 43, 11, 21, 53, 43, 11, 21, 53, 43, 11, 21, 53, 43, 11 },\r\n    { 10, 42, 54, 22, 10, 42, 54, 22, 10, 42, 54, 22, 10, 42, 54, 22 },\r\n    { 30, 62, 34, 2, 30, 62, 34, 2, 30, 62, 34, 2, 30, 62, 34, 2 },\r\n    { 18, 50, 46, 14, 18, 50, 46, 14, 18, 50, 46, 14, 18, 50, 46, 14 },\r\n    { 7, 39, 57, 25, 7, 39, 57, 25, 7, 39, 57, 25, 7, 39, 57, 25 },\r\n    { 27, 59, 37, 5, 27, 59, 37, 5, 27, 59, 37, 5, 27, 59, 37, 5 },\r\n    { 15, 47, 49, 17, 15, 47, 49, 17, 15, 47, 49, 17, 15, 47, 49, 17 },\r\n    { 4, 36, 60, 28, 4, 36, 60, 28, 4, 36, 60, 28, 4, 36, 60, 28 },\r\n    { 24, 56, 40, 8, 24, 56, 40, 8, 24, 56, 40, 8, 24, 56, 40, 8 },\r\n    { 13, 45, 51, 19, 13, 45, 51, 19, 13, 45, 51, 19, 13, 45, 51, 19 },\r\n    { 1, 33, 63, 31, 1, 33, 63, 31, 1, 33, 63, 31, 1, 33, 63, 31 },\r\n    { 21, 53, 43, 11, 21, 53, 43, 11, 21, 53, 43, 11, 21, 53, 43, 11 },\r\n    { 10, 42, 54, 22, 10, 42, 54, 22, 10, 42, 54, 22, 10, 42, 54, 22 },\r\n    { 30, 62, 34, 2, 30, 62, 34, 2, 30, 62, 34, 2, 30, 62, 34, 2 },\r\n    { 18, 50, 46, 14, 18, 50, 46, 14, 18, 50, 46, 14, 18, 50, 46, 14 },\r\n    { 7, 39, 57, 25, 7, 39, 57, 25, 7, 39, 57, 25, 7, 39, 57, 25 },\r\n    { 27, 59, 37, 5, 27, 59, 37, 5, 27, 59, 37, 5, 27, 59, 37, 5 },\r\n    { 16, 48, 48, 16, 16, 48, 48, 16, 16, 48, 48, 16, 16, 48, 48, 16 },\r\n    { 4, 36, 60, 28, 4, 36, 60, 28, 4, 36, 60, 28, 4, 36, 60, 28 },\r\n    { 24, 56, 40, 8, 24, 56, 40, 8, 24, 56, 40, 8, 24, 56, 40, 8 },\r\n    { 13, 45, 51, 19, 13, 45, 51, 19, 13, 45, 51, 19, 13, 45, 51, 19 },\r\n    { 1, 33, 63, 31, 1, 33, 63, 31, 1, 33, 63, 31, 1, 33, 63, 31 },\r\n    { 21, 53, 43, 11, 21, 53, 43, 11, 21, 53, 43, 11, 21, 53, 43, 11 },\r\n    { 10, 42, 54, 22, 10, 42, 54, 22, 10, 42, 54, 22, 10, 42, 54, 22 },\r\n    { 30, 62, 34, 2, 30, 62, 34, 2, 30, 62, 34, 2, 30, 62, 34, 2 },\r\n    { 19, 51, 45, 13, 19, 51, 45, 13, 19, 51, 45, 13, 19, 51, 45, 13 },\r\n    { 7, 39, 57, 25, 7, 39, 57, 25, 7, 39, 57, 25, 7, 39, 57, 25 },\r\n    { 27, 59, 37, 5, 27, 59, 37, 5, 27, 59, 37, 5, 27, 59, 37, 5 },\r\n    { 16, 48, 48, 16, 16, 48, 48, 16, 16, 48, 48, 16, 16, 48, 48, 16 },\r\n    { 4, 36, 60, 28, 4, 36, 60, 28, 4, 36, 60, 28, 4, 36, 60, 28 },\r\n    { 24, 56, 40, 8, 24, 56, 40, 8, 24, 56, 40, 8, 24, 56, 40, 8 }\r\n};\r\nconst uint8_t tab_idx_mode_9[64] = {\r\n    0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 5, 5, 5, 6, 6, 6, 7, 7, 7, 8, 8, 9, 9,\r\n    9, 10, 10, 10, 11, 11, 11, 12, 12, 13, 13, 13, 14, 14, 14, 15, 15, 15, 16,\r\n    16, 17, 17, 17, 18, 18, 18, 19, 19, 19, 20, 20, 21, 21, 21, 22, 22, 22, 23\r\n};\r\n\r\n#if HIGH_BIT_DEPTH\r\nconst ALIGN16(int16_t tab_coeff_mode_11[64][16]) = {\r\n#else\r\nconst ALIGN16(int8_t tab_coeff_mode_11[64][16]) = {\r\n#endif\r\n    { 28, 60, 36, 4, 28, 60, 36, 4, 28, 60, 36, 4, 28, 60, 36, 4 },\r\n    { 24, 56, 40, 8, 24, 56, 40, 8, 24, 56, 40, 8, 24, 56, 40, 8 },\r\n    { 20, 52, 44, 12, 20, 52, 44, 12, 20, 52, 44, 12, 20, 52, 44, 12 },\r\n    { 16, 48, 48, 16, 16, 48, 48, 16, 16, 48, 48, 16, 16, 48, 48, 16 },\r\n    { 12, 44, 52, 20, 12, 44, 52, 20, 12, 44, 52, 20, 12, 44, 52, 20 },\r\n    { 8, 40, 56, 24, 8, 40, 56, 24, 8, 40, 56, 24, 8, 40, 56, 24 },\r\n    { 4, 36, 60, 28, 4, 36, 60, 28, 4, 36, 60, 28, 4, 36, 60, 28 },\r\n    { 32, 64, 32, 0, 32, 64, 32, 0, 32, 64, 32, 0, 32, 64, 32, 0 }\r\n};\r\n"
  },
  {
    "path": "source/common/vec/intrinsic.h",
    "content": "/*\r\n * intrinsic.h\r\n *\r\n * Description of this file:\r\n *    SIMD assembly functions definition of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video decoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#ifndef DAVS2_INTRINSIC_H\r\n#define DAVS2_INTRINSIC_H\r\n\r\n#ifdef __cplusplus\r\nextern \"C\" {\r\n#endif\r\n\r\n#if !defined(_MSC_VER) && !defined(__INTEL_COMPILER)\r\n#define __int64     long long\r\n#endif\r\n\r\n/* ---------------------------------------------------------------------------\r\n * global variables\r\n */\r\n#define intrinsic_mask FPFX(intrinsic_mask)\r\nALIGN32(extern const int8_t  intrinsic_mask[15][16]);\r\n#define intrinsic_mask_256_8bit FPFX(intrinsic_mask_256_8bit)\r\nALIGN32(extern const int8_t  intrinsic_mask_256_8bit[16][32]);\r\n#define intrinsic_mask32 FPFX(intrinsic_mask32)\r\nALIGN32(extern const int8_t  intrinsic_mask32[32][32]);\r\n#define intrinsic_mask_10bit FPFX(intrinsic_mask_10bit)\r\nALIGN32(extern const int16_t intrinsic_mask_10bit[15][16]);\r\n#define tab_log2 FPFX(tab_log2)\r\nALIGN32(extern const int8_t tab_log2[65]);\r\n#define tab_coeff_mode_7 FPFX(tab_coeff_mode_7)\r\nALIGN16(extern const pel_t tab_coeff_mode_7[64][16]);\r\n#define tab_idx_mode_7 FPFX(tab_idx_mode_7)\r\nALIGN32(extern const uint8_t tab_idx_mode_7[64]);\r\n#define tab_coeff_mode_7_avx FPFX(tab_coeff_mode_7_avx)\r\nALIGN32(extern const pel_t tab_coeff_mode_7_avx[64][32]);\r\n\r\n#if HIGH_BIT_DEPTH\r\n#define tab_coeff_mode_9 FPFX(tab_coeff_mode_9)\r\nALIGN16(extern const int16_t tab_coeff_mode_9[64][16]);\r\n#else\r\n#define tab_coeff_mode_9 FPFX(tab_coeff_mode_9)\r\nALIGN16(extern const int8_t tab_coeff_mode_9[64][16]);\r\n#endif\r\n\r\n#define tab_idx_mode_9 FPFX(tab_idx_mode_9)\r\nextern const uint8_t tab_idx_mode_9[64];\r\n#if HIGH_BIT_DEPTH\r\n#define tab_coeff_mode_11 FPFX(tab_coeff_mode_11)\r\nALIGN16(extern const int16_t tab_coeff_mode_11[64][16]);\r\n#else\r\n#define tab_coeff_mode_11 FPFX(tab_coeff_mode_11)\r\nALIGN16(extern const int8_t tab_coeff_mode_11[64][16]);\r\n#endif\r\n\r\n/* ---------------------------------------------------------------------------\r\n * macros used for quick access of __m128i\r\n */\r\n#define M128_U64(mx, idx)  _mm_extract_epi64(mx, idx)\r\n#define M128_U32(mx, idx)  _mm_extract_epi32(mx, idx)\r\n#define M128_I32(mx, idx)  _mm_extract_epi32(mx, idx)\r\n#define M128_U16(mx, idx)  _mm_extract_epi16(mx, idx)\r\n#define M128_I16(mx, idx)  _mm_extract_epi16(mx, idx)\r\n\r\n\r\n#if _MSC_VER\r\n//Ӻ궨  ǰimmintrin.hûжЩ      zhangjiaqi 2016-12-02\r\n#define _mm256_extract_epi64(a, i) (a.m256i_i64[i])\r\n#define _mm256_extract_epi32(a, i) (a.m256i_i32[i])\r\n#define _mm256_extract_epi16(a, i) (a.m256i_i16[i])\r\n#define _mm256_extract_epi8(a, i)  (a.m256i_i8 [i])\r\n#define _mm256_insert_epi64(a, v, i) (a.m256i_i64[i] = v)\r\n#define _mm_extract_epi64(r, i) r.m128i_i64[i]\r\n#else\r\n// Ӳgccȱٵavx\r\n#define _mm256_set_m128i(/* __m128i */ hi, /* __m128i */ lo) \\\r\n            _mm256_insertf128_si256(_mm256_castsi128_si256(lo), (hi), 0x1)\r\n#define _mm256_loadu2_m128i(/* __m128i const* */ hiaddr, \\\r\n                            /* __m128i const* */ loaddr) \\\r\n            _mm256_set_m128i(_mm_loadu_si128(hiaddr), _mm_loadu_si128(loaddr))\r\n#define _mm256_storeu2_m128i(/* __m128i* */ hiaddr, /* __m128i* */ loaddr, \\\r\n                             /* __m256i */ a) \\\r\n    do { \\\r\n        __m256i _a = (a); /* reference a only once in macro body */ \\\r\n        _mm_storeu_si128((loaddr), _mm256_castsi256_si128(_a)); \\\r\n        _mm_storeu_si128((hiaddr), _mm256_extractf128_si256(_a, 0x1)); \\\r\n    } while (0)\r\n#endif\r\n\r\n#define davs2_memzero_aligned_c_sse2 FPFX(memzero_aligned_c_sse2)\r\nvoid *davs2_memzero_aligned_c_sse2(void *dst, size_t n);\r\n#define davs2_memzero_aligned_c_avx FPFX(memzero_aligned_c_avx)\r\nvoid *davs2_memzero_aligned_c_avx (void *dst, size_t n);\r\n#define davs2_memcpy_aligned_c_sse2 FPFX(memcpy_aligned_c_sse2)\r\nvoid *davs2_memcpy_aligned_c_sse2 (void *dst, const void *src, size_t n);\r\n\r\n#define davs2_memcpy_aligned_mmx FPFX(memcpy_aligned_mmx)\r\nvoid *davs2_memcpy_aligned_mmx(void *dst, const void *src, size_t n);\r\n#define davs2_memcpy_aligned_sse FPFX(memcpy_aligned_sse)\r\nvoid *davs2_memcpy_aligned_sse(void *dst, const void *src, size_t n);\r\n\r\n#define davs2_fast_memcpy_mmx FPFX(fast_memcpy_mmx)\r\nvoid *davs2_fast_memcpy_mmx(void *dst, const void *src, size_t n);\r\n#define davs2_fast_memset_mmx FPFX(fast_memset_mmx)\r\nvoid *davs2_fast_memset_mmx(void *dst, int val, size_t n);\r\n\r\n#define davs2_memzero_aligned_mmx FPFX(memzero_aligned_mmx)\r\nvoid *davs2_memzero_aligned_mmx  (void *dst, size_t n);\r\n#define davs2_memzero_aligned_sse FPFX(memzero_aligned_sse)\r\nvoid *davs2_memzero_aligned_sse  (void *dst, size_t n);\r\n#define davs2_memzero_aligned_avx FPFX(memzero_aligned_avx)\r\nvoid *davs2_memzero_aligned_avx  (void *dst, size_t n);\r\n\r\n#define davs2_fast_memzero_mmx FPFX(fast_memzero_mmx)\r\nvoid *davs2_fast_memzero_mmx     (void *dst, size_t n);\r\n\r\n#define plane_copy_c_sse2 FPFX(plane_copy_c_sse2)\r\nvoid plane_copy_c_sse2          (pel_t *dst, intptr_t i_dst, pel_t *src, intptr_t i_src, int w, int h);\r\n\r\n#define intpl_copy_block_sse128 FPFX(intpl_copy_block_sse128)\r\nvoid intpl_copy_block_sse128    (pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height);\r\n\r\n#define intpl_luma_block_hor_sse128 FPFX(intpl_luma_block_hor_sse128)\r\nvoid intpl_luma_block_hor_sse128(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, const int8_t *coeff);\r\n#define intpl_luma_block_ver_sse128 FPFX(intpl_luma_block_ver_sse128)\r\nvoid intpl_luma_block_ver_sse128(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, const int8_t *coeff);\r\n#define intpl_luma_block_ver0_sse128 FPFX(intpl_luma_block_ver0_sse128)\r\nvoid intpl_luma_block_ver0_sse128(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, const int8_t *coeff);\r\n#define intpl_luma_block_ver1_sse128 FPFX(intpl_luma_block_ver1_sse128)\r\nvoid intpl_luma_block_ver1_sse128(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, const int8_t *coeff);\r\n#define intpl_luma_block_ver2_sse128 FPFX(intpl_luma_block_ver2_sse128)\r\nvoid intpl_luma_block_ver2_sse128(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, const int8_t *coeff);\r\n#define intpl_luma_block_ext_sse128 FPFX(intpl_luma_block_ext_sse128)\r\nvoid intpl_luma_block_ext_sse128(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, const int8_t *coeff_h, const int8_t *coeff_v);\r\n\r\n#define intpl_chroma_block_hor_sse128 FPFX(intpl_chroma_block_hor_sse128)\r\nvoid intpl_chroma_block_hor_sse128(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, const int8_t *coeff);\r\n#define intpl_chroma_block_ver_sse128 FPFX(intpl_chroma_block_ver_sse128)\r\nvoid intpl_chroma_block_ver_sse128(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, const int8_t *coeff);\r\n#define intpl_chroma_block_ext_sse128 FPFX(intpl_chroma_block_ext_sse128)\r\nvoid intpl_chroma_block_ext_sse128(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, const int8_t *coeff_h, const int8_t *coeff_v);\r\n\r\n#define intpl_luma_block_hor_avx2 FPFX(intpl_luma_block_hor_avx2)\r\nvoid intpl_luma_block_hor_avx2(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, const int8_t *coeff);\r\n#define intpl_luma_block_ver_avx2 FPFX(intpl_luma_block_ver_avx2)\r\nvoid intpl_luma_block_ver_avx2(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, const int8_t *coeff);\r\n#define intpl_luma_block_ver0_avx2 FPFX(intpl_luma_block_ver0_avx2)\r\nvoid intpl_luma_block_ver0_avx2(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, const int8_t *coeff);\r\n#define intpl_luma_block_ver1_avx2 FPFX(intpl_luma_block_ver1_avx2)\r\nvoid intpl_luma_block_ver1_avx2(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, const int8_t *coeff);\r\n#define intpl_luma_block_ver2_avx2 FPFX(intpl_luma_block_ver2_avx2)\r\nvoid intpl_luma_block_ver2_avx2(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, const int8_t *coeff);\r\n#define intpl_luma_block_ext_avx2 FPFX(intpl_luma_block_ext_avx2)\r\nvoid intpl_luma_block_ext_avx2(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, const int8_t *coeff_h, const int8_t *coeff_v);\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\n#define intpl_luma_hor_sse128 FPFX(intpl_luma_hor_sse128)\r\nvoid intpl_luma_hor_sse128(pel_t *dst, int i_dst, mct_t *tmp, int i_tmp, pel_t *src, int i_src, int width, int height, int8_t const *coeff);\r\n#define intpl_luma_hor_x3_sse128 FPFX(intpl_luma_hor_x3_sse128)\r\nvoid intpl_luma_hor_x3_sse128(pel_t *const dst[3], int i_dst, mct_t *const tmp[3], int i_tmp, pel_t *src, int i_src, int width, int height, const int8_t **coeff);\r\n#define intpl_luma_ver_x3_sse128 FPFX(intpl_luma_ver_x3_sse128)\r\nvoid intpl_luma_ver_x3_sse128(pel_t *const dst[3], int i_dst, pel_t *src, int i_src, int width, int height, int8_t const **coeff);\r\n#define intpl_luma_ext_x3_sse128 FPFX(intpl_luma_ext_x3_sse128)\r\nvoid intpl_luma_ext_x3_sse128(pel_t *const dst[3], int i_dst, mct_t *tmp, int i_tmp, int width, int height, const int8_t **coeff);\r\n#define intpl_luma_ext_sse128 FPFX(intpl_luma_ext_sse128)\r\nvoid intpl_luma_ext_sse128(pel_t *dst, int i_dst, mct_t *tmp, int i_tmp, int width, int height, const int8_t *coeff);\r\n\r\n#define avs_pixel_average_sse128 FPFX(avs_pixel_average_sse128)\r\nvoid avs_pixel_average_sse128 (pel_t *dst, int i_dst, const pel_t *src0, int i_src0, const pel_t *src1, int i_src1, int width, int height);\r\n#define davs2_pixel_average_avx FPFX(pixel_average_avx)\r\nvoid davs2_pixel_average_avx  (pel_t *dst, int i_dst, const pel_t *src1, int i_src1, const pel_t *src2, int i_src2, int width, int height);\r\n#define padding_rows_sse128 FPFX(padding_rows_sse128)\r\nvoid padding_rows_sse128      (pel_t *src, int i_src, int width, int height, int start, int rows, int pad);\r\n#define padding_rows_lr_sse128 FPFX(padding_rows_lr_sse128)\r\nvoid padding_rows_lr_sse128   (pel_t *src, int i_src, int width, int height, int start, int rows, int pad);\r\n\r\n#define intpl_chroma_block_hor_avx2 FPFX(intpl_chroma_block_hor_avx2)\r\nvoid intpl_chroma_block_hor_avx2(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, const int8_t *coeff);\r\n#define intpl_chroma_block_ver_avx2 FPFX(intpl_chroma_block_ver_avx2)\r\nvoid intpl_chroma_block_ver_avx2(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, const int8_t *coeff);\r\n#define intpl_chroma_block_ext_avx2 FPFX(intpl_chroma_block_ext_avx2)\r\nvoid intpl_chroma_block_ext_avx2(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, const int8_t *coeff_h, const int8_t *coeff_v);\r\n#define deblock_edge_ver_sse128 FPFX(deblock_edge_ver_sse128)\r\nvoid deblock_edge_ver_sse128  (pel_t *SrcPtr, int stride, int Alpha, int Beta, uint8_t *flt_flag);\r\n#define deblock_edge_hor_sse128 FPFX(deblock_edge_hor_sse128)\r\nvoid deblock_edge_hor_sse128  (pel_t *SrcPtr, int stride, int Alpha, int Beta, uint8_t *flt_flag);\r\n#if HDR_CHROMA_DELTA_QP\r\n#define deblock_edge_ver_c_sse128 FPFX(deblock_edge_ver_c_sse128)\r\nvoid deblock_edge_ver_c_sse128(pel_t *SrcPtrU, pel_t *SrcPtrV, int stride, int *Alpha, int *Beta, uint8_t *flt_flag);\r\n#define deblock_edge_hor_c_sse128 FPFX(deblock_edge_hor_c_sse128)\r\nvoid deblock_edge_hor_c_sse128(pel_t *SrcPtrU, pel_t *SrcPtrV, int stride, int *Alpha, int *Beta, uint8_t *flt_flag);\r\n#else\r\n#define deblock_edge_ver_c_sse128 FPFX(deblock_edge_ver_c_sse128)\r\nvoid deblock_edge_ver_c_sse128(pel_t *SrcPtrU, pel_t *SrcPtrV, int stride, int Alpha, int Beta, uint8_t *flt_flag);\r\n#define deblock_edge_hor_c_sse128 FPFX(deblock_edge_hor_c_sse128)\r\nvoid deblock_edge_hor_c_sse128(pel_t *SrcPtrU, pel_t *SrcPtrV, int stride, int Alpha, int Beta, uint8_t *flt_flag);\r\n#endif\r\n//--------avx2--------\r\n#define deblock_edge_hor_avx2 FPFX(deblock_edge_hor_avx2)\r\nvoid deblock_edge_hor_avx2(pel_t *SrcPtr, int stride, int Alpha, int Beta, uint8_t *flt_flag);\r\n#define deblock_edge_ver_avx2 FPFX(deblock_edge_ver_avx2)\r\nvoid deblock_edge_ver_avx2(pel_t *SrcPtr, int stride, int Alpha, int Beta, uint8_t *flt_flag);\r\n#define deblock_edge_hor_c_avx2 FPFX(deblock_edge_hor_c_avx2)\r\nvoid deblock_edge_hor_c_avx2(pel_t *SrcPtrU, pel_t *SrcPtrV, int stride, int Alpha, int Beta, uint8_t *flt_flag);\r\n#define deblock_edge_ver_c_avx2 FPFX(deblock_edge_ver_c_avx2)\r\nvoid deblock_edge_ver_c_avx2(pel_t *SrcPtrU, pel_t *SrcPtrV, int stride, int Alpha, int Beta, uint8_t *flt_flag);\r\n\r\n\r\n#define davs2_dequant_sse4 FPFX(dequant_sse4)\r\nvoid davs2_dequant_sse4(coeff_t *coef, const int i_coef, const int scale, const int shift);\r\n\r\n#define idct_4x4_sse128 FPFX(idct_4x4_sse128)\r\nvoid idct_4x4_sse128  (const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_8x8_sse128 FPFX(idct_8x8_sse128)\r\nvoid idct_8x8_sse128  (const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_16x16_sse128 FPFX(idct_16x16_sse128)\r\nvoid idct_16x16_sse128(const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_32x32_sse128 FPFX(idct_32x32_sse128)\r\nvoid idct_32x32_sse128(const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_64x64_sse128 FPFX(idct_64x64_sse128)\r\nvoid idct_64x64_sse128(const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_16x4_sse128 FPFX(idct_16x4_sse128)\r\nvoid idct_16x4_sse128 (const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_32x8_sse128 FPFX(idct_32x8_sse128)\r\nvoid idct_32x8_sse128 (const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_64x16_sse128 FPFX(idct_64x16_sse128)\r\nvoid idct_64x16_sse128(const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_4x16_sse128 FPFX(idct_4x16_sse128)\r\nvoid idct_4x16_sse128 (const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_8x32_sse128 FPFX(idct_8x32_sse128)\r\nvoid idct_8x32_sse128 (const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_16x64_sse128 FPFX(idct_16x64_sse128)\r\nvoid idct_16x64_sse128(const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define inv_transform_4x4_2nd_sse128 FPFX(inv_transform_4x4_2nd_sse128)\r\nvoid inv_transform_4x4_2nd_sse128(coeff_t *coeff, int i_coeff);\r\n#define inv_transform_2nd_sse128 FPFX(inv_transform_2nd_sse128)\r\nvoid inv_transform_2nd_sse128    (coeff_t *coeff, int i_coeff, int i_mode, int b_top, int b_left);\r\n#define inv_wavelet_64x16_sse128 FPFX(inv_wavelet_64x16_sse128)\r\nvoid inv_wavelet_64x16_sse128(coeff_t *coeff);\r\n#define inv_wavelet_16x64_sse128 FPFX(inv_wavelet_16x64_sse128)\r\nvoid inv_wavelet_16x64_sse128(coeff_t *coeff);\r\n\r\n//futl add 2016.11.30    avx2\r\n#define idct_8x8_avx2 FPFX(vec_idct_8x8_avx2)\r\nvoid idct_8x8_avx2  (const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_16x16_avx2 FPFX(vec_idct_16x16_avx2)\r\nvoid idct_16x16_avx2(const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_32x32_avx2 FPFX(vec_idct_32x32_avx2)\r\nvoid idct_32x32_avx2(const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_64x64_avx2 FPFX(vec_idct_64x64_avx2)\r\nvoid idct_64x64_avx2(const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_64x16_avx2 FPFX(vec_idct_64x16_avx2)\r\nvoid idct_64x16_avx2(const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_16x64_avx2 FPFX(vec_idct_16x64_avx2)\r\nvoid idct_16x64_avx2(const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define inv_wavelet_64x16_avx2 FPFX(inv_wavelet_64x16_avx2)\r\nvoid inv_wavelet_64x16_avx2(coeff_t *coeff);\r\n#define inv_wavelet_16x64_avx2 FPFX(inv_wavelet_16x64_avx2)\r\nvoid inv_wavelet_16x64_avx2(coeff_t *coeff);\r\n#define inv_wavelet_64x64_avx2 FPFX(inv_wavelet_64x64_avx2)\r\nvoid inv_wavelet_64x64_avx2(coeff_t *coeff);\r\n\r\n/* DCT half and quad */\r\n#define idct_4x4_half_sse128 FPFX(idct_4x4_half_sse128)\r\nvoid idct_4x4_half_sse128  (const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_8x8_half_sse128 FPFX(idct_8x8_half_sse128)\r\nvoid idct_8x8_half_sse128  (const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_16x16_half_sse128 FPFX(idct_16x16_half_sse128)\r\nvoid idct_16x16_half_sse128(const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_32x32_half_sse128 FPFX(idct_32x32_half_sse128)\r\nvoid idct_32x32_half_sse128(const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_64x64_half_sse128 FPFX(idct_64x64_half_sse128)\r\nvoid idct_64x64_half_sse128(const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_16x4_half_sse128 FPFX(idct_16x4_half_sse128)\r\nvoid idct_16x4_half_sse128 (const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_32x8_half_sse128 FPFX(idct_32x8_half_sse128)\r\nvoid idct_32x8_half_sse128 (const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_4x16_half_sse128 FPFX(idct_4x16_half_sse128)\r\nvoid idct_4x16_half_sse128 (const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_8x32_half_sse128 FPFX(idct_8x32_half_sse128)\r\nvoid idct_8x32_half_sse128 (const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_16x64_half_sse128 FPFX(idct_16x64_half_sse128)\r\nvoid idct_16x64_half_sse128(const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_64x16_half_sse128 FPFX(idct_64x16_half_sse128)\r\nvoid idct_64x16_half_sse128(const coeff_t *src, coeff_t *dst, int i_dst);\r\n\r\n#define idct_4x4_quad_sse128 FPFX(idct_4x4_quad_sse128)\r\nvoid idct_4x4_quad_sse128  (const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_8x8_quad_sse128 FPFX(idct_8x8_quad_sse128)\r\nvoid idct_8x8_quad_sse128  (const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_16x16_quad_sse128 FPFX(idct_16x16_quad_sse128)\r\nvoid idct_16x16_quad_sse128(const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_32x32_quad_sse128 FPFX(idct_32x32_quad_sse128)\r\nvoid idct_32x32_quad_sse128(const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_64x64_quad_sse128 FPFX(idct_64x64_quad_sse128)\r\nvoid idct_64x64_quad_sse128(const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_16x4_quad_sse128 FPFX(idct_16x4_quad_sse128)\r\nvoid idct_16x4_quad_sse128 (const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_32x8_quad_sse128 FPFX(idct_32x8_quad_sse128)\r\nvoid idct_32x8_quad_sse128 (const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_4x16_quad_sse128 FPFX(idct_4x16_quad_sse128)\r\nvoid idct_4x16_quad_sse128 (const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_8x32_quad_sse128 FPFX(idct_8x32_quad_sse128)\r\nvoid idct_8x32_quad_sse128 (const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_16x64_quad_sse128 FPFX(idct_16x64_quad_sse128)\r\nvoid idct_16x64_quad_sse128(const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_64x16_quad_sse128 FPFX(idct_64x16_quad_sse128)\r\nvoid idct_64x16_quad_sse128(const coeff_t *src, coeff_t *dst, int i_dst);\r\n\r\n#define idct_8x8_half_avx2 FPFX(idct_8x8_half_avx2)\r\nvoid idct_8x8_half_avx2  (const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_16x16_half_avx2 FPFX(idct_16x16_half_avx2)\r\nvoid idct_16x16_half_avx2(const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_32x32_half_avx2 FPFX(idct_32x32_half_avx2)\r\nvoid idct_32x32_half_avx2(const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_64x64_half_avx2 FPFX(idct_64x64_half_avx2)\r\nvoid idct_64x64_half_avx2(const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_16x4_half_avx2 FPFX(idct_16x4_half_avx2)\r\nvoid idct_16x4_half_avx2 (const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_32x8_half_avx2 FPFX(idct_32x8_half_avx2)\r\nvoid idct_32x8_half_avx2 (const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_4x16_half_avx2 FPFX(idct_4x16_half_avx2)\r\nvoid idct_4x16_half_avx2 (const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_8x32_half_avx2 FPFX(idct_8x32_half_avx2)\r\nvoid idct_8x32_half_avx2 (const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_16x64_half_avx2 FPFX(idct_16x64_half_avx2)\r\nvoid idct_16x64_half_avx2(const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_64x16_half_avx2 FPFX(idct_64x16_half_avx2)\r\nvoid idct_64x16_half_avx2(const coeff_t *src, coeff_t *dst, int i_dst);\r\n\r\n#define idct_8x8_quad_avx2 FPFX(idct_8x8_quad_avx2)\r\nvoid idct_8x8_quad_avx2  (const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_16x16_quad_avx2 FPFX(idct_16x16_quad_avx2)\r\nvoid idct_16x16_quad_avx2(const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_32x32_quad_avx2 FPFX(idct_32x32_quad_avx2)\r\nvoid idct_32x32_quad_avx2(const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_64x64_quad_avx2 FPFX(idct_64x64_quad_avx2)\r\nvoid idct_64x64_quad_avx2(const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_16x4_quad_avx2 FPFX(idct_16x4_quad_avx2)\r\nvoid idct_16x4_quad_avx2 (const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_32x8_quad_avx2 FPFX(idct_32x8_quad_avx2)\r\nvoid idct_32x8_quad_avx2 (const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_4x16_quad_avx2 FPFX(idct_4x16_quad_avx2)\r\nvoid idct_4x16_quad_avx2 (const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_8x32_quad_avx2 FPFX(idct_8x32_quad_avx2)\r\nvoid idct_8x32_quad_avx2 (const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_16x64_quad_avx2 FPFX(idct_16x64_quad_avx2)\r\nvoid idct_16x64_quad_avx2(const coeff_t *src, coeff_t *dst, int i_dst);\r\n#define idct_64x16_quad_avx2 FPFX(idct_64x16_quad_avx2)\r\nvoid idct_64x16_quad_avx2(const coeff_t *src, coeff_t *dst, int i_dst);\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * SAO\r\n */\r\n#define SAO_on_block_bo_sse128 FPFX(SAO_on_block_bo_sse128)\r\nvoid SAO_on_block_bo_sse128    (pel_t *p_dst, int i_dst, const pel_t *p_src, int i_src, int i_block_w, int i_block_h, int bit_depth, const sao_param_t *sao_param);\r\n#define SAO_on_block_eo_0_sse128 FPFX(SAO_on_block_eo_0_sse128)\r\nvoid SAO_on_block_eo_0_sse128  (pel_t *p_dst, int i_dst, const pel_t *p_src, int i_src, int i_block_w, int i_block_h, int bit_depth, const int *lcu_avail, const int *sao_offset);\r\n#define SAO_on_block_eo_45_sse128 FPFX(SAO_on_block_eo_45_sse128)\r\nvoid SAO_on_block_eo_45_sse128 (pel_t *p_dst, int i_dst, const pel_t *p_src, int i_src, int i_block_w, int i_block_h, int bit_depth, const int *lcu_avail, const int *sao_offset);\r\n#define SAO_on_block_eo_90_sse128 FPFX(SAO_on_block_eo_90_sse128)\r\nvoid SAO_on_block_eo_90_sse128 (pel_t *p_dst, int i_dst, const pel_t *p_src, int i_src, int i_block_w, int i_block_h, int bit_depth, const int *lcu_avail, const int *sao_offset);\r\n#define SAO_on_block_eo_135_sse128 FPFX(SAO_on_block_eo_135_sse128)\r\nvoid SAO_on_block_eo_135_sse128(pel_t *p_dst, int i_dst, const pel_t *p_src, int i_src, int i_block_w, int i_block_h, int bit_depth, const int *lcu_avail, const int *sao_offset);\r\n#define SAO_on_block_bo_avx2 FPFX(SAO_on_block_bo_avx2)\r\nvoid SAO_on_block_bo_avx2    (pel_t *p_dst, int i_dst, const pel_t *p_src, int i_src, int i_block_w, int i_block_h, int bit_depth, const sao_param_t *sao_param);\r\n#define SAO_on_block_eo_0_avx2 FPFX(SAO_on_block_eo_0_avx2)\r\nvoid SAO_on_block_eo_0_avx2  (pel_t *p_dst, int i_dst, const pel_t *p_src, int i_src, int i_block_w, int i_block_h, int bit_depth, const int *lcu_avail, const int *sao_offset);\r\n#define SAO_on_block_eo_45_avx2 FPFX(SAO_on_block_eo_45_avx2)\r\nvoid SAO_on_block_eo_45_avx2 (pel_t *p_dst, int i_dst, const pel_t *p_src, int i_src, int i_block_w, int i_block_h, int bit_depth, const int *lcu_avail, const int *sao_offset);\r\n#define SAO_on_block_eo_90_avx2 FPFX(SAO_on_block_eo_90_avx2)\r\nvoid SAO_on_block_eo_90_avx2 (pel_t *p_dst, int i_dst, const pel_t *p_src, int i_src, int i_block_w, int i_block_h, int bit_depth, const int *lcu_avail, const int *sao_offset);\r\n#define SAO_on_block_eo_135_avx2 FPFX(SAO_on_block_eo_135_avx2)\r\nvoid SAO_on_block_eo_135_avx2(pel_t *p_dst, int i_dst, const pel_t *p_src, int i_src, int i_block_w, int i_block_h, int bit_depth, const int *lcu_avail, const int *sao_offset);\r\n\r\n/* ---------------------------------------------------------------------------\r\n * ALF\r\n */\r\n#define alf_filter_block_sse128 FPFX(alf_filter_block_sse128)\r\nvoid alf_filter_block_sse128(pel_t *p_dst, const pel_t *p_src, int stride,\r\n    int lcu_pix_x, int lcu_pix_y, int lcu_width, int lcu_height,\r\n    int *alf_coef, int b_top_avail, int b_down_avail);\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n * Intra Prediction\r\n */\r\n#define fill_edge_samples_0_sse128 FPFX(fill_edge_samples_0_sse128)\r\nvoid fill_edge_samples_0_sse128 (const pel_t *pTL, int i_TL, const pel_t *pLcuEP, pel_t *EP, uint32_t i_avai, int bsx, int bsy);\r\n#define fill_edge_samples_x_sse128 FPFX(fill_edge_samples_x_sse128)\r\nvoid fill_edge_samples_x_sse128 (const pel_t *pTL, int i_TL, const pel_t *pLcuEP, pel_t *EP, uint32_t i_avai, int bsx, int bsy);\r\n#define fill_edge_samples_y_sse128 FPFX(fill_edge_samples_y_sse128)\r\nvoid fill_edge_samples_y_sse128 (const pel_t *pTL, int i_TL, const pel_t *pLcuEP, pel_t *EP, uint32_t i_avai, int bsx, int bsy);\r\n#define fill_edge_samples_xy_sse128 FPFX(fill_edge_samples_xy_sse128)\r\nvoid fill_edge_samples_xy_sse128(const pel_t *pTL, int i_TL, const pel_t *pLcuEP, pel_t *EP, uint32_t i_avai, int bsx, int bsy);\r\n\r\n#define intra_pred_dc_sse128 FPFX(intra_pred_dc_sse128)\r\nvoid intra_pred_dc_sse128       (pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n#define intra_pred_plane_sse128 FPFX(intra_pred_plane_sse128)\r\nvoid intra_pred_plane_sse128    (pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n#define intra_pred_bilinear_sse128 FPFX(intra_pred_bilinear_sse128)\r\nvoid intra_pred_bilinear_sse128 (pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n#define intra_pred_hor_sse128 FPFX(intra_pred_hor_sse128)\r\nvoid intra_pred_hor_sse128      (pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n#define intra_pred_ver_sse128 FPFX(intra_pred_ver_sse128)\r\nvoid intra_pred_ver_sse128      (pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n\r\n#define intra_pred_ang_x_3_sse128 FPFX(intra_pred_ang_x_3_sse128)\r\nvoid intra_pred_ang_x_3_sse128  (pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n#define intra_pred_ang_x_4_sse128 FPFX(intra_pred_ang_x_4_sse128)\r\nvoid intra_pred_ang_x_4_sse128  (pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n#define intra_pred_ang_x_5_sse128 FPFX(intra_pred_ang_x_5_sse128)\r\nvoid intra_pred_ang_x_5_sse128  (pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n#define intra_pred_ang_x_6_sse128 FPFX(intra_pred_ang_x_6_sse128)\r\nvoid intra_pred_ang_x_6_sse128  (pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n#define intra_pred_ang_x_7_sse128 FPFX(intra_pred_ang_x_7_sse128)\r\nvoid intra_pred_ang_x_7_sse128  (pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n#define intra_pred_ang_x_8_sse128 FPFX(intra_pred_ang_x_8_sse128)\r\nvoid intra_pred_ang_x_8_sse128  (pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n#define intra_pred_ang_x_9_sse128 FPFX(intra_pred_ang_x_9_sse128)\r\nvoid intra_pred_ang_x_9_sse128  (pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n#define intra_pred_ang_x_10_sse128 FPFX(intra_pred_ang_x_10_sse128)\r\nvoid intra_pred_ang_x_10_sse128 (pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n#define intra_pred_ang_x_11_sse128 FPFX(intra_pred_ang_x_11_sse128)\r\nvoid intra_pred_ang_x_11_sse128 (pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n\r\n#define intra_pred_ang_y_25_sse128 FPFX(intra_pred_ang_y_25_sse128)\r\nvoid intra_pred_ang_y_25_sse128 (pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n#define intra_pred_ang_y_26_sse128 FPFX(intra_pred_ang_y_26_sse128)\r\nvoid intra_pred_ang_y_26_sse128 (pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n#define intra_pred_ang_y_27_sse128 FPFX(intra_pred_ang_y_27_sse128)\r\nvoid intra_pred_ang_y_27_sse128(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n#define intra_pred_ang_y_28_sse128 FPFX(intra_pred_ang_y_28_sse128)\r\nvoid intra_pred_ang_y_28_sse128(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n#define intra_pred_ang_y_29_sse128 FPFX(intra_pred_ang_y_29_sse128)\r\nvoid intra_pred_ang_y_29_sse128(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n#define intra_pred_ang_y_30_sse128 FPFX(intra_pred_ang_y_30_sse128)\r\nvoid intra_pred_ang_y_30_sse128 (pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n#define intra_pred_ang_y_31_sse128 FPFX(intra_pred_ang_y_31_sse128)\r\nvoid intra_pred_ang_y_31_sse128 (pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n#define intra_pred_ang_y_32_sse128 FPFX(intra_pred_ang_y_32_sse128)\r\nvoid intra_pred_ang_y_32_sse128 (pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n\r\n#define intra_pred_ang_xy_13_sse128 FPFX(intra_pred_ang_xy_13_sse128)\r\nvoid intra_pred_ang_xy_13_sse128(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n#define intra_pred_ang_xy_14_sse128 FPFX(intra_pred_ang_xy_14_sse128)\r\nvoid intra_pred_ang_xy_14_sse128(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n#define intra_pred_ang_xy_16_sse128 FPFX(intra_pred_ang_xy_16_sse128)\r\nvoid intra_pred_ang_xy_16_sse128(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n#define intra_pred_ang_xy_18_sse128 FPFX(intra_pred_ang_xy_18_sse128)\r\nvoid intra_pred_ang_xy_18_sse128(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n#define intra_pred_ang_xy_20_sse128 FPFX(intra_pred_ang_xy_20_sse128)\r\nvoid intra_pred_ang_xy_20_sse128(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n#define intra_pred_ang_xy_22_sse128 FPFX(intra_pred_ang_xy_22_sse128)\r\nvoid intra_pred_ang_xy_22_sse128(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n#define intra_pred_ang_xy_23_sse128 FPFX(intra_pred_ang_xy_23_sse128)\r\nvoid intra_pred_ang_xy_23_sse128(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n\r\n//intra prediction avx functions\r\n#define intra_pred_ver_avx FPFX(intra_pred_ver_avx)\r\nvoid intra_pred_ver_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n#define intra_pred_hor_avx FPFX(intra_pred_hor_avx)\r\nvoid intra_pred_hor_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n#define intra_pred_dc_avx FPFX(intra_pred_dc_avx)\r\nvoid intra_pred_dc_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n#define intra_pred_plane_avx FPFX(intra_pred_plane_avx)\r\nvoid intra_pred_plane_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n#define intra_pred_bilinear_avx FPFX(intra_pred_bilinear_avx)\r\nvoid intra_pred_bilinear_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n#define intra_pred_ang_x_3_avx FPFX(intra_pred_ang_x_3_avx)\r\nvoid intra_pred_ang_x_3_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n#define intra_pred_ang_x_4_avx FPFX(intra_pred_ang_x_4_avx)\r\nvoid intra_pred_ang_x_4_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n#define intra_pred_ang_x_5_avx FPFX(intra_pred_ang_x_5_avx)\r\nvoid intra_pred_ang_x_5_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n#define intra_pred_ang_x_6_avx FPFX(intra_pred_ang_x_6_avx)\r\nvoid intra_pred_ang_x_6_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n#define intra_pred_ang_x_7_avx FPFX(intra_pred_ang_x_7_avx)\r\nvoid intra_pred_ang_x_7_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n#define intra_pred_ang_x_8_avx FPFX(intra_pred_ang_x_8_avx)\r\nvoid intra_pred_ang_x_8_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n#define intra_pred_ang_x_9_avx FPFX(intra_pred_ang_x_9_avx)\r\nvoid intra_pred_ang_x_9_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n#define intra_pred_ang_x_10_avx FPFX(intra_pred_ang_x_10_avx)\r\nvoid intra_pred_ang_x_10_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n#define intra_pred_ang_x_11_avx FPFX(intra_pred_ang_x_11_avx)\r\nvoid intra_pred_ang_x_11_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n\r\n#define intra_pred_ang_xy_13_avx FPFX(intra_pred_ang_xy_13_avx)\r\nvoid intra_pred_ang_xy_13_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n#define intra_pred_ang_xy_14_avx FPFX(intra_pred_ang_xy_14_avx)\r\nvoid intra_pred_ang_xy_14_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n#define intra_pred_ang_xy_16_avx FPFX(intra_pred_ang_xy_16_avx)\r\nvoid intra_pred_ang_xy_16_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n#define intra_pred_ang_xy_18_avx FPFX(intra_pred_ang_xy_18_avx)\r\nvoid intra_pred_ang_xy_18_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n#define intra_pred_ang_xy_20_avx FPFX(intra_pred_ang_xy_20_avx)\r\nvoid intra_pred_ang_xy_20_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n#define intra_pred_ang_xy_22_avx FPFX(intra_pred_ang_xy_22_avx)\r\nvoid intra_pred_ang_xy_22_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n#define intra_pred_ang_xy_23_avx FPFX(intra_pred_ang_xy_23_avx)\r\nvoid intra_pred_ang_xy_23_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n\r\n#define intra_pred_ang_y_25_avx FPFX(intra_pred_ang_y_25_avx)\r\nvoid intra_pred_ang_y_25_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n#define intra_pred_ang_y_26_avx FPFX(intra_pred_ang_y_26_avx)\r\nvoid intra_pred_ang_y_26_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n#define intra_pred_ang_y_28_avx FPFX(intra_pred_ang_y_28_avx)\r\nvoid intra_pred_ang_y_28_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n#define intra_pred_ang_y_30_avx FPFX(intra_pred_ang_y_30_avx)\r\nvoid intra_pred_ang_y_30_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n#define intra_pred_ang_y_31_avx FPFX(intra_pred_ang_y_31_avx)\r\nvoid intra_pred_ang_y_31_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n#define intra_pred_ang_y_32_avx FPFX(intra_pred_ang_y_32_avx)\r\nvoid intra_pred_ang_y_32_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy);\r\n/* Function declaration defines */\r\n\r\n#define FUNCDEF_TU(ret, name, cpu, ...) \\\r\n    ret FPFX(name ## _4x4_   ## cpu(__VA_ARGS__));\\\r\n    ret FPFX(name ## _8x8_   ## cpu(__VA_ARGS__));\\\r\n    ret FPFX(name ## _16x16_ ## cpu(__VA_ARGS__));\\\r\n    ret FPFX(name ## _32x32_ ## cpu(__VA_ARGS__));\\\r\n    ret FPFX(name ## _64x64_ ## cpu(__VA_ARGS__))\r\n\r\n#define FUNCDEF_TU_S(ret, name, cpu, ...) \\\r\n    ret FPFX(name ## _4_  ## cpu(__VA_ARGS__));\\\r\n    ret FPFX(name ## _8_  ## cpu(__VA_ARGS__));\\\r\n    ret FPFX(name ## _16_ ## cpu(__VA_ARGS__));\\\r\n    ret FPFX(name ## _32_ ## cpu(__VA_ARGS__));\\\r\n    ret FPFX(name ## _64_ ## cpu(__VA_ARGS__))\r\n\r\n#define FUNCDEF_PU(ret, name, cpu, ...) \\\r\n    ret FPFX(name ## _4x4_   ## cpu)(__VA_ARGS__);\\\r\n    ret FPFX(name ## _8x8_   ## cpu)(__VA_ARGS__);\\\r\n    ret FPFX(name ## _16x16_ ## cpu)(__VA_ARGS__);\\\r\n    ret FPFX(name ## _32x32_ ## cpu)(__VA_ARGS__);\\\r\n    ret FPFX(name ## _64x64_ ## cpu)(__VA_ARGS__);\\\r\n    ret FPFX(name ## _8x4_   ## cpu)(__VA_ARGS__);\\\r\n    ret FPFX(name ## _4x8_   ## cpu)(__VA_ARGS__);\\\r\n    ret FPFX(name ## _16x8_  ## cpu)(__VA_ARGS__);\\\r\n    ret FPFX(name ## _8x16_  ## cpu)(__VA_ARGS__);\\\r\n    ret FPFX(name ## _16x32_ ## cpu)(__VA_ARGS__);\\\r\n    ret FPFX(name ## _32x16_ ## cpu)(__VA_ARGS__);\\\r\n    ret FPFX(name ## _64x32_ ## cpu)(__VA_ARGS__);\\\r\n    ret FPFX(name ## _32x64_ ## cpu)(__VA_ARGS__);\\\r\n    ret FPFX(name ## _16x12_ ## cpu)(__VA_ARGS__);\\\r\n    ret FPFX(name ## _12x16_ ## cpu)(__VA_ARGS__);\\\r\n    ret FPFX(name ## _16x4_  ## cpu)(__VA_ARGS__);\\\r\n    ret FPFX(name ## _4x16_  ## cpu)(__VA_ARGS__);\\\r\n    ret FPFX(name ## _32x24_ ## cpu)(__VA_ARGS__);\\\r\n    ret FPFX(name ## _24x32_ ## cpu)(__VA_ARGS__);\\\r\n    ret FPFX(name ## _32x8_  ## cpu)(__VA_ARGS__);\\\r\n    ret FPFX(name ## _8x32_  ## cpu)(__VA_ARGS__);\\\r\n    ret FPFX(name ## _64x48_ ## cpu)(__VA_ARGS__);\\\r\n    ret FPFX(name ## _48x64_ ## cpu)(__VA_ARGS__);\\\r\n    ret FPFX(name ## _64x16_ ## cpu)(__VA_ARGS__);\\\r\n    ret FPFX(name ## _16x64_ ## cpu)(__VA_ARGS__)\r\n\r\n#define FUNCDEF_CHROMA_PU(ret, name, cpu, ...) \\\r\n    FUNCDEF_PU(ret, name, cpu, __VA_ARGS__);\\\r\n    ret FPFX(name ## _4x2_   ## cpu)(__VA_ARGS__);\\\r\n    ret FPFX(name ## _2x4_   ## cpu)(__VA_ARGS__);\\\r\n    ret FPFX(name ## _8x2_   ## cpu)(__VA_ARGS__);\\\r\n    ret FPFX(name ## _2x8_   ## cpu)(__VA_ARGS__);\\\r\n    ret FPFX(name ## _8x6_   ## cpu)(__VA_ARGS__);\\\r\n    ret FPFX(name ## _6x8_   ## cpu)(__VA_ARGS__);\\\r\n    ret FPFX(name ## _8x12_  ## cpu)(__VA_ARGS__);\\\r\n    ret FPFX(name ## _12x8_  ## cpu)(__VA_ARGS__);\\\r\n    ret FPFX(name ## _6x16_  ## cpu)(__VA_ARGS__);\\\r\n    ret FPFX(name ## _16x6_  ## cpu)(__VA_ARGS__);\\\r\n    ret FPFX(name ## _2x16_  ## cpu)(__VA_ARGS__);\\\r\n    ret FPFX(name ## _16x2_  ## cpu)(__VA_ARGS__);\\\r\n    ret FPFX(name ## _4x12_  ## cpu)(__VA_ARGS__);\\\r\n    ret FPFX(name ## _12x4_  ## cpu)(__VA_ARGS__);\\\r\n    ret FPFX(name ## _32x12_ ## cpu)(__VA_ARGS__);\\\r\n    ret FPFX(name ## _12x32_ ## cpu)(__VA_ARGS__);\\\r\n    ret FPFX(name ## _32x4_  ## cpu)(__VA_ARGS__);\\\r\n    ret FPFX(name ## _4x32_  ## cpu)(__VA_ARGS__);\\\r\n    ret FPFX(name ## _32x48_ ## cpu)(__VA_ARGS__);\\\r\n    ret FPFX(name ## _48x32_ ## cpu)(__VA_ARGS__);\\\r\n    ret FPFX(name ## _16x24_ ## cpu)(__VA_ARGS__);\\\r\n    ret FPFX(name ## _24x16_ ## cpu)(__VA_ARGS__);\\\r\n    ret FPFX(name ## _8x64_  ## cpu)(__VA_ARGS__);\\\r\n    ret FPFX(name ## _64x8_  ## cpu)(__VA_ARGS__);\\\r\n    ret FPFX(name ## _64x24_ ## cpu)(__VA_ARGS__);\\\r\n    ret FPFX(name ## _24x64_ ## cpu)(__VA_ARGS__);\r\n\r\n\r\n#ifdef __cplusplus\r\n}\r\n#endif\r\n#endif // #ifndef DAVS2_INTRINSIC_H\r\n"
  },
  {
    "path": "source/common/vec/intrinsic_alf.cc",
    "content": "/*\r\n * intrinsic_alf.cc\r\n *\r\n * Description of this file:\r\n *    SSE assembly functions of ALF module of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video decoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#include \"../common.h\"\r\n#include \"intrinsic.h\"\r\n\r\n#include <mmintrin.h>\r\n#include <emmintrin.h>\r\n#include <tmmintrin.h>\r\n#include <smmintrin.h>\r\n\r\n\r\n#if !HIGH_BIT_DEPTH\r\n\r\nvoid alf_filter_block_sse128(pel_t *p_dst, const pel_t *p_src, int stride,\r\n                             int lcu_pix_x, int lcu_pix_y, int lcu_width, int lcu_height,\r\n                             int *alf_coeff, int b_top_avail, int b_down_avail)\r\n{\r\n    const pel_t *imgPad1, *imgPad2, *imgPad3, *imgPad4, *imgPad5, *imgPad6;\r\n\r\n    __m128i T00, T01, T10, T11, T20, T21, T30, T31, T40, T41, T50, T51;\r\n    __m128i T1, T2, T3, T4, T5, T6, T7, T8;\r\n    __m128i E00, E01, E10, E11, E20, E21, E30, E31, E40, E41;\r\n    __m128i C0, C1, C2, C3, C4, C30, C31, C32, C33;\r\n    __m128i S0, S00, S01, S1, S10, S11, S2, S20, S21, S3, S30, S31, S4, S40, S41, S5, S50, S51, S6, S60, S61, S7, S8, SS1, SS2, S;\r\n    __m128i mSwitch1, mSwitch2, mSwitch3, mSwitch4, mSwitch5;\r\n    __m128i mAddOffset;\r\n    __m128i mZero = _mm_set1_epi16(0);\r\n    __m128i mMax = _mm_set1_epi16((short)((1 << g_bit_depth) - 1));\r\n    __m128i mask;\r\n\r\n    int startPos  = b_top_avail  ? (lcu_pix_y - 4) : lcu_pix_y;\r\n    int endPos    = b_down_avail ? (lcu_pix_y + lcu_height - 4) : (lcu_pix_y + lcu_height);\r\n    int xPosEnd   = lcu_pix_x + lcu_width;\r\n    int xPosEnd16 = xPosEnd - (lcu_width & 0x0f);\r\n\r\n    int yUp, yBottom;\r\n    int x, y;\r\n\r\n    mask = _mm_loadu_si128((__m128i*)(intrinsic_mask[(lcu_width & 15) - 1]));\r\n\r\n    p_src += (startPos * stride) + lcu_pix_x;\r\n    p_dst += (startPos * stride) + lcu_pix_x;\r\n    lcu_height = endPos - startPos;\r\n    lcu_height--;\r\n\r\n    C0         = _mm_set1_epi8((char)alf_coeff[0]);\r\n    C1         = _mm_set1_epi8((char)alf_coeff[1]);\r\n    C2         = _mm_set1_epi8((char)alf_coeff[2]);\r\n    C3         = _mm_set1_epi8((char)alf_coeff[3]);\r\n    C4         = _mm_set1_epi8((char)alf_coeff[4]);\r\n\r\n    mSwitch1   = _mm_setr_epi8(0, 1, 2, 3, 2, 1, 0, 3, 0, 1, 2, 3, 2, 1, 0, 3);\r\n    C30        = _mm_loadu_si128((__m128i*)&alf_coeff[5]);\r\n    C31        = _mm_packs_epi32(C30, C30);\r\n    C32        = _mm_packs_epi16(C31, C31);\r\n    C33        = _mm_shuffle_epi8(C32, mSwitch1);\r\n    mSwitch2   = _mm_setr_epi8(0, 1, 2, 3, 4, 5, 6, -1, 1, 2, 3, 4, 5, 6, 7, -1);\r\n    mSwitch3   = _mm_setr_epi8(2, 3, 4, 5, 6, 7, 8, -1, 3, 4, 5, 6, 7, 8, 9, -1);\r\n    mSwitch4   = _mm_setr_epi8(4, 5, 6, 7, 8, 9, 10, -1, 5, 6, 7, 8, 9, 10, 11, -1);\r\n    mSwitch5   = _mm_setr_epi8(6, 7, 8, 9, 10, 11, 12, -1, 7, 8, 9, 10, 11, 12, 13, -1);\r\n    mAddOffset = _mm_set1_epi16(32);\r\n\r\n    for (y = 0; y <= lcu_height; y++) {\r\n        yUp     = DAVS2_CLIP3(0, lcu_height, y - 1);\r\n        yBottom = DAVS2_CLIP3(0, lcu_height, y + 1);\r\n        imgPad1 = p_src + (yBottom - y) * stride;\r\n        imgPad2 = p_src + (yUp     - y) * stride;\r\n\r\n        yUp     = DAVS2_CLIP3(0, lcu_height, y - 2);\r\n        yBottom = DAVS2_CLIP3(0, lcu_height, y + 2);\r\n        imgPad3 = p_src + (yBottom - y) * stride;\r\n        imgPad4 = p_src + (yUp     - y) * stride;\r\n\r\n        yUp     = DAVS2_CLIP3(0, lcu_height, y - 3);\r\n        yBottom = DAVS2_CLIP3(0, lcu_height, y + 3);\r\n        imgPad5 = p_src + (yBottom - y) * stride;\r\n        imgPad6 = p_src + (yUp     - y) * stride;\r\n\r\n        // 176x144ʱVֲƥ䣬ĺƥ\r\n        //for (x = lcu_pix_x; x < xPosEnd - 15; x += 16) {\r\n        for (x = 0; x < lcu_width; x += 16) {\r\n            T00 = _mm_loadu_si128((__m128i*)&imgPad6[x]);\r\n            T01 = _mm_loadu_si128((__m128i*)&imgPad5[x]);\r\n            E00 = _mm_unpacklo_epi8(T00, T01);\r\n            E01 = _mm_unpackhi_epi8(T00, T01);\r\n            S00 = _mm_maddubs_epi16(E00, C0);//ǰ8C0*P0Ľ\r\n            S01 = _mm_maddubs_epi16(E01, C0);//8C0*P0Ľ\r\n\r\n            T10 = _mm_loadu_si128((__m128i*)&imgPad4[x]);\r\n            T11 = _mm_loadu_si128((__m128i*)&imgPad3[x]);\r\n            E10 = _mm_unpacklo_epi8(T10, T11);\r\n            E11 = _mm_unpackhi_epi8(T10, T11);\r\n            S10 = _mm_maddubs_epi16(E10, C1);//ǰ8C1*P1Ľ\r\n            S11 = _mm_maddubs_epi16(E11, C1);//8C1*P1Ľ\r\n\r\n            T20 = _mm_loadu_si128((__m128i*)&imgPad2[x - 1]);\r\n            T21 = _mm_loadu_si128((__m128i*)&imgPad1[x + 1]);\r\n            E20 = _mm_unpacklo_epi8(T20, T21);\r\n            E21 = _mm_unpackhi_epi8(T20, T21);\r\n            S20 = _mm_maddubs_epi16(E20, C2);\r\n            S21 = _mm_maddubs_epi16(E21, C2);\r\n\r\n            T30 = _mm_loadu_si128((__m128i*)&imgPad2[x]);\r\n            T31 = _mm_loadu_si128((__m128i*)&imgPad1[x]);\r\n            E30 = _mm_unpacklo_epi8(T30, T31);\r\n            E31 = _mm_unpackhi_epi8(T30, T31);\r\n            S30 = _mm_maddubs_epi16(E30, C3);\r\n            S31 = _mm_maddubs_epi16(E31, C3);\r\n\r\n            T40 = _mm_loadu_si128((__m128i*)&imgPad2[x + 1]);\r\n            T41 = _mm_loadu_si128((__m128i*)&imgPad1[x - 1]);\r\n            E40 = _mm_unpacklo_epi8(T40, T41);\r\n            E41 = _mm_unpackhi_epi8(T40, T41);\r\n            S40 = _mm_maddubs_epi16(E40, C4);\r\n            S41 = _mm_maddubs_epi16(E41, C4);\r\n\r\n            T50 = _mm_loadu_si128((__m128i*)&p_src[x - 3]);\r\n            T51 = _mm_loadu_si128((__m128i*)&p_src[x + 5]);\r\n            T1  = _mm_shuffle_epi8(T50, mSwitch2);\r\n            T2  = _mm_shuffle_epi8(T50, mSwitch3);\r\n            T3  = _mm_shuffle_epi8(T50, mSwitch4);\r\n            T4  = _mm_shuffle_epi8(T50, mSwitch5);\r\n            T5  = _mm_shuffle_epi8(T51, mSwitch2);\r\n            T6  = _mm_shuffle_epi8(T51, mSwitch3);\r\n            T7  = _mm_shuffle_epi8(T51, mSwitch4);\r\n            T8  = _mm_shuffle_epi8(T51, mSwitch5);\r\n\r\n            S5  = _mm_maddubs_epi16(T1, C33);\r\n            S6  = _mm_maddubs_epi16(T2, C33);\r\n            S7  = _mm_maddubs_epi16(T3, C33);\r\n            S8  = _mm_maddubs_epi16(T4, C33);\r\n            S50 = _mm_hadds_epi16(S5, S6);\r\n            S51 = _mm_hadds_epi16(S7, S8);\r\n            S5  = _mm_hadds_epi16(S50, S51);//ǰ8\r\n            S4  = _mm_maddubs_epi16(T5, C33);\r\n            S6  = _mm_maddubs_epi16(T6, C33);\r\n            S7  = _mm_maddubs_epi16(T7, C33);\r\n            S8  = _mm_maddubs_epi16(T8, C33);\r\n            S60 = _mm_hadds_epi16(S4, S6);\r\n            S61 = _mm_hadds_epi16(S7, S8);\r\n            S6  = _mm_hadds_epi16(S60, S61);//8\r\n\r\n            S0  = _mm_adds_epi16(S00, S10);\r\n            S1  = _mm_adds_epi16(S30, S20);\r\n            S2  = _mm_adds_epi16(S40, S5);\r\n            S3  = _mm_adds_epi16(S1, S0);\r\n            SS1 = _mm_adds_epi16(S2, S3);//ǰ8\r\n\r\n            S0  = _mm_adds_epi16(S01, S11);\r\n            S1  = _mm_adds_epi16(S31, S21);\r\n            S2  = _mm_adds_epi16(S41, S6);\r\n            S3  = _mm_adds_epi16(S1, S0);\r\n            SS2 = _mm_adds_epi16(S2, S3);//8\r\n\r\n\r\n            SS1 = _mm_adds_epi16(SS1, mAddOffset);\r\n            SS1 = _mm_srai_epi16(SS1, 6);\r\n            SS1 = _mm_min_epi16(SS1, mMax);\r\n            SS1 = _mm_max_epi16(SS1, mZero);\r\n\r\n            SS2 = _mm_adds_epi16(SS2, mAddOffset);\r\n            SS2 = _mm_srai_epi16(SS2, 6);\r\n            SS2 = _mm_min_epi16(SS2, mMax);\r\n            SS2 = _mm_max_epi16(SS2, mZero);\r\n\r\n            S   = _mm_packus_epi16(SS1, SS2);\r\n            if (x != xPosEnd16) {\r\n                _mm_storeu_si128((__m128i*)(p_dst + x), S);\r\n            } else {\r\n                _mm_maskmoveu_si128(S, mask, (char *)(p_dst + x));\r\n                break;\r\n            }\r\n        }\r\n\r\n        p_src += stride;\r\n        p_dst += stride;\r\n    }\r\n}\r\n\r\n#endif  // #if !HIGH_BIT_DEPTH\r\n"
  },
  {
    "path": "source/common/vec/intrinsic_deblock.cc",
    "content": "/*\r\n * intrinsic_deblock.cc\r\n *\r\n * Description of this file:\r\n *    SSE assembly functions of Deblock module of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video decoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#include \"../common.h\"\r\n#include \"intrinsic.h\"\r\n\r\n#include <mmintrin.h>\r\n#include <emmintrin.h>\r\n#include <tmmintrin.h>\r\n#include <smmintrin.h>\r\n\r\n#if !HIGH_BIT_DEPTH\r\n\r\nvoid deblock_edge_ver_sse128(pel_t *SrcPtr, int stride, int Alpha, int Beta, uint8_t *flt_flag)\r\n{\r\n    pel_t *pTmp = SrcPtr - 4;\r\n    int flag0 = flt_flag[0] ? -1 : 0;\r\n    int flag1 = flt_flag[1] ? -1 : 0;\r\n    __m128i TL0, TL1, TL2, TL3;\r\n    __m128i TR0, TR1, TR2, TR3;\r\n    __m128i TL0l, TL1l;\r\n    __m128i TR0l, TR1l;\r\n    __m128i V0, V1, V2, V3, V4, V5;\r\n    __m128i T0, T1, T2, T3, T4, T5, T6, T7;\r\n    __m128i M0, M1, M2;\r\n    __m128i FLT_L, FLT_R, FLT, FS;\r\n    __m128i FS3, FS4, FS56;\r\n\r\n    __m128i ALPHA = _mm_set1_epi16((pel_t)Alpha);\r\n    __m128i BETA = _mm_set1_epi16((pel_t)Beta);\r\n    __m128i c_0 = _mm_set1_epi16(0);\r\n    __m128i c_1 = _mm_set1_epi16(1);\r\n    __m128i c_2 = _mm_set1_epi16(2);\r\n    __m128i c_3 = _mm_set1_epi16(3);\r\n    __m128i c_4 = _mm_set1_epi16(4);\r\n    __m128i c_8 = _mm_set1_epi16(8);\r\n    __m128i c_16 = _mm_set1_epi16(16);\r\n\r\n    T0 = _mm_loadl_epi64((__m128i*)(pTmp));\r\n    T1 = _mm_loadl_epi64((__m128i*)(pTmp + stride));\r\n    T2 = _mm_loadl_epi64((__m128i*)(pTmp + stride * 2));\r\n    T3 = _mm_loadl_epi64((__m128i*)(pTmp + stride * 3));\r\n    T4 = _mm_loadl_epi64((__m128i*)(pTmp + stride * 4));\r\n    T5 = _mm_loadl_epi64((__m128i*)(pTmp + stride * 5));\r\n    T6 = _mm_loadl_epi64((__m128i*)(pTmp + stride * 6));\r\n    T7 = _mm_loadl_epi64((__m128i*)(pTmp + stride * 7));\r\n\r\n    T0 = _mm_unpacklo_epi8(T0, T1);\r\n    T1 = _mm_unpacklo_epi8(T2, T3);\r\n    T2 = _mm_unpacklo_epi8(T4, T5);\r\n    T3 = _mm_unpacklo_epi8(T6, T7);\r\n\r\n    T4 = _mm_unpacklo_epi16(T0, T1);\r\n    T5 = _mm_unpacklo_epi16(T2, T3);\r\n    T6 = _mm_unpackhi_epi16(T0, T1);\r\n    T7 = _mm_unpackhi_epi16(T2, T3);\r\n\r\n    T0 = _mm_unpacklo_epi32(T4, T5);\r\n    T1 = _mm_unpackhi_epi32(T4, T5);\r\n    T2 = _mm_unpacklo_epi32(T6, T7);\r\n    T3 = _mm_unpackhi_epi32(T6, T7);\r\n\r\n    TL3 = _mm_unpacklo_epi8(T0, c_0);\r\n    TL2 = _mm_unpackhi_epi8(T0, c_0);\r\n    TL1 = _mm_unpacklo_epi8(T1, c_0);\r\n    TL0 = _mm_unpackhi_epi8(T1, c_0);\r\n\r\n    TR0 = _mm_unpacklo_epi8(T2, c_0);\r\n    TR1 = _mm_unpackhi_epi8(T2, c_0);\r\n    TR2 = _mm_unpacklo_epi8(T3, c_0);\r\n    TR3 = _mm_unpackhi_epi8(T3, c_0);\r\n\r\n#define _mm_subabs_epu16(a, b) _mm_abs_epi16(_mm_subs_epi16(a, b))\r\n\r\n    T0 = _mm_subabs_epu16(TL0, TR0);\r\n    T1 = _mm_cmpgt_epi16(T0, c_1);\r\n    T2 = _mm_cmpgt_epi16(ALPHA, T0);\r\n\r\n    M0 = _mm_set_epi32(flag1, flag1, flag0, flag0);\r\n    M0 = _mm_and_si128(M0, _mm_and_si128(T1, T2)); // mask1\r\n\r\n    T0 = _mm_subabs_epu16(TL1, TL0);\r\n    T1 = _mm_subabs_epu16(TR1, TR0);\r\n    FLT_L = _mm_and_si128(_mm_cmpgt_epi16(BETA, T0), c_2);\r\n    FLT_R = _mm_and_si128(_mm_cmpgt_epi16(BETA, T1), c_2);\r\n\r\n    T0 = _mm_subabs_epu16(TL2, TL0);\r\n    T1 = _mm_subabs_epu16(TR2, TR0);\r\n    M1 = _mm_cmpgt_epi16(BETA, T0);\r\n    M2 = _mm_cmpgt_epi16(BETA, T1);\r\n    FLT_L = _mm_add_epi16(_mm_and_si128(M1, c_1), FLT_L);\r\n    FLT_R = _mm_add_epi16(_mm_and_si128(M2, c_1), FLT_R);\r\n    FLT = _mm_add_epi16(FLT_L, FLT_R);\r\n\r\n    M1 = _mm_and_si128(_mm_cmpeq_epi16(TR0, TR1), _mm_cmpeq_epi16(TL0, TL1));\r\n    T0 = _mm_sub_epi16(FLT, c_2);\r\n    T1 = _mm_sub_epi16(FLT, c_3);\r\n    T2 = _mm_subabs_epu16(TL1, TR1);\r\n\r\n    FS56 = _mm_blendv_epi8(T1, T0, M1);\r\n    FS4 = _mm_blendv_epi8(c_1, c_2, _mm_cmpeq_epi16(FLT_L, c_2));\r\n    FS3 = _mm_blendv_epi8(c_0, c_1, _mm_cmpgt_epi16(BETA, T2));\r\n\r\n    FS = _mm_blendv_epi8(c_0, FS56, _mm_cmpgt_epi16(FLT, c_4));\r\n    FS = _mm_blendv_epi8(FS, FS4, _mm_cmpeq_epi16(FLT, c_4));\r\n    FS = _mm_blendv_epi8(FS, FS3, _mm_cmpeq_epi16(FLT, c_3));\r\n\r\n    FS = _mm_and_si128(FS, M0);\r\n\r\n#undef _mm_subabs_epu16\r\n\r\n\r\n    TL0l = TL0;\r\n    TL1l = TL1;\r\n    TR0l = TR0;\r\n    TR1l = TR1;\r\n\r\n    /* fs == 1 */\r\n    T2 = _mm_add_epi16(_mm_add_epi16(TL0l, TR0l), c_2); // L0 + R0 + 2\r\n\r\n    V0 = _mm_srli_epi16(_mm_add_epi16(_mm_slli_epi16(TL0l, 1), T2), 2);\r\n\r\n    V1 = _mm_srli_epi16(_mm_add_epi16(_mm_slli_epi16(TR0l, 1), T2), 2);\r\n\r\n    TL0 = _mm_blendv_epi8(TL0, V0, _mm_cmpeq_epi16(FS, c_1));\r\n    TR0 = _mm_blendv_epi8(TR0, V1, _mm_cmpeq_epi16(FS, c_1));\r\n\r\n    /* fs == 2 */\r\n    T2 = _mm_slli_epi16(T2, 1); // (L0 << 1) + (R0 << 1) + 4\r\n    T3 = _mm_slli_epi16(T3, 1);\r\n\r\n    T0 = _mm_add_epi16(_mm_slli_epi16(TL1l, 1), _mm_add_epi16(TL1l, TR0l));\r\n\r\n    T0 = _mm_add_epi16(_mm_slli_epi16(TL0l, 3), _mm_add_epi16(T0, T2));\r\n\r\n    V0 = _mm_srli_epi16(_mm_add_epi16(T0, c_4), 4);\r\n\r\n    T0 = _mm_add_epi16(_mm_slli_epi16(TR1l, 1), _mm_add_epi16(TR1l, TL0l));\r\n\r\n    T0 = _mm_add_epi16(_mm_slli_epi16(TR0l, 3), _mm_add_epi16(T0, T2));\r\n\r\n    V1 = _mm_srli_epi16(_mm_add_epi16(T0, c_4), 4);\r\n\r\n    TL0 = _mm_blendv_epi8(TL0, V0, _mm_cmpeq_epi16(FS, c_2));\r\n    TR0 = _mm_blendv_epi8(TR0, V1, _mm_cmpeq_epi16(FS, c_2));\r\n\r\n    /* fs == 3 */\r\n    T2 = _mm_slli_epi16(T2, 1); // (L0 << 2) + (R0 << 2) + 8\r\n    T3 = _mm_slli_epi16(T3, 1);\r\n\r\n    T0 = _mm_add_epi16(_mm_slli_epi16(TL1l, 2), _mm_add_epi16(TL2, TR1l));\r\n\r\n    T0 = _mm_add_epi16(_mm_slli_epi16(TL0l, 1), _mm_add_epi16(T0, T2));\r\n\r\n    V0 = _mm_srli_epi16(T0, 4);\r\n\r\n    T0 = _mm_add_epi16(_mm_slli_epi16(TR1l, 2), _mm_add_epi16(TR2, TL1l));\r\n\r\n    T0 = _mm_add_epi16(_mm_slli_epi16(TR0l, 1), _mm_add_epi16(T0, T2));\r\n\r\n    V1 = _mm_srli_epi16(T0, 4);\r\n\r\n    TL0 = _mm_blendv_epi8(TL0, V0, _mm_cmpeq_epi16(FS, c_3));\r\n    TR0 = _mm_blendv_epi8(TR0, V1, _mm_cmpeq_epi16(FS, c_3));\r\n\r\n    T0 = _mm_add_epi16(_mm_add_epi16(TL2, TR0l), _mm_slli_epi16(TL2, 1));\r\n    T0 = _mm_add_epi16(T0, _mm_slli_epi16(TL1l, 3));\r\n    T0 = _mm_add_epi16(T0, _mm_slli_epi16(TL0l, 2));\r\n    V2 = _mm_srli_epi16(_mm_add_epi16(T0, c_8), 4);\r\n\r\n    T0 = _mm_add_epi16(_mm_add_epi16(TR2, TL0l), _mm_slli_epi16(TR2, 1));\r\n    T0 = _mm_add_epi16(T0, _mm_slli_epi16(TR1l, 3));\r\n    T0 = _mm_add_epi16(T0, _mm_slli_epi16(TR0l, 2));\r\n    V3 = _mm_srli_epi16(_mm_add_epi16(T0, c_8), 4);\r\n\r\n    TL1 = _mm_blendv_epi8(TL1, V2, _mm_cmpeq_epi16(FS, c_3));\r\n    TR1 = _mm_blendv_epi8(TR1, V3, _mm_cmpeq_epi16(FS, c_3));\r\n\r\n    FS = _mm_cmpeq_epi16(FS, c_4);\r\n\r\n    if (!_mm_testz_si128(FS, _mm_set1_epi16(-1))) { /* fs == 4 */\r\n        /* cal L0/R0 */\r\n        T0 = _mm_slli_epi16(_mm_add_epi16(_mm_add_epi16(TL0l, TL2), TR0l), 3);\r\n        T0 = _mm_add_epi16(_mm_add_epi16(T0, c_16), _mm_add_epi16(TL0l, TL2));\r\n        T2 = _mm_add_epi16(_mm_slli_epi16(TR2, 1), _mm_slli_epi16(TR2, 2));\r\n        V0 = _mm_srli_epi16(_mm_add_epi16(T0, T2), 5);\r\n\r\n        T0 = _mm_slli_epi16(_mm_add_epi16(_mm_add_epi16(TR0l, TR2), TL0l), 3);\r\n        T0 = _mm_add_epi16(_mm_add_epi16(T0, c_16), _mm_add_epi16(TR0l, TR2));\r\n        T2 = _mm_add_epi16(_mm_slli_epi16(TL2, 1), _mm_slli_epi16(TL2, 2));\r\n        V1 = _mm_srli_epi16(_mm_add_epi16(T0, T2), 5);\r\n\r\n        TL0 = _mm_blendv_epi8(TL0, V0, FS);\r\n        TR0 = _mm_blendv_epi8(TR0, V1, FS);\r\n\r\n        /* cal L1/R1 */\r\n        T0 = _mm_slli_epi16(_mm_add_epi16(TL2, TR0l), 1);\r\n        T0 = _mm_add_epi16(T0, _mm_sub_epi16(_mm_slli_epi16(TL0l, 3), TL0l));\r\n        T2 = _mm_add_epi16(_mm_slli_epi16(TL2, 2), _mm_add_epi16(TR0l, c_8));\r\n        V2 = _mm_srli_epi16(_mm_add_epi16(T0, T2), 4);\r\n\r\n        T0 = _mm_slli_epi16(_mm_add_epi16(TR2, TL0l), 1);\r\n        T0 = _mm_add_epi16(T0, _mm_sub_epi16(_mm_slli_epi16(TR0l, 3), TR0l));\r\n        T2 = _mm_add_epi16(_mm_slli_epi16(TR2, 2), _mm_add_epi16(TL0l, c_8));\r\n        V3 = _mm_srli_epi16(_mm_add_epi16(T0, T2), 4);\r\n\r\n        TL1 = _mm_blendv_epi8(TL1, V2, FS);\r\n        TR1 = _mm_blendv_epi8(TR1, V3, FS);\r\n\r\n        /* cal L2/R2 */\r\n        T0 = _mm_add_epi16(_mm_slli_epi16(TL2, 1), TL2);\r\n        T2 = _mm_add_epi16(_mm_slli_epi16(TL0l, 2), TR0l);\r\n        V4 = _mm_srli_epi16(_mm_add_epi16(T0, _mm_add_epi16(T2, c_4)), 3);\r\n\r\n        T0 = _mm_add_epi16(_mm_slli_epi16(TR2, 1), TR2);\r\n        T2 = _mm_add_epi16(_mm_slli_epi16(TR0l, 2), TL0l);\r\n        V5 = _mm_srli_epi16(_mm_add_epi16(T0, _mm_add_epi16(T2, c_4)), 3);\r\n\r\n        TL2 = _mm_blendv_epi8(TL2, V4, FS);\r\n        TR2 = _mm_blendv_epi8(TR2, V5, FS);\r\n    }\r\n\r\n    /* stroe result */\r\n    T0 = _mm_packus_epi16(TL3, TR0);\r\n    T1 = _mm_packus_epi16(TL2, TR1);\r\n    T2 = _mm_packus_epi16(TL1, TR2);\r\n    T3 = _mm_packus_epi16(TL0, TR3);\r\n\r\n    T4 = _mm_unpacklo_epi8(T0, T1);\r\n    T5 = _mm_unpacklo_epi8(T2, T3);\r\n    T6 = _mm_unpackhi_epi8(T0, T1);\r\n    T7 = _mm_unpackhi_epi8(T2, T3);\r\n\r\n    V0 = _mm_unpacklo_epi16(T4, T5);\r\n    V1 = _mm_unpacklo_epi16(T6, T7);\r\n    V2 = _mm_unpackhi_epi16(T4, T5);\r\n    V3 = _mm_unpackhi_epi16(T6, T7);\r\n\r\n    T0 = _mm_unpacklo_epi32(V0, V1);\r\n    T1 = _mm_unpackhi_epi32(V0, V1);\r\n    T2 = _mm_unpacklo_epi32(V2, V3);\r\n    T3 = _mm_unpackhi_epi32(V2, V3);\r\n\r\n    pTmp = SrcPtr - 4;\r\n    _mm_storel_epi64((__m128i*)(pTmp), T0);\r\n    pTmp += stride;\r\n    _mm_storel_epi64((__m128i*)(pTmp), _mm_srli_si128(T0, 8));\r\n    pTmp += stride;\r\n    _mm_storel_epi64((__m128i*)(pTmp), T1);\r\n    pTmp += stride;\r\n    _mm_storel_epi64((__m128i*)(pTmp), _mm_srli_si128(T1, 8));\r\n    pTmp += stride;\r\n    _mm_storel_epi64((__m128i*)(pTmp), T2);\r\n    pTmp += stride;\r\n    _mm_storel_epi64((__m128i*)(pTmp), _mm_srli_si128(T2, 8));\r\n    pTmp += stride;\r\n    _mm_storel_epi64((__m128i*)(pTmp), T3);\r\n    pTmp += stride;\r\n    _mm_storel_epi64((__m128i*)(pTmp), _mm_srli_si128(T3, 8));\r\n}\r\n\r\nvoid deblock_edge_ver_c_sse128(pel_t *SrcPtrU, pel_t *SrcPtrV, int stride, int Alpha, int Beta, uint8_t *flt_flag)\r\n{\r\n    pel_t *pTmp;\r\n    int flag0 = flt_flag[0] ? -1 : 0;\r\n    int flag1 = flt_flag[1] ? -1 : 0;\r\n\r\n    __m128i UVL0, UVL1, UVR0, UVR1;\r\n    __m128i TL0, TL1, TL2, TL3;\r\n    __m128i TR0, TR1, TR2, TR3;\r\n    __m128i T0, T1, T2, T3, T4, T5, T6, T7;\r\n    __m128i P0, P1, P2, P3, P4, P5, P6, P7;\r\n    __m128i V0, V1, V2, V3;\r\n    __m128i M0, M1, M2;\r\n    __m128i FLT_L, FLT_R, FLT, FS;\r\n    __m128i FS4, FS56;\r\n\r\n    __m128i ALPHA = _mm_set1_epi16((pel_t)Alpha);\r\n    __m128i BETA = _mm_set1_epi16((pel_t)Beta);\r\n    __m128i c_0 = _mm_set1_epi16(0);\r\n    __m128i c_1 = _mm_set1_epi16(1);\r\n    __m128i c_2 = _mm_set1_epi16(2);\r\n    __m128i c_3 = _mm_set1_epi16(3);\r\n    __m128i c_4 = _mm_set1_epi16(4);\r\n    __m128i c_8 = _mm_set1_epi16(8);\r\n\r\n    pTmp = SrcPtrU - 4;\r\n    T0 = _mm_loadl_epi64((__m128i*)(pTmp));\r\n    T1 = _mm_loadl_epi64((__m128i*)(pTmp + stride));\r\n    T2 = _mm_loadl_epi64((__m128i*)(pTmp + stride * 2));\r\n    T3 = _mm_loadl_epi64((__m128i*)(pTmp + stride * 3));\r\n\r\n    pTmp = SrcPtrV - 4;\r\n    T4 = _mm_loadl_epi64((__m128i*)(pTmp));\r\n    T5 = _mm_loadl_epi64((__m128i*)(pTmp + stride));\r\n    T6 = _mm_loadl_epi64((__m128i*)(pTmp + stride * 2));\r\n    T7 = _mm_loadl_epi64((__m128i*)(pTmp + stride * 3));\r\n\r\n    P0 = _mm_unpacklo_epi8(T0, T1);\r\n    P1 = _mm_unpacklo_epi8(T2, T3);\r\n    P2 = _mm_unpacklo_epi8(T4, T5);\r\n    P3 = _mm_unpacklo_epi8(T6, T7);\r\n\r\n    P4 = _mm_unpacklo_epi16(P0, P1);\r\n    P5 = _mm_unpacklo_epi16(P2, P3);\r\n    P6 = _mm_unpackhi_epi16(P0, P1);\r\n    P7 = _mm_unpackhi_epi16(P2, P3);\r\n\r\n    T0 = _mm_unpacklo_epi32(P4, P5);\r\n    T1 = _mm_unpackhi_epi32(P4, P5);\r\n    T2 = _mm_unpacklo_epi32(P6, P7);\r\n    T3 = _mm_unpackhi_epi32(P6, P7);\r\n\r\n    TL3 = _mm_unpacklo_epi8(T0, c_0);\r\n    TL2 = _mm_unpackhi_epi8(T0, c_0);\r\n    TL1 = _mm_unpacklo_epi8(T1, c_0);\r\n    TL0 = _mm_unpackhi_epi8(T1, c_0);\r\n\r\n    TR0 = _mm_unpacklo_epi8(T2, c_0);\r\n    TR1 = _mm_unpackhi_epi8(T2, c_0);\r\n    TR2 = _mm_unpacklo_epi8(T3, c_0);\r\n    TR3 = _mm_unpackhi_epi8(T3, c_0);\r\n\r\n#define _mm_subabs_epu16(a, b) _mm_abs_epi16(_mm_subs_epi16(a, b))\r\n\r\n    T0 = _mm_subabs_epu16(TL0, TR0);\r\n    T1 = _mm_cmpgt_epi16(T0, c_1);\r\n    T2 = _mm_cmpgt_epi16(ALPHA, T0);\r\n    M0 = _mm_set_epi32(flag1, flag0, flag1, flag0);\r\n    M0 = _mm_and_si128(M0, _mm_and_si128(T1, T2)); // mask1\r\n\r\n    T0 = _mm_subabs_epu16(TL1, TL0);\r\n    T1 = _mm_subabs_epu16(TR1, TR0);\r\n    FLT_L = _mm_and_si128(_mm_cmpgt_epi16(BETA, T0), c_2);\r\n    FLT_R = _mm_and_si128(_mm_cmpgt_epi16(BETA, T1), c_2);\r\n\r\n    T0 = _mm_subabs_epu16(TL2, TL0);\r\n    T1 = _mm_subabs_epu16(TR2, TR0);\r\n    M1 = _mm_cmpgt_epi16(BETA, T0);\r\n    M2 = _mm_cmpgt_epi16(BETA, T1);\r\n    FLT_L = _mm_add_epi16(_mm_and_si128(M1, c_1), FLT_L);\r\n    FLT_R = _mm_add_epi16(_mm_and_si128(M2, c_1), FLT_R);\r\n    FLT = _mm_add_epi16(FLT_L, FLT_R);\r\n\r\n    M1 = _mm_and_si128(_mm_cmpeq_epi16(TR0, TR1), _mm_cmpeq_epi16(TL0, TL1));\r\n    T0 = _mm_sub_epi16(FLT, c_3);\r\n    T1 = _mm_sub_epi16(FLT, c_4);\r\n    T2 = _mm_subabs_epu16(TL1, TR1);\r\n\r\n    FS56 = _mm_blendv_epi8(T1, T0, M1);\r\n    FS4 = _mm_blendv_epi8(c_0, c_1, _mm_cmpeq_epi16(FLT_L, c_2));\r\n\r\n    FS = _mm_blendv_epi8(c_0, FS56, _mm_cmpgt_epi16(FLT, c_4));\r\n    FS = _mm_blendv_epi8(FS, FS4, _mm_cmpeq_epi16(FLT, c_4));\r\n\r\n    FS = _mm_and_si128(FS, M0);\r\n\r\n#undef _mm_subabs_epu16\r\n\r\n    UVL0 = TL0;\r\n    UVL1 = TL1;\r\n    UVR0 = TR0;\r\n    UVR1 = TR1;\r\n\r\n    /* fs == 1 */\r\n    T2 = _mm_add_epi16(_mm_add_epi16(UVL0, UVR0), c_2); // L0 + R0 + 2\r\n\r\n    V0 = _mm_srli_epi16(_mm_add_epi16(_mm_slli_epi16(UVL0, 1), T2), 2);\r\n\r\n    V1 = _mm_srli_epi16(_mm_add_epi16(_mm_slli_epi16(UVR0, 1), T2), 2);\r\n\r\n    TL0 = _mm_blendv_epi8(TL0, V0, _mm_cmpeq_epi16(FS, c_1));\r\n    TR0 = _mm_blendv_epi8(TR0, V1, _mm_cmpeq_epi16(FS, c_1));\r\n\r\n    /* fs == 2 */\r\n    T2 = _mm_slli_epi16(T2, 1); // (L0 << 1) + (R0 << 1) + 4\r\n    T3 = _mm_slli_epi16(T3, 1);\r\n\r\n    T0 = _mm_add_epi16(_mm_slli_epi16(UVL1, 1), _mm_add_epi16(UVL1, UVR0));\r\n    T0 = _mm_add_epi16(_mm_slli_epi16(UVL0, 3), _mm_add_epi16(T0, T2));\r\n    V0 = _mm_srli_epi16(_mm_add_epi16(T0, c_4), 4);\r\n    T0 = _mm_add_epi16(_mm_slli_epi16(UVR1, 1), _mm_add_epi16(UVR1, UVL0));\r\n    T0 = _mm_add_epi16(_mm_slli_epi16(UVR0, 3), _mm_add_epi16(T0, T2));\r\n    V1 = _mm_srli_epi16(_mm_add_epi16(T0, c_4), 4);\r\n\r\n    TL0 = _mm_blendv_epi8(TL0, V0, _mm_cmpeq_epi16(FS, c_2));\r\n    TR0 = _mm_blendv_epi8(TR0, V1, _mm_cmpeq_epi16(FS, c_2));\r\n\r\n    /* fs == 3 */\r\n    T2 = _mm_slli_epi16(T2, 1); // (L0 << 2) + (R0 << 2) + 8\r\n    T3 = _mm_slli_epi16(T3, 1);\r\n\r\n    T0 = _mm_add_epi16(_mm_slli_epi16(UVL1, 2), _mm_add_epi16(TL2, UVR1));\r\n    T0 = _mm_add_epi16(_mm_slli_epi16(UVL0, 1), _mm_add_epi16(T0, T2));\r\n    V0 = _mm_srli_epi16(T0, 4);\r\n    T0 = _mm_add_epi16(_mm_slli_epi16(UVR1, 2), _mm_add_epi16(TR2, UVL1));\r\n    T0 = _mm_add_epi16(_mm_slli_epi16(UVR0, 1), _mm_add_epi16(T0, T2));\r\n    V1 = _mm_srli_epi16(T0, 4);\r\n\r\n    TL0 = _mm_blendv_epi8(TL0, V0, _mm_cmpeq_epi16(FS, c_3));\r\n    TR0 = _mm_blendv_epi8(TR0, V1, _mm_cmpeq_epi16(FS, c_3));\r\n\r\n    T0 = _mm_add_epi16(_mm_add_epi16(TL2, UVR0), _mm_slli_epi16(TL2, 1));\r\n    T0 = _mm_add_epi16(T0, _mm_slli_epi16(UVL1, 3));\r\n    T0 = _mm_add_epi16(T0, _mm_slli_epi16(UVL0, 2));\r\n    V2 = _mm_srli_epi16(_mm_add_epi16(T0, c_8), 4);\r\n\r\n    T0 = _mm_add_epi16(_mm_add_epi16(TR2, UVL0), _mm_slli_epi16(TR2, 1));\r\n    T0 = _mm_add_epi16(T0, _mm_slli_epi16(UVR1, 3));\r\n    T0 = _mm_add_epi16(T0, _mm_slli_epi16(UVR0, 2));\r\n    V3 = _mm_srli_epi16(_mm_add_epi16(T0, c_8), 4);\r\n\r\n    TL1 = _mm_blendv_epi8(TL1, V2, _mm_cmpeq_epi16(FS, c_3));\r\n    TR1 = _mm_blendv_epi8(TR1, V3, _mm_cmpeq_epi16(FS, c_3));\r\n\r\n    /* store result */\r\n    T0 = _mm_packus_epi16(TL3, TR0);\r\n    T1 = _mm_packus_epi16(TL2, TR1);\r\n    T2 = _mm_packus_epi16(TL1, TR2);\r\n    T3 = _mm_packus_epi16(TL0, TR3);\r\n\r\n    P0 = _mm_unpacklo_epi8(T0, T1);\r\n    P1 = _mm_unpacklo_epi8(T2, T3);\r\n    P2 = _mm_unpackhi_epi8(T0, T1);\r\n    P3 = _mm_unpackhi_epi8(T2, T3);\r\n\r\n    P4 = _mm_unpacklo_epi16(P0, P1);\r\n    P5 = _mm_unpacklo_epi16(P2, P3);\r\n    P6 = _mm_unpackhi_epi16(P0, P1);\r\n    P7 = _mm_unpackhi_epi16(P2, P3);\r\n\r\n    T0 = _mm_unpacklo_epi32(P4, P5);\r\n    T1 = _mm_unpackhi_epi32(P4, P5);\r\n    T2 = _mm_unpacklo_epi32(P6, P7);\r\n    T3 = _mm_unpackhi_epi32(P6, P7);\r\n\r\n    pTmp = SrcPtrU - 4;\r\n    _mm_storel_epi64((__m128i*)(pTmp), T0);\r\n    _mm_storel_epi64((__m128i*)(pTmp + stride), _mm_srli_si128(T0, 8));\r\n    _mm_storel_epi64((__m128i*)(pTmp + (stride << 1)), T1);\r\n    _mm_storel_epi64((__m128i*)(pTmp + stride * 3), _mm_srli_si128(T1, 8));\r\n\r\n    pTmp = SrcPtrV - 4;\r\n    _mm_storel_epi64((__m128i*)(pTmp), T2);\r\n    _mm_storel_epi64((__m128i*)(pTmp + stride), _mm_srli_si128(T2, 8));\r\n    _mm_storel_epi64((__m128i*)(pTmp + (stride << 1)), T3);\r\n    _mm_storel_epi64((__m128i*)(pTmp + stride * 3), _mm_srli_si128(T3, 8));\r\n\r\n}\r\n\r\nvoid deblock_edge_hor_sse128(pel_t *SrcPtr, int stride, int Alpha, int Beta, uint8_t *flt_flag)\r\n{\r\n    int inc = stride;\r\n    int inc2 = inc << 1;\r\n    int inc3 = inc + inc2;\r\n    int flag0 = flt_flag[0] ? -1 : 0;\r\n    int flag1 = flt_flag[1] ? -1 : 0;\r\n\r\n    __m128i TL0, TL1, TL2;\r\n    __m128i TR0, TR1, TR2;\r\n    __m128i TL0w, TL1w, TL2w, TR0w, TR1w, TR2w; //for write\r\n    __m128i V0, V1, V2, V3, V4, V5;\r\n    __m128i T0, T1, T2;\r\n    __m128i M0, M1, M2;\r\n    __m128i FLT_L, FLT_R, FLT, FS;\r\n    __m128i FS3, FS4, FS56;\r\n\r\n    __m128i ALPHA = _mm_set1_epi16((short)Alpha);\r\n    __m128i BETA = _mm_set1_epi16((short)Beta);\r\n    __m128i c_0 = _mm_set1_epi16(0);\r\n    __m128i c_1 = _mm_set1_epi16(1);\r\n    __m128i c_2 = _mm_set1_epi16(2);\r\n    __m128i c_3 = _mm_set1_epi16(3);\r\n    __m128i c_4 = _mm_set1_epi16(4);\r\n    __m128i c_8 = _mm_set1_epi16(8);\r\n    __m128i c_16 = _mm_set1_epi16(16);\r\n\r\n    TL2 = _mm_loadl_epi64((__m128i*)(SrcPtr - inc3));\r\n    TL1 = _mm_loadl_epi64((__m128i*)(SrcPtr - inc2));\r\n    TL0 = _mm_loadl_epi64((__m128i*)(SrcPtr - inc));\r\n    TR0 = _mm_loadl_epi64((__m128i*)(SrcPtr + 0));\r\n    TR1 = _mm_loadl_epi64((__m128i*)(SrcPtr + inc));\r\n    TR2 = _mm_loadl_epi64((__m128i*)(SrcPtr + inc2));\r\n\r\n    TL2 = _mm_unpacklo_epi8(TL2, c_0);\r\n    TL1 = _mm_unpacklo_epi8(TL1, c_0);\r\n    TL0 = _mm_unpacklo_epi8(TL0, c_0);\r\n    TR0 = _mm_unpacklo_epi8(TR0, c_0);\r\n    TR1 = _mm_unpacklo_epi8(TR1, c_0);\r\n    TR2 = _mm_unpacklo_epi8(TR2, c_0);\r\n\r\n#define _mm_subabs_epu16(a, b) _mm_abs_epi16(_mm_subs_epi16(a, b))\r\n\r\n    T0 = _mm_subabs_epu16(TL0, TR0);\r\n    T1 = _mm_cmpgt_epi16(T0, c_1);\r\n    T2 = _mm_cmpgt_epi16(ALPHA, T0);\r\n    M0 = _mm_set_epi32(flag1, flag1, flag0, flag0);\r\n    M0 = _mm_and_si128(M0, _mm_and_si128(T1, T2)); // mask1\r\n\r\n    T0 = _mm_subabs_epu16(TL1, TL0);\r\n    T1 = _mm_subabs_epu16(TR1, TR0);\r\n    FLT_L = _mm_and_si128(_mm_cmpgt_epi16(BETA, T0), c_2);\r\n    FLT_R = _mm_and_si128(_mm_cmpgt_epi16(BETA, T1), c_2);\r\n\r\n    T0 = _mm_subabs_epu16(TL2, TL0);\r\n    T1 = _mm_subabs_epu16(TR2, TR0);\r\n    M1 = _mm_cmpgt_epi16(BETA, T0);\r\n    M2 = _mm_cmpgt_epi16(BETA, T1);\r\n    FLT_L = _mm_add_epi16(_mm_and_si128(M1, c_1), FLT_L);\r\n    FLT_R = _mm_add_epi16(_mm_and_si128(M2, c_1), FLT_R);\r\n    FLT = _mm_add_epi16(FLT_L, FLT_R);\r\n\r\n    M1 = _mm_and_si128(_mm_cmpeq_epi16(TR0, TR1), _mm_cmpeq_epi16(TL0, TL1));\r\n    T0 = _mm_subs_epi16(FLT, c_2);\r\n    T1 = _mm_subs_epi16(FLT, c_3);\r\n    T2 = _mm_subabs_epu16(TL1, TR1);\r\n\r\n    FS56 = _mm_blendv_epi8(T1, T0, M1);\r\n    FS4 = _mm_blendv_epi8(c_1, c_2, _mm_cmpeq_epi16(FLT_L, c_2));\r\n    FS3 = _mm_blendv_epi8(c_0, c_1, _mm_cmpgt_epi16(BETA, T2));\r\n\r\n    FS = _mm_blendv_epi8(c_0, FS56, _mm_cmpgt_epi16(FLT, c_4));\r\n    FS = _mm_blendv_epi8(FS, FS4, _mm_cmpeq_epi16(FLT, c_4));\r\n    FS = _mm_blendv_epi8(FS, FS3, _mm_cmpeq_epi16(FLT, c_3));\r\n\r\n    FS = _mm_and_si128(FS, M0);\r\n\r\n#undef _mm_subabs_epu16\r\n\r\n    TR0w = TR0;\r\n    TR1w = TR1;\r\n    TL0w = TL0;\r\n    TL1w = TL1;\r\n\r\n    /* fs == 1 */\r\n    T2 = _mm_add_epi16(_mm_add_epi16(TL0, TR0), c_2); // L0 + R0 + 2\r\n\r\n    V0 = _mm_srli_epi16(_mm_add_epi16(_mm_slli_epi16(TL0, 1), T2), 2);\r\n\r\n    V1 = _mm_srli_epi16(_mm_add_epi16(_mm_slli_epi16(TR0, 1), T2), 2);\r\n\r\n    TL0w = _mm_blendv_epi8(TL0, V0, _mm_cmpeq_epi16(FS, c_1));\r\n    TR0w = _mm_blendv_epi8(TR0, V1, _mm_cmpeq_epi16(FS, c_1));\r\n\r\n    /* fs == 2 */\r\n    T2 = _mm_slli_epi16(T2, 1); // (L0 << 1) + (R0 << 1) + 4\r\n\r\n    T0 = _mm_add_epi16(_mm_slli_epi16(TL1, 1), _mm_add_epi16(TL1, TR0));\r\n\r\n    T0 = _mm_add_epi16(_mm_slli_epi16(TL0, 3), _mm_add_epi16(T0, T2));\r\n\r\n    V0 = _mm_srli_epi16(_mm_add_epi16(T0, c_4), 4);\r\n\r\n    T0 = _mm_add_epi16(_mm_slli_epi16(TR1, 1), _mm_add_epi16(TR1, TL0));\r\n\r\n    T0 = _mm_add_epi16(_mm_slli_epi16(TR0, 3), _mm_add_epi16(T0, T2));\r\n\r\n    V1 = _mm_srli_epi16(_mm_add_epi16(T0, c_4), 4);\r\n\r\n    TL0w = _mm_blendv_epi8(TL0w, V0, _mm_cmpeq_epi16(FS, c_2));\r\n    TR0w = _mm_blendv_epi8(TR0w, V1, _mm_cmpeq_epi16(FS, c_2));\r\n\r\n    /* fs == 3 */\r\n    T2 = _mm_slli_epi16(T2, 1); // (L0 << 2) + (R0 << 2) + 8\r\n\r\n    T0 = _mm_add_epi16(_mm_slli_epi16(TL1, 2), _mm_add_epi16(TL2, TR1));\r\n\r\n    T0 = _mm_add_epi16(_mm_slli_epi16(TL0, 1), _mm_add_epi16(T0, T2));\r\n\r\n    V0 = _mm_srli_epi16(T0, 4);\r\n\r\n    T0 = _mm_add_epi16(_mm_slli_epi16(TR1, 2), _mm_add_epi16(TR2, TL1));\r\n\r\n    T0 = _mm_add_epi16(_mm_slli_epi16(TR0, 1), _mm_add_epi16(T0, T2));\r\n\r\n    V1 = _mm_srli_epi16(T0, 4);\r\n\r\n    TL0w = _mm_blendv_epi8(TL0w, V0, _mm_cmpeq_epi16(FS, c_3));\r\n    TR0w = _mm_blendv_epi8(TR0w, V1, _mm_cmpeq_epi16(FS, c_3));\r\n\r\n    T0 = _mm_add_epi16(_mm_add_epi16(TL2, TR0), _mm_slli_epi16(TL2, 1));\r\n    T0 = _mm_add_epi16(T0, _mm_slli_epi16(TL1, 3));\r\n    T0 = _mm_add_epi16(T0, _mm_slli_epi16(TL0, 2));\r\n    V2 = _mm_srli_epi16(_mm_add_epi16(T0, c_8), 4);\r\n\r\n    T0 = _mm_add_epi16(_mm_add_epi16(TR2, TL0), _mm_slli_epi16(TR2, 1));\r\n    T0 = _mm_add_epi16(T0, _mm_slli_epi16(TR1, 3));\r\n    T0 = _mm_add_epi16(T0, _mm_slli_epi16(TR0, 2));\r\n    V3 = _mm_srli_epi16(_mm_add_epi16(T0, c_8), 4);\r\n\r\n    TL1w = _mm_blendv_epi8(TL1w, V2, _mm_cmpeq_epi16(FS, c_3));\r\n    TR1w = _mm_blendv_epi8(TR1w, V3, _mm_cmpeq_epi16(FS, c_3));\r\n\r\n    FS = _mm_cmpeq_epi16(FS, c_4);\r\n\r\n    if (!_mm_testz_si128(FS, _mm_set1_epi16(-1))) { /* fs == 4 */\r\n        /* cal L0/R0 */\r\n        T0 = _mm_slli_epi16(_mm_add_epi16(_mm_add_epi16(TL0, TL2), TR0), 3);\r\n        T0 = _mm_add_epi16(_mm_add_epi16(T0, c_16), _mm_add_epi16(TL0, TL2));\r\n        T2 = _mm_add_epi16(_mm_slli_epi16(TR2, 1), _mm_slli_epi16(TR2, 2));\r\n        V0 = _mm_srli_epi16(_mm_add_epi16(T0, T2), 5);\r\n\r\n        T0 = _mm_slli_epi16(_mm_add_epi16(_mm_add_epi16(TR0, TR2), TL0), 3);\r\n        T0 = _mm_add_epi16(_mm_add_epi16(T0, c_16), _mm_add_epi16(TR0, TR2));\r\n        T2 = _mm_add_epi16(_mm_slli_epi16(TL2, 1), _mm_slli_epi16(TL2, 2));\r\n        V1 = _mm_srli_epi16(_mm_add_epi16(T0, T2), 5);\r\n\r\n        TL0w = _mm_blendv_epi8(TL0w, V0, FS);\r\n        TR0w = _mm_blendv_epi8(TR0w, V1, FS);\r\n\r\n        /* cal L1/R1 */\r\n        T0 = _mm_slli_epi16(_mm_add_epi16(TL2, TR0), 1);\r\n        T0 = _mm_add_epi16(T0, _mm_sub_epi16(_mm_slli_epi16(TL0, 3), TL0));\r\n        T2 = _mm_add_epi16(_mm_slli_epi16(TL2, 2), _mm_add_epi16(TR0, c_8));\r\n        V2 = _mm_srli_epi16(_mm_add_epi16(T0, T2), 4);\r\n\r\n        T0 = _mm_slli_epi16(_mm_add_epi16(TR2, TL0), 1);\r\n        T0 = _mm_add_epi16(T0, _mm_sub_epi16(_mm_slli_epi16(TR0, 3), TR0));\r\n        T2 = _mm_add_epi16(_mm_slli_epi16(TR2, 2), _mm_add_epi16(TL0, c_8));\r\n        V3 = _mm_srli_epi16(_mm_add_epi16(T0, T2), 4);\r\n\r\n        TL1w = _mm_blendv_epi8(TL1w, V2, FS);\r\n        TR1w = _mm_blendv_epi8(TR1w, V3, FS);\r\n\r\n        /* cal L2/R2 */\r\n        T0 = _mm_add_epi16(_mm_slli_epi16(TL2, 1), TL2);\r\n        T2 = _mm_add_epi16(_mm_slli_epi16(TL0, 2), TR0);\r\n        V4 = _mm_srli_epi16(_mm_add_epi16(T0, _mm_add_epi16(T2, c_4)), 3);\r\n\r\n        T0 = _mm_add_epi16(_mm_slli_epi16(TR2, 1), TR2);\r\n        T2 = _mm_add_epi16(_mm_slli_epi16(TR0, 2), TL0);\r\n        V5 = _mm_srli_epi16(_mm_add_epi16(T0, _mm_add_epi16(T2, c_4)), 3);\r\n\r\n        TL2w = _mm_blendv_epi8(TL2, V4, FS);\r\n        TR2w = _mm_blendv_epi8(TR2, V5, FS);\r\n\r\n        /* stroe result */\r\n        _mm_storel_epi64((__m128i*)(SrcPtr - inc), _mm_packus_epi16(TL0w, c_0));\r\n        _mm_storel_epi64((__m128i*)(SrcPtr - 0), _mm_packus_epi16(TR0w, c_0));\r\n\r\n        _mm_storel_epi64((__m128i*)(SrcPtr - inc2), _mm_packus_epi16(TL1w, c_0));\r\n        _mm_storel_epi64((__m128i*)(SrcPtr + inc), _mm_packus_epi16(TR1w, c_0));\r\n\r\n        _mm_storel_epi64((__m128i*)(SrcPtr - inc3), _mm_packus_epi16(TL2w, c_0));\r\n        _mm_storel_epi64((__m128i*)(SrcPtr + inc2), _mm_packus_epi16(TR2w, c_0));\r\n    } else {\r\n        /* stroe result */\r\n        _mm_storel_epi64((__m128i*)(SrcPtr - inc), _mm_packus_epi16(TL0w, c_0));\r\n        _mm_storel_epi64((__m128i*)(SrcPtr - 0), _mm_packus_epi16(TR0w, c_0));\r\n\r\n        _mm_storel_epi64((__m128i*)(SrcPtr - inc2), _mm_packus_epi16(TL1w, c_0));\r\n        _mm_storel_epi64((__m128i*)(SrcPtr + inc), _mm_packus_epi16(TR1w, c_0));\r\n    }\r\n\r\n}\r\n\r\nvoid deblock_edge_hor_c_sse128(pel_t *SrcPtrU, pel_t *SrcPtrV, int stride, int Alpha, int Beta, uint8_t *flt_flag)\r\n{\r\n    int inc = stride;\r\n    int inc2 = inc << 1;\r\n    int inc3 = inc + inc2;\r\n    int flag0 = flt_flag[0] ? -1 : 0;\r\n    int flag1 = flt_flag[1] ? -1 : 0;\r\n\r\n    __m128i UL0, UL1, UR0, UR1;\r\n    __m128i TL0, TL1, TL2;\r\n    __m128i TR0, TR1, TR2;\r\n    __m128i T0, T1, T2;\r\n    __m128i V0, V1, V2, V3;\r\n    __m128i M0, M1, M2;\r\n    __m128i FLT_L, FLT_R, FLT, FS;\r\n    __m128i FS4, FS56;\r\n\r\n    __m128i ALPHA = _mm_set1_epi16((pel_t)Alpha);\r\n    __m128i BETA = _mm_set1_epi16((pel_t)Beta);\r\n    __m128i c_0 = _mm_set1_epi16(0);\r\n    __m128i c_1 = _mm_set1_epi16(1);\r\n    __m128i c_2 = _mm_set1_epi16(2);\r\n    __m128i c_3 = _mm_set1_epi16(3);\r\n    __m128i c_4 = _mm_set1_epi16(4);\r\n    __m128i c_8 = _mm_set1_epi16(8);\r\n\r\n    TL0 = _mm_set_epi32(0, 0, ((int32_t*)(SrcPtrV - inc))[0], ((int32_t*)(SrcPtrU - inc))[0]);\r\n    TL1 = _mm_set_epi32(0, 0, ((int32_t*)(SrcPtrV - inc2))[0], ((int32_t*)(SrcPtrU - inc2))[0]);\r\n    TL2 = _mm_set_epi32(0, 0, ((int32_t*)(SrcPtrV - inc3))[0], ((int32_t*)(SrcPtrU - inc3))[0]);\r\n    TR0 = _mm_set_epi32(0, 0, ((int32_t*)(SrcPtrV))[0], ((int32_t*)(SrcPtrU))[0]);\r\n    TR1 = _mm_set_epi32(0, 0, ((int32_t*)(SrcPtrV + inc))[0], ((int32_t*)(SrcPtrU + inc))[0]);\r\n    TR2 = _mm_set_epi32(0, 0, ((int32_t*)(SrcPtrV + inc2))[0], ((int32_t*)(SrcPtrU + inc2))[0]);\r\n\r\n    TL0 = _mm_unpacklo_epi8(TL0, c_0);\r\n    TL1 = _mm_unpacklo_epi8(TL1, c_0);\r\n    TL2 = _mm_unpacklo_epi8(TL2, c_0);\r\n    TR0 = _mm_unpacklo_epi8(TR0, c_0);\r\n    TR1 = _mm_unpacklo_epi8(TR1, c_0);\r\n    TR2 = _mm_unpacklo_epi8(TR2, c_0);\r\n\r\n#define _mm_subabs_epu16(a, b) _mm_abs_epi16(_mm_subs_epi16(a, b))\r\n\r\n    T0 = _mm_subabs_epu16(TL0, TR0);\r\n    T1 = _mm_cmpgt_epi16(T0, c_1);\r\n    T2 = _mm_cmpgt_epi16(ALPHA, T0);\r\n\r\n    M0 = _mm_set_epi32(flag1, flag0, flag1, flag0);\r\n    M0 = _mm_and_si128(M0, _mm_and_si128(T1, T2)); // mask1\r\n\r\n    T0 = _mm_subabs_epu16(TL1, TL0);\r\n    T1 = _mm_subabs_epu16(TR1, TR0);\r\n    FLT_L = _mm_and_si128(_mm_cmpgt_epi16(BETA, T0), c_2);\r\n    FLT_R = _mm_and_si128(_mm_cmpgt_epi16(BETA, T1), c_2);\r\n\r\n    T0 = _mm_subabs_epu16(TL2, TL0);\r\n    T1 = _mm_subabs_epu16(TR2, TR0);\r\n    M1 = _mm_cmpgt_epi16(BETA, T0);\r\n    M2 = _mm_cmpgt_epi16(BETA, T1);\r\n    FLT_L = _mm_add_epi16(_mm_and_si128(M1, c_1), FLT_L);\r\n    FLT_R = _mm_add_epi16(_mm_and_si128(M2, c_1), FLT_R);\r\n    FLT = _mm_add_epi16(FLT_L, FLT_R);\r\n\r\n    M1 = _mm_and_si128(_mm_cmpeq_epi16(TR0, TR1), _mm_cmpeq_epi16(TL0, TL1));\r\n    T0 = _mm_subs_epi16(FLT, c_3);\r\n    T1 = _mm_subs_epi16(FLT, c_4);\r\n\r\n    FS56 = _mm_blendv_epi8(T1, T0, M1);\r\n    FS4 = _mm_blendv_epi8(c_0, c_1, _mm_cmpeq_epi16(FLT_L, c_2));\r\n\r\n    FS = _mm_blendv_epi8(c_0, FS56, _mm_cmpgt_epi16(FLT, c_4));\r\n    FS = _mm_blendv_epi8(FS, FS4, _mm_cmpeq_epi16(FLT, c_4));\r\n\r\n    FS = _mm_and_si128(FS, M0);\r\n\r\n#undef _mm_subabs_epu16\r\n\r\n    UR0 = TR0;  //UR0 TR0 to store\r\n    UR1 = TR1;\r\n    UL0 = TL0;\r\n    UL1 = TL1;\r\n\r\n    /* fs == 1 */\r\n    T2 = _mm_add_epi16(_mm_add_epi16(TL0, TR0), c_2); // L0 + R0 + 2\r\n    V0 = _mm_srli_epi16(_mm_add_epi16(_mm_slli_epi16(TL0, 1), T2), 2);\r\n    V1 = _mm_srli_epi16(_mm_add_epi16(_mm_slli_epi16(TR0, 1), T2), 2);\r\n\r\n    UL0 = _mm_blendv_epi8(TL0, V0, _mm_cmpeq_epi16(FS, c_1));\r\n    UR0 = _mm_blendv_epi8(TR0, V1, _mm_cmpeq_epi16(FS, c_1));\r\n\r\n    /* fs == 2 */\r\n    T2 = _mm_slli_epi16(T2, 1); // (L0 << 1) + (R0 << 1) + 4\r\n    T0 = _mm_add_epi16(_mm_slli_epi16(TL1, 1), _mm_add_epi16(TL1, TR0));\r\n    T0 = _mm_add_epi16(_mm_slli_epi16(TL0, 3), _mm_add_epi16(T0, T2));\r\n    V0 = _mm_srli_epi16(_mm_add_epi16(T0, c_4), 4);\r\n    T0 = _mm_add_epi16(_mm_slli_epi16(TR1, 1), _mm_add_epi16(TR1, TL0));\r\n    T0 = _mm_add_epi16(_mm_slli_epi16(TR0, 3), _mm_add_epi16(T0, T2));\r\n    V1 = _mm_srli_epi16(_mm_add_epi16(T0, c_4), 4);\r\n\r\n    UL0 = _mm_blendv_epi8(UL0, V0, _mm_cmpeq_epi16(FS, c_2));\r\n    UR0 = _mm_blendv_epi8(UR0, V1, _mm_cmpeq_epi16(FS, c_2));\r\n\r\n    /* fs == 3 */\r\n    T2 = _mm_slli_epi16(T2, 1); // (L0 << 2) + (R0 << 2) + 8\r\n    T0 = _mm_add_epi16(_mm_slli_epi16(TL1, 2), _mm_add_epi16(TL2, TR1));\r\n    T0 = _mm_add_epi16(_mm_slli_epi16(TL0, 1), _mm_add_epi16(T0, T2));\r\n    V0 = _mm_srli_epi16(T0, 4);\r\n    T0 = _mm_add_epi16(_mm_slli_epi16(TR1, 2), _mm_add_epi16(TR2, TL1));\r\n    T0 = _mm_add_epi16(_mm_slli_epi16(TR0, 1), _mm_add_epi16(T0, T2));\r\n    V1 = _mm_srli_epi16(T0, 4);\r\n\r\n    UL0 = _mm_blendv_epi8(UL0, V0, _mm_cmpeq_epi16(FS, c_3));\r\n    UR0 = _mm_blendv_epi8(UR0, V1, _mm_cmpeq_epi16(FS, c_3));\r\n\r\n    T0 = _mm_add_epi16(_mm_add_epi16(TL2, TR0), _mm_slli_epi16(TL2, 1));\r\n    T0 = _mm_add_epi16(T0, _mm_slli_epi16(TL1, 3));\r\n    T0 = _mm_add_epi16(T0, _mm_slli_epi16(TL0, 2));\r\n    V2 = _mm_srli_epi16(_mm_add_epi16(T0, c_8), 4);\r\n\r\n    T0 = _mm_add_epi16(_mm_add_epi16(TR2, TL0), _mm_slli_epi16(TR2, 1));\r\n    T0 = _mm_add_epi16(T0, _mm_slli_epi16(TR1, 3));\r\n    T0 = _mm_add_epi16(T0, _mm_slli_epi16(TR0, 2));\r\n    V3 = _mm_srli_epi16(_mm_add_epi16(T0, c_8), 4);\r\n\r\n    UL1 = _mm_blendv_epi8(UL1, V2, _mm_cmpeq_epi16(FS, c_3));\r\n    UR1 = _mm_blendv_epi8(UR1, V3, _mm_cmpeq_epi16(FS, c_3));\r\n\r\n    /* store result */\r\n    UL0 = _mm_packus_epi16(UL0, c_0);\r\n    UL1 = _mm_packus_epi16(UL1, c_0);\r\n    UR0 = _mm_packus_epi16(UR0, c_0);\r\n    UR1 = _mm_packus_epi16(UR1, c_0);\r\n\r\n    ((int32_t*)(SrcPtrU - inc ))[0] = M128_I32(UL0, 0);\r\n    ((int32_t*)(SrcPtrU       ))[0] = M128_I32(UR0, 0);\r\n    ((int32_t*)(SrcPtrU - inc2))[0] = M128_I32(UL1, 0);\r\n    ((int32_t*)(SrcPtrU + inc ))[0] = M128_I32(UR1, 0);\r\n    ((int32_t*)(SrcPtrV - inc ))[0] = M128_I32(UL0, 1);\r\n    ((int32_t*)(SrcPtrV       ))[0] = M128_I32(UR0, 1);\r\n    ((int32_t*)(SrcPtrV - inc2))[0] = M128_I32(UL1, 1);\r\n    ((int32_t*)(SrcPtrV + inc ))[0] = M128_I32(UR1, 1);\r\n}\r\n\r\n#endif // #if !HIGH_BIT_DEPTH\r\n"
  },
  {
    "path": "source/common/vec/intrinsic_deblock_avx2.cc",
    "content": "/*\r\n * intrinsic_deblock_avx2.cc\r\n *\r\n * Description of this file:\r\n *    AVX2 assembly functions of Deblock module of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video decoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#include <mmintrin.h>\r\n#include <emmintrin.h>\r\n#include <tmmintrin.h>\r\n#include <smmintrin.h>\r\n#include <immintrin.h>\r\n\r\n#include \"../common.h\"\r\n#include \"intrinsic.h\"\r\n\r\n#if !HIGH_BIT_DEPTH\r\n__m128i c_0_128;\r\n\r\n__m256i c_f;\r\n__m256i c_0;\r\n__m256i c_1;\r\n__m256i c_2;\r\n__m256i c_3;\r\n__m256i c_4;\r\n__m256i c_8;\r\n__m256i c_16;\r\n\r\n\r\n/*----------------------avx2-----------------------------------*/\r\n\r\nvoid deblock_edge_ver_avx2(pel_t *SrcPtr, int stride, int Alpha, int Beta, uint8_t *flt_flag)\r\n{\r\n    pel_t *pTmp = SrcPtr - 4;\r\n    int flag0 = flt_flag[0] ? -1 : 0;\r\n    int flag1 = flt_flag[1] ? -1 : 0;\r\n    __m128i TL0, TL1, TL2, TL3;\r\n    __m128i TR0, TR1, TR2, TR3;\r\n    __m128i T0, T1, T2, T3, T4, T5, T6, T7;\r\n    __m128i M0, M1;\r\n    __m128i FLT, FS;\r\n    __m128i FS3, FS4, FS56;\r\n    __m256i TLR0, TLR1, TLR2; // store TL* and TR*\r\n    __m256i TRL0, TRL1, TRL2; // store TR* and TL*\r\n    __m256i T0_256, T1_256, T2_256;\r\n    __m256i FLT_LR;\r\n    __m256i TLR0w, TLR1w;\r\n    __m256i FS_256;\r\n\r\n    __m128i ALPHA = _mm_set1_epi16((pel_t)Alpha);\r\n    __m128i BETA = _mm_set1_epi16((pel_t)Beta);\r\n    __m128i c0 = _mm_set1_epi16(0);\r\n    __m256i c_1_256 = _mm256_set1_epi16(1);\r\n    __m256i c_2_256 = _mm256_set1_epi16(2);\r\n    __m256i c_3_256 = _mm256_set1_epi16(3);\r\n    __m256i c_4_256 = _mm256_set1_epi16(4);\r\n    __m256i c_8_256 = _mm256_set1_epi16(8);\r\n    __m256i c_16_256 = _mm256_set1_epi16(16);\r\n    __m256i BETA_256 = _mm256_set1_epi16((short)Beta);\r\n\r\n    T0 = _mm_loadl_epi64((__m128i*)(pTmp));\r\n    T1 = _mm_loadl_epi64((__m128i*)(pTmp + stride));\r\n    T2 = _mm_loadl_epi64((__m128i*)(pTmp + stride * 2));\r\n    T3 = _mm_loadl_epi64((__m128i*)(pTmp + stride * 3));\r\n    T4 = _mm_loadl_epi64((__m128i*)(pTmp + stride * 4));\r\n    T5 = _mm_loadl_epi64((__m128i*)(pTmp + stride * 5));\r\n    T6 = _mm_loadl_epi64((__m128i*)(pTmp + stride * 6));\r\n    T7 = _mm_loadl_epi64((__m128i*)(pTmp + stride * 7));\r\n\r\n    //--------------- transpose -------------------------------\r\n    T0 = _mm_unpacklo_epi8(T0, T1);\r\n    T1 = _mm_unpacklo_epi8(T2, T3);\r\n    T2 = _mm_unpacklo_epi8(T4, T5);\r\n    T3 = _mm_unpacklo_epi8(T6, T7);\r\n\r\n    T4 = _mm_unpacklo_epi16(T0, T1);\r\n    T5 = _mm_unpacklo_epi16(T2, T3);\r\n    T6 = _mm_unpackhi_epi16(T0, T1);\r\n    T7 = _mm_unpackhi_epi16(T2, T3);\r\n\r\n    /*\r\n    TLR0 = _mm256_inserti128_si256(_mm256_castsi128_si256(T4), T6, 1);\r\n    TLR1 = _mm256_inserti128_si256(_mm256_castsi128_si256(T5), T7, 1);\r\n\r\n    TLR0w = _mm256_unpacklo_epi32(TLR0, TLR1);\t\t//T0 T2\r\n    TLR1w = _mm256_unpackhi_epi32(TLR0, TLR1);\t\t//T1 T3\r\n\r\n    TLR3 = _mm256_unpacklo_epi8(TLR0w, c_0_256);\t//TL3 TR0\r\n    TLR2 = _mm256_unpackhi_epi8(TLR0w, c_0_256);\t//TL2 TR1\r\n    TLR1 = _mm256_unpacklo_epi8(TLR1w, c_0_256);\t//TL1 TR2\r\n    TLR0 = _mm256_unpackhi_epi8(TLR1w, c_0_256);\t//TL0 TR3\r\n\r\n    TR0 = _mm256_extracti128_si256(TLR3, 0x01);\r\n    TR1 = _mm256_extracti128_si256(TLR2, 0x01);\r\n    TR2 = _mm256_extracti128_si256(TLR1, 0x01);\r\n    TR3 = _mm256_extracti128_si256(TLR0, 0x01);\r\n\r\n    TLR0 = _mm256_inserti128_si256(TLR0, TR0, 1);\r\n    TLR1 = _mm256_inserti128_si256(TLR1, TR1, 1);\r\n    TLR2 = _mm256_inserti128_si256(TLR2, TR2, 1);\r\n    TRL0 = _mm256_inserti128_si256(_mm256_castsi128_si256(TR0), _mm256_castsi256_si128(TLR0), 1);\r\n    TRL1 = _mm256_inserti128_si256(_mm256_castsi128_si256(TR1), _mm256_castsi256_si128(TLR1), 1);\r\n    */\r\n\r\n    T0 = _mm_unpacklo_epi32(T4, T5);\r\n    T1 = _mm_unpackhi_epi32(T4, T5);\r\n    T2 = _mm_unpacklo_epi32(T6, T7);\r\n    T3 = _mm_unpackhi_epi32(T6, T7);\r\n\r\n    TL3 = _mm_unpacklo_epi8(T0, c0);\r\n    TL2 = _mm_unpackhi_epi8(T0, c0);\r\n    TL1 = _mm_unpacklo_epi8(T1, c0);\r\n    TL0 = _mm_unpackhi_epi8(T1, c0);\r\n\r\n    TR0 = _mm_unpacklo_epi8(T2, c0);\r\n    TR1 = _mm_unpackhi_epi8(T2, c0);\r\n    TR2 = _mm_unpacklo_epi8(T3, c0);\r\n    TR3 = _mm_unpackhi_epi8(T3, c0);\r\n\r\n    TLR0 = _mm256_inserti128_si256(_mm256_castsi128_si256(TL0), TR0, 1);\r\n    TLR1 = _mm256_inserti128_si256(_mm256_castsi128_si256(TL1), TR1, 1);\r\n    TLR2 = _mm256_inserti128_si256(_mm256_castsi128_si256(TL2), TR2, 1);\r\n    TRL0 = _mm256_inserti128_si256(_mm256_castsi128_si256(TR0), TL0, 1);\r\n    TRL1 = _mm256_inserti128_si256(_mm256_castsi128_si256(TR1), TL1, 1);\r\n\r\n    T0 = _mm_abs_epi16(_mm_subs_epi16(TL0, TR0));\r\n    T1 = _mm_cmpgt_epi16(T0, _mm256_castsi256_si128(c_1_256));\r\n    T2 = _mm_cmpgt_epi16(ALPHA, T0);\r\n\r\n    M0 = _mm_set_epi32(flag1, flag1, flag0, flag0);\r\n    M0 = _mm_and_si128(M0, _mm_and_si128(T1, T2)); // mask1\r\n\r\n    T0_256 = _mm256_abs_epi16(_mm256_subs_epi16(TLR1, TLR0));\r\n    FLT_LR = _mm256_and_si256(_mm256_cmpgt_epi16(BETA_256, T0_256), c_2_256);\r\n\r\n    T1_256 = _mm256_abs_epi16(_mm256_subs_epi16(TLR2, TLR0));\r\n    T2_256 = _mm256_cmpgt_epi16(BETA_256, T1_256);\r\n\r\n    FLT_LR = _mm256_add_epi16(_mm256_and_si256(T2_256, c_1_256), FLT_LR);\r\n    FLT = _mm_add_epi16(_mm256_castsi256_si128(FLT_LR), _mm256_extracti128_si256(FLT_LR, 0x01));\r\n\r\n    T0_256 = _mm256_cmpeq_epi16(TLR1, TLR0);\r\n    M1 = _mm_and_si128(_mm256_castsi256_si128(T0_256), _mm256_extracti128_si256(T0_256, 0x01));\r\n    T0 = _mm_subs_epi16(FLT, _mm256_castsi256_si128(c_2_256));\r\n    T1 = _mm_subs_epi16(FLT, _mm256_castsi256_si128(c_3_256));\r\n\r\n    T2 = _mm_abs_epi16(_mm_subs_epi16(TL1, TR1));\r\n\r\n    FS56 = _mm_blendv_epi8(T1, T0, M1);\r\n    FS4 = _mm_blendv_epi8(_mm256_castsi256_si128(c_1_256), _mm256_castsi256_si128(c_2_256), _mm_cmpeq_epi16(_mm256_castsi256_si128(FLT_LR), _mm256_castsi256_si128(c_2_256)));\r\n    FS3 = _mm_blendv_epi8(c0, _mm256_castsi256_si128(c_1_256), _mm_cmpgt_epi16(BETA, T2));\r\n\r\n    FS = _mm_blendv_epi8(c0, FS56, _mm_cmpgt_epi16(FLT, _mm256_castsi256_si128(c_4_256)));\r\n    FS = _mm_blendv_epi8(FS, FS4, _mm_cmpeq_epi16(FLT, _mm256_castsi256_si128(c_4_256)));\r\n    FS = _mm_blendv_epi8(FS, FS3, _mm_cmpeq_epi16(FLT, _mm256_castsi256_si128(c_3_256)));\r\n\r\n    FS = _mm_and_si128(FS, M0);\r\n    FS_256 = _mm256_inserti128_si256(_mm256_castsi128_si256(FS), FS, 1);\r\n\r\n\r\n    TLR0w = TLR0;\r\n    TLR1w = TLR1;\r\n    /* fs == 1 */\r\n    T2 = _mm_add_epi16(_mm_add_epi16(TL0, TR0), _mm256_castsi256_si128(c_2_256)); // L0 + R0 + 2\r\n    T2_256 = _mm256_castsi128_si256(T2);\r\n    T2_256 = _mm256_inserti128_si256(T2_256, T2, 1); // save\r\n    T1_256 = _mm256_srli_epi16(_mm256_add_epi16(_mm256_slli_epi16(TLR0, 1), T2_256), 2);\r\n    TLR0w = _mm256_blendv_epi8(TLR0, T1_256, _mm256_cmpeq_epi16(FS_256, c_1_256));\r\n\r\n    /* fs == 2 */\r\n    T2_256 = _mm256_slli_epi16(T2_256, 1);\r\n    T0_256 = _mm256_add_epi16(_mm256_slli_epi16(TLR1, 1), _mm256_add_epi16(TLR1, TRL0));\r\n    T0_256 = _mm256_add_epi16(_mm256_slli_epi16(TLR0, 3), _mm256_add_epi16(T0_256, T2_256));\r\n    T1_256 = _mm256_srli_epi16(_mm256_add_epi16(T0_256, c_4_256), 4);\r\n    TLR0w = _mm256_blendv_epi8(TLR0w, T1_256, _mm256_cmpeq_epi16(FS_256, c_2_256));\r\n\r\n    /* fs == 3 */\r\n    T2_256 = _mm256_slli_epi16(T2_256, 1); // (L0 << 2) + (R0 << 2) + 8\r\n    T0_256 = _mm256_add_epi16(_mm256_slli_epi16(TLR1, 2), _mm256_add_epi16(TLR2, TRL1));\r\n    T0_256 = _mm256_add_epi16(_mm256_slli_epi16(TLR0, 1), _mm256_add_epi16(T0_256, T2_256));\r\n    T1_256 = _mm256_srli_epi16(T0_256, 4);\r\n    TLR0w = _mm256_blendv_epi8(TLR0w, T1_256, _mm256_cmpeq_epi16(FS_256, c_3_256));\r\n\r\n    T0_256 = _mm256_add_epi16(_mm256_add_epi16(TLR2, TRL0), _mm256_slli_epi16(TLR2, 1));\r\n    T0_256 = _mm256_add_epi16(T0_256, _mm256_slli_epi16(TLR1, 3));\r\n    T0_256 = _mm256_add_epi16(T0_256, _mm256_slli_epi16(TLR0, 2));\r\n    T1_256 = _mm256_srli_epi16(_mm256_add_epi16(T0_256, c_8_256), 4);\r\n\r\n    TLR1w = _mm256_blendv_epi8(TLR1w, T1_256, _mm256_cmpeq_epi16(FS_256, c_3_256));\r\n\r\n    FS = _mm_cmpeq_epi16(FS, _mm256_castsi256_si128(c_4_256));\r\n\r\n    if (_mm_extract_epi64(FS, 0) || _mm_extract_epi64(FS, 1)) { /* fs == 4 */\r\n        TRL2 = _mm256_inserti128_si256(_mm256_castsi128_si256(TR2), TL2, 1);\r\n        FS_256 = _mm256_inserti128_si256(_mm256_castsi128_si256(FS), FS, 1);\r\n\r\n        /* cal L0/R0 */\r\n        T0_256 = _mm256_slli_epi16(_mm256_add_epi16(_mm256_add_epi16(TLR0, TLR2), TRL0), 3);\r\n        T0_256 = _mm256_add_epi16(_mm256_add_epi16(T0_256, c_16_256), _mm256_add_epi16(TLR0, TLR2));\r\n        T2_256 = _mm256_add_epi16(_mm256_slli_epi16(TRL2, 1), _mm256_slli_epi16(TRL2, 2));\r\n        T1_256 = _mm256_srli_epi16(_mm256_add_epi16(T0_256, T2_256), 5);\r\n\r\n        TLR0w = _mm256_blendv_epi8(TLR0w, T1_256, FS_256);\r\n\r\n        /* cal L1/R1 */\r\n        T0_256 = _mm256_slli_epi16(_mm256_add_epi16(TLR2, TRL0), 1);\r\n        T0_256 = _mm256_add_epi16(T0_256, _mm256_sub_epi16(_mm256_slli_epi16(TLR0, 3), TLR0));\r\n        T2_256 = _mm256_add_epi16(_mm256_slli_epi16(TLR2, 2), _mm256_add_epi16(TRL0, c_8_256));\r\n        T1_256 = _mm256_srli_epi16(_mm256_add_epi16(T0_256, T2_256), 4);\r\n\r\n        TLR1w = _mm256_blendv_epi8(TLR1w, T1_256, FS_256);\r\n\r\n        /* cal L2/R2 */\r\n        T0_256 = _mm256_add_epi16(_mm256_slli_epi16(TLR2, 1), TLR2);\r\n        T2_256 = _mm256_add_epi16(_mm256_slli_epi16(TLR0, 2), TRL0);\r\n        T1_256 = _mm256_srli_epi16(_mm256_add_epi16(T0_256, _mm256_add_epi16(T2_256, c_4_256)), 3);\r\n\r\n        TLR2 = _mm256_blendv_epi8(TLR2, T1_256, FS_256);\r\n\r\n    }\r\n\r\n    /* stroe result */\r\n    T4 = _mm_packus_epi16(TL3, _mm256_extracti128_si256(TLR0w, 0x01));\r\n    T5 = _mm_packus_epi16(_mm256_castsi256_si128(TLR2), _mm256_extracti128_si256(TLR1w, 0x01));\r\n    T6 = _mm_packus_epi16(_mm256_castsi256_si128(TLR1w), _mm256_extracti128_si256(TLR2, 0x01));\r\n    T7 = _mm_packus_epi16(_mm256_castsi256_si128(TLR0w), TR3);\r\n\r\n    T0 = _mm_unpacklo_epi8(T4, T5);\r\n    T1 = _mm_unpacklo_epi8(T6, T7);\r\n    T2 = _mm_unpackhi_epi8(T4, T5);\r\n    T3 = _mm_unpackhi_epi8(T6, T7);\r\n\r\n    T4 = _mm_unpacklo_epi16(T0, T1);\r\n    T5 = _mm_unpacklo_epi16(T2, T3);\r\n    T6 = _mm_unpackhi_epi16(T0, T1);\r\n    T7 = _mm_unpackhi_epi16(T2, T3);\r\n\r\n    T0 = _mm_unpacklo_epi32(T4, T5);\r\n    T1 = _mm_unpackhi_epi32(T4, T5);\r\n    T2 = _mm_unpacklo_epi32(T6, T7);\r\n    T3 = _mm_unpackhi_epi32(T6, T7);\r\n\r\n    pTmp = SrcPtr - 4;\r\n    _mm_storel_epi64((__m128i*)(pTmp), T0);\r\n    pTmp += stride;\r\n    _mm_storel_epi64((__m128i*)(pTmp), _mm_srli_si128(T0, 8));\r\n    pTmp += stride;\r\n    _mm_storel_epi64((__m128i*)(pTmp), T1);\r\n    pTmp += stride;\r\n    _mm_storel_epi64((__m128i*)(pTmp), _mm_srli_si128(T1, 8));\r\n    pTmp += stride;\r\n    _mm_storel_epi64((__m128i*)(pTmp), T2);\r\n    pTmp += stride;\r\n    _mm_storel_epi64((__m128i*)(pTmp), _mm_srli_si128(T2, 8));\r\n    pTmp += stride;\r\n    _mm_storel_epi64((__m128i*)(pTmp), T3);\r\n    pTmp += stride;\r\n    _mm_storel_epi64((__m128i*)(pTmp), _mm_srli_si128(T3, 8));\r\n}\r\n\r\nvoid deblock_edge_ver_c_avx2(pel_t *SrcPtrU, pel_t *SrcPtrV, int stride, int Alpha, int Beta, uint8_t *flt_flag)\r\n{\r\n    pel_t *pTmp;\r\n    int flag0 = flt_flag[0] ? -1 : 0;\r\n    int flag1 = flt_flag[1] ? -1 : 0;\r\n\r\n    __m128i TL0, TL1, TL2, TL3;\r\n    __m128i TR0, TR1, TR2, TR3;\r\n    __m128i T0, T1, T2, T3, T4, T5, T6, T7;\r\n    __m128i M0, M1;\r\n    __m128i FLT, FS;\r\n    __m128i FS4, FS56;\r\n    __m256i TLR0, TLR1, TLR2; // store TL* and TR*\r\n    __m256i TRL0, TRL1; // store TR* and TL*\r\n    __m256i T0_256, T1_256, T2_256;\r\n    __m256i FLT_X;\r\n    __m256i TLR0w, TLR1w;\r\n    __m256i FS_256;\r\n\r\n    __m128i ALPHA = _mm_set1_epi16((pel_t)Alpha);\r\n    __m128i c0 = _mm_set1_epi16(0);\r\n    __m256i c_1_256 = _mm256_set1_epi16(1);\r\n    __m256i c_2_256 = _mm256_set1_epi16(2);\r\n    __m256i c_3_256 = _mm256_set1_epi16(3);\r\n    __m256i c_4_256 = _mm256_set1_epi16(4);\r\n    __m256i c_8_256 = _mm256_set1_epi16(8);\r\n    __m256i BETA_256 = _mm256_set1_epi16((short)Beta);\r\n\r\n    pTmp = SrcPtrU - 4;\r\n    T0 = _mm_loadl_epi64((__m128i*)(pTmp));\r\n    T1 = _mm_loadl_epi64((__m128i*)(pTmp + stride));\r\n    T2 = _mm_loadl_epi64((__m128i*)(pTmp + stride * 2));\r\n    T3 = _mm_loadl_epi64((__m128i*)(pTmp + stride * 3));\r\n\r\n    pTmp = SrcPtrV - 4;\r\n    T4 = _mm_loadl_epi64((__m128i*)(pTmp));\r\n    T5 = _mm_loadl_epi64((__m128i*)(pTmp + stride));\r\n    T6 = _mm_loadl_epi64((__m128i*)(pTmp + stride * 2));\r\n    T7 = _mm_loadl_epi64((__m128i*)(pTmp + stride * 3));\r\n\r\n    T0 = _mm_unpacklo_epi8(T0, T1);\r\n    T1 = _mm_unpacklo_epi8(T2, T3);\r\n    T2 = _mm_unpacklo_epi8(T4, T5);\r\n    T3 = _mm_unpacklo_epi8(T6, T7);\r\n\r\n    T4 = _mm_unpacklo_epi16(T0, T1);\r\n    T5 = _mm_unpacklo_epi16(T2, T3);\r\n    T6 = _mm_unpackhi_epi16(T0, T1);\r\n    T7 = _mm_unpackhi_epi16(T2, T3);\r\n\r\n    T0 = _mm_unpacklo_epi32(T4, T5);\r\n    T1 = _mm_unpackhi_epi32(T4, T5);\r\n    T2 = _mm_unpacklo_epi32(T6, T7);\r\n    T3 = _mm_unpackhi_epi32(T6, T7);\r\n\r\n    TL3 = _mm_unpacklo_epi8(T0, c0);\r\n    TL2 = _mm_unpackhi_epi8(T0, c0);\r\n    TL1 = _mm_unpacklo_epi8(T1, c0);\r\n    TL0 = _mm_unpackhi_epi8(T1, c0);\r\n\r\n    TR0 = _mm_unpacklo_epi8(T2, c0);\r\n    TR1 = _mm_unpackhi_epi8(T2, c0);\r\n    TR2 = _mm_unpacklo_epi8(T3, c0);\r\n    TR3 = _mm_unpackhi_epi8(T3, c0);\r\n\r\n    TLR0 = _mm256_inserti128_si256(_mm256_castsi128_si256(TL0), TR0, 1);\r\n    TLR1 = _mm256_inserti128_si256(_mm256_castsi128_si256(TL1), TR1, 1);\r\n    TLR2 = _mm256_inserti128_si256(_mm256_castsi128_si256(TL2), TR2, 1);\r\n    TRL0 = _mm256_inserti128_si256(_mm256_castsi128_si256(TR0), TL0, 1);\r\n    TRL1 = _mm256_inserti128_si256(_mm256_castsi128_si256(TR1), TL1, 1);\r\n\r\n    T0 = _mm_abs_epi16(_mm_subs_epi16(_mm256_castsi256_si128(TLR0), _mm256_castsi256_si128(TRL0)));\r\n    T1 = _mm_cmpgt_epi16(T0, _mm256_castsi256_si128(c_1_256));\r\n    T2 = _mm_cmpgt_epi16(ALPHA, T0);\r\n\r\n    M0 = _mm_set_epi32(flag1, flag0, flag1, flag0);\r\n    M0 = _mm_and_si128(M0, _mm_and_si128(T1, T2)); // mask1\r\n\r\n    T0_256 = _mm256_abs_epi16(_mm256_subs_epi16(TLR1, TLR0));\r\n\r\n    FLT_X = _mm256_and_si256(_mm256_cmpgt_epi16(BETA_256, T0_256), c_2_256);\r\n\r\n    T0_256 = _mm256_abs_epi16(_mm256_subs_epi16(TLR2, TLR0));\r\n    T1_256 = _mm256_and_si256(_mm256_cmpgt_epi16(BETA_256, T0_256), c_1_256);\r\n\r\n    FLT_X = _mm256_add_epi16(T1_256, FLT_X);\r\n    FLT = _mm_add_epi16(_mm256_castsi256_si128(FLT_X), _mm256_extracti128_si256(FLT_X, 0x01));\r\n\r\n    T0_256 = _mm256_cmpeq_epi16(TLR1, TLR0);\r\n    M1 = _mm_and_si128(_mm256_castsi256_si128(T0_256), _mm256_extracti128_si256(T0_256, 0x01));\r\n    T0 = _mm_subs_epi16(FLT, _mm256_castsi256_si128(c_3_256));\r\n    T1 = _mm_subs_epi16(FLT, _mm256_castsi256_si128(c_4_256));\r\n\r\n    FS56 = _mm_blendv_epi8(T1, T0, M1);\r\n    FS4 = _mm_blendv_epi8(c0, _mm256_castsi256_si128(c_1_256), _mm_cmpeq_epi16(_mm256_castsi256_si128(FLT_X), _mm256_castsi256_si128(c_2_256)));\r\n\r\n    FS = _mm_blendv_epi8(c0, FS56, _mm_cmpgt_epi16(FLT, _mm256_castsi256_si128(c_4_256)));\r\n    FS = _mm_blendv_epi8(FS, FS4, _mm_cmpeq_epi16(FLT, _mm256_castsi256_si128(c_4_256)));\r\n\r\n    FS = _mm_and_si128(FS, M0);\r\n    FS_256 = _mm256_inserti128_si256(_mm256_castsi128_si256(FS), FS, 1);\r\n\r\n    TLR0w = TLR0;\r\n    TLR1w = TLR1;\r\n    /* fs == 1 */\r\n    T2 = _mm_add_epi16(_mm_add_epi16(_mm256_castsi256_si128(TLR0), _mm256_castsi256_si128(TRL0)), _mm256_castsi256_si128(c_2_256)); // L0 + R0 + 2\r\n    T2_256 = _mm256_castsi128_si256(T2);\r\n    T2_256 = _mm256_inserti128_si256(T2_256, T2, 1); // save\r\n    T1_256 = _mm256_srli_epi16(_mm256_add_epi16(_mm256_slli_epi16(TLR0, 1), T2_256), 2);\r\n    TLR0w = _mm256_blendv_epi8(TLR0, T1_256, _mm256_cmpeq_epi16(FS_256, c_1_256));\r\n\r\n    /* fs == 2 */\r\n    T2_256 = _mm256_slli_epi16(T2_256, 1);\r\n    T0_256 = _mm256_add_epi16(_mm256_slli_epi16(TLR1, 1), _mm256_add_epi16(TLR1, TRL0));\r\n    T0_256 = _mm256_add_epi16(_mm256_slli_epi16(TLR0, 3), _mm256_add_epi16(T0_256, T2_256));\r\n    T1_256 = _mm256_srli_epi16(_mm256_add_epi16(T0_256, c_4_256), 4);\r\n    TLR0w = _mm256_blendv_epi8(TLR0w, T1_256, _mm256_cmpeq_epi16(FS_256, c_2_256));\r\n\r\n    /* fs == 3 */\r\n    T2_256 = _mm256_slli_epi16(T2_256, 1); // (L0 << 2) + (R0 << 2) + 8\r\n    T0_256 = _mm256_add_epi16(_mm256_slli_epi16(TLR1, 2), _mm256_add_epi16(TLR2, TRL1));\r\n    T0_256 = _mm256_add_epi16(_mm256_slli_epi16(TLR0, 1), _mm256_add_epi16(T0_256, T2_256));\r\n    T1_256 = _mm256_srli_epi16(T0_256, 4);\r\n    TLR0w = _mm256_blendv_epi8(TLR0w, T1_256, _mm256_cmpeq_epi16(FS_256, c_3_256));\r\n\r\n    T0_256 = _mm256_add_epi16(_mm256_add_epi16(TLR2, TRL0), _mm256_slli_epi16(TLR2, 1));\r\n    T0_256 = _mm256_add_epi16(T0_256, _mm256_slli_epi16(TLR1, 3));\r\n    T0_256 = _mm256_add_epi16(T0_256, _mm256_slli_epi16(TLR0, 2));\r\n    T1_256 = _mm256_srli_epi16(_mm256_add_epi16(T0_256, c_8_256), 4);\r\n\r\n    TLR1w = _mm256_blendv_epi8(TLR1w, T1_256, _mm256_cmpeq_epi16(FS_256, c_3_256));\r\n\r\n    /* stroe result */\r\n    T4 = _mm_packus_epi16(TL3, _mm256_extracti128_si256(TLR0w, 0x01));\r\n    T5 = _mm_packus_epi16(TL2, _mm256_extracti128_si256(TLR1w, 0x01));\r\n    T6 = _mm_packus_epi16(_mm256_castsi256_si128(TLR1w), TR2);\r\n    T7 = _mm_packus_epi16(_mm256_castsi256_si128(TLR0w), TR3);\r\n\r\n    T0 = _mm_unpacklo_epi8(T4, T5);\r\n    T1 = _mm_unpacklo_epi8(T6, T7);\r\n    T2 = _mm_unpackhi_epi8(T4, T5);\r\n    T3 = _mm_unpackhi_epi8(T6, T7);\r\n\r\n    T4 = _mm_unpacklo_epi16(T0, T1);\r\n    T5 = _mm_unpacklo_epi16(T2, T3);\r\n    T6 = _mm_unpackhi_epi16(T0, T1);\r\n    T7 = _mm_unpackhi_epi16(T2, T3);\r\n\r\n    T0 = _mm_unpacklo_epi32(T4, T5);\r\n    T1 = _mm_unpackhi_epi32(T4, T5);\r\n    T2 = _mm_unpacklo_epi32(T6, T7);\r\n    T3 = _mm_unpackhi_epi32(T6, T7);\r\n\r\n    pTmp = SrcPtrU - 4;\r\n    _mm_storel_epi64((__m128i*)(pTmp), T0);\r\n    _mm_storel_epi64((__m128i*)(pTmp + stride), _mm_srli_si128(T0, 8));\r\n    _mm_storel_epi64((__m128i*)(pTmp + (stride << 1)), T1);\r\n    _mm_storel_epi64((__m128i*)(pTmp + stride * 3), _mm_srli_si128(T1, 8));\r\n\r\n    pTmp = SrcPtrV - 4;\r\n    _mm_storel_epi64((__m128i*)(pTmp), T2);\r\n    _mm_storel_epi64((__m128i*)(pTmp + stride), _mm_srli_si128(T2, 8));\r\n    _mm_storel_epi64((__m128i*)(pTmp + (stride << 1)), T3);\r\n    _mm_storel_epi64((__m128i*)(pTmp + stride * 3), _mm_srli_si128(T3, 8));\r\n\r\n}\r\n\r\nvoid deblock_edge_hor_avx2(pel_t *SrcPtr, int stride, int Alpha, int Beta, uint8_t *flt_flag)\r\n{\r\n    \r\n    int inc = stride;\r\n    int inc2 = inc << 1;\r\n    int inc3 = inc + inc2;\r\n    int flag0 = flt_flag[0] ? -1 : 0;\r\n    int flag1 = flt_flag[1] ? -1 : 0;\r\n\r\n    __m128i TL0, TL1, TL2;\r\n    __m128i TR0, TR1, TR2;\r\n    __m128i T0, T1, T2;\r\n    __m128i M0, M1;\r\n    __m128i FLT, FS;\r\n    __m128i FS3, FS4, FS56;\r\n    __m256i TLR0, TLR1, TLR2; // store TL* and TR*\r\n    __m256i TRL0, TRL1, TRL2; // store TR* and TL*\r\n    __m256i T0_256, T1_256, T2_256;\r\n    __m256i FLT_X;\r\n    __m256i TLR0w, TLR1w;\r\n    __m256i FS_256;\r\n\r\n    __m128i ALPHA = _mm_set1_epi16((short)Alpha);\r\n    __m128i BETA = _mm_set1_epi16((short)Beta);\r\n    __m128i c0 = _mm_set1_epi16(0);\r\n    __m256i c_0_256 = _mm256_setzero_si256();\r\n    __m256i c_1_256 = _mm256_set1_epi16(1);\r\n    __m256i c_2_256 = _mm256_set1_epi16(2);\r\n    __m256i c_3_256 = _mm256_set1_epi16(3);\r\n    __m256i c_4_256 = _mm256_set1_epi16(4);\r\n    __m256i c_8_256 = _mm256_set1_epi16(8);\r\n    __m256i c_16_256 = _mm256_set1_epi16(16);\r\n    __m256i BETA_256 = _mm256_set1_epi16((short)Beta);\r\n\r\n    TL2 = _mm_loadl_epi64((__m128i*)(SrcPtr - inc3));\r\n    TL1 = _mm_loadl_epi64((__m128i*)(SrcPtr - inc2));\r\n    TL0 = _mm_loadl_epi64((__m128i*)(SrcPtr - inc));\r\n    TR0 = _mm_loadl_epi64((__m128i*)(SrcPtr + 0));\r\n    TR1 = _mm_loadl_epi64((__m128i*)(SrcPtr + inc));\r\n    TR2 = _mm_loadl_epi64((__m128i*)(SrcPtr + inc2));\r\n\r\n    TL2 = _mm_unpacklo_epi8(TL2, c0);\r\n    TL1 = _mm_unpacklo_epi8(TL1, c0);\r\n    TL0 = _mm_unpacklo_epi8(TL0, c0);\r\n    TR0 = _mm_unpacklo_epi8(TR0, c0);\r\n    TR1 = _mm_unpacklo_epi8(TR1, c0);\r\n    TR2 = _mm_unpacklo_epi8(TR2, c0);\r\n\r\n    TLR0 = _mm256_inserti128_si256(_mm256_castsi128_si256(TL0), TR0, 1);\r\n    TLR1 = _mm256_inserti128_si256(_mm256_castsi128_si256(TL1), TR1, 1);\r\n    TLR2 = _mm256_inserti128_si256(_mm256_castsi128_si256(TL2), TR2, 1);\r\n    TRL0 = _mm256_inserti128_si256(_mm256_castsi128_si256(TR0), TL0, 1);\r\n    TRL1 = _mm256_inserti128_si256(_mm256_castsi128_si256(TR1), TL1, 1);\r\n\r\n    T0 = _mm_abs_epi16(_mm_subs_epi16(TL0, TR0));\r\n    T1 = _mm_cmpgt_epi16(T0, _mm256_castsi256_si128(c_1_256));\r\n    T2 = _mm_cmpgt_epi16(ALPHA, T0);\r\n\r\n    M0 = _mm_set_epi32(flag1, flag1, flag0, flag0);\r\n    M0 = _mm_and_si128(M0, _mm_and_si128(T1, T2)); // mask1\r\n\r\n    T0_256 = _mm256_abs_epi16(_mm256_subs_epi16(TLR1, TLR0));\r\n\r\n    FLT_X = _mm256_and_si256(_mm256_cmpgt_epi16(BETA_256, T0_256), c_2_256);\r\n\r\n    T0_256 = _mm256_abs_epi16(_mm256_subs_epi16(TLR2, TLR0));\r\n    T1_256 = _mm256_and_si256(_mm256_cmpgt_epi16(BETA_256, T0_256), c_1_256);\r\n\r\n    FLT_X = _mm256_add_epi16(T1_256, FLT_X);\r\n    FLT = _mm_add_epi16(_mm256_castsi256_si128(FLT_X), _mm256_extracti128_si256(FLT_X, 0x01));\r\n\r\n    T0_256 = _mm256_cmpeq_epi16(TLR1, TLR0);\r\n    M1 = _mm_and_si128(_mm256_castsi256_si128(T0_256), _mm256_extracti128_si256(T0_256, 0x01));\r\n    T0 = _mm_subs_epi16(FLT, _mm256_castsi256_si128(c_2_256));\r\n    T1 = _mm_subs_epi16(FLT, _mm256_castsi256_si128(c_3_256));\r\n\r\n    T2 = _mm_abs_epi16(_mm_subs_epi16(TL1, TR1));\r\n\r\n    FS56 = _mm_blendv_epi8(T1, T0, M1);\r\n    FS4 = _mm_blendv_epi8(_mm256_castsi256_si128(c_1_256), _mm256_castsi256_si128(c_2_256), _mm_cmpeq_epi16(_mm256_castsi256_si128(FLT_X), _mm256_castsi256_si128(c_2_256)));\r\n    FS3 = _mm_blendv_epi8(c0, _mm256_castsi256_si128(c_1_256), _mm_cmpgt_epi16(BETA, T2));\r\n\r\n    FS = _mm_blendv_epi8(c0, FS56, _mm_cmpgt_epi16(FLT, _mm256_castsi256_si128(c_4_256)));\r\n    FS = _mm_blendv_epi8(FS, FS4, _mm_cmpeq_epi16(FLT, _mm256_castsi256_si128(c_4_256)));\r\n    FS = _mm_blendv_epi8(FS, FS3, _mm_cmpeq_epi16(FLT, _mm256_castsi256_si128(c_3_256)));\r\n\r\n    FS = _mm_and_si128(FS, M0);\r\n    FS_256 = _mm256_inserti128_si256(_mm256_castsi128_si256(FS), FS, 1);\r\n\r\n    TLR0w = TLR0;\r\n    TLR1w = TLR1;\r\n    /* fs == 1 */\r\n    T2 = _mm_add_epi16(_mm_add_epi16(TL0, TR0), _mm256_castsi256_si128(c_2_256)); // L0 + R0 + 2\r\n    T2_256 = _mm256_castsi128_si256(T2);\r\n    T2_256 = _mm256_inserti128_si256(T2_256, T2, 1); // save\r\n    T1_256 = _mm256_srli_epi16(_mm256_add_epi16(_mm256_slli_epi16(TLR0, 1), T2_256), 2);\r\n    TLR0w = _mm256_blendv_epi8(TLR0, T1_256, _mm256_cmpeq_epi16(FS_256, c_1_256));\r\n\r\n    /* fs == 2 */\r\n    T2_256 = _mm256_slli_epi16(T2_256, 1);\r\n    T0_256 = _mm256_add_epi16(_mm256_slli_epi16(TLR1, 1), _mm256_add_epi16(TLR1, TRL0));\r\n    T0_256 = _mm256_add_epi16(_mm256_slli_epi16(TLR0, 3), _mm256_add_epi16(T0_256, T2_256));\r\n    T1_256 = _mm256_srli_epi16(_mm256_add_epi16(T0_256, c_4_256), 4);\r\n    TLR0w = _mm256_blendv_epi8(TLR0w, T1_256, _mm256_cmpeq_epi16(FS_256, c_2_256));\r\n\r\n    /* fs == 3 */\r\n    T2_256 = _mm256_slli_epi16(T2_256, 1); // (L0 << 2) + (R0 << 2) + 8\r\n    T0_256 = _mm256_add_epi16(_mm256_slli_epi16(TLR1, 2), _mm256_add_epi16(TLR2, TRL1));\r\n    T0_256 = _mm256_add_epi16(_mm256_slli_epi16(TLR0, 1), _mm256_add_epi16(T0_256, T2_256));\r\n    T1_256 = _mm256_srli_epi16(T0_256, 4);\r\n    TLR0w = _mm256_blendv_epi8(TLR0w, T1_256, _mm256_cmpeq_epi16(FS_256, c_3_256));\r\n\r\n    T0_256 = _mm256_add_epi16(_mm256_add_epi16(TLR2, TRL0), _mm256_slli_epi16(TLR2, 1));\r\n    T0_256 = _mm256_add_epi16(T0_256, _mm256_slli_epi16(TLR1, 3));\r\n    T0_256 = _mm256_add_epi16(T0_256, _mm256_slli_epi16(TLR0, 2));\r\n    T1_256 = _mm256_srli_epi16(_mm256_add_epi16(T0_256, c_8_256), 4);\r\n\r\n    TLR1w = _mm256_blendv_epi8(TLR1w, T1_256, _mm256_cmpeq_epi16(FS_256, c_3_256));\r\n\r\n    FS = _mm_cmpeq_epi16(FS, _mm256_castsi256_si128(c_4_256));\r\n\r\n    if (_mm_extract_epi64(FS, 0) || _mm_extract_epi64(FS, 1)) { /* fs == 4 */\r\n        TRL2 = _mm256_inserti128_si256(_mm256_castsi128_si256(TR2), TL2, 1);\r\n        FS_256 = _mm256_inserti128_si256(_mm256_castsi128_si256(FS), FS, 1);\r\n\r\n        /* cal L0/R0 */\r\n        T0_256 = _mm256_slli_epi16(_mm256_add_epi16(_mm256_add_epi16(TLR0, TLR2), TRL0), 3);\r\n        T0_256 = _mm256_add_epi16(_mm256_add_epi16(T0_256, c_16_256), _mm256_add_epi16(TLR0, TLR2));\r\n        T2_256 = _mm256_add_epi16(_mm256_slli_epi16(TRL2, 1), _mm256_slli_epi16(TRL2, 2));\r\n        T1_256 = _mm256_srli_epi16(_mm256_add_epi16(T0_256, T2_256), 5);\r\n\r\n        TLR0w = _mm256_blendv_epi8(TLR0w, T1_256, FS_256);\r\n\r\n        /* cal L1/R1 */\r\n        T0_256 = _mm256_slli_epi16(_mm256_add_epi16(TLR2, TRL0), 1);\r\n        T0_256 = _mm256_add_epi16(T0_256, _mm256_sub_epi16(_mm256_slli_epi16(TLR0, 3), TLR0));\r\n        T2_256 = _mm256_add_epi16(_mm256_slli_epi16(TLR2, 2), _mm256_add_epi16(TRL0, c_8_256));\r\n        T1_256 = _mm256_srli_epi16(_mm256_add_epi16(T0_256, T2_256), 4);\r\n\r\n        TLR1w = _mm256_blendv_epi8(TLR1w, T1_256, FS_256);\r\n\r\n        /* cal L2/R2 */\r\n        T0_256 = _mm256_add_epi16(_mm256_slli_epi16(TLR2, 1), TLR2);\r\n        T2_256 = _mm256_add_epi16(_mm256_slli_epi16(TLR0, 2), TRL0);\r\n        T1_256 = _mm256_srli_epi16(_mm256_add_epi16(T0_256, _mm256_add_epi16(T2_256, c_4_256)), 3);\r\n\r\n        TLR2 = _mm256_blendv_epi8(TLR2, T1_256, FS_256);\r\n\r\n        TLR0w = _mm256_packus_epi16(TLR0w, c_0_256);\r\n        TLR1w = _mm256_packus_epi16(TLR1w, c_0_256);\r\n        TLR2 = _mm256_packus_epi16(TLR2, c_0_256);\r\n        /* stroe result */\r\n        _mm_storel_epi64((__m128i*)(SrcPtr - inc), _mm256_castsi256_si128(TLR0w));\r\n        _mm_storel_epi64((__m128i*)(SrcPtr - 0), _mm256_extracti128_si256(TLR0w, 0x01));\r\n\r\n        _mm_storel_epi64((__m128i*)(SrcPtr - inc2), _mm256_castsi256_si128(TLR1w));\r\n        _mm_storel_epi64((__m128i*)(SrcPtr + inc), _mm256_extracti128_si256(TLR1w, 0x01));\r\n\r\n        _mm_storel_epi64((__m128i*)(SrcPtr - inc3), _mm256_castsi256_si128(TLR2));\r\n        _mm_storel_epi64((__m128i*)(SrcPtr + inc2), _mm256_extracti128_si256(TLR2, 0x01));\r\n    } else {\r\n        /* stroe result */\r\n        TLR0w = _mm256_packus_epi16(TLR0w, c_0_256);\r\n        TLR1w = _mm256_packus_epi16(TLR1w, c_0_256);\r\n        _mm_storel_epi64((__m128i*)(SrcPtr - inc), _mm256_castsi256_si128(TLR0w));\r\n        _mm_storel_epi64((__m128i*)(SrcPtr - 0), _mm256_extracti128_si256(TLR0w, 0x01));\r\n\r\n        _mm_storel_epi64((__m128i*)(SrcPtr - inc2), _mm256_castsi256_si128(TLR1w));\r\n        _mm_storel_epi64((__m128i*)(SrcPtr + inc), _mm256_extracti128_si256(TLR1w, 0x01));\r\n    }\r\n\r\n}\r\n\r\n//Ҫ޸ı  ޸ı   i32s_tΪint32_t;signed int\r\nvoid deblock_edge_hor_c_avx2(pel_t *SrcPtrU, pel_t *SrcPtrV, int stride, int Alpha, int Beta, uint8_t *flt_flag)\r\n{\r\n    int inc = stride;\r\n    int inc2 = inc << 1;\r\n    int inc3 = inc + inc2;\r\n    int flag0 = flt_flag[0] ? -1 : 0;\r\n    int flag1 = flt_flag[1] ? -1 : 0;\r\n\r\n    __m128i T0, T1, T2;\r\n    __m128i M0, M1;\r\n    __m128i FLT, FS;\r\n    __m128i FS4, FS56;\r\n\r\n    __m256i TLR0, TLR1, TLR2; // store TL* and TR*\r\n    __m256i TRL0, TRL1; // store TR* and TL*\r\n    __m256i T0_256, T1_256, T2_256;\r\n    __m256i FLT_X;\r\n    __m256i TLR0w, TLR1w;\r\n    __m256i FS_256;\r\n\r\n    __m128i ALPHA = _mm_set1_epi16((short)Alpha);\r\n    __m128i c0 = _mm_set1_epi16(0);\r\n    __m256i c_0_256 = _mm256_setzero_si256();\r\n    __m256i c_1_256 = _mm256_set1_epi16(1);\r\n    __m256i c_2_256 = _mm256_set1_epi16(2);\r\n    __m256i c_3_256 = _mm256_set1_epi16(3);\r\n    __m256i c_4_256 = _mm256_set1_epi16(4);\r\n    __m256i c_8_256 = _mm256_set1_epi16(8);\r\n    __m256i BETA_256 = _mm256_set1_epi16((short)Beta);\r\n    __m256i mask0 = _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, -1);\r\n    __m256i mask1 = _mm256_set_epi32(0, 0, 0, 0, 0, 0, -1, 0);\r\n    __m256i mask4 = _mm256_set_epi32(0, 0, 0, -1, 0, 0, 0, 0);\r\n    __m256i mask5 = _mm256_set_epi32(0, 0, -1, 0, 0, 0, 0, 0);\r\n\r\n    TLR0 = _mm256_set_epi32(0, 0, ((int32_t*)(SrcPtrV))[0], ((int32_t*)(SrcPtrU))[0], 0, 0, ((int32_t*)(SrcPtrV - inc))[0], ((int32_t*)(SrcPtrU - inc))[0]);\r\n    TLR1 = _mm256_set_epi32(0, 0, ((int32_t*)(SrcPtrV + inc))[0], ((int32_t*)(SrcPtrU + inc))[0], 0, 0, ((int32_t*)(SrcPtrV - inc2))[0], ((int32_t*)(SrcPtrU - inc2))[0]);\r\n    TLR2 = _mm256_set_epi32(0, 0, ((int32_t*)(SrcPtrV + inc2))[0], ((int32_t*)(SrcPtrU + inc2))[0], 0, 0, ((int32_t*)(SrcPtrV - inc3))[0], ((int32_t*)(SrcPtrU - inc3))[0]);\r\n\r\n    TLR0 = _mm256_unpacklo_epi8(TLR0, c_0_256);\r\n    TLR1 = _mm256_unpacklo_epi8(TLR1, c_0_256);\r\n    TLR2 = _mm256_unpacklo_epi8(TLR2, c_0_256);\r\n\r\n    TRL0 = _mm256_inserti128_si256(_mm256_castsi128_si256(_mm256_extracti128_si256(TLR0, 0x01)), _mm256_castsi256_si128(TLR0), 1);\r\n    TRL1 = _mm256_inserti128_si256(_mm256_castsi128_si256(_mm256_extracti128_si256(TLR1, 0x01)), _mm256_castsi256_si128(TLR1), 1);\r\n\r\n    T0 = _mm_abs_epi16(_mm_subs_epi16(_mm256_castsi256_si128(TLR0), _mm256_castsi256_si128(TRL0)));\r\n    T1 = _mm_cmpgt_epi16(T0, _mm256_castsi256_si128(c_1_256));\r\n    T2 = _mm_cmpgt_epi16(ALPHA, T0);\r\n\r\n    M0 = _mm_set_epi32(flag1, flag0, flag1, flag0);\r\n    M0 = _mm_and_si128(M0, _mm_and_si128(T1, T2)); // mask1\r\n\r\n    T0_256 = _mm256_abs_epi16(_mm256_subs_epi16(TLR1, TLR0));\r\n\r\n    FLT_X = _mm256_and_si256(_mm256_cmpgt_epi16(BETA_256, T0_256), c_2_256);\r\n\r\n    T0_256 = _mm256_abs_epi16(_mm256_subs_epi16(TLR2, TLR0));\r\n    T1_256 = _mm256_and_si256(_mm256_cmpgt_epi16(BETA_256, T0_256), c_1_256);\r\n\r\n    FLT_X = _mm256_add_epi16(T1_256, FLT_X);\r\n    FLT = _mm_add_epi16(_mm256_castsi256_si128(FLT_X), _mm256_extracti128_si256(FLT_X, 0x01));\r\n\r\n    T0_256 = _mm256_cmpeq_epi16(TLR1, TLR0);\r\n    M1 = _mm_and_si128(_mm256_castsi256_si128(T0_256), _mm256_extracti128_si256(T0_256, 0x01));\r\n    T0 = _mm_subs_epi16(FLT, _mm256_castsi256_si128(c_3_256));\r\n    T1 = _mm_subs_epi16(FLT, _mm256_castsi256_si128(c_4_256));\r\n\r\n    FS56 = _mm_blendv_epi8(T1, T0, M1);\r\n    FS4 = _mm_blendv_epi8(c0, _mm256_castsi256_si128(c_1_256), _mm_cmpeq_epi16(_mm256_castsi256_si128(FLT_X), _mm256_castsi256_si128(c_2_256)));\r\n\r\n    FS = _mm_blendv_epi8(c0, FS56, _mm_cmpgt_epi16(FLT, _mm256_castsi256_si128(c_4_256)));\r\n    FS = _mm_blendv_epi8(FS, FS4, _mm_cmpeq_epi16(FLT, _mm256_castsi256_si128(c_4_256)));\r\n\r\n    FS = _mm_and_si128(FS, M0);\r\n    FS_256 = _mm256_inserti128_si256(_mm256_castsi128_si256(FS), FS, 1);\r\n\r\n    TLR0w = TLR0;\r\n    TLR1w = TLR1;\r\n    /* fs == 1 */\r\n    T2 = _mm_add_epi16(_mm_add_epi16(_mm256_castsi256_si128(TLR0), _mm256_castsi256_si128(TRL0)), _mm256_castsi256_si128(c_2_256)); // L0 + R0 + 2\r\n    T2_256 = _mm256_castsi128_si256(T2);\r\n    T2_256 = _mm256_inserti128_si256(T2_256, T2, 1); // save\r\n    T1_256 = _mm256_srli_epi16(_mm256_add_epi16(_mm256_slli_epi16(TLR0, 1), T2_256), 2);\r\n    TLR0w = _mm256_blendv_epi8(TLR0, T1_256, _mm256_cmpeq_epi16(FS_256, c_1_256));\r\n\r\n    /* fs == 2 */\r\n    T2_256 = _mm256_slli_epi16(T2_256, 1);\r\n    T0_256 = _mm256_add_epi16(_mm256_slli_epi16(TLR1, 1), _mm256_add_epi16(TLR1, TRL0));\r\n    T0_256 = _mm256_add_epi16(_mm256_slli_epi16(TLR0, 3), _mm256_add_epi16(T0_256, T2_256));\r\n    T1_256 = _mm256_srli_epi16(_mm256_add_epi16(T0_256, c_4_256), 4);\r\n    TLR0w = _mm256_blendv_epi8(TLR0w, T1_256, _mm256_cmpeq_epi16(FS_256, c_2_256));\r\n\r\n    /* fs == 3 */\r\n    T2_256 = _mm256_slli_epi16(T2_256, 1); // (L0 << 2) + (R0 << 2) + 8\r\n    T0_256 = _mm256_add_epi16(_mm256_slli_epi16(TLR1, 2), _mm256_add_epi16(TLR2, TRL1));\r\n    T0_256 = _mm256_add_epi16(_mm256_slli_epi16(TLR0, 1), _mm256_add_epi16(T0_256, T2_256));\r\n    T1_256 = _mm256_srli_epi16(T0_256, 4);\r\n    TLR0w = _mm256_blendv_epi8(TLR0w, T1_256, _mm256_cmpeq_epi16(FS_256, c_3_256));\r\n\r\n    T0_256 = _mm256_add_epi16(_mm256_add_epi16(TLR2, TRL0), _mm256_slli_epi16(TLR2, 1));\r\n    T0_256 = _mm256_add_epi16(T0_256, _mm256_slli_epi16(TLR1, 3));\r\n    T0_256 = _mm256_add_epi16(T0_256, _mm256_slli_epi16(TLR0, 2));\r\n    T1_256 = _mm256_srli_epi16(_mm256_add_epi16(T0_256, c_8_256), 4);\r\n\r\n    TLR1w = _mm256_blendv_epi8(TLR1w, T1_256, _mm256_cmpeq_epi16(FS_256, c_3_256));\r\n\r\n    /* store result */\r\n    TLR0w = _mm256_packus_epi16(TLR0w, c_0_256);\r\n    TLR1w = _mm256_packus_epi16(TLR1w, c_0_256);\r\n\r\n    _mm256_maskstore_epi32(((int32_t*)(SrcPtrU - inc )), mask0, TLR0w);\r\n    _mm256_maskstore_epi32(((int32_t*)(SrcPtrU - 16)), mask4, TLR0w);\r\n    _mm256_maskstore_epi32(((int32_t*)(SrcPtrU - inc2)), mask0, TLR1w);\r\n    _mm256_maskstore_epi32(((int32_t*)(SrcPtrU + inc - 16)), mask4, TLR1w);\r\n\r\n    _mm256_maskstore_epi32(((int32_t*)(SrcPtrV - inc - 4)), mask1, TLR0w);\r\n    _mm256_maskstore_epi32(((int32_t*)(SrcPtrV - 20)), mask5, TLR0w);\r\n    _mm256_maskstore_epi32(((int32_t*)(SrcPtrV - inc2 - 4)), mask1, TLR1w);\r\n    _mm256_maskstore_epi32(((int32_t*)(SrcPtrV + inc - 20)), mask5, TLR1w);\r\n}\r\n\r\n#endif\r\n"
  },
  {
    "path": "source/common/vec/intrinsic_idct.cc",
    "content": "/*\r\n * intrinsic_idct.cc\r\n *\r\n * Description of this file:\r\n *    SSE assembly functions of IDCT module of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video decoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#include \"../common.h\"\r\n#include \"intrinsic.h\"\r\n\r\n#include <mmintrin.h>\r\n#include <emmintrin.h>\r\n#include <tmmintrin.h>\r\n#include <smmintrin.h>\r\n\r\n\r\nALIGN32(static const coeff_t tab_idct_8x8[12][8]) = {\r\n    {  44,  38,  44,  38,  44,  38,  44,  38 },\r\n    {  25,   9,  25,   9,  25,   9,  25,   9 },\r\n    {  38,  -9,  38,  -9,  38,  -9,  38,  -9 },\r\n    { -44, -25, -44, -25, -44, -25, -44, -25 },\r\n    {  25, -44,  25, -44,  25, -44,  25, -44 },\r\n    {   9,  38,   9,  38,   9,  38,   9,  38 },\r\n    {   9, -25,   9, -25,   9, -25,   9, -25 },\r\n    {  38, -44,  38, -44,  38, -44,  38, -44 },\r\n    {  32,  32,  32,  32,  32,  32,  32,  32 },\r\n    {  32, -32,  32, -32,  32, -32,  32, -32 },\r\n    {  42,  17,  42,  17,  42,  17,  42,  17 },\r\n    {  17, -42,  17, -42,  17, -42,  17, -42 }\r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nALIGN16(static const int16_t g_2T[SEC_TR_SIZE * SEC_TR_SIZE]) = {\r\n    123,  -35,  -8,  -3,\r\n    -32, -120,  30,  10,\r\n     14,   25, 123, -22,\r\n      8,   13,  19, 126\r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nALIGN16(static const int16_t g_2T_C[SEC_TR_SIZE * SEC_TR_SIZE]) = {\r\n    34,  58,  72,  81,\r\n    77,  69,  -7, -75,\r\n    79, -33, -75,  58,\r\n    55, -84,  73, -28\r\n};\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid idct_4x4_sse128(const coeff_t *src, coeff_t *dst, int i_dst)\r\n{\r\n    const int shift1 = 5;\r\n    const int shift2 = 20 - g_bit_depth;\r\n    // const int clip_depth1 = LIMIT_BIT;\r\n    const int clip_depth2 = g_bit_depth + 1;\r\n\r\n    const __m128i c16_p17_p42 = _mm_set1_epi32(0x0011002A);\r\n    const __m128i c16_n42_p17 = _mm_set1_epi32(0xFFD60011);\r\n    const __m128i c16_n32_p32 = _mm_set1_epi32(0xFFE00020);\r\n    const __m128i c16_p32_p32 = _mm_set1_epi32(0x00200020);\r\n\r\n    __m128i c32_rnd = _mm_set1_epi32(1 << (shift1 - 1));    // add1\r\n    __m128i S0, S1;\r\n    __m128i T0, T1;\r\n    __m128i E0, E1, O0, O1;\r\n\r\n    S0  = _mm_loadu_si128((__m128i*)(src   ));\r\n    S1  = _mm_loadu_si128((__m128i*)(src+ 8));\r\n\r\n    T0 = _mm_unpacklo_epi16(S0, S1);\r\n    E0 = _mm_add_epi32(_mm_madd_epi16(T0, c16_p32_p32), c32_rnd);\r\n    E1 = _mm_add_epi32(_mm_madd_epi16(T0, c16_n32_p32), c32_rnd);\r\n\r\n    T1 = _mm_unpackhi_epi16(S0, S1);\r\n    O0 = _mm_madd_epi16(T1, c16_p17_p42);\r\n    O1 = _mm_madd_epi16(T1, c16_n42_p17);\r\n\r\n    S0 = _mm_packs_epi32(_mm_srai_epi32(_mm_add_epi32(E0, O0), shift1), _mm_srai_epi32(_mm_sub_epi32(E1, O1), shift1));\r\n    S1 = _mm_packs_epi32(_mm_srai_epi32(_mm_add_epi32(E1, O1), shift1), _mm_srai_epi32(_mm_sub_epi32(E0, O0), shift1));\r\n\r\n    /* inverse */\r\n    T0 = _mm_unpacklo_epi16(S0, S1);\r\n    T1 = _mm_unpackhi_epi16(S0, S1);\r\n    S0 = _mm_unpacklo_epi32(T0, T1);\r\n    S1 = _mm_unpackhi_epi32(T0, T1);\r\n\r\n    /* second pass -------------------------------------------------\r\n     */\r\n    c32_rnd  = _mm_set1_epi32(1 << (shift2 - 1));    // add2\r\n\r\n    T0 = _mm_unpacklo_epi16(S0, S1);\r\n    E0 = _mm_add_epi32(_mm_madd_epi16(T0, c16_p32_p32), c32_rnd);\r\n    E1 = _mm_add_epi32(_mm_madd_epi16(T0, c16_n32_p32), c32_rnd);\r\n\r\n    T1 = _mm_unpackhi_epi16(S0, S1);\r\n    O0 = _mm_madd_epi16(T1, c16_p17_p42);\r\n    O1 = _mm_madd_epi16(T1, c16_n42_p17);\r\n\r\n    S0  = _mm_packs_epi32(_mm_srai_epi32(_mm_add_epi32(E0, O0), shift2), _mm_srai_epi32(_mm_sub_epi32(E1, O1), shift2));\r\n    S1  = _mm_packs_epi32(_mm_srai_epi32(_mm_add_epi32(E1, O1), shift2), _mm_srai_epi32(_mm_sub_epi32(E0, O0), shift2));\r\n\r\n    T0 = _mm_unpacklo_epi16(S0, S1);\r\n    T1 = _mm_unpackhi_epi16(S0, S1);\r\n    S0 = _mm_unpacklo_epi32(T0, T1);\r\n    S1 = _mm_unpackhi_epi32(T0, T1);\r\n\r\n    // clip\r\n    {\r\n        const __m128i max_val = _mm_set1_epi16((1 << (clip_depth2 - 1)) - 1);\r\n        const __m128i min_val = _mm_set1_epi16(-(1 << (clip_depth2 - 1)));\r\n\r\n        S0 = _mm_max_epi16(_mm_min_epi16(S0, max_val), min_val);\r\n        S1 = _mm_max_epi16(_mm_min_epi16(S1, max_val), min_val);\r\n    }\r\n\r\n    // store\r\n    if (i_dst == 4) {\r\n        _mm_store_si128((__m128i*)(dst + 0), S0);\r\n        _mm_store_si128((__m128i*)(dst + 8), S1);\r\n    } else {\r\n        _mm_storel_epi64((__m128i*)(dst + 0 * i_dst), S0);\r\n        _mm_storeh_pi((__m64  *)(dst + 1 * i_dst), _mm_castsi128_ps(S0));\r\n        _mm_storel_epi64((__m128i*)(dst + 2 * i_dst), S1);\r\n        _mm_storeh_pi((__m64  *)(dst + 3 * i_dst), _mm_castsi128_ps(S1));\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid idct_4x16_sse128(const coeff_t *src, coeff_t *dst, int i_dst)\r\n{\r\n    const int shift1 = 5;\r\n    const int shift2 = 20 - g_bit_depth;\r\n    // const int clip_depth1 = LIMIT_BIT;\r\n    const int clip_depth2 = g_bit_depth + 1;\r\n\r\n    const __m128i c16_p43_p45 = _mm_set1_epi32(0x002B002D);   //row0 87high - 90low address\r\n    const __m128i c16_p35_p40 = _mm_set1_epi32(0x00230028);\r\n    const __m128i c16_p21_p29 = _mm_set1_epi32(0x0015001D);\r\n    const __m128i c16_p04_p13 = _mm_set1_epi32(0x0004000D);\r\n    const __m128i c16_p29_p43 = _mm_set1_epi32(0x001D002B);   //row1\r\n    const __m128i c16_n21_p04 = _mm_set1_epi32(0xFFEB0004);\r\n    const __m128i c16_n45_n40 = _mm_set1_epi32(0xFFD3FFD8);\r\n    const __m128i c16_n13_n35 = _mm_set1_epi32(0xFFF3FFDD);\r\n    const __m128i c16_p04_p40 = _mm_set1_epi32(0x00040028);   //row2\r\n    const __m128i c16_n43_n35 = _mm_set1_epi32(0xFFD5FFDD);\r\n    const __m128i c16_p29_n13 = _mm_set1_epi32(0x001DFFF3);\r\n    const __m128i c16_p21_p45 = _mm_set1_epi32(0x0015002D);\r\n    const __m128i c16_n21_p35 = _mm_set1_epi32(0xFFEB0023);   //row3\r\n    const __m128i c16_p04_n43 = _mm_set1_epi32(0x0004FFD5);\r\n    const __m128i c16_p13_p45 = _mm_set1_epi32(0x000D002D);\r\n    const __m128i c16_n29_n40 = _mm_set1_epi32(0xFFE3FFD8);\r\n    const __m128i c16_n40_p29 = _mm_set1_epi32(0xFFD8001D);   //row4\r\n    const __m128i c16_p45_n13 = _mm_set1_epi32(0x002DFFF3);\r\n    const __m128i c16_n43_n04 = _mm_set1_epi32(0xFFD5FFFC);\r\n    const __m128i c16_p35_p21 = _mm_set1_epi32(0x00230015);\r\n    const __m128i c16_n45_p21 = _mm_set1_epi32(0xFFD30015);   //row5\r\n    const __m128i c16_p13_p29 = _mm_set1_epi32(0x000D001D);\r\n    const __m128i c16_p35_n43 = _mm_set1_epi32(0x0023FFD5);\r\n    const __m128i c16_n40_p04 = _mm_set1_epi32(0xFFD80004);\r\n    const __m128i c16_n35_p13 = _mm_set1_epi32(0xFFDD000D);   //row6\r\n    const __m128i c16_n40_p45 = _mm_set1_epi32(0xFFD8002D);\r\n    const __m128i c16_p04_p21 = _mm_set1_epi32(0x00040015);\r\n    const __m128i c16_p43_n29 = _mm_set1_epi32(0x002BFFE3);\r\n    const __m128i c16_n13_p04 = _mm_set1_epi32(0xFFF30004);   //row7\r\n    const __m128i c16_n29_p21 = _mm_set1_epi32(0xFFE30015);\r\n    const __m128i c16_n40_p35 = _mm_set1_epi32(0xFFD80023);\r\n    const __m128i c16_n45_p43 = _mm_set1_epi32(0xFFD3002B);\r\n\r\n    const __m128i c16_p38_p44 = _mm_set1_epi32(0x0026002C);\r\n    const __m128i c16_p09_p25 = _mm_set1_epi32(0x00090019);\r\n    const __m128i c16_n09_p38 = _mm_set1_epi32(0xFFF70026);\r\n    const __m128i c16_n25_n44 = _mm_set1_epi32(0xFFE7FFD4);\r\n    const __m128i c16_n44_p25 = _mm_set1_epi32(0xFFD40019);\r\n    const __m128i c16_p38_p09 = _mm_set1_epi32(0x00260009);\r\n    const __m128i c16_n25_p09 = _mm_set1_epi32(0xFFE70009);\r\n    const __m128i c16_n44_p38 = _mm_set1_epi32(0xFFD40026);\r\n\r\n    const __m128i c16_p17_p42 = _mm_set1_epi32(0x0011002A);\r\n    const __m128i c16_n42_p17 = _mm_set1_epi32(0xFFD60011);\r\n\r\n    const __m128i c16_n32_p32 = _mm_set1_epi32(0xFFE00020);\r\n    const __m128i c16_p32_p32 = _mm_set1_epi32(0x00200020);\r\n\r\n    __m128i c32_rnd = _mm_set1_epi32(1 << (shift1 - 1));            // add1\r\n\r\n    // DCT1\r\n    __m128i in00, in01, in02, in03, in04, in05, in06, in07;\r\n    __m128i res00, res01, res02, res03, res04, res05, res06, res07;\r\n\r\n    in00 = _mm_loadu_si128((const __m128i*)&src[ 0 * 4]);           // [07 06 05 04 03 02 01 00]\r\n    in01 = _mm_loadu_si128((const __m128i*)&src[ 2 * 4]);           // [27 26 25 24 23 22 21 20]\r\n    in02 = _mm_loadu_si128((const __m128i*)&src[ 4 * 4]);           // [47 46 45 44 43 42 41 40]\r\n    in03 = _mm_loadu_si128((const __m128i*)&src[ 6 * 4]);           // [67 66 65 64 63 62 61 60]\r\n    in04 = _mm_loadu_si128((const __m128i*)&src[ 8 * 4]);\r\n    in05 = _mm_loadu_si128((const __m128i*)&src[10 * 4]);\r\n    in06 = _mm_loadu_si128((const __m128i*)&src[12 * 4]);\r\n    in07 = _mm_loadu_si128((const __m128i*)&src[14 * 4]);\r\n\r\n    {\r\n        const __m128i T_00_00A = _mm_unpackhi_epi16(in00, in01);    // [33 13 32 12 31 11 30 10]\r\n        const __m128i T_00_01A = _mm_unpackhi_epi16(in02, in03);    // [ ]\r\n        const __m128i T_00_02A = _mm_unpackhi_epi16(in04, in05);    // [ ]\r\n        const __m128i T_00_03A = _mm_unpackhi_epi16(in06, in07);    // [ ]\r\n        const __m128i T_00_04A = _mm_unpacklo_epi16(in01, in03);    // [ ]\r\n        const __m128i T_00_05A = _mm_unpacklo_epi16(in05, in07);    // [ ]\r\n        const __m128i T_00_06A = _mm_unpacklo_epi16(in02, in06);    // [ ]row\r\n        const __m128i T_00_07A = _mm_unpacklo_epi16(in00, in04);    // [83 03 82 02 81 01 81 00] row08 row00\r\n\r\n        __m128i O0A, O1A, O2A, O3A, O4A, O5A, O6A, O7A;\r\n        __m128i EO0A, EO1A, EO2A, EO3A;\r\n        __m128i EEO0A, EEO1A;\r\n        __m128i EEE0A, EEE1A;\r\n\r\n#define COMPUTE_ROW(row0103, row0507, row0911, row1315, c0103, c0507, c0911, c1315, row) \\\r\n    row = _mm_add_epi32(_mm_add_epi32(_mm_madd_epi16(row0103, c0103), _mm_madd_epi16(row0507, c0507)), \\\r\n                        _mm_add_epi32(_mm_madd_epi16(row0911, c0911), _mm_madd_epi16(row1315, c1315)));\r\n\r\n        COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_p43_p45, c16_p35_p40, c16_p21_p29, c16_p04_p13, O0A)\r\n        COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_p29_p43, c16_n21_p04, c16_n45_n40, c16_n13_n35, O1A)\r\n        COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_p04_p40, c16_n43_n35, c16_p29_n13, c16_p21_p45, O2A)\r\n        COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_n21_p35, c16_p04_n43, c16_p13_p45, c16_n29_n40, O3A)\r\n        COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_n40_p29, c16_p45_n13, c16_n43_n04, c16_p35_p21, O4A)\r\n        COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_n45_p21, c16_p13_p29, c16_p35_n43, c16_n40_p04, O5A)\r\n        COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_n35_p13, c16_n40_p45, c16_p04_p21, c16_p43_n29, O6A)\r\n        COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_n13_p04, c16_n29_p21, c16_n40_p35, c16_n45_p43, O7A)\r\n#undef COMPUTE_ROW\r\n\r\n        EO0A = _mm_add_epi32(_mm_madd_epi16(T_00_04A, c16_p38_p44), _mm_madd_epi16(T_00_05A, c16_p09_p25)); // EO0\r\n        EO1A = _mm_add_epi32(_mm_madd_epi16(T_00_04A, c16_n09_p38), _mm_madd_epi16(T_00_05A, c16_n25_n44)); // EO1\r\n        EO2A = _mm_add_epi32(_mm_madd_epi16(T_00_04A, c16_n44_p25), _mm_madd_epi16(T_00_05A, c16_p38_p09)); // EO2\r\n        EO3A = _mm_add_epi32(_mm_madd_epi16(T_00_04A, c16_n25_p09), _mm_madd_epi16(T_00_05A, c16_n44_p38)); // EO3\r\n\r\n        EEO0A = _mm_madd_epi16(T_00_06A, c16_p17_p42);\r\n        EEO1A = _mm_madd_epi16(T_00_06A, c16_n42_p17);\r\n\r\n        EEE0A = _mm_madd_epi16(T_00_07A, c16_p32_p32);\r\n        EEE1A = _mm_madd_epi16(T_00_07A, c16_n32_p32);\r\n        {\r\n            const __m128i EE0A = _mm_add_epi32(EEE0A, EEO0A);   // EE0 = EEE0 + EEO0\r\n            const __m128i EE1A = _mm_add_epi32(EEE1A, EEO1A);   // EE1 = EEE1 + EEO1\r\n            const __m128i EE3A = _mm_sub_epi32(EEE0A, EEO0A);   // EE2 = EEE0 - EEO0\r\n            const __m128i EE2A = _mm_sub_epi32(EEE1A, EEO1A);   // EE3 = EEE1 - EEO1\r\n\r\n            const __m128i T10A = _mm_add_epi32(_mm_add_epi32(EE0A, EO0A), c32_rnd);   // E0 (= EE0 + EO0) + rnd\r\n            const __m128i T11A = _mm_add_epi32(_mm_add_epi32(EE1A, EO1A), c32_rnd);   // E1 (= EE1 + EO1) + rnd\r\n            const __m128i T12A = _mm_add_epi32(_mm_add_epi32(EE2A, EO2A), c32_rnd);   // E2 (= EE2 + EO2) + rnd\r\n            const __m128i T13A = _mm_add_epi32(_mm_add_epi32(EE3A, EO3A), c32_rnd);   // E3 (= EE3 + EO3) + rnd\r\n            const __m128i T14A = _mm_add_epi32(_mm_sub_epi32(EE3A, EO3A), c32_rnd);   // E4 (= EE3 - EO3) + rnd\r\n            const __m128i T15A = _mm_add_epi32(_mm_sub_epi32(EE2A, EO2A), c32_rnd);   // E5 (= EE2 - EO2) + rnd\r\n            const __m128i T16A = _mm_add_epi32(_mm_sub_epi32(EE1A, EO1A), c32_rnd);   // E6 (= EE1 - EO1) + rnd\r\n            const __m128i T17A = _mm_add_epi32(_mm_sub_epi32(EE0A, EO0A), c32_rnd);   // E7 (= EE0 - EO0) + rnd\r\n\r\n\r\n            const __m128i T30A = _mm_srai_epi32(_mm_add_epi32(T10A, O0A), shift1);  // E0 + O0 + rnd   [30 20 10 00]\r\n            const __m128i T31A = _mm_srai_epi32(_mm_add_epi32(T11A, O1A), shift1);  // E1 + O1 + rnd   [31 21 11 01]\r\n            const __m128i T32A = _mm_srai_epi32(_mm_add_epi32(T12A, O2A), shift1);  // E2 + O2 + rnd   [32 22 12 02]\r\n            const __m128i T33A = _mm_srai_epi32(_mm_add_epi32(T13A, O3A), shift1);  // E3 + O3 + rnd   [33 23 13 03]\r\n            const __m128i T34A = _mm_srai_epi32(_mm_add_epi32(T14A, O4A), shift1);  // E4              [33 24 14 04]\r\n            const __m128i T35A = _mm_srai_epi32(_mm_add_epi32(T15A, O5A), shift1);  // E5              [35 25 15 05]\r\n            const __m128i T36A = _mm_srai_epi32(_mm_add_epi32(T16A, O6A), shift1);  // E6              [36 26 16 06]\r\n            const __m128i T37A = _mm_srai_epi32(_mm_add_epi32(T17A, O7A), shift1);  // E7              [37 27 17 07]\r\n\r\n            const __m128i T38A = _mm_srai_epi32(_mm_sub_epi32(T17A, O7A), shift1);  // E7             [30 20 10 00] x8\r\n            const __m128i T39A = _mm_srai_epi32(_mm_sub_epi32(T16A, O6A), shift1);  // E6             [31 21 11 01] x9\r\n            const __m128i T3AA = _mm_srai_epi32(_mm_sub_epi32(T15A, O5A), shift1);  // E5             [32 22 12 02] xA\r\n            const __m128i T3BA = _mm_srai_epi32(_mm_sub_epi32(T14A, O4A), shift1);  // E4             [33 23 13 03] xB\r\n            const __m128i T3CA = _mm_srai_epi32(_mm_sub_epi32(T13A, O3A), shift1);  // E3 - O3 + rnd  [33 24 14 04] xC\r\n            const __m128i T3DA = _mm_srai_epi32(_mm_sub_epi32(T12A, O2A), shift1);  // E2 - O2 + rnd  [35 25 15 05] xD\r\n            const __m128i T3EA = _mm_srai_epi32(_mm_sub_epi32(T11A, O1A), shift1);  // E1 - O1 + rnd  [36 26 16 06] xE\r\n            const __m128i T3FA = _mm_srai_epi32(_mm_sub_epi32(T10A, O0A), shift1);  // E0 - O0 + rnd  [37 27 17 07] xF\r\n\r\n            res00 = _mm_packs_epi32(T30A, T38A);\r\n            res01 = _mm_packs_epi32(T31A, T39A);\r\n            res02 = _mm_packs_epi32(T32A, T3AA);\r\n            res03 = _mm_packs_epi32(T33A, T3BA);\r\n\r\n            res04 = _mm_packs_epi32(T34A, T3CA);\r\n            res05 = _mm_packs_epi32(T35A, T3DA);\r\n            res06 = _mm_packs_epi32(T36A, T3EA);\r\n            res07 = _mm_packs_epi32(T37A, T3FA);\r\n        }\r\n    }\r\n\r\n    // transpose matrix\r\n    {\r\n        __m128i tr0_0, tr0_1, tr0_2, tr0_3, tr0_4, tr0_5, tr0_6, tr0_7;\r\n        __m128i tr1_0, tr1_1, tr1_2, tr1_3, tr1_4, tr1_5, tr1_6, tr1_7;\r\n        __m128i E01, E02, E03, E04, E11, E12, E13, E14;\r\n        __m128i O01, O02, O03, O04, O11, O12, O13, O14;\r\n        __m128i T0, T1, T2, T3;\r\n\r\n        tr0_0 = _mm_unpacklo_epi16(res00, res01);\r\n        tr0_1 = _mm_unpackhi_epi16(res00, res01);\r\n        tr0_2 = _mm_unpacklo_epi16(res02, res03);\r\n        tr0_3 = _mm_unpackhi_epi16(res02, res03);\r\n        tr0_4 = _mm_unpacklo_epi16(res04, res05);\r\n        tr0_5 = _mm_unpackhi_epi16(res04, res05);\r\n        tr0_6 = _mm_unpacklo_epi16(res06, res07);\r\n        tr0_7 = _mm_unpackhi_epi16(res06, res07);\r\n\r\n        tr1_0 = _mm_unpacklo_epi32(tr0_0, tr0_2);\r\n        tr1_1 = _mm_unpackhi_epi32(tr0_0, tr0_2);\r\n        tr1_2 = _mm_unpacklo_epi32(tr0_1, tr0_3);\r\n        tr1_3 = _mm_unpackhi_epi32(tr0_1, tr0_3);\r\n        tr1_4 = _mm_unpacklo_epi32(tr0_4, tr0_6);\r\n        tr1_5 = _mm_unpackhi_epi32(tr0_4, tr0_6);\r\n        tr1_6 = _mm_unpacklo_epi32(tr0_5, tr0_7);\r\n        tr1_7 = _mm_unpackhi_epi32(tr0_5, tr0_7);\r\n\r\n        res00 = _mm_unpacklo_epi64(tr1_0, tr1_4);\r\n        res02 = _mm_unpackhi_epi64(tr1_0, tr1_4);\r\n        res04 = _mm_unpacklo_epi64(tr1_1, tr1_5);\r\n        res06 = _mm_unpackhi_epi64(tr1_1, tr1_5);\r\n        res01 = _mm_unpacklo_epi64(tr1_2, tr1_6);\r\n        res03 = _mm_unpackhi_epi64(tr1_2, tr1_6);\r\n        res05 = _mm_unpacklo_epi64(tr1_3, tr1_7);\r\n        res07 = _mm_unpackhi_epi64(tr1_3, tr1_7);\r\n\r\n        c32_rnd = _mm_set1_epi32(1 << (shift2 - 1));    // add2\r\n\r\n        T0 = _mm_unpacklo_epi16(res00, res04);\r\n        E01 = _mm_add_epi32(_mm_madd_epi16(T0, c16_p32_p32), c32_rnd);\r\n        E11 = _mm_add_epi32(_mm_madd_epi16(T0, c16_n32_p32), c32_rnd);\r\n\r\n        T1 = _mm_unpackhi_epi16(res00, res04);\r\n        E02 = _mm_add_epi32(_mm_madd_epi16(T1, c16_p32_p32), c32_rnd);\r\n        E12 = _mm_add_epi32(_mm_madd_epi16(T1, c16_n32_p32), c32_rnd);\r\n\r\n        T0 = _mm_unpacklo_epi16(res01, res05);\r\n        E03 = _mm_add_epi32(_mm_madd_epi16(T0, c16_p32_p32), c32_rnd);\r\n        E13 = _mm_add_epi32(_mm_madd_epi16(T0, c16_n32_p32), c32_rnd);\r\n\r\n        T1 = _mm_unpackhi_epi16(res01, res05);\r\n        E04 = _mm_add_epi32(_mm_madd_epi16(T1, c16_p32_p32), c32_rnd);\r\n        E14 = _mm_add_epi32(_mm_madd_epi16(T1, c16_n32_p32), c32_rnd);\r\n\r\n        T0 = _mm_unpacklo_epi16(res02, res06);\r\n        O01 = _mm_madd_epi16(T0, c16_p17_p42);\r\n        O11 = _mm_madd_epi16(T0, c16_n42_p17);\r\n\r\n        T1 = _mm_unpackhi_epi16(res02, res06);\r\n        O02 = _mm_madd_epi16(T1, c16_p17_p42);\r\n        O12 = _mm_madd_epi16(T1, c16_n42_p17);\r\n\r\n        T0 = _mm_unpacklo_epi16(res03, res07);\r\n        O03 = _mm_madd_epi16(T0, c16_p17_p42);\r\n        O13 = _mm_madd_epi16(T0, c16_n42_p17);\r\n\r\n        T1 = _mm_unpackhi_epi16(res03, res07);\r\n        O04 = _mm_madd_epi16(T1, c16_p17_p42);\r\n        O14 = _mm_madd_epi16(T1, c16_n42_p17);\r\n\r\n        res00 = _mm_packs_epi32(_mm_srai_epi32(_mm_add_epi32(E01, O01), shift2), _mm_srai_epi32(_mm_add_epi32(E02, O02), shift2));\r\n        res01 = _mm_packs_epi32(_mm_srai_epi32(_mm_add_epi32(E03, O03), shift2), _mm_srai_epi32(_mm_add_epi32(E04, O04), shift2));\r\n\r\n        res06 = _mm_packs_epi32(_mm_srai_epi32(_mm_sub_epi32(E01, O01), shift2), _mm_srai_epi32(_mm_sub_epi32(E02, O02), shift2));\r\n        res07 = _mm_packs_epi32(_mm_srai_epi32(_mm_sub_epi32(E03, O03), shift2), _mm_srai_epi32(_mm_sub_epi32(E04, O04), shift2));\r\n\r\n        res02 = _mm_packs_epi32(_mm_srai_epi32(_mm_add_epi32(E11, O11), shift2), _mm_srai_epi32(_mm_add_epi32(E12, O12), shift2));\r\n        res03 = _mm_packs_epi32(_mm_srai_epi32(_mm_add_epi32(E13, O13), shift2), _mm_srai_epi32(_mm_add_epi32(E14, O14), shift2));\r\n\r\n        res04 = _mm_packs_epi32(_mm_srai_epi32(_mm_sub_epi32(E11, O11), shift2), _mm_srai_epi32(_mm_sub_epi32(E12, O12), shift2));\r\n        res05 = _mm_packs_epi32(_mm_srai_epi32(_mm_sub_epi32(E13, O13), shift2), _mm_srai_epi32(_mm_sub_epi32(E14, O14), shift2));\r\n\r\n        T0 = _mm_unpacklo_epi16(res00, res02);\r\n        T1 = _mm_unpackhi_epi16(res00, res02);\r\n        T2 = _mm_unpacklo_epi16(res04, res06);\r\n        T3 = _mm_unpackhi_epi16(res04, res06);\r\n\r\n        res00 = _mm_unpacklo_epi32(T0, T2);\r\n        res02 = _mm_unpackhi_epi32(T0, T2);\r\n        res04 = _mm_unpacklo_epi32(T1, T3);\r\n        res06 = _mm_unpackhi_epi32(T1, T3);\r\n\r\n        T0 = _mm_unpacklo_epi16(res01, res03);\r\n        T1 = _mm_unpackhi_epi16(res01, res03);\r\n        T2 = _mm_unpacklo_epi16(res05, res07);\r\n        T3 = _mm_unpackhi_epi16(res05, res07);\r\n\r\n        res01 = _mm_unpacklo_epi32(T0, T2);\r\n        res03 = _mm_unpackhi_epi32(T0, T2);\r\n        res05 = _mm_unpacklo_epi32(T1, T3);\r\n        res07 = _mm_unpackhi_epi32(T1, T3);\r\n    }\r\n\r\n    // clip\r\n    {\r\n        const __m128i max_val = _mm_set1_epi16((1 << (clip_depth2 - 1)) - 1);\r\n        const __m128i min_val = _mm_set1_epi16(-(1 << (clip_depth2 - 1)));\r\n\r\n        res00 = _mm_max_epi16(_mm_min_epi16(res00, max_val), min_val);\r\n        res02 = _mm_max_epi16(_mm_min_epi16(res02, max_val), min_val);\r\n        res04 = _mm_max_epi16(_mm_min_epi16(res04, max_val), min_val);\r\n        res06 = _mm_max_epi16(_mm_min_epi16(res06, max_val), min_val);\r\n        res01 = _mm_max_epi16(_mm_min_epi16(res01, max_val), min_val);\r\n        res03 = _mm_max_epi16(_mm_min_epi16(res03, max_val), min_val);\r\n        res05 = _mm_max_epi16(_mm_min_epi16(res05, max_val), min_val);\r\n        res07 = _mm_max_epi16(_mm_min_epi16(res07, max_val), min_val);\r\n    }\r\n\r\n    // store\r\n    if (i_dst == 4) {\r\n        _mm_store_si128((__m128i*)(dst +  0 * 4), res00);\r\n        _mm_store_si128((__m128i*)(dst +  2 * 4), res02);\r\n        _mm_store_si128((__m128i*)(dst +  4 * 4), res04);\r\n        _mm_store_si128((__m128i*)(dst +  6 * 4), res06);\r\n        _mm_store_si128((__m128i*)(dst +  8 * 4), res01);\r\n        _mm_store_si128((__m128i*)(dst + 10 * 4), res03);\r\n        _mm_store_si128((__m128i*)(dst + 12 * 4), res05);\r\n        _mm_store_si128((__m128i*)(dst + 14 * 4), res07);\r\n    } else {\r\n        _mm_storel_epi64((__m128i*)(dst +  0 * i_dst), res00);\r\n        _mm_storeh_pi   ((__m64  *)(dst +  1 * i_dst), _mm_castsi128_ps(res00));\r\n        _mm_storel_epi64((__m128i*)(dst +  2 * i_dst), res02);\r\n        _mm_storeh_pi   ((__m64  *)(dst +  3 * i_dst), _mm_castsi128_ps(res02));\r\n        _mm_storel_epi64((__m128i*)(dst +  4 * i_dst), res04);\r\n        _mm_storeh_pi   ((__m64  *)(dst +  5 * i_dst), _mm_castsi128_ps(res04));\r\n        _mm_storel_epi64((__m128i*)(dst +  6 * i_dst), res06);\r\n        _mm_storeh_pi   ((__m64  *)(dst +  7 * i_dst), _mm_castsi128_ps(res06));\r\n        _mm_storel_epi64((__m128i*)(dst +  8 * i_dst), res01);\r\n        _mm_storeh_pi   ((__m64  *)(dst +  9 * i_dst), _mm_castsi128_ps(res01));\r\n        _mm_storel_epi64((__m128i*)(dst + 10 * i_dst), res03);\r\n        _mm_storeh_pi   ((__m64  *)(dst + 11 * i_dst), _mm_castsi128_ps(res03));\r\n        _mm_storel_epi64((__m128i*)(dst + 12 * i_dst), res05);\r\n        _mm_storeh_pi   ((__m64  *)(dst + 13 * i_dst), _mm_castsi128_ps(res05));\r\n        _mm_storel_epi64((__m128i*)(dst + 14 * i_dst), res07);\r\n        _mm_storeh_pi   ((__m64  *)(dst + 15 * i_dst), _mm_castsi128_ps(res07));\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid idct_4x16_half_sse128(const coeff_t *src, coeff_t *dst, int i_dst)\r\n{\r\n    // TODO: implement this\r\n    // 1/2СϽǵ4x8зϵ\r\n    idct_4x16_sse128(src, dst, i_dst);\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid idct_4x16_quad_sse128(const coeff_t *src, coeff_t *dst, int i_dst)\r\n{\r\n    // TODO: implement this\r\n    // 1/4СϽǵ4x4зϵ\r\n    idct_4x16_half_sse128(src, dst, i_dst);\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid idct_16x4_sse128(const coeff_t *src, coeff_t *dst, int i_dst)\r\n{\r\n    const int shift1 = 5;\r\n    const int shift2 = 20 - g_bit_depth;\r\n    // const int clip_depth1 = LIMIT_BIT;\r\n    const int clip_depth2 = g_bit_depth + 1;\r\n\r\n    const __m128i c16_p43_p45 = _mm_set1_epi32(0x002B002D);   //row0 87high - 90low address\r\n    const __m128i c16_p35_p40 = _mm_set1_epi32(0x00230028);\r\n    const __m128i c16_p21_p29 = _mm_set1_epi32(0x0015001D);\r\n    const __m128i c16_p04_p13 = _mm_set1_epi32(0x0004000D);\r\n    const __m128i c16_p29_p43 = _mm_set1_epi32(0x001D002B);   //row1\r\n    const __m128i c16_n21_p04 = _mm_set1_epi32(0xFFEB0004);\r\n    const __m128i c16_n45_n40 = _mm_set1_epi32(0xFFD3FFD8);\r\n    const __m128i c16_n13_n35 = _mm_set1_epi32(0xFFF3FFDD);\r\n    const __m128i c16_p04_p40 = _mm_set1_epi32(0x00040028);   //row2\r\n    const __m128i c16_n43_n35 = _mm_set1_epi32(0xFFD5FFDD);\r\n    const __m128i c16_p29_n13 = _mm_set1_epi32(0x001DFFF3);\r\n    const __m128i c16_p21_p45 = _mm_set1_epi32(0x0015002D);\r\n    const __m128i c16_n21_p35 = _mm_set1_epi32(0xFFEB0023);   //row3\r\n    const __m128i c16_p04_n43 = _mm_set1_epi32(0x0004FFD5);\r\n    const __m128i c16_p13_p45 = _mm_set1_epi32(0x000D002D);\r\n    const __m128i c16_n29_n40 = _mm_set1_epi32(0xFFE3FFD8);\r\n    const __m128i c16_n40_p29 = _mm_set1_epi32(0xFFD8001D);   //row4\r\n    const __m128i c16_p45_n13 = _mm_set1_epi32(0x002DFFF3);\r\n    const __m128i c16_n43_n04 = _mm_set1_epi32(0xFFD5FFFC);\r\n    const __m128i c16_p35_p21 = _mm_set1_epi32(0x00230015);\r\n    const __m128i c16_n45_p21 = _mm_set1_epi32(0xFFD30015);   //row5\r\n    const __m128i c16_p13_p29 = _mm_set1_epi32(0x000D001D);\r\n    const __m128i c16_p35_n43 = _mm_set1_epi32(0x0023FFD5);\r\n    const __m128i c16_n40_p04 = _mm_set1_epi32(0xFFD80004);\r\n    const __m128i c16_n35_p13 = _mm_set1_epi32(0xFFDD000D);   //row6\r\n    const __m128i c16_n40_p45 = _mm_set1_epi32(0xFFD8002D);\r\n    const __m128i c16_p04_p21 = _mm_set1_epi32(0x00040015);\r\n    const __m128i c16_p43_n29 = _mm_set1_epi32(0x002BFFE3);\r\n    const __m128i c16_n13_p04 = _mm_set1_epi32(0xFFF30004);   //row7\r\n    const __m128i c16_n29_p21 = _mm_set1_epi32(0xFFE30015);\r\n    const __m128i c16_n40_p35 = _mm_set1_epi32(0xFFD80023);\r\n    const __m128i c16_n45_p43 = _mm_set1_epi32(0xFFD3002B);\r\n\r\n    const __m128i c16_p38_p44 = _mm_set1_epi32(0x0026002C);\r\n    const __m128i c16_p09_p25 = _mm_set1_epi32(0x00090019);\r\n    const __m128i c16_n09_p38 = _mm_set1_epi32(0xFFF70026);\r\n    const __m128i c16_n25_n44 = _mm_set1_epi32(0xFFE7FFD4);\r\n    const __m128i c16_n44_p25 = _mm_set1_epi32(0xFFD40019);\r\n    const __m128i c16_p38_p09 = _mm_set1_epi32(0x00260009);\r\n    const __m128i c16_n25_p09 = _mm_set1_epi32(0xFFE70009);\r\n    const __m128i c16_n44_p38 = _mm_set1_epi32(0xFFD40026);\r\n\r\n    const __m128i c16_p17_p42 = _mm_set1_epi32(0x0011002A);\r\n    const __m128i c16_n42_p17 = _mm_set1_epi32(0xFFD60011);\r\n\r\n    const __m128i c16_n32_p32 = _mm_set1_epi32(0xFFE00020);\r\n    const __m128i c16_p32_p32 = _mm_set1_epi32(0x00200020);\r\n\r\n    __m128i c32_rnd = _mm_set1_epi32(1 << (shift1 - 1));        // add1\r\n\r\n    // DCT1\r\n    __m128i in00[2], in01[2], in02[2], in03[2];\r\n    __m128i res00[2], res01[2], res02[2], res03[2];\r\n    int i, part;\r\n\r\n    for (i = 0; i < 2; i++) {\r\n        const int offset = (i << 3);\r\n        in00[i] = _mm_loadu_si128((const __m128i*)&src[0 * 16 + offset]);   // [07 06 05 04 03 02 01 00]\r\n        in01[i] = _mm_loadu_si128((const __m128i*)&src[1 * 16 + offset]);   // [17 16 15 14 13 12 11 10]\r\n        in02[i] = _mm_loadu_si128((const __m128i*)&src[2 * 16 + offset]);   // [27 26 25 24 23 22 21 20]\r\n        in03[i] = _mm_loadu_si128((const __m128i*)&src[3 * 16 + offset]);   // [37 36 35 34 33 32 31 30]\r\n    }\r\n\r\n    for (part = 0; part < 2; part++) {\r\n        const __m128i T_00_00A = _mm_unpacklo_epi16(in01[part], in03[part]);\r\n        const __m128i T_00_00B = _mm_unpackhi_epi16(in01[part], in03[part]);\r\n        const __m128i T_00_01A = _mm_unpacklo_epi16(in00[part], in02[part]);\r\n        const __m128i T_00_01B = _mm_unpackhi_epi16(in00[part], in02[part]);\r\n\r\n        __m128i E0A, E0B, E1A, E1B, O0A, O0B, O1A, O1B;\r\n\r\n        E0A = _mm_add_epi32(_mm_madd_epi16(T_00_01A, c16_p32_p32), c32_rnd);\r\n        E1A = _mm_add_epi32(_mm_madd_epi16(T_00_01A, c16_n32_p32), c32_rnd);\r\n\r\n        E0B = _mm_add_epi32(_mm_madd_epi16(T_00_01B, c16_p32_p32), c32_rnd);\r\n        E1B = _mm_add_epi32(_mm_madd_epi16(T_00_01B, c16_n32_p32), c32_rnd);\r\n\r\n        O0A = _mm_madd_epi16(T_00_00A, c16_p17_p42);\r\n        O1A = _mm_madd_epi16(T_00_00A, c16_n42_p17);\r\n\r\n        O0B = _mm_madd_epi16(T_00_00B, c16_p17_p42);\r\n        O1B = _mm_madd_epi16(T_00_00B, c16_n42_p17);\r\n\r\n        res00[part] = _mm_packs_epi32(_mm_srai_epi32(_mm_add_epi32(E0A, O0A), 5), _mm_srai_epi32(_mm_add_epi32(E0B, O0B), 5));\r\n        res03[part] = _mm_packs_epi32(_mm_srai_epi32(_mm_sub_epi32(E0A, O0A), 5), _mm_srai_epi32(_mm_sub_epi32(E0B, O0B), 5));\r\n        res01[part] = _mm_packs_epi32(_mm_srai_epi32(_mm_add_epi32(E1A, O1A), 5), _mm_srai_epi32(_mm_add_epi32(E1B, O1B), 5));\r\n        res02[part] = _mm_packs_epi32(_mm_srai_epi32(_mm_sub_epi32(E1A, O1A), 5), _mm_srai_epi32(_mm_sub_epi32(E1B, O1B), 5));\r\n    }\r\n\r\n    // transpose matrix\r\n    {\r\n        __m128i tr0_0, tr0_1, tr0_2, tr0_3, tr0_4, tr0_5, tr0_6, tr0_7;\r\n        __m128i tr1_0, tr1_1, tr1_2, tr1_3, tr1_4, tr1_5, tr1_6, tr1_7;\r\n\r\n        tr0_0 = _mm_unpacklo_epi16(res00[0], res01[0]);\r\n        tr0_1 = _mm_unpacklo_epi16(res02[0], res03[0]);\r\n\r\n        tr0_2 = _mm_unpackhi_epi16(res00[0], res01[0]);\r\n        tr0_3 = _mm_unpackhi_epi16(res02[0], res03[0]);\r\n\r\n        tr0_4 = _mm_unpacklo_epi16(res00[1], res01[1]);\r\n        tr0_5 = _mm_unpacklo_epi16(res02[1], res03[1]);\r\n\r\n        tr0_6 = _mm_unpackhi_epi16(res00[1], res01[1]);\r\n        tr0_7 = _mm_unpackhi_epi16(res02[1], res03[1]);\r\n\r\n        tr1_0 = _mm_unpacklo_epi32(tr0_0, tr0_1);\r\n        tr1_1 = _mm_unpacklo_epi32(tr0_2, tr0_3);\r\n\r\n        tr1_2 = _mm_unpackhi_epi32(tr0_0, tr0_1);\r\n        tr1_3 = _mm_unpackhi_epi32(tr0_2, tr0_3);\r\n\r\n        tr1_4 = _mm_unpacklo_epi32(tr0_4, tr0_5);\r\n        tr1_5 = _mm_unpacklo_epi32(tr0_6, tr0_7);\r\n\r\n        tr1_6 = _mm_unpackhi_epi32(tr0_4, tr0_5);\r\n        tr1_7 = _mm_unpackhi_epi32(tr0_6, tr0_7);\r\n\r\n        // second fft\r\n        c32_rnd = _mm_set1_epi32(1 << (shift2 - 1));                    // add2\r\n        {\r\n            const __m128i T_00_00A = _mm_unpackhi_epi16(tr1_0, tr1_2);  // [33 13 32 12 31 11 30 10]\r\n            const __m128i T_00_01A = _mm_unpackhi_epi16(tr1_1, tr1_3);  // [ ]\r\n            const __m128i T_00_02A = _mm_unpackhi_epi16(tr1_4, tr1_6);  // [ ]\r\n            const __m128i T_00_03A = _mm_unpackhi_epi16(tr1_5, tr1_7);  // [ ]\r\n            const __m128i T_00_04A = _mm_unpacklo_epi16(tr1_2, tr1_3);  // [ ]\r\n            const __m128i T_00_05A = _mm_unpacklo_epi16(tr1_6, tr1_7);  // [ ]\r\n            const __m128i T_00_06A = _mm_unpacklo_epi16(tr1_1, tr1_5);  // [ ]row\r\n            const __m128i T_00_07A = _mm_unpacklo_epi16(tr1_0, tr1_4);  // [83 03 82 02 81 01 81 00] row08 row00\r\n\r\n            __m128i O0A, O1A, O2A, O3A, O4A, O5A, O6A, O7A;\r\n\r\n            __m128i EO0A, EO1A, EO2A, EO3A;\r\n            __m128i EEO0A, EEO1A;\r\n            __m128i EEE0A, EEE1A;\r\n#define COMPUTE_ROW(row0103, row0507, row0911, row1315, c0103, c0507, c0911, c1315, row) \\\r\n    row = _mm_add_epi32(_mm_add_epi32(_mm_madd_epi16(row0103, c0103), _mm_madd_epi16(row0507, c0507)), \\\r\n                        _mm_add_epi32(_mm_madd_epi16(row0911, c0911), _mm_madd_epi16(row1315, c1315)));\r\n\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_p43_p45, c16_p35_p40, c16_p21_p29, c16_p04_p13, O0A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_p29_p43, c16_n21_p04, c16_n45_n40, c16_n13_n35, O1A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_p04_p40, c16_n43_n35, c16_p29_n13, c16_p21_p45, O2A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_n21_p35, c16_p04_n43, c16_p13_p45, c16_n29_n40, O3A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_n40_p29, c16_p45_n13, c16_n43_n04, c16_p35_p21, O4A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_n45_p21, c16_p13_p29, c16_p35_n43, c16_n40_p04, O5A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_n35_p13, c16_n40_p45, c16_p04_p21, c16_p43_n29, O6A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_n13_p04, c16_n29_p21, c16_n40_p35, c16_n45_p43, O7A)\r\n\r\n#undef COMPUTE_ROW\r\n\r\n            EO0A = _mm_add_epi32(_mm_madd_epi16(T_00_04A, c16_p38_p44), _mm_madd_epi16(T_00_05A, c16_p09_p25)); // EO0\r\n            EO1A = _mm_add_epi32(_mm_madd_epi16(T_00_04A, c16_n09_p38), _mm_madd_epi16(T_00_05A, c16_n25_n44)); // EO1\r\n            EO2A = _mm_add_epi32(_mm_madd_epi16(T_00_04A, c16_n44_p25), _mm_madd_epi16(T_00_05A, c16_p38_p09)); // EO2\r\n            EO3A = _mm_add_epi32(_mm_madd_epi16(T_00_04A, c16_n25_p09), _mm_madd_epi16(T_00_05A, c16_n44_p38)); // EO3\r\n\r\n            EEO0A = _mm_madd_epi16(T_00_06A, c16_p17_p42);\r\n            EEO1A = _mm_madd_epi16(T_00_06A, c16_n42_p17);\r\n\r\n            EEE0A = _mm_madd_epi16(T_00_07A, c16_p32_p32);\r\n            EEE1A = _mm_madd_epi16(T_00_07A, c16_n32_p32);\r\n            {\r\n                const __m128i EE0A = _mm_add_epi32(EEE0A, EEO0A);   // EE0 = EEE0 + EEO0\r\n                const __m128i EE1A = _mm_add_epi32(EEE1A, EEO1A);   // EE1 = EEE1 + EEO1\r\n                const __m128i EE3A = _mm_sub_epi32(EEE0A, EEO0A);   // EE2 = EEE0 - EEO0\r\n                const __m128i EE2A = _mm_sub_epi32(EEE1A, EEO1A);   // EE3 = EEE1 - EEO1\r\n\r\n                const __m128i T10A = _mm_add_epi32(_mm_add_epi32(EE0A, EO0A), c32_rnd);   // E0 (= EE0 + EO0) + rnd\r\n                const __m128i T11A = _mm_add_epi32(_mm_add_epi32(EE1A, EO1A), c32_rnd);   // E1 (= EE1 + EO1) + rnd\r\n                const __m128i T12A = _mm_add_epi32(_mm_add_epi32(EE2A, EO2A), c32_rnd);   // E2 (= EE2 + EO2) + rnd\r\n                const __m128i T13A = _mm_add_epi32(_mm_add_epi32(EE3A, EO3A), c32_rnd);   // E3 (= EE3 + EO3) + rnd\r\n                const __m128i T14A = _mm_add_epi32(_mm_sub_epi32(EE3A, EO3A), c32_rnd);   // E4 (= EE3 - EO3) + rnd\r\n                const __m128i T15A = _mm_add_epi32(_mm_sub_epi32(EE2A, EO2A), c32_rnd);   // E5 (= EE2 - EO2) + rnd\r\n                const __m128i T16A = _mm_add_epi32(_mm_sub_epi32(EE1A, EO1A), c32_rnd);   // E6 (= EE1 - EO1) + rnd\r\n                const __m128i T17A = _mm_add_epi32(_mm_sub_epi32(EE0A, EO0A), c32_rnd);   // E7 (= EE0 - EO0) + rnd\r\n\r\n                const __m128i T30A = _mm_srai_epi32(_mm_add_epi32(T10A, O0A), shift2);  // E0 + O0 + rnd [30 20 10 00]\r\n                const __m128i T31A = _mm_srai_epi32(_mm_add_epi32(T11A, O1A), shift2);  // E1 + O1 + rnd [31 21 11 01]\r\n                const __m128i T32A = _mm_srai_epi32(_mm_add_epi32(T12A, O2A), shift2);  // E2 + O2 + rnd [32 22 12 02]\r\n                const __m128i T33A = _mm_srai_epi32(_mm_add_epi32(T13A, O3A), shift2);  // E3 + O3 + rnd [33 23 13 03]\r\n                const __m128i T34A = _mm_srai_epi32(_mm_add_epi32(T14A, O4A), shift2);  // E4            [33 24 14 04]\r\n                const __m128i T35A = _mm_srai_epi32(_mm_add_epi32(T15A, O5A), shift2);  // E5            [35 25 15 05]\r\n                const __m128i T36A = _mm_srai_epi32(_mm_add_epi32(T16A, O6A), shift2);  // E6            [36 26 16 06]\r\n                const __m128i T37A = _mm_srai_epi32(_mm_add_epi32(T17A, O7A), shift2);  // E7            [37 27 17 07]\r\n                const __m128i T38A = _mm_srai_epi32(_mm_sub_epi32(T17A, O7A), shift2);  // E7            [30 20 10 00] x8\r\n                const __m128i T39A = _mm_srai_epi32(_mm_sub_epi32(T16A, O6A), shift2);  // E6            [31 21 11 01] x9\r\n                const __m128i T3AA = _mm_srai_epi32(_mm_sub_epi32(T15A, O5A), shift2);  // E5            [32 22 12 02] xA\r\n                const __m128i T3BA = _mm_srai_epi32(_mm_sub_epi32(T14A, O4A), shift2);  // E4            [33 23 13 03] xB\r\n                const __m128i T3CA = _mm_srai_epi32(_mm_sub_epi32(T13A, O3A), shift2);  // E3 - O3 + rnd [33 24 14 04] xC\r\n                const __m128i T3DA = _mm_srai_epi32(_mm_sub_epi32(T12A, O2A), shift2);  // E2 - O2 + rnd [35 25 15 05] xD\r\n                const __m128i T3EA = _mm_srai_epi32(_mm_sub_epi32(T11A, O1A), shift2);  // E1 - O1 + rnd [36 26 16 06] xE\r\n                const __m128i T3FA = _mm_srai_epi32(_mm_sub_epi32(T10A, O0A), shift2);  // E0 - O0 + rnd [37 27 17 07] xF\r\n\r\n                res00[0] = _mm_packs_epi32(T30A, T38A);\r\n                res01[0] = _mm_packs_epi32(T31A, T39A);\r\n                res02[0] = _mm_packs_epi32(T32A, T3AA);\r\n                res03[0] = _mm_packs_epi32(T33A, T3BA);\r\n                res00[1] = _mm_packs_epi32(T34A, T3CA);\r\n                res01[1] = _mm_packs_epi32(T35A, T3DA);\r\n                res02[1] = _mm_packs_epi32(T36A, T3EA);\r\n                res03[1] = _mm_packs_epi32(T37A, T3FA);\r\n            }\r\n        }\r\n\r\n        // transpose matrix\r\n        tr0_0 = _mm_unpacklo_epi16(res00[0], res01[0]);\r\n        tr0_1 = _mm_unpacklo_epi16(res02[0], res03[0]);\r\n        tr0_2 = _mm_unpackhi_epi16(res00[0], res01[0]);\r\n        tr0_3 = _mm_unpackhi_epi16(res02[0], res03[0]);\r\n        tr0_4 = _mm_unpacklo_epi16(res00[1], res01[1]);\r\n        tr0_5 = _mm_unpacklo_epi16(res02[1], res03[1]);\r\n        tr0_6 = _mm_unpackhi_epi16(res00[1], res01[1]);\r\n        tr0_7 = _mm_unpackhi_epi16(res02[1], res03[1]);\r\n\r\n        tr1_0 = _mm_unpacklo_epi32(tr0_0, tr0_1);\r\n        tr1_1 = _mm_unpacklo_epi32(tr0_2, tr0_3);\r\n        tr1_2 = _mm_unpackhi_epi32(tr0_0, tr0_1);\r\n        tr1_3 = _mm_unpackhi_epi32(tr0_2, tr0_3);\r\n        tr1_4 = _mm_unpacklo_epi32(tr0_4, tr0_5);\r\n        tr1_5 = _mm_unpacklo_epi32(tr0_6, tr0_7);\r\n        tr1_6 = _mm_unpackhi_epi32(tr0_4, tr0_5);\r\n        tr1_7 = _mm_unpackhi_epi32(tr0_6, tr0_7);\r\n\r\n        res00[0] = _mm_unpacklo_epi64(tr1_0, tr1_4);\r\n        res01[0] = _mm_unpackhi_epi64(tr1_0, tr1_4);\r\n        res02[0] = _mm_unpacklo_epi64(tr1_2, tr1_6);\r\n        res03[0] = _mm_unpackhi_epi64(tr1_2, tr1_6);\r\n        res00[1] = _mm_unpacklo_epi64(tr1_1, tr1_5);\r\n        res01[1] = _mm_unpackhi_epi64(tr1_1, tr1_5);\r\n        res02[1] = _mm_unpacklo_epi64(tr1_3, tr1_7);\r\n        res03[1] = _mm_unpackhi_epi64(tr1_3, tr1_7);\r\n\r\n        // clip\r\n        {\r\n            const __m128i max_val = _mm_set1_epi16((1 << (clip_depth2 - 1)) - 1);\r\n            const __m128i min_val = _mm_set1_epi16(-(1 << (clip_depth2 - 1)));\r\n\r\n            res00[0] = _mm_max_epi16(_mm_min_epi16(res00[0], max_val), min_val);\r\n            res01[0] = _mm_max_epi16(_mm_min_epi16(res01[0], max_val), min_val);\r\n            res02[0] = _mm_max_epi16(_mm_min_epi16(res02[0], max_val), min_val);\r\n            res03[0] = _mm_max_epi16(_mm_min_epi16(res03[0], max_val), min_val);\r\n\r\n            res00[1] = _mm_max_epi16(_mm_min_epi16(res00[1], max_val), min_val);\r\n            res01[1] = _mm_max_epi16(_mm_min_epi16(res01[1], max_val), min_val);\r\n            res02[1] = _mm_max_epi16(_mm_min_epi16(res02[1], max_val), min_val);\r\n            res03[1] = _mm_max_epi16(_mm_min_epi16(res03[1], max_val), min_val);\r\n        }\r\n    }\r\n\r\n    _mm_storeu_si128((__m128i*)(dst + 0 * i_dst    ), res00[0]);\r\n    _mm_storeu_si128((__m128i*)(dst + 0 * i_dst + 8), res00[1]);\r\n    _mm_storeu_si128((__m128i*)(dst + 1 * i_dst    ), res01[0]);\r\n    _mm_storeu_si128((__m128i*)(dst + 1 * i_dst + 8), res01[1]);\r\n    _mm_storeu_si128((__m128i*)(dst + 2 * i_dst    ), res02[0]);\r\n    _mm_storeu_si128((__m128i*)(dst + 2 * i_dst + 8), res02[1]);\r\n    _mm_storeu_si128((__m128i*)(dst + 3 * i_dst    ), res03[0]);\r\n    _mm_storeu_si128((__m128i*)(dst + 3 * i_dst + 8), res03[1]);\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid idct_16x4_half_sse128(const coeff_t *src, coeff_t *dst, int i_dst)\r\n{\r\n    // TODO: implement this\r\n    // 1/2СϽǵ8x4зϵ\r\n    idct_16x4_sse128(src, dst, i_dst);\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid idct_16x4_quad_sse128(const coeff_t *src, coeff_t *dst, int i_dst)\r\n{\r\n    // TODO: implement this\r\n    // 1/4СϽǵ4x4зϵ\r\n    idct_16x4_half_sse128(src, dst, i_dst);\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid idct_8x8_sse128(const coeff_t *src, coeff_t *dst, int i_dst)\r\n{\r\n    // const int shift1 = 5;\r\n    const int shift2 = 20 - g_bit_depth;\r\n    // const int clip_depth1 = LIMIT_BIT;\r\n    const int clip_depth2 = g_bit_depth + 1;\r\n\r\n    __m128i S0, S1, S2, S3, S4, S5, S6, S7;\r\n    __m128i mAdd, T0, T1, T2, T3;\r\n    __m128i E0h, E1h, E2h, E3h, E0l, E1l, E2l, E3l;\r\n    __m128i O0h, O1h, O2h, O3h, O0l, O1l, O2l, O3l;\r\n    __m128i EE0l, EE1l, E00l, E01l, EE0h, EE1h, E00h, E01h;\r\n    __m128i T00, T01, T02, T03, T04, T05, T06, T07;\r\n\r\n    mAdd = _mm_set1_epi32(16);                // add1\r\n\r\n    S1 = _mm_load_si128((__m128i*)&src[8]);\r\n    S3 = _mm_load_si128((__m128i*)&src[24]);\r\n\r\n    T0  = _mm_unpacklo_epi16(S1, S3);\r\n    E1l = _mm_madd_epi16(T0, _mm_load_si128((__m128i*)(tab_idct_8x8[0])));\r\n    T1  = _mm_unpackhi_epi16(S1, S3);\r\n    E1h = _mm_madd_epi16(T1, _mm_load_si128((__m128i*)(tab_idct_8x8[0])));\r\n\r\n    S5  = _mm_load_si128((__m128i*)&src[40]);\r\n    S7  = _mm_load_si128((__m128i*)&src[56]);\r\n\r\n    T2  = _mm_unpacklo_epi16(S5, S7);\r\n    E2l = _mm_madd_epi16(T2, _mm_load_si128((__m128i*)(tab_idct_8x8[1])));\r\n    T3  = _mm_unpackhi_epi16(S5, S7);\r\n    E2h = _mm_madd_epi16(T3, _mm_load_si128((__m128i*)(tab_idct_8x8[1])));\r\n\r\n    O0l = _mm_add_epi32(E1l, E2l);\r\n    O0h = _mm_add_epi32(E1h, E2h);\r\n\r\n    E1l = _mm_madd_epi16(T0, _mm_load_si128((__m128i*)(tab_idct_8x8[2])));\r\n    E1h = _mm_madd_epi16(T1, _mm_load_si128((__m128i*)(tab_idct_8x8[2])));\r\n    E2l = _mm_madd_epi16(T2, _mm_load_si128((__m128i*)(tab_idct_8x8[3])));\r\n    E2h = _mm_madd_epi16(T3, _mm_load_si128((__m128i*)(tab_idct_8x8[3])));\r\n\r\n    O1l = _mm_add_epi32(E1l, E2l);\r\n    O1h = _mm_add_epi32(E1h, E2h);\r\n\r\n    E1l = _mm_madd_epi16(T0, _mm_load_si128((__m128i*)(tab_idct_8x8[4])));\r\n    E1h = _mm_madd_epi16(T1, _mm_load_si128((__m128i*)(tab_idct_8x8[4])));\r\n    E2l = _mm_madd_epi16(T2, _mm_load_si128((__m128i*)(tab_idct_8x8[5])));\r\n    E2h = _mm_madd_epi16(T3, _mm_load_si128((__m128i*)(tab_idct_8x8[5])));\r\n    O2l = _mm_add_epi32(E1l, E2l);\r\n    O2h = _mm_add_epi32(E1h, E2h);\r\n\r\n    E1l = _mm_madd_epi16(T0, _mm_load_si128((__m128i*)(tab_idct_8x8[6])));\r\n    E1h = _mm_madd_epi16(T1, _mm_load_si128((__m128i*)(tab_idct_8x8[6])));\r\n    E2l = _mm_madd_epi16(T2, _mm_load_si128((__m128i*)(tab_idct_8x8[7])));\r\n    E2h = _mm_madd_epi16(T3, _mm_load_si128((__m128i*)(tab_idct_8x8[7])));\r\n    O3h = _mm_add_epi32(E1h, E2h);\r\n    O3l = _mm_add_epi32(E1l, E2l);\r\n\r\n    /*    -------     */\r\n\r\n    S0 = _mm_load_si128((__m128i*)&src[0]);\r\n    S4 = _mm_load_si128((__m128i*)&src[32]);\r\n\r\n    T0   = _mm_unpacklo_epi16(S0, S4);\r\n    EE0l = _mm_madd_epi16(T0, _mm_load_si128((__m128i*)(tab_idct_8x8[8])));\r\n    T1   = _mm_unpackhi_epi16(S0, S4);\r\n    EE0h = _mm_madd_epi16(T1, _mm_load_si128((__m128i*)(tab_idct_8x8[8])));\r\n\r\n    EE1l = _mm_madd_epi16(T0, _mm_load_si128((__m128i*)(tab_idct_8x8[9])));\r\n    EE1h = _mm_madd_epi16(T1, _mm_load_si128((__m128i*)(tab_idct_8x8[9])));\r\n\r\n    /*    -------     */\r\n\r\n    S2 = _mm_load_si128((__m128i*)&src[16]);\r\n    S6 = _mm_load_si128((__m128i*)&src[48]);\r\n\r\n    T0   = _mm_unpacklo_epi16(S2, S6);\r\n    E00l = _mm_madd_epi16(T0, _mm_load_si128((__m128i*)(tab_idct_8x8[10])));\r\n    T1   = _mm_unpackhi_epi16(S2, S6);\r\n    E00h = _mm_madd_epi16(T1, _mm_load_si128((__m128i*)(tab_idct_8x8[10])));\r\n    E01l = _mm_madd_epi16(T0, _mm_load_si128((__m128i*)(tab_idct_8x8[11])));\r\n    E01h = _mm_madd_epi16(T1, _mm_load_si128((__m128i*)(tab_idct_8x8[11])));\r\n    E0l = _mm_add_epi32(EE0l, E00l);\r\n    E0l = _mm_add_epi32(E0l, mAdd);\r\n    E0h = _mm_add_epi32(EE0h, E00h);\r\n    E0h = _mm_add_epi32(E0h, mAdd);\r\n    E3l = _mm_sub_epi32(EE0l, E00l);\r\n    E3l = _mm_add_epi32(E3l, mAdd);\r\n    E3h = _mm_sub_epi32(EE0h, E00h);\r\n    E3h = _mm_add_epi32(E3h, mAdd);\r\n\r\n    E1l = _mm_add_epi32(EE1l, E01l);\r\n    E1l = _mm_add_epi32(E1l, mAdd);\r\n    E1h = _mm_add_epi32(EE1h, E01h);\r\n    E1h = _mm_add_epi32(E1h, mAdd);\r\n    E2l = _mm_sub_epi32(EE1l, E01l);\r\n    E2l = _mm_add_epi32(E2l, mAdd);\r\n    E2h = _mm_sub_epi32(EE1h, E01h);\r\n    E2h = _mm_add_epi32(E2h, mAdd);\r\n    S0 = _mm_packs_epi32(_mm_srai_epi32(_mm_add_epi32(E0l, O0l), 5), _mm_srai_epi32(_mm_add_epi32(E0h, O0h), 5));  // ״η任λ\r\n    S7 = _mm_packs_epi32(_mm_srai_epi32(_mm_sub_epi32(E0l, O0l), 5), _mm_srai_epi32(_mm_sub_epi32(E0h, O0h), 5));\r\n    S1 = _mm_packs_epi32(_mm_srai_epi32(_mm_add_epi32(E1l, O1l), 5), _mm_srai_epi32(_mm_add_epi32(E1h, O1h), 5));\r\n    S6 = _mm_packs_epi32(_mm_srai_epi32(_mm_sub_epi32(E1l, O1l), 5), _mm_srai_epi32(_mm_sub_epi32(E1h, O1h), 5));\r\n    S2 = _mm_packs_epi32(_mm_srai_epi32(_mm_add_epi32(E2l, O2l), 5), _mm_srai_epi32(_mm_add_epi32(E2h, O2h), 5));\r\n    S5 = _mm_packs_epi32(_mm_srai_epi32(_mm_sub_epi32(E2l, O2l), 5), _mm_srai_epi32(_mm_sub_epi32(E2h, O2h), 5));\r\n    S3 = _mm_packs_epi32(_mm_srai_epi32(_mm_add_epi32(E3l, O3l), 5), _mm_srai_epi32(_mm_add_epi32(E3h, O3h), 5));\r\n    S4 = _mm_packs_epi32(_mm_srai_epi32(_mm_sub_epi32(E3l, O3l), 5), _mm_srai_epi32(_mm_sub_epi32(E3h, O3h), 5));\r\n\r\n    /*  Inverse matrix   */\r\n\r\n    E0l = _mm_unpacklo_epi16(S0, S4);\r\n    E1l = _mm_unpacklo_epi16(S1, S5);\r\n    E2l = _mm_unpacklo_epi16(S2, S6);\r\n    E3l = _mm_unpacklo_epi16(S3, S7);\r\n    O0l = _mm_unpackhi_epi16(S0, S4);\r\n    O1l = _mm_unpackhi_epi16(S1, S5);\r\n    O2l = _mm_unpackhi_epi16(S2, S6);\r\n    O3l = _mm_unpackhi_epi16(S3, S7);\r\n\r\n    T0  = _mm_unpacklo_epi16(E0l, E2l);\r\n    T1  = _mm_unpacklo_epi16(E1l, E3l);\r\n    S0  = _mm_unpacklo_epi16(T0, T1);\r\n    S1  = _mm_unpackhi_epi16(T0, T1);\r\n\r\n    T2  = _mm_unpackhi_epi16(E0l, E2l);\r\n    T3  = _mm_unpackhi_epi16(E1l, E3l);\r\n    S2  = _mm_unpacklo_epi16(T2, T3);\r\n    S3  = _mm_unpackhi_epi16(T2, T3);\r\n\r\n    T0  = _mm_unpacklo_epi16(O0l, O2l);\r\n    T1  = _mm_unpacklo_epi16(O1l, O3l);\r\n    S4  = _mm_unpacklo_epi16(T0, T1);\r\n    S5  = _mm_unpackhi_epi16(T0, T1);\r\n\r\n    T2  = _mm_unpackhi_epi16(O0l, O2l);\r\n    T3  = _mm_unpackhi_epi16(O1l, O3l);\r\n    S6  = _mm_unpacklo_epi16(T2, T3);\r\n    S7  = _mm_unpackhi_epi16(T2, T3);\r\n\r\n    mAdd = _mm_set1_epi32(1 << (shift2 - 1));   // add2\r\n\r\n    T0  = _mm_unpacklo_epi16(S1, S3);\r\n    E1l = _mm_madd_epi16(T0, _mm_load_si128((__m128i*)(tab_idct_8x8[0])));\r\n    T1  = _mm_unpackhi_epi16(S1, S3);\r\n    E1h = _mm_madd_epi16(T1, _mm_load_si128((__m128i*)(tab_idct_8x8[0])));\r\n    T2  = _mm_unpacklo_epi16(S5, S7);\r\n    E2l = _mm_madd_epi16(T2, _mm_load_si128((__m128i*)(tab_idct_8x8[1])));\r\n    T3  = _mm_unpackhi_epi16(S5, S7);\r\n    E2h = _mm_madd_epi16(T3, _mm_load_si128((__m128i*)(tab_idct_8x8[1])));\r\n\r\n    O0l = _mm_add_epi32(E1l, E2l);\r\n    O0h = _mm_add_epi32(E1h, E2h);\r\n    E1l = _mm_madd_epi16(T0, _mm_load_si128((__m128i*)(tab_idct_8x8[2])));\r\n    E1h = _mm_madd_epi16(T1, _mm_load_si128((__m128i*)(tab_idct_8x8[2])));\r\n    E2l = _mm_madd_epi16(T2, _mm_load_si128((__m128i*)(tab_idct_8x8[3])));\r\n    E2h = _mm_madd_epi16(T3, _mm_load_si128((__m128i*)(tab_idct_8x8[3])));\r\n    O1l = _mm_add_epi32(E1l, E2l);\r\n    O1h = _mm_add_epi32(E1h, E2h);\r\n    E1l = _mm_madd_epi16(T0, _mm_load_si128((__m128i*)(tab_idct_8x8[4])));\r\n    E1h = _mm_madd_epi16(T1, _mm_load_si128((__m128i*)(tab_idct_8x8[4])));\r\n    E2l = _mm_madd_epi16(T2, _mm_load_si128((__m128i*)(tab_idct_8x8[5])));\r\n    E2h = _mm_madd_epi16(T3, _mm_load_si128((__m128i*)(tab_idct_8x8[5])));\r\n    O2l = _mm_add_epi32(E1l, E2l);\r\n    O2h = _mm_add_epi32(E1h, E2h);\r\n    E1l = _mm_madd_epi16(T0, _mm_load_si128((__m128i*)(tab_idct_8x8[6])));\r\n    E1h = _mm_madd_epi16(T1, _mm_load_si128((__m128i*)(tab_idct_8x8[6])));\r\n    E2l = _mm_madd_epi16(T2, _mm_load_si128((__m128i*)(tab_idct_8x8[7])));\r\n    E2h = _mm_madd_epi16(T3, _mm_load_si128((__m128i*)(tab_idct_8x8[7])));\r\n    O3h = _mm_add_epi32(E1h, E2h);\r\n    O3l = _mm_add_epi32(E1l, E2l);\r\n\r\n    T0   = _mm_unpacklo_epi16(S0, S4);\r\n    T1   = _mm_unpackhi_epi16(S0, S4);\r\n    EE0l = _mm_madd_epi16(T0, _mm_load_si128((__m128i*)(tab_idct_8x8[8])));\r\n    EE0h = _mm_madd_epi16(T1, _mm_load_si128((__m128i*)(tab_idct_8x8[8])));\r\n    EE1l = _mm_madd_epi16(T0, _mm_load_si128((__m128i*)(tab_idct_8x8[9])));\r\n    EE1h = _mm_madd_epi16(T1, _mm_load_si128((__m128i*)(tab_idct_8x8[9])));\r\n\r\n    T0   = _mm_unpacklo_epi16(S2, S6);\r\n    T1   = _mm_unpackhi_epi16(S2, S6);\r\n    E00l = _mm_madd_epi16(T0, _mm_load_si128((__m128i*)(tab_idct_8x8[10])));\r\n    E00h = _mm_madd_epi16(T1, _mm_load_si128((__m128i*)(tab_idct_8x8[10])));\r\n    E01l = _mm_madd_epi16(T0, _mm_load_si128((__m128i*)(tab_idct_8x8[11])));\r\n    E01h = _mm_madd_epi16(T1, _mm_load_si128((__m128i*)(tab_idct_8x8[11])));\r\n    E0l = _mm_add_epi32(EE0l, E00l);\r\n    E0l = _mm_add_epi32(E0l, mAdd);\r\n    E0h = _mm_add_epi32(EE0h, E00h);\r\n    E0h = _mm_add_epi32(E0h, mAdd);\r\n    E3l = _mm_sub_epi32(EE0l, E00l);\r\n    E3l = _mm_add_epi32(E3l, mAdd);\r\n    E3h = _mm_sub_epi32(EE0h, E00h);\r\n    E3h = _mm_add_epi32(E3h, mAdd);\r\n    E1l = _mm_add_epi32(EE1l, E01l);\r\n    E1l = _mm_add_epi32(E1l, mAdd);\r\n    E1h = _mm_add_epi32(EE1h, E01h);\r\n    E1h = _mm_add_epi32(E1h, mAdd);\r\n    E2l = _mm_sub_epi32(EE1l, E01l);\r\n    E2l = _mm_add_epi32(E2l, mAdd);\r\n    E2h = _mm_sub_epi32(EE1h, E01h);\r\n    E2h = _mm_add_epi32(E2h, mAdd);\r\n\r\n    S0 = _mm_packs_epi32(_mm_srai_epi32(_mm_add_epi32(E0l, O0l), shift2), _mm_srai_epi32(_mm_add_epi32(E0h, O0h), shift2));\r\n    S7 = _mm_packs_epi32(_mm_srai_epi32(_mm_sub_epi32(E0l, O0l), shift2), _mm_srai_epi32(_mm_sub_epi32(E0h, O0h), shift2));\r\n    S1 = _mm_packs_epi32(_mm_srai_epi32(_mm_add_epi32(E1l, O1l), shift2), _mm_srai_epi32(_mm_add_epi32(E1h, O1h), shift2));\r\n    S6 = _mm_packs_epi32(_mm_srai_epi32(_mm_sub_epi32(E1l, O1l), shift2), _mm_srai_epi32(_mm_sub_epi32(E1h, O1h), shift2));\r\n    S2 = _mm_packs_epi32(_mm_srai_epi32(_mm_add_epi32(E2l, O2l), shift2), _mm_srai_epi32(_mm_add_epi32(E2h, O2h), shift2));\r\n    S5 = _mm_packs_epi32(_mm_srai_epi32(_mm_sub_epi32(E2l, O2l), shift2), _mm_srai_epi32(_mm_sub_epi32(E2h, O2h), shift2));\r\n    S3 = _mm_packs_epi32(_mm_srai_epi32(_mm_add_epi32(E3l, O3l), shift2), _mm_srai_epi32(_mm_add_epi32(E3h, O3h), shift2));\r\n    S4 = _mm_packs_epi32(_mm_srai_epi32(_mm_sub_epi32(E3l, O3l), shift2), _mm_srai_epi32(_mm_sub_epi32(E3h, O3h), shift2));\r\n\r\n    // [07 06 05 04 03 02 01 00]\r\n    // [17 16 15 14 13 12 11 10]\r\n    // [27 26 25 24 23 22 21 20]\r\n    // [37 36 35 34 33 32 31 30]\r\n    // [47 46 45 44 43 42 41 40]\r\n    // [57 56 55 54 53 52 51 50]\r\n    // [67 66 65 64 63 62 61 60]\r\n    // [77 76 75 74 73 72 71 70]\r\n\r\n    T00 = _mm_unpacklo_epi16(S0, S1);     // [13 03 12 02 11 01 10 00]\r\n    T01 = _mm_unpackhi_epi16(S0, S1);     // [17 07 16 06 15 05 14 04]\r\n    T02 = _mm_unpacklo_epi16(S2, S3);     // [33 23 32 22 31 21 30 20]\r\n    T03 = _mm_unpackhi_epi16(S2, S3);     // [37 27 36 26 35 25 34 24]\r\n    T04 = _mm_unpacklo_epi16(S4, S5);     // [53 43 52 42 51 41 50 40]\r\n    T05 = _mm_unpackhi_epi16(S4, S5);     // [57 47 56 46 55 45 54 44]\r\n    T06 = _mm_unpacklo_epi16(S6, S7);     // [73 63 72 62 71 61 70 60]\r\n    T07 = _mm_unpackhi_epi16(S6, S7);     // [77 67 76 66 75 65 74 64]\r\n\r\n    // clip\r\n    {\r\n        const __m128i max_val = _mm_set1_epi16((1 << (clip_depth2 - 1)) - 1);\r\n        const __m128i min_val = _mm_set1_epi16(-(1 << (clip_depth2 - 1)));\r\n\r\n        T00 = _mm_max_epi16(_mm_min_epi16(T00, max_val), min_val);\r\n        T01 = _mm_max_epi16(_mm_min_epi16(T01, max_val), min_val);\r\n        T02 = _mm_max_epi16(_mm_min_epi16(T02, max_val), min_val);\r\n        T03 = _mm_max_epi16(_mm_min_epi16(T03, max_val), min_val);\r\n        T04 = _mm_max_epi16(_mm_min_epi16(T04, max_val), min_val);\r\n        T05 = _mm_max_epi16(_mm_min_epi16(T05, max_val), min_val);\r\n        T06 = _mm_max_epi16(_mm_min_epi16(T06, max_val), min_val);\r\n        T07 = _mm_max_epi16(_mm_min_epi16(T07, max_val), min_val);\r\n    }\r\n\r\n    {\r\n        __m128i T10, T11, T12, T13;\r\n\r\n        T10 = _mm_unpacklo_epi32(T00, T02);     // [31 21 11 01 30 20 10 00]\r\n        T11 = _mm_unpackhi_epi32(T00, T02);     // [33 23 13 03 32 22 12 02]\r\n        T12 = _mm_unpacklo_epi32(T04, T06);     // [71 61 51 41 70 60 50 40]\r\n        T13 = _mm_unpackhi_epi32(T04, T06);     // [73 63 53 43 72 62 52 42]\r\n\r\n        _mm_store_si128((__m128i*)(dst + 0 * i_dst), _mm_unpacklo_epi64(T10, T12));  // [70 60 50 40 30 20 10 00]\r\n        _mm_store_si128((__m128i*)(dst + 1 * i_dst), _mm_unpackhi_epi64(T10, T12));  // [71 61 51 41 31 21 11 01]\r\n        _mm_store_si128((__m128i*)(dst + 2 * i_dst), _mm_unpacklo_epi64(T11, T13));  // [72 62 52 42 32 22 12 02]\r\n        _mm_store_si128((__m128i*)(dst + 3 * i_dst), _mm_unpackhi_epi64(T11, T13));  // [73 63 53 43 33 23 13 03]\r\n\r\n        T10 = _mm_unpacklo_epi32(T01, T03);     // [35 25 15 05 34 24 14 04]\r\n        T12 = _mm_unpacklo_epi32(T05, T07);     // [75 65 55 45 74 64 54 44]\r\n        T11 = _mm_unpackhi_epi32(T01, T03);     // [37 27 17 07 36 26 16 06]\r\n        T13 = _mm_unpackhi_epi32(T05, T07);     // [77 67 57 47 76 56 46 36]\r\n\r\n        _mm_store_si128((__m128i*)(dst + 4 * i_dst), _mm_unpacklo_epi64(T10, T12));  // [74 64 54 44 34 24 14 04]\r\n        _mm_store_si128((__m128i*)(dst + 5 * i_dst), _mm_unpackhi_epi64(T10, T12));  // [75 65 55 45 35 25 15 05]\r\n        _mm_store_si128((__m128i*)(dst + 6 * i_dst), _mm_unpacklo_epi64(T11, T13));  // [76 66 56 46 36 26 16 06]\r\n        _mm_store_si128((__m128i*)(dst + 7 * i_dst), _mm_unpackhi_epi64(T11, T13));  // [77 67 57 47 37 27 17 07]\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid idct_8x8_half_sse128(const coeff_t *src, coeff_t *dst, int i_dst)\r\n{\r\n    // TODO: implement this\r\n    // 1/2СϽǵ4x4зϵ\r\n    idct_8x8_sse128(src, dst, i_dst);\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid idct_8x8_quad_sse128(const coeff_t *src, coeff_t *dst, int i_dst)\r\n{\r\n    // TODO: implement this\r\n    // 1/4СϽǵ2x2зϵ\r\n    idct_8x8_half_sse128(src, dst, i_dst);\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid idct_16x16_sse128(const coeff_t *src, coeff_t *dst, int i_dst)\r\n{\r\n    const int shift1 = 5;\r\n    const int shift2 = 20 - g_bit_depth;\r\n    //const int clip_depth1 = LIMIT_BIT;\r\n    const int clip_depth2 = g_bit_depth + 1;\r\n\r\n    const __m128i c16_p43_p45 = _mm_set1_epi32(0x002B002D);   //row0 87high - 90low address\r\n    const __m128i c16_p35_p40 = _mm_set1_epi32(0x00230028);\r\n    const __m128i c16_p21_p29 = _mm_set1_epi32(0x0015001D);\r\n    const __m128i c16_p04_p13 = _mm_set1_epi32(0x0004000D);\r\n    const __m128i c16_p29_p43 = _mm_set1_epi32(0x001D002B);   //row1\r\n    const __m128i c16_n21_p04 = _mm_set1_epi32(0xFFEB0004);\r\n    const __m128i c16_n45_n40 = _mm_set1_epi32(0xFFD3FFD8);\r\n    const __m128i c16_n13_n35 = _mm_set1_epi32(0xFFF3FFDD);\r\n    const __m128i c16_p04_p40 = _mm_set1_epi32(0x00040028);   //row2\r\n    const __m128i c16_n43_n35 = _mm_set1_epi32(0xFFD5FFDD);\r\n    const __m128i c16_p29_n13 = _mm_set1_epi32(0x001DFFF3);\r\n    const __m128i c16_p21_p45 = _mm_set1_epi32(0x0015002D);\r\n    const __m128i c16_n21_p35 = _mm_set1_epi32(0xFFEB0023);   //row3\r\n    const __m128i c16_p04_n43 = _mm_set1_epi32(0x0004FFD5);\r\n    const __m128i c16_p13_p45 = _mm_set1_epi32(0x000D002D);\r\n    const __m128i c16_n29_n40 = _mm_set1_epi32(0xFFE3FFD8);\r\n    const __m128i c16_n40_p29 = _mm_set1_epi32(0xFFD8001D);   //row4\r\n    const __m128i c16_p45_n13 = _mm_set1_epi32(0x002DFFF3);\r\n    const __m128i c16_n43_n04 = _mm_set1_epi32(0xFFD5FFFC);\r\n    const __m128i c16_p35_p21 = _mm_set1_epi32(0x00230015);\r\n    const __m128i c16_n45_p21 = _mm_set1_epi32(0xFFD30015);   //row5\r\n    const __m128i c16_p13_p29 = _mm_set1_epi32(0x000D001D);\r\n    const __m128i c16_p35_n43 = _mm_set1_epi32(0x0023FFD5);\r\n    const __m128i c16_n40_p04 = _mm_set1_epi32(0xFFD80004);\r\n    const __m128i c16_n35_p13 = _mm_set1_epi32(0xFFDD000D);   //row6\r\n    const __m128i c16_n40_p45 = _mm_set1_epi32(0xFFD8002D);\r\n    const __m128i c16_p04_p21 = _mm_set1_epi32(0x00040015);\r\n    const __m128i c16_p43_n29 = _mm_set1_epi32(0x002BFFE3);\r\n    const __m128i c16_n13_p04 = _mm_set1_epi32(0xFFF30004);   //row7\r\n    const __m128i c16_n29_p21 = _mm_set1_epi32(0xFFE30015);\r\n    const __m128i c16_n40_p35 = _mm_set1_epi32(0xFFD80023);\r\n    const __m128i c16_n45_p43 = _mm_set1_epi32(0xFFD3002B);\r\n\r\n    const __m128i c16_p38_p44 = _mm_set1_epi32(0x0026002C);\r\n    const __m128i c16_p09_p25 = _mm_set1_epi32(0x00090019);\r\n    const __m128i c16_n09_p38 = _mm_set1_epi32(0xFFF70026);\r\n    const __m128i c16_n25_n44 = _mm_set1_epi32(0xFFE7FFD4);\r\n    const __m128i c16_n44_p25 = _mm_set1_epi32(0xFFD40019);\r\n    const __m128i c16_p38_p09 = _mm_set1_epi32(0x00260009);\r\n    const __m128i c16_n25_p09 = _mm_set1_epi32(0xFFE70009);\r\n    const __m128i c16_n44_p38 = _mm_set1_epi32(0xFFD40026);\r\n\r\n    const __m128i c16_p17_p42 = _mm_set1_epi32(0x0011002A);\r\n    const __m128i c16_n42_p17 = _mm_set1_epi32(0xFFD60011);\r\n\r\n    const __m128i c16_n32_p32 = _mm_set1_epi32(0xFFE00020);\r\n    const __m128i c16_p32_p32 = _mm_set1_epi32(0x00200020);\r\n\r\n    int i, pass, part;\r\n\r\n    int nShift = shift1;\r\n    __m128i c32_rnd = _mm_set1_epi32((1 << shift1) >> 1);               // add1\r\n\r\n    // DCT1\r\n    __m128i in00[2], in01[2], in02[2], in03[2], in04[2], in05[2], in06[2], in07[2];\r\n    __m128i in08[2], in09[2], in10[2], in11[2], in12[2], in13[2], in14[2], in15[2];\r\n    __m128i res00[2], res01[2], res02[2], res03[2], res04[2], res05[2], res06[2], res07[2];\r\n    __m128i res08[2], res09[2], res10[2], res11[2], res12[2], res13[2], res14[2], res15[2];\r\n\r\n    for (i = 0; i < 2; i++) {\r\n        const int offset = (i << 3);\r\n\r\n        in00[i] = _mm_load_si128((const __m128i*)&src[ 0 * 16 + offset]);   // [07 06 05 04 03 02 01 00]\r\n        in01[i] = _mm_load_si128((const __m128i*)&src[ 1 * 16 + offset]);   // [17 16 15 14 13 12 11 10]\r\n        in02[i] = _mm_load_si128((const __m128i*)&src[ 2 * 16 + offset]);   // [27 26 25 24 23 22 21 20]\r\n        in03[i] = _mm_load_si128((const __m128i*)&src[ 3 * 16 + offset]);   // [37 36 35 34 33 32 31 30]\r\n        in04[i] = _mm_load_si128((const __m128i*)&src[ 4 * 16 + offset]);   // [47 46 45 44 43 42 41 40]\r\n        in05[i] = _mm_load_si128((const __m128i*)&src[ 5 * 16 + offset]);   // [57 56 55 54 53 52 51 50]\r\n        in06[i] = _mm_load_si128((const __m128i*)&src[ 6 * 16 + offset]);   // [67 66 65 64 63 62 61 60]\r\n        in07[i] = _mm_load_si128((const __m128i*)&src[ 7 * 16 + offset]);   // [77 76 75 74 73 72 71 70]\r\n        in08[i] = _mm_load_si128((const __m128i*)&src[ 8 * 16 + offset]);\r\n        in09[i] = _mm_load_si128((const __m128i*)&src[ 9 * 16 + offset]);\r\n        in10[i] = _mm_load_si128((const __m128i*)&src[10 * 16 + offset]);\r\n        in11[i] = _mm_load_si128((const __m128i*)&src[11 * 16 + offset]);\r\n        in12[i] = _mm_load_si128((const __m128i*)&src[12 * 16 + offset]);\r\n        in13[i] = _mm_load_si128((const __m128i*)&src[13 * 16 + offset]);\r\n        in14[i] = _mm_load_si128((const __m128i*)&src[14 * 16 + offset]);\r\n        in15[i] = _mm_load_si128((const __m128i*)&src[15 * 16 + offset]);\r\n    }\r\n\r\n    for (pass = 0; pass < 2; pass++) {\r\n        for (part = 0; part < 2; part++) {\r\n            const __m128i T_00_00A = _mm_unpacklo_epi16(in01[part], in03[part]);    // [33 13 32 12 31 11 30 10]\r\n            const __m128i T_00_00B = _mm_unpackhi_epi16(in01[part], in03[part]);    // [37 17 36 16 35 15 34 14]\r\n            const __m128i T_00_01A = _mm_unpacklo_epi16(in05[part], in07[part]);    // [ ]\r\n            const __m128i T_00_01B = _mm_unpackhi_epi16(in05[part], in07[part]);    // [ ]\r\n            const __m128i T_00_02A = _mm_unpacklo_epi16(in09[part], in11[part]);    // [ ]\r\n            const __m128i T_00_02B = _mm_unpackhi_epi16(in09[part], in11[part]);    // [ ]\r\n            const __m128i T_00_03A = _mm_unpacklo_epi16(in13[part], in15[part]);    // [ ]\r\n            const __m128i T_00_03B = _mm_unpackhi_epi16(in13[part], in15[part]);    // [ ]\r\n            const __m128i T_00_04A = _mm_unpacklo_epi16(in02[part], in06[part]);    // [ ]\r\n            const __m128i T_00_04B = _mm_unpackhi_epi16(in02[part], in06[part]);    // [ ]\r\n            const __m128i T_00_05A = _mm_unpacklo_epi16(in10[part], in14[part]);    // [ ]\r\n            const __m128i T_00_05B = _mm_unpackhi_epi16(in10[part], in14[part]);    // [ ]\r\n            const __m128i T_00_06A = _mm_unpacklo_epi16(in04[part], in12[part]);    // [ ]row\r\n            const __m128i T_00_06B = _mm_unpackhi_epi16(in04[part], in12[part]);    // [ ]\r\n            const __m128i T_00_07A = _mm_unpacklo_epi16(in00[part], in08[part]);    // [83 03 82 02 81 01 81 00] row08 row00\r\n            const __m128i T_00_07B = _mm_unpackhi_epi16(in00[part], in08[part]);    // [87 07 86 06 85 05 84 04]\r\n\r\n            __m128i O0A, O1A, O2A, O3A, O4A, O5A, O6A, O7A;\r\n            __m128i O0B, O1B, O2B, O3B, O4B, O5B, O6B, O7B;\r\n            __m128i EO0A, EO1A, EO2A, EO3A;\r\n            __m128i EO0B, EO1B, EO2B, EO3B;\r\n            __m128i EEO0A, EEO1A;\r\n            __m128i EEO0B, EEO1B;\r\n            __m128i EEE0A, EEE1A;\r\n            __m128i EEE0B, EEE1B;\r\n            __m128i T00, T01;\r\n\r\n#define COMPUTE_ROW(row0103, row0507, row0911, row1315, c0103, c0507, c0911, c1315, row) \\\r\n    T00 = _mm_add_epi32(_mm_madd_epi16(row0103, c0103), _mm_madd_epi16(row0507, c0507)); \\\r\n    T01 = _mm_add_epi32(_mm_madd_epi16(row0911, c0911), _mm_madd_epi16(row1315, c1315)); \\\r\n    row = _mm_add_epi32(T00, T01);\r\n\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_p43_p45, c16_p35_p40, c16_p21_p29, c16_p04_p13, O0A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_p29_p43, c16_n21_p04, c16_n45_n40, c16_n13_n35, O1A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_p04_p40, c16_n43_n35, c16_p29_n13, c16_p21_p45, O2A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_n21_p35, c16_p04_n43, c16_p13_p45, c16_n29_n40, O3A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_n40_p29, c16_p45_n13, c16_n43_n04, c16_p35_p21, O4A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_n45_p21, c16_p13_p29, c16_p35_n43, c16_n40_p04, O5A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_n35_p13, c16_n40_p45, c16_p04_p21, c16_p43_n29, O6A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_n13_p04, c16_n29_p21, c16_n40_p35, c16_n45_p43, O7A)\r\n\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, c16_p43_p45, c16_p35_p40, c16_p21_p29, c16_p04_p13, O0B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, c16_p29_p43, c16_n21_p04, c16_n45_n40, c16_n13_n35, O1B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, c16_p04_p40, c16_n43_n35, c16_p29_n13, c16_p21_p45, O2B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, c16_n21_p35, c16_p04_n43, c16_p13_p45, c16_n29_n40, O3B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, c16_n40_p29, c16_p45_n13, c16_n43_n04, c16_p35_p21, O4B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, c16_n45_p21, c16_p13_p29, c16_p35_n43, c16_n40_p04, O5B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, c16_n35_p13, c16_n40_p45, c16_p04_p21, c16_p43_n29, O6B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, c16_n13_p04, c16_n29_p21, c16_n40_p35, c16_n45_p43, O7B)\r\n#undef COMPUTE_ROW\r\n\r\n\r\n            EO0A = _mm_add_epi32(_mm_madd_epi16(T_00_04A, c16_p38_p44), _mm_madd_epi16(T_00_05A, c16_p09_p25)); // EO0\r\n            EO0B = _mm_add_epi32(_mm_madd_epi16(T_00_04B, c16_p38_p44), _mm_madd_epi16(T_00_05B, c16_p09_p25));\r\n            EO1A = _mm_add_epi32(_mm_madd_epi16(T_00_04A, c16_n09_p38), _mm_madd_epi16(T_00_05A, c16_n25_n44)); // EO1\r\n            EO1B = _mm_add_epi32(_mm_madd_epi16(T_00_04B, c16_n09_p38), _mm_madd_epi16(T_00_05B, c16_n25_n44));\r\n            EO2A = _mm_add_epi32(_mm_madd_epi16(T_00_04A, c16_n44_p25), _mm_madd_epi16(T_00_05A, c16_p38_p09)); // EO2\r\n            EO2B = _mm_add_epi32(_mm_madd_epi16(T_00_04B, c16_n44_p25), _mm_madd_epi16(T_00_05B, c16_p38_p09));\r\n            EO3A = _mm_add_epi32(_mm_madd_epi16(T_00_04A, c16_n25_p09), _mm_madd_epi16(T_00_05A, c16_n44_p38)); // EO3\r\n            EO3B = _mm_add_epi32(_mm_madd_epi16(T_00_04B, c16_n25_p09), _mm_madd_epi16(T_00_05B, c16_n44_p38));\r\n\r\n\r\n            EEO0A = _mm_madd_epi16(T_00_06A, c16_p17_p42);\r\n            EEO0B = _mm_madd_epi16(T_00_06B, c16_p17_p42);\r\n            EEO1A = _mm_madd_epi16(T_00_06A, c16_n42_p17);\r\n            EEO1B = _mm_madd_epi16(T_00_06B, c16_n42_p17);\r\n\r\n\r\n            EEE0A = _mm_madd_epi16(T_00_07A, c16_p32_p32);\r\n            EEE0B = _mm_madd_epi16(T_00_07B, c16_p32_p32);\r\n            EEE1A = _mm_madd_epi16(T_00_07A, c16_n32_p32);\r\n            EEE1B = _mm_madd_epi16(T_00_07B, c16_n32_p32);\r\n\r\n            {\r\n                const __m128i EE0A = _mm_add_epi32(EEE0A, EEO0A);       // EE0 = EEE0 + EEO0\r\n                const __m128i EE0B = _mm_add_epi32(EEE0B, EEO0B);\r\n                const __m128i EE1A = _mm_add_epi32(EEE1A, EEO1A);       // EE1 = EEE1 + EEO1\r\n                const __m128i EE1B = _mm_add_epi32(EEE1B, EEO1B);\r\n                const __m128i EE3A = _mm_sub_epi32(EEE0A, EEO0A);       // EE2 = EEE0 - EEO0\r\n                const __m128i EE3B = _mm_sub_epi32(EEE0B, EEO0B);\r\n                const __m128i EE2A = _mm_sub_epi32(EEE1A, EEO1A);       // EE3 = EEE1 - EEO1\r\n                const __m128i EE2B = _mm_sub_epi32(EEE1B, EEO1B);\r\n\r\n                const __m128i T10A = _mm_add_epi32(_mm_add_epi32(EE0A, EO0A), c32_rnd);     // E0 (= EE0 + EO0) + rnd\r\n                const __m128i T10B = _mm_add_epi32(_mm_add_epi32(EE0B, EO0B), c32_rnd);\r\n                const __m128i T11A = _mm_add_epi32(_mm_add_epi32(EE1A, EO1A), c32_rnd);     // E1 (= EE1 + EO1) + rnd\r\n                const __m128i T11B = _mm_add_epi32(_mm_add_epi32(EE1B, EO1B), c32_rnd);\r\n                const __m128i T12A = _mm_add_epi32(_mm_add_epi32(EE2A, EO2A), c32_rnd);     // E2 (= EE2 + EO2) + rnd\r\n                const __m128i T12B = _mm_add_epi32(_mm_add_epi32(EE2B, EO2B), c32_rnd);\r\n                const __m128i T13A = _mm_add_epi32(_mm_add_epi32(EE3A, EO3A), c32_rnd);     // E3 (= EE3 + EO3) + rnd\r\n                const __m128i T13B = _mm_add_epi32(_mm_add_epi32(EE3B, EO3B), c32_rnd);\r\n                const __m128i T14A = _mm_add_epi32(_mm_sub_epi32(EE3A, EO3A), c32_rnd);     // E4 (= EE3 - EO3) + rnd\r\n                const __m128i T14B = _mm_add_epi32(_mm_sub_epi32(EE3B, EO3B), c32_rnd);\r\n                const __m128i T15A = _mm_add_epi32(_mm_sub_epi32(EE2A, EO2A), c32_rnd);     // E5 (= EE2 - EO2) + rnd\r\n                const __m128i T15B = _mm_add_epi32(_mm_sub_epi32(EE2B, EO2B), c32_rnd);\r\n                const __m128i T16A = _mm_add_epi32(_mm_sub_epi32(EE1A, EO1A), c32_rnd);     // E6 (= EE1 - EO1) + rnd\r\n                const __m128i T16B = _mm_add_epi32(_mm_sub_epi32(EE1B, EO1B), c32_rnd);\r\n                const __m128i T17A = _mm_add_epi32(_mm_sub_epi32(EE0A, EO0A), c32_rnd);     // E7 (= EE0 - EO0) + rnd\r\n                const __m128i T17B = _mm_add_epi32(_mm_sub_epi32(EE0B, EO0B), c32_rnd);\r\n\r\n                const __m128i T30A = _mm_srai_epi32(_mm_add_epi32(T10A, O0A), nShift);      // E0 + O0 + rnd [30 20 10 00]\r\n                const __m128i T30B = _mm_srai_epi32(_mm_add_epi32(T10B, O0B), nShift);      //               [70 60 50 40]\r\n                const __m128i T31A = _mm_srai_epi32(_mm_add_epi32(T11A, O1A), nShift);      // E1 + O1 + rnd [31 21 11 01]\r\n                const __m128i T31B = _mm_srai_epi32(_mm_add_epi32(T11B, O1B), nShift);      //               [71 61 51 41]\r\n                const __m128i T32A = _mm_srai_epi32(_mm_add_epi32(T12A, O2A), nShift);      // E2 + O2 + rnd [32 22 12 02]\r\n                const __m128i T32B = _mm_srai_epi32(_mm_add_epi32(T12B, O2B), nShift);      //               [72 62 52 42]\r\n                const __m128i T33A = _mm_srai_epi32(_mm_add_epi32(T13A, O3A), nShift);      // E3 + O3 + rnd [33 23 13 03]\r\n                const __m128i T33B = _mm_srai_epi32(_mm_add_epi32(T13B, O3B), nShift);      //               [73 63 53 43]\r\n                const __m128i T34A = _mm_srai_epi32(_mm_add_epi32(T14A, O4A), nShift);      // E4            [33 24 14 04]\r\n                const __m128i T34B = _mm_srai_epi32(_mm_add_epi32(T14B, O4B), nShift);      //               [74 64 54 44]\r\n                const __m128i T35A = _mm_srai_epi32(_mm_add_epi32(T15A, O5A), nShift);      // E5            [35 25 15 05]\r\n                const __m128i T35B = _mm_srai_epi32(_mm_add_epi32(T15B, O5B), nShift);      //               [75 65 55 45]\r\n                const __m128i T36A = _mm_srai_epi32(_mm_add_epi32(T16A, O6A), nShift);      // E6            [36 26 16 06]\r\n                const __m128i T36B = _mm_srai_epi32(_mm_add_epi32(T16B, O6B), nShift);      //               [76 66 56 46]\r\n                const __m128i T37A = _mm_srai_epi32(_mm_add_epi32(T17A, O7A), nShift);      // E7            [37 27 17 07]\r\n                const __m128i T37B = _mm_srai_epi32(_mm_add_epi32(T17B, O7B), nShift);      //               [77 67 57 47]\r\n                \r\n                const __m128i T38A = _mm_srai_epi32(_mm_sub_epi32(T17A, O7A), nShift);      // E7 [30 20 10 00] x8\r\n                const __m128i T38B = _mm_srai_epi32(_mm_sub_epi32(T17B, O7B), nShift);      //    [70 60 50 40]\r\n                const __m128i T39A = _mm_srai_epi32(_mm_sub_epi32(T16A, O6A), nShift);      // E6 [31 21 11 01] x9\r\n                const __m128i T39B = _mm_srai_epi32(_mm_sub_epi32(T16B, O6B), nShift);      //    [71 61 51 41]\r\n                const __m128i T3AA = _mm_srai_epi32(_mm_sub_epi32(T15A, O5A), nShift);      // E5 [32 22 12 02] xA\r\n                const __m128i T3AB = _mm_srai_epi32(_mm_sub_epi32(T15B, O5B), nShift);      //    [72 62 52 42]\r\n                const __m128i T3BA = _mm_srai_epi32(_mm_sub_epi32(T14A, O4A), nShift);      // E4 [33 23 13 03] xB\r\n                const __m128i T3BB = _mm_srai_epi32(_mm_sub_epi32(T14B, O4B), nShift);      //    [73 63 53 43]\r\n                const __m128i T3CA = _mm_srai_epi32(_mm_sub_epi32(T13A, O3A), nShift);      // E3 - O3 + rnd [33 24 14 04] xC\r\n                const __m128i T3CB = _mm_srai_epi32(_mm_sub_epi32(T13B, O3B), nShift);      //               [74 64 54 44]\r\n                const __m128i T3DA = _mm_srai_epi32(_mm_sub_epi32(T12A, O2A), nShift);      // E2 - O2 + rnd [35 25 15 05] xD\r\n                const __m128i T3DB = _mm_srai_epi32(_mm_sub_epi32(T12B, O2B), nShift);      //               [75 65 55 45]\r\n                const __m128i T3EA = _mm_srai_epi32(_mm_sub_epi32(T11A, O1A), nShift);      // E1 - O1 + rnd [36 26 16 06] xE\r\n                const __m128i T3EB = _mm_srai_epi32(_mm_sub_epi32(T11B, O1B), nShift);      //               [76 66 56 46]\r\n                const __m128i T3FA = _mm_srai_epi32(_mm_sub_epi32(T10A, O0A), nShift);      // E0 - O0 + rnd [37 27 17 07] xF\r\n                const __m128i T3FB = _mm_srai_epi32(_mm_sub_epi32(T10B, O0B), nShift);      //               [77 67 57 47]\r\n\r\n                res00[part] = _mm_packs_epi32(T30A, T30B);              // [70 60 50 40 30 20 10 00]\r\n                res01[part] = _mm_packs_epi32(T31A, T31B);              // [71 61 51 41 31 21 11 01]\r\n                res02[part] = _mm_packs_epi32(T32A, T32B);              // [72 62 52 42 32 22 12 02]\r\n                res03[part] = _mm_packs_epi32(T33A, T33B);              // [73 63 53 43 33 23 13 03]\r\n                res04[part] = _mm_packs_epi32(T34A, T34B);              // [74 64 54 44 34 24 14 04]\r\n                res05[part] = _mm_packs_epi32(T35A, T35B);              // [75 65 55 45 35 25 15 05]\r\n                res06[part] = _mm_packs_epi32(T36A, T36B);              // [76 66 56 46 36 26 16 06]\r\n                res07[part] = _mm_packs_epi32(T37A, T37B);              // [77 67 57 47 37 27 17 07]\r\n\r\n                res08[part] = _mm_packs_epi32(T38A, T38B);              // [A0 ... 80]\r\n                res09[part] = _mm_packs_epi32(T39A, T39B);              // [A1 ... 81]\r\n                res10[part] = _mm_packs_epi32(T3AA, T3AB);              // [A2 ... 82]\r\n                res11[part] = _mm_packs_epi32(T3BA, T3BB);              // [A3 ... 83]\r\n                res12[part] = _mm_packs_epi32(T3CA, T3CB);              // [A4 ... 84]\r\n                res13[part] = _mm_packs_epi32(T3DA, T3DB);              // [A5 ... 85]\r\n                res14[part] = _mm_packs_epi32(T3EA, T3EB);              // [A6 ... 86]\r\n                res15[part] = _mm_packs_epi32(T3FA, T3FB);              // [A7 ... 87]\r\n            }\r\n        }\r\n\r\n        // transpose matrix 8x8 16bit\r\n        {\r\n            __m128i tr0_0, tr0_1, tr0_2, tr0_3, tr0_4, tr0_5, tr0_6, tr0_7;\r\n            __m128i tr1_0, tr1_1, tr1_2, tr1_3, tr1_4, tr1_5, tr1_6, tr1_7;\r\n\r\n#define TRANSPOSE_8x8_16BIT(I0, I1, I2, I3, I4, I5, I6, I7, O0, O1, O2, O3, O4, O5, O6, O7) \\\r\n    tr0_0 = _mm_unpacklo_epi16(I0, I1); \\\r\n    tr0_1 = _mm_unpacklo_epi16(I2, I3); \\\r\n    tr0_2 = _mm_unpackhi_epi16(I0, I1); \\\r\n    tr0_3 = _mm_unpackhi_epi16(I2, I3); \\\r\n    tr0_4 = _mm_unpacklo_epi16(I4, I5); \\\r\n    tr0_5 = _mm_unpacklo_epi16(I6, I7); \\\r\n    tr0_6 = _mm_unpackhi_epi16(I4, I5); \\\r\n    tr0_7 = _mm_unpackhi_epi16(I6, I7); \\\r\n    tr1_0 = _mm_unpacklo_epi32(tr0_0, tr0_1); \\\r\n    tr1_1 = _mm_unpacklo_epi32(tr0_2, tr0_3); \\\r\n    tr1_2 = _mm_unpackhi_epi32(tr0_0, tr0_1); \\\r\n    tr1_3 = _mm_unpackhi_epi32(tr0_2, tr0_3); \\\r\n    tr1_4 = _mm_unpacklo_epi32(tr0_4, tr0_5); \\\r\n    tr1_5 = _mm_unpacklo_epi32(tr0_6, tr0_7); \\\r\n    tr1_6 = _mm_unpackhi_epi32(tr0_4, tr0_5); \\\r\n    tr1_7 = _mm_unpackhi_epi32(tr0_6, tr0_7); \\\r\n    O0 = _mm_unpacklo_epi64(tr1_0, tr1_4); \\\r\n    O1 = _mm_unpackhi_epi64(tr1_0, tr1_4); \\\r\n    O2 = _mm_unpacklo_epi64(tr1_2, tr1_6); \\\r\n    O3 = _mm_unpackhi_epi64(tr1_2, tr1_6); \\\r\n    O4 = _mm_unpacklo_epi64(tr1_1, tr1_5); \\\r\n    O5 = _mm_unpackhi_epi64(tr1_1, tr1_5); \\\r\n    O6 = _mm_unpacklo_epi64(tr1_3, tr1_7); \\\r\n    O7 = _mm_unpackhi_epi64(tr1_3, tr1_7); \\\r\n\r\n            TRANSPOSE_8x8_16BIT(res00[0], res01[0], res02[0], res03[0], res04[0], res05[0], res06[0], res07[0], in00[0], in01[0], in02[0], in03[0], in04[0], in05[0], in06[0], in07[0])\r\n            TRANSPOSE_8x8_16BIT(res08[0], res09[0], res10[0], res11[0], res12[0], res13[0], res14[0], res15[0], in00[1], in01[1], in02[1], in03[1], in04[1], in05[1], in06[1], in07[1])\r\n            TRANSPOSE_8x8_16BIT(res00[1], res01[1], res02[1], res03[1], res04[1], res05[1], res06[1], res07[1], in08[0], in09[0], in10[0], in11[0], in12[0], in13[0], in14[0], in15[0])\r\n            TRANSPOSE_8x8_16BIT(res08[1], res09[1], res10[1], res11[1], res12[1], res13[1], res14[1], res15[1], in08[1], in09[1], in10[1], in11[1], in12[1], in13[1], in14[1], in15[1])\r\n\r\n#undef TRANSPOSE_8x8_16BIT\r\n        }\r\n\r\n        nShift = shift2;\r\n        c32_rnd = _mm_set1_epi32(1 << (shift2 - 1));    // add2\r\n    }\r\n\r\n    // clip\r\n    {\r\n        const __m128i max_val = _mm_set1_epi16((1 << (clip_depth2 - 1)) - 1);\r\n        const __m128i min_val = _mm_set1_epi16(-(1 << (clip_depth2 - 1)));\r\n\r\n        in00[0] = _mm_max_epi16(_mm_min_epi16(in00[0], max_val), min_val);\r\n        in00[1] = _mm_max_epi16(_mm_min_epi16(in00[1], max_val), min_val);\r\n\r\n        in01[0] = _mm_max_epi16(_mm_min_epi16(in01[0], max_val), min_val);\r\n        in01[1] = _mm_max_epi16(_mm_min_epi16(in01[1], max_val), min_val);\r\n\r\n        in02[0] = _mm_max_epi16(_mm_min_epi16(in02[0], max_val), min_val);\r\n        in02[1] = _mm_max_epi16(_mm_min_epi16(in02[1], max_val), min_val);\r\n\r\n        in03[0] = _mm_max_epi16(_mm_min_epi16(in03[0], max_val), min_val);\r\n        in03[1] = _mm_max_epi16(_mm_min_epi16(in03[1], max_val), min_val);\r\n\r\n        in04[0] = _mm_max_epi16(_mm_min_epi16(in04[0], max_val), min_val);\r\n        in04[1] = _mm_max_epi16(_mm_min_epi16(in04[1], max_val), min_val);\r\n\r\n        in05[0] = _mm_max_epi16(_mm_min_epi16(in05[0], max_val), min_val);\r\n        in05[1] = _mm_max_epi16(_mm_min_epi16(in05[1], max_val), min_val);\r\n\r\n        in06[0] = _mm_max_epi16(_mm_min_epi16(in06[0], max_val), min_val);\r\n        in06[1] = _mm_max_epi16(_mm_min_epi16(in06[1], max_val), min_val);\r\n\r\n        in07[0] = _mm_max_epi16(_mm_min_epi16(in07[0], max_val), min_val);\r\n        in07[1] = _mm_max_epi16(_mm_min_epi16(in07[1], max_val), min_val);\r\n\r\n        in08[0] = _mm_max_epi16(_mm_min_epi16(in08[0], max_val), min_val);\r\n        in08[1] = _mm_max_epi16(_mm_min_epi16(in08[1], max_val), min_val);\r\n\r\n        in09[0] = _mm_max_epi16(_mm_min_epi16(in09[0], max_val), min_val);\r\n        in09[1] = _mm_max_epi16(_mm_min_epi16(in09[1], max_val), min_val);\r\n\r\n        in10[0] = _mm_max_epi16(_mm_min_epi16(in10[0], max_val), min_val);\r\n        in10[1] = _mm_max_epi16(_mm_min_epi16(in10[1], max_val), min_val);\r\n\r\n        in11[0] = _mm_max_epi16(_mm_min_epi16(in11[0], max_val), min_val);\r\n        in11[1] = _mm_max_epi16(_mm_min_epi16(in11[1], max_val), min_val);\r\n\r\n        in12[0] = _mm_max_epi16(_mm_min_epi16(in12[0], max_val), min_val);\r\n        in12[1] = _mm_max_epi16(_mm_min_epi16(in12[1], max_val), min_val);\r\n\r\n        in13[0] = _mm_max_epi16(_mm_min_epi16(in13[0], max_val), min_val);\r\n        in13[1] = _mm_max_epi16(_mm_min_epi16(in13[1], max_val), min_val);\r\n\r\n        in14[0] = _mm_max_epi16(_mm_min_epi16(in14[0], max_val), min_val);\r\n        in14[1] = _mm_max_epi16(_mm_min_epi16(in14[1], max_val), min_val);\r\n\r\n        in15[0] = _mm_max_epi16(_mm_min_epi16(in15[0], max_val), min_val);\r\n        in15[1] = _mm_max_epi16(_mm_min_epi16(in15[1], max_val), min_val);\r\n    }\r\n\r\n    // store\r\n    _mm_store_si128((__m128i*)(dst +  0 * i_dst + 0), in00[0]);\r\n    _mm_store_si128((__m128i*)(dst +  0 * i_dst + 8), in00[1]);\r\n    _mm_store_si128((__m128i*)(dst +  1 * i_dst + 0), in01[0]);\r\n    _mm_store_si128((__m128i*)(dst +  1 * i_dst + 8), in01[1]);\r\n    _mm_store_si128((__m128i*)(dst +  2 * i_dst + 0), in02[0]);\r\n    _mm_store_si128((__m128i*)(dst +  2 * i_dst + 8), in02[1]);\r\n    _mm_store_si128((__m128i*)(dst +  3 * i_dst + 0), in03[0]);\r\n    _mm_store_si128((__m128i*)(dst +  3 * i_dst + 8), in03[1]);\r\n    _mm_store_si128((__m128i*)(dst +  4 * i_dst + 0), in04[0]);\r\n    _mm_store_si128((__m128i*)(dst +  4 * i_dst + 8), in04[1]);\r\n    _mm_store_si128((__m128i*)(dst +  5 * i_dst + 0), in05[0]);\r\n    _mm_store_si128((__m128i*)(dst +  5 * i_dst + 8), in05[1]);\r\n    _mm_store_si128((__m128i*)(dst +  6 * i_dst + 0), in06[0]);\r\n    _mm_store_si128((__m128i*)(dst +  6 * i_dst + 8), in06[1]);\r\n    _mm_store_si128((__m128i*)(dst +  7 * i_dst + 0), in07[0]);\r\n    _mm_store_si128((__m128i*)(dst +  7 * i_dst + 8), in07[1]);\r\n    _mm_store_si128((__m128i*)(dst +  8 * i_dst + 0), in08[0]);\r\n    _mm_store_si128((__m128i*)(dst +  8 * i_dst + 8), in08[1]);\r\n    _mm_store_si128((__m128i*)(dst +  9 * i_dst + 0), in09[0]);\r\n    _mm_store_si128((__m128i*)(dst +  9 * i_dst + 8), in09[1]);\r\n    _mm_store_si128((__m128i*)(dst + 10 * i_dst + 0), in10[0]);\r\n    _mm_store_si128((__m128i*)(dst + 10 * i_dst + 8), in10[1]);\r\n    _mm_store_si128((__m128i*)(dst + 11 * i_dst + 0), in11[0]);\r\n    _mm_store_si128((__m128i*)(dst + 11 * i_dst + 8), in11[1]);\r\n    _mm_store_si128((__m128i*)(dst + 12 * i_dst + 0), in12[0]);\r\n    _mm_store_si128((__m128i*)(dst + 12 * i_dst + 8), in12[1]);\r\n    _mm_store_si128((__m128i*)(dst + 13 * i_dst + 0), in13[0]);\r\n    _mm_store_si128((__m128i*)(dst + 13 * i_dst + 8), in13[1]);\r\n    _mm_store_si128((__m128i*)(dst + 14 * i_dst + 0), in14[0]);\r\n    _mm_store_si128((__m128i*)(dst + 14 * i_dst + 8), in14[1]);\r\n    _mm_store_si128((__m128i*)(dst + 15 * i_dst + 0), in15[0]);\r\n    _mm_store_si128((__m128i*)(dst + 15 * i_dst + 8), in15[1]);\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid idct_16x16_half_sse128(const coeff_t *src, coeff_t *dst, int i_dst)\r\n{\r\n    // TODO: implement this\r\n    // 1/2СϽǵ8x8зϵ\r\n    //idct_16x16_sse128(src, dst, i_dst);\r\n\r\n\r\n\r\n    const int shift1 = 5;\r\n    const int shift2 = 20 - g_bit_depth;\r\n    //const int clip_depth1 = LIMIT_BIT;\r\n    const int clip_depth2 = g_bit_depth + 1;\r\n\r\n    const __m128i c16_p43_p45 = _mm_set1_epi32(0x002B002D);   //row0 87high - 90low address   1   3\r\n    const __m128i c16_p35_p40 = _mm_set1_epi32(0x00230028);   // 5 7\r\n\r\n    const __m128i c16_p29_p43 = _mm_set1_epi32(0x001D002B);   //row1\r\n    const __m128i c16_n21_p04 = _mm_set1_epi32(0xFFEB0004);\r\n\r\n    const __m128i c16_p04_p40 = _mm_set1_epi32(0x00040028);   //row2\r\n    const __m128i c16_n43_n35 = _mm_set1_epi32(0xFFD5FFDD);\r\n\r\n    const __m128i c16_n21_p35 = _mm_set1_epi32(0xFFEB0023);   //row3\r\n    const __m128i c16_p04_n43 = _mm_set1_epi32(0x0004FFD5);\r\n\r\n    const __m128i c16_n40_p29 = _mm_set1_epi32(0xFFD8001D);   //row4\r\n    const __m128i c16_p45_n13 = _mm_set1_epi32(0x002DFFF3);\r\n\r\n    const __m128i c16_n45_p21 = _mm_set1_epi32(0xFFD30015);   //row5\r\n    const __m128i c16_p13_p29 = _mm_set1_epi32(0x000D001D);\r\n\r\n    const __m128i c16_n35_p13 = _mm_set1_epi32(0xFFDD000D);   //row6\r\n    const __m128i c16_n40_p45 = _mm_set1_epi32(0xFFD8002D);\r\n\r\n    const __m128i c16_n13_p04 = _mm_set1_epi32(0xFFF30004);   //row7\r\n    const __m128i c16_n29_p21 = _mm_set1_epi32(0xFFE30015);\r\n\r\n\r\n    const __m128i c16_p38_p44 = _mm_set1_epi32(0x0026002C);    //row0 2 6\r\n    const __m128i c16_n09_p38 = _mm_set1_epi32(0xFFF70026);    //row1 2 6    \r\n    const __m128i c16_n44_p25 = _mm_set1_epi32(0xFFD40019);    //row2\r\n    const __m128i c16_n25_p09 = _mm_set1_epi32(0xFFE70009);    //row3\r\n\r\n\r\n\r\n    const __m128i c16_p17_p42 = _mm_set1_epi32(0x0011002A);    //row0 4 12\r\n    const __m128i c16_n42_p17 = _mm_set1_epi32(0xFFD60011);    //row1 4 12\r\n\r\n    const __m128i c16_n32_p32 = _mm_set1_epi32(0xFFE00020);    //row1 0 8\r\n    const __m128i c16_p32_p32 = _mm_set1_epi32(0x00200020);    //row0 0 8\r\n\r\n    int part;\r\n\r\n    int nShift = shift1;\r\n    __m128i c32_rnd = _mm_set1_epi32((1 << shift1) >> 1);               // add1\r\n    __m128i Zero_8 = _mm_set1_epi16(0);\r\n\r\n    // DCT1\r\n    __m128i in00[2], in01[2], in02[2], in03[2], in04[2], in05[2], in06[2], in07[2];\r\n    __m128i in08[2], in09[2], in10[2], in11[2], in12[2], in13[2], in14[2], in15[2];\r\n    __m128i res00[2], res01[2], res02[2], res03[2], res04[2], res05[2], res06[2], res07[2];\r\n    __m128i res08[2], res09[2], res10[2], res11[2], res12[2], res13[2], res14[2], res15[2];\r\n\r\n\r\n    in00[0] = _mm_load_si128((const __m128i*)&src[0 * 16]);   // [07 06 05 04 03 02 01 00]\r\n    in01[0] = _mm_load_si128((const __m128i*)&src[1 * 16]);   // [17 16 15 14 13 12 11 10]\r\n    in02[0] = _mm_load_si128((const __m128i*)&src[2 * 16]);   // [27 26 25 24 23 22 21 20]\r\n    in03[0] = _mm_load_si128((const __m128i*)&src[3 * 16]);   // [37 36 35 34 33 32 31 30]\r\n    in04[0] = _mm_load_si128((const __m128i*)&src[4 * 16]);   // [47 46 45 44 43 42 41 40]\r\n    in05[0] = _mm_load_si128((const __m128i*)&src[5 * 16]);   // [57 56 55 54 53 52 51 50]\r\n    in06[0] = _mm_load_si128((const __m128i*)&src[6 * 16]);   // [67 66 65 64 63 62 61 60]\r\n    in07[0] = _mm_load_si128((const __m128i*)&src[7 * 16]);   // [77 76 75 74 73 72 71 70]\r\n\r\n\r\n    //pass=1\r\n    {\r\n        const __m128i T_00_00A = _mm_unpacklo_epi16(in01[0], in03[0]);    // [33 13 32 12 31 11 30 10]\r\n        const __m128i T_00_00B = _mm_unpackhi_epi16(in01[0], in03[0]);    // [37 17 36 16 35 15 34 14]\r\n        const __m128i T_00_01A = _mm_unpacklo_epi16(in05[0], in07[0]);    // [ ]\r\n        const __m128i T_00_01B = _mm_unpackhi_epi16(in05[0], in07[0]);    // [ ]\r\n\r\n        const __m128i T_00_04A = _mm_unpacklo_epi16(in02[0], in06[0]);    // [ ]\r\n        const __m128i T_00_04B = _mm_unpackhi_epi16(in02[0], in06[0]);    // [ ]\r\n        //4 12\r\n        const __m128i T_00_06A = _mm_unpacklo_epi16(in04[0], Zero_8);    // [ ]row\r\n        const __m128i T_00_06B = _mm_unpackhi_epi16(in04[0], Zero_8);    // [ ]\r\n        //0 8\r\n        const __m128i T_00_07A = _mm_unpacklo_epi16(in00[0], Zero_8);    // [83 03 82 02 81 01 81 00] row08 row00\r\n        const __m128i T_00_07B = _mm_unpackhi_epi16(in00[0], Zero_8);    // [87 07 86 06 85 05 84 04]\r\n\r\n        __m128i O0A, O1A, O2A, O3A, O4A, O5A, O6A, O7A;\r\n        __m128i O0B, O1B, O2B, O3B, O4B, O5B, O6B, O7B;\r\n        __m128i EO0A, EO1A, EO2A, EO3A;\r\n        __m128i EO0B, EO1B, EO2B, EO3B;\r\n        __m128i EEO0A, EEO1A;\r\n        __m128i EEO0B, EEO1B;\r\n        __m128i EEE0A, EEE1A;\r\n        __m128i EEE0B, EEE1B;\r\n        \r\n        \r\n        //1 3 5 7\r\n#define COMPUTE_ROW(row0103, row0507, c0103, c0507, row) \\\r\n    row = _mm_add_epi32(_mm_madd_epi16(row0103, c0103), _mm_madd_epi16(row0507, c0507));\r\n\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, c16_p43_p45, c16_p35_p40, O0A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, c16_p29_p43, c16_n21_p04, O1A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, c16_p04_p40, c16_n43_n35, O2A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, c16_n21_p35, c16_p04_n43, O3A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, c16_n40_p29, c16_p45_n13, O4A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, c16_n45_p21, c16_p13_p29, O5A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, c16_n35_p13, c16_n40_p45, O6A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, c16_n13_p04, c16_n29_p21, O7A)\r\n\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, c16_p43_p45, c16_p35_p40, O0B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, c16_p29_p43, c16_n21_p04, O1B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, c16_p04_p40, c16_n43_n35, O2B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, c16_n21_p35, c16_p04_n43, O3B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, c16_n40_p29, c16_p45_n13, O4B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, c16_n45_p21, c16_p13_p29, O5B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, c16_n35_p13, c16_n40_p45, O6B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, c16_n13_p04, c16_n29_p21, O7B)\r\n#undef COMPUTE_ROW\r\n\r\n        //2 6\r\n        EO0A = _mm_madd_epi16(T_00_04A, c16_p38_p44); // EO0\r\n        EO0B = _mm_madd_epi16(T_00_04B, c16_p38_p44);\r\n        EO1A = _mm_madd_epi16(T_00_04A, c16_n09_p38); // EO1\r\n        EO1B = _mm_madd_epi16(T_00_04B, c16_n09_p38);\r\n        EO2A = _mm_madd_epi16(T_00_04A, c16_n44_p25); // EO2\r\n        EO2B = _mm_madd_epi16(T_00_04B, c16_n44_p25);\r\n        EO3A = _mm_madd_epi16(T_00_04A, c16_n25_p09); // EO3\r\n        EO3B = _mm_madd_epi16(T_00_04B, c16_n25_p09);\r\n\r\n        //4 12\r\n        EEO0A = _mm_madd_epi16(T_00_06A, c16_p17_p42);\r\n        EEO0B = _mm_madd_epi16(T_00_06B, c16_p17_p42);\r\n        EEO1A = _mm_madd_epi16(T_00_06A, c16_n42_p17);\r\n        EEO1B = _mm_madd_epi16(T_00_06B, c16_n42_p17);\r\n        //0 8\r\n        EEE0A = _mm_madd_epi16(T_00_07A, c16_p32_p32);\r\n        EEE0B = _mm_madd_epi16(T_00_07B, c16_p32_p32);\r\n        EEE1A = _mm_madd_epi16(T_00_07A, c16_n32_p32);\r\n        EEE1B = _mm_madd_epi16(T_00_07B, c16_n32_p32);\r\n\r\n        {\r\n            const __m128i EE0A = _mm_add_epi32(EEE0A, EEO0A);       // EE0 = EEE0 + EEO0\r\n            const __m128i EE0B = _mm_add_epi32(EEE0B, EEO0B);\r\n            const __m128i EE1A = _mm_add_epi32(EEE1A, EEO1A);       // EE1 = EEE1 + EEO1\r\n            const __m128i EE1B = _mm_add_epi32(EEE1B, EEO1B);\r\n            const __m128i EE3A = _mm_sub_epi32(EEE0A, EEO0A);       // EE2 = EEE0 - EEO0\r\n            const __m128i EE3B = _mm_sub_epi32(EEE0B, EEO0B);\r\n            const __m128i EE2A = _mm_sub_epi32(EEE1A, EEO1A);       // EE3 = EEE1 - EEO1\r\n            const __m128i EE2B = _mm_sub_epi32(EEE1B, EEO1B);\r\n\r\n            const __m128i T10A = _mm_add_epi32(_mm_add_epi32(EE0A, EO0A), c32_rnd);     // E0 (= EE0 + EO0) + rnd\r\n            const __m128i T10B = _mm_add_epi32(_mm_add_epi32(EE0B, EO0B), c32_rnd);\r\n            const __m128i T11A = _mm_add_epi32(_mm_add_epi32(EE1A, EO1A), c32_rnd);     // E1 (= EE1 + EO1) + rnd\r\n            const __m128i T11B = _mm_add_epi32(_mm_add_epi32(EE1B, EO1B), c32_rnd);\r\n            const __m128i T12A = _mm_add_epi32(_mm_add_epi32(EE2A, EO2A), c32_rnd);     // E2 (= EE2 + EO2) + rnd\r\n            const __m128i T12B = _mm_add_epi32(_mm_add_epi32(EE2B, EO2B), c32_rnd);\r\n            const __m128i T13A = _mm_add_epi32(_mm_add_epi32(EE3A, EO3A), c32_rnd);     // E3 (= EE3 + EO3) + rnd\r\n            const __m128i T13B = _mm_add_epi32(_mm_add_epi32(EE3B, EO3B), c32_rnd);\r\n            const __m128i T14A = _mm_add_epi32(_mm_sub_epi32(EE3A, EO3A), c32_rnd);     // E4 (= EE3 - EO3) + rnd\r\n            const __m128i T14B = _mm_add_epi32(_mm_sub_epi32(EE3B, EO3B), c32_rnd);\r\n            const __m128i T15A = _mm_add_epi32(_mm_sub_epi32(EE2A, EO2A), c32_rnd);     // E5 (= EE2 - EO2) + rnd\r\n            const __m128i T15B = _mm_add_epi32(_mm_sub_epi32(EE2B, EO2B), c32_rnd);\r\n            const __m128i T16A = _mm_add_epi32(_mm_sub_epi32(EE1A, EO1A), c32_rnd);     // E6 (= EE1 - EO1) + rnd\r\n            const __m128i T16B = _mm_add_epi32(_mm_sub_epi32(EE1B, EO1B), c32_rnd);\r\n            const __m128i T17A = _mm_add_epi32(_mm_sub_epi32(EE0A, EO0A), c32_rnd);     // E7 (= EE0 - EO0) + rnd\r\n            const __m128i T17B = _mm_add_epi32(_mm_sub_epi32(EE0B, EO0B), c32_rnd);\r\n\r\n            const __m128i T30A = _mm_srai_epi32(_mm_add_epi32(T10A, O0A), nShift);      // E0 + O0 + rnd [30 20 10 00]\r\n            const __m128i T30B = _mm_srai_epi32(_mm_add_epi32(T10B, O0B), nShift);      //               [70 60 50 40]\r\n            const __m128i T31A = _mm_srai_epi32(_mm_add_epi32(T11A, O1A), nShift);      // E1 + O1 + rnd [31 21 11 01]\r\n            const __m128i T31B = _mm_srai_epi32(_mm_add_epi32(T11B, O1B), nShift);      //               [71 61 51 41]\r\n            const __m128i T32A = _mm_srai_epi32(_mm_add_epi32(T12A, O2A), nShift);      // E2 + O2 + rnd [32 22 12 02]\r\n            const __m128i T32B = _mm_srai_epi32(_mm_add_epi32(T12B, O2B), nShift);      //               [72 62 52 42]\r\n            const __m128i T33A = _mm_srai_epi32(_mm_add_epi32(T13A, O3A), nShift);      // E3 + O3 + rnd [33 23 13 03]\r\n            const __m128i T33B = _mm_srai_epi32(_mm_add_epi32(T13B, O3B), nShift);      //               [73 63 53 43]\r\n            const __m128i T34A = _mm_srai_epi32(_mm_add_epi32(T14A, O4A), nShift);      // E4            [33 24 14 04]\r\n            const __m128i T34B = _mm_srai_epi32(_mm_add_epi32(T14B, O4B), nShift);      //               [74 64 54 44]\r\n            const __m128i T35A = _mm_srai_epi32(_mm_add_epi32(T15A, O5A), nShift);      // E5            [35 25 15 05]\r\n            const __m128i T35B = _mm_srai_epi32(_mm_add_epi32(T15B, O5B), nShift);      //               [75 65 55 45]\r\n            const __m128i T36A = _mm_srai_epi32(_mm_add_epi32(T16A, O6A), nShift);      // E6            [36 26 16 06]\r\n            const __m128i T36B = _mm_srai_epi32(_mm_add_epi32(T16B, O6B), nShift);      //               [76 66 56 46]\r\n            const __m128i T37A = _mm_srai_epi32(_mm_add_epi32(T17A, O7A), nShift);      // E7            [37 27 17 07]\r\n            const __m128i T37B = _mm_srai_epi32(_mm_add_epi32(T17B, O7B), nShift);      //               [77 67 57 47]\r\n\r\n            const __m128i T38A = _mm_srai_epi32(_mm_sub_epi32(T17A, O7A), nShift);      // E7 [30 20 10 00] x8\r\n            const __m128i T38B = _mm_srai_epi32(_mm_sub_epi32(T17B, O7B), nShift);      //    [70 60 50 40]\r\n            const __m128i T39A = _mm_srai_epi32(_mm_sub_epi32(T16A, O6A), nShift);      // E6 [31 21 11 01] x9\r\n            const __m128i T39B = _mm_srai_epi32(_mm_sub_epi32(T16B, O6B), nShift);      //    [71 61 51 41]\r\n            const __m128i T3AA = _mm_srai_epi32(_mm_sub_epi32(T15A, O5A), nShift);      // E5 [32 22 12 02] xA\r\n            const __m128i T3AB = _mm_srai_epi32(_mm_sub_epi32(T15B, O5B), nShift);      //    [72 62 52 42]\r\n            const __m128i T3BA = _mm_srai_epi32(_mm_sub_epi32(T14A, O4A), nShift);      // E4 [33 23 13 03] xB\r\n            const __m128i T3BB = _mm_srai_epi32(_mm_sub_epi32(T14B, O4B), nShift);      //    [73 63 53 43]\r\n            const __m128i T3CA = _mm_srai_epi32(_mm_sub_epi32(T13A, O3A), nShift);      // E3 - O3 + rnd [33 24 14 04] xC\r\n            const __m128i T3CB = _mm_srai_epi32(_mm_sub_epi32(T13B, O3B), nShift);      //               [74 64 54 44]\r\n            const __m128i T3DA = _mm_srai_epi32(_mm_sub_epi32(T12A, O2A), nShift);      // E2 - O2 + rnd [35 25 15 05] xD\r\n            const __m128i T3DB = _mm_srai_epi32(_mm_sub_epi32(T12B, O2B), nShift);      //               [75 65 55 45]\r\n            const __m128i T3EA = _mm_srai_epi32(_mm_sub_epi32(T11A, O1A), nShift);      // E1 - O1 + rnd [36 26 16 06] xE\r\n            const __m128i T3EB = _mm_srai_epi32(_mm_sub_epi32(T11B, O1B), nShift);      //               [76 66 56 46]\r\n            const __m128i T3FA = _mm_srai_epi32(_mm_sub_epi32(T10A, O0A), nShift);      // E0 - O0 + rnd [37 27 17 07] xF\r\n            const __m128i T3FB = _mm_srai_epi32(_mm_sub_epi32(T10B, O0B), nShift);      //               [77 67 57 47]\r\n\r\n            res00[0] = _mm_packs_epi32(T30A, T30B);              // [70 60 50 40 30 20 10 00]\r\n            res01[0] = _mm_packs_epi32(T31A, T31B);              // [71 61 51 41 31 21 11 01]\r\n            res02[0] = _mm_packs_epi32(T32A, T32B);              // [72 62 52 42 32 22 12 02]\r\n            res03[0] = _mm_packs_epi32(T33A, T33B);              // [73 63 53 43 33 23 13 03]\r\n            res04[0] = _mm_packs_epi32(T34A, T34B);              // [74 64 54 44 34 24 14 04]\r\n            res05[0] = _mm_packs_epi32(T35A, T35B);              // [75 65 55 45 35 25 15 05]\r\n            res06[0] = _mm_packs_epi32(T36A, T36B);              // [76 66 56 46 36 26 16 06]\r\n            res07[0] = _mm_packs_epi32(T37A, T37B);              // [77 67 57 47 37 27 17 07]\r\n\r\n            res08[0] = _mm_packs_epi32(T38A, T38B);              // [A0 ... 80]\r\n            res09[0] = _mm_packs_epi32(T39A, T39B);              // [A1 ... 81]\r\n            res10[0] = _mm_packs_epi32(T3AA, T3AB);              // [A2 ... 82]\r\n            res11[0] = _mm_packs_epi32(T3BA, T3BB);              // [A3 ... 83]\r\n            res12[0] = _mm_packs_epi32(T3CA, T3CB);              // [A4 ... 84]\r\n            res13[0] = _mm_packs_epi32(T3DA, T3DB);              // [A5 ... 85]\r\n            res14[0] = _mm_packs_epi32(T3EA, T3EB);              // [A6 ... 86]\r\n            res15[0] = _mm_packs_epi32(T3FA, T3FB);              // [A7 ... 87]\r\n        }\r\n    \r\n\r\n    // transpose matrix 8x8 16bit\r\n    {\r\n        __m128i tr0_0, tr0_1, tr0_2, tr0_3, tr0_4, tr0_5, tr0_6, tr0_7;\r\n        __m128i tr1_0, tr1_1, tr1_2, tr1_3, tr1_4, tr1_5, tr1_6, tr1_7;\r\n\r\n#define TRANSPOSE_8x8_16BIT(I0, I1, I2, I3, I4, I5, I6, I7, O0, O1, O2, O3, O4, O5, O6, O7) \\\r\n    tr0_0 = _mm_unpacklo_epi16(I0, I1); \\\r\n    tr0_1 = _mm_unpacklo_epi16(I2, I3); \\\r\n    tr0_2 = _mm_unpackhi_epi16(I0, I1); \\\r\n    tr0_3 = _mm_unpackhi_epi16(I2, I3); \\\r\n    tr0_4 = _mm_unpacklo_epi16(I4, I5); \\\r\n    tr0_5 = _mm_unpacklo_epi16(I6, I7); \\\r\n    tr0_6 = _mm_unpackhi_epi16(I4, I5); \\\r\n    tr0_7 = _mm_unpackhi_epi16(I6, I7); \\\r\n    tr1_0 = _mm_unpacklo_epi32(tr0_0, tr0_1); \\\r\n    tr1_1 = _mm_unpacklo_epi32(tr0_2, tr0_3); \\\r\n    tr1_2 = _mm_unpackhi_epi32(tr0_0, tr0_1); \\\r\n    tr1_3 = _mm_unpackhi_epi32(tr0_2, tr0_3); \\\r\n    tr1_4 = _mm_unpacklo_epi32(tr0_4, tr0_5); \\\r\n    tr1_5 = _mm_unpacklo_epi32(tr0_6, tr0_7); \\\r\n    tr1_6 = _mm_unpackhi_epi32(tr0_4, tr0_5); \\\r\n    tr1_7 = _mm_unpackhi_epi32(tr0_6, tr0_7); \\\r\n    O0    = _mm_unpacklo_epi64(tr1_0, tr1_4); \\\r\n    O1    = _mm_unpackhi_epi64(tr1_0, tr1_4); \\\r\n    O2    = _mm_unpacklo_epi64(tr1_2, tr1_6); \\\r\n    O3    = _mm_unpackhi_epi64(tr1_2, tr1_6); \\\r\n    O4    = _mm_unpacklo_epi64(tr1_1, tr1_5); \\\r\n    O5    = _mm_unpackhi_epi64(tr1_1, tr1_5); \\\r\n    O6    = _mm_unpacklo_epi64(tr1_3, tr1_7); \\\r\n    O7    = _mm_unpackhi_epi64(tr1_3, tr1_7); \\\r\n\r\n                TRANSPOSE_8x8_16BIT(res00[0], res01[0], res02[0], res03[0], res04[0], res05[0], res06[0], res07[0], in00[0], in01[0], in02[0], in03[0], in04[0], in05[0], in06[0], in07[0])\r\n                TRANSPOSE_8x8_16BIT(res08[0], res09[0], res10[0], res11[0], res12[0], res13[0], res14[0], res15[0], in00[1], in01[1], in02[1], in03[1], in04[1], in05[1], in06[1], in07[1])\r\n\r\n#undef TRANSPOSE_8x8_16BIT\r\n        }\r\n\r\n        nShift = shift2;\r\n        c32_rnd = _mm_set1_epi32(1 << (shift2 - 1));    // add2\r\n    }\r\n\r\n    //pass=2\r\n    {\r\n        for (part = 0; part < 2; part++) {\r\n            const __m128i T_00_00A = _mm_unpacklo_epi16(in01[part], in03[part]);    // [33 13 32 12 31 11 30 10]\r\n            const __m128i T_00_00B = _mm_unpackhi_epi16(in01[part], in03[part]);    // [37 17 36 16 35 15 34 14]\r\n            const __m128i T_00_01A = _mm_unpacklo_epi16(in05[part], in07[part]);    // [ ]\r\n            const __m128i T_00_01B = _mm_unpackhi_epi16(in05[part], in07[part]);    // [ ]\r\n\r\n            const __m128i T_00_04A = _mm_unpacklo_epi16(in02[part], in06[part]);    // [ ]\r\n            const __m128i T_00_04B = _mm_unpackhi_epi16(in02[part], in06[part]);    // [ ]\r\n            //4 12\r\n            const __m128i T_00_06A = _mm_unpacklo_epi16(in04[part], Zero_8);    // [ ]row\r\n            const __m128i T_00_06B = _mm_unpackhi_epi16(in04[part], Zero_8);    // [ ]\r\n            //0 8\r\n            const __m128i T_00_07A = _mm_unpacklo_epi16(in00[part], Zero_8);    // [83 03 82 02 81 01 81 00] row08 row00\r\n            const __m128i T_00_07B = _mm_unpackhi_epi16(in00[part], Zero_8);    // [87 07 86 06 85 05 84 04]\r\n\r\n            __m128i O0A, O1A, O2A, O3A, O4A, O5A, O6A, O7A;\r\n            __m128i O0B, O1B, O2B, O3B, O4B, O5B, O6B, O7B;\r\n            __m128i EO0A, EO1A, EO2A, EO3A;\r\n            __m128i EO0B, EO1B, EO2B, EO3B;\r\n            __m128i EEO0A, EEO1A;\r\n            __m128i EEO0B, EEO1B;\r\n            __m128i EEE0A, EEE1A;\r\n            __m128i EEE0B, EEE1B;\r\n\r\n\r\n            //1 3 5 7\r\n#define COMPUTE_ROW(row0103, row0507, c0103, c0507, row) \\\r\n    row = _mm_add_epi32(_mm_madd_epi16(row0103, c0103), _mm_madd_epi16(row0507, c0507));\r\n\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, c16_p43_p45, c16_p35_p40, O0A)\r\n                COMPUTE_ROW(T_00_00A, T_00_01A, c16_p29_p43, c16_n21_p04, O1A)\r\n                COMPUTE_ROW(T_00_00A, T_00_01A, c16_p04_p40, c16_n43_n35, O2A)\r\n                COMPUTE_ROW(T_00_00A, T_00_01A, c16_n21_p35, c16_p04_n43, O3A)\r\n                COMPUTE_ROW(T_00_00A, T_00_01A, c16_n40_p29, c16_p45_n13, O4A)\r\n                COMPUTE_ROW(T_00_00A, T_00_01A, c16_n45_p21, c16_p13_p29, O5A)\r\n                COMPUTE_ROW(T_00_00A, T_00_01A, c16_n35_p13, c16_n40_p45, O6A)\r\n                COMPUTE_ROW(T_00_00A, T_00_01A, c16_n13_p04, c16_n29_p21, O7A)\r\n\r\n                COMPUTE_ROW(T_00_00B, T_00_01B, c16_p43_p45, c16_p35_p40, O0B)\r\n                COMPUTE_ROW(T_00_00B, T_00_01B, c16_p29_p43, c16_n21_p04, O1B)\r\n                COMPUTE_ROW(T_00_00B, T_00_01B, c16_p04_p40, c16_n43_n35, O2B)\r\n                COMPUTE_ROW(T_00_00B, T_00_01B, c16_n21_p35, c16_p04_n43, O3B)\r\n                COMPUTE_ROW(T_00_00B, T_00_01B, c16_n40_p29, c16_p45_n13, O4B)\r\n                COMPUTE_ROW(T_00_00B, T_00_01B, c16_n45_p21, c16_p13_p29, O5B)\r\n                COMPUTE_ROW(T_00_00B, T_00_01B, c16_n35_p13, c16_n40_p45, O6B)\r\n                COMPUTE_ROW(T_00_00B, T_00_01B, c16_n13_p04, c16_n29_p21, O7B)\r\n#undef COMPUTE_ROW\r\n\r\n                //2 6\r\n            EO0A = _mm_madd_epi16(T_00_04A, c16_p38_p44); // EO0\r\n            EO0B = _mm_madd_epi16(T_00_04B, c16_p38_p44);\r\n            EO1A = _mm_madd_epi16(T_00_04A, c16_n09_p38); // EO1\r\n            EO1B = _mm_madd_epi16(T_00_04B, c16_n09_p38);\r\n            EO2A = _mm_madd_epi16(T_00_04A, c16_n44_p25); // EO2\r\n            EO2B = _mm_madd_epi16(T_00_04B, c16_n44_p25);\r\n            EO3A = _mm_madd_epi16(T_00_04A, c16_n25_p09); // EO3\r\n            EO3B = _mm_madd_epi16(T_00_04B, c16_n25_p09);\r\n\r\n            //4 12\r\n            EEO0A = _mm_madd_epi16(T_00_06A, c16_p17_p42);\r\n            EEO0B = _mm_madd_epi16(T_00_06B, c16_p17_p42);\r\n            EEO1A = _mm_madd_epi16(T_00_06A, c16_n42_p17);\r\n            EEO1B = _mm_madd_epi16(T_00_06B, c16_n42_p17);\r\n            //0 8\r\n            EEE0A = _mm_madd_epi16(T_00_07A, c16_p32_p32);\r\n            EEE0B = _mm_madd_epi16(T_00_07B, c16_p32_p32);\r\n            EEE1A = _mm_madd_epi16(T_00_07A, c16_n32_p32);\r\n            EEE1B = _mm_madd_epi16(T_00_07B, c16_n32_p32);\r\n\r\n            {\r\n                const __m128i EE0A = _mm_add_epi32(EEE0A, EEO0A);       // EE0 = EEE0 + EEO0\r\n                const __m128i EE0B = _mm_add_epi32(EEE0B, EEO0B);\r\n                const __m128i EE1A = _mm_add_epi32(EEE1A, EEO1A);       // EE1 = EEE1 + EEO1\r\n                const __m128i EE1B = _mm_add_epi32(EEE1B, EEO1B);\r\n                const __m128i EE3A = _mm_sub_epi32(EEE0A, EEO0A);       // EE2 = EEE0 - EEO0\r\n                const __m128i EE3B = _mm_sub_epi32(EEE0B, EEO0B);\r\n                const __m128i EE2A = _mm_sub_epi32(EEE1A, EEO1A);       // EE3 = EEE1 - EEO1\r\n                const __m128i EE2B = _mm_sub_epi32(EEE1B, EEO1B);\r\n\r\n                const __m128i T10A = _mm_add_epi32(_mm_add_epi32(EE0A, EO0A), c32_rnd);     // E0 (= EE0 + EO0) + rnd\r\n                const __m128i T10B = _mm_add_epi32(_mm_add_epi32(EE0B, EO0B), c32_rnd);\r\n                const __m128i T11A = _mm_add_epi32(_mm_add_epi32(EE1A, EO1A), c32_rnd);     // E1 (= EE1 + EO1) + rnd\r\n                const __m128i T11B = _mm_add_epi32(_mm_add_epi32(EE1B, EO1B), c32_rnd);\r\n                const __m128i T12A = _mm_add_epi32(_mm_add_epi32(EE2A, EO2A), c32_rnd);     // E2 (= EE2 + EO2) + rnd\r\n                const __m128i T12B = _mm_add_epi32(_mm_add_epi32(EE2B, EO2B), c32_rnd);\r\n                const __m128i T13A = _mm_add_epi32(_mm_add_epi32(EE3A, EO3A), c32_rnd);     // E3 (= EE3 + EO3) + rnd\r\n                const __m128i T13B = _mm_add_epi32(_mm_add_epi32(EE3B, EO3B), c32_rnd);\r\n                const __m128i T14A = _mm_add_epi32(_mm_sub_epi32(EE3A, EO3A), c32_rnd);     // E4 (= EE3 - EO3) + rnd\r\n                const __m128i T14B = _mm_add_epi32(_mm_sub_epi32(EE3B, EO3B), c32_rnd);\r\n                const __m128i T15A = _mm_add_epi32(_mm_sub_epi32(EE2A, EO2A), c32_rnd);     // E5 (= EE2 - EO2) + rnd\r\n                const __m128i T15B = _mm_add_epi32(_mm_sub_epi32(EE2B, EO2B), c32_rnd);\r\n                const __m128i T16A = _mm_add_epi32(_mm_sub_epi32(EE1A, EO1A), c32_rnd);     // E6 (= EE1 - EO1) + rnd\r\n                const __m128i T16B = _mm_add_epi32(_mm_sub_epi32(EE1B, EO1B), c32_rnd);\r\n                const __m128i T17A = _mm_add_epi32(_mm_sub_epi32(EE0A, EO0A), c32_rnd);     // E7 (= EE0 - EO0) + rnd\r\n                const __m128i T17B = _mm_add_epi32(_mm_sub_epi32(EE0B, EO0B), c32_rnd);\r\n\r\n                const __m128i T30A = _mm_srai_epi32(_mm_add_epi32(T10A, O0A), nShift);      // E0 + O0 + rnd [30 20 10 00]\r\n                const __m128i T30B = _mm_srai_epi32(_mm_add_epi32(T10B, O0B), nShift);      //               [70 60 50 40]\r\n                const __m128i T31A = _mm_srai_epi32(_mm_add_epi32(T11A, O1A), nShift);      // E1 + O1 + rnd [31 21 11 01]\r\n                const __m128i T31B = _mm_srai_epi32(_mm_add_epi32(T11B, O1B), nShift);      //               [71 61 51 41]\r\n                const __m128i T32A = _mm_srai_epi32(_mm_add_epi32(T12A, O2A), nShift);      // E2 + O2 + rnd [32 22 12 02]\r\n                const __m128i T32B = _mm_srai_epi32(_mm_add_epi32(T12B, O2B), nShift);      //               [72 62 52 42]\r\n                const __m128i T33A = _mm_srai_epi32(_mm_add_epi32(T13A, O3A), nShift);      // E3 + O3 + rnd [33 23 13 03]\r\n                const __m128i T33B = _mm_srai_epi32(_mm_add_epi32(T13B, O3B), nShift);      //               [73 63 53 43]\r\n                const __m128i T34A = _mm_srai_epi32(_mm_add_epi32(T14A, O4A), nShift);      // E4            [33 24 14 04]\r\n                const __m128i T34B = _mm_srai_epi32(_mm_add_epi32(T14B, O4B), nShift);      //               [74 64 54 44]\r\n                const __m128i T35A = _mm_srai_epi32(_mm_add_epi32(T15A, O5A), nShift);      // E5            [35 25 15 05]\r\n                const __m128i T35B = _mm_srai_epi32(_mm_add_epi32(T15B, O5B), nShift);      //               [75 65 55 45]\r\n                const __m128i T36A = _mm_srai_epi32(_mm_add_epi32(T16A, O6A), nShift);      // E6            [36 26 16 06]\r\n                const __m128i T36B = _mm_srai_epi32(_mm_add_epi32(T16B, O6B), nShift);      //               [76 66 56 46]\r\n                const __m128i T37A = _mm_srai_epi32(_mm_add_epi32(T17A, O7A), nShift);      // E7            [37 27 17 07]\r\n                const __m128i T37B = _mm_srai_epi32(_mm_add_epi32(T17B, O7B), nShift);      //               [77 67 57 47]\r\n\r\n                const __m128i T38A = _mm_srai_epi32(_mm_sub_epi32(T17A, O7A), nShift);      // E7 [30 20 10 00] x8\r\n                const __m128i T38B = _mm_srai_epi32(_mm_sub_epi32(T17B, O7B), nShift);      //    [70 60 50 40]\r\n                const __m128i T39A = _mm_srai_epi32(_mm_sub_epi32(T16A, O6A), nShift);      // E6 [31 21 11 01] x9\r\n                const __m128i T39B = _mm_srai_epi32(_mm_sub_epi32(T16B, O6B), nShift);      //    [71 61 51 41]\r\n                const __m128i T3AA = _mm_srai_epi32(_mm_sub_epi32(T15A, O5A), nShift);      // E5 [32 22 12 02] xA\r\n                const __m128i T3AB = _mm_srai_epi32(_mm_sub_epi32(T15B, O5B), nShift);      //    [72 62 52 42]\r\n                const __m128i T3BA = _mm_srai_epi32(_mm_sub_epi32(T14A, O4A), nShift);      // E4 [33 23 13 03] xB\r\n                const __m128i T3BB = _mm_srai_epi32(_mm_sub_epi32(T14B, O4B), nShift);      //    [73 63 53 43]\r\n                const __m128i T3CA = _mm_srai_epi32(_mm_sub_epi32(T13A, O3A), nShift);      // E3 - O3 + rnd [33 24 14 04] xC\r\n                const __m128i T3CB = _mm_srai_epi32(_mm_sub_epi32(T13B, O3B), nShift);      //               [74 64 54 44]\r\n                const __m128i T3DA = _mm_srai_epi32(_mm_sub_epi32(T12A, O2A), nShift);      // E2 - O2 + rnd [35 25 15 05] xD\r\n                const __m128i T3DB = _mm_srai_epi32(_mm_sub_epi32(T12B, O2B), nShift);      //               [75 65 55 45]\r\n                const __m128i T3EA = _mm_srai_epi32(_mm_sub_epi32(T11A, O1A), nShift);      // E1 - O1 + rnd [36 26 16 06] xE\r\n                const __m128i T3EB = _mm_srai_epi32(_mm_sub_epi32(T11B, O1B), nShift);      //               [76 66 56 46]\r\n                const __m128i T3FA = _mm_srai_epi32(_mm_sub_epi32(T10A, O0A), nShift);      // E0 - O0 + rnd [37 27 17 07] xF\r\n                const __m128i T3FB = _mm_srai_epi32(_mm_sub_epi32(T10B, O0B), nShift);      //               [77 67 57 47]\r\n\r\n                res00[part] = _mm_packs_epi32(T30A, T30B);              // [70 60 50 40 30 20 10 00]\r\n                res01[part] = _mm_packs_epi32(T31A, T31B);              // [71 61 51 41 31 21 11 01]\r\n                res02[part] = _mm_packs_epi32(T32A, T32B);              // [72 62 52 42 32 22 12 02]\r\n                res03[part] = _mm_packs_epi32(T33A, T33B);              // [73 63 53 43 33 23 13 03]\r\n                res04[part] = _mm_packs_epi32(T34A, T34B);              // [74 64 54 44 34 24 14 04]\r\n                res05[part] = _mm_packs_epi32(T35A, T35B);              // [75 65 55 45 35 25 15 05]\r\n                res06[part] = _mm_packs_epi32(T36A, T36B);              // [76 66 56 46 36 26 16 06]\r\n                res07[part] = _mm_packs_epi32(T37A, T37B);              // [77 67 57 47 37 27 17 07]\r\n\r\n                res08[part] = _mm_packs_epi32(T38A, T38B);              // [A0 ... 80]\r\n                res09[part] = _mm_packs_epi32(T39A, T39B);              // [A1 ... 81]\r\n                res10[part] = _mm_packs_epi32(T3AA, T3AB);              // [A2 ... 82]\r\n                res11[part] = _mm_packs_epi32(T3BA, T3BB);              // [A3 ... 83]\r\n                res12[part] = _mm_packs_epi32(T3CA, T3CB);              // [A4 ... 84]\r\n                res13[part] = _mm_packs_epi32(T3DA, T3DB);              // [A5 ... 85]\r\n                res14[part] = _mm_packs_epi32(T3EA, T3EB);              // [A6 ... 86]\r\n                res15[part] = _mm_packs_epi32(T3FA, T3FB);              // [A7 ... 87]\r\n            }\r\n        }\r\n\r\n        // transpose matrix 8x8 16bit\r\n        {\r\n            __m128i tr0_0, tr0_1, tr0_2, tr0_3, tr0_4, tr0_5, tr0_6, tr0_7;\r\n            __m128i tr1_0, tr1_1, tr1_2, tr1_3, tr1_4, tr1_5, tr1_6, tr1_7;\r\n\r\n#define TRANSPOSE_8x8_16BIT(I0, I1, I2, I3, I4, I5, I6, I7, O0, O1, O2, O3, O4, O5, O6, O7) \\\r\n    tr0_0 = _mm_unpacklo_epi16(I0, I1); \\\r\n    tr0_1 = _mm_unpacklo_epi16(I2, I3); \\\r\n    tr0_2 = _mm_unpackhi_epi16(I0, I1); \\\r\n    tr0_3 = _mm_unpackhi_epi16(I2, I3); \\\r\n    tr0_4 = _mm_unpacklo_epi16(I4, I5); \\\r\n    tr0_5 = _mm_unpacklo_epi16(I6, I7); \\\r\n    tr0_6 = _mm_unpackhi_epi16(I4, I5); \\\r\n    tr0_7 = _mm_unpackhi_epi16(I6, I7); \\\r\n    tr1_0 = _mm_unpacklo_epi32(tr0_0, tr0_1); \\\r\n    tr1_1 = _mm_unpacklo_epi32(tr0_2, tr0_3); \\\r\n    tr1_2 = _mm_unpackhi_epi32(tr0_0, tr0_1); \\\r\n    tr1_3 = _mm_unpackhi_epi32(tr0_2, tr0_3); \\\r\n    tr1_4 = _mm_unpacklo_epi32(tr0_4, tr0_5); \\\r\n    tr1_5 = _mm_unpacklo_epi32(tr0_6, tr0_7); \\\r\n    tr1_6 = _mm_unpackhi_epi32(tr0_4, tr0_5); \\\r\n    tr1_7 = _mm_unpackhi_epi32(tr0_6, tr0_7); \\\r\n    O0    = _mm_unpacklo_epi64(tr1_0, tr1_4); \\\r\n    O1    = _mm_unpackhi_epi64(tr1_0, tr1_4); \\\r\n    O2    = _mm_unpacklo_epi64(tr1_2, tr1_6); \\\r\n    O3    = _mm_unpackhi_epi64(tr1_2, tr1_6); \\\r\n    O4    = _mm_unpacklo_epi64(tr1_1, tr1_5); \\\r\n    O5    = _mm_unpackhi_epi64(tr1_1, tr1_5); \\\r\n    O6    = _mm_unpacklo_epi64(tr1_3, tr1_7); \\\r\n    O7    = _mm_unpackhi_epi64(tr1_3, tr1_7); \\\r\n\r\n                    TRANSPOSE_8x8_16BIT(res00[0], res01[0], res02[0], res03[0], res04[0], res05[0], res06[0], res07[0], in00[0], in01[0], in02[0], in03[0], in04[0], in05[0], in06[0], in07[0])\r\n                    TRANSPOSE_8x8_16BIT(res08[0], res09[0], res10[0], res11[0], res12[0], res13[0], res14[0], res15[0], in00[1], in01[1], in02[1], in03[1], in04[1], in05[1], in06[1], in07[1])\r\n                    TRANSPOSE_8x8_16BIT(res00[1], res01[1], res02[1], res03[1], res04[1], res05[1], res06[1], res07[1], in08[0], in09[0], in10[0], in11[0], in12[0], in13[0], in14[0], in15[0])\r\n                    TRANSPOSE_8x8_16BIT(res08[1], res09[1], res10[1], res11[1], res12[1], res13[1], res14[1], res15[1], in08[1], in09[1], in10[1], in11[1], in12[1], in13[1], in14[1], in15[1])\r\n\r\n#undef TRANSPOSE_8x8_16BIT\r\n            }\r\n\r\n    }\r\n\r\n\r\n\r\n\r\n\r\n    // clip\r\n    {\r\n        const __m128i max_val = _mm_set1_epi16((1 << (clip_depth2 - 1)) - 1);\r\n        const __m128i min_val = _mm_set1_epi16(-(1 << (clip_depth2 - 1)));\r\n\r\n        in00[0] = _mm_max_epi16(_mm_min_epi16(in00[0], max_val), min_val);\r\n        in00[1] = _mm_max_epi16(_mm_min_epi16(in00[1], max_val), min_val);\r\n\r\n        in01[0] = _mm_max_epi16(_mm_min_epi16(in01[0], max_val), min_val);\r\n        in01[1] = _mm_max_epi16(_mm_min_epi16(in01[1], max_val), min_val);\r\n\r\n        in02[0] = _mm_max_epi16(_mm_min_epi16(in02[0], max_val), min_val);\r\n        in02[1] = _mm_max_epi16(_mm_min_epi16(in02[1], max_val), min_val);\r\n\r\n        in03[0] = _mm_max_epi16(_mm_min_epi16(in03[0], max_val), min_val);\r\n        in03[1] = _mm_max_epi16(_mm_min_epi16(in03[1], max_val), min_val);\r\n\r\n        in04[0] = _mm_max_epi16(_mm_min_epi16(in04[0], max_val), min_val);\r\n        in04[1] = _mm_max_epi16(_mm_min_epi16(in04[1], max_val), min_val);\r\n\r\n        in05[0] = _mm_max_epi16(_mm_min_epi16(in05[0], max_val), min_val);\r\n        in05[1] = _mm_max_epi16(_mm_min_epi16(in05[1], max_val), min_val);\r\n\r\n        in06[0] = _mm_max_epi16(_mm_min_epi16(in06[0], max_val), min_val);\r\n        in06[1] = _mm_max_epi16(_mm_min_epi16(in06[1], max_val), min_val);\r\n\r\n        in07[0] = _mm_max_epi16(_mm_min_epi16(in07[0], max_val), min_val);\r\n        in07[1] = _mm_max_epi16(_mm_min_epi16(in07[1], max_val), min_val);\r\n\r\n        in08[0] = _mm_max_epi16(_mm_min_epi16(in08[0], max_val), min_val);\r\n        in08[1] = _mm_max_epi16(_mm_min_epi16(in08[1], max_val), min_val);\r\n\r\n        in09[0] = _mm_max_epi16(_mm_min_epi16(in09[0], max_val), min_val);\r\n        in09[1] = _mm_max_epi16(_mm_min_epi16(in09[1], max_val), min_val);\r\n\r\n        in10[0] = _mm_max_epi16(_mm_min_epi16(in10[0], max_val), min_val);\r\n        in10[1] = _mm_max_epi16(_mm_min_epi16(in10[1], max_val), min_val);\r\n\r\n        in11[0] = _mm_max_epi16(_mm_min_epi16(in11[0], max_val), min_val);\r\n        in11[1] = _mm_max_epi16(_mm_min_epi16(in11[1], max_val), min_val);\r\n\r\n        in12[0] = _mm_max_epi16(_mm_min_epi16(in12[0], max_val), min_val);\r\n        in12[1] = _mm_max_epi16(_mm_min_epi16(in12[1], max_val), min_val);\r\n\r\n        in13[0] = _mm_max_epi16(_mm_min_epi16(in13[0], max_val), min_val);\r\n        in13[1] = _mm_max_epi16(_mm_min_epi16(in13[1], max_val), min_val);\r\n\r\n        in14[0] = _mm_max_epi16(_mm_min_epi16(in14[0], max_val), min_val);\r\n        in14[1] = _mm_max_epi16(_mm_min_epi16(in14[1], max_val), min_val);\r\n\r\n        in15[0] = _mm_max_epi16(_mm_min_epi16(in15[0], max_val), min_val);\r\n        in15[1] = _mm_max_epi16(_mm_min_epi16(in15[1], max_val), min_val);\r\n    }\r\n\r\n    // store\r\n    _mm_store_si128((__m128i*)(dst + 0 * i_dst + 0), in00[0]);\r\n    _mm_store_si128((__m128i*)(dst + 0 * i_dst + 8), in00[1]);\r\n    _mm_store_si128((__m128i*)(dst + 1 * i_dst + 0), in01[0]);\r\n    _mm_store_si128((__m128i*)(dst + 1 * i_dst + 8), in01[1]);\r\n    _mm_store_si128((__m128i*)(dst + 2 * i_dst + 0), in02[0]);\r\n    _mm_store_si128((__m128i*)(dst + 2 * i_dst + 8), in02[1]);\r\n    _mm_store_si128((__m128i*)(dst + 3 * i_dst + 0), in03[0]);\r\n    _mm_store_si128((__m128i*)(dst + 3 * i_dst + 8), in03[1]);\r\n    _mm_store_si128((__m128i*)(dst + 4 * i_dst + 0), in04[0]);\r\n    _mm_store_si128((__m128i*)(dst + 4 * i_dst + 8), in04[1]);\r\n    _mm_store_si128((__m128i*)(dst + 5 * i_dst + 0), in05[0]);\r\n    _mm_store_si128((__m128i*)(dst + 5 * i_dst + 8), in05[1]);\r\n    _mm_store_si128((__m128i*)(dst + 6 * i_dst + 0), in06[0]);\r\n    _mm_store_si128((__m128i*)(dst + 6 * i_dst + 8), in06[1]);\r\n    _mm_store_si128((__m128i*)(dst + 7 * i_dst + 0), in07[0]);\r\n    _mm_store_si128((__m128i*)(dst + 7 * i_dst + 8), in07[1]);\r\n    _mm_store_si128((__m128i*)(dst + 8 * i_dst + 0), in08[0]);\r\n    _mm_store_si128((__m128i*)(dst + 8 * i_dst + 8), in08[1]);\r\n    _mm_store_si128((__m128i*)(dst + 9 * i_dst + 0), in09[0]);\r\n    _mm_store_si128((__m128i*)(dst + 9 * i_dst + 8), in09[1]);\r\n    _mm_store_si128((__m128i*)(dst + 10 * i_dst + 0), in10[0]);\r\n    _mm_store_si128((__m128i*)(dst + 10 * i_dst + 8), in10[1]);\r\n    _mm_store_si128((__m128i*)(dst + 11 * i_dst + 0), in11[0]);\r\n    _mm_store_si128((__m128i*)(dst + 11 * i_dst + 8), in11[1]);\r\n    _mm_store_si128((__m128i*)(dst + 12 * i_dst + 0), in12[0]);\r\n    _mm_store_si128((__m128i*)(dst + 12 * i_dst + 8), in12[1]);\r\n    _mm_store_si128((__m128i*)(dst + 13 * i_dst + 0), in13[0]);\r\n    _mm_store_si128((__m128i*)(dst + 13 * i_dst + 8), in13[1]);\r\n    _mm_store_si128((__m128i*)(dst + 14 * i_dst + 0), in14[0]);\r\n    _mm_store_si128((__m128i*)(dst + 14 * i_dst + 8), in14[1]);\r\n    _mm_store_si128((__m128i*)(dst + 15 * i_dst + 0), in15[0]);\r\n    _mm_store_si128((__m128i*)(dst + 15 * i_dst + 8), in15[1]);\r\n    \r\n\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid idct_16x16_quad_sse128(const coeff_t *src, coeff_t *dst, int i_dst)\r\n{\r\n    // TODO: implement this\r\n    // 1/4СϽǵ4x4зϵ\r\n    //idct_16x16_half_sse128(src, dst, i_dst);\r\n\r\n    const int shift1 = 5;\r\n    const int shift2 = 20 - g_bit_depth;\r\n    //const int clip_depth1 = LIMIT_BIT;\r\n    const int clip_depth2 = g_bit_depth + 1;\r\n\r\n    const __m128i c16_p43_p45 = _mm_set1_epi32(0x002B002D);   //row0 87high - 90low address   1   3\r\n\r\n    const __m128i c16_p29_p43 = _mm_set1_epi32(0x001D002B);   //row1\r\n\r\n    const __m128i c16_p04_p40 = _mm_set1_epi32(0x00040028);   //row2\r\n\r\n    const __m128i c16_n21_p35 = _mm_set1_epi32(0xFFEB0023);   //row3\r\n\r\n    const __m128i c16_n40_p29 = _mm_set1_epi32(0xFFD8001D);   //row4\r\n\r\n    const __m128i c16_n45_p21 = _mm_set1_epi32(0xFFD30015);   //row5\r\n\r\n    const __m128i c16_n35_p13 = _mm_set1_epi32(0xFFDD000D);   //row6\r\n\r\n    const __m128i c16_n13_p04 = _mm_set1_epi32(0xFFF30004);   //row7\r\n\r\n\r\n    const __m128i c16_p38_p44 = _mm_set1_epi32(0x0026002C);    //row0 2 6\r\n    const __m128i c16_n09_p38 = _mm_set1_epi32(0xFFF70026);    //row1 2 6    \r\n    const __m128i c16_n44_p25 = _mm_set1_epi32(0xFFD40019);    //row2\r\n    const __m128i c16_n25_p09 = _mm_set1_epi32(0xFFE70009);    //row3\r\n\r\n\r\n\r\n    // const __m128i c16_p17_p42 = _mm_set1_epi32(0x0011002A);    //row0 4 12\r\n    // const __m128i c16_n42_p17 = _mm_set1_epi32(0xFFD60011);    //row1 4 12\r\n\r\n    const __m128i c16_n32_p32 = _mm_set1_epi32(0xFFE00020);    //row1 0 8\r\n    const __m128i c16_p32_p32 = _mm_set1_epi32(0x00200020);    //row0 0 8\r\n\r\n    int   part;\r\n\r\n    int nShift = shift1;\r\n    __m128i c32_rnd = _mm_set1_epi32((1 << shift1) >> 1);               // add1\r\n    __m128i Zero_8 = _mm_set1_epi16(0);\r\n    __m128i add_zero = _mm_set1_epi32(0);\r\n\r\n    // DCT1\r\n    __m128i in00[2], in01[2], in02[2], in03[2], in04[2], in05[2], in06[2], in07[2];\r\n    __m128i in08[2], in09[2], in10[2], in11[2], in12[2], in13[2], in14[2], in15[2];\r\n    __m128i res00[2], res01[2], res02[2], res03[2], res04[2], res05[2], res06[2], res07[2];\r\n    __m128i res08[2], res09[2], res10[2], res11[2], res12[2], res13[2], res14[2], res15[2];\r\n\r\n\r\n    in00[0] = _mm_load_si128((const __m128i*)&src[0 * 16]);   // [07 06 05 04 03 02 01 00]\r\n    in01[0] = _mm_load_si128((const __m128i*)&src[1 * 16]);   // [17 16 15 14 13 12 11 10]\r\n    in02[0] = _mm_load_si128((const __m128i*)&src[2 * 16]);   // [27 26 25 24 23 22 21 20]\r\n    in03[0] = _mm_load_si128((const __m128i*)&src[3 * 16]);   // [37 36 35 34 33 32 31 30]\r\n\r\n    //pass=1\r\n    {\r\n        const __m128i T_00_00A = _mm_unpacklo_epi16(in01[0], in03[0]);    // [33 13 32 12 31 11 30 10]\r\n     // const __m128i T_00_00B = _mm_unpackhi_epi16(in01[0], in03[0]);    // [37 17 36 16 35 15 34 14]\r\n\r\n        const __m128i T_00_04A = _mm_unpacklo_epi16(in02[0], Zero_8);    // [ ]\r\n     // const __m128i T_00_04B = _mm_unpackhi_epi16(in02[0], Zero_8);    // [ ]\r\n\r\n        //0 8\r\n        const __m128i T_00_07A = _mm_unpacklo_epi16(in00[0], Zero_8);    // [83 03 82 02 81 01 81 00] row08 row00\r\n    //  const __m128i T_00_07B = _mm_unpackhi_epi16(in00[0], Zero_8);    // [87 07 86 06 85 05 84 04]\r\n\r\n        __m128i O0A, O1A, O2A, O3A, O4A, O5A, O6A, O7A;\r\n        __m128i EO0A, EO1A, EO2A, EO3A;\r\n        __m128i EEE0A, EEE1A;\r\n        \r\n        //1 3\r\n        O0A = _mm_madd_epi16(T_00_00A, c16_p43_p45);\r\n        O1A = _mm_madd_epi16(T_00_00A, c16_p29_p43);\r\n        O2A = _mm_madd_epi16(T_00_00A, c16_p04_p40);\r\n        O3A = _mm_madd_epi16(T_00_00A, c16_n21_p35);\r\n        O4A = _mm_madd_epi16(T_00_00A, c16_n40_p29);\r\n        O5A = _mm_madd_epi16(T_00_00A, c16_n45_p21);\r\n        O6A = _mm_madd_epi16(T_00_00A, c16_n35_p13);\r\n        O7A = _mm_madd_epi16(T_00_00A, c16_n13_p04);\r\n        \r\n        //2 6\r\n        EO0A = _mm_madd_epi16(T_00_04A, c16_p38_p44); // EO0\r\n        EO1A = _mm_madd_epi16(T_00_04A, c16_n09_p38); // EO1\r\n        EO2A = _mm_madd_epi16(T_00_04A, c16_n44_p25); // EO2\r\n        EO3A = _mm_madd_epi16(T_00_04A, c16_n25_p09); // EO3\r\n        //0 8\r\n        EEE0A = _mm_madd_epi16(T_00_07A, c16_p32_p32);\r\n        EEE1A = _mm_madd_epi16(T_00_07A, c16_n32_p32);\r\n\r\n        {\r\n            const __m128i EE0A = _mm_add_epi32(EEE0A, add_zero);       // EE0 = EEE0 + EEO0\r\n            const __m128i EE1A = _mm_add_epi32(EEE1A, add_zero);       // EE1 = EEE1 + EEO1\r\n            const __m128i EE3A = _mm_sub_epi32(EEE0A, add_zero);       // EE2 = EEE0 - EEO0\r\n            const __m128i EE2A = _mm_sub_epi32(EEE1A, add_zero);       // EE3 = EEE1 - EEO1\r\n\r\n            const __m128i T10A = _mm_add_epi32(_mm_add_epi32(EE0A, EO0A), c32_rnd);     // E0 (= EE0 + EO0) + rnd\r\n            const __m128i T11A = _mm_add_epi32(_mm_add_epi32(EE1A, EO1A), c32_rnd);     // E1 (= EE1 + EO1) + rnd\r\n            const __m128i T12A = _mm_add_epi32(_mm_add_epi32(EE2A, EO2A), c32_rnd);     // E2 (= EE2 + EO2) + rnd\r\n            const __m128i T13A = _mm_add_epi32(_mm_add_epi32(EE3A, EO3A), c32_rnd);     // E3 (= EE3 + EO3) + rnd\r\n            const __m128i T14A = _mm_add_epi32(_mm_sub_epi32(EE3A, EO3A), c32_rnd);     // E4 (= EE3 - EO3) + rnd\r\n            const __m128i T15A = _mm_add_epi32(_mm_sub_epi32(EE2A, EO2A), c32_rnd);     // E5 (= EE2 - EO2) + rnd\r\n            const __m128i T16A = _mm_add_epi32(_mm_sub_epi32(EE1A, EO1A), c32_rnd);     // E6 (= EE1 - EO1) + rnd\r\n            const __m128i T17A = _mm_add_epi32(_mm_sub_epi32(EE0A, EO0A), c32_rnd);     // E7 (= EE0 - EO0) + rnd\r\n\r\n            const __m128i T30A = _mm_srai_epi32(_mm_add_epi32(T10A, O0A), nShift);      // E0 + O0 + rnd [30 20 10 00]\r\n            const __m128i T31A = _mm_srai_epi32(_mm_add_epi32(T11A, O1A), nShift);      // E1 + O1 + rnd [31 21 11 01]\r\n            const __m128i T32A = _mm_srai_epi32(_mm_add_epi32(T12A, O2A), nShift);      // E2 + O2 + rnd [32 22 12 02]\r\n            const __m128i T33A = _mm_srai_epi32(_mm_add_epi32(T13A, O3A), nShift);      // E3 + O3 + rnd [33 23 13 03]\r\n            const __m128i T34A = _mm_srai_epi32(_mm_add_epi32(T14A, O4A), nShift);      // E4            [33 24 14 04]\r\n            const __m128i T35A = _mm_srai_epi32(_mm_add_epi32(T15A, O5A), nShift);      // E5            [35 25 15 05]\r\n            const __m128i T36A = _mm_srai_epi32(_mm_add_epi32(T16A, O6A), nShift);      // E6            [36 26 16 06]\r\n            const __m128i T37A = _mm_srai_epi32(_mm_add_epi32(T17A, O7A), nShift);      // E7            [37 27 17 07]\r\n\r\n            const __m128i T38A = _mm_srai_epi32(_mm_sub_epi32(T17A, O7A), nShift);      // E7 [30 20 10 00] x8\r\n            const __m128i T39A = _mm_srai_epi32(_mm_sub_epi32(T16A, O6A), nShift);      // E6 [31 21 11 01] x9\r\n            const __m128i T3AA = _mm_srai_epi32(_mm_sub_epi32(T15A, O5A), nShift);      // E5 [32 22 12 02] xA\r\n            const __m128i T3BA = _mm_srai_epi32(_mm_sub_epi32(T14A, O4A), nShift);      // E4 [33 23 13 03] xB\r\n            const __m128i T3CA = _mm_srai_epi32(_mm_sub_epi32(T13A, O3A), nShift);      // E3 - O3 + rnd [33 24 14 04] xC\r\n            const __m128i T3DA = _mm_srai_epi32(_mm_sub_epi32(T12A, O2A), nShift);      // E2 - O2 + rnd [35 25 15 05] xD\r\n            const __m128i T3EA = _mm_srai_epi32(_mm_sub_epi32(T11A, O1A), nShift);      // E1 - O1 + rnd [36 26 16 06] xE\r\n            const __m128i T3FA = _mm_srai_epi32(_mm_sub_epi32(T10A, O0A), nShift);      // E0 - O0 + rnd [37 27 17 07] xF\r\n\r\n            res00[0] = _mm_packs_epi32(T30A, add_zero);              // [70 60 50 40 30 20 10 00]\r\n            res01[0] = _mm_packs_epi32(T31A, add_zero);              // [71 61 51 41 31 21 11 01]\r\n            res02[0] = _mm_packs_epi32(T32A, add_zero);              // [72 62 52 42 32 22 12 02]\r\n            res03[0] = _mm_packs_epi32(T33A, add_zero);              // [73 63 53 43 33 23 13 03]\r\n            res04[0] = _mm_packs_epi32(T34A, add_zero);              // [74 64 54 44 34 24 14 04]\r\n            res05[0] = _mm_packs_epi32(T35A, add_zero);              // [75 65 55 45 35 25 15 05]\r\n            res06[0] = _mm_packs_epi32(T36A, add_zero);              // [76 66 56 46 36 26 16 06]\r\n            res07[0] = _mm_packs_epi32(T37A, add_zero);              // [77 67 57 47 37 27 17 07]\r\n\r\n            res08[0] = _mm_packs_epi32(T38A, add_zero);              // [A0 ... 80]\r\n            res09[0] = _mm_packs_epi32(T39A, add_zero);              // [A1 ... 81]\r\n            res10[0] = _mm_packs_epi32(T3AA, add_zero);              // [A2 ... 82]\r\n            res11[0] = _mm_packs_epi32(T3BA, add_zero);              // [A3 ... 83]\r\n            res12[0] = _mm_packs_epi32(T3CA, add_zero);              // [A4 ... 84]\r\n            res13[0] = _mm_packs_epi32(T3DA, add_zero);              // [A5 ... 85]\r\n            res14[0] = _mm_packs_epi32(T3EA, add_zero);              // [A6 ... 86]\r\n            res15[0] = _mm_packs_epi32(T3FA, add_zero);              // [A7 ... 87]\r\n        }\r\n\r\n\r\n        // transpose matrix 8x8 16bit\r\n    {\r\n        __m128i tr0_0, tr0_1, tr0_2, tr0_3, tr0_4, tr0_5, tr0_6, tr0_7;\r\n        __m128i tr1_0, tr1_1, tr1_2, tr1_3, tr1_4, tr1_5, tr1_6, tr1_7;\r\n\r\n#define TRANSPOSE_8x8_16BIT(I0, I1, I2, I3, I4, I5, I6, I7, O0, O1, O2, O3, O4, O5, O6, O7) \\\r\n    tr0_0 = _mm_unpacklo_epi16(I0, I1); \\\r\n    tr0_1 = _mm_unpacklo_epi16(I2, I3); \\\r\n    tr0_2 = _mm_unpackhi_epi16(I0, I1); \\\r\n    tr0_3 = _mm_unpackhi_epi16(I2, I3); \\\r\n    tr0_4 = _mm_unpacklo_epi16(I4, I5); \\\r\n    tr0_5 = _mm_unpacklo_epi16(I6, I7); \\\r\n    tr0_6 = _mm_unpackhi_epi16(I4, I5); \\\r\n    tr0_7 = _mm_unpackhi_epi16(I6, I7); \\\r\n    tr1_0 = _mm_unpacklo_epi32(tr0_0, tr0_1); \\\r\n    tr1_1 = _mm_unpacklo_epi32(tr0_2, tr0_3); \\\r\n    tr1_2 = _mm_unpackhi_epi32(tr0_0, tr0_1); \\\r\n    tr1_3 = _mm_unpackhi_epi32(tr0_2, tr0_3); \\\r\n    tr1_4 = _mm_unpacklo_epi32(tr0_4, tr0_5); \\\r\n    tr1_5 = _mm_unpacklo_epi32(tr0_6, tr0_7); \\\r\n    tr1_6 = _mm_unpackhi_epi32(tr0_4, tr0_5); \\\r\n    tr1_7 = _mm_unpackhi_epi32(tr0_6, tr0_7); \\\r\n    O0    = _mm_unpacklo_epi64(tr1_0, tr1_4); \\\r\n    O1    = _mm_unpackhi_epi64(tr1_0, tr1_4); \\\r\n    O2    = _mm_unpacklo_epi64(tr1_2, tr1_6); \\\r\n    O3    = _mm_unpackhi_epi64(tr1_2, tr1_6); \\\r\n    O4    = _mm_unpacklo_epi64(tr1_1, tr1_5); \\\r\n    O5    = _mm_unpackhi_epi64(tr1_1, tr1_5); \\\r\n    O6    = _mm_unpacklo_epi64(tr1_3, tr1_7); \\\r\n    O7    = _mm_unpackhi_epi64(tr1_3, tr1_7); \\\r\n\r\n            TRANSPOSE_8x8_16BIT(res00[0], res01[0], res02[0], res03[0], res04[0], res05[0], res06[0], res07[0], in00[0], in01[0], in02[0], in03[0], in04[0], in05[0], in06[0], in07[0])\r\n            TRANSPOSE_8x8_16BIT(res08[0], res09[0], res10[0], res11[0], res12[0], res13[0], res14[0], res15[0], in00[1], in01[1], in02[1], in03[1], in04[1], in05[1], in06[1], in07[1])\r\n\r\n#undef TRANSPOSE_8x8_16BIT\r\n    }\r\n\r\n    nShift = shift2;\r\n    c32_rnd = _mm_set1_epi32(1 << (shift2 - 1));    // add2\r\n    }\r\n\r\n    //pass=2\r\n    {\r\n        for (part = 0; part < 2; part++) {\r\n            const __m128i T_00_00A = _mm_unpacklo_epi16(in01[part], in03[part]);    // [33 13 32 12 31 11 30 10]\r\n            const __m128i T_00_00B = _mm_unpackhi_epi16(in01[part], in03[part]);    // [37 17 36 16 35 15 34 14]\r\n\r\n            const __m128i T_00_04A = _mm_unpacklo_epi16(in02[part], Zero_8);    // [ ]\r\n            const __m128i T_00_04B = _mm_unpackhi_epi16(in02[part], Zero_8);    // [ ]\r\n\r\n            //0 8\r\n            const __m128i T_00_07A = _mm_unpacklo_epi16(in00[part], Zero_8);    // [83 03 82 02 81 01 81 00] row08 row00\r\n            const __m128i T_00_07B = _mm_unpackhi_epi16(in00[part], Zero_8);    // [87 07 86 06 85 05 84 04]\r\n\r\n            __m128i O0A, O1A, O2A, O3A, O4A, O5A, O6A, O7A;\r\n            __m128i O0B, O1B, O2B, O3B, O4B, O5B, O6B, O7B;\r\n            __m128i EO0A, EO1A, EO2A, EO3A;\r\n            __m128i EO0B, EO1B, EO2B, EO3B;\r\n            __m128i EEE0A, EEE1A;\r\n            __m128i EEE0B, EEE1B;\r\n            \r\n            //1 3 5 7\r\n            O0A = _mm_madd_epi16(T_00_00A, c16_p43_p45);\r\n            O1A = _mm_madd_epi16(T_00_00A, c16_p29_p43);\r\n            O2A = _mm_madd_epi16(T_00_00A, c16_p04_p40);\r\n            O3A = _mm_madd_epi16(T_00_00A, c16_n21_p35);\r\n            O4A = _mm_madd_epi16(T_00_00A, c16_n40_p29);\r\n            O5A = _mm_madd_epi16(T_00_00A, c16_n45_p21);\r\n            O6A = _mm_madd_epi16(T_00_00A, c16_n35_p13);\r\n            O7A = _mm_madd_epi16(T_00_00A, c16_n13_p04);\r\n\r\n            O0B = _mm_madd_epi16(T_00_00B, c16_p43_p45);\r\n            O1B = _mm_madd_epi16(T_00_00B, c16_p29_p43);\r\n            O2B = _mm_madd_epi16(T_00_00B, c16_p04_p40);\r\n            O3B = _mm_madd_epi16(T_00_00B, c16_n21_p35);\r\n            O4B = _mm_madd_epi16(T_00_00B, c16_n40_p29);\r\n            O5B = _mm_madd_epi16(T_00_00B, c16_n45_p21);\r\n            O6B = _mm_madd_epi16(T_00_00B, c16_n35_p13);\r\n            O7B = _mm_madd_epi16(T_00_00B, c16_n13_p04);\r\n\r\n\r\n            //2 6\r\n            EO0A = _mm_madd_epi16(T_00_04A, c16_p38_p44); // EO0\r\n            EO0B = _mm_madd_epi16(T_00_04B, c16_p38_p44);\r\n            EO1A = _mm_madd_epi16(T_00_04A, c16_n09_p38); // EO1\r\n            EO1B = _mm_madd_epi16(T_00_04B, c16_n09_p38);\r\n            EO2A = _mm_madd_epi16(T_00_04A, c16_n44_p25); // EO2\r\n            EO2B = _mm_madd_epi16(T_00_04B, c16_n44_p25);\r\n            EO3A = _mm_madd_epi16(T_00_04A, c16_n25_p09); // EO3\r\n            EO3B = _mm_madd_epi16(T_00_04B, c16_n25_p09);\r\n\r\n            //0 8\r\n            EEE0A = _mm_madd_epi16(T_00_07A, c16_p32_p32);\r\n            EEE0B = _mm_madd_epi16(T_00_07B, c16_p32_p32);\r\n            EEE1A = _mm_madd_epi16(T_00_07A, c16_n32_p32);\r\n            EEE1B = _mm_madd_epi16(T_00_07B, c16_n32_p32);\r\n\r\n            {\r\n                const __m128i EE0A = _mm_add_epi32(EEE0A, add_zero);       // EE0 = EEE0 + EEO0\r\n                const __m128i EE0B = _mm_add_epi32(EEE0B, add_zero);\r\n                const __m128i EE1A = _mm_add_epi32(EEE1A, add_zero);       // EE1 = EEE1 + EEO1\r\n                const __m128i EE1B = _mm_add_epi32(EEE1B, add_zero);\r\n                const __m128i EE3A = _mm_sub_epi32(EEE0A, add_zero);       // EE2 = EEE0 - EEO0\r\n                const __m128i EE3B = _mm_sub_epi32(EEE0B, add_zero);\r\n                const __m128i EE2A = _mm_sub_epi32(EEE1A, add_zero);       // EE3 = EEE1 - EEO1\r\n                const __m128i EE2B = _mm_sub_epi32(EEE1B, add_zero);\r\n\r\n                const __m128i T10A = _mm_add_epi32(_mm_add_epi32(EE0A, EO0A), c32_rnd);     // E0 (= EE0 + EO0) + rnd\r\n                const __m128i T10B = _mm_add_epi32(_mm_add_epi32(EE0B, EO0B), c32_rnd);\r\n                const __m128i T11A = _mm_add_epi32(_mm_add_epi32(EE1A, EO1A), c32_rnd);     // E1 (= EE1 + EO1) + rnd\r\n                const __m128i T11B = _mm_add_epi32(_mm_add_epi32(EE1B, EO1B), c32_rnd);\r\n                const __m128i T12A = _mm_add_epi32(_mm_add_epi32(EE2A, EO2A), c32_rnd);     // E2 (= EE2 + EO2) + rnd\r\n                const __m128i T12B = _mm_add_epi32(_mm_add_epi32(EE2B, EO2B), c32_rnd);\r\n                const __m128i T13A = _mm_add_epi32(_mm_add_epi32(EE3A, EO3A), c32_rnd);     // E3 (= EE3 + EO3) + rnd\r\n                const __m128i T13B = _mm_add_epi32(_mm_add_epi32(EE3B, EO3B), c32_rnd);\r\n                const __m128i T14A = _mm_add_epi32(_mm_sub_epi32(EE3A, EO3A), c32_rnd);     // E4 (= EE3 - EO3) + rnd\r\n                const __m128i T14B = _mm_add_epi32(_mm_sub_epi32(EE3B, EO3B), c32_rnd);\r\n                const __m128i T15A = _mm_add_epi32(_mm_sub_epi32(EE2A, EO2A), c32_rnd);     // E5 (= EE2 - EO2) + rnd\r\n                const __m128i T15B = _mm_add_epi32(_mm_sub_epi32(EE2B, EO2B), c32_rnd);\r\n                const __m128i T16A = _mm_add_epi32(_mm_sub_epi32(EE1A, EO1A), c32_rnd);     // E6 (= EE1 - EO1) + rnd\r\n                const __m128i T16B = _mm_add_epi32(_mm_sub_epi32(EE1B, EO1B), c32_rnd);\r\n                const __m128i T17A = _mm_add_epi32(_mm_sub_epi32(EE0A, EO0A), c32_rnd);     // E7 (= EE0 - EO0) + rnd\r\n                const __m128i T17B = _mm_add_epi32(_mm_sub_epi32(EE0B, EO0B), c32_rnd);\r\n\r\n                const __m128i T30A = _mm_srai_epi32(_mm_add_epi32(T10A, O0A), nShift);      // E0 + O0 + rnd [30 20 10 00]\r\n                const __m128i T30B = _mm_srai_epi32(_mm_add_epi32(T10B, O0B), nShift);      //               [70 60 50 40]\r\n                const __m128i T31A = _mm_srai_epi32(_mm_add_epi32(T11A, O1A), nShift);      // E1 + O1 + rnd [31 21 11 01]\r\n                const __m128i T31B = _mm_srai_epi32(_mm_add_epi32(T11B, O1B), nShift);      //               [71 61 51 41]\r\n                const __m128i T32A = _mm_srai_epi32(_mm_add_epi32(T12A, O2A), nShift);      // E2 + O2 + rnd [32 22 12 02]\r\n                const __m128i T32B = _mm_srai_epi32(_mm_add_epi32(T12B, O2B), nShift);      //               [72 62 52 42]\r\n                const __m128i T33A = _mm_srai_epi32(_mm_add_epi32(T13A, O3A), nShift);      // E3 + O3 + rnd [33 23 13 03]\r\n                const __m128i T33B = _mm_srai_epi32(_mm_add_epi32(T13B, O3B), nShift);      //               [73 63 53 43]\r\n                const __m128i T34A = _mm_srai_epi32(_mm_add_epi32(T14A, O4A), nShift);      // E4            [33 24 14 04]\r\n                const __m128i T34B = _mm_srai_epi32(_mm_add_epi32(T14B, O4B), nShift);      //               [74 64 54 44]\r\n                const __m128i T35A = _mm_srai_epi32(_mm_add_epi32(T15A, O5A), nShift);      // E5            [35 25 15 05]\r\n                const __m128i T35B = _mm_srai_epi32(_mm_add_epi32(T15B, O5B), nShift);      //               [75 65 55 45]\r\n                const __m128i T36A = _mm_srai_epi32(_mm_add_epi32(T16A, O6A), nShift);      // E6            [36 26 16 06]\r\n                const __m128i T36B = _mm_srai_epi32(_mm_add_epi32(T16B, O6B), nShift);      //               [76 66 56 46]\r\n                const __m128i T37A = _mm_srai_epi32(_mm_add_epi32(T17A, O7A), nShift);      // E7            [37 27 17 07]\r\n                const __m128i T37B = _mm_srai_epi32(_mm_add_epi32(T17B, O7B), nShift);      //               [77 67 57 47]\r\n\r\n                const __m128i T38A = _mm_srai_epi32(_mm_sub_epi32(T17A, O7A), nShift);      // E7 [30 20 10 00] x8\r\n                const __m128i T38B = _mm_srai_epi32(_mm_sub_epi32(T17B, O7B), nShift);      //    [70 60 50 40]\r\n                const __m128i T39A = _mm_srai_epi32(_mm_sub_epi32(T16A, O6A), nShift);      // E6 [31 21 11 01] x9\r\n                const __m128i T39B = _mm_srai_epi32(_mm_sub_epi32(T16B, O6B), nShift);      //    [71 61 51 41]\r\n                const __m128i T3AA = _mm_srai_epi32(_mm_sub_epi32(T15A, O5A), nShift);      // E5 [32 22 12 02] xA\r\n                const __m128i T3AB = _mm_srai_epi32(_mm_sub_epi32(T15B, O5B), nShift);      //    [72 62 52 42]\r\n                const __m128i T3BA = _mm_srai_epi32(_mm_sub_epi32(T14A, O4A), nShift);      // E4 [33 23 13 03] xB\r\n                const __m128i T3BB = _mm_srai_epi32(_mm_sub_epi32(T14B, O4B), nShift);      //    [73 63 53 43]\r\n                const __m128i T3CA = _mm_srai_epi32(_mm_sub_epi32(T13A, O3A), nShift);      // E3 - O3 + rnd [33 24 14 04] xC\r\n                const __m128i T3CB = _mm_srai_epi32(_mm_sub_epi32(T13B, O3B), nShift);      //               [74 64 54 44]\r\n                const __m128i T3DA = _mm_srai_epi32(_mm_sub_epi32(T12A, O2A), nShift);      // E2 - O2 + rnd [35 25 15 05] xD\r\n                const __m128i T3DB = _mm_srai_epi32(_mm_sub_epi32(T12B, O2B), nShift);      //               [75 65 55 45]\r\n                const __m128i T3EA = _mm_srai_epi32(_mm_sub_epi32(T11A, O1A), nShift);      // E1 - O1 + rnd [36 26 16 06] xE\r\n                const __m128i T3EB = _mm_srai_epi32(_mm_sub_epi32(T11B, O1B), nShift);      //               [76 66 56 46]\r\n                const __m128i T3FA = _mm_srai_epi32(_mm_sub_epi32(T10A, O0A), nShift);      // E0 - O0 + rnd [37 27 17 07] xF\r\n                const __m128i T3FB = _mm_srai_epi32(_mm_sub_epi32(T10B, O0B), nShift);      //               [77 67 57 47]\r\n\r\n                res00[part] = _mm_packs_epi32(T30A, T30B);              // [70 60 50 40 30 20 10 00]\r\n                res01[part] = _mm_packs_epi32(T31A, T31B);              // [71 61 51 41 31 21 11 01]\r\n                res02[part] = _mm_packs_epi32(T32A, T32B);              // [72 62 52 42 32 22 12 02]\r\n                res03[part] = _mm_packs_epi32(T33A, T33B);              // [73 63 53 43 33 23 13 03]\r\n                res04[part] = _mm_packs_epi32(T34A, T34B);              // [74 64 54 44 34 24 14 04]\r\n                res05[part] = _mm_packs_epi32(T35A, T35B);              // [75 65 55 45 35 25 15 05]\r\n                res06[part] = _mm_packs_epi32(T36A, T36B);              // [76 66 56 46 36 26 16 06]\r\n                res07[part] = _mm_packs_epi32(T37A, T37B);              // [77 67 57 47 37 27 17 07]\r\n\r\n                res08[part] = _mm_packs_epi32(T38A, T38B);              // [A0 ... 80]\r\n                res09[part] = _mm_packs_epi32(T39A, T39B);              // [A1 ... 81]\r\n                res10[part] = _mm_packs_epi32(T3AA, T3AB);              // [A2 ... 82]\r\n                res11[part] = _mm_packs_epi32(T3BA, T3BB);              // [A3 ... 83]\r\n                res12[part] = _mm_packs_epi32(T3CA, T3CB);              // [A4 ... 84]\r\n                res13[part] = _mm_packs_epi32(T3DA, T3DB);              // [A5 ... 85]\r\n                res14[part] = _mm_packs_epi32(T3EA, T3EB);              // [A6 ... 86]\r\n                res15[part] = _mm_packs_epi32(T3FA, T3FB);              // [A7 ... 87]\r\n            }\r\n        }\r\n\r\n        // transpose matrix 8x8 16bit\r\n        {\r\n            __m128i tr0_0, tr0_1, tr0_2, tr0_3, tr0_4, tr0_5, tr0_6, tr0_7;\r\n            __m128i tr1_0, tr1_1, tr1_2, tr1_3, tr1_4, tr1_5, tr1_6, tr1_7;\r\n\r\n#define TRANSPOSE_8x8_16BIT(I0, I1, I2, I3, I4, I5, I6, I7, O0, O1, O2, O3, O4, O5, O6, O7) \\\r\n    tr0_0 = _mm_unpacklo_epi16(I0, I1); \\\r\n    tr0_1 = _mm_unpacklo_epi16(I2, I3); \\\r\n    tr0_2 = _mm_unpackhi_epi16(I0, I1); \\\r\n    tr0_3 = _mm_unpackhi_epi16(I2, I3); \\\r\n    tr0_4 = _mm_unpacklo_epi16(I4, I5); \\\r\n    tr0_5 = _mm_unpacklo_epi16(I6, I7); \\\r\n    tr0_6 = _mm_unpackhi_epi16(I4, I5); \\\r\n    tr0_7 = _mm_unpackhi_epi16(I6, I7); \\\r\n    tr1_0 = _mm_unpacklo_epi32(tr0_0, tr0_1); \\\r\n    tr1_1 = _mm_unpacklo_epi32(tr0_2, tr0_3); \\\r\n    tr1_2 = _mm_unpackhi_epi32(tr0_0, tr0_1); \\\r\n    tr1_3 = _mm_unpackhi_epi32(tr0_2, tr0_3); \\\r\n    tr1_4 = _mm_unpacklo_epi32(tr0_4, tr0_5); \\\r\n    tr1_5 = _mm_unpacklo_epi32(tr0_6, tr0_7); \\\r\n    tr1_6 = _mm_unpackhi_epi32(tr0_4, tr0_5); \\\r\n    tr1_7 = _mm_unpackhi_epi32(tr0_6, tr0_7); \\\r\n    O0    = _mm_unpacklo_epi64(tr1_0, tr1_4); \\\r\n    O1    = _mm_unpackhi_epi64(tr1_0, tr1_4); \\\r\n    O2    = _mm_unpacklo_epi64(tr1_2, tr1_6); \\\r\n    O3    = _mm_unpackhi_epi64(tr1_2, tr1_6); \\\r\n    O4    = _mm_unpacklo_epi64(tr1_1, tr1_5); \\\r\n    O5    = _mm_unpackhi_epi64(tr1_1, tr1_5); \\\r\n    O6    = _mm_unpacklo_epi64(tr1_3, tr1_7); \\\r\n    O7    = _mm_unpackhi_epi64(tr1_3, tr1_7); \\\r\n\r\n                TRANSPOSE_8x8_16BIT(res00[0], res01[0], res02[0], res03[0], res04[0], res05[0], res06[0], res07[0], in00[0], in01[0], in02[0], in03[0], in04[0], in05[0], in06[0], in07[0])\r\n                TRANSPOSE_8x8_16BIT(res08[0], res09[0], res10[0], res11[0], res12[0], res13[0], res14[0], res15[0], in00[1], in01[1], in02[1], in03[1], in04[1], in05[1], in06[1], in07[1])\r\n                TRANSPOSE_8x8_16BIT(res00[1], res01[1], res02[1], res03[1], res04[1], res05[1], res06[1], res07[1], in08[0], in09[0], in10[0], in11[0], in12[0], in13[0], in14[0], in15[0])\r\n                TRANSPOSE_8x8_16BIT(res08[1], res09[1], res10[1], res11[1], res12[1], res13[1], res14[1], res15[1], in08[1], in09[1], in10[1], in11[1], in12[1], in13[1], in14[1], in15[1])\r\n\r\n#undef TRANSPOSE_8x8_16BIT\r\n        }\r\n\r\n    }\r\n\r\n\r\n    // clip\r\n    {\r\n        const __m128i max_val = _mm_set1_epi16((1 << (clip_depth2 - 1)) - 1);\r\n        const __m128i min_val = _mm_set1_epi16(-(1 << (clip_depth2 - 1)));\r\n\r\n        in00[0] = _mm_max_epi16(_mm_min_epi16(in00[0], max_val), min_val);\r\n        in00[1] = _mm_max_epi16(_mm_min_epi16(in00[1], max_val), min_val);\r\n\r\n        in01[0] = _mm_max_epi16(_mm_min_epi16(in01[0], max_val), min_val);\r\n        in01[1] = _mm_max_epi16(_mm_min_epi16(in01[1], max_val), min_val);\r\n\r\n        in02[0] = _mm_max_epi16(_mm_min_epi16(in02[0], max_val), min_val);\r\n        in02[1] = _mm_max_epi16(_mm_min_epi16(in02[1], max_val), min_val);\r\n\r\n        in03[0] = _mm_max_epi16(_mm_min_epi16(in03[0], max_val), min_val);\r\n        in03[1] = _mm_max_epi16(_mm_min_epi16(in03[1], max_val), min_val);\r\n\r\n        in04[0] = _mm_max_epi16(_mm_min_epi16(in04[0], max_val), min_val);\r\n        in04[1] = _mm_max_epi16(_mm_min_epi16(in04[1], max_val), min_val);\r\n\r\n        in05[0] = _mm_max_epi16(_mm_min_epi16(in05[0], max_val), min_val);\r\n        in05[1] = _mm_max_epi16(_mm_min_epi16(in05[1], max_val), min_val);\r\n\r\n        in06[0] = _mm_max_epi16(_mm_min_epi16(in06[0], max_val), min_val);\r\n        in06[1] = _mm_max_epi16(_mm_min_epi16(in06[1], max_val), min_val);\r\n\r\n        in07[0] = _mm_max_epi16(_mm_min_epi16(in07[0], max_val), min_val);\r\n        in07[1] = _mm_max_epi16(_mm_min_epi16(in07[1], max_val), min_val);\r\n\r\n        in08[0] = _mm_max_epi16(_mm_min_epi16(in08[0], max_val), min_val);\r\n        in08[1] = _mm_max_epi16(_mm_min_epi16(in08[1], max_val), min_val);\r\n\r\n        in09[0] = _mm_max_epi16(_mm_min_epi16(in09[0], max_val), min_val);\r\n        in09[1] = _mm_max_epi16(_mm_min_epi16(in09[1], max_val), min_val);\r\n\r\n        in10[0] = _mm_max_epi16(_mm_min_epi16(in10[0], max_val), min_val);\r\n        in10[1] = _mm_max_epi16(_mm_min_epi16(in10[1], max_val), min_val);\r\n\r\n        in11[0] = _mm_max_epi16(_mm_min_epi16(in11[0], max_val), min_val);\r\n        in11[1] = _mm_max_epi16(_mm_min_epi16(in11[1], max_val), min_val);\r\n\r\n        in12[0] = _mm_max_epi16(_mm_min_epi16(in12[0], max_val), min_val);\r\n        in12[1] = _mm_max_epi16(_mm_min_epi16(in12[1], max_val), min_val);\r\n\r\n        in13[0] = _mm_max_epi16(_mm_min_epi16(in13[0], max_val), min_val);\r\n        in13[1] = _mm_max_epi16(_mm_min_epi16(in13[1], max_val), min_val);\r\n\r\n        in14[0] = _mm_max_epi16(_mm_min_epi16(in14[0], max_val), min_val);\r\n        in14[1] = _mm_max_epi16(_mm_min_epi16(in14[1], max_val), min_val);\r\n\r\n        in15[0] = _mm_max_epi16(_mm_min_epi16(in15[0], max_val), min_val);\r\n        in15[1] = _mm_max_epi16(_mm_min_epi16(in15[1], max_val), min_val);\r\n    }\r\n\r\n    // store\r\n    _mm_store_si128((__m128i*)(dst + 0 * i_dst + 0), in00[0]);\r\n    _mm_store_si128((__m128i*)(dst + 0 * i_dst + 8), in00[1]);\r\n    _mm_store_si128((__m128i*)(dst + 1 * i_dst + 0), in01[0]);\r\n    _mm_store_si128((__m128i*)(dst + 1 * i_dst + 8), in01[1]);\r\n    _mm_store_si128((__m128i*)(dst + 2 * i_dst + 0), in02[0]);\r\n    _mm_store_si128((__m128i*)(dst + 2 * i_dst + 8), in02[1]);\r\n    _mm_store_si128((__m128i*)(dst + 3 * i_dst + 0), in03[0]);\r\n    _mm_store_si128((__m128i*)(dst + 3 * i_dst + 8), in03[1]);\r\n    _mm_store_si128((__m128i*)(dst + 4 * i_dst + 0), in04[0]);\r\n    _mm_store_si128((__m128i*)(dst + 4 * i_dst + 8), in04[1]);\r\n    _mm_store_si128((__m128i*)(dst + 5 * i_dst + 0), in05[0]);\r\n    _mm_store_si128((__m128i*)(dst + 5 * i_dst + 8), in05[1]);\r\n    _mm_store_si128((__m128i*)(dst + 6 * i_dst + 0), in06[0]);\r\n    _mm_store_si128((__m128i*)(dst + 6 * i_dst + 8), in06[1]);\r\n    _mm_store_si128((__m128i*)(dst + 7 * i_dst + 0), in07[0]);\r\n    _mm_store_si128((__m128i*)(dst + 7 * i_dst + 8), in07[1]);\r\n    _mm_store_si128((__m128i*)(dst + 8 * i_dst + 0), in08[0]);\r\n    _mm_store_si128((__m128i*)(dst + 8 * i_dst + 8), in08[1]);\r\n    _mm_store_si128((__m128i*)(dst + 9 * i_dst + 0), in09[0]);\r\n    _mm_store_si128((__m128i*)(dst + 9 * i_dst + 8), in09[1]);\r\n    _mm_store_si128((__m128i*)(dst + 10 * i_dst + 0), in10[0]);\r\n    _mm_store_si128((__m128i*)(dst + 10 * i_dst + 8), in10[1]);\r\n    _mm_store_si128((__m128i*)(dst + 11 * i_dst + 0), in11[0]);\r\n    _mm_store_si128((__m128i*)(dst + 11 * i_dst + 8), in11[1]);\r\n    _mm_store_si128((__m128i*)(dst + 12 * i_dst + 0), in12[0]);\r\n    _mm_store_si128((__m128i*)(dst + 12 * i_dst + 8), in12[1]);\r\n    _mm_store_si128((__m128i*)(dst + 13 * i_dst + 0), in13[0]);\r\n    _mm_store_si128((__m128i*)(dst + 13 * i_dst + 8), in13[1]);\r\n    _mm_store_si128((__m128i*)(dst + 14 * i_dst + 0), in14[0]);\r\n    _mm_store_si128((__m128i*)(dst + 14 * i_dst + 8), in14[1]);\r\n    _mm_store_si128((__m128i*)(dst + 15 * i_dst + 0), in15[0]);\r\n    _mm_store_si128((__m128i*)(dst + 15 * i_dst + 8), in15[1]);\r\n\r\n\r\n\r\n\r\n}\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid idct_32x32_sse128(const coeff_t *src, coeff_t *dst, int i_dst)\r\n{\r\n    int a_flag = i_dst & 0x01;\r\n    //int shift1 = 5;\r\n    int shift2 = 20 - g_bit_depth - a_flag;\r\n    //int clip_depth1 = LIMIT_BIT;\r\n    int clip_depth2 = g_bit_depth + 1 + a_flag;\r\n\r\n    const __m128i c16_p45_p45 = _mm_set1_epi32(0x002D002D);\r\n    const __m128i c16_p43_p44 = _mm_set1_epi32(0x002B002C);\r\n    const __m128i c16_p39_p41 = _mm_set1_epi32(0x00270029);\r\n    const __m128i c16_p34_p36 = _mm_set1_epi32(0x00220024);\r\n    const __m128i c16_p27_p30 = _mm_set1_epi32(0x001B001E);\r\n    const __m128i c16_p19_p23 = _mm_set1_epi32(0x00130017);\r\n    const __m128i c16_p11_p15 = _mm_set1_epi32(0x000B000F);\r\n    const __m128i c16_p02_p07 = _mm_set1_epi32(0x00020007);\r\n    const __m128i c16_p41_p45 = _mm_set1_epi32(0x0029002D);\r\n    const __m128i c16_p23_p34 = _mm_set1_epi32(0x00170022);\r\n    const __m128i c16_n02_p11 = _mm_set1_epi32(0xFFFE000B);\r\n    const __m128i c16_n27_n15 = _mm_set1_epi32(0xFFE5FFF1);\r\n    const __m128i c16_n43_n36 = _mm_set1_epi32(0xFFD5FFDC);\r\n    const __m128i c16_n44_n45 = _mm_set1_epi32(0xFFD4FFD3);\r\n    const __m128i c16_n30_n39 = _mm_set1_epi32(0xFFE2FFD9);\r\n    const __m128i c16_n07_n19 = _mm_set1_epi32(0xFFF9FFED);\r\n    const __m128i c16_p34_p44 = _mm_set1_epi32(0x0022002C);\r\n    const __m128i c16_n07_p15 = _mm_set1_epi32(0xFFF9000F);\r\n    const __m128i c16_n41_n27 = _mm_set1_epi32(0xFFD7FFE5);\r\n    const __m128i c16_n39_n45 = _mm_set1_epi32(0xFFD9FFD3);\r\n    const __m128i c16_n02_n23 = _mm_set1_epi32(0xFFFEFFE9);\r\n    const __m128i c16_p36_p19 = _mm_set1_epi32(0x00240013);\r\n    const __m128i c16_p43_p45 = _mm_set1_epi32(0x002B002D);\r\n    const __m128i c16_p11_p30 = _mm_set1_epi32(0x000B001E);\r\n    const __m128i c16_p23_p43 = _mm_set1_epi32(0x0017002B);\r\n    const __m128i c16_n34_n07 = _mm_set1_epi32(0xFFDEFFF9);\r\n    const __m128i c16_n36_n45 = _mm_set1_epi32(0xFFDCFFD3);\r\n    const __m128i c16_p19_n11 = _mm_set1_epi32(0x0013FFF5);\r\n    const __m128i c16_p44_p41 = _mm_set1_epi32(0x002C0029);\r\n    const __m128i c16_n02_p27 = _mm_set1_epi32(0xFFFE001B);\r\n    const __m128i c16_n45_n30 = _mm_set1_epi32(0xFFD3FFE2);\r\n    const __m128i c16_n15_n39 = _mm_set1_epi32(0xFFF1FFD9);\r\n    const __m128i c16_p11_p41 = _mm_set1_epi32(0x000B0029);\r\n    const __m128i c16_n45_n27 = _mm_set1_epi32(0xFFD3FFE5);\r\n    const __m128i c16_p07_n30 = _mm_set1_epi32(0x0007FFE2);\r\n    const __m128i c16_p43_p39 = _mm_set1_epi32(0x002B0027);\r\n    const __m128i c16_n23_p15 = _mm_set1_epi32(0xFFE9000F);\r\n    const __m128i c16_n34_n45 = _mm_set1_epi32(0xFFDEFFD3);\r\n    const __m128i c16_p36_p02 = _mm_set1_epi32(0x00240002);\r\n    const __m128i c16_p19_p44 = _mm_set1_epi32(0x0013002C);\r\n    const __m128i c16_n02_p39 = _mm_set1_epi32(0xFFFE0027);\r\n    const __m128i c16_n36_n41 = _mm_set1_epi32(0xFFDCFFD7);\r\n    const __m128i c16_p43_p07 = _mm_set1_epi32(0x002B0007);\r\n    const __m128i c16_n11_p34 = _mm_set1_epi32(0xFFF50022);\r\n    const __m128i c16_n30_n44 = _mm_set1_epi32(0xFFE2FFD4);\r\n    const __m128i c16_p45_p15 = _mm_set1_epi32(0x002D000F);\r\n    const __m128i c16_n19_p27 = _mm_set1_epi32(0xFFED001B);\r\n    const __m128i c16_n23_n45 = _mm_set1_epi32(0xFFE9FFD3);\r\n    const __m128i c16_n15_p36 = _mm_set1_epi32(0xFFF10024);\r\n    const __m128i c16_n11_n45 = _mm_set1_epi32(0xFFF5FFD3);\r\n    const __m128i c16_p34_p39 = _mm_set1_epi32(0x00220027);\r\n    const __m128i c16_n45_n19 = _mm_set1_epi32(0xFFD3FFED);\r\n    const __m128i c16_p41_n07 = _mm_set1_epi32(0x0029FFF9);\r\n    const __m128i c16_n23_p30 = _mm_set1_epi32(0xFFE9001E);\r\n    const __m128i c16_n02_n44 = _mm_set1_epi32(0xFFFEFFD4);\r\n    const __m128i c16_p27_p43 = _mm_set1_epi32(0x001B002B);\r\n    const __m128i c16_n27_p34 = _mm_set1_epi32(0xFFE50022);\r\n    const __m128i c16_p19_n39 = _mm_set1_epi32(0x0013FFD9);\r\n    const __m128i c16_n11_p43 = _mm_set1_epi32(0xFFF5002B);\r\n    const __m128i c16_p02_n45 = _mm_set1_epi32(0x0002FFD3);\r\n    const __m128i c16_p07_p45 = _mm_set1_epi32(0x0007002D);\r\n    const __m128i c16_n15_n44 = _mm_set1_epi32(0xFFF1FFD4);\r\n    const __m128i c16_p23_p41 = _mm_set1_epi32(0x00170029);\r\n    const __m128i c16_n30_n36 = _mm_set1_epi32(0xFFE2FFDC);\r\n    const __m128i c16_n36_p30 = _mm_set1_epi32(0xFFDC001E);\r\n    const __m128i c16_p41_n23 = _mm_set1_epi32(0x0029FFE9);\r\n    const __m128i c16_n44_p15 = _mm_set1_epi32(0xFFD4000F);\r\n    const __m128i c16_p45_n07 = _mm_set1_epi32(0x002DFFF9);\r\n    const __m128i c16_n45_n02 = _mm_set1_epi32(0xFFD3FFFE);\r\n    const __m128i c16_p43_p11 = _mm_set1_epi32(0x002B000B);\r\n    const __m128i c16_n39_n19 = _mm_set1_epi32(0xFFD9FFED);\r\n    const __m128i c16_p34_p27 = _mm_set1_epi32(0x0022001B);\r\n    const __m128i c16_n43_p27 = _mm_set1_epi32(0xFFD5001B);\r\n    const __m128i c16_p44_n02 = _mm_set1_epi32(0x002CFFFE);\r\n    const __m128i c16_n30_n23 = _mm_set1_epi32(0xFFE2FFE9);\r\n    const __m128i c16_p07_p41 = _mm_set1_epi32(0x00070029);\r\n    const __m128i c16_p19_n45 = _mm_set1_epi32(0x0013FFD3);\r\n    const __m128i c16_n39_p34 = _mm_set1_epi32(0xFFD90022);\r\n    const __m128i c16_p45_n11 = _mm_set1_epi32(0x002DFFF5);\r\n    const __m128i c16_n36_n15 = _mm_set1_epi32(0xFFDCFFF1);\r\n    const __m128i c16_n45_p23 = _mm_set1_epi32(0xFFD30017);\r\n    const __m128i c16_p27_p19 = _mm_set1_epi32(0x001B0013);\r\n    const __m128i c16_p15_n45 = _mm_set1_epi32(0x000FFFD3);\r\n    const __m128i c16_n44_p30 = _mm_set1_epi32(0xFFD4001E);\r\n    const __m128i c16_p34_p11 = _mm_set1_epi32(0x0022000B);\r\n    const __m128i c16_p07_n43 = _mm_set1_epi32(0x0007FFD5);\r\n    const __m128i c16_n41_p36 = _mm_set1_epi32(0xFFD70024);\r\n    const __m128i c16_p39_p02 = _mm_set1_epi32(0x00270002);\r\n    const __m128i c16_n44_p19 = _mm_set1_epi32(0xFFD40013);\r\n    const __m128i c16_n02_p36 = _mm_set1_epi32(0xFFFE0024);\r\n    const __m128i c16_p45_n34 = _mm_set1_epi32(0x002DFFDE);\r\n    const __m128i c16_n15_n23 = _mm_set1_epi32(0xFFF1FFE9);\r\n    const __m128i c16_n39_p43 = _mm_set1_epi32(0xFFD9002B);\r\n    const __m128i c16_p30_p07 = _mm_set1_epi32(0x001E0007);\r\n    const __m128i c16_p27_n45 = _mm_set1_epi32(0x001BFFD3);\r\n    const __m128i c16_n41_p11 = _mm_set1_epi32(0xFFD7000B);\r\n    const __m128i c16_n39_p15 = _mm_set1_epi32(0xFFD9000F);\r\n    const __m128i c16_n30_p45 = _mm_set1_epi32(0xFFE2002D);\r\n    const __m128i c16_p27_p02 = _mm_set1_epi32(0x001B0002);\r\n    const __m128i c16_p41_n44 = _mm_set1_epi32(0x0029FFD4);\r\n    const __m128i c16_n11_n19 = _mm_set1_epi32(0xFFF5FFED);\r\n    const __m128i c16_n45_p36 = _mm_set1_epi32(0xFFD30024);\r\n    const __m128i c16_n07_p34 = _mm_set1_epi32(0xFFF90022);\r\n    const __m128i c16_p43_n23 = _mm_set1_epi32(0x002BFFE9);\r\n    const __m128i c16_n30_p11 = _mm_set1_epi32(0xFFE2000B);\r\n    const __m128i c16_n45_p43 = _mm_set1_epi32(0xFFD3002B);\r\n    const __m128i c16_n19_p36 = _mm_set1_epi32(0xFFED0024);\r\n    const __m128i c16_p23_n02 = _mm_set1_epi32(0x0017FFFE);\r\n    const __m128i c16_p45_n39 = _mm_set1_epi32(0x002DFFD9);\r\n    const __m128i c16_p27_n41 = _mm_set1_epi32(0x001BFFD7);\r\n    const __m128i c16_n15_n07 = _mm_set1_epi32(0xFFF1FFF9);\r\n    const __m128i c16_n44_p34 = _mm_set1_epi32(0xFFD40022);\r\n    const __m128i c16_n19_p07 = _mm_set1_epi32(0xFFED0007);\r\n    const __m128i c16_n39_p30 = _mm_set1_epi32(0xFFD9001E);\r\n    const __m128i c16_n45_p44 = _mm_set1_epi32(0xFFD3002C);\r\n    const __m128i c16_n36_p43 = _mm_set1_epi32(0xFFDC002B);\r\n    const __m128i c16_n15_p27 = _mm_set1_epi32(0xFFF1001B);\r\n    const __m128i c16_p11_p02 = _mm_set1_epi32(0x000B0002);\r\n    const __m128i c16_p34_n23 = _mm_set1_epi32(0x0022FFE9);\r\n    const __m128i c16_p45_n41 = _mm_set1_epi32(0x002DFFD7);\r\n    const __m128i c16_n07_p02 = _mm_set1_epi32(0xFFF90002);\r\n    const __m128i c16_n15_p11 = _mm_set1_epi32(0xFFF1000B);\r\n    const __m128i c16_n23_p19 = _mm_set1_epi32(0xFFE90013);\r\n    const __m128i c16_n30_p27 = _mm_set1_epi32(0xFFE2001B);\r\n    const __m128i c16_n36_p34 = _mm_set1_epi32(0xFFDC0022);\r\n    const __m128i c16_n41_p39 = _mm_set1_epi32(0xFFD70027);\r\n    const __m128i c16_n44_p43 = _mm_set1_epi32(0xFFD4002B);\r\n    const __m128i c16_n45_p45 = _mm_set1_epi32(0xFFD3002D);\r\n\r\n    //  const __m128i c16_p43_p45 = _mm_set1_epi32(0x002B002D);\r\n    const __m128i c16_p35_p40 = _mm_set1_epi32(0x00230028);\r\n    const __m128i c16_p21_p29 = _mm_set1_epi32(0x0015001D);\r\n    const __m128i c16_p04_p13 = _mm_set1_epi32(0x0004000D);\r\n    const __m128i c16_p29_p43 = _mm_set1_epi32(0x001D002B);\r\n    const __m128i c16_n21_p04 = _mm_set1_epi32(0xFFEB0004);\r\n    const __m128i c16_n45_n40 = _mm_set1_epi32(0xFFD3FFD8);\r\n    const __m128i c16_n13_n35 = _mm_set1_epi32(0xFFF3FFDD);\r\n    const __m128i c16_p04_p40 = _mm_set1_epi32(0x00040028);\r\n    const __m128i c16_n43_n35 = _mm_set1_epi32(0xFFD5FFDD);\r\n    const __m128i c16_p29_n13 = _mm_set1_epi32(0x001DFFF3);\r\n    const __m128i c16_p21_p45 = _mm_set1_epi32(0x0015002D);\r\n    const __m128i c16_n21_p35 = _mm_set1_epi32(0xFFEB0023);\r\n    const __m128i c16_p04_n43 = _mm_set1_epi32(0x0004FFD5);\r\n    const __m128i c16_p13_p45 = _mm_set1_epi32(0x000D002D);\r\n    const __m128i c16_n29_n40 = _mm_set1_epi32(0xFFE3FFD8);\r\n    const __m128i c16_n40_p29 = _mm_set1_epi32(0xFFD8001D);\r\n    const __m128i c16_p45_n13 = _mm_set1_epi32(0x002DFFF3);\r\n    const __m128i c16_n43_n04 = _mm_set1_epi32(0xFFD5FFFC);\r\n    const __m128i c16_p35_p21 = _mm_set1_epi32(0x00230015);\r\n    const __m128i c16_n45_p21 = _mm_set1_epi32(0xFFD30015);\r\n    const __m128i c16_p13_p29 = _mm_set1_epi32(0x000D001D);\r\n    const __m128i c16_p35_n43 = _mm_set1_epi32(0x0023FFD5);\r\n    const __m128i c16_n40_p04 = _mm_set1_epi32(0xFFD80004);\r\n    const __m128i c16_n35_p13 = _mm_set1_epi32(0xFFDD000D);\r\n    const __m128i c16_n40_p45 = _mm_set1_epi32(0xFFD8002D);\r\n    const __m128i c16_p04_p21 = _mm_set1_epi32(0x00040015);\r\n    const __m128i c16_p43_n29 = _mm_set1_epi32(0x002BFFE3);\r\n    const __m128i c16_n13_p04 = _mm_set1_epi32(0xFFF30004);\r\n    const __m128i c16_n29_p21 = _mm_set1_epi32(0xFFE30015);\r\n    const __m128i c16_n40_p35 = _mm_set1_epi32(0xFFD80023);\r\n    //  const __m128i c16_n45_p43 = _mm_set1_epi32(0xFFD3002B);\r\n\r\n    const __m128i c16_p38_p44 = _mm_set1_epi32(0x0026002C);\r\n    const __m128i c16_p09_p25 = _mm_set1_epi32(0x00090019);\r\n    const __m128i c16_n09_p38 = _mm_set1_epi32(0xFFF70026);\r\n    const __m128i c16_n25_n44 = _mm_set1_epi32(0xFFE7FFD4);\r\n\r\n    const __m128i c16_n44_p25 = _mm_set1_epi32(0xFFD40019);\r\n    const __m128i c16_p38_p09 = _mm_set1_epi32(0x00260009);\r\n    const __m128i c16_n25_p09 = _mm_set1_epi32(0xFFE70009);\r\n    const __m128i c16_n44_p38 = _mm_set1_epi32(0xFFD40026);\r\n\r\n    const __m128i c16_p17_p42 = _mm_set1_epi32(0x0011002A);\r\n    const __m128i c16_n42_p17 = _mm_set1_epi32(0xFFD60011);\r\n\r\n    const __m128i c16_p32_p32 = _mm_set1_epi32(0x00200020);\r\n    const __m128i c16_n32_p32 = _mm_set1_epi32(0xFFE00020);\r\n\r\n    __m128i c32_rnd = _mm_set1_epi32(16);   // add1\r\n\r\n    int nShift = 5;\r\n    int i, pass, part;\r\n\r\n    // DCT1\r\n    __m128i in00[4], in01[4], in02[4], in03[4], in04[4], in05[4], in06[4], in07[4], in08[4], in09[4], in10[4], in11[4], in12[4], in13[4], in14[4], in15[4];\r\n    __m128i in16[4], in17[4], in18[4], in19[4], in20[4], in21[4], in22[4], in23[4], in24[4], in25[4], in26[4], in27[4], in28[4], in29[4], in30[4], in31[4];\r\n    __m128i res00[4], res01[4], res02[4], res03[4], res04[4], res05[4], res06[4], res07[4], res08[4], res09[4], res10[4], res11[4], res12[4], res13[4], res14[4], res15[4];\r\n    __m128i res16[4], res17[4], res18[4], res19[4], res20[4], res21[4], res22[4], res23[4], res24[4], res25[4], res26[4], res27[4], res28[4], res29[4], res30[4], res31[4];\r\n\r\n    i_dst &= 0xFE;    /* remember to remove the flag bit */\r\n\r\n    for (i = 0; i < 4; i++) {\r\n        const int offset = (i << 3);\r\n\r\n        in00[i] = _mm_loadu_si128((const __m128i*)&src[ 0 * 32 + offset]);\r\n        in01[i] = _mm_loadu_si128((const __m128i*)&src[ 1 * 32 + offset]);\r\n        in02[i] = _mm_loadu_si128((const __m128i*)&src[ 2 * 32 + offset]);\r\n        in03[i] = _mm_loadu_si128((const __m128i*)&src[ 3 * 32 + offset]);\r\n        in04[i] = _mm_loadu_si128((const __m128i*)&src[ 4 * 32 + offset]);\r\n        in05[i] = _mm_loadu_si128((const __m128i*)&src[ 5 * 32 + offset]);\r\n        in06[i] = _mm_loadu_si128((const __m128i*)&src[ 6 * 32 + offset]);\r\n        in07[i] = _mm_loadu_si128((const __m128i*)&src[ 7 * 32 + offset]);\r\n        in08[i] = _mm_loadu_si128((const __m128i*)&src[ 8 * 32 + offset]);\r\n        in09[i] = _mm_loadu_si128((const __m128i*)&src[ 9 * 32 + offset]);\r\n        in10[i] = _mm_loadu_si128((const __m128i*)&src[10 * 32 + offset]);\r\n        in11[i] = _mm_loadu_si128((const __m128i*)&src[11 * 32 + offset]);\r\n        in12[i] = _mm_loadu_si128((const __m128i*)&src[12 * 32 + offset]);\r\n        in13[i] = _mm_loadu_si128((const __m128i*)&src[13 * 32 + offset]);\r\n        in14[i] = _mm_loadu_si128((const __m128i*)&src[14 * 32 + offset]);\r\n        in15[i] = _mm_loadu_si128((const __m128i*)&src[15 * 32 + offset]);\r\n        in16[i] = _mm_loadu_si128((const __m128i*)&src[16 * 32 + offset]);\r\n        in17[i] = _mm_loadu_si128((const __m128i*)&src[17 * 32 + offset]);\r\n        in18[i] = _mm_loadu_si128((const __m128i*)&src[18 * 32 + offset]);\r\n        in19[i] = _mm_loadu_si128((const __m128i*)&src[19 * 32 + offset]);\r\n        in20[i] = _mm_loadu_si128((const __m128i*)&src[20 * 32 + offset]);\r\n        in21[i] = _mm_loadu_si128((const __m128i*)&src[21 * 32 + offset]);\r\n        in22[i] = _mm_loadu_si128((const __m128i*)&src[22 * 32 + offset]);\r\n        in23[i] = _mm_loadu_si128((const __m128i*)&src[23 * 32 + offset]);\r\n        in24[i] = _mm_loadu_si128((const __m128i*)&src[24 * 32 + offset]);\r\n        in25[i] = _mm_loadu_si128((const __m128i*)&src[25 * 32 + offset]);\r\n        in26[i] = _mm_loadu_si128((const __m128i*)&src[26 * 32 + offset]);\r\n        in27[i] = _mm_loadu_si128((const __m128i*)&src[27 * 32 + offset]);\r\n        in28[i] = _mm_loadu_si128((const __m128i*)&src[28 * 32 + offset]);\r\n        in29[i] = _mm_loadu_si128((const __m128i*)&src[29 * 32 + offset]);\r\n        in30[i] = _mm_loadu_si128((const __m128i*)&src[30 * 32 + offset]);\r\n        in31[i] = _mm_loadu_si128((const __m128i*)&src[31 * 32 + offset]);\r\n    }\r\n\r\n    for (pass = 0; pass < 2; pass++) {\r\n        if (pass == 1) {\r\n            c32_rnd = _mm_set1_epi32(1 << (shift2 - 1));    // add2\r\n            nShift = shift2;\r\n        }\r\n\r\n        for (part = 0; part < 4; part++) {\r\n            const __m128i T_00_00A = _mm_unpacklo_epi16(in01[part], in03[part]);    // [33 13 32 12 31 11 30 10]\r\n            const __m128i T_00_00B = _mm_unpackhi_epi16(in01[part], in03[part]);    // [37 17 36 16 35 15 34 14]\r\n            const __m128i T_00_01A = _mm_unpacklo_epi16(in05[part], in07[part]);    // [ ]\r\n            const __m128i T_00_01B = _mm_unpackhi_epi16(in05[part], in07[part]);    // [ ]\r\n            const __m128i T_00_02A = _mm_unpacklo_epi16(in09[part], in11[part]);    // [ ]\r\n            const __m128i T_00_02B = _mm_unpackhi_epi16(in09[part], in11[part]);    // [ ]\r\n            const __m128i T_00_03A = _mm_unpacklo_epi16(in13[part], in15[part]);    // [ ]\r\n            const __m128i T_00_03B = _mm_unpackhi_epi16(in13[part], in15[part]);    // [ ]\r\n            const __m128i T_00_04A = _mm_unpacklo_epi16(in17[part], in19[part]);    // [ ]\r\n            const __m128i T_00_04B = _mm_unpackhi_epi16(in17[part], in19[part]);    // [ ]\r\n            const __m128i T_00_05A = _mm_unpacklo_epi16(in21[part], in23[part]);    // [ ]\r\n            const __m128i T_00_05B = _mm_unpackhi_epi16(in21[part], in23[part]);    // [ ]\r\n            const __m128i T_00_06A = _mm_unpacklo_epi16(in25[part], in27[part]);    // [ ]\r\n            const __m128i T_00_06B = _mm_unpackhi_epi16(in25[part], in27[part]);    // [ ]\r\n            const __m128i T_00_07A = _mm_unpacklo_epi16(in29[part], in31[part]);    //\r\n            const __m128i T_00_07B = _mm_unpackhi_epi16(in29[part], in31[part]);    // [ ]\r\n\r\n            const __m128i T_00_08A = _mm_unpacklo_epi16(in02[part], in06[part]);    // [ ]\r\n            const __m128i T_00_08B = _mm_unpackhi_epi16(in02[part], in06[part]);    // [ ]\r\n            const __m128i T_00_09A = _mm_unpacklo_epi16(in10[part], in14[part]);    // [ ]\r\n            const __m128i T_00_09B = _mm_unpackhi_epi16(in10[part], in14[part]);    // [ ]\r\n            const __m128i T_00_10A = _mm_unpacklo_epi16(in18[part], in22[part]);    // [ ]\r\n            const __m128i T_00_10B = _mm_unpackhi_epi16(in18[part], in22[part]);    // [ ]\r\n            const __m128i T_00_11A = _mm_unpacklo_epi16(in26[part], in30[part]);    // [ ]\r\n            const __m128i T_00_11B = _mm_unpackhi_epi16(in26[part], in30[part]);    // [ ]\r\n\r\n            const __m128i T_00_12A = _mm_unpacklo_epi16(in04[part], in12[part]);    // [ ]\r\n            const __m128i T_00_12B = _mm_unpackhi_epi16(in04[part], in12[part]);    // [ ]\r\n            const __m128i T_00_13A = _mm_unpacklo_epi16(in20[part], in28[part]);    // [ ]\r\n            const __m128i T_00_13B = _mm_unpackhi_epi16(in20[part], in28[part]);    // [ ]\r\n\r\n            const __m128i T_00_14A = _mm_unpacklo_epi16(in08[part], in24[part]);    //\r\n            const __m128i T_00_14B = _mm_unpackhi_epi16(in08[part], in24[part]);    // [ ]\r\n            const __m128i T_00_15A = _mm_unpacklo_epi16(in00[part], in16[part]);    //\r\n            const __m128i T_00_15B = _mm_unpackhi_epi16(in00[part], in16[part]);    // [ ]\r\n\r\n            __m128i O00A, O01A, O02A, O03A, O04A, O05A, O06A, O07A, O08A, O09A, O10A, O11A, O12A, O13A, O14A, O15A;\r\n            __m128i O00B, O01B, O02B, O03B, O04B, O05B, O06B, O07B, O08B, O09B, O10B, O11B, O12B, O13B, O14B, O15B;\r\n            __m128i EO0A, EO1A, EO2A, EO3A, EO4A, EO5A, EO6A, EO7A;\r\n            __m128i EO0B, EO1B, EO2B, EO3B, EO4B, EO5B, EO6B, EO7B;\r\n            {\r\n                __m128i T00, T01, T02, T03;\r\n#define COMPUTE_ROW(r0103, r0507, r0911, r1315, r1719, r2123, r2527, r2931, c0103, c0507, c0911, c1315, c1719, c2123, c2527, c2931, row) \\\r\n    T00 = _mm_add_epi32(_mm_madd_epi16(r0103, c0103), _mm_madd_epi16(r0507, c0507)); \\\r\n    T01 = _mm_add_epi32(_mm_madd_epi16(r0911, c0911), _mm_madd_epi16(r1315, c1315)); \\\r\n    T02 = _mm_add_epi32(_mm_madd_epi16(r1719, c1719), _mm_madd_epi16(r2123, c2123)); \\\r\n    T03 = _mm_add_epi32(_mm_madd_epi16(r2527, c2527), _mm_madd_epi16(r2931, c2931)); \\\r\n    row = _mm_add_epi32(_mm_add_epi32(T00, T01), _mm_add_epi32(T02, T03));\r\n\r\n                COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n                            c16_p45_p45, c16_p43_p44, c16_p39_p41, c16_p34_p36, c16_p27_p30, c16_p19_p23, c16_p11_p15, c16_p02_p07, O00A)\r\n                COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n                            c16_p41_p45, c16_p23_p34, c16_n02_p11, c16_n27_n15, c16_n43_n36, c16_n44_n45, c16_n30_n39, c16_n07_n19, O01A)\r\n                COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n                            c16_p34_p44, c16_n07_p15, c16_n41_n27, c16_n39_n45, c16_n02_n23, c16_p36_p19, c16_p43_p45, c16_p11_p30, O02A)\r\n                COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n                            c16_p23_p43, c16_n34_n07, c16_n36_n45, c16_p19_n11, c16_p44_p41, c16_n02_p27, c16_n45_n30, c16_n15_n39, O03A)\r\n                COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n                            c16_p11_p41, c16_n45_n27, c16_p07_n30, c16_p43_p39, c16_n23_p15, c16_n34_n45, c16_p36_p02, c16_p19_p44, O04A)\r\n                COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n                            c16_n02_p39, c16_n36_n41, c16_p43_p07, c16_n11_p34, c16_n30_n44, c16_p45_p15, c16_n19_p27, c16_n23_n45, O05A)\r\n                COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n                            c16_n15_p36, c16_n11_n45, c16_p34_p39, c16_n45_n19, c16_p41_n07, c16_n23_p30, c16_n02_n44, c16_p27_p43, O06A)\r\n                COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n                            c16_n27_p34, c16_p19_n39, c16_n11_p43, c16_p02_n45, c16_p07_p45, c16_n15_n44, c16_p23_p41, c16_n30_n36, O07A)\r\n                COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n                            c16_n36_p30, c16_p41_n23, c16_n44_p15, c16_p45_n07, c16_n45_n02, c16_p43_p11, c16_n39_n19, c16_p34_p27, O08A)\r\n                COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n                            c16_n43_p27, c16_p44_n02, c16_n30_n23, c16_p07_p41, c16_p19_n45, c16_n39_p34, c16_p45_n11, c16_n36_n15, O09A)\r\n                COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n                            c16_n45_p23, c16_p27_p19, c16_p15_n45, c16_n44_p30, c16_p34_p11, c16_p07_n43, c16_n41_p36, c16_p39_p02, O10A)\r\n                COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n                            c16_n44_p19, c16_n02_p36, c16_p45_n34, c16_n15_n23, c16_n39_p43, c16_p30_p07, c16_p27_n45, c16_n41_p11, O11A)\r\n                COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n                            c16_n39_p15, c16_n30_p45, c16_p27_p02, c16_p41_n44, c16_n11_n19, c16_n45_p36, c16_n07_p34, c16_p43_n23, O12A)\r\n                COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n                            c16_n30_p11, c16_n45_p43, c16_n19_p36, c16_p23_n02, c16_p45_n39, c16_p27_n41, c16_n15_n07, c16_n44_p34, O13A)\r\n                COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n                            c16_n19_p07, c16_n39_p30, c16_n45_p44, c16_n36_p43, c16_n15_p27, c16_p11_p02, c16_p34_n23, c16_p45_n41, O14A)\r\n                COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n                            c16_n07_p02, c16_n15_p11, c16_n23_p19, c16_n30_p27, c16_n36_p34, c16_n41_p39, c16_n44_p43, c16_n45_p45, O15A)\r\n\r\n                COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n                            c16_p45_p45, c16_p43_p44, c16_p39_p41, c16_p34_p36, c16_p27_p30, c16_p19_p23, c16_p11_p15, c16_p02_p07, O00B)\r\n                COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n                            c16_p41_p45, c16_p23_p34, c16_n02_p11, c16_n27_n15, c16_n43_n36, c16_n44_n45, c16_n30_n39, c16_n07_n19, O01B)\r\n                COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n                            c16_p34_p44, c16_n07_p15, c16_n41_n27, c16_n39_n45, c16_n02_n23, c16_p36_p19, c16_p43_p45, c16_p11_p30, O02B)\r\n                COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n                            c16_p23_p43, c16_n34_n07, c16_n36_n45, c16_p19_n11, c16_p44_p41, c16_n02_p27, c16_n45_n30, c16_n15_n39, O03B)\r\n                COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n                            c16_p11_p41, c16_n45_n27, c16_p07_n30, c16_p43_p39, c16_n23_p15, c16_n34_n45, c16_p36_p02, c16_p19_p44, O04B)\r\n                COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n                            c16_n02_p39, c16_n36_n41, c16_p43_p07, c16_n11_p34, c16_n30_n44, c16_p45_p15, c16_n19_p27, c16_n23_n45, O05B)\r\n                COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n                            c16_n15_p36, c16_n11_n45, c16_p34_p39, c16_n45_n19, c16_p41_n07, c16_n23_p30, c16_n02_n44, c16_p27_p43, O06B)\r\n                COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n                            c16_n27_p34, c16_p19_n39, c16_n11_p43, c16_p02_n45, c16_p07_p45, c16_n15_n44, c16_p23_p41, c16_n30_n36, O07B)\r\n                COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n                            c16_n36_p30, c16_p41_n23, c16_n44_p15, c16_p45_n07, c16_n45_n02, c16_p43_p11, c16_n39_n19, c16_p34_p27, O08B)\r\n                COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n                            c16_n43_p27, c16_p44_n02, c16_n30_n23, c16_p07_p41, c16_p19_n45, c16_n39_p34, c16_p45_n11, c16_n36_n15, O09B)\r\n                COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n                            c16_n45_p23, c16_p27_p19, c16_p15_n45, c16_n44_p30, c16_p34_p11, c16_p07_n43, c16_n41_p36, c16_p39_p02, O10B)\r\n                COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n                            c16_n44_p19, c16_n02_p36, c16_p45_n34, c16_n15_n23, c16_n39_p43, c16_p30_p07, c16_p27_n45, c16_n41_p11, O11B)\r\n                COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n                            c16_n39_p15, c16_n30_p45, c16_p27_p02, c16_p41_n44, c16_n11_n19, c16_n45_p36, c16_n07_p34, c16_p43_n23, O12B)\r\n                COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n                            c16_n30_p11, c16_n45_p43, c16_n19_p36, c16_p23_n02, c16_p45_n39, c16_p27_n41, c16_n15_n07, c16_n44_p34, O13B)\r\n                COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n                            c16_n19_p07, c16_n39_p30, c16_n45_p44, c16_n36_p43, c16_n15_p27, c16_p11_p02, c16_p34_n23, c16_p45_n41, O14B)\r\n                COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n                            c16_n07_p02, c16_n15_p11, c16_n23_p19, c16_n30_p27, c16_n36_p34, c16_n41_p39, c16_n44_p43, c16_n45_p45, O15B)\r\n#undef COMPUTE_ROW\r\n            }\r\n\r\n            {\r\n                __m128i T00, T01;\r\n#define COMPUTE_ROW(row0206, row1014, row1822, row2630, c0206, c1014, c1822, c2630, row) \\\r\n    T00 = _mm_add_epi32(_mm_madd_epi16(row0206, c0206), _mm_madd_epi16(row1014, c1014)); \\\r\n    T01 = _mm_add_epi32(_mm_madd_epi16(row1822, c1822), _mm_madd_epi16(row2630, c2630)); \\\r\n    row = _mm_add_epi32(T00, T01);\r\n\r\n                COMPUTE_ROW(T_00_08A, T_00_09A, T_00_10A, T_00_11A, c16_p43_p45, c16_p35_p40, c16_p21_p29, c16_p04_p13, EO0A)\r\n                COMPUTE_ROW(T_00_08A, T_00_09A, T_00_10A, T_00_11A, c16_p29_p43, c16_n21_p04, c16_n45_n40, c16_n13_n35, EO1A)\r\n                COMPUTE_ROW(T_00_08A, T_00_09A, T_00_10A, T_00_11A, c16_p04_p40, c16_n43_n35, c16_p29_n13, c16_p21_p45, EO2A)\r\n                COMPUTE_ROW(T_00_08A, T_00_09A, T_00_10A, T_00_11A, c16_n21_p35, c16_p04_n43, c16_p13_p45, c16_n29_n40, EO3A)\r\n                COMPUTE_ROW(T_00_08A, T_00_09A, T_00_10A, T_00_11A, c16_n40_p29, c16_p45_n13, c16_n43_n04, c16_p35_p21, EO4A)\r\n                COMPUTE_ROW(T_00_08A, T_00_09A, T_00_10A, T_00_11A, c16_n45_p21, c16_p13_p29, c16_p35_n43, c16_n40_p04, EO5A)\r\n                COMPUTE_ROW(T_00_08A, T_00_09A, T_00_10A, T_00_11A, c16_n35_p13, c16_n40_p45, c16_p04_p21, c16_p43_n29, EO6A)\r\n                COMPUTE_ROW(T_00_08A, T_00_09A, T_00_10A, T_00_11A, c16_n13_p04, c16_n29_p21, c16_n40_p35, c16_n45_p43, EO7A)\r\n\r\n                COMPUTE_ROW(T_00_08B, T_00_09B, T_00_10B, T_00_11B, c16_p43_p45, c16_p35_p40, c16_p21_p29, c16_p04_p13, EO0B)\r\n                COMPUTE_ROW(T_00_08B, T_00_09B, T_00_10B, T_00_11B, c16_p29_p43, c16_n21_p04, c16_n45_n40, c16_n13_n35, EO1B)\r\n                COMPUTE_ROW(T_00_08B, T_00_09B, T_00_10B, T_00_11B, c16_p04_p40, c16_n43_n35, c16_p29_n13, c16_p21_p45, EO2B)\r\n                COMPUTE_ROW(T_00_08B, T_00_09B, T_00_10B, T_00_11B, c16_n21_p35, c16_p04_n43, c16_p13_p45, c16_n29_n40, EO3B)\r\n                COMPUTE_ROW(T_00_08B, T_00_09B, T_00_10B, T_00_11B, c16_n40_p29, c16_p45_n13, c16_n43_n04, c16_p35_p21, EO4B)\r\n                COMPUTE_ROW(T_00_08B, T_00_09B, T_00_10B, T_00_11B, c16_n45_p21, c16_p13_p29, c16_p35_n43, c16_n40_p04, EO5B)\r\n                COMPUTE_ROW(T_00_08B, T_00_09B, T_00_10B, T_00_11B, c16_n35_p13, c16_n40_p45, c16_p04_p21, c16_p43_n29, EO6B)\r\n                COMPUTE_ROW(T_00_08B, T_00_09B, T_00_10B, T_00_11B, c16_n13_p04, c16_n29_p21, c16_n40_p35, c16_n45_p43, EO7B)\r\n#undef COMPUTE_ROW\r\n            }\r\n            {\r\n                const __m128i EEO0A = _mm_add_epi32(_mm_madd_epi16(T_00_12A, c16_p38_p44), _mm_madd_epi16(T_00_13A, c16_p09_p25));\r\n                const __m128i EEO1A = _mm_add_epi32(_mm_madd_epi16(T_00_12A, c16_n09_p38), _mm_madd_epi16(T_00_13A, c16_n25_n44));\r\n                const __m128i EEO2A = _mm_add_epi32(_mm_madd_epi16(T_00_12A, c16_n44_p25), _mm_madd_epi16(T_00_13A, c16_p38_p09));\r\n                const __m128i EEO3A = _mm_add_epi32(_mm_madd_epi16(T_00_12A, c16_n25_p09), _mm_madd_epi16(T_00_13A, c16_n44_p38));\r\n                const __m128i EEO0B = _mm_add_epi32(_mm_madd_epi16(T_00_12B, c16_p38_p44), _mm_madd_epi16(T_00_13B, c16_p09_p25));\r\n                const __m128i EEO1B = _mm_add_epi32(_mm_madd_epi16(T_00_12B, c16_n09_p38), _mm_madd_epi16(T_00_13B, c16_n25_n44));\r\n                const __m128i EEO2B = _mm_add_epi32(_mm_madd_epi16(T_00_12B, c16_n44_p25), _mm_madd_epi16(T_00_13B, c16_p38_p09));\r\n                const __m128i EEO3B = _mm_add_epi32(_mm_madd_epi16(T_00_12B, c16_n25_p09), _mm_madd_epi16(T_00_13B, c16_n44_p38));\r\n\r\n                const __m128i EEEO0A = _mm_madd_epi16(T_00_14A, c16_p17_p42);\r\n                const __m128i EEEO0B = _mm_madd_epi16(T_00_14B, c16_p17_p42);\r\n                const __m128i EEEO1A = _mm_madd_epi16(T_00_14A, c16_n42_p17);\r\n                const __m128i EEEO1B = _mm_madd_epi16(T_00_14B, c16_n42_p17);\r\n\r\n                const __m128i EEEE0A = _mm_madd_epi16(T_00_15A, c16_p32_p32);\r\n                const __m128i EEEE0B = _mm_madd_epi16(T_00_15B, c16_p32_p32);\r\n                const __m128i EEEE1A = _mm_madd_epi16(T_00_15A, c16_n32_p32);\r\n                const __m128i EEEE1B = _mm_madd_epi16(T_00_15B, c16_n32_p32);\r\n\r\n                const __m128i EEE0A = _mm_add_epi32(EEEE0A, EEEO0A);    // EEE0 = EEEE0 + EEEO0\r\n                const __m128i EEE0B = _mm_add_epi32(EEEE0B, EEEO0B);\r\n                const __m128i EEE1A = _mm_add_epi32(EEEE1A, EEEO1A);    // EEE1 = EEEE1 + EEEO1\r\n                const __m128i EEE1B = _mm_add_epi32(EEEE1B, EEEO1B);\r\n                const __m128i EEE3A = _mm_sub_epi32(EEEE0A, EEEO0A);    // EEE2 = EEEE0 - EEEO0\r\n                const __m128i EEE3B = _mm_sub_epi32(EEEE0B, EEEO0B);\r\n                const __m128i EEE2A = _mm_sub_epi32(EEEE1A, EEEO1A);    // EEE3 = EEEE1 - EEEO1\r\n                const __m128i EEE2B = _mm_sub_epi32(EEEE1B, EEEO1B);\r\n\r\n                const __m128i EE0A = _mm_add_epi32(EEE0A, EEO0A);       // EE0 = EEE0 + EEO0\r\n                const __m128i EE0B = _mm_add_epi32(EEE0B, EEO0B);\r\n                const __m128i EE1A = _mm_add_epi32(EEE1A, EEO1A);       // EE1 = EEE1 + EEO1\r\n                const __m128i EE1B = _mm_add_epi32(EEE1B, EEO1B);\r\n                const __m128i EE2A = _mm_add_epi32(EEE2A, EEO2A);       // EE2 = EEE0 + EEO0\r\n                const __m128i EE2B = _mm_add_epi32(EEE2B, EEO2B);\r\n                const __m128i EE3A = _mm_add_epi32(EEE3A, EEO3A);       // EE3 = EEE1 + EEO1\r\n                const __m128i EE3B = _mm_add_epi32(EEE3B, EEO3B);\r\n                const __m128i EE7A = _mm_sub_epi32(EEE0A, EEO0A);       // EE7 = EEE0 - EEO0\r\n                const __m128i EE7B = _mm_sub_epi32(EEE0B, EEO0B);\r\n                const __m128i EE6A = _mm_sub_epi32(EEE1A, EEO1A);       // EE6 = EEE1 - EEO1\r\n                const __m128i EE6B = _mm_sub_epi32(EEE1B, EEO1B);\r\n                const __m128i EE5A = _mm_sub_epi32(EEE2A, EEO2A);       // EE5 = EEE0 - EEO0\r\n                const __m128i EE5B = _mm_sub_epi32(EEE2B, EEO2B);\r\n                const __m128i EE4A = _mm_sub_epi32(EEE3A, EEO3A);       // EE4 = EEE1 - EEO1\r\n                const __m128i EE4B = _mm_sub_epi32(EEE3B, EEO3B);\r\n\r\n                const __m128i E0A = _mm_add_epi32(EE0A, EO0A);          // E0 = EE0 + EO0\r\n                const __m128i E0B = _mm_add_epi32(EE0B, EO0B);\r\n                const __m128i E1A = _mm_add_epi32(EE1A, EO1A);          // E1 = EE1 + EO1\r\n                const __m128i E1B = _mm_add_epi32(EE1B, EO1B);\r\n                const __m128i E2A = _mm_add_epi32(EE2A, EO2A);          // E2 = EE2 + EO2\r\n                const __m128i E2B = _mm_add_epi32(EE2B, EO2B);\r\n                const __m128i E3A = _mm_add_epi32(EE3A, EO3A);          // E3 = EE3 + EO3\r\n                const __m128i E3B = _mm_add_epi32(EE3B, EO3B);\r\n                const __m128i E4A = _mm_add_epi32(EE4A, EO4A);          // E4 =\r\n                const __m128i E4B = _mm_add_epi32(EE4B, EO4B);\r\n                const __m128i E5A = _mm_add_epi32(EE5A, EO5A);          // E5 =\r\n                const __m128i E5B = _mm_add_epi32(EE5B, EO5B);\r\n                const __m128i E6A = _mm_add_epi32(EE6A, EO6A);          // E6 =\r\n                const __m128i E6B = _mm_add_epi32(EE6B, EO6B);\r\n                const __m128i E7A = _mm_add_epi32(EE7A, EO7A);          // E7 =\r\n                const __m128i E7B = _mm_add_epi32(EE7B, EO7B);\r\n                const __m128i EFA = _mm_sub_epi32(EE0A, EO0A);          // EF = EE0 - EO0\r\n                const __m128i EFB = _mm_sub_epi32(EE0B, EO0B);\r\n                const __m128i EEA = _mm_sub_epi32(EE1A, EO1A);          // EE = EE1 - EO1\r\n                const __m128i EEB = _mm_sub_epi32(EE1B, EO1B);\r\n                const __m128i EDA = _mm_sub_epi32(EE2A, EO2A);          // ED = EE2 - EO2\r\n                const __m128i EDB = _mm_sub_epi32(EE2B, EO2B);\r\n                const __m128i ECA = _mm_sub_epi32(EE3A, EO3A);          // EC = EE3 - EO3\r\n                const __m128i ECB = _mm_sub_epi32(EE3B, EO3B);\r\n                const __m128i EBA = _mm_sub_epi32(EE4A, EO4A);          // EB =\r\n                const __m128i EBB = _mm_sub_epi32(EE4B, EO4B);\r\n                const __m128i EAA = _mm_sub_epi32(EE5A, EO5A);          // EA =\r\n                const __m128i EAB = _mm_sub_epi32(EE5B, EO5B);\r\n                const __m128i E9A = _mm_sub_epi32(EE6A, EO6A);          // E9 =\r\n                const __m128i E9B = _mm_sub_epi32(EE6B, EO6B);\r\n                const __m128i E8A = _mm_sub_epi32(EE7A, EO7A);          // E8 =\r\n                const __m128i E8B = _mm_sub_epi32(EE7B, EO7B);\r\n\r\n                const __m128i T10A = _mm_add_epi32(E0A, c32_rnd);       // E0 + rnd\r\n                const __m128i T10B = _mm_add_epi32(E0B, c32_rnd);\r\n                const __m128i T11A = _mm_add_epi32(E1A, c32_rnd);       // E1 + rnd\r\n                const __m128i T11B = _mm_add_epi32(E1B, c32_rnd);\r\n                const __m128i T12A = _mm_add_epi32(E2A, c32_rnd);       // E2 + rnd\r\n                const __m128i T12B = _mm_add_epi32(E2B, c32_rnd);\r\n                const __m128i T13A = _mm_add_epi32(E3A, c32_rnd);       // E3 + rnd\r\n                const __m128i T13B = _mm_add_epi32(E3B, c32_rnd);\r\n                const __m128i T14A = _mm_add_epi32(E4A, c32_rnd);       // E4 + rnd\r\n                const __m128i T14B = _mm_add_epi32(E4B, c32_rnd);\r\n                const __m128i T15A = _mm_add_epi32(E5A, c32_rnd);       // E5 + rnd\r\n                const __m128i T15B = _mm_add_epi32(E5B, c32_rnd);\r\n                const __m128i T16A = _mm_add_epi32(E6A, c32_rnd);       // E6 + rnd\r\n                const __m128i T16B = _mm_add_epi32(E6B, c32_rnd);\r\n                const __m128i T17A = _mm_add_epi32(E7A, c32_rnd);       // E7 + rnd\r\n                const __m128i T17B = _mm_add_epi32(E7B, c32_rnd);\r\n                const __m128i T18A = _mm_add_epi32(E8A, c32_rnd);       // E8 + rnd\r\n                const __m128i T18B = _mm_add_epi32(E8B, c32_rnd);\r\n                const __m128i T19A = _mm_add_epi32(E9A, c32_rnd);       // E9 + rnd\r\n                const __m128i T19B = _mm_add_epi32(E9B, c32_rnd);\r\n                const __m128i T1AA = _mm_add_epi32(EAA, c32_rnd);       // E10 + rnd\r\n                const __m128i T1AB = _mm_add_epi32(EAB, c32_rnd);\r\n                const __m128i T1BA = _mm_add_epi32(EBA, c32_rnd);       // E11 + rnd\r\n                const __m128i T1BB = _mm_add_epi32(EBB, c32_rnd);\r\n                const __m128i T1CA = _mm_add_epi32(ECA, c32_rnd);       // E12 + rnd\r\n                const __m128i T1CB = _mm_add_epi32(ECB, c32_rnd);\r\n                const __m128i T1DA = _mm_add_epi32(EDA, c32_rnd);       // E13 + rnd\r\n                const __m128i T1DB = _mm_add_epi32(EDB, c32_rnd);\r\n                const __m128i T1EA = _mm_add_epi32(EEA, c32_rnd);       // E14 + rnd\r\n                const __m128i T1EB = _mm_add_epi32(EEB, c32_rnd);\r\n                const __m128i T1FA = _mm_add_epi32(EFA, c32_rnd);       // E15 + rnd\r\n                const __m128i T1FB = _mm_add_epi32(EFB, c32_rnd);\r\n\r\n                const __m128i T2_00A = _mm_add_epi32(T10A, O00A);       // E0 + O0 + rnd\r\n                const __m128i T2_00B = _mm_add_epi32(T10B, O00B);\r\n                const __m128i T2_01A = _mm_add_epi32(T11A, O01A);       // E1 + O1 + rnd\r\n                const __m128i T2_01B = _mm_add_epi32(T11B, O01B);\r\n                const __m128i T2_02A = _mm_add_epi32(T12A, O02A);       // E2 + O2 + rnd\r\n                const __m128i T2_02B = _mm_add_epi32(T12B, O02B);\r\n                const __m128i T2_03A = _mm_add_epi32(T13A, O03A);       // E3 + O3 + rnd\r\n                const __m128i T2_03B = _mm_add_epi32(T13B, O03B);\r\n                const __m128i T2_04A = _mm_add_epi32(T14A, O04A);       // E4\r\n                const __m128i T2_04B = _mm_add_epi32(T14B, O04B);\r\n                const __m128i T2_05A = _mm_add_epi32(T15A, O05A);       // E5\r\n                const __m128i T2_05B = _mm_add_epi32(T15B, O05B);\r\n                const __m128i T2_06A = _mm_add_epi32(T16A, O06A);       // E6\r\n                const __m128i T2_06B = _mm_add_epi32(T16B, O06B);\r\n                const __m128i T2_07A = _mm_add_epi32(T17A, O07A);       // E7\r\n                const __m128i T2_07B = _mm_add_epi32(T17B, O07B);\r\n                const __m128i T2_08A = _mm_add_epi32(T18A, O08A);       // E8\r\n                const __m128i T2_08B = _mm_add_epi32(T18B, O08B);\r\n                const __m128i T2_09A = _mm_add_epi32(T19A, O09A);       // E9\r\n                const __m128i T2_09B = _mm_add_epi32(T19B, O09B);\r\n                const __m128i T2_10A = _mm_add_epi32(T1AA, O10A);       // E10\r\n                const __m128i T2_10B = _mm_add_epi32(T1AB, O10B);\r\n                const __m128i T2_11A = _mm_add_epi32(T1BA, O11A);       // E11\r\n                const __m128i T2_11B = _mm_add_epi32(T1BB, O11B);\r\n                const __m128i T2_12A = _mm_add_epi32(T1CA, O12A);       // E12\r\n                const __m128i T2_12B = _mm_add_epi32(T1CB, O12B);\r\n                const __m128i T2_13A = _mm_add_epi32(T1DA, O13A);       // E13\r\n                const __m128i T2_13B = _mm_add_epi32(T1DB, O13B);\r\n                const __m128i T2_14A = _mm_add_epi32(T1EA, O14A);       // E14\r\n                const __m128i T2_14B = _mm_add_epi32(T1EB, O14B);\r\n                const __m128i T2_15A = _mm_add_epi32(T1FA, O15A);       // E15\r\n                const __m128i T2_15B = _mm_add_epi32(T1FB, O15B);\r\n                const __m128i T2_31A = _mm_sub_epi32(T10A, O00A);       // E0 - O0 + rnd\r\n                const __m128i T2_31B = _mm_sub_epi32(T10B, O00B);\r\n                const __m128i T2_30A = _mm_sub_epi32(T11A, O01A);       // E1 - O1 + rnd\r\n                const __m128i T2_30B = _mm_sub_epi32(T11B, O01B);\r\n                const __m128i T2_29A = _mm_sub_epi32(T12A, O02A);       // E2 - O2 + rnd\r\n                const __m128i T2_29B = _mm_sub_epi32(T12B, O02B);\r\n                const __m128i T2_28A = _mm_sub_epi32(T13A, O03A);       // E3 - O3 + rnd\r\n                const __m128i T2_28B = _mm_sub_epi32(T13B, O03B);\r\n                const __m128i T2_27A = _mm_sub_epi32(T14A, O04A);       // E4\r\n                const __m128i T2_27B = _mm_sub_epi32(T14B, O04B);\r\n                const __m128i T2_26A = _mm_sub_epi32(T15A, O05A);       // E5\r\n                const __m128i T2_26B = _mm_sub_epi32(T15B, O05B);\r\n                const __m128i T2_25A = _mm_sub_epi32(T16A, O06A);       // E6\r\n                const __m128i T2_25B = _mm_sub_epi32(T16B, O06B);\r\n                const __m128i T2_24A = _mm_sub_epi32(T17A, O07A);       // E7\r\n                const __m128i T2_24B = _mm_sub_epi32(T17B, O07B);\r\n                const __m128i T2_23A = _mm_sub_epi32(T18A, O08A);       //\r\n                const __m128i T2_23B = _mm_sub_epi32(T18B, O08B);\r\n                const __m128i T2_22A = _mm_sub_epi32(T19A, O09A);       //\r\n                const __m128i T2_22B = _mm_sub_epi32(T19B, O09B);\r\n                const __m128i T2_21A = _mm_sub_epi32(T1AA, O10A);       //\r\n                const __m128i T2_21B = _mm_sub_epi32(T1AB, O10B);\r\n                const __m128i T2_20A = _mm_sub_epi32(T1BA, O11A);       //\r\n                const __m128i T2_20B = _mm_sub_epi32(T1BB, O11B);\r\n                const __m128i T2_19A = _mm_sub_epi32(T1CA, O12A);       //\r\n                const __m128i T2_19B = _mm_sub_epi32(T1CB, O12B);\r\n                const __m128i T2_18A = _mm_sub_epi32(T1DA, O13A);       //\r\n                const __m128i T2_18B = _mm_sub_epi32(T1DB, O13B);\r\n                const __m128i T2_17A = _mm_sub_epi32(T1EA, O14A);       //\r\n                const __m128i T2_17B = _mm_sub_epi32(T1EB, O14B);\r\n                const __m128i T2_16A = _mm_sub_epi32(T1FA, O15A);       //\r\n                const __m128i T2_16B = _mm_sub_epi32(T1FB, O15B);\r\n\r\n                const __m128i T3_00A = _mm_srai_epi32(T2_00A, nShift);  // [30 20 10 00]\r\n                const __m128i T3_00B = _mm_srai_epi32(T2_00B, nShift);  // [70 60 50 40]\r\n                const __m128i T3_01A = _mm_srai_epi32(T2_01A, nShift);  // [31 21 11 01]\r\n                const __m128i T3_01B = _mm_srai_epi32(T2_01B, nShift);  // [71 61 51 41]\r\n                const __m128i T3_02A = _mm_srai_epi32(T2_02A, nShift);  // [32 22 12 02]\r\n                const __m128i T3_02B = _mm_srai_epi32(T2_02B, nShift);  // [72 62 52 42]\r\n                const __m128i T3_03A = _mm_srai_epi32(T2_03A, nShift);  // [33 23 13 03]\r\n                const __m128i T3_03B = _mm_srai_epi32(T2_03B, nShift);  // [73 63 53 43]\r\n                const __m128i T3_04A = _mm_srai_epi32(T2_04A, nShift);  // [33 24 14 04]\r\n                const __m128i T3_04B = _mm_srai_epi32(T2_04B, nShift);  // [74 64 54 44]\r\n                const __m128i T3_05A = _mm_srai_epi32(T2_05A, nShift);  // [35 25 15 05]\r\n                const __m128i T3_05B = _mm_srai_epi32(T2_05B, nShift);  // [75 65 55 45]\r\n                const __m128i T3_06A = _mm_srai_epi32(T2_06A, nShift);  // [36 26 16 06]\r\n                const __m128i T3_06B = _mm_srai_epi32(T2_06B, nShift);  // [76 66 56 46]\r\n                const __m128i T3_07A = _mm_srai_epi32(T2_07A, nShift);  // [37 27 17 07]\r\n                const __m128i T3_07B = _mm_srai_epi32(T2_07B, nShift);  // [77 67 57 47]\r\n                const __m128i T3_08A = _mm_srai_epi32(T2_08A, nShift);  // [30 20 10 00] x8\r\n                const __m128i T3_08B = _mm_srai_epi32(T2_08B, nShift);  // [70 60 50 40]\r\n                const __m128i T3_09A = _mm_srai_epi32(T2_09A, nShift);  // [31 21 11 01] x9\r\n                const __m128i T3_09B = _mm_srai_epi32(T2_09B, nShift);  // [71 61 51 41]\r\n                const __m128i T3_10A = _mm_srai_epi32(T2_10A, nShift);  // [32 22 12 02] xA\r\n                const __m128i T3_10B = _mm_srai_epi32(T2_10B, nShift);  // [72 62 52 42]\r\n                const __m128i T3_11A = _mm_srai_epi32(T2_11A, nShift);  // [33 23 13 03] xB\r\n                const __m128i T3_11B = _mm_srai_epi32(T2_11B, nShift);  // [73 63 53 43]\r\n                const __m128i T3_12A = _mm_srai_epi32(T2_12A, nShift);  // [33 24 14 04] xC\r\n                const __m128i T3_12B = _mm_srai_epi32(T2_12B, nShift);  // [74 64 54 44]\r\n                const __m128i T3_13A = _mm_srai_epi32(T2_13A, nShift);  // [35 25 15 05] xD\r\n                const __m128i T3_13B = _mm_srai_epi32(T2_13B, nShift);  // [75 65 55 45]\r\n                const __m128i T3_14A = _mm_srai_epi32(T2_14A, nShift);  // [36 26 16 06] xE\r\n                const __m128i T3_14B = _mm_srai_epi32(T2_14B, nShift);  // [76 66 56 46]\r\n                const __m128i T3_15A = _mm_srai_epi32(T2_15A, nShift);  // [37 27 17 07] xF\r\n                const __m128i T3_15B = _mm_srai_epi32(T2_15B, nShift);  // [77 67 57 47]\r\n\r\n                const __m128i T3_16A = _mm_srai_epi32(T2_16A, nShift);  // [30 20 10 00]\r\n                const __m128i T3_16B = _mm_srai_epi32(T2_16B, nShift);  // [70 60 50 40]\r\n                const __m128i T3_17A = _mm_srai_epi32(T2_17A, nShift);  // [31 21 11 01]\r\n                const __m128i T3_17B = _mm_srai_epi32(T2_17B, nShift);  // [71 61 51 41]\r\n                const __m128i T3_18A = _mm_srai_epi32(T2_18A, nShift);  // [32 22 12 02]\r\n                const __m128i T3_18B = _mm_srai_epi32(T2_18B, nShift);  // [72 62 52 42]\r\n                const __m128i T3_19A = _mm_srai_epi32(T2_19A, nShift);  // [33 23 13 03]\r\n                const __m128i T3_19B = _mm_srai_epi32(T2_19B, nShift);  // [73 63 53 43]\r\n                const __m128i T3_20A = _mm_srai_epi32(T2_20A, nShift);  // [33 24 14 04]\r\n                const __m128i T3_20B = _mm_srai_epi32(T2_20B, nShift);  // [74 64 54 44]\r\n                const __m128i T3_21A = _mm_srai_epi32(T2_21A, nShift);  // [35 25 15 05]\r\n                const __m128i T3_21B = _mm_srai_epi32(T2_21B, nShift);  // [75 65 55 45]\r\n                const __m128i T3_22A = _mm_srai_epi32(T2_22A, nShift);  // [36 26 16 06]\r\n                const __m128i T3_22B = _mm_srai_epi32(T2_22B, nShift);  // [76 66 56 46]\r\n                const __m128i T3_23A = _mm_srai_epi32(T2_23A, nShift);  // [37 27 17 07]\r\n                const __m128i T3_23B = _mm_srai_epi32(T2_23B, nShift);  // [77 67 57 47]\r\n                const __m128i T3_24A = _mm_srai_epi32(T2_24A, nShift);  // [30 20 10 00] x8\r\n                const __m128i T3_24B = _mm_srai_epi32(T2_24B, nShift);  // [70 60 50 40]\r\n                const __m128i T3_25A = _mm_srai_epi32(T2_25A, nShift);  // [31 21 11 01] x9\r\n                const __m128i T3_25B = _mm_srai_epi32(T2_25B, nShift);  // [71 61 51 41]\r\n                const __m128i T3_26A = _mm_srai_epi32(T2_26A, nShift);  // [32 22 12 02] xA\r\n                const __m128i T3_26B = _mm_srai_epi32(T2_26B, nShift);  // [72 62 52 42]\r\n                const __m128i T3_27A = _mm_srai_epi32(T2_27A, nShift);  // [33 23 13 03] xB\r\n                const __m128i T3_27B = _mm_srai_epi32(T2_27B, nShift);  // [73 63 53 43]\r\n                const __m128i T3_28A = _mm_srai_epi32(T2_28A, nShift);  // [33 24 14 04] xC\r\n                const __m128i T3_28B = _mm_srai_epi32(T2_28B, nShift);  // [74 64 54 44]\r\n                const __m128i T3_29A = _mm_srai_epi32(T2_29A, nShift);  // [35 25 15 05] xD\r\n                const __m128i T3_29B = _mm_srai_epi32(T2_29B, nShift);  // [75 65 55 45]\r\n                const __m128i T3_30A = _mm_srai_epi32(T2_30A, nShift);  // [36 26 16 06] xE\r\n                const __m128i T3_30B = _mm_srai_epi32(T2_30B, nShift);  // [76 66 56 46]\r\n                const __m128i T3_31A = _mm_srai_epi32(T2_31A, nShift);  // [37 27 17 07] xF\r\n                const __m128i T3_31B = _mm_srai_epi32(T2_31B, nShift);  // [77 67 57 47]\r\n\r\n                res00[part] = _mm_packs_epi32(T3_00A, T3_00B);          // [70 60 50 40 30 20 10 00]\r\n                res01[part] = _mm_packs_epi32(T3_01A, T3_01B);          // [71 61 51 41 31 21 11 01]\r\n                res02[part] = _mm_packs_epi32(T3_02A, T3_02B);          // [72 62 52 42 32 22 12 02]\r\n                res03[part] = _mm_packs_epi32(T3_03A, T3_03B);          // [73 63 53 43 33 23 13 03]\r\n                res04[part] = _mm_packs_epi32(T3_04A, T3_04B);          // [74 64 54 44 34 24 14 04]\r\n                res05[part] = _mm_packs_epi32(T3_05A, T3_05B);          // [75 65 55 45 35 25 15 05]\r\n                res06[part] = _mm_packs_epi32(T3_06A, T3_06B);          // [76 66 56 46 36 26 16 06]\r\n                res07[part] = _mm_packs_epi32(T3_07A, T3_07B);          // [77 67 57 47 37 27 17 07]\r\n                res08[part] = _mm_packs_epi32(T3_08A, T3_08B);          // [A0 ... 80]\r\n                res09[part] = _mm_packs_epi32(T3_09A, T3_09B);          // [A1 ... 81]\r\n                res10[part] = _mm_packs_epi32(T3_10A, T3_10B);          // [A2 ... 82]\r\n                res11[part] = _mm_packs_epi32(T3_11A, T3_11B);          // [A3 ... 83]\r\n                res12[part] = _mm_packs_epi32(T3_12A, T3_12B);          // [A4 ... 84]\r\n                res13[part] = _mm_packs_epi32(T3_13A, T3_13B);          // [A5 ... 85]\r\n                res14[part] = _mm_packs_epi32(T3_14A, T3_14B);          // [A6 ... 86]\r\n                res15[part] = _mm_packs_epi32(T3_15A, T3_15B);          // [A7 ... 87]\r\n                res16[part] = _mm_packs_epi32(T3_16A, T3_16B);\r\n                res17[part] = _mm_packs_epi32(T3_17A, T3_17B);\r\n                res18[part] = _mm_packs_epi32(T3_18A, T3_18B);\r\n                res19[part] = _mm_packs_epi32(T3_19A, T3_19B);\r\n                res20[part] = _mm_packs_epi32(T3_20A, T3_20B);\r\n                res21[part] = _mm_packs_epi32(T3_21A, T3_21B);\r\n                res22[part] = _mm_packs_epi32(T3_22A, T3_22B);\r\n                res23[part] = _mm_packs_epi32(T3_23A, T3_23B);\r\n                res24[part] = _mm_packs_epi32(T3_24A, T3_24B);\r\n                res25[part] = _mm_packs_epi32(T3_25A, T3_25B);\r\n                res26[part] = _mm_packs_epi32(T3_26A, T3_26B);\r\n                res27[part] = _mm_packs_epi32(T3_27A, T3_27B);\r\n                res28[part] = _mm_packs_epi32(T3_28A, T3_28B);\r\n                res29[part] = _mm_packs_epi32(T3_29A, T3_29B);\r\n                res30[part] = _mm_packs_epi32(T3_30A, T3_30B);\r\n                res31[part] = _mm_packs_epi32(T3_31A, T3_31B);\r\n            }\r\n        }\r\n\r\n        //transpose matrix 8x8 16bit.\r\n        {\r\n            __m128i tr0_0, tr0_1, tr0_2, tr0_3, tr0_4, tr0_5, tr0_6, tr0_7;\r\n            __m128i tr1_0, tr1_1, tr1_2, tr1_3, tr1_4, tr1_5, tr1_6, tr1_7;\r\n#define TRANSPOSE_8x8_16BIT(I0, I1, I2, I3, I4, I5, I6, I7, O0, O1, O2, O3, O4, O5, O6, O7) \\\r\n    tr0_0 = _mm_unpacklo_epi16(I0, I1); \\\r\n    tr0_1 = _mm_unpacklo_epi16(I2, I3); \\\r\n    tr0_2 = _mm_unpackhi_epi16(I0, I1); \\\r\n    tr0_3 = _mm_unpackhi_epi16(I2, I3); \\\r\n    tr0_4 = _mm_unpacklo_epi16(I4, I5); \\\r\n    tr0_5 = _mm_unpacklo_epi16(I6, I7); \\\r\n    tr0_6 = _mm_unpackhi_epi16(I4, I5); \\\r\n    tr0_7 = _mm_unpackhi_epi16(I6, I7); \\\r\n    tr1_0 = _mm_unpacklo_epi32(tr0_0, tr0_1); \\\r\n    tr1_1 = _mm_unpacklo_epi32(tr0_2, tr0_3); \\\r\n    tr1_2 = _mm_unpackhi_epi32(tr0_0, tr0_1); \\\r\n    tr1_3 = _mm_unpackhi_epi32(tr0_2, tr0_3); \\\r\n    tr1_4 = _mm_unpacklo_epi32(tr0_4, tr0_5); \\\r\n    tr1_5 = _mm_unpacklo_epi32(tr0_6, tr0_7); \\\r\n    tr1_6 = _mm_unpackhi_epi32(tr0_4, tr0_5); \\\r\n    tr1_7 = _mm_unpackhi_epi32(tr0_6, tr0_7); \\\r\n    O0 = _mm_unpacklo_epi64(tr1_0, tr1_4); \\\r\n    O1 = _mm_unpackhi_epi64(tr1_0, tr1_4); \\\r\n    O2 = _mm_unpacklo_epi64(tr1_2, tr1_6); \\\r\n    O3 = _mm_unpackhi_epi64(tr1_2, tr1_6); \\\r\n    O4 = _mm_unpacklo_epi64(tr1_1, tr1_5); \\\r\n    O5 = _mm_unpackhi_epi64(tr1_1, tr1_5); \\\r\n    O6 = _mm_unpacklo_epi64(tr1_3, tr1_7); \\\r\n    O7 = _mm_unpackhi_epi64(tr1_3, tr1_7);\r\n\r\n            TRANSPOSE_8x8_16BIT(res00[0], res01[0], res02[0], res03[0], res04[0], res05[0], res06[0], res07[0], in00[0], in01[0], in02[0], in03[0], in04[0], in05[0], in06[0], in07[0])\r\n            TRANSPOSE_8x8_16BIT(res00[1], res01[1], res02[1], res03[1], res04[1], res05[1], res06[1], res07[1], in08[0], in09[0], in10[0], in11[0], in12[0], in13[0], in14[0], in15[0])\r\n            TRANSPOSE_8x8_16BIT(res00[2], res01[2], res02[2], res03[2], res04[2], res05[2], res06[2], res07[2], in16[0], in17[0], in18[0], in19[0], in20[0], in21[0], in22[0], in23[0])\r\n            TRANSPOSE_8x8_16BIT(res00[3], res01[3], res02[3], res03[3], res04[3], res05[3], res06[3], res07[3], in24[0], in25[0], in26[0], in27[0], in28[0], in29[0], in30[0], in31[0])\r\n\r\n            TRANSPOSE_8x8_16BIT(res08[0], res09[0], res10[0], res11[0], res12[0], res13[0], res14[0], res15[0], in00[1], in01[1], in02[1], in03[1], in04[1], in05[1], in06[1], in07[1])\r\n            TRANSPOSE_8x8_16BIT(res08[1], res09[1], res10[1], res11[1], res12[1], res13[1], res14[1], res15[1], in08[1], in09[1], in10[1], in11[1], in12[1], in13[1], in14[1], in15[1])\r\n            TRANSPOSE_8x8_16BIT(res08[2], res09[2], res10[2], res11[2], res12[2], res13[2], res14[2], res15[2], in16[1], in17[1], in18[1], in19[1], in20[1], in21[1], in22[1], in23[1])\r\n            TRANSPOSE_8x8_16BIT(res08[3], res09[3], res10[3], res11[3], res12[3], res13[3], res14[3], res15[3], in24[1], in25[1], in26[1], in27[1], in28[1], in29[1], in30[1], in31[1])\r\n\r\n            TRANSPOSE_8x8_16BIT(res16[0], res17[0], res18[0], res19[0], res20[0], res21[0], res22[0], res23[0], in00[2], in01[2], in02[2], in03[2], in04[2], in05[2], in06[2], in07[2])\r\n            TRANSPOSE_8x8_16BIT(res16[1], res17[1], res18[1], res19[1], res20[1], res21[1], res22[1], res23[1], in08[2], in09[2], in10[2], in11[2], in12[2], in13[2], in14[2], in15[2])\r\n            TRANSPOSE_8x8_16BIT(res16[2], res17[2], res18[2], res19[2], res20[2], res21[2], res22[2], res23[2], in16[2], in17[2], in18[2], in19[2], in20[2], in21[2], in22[2], in23[2])\r\n            TRANSPOSE_8x8_16BIT(res16[3], res17[3], res18[3], res19[3], res20[3], res21[3], res22[3], res23[3], in24[2], in25[2], in26[2], in27[2], in28[2], in29[2], in30[2], in31[2])\r\n\r\n            TRANSPOSE_8x8_16BIT(res24[0], res25[0], res26[0], res27[0], res28[0], res29[0], res30[0], res31[0], in00[3], in01[3], in02[3], in03[3], in04[3], in05[3], in06[3], in07[3])\r\n            TRANSPOSE_8x8_16BIT(res24[1], res25[1], res26[1], res27[1], res28[1], res29[1], res30[1], res31[1], in08[3], in09[3], in10[3], in11[3], in12[3], in13[3], in14[3], in15[3])\r\n            TRANSPOSE_8x8_16BIT(res24[2], res25[2], res26[2], res27[2], res28[2], res29[2], res30[2], res31[2], in16[3], in17[3], in18[3], in19[3], in20[3], in21[3], in22[3], in23[3])\r\n            TRANSPOSE_8x8_16BIT(res24[3], res25[3], res26[3], res27[3], res28[3], res29[3], res30[3], res31[3], in24[3], in25[3], in26[3], in27[3], in28[3], in29[3], in30[3], in31[3])\r\n#undef TRANSPOSE_8x8_16BIT\r\n        }\r\n    }\r\n\r\n    //clip\r\n    {\r\n        __m128i max_val = _mm_set1_epi16((1 << (clip_depth2 - 1)) - 1);\r\n        __m128i min_val = _mm_set1_epi16(-(1 << (clip_depth2 - 1)));\r\n        int k;\r\n\r\n        for (k = 0; k < 4; k++) {\r\n            in00[k] = _mm_max_epi16(_mm_min_epi16(in00[k], max_val), min_val);\r\n            in01[k] = _mm_max_epi16(_mm_min_epi16(in01[k], max_val), min_val);\r\n            in02[k] = _mm_max_epi16(_mm_min_epi16(in02[k], max_val), min_val);\r\n            in03[k] = _mm_max_epi16(_mm_min_epi16(in03[k], max_val), min_val);\r\n            in04[k] = _mm_max_epi16(_mm_min_epi16(in04[k], max_val), min_val);\r\n            in05[k] = _mm_max_epi16(_mm_min_epi16(in05[k], max_val), min_val);\r\n            in06[k] = _mm_max_epi16(_mm_min_epi16(in06[k], max_val), min_val);\r\n            in07[k] = _mm_max_epi16(_mm_min_epi16(in07[k], max_val), min_val);\r\n            in08[k] = _mm_max_epi16(_mm_min_epi16(in08[k], max_val), min_val);\r\n            in09[k] = _mm_max_epi16(_mm_min_epi16(in09[k], max_val), min_val);            \r\n            in10[k] = _mm_max_epi16(_mm_min_epi16(in10[k], max_val), min_val);\r\n            in11[k] = _mm_max_epi16(_mm_min_epi16(in11[k], max_val), min_val);\r\n            in12[k] = _mm_max_epi16(_mm_min_epi16(in12[k], max_val), min_val);\r\n            in13[k] = _mm_max_epi16(_mm_min_epi16(in13[k], max_val), min_val);\r\n            in14[k] = _mm_max_epi16(_mm_min_epi16(in14[k], max_val), min_val);\r\n            in15[k] = _mm_max_epi16(_mm_min_epi16(in15[k], max_val), min_val);\r\n            in16[k] = _mm_max_epi16(_mm_min_epi16(in16[k], max_val), min_val);\r\n            in17[k] = _mm_max_epi16(_mm_min_epi16(in17[k], max_val), min_val);\r\n            in18[k] = _mm_max_epi16(_mm_min_epi16(in18[k], max_val), min_val);            \r\n            in19[k] = _mm_max_epi16(_mm_min_epi16(in19[k], max_val), min_val);\r\n            in20[k] = _mm_max_epi16(_mm_min_epi16(in20[k], max_val), min_val);\r\n            in21[k] = _mm_max_epi16(_mm_min_epi16(in21[k], max_val), min_val);\r\n            in22[k] = _mm_max_epi16(_mm_min_epi16(in22[k], max_val), min_val);\r\n            in23[k] = _mm_max_epi16(_mm_min_epi16(in23[k], max_val), min_val);\r\n            in24[k] = _mm_max_epi16(_mm_min_epi16(in24[k], max_val), min_val);\r\n            in25[k] = _mm_max_epi16(_mm_min_epi16(in25[k], max_val), min_val);\r\n            in26[k] = _mm_max_epi16(_mm_min_epi16(in26[k], max_val), min_val);\r\n            in27[k] = _mm_max_epi16(_mm_min_epi16(in27[k], max_val), min_val);\r\n            in28[k] = _mm_max_epi16(_mm_min_epi16(in28[k], max_val), min_val);\r\n            in29[k] = _mm_max_epi16(_mm_min_epi16(in29[k], max_val), min_val);\r\n            in30[k] = _mm_max_epi16(_mm_min_epi16(in30[k], max_val), min_val);\r\n            in31[k] = _mm_max_epi16(_mm_min_epi16(in31[k], max_val), min_val);\r\n        }\r\n    }\r\n\r\n    // Add\r\n    for (i = 0; i < 2; i++) {\r\n#define STORE_LINE(L0, L1, L2, L3, L4, L5, L6, L7, H0, H1, H2, H3, H4, H5, H6, H7, offsetV, offsetH) \\\r\n    _mm_storeu_si128((__m128i*)(dst + (0 + (offsetV)) * i_dst + (offsetH)+0), L0); \\\r\n    _mm_storeu_si128((__m128i*)(dst + (0 + (offsetV)) * i_dst + (offsetH)+8), H0); \\\r\n    _mm_storeu_si128((__m128i*)(dst + (1 + (offsetV)) * i_dst + (offsetH)+0), L1); \\\r\n    _mm_storeu_si128((__m128i*)(dst + (1 + (offsetV)) * i_dst + (offsetH)+8), H1); \\\r\n    _mm_storeu_si128((__m128i*)(dst + (2 + (offsetV)) * i_dst + (offsetH)+0), L2); \\\r\n    _mm_storeu_si128((__m128i*)(dst + (2 + (offsetV)) * i_dst + (offsetH)+8), H2); \\\r\n    _mm_storeu_si128((__m128i*)(dst + (3 + (offsetV)) * i_dst + (offsetH)+0), L3); \\\r\n    _mm_storeu_si128((__m128i*)(dst + (3 + (offsetV)) * i_dst + (offsetH)+8), H3); \\\r\n    _mm_storeu_si128((__m128i*)(dst + (4 + (offsetV)) * i_dst + (offsetH)+0), L4); \\\r\n    _mm_storeu_si128((__m128i*)(dst + (4 + (offsetV)) * i_dst + (offsetH)+8), H4); \\\r\n    _mm_storeu_si128((__m128i*)(dst + (5 + (offsetV)) * i_dst + (offsetH)+0), L5); \\\r\n    _mm_storeu_si128((__m128i*)(dst + (5 + (offsetV)) * i_dst + (offsetH)+8), H5); \\\r\n    _mm_storeu_si128((__m128i*)(dst + (6 + (offsetV)) * i_dst + (offsetH)+0), L6); \\\r\n    _mm_storeu_si128((__m128i*)(dst + (6 + (offsetV)) * i_dst + (offsetH)+8), H6); \\\r\n    _mm_storeu_si128((__m128i*)(dst + (7 + (offsetV)) * i_dst + (offsetH)+0), L7); \\\r\n    _mm_storeu_si128((__m128i*)(dst + (7 + (offsetV)) * i_dst + (offsetH)+8), H7);\r\n\r\n        const int k = i * 2;\r\n        STORE_LINE(in00[k], in01[k], in02[k], in03[k], in04[k], in05[k], in06[k], in07[k], in00[k + 1], in01[k + 1], in02[k + 1], in03[k + 1], in04[k + 1], in05[k + 1], in06[k + 1], in07[k + 1], 0,  i * 16)\r\n        STORE_LINE(in08[k], in09[k], in10[k], in11[k], in12[k], in13[k], in14[k], in15[k], in08[k + 1], in09[k + 1], in10[k + 1], in11[k + 1], in12[k + 1], in13[k + 1], in14[k + 1], in15[k + 1], 8,  i * 16)\r\n        STORE_LINE(in16[k], in17[k], in18[k], in19[k], in20[k], in21[k], in22[k], in23[k], in16[k + 1], in17[k + 1], in18[k + 1], in19[k + 1], in20[k + 1], in21[k + 1], in22[k + 1], in23[k + 1], 16, i * 16)\r\n        STORE_LINE(in24[k], in25[k], in26[k], in27[k], in28[k], in29[k], in30[k], in31[k], in24[k + 1], in25[k + 1], in26[k + 1], in27[k + 1], in28[k + 1], in29[k + 1], in30[k + 1], in31[k + 1], 24, i * 16)\r\n#undef STORE_LINE\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid idct_32x32_half_sse128(const coeff_t *src, coeff_t *dst, int i_dst)\r\n{\r\n    // TODO: implement this\r\n    // 1/2СϽǵ16x16зϵ\r\n    int a_flag = i_dst & 0x01;\r\n    int shift2 = 20 - g_bit_depth - a_flag;\r\n    int clip_depth2 = g_bit_depth + 1 + a_flag;\r\n\r\n    const __m128i c16_p45_p45 = _mm_set1_epi32(0x002D002D);\r\n    const __m128i c16_p43_p44 = _mm_set1_epi32(0x002B002C);\r\n    const __m128i c16_p39_p41 = _mm_set1_epi32(0x00270029);\r\n    const __m128i c16_p34_p36 = _mm_set1_epi32(0x00220024);\r\n    const __m128i c16_p41_p45 = _mm_set1_epi32(0x0029002D);\r\n    const __m128i c16_p23_p34 = _mm_set1_epi32(0x00170022);\r\n    const __m128i c16_n02_p11 = _mm_set1_epi32(0xFFFE000B);\r\n    const __m128i c16_n27_n15 = _mm_set1_epi32(0xFFE5FFF1);\r\n    const __m128i c16_p34_p44 = _mm_set1_epi32(0x0022002C);\r\n    const __m128i c16_n07_p15 = _mm_set1_epi32(0xFFF9000F);\r\n    const __m128i c16_n41_n27 = _mm_set1_epi32(0xFFD7FFE5);\r\n    const __m128i c16_n39_n45 = _mm_set1_epi32(0xFFD9FFD3);\r\n    const __m128i c16_p43_p45 = _mm_set1_epi32(0x002B002D);\r\n    const __m128i c16_p23_p43 = _mm_set1_epi32(0x0017002B);\r\n    const __m128i c16_n34_n07 = _mm_set1_epi32(0xFFDEFFF9);\r\n    const __m128i c16_n36_n45 = _mm_set1_epi32(0xFFDCFFD3);\r\n    const __m128i c16_p19_n11 = _mm_set1_epi32(0x0013FFF5);\r\n    const __m128i c16_p11_p41 = _mm_set1_epi32(0x000B0029);\r\n    const __m128i c16_n45_n27 = _mm_set1_epi32(0xFFD3FFE5);\r\n    const __m128i c16_p07_n30 = _mm_set1_epi32(0x0007FFE2);\r\n    const __m128i c16_p43_p39 = _mm_set1_epi32(0x002B0027);\r\n    const __m128i c16_n02_p39 = _mm_set1_epi32(0xFFFE0027);\r\n    const __m128i c16_n36_n41 = _mm_set1_epi32(0xFFDCFFD7);\r\n    const __m128i c16_p43_p07 = _mm_set1_epi32(0x002B0007);\r\n    const __m128i c16_n11_p34 = _mm_set1_epi32(0xFFF50022);\r\n    const __m128i c16_n15_p36 = _mm_set1_epi32(0xFFF10024);\r\n    const __m128i c16_n11_n45 = _mm_set1_epi32(0xFFF5FFD3);\r\n    const __m128i c16_p34_p39 = _mm_set1_epi32(0x00220027);\r\n    const __m128i c16_n45_n19 = _mm_set1_epi32(0xFFD3FFED);\r\n    const __m128i c16_n27_p34 = _mm_set1_epi32(0xFFE50022);\r\n    const __m128i c16_p19_n39 = _mm_set1_epi32(0x0013FFD9);\r\n    const __m128i c16_n11_p43 = _mm_set1_epi32(0xFFF5002B);\r\n    const __m128i c16_p02_n45 = _mm_set1_epi32(0x0002FFD3);\r\n    const __m128i c16_n36_p30 = _mm_set1_epi32(0xFFDC001E);\r\n    const __m128i c16_p41_n23 = _mm_set1_epi32(0x0029FFE9);\r\n    const __m128i c16_n44_p15 = _mm_set1_epi32(0xFFD4000F);\r\n    const __m128i c16_p45_n07 = _mm_set1_epi32(0x002DFFF9);\r\n    const __m128i c16_n43_p27 = _mm_set1_epi32(0xFFD5001B);\r\n    const __m128i c16_p44_n02 = _mm_set1_epi32(0x002CFFFE);\r\n    const __m128i c16_n30_n23 = _mm_set1_epi32(0xFFE2FFE9);\r\n    const __m128i c16_p07_p41 = _mm_set1_epi32(0x00070029);\r\n    const __m128i c16_n45_p23 = _mm_set1_epi32(0xFFD30017);\r\n    const __m128i c16_p27_p19 = _mm_set1_epi32(0x001B0013);\r\n    const __m128i c16_p15_n45 = _mm_set1_epi32(0x000FFFD3);\r\n    const __m128i c16_n44_p30 = _mm_set1_epi32(0xFFD4001E);\r\n    const __m128i c16_n44_p19 = _mm_set1_epi32(0xFFD40013);\r\n    const __m128i c16_n02_p36 = _mm_set1_epi32(0xFFFE0024);\r\n    const __m128i c16_p45_n34 = _mm_set1_epi32(0x002DFFDE);\r\n    const __m128i c16_n15_n23 = _mm_set1_epi32(0xFFF1FFE9);\r\n    const __m128i c16_n39_p15 = _mm_set1_epi32(0xFFD9000F);\r\n    const __m128i c16_n30_p45 = _mm_set1_epi32(0xFFE2002D);\r\n    const __m128i c16_p27_p02 = _mm_set1_epi32(0x001B0002);\r\n    const __m128i c16_p41_n44 = _mm_set1_epi32(0x0029FFD4);\r\n    const __m128i c16_n30_p11 = _mm_set1_epi32(0xFFE2000B);\r\n    const __m128i c16_n45_p43 = _mm_set1_epi32(0xFFD3002B);\r\n    const __m128i c16_n19_p36 = _mm_set1_epi32(0xFFED0024);\r\n    const __m128i c16_p23_n02 = _mm_set1_epi32(0x0017FFFE);\r\n    const __m128i c16_n19_p07 = _mm_set1_epi32(0xFFED0007);\r\n    const __m128i c16_n39_p30 = _mm_set1_epi32(0xFFD9001E);\r\n    const __m128i c16_n45_p44 = _mm_set1_epi32(0xFFD3002C);\r\n    const __m128i c16_n36_p43 = _mm_set1_epi32(0xFFDC002B);\r\n    const __m128i c16_n07_p02 = _mm_set1_epi32(0xFFF90002);\r\n    const __m128i c16_n15_p11 = _mm_set1_epi32(0xFFF1000B);\r\n    const __m128i c16_n23_p19 = _mm_set1_epi32(0xFFE90013);\r\n    const __m128i c16_n30_p27 = _mm_set1_epi32(0xFFE2001B);\r\n    const __m128i c16_p35_p40 = _mm_set1_epi32(0x00230028);\r\n    const __m128i c16_p29_p43 = _mm_set1_epi32(0x001D002B);\r\n    const __m128i c16_n21_p04 = _mm_set1_epi32(0xFFEB0004);\r\n    const __m128i c16_p04_p40 = _mm_set1_epi32(0x00040028);\r\n    const __m128i c16_n43_n35 = _mm_set1_epi32(0xFFD5FFDD);\r\n    const __m128i c16_n21_p35 = _mm_set1_epi32(0xFFEB0023);\r\n    const __m128i c16_p04_n43 = _mm_set1_epi32(0x0004FFD5);\r\n    const __m128i c16_n40_p29 = _mm_set1_epi32(0xFFD8001D);\r\n    const __m128i c16_p45_n13 = _mm_set1_epi32(0x002DFFF3);\r\n    const __m128i c16_n45_p21 = _mm_set1_epi32(0xFFD30015);\r\n    const __m128i c16_p13_p29 = _mm_set1_epi32(0x000D001D);\r\n    const __m128i c16_n35_p13 = _mm_set1_epi32(0xFFDD000D);\r\n    const __m128i c16_n40_p45 = _mm_set1_epi32(0xFFD8002D);\r\n    const __m128i c16_n13_p04 = _mm_set1_epi32(0xFFF30004);\r\n    const __m128i c16_n29_p21 = _mm_set1_epi32(0xFFE30015);\r\n    const __m128i c16_p38_p44 = _mm_set1_epi32(0x0026002C);\r\n    const __m128i c16_n09_p38 = _mm_set1_epi32(0xFFF70026);\r\n    const __m128i c16_n44_p25 = _mm_set1_epi32(0xFFD40019);\r\n    const __m128i c16_n25_p09 = _mm_set1_epi32(0xFFE70009);\r\n    const __m128i c16_p17_p42 = _mm_set1_epi32(0x0011002A);\r\n    const __m128i c16_n42_p17 = _mm_set1_epi32(0xFFD60011);\r\n    const __m128i c16_p32_p32 = _mm_set1_epi32(0x00200020);\r\n    const __m128i c16_n32_p32 = _mm_set1_epi32(0xFFE00020);\r\n\r\n    __m128i c32_rnd = _mm_set1_epi32(16);   // add1\r\n    __m128i Zero_16 = _mm_set1_epi16(0);\r\n\r\n    int nShift = 5;\r\n    int i, part;\r\n\r\n    // DCT1\r\n    __m128i in00[4], in01[4], in02[4], in03[4], in04[4], in05[4], in06[4], in07[4], in08[4], in09[4], in10[4], in11[4], in12[4], in13[4], in14[4], in15[4];\r\n    __m128i in16[4], in17[4], in18[4], in19[4], in20[4], in21[4], in22[4], in23[4], in24[4], in25[4], in26[4], in27[4], in28[4], in29[4], in30[4], in31[4];\r\n    __m128i res00[4], res01[4], res02[4], res03[4], res04[4], res05[4], res06[4], res07[4], res08[4], res09[4], res10[4], res11[4], res12[4], res13[4], res14[4], res15[4];\r\n    __m128i res16[4], res17[4], res18[4], res19[4], res20[4], res21[4], res22[4], res23[4], res24[4], res25[4], res26[4], res27[4], res28[4], res29[4], res30[4], res31[4];\r\n\r\n    i_dst &= 0xFE;    /* remember to remove the flag bit */\r\n\r\n    for (i = 0; i < 2; i++) {\r\n        const int offset = (i << 3);\r\n        in00[i] = _mm_loadu_si128((const __m128i*)&src[ 0 * 32 + offset]);\r\n        in01[i] = _mm_loadu_si128((const __m128i*)&src[ 1 * 32 + offset]);\r\n        in02[i] = _mm_loadu_si128((const __m128i*)&src[ 2 * 32 + offset]);\r\n        in03[i] = _mm_loadu_si128((const __m128i*)&src[ 3 * 32 + offset]);\r\n        in04[i] = _mm_loadu_si128((const __m128i*)&src[ 4 * 32 + offset]);\r\n        in05[i] = _mm_loadu_si128((const __m128i*)&src[ 5 * 32 + offset]);\r\n        in06[i] = _mm_loadu_si128((const __m128i*)&src[ 6 * 32 + offset]);\r\n        in07[i] = _mm_loadu_si128((const __m128i*)&src[ 7 * 32 + offset]);\r\n        in08[i] = _mm_loadu_si128((const __m128i*)&src[ 8 * 32 + offset]);\r\n        in09[i] = _mm_loadu_si128((const __m128i*)&src[ 9 * 32 + offset]);\r\n        in10[i] = _mm_loadu_si128((const __m128i*)&src[10 * 32 + offset]);\r\n        in11[i] = _mm_loadu_si128((const __m128i*)&src[11 * 32 + offset]);\r\n        in12[i] = _mm_loadu_si128((const __m128i*)&src[12 * 32 + offset]);\r\n        in13[i] = _mm_loadu_si128((const __m128i*)&src[13 * 32 + offset]);\r\n        in14[i] = _mm_loadu_si128((const __m128i*)&src[14 * 32 + offset]);\r\n        in15[i] = _mm_loadu_si128((const __m128i*)&src[15 * 32 + offset]);\r\n    }\r\n\r\n    //pass=1\r\n    for (part = 0; part < 2; part++) {\r\n        const __m128i T_00_00A = _mm_unpacklo_epi16(in01[part], in03[part]);    // [33 13 32 12 31 11 30 10]\r\n        const __m128i T_00_00B = _mm_unpackhi_epi16(in01[part], in03[part]);    // [37 17 36 16 35 15 34 14]\r\n        const __m128i T_00_01A = _mm_unpacklo_epi16(in05[part], in07[part]);    // [ ]\r\n        const __m128i T_00_01B = _mm_unpackhi_epi16(in05[part], in07[part]);    // [ ]\r\n        const __m128i T_00_02A = _mm_unpacklo_epi16(in09[part], in11[part]);    // [ ]\r\n        const __m128i T_00_02B = _mm_unpackhi_epi16(in09[part], in11[part]);    // [ ]\r\n        const __m128i T_00_03A = _mm_unpacklo_epi16(in13[part], in15[part]);    // [ ]\r\n        const __m128i T_00_03B = _mm_unpackhi_epi16(in13[part], in15[part]);    // [ ]\r\n\r\n        const __m128i T_00_08A = _mm_unpacklo_epi16(in02[part], in06[part]);    // [ ]\r\n        const __m128i T_00_08B = _mm_unpackhi_epi16(in02[part], in06[part]);    // [ ]\r\n        const __m128i T_00_09A = _mm_unpacklo_epi16(in10[part], in14[part]);    // [ ]\r\n        const __m128i T_00_09B = _mm_unpackhi_epi16(in10[part], in14[part]);    // [ ]\r\n\r\n        const __m128i T_00_12A = _mm_unpacklo_epi16(in04[part], in12[part]);    // [ ]\r\n        const __m128i T_00_12B = _mm_unpackhi_epi16(in04[part], in12[part]);    // [ ]\r\n\r\n        const __m128i T_00_14A = _mm_unpacklo_epi16(in08[part], Zero_16);    //\r\n        const __m128i T_00_14B = _mm_unpackhi_epi16(in08[part], Zero_16);    // [ ]\r\n        const __m128i T_00_15A = _mm_unpacklo_epi16(in00[part], Zero_16);    //\r\n        const __m128i T_00_15B = _mm_unpackhi_epi16(in00[part], Zero_16);    // [ ]\r\n\r\n        __m128i O00A, O01A, O02A, O03A, O04A, O05A, O06A, O07A, O08A, O09A, O10A, O11A, O12A, O13A, O14A, O15A;\r\n        __m128i O00B, O01B, O02B, O03B, O04B, O05B, O06B, O07B, O08B, O09B, O10B, O11B, O12B, O13B, O14B, O15B;\r\n        __m128i EO0A, EO1A, EO2A, EO3A, EO4A, EO5A, EO6A, EO7A;\r\n        __m128i EO0B, EO1B, EO2B, EO3B, EO4B, EO5B, EO6B, EO7B;\r\n        {\r\n            __m128i T00, T01;\r\n#define COMPUTE_ROW(r0103, r0507, r0911, r1315, c0103, c0507, c0911, c1315, row) \\\r\n        T00 = _mm_add_epi32(_mm_madd_epi16(r0103, c0103), _mm_madd_epi16(r0507, c0507)); \\\r\n        T01 = _mm_add_epi32(_mm_madd_epi16(r0911, c0911), _mm_madd_epi16(r1315, c1315)); \\\r\n        row = _mm_add_epi32(T00, T01);\r\n\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_p45_p45, c16_p43_p44, c16_p39_p41, c16_p34_p36, O00A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_p41_p45, c16_p23_p34, c16_n02_p11, c16_n27_n15, O01A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_p34_p44, c16_n07_p15, c16_n41_n27, c16_n39_n45, O02A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_p23_p43, c16_n34_n07, c16_n36_n45, c16_p19_n11, O03A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_p11_p41, c16_n45_n27, c16_p07_n30, c16_p43_p39, O04A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_n02_p39, c16_n36_n41, c16_p43_p07, c16_n11_p34, O05A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_n15_p36, c16_n11_n45, c16_p34_p39, c16_n45_n19, O06A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_n27_p34, c16_p19_n39, c16_n11_p43, c16_p02_n45, O07A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_n36_p30, c16_p41_n23, c16_n44_p15, c16_p45_n07, O08A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_n43_p27, c16_p44_n02, c16_n30_n23, c16_p07_p41, O09A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_n45_p23, c16_p27_p19, c16_p15_n45, c16_n44_p30, O10A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_n44_p19, c16_n02_p36, c16_p45_n34, c16_n15_n23, O11A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_n39_p15, c16_n30_p45, c16_p27_p02, c16_p41_n44, O12A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_n30_p11, c16_n45_p43, c16_n19_p36, c16_p23_n02, O13A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_n19_p07, c16_n39_p30, c16_n45_p44, c16_n36_p43, O14A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_n07_p02, c16_n15_p11, c16_n23_p19, c16_n30_p27, O15A)\r\n\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, c16_p45_p45, c16_p43_p44, c16_p39_p41, c16_p34_p36, O00B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, c16_p41_p45, c16_p23_p34, c16_n02_p11, c16_n27_n15, O01B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, c16_p34_p44, c16_n07_p15, c16_n41_n27, c16_n39_n45, O02B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, c16_p23_p43, c16_n34_n07, c16_n36_n45, c16_p19_n11, O03B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, c16_p11_p41, c16_n45_n27, c16_p07_n30, c16_p43_p39, O04B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, c16_n02_p39, c16_n36_n41, c16_p43_p07, c16_n11_p34, O05B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, c16_n15_p36, c16_n11_n45, c16_p34_p39, c16_n45_n19, O06B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, c16_n27_p34, c16_p19_n39, c16_n11_p43, c16_p02_n45, O07B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, c16_n36_p30, c16_p41_n23, c16_n44_p15, c16_p45_n07, O08B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, c16_n43_p27, c16_p44_n02, c16_n30_n23, c16_p07_p41, O09B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, c16_n45_p23, c16_p27_p19, c16_p15_n45, c16_n44_p30, O10B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, c16_n44_p19, c16_n02_p36, c16_p45_n34, c16_n15_n23, O11B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, c16_n39_p15, c16_n30_p45, c16_p27_p02, c16_p41_n44, O12B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, c16_n30_p11, c16_n45_p43, c16_n19_p36, c16_p23_n02, O13B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, c16_n19_p07, c16_n39_p30, c16_n45_p44, c16_n36_p43, O14B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, c16_n07_p02, c16_n15_p11, c16_n23_p19, c16_n30_p27, O15B)\r\n#undef COMPUTE_ROW\r\n        }\r\n\r\n        EO0A = _mm_add_epi32(_mm_madd_epi16(T_00_08A, c16_p43_p45), _mm_madd_epi16(T_00_09A, c16_p35_p40));\r\n        EO1A = _mm_add_epi32(_mm_madd_epi16(T_00_08A, c16_p29_p43), _mm_madd_epi16(T_00_09A, c16_n21_p04));\r\n        EO2A = _mm_add_epi32(_mm_madd_epi16(T_00_08A, c16_p04_p40), _mm_madd_epi16(T_00_09A, c16_n43_n35));\r\n        EO3A = _mm_add_epi32(_mm_madd_epi16(T_00_08A, c16_n21_p35), _mm_madd_epi16(T_00_09A, c16_p04_n43));\r\n        EO4A = _mm_add_epi32(_mm_madd_epi16(T_00_08A, c16_n40_p29), _mm_madd_epi16(T_00_09A, c16_p45_n13));\r\n        EO5A = _mm_add_epi32(_mm_madd_epi16(T_00_08A, c16_n45_p21), _mm_madd_epi16(T_00_09A, c16_p13_p29));\r\n        EO6A = _mm_add_epi32(_mm_madd_epi16(T_00_08A, c16_n35_p13), _mm_madd_epi16(T_00_09A, c16_n40_p45));\r\n        EO7A = _mm_add_epi32(_mm_madd_epi16(T_00_08A, c16_n13_p04), _mm_madd_epi16(T_00_09A, c16_n29_p21));\r\n\r\n        EO0B = _mm_add_epi32(_mm_madd_epi16(T_00_08B, c16_p43_p45), _mm_madd_epi16(T_00_09B, c16_p35_p40));\r\n        EO1B = _mm_add_epi32(_mm_madd_epi16(T_00_08B, c16_p29_p43), _mm_madd_epi16(T_00_09B, c16_n21_p04));\r\n        EO2B = _mm_add_epi32(_mm_madd_epi16(T_00_08B, c16_p04_p40), _mm_madd_epi16(T_00_09B, c16_n43_n35));\r\n        EO3B = _mm_add_epi32(_mm_madd_epi16(T_00_08B, c16_n21_p35), _mm_madd_epi16(T_00_09B, c16_p04_n43));\r\n        EO4B = _mm_add_epi32(_mm_madd_epi16(T_00_08B, c16_n40_p29), _mm_madd_epi16(T_00_09B, c16_p45_n13));\r\n        EO5B = _mm_add_epi32(_mm_madd_epi16(T_00_08B, c16_n45_p21), _mm_madd_epi16(T_00_09B, c16_p13_p29));\r\n        EO6B = _mm_add_epi32(_mm_madd_epi16(T_00_08B, c16_n35_p13), _mm_madd_epi16(T_00_09B, c16_n40_p45));\r\n        EO7B = _mm_add_epi32(_mm_madd_epi16(T_00_08B, c16_n13_p04), _mm_madd_epi16(T_00_09B, c16_n29_p21));\r\n            \r\n        {\r\n            const __m128i EEO0A = _mm_madd_epi16(T_00_12A, c16_p38_p44);\r\n            const __m128i EEO1A = _mm_madd_epi16(T_00_12A, c16_n09_p38);\r\n            const __m128i EEO2A = _mm_madd_epi16(T_00_12A, c16_n44_p25);\r\n            const __m128i EEO3A = _mm_madd_epi16(T_00_12A, c16_n25_p09);\r\n            const __m128i EEO0B = _mm_madd_epi16(T_00_12B, c16_p38_p44);\r\n            const __m128i EEO1B = _mm_madd_epi16(T_00_12B, c16_n09_p38);\r\n            const __m128i EEO2B = _mm_madd_epi16(T_00_12B, c16_n44_p25);\r\n            const __m128i EEO3B = _mm_madd_epi16(T_00_12B, c16_n25_p09);\r\n\r\n            const __m128i EEEO0A = _mm_madd_epi16(T_00_14A, c16_p17_p42);\r\n            const __m128i EEEO0B = _mm_madd_epi16(T_00_14B, c16_p17_p42);\r\n            const __m128i EEEO1A = _mm_madd_epi16(T_00_14A, c16_n42_p17);\r\n            const __m128i EEEO1B = _mm_madd_epi16(T_00_14B, c16_n42_p17);\r\n\r\n            const __m128i EEEE0A = _mm_madd_epi16(T_00_15A, c16_p32_p32);\r\n            const __m128i EEEE0B = _mm_madd_epi16(T_00_15B, c16_p32_p32);\r\n            const __m128i EEEE1A = _mm_madd_epi16(T_00_15A, c16_n32_p32);\r\n            const __m128i EEEE1B = _mm_madd_epi16(T_00_15B, c16_n32_p32);\r\n\r\n            const __m128i EEE0A = _mm_add_epi32(EEEE0A, EEEO0A);    // EEE0 = EEEE0 + EEEO0\r\n            const __m128i EEE0B = _mm_add_epi32(EEEE0B, EEEO0B);\r\n            const __m128i EEE1A = _mm_add_epi32(EEEE1A, EEEO1A);    // EEE1 = EEEE1 + EEEO1\r\n            const __m128i EEE1B = _mm_add_epi32(EEEE1B, EEEO1B);\r\n            const __m128i EEE3A = _mm_sub_epi32(EEEE0A, EEEO0A);    // EEE2 = EEEE0 - EEEO0\r\n            const __m128i EEE3B = _mm_sub_epi32(EEEE0B, EEEO0B);\r\n            const __m128i EEE2A = _mm_sub_epi32(EEEE1A, EEEO1A);    // EEE3 = EEEE1 - EEEO1\r\n            const __m128i EEE2B = _mm_sub_epi32(EEEE1B, EEEO1B);\r\n\r\n            const __m128i EE0A = _mm_add_epi32(EEE0A, EEO0A);       // EE0 = EEE0 + EEO0\r\n            const __m128i EE0B = _mm_add_epi32(EEE0B, EEO0B);\r\n            const __m128i EE1A = _mm_add_epi32(EEE1A, EEO1A);       // EE1 = EEE1 + EEO1\r\n            const __m128i EE1B = _mm_add_epi32(EEE1B, EEO1B);\r\n            const __m128i EE2A = _mm_add_epi32(EEE2A, EEO2A);       // EE2 = EEE0 + EEO0\r\n            const __m128i EE2B = _mm_add_epi32(EEE2B, EEO2B);\r\n            const __m128i EE3A = _mm_add_epi32(EEE3A, EEO3A);       // EE3 = EEE1 + EEO1\r\n            const __m128i EE3B = _mm_add_epi32(EEE3B, EEO3B);\r\n            const __m128i EE7A = _mm_sub_epi32(EEE0A, EEO0A);       // EE7 = EEE0 - EEO0\r\n            const __m128i EE7B = _mm_sub_epi32(EEE0B, EEO0B);\r\n            const __m128i EE6A = _mm_sub_epi32(EEE1A, EEO1A);       // EE6 = EEE1 - EEO1\r\n            const __m128i EE6B = _mm_sub_epi32(EEE1B, EEO1B);\r\n            const __m128i EE5A = _mm_sub_epi32(EEE2A, EEO2A);       // EE5 = EEE0 - EEO0\r\n            const __m128i EE5B = _mm_sub_epi32(EEE2B, EEO2B);\r\n            const __m128i EE4A = _mm_sub_epi32(EEE3A, EEO3A);       // EE4 = EEE1 - EEO1\r\n            const __m128i EE4B = _mm_sub_epi32(EEE3B, EEO3B);\r\n\r\n            const __m128i E0A = _mm_add_epi32(EE0A, EO0A);          // E0 = EE0 + EO0\r\n            const __m128i E0B = _mm_add_epi32(EE0B, EO0B);\r\n            const __m128i E1A = _mm_add_epi32(EE1A, EO1A);          // E1 = EE1 + EO1\r\n            const __m128i E1B = _mm_add_epi32(EE1B, EO1B);\r\n            const __m128i E2A = _mm_add_epi32(EE2A, EO2A);          // E2 = EE2 + EO2\r\n            const __m128i E2B = _mm_add_epi32(EE2B, EO2B);\r\n            const __m128i E3A = _mm_add_epi32(EE3A, EO3A);          // E3 = EE3 + EO3\r\n            const __m128i E3B = _mm_add_epi32(EE3B, EO3B);\r\n            const __m128i E4A = _mm_add_epi32(EE4A, EO4A);          // E4 =\r\n            const __m128i E4B = _mm_add_epi32(EE4B, EO4B);\r\n            const __m128i E5A = _mm_add_epi32(EE5A, EO5A);          // E5 =\r\n            const __m128i E5B = _mm_add_epi32(EE5B, EO5B);\r\n            const __m128i E6A = _mm_add_epi32(EE6A, EO6A);          // E6 =\r\n            const __m128i E6B = _mm_add_epi32(EE6B, EO6B);\r\n            const __m128i E7A = _mm_add_epi32(EE7A, EO7A);          // E7 =\r\n            const __m128i E7B = _mm_add_epi32(EE7B, EO7B);\r\n            const __m128i EFA = _mm_sub_epi32(EE0A, EO0A);          // EF = EE0 - EO0\r\n            const __m128i EFB = _mm_sub_epi32(EE0B, EO0B);\r\n            const __m128i EEA = _mm_sub_epi32(EE1A, EO1A);          // EE = EE1 - EO1\r\n            const __m128i EEB = _mm_sub_epi32(EE1B, EO1B);\r\n            const __m128i EDA = _mm_sub_epi32(EE2A, EO2A);          // ED = EE2 - EO2\r\n            const __m128i EDB = _mm_sub_epi32(EE2B, EO2B);\r\n            const __m128i ECA = _mm_sub_epi32(EE3A, EO3A);          // EC = EE3 - EO3\r\n            const __m128i ECB = _mm_sub_epi32(EE3B, EO3B);\r\n            const __m128i EBA = _mm_sub_epi32(EE4A, EO4A);          // EB =\r\n            const __m128i EBB = _mm_sub_epi32(EE4B, EO4B);\r\n            const __m128i EAA = _mm_sub_epi32(EE5A, EO5A);          // EA =\r\n            const __m128i EAB = _mm_sub_epi32(EE5B, EO5B);\r\n            const __m128i E9A = _mm_sub_epi32(EE6A, EO6A);          // E9 =\r\n            const __m128i E9B = _mm_sub_epi32(EE6B, EO6B);\r\n            const __m128i E8A = _mm_sub_epi32(EE7A, EO7A);          // E8 =\r\n            const __m128i E8B = _mm_sub_epi32(EE7B, EO7B);\r\n\r\n            const __m128i T10A = _mm_add_epi32(E0A, c32_rnd);       // E0 + rnd\r\n            const __m128i T10B = _mm_add_epi32(E0B, c32_rnd);\r\n            const __m128i T11A = _mm_add_epi32(E1A, c32_rnd);       // E1 + rnd\r\n            const __m128i T11B = _mm_add_epi32(E1B, c32_rnd);\r\n            const __m128i T12A = _mm_add_epi32(E2A, c32_rnd);       // E2 + rnd\r\n            const __m128i T12B = _mm_add_epi32(E2B, c32_rnd);\r\n            const __m128i T13A = _mm_add_epi32(E3A, c32_rnd);       // E3 + rnd\r\n            const __m128i T13B = _mm_add_epi32(E3B, c32_rnd);\r\n            const __m128i T14A = _mm_add_epi32(E4A, c32_rnd);       // E4 + rnd\r\n            const __m128i T14B = _mm_add_epi32(E4B, c32_rnd);\r\n            const __m128i T15A = _mm_add_epi32(E5A, c32_rnd);       // E5 + rnd\r\n            const __m128i T15B = _mm_add_epi32(E5B, c32_rnd);\r\n            const __m128i T16A = _mm_add_epi32(E6A, c32_rnd);       // E6 + rnd\r\n            const __m128i T16B = _mm_add_epi32(E6B, c32_rnd);\r\n            const __m128i T17A = _mm_add_epi32(E7A, c32_rnd);       // E7 + rnd\r\n            const __m128i T17B = _mm_add_epi32(E7B, c32_rnd);\r\n            const __m128i T18A = _mm_add_epi32(E8A, c32_rnd);       // E8 + rnd\r\n            const __m128i T18B = _mm_add_epi32(E8B, c32_rnd);\r\n            const __m128i T19A = _mm_add_epi32(E9A, c32_rnd);       // E9 + rnd\r\n            const __m128i T19B = _mm_add_epi32(E9B, c32_rnd);\r\n            const __m128i T1AA = _mm_add_epi32(EAA, c32_rnd);       // E10 + rnd\r\n            const __m128i T1AB = _mm_add_epi32(EAB, c32_rnd);\r\n            const __m128i T1BA = _mm_add_epi32(EBA, c32_rnd);       // E11 + rnd\r\n            const __m128i T1BB = _mm_add_epi32(EBB, c32_rnd);\r\n            const __m128i T1CA = _mm_add_epi32(ECA, c32_rnd);       // E12 + rnd\r\n            const __m128i T1CB = _mm_add_epi32(ECB, c32_rnd);\r\n            const __m128i T1DA = _mm_add_epi32(EDA, c32_rnd);       // E13 + rnd\r\n            const __m128i T1DB = _mm_add_epi32(EDB, c32_rnd);\r\n            const __m128i T1EA = _mm_add_epi32(EEA, c32_rnd);       // E14 + rnd\r\n            const __m128i T1EB = _mm_add_epi32(EEB, c32_rnd);\r\n            const __m128i T1FA = _mm_add_epi32(EFA, c32_rnd);       // E15 + rnd\r\n            const __m128i T1FB = _mm_add_epi32(EFB, c32_rnd);\r\n\r\n            const __m128i T2_00A = _mm_add_epi32(T10A, O00A);       // E0 + O0 + rnd\r\n            const __m128i T2_00B = _mm_add_epi32(T10B, O00B);\r\n            const __m128i T2_01A = _mm_add_epi32(T11A, O01A);       // E1 + O1 + rnd\r\n            const __m128i T2_01B = _mm_add_epi32(T11B, O01B);\r\n            const __m128i T2_02A = _mm_add_epi32(T12A, O02A);       // E2 + O2 + rnd\r\n            const __m128i T2_02B = _mm_add_epi32(T12B, O02B);\r\n            const __m128i T2_03A = _mm_add_epi32(T13A, O03A);       // E3 + O3 + rnd\r\n            const __m128i T2_03B = _mm_add_epi32(T13B, O03B);\r\n            const __m128i T2_04A = _mm_add_epi32(T14A, O04A);       // E4\r\n            const __m128i T2_04B = _mm_add_epi32(T14B, O04B);\r\n            const __m128i T2_05A = _mm_add_epi32(T15A, O05A);       // E5\r\n            const __m128i T2_05B = _mm_add_epi32(T15B, O05B);\r\n            const __m128i T2_06A = _mm_add_epi32(T16A, O06A);       // E6\r\n            const __m128i T2_06B = _mm_add_epi32(T16B, O06B);\r\n            const __m128i T2_07A = _mm_add_epi32(T17A, O07A);       // E7\r\n            const __m128i T2_07B = _mm_add_epi32(T17B, O07B);\r\n            const __m128i T2_08A = _mm_add_epi32(T18A, O08A);       // E8\r\n            const __m128i T2_08B = _mm_add_epi32(T18B, O08B);\r\n            const __m128i T2_09A = _mm_add_epi32(T19A, O09A);       // E9\r\n            const __m128i T2_09B = _mm_add_epi32(T19B, O09B);\r\n            const __m128i T2_10A = _mm_add_epi32(T1AA, O10A);       // E10\r\n            const __m128i T2_10B = _mm_add_epi32(T1AB, O10B);\r\n            const __m128i T2_11A = _mm_add_epi32(T1BA, O11A);       // E11\r\n            const __m128i T2_11B = _mm_add_epi32(T1BB, O11B);\r\n            const __m128i T2_12A = _mm_add_epi32(T1CA, O12A);       // E12\r\n            const __m128i T2_12B = _mm_add_epi32(T1CB, O12B);\r\n            const __m128i T2_13A = _mm_add_epi32(T1DA, O13A);       // E13\r\n            const __m128i T2_13B = _mm_add_epi32(T1DB, O13B);\r\n            const __m128i T2_14A = _mm_add_epi32(T1EA, O14A);       // E14\r\n            const __m128i T2_14B = _mm_add_epi32(T1EB, O14B);\r\n            const __m128i T2_15A = _mm_add_epi32(T1FA, O15A);       // E15\r\n            const __m128i T2_15B = _mm_add_epi32(T1FB, O15B);\r\n            const __m128i T2_31A = _mm_sub_epi32(T10A, O00A);       // E0 - O0 + rnd\r\n            const __m128i T2_31B = _mm_sub_epi32(T10B, O00B);\r\n            const __m128i T2_30A = _mm_sub_epi32(T11A, O01A);       // E1 - O1 + rnd\r\n            const __m128i T2_30B = _mm_sub_epi32(T11B, O01B);\r\n            const __m128i T2_29A = _mm_sub_epi32(T12A, O02A);       // E2 - O2 + rnd\r\n            const __m128i T2_29B = _mm_sub_epi32(T12B, O02B);\r\n            const __m128i T2_28A = _mm_sub_epi32(T13A, O03A);       // E3 - O3 + rnd\r\n            const __m128i T2_28B = _mm_sub_epi32(T13B, O03B);\r\n            const __m128i T2_27A = _mm_sub_epi32(T14A, O04A);       // E4\r\n            const __m128i T2_27B = _mm_sub_epi32(T14B, O04B);\r\n            const __m128i T2_26A = _mm_sub_epi32(T15A, O05A);       // E5\r\n            const __m128i T2_26B = _mm_sub_epi32(T15B, O05B);\r\n            const __m128i T2_25A = _mm_sub_epi32(T16A, O06A);       // E6\r\n            const __m128i T2_25B = _mm_sub_epi32(T16B, O06B);\r\n            const __m128i T2_24A = _mm_sub_epi32(T17A, O07A);       // E7\r\n            const __m128i T2_24B = _mm_sub_epi32(T17B, O07B);\r\n            const __m128i T2_23A = _mm_sub_epi32(T18A, O08A);       //\r\n            const __m128i T2_23B = _mm_sub_epi32(T18B, O08B);\r\n            const __m128i T2_22A = _mm_sub_epi32(T19A, O09A);       //\r\n            const __m128i T2_22B = _mm_sub_epi32(T19B, O09B);\r\n            const __m128i T2_21A = _mm_sub_epi32(T1AA, O10A);       //\r\n            const __m128i T2_21B = _mm_sub_epi32(T1AB, O10B);\r\n            const __m128i T2_20A = _mm_sub_epi32(T1BA, O11A);       //\r\n            const __m128i T2_20B = _mm_sub_epi32(T1BB, O11B);\r\n            const __m128i T2_19A = _mm_sub_epi32(T1CA, O12A);       //\r\n            const __m128i T2_19B = _mm_sub_epi32(T1CB, O12B);\r\n            const __m128i T2_18A = _mm_sub_epi32(T1DA, O13A);       //\r\n            const __m128i T2_18B = _mm_sub_epi32(T1DB, O13B);\r\n            const __m128i T2_17A = _mm_sub_epi32(T1EA, O14A);       //\r\n            const __m128i T2_17B = _mm_sub_epi32(T1EB, O14B);\r\n            const __m128i T2_16A = _mm_sub_epi32(T1FA, O15A);       //\r\n            const __m128i T2_16B = _mm_sub_epi32(T1FB, O15B);\r\n\r\n            const __m128i T3_00A = _mm_srai_epi32(T2_00A, nShift);  // [30 20 10 00]\r\n            const __m128i T3_00B = _mm_srai_epi32(T2_00B, nShift);  // [70 60 50 40]\r\n            const __m128i T3_01A = _mm_srai_epi32(T2_01A, nShift);  // [31 21 11 01]\r\n            const __m128i T3_01B = _mm_srai_epi32(T2_01B, nShift);  // [71 61 51 41]\r\n            const __m128i T3_02A = _mm_srai_epi32(T2_02A, nShift);  // [32 22 12 02]\r\n            const __m128i T3_02B = _mm_srai_epi32(T2_02B, nShift);  // [72 62 52 42]\r\n            const __m128i T3_03A = _mm_srai_epi32(T2_03A, nShift);  // [33 23 13 03]\r\n            const __m128i T3_03B = _mm_srai_epi32(T2_03B, nShift);  // [73 63 53 43]\r\n            const __m128i T3_04A = _mm_srai_epi32(T2_04A, nShift);  // [33 24 14 04]\r\n            const __m128i T3_04B = _mm_srai_epi32(T2_04B, nShift);  // [74 64 54 44]\r\n            const __m128i T3_05A = _mm_srai_epi32(T2_05A, nShift);  // [35 25 15 05]\r\n            const __m128i T3_05B = _mm_srai_epi32(T2_05B, nShift);  // [75 65 55 45]\r\n            const __m128i T3_06A = _mm_srai_epi32(T2_06A, nShift);  // [36 26 16 06]\r\n            const __m128i T3_06B = _mm_srai_epi32(T2_06B, nShift);  // [76 66 56 46]\r\n            const __m128i T3_07A = _mm_srai_epi32(T2_07A, nShift);  // [37 27 17 07]\r\n            const __m128i T3_07B = _mm_srai_epi32(T2_07B, nShift);  // [77 67 57 47]\r\n            const __m128i T3_08A = _mm_srai_epi32(T2_08A, nShift);  // [30 20 10 00] x8\r\n            const __m128i T3_08B = _mm_srai_epi32(T2_08B, nShift);  // [70 60 50 40]\r\n            const __m128i T3_09A = _mm_srai_epi32(T2_09A, nShift);  // [31 21 11 01] x9\r\n            const __m128i T3_09B = _mm_srai_epi32(T2_09B, nShift);  // [71 61 51 41]\r\n            const __m128i T3_10A = _mm_srai_epi32(T2_10A, nShift);  // [32 22 12 02] xA\r\n            const __m128i T3_10B = _mm_srai_epi32(T2_10B, nShift);  // [72 62 52 42]\r\n            const __m128i T3_11A = _mm_srai_epi32(T2_11A, nShift);  // [33 23 13 03] xB\r\n            const __m128i T3_11B = _mm_srai_epi32(T2_11B, nShift);  // [73 63 53 43]\r\n            const __m128i T3_12A = _mm_srai_epi32(T2_12A, nShift);  // [33 24 14 04] xC\r\n            const __m128i T3_12B = _mm_srai_epi32(T2_12B, nShift);  // [74 64 54 44]\r\n            const __m128i T3_13A = _mm_srai_epi32(T2_13A, nShift);  // [35 25 15 05] xD\r\n            const __m128i T3_13B = _mm_srai_epi32(T2_13B, nShift);  // [75 65 55 45]\r\n            const __m128i T3_14A = _mm_srai_epi32(T2_14A, nShift);  // [36 26 16 06] xE\r\n            const __m128i T3_14B = _mm_srai_epi32(T2_14B, nShift);  // [76 66 56 46]\r\n            const __m128i T3_15A = _mm_srai_epi32(T2_15A, nShift);  // [37 27 17 07] xF\r\n            const __m128i T3_15B = _mm_srai_epi32(T2_15B, nShift);  // [77 67 57 47]\r\n\r\n            const __m128i T3_16A = _mm_srai_epi32(T2_16A, nShift);  // [30 20 10 00]\r\n            const __m128i T3_16B = _mm_srai_epi32(T2_16B, nShift);  // [70 60 50 40]\r\n            const __m128i T3_17A = _mm_srai_epi32(T2_17A, nShift);  // [31 21 11 01]\r\n            const __m128i T3_17B = _mm_srai_epi32(T2_17B, nShift);  // [71 61 51 41]\r\n            const __m128i T3_18A = _mm_srai_epi32(T2_18A, nShift);  // [32 22 12 02]\r\n            const __m128i T3_18B = _mm_srai_epi32(T2_18B, nShift);  // [72 62 52 42]\r\n            const __m128i T3_19A = _mm_srai_epi32(T2_19A, nShift);  // [33 23 13 03]\r\n            const __m128i T3_19B = _mm_srai_epi32(T2_19B, nShift);  // [73 63 53 43]\r\n            const __m128i T3_20A = _mm_srai_epi32(T2_20A, nShift);  // [33 24 14 04]\r\n            const __m128i T3_20B = _mm_srai_epi32(T2_20B, nShift);  // [74 64 54 44]\r\n            const __m128i T3_21A = _mm_srai_epi32(T2_21A, nShift);  // [35 25 15 05]\r\n            const __m128i T3_21B = _mm_srai_epi32(T2_21B, nShift);  // [75 65 55 45]\r\n            const __m128i T3_22A = _mm_srai_epi32(T2_22A, nShift);  // [36 26 16 06]\r\n            const __m128i T3_22B = _mm_srai_epi32(T2_22B, nShift);  // [76 66 56 46]\r\n            const __m128i T3_23A = _mm_srai_epi32(T2_23A, nShift);  // [37 27 17 07]\r\n            const __m128i T3_23B = _mm_srai_epi32(T2_23B, nShift);  // [77 67 57 47]\r\n            const __m128i T3_24A = _mm_srai_epi32(T2_24A, nShift);  // [30 20 10 00] x8\r\n            const __m128i T3_24B = _mm_srai_epi32(T2_24B, nShift);  // [70 60 50 40]\r\n            const __m128i T3_25A = _mm_srai_epi32(T2_25A, nShift);  // [31 21 11 01] x9\r\n            const __m128i T3_25B = _mm_srai_epi32(T2_25B, nShift);  // [71 61 51 41]\r\n            const __m128i T3_26A = _mm_srai_epi32(T2_26A, nShift);  // [32 22 12 02] xA\r\n            const __m128i T3_26B = _mm_srai_epi32(T2_26B, nShift);  // [72 62 52 42]\r\n            const __m128i T3_27A = _mm_srai_epi32(T2_27A, nShift);  // [33 23 13 03] xB\r\n            const __m128i T3_27B = _mm_srai_epi32(T2_27B, nShift);  // [73 63 53 43]\r\n            const __m128i T3_28A = _mm_srai_epi32(T2_28A, nShift);  // [33 24 14 04] xC\r\n            const __m128i T3_28B = _mm_srai_epi32(T2_28B, nShift);  // [74 64 54 44]\r\n            const __m128i T3_29A = _mm_srai_epi32(T2_29A, nShift);  // [35 25 15 05] xD\r\n            const __m128i T3_29B = _mm_srai_epi32(T2_29B, nShift);  // [75 65 55 45]\r\n            const __m128i T3_30A = _mm_srai_epi32(T2_30A, nShift);  // [36 26 16 06] xE\r\n            const __m128i T3_30B = _mm_srai_epi32(T2_30B, nShift);  // [76 66 56 46]\r\n            const __m128i T3_31A = _mm_srai_epi32(T2_31A, nShift);  // [37 27 17 07] xF\r\n            const __m128i T3_31B = _mm_srai_epi32(T2_31B, nShift);  // [77 67 57 47]\r\n\r\n            res00[part] = _mm_packs_epi32(T3_00A, T3_00B);          // [70 60 50 40 30 20 10 00]\r\n            res01[part] = _mm_packs_epi32(T3_01A, T3_01B);          // [71 61 51 41 31 21 11 01]\r\n            res02[part] = _mm_packs_epi32(T3_02A, T3_02B);          // [72 62 52 42 32 22 12 02]\r\n            res03[part] = _mm_packs_epi32(T3_03A, T3_03B);          // [73 63 53 43 33 23 13 03]\r\n            res04[part] = _mm_packs_epi32(T3_04A, T3_04B);          // [74 64 54 44 34 24 14 04]\r\n            res05[part] = _mm_packs_epi32(T3_05A, T3_05B);          // [75 65 55 45 35 25 15 05]\r\n            res06[part] = _mm_packs_epi32(T3_06A, T3_06B);          // [76 66 56 46 36 26 16 06]\r\n            res07[part] = _mm_packs_epi32(T3_07A, T3_07B);          // [77 67 57 47 37 27 17 07]\r\n            res08[part] = _mm_packs_epi32(T3_08A, T3_08B);          // [A0 ... 80]\r\n            res09[part] = _mm_packs_epi32(T3_09A, T3_09B);          // [A1 ... 81]\r\n            res10[part] = _mm_packs_epi32(T3_10A, T3_10B);          // [A2 ... 82]\r\n            res11[part] = _mm_packs_epi32(T3_11A, T3_11B);          // [A3 ... 83]\r\n            res12[part] = _mm_packs_epi32(T3_12A, T3_12B);          // [A4 ... 84]\r\n            res13[part] = _mm_packs_epi32(T3_13A, T3_13B);          // [A5 ... 85]\r\n            res14[part] = _mm_packs_epi32(T3_14A, T3_14B);          // [A6 ... 86]\r\n            res15[part] = _mm_packs_epi32(T3_15A, T3_15B);          // [A7 ... 87]\r\n            res16[part] = _mm_packs_epi32(T3_16A, T3_16B);\r\n            res17[part] = _mm_packs_epi32(T3_17A, T3_17B);\r\n            res18[part] = _mm_packs_epi32(T3_18A, T3_18B);\r\n            res19[part] = _mm_packs_epi32(T3_19A, T3_19B);\r\n            res20[part] = _mm_packs_epi32(T3_20A, T3_20B);\r\n            res21[part] = _mm_packs_epi32(T3_21A, T3_21B);\r\n            res22[part] = _mm_packs_epi32(T3_22A, T3_22B);\r\n            res23[part] = _mm_packs_epi32(T3_23A, T3_23B);\r\n            res24[part] = _mm_packs_epi32(T3_24A, T3_24B);\r\n            res25[part] = _mm_packs_epi32(T3_25A, T3_25B);\r\n            res26[part] = _mm_packs_epi32(T3_26A, T3_26B);\r\n            res27[part] = _mm_packs_epi32(T3_27A, T3_27B);\r\n            res28[part] = _mm_packs_epi32(T3_28A, T3_28B);\r\n            res29[part] = _mm_packs_epi32(T3_29A, T3_29B);\r\n            res30[part] = _mm_packs_epi32(T3_30A, T3_30B);\r\n            res31[part] = _mm_packs_epi32(T3_31A, T3_31B);\r\n        }\r\n    }\r\n\r\n    //transpose matrix 8x8 16bit.\r\n    {\r\n        __m128i tr0_0, tr0_1, tr0_2, tr0_3, tr0_4, tr0_5, tr0_6, tr0_7;\r\n        __m128i tr1_0, tr1_1, tr1_2, tr1_3, tr1_4, tr1_5, tr1_6, tr1_7;\r\n#define TRANSPOSE_8x8_16BIT(I0, I1, I2, I3, I4, I5, I6, I7, O0, O1, O2, O3, O4, O5, O6, O7) \\\r\n        tr0_0 = _mm_unpacklo_epi16(I0, I1); \\\r\n        tr0_1 = _mm_unpacklo_epi16(I2, I3); \\\r\n        tr0_2 = _mm_unpackhi_epi16(I0, I1); \\\r\n        tr0_3 = _mm_unpackhi_epi16(I2, I3); \\\r\n        tr0_4 = _mm_unpacklo_epi16(I4, I5); \\\r\n        tr0_5 = _mm_unpacklo_epi16(I6, I7); \\\r\n        tr0_6 = _mm_unpackhi_epi16(I4, I5); \\\r\n        tr0_7 = _mm_unpackhi_epi16(I6, I7); \\\r\n        tr1_0 = _mm_unpacklo_epi32(tr0_0, tr0_1); \\\r\n        tr1_1 = _mm_unpacklo_epi32(tr0_2, tr0_3); \\\r\n        tr1_2 = _mm_unpackhi_epi32(tr0_0, tr0_1); \\\r\n        tr1_3 = _mm_unpackhi_epi32(tr0_2, tr0_3); \\\r\n        tr1_4 = _mm_unpacklo_epi32(tr0_4, tr0_5); \\\r\n        tr1_5 = _mm_unpacklo_epi32(tr0_6, tr0_7); \\\r\n        tr1_6 = _mm_unpackhi_epi32(tr0_4, tr0_5); \\\r\n        tr1_7 = _mm_unpackhi_epi32(tr0_6, tr0_7); \\\r\n        O0 = _mm_unpacklo_epi64(tr1_0, tr1_4); \\\r\n        O1 = _mm_unpackhi_epi64(tr1_0, tr1_4); \\\r\n        O2 = _mm_unpacklo_epi64(tr1_2, tr1_6); \\\r\n        O3 = _mm_unpackhi_epi64(tr1_2, tr1_6); \\\r\n        O4 = _mm_unpacklo_epi64(tr1_1, tr1_5); \\\r\n        O5 = _mm_unpackhi_epi64(tr1_1, tr1_5); \\\r\n        O6 = _mm_unpacklo_epi64(tr1_3, tr1_7); \\\r\n        O7 = _mm_unpackhi_epi64(tr1_3, tr1_7);\r\n\r\n        TRANSPOSE_8x8_16BIT(res00[0], res01[0], res02[0], res03[0], res04[0], res05[0], res06[0], res07[0], in00[0], in01[0], in02[0], in03[0], in04[0], in05[0], in06[0], in07[0])\r\n        TRANSPOSE_8x8_16BIT(res00[1], res01[1], res02[1], res03[1], res04[1], res05[1], res06[1], res07[1], in08[0], in09[0], in10[0], in11[0], in12[0], in13[0], in14[0], in15[0])\r\n        \r\n        TRANSPOSE_8x8_16BIT(res08[0], res09[0], res10[0], res11[0], res12[0], res13[0], res14[0], res15[0], in00[1], in01[1], in02[1], in03[1], in04[1], in05[1], in06[1], in07[1])\r\n        TRANSPOSE_8x8_16BIT(res08[1], res09[1], res10[1], res11[1], res12[1], res13[1], res14[1], res15[1], in08[1], in09[1], in10[1], in11[1], in12[1], in13[1], in14[1], in15[1])\r\n        \r\n        TRANSPOSE_8x8_16BIT(res16[0], res17[0], res18[0], res19[0], res20[0], res21[0], res22[0], res23[0], in00[2], in01[2], in02[2], in03[2], in04[2], in05[2], in06[2], in07[2])\r\n        TRANSPOSE_8x8_16BIT(res16[1], res17[1], res18[1], res19[1], res20[1], res21[1], res22[1], res23[1], in08[2], in09[2], in10[2], in11[2], in12[2], in13[2], in14[2], in15[2])\r\n        \r\n        TRANSPOSE_8x8_16BIT(res24[0], res25[0], res26[0], res27[0], res28[0], res29[0], res30[0], res31[0], in00[3], in01[3], in02[3], in03[3], in04[3], in05[3], in06[3], in07[3])\r\n        TRANSPOSE_8x8_16BIT(res24[1], res25[1], res26[1], res27[1], res28[1], res29[1], res30[1], res31[1], in08[3], in09[3], in10[3], in11[3], in12[3], in13[3], in14[3], in15[3])\r\n#undef TRANSPOSE_8x8_16BIT\r\n    }\r\n\r\n    //pass=2\r\n    c32_rnd = _mm_set1_epi32(1 << (shift2 - 1));    // add2\r\n    nShift = shift2;\r\n    for (part = 0; part < 4; part++) {\r\n        const __m128i T_00_00A = _mm_unpacklo_epi16(in01[part], in03[part]);    // [33 13 32 12 31 11 30 10]\r\n        const __m128i T_00_00B = _mm_unpackhi_epi16(in01[part], in03[part]);    // [37 17 36 16 35 15 34 14]\r\n        const __m128i T_00_01A = _mm_unpacklo_epi16(in05[part], in07[part]);    // [ ]\r\n        const __m128i T_00_01B = _mm_unpackhi_epi16(in05[part], in07[part]);    // [ ]\r\n        const __m128i T_00_02A = _mm_unpacklo_epi16(in09[part], in11[part]);    // [ ]\r\n        const __m128i T_00_02B = _mm_unpackhi_epi16(in09[part], in11[part]);    // [ ]\r\n        const __m128i T_00_03A = _mm_unpacklo_epi16(in13[part], in15[part]);    // [ ]\r\n        const __m128i T_00_03B = _mm_unpackhi_epi16(in13[part], in15[part]);    // [ ]\r\n\r\n        const __m128i T_00_08A = _mm_unpacklo_epi16(in02[part], in06[part]);    // [ ]\r\n        const __m128i T_00_08B = _mm_unpackhi_epi16(in02[part], in06[part]);    // [ ]\r\n        const __m128i T_00_09A = _mm_unpacklo_epi16(in10[part], in14[part]);    // [ ]\r\n        const __m128i T_00_09B = _mm_unpackhi_epi16(in10[part], in14[part]);    // [ ]\r\n\r\n        const __m128i T_00_12A = _mm_unpacklo_epi16(in04[part], in12[part]);    // [ ]\r\n        const __m128i T_00_12B = _mm_unpackhi_epi16(in04[part], in12[part]);    // [ ]\r\n\r\n        const __m128i T_00_14A = _mm_unpacklo_epi16(in08[part], Zero_16);    //\r\n        const __m128i T_00_14B = _mm_unpackhi_epi16(in08[part], Zero_16);    // [ ]\r\n        const __m128i T_00_15A = _mm_unpacklo_epi16(in00[part], Zero_16);    //\r\n        const __m128i T_00_15B = _mm_unpackhi_epi16(in00[part], Zero_16);    // [ ]\r\n\r\n        __m128i O00A, O01A, O02A, O03A, O04A, O05A, O06A, O07A, O08A, O09A, O10A, O11A, O12A, O13A, O14A, O15A;\r\n        __m128i O00B, O01B, O02B, O03B, O04B, O05B, O06B, O07B, O08B, O09B, O10B, O11B, O12B, O13B, O14B, O15B;\r\n        __m128i EO0A, EO1A, EO2A, EO3A, EO4A, EO5A, EO6A, EO7A;\r\n        __m128i EO0B, EO1B, EO2B, EO3B, EO4B, EO5B, EO6B, EO7B;\r\n        {\r\n            __m128i T00, T01;\r\n#define COMPUTE_ROW(r0103, r0507, r0911, r1315, c0103, c0507, c0911, c1315, row) \\\r\n    T00 = _mm_add_epi32(_mm_madd_epi16(r0103, c0103), _mm_madd_epi16(r0507, c0507)); \\\r\n    T01 = _mm_add_epi32(_mm_madd_epi16(r0911, c0911), _mm_madd_epi16(r1315, c1315)); \\\r\n    row = _mm_add_epi32(T00, T01);\r\n\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_p45_p45, c16_p43_p44, c16_p39_p41, c16_p34_p36, O00A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_p41_p45, c16_p23_p34, c16_n02_p11, c16_n27_n15, O01A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_p34_p44, c16_n07_p15, c16_n41_n27, c16_n39_n45, O02A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_p23_p43, c16_n34_n07, c16_n36_n45, c16_p19_n11, O03A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_p11_p41, c16_n45_n27, c16_p07_n30, c16_p43_p39, O04A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_n02_p39, c16_n36_n41, c16_p43_p07, c16_n11_p34, O05A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_n15_p36, c16_n11_n45, c16_p34_p39, c16_n45_n19, O06A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_n27_p34, c16_p19_n39, c16_n11_p43, c16_p02_n45, O07A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_n36_p30, c16_p41_n23, c16_n44_p15, c16_p45_n07, O08A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_n43_p27, c16_p44_n02, c16_n30_n23, c16_p07_p41, O09A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_n45_p23, c16_p27_p19, c16_p15_n45, c16_n44_p30, O10A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_n44_p19, c16_n02_p36, c16_p45_n34, c16_n15_n23, O11A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_n39_p15, c16_n30_p45, c16_p27_p02, c16_p41_n44, O12A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_n30_p11, c16_n45_p43, c16_n19_p36, c16_p23_n02, O13A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_n19_p07, c16_n39_p30, c16_n45_p44, c16_n36_p43, O14A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_n07_p02, c16_n15_p11, c16_n23_p19, c16_n30_p27, O15A)\r\n\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, c16_p45_p45, c16_p43_p44, c16_p39_p41, c16_p34_p36, O00B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, c16_p41_p45, c16_p23_p34, c16_n02_p11, c16_n27_n15, O01B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, c16_p34_p44, c16_n07_p15, c16_n41_n27, c16_n39_n45, O02B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, c16_p23_p43, c16_n34_n07, c16_n36_n45, c16_p19_n11, O03B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, c16_p11_p41, c16_n45_n27, c16_p07_n30, c16_p43_p39, O04B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, c16_n02_p39, c16_n36_n41, c16_p43_p07, c16_n11_p34, O05B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, c16_n15_p36, c16_n11_n45, c16_p34_p39, c16_n45_n19, O06B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, c16_n27_p34, c16_p19_n39, c16_n11_p43, c16_p02_n45, O07B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, c16_n36_p30, c16_p41_n23, c16_n44_p15, c16_p45_n07, O08B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, c16_n43_p27, c16_p44_n02, c16_n30_n23, c16_p07_p41, O09B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, c16_n45_p23, c16_p27_p19, c16_p15_n45, c16_n44_p30, O10B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, c16_n44_p19, c16_n02_p36, c16_p45_n34, c16_n15_n23, O11B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, c16_n39_p15, c16_n30_p45, c16_p27_p02, c16_p41_n44, O12B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, c16_n30_p11, c16_n45_p43, c16_n19_p36, c16_p23_n02, O13B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, c16_n19_p07, c16_n39_p30, c16_n45_p44, c16_n36_p43, O14B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, c16_n07_p02, c16_n15_p11, c16_n23_p19, c16_n30_p27, O15B)\r\n#undef COMPUTE_ROW\r\n        }\r\n\r\n        EO0A = _mm_add_epi32(_mm_madd_epi16(T_00_08A, c16_p43_p45), _mm_madd_epi16(T_00_09A, c16_p35_p40));\r\n        EO1A = _mm_add_epi32(_mm_madd_epi16(T_00_08A, c16_p29_p43), _mm_madd_epi16(T_00_09A, c16_n21_p04));\r\n        EO2A = _mm_add_epi32(_mm_madd_epi16(T_00_08A, c16_p04_p40), _mm_madd_epi16(T_00_09A, c16_n43_n35));\r\n        EO3A = _mm_add_epi32(_mm_madd_epi16(T_00_08A, c16_n21_p35), _mm_madd_epi16(T_00_09A, c16_p04_n43));\r\n        EO4A = _mm_add_epi32(_mm_madd_epi16(T_00_08A, c16_n40_p29), _mm_madd_epi16(T_00_09A, c16_p45_n13));\r\n        EO5A = _mm_add_epi32(_mm_madd_epi16(T_00_08A, c16_n45_p21), _mm_madd_epi16(T_00_09A, c16_p13_p29));\r\n        EO6A = _mm_add_epi32(_mm_madd_epi16(T_00_08A, c16_n35_p13), _mm_madd_epi16(T_00_09A, c16_n40_p45));\r\n        EO7A = _mm_add_epi32(_mm_madd_epi16(T_00_08A, c16_n13_p04), _mm_madd_epi16(T_00_09A, c16_n29_p21));\r\n\r\n        EO0B = _mm_add_epi32(_mm_madd_epi16(T_00_08B, c16_p43_p45), _mm_madd_epi16(T_00_09B, c16_p35_p40));\r\n        EO1B = _mm_add_epi32(_mm_madd_epi16(T_00_08B, c16_p29_p43), _mm_madd_epi16(T_00_09B, c16_n21_p04));\r\n        EO2B = _mm_add_epi32(_mm_madd_epi16(T_00_08B, c16_p04_p40), _mm_madd_epi16(T_00_09B, c16_n43_n35));\r\n        EO3B = _mm_add_epi32(_mm_madd_epi16(T_00_08B, c16_n21_p35), _mm_madd_epi16(T_00_09B, c16_p04_n43));\r\n        EO4B = _mm_add_epi32(_mm_madd_epi16(T_00_08B, c16_n40_p29), _mm_madd_epi16(T_00_09B, c16_p45_n13));\r\n        EO5B = _mm_add_epi32(_mm_madd_epi16(T_00_08B, c16_n45_p21), _mm_madd_epi16(T_00_09B, c16_p13_p29));\r\n        EO6B = _mm_add_epi32(_mm_madd_epi16(T_00_08B, c16_n35_p13), _mm_madd_epi16(T_00_09B, c16_n40_p45));\r\n        EO7B = _mm_add_epi32(_mm_madd_epi16(T_00_08B, c16_n13_p04), _mm_madd_epi16(T_00_09B, c16_n29_p21));\r\n\r\n        {\r\n            const __m128i EEO0A = _mm_madd_epi16(T_00_12A, c16_p38_p44);\r\n            const __m128i EEO1A = _mm_madd_epi16(T_00_12A, c16_n09_p38);\r\n            const __m128i EEO2A = _mm_madd_epi16(T_00_12A, c16_n44_p25);\r\n            const __m128i EEO3A = _mm_madd_epi16(T_00_12A, c16_n25_p09);\r\n            const __m128i EEO0B = _mm_madd_epi16(T_00_12B, c16_p38_p44);\r\n            const __m128i EEO1B = _mm_madd_epi16(T_00_12B, c16_n09_p38);\r\n            const __m128i EEO2B = _mm_madd_epi16(T_00_12B, c16_n44_p25);\r\n            const __m128i EEO3B = _mm_madd_epi16(T_00_12B, c16_n25_p09);\r\n\r\n            const __m128i EEEO0A = _mm_madd_epi16(T_00_14A, c16_p17_p42);\r\n            const __m128i EEEO0B = _mm_madd_epi16(T_00_14B, c16_p17_p42);\r\n            const __m128i EEEO1A = _mm_madd_epi16(T_00_14A, c16_n42_p17);\r\n            const __m128i EEEO1B = _mm_madd_epi16(T_00_14B, c16_n42_p17);\r\n\r\n            const __m128i EEEE0A = _mm_madd_epi16(T_00_15A, c16_p32_p32);\r\n            const __m128i EEEE0B = _mm_madd_epi16(T_00_15B, c16_p32_p32);\r\n            const __m128i EEEE1A = _mm_madd_epi16(T_00_15A, c16_n32_p32);\r\n            const __m128i EEEE1B = _mm_madd_epi16(T_00_15B, c16_n32_p32);\r\n\r\n            const __m128i EEE0A = _mm_add_epi32(EEEE0A, EEEO0A);    // EEE0 = EEEE0 + EEEO0\r\n            const __m128i EEE0B = _mm_add_epi32(EEEE0B, EEEO0B);\r\n            const __m128i EEE1A = _mm_add_epi32(EEEE1A, EEEO1A);    // EEE1 = EEEE1 + EEEO1\r\n            const __m128i EEE1B = _mm_add_epi32(EEEE1B, EEEO1B);\r\n            const __m128i EEE3A = _mm_sub_epi32(EEEE0A, EEEO0A);    // EEE2 = EEEE0 - EEEO0\r\n            const __m128i EEE3B = _mm_sub_epi32(EEEE0B, EEEO0B);\r\n            const __m128i EEE2A = _mm_sub_epi32(EEEE1A, EEEO1A);    // EEE3 = EEEE1 - EEEO1\r\n            const __m128i EEE2B = _mm_sub_epi32(EEEE1B, EEEO1B);\r\n\r\n            const __m128i EE0A = _mm_add_epi32(EEE0A, EEO0A);       // EE0 = EEE0 + EEO0\r\n            const __m128i EE0B = _mm_add_epi32(EEE0B, EEO0B);\r\n            const __m128i EE1A = _mm_add_epi32(EEE1A, EEO1A);       // EE1 = EEE1 + EEO1\r\n            const __m128i EE1B = _mm_add_epi32(EEE1B, EEO1B);\r\n            const __m128i EE2A = _mm_add_epi32(EEE2A, EEO2A);       // EE2 = EEE0 + EEO0\r\n            const __m128i EE2B = _mm_add_epi32(EEE2B, EEO2B);\r\n            const __m128i EE3A = _mm_add_epi32(EEE3A, EEO3A);       // EE3 = EEE1 + EEO1\r\n            const __m128i EE3B = _mm_add_epi32(EEE3B, EEO3B);\r\n            const __m128i EE7A = _mm_sub_epi32(EEE0A, EEO0A);       // EE7 = EEE0 - EEO0\r\n            const __m128i EE7B = _mm_sub_epi32(EEE0B, EEO0B);\r\n            const __m128i EE6A = _mm_sub_epi32(EEE1A, EEO1A);       // EE6 = EEE1 - EEO1\r\n            const __m128i EE6B = _mm_sub_epi32(EEE1B, EEO1B);\r\n            const __m128i EE5A = _mm_sub_epi32(EEE2A, EEO2A);       // EE5 = EEE0 - EEO0\r\n            const __m128i EE5B = _mm_sub_epi32(EEE2B, EEO2B);\r\n            const __m128i EE4A = _mm_sub_epi32(EEE3A, EEO3A);       // EE4 = EEE1 - EEO1\r\n            const __m128i EE4B = _mm_sub_epi32(EEE3B, EEO3B);\r\n\r\n            const __m128i E0A = _mm_add_epi32(EE0A, EO0A);          // E0 = EE0 + EO0\r\n            const __m128i E0B = _mm_add_epi32(EE0B, EO0B);\r\n            const __m128i E1A = _mm_add_epi32(EE1A, EO1A);          // E1 = EE1 + EO1\r\n            const __m128i E1B = _mm_add_epi32(EE1B, EO1B);\r\n            const __m128i E2A = _mm_add_epi32(EE2A, EO2A);          // E2 = EE2 + EO2\r\n            const __m128i E2B = _mm_add_epi32(EE2B, EO2B);\r\n            const __m128i E3A = _mm_add_epi32(EE3A, EO3A);          // E3 = EE3 + EO3\r\n            const __m128i E3B = _mm_add_epi32(EE3B, EO3B);\r\n            const __m128i E4A = _mm_add_epi32(EE4A, EO4A);          // E4 =\r\n            const __m128i E4B = _mm_add_epi32(EE4B, EO4B);\r\n            const __m128i E5A = _mm_add_epi32(EE5A, EO5A);          // E5 =\r\n            const __m128i E5B = _mm_add_epi32(EE5B, EO5B);\r\n            const __m128i E6A = _mm_add_epi32(EE6A, EO6A);          // E6 =\r\n            const __m128i E6B = _mm_add_epi32(EE6B, EO6B);\r\n            const __m128i E7A = _mm_add_epi32(EE7A, EO7A);          // E7 =\r\n            const __m128i E7B = _mm_add_epi32(EE7B, EO7B);\r\n            const __m128i EFA = _mm_sub_epi32(EE0A, EO0A);          // EF = EE0 - EO0\r\n            const __m128i EFB = _mm_sub_epi32(EE0B, EO0B);\r\n            const __m128i EEA = _mm_sub_epi32(EE1A, EO1A);          // EE = EE1 - EO1\r\n            const __m128i EEB = _mm_sub_epi32(EE1B, EO1B);\r\n            const __m128i EDA = _mm_sub_epi32(EE2A, EO2A);          // ED = EE2 - EO2\r\n            const __m128i EDB = _mm_sub_epi32(EE2B, EO2B);\r\n            const __m128i ECA = _mm_sub_epi32(EE3A, EO3A);          // EC = EE3 - EO3\r\n            const __m128i ECB = _mm_sub_epi32(EE3B, EO3B);\r\n            const __m128i EBA = _mm_sub_epi32(EE4A, EO4A);          // EB =\r\n            const __m128i EBB = _mm_sub_epi32(EE4B, EO4B);\r\n            const __m128i EAA = _mm_sub_epi32(EE5A, EO5A);          // EA =\r\n            const __m128i EAB = _mm_sub_epi32(EE5B, EO5B);\r\n            const __m128i E9A = _mm_sub_epi32(EE6A, EO6A);          // E9 =\r\n            const __m128i E9B = _mm_sub_epi32(EE6B, EO6B);\r\n            const __m128i E8A = _mm_sub_epi32(EE7A, EO7A);          // E8 =\r\n            const __m128i E8B = _mm_sub_epi32(EE7B, EO7B);\r\n\r\n            const __m128i T10A = _mm_add_epi32(E0A, c32_rnd);       // E0 + rnd\r\n            const __m128i T10B = _mm_add_epi32(E0B, c32_rnd);\r\n            const __m128i T11A = _mm_add_epi32(E1A, c32_rnd);       // E1 + rnd\r\n            const __m128i T11B = _mm_add_epi32(E1B, c32_rnd);\r\n            const __m128i T12A = _mm_add_epi32(E2A, c32_rnd);       // E2 + rnd\r\n            const __m128i T12B = _mm_add_epi32(E2B, c32_rnd);\r\n            const __m128i T13A = _mm_add_epi32(E3A, c32_rnd);       // E3 + rnd\r\n            const __m128i T13B = _mm_add_epi32(E3B, c32_rnd);\r\n            const __m128i T14A = _mm_add_epi32(E4A, c32_rnd);       // E4 + rnd\r\n            const __m128i T14B = _mm_add_epi32(E4B, c32_rnd);\r\n            const __m128i T15A = _mm_add_epi32(E5A, c32_rnd);       // E5 + rnd\r\n            const __m128i T15B = _mm_add_epi32(E5B, c32_rnd);\r\n            const __m128i T16A = _mm_add_epi32(E6A, c32_rnd);       // E6 + rnd\r\n            const __m128i T16B = _mm_add_epi32(E6B, c32_rnd);\r\n            const __m128i T17A = _mm_add_epi32(E7A, c32_rnd);       // E7 + rnd\r\n            const __m128i T17B = _mm_add_epi32(E7B, c32_rnd);\r\n            const __m128i T18A = _mm_add_epi32(E8A, c32_rnd);       // E8 + rnd\r\n            const __m128i T18B = _mm_add_epi32(E8B, c32_rnd);\r\n            const __m128i T19A = _mm_add_epi32(E9A, c32_rnd);       // E9 + rnd\r\n            const __m128i T19B = _mm_add_epi32(E9B, c32_rnd);\r\n            const __m128i T1AA = _mm_add_epi32(EAA, c32_rnd);       // E10 + rnd\r\n            const __m128i T1AB = _mm_add_epi32(EAB, c32_rnd);\r\n            const __m128i T1BA = _mm_add_epi32(EBA, c32_rnd);       // E11 + rnd\r\n            const __m128i T1BB = _mm_add_epi32(EBB, c32_rnd);\r\n            const __m128i T1CA = _mm_add_epi32(ECA, c32_rnd);       // E12 + rnd\r\n            const __m128i T1CB = _mm_add_epi32(ECB, c32_rnd);\r\n            const __m128i T1DA = _mm_add_epi32(EDA, c32_rnd);       // E13 + rnd\r\n            const __m128i T1DB = _mm_add_epi32(EDB, c32_rnd);\r\n            const __m128i T1EA = _mm_add_epi32(EEA, c32_rnd);       // E14 + rnd\r\n            const __m128i T1EB = _mm_add_epi32(EEB, c32_rnd);\r\n            const __m128i T1FA = _mm_add_epi32(EFA, c32_rnd);       // E15 + rnd\r\n            const __m128i T1FB = _mm_add_epi32(EFB, c32_rnd);\r\n\r\n            const __m128i T2_00A = _mm_add_epi32(T10A, O00A);       // E0 + O0 + rnd\r\n            const __m128i T2_00B = _mm_add_epi32(T10B, O00B);\r\n            const __m128i T2_01A = _mm_add_epi32(T11A, O01A);       // E1 + O1 + rnd\r\n            const __m128i T2_01B = _mm_add_epi32(T11B, O01B);\r\n            const __m128i T2_02A = _mm_add_epi32(T12A, O02A);       // E2 + O2 + rnd\r\n            const __m128i T2_02B = _mm_add_epi32(T12B, O02B);\r\n            const __m128i T2_03A = _mm_add_epi32(T13A, O03A);       // E3 + O3 + rnd\r\n            const __m128i T2_03B = _mm_add_epi32(T13B, O03B);\r\n            const __m128i T2_04A = _mm_add_epi32(T14A, O04A);       // E4\r\n            const __m128i T2_04B = _mm_add_epi32(T14B, O04B);\r\n            const __m128i T2_05A = _mm_add_epi32(T15A, O05A);       // E5\r\n            const __m128i T2_05B = _mm_add_epi32(T15B, O05B);\r\n            const __m128i T2_06A = _mm_add_epi32(T16A, O06A);       // E6\r\n            const __m128i T2_06B = _mm_add_epi32(T16B, O06B);\r\n            const __m128i T2_07A = _mm_add_epi32(T17A, O07A);       // E7\r\n            const __m128i T2_07B = _mm_add_epi32(T17B, O07B);\r\n            const __m128i T2_08A = _mm_add_epi32(T18A, O08A);       // E8\r\n            const __m128i T2_08B = _mm_add_epi32(T18B, O08B);\r\n            const __m128i T2_09A = _mm_add_epi32(T19A, O09A);       // E9\r\n            const __m128i T2_09B = _mm_add_epi32(T19B, O09B);\r\n            const __m128i T2_10A = _mm_add_epi32(T1AA, O10A);       // E10\r\n            const __m128i T2_10B = _mm_add_epi32(T1AB, O10B);\r\n            const __m128i T2_11A = _mm_add_epi32(T1BA, O11A);       // E11\r\n            const __m128i T2_11B = _mm_add_epi32(T1BB, O11B);\r\n            const __m128i T2_12A = _mm_add_epi32(T1CA, O12A);       // E12\r\n            const __m128i T2_12B = _mm_add_epi32(T1CB, O12B);\r\n            const __m128i T2_13A = _mm_add_epi32(T1DA, O13A);       // E13\r\n            const __m128i T2_13B = _mm_add_epi32(T1DB, O13B);\r\n            const __m128i T2_14A = _mm_add_epi32(T1EA, O14A);       // E14\r\n            const __m128i T2_14B = _mm_add_epi32(T1EB, O14B);\r\n            const __m128i T2_15A = _mm_add_epi32(T1FA, O15A);       // E15\r\n            const __m128i T2_15B = _mm_add_epi32(T1FB, O15B);\r\n            const __m128i T2_31A = _mm_sub_epi32(T10A, O00A);       // E0 - O0 + rnd\r\n            const __m128i T2_31B = _mm_sub_epi32(T10B, O00B);\r\n            const __m128i T2_30A = _mm_sub_epi32(T11A, O01A);       // E1 - O1 + rnd\r\n            const __m128i T2_30B = _mm_sub_epi32(T11B, O01B);\r\n            const __m128i T2_29A = _mm_sub_epi32(T12A, O02A);       // E2 - O2 + rnd\r\n            const __m128i T2_29B = _mm_sub_epi32(T12B, O02B);\r\n            const __m128i T2_28A = _mm_sub_epi32(T13A, O03A);       // E3 - O3 + rnd\r\n            const __m128i T2_28B = _mm_sub_epi32(T13B, O03B);\r\n            const __m128i T2_27A = _mm_sub_epi32(T14A, O04A);       // E4\r\n            const __m128i T2_27B = _mm_sub_epi32(T14B, O04B);\r\n            const __m128i T2_26A = _mm_sub_epi32(T15A, O05A);       // E5\r\n            const __m128i T2_26B = _mm_sub_epi32(T15B, O05B);\r\n            const __m128i T2_25A = _mm_sub_epi32(T16A, O06A);       // E6\r\n            const __m128i T2_25B = _mm_sub_epi32(T16B, O06B);\r\n            const __m128i T2_24A = _mm_sub_epi32(T17A, O07A);       // E7\r\n            const __m128i T2_24B = _mm_sub_epi32(T17B, O07B);\r\n            const __m128i T2_23A = _mm_sub_epi32(T18A, O08A);       //\r\n            const __m128i T2_23B = _mm_sub_epi32(T18B, O08B);\r\n            const __m128i T2_22A = _mm_sub_epi32(T19A, O09A);       //\r\n            const __m128i T2_22B = _mm_sub_epi32(T19B, O09B);\r\n            const __m128i T2_21A = _mm_sub_epi32(T1AA, O10A);       //\r\n            const __m128i T2_21B = _mm_sub_epi32(T1AB, O10B);\r\n            const __m128i T2_20A = _mm_sub_epi32(T1BA, O11A);       //\r\n            const __m128i T2_20B = _mm_sub_epi32(T1BB, O11B);\r\n            const __m128i T2_19A = _mm_sub_epi32(T1CA, O12A);       //\r\n            const __m128i T2_19B = _mm_sub_epi32(T1CB, O12B);\r\n            const __m128i T2_18A = _mm_sub_epi32(T1DA, O13A);       //\r\n            const __m128i T2_18B = _mm_sub_epi32(T1DB, O13B);\r\n            const __m128i T2_17A = _mm_sub_epi32(T1EA, O14A);       //\r\n            const __m128i T2_17B = _mm_sub_epi32(T1EB, O14B);\r\n            const __m128i T2_16A = _mm_sub_epi32(T1FA, O15A);       //\r\n            const __m128i T2_16B = _mm_sub_epi32(T1FB, O15B);\r\n\r\n            const __m128i T3_00A = _mm_srai_epi32(T2_00A, nShift);  // [30 20 10 00]\r\n            const __m128i T3_00B = _mm_srai_epi32(T2_00B, nShift);  // [70 60 50 40]\r\n            const __m128i T3_01A = _mm_srai_epi32(T2_01A, nShift);  // [31 21 11 01]\r\n            const __m128i T3_01B = _mm_srai_epi32(T2_01B, nShift);  // [71 61 51 41]\r\n            const __m128i T3_02A = _mm_srai_epi32(T2_02A, nShift);  // [32 22 12 02]\r\n            const __m128i T3_02B = _mm_srai_epi32(T2_02B, nShift);  // [72 62 52 42]\r\n            const __m128i T3_03A = _mm_srai_epi32(T2_03A, nShift);  // [33 23 13 03]\r\n            const __m128i T3_03B = _mm_srai_epi32(T2_03B, nShift);  // [73 63 53 43]\r\n            const __m128i T3_04A = _mm_srai_epi32(T2_04A, nShift);  // [33 24 14 04]\r\n            const __m128i T3_04B = _mm_srai_epi32(T2_04B, nShift);  // [74 64 54 44]\r\n            const __m128i T3_05A = _mm_srai_epi32(T2_05A, nShift);  // [35 25 15 05]\r\n            const __m128i T3_05B = _mm_srai_epi32(T2_05B, nShift);  // [75 65 55 45]\r\n            const __m128i T3_06A = _mm_srai_epi32(T2_06A, nShift);  // [36 26 16 06]\r\n            const __m128i T3_06B = _mm_srai_epi32(T2_06B, nShift);  // [76 66 56 46]\r\n            const __m128i T3_07A = _mm_srai_epi32(T2_07A, nShift);  // [37 27 17 07]\r\n            const __m128i T3_07B = _mm_srai_epi32(T2_07B, nShift);  // [77 67 57 47]\r\n            const __m128i T3_08A = _mm_srai_epi32(T2_08A, nShift);  // [30 20 10 00] x8\r\n            const __m128i T3_08B = _mm_srai_epi32(T2_08B, nShift);  // [70 60 50 40]\r\n            const __m128i T3_09A = _mm_srai_epi32(T2_09A, nShift);  // [31 21 11 01] x9\r\n            const __m128i T3_09B = _mm_srai_epi32(T2_09B, nShift);  // [71 61 51 41]\r\n            const __m128i T3_10A = _mm_srai_epi32(T2_10A, nShift);  // [32 22 12 02] xA\r\n            const __m128i T3_10B = _mm_srai_epi32(T2_10B, nShift);  // [72 62 52 42]\r\n            const __m128i T3_11A = _mm_srai_epi32(T2_11A, nShift);  // [33 23 13 03] xB\r\n            const __m128i T3_11B = _mm_srai_epi32(T2_11B, nShift);  // [73 63 53 43]\r\n            const __m128i T3_12A = _mm_srai_epi32(T2_12A, nShift);  // [33 24 14 04] xC\r\n            const __m128i T3_12B = _mm_srai_epi32(T2_12B, nShift);  // [74 64 54 44]\r\n            const __m128i T3_13A = _mm_srai_epi32(T2_13A, nShift);  // [35 25 15 05] xD\r\n            const __m128i T3_13B = _mm_srai_epi32(T2_13B, nShift);  // [75 65 55 45]\r\n            const __m128i T3_14A = _mm_srai_epi32(T2_14A, nShift);  // [36 26 16 06] xE\r\n            const __m128i T3_14B = _mm_srai_epi32(T2_14B, nShift);  // [76 66 56 46]\r\n            const __m128i T3_15A = _mm_srai_epi32(T2_15A, nShift);  // [37 27 17 07] xF\r\n            const __m128i T3_15B = _mm_srai_epi32(T2_15B, nShift);  // [77 67 57 47]\r\n\r\n            const __m128i T3_16A = _mm_srai_epi32(T2_16A, nShift);  // [30 20 10 00]\r\n            const __m128i T3_16B = _mm_srai_epi32(T2_16B, nShift);  // [70 60 50 40]\r\n            const __m128i T3_17A = _mm_srai_epi32(T2_17A, nShift);  // [31 21 11 01]\r\n            const __m128i T3_17B = _mm_srai_epi32(T2_17B, nShift);  // [71 61 51 41]\r\n            const __m128i T3_18A = _mm_srai_epi32(T2_18A, nShift);  // [32 22 12 02]\r\n            const __m128i T3_18B = _mm_srai_epi32(T2_18B, nShift);  // [72 62 52 42]\r\n            const __m128i T3_19A = _mm_srai_epi32(T2_19A, nShift);  // [33 23 13 03]\r\n            const __m128i T3_19B = _mm_srai_epi32(T2_19B, nShift);  // [73 63 53 43]\r\n            const __m128i T3_20A = _mm_srai_epi32(T2_20A, nShift);  // [33 24 14 04]\r\n            const __m128i T3_20B = _mm_srai_epi32(T2_20B, nShift);  // [74 64 54 44]\r\n            const __m128i T3_21A = _mm_srai_epi32(T2_21A, nShift);  // [35 25 15 05]\r\n            const __m128i T3_21B = _mm_srai_epi32(T2_21B, nShift);  // [75 65 55 45]\r\n            const __m128i T3_22A = _mm_srai_epi32(T2_22A, nShift);  // [36 26 16 06]\r\n            const __m128i T3_22B = _mm_srai_epi32(T2_22B, nShift);  // [76 66 56 46]\r\n            const __m128i T3_23A = _mm_srai_epi32(T2_23A, nShift);  // [37 27 17 07]\r\n            const __m128i T3_23B = _mm_srai_epi32(T2_23B, nShift);  // [77 67 57 47]\r\n            const __m128i T3_24A = _mm_srai_epi32(T2_24A, nShift);  // [30 20 10 00] x8\r\n            const __m128i T3_24B = _mm_srai_epi32(T2_24B, nShift);  // [70 60 50 40]\r\n            const __m128i T3_25A = _mm_srai_epi32(T2_25A, nShift);  // [31 21 11 01] x9\r\n            const __m128i T3_25B = _mm_srai_epi32(T2_25B, nShift);  // [71 61 51 41]\r\n            const __m128i T3_26A = _mm_srai_epi32(T2_26A, nShift);  // [32 22 12 02] xA\r\n            const __m128i T3_26B = _mm_srai_epi32(T2_26B, nShift);  // [72 62 52 42]\r\n            const __m128i T3_27A = _mm_srai_epi32(T2_27A, nShift);  // [33 23 13 03] xB\r\n            const __m128i T3_27B = _mm_srai_epi32(T2_27B, nShift);  // [73 63 53 43]\r\n            const __m128i T3_28A = _mm_srai_epi32(T2_28A, nShift);  // [33 24 14 04] xC\r\n            const __m128i T3_28B = _mm_srai_epi32(T2_28B, nShift);  // [74 64 54 44]\r\n            const __m128i T3_29A = _mm_srai_epi32(T2_29A, nShift);  // [35 25 15 05] xD\r\n            const __m128i T3_29B = _mm_srai_epi32(T2_29B, nShift);  // [75 65 55 45]\r\n            const __m128i T3_30A = _mm_srai_epi32(T2_30A, nShift);  // [36 26 16 06] xE\r\n            const __m128i T3_30B = _mm_srai_epi32(T2_30B, nShift);  // [76 66 56 46]\r\n            const __m128i T3_31A = _mm_srai_epi32(T2_31A, nShift);  // [37 27 17 07] xF\r\n            const __m128i T3_31B = _mm_srai_epi32(T2_31B, nShift);  // [77 67 57 47]\r\n\r\n            res00[part] = _mm_packs_epi32(T3_00A, T3_00B);          // [70 60 50 40 30 20 10 00]\r\n            res01[part] = _mm_packs_epi32(T3_01A, T3_01B);          // [71 61 51 41 31 21 11 01]\r\n            res02[part] = _mm_packs_epi32(T3_02A, T3_02B);          // [72 62 52 42 32 22 12 02]\r\n            res03[part] = _mm_packs_epi32(T3_03A, T3_03B);          // [73 63 53 43 33 23 13 03]\r\n            res04[part] = _mm_packs_epi32(T3_04A, T3_04B);          // [74 64 54 44 34 24 14 04]\r\n            res05[part] = _mm_packs_epi32(T3_05A, T3_05B);          // [75 65 55 45 35 25 15 05]\r\n            res06[part] = _mm_packs_epi32(T3_06A, T3_06B);          // [76 66 56 46 36 26 16 06]\r\n            res07[part] = _mm_packs_epi32(T3_07A, T3_07B);          // [77 67 57 47 37 27 17 07]\r\n            res08[part] = _mm_packs_epi32(T3_08A, T3_08B);          // [A0 ... 80]\r\n            res09[part] = _mm_packs_epi32(T3_09A, T3_09B);          // [A1 ... 81]\r\n            res10[part] = _mm_packs_epi32(T3_10A, T3_10B);          // [A2 ... 82]\r\n            res11[part] = _mm_packs_epi32(T3_11A, T3_11B);          // [A3 ... 83]\r\n            res12[part] = _mm_packs_epi32(T3_12A, T3_12B);          // [A4 ... 84]\r\n            res13[part] = _mm_packs_epi32(T3_13A, T3_13B);          // [A5 ... 85]\r\n            res14[part] = _mm_packs_epi32(T3_14A, T3_14B);          // [A6 ... 86]\r\n            res15[part] = _mm_packs_epi32(T3_15A, T3_15B);          // [A7 ... 87]\r\n            res16[part] = _mm_packs_epi32(T3_16A, T3_16B);\r\n            res17[part] = _mm_packs_epi32(T3_17A, T3_17B);\r\n            res18[part] = _mm_packs_epi32(T3_18A, T3_18B);\r\n            res19[part] = _mm_packs_epi32(T3_19A, T3_19B);\r\n            res20[part] = _mm_packs_epi32(T3_20A, T3_20B);\r\n            res21[part] = _mm_packs_epi32(T3_21A, T3_21B);\r\n            res22[part] = _mm_packs_epi32(T3_22A, T3_22B);\r\n            res23[part] = _mm_packs_epi32(T3_23A, T3_23B);\r\n            res24[part] = _mm_packs_epi32(T3_24A, T3_24B);\r\n            res25[part] = _mm_packs_epi32(T3_25A, T3_25B);\r\n            res26[part] = _mm_packs_epi32(T3_26A, T3_26B);\r\n            res27[part] = _mm_packs_epi32(T3_27A, T3_27B);\r\n            res28[part] = _mm_packs_epi32(T3_28A, T3_28B);\r\n            res29[part] = _mm_packs_epi32(T3_29A, T3_29B);\r\n            res30[part] = _mm_packs_epi32(T3_30A, T3_30B);\r\n            res31[part] = _mm_packs_epi32(T3_31A, T3_31B);\r\n        }\r\n    }\r\n\r\n    //transpose matrix 8x8 16bit.\r\n    {\r\n        __m128i tr0_0, tr0_1, tr0_2, tr0_3, tr0_4, tr0_5, tr0_6, tr0_7;\r\n        __m128i tr1_0, tr1_1, tr1_2, tr1_3, tr1_4, tr1_5, tr1_6, tr1_7;\r\n#define TRANSPOSE_8x8_16BIT(I0, I1, I2, I3, I4, I5, I6, I7, O0, O1, O2, O3, O4, O5, O6, O7) \\\r\n    tr0_0 = _mm_unpacklo_epi16(I0, I1); \\\r\n    tr0_1 = _mm_unpacklo_epi16(I2, I3); \\\r\n    tr0_2 = _mm_unpackhi_epi16(I0, I1); \\\r\n    tr0_3 = _mm_unpackhi_epi16(I2, I3); \\\r\n    tr0_4 = _mm_unpacklo_epi16(I4, I5); \\\r\n    tr0_5 = _mm_unpacklo_epi16(I6, I7); \\\r\n    tr0_6 = _mm_unpackhi_epi16(I4, I5); \\\r\n    tr0_7 = _mm_unpackhi_epi16(I6, I7); \\\r\n    tr1_0 = _mm_unpacklo_epi32(tr0_0, tr0_1); \\\r\n    tr1_1 = _mm_unpacklo_epi32(tr0_2, tr0_3); \\\r\n    tr1_2 = _mm_unpackhi_epi32(tr0_0, tr0_1); \\\r\n    tr1_3 = _mm_unpackhi_epi32(tr0_2, tr0_3); \\\r\n    tr1_4 = _mm_unpacklo_epi32(tr0_4, tr0_5); \\\r\n    tr1_5 = _mm_unpacklo_epi32(tr0_6, tr0_7); \\\r\n    tr1_6 = _mm_unpackhi_epi32(tr0_4, tr0_5); \\\r\n    tr1_7 = _mm_unpackhi_epi32(tr0_6, tr0_7); \\\r\n    O0 = _mm_unpacklo_epi64(tr1_0, tr1_4); \\\r\n    O1 = _mm_unpackhi_epi64(tr1_0, tr1_4); \\\r\n    O2 = _mm_unpacklo_epi64(tr1_2, tr1_6); \\\r\n    O3 = _mm_unpackhi_epi64(tr1_2, tr1_6); \\\r\n    O4 = _mm_unpacklo_epi64(tr1_1, tr1_5); \\\r\n    O5 = _mm_unpackhi_epi64(tr1_1, tr1_5); \\\r\n    O6 = _mm_unpacklo_epi64(tr1_3, tr1_7); \\\r\n    O7 = _mm_unpackhi_epi64(tr1_3, tr1_7);\r\n\r\n        TRANSPOSE_8x8_16BIT(res00[0], res01[0], res02[0], res03[0], res04[0], res05[0], res06[0], res07[0], in00[0], in01[0], in02[0], in03[0], in04[0], in05[0], in06[0], in07[0])\r\n        TRANSPOSE_8x8_16BIT(res00[1], res01[1], res02[1], res03[1], res04[1], res05[1], res06[1], res07[1], in08[0], in09[0], in10[0], in11[0], in12[0], in13[0], in14[0], in15[0])\r\n        TRANSPOSE_8x8_16BIT(res00[2], res01[2], res02[2], res03[2], res04[2], res05[2], res06[2], res07[2], in16[0], in17[0], in18[0], in19[0], in20[0], in21[0], in22[0], in23[0])\r\n        TRANSPOSE_8x8_16BIT(res00[3], res01[3], res02[3], res03[3], res04[3], res05[3], res06[3], res07[3], in24[0], in25[0], in26[0], in27[0], in28[0], in29[0], in30[0], in31[0])\r\n\r\n        TRANSPOSE_8x8_16BIT(res08[0], res09[0], res10[0], res11[0], res12[0], res13[0], res14[0], res15[0], in00[1], in01[1], in02[1], in03[1], in04[1], in05[1], in06[1], in07[1])\r\n        TRANSPOSE_8x8_16BIT(res08[1], res09[1], res10[1], res11[1], res12[1], res13[1], res14[1], res15[1], in08[1], in09[1], in10[1], in11[1], in12[1], in13[1], in14[1], in15[1])\r\n        TRANSPOSE_8x8_16BIT(res08[2], res09[2], res10[2], res11[2], res12[2], res13[2], res14[2], res15[2], in16[1], in17[1], in18[1], in19[1], in20[1], in21[1], in22[1], in23[1])\r\n        TRANSPOSE_8x8_16BIT(res08[3], res09[3], res10[3], res11[3], res12[3], res13[3], res14[3], res15[3], in24[1], in25[1], in26[1], in27[1], in28[1], in29[1], in30[1], in31[1])\r\n\r\n        TRANSPOSE_8x8_16BIT(res16[0], res17[0], res18[0], res19[0], res20[0], res21[0], res22[0], res23[0], in00[2], in01[2], in02[2], in03[2], in04[2], in05[2], in06[2], in07[2])\r\n        TRANSPOSE_8x8_16BIT(res16[1], res17[1], res18[1], res19[1], res20[1], res21[1], res22[1], res23[1], in08[2], in09[2], in10[2], in11[2], in12[2], in13[2], in14[2], in15[2])\r\n        TRANSPOSE_8x8_16BIT(res16[2], res17[2], res18[2], res19[2], res20[2], res21[2], res22[2], res23[2], in16[2], in17[2], in18[2], in19[2], in20[2], in21[2], in22[2], in23[2])\r\n        TRANSPOSE_8x8_16BIT(res16[3], res17[3], res18[3], res19[3], res20[3], res21[3], res22[3], res23[3], in24[2], in25[2], in26[2], in27[2], in28[2], in29[2], in30[2], in31[2])\r\n\r\n        TRANSPOSE_8x8_16BIT(res24[0], res25[0], res26[0], res27[0], res28[0], res29[0], res30[0], res31[0], in00[3], in01[3], in02[3], in03[3], in04[3], in05[3], in06[3], in07[3])\r\n        TRANSPOSE_8x8_16BIT(res24[1], res25[1], res26[1], res27[1], res28[1], res29[1], res30[1], res31[1], in08[3], in09[3], in10[3], in11[3], in12[3], in13[3], in14[3], in15[3])\r\n        TRANSPOSE_8x8_16BIT(res24[2], res25[2], res26[2], res27[2], res28[2], res29[2], res30[2], res31[2], in16[3], in17[3], in18[3], in19[3], in20[3], in21[3], in22[3], in23[3])\r\n        TRANSPOSE_8x8_16BIT(res24[3], res25[3], res26[3], res27[3], res28[3], res29[3], res30[3], res31[3], in24[3], in25[3], in26[3], in27[3], in28[3], in29[3], in30[3], in31[3])\r\n#undef TRANSPOSE_8x8_16BIT\r\n    }\r\n\r\n\r\n    //clip\r\n    {\r\n        __m128i max_val = _mm_set1_epi16((1 << (clip_depth2 - 1)) - 1);\r\n        __m128i min_val = _mm_set1_epi16(-(1 << (clip_depth2 - 1)));\r\n        int k;\r\n\r\n        for (k = 0; k < 4; k++) {\r\n            in00[k] = _mm_max_epi16(_mm_min_epi16(in00[k], max_val), min_val);\r\n            in01[k] = _mm_max_epi16(_mm_min_epi16(in01[k], max_val), min_val);\r\n            in02[k] = _mm_max_epi16(_mm_min_epi16(in02[k], max_val), min_val);\r\n            in03[k] = _mm_max_epi16(_mm_min_epi16(in03[k], max_val), min_val);\r\n            in04[k] = _mm_max_epi16(_mm_min_epi16(in04[k], max_val), min_val);\r\n            in05[k] = _mm_max_epi16(_mm_min_epi16(in05[k], max_val), min_val);\r\n            in06[k] = _mm_max_epi16(_mm_min_epi16(in06[k], max_val), min_val);\r\n            in07[k] = _mm_max_epi16(_mm_min_epi16(in07[k], max_val), min_val);\r\n            in08[k] = _mm_max_epi16(_mm_min_epi16(in08[k], max_val), min_val);\r\n            in09[k] = _mm_max_epi16(_mm_min_epi16(in09[k], max_val), min_val);\r\n            in10[k] = _mm_max_epi16(_mm_min_epi16(in10[k], max_val), min_val);\r\n            in11[k] = _mm_max_epi16(_mm_min_epi16(in11[k], max_val), min_val);\r\n            in12[k] = _mm_max_epi16(_mm_min_epi16(in12[k], max_val), min_val);\r\n            in13[k] = _mm_max_epi16(_mm_min_epi16(in13[k], max_val), min_val);\r\n            in14[k] = _mm_max_epi16(_mm_min_epi16(in14[k], max_val), min_val);\r\n            in15[k] = _mm_max_epi16(_mm_min_epi16(in15[k], max_val), min_val);\r\n            in16[k] = _mm_max_epi16(_mm_min_epi16(in16[k], max_val), min_val);\r\n            in17[k] = _mm_max_epi16(_mm_min_epi16(in17[k], max_val), min_val);\r\n            in18[k] = _mm_max_epi16(_mm_min_epi16(in18[k], max_val), min_val);\r\n            in19[k] = _mm_max_epi16(_mm_min_epi16(in19[k], max_val), min_val);\r\n            in20[k] = _mm_max_epi16(_mm_min_epi16(in20[k], max_val), min_val);\r\n            in21[k] = _mm_max_epi16(_mm_min_epi16(in21[k], max_val), min_val);\r\n            in22[k] = _mm_max_epi16(_mm_min_epi16(in22[k], max_val), min_val);\r\n            in23[k] = _mm_max_epi16(_mm_min_epi16(in23[k], max_val), min_val);\r\n            in24[k] = _mm_max_epi16(_mm_min_epi16(in24[k], max_val), min_val);\r\n            in25[k] = _mm_max_epi16(_mm_min_epi16(in25[k], max_val), min_val);\r\n            in26[k] = _mm_max_epi16(_mm_min_epi16(in26[k], max_val), min_val);\r\n            in27[k] = _mm_max_epi16(_mm_min_epi16(in27[k], max_val), min_val);\r\n            in28[k] = _mm_max_epi16(_mm_min_epi16(in28[k], max_val), min_val);\r\n            in29[k] = _mm_max_epi16(_mm_min_epi16(in29[k], max_val), min_val);\r\n            in30[k] = _mm_max_epi16(_mm_min_epi16(in30[k], max_val), min_val);\r\n            in31[k] = _mm_max_epi16(_mm_min_epi16(in31[k], max_val), min_val);\r\n        }\r\n    }\r\n\r\n    // Add\r\n    for (i = 0; i < 2; i++) {\r\n#define STORE_LINE(L0, L1, L2, L3, L4, L5, L6, L7, H0, H1, H2, H3, H4, H5, H6, H7, offsetV, offsetH) \\\r\n    _mm_storeu_si128((__m128i*)(dst + (0 + (offsetV)) * i_dst + (offsetH)+0), L0); \\\r\n    _mm_storeu_si128((__m128i*)(dst + (0 + (offsetV)) * i_dst + (offsetH)+8), H0); \\\r\n    _mm_storeu_si128((__m128i*)(dst + (1 + (offsetV)) * i_dst + (offsetH)+0), L1); \\\r\n    _mm_storeu_si128((__m128i*)(dst + (1 + (offsetV)) * i_dst + (offsetH)+8), H1); \\\r\n    _mm_storeu_si128((__m128i*)(dst + (2 + (offsetV)) * i_dst + (offsetH)+0), L2); \\\r\n    _mm_storeu_si128((__m128i*)(dst + (2 + (offsetV)) * i_dst + (offsetH)+8), H2); \\\r\n    _mm_storeu_si128((__m128i*)(dst + (3 + (offsetV)) * i_dst + (offsetH)+0), L3); \\\r\n    _mm_storeu_si128((__m128i*)(dst + (3 + (offsetV)) * i_dst + (offsetH)+8), H3); \\\r\n    _mm_storeu_si128((__m128i*)(dst + (4 + (offsetV)) * i_dst + (offsetH)+0), L4); \\\r\n    _mm_storeu_si128((__m128i*)(dst + (4 + (offsetV)) * i_dst + (offsetH)+8), H4); \\\r\n    _mm_storeu_si128((__m128i*)(dst + (5 + (offsetV)) * i_dst + (offsetH)+0), L5); \\\r\n    _mm_storeu_si128((__m128i*)(dst + (5 + (offsetV)) * i_dst + (offsetH)+8), H5); \\\r\n    _mm_storeu_si128((__m128i*)(dst + (6 + (offsetV)) * i_dst + (offsetH)+0), L6); \\\r\n    _mm_storeu_si128((__m128i*)(dst + (6 + (offsetV)) * i_dst + (offsetH)+8), H6); \\\r\n    _mm_storeu_si128((__m128i*)(dst + (7 + (offsetV)) * i_dst + (offsetH)+0), L7); \\\r\n    _mm_storeu_si128((__m128i*)(dst + (7 + (offsetV)) * i_dst + (offsetH)+8), H7);\r\n\r\n        const int k = i * 2;\r\n        STORE_LINE(in00[k], in01[k], in02[k], in03[k], in04[k], in05[k], in06[k], in07[k], in00[k + 1], in01[k + 1], in02[k + 1], in03[k + 1], in04[k + 1], in05[k + 1], in06[k + 1], in07[k + 1], 0, i * 16)\r\n        STORE_LINE(in08[k], in09[k], in10[k], in11[k], in12[k], in13[k], in14[k], in15[k], in08[k + 1], in09[k + 1], in10[k + 1], in11[k + 1], in12[k + 1], in13[k + 1], in14[k + 1], in15[k + 1], 8, i * 16)\r\n        STORE_LINE(in16[k], in17[k], in18[k], in19[k], in20[k], in21[k], in22[k], in23[k], in16[k + 1], in17[k + 1], in18[k + 1], in19[k + 1], in20[k + 1], in21[k + 1], in22[k + 1], in23[k + 1], 16, i * 16)\r\n        STORE_LINE(in24[k], in25[k], in26[k], in27[k], in28[k], in29[k], in30[k], in31[k], in24[k + 1], in25[k + 1], in26[k + 1], in27[k + 1], in28[k + 1], in29[k + 1], in30[k + 1], in31[k + 1], 24, i * 16)\r\n#undef STORE_LINE\r\n    }\r\n\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid idct_32x32_quad_sse128(const coeff_t *src, coeff_t *dst, int i_dst)\r\n{\r\n    // TODO: implement this\r\n    // 1/4СϽǵ8x8зϵ\r\n    int a_flag = i_dst & 0x01;\r\n    int shift2 = 20 - g_bit_depth - a_flag;\r\n    int clip_depth2 = g_bit_depth + 1 + a_flag;\r\n\r\n    const __m128i c16_p45_p45 = _mm_set1_epi32(0x002D002D);\r\n    const __m128i c16_p43_p44 = _mm_set1_epi32(0x002B002C);\r\n    const __m128i c16_p41_p45 = _mm_set1_epi32(0x0029002D);\r\n    const __m128i c16_p23_p34 = _mm_set1_epi32(0x00170022);\r\n    const __m128i c16_p34_p44 = _mm_set1_epi32(0x0022002C);\r\n    const __m128i c16_n07_p15 = _mm_set1_epi32(0xFFF9000F);\r\n    const __m128i c16_p43_p45 = _mm_set1_epi32(0x002B002D);\r\n    const __m128i c16_p23_p43 = _mm_set1_epi32(0x0017002B);\r\n    const __m128i c16_n34_n07 = _mm_set1_epi32(0xFFDEFFF9);\r\n    const __m128i c16_p11_p41 = _mm_set1_epi32(0x000B0029);\r\n    const __m128i c16_n45_n27 = _mm_set1_epi32(0xFFD3FFE5);\r\n    const __m128i c16_n02_p39 = _mm_set1_epi32(0xFFFE0027);\r\n    const __m128i c16_n36_n41 = _mm_set1_epi32(0xFFDCFFD7);\r\n    const __m128i c16_n15_p36 = _mm_set1_epi32(0xFFF10024);\r\n    const __m128i c16_n11_n45 = _mm_set1_epi32(0xFFF5FFD3);\r\n    const __m128i c16_n27_p34 = _mm_set1_epi32(0xFFE50022);\r\n    const __m128i c16_p19_n39 = _mm_set1_epi32(0x0013FFD9);\r\n    const __m128i c16_n36_p30 = _mm_set1_epi32(0xFFDC001E);\r\n    const __m128i c16_p41_n23 = _mm_set1_epi32(0x0029FFE9);\r\n    const __m128i c16_n43_p27 = _mm_set1_epi32(0xFFD5001B);\r\n    const __m128i c16_p44_n02 = _mm_set1_epi32(0x002CFFFE);\r\n    const __m128i c16_n45_p23 = _mm_set1_epi32(0xFFD30017);\r\n    const __m128i c16_p27_p19 = _mm_set1_epi32(0x001B0013);\r\n    const __m128i c16_n44_p19 = _mm_set1_epi32(0xFFD40013);\r\n    const __m128i c16_n02_p36 = _mm_set1_epi32(0xFFFE0024);\r\n    const __m128i c16_n39_p15 = _mm_set1_epi32(0xFFD9000F);\r\n    const __m128i c16_n30_p45 = _mm_set1_epi32(0xFFE2002D);\r\n    const __m128i c16_n30_p11 = _mm_set1_epi32(0xFFE2000B);\r\n    const __m128i c16_n45_p43 = _mm_set1_epi32(0xFFD3002B);\r\n    const __m128i c16_n19_p07 = _mm_set1_epi32(0xFFED0007);\r\n    const __m128i c16_n39_p30 = _mm_set1_epi32(0xFFD9001E);\r\n    const __m128i c16_n07_p02 = _mm_set1_epi32(0xFFF90002);\r\n    const __m128i c16_n15_p11 = _mm_set1_epi32(0xFFF1000B);\r\n    const __m128i c16_p29_p43 = _mm_set1_epi32(0x001D002B);\r\n    const __m128i c16_p04_p40 = _mm_set1_epi32(0x00040028);\r\n    const __m128i c16_n21_p35 = _mm_set1_epi32(0xFFEB0023);\r\n    const __m128i c16_n40_p29 = _mm_set1_epi32(0xFFD8001D);\r\n    const __m128i c16_n45_p21 = _mm_set1_epi32(0xFFD30015);\r\n    const __m128i c16_n35_p13 = _mm_set1_epi32(0xFFDD000D);\r\n    const __m128i c16_n13_p04 = _mm_set1_epi32(0xFFF30004);\r\n    const __m128i c16_p38_p44 = _mm_set1_epi32(0x0026002C);\r\n    const __m128i c16_n09_p38 = _mm_set1_epi32(0xFFF70026);\r\n    const __m128i c16_n44_p25 = _mm_set1_epi32(0xFFD40019);\r\n    const __m128i c16_n25_p09 = _mm_set1_epi32(0xFFE70009);\r\n    const __m128i c16_p32_p32 = _mm_set1_epi32(0x00200020);\r\n    const __m128i c16_n32_p32 = _mm_set1_epi32(0xFFE00020);\r\n\r\n    __m128i c32_rnd = _mm_set1_epi32(16);   // add1\r\n    __m128i Zero_16 = _mm_set1_epi16(0);\r\n\r\n    int nShift = 5;\r\n    int i, part;\r\n\r\n    // DCT1\r\n    __m128i in00[4], in01[4], in02[4], in03[4], in04[4], in05[4], in06[4], in07[4], in08[4], in09[4], in10[4], in11[4], in12[4], in13[4], in14[4], in15[4];\r\n    __m128i in16[4], in17[4], in18[4], in19[4], in20[4], in21[4], in22[4], in23[4], in24[4], in25[4], in26[4], in27[4], in28[4], in29[4], in30[4], in31[4];\r\n    __m128i res00[4], res01[4], res02[4], res03[4], res04[4], res05[4], res06[4], res07[4], res08[4], res09[4], res10[4], res11[4], res12[4], res13[4], res14[4], res15[4];\r\n    __m128i res16[4], res17[4], res18[4], res19[4], res20[4], res21[4], res22[4], res23[4], res24[4], res25[4], res26[4], res27[4], res28[4], res29[4], res30[4], res31[4];\r\n\r\n    i_dst &= 0xFE;    /* remember to remove the flag bit */\r\n\r\n    in00[0] = _mm_loadu_si128((const __m128i*)&src[0 * 32]);\r\n    in01[0] = _mm_loadu_si128((const __m128i*)&src[1 * 32]);\r\n    in02[0] = _mm_loadu_si128((const __m128i*)&src[2 * 32]);\r\n    in03[0] = _mm_loadu_si128((const __m128i*)&src[3 * 32]);\r\n    in04[0] = _mm_loadu_si128((const __m128i*)&src[4 * 32]);\r\n    in05[0] = _mm_loadu_si128((const __m128i*)&src[5 * 32]);\r\n    in06[0] = _mm_loadu_si128((const __m128i*)&src[6 * 32]);\r\n    in07[0] = _mm_loadu_si128((const __m128i*)&src[7 * 32]);\r\n\r\n    //pass=1\r\n    const __m128i T_00_00A = _mm_unpacklo_epi16(in01[0], in03[0]);    // [33 13 32 12 31 11 30 10]\r\n    const __m128i T_00_00B = _mm_unpackhi_epi16(in01[0], in03[0]);    // [37 17 36 16 35 15 34 14]\r\n    const __m128i T_00_01A = _mm_unpacklo_epi16(in05[0], in07[0]);    // [ ]\r\n    const __m128i T_00_01B = _mm_unpackhi_epi16(in05[0], in07[0]);    // [ ]\r\n\r\n    const __m128i T_00_08A = _mm_unpacklo_epi16(in02[0], in06[0]);    // [ ]\r\n    const __m128i T_00_08B = _mm_unpackhi_epi16(in02[0], in06[0]);    // [ ]\r\n\r\n    const __m128i T_00_12A = _mm_unpacklo_epi16(in04[0], Zero_16);    // [ ]\r\n    const __m128i T_00_12B = _mm_unpackhi_epi16(in04[0], Zero_16);    // [ ]\r\n\r\n    const __m128i T_00_15A = _mm_unpacklo_epi16(in00[0], Zero_16);    //\r\n    const __m128i T_00_15B = _mm_unpackhi_epi16(in00[0], Zero_16);    // [ ]\r\n\r\n    __m128i O00A, O01A, O02A, O03A, O04A, O05A, O06A, O07A, O08A, O09A, O10A, O11A, O12A, O13A, O14A, O15A;\r\n    __m128i O00B, O01B, O02B, O03B, O04B, O05B, O06B, O07B, O08B, O09B, O10B, O11B, O12B, O13B, O14B, O15B;\r\n    __m128i EO0A, EO1A, EO2A, EO3A, EO4A, EO5A, EO6A, EO7A;\r\n    __m128i EO0B, EO1B, EO2B, EO3B, EO4B, EO5B, EO6B, EO7B;\r\n\r\n    O00A = _mm_add_epi32(_mm_madd_epi16(T_00_00A, c16_p45_p45), _mm_madd_epi16(T_00_01A, c16_p43_p44));\r\n    O01A = _mm_add_epi32(_mm_madd_epi16(T_00_00A, c16_p41_p45), _mm_madd_epi16(T_00_01A, c16_p23_p34));\r\n    O02A = _mm_add_epi32(_mm_madd_epi16(T_00_00A, c16_p34_p44), _mm_madd_epi16(T_00_01A, c16_n07_p15));\r\n    O03A = _mm_add_epi32(_mm_madd_epi16(T_00_00A, c16_p23_p43), _mm_madd_epi16(T_00_01A, c16_n34_n07));\r\n    O04A = _mm_add_epi32(_mm_madd_epi16(T_00_00A, c16_p11_p41), _mm_madd_epi16(T_00_01A, c16_n45_n27));\r\n    O05A = _mm_add_epi32(_mm_madd_epi16(T_00_00A, c16_n02_p39), _mm_madd_epi16(T_00_01A, c16_n36_n41));\r\n    O06A = _mm_add_epi32(_mm_madd_epi16(T_00_00A, c16_n15_p36), _mm_madd_epi16(T_00_01A, c16_n11_n45));\r\n    O07A = _mm_add_epi32(_mm_madd_epi16(T_00_00A, c16_n27_p34), _mm_madd_epi16(T_00_01A, c16_p19_n39));\r\n    O08A = _mm_add_epi32(_mm_madd_epi16(T_00_00A, c16_n36_p30), _mm_madd_epi16(T_00_01A, c16_p41_n23));\r\n    O09A = _mm_add_epi32(_mm_madd_epi16(T_00_00A, c16_n43_p27), _mm_madd_epi16(T_00_01A, c16_p44_n02));\r\n    O10A = _mm_add_epi32(_mm_madd_epi16(T_00_00A, c16_n45_p23), _mm_madd_epi16(T_00_01A, c16_p27_p19));\r\n    O11A = _mm_add_epi32(_mm_madd_epi16(T_00_00A, c16_n44_p19), _mm_madd_epi16(T_00_01A, c16_n02_p36));\r\n    O12A = _mm_add_epi32(_mm_madd_epi16(T_00_00A, c16_n39_p15), _mm_madd_epi16(T_00_01A, c16_n30_p45));\r\n    O13A = _mm_add_epi32(_mm_madd_epi16(T_00_00A, c16_n30_p11), _mm_madd_epi16(T_00_01A, c16_n45_p43));\r\n    O14A = _mm_add_epi32(_mm_madd_epi16(T_00_00A, c16_n19_p07), _mm_madd_epi16(T_00_01A, c16_n39_p30));\r\n    O15A = _mm_add_epi32(_mm_madd_epi16(T_00_00A, c16_n07_p02), _mm_madd_epi16(T_00_01A, c16_n15_p11));\r\n        \r\n    O00B = _mm_add_epi32(_mm_madd_epi16(T_00_00B, c16_p45_p45), _mm_madd_epi16(T_00_01B, c16_p43_p44));\r\n    O01B = _mm_add_epi32(_mm_madd_epi16(T_00_00B, c16_p41_p45), _mm_madd_epi16(T_00_01B, c16_p23_p34));\r\n    O02B = _mm_add_epi32(_mm_madd_epi16(T_00_00B, c16_p34_p44), _mm_madd_epi16(T_00_01B, c16_n07_p15));\r\n    O03B = _mm_add_epi32(_mm_madd_epi16(T_00_00B, c16_p23_p43), _mm_madd_epi16(T_00_01B, c16_n34_n07));\r\n    O04B = _mm_add_epi32(_mm_madd_epi16(T_00_00B, c16_p11_p41), _mm_madd_epi16(T_00_01B, c16_n45_n27));\r\n    O05B = _mm_add_epi32(_mm_madd_epi16(T_00_00B, c16_n02_p39), _mm_madd_epi16(T_00_01B, c16_n36_n41));\r\n    O06B = _mm_add_epi32(_mm_madd_epi16(T_00_00B, c16_n15_p36), _mm_madd_epi16(T_00_01B, c16_n11_n45));\r\n    O07B = _mm_add_epi32(_mm_madd_epi16(T_00_00B, c16_n27_p34), _mm_madd_epi16(T_00_01B, c16_p19_n39));\r\n    O08B = _mm_add_epi32(_mm_madd_epi16(T_00_00B, c16_n36_p30), _mm_madd_epi16(T_00_01B, c16_p41_n23));\r\n    O09B = _mm_add_epi32(_mm_madd_epi16(T_00_00B, c16_n43_p27), _mm_madd_epi16(T_00_01B, c16_p44_n02));\r\n    O10B = _mm_add_epi32(_mm_madd_epi16(T_00_00B, c16_n45_p23), _mm_madd_epi16(T_00_01B, c16_p27_p19));\r\n    O11B = _mm_add_epi32(_mm_madd_epi16(T_00_00B, c16_n44_p19), _mm_madd_epi16(T_00_01B, c16_n02_p36));\r\n    O12B = _mm_add_epi32(_mm_madd_epi16(T_00_00B, c16_n39_p15), _mm_madd_epi16(T_00_01B, c16_n30_p45));\r\n    O13B = _mm_add_epi32(_mm_madd_epi16(T_00_00B, c16_n30_p11), _mm_madd_epi16(T_00_01B, c16_n45_p43));\r\n    O14B = _mm_add_epi32(_mm_madd_epi16(T_00_00B, c16_n19_p07), _mm_madd_epi16(T_00_01B, c16_n39_p30));\r\n    O15B = _mm_add_epi32(_mm_madd_epi16(T_00_00B, c16_n07_p02), _mm_madd_epi16(T_00_01B, c16_n15_p11));\r\n\r\n\r\n    EO0A = _mm_madd_epi16(T_00_08A, c16_p43_p45);\r\n    EO1A = _mm_madd_epi16(T_00_08A, c16_p29_p43);\r\n    EO2A = _mm_madd_epi16(T_00_08A, c16_p04_p40);\r\n    EO3A = _mm_madd_epi16(T_00_08A, c16_n21_p35);\r\n    EO4A = _mm_madd_epi16(T_00_08A, c16_n40_p29);\r\n    EO5A = _mm_madd_epi16(T_00_08A, c16_n45_p21);\r\n    EO6A = _mm_madd_epi16(T_00_08A, c16_n35_p13);\r\n    EO7A = _mm_madd_epi16(T_00_08A, c16_n13_p04);\r\n\r\n    EO0B = _mm_madd_epi16(T_00_08B, c16_p43_p45);\r\n    EO1B = _mm_madd_epi16(T_00_08B, c16_p29_p43);\r\n    EO2B = _mm_madd_epi16(T_00_08B, c16_p04_p40);\r\n    EO3B = _mm_madd_epi16(T_00_08B, c16_n21_p35);\r\n    EO4B = _mm_madd_epi16(T_00_08B, c16_n40_p29);\r\n    EO5B = _mm_madd_epi16(T_00_08B, c16_n45_p21);\r\n    EO6B = _mm_madd_epi16(T_00_08B, c16_n35_p13);\r\n    EO7B = _mm_madd_epi16(T_00_08B, c16_n13_p04);\r\n\r\n    {\r\n        const __m128i EEO0A = _mm_madd_epi16(T_00_12A, c16_p38_p44);\r\n        const __m128i EEO1A = _mm_madd_epi16(T_00_12A, c16_n09_p38);\r\n        const __m128i EEO2A = _mm_madd_epi16(T_00_12A, c16_n44_p25);\r\n        const __m128i EEO3A = _mm_madd_epi16(T_00_12A, c16_n25_p09);\r\n        const __m128i EEO0B = _mm_madd_epi16(T_00_12B, c16_p38_p44);\r\n        const __m128i EEO1B = _mm_madd_epi16(T_00_12B, c16_n09_p38);\r\n        const __m128i EEO2B = _mm_madd_epi16(T_00_12B, c16_n44_p25);\r\n        const __m128i EEO3B = _mm_madd_epi16(T_00_12B, c16_n25_p09);\r\n\r\n        const __m128i EEEE0A = _mm_madd_epi16(T_00_15A, c16_p32_p32);\r\n        const __m128i EEEE0B = _mm_madd_epi16(T_00_15B, c16_p32_p32);\r\n        const __m128i EEEE1A = _mm_madd_epi16(T_00_15A, c16_n32_p32);\r\n        const __m128i EEEE1B = _mm_madd_epi16(T_00_15B, c16_n32_p32);\r\n\r\n        const __m128i EEE0A = EEEE0A;    // EEE0 = EEEE0 + EEEO0\r\n        const __m128i EEE0B = EEEE0B;\r\n        const __m128i EEE1A = EEEE1A;    // EEE1 = EEEE1 + EEEO1\r\n        const __m128i EEE1B = EEEE1B;\r\n        const __m128i EEE3A = EEEE0A;    // EEE2 = EEEE0 - EEEO0\r\n        const __m128i EEE3B = EEEE0B;\r\n        const __m128i EEE2A = EEEE1A;    // EEE3 = EEEE1 - EEEO1\r\n        const __m128i EEE2B = EEEE1B;\r\n\r\n        const __m128i EE0A = _mm_add_epi32(EEE0A, EEO0A);       // EE0 = EEE0 + EEO0\r\n        const __m128i EE0B = _mm_add_epi32(EEE0B, EEO0B);\r\n        const __m128i EE1A = _mm_add_epi32(EEE1A, EEO1A);       // EE1 = EEE1 + EEO1\r\n        const __m128i EE1B = _mm_add_epi32(EEE1B, EEO1B);\r\n        const __m128i EE2A = _mm_add_epi32(EEE2A, EEO2A);       // EE2 = EEE0 + EEO0\r\n        const __m128i EE2B = _mm_add_epi32(EEE2B, EEO2B);\r\n        const __m128i EE3A = _mm_add_epi32(EEE3A, EEO3A);       // EE3 = EEE1 + EEO1\r\n        const __m128i EE3B = _mm_add_epi32(EEE3B, EEO3B);\r\n        const __m128i EE7A = _mm_sub_epi32(EEE0A, EEO0A);       // EE7 = EEE0 - EEO0\r\n        const __m128i EE7B = _mm_sub_epi32(EEE0B, EEO0B);\r\n        const __m128i EE6A = _mm_sub_epi32(EEE1A, EEO1A);       // EE6 = EEE1 - EEO1\r\n        const __m128i EE6B = _mm_sub_epi32(EEE1B, EEO1B);\r\n        const __m128i EE5A = _mm_sub_epi32(EEE2A, EEO2A);       // EE5 = EEE0 - EEO0\r\n        const __m128i EE5B = _mm_sub_epi32(EEE2B, EEO2B);\r\n        const __m128i EE4A = _mm_sub_epi32(EEE3A, EEO3A);       // EE4 = EEE1 - EEO1\r\n        const __m128i EE4B = _mm_sub_epi32(EEE3B, EEO3B);\r\n\r\n        const __m128i E0A = _mm_add_epi32(EE0A, EO0A);          // E0 = EE0 + EO0\r\n        const __m128i E0B = _mm_add_epi32(EE0B, EO0B);\r\n        const __m128i E1A = _mm_add_epi32(EE1A, EO1A);          // E1 = EE1 + EO1\r\n        const __m128i E1B = _mm_add_epi32(EE1B, EO1B);\r\n        const __m128i E2A = _mm_add_epi32(EE2A, EO2A);          // E2 = EE2 + EO2\r\n        const __m128i E2B = _mm_add_epi32(EE2B, EO2B);\r\n        const __m128i E3A = _mm_add_epi32(EE3A, EO3A);          // E3 = EE3 + EO3\r\n        const __m128i E3B = _mm_add_epi32(EE3B, EO3B);\r\n        const __m128i E4A = _mm_add_epi32(EE4A, EO4A);          // E4 =\r\n        const __m128i E4B = _mm_add_epi32(EE4B, EO4B);\r\n        const __m128i E5A = _mm_add_epi32(EE5A, EO5A);          // E5 =\r\n        const __m128i E5B = _mm_add_epi32(EE5B, EO5B);\r\n        const __m128i E6A = _mm_add_epi32(EE6A, EO6A);          // E6 =\r\n        const __m128i E6B = _mm_add_epi32(EE6B, EO6B);\r\n        const __m128i E7A = _mm_add_epi32(EE7A, EO7A);          // E7 =\r\n        const __m128i E7B = _mm_add_epi32(EE7B, EO7B);\r\n        const __m128i EFA = _mm_sub_epi32(EE0A, EO0A);          // EF = EE0 - EO0\r\n        const __m128i EFB = _mm_sub_epi32(EE0B, EO0B);\r\n        const __m128i EEA = _mm_sub_epi32(EE1A, EO1A);          // EE = EE1 - EO1\r\n        const __m128i EEB = _mm_sub_epi32(EE1B, EO1B);\r\n        const __m128i EDA = _mm_sub_epi32(EE2A, EO2A);          // ED = EE2 - EO2\r\n        const __m128i EDB = _mm_sub_epi32(EE2B, EO2B);\r\n        const __m128i ECA = _mm_sub_epi32(EE3A, EO3A);          // EC = EE3 - EO3\r\n        const __m128i ECB = _mm_sub_epi32(EE3B, EO3B);\r\n        const __m128i EBA = _mm_sub_epi32(EE4A, EO4A);          // EB =\r\n        const __m128i EBB = _mm_sub_epi32(EE4B, EO4B);\r\n        const __m128i EAA = _mm_sub_epi32(EE5A, EO5A);          // EA =\r\n        const __m128i EAB = _mm_sub_epi32(EE5B, EO5B);\r\n        const __m128i E9A = _mm_sub_epi32(EE6A, EO6A);          // E9 =\r\n        const __m128i E9B = _mm_sub_epi32(EE6B, EO6B);\r\n        const __m128i E8A = _mm_sub_epi32(EE7A, EO7A);          // E8 =\r\n        const __m128i E8B = _mm_sub_epi32(EE7B, EO7B);\r\n\r\n        const __m128i T10A = _mm_add_epi32(E0A, c32_rnd);       // E0 + rnd\r\n        const __m128i T10B = _mm_add_epi32(E0B, c32_rnd);\r\n        const __m128i T11A = _mm_add_epi32(E1A, c32_rnd);       // E1 + rnd\r\n        const __m128i T11B = _mm_add_epi32(E1B, c32_rnd);\r\n        const __m128i T12A = _mm_add_epi32(E2A, c32_rnd);       // E2 + rnd\r\n        const __m128i T12B = _mm_add_epi32(E2B, c32_rnd);\r\n        const __m128i T13A = _mm_add_epi32(E3A, c32_rnd);       // E3 + rnd\r\n        const __m128i T13B = _mm_add_epi32(E3B, c32_rnd);\r\n        const __m128i T14A = _mm_add_epi32(E4A, c32_rnd);       // E4 + rnd\r\n        const __m128i T14B = _mm_add_epi32(E4B, c32_rnd);\r\n        const __m128i T15A = _mm_add_epi32(E5A, c32_rnd);       // E5 + rnd\r\n        const __m128i T15B = _mm_add_epi32(E5B, c32_rnd);\r\n        const __m128i T16A = _mm_add_epi32(E6A, c32_rnd);       // E6 + rnd\r\n        const __m128i T16B = _mm_add_epi32(E6B, c32_rnd);\r\n        const __m128i T17A = _mm_add_epi32(E7A, c32_rnd);       // E7 + rnd\r\n        const __m128i T17B = _mm_add_epi32(E7B, c32_rnd);\r\n        const __m128i T18A = _mm_add_epi32(E8A, c32_rnd);       // E8 + rnd\r\n        const __m128i T18B = _mm_add_epi32(E8B, c32_rnd);\r\n        const __m128i T19A = _mm_add_epi32(E9A, c32_rnd);       // E9 + rnd\r\n        const __m128i T19B = _mm_add_epi32(E9B, c32_rnd);\r\n        const __m128i T1AA = _mm_add_epi32(EAA, c32_rnd);       // E10 + rnd\r\n        const __m128i T1AB = _mm_add_epi32(EAB, c32_rnd);\r\n        const __m128i T1BA = _mm_add_epi32(EBA, c32_rnd);       // E11 + rnd\r\n        const __m128i T1BB = _mm_add_epi32(EBB, c32_rnd);\r\n        const __m128i T1CA = _mm_add_epi32(ECA, c32_rnd);       // E12 + rnd\r\n        const __m128i T1CB = _mm_add_epi32(ECB, c32_rnd);\r\n        const __m128i T1DA = _mm_add_epi32(EDA, c32_rnd);       // E13 + rnd\r\n        const __m128i T1DB = _mm_add_epi32(EDB, c32_rnd);\r\n        const __m128i T1EA = _mm_add_epi32(EEA, c32_rnd);       // E14 + rnd\r\n        const __m128i T1EB = _mm_add_epi32(EEB, c32_rnd);\r\n        const __m128i T1FA = _mm_add_epi32(EFA, c32_rnd);       // E15 + rnd\r\n        const __m128i T1FB = _mm_add_epi32(EFB, c32_rnd);\r\n\r\n        const __m128i T2_00A = _mm_add_epi32(T10A, O00A);       // E0 + O0 + rnd\r\n        const __m128i T2_00B = _mm_add_epi32(T10B, O00B);\r\n        const __m128i T2_01A = _mm_add_epi32(T11A, O01A);       // E1 + O1 + rnd\r\n        const __m128i T2_01B = _mm_add_epi32(T11B, O01B);\r\n        const __m128i T2_02A = _mm_add_epi32(T12A, O02A);       // E2 + O2 + rnd\r\n        const __m128i T2_02B = _mm_add_epi32(T12B, O02B);\r\n        const __m128i T2_03A = _mm_add_epi32(T13A, O03A);       // E3 + O3 + rnd\r\n        const __m128i T2_03B = _mm_add_epi32(T13B, O03B);\r\n        const __m128i T2_04A = _mm_add_epi32(T14A, O04A);       // E4\r\n        const __m128i T2_04B = _mm_add_epi32(T14B, O04B);\r\n        const __m128i T2_05A = _mm_add_epi32(T15A, O05A);       // E5\r\n        const __m128i T2_05B = _mm_add_epi32(T15B, O05B);\r\n        const __m128i T2_06A = _mm_add_epi32(T16A, O06A);       // E6\r\n        const __m128i T2_06B = _mm_add_epi32(T16B, O06B);\r\n        const __m128i T2_07A = _mm_add_epi32(T17A, O07A);       // E7\r\n        const __m128i T2_07B = _mm_add_epi32(T17B, O07B);\r\n        const __m128i T2_08A = _mm_add_epi32(T18A, O08A);       // E8\r\n        const __m128i T2_08B = _mm_add_epi32(T18B, O08B);\r\n        const __m128i T2_09A = _mm_add_epi32(T19A, O09A);       // E9\r\n        const __m128i T2_09B = _mm_add_epi32(T19B, O09B);\r\n        const __m128i T2_10A = _mm_add_epi32(T1AA, O10A);       // E10\r\n        const __m128i T2_10B = _mm_add_epi32(T1AB, O10B);\r\n        const __m128i T2_11A = _mm_add_epi32(T1BA, O11A);       // E11\r\n        const __m128i T2_11B = _mm_add_epi32(T1BB, O11B);\r\n        const __m128i T2_12A = _mm_add_epi32(T1CA, O12A);       // E12\r\n        const __m128i T2_12B = _mm_add_epi32(T1CB, O12B);\r\n        const __m128i T2_13A = _mm_add_epi32(T1DA, O13A);       // E13\r\n        const __m128i T2_13B = _mm_add_epi32(T1DB, O13B);\r\n        const __m128i T2_14A = _mm_add_epi32(T1EA, O14A);       // E14\r\n        const __m128i T2_14B = _mm_add_epi32(T1EB, O14B);\r\n        const __m128i T2_15A = _mm_add_epi32(T1FA, O15A);       // E15\r\n        const __m128i T2_15B = _mm_add_epi32(T1FB, O15B);\r\n        const __m128i T2_31A = _mm_sub_epi32(T10A, O00A);       // E0 - O0 + rnd\r\n        const __m128i T2_31B = _mm_sub_epi32(T10B, O00B);\r\n        const __m128i T2_30A = _mm_sub_epi32(T11A, O01A);       // E1 - O1 + rnd\r\n        const __m128i T2_30B = _mm_sub_epi32(T11B, O01B);\r\n        const __m128i T2_29A = _mm_sub_epi32(T12A, O02A);       // E2 - O2 + rnd\r\n        const __m128i T2_29B = _mm_sub_epi32(T12B, O02B);\r\n        const __m128i T2_28A = _mm_sub_epi32(T13A, O03A);       // E3 - O3 + rnd\r\n        const __m128i T2_28B = _mm_sub_epi32(T13B, O03B);\r\n        const __m128i T2_27A = _mm_sub_epi32(T14A, O04A);       // E4\r\n        const __m128i T2_27B = _mm_sub_epi32(T14B, O04B);\r\n        const __m128i T2_26A = _mm_sub_epi32(T15A, O05A);       // E5\r\n        const __m128i T2_26B = _mm_sub_epi32(T15B, O05B);\r\n        const __m128i T2_25A = _mm_sub_epi32(T16A, O06A);       // E6\r\n        const __m128i T2_25B = _mm_sub_epi32(T16B, O06B);\r\n        const __m128i T2_24A = _mm_sub_epi32(T17A, O07A);       // E7\r\n        const __m128i T2_24B = _mm_sub_epi32(T17B, O07B);\r\n        const __m128i T2_23A = _mm_sub_epi32(T18A, O08A);       //\r\n        const __m128i T2_23B = _mm_sub_epi32(T18B, O08B);\r\n        const __m128i T2_22A = _mm_sub_epi32(T19A, O09A);       //\r\n        const __m128i T2_22B = _mm_sub_epi32(T19B, O09B);\r\n        const __m128i T2_21A = _mm_sub_epi32(T1AA, O10A);       //\r\n        const __m128i T2_21B = _mm_sub_epi32(T1AB, O10B);\r\n        const __m128i T2_20A = _mm_sub_epi32(T1BA, O11A);       //\r\n        const __m128i T2_20B = _mm_sub_epi32(T1BB, O11B);\r\n        const __m128i T2_19A = _mm_sub_epi32(T1CA, O12A);       //\r\n        const __m128i T2_19B = _mm_sub_epi32(T1CB, O12B);\r\n        const __m128i T2_18A = _mm_sub_epi32(T1DA, O13A);       //\r\n        const __m128i T2_18B = _mm_sub_epi32(T1DB, O13B);\r\n        const __m128i T2_17A = _mm_sub_epi32(T1EA, O14A);       //\r\n        const __m128i T2_17B = _mm_sub_epi32(T1EB, O14B);\r\n        const __m128i T2_16A = _mm_sub_epi32(T1FA, O15A);       //\r\n        const __m128i T2_16B = _mm_sub_epi32(T1FB, O15B);\r\n\r\n        const __m128i T3_00A = _mm_srai_epi32(T2_00A, nShift);  // [30 20 10 00]\r\n        const __m128i T3_00B = _mm_srai_epi32(T2_00B, nShift);  // [70 60 50 40]\r\n        const __m128i T3_01A = _mm_srai_epi32(T2_01A, nShift);  // [31 21 11 01]\r\n        const __m128i T3_01B = _mm_srai_epi32(T2_01B, nShift);  // [71 61 51 41]\r\n        const __m128i T3_02A = _mm_srai_epi32(T2_02A, nShift);  // [32 22 12 02]\r\n        const __m128i T3_02B = _mm_srai_epi32(T2_02B, nShift);  // [72 62 52 42]\r\n        const __m128i T3_03A = _mm_srai_epi32(T2_03A, nShift);  // [33 23 13 03]\r\n        const __m128i T3_03B = _mm_srai_epi32(T2_03B, nShift);  // [73 63 53 43]\r\n        const __m128i T3_04A = _mm_srai_epi32(T2_04A, nShift);  // [33 24 14 04]\r\n        const __m128i T3_04B = _mm_srai_epi32(T2_04B, nShift);  // [74 64 54 44]\r\n        const __m128i T3_05A = _mm_srai_epi32(T2_05A, nShift);  // [35 25 15 05]\r\n        const __m128i T3_05B = _mm_srai_epi32(T2_05B, nShift);  // [75 65 55 45]\r\n        const __m128i T3_06A = _mm_srai_epi32(T2_06A, nShift);  // [36 26 16 06]\r\n        const __m128i T3_06B = _mm_srai_epi32(T2_06B, nShift);  // [76 66 56 46]\r\n        const __m128i T3_07A = _mm_srai_epi32(T2_07A, nShift);  // [37 27 17 07]\r\n        const __m128i T3_07B = _mm_srai_epi32(T2_07B, nShift);  // [77 67 57 47]\r\n        const __m128i T3_08A = _mm_srai_epi32(T2_08A, nShift);  // [30 20 10 00] x8\r\n        const __m128i T3_08B = _mm_srai_epi32(T2_08B, nShift);  // [70 60 50 40]\r\n        const __m128i T3_09A = _mm_srai_epi32(T2_09A, nShift);  // [31 21 11 01] x9\r\n        const __m128i T3_09B = _mm_srai_epi32(T2_09B, nShift);  // [71 61 51 41]\r\n        const __m128i T3_10A = _mm_srai_epi32(T2_10A, nShift);  // [32 22 12 02] xA\r\n        const __m128i T3_10B = _mm_srai_epi32(T2_10B, nShift);  // [72 62 52 42]\r\n        const __m128i T3_11A = _mm_srai_epi32(T2_11A, nShift);  // [33 23 13 03] xB\r\n        const __m128i T3_11B = _mm_srai_epi32(T2_11B, nShift);  // [73 63 53 43]\r\n        const __m128i T3_12A = _mm_srai_epi32(T2_12A, nShift);  // [33 24 14 04] xC\r\n        const __m128i T3_12B = _mm_srai_epi32(T2_12B, nShift);  // [74 64 54 44]\r\n        const __m128i T3_13A = _mm_srai_epi32(T2_13A, nShift);  // [35 25 15 05] xD\r\n        const __m128i T3_13B = _mm_srai_epi32(T2_13B, nShift);  // [75 65 55 45]\r\n        const __m128i T3_14A = _mm_srai_epi32(T2_14A, nShift);  // [36 26 16 06] xE\r\n        const __m128i T3_14B = _mm_srai_epi32(T2_14B, nShift);  // [76 66 56 46]\r\n        const __m128i T3_15A = _mm_srai_epi32(T2_15A, nShift);  // [37 27 17 07] xF\r\n        const __m128i T3_15B = _mm_srai_epi32(T2_15B, nShift);  // [77 67 57 47]\r\n\r\n        const __m128i T3_16A = _mm_srai_epi32(T2_16A, nShift);  // [30 20 10 00]\r\n        const __m128i T3_16B = _mm_srai_epi32(T2_16B, nShift);  // [70 60 50 40]\r\n        const __m128i T3_17A = _mm_srai_epi32(T2_17A, nShift);  // [31 21 11 01]\r\n        const __m128i T3_17B = _mm_srai_epi32(T2_17B, nShift);  // [71 61 51 41]\r\n        const __m128i T3_18A = _mm_srai_epi32(T2_18A, nShift);  // [32 22 12 02]\r\n        const __m128i T3_18B = _mm_srai_epi32(T2_18B, nShift);  // [72 62 52 42]\r\n        const __m128i T3_19A = _mm_srai_epi32(T2_19A, nShift);  // [33 23 13 03]\r\n        const __m128i T3_19B = _mm_srai_epi32(T2_19B, nShift);  // [73 63 53 43]\r\n        const __m128i T3_20A = _mm_srai_epi32(T2_20A, nShift);  // [33 24 14 04]\r\n        const __m128i T3_20B = _mm_srai_epi32(T2_20B, nShift);  // [74 64 54 44]\r\n        const __m128i T3_21A = _mm_srai_epi32(T2_21A, nShift);  // [35 25 15 05]\r\n        const __m128i T3_21B = _mm_srai_epi32(T2_21B, nShift);  // [75 65 55 45]\r\n        const __m128i T3_22A = _mm_srai_epi32(T2_22A, nShift);  // [36 26 16 06]\r\n        const __m128i T3_22B = _mm_srai_epi32(T2_22B, nShift);  // [76 66 56 46]\r\n        const __m128i T3_23A = _mm_srai_epi32(T2_23A, nShift);  // [37 27 17 07]\r\n        const __m128i T3_23B = _mm_srai_epi32(T2_23B, nShift);  // [77 67 57 47]\r\n        const __m128i T3_24A = _mm_srai_epi32(T2_24A, nShift);  // [30 20 10 00] x8\r\n        const __m128i T3_24B = _mm_srai_epi32(T2_24B, nShift);  // [70 60 50 40]\r\n        const __m128i T3_25A = _mm_srai_epi32(T2_25A, nShift);  // [31 21 11 01] x9\r\n        const __m128i T3_25B = _mm_srai_epi32(T2_25B, nShift);  // [71 61 51 41]\r\n        const __m128i T3_26A = _mm_srai_epi32(T2_26A, nShift);  // [32 22 12 02] xA\r\n        const __m128i T3_26B = _mm_srai_epi32(T2_26B, nShift);  // [72 62 52 42]\r\n        const __m128i T3_27A = _mm_srai_epi32(T2_27A, nShift);  // [33 23 13 03] xB\r\n        const __m128i T3_27B = _mm_srai_epi32(T2_27B, nShift);  // [73 63 53 43]\r\n        const __m128i T3_28A = _mm_srai_epi32(T2_28A, nShift);  // [33 24 14 04] xC\r\n        const __m128i T3_28B = _mm_srai_epi32(T2_28B, nShift);  // [74 64 54 44]\r\n        const __m128i T3_29A = _mm_srai_epi32(T2_29A, nShift);  // [35 25 15 05] xD\r\n        const __m128i T3_29B = _mm_srai_epi32(T2_29B, nShift);  // [75 65 55 45]\r\n        const __m128i T3_30A = _mm_srai_epi32(T2_30A, nShift);  // [36 26 16 06] xE\r\n        const __m128i T3_30B = _mm_srai_epi32(T2_30B, nShift);  // [76 66 56 46]\r\n        const __m128i T3_31A = _mm_srai_epi32(T2_31A, nShift);  // [37 27 17 07] xF\r\n        const __m128i T3_31B = _mm_srai_epi32(T2_31B, nShift);  // [77 67 57 47]\r\n\r\n        res00[0] = _mm_packs_epi32(T3_00A, T3_00B);          // [70 60 50 40 30 20 10 00]\r\n        res01[0] = _mm_packs_epi32(T3_01A, T3_01B);          // [71 61 51 41 31 21 11 01]\r\n        res02[0] = _mm_packs_epi32(T3_02A, T3_02B);          // [72 62 52 42 32 22 12 02]\r\n        res03[0] = _mm_packs_epi32(T3_03A, T3_03B);          // [73 63 53 43 33 23 13 03]\r\n        res04[0] = _mm_packs_epi32(T3_04A, T3_04B);          // [74 64 54 44 34 24 14 04]\r\n        res05[0] = _mm_packs_epi32(T3_05A, T3_05B);          // [75 65 55 45 35 25 15 05]\r\n        res06[0] = _mm_packs_epi32(T3_06A, T3_06B);          // [76 66 56 46 36 26 16 06]\r\n        res07[0] = _mm_packs_epi32(T3_07A, T3_07B);          // [77 67 57 47 37 27 17 07]\r\n        res08[0] = _mm_packs_epi32(T3_08A, T3_08B);          // [A0 ... 80]\r\n        res09[0] = _mm_packs_epi32(T3_09A, T3_09B);          // [A1 ... 81]\r\n        res10[0] = _mm_packs_epi32(T3_10A, T3_10B);          // [A2 ... 82]\r\n        res11[0] = _mm_packs_epi32(T3_11A, T3_11B);          // [A3 ... 83]\r\n        res12[0] = _mm_packs_epi32(T3_12A, T3_12B);          // [A4 ... 84]\r\n        res13[0] = _mm_packs_epi32(T3_13A, T3_13B);          // [A5 ... 85]\r\n        res14[0] = _mm_packs_epi32(T3_14A, T3_14B);          // [A6 ... 86]\r\n        res15[0] = _mm_packs_epi32(T3_15A, T3_15B);          // [A7 ... 87]\r\n        res16[0] = _mm_packs_epi32(T3_16A, T3_16B);\r\n        res17[0] = _mm_packs_epi32(T3_17A, T3_17B);\r\n        res18[0] = _mm_packs_epi32(T3_18A, T3_18B);\r\n        res19[0] = _mm_packs_epi32(T3_19A, T3_19B);\r\n        res20[0] = _mm_packs_epi32(T3_20A, T3_20B);\r\n        res21[0] = _mm_packs_epi32(T3_21A, T3_21B);\r\n        res22[0] = _mm_packs_epi32(T3_22A, T3_22B);\r\n        res23[0] = _mm_packs_epi32(T3_23A, T3_23B);\r\n        res24[0] = _mm_packs_epi32(T3_24A, T3_24B);\r\n        res25[0] = _mm_packs_epi32(T3_25A, T3_25B);\r\n        res26[0] = _mm_packs_epi32(T3_26A, T3_26B);\r\n        res27[0] = _mm_packs_epi32(T3_27A, T3_27B);\r\n        res28[0] = _mm_packs_epi32(T3_28A, T3_28B);\r\n        res29[0] = _mm_packs_epi32(T3_29A, T3_29B);\r\n        res30[0] = _mm_packs_epi32(T3_30A, T3_30B);\r\n        res31[0] = _mm_packs_epi32(T3_31A, T3_31B);\r\n    }\r\n\r\n    //transpose matrix 8x8 16bit.\r\n    {\r\n        __m128i tr0_0, tr0_1, tr0_2, tr0_3, tr0_4, tr0_5, tr0_6, tr0_7;\r\n        __m128i tr1_0, tr1_1, tr1_2, tr1_3, tr1_4, tr1_5, tr1_6, tr1_7;\r\n#define TRANSPOSE_8x8_16BIT(I0, I1, I2, I3, I4, I5, I6, I7, O0, O1, O2, O3, O4, O5, O6, O7) \\\r\n    tr0_0 = _mm_unpacklo_epi16(I0, I1); \\\r\n    tr0_1 = _mm_unpacklo_epi16(I2, I3); \\\r\n    tr0_2 = _mm_unpackhi_epi16(I0, I1); \\\r\n    tr0_3 = _mm_unpackhi_epi16(I2, I3); \\\r\n    tr0_4 = _mm_unpacklo_epi16(I4, I5); \\\r\n    tr0_5 = _mm_unpacklo_epi16(I6, I7); \\\r\n    tr0_6 = _mm_unpackhi_epi16(I4, I5); \\\r\n    tr0_7 = _mm_unpackhi_epi16(I6, I7); \\\r\n    tr1_0 = _mm_unpacklo_epi32(tr0_0, tr0_1); \\\r\n    tr1_1 = _mm_unpacklo_epi32(tr0_2, tr0_3); \\\r\n    tr1_2 = _mm_unpackhi_epi32(tr0_0, tr0_1); \\\r\n    tr1_3 = _mm_unpackhi_epi32(tr0_2, tr0_3); \\\r\n    tr1_4 = _mm_unpacklo_epi32(tr0_4, tr0_5); \\\r\n    tr1_5 = _mm_unpacklo_epi32(tr0_6, tr0_7); \\\r\n    tr1_6 = _mm_unpackhi_epi32(tr0_4, tr0_5); \\\r\n    tr1_7 = _mm_unpackhi_epi32(tr0_6, tr0_7); \\\r\n    O0 = _mm_unpacklo_epi64(tr1_0, tr1_4); \\\r\n    O1 = _mm_unpackhi_epi64(tr1_0, tr1_4); \\\r\n    O2 = _mm_unpacklo_epi64(tr1_2, tr1_6); \\\r\n    O3 = _mm_unpackhi_epi64(tr1_2, tr1_6); \\\r\n    O4 = _mm_unpacklo_epi64(tr1_1, tr1_5); \\\r\n    O5 = _mm_unpackhi_epi64(tr1_1, tr1_5); \\\r\n    O6 = _mm_unpacklo_epi64(tr1_3, tr1_7); \\\r\n    O7 = _mm_unpackhi_epi64(tr1_3, tr1_7);\r\n\r\n        TRANSPOSE_8x8_16BIT(res00[0], res01[0], res02[0], res03[0], res04[0], res05[0], res06[0], res07[0], in00[0], in01[0], in02[0], in03[0], in04[0], in05[0], in06[0], in07[0])\r\n        \r\n        TRANSPOSE_8x8_16BIT(res08[0], res09[0], res10[0], res11[0], res12[0], res13[0], res14[0], res15[0], in00[1], in01[1], in02[1], in03[1], in04[1], in05[1], in06[1], in07[1])\r\n        \r\n        TRANSPOSE_8x8_16BIT(res16[0], res17[0], res18[0], res19[0], res20[0], res21[0], res22[0], res23[0], in00[2], in01[2], in02[2], in03[2], in04[2], in05[2], in06[2], in07[2])\r\n        \r\n        TRANSPOSE_8x8_16BIT(res24[0], res25[0], res26[0], res27[0], res28[0], res29[0], res30[0], res31[0], in00[3], in01[3], in02[3], in03[3], in04[3], in05[3], in06[3], in07[3])\r\n#undef TRANSPOSE_8x8_16BIT\r\n    }\r\n\r\n\r\n    //pass=2\r\n    c32_rnd = _mm_set1_epi32(1 << (shift2 - 1));    // add2\r\n    nShift = shift2;\r\n    for (part = 0; part < 4; part++) {\r\n        const __m128i T_00_00_A = _mm_unpacklo_epi16(in01[part], in03[part]);    // [33 13 32 12 31 11 30 10]\r\n        const __m128i T_00_00_B = _mm_unpackhi_epi16(in01[part], in03[part]);    // [37 17 36 16 35 15 34 14]\r\n        const __m128i T_00_01_A = _mm_unpacklo_epi16(in05[part], in07[part]);    // [ ]\r\n        const __m128i T_00_01_B = _mm_unpackhi_epi16(in05[part], in07[part]);    // [ ]\r\n\r\n        const __m128i T_00_08_A = _mm_unpacklo_epi16(in02[part], in06[part]);    // [ ]\r\n        const __m128i T_00_08_B = _mm_unpackhi_epi16(in02[part], in06[part]);    // [ ]\r\n\r\n        const __m128i T_00_12_A = _mm_unpacklo_epi16(in04[part], Zero_16);    // [ ]\r\n        const __m128i T_00_12_B = _mm_unpackhi_epi16(in04[part], Zero_16);    // [ ]\r\n\r\n        const __m128i T_00_15_A = _mm_unpacklo_epi16(in00[part], Zero_16);    //\r\n        const __m128i T_00_15_B = _mm_unpackhi_epi16(in00[part], Zero_16);    // [ ]\r\n\r\n        //__m128i O00A, O01A, O02A, O03A, O04A, O05A, O06A, O07A, O08A, O09A, O10A, O11A, O12A, O13A, O14A, O15A;\r\n        //__m128i O00B, O01B, O02B, O03B, O04B, O05B, O06B, O07B, O08B, O09B, O10B, O11B, O12B, O13B, O14B, O15B;\r\n        //__m128i EO0A, EO1A, EO2A, EO3A, EO4A, EO5A, EO6A, EO7A;\r\n        //__m128i EO0B, EO1B, EO2B, EO3B, EO4B, EO5B, EO6B, EO7B;\r\n        \r\n        O00A = _mm_add_epi32(_mm_madd_epi16(T_00_00_A, c16_p45_p45), _mm_madd_epi16(T_00_01_A, c16_p43_p44));\r\n        O01A = _mm_add_epi32(_mm_madd_epi16(T_00_00_A, c16_p41_p45), _mm_madd_epi16(T_00_01_A, c16_p23_p34));\r\n        O02A = _mm_add_epi32(_mm_madd_epi16(T_00_00_A, c16_p34_p44), _mm_madd_epi16(T_00_01_A, c16_n07_p15));\r\n        O03A = _mm_add_epi32(_mm_madd_epi16(T_00_00_A, c16_p23_p43), _mm_madd_epi16(T_00_01_A, c16_n34_n07));\r\n        O04A = _mm_add_epi32(_mm_madd_epi16(T_00_00_A, c16_p11_p41), _mm_madd_epi16(T_00_01_A, c16_n45_n27));\r\n        O05A = _mm_add_epi32(_mm_madd_epi16(T_00_00_A, c16_n02_p39), _mm_madd_epi16(T_00_01_A, c16_n36_n41));\r\n        O06A = _mm_add_epi32(_mm_madd_epi16(T_00_00_A, c16_n15_p36), _mm_madd_epi16(T_00_01_A, c16_n11_n45));\r\n        O07A = _mm_add_epi32(_mm_madd_epi16(T_00_00_A, c16_n27_p34), _mm_madd_epi16(T_00_01_A, c16_p19_n39));\r\n        O08A = _mm_add_epi32(_mm_madd_epi16(T_00_00_A, c16_n36_p30), _mm_madd_epi16(T_00_01_A, c16_p41_n23));\r\n        O09A = _mm_add_epi32(_mm_madd_epi16(T_00_00_A, c16_n43_p27), _mm_madd_epi16(T_00_01_A, c16_p44_n02));\r\n        O10A = _mm_add_epi32(_mm_madd_epi16(T_00_00_A, c16_n45_p23), _mm_madd_epi16(T_00_01_A, c16_p27_p19));\r\n        O11A = _mm_add_epi32(_mm_madd_epi16(T_00_00_A, c16_n44_p19), _mm_madd_epi16(T_00_01_A, c16_n02_p36));\r\n        O12A = _mm_add_epi32(_mm_madd_epi16(T_00_00_A, c16_n39_p15), _mm_madd_epi16(T_00_01_A, c16_n30_p45));\r\n        O13A = _mm_add_epi32(_mm_madd_epi16(T_00_00_A, c16_n30_p11), _mm_madd_epi16(T_00_01_A, c16_n45_p43));\r\n        O14A = _mm_add_epi32(_mm_madd_epi16(T_00_00_A, c16_n19_p07), _mm_madd_epi16(T_00_01_A, c16_n39_p30));\r\n        O15A = _mm_add_epi32(_mm_madd_epi16(T_00_00_A, c16_n07_p02), _mm_madd_epi16(T_00_01_A, c16_n15_p11));\r\n\r\n        O00B = _mm_add_epi32(_mm_madd_epi16(T_00_00_B, c16_p45_p45), _mm_madd_epi16(T_00_01_B, c16_p43_p44));\r\n        O01B = _mm_add_epi32(_mm_madd_epi16(T_00_00_B, c16_p41_p45), _mm_madd_epi16(T_00_01_B, c16_p23_p34));\r\n        O02B = _mm_add_epi32(_mm_madd_epi16(T_00_00_B, c16_p34_p44), _mm_madd_epi16(T_00_01_B, c16_n07_p15));\r\n        O03B = _mm_add_epi32(_mm_madd_epi16(T_00_00_B, c16_p23_p43), _mm_madd_epi16(T_00_01_B, c16_n34_n07));\r\n        O04B = _mm_add_epi32(_mm_madd_epi16(T_00_00_B, c16_p11_p41), _mm_madd_epi16(T_00_01_B, c16_n45_n27));\r\n        O05B = _mm_add_epi32(_mm_madd_epi16(T_00_00_B, c16_n02_p39), _mm_madd_epi16(T_00_01_B, c16_n36_n41));\r\n        O06B = _mm_add_epi32(_mm_madd_epi16(T_00_00_B, c16_n15_p36), _mm_madd_epi16(T_00_01_B, c16_n11_n45));\r\n        O07B = _mm_add_epi32(_mm_madd_epi16(T_00_00_B, c16_n27_p34), _mm_madd_epi16(T_00_01_B, c16_p19_n39));\r\n        O08B = _mm_add_epi32(_mm_madd_epi16(T_00_00_B, c16_n36_p30), _mm_madd_epi16(T_00_01_B, c16_p41_n23));\r\n        O09B = _mm_add_epi32(_mm_madd_epi16(T_00_00_B, c16_n43_p27), _mm_madd_epi16(T_00_01_B, c16_p44_n02));\r\n        O10B = _mm_add_epi32(_mm_madd_epi16(T_00_00_B, c16_n45_p23), _mm_madd_epi16(T_00_01_B, c16_p27_p19));\r\n        O11B = _mm_add_epi32(_mm_madd_epi16(T_00_00_B, c16_n44_p19), _mm_madd_epi16(T_00_01_B, c16_n02_p36));\r\n        O12B = _mm_add_epi32(_mm_madd_epi16(T_00_00_B, c16_n39_p15), _mm_madd_epi16(T_00_01_B, c16_n30_p45));\r\n        O13B = _mm_add_epi32(_mm_madd_epi16(T_00_00_B, c16_n30_p11), _mm_madd_epi16(T_00_01_B, c16_n45_p43));\r\n        O14B = _mm_add_epi32(_mm_madd_epi16(T_00_00_B, c16_n19_p07), _mm_madd_epi16(T_00_01_B, c16_n39_p30));\r\n        O15B = _mm_add_epi32(_mm_madd_epi16(T_00_00_B, c16_n07_p02), _mm_madd_epi16(T_00_01_B, c16_n15_p11));\r\n\r\n        EO0A = _mm_madd_epi16(T_00_08_A, c16_p43_p45);\r\n        EO1A = _mm_madd_epi16(T_00_08_A, c16_p29_p43);\r\n        EO2A = _mm_madd_epi16(T_00_08_A, c16_p04_p40);\r\n        EO3A = _mm_madd_epi16(T_00_08_A, c16_n21_p35);\r\n        EO4A = _mm_madd_epi16(T_00_08_A, c16_n40_p29);\r\n        EO5A = _mm_madd_epi16(T_00_08_A, c16_n45_p21);\r\n        EO6A = _mm_madd_epi16(T_00_08_A, c16_n35_p13);\r\n        EO7A = _mm_madd_epi16(T_00_08_A, c16_n13_p04);\r\n\r\n        EO0B = _mm_madd_epi16(T_00_08_B, c16_p43_p45);\r\n        EO1B = _mm_madd_epi16(T_00_08_B, c16_p29_p43);\r\n        EO2B = _mm_madd_epi16(T_00_08_B, c16_p04_p40);\r\n        EO3B = _mm_madd_epi16(T_00_08_B, c16_n21_p35);\r\n        EO4B = _mm_madd_epi16(T_00_08_B, c16_n40_p29);\r\n        EO5B = _mm_madd_epi16(T_00_08_B, c16_n45_p21);\r\n        EO6B = _mm_madd_epi16(T_00_08_B, c16_n35_p13);\r\n        EO7B = _mm_madd_epi16(T_00_08_B, c16_n13_p04);\r\n\r\n        {\r\n            const __m128i EEO0A = _mm_madd_epi16(T_00_12_A, c16_p38_p44);\r\n            const __m128i EEO1A = _mm_madd_epi16(T_00_12_A, c16_n09_p38);\r\n            const __m128i EEO2A = _mm_madd_epi16(T_00_12_A, c16_n44_p25);\r\n            const __m128i EEO3A = _mm_madd_epi16(T_00_12_A, c16_n25_p09);\r\n            const __m128i EEO0B = _mm_madd_epi16(T_00_12_B, c16_p38_p44);\r\n            const __m128i EEO1B = _mm_madd_epi16(T_00_12_B, c16_n09_p38);\r\n            const __m128i EEO2B = _mm_madd_epi16(T_00_12_B, c16_n44_p25);\r\n            const __m128i EEO3B = _mm_madd_epi16(T_00_12_B, c16_n25_p09);\r\n\r\n            const __m128i EEEE0A = _mm_madd_epi16(T_00_15_A, c16_p32_p32);\r\n            const __m128i EEEE0B = _mm_madd_epi16(T_00_15_B, c16_p32_p32);\r\n            const __m128i EEEE1A = _mm_madd_epi16(T_00_15_A, c16_n32_p32);\r\n            const __m128i EEEE1B = _mm_madd_epi16(T_00_15_B, c16_n32_p32);\r\n\r\n            const __m128i EEE0A = EEEE0A;    // EEE0 = EEEE0 + EEEO0\r\n            const __m128i EEE0B = EEEE0B;\r\n            const __m128i EEE1A = EEEE1A;    // EEE1 = EEEE1 + EEEO1\r\n            const __m128i EEE1B = EEEE1B;\r\n            const __m128i EEE3A = EEEE0A;    // EEE2 = EEEE0 - EEEO0\r\n            const __m128i EEE3B = EEEE0B;\r\n            const __m128i EEE2A = EEEE1A;    // EEE3 = EEEE1 - EEEO1\r\n            const __m128i EEE2B = EEEE1B;\r\n\r\n            const __m128i EE0A = _mm_add_epi32(EEE0A, EEO0A);       // EE0 = EEE0 + EEO0\r\n            const __m128i EE0B = _mm_add_epi32(EEE0B, EEO0B);\r\n            const __m128i EE1A = _mm_add_epi32(EEE1A, EEO1A);       // EE1 = EEE1 + EEO1\r\n            const __m128i EE1B = _mm_add_epi32(EEE1B, EEO1B);\r\n            const __m128i EE2A = _mm_add_epi32(EEE2A, EEO2A);       // EE2 = EEE0 + EEO0\r\n            const __m128i EE2B = _mm_add_epi32(EEE2B, EEO2B);\r\n            const __m128i EE3A = _mm_add_epi32(EEE3A, EEO3A);       // EE3 = EEE1 + EEO1\r\n            const __m128i EE3B = _mm_add_epi32(EEE3B, EEO3B);\r\n            const __m128i EE7A = _mm_sub_epi32(EEE0A, EEO0A);       // EE7 = EEE0 - EEO0\r\n            const __m128i EE7B = _mm_sub_epi32(EEE0B, EEO0B);\r\n            const __m128i EE6A = _mm_sub_epi32(EEE1A, EEO1A);       // EE6 = EEE1 - EEO1\r\n            const __m128i EE6B = _mm_sub_epi32(EEE1B, EEO1B);\r\n            const __m128i EE5A = _mm_sub_epi32(EEE2A, EEO2A);       // EE5 = EEE0 - EEO0\r\n            const __m128i EE5B = _mm_sub_epi32(EEE2B, EEO2B);\r\n            const __m128i EE4A = _mm_sub_epi32(EEE3A, EEO3A);       // EE4 = EEE1 - EEO1\r\n            const __m128i EE4B = _mm_sub_epi32(EEE3B, EEO3B);\r\n\r\n            const __m128i E0A = _mm_add_epi32(EE0A, EO0A);          // E0 = EE0 + EO0\r\n            const __m128i E0B = _mm_add_epi32(EE0B, EO0B);\r\n            const __m128i E1A = _mm_add_epi32(EE1A, EO1A);          // E1 = EE1 + EO1\r\n            const __m128i E1B = _mm_add_epi32(EE1B, EO1B);\r\n            const __m128i E2A = _mm_add_epi32(EE2A, EO2A);          // E2 = EE2 + EO2\r\n            const __m128i E2B = _mm_add_epi32(EE2B, EO2B);\r\n            const __m128i E3A = _mm_add_epi32(EE3A, EO3A);          // E3 = EE3 + EO3\r\n            const __m128i E3B = _mm_add_epi32(EE3B, EO3B);\r\n            const __m128i E4A = _mm_add_epi32(EE4A, EO4A);          // E4 =\r\n            const __m128i E4B = _mm_add_epi32(EE4B, EO4B);\r\n            const __m128i E5A = _mm_add_epi32(EE5A, EO5A);          // E5 =\r\n            const __m128i E5B = _mm_add_epi32(EE5B, EO5B);\r\n            const __m128i E6A = _mm_add_epi32(EE6A, EO6A);          // E6 =\r\n            const __m128i E6B = _mm_add_epi32(EE6B, EO6B);\r\n            const __m128i E7A = _mm_add_epi32(EE7A, EO7A);          // E7 =\r\n            const __m128i E7B = _mm_add_epi32(EE7B, EO7B);\r\n            const __m128i EFA = _mm_sub_epi32(EE0A, EO0A);          // EF = EE0 - EO0\r\n            const __m128i EFB = _mm_sub_epi32(EE0B, EO0B);\r\n            const __m128i EEA = _mm_sub_epi32(EE1A, EO1A);          // EE = EE1 - EO1\r\n            const __m128i EEB = _mm_sub_epi32(EE1B, EO1B);\r\n            const __m128i EDA = _mm_sub_epi32(EE2A, EO2A);          // ED = EE2 - EO2\r\n            const __m128i EDB = _mm_sub_epi32(EE2B, EO2B);\r\n            const __m128i ECA = _mm_sub_epi32(EE3A, EO3A);          // EC = EE3 - EO3\r\n            const __m128i ECB = _mm_sub_epi32(EE3B, EO3B);\r\n            const __m128i EBA = _mm_sub_epi32(EE4A, EO4A);          // EB =\r\n            const __m128i EBB = _mm_sub_epi32(EE4B, EO4B);\r\n            const __m128i EAA = _mm_sub_epi32(EE5A, EO5A);          // EA =\r\n            const __m128i EAB = _mm_sub_epi32(EE5B, EO5B);\r\n            const __m128i E9A = _mm_sub_epi32(EE6A, EO6A);          // E9 =\r\n            const __m128i E9B = _mm_sub_epi32(EE6B, EO6B);\r\n            const __m128i E8A = _mm_sub_epi32(EE7A, EO7A);          // E8 =\r\n            const __m128i E8B = _mm_sub_epi32(EE7B, EO7B);\r\n\r\n            const __m128i T10A = _mm_add_epi32(E0A, c32_rnd);       // E0 + rnd\r\n            const __m128i T10B = _mm_add_epi32(E0B, c32_rnd);\r\n            const __m128i T11A = _mm_add_epi32(E1A, c32_rnd);       // E1 + rnd\r\n            const __m128i T11B = _mm_add_epi32(E1B, c32_rnd);\r\n            const __m128i T12A = _mm_add_epi32(E2A, c32_rnd);       // E2 + rnd\r\n            const __m128i T12B = _mm_add_epi32(E2B, c32_rnd);\r\n            const __m128i T13A = _mm_add_epi32(E3A, c32_rnd);       // E3 + rnd\r\n            const __m128i T13B = _mm_add_epi32(E3B, c32_rnd);\r\n            const __m128i T14A = _mm_add_epi32(E4A, c32_rnd);       // E4 + rnd\r\n            const __m128i T14B = _mm_add_epi32(E4B, c32_rnd);\r\n            const __m128i T15A = _mm_add_epi32(E5A, c32_rnd);       // E5 + rnd\r\n            const __m128i T15B = _mm_add_epi32(E5B, c32_rnd);\r\n            const __m128i T16A = _mm_add_epi32(E6A, c32_rnd);       // E6 + rnd\r\n            const __m128i T16B = _mm_add_epi32(E6B, c32_rnd);\r\n            const __m128i T17A = _mm_add_epi32(E7A, c32_rnd);       // E7 + rnd\r\n            const __m128i T17B = _mm_add_epi32(E7B, c32_rnd);\r\n            const __m128i T18A = _mm_add_epi32(E8A, c32_rnd);       // E8 + rnd\r\n            const __m128i T18B = _mm_add_epi32(E8B, c32_rnd);\r\n            const __m128i T19A = _mm_add_epi32(E9A, c32_rnd);       // E9 + rnd\r\n            const __m128i T19B = _mm_add_epi32(E9B, c32_rnd);\r\n            const __m128i T1AA = _mm_add_epi32(EAA, c32_rnd);       // E10 + rnd\r\n            const __m128i T1AB = _mm_add_epi32(EAB, c32_rnd);\r\n            const __m128i T1BA = _mm_add_epi32(EBA, c32_rnd);       // E11 + rnd\r\n            const __m128i T1BB = _mm_add_epi32(EBB, c32_rnd);\r\n            const __m128i T1CA = _mm_add_epi32(ECA, c32_rnd);       // E12 + rnd\r\n            const __m128i T1CB = _mm_add_epi32(ECB, c32_rnd);\r\n            const __m128i T1DA = _mm_add_epi32(EDA, c32_rnd);       // E13 + rnd\r\n            const __m128i T1DB = _mm_add_epi32(EDB, c32_rnd);\r\n            const __m128i T1EA = _mm_add_epi32(EEA, c32_rnd);       // E14 + rnd\r\n            const __m128i T1EB = _mm_add_epi32(EEB, c32_rnd);\r\n            const __m128i T1FA = _mm_add_epi32(EFA, c32_rnd);       // E15 + rnd\r\n            const __m128i T1FB = _mm_add_epi32(EFB, c32_rnd);\r\n\r\n            const __m128i T2_00A = _mm_add_epi32(T10A, O00A);       // E0 + O0 + rnd\r\n            const __m128i T2_00B = _mm_add_epi32(T10B, O00B);\r\n            const __m128i T2_01A = _mm_add_epi32(T11A, O01A);       // E1 + O1 + rnd\r\n            const __m128i T2_01B = _mm_add_epi32(T11B, O01B);\r\n            const __m128i T2_02A = _mm_add_epi32(T12A, O02A);       // E2 + O2 + rnd\r\n            const __m128i T2_02B = _mm_add_epi32(T12B, O02B);\r\n            const __m128i T2_03A = _mm_add_epi32(T13A, O03A);       // E3 + O3 + rnd\r\n            const __m128i T2_03B = _mm_add_epi32(T13B, O03B);\r\n            const __m128i T2_04A = _mm_add_epi32(T14A, O04A);       // E4\r\n            const __m128i T2_04B = _mm_add_epi32(T14B, O04B);\r\n            const __m128i T2_05A = _mm_add_epi32(T15A, O05A);       // E5\r\n            const __m128i T2_05B = _mm_add_epi32(T15B, O05B);\r\n            const __m128i T2_06A = _mm_add_epi32(T16A, O06A);       // E6\r\n            const __m128i T2_06B = _mm_add_epi32(T16B, O06B);\r\n            const __m128i T2_07A = _mm_add_epi32(T17A, O07A);       // E7\r\n            const __m128i T2_07B = _mm_add_epi32(T17B, O07B);\r\n            const __m128i T2_08A = _mm_add_epi32(T18A, O08A);       // E8\r\n            const __m128i T2_08B = _mm_add_epi32(T18B, O08B);\r\n            const __m128i T2_09A = _mm_add_epi32(T19A, O09A);       // E9\r\n            const __m128i T2_09B = _mm_add_epi32(T19B, O09B);\r\n            const __m128i T2_10A = _mm_add_epi32(T1AA, O10A);       // E10\r\n            const __m128i T2_10B = _mm_add_epi32(T1AB, O10B);\r\n            const __m128i T2_11A = _mm_add_epi32(T1BA, O11A);       // E11\r\n            const __m128i T2_11B = _mm_add_epi32(T1BB, O11B);\r\n            const __m128i T2_12A = _mm_add_epi32(T1CA, O12A);       // E12\r\n            const __m128i T2_12B = _mm_add_epi32(T1CB, O12B);\r\n            const __m128i T2_13A = _mm_add_epi32(T1DA, O13A);       // E13\r\n            const __m128i T2_13B = _mm_add_epi32(T1DB, O13B);\r\n            const __m128i T2_14A = _mm_add_epi32(T1EA, O14A);       // E14\r\n            const __m128i T2_14B = _mm_add_epi32(T1EB, O14B);\r\n            const __m128i T2_15A = _mm_add_epi32(T1FA, O15A);       // E15\r\n            const __m128i T2_15B = _mm_add_epi32(T1FB, O15B);\r\n            const __m128i T2_31A = _mm_sub_epi32(T10A, O00A);       // E0 - O0 + rnd\r\n            const __m128i T2_31B = _mm_sub_epi32(T10B, O00B);\r\n            const __m128i T2_30A = _mm_sub_epi32(T11A, O01A);       // E1 - O1 + rnd\r\n            const __m128i T2_30B = _mm_sub_epi32(T11B, O01B);\r\n            const __m128i T2_29A = _mm_sub_epi32(T12A, O02A);       // E2 - O2 + rnd\r\n            const __m128i T2_29B = _mm_sub_epi32(T12B, O02B);\r\n            const __m128i T2_28A = _mm_sub_epi32(T13A, O03A);       // E3 - O3 + rnd\r\n            const __m128i T2_28B = _mm_sub_epi32(T13B, O03B);\r\n            const __m128i T2_27A = _mm_sub_epi32(T14A, O04A);       // E4\r\n            const __m128i T2_27B = _mm_sub_epi32(T14B, O04B);\r\n            const __m128i T2_26A = _mm_sub_epi32(T15A, O05A);       // E5\r\n            const __m128i T2_26B = _mm_sub_epi32(T15B, O05B);\r\n            const __m128i T2_25A = _mm_sub_epi32(T16A, O06A);       // E6\r\n            const __m128i T2_25B = _mm_sub_epi32(T16B, O06B);\r\n            const __m128i T2_24A = _mm_sub_epi32(T17A, O07A);       // E7\r\n            const __m128i T2_24B = _mm_sub_epi32(T17B, O07B);\r\n            const __m128i T2_23A = _mm_sub_epi32(T18A, O08A);       //\r\n            const __m128i T2_23B = _mm_sub_epi32(T18B, O08B);\r\n            const __m128i T2_22A = _mm_sub_epi32(T19A, O09A);       //\r\n            const __m128i T2_22B = _mm_sub_epi32(T19B, O09B);\r\n            const __m128i T2_21A = _mm_sub_epi32(T1AA, O10A);       //\r\n            const __m128i T2_21B = _mm_sub_epi32(T1AB, O10B);\r\n            const __m128i T2_20A = _mm_sub_epi32(T1BA, O11A);       //\r\n            const __m128i T2_20B = _mm_sub_epi32(T1BB, O11B);\r\n            const __m128i T2_19A = _mm_sub_epi32(T1CA, O12A);       //\r\n            const __m128i T2_19B = _mm_sub_epi32(T1CB, O12B);\r\n            const __m128i T2_18A = _mm_sub_epi32(T1DA, O13A);       //\r\n            const __m128i T2_18B = _mm_sub_epi32(T1DB, O13B);\r\n            const __m128i T2_17A = _mm_sub_epi32(T1EA, O14A);       //\r\n            const __m128i T2_17B = _mm_sub_epi32(T1EB, O14B);\r\n            const __m128i T2_16A = _mm_sub_epi32(T1FA, O15A);       //\r\n            const __m128i T2_16B = _mm_sub_epi32(T1FB, O15B);\r\n\r\n            const __m128i T3_00A = _mm_srai_epi32(T2_00A, nShift);  // [30 20 10 00]\r\n            const __m128i T3_00B = _mm_srai_epi32(T2_00B, nShift);  // [70 60 50 40]\r\n            const __m128i T3_01A = _mm_srai_epi32(T2_01A, nShift);  // [31 21 11 01]\r\n            const __m128i T3_01B = _mm_srai_epi32(T2_01B, nShift);  // [71 61 51 41]\r\n            const __m128i T3_02A = _mm_srai_epi32(T2_02A, nShift);  // [32 22 12 02]\r\n            const __m128i T3_02B = _mm_srai_epi32(T2_02B, nShift);  // [72 62 52 42]\r\n            const __m128i T3_03A = _mm_srai_epi32(T2_03A, nShift);  // [33 23 13 03]\r\n            const __m128i T3_03B = _mm_srai_epi32(T2_03B, nShift);  // [73 63 53 43]\r\n            const __m128i T3_04A = _mm_srai_epi32(T2_04A, nShift);  // [33 24 14 04]\r\n            const __m128i T3_04B = _mm_srai_epi32(T2_04B, nShift);  // [74 64 54 44]\r\n            const __m128i T3_05A = _mm_srai_epi32(T2_05A, nShift);  // [35 25 15 05]\r\n            const __m128i T3_05B = _mm_srai_epi32(T2_05B, nShift);  // [75 65 55 45]\r\n            const __m128i T3_06A = _mm_srai_epi32(T2_06A, nShift);  // [36 26 16 06]\r\n            const __m128i T3_06B = _mm_srai_epi32(T2_06B, nShift);  // [76 66 56 46]\r\n            const __m128i T3_07A = _mm_srai_epi32(T2_07A, nShift);  // [37 27 17 07]\r\n            const __m128i T3_07B = _mm_srai_epi32(T2_07B, nShift);  // [77 67 57 47]\r\n            const __m128i T3_08A = _mm_srai_epi32(T2_08A, nShift);  // [30 20 10 00] x8\r\n            const __m128i T3_08B = _mm_srai_epi32(T2_08B, nShift);  // [70 60 50 40]\r\n            const __m128i T3_09A = _mm_srai_epi32(T2_09A, nShift);  // [31 21 11 01] x9\r\n            const __m128i T3_09B = _mm_srai_epi32(T2_09B, nShift);  // [71 61 51 41]\r\n            const __m128i T3_10A = _mm_srai_epi32(T2_10A, nShift);  // [32 22 12 02] xA\r\n            const __m128i T3_10B = _mm_srai_epi32(T2_10B, nShift);  // [72 62 52 42]\r\n            const __m128i T3_11A = _mm_srai_epi32(T2_11A, nShift);  // [33 23 13 03] xB\r\n            const __m128i T3_11B = _mm_srai_epi32(T2_11B, nShift);  // [73 63 53 43]\r\n            const __m128i T3_12A = _mm_srai_epi32(T2_12A, nShift);  // [33 24 14 04] xC\r\n            const __m128i T3_12B = _mm_srai_epi32(T2_12B, nShift);  // [74 64 54 44]\r\n            const __m128i T3_13A = _mm_srai_epi32(T2_13A, nShift);  // [35 25 15 05] xD\r\n            const __m128i T3_13B = _mm_srai_epi32(T2_13B, nShift);  // [75 65 55 45]\r\n            const __m128i T3_14A = _mm_srai_epi32(T2_14A, nShift);  // [36 26 16 06] xE\r\n            const __m128i T3_14B = _mm_srai_epi32(T2_14B, nShift);  // [76 66 56 46]\r\n            const __m128i T3_15A = _mm_srai_epi32(T2_15A, nShift);  // [37 27 17 07] xF\r\n            const __m128i T3_15B = _mm_srai_epi32(T2_15B, nShift);  // [77 67 57 47]\r\n\r\n            const __m128i T3_16A = _mm_srai_epi32(T2_16A, nShift);  // [30 20 10 00]\r\n            const __m128i T3_16B = _mm_srai_epi32(T2_16B, nShift);  // [70 60 50 40]\r\n            const __m128i T3_17A = _mm_srai_epi32(T2_17A, nShift);  // [31 21 11 01]\r\n            const __m128i T3_17B = _mm_srai_epi32(T2_17B, nShift);  // [71 61 51 41]\r\n            const __m128i T3_18A = _mm_srai_epi32(T2_18A, nShift);  // [32 22 12 02]\r\n            const __m128i T3_18B = _mm_srai_epi32(T2_18B, nShift);  // [72 62 52 42]\r\n            const __m128i T3_19A = _mm_srai_epi32(T2_19A, nShift);  // [33 23 13 03]\r\n            const __m128i T3_19B = _mm_srai_epi32(T2_19B, nShift);  // [73 63 53 43]\r\n            const __m128i T3_20A = _mm_srai_epi32(T2_20A, nShift);  // [33 24 14 04]\r\n            const __m128i T3_20B = _mm_srai_epi32(T2_20B, nShift);  // [74 64 54 44]\r\n            const __m128i T3_21A = _mm_srai_epi32(T2_21A, nShift);  // [35 25 15 05]\r\n            const __m128i T3_21B = _mm_srai_epi32(T2_21B, nShift);  // [75 65 55 45]\r\n            const __m128i T3_22A = _mm_srai_epi32(T2_22A, nShift);  // [36 26 16 06]\r\n            const __m128i T3_22B = _mm_srai_epi32(T2_22B, nShift);  // [76 66 56 46]\r\n            const __m128i T3_23A = _mm_srai_epi32(T2_23A, nShift);  // [37 27 17 07]\r\n            const __m128i T3_23B = _mm_srai_epi32(T2_23B, nShift);  // [77 67 57 47]\r\n            const __m128i T3_24A = _mm_srai_epi32(T2_24A, nShift);  // [30 20 10 00] x8\r\n            const __m128i T3_24B = _mm_srai_epi32(T2_24B, nShift);  // [70 60 50 40]\r\n            const __m128i T3_25A = _mm_srai_epi32(T2_25A, nShift);  // [31 21 11 01] x9\r\n            const __m128i T3_25B = _mm_srai_epi32(T2_25B, nShift);  // [71 61 51 41]\r\n            const __m128i T3_26A = _mm_srai_epi32(T2_26A, nShift);  // [32 22 12 02] xA\r\n            const __m128i T3_26B = _mm_srai_epi32(T2_26B, nShift);  // [72 62 52 42]\r\n            const __m128i T3_27A = _mm_srai_epi32(T2_27A, nShift);  // [33 23 13 03] xB\r\n            const __m128i T3_27B = _mm_srai_epi32(T2_27B, nShift);  // [73 63 53 43]\r\n            const __m128i T3_28A = _mm_srai_epi32(T2_28A, nShift);  // [33 24 14 04] xC\r\n            const __m128i T3_28B = _mm_srai_epi32(T2_28B, nShift);  // [74 64 54 44]\r\n            const __m128i T3_29A = _mm_srai_epi32(T2_29A, nShift);  // [35 25 15 05] xD\r\n            const __m128i T3_29B = _mm_srai_epi32(T2_29B, nShift);  // [75 65 55 45]\r\n            const __m128i T3_30A = _mm_srai_epi32(T2_30A, nShift);  // [36 26 16 06] xE\r\n            const __m128i T3_30B = _mm_srai_epi32(T2_30B, nShift);  // [76 66 56 46]\r\n            const __m128i T3_31A = _mm_srai_epi32(T2_31A, nShift);  // [37 27 17 07] xF\r\n            const __m128i T3_31B = _mm_srai_epi32(T2_31B, nShift);  // [77 67 57 47]\r\n\r\n            res00[part] = _mm_packs_epi32(T3_00A, T3_00B);          // [70 60 50 40 30 20 10 00]\r\n            res01[part] = _mm_packs_epi32(T3_01A, T3_01B);          // [71 61 51 41 31 21 11 01]\r\n            res02[part] = _mm_packs_epi32(T3_02A, T3_02B);          // [72 62 52 42 32 22 12 02]\r\n            res03[part] = _mm_packs_epi32(T3_03A, T3_03B);          // [73 63 53 43 33 23 13 03]\r\n            res04[part] = _mm_packs_epi32(T3_04A, T3_04B);          // [74 64 54 44 34 24 14 04]\r\n            res05[part] = _mm_packs_epi32(T3_05A, T3_05B);          // [75 65 55 45 35 25 15 05]\r\n            res06[part] = _mm_packs_epi32(T3_06A, T3_06B);          // [76 66 56 46 36 26 16 06]\r\n            res07[part] = _mm_packs_epi32(T3_07A, T3_07B);          // [77 67 57 47 37 27 17 07]\r\n            res08[part] = _mm_packs_epi32(T3_08A, T3_08B);          // [A0 ... 80]\r\n            res09[part] = _mm_packs_epi32(T3_09A, T3_09B);          // [A1 ... 81]\r\n            res10[part] = _mm_packs_epi32(T3_10A, T3_10B);          // [A2 ... 82]\r\n            res11[part] = _mm_packs_epi32(T3_11A, T3_11B);          // [A3 ... 83]\r\n            res12[part] = _mm_packs_epi32(T3_12A, T3_12B);          // [A4 ... 84]\r\n            res13[part] = _mm_packs_epi32(T3_13A, T3_13B);          // [A5 ... 85]\r\n            res14[part] = _mm_packs_epi32(T3_14A, T3_14B);          // [A6 ... 86]\r\n            res15[part] = _mm_packs_epi32(T3_15A, T3_15B);          // [A7 ... 87]\r\n            res16[part] = _mm_packs_epi32(T3_16A, T3_16B);\r\n            res17[part] = _mm_packs_epi32(T3_17A, T3_17B);\r\n            res18[part] = _mm_packs_epi32(T3_18A, T3_18B);\r\n            res19[part] = _mm_packs_epi32(T3_19A, T3_19B);\r\n            res20[part] = _mm_packs_epi32(T3_20A, T3_20B);\r\n            res21[part] = _mm_packs_epi32(T3_21A, T3_21B);\r\n            res22[part] = _mm_packs_epi32(T3_22A, T3_22B);\r\n            res23[part] = _mm_packs_epi32(T3_23A, T3_23B);\r\n            res24[part] = _mm_packs_epi32(T3_24A, T3_24B);\r\n            res25[part] = _mm_packs_epi32(T3_25A, T3_25B);\r\n            res26[part] = _mm_packs_epi32(T3_26A, T3_26B);\r\n            res27[part] = _mm_packs_epi32(T3_27A, T3_27B);\r\n            res28[part] = _mm_packs_epi32(T3_28A, T3_28B);\r\n            res29[part] = _mm_packs_epi32(T3_29A, T3_29B);\r\n            res30[part] = _mm_packs_epi32(T3_30A, T3_30B);\r\n            res31[part] = _mm_packs_epi32(T3_31A, T3_31B);\r\n        }\r\n    }\r\n\r\n    //transpose matrix 8x8 16bit.\r\n    {\r\n        __m128i tr0_0, tr0_1, tr0_2, tr0_3, tr0_4, tr0_5, tr0_6, tr0_7;\r\n        __m128i tr1_0, tr1_1, tr1_2, tr1_3, tr1_4, tr1_5, tr1_6, tr1_7;\r\n#define TRANSPOSE_8x8_16BIT(I0, I1, I2, I3, I4, I5, I6, I7, O0, O1, O2, O3, O4, O5, O6, O7) \\\r\n    tr0_0 = _mm_unpacklo_epi16(I0, I1); \\\r\n    tr0_1 = _mm_unpacklo_epi16(I2, I3); \\\r\n    tr0_2 = _mm_unpackhi_epi16(I0, I1); \\\r\n    tr0_3 = _mm_unpackhi_epi16(I2, I3); \\\r\n    tr0_4 = _mm_unpacklo_epi16(I4, I5); \\\r\n    tr0_5 = _mm_unpacklo_epi16(I6, I7); \\\r\n    tr0_6 = _mm_unpackhi_epi16(I4, I5); \\\r\n    tr0_7 = _mm_unpackhi_epi16(I6, I7); \\\r\n    tr1_0 = _mm_unpacklo_epi32(tr0_0, tr0_1); \\\r\n    tr1_1 = _mm_unpacklo_epi32(tr0_2, tr0_3); \\\r\n    tr1_2 = _mm_unpackhi_epi32(tr0_0, tr0_1); \\\r\n    tr1_3 = _mm_unpackhi_epi32(tr0_2, tr0_3); \\\r\n    tr1_4 = _mm_unpacklo_epi32(tr0_4, tr0_5); \\\r\n    tr1_5 = _mm_unpacklo_epi32(tr0_6, tr0_7); \\\r\n    tr1_6 = _mm_unpackhi_epi32(tr0_4, tr0_5); \\\r\n    tr1_7 = _mm_unpackhi_epi32(tr0_6, tr0_7); \\\r\n    O0 = _mm_unpacklo_epi64(tr1_0, tr1_4); \\\r\n    O1 = _mm_unpackhi_epi64(tr1_0, tr1_4); \\\r\n    O2 = _mm_unpacklo_epi64(tr1_2, tr1_6); \\\r\n    O3 = _mm_unpackhi_epi64(tr1_2, tr1_6); \\\r\n    O4 = _mm_unpacklo_epi64(tr1_1, tr1_5); \\\r\n    O5 = _mm_unpackhi_epi64(tr1_1, tr1_5); \\\r\n    O6 = _mm_unpacklo_epi64(tr1_3, tr1_7); \\\r\n    O7 = _mm_unpackhi_epi64(tr1_3, tr1_7);\r\n\r\n        TRANSPOSE_8x8_16BIT(res00[0], res01[0], res02[0], res03[0], res04[0], res05[0], res06[0], res07[0], in00[0], in01[0], in02[0], in03[0], in04[0], in05[0], in06[0], in07[0])\r\n        TRANSPOSE_8x8_16BIT(res00[1], res01[1], res02[1], res03[1], res04[1], res05[1], res06[1], res07[1], in08[0], in09[0], in10[0], in11[0], in12[0], in13[0], in14[0], in15[0])\r\n        TRANSPOSE_8x8_16BIT(res00[2], res01[2], res02[2], res03[2], res04[2], res05[2], res06[2], res07[2], in16[0], in17[0], in18[0], in19[0], in20[0], in21[0], in22[0], in23[0])\r\n        TRANSPOSE_8x8_16BIT(res00[3], res01[3], res02[3], res03[3], res04[3], res05[3], res06[3], res07[3], in24[0], in25[0], in26[0], in27[0], in28[0], in29[0], in30[0], in31[0])\r\n\r\n        TRANSPOSE_8x8_16BIT(res08[0], res09[0], res10[0], res11[0], res12[0], res13[0], res14[0], res15[0], in00[1], in01[1], in02[1], in03[1], in04[1], in05[1], in06[1], in07[1])\r\n        TRANSPOSE_8x8_16BIT(res08[1], res09[1], res10[1], res11[1], res12[1], res13[1], res14[1], res15[1], in08[1], in09[1], in10[1], in11[1], in12[1], in13[1], in14[1], in15[1])\r\n        TRANSPOSE_8x8_16BIT(res08[2], res09[2], res10[2], res11[2], res12[2], res13[2], res14[2], res15[2], in16[1], in17[1], in18[1], in19[1], in20[1], in21[1], in22[1], in23[1])\r\n        TRANSPOSE_8x8_16BIT(res08[3], res09[3], res10[3], res11[3], res12[3], res13[3], res14[3], res15[3], in24[1], in25[1], in26[1], in27[1], in28[1], in29[1], in30[1], in31[1])\r\n\r\n        TRANSPOSE_8x8_16BIT(res16[0], res17[0], res18[0], res19[0], res20[0], res21[0], res22[0], res23[0], in00[2], in01[2], in02[2], in03[2], in04[2], in05[2], in06[2], in07[2])\r\n        TRANSPOSE_8x8_16BIT(res16[1], res17[1], res18[1], res19[1], res20[1], res21[1], res22[1], res23[1], in08[2], in09[2], in10[2], in11[2], in12[2], in13[2], in14[2], in15[2])\r\n        TRANSPOSE_8x8_16BIT(res16[2], res17[2], res18[2], res19[2], res20[2], res21[2], res22[2], res23[2], in16[2], in17[2], in18[2], in19[2], in20[2], in21[2], in22[2], in23[2])\r\n        TRANSPOSE_8x8_16BIT(res16[3], res17[3], res18[3], res19[3], res20[3], res21[3], res22[3], res23[3], in24[2], in25[2], in26[2], in27[2], in28[2], in29[2], in30[2], in31[2])\r\n\r\n        TRANSPOSE_8x8_16BIT(res24[0], res25[0], res26[0], res27[0], res28[0], res29[0], res30[0], res31[0], in00[3], in01[3], in02[3], in03[3], in04[3], in05[3], in06[3], in07[3])\r\n        TRANSPOSE_8x8_16BIT(res24[1], res25[1], res26[1], res27[1], res28[1], res29[1], res30[1], res31[1], in08[3], in09[3], in10[3], in11[3], in12[3], in13[3], in14[3], in15[3])\r\n        TRANSPOSE_8x8_16BIT(res24[2], res25[2], res26[2], res27[2], res28[2], res29[2], res30[2], res31[2], in16[3], in17[3], in18[3], in19[3], in20[3], in21[3], in22[3], in23[3])\r\n        TRANSPOSE_8x8_16BIT(res24[3], res25[3], res26[3], res27[3], res28[3], res29[3], res30[3], res31[3], in24[3], in25[3], in26[3], in27[3], in28[3], in29[3], in30[3], in31[3])\r\n#undef TRANSPOSE_8x8_16BIT\r\n    }\r\n\r\n\r\n    //clip\r\n    {\r\n        __m128i max_val = _mm_set1_epi16((1 << (clip_depth2 - 1)) - 1);\r\n        __m128i min_val = _mm_set1_epi16(-(1 << (clip_depth2 - 1)));\r\n        int k;\r\n\r\n        for (k = 0; k < 4; k++) {\r\n            in00[k] = _mm_max_epi16(_mm_min_epi16(in00[k], max_val), min_val);\r\n            in01[k] = _mm_max_epi16(_mm_min_epi16(in01[k], max_val), min_val);\r\n            in02[k] = _mm_max_epi16(_mm_min_epi16(in02[k], max_val), min_val);\r\n            in03[k] = _mm_max_epi16(_mm_min_epi16(in03[k], max_val), min_val);\r\n            in04[k] = _mm_max_epi16(_mm_min_epi16(in04[k], max_val), min_val);\r\n            in05[k] = _mm_max_epi16(_mm_min_epi16(in05[k], max_val), min_val);\r\n            in06[k] = _mm_max_epi16(_mm_min_epi16(in06[k], max_val), min_val);\r\n            in07[k] = _mm_max_epi16(_mm_min_epi16(in07[k], max_val), min_val);\r\n            in08[k] = _mm_max_epi16(_mm_min_epi16(in08[k], max_val), min_val);\r\n            in09[k] = _mm_max_epi16(_mm_min_epi16(in09[k], max_val), min_val);\r\n            in10[k] = _mm_max_epi16(_mm_min_epi16(in10[k], max_val), min_val);\r\n            in11[k] = _mm_max_epi16(_mm_min_epi16(in11[k], max_val), min_val);\r\n            in12[k] = _mm_max_epi16(_mm_min_epi16(in12[k], max_val), min_val);\r\n            in13[k] = _mm_max_epi16(_mm_min_epi16(in13[k], max_val), min_val);\r\n            in14[k] = _mm_max_epi16(_mm_min_epi16(in14[k], max_val), min_val);\r\n            in15[k] = _mm_max_epi16(_mm_min_epi16(in15[k], max_val), min_val);\r\n            in16[k] = _mm_max_epi16(_mm_min_epi16(in16[k], max_val), min_val);\r\n            in17[k] = _mm_max_epi16(_mm_min_epi16(in17[k], max_val), min_val);\r\n            in18[k] = _mm_max_epi16(_mm_min_epi16(in18[k], max_val), min_val);\r\n            in19[k] = _mm_max_epi16(_mm_min_epi16(in19[k], max_val), min_val);\r\n            in20[k] = _mm_max_epi16(_mm_min_epi16(in20[k], max_val), min_val);\r\n            in21[k] = _mm_max_epi16(_mm_min_epi16(in21[k], max_val), min_val);\r\n            in22[k] = _mm_max_epi16(_mm_min_epi16(in22[k], max_val), min_val);\r\n            in23[k] = _mm_max_epi16(_mm_min_epi16(in23[k], max_val), min_val);\r\n            in24[k] = _mm_max_epi16(_mm_min_epi16(in24[k], max_val), min_val);\r\n            in25[k] = _mm_max_epi16(_mm_min_epi16(in25[k], max_val), min_val);\r\n            in26[k] = _mm_max_epi16(_mm_min_epi16(in26[k], max_val), min_val);\r\n            in27[k] = _mm_max_epi16(_mm_min_epi16(in27[k], max_val), min_val);\r\n            in28[k] = _mm_max_epi16(_mm_min_epi16(in28[k], max_val), min_val);\r\n            in29[k] = _mm_max_epi16(_mm_min_epi16(in29[k], max_val), min_val);\r\n            in30[k] = _mm_max_epi16(_mm_min_epi16(in30[k], max_val), min_val);\r\n            in31[k] = _mm_max_epi16(_mm_min_epi16(in31[k], max_val), min_val);\r\n        }\r\n    }\r\n\r\n    // Add\r\n    for (i = 0; i < 2; i++) {\r\n#define STORE_LINE(L0, L1, L2, L3, L4, L5, L6, L7, H0, H1, H2, H3, H4, H5, H6, H7, offsetV, offsetH) \\\r\n    _mm_storeu_si128((__m128i*)(dst + (0 + (offsetV)) * i_dst + (offsetH)+0), L0); \\\r\n    _mm_storeu_si128((__m128i*)(dst + (0 + (offsetV)) * i_dst + (offsetH)+8), H0); \\\r\n    _mm_storeu_si128((__m128i*)(dst + (1 + (offsetV)) * i_dst + (offsetH)+0), L1); \\\r\n    _mm_storeu_si128((__m128i*)(dst + (1 + (offsetV)) * i_dst + (offsetH)+8), H1); \\\r\n    _mm_storeu_si128((__m128i*)(dst + (2 + (offsetV)) * i_dst + (offsetH)+0), L2); \\\r\n    _mm_storeu_si128((__m128i*)(dst + (2 + (offsetV)) * i_dst + (offsetH)+8), H2); \\\r\n    _mm_storeu_si128((__m128i*)(dst + (3 + (offsetV)) * i_dst + (offsetH)+0), L3); \\\r\n    _mm_storeu_si128((__m128i*)(dst + (3 + (offsetV)) * i_dst + (offsetH)+8), H3); \\\r\n    _mm_storeu_si128((__m128i*)(dst + (4 + (offsetV)) * i_dst + (offsetH)+0), L4); \\\r\n    _mm_storeu_si128((__m128i*)(dst + (4 + (offsetV)) * i_dst + (offsetH)+8), H4); \\\r\n    _mm_storeu_si128((__m128i*)(dst + (5 + (offsetV)) * i_dst + (offsetH)+0), L5); \\\r\n    _mm_storeu_si128((__m128i*)(dst + (5 + (offsetV)) * i_dst + (offsetH)+8), H5); \\\r\n    _mm_storeu_si128((__m128i*)(dst + (6 + (offsetV)) * i_dst + (offsetH)+0), L6); \\\r\n    _mm_storeu_si128((__m128i*)(dst + (6 + (offsetV)) * i_dst + (offsetH)+8), H6); \\\r\n    _mm_storeu_si128((__m128i*)(dst + (7 + (offsetV)) * i_dst + (offsetH)+0), L7); \\\r\n    _mm_storeu_si128((__m128i*)(dst + (7 + (offsetV)) * i_dst + (offsetH)+8), H7);\r\n\r\n        const int k = i * 2;\r\n        STORE_LINE(in00[k], in01[k], in02[k], in03[k], in04[k], in05[k], in06[k], in07[k], in00[k + 1], in01[k + 1], in02[k + 1], in03[k + 1], in04[k + 1], in05[k + 1], in06[k + 1], in07[k + 1], 0, i * 16)\r\n            STORE_LINE(in08[k], in09[k], in10[k], in11[k], in12[k], in13[k], in14[k], in15[k], in08[k + 1], in09[k + 1], in10[k + 1], in11[k + 1], in12[k + 1], in13[k + 1], in14[k + 1], in15[k + 1], 8, i * 16)\r\n            STORE_LINE(in16[k], in17[k], in18[k], in19[k], in20[k], in21[k], in22[k], in23[k], in16[k + 1], in17[k + 1], in18[k + 1], in19[k + 1], in20[k + 1], in21[k + 1], in22[k + 1], in23[k + 1], 16, i * 16)\r\n            STORE_LINE(in24[k], in25[k], in26[k], in27[k], in28[k], in29[k], in30[k], in31[k], in24[k + 1], in25[k + 1], in26[k + 1], in27[k + 1], in28[k + 1], in29[k + 1], in30[k + 1], in31[k + 1], 24, i * 16)\r\n#undef STORE_LINE\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid idct_32x8_sse128(const coeff_t *src, coeff_t *dst, int i_dst)\r\n{\r\n    __m128i m128iS0[4], m128iS1[4], m128iS2[4], m128iS3[4], m128iS4[4], m128iS5[4], m128iS6[4], m128iS7[4];\r\n    __m128i m128iAdd, m128Tmp0, m128Tmp1, m128Tmp2, m128Tmp3;\r\n    __m128i E0h, E1h, E2h, E3h, E0l, E1l, E2l, E3l;\r\n    __m128i O0h, O1h, O2h, O3h, O0l, O1l, O2l, O3l;\r\n    __m128i EE0l, EE1l, E00l, E01l, EE0h, EE1h, E00h, E01h;\r\n    //int shift1 = 5;\r\n    int shift2 = 20 - g_bit_depth - (i_dst & 0x01);\r\n    //int clip_depth1 = LIMIT_BIT;\r\n    int clip_depth2 = g_bit_depth + 1 + (i_dst & 0x01);\r\n    int i, pass;\r\n\r\n    i_dst &= 0xFE;    /* remember to remove the flag bit */\r\n    m128iAdd = _mm_set1_epi32(16);      // add1\r\n\r\n    for (pass = 0; pass < 4; pass++) {\r\n        m128iS1[pass] = _mm_load_si128((__m128i*)&src[pass * 8 + 1 * 32]);\r\n        m128iS3[pass] = _mm_load_si128((__m128i*)&src[pass * 8 + 3 * 32]);\r\n\r\n        m128Tmp0 = _mm_unpacklo_epi16(m128iS1[pass], m128iS3[pass]);\r\n        E1l      = _mm_madd_epi16(m128Tmp0, _mm_load_si128((__m128i*)(tab_idct_8x8[0])));\r\n        m128Tmp1 = _mm_unpackhi_epi16(m128iS1[pass], m128iS3[pass]);\r\n        E1h      = _mm_madd_epi16(m128Tmp1, _mm_load_si128((__m128i*)(tab_idct_8x8[0])));\r\n\r\n        m128iS5[pass] = _mm_load_si128((__m128i*)&src[pass * 8 + 5 * 32]);\r\n        m128iS7[pass] = _mm_load_si128((__m128i*)&src[pass * 8 + 7 * 32]);\r\n\r\n        m128Tmp2 = _mm_unpacklo_epi16(m128iS5[pass], m128iS7[pass]);\r\n        E2l      = _mm_madd_epi16(m128Tmp2, _mm_load_si128((__m128i*)(tab_idct_8x8[1])));\r\n        m128Tmp3 = _mm_unpackhi_epi16(m128iS5[pass], m128iS7[pass]);\r\n        E2h = _mm_madd_epi16(m128Tmp3, _mm_load_si128((__m128i*)(tab_idct_8x8[1])));\r\n        O0l = _mm_add_epi32(E1l, E2l);\r\n        O0h = _mm_add_epi32(E1h, E2h);\r\n\r\n        E1l = _mm_madd_epi16(m128Tmp0, _mm_load_si128((__m128i*)(tab_idct_8x8[2])));\r\n        E1h = _mm_madd_epi16(m128Tmp1, _mm_load_si128((__m128i*)(tab_idct_8x8[2])));\r\n        E2l = _mm_madd_epi16(m128Tmp2, _mm_load_si128((__m128i*)(tab_idct_8x8[3])));\r\n        E2h = _mm_madd_epi16(m128Tmp3, _mm_load_si128((__m128i*)(tab_idct_8x8[3])));\r\n\r\n        O1l = _mm_add_epi32(E1l, E2l);\r\n        O1h = _mm_add_epi32(E1h, E2h);\r\n\r\n        E1l = _mm_madd_epi16(m128Tmp0, _mm_load_si128((__m128i*)(tab_idct_8x8[4])));\r\n        E1h = _mm_madd_epi16(m128Tmp1, _mm_load_si128((__m128i*)(tab_idct_8x8[4])));\r\n        E2l = _mm_madd_epi16(m128Tmp2, _mm_load_si128((__m128i*)(tab_idct_8x8[5])));\r\n        E2h = _mm_madd_epi16(m128Tmp3, _mm_load_si128((__m128i*)(tab_idct_8x8[5])));\r\n        O2l = _mm_add_epi32(E1l, E2l);\r\n        O2h = _mm_add_epi32(E1h, E2h);\r\n\r\n        E1l = _mm_madd_epi16(m128Tmp0, _mm_load_si128((__m128i*)(tab_idct_8x8[6])));\r\n        E1h = _mm_madd_epi16(m128Tmp1, _mm_load_si128((__m128i*)(tab_idct_8x8[6])));\r\n        E2l = _mm_madd_epi16(m128Tmp2, _mm_load_si128((__m128i*)(tab_idct_8x8[7])));\r\n        E2h = _mm_madd_epi16(m128Tmp3, _mm_load_si128((__m128i*)(tab_idct_8x8[7])));\r\n        O3h = _mm_add_epi32(E1h, E2h);\r\n        O3l = _mm_add_epi32(E1l, E2l);\r\n\r\n        /*    -------     */\r\n\r\n        m128iS0[pass] = _mm_load_si128((__m128i*)&src[pass * 8 + 0 * 32]);\r\n        m128iS4[pass] = _mm_load_si128((__m128i*)&src[pass * 8 + 4 * 32]);\r\n\r\n        m128Tmp0 = _mm_unpacklo_epi16(m128iS0[pass], m128iS4[pass]);\r\n        EE0l     = _mm_madd_epi16(m128Tmp0, _mm_load_si128((__m128i*)(tab_idct_8x8[8])));\r\n        m128Tmp1 = _mm_unpackhi_epi16(m128iS0[pass], m128iS4[pass]);\r\n        EE0h     = _mm_madd_epi16(m128Tmp1, _mm_load_si128((__m128i*)(tab_idct_8x8[8])));\r\n\r\n        EE1l = _mm_madd_epi16(m128Tmp0, _mm_load_si128((__m128i*)(tab_idct_8x8[9])));\r\n        EE1h = _mm_madd_epi16(m128Tmp1, _mm_load_si128((__m128i*)(tab_idct_8x8[9])));\r\n\r\n        /*    -------     */\r\n\r\n        m128iS2[pass] = _mm_load_si128((__m128i*)&src[pass * 8 + 2 * 32]);\r\n        m128iS6[pass] = _mm_load_si128((__m128i*)&src[pass * 8 + 6 * 32]);\r\n\r\n        m128Tmp0 = _mm_unpacklo_epi16(m128iS2[pass], m128iS6[pass]);\r\n        E00l     = _mm_madd_epi16(m128Tmp0, _mm_load_si128((__m128i*)(tab_idct_8x8[10])));\r\n        m128Tmp1 = _mm_unpackhi_epi16(m128iS2[pass], m128iS6[pass]);\r\n        E00h     = _mm_madd_epi16(m128Tmp1, _mm_load_si128((__m128i*)(tab_idct_8x8[10])));\r\n        E01l = _mm_madd_epi16(m128Tmp0, _mm_load_si128((__m128i*)(tab_idct_8x8[11])));\r\n        E01h = _mm_madd_epi16(m128Tmp1, _mm_load_si128((__m128i*)(tab_idct_8x8[11])));\r\n        E0l = _mm_add_epi32(EE0l, E00l);\r\n        E0l = _mm_add_epi32(E0l,  m128iAdd);\r\n        E0h = _mm_add_epi32(EE0h, E00h);\r\n        E0h = _mm_add_epi32(E0h,  m128iAdd);\r\n        E3l = _mm_sub_epi32(EE0l, E00l);\r\n        E3l = _mm_add_epi32(E3l,  m128iAdd);\r\n        E3h = _mm_sub_epi32(EE0h, E00h);\r\n        E3h = _mm_add_epi32(E3h,  m128iAdd);\r\n\r\n        E1l = _mm_add_epi32(EE1l, E01l);\r\n        E1l = _mm_add_epi32(E1l,  m128iAdd);\r\n        E1h = _mm_add_epi32(EE1h, E01h);\r\n        E1h = _mm_add_epi32(E1h,  m128iAdd);\r\n        E2l = _mm_sub_epi32(EE1l, E01l);\r\n        E2l = _mm_add_epi32(E2l,  m128iAdd);\r\n        E2h = _mm_sub_epi32(EE1h, E01h);\r\n        E2h = _mm_add_epi32(E2h,  m128iAdd);\r\n\r\n        m128iS0[pass] = _mm_packs_epi32(_mm_srai_epi32(_mm_add_epi32(E0l, O0l), 5), _mm_srai_epi32(_mm_add_epi32(E0h, O0h), 5));    // ״η任λ\r\n        m128iS7[pass] = _mm_packs_epi32(_mm_srai_epi32(_mm_sub_epi32(E0l, O0l), 5), _mm_srai_epi32(_mm_sub_epi32(E0h, O0h), 5));\r\n        m128iS1[pass] = _mm_packs_epi32(_mm_srai_epi32(_mm_add_epi32(E1l, O1l), 5), _mm_srai_epi32(_mm_add_epi32(E1h, O1h), 5));\r\n        m128iS6[pass] = _mm_packs_epi32(_mm_srai_epi32(_mm_sub_epi32(E1l, O1l), 5), _mm_srai_epi32(_mm_sub_epi32(E1h, O1h), 5));\r\n        m128iS2[pass] = _mm_packs_epi32(_mm_srai_epi32(_mm_add_epi32(E2l, O2l), 5), _mm_srai_epi32(_mm_add_epi32(E2h, O2h), 5));\r\n        m128iS5[pass] = _mm_packs_epi32(_mm_srai_epi32(_mm_sub_epi32(E2l, O2l), 5), _mm_srai_epi32(_mm_sub_epi32(E2h, O2h), 5));\r\n        m128iS3[pass] = _mm_packs_epi32(_mm_srai_epi32(_mm_add_epi32(E3l, O3l), 5), _mm_srai_epi32(_mm_add_epi32(E3h, O3h), 5));\r\n        m128iS4[pass] = _mm_packs_epi32(_mm_srai_epi32(_mm_sub_epi32(E3l, O3l), 5), _mm_srai_epi32(_mm_sub_epi32(E3h, O3h), 5));\r\n\r\n        /*  Inverts matrix   */\r\n        E0l = _mm_unpacklo_epi16(m128iS0[pass], m128iS4[pass]);\r\n        E1l = _mm_unpacklo_epi16(m128iS1[pass], m128iS5[pass]);\r\n        E2l = _mm_unpacklo_epi16(m128iS2[pass], m128iS6[pass]);\r\n        E3l = _mm_unpacklo_epi16(m128iS3[pass], m128iS7[pass]);\r\n        O0l = _mm_unpackhi_epi16(m128iS0[pass], m128iS4[pass]);\r\n        O1l = _mm_unpackhi_epi16(m128iS1[pass], m128iS5[pass]);\r\n        O2l = _mm_unpackhi_epi16(m128iS2[pass], m128iS6[pass]);\r\n        O3l = _mm_unpackhi_epi16(m128iS3[pass], m128iS7[pass]);\r\n        m128Tmp0      = _mm_unpacklo_epi16(E0l, E2l);\r\n        m128Tmp1      = _mm_unpacklo_epi16(E1l, E3l);\r\n        m128iS0[pass] = _mm_unpacklo_epi16(m128Tmp0, m128Tmp1);\r\n        m128iS1[pass] = _mm_unpackhi_epi16(m128Tmp0, m128Tmp1);\r\n        m128Tmp2      = _mm_unpackhi_epi16(E0l, E2l);\r\n        m128Tmp3      = _mm_unpackhi_epi16(E1l, E3l);\r\n        m128iS2[pass] = _mm_unpacklo_epi16(m128Tmp2, m128Tmp3);\r\n        m128iS3[pass] = _mm_unpackhi_epi16(m128Tmp2, m128Tmp3);\r\n        m128Tmp0      = _mm_unpacklo_epi16(O0l, O2l);\r\n        m128Tmp1      = _mm_unpacklo_epi16(O1l, O3l);\r\n        m128iS4[pass] = _mm_unpacklo_epi16(m128Tmp0, m128Tmp1);\r\n        m128iS5[pass] = _mm_unpackhi_epi16(m128Tmp0, m128Tmp1);\r\n        m128Tmp2      = _mm_unpackhi_epi16(O0l, O2l);\r\n        m128Tmp3      = _mm_unpackhi_epi16(O1l, O3l);\r\n        m128iS6[pass] = _mm_unpacklo_epi16(m128Tmp2, m128Tmp3);\r\n        m128iS7[pass] = _mm_unpackhi_epi16(m128Tmp2, m128Tmp3);\r\n    }\r\n\r\n    {\r\n        const __m128i c16_p45_p45 = _mm_set1_epi32(0x002D002D);\r\n        const __m128i c16_p43_p44 = _mm_set1_epi32(0x002B002C);\r\n        const __m128i c16_p39_p41 = _mm_set1_epi32(0x00270029);\r\n        const __m128i c16_p34_p36 = _mm_set1_epi32(0x00220024);\r\n        const __m128i c16_p27_p30 = _mm_set1_epi32(0x001B001E);\r\n        const __m128i c16_p19_p23 = _mm_set1_epi32(0x00130017);\r\n        const __m128i c16_p11_p15 = _mm_set1_epi32(0x000B000F);\r\n        const __m128i c16_p02_p07 = _mm_set1_epi32(0x00020007);\r\n        const __m128i c16_p41_p45 = _mm_set1_epi32(0x0029002D);\r\n        const __m128i c16_p23_p34 = _mm_set1_epi32(0x00170022);\r\n        const __m128i c16_n02_p11 = _mm_set1_epi32(0xFFFE000B);\r\n        const __m128i c16_n27_n15 = _mm_set1_epi32(0xFFE5FFF1);\r\n        const __m128i c16_n43_n36 = _mm_set1_epi32(0xFFD5FFDC);\r\n        const __m128i c16_n44_n45 = _mm_set1_epi32(0xFFD4FFD3);\r\n        const __m128i c16_n30_n39 = _mm_set1_epi32(0xFFE2FFD9);\r\n        const __m128i c16_n07_n19 = _mm_set1_epi32(0xFFF9FFED);\r\n        const __m128i c16_p34_p44 = _mm_set1_epi32(0x0022002C);\r\n        const __m128i c16_n07_p15 = _mm_set1_epi32(0xFFF9000F);\r\n        const __m128i c16_n41_n27 = _mm_set1_epi32(0xFFD7FFE5);\r\n        const __m128i c16_n39_n45 = _mm_set1_epi32(0xFFD9FFD3);\r\n        const __m128i c16_n02_n23 = _mm_set1_epi32(0xFFFEFFE9);\r\n        const __m128i c16_p36_p19 = _mm_set1_epi32(0x00240013);\r\n        const __m128i c16_p43_p45 = _mm_set1_epi32(0x002B002D);\r\n        const __m128i c16_p11_p30 = _mm_set1_epi32(0x000B001E);\r\n        const __m128i c16_p23_p43 = _mm_set1_epi32(0x0017002B);\r\n        const __m128i c16_n34_n07 = _mm_set1_epi32(0xFFDEFFF9);\r\n        const __m128i c16_n36_n45 = _mm_set1_epi32(0xFFDCFFD3);\r\n        const __m128i c16_p19_n11 = _mm_set1_epi32(0x0013FFF5);\r\n        const __m128i c16_p44_p41 = _mm_set1_epi32(0x002C0029);\r\n        const __m128i c16_n02_p27 = _mm_set1_epi32(0xFFFE001B);\r\n        const __m128i c16_n45_n30 = _mm_set1_epi32(0xFFD3FFE2);\r\n        const __m128i c16_n15_n39 = _mm_set1_epi32(0xFFF1FFD9);\r\n        const __m128i c16_p11_p41 = _mm_set1_epi32(0x000B0029);\r\n        const __m128i c16_n45_n27 = _mm_set1_epi32(0xFFD3FFE5);\r\n        const __m128i c16_p07_n30 = _mm_set1_epi32(0x0007FFE2);\r\n        const __m128i c16_p43_p39 = _mm_set1_epi32(0x002B0027);\r\n        const __m128i c16_n23_p15 = _mm_set1_epi32(0xFFE9000F);\r\n        const __m128i c16_n34_n45 = _mm_set1_epi32(0xFFDEFFD3);\r\n        const __m128i c16_p36_p02 = _mm_set1_epi32(0x00240002);\r\n        const __m128i c16_p19_p44 = _mm_set1_epi32(0x0013002C);\r\n        const __m128i c16_n02_p39 = _mm_set1_epi32(0xFFFE0027);\r\n        const __m128i c16_n36_n41 = _mm_set1_epi32(0xFFDCFFD7);\r\n        const __m128i c16_p43_p07 = _mm_set1_epi32(0x002B0007);\r\n        const __m128i c16_n11_p34 = _mm_set1_epi32(0xFFF50022);\r\n        const __m128i c16_n30_n44 = _mm_set1_epi32(0xFFE2FFD4);\r\n        const __m128i c16_p45_p15 = _mm_set1_epi32(0x002D000F);\r\n        const __m128i c16_n19_p27 = _mm_set1_epi32(0xFFED001B);\r\n        const __m128i c16_n23_n45 = _mm_set1_epi32(0xFFE9FFD3);\r\n        const __m128i c16_n15_p36 = _mm_set1_epi32(0xFFF10024);\r\n        const __m128i c16_n11_n45 = _mm_set1_epi32(0xFFF5FFD3);\r\n        const __m128i c16_p34_p39 = _mm_set1_epi32(0x00220027);\r\n        const __m128i c16_n45_n19 = _mm_set1_epi32(0xFFD3FFED);\r\n        const __m128i c16_p41_n07 = _mm_set1_epi32(0x0029FFF9);\r\n        const __m128i c16_n23_p30 = _mm_set1_epi32(0xFFE9001E);\r\n        const __m128i c16_n02_n44 = _mm_set1_epi32(0xFFFEFFD4);\r\n        const __m128i c16_p27_p43 = _mm_set1_epi32(0x001B002B);\r\n        const __m128i c16_n27_p34 = _mm_set1_epi32(0xFFE50022);\r\n        const __m128i c16_p19_n39 = _mm_set1_epi32(0x0013FFD9);\r\n        const __m128i c16_n11_p43 = _mm_set1_epi32(0xFFF5002B);\r\n        const __m128i c16_p02_n45 = _mm_set1_epi32(0x0002FFD3);\r\n        const __m128i c16_p07_p45 = _mm_set1_epi32(0x0007002D);\r\n        const __m128i c16_n15_n44 = _mm_set1_epi32(0xFFF1FFD4);\r\n        const __m128i c16_p23_p41 = _mm_set1_epi32(0x00170029);\r\n        const __m128i c16_n30_n36 = _mm_set1_epi32(0xFFE2FFDC);\r\n        const __m128i c16_n36_p30 = _mm_set1_epi32(0xFFDC001E);\r\n        const __m128i c16_p41_n23 = _mm_set1_epi32(0x0029FFE9);\r\n        const __m128i c16_n44_p15 = _mm_set1_epi32(0xFFD4000F);\r\n        const __m128i c16_p45_n07 = _mm_set1_epi32(0x002DFFF9);\r\n        const __m128i c16_n45_n02 = _mm_set1_epi32(0xFFD3FFFE);\r\n        const __m128i c16_p43_p11 = _mm_set1_epi32(0x002B000B);\r\n        const __m128i c16_n39_n19 = _mm_set1_epi32(0xFFD9FFED);\r\n        const __m128i c16_p34_p27 = _mm_set1_epi32(0x0022001B);\r\n        const __m128i c16_n43_p27 = _mm_set1_epi32(0xFFD5001B);\r\n        const __m128i c16_p44_n02 = _mm_set1_epi32(0x002CFFFE);\r\n        const __m128i c16_n30_n23 = _mm_set1_epi32(0xFFE2FFE9);\r\n        const __m128i c16_p07_p41 = _mm_set1_epi32(0x00070029);\r\n        const __m128i c16_p19_n45 = _mm_set1_epi32(0x0013FFD3);\r\n        const __m128i c16_n39_p34 = _mm_set1_epi32(0xFFD90022);\r\n        const __m128i c16_p45_n11 = _mm_set1_epi32(0x002DFFF5);\r\n        const __m128i c16_n36_n15 = _mm_set1_epi32(0xFFDCFFF1);\r\n        const __m128i c16_n45_p23 = _mm_set1_epi32(0xFFD30017);\r\n        const __m128i c16_p27_p19 = _mm_set1_epi32(0x001B0013);\r\n        const __m128i c16_p15_n45 = _mm_set1_epi32(0x000FFFD3);\r\n        const __m128i c16_n44_p30 = _mm_set1_epi32(0xFFD4001E);\r\n        const __m128i c16_p34_p11 = _mm_set1_epi32(0x0022000B);\r\n        const __m128i c16_p07_n43 = _mm_set1_epi32(0x0007FFD5);\r\n        const __m128i c16_n41_p36 = _mm_set1_epi32(0xFFD70024);\r\n        const __m128i c16_p39_p02 = _mm_set1_epi32(0x00270002);\r\n        const __m128i c16_n44_p19 = _mm_set1_epi32(0xFFD40013);\r\n        const __m128i c16_n02_p36 = _mm_set1_epi32(0xFFFE0024);\r\n        const __m128i c16_p45_n34 = _mm_set1_epi32(0x002DFFDE);\r\n        const __m128i c16_n15_n23 = _mm_set1_epi32(0xFFF1FFE9);\r\n        const __m128i c16_n39_p43 = _mm_set1_epi32(0xFFD9002B);\r\n        const __m128i c16_p30_p07 = _mm_set1_epi32(0x001E0007);\r\n        const __m128i c16_p27_n45 = _mm_set1_epi32(0x001BFFD3);\r\n        const __m128i c16_n41_p11 = _mm_set1_epi32(0xFFD7000B);\r\n        const __m128i c16_n39_p15 = _mm_set1_epi32(0xFFD9000F);\r\n        const __m128i c16_n30_p45 = _mm_set1_epi32(0xFFE2002D);\r\n        const __m128i c16_p27_p02 = _mm_set1_epi32(0x001B0002);\r\n        const __m128i c16_p41_n44 = _mm_set1_epi32(0x0029FFD4);\r\n        const __m128i c16_n11_n19 = _mm_set1_epi32(0xFFF5FFED);\r\n        const __m128i c16_n45_p36 = _mm_set1_epi32(0xFFD30024);\r\n        const __m128i c16_n07_p34 = _mm_set1_epi32(0xFFF90022);\r\n        const __m128i c16_p43_n23 = _mm_set1_epi32(0x002BFFE9);\r\n        const __m128i c16_n30_p11 = _mm_set1_epi32(0xFFE2000B);\r\n        const __m128i c16_n45_p43 = _mm_set1_epi32(0xFFD3002B);\r\n        const __m128i c16_n19_p36 = _mm_set1_epi32(0xFFED0024);\r\n        const __m128i c16_p23_n02 = _mm_set1_epi32(0x0017FFFE);\r\n        const __m128i c16_p45_n39 = _mm_set1_epi32(0x002DFFD9);\r\n        const __m128i c16_p27_n41 = _mm_set1_epi32(0x001BFFD7);\r\n        const __m128i c16_n15_n07 = _mm_set1_epi32(0xFFF1FFF9);\r\n        const __m128i c16_n44_p34 = _mm_set1_epi32(0xFFD40022);\r\n        const __m128i c16_n19_p07 = _mm_set1_epi32(0xFFED0007);\r\n        const __m128i c16_n39_p30 = _mm_set1_epi32(0xFFD9001E);\r\n        const __m128i c16_n45_p44 = _mm_set1_epi32(0xFFD3002C);\r\n        const __m128i c16_n36_p43 = _mm_set1_epi32(0xFFDC002B);\r\n        const __m128i c16_n15_p27 = _mm_set1_epi32(0xFFF1001B);\r\n        const __m128i c16_p11_p02 = _mm_set1_epi32(0x000B0002);\r\n        const __m128i c16_p34_n23 = _mm_set1_epi32(0x0022FFE9);\r\n        const __m128i c16_p45_n41 = _mm_set1_epi32(0x002DFFD7);\r\n        const __m128i c16_n07_p02 = _mm_set1_epi32(0xFFF90002);\r\n        const __m128i c16_n15_p11 = _mm_set1_epi32(0xFFF1000B);\r\n        const __m128i c16_n23_p19 = _mm_set1_epi32(0xFFE90013);\r\n        const __m128i c16_n30_p27 = _mm_set1_epi32(0xFFE2001B);\r\n        const __m128i c16_n36_p34 = _mm_set1_epi32(0xFFDC0022);\r\n        const __m128i c16_n41_p39 = _mm_set1_epi32(0xFFD70027);\r\n        const __m128i c16_n44_p43 = _mm_set1_epi32(0xFFD4002B);\r\n        const __m128i c16_n45_p45 = _mm_set1_epi32(0xFFD3002D);\r\n\r\n        //  const __m128i c16_p43_p45 = _mm_set1_epi32(0x002B002D);\r\n        const __m128i c16_p35_p40 = _mm_set1_epi32(0x00230028);\r\n        const __m128i c16_p21_p29 = _mm_set1_epi32(0x0015001D);\r\n        const __m128i c16_p04_p13 = _mm_set1_epi32(0x0004000D);\r\n        const __m128i c16_p29_p43 = _mm_set1_epi32(0x001D002B);\r\n        const __m128i c16_n21_p04 = _mm_set1_epi32(0xFFEB0004);\r\n        const __m128i c16_n45_n40 = _mm_set1_epi32(0xFFD3FFD8);\r\n        const __m128i c16_n13_n35 = _mm_set1_epi32(0xFFF3FFDD);\r\n        const __m128i c16_p04_p40 = _mm_set1_epi32(0x00040028);\r\n        const __m128i c16_n43_n35 = _mm_set1_epi32(0xFFD5FFDD);\r\n        const __m128i c16_p29_n13 = _mm_set1_epi32(0x001DFFF3);\r\n        const __m128i c16_p21_p45 = _mm_set1_epi32(0x0015002D);\r\n        const __m128i c16_n21_p35 = _mm_set1_epi32(0xFFEB0023);\r\n        const __m128i c16_p04_n43 = _mm_set1_epi32(0x0004FFD5);\r\n        const __m128i c16_p13_p45 = _mm_set1_epi32(0x000D002D);\r\n        const __m128i c16_n29_n40 = _mm_set1_epi32(0xFFE3FFD8);\r\n        const __m128i c16_n40_p29 = _mm_set1_epi32(0xFFD8001D);\r\n        const __m128i c16_p45_n13 = _mm_set1_epi32(0x002DFFF3);\r\n        const __m128i c16_n43_n04 = _mm_set1_epi32(0xFFD5FFFC);\r\n        const __m128i c16_p35_p21 = _mm_set1_epi32(0x00230015);\r\n        const __m128i c16_n45_p21 = _mm_set1_epi32(0xFFD30015);\r\n        const __m128i c16_p13_p29 = _mm_set1_epi32(0x000D001D);\r\n        const __m128i c16_p35_n43 = _mm_set1_epi32(0x0023FFD5);\r\n        const __m128i c16_n40_p04 = _mm_set1_epi32(0xFFD80004);\r\n        const __m128i c16_n35_p13 = _mm_set1_epi32(0xFFDD000D);\r\n        const __m128i c16_n40_p45 = _mm_set1_epi32(0xFFD8002D);\r\n        const __m128i c16_p04_p21 = _mm_set1_epi32(0x00040015);\r\n        const __m128i c16_p43_n29 = _mm_set1_epi32(0x002BFFE3);\r\n        const __m128i c16_n13_p04 = _mm_set1_epi32(0xFFF30004);\r\n        const __m128i c16_n29_p21 = _mm_set1_epi32(0xFFE30015);\r\n        const __m128i c16_n40_p35 = _mm_set1_epi32(0xFFD80023);\r\n        //  const __m128i c16_n45_p43 = _mm_set1_epi32(0xFFD3002B);\r\n\r\n        const __m128i c16_p38_p44 = _mm_set1_epi32(0x0026002C);\r\n        const __m128i c16_p09_p25 = _mm_set1_epi32(0x00090019);\r\n        const __m128i c16_n09_p38 = _mm_set1_epi32(0xFFF70026);\r\n        const __m128i c16_n25_n44 = _mm_set1_epi32(0xFFE7FFD4);\r\n\r\n        const __m128i c16_n44_p25 = _mm_set1_epi32(0xFFD40019);\r\n        const __m128i c16_p38_p09 = _mm_set1_epi32(0x00260009);\r\n        const __m128i c16_n25_p09 = _mm_set1_epi32(0xFFE70009);\r\n        const __m128i c16_n44_p38 = _mm_set1_epi32(0xFFD40026);\r\n\r\n        const __m128i c16_p17_p42 = _mm_set1_epi32(0x0011002A);\r\n        const __m128i c16_n42_p17 = _mm_set1_epi32(0xFFD60011);\r\n\r\n        const __m128i c16_p32_p32 = _mm_set1_epi32(0x00200020);\r\n        const __m128i c16_n32_p32 = _mm_set1_epi32(0xFFE00020);\r\n\r\n\r\n        __m128i c32_rnd = _mm_set1_epi32(1 << (shift2 - 1));    // add2\r\n        int nShift = shift2;\r\n\r\n        // DCT1\r\n\r\n        __m128i res00[4], res01[4], res02[4], res03[4], res04[4], res05[4], res06[4], res07[4], res08[4], res09[4], res10[4], res11[4], res12[4], res13[4], res14[4], res15[4];\r\n        __m128i res16[4], res17[4], res18[4], res19[4], res20[4], res21[4], res22[4], res23[4], res24[4], res25[4], res26[4], res27[4], res28[4], res29[4], res30[4], res31[4];\r\n\r\n        const __m128i T_00_00A = _mm_unpacklo_epi16(m128iS1[0], m128iS3[0]);    // [33 13 32 12 31 11 30 10]\r\n        const __m128i T_00_00B = _mm_unpackhi_epi16(m128iS1[0], m128iS3[0]);    // [37 17 36 16 35 15 34 14]\r\n        const __m128i T_00_01A = _mm_unpacklo_epi16(m128iS5[0], m128iS7[0]);    // [ ]\r\n        const __m128i T_00_01B = _mm_unpackhi_epi16(m128iS5[0], m128iS7[0]);    // [ ]\r\n        const __m128i T_00_02A = _mm_unpacklo_epi16(m128iS1[1], m128iS3[1]);    // [ ]\r\n        const __m128i T_00_02B = _mm_unpackhi_epi16(m128iS1[1], m128iS3[1]);    // [ ]\r\n        const __m128i T_00_03A = _mm_unpacklo_epi16(m128iS5[1], m128iS7[1]);    // [ ]\r\n        const __m128i T_00_03B = _mm_unpackhi_epi16(m128iS5[1], m128iS7[1]);    // [ ]\r\n        const __m128i T_00_04A = _mm_unpacklo_epi16(m128iS1[2], m128iS3[2]);    // [ ]\r\n        const __m128i T_00_04B = _mm_unpackhi_epi16(m128iS1[2], m128iS3[2]);    // [ ]\r\n        const __m128i T_00_05A = _mm_unpacklo_epi16(m128iS5[2], m128iS7[2]);    // [ ]\r\n        const __m128i T_00_05B = _mm_unpackhi_epi16(m128iS5[2], m128iS7[2]);    // [ ]\r\n        const __m128i T_00_06A = _mm_unpacklo_epi16(m128iS1[3], m128iS3[3]);    // [ ]\r\n        const __m128i T_00_06B = _mm_unpackhi_epi16(m128iS1[3], m128iS3[3]);    // [ ]\r\n        const __m128i T_00_07A = _mm_unpacklo_epi16(m128iS5[3], m128iS7[3]);    //\r\n        const __m128i T_00_07B = _mm_unpackhi_epi16(m128iS5[3], m128iS7[3]);    // [ ]\r\n\r\n        const __m128i T_00_08A = _mm_unpacklo_epi16(m128iS2[0], m128iS6[0]);    // [ ]\r\n        const __m128i T_00_08B = _mm_unpackhi_epi16(m128iS2[0], m128iS6[0]);    // [ ]\r\n        const __m128i T_00_09A = _mm_unpacklo_epi16(m128iS2[1], m128iS6[1]);    // [ ]\r\n        const __m128i T_00_09B = _mm_unpackhi_epi16(m128iS2[1], m128iS6[1]);    // [ ]\r\n        const __m128i T_00_10A = _mm_unpacklo_epi16(m128iS2[2], m128iS6[2]);    // [ ]\r\n        const __m128i T_00_10B = _mm_unpackhi_epi16(m128iS2[2], m128iS6[2]);    // [ ]\r\n        const __m128i T_00_11A = _mm_unpacklo_epi16(m128iS2[3], m128iS6[3]);    // [ ]\r\n        const __m128i T_00_11B = _mm_unpackhi_epi16(m128iS2[3], m128iS6[3]);    // [ ]\r\n\r\n        const __m128i T_00_12A = _mm_unpacklo_epi16(m128iS4[0], m128iS4[1]);    // [ ]\r\n        const __m128i T_00_12B = _mm_unpackhi_epi16(m128iS4[0], m128iS4[1]);    // [ ]\r\n        const __m128i T_00_13A = _mm_unpacklo_epi16(m128iS4[2], m128iS4[3]);    // [ ]\r\n        const __m128i T_00_13B = _mm_unpackhi_epi16(m128iS4[2], m128iS4[3]);    // [ ]\r\n\r\n        const __m128i T_00_14A = _mm_unpacklo_epi16(m128iS0[1], m128iS0[3]);    //\r\n        const __m128i T_00_14B = _mm_unpackhi_epi16(m128iS0[1], m128iS0[3]);    // [ ]\r\n        const __m128i T_00_15A = _mm_unpacklo_epi16(m128iS0[0], m128iS0[2]);    //\r\n        const __m128i T_00_15B = _mm_unpackhi_epi16(m128iS0[0], m128iS0[2]);    // [ ]\r\n\r\n        __m128i O00A, O01A, O02A, O03A, O04A, O05A, O06A, O07A, O08A, O09A, O10A, O11A, O12A, O13A, O14A, O15A;\r\n        __m128i O00B, O01B, O02B, O03B, O04B, O05B, O06B, O07B, O08B, O09B, O10B, O11B, O12B, O13B, O14B, O15B;\r\n        __m128i EO0A, EO1A, EO2A, EO3A, EO4A, EO5A, EO6A, EO7A;\r\n        __m128i EO0B, EO1B, EO2B, EO3B, EO4B, EO5B, EO6B, EO7B;\r\n        __m128i T00, T01, T02, T03;\r\n\r\n#define COMPUTE_ROW(r0103, r0507, r0911, r1315, r1719, r2123, r2527, r2931, c0103, c0507, c0911, c1315, c1719, c2123, c2527, c2931, row) \\\r\n    T00 = _mm_add_epi32(_mm_madd_epi16(r0103, c0103), _mm_madd_epi16(r0507, c0507)); \\\r\n    T01 = _mm_add_epi32(_mm_madd_epi16(r0911, c0911), _mm_madd_epi16(r1315, c1315)); \\\r\n    T02 = _mm_add_epi32(_mm_madd_epi16(r1719, c1719), _mm_madd_epi16(r2123, c2123)); \\\r\n    T03 = _mm_add_epi32(_mm_madd_epi16(r2527, c2527), _mm_madd_epi16(r2931, c2931)); \\\r\n    row = _mm_add_epi32(_mm_add_epi32(T00, T01), _mm_add_epi32(T02, T03));\r\n\r\n        COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n            c16_p45_p45, c16_p43_p44, c16_p39_p41, c16_p34_p36, c16_p27_p30, c16_p19_p23, c16_p11_p15, c16_p02_p07, O00A)\r\n        COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n            c16_p41_p45, c16_p23_p34, c16_n02_p11, c16_n27_n15, c16_n43_n36, c16_n44_n45, c16_n30_n39, c16_n07_n19, O01A)\r\n        COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n            c16_p34_p44, c16_n07_p15, c16_n41_n27, c16_n39_n45, c16_n02_n23, c16_p36_p19, c16_p43_p45, c16_p11_p30, O02A)\r\n        COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n            c16_p23_p43, c16_n34_n07, c16_n36_n45, c16_p19_n11, c16_p44_p41, c16_n02_p27, c16_n45_n30, c16_n15_n39, O03A)\r\n        COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n            c16_p11_p41, c16_n45_n27, c16_p07_n30, c16_p43_p39, c16_n23_p15, c16_n34_n45, c16_p36_p02, c16_p19_p44, O04A)\r\n        COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n            c16_n02_p39, c16_n36_n41, c16_p43_p07, c16_n11_p34, c16_n30_n44, c16_p45_p15, c16_n19_p27, c16_n23_n45, O05A)\r\n        COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n            c16_n15_p36, c16_n11_n45, c16_p34_p39, c16_n45_n19, c16_p41_n07, c16_n23_p30, c16_n02_n44, c16_p27_p43, O06A)\r\n        COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n            c16_n27_p34, c16_p19_n39, c16_n11_p43, c16_p02_n45, c16_p07_p45, c16_n15_n44, c16_p23_p41, c16_n30_n36, O07A)\r\n        COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n            c16_n36_p30, c16_p41_n23, c16_n44_p15, c16_p45_n07, c16_n45_n02, c16_p43_p11, c16_n39_n19, c16_p34_p27, O08A)\r\n        COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n            c16_n43_p27, c16_p44_n02, c16_n30_n23, c16_p07_p41, c16_p19_n45, c16_n39_p34, c16_p45_n11, c16_n36_n15, O09A)\r\n        COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n            c16_n45_p23, c16_p27_p19, c16_p15_n45, c16_n44_p30, c16_p34_p11, c16_p07_n43, c16_n41_p36, c16_p39_p02, O10A)\r\n        COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n            c16_n44_p19, c16_n02_p36, c16_p45_n34, c16_n15_n23, c16_n39_p43, c16_p30_p07, c16_p27_n45, c16_n41_p11, O11A)\r\n        COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n            c16_n39_p15, c16_n30_p45, c16_p27_p02, c16_p41_n44, c16_n11_n19, c16_n45_p36, c16_n07_p34, c16_p43_n23, O12A)\r\n        COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n            c16_n30_p11, c16_n45_p43, c16_n19_p36, c16_p23_n02, c16_p45_n39, c16_p27_n41, c16_n15_n07, c16_n44_p34, O13A)\r\n        COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n            c16_n19_p07, c16_n39_p30, c16_n45_p44, c16_n36_p43, c16_n15_p27, c16_p11_p02, c16_p34_n23, c16_p45_n41, O14A)\r\n        COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n            c16_n07_p02, c16_n15_p11, c16_n23_p19, c16_n30_p27, c16_n36_p34, c16_n41_p39, c16_n44_p43, c16_n45_p45, O15A)\r\n\r\n        COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n            c16_p45_p45, c16_p43_p44, c16_p39_p41, c16_p34_p36, c16_p27_p30, c16_p19_p23, c16_p11_p15, c16_p02_p07, O00B)\r\n        COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n            c16_p41_p45, c16_p23_p34, c16_n02_p11, c16_n27_n15, c16_n43_n36, c16_n44_n45, c16_n30_n39, c16_n07_n19, O01B)\r\n        COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n            c16_p34_p44, c16_n07_p15, c16_n41_n27, c16_n39_n45, c16_n02_n23, c16_p36_p19, c16_p43_p45, c16_p11_p30, O02B)\r\n        COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n            c16_p23_p43, c16_n34_n07, c16_n36_n45, c16_p19_n11, c16_p44_p41, c16_n02_p27, c16_n45_n30, c16_n15_n39, O03B)\r\n        COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n            c16_p11_p41, c16_n45_n27, c16_p07_n30, c16_p43_p39, c16_n23_p15, c16_n34_n45, c16_p36_p02, c16_p19_p44, O04B)\r\n        COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n            c16_n02_p39, c16_n36_n41, c16_p43_p07, c16_n11_p34, c16_n30_n44, c16_p45_p15, c16_n19_p27, c16_n23_n45, O05B)\r\n        COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n            c16_n15_p36, c16_n11_n45, c16_p34_p39, c16_n45_n19, c16_p41_n07, c16_n23_p30, c16_n02_n44, c16_p27_p43, O06B)\r\n        COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n            c16_n27_p34, c16_p19_n39, c16_n11_p43, c16_p02_n45, c16_p07_p45, c16_n15_n44, c16_p23_p41, c16_n30_n36, O07B)\r\n        COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n            c16_n36_p30, c16_p41_n23, c16_n44_p15, c16_p45_n07, c16_n45_n02, c16_p43_p11, c16_n39_n19, c16_p34_p27, O08B)\r\n        COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n            c16_n43_p27, c16_p44_n02, c16_n30_n23, c16_p07_p41, c16_p19_n45, c16_n39_p34, c16_p45_n11, c16_n36_n15, O09B)\r\n        COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n            c16_n45_p23, c16_p27_p19, c16_p15_n45, c16_n44_p30, c16_p34_p11, c16_p07_n43, c16_n41_p36, c16_p39_p02, O10B)\r\n        COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n            c16_n44_p19, c16_n02_p36, c16_p45_n34, c16_n15_n23, c16_n39_p43, c16_p30_p07, c16_p27_n45, c16_n41_p11, O11B)\r\n        COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n            c16_n39_p15, c16_n30_p45, c16_p27_p02, c16_p41_n44, c16_n11_n19, c16_n45_p36, c16_n07_p34, c16_p43_n23, O12B)\r\n        COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n            c16_n30_p11, c16_n45_p43, c16_n19_p36, c16_p23_n02, c16_p45_n39, c16_p27_n41, c16_n15_n07, c16_n44_p34, O13B)\r\n        COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n            c16_n19_p07, c16_n39_p30, c16_n45_p44, c16_n36_p43, c16_n15_p27, c16_p11_p02, c16_p34_n23, c16_p45_n41, O14B)\r\n        COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n            c16_n07_p02, c16_n15_p11, c16_n23_p19, c16_n30_p27, c16_n36_p34, c16_n41_p39, c16_n44_p43, c16_n45_p45, O15B)\r\n\r\n#undef COMPUTE_ROW\r\n\r\n        {\r\n#define COMPUTE_ROW(row0206, row1014, row1822, row2630, c0206, c1014, c1822, c2630, row) \\\r\n    T00 = _mm_add_epi32(_mm_madd_epi16(row0206, c0206), _mm_madd_epi16(row1014, c1014)); \\\r\n    T01 = _mm_add_epi32(_mm_madd_epi16(row1822, c1822), _mm_madd_epi16(row2630, c2630)); \\\r\n    row = _mm_add_epi32(T00, T01);\r\n\r\n            COMPUTE_ROW(T_00_08A, T_00_09A, T_00_10A, T_00_11A, c16_p43_p45, c16_p35_p40, c16_p21_p29, c16_p04_p13, EO0A)\r\n            COMPUTE_ROW(T_00_08A, T_00_09A, T_00_10A, T_00_11A, c16_p29_p43, c16_n21_p04, c16_n45_n40, c16_n13_n35, EO1A)\r\n            COMPUTE_ROW(T_00_08A, T_00_09A, T_00_10A, T_00_11A, c16_p04_p40, c16_n43_n35, c16_p29_n13, c16_p21_p45, EO2A)\r\n            COMPUTE_ROW(T_00_08A, T_00_09A, T_00_10A, T_00_11A, c16_n21_p35, c16_p04_n43, c16_p13_p45, c16_n29_n40, EO3A)\r\n            COMPUTE_ROW(T_00_08A, T_00_09A, T_00_10A, T_00_11A, c16_n40_p29, c16_p45_n13, c16_n43_n04, c16_p35_p21, EO4A)\r\n            COMPUTE_ROW(T_00_08A, T_00_09A, T_00_10A, T_00_11A, c16_n45_p21, c16_p13_p29, c16_p35_n43, c16_n40_p04, EO5A)\r\n            COMPUTE_ROW(T_00_08A, T_00_09A, T_00_10A, T_00_11A, c16_n35_p13, c16_n40_p45, c16_p04_p21, c16_p43_n29, EO6A)\r\n            COMPUTE_ROW(T_00_08A, T_00_09A, T_00_10A, T_00_11A, c16_n13_p04, c16_n29_p21, c16_n40_p35, c16_n45_p43, EO7A)\r\n\r\n            COMPUTE_ROW(T_00_08B, T_00_09B, T_00_10B, T_00_11B, c16_p43_p45, c16_p35_p40, c16_p21_p29, c16_p04_p13, EO0B)\r\n            COMPUTE_ROW(T_00_08B, T_00_09B, T_00_10B, T_00_11B, c16_p29_p43, c16_n21_p04, c16_n45_n40, c16_n13_n35, EO1B)\r\n            COMPUTE_ROW(T_00_08B, T_00_09B, T_00_10B, T_00_11B, c16_p04_p40, c16_n43_n35, c16_p29_n13, c16_p21_p45, EO2B)\r\n            COMPUTE_ROW(T_00_08B, T_00_09B, T_00_10B, T_00_11B, c16_n21_p35, c16_p04_n43, c16_p13_p45, c16_n29_n40, EO3B)\r\n            COMPUTE_ROW(T_00_08B, T_00_09B, T_00_10B, T_00_11B, c16_n40_p29, c16_p45_n13, c16_n43_n04, c16_p35_p21, EO4B)\r\n            COMPUTE_ROW(T_00_08B, T_00_09B, T_00_10B, T_00_11B, c16_n45_p21, c16_p13_p29, c16_p35_n43, c16_n40_p04, EO5B)\r\n            COMPUTE_ROW(T_00_08B, T_00_09B, T_00_10B, T_00_11B, c16_n35_p13, c16_n40_p45, c16_p04_p21, c16_p43_n29, EO6B)\r\n            COMPUTE_ROW(T_00_08B, T_00_09B, T_00_10B, T_00_11B, c16_n13_p04, c16_n29_p21, c16_n40_p35, c16_n45_p43, EO7B)\r\n#undef COMPUTE_ROW\r\n        }\r\n\r\n        {\r\n            const __m128i EEO0A = _mm_add_epi32(_mm_madd_epi16(T_00_12A, c16_p38_p44), _mm_madd_epi16(T_00_13A, c16_p09_p25));\r\n            const __m128i EEO1A = _mm_add_epi32(_mm_madd_epi16(T_00_12A, c16_n09_p38), _mm_madd_epi16(T_00_13A, c16_n25_n44));\r\n            const __m128i EEO2A = _mm_add_epi32(_mm_madd_epi16(T_00_12A, c16_n44_p25), _mm_madd_epi16(T_00_13A, c16_p38_p09));\r\n            const __m128i EEO3A = _mm_add_epi32(_mm_madd_epi16(T_00_12A, c16_n25_p09), _mm_madd_epi16(T_00_13A, c16_n44_p38));\r\n            const __m128i EEO0B = _mm_add_epi32(_mm_madd_epi16(T_00_12B, c16_p38_p44), _mm_madd_epi16(T_00_13B, c16_p09_p25));\r\n            const __m128i EEO1B = _mm_add_epi32(_mm_madd_epi16(T_00_12B, c16_n09_p38), _mm_madd_epi16(T_00_13B, c16_n25_n44));\r\n            const __m128i EEO2B = _mm_add_epi32(_mm_madd_epi16(T_00_12B, c16_n44_p25), _mm_madd_epi16(T_00_13B, c16_p38_p09));\r\n            const __m128i EEO3B = _mm_add_epi32(_mm_madd_epi16(T_00_12B, c16_n25_p09), _mm_madd_epi16(T_00_13B, c16_n44_p38));\r\n\r\n            const __m128i EEEO0A = _mm_madd_epi16(T_00_14A, c16_p17_p42);\r\n            const __m128i EEEO0B = _mm_madd_epi16(T_00_14B, c16_p17_p42);\r\n            const __m128i EEEO1A = _mm_madd_epi16(T_00_14A, c16_n42_p17);\r\n            const __m128i EEEO1B = _mm_madd_epi16(T_00_14B, c16_n42_p17);\r\n\r\n            const __m128i EEEE0A = _mm_madd_epi16(T_00_15A, c16_p32_p32);\r\n            const __m128i EEEE0B = _mm_madd_epi16(T_00_15B, c16_p32_p32);\r\n            const __m128i EEEE1A = _mm_madd_epi16(T_00_15A, c16_n32_p32);\r\n            const __m128i EEEE1B = _mm_madd_epi16(T_00_15B, c16_n32_p32);\r\n\r\n            const __m128i EEE0A = _mm_add_epi32(EEEE0A, EEEO0A);    // EEE0 = EEEE0 + EEEO0\r\n            const __m128i EEE0B = _mm_add_epi32(EEEE0B, EEEO0B);\r\n            const __m128i EEE1A = _mm_add_epi32(EEEE1A, EEEO1A);    // EEE1 = EEEE1 + EEEO1\r\n            const __m128i EEE1B = _mm_add_epi32(EEEE1B, EEEO1B);\r\n            const __m128i EEE3A = _mm_sub_epi32(EEEE0A, EEEO0A);    // EEE2 = EEEE0 - EEEO0\r\n            const __m128i EEE3B = _mm_sub_epi32(EEEE0B, EEEO0B);\r\n            const __m128i EEE2A = _mm_sub_epi32(EEEE1A, EEEO1A);    // EEE3 = EEEE1 - EEEO1\r\n            const __m128i EEE2B = _mm_sub_epi32(EEEE1B, EEEO1B);\r\n\r\n            const __m128i EE0A = _mm_add_epi32(EEE0A, EEO0A);       // EE0 = EEE0 + EEO0\r\n            const __m128i EE0B = _mm_add_epi32(EEE0B, EEO0B);\r\n            const __m128i EE1A = _mm_add_epi32(EEE1A, EEO1A);       // EE1 = EEE1 + EEO1\r\n            const __m128i EE1B = _mm_add_epi32(EEE1B, EEO1B);\r\n            const __m128i EE2A = _mm_add_epi32(EEE2A, EEO2A);       // EE2 = EEE0 + EEO0\r\n            const __m128i EE2B = _mm_add_epi32(EEE2B, EEO2B);\r\n            const __m128i EE3A = _mm_add_epi32(EEE3A, EEO3A);       // EE3 = EEE1 + EEO1\r\n            const __m128i EE3B = _mm_add_epi32(EEE3B, EEO3B);\r\n            const __m128i EE7A = _mm_sub_epi32(EEE0A, EEO0A);       // EE7 = EEE0 - EEO0\r\n            const __m128i EE7B = _mm_sub_epi32(EEE0B, EEO0B);\r\n            const __m128i EE6A = _mm_sub_epi32(EEE1A, EEO1A);       // EE6 = EEE1 - EEO1\r\n            const __m128i EE6B = _mm_sub_epi32(EEE1B, EEO1B);\r\n            const __m128i EE5A = _mm_sub_epi32(EEE2A, EEO2A);       // EE5 = EEE0 - EEO0\r\n            const __m128i EE5B = _mm_sub_epi32(EEE2B, EEO2B);\r\n            const __m128i EE4A = _mm_sub_epi32(EEE3A, EEO3A);       // EE4 = EEE1 - EEO1\r\n            const __m128i EE4B = _mm_sub_epi32(EEE3B, EEO3B);\r\n\r\n            const __m128i E0A = _mm_add_epi32(EE0A, EO0A);          // E0 = EE0 + EO0\r\n            const __m128i E0B = _mm_add_epi32(EE0B, EO0B);\r\n            const __m128i E1A = _mm_add_epi32(EE1A, EO1A);          // E1 = EE1 + EO1\r\n            const __m128i E1B = _mm_add_epi32(EE1B, EO1B);\r\n            const __m128i E2A = _mm_add_epi32(EE2A, EO2A);          // E2 = EE2 + EO2\r\n            const __m128i E2B = _mm_add_epi32(EE2B, EO2B);\r\n            const __m128i E3A = _mm_add_epi32(EE3A, EO3A);          // E3 = EE3 + EO3\r\n            const __m128i E3B = _mm_add_epi32(EE3B, EO3B);\r\n            const __m128i E4A = _mm_add_epi32(EE4A, EO4A);          // E4 =\r\n            const __m128i E4B = _mm_add_epi32(EE4B, EO4B);\r\n            const __m128i E5A = _mm_add_epi32(EE5A, EO5A);          // E5 =\r\n            const __m128i E5B = _mm_add_epi32(EE5B, EO5B);\r\n            const __m128i E6A = _mm_add_epi32(EE6A, EO6A);          // E6 =\r\n            const __m128i E6B = _mm_add_epi32(EE6B, EO6B);\r\n            const __m128i E7A = _mm_add_epi32(EE7A, EO7A);          // E7 =\r\n            const __m128i E7B = _mm_add_epi32(EE7B, EO7B);\r\n            const __m128i EFA = _mm_sub_epi32(EE0A, EO0A);          // EF = EE0 - EO0\r\n            const __m128i EFB = _mm_sub_epi32(EE0B, EO0B);\r\n            const __m128i EEA = _mm_sub_epi32(EE1A, EO1A);          // EE = EE1 - EO1\r\n            const __m128i EEB = _mm_sub_epi32(EE1B, EO1B);\r\n            const __m128i EDA = _mm_sub_epi32(EE2A, EO2A);          // ED = EE2 - EO2\r\n            const __m128i EDB = _mm_sub_epi32(EE2B, EO2B);\r\n            const __m128i ECA = _mm_sub_epi32(EE3A, EO3A);          // EC = EE3 - EO3\r\n            const __m128i ECB = _mm_sub_epi32(EE3B, EO3B);\r\n            const __m128i EBA = _mm_sub_epi32(EE4A, EO4A);          // EB =\r\n            const __m128i EBB = _mm_sub_epi32(EE4B, EO4B);\r\n            const __m128i EAA = _mm_sub_epi32(EE5A, EO5A);          // EA =\r\n            const __m128i EAB = _mm_sub_epi32(EE5B, EO5B);\r\n            const __m128i E9A = _mm_sub_epi32(EE6A, EO6A);          // E9 =\r\n            const __m128i E9B = _mm_sub_epi32(EE6B, EO6B);\r\n            const __m128i E8A = _mm_sub_epi32(EE7A, EO7A);          // E8 =\r\n            const __m128i E8B = _mm_sub_epi32(EE7B, EO7B);\r\n\r\n            const __m128i T10A = _mm_add_epi32(E0A, c32_rnd);       // E0 + rnd\r\n            const __m128i T10B = _mm_add_epi32(E0B, c32_rnd);\r\n            const __m128i T11A = _mm_add_epi32(E1A, c32_rnd);       // E1 + rnd\r\n            const __m128i T11B = _mm_add_epi32(E1B, c32_rnd);\r\n            const __m128i T12A = _mm_add_epi32(E2A, c32_rnd);       // E2 + rnd\r\n            const __m128i T12B = _mm_add_epi32(E2B, c32_rnd);\r\n            const __m128i T13A = _mm_add_epi32(E3A, c32_rnd);       // E3 + rnd\r\n            const __m128i T13B = _mm_add_epi32(E3B, c32_rnd);\r\n            const __m128i T14A = _mm_add_epi32(E4A, c32_rnd);       // E4 + rnd\r\n            const __m128i T14B = _mm_add_epi32(E4B, c32_rnd);\r\n            const __m128i T15A = _mm_add_epi32(E5A, c32_rnd);       // E5 + rnd\r\n            const __m128i T15B = _mm_add_epi32(E5B, c32_rnd);\r\n            const __m128i T16A = _mm_add_epi32(E6A, c32_rnd);       // E6 + rnd\r\n            const __m128i T16B = _mm_add_epi32(E6B, c32_rnd);\r\n            const __m128i T17A = _mm_add_epi32(E7A, c32_rnd);       // E7 + rnd\r\n            const __m128i T17B = _mm_add_epi32(E7B, c32_rnd);\r\n            const __m128i T18A = _mm_add_epi32(E8A, c32_rnd);       // E8 + rnd\r\n            const __m128i T18B = _mm_add_epi32(E8B, c32_rnd);\r\n            const __m128i T19A = _mm_add_epi32(E9A, c32_rnd);       // E9 + rnd\r\n            const __m128i T19B = _mm_add_epi32(E9B, c32_rnd);\r\n            const __m128i T1AA = _mm_add_epi32(EAA, c32_rnd);       // E10 + rnd\r\n            const __m128i T1AB = _mm_add_epi32(EAB, c32_rnd);\r\n            const __m128i T1BA = _mm_add_epi32(EBA, c32_rnd);       // E11 + rnd\r\n            const __m128i T1BB = _mm_add_epi32(EBB, c32_rnd);\r\n            const __m128i T1CA = _mm_add_epi32(ECA, c32_rnd);       // E12 + rnd\r\n            const __m128i T1CB = _mm_add_epi32(ECB, c32_rnd);\r\n            const __m128i T1DA = _mm_add_epi32(EDA, c32_rnd);       // E13 + rnd\r\n            const __m128i T1DB = _mm_add_epi32(EDB, c32_rnd);\r\n            const __m128i T1EA = _mm_add_epi32(EEA, c32_rnd);       // E14 + rnd\r\n            const __m128i T1EB = _mm_add_epi32(EEB, c32_rnd);\r\n            const __m128i T1FA = _mm_add_epi32(EFA, c32_rnd);       // E15 + rnd\r\n            const __m128i T1FB = _mm_add_epi32(EFB, c32_rnd);\r\n\r\n            const __m128i T2_00A = _mm_add_epi32(T10A, O00A);       // E0 + O0 + rnd\r\n            const __m128i T2_00B = _mm_add_epi32(T10B, O00B);\r\n            const __m128i T2_01A = _mm_add_epi32(T11A, O01A);       // E1 + O1 + rnd\r\n            const __m128i T2_01B = _mm_add_epi32(T11B, O01B);\r\n            const __m128i T2_02A = _mm_add_epi32(T12A, O02A);       // E2 + O2 + rnd\r\n            const __m128i T2_02B = _mm_add_epi32(T12B, O02B);\r\n            const __m128i T2_03A = _mm_add_epi32(T13A, O03A);       // E3 + O3 + rnd\r\n            const __m128i T2_03B = _mm_add_epi32(T13B, O03B);\r\n            const __m128i T2_04A = _mm_add_epi32(T14A, O04A);       // E4\r\n            const __m128i T2_04B = _mm_add_epi32(T14B, O04B);\r\n            const __m128i T2_05A = _mm_add_epi32(T15A, O05A);       // E5\r\n            const __m128i T2_05B = _mm_add_epi32(T15B, O05B);\r\n            const __m128i T2_06A = _mm_add_epi32(T16A, O06A);       // E6\r\n            const __m128i T2_06B = _mm_add_epi32(T16B, O06B);\r\n            const __m128i T2_07A = _mm_add_epi32(T17A, O07A);       // E7\r\n            const __m128i T2_07B = _mm_add_epi32(T17B, O07B);\r\n            const __m128i T2_08A = _mm_add_epi32(T18A, O08A);       // E8\r\n            const __m128i T2_08B = _mm_add_epi32(T18B, O08B);\r\n            const __m128i T2_09A = _mm_add_epi32(T19A, O09A);       // E9\r\n            const __m128i T2_09B = _mm_add_epi32(T19B, O09B);\r\n            const __m128i T2_10A = _mm_add_epi32(T1AA, O10A);       // E10\r\n            const __m128i T2_10B = _mm_add_epi32(T1AB, O10B);\r\n            const __m128i T2_11A = _mm_add_epi32(T1BA, O11A);       // E11\r\n            const __m128i T2_11B = _mm_add_epi32(T1BB, O11B);\r\n            const __m128i T2_12A = _mm_add_epi32(T1CA, O12A);       // E12\r\n            const __m128i T2_12B = _mm_add_epi32(T1CB, O12B);\r\n            const __m128i T2_13A = _mm_add_epi32(T1DA, O13A);       // E13\r\n            const __m128i T2_13B = _mm_add_epi32(T1DB, O13B);\r\n            const __m128i T2_14A = _mm_add_epi32(T1EA, O14A);       // E14\r\n            const __m128i T2_14B = _mm_add_epi32(T1EB, O14B);\r\n            const __m128i T2_15A = _mm_add_epi32(T1FA, O15A);       // E15\r\n            const __m128i T2_15B = _mm_add_epi32(T1FB, O15B);\r\n            const __m128i T2_31A = _mm_sub_epi32(T10A, O00A);       // E0 - O0 + rnd\r\n            const __m128i T2_31B = _mm_sub_epi32(T10B, O00B);\r\n            const __m128i T2_30A = _mm_sub_epi32(T11A, O01A);       // E1 - O1 + rnd\r\n            const __m128i T2_30B = _mm_sub_epi32(T11B, O01B);\r\n            const __m128i T2_29A = _mm_sub_epi32(T12A, O02A);       // E2 - O2 + rnd\r\n            const __m128i T2_29B = _mm_sub_epi32(T12B, O02B);\r\n            const __m128i T2_28A = _mm_sub_epi32(T13A, O03A);       // E3 - O3 + rnd\r\n            const __m128i T2_28B = _mm_sub_epi32(T13B, O03B);\r\n            const __m128i T2_27A = _mm_sub_epi32(T14A, O04A);       // E4\r\n            const __m128i T2_27B = _mm_sub_epi32(T14B, O04B);\r\n            const __m128i T2_26A = _mm_sub_epi32(T15A, O05A);       // E5\r\n            const __m128i T2_26B = _mm_sub_epi32(T15B, O05B);\r\n            const __m128i T2_25A = _mm_sub_epi32(T16A, O06A);       // E6\r\n            const __m128i T2_25B = _mm_sub_epi32(T16B, O06B);\r\n            const __m128i T2_24A = _mm_sub_epi32(T17A, O07A);       // E7\r\n            const __m128i T2_24B = _mm_sub_epi32(T17B, O07B);\r\n            const __m128i T2_23A = _mm_sub_epi32(T18A, O08A);       //\r\n            const __m128i T2_23B = _mm_sub_epi32(T18B, O08B);\r\n            const __m128i T2_22A = _mm_sub_epi32(T19A, O09A);       //\r\n            const __m128i T2_22B = _mm_sub_epi32(T19B, O09B);\r\n            const __m128i T2_21A = _mm_sub_epi32(T1AA, O10A);       //\r\n            const __m128i T2_21B = _mm_sub_epi32(T1AB, O10B);\r\n            const __m128i T2_20A = _mm_sub_epi32(T1BA, O11A);       //\r\n            const __m128i T2_20B = _mm_sub_epi32(T1BB, O11B);\r\n            const __m128i T2_19A = _mm_sub_epi32(T1CA, O12A);       //\r\n            const __m128i T2_19B = _mm_sub_epi32(T1CB, O12B);\r\n            const __m128i T2_18A = _mm_sub_epi32(T1DA, O13A);       //\r\n            const __m128i T2_18B = _mm_sub_epi32(T1DB, O13B);\r\n            const __m128i T2_17A = _mm_sub_epi32(T1EA, O14A);       //\r\n            const __m128i T2_17B = _mm_sub_epi32(T1EB, O14B);\r\n            const __m128i T2_16A = _mm_sub_epi32(T1FA, O15A);       //\r\n            const __m128i T2_16B = _mm_sub_epi32(T1FB, O15B);\r\n\r\n            const __m128i T3_00A = _mm_srai_epi32(T2_00A, nShift);  // [30 20 10 00]\r\n            const __m128i T3_00B = _mm_srai_epi32(T2_00B, nShift);  // [70 60 50 40]\r\n            const __m128i T3_01A = _mm_srai_epi32(T2_01A, nShift);  // [31 21 11 01]\r\n            const __m128i T3_01B = _mm_srai_epi32(T2_01B, nShift);  // [71 61 51 41]\r\n            const __m128i T3_02A = _mm_srai_epi32(T2_02A, nShift);  // [32 22 12 02]\r\n            const __m128i T3_02B = _mm_srai_epi32(T2_02B, nShift);  // [72 62 52 42]\r\n            const __m128i T3_03A = _mm_srai_epi32(T2_03A, nShift);  // [33 23 13 03]\r\n            const __m128i T3_03B = _mm_srai_epi32(T2_03B, nShift);  // [73 63 53 43]\r\n            const __m128i T3_04A = _mm_srai_epi32(T2_04A, nShift);  // [33 24 14 04]\r\n            const __m128i T3_04B = _mm_srai_epi32(T2_04B, nShift);  // [74 64 54 44]\r\n            const __m128i T3_05A = _mm_srai_epi32(T2_05A, nShift);  // [35 25 15 05]\r\n            const __m128i T3_05B = _mm_srai_epi32(T2_05B, nShift);  // [75 65 55 45]\r\n            const __m128i T3_06A = _mm_srai_epi32(T2_06A, nShift);  // [36 26 16 06]\r\n            const __m128i T3_06B = _mm_srai_epi32(T2_06B, nShift);  // [76 66 56 46]\r\n            const __m128i T3_07A = _mm_srai_epi32(T2_07A, nShift);  // [37 27 17 07]\r\n            const __m128i T3_07B = _mm_srai_epi32(T2_07B, nShift);  // [77 67 57 47]\r\n            const __m128i T3_08A = _mm_srai_epi32(T2_08A, nShift);  // [30 20 10 00] x8\r\n            const __m128i T3_08B = _mm_srai_epi32(T2_08B, nShift);  // [70 60 50 40]\r\n            const __m128i T3_09A = _mm_srai_epi32(T2_09A, nShift);  // [31 21 11 01] x9\r\n            const __m128i T3_09B = _mm_srai_epi32(T2_09B, nShift);  // [71 61 51 41]\r\n            const __m128i T3_10A = _mm_srai_epi32(T2_10A, nShift);  // [32 22 12 02] xA\r\n            const __m128i T3_10B = _mm_srai_epi32(T2_10B, nShift);  // [72 62 52 42]\r\n            const __m128i T3_11A = _mm_srai_epi32(T2_11A, nShift);  // [33 23 13 03] xB\r\n            const __m128i T3_11B = _mm_srai_epi32(T2_11B, nShift);  // [73 63 53 43]\r\n            const __m128i T3_12A = _mm_srai_epi32(T2_12A, nShift);  // [33 24 14 04] xC\r\n            const __m128i T3_12B = _mm_srai_epi32(T2_12B, nShift);  // [74 64 54 44]\r\n            const __m128i T3_13A = _mm_srai_epi32(T2_13A, nShift);  // [35 25 15 05] xD\r\n            const __m128i T3_13B = _mm_srai_epi32(T2_13B, nShift);  // [75 65 55 45]\r\n            const __m128i T3_14A = _mm_srai_epi32(T2_14A, nShift);  // [36 26 16 06] xE\r\n            const __m128i T3_14B = _mm_srai_epi32(T2_14B, nShift);  // [76 66 56 46]\r\n            const __m128i T3_15A = _mm_srai_epi32(T2_15A, nShift);  // [37 27 17 07] xF\r\n            const __m128i T3_15B = _mm_srai_epi32(T2_15B, nShift);  // [77 67 57 47]\r\n\r\n            const __m128i T3_16A = _mm_srai_epi32(T2_16A, nShift);  // [30 20 10 00]\r\n            const __m128i T3_16B = _mm_srai_epi32(T2_16B, nShift);  // [70 60 50 40]\r\n            const __m128i T3_17A = _mm_srai_epi32(T2_17A, nShift);  // [31 21 11 01]\r\n            const __m128i T3_17B = _mm_srai_epi32(T2_17B, nShift);  // [71 61 51 41]\r\n            const __m128i T3_18A = _mm_srai_epi32(T2_18A, nShift);  // [32 22 12 02]\r\n            const __m128i T3_18B = _mm_srai_epi32(T2_18B, nShift);  // [72 62 52 42]\r\n            const __m128i T3_19A = _mm_srai_epi32(T2_19A, nShift);  // [33 23 13 03]\r\n            const __m128i T3_19B = _mm_srai_epi32(T2_19B, nShift);  // [73 63 53 43]\r\n            const __m128i T3_20A = _mm_srai_epi32(T2_20A, nShift);  // [33 24 14 04]\r\n            const __m128i T3_20B = _mm_srai_epi32(T2_20B, nShift);  // [74 64 54 44]\r\n            const __m128i T3_21A = _mm_srai_epi32(T2_21A, nShift);  // [35 25 15 05]\r\n            const __m128i T3_21B = _mm_srai_epi32(T2_21B, nShift);  // [75 65 55 45]\r\n            const __m128i T3_22A = _mm_srai_epi32(T2_22A, nShift);  // [36 26 16 06]\r\n            const __m128i T3_22B = _mm_srai_epi32(T2_22B, nShift);  // [76 66 56 46]\r\n            const __m128i T3_23A = _mm_srai_epi32(T2_23A, nShift);  // [37 27 17 07]\r\n            const __m128i T3_23B = _mm_srai_epi32(T2_23B, nShift);  // [77 67 57 47]\r\n            const __m128i T3_24A = _mm_srai_epi32(T2_24A, nShift);  // [30 20 10 00] x8\r\n            const __m128i T3_24B = _mm_srai_epi32(T2_24B, nShift);  // [70 60 50 40]\r\n            const __m128i T3_25A = _mm_srai_epi32(T2_25A, nShift);  // [31 21 11 01] x9\r\n            const __m128i T3_25B = _mm_srai_epi32(T2_25B, nShift);  // [71 61 51 41]\r\n            const __m128i T3_26A = _mm_srai_epi32(T2_26A, nShift);  // [32 22 12 02] xA\r\n            const __m128i T3_26B = _mm_srai_epi32(T2_26B, nShift);  // [72 62 52 42]\r\n            const __m128i T3_27A = _mm_srai_epi32(T2_27A, nShift);  // [33 23 13 03] xB\r\n            const __m128i T3_27B = _mm_srai_epi32(T2_27B, nShift);  // [73 63 53 43]\r\n            const __m128i T3_28A = _mm_srai_epi32(T2_28A, nShift);  // [33 24 14 04] xC\r\n            const __m128i T3_28B = _mm_srai_epi32(T2_28B, nShift);  // [74 64 54 44]\r\n            const __m128i T3_29A = _mm_srai_epi32(T2_29A, nShift);  // [35 25 15 05] xD\r\n            const __m128i T3_29B = _mm_srai_epi32(T2_29B, nShift);  // [75 65 55 45]\r\n            const __m128i T3_30A = _mm_srai_epi32(T2_30A, nShift);  // [36 26 16 06] xE\r\n            const __m128i T3_30B = _mm_srai_epi32(T2_30B, nShift);  // [76 66 56 46]\r\n            const __m128i T3_31A = _mm_srai_epi32(T2_31A, nShift);  // [37 27 17 07] xF\r\n            const __m128i T3_31B = _mm_srai_epi32(T2_31B, nShift);  // [77 67 57 47]\r\n\r\n            res00[0] = _mm_packs_epi32(T3_00A, T3_00B);             // [70 60 50 40 30 20 10 00]\r\n            res01[0] = _mm_packs_epi32(T3_01A, T3_01B);             // [71 61 51 41 31 21 11 01]\r\n            res02[0] = _mm_packs_epi32(T3_02A, T3_02B);             // [72 62 52 42 32 22 12 02]\r\n            res03[0] = _mm_packs_epi32(T3_03A, T3_03B);             // [73 63 53 43 33 23 13 03]\r\n            res04[0] = _mm_packs_epi32(T3_04A, T3_04B);             // [74 64 54 44 34 24 14 04]\r\n            res05[0] = _mm_packs_epi32(T3_05A, T3_05B);             // [75 65 55 45 35 25 15 05]\r\n            res06[0] = _mm_packs_epi32(T3_06A, T3_06B);             // [76 66 56 46 36 26 16 06]\r\n            res07[0] = _mm_packs_epi32(T3_07A, T3_07B);             // [77 67 57 47 37 27 17 07]\r\n            res08[0] = _mm_packs_epi32(T3_08A, T3_08B);             // [A0 ... 80]\r\n            res09[0] = _mm_packs_epi32(T3_09A, T3_09B);             // [A1 ... 81]\r\n            res10[0] = _mm_packs_epi32(T3_10A, T3_10B);             // [A2 ... 82]\r\n            res11[0] = _mm_packs_epi32(T3_11A, T3_11B);             // [A3 ... 83]\r\n            res12[0] = _mm_packs_epi32(T3_12A, T3_12B);             // [A4 ... 84]\r\n            res13[0] = _mm_packs_epi32(T3_13A, T3_13B);             // [A5 ... 85]\r\n            res14[0] = _mm_packs_epi32(T3_14A, T3_14B);             // [A6 ... 86]\r\n            res15[0] = _mm_packs_epi32(T3_15A, T3_15B);             // [A7 ... 87]\r\n            res16[0] = _mm_packs_epi32(T3_16A, T3_16B);\r\n            res17[0] = _mm_packs_epi32(T3_17A, T3_17B);\r\n            res18[0] = _mm_packs_epi32(T3_18A, T3_18B);\r\n            res19[0] = _mm_packs_epi32(T3_19A, T3_19B);\r\n            res20[0] = _mm_packs_epi32(T3_20A, T3_20B);\r\n            res21[0] = _mm_packs_epi32(T3_21A, T3_21B);\r\n            res22[0] = _mm_packs_epi32(T3_22A, T3_22B);\r\n            res23[0] = _mm_packs_epi32(T3_23A, T3_23B);\r\n            res24[0] = _mm_packs_epi32(T3_24A, T3_24B);\r\n            res25[0] = _mm_packs_epi32(T3_25A, T3_25B);\r\n            res26[0] = _mm_packs_epi32(T3_26A, T3_26B);\r\n            res27[0] = _mm_packs_epi32(T3_27A, T3_27B);\r\n            res28[0] = _mm_packs_epi32(T3_28A, T3_28B);\r\n            res29[0] = _mm_packs_epi32(T3_29A, T3_29B);\r\n            res30[0] = _mm_packs_epi32(T3_30A, T3_30B);\r\n            res31[0] = _mm_packs_epi32(T3_31A, T3_31B);\r\n        }\r\n\r\n        //transpose matrix 8x8 16bit.\r\n        {\r\n            __m128i tr0_0, tr0_1, tr0_2, tr0_3, tr0_4, tr0_5, tr0_6, tr0_7;\r\n            __m128i tr1_0, tr1_1, tr1_2, tr1_3, tr1_4, tr1_5, tr1_6, tr1_7;\r\n#define TRANSPOSE_8x8_16BIT(I0, I1, I2, I3, I4, I5, I6, I7, O0, O1, O2, O3, O4, O5, O6, O7) \\\r\n    tr0_0 = _mm_unpacklo_epi16(I0, I1); \\\r\n    tr0_1 = _mm_unpacklo_epi16(I2, I3); \\\r\n    tr0_2 = _mm_unpackhi_epi16(I0, I1); \\\r\n    tr0_3 = _mm_unpackhi_epi16(I2, I3); \\\r\n    tr0_4 = _mm_unpacklo_epi16(I4, I5); \\\r\n    tr0_5 = _mm_unpacklo_epi16(I6, I7); \\\r\n    tr0_6 = _mm_unpackhi_epi16(I4, I5); \\\r\n    tr0_7 = _mm_unpackhi_epi16(I6, I7); \\\r\n    tr1_0 = _mm_unpacklo_epi32(tr0_0, tr0_1); \\\r\n    tr1_1 = _mm_unpacklo_epi32(tr0_2, tr0_3); \\\r\n    tr1_2 = _mm_unpackhi_epi32(tr0_0, tr0_1); \\\r\n    tr1_3 = _mm_unpackhi_epi32(tr0_2, tr0_3); \\\r\n    tr1_4 = _mm_unpacklo_epi32(tr0_4, tr0_5); \\\r\n    tr1_5 = _mm_unpacklo_epi32(tr0_6, tr0_7); \\\r\n    tr1_6 = _mm_unpackhi_epi32(tr0_4, tr0_5); \\\r\n    tr1_7 = _mm_unpackhi_epi32(tr0_6, tr0_7); \\\r\n    O0 = _mm_unpacklo_epi64(tr1_0, tr1_4); \\\r\n    O1 = _mm_unpackhi_epi64(tr1_0, tr1_4); \\\r\n    O2 = _mm_unpacklo_epi64(tr1_2, tr1_6); \\\r\n    O3 = _mm_unpackhi_epi64(tr1_2, tr1_6); \\\r\n    O4 = _mm_unpacklo_epi64(tr1_1, tr1_5); \\\r\n    O5 = _mm_unpackhi_epi64(tr1_1, tr1_5); \\\r\n    O6 = _mm_unpacklo_epi64(tr1_3, tr1_7); \\\r\n    O7 = _mm_unpackhi_epi64(tr1_3, tr1_7);\r\n\r\n            TRANSPOSE_8x8_16BIT(res00[0], res01[0], res02[0], res03[0], res04[0], res05[0], res06[0], res07[0], m128iS0[0], m128iS1[0], m128iS2[0], m128iS3[0], m128iS4[0], m128iS5[0], m128iS6[0], m128iS7[0])\r\n            TRANSPOSE_8x8_16BIT(res08[0], res09[0], res10[0], res11[0], res12[0], res13[0], res14[0], res15[0], m128iS0[1], m128iS1[1], m128iS2[1], m128iS3[1], m128iS4[1], m128iS5[1], m128iS6[1], m128iS7[1])\r\n            TRANSPOSE_8x8_16BIT(res16[0], res17[0], res18[0], res19[0], res20[0], res21[0], res22[0], res23[0], m128iS0[2], m128iS1[2], m128iS2[2], m128iS3[2], m128iS4[2], m128iS5[2], m128iS6[2], m128iS7[2])\r\n            TRANSPOSE_8x8_16BIT(res24[0], res25[0], res26[0], res27[0], res28[0], res29[0], res30[0], res31[0], m128iS0[3], m128iS1[3], m128iS2[3], m128iS3[3], m128iS4[3], m128iS5[3], m128iS6[3], m128iS7[3])\r\n\r\n#undef TRANSPOSE_8x8_16BIT\r\n        }\r\n    }\r\n\r\n    //clip\r\n    {\r\n        __m128i max_val = _mm_set1_epi16((1 << (clip_depth2 - 1)) - 1);\r\n        __m128i min_val = _mm_set1_epi16(-(1 << (clip_depth2 - 1)));\r\n\r\n        for (i = 0; i < 4; i++) {\r\n            m128iS0[i] = _mm_min_epi16(m128iS0[i], max_val);\r\n            m128iS0[i] = _mm_max_epi16(m128iS0[i], min_val);\r\n\r\n            m128iS1[i] = _mm_min_epi16(m128iS1[i], max_val);\r\n            m128iS1[i] = _mm_max_epi16(m128iS1[i], min_val);\r\n\r\n            m128iS2[i] = _mm_min_epi16(m128iS2[i], max_val);\r\n            m128iS2[i] = _mm_max_epi16(m128iS2[i], min_val);\r\n\r\n            m128iS3[i] = _mm_min_epi16(m128iS3[i], max_val);\r\n            m128iS3[i] = _mm_max_epi16(m128iS3[i], min_val);\r\n\r\n            m128iS4[i] = _mm_min_epi16(m128iS4[i], max_val);\r\n            m128iS4[i] = _mm_max_epi16(m128iS4[i], min_val);\r\n\r\n            m128iS5[i] = _mm_min_epi16(m128iS5[i], max_val);\r\n            m128iS5[i] = _mm_max_epi16(m128iS5[i], min_val);\r\n\r\n            m128iS6[i] = _mm_min_epi16(m128iS6[i], max_val);\r\n            m128iS6[i] = _mm_max_epi16(m128iS6[i], min_val);\r\n\r\n            m128iS7[i] = _mm_min_epi16(m128iS7[i], max_val);\r\n            m128iS7[i] = _mm_max_epi16(m128iS7[i], min_val);\r\n        }\r\n    }\r\n    //  coeff_t blk2[32 * 8];\r\n\r\n    // Add\r\n    for (i = 0; i < 2; i++) {\r\n#define STORE_LINE(L0, L1, L2, L3, offsetV) \\\r\n    _mm_store_si128((__m128i*)(dst + offsetV * i_dst +  0), L0); \\\r\n    _mm_store_si128((__m128i*)(dst + offsetV * i_dst +  8), L1); \\\r\n    _mm_store_si128((__m128i*)(dst + offsetV * i_dst + 16), L2); \\\r\n    _mm_store_si128((__m128i*)(dst + offsetV * i_dst + 24), L3);\r\n\r\n        STORE_LINE(m128iS0[0], m128iS0[1], m128iS0[2], m128iS0[3], 0)\r\n        STORE_LINE(m128iS1[0], m128iS1[1], m128iS1[2], m128iS1[3], 1)\r\n        STORE_LINE(m128iS2[0], m128iS2[1], m128iS2[2], m128iS2[3], 2)\r\n        STORE_LINE(m128iS3[0], m128iS3[1], m128iS3[2], m128iS3[3], 3)\r\n        STORE_LINE(m128iS4[0], m128iS4[1], m128iS4[2], m128iS4[3], 4)\r\n        STORE_LINE(m128iS5[0], m128iS5[1], m128iS5[2], m128iS5[3], 5)\r\n        STORE_LINE(m128iS6[0], m128iS6[1], m128iS6[2], m128iS6[3], 6)\r\n        STORE_LINE(m128iS7[0], m128iS7[1], m128iS7[2], m128iS7[3], 7)\r\n\r\n#undef STORE_LINE\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid idct_32x8_half_sse128(const coeff_t *src, coeff_t *dst, int i_dst)\r\n{\r\n    // TODO: implement this\r\n    // 1/2СϽǵ16x8зϵ\r\n    idct_32x8_sse128(src, dst, i_dst);\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid idct_32x8_quad_sse128(const coeff_t *src, coeff_t *dst, int i_dst)\r\n{\r\n    // TODO: implement this\r\n    // 1/4СϽǵ8x8зϵ\r\n    idct_32x8_half_sse128(src, dst, i_dst);\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid idct_8x32_sse128(const coeff_t *src, coeff_t *dst, int i_dst)\r\n{\r\n    const __m128i c16_p45_p45 = _mm_set1_epi32(0x002D002D);\r\n    const __m128i c16_p43_p44 = _mm_set1_epi32(0x002B002C);\r\n    const __m128i c16_p39_p41 = _mm_set1_epi32(0x00270029);\r\n    const __m128i c16_p34_p36 = _mm_set1_epi32(0x00220024);\r\n    const __m128i c16_p27_p30 = _mm_set1_epi32(0x001B001E);\r\n    const __m128i c16_p19_p23 = _mm_set1_epi32(0x00130017);\r\n    const __m128i c16_p11_p15 = _mm_set1_epi32(0x000B000F);\r\n    const __m128i c16_p02_p07 = _mm_set1_epi32(0x00020007);\r\n    const __m128i c16_p41_p45 = _mm_set1_epi32(0x0029002D);\r\n    const __m128i c16_p23_p34 = _mm_set1_epi32(0x00170022);\r\n    const __m128i c16_n02_p11 = _mm_set1_epi32(0xFFFE000B);\r\n    const __m128i c16_n27_n15 = _mm_set1_epi32(0xFFE5FFF1);\r\n    const __m128i c16_n43_n36 = _mm_set1_epi32(0xFFD5FFDC);\r\n    const __m128i c16_n44_n45 = _mm_set1_epi32(0xFFD4FFD3);\r\n    const __m128i c16_n30_n39 = _mm_set1_epi32(0xFFE2FFD9);\r\n    const __m128i c16_n07_n19 = _mm_set1_epi32(0xFFF9FFED);\r\n    const __m128i c16_p34_p44 = _mm_set1_epi32(0x0022002C);\r\n    const __m128i c16_n07_p15 = _mm_set1_epi32(0xFFF9000F);\r\n    const __m128i c16_n41_n27 = _mm_set1_epi32(0xFFD7FFE5);\r\n    const __m128i c16_n39_n45 = _mm_set1_epi32(0xFFD9FFD3);\r\n    const __m128i c16_n02_n23 = _mm_set1_epi32(0xFFFEFFE9);\r\n    const __m128i c16_p36_p19 = _mm_set1_epi32(0x00240013);\r\n    const __m128i c16_p43_p45 = _mm_set1_epi32(0x002B002D);\r\n    const __m128i c16_p11_p30 = _mm_set1_epi32(0x000B001E);\r\n    const __m128i c16_p23_p43 = _mm_set1_epi32(0x0017002B);\r\n    const __m128i c16_n34_n07 = _mm_set1_epi32(0xFFDEFFF9);\r\n    const __m128i c16_n36_n45 = _mm_set1_epi32(0xFFDCFFD3);\r\n    const __m128i c16_p19_n11 = _mm_set1_epi32(0x0013FFF5);\r\n    const __m128i c16_p44_p41 = _mm_set1_epi32(0x002C0029);\r\n    const __m128i c16_n02_p27 = _mm_set1_epi32(0xFFFE001B);\r\n    const __m128i c16_n45_n30 = _mm_set1_epi32(0xFFD3FFE2);\r\n    const __m128i c16_n15_n39 = _mm_set1_epi32(0xFFF1FFD9);\r\n    const __m128i c16_p11_p41 = _mm_set1_epi32(0x000B0029);\r\n    const __m128i c16_n45_n27 = _mm_set1_epi32(0xFFD3FFE5);\r\n    const __m128i c16_p07_n30 = _mm_set1_epi32(0x0007FFE2);\r\n    const __m128i c16_p43_p39 = _mm_set1_epi32(0x002B0027);\r\n    const __m128i c16_n23_p15 = _mm_set1_epi32(0xFFE9000F);\r\n    const __m128i c16_n34_n45 = _mm_set1_epi32(0xFFDEFFD3);\r\n    const __m128i c16_p36_p02 = _mm_set1_epi32(0x00240002);\r\n    const __m128i c16_p19_p44 = _mm_set1_epi32(0x0013002C);\r\n    const __m128i c16_n02_p39 = _mm_set1_epi32(0xFFFE0027);\r\n    const __m128i c16_n36_n41 = _mm_set1_epi32(0xFFDCFFD7);\r\n    const __m128i c16_p43_p07 = _mm_set1_epi32(0x002B0007);\r\n    const __m128i c16_n11_p34 = _mm_set1_epi32(0xFFF50022);\r\n    const __m128i c16_n30_n44 = _mm_set1_epi32(0xFFE2FFD4);\r\n    const __m128i c16_p45_p15 = _mm_set1_epi32(0x002D000F);\r\n    const __m128i c16_n19_p27 = _mm_set1_epi32(0xFFED001B);\r\n    const __m128i c16_n23_n45 = _mm_set1_epi32(0xFFE9FFD3);\r\n    const __m128i c16_n15_p36 = _mm_set1_epi32(0xFFF10024);\r\n    const __m128i c16_n11_n45 = _mm_set1_epi32(0xFFF5FFD3);\r\n    const __m128i c16_p34_p39 = _mm_set1_epi32(0x00220027);\r\n    const __m128i c16_n45_n19 = _mm_set1_epi32(0xFFD3FFED);\r\n    const __m128i c16_p41_n07 = _mm_set1_epi32(0x0029FFF9);\r\n    const __m128i c16_n23_p30 = _mm_set1_epi32(0xFFE9001E);\r\n    const __m128i c16_n02_n44 = _mm_set1_epi32(0xFFFEFFD4);\r\n    const __m128i c16_p27_p43 = _mm_set1_epi32(0x001B002B);\r\n    const __m128i c16_n27_p34 = _mm_set1_epi32(0xFFE50022);\r\n    const __m128i c16_p19_n39 = _mm_set1_epi32(0x0013FFD9);\r\n    const __m128i c16_n11_p43 = _mm_set1_epi32(0xFFF5002B);\r\n    const __m128i c16_p02_n45 = _mm_set1_epi32(0x0002FFD3);\r\n    const __m128i c16_p07_p45 = _mm_set1_epi32(0x0007002D);\r\n    const __m128i c16_n15_n44 = _mm_set1_epi32(0xFFF1FFD4);\r\n    const __m128i c16_p23_p41 = _mm_set1_epi32(0x00170029);\r\n    const __m128i c16_n30_n36 = _mm_set1_epi32(0xFFE2FFDC);\r\n    const __m128i c16_n36_p30 = _mm_set1_epi32(0xFFDC001E);\r\n    const __m128i c16_p41_n23 = _mm_set1_epi32(0x0029FFE9);\r\n    const __m128i c16_n44_p15 = _mm_set1_epi32(0xFFD4000F);\r\n    const __m128i c16_p45_n07 = _mm_set1_epi32(0x002DFFF9);\r\n    const __m128i c16_n45_n02 = _mm_set1_epi32(0xFFD3FFFE);\r\n    const __m128i c16_p43_p11 = _mm_set1_epi32(0x002B000B);\r\n    const __m128i c16_n39_n19 = _mm_set1_epi32(0xFFD9FFED);\r\n    const __m128i c16_p34_p27 = _mm_set1_epi32(0x0022001B);\r\n    const __m128i c16_n43_p27 = _mm_set1_epi32(0xFFD5001B);\r\n    const __m128i c16_p44_n02 = _mm_set1_epi32(0x002CFFFE);\r\n    const __m128i c16_n30_n23 = _mm_set1_epi32(0xFFE2FFE9);\r\n    const __m128i c16_p07_p41 = _mm_set1_epi32(0x00070029);\r\n    const __m128i c16_p19_n45 = _mm_set1_epi32(0x0013FFD3);\r\n    const __m128i c16_n39_p34 = _mm_set1_epi32(0xFFD90022);\r\n    const __m128i c16_p45_n11 = _mm_set1_epi32(0x002DFFF5);\r\n    const __m128i c16_n36_n15 = _mm_set1_epi32(0xFFDCFFF1);\r\n    const __m128i c16_n45_p23 = _mm_set1_epi32(0xFFD30017);\r\n    const __m128i c16_p27_p19 = _mm_set1_epi32(0x001B0013);\r\n    const __m128i c16_p15_n45 = _mm_set1_epi32(0x000FFFD3);\r\n    const __m128i c16_n44_p30 = _mm_set1_epi32(0xFFD4001E);\r\n    const __m128i c16_p34_p11 = _mm_set1_epi32(0x0022000B);\r\n    const __m128i c16_p07_n43 = _mm_set1_epi32(0x0007FFD5);\r\n    const __m128i c16_n41_p36 = _mm_set1_epi32(0xFFD70024);\r\n    const __m128i c16_p39_p02 = _mm_set1_epi32(0x00270002);\r\n    const __m128i c16_n44_p19 = _mm_set1_epi32(0xFFD40013);\r\n    const __m128i c16_n02_p36 = _mm_set1_epi32(0xFFFE0024);\r\n    const __m128i c16_p45_n34 = _mm_set1_epi32(0x002DFFDE);\r\n    const __m128i c16_n15_n23 = _mm_set1_epi32(0xFFF1FFE9);\r\n    const __m128i c16_n39_p43 = _mm_set1_epi32(0xFFD9002B);\r\n    const __m128i c16_p30_p07 = _mm_set1_epi32(0x001E0007);\r\n    const __m128i c16_p27_n45 = _mm_set1_epi32(0x001BFFD3);\r\n    const __m128i c16_n41_p11 = _mm_set1_epi32(0xFFD7000B);\r\n    const __m128i c16_n39_p15 = _mm_set1_epi32(0xFFD9000F);\r\n    const __m128i c16_n30_p45 = _mm_set1_epi32(0xFFE2002D);\r\n    const __m128i c16_p27_p02 = _mm_set1_epi32(0x001B0002);\r\n    const __m128i c16_p41_n44 = _mm_set1_epi32(0x0029FFD4);\r\n    const __m128i c16_n11_n19 = _mm_set1_epi32(0xFFF5FFED);\r\n    const __m128i c16_n45_p36 = _mm_set1_epi32(0xFFD30024);\r\n    const __m128i c16_n07_p34 = _mm_set1_epi32(0xFFF90022);\r\n    const __m128i c16_p43_n23 = _mm_set1_epi32(0x002BFFE9);\r\n    const __m128i c16_n30_p11 = _mm_set1_epi32(0xFFE2000B);\r\n    const __m128i c16_n45_p43 = _mm_set1_epi32(0xFFD3002B);\r\n    const __m128i c16_n19_p36 = _mm_set1_epi32(0xFFED0024);\r\n    const __m128i c16_p23_n02 = _mm_set1_epi32(0x0017FFFE);\r\n    const __m128i c16_p45_n39 = _mm_set1_epi32(0x002DFFD9);\r\n    const __m128i c16_p27_n41 = _mm_set1_epi32(0x001BFFD7);\r\n    const __m128i c16_n15_n07 = _mm_set1_epi32(0xFFF1FFF9);\r\n    const __m128i c16_n44_p34 = _mm_set1_epi32(0xFFD40022);\r\n    const __m128i c16_n19_p07 = _mm_set1_epi32(0xFFED0007);\r\n    const __m128i c16_n39_p30 = _mm_set1_epi32(0xFFD9001E);\r\n    const __m128i c16_n45_p44 = _mm_set1_epi32(0xFFD3002C);\r\n    const __m128i c16_n36_p43 = _mm_set1_epi32(0xFFDC002B);\r\n    const __m128i c16_n15_p27 = _mm_set1_epi32(0xFFF1001B);\r\n    const __m128i c16_p11_p02 = _mm_set1_epi32(0x000B0002);\r\n    const __m128i c16_p34_n23 = _mm_set1_epi32(0x0022FFE9);\r\n    const __m128i c16_p45_n41 = _mm_set1_epi32(0x002DFFD7);\r\n    const __m128i c16_n07_p02 = _mm_set1_epi32(0xFFF90002);\r\n    const __m128i c16_n15_p11 = _mm_set1_epi32(0xFFF1000B);\r\n    const __m128i c16_n23_p19 = _mm_set1_epi32(0xFFE90013);\r\n    const __m128i c16_n30_p27 = _mm_set1_epi32(0xFFE2001B);\r\n    const __m128i c16_n36_p34 = _mm_set1_epi32(0xFFDC0022);\r\n    const __m128i c16_n41_p39 = _mm_set1_epi32(0xFFD70027);\r\n    const __m128i c16_n44_p43 = _mm_set1_epi32(0xFFD4002B);\r\n    const __m128i c16_n45_p45 = _mm_set1_epi32(0xFFD3002D);\r\n\r\n    //  const __m128i c16_p43_p45 = _mm_set1_epi32(0x002B002D);\r\n    const __m128i c16_p35_p40 = _mm_set1_epi32(0x00230028);\r\n    const __m128i c16_p21_p29 = _mm_set1_epi32(0x0015001D);\r\n    const __m128i c16_p04_p13 = _mm_set1_epi32(0x0004000D);\r\n    const __m128i c16_p29_p43 = _mm_set1_epi32(0x001D002B);\r\n    const __m128i c16_n21_p04 = _mm_set1_epi32(0xFFEB0004);\r\n    const __m128i c16_n45_n40 = _mm_set1_epi32(0xFFD3FFD8);\r\n    const __m128i c16_n13_n35 = _mm_set1_epi32(0xFFF3FFDD);\r\n    const __m128i c16_p04_p40 = _mm_set1_epi32(0x00040028);\r\n    const __m128i c16_n43_n35 = _mm_set1_epi32(0xFFD5FFDD);\r\n    const __m128i c16_p29_n13 = _mm_set1_epi32(0x001DFFF3);\r\n    const __m128i c16_p21_p45 = _mm_set1_epi32(0x0015002D);\r\n    const __m128i c16_n21_p35 = _mm_set1_epi32(0xFFEB0023);\r\n    const __m128i c16_p04_n43 = _mm_set1_epi32(0x0004FFD5);\r\n    const __m128i c16_p13_p45 = _mm_set1_epi32(0x000D002D);\r\n    const __m128i c16_n29_n40 = _mm_set1_epi32(0xFFE3FFD8);\r\n    const __m128i c16_n40_p29 = _mm_set1_epi32(0xFFD8001D);\r\n    const __m128i c16_p45_n13 = _mm_set1_epi32(0x002DFFF3);\r\n    const __m128i c16_n43_n04 = _mm_set1_epi32(0xFFD5FFFC);\r\n    const __m128i c16_p35_p21 = _mm_set1_epi32(0x00230015);\r\n    const __m128i c16_n45_p21 = _mm_set1_epi32(0xFFD30015);\r\n    const __m128i c16_p13_p29 = _mm_set1_epi32(0x000D001D);\r\n    const __m128i c16_p35_n43 = _mm_set1_epi32(0x0023FFD5);\r\n    const __m128i c16_n40_p04 = _mm_set1_epi32(0xFFD80004);\r\n    const __m128i c16_n35_p13 = _mm_set1_epi32(0xFFDD000D);\r\n    const __m128i c16_n40_p45 = _mm_set1_epi32(0xFFD8002D);\r\n    const __m128i c16_p04_p21 = _mm_set1_epi32(0x00040015);\r\n    const __m128i c16_p43_n29 = _mm_set1_epi32(0x002BFFE3);\r\n    const __m128i c16_n13_p04 = _mm_set1_epi32(0xFFF30004);\r\n    const __m128i c16_n29_p21 = _mm_set1_epi32(0xFFE30015);\r\n    const __m128i c16_n40_p35 = _mm_set1_epi32(0xFFD80023);\r\n    //  const __m128i c16_n45_p43 = _mm_set1_epi32(0xFFD3002B);\r\n\r\n    const __m128i c16_p38_p44 = _mm_set1_epi32(0x0026002C);\r\n    const __m128i c16_p09_p25 = _mm_set1_epi32(0x00090019);\r\n    const __m128i c16_n09_p38 = _mm_set1_epi32(0xFFF70026);\r\n    const __m128i c16_n25_n44 = _mm_set1_epi32(0xFFE7FFD4);\r\n\r\n    const __m128i c16_n44_p25 = _mm_set1_epi32(0xFFD40019);\r\n    const __m128i c16_p38_p09 = _mm_set1_epi32(0x00260009);\r\n    const __m128i c16_n25_p09 = _mm_set1_epi32(0xFFE70009);\r\n    const __m128i c16_n44_p38 = _mm_set1_epi32(0xFFD40026);\r\n\r\n    const __m128i c16_p17_p42 = _mm_set1_epi32(0x0011002A);\r\n    const __m128i c16_n42_p17 = _mm_set1_epi32(0xFFD60011);\r\n\r\n    const __m128i c16_p32_p32 = _mm_set1_epi32(0x00200020);\r\n    const __m128i c16_n32_p32 = _mm_set1_epi32(0xFFE00020);\r\n\r\n    __m128i c32_rnd = _mm_set1_epi32(16);\r\n\r\n    int nShift = 5, pass;\r\n    //int shift1 = 5;\r\n    int shift2 = 20 - g_bit_depth - (i_dst & 0x01);\r\n    //int clip_depth1 = LIMIT_BIT;\r\n    int clip_depth2 = g_bit_depth + 1 + (i_dst & 0x01);\r\n\r\n    // DCT1\r\n    __m128i in00, in01, in02, in03, in04, in05, in06, in07, in08, in09, in10, in11, in12, in13, in14, in15;\r\n    __m128i in16, in17, in18, in19, in20, in21, in22, in23, in24, in25, in26, in27, in28, in29, in30, in31;\r\n    __m128i res00[4], res01[4], res02[4], res03[4], res04[4], res05[4], res06[4], res07[4];\r\n\r\n    i_dst &= 0xFE;\r\n\r\n    in00 = _mm_load_si128((const __m128i*)&src[0 * 8]);\r\n    in01 = _mm_load_si128((const __m128i*)&src[ 1 * 8]);\r\n    in02 = _mm_load_si128((const __m128i*)&src[ 2 * 8]);\r\n    in03 = _mm_load_si128((const __m128i*)&src[ 3 * 8]);\r\n    in04 = _mm_load_si128((const __m128i*)&src[ 4 * 8]);\r\n    in05 = _mm_load_si128((const __m128i*)&src[ 5 * 8]);\r\n    in06 = _mm_load_si128((const __m128i*)&src[ 6 * 8]);\r\n    in07 = _mm_load_si128((const __m128i*)&src[ 7 * 8]);\r\n    in08 = _mm_load_si128((const __m128i*)&src[ 8 * 8]);\r\n    in09 = _mm_load_si128((const __m128i*)&src[ 9 * 8]);\r\n    in10 = _mm_load_si128((const __m128i*)&src[10 * 8]);\r\n    in11 = _mm_load_si128((const __m128i*)&src[11 * 8]);\r\n    in12 = _mm_load_si128((const __m128i*)&src[12 * 8]);\r\n    in13 = _mm_load_si128((const __m128i*)&src[13 * 8]);\r\n    in14 = _mm_load_si128((const __m128i*)&src[14 * 8]);\r\n    in15 = _mm_load_si128((const __m128i*)&src[15 * 8]);\r\n    in16 = _mm_load_si128((const __m128i*)&src[16 * 8]);\r\n    in17 = _mm_load_si128((const __m128i*)&src[17 * 8]);\r\n    in18 = _mm_load_si128((const __m128i*)&src[18 * 8]);\r\n    in19 = _mm_load_si128((const __m128i*)&src[19 * 8]);\r\n    in20 = _mm_load_si128((const __m128i*)&src[20 * 8]);\r\n    in21 = _mm_load_si128((const __m128i*)&src[21 * 8]);\r\n    in22 = _mm_load_si128((const __m128i*)&src[22 * 8]);\r\n    in23 = _mm_load_si128((const __m128i*)&src[23 * 8]);\r\n    in24 = _mm_load_si128((const __m128i*)&src[24 * 8]);\r\n    in25 = _mm_load_si128((const __m128i*)&src[25 * 8]);\r\n    in26 = _mm_load_si128((const __m128i*)&src[26 * 8]);\r\n    in27 = _mm_load_si128((const __m128i*)&src[27 * 8]);\r\n    in28 = _mm_load_si128((const __m128i*)&src[28 * 8]);\r\n    in29 = _mm_load_si128((const __m128i*)&src[29 * 8]);\r\n    in30 = _mm_load_si128((const __m128i*)&src[30 * 8]);\r\n    in31 = _mm_load_si128((const __m128i*)&src[31 * 8]);\r\n\r\n    {\r\n        const __m128i T_00_00A = _mm_unpacklo_epi16(in01, in03);    // [33 13 32 12 31 11 30 10]\r\n        const __m128i T_00_00B = _mm_unpackhi_epi16(in01, in03);    // [37 17 36 16 35 15 34 14]\r\n        const __m128i T_00_01A = _mm_unpacklo_epi16(in05, in07);    // [ ]\r\n        const __m128i T_00_01B = _mm_unpackhi_epi16(in05, in07);    // [ ]\r\n        const __m128i T_00_02A = _mm_unpacklo_epi16(in09, in11);    // [ ]\r\n        const __m128i T_00_02B = _mm_unpackhi_epi16(in09, in11);    // [ ]\r\n        const __m128i T_00_03A = _mm_unpacklo_epi16(in13, in15);    // [ ]\r\n        const __m128i T_00_03B = _mm_unpackhi_epi16(in13, in15);    // [ ]\r\n        const __m128i T_00_04A = _mm_unpacklo_epi16(in17, in19);    // [ ]\r\n        const __m128i T_00_04B = _mm_unpackhi_epi16(in17, in19);    // [ ]\r\n        const __m128i T_00_05A = _mm_unpacklo_epi16(in21, in23);    // [ ]\r\n        const __m128i T_00_05B = _mm_unpackhi_epi16(in21, in23);    // [ ]\r\n        const __m128i T_00_06A = _mm_unpacklo_epi16(in25, in27);    // [ ]\r\n        const __m128i T_00_06B = _mm_unpackhi_epi16(in25, in27);    // [ ]\r\n        const __m128i T_00_07A = _mm_unpacklo_epi16(in29, in31);    //\r\n        const __m128i T_00_07B = _mm_unpackhi_epi16(in29, in31);    // [ ]\r\n\r\n        const __m128i T_00_08A = _mm_unpacklo_epi16(in02, in06);    // [ ]\r\n        const __m128i T_00_08B = _mm_unpackhi_epi16(in02, in06);    // [ ]\r\n        const __m128i T_00_09A = _mm_unpacklo_epi16(in10, in14);    // [ ]\r\n        const __m128i T_00_09B = _mm_unpackhi_epi16(in10, in14);    // [ ]\r\n        const __m128i T_00_10A = _mm_unpacklo_epi16(in18, in22);    // [ ]\r\n        const __m128i T_00_10B = _mm_unpackhi_epi16(in18, in22);    // [ ]\r\n        const __m128i T_00_11A = _mm_unpacklo_epi16(in26, in30);    // [ ]\r\n        const __m128i T_00_11B = _mm_unpackhi_epi16(in26, in30);    // [ ]\r\n\r\n        const __m128i T_00_12A = _mm_unpacklo_epi16(in04, in12);    // [ ]\r\n        const __m128i T_00_12B = _mm_unpackhi_epi16(in04, in12);    // [ ]\r\n        const __m128i T_00_13A = _mm_unpacklo_epi16(in20, in28);    // [ ]\r\n        const __m128i T_00_13B = _mm_unpackhi_epi16(in20, in28);    // [ ]\r\n\r\n        const __m128i T_00_14A = _mm_unpacklo_epi16(in08, in24);    //\r\n        const __m128i T_00_14B = _mm_unpackhi_epi16(in08, in24);    // [ ]\r\n        const __m128i T_00_15A = _mm_unpacklo_epi16(in00, in16);    //\r\n        const __m128i T_00_15B = _mm_unpackhi_epi16(in00, in16);    // [ ]\r\n\r\n        __m128i O00A, O01A, O02A, O03A, O04A, O05A, O06A, O07A, O08A, O09A, O10A, O11A, O12A, O13A, O14A, O15A;\r\n        __m128i O00B, O01B, O02B, O03B, O04B, O05B, O06B, O07B, O08B, O09B, O10B, O11B, O12B, O13B, O14B, O15B;\r\n        __m128i EO0A, EO1A, EO2A, EO3A, EO4A, EO5A, EO6A, EO7A;\r\n        __m128i EO0B, EO1B, EO2B, EO3B, EO4B, EO5B, EO6B, EO7B;\r\n        {\r\n            __m128i T00, T01, T02, T03;\r\n#define COMPUTE_ROW(r0103, r0507, r0911, r1315, r1719, r2123, r2527, r2931, c0103, c0507, c0911, c1315, c1719, c2123, c2527, c2931, row) \\\r\n    T00 = _mm_add_epi32(_mm_madd_epi16(r0103, c0103), _mm_madd_epi16(r0507, c0507)); \\\r\n    T01 = _mm_add_epi32(_mm_madd_epi16(r0911, c0911), _mm_madd_epi16(r1315, c1315)); \\\r\n    T02 = _mm_add_epi32(_mm_madd_epi16(r1719, c1719), _mm_madd_epi16(r2123, c2123)); \\\r\n    T03 = _mm_add_epi32(_mm_madd_epi16(r2527, c2527), _mm_madd_epi16(r2931, c2931)); \\\r\n    row = _mm_add_epi32(_mm_add_epi32(T00, T01), _mm_add_epi32(T02, T03));\r\n\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n            c16_p45_p45, c16_p43_p44, c16_p39_p41, c16_p34_p36, c16_p27_p30, c16_p19_p23, c16_p11_p15, c16_p02_p07, O00A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n            c16_p41_p45, c16_p23_p34, c16_n02_p11, c16_n27_n15, c16_n43_n36, c16_n44_n45, c16_n30_n39, c16_n07_n19, O01A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n            c16_p34_p44, c16_n07_p15, c16_n41_n27, c16_n39_n45, c16_n02_n23, c16_p36_p19, c16_p43_p45, c16_p11_p30, O02A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n            c16_p23_p43, c16_n34_n07, c16_n36_n45, c16_p19_n11, c16_p44_p41, c16_n02_p27, c16_n45_n30, c16_n15_n39, O03A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n            c16_p11_p41, c16_n45_n27, c16_p07_n30, c16_p43_p39, c16_n23_p15, c16_n34_n45, c16_p36_p02, c16_p19_p44, O04A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n            c16_n02_p39, c16_n36_n41, c16_p43_p07, c16_n11_p34, c16_n30_n44, c16_p45_p15, c16_n19_p27, c16_n23_n45, O05A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n            c16_n15_p36, c16_n11_n45, c16_p34_p39, c16_n45_n19, c16_p41_n07, c16_n23_p30, c16_n02_n44, c16_p27_p43, O06A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n            c16_n27_p34, c16_p19_n39, c16_n11_p43, c16_p02_n45, c16_p07_p45, c16_n15_n44, c16_p23_p41, c16_n30_n36, O07A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n            c16_n36_p30, c16_p41_n23, c16_n44_p15, c16_p45_n07, c16_n45_n02, c16_p43_p11, c16_n39_n19, c16_p34_p27, O08A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n            c16_n43_p27, c16_p44_n02, c16_n30_n23, c16_p07_p41, c16_p19_n45, c16_n39_p34, c16_p45_n11, c16_n36_n15, O09A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n            c16_n45_p23, c16_p27_p19, c16_p15_n45, c16_n44_p30, c16_p34_p11, c16_p07_n43, c16_n41_p36, c16_p39_p02, O10A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n            c16_n44_p19, c16_n02_p36, c16_p45_n34, c16_n15_n23, c16_n39_p43, c16_p30_p07, c16_p27_n45, c16_n41_p11, O11A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n            c16_n39_p15, c16_n30_p45, c16_p27_p02, c16_p41_n44, c16_n11_n19, c16_n45_p36, c16_n07_p34, c16_p43_n23, O12A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n            c16_n30_p11, c16_n45_p43, c16_n19_p36, c16_p23_n02, c16_p45_n39, c16_p27_n41, c16_n15_n07, c16_n44_p34, O13A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n            c16_n19_p07, c16_n39_p30, c16_n45_p44, c16_n36_p43, c16_n15_p27, c16_p11_p02, c16_p34_n23, c16_p45_n41, O14A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n            c16_n07_p02, c16_n15_p11, c16_n23_p19, c16_n30_p27, c16_n36_p34, c16_n41_p39, c16_n44_p43, c16_n45_p45, O15A)\r\n\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n            c16_p45_p45, c16_p43_p44, c16_p39_p41, c16_p34_p36, c16_p27_p30, c16_p19_p23, c16_p11_p15, c16_p02_p07, O00B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n            c16_p41_p45, c16_p23_p34, c16_n02_p11, c16_n27_n15, c16_n43_n36, c16_n44_n45, c16_n30_n39, c16_n07_n19, O01B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n            c16_p34_p44, c16_n07_p15, c16_n41_n27, c16_n39_n45, c16_n02_n23, c16_p36_p19, c16_p43_p45, c16_p11_p30, O02B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n            c16_p23_p43, c16_n34_n07, c16_n36_n45, c16_p19_n11, c16_p44_p41, c16_n02_p27, c16_n45_n30, c16_n15_n39, O03B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n            c16_p11_p41, c16_n45_n27, c16_p07_n30, c16_p43_p39, c16_n23_p15, c16_n34_n45, c16_p36_p02, c16_p19_p44, O04B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n            c16_n02_p39, c16_n36_n41, c16_p43_p07, c16_n11_p34, c16_n30_n44, c16_p45_p15, c16_n19_p27, c16_n23_n45, O05B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n            c16_n15_p36, c16_n11_n45, c16_p34_p39, c16_n45_n19, c16_p41_n07, c16_n23_p30, c16_n02_n44, c16_p27_p43, O06B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n            c16_n27_p34, c16_p19_n39, c16_n11_p43, c16_p02_n45, c16_p07_p45, c16_n15_n44, c16_p23_p41, c16_n30_n36, O07B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n            c16_n36_p30, c16_p41_n23, c16_n44_p15, c16_p45_n07, c16_n45_n02, c16_p43_p11, c16_n39_n19, c16_p34_p27, O08B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n            c16_n43_p27, c16_p44_n02, c16_n30_n23, c16_p07_p41, c16_p19_n45, c16_n39_p34, c16_p45_n11, c16_n36_n15, O09B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n            c16_n45_p23, c16_p27_p19, c16_p15_n45, c16_n44_p30, c16_p34_p11, c16_p07_n43, c16_n41_p36, c16_p39_p02, O10B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n            c16_n44_p19, c16_n02_p36, c16_p45_n34, c16_n15_n23, c16_n39_p43, c16_p30_p07, c16_p27_n45, c16_n41_p11, O11B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n            c16_n39_p15, c16_n30_p45, c16_p27_p02, c16_p41_n44, c16_n11_n19, c16_n45_p36, c16_n07_p34, c16_p43_n23, O12B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n            c16_n30_p11, c16_n45_p43, c16_n19_p36, c16_p23_n02, c16_p45_n39, c16_p27_n41, c16_n15_n07, c16_n44_p34, O13B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n            c16_n19_p07, c16_n39_p30, c16_n45_p44, c16_n36_p43, c16_n15_p27, c16_p11_p02, c16_p34_n23, c16_p45_n41, O14B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n            c16_n07_p02, c16_n15_p11, c16_n23_p19, c16_n30_p27, c16_n36_p34, c16_n41_p39, c16_n44_p43, c16_n45_p45, O15B)\r\n\r\n#undef COMPUTE_ROW\r\n        }\r\n\r\n        {\r\n            __m128i T00, T01;\r\n#define COMPUTE_ROW(row0206, row1014, row1822, row2630, c0206, c1014, c1822, c2630, row) \\\r\n    T00 = _mm_add_epi32(_mm_madd_epi16(row0206, c0206), _mm_madd_epi16(row1014, c1014)); \\\r\n    T01 = _mm_add_epi32(_mm_madd_epi16(row1822, c1822), _mm_madd_epi16(row2630, c2630)); \\\r\n    row = _mm_add_epi32(T00, T01);\r\n\r\n            COMPUTE_ROW(T_00_08A, T_00_09A, T_00_10A, T_00_11A, c16_p43_p45, c16_p35_p40, c16_p21_p29, c16_p04_p13, EO0A)\r\n            COMPUTE_ROW(T_00_08A, T_00_09A, T_00_10A, T_00_11A, c16_p29_p43, c16_n21_p04, c16_n45_n40, c16_n13_n35, EO1A)\r\n            COMPUTE_ROW(T_00_08A, T_00_09A, T_00_10A, T_00_11A, c16_p04_p40, c16_n43_n35, c16_p29_n13, c16_p21_p45, EO2A)\r\n            COMPUTE_ROW(T_00_08A, T_00_09A, T_00_10A, T_00_11A, c16_n21_p35, c16_p04_n43, c16_p13_p45, c16_n29_n40, EO3A)\r\n            COMPUTE_ROW(T_00_08A, T_00_09A, T_00_10A, T_00_11A, c16_n40_p29, c16_p45_n13, c16_n43_n04, c16_p35_p21, EO4A)\r\n            COMPUTE_ROW(T_00_08A, T_00_09A, T_00_10A, T_00_11A, c16_n45_p21, c16_p13_p29, c16_p35_n43, c16_n40_p04, EO5A)\r\n            COMPUTE_ROW(T_00_08A, T_00_09A, T_00_10A, T_00_11A, c16_n35_p13, c16_n40_p45, c16_p04_p21, c16_p43_n29, EO6A)\r\n            COMPUTE_ROW(T_00_08A, T_00_09A, T_00_10A, T_00_11A, c16_n13_p04, c16_n29_p21, c16_n40_p35, c16_n45_p43, EO7A)\r\n\r\n            COMPUTE_ROW(T_00_08B, T_00_09B, T_00_10B, T_00_11B, c16_p43_p45, c16_p35_p40, c16_p21_p29, c16_p04_p13, EO0B)\r\n            COMPUTE_ROW(T_00_08B, T_00_09B, T_00_10B, T_00_11B, c16_p29_p43, c16_n21_p04, c16_n45_n40, c16_n13_n35, EO1B)\r\n            COMPUTE_ROW(T_00_08B, T_00_09B, T_00_10B, T_00_11B, c16_p04_p40, c16_n43_n35, c16_p29_n13, c16_p21_p45, EO2B)\r\n            COMPUTE_ROW(T_00_08B, T_00_09B, T_00_10B, T_00_11B, c16_n21_p35, c16_p04_n43, c16_p13_p45, c16_n29_n40, EO3B)\r\n            COMPUTE_ROW(T_00_08B, T_00_09B, T_00_10B, T_00_11B, c16_n40_p29, c16_p45_n13, c16_n43_n04, c16_p35_p21, EO4B)\r\n            COMPUTE_ROW(T_00_08B, T_00_09B, T_00_10B, T_00_11B, c16_n45_p21, c16_p13_p29, c16_p35_n43, c16_n40_p04, EO5B)\r\n            COMPUTE_ROW(T_00_08B, T_00_09B, T_00_10B, T_00_11B, c16_n35_p13, c16_n40_p45, c16_p04_p21, c16_p43_n29, EO6B)\r\n            COMPUTE_ROW(T_00_08B, T_00_09B, T_00_10B, T_00_11B, c16_n13_p04, c16_n29_p21, c16_n40_p35, c16_n45_p43, EO7B)\r\n#undef COMPUTE_ROW\r\n        }\r\n\r\n        {\r\n            const __m128i EEO0A = _mm_add_epi32(_mm_madd_epi16(T_00_12A, c16_p38_p44), _mm_madd_epi16(T_00_13A, c16_p09_p25));\r\n            const __m128i EEO1A = _mm_add_epi32(_mm_madd_epi16(T_00_12A, c16_n09_p38), _mm_madd_epi16(T_00_13A, c16_n25_n44));\r\n            const __m128i EEO2A = _mm_add_epi32(_mm_madd_epi16(T_00_12A, c16_n44_p25), _mm_madd_epi16(T_00_13A, c16_p38_p09));\r\n            const __m128i EEO3A = _mm_add_epi32(_mm_madd_epi16(T_00_12A, c16_n25_p09), _mm_madd_epi16(T_00_13A, c16_n44_p38));\r\n            const __m128i EEO0B = _mm_add_epi32(_mm_madd_epi16(T_00_12B, c16_p38_p44), _mm_madd_epi16(T_00_13B, c16_p09_p25));\r\n            const __m128i EEO1B = _mm_add_epi32(_mm_madd_epi16(T_00_12B, c16_n09_p38), _mm_madd_epi16(T_00_13B, c16_n25_n44));\r\n            const __m128i EEO2B = _mm_add_epi32(_mm_madd_epi16(T_00_12B, c16_n44_p25), _mm_madd_epi16(T_00_13B, c16_p38_p09));\r\n            const __m128i EEO3B = _mm_add_epi32(_mm_madd_epi16(T_00_12B, c16_n25_p09), _mm_madd_epi16(T_00_13B, c16_n44_p38));\r\n\r\n            const __m128i EEEO0A = _mm_madd_epi16(T_00_14A, c16_p17_p42);\r\n            const __m128i EEEO0B = _mm_madd_epi16(T_00_14B, c16_p17_p42);\r\n            const __m128i EEEO1A = _mm_madd_epi16(T_00_14A, c16_n42_p17);\r\n            const __m128i EEEO1B = _mm_madd_epi16(T_00_14B, c16_n42_p17);\r\n\r\n            const __m128i EEEE0A = _mm_madd_epi16(T_00_15A, c16_p32_p32);\r\n            const __m128i EEEE0B = _mm_madd_epi16(T_00_15B, c16_p32_p32);\r\n            const __m128i EEEE1A = _mm_madd_epi16(T_00_15A, c16_n32_p32);\r\n            const __m128i EEEE1B = _mm_madd_epi16(T_00_15B, c16_n32_p32);\r\n\r\n            const __m128i EEE0A = _mm_add_epi32(EEEE0A, EEEO0A);    // EEE0 = EEEE0 + EEEO0\r\n            const __m128i EEE0B = _mm_add_epi32(EEEE0B, EEEO0B);\r\n            const __m128i EEE1A = _mm_add_epi32(EEEE1A, EEEO1A);    // EEE1 = EEEE1 + EEEO1\r\n            const __m128i EEE1B = _mm_add_epi32(EEEE1B, EEEO1B);\r\n            const __m128i EEE3A = _mm_sub_epi32(EEEE0A, EEEO0A);    // EEE2 = EEEE0 - EEEO0\r\n            const __m128i EEE3B = _mm_sub_epi32(EEEE0B, EEEO0B);\r\n            const __m128i EEE2A = _mm_sub_epi32(EEEE1A, EEEO1A);    // EEE3 = EEEE1 - EEEO1\r\n            const __m128i EEE2B = _mm_sub_epi32(EEEE1B, EEEO1B);\r\n\r\n            const __m128i EE0A = _mm_add_epi32(EEE0A, EEO0A);       // EE0 = EEE0 + EEO0\r\n            const __m128i EE0B = _mm_add_epi32(EEE0B, EEO0B);\r\n            const __m128i EE1A = _mm_add_epi32(EEE1A, EEO1A);       // EE1 = EEE1 + EEO1\r\n            const __m128i EE1B = _mm_add_epi32(EEE1B, EEO1B);\r\n            const __m128i EE2A = _mm_add_epi32(EEE2A, EEO2A);       // EE2 = EEE0 + EEO0\r\n            const __m128i EE2B = _mm_add_epi32(EEE2B, EEO2B);\r\n            const __m128i EE3A = _mm_add_epi32(EEE3A, EEO3A);       // EE3 = EEE1 + EEO1\r\n            const __m128i EE3B = _mm_add_epi32(EEE3B, EEO3B);\r\n            const __m128i EE7A = _mm_sub_epi32(EEE0A, EEO0A);       // EE7 = EEE0 - EEO0\r\n            const __m128i EE7B = _mm_sub_epi32(EEE0B, EEO0B);\r\n            const __m128i EE6A = _mm_sub_epi32(EEE1A, EEO1A);       // EE6 = EEE1 - EEO1\r\n            const __m128i EE6B = _mm_sub_epi32(EEE1B, EEO1B);\r\n            const __m128i EE5A = _mm_sub_epi32(EEE2A, EEO2A);       // EE5 = EEE0 - EEO0\r\n            const __m128i EE5B = _mm_sub_epi32(EEE2B, EEO2B);\r\n            const __m128i EE4A = _mm_sub_epi32(EEE3A, EEO3A);       // EE4 = EEE1 - EEO1\r\n            const __m128i EE4B = _mm_sub_epi32(EEE3B, EEO3B);\r\n\r\n            const __m128i E0A = _mm_add_epi32(EE0A, EO0A);          // E0 = EE0 + EO0\r\n            const __m128i E0B = _mm_add_epi32(EE0B, EO0B);\r\n            const __m128i E1A = _mm_add_epi32(EE1A, EO1A);          // E1 = EE1 + EO1\r\n            const __m128i E1B = _mm_add_epi32(EE1B, EO1B);\r\n            const __m128i E2A = _mm_add_epi32(EE2A, EO2A);          // E2 = EE2 + EO2\r\n            const __m128i E2B = _mm_add_epi32(EE2B, EO2B);\r\n            const __m128i E3A = _mm_add_epi32(EE3A, EO3A);          // E3 = EE3 + EO3\r\n            const __m128i E3B = _mm_add_epi32(EE3B, EO3B);\r\n            const __m128i E4A = _mm_add_epi32(EE4A, EO4A);          // E4 =\r\n            const __m128i E4B = _mm_add_epi32(EE4B, EO4B);\r\n            const __m128i E5A = _mm_add_epi32(EE5A, EO5A);          // E5 =\r\n            const __m128i E5B = _mm_add_epi32(EE5B, EO5B);\r\n            const __m128i E6A = _mm_add_epi32(EE6A, EO6A);          // E6 =\r\n            const __m128i E6B = _mm_add_epi32(EE6B, EO6B);\r\n            const __m128i E7A = _mm_add_epi32(EE7A, EO7A);          // E7 =\r\n            const __m128i E7B = _mm_add_epi32(EE7B, EO7B);\r\n            const __m128i EFA = _mm_sub_epi32(EE0A, EO0A);          // EF = EE0 - EO0\r\n            const __m128i EFB = _mm_sub_epi32(EE0B, EO0B);\r\n            const __m128i EEA = _mm_sub_epi32(EE1A, EO1A);          // EE = EE1 - EO1\r\n            const __m128i EEB = _mm_sub_epi32(EE1B, EO1B);\r\n            const __m128i EDA = _mm_sub_epi32(EE2A, EO2A);          // ED = EE2 - EO2\r\n            const __m128i EDB = _mm_sub_epi32(EE2B, EO2B);\r\n            const __m128i ECA = _mm_sub_epi32(EE3A, EO3A);          // EC = EE3 - EO3\r\n            const __m128i ECB = _mm_sub_epi32(EE3B, EO3B);\r\n            const __m128i EBA = _mm_sub_epi32(EE4A, EO4A);          // EB =\r\n            const __m128i EBB = _mm_sub_epi32(EE4B, EO4B);\r\n            const __m128i EAA = _mm_sub_epi32(EE5A, EO5A);          // EA =\r\n            const __m128i EAB = _mm_sub_epi32(EE5B, EO5B);\r\n            const __m128i E9A = _mm_sub_epi32(EE6A, EO6A);          // E9 =\r\n            const __m128i E9B = _mm_sub_epi32(EE6B, EO6B);\r\n            const __m128i E8A = _mm_sub_epi32(EE7A, EO7A);          // E8 =\r\n            const __m128i E8B = _mm_sub_epi32(EE7B, EO7B);\r\n\r\n            const __m128i T10A = _mm_add_epi32(E0A, c32_rnd);       // E0 + rnd\r\n            const __m128i T10B = _mm_add_epi32(E0B, c32_rnd);\r\n            const __m128i T11A = _mm_add_epi32(E1A, c32_rnd);       // E1 + rnd\r\n            const __m128i T11B = _mm_add_epi32(E1B, c32_rnd);\r\n            const __m128i T12A = _mm_add_epi32(E2A, c32_rnd);       // E2 + rnd\r\n            const __m128i T12B = _mm_add_epi32(E2B, c32_rnd);\r\n            const __m128i T13A = _mm_add_epi32(E3A, c32_rnd);       // E3 + rnd\r\n            const __m128i T13B = _mm_add_epi32(E3B, c32_rnd);\r\n            const __m128i T14A = _mm_add_epi32(E4A, c32_rnd);       // E4 + rnd\r\n            const __m128i T14B = _mm_add_epi32(E4B, c32_rnd);\r\n            const __m128i T15A = _mm_add_epi32(E5A, c32_rnd);       // E5 + rnd\r\n            const __m128i T15B = _mm_add_epi32(E5B, c32_rnd);\r\n            const __m128i T16A = _mm_add_epi32(E6A, c32_rnd);       // E6 + rnd\r\n            const __m128i T16B = _mm_add_epi32(E6B, c32_rnd);\r\n            const __m128i T17A = _mm_add_epi32(E7A, c32_rnd);       // E7 + rnd\r\n            const __m128i T17B = _mm_add_epi32(E7B, c32_rnd);\r\n            const __m128i T18A = _mm_add_epi32(E8A, c32_rnd);       // E8 + rnd\r\n            const __m128i T18B = _mm_add_epi32(E8B, c32_rnd);\r\n            const __m128i T19A = _mm_add_epi32(E9A, c32_rnd);       // E9 + rnd\r\n            const __m128i T19B = _mm_add_epi32(E9B, c32_rnd);\r\n            const __m128i T1AA = _mm_add_epi32(EAA, c32_rnd);       // E10 + rnd\r\n            const __m128i T1AB = _mm_add_epi32(EAB, c32_rnd);\r\n            const __m128i T1BA = _mm_add_epi32(EBA, c32_rnd);       // E11 + rnd\r\n            const __m128i T1BB = _mm_add_epi32(EBB, c32_rnd);\r\n            const __m128i T1CA = _mm_add_epi32(ECA, c32_rnd);       // E12 + rnd\r\n            const __m128i T1CB = _mm_add_epi32(ECB, c32_rnd);\r\n            const __m128i T1DA = _mm_add_epi32(EDA, c32_rnd);       // E13 + rnd\r\n            const __m128i T1DB = _mm_add_epi32(EDB, c32_rnd);\r\n            const __m128i T1EA = _mm_add_epi32(EEA, c32_rnd);       // E14 + rnd\r\n            const __m128i T1EB = _mm_add_epi32(EEB, c32_rnd);\r\n            const __m128i T1FA = _mm_add_epi32(EFA, c32_rnd);       // E15 + rnd\r\n            const __m128i T1FB = _mm_add_epi32(EFB, c32_rnd);\r\n\r\n            const __m128i T2_00A = _mm_add_epi32(T10A, O00A);       // E0 + O0 + rnd\r\n            const __m128i T2_00B = _mm_add_epi32(T10B, O00B);\r\n            const __m128i T2_01A = _mm_add_epi32(T11A, O01A);       // E1 + O1 + rnd\r\n            const __m128i T2_01B = _mm_add_epi32(T11B, O01B);\r\n            const __m128i T2_02A = _mm_add_epi32(T12A, O02A);       // E2 + O2 + rnd\r\n            const __m128i T2_02B = _mm_add_epi32(T12B, O02B);\r\n            const __m128i T2_03A = _mm_add_epi32(T13A, O03A);       // E3 + O3 + rnd\r\n            const __m128i T2_03B = _mm_add_epi32(T13B, O03B);\r\n            const __m128i T2_04A = _mm_add_epi32(T14A, O04A);       // E4\r\n            const __m128i T2_04B = _mm_add_epi32(T14B, O04B);\r\n            const __m128i T2_05A = _mm_add_epi32(T15A, O05A);       // E5\r\n            const __m128i T2_05B = _mm_add_epi32(T15B, O05B);\r\n            const __m128i T2_06A = _mm_add_epi32(T16A, O06A);       // E6\r\n            const __m128i T2_06B = _mm_add_epi32(T16B, O06B);\r\n            const __m128i T2_07A = _mm_add_epi32(T17A, O07A);       // E7\r\n            const __m128i T2_07B = _mm_add_epi32(T17B, O07B);\r\n            const __m128i T2_08A = _mm_add_epi32(T18A, O08A);       // E8\r\n            const __m128i T2_08B = _mm_add_epi32(T18B, O08B);\r\n            const __m128i T2_09A = _mm_add_epi32(T19A, O09A);       // E9\r\n            const __m128i T2_09B = _mm_add_epi32(T19B, O09B);\r\n            const __m128i T2_10A = _mm_add_epi32(T1AA, O10A);       // E10\r\n            const __m128i T2_10B = _mm_add_epi32(T1AB, O10B);\r\n            const __m128i T2_11A = _mm_add_epi32(T1BA, O11A);       // E11\r\n            const __m128i T2_11B = _mm_add_epi32(T1BB, O11B);\r\n            const __m128i T2_12A = _mm_add_epi32(T1CA, O12A);       // E12\r\n            const __m128i T2_12B = _mm_add_epi32(T1CB, O12B);\r\n            const __m128i T2_13A = _mm_add_epi32(T1DA, O13A);       // E13\r\n            const __m128i T2_13B = _mm_add_epi32(T1DB, O13B);\r\n            const __m128i T2_14A = _mm_add_epi32(T1EA, O14A);       // E14\r\n            const __m128i T2_14B = _mm_add_epi32(T1EB, O14B);\r\n            const __m128i T2_15A = _mm_add_epi32(T1FA, O15A);       // E15\r\n            const __m128i T2_15B = _mm_add_epi32(T1FB, O15B);\r\n            const __m128i T2_31A = _mm_sub_epi32(T10A, O00A);       // E0 - O0 + rnd\r\n            const __m128i T2_31B = _mm_sub_epi32(T10B, O00B);\r\n            const __m128i T2_30A = _mm_sub_epi32(T11A, O01A);       // E1 - O1 + rnd\r\n            const __m128i T2_30B = _mm_sub_epi32(T11B, O01B);\r\n            const __m128i T2_29A = _mm_sub_epi32(T12A, O02A);       // E2 - O2 + rnd\r\n            const __m128i T2_29B = _mm_sub_epi32(T12B, O02B);\r\n            const __m128i T2_28A = _mm_sub_epi32(T13A, O03A);       // E3 - O3 + rnd\r\n            const __m128i T2_28B = _mm_sub_epi32(T13B, O03B);\r\n            const __m128i T2_27A = _mm_sub_epi32(T14A, O04A);       // E4\r\n            const __m128i T2_27B = _mm_sub_epi32(T14B, O04B);\r\n            const __m128i T2_26A = _mm_sub_epi32(T15A, O05A);       // E5\r\n            const __m128i T2_26B = _mm_sub_epi32(T15B, O05B);\r\n            const __m128i T2_25A = _mm_sub_epi32(T16A, O06A);       // E6\r\n            const __m128i T2_25B = _mm_sub_epi32(T16B, O06B);\r\n            const __m128i T2_24A = _mm_sub_epi32(T17A, O07A);       // E7\r\n            const __m128i T2_24B = _mm_sub_epi32(T17B, O07B);\r\n            const __m128i T2_23A = _mm_sub_epi32(T18A, O08A);       //\r\n            const __m128i T2_23B = _mm_sub_epi32(T18B, O08B);\r\n            const __m128i T2_22A = _mm_sub_epi32(T19A, O09A);       //\r\n            const __m128i T2_22B = _mm_sub_epi32(T19B, O09B);\r\n            const __m128i T2_21A = _mm_sub_epi32(T1AA, O10A);       //\r\n            const __m128i T2_21B = _mm_sub_epi32(T1AB, O10B);\r\n            const __m128i T2_20A = _mm_sub_epi32(T1BA, O11A);       //\r\n            const __m128i T2_20B = _mm_sub_epi32(T1BB, O11B);\r\n            const __m128i T2_19A = _mm_sub_epi32(T1CA, O12A);       //\r\n            const __m128i T2_19B = _mm_sub_epi32(T1CB, O12B);\r\n            const __m128i T2_18A = _mm_sub_epi32(T1DA, O13A);       //\r\n            const __m128i T2_18B = _mm_sub_epi32(T1DB, O13B);\r\n            const __m128i T2_17A = _mm_sub_epi32(T1EA, O14A);       //\r\n            const __m128i T2_17B = _mm_sub_epi32(T1EB, O14B);\r\n            const __m128i T2_16A = _mm_sub_epi32(T1FA, O15A);       //\r\n            const __m128i T2_16B = _mm_sub_epi32(T1FB, O15B);\r\n\r\n            const __m128i T3_00A = _mm_srai_epi32(T2_00A, nShift);  // [30 20 10 00]\r\n            const __m128i T3_00B = _mm_srai_epi32(T2_00B, nShift);  // [70 60 50 40]\r\n            const __m128i T3_01A = _mm_srai_epi32(T2_01A, nShift);  // [31 21 11 01]\r\n            const __m128i T3_01B = _mm_srai_epi32(T2_01B, nShift);  // [71 61 51 41]\r\n            const __m128i T3_02A = _mm_srai_epi32(T2_02A, nShift);  // [32 22 12 02]\r\n            const __m128i T3_02B = _mm_srai_epi32(T2_02B, nShift);  // [72 62 52 42]\r\n            const __m128i T3_03A = _mm_srai_epi32(T2_03A, nShift);  // [33 23 13 03]\r\n            const __m128i T3_03B = _mm_srai_epi32(T2_03B, nShift);  // [73 63 53 43]\r\n            const __m128i T3_04A = _mm_srai_epi32(T2_04A, nShift);  // [33 24 14 04]\r\n            const __m128i T3_04B = _mm_srai_epi32(T2_04B, nShift);  // [74 64 54 44]\r\n            const __m128i T3_05A = _mm_srai_epi32(T2_05A, nShift);  // [35 25 15 05]\r\n            const __m128i T3_05B = _mm_srai_epi32(T2_05B, nShift);  // [75 65 55 45]\r\n            const __m128i T3_06A = _mm_srai_epi32(T2_06A, nShift);  // [36 26 16 06]\r\n            const __m128i T3_06B = _mm_srai_epi32(T2_06B, nShift);  // [76 66 56 46]\r\n            const __m128i T3_07A = _mm_srai_epi32(T2_07A, nShift);  // [37 27 17 07]\r\n            const __m128i T3_07B = _mm_srai_epi32(T2_07B, nShift);  // [77 67 57 47]\r\n            const __m128i T3_08A = _mm_srai_epi32(T2_08A, nShift);  // [30 20 10 00] x8\r\n            const __m128i T3_08B = _mm_srai_epi32(T2_08B, nShift);  // [70 60 50 40]\r\n            const __m128i T3_09A = _mm_srai_epi32(T2_09A, nShift);  // [31 21 11 01] x9\r\n            const __m128i T3_09B = _mm_srai_epi32(T2_09B, nShift);  // [71 61 51 41]\r\n            const __m128i T3_10A = _mm_srai_epi32(T2_10A, nShift);  // [32 22 12 02] xA\r\n            const __m128i T3_10B = _mm_srai_epi32(T2_10B, nShift);  // [72 62 52 42]\r\n            const __m128i T3_11A = _mm_srai_epi32(T2_11A, nShift);  // [33 23 13 03] xB\r\n            const __m128i T3_11B = _mm_srai_epi32(T2_11B, nShift);  // [73 63 53 43]\r\n            const __m128i T3_12A = _mm_srai_epi32(T2_12A, nShift);  // [33 24 14 04] xC\r\n            const __m128i T3_12B = _mm_srai_epi32(T2_12B, nShift);  // [74 64 54 44]\r\n            const __m128i T3_13A = _mm_srai_epi32(T2_13A, nShift);  // [35 25 15 05] xD\r\n            const __m128i T3_13B = _mm_srai_epi32(T2_13B, nShift);  // [75 65 55 45]\r\n            const __m128i T3_14A = _mm_srai_epi32(T2_14A, nShift);  // [36 26 16 06] xE\r\n            const __m128i T3_14B = _mm_srai_epi32(T2_14B, nShift);  // [76 66 56 46]\r\n            const __m128i T3_15A = _mm_srai_epi32(T2_15A, nShift);  // [37 27 17 07] xF\r\n            const __m128i T3_15B = _mm_srai_epi32(T2_15B, nShift);  // [77 67 57 47]\r\n\r\n            const __m128i T3_16A = _mm_srai_epi32(T2_16A, nShift);  // [30 20 10 00]\r\n            const __m128i T3_16B = _mm_srai_epi32(T2_16B, nShift);  // [70 60 50 40]\r\n            const __m128i T3_17A = _mm_srai_epi32(T2_17A, nShift);  // [31 21 11 01]\r\n            const __m128i T3_17B = _mm_srai_epi32(T2_17B, nShift);  // [71 61 51 41]\r\n            const __m128i T3_18A = _mm_srai_epi32(T2_18A, nShift);  // [32 22 12 02]\r\n            const __m128i T3_18B = _mm_srai_epi32(T2_18B, nShift);  // [72 62 52 42]\r\n            const __m128i T3_19A = _mm_srai_epi32(T2_19A, nShift);  // [33 23 13 03]\r\n            const __m128i T3_19B = _mm_srai_epi32(T2_19B, nShift);  // [73 63 53 43]\r\n            const __m128i T3_20A = _mm_srai_epi32(T2_20A, nShift);  // [33 24 14 04]\r\n            const __m128i T3_20B = _mm_srai_epi32(T2_20B, nShift);  // [74 64 54 44]\r\n            const __m128i T3_21A = _mm_srai_epi32(T2_21A, nShift);  // [35 25 15 05]\r\n            const __m128i T3_21B = _mm_srai_epi32(T2_21B, nShift);  // [75 65 55 45]\r\n            const __m128i T3_22A = _mm_srai_epi32(T2_22A, nShift);  // [36 26 16 06]\r\n            const __m128i T3_22B = _mm_srai_epi32(T2_22B, nShift);  // [76 66 56 46]\r\n            const __m128i T3_23A = _mm_srai_epi32(T2_23A, nShift);  // [37 27 17 07]\r\n            const __m128i T3_23B = _mm_srai_epi32(T2_23B, nShift);  // [77 67 57 47]\r\n            const __m128i T3_24A = _mm_srai_epi32(T2_24A, nShift);  // [30 20 10 00] x8\r\n            const __m128i T3_24B = _mm_srai_epi32(T2_24B, nShift);  // [70 60 50 40]\r\n            const __m128i T3_25A = _mm_srai_epi32(T2_25A, nShift);  // [31 21 11 01] x9\r\n            const __m128i T3_25B = _mm_srai_epi32(T2_25B, nShift);  // [71 61 51 41]\r\n            const __m128i T3_26A = _mm_srai_epi32(T2_26A, nShift);  // [32 22 12 02] xA\r\n            const __m128i T3_26B = _mm_srai_epi32(T2_26B, nShift);  // [72 62 52 42]\r\n            const __m128i T3_27A = _mm_srai_epi32(T2_27A, nShift);  // [33 23 13 03] xB\r\n            const __m128i T3_27B = _mm_srai_epi32(T2_27B, nShift);  // [73 63 53 43]\r\n            const __m128i T3_28A = _mm_srai_epi32(T2_28A, nShift);  // [33 24 14 04] xC\r\n            const __m128i T3_28B = _mm_srai_epi32(T2_28B, nShift);  // [74 64 54 44]\r\n            const __m128i T3_29A = _mm_srai_epi32(T2_29A, nShift);  // [35 25 15 05] xD\r\n            const __m128i T3_29B = _mm_srai_epi32(T2_29B, nShift);  // [75 65 55 45]\r\n            const __m128i T3_30A = _mm_srai_epi32(T2_30A, nShift);  // [36 26 16 06] xE\r\n            const __m128i T3_30B = _mm_srai_epi32(T2_30B, nShift);  // [76 66 56 46]\r\n            const __m128i T3_31A = _mm_srai_epi32(T2_31A, nShift);  // [37 27 17 07] xF\r\n            const __m128i T3_31B = _mm_srai_epi32(T2_31B, nShift);  // [77 67 57 47]\r\n\r\n            res00[0] = _mm_packs_epi32(T3_00A, T3_00B);             // [70 60 50 40 30 20 10 00]\r\n            res01[0] = _mm_packs_epi32(T3_01A, T3_01B);             // [71 61 51 41 31 21 11 01]\r\n            res02[0] = _mm_packs_epi32(T3_02A, T3_02B);             // [72 62 52 42 32 22 12 02]\r\n            res03[0] = _mm_packs_epi32(T3_03A, T3_03B);             // [73 63 53 43 33 23 13 03]\r\n            res04[0] = _mm_packs_epi32(T3_04A, T3_04B);             // [74 64 54 44 34 24 14 04]\r\n            res05[0] = _mm_packs_epi32(T3_05A, T3_05B);             // [75 65 55 45 35 25 15 05]\r\n            res06[0] = _mm_packs_epi32(T3_06A, T3_06B);             // [76 66 56 46 36 26 16 06]\r\n            res07[0] = _mm_packs_epi32(T3_07A, T3_07B);             // [77 67 57 47 37 27 17 07]\r\n            res00[1] = _mm_packs_epi32(T3_08A, T3_08B);             // [A0 ... 80]\r\n            res01[1] = _mm_packs_epi32(T3_09A, T3_09B);             // [A1 ... 81]\r\n            res02[1] = _mm_packs_epi32(T3_10A, T3_10B);             // [A2 ... 82]\r\n            res03[1] = _mm_packs_epi32(T3_11A, T3_11B);             // [A3 ... 83]\r\n            res04[1] = _mm_packs_epi32(T3_12A, T3_12B);             // [A4 ... 84]\r\n            res05[1] = _mm_packs_epi32(T3_13A, T3_13B);             // [A5 ... 85]\r\n            res06[1] = _mm_packs_epi32(T3_14A, T3_14B);             // [A6 ... 86]\r\n            res07[1] = _mm_packs_epi32(T3_15A, T3_15B);             // [A7 ... 87]\r\n            res00[2] = _mm_packs_epi32(T3_16A, T3_16B);\r\n            res01[2] = _mm_packs_epi32(T3_17A, T3_17B);\r\n            res02[2] = _mm_packs_epi32(T3_18A, T3_18B);\r\n            res03[2] = _mm_packs_epi32(T3_19A, T3_19B);\r\n            res04[2] = _mm_packs_epi32(T3_20A, T3_20B);\r\n            res05[2] = _mm_packs_epi32(T3_21A, T3_21B);\r\n            res06[2] = _mm_packs_epi32(T3_22A, T3_22B);\r\n            res07[2] = _mm_packs_epi32(T3_23A, T3_23B);\r\n            res00[3] = _mm_packs_epi32(T3_24A, T3_24B);\r\n            res01[3] = _mm_packs_epi32(T3_25A, T3_25B);\r\n            res02[3] = _mm_packs_epi32(T3_26A, T3_26B);\r\n            res03[3] = _mm_packs_epi32(T3_27A, T3_27B);\r\n            res04[3] = _mm_packs_epi32(T3_28A, T3_28B);\r\n            res05[3] = _mm_packs_epi32(T3_29A, T3_29B);\r\n            res06[3] = _mm_packs_epi32(T3_30A, T3_30B);\r\n            res07[3] = _mm_packs_epi32(T3_31A, T3_31B);\r\n        }\r\n\r\n    }\r\n\r\n#define TRANSPOSE_8x8_16BIT(I0, I1, I2, I3, I4, I5, I6, I7, O0, O1, O2, O3, O4, O5, O6, O7) \\\r\n    tr0_0 = _mm_unpacklo_epi16(I0, I1); \\\r\n    tr0_1 = _mm_unpacklo_epi16(I2, I3); \\\r\n    tr0_2 = _mm_unpackhi_epi16(I0, I1); \\\r\n    tr0_3 = _mm_unpackhi_epi16(I2, I3); \\\r\n    tr0_4 = _mm_unpacklo_epi16(I4, I5); \\\r\n    tr0_5 = _mm_unpacklo_epi16(I6, I7); \\\r\n    tr0_6 = _mm_unpackhi_epi16(I4, I5); \\\r\n    tr0_7 = _mm_unpackhi_epi16(I6, I7); \\\r\n    tr1_0 = _mm_unpacklo_epi32(tr0_0, tr0_1); \\\r\n    tr1_1 = _mm_unpacklo_epi32(tr0_2, tr0_3); \\\r\n    tr1_2 = _mm_unpackhi_epi32(tr0_0, tr0_1); \\\r\n    tr1_3 = _mm_unpackhi_epi32(tr0_2, tr0_3); \\\r\n    tr1_4 = _mm_unpacklo_epi32(tr0_4, tr0_5); \\\r\n    tr1_5 = _mm_unpacklo_epi32(tr0_6, tr0_7); \\\r\n    tr1_6 = _mm_unpackhi_epi32(tr0_4, tr0_5); \\\r\n    tr1_7 = _mm_unpackhi_epi32(tr0_6, tr0_7); \\\r\n    O0 = _mm_unpacklo_epi64(tr1_0, tr1_4); \\\r\n    O1 = _mm_unpackhi_epi64(tr1_0, tr1_4); \\\r\n    O2 = _mm_unpacklo_epi64(tr1_2, tr1_6); \\\r\n    O3 = _mm_unpackhi_epi64(tr1_2, tr1_6); \\\r\n    O4 = _mm_unpacklo_epi64(tr1_1, tr1_5); \\\r\n    O5 = _mm_unpackhi_epi64(tr1_1, tr1_5); \\\r\n    O6 = _mm_unpacklo_epi64(tr1_3, tr1_7); \\\r\n    O7 = _mm_unpackhi_epi64(tr1_3, tr1_7);\r\n\r\n    //clip\r\n    {\r\n        __m128i max_val = _mm_set1_epi16((1 << (clip_depth2 - 1)) - 1);\r\n        __m128i min_val = _mm_set1_epi16(-(1 << (clip_depth2 - 1)));\r\n\r\n        c32_rnd = _mm_set1_epi32(1 << (shift2 - 1));    // add2\r\n        nShift = shift2;\r\n\r\n        for (pass = 0; pass < 4; pass++) {\r\n            __m128i tr0_0, tr0_1, tr0_2, tr0_3, tr0_4, tr0_5, tr0_6, tr0_7;\r\n            __m128i tr1_0, tr1_1, tr1_2, tr1_3, tr1_4, tr1_5, tr1_6, tr1_7;\r\n            __m128i m128Tmp0, m128Tmp1, m128Tmp2, m128Tmp3, E0h, E1h, E2h, E3h, E0l, E1l, E2l, E3l, O0h, O1h, O2h, O3h, O0l, O1l, O2l, O3l, EE0l, EE1l, E00l, E01l, EE0h, EE1h, E00h, E01h;\r\n\r\n            TRANSPOSE_8x8_16BIT(res00[pass], res01[pass], res02[pass], res03[pass], res04[pass], res05[pass], res06[pass], res07[pass], in00, in01, in02, in03, in04, in05, in06, in07)\r\n\r\n            m128Tmp0 = _mm_unpacklo_epi16(in01, in03);\r\n            E1l = _mm_madd_epi16(m128Tmp0, _mm_load_si128((__m128i*)(tab_idct_8x8[0])));\r\n            m128Tmp1 = _mm_unpackhi_epi16(in01, in03);\r\n            E1h = _mm_madd_epi16(m128Tmp1, _mm_load_si128((__m128i*)(tab_idct_8x8[0])));\r\n\r\n            m128Tmp2 = _mm_unpacklo_epi16(in05, in07);\r\n            E2l = _mm_madd_epi16(m128Tmp2, _mm_load_si128((__m128i*)(tab_idct_8x8[1])));\r\n            m128Tmp3 = _mm_unpackhi_epi16(in05, in07);\r\n            E2h = _mm_madd_epi16(m128Tmp3, _mm_load_si128((__m128i*)(tab_idct_8x8[1])));\r\n            O0l = _mm_add_epi32(E1l, E2l);\r\n            O0h = _mm_add_epi32(E1h, E2h);\r\n\r\n            E1l = _mm_madd_epi16(m128Tmp0, _mm_load_si128((__m128i*)(tab_idct_8x8[2])));\r\n            E1h = _mm_madd_epi16(m128Tmp1, _mm_load_si128((__m128i*)(tab_idct_8x8[2])));\r\n            E2l = _mm_madd_epi16(m128Tmp2, _mm_load_si128((__m128i*)(tab_idct_8x8[3])));\r\n            E2h = _mm_madd_epi16(m128Tmp3, _mm_load_si128((__m128i*)(tab_idct_8x8[3])));\r\n\r\n            O1l = _mm_add_epi32(E1l, E2l);\r\n            O1h = _mm_add_epi32(E1h, E2h);\r\n\r\n            E1l = _mm_madd_epi16(m128Tmp0, _mm_load_si128((__m128i*)(tab_idct_8x8[4])));\r\n            E1h = _mm_madd_epi16(m128Tmp1, _mm_load_si128((__m128i*)(tab_idct_8x8[4])));\r\n            E2l = _mm_madd_epi16(m128Tmp2, _mm_load_si128((__m128i*)(tab_idct_8x8[5])));\r\n            E2h = _mm_madd_epi16(m128Tmp3, _mm_load_si128((__m128i*)(tab_idct_8x8[5])));\r\n            O2l = _mm_add_epi32(E1l, E2l);\r\n            O2h = _mm_add_epi32(E1h, E2h);\r\n\r\n            E1l = _mm_madd_epi16(m128Tmp0, _mm_load_si128((__m128i*)(tab_idct_8x8[6])));\r\n            E1h = _mm_madd_epi16(m128Tmp1, _mm_load_si128((__m128i*)(tab_idct_8x8[6])));\r\n            E2l = _mm_madd_epi16(m128Tmp2, _mm_load_si128((__m128i*)(tab_idct_8x8[7])));\r\n            E2h = _mm_madd_epi16(m128Tmp3, _mm_load_si128((__m128i*)(tab_idct_8x8[7])));\r\n            O3h = _mm_add_epi32(E1h, E2h);\r\n            O3l = _mm_add_epi32(E1l, E2l);\r\n\r\n            /*    -------     */\r\n            m128Tmp0 = _mm_unpacklo_epi16(in00, in04);\r\n            EE0l = _mm_madd_epi16(m128Tmp0, _mm_load_si128((__m128i*)(tab_idct_8x8[8])));\r\n            m128Tmp1 = _mm_unpackhi_epi16(in00, in04);\r\n            EE0h = _mm_madd_epi16(m128Tmp1, _mm_load_si128((__m128i*)(tab_idct_8x8[8])));\r\n\r\n            EE1l = _mm_madd_epi16(m128Tmp0, _mm_load_si128((__m128i*)(tab_idct_8x8[9])));\r\n            EE1h = _mm_madd_epi16(m128Tmp1, _mm_load_si128((__m128i*)(tab_idct_8x8[9])));\r\n\r\n            /*    -------     */\r\n            m128Tmp0 = _mm_unpacklo_epi16(in02, in06);\r\n            E00l = _mm_madd_epi16(m128Tmp0, _mm_load_si128((__m128i*)(tab_idct_8x8[10])));\r\n            m128Tmp1 = _mm_unpackhi_epi16(in02, in06);\r\n            E00h = _mm_madd_epi16(m128Tmp1, _mm_load_si128((__m128i*)(tab_idct_8x8[10])));\r\n            E01l = _mm_madd_epi16(m128Tmp0, _mm_load_si128((__m128i*)(tab_idct_8x8[11])));\r\n            E01h = _mm_madd_epi16(m128Tmp1, _mm_load_si128((__m128i*)(tab_idct_8x8[11])));\r\n            E0l = _mm_add_epi32(EE0l, E00l);\r\n            E0l = _mm_add_epi32(E0l, c32_rnd);\r\n            E0h = _mm_add_epi32(EE0h, E00h);\r\n            E0h = _mm_add_epi32(E0h, c32_rnd);\r\n            E3l = _mm_sub_epi32(EE0l, E00l);\r\n            E3l = _mm_add_epi32(E3l, c32_rnd);\r\n            E3h = _mm_sub_epi32(EE0h, E00h);\r\n            E3h = _mm_add_epi32(E3h, c32_rnd);\r\n\r\n            E1l = _mm_add_epi32(EE1l, E01l);\r\n            E1l = _mm_add_epi32(E1l, c32_rnd);\r\n            E1h = _mm_add_epi32(EE1h, E01h);\r\n            E1h = _mm_add_epi32(E1h, c32_rnd);\r\n            E2l = _mm_sub_epi32(EE1l, E01l);\r\n            E2l = _mm_add_epi32(E2l, c32_rnd);\r\n            E2h = _mm_sub_epi32(EE1h, E01h);\r\n            E2h = _mm_add_epi32(E2h, c32_rnd);\r\n            in00 = _mm_packs_epi32(_mm_srai_epi32(_mm_add_epi32(E0l, O0l), nShift), _mm_srai_epi32(_mm_add_epi32(E0h, O0h), nShift));     // ״η任λ\r\n            in07 = _mm_packs_epi32(_mm_srai_epi32(_mm_sub_epi32(E0l, O0l), nShift), _mm_srai_epi32(_mm_sub_epi32(E0h, O0h), nShift));\r\n            in01 = _mm_packs_epi32(_mm_srai_epi32(_mm_add_epi32(E1l, O1l), nShift), _mm_srai_epi32(_mm_add_epi32(E1h, O1h), nShift));\r\n            in06 = _mm_packs_epi32(_mm_srai_epi32(_mm_sub_epi32(E1l, O1l), nShift), _mm_srai_epi32(_mm_sub_epi32(E1h, O1h), nShift));\r\n            in02 = _mm_packs_epi32(_mm_srai_epi32(_mm_add_epi32(E2l, O2l), nShift), _mm_srai_epi32(_mm_add_epi32(E2h, O2h), nShift));\r\n            in05 = _mm_packs_epi32(_mm_srai_epi32(_mm_sub_epi32(E2l, O2l), nShift), _mm_srai_epi32(_mm_sub_epi32(E2h, O2h), nShift));\r\n            in03 = _mm_packs_epi32(_mm_srai_epi32(_mm_add_epi32(E3l, O3l), nShift), _mm_srai_epi32(_mm_add_epi32(E3h, O3h), nShift));\r\n            in04 = _mm_packs_epi32(_mm_srai_epi32(_mm_sub_epi32(E3l, O3l), nShift), _mm_srai_epi32(_mm_sub_epi32(E3h, O3h), nShift));\r\n\r\n            /*  Invers matrix   */\r\n            E0l = _mm_unpacklo_epi16(in00, in04);\r\n            E1l = _mm_unpacklo_epi16(in01, in05);\r\n            E2l = _mm_unpacklo_epi16(in02, in06);\r\n            E3l = _mm_unpacklo_epi16(in03, in07);\r\n            O0l = _mm_unpackhi_epi16(in00, in04);\r\n            O1l = _mm_unpackhi_epi16(in01, in05);\r\n            O2l = _mm_unpackhi_epi16(in02, in06);\r\n            O3l = _mm_unpackhi_epi16(in03, in07);\r\n\r\n            m128Tmp0 = _mm_unpacklo_epi16(E0l, E2l);\r\n            m128Tmp1 = _mm_unpacklo_epi16(E1l, E3l);\r\n            in00 = _mm_unpacklo_epi16(m128Tmp0, m128Tmp1);\r\n            in00 = _mm_min_epi16(in00, max_val);\r\n            in00 = _mm_max_epi16(in00, min_val);\r\n            _mm_store_si128((__m128i*)&dst[pass * 8 * i_dst + 0 * 8], in00);\r\n            in01 = _mm_unpackhi_epi16(m128Tmp0, m128Tmp1);\r\n            in01 = _mm_min_epi16(in01, max_val);\r\n            in01 = _mm_max_epi16(in01, min_val);\r\n            _mm_store_si128((__m128i*)&dst[pass * 8 * i_dst + 1 * 8], in01);\r\n\r\n            m128Tmp2 = _mm_unpackhi_epi16(E0l, E2l);\r\n            m128Tmp3 = _mm_unpackhi_epi16(E1l, E3l);\r\n            in02 = _mm_unpacklo_epi16(m128Tmp2, m128Tmp3);\r\n            in02 = _mm_min_epi16(in02, max_val);\r\n            in02 = _mm_max_epi16(in02, min_val);\r\n            _mm_store_si128((__m128i*)&dst[pass * 8 * i_dst + 2 * 8], in02);\r\n            in03 = _mm_unpackhi_epi16(m128Tmp2, m128Tmp3);\r\n            in03 = _mm_min_epi16(in03, max_val);\r\n            in03 = _mm_max_epi16(in03, min_val);\r\n            _mm_store_si128((__m128i*)&dst[pass * 8 * i_dst + 3 * 8], in03);\r\n\r\n            m128Tmp0 = _mm_unpacklo_epi16(O0l, O2l);\r\n            m128Tmp1 = _mm_unpacklo_epi16(O1l, O3l);\r\n            in04 = _mm_unpacklo_epi16(m128Tmp0, m128Tmp1);\r\n            in04 = _mm_min_epi16(in04, max_val);\r\n            in04 = _mm_max_epi16(in04, min_val);\r\n            _mm_store_si128((__m128i*)&dst[pass * 8 * i_dst + 4 * 8], in04);\r\n            in05 = _mm_unpackhi_epi16(m128Tmp0, m128Tmp1);\r\n            in05 = _mm_min_epi16(in05, max_val);\r\n            in05 = _mm_max_epi16(in05, min_val);\r\n            _mm_store_si128((__m128i*)&dst[pass * 8 * i_dst + 5 * 8], in05);\r\n\r\n            m128Tmp2 = _mm_unpackhi_epi16(O0l, O2l);\r\n            m128Tmp3 = _mm_unpackhi_epi16(O1l, O3l);\r\n            in06 = _mm_unpacklo_epi16(m128Tmp2, m128Tmp3);\r\n            in06 = _mm_min_epi16(in06, max_val);\r\n            in06 = _mm_max_epi16(in06, min_val);\r\n            _mm_store_si128((__m128i*)&dst[pass * 8 * i_dst + 6 * 8], in06);\r\n            in07 = _mm_unpackhi_epi16(m128Tmp2, m128Tmp3);\r\n            in07 = _mm_min_epi16(in07, max_val);\r\n            in07 = _mm_max_epi16(in07, min_val);\r\n            _mm_store_si128((__m128i*)&dst[pass * 8 * i_dst + 7 * 8], in07);\r\n        }\r\n    }\r\n#undef TRANSPOSE_8x8_16BIT\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid idct_8x32_half_sse128(const coeff_t *src, coeff_t *dst, int i_dst)\r\n{\r\n    // TODO: implement this\r\n    // 1/2СϽǵ8x16зϵ\r\n    idct_8x32_sse128(src, dst, i_dst);\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid idct_8x32_quad_sse128(const coeff_t *src, coeff_t *dst, int i_dst)\r\n{\r\n    // TODO: implement this\r\n    // 1/4СϽǵ8x8зϵ\r\n    idct_8x32_half_sse128(src, dst, i_dst);\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void inv_2nd_trans_hor_sse128(coeff_t *coeff, int i_coeff, int i_shift, const int16_t *tc)\r\n{\r\n    int rnd_factor = 1 << (i_shift - 1);\r\n    int j;\r\n\r\n    __m128i factor = _mm_set1_epi32(rnd_factor);\r\n    __m128i tmpZero = _mm_setzero_si128();                      // 0 elements\r\n\r\n    // load tc data, a matrix of 4x4\r\n    __m128i tmpLoad0 = _mm_loadu_si128((__m128i*)&tc[0 * SEC_TR_SIZE + 0]);  // tc[0][] & tc[1][]\r\n    __m128i tmpLoad1 = _mm_loadu_si128((__m128i*)&tc[2 * SEC_TR_SIZE + 0]);  // tc[2][] & tc[3][]\r\n    __m128i tmpCoef0 = _mm_unpacklo_epi16(tmpLoad0, tmpZero);   // tc[0][]\r\n    __m128i tmpCoef1 = _mm_unpackhi_epi16(tmpLoad0, tmpZero);   // tc[1][]\r\n    __m128i tmpCoef2 = _mm_unpacklo_epi16(tmpLoad1, tmpZero);   // tc[2][]\r\n    __m128i tmpCoef3 = _mm_unpackhi_epi16(tmpLoad1, tmpZero);   // tc[3][]\r\n\r\n    for (j = 0; j < 4; j++) {\r\n        // multiple & add\r\n        __m128i tmpProduct0 = _mm_madd_epi16(tmpCoef0, _mm_set1_epi32(coeff[0]));\r\n        __m128i tmpProduct1 = _mm_madd_epi16(tmpCoef1, _mm_set1_epi32(coeff[1]));\r\n        __m128i tmpProduct2 = _mm_madd_epi16(tmpCoef2, _mm_set1_epi32(coeff[2]));\r\n        __m128i tmpProduct3 = _mm_madd_epi16(tmpCoef3, _mm_set1_epi32(coeff[3]));\r\n\r\n        // add operation\r\n        __m128i tmpDst0 = _mm_add_epi32(_mm_add_epi32(tmpProduct0, tmpProduct1), _mm_add_epi32(tmpProduct2, tmpProduct3));\r\n\r\n        // shift operation\r\n        tmpDst0 = _mm_srai_epi32(_mm_add_epi32(tmpDst0, factor), i_shift);\r\n        // clip3 operation\r\n        tmpDst0 = _mm_packs_epi32(tmpDst0, tmpZero);    // only low 64bits (4xSHORT) are valid!\r\n\r\n        _mm_storel_epi64((__m128i*)coeff, tmpDst0); // store from &coeff[0]\r\n        coeff += i_coeff;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void inv_2nd_trans_ver_sse128(coeff_t *coeff, int i_coeff, int i_shift, const int16_t *tc)\r\n{\r\n    const int rnd_factor = 1 << (i_shift - 1);\r\n    __m128i factor = _mm_set1_epi32(rnd_factor);\r\n    __m128i tmpZero = _mm_setzero_si128();                // 0 elements\r\n\r\n    // load coeff data\r\n    __m128i tmpLoad0 = _mm_loadu_si128((__m128i*)&coeff[0        ]);\r\n    __m128i tmpLoad1 = _mm_loadu_si128((__m128i*)&coeff[1 * i_coeff]);\r\n    __m128i tmpLoad2 = _mm_loadu_si128((__m128i*)&coeff[2 * i_coeff]);\r\n    __m128i tmpLoad3 = _mm_loadu_si128((__m128i*)&coeff[3 * i_coeff]);\r\n    __m128i tmpSrc0 = _mm_unpacklo_epi16(tmpLoad0, tmpZero);    // tmpSrc[0][]\r\n    __m128i tmpSrc1 = _mm_unpacklo_epi16(tmpLoad1, tmpZero);    // tmpSrc[1][]\r\n    __m128i tmpSrc2 = _mm_unpacklo_epi16(tmpLoad2, tmpZero);    // tmpSrc[2][]\r\n    __m128i tmpSrc3 = _mm_unpacklo_epi16(tmpLoad3, tmpZero);    // tmpSrc[3][]\r\n    int i;\r\n\r\n    for (i = 0; i < 4; i++) {\r\n        // multiple & add\r\n        __m128i tmpProduct0 = _mm_madd_epi16(_mm_set1_epi32(tc[0 * SEC_TR_SIZE + i]), tmpSrc0);\r\n        __m128i tmpProduct1 = _mm_madd_epi16(_mm_set1_epi32(tc[1 * SEC_TR_SIZE + i]), tmpSrc1);\r\n        __m128i tmpProduct2 = _mm_madd_epi16(_mm_set1_epi32(tc[2 * SEC_TR_SIZE + i]), tmpSrc2);\r\n        __m128i tmpProduct3 = _mm_madd_epi16(_mm_set1_epi32(tc[3 * SEC_TR_SIZE + i]), tmpSrc3);\r\n        // add operation\r\n        __m128i tmpDst0 = _mm_add_epi32(_mm_add_epi32(tmpProduct0, tmpProduct1), _mm_add_epi32(tmpProduct2, tmpProduct3));\r\n        // shift operation\r\n        tmpDst0 = _mm_srai_epi32(_mm_add_epi32(tmpDst0, factor), i_shift);\r\n        // clip3 operation\r\n        tmpDst0 = _mm_packs_epi32(tmpDst0, tmpZero);        // only low 64bits (4xSHORT) are valid!\r\n\r\n        // store from &coeff[0]\r\n        _mm_storel_epi64((__m128i*)&coeff[0 * i_coeff + 0], tmpDst0);\r\n        coeff += i_coeff;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n*/\r\nvoid inv_transform_2nd_sse128(coeff_t *coeff, int i_coeff, int i_mode, int b_top, int b_left)\r\n{\r\n    int vt = (i_mode >=  0 && i_mode <= 23);\r\n    int ht = (i_mode >= 13 && i_mode <= 32) || (i_mode >= 0 && i_mode <= 2);\r\n\r\n    if (ht && b_left) {\r\n        inv_2nd_trans_hor_sse128(coeff, i_coeff, 7, g_2T);\r\n    }\r\n    if (vt && b_top) {\r\n        inv_2nd_trans_ver_sse128(coeff, i_coeff, 7, g_2T);\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid inv_transform_4x4_2nd_sse128(coeff_t *coeff, int i_coeff)\r\n{\r\n    const int shift1 = 5;\r\n    const int shift2 = 20 - g_bit_depth + 2;\r\n    const int clip_depth2 = g_bit_depth + 1;\r\n\r\n    /*---vertical transform first---*/\r\n    __m128i factor = _mm_set1_epi32(1 << (shift1 - 1));         // add1\r\n    __m128i tmpZero = _mm_setzero_si128();                      // 0 elements\r\n\r\n    // load coeff data\r\n    __m128i tmpLoad0 = _mm_loadu_si128((__m128i*)&coeff[0          ]);\r\n    __m128i tmpLoad1 = _mm_loadu_si128((__m128i*)&coeff[1 * i_coeff]);\r\n    __m128i tmpLoad2 = _mm_loadu_si128((__m128i*)&coeff[2 * i_coeff]);\r\n    __m128i tmpLoad3 = _mm_loadu_si128((__m128i*)&coeff[3 * i_coeff]);\r\n    __m128i tmpSrc0 = _mm_unpacklo_epi16(tmpLoad0, tmpZero);    // tmpSrc[0][]\r\n    __m128i tmpSrc1 = _mm_unpacklo_epi16(tmpLoad1, tmpZero);    // tmpSrc[1][]\r\n    __m128i tmpSrc2 = _mm_unpacklo_epi16(tmpLoad2, tmpZero);    // tmpSrc[2][]\r\n    __m128i tmpSrc3 = _mm_unpacklo_epi16(tmpLoad3, tmpZero);    // tmpSrc[3][]\r\n    int i;\r\n\r\n    for (i = 0; i < 4; i++) {\r\n        // multiple & add\r\n        __m128i tmpProduct0 = _mm_madd_epi16(_mm_set1_epi32(g_2T_C[0 * SEC_TR_SIZE + i]), tmpSrc0);\r\n        __m128i tmpProduct1 = _mm_madd_epi16(_mm_set1_epi32(g_2T_C[1 * SEC_TR_SIZE + i]), tmpSrc1);\r\n        __m128i tmpProduct2 = _mm_madd_epi16(_mm_set1_epi32(g_2T_C[2 * SEC_TR_SIZE + i]), tmpSrc2);\r\n        __m128i tmpProduct3 = _mm_madd_epi16(_mm_set1_epi32(g_2T_C[3 * SEC_TR_SIZE + i]), tmpSrc3);\r\n        // add operation\r\n        __m128i tmpDst0 = _mm_add_epi32(_mm_add_epi32(tmpProduct0, tmpProduct1), _mm_add_epi32(tmpProduct2, tmpProduct3));\r\n        // shift operation\r\n        tmpDst0 = _mm_srai_epi32(_mm_add_epi32(tmpDst0, factor), shift1);\r\n        // clip3 operation\r\n        tmpDst0 = _mm_packs_epi32(tmpDst0, tmpZero);        // only low 64bits (4xSHORT) are valid!\r\n\r\n        _mm_storel_epi64((__m128i*)&coeff[i * i_coeff + 0], tmpDst0); // store from &coeff[0]\r\n    }\r\n\r\n    /*---hor transform---*/\r\n    factor = _mm_set1_epi32(1 << (shift2 - 1));\r\n    const __m128i vmax_val = _mm_set1_epi32((1 << (clip_depth2 - 1)) - 1);\r\n    const __m128i vmin_val = _mm_set1_epi32(-(1 << (clip_depth2 - 1)));\r\n\r\n    //load coef data, a matrix of 4x4\r\n    tmpLoad0 = _mm_loadu_si128((__m128i*)&g_2T_C[0 * SEC_TR_SIZE + 0]);  // coef[0][] & coef[1][]\r\n    tmpLoad1 = _mm_loadu_si128((__m128i*)&g_2T_C[2 * SEC_TR_SIZE + 0]);  // coef[2][] & coef[3][]\r\n    const __m128i tmpCoef0 = _mm_unpacklo_epi16(tmpLoad0, tmpZero);   // coef[0][]\r\n    const __m128i tmpCoef1 = _mm_unpackhi_epi16(tmpLoad0, tmpZero);   // coef[1][]\r\n    const __m128i tmpCoef2 = _mm_unpacklo_epi16(tmpLoad1, tmpZero);   // coef[2][]\r\n    const __m128i tmpCoef3 = _mm_unpackhi_epi16(tmpLoad1, tmpZero);   // coef[3][]\r\n\r\n    for (i = 0; i < 4; i++) {\r\n        // multiple & add\r\n        __m128i tmpProduct0 = _mm_madd_epi16(tmpCoef0, _mm_set1_epi32(coeff[0]));\r\n        __m128i tmpProduct1 = _mm_madd_epi16(tmpCoef1, _mm_set1_epi32(coeff[1]));\r\n        __m128i tmpProduct2 = _mm_madd_epi16(tmpCoef2, _mm_set1_epi32(coeff[2]));\r\n        __m128i tmpProduct3 = _mm_madd_epi16(tmpCoef3, _mm_set1_epi32(coeff[3]));\r\n        // add operation\r\n        __m128i tmpDst0 = _mm_add_epi32(_mm_add_epi32(tmpProduct0, tmpProduct1), _mm_add_epi32(tmpProduct2, tmpProduct3));\r\n        // shift operation\r\n        tmpDst0 = _mm_srai_epi32(_mm_add_epi32(tmpDst0, factor), shift2);\r\n        // clip3 operation\r\n        tmpDst0 = _mm_max_epi32(_mm_min_epi32(tmpDst0, vmax_val), vmin_val);\r\n\r\n        tmpDst0 = _mm_packs_epi32(tmpDst0, tmpZero);        // only low 64bits (4xSHORT) are valid!\r\n        _mm_storel_epi64((__m128i*)coeff, tmpDst0); // store from &coeff[0]\r\n        coeff += i_coeff;\r\n    }\r\n}\r\n\r\n\r\n// transpose 8x8 & transpose 16x16\r\n#define TRANSPOSE_8x8_16BIT(I0, I1, I2, I3, I4, I5, I6, I7, O0, O1, O2, O3, O4, O5, O6, O7) \\\r\n    tr0_0 = _mm_unpacklo_epi16(I0, I1); \\\r\n    tr0_1 = _mm_unpacklo_epi16(I2, I3); \\\r\n    tr0_2 = _mm_unpackhi_epi16(I0, I1); \\\r\n    tr0_3 = _mm_unpackhi_epi16(I2, I3); \\\r\n    tr0_4 = _mm_unpacklo_epi16(I4, I5); \\\r\n    tr0_5 = _mm_unpacklo_epi16(I6, I7); \\\r\n    tr0_6 = _mm_unpackhi_epi16(I4, I5); \\\r\n    tr0_7 = _mm_unpackhi_epi16(I6, I7); \\\r\n    tr1_0 = _mm_unpacklo_epi32(tr0_0, tr0_1); \\\r\n    tr1_1 = _mm_unpacklo_epi32(tr0_2, tr0_3); \\\r\n    tr1_2 = _mm_unpackhi_epi32(tr0_0, tr0_1); \\\r\n    tr1_3 = _mm_unpackhi_epi32(tr0_2, tr0_3); \\\r\n    tr1_4 = _mm_unpacklo_epi32(tr0_4, tr0_5); \\\r\n    tr1_5 = _mm_unpacklo_epi32(tr0_6, tr0_7); \\\r\n    tr1_6 = _mm_unpackhi_epi32(tr0_4, tr0_5); \\\r\n    tr1_7 = _mm_unpackhi_epi32(tr0_6, tr0_7); \\\r\n    O0 = _mm_unpacklo_epi64(tr1_0, tr1_4); \\\r\n    O1 = _mm_unpackhi_epi64(tr1_0, tr1_4); \\\r\n    O2 = _mm_unpacklo_epi64(tr1_2, tr1_6); \\\r\n    O3 = _mm_unpackhi_epi64(tr1_2, tr1_6); \\\r\n    O4 = _mm_unpacklo_epi64(tr1_1, tr1_5); \\\r\n    O5 = _mm_unpackhi_epi64(tr1_1, tr1_5); \\\r\n    O6 = _mm_unpacklo_epi64(tr1_3, tr1_7); \\\r\n    O7 = _mm_unpackhi_epi64(tr1_3, tr1_7); \\\r\n\r\n#define TRANSPOSE_16x16_16BIT(A0_0, A1_0, A2_0, A3_0, A4_0, A5_0, A6_0, A7_0, A8_0, A9_0, A10_0, A11_0, A12_0, A13_0, A14_0, A15_0, A0_1, A1_1, A2_1, A3_1, A4_1, A5_1, A6_1, A7_1, A8_1, A9_1, A10_1, A11_1, A12_1, A13_1, A14_1, A15_1, B0_0, B1_0, B2_0, B3_0, B4_0, B5_0, B6_0, B7_0, B8_0, B9_0, B10_0, B11_0, B12_0, B13_0, B14_0, B15_0, B0_1, B1_1, B2_1, B3_1, B4_1, B5_1, B6_1, B7_1, B8_1, B9_1, B10_1, B11_1, B12_1, B13_1, B14_1, B15_1) \\\r\n    TRANSPOSE_8x8_16BIT(A0_0, A1_0, A2_0, A3_0, A4_0, A5_0, A6_0, A7_0, B0_0, B1_0, B2_0, B3_0, B4_0, B5_0, B6_0, B7_0); \\\r\n    TRANSPOSE_8x8_16BIT(A8_0, A9_0, A10_0, A11_0, A12_0, A13_0, A14_0, A15_0, B0_1, B1_1, B2_1, B3_1, B4_1, B5_1, B6_1, B7_1); \\\r\n    TRANSPOSE_8x8_16BIT(A0_1, A1_1, A2_1, A3_1, A4_1, A5_1, A6_1, A7_1, B8_0, B9_0, B10_0, B11_0, B12_0, B13_0, B14_0, B15_0); \\\r\n    TRANSPOSE_8x8_16BIT(A8_1, A9_1, A10_1, A11_1, A12_1, A13_1, A14_1, A15_1, B8_1, B9_1, B10_1, B11_1, B12_1, B13_1, B14_1, B15_1); \\\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid inv_wavelet_64x64_sse128(coeff_t *coeff)\r\n{\r\n    int i;\r\n    // 64*64\r\n    __m128i T00[8], T01[8], T02[8], T03[8], T04[8], T05[8], T06[8], T07[8], T08[8], T09[8], T10[8], T11[8], T12[8], T13[8], T14[8], T15[8], T16[8], T17[8], T18[8], T19[8], T20[8], T21[8], T22[8], T23[8], T24[8], T25[8], T26[8], T27[8], T28[8], T29[8], T30[8], T31[8], T32[8], T33[8], T34[8], T35[8], T36[8], T37[8], T38[8], T39[8], T40[8], T41[8], T42[8], T43[8], T44[8], T45[8], T46[8], T47[8], T48[8], T49[8], T50[8], T51[8], T52[8], T53[8], T54[8], T55[8], T56[8], T57[8], T58[8], T59[8], T60[8], T61[8], T62[8], T63[8];\r\n\r\n    // 16*64\r\n    __m128i V00[8], V01[8], V02[8], V03[8], V04[8], V05[8], V06[8], V07[8], V08[8], V09[8], V10[8], V11[8], V12[8], V13[8], V14[8], V15[8], V16[8], V17[8], V18[8], V19[8], V20[8], V21[8], V22[8], V23[8], V24[8], V25[8], V26[8], V27[8], V28[8], V29[8], V30[8], V31[8], V32[8], V33[8], V34[8], V35[8], V36[8], V37[8], V38[8], V39[8], V40[8], V41[8], V42[8], V43[8], V44[8], V45[8], V46[8], V47[8], V48[8], V49[8], V50[8], V51[8], V52[8], V53[8], V54[8], V55[8], V56[8], V57[8], V58[8], V59[8], V60[8], V61[8], V62[8], V63[8];\r\n\r\n    __m128i tr0_0, tr0_1, tr0_2, tr0_3, tr0_4, tr0_5, tr0_6, tr0_7;\r\n    __m128i tr1_0, tr1_1, tr1_2, tr1_3, tr1_4, tr1_5, tr1_6, tr1_7;\r\n\r\n    /*--vertical transform--*/\r\n    //32*32, LOAD AND SHIFT\r\n    for (i = 0; i < 4; i++) {\r\n        T00[i] = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * i + 32 *  0]), 1);\r\n        T01[i] = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * i + 32 *  1]), 1);\r\n        T02[i] = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * i + 32 *  2]), 1);\r\n        T03[i] = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * i + 32 *  3]), 1);\r\n        T04[i] = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * i + 32 *  4]), 1);\r\n        T05[i] = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * i + 32 *  5]), 1);\r\n        T06[i] = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * i + 32 *  6]), 1);\r\n        T07[i] = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * i + 32 *  7]), 1);\r\n\r\n        T08[i] = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * i + 32 *  8]), 1);\r\n        T09[i] = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * i + 32 *  9]), 1);\r\n        T10[i] = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * i + 32 * 10]), 1);\r\n        T11[i] = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * i + 32 * 11]), 1);\r\n        T12[i] = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * i + 32 * 12]), 1);\r\n        T13[i] = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * i + 32 * 13]), 1);\r\n        T14[i] = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * i + 32 * 14]), 1);\r\n        T15[i] = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * i + 32 * 15]), 1);\r\n\r\n        T16[i] = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * i + 32 * 16]), 1);\r\n        T17[i] = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * i + 32 * 17]), 1);\r\n        T18[i] = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * i + 32 * 18]), 1);\r\n        T19[i] = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * i + 32 * 19]), 1);\r\n        T20[i] = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * i + 32 * 20]), 1);\r\n        T21[i] = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * i + 32 * 21]), 1);\r\n        T22[i] = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * i + 32 * 22]), 1);\r\n        T23[i] = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * i + 32 * 23]), 1);\r\n\r\n        T24[i] = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * i + 32 * 24]), 1);\r\n        T25[i] = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * i + 32 * 25]), 1);\r\n        T26[i] = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * i + 32 * 26]), 1);\r\n        T27[i] = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * i + 32 * 27]), 1);\r\n        T28[i] = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * i + 32 * 28]), 1);\r\n        T29[i] = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * i + 32 * 29]), 1);\r\n        T30[i] = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * i + 32 * 30]), 1);\r\n        T31[i] = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * i + 32 * 31]), 1);\r\n    }\r\n\r\n    //filter (odd pixel/row)\r\n    for (i = 0; i < 4; i++) {\r\n        T32[i] = _mm_srai_epi16(_mm_add_epi16(T00[i], T01[i]), 1);\r\n        T33[i] = _mm_srai_epi16(_mm_add_epi16(T01[i], T02[i]), 1);\r\n        T34[i] = _mm_srai_epi16(_mm_add_epi16(T02[i], T03[i]), 1);\r\n        T35[i] = _mm_srai_epi16(_mm_add_epi16(T03[i], T04[i]), 1);\r\n        T36[i] = _mm_srai_epi16(_mm_add_epi16(T04[i], T05[i]), 1);\r\n        T37[i] = _mm_srai_epi16(_mm_add_epi16(T05[i], T06[i]), 1);\r\n        T38[i] = _mm_srai_epi16(_mm_add_epi16(T06[i], T07[i]), 1);\r\n        T39[i] = _mm_srai_epi16(_mm_add_epi16(T07[i], T08[i]), 1);\r\n\r\n        T40[i] = _mm_srai_epi16(_mm_add_epi16(T08[i], T09[i]), 1);\r\n        T41[i] = _mm_srai_epi16(_mm_add_epi16(T09[i], T10[i]), 1);\r\n        T42[i] = _mm_srai_epi16(_mm_add_epi16(T10[i], T11[i]), 1);\r\n        T43[i] = _mm_srai_epi16(_mm_add_epi16(T11[i], T12[i]), 1);\r\n        T44[i] = _mm_srai_epi16(_mm_add_epi16(T12[i], T13[i]), 1);\r\n        T45[i] = _mm_srai_epi16(_mm_add_epi16(T13[i], T14[i]), 1);\r\n        T46[i] = _mm_srai_epi16(_mm_add_epi16(T14[i], T15[i]), 1);\r\n        T47[i] = _mm_srai_epi16(_mm_add_epi16(T15[i], T16[i]), 1);\r\n\r\n        T48[i] = _mm_srai_epi16(_mm_add_epi16(T16[i], T17[i]), 1);\r\n        T49[i] = _mm_srai_epi16(_mm_add_epi16(T17[i], T18[i]), 1);\r\n        T50[i] = _mm_srai_epi16(_mm_add_epi16(T18[i], T19[i]), 1);\r\n        T51[i] = _mm_srai_epi16(_mm_add_epi16(T19[i], T20[i]), 1);\r\n        T52[i] = _mm_srai_epi16(_mm_add_epi16(T20[i], T21[i]), 1);\r\n        T53[i] = _mm_srai_epi16(_mm_add_epi16(T21[i], T22[i]), 1);\r\n        T54[i] = _mm_srai_epi16(_mm_add_epi16(T22[i], T23[i]), 1);\r\n        T55[i] = _mm_srai_epi16(_mm_add_epi16(T23[i], T24[i]), 1);\r\n\r\n        T56[i] = _mm_srai_epi16(_mm_add_epi16(T24[i], T25[i]), 1);\r\n        T57[i] = _mm_srai_epi16(_mm_add_epi16(T25[i], T26[i]), 1);\r\n        T58[i] = _mm_srai_epi16(_mm_add_epi16(T26[i], T27[i]), 1);\r\n        T59[i] = _mm_srai_epi16(_mm_add_epi16(T27[i], T28[i]), 1);\r\n        T60[i] = _mm_srai_epi16(_mm_add_epi16(T28[i], T29[i]), 1);\r\n        T61[i] = _mm_srai_epi16(_mm_add_epi16(T29[i], T30[i]), 1);\r\n        T62[i] = _mm_srai_epi16(_mm_add_epi16(T30[i], T31[i]), 1);\r\n        T63[i] = _mm_srai_epi16(_mm_add_epi16(T31[i], T31[i]), 1);\r\n    }\r\n\r\n    /*--transposition--*/\r\n    //32x64 -> 64x32\r\n    TRANSPOSE_16x16_16BIT(\r\n        T00[0], T32[0], T01[0], T33[0], T02[0], T34[0], T03[0], T35[0], T04[0], T36[0], T05[0], T37[0], T06[0], T38[0], T07[0], T39[0], T00[1], T32[1], T01[1], T33[1], T02[1], T34[1], T03[1], T35[1], T04[1], T36[1], T05[1], T37[1], T06[1], T38[1], T07[1], T39[1],\r\n        V00[0], V01[0], V02[0], V03[0], V04[0], V05[0], V06[0], V07[0], V08[0], V09[0], V10[0], V11[0], V12[0], V13[0], V14[0], V15[0], V00[1], V01[1], V02[1], V03[1], V04[1], V05[1], V06[1], V07[1], V08[1], V09[1], V10[1], V11[1], V12[1], V13[1], V14[1], V15[1]);\r\n    TRANSPOSE_16x16_16BIT(\r\n        T00[2], T32[2], T01[2], T33[2], T02[2], T34[2], T03[2], T35[2], T04[2], T36[2], T05[2], T37[2], T06[2], T38[2], T07[2], T39[2], T00[3], T32[3], T01[3], T33[3], T02[3], T34[3], T03[3], T35[3], T04[3], T36[3], T05[3], T37[3], T06[3], T38[3], T07[3], T39[3],\r\n        V16[0], V17[0], V18[0], V19[0], V20[0], V21[0], V22[0], V23[0], V24[0], V25[0], V26[0], V27[0], V28[0], V29[0], V30[0], V31[0], V16[1], V17[1], V18[1], V19[1], V20[1], V21[1], V22[1], V23[1], V24[1], V25[1], V26[1], V27[1], V28[1], V29[1], V30[1], V31[1]);\r\n\r\n    TRANSPOSE_16x16_16BIT(\r\n        T08[0], T40[0], T09[0], T41[0], T10[0], T42[0], T11[0], T43[0], T12[0], T44[0], T13[0], T45[0], T14[0], T46[0], T15[0], T47[0], T08[1], T40[1], T09[1], T41[1], T10[1], T42[1], T11[1], T43[1], T12[1], T44[1], T13[1], T45[1], T14[1], T46[1], T15[1], T47[1],\r\n        V00[2], V01[2], V02[2], V03[2], V04[2], V05[2], V06[2], V07[2], V08[2], V09[2], V10[2], V11[2], V12[2], V13[2], V14[2], V15[2], V00[3], V01[3], V02[3], V03[3], V04[3], V05[3], V06[3], V07[3], V08[3], V09[3], V10[3], V11[3], V12[3], V13[3], V14[3], V15[3]);\r\n    TRANSPOSE_16x16_16BIT(\r\n        T08[2], T40[2], T09[2], T41[2], T10[2], T42[2], T11[2], T43[2], T12[2], T44[2], T13[2], T45[2], T14[2], T46[2], T15[2], T47[2], T08[3], T40[3], T09[3], T41[3], T10[3], T42[3], T11[3], T43[3], T12[3], T44[3], T13[3], T45[3], T14[3], T46[3], T15[3], T47[3],\r\n        V16[2], V17[2], V18[2], V19[2], V20[2], V21[2], V22[2], V23[2], V24[2], V25[2], V26[2], V27[2], V28[2], V29[2], V30[2], V31[2], V16[3], V17[3], V18[3], V19[3], V20[3], V21[3], V22[3], V23[3], V24[3], V25[3], V26[3], V27[3], V28[3], V29[3], V30[3], V31[3]);\r\n\r\n    TRANSPOSE_16x16_16BIT(\r\n        T16[0], T48[0], T17[0], T49[0], T18[0], T50[0], T19[0], T51[0], T20[0], T52[0], T21[0], T53[0], T22[0], T54[0], T23[0], T55[0], T16[1], T48[1], T17[1], T49[1], T18[1], T50[1], T19[1], T51[1], T20[1], T52[1], T21[1], T53[1], T22[1], T54[1], T23[1], T55[1],\r\n        V00[4], V01[4], V02[4], V03[4], V04[4], V05[4], V06[4], V07[4], V08[4], V09[4], V10[4], V11[4], V12[4], V13[4], V14[4], V15[4], V00[5], V01[5], V02[5], V03[5], V04[5], V05[5], V06[5], V07[5], V08[5], V09[5], V10[5], V11[5], V12[5], V13[5], V14[5], V15[5]);\r\n    TRANSPOSE_16x16_16BIT(\r\n        T16[2], T48[2], T17[2], T49[2], T18[2], T50[2], T19[2], T51[2], T20[2], T52[2], T21[2], T53[2], T22[2], T54[2], T23[2], T55[2], T16[3], T48[3], T17[3], T49[3], T18[3], T50[3], T19[3], T51[3], T20[3], T52[3], T21[3], T53[3], T22[3], T54[3], T23[3], T55[3],\r\n        V16[4], V17[4], V18[4], V19[4], V20[4], V21[4], V22[4], V23[4], V24[4], V25[4], V26[4], V27[4], V28[4], V29[4], V30[4], V31[4], V16[5], V17[5], V18[5], V19[5], V20[5], V21[5], V22[5], V23[5], V24[5], V25[5], V26[5], V27[5], V28[5], V29[5], V30[5], V31[5]);\r\n\r\n    TRANSPOSE_16x16_16BIT(\r\n        T24[0], T56[0], T25[0], T57[0], T26[0], T58[0], T27[0], T59[0], T28[0], T60[0], T29[0], T61[0], T30[0], T62[0], T31[0], T63[0], T24[1], T56[1], T25[1], T57[1], T26[1], T58[1], T27[1], T59[1], T28[1], T60[1], T29[1], T61[1], T30[1], T62[1], T31[1], T63[1],\r\n        V00[6], V01[6], V02[6], V03[6], V04[6], V05[6], V06[6], V07[6], V08[6], V09[6], V10[6], V11[6], V12[6], V13[6], V14[6], V15[6], V00[7], V01[7], V02[7], V03[7], V04[7], V05[7], V06[7], V07[7], V08[7], V09[7], V10[7], V11[7], V12[7], V13[7], V14[7], V15[7]);\r\n    TRANSPOSE_16x16_16BIT(\r\n        T24[2], T56[2], T25[2], T57[2], T26[2], T58[2], T27[2], T59[2], T28[2], T60[2], T29[2], T61[2], T30[2], T62[2], T31[2], T63[2], T24[3], T56[3], T25[3], T57[3], T26[3], T58[3], T27[3], T59[3], T28[3], T60[3], T29[3], T61[3], T30[3], T62[3], T31[3], T63[3],\r\n        V16[6], V17[6], V18[6], V19[6], V20[6], V21[6], V22[6], V23[6], V24[6], V25[6], V26[6], V27[6], V28[6], V29[6], V30[6], V31[6], V16[7], V17[7], V18[7], V19[7], V20[7], V21[7], V22[7], V23[7], V24[7], V25[7], V26[7], V27[7], V28[7], V29[7], V30[7], V31[7]);\r\n\r\n    /*--horizontal transform--*/\r\n    //filter (odd pixel/column)\r\n    for (i = 0; i < 8; i++) {\r\n        V32[i] = _mm_srai_epi16(_mm_add_epi16(V00[i], V01[i]), 1);\r\n        V33[i] = _mm_srai_epi16(_mm_add_epi16(V01[i], V02[i]), 1);\r\n        V34[i] = _mm_srai_epi16(_mm_add_epi16(V02[i], V03[i]), 1);\r\n        V35[i] = _mm_srai_epi16(_mm_add_epi16(V03[i], V04[i]), 1);\r\n        V36[i] = _mm_srai_epi16(_mm_add_epi16(V04[i], V05[i]), 1);\r\n        V37[i] = _mm_srai_epi16(_mm_add_epi16(V05[i], V06[i]), 1);\r\n        V38[i] = _mm_srai_epi16(_mm_add_epi16(V06[i], V07[i]), 1);\r\n        V39[i] = _mm_srai_epi16(_mm_add_epi16(V07[i], V08[i]), 1);\r\n        V40[i] = _mm_srai_epi16(_mm_add_epi16(V08[i], V09[i]), 1);\r\n        V41[i] = _mm_srai_epi16(_mm_add_epi16(V09[i], V10[i]), 1);\r\n        V42[i] = _mm_srai_epi16(_mm_add_epi16(V10[i], V11[i]), 1);\r\n        V43[i] = _mm_srai_epi16(_mm_add_epi16(V11[i], V12[i]), 1);\r\n        V44[i] = _mm_srai_epi16(_mm_add_epi16(V12[i], V13[i]), 1);\r\n        V45[i] = _mm_srai_epi16(_mm_add_epi16(V13[i], V14[i]), 1);\r\n        V46[i] = _mm_srai_epi16(_mm_add_epi16(V14[i], V15[i]), 1);\r\n        V47[i] = _mm_srai_epi16(_mm_add_epi16(V15[i], V16[i]), 1);\r\n\r\n        V48[i] = _mm_srai_epi16(_mm_add_epi16(V16[i], V17[i]), 1);\r\n        V49[i] = _mm_srai_epi16(_mm_add_epi16(V17[i], V18[i]), 1);\r\n        V50[i] = _mm_srai_epi16(_mm_add_epi16(V18[i], V19[i]), 1);\r\n        V51[i] = _mm_srai_epi16(_mm_add_epi16(V19[i], V20[i]), 1);\r\n        V52[i] = _mm_srai_epi16(_mm_add_epi16(V20[i], V21[i]), 1);\r\n        V53[i] = _mm_srai_epi16(_mm_add_epi16(V21[i], V22[i]), 1);\r\n        V54[i] = _mm_srai_epi16(_mm_add_epi16(V22[i], V23[i]), 1);\r\n        V55[i] = _mm_srai_epi16(_mm_add_epi16(V23[i], V24[i]), 1);\r\n        V56[i] = _mm_srai_epi16(_mm_add_epi16(V24[i], V25[i]), 1);\r\n        V57[i] = _mm_srai_epi16(_mm_add_epi16(V25[i], V26[i]), 1);\r\n        V58[i] = _mm_srai_epi16(_mm_add_epi16(V26[i], V27[i]), 1);\r\n        V59[i] = _mm_srai_epi16(_mm_add_epi16(V27[i], V28[i]), 1);\r\n        V60[i] = _mm_srai_epi16(_mm_add_epi16(V28[i], V29[i]), 1);\r\n        V61[i] = _mm_srai_epi16(_mm_add_epi16(V29[i], V30[i]), 1);\r\n        V62[i] = _mm_srai_epi16(_mm_add_epi16(V30[i], V31[i]), 1);\r\n        V63[i] = _mm_srai_epi16(_mm_add_epi16(V31[i], V31[i]), 1);\r\n    }\r\n\r\n    /*--transposition & Store--*/\r\n    //64x64\r\n    TRANSPOSE_16x16_16BIT(\r\n        V00[0], V32[0], V01[0], V33[0], V02[0], V34[0], V03[0], V35[0], V04[0], V36[0], V05[0], V37[0], V06[0], V38[0], V07[0], V39[0], V00[1], V32[1], V01[1], V33[1], V02[1], V34[1], V03[1], V35[1], V04[1], V36[1], V05[1], V37[1], V06[1], V38[1], V07[1], V39[1],\r\n        T00[0], T01[0], T02[0], T03[0], T04[0], T05[0], T06[0], T07[0], T08[0], T09[0], T10[0], T11[0], T12[0], T13[0], T14[0], T15[0], T00[1], T01[1], T02[1], T03[1], T04[1], T05[1], T06[1], T07[1], T08[1], T09[1], T10[1], T11[1], T12[1], T13[1], T14[1], T15[1]);\r\n    TRANSPOSE_16x16_16BIT(\r\n        V00[2], V32[2], V01[2], V33[2], V02[2], V34[2], V03[2], V35[2], V04[2], V36[2], V05[2], V37[2], V06[2], V38[2], V07[2], V39[2], V00[3], V32[3], V01[3], V33[3], V02[3], V34[3], V03[3], V35[3], V04[3], V36[3], V05[3], V37[3], V06[3], V38[3], V07[3], V39[3],\r\n        T16[0], T17[0], T18[0], T19[0], T20[0], T21[0], T22[0], T23[0], T24[0], T25[0], T26[0], T27[0], T28[0], T29[0], T30[0], T31[0], T16[1], T17[1], T18[1], T19[1], T20[1], T21[1], T22[1], T23[1], T24[1], T25[1], T26[1], T27[1], T28[1], T29[1], T30[1], T31[1]);\r\n    TRANSPOSE_16x16_16BIT(V00[4], V32[4], V01[4], V33[4], V02[4], V34[4], V03[4], V35[4], V04[4], V36[4], V05[4], V37[4], V06[4], V38[4], V07[4], V39[4], V00[5], V32[5], V01[5], V33[5], V02[5], V34[5], V03[5], V35[5], V04[5], V36[5], V05[5], V37[5], V06[5], V38[5], V07[5], V39[5], T32[0], T33[0], T34[0], T35[0], T36[0], T37[0], T38[0], T39[0], T40[0], T41[0], T42[0], T43[0], T44[0], T45[0], T46[0], T47[0], T32[1], T33[1], T34[1], T35[1], T36[1], T37[1], T38[1], T39[1], T40[1], T41[1], T42[1], T43[1], T44[1], T45[1], T46[1], T47[1]);\r\n    TRANSPOSE_16x16_16BIT(V00[6], V32[6], V01[6], V33[6], V02[6], V34[6], V03[6], V35[6], V04[6], V36[6], V05[6], V37[6], V06[6], V38[6], V07[6], V39[6], V00[7], V32[7], V01[7], V33[7], V02[7], V34[7], V03[7], V35[7], V04[7], V36[7], V05[7], V37[7], V06[7], V38[7], V07[7], V39[7], T48[0], T49[0], T50[0], T51[0], T52[0], T53[0], T54[0], T55[0], T56[0], T57[0], T58[0], T59[0], T60[0], T61[0], T62[0], T63[0], T48[1], T49[1], T50[1], T51[1], T52[1], T53[1], T54[1], T55[1], T56[1], T57[1], T58[1], T59[1], T60[1], T61[1], T62[1], T63[1]);\r\n\r\n    TRANSPOSE_16x16_16BIT(\r\n        V08[0], V40[0], V09[0], V41[0], V10[0], V42[0], V11[0], V43[0], V12[0], V44[0], V13[0], V45[0], V14[0], V46[0], V15[0], V47[0], V08[1], V40[1], V09[1], V41[1], V10[1], V42[1], V11[1], V43[1], V12[1], V44[1], V13[1], V45[1], V14[1], V46[1], V15[1], V47[1],\r\n        T00[2], T01[2], T02[2], T03[2], T04[2], T05[2], T06[2], T07[2], T08[2], T09[2], T10[2], T11[2], T12[2], T13[2], T14[2], T15[2], T00[3], T01[3], T02[3], T03[3], T04[3], T05[3], T06[3], T07[3], T08[3], T09[3], T10[3], T11[3], T12[3], T13[3], T14[3], T15[3]);\r\n    TRANSPOSE_16x16_16BIT(\r\n        V08[2], V40[2], V09[2], V41[2], V10[2], V42[2], V11[2], V43[2], V12[2], V44[2], V13[2], V45[2], V14[2], V46[2], V15[2], V47[2], V08[3], V40[3], V09[3], V41[3], V10[3], V42[3], V11[3], V43[3], V12[3], V44[3], V13[3], V45[3], V14[3], V46[3], V15[3], V47[3],\r\n        T16[2], T17[2], T18[2], T19[2], T20[2], T21[2], T22[2], T23[2], T24[2], T25[2], T26[2], T27[2], T28[2], T29[2], T30[2], T31[2], T16[3], T17[3], T18[3], T19[3], T20[3], T21[3], T22[3], T23[3], T24[3], T25[3], T26[3], T27[3], T28[3], T29[3], T30[3], T31[3]);\r\n    TRANSPOSE_16x16_16BIT(\r\n        V08[4], V40[4], V09[4], V41[4], V10[4], V42[4], V11[4], V43[4], V12[4], V44[4], V13[4], V45[4], V14[4], V46[4], V15[4], V47[4], V08[5], V40[5], V09[5], V41[5], V10[5], V42[5], V11[5], V43[5], V12[5], V44[5], V13[5], V45[5], V14[5], V46[5], V15[5], V47[5],\r\n        T32[2], T33[2], T34[2], T35[2], T36[2], T37[2], T38[2], T39[2], T40[2], T41[2], T42[2], T43[2], T44[2], T45[2], T46[2], T47[2], T32[3], T33[3], T34[3], T35[3], T36[3], T37[3], T38[3], T39[3], T40[3], T41[3], T42[3], T43[3], T44[3], T45[3], T46[3], T47[3]);\r\n    TRANSPOSE_16x16_16BIT(\r\n        V08[6], V40[6], V09[6], V41[6], V10[6], V42[6], V11[6], V43[6], V12[6], V44[6], V13[6], V45[6], V14[6], V46[6], V15[6], V47[6], V08[7], V40[7], V09[7], V41[7], V10[7], V42[7], V11[7], V43[7], V12[7], V44[7], V13[7], V45[7], V14[7], V46[7], V15[7], V47[7],\r\n        T48[2], T49[2], T50[2], T51[2], T52[2], T53[2], T54[2], T55[2], T56[2], T57[2], T58[2], T59[2], T60[2], T61[2], T62[2], T63[2], T48[3], T49[3], T50[3], T51[3], T52[3], T53[3], T54[3], T55[3], T56[3], T57[3], T58[3], T59[3], T60[3], T61[3], T62[3], T63[3]);\r\n\r\n    TRANSPOSE_16x16_16BIT(\r\n        V16[0], V48[0], V17[0], V49[0], V18[0], V50[0], V19[0], V51[0], V20[0], V52[0], V21[0], V53[0], V22[0], V54[0], V23[0], V55[0], V16[1], V48[1], V17[1], V49[1], V18[1], V50[1], V19[1], V51[1], V20[1], V52[1], V21[1], V53[1], V22[1], V54[1], V23[1], V55[1],\r\n        T00[4], T01[4], T02[4], T03[4], T04[4], T05[4], T06[4], T07[4], T08[4], T09[4], T10[4], T11[4], T12[4], T13[4], T14[4], T15[4], T00[5], T01[5], T02[5], T03[5], T04[5], T05[5], T06[5], T07[5], T08[5], T09[5], T10[5], T11[5], T12[5], T13[5], T14[5], T15[5]);\r\n    TRANSPOSE_16x16_16BIT(\r\n        V16[2], V48[2], V17[2], V49[2], V18[2], V50[2], V19[2], V51[2], V20[2], V52[2], V21[2], V53[2], V22[2], V54[2], V23[2], V55[2], V16[3], V48[3], V17[3], V49[3], V18[3], V50[3], V19[3], V51[3], V20[3], V52[3], V21[3], V53[3], V22[3], V54[3], V23[3], V55[3],\r\n        T16[4], T17[4], T18[4], T19[4], T20[4], T21[4], T22[4], T23[4], T24[4], T25[4], T26[4], T27[4], T28[4], T29[4], T30[4], T31[4], T16[5], T17[5], T18[5], T19[5], T20[5], T21[5], T22[5], T23[5], T24[5], T25[5], T26[5], T27[5], T28[5], T29[5], T30[5], T31[5]);\r\n    TRANSPOSE_16x16_16BIT(\r\n        V16[4], V48[4], V17[4], V49[4], V18[4], V50[4], V19[4], V51[4], V20[4], V52[4], V21[4], V53[4], V22[4], V54[4], V23[4], V55[4], V16[5], V48[5], V17[5], V49[5], V18[5], V50[5], V19[5], V51[5], V20[5], V52[5], V21[5], V53[5], V22[5], V54[5], V23[5], V55[5],\r\n        T32[4], T33[4], T34[4], T35[4], T36[4], T37[4], T38[4], T39[4], T40[4], T41[4], T42[4], T43[4], T44[4], T45[4], T46[4], T47[4], T32[5], T33[5], T34[5], T35[5], T36[5], T37[5], T38[5], T39[5], T40[5], T41[5], T42[5], T43[5], T44[5], T45[5], T46[5], T47[5]);\r\n    TRANSPOSE_16x16_16BIT(\r\n        V16[6], V48[6], V17[6], V49[6], V18[6], V50[6], V19[6], V51[6], V20[6], V52[6], V21[6], V53[6], V22[6], V54[6], V23[6], V55[6], V16[7], V48[7], V17[7], V49[7], V18[7], V50[7], V19[7], V51[7], V20[7], V52[7], V21[7], V53[7], V22[7], V54[7], V23[7], V55[7],\r\n        T48[4], T49[4], T50[4], T51[4], T52[4], T53[4], T54[4], T55[4], T56[4], T57[4], T58[4], T59[4], T60[4], T61[4], T62[4], T63[4], T48[5], T49[5], T50[5], T51[5], T52[5], T53[5], T54[5], T55[5], T56[5], T57[5], T58[5], T59[5], T60[5], T61[5], T62[5], T63[5]);\r\n\r\n    TRANSPOSE_16x16_16BIT(\r\n        V24[0], V56[0], V25[0], V57[0], V26[0], V58[0], V27[0], V59[0], V28[0], V60[0], V29[0], V61[0], V30[0], V62[0], V31[0], V63[0], V24[1], V56[1], V25[1], V57[1], V26[1], V58[1], V27[1], V59[1], V28[1], V60[1], V29[1], V61[1], V30[1], V62[1], V31[1], V63[1],\r\n        T00[6], T01[6], T02[6], T03[6], T04[6], T05[6], T06[6], T07[6], T08[6], T09[6], T10[6], T11[6], T12[6], T13[6], T14[6], T15[6], T00[7], T01[7], T02[7], T03[7], T04[7], T05[7], T06[7], T07[7], T08[7], T09[7], T10[7], T11[7], T12[7], T13[7], T14[7], T15[7]);\r\n    TRANSPOSE_16x16_16BIT(\r\n        V24[2], V56[2], V25[2], V57[2], V26[2], V58[2], V27[2], V59[2], V28[2], V60[2], V29[2], V61[2], V30[2], V62[2], V31[2], V63[2], V24[3], V56[3], V25[3], V57[3], V26[3], V58[3], V27[3], V59[3], V28[3], V60[3], V29[3], V61[3], V30[3], V62[3], V31[3], V63[3],\r\n        T16[6], T17[6], T18[6], T19[6], T20[6], T21[6], T22[6], T23[6], T24[6], T25[6], T26[6], T27[6], T28[6], T29[6], T30[6], T31[6], T16[7], T17[7], T18[7], T19[7], T20[7], T21[7], T22[7], T23[7], T24[7], T25[7], T26[7], T27[7], T28[7], T29[7], T30[7], T31[7]);\r\n    TRANSPOSE_16x16_16BIT(\r\n        V24[4], V56[4], V25[4], V57[4], V26[4], V58[4], V27[4], V59[4], V28[4], V60[4], V29[4], V61[4], V30[4], V62[4], V31[4], V63[4], V24[5], V56[5], V25[5], V57[5], V26[5], V58[5], V27[5], V59[5], V28[5], V60[5], V29[5], V61[5], V30[5], V62[5], V31[5], V63[5],\r\n        T32[6], T33[6], T34[6], T35[6], T36[6], T37[6], T38[6], T39[6], T40[6], T41[6], T42[6], T43[6], T44[6], T45[6], T46[6], T47[6], T32[7], T33[7], T34[7], T35[7], T36[7], T37[7], T38[7], T39[7], T40[7], T41[7], T42[7], T43[7], T44[7], T45[7], T46[7], T47[7]);\r\n    TRANSPOSE_16x16_16BIT(\r\n        V24[6], V56[6], V25[6], V57[6], V26[6], V58[6], V27[6], V59[6], V28[6], V60[6], V29[6], V61[6], V30[6], V62[6], V31[6], V63[6], V24[7], V56[7], V25[7], V57[7], V26[7], V58[7], V27[7], V59[7], V28[7], V60[7], V29[7], V61[7], V30[7], V62[7], V31[7], V63[7],\r\n        T48[6], T49[6], T50[6], T51[6], T52[6], T53[6], T54[6], T55[6], T56[6], T57[6], T58[6], T59[6], T60[6], T61[6], T62[6], T63[6], T48[7], T49[7], T50[7], T51[7], T52[7], T53[7], T54[7], T55[7], T56[7], T57[7], T58[7], T59[7], T60[7], T61[7], T62[7], T63[7]);\r\n\r\n    //store\r\n    for (i = 0; i < 8; i++) {\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i          ], T00[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64     ], T01[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 *  2], T02[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 *  3], T03[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 *  4], T04[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 *  5], T05[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 *  6], T06[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 *  7], T07[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 *  8], T08[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 *  9], T09[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 10], T10[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 11], T11[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 12], T12[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 13], T13[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 14], T14[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 15], T15[i]);\r\n\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 16], T16[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 17], T17[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 18], T18[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 19], T19[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 20], T20[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 21], T21[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 22], T22[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 23], T23[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 24], T24[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 25], T25[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 26], T26[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 27], T27[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 28], T28[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 29], T29[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 30], T30[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 31], T31[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 32], T32[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 33], T33[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 34], T34[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 35], T35[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 36], T36[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 37], T37[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 38], T38[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 39], T39[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 40], T40[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 41], T41[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 42], T42[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 43], T43[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 44], T44[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 45], T45[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 46], T46[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 47], T47[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 48], T48[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 49], T49[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 50], T50[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 51], T51[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 52], T52[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 53], T53[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 54], T54[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 55], T55[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 56], T56[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 57], T57[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 58], T58[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 59], T59[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 60], T60[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 61], T61[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 62], T62[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 64 * 63], T63[i]);\r\n    }\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid inv_wavelet_64x16_sse128(coeff_t *coeff)\r\n{\r\n    int i;\r\n    // 64*16\r\n    __m128i T00[8], T01[8], T02[8], T03[8], T04[8], T05[8], T06[8], T07[8], T08[8], T09[8], T10[8], T11[8], T12[8], T13[8], T14[8], T15[8];\r\n\r\n    // 16*64\r\n    __m128i V00[2], V01[2], V02[2], V03[2], V04[2], V05[2], V06[2], V07[2], V08[2], V09[2], V10[2], V11[2], V12[2], V13[2], V14[2], V15[2], V16[2], V17[2], V18[2], V19[2], V20[2], V21[2], V22[2], V23[2], V24[2], V25[2], V26[2], V27[2], V28[2], V29[2], V30[2], V31[2], V32[2], V33[2], V34[2], V35[2], V36[2], V37[2], V38[2], V39[2], V40[2], V41[2], V42[2], V43[2], V44[2], V45[2], V46[2], V47[2], V48[2], V49[2], V50[2], V51[2], V52[2], V53[2], V54[2], V55[2], V56[2], V57[2], V58[2], V59[2], V60[2], V61[2], V62[2], V63[2];\r\n\r\n    __m128i tr0_0, tr0_1, tr0_2, tr0_3, tr0_4, tr0_5, tr0_6, tr0_7;\r\n    __m128i tr1_0, tr1_1, tr1_2, tr1_3, tr1_4, tr1_5, tr1_6, tr1_7;\r\n\r\n    /*--vertical transform--*/\r\n    //32*8, LOAD AND SHIFT\r\n    T00[0] = _mm_srai_epi16(_mm_load_si128((__m128i*)&coeff[ 0 + 32 * 0]), 1);\r\n    T01[0] = _mm_srai_epi16(_mm_load_si128((__m128i*)&coeff[ 0 + 32 * 1]), 1);\r\n    T02[0] = _mm_srai_epi16(_mm_load_si128((__m128i*)&coeff[ 0 + 32 * 2]), 1);\r\n    T03[0] = _mm_srai_epi16(_mm_load_si128((__m128i*)&coeff[ 0 + 32 * 3]), 1);\r\n    T04[0] = _mm_srai_epi16(_mm_load_si128((__m128i*)&coeff[ 0 + 32 * 4]), 1);\r\n    T05[0] = _mm_srai_epi16(_mm_load_si128((__m128i*)&coeff[ 0 + 32 * 5]), 1);\r\n    T06[0] = _mm_srai_epi16(_mm_load_si128((__m128i*)&coeff[ 0 + 32 * 6]), 1);\r\n    T07[0] = _mm_srai_epi16(_mm_load_si128((__m128i*)&coeff[ 0 + 32 * 7]), 1);\r\n\r\n    T00[1] = _mm_srai_epi16(_mm_load_si128((__m128i*)&coeff[ 8 + 32 * 0]), 1);\r\n    T01[1] = _mm_srai_epi16(_mm_load_si128((__m128i*)&coeff[ 8 + 32 * 1]), 1);\r\n    T02[1] = _mm_srai_epi16(_mm_load_si128((__m128i*)&coeff[ 8 + 32 * 2]), 1);\r\n    T03[1] = _mm_srai_epi16(_mm_load_si128((__m128i*)&coeff[ 8 + 32 * 3]), 1);\r\n    T04[1] = _mm_srai_epi16(_mm_load_si128((__m128i*)&coeff[ 8 + 32 * 4]), 1);\r\n    T05[1] = _mm_srai_epi16(_mm_load_si128((__m128i*)&coeff[ 8 + 32 * 5]), 1);\r\n    T06[1] = _mm_srai_epi16(_mm_load_si128((__m128i*)&coeff[ 8 + 32 * 6]), 1);\r\n    T07[1] = _mm_srai_epi16(_mm_load_si128((__m128i*)&coeff[ 8 + 32 * 7]), 1);\r\n\r\n    T00[2] = _mm_srai_epi16(_mm_load_si128((__m128i*)&coeff[16 + 32 * 0]), 1);\r\n    T01[2] = _mm_srai_epi16(_mm_load_si128((__m128i*)&coeff[16 + 32 * 1]), 1);\r\n    T02[2] = _mm_srai_epi16(_mm_load_si128((__m128i*)&coeff[16 + 32 * 2]), 1);\r\n    T03[2] = _mm_srai_epi16(_mm_load_si128((__m128i*)&coeff[16 + 32 * 3]), 1);\r\n    T04[2] = _mm_srai_epi16(_mm_load_si128((__m128i*)&coeff[16 + 32 * 4]), 1);\r\n    T05[2] = _mm_srai_epi16(_mm_load_si128((__m128i*)&coeff[16 + 32 * 5]), 1);\r\n    T06[2] = _mm_srai_epi16(_mm_load_si128((__m128i*)&coeff[16 + 32 * 6]), 1);\r\n    T07[2] = _mm_srai_epi16(_mm_load_si128((__m128i*)&coeff[16 + 32 * 7]), 1);\r\n\r\n    T00[3] = _mm_srai_epi16(_mm_load_si128((__m128i*)&coeff[24 + 32 * 0]), 1);\r\n    T01[3] = _mm_srai_epi16(_mm_load_si128((__m128i*)&coeff[24 + 32 * 1]), 1);\r\n    T02[3] = _mm_srai_epi16(_mm_load_si128((__m128i*)&coeff[24 + 32 * 2]), 1);\r\n    T03[3] = _mm_srai_epi16(_mm_load_si128((__m128i*)&coeff[24 + 32 * 3]), 1);\r\n    T04[3] = _mm_srai_epi16(_mm_load_si128((__m128i*)&coeff[24 + 32 * 4]), 1);\r\n    T05[3] = _mm_srai_epi16(_mm_load_si128((__m128i*)&coeff[24 + 32 * 5]), 1);\r\n    T06[3] = _mm_srai_epi16(_mm_load_si128((__m128i*)&coeff[24 + 32 * 6]), 1);\r\n    T07[3] = _mm_srai_epi16(_mm_load_si128((__m128i*)&coeff[24 + 32 * 7]), 1);\r\n\r\n    //filter (odd pixel/row)\r\n    T08[0] = _mm_srai_epi16(_mm_add_epi16(T00[0], T01[0]), 1);\r\n    T09[0] = _mm_srai_epi16(_mm_add_epi16(T01[0], T02[0]), 1);\r\n    T10[0] = _mm_srai_epi16(_mm_add_epi16(T02[0], T03[0]), 1);\r\n    T11[0] = _mm_srai_epi16(_mm_add_epi16(T03[0], T04[0]), 1);\r\n    T12[0] = _mm_srai_epi16(_mm_add_epi16(T04[0], T05[0]), 1);\r\n    T13[0] = _mm_srai_epi16(_mm_add_epi16(T05[0], T06[0]), 1);\r\n    T14[0] = _mm_srai_epi16(_mm_add_epi16(T06[0], T07[0]), 1);\r\n    T15[0] = _mm_srai_epi16(_mm_add_epi16(T07[0], T07[0]), 1);\r\n\r\n    T08[1] = _mm_srai_epi16(_mm_add_epi16(T00[1], T01[1]), 1);\r\n    T09[1] = _mm_srai_epi16(_mm_add_epi16(T01[1], T02[1]), 1);\r\n    T10[1] = _mm_srai_epi16(_mm_add_epi16(T02[1], T03[1]), 1);\r\n    T11[1] = _mm_srai_epi16(_mm_add_epi16(T03[1], T04[1]), 1);\r\n    T12[1] = _mm_srai_epi16(_mm_add_epi16(T04[1], T05[1]), 1);\r\n    T13[1] = _mm_srai_epi16(_mm_add_epi16(T05[1], T06[1]), 1);\r\n    T14[1] = _mm_srai_epi16(_mm_add_epi16(T06[1], T07[1]), 1);\r\n    T15[1] = _mm_srai_epi16(_mm_add_epi16(T07[1], T07[1]), 1);\r\n\r\n    T08[2] = _mm_srai_epi16(_mm_add_epi16(T00[2], T01[2]), 1);\r\n    T09[2] = _mm_srai_epi16(_mm_add_epi16(T01[2], T02[2]), 1);\r\n    T10[2] = _mm_srai_epi16(_mm_add_epi16(T02[2], T03[2]), 1);\r\n    T11[2] = _mm_srai_epi16(_mm_add_epi16(T03[2], T04[2]), 1);\r\n    T12[2] = _mm_srai_epi16(_mm_add_epi16(T04[2], T05[2]), 1);\r\n    T13[2] = _mm_srai_epi16(_mm_add_epi16(T05[2], T06[2]), 1);\r\n    T14[2] = _mm_srai_epi16(_mm_add_epi16(T06[2], T07[2]), 1);\r\n    T15[2] = _mm_srai_epi16(_mm_add_epi16(T07[2], T07[2]), 1);\r\n\r\n    T08[3] = _mm_srai_epi16(_mm_add_epi16(T00[3], T01[3]), 1);\r\n    T09[3] = _mm_srai_epi16(_mm_add_epi16(T01[3], T02[3]), 1);\r\n    T10[3] = _mm_srai_epi16(_mm_add_epi16(T02[3], T03[3]), 1);\r\n    T11[3] = _mm_srai_epi16(_mm_add_epi16(T03[3], T04[3]), 1);\r\n    T12[3] = _mm_srai_epi16(_mm_add_epi16(T04[3], T05[3]), 1);\r\n    T13[3] = _mm_srai_epi16(_mm_add_epi16(T05[3], T06[3]), 1);\r\n    T14[3] = _mm_srai_epi16(_mm_add_epi16(T06[3], T07[3]), 1);\r\n    T15[3] = _mm_srai_epi16(_mm_add_epi16(T07[3], T07[3]), 1);\r\n\r\n    /*--transposition--*/\r\n    //32x16 -> 16x32\r\n    TRANSPOSE_8x8_16BIT(T00[0], T08[0], T01[0], T09[0], T02[0], T10[0], T03[0], T11[0], V00[0], V01[0], V02[0], V03[0], V04[0], V05[0], V06[0], V07[0]);\r\n    TRANSPOSE_8x8_16BIT(T00[1], T08[1], T01[1], T09[1], T02[1], T10[1], T03[1], T11[1], V08[0], V09[0], V10[0], V11[0], V12[0], V13[0], V14[0], V15[0]);\r\n    TRANSPOSE_8x8_16BIT(T00[2], T08[2], T01[2], T09[2], T02[2], T10[2], T03[2], T11[2], V16[0], V17[0], V18[0], V19[0], V20[0], V21[0], V22[0], V23[0]);\r\n    TRANSPOSE_8x8_16BIT(T00[3], T08[3], T01[3], T09[3], T02[3], T10[3], T03[3], T11[3], V24[0], V25[0], V26[0], V27[0], V28[0], V29[0], V30[0], V31[0]);\r\n\r\n    TRANSPOSE_8x8_16BIT(T04[0], T12[0], T05[0], T13[0], T06[0], T14[0], T07[0], T15[0], V00[1], V01[1], V02[1], V03[1], V04[1], V05[1], V06[1], V07[1]);\r\n    TRANSPOSE_8x8_16BIT(T04[1], T12[1], T05[1], T13[1], T06[1], T14[1], T07[1], T15[1], V08[1], V09[1], V10[1], V11[1], V12[1], V13[1], V14[1], V15[1]);\r\n    TRANSPOSE_8x8_16BIT(T04[2], T12[2], T05[2], T13[2], T06[2], T14[2], T07[2], T15[2], V16[1], V17[1], V18[1], V19[1], V20[1], V21[1], V22[1], V23[1]);\r\n    TRANSPOSE_8x8_16BIT(T04[3], T12[3], T05[3], T13[3], T06[3], T14[3], T07[3], T15[3], V24[1], V25[1], V26[1], V27[1], V28[1], V29[1], V30[1], V31[1]);\r\n\r\n    /*--horizontal transform--*/\r\n    //filter (odd pixel/column)\r\n    V32[0] = _mm_srai_epi16(_mm_add_epi16(V00[0], V01[0]), 1);\r\n    V33[0] = _mm_srai_epi16(_mm_add_epi16(V01[0], V02[0]), 1);\r\n    V34[0] = _mm_srai_epi16(_mm_add_epi16(V02[0], V03[0]), 1);\r\n    V35[0] = _mm_srai_epi16(_mm_add_epi16(V03[0], V04[0]), 1);\r\n    V36[0] = _mm_srai_epi16(_mm_add_epi16(V04[0], V05[0]), 1);\r\n    V37[0] = _mm_srai_epi16(_mm_add_epi16(V05[0], V06[0]), 1);\r\n    V38[0] = _mm_srai_epi16(_mm_add_epi16(V06[0], V07[0]), 1);\r\n    V39[0] = _mm_srai_epi16(_mm_add_epi16(V07[0], V08[0]), 1);\r\n    V40[0] = _mm_srai_epi16(_mm_add_epi16(V08[0], V09[0]), 1);\r\n    V41[0] = _mm_srai_epi16(_mm_add_epi16(V09[0], V10[0]), 1);\r\n    V42[0] = _mm_srai_epi16(_mm_add_epi16(V10[0], V11[0]), 1);\r\n    V43[0] = _mm_srai_epi16(_mm_add_epi16(V11[0], V12[0]), 1);\r\n    V44[0] = _mm_srai_epi16(_mm_add_epi16(V12[0], V13[0]), 1);\r\n    V45[0] = _mm_srai_epi16(_mm_add_epi16(V13[0], V14[0]), 1);\r\n    V46[0] = _mm_srai_epi16(_mm_add_epi16(V14[0], V15[0]), 1);\r\n    V47[0] = _mm_srai_epi16(_mm_add_epi16(V15[0], V16[0]), 1);\r\n\r\n    V48[0] = _mm_srai_epi16(_mm_add_epi16(V16[0], V17[0]), 1);\r\n    V49[0] = _mm_srai_epi16(_mm_add_epi16(V17[0], V18[0]), 1);\r\n    V50[0] = _mm_srai_epi16(_mm_add_epi16(V18[0], V19[0]), 1);\r\n    V51[0] = _mm_srai_epi16(_mm_add_epi16(V19[0], V20[0]), 1);\r\n    V52[0] = _mm_srai_epi16(_mm_add_epi16(V20[0], V21[0]), 1);\r\n    V53[0] = _mm_srai_epi16(_mm_add_epi16(V21[0], V22[0]), 1);\r\n    V54[0] = _mm_srai_epi16(_mm_add_epi16(V22[0], V23[0]), 1);\r\n    V55[0] = _mm_srai_epi16(_mm_add_epi16(V23[0], V24[0]), 1);\r\n    V56[0] = _mm_srai_epi16(_mm_add_epi16(V24[0], V25[0]), 1);\r\n    V57[0] = _mm_srai_epi16(_mm_add_epi16(V25[0], V26[0]), 1);\r\n    V58[0] = _mm_srai_epi16(_mm_add_epi16(V26[0], V27[0]), 1);\r\n    V59[0] = _mm_srai_epi16(_mm_add_epi16(V27[0], V28[0]), 1);\r\n    V60[0] = _mm_srai_epi16(_mm_add_epi16(V28[0], V29[0]), 1);\r\n    V61[0] = _mm_srai_epi16(_mm_add_epi16(V29[0], V30[0]), 1);\r\n    V62[0] = _mm_srai_epi16(_mm_add_epi16(V30[0], V31[0]), 1);\r\n    V63[0] = _mm_srai_epi16(_mm_add_epi16(V31[0], V31[0]), 1);\r\n\r\n    V32[1] = _mm_srai_epi16(_mm_add_epi16(V00[1], V01[1]), 1);\r\n    V33[1] = _mm_srai_epi16(_mm_add_epi16(V01[1], V02[1]), 1);\r\n    V34[1] = _mm_srai_epi16(_mm_add_epi16(V02[1], V03[1]), 1);\r\n    V35[1] = _mm_srai_epi16(_mm_add_epi16(V03[1], V04[1]), 1);\r\n    V36[1] = _mm_srai_epi16(_mm_add_epi16(V04[1], V05[1]), 1);\r\n    V37[1] = _mm_srai_epi16(_mm_add_epi16(V05[1], V06[1]), 1);\r\n    V38[1] = _mm_srai_epi16(_mm_add_epi16(V06[1], V07[1]), 1);\r\n    V39[1] = _mm_srai_epi16(_mm_add_epi16(V07[1], V08[1]), 1);\r\n    V40[1] = _mm_srai_epi16(_mm_add_epi16(V08[1], V09[1]), 1);\r\n    V41[1] = _mm_srai_epi16(_mm_add_epi16(V09[1], V10[1]), 1);\r\n    V42[1] = _mm_srai_epi16(_mm_add_epi16(V10[1], V11[1]), 1);\r\n    V43[1] = _mm_srai_epi16(_mm_add_epi16(V11[1], V12[1]), 1);\r\n    V44[1] = _mm_srai_epi16(_mm_add_epi16(V12[1], V13[1]), 1);\r\n    V45[1] = _mm_srai_epi16(_mm_add_epi16(V13[1], V14[1]), 1);\r\n    V46[1] = _mm_srai_epi16(_mm_add_epi16(V14[1], V15[1]), 1);\r\n    V47[1] = _mm_srai_epi16(_mm_add_epi16(V15[1], V16[1]), 1);\r\n\r\n    V48[1] = _mm_srai_epi16(_mm_add_epi16(V16[1], V17[1]), 1);\r\n    V49[1] = _mm_srai_epi16(_mm_add_epi16(V17[1], V18[1]), 1);\r\n    V50[1] = _mm_srai_epi16(_mm_add_epi16(V18[1], V19[1]), 1);\r\n    V51[1] = _mm_srai_epi16(_mm_add_epi16(V19[1], V20[1]), 1);\r\n    V52[1] = _mm_srai_epi16(_mm_add_epi16(V20[1], V21[1]), 1);\r\n    V53[1] = _mm_srai_epi16(_mm_add_epi16(V21[1], V22[1]), 1);\r\n    V54[1] = _mm_srai_epi16(_mm_add_epi16(V22[1], V23[1]), 1);\r\n    V55[1] = _mm_srai_epi16(_mm_add_epi16(V23[1], V24[1]), 1);\r\n    V56[1] = _mm_srai_epi16(_mm_add_epi16(V24[1], V25[1]), 1);\r\n    V57[1] = _mm_srai_epi16(_mm_add_epi16(V25[1], V26[1]), 1);\r\n    V58[1] = _mm_srai_epi16(_mm_add_epi16(V26[1], V27[1]), 1);\r\n    V59[1] = _mm_srai_epi16(_mm_add_epi16(V27[1], V28[1]), 1);\r\n    V60[1] = _mm_srai_epi16(_mm_add_epi16(V28[1], V29[1]), 1);\r\n    V61[1] = _mm_srai_epi16(_mm_add_epi16(V29[1], V30[1]), 1);\r\n    V62[1] = _mm_srai_epi16(_mm_add_epi16(V30[1], V31[1]), 1);\r\n    V63[1] = _mm_srai_epi16(_mm_add_epi16(V31[1], V31[1]), 1);\r\n\r\n    /*--transposition & Store--*/\r\n    //16x64 -> 64x16\r\n    TRANSPOSE_8x8_16BIT(V00[0], V32[0], V01[0], V33[0], V02[0], V34[0], V03[0], V35[0], T00[0], T01[0], T02[0], T03[0], T04[0], T05[0], T06[0], T07[0]);\r\n    TRANSPOSE_8x8_16BIT(V04[0], V36[0], V05[0], V37[0], V06[0], V38[0], V07[0], V39[0], T00[1], T01[1], T02[1], T03[1], T04[1], T05[1], T06[1], T07[1]);\r\n    TRANSPOSE_8x8_16BIT(V08[0], V40[0], V09[0], V41[0], V10[0], V42[0], V11[0], V43[0], T00[2], T01[2], T02[2], T03[2], T04[2], T05[2], T06[2], T07[2]);\r\n    TRANSPOSE_8x8_16BIT(V12[0], V44[0], V13[0], V45[0], V14[0], V46[0], V15[0], V47[0], T00[3], T01[3], T02[3], T03[3], T04[3], T05[3], T06[3], T07[3]);\r\n    TRANSPOSE_8x8_16BIT(V16[0], V48[0], V17[0], V49[0], V18[0], V50[0], V19[0], V51[0], T00[4], T01[4], T02[4], T03[4], T04[4], T05[4], T06[4], T07[4]);\r\n    TRANSPOSE_8x8_16BIT(V20[0], V52[0], V21[0], V53[0], V22[0], V54[0], V23[0], V55[0], T00[5], T01[5], T02[5], T03[5], T04[5], T05[5], T06[5], T07[5]);\r\n    TRANSPOSE_8x8_16BIT(V24[0], V56[0], V25[0], V57[0], V26[0], V58[0], V27[0], V59[0], T00[6], T01[6], T02[6], T03[6], T04[6], T05[6], T06[6], T07[6]);\r\n    TRANSPOSE_8x8_16BIT(V28[0], V60[0], V29[0], V61[0], V30[0], V62[0], V31[0], V63[0], T00[7], T01[7], T02[7], T03[7], T04[7], T05[7], T06[7], T07[7]);\r\n\r\n    TRANSPOSE_8x8_16BIT(V00[1], V32[1], V01[1], V33[1], V02[1], V34[1], V03[1], V35[1], T08[0], T09[0], T10[0], T11[0], T12[0], T13[0], T14[0], T15[0]);\r\n    TRANSPOSE_8x8_16BIT(V04[1], V36[1], V05[1], V37[1], V06[1], V38[1], V07[1], V39[1], T08[1], T09[1], T10[1], T11[1], T12[1], T13[1], T14[1], T15[1]);\r\n    TRANSPOSE_8x8_16BIT(V08[1], V40[1], V09[1], V41[1], V10[1], V42[1], V11[1], V43[1], T08[2], T09[2], T10[2], T11[2], T12[2], T13[2], T14[2], T15[2]);\r\n    TRANSPOSE_8x8_16BIT(V12[1], V44[1], V13[1], V45[1], V14[1], V46[1], V15[1], V47[1], T08[3], T09[3], T10[3], T11[3], T12[3], T13[3], T14[3], T15[3]);\r\n    TRANSPOSE_8x8_16BIT(V16[1], V48[1], V17[1], V49[1], V18[1], V50[1], V19[1], V51[1], T08[4], T09[4], T10[4], T11[4], T12[4], T13[4], T14[4], T15[4]);\r\n    TRANSPOSE_8x8_16BIT(V20[1], V52[1], V21[1], V53[1], V22[1], V54[1], V23[1], V55[1], T08[5], T09[5], T10[5], T11[5], T12[5], T13[5], T14[5], T15[5]);\r\n    TRANSPOSE_8x8_16BIT(V24[1], V56[1], V25[1], V57[1], V26[1], V58[1], V27[1], V59[1], T08[6], T09[6], T10[6], T11[6], T12[6], T13[6], T14[6], T15[6]);\r\n    TRANSPOSE_8x8_16BIT(V28[1], V60[1], V29[1], V61[1], V30[1], V62[1], V31[1], V63[1], T08[7], T09[7], T10[7], T11[7], T12[7], T13[7], T14[7], T15[7]);\r\n\r\n    //store\r\n    for (i = 0; i < 8; i++) {\r\n        _mm_store_si128((__m128i*)&coeff[8 * i          ], T00[i]);\r\n        _mm_store_si128((__m128i*)&coeff[8 * i + 64     ], T01[i]);\r\n        _mm_store_si128((__m128i*)&coeff[8 * i + 64 *  2], T02[i]);\r\n        _mm_store_si128((__m128i*)&coeff[8 * i + 64 *  3], T03[i]);\r\n        _mm_store_si128((__m128i*)&coeff[8 * i + 64 *  4], T04[i]);\r\n        _mm_store_si128((__m128i*)&coeff[8 * i + 64 *  5], T05[i]);\r\n        _mm_store_si128((__m128i*)&coeff[8 * i + 64 *  6], T06[i]);\r\n        _mm_store_si128((__m128i*)&coeff[8 * i + 64 *  7], T07[i]);\r\n        _mm_store_si128((__m128i*)&coeff[8 * i + 64 *  8], T08[i]);\r\n        _mm_store_si128((__m128i*)&coeff[8 * i + 64 *  9], T09[i]);\r\n        _mm_store_si128((__m128i*)&coeff[8 * i + 64 * 10], T10[i]);\r\n        _mm_store_si128((__m128i*)&coeff[8 * i + 64 * 11], T11[i]);\r\n        _mm_store_si128((__m128i*)&coeff[8 * i + 64 * 12], T12[i]);\r\n        _mm_store_si128((__m128i*)&coeff[8 * i + 64 * 13], T13[i]);\r\n        _mm_store_si128((__m128i*)&coeff[8 * i + 64 * 14], T14[i]);\r\n        _mm_store_si128((__m128i*)&coeff[8 * i + 64 * 15], T15[i]);\r\n    }\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid inv_wavelet_16x64_sse128(coeff_t *coeff)\r\n{\r\n    //src coeff 8*32\r\n    __m128i S00, S01, S02, S03, S04, S05, S06, S07, S08, S09, S10, S11, S12, S13, S14, S15, S16, S17, S18, S19, S20, S21, S22, S23, S24, S25, S26, S27, S28, S29, S30, S31;\r\n    __m128i S32, S33, S34, S35, S36, S37, S38, S39, S40, S41, S42, S43, S44, S45, S46, S47, S48, S49, S50, S51, S52, S53, S54, S55, S56, S57, S58, S59, S60, S61, S62, S63;\r\n\r\n    // 64*16\r\n    __m128i T00[8], T01[8], T02[8], T03[8], T04[8], T05[8], T06[8], T07[8], T08[8], T09[8], T10[8], T11[8], T12[8], T13[8], T14[8], T15[8];\r\n\r\n    // 16*64\r\n    __m128i V00[2], V01[2], V02[2], V03[2], V04[2], V05[2], V06[2], V07[2], V08[2], V09[2], V10[2], V11[2], V12[2], V13[2], V14[2], V15[2], V16[2], V17[2], V18[2], V19[2], V20[2], V21[2], V22[2], V23[2], V24[2], V25[2], V26[2], V27[2], V28[2], V29[2], V30[2], V31[2], V32[2], V33[2], V34[2], V35[2], V36[2], V37[2], V38[2], V39[2], V40[2], V41[2], V42[2], V43[2], V44[2], V45[2], V46[2], V47[2], V48[2], V49[2], V50[2], V51[2], V52[2], V53[2], V54[2], V55[2], V56[2], V57[2], V58[2], V59[2], V60[2], V61[2], V62[2], V63[2];\r\n\r\n    __m128i tr0_0, tr0_1, tr0_2, tr0_3, tr0_4, tr0_5, tr0_6, tr0_7;\r\n    __m128i tr1_0, tr1_1, tr1_2, tr1_3, tr1_4, tr1_5, tr1_6, tr1_7;\r\n\r\n    int i;\r\n    /*--load & shift--*/\r\n    //8*32\r\n    S00 = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 *  0]), 1);\r\n    S01 = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 *  1]), 1);\r\n    S02 = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 *  2]), 1);\r\n    S03 = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 *  3]), 1);\r\n    S04 = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 *  4]), 1);\r\n    S05 = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 *  5]), 1);\r\n    S06 = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 *  6]), 1);\r\n    S07 = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 *  7]), 1);\r\n    S08 = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 *  8]), 1);\r\n    S09 = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 *  9]), 1);\r\n    S10 = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * 10]), 1);\r\n    S11 = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * 11]), 1);\r\n    S12 = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * 12]), 1);\r\n    S13 = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * 13]), 1);\r\n    S14 = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * 14]), 1);\r\n    S15 = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * 15]), 1);\r\n    S16 = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * 16]), 1);\r\n    S17 = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * 17]), 1);\r\n    S18 = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * 18]), 1);\r\n    S19 = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * 19]), 1);\r\n    S20 = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * 20]), 1);\r\n    S21 = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * 21]), 1);\r\n    S22 = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * 22]), 1);\r\n    S23 = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * 23]), 1);\r\n    S24 = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * 24]), 1);\r\n    S25 = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * 25]), 1);\r\n    S26 = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * 26]), 1);\r\n    S27 = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * 27]), 1);\r\n    S28 = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * 28]), 1);\r\n    S29 = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * 29]), 1);\r\n    S30 = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * 30]), 1);\r\n    S31 = _mm_srai_epi16(_mm_loadu_si128((__m128i*)&coeff[8 * 31]), 1);\r\n\r\n    /*--vertical transform--*/\r\n    S32 = _mm_srai_epi16(_mm_add_epi16(S00, S01), 1);\r\n    S33 = _mm_srai_epi16(_mm_add_epi16(S01, S02), 1);\r\n    S34 = _mm_srai_epi16(_mm_add_epi16(S02, S03), 1);\r\n    S35 = _mm_srai_epi16(_mm_add_epi16(S03, S04), 1);\r\n    S36 = _mm_srai_epi16(_mm_add_epi16(S04, S05), 1);\r\n    S37 = _mm_srai_epi16(_mm_add_epi16(S05, S06), 1);\r\n    S38 = _mm_srai_epi16(_mm_add_epi16(S06, S07), 1);\r\n    S39 = _mm_srai_epi16(_mm_add_epi16(S07, S08), 1);\r\n    S40 = _mm_srai_epi16(_mm_add_epi16(S08, S09), 1);\r\n    S41 = _mm_srai_epi16(_mm_add_epi16(S09, S10), 1);\r\n    S42 = _mm_srai_epi16(_mm_add_epi16(S10, S11), 1);\r\n    S43 = _mm_srai_epi16(_mm_add_epi16(S11, S12), 1);\r\n    S44 = _mm_srai_epi16(_mm_add_epi16(S12, S13), 1);\r\n    S45 = _mm_srai_epi16(_mm_add_epi16(S13, S14), 1);\r\n    S46 = _mm_srai_epi16(_mm_add_epi16(S14, S15), 1);\r\n    S47 = _mm_srai_epi16(_mm_add_epi16(S15, S16), 1);\r\n    S48 = _mm_srai_epi16(_mm_add_epi16(S16, S17), 1);\r\n    S49 = _mm_srai_epi16(_mm_add_epi16(S17, S18), 1);\r\n    S50 = _mm_srai_epi16(_mm_add_epi16(S18, S19), 1);\r\n    S51 = _mm_srai_epi16(_mm_add_epi16(S19, S20), 1);\r\n    S52 = _mm_srai_epi16(_mm_add_epi16(S20, S21), 1);\r\n    S53 = _mm_srai_epi16(_mm_add_epi16(S21, S22), 1);\r\n    S54 = _mm_srai_epi16(_mm_add_epi16(S22, S23), 1);\r\n    S55 = _mm_srai_epi16(_mm_add_epi16(S23, S24), 1);\r\n    S56 = _mm_srai_epi16(_mm_add_epi16(S24, S25), 1);\r\n    S57 = _mm_srai_epi16(_mm_add_epi16(S25, S26), 1);\r\n    S58 = _mm_srai_epi16(_mm_add_epi16(S26, S27), 1);\r\n    S59 = _mm_srai_epi16(_mm_add_epi16(S27, S28), 1);\r\n    S60 = _mm_srai_epi16(_mm_add_epi16(S28, S29), 1);\r\n    S61 = _mm_srai_epi16(_mm_add_epi16(S29, S30), 1);\r\n    S62 = _mm_srai_epi16(_mm_add_epi16(S30, S31), 1);\r\n    S63 = _mm_srai_epi16(_mm_add_epi16(S31, S31), 1);\r\n\r\n    /*--transposition--*/\r\n    //8x64 -> 64x8\r\n    TRANSPOSE_8x8_16BIT(S00, S32, S01, S33, S02, S34, S03, S35, T00[0], T01[0], T02[0], T03[0], T04[0], T05[0], T06[0], T07[0]);\r\n    TRANSPOSE_8x8_16BIT(S04, S36, S05, S37, S06, S38, S07, S39, T00[1], T01[1], T02[1], T03[1], T04[1], T05[1], T06[1], T07[1]);\r\n    TRANSPOSE_8x8_16BIT(S08, S40, S09, S41, S10, S42, S11, S43, T00[2], T01[2], T02[2], T03[2], T04[2], T05[2], T06[2], T07[2]);\r\n    TRANSPOSE_8x8_16BIT(S12, S44, S13, S45, S14, S46, S15, S47, T00[3], T01[3], T02[3], T03[3], T04[3], T05[3], T06[3], T07[3]);\r\n    TRANSPOSE_8x8_16BIT(S16, S48, S17, S49, S18, S50, S19, S51, T00[4], T01[4], T02[4], T03[4], T04[4], T05[4], T06[4], T07[4]);\r\n    TRANSPOSE_8x8_16BIT(S20, S52, S21, S53, S22, S54, S23, S55, T00[5], T01[5], T02[5], T03[5], T04[5], T05[5], T06[5], T07[5]);\r\n    TRANSPOSE_8x8_16BIT(S24, S56, S25, S57, S26, S58, S27, S59, T00[6], T01[6], T02[6], T03[6], T04[6], T05[6], T06[6], T07[6]);\r\n    TRANSPOSE_8x8_16BIT(S28, S60, S29, S61, S30, S62, S31, S63, T00[7], T01[7], T02[7], T03[7], T04[7], T05[7], T06[7], T07[7]);\r\n\r\n    /*--horizontal transform--*/\r\n    for (i = 0; i < 8; i++) {\r\n        T08[i] = _mm_srai_epi16(_mm_add_epi16(T00[i], T01[i]), 1);\r\n        T09[i] = _mm_srai_epi16(_mm_add_epi16(T01[i], T02[i]), 1);\r\n        T10[i] = _mm_srai_epi16(_mm_add_epi16(T02[i], T03[i]), 1);\r\n        T11[i] = _mm_srai_epi16(_mm_add_epi16(T03[i], T04[i]), 1);\r\n        T12[i] = _mm_srai_epi16(_mm_add_epi16(T04[i], T05[i]), 1);\r\n        T13[i] = _mm_srai_epi16(_mm_add_epi16(T05[i], T06[i]), 1);\r\n        T14[i] = _mm_srai_epi16(_mm_add_epi16(T06[i], T07[i]), 1);\r\n        T15[i] = _mm_srai_epi16(_mm_add_epi16(T07[i], T07[i]), 1);\r\n    }\r\n\r\n    /*--transposition--*/\r\n    //64x16 -> 16x64\r\n    TRANSPOSE_8x8_16BIT(T00[0], T08[0], T01[0], T09[0], T02[0], T10[0], T03[0], T11[0], V00[0], V01[0], V02[0], V03[0], V04[0], V05[0], V06[0], V07[0]);\r\n    TRANSPOSE_8x8_16BIT(T00[1], T08[1], T01[1], T09[1], T02[1], T10[1], T03[1], T11[1], V08[0], V09[0], V10[0], V11[0], V12[0], V13[0], V14[0], V15[0]);\r\n    TRANSPOSE_8x8_16BIT(T00[2], T08[2], T01[2], T09[2], T02[2], T10[2], T03[2], T11[2], V16[0], V17[0], V18[0], V19[0], V20[0], V21[0], V22[0], V23[0]);\r\n    TRANSPOSE_8x8_16BIT(T00[3], T08[3], T01[3], T09[3], T02[3], T10[3], T03[3], T11[3], V24[0], V25[0], V26[0], V27[0], V28[0], V29[0], V30[0], V31[0]);\r\n    TRANSPOSE_8x8_16BIT(T00[4], T08[4], T01[4], T09[4], T02[4], T10[4], T03[4], T11[4], V32[0], V33[0], V34[0], V35[0], V36[0], V37[0], V38[0], V39[0]);\r\n    TRANSPOSE_8x8_16BIT(T00[5], T08[5], T01[5], T09[5], T02[5], T10[5], T03[5], T11[5], V40[0], V41[0], V42[0], V43[0], V44[0], V45[0], V46[0], V47[0]);\r\n    TRANSPOSE_8x8_16BIT(T00[6], T08[6], T01[6], T09[6], T02[6], T10[6], T03[6], T11[6], V48[0], V49[0], V50[0], V51[0], V52[0], V53[0], V54[0], V55[0]);\r\n    TRANSPOSE_8x8_16BIT(T00[7], T08[7], T01[7], T09[7], T02[7], T10[7], T03[7], T11[7], V56[0], V57[0], V58[0], V59[0], V60[0], V61[0], V62[0], V63[0]);\r\n\r\n    TRANSPOSE_8x8_16BIT(T04[0], T12[0], T05[0], T13[0], T06[0], T14[0], T07[0], T15[0], V00[1], V01[1], V02[1], V03[1], V04[1], V05[1], V06[1], V07[1]);\r\n    TRANSPOSE_8x8_16BIT(T04[1], T12[1], T05[1], T13[1], T06[1], T14[1], T07[1], T15[1], V08[1], V09[1], V10[1], V11[1], V12[1], V13[1], V14[1], V15[1]);\r\n    TRANSPOSE_8x8_16BIT(T04[2], T12[2], T05[2], T13[2], T06[2], T14[2], T07[2], T15[2], V16[1], V17[1], V18[1], V19[1], V20[1], V21[1], V22[1], V23[1]);\r\n    TRANSPOSE_8x8_16BIT(T04[3], T12[3], T05[3], T13[3], T06[3], T14[3], T07[3], T15[3], V24[1], V25[1], V26[1], V27[1], V28[1], V29[1], V30[1], V31[1]);\r\n    TRANSPOSE_8x8_16BIT(T04[4], T12[4], T05[4], T13[4], T06[4], T14[4], T07[4], T15[4], V32[1], V33[1], V34[1], V35[1], V36[1], V37[1], V38[1], V39[1]);\r\n    TRANSPOSE_8x8_16BIT(T04[5], T12[5], T05[5], T13[5], T06[5], T14[5], T07[5], T15[5], V40[1], V41[1], V42[1], V43[1], V44[1], V45[1], V46[1], V47[1]);\r\n    TRANSPOSE_8x8_16BIT(T04[6], T12[6], T05[6], T13[6], T06[6], T14[6], T07[6], T15[6], V48[1], V49[1], V50[1], V51[1], V52[1], V53[1], V54[1], V55[1]);\r\n    TRANSPOSE_8x8_16BIT(T04[7], T12[7], T05[7], T13[7], T06[7], T14[7], T07[7], T15[7], V56[1], V57[1], V58[1], V59[1], V60[1], V61[1], V62[1], V63[1]);\r\n\r\n    /*--Store--*/\r\n    //16x64\r\n    for (i = 0; i < 2; i++) {\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 *  0], V00[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 *  1], V01[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 *  2], V02[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 *  3], V03[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 *  4], V04[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 *  5], V05[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 *  6], V06[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 *  7], V07[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 *  8], V08[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 *  9], V09[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 10], V10[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 11], V11[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 12], V12[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 13], V13[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 14], V14[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 15], V15[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 16], V16[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 17], V17[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 18], V18[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 19], V19[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 20], V20[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 21], V21[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 22], V22[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 23], V23[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 24], V24[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 25], V25[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 26], V26[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 27], V27[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 28], V28[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 29], V29[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 30], V30[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 31], V31[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 32], V32[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 33], V33[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 34], V34[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 35], V35[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 36], V36[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 37], V37[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 38], V38[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 39], V39[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 40], V40[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 41], V41[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 42], V42[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 43], V43[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 44], V44[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 45], V45[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 46], V46[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 47], V47[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 48], V48[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 49], V49[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 50], V50[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 51], V51[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 52], V52[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 53], V53[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 54], V54[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 55], V55[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 56], V56[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 57], V57[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 58], V58[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 59], V59[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 60], V60[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 61], V61[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 62], V62[i]);\r\n        _mm_storeu_si128((__m128i*)&coeff[8 * i + 16 * 63], V63[i]);\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid idct_64x64_sse128(const coeff_t *src, coeff_t *dst, int i_dst)\r\n{\r\n    UNUSED_PARAMETER(i_dst);\r\n    idct_32x32_sse128(src, dst, 32 | 0x01); /* 32x32 idct */\r\n    inv_wavelet_64x64_sse128(dst);\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid idct_64x64_half_sse128(const coeff_t *src, coeff_t *dst, int i_dst)\r\n{\r\n    UNUSED_PARAMETER(i_dst);\r\n    idct_32x32_half_sse128(src, dst, 32 | 0x01); /* 32x32 idct */\r\n    inv_wavelet_64x64_sse128(dst);\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid idct_64x64_quad_sse128(const coeff_t *src, coeff_t *dst, int i_dst)\r\n{\r\n    UNUSED_PARAMETER(i_dst);\r\n    idct_32x32_quad_sse128(src, dst, 32 | 0x01); /* 32x32 idct */\r\n    inv_wavelet_64x64_sse128(dst);\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid idct_64x16_sse128(const coeff_t *src, coeff_t *dst, int i_dst)\r\n{\r\n    UNUSED_PARAMETER(i_dst);\r\n    idct_32x8_sse128(src, dst, 32 | 0x01);\r\n    inv_wavelet_64x16_sse128(dst);\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid idct_64x16_half_sse128(const coeff_t *src, coeff_t *dst, int i_dst)\r\n{\r\n    UNUSED_PARAMETER(i_dst);\r\n    idct_32x8_half_sse128(src, dst, 32 | 0x01);\r\n    inv_wavelet_64x16_sse128(dst);\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid idct_64x16_quad_sse128(const coeff_t *src, coeff_t *dst, int i_dst)\r\n{\r\n    UNUSED_PARAMETER(i_dst);\r\n    idct_32x8_quad_sse128(src, dst, 32 | 0x01);\r\n    inv_wavelet_64x16_sse128(dst);\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid idct_16x64_sse128(const coeff_t *src, coeff_t *dst, int i_dst)\r\n{\r\n    UNUSED_PARAMETER(i_dst);\r\n    idct_8x32_sse128(src, dst, 8 | 0x01);\r\n    inv_wavelet_16x64_sse128(dst);\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid idct_16x64_half_sse128(const coeff_t *src, coeff_t *dst, int i_dst)\r\n{\r\n    UNUSED_PARAMETER(i_dst);\r\n    idct_8x32_half_sse128(src, dst, 8 | 0x01);\r\n    inv_wavelet_16x64_sse128(dst);\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid idct_16x64_quad_sse128(const coeff_t *src, coeff_t *dst, int i_dst)\r\n{\r\n    UNUSED_PARAMETER(i_dst);\r\n    idct_8x32_quad_sse128(src, dst, 8 | 0x01);\r\n    inv_wavelet_16x64_sse128(dst);\r\n}\r\n"
  },
  {
    "path": "source/common/vec/intrinsic_idct_avx2.cc",
    "content": "/*\r\n * intrinsic_idct_avx2.cc\r\n *\r\n * Description of this file:\r\n *    AVX2 assembly functions of IDCT module of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video decoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#include <mmintrin.h>\r\n#include <emmintrin.h>\r\n#include <tmmintrin.h>\r\n#include <smmintrin.h>\r\n#include <immintrin.h>\r\n\r\n#include \"../common.h\"\r\n#include \"intrinsic.h\"\r\n\r\n/* disable warnings */\r\n#pragma warning(disable:4127)  // warning C4127: ʽǳ\r\n\r\nALIGN32(static const coeff_t tab_idct_8x8_256[12][16]) =\r\n{\r\n    { 44, 38, 44, 38, 44, 38, 44, 38, 44, 38, 44, 38, 44, 38, 44, 38 },\r\n    { 25, 9, 25, 9, 25, 9, 25, 9, 25, 9, 25, 9, 25, 9, 25, 9 },\r\n    { 38, -9, 38, -9, 38, -9, 38, -9, 38, -9, 38, -9, 38, -9, 38, -9 },\r\n    { -44, -25, -44, -25, -44, -25, -44, -25, -44, -25, -44, -25, -44, -25, -44, -25 },\r\n    { 25, -44, 25, -44, 25, -44, 25, -44, 25, -44, 25, -44, 25, -44, 25, -44 },\r\n    { 9, 38, 9, 38, 9, 38, 9, 38, 9, 38, 9, 38, 9, 38, 9, 38 },\r\n    { 9, -25, 9, -25, 9, -25, 9, -25, 9, -25, 9, -25, 9, -25, 9, -25 },\r\n    { 38, -44, 38, -44, 38, -44, 38, -44, 38, -44, 38, -44, 38, -44, 38, -44 },\r\n    { 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32 },\r\n    { 32, -32, 32, -32, 32, -32, 32, -32, 32, -32, 32, -32, 32, -32, 32, -32 },\r\n    { 42, 17, 42, 17, 42, 17, 42, 17, 42, 17, 42, 17, 42, 17, 42, 17 },\r\n    { 17, -42, 17, -42, 17, -42, 17, -42, 17, -42, 17, -42, 17, -42, 17, -42 }\r\n};\r\n\r\n\r\n\r\n\r\nvoid idct_8x8_avx2(const coeff_t *src, coeff_t *dst, int i_dst)\r\n{\r\n    const int SHIFT1 = 5;\r\n    // const int CLIP1 = LIMIT_BIT;\r\n    const int SHIFT2 = 20 - g_bit_depth;\r\n    const int CLIP2 = g_bit_depth + 1;\r\n \r\n    __m256i mAdd;\r\n    __m256i S1S5, S3S7;\r\n    __m256i T0, T1, T2, T3;\r\n    __m256i E0, E1, E2, E3, O0, O1, O2, O3;\r\n    __m256i EE0, EE1, EO0, EO1;\r\n    __m256i S0, S1, S2, S3, S4, S5, S6, S7;\r\n    __m256i C00, C01, C02, C03, C04, C05, C06, C07;\r\n    __m256i max_val, min_val;\r\n\r\n    UNUSED_PARAMETER(i_dst);\r\n    S1S5 = _mm256_loadu2_m128i((__m128i*)&src[40], (__m128i*)&src[ 8]);\r\n    S3S7 = _mm256_loadu2_m128i((__m128i*)&src[56], (__m128i*)&src[24]);\r\n\r\n    T0 = _mm256_unpacklo_epi16(S1S5, S3S7);\r\n    T1 = _mm256_unpackhi_epi16(S1S5, S3S7);\r\n\r\n    T2 = _mm256_permute2x128_si256(T0, T1, 0x20);\r\n    T3 = _mm256_permute2x128_si256(T0, T1, 0x31);\r\n\r\n    O0 = _mm256_add_epi32(_mm256_madd_epi16(T2, _mm256_load_si256((__m256i*)(tab_idct_8x8_256[0]))),\r\n                          _mm256_madd_epi16(T3, _mm256_load_si256((__m256i*)(tab_idct_8x8_256[1]))));\r\n    O1 = _mm256_add_epi32(_mm256_madd_epi16(T2, _mm256_load_si256((__m256i*)(tab_idct_8x8_256[2]))),\r\n                          _mm256_madd_epi16(T3, _mm256_load_si256((__m256i*)(tab_idct_8x8_256[3]))));\r\n    O2 = _mm256_add_epi32(_mm256_madd_epi16(T2, _mm256_load_si256((__m256i*)(tab_idct_8x8_256[4]))),\r\n                          _mm256_madd_epi16(T3, _mm256_load_si256((__m256i*)(tab_idct_8x8_256[5]))));\r\n    O3 = _mm256_add_epi32(_mm256_madd_epi16(T2, _mm256_load_si256((__m256i*)(tab_idct_8x8_256[6]))),\r\n                          _mm256_madd_epi16(T3, _mm256_load_si256((__m256i*)(tab_idct_8x8_256[7]))));\r\n\r\n    /*    -------     */\r\n    S1S5 = _mm256_loadu2_m128i((__m128i*)&src[16], (__m128i*)&src[0]);\r\n    S3S7 = _mm256_loadu2_m128i((__m128i*)&src[48], (__m128i*)&src[32]);\r\n\r\n    T0 = _mm256_unpacklo_epi16(S1S5, S3S7);\r\n    T1 = _mm256_unpackhi_epi16(S1S5, S3S7);\r\n\r\n    T2 = _mm256_permute2x128_si256(T0, T1, 0x20);\r\n    T3 = _mm256_permute2x128_si256(T0, T1, 0x31);\r\n\r\n    EE0 = _mm256_madd_epi16(T2, _mm256_load_si256((__m256i*)(tab_idct_8x8_256[8])));\r\n    EE1 = _mm256_madd_epi16(T2, _mm256_load_si256((__m256i*)(tab_idct_8x8_256[9])));\r\n    EO0 = _mm256_madd_epi16(T3, _mm256_load_si256((__m256i*)(tab_idct_8x8_256[10])));\r\n    EO1 = _mm256_madd_epi16(T3, _mm256_load_si256((__m256i*)(tab_idct_8x8_256[11])));\r\n\r\n    /*    -------     */\r\n    mAdd = _mm256_set1_epi32((1 << (SHIFT1 - 1)));                               // ״η任\r\n\r\n    E0 = _mm256_add_epi32(EE0, EO0);\r\n    E1 = _mm256_add_epi32(EE1, EO1);\r\n    E3 = _mm256_sub_epi32(EE0, EO0);\r\n    E2 = _mm256_sub_epi32(EE1, EO1);\r\n    E0 = _mm256_add_epi32(E0, mAdd);\r\n    E1 = _mm256_add_epi32(E1, mAdd);\r\n    E2 = _mm256_add_epi32(E2, mAdd);\r\n    E3 = _mm256_add_epi32(E3, mAdd);\r\n\r\n    S0 = _mm256_srai_epi32(_mm256_add_epi32(E0, O0), SHIFT1);\r\n    S7 = _mm256_srai_epi32(_mm256_sub_epi32(E0, O0), SHIFT1);\r\n    S1 = _mm256_srai_epi32(_mm256_add_epi32(E1, O1), SHIFT1);\r\n    S6 = _mm256_srai_epi32(_mm256_sub_epi32(E1, O1), SHIFT1);\r\n    S2 = _mm256_srai_epi32(_mm256_add_epi32(E2, O2), SHIFT1);\r\n    S5 = _mm256_srai_epi32(_mm256_sub_epi32(E2, O2), SHIFT1);\r\n    S3 = _mm256_srai_epi32(_mm256_add_epi32(E3, O3), SHIFT1);\r\n    S4 = _mm256_srai_epi32(_mm256_sub_epi32(E3, O3), SHIFT1);\r\n\r\n    C00 = _mm256_permute2x128_si256(S0, S4, 0x20);\r\n    C01 = _mm256_permute2x128_si256(S0, S4, 0x31);\r\n\r\n    C02 = _mm256_permute2x128_si256(S1, S5, 0x20);\r\n    C03 = _mm256_permute2x128_si256(S1, S5, 0x31);\r\n\r\n    C04 = _mm256_permute2x128_si256(S2, S6, 0x20);\r\n    C05 = _mm256_permute2x128_si256(S2, S6, 0x31);\r\n\r\n    C06 = _mm256_permute2x128_si256(S3, S7, 0x20);\r\n    C07 = _mm256_permute2x128_si256(S3, S7, 0x31);\r\n\r\n    S0 = _mm256_packs_epi32(C00, C01);\r\n    S1 = _mm256_packs_epi32(C02, C03);\r\n    S2 = _mm256_packs_epi32(C04, C05);\r\n    S3 = _mm256_packs_epi32(C06, C07);\r\n\r\n    S4 = _mm256_unpacklo_epi16(S0, S1);\r\n    S5 = _mm256_unpacklo_epi16(S2, S3);\r\n    S6 = _mm256_unpackhi_epi16(S0, S1);\r\n    S7 = _mm256_unpackhi_epi16(S2, S3);\r\n\r\n    C00 = _mm256_unpacklo_epi32(S4, S5);\r\n    C01 = _mm256_unpacklo_epi32(S6, S7);\r\n    C02 = _mm256_unpackhi_epi32(S4, S5);\r\n    C03 = _mm256_unpackhi_epi32(S6, S7);\r\n\r\n    C04 = _mm256_permute2x128_si256(C00, C02, 0x20);\r\n    C05 = _mm256_permute2x128_si256(C00, C02, 0x31);\r\n    C06 = _mm256_permute2x128_si256(C01, C03, 0x20);\r\n    C07 = _mm256_permute2x128_si256(C01, C03, 0x31);\r\n\r\n    S0 = _mm256_unpacklo_epi64(C04, C05);\r\n    S1 = _mm256_unpacklo_epi64(C06, C07);\r\n\r\n    S2 = _mm256_unpackhi_epi64(C04, C05);\r\n    S3 = _mm256_unpackhi_epi64(C06, C07);\r\n\r\n    S4 = _mm256_permute2x128_si256(S2, S3, 0x20);\r\n    S5 = _mm256_permute2x128_si256(S2, S3, 0x31);\r\n\r\n\r\n    T0 = _mm256_unpacklo_epi16(S4, S5);\r\n    T1 = _mm256_unpackhi_epi16(S4, S5);\r\n\r\n    T2 = _mm256_permute2x128_si256(T0, T1, 0x20);\r\n    T3 = _mm256_permute2x128_si256(T0, T1, 0x31);\r\n\r\n    O0 = _mm256_add_epi32(_mm256_madd_epi16(T2, _mm256_load_si256((__m256i*)(tab_idct_8x8_256[0]))),\r\n                          _mm256_madd_epi16(T3, _mm256_load_si256((__m256i*)(tab_idct_8x8_256[1]))));\r\n    O1 = _mm256_add_epi32(_mm256_madd_epi16(T2, _mm256_load_si256((__m256i*)(tab_idct_8x8_256[2]))),\r\n                          _mm256_madd_epi16(T3, _mm256_load_si256((__m256i*)(tab_idct_8x8_256[3]))));\r\n    O2 = _mm256_add_epi32(_mm256_madd_epi16(T2, _mm256_load_si256((__m256i*)(tab_idct_8x8_256[4]))),\r\n                          _mm256_madd_epi16(T3, _mm256_load_si256((__m256i*)(tab_idct_8x8_256[5]))));\r\n    O3 = _mm256_add_epi32(_mm256_madd_epi16(T2, _mm256_load_si256((__m256i*)(tab_idct_8x8_256[6]))),\r\n                          _mm256_madd_epi16(T3, _mm256_load_si256((__m256i*)(tab_idct_8x8_256[7]))));\r\n\r\n    /*    -------     */\r\n    T0 = _mm256_unpacklo_epi16(S0, S1);\r\n    T1 = _mm256_unpackhi_epi16(S0, S1);\r\n\r\n    T2 = _mm256_permute2x128_si256(T0, T1, 0x20);\r\n    T3 = _mm256_permute2x128_si256(T0, T1, 0x31);\r\n\r\n    EE0 = _mm256_madd_epi16(T2, _mm256_load_si256((__m256i*)(tab_idct_8x8_256[8])));\r\n    EE1 = _mm256_madd_epi16(T2, _mm256_load_si256((__m256i*)(tab_idct_8x8_256[9])));\r\n    EO0 = _mm256_madd_epi16(T3, _mm256_load_si256((__m256i*)(tab_idct_8x8_256[10])));\r\n    EO1 = _mm256_madd_epi16(T3, _mm256_load_si256((__m256i*)(tab_idct_8x8_256[11])));\r\n\r\n    /*    -------     */\r\n    mAdd = _mm256_set1_epi32(SHIFT2 ? (1 << (SHIFT2 - 1)) : 0);                       //\r\n\r\n    E0 = _mm256_add_epi32(EE0, EO0);\r\n    E1 = _mm256_add_epi32(EE1, EO1);\r\n    E3 = _mm256_sub_epi32(EE0, EO0);\r\n    E2 = _mm256_sub_epi32(EE1, EO1);\r\n    E0 = _mm256_add_epi32(E0, mAdd);\r\n    E1 = _mm256_add_epi32(E1, mAdd);\r\n    E2 = _mm256_add_epi32(E2, mAdd);\r\n    E3 = _mm256_add_epi32(E3, mAdd);\r\n\r\n    S0 = _mm256_srai_epi32(_mm256_add_epi32(E0, O0), SHIFT2);\r\n    S7 = _mm256_srai_epi32(_mm256_sub_epi32(E0, O0), SHIFT2);\r\n    S1 = _mm256_srai_epi32(_mm256_add_epi32(E1, O1), SHIFT2);\r\n    S6 = _mm256_srai_epi32(_mm256_sub_epi32(E1, O1), SHIFT2);\r\n    S2 = _mm256_srai_epi32(_mm256_add_epi32(E2, O2), SHIFT2);\r\n    S5 = _mm256_srai_epi32(_mm256_sub_epi32(E2, O2), SHIFT2);\r\n    S3 = _mm256_srai_epi32(_mm256_add_epi32(E3, O3), SHIFT2);\r\n    S4 = _mm256_srai_epi32(_mm256_sub_epi32(E3, O3), SHIFT2);\r\n\r\n    C00 = _mm256_permute2x128_si256(S0, S4, 0x20);\r\n    C01 = _mm256_permute2x128_si256(S0, S4, 0x31);\r\n\r\n    C02 = _mm256_permute2x128_si256(S1, S5, 0x20);\r\n    C03 = _mm256_permute2x128_si256(S1, S5, 0x31);\r\n\r\n    C04 = _mm256_permute2x128_si256(S2, S6, 0x20);\r\n    C05 = _mm256_permute2x128_si256(S2, S6, 0x31);\r\n\r\n    C06 = _mm256_permute2x128_si256(S3, S7, 0x20);\r\n    C07 = _mm256_permute2x128_si256(S3, S7, 0x31);\r\n\r\n    S0 = _mm256_packs_epi32(C00, C01);\r\n    S1 = _mm256_packs_epi32(C02, C03);\r\n    S2 = _mm256_packs_epi32(C04, C05);\r\n    S3 = _mm256_packs_epi32(C06, C07);\r\n\r\n    S4 = _mm256_unpacklo_epi16(S0, S1);\r\n    S5 = _mm256_unpacklo_epi16(S2, S3);\r\n    S6 = _mm256_unpackhi_epi16(S0, S1);\r\n    S7 = _mm256_unpackhi_epi16(S2, S3);\r\n\r\n    C00 = _mm256_unpacklo_epi32(S4, S5);\r\n    C01 = _mm256_unpacklo_epi32(S6, S7);\r\n    C02 = _mm256_unpackhi_epi32(S4, S5);\r\n    C03 = _mm256_unpackhi_epi32(S6, S7);\r\n\r\n    C04 = _mm256_permute2x128_si256(C00, C02, 0x20);\r\n    C05 = _mm256_permute2x128_si256(C00, C02, 0x31);\r\n    C06 = _mm256_permute2x128_si256(C01, C03, 0x20);\r\n    C07 = _mm256_permute2x128_si256(C01, C03, 0x31);\r\n\r\n    S0 = _mm256_unpacklo_epi64(C04, C05);\r\n    S1 = _mm256_unpacklo_epi64(C06, C07);\r\n    S2 = _mm256_unpackhi_epi64(C04, C05);\r\n    S3 = _mm256_unpackhi_epi64(C06, C07);\r\n\r\n    // CLIP2\r\n    max_val = _mm256_set1_epi16((1 << (CLIP2 - 1)) - 1);\r\n    min_val = _mm256_set1_epi16(-(1 << (CLIP2 - 1)));\r\n\r\n    S0 = _mm256_max_epi16(_mm256_min_epi16(S0, max_val), min_val);\r\n    S1 = _mm256_max_epi16(_mm256_min_epi16(S1, max_val), min_val);\r\n    S2 = _mm256_max_epi16(_mm256_min_epi16(S2, max_val), min_val);\r\n    S3 = _mm256_max_epi16(_mm256_min_epi16(S3, max_val), min_val);\r\n\r\n    // store\r\n    _mm256_storeu2_m128i((__m128i*)&dst[16], (__m128i*)&dst[ 0], S0);\r\n    _mm256_storeu2_m128i((__m128i*)&dst[48], (__m128i*)&dst[32], S1);\r\n    _mm256_storeu2_m128i((__m128i*)&dst[24], (__m128i*)&dst[ 8], S2);\r\n    _mm256_storeu2_m128i((__m128i*)&dst[56], (__m128i*)&dst[40], S3);\r\n}\r\n\r\n\r\nvoid idct_16x16_avx2(const coeff_t *src, coeff_t *dst, int i_dst)\r\n{\r\n    const int shift = 20-g_bit_depth;\r\n    const int clip = g_bit_depth + 1;\r\n\r\n    const __m256i c16_p43_p45 = _mm256_set1_epi32(0x002B002D);      //row0 87high - 90low address\r\n    const __m256i c16_p35_p40 = _mm256_set1_epi32(0x00230028);\r\n    const __m256i c16_p21_p29 = _mm256_set1_epi32(0x0015001D);\r\n    const __m256i c16_p04_p13 = _mm256_set1_epi32(0x0004000D);\r\n    const __m256i c16_p29_p43 = _mm256_set1_epi32(0x001D002B);      //row1\r\n    const __m256i c16_n21_p04 = _mm256_set1_epi32(0xFFEB0004);\r\n    const __m256i c16_n45_n40 = _mm256_set1_epi32(0xFFD3FFD8);\r\n    const __m256i c16_n13_n35 = _mm256_set1_epi32(0xFFF3FFDD);\r\n    const __m256i c16_p04_p40 = _mm256_set1_epi32(0x00040028);      //row2\r\n    const __m256i c16_n43_n35 = _mm256_set1_epi32(0xFFD5FFDD);\r\n    const __m256i c16_p29_n13 = _mm256_set1_epi32(0x001DFFF3);\r\n    const __m256i c16_p21_p45 = _mm256_set1_epi32(0x0015002D);\r\n    const __m256i c16_n21_p35 = _mm256_set1_epi32(0xFFEB0023);      //row3\r\n    const __m256i c16_p04_n43 = _mm256_set1_epi32(0x0004FFD5);\r\n    const __m256i c16_p13_p45 = _mm256_set1_epi32(0x000D002D);\r\n    const __m256i c16_n29_n40 = _mm256_set1_epi32(0xFFE3FFD8);\r\n    const __m256i c16_n40_p29 = _mm256_set1_epi32(0xFFD8001D);      //row4\r\n    const __m256i c16_p45_n13 = _mm256_set1_epi32(0x002DFFF3);\r\n    const __m256i c16_n43_n04 = _mm256_set1_epi32(0xFFD5FFFC);\r\n    const __m256i c16_p35_p21 = _mm256_set1_epi32(0x00230015);\r\n    const __m256i c16_n45_p21 = _mm256_set1_epi32(0xFFD30015);      //row5\r\n    const __m256i c16_p13_p29 = _mm256_set1_epi32(0x000D001D);\r\n    const __m256i c16_p35_n43 = _mm256_set1_epi32(0x0023FFD5);\r\n    const __m256i c16_n40_p04 = _mm256_set1_epi32(0xFFD80004);\r\n    const __m256i c16_n35_p13 = _mm256_set1_epi32(0xFFDD000D);      //row6\r\n    const __m256i c16_n40_p45 = _mm256_set1_epi32(0xFFD8002D);\r\n    const __m256i c16_p04_p21 = _mm256_set1_epi32(0x00040015);\r\n    const __m256i c16_p43_n29 = _mm256_set1_epi32(0x002BFFE3);\r\n    const __m256i c16_n13_p04 = _mm256_set1_epi32(0xFFF30004);      //row7\r\n    const __m256i c16_n29_p21 = _mm256_set1_epi32(0xFFE30015);\r\n    const __m256i c16_n40_p35 = _mm256_set1_epi32(0xFFD80023);\r\n    const __m256i c16_n45_p43 = _mm256_set1_epi32(0xFFD3002B);\r\n\r\n    const __m256i c16_p38_p44 = _mm256_set1_epi32(0x0026002C);\r\n    const __m256i c16_p09_p25 = _mm256_set1_epi32(0x00090019);\r\n    const __m256i c16_n09_p38 = _mm256_set1_epi32(0xFFF70026);\r\n    const __m256i c16_n25_n44 = _mm256_set1_epi32(0xFFE7FFD4);\r\n    const __m256i c16_n44_p25 = _mm256_set1_epi32(0xFFD40019);\r\n    const __m256i c16_p38_p09 = _mm256_set1_epi32(0x00260009);\r\n    const __m256i c16_n25_p09 = _mm256_set1_epi32(0xFFE70009);\r\n    const __m256i c16_n44_p38 = _mm256_set1_epi32(0xFFD40026);\r\n\r\n    const __m256i c16_p17_p42 = _mm256_set1_epi32(0x0011002A);\r\n    const __m256i c16_n42_p17 = _mm256_set1_epi32(0xFFD60011);\r\n\r\n    const __m256i c16_n32_p32 = _mm256_set1_epi32(0xFFE00020);\r\n    const __m256i c16_p32_p32 = _mm256_set1_epi32(0x00200020);\r\n\r\n    __m256i max_val, min_val;\r\n    __m256i c32_rnd = _mm256_set1_epi32(16);                                    // һ\r\n\r\n    int nShift = 5;\r\n    int pass;\r\n\r\n    __m256i in00, in01, in02, in03, in04, in05, in06, in07;\r\n    __m256i in08, in09, in10, in11, in12, in13, in14, in15;\r\n    __m256i res00, res01, res02, res03, res04, res05, res06, res07;\r\n    __m256i res08, res09, res10, res11, res12, res13, res14, res15;\r\n\r\n\r\n    UNUSED_PARAMETER(i_dst);\r\n\r\n    in00 = _mm256_lddqu_si256((const __m256i*)&src[0 * 16]);    // [07 06 05 04 03 02 01 00]\r\n    in01 = _mm256_lddqu_si256((const __m256i*)&src[1 * 16]);    // [17 16 15 14 13 12 11 10]\r\n    in02 = _mm256_lddqu_si256((const __m256i*)&src[2 * 16]);    // [27 26 25 24 23 22 21 20]\r\n    in03 = _mm256_lddqu_si256((const __m256i*)&src[3 * 16]);    // [37 36 35 34 33 32 31 30]\r\n    in04 = _mm256_lddqu_si256((const __m256i*)&src[4 * 16]);    // [47 46 45 44 43 42 41 40]\r\n    in05 = _mm256_lddqu_si256((const __m256i*)&src[5 * 16]);    // [57 56 55 54 53 52 51 50]\r\n    in06 = _mm256_lddqu_si256((const __m256i*)&src[6 * 16]);    // [67 66 65 64 63 62 61 60]\r\n    in07 = _mm256_lddqu_si256((const __m256i*)&src[7 * 16]);    // [77 76 75 74 73 72 71 70]\r\n    in08 = _mm256_lddqu_si256((const __m256i*)&src[8 * 16]);\r\n    in09 = _mm256_lddqu_si256((const __m256i*)&src[9 * 16]);\r\n    in10 = _mm256_lddqu_si256((const __m256i*)&src[10 * 16]);\r\n    in11 = _mm256_lddqu_si256((const __m256i*)&src[11 * 16]);\r\n    in12 = _mm256_lddqu_si256((const __m256i*)&src[12 * 16]);\r\n    in13 = _mm256_lddqu_si256((const __m256i*)&src[13 * 16]);\r\n    in14 = _mm256_lddqu_si256((const __m256i*)&src[14 * 16]);\r\n    in15 = _mm256_lddqu_si256((const __m256i*)&src[15 * 16]);\r\n\r\n\r\n    for (pass = 0; pass < 2; pass++) {\r\n        const __m256i T_00_00A = _mm256_unpacklo_epi16(in01, in03);       // [33 13 32 12 31 11 30 10]\r\n        const __m256i T_00_00B = _mm256_unpackhi_epi16(in01, in03);       // [37 17 36 16 35 15 34 14]\r\n        const __m256i T_00_01A = _mm256_unpacklo_epi16(in05, in07);       // [ ]\r\n        const __m256i T_00_01B = _mm256_unpackhi_epi16(in05, in07);       // [ ]\r\n        const __m256i T_00_02A = _mm256_unpacklo_epi16(in09, in11);       // [ ]\r\n        const __m256i T_00_02B = _mm256_unpackhi_epi16(in09, in11);       // [ ]\r\n        const __m256i T_00_03A = _mm256_unpacklo_epi16(in13, in15);       // [ ]\r\n        const __m256i T_00_03B = _mm256_unpackhi_epi16(in13, in15);       // [ ]\r\n        const __m256i T_00_04A = _mm256_unpacklo_epi16(in02, in06);       // [ ]\r\n        const __m256i T_00_04B = _mm256_unpackhi_epi16(in02, in06);       // [ ]\r\n        const __m256i T_00_05A = _mm256_unpacklo_epi16(in10, in14);       // [ ]\r\n        const __m256i T_00_05B = _mm256_unpackhi_epi16(in10, in14);       // [ ]\r\n        const __m256i T_00_06A = _mm256_unpacklo_epi16(in04, in12);       // [ ]row\r\n        const __m256i T_00_06B = _mm256_unpackhi_epi16(in04, in12);       // [ ]\r\n        const __m256i T_00_07A = _mm256_unpacklo_epi16(in00, in08);       // [83 03 82 02 81 01 81 00] row08 row00\r\n        const __m256i T_00_07B = _mm256_unpackhi_epi16(in00, in08);       // [87 07 86 06 85 05 84 04]\r\n\r\n        __m256i O0A, O1A, O2A, O3A, O4A, O5A, O6A, O7A;\r\n        __m256i O0B, O1B, O2B, O3B, O4B, O5B, O6B, O7B;\r\n        __m256i EO0A, EO1A, EO2A, EO3A;\r\n        __m256i EO0B, EO1B, EO2B, EO3B;\r\n        __m256i EEO0A, EEO1A;\r\n        __m256i EEO0B, EEO1B;\r\n        __m256i EEE0A, EEE1A;\r\n        __m256i EEE0B, EEE1B;\r\n\r\n    {\r\n        __m256i T00, T01;\r\n#define COMPUTE_ROW(row0103, row0507, row0911, row1315, c0103, c0507, c0911, c1315, row) \\\r\n    T00 = _mm256_add_epi32(_mm256_madd_epi16(row0103, c0103), _mm256_madd_epi16(row0507, c0507)); \\\r\n    T01 = _mm256_add_epi32(_mm256_madd_epi16(row0911, c0911), _mm256_madd_epi16(row1315, c1315)); \\\r\n    row = _mm256_add_epi32(T00, T01);\r\n\r\n        COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_p43_p45, c16_p35_p40, c16_p21_p29, c16_p04_p13, O0A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_p29_p43, c16_n21_p04, c16_n45_n40, c16_n13_n35, O1A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_p04_p40, c16_n43_n35, c16_p29_n13, c16_p21_p45, O2A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_n21_p35, c16_p04_n43, c16_p13_p45, c16_n29_n40, O3A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_n40_p29, c16_p45_n13, c16_n43_n04, c16_p35_p21, O4A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_n45_p21, c16_p13_p29, c16_p35_n43, c16_n40_p04, O5A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_n35_p13, c16_n40_p45, c16_p04_p21, c16_p43_n29, O6A)\r\n            COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, c16_n13_p04, c16_n29_p21, c16_n40_p35, c16_n45_p43, O7A)\r\n\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, c16_p43_p45, c16_p35_p40, c16_p21_p29, c16_p04_p13, O0B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, c16_p29_p43, c16_n21_p04, c16_n45_n40, c16_n13_n35, O1B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, c16_p04_p40, c16_n43_n35, c16_p29_n13, c16_p21_p45, O2B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, c16_n21_p35, c16_p04_n43, c16_p13_p45, c16_n29_n40, O3B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, c16_n40_p29, c16_p45_n13, c16_n43_n04, c16_p35_p21, O4B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, c16_n45_p21, c16_p13_p29, c16_p35_n43, c16_n40_p04, O5B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, c16_n35_p13, c16_n40_p45, c16_p04_p21, c16_p43_n29, O6B)\r\n            COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, c16_n13_p04, c16_n29_p21, c16_n40_p35, c16_n45_p43, O7B)\r\n#undef COMPUTE_ROW\r\n    }\r\n\r\n    EO0A = _mm256_add_epi32(_mm256_madd_epi16(T_00_04A, c16_p38_p44), _mm256_madd_epi16(T_00_05A, c16_p09_p25)); // EO0\r\n    EO0B = _mm256_add_epi32(_mm256_madd_epi16(T_00_04B, c16_p38_p44), _mm256_madd_epi16(T_00_05B, c16_p09_p25));\r\n    EO1A = _mm256_add_epi32(_mm256_madd_epi16(T_00_04A, c16_n09_p38), _mm256_madd_epi16(T_00_05A, c16_n25_n44)); // EO1\r\n    EO1B = _mm256_add_epi32(_mm256_madd_epi16(T_00_04B, c16_n09_p38), _mm256_madd_epi16(T_00_05B, c16_n25_n44));\r\n    EO2A = _mm256_add_epi32(_mm256_madd_epi16(T_00_04A, c16_n44_p25), _mm256_madd_epi16(T_00_05A, c16_p38_p09)); // EO2\r\n    EO2B = _mm256_add_epi32(_mm256_madd_epi16(T_00_04B, c16_n44_p25), _mm256_madd_epi16(T_00_05B, c16_p38_p09));\r\n    EO3A = _mm256_add_epi32(_mm256_madd_epi16(T_00_04A, c16_n25_p09), _mm256_madd_epi16(T_00_05A, c16_n44_p38)); // EO3\r\n    EO3B = _mm256_add_epi32(_mm256_madd_epi16(T_00_04B, c16_n25_p09), _mm256_madd_epi16(T_00_05B, c16_n44_p38));\r\n\r\n    EEO0A = _mm256_madd_epi16(T_00_06A, c16_p17_p42);\r\n    EEO0B = _mm256_madd_epi16(T_00_06B, c16_p17_p42);\r\n    EEO1A = _mm256_madd_epi16(T_00_06A, c16_n42_p17);\r\n    EEO1B = _mm256_madd_epi16(T_00_06B, c16_n42_p17);\r\n\r\n    EEE0A = _mm256_madd_epi16(T_00_07A, c16_p32_p32);\r\n    EEE0B = _mm256_madd_epi16(T_00_07B, c16_p32_p32);\r\n    EEE1A = _mm256_madd_epi16(T_00_07A, c16_n32_p32);\r\n    EEE1B = _mm256_madd_epi16(T_00_07B, c16_n32_p32);\r\n    {\r\n        const __m256i EE0A = _mm256_add_epi32(EEE0A, EEO0A);          // EE0 = EEE0 + EEO0\r\n        const __m256i EE0B = _mm256_add_epi32(EEE0B, EEO0B);\r\n        const __m256i EE1A = _mm256_add_epi32(EEE1A, EEO1A);          // EE1 = EEE1 + EEO1\r\n        const __m256i EE1B = _mm256_add_epi32(EEE1B, EEO1B);\r\n        const __m256i EE3A = _mm256_sub_epi32(EEE0A, EEO0A);          // EE2 = EEE0 - EEO0\r\n        const __m256i EE3B = _mm256_sub_epi32(EEE0B, EEO0B);\r\n        const __m256i EE2A = _mm256_sub_epi32(EEE1A, EEO1A);          // EE3 = EEE1 - EEO1\r\n        const __m256i EE2B = _mm256_sub_epi32(EEE1B, EEO1B);\r\n\r\n        const __m256i E0A = _mm256_add_epi32(EE0A, EO0A);          // E0 = EE0 + EO0\r\n        const __m256i E0B = _mm256_add_epi32(EE0B, EO0B);\r\n        const __m256i E1A = _mm256_add_epi32(EE1A, EO1A);          // E1 = EE1 + EO1\r\n        const __m256i E1B = _mm256_add_epi32(EE1B, EO1B);\r\n        const __m256i E2A = _mm256_add_epi32(EE2A, EO2A);          // E2 = EE2 + EO2\r\n        const __m256i E2B = _mm256_add_epi32(EE2B, EO2B);\r\n        const __m256i E3A = _mm256_add_epi32(EE3A, EO3A);          // E3 = EE3 + EO3\r\n        const __m256i E3B = _mm256_add_epi32(EE3B, EO3B);\r\n        const __m256i E7A = _mm256_sub_epi32(EE0A, EO0A);          // E7 = EE0 - EO0\r\n        const __m256i E7B = _mm256_sub_epi32(EE0B, EO0B);\r\n        const __m256i E6A = _mm256_sub_epi32(EE1A, EO1A);          // E6 = EE1 - EO1\r\n        const __m256i E6B = _mm256_sub_epi32(EE1B, EO1B);\r\n        const __m256i E5A = _mm256_sub_epi32(EE2A, EO2A);          // E5 = EE2 - EO2\r\n        const __m256i E5B = _mm256_sub_epi32(EE2B, EO2B);\r\n        const __m256i E4A = _mm256_sub_epi32(EE3A, EO3A);          // E4 = EE3 - EO3\r\n        const __m256i E4B = _mm256_sub_epi32(EE3B, EO3B);\r\n\r\n        const __m256i T10A = _mm256_add_epi32(E0A, c32_rnd);         // E0 + rnd\r\n        const __m256i T10B = _mm256_add_epi32(E0B, c32_rnd);\r\n        const __m256i T11A = _mm256_add_epi32(E1A, c32_rnd);         // E1 + rnd\r\n        const __m256i T11B = _mm256_add_epi32(E1B, c32_rnd);\r\n        const __m256i T12A = _mm256_add_epi32(E2A, c32_rnd);         // E2 + rnd\r\n        const __m256i T12B = _mm256_add_epi32(E2B, c32_rnd);\r\n        const __m256i T13A = _mm256_add_epi32(E3A, c32_rnd);         // E3 + rnd\r\n        const __m256i T13B = _mm256_add_epi32(E3B, c32_rnd);\r\n        const __m256i T14A = _mm256_add_epi32(E4A, c32_rnd);         // E4 + rnd\r\n        const __m256i T14B = _mm256_add_epi32(E4B, c32_rnd);\r\n        const __m256i T15A = _mm256_add_epi32(E5A, c32_rnd);         // E5 + rnd\r\n        const __m256i T15B = _mm256_add_epi32(E5B, c32_rnd);\r\n        const __m256i T16A = _mm256_add_epi32(E6A, c32_rnd);         // E6 + rnd\r\n        const __m256i T16B = _mm256_add_epi32(E6B, c32_rnd);\r\n        const __m256i T17A = _mm256_add_epi32(E7A, c32_rnd);         // E7 + rnd\r\n        const __m256i T17B = _mm256_add_epi32(E7B, c32_rnd);\r\n\r\n        const __m256i T20A = _mm256_add_epi32(T10A, O0A);          // E0 + O0 + rnd\r\n        const __m256i T20B = _mm256_add_epi32(T10B, O0B);\r\n        const __m256i T21A = _mm256_add_epi32(T11A, O1A);          // E1 + O1 + rnd\r\n        const __m256i T21B = _mm256_add_epi32(T11B, O1B);\r\n        const __m256i T22A = _mm256_add_epi32(T12A, O2A);          // E2 + O2 + rnd\r\n        const __m256i T22B = _mm256_add_epi32(T12B, O2B);\r\n        const __m256i T23A = _mm256_add_epi32(T13A, O3A);          // E3 + O3 + rnd\r\n        const __m256i T23B = _mm256_add_epi32(T13B, O3B);\r\n        const __m256i T24A = _mm256_add_epi32(T14A, O4A);          // E4\r\n        const __m256i T24B = _mm256_add_epi32(T14B, O4B);\r\n        const __m256i T25A = _mm256_add_epi32(T15A, O5A);          // E5\r\n        const __m256i T25B = _mm256_add_epi32(T15B, O5B);\r\n        const __m256i T26A = _mm256_add_epi32(T16A, O6A);          // E6\r\n        const __m256i T26B = _mm256_add_epi32(T16B, O6B);\r\n        const __m256i T27A = _mm256_add_epi32(T17A, O7A);          // E7\r\n        const __m256i T27B = _mm256_add_epi32(T17B, O7B);\r\n        const __m256i T2FA = _mm256_sub_epi32(T10A, O0A);          // E0 - O0 + rnd\r\n        const __m256i T2FB = _mm256_sub_epi32(T10B, O0B);\r\n        const __m256i T2EA = _mm256_sub_epi32(T11A, O1A);          // E1 - O1 + rnd\r\n        const __m256i T2EB = _mm256_sub_epi32(T11B, O1B);\r\n        const __m256i T2DA = _mm256_sub_epi32(T12A, O2A);          // E2 - O2 + rnd\r\n        const __m256i T2DB = _mm256_sub_epi32(T12B, O2B);\r\n        const __m256i T2CA = _mm256_sub_epi32(T13A, O3A);          // E3 - O3 + rnd\r\n        const __m256i T2CB = _mm256_sub_epi32(T13B, O3B);\r\n        const __m256i T2BA = _mm256_sub_epi32(T14A, O4A);          // E4\r\n        const __m256i T2BB = _mm256_sub_epi32(T14B, O4B);\r\n        const __m256i T2AA = _mm256_sub_epi32(T15A, O5A);          // E5\r\n        const __m256i T2AB = _mm256_sub_epi32(T15B, O5B);\r\n        const __m256i T29A = _mm256_sub_epi32(T16A, O6A);          // E6\r\n        const __m256i T29B = _mm256_sub_epi32(T16B, O6B);\r\n        const __m256i T28A = _mm256_sub_epi32(T17A, O7A);          // E7\r\n        const __m256i T28B = _mm256_sub_epi32(T17B, O7B);\r\n\r\n        const __m256i T30A = _mm256_srai_epi32(T20A, nShift);             // [30 20 10 00] // This operation make it much slower than 128\r\n        const __m256i T30B = _mm256_srai_epi32(T20B, nShift);             // [70 60 50 40] // This operation make it much slower than 128\r\n        const __m256i T31A = _mm256_srai_epi32(T21A, nShift);             // [31 21 11 01] // This operation make it much slower than 128\r\n        const __m256i T31B = _mm256_srai_epi32(T21B, nShift);             // [71 61 51 41] // This operation make it much slower than 128\r\n        const __m256i T32A = _mm256_srai_epi32(T22A, nShift);             // [32 22 12 02] // This operation make it much slower than 128\r\n        const __m256i T32B = _mm256_srai_epi32(T22B, nShift);             // [72 62 52 42] // This operation make it much slower than 128\r\n        const __m256i T33A = _mm256_srai_epi32(T23A, nShift);             // [33 23 13 03] // This operation make it much slower than 128\r\n        const __m256i T33B = _mm256_srai_epi32(T23B, nShift);             // [73 63 53 43] // This operation make it much slower than 128\r\n        const __m256i T34A = _mm256_srai_epi32(T24A, nShift);             // [33 24 14 04] // This operation make it much slower than 128\r\n        const __m256i T34B = _mm256_srai_epi32(T24B, nShift);             // [74 64 54 44] // This operation make it much slower than 128\r\n        const __m256i T35A = _mm256_srai_epi32(T25A, nShift);             // [35 25 15 05] // This operation make it much slower than 128\r\n        const __m256i T35B = _mm256_srai_epi32(T25B, nShift);             // [75 65 55 45] // This operation make it much slower than 128\r\n        const __m256i T36A = _mm256_srai_epi32(T26A, nShift);             // [36 26 16 06] // This operation make it much slower than 128\r\n        const __m256i T36B = _mm256_srai_epi32(T26B, nShift);             // [76 66 56 46] // This operation make it much slower than 128\r\n        const __m256i T37A = _mm256_srai_epi32(T27A, nShift);             // [37 27 17 07] // This operation make it much slower than 128\r\n        const __m256i T37B = _mm256_srai_epi32(T27B, nShift);             // [77 67 57 47] // This operation make it much slower than 128\r\n\r\n        const __m256i T38A = _mm256_srai_epi32(T28A, nShift);             // [30 20 10 00] x8 // This operation make it much slower than 128\r\n        const __m256i T38B = _mm256_srai_epi32(T28B, nShift);             // [70 60 50 40]\r\n        const __m256i T39A = _mm256_srai_epi32(T29A, nShift);             // [31 21 11 01] x9 // This operation make it much slower than 128\r\n        const __m256i T39B = _mm256_srai_epi32(T29B, nShift);             // [71 61 51 41]\r\n        const __m256i T3AA = _mm256_srai_epi32(T2AA, nShift);             // [32 22 12 02] xA // This operation make it much slower than 128\r\n        const __m256i T3AB = _mm256_srai_epi32(T2AB, nShift);             // [72 62 52 42]\r\n        const __m256i T3BA = _mm256_srai_epi32(T2BA, nShift);             // [33 23 13 03] xB // This operation make it much slower than 128\r\n        const __m256i T3BB = _mm256_srai_epi32(T2BB, nShift);             // [73 63 53 43]\r\n        const __m256i T3CA = _mm256_srai_epi32(T2CA, nShift);             // [33 24 14 04] xC // This operation make it much slower than 128\r\n        const __m256i T3CB = _mm256_srai_epi32(T2CB, nShift);             // [74 64 54 44]\r\n        const __m256i T3DA = _mm256_srai_epi32(T2DA, nShift);             // [35 25 15 05] xD // This operation make it much slower than 128\r\n        const __m256i T3DB = _mm256_srai_epi32(T2DB, nShift);             // [75 65 55 45]\r\n        const __m256i T3EA = _mm256_srai_epi32(T2EA, nShift);             // [36 26 16 06] xE // This operation make it much slower than 128\r\n        const __m256i T3EB = _mm256_srai_epi32(T2EB, nShift);             // [76 66 56 46]\r\n        const __m256i T3FA = _mm256_srai_epi32(T2FA, nShift);             // [37 27 17 07] xF // This operation make it much slower than 128\r\n        const __m256i T3FB = _mm256_srai_epi32(T2FB, nShift);             // [77 67 57 47]\r\n\r\n        res00 = _mm256_packs_epi32(T30A, T30B);        // [70 60 50 40 30 20 10 00]\r\n        res01 = _mm256_packs_epi32(T31A, T31B);        // [71 61 51 41 31 21 11 01]\r\n        res02 = _mm256_packs_epi32(T32A, T32B);        // [72 62 52 42 32 22 12 02]\r\n        res03 = _mm256_packs_epi32(T33A, T33B);        // [73 63 53 43 33 23 13 03]\r\n        res04 = _mm256_packs_epi32(T34A, T34B);        // [74 64 54 44 34 24 14 04]\r\n        res05 = _mm256_packs_epi32(T35A, T35B);        // [75 65 55 45 35 25 15 05]\r\n        res06 = _mm256_packs_epi32(T36A, T36B);        // [76 66 56 46 36 26 16 06]\r\n        res07 = _mm256_packs_epi32(T37A, T37B);        // [77 67 57 47 37 27 17 07]\r\n\r\n        res08 = _mm256_packs_epi32(T38A, T38B);        // [A0 ... 80]\r\n        res09 = _mm256_packs_epi32(T39A, T39B);        // [A1 ... 81]\r\n        res10 = _mm256_packs_epi32(T3AA, T3AB);        // [A2 ... 82]\r\n        res11 = _mm256_packs_epi32(T3BA, T3BB);        // [A3 ... 83]\r\n        res12 = _mm256_packs_epi32(T3CA, T3CB);        // [A4 ... 84]\r\n        res13 = _mm256_packs_epi32(T3DA, T3DB);        // [A5 ... 85]\r\n        res14 = _mm256_packs_epi32(T3EA, T3EB);        // [A6 ... 86]\r\n        res15 = _mm256_packs_epi32(T3FA, T3FB);        // [A7 ... 87]\r\n    }\r\n\r\n        //transpose matrix 16x16 16bit.\r\n        {\r\n            __m256i tr0_0, tr0_1, tr0_2, tr0_3, tr0_4, tr0_5, tr0_6, tr0_7, tr0_8, tr0_9, tr0_10, tr0_11, tr0_12, tr0_13, tr0_14, tr0_15;\r\n    #define TRANSPOSE_16x16_16BIT(I0, I1, I2, I3, I4, I5, I6, I7, I8, I9, I10, I11, I12, I13, I14, I15, O0, O1, O2, O3, O4, O5, O6, O7, O8, O9, O10, O11, O12, O13, O14, O15) \\\r\n        tr0_0 = _mm256_unpacklo_epi16(I0, I1); \\\r\n        tr0_1 = _mm256_unpacklo_epi16(I2, I3); \\\r\n        tr0_2 = _mm256_unpacklo_epi16(I4, I5); \\\r\n        tr0_3 = _mm256_unpacklo_epi16(I6, I7); \\\r\n        tr0_4 = _mm256_unpacklo_epi16(I8, I9); \\\r\n        tr0_5 = _mm256_unpacklo_epi16(I10, I11); \\\r\n        tr0_6 = _mm256_unpacklo_epi16(I12, I13); \\\r\n        tr0_7 = _mm256_unpacklo_epi16(I14, I15); \\\r\n        tr0_8 = _mm256_unpackhi_epi16(I0, I1); \\\r\n        tr0_9 = _mm256_unpackhi_epi16(I2, I3); \\\r\n        tr0_10 = _mm256_unpackhi_epi16(I4, I5); \\\r\n        tr0_11 = _mm256_unpackhi_epi16(I6, I7); \\\r\n        tr0_12 = _mm256_unpackhi_epi16(I8, I9); \\\r\n        tr0_13 = _mm256_unpackhi_epi16(I10, I11); \\\r\n        tr0_14 = _mm256_unpackhi_epi16(I12, I13); \\\r\n        tr0_15 = _mm256_unpackhi_epi16(I14, I15); \\\r\n        O0 = _mm256_unpacklo_epi32(tr0_0, tr0_1); \\\r\n        O1 = _mm256_unpacklo_epi32(tr0_2, tr0_3); \\\r\n        O2 = _mm256_unpacklo_epi32(tr0_4, tr0_5); \\\r\n        O3 = _mm256_unpacklo_epi32(tr0_6, tr0_7); \\\r\n        O4 = _mm256_unpackhi_epi32(tr0_0, tr0_1); \\\r\n        O5 = _mm256_unpackhi_epi32(tr0_2, tr0_3); \\\r\n        O6 = _mm256_unpackhi_epi32(tr0_4, tr0_5); \\\r\n        O7 = _mm256_unpackhi_epi32(tr0_6, tr0_7); \\\r\n        O8 = _mm256_unpacklo_epi32(tr0_8, tr0_9); \\\r\n        O9 = _mm256_unpacklo_epi32(tr0_10, tr0_11); \\\r\n        O10 = _mm256_unpacklo_epi32(tr0_12, tr0_13); \\\r\n        O11 = _mm256_unpacklo_epi32(tr0_14, tr0_15); \\\r\n        O12 = _mm256_unpackhi_epi32(tr0_8, tr0_9); \\\r\n        O13 = _mm256_unpackhi_epi32(tr0_10, tr0_11); \\\r\n        O14 = _mm256_unpackhi_epi32(tr0_12, tr0_13); \\\r\n        O15 = _mm256_unpackhi_epi32(tr0_14, tr0_15); \\\r\n        tr0_0 = _mm256_unpacklo_epi64(O0, O1); \\\r\n        tr0_1 = _mm256_unpacklo_epi64(O2, O3); \\\r\n        tr0_2 = _mm256_unpackhi_epi64(O0, O1); \\\r\n        tr0_3 = _mm256_unpackhi_epi64(O2, O3); \\\r\n        tr0_4 = _mm256_unpacklo_epi64(O4, O5); \\\r\n        tr0_5 = _mm256_unpacklo_epi64(O6, O7); \\\r\n        tr0_6 = _mm256_unpackhi_epi64(O4, O5); \\\r\n        tr0_7 = _mm256_unpackhi_epi64(O6, O7); \\\r\n        tr0_8 = _mm256_unpacklo_epi64(O8, O9); \\\r\n        tr0_9 = _mm256_unpacklo_epi64(O10, O11); \\\r\n        tr0_10 = _mm256_unpackhi_epi64(O8, O9); \\\r\n        tr0_11 = _mm256_unpackhi_epi64(O10, O11); \\\r\n        tr0_12 = _mm256_unpacklo_epi64(O12, O13); \\\r\n        tr0_13 = _mm256_unpacklo_epi64(O14, O15); \\\r\n        tr0_14 = _mm256_unpackhi_epi64(O12, O13); \\\r\n        tr0_15 = _mm256_unpackhi_epi64(O14, O15); \\\r\n        O0 = _mm256_permute2x128_si256(tr0_0, tr0_1, 0x20); \\\r\n        O1 = _mm256_permute2x128_si256(tr0_2, tr0_3, 0x20); \\\r\n        O2 = _mm256_permute2x128_si256(tr0_4, tr0_5, 0x20); \\\r\n        O3 = _mm256_permute2x128_si256(tr0_6, tr0_7, 0x20); \\\r\n        O4 = _mm256_permute2x128_si256(tr0_8, tr0_9, 0x20); \\\r\n        O5 = _mm256_permute2x128_si256(tr0_10, tr0_11, 0x20); \\\r\n        O6 = _mm256_permute2x128_si256(tr0_12, tr0_13, 0x20); \\\r\n        O7 = _mm256_permute2x128_si256(tr0_14, tr0_15, 0x20); \\\r\n        O8 = _mm256_permute2x128_si256(tr0_0, tr0_1, 0x31); \\\r\n        O9 = _mm256_permute2x128_si256(tr0_2, tr0_3, 0x31); \\\r\n        O10 = _mm256_permute2x128_si256(tr0_4, tr0_5, 0x31); \\\r\n        O11 = _mm256_permute2x128_si256(tr0_6, tr0_7, 0x31); \\\r\n        O12 = _mm256_permute2x128_si256(tr0_8, tr0_9, 0x31); \\\r\n        O13 = _mm256_permute2x128_si256(tr0_10, tr0_11, 0x31); \\\r\n        O14 = _mm256_permute2x128_si256(tr0_12, tr0_13, 0x31); \\\r\n        O15 = _mm256_permute2x128_si256(tr0_14, tr0_15, 0x31); \\\r\n\r\n            TRANSPOSE_16x16_16BIT(res00, res01, res02, res03, res04, res05, res06, res07, res08, res09, res10, res11, res12, res13, res14, res15, in00, in01, in02, in03, in04, in05, in06, in07, in08, in09, in10, in11, in12, in13, in14, in15)\r\n    #undef TRANSPOSE_16x16_16BIT\r\n        }\r\n\r\n        nShift = shift;\r\n        c32_rnd = _mm256_set1_epi32(shift ? (1 << (shift - 1)) : 0);                // pass == 1 ڶ\r\n    }\r\n\r\n    // clip\r\n    max_val = _mm256_set1_epi16((1 << (clip - 1)) - 1);\r\n    min_val = _mm256_set1_epi16(-(1 << (clip - 1)));\r\n\r\n    in00 = _mm256_max_epi16(_mm256_min_epi16(in00, max_val), min_val);\r\n    in01 = _mm256_max_epi16(_mm256_min_epi16(in01, max_val), min_val);\r\n    in02 = _mm256_max_epi16(_mm256_min_epi16(in02, max_val), min_val);\r\n    in03 = _mm256_max_epi16(_mm256_min_epi16(in03, max_val), min_val);\r\n    in04 = _mm256_max_epi16(_mm256_min_epi16(in04, max_val), min_val);\r\n    in05 = _mm256_max_epi16(_mm256_min_epi16(in05, max_val), min_val);\r\n    in06 = _mm256_max_epi16(_mm256_min_epi16(in06, max_val), min_val);\r\n    in07 = _mm256_max_epi16(_mm256_min_epi16(in07, max_val), min_val);\r\n    in08 = _mm256_max_epi16(_mm256_min_epi16(in08, max_val), min_val);\r\n    in09 = _mm256_max_epi16(_mm256_min_epi16(in09, max_val), min_val);\r\n    in10 = _mm256_max_epi16(_mm256_min_epi16(in10, max_val), min_val);\r\n    in11 = _mm256_max_epi16(_mm256_min_epi16(in11, max_val), min_val);\r\n    in12 = _mm256_max_epi16(_mm256_min_epi16(in12, max_val), min_val);\r\n    in13 = _mm256_max_epi16(_mm256_min_epi16(in13, max_val), min_val);\r\n    in14 = _mm256_max_epi16(_mm256_min_epi16(in14, max_val), min_val);\r\n    in15 = _mm256_max_epi16(_mm256_min_epi16(in15, max_val), min_val);\r\n\r\n    // store\r\n    _mm256_storeu_si256((__m256i*)&dst[0 * 16 + 0], in00);\r\n    _mm256_storeu_si256((__m256i*)&dst[1 * 16 + 0], in01);\r\n    _mm256_storeu_si256((__m256i*)&dst[2 * 16 + 0], in02);\r\n    _mm256_storeu_si256((__m256i*)&dst[3 * 16 + 0], in03);\r\n    _mm256_storeu_si256((__m256i*)&dst[4 * 16 + 0], in04);\r\n    _mm256_storeu_si256((__m256i*)&dst[5 * 16 + 0], in05);\r\n    _mm256_storeu_si256((__m256i*)&dst[6 * 16 + 0], in06);\r\n    _mm256_storeu_si256((__m256i*)&dst[7 * 16 + 0], in07);\r\n    _mm256_storeu_si256((__m256i*)&dst[8 * 16 + 0], in08);\r\n    _mm256_storeu_si256((__m256i*)&dst[9 * 16 + 0], in09);\r\n    _mm256_storeu_si256((__m256i*)&dst[10 * 16 + 0], in10);\r\n    _mm256_storeu_si256((__m256i*)&dst[11 * 16 + 0], in11);\r\n    _mm256_storeu_si256((__m256i*)&dst[12 * 16 + 0], in12);\r\n    _mm256_storeu_si256((__m256i*)&dst[13 * 16 + 0], in13);\r\n    _mm256_storeu_si256((__m256i*)&dst[14 * 16 + 0], in14);\r\n    _mm256_storeu_si256((__m256i*)&dst[15 * 16 + 0], in15);\r\n}\r\n\r\n\r\nvoid idct_32x32_avx2(const coeff_t *src, coeff_t *dst, int i_dst)\r\n{\r\n    int shift = 20 - g_bit_depth - (i_dst & 0x01);\r\n    int clip = g_bit_depth + 1 + (i_dst & 0x01);\r\n    int k, i;\r\n    __m256i max_val, min_val;\r\n    __m256i EEO0A, EEO1A, EEO2A, EEO3A, EEO0B, EEO1B, EEO2B, EEO3B;\r\n    __m256i EEEO0A, EEEO0B, EEEO1A, EEEO1B;\r\n    __m256i EEEE0A, EEEE0B, EEEE1A, EEEE1B;\r\n    __m256i EEE0A, EEE0B, EEE1A, EEE1B, EEE3A, EEE3B, EEE2A, EEE2B;\r\n    __m256i EE0A, EE0B, EE1A, EE1B, EE2A, EE2B, EE3A, EE3B, EE7A, EE7B, EE6A, EE6B, EE5A, EE5B, EE4A, EE4B;\r\n    __m256i E0A, E0B, E1A, E1B, E2A, E2B, E3A, E3B, E4A, E4B, E5A, E5B, E6A, E6B, E7A, E7B, EFA, EFB, EEA, EEB, EDA, EDB, ECA, ECB, EBA, EBB, EAA, EAB, E9A, E9B, E8A, E8B;\r\n    __m256i T10A, T10B, T11A, T11B, T12A, T12B, T13A, T13B, T14A, T14B, T15A, T15B, T16A, T16B, T17A, T17B, T18A, T18B, T19A, T19B, T1AA, T1AB, T1BA, T1BB, T1CA, T1CB, T1DA, T1DB, T1EA, T1EB, T1FA, T1FB;\r\n    __m256i T2_00A, T2_00B, T2_01A, T2_01B, T2_02A, T2_02B, T2_03A, T2_03B, T2_04A, T2_04B, T2_05A, T2_05B, T2_06A, T2_06B, T2_07A, T2_07B, T2_08A, T2_08B, T2_09A, T2_09B, T2_10A, T2_10B, T2_11A, T2_11B, T2_12A, T2_12B, T2_13A, T2_13B, T2_14A, T2_14B, T2_15A, T2_15B, T2_31A, T2_31B, T2_30A, T2_30B, T2_29A, T2_29B, T2_28A, T2_28B, T2_27A, T2_27B, T2_26A, T2_26B, T2_25A, T2_25B, T2_24A, T2_24B, T2_23A, T2_23B, T2_22A, T2_22B, T2_21A, T2_21B, T2_20A, T2_20B, T2_19A, T2_19B, T2_18A, T2_18B, T2_17A, T2_17B, T2_16A, T2_16B;\r\n    __m256i T3_00A, T3_00B, T3_01A, T3_01B, T3_02A, T3_02B, T3_03A, T3_03B, T3_04A, T3_04B, T3_05A, T3_05B, T3_06A, T3_06B, T3_07A, T3_07B, T3_08A, T3_08B, T3_09A, T3_09B, T3_10A, T3_10B, T3_11A, T3_11B, T3_12A, T3_12B, T3_13A, T3_13B, T3_14A, T3_14B, T3_15A, T3_15B;\r\n    __m256i T3_16A, T3_16B, T3_17A, T3_17B, T3_18A, T3_18B, T3_19A, T3_19B, T3_20A, T3_20B, T3_21A, T3_21B, T3_22A, T3_22B, T3_23A, T3_23B, T3_24A, T3_24B, T3_25A, T3_25B, T3_26A, T3_26B, T3_27A, T3_27B, T3_28A, T3_28B, T3_29A, T3_29B, T3_30A, T3_30B, T3_31A, T3_31B;\r\n    const __m256i c16_p45_p45 = _mm256_set1_epi32(0x002D002D);\r\n    const __m256i c16_p43_p44 = _mm256_set1_epi32(0x002B002C);\r\n    const __m256i c16_p39_p41 = _mm256_set1_epi32(0x00270029);\r\n    const __m256i c16_p34_p36 = _mm256_set1_epi32(0x00220024);\r\n    const __m256i c16_p27_p30 = _mm256_set1_epi32(0x001B001E);\r\n    const __m256i c16_p19_p23 = _mm256_set1_epi32(0x00130017);\r\n    const __m256i c16_p11_p15 = _mm256_set1_epi32(0x000B000F);\r\n    const __m256i c16_p02_p07 = _mm256_set1_epi32(0x00020007);\r\n    const __m256i c16_p41_p45 = _mm256_set1_epi32(0x0029002D);\r\n    const __m256i c16_p23_p34 = _mm256_set1_epi32(0x00170022);\r\n    const __m256i c16_n02_p11 = _mm256_set1_epi32(0xFFFE000B);\r\n    const __m256i c16_n27_n15 = _mm256_set1_epi32(0xFFE5FFF1);\r\n    const __m256i c16_n43_n36 = _mm256_set1_epi32(0xFFD5FFDC);\r\n    const __m256i c16_n44_n45 = _mm256_set1_epi32(0xFFD4FFD3);\r\n    const __m256i c16_n30_n39 = _mm256_set1_epi32(0xFFE2FFD9);\r\n    const __m256i c16_n07_n19 = _mm256_set1_epi32(0xFFF9FFED);\r\n    const __m256i c16_p34_p44 = _mm256_set1_epi32(0x0022002C);\r\n    const __m256i c16_n07_p15 = _mm256_set1_epi32(0xFFF9000F);\r\n    const __m256i c16_n41_n27 = _mm256_set1_epi32(0xFFD7FFE5);\r\n    const __m256i c16_n39_n45 = _mm256_set1_epi32(0xFFD9FFD3);\r\n    const __m256i c16_n02_n23 = _mm256_set1_epi32(0xFFFEFFE9);\r\n    const __m256i c16_p36_p19 = _mm256_set1_epi32(0x00240013);\r\n    const __m256i c16_p43_p45 = _mm256_set1_epi32(0x002B002D);\r\n    const __m256i c16_p11_p30 = _mm256_set1_epi32(0x000B001E);\r\n    const __m256i c16_p23_p43 = _mm256_set1_epi32(0x0017002B);\r\n    const __m256i c16_n34_n07 = _mm256_set1_epi32(0xFFDEFFF9);\r\n    const __m256i c16_n36_n45 = _mm256_set1_epi32(0xFFDCFFD3);\r\n    const __m256i c16_p19_n11 = _mm256_set1_epi32(0x0013FFF5);\r\n    const __m256i c16_p44_p41 = _mm256_set1_epi32(0x002C0029);\r\n    const __m256i c16_n02_p27 = _mm256_set1_epi32(0xFFFE001B);\r\n    const __m256i c16_n45_n30 = _mm256_set1_epi32(0xFFD3FFE2);\r\n    const __m256i c16_n15_n39 = _mm256_set1_epi32(0xFFF1FFD9);\r\n    const __m256i c16_p11_p41 = _mm256_set1_epi32(0x000B0029);\r\n    const __m256i c16_n45_n27 = _mm256_set1_epi32(0xFFD3FFE5);\r\n    const __m256i c16_p07_n30 = _mm256_set1_epi32(0x0007FFE2);\r\n    const __m256i c16_p43_p39 = _mm256_set1_epi32(0x002B0027);\r\n    const __m256i c16_n23_p15 = _mm256_set1_epi32(0xFFE9000F);\r\n    const __m256i c16_n34_n45 = _mm256_set1_epi32(0xFFDEFFD3);\r\n    const __m256i c16_p36_p02 = _mm256_set1_epi32(0x00240002);\r\n    const __m256i c16_p19_p44 = _mm256_set1_epi32(0x0013002C);\r\n    const __m256i c16_n02_p39 = _mm256_set1_epi32(0xFFFE0027);\r\n    const __m256i c16_n36_n41 = _mm256_set1_epi32(0xFFDCFFD7);\r\n    const __m256i c16_p43_p07 = _mm256_set1_epi32(0x002B0007);\r\n    const __m256i c16_n11_p34 = _mm256_set1_epi32(0xFFF50022);\r\n    const __m256i c16_n30_n44 = _mm256_set1_epi32(0xFFE2FFD4);\r\n    const __m256i c16_p45_p15 = _mm256_set1_epi32(0x002D000F);\r\n    const __m256i c16_n19_p27 = _mm256_set1_epi32(0xFFED001B);\r\n    const __m256i c16_n23_n45 = _mm256_set1_epi32(0xFFE9FFD3);\r\n    const __m256i c16_n15_p36 = _mm256_set1_epi32(0xFFF10024);\r\n    const __m256i c16_n11_n45 = _mm256_set1_epi32(0xFFF5FFD3);\r\n    const __m256i c16_p34_p39 = _mm256_set1_epi32(0x00220027);\r\n    const __m256i c16_n45_n19 = _mm256_set1_epi32(0xFFD3FFED);\r\n    const __m256i c16_p41_n07 = _mm256_set1_epi32(0x0029FFF9);\r\n    const __m256i c16_n23_p30 = _mm256_set1_epi32(0xFFE9001E);\r\n    const __m256i c16_n02_n44 = _mm256_set1_epi32(0xFFFEFFD4);\r\n    const __m256i c16_p27_p43 = _mm256_set1_epi32(0x001B002B);\r\n    const __m256i c16_n27_p34 = _mm256_set1_epi32(0xFFE50022);\r\n    const __m256i c16_p19_n39 = _mm256_set1_epi32(0x0013FFD9);\r\n    const __m256i c16_n11_p43 = _mm256_set1_epi32(0xFFF5002B);\r\n    const __m256i c16_p02_n45 = _mm256_set1_epi32(0x0002FFD3);\r\n    const __m256i c16_p07_p45 = _mm256_set1_epi32(0x0007002D);\r\n    const __m256i c16_n15_n44 = _mm256_set1_epi32(0xFFF1FFD4);\r\n    const __m256i c16_p23_p41 = _mm256_set1_epi32(0x00170029);\r\n    const __m256i c16_n30_n36 = _mm256_set1_epi32(0xFFE2FFDC);\r\n    const __m256i c16_n36_p30 = _mm256_set1_epi32(0xFFDC001E);\r\n    const __m256i c16_p41_n23 = _mm256_set1_epi32(0x0029FFE9);\r\n    const __m256i c16_n44_p15 = _mm256_set1_epi32(0xFFD4000F);\r\n    const __m256i c16_p45_n07 = _mm256_set1_epi32(0x002DFFF9);\r\n    const __m256i c16_n45_n02 = _mm256_set1_epi32(0xFFD3FFFE);\r\n    const __m256i c16_p43_p11 = _mm256_set1_epi32(0x002B000B);\r\n    const __m256i c16_n39_n19 = _mm256_set1_epi32(0xFFD9FFED);\r\n    const __m256i c16_p34_p27 = _mm256_set1_epi32(0x0022001B);\r\n    const __m256i c16_n43_p27 = _mm256_set1_epi32(0xFFD5001B);\r\n    const __m256i c16_p44_n02 = _mm256_set1_epi32(0x002CFFFE);\r\n    const __m256i c16_n30_n23 = _mm256_set1_epi32(0xFFE2FFE9);\r\n    const __m256i c16_p07_p41 = _mm256_set1_epi32(0x00070029);\r\n    const __m256i c16_p19_n45 = _mm256_set1_epi32(0x0013FFD3);\r\n    const __m256i c16_n39_p34 = _mm256_set1_epi32(0xFFD90022);\r\n    const __m256i c16_p45_n11 = _mm256_set1_epi32(0x002DFFF5);\r\n    const __m256i c16_n36_n15 = _mm256_set1_epi32(0xFFDCFFF1);\r\n    const __m256i c16_n45_p23 = _mm256_set1_epi32(0xFFD30017);\r\n    const __m256i c16_p27_p19 = _mm256_set1_epi32(0x001B0013);\r\n    const __m256i c16_p15_n45 = _mm256_set1_epi32(0x000FFFD3);\r\n    const __m256i c16_n44_p30 = _mm256_set1_epi32(0xFFD4001E);\r\n    const __m256i c16_p34_p11 = _mm256_set1_epi32(0x0022000B);\r\n    const __m256i c16_p07_n43 = _mm256_set1_epi32(0x0007FFD5);\r\n    const __m256i c16_n41_p36 = _mm256_set1_epi32(0xFFD70024);\r\n    const __m256i c16_p39_p02 = _mm256_set1_epi32(0x00270002);\r\n    const __m256i c16_n44_p19 = _mm256_set1_epi32(0xFFD40013);\r\n    const __m256i c16_n02_p36 = _mm256_set1_epi32(0xFFFE0024);\r\n    const __m256i c16_p45_n34 = _mm256_set1_epi32(0x002DFFDE);\r\n    const __m256i c16_n15_n23 = _mm256_set1_epi32(0xFFF1FFE9);\r\n    const __m256i c16_n39_p43 = _mm256_set1_epi32(0xFFD9002B);\r\n    const __m256i c16_p30_p07 = _mm256_set1_epi32(0x001E0007);\r\n    const __m256i c16_p27_n45 = _mm256_set1_epi32(0x001BFFD3);\r\n    const __m256i c16_n41_p11 = _mm256_set1_epi32(0xFFD7000B);\r\n    const __m256i c16_n39_p15 = _mm256_set1_epi32(0xFFD9000F);\r\n    const __m256i c16_n30_p45 = _mm256_set1_epi32(0xFFE2002D);\r\n    const __m256i c16_p27_p02 = _mm256_set1_epi32(0x001B0002);\r\n    const __m256i c16_p41_n44 = _mm256_set1_epi32(0x0029FFD4);\r\n    const __m256i c16_n11_n19 = _mm256_set1_epi32(0xFFF5FFED);\r\n    const __m256i c16_n45_p36 = _mm256_set1_epi32(0xFFD30024);\r\n    const __m256i c16_n07_p34 = _mm256_set1_epi32(0xFFF90022);\r\n    const __m256i c16_p43_n23 = _mm256_set1_epi32(0x002BFFE9);\r\n    const __m256i c16_n30_p11 = _mm256_set1_epi32(0xFFE2000B);\r\n    const __m256i c16_n45_p43 = _mm256_set1_epi32(0xFFD3002B);\r\n    const __m256i c16_n19_p36 = _mm256_set1_epi32(0xFFED0024);\r\n    const __m256i c16_p23_n02 = _mm256_set1_epi32(0x0017FFFE);\r\n    const __m256i c16_p45_n39 = _mm256_set1_epi32(0x002DFFD9);\r\n    const __m256i c16_p27_n41 = _mm256_set1_epi32(0x001BFFD7);\r\n    const __m256i c16_n15_n07 = _mm256_set1_epi32(0xFFF1FFF9);\r\n    const __m256i c16_n44_p34 = _mm256_set1_epi32(0xFFD40022);\r\n    const __m256i c16_n19_p07 = _mm256_set1_epi32(0xFFED0007);\r\n    const __m256i c16_n39_p30 = _mm256_set1_epi32(0xFFD9001E);\r\n    const __m256i c16_n45_p44 = _mm256_set1_epi32(0xFFD3002C);\r\n    const __m256i c16_n36_p43 = _mm256_set1_epi32(0xFFDC002B);\r\n    const __m256i c16_n15_p27 = _mm256_set1_epi32(0xFFF1001B);\r\n    const __m256i c16_p11_p02 = _mm256_set1_epi32(0x000B0002);\r\n    const __m256i c16_p34_n23 = _mm256_set1_epi32(0x0022FFE9);\r\n    const __m256i c16_p45_n41 = _mm256_set1_epi32(0x002DFFD7);\r\n    const __m256i c16_n07_p02 = _mm256_set1_epi32(0xFFF90002);\r\n    const __m256i c16_n15_p11 = _mm256_set1_epi32(0xFFF1000B);\r\n    const __m256i c16_n23_p19 = _mm256_set1_epi32(0xFFE90013);\r\n    const __m256i c16_n30_p27 = _mm256_set1_epi32(0xFFE2001B);\r\n    const __m256i c16_n36_p34 = _mm256_set1_epi32(0xFFDC0022);\r\n    const __m256i c16_n41_p39 = _mm256_set1_epi32(0xFFD70027);\r\n    const __m256i c16_n44_p43 = _mm256_set1_epi32(0xFFD4002B);\r\n    const __m256i c16_n45_p45 = _mm256_set1_epi32(0xFFD3002D);\r\n\r\n    //  const __m256i c16_p43_p45 = _mm256_set1_epi32(0x002B002D);\r\n    const __m256i c16_p35_p40 = _mm256_set1_epi32(0x00230028);\r\n    const __m256i c16_p21_p29 = _mm256_set1_epi32(0x0015001D);\r\n    const __m256i c16_p04_p13 = _mm256_set1_epi32(0x0004000D);\r\n    const __m256i c16_p29_p43 = _mm256_set1_epi32(0x001D002B);\r\n    const __m256i c16_n21_p04 = _mm256_set1_epi32(0xFFEB0004);\r\n    const __m256i c16_n45_n40 = _mm256_set1_epi32(0xFFD3FFD8);\r\n    const __m256i c16_n13_n35 = _mm256_set1_epi32(0xFFF3FFDD);\r\n    const __m256i c16_p04_p40 = _mm256_set1_epi32(0x00040028);\r\n    const __m256i c16_n43_n35 = _mm256_set1_epi32(0xFFD5FFDD);\r\n    const __m256i c16_p29_n13 = _mm256_set1_epi32(0x001DFFF3);\r\n    const __m256i c16_p21_p45 = _mm256_set1_epi32(0x0015002D);\r\n    const __m256i c16_n21_p35 = _mm256_set1_epi32(0xFFEB0023);\r\n    const __m256i c16_p04_n43 = _mm256_set1_epi32(0x0004FFD5);\r\n    const __m256i c16_p13_p45 = _mm256_set1_epi32(0x000D002D);\r\n    const __m256i c16_n29_n40 = _mm256_set1_epi32(0xFFE3FFD8);\r\n    const __m256i c16_n40_p29 = _mm256_set1_epi32(0xFFD8001D);\r\n    const __m256i c16_p45_n13 = _mm256_set1_epi32(0x002DFFF3);\r\n    const __m256i c16_n43_n04 = _mm256_set1_epi32(0xFFD5FFFC);\r\n    const __m256i c16_p35_p21 = _mm256_set1_epi32(0x00230015);\r\n    const __m256i c16_n45_p21 = _mm256_set1_epi32(0xFFD30015);\r\n    const __m256i c16_p13_p29 = _mm256_set1_epi32(0x000D001D);\r\n    const __m256i c16_p35_n43 = _mm256_set1_epi32(0x0023FFD5);\r\n    const __m256i c16_n40_p04 = _mm256_set1_epi32(0xFFD80004);\r\n    const __m256i c16_n35_p13 = _mm256_set1_epi32(0xFFDD000D);\r\n    const __m256i c16_n40_p45 = _mm256_set1_epi32(0xFFD8002D);\r\n    const __m256i c16_p04_p21 = _mm256_set1_epi32(0x00040015);\r\n    const __m256i c16_p43_n29 = _mm256_set1_epi32(0x002BFFE3);\r\n    const __m256i c16_n13_p04 = _mm256_set1_epi32(0xFFF30004);\r\n    const __m256i c16_n29_p21 = _mm256_set1_epi32(0xFFE30015);\r\n    const __m256i c16_n40_p35 = _mm256_set1_epi32(0xFFD80023);\r\n    //const __m256i c16_n45_p43 = _mm256_set1_epi32(0xFFD3002B);\r\n\r\n    const __m256i c16_p38_p44 = _mm256_set1_epi32(0x0026002C);\r\n    const __m256i c16_p09_p25 = _mm256_set1_epi32(0x00090019);\r\n    const __m256i c16_n09_p38 = _mm256_set1_epi32(0xFFF70026);\r\n    const __m256i c16_n25_n44 = _mm256_set1_epi32(0xFFE7FFD4);\r\n\r\n    const __m256i c16_n44_p25 = _mm256_set1_epi32(0xFFD40019);\r\n    const __m256i c16_p38_p09 = _mm256_set1_epi32(0x00260009);\r\n    const __m256i c16_n25_p09 = _mm256_set1_epi32(0xFFE70009);\r\n    const __m256i c16_n44_p38 = _mm256_set1_epi32(0xFFD40026);\r\n\r\n    const __m256i c16_p17_p42 = _mm256_set1_epi32(0x0011002A);\r\n    const __m256i c16_n42_p17 = _mm256_set1_epi32(0xFFD60011);\r\n\r\n    const __m256i c16_p32_p32 = _mm256_set1_epi32(0x00200020);\r\n    const __m256i c16_n32_p32 = _mm256_set1_epi32(0xFFE00020);\r\n\r\n    __m256i c32_rnd = _mm256_set1_epi32(16);\r\n    int nShift = 5;\r\n\r\n    // DCT1\r\n    __m256i in00[2], in01[2], in02[2], in03[2], in04[2], in05[2], in06[2], in07[2], in08[2], in09[2], in10[2], in11[2], in12[2], in13[2], in14[2], in15[2];\r\n    __m256i in16[2], in17[2], in18[2], in19[2], in20[2], in21[2], in22[2], in23[2], in24[2], in25[2], in26[2], in27[2], in28[2], in29[2], in30[2], in31[2];\r\n    __m256i res00[2], res01[2], res02[2], res03[2], res04[2], res05[2], res06[2], res07[2], res08[2], res09[2], res10[2], res11[2], res12[2], res13[2], res14[2], res15[2];\r\n    __m256i res16[2], res17[2], res18[2], res19[2], res20[2], res21[2], res22[2], res23[2], res24[2], res25[2], res26[2], res27[2], res28[2], res29[2], res30[2], res31[2];\r\n\r\n    int pass, part;\r\n\r\n    UNUSED_PARAMETER(i_dst);\r\n\r\n    for (i = 0; i < 2; i++) {\r\n        const int offset = (i << 4);\r\n        in00[i] = _mm256_lddqu_si256((const __m256i*)&src[0 * 32 + offset]);\r\n        in01[i] = _mm256_lddqu_si256((const __m256i*)&src[1 * 32 + offset]);\r\n        in02[i] = _mm256_lddqu_si256((const __m256i*)&src[2 * 32 + offset]);\r\n        in03[i] = _mm256_lddqu_si256((const __m256i*)&src[3 * 32 + offset]);\r\n        in04[i] = _mm256_lddqu_si256((const __m256i*)&src[4 * 32 + offset]);\r\n        in05[i] = _mm256_lddqu_si256((const __m256i*)&src[5 * 32 + offset]);\r\n        in06[i] = _mm256_lddqu_si256((const __m256i*)&src[6 * 32 + offset]);\r\n        in07[i] = _mm256_lddqu_si256((const __m256i*)&src[7 * 32 + offset]);\r\n        in08[i] = _mm256_lddqu_si256((const __m256i*)&src[8 * 32 + offset]);\r\n        in09[i] = _mm256_lddqu_si256((const __m256i*)&src[9 * 32 + offset]);\r\n        in10[i] = _mm256_lddqu_si256((const __m256i*)&src[10 * 32 + offset]);\r\n        in11[i] = _mm256_lddqu_si256((const __m256i*)&src[11 * 32 + offset]);\r\n        in12[i] = _mm256_lddqu_si256((const __m256i*)&src[12 * 32 + offset]);\r\n        in13[i] = _mm256_lddqu_si256((const __m256i*)&src[13 * 32 + offset]);\r\n        in14[i] = _mm256_lddqu_si256((const __m256i*)&src[14 * 32 + offset]);\r\n        in15[i] = _mm256_lddqu_si256((const __m256i*)&src[15 * 32 + offset]);\r\n        in16[i] = _mm256_lddqu_si256((const __m256i*)&src[16 * 32 + offset]);\r\n        in17[i] = _mm256_lddqu_si256((const __m256i*)&src[17 * 32 + offset]);\r\n        in18[i] = _mm256_lddqu_si256((const __m256i*)&src[18 * 32 + offset]);\r\n        in19[i] = _mm256_lddqu_si256((const __m256i*)&src[19 * 32 + offset]);\r\n        in20[i] = _mm256_lddqu_si256((const __m256i*)&src[20 * 32 + offset]);\r\n        in21[i] = _mm256_lddqu_si256((const __m256i*)&src[21 * 32 + offset]);\r\n        in22[i] = _mm256_lddqu_si256((const __m256i*)&src[22 * 32 + offset]);\r\n        in23[i] = _mm256_lddqu_si256((const __m256i*)&src[23 * 32 + offset]);\r\n        in24[i] = _mm256_lddqu_si256((const __m256i*)&src[24 * 32 + offset]);\r\n        in25[i] = _mm256_lddqu_si256((const __m256i*)&src[25 * 32 + offset]);\r\n        in26[i] = _mm256_lddqu_si256((const __m256i*)&src[26 * 32 + offset]);\r\n        in27[i] = _mm256_lddqu_si256((const __m256i*)&src[27 * 32 + offset]);\r\n        in28[i] = _mm256_lddqu_si256((const __m256i*)&src[28 * 32 + offset]);\r\n        in29[i] = _mm256_lddqu_si256((const __m256i*)&src[29 * 32 + offset]);\r\n        in30[i] = _mm256_lddqu_si256((const __m256i*)&src[30 * 32 + offset]);\r\n        in31[i] = _mm256_lddqu_si256((const __m256i*)&src[31 * 32 + offset]);\r\n    }\r\n\r\n    for (pass = 0; pass < 2; pass++) {\r\n        for (part = 0; part < 2; part++) {\r\n            const __m256i T_00_00A = _mm256_unpacklo_epi16(in01[part], in03[part]);       // [33 13 32 12 31 11 30 10]\r\n            const __m256i T_00_00B = _mm256_unpackhi_epi16(in01[part], in03[part]);       // [37 17 36 16 35 15 34 14]\r\n            const __m256i T_00_01A = _mm256_unpacklo_epi16(in05[part], in07[part]);       // [ ]\r\n            const __m256i T_00_01B = _mm256_unpackhi_epi16(in05[part], in07[part]);       // [ ]\r\n            const __m256i T_00_02A = _mm256_unpacklo_epi16(in09[part], in11[part]);       // [ ]\r\n            const __m256i T_00_02B = _mm256_unpackhi_epi16(in09[part], in11[part]);       // [ ]\r\n            const __m256i T_00_03A = _mm256_unpacklo_epi16(in13[part], in15[part]);       // [ ]\r\n            const __m256i T_00_03B = _mm256_unpackhi_epi16(in13[part], in15[part]);       // [ ]\r\n            const __m256i T_00_04A = _mm256_unpacklo_epi16(in17[part], in19[part]);       // [ ]\r\n            const __m256i T_00_04B = _mm256_unpackhi_epi16(in17[part], in19[part]);       // [ ]\r\n            const __m256i T_00_05A = _mm256_unpacklo_epi16(in21[part], in23[part]);       // [ ]\r\n            const __m256i T_00_05B = _mm256_unpackhi_epi16(in21[part], in23[part]);       // [ ]\r\n            const __m256i T_00_06A = _mm256_unpacklo_epi16(in25[part], in27[part]);       // [ ]\r\n            const __m256i T_00_06B = _mm256_unpackhi_epi16(in25[part], in27[part]);       // [ ]\r\n            const __m256i T_00_07A = _mm256_unpacklo_epi16(in29[part], in31[part]);       //\r\n            const __m256i T_00_07B = _mm256_unpackhi_epi16(in29[part], in31[part]);       // [ ]\r\n\r\n            const __m256i T_00_08A = _mm256_unpacklo_epi16(in02[part], in06[part]);       // [ ]\r\n            const __m256i T_00_08B = _mm256_unpackhi_epi16(in02[part], in06[part]);       // [ ]\r\n            const __m256i T_00_09A = _mm256_unpacklo_epi16(in10[part], in14[part]);       // [ ]\r\n            const __m256i T_00_09B = _mm256_unpackhi_epi16(in10[part], in14[part]);       // [ ]\r\n            const __m256i T_00_10A = _mm256_unpacklo_epi16(in18[part], in22[part]);       // [ ]\r\n            const __m256i T_00_10B = _mm256_unpackhi_epi16(in18[part], in22[part]);       // [ ]\r\n            const __m256i T_00_11A = _mm256_unpacklo_epi16(in26[part], in30[part]);       // [ ]\r\n            const __m256i T_00_11B = _mm256_unpackhi_epi16(in26[part], in30[part]);       // [ ]\r\n\r\n            const __m256i T_00_12A = _mm256_unpacklo_epi16(in04[part], in12[part]);       // [ ]\r\n            const __m256i T_00_12B = _mm256_unpackhi_epi16(in04[part], in12[part]);       // [ ]\r\n            const __m256i T_00_13A = _mm256_unpacklo_epi16(in20[part], in28[part]);       // [ ]\r\n            const __m256i T_00_13B = _mm256_unpackhi_epi16(in20[part], in28[part]);       // [ ]\r\n\r\n            const __m256i T_00_14A = _mm256_unpacklo_epi16(in08[part], in24[part]);       //\r\n            const __m256i T_00_14B = _mm256_unpackhi_epi16(in08[part], in24[part]);       // [ ]\r\n            const __m256i T_00_15A = _mm256_unpacklo_epi16(in00[part], in16[part]);       //\r\n            const __m256i T_00_15B = _mm256_unpackhi_epi16(in00[part], in16[part]);       // [ ]\r\n\r\n            __m256i O00A, O01A, O02A, O03A, O04A, O05A, O06A, O07A, O08A, O09A, O10A, O11A, O12A, O13A, O14A, O15A;\r\n            __m256i O00B, O01B, O02B, O03B, O04B, O05B, O06B, O07B, O08B, O09B, O10B, O11B, O12B, O13B, O14B, O15B;\r\n            __m256i EO0A, EO1A, EO2A, EO3A, EO4A, EO5A, EO6A, EO7A;\r\n            __m256i EO0B, EO1B, EO2B, EO3B, EO4B, EO5B, EO6B, EO7B;\r\n            {\r\n                __m256i T00, T01, T02, T03;\r\n#define     COMPUTE_ROW(r0103, r0507, r0911, r1315, r1719, r2123, r2527, r2931, c0103, c0507, c0911, c1315, c1719, c2123, c2527, c2931, row) \\\r\n            T00 = _mm256_add_epi32(_mm256_madd_epi16(r0103, c0103), _mm256_madd_epi16(r0507, c0507)); \\\r\n            T01 = _mm256_add_epi32(_mm256_madd_epi16(r0911, c0911), _mm256_madd_epi16(r1315, c1315)); \\\r\n            T02 = _mm256_add_epi32(_mm256_madd_epi16(r1719, c1719), _mm256_madd_epi16(r2123, c2123)); \\\r\n            T03 = _mm256_add_epi32(_mm256_madd_epi16(r2527, c2527), _mm256_madd_epi16(r2931, c2931)); \\\r\n            row = _mm256_add_epi32(_mm256_add_epi32(T00, T01), _mm256_add_epi32(T02, T03));\r\n\r\n                COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n                    c16_p45_p45, c16_p43_p44, c16_p39_p41, c16_p34_p36, c16_p27_p30, c16_p19_p23, c16_p11_p15, c16_p02_p07, O00A)\r\n                    COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n                    c16_p41_p45, c16_p23_p34, c16_n02_p11, c16_n27_n15, c16_n43_n36, c16_n44_n45, c16_n30_n39, c16_n07_n19, O01A)\r\n                    COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n                    c16_p34_p44, c16_n07_p15, c16_n41_n27, c16_n39_n45, c16_n02_n23, c16_p36_p19, c16_p43_p45, c16_p11_p30, O02A)\r\n                    COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n                    c16_p23_p43, c16_n34_n07, c16_n36_n45, c16_p19_n11, c16_p44_p41, c16_n02_p27, c16_n45_n30, c16_n15_n39, O03A)\r\n                    COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n                    c16_p11_p41, c16_n45_n27, c16_p07_n30, c16_p43_p39, c16_n23_p15, c16_n34_n45, c16_p36_p02, c16_p19_p44, O04A)\r\n                    COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n                    c16_n02_p39, c16_n36_n41, c16_p43_p07, c16_n11_p34, c16_n30_n44, c16_p45_p15, c16_n19_p27, c16_n23_n45, O05A)\r\n                    COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n                    c16_n15_p36, c16_n11_n45, c16_p34_p39, c16_n45_n19, c16_p41_n07, c16_n23_p30, c16_n02_n44, c16_p27_p43, O06A)\r\n                    COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n                    c16_n27_p34, c16_p19_n39, c16_n11_p43, c16_p02_n45, c16_p07_p45, c16_n15_n44, c16_p23_p41, c16_n30_n36, O07A)\r\n                    COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n                    c16_n36_p30, c16_p41_n23, c16_n44_p15, c16_p45_n07, c16_n45_n02, c16_p43_p11, c16_n39_n19, c16_p34_p27, O08A)\r\n                    COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n                    c16_n43_p27, c16_p44_n02, c16_n30_n23, c16_p07_p41, c16_p19_n45, c16_n39_p34, c16_p45_n11, c16_n36_n15, O09A)\r\n                    COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n                    c16_n45_p23, c16_p27_p19, c16_p15_n45, c16_n44_p30, c16_p34_p11, c16_p07_n43, c16_n41_p36, c16_p39_p02, O10A)\r\n                    COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n                    c16_n44_p19, c16_n02_p36, c16_p45_n34, c16_n15_n23, c16_n39_p43, c16_p30_p07, c16_p27_n45, c16_n41_p11, O11A)\r\n                    COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n                    c16_n39_p15, c16_n30_p45, c16_p27_p02, c16_p41_n44, c16_n11_n19, c16_n45_p36, c16_n07_p34, c16_p43_n23, O12A)\r\n                    COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n                    c16_n30_p11, c16_n45_p43, c16_n19_p36, c16_p23_n02, c16_p45_n39, c16_p27_n41, c16_n15_n07, c16_n44_p34, O13A)\r\n                    COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n                    c16_n19_p07, c16_n39_p30, c16_n45_p44, c16_n36_p43, c16_n15_p27, c16_p11_p02, c16_p34_n23, c16_p45_n41, O14A)\r\n                    COMPUTE_ROW(T_00_00A, T_00_01A, T_00_02A, T_00_03A, T_00_04A, T_00_05A, T_00_06A, T_00_07A, \\\r\n                    c16_n07_p02, c16_n15_p11, c16_n23_p19, c16_n30_p27, c16_n36_p34, c16_n41_p39, c16_n44_p43, c16_n45_p45, O15A)\r\n\r\n                    COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n                    c16_p45_p45, c16_p43_p44, c16_p39_p41, c16_p34_p36, c16_p27_p30, c16_p19_p23, c16_p11_p15, c16_p02_p07, O00B)\r\n                    COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n                    c16_p41_p45, c16_p23_p34, c16_n02_p11, c16_n27_n15, c16_n43_n36, c16_n44_n45, c16_n30_n39, c16_n07_n19, O01B)\r\n                    COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n                    c16_p34_p44, c16_n07_p15, c16_n41_n27, c16_n39_n45, c16_n02_n23, c16_p36_p19, c16_p43_p45, c16_p11_p30, O02B)\r\n                    COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n                    c16_p23_p43, c16_n34_n07, c16_n36_n45, c16_p19_n11, c16_p44_p41, c16_n02_p27, c16_n45_n30, c16_n15_n39, O03B)\r\n                    COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n                    c16_p11_p41, c16_n45_n27, c16_p07_n30, c16_p43_p39, c16_n23_p15, c16_n34_n45, c16_p36_p02, c16_p19_p44, O04B)\r\n                    COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n                    c16_n02_p39, c16_n36_n41, c16_p43_p07, c16_n11_p34, c16_n30_n44, c16_p45_p15, c16_n19_p27, c16_n23_n45, O05B)\r\n                    COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n                    c16_n15_p36, c16_n11_n45, c16_p34_p39, c16_n45_n19, c16_p41_n07, c16_n23_p30, c16_n02_n44, c16_p27_p43, O06B)\r\n                    COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n                    c16_n27_p34, c16_p19_n39, c16_n11_p43, c16_p02_n45, c16_p07_p45, c16_n15_n44, c16_p23_p41, c16_n30_n36, O07B)\r\n                    COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n                    c16_n36_p30, c16_p41_n23, c16_n44_p15, c16_p45_n07, c16_n45_n02, c16_p43_p11, c16_n39_n19, c16_p34_p27, O08B)\r\n                    COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n                    c16_n43_p27, c16_p44_n02, c16_n30_n23, c16_p07_p41, c16_p19_n45, c16_n39_p34, c16_p45_n11, c16_n36_n15, O09B)\r\n                    COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n                    c16_n45_p23, c16_p27_p19, c16_p15_n45, c16_n44_p30, c16_p34_p11, c16_p07_n43, c16_n41_p36, c16_p39_p02, O10B)\r\n                    COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n                    c16_n44_p19, c16_n02_p36, c16_p45_n34, c16_n15_n23, c16_n39_p43, c16_p30_p07, c16_p27_n45, c16_n41_p11, O11B)\r\n                    COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n                    c16_n39_p15, c16_n30_p45, c16_p27_p02, c16_p41_n44, c16_n11_n19, c16_n45_p36, c16_n07_p34, c16_p43_n23, O12B)\r\n                    COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n                    c16_n30_p11, c16_n45_p43, c16_n19_p36, c16_p23_n02, c16_p45_n39, c16_p27_n41, c16_n15_n07, c16_n44_p34, O13B)\r\n                    COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n                    c16_n19_p07, c16_n39_p30, c16_n45_p44, c16_n36_p43, c16_n15_p27, c16_p11_p02, c16_p34_n23, c16_p45_n41, O14B)\r\n                    COMPUTE_ROW(T_00_00B, T_00_01B, T_00_02B, T_00_03B, T_00_04B, T_00_05B, T_00_06B, T_00_07B, \\\r\n                    c16_n07_p02, c16_n15_p11, c16_n23_p19, c16_n30_p27, c16_n36_p34, c16_n41_p39, c16_n44_p43, c16_n45_p45, O15B)\r\n\r\n#undef      COMPUTE_ROW\r\n            }\r\n\r\n\r\n            {\r\n                __m256i T00, T01;\r\n#define     COMPUTE_ROW(row0206, row1014, row1822, row2630, c0206, c1014, c1822, c2630, row) \\\r\n            T00 = _mm256_add_epi32(_mm256_madd_epi16(row0206, c0206), _mm256_madd_epi16(row1014, c1014)); \\\r\n            T01 = _mm256_add_epi32(_mm256_madd_epi16(row1822, c1822), _mm256_madd_epi16(row2630, c2630)); \\\r\n            row = _mm256_add_epi32(T00, T01);\r\n\r\n                COMPUTE_ROW(T_00_08A, T_00_09A, T_00_10A, T_00_11A, c16_p43_p45, c16_p35_p40, c16_p21_p29, c16_p04_p13, EO0A)\r\n                    COMPUTE_ROW(T_00_08A, T_00_09A, T_00_10A, T_00_11A, c16_p29_p43, c16_n21_p04, c16_n45_n40, c16_n13_n35, EO1A)\r\n                    COMPUTE_ROW(T_00_08A, T_00_09A, T_00_10A, T_00_11A, c16_p04_p40, c16_n43_n35, c16_p29_n13, c16_p21_p45, EO2A)\r\n                    COMPUTE_ROW(T_00_08A, T_00_09A, T_00_10A, T_00_11A, c16_n21_p35, c16_p04_n43, c16_p13_p45, c16_n29_n40, EO3A)\r\n                    COMPUTE_ROW(T_00_08A, T_00_09A, T_00_10A, T_00_11A, c16_n40_p29, c16_p45_n13, c16_n43_n04, c16_p35_p21, EO4A)\r\n                    COMPUTE_ROW(T_00_08A, T_00_09A, T_00_10A, T_00_11A, c16_n45_p21, c16_p13_p29, c16_p35_n43, c16_n40_p04, EO5A)\r\n                    COMPUTE_ROW(T_00_08A, T_00_09A, T_00_10A, T_00_11A, c16_n35_p13, c16_n40_p45, c16_p04_p21, c16_p43_n29, EO6A)\r\n                    COMPUTE_ROW(T_00_08A, T_00_09A, T_00_10A, T_00_11A, c16_n13_p04, c16_n29_p21, c16_n40_p35, c16_n45_p43, EO7A)\r\n\r\n                    COMPUTE_ROW(T_00_08B, T_00_09B, T_00_10B, T_00_11B, c16_p43_p45, c16_p35_p40, c16_p21_p29, c16_p04_p13, EO0B)\r\n                    COMPUTE_ROW(T_00_08B, T_00_09B, T_00_10B, T_00_11B, c16_p29_p43, c16_n21_p04, c16_n45_n40, c16_n13_n35, EO1B)\r\n                    COMPUTE_ROW(T_00_08B, T_00_09B, T_00_10B, T_00_11B, c16_p04_p40, c16_n43_n35, c16_p29_n13, c16_p21_p45, EO2B)\r\n                    COMPUTE_ROW(T_00_08B, T_00_09B, T_00_10B, T_00_11B, c16_n21_p35, c16_p04_n43, c16_p13_p45, c16_n29_n40, EO3B)\r\n                    COMPUTE_ROW(T_00_08B, T_00_09B, T_00_10B, T_00_11B, c16_n40_p29, c16_p45_n13, c16_n43_n04, c16_p35_p21, EO4B)\r\n                    COMPUTE_ROW(T_00_08B, T_00_09B, T_00_10B, T_00_11B, c16_n45_p21, c16_p13_p29, c16_p35_n43, c16_n40_p04, EO5B)\r\n                    COMPUTE_ROW(T_00_08B, T_00_09B, T_00_10B, T_00_11B, c16_n35_p13, c16_n40_p45, c16_p04_p21, c16_p43_n29, EO6B)\r\n                    COMPUTE_ROW(T_00_08B, T_00_09B, T_00_10B, T_00_11B, c16_n13_p04, c16_n29_p21, c16_n40_p35, c16_n45_p43, EO7B)\r\n#undef      COMPUTE_ROW\r\n            }\r\n\r\n            EEO0A = _mm256_add_epi32(_mm256_madd_epi16(T_00_12A, c16_p38_p44), _mm256_madd_epi16(T_00_13A, c16_p09_p25));\r\n            EEO1A = _mm256_add_epi32(_mm256_madd_epi16(T_00_12A, c16_n09_p38), _mm256_madd_epi16(T_00_13A, c16_n25_n44));\r\n            EEO2A = _mm256_add_epi32(_mm256_madd_epi16(T_00_12A, c16_n44_p25), _mm256_madd_epi16(T_00_13A, c16_p38_p09));\r\n            EEO3A = _mm256_add_epi32(_mm256_madd_epi16(T_00_12A, c16_n25_p09), _mm256_madd_epi16(T_00_13A, c16_n44_p38));\r\n            EEO0B = _mm256_add_epi32(_mm256_madd_epi16(T_00_12B, c16_p38_p44), _mm256_madd_epi16(T_00_13B, c16_p09_p25));\r\n            EEO1B = _mm256_add_epi32(_mm256_madd_epi16(T_00_12B, c16_n09_p38), _mm256_madd_epi16(T_00_13B, c16_n25_n44));\r\n            EEO2B = _mm256_add_epi32(_mm256_madd_epi16(T_00_12B, c16_n44_p25), _mm256_madd_epi16(T_00_13B, c16_p38_p09));\r\n            EEO3B = _mm256_add_epi32(_mm256_madd_epi16(T_00_12B, c16_n25_p09), _mm256_madd_epi16(T_00_13B, c16_n44_p38));\r\n\r\n            EEEO0A = _mm256_madd_epi16(T_00_14A, c16_p17_p42);\r\n            EEEO0B = _mm256_madd_epi16(T_00_14B, c16_p17_p42);\r\n            EEEO1A = _mm256_madd_epi16(T_00_14A, c16_n42_p17);\r\n            EEEO1B = _mm256_madd_epi16(T_00_14B, c16_n42_p17);\r\n\r\n            EEEE0A = _mm256_madd_epi16(T_00_15A, c16_p32_p32);\r\n            EEEE0B = _mm256_madd_epi16(T_00_15B, c16_p32_p32);\r\n            EEEE1A = _mm256_madd_epi16(T_00_15A, c16_n32_p32);\r\n            EEEE1B = _mm256_madd_epi16(T_00_15B, c16_n32_p32);\r\n\r\n            EEE0A = _mm256_add_epi32(EEEE0A, EEEO0A);          // EEE0 = EEEE0 + EEEO0\r\n            EEE0B = _mm256_add_epi32(EEEE0B, EEEO0B);\r\n            EEE1A = _mm256_add_epi32(EEEE1A, EEEO1A);          // EEE1 = EEEE1 + EEEO1\r\n            EEE1B = _mm256_add_epi32(EEEE1B, EEEO1B);\r\n            EEE3A = _mm256_sub_epi32(EEEE0A, EEEO0A);          // EEE2 = EEEE0 - EEEO0\r\n            EEE3B = _mm256_sub_epi32(EEEE0B, EEEO0B);\r\n            EEE2A = _mm256_sub_epi32(EEEE1A, EEEO1A);          // EEE3 = EEEE1 - EEEO1\r\n            EEE2B = _mm256_sub_epi32(EEEE1B, EEEO1B);\r\n\r\n            EE0A = _mm256_add_epi32(EEE0A, EEO0A);          // EE0 = EEE0 + EEO0\r\n            EE0B = _mm256_add_epi32(EEE0B, EEO0B);\r\n            EE1A = _mm256_add_epi32(EEE1A, EEO1A);          // EE1 = EEE1 + EEO1\r\n            EE1B = _mm256_add_epi32(EEE1B, EEO1B);\r\n            EE2A = _mm256_add_epi32(EEE2A, EEO2A);          // EE2 = EEE0 + EEO0\r\n            EE2B = _mm256_add_epi32(EEE2B, EEO2B);\r\n            EE3A = _mm256_add_epi32(EEE3A, EEO3A);          // EE3 = EEE1 + EEO1\r\n            EE3B = _mm256_add_epi32(EEE3B, EEO3B);\r\n            EE7A = _mm256_sub_epi32(EEE0A, EEO0A);          // EE7 = EEE0 - EEO0\r\n            EE7B = _mm256_sub_epi32(EEE0B, EEO0B);\r\n            EE6A = _mm256_sub_epi32(EEE1A, EEO1A);          // EE6 = EEE1 - EEO1\r\n            EE6B = _mm256_sub_epi32(EEE1B, EEO1B);\r\n            EE5A = _mm256_sub_epi32(EEE2A, EEO2A);          // EE5 = EEE0 - EEO0\r\n            EE5B = _mm256_sub_epi32(EEE2B, EEO2B);\r\n            EE4A = _mm256_sub_epi32(EEE3A, EEO3A);          // EE4 = EEE1 - EEO1\r\n            EE4B = _mm256_sub_epi32(EEE3B, EEO3B);\r\n\r\n            E0A = _mm256_add_epi32(EE0A, EO0A);          // E0 = EE0 + EO0\r\n            E0B = _mm256_add_epi32(EE0B, EO0B);\r\n            E1A = _mm256_add_epi32(EE1A, EO1A);          // E1 = EE1 + EO1\r\n            E1B = _mm256_add_epi32(EE1B, EO1B);\r\n            E2A = _mm256_add_epi32(EE2A, EO2A);          // E2 = EE2 + EO2\r\n            E2B = _mm256_add_epi32(EE2B, EO2B);\r\n            E3A = _mm256_add_epi32(EE3A, EO3A);          // E3 = EE3 + EO3\r\n            E3B = _mm256_add_epi32(EE3B, EO3B);\r\n            E4A = _mm256_add_epi32(EE4A, EO4A);          // E4 =\r\n            E4B = _mm256_add_epi32(EE4B, EO4B);\r\n            E5A = _mm256_add_epi32(EE5A, EO5A);          // E5 =\r\n            E5B = _mm256_add_epi32(EE5B, EO5B);\r\n            E6A = _mm256_add_epi32(EE6A, EO6A);          // E6 =\r\n            E6B = _mm256_add_epi32(EE6B, EO6B);\r\n            E7A = _mm256_add_epi32(EE7A, EO7A);          // E7 =\r\n            E7B = _mm256_add_epi32(EE7B, EO7B);\r\n            EFA = _mm256_sub_epi32(EE0A, EO0A);          // EF = EE0 - EO0\r\n            EFB = _mm256_sub_epi32(EE0B, EO0B);\r\n            EEA = _mm256_sub_epi32(EE1A, EO1A);          // EE = EE1 - EO1\r\n            EEB = _mm256_sub_epi32(EE1B, EO1B);\r\n            EDA = _mm256_sub_epi32(EE2A, EO2A);          // ED = EE2 - EO2\r\n            EDB = _mm256_sub_epi32(EE2B, EO2B);\r\n            ECA = _mm256_sub_epi32(EE3A, EO3A);          // EC = EE3 - EO3\r\n            ECB = _mm256_sub_epi32(EE3B, EO3B);\r\n            EBA = _mm256_sub_epi32(EE4A, EO4A);          // EB =\r\n            EBB = _mm256_sub_epi32(EE4B, EO4B);\r\n            EAA = _mm256_sub_epi32(EE5A, EO5A);          // EA =\r\n            EAB = _mm256_sub_epi32(EE5B, EO5B);\r\n            E9A = _mm256_sub_epi32(EE6A, EO6A);          // E9 =\r\n            E9B = _mm256_sub_epi32(EE6B, EO6B);\r\n            E8A = _mm256_sub_epi32(EE7A, EO7A);          // E8 =\r\n            E8B = _mm256_sub_epi32(EE7B, EO7B);\r\n\r\n            T10A = _mm256_add_epi32(E0A, c32_rnd);         // E0 + rnd\r\n            T10B = _mm256_add_epi32(E0B, c32_rnd);\r\n            T11A = _mm256_add_epi32(E1A, c32_rnd);         // E1 + rnd\r\n            T11B = _mm256_add_epi32(E1B, c32_rnd);\r\n            T12A = _mm256_add_epi32(E2A, c32_rnd);         // E2 + rnd\r\n            T12B = _mm256_add_epi32(E2B, c32_rnd);\r\n            T13A = _mm256_add_epi32(E3A, c32_rnd);         // E3 + rnd\r\n            T13B = _mm256_add_epi32(E3B, c32_rnd);\r\n            T14A = _mm256_add_epi32(E4A, c32_rnd);         // E4 + rnd\r\n            T14B = _mm256_add_epi32(E4B, c32_rnd);\r\n            T15A = _mm256_add_epi32(E5A, c32_rnd);         // E5 + rnd\r\n            T15B = _mm256_add_epi32(E5B, c32_rnd);\r\n            T16A = _mm256_add_epi32(E6A, c32_rnd);         // E6 + rnd\r\n            T16B = _mm256_add_epi32(E6B, c32_rnd);\r\n            T17A = _mm256_add_epi32(E7A, c32_rnd);         // E7 + rnd\r\n            T17B = _mm256_add_epi32(E7B, c32_rnd);\r\n            T18A = _mm256_add_epi32(E8A, c32_rnd);         // E8 + rnd\r\n            T18B = _mm256_add_epi32(E8B, c32_rnd);\r\n            T19A = _mm256_add_epi32(E9A, c32_rnd);         // E9 + rnd\r\n            T19B = _mm256_add_epi32(E9B, c32_rnd);\r\n            T1AA = _mm256_add_epi32(EAA, c32_rnd);         // E10 + rnd\r\n            T1AB = _mm256_add_epi32(EAB, c32_rnd);\r\n            T1BA = _mm256_add_epi32(EBA, c32_rnd);         // E11 + rnd\r\n            T1BB = _mm256_add_epi32(EBB, c32_rnd);\r\n            T1CA = _mm256_add_epi32(ECA, c32_rnd);         // E12 + rnd\r\n            T1CB = _mm256_add_epi32(ECB, c32_rnd);\r\n            T1DA = _mm256_add_epi32(EDA, c32_rnd);         // E13 + rnd\r\n            T1DB = _mm256_add_epi32(EDB, c32_rnd);\r\n            T1EA = _mm256_add_epi32(EEA, c32_rnd);         // E14 + rnd\r\n            T1EB = _mm256_add_epi32(EEB, c32_rnd);\r\n            T1FA = _mm256_add_epi32(EFA, c32_rnd);         // E15 + rnd\r\n            T1FB = _mm256_add_epi32(EFB, c32_rnd);\r\n\r\n            T2_00A = _mm256_add_epi32(T10A, O00A);          // E0 + O0 + rnd\r\n            T2_00B = _mm256_add_epi32(T10B, O00B);\r\n            T2_01A = _mm256_add_epi32(T11A, O01A);          // E1 + O1 + rnd\r\n            T2_01B = _mm256_add_epi32(T11B, O01B);\r\n            T2_02A = _mm256_add_epi32(T12A, O02A);          // E2 + O2 + rnd\r\n            T2_02B = _mm256_add_epi32(T12B, O02B);\r\n            T2_03A = _mm256_add_epi32(T13A, O03A);          // E3 + O3 + rnd\r\n            T2_03B = _mm256_add_epi32(T13B, O03B);\r\n            T2_04A = _mm256_add_epi32(T14A, O04A);          // E4\r\n            T2_04B = _mm256_add_epi32(T14B, O04B);\r\n            T2_05A = _mm256_add_epi32(T15A, O05A);          // E5\r\n            T2_05B = _mm256_add_epi32(T15B, O05B);\r\n            T2_06A = _mm256_add_epi32(T16A, O06A);          // E6\r\n            T2_06B = _mm256_add_epi32(T16B, O06B);\r\n            T2_07A = _mm256_add_epi32(T17A, O07A);          // E7\r\n            T2_07B = _mm256_add_epi32(T17B, O07B);\r\n            T2_08A = _mm256_add_epi32(T18A, O08A);          // E8\r\n            T2_08B = _mm256_add_epi32(T18B, O08B);\r\n            T2_09A = _mm256_add_epi32(T19A, O09A);          // E9\r\n            T2_09B = _mm256_add_epi32(T19B, O09B);\r\n            T2_10A = _mm256_add_epi32(T1AA, O10A);          // E10\r\n            T2_10B = _mm256_add_epi32(T1AB, O10B);\r\n            T2_11A = _mm256_add_epi32(T1BA, O11A);          // E11\r\n            T2_11B = _mm256_add_epi32(T1BB, O11B);\r\n            T2_12A = _mm256_add_epi32(T1CA, O12A);          // E12\r\n            T2_12B = _mm256_add_epi32(T1CB, O12B);\r\n            T2_13A = _mm256_add_epi32(T1DA, O13A);          // E13\r\n            T2_13B = _mm256_add_epi32(T1DB, O13B);\r\n            T2_14A = _mm256_add_epi32(T1EA, O14A);          // E14\r\n            T2_14B = _mm256_add_epi32(T1EB, O14B);\r\n            T2_15A = _mm256_add_epi32(T1FA, O15A);          // E15\r\n            T2_15B = _mm256_add_epi32(T1FB, O15B);\r\n            T2_31A = _mm256_sub_epi32(T10A, O00A);          // E0 - O0 + rnd\r\n            T2_31B = _mm256_sub_epi32(T10B, O00B);\r\n            T2_30A = _mm256_sub_epi32(T11A, O01A);          // E1 - O1 + rnd\r\n            T2_30B = _mm256_sub_epi32(T11B, O01B);\r\n            T2_29A = _mm256_sub_epi32(T12A, O02A);          // E2 - O2 + rnd\r\n            T2_29B = _mm256_sub_epi32(T12B, O02B);\r\n            T2_28A = _mm256_sub_epi32(T13A, O03A);          // E3 - O3 + rnd\r\n            T2_28B = _mm256_sub_epi32(T13B, O03B);\r\n            T2_27A = _mm256_sub_epi32(T14A, O04A);          // E4\r\n            T2_27B = _mm256_sub_epi32(T14B, O04B);\r\n            T2_26A = _mm256_sub_epi32(T15A, O05A);          // E5\r\n            T2_26B = _mm256_sub_epi32(T15B, O05B);\r\n            T2_25A = _mm256_sub_epi32(T16A, O06A);          // E6\r\n            T2_25B = _mm256_sub_epi32(T16B, O06B);\r\n            T2_24A = _mm256_sub_epi32(T17A, O07A);          // E7\r\n            T2_24B = _mm256_sub_epi32(T17B, O07B);\r\n            T2_23A = _mm256_sub_epi32(T18A, O08A);          //\r\n            T2_23B = _mm256_sub_epi32(T18B, O08B);\r\n            T2_22A = _mm256_sub_epi32(T19A, O09A);          //\r\n            T2_22B = _mm256_sub_epi32(T19B, O09B);\r\n            T2_21A = _mm256_sub_epi32(T1AA, O10A);          //\r\n            T2_21B = _mm256_sub_epi32(T1AB, O10B);\r\n            T2_20A = _mm256_sub_epi32(T1BA, O11A);          //\r\n            T2_20B = _mm256_sub_epi32(T1BB, O11B);\r\n            T2_19A = _mm256_sub_epi32(T1CA, O12A);          //\r\n            T2_19B = _mm256_sub_epi32(T1CB, O12B);\r\n            T2_18A = _mm256_sub_epi32(T1DA, O13A);          //\r\n            T2_18B = _mm256_sub_epi32(T1DB, O13B);\r\n            T2_17A = _mm256_sub_epi32(T1EA, O14A);          //\r\n            T2_17B = _mm256_sub_epi32(T1EB, O14B);\r\n            T2_16A = _mm256_sub_epi32(T1FA, O15A);          //\r\n            T2_16B = _mm256_sub_epi32(T1FB, O15B);\r\n\r\n            T3_00A = _mm256_srai_epi32(T2_00A, nShift);             // [30 20 10 00] // This operation make it much slower than 128\r\n            T3_00B = _mm256_srai_epi32(T2_00B, nShift);             // [70 60 50 40] // This operation make it much slower than 128\r\n            T3_01A = _mm256_srai_epi32(T2_01A, nShift);             // [31 21 11 01] // This operation make it much slower than 128\r\n            T3_01B = _mm256_srai_epi32(T2_01B, nShift);             // [71 61 51 41] // This operation make it much slower than 128\r\n            T3_02A = _mm256_srai_epi32(T2_02A, nShift);             // [32 22 12 02] // This operation make it much slower than 128\r\n            T3_02B = _mm256_srai_epi32(T2_02B, nShift);             // [72 62 52 42]\r\n            T3_03A = _mm256_srai_epi32(T2_03A, nShift);             // [33 23 13 03]\r\n            T3_03B = _mm256_srai_epi32(T2_03B, nShift);             // [73 63 53 43]\r\n            T3_04A = _mm256_srai_epi32(T2_04A, nShift);             // [33 24 14 04]\r\n            T3_04B = _mm256_srai_epi32(T2_04B, nShift);             // [74 64 54 44]\r\n            T3_05A = _mm256_srai_epi32(T2_05A, nShift);             // [35 25 15 05]\r\n            T3_05B = _mm256_srai_epi32(T2_05B, nShift);             // [75 65 55 45]\r\n            T3_06A = _mm256_srai_epi32(T2_06A, nShift);             // [36 26 16 06]\r\n            T3_06B = _mm256_srai_epi32(T2_06B, nShift);             // [76 66 56 46]\r\n            T3_07A = _mm256_srai_epi32(T2_07A, nShift);             // [37 27 17 07]\r\n            T3_07B = _mm256_srai_epi32(T2_07B, nShift);             // [77 67 57 47]\r\n            T3_08A = _mm256_srai_epi32(T2_08A, nShift);             // [30 20 10 00] x8\r\n            T3_08B = _mm256_srai_epi32(T2_08B, nShift);             // [70 60 50 40]\r\n            T3_09A = _mm256_srai_epi32(T2_09A, nShift);             // [31 21 11 01] x9\r\n            T3_09B = _mm256_srai_epi32(T2_09B, nShift);             // [71 61 51 41]\r\n            T3_10A = _mm256_srai_epi32(T2_10A, nShift);             // [32 22 12 02] xA\r\n            T3_10B = _mm256_srai_epi32(T2_10B, nShift);             // [72 62 52 42]\r\n            T3_11A = _mm256_srai_epi32(T2_11A, nShift);             // [33 23 13 03] xB\r\n            T3_11B = _mm256_srai_epi32(T2_11B, nShift);             // [73 63 53 43]\r\n            T3_12A = _mm256_srai_epi32(T2_12A, nShift);             // [33 24 14 04] xC\r\n            T3_12B = _mm256_srai_epi32(T2_12B, nShift);             // [74 64 54 44]\r\n            T3_13A = _mm256_srai_epi32(T2_13A, nShift);             // [35 25 15 05] xD\r\n            T3_13B = _mm256_srai_epi32(T2_13B, nShift);             // [75 65 55 45]\r\n            T3_14A = _mm256_srai_epi32(T2_14A, nShift);             // [36 26 16 06] xE\r\n            T3_14B = _mm256_srai_epi32(T2_14B, nShift);             // [76 66 56 46]\r\n            T3_15A = _mm256_srai_epi32(T2_15A, nShift);             // [37 27 17 07] xF\r\n            T3_15B = _mm256_srai_epi32(T2_15B, nShift);             // [77 67 57 47]\r\n\r\n            T3_16A = _mm256_srai_epi32(T2_16A, nShift);             // [30 20 10 00] // This operation make it much slower than 128\r\n            T3_16B = _mm256_srai_epi32(T2_16B, nShift);             // [70 60 50 40] // This operation make it much slower than 128\r\n            T3_17A = _mm256_srai_epi32(T2_17A, nShift);             // [31 21 11 01] // This operation make it much slower than 128\r\n            T3_17B = _mm256_srai_epi32(T2_17B, nShift);             // [71 61 51 41]\r\n            T3_18A = _mm256_srai_epi32(T2_18A, nShift);             // [32 22 12 02]\r\n            T3_18B = _mm256_srai_epi32(T2_18B, nShift);             // [72 62 52 42]\r\n            T3_19A = _mm256_srai_epi32(T2_19A, nShift);             // [33 23 13 03]\r\n            T3_19B = _mm256_srai_epi32(T2_19B, nShift);             // [73 63 53 43]\r\n            T3_20A = _mm256_srai_epi32(T2_20A, nShift);             // [33 24 14 04]\r\n            T3_20B = _mm256_srai_epi32(T2_20B, nShift);             // [74 64 54 44]\r\n            T3_21A = _mm256_srai_epi32(T2_21A, nShift);             // [35 25 15 05]\r\n            T3_21B = _mm256_srai_epi32(T2_21B, nShift);             // [75 65 55 45]\r\n            T3_22A = _mm256_srai_epi32(T2_22A, nShift);             // [36 26 16 06]\r\n            T3_22B = _mm256_srai_epi32(T2_22B, nShift);             // [76 66 56 46]\r\n            T3_23A = _mm256_srai_epi32(T2_23A, nShift);             // [37 27 17 07]\r\n            T3_23B = _mm256_srai_epi32(T2_23B, nShift);             // [77 67 57 47]\r\n            T3_24A = _mm256_srai_epi32(T2_24A, nShift);             // [30 20 10 00] x8\r\n            T3_24B = _mm256_srai_epi32(T2_24B, nShift);             // [70 60 50 40]\r\n            T3_25A = _mm256_srai_epi32(T2_25A, nShift);             // [31 21 11 01] x9\r\n            T3_25B = _mm256_srai_epi32(T2_25B, nShift);             // [71 61 51 41]\r\n            T3_26A = _mm256_srai_epi32(T2_26A, nShift);             // [32 22 12 02] xA\r\n            T3_26B = _mm256_srai_epi32(T2_26B, nShift);             // [72 62 52 42]\r\n            T3_27A = _mm256_srai_epi32(T2_27A, nShift);             // [33 23 13 03] xB\r\n            T3_27B = _mm256_srai_epi32(T2_27B, nShift);             // [73 63 53 43]\r\n            T3_28A = _mm256_srai_epi32(T2_28A, nShift);             // [33 24 14 04] xC\r\n            T3_28B = _mm256_srai_epi32(T2_28B, nShift);             // [74 64 54 44]\r\n            T3_29A = _mm256_srai_epi32(T2_29A, nShift);             // [35 25 15 05] xD\r\n            T3_29B = _mm256_srai_epi32(T2_29B, nShift);             // [75 65 55 45]\r\n            T3_30A = _mm256_srai_epi32(T2_30A, nShift);             // [36 26 16 06] xE\r\n            T3_30B = _mm256_srai_epi32(T2_30B, nShift);             // [76 66 56 46]\r\n            T3_31A = _mm256_srai_epi32(T2_31A, nShift);             // [37 27 17 07] xF\r\n            T3_31B = _mm256_srai_epi32(T2_31B, nShift);             // [77 67 57 47]\r\n\r\n            res00[part] = _mm256_packs_epi32(T3_00A, T3_00B);        // [70 60 50 40 30 20 10 00]\r\n            res01[part] = _mm256_packs_epi32(T3_01A, T3_01B);        // [71 61 51 41 31 21 11 01]\r\n            res02[part] = _mm256_packs_epi32(T3_02A, T3_02B);        // [72 62 52 42 32 22 12 02]\r\n            res03[part] = _mm256_packs_epi32(T3_03A, T3_03B);        // [73 63 53 43 33 23 13 03]\r\n            res04[part] = _mm256_packs_epi32(T3_04A, T3_04B);        // [74 64 54 44 34 24 14 04]\r\n            res05[part] = _mm256_packs_epi32(T3_05A, T3_05B);        // [75 65 55 45 35 25 15 05]\r\n            res06[part] = _mm256_packs_epi32(T3_06A, T3_06B);        // [76 66 56 46 36 26 16 06]\r\n            res07[part] = _mm256_packs_epi32(T3_07A, T3_07B);        // [77 67 57 47 37 27 17 07]\r\n            res08[part] = _mm256_packs_epi32(T3_08A, T3_08B);        // [A0 ... 80]\r\n            res09[part] = _mm256_packs_epi32(T3_09A, T3_09B);        // [A1 ... 81]\r\n            res10[part] = _mm256_packs_epi32(T3_10A, T3_10B);        // [A2 ... 82]\r\n            res11[part] = _mm256_packs_epi32(T3_11A, T3_11B);        // [A3 ... 83]\r\n            res12[part] = _mm256_packs_epi32(T3_12A, T3_12B);        // [A4 ... 84]\r\n            res13[part] = _mm256_packs_epi32(T3_13A, T3_13B);        // [A5 ... 85]\r\n            res14[part] = _mm256_packs_epi32(T3_14A, T3_14B);        // [A6 ... 86]\r\n            res15[part] = _mm256_packs_epi32(T3_15A, T3_15B);        // [A7 ... 87]\r\n            res16[part] = _mm256_packs_epi32(T3_16A, T3_16B);\r\n            res17[part] = _mm256_packs_epi32(T3_17A, T3_17B);\r\n            res18[part] = _mm256_packs_epi32(T3_18A, T3_18B);\r\n            res19[part] = _mm256_packs_epi32(T3_19A, T3_19B);\r\n            res20[part] = _mm256_packs_epi32(T3_20A, T3_20B);\r\n            res21[part] = _mm256_packs_epi32(T3_21A, T3_21B);\r\n            res22[part] = _mm256_packs_epi32(T3_22A, T3_22B);\r\n            res23[part] = _mm256_packs_epi32(T3_23A, T3_23B);\r\n            res24[part] = _mm256_packs_epi32(T3_24A, T3_24B);\r\n            res25[part] = _mm256_packs_epi32(T3_25A, T3_25B);\r\n            res26[part] = _mm256_packs_epi32(T3_26A, T3_26B);\r\n            res27[part] = _mm256_packs_epi32(T3_27A, T3_27B);\r\n            res28[part] = _mm256_packs_epi32(T3_28A, T3_28B);\r\n            res29[part] = _mm256_packs_epi32(T3_29A, T3_29B);\r\n            res30[part] = _mm256_packs_epi32(T3_30A, T3_30B);\r\n            res31[part] = _mm256_packs_epi32(T3_31A, T3_31B);\r\n\r\n        }\r\n\r\n        //transpose 32x32 matrix\r\n        {\r\n            __m256i tr0_0, tr0_1, tr0_2, tr0_3, tr0_4, tr0_5, tr0_6, tr0_7, tr0_8, tr0_9, tr0_10, tr0_11, tr0_12, tr0_13, tr0_14, tr0_15;\r\n#define TRANSPOSE_16x16_16BIT(I0, I1, I2, I3, I4, I5, I6, I7, I8, I9, I10, I11, I12, I13, I14, I15, O0, O1, O2, O3, O4, O5, O6, O7, O8, O9, O10, O11, O12, O13, O14, O15) \\\r\n        tr0_0 = _mm256_unpacklo_epi16(I0, I1); \\\r\n        tr0_1 = _mm256_unpacklo_epi16(I2, I3); \\\r\n        tr0_2 = _mm256_unpacklo_epi16(I4, I5); \\\r\n        tr0_3 = _mm256_unpacklo_epi16(I6, I7); \\\r\n        tr0_4 = _mm256_unpacklo_epi16(I8, I9); \\\r\n        tr0_5 = _mm256_unpacklo_epi16(I10, I11); \\\r\n        tr0_6 = _mm256_unpacklo_epi16(I12, I13); \\\r\n        tr0_7 = _mm256_unpacklo_epi16(I14, I15); \\\r\n        tr0_8 = _mm256_unpackhi_epi16(I0, I1); \\\r\n        tr0_9 = _mm256_unpackhi_epi16(I2, I3); \\\r\n        tr0_10 = _mm256_unpackhi_epi16(I4, I5); \\\r\n        tr0_11 = _mm256_unpackhi_epi16(I6, I7); \\\r\n        tr0_12 = _mm256_unpackhi_epi16(I8, I9); \\\r\n        tr0_13 = _mm256_unpackhi_epi16(I10, I11); \\\r\n        tr0_14 = _mm256_unpackhi_epi16(I12, I13); \\\r\n        tr0_15 = _mm256_unpackhi_epi16(I14, I15); \\\r\n        O0 = _mm256_unpacklo_epi32(tr0_0, tr0_1); \\\r\n        O1 = _mm256_unpacklo_epi32(tr0_2, tr0_3); \\\r\n        O2 = _mm256_unpacklo_epi32(tr0_4, tr0_5); \\\r\n        O3 = _mm256_unpacklo_epi32(tr0_6, tr0_7); \\\r\n        O4 = _mm256_unpackhi_epi32(tr0_0, tr0_1); \\\r\n        O5 = _mm256_unpackhi_epi32(tr0_2, tr0_3); \\\r\n        O6 = _mm256_unpackhi_epi32(tr0_4, tr0_5); \\\r\n        O7 = _mm256_unpackhi_epi32(tr0_6, tr0_7); \\\r\n        O8 = _mm256_unpacklo_epi32(tr0_8, tr0_9); \\\r\n        O9 = _mm256_unpacklo_epi32(tr0_10, tr0_11); \\\r\n        O10 = _mm256_unpacklo_epi32(tr0_12, tr0_13); \\\r\n        O11 = _mm256_unpacklo_epi32(tr0_14, tr0_15); \\\r\n        O12 = _mm256_unpackhi_epi32(tr0_8, tr0_9); \\\r\n        O13 = _mm256_unpackhi_epi32(tr0_10, tr0_11); \\\r\n        O14 = _mm256_unpackhi_epi32(tr0_12, tr0_13); \\\r\n        O15 = _mm256_unpackhi_epi32(tr0_14, tr0_15); \\\r\n        tr0_0 = _mm256_unpacklo_epi64(O0, O1); \\\r\n        tr0_1 = _mm256_unpacklo_epi64(O2, O3); \\\r\n        tr0_2 = _mm256_unpackhi_epi64(O0, O1); \\\r\n        tr0_3 = _mm256_unpackhi_epi64(O2, O3); \\\r\n        tr0_4 = _mm256_unpacklo_epi64(O4, O5); \\\r\n        tr0_5 = _mm256_unpacklo_epi64(O6, O7); \\\r\n        tr0_6 = _mm256_unpackhi_epi64(O4, O5); \\\r\n        tr0_7 = _mm256_unpackhi_epi64(O6, O7); \\\r\n        tr0_8 = _mm256_unpacklo_epi64(O8, O9); \\\r\n        tr0_9 = _mm256_unpacklo_epi64(O10, O11); \\\r\n        tr0_10 = _mm256_unpackhi_epi64(O8, O9); \\\r\n        tr0_11 = _mm256_unpackhi_epi64(O10, O11); \\\r\n        tr0_12 = _mm256_unpacklo_epi64(O12, O13); \\\r\n        tr0_13 = _mm256_unpacklo_epi64(O14, O15); \\\r\n        tr0_14 = _mm256_unpackhi_epi64(O12, O13); \\\r\n        tr0_15 = _mm256_unpackhi_epi64(O14, O15); \\\r\n        O0 = _mm256_permute2x128_si256(tr0_0, tr0_1, 0x20); \\\r\n        O1 = _mm256_permute2x128_si256(tr0_2, tr0_3, 0x20); \\\r\n        O2 = _mm256_permute2x128_si256(tr0_4, tr0_5, 0x20); \\\r\n        O3 = _mm256_permute2x128_si256(tr0_6, tr0_7, 0x20); \\\r\n        O4 = _mm256_permute2x128_si256(tr0_8, tr0_9, 0x20); \\\r\n        O5 = _mm256_permute2x128_si256(tr0_10, tr0_11, 0x20); \\\r\n        O6 = _mm256_permute2x128_si256(tr0_12, tr0_13, 0x20); \\\r\n        O7 = _mm256_permute2x128_si256(tr0_14, tr0_15, 0x20); \\\r\n        O8 = _mm256_permute2x128_si256(tr0_0, tr0_1, 0x31); \\\r\n        O9 = _mm256_permute2x128_si256(tr0_2, tr0_3, 0x31); \\\r\n        O10 = _mm256_permute2x128_si256(tr0_4, tr0_5, 0x31); \\\r\n        O11 = _mm256_permute2x128_si256(tr0_6, tr0_7, 0x31); \\\r\n        O12 = _mm256_permute2x128_si256(tr0_8, tr0_9, 0x31); \\\r\n        O13 = _mm256_permute2x128_si256(tr0_10, tr0_11, 0x31); \\\r\n        O14 = _mm256_permute2x128_si256(tr0_12, tr0_13, 0x31); \\\r\n        O15 = _mm256_permute2x128_si256(tr0_14, tr0_15, 0x31); \\\r\n\r\n            TRANSPOSE_16x16_16BIT(res00[0], res01[0], res02[0], res03[0], res04[0], res05[0], res06[0], res07[0], res08[0], res09[0], res10[0], res11[0], res12[0], res13[0], res14[0], res15[0], in00[0], in01[0], in02[0], in03[0], in04[0], in05[0], in06[0], in07[0], in08[0], in09[0], in10[0], in11[0], in12[0], in13[0], in14[0], in15[0])\r\n                TRANSPOSE_16x16_16BIT(res16[0], res17[0], res18[0], res19[0], res20[0], res21[0], res22[0], res23[0], res24[0], res25[0], res26[0], res27[0], res28[0], res29[0], res30[0], res31[0], in00[1], in01[1], in02[1], in03[1], in04[1], in05[1], in06[1], in07[1], in08[1], in09[1], in10[1], in11[1], in12[1], in13[1], in14[1], in15[1]);\r\n            TRANSPOSE_16x16_16BIT(res00[1], res01[1], res02[1], res03[1], res04[1], res05[1], res06[1], res07[1], res08[1], res09[1], res10[1], res11[1], res12[1], res13[1], res14[1], res15[1], in16[0], in17[0], in18[0], in19[0], in20[0], in21[0], in22[0], in23[0], in24[0], in25[0], in26[0], in27[0], in28[0], in29[0], in30[0], in31[0]);\r\n            TRANSPOSE_16x16_16BIT(res16[1], res17[1], res18[1], res19[1], res20[1], res21[1], res22[1], res23[1], res24[1], res25[1], res26[1], res27[1], res28[1], res29[1], res30[1], res31[1], in16[1], in17[1], in18[1], in19[1], in20[1], in21[1], in22[1], in23[1], in24[1], in25[1], in26[1], in27[1], in28[1], in29[1], in30[1], in31[1]);\r\n\r\n#undef  TRANSPOSE_16x16_16BIT\r\n\r\n        }\r\n\r\n        c32_rnd = _mm256_set1_epi32(shift ? (1 << (shift - 1)) : 0);                    // pass == 1 ڶ\r\n        nShift = shift;\r\n    }\r\n\r\n    // clip\r\n    max_val = _mm256_set1_epi16((1 << (clip - 1)) - 1);\r\n    min_val = _mm256_set1_epi16(-(1 << (clip - 1)));\r\n\r\n    for (k = 0; k < 2; k++) {\r\n        in00[k] = _mm256_max_epi16(_mm256_min_epi16(in00[k], max_val), min_val);\r\n        in01[k] = _mm256_max_epi16(_mm256_min_epi16(in01[k], max_val), min_val);\r\n        in02[k] = _mm256_max_epi16(_mm256_min_epi16(in02[k], max_val), min_val);\r\n        in03[k] = _mm256_max_epi16(_mm256_min_epi16(in03[k], max_val), min_val);\r\n        in04[k] = _mm256_max_epi16(_mm256_min_epi16(in04[k], max_val), min_val);\r\n        in05[k] = _mm256_max_epi16(_mm256_min_epi16(in05[k], max_val), min_val);\r\n        in06[k] = _mm256_max_epi16(_mm256_min_epi16(in06[k], max_val), min_val);\r\n        in07[k] = _mm256_max_epi16(_mm256_min_epi16(in07[k], max_val), min_val);\r\n        in08[k] = _mm256_max_epi16(_mm256_min_epi16(in08[k], max_val), min_val);\r\n        in09[k] = _mm256_max_epi16(_mm256_min_epi16(in09[k], max_val), min_val);\r\n        in10[k] = _mm256_max_epi16(_mm256_min_epi16(in10[k], max_val), min_val);\r\n        in11[k] = _mm256_max_epi16(_mm256_min_epi16(in11[k], max_val), min_val);\r\n        in12[k] = _mm256_max_epi16(_mm256_min_epi16(in12[k], max_val), min_val);\r\n        in13[k] = _mm256_max_epi16(_mm256_min_epi16(in13[k], max_val), min_val);\r\n        in14[k] = _mm256_max_epi16(_mm256_min_epi16(in14[k], max_val), min_val);\r\n        in15[k] = _mm256_max_epi16(_mm256_min_epi16(in15[k], max_val), min_val);\r\n        in16[k] = _mm256_max_epi16(_mm256_min_epi16(in16[k], max_val), min_val);\r\n        in17[k] = _mm256_max_epi16(_mm256_min_epi16(in17[k], max_val), min_val);\r\n        in18[k] = _mm256_max_epi16(_mm256_min_epi16(in18[k], max_val), min_val);\r\n        in19[k] = _mm256_max_epi16(_mm256_min_epi16(in19[k], max_val), min_val);\r\n        in20[k] = _mm256_max_epi16(_mm256_min_epi16(in20[k], max_val), min_val);\r\n        in21[k] = _mm256_max_epi16(_mm256_min_epi16(in21[k], max_val), min_val);\r\n        in22[k] = _mm256_max_epi16(_mm256_min_epi16(in22[k], max_val), min_val);\r\n        in23[k] = _mm256_max_epi16(_mm256_min_epi16(in23[k], max_val), min_val);\r\n        in24[k] = _mm256_max_epi16(_mm256_min_epi16(in24[k], max_val), min_val);\r\n        in25[k] = _mm256_max_epi16(_mm256_min_epi16(in25[k], max_val), min_val);\r\n        in26[k] = _mm256_max_epi16(_mm256_min_epi16(in26[k], max_val), min_val);\r\n        in27[k] = _mm256_max_epi16(_mm256_min_epi16(in27[k], max_val), min_val);\r\n        in28[k] = _mm256_max_epi16(_mm256_min_epi16(in28[k], max_val), min_val);\r\n        in29[k] = _mm256_max_epi16(_mm256_min_epi16(in29[k], max_val), min_val);\r\n        in30[k] = _mm256_max_epi16(_mm256_min_epi16(in30[k], max_val), min_val);\r\n        in31[k] = _mm256_max_epi16(_mm256_min_epi16(in31[k], max_val), min_val);\r\n    }\r\n\r\n\r\n    // Store\r\n    for (i = 0; i < 2; i++) {\r\n        const int offset = (i << 4);\r\n        _mm256_storeu_si256((__m256i*)&dst[0 * 32 + offset], in00[i]);\r\n        _mm256_storeu_si256((__m256i*)&dst[1 * 32 + offset], in01[i]);\r\n        _mm256_storeu_si256((__m256i*)&dst[2 * 32 + offset], in02[i]);\r\n        _mm256_storeu_si256((__m256i*)&dst[3 * 32 + offset], in03[i]);\r\n        _mm256_storeu_si256((__m256i*)&dst[4 * 32 + offset], in04[i]);\r\n        _mm256_storeu_si256((__m256i*)&dst[5 * 32 + offset], in05[i]);\r\n        _mm256_storeu_si256((__m256i*)&dst[6 * 32 + offset], in06[i]);\r\n        _mm256_storeu_si256((__m256i*)&dst[7 * 32 + offset], in07[i]);\r\n        _mm256_storeu_si256((__m256i*)&dst[8 * 32 + offset], in08[i]);\r\n        _mm256_storeu_si256((__m256i*)&dst[9 * 32 + offset], in09[i]);\r\n        _mm256_storeu_si256((__m256i*)&dst[10 * 32 + offset], in10[i]);\r\n        _mm256_storeu_si256((__m256i*)&dst[11 * 32 + offset], in11[i]);\r\n        _mm256_storeu_si256((__m256i*)&dst[12 * 32 + offset], in12[i]);\r\n        _mm256_storeu_si256((__m256i*)&dst[13 * 32 + offset], in13[i]);\r\n        _mm256_storeu_si256((__m256i*)&dst[14 * 32 + offset], in14[i]);\r\n        _mm256_storeu_si256((__m256i*)&dst[15 * 32 + offset], in15[i]);\r\n        _mm256_storeu_si256((__m256i*)&dst[16 * 32 + offset], in16[i]);\r\n        _mm256_storeu_si256((__m256i*)&dst[17 * 32 + offset], in17[i]);\r\n        _mm256_storeu_si256((__m256i*)&dst[18 * 32 + offset], in18[i]);\r\n        _mm256_storeu_si256((__m256i*)&dst[19 * 32 + offset], in19[i]);\r\n        _mm256_storeu_si256((__m256i*)&dst[20 * 32 + offset], in20[i]);\r\n        _mm256_storeu_si256((__m256i*)&dst[21 * 32 + offset], in21[i]);\r\n        _mm256_storeu_si256((__m256i*)&dst[22 * 32 + offset], in22[i]);\r\n        _mm256_storeu_si256((__m256i*)&dst[23 * 32 + offset], in23[i]);\r\n        _mm256_storeu_si256((__m256i*)&dst[24 * 32 + offset], in24[i]);\r\n        _mm256_storeu_si256((__m256i*)&dst[25 * 32 + offset], in25[i]);\r\n        _mm256_storeu_si256((__m256i*)&dst[26 * 32 + offset], in26[i]);\r\n        _mm256_storeu_si256((__m256i*)&dst[27 * 32 + offset], in27[i]);\r\n        _mm256_storeu_si256((__m256i*)&dst[28 * 32 + offset], in28[i]);\r\n        _mm256_storeu_si256((__m256i*)&dst[29 * 32 + offset], in29[i]);\r\n        _mm256_storeu_si256((__m256i*)&dst[30 * 32 + offset], in30[i]);\r\n        _mm256_storeu_si256((__m256i*)&dst[31 * 32 + offset], in31[i]);\r\n    }\r\n\r\n}\r\n\r\n\r\n\r\n#define TRANSPOSE_8x8_16BIT_m256i(I0, I1, I2, I3, I4, I5, I6, I7, O0, O1, O2, O3, O4, O5, O6, O7) \\\r\n        tr0_0 = _mm256_unpacklo_epi16(I0, I1); \\\r\n        tr0_1 = _mm256_unpacklo_epi16(I2, I3); \\\r\n        tr0_2 = _mm256_unpackhi_epi16(I0, I1); \\\r\n        tr0_3 = _mm256_unpackhi_epi16(I2, I3); \\\r\n        tr0_4 = _mm256_unpacklo_epi16(I4, I5); \\\r\n        tr0_5 = _mm256_unpacklo_epi16(I6, I7); \\\r\n        tr0_6 = _mm256_unpackhi_epi16(I4, I5); \\\r\n        tr0_7 = _mm256_unpackhi_epi16(I6, I7); \\\r\n        tr1_0 = _mm256_unpacklo_epi32(tr0_0, tr0_1); \\\r\n        tr1_1 = _mm256_unpacklo_epi32(tr0_2, tr0_3); \\\r\n        tr1_2 = _mm256_unpackhi_epi32(tr0_0, tr0_1); \\\r\n        tr1_3 = _mm256_unpackhi_epi32(tr0_2, tr0_3); \\\r\n        tr1_4 = _mm256_unpacklo_epi32(tr0_4, tr0_5); \\\r\n        tr1_5 = _mm256_unpacklo_epi32(tr0_6, tr0_7); \\\r\n        tr1_6 = _mm256_unpackhi_epi32(tr0_4, tr0_5); \\\r\n        tr1_7 = _mm256_unpackhi_epi32(tr0_6, tr0_7); \\\r\n        O0 = _mm256_unpacklo_epi64(tr1_0, tr1_4); \\\r\n        O1 = _mm256_unpackhi_epi64(tr1_0, tr1_4); \\\r\n        O2 = _mm256_unpacklo_epi64(tr1_2, tr1_6); \\\r\n        O3 = _mm256_unpackhi_epi64(tr1_2, tr1_6); \\\r\n        O4 = _mm256_unpacklo_epi64(tr1_1, tr1_5); \\\r\n        O5 = _mm256_unpackhi_epi64(tr1_1, tr1_5); \\\r\n        O6 = _mm256_unpacklo_epi64(tr1_3, tr1_7); \\\r\n        O7 = _mm256_unpackhi_epi64(tr1_3, tr1_7);\r\n\r\n#define TRANSPOSE_16x16_16BIT_m256i(I0,\tI1,\tI2,\tI3,\tI4,\tI5,\tI6,\tI7,\tI8,\tI9,\tI10, I11, I12, I13, I14, I15, O0, O1, O2, O3, O4, O5, O6, O7, O8, O9, O10, O11,\tO12, O13, O14, O15) \\\r\n        TRANSPOSE_8x8_16BIT_m256i(I0, I1, I2, I3, I4, I5, I6, I7, t0, t1, t2, t3, t4, t5, t6, t7); \\\r\n        TRANSPOSE_8x8_16BIT_m256i(I8, I9, I10, I11, I12, I13, I14, I15, t8, t9, t10, t11, t12, t13, t14, t15); \\\r\n        O0 = _mm256_permute2x128_si256(t0, t8, 0x20); \\\r\n        O1 = _mm256_permute2x128_si256(t1, t9, 0x20); \\\r\n        O2 = _mm256_permute2x128_si256(t2, t10, 0x20); \\\r\n        O3 = _mm256_permute2x128_si256(t3, t11, 0x20); \\\r\n        O4 = _mm256_permute2x128_si256(t4, t12, 0x20); \\\r\n        O5 = _mm256_permute2x128_si256(t5, t13, 0x20); \\\r\n        O6 = _mm256_permute2x128_si256(t6, t14, 0x20); \\\r\n        O7 = _mm256_permute2x128_si256(t7, t15, 0x20); \\\r\n        O8 = _mm256_permute2x128_si256(t0, t8, 0x31); \\\r\n        O9 = _mm256_permute2x128_si256(t1, t9, 0x31); \\\r\n        O10 = _mm256_permute2x128_si256(t2, t10, 0x31); \\\r\n        O11 = _mm256_permute2x128_si256(t3, t11, 0x31); \\\r\n        O12 = _mm256_permute2x128_si256(t4, t12, 0x31); \\\r\n        O13 = _mm256_permute2x128_si256(t5, t13, 0x31); \\\r\n        O14 = _mm256_permute2x128_si256(t6, t14, 0x31); \\\r\n        O15 = _mm256_permute2x128_si256(t7, t15, 0x31);\r\n\r\n//inv_wavelet_64x16_sse128\r\nvoid inv_wavelet_64x16_avx2(coeff_t *coeff)\r\n{\r\n    int i;\r\n\r\n    __m256i tr0_0, tr0_1, tr0_2, tr0_3, tr0_4, tr0_5, tr0_6, tr0_7;\r\n    __m256i tr1_0, tr1_1, tr1_2, tr1_3, tr1_4, tr1_5, tr1_6, tr1_7;\r\n    __m256i\tt0, t1, t2, t3, t4, t5, t6, t7, t8, t9, t10, t11, t12, t13, t14, t15;\r\n\r\n    // 64*16\r\n    __m256i T00[4], T01[4], T02[4], T03[4], T04[4], T05[4], T06[4], T07[4], T08[4], T09[4], T10[4], T11[4], T12[4], T13[4], T14[4], T15[4];\r\n\r\n    // 16*64\r\n    __m256i V00, V01, V02, V03, V04, V05, V06, V07, V08, V09, V10, V11, V12, V13, V14, V15, V16, V17, V18, V19, V20, V21, V22, V23, V24, V25, V26, V27, V28, V29, V30, V31, V32, V33, V34, V35, V36, V37, V38, V39, V40, V41, V42, V43, V44, V45, V46, V47, V48, V49, V50, V51, V52, V53, V54, V55, V56, V57, V58, V59, V60, V61, V62, V63;\r\n\r\n    /*--vertical transform--*/\r\n    //32*8, LOAD AND SHIFT\r\n    T00[0] = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[0 + 32 * 0]), 1);\r\n    T01[0] = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[0 + 32 * 1]), 1);\r\n    T02[0] = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[0 + 32 * 2]), 1);\r\n    T03[0] = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[0 + 32 * 3]), 1);\r\n    T04[0] = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[0 + 32 * 4]), 1);\r\n    T05[0] = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[0 + 32 * 5]), 1);\r\n    T06[0] = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[0 + 32 * 6]), 1);\r\n    T07[0] = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[0 + 32 * 7]), 1);\r\n\r\n    T00[1] = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[16 + 32 * 0]), 1);\r\n    T01[1] = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[16 + 32 * 1]), 1);\r\n    T02[1] = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[16 + 32 * 2]), 1);\r\n    T03[1] = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[16 + 32 * 3]), 1);\r\n    T04[1] = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[16 + 32 * 4]), 1);\r\n    T05[1] = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[16 + 32 * 5]), 1);\r\n    T06[1] = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[16 + 32 * 6]), 1);\r\n    T07[1] = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[16 + 32 * 7]), 1);\r\n\r\n    //filter (odd pixel/row)\r\n    T08[0] = _mm256_srai_epi16(_mm256_add_epi16(T00[0], T01[0]), 1);\r\n    T09[0] = _mm256_srai_epi16(_mm256_add_epi16(T01[0], T02[0]), 1);\r\n    T10[0] = _mm256_srai_epi16(_mm256_add_epi16(T02[0], T03[0]), 1);\r\n    T11[0] = _mm256_srai_epi16(_mm256_add_epi16(T03[0], T04[0]), 1);\r\n    T12[0] = _mm256_srai_epi16(_mm256_add_epi16(T04[0], T05[0]), 1);\r\n    T13[0] = _mm256_srai_epi16(_mm256_add_epi16(T05[0], T06[0]), 1);\r\n    T14[0] = _mm256_srai_epi16(_mm256_add_epi16(T06[0], T07[0]), 1);\r\n    T15[0] = _mm256_srai_epi16(_mm256_add_epi16(T07[0], T07[0]), 1);\r\n\r\n    T08[1] = _mm256_srai_epi16(_mm256_add_epi16(T00[1], T01[1]), 1);\r\n    T09[1] = _mm256_srai_epi16(_mm256_add_epi16(T01[1], T02[1]), 1);\r\n    T10[1] = _mm256_srai_epi16(_mm256_add_epi16(T02[1], T03[1]), 1);\r\n    T11[1] = _mm256_srai_epi16(_mm256_add_epi16(T03[1], T04[1]), 1);\r\n    T12[1] = _mm256_srai_epi16(_mm256_add_epi16(T04[1], T05[1]), 1);\r\n    T13[1] = _mm256_srai_epi16(_mm256_add_epi16(T05[1], T06[1]), 1);\r\n    T14[1] = _mm256_srai_epi16(_mm256_add_epi16(T06[1], T07[1]), 1);\r\n    T15[1] = _mm256_srai_epi16(_mm256_add_epi16(T07[1], T07[1]), 1);\r\n\r\n    /*--transposition--*/\r\n    //32x16 -> 16x32\r\n    TRANSPOSE_16x16_16BIT_m256i(T00[0], T08[0], T01[0], T09[0], T02[0], T10[0], T03[0], T11[0], T04[0], T12[0], T05[0], T13[0], T06[0], T14[0], T07[0], T15[0], V00, V01, V02, V03, V04, V05, V06, V07, V08, V09, V10, V11, V12, V13, V14, V15);\r\n    TRANSPOSE_16x16_16BIT_m256i(T00[1], T08[1], T01[1], T09[1], T02[1], T10[1], T03[1], T11[1], T04[1], T12[1], T05[1], T13[1], T06[1], T14[1], T07[1], T15[1], V16, V17, V18, V19, V20, V21, V22, V23, V24, V25, V26, V27, V28, V29, V30, V31);\r\n\r\n    /*--horizontal transform--*/\r\n    //filter (odd pixel/column)\r\n    V32 = _mm256_srai_epi16(_mm256_add_epi16(V00, V01), 1);\r\n    V33 = _mm256_srai_epi16(_mm256_add_epi16(V01, V02), 1);\r\n    V34 = _mm256_srai_epi16(_mm256_add_epi16(V02, V03), 1);\r\n    V35 = _mm256_srai_epi16(_mm256_add_epi16(V03, V04), 1);\r\n    V36 = _mm256_srai_epi16(_mm256_add_epi16(V04, V05), 1);\r\n    V37 = _mm256_srai_epi16(_mm256_add_epi16(V05, V06), 1);\r\n    V38 = _mm256_srai_epi16(_mm256_add_epi16(V06, V07), 1);\r\n    V39 = _mm256_srai_epi16(_mm256_add_epi16(V07, V08), 1);\r\n    V40 = _mm256_srai_epi16(_mm256_add_epi16(V08, V09), 1);\r\n    V41 = _mm256_srai_epi16(_mm256_add_epi16(V09, V10), 1);\r\n    V42 = _mm256_srai_epi16(_mm256_add_epi16(V10, V11), 1);\r\n    V43 = _mm256_srai_epi16(_mm256_add_epi16(V11, V12), 1);\r\n    V44 = _mm256_srai_epi16(_mm256_add_epi16(V12, V13), 1);\r\n    V45 = _mm256_srai_epi16(_mm256_add_epi16(V13, V14), 1);\r\n    V46 = _mm256_srai_epi16(_mm256_add_epi16(V14, V15), 1);\r\n    V47 = _mm256_srai_epi16(_mm256_add_epi16(V15, V16), 1);\r\n\r\n    V48 = _mm256_srai_epi16(_mm256_add_epi16(V16, V17), 1);\r\n    V49 = _mm256_srai_epi16(_mm256_add_epi16(V17, V18), 1);\r\n    V50 = _mm256_srai_epi16(_mm256_add_epi16(V18, V19), 1);\r\n    V51 = _mm256_srai_epi16(_mm256_add_epi16(V19, V20), 1);\r\n    V52 = _mm256_srai_epi16(_mm256_add_epi16(V20, V21), 1);\r\n    V53 = _mm256_srai_epi16(_mm256_add_epi16(V21, V22), 1);\r\n    V54 = _mm256_srai_epi16(_mm256_add_epi16(V22, V23), 1);\r\n    V55 = _mm256_srai_epi16(_mm256_add_epi16(V23, V24), 1);\r\n    V56 = _mm256_srai_epi16(_mm256_add_epi16(V24, V25), 1);\r\n    V57 = _mm256_srai_epi16(_mm256_add_epi16(V25, V26), 1);\r\n    V58 = _mm256_srai_epi16(_mm256_add_epi16(V26, V27), 1);\r\n    V59 = _mm256_srai_epi16(_mm256_add_epi16(V27, V28), 1);\r\n    V60 = _mm256_srai_epi16(_mm256_add_epi16(V28, V29), 1);\r\n    V61 = _mm256_srai_epi16(_mm256_add_epi16(V29, V30), 1);\r\n    V62 = _mm256_srai_epi16(_mm256_add_epi16(V30, V31), 1);\r\n    V63 = _mm256_srai_epi16(_mm256_add_epi16(V31, V31), 1);\r\n\r\n    /*--transposition & Store--*/\r\n    //16x64 -> 64x16\r\n    TRANSPOSE_16x16_16BIT_m256i(V00, V32, V01, V33, V02, V34, V03, V35, V04, V36, V05, V37, V06, V38, V07, V39, T00[0], T01[0], T02[0], T03[0], T04[0], T05[0], T06[0], T07[0], T08[0], T09[0], T10[0], T11[0], T12[0], T13[0], T14[0], T15[0]);\r\n    TRANSPOSE_16x16_16BIT_m256i(V08, V40, V09, V41, V10, V42, V11, V43, V12, V44, V13, V45, V14, V46, V15, V47, T00[1], T01[1], T02[1], T03[1], T04[1], T05[1], T06[1], T07[1], T08[1], T09[1], T10[1], T11[1], T12[1], T13[1], T14[1], T15[1]);\r\n    TRANSPOSE_16x16_16BIT_m256i(V16, V48, V17, V49, V18, V50, V19, V51, V20, V52, V21, V53, V22, V54, V23, V55, T00[2], T01[2], T02[2], T03[2], T04[2], T05[2], T06[2], T07[2], T08[2], T09[2], T10[2], T11[2], T12[2], T13[2], T14[2], T15[2]);\r\n    TRANSPOSE_16x16_16BIT_m256i(V24, V56, V25, V57, V26, V58, V27, V59, V28, V60, V29, V61, V30, V62, V31, V63, T00[3], T01[3], T02[3], T03[3], T04[3], T05[3], T06[3], T07[3], T08[3], T09[3], T10[3], T11[3], T12[3], T13[3], T14[3], T15[3]);\r\n\r\n    //store\r\n    for (i = 0; i < 4; i++) {\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i], T00[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64], T01[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 2], T02[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 3], T03[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 4], T04[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 5], T05[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 6], T06[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 7], T07[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 8], T08[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 9], T09[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 10], T10[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 11], T11[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 12], T12[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 13], T13[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 14], T14[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 15], T15[i]);\r\n    }\r\n}\r\n\r\n\r\nvoid inv_wavelet_16x64_avx2(coeff_t *coeff)\r\n{\r\n    //src blk 8*32\r\n\r\n    __m256i tr0_0, tr0_1, tr0_2, tr0_3, tr0_4, tr0_5, tr0_6, tr0_7;\r\n    __m256i tr1_0, tr1_1, tr1_2, tr1_3, tr1_4, tr1_5, tr1_6, tr1_7;\r\n    __m256i\tt0, t1, t2, t3, t4, t5, t6, t7, t8, t9, t10, t11, t12, t13, t14, t15;\r\n\r\n    __m256i S00, S01, S02, S03, S04, S05, S06, S07, S08, S09, S10, S11, S12, S13, S14, S15, S16, S17, S18, S19, S20, S21, S22, S23, S24, S25, S26, S27, S28, S29, S30, S31;\r\n    __m256i S32, S33, S34, S35, S36, S37, S38, S39, S40, S41, S42, S43, S44, S45, S46, S47, S48, S49, S50, S51, S52, S53, S54, S55, S56, S57, S58, S59, S60, S61, S62, S63;\r\n\r\n    // 64*16\r\n    __m256i TT00[8], TT01[8], TT02[8], TT03[8], TT04[8], TT05[8], TT06[8], TT07[8];\r\n    __m256i T00[4], T01[4], T02[4], T03[4], T04[4], T05[4], T06[4], T07[4], T08[4], T09[4], T10[4], T11[4], T12[4], T13[4], T14[4], T15[4];\r\n\r\n    // 16*64\r\n    __m256i V00, V01, V02, V03, V04, V05, V06, V07, V08, V09, V10, V11, V12, V13, V14, V15, V16, V17, V18, V19, V20, V21, V22, V23, V24, V25, V26, V27, V28, V29, V30, V31, V32, V33, V34, V35, V36, V37, V38, V39, V40, V41, V42, V43, V44, V45, V46, V47, V48, V49, V50, V51, V52, V53, V54, V55, V56, V57, V58, V59, V60, V61, V62, V63;\r\n\r\n    int i;\r\n    /*--load & shift--*/\r\n    //8*32\r\n    S00 = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[8 * 0]), 1);\r\n    S01 = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[8 * 1]), 1);\r\n    S02 = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[8 * 2]), 1);\r\n    S03 = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[8 * 3]), 1);\r\n    S04 = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[8 * 4]), 1);\r\n    S05 = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[8 * 5]), 1);\r\n    S06 = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[8 * 6]), 1);\r\n    S07 = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[8 * 7]), 1);\r\n    S08 = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[8 * 8]), 1);\r\n    S09 = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[8 * 9]), 1);\r\n    S10 = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[8 * 10]), 1);\r\n    S11 = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[8 * 11]), 1);\r\n    S12 = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[8 * 12]), 1);\r\n    S13 = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[8 * 13]), 1);\r\n    S14 = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[8 * 14]), 1);\r\n    S15 = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[8 * 15]), 1);\r\n    S16 = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[8 * 16]), 1);\r\n    S17 = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[8 * 17]), 1);\r\n    S18 = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[8 * 18]), 1);\r\n    S19 = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[8 * 19]), 1);\r\n    S20 = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[8 * 20]), 1);\r\n    S21 = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[8 * 21]), 1);\r\n    S22 = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[8 * 22]), 1);\r\n    S23 = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[8 * 23]), 1);\r\n    S24 = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[8 * 24]), 1);\r\n    S25 = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[8 * 25]), 1);\r\n    S26 = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[8 * 26]), 1);\r\n    S27 = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[8 * 27]), 1);\r\n    S28 = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[8 * 28]), 1);\r\n    S29 = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[8 * 29]), 1);\r\n    S30 = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[8 * 30]), 1);\r\n    S31 = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[8 * 31]), 1);\r\n\r\n    /*--vertical transform--*/\r\n    S32 = _mm256_srai_epi16(_mm256_add_epi16(S00, S01), 1);\r\n    S33 = _mm256_srai_epi16(_mm256_add_epi16(S01, S02), 1);\r\n    S34 = _mm256_srai_epi16(_mm256_add_epi16(S02, S03), 1);\r\n    S35 = _mm256_srai_epi16(_mm256_add_epi16(S03, S04), 1);\r\n    S36 = _mm256_srai_epi16(_mm256_add_epi16(S04, S05), 1);\r\n    S37 = _mm256_srai_epi16(_mm256_add_epi16(S05, S06), 1);\r\n    S38 = _mm256_srai_epi16(_mm256_add_epi16(S06, S07), 1);\r\n    S39 = _mm256_srai_epi16(_mm256_add_epi16(S07, S08), 1);\r\n    S40 = _mm256_srai_epi16(_mm256_add_epi16(S08, S09), 1);\r\n    S41 = _mm256_srai_epi16(_mm256_add_epi16(S09, S10), 1);\r\n    S42 = _mm256_srai_epi16(_mm256_add_epi16(S10, S11), 1);\r\n    S43 = _mm256_srai_epi16(_mm256_add_epi16(S11, S12), 1);\r\n    S44 = _mm256_srai_epi16(_mm256_add_epi16(S12, S13), 1);\r\n    S45 = _mm256_srai_epi16(_mm256_add_epi16(S13, S14), 1);\r\n    S46 = _mm256_srai_epi16(_mm256_add_epi16(S14, S15), 1);\r\n    S47 = _mm256_srai_epi16(_mm256_add_epi16(S15, S16), 1);\r\n    S48 = _mm256_srai_epi16(_mm256_add_epi16(S16, S17), 1);\r\n    S49 = _mm256_srai_epi16(_mm256_add_epi16(S17, S18), 1);\r\n    S50 = _mm256_srai_epi16(_mm256_add_epi16(S18, S19), 1);\r\n    S51 = _mm256_srai_epi16(_mm256_add_epi16(S19, S20), 1);\r\n    S52 = _mm256_srai_epi16(_mm256_add_epi16(S20, S21), 1);\r\n    S53 = _mm256_srai_epi16(_mm256_add_epi16(S21, S22), 1);\r\n    S54 = _mm256_srai_epi16(_mm256_add_epi16(S22, S23), 1);\r\n    S55 = _mm256_srai_epi16(_mm256_add_epi16(S23, S24), 1);\r\n    S56 = _mm256_srai_epi16(_mm256_add_epi16(S24, S25), 1);\r\n    S57 = _mm256_srai_epi16(_mm256_add_epi16(S25, S26), 1);\r\n    S58 = _mm256_srai_epi16(_mm256_add_epi16(S26, S27), 1);\r\n    S59 = _mm256_srai_epi16(_mm256_add_epi16(S27, S28), 1);\r\n    S60 = _mm256_srai_epi16(_mm256_add_epi16(S28, S29), 1);\r\n    S61 = _mm256_srai_epi16(_mm256_add_epi16(S29, S30), 1);\r\n    S62 = _mm256_srai_epi16(_mm256_add_epi16(S30, S31), 1);\r\n    S63 = _mm256_srai_epi16(_mm256_add_epi16(S31, S31), 1);\r\n\r\n    /*--transposition--*/\r\n    //8x64 -> 64x8\r\n    TRANSPOSE_8x8_16BIT_m256i(S00, S32, S01, S33, S02, S34, S03, S35, TT00[0], TT01[0], TT02[0], TT03[0], TT04[0], TT05[0], TT06[0], TT07[0]);\r\n    TRANSPOSE_8x8_16BIT_m256i(S04, S36, S05, S37, S06, S38, S07, S39, TT00[1], TT01[1], TT02[1], TT03[1], TT04[1], TT05[1], TT06[1], TT07[1]);\r\n    TRANSPOSE_8x8_16BIT_m256i(S08, S40, S09, S41, S10, S42, S11, S43, TT00[2], TT01[2], TT02[2], TT03[2], TT04[2], TT05[2], TT06[2], TT07[2]);\r\n    TRANSPOSE_8x8_16BIT_m256i(S12, S44, S13, S45, S14, S46, S15, S47, TT00[3], TT01[3], TT02[3], TT03[3], TT04[3], TT05[3], TT06[3], TT07[3]);\r\n    TRANSPOSE_8x8_16BIT_m256i(S16, S48, S17, S49, S18, S50, S19, S51, TT00[4], TT01[4], TT02[4], TT03[4], TT04[4], TT05[4], TT06[4], TT07[4]);\r\n    TRANSPOSE_8x8_16BIT_m256i(S20, S52, S21, S53, S22, S54, S23, S55, TT00[5], TT01[5], TT02[5], TT03[5], TT04[5], TT05[5], TT06[5], TT07[5]);\r\n    TRANSPOSE_8x8_16BIT_m256i(S24, S56, S25, S57, S26, S58, S27, S59, TT00[6], TT01[6], TT02[6], TT03[6], TT04[6], TT05[6], TT06[6], TT07[6]);\r\n    TRANSPOSE_8x8_16BIT_m256i(S28, S60, S29, S61, S30, S62, S31, S63, TT00[7], TT01[7], TT02[7], TT03[7], TT04[7], TT05[7], TT06[7], TT07[7]);\r\n\r\n    T00[0] = _mm256_permute2x128_si256(TT00[0], TT00[1], 0x20);\r\n    T00[1] = _mm256_permute2x128_si256(TT00[2], TT00[3], 0x20);\r\n    T00[2] = _mm256_permute2x128_si256(TT00[4], TT00[5], 0x20);\r\n    T00[3] = _mm256_permute2x128_si256(TT00[6], TT00[7], 0x20);\r\n    T01[0] = _mm256_permute2x128_si256(TT01[0], TT01[1], 0x20);\r\n    T01[1] = _mm256_permute2x128_si256(TT01[2], TT01[3], 0x20);\r\n    T01[2] = _mm256_permute2x128_si256(TT01[4], TT01[5], 0x20);\r\n    T01[3] = _mm256_permute2x128_si256(TT01[6], TT01[7], 0x20);\r\n    T02[0] = _mm256_permute2x128_si256(TT02[0], TT02[1], 0x20);\r\n    T02[1] = _mm256_permute2x128_si256(TT02[2], TT02[3], 0x20);\r\n    T02[2] = _mm256_permute2x128_si256(TT02[4], TT02[5], 0x20);\r\n    T02[3] = _mm256_permute2x128_si256(TT02[6], TT02[7], 0x20);\r\n    T03[0] = _mm256_permute2x128_si256(TT03[0], TT03[1], 0x20);\r\n    T03[1] = _mm256_permute2x128_si256(TT03[2], TT03[3], 0x20);\r\n    T03[2] = _mm256_permute2x128_si256(TT03[4], TT03[5], 0x20);\r\n    T03[3] = _mm256_permute2x128_si256(TT03[6], TT03[7], 0x20);\r\n\r\n    T04[0] = _mm256_permute2x128_si256(TT04[0], TT04[1], 0x20);\r\n    T04[1] = _mm256_permute2x128_si256(TT04[2], TT04[3], 0x20);\r\n    T04[2] = _mm256_permute2x128_si256(TT04[4], TT04[5], 0x20);\r\n    T04[3] = _mm256_permute2x128_si256(TT04[6], TT04[7], 0x20);\r\n    T05[0] = _mm256_permute2x128_si256(TT05[0], TT05[1], 0x20);\r\n    T05[1] = _mm256_permute2x128_si256(TT05[2], TT05[3], 0x20);\r\n    T05[2] = _mm256_permute2x128_si256(TT05[4], TT05[5], 0x20);\r\n    T05[3] = _mm256_permute2x128_si256(TT05[6], TT05[7], 0x20);\r\n    T06[0] = _mm256_permute2x128_si256(TT06[0], TT06[1], 0x20);\r\n    T06[1] = _mm256_permute2x128_si256(TT06[2], TT06[3], 0x20);\r\n    T06[2] = _mm256_permute2x128_si256(TT06[4], TT06[5], 0x20);\r\n    T06[3] = _mm256_permute2x128_si256(TT06[6], TT06[7], 0x20);\r\n    T07[0] = _mm256_permute2x128_si256(TT07[0], TT07[1], 0x20);\r\n    T07[1] = _mm256_permute2x128_si256(TT07[2], TT07[3], 0x20);\r\n    T07[2] = _mm256_permute2x128_si256(TT07[4], TT07[5], 0x20);\r\n    T07[3] = _mm256_permute2x128_si256(TT07[6], TT07[7], 0x20);\r\n\r\n    /*--horizontal transform--*/\r\n    for (i = 0; i < 4; i++) {\r\n        T08[i] = _mm256_srai_epi16(_mm256_add_epi16(T00[i], T01[i]), 1);\r\n        T09[i] = _mm256_srai_epi16(_mm256_add_epi16(T01[i], T02[i]), 1);\r\n        T10[i] = _mm256_srai_epi16(_mm256_add_epi16(T02[i], T03[i]), 1);\r\n        T11[i] = _mm256_srai_epi16(_mm256_add_epi16(T03[i], T04[i]), 1);\r\n        T12[i] = _mm256_srai_epi16(_mm256_add_epi16(T04[i], T05[i]), 1);\r\n        T13[i] = _mm256_srai_epi16(_mm256_add_epi16(T05[i], T06[i]), 1);\r\n        T14[i] = _mm256_srai_epi16(_mm256_add_epi16(T06[i], T07[i]), 1);\r\n        T15[i] = _mm256_srai_epi16(_mm256_add_epi16(T07[i], T07[i]), 1);\r\n    }\r\n\r\n    /*--transposition--*/\r\n    //64x16 -> 16x64\r\n    TRANSPOSE_16x16_16BIT_m256i(T00[0], T08[0], T01[0], T09[0], T02[0], T10[0], T03[0], T11[0], T04[0], T12[0], T05[0], T13[0], T06[0], T14[0], T07[0], T15[0], V00, V01, V02, V03, V04, V05, V06, V07, V08, V09, V10, V11, V12, V13, V14, V15);\r\n    TRANSPOSE_16x16_16BIT_m256i(T00[1], T08[1], T01[1], T09[1], T02[1], T10[1], T03[1], T11[1], T04[1], T12[1], T05[1], T13[1], T06[1], T14[1], T07[1], T15[1], V16, V17, V18, V19, V20, V21, V22, V23, V24, V25, V26, V27, V28, V29, V30, V31);\r\n    TRANSPOSE_16x16_16BIT_m256i(T00[2], T08[2], T01[2], T09[2], T02[2], T10[2], T03[2], T11[2], T04[2], T12[2], T05[2], T13[2], T06[2], T14[2], T07[2], T15[2], V32, V33, V34, V35, V36, V37, V38, V39, V40, V41, V42, V43, V44, V45, V46, V47);\r\n    TRANSPOSE_16x16_16BIT_m256i(T00[3], T08[3], T01[3], T09[3], T02[3], T10[3], T03[3], T11[3], T04[3], T12[3], T05[3], T13[3], T06[3], T14[3], T07[3], T15[3], V48, V49, V50, V51, V52, V53, V54, V55, V56, V57, V58, V59, V60, V61, V62, V63);\r\n\r\n    /*--Store--*/\r\n    //16x64\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 0], V00);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 1], V01);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 2], V02);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 3], V03);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 4], V04);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 5], V05);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 6], V06);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 7], V07);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 8], V08);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 9], V09);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 10], V10);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 11], V11);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 12], V12);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 13], V13);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 14], V14);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 15], V15);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 16], V16);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 17], V17);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 18], V18);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 19], V19);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 20], V20);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 21], V21);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 22], V22);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 23], V23);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 24], V24);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 25], V25);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 26], V26);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 27], V27);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 28], V28);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 29], V29);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 30], V30);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 31], V31);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 32], V32);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 33], V33);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 34], V34);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 35], V35);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 36], V36);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 37], V37);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 38], V38);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 39], V39);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 40], V40);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 41], V41);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 42], V42);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 43], V43);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 44], V44);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 45], V45);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 46], V46);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 47], V47);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 48], V48);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 49], V49);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 50], V50);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 51], V51);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 52], V52);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 53], V53);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 54], V54);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 55], V55);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 56], V56);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 57], V57);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 58], V58);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 59], V59);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 60], V60);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 61], V61);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 62], V62);\r\n    _mm256_storeu_si256((__m256i*)&coeff[16 * 63], V63);\r\n}\r\n\r\n\r\nvoid inv_wavelet_64x64_avx2(coeff_t *coeff)\r\n{\r\n    int i;\r\n\r\n    __m256i tr0_0, tr0_1, tr0_2, tr0_3, tr0_4, tr0_5, tr0_6, tr0_7;\r\n    __m256i tr1_0, tr1_1, tr1_2, tr1_3, tr1_4, tr1_5, tr1_6, tr1_7;\r\n    __m256i\tt0, t1, t2, t3, t4, t5, t6, t7, t8, t9, t10, t11, t12, t13, t14, t15;\r\n\r\n    // 64*64\r\n    __m256i T00[4], T01[4], T02[4], T03[4], T04[4], T05[4], T06[4], T07[4], T08[4], T09[4], T10[4], T11[4], T12[4], T13[4], T14[4], T15[4], T16[4], T17[4], T18[4], T19[4], T20[4], T21[4], T22[4], T23[4], T24[4], T25[4], T26[4], T27[4], T28[4], T29[4], T30[4], T31[4], T32[4], T33[4], T34[4], T35[4], T36[4], T37[4], T38[4], T39[4], T40[4], T41[4], T42[4], T43[4], T44[4], T45[4], T46[4], T47[4], T48[4], T49[4], T50[4], T51[4], T52[4], T53[4], T54[4], T55[4], T56[4], T57[4], T58[4], T59[4], T60[4], T61[4], T62[4], T63[4];\r\n\r\n    // 64*64\r\n    __m256i V00[4], V01[4], V02[4], V03[4], V04[4], V05[4], V06[4], V07[4], V08[4], V09[4], V10[4], V11[4], V12[4], V13[4], V14[4], V15[4], V16[4], V17[4], V18[4], V19[4], V20[4], V21[4], V22[4], V23[4], V24[4], V25[4], V26[4], V27[4], V28[4], V29[4], V30[4], V31[4], V32[4], V33[4], V34[4], V35[4], V36[4], V37[4], V38[4], V39[4], V40[4], V41[4], V42[4], V43[4], V44[4], V45[4], V46[4], V47[4], V48[4], V49[4], V50[4], V51[4], V52[4], V53[4], V54[4], V55[4], V56[4], V57[4], V58[4], V59[4], V60[4], V61[4], V62[4], V63[4];\r\n\r\n    /*--vertical transform--*/\r\n    //32*32, LOAD AND SHIFT\r\n    for (i = 0; i < 2; i++) {\r\n        T00[i] = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[16 * i + 32 * 0]), 1);\r\n        T01[i] = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[16 * i + 32 * 1]), 1);\r\n        T02[i] = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[16 * i + 32 * 2]), 1);\r\n        T03[i] = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[16 * i + 32 * 3]), 1);\r\n        T04[i] = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[16 * i + 32 * 4]), 1);\r\n        T05[i] = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[16 * i + 32 * 5]), 1);\r\n        T06[i] = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[16 * i + 32 * 6]), 1);\r\n        T07[i] = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[16 * i + 32 * 7]), 1);\r\n\r\n        T08[i] = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[16 * i + 32 * 8]), 1);\r\n        T09[i] = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[16 * i + 32 * 9]), 1);\r\n        T10[i] = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[16 * i + 32 * 10]), 1);\r\n        T11[i] = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[16 * i + 32 * 11]), 1);\r\n        T12[i] = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[16 * i + 32 * 12]), 1);\r\n        T13[i] = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[16 * i + 32 * 13]), 1);\r\n        T14[i] = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[16 * i + 32 * 14]), 1);\r\n        T15[i] = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[16 * i + 32 * 15]), 1);\r\n\r\n        T16[i] = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[16 * i + 32 * 16]), 1);\r\n        T17[i] = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[16 * i + 32 * 17]), 1);\r\n        T18[i] = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[16 * i + 32 * 18]), 1);\r\n        T19[i] = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[16 * i + 32 * 19]), 1);\r\n        T20[i] = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[16 * i + 32 * 20]), 1);\r\n        T21[i] = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[16 * i + 32 * 21]), 1);\r\n        T22[i] = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[16 * i + 32 * 22]), 1);\r\n        T23[i] = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[16 * i + 32 * 23]), 1);\r\n\r\n        T24[i] = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[16 * i + 32 * 24]), 1);\r\n        T25[i] = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[16 * i + 32 * 25]), 1);\r\n        T26[i] = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[16 * i + 32 * 26]), 1);\r\n        T27[i] = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[16 * i + 32 * 27]), 1);\r\n        T28[i] = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[16 * i + 32 * 28]), 1);\r\n        T29[i] = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[16 * i + 32 * 29]), 1);\r\n        T30[i] = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[16 * i + 32 * 30]), 1);\r\n        T31[i] = _mm256_srai_epi16(_mm256_loadu_si256((__m256i*)&coeff[16 * i + 32 * 31]), 1);\r\n    }\r\n\r\n    //filter (odd pixel/row)\r\n    for (i = 0; i < 4; i++) {\r\n        T32[i] = _mm256_srai_epi16(_mm256_add_epi16(T00[i], T01[i]), 1);\r\n        T33[i] = _mm256_srai_epi16(_mm256_add_epi16(T01[i], T02[i]), 1);\r\n        T34[i] = _mm256_srai_epi16(_mm256_add_epi16(T02[i], T03[i]), 1);\r\n        T35[i] = _mm256_srai_epi16(_mm256_add_epi16(T03[i], T04[i]), 1);\r\n        T36[i] = _mm256_srai_epi16(_mm256_add_epi16(T04[i], T05[i]), 1);\r\n        T37[i] = _mm256_srai_epi16(_mm256_add_epi16(T05[i], T06[i]), 1);\r\n        T38[i] = _mm256_srai_epi16(_mm256_add_epi16(T06[i], T07[i]), 1);\r\n        T39[i] = _mm256_srai_epi16(_mm256_add_epi16(T07[i], T08[i]), 1);\r\n\r\n        T40[i] = _mm256_srai_epi16(_mm256_add_epi16(T08[i], T09[i]), 1);\r\n        T41[i] = _mm256_srai_epi16(_mm256_add_epi16(T09[i], T10[i]), 1);\r\n        T42[i] = _mm256_srai_epi16(_mm256_add_epi16(T10[i], T11[i]), 1);\r\n        T43[i] = _mm256_srai_epi16(_mm256_add_epi16(T11[i], T12[i]), 1);\r\n        T44[i] = _mm256_srai_epi16(_mm256_add_epi16(T12[i], T13[i]), 1);\r\n        T45[i] = _mm256_srai_epi16(_mm256_add_epi16(T13[i], T14[i]), 1);\r\n        T46[i] = _mm256_srai_epi16(_mm256_add_epi16(T14[i], T15[i]), 1);\r\n        T47[i] = _mm256_srai_epi16(_mm256_add_epi16(T15[i], T16[i]), 1);\r\n\r\n        T48[i] = _mm256_srai_epi16(_mm256_add_epi16(T16[i], T17[i]), 1);\r\n        T49[i] = _mm256_srai_epi16(_mm256_add_epi16(T17[i], T18[i]), 1);\r\n        T50[i] = _mm256_srai_epi16(_mm256_add_epi16(T18[i], T19[i]), 1);\r\n        T51[i] = _mm256_srai_epi16(_mm256_add_epi16(T19[i], T20[i]), 1);\r\n        T52[i] = _mm256_srai_epi16(_mm256_add_epi16(T20[i], T21[i]), 1);\r\n        T53[i] = _mm256_srai_epi16(_mm256_add_epi16(T21[i], T22[i]), 1);\r\n        T54[i] = _mm256_srai_epi16(_mm256_add_epi16(T22[i], T23[i]), 1);\r\n        T55[i] = _mm256_srai_epi16(_mm256_add_epi16(T23[i], T24[i]), 1);\r\n\r\n        T56[i] = _mm256_srai_epi16(_mm256_add_epi16(T24[i], T25[i]), 1);\r\n        T57[i] = _mm256_srai_epi16(_mm256_add_epi16(T25[i], T26[i]), 1);\r\n        T58[i] = _mm256_srai_epi16(_mm256_add_epi16(T26[i], T27[i]), 1);\r\n        T59[i] = _mm256_srai_epi16(_mm256_add_epi16(T27[i], T28[i]), 1);\r\n        T60[i] = _mm256_srai_epi16(_mm256_add_epi16(T28[i], T29[i]), 1);\r\n        T61[i] = _mm256_srai_epi16(_mm256_add_epi16(T29[i], T30[i]), 1);\r\n        T62[i] = _mm256_srai_epi16(_mm256_add_epi16(T30[i], T31[i]), 1);\r\n        T63[i] = _mm256_srai_epi16(_mm256_add_epi16(T31[i], T31[i]), 1);\r\n    }\r\n\r\n    /*--transposition--*/\r\n    //32x64 -> 64x32\r\n    TRANSPOSE_16x16_16BIT_m256i(T00[0], T32[0], T01[0], T33[0], T02[0], T34[0], T03[0], T35[0], T04[0], T36[0], T05[0], T37[0], T06[0], T38[0], T07[0], T39[0], V00[0], V01[0], V02[0], V03[0], V04[0], V05[0], V06[0], V07[0], V08[0], V09[0], V10[0], V11[0], V12[0], V13[0], V14[0], V15[0]);\r\n    TRANSPOSE_16x16_16BIT_m256i(T08[0], T40[0], T09[0], T41[0], T10[0], T42[0], T11[0], T43[0], T12[0], T44[0], T13[0], T45[0], T14[0], T46[0], T15[0], T47[0], V00[1], V01[1], V02[1], V03[1], V04[1], V05[1], V06[1], V07[1], V08[1], V09[1], V10[1], V11[1], V12[1], V13[1], V14[1], V15[1]);\r\n    TRANSPOSE_16x16_16BIT_m256i(T16[0], T48[0], T17[0], T49[0], T18[0], T50[0], T19[0], T51[0], T20[0], T52[0], T21[0], T53[0], T22[0], T54[0], T23[0], T55[0], V00[2], V01[2], V02[2], V03[2], V04[2], V05[2], V06[2], V07[2], V08[2], V09[2], V10[2], V11[2], V12[2], V13[2], V14[2], V15[2]);\r\n    TRANSPOSE_16x16_16BIT_m256i(T24[0], T56[0], T25[0], T57[0], T26[0], T58[0], T27[0], T59[0], T28[0], T60[0], T29[0], T61[0], T30[0], T62[0], T31[0], T63[0], V00[3], V01[3], V02[3], V03[3], V04[3], V05[3], V06[3], V07[3], V08[3], V09[3], V10[3], V11[3], V12[3], V13[3], V14[3], V15[3]);\r\n\r\n    TRANSPOSE_16x16_16BIT_m256i(T00[1], T32[1], T01[1], T33[1], T02[1], T34[1], T03[1], T35[1], T04[1], T36[1], T05[1], T37[1], T06[1], T38[1], T07[1], T39[1], V16[0], V17[0], V18[0], V19[0], V20[0], V21[0], V22[0], V23[0], V24[0], V25[0], V26[0], V27[0], V28[0], V29[0], V30[0], V31[0]);\r\n    TRANSPOSE_16x16_16BIT_m256i(T08[1], T40[1], T09[1], T41[1], T10[1], T42[1], T11[1], T43[1], T12[1], T44[1], T13[1], T45[1], T14[1], T46[1], T15[1], T47[1], V16[1], V17[1], V18[1], V19[1], V20[1], V21[1], V22[1], V23[1], V24[1], V25[1], V26[1], V27[1], V28[1], V29[1], V30[1], V31[1]);\r\n    TRANSPOSE_16x16_16BIT_m256i(T16[1], T48[1], T17[1], T49[1], T18[1], T50[1], T19[1], T51[1], T20[1], T52[1], T21[1], T53[1], T22[1], T54[1], T23[1], T55[1], V16[2], V17[2], V18[2], V19[2], V20[2], V21[2], V22[2], V23[2], V24[2], V25[2], V26[2], V27[2], V28[2], V29[2], V30[2], V31[2]);\r\n    TRANSPOSE_16x16_16BIT_m256i(T24[1], T56[1], T25[1], T57[1], T26[1], T58[1], T27[1], T59[1], T28[1], T60[1], T29[1], T61[1], T30[1], T62[1], T31[1], T63[1], V16[3], V17[3], V18[3], V19[3], V20[3], V21[3], V22[3], V23[3], V24[3], V25[3], V26[3], V27[3], V28[3], V29[3], V30[3], V31[3]);\r\n\r\n    /*--horizontal transform--*/\r\n    //filter (odd pixel/column)\r\n    for (i = 0; i < 4; i++) {\r\n        V32[i] = _mm256_srai_epi16(_mm256_add_epi16(V00[i], V01[i]), 1);\r\n        V33[i] = _mm256_srai_epi16(_mm256_add_epi16(V01[i], V02[i]), 1);\r\n        V34[i] = _mm256_srai_epi16(_mm256_add_epi16(V02[i], V03[i]), 1);\r\n        V35[i] = _mm256_srai_epi16(_mm256_add_epi16(V03[i], V04[i]), 1);\r\n        V36[i] = _mm256_srai_epi16(_mm256_add_epi16(V04[i], V05[i]), 1);\r\n        V37[i] = _mm256_srai_epi16(_mm256_add_epi16(V05[i], V06[i]), 1);\r\n        V38[i] = _mm256_srai_epi16(_mm256_add_epi16(V06[i], V07[i]), 1);\r\n        V39[i] = _mm256_srai_epi16(_mm256_add_epi16(V07[i], V08[i]), 1);\r\n        V40[i] = _mm256_srai_epi16(_mm256_add_epi16(V08[i], V09[i]), 1);\r\n        V41[i] = _mm256_srai_epi16(_mm256_add_epi16(V09[i], V10[i]), 1);\r\n        V42[i] = _mm256_srai_epi16(_mm256_add_epi16(V10[i], V11[i]), 1);\r\n        V43[i] = _mm256_srai_epi16(_mm256_add_epi16(V11[i], V12[i]), 1);\r\n        V44[i] = _mm256_srai_epi16(_mm256_add_epi16(V12[i], V13[i]), 1);\r\n        V45[i] = _mm256_srai_epi16(_mm256_add_epi16(V13[i], V14[i]), 1);\r\n        V46[i] = _mm256_srai_epi16(_mm256_add_epi16(V14[i], V15[i]), 1);\r\n        V47[i] = _mm256_srai_epi16(_mm256_add_epi16(V15[i], V16[i]), 1);\r\n\r\n        V48[i] = _mm256_srai_epi16(_mm256_add_epi16(V16[i], V17[i]), 1);\r\n        V49[i] = _mm256_srai_epi16(_mm256_add_epi16(V17[i], V18[i]), 1);\r\n        V50[i] = _mm256_srai_epi16(_mm256_add_epi16(V18[i], V19[i]), 1);\r\n        V51[i] = _mm256_srai_epi16(_mm256_add_epi16(V19[i], V20[i]), 1);\r\n        V52[i] = _mm256_srai_epi16(_mm256_add_epi16(V20[i], V21[i]), 1);\r\n        V53[i] = _mm256_srai_epi16(_mm256_add_epi16(V21[i], V22[i]), 1);\r\n        V54[i] = _mm256_srai_epi16(_mm256_add_epi16(V22[i], V23[i]), 1);\r\n        V55[i] = _mm256_srai_epi16(_mm256_add_epi16(V23[i], V24[i]), 1);\r\n        V56[i] = _mm256_srai_epi16(_mm256_add_epi16(V24[i], V25[i]), 1);\r\n        V57[i] = _mm256_srai_epi16(_mm256_add_epi16(V25[i], V26[i]), 1);\r\n        V58[i] = _mm256_srai_epi16(_mm256_add_epi16(V26[i], V27[i]), 1);\r\n        V59[i] = _mm256_srai_epi16(_mm256_add_epi16(V27[i], V28[i]), 1);\r\n        V60[i] = _mm256_srai_epi16(_mm256_add_epi16(V28[i], V29[i]), 1);\r\n        V61[i] = _mm256_srai_epi16(_mm256_add_epi16(V29[i], V30[i]), 1);\r\n        V62[i] = _mm256_srai_epi16(_mm256_add_epi16(V30[i], V31[i]), 1);\r\n        V63[i] = _mm256_srai_epi16(_mm256_add_epi16(V31[i], V31[i]), 1);\r\n    }\r\n\r\n    /*--transposition & Store--*/\r\n    //64x64 \r\n    TRANSPOSE_16x16_16BIT_m256i(V00[0], V32[0], V01[0], V33[0], V02[0], V34[0], V03[0], V35[0], V04[0], V36[0], V05[0], V37[0], V06[0], V38[0], V07[0], V39[0], T00[0], T01[0], T02[0], T03[0], T04[0], T05[0], T06[0], T07[0], T08[0], T09[0], T10[0], T11[0], T12[0], T13[0], T14[0], T15[0]);\r\n    TRANSPOSE_16x16_16BIT_m256i(V00[1], V32[1], V01[1], V33[1], V02[1], V34[1], V03[1], V35[1], V04[1], V36[1], V05[1], V37[1], V06[1], V38[1], V07[1], V39[1], T16[0], T17[0], T18[0], T19[0], T20[0], T21[0], T22[0], T23[0], T24[0], T25[0], T26[0], T27[0], T28[0], T29[0], T30[0], T31[0]);\r\n    TRANSPOSE_16x16_16BIT_m256i(V00[2], V32[2], V01[2], V33[2], V02[2], V34[2], V03[2], V35[2], V04[2], V36[2], V05[2], V37[2], V06[2], V38[2], V07[2], V39[2], T32[0], T33[0], T34[0], T35[0], T36[0], T37[0], T38[0], T39[0], T40[0], T41[0], T42[0], T43[0], T44[0], T45[0], T46[0], T47[0]);\r\n    TRANSPOSE_16x16_16BIT_m256i(V00[3], V32[3], V01[3], V33[3], V02[3], V34[3], V03[3], V35[3], V04[3], V36[3], V05[3], V37[3], V06[3], V38[3], V07[3], V39[3], T48[0], T49[0], T50[0], T51[0], T52[0], T53[0], T54[0], T55[0], T56[0], T57[0], T58[0], T59[0], T60[0], T61[0], T62[0], T63[0]);\r\n\r\n    TRANSPOSE_16x16_16BIT_m256i(V08[0], V40[0], V09[0], V41[0], V10[0], V42[0], V11[0], V43[0], V12[0], V44[0], V13[0], V45[0], V14[0], V46[0], V15[0], V47[0], T00[1], T01[1], T02[1], T03[1], T04[1], T05[1], T06[1], T07[1], T08[1], T09[1], T10[1], T11[1], T12[1], T13[1], T14[1], T15[1]);\r\n    TRANSPOSE_16x16_16BIT_m256i(V08[1], V40[1], V09[1], V41[1], V10[1], V42[1], V11[1], V43[1], V12[1], V44[1], V13[1], V45[1], V14[1], V46[1], V15[1], V47[1], T16[1], T17[1], T18[1], T19[1], T20[1], T21[1], T22[1], T23[1], T24[1], T25[1], T26[1], T27[1], T28[1], T29[1], T30[1], T31[1]);\r\n    TRANSPOSE_16x16_16BIT_m256i(V08[2], V40[2], V09[2], V41[2], V10[2], V42[2], V11[2], V43[2], V12[2], V44[2], V13[2], V45[2], V14[2], V46[2], V15[2], V47[2], T32[1], T33[1], T34[1], T35[1], T36[1], T37[1], T38[1], T39[1], T40[1], T41[1], T42[1], T43[1], T44[1], T45[1], T46[1], T47[1]);\r\n    TRANSPOSE_16x16_16BIT_m256i(V08[3], V40[3], V09[3], V41[3], V10[3], V42[3], V11[3], V43[3], V12[3], V44[3], V13[3], V45[3], V14[3], V46[3], V15[3], V47[3], T48[1], T49[1], T50[1], T51[1], T52[1], T53[1], T54[1], T55[1], T56[1], T57[1], T58[1], T59[1], T60[1], T61[1], T62[1], T63[1]);\r\n\r\n    TRANSPOSE_16x16_16BIT_m256i(V16[0], V48[0], V17[0], V49[0], V18[0], V50[0], V19[0], V51[0], V20[0], V52[0], V21[0], V53[0], V22[0], V54[0], V23[0], V55[0], T00[2], T01[2], T02[2], T03[2], T04[2], T05[2], T06[2], T07[2], T08[2], T09[2], T10[2], T11[2], T12[2], T13[2], T14[2], T15[2]);\r\n    TRANSPOSE_16x16_16BIT_m256i(V16[1], V48[1], V17[1], V49[1], V18[1], V50[1], V19[1], V51[1], V20[1], V52[1], V21[1], V53[1], V22[1], V54[1], V23[1], V55[1], T16[2], T17[2], T18[2], T19[2], T20[2], T21[2], T22[2], T23[2], T24[2], T25[2], T26[2], T27[2], T28[2], T29[2], T30[2], T31[2]);\r\n    TRANSPOSE_16x16_16BIT_m256i(V16[2], V48[2], V17[2], V49[2], V18[2], V50[2], V19[2], V51[2], V20[2], V52[2], V21[2], V53[2], V22[2], V54[2], V23[2], V55[2], T32[2], T33[2], T34[2], T35[2], T36[2], T37[2], T38[2], T39[2], T40[2], T41[2], T42[2], T43[2], T44[2], T45[2], T46[2], T47[2]);\r\n    TRANSPOSE_16x16_16BIT_m256i(V16[3], V48[3], V17[3], V49[3], V18[3], V50[3], V19[3], V51[3], V20[3], V52[3], V21[3], V53[3], V22[3], V54[3], V23[3], V55[3], T48[2], T49[2], T50[2], T51[2], T52[2], T53[2], T54[2], T55[2], T56[2], T57[2], T58[2], T59[2], T60[2], T61[2], T62[2], T63[2]);\r\n\r\n    TRANSPOSE_16x16_16BIT_m256i(V24[0], V56[0], V25[0], V57[0], V26[0], V58[0], V27[0], V59[0], V28[0], V60[0], V29[0], V61[0], V30[0], V62[0], V31[0], V63[0], T00[3], T01[3], T02[3], T03[3], T04[3], T05[3], T06[3], T07[3], T08[3], T09[3], T10[3], T11[3], T12[3], T13[3], T14[3], T15[3]);\r\n    TRANSPOSE_16x16_16BIT_m256i(V24[1], V56[1], V25[1], V57[1], V26[1], V58[1], V27[1], V59[1], V28[1], V60[1], V29[1], V61[1], V30[1], V62[1], V31[1], V63[1], T16[3], T17[3], T18[3], T19[3], T20[3], T21[3], T22[3], T23[3], T24[3], T25[3], T26[3], T27[3], T28[3], T29[3], T30[3], T31[3]);\r\n    TRANSPOSE_16x16_16BIT_m256i(V24[2], V56[2], V25[2], V57[2], V26[2], V58[2], V27[2], V59[2], V28[2], V60[2], V29[2], V61[2], V30[2], V62[2], V31[2], V63[2], T32[3], T33[3], T34[3], T35[3], T36[3], T37[3], T38[3], T39[3], T40[3], T41[3], T42[3], T43[3], T44[3], T45[3], T46[3], T47[3]);\r\n    TRANSPOSE_16x16_16BIT_m256i(V24[3], V56[3], V25[3], V57[3], V26[3], V58[3], V27[3], V59[3], V28[3], V60[3], V29[3], V61[3], V30[3], V62[3], V31[3], V63[3], T48[3], T49[3], T50[3], T51[3], T52[3], T53[3], T54[3], T55[3], T56[3], T57[3], T58[3], T59[3], T60[3], T61[3], T62[3], T63[3]);\r\n\r\n    //store\r\n    for (i = 0; i < 4; i++) {\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i], T00[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64], T01[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 2], T02[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 3], T03[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 4], T04[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 5], T05[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 6], T06[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 7], T07[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 8], T08[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 9], T09[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 10], T10[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 11], T11[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 12], T12[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 13], T13[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 14], T14[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 15], T15[i]);\r\n\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 16], T16[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 17], T17[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 18], T18[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 19], T19[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 20], T20[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 21], T21[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 22], T22[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 23], T23[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 24], T24[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 25], T25[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 26], T26[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 27], T27[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 28], T28[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 29], T29[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 30], T30[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 31], T31[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 32], T32[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 33], T33[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 34], T34[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 35], T35[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 36], T36[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 37], T37[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 38], T38[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 39], T39[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 40], T40[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 41], T41[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 42], T42[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 43], T43[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 44], T44[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 45], T45[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 46], T46[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 47], T47[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 48], T48[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 49], T49[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 50], T50[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 51], T51[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 52], T52[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 53], T53[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 54], T54[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 55], T55[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 56], T56[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 57], T57[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 58], T58[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 59], T59[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 60], T60[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 61], T61[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 62], T62[i]);\r\n        _mm256_storeu_si256((__m256i*)&coeff[16 * i + 64 * 63], T63[i]);\r\n    }\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n*/\r\nvoid idct_64x64_avx2(const coeff_t *src, coeff_t *dst, int i_dst)\r\n{\r\n    UNUSED_PARAMETER(i_dst);\r\n    idct_32x32_avx2(src, dst, 32 | 0x01); //TODO: change the code to avx2\r\n    inv_wavelet_64x64_avx2(dst);\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n*/\r\nvoid idct_64x16_avx2(const coeff_t *src, coeff_t *dst, int i_dst)\r\n{\r\n    UNUSED_PARAMETER(i_dst);\r\n    idct_32x8_sse128(src, dst, 32 | 0x01);//TODO: change the code to avx2\r\n    inv_wavelet_64x16_avx2(dst);\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n*/\r\nvoid idct_16x64_avx2(const coeff_t *src, coeff_t *dst, int i_dst)\r\n{\r\n    UNUSED_PARAMETER(i_dst);\r\n    idct_8x32_sse128(src, dst, 8 | 0x01);//TODO: change the code to avx2\r\n    inv_wavelet_16x64_avx2(dst);\r\n}\r\n"
  },
  {
    "path": "source/common/vec/intrinsic_inter_pred.cc",
    "content": "/*\r\n * intrinsic_inter-pred.cc\r\n *\r\n * Description of this file:\r\n *    SSE assembly functions of Inter-Prediction module of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video decoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#include <mmintrin.h>\r\n#include <emmintrin.h>\r\n#include <tmmintrin.h>\r\n#include <smmintrin.h>\r\n#include <immintrin.h>\r\n\r\n#include \"../common.h\"\r\n#include \"intrinsic.h\"\r\n\r\n#if !HIGH_BIT_DEPTH\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid intpl_chroma_block_hor_sse128(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, const int8_t *coeff)\r\n{\r\n    const int16_t offset = 32;\r\n    const int shift = 6;\r\n    int row, col;\r\n    const __m128i mAddOffset = _mm_set1_epi16(offset);\r\n    const __m128i mSwitch1   = _mm_setr_epi8(0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6);\r\n    const __m128i mSwitch2   = _mm_setr_epi8(4, 5, 6, 7, 5, 6, 7, 8, 6, 7, 8, 9, 7, 8, 9, 10);\r\n    const __m128i mCoef      = _mm_set1_epi32(*(int*)coeff);\r\n    const __m128i mask       = _mm_loadu_si128((__m128i*)(intrinsic_mask[(width & 7) - 1]));\r\n\r\n    src -= 1;\r\n\r\n    for (row = 0; row < height; row++) {\r\n        __m128i mSrc, mT20, mT40, mVal;\r\n\r\n        for (col = 0; col < width - 7; col += 8) {\r\n            mSrc = _mm_loadu_si128((__m128i*)(src + col));\r\n\r\n            mT20 = _mm_maddubs_epi16(_mm_shuffle_epi8(mSrc, mSwitch1), mCoef);\r\n            mT40 = _mm_maddubs_epi16(_mm_shuffle_epi8(mSrc, mSwitch2), mCoef);\r\n\r\n            mVal = _mm_hadd_epi16(mT20, mT40);\r\n            mVal = _mm_srai_epi16(_mm_add_epi16(mVal, mAddOffset), shift);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_storel_epi64((__m128i*)&dst[col], mVal);\r\n        }\r\n\r\n        if (col < width) {\r\n            mSrc = _mm_loadu_si128((__m128i*)(src + col));\r\n\r\n            mT20 = _mm_maddubs_epi16(_mm_shuffle_epi8(mSrc, mSwitch1), mCoef);\r\n            mT40 = _mm_maddubs_epi16(_mm_shuffle_epi8(mSrc, mSwitch2), mCoef);\r\n\r\n            mVal = _mm_hadd_epi16(mT20, mT40);\r\n            mVal = _mm_srai_epi16(_mm_add_epi16(mVal, mAddOffset), shift);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_maskmoveu_si128(mVal, mask, (char *)&dst[col]);\r\n        }\r\n\r\n        src += i_src;\r\n        dst += i_dst;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid intpl_luma_block_hor_sse128(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, const int8_t *coeff)\r\n{\r\n    int row, col = 0;\r\n    const short offset = 32;\r\n    const int shift = 6;\r\n\r\n    __m128i mAddOffset = _mm_set1_epi16(offset);\r\n\r\n    __m128i mSwitch1 = _mm_setr_epi8(0, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8);\r\n    __m128i mSwitch2 = _mm_setr_epi8(2, 3, 4, 5, 6, 7, 8, 9, 3, 4, 5, 6, 7, 8, 9, 10);\r\n    __m128i mSwitch3 = _mm_setr_epi8(4, 5, 6, 7, 8, 9, 10, 11, 5, 6, 7, 8, 9, 10, 11, 12);\r\n    __m128i mSwitch4 = _mm_setr_epi8(6, 7, 8, 9, 10, 11, 12, 13, 7, 8, 9, 10, 11, 12, 13, 14);\r\n\r\n    __m128i mCoef = _mm_loadl_epi64((__m128i*)coeff);\r\n\r\n    __m128i mask = _mm_loadu_si128((__m128i*)(intrinsic_mask[(width & 7) - 1]));\r\n    mCoef = _mm_unpacklo_epi64(mCoef, mCoef);\r\n\r\n    src -= 3;\r\n    for (row = 0; row < height; row++) {\r\n        __m128i srcCoeff, T20, T40, T60, T80, sum;\r\n\r\n        for (col = 0; col < width - 7; col += 8) {\r\n            srcCoeff = _mm_loadu_si128((__m128i*)(src + col));\r\n\r\n            T20 = _mm_maddubs_epi16(_mm_shuffle_epi8(srcCoeff, mSwitch1), mCoef);\r\n            T40 = _mm_maddubs_epi16(_mm_shuffle_epi8(srcCoeff, mSwitch2), mCoef);\r\n            T60 = _mm_maddubs_epi16(_mm_shuffle_epi8(srcCoeff, mSwitch3), mCoef);\r\n            T80 = _mm_maddubs_epi16(_mm_shuffle_epi8(srcCoeff, mSwitch4), mCoef);\r\n\r\n            sum = _mm_hadd_epi16(_mm_hadd_epi16(T20, T40), _mm_hadd_epi16(T60, T80));\r\n            sum = _mm_srai_epi16(_mm_add_epi16(sum, mAddOffset), shift);\r\n            sum = _mm_packus_epi16(sum, sum);\r\n\r\n            _mm_storel_epi64((__m128i*)&dst[col], sum);\r\n        }\r\n\r\n        if (col < width) {\r\n            srcCoeff = _mm_loadu_si128((__m128i*)(src + col));\r\n\r\n            T20 = _mm_maddubs_epi16(_mm_shuffle_epi8(srcCoeff, mSwitch1), mCoef);\r\n            T40 = _mm_maddubs_epi16(_mm_shuffle_epi8(srcCoeff, mSwitch2), mCoef);\r\n            T60 = _mm_maddubs_epi16(_mm_shuffle_epi8(srcCoeff, mSwitch3), mCoef);\r\n            T80 = _mm_maddubs_epi16(_mm_shuffle_epi8(srcCoeff, mSwitch4), mCoef);\r\n\r\n            sum = _mm_hadd_epi16(_mm_hadd_epi16(T20, T40), _mm_hadd_epi16(T60, T80));\r\n            sum = _mm_srai_epi16(_mm_add_epi16(sum, mAddOffset), shift);\r\n            sum = _mm_packus_epi16(sum, sum);\r\n\r\n            _mm_maskmoveu_si128(sum, mask, (char *)&dst[col]);\r\n        }\r\n\r\n        src += i_src;\r\n        dst += i_dst;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid intpl_luma_hor_sse128(pel_t *dst, int i_dst, mct_t *tmp, int i_tmp, pel_t *src, int i_src, int width, int height, int8_t const *coeff)\r\n{\r\n    int row, col = 0;\r\n    const short offset = 32;\r\n    const int shift = 6;\r\n\r\n    __m128i mAddOffset = _mm_set1_epi16(offset);\r\n\r\n    __m128i mSwitch1 = _mm_setr_epi8(0, 1, 2, 3, 4, 5, 6, 7,     1, 2, 3, 4, 5, 6, 7, 8);\r\n    __m128i mSwitch2 = _mm_setr_epi8(2, 3, 4, 5, 6, 7, 8, 9,     3, 4, 5, 6, 7, 8, 9, 10);\r\n    __m128i mSwitch3 = _mm_setr_epi8(4, 5, 6, 7, 8, 9, 10, 11,   5, 6, 7, 8, 9, 10, 11, 12);\r\n    __m128i mSwitch4 = _mm_setr_epi8(6, 7, 8, 9, 10, 11, 12, 13, 7, 8, 9, 10, 11, 12, 13, 14);\r\n\r\n    __m128i mCoef = _mm_loadl_epi64((__m128i*)coeff);\r\n\r\n    mCoef = _mm_unpacklo_epi64(mCoef, mCoef);\r\n\r\n    __m128i T01, T23, T45, T67, T89, Tab, Tcd, Tef;\r\n    __m128i S1, S2, S3, S4;\r\n    __m128i U0, U1;\r\n    __m128i Val1, Val2, Val;\r\n\r\n    src -= 3;\r\n    for (row = 0; row < height; row++) {\r\n        for (col = 0; col < width - 8; col += 16) {\r\n            __m128i srcCoeff1 = _mm_loadu_si128((__m128i*)(src + col));\r\n            __m128i srcCoeff2 = _mm_loadu_si128((__m128i*)(src + col + 8));\r\n\r\n            T01 = _mm_maddubs_epi16(_mm_shuffle_epi8(srcCoeff1, mSwitch1), mCoef);\r\n            T23 = _mm_maddubs_epi16(_mm_shuffle_epi8(srcCoeff1, mSwitch2), mCoef);\r\n            T45 = _mm_maddubs_epi16(_mm_shuffle_epi8(srcCoeff1, mSwitch3), mCoef);\r\n            T67 = _mm_maddubs_epi16(_mm_shuffle_epi8(srcCoeff1, mSwitch4), mCoef);\r\n\r\n            S1 = _mm_hadd_epi16(T01, T23);\r\n            S2 = _mm_hadd_epi16(T45, T67);\r\n            U0 = _mm_hadd_epi16(S1, S2);\r\n\r\n            _mm_store_si128((__m128i*)&tmp[col], U0);\r\n\r\n            T89 = _mm_maddubs_epi16(_mm_shuffle_epi8(srcCoeff2, mSwitch1), mCoef);\r\n            Tab = _mm_maddubs_epi16(_mm_shuffle_epi8(srcCoeff2, mSwitch2), mCoef);\r\n            Tcd = _mm_maddubs_epi16(_mm_shuffle_epi8(srcCoeff2, mSwitch3), mCoef);\r\n            Tef = _mm_maddubs_epi16(_mm_shuffle_epi8(srcCoeff2, mSwitch4), mCoef);\r\n\r\n            S3 = _mm_hadd_epi16(T89, Tab);\r\n            S4 = _mm_hadd_epi16(Tcd, Tef);\r\n            U1 = _mm_hadd_epi16(S3, S4);\r\n\r\n            _mm_store_si128((__m128i*)&tmp[col + 8], U1);\r\n\r\n\r\n            Val1 = _mm_add_epi16(U0, mAddOffset);\r\n            Val2 = _mm_add_epi16(U1, mAddOffset);\r\n\r\n            Val1 = _mm_srai_epi16(Val1, shift);\r\n            Val2 = _mm_srai_epi16(Val2, shift);\r\n\r\n            Val = _mm_packus_epi16(Val1, Val2);\r\n\r\n            _mm_storeu_si128((__m128i*)&dst[col], Val);\r\n        }\r\n\r\n        if (col < width) {\r\n            __m128i srcCoeff1 = _mm_loadu_si128((__m128i*)(src + col));\r\n\r\n            T01 = _mm_maddubs_epi16(_mm_shuffle_epi8(srcCoeff1, mSwitch1), mCoef);\r\n            T23 = _mm_maddubs_epi16(_mm_shuffle_epi8(srcCoeff1, mSwitch2), mCoef);\r\n            T45 = _mm_maddubs_epi16(_mm_shuffle_epi8(srcCoeff1, mSwitch3), mCoef);\r\n            T67 = _mm_maddubs_epi16(_mm_shuffle_epi8(srcCoeff1, mSwitch4), mCoef);\r\n\r\n            S1 = _mm_hadd_epi16(T01, T23);\r\n            S2 = _mm_hadd_epi16(T45, T67);\r\n            U0 = _mm_hadd_epi16(S1, S2);\r\n\r\n            _mm_store_si128((__m128i*)&tmp[col], U0);\r\n\r\n            Val1 = _mm_add_epi16(U0, mAddOffset);\r\n            Val1 = _mm_srai_epi16(Val1, shift);\r\n\r\n            Val = _mm_packus_epi16(Val1, Val1);\r\n\r\n            _mm_store_si128((__m128i*)&dst[col], Val);\r\n        }\r\n\r\n        src += i_src;\r\n        tmp += i_tmp;\r\n        dst += i_dst;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid intpl_luma_hor_x3_sse128(pel_t *const dst[3], int i_dst, mct_t *const tmp[3], int i_tmp, pel_t *src, int i_src, int width, int height, const int8_t **coeff)\r\n{\r\n    int row, col = 0;\r\n    const short offset = 32;\r\n    const int shift = 6;\r\n\r\n    __m128i mAddOffset = _mm_set1_epi16(offset);\r\n\r\n    __m128i mSwitch1 = _mm_setr_epi8(0, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8);\r\n    __m128i mSwitch2 = _mm_setr_epi8(2, 3, 4, 5, 6, 7, 8, 9, 3, 4, 5, 6, 7, 8, 9, 10);\r\n    __m128i mSwitch3 = _mm_setr_epi8(4, 5, 6, 7, 8, 9, 10, 11, 5, 6, 7, 8, 9, 10, 11, 12);\r\n    __m128i mSwitch4 = _mm_setr_epi8(6, 7, 8, 9, 10, 11, 12, 13, 7, 8, 9, 10, 11, 12, 13, 14);\r\n\r\n    __m128i mCoef0 = _mm_loadl_epi64((__m128i*)coeff[0]);\r\n    __m128i mCoef1 = _mm_loadl_epi64((__m128i*)coeff[1]);\r\n    __m128i mCoef2 = _mm_loadl_epi64((__m128i*)coeff[2]);\r\n    mct_t *tmp0 = tmp[0];\r\n    mct_t *tmp1 = tmp[1];\r\n    mct_t *tmp2 = tmp[2];\r\n    pel_t *dst0 = dst[0];\r\n    pel_t *dst1 = dst[1];\r\n    pel_t *dst2 = dst[2];\r\n\r\n    __m128i mask = _mm_loadu_si128((__m128i*)(intrinsic_mask[(width & 7) - 1]));\r\n    mCoef0 = _mm_unpacklo_epi64(mCoef0, mCoef0);\r\n    mCoef1 = _mm_unpacklo_epi64(mCoef1, mCoef1);\r\n    mCoef2 = _mm_unpacklo_epi64(mCoef2, mCoef2);\r\n\r\n    src -= 3;\r\n    for (row = 0; row < height; row++) {\r\n        __m128i TC1, TC2, TC3, TC4;\r\n        __m128i T20, T40, T60, T80, sum, val;\r\n        __m128i srcCoeff;\r\n        for (col = 0; col < width - 7; col += 8) {\r\n            srcCoeff = _mm_loadu_si128((__m128i*)(src + col));\r\n\r\n            TC1 = _mm_shuffle_epi8(srcCoeff, mSwitch1);\r\n            TC2 = _mm_shuffle_epi8(srcCoeff, mSwitch2);\r\n            TC3 = _mm_shuffle_epi8(srcCoeff, mSwitch3);\r\n            TC4 = _mm_shuffle_epi8(srcCoeff, mSwitch4);\r\n\r\n            // First\r\n            T20 = _mm_maddubs_epi16(TC1, mCoef0);\r\n            T40 = _mm_maddubs_epi16(TC2, mCoef0);\r\n            T60 = _mm_maddubs_epi16(TC3, mCoef0);\r\n            T80 = _mm_maddubs_epi16(TC4, mCoef0);\r\n\r\n            sum = _mm_hadd_epi16(_mm_hadd_epi16(T20, T40), _mm_hadd_epi16(T60, T80));\r\n\r\n            _mm_store_si128((__m128i*)(&tmp0[col]), sum);\r\n\r\n            val = _mm_srai_epi16(_mm_add_epi16(sum, mAddOffset), shift);\r\n            val = _mm_packus_epi16(val, val);\r\n\r\n            _mm_storel_epi64((__m128i*)&dst0[col], val);\r\n\r\n            // Second\r\n            T20 = _mm_maddubs_epi16(TC1, mCoef1);\r\n            T40 = _mm_maddubs_epi16(TC2, mCoef1);\r\n            T60 = _mm_maddubs_epi16(TC3, mCoef1);\r\n            T80 = _mm_maddubs_epi16(TC4, mCoef1);\r\n\r\n            sum = _mm_hadd_epi16(_mm_hadd_epi16(T20, T40), _mm_hadd_epi16(T60, T80));\r\n\r\n            _mm_store_si128((__m128i*)(&tmp1[col]), sum);\r\n\r\n            val = _mm_srai_epi16(_mm_add_epi16(sum, mAddOffset), shift);\r\n            val = _mm_packus_epi16(val, val);\r\n\r\n            _mm_storel_epi64((__m128i*)&dst1[col], val);\r\n\r\n            // Third\r\n            T20 = _mm_maddubs_epi16(TC1, mCoef2);\r\n            T40 = _mm_maddubs_epi16(TC2, mCoef2);\r\n            T60 = _mm_maddubs_epi16(TC3, mCoef2);\r\n            T80 = _mm_maddubs_epi16(TC4, mCoef2);\r\n\r\n            sum = _mm_hadd_epi16(_mm_hadd_epi16(T20, T40), _mm_hadd_epi16(T60, T80));\r\n\r\n            _mm_store_si128((__m128i*)(&tmp2[col]), sum);\r\n\r\n            val = _mm_srai_epi16(_mm_add_epi16(sum, mAddOffset), shift);\r\n            val = _mm_packus_epi16(val, val);\r\n\r\n            _mm_storel_epi64((__m128i*)&dst2[col], val);\r\n        }\r\n\r\n        if (col < width) {\r\n            srcCoeff = _mm_loadu_si128((__m128i*)(src + col));\r\n\r\n            TC1 = _mm_shuffle_epi8(srcCoeff, mSwitch1);\r\n            TC2 = _mm_shuffle_epi8(srcCoeff, mSwitch2);\r\n            TC3 = _mm_shuffle_epi8(srcCoeff, mSwitch3);\r\n            TC4 = _mm_shuffle_epi8(srcCoeff, mSwitch4);\r\n\r\n            // First\r\n            T20 = _mm_maddubs_epi16(TC1, mCoef0);\r\n            T40 = _mm_maddubs_epi16(TC2, mCoef0);\r\n            T60 = _mm_maddubs_epi16(TC3, mCoef0);\r\n            T80 = _mm_maddubs_epi16(TC4, mCoef0);\r\n\r\n            sum = _mm_hadd_epi16(_mm_hadd_epi16(T20, T40), _mm_hadd_epi16(T60, T80));\r\n\r\n            _mm_store_si128((__m128i*)(&tmp0[col]), sum);\r\n\r\n            val = _mm_srai_epi16(_mm_add_epi16(sum, mAddOffset), shift);\r\n            val = _mm_packus_epi16(val, val);\r\n\r\n            _mm_maskmoveu_si128(val, mask, (char *)&dst0[col]);\r\n\r\n            // Second\r\n            T20 = _mm_maddubs_epi16(TC1, mCoef1);\r\n            T40 = _mm_maddubs_epi16(TC2, mCoef1);\r\n            T60 = _mm_maddubs_epi16(TC3, mCoef1);\r\n            T80 = _mm_maddubs_epi16(TC4, mCoef1);\r\n\r\n            sum = _mm_hadd_epi16(_mm_hadd_epi16(T20, T40), _mm_hadd_epi16(T60, T80));\r\n\r\n            _mm_store_si128((__m128i*)(&tmp1[col]), sum);\r\n\r\n            val = _mm_srai_epi16(_mm_add_epi16(sum, mAddOffset), shift);\r\n            val = _mm_packus_epi16(val, val);\r\n\r\n            _mm_maskmoveu_si128(val, mask, (char *)&dst1[col]);\r\n\r\n            // Third\r\n            T20 = _mm_maddubs_epi16(TC1, mCoef2);\r\n            T40 = _mm_maddubs_epi16(TC2, mCoef2);\r\n            T60 = _mm_maddubs_epi16(TC3, mCoef2);\r\n            T80 = _mm_maddubs_epi16(TC4, mCoef2);\r\n\r\n            sum = _mm_hadd_epi16(_mm_hadd_epi16(T20, T40), _mm_hadd_epi16(T60, T80));\r\n\r\n            _mm_store_si128((__m128i*)(&tmp2[col]), sum);\r\n\r\n            val = _mm_srai_epi16(_mm_add_epi16(sum, mAddOffset), shift);\r\n            val = _mm_packus_epi16(val, val);\r\n\r\n            _mm_maskmoveu_si128(val, mask, (char *)&dst2[col]);\r\n        }\r\n\r\n        src += i_src;\r\n        tmp0 += i_tmp;\r\n        tmp1 += i_tmp;\r\n        tmp2 += i_tmp;\r\n        dst0 += i_dst;\r\n        dst1 += i_dst;\r\n        dst2 += i_dst;\r\n    }\r\n    \r\n}\r\n/* ---------------------------------------------------------------------------\r\n */\r\n#define INTPL_LUMA_VER_SSE128_COMPUT(W0,W1,W2,W3,W4,W5,W6,W7,result)      \\\r\n    T0 = _mm_maddubs_epi16(D0, W0);                                \\\r\n    T1 = _mm_maddubs_epi16(D1, W1);                                \\\r\n    T2 = _mm_maddubs_epi16(D2, W2);                                \\\r\n    T3 = _mm_maddubs_epi16(D3, W3);                                \\\r\n    T4 = _mm_maddubs_epi16(D4, W4);                                \\\r\n    T5 = _mm_maddubs_epi16(D5, W5);                                \\\r\n    T6 = _mm_maddubs_epi16(D6, W6);                                \\\r\n    T7 = _mm_maddubs_epi16(D7, W7);                                \\\r\n                                                                   \\\r\n    mVal1 = _mm_add_epi16(T0, T1);                                 \\\r\n    mVal1 = _mm_add_epi16(mVal1, T2);                              \\\r\n    mVal1 = _mm_add_epi16(mVal1, T3);                              \\\r\n                                                                   \\\r\n    mVal2 = _mm_add_epi16(T4, T5);                                 \\\r\n    mVal2 = _mm_add_epi16(mVal2, T6);                              \\\r\n    mVal2 = _mm_add_epi16(mVal2, T7);                              \\\r\n                                                                   \\\r\n    mVal1 = _mm_add_epi16(mVal1, mAddOffset);                      \\\r\n    mVal2 = _mm_add_epi16(mVal2, mAddOffset);                      \\\r\n    mVal1 = _mm_srai_epi16(mVal1, shift);                          \\\r\n    mVal2 = _mm_srai_epi16(mVal2, shift);                          \\\r\n    result = _mm_packus_epi16(mVal1, mVal2);\r\n\r\n#define INTPL_LUMA_VER_SSE128_STORE(result, store_dst)             \\\r\n    _mm_storeu_si128((__m128i*)&(store_dst)[col], result);\r\n\r\n#define INTPL_LUMA_VER_SSE128_COMPUT_LO(W0,W1,W2,W3,result)        \\\r\n    T0 = _mm_maddubs_epi16(D0, W0);                                \\\r\n    T1 = _mm_maddubs_epi16(D1, W1);                                \\\r\n    T2 = _mm_maddubs_epi16(D2, W2);                                \\\r\n    T3 = _mm_maddubs_epi16(D3, W3);                                \\\r\n                                                                   \\\r\n    mVal1 = _mm_add_epi16(T0, T1);                                 \\\r\n    mVal1 = _mm_add_epi16(mVal1, T2);                              \\\r\n    mVal1 = _mm_add_epi16(mVal1, T3);                              \\\r\n                                                                   \\\r\n    mVal1 = _mm_add_epi16(mVal1, mAddOffset);                      \\\r\n    mVal1 = _mm_srai_epi16(mVal1, shift);                          \\\r\n    result = _mm_packus_epi16(mVal1, mVal1);\r\n\r\n\r\nvoid intpl_luma_ver_sse128(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, int8_t const *coeff)\r\n{\r\n    int row, col;\r\n    const short offset = 32;\r\n    const int shift = 6;\r\n\r\n    __m128i mAddOffset = _mm_set1_epi16(offset);\r\n\r\n    pel_t const *p;\r\n\r\n    src -= 3 * i_src;\r\n\r\n    int8_t coeff_tmp[2];\r\n    coeff_tmp[0] = coeff[7],coeff_tmp[1] = coeff[0];\r\n    __m128i coeff70 = _mm_set1_epi16(*(short*)coeff_tmp);\r\n    __m128i coeff12 = _mm_set1_epi16(*(short*)(coeff + 1));\r\n    __m128i coeff34 = _mm_set1_epi16(*(short*)(coeff + 3));\r\n    __m128i coeff56 = _mm_set1_epi16(*(short*)(coeff + 5));\r\n\r\n    __m128i coeff01 = _mm_set1_epi16(*(short*)coeff);\r\n    __m128i coeff23 = _mm_set1_epi16(*(short*)(coeff + 2));\r\n    __m128i coeff45 = _mm_set1_epi16(*(short*)(coeff + 4));\r\n    __m128i coeff67 = _mm_set1_epi16(*(short*)(coeff + 6));\r\n    __m128i mVal1, mVal2;\r\n\r\n    __m128i T00, T10, T20, T30, T40, T50, T60, T70, T80, T90, Ta0;\r\n    __m128i T0, T1, T2, T3, T4, T5, T6, T7;\r\n    __m128i D0, D1, D2, D3, D4, D5, D6, D7;\r\n    __m128i U0, U1, U2, U3;\r\n    for (row = 0; row < height; row = row + 4) {\r\n        p = src;\r\n        for (col = 0; col < width - 8; col += 16) {\r\n            T00 = _mm_loadu_si128((__m128i*)(p));\r\n            T10 = _mm_loadu_si128((__m128i*)(p + i_src));\r\n            T20 = _mm_loadu_si128((__m128i*)(p + 2 * i_src));\r\n            T30 = _mm_loadu_si128((__m128i*)(p + 3 * i_src));\r\n            T40 = _mm_loadu_si128((__m128i*)(p + 4 * i_src));\r\n            T50 = _mm_loadu_si128((__m128i*)(p + 5 * i_src));\r\n            T60 = _mm_loadu_si128((__m128i*)(p + 6 * i_src));\r\n            T70 = _mm_loadu_si128((__m128i*)(p + 7 * i_src));\r\n            T80 = _mm_loadu_si128((__m128i*)(p + 8 * i_src));\r\n            T90 = _mm_loadu_si128((__m128i*)(p + 9 * i_src));\r\n            Ta0 = _mm_loadu_si128((__m128i*)(p + 10 * i_src));\r\n\r\n            //0\r\n            D0 = _mm_unpacklo_epi8(T00, T10);\r\n            D1 = _mm_unpacklo_epi8(T20, T30);\r\n            D2 = _mm_unpacklo_epi8(T40, T50);\r\n            D3 = _mm_unpacklo_epi8(T60, T70);\r\n            D4 = _mm_unpackhi_epi8(T00, T10);\r\n            D5 = _mm_unpackhi_epi8(T20, T30);\r\n            D6 = _mm_unpackhi_epi8(T40, T50);\r\n            D7 = _mm_unpackhi_epi8(T60, T70);\r\n\r\n            INTPL_LUMA_VER_SSE128_COMPUT(coeff01, coeff23, coeff45, coeff67, coeff01, coeff23, coeff45, coeff67, U0);\r\n            INTPL_LUMA_VER_SSE128_STORE(U0, dst);\r\n\r\n            //1\r\n            D0 = _mm_unpacklo_epi8(T80, T10);\r\n            D4 = _mm_unpackhi_epi8(T80, T10);\r\n\r\n            INTPL_LUMA_VER_SSE128_COMPUT(coeff70, coeff12, coeff34, coeff56, coeff70, coeff12, coeff34, coeff56, U1);\r\n            INTPL_LUMA_VER_SSE128_STORE(U1, dst + i_dst);\r\n\r\n            //2\r\n            D0 = _mm_unpacklo_epi8(T80, T90);\r\n            D4 = _mm_unpackhi_epi8(T80, T90);\r\n\r\n            INTPL_LUMA_VER_SSE128_COMPUT(coeff67, coeff01, coeff23, coeff45, coeff67, coeff01, coeff23, coeff45, U2);\r\n            INTPL_LUMA_VER_SSE128_STORE(U2, dst + 2 * i_dst);\r\n\r\n            //3\r\n            D1 = _mm_unpacklo_epi8(Ta0, T30);\r\n            D5 = _mm_unpackhi_epi8(Ta0, T30);\r\n\r\n            INTPL_LUMA_VER_SSE128_COMPUT(coeff56, coeff70, coeff12, coeff34, coeff56, coeff70, coeff12, coeff34, U3);\r\n            INTPL_LUMA_VER_SSE128_STORE(U3, dst + 3 * i_dst);\r\n\r\n            p += 16;\r\n        }\r\n\r\n        //<=8bit\r\n        if (col < width) {\r\n            T00 = _mm_loadu_si128((__m128i*)(p));\r\n            T10 = _mm_loadu_si128((__m128i*)(p + i_src));\r\n            T20 = _mm_loadu_si128((__m128i*)(p + 2 * i_src));\r\n            T30 = _mm_loadu_si128((__m128i*)(p + 3 * i_src));\r\n            T40 = _mm_loadu_si128((__m128i*)(p + 4 * i_src));\r\n            T50 = _mm_loadu_si128((__m128i*)(p + 5 * i_src));\r\n            T60 = _mm_loadu_si128((__m128i*)(p + 6 * i_src));\r\n            T70 = _mm_loadu_si128((__m128i*)(p + 7 * i_src));\r\n            T80 = _mm_loadu_si128((__m128i*)(p + 8 * i_src));\r\n            T90 = _mm_loadu_si128((__m128i*)(p + 9 * i_src));\r\n            Ta0 = _mm_loadu_si128((__m128i*)(p + 10 * i_src));\r\n\r\n            //0\r\n            D0 = _mm_unpacklo_epi8(T00, T10);\r\n            D1 = _mm_unpacklo_epi8(T20, T30);\r\n            D2 = _mm_unpacklo_epi8(T40, T50);\r\n            D3 = _mm_unpacklo_epi8(T60, T70);\r\n\r\n            INTPL_LUMA_VER_SSE128_COMPUT_LO(coeff01, coeff23, coeff45, coeff67, U0);\r\n            INTPL_LUMA_VER_SSE128_STORE(U0, dst);\r\n\r\n            //1\r\n            D0 = _mm_unpacklo_epi8(T80, T10);\r\n\r\n            INTPL_LUMA_VER_SSE128_COMPUT_LO(coeff70, coeff12, coeff34, coeff56, U1);\r\n            INTPL_LUMA_VER_SSE128_STORE(U1, dst + i_dst);\r\n\r\n            //2\r\n            D0 = _mm_unpacklo_epi8(T80, T90);\r\n\r\n            INTPL_LUMA_VER_SSE128_COMPUT_LO(coeff67, coeff01, coeff23, coeff45, U2);\r\n            INTPL_LUMA_VER_SSE128_STORE(U2, dst + 2 * i_dst);\r\n\r\n            //3\r\n            D1 = _mm_unpacklo_epi8(Ta0, T30);\r\n\r\n            INTPL_LUMA_VER_SSE128_COMPUT_LO(coeff56, coeff70, coeff12, coeff34, U3);\r\n            INTPL_LUMA_VER_SSE128_STORE(U3, dst + 3 * i_dst);\r\n\r\n            p += 8;\r\n            col += 8;\r\n        }\r\n\r\n        src += i_src * 4;\r\n        dst += i_dst * 4;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n *\r\n */\r\nvoid intpl_luma_ver_x3_sse128(pel_t *const dst[3], int i_dst, pel_t *src, int i_src, int width, int height, int8_t const **coeff)\r\n{\r\n    /*\r\n    intpl_luma_ver_sse128(dst0, i_dst, src, i_src, width, height, coeff[0]);\r\n    intpl_luma_ver_sse128(dst1, i_dst, src, i_src, width, height, coeff[1]);\r\n    intpl_luma_ver_sse128(dst2, i_dst, src, i_src, width, height, coeff[2]);\r\n    */\r\n    int row, col;\r\n    const short offset = 32;\r\n    const int shift = 6;\r\n    int bsymFirst = (coeff[0][1] == coeff[0][6]);\r\n    int bsymSecond = (coeff[1][1] == coeff[1][6]);\r\n    int bsymThird = (coeff[2][1] == coeff[2][6]);\r\n\r\n    __m128i mAddOffset = _mm_set1_epi16(offset);\r\n\r\n    pel_t const *p;\r\n\r\n    __m128i mask = _mm_loadu_si128((__m128i*)(intrinsic_mask[(width & 7) - 1]));\r\n    \r\n    src -= 3 * i_src;\r\n\r\n    __m128i coeffFirst0, coeffFirst1, coeffFirst2, coeffFirst3;\r\n    __m128i coeffSecond0, coeffSecond1, coeffSecond2, coeffSecond3;\r\n    __m128i coeffThird0, coeffThird1, coeffThird2, coeffThird3;\r\n    __m128i tempT00, tempT10, tempT20, tempT30;\r\n    __m128i mVal;\r\n\r\n    pel_t *dst0 = dst[0];\r\n    pel_t *dst1 = dst[1];\r\n    pel_t *dst2 = dst[2];\r\n\r\n    //load Coefficient\r\n    if (bsymFirst) { \r\n        coeffFirst0 = _mm_set1_epi8(coeff[0][0]);\r\n        coeffFirst1 = _mm_set1_epi8(coeff[0][1]);\r\n        coeffFirst2 = _mm_set1_epi8(coeff[0][2]);\r\n        coeffFirst3 = _mm_set1_epi8(coeff[0][3]);\r\n    } else { \r\n        coeffFirst0 = _mm_set1_epi16(*(short*)coeff[0]);\r\n        coeffFirst1 = _mm_set1_epi16(*(short*)(coeff[0] + 2));\r\n        coeffFirst2 = _mm_set1_epi16(*(short*)(coeff[0] + 4));\r\n        coeffFirst3 = _mm_set1_epi16(*(short*)(coeff[0] + 6));\r\n    }\r\n    if (bsymSecond) { \r\n        coeffSecond0 = _mm_set1_epi8(coeff[1][0]);\r\n        coeffSecond1 = _mm_set1_epi8(coeff[1][1]);\r\n        coeffSecond2 = _mm_set1_epi8(coeff[1][2]);\r\n        coeffSecond3 = _mm_set1_epi8(coeff[1][3]);\r\n    } else { \r\n        coeffSecond0 = _mm_set1_epi16(*(short*)coeff[1]);\r\n        coeffSecond1 = _mm_set1_epi16(*(short*)(coeff[1] + 2));\r\n        coeffSecond2 = _mm_set1_epi16(*(short*)(coeff[1] + 4));\r\n        coeffSecond3 = _mm_set1_epi16(*(short*)(coeff[1] + 6));\r\n    }\r\n    if (bsymThird) { \r\n        coeffThird0 = _mm_set1_epi8(coeff[2][0]);\r\n        coeffThird1 = _mm_set1_epi8(coeff[2][1]);\r\n        coeffThird2 = _mm_set1_epi8(coeff[2][2]);\r\n        coeffThird3 = _mm_set1_epi8(coeff[2][3]);\r\n    } else { \r\n        coeffThird0 = _mm_set1_epi16(*(short*)coeff[2]);\r\n        coeffThird1 = _mm_set1_epi16(*(short*)(coeff[2] + 2));\r\n        coeffThird2 = _mm_set1_epi16(*(short*)(coeff[2] + 4));\r\n        coeffThird3 = _mm_set1_epi16(*(short*)(coeff[2] + 6));\r\n    }\r\n\r\n    //Double For\r\n    for (row = 0; row < height - 3; row += 4) {\r\n        p = src;\r\n        for (col = 0; col < width - 7; col += 8) {\r\n            __m128i T00 = _mm_loadu_si128((__m128i*)(p));\r\n            __m128i T10 = _mm_loadu_si128((__m128i*)(p + i_src));\r\n            __m128i T20 = _mm_loadu_si128((__m128i*)(p + 2 * i_src));\r\n            __m128i T30 = _mm_loadu_si128((__m128i*)(p + 3 * i_src));\r\n            __m128i T40 = _mm_loadu_si128((__m128i*)(p + 4 * i_src));\r\n            __m128i T50 = _mm_loadu_si128((__m128i*)(p + 5 * i_src));\r\n            __m128i T60 = _mm_loadu_si128((__m128i*)(p + 6 * i_src));\r\n            __m128i T70 = _mm_loadu_si128((__m128i*)(p + 7 * i_src));\r\n            __m128i T80 = _mm_loadu_si128((__m128i*)(p + 8 * i_src));\r\n            __m128i T90 = _mm_loadu_si128((__m128i*)(p + 9 * i_src));\r\n            __m128i Ta0 = _mm_loadu_si128((__m128i*)(p + 10 * i_src));\r\n\r\n            //First\r\n            if (bsymFirst) { \r\n                tempT00 = _mm_unpacklo_epi8(T00, T70);\r\n                tempT10 = _mm_unpacklo_epi8(T10, T60);\r\n                tempT20 = _mm_unpacklo_epi8(T20, T50);\r\n                tempT30 = _mm_unpacklo_epi8(T30, T40);\r\n            } else { \r\n                tempT00 = _mm_unpacklo_epi8(T00, T10);\r\n                tempT10 = _mm_unpacklo_epi8(T20, T30);\r\n                tempT20 = _mm_unpacklo_epi8(T40, T50);\r\n                tempT30 = _mm_unpacklo_epi8(T60, T70);\r\n            }\r\n            tempT00 = _mm_maddubs_epi16(tempT00, coeffFirst0);\r\n            tempT10 = _mm_maddubs_epi16(tempT10, coeffFirst1);\r\n            tempT20 = _mm_maddubs_epi16(tempT20, coeffFirst2);\r\n            tempT30 = _mm_maddubs_epi16(tempT30, coeffFirst3);\r\n\r\n            mVal = _mm_add_epi16(tempT00, tempT10);\r\n            mVal = _mm_add_epi16(mVal, tempT20);\r\n            mVal = _mm_add_epi16(mVal, tempT30);\r\n\r\n            mVal = _mm_add_epi16(mVal, mAddOffset);\r\n            mVal = _mm_srai_epi16(mVal, shift);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_storel_epi64((__m128i*)&dst0[col], mVal);\r\n\r\n            if (bsymFirst) {\r\n                tempT00 = _mm_unpacklo_epi8(T10, T80);\r\n                tempT10 = _mm_unpacklo_epi8(T20, T70);\r\n                tempT20 = _mm_unpacklo_epi8(T30, T60);\r\n                tempT30 = _mm_unpacklo_epi8(T40, T50);\r\n            }\r\n            else {\r\n                tempT00 = _mm_unpacklo_epi8(T10, T20);\r\n                tempT10 = _mm_unpacklo_epi8(T30, T40);\r\n                tempT20 = _mm_unpacklo_epi8(T50, T60);\r\n                tempT30 = _mm_unpacklo_epi8(T70, T80);\r\n            }\r\n            tempT00 = _mm_maddubs_epi16(tempT00, coeffFirst0);\r\n            tempT10 = _mm_maddubs_epi16(tempT10, coeffFirst1);\r\n            tempT20 = _mm_maddubs_epi16(tempT20, coeffFirst2);\r\n            tempT30 = _mm_maddubs_epi16(tempT30, coeffFirst3);\r\n\r\n            mVal = _mm_add_epi16(tempT00, tempT10);\r\n            mVal = _mm_add_epi16(mVal, tempT20);\r\n            mVal = _mm_add_epi16(mVal, tempT30);\r\n\r\n            mVal = _mm_add_epi16(mVal, mAddOffset);\r\n            mVal = _mm_srai_epi16(mVal, shift);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_storel_epi64((__m128i*)(&dst0[col] + i_dst), mVal);\r\n\r\n            if (bsymFirst) {\r\n                tempT00 = _mm_unpacklo_epi8(T20, T90);\r\n                tempT10 = _mm_unpacklo_epi8(T30, T80);\r\n                tempT20 = _mm_unpacklo_epi8(T40, T70);\r\n                tempT30 = _mm_unpacklo_epi8(T50, T60);\r\n            }\r\n            else {\r\n                tempT00 = _mm_unpacklo_epi8(T20, T30);\r\n                tempT10 = _mm_unpacklo_epi8(T40, T50);\r\n                tempT20 = _mm_unpacklo_epi8(T60, T70);\r\n                tempT30 = _mm_unpacklo_epi8(T80, T90);\r\n            }\r\n            tempT00 = _mm_maddubs_epi16(tempT00, coeffFirst0);\r\n            tempT10 = _mm_maddubs_epi16(tempT10, coeffFirst1);\r\n            tempT20 = _mm_maddubs_epi16(tempT20, coeffFirst2);\r\n            tempT30 = _mm_maddubs_epi16(tempT30, coeffFirst3);\r\n\r\n            mVal = _mm_add_epi16(tempT00, tempT10);\r\n            mVal = _mm_add_epi16(mVal, tempT20);\r\n            mVal = _mm_add_epi16(mVal, tempT30);\r\n\r\n            mVal = _mm_add_epi16(mVal, mAddOffset);\r\n            mVal = _mm_srai_epi16(mVal, shift);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_storel_epi64((__m128i*)(&dst0[col] + 2 * i_dst), mVal);\r\n\r\n            if (bsymFirst) {\r\n                tempT00 = _mm_unpacklo_epi8(T30, Ta0);\r\n                tempT10 = _mm_unpacklo_epi8(T40, T90);\r\n                tempT20 = _mm_unpacklo_epi8(T50, T80);\r\n                tempT30 = _mm_unpacklo_epi8(T60, T70);\r\n            }\r\n            else {\r\n                tempT00 = _mm_unpacklo_epi8(T30, T40);\r\n                tempT10 = _mm_unpacklo_epi8(T50, T60);\r\n                tempT20 = _mm_unpacklo_epi8(T70, T80);\r\n                tempT30 = _mm_unpacklo_epi8(T90, Ta0);\r\n            }\r\n            tempT00 = _mm_maddubs_epi16(tempT00, coeffFirst0);\r\n            tempT10 = _mm_maddubs_epi16(tempT10, coeffFirst1);\r\n            tempT20 = _mm_maddubs_epi16(tempT20, coeffFirst2);\r\n            tempT30 = _mm_maddubs_epi16(tempT30, coeffFirst3);\r\n\r\n            mVal = _mm_add_epi16(tempT00, tempT10);\r\n            mVal = _mm_add_epi16(mVal, tempT20);\r\n            mVal = _mm_add_epi16(mVal, tempT30);\r\n\r\n            mVal = _mm_add_epi16(mVal, mAddOffset);\r\n            mVal = _mm_srai_epi16(mVal, shift);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_storel_epi64((__m128i*)(&dst0[col] + i_dst), mVal);\r\n\r\n            //Second\r\n            if (bsymSecond) { \r\n                tempT00 = _mm_unpacklo_epi8(T00, T70);\r\n                tempT10 = _mm_unpacklo_epi8(T10, T60);\r\n                tempT20 = _mm_unpacklo_epi8(T20, T50);\r\n                tempT30 = _mm_unpacklo_epi8(T30, T40);\r\n            } else { \r\n                tempT00 = _mm_unpacklo_epi8(T00, T10);\r\n                tempT10 = _mm_unpacklo_epi8(T20, T30);\r\n                tempT20 = _mm_unpacklo_epi8(T40, T50);\r\n                tempT30 = _mm_unpacklo_epi8(T60, T70);\r\n            }\r\n            tempT00 = _mm_maddubs_epi16(tempT00, coeffSecond0);\r\n            tempT10 = _mm_maddubs_epi16(tempT10, coeffSecond1);\r\n            tempT20 = _mm_maddubs_epi16(tempT20, coeffSecond2);\r\n            tempT30 = _mm_maddubs_epi16(tempT30, coeffSecond3);\r\n\r\n            mVal = _mm_add_epi16(tempT00, tempT10);\r\n            mVal = _mm_add_epi16(mVal, tempT20);\r\n            mVal = _mm_add_epi16(mVal, tempT30);\r\n\r\n            mVal = _mm_add_epi16(mVal, mAddOffset);\r\n            mVal = _mm_srai_epi16(mVal, shift);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_storel_epi64((__m128i*)&dst1[col], mVal);\r\n\r\n            if (bsymSecond) {\r\n                tempT00 = _mm_unpacklo_epi8(T10, T80);\r\n                tempT10 = _mm_unpacklo_epi8(T20, T70);\r\n                tempT20 = _mm_unpacklo_epi8(T30, T60);\r\n                tempT30 = _mm_unpacklo_epi8(T40, T50);\r\n            }\r\n            else {\r\n                tempT00 = _mm_unpacklo_epi8(T10, T20);\r\n                tempT10 = _mm_unpacklo_epi8(T30, T40);\r\n                tempT20 = _mm_unpacklo_epi8(T50, T60);\r\n                tempT30 = _mm_unpacklo_epi8(T70, T80);\r\n            }\r\n            tempT00 = _mm_maddubs_epi16(tempT00, coeffSecond0);\r\n            tempT10 = _mm_maddubs_epi16(tempT10, coeffSecond1);\r\n            tempT20 = _mm_maddubs_epi16(tempT20, coeffSecond2);\r\n            tempT30 = _mm_maddubs_epi16(tempT30, coeffSecond3);\r\n\r\n            mVal = _mm_add_epi16(tempT00, tempT10);\r\n            mVal = _mm_add_epi16(mVal, tempT20);\r\n            mVal = _mm_add_epi16(mVal, tempT30);\r\n\r\n            mVal = _mm_add_epi16(mVal, mAddOffset);\r\n            mVal = _mm_srai_epi16(mVal, shift);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_storel_epi64((__m128i*)(&dst1[col] + i_dst), mVal);\r\n\r\n            if (bsymSecond) {\r\n                tempT00 = _mm_unpacklo_epi8(T20, T90);\r\n                tempT10 = _mm_unpacklo_epi8(T30, T80);\r\n                tempT20 = _mm_unpacklo_epi8(T40, T70);\r\n                tempT30 = _mm_unpacklo_epi8(T50, T60);\r\n            }\r\n            else {\r\n                tempT00 = _mm_unpacklo_epi8(T20, T30);\r\n                tempT10 = _mm_unpacklo_epi8(T40, T50);\r\n                tempT20 = _mm_unpacklo_epi8(T60, T70);\r\n                tempT30 = _mm_unpacklo_epi8(T80, T90);\r\n            }\r\n            tempT00 = _mm_maddubs_epi16(tempT00, coeffSecond0);\r\n            tempT10 = _mm_maddubs_epi16(tempT10, coeffSecond1);\r\n            tempT20 = _mm_maddubs_epi16(tempT20, coeffSecond2);\r\n            tempT30 = _mm_maddubs_epi16(tempT30, coeffSecond3);\r\n\r\n            mVal = _mm_add_epi16(tempT00, tempT10);\r\n            mVal = _mm_add_epi16(mVal, tempT20);\r\n            mVal = _mm_add_epi16(mVal, tempT30);\r\n\r\n            mVal = _mm_add_epi16(mVal, mAddOffset);\r\n            mVal = _mm_srai_epi16(mVal, shift);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_storel_epi64((__m128i*)(&dst1[col] + 2 * i_dst), mVal);\r\n\r\n            if (bsymSecond) {\r\n                tempT00 = _mm_unpacklo_epi8(T30, Ta0);\r\n                tempT10 = _mm_unpacklo_epi8(T40, T90);\r\n                tempT20 = _mm_unpacklo_epi8(T50, T80);\r\n                tempT30 = _mm_unpacklo_epi8(T60, T70);\r\n            }\r\n            else {\r\n                tempT00 = _mm_unpacklo_epi8(T30, T40);\r\n                tempT10 = _mm_unpacklo_epi8(T50, T60);\r\n                tempT20 = _mm_unpacklo_epi8(T70, T80);\r\n                tempT30 = _mm_unpacklo_epi8(T90, Ta0);\r\n            }\r\n            tempT00 = _mm_maddubs_epi16(tempT00, coeffSecond0);\r\n            tempT10 = _mm_maddubs_epi16(tempT10, coeffSecond1);\r\n            tempT20 = _mm_maddubs_epi16(tempT20, coeffSecond2);\r\n            tempT30 = _mm_maddubs_epi16(tempT30, coeffSecond3);\r\n\r\n            mVal = _mm_add_epi16(tempT00, tempT10);\r\n            mVal = _mm_add_epi16(mVal, tempT20);\r\n            mVal = _mm_add_epi16(mVal, tempT30);\r\n\r\n            mVal = _mm_add_epi16(mVal, mAddOffset);\r\n            mVal = _mm_srai_epi16(mVal, shift);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_storel_epi64((__m128i*)(&dst1[col] + 3 * i_dst), mVal);\r\n\r\n            //Third\r\n            if (bsymThird) { \r\n                tempT00 = _mm_unpacklo_epi8(T00, T70);\r\n                tempT10 = _mm_unpacklo_epi8(T10, T60);\r\n                tempT20 = _mm_unpacklo_epi8(T20, T50);\r\n                tempT30 = _mm_unpacklo_epi8(T30, T40);\r\n            } else { \r\n                tempT00 = _mm_unpacklo_epi8(T00, T10);\r\n                tempT10 = _mm_unpacklo_epi8(T20, T30);\r\n                tempT20 = _mm_unpacklo_epi8(T40, T50);\r\n                tempT30 = _mm_unpacklo_epi8(T60, T70);\r\n            }\r\n            tempT00 = _mm_maddubs_epi16(tempT00, coeffThird0);\r\n            tempT10 = _mm_maddubs_epi16(tempT10, coeffThird1);\r\n            tempT20 = _mm_maddubs_epi16(tempT20, coeffThird2);\r\n            tempT30 = _mm_maddubs_epi16(tempT30, coeffThird3);\r\n\r\n            mVal = _mm_add_epi16(tempT00, tempT10);\r\n            mVal = _mm_add_epi16(mVal, tempT20);\r\n            mVal = _mm_add_epi16(mVal, tempT30);\r\n\r\n            mVal = _mm_add_epi16(mVal, mAddOffset);\r\n            mVal = _mm_srai_epi16(mVal, shift);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_storel_epi64((__m128i*)&dst2[col], mVal);\r\n\r\n            if (bsymThird) {\r\n                tempT00 = _mm_unpacklo_epi8(T10, T80);\r\n                tempT10 = _mm_unpacklo_epi8(T20, T70);\r\n                tempT20 = _mm_unpacklo_epi8(T30, T60);\r\n                tempT30 = _mm_unpacklo_epi8(T40, T50);\r\n            }\r\n            else {\r\n                tempT00 = _mm_unpacklo_epi8(T10, T20);\r\n                tempT10 = _mm_unpacklo_epi8(T30, T40);\r\n                tempT20 = _mm_unpacklo_epi8(T50, T60);\r\n                tempT30 = _mm_unpacklo_epi8(T70, T80);\r\n            }\r\n            tempT00 = _mm_maddubs_epi16(tempT00, coeffThird0);\r\n            tempT10 = _mm_maddubs_epi16(tempT10, coeffThird1);\r\n            tempT20 = _mm_maddubs_epi16(tempT20, coeffThird2);\r\n            tempT30 = _mm_maddubs_epi16(tempT30, coeffThird3);\r\n\r\n            mVal = _mm_add_epi16(tempT00, tempT10);\r\n            mVal = _mm_add_epi16(mVal, tempT20);\r\n            mVal = _mm_add_epi16(mVal, tempT30);\r\n\r\n            mVal = _mm_add_epi16(mVal, mAddOffset);\r\n            mVal = _mm_srai_epi16(mVal, shift);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_storel_epi64((__m128i*)(&dst2[col] + i_dst), mVal);\r\n\r\n            if (bsymThird) {\r\n                tempT00 = _mm_unpacklo_epi8(T20, T90);\r\n                tempT10 = _mm_unpacklo_epi8(T30, T80);\r\n                tempT20 = _mm_unpacklo_epi8(T40, T70);\r\n                tempT30 = _mm_unpacklo_epi8(T50, T60);\r\n            }\r\n            else {\r\n                tempT00 = _mm_unpacklo_epi8(T20, T30);\r\n                tempT10 = _mm_unpacklo_epi8(T40, T50);\r\n                tempT20 = _mm_unpacklo_epi8(T60, T70);\r\n                tempT30 = _mm_unpacklo_epi8(T80, T90);\r\n            }\r\n            tempT00 = _mm_maddubs_epi16(tempT00, coeffThird0);\r\n            tempT10 = _mm_maddubs_epi16(tempT10, coeffThird1);\r\n            tempT20 = _mm_maddubs_epi16(tempT20, coeffThird2);\r\n            tempT30 = _mm_maddubs_epi16(tempT30, coeffThird3);\r\n\r\n            mVal = _mm_add_epi16(tempT00, tempT10);\r\n            mVal = _mm_add_epi16(mVal, tempT20);\r\n            mVal = _mm_add_epi16(mVal, tempT30);\r\n\r\n            mVal = _mm_add_epi16(mVal, mAddOffset);\r\n            mVal = _mm_srai_epi16(mVal, shift);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_storel_epi64((__m128i*)(&dst2[col] + 2 * i_dst), mVal);\r\n\r\n            if (bsymThird) {\r\n                tempT00 = _mm_unpacklo_epi8(T30, Ta0);\r\n                tempT10 = _mm_unpacklo_epi8(T40, T90);\r\n                tempT20 = _mm_unpacklo_epi8(T50, T80);\r\n                tempT30 = _mm_unpacklo_epi8(T60, T70);\r\n            }\r\n            else {\r\n                tempT00 = _mm_unpacklo_epi8(T30, T40);\r\n                tempT10 = _mm_unpacklo_epi8(T50, T60);\r\n                tempT20 = _mm_unpacklo_epi8(T70, T80);\r\n                tempT30 = _mm_unpacklo_epi8(T90, Ta0);\r\n            }\r\n            tempT00 = _mm_maddubs_epi16(tempT00, coeffThird0);\r\n            tempT10 = _mm_maddubs_epi16(tempT10, coeffThird1);\r\n            tempT20 = _mm_maddubs_epi16(tempT20, coeffThird2);\r\n            tempT30 = _mm_maddubs_epi16(tempT30, coeffThird3);\r\n\r\n            mVal = _mm_add_epi16(tempT00, tempT10);\r\n            mVal = _mm_add_epi16(mVal, tempT20);\r\n            mVal = _mm_add_epi16(mVal, tempT30);\r\n\r\n            mVal = _mm_add_epi16(mVal, mAddOffset);\r\n            mVal = _mm_srai_epi16(mVal, shift);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_storel_epi64((__m128i*)(&dst2[col] + 3 * i_dst), mVal);\r\n\r\n            p += 8;\r\n        }\r\n\r\n        if (col < width) { // store either 1, 2, 3, 4, 5, 6, or 7 8-bit results in dst\r\n            __m128i T00 = _mm_loadu_si128((__m128i*)(p));\r\n            __m128i T10 = _mm_loadu_si128((__m128i*)(p + i_src));\r\n            __m128i T20 = _mm_loadu_si128((__m128i*)(p + 2 * i_src));\r\n            __m128i T30 = _mm_loadu_si128((__m128i*)(p + 3 * i_src));\r\n            __m128i T40 = _mm_loadu_si128((__m128i*)(p + 4 * i_src));\r\n            __m128i T50 = _mm_loadu_si128((__m128i*)(p + 5 * i_src));\r\n            __m128i T60 = _mm_loadu_si128((__m128i*)(p + 6 * i_src));\r\n            __m128i T70 = _mm_loadu_si128((__m128i*)(p + 7 * i_src));\r\n            __m128i T80 = _mm_loadu_si128((__m128i*)(p + 8 * i_src));\r\n            __m128i T90 = _mm_loadu_si128((__m128i*)(p + 9 * i_src));\r\n            __m128i Ta0 = _mm_loadu_si128((__m128i*)(p + 10 * i_src));\r\n\r\n            //First\r\n            if (bsymFirst) { \r\n                tempT00 = _mm_unpacklo_epi8(T00, T70);\r\n                tempT10 = _mm_unpacklo_epi8(T10, T60);\r\n                tempT20 = _mm_unpacklo_epi8(T20, T50);\r\n                tempT30 = _mm_unpacklo_epi8(T30, T40);\r\n            } else { \r\n                tempT00 = _mm_unpacklo_epi8(T00, T10);\r\n                tempT10 = _mm_unpacklo_epi8(T20, T30);\r\n                tempT20 = _mm_unpacklo_epi8(T40, T50);\r\n                tempT30 = _mm_unpacklo_epi8(T60, T70);\r\n            }\r\n            tempT00 = _mm_maddubs_epi16(tempT00, coeffFirst0);\r\n            tempT10 = _mm_maddubs_epi16(tempT10, coeffFirst1);\r\n            tempT20 = _mm_maddubs_epi16(tempT20, coeffFirst2);\r\n            tempT30 = _mm_maddubs_epi16(tempT30, coeffFirst3);\r\n\r\n            mVal = _mm_add_epi16(tempT00, tempT10);\r\n            mVal = _mm_add_epi16(mVal, tempT20);\r\n            mVal = _mm_add_epi16(mVal, tempT30);\r\n\r\n            mVal = _mm_add_epi16(mVal, mAddOffset);\r\n            mVal = _mm_srai_epi16(mVal, shift);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_maskmoveu_si128(mVal, mask, (char *)&dst0[col]);\r\n\r\n            if (bsymFirst) {\r\n                tempT00 = _mm_unpacklo_epi8(T10, T80);\r\n                tempT10 = _mm_unpacklo_epi8(T20, T70);\r\n                tempT20 = _mm_unpacklo_epi8(T30, T60);\r\n                tempT30 = _mm_unpacklo_epi8(T40, T50);\r\n            }\r\n            else {\r\n                tempT00 = _mm_unpacklo_epi8(T10, T20);\r\n                tempT10 = _mm_unpacklo_epi8(T30, T40);\r\n                tempT20 = _mm_unpacklo_epi8(T50, T60);\r\n                tempT30 = _mm_unpacklo_epi8(T70, T80);\r\n            }\r\n            tempT00 = _mm_maddubs_epi16(tempT00, coeffFirst0);\r\n            tempT10 = _mm_maddubs_epi16(tempT10, coeffFirst1);\r\n            tempT20 = _mm_maddubs_epi16(tempT20, coeffFirst2);\r\n            tempT30 = _mm_maddubs_epi16(tempT30, coeffFirst3);\r\n\r\n            mVal = _mm_add_epi16(tempT00, tempT10);\r\n            mVal = _mm_add_epi16(mVal, tempT20);\r\n            mVal = _mm_add_epi16(mVal, tempT30);\r\n\r\n            mVal = _mm_add_epi16(mVal, mAddOffset);\r\n            mVal = _mm_srai_epi16(mVal, shift);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_maskmoveu_si128(mVal, mask, (char *)(&dst0[col] + i_dst));\r\n\r\n            if (bsymFirst) {\r\n                tempT00 = _mm_unpacklo_epi8(T20, T90);\r\n                tempT10 = _mm_unpacklo_epi8(T30, T80);\r\n                tempT20 = _mm_unpacklo_epi8(T40, T70);\r\n                tempT30 = _mm_unpacklo_epi8(T50, T60);\r\n            }\r\n            else {\r\n                tempT00 = _mm_unpacklo_epi8(T20, T30);\r\n                tempT10 = _mm_unpacklo_epi8(T40, T50);\r\n                tempT20 = _mm_unpacklo_epi8(T60, T70);\r\n                tempT30 = _mm_unpacklo_epi8(T80, T90);\r\n            }\r\n            tempT00 = _mm_maddubs_epi16(tempT00, coeffFirst0);\r\n            tempT10 = _mm_maddubs_epi16(tempT10, coeffFirst1);\r\n            tempT20 = _mm_maddubs_epi16(tempT20, coeffFirst2);\r\n            tempT30 = _mm_maddubs_epi16(tempT30, coeffFirst3);\r\n\r\n            mVal = _mm_add_epi16(tempT00, tempT10);\r\n            mVal = _mm_add_epi16(mVal, tempT20);\r\n            mVal = _mm_add_epi16(mVal, tempT30);\r\n\r\n            mVal = _mm_add_epi16(mVal, mAddOffset);\r\n            mVal = _mm_srai_epi16(mVal, shift);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_maskmoveu_si128(mVal, mask, (char *)(&dst0[col] + 2 * i_dst));\r\n\r\n            if (bsymFirst) {\r\n                tempT00 = _mm_unpacklo_epi8(T30, Ta0);\r\n                tempT10 = _mm_unpacklo_epi8(T40, T90);\r\n                tempT20 = _mm_unpacklo_epi8(T50, T80);\r\n                tempT30 = _mm_unpacklo_epi8(T60, T70);\r\n            }\r\n            else {\r\n                tempT00 = _mm_unpacklo_epi8(T30, T40);\r\n                tempT10 = _mm_unpacklo_epi8(T50, T60);\r\n                tempT20 = _mm_unpacklo_epi8(T70, T80);\r\n                tempT30 = _mm_unpacklo_epi8(T90, Ta0);\r\n            }\r\n            tempT00 = _mm_maddubs_epi16(tempT00, coeffFirst0);\r\n            tempT10 = _mm_maddubs_epi16(tempT10, coeffFirst1);\r\n            tempT20 = _mm_maddubs_epi16(tempT20, coeffFirst2);\r\n            tempT30 = _mm_maddubs_epi16(tempT30, coeffFirst3);\r\n\r\n            mVal = _mm_add_epi16(tempT00, tempT10);\r\n            mVal = _mm_add_epi16(mVal, tempT20);\r\n            mVal = _mm_add_epi16(mVal, tempT30);\r\n\r\n            mVal = _mm_add_epi16(mVal, mAddOffset);\r\n            mVal = _mm_srai_epi16(mVal, shift);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_maskmoveu_si128(mVal, mask, (char *)(&dst0[col] + 3 * i_dst));\r\n\r\n\r\n            //Second\r\n            if (bsymSecond) { \r\n                tempT00 = _mm_unpacklo_epi8(T00, T70);\r\n                tempT10 = _mm_unpacklo_epi8(T10, T60);\r\n                tempT20 = _mm_unpacklo_epi8(T20, T50);\r\n                tempT30 = _mm_unpacklo_epi8(T30, T40);\r\n            } else { \r\n                tempT00 = _mm_unpacklo_epi8(T00, T10);\r\n                tempT10 = _mm_unpacklo_epi8(T20, T30);\r\n                tempT20 = _mm_unpacklo_epi8(T40, T50);\r\n                tempT30 = _mm_unpacklo_epi8(T60, T70);\r\n            }\r\n            tempT00 = _mm_maddubs_epi16(tempT00, coeffSecond0);\r\n            tempT10 = _mm_maddubs_epi16(tempT10, coeffSecond1);\r\n            tempT20 = _mm_maddubs_epi16(tempT20, coeffSecond2);\r\n            tempT30 = _mm_maddubs_epi16(tempT30, coeffSecond3);\r\n\r\n            mVal = _mm_add_epi16(tempT00, tempT10);\r\n            mVal = _mm_add_epi16(mVal, tempT20);\r\n            mVal = _mm_add_epi16(mVal, tempT30);\r\n\r\n            mVal = _mm_add_epi16(mVal, mAddOffset);\r\n            mVal = _mm_srai_epi16(mVal, shift);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_maskmoveu_si128(mVal, mask, (char *)&dst1[col]);\r\n\r\n            if (bsymSecond) {\r\n                tempT00 = _mm_unpacklo_epi8(T10, T80);\r\n                tempT10 = _mm_unpacklo_epi8(T20, T70);\r\n                tempT20 = _mm_unpacklo_epi8(T30, T60);\r\n                tempT30 = _mm_unpacklo_epi8(T40, T50);\r\n            }\r\n            else {\r\n                tempT00 = _mm_unpacklo_epi8(T10, T20);\r\n                tempT10 = _mm_unpacklo_epi8(T30, T40);\r\n                tempT20 = _mm_unpacklo_epi8(T50, T60);\r\n                tempT30 = _mm_unpacklo_epi8(T70, T80);\r\n            }\r\n            tempT00 = _mm_maddubs_epi16(tempT00, coeffSecond0);\r\n            tempT10 = _mm_maddubs_epi16(tempT10, coeffSecond1);\r\n            tempT20 = _mm_maddubs_epi16(tempT20, coeffSecond2);\r\n            tempT30 = _mm_maddubs_epi16(tempT30, coeffSecond3);\r\n\r\n            mVal = _mm_add_epi16(tempT00, tempT10);\r\n            mVal = _mm_add_epi16(mVal, tempT20);\r\n            mVal = _mm_add_epi16(mVal, tempT30);\r\n\r\n            mVal = _mm_add_epi16(mVal, mAddOffset);\r\n            mVal = _mm_srai_epi16(mVal, shift);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_maskmoveu_si128(mVal, mask, (char *)(&dst1[col] + i_dst));\r\n\r\n            if (bsymSecond) {\r\n                tempT00 = _mm_unpacklo_epi8(T20, T90);\r\n                tempT10 = _mm_unpacklo_epi8(T30, T80);\r\n                tempT20 = _mm_unpacklo_epi8(T40, T70);\r\n                tempT30 = _mm_unpacklo_epi8(T50, T60);\r\n            }\r\n            else {\r\n                tempT00 = _mm_unpacklo_epi8(T20, T30);\r\n                tempT10 = _mm_unpacklo_epi8(T40, T50);\r\n                tempT20 = _mm_unpacklo_epi8(T60, T70);\r\n                tempT30 = _mm_unpacklo_epi8(T80, T90);\r\n            }\r\n            tempT00 = _mm_maddubs_epi16(tempT00, coeffSecond0);\r\n            tempT10 = _mm_maddubs_epi16(tempT10, coeffSecond1);\r\n            tempT20 = _mm_maddubs_epi16(tempT20, coeffSecond2);\r\n            tempT30 = _mm_maddubs_epi16(tempT30, coeffSecond3);\r\n\r\n            mVal = _mm_add_epi16(tempT00, tempT10);\r\n            mVal = _mm_add_epi16(mVal, tempT20);\r\n            mVal = _mm_add_epi16(mVal, tempT30);\r\n\r\n            mVal = _mm_add_epi16(mVal, mAddOffset);\r\n            mVal = _mm_srai_epi16(mVal, shift);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_maskmoveu_si128(mVal, mask, (char *)(&dst1[col] +  2 * i_dst));\r\n\r\n            if (bsymSecond) {\r\n                tempT00 = _mm_unpacklo_epi8(T30, Ta0);\r\n                tempT10 = _mm_unpacklo_epi8(T40, T90);\r\n                tempT20 = _mm_unpacklo_epi8(T50, T80);\r\n                tempT30 = _mm_unpacklo_epi8(T60, T70);\r\n            }\r\n            else {\r\n                tempT00 = _mm_unpacklo_epi8(T30, T40);\r\n                tempT10 = _mm_unpacklo_epi8(T50, T60);\r\n                tempT20 = _mm_unpacklo_epi8(T70, T80);\r\n                tempT30 = _mm_unpacklo_epi8(T90, Ta0);\r\n            }\r\n            tempT00 = _mm_maddubs_epi16(tempT00, coeffSecond0);\r\n            tempT10 = _mm_maddubs_epi16(tempT10, coeffSecond1);\r\n            tempT20 = _mm_maddubs_epi16(tempT20, coeffSecond2);\r\n            tempT30 = _mm_maddubs_epi16(tempT30, coeffSecond3);\r\n\r\n            mVal = _mm_add_epi16(tempT00, tempT10);\r\n            mVal = _mm_add_epi16(mVal, tempT20);\r\n            mVal = _mm_add_epi16(mVal, tempT30);\r\n\r\n            mVal = _mm_add_epi16(mVal, mAddOffset);\r\n            mVal = _mm_srai_epi16(mVal, shift);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_maskmoveu_si128(mVal, mask, (char *)(&dst1[col] + 3 * i_dst));\r\n\r\n            //Third\r\n            if (bsymThird) { \r\n                tempT00 = _mm_unpacklo_epi8(T00, T70);\r\n                tempT10 = _mm_unpacklo_epi8(T10, T60);\r\n                tempT20 = _mm_unpacklo_epi8(T20, T50);\r\n                tempT30 = _mm_unpacklo_epi8(T30, T40);\r\n            } else { \r\n                tempT00 = _mm_unpacklo_epi8(T00, T10);\r\n                tempT10 = _mm_unpacklo_epi8(T20, T30);\r\n                tempT20 = _mm_unpacklo_epi8(T40, T50);\r\n                tempT30 = _mm_unpacklo_epi8(T60, T70);\r\n            }\r\n            tempT00 = _mm_maddubs_epi16(tempT00, coeffThird0);\r\n            tempT10 = _mm_maddubs_epi16(tempT10, coeffThird1);\r\n            tempT20 = _mm_maddubs_epi16(tempT20, coeffThird2);\r\n            tempT30 = _mm_maddubs_epi16(tempT30, coeffThird3);\r\n\r\n            mVal = _mm_add_epi16(tempT00, tempT10);\r\n            mVal = _mm_add_epi16(mVal, tempT20);\r\n            mVal = _mm_add_epi16(mVal, tempT30);\r\n\r\n            mVal = _mm_add_epi16(mVal, mAddOffset);\r\n            mVal = _mm_srai_epi16(mVal, shift);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_maskmoveu_si128(mVal, mask, (char *)&dst2[col]);\r\n\r\n            if (bsymThird) {\r\n                tempT00 = _mm_unpacklo_epi8(T10, T80);\r\n                tempT10 = _mm_unpacklo_epi8(T20, T70);\r\n                tempT20 = _mm_unpacklo_epi8(T30, T60);\r\n                tempT30 = _mm_unpacklo_epi8(T40, T50);\r\n            }\r\n            else {\r\n                tempT00 = _mm_unpacklo_epi8(T10, T20);\r\n                tempT10 = _mm_unpacklo_epi8(T30, T40);\r\n                tempT20 = _mm_unpacklo_epi8(T50, T60);\r\n                tempT30 = _mm_unpacklo_epi8(T70, T80);\r\n            }\r\n            tempT00 = _mm_maddubs_epi16(tempT00, coeffThird0);\r\n            tempT10 = _mm_maddubs_epi16(tempT10, coeffThird1);\r\n            tempT20 = _mm_maddubs_epi16(tempT20, coeffThird2);\r\n            tempT30 = _mm_maddubs_epi16(tempT30, coeffThird3);\r\n\r\n            mVal = _mm_add_epi16(tempT00, tempT10);\r\n            mVal = _mm_add_epi16(mVal, tempT20);\r\n            mVal = _mm_add_epi16(mVal, tempT30);\r\n\r\n            mVal = _mm_add_epi16(mVal, mAddOffset);\r\n            mVal = _mm_srai_epi16(mVal, shift);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_maskmoveu_si128(mVal, mask, (char *)(&dst2[col] + i_dst));\r\n\r\n            if (bsymThird) {\r\n                tempT00 = _mm_unpacklo_epi8(T20, T90);\r\n                tempT10 = _mm_unpacklo_epi8(T30, T80);\r\n                tempT20 = _mm_unpacklo_epi8(T40, T70);\r\n                tempT30 = _mm_unpacklo_epi8(T50, T60);\r\n            }\r\n            else {\r\n                tempT00 = _mm_unpacklo_epi8(T20, T30);\r\n                tempT10 = _mm_unpacklo_epi8(T40, T50);\r\n                tempT20 = _mm_unpacklo_epi8(T60, T70);\r\n                tempT30 = _mm_unpacklo_epi8(T80, T90);\r\n            }\r\n            tempT00 = _mm_maddubs_epi16(tempT00, coeffThird0);\r\n            tempT10 = _mm_maddubs_epi16(tempT10, coeffThird1);\r\n            tempT20 = _mm_maddubs_epi16(tempT20, coeffThird2);\r\n            tempT30 = _mm_maddubs_epi16(tempT30, coeffThird3);\r\n\r\n            mVal = _mm_add_epi16(tempT00, tempT10);\r\n            mVal = _mm_add_epi16(mVal, tempT20);\r\n            mVal = _mm_add_epi16(mVal, tempT30);\r\n\r\n            mVal = _mm_add_epi16(mVal, mAddOffset);\r\n            mVal = _mm_srai_epi16(mVal, shift);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_maskmoveu_si128(mVal, mask, (char *)(&dst2[col] + 2 * i_dst));\r\n\r\n            if (bsymThird) {\r\n                tempT00 = _mm_unpacklo_epi8(T30, Ta0);\r\n                tempT10 = _mm_unpacklo_epi8(T40, T90);\r\n                tempT20 = _mm_unpacklo_epi8(T50, T80);\r\n                tempT30 = _mm_unpacklo_epi8(T60, T70);\r\n            }\r\n            else {\r\n                tempT00 = _mm_unpacklo_epi8(T30, T40);\r\n                tempT10 = _mm_unpacklo_epi8(T50, T60);\r\n                tempT20 = _mm_unpacklo_epi8(T70, T80);\r\n                tempT30 = _mm_unpacklo_epi8(T90, Ta0);\r\n            }\r\n            tempT00 = _mm_maddubs_epi16(tempT00, coeffThird0);\r\n            tempT10 = _mm_maddubs_epi16(tempT10, coeffThird1);\r\n            tempT20 = _mm_maddubs_epi16(tempT20, coeffThird2);\r\n            tempT30 = _mm_maddubs_epi16(tempT30, coeffThird3);\r\n\r\n            mVal = _mm_add_epi16(tempT00, tempT10);\r\n            mVal = _mm_add_epi16(mVal, tempT20);\r\n            mVal = _mm_add_epi16(mVal, tempT30);\r\n\r\n            mVal = _mm_add_epi16(mVal, mAddOffset);\r\n            mVal = _mm_srai_epi16(mVal, shift);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_maskmoveu_si128(mVal, mask, (char *)(&dst2[col] + 3 * i_dst));\r\n        }\r\n\r\n        src += 4 * i_src;\r\n        dst0 += 4 * i_dst;\r\n        dst1 += 4 * i_dst;\r\n        dst2 += 4 * i_dst;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid intpl_luma_ext_sse128(pel_t *dst, int i_dst, mct_t *tmp, int i_tmp, int width, int height, const int8_t *coeff)\r\n{\r\n    int row, col;\r\n    int shift;\r\n    int16_t const *p;\r\n    int bsymy = (coeff[1] == coeff[6]);\r\n\r\n    __m128i mAddOffset;\r\n    __m128i mask = _mm_loadu_si128((__m128i*)(intrinsic_mask[(width & 7) - 1]));\r\n\r\n    // VER\r\n    shift = 12;\r\n    mAddOffset = _mm_set1_epi32(1 << (shift - 1));\r\n    tmp = tmp - 3 * i_tmp;\r\n    if (bsymy) {\r\n        __m128i mCoefy1 = _mm_set1_epi16(coeff[0]);\r\n        __m128i mCoefy2 = _mm_set1_epi16(coeff[1]);\r\n        __m128i mCoefy3 = _mm_set1_epi16(coeff[2]);\r\n        __m128i mCoefy4 = _mm_set1_epi16(coeff[3]);\r\n        __m128i mVal1, mVal2, mVal;\r\n\r\n        for (row = 0; row < height - 3; row += 4) {\r\n            p = tmp;\r\n            for (col = 0; col < width - 7; col += 8) {\r\n                __m128i T00 = _mm_loadu_si128((__m128i*)(p));\r\n                __m128i T10 = _mm_loadu_si128((__m128i*)(p + i_tmp));\r\n                __m128i T20 = _mm_loadu_si128((__m128i*)(p + 2 * i_tmp));\r\n                __m128i T30 = _mm_loadu_si128((__m128i*)(p + 3 * i_tmp));\r\n                __m128i T40 = _mm_loadu_si128((__m128i*)(p + 4 * i_tmp));\r\n                __m128i T50 = _mm_loadu_si128((__m128i*)(p + 5 * i_tmp));\r\n                __m128i T60 = _mm_loadu_si128((__m128i*)(p + 6 * i_tmp));\r\n                __m128i T70 = _mm_loadu_si128((__m128i*)(p + 7 * i_tmp));\r\n                __m128i T80 = _mm_loadu_si128((__m128i*)(p + 8 * i_tmp));\r\n                __m128i T90 = _mm_loadu_si128((__m128i*)(p + 9 * i_tmp));\r\n                __m128i Ta0 = _mm_loadu_si128((__m128i*)(p + 10 * i_tmp));\r\n\r\n                __m128i T0 = _mm_unpacklo_epi16(T00, T70);\r\n                __m128i T1 = _mm_unpacklo_epi16(T10, T60);\r\n                __m128i T2 = _mm_unpacklo_epi16(T20, T50);\r\n                __m128i T3 = _mm_unpacklo_epi16(T30, T40);\r\n                __m128i T4 = _mm_unpackhi_epi16(T00, T70);\r\n                __m128i T5 = _mm_unpackhi_epi16(T10, T60);\r\n                __m128i T6 = _mm_unpackhi_epi16(T20, T50);\r\n                __m128i T7 = _mm_unpackhi_epi16(T30, T40);\r\n\r\n                T0 = _mm_madd_epi16(T0, mCoefy1);\r\n                T1 = _mm_madd_epi16(T1, mCoefy2);\r\n                T2 = _mm_madd_epi16(T2, mCoefy3);\r\n                T3 = _mm_madd_epi16(T3, mCoefy4);\r\n                T4 = _mm_madd_epi16(T4, mCoefy1);\r\n                T5 = _mm_madd_epi16(T5, mCoefy2);\r\n                T6 = _mm_madd_epi16(T6, mCoefy3);\r\n                T7 = _mm_madd_epi16(T7, mCoefy4);\r\n\r\n                mVal1 = _mm_add_epi32(T0, T1);\r\n                mVal1 = _mm_add_epi32(mVal1, T2);\r\n                mVal1 = _mm_add_epi32(mVal1, T3);\r\n\r\n                mVal2 = _mm_add_epi32(T4, T5);\r\n                mVal2 = _mm_add_epi32(mVal2, T6);\r\n                mVal2 = _mm_add_epi32(mVal2, T7);\r\n\r\n                mVal1 = _mm_add_epi32(mVal1, mAddOffset);\r\n                mVal2 = _mm_add_epi32(mVal2, mAddOffset);\r\n                mVal1 = _mm_srai_epi32(mVal1, shift);\r\n                mVal2 = _mm_srai_epi32(mVal2, shift);\r\n                mVal = _mm_packs_epi32(mVal1, mVal2);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_storel_epi64((__m128i*)&dst[col], mVal);\r\n\r\n                T0 = _mm_unpacklo_epi16(T10, T80);\r\n                T1 = _mm_unpacklo_epi16(T20, T70);\r\n                T2 = _mm_unpacklo_epi16(T30, T60);\r\n                T3 = _mm_unpacklo_epi16(T40, T50);\r\n                T4 = _mm_unpackhi_epi16(T10, T80);\r\n                T5 = _mm_unpackhi_epi16(T20, T70);\r\n                T6 = _mm_unpackhi_epi16(T30, T60);\r\n                T7 = _mm_unpackhi_epi16(T40, T50);\r\n\r\n                T0 = _mm_madd_epi16(T0, mCoefy1);\r\n                T1 = _mm_madd_epi16(T1, mCoefy2);\r\n                T2 = _mm_madd_epi16(T2, mCoefy3);\r\n                T3 = _mm_madd_epi16(T3, mCoefy4);\r\n                T4 = _mm_madd_epi16(T4, mCoefy1);\r\n                T5 = _mm_madd_epi16(T5, mCoefy2);\r\n                T6 = _mm_madd_epi16(T6, mCoefy3);\r\n                T7 = _mm_madd_epi16(T7, mCoefy4);\r\n\r\n                mVal1 = _mm_add_epi32(T0, T1);\r\n                mVal1 = _mm_add_epi32(mVal1, T2);\r\n                mVal1 = _mm_add_epi32(mVal1, T3);\r\n\r\n                mVal2 = _mm_add_epi32(T4, T5);\r\n                mVal2 = _mm_add_epi32(mVal2, T6);\r\n                mVal2 = _mm_add_epi32(mVal2, T7);\r\n\r\n                mVal1 = _mm_add_epi32(mVal1, mAddOffset);\r\n                mVal2 = _mm_add_epi32(mVal2, mAddOffset);\r\n                mVal1 = _mm_srai_epi32(mVal1, shift);\r\n                mVal2 = _mm_srai_epi32(mVal2, shift);\r\n                mVal = _mm_packs_epi32(mVal1, mVal2);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_storel_epi64((__m128i*)(&dst[col] + i_dst), mVal);\r\n\r\n                T0 = _mm_unpacklo_epi16(T20, T90);\r\n                T1 = _mm_unpacklo_epi16(T30, T80);\r\n                T2 = _mm_unpacklo_epi16(T40, T70);\r\n                T3 = _mm_unpacklo_epi16(T50, T60);\r\n                T4 = _mm_unpackhi_epi16(T20, T90);\r\n                T5 = _mm_unpackhi_epi16(T30, T80);\r\n                T6 = _mm_unpackhi_epi16(T40, T70);\r\n                T7 = _mm_unpackhi_epi16(T50, T60);\r\n\r\n                T0 = _mm_madd_epi16(T0, mCoefy1);\r\n                T1 = _mm_madd_epi16(T1, mCoefy2);\r\n                T2 = _mm_madd_epi16(T2, mCoefy3);\r\n                T3 = _mm_madd_epi16(T3, mCoefy4);\r\n                T4 = _mm_madd_epi16(T4, mCoefy1);\r\n                T5 = _mm_madd_epi16(T5, mCoefy2);\r\n                T6 = _mm_madd_epi16(T6, mCoefy3);\r\n                T7 = _mm_madd_epi16(T7, mCoefy4);\r\n\r\n                mVal1 = _mm_add_epi32(T0, T1);\r\n                mVal1 = _mm_add_epi32(mVal1, T2);\r\n                mVal1 = _mm_add_epi32(mVal1, T3);\r\n\r\n                mVal2 = _mm_add_epi32(T4, T5);\r\n                mVal2 = _mm_add_epi32(mVal2, T6);\r\n                mVal2 = _mm_add_epi32(mVal2, T7);\r\n\r\n                mVal1 = _mm_add_epi32(mVal1, mAddOffset);\r\n                mVal2 = _mm_add_epi32(mVal2, mAddOffset);\r\n                mVal1 = _mm_srai_epi32(mVal1, shift);\r\n                mVal2 = _mm_srai_epi32(mVal2, shift);\r\n                mVal = _mm_packs_epi32(mVal1, mVal2);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_storel_epi64((__m128i*)(&dst[col] + 2 * i_dst), mVal);\r\n\r\n                T0 = _mm_unpacklo_epi16(T30, Ta0);\r\n                T1 = _mm_unpacklo_epi16(T40, T90);\r\n                T2 = _mm_unpacklo_epi16(T50, T80);\r\n                T3 = _mm_unpacklo_epi16(T60, T70);\r\n                T4 = _mm_unpackhi_epi16(T30, Ta0);\r\n                T5 = _mm_unpackhi_epi16(T40, T90);\r\n                T6 = _mm_unpackhi_epi16(T50, T80);\r\n                T7 = _mm_unpackhi_epi16(T60, T70);\r\n\r\n                T0 = _mm_madd_epi16(T0, mCoefy1);\r\n                T1 = _mm_madd_epi16(T1, mCoefy2);\r\n                T2 = _mm_madd_epi16(T2, mCoefy3);\r\n                T3 = _mm_madd_epi16(T3, mCoefy4);\r\n                T4 = _mm_madd_epi16(T4, mCoefy1);\r\n                T5 = _mm_madd_epi16(T5, mCoefy2);\r\n                T6 = _mm_madd_epi16(T6, mCoefy3);\r\n                T7 = _mm_madd_epi16(T7, mCoefy4);\r\n\r\n                mVal1 = _mm_add_epi32(T0, T1);\r\n                mVal1 = _mm_add_epi32(mVal1, T2);\r\n                mVal1 = _mm_add_epi32(mVal1, T3);\r\n\r\n                mVal2 = _mm_add_epi32(T4, T5);\r\n                mVal2 = _mm_add_epi32(mVal2, T6);\r\n                mVal2 = _mm_add_epi32(mVal2, T7);\r\n\r\n                mVal1 = _mm_add_epi32(mVal1, mAddOffset);\r\n                mVal2 = _mm_add_epi32(mVal2, mAddOffset);\r\n                mVal1 = _mm_srai_epi32(mVal1, shift);\r\n                mVal2 = _mm_srai_epi32(mVal2, shift);\r\n                mVal = _mm_packs_epi32(mVal1, mVal2);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_storel_epi64((__m128i*)(&dst[col] + 3 * i_dst), mVal);\r\n\r\n                p += 8;\r\n            }\r\n\r\n            if (col < width) { // store either 1, 2, 3, 4, 5, 6, or 7 8-bit results in dst\r\n                __m128i T00 = _mm_loadu_si128((__m128i*)(p));\r\n                __m128i T10 = _mm_loadu_si128((__m128i*)(p + i_tmp));\r\n                __m128i T20 = _mm_loadu_si128((__m128i*)(p + 2 * i_tmp));\r\n                __m128i T30 = _mm_loadu_si128((__m128i*)(p + 3 * i_tmp));\r\n                __m128i T40 = _mm_loadu_si128((__m128i*)(p + 4 * i_tmp));\r\n                __m128i T50 = _mm_loadu_si128((__m128i*)(p + 5 * i_tmp));\r\n                __m128i T60 = _mm_loadu_si128((__m128i*)(p + 6 * i_tmp));\r\n                __m128i T70 = _mm_loadu_si128((__m128i*)(p + 7 * i_tmp));\r\n                __m128i T80 = _mm_loadu_si128((__m128i*)(p + 8 * i_tmp));\r\n                __m128i T90 = _mm_loadu_si128((__m128i*)(p + 9 * i_tmp));\r\n                __m128i Ta0 = _mm_loadu_si128((__m128i*)(p + 10 * i_tmp));\r\n\r\n                __m128i T0 = _mm_unpacklo_epi16(T00, T70);\r\n                __m128i T1 = _mm_unpacklo_epi16(T10, T60);\r\n                __m128i T2 = _mm_unpacklo_epi16(T20, T50);\r\n                __m128i T3 = _mm_unpacklo_epi16(T30, T40);\r\n                __m128i T4 = _mm_unpackhi_epi16(T00, T70);\r\n                __m128i T5 = _mm_unpackhi_epi16(T10, T60);\r\n                __m128i T6 = _mm_unpackhi_epi16(T20, T50);\r\n                __m128i T7 = _mm_unpackhi_epi16(T30, T40);\r\n\r\n                T0 = _mm_madd_epi16(T0, mCoefy1);\r\n                T1 = _mm_madd_epi16(T1, mCoefy2);\r\n                T2 = _mm_madd_epi16(T2, mCoefy3);\r\n                T3 = _mm_madd_epi16(T3, mCoefy4);\r\n                T4 = _mm_madd_epi16(T4, mCoefy1);\r\n                T5 = _mm_madd_epi16(T5, mCoefy2);\r\n                T6 = _mm_madd_epi16(T6, mCoefy3);\r\n                T7 = _mm_madd_epi16(T7, mCoefy4);\r\n\r\n                mVal1 = _mm_add_epi32(T0, T1);\r\n                mVal1 = _mm_add_epi32(mVal1, T2);\r\n                mVal1 = _mm_add_epi32(mVal1, T3);\r\n\r\n                mVal2 = _mm_add_epi32(T4, T5);\r\n                mVal2 = _mm_add_epi32(mVal2, T6);\r\n                mVal2 = _mm_add_epi32(mVal2, T7);\r\n\r\n                mVal1 = _mm_add_epi32(mVal1, mAddOffset);\r\n                mVal2 = _mm_add_epi32(mVal2, mAddOffset);\r\n                mVal1 = _mm_srai_epi32(mVal1, shift);\r\n                mVal2 = _mm_srai_epi32(mVal2, shift);\r\n                mVal = _mm_packs_epi32(mVal1, mVal2);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_maskmoveu_si128(mVal, mask, (char *)&dst[col]);\r\n\r\n                T0 = _mm_unpacklo_epi16(T10, T80);\r\n                T1 = _mm_unpacklo_epi16(T20, T70);\r\n                T2 = _mm_unpacklo_epi16(T30, T60);\r\n                T3 = _mm_unpacklo_epi16(T40, T50);\r\n                T4 = _mm_unpackhi_epi16(T10, T80);\r\n                T5 = _mm_unpackhi_epi16(T20, T70);\r\n                T6 = _mm_unpackhi_epi16(T30, T60);\r\n                T7 = _mm_unpackhi_epi16(T40, T50);\r\n\r\n                T0 = _mm_madd_epi16(T0, mCoefy1);\r\n                T1 = _mm_madd_epi16(T1, mCoefy2);\r\n                T2 = _mm_madd_epi16(T2, mCoefy3);\r\n                T3 = _mm_madd_epi16(T3, mCoefy4);\r\n                T4 = _mm_madd_epi16(T4, mCoefy1);\r\n                T5 = _mm_madd_epi16(T5, mCoefy2);\r\n                T6 = _mm_madd_epi16(T6, mCoefy3);\r\n                T7 = _mm_madd_epi16(T7, mCoefy4);\r\n\r\n                mVal1 = _mm_add_epi32(T0, T1);\r\n                mVal1 = _mm_add_epi32(mVal1, T2);\r\n                mVal1 = _mm_add_epi32(mVal1, T3);\r\n\r\n                mVal2 = _mm_add_epi32(T4, T5);\r\n                mVal2 = _mm_add_epi32(mVal2, T6);\r\n                mVal2 = _mm_add_epi32(mVal2, T7);\r\n\r\n                mVal1 = _mm_add_epi32(mVal1, mAddOffset);\r\n                mVal2 = _mm_add_epi32(mVal2, mAddOffset);\r\n                mVal1 = _mm_srai_epi32(mVal1, shift);\r\n                mVal2 = _mm_srai_epi32(mVal2, shift);\r\n                mVal = _mm_packs_epi32(mVal1, mVal2);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_maskmoveu_si128(mVal, mask, (char *)(&dst[col] + i_dst));\r\n\r\n                T0 = _mm_unpacklo_epi16(T20, T90);\r\n                T1 = _mm_unpacklo_epi16(T30, T80);\r\n                T2 = _mm_unpacklo_epi16(T40, T70);\r\n                T3 = _mm_unpacklo_epi16(T50, T60);\r\n                T4 = _mm_unpackhi_epi16(T20, T90);\r\n                T5 = _mm_unpackhi_epi16(T30, T80);\r\n                T6 = _mm_unpackhi_epi16(T40, T70);\r\n                T7 = _mm_unpackhi_epi16(T50, T60);\r\n\r\n                T0 = _mm_madd_epi16(T0, mCoefy1);\r\n                T1 = _mm_madd_epi16(T1, mCoefy2);\r\n                T2 = _mm_madd_epi16(T2, mCoefy3);\r\n                T3 = _mm_madd_epi16(T3, mCoefy4);\r\n                T4 = _mm_madd_epi16(T4, mCoefy1);\r\n                T5 = _mm_madd_epi16(T5, mCoefy2);\r\n                T6 = _mm_madd_epi16(T6, mCoefy3);\r\n                T7 = _mm_madd_epi16(T7, mCoefy4);\r\n\r\n                mVal1 = _mm_add_epi32(T0, T1);\r\n                mVal1 = _mm_add_epi32(mVal1, T2);\r\n                mVal1 = _mm_add_epi32(mVal1, T3);\r\n\r\n                mVal2 = _mm_add_epi32(T4, T5);\r\n                mVal2 = _mm_add_epi32(mVal2, T6);\r\n                mVal2 = _mm_add_epi32(mVal2, T7);\r\n\r\n                mVal1 = _mm_add_epi32(mVal1, mAddOffset);\r\n                mVal2 = _mm_add_epi32(mVal2, mAddOffset);\r\n                mVal1 = _mm_srai_epi32(mVal1, shift);\r\n                mVal2 = _mm_srai_epi32(mVal2, shift);\r\n                mVal = _mm_packs_epi32(mVal1, mVal2);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_maskmoveu_si128(mVal, mask, (char *)(&dst[col] + 2 * i_dst));\r\n\r\n                T0 = _mm_unpacklo_epi16(T30, Ta0);\r\n                T1 = _mm_unpacklo_epi16(T40, T90);\r\n                T2 = _mm_unpacklo_epi16(T50, T80);\r\n                T3 = _mm_unpacklo_epi16(T60, T70);\r\n                T4 = _mm_unpackhi_epi16(T30, Ta0);\r\n                T5 = _mm_unpackhi_epi16(T40, T90);\r\n                T6 = _mm_unpackhi_epi16(T50, T80);\r\n                T7 = _mm_unpackhi_epi16(T60, T70);\r\n\r\n                T0 = _mm_madd_epi16(T0, mCoefy1);\r\n                T1 = _mm_madd_epi16(T1, mCoefy2);\r\n                T2 = _mm_madd_epi16(T2, mCoefy3);\r\n                T3 = _mm_madd_epi16(T3, mCoefy4);\r\n                T4 = _mm_madd_epi16(T4, mCoefy1);\r\n                T5 = _mm_madd_epi16(T5, mCoefy2);\r\n                T6 = _mm_madd_epi16(T6, mCoefy3);\r\n                T7 = _mm_madd_epi16(T7, mCoefy4);\r\n\r\n                mVal1 = _mm_add_epi32(T0, T1);\r\n                mVal1 = _mm_add_epi32(mVal1, T2);\r\n                mVal1 = _mm_add_epi32(mVal1, T3);\r\n\r\n                mVal2 = _mm_add_epi32(T4, T5);\r\n                mVal2 = _mm_add_epi32(mVal2, T6);\r\n                mVal2 = _mm_add_epi32(mVal2, T7);\r\n\r\n                mVal1 = _mm_add_epi32(mVal1, mAddOffset);\r\n                mVal2 = _mm_add_epi32(mVal2, mAddOffset);\r\n                mVal1 = _mm_srai_epi32(mVal1, shift);\r\n                mVal2 = _mm_srai_epi32(mVal2, shift);\r\n                mVal = _mm_packs_epi32(mVal1, mVal2);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_maskmoveu_si128(mVal, mask, (char *)(&dst[col] + 3 * i_dst));\r\n            }\r\n            tmp += 4 * i_tmp;\r\n            dst += 4 * i_dst;\r\n        }\r\n    } else {\r\n        __m128i mCoefy1 = _mm_set1_epi16(*(int16_t*)(coeff + 0));\r\n        __m128i mCoefy2 = _mm_set1_epi16(*(int16_t*)(coeff + 2));\r\n        __m128i mCoefy3 = _mm_set1_epi16(*(int16_t*)(coeff + 4));\r\n        __m128i mCoefy4 = _mm_set1_epi16(*(int16_t*)(coeff + 6));\r\n        __m128i mVal1, mVal2, mVal;\r\n        mCoefy1 = _mm_cvtepi8_epi16(mCoefy1);\r\n        mCoefy2 = _mm_cvtepi8_epi16(mCoefy2);\r\n        mCoefy3 = _mm_cvtepi8_epi16(mCoefy3);\r\n        mCoefy4 = _mm_cvtepi8_epi16(mCoefy4);\r\n\r\n        for (row = 0; row < height; row++) {\r\n            p = tmp;\r\n            for (col = 0; col < width - 7; col += 8) {\r\n                __m128i T00 = _mm_loadu_si128((__m128i*)(p));\r\n                __m128i T10 = _mm_loadu_si128((__m128i*)(p + i_tmp));\r\n                __m128i T20 = _mm_loadu_si128((__m128i*)(p + 2 * i_tmp));\r\n                __m128i T30 = _mm_loadu_si128((__m128i*)(p + 3 * i_tmp));\r\n                __m128i T40 = _mm_loadu_si128((__m128i*)(p + 4 * i_tmp));\r\n                __m128i T50 = _mm_loadu_si128((__m128i*)(p + 5 * i_tmp));\r\n                __m128i T60 = _mm_loadu_si128((__m128i*)(p + 6 * i_tmp));\r\n                __m128i T70 = _mm_loadu_si128((__m128i*)(p + 7 * i_tmp));\r\n                __m128i T80 = _mm_loadu_si128((__m128i*)(p + 8 * i_tmp));\r\n                __m128i T90 = _mm_loadu_si128((__m128i*)(p + 9 * i_tmp));\r\n                __m128i Ta0 = _mm_loadu_si128((__m128i*)(p + 10 * i_tmp));\r\n\r\n                __m128i T0 = _mm_unpacklo_epi16(T00, T10);\r\n                __m128i T1 = _mm_unpacklo_epi16(T20, T30);\r\n                __m128i T2 = _mm_unpacklo_epi16(T40, T50);\r\n                __m128i T3 = _mm_unpacklo_epi16(T60, T70);\r\n                __m128i T4 = _mm_unpackhi_epi16(T00, T10);\r\n                __m128i T5 = _mm_unpackhi_epi16(T20, T30);\r\n                __m128i T6 = _mm_unpackhi_epi16(T40, T50);\r\n                __m128i T7 = _mm_unpackhi_epi16(T60, T70);\r\n\r\n                T0 = _mm_madd_epi16(T0, mCoefy1);\r\n                T1 = _mm_madd_epi16(T1, mCoefy2);\r\n                T2 = _mm_madd_epi16(T2, mCoefy3);\r\n                T3 = _mm_madd_epi16(T3, mCoefy4);\r\n                T4 = _mm_madd_epi16(T4, mCoefy1);\r\n                T5 = _mm_madd_epi16(T5, mCoefy2);\r\n                T6 = _mm_madd_epi16(T6, mCoefy3);\r\n                T7 = _mm_madd_epi16(T7, mCoefy4);\r\n\r\n                mVal1 = _mm_add_epi32(T0, T1);\r\n                mVal1 = _mm_add_epi32(mVal1, T2);\r\n                mVal1 = _mm_add_epi32(mVal1, T3);\r\n\r\n                mVal2 = _mm_add_epi32(T4, T5);\r\n                mVal2 = _mm_add_epi32(mVal2, T6);\r\n                mVal2 = _mm_add_epi32(mVal2, T7);\r\n\r\n                mVal1 = _mm_add_epi32(mVal1, mAddOffset);\r\n                mVal2 = _mm_add_epi32(mVal2, mAddOffset);\r\n                mVal1 = _mm_srai_epi32(mVal1, shift);\r\n                mVal2 = _mm_srai_epi32(mVal2, shift);\r\n                mVal = _mm_packs_epi32(mVal1, mVal2);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_storel_epi64((__m128i*)&dst[col], mVal);\r\n\r\n                T0 = _mm_unpacklo_epi16(T10, T20);\r\n                T1 = _mm_unpacklo_epi16(T30, T40);\r\n                T2 = _mm_unpacklo_epi16(T50, T60);\r\n                T3 = _mm_unpacklo_epi16(T70, T80);\r\n                T4 = _mm_unpackhi_epi16(T10, T20);\r\n                T5 = _mm_unpackhi_epi16(T30, T40);\r\n                T6 = _mm_unpackhi_epi16(T50, T60);\r\n                T7 = _mm_unpackhi_epi16(T70, T80);\r\n\r\n                T0 = _mm_madd_epi16(T0, mCoefy1);\r\n                T1 = _mm_madd_epi16(T1, mCoefy2);\r\n                T2 = _mm_madd_epi16(T2, mCoefy3);\r\n                T3 = _mm_madd_epi16(T3, mCoefy4);\r\n                T4 = _mm_madd_epi16(T4, mCoefy1);\r\n                T5 = _mm_madd_epi16(T5, mCoefy2);\r\n                T6 = _mm_madd_epi16(T6, mCoefy3);\r\n                T7 = _mm_madd_epi16(T7, mCoefy4);\r\n\r\n                mVal1 = _mm_add_epi32(T0, T1);\r\n                mVal1 = _mm_add_epi32(mVal1, T2);\r\n                mVal1 = _mm_add_epi32(mVal1, T3);\r\n\r\n                mVal2 = _mm_add_epi32(T4, T5);\r\n                mVal2 = _mm_add_epi32(mVal2, T6);\r\n                mVal2 = _mm_add_epi32(mVal2, T7);\r\n\r\n                mVal1 = _mm_add_epi32(mVal1, mAddOffset);\r\n                mVal2 = _mm_add_epi32(mVal2, mAddOffset);\r\n                mVal1 = _mm_srai_epi32(mVal1, shift);\r\n                mVal2 = _mm_srai_epi32(mVal2, shift);\r\n                mVal = _mm_packs_epi32(mVal1, mVal2);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_storel_epi64((__m128i*)(&dst[col] + i_dst), mVal);\r\n\r\n                T0 = _mm_unpacklo_epi16(T20, T30);\r\n                T1 = _mm_unpacklo_epi16(T40, T50);\r\n                T2 = _mm_unpacklo_epi16(T60, T70);\r\n                T3 = _mm_unpacklo_epi16(T80, T90);\r\n                T4 = _mm_unpackhi_epi16(T20, T30);\r\n                T5 = _mm_unpackhi_epi16(T40, T50);\r\n                T6 = _mm_unpackhi_epi16(T60, T70);\r\n                T7 = _mm_unpackhi_epi16(T80, T90);\r\n\r\n                T0 = _mm_madd_epi16(T0, mCoefy1);\r\n                T1 = _mm_madd_epi16(T1, mCoefy2);\r\n                T2 = _mm_madd_epi16(T2, mCoefy3);\r\n                T3 = _mm_madd_epi16(T3, mCoefy4);\r\n                T4 = _mm_madd_epi16(T4, mCoefy1);\r\n                T5 = _mm_madd_epi16(T5, mCoefy2);\r\n                T6 = _mm_madd_epi16(T6, mCoefy3);\r\n                T7 = _mm_madd_epi16(T7, mCoefy4);\r\n\r\n                mVal1 = _mm_add_epi32(T0, T1);\r\n                mVal1 = _mm_add_epi32(mVal1, T2);\r\n                mVal1 = _mm_add_epi32(mVal1, T3);\r\n\r\n                mVal2 = _mm_add_epi32(T4, T5);\r\n                mVal2 = _mm_add_epi32(mVal2, T6);\r\n                mVal2 = _mm_add_epi32(mVal2, T7);\r\n\r\n                mVal1 = _mm_add_epi32(mVal1, mAddOffset);\r\n                mVal2 = _mm_add_epi32(mVal2, mAddOffset);\r\n                mVal1 = _mm_srai_epi32(mVal1, shift);\r\n                mVal2 = _mm_srai_epi32(mVal2, shift);\r\n                mVal = _mm_packs_epi32(mVal1, mVal2);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_storel_epi64((__m128i*)(&dst[col] + 2 * i_dst), mVal);\r\n\r\n                T0 = _mm_unpacklo_epi16(T30, T40);\r\n                T1 = _mm_unpacklo_epi16(T50, T60);\r\n                T2 = _mm_unpacklo_epi16(T70, T80);\r\n                T3 = _mm_unpacklo_epi16(T90, Ta0);\r\n                T4 = _mm_unpackhi_epi16(T30, T40);\r\n                T5 = _mm_unpackhi_epi16(T50, T60);\r\n                T6 = _mm_unpackhi_epi16(T70, T80);\r\n                T7 = _mm_unpackhi_epi16(T90, Ta0);\r\n\r\n                T0 = _mm_madd_epi16(T0, mCoefy1);\r\n                T1 = _mm_madd_epi16(T1, mCoefy2);\r\n                T2 = _mm_madd_epi16(T2, mCoefy3);\r\n                T3 = _mm_madd_epi16(T3, mCoefy4);\r\n                T4 = _mm_madd_epi16(T4, mCoefy1);\r\n                T5 = _mm_madd_epi16(T5, mCoefy2);\r\n                T6 = _mm_madd_epi16(T6, mCoefy3);\r\n                T7 = _mm_madd_epi16(T7, mCoefy4);\r\n\r\n                mVal1 = _mm_add_epi32(T0, T1);\r\n                mVal1 = _mm_add_epi32(mVal1, T2);\r\n                mVal1 = _mm_add_epi32(mVal1, T3);\r\n\r\n                mVal2 = _mm_add_epi32(T4, T5);\r\n                mVal2 = _mm_add_epi32(mVal2, T6);\r\n                mVal2 = _mm_add_epi32(mVal2, T7);\r\n\r\n                mVal1 = _mm_add_epi32(mVal1, mAddOffset);\r\n                mVal2 = _mm_add_epi32(mVal2, mAddOffset);\r\n                mVal1 = _mm_srai_epi32(mVal1, shift);\r\n                mVal2 = _mm_srai_epi32(mVal2, shift);\r\n                mVal = _mm_packs_epi32(mVal1, mVal2);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_storel_epi64((__m128i*)(&dst[col] + 3 * i_dst), mVal);\r\n\r\n                p += 8;\r\n            }\r\n\r\n            if (col < width) {\r\n                __m128i T00 = _mm_loadu_si128((__m128i*)(p));\r\n                __m128i T10 = _mm_loadu_si128((__m128i*)(p + i_tmp));\r\n                __m128i T20 = _mm_loadu_si128((__m128i*)(p + 2 * i_tmp));\r\n                __m128i T30 = _mm_loadu_si128((__m128i*)(p + 3 * i_tmp));\r\n                __m128i T40 = _mm_loadu_si128((__m128i*)(p + 4 * i_tmp));\r\n                __m128i T50 = _mm_loadu_si128((__m128i*)(p + 5 * i_tmp));\r\n                __m128i T60 = _mm_loadu_si128((__m128i*)(p + 6 * i_tmp));\r\n                __m128i T70 = _mm_loadu_si128((__m128i*)(p + 7 * i_tmp));\r\n                __m128i T80 = _mm_loadu_si128((__m128i*)(p + 8 * i_tmp));\r\n                __m128i T90 = _mm_loadu_si128((__m128i*)(p + 9 * i_tmp));\r\n                __m128i Ta0 = _mm_loadu_si128((__m128i*)(p + 10 * i_tmp));\r\n\r\n                __m128i T0 = _mm_unpacklo_epi16(T00, T10);\r\n                __m128i T1 = _mm_unpacklo_epi16(T20, T30);\r\n                __m128i T2 = _mm_unpacklo_epi16(T40, T50);\r\n                __m128i T3 = _mm_unpacklo_epi16(T60, T70);\r\n                __m128i T4 = _mm_unpackhi_epi16(T00, T10);\r\n                __m128i T5 = _mm_unpackhi_epi16(T20, T30);\r\n                __m128i T6 = _mm_unpackhi_epi16(T40, T50);\r\n                __m128i T7 = _mm_unpackhi_epi16(T60, T70);\r\n\r\n                T0 = _mm_madd_epi16(T0, mCoefy1);\r\n                T1 = _mm_madd_epi16(T1, mCoefy2);\r\n                T2 = _mm_madd_epi16(T2, mCoefy3);\r\n                T3 = _mm_madd_epi16(T3, mCoefy4);\r\n                T4 = _mm_madd_epi16(T4, mCoefy1);\r\n                T5 = _mm_madd_epi16(T5, mCoefy2);\r\n                T6 = _mm_madd_epi16(T6, mCoefy3);\r\n                T7 = _mm_madd_epi16(T7, mCoefy4);\r\n\r\n                mVal1 = _mm_add_epi32(T0, T1);\r\n                mVal1 = _mm_add_epi32(mVal1, T2);\r\n                mVal1 = _mm_add_epi32(mVal1, T3);\r\n\r\n                mVal2 = _mm_add_epi32(T4, T5);\r\n                mVal2 = _mm_add_epi32(mVal2, T6);\r\n                mVal2 = _mm_add_epi32(mVal2, T7);\r\n\r\n                mVal1 = _mm_add_epi32(mVal1, mAddOffset);\r\n                mVal2 = _mm_add_epi32(mVal2, mAddOffset);\r\n                mVal1 = _mm_srai_epi32(mVal1, shift);\r\n                mVal2 = _mm_srai_epi32(mVal2, shift);\r\n                mVal = _mm_packs_epi32(mVal1, mVal2);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_maskmoveu_si128(mVal, mask, (char *)&dst[col]);\r\n\r\n                T0 = _mm_unpacklo_epi16(T10, T20);\r\n                T1 = _mm_unpacklo_epi16(T30, T40);\r\n                T2 = _mm_unpacklo_epi16(T50, T60);\r\n                T3 = _mm_unpacklo_epi16(T70, T80);\r\n                T4 = _mm_unpackhi_epi16(T10, T20);\r\n                T5 = _mm_unpackhi_epi16(T30, T40);\r\n                T6 = _mm_unpackhi_epi16(T50, T60);\r\n                T7 = _mm_unpackhi_epi16(T70, T80);\r\n\r\n                T0 = _mm_madd_epi16(T0, mCoefy1);\r\n                T1 = _mm_madd_epi16(T1, mCoefy2);\r\n                T2 = _mm_madd_epi16(T2, mCoefy3);\r\n                T3 = _mm_madd_epi16(T3, mCoefy4);\r\n                T4 = _mm_madd_epi16(T4, mCoefy1);\r\n                T5 = _mm_madd_epi16(T5, mCoefy2);\r\n                T6 = _mm_madd_epi16(T6, mCoefy3);\r\n                T7 = _mm_madd_epi16(T7, mCoefy4);\r\n\r\n                mVal1 = _mm_add_epi32(T0, T1);\r\n                mVal1 = _mm_add_epi32(mVal1, T2);\r\n                mVal1 = _mm_add_epi32(mVal1, T3);\r\n\r\n                mVal2 = _mm_add_epi32(T4, T5);\r\n                mVal2 = _mm_add_epi32(mVal2, T6);\r\n                mVal2 = _mm_add_epi32(mVal2, T7);\r\n\r\n                mVal1 = _mm_add_epi32(mVal1, mAddOffset);\r\n                mVal2 = _mm_add_epi32(mVal2, mAddOffset);\r\n                mVal1 = _mm_srai_epi32(mVal1, shift);\r\n                mVal2 = _mm_srai_epi32(mVal2, shift);\r\n                mVal = _mm_packs_epi32(mVal1, mVal2);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_maskmoveu_si128(mVal, mask, (char *)(&dst[col] + i_dst));\r\n\r\n                T0 = _mm_unpacklo_epi16(T20, T30);\r\n                T1 = _mm_unpacklo_epi16(T40, T50);\r\n                T2 = _mm_unpacklo_epi16(T60, T70);\r\n                T3 = _mm_unpacklo_epi16(T80, T90);\r\n                T4 = _mm_unpackhi_epi16(T20, T30);\r\n                T5 = _mm_unpackhi_epi16(T40, T50);\r\n                T6 = _mm_unpackhi_epi16(T60, T70);\r\n                T7 = _mm_unpackhi_epi16(T80, T90);\r\n\r\n                T0 = _mm_madd_epi16(T0, mCoefy1);\r\n                T1 = _mm_madd_epi16(T1, mCoefy2);\r\n                T2 = _mm_madd_epi16(T2, mCoefy3);\r\n                T3 = _mm_madd_epi16(T3, mCoefy4);\r\n                T4 = _mm_madd_epi16(T4, mCoefy1);\r\n                T5 = _mm_madd_epi16(T5, mCoefy2);\r\n                T6 = _mm_madd_epi16(T6, mCoefy3);\r\n                T7 = _mm_madd_epi16(T7, mCoefy4);\r\n\r\n                mVal1 = _mm_add_epi32(T0, T1);\r\n                mVal1 = _mm_add_epi32(mVal1, T2);\r\n                mVal1 = _mm_add_epi32(mVal1, T3);\r\n\r\n                mVal2 = _mm_add_epi32(T4, T5);\r\n                mVal2 = _mm_add_epi32(mVal2, T6);\r\n                mVal2 = _mm_add_epi32(mVal2, T7);\r\n\r\n                mVal1 = _mm_add_epi32(mVal1, mAddOffset);\r\n                mVal2 = _mm_add_epi32(mVal2, mAddOffset);\r\n                mVal1 = _mm_srai_epi32(mVal1, shift);\r\n                mVal2 = _mm_srai_epi32(mVal2, shift);\r\n                mVal = _mm_packs_epi32(mVal1, mVal2);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_maskmoveu_si128(mVal, mask, (char *)(&dst[col] + 2 * i_dst));\r\n\r\n                T0 = _mm_unpacklo_epi16(T30, T40);\r\n                T1 = _mm_unpacklo_epi16(T50, T60);\r\n                T2 = _mm_unpacklo_epi16(T70, T80);\r\n                T3 = _mm_unpacklo_epi16(T90, Ta0);\r\n                T4 = _mm_unpackhi_epi16(T30, T40);\r\n                T5 = _mm_unpackhi_epi16(T50, T60);\r\n                T6 = _mm_unpackhi_epi16(T70, T80);\r\n                T7 = _mm_unpackhi_epi16(T90, Ta0);\r\n\r\n                T0 = _mm_madd_epi16(T0, mCoefy1);\r\n                T1 = _mm_madd_epi16(T1, mCoefy2);\r\n                T2 = _mm_madd_epi16(T2, mCoefy3);\r\n                T3 = _mm_madd_epi16(T3, mCoefy4);\r\n                T4 = _mm_madd_epi16(T4, mCoefy1);\r\n                T5 = _mm_madd_epi16(T5, mCoefy2);\r\n                T6 = _mm_madd_epi16(T6, mCoefy3);\r\n                T7 = _mm_madd_epi16(T7, mCoefy4);\r\n\r\n                mVal1 = _mm_add_epi32(T0, T1);\r\n                mVal1 = _mm_add_epi32(mVal1, T2);\r\n                mVal1 = _mm_add_epi32(mVal1, T3);\r\n\r\n                mVal2 = _mm_add_epi32(T4, T5);\r\n                mVal2 = _mm_add_epi32(mVal2, T6);\r\n                mVal2 = _mm_add_epi32(mVal2, T7);\r\n\r\n                mVal1 = _mm_add_epi32(mVal1, mAddOffset);\r\n                mVal2 = _mm_add_epi32(mVal2, mAddOffset);\r\n                mVal1 = _mm_srai_epi32(mVal1, shift);\r\n                mVal2 = _mm_srai_epi32(mVal2, shift);\r\n                mVal = _mm_packs_epi32(mVal1, mVal2);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_maskmoveu_si128(mVal, mask, (char *)(&dst[col] + 3 * i_dst));\r\n            }\r\n\r\n            tmp += 4 * i_tmp;\r\n            dst += 4 * i_dst;\r\n        }\r\n    }\r\n}\r\n\r\nvoid intpl_luma_ext_x3_sse128(pel_t *const dst[3], int i_dst, mct_t *tmp, int i_tmp, int width, int height, const int8_t **coeff)\r\n{\r\n    /*\r\n    intpl_luma_ext_sse128(dst0, i_dst, tmp, i_tmp, width, height, coeff[0]);\r\n    intpl_luma_ext_sse128(dst1, i_dst, tmp, i_tmp, width, height, coeff[1]);\r\n    intpl_luma_ext_sse128(dst2, i_dst, tmp, i_tmp, width, height, coeff[2]);\r\n    */\r\n    int row, col;\r\n    int shift;\r\n    int16_t const *p;\r\n    int bsymyFirst = (coeff[0][1] == coeff[0][6]);\r\n    int bsymySecond = (coeff[1][1] == coeff[1][6]);\r\n    int bsymyThird = (coeff[2][1] == coeff[2][6]);\r\n\r\n    __m128i mAddOffset;\r\n    __m128i mask = _mm_loadu_si128((__m128i*)(intrinsic_mask[(width & 7) - 1]));\r\n\r\n    // VER\r\n    shift = 12;\r\n    mAddOffset = _mm_set1_epi32(1 << (shift - 1));\r\n    tmp = tmp - 3 * i_tmp;\r\n\r\n    __m128i mCoefy1First,mCoefy2First,mCoefy3First,mCoefy4First;\r\n    __m128i mCoefy1Second,mCoefy2Second,mCoefy3Second,mCoefy4Second;\r\n    __m128i mCoefy1Third,mCoefy2Third,mCoefy3Third,mCoefy4Third;\r\n\r\n    pel_t *dst0 = dst[0];\r\n    pel_t *dst1 = dst[1];\r\n    pel_t *dst2 = dst[2];\r\n\r\n    if(bsymyFirst) { \r\n        mCoefy1First = _mm_set1_epi16(coeff[0][0]);\r\n        mCoefy2First = _mm_set1_epi16(coeff[0][1]);\r\n        mCoefy3First = _mm_set1_epi16(coeff[0][2]);\r\n        mCoefy4First = _mm_set1_epi16(coeff[0][3]);\r\n    } else {\r\n        mCoefy1First = _mm_set1_epi16(*(int16_t*)coeff[0]);\r\n        mCoefy2First = _mm_set1_epi16(*(int16_t*)(coeff[0] + 2));\r\n        mCoefy3First = _mm_set1_epi16(*(int16_t*)(coeff[0] + 4));\r\n        mCoefy4First = _mm_set1_epi16(*(int16_t*)(coeff[0] + 6));\r\n        mCoefy1First = _mm_cvtepi8_epi16(mCoefy1First);\r\n        mCoefy2First = _mm_cvtepi8_epi16(mCoefy2First);\r\n        mCoefy3First = _mm_cvtepi8_epi16(mCoefy3First);\r\n        mCoefy4First = _mm_cvtepi8_epi16(mCoefy4First);\r\n    }\r\n\r\n    if(bsymySecond) { \r\n        mCoefy1Second = _mm_set1_epi16(coeff[1][0]);\r\n        mCoefy2Second = _mm_set1_epi16(coeff[1][1]);\r\n        mCoefy3Second = _mm_set1_epi16(coeff[1][2]);\r\n        mCoefy4Second = _mm_set1_epi16(coeff[1][3]);\r\n    } else {\r\n        mCoefy1Second = _mm_set1_epi16(*(int16_t*)coeff[1]);\r\n        mCoefy2Second = _mm_set1_epi16(*(int16_t*)(coeff[1] + 2));\r\n        mCoefy3Second = _mm_set1_epi16(*(int16_t*)(coeff[1] + 4));\r\n        mCoefy4Second = _mm_set1_epi16(*(int16_t*)(coeff[1] + 6));\r\n        mCoefy1Second = _mm_cvtepi8_epi16(mCoefy1Second);\r\n        mCoefy2Second = _mm_cvtepi8_epi16(mCoefy2Second);\r\n        mCoefy3Second = _mm_cvtepi8_epi16(mCoefy3Second);\r\n        mCoefy4Second = _mm_cvtepi8_epi16(mCoefy4Second);\r\n    }\r\n\r\n    if(bsymyThird) { \r\n        mCoefy1Third = _mm_set1_epi16(coeff[2][0]);\r\n        mCoefy2Third = _mm_set1_epi16(coeff[2][1]);\r\n        mCoefy3Third = _mm_set1_epi16(coeff[2][2]);\r\n        mCoefy4Third = _mm_set1_epi16(coeff[2][3]);\r\n    } else {\r\n        mCoefy1Third = _mm_set1_epi16(*(int16_t*)coeff[2]);\r\n        mCoefy2Third = _mm_set1_epi16(*(int16_t*)(coeff[2] + 2));\r\n        mCoefy3Third = _mm_set1_epi16(*(int16_t*)(coeff[2] + 4));\r\n        mCoefy4Third = _mm_set1_epi16(*(int16_t*)(coeff[2] + 6));\r\n        mCoefy1Third = _mm_cvtepi8_epi16(mCoefy1Third);\r\n        mCoefy2Third = _mm_cvtepi8_epi16(mCoefy2Third);\r\n        mCoefy3Third = _mm_cvtepi8_epi16(mCoefy3Third);\r\n        mCoefy4Third = _mm_cvtepi8_epi16(mCoefy4Third);\r\n    }\r\n\r\n    __m128i T00, T10, T20, T30, T40, T50, T60, T70, T80, T90, Ta0;\r\n    __m128i T0, T1, T2, T3, T4, T5, T6, T7;\r\n    __m128i mVal1, mVal2, mVal;\r\n    //\r\n    for (row = 0; row < height - 3 ; row += 4) { \r\n        p = tmp;\r\n        for (col = 0; col < width - 7; col += 8) { \r\n            T00 = _mm_loadu_si128((__m128i*)(p));\r\n            T10 = _mm_loadu_si128((__m128i*)(p + i_tmp));\r\n            T20 = _mm_loadu_si128((__m128i*)(p + 2 * i_tmp));\r\n            T30 = _mm_loadu_si128((__m128i*)(p + 3 * i_tmp));\r\n            T40 = _mm_loadu_si128((__m128i*)(p + 4 * i_tmp));\r\n            T50 = _mm_loadu_si128((__m128i*)(p + 5 * i_tmp));\r\n            T60 = _mm_loadu_si128((__m128i*)(p + 6 * i_tmp));\r\n            T70 = _mm_loadu_si128((__m128i*)(p + 7 * i_tmp));\r\n            T80 = _mm_loadu_si128((__m128i*)(p + 8 * i_tmp));\r\n            T90 = _mm_loadu_si128((__m128i*)(p + 9 * i_tmp));\r\n            Ta0 = _mm_loadu_si128((__m128i*)(p + 10 * i_tmp));\r\n\r\n            //First\r\n            if (bsymyFirst) {\r\n                T0 = _mm_unpacklo_epi16(T00, T70);\r\n                T1 = _mm_unpacklo_epi16(T10, T60);\r\n                T2 = _mm_unpacklo_epi16(T20, T50);\r\n                T3 = _mm_unpacklo_epi16(T30, T40);\r\n                T4 = _mm_unpackhi_epi16(T00, T70);\r\n                T5 = _mm_unpackhi_epi16(T10, T60);\r\n                T6 = _mm_unpackhi_epi16(T20, T50);\r\n                T7 = _mm_unpackhi_epi16(T30, T40);\r\n            } else {\r\n                T0 = _mm_unpacklo_epi16(T00, T10);\r\n                T1 = _mm_unpacklo_epi16(T20, T30);\r\n                T2 = _mm_unpacklo_epi16(T40, T50);\r\n                T3 = _mm_unpacklo_epi16(T60, T70);\r\n                T4 = _mm_unpackhi_epi16(T00, T10);\r\n                T5 = _mm_unpackhi_epi16(T20, T30);\r\n                T6 = _mm_unpackhi_epi16(T40, T50);\r\n                T7 = _mm_unpackhi_epi16(T60, T70);\r\n            }\r\n            T0 = _mm_madd_epi16(T0, mCoefy1First);\r\n            T1 = _mm_madd_epi16(T1, mCoefy2First);\r\n            T2 = _mm_madd_epi16(T2, mCoefy3First);\r\n            T3 = _mm_madd_epi16(T3, mCoefy4First);\r\n            T4 = _mm_madd_epi16(T4, mCoefy1First);\r\n            T5 = _mm_madd_epi16(T5, mCoefy2First);\r\n            T6 = _mm_madd_epi16(T6, mCoefy3First);\r\n            T7 = _mm_madd_epi16(T7, mCoefy4First);\r\n\r\n            mVal1 = _mm_add_epi32(T0, T1);\r\n            mVal1 = _mm_add_epi32(mVal1, T2);\r\n            mVal1 = _mm_add_epi32(mVal1, T3);\r\n\r\n            mVal2 = _mm_add_epi32(T4, T5);\r\n            mVal2 = _mm_add_epi32(mVal2, T6);\r\n            mVal2 = _mm_add_epi32(mVal2, T7);\r\n\r\n            mVal1 = _mm_add_epi32(mVal1, mAddOffset);\r\n            mVal2 = _mm_add_epi32(mVal2, mAddOffset);\r\n            mVal1 = _mm_srai_epi32(mVal1, shift);\r\n            mVal2 = _mm_srai_epi32(mVal2, shift);\r\n            mVal = _mm_packs_epi32(mVal1, mVal2);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_storel_epi64((__m128i*)&dst0[col], mVal);\r\n\r\n            if (bsymyFirst) {\r\n                T0 = _mm_unpacklo_epi16(T10, T80);\r\n                T1 = _mm_unpacklo_epi16(T20, T70);\r\n                T2 = _mm_unpacklo_epi16(T30, T60);\r\n                T3 = _mm_unpacklo_epi16(T40, T50);\r\n                T4 = _mm_unpackhi_epi16(T10, T80);\r\n                T5 = _mm_unpackhi_epi16(T20, T70);\r\n                T6 = _mm_unpackhi_epi16(T30, T60);\r\n                T7 = _mm_unpackhi_epi16(T40, T50);\r\n            }\r\n            else {\r\n                T0 = _mm_unpacklo_epi16(T10, T20);\r\n                T1 = _mm_unpacklo_epi16(T30, T40);\r\n                T2 = _mm_unpacklo_epi16(T50, T60);\r\n                T3 = _mm_unpacklo_epi16(T70, T80);\r\n                T4 = _mm_unpackhi_epi16(T10, T20);\r\n                T5 = _mm_unpackhi_epi16(T30, T40);\r\n                T6 = _mm_unpackhi_epi16(T50, T60);\r\n                T7 = _mm_unpackhi_epi16(T70, T80);\r\n            }\r\n            T0 = _mm_madd_epi16(T0, mCoefy1First);\r\n            T1 = _mm_madd_epi16(T1, mCoefy2First);\r\n            T2 = _mm_madd_epi16(T2, mCoefy3First);\r\n            T3 = _mm_madd_epi16(T3, mCoefy4First);\r\n            T4 = _mm_madd_epi16(T4, mCoefy1First);\r\n            T5 = _mm_madd_epi16(T5, mCoefy2First);\r\n            T6 = _mm_madd_epi16(T6, mCoefy3First);\r\n            T7 = _mm_madd_epi16(T7, mCoefy4First);\r\n\r\n            mVal1 = _mm_add_epi32(T0, T1);\r\n            mVal1 = _mm_add_epi32(mVal1, T2);\r\n            mVal1 = _mm_add_epi32(mVal1, T3);\r\n\r\n            mVal2 = _mm_add_epi32(T4, T5);\r\n            mVal2 = _mm_add_epi32(mVal2, T6);\r\n            mVal2 = _mm_add_epi32(mVal2, T7);\r\n\r\n            mVal1 = _mm_add_epi32(mVal1, mAddOffset);\r\n            mVal2 = _mm_add_epi32(mVal2, mAddOffset);\r\n            mVal1 = _mm_srai_epi32(mVal1, shift);\r\n            mVal2 = _mm_srai_epi32(mVal2, shift);\r\n            mVal = _mm_packs_epi32(mVal1, mVal2);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_storel_epi64((__m128i*)(&dst0[col] + i_dst), mVal);\r\n\r\n            if (bsymyFirst) {\r\n                T0 = _mm_unpacklo_epi16(T20, T90);\r\n                T1 = _mm_unpacklo_epi16(T30, T80);\r\n                T2 = _mm_unpacklo_epi16(T40, T70);\r\n                T3 = _mm_unpacklo_epi16(T50, T60);\r\n                T4 = _mm_unpackhi_epi16(T20, T90);\r\n                T5 = _mm_unpackhi_epi16(T30, T80);\r\n                T6 = _mm_unpackhi_epi16(T40, T70);\r\n                T7 = _mm_unpackhi_epi16(T50, T60);\r\n            }\r\n            else {\r\n                T0 = _mm_unpacklo_epi16(T20, T30);\r\n                T1 = _mm_unpacklo_epi16(T40, T50);\r\n                T2 = _mm_unpacklo_epi16(T60, T70);\r\n                T3 = _mm_unpacklo_epi16(T80, T90);\r\n                T4 = _mm_unpackhi_epi16(T20, T30);\r\n                T5 = _mm_unpackhi_epi16(T40, T50);\r\n                T6 = _mm_unpackhi_epi16(T60, T70);\r\n                T7 = _mm_unpackhi_epi16(T80, T90);\r\n            }\r\n            T0 = _mm_madd_epi16(T0, mCoefy1First);\r\n            T1 = _mm_madd_epi16(T1, mCoefy2First);\r\n            T2 = _mm_madd_epi16(T2, mCoefy3First);\r\n            T3 = _mm_madd_epi16(T3, mCoefy4First);\r\n            T4 = _mm_madd_epi16(T4, mCoefy1First);\r\n            T5 = _mm_madd_epi16(T5, mCoefy2First);\r\n            T6 = _mm_madd_epi16(T6, mCoefy3First);\r\n            T7 = _mm_madd_epi16(T7, mCoefy4First);\r\n\r\n            mVal1 = _mm_add_epi32(T0, T1);\r\n            mVal1 = _mm_add_epi32(mVal1, T2);\r\n            mVal1 = _mm_add_epi32(mVal1, T3);\r\n\r\n            mVal2 = _mm_add_epi32(T4, T5);\r\n            mVal2 = _mm_add_epi32(mVal2, T6);\r\n            mVal2 = _mm_add_epi32(mVal2, T7);\r\n\r\n            mVal1 = _mm_add_epi32(mVal1, mAddOffset);\r\n            mVal2 = _mm_add_epi32(mVal2, mAddOffset);\r\n            mVal1 = _mm_srai_epi32(mVal1, shift);\r\n            mVal2 = _mm_srai_epi32(mVal2, shift);\r\n            mVal = _mm_packs_epi32(mVal1, mVal2);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_storel_epi64((__m128i*)(&dst0[col] + 2 * i_dst), mVal);\r\n\r\n            if (bsymyFirst) {\r\n                T0 = _mm_unpacklo_epi16(T30, Ta0);\r\n                T1 = _mm_unpacklo_epi16(T40, T90);\r\n                T2 = _mm_unpacklo_epi16(T50, T80);\r\n                T3 = _mm_unpacklo_epi16(T60, T70);\r\n                T4 = _mm_unpackhi_epi16(T30, Ta0);\r\n                T5 = _mm_unpackhi_epi16(T40, T90);\r\n                T6 = _mm_unpackhi_epi16(T50, T80);\r\n                T7 = _mm_unpackhi_epi16(T60, T70);\r\n            }\r\n            else {\r\n                T0 = _mm_unpacklo_epi16(T30, T40);\r\n                T1 = _mm_unpacklo_epi16(T50, T60);\r\n                T2 = _mm_unpacklo_epi16(T70, T80);\r\n                T3 = _mm_unpacklo_epi16(T90, Ta0);\r\n                T4 = _mm_unpackhi_epi16(T30, T40);\r\n                T5 = _mm_unpackhi_epi16(T50, T60);\r\n                T6 = _mm_unpackhi_epi16(T70, T80);\r\n                T7 = _mm_unpackhi_epi16(T90, Ta0);\r\n            }\r\n            T0 = _mm_madd_epi16(T0, mCoefy1First);\r\n            T1 = _mm_madd_epi16(T1, mCoefy2First);\r\n            T2 = _mm_madd_epi16(T2, mCoefy3First);\r\n            T3 = _mm_madd_epi16(T3, mCoefy4First);\r\n            T4 = _mm_madd_epi16(T4, mCoefy1First);\r\n            T5 = _mm_madd_epi16(T5, mCoefy2First);\r\n            T6 = _mm_madd_epi16(T6, mCoefy3First);\r\n            T7 = _mm_madd_epi16(T7, mCoefy4First);\r\n\r\n            mVal1 = _mm_add_epi32(T0, T1);\r\n            mVal1 = _mm_add_epi32(mVal1, T2);\r\n            mVal1 = _mm_add_epi32(mVal1, T3);\r\n\r\n            mVal2 = _mm_add_epi32(T4, T5);\r\n            mVal2 = _mm_add_epi32(mVal2, T6);\r\n            mVal2 = _mm_add_epi32(mVal2, T7);\r\n\r\n            mVal1 = _mm_add_epi32(mVal1, mAddOffset);\r\n            mVal2 = _mm_add_epi32(mVal2, mAddOffset);\r\n            mVal1 = _mm_srai_epi32(mVal1, shift);\r\n            mVal2 = _mm_srai_epi32(mVal2, shift);\r\n            mVal = _mm_packs_epi32(mVal1, mVal2);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_storel_epi64((__m128i*)(&dst0[col] + 3 * i_dst), mVal);\r\n\r\n            //Second\r\n            if (bsymySecond) {\r\n                T0 = _mm_unpacklo_epi16(T00, T70);\r\n                T1 = _mm_unpacklo_epi16(T10, T60);\r\n                T2 = _mm_unpacklo_epi16(T20, T50);\r\n                T3 = _mm_unpacklo_epi16(T30, T40);\r\n                T4 = _mm_unpackhi_epi16(T00, T70);\r\n                T5 = _mm_unpackhi_epi16(T10, T60);\r\n                T6 = _mm_unpackhi_epi16(T20, T50);\r\n                T7 = _mm_unpackhi_epi16(T30, T40);\r\n            } else {\r\n                T0 = _mm_unpacklo_epi16(T00, T10);\r\n                T1 = _mm_unpacklo_epi16(T20, T30);\r\n                T2 = _mm_unpacklo_epi16(T40, T50);\r\n                T3 = _mm_unpacklo_epi16(T60, T70);\r\n                T4 = _mm_unpackhi_epi16(T00, T10);\r\n                T5 = _mm_unpackhi_epi16(T20, T30);\r\n                T6 = _mm_unpackhi_epi16(T40, T50);\r\n                T7 = _mm_unpackhi_epi16(T60, T70);\r\n            }\r\n            T0 = _mm_madd_epi16(T0, mCoefy1Second);\r\n            T1 = _mm_madd_epi16(T1, mCoefy2Second);\r\n            T2 = _mm_madd_epi16(T2, mCoefy3Second);\r\n            T3 = _mm_madd_epi16(T3, mCoefy4Second);\r\n            T4 = _mm_madd_epi16(T4, mCoefy1Second);\r\n            T5 = _mm_madd_epi16(T5, mCoefy2Second);\r\n            T6 = _mm_madd_epi16(T6, mCoefy3Second);\r\n            T7 = _mm_madd_epi16(T7, mCoefy4Second);\r\n\r\n            mVal1 = _mm_add_epi32(T0, T1);\r\n            mVal1 = _mm_add_epi32(mVal1, T2);\r\n            mVal1 = _mm_add_epi32(mVal1, T3);\r\n\r\n            mVal2 = _mm_add_epi32(T4, T5);\r\n            mVal2 = _mm_add_epi32(mVal2, T6);\r\n            mVal2 = _mm_add_epi32(mVal2, T7);\r\n\r\n            mVal1 = _mm_add_epi32(mVal1, mAddOffset);\r\n            mVal2 = _mm_add_epi32(mVal2, mAddOffset);\r\n            mVal1 = _mm_srai_epi32(mVal1, shift);\r\n            mVal2 = _mm_srai_epi32(mVal2, shift);\r\n            mVal = _mm_packs_epi32(mVal1, mVal2);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_storel_epi64((__m128i*)&dst1[col], mVal);\r\n\r\n            if (bsymySecond) {\r\n                T0 = _mm_unpacklo_epi16(T10, T80);\r\n                T1 = _mm_unpacklo_epi16(T20, T70);\r\n                T2 = _mm_unpacklo_epi16(T30, T60);\r\n                T3 = _mm_unpacklo_epi16(T40, T50);\r\n                T4 = _mm_unpackhi_epi16(T10, T80);\r\n                T5 = _mm_unpackhi_epi16(T20, T70);\r\n                T6 = _mm_unpackhi_epi16(T30, T60);\r\n                T7 = _mm_unpackhi_epi16(T40, T50);\r\n            }\r\n            else {\r\n                T0 = _mm_unpacklo_epi16(T10, T20);\r\n                T1 = _mm_unpacklo_epi16(T30, T40);\r\n                T2 = _mm_unpacklo_epi16(T50, T60);\r\n                T3 = _mm_unpacklo_epi16(T70, T80);\r\n                T4 = _mm_unpackhi_epi16(T10, T20);\r\n                T5 = _mm_unpackhi_epi16(T30, T40);\r\n                T6 = _mm_unpackhi_epi16(T50, T60);\r\n                T7 = _mm_unpackhi_epi16(T70, T80);\r\n            }\r\n            T0 = _mm_madd_epi16(T0, mCoefy1Second);\r\n            T1 = _mm_madd_epi16(T1, mCoefy2Second);\r\n            T2 = _mm_madd_epi16(T2, mCoefy3Second);\r\n            T3 = _mm_madd_epi16(T3, mCoefy4Second);\r\n            T4 = _mm_madd_epi16(T4, mCoefy1Second);\r\n            T5 = _mm_madd_epi16(T5, mCoefy2Second);\r\n            T6 = _mm_madd_epi16(T6, mCoefy3Second);\r\n            T7 = _mm_madd_epi16(T7, mCoefy4Second);\r\n\r\n            mVal1 = _mm_add_epi32(T0, T1);\r\n            mVal1 = _mm_add_epi32(mVal1, T2);\r\n            mVal1 = _mm_add_epi32(mVal1, T3);\r\n\r\n            mVal2 = _mm_add_epi32(T4, T5);\r\n            mVal2 = _mm_add_epi32(mVal2, T6);\r\n            mVal2 = _mm_add_epi32(mVal2, T7);\r\n\r\n            mVal1 = _mm_add_epi32(mVal1, mAddOffset);\r\n            mVal2 = _mm_add_epi32(mVal2, mAddOffset);\r\n            mVal1 = _mm_srai_epi32(mVal1, shift);\r\n            mVal2 = _mm_srai_epi32(mVal2, shift);\r\n            mVal = _mm_packs_epi32(mVal1, mVal2);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_storel_epi64((__m128i*)(&dst1[col] + i_dst), mVal);\r\n\r\n            if (bsymySecond) {\r\n                T0 = _mm_unpacklo_epi16(T20, T90);\r\n                T1 = _mm_unpacklo_epi16(T30, T80);\r\n                T2 = _mm_unpacklo_epi16(T40, T70);\r\n                T3 = _mm_unpacklo_epi16(T50, T60);\r\n                T4 = _mm_unpackhi_epi16(T20, T90);\r\n                T5 = _mm_unpackhi_epi16(T30, T80);\r\n                T6 = _mm_unpackhi_epi16(T40, T70);\r\n                T7 = _mm_unpackhi_epi16(T50, T60);\r\n            }\r\n            else {\r\n                T0 = _mm_unpacklo_epi16(T20, T30);\r\n                T1 = _mm_unpacklo_epi16(T40, T50);\r\n                T2 = _mm_unpacklo_epi16(T60, T70);\r\n                T3 = _mm_unpacklo_epi16(T80, T90);\r\n                T4 = _mm_unpackhi_epi16(T20, T30);\r\n                T5 = _mm_unpackhi_epi16(T40, T50);\r\n                T6 = _mm_unpackhi_epi16(T60, T70);\r\n                T7 = _mm_unpackhi_epi16(T80, T90);\r\n            }\r\n            T0 = _mm_madd_epi16(T0, mCoefy1Second);\r\n            T1 = _mm_madd_epi16(T1, mCoefy2Second);\r\n            T2 = _mm_madd_epi16(T2, mCoefy3Second);\r\n            T3 = _mm_madd_epi16(T3, mCoefy4Second);\r\n            T4 = _mm_madd_epi16(T4, mCoefy1Second);\r\n            T5 = _mm_madd_epi16(T5, mCoefy2Second);\r\n            T6 = _mm_madd_epi16(T6, mCoefy3Second);\r\n            T7 = _mm_madd_epi16(T7, mCoefy4Second);\r\n\r\n            mVal1 = _mm_add_epi32(T0, T1);\r\n            mVal1 = _mm_add_epi32(mVal1, T2);\r\n            mVal1 = _mm_add_epi32(mVal1, T3);\r\n\r\n            mVal2 = _mm_add_epi32(T4, T5);\r\n            mVal2 = _mm_add_epi32(mVal2, T6);\r\n            mVal2 = _mm_add_epi32(mVal2, T7);\r\n\r\n            mVal1 = _mm_add_epi32(mVal1, mAddOffset);\r\n            mVal2 = _mm_add_epi32(mVal2, mAddOffset);\r\n            mVal1 = _mm_srai_epi32(mVal1, shift);\r\n            mVal2 = _mm_srai_epi32(mVal2, shift);\r\n            mVal = _mm_packs_epi32(mVal1, mVal2);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_storel_epi64((__m128i*)(&dst1[col] + 2 * i_dst), mVal);\r\n\r\n            if (bsymySecond) {\r\n                T0 = _mm_unpacklo_epi16(T30, Ta0);\r\n                T1 = _mm_unpacklo_epi16(T40, T90);\r\n                T2 = _mm_unpacklo_epi16(T50, T80);\r\n                T3 = _mm_unpacklo_epi16(T60, T70);\r\n                T4 = _mm_unpackhi_epi16(T30, Ta0);\r\n                T5 = _mm_unpackhi_epi16(T40, T90);\r\n                T6 = _mm_unpackhi_epi16(T50, T80);\r\n                T7 = _mm_unpackhi_epi16(T60, T70);\r\n            }\r\n            else {\r\n                T0 = _mm_unpacklo_epi16(T30, T40);\r\n                T1 = _mm_unpacklo_epi16(T50, T60);\r\n                T2 = _mm_unpacklo_epi16(T70, T80);\r\n                T3 = _mm_unpacklo_epi16(T90, Ta0);\r\n                T4 = _mm_unpackhi_epi16(T30, T40);\r\n                T5 = _mm_unpackhi_epi16(T50, T60);\r\n                T6 = _mm_unpackhi_epi16(T70, T80);\r\n                T7 = _mm_unpackhi_epi16(T90, Ta0);\r\n            }\r\n            T0 = _mm_madd_epi16(T0, mCoefy1Second);\r\n            T1 = _mm_madd_epi16(T1, mCoefy2Second);\r\n            T2 = _mm_madd_epi16(T2, mCoefy3Second);\r\n            T3 = _mm_madd_epi16(T3, mCoefy4Second);\r\n            T4 = _mm_madd_epi16(T4, mCoefy1Second);\r\n            T5 = _mm_madd_epi16(T5, mCoefy2Second);\r\n            T6 = _mm_madd_epi16(T6, mCoefy3Second);\r\n            T7 = _mm_madd_epi16(T7, mCoefy4Second);\r\n\r\n            mVal1 = _mm_add_epi32(T0, T1);\r\n            mVal1 = _mm_add_epi32(mVal1, T2);\r\n            mVal1 = _mm_add_epi32(mVal1, T3);\r\n\r\n            mVal2 = _mm_add_epi32(T4, T5);\r\n            mVal2 = _mm_add_epi32(mVal2, T6);\r\n            mVal2 = _mm_add_epi32(mVal2, T7);\r\n\r\n            mVal1 = _mm_add_epi32(mVal1, mAddOffset);\r\n            mVal2 = _mm_add_epi32(mVal2, mAddOffset);\r\n            mVal1 = _mm_srai_epi32(mVal1, shift);\r\n            mVal2 = _mm_srai_epi32(mVal2, shift);\r\n            mVal = _mm_packs_epi32(mVal1, mVal2);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_storel_epi64((__m128i*)(&dst1[col] + 3 * i_dst), mVal);\r\n\r\n            //Third\r\n            if (bsymyThird) {\r\n                T0 = _mm_unpacklo_epi16(T00, T70);\r\n                T1 = _mm_unpacklo_epi16(T10, T60);\r\n                T2 = _mm_unpacklo_epi16(T20, T50);\r\n                T3 = _mm_unpacklo_epi16(T30, T40);\r\n                T4 = _mm_unpackhi_epi16(T00, T70);\r\n                T5 = _mm_unpackhi_epi16(T10, T60);\r\n                T6 = _mm_unpackhi_epi16(T20, T50);\r\n                T7 = _mm_unpackhi_epi16(T30, T40);\r\n            } else {\r\n                T0 = _mm_unpacklo_epi16(T00, T10);\r\n                T1 = _mm_unpacklo_epi16(T20, T30);\r\n                T2 = _mm_unpacklo_epi16(T40, T50);\r\n                T3 = _mm_unpacklo_epi16(T60, T70);\r\n                T4 = _mm_unpackhi_epi16(T00, T10);\r\n                T5 = _mm_unpackhi_epi16(T20, T30);\r\n                T6 = _mm_unpackhi_epi16(T40, T50);\r\n                T7 = _mm_unpackhi_epi16(T60, T70);\r\n            }\r\n            T0 = _mm_madd_epi16(T0, mCoefy1Third);\r\n            T1 = _mm_madd_epi16(T1, mCoefy2Third);\r\n            T2 = _mm_madd_epi16(T2, mCoefy3Third);\r\n            T3 = _mm_madd_epi16(T3, mCoefy4Third);\r\n            T4 = _mm_madd_epi16(T4, mCoefy1Third);\r\n            T5 = _mm_madd_epi16(T5, mCoefy2Third);\r\n            T6 = _mm_madd_epi16(T6, mCoefy3Third);\r\n            T7 = _mm_madd_epi16(T7, mCoefy4Third);\r\n\r\n            mVal1 = _mm_add_epi32(T0, T1);\r\n            mVal1 = _mm_add_epi32(mVal1, T2);\r\n            mVal1 = _mm_add_epi32(mVal1, T3);\r\n\r\n            mVal2 = _mm_add_epi32(T4, T5);\r\n            mVal2 = _mm_add_epi32(mVal2, T6);\r\n            mVal2 = _mm_add_epi32(mVal2, T7);\r\n\r\n            mVal1 = _mm_add_epi32(mVal1, mAddOffset);\r\n            mVal2 = _mm_add_epi32(mVal2, mAddOffset);\r\n            mVal1 = _mm_srai_epi32(mVal1, shift);\r\n            mVal2 = _mm_srai_epi32(mVal2, shift);\r\n            mVal = _mm_packs_epi32(mVal1, mVal2);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_storel_epi64((__m128i*)&dst2[col], mVal);\r\n\r\n            if (bsymyThird) {\r\n                T0 = _mm_unpacklo_epi16(T10, T80);\r\n                T1 = _mm_unpacklo_epi16(T20, T70);\r\n                T2 = _mm_unpacklo_epi16(T30, T60);\r\n                T3 = _mm_unpacklo_epi16(T40, T50);\r\n                T4 = _mm_unpackhi_epi16(T10, T80);\r\n                T5 = _mm_unpackhi_epi16(T20, T70);\r\n                T6 = _mm_unpackhi_epi16(T30, T60);\r\n                T7 = _mm_unpackhi_epi16(T40, T50);\r\n            }\r\n            else {\r\n                T0 = _mm_unpacklo_epi16(T10, T20);\r\n                T1 = _mm_unpacklo_epi16(T30, T40);\r\n                T2 = _mm_unpacklo_epi16(T50, T60);\r\n                T3 = _mm_unpacklo_epi16(T70, T80);\r\n                T4 = _mm_unpackhi_epi16(T10, T20);\r\n                T5 = _mm_unpackhi_epi16(T30, T40);\r\n                T6 = _mm_unpackhi_epi16(T50, T60);\r\n                T7 = _mm_unpackhi_epi16(T70, T80);\r\n            }\r\n            T0 = _mm_madd_epi16(T0, mCoefy1Third);\r\n            T1 = _mm_madd_epi16(T1, mCoefy2Third);\r\n            T2 = _mm_madd_epi16(T2, mCoefy3Third);\r\n            T3 = _mm_madd_epi16(T3, mCoefy4Third);\r\n            T4 = _mm_madd_epi16(T4, mCoefy1Third);\r\n            T5 = _mm_madd_epi16(T5, mCoefy2Third);\r\n            T6 = _mm_madd_epi16(T6, mCoefy3Third);\r\n            T7 = _mm_madd_epi16(T7, mCoefy4Third);\r\n\r\n            mVal1 = _mm_add_epi32(T0, T1);\r\n            mVal1 = _mm_add_epi32(mVal1, T2);\r\n            mVal1 = _mm_add_epi32(mVal1, T3);\r\n\r\n            mVal2 = _mm_add_epi32(T4, T5);\r\n            mVal2 = _mm_add_epi32(mVal2, T6);\r\n            mVal2 = _mm_add_epi32(mVal2, T7);\r\n\r\n            mVal1 = _mm_add_epi32(mVal1, mAddOffset);\r\n            mVal2 = _mm_add_epi32(mVal2, mAddOffset);\r\n            mVal1 = _mm_srai_epi32(mVal1, shift);\r\n            mVal2 = _mm_srai_epi32(mVal2, shift);\r\n            mVal = _mm_packs_epi32(mVal1, mVal2);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_storel_epi64((__m128i*)(&dst2[col] + i_dst), mVal);\r\n\r\n            if (bsymyThird) {\r\n                T0 = _mm_unpacklo_epi16(T20, T90);\r\n                T1 = _mm_unpacklo_epi16(T30, T80);\r\n                T2 = _mm_unpacklo_epi16(T40, T70);\r\n                T3 = _mm_unpacklo_epi16(T50, T60);\r\n                T4 = _mm_unpackhi_epi16(T20, T90);\r\n                T5 = _mm_unpackhi_epi16(T30, T80);\r\n                T6 = _mm_unpackhi_epi16(T40, T70);\r\n                T7 = _mm_unpackhi_epi16(T50, T60);\r\n            }\r\n            else {\r\n                T0 = _mm_unpacklo_epi16(T20, T30);\r\n                T1 = _mm_unpacklo_epi16(T40, T50);\r\n                T2 = _mm_unpacklo_epi16(T60, T70);\r\n                T3 = _mm_unpacklo_epi16(T80, T90);\r\n                T4 = _mm_unpackhi_epi16(T20, T30);\r\n                T5 = _mm_unpackhi_epi16(T40, T50);\r\n                T6 = _mm_unpackhi_epi16(T60, T70);\r\n                T7 = _mm_unpackhi_epi16(T80, T90);\r\n            }\r\n            T0 = _mm_madd_epi16(T0, mCoefy1Third);\r\n            T1 = _mm_madd_epi16(T1, mCoefy2Third);\r\n            T2 = _mm_madd_epi16(T2, mCoefy3Third);\r\n            T3 = _mm_madd_epi16(T3, mCoefy4Third);\r\n            T4 = _mm_madd_epi16(T4, mCoefy1Third);\r\n            T5 = _mm_madd_epi16(T5, mCoefy2Third);\r\n            T6 = _mm_madd_epi16(T6, mCoefy3Third);\r\n            T7 = _mm_madd_epi16(T7, mCoefy4Third);\r\n\r\n            mVal1 = _mm_add_epi32(T0, T1);\r\n            mVal1 = _mm_add_epi32(mVal1, T2);\r\n            mVal1 = _mm_add_epi32(mVal1, T3);\r\n\r\n            mVal2 = _mm_add_epi32(T4, T5);\r\n            mVal2 = _mm_add_epi32(mVal2, T6);\r\n            mVal2 = _mm_add_epi32(mVal2, T7);\r\n\r\n            mVal1 = _mm_add_epi32(mVal1, mAddOffset);\r\n            mVal2 = _mm_add_epi32(mVal2, mAddOffset);\r\n            mVal1 = _mm_srai_epi32(mVal1, shift);\r\n            mVal2 = _mm_srai_epi32(mVal2, shift);\r\n            mVal = _mm_packs_epi32(mVal1, mVal2);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_storel_epi64((__m128i*)(&dst2[col] + 2 * i_dst), mVal);\r\n\r\n            if (bsymyThird) {\r\n                T0 = _mm_unpacklo_epi16(T30, Ta0);\r\n                T1 = _mm_unpacklo_epi16(T40, T90);\r\n                T2 = _mm_unpacklo_epi16(T50, T80);\r\n                T3 = _mm_unpacklo_epi16(T60, T70);\r\n                T4 = _mm_unpackhi_epi16(T30, Ta0);\r\n                T5 = _mm_unpackhi_epi16(T40, T90);\r\n                T6 = _mm_unpackhi_epi16(T50, T80);\r\n                T7 = _mm_unpackhi_epi16(T60, T70);\r\n            }\r\n            else {\r\n                T0 = _mm_unpacklo_epi16(T30, T40);\r\n                T1 = _mm_unpacklo_epi16(T50, T60);\r\n                T2 = _mm_unpacklo_epi16(T70, T80);\r\n                T3 = _mm_unpacklo_epi16(T90, Ta0);\r\n                T4 = _mm_unpackhi_epi16(T30, T40);\r\n                T5 = _mm_unpackhi_epi16(T50, T60);\r\n                T6 = _mm_unpackhi_epi16(T70, T80);\r\n                T7 = _mm_unpackhi_epi16(T90, Ta0);\r\n            }\r\n            T0 = _mm_madd_epi16(T0, mCoefy1Third);\r\n            T1 = _mm_madd_epi16(T1, mCoefy2Third);\r\n            T2 = _mm_madd_epi16(T2, mCoefy3Third);\r\n            T3 = _mm_madd_epi16(T3, mCoefy4Third);\r\n            T4 = _mm_madd_epi16(T4, mCoefy1Third);\r\n            T5 = _mm_madd_epi16(T5, mCoefy2Third);\r\n            T6 = _mm_madd_epi16(T6, mCoefy3Third);\r\n            T7 = _mm_madd_epi16(T7, mCoefy4Third);\r\n\r\n            mVal1 = _mm_add_epi32(T0, T1);\r\n            mVal1 = _mm_add_epi32(mVal1, T2);\r\n            mVal1 = _mm_add_epi32(mVal1, T3);\r\n\r\n            mVal2 = _mm_add_epi32(T4, T5);\r\n            mVal2 = _mm_add_epi32(mVal2, T6);\r\n            mVal2 = _mm_add_epi32(mVal2, T7);\r\n\r\n            mVal1 = _mm_add_epi32(mVal1, mAddOffset);\r\n            mVal2 = _mm_add_epi32(mVal2, mAddOffset);\r\n            mVal1 = _mm_srai_epi32(mVal1, shift);\r\n            mVal2 = _mm_srai_epi32(mVal2, shift);\r\n            mVal = _mm_packs_epi32(mVal1, mVal2);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_storel_epi64((__m128i*)(&dst2[col] + 3 * i_dst), mVal);\r\n\r\n            p += 8;\r\n        }\r\n\r\n        if (col < width) {\r\n            T00 = _mm_loadu_si128((__m128i*)(p));\r\n            T10 = _mm_loadu_si128((__m128i*)(p + i_tmp));\r\n            T20 = _mm_loadu_si128((__m128i*)(p + 2 * i_tmp));\r\n            T30 = _mm_loadu_si128((__m128i*)(p + 3 * i_tmp));\r\n            T40 = _mm_loadu_si128((__m128i*)(p + 4 * i_tmp));\r\n            T50 = _mm_loadu_si128((__m128i*)(p + 5 * i_tmp));\r\n            T60 = _mm_loadu_si128((__m128i*)(p + 6 * i_tmp));\r\n            T70 = _mm_loadu_si128((__m128i*)(p + 7 * i_tmp));\r\n            T80 = _mm_loadu_si128((__m128i*)(p + 8 * i_tmp));\r\n            T90 = _mm_loadu_si128((__m128i*)(p + 9 * i_tmp));\r\n            Ta0 = _mm_loadu_si128((__m128i*)(p + 10 * i_tmp));\r\n            //First\r\n            if (bsymyFirst) {\r\n                T0 = _mm_unpacklo_epi16(T00, T70);\r\n                T1 = _mm_unpacklo_epi16(T10, T60);\r\n                T2 = _mm_unpacklo_epi16(T20, T50);\r\n                T3 = _mm_unpacklo_epi16(T30, T40);\r\n                T4 = _mm_unpackhi_epi16(T00, T70);\r\n                T5 = _mm_unpackhi_epi16(T10, T60);\r\n                T6 = _mm_unpackhi_epi16(T20, T50);\r\n                T7 = _mm_unpackhi_epi16(T30, T40);\r\n            } else {\r\n                T0 = _mm_unpacklo_epi16(T00, T10);\r\n                T1 = _mm_unpacklo_epi16(T20, T30);\r\n                T2 = _mm_unpacklo_epi16(T40, T50);\r\n                T3 = _mm_unpacklo_epi16(T60, T70);\r\n                T4 = _mm_unpackhi_epi16(T00, T10);\r\n                T5 = _mm_unpackhi_epi16(T20, T30);\r\n                T6 = _mm_unpackhi_epi16(T40, T50);\r\n                T7 = _mm_unpackhi_epi16(T60, T70);\r\n            }\r\n            T0 = _mm_madd_epi16(T0, mCoefy1First);\r\n            T1 = _mm_madd_epi16(T1, mCoefy2First);\r\n            T2 = _mm_madd_epi16(T2, mCoefy3First);\r\n            T3 = _mm_madd_epi16(T3, mCoefy4First);\r\n            T4 = _mm_madd_epi16(T4, mCoefy1First);\r\n            T5 = _mm_madd_epi16(T5, mCoefy2First);\r\n            T6 = _mm_madd_epi16(T6, mCoefy3First);\r\n            T7 = _mm_madd_epi16(T7, mCoefy4First);\r\n\r\n            mVal1 = _mm_add_epi32(T0, T1);\r\n            mVal1 = _mm_add_epi32(mVal1, T2);\r\n            mVal1 = _mm_add_epi32(mVal1, T3);\r\n\r\n            mVal2 = _mm_add_epi32(T4, T5);\r\n            mVal2 = _mm_add_epi32(mVal2, T6);\r\n            mVal2 = _mm_add_epi32(mVal2, T7);\r\n\r\n            mVal1 = _mm_add_epi32(mVal1, mAddOffset);\r\n            mVal2 = _mm_add_epi32(mVal2, mAddOffset);\r\n            mVal1 = _mm_srai_epi32(mVal1, shift);\r\n            mVal2 = _mm_srai_epi32(mVal2, shift);\r\n            mVal = _mm_packs_epi32(mVal1, mVal2);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_maskmoveu_si128(mVal, mask, (char *)&dst0[col]);\r\n\r\n            if (bsymyFirst) {\r\n                T0 = _mm_unpacklo_epi16(T10, T80);\r\n                T1 = _mm_unpacklo_epi16(T20, T70);\r\n                T2 = _mm_unpacklo_epi16(T30, T60);\r\n                T3 = _mm_unpacklo_epi16(T40, T50);\r\n                T4 = _mm_unpackhi_epi16(T10, T80);\r\n                T5 = _mm_unpackhi_epi16(T20, T70);\r\n                T6 = _mm_unpackhi_epi16(T30, T60);\r\n                T7 = _mm_unpackhi_epi16(T40, T50);\r\n            }\r\n            else {\r\n                T0 = _mm_unpacklo_epi16(T10, T20);\r\n                T1 = _mm_unpacklo_epi16(T30, T40);\r\n                T2 = _mm_unpacklo_epi16(T50, T60);\r\n                T3 = _mm_unpacklo_epi16(T70, T80);\r\n                T4 = _mm_unpackhi_epi16(T10, T20);\r\n                T5 = _mm_unpackhi_epi16(T30, T40);\r\n                T6 = _mm_unpackhi_epi16(T50, T60);\r\n                T7 = _mm_unpackhi_epi16(T70, T80);\r\n            }\r\n            T0 = _mm_madd_epi16(T0, mCoefy1First);\r\n            T1 = _mm_madd_epi16(T1, mCoefy2First);\r\n            T2 = _mm_madd_epi16(T2, mCoefy3First);\r\n            T3 = _mm_madd_epi16(T3, mCoefy4First);\r\n            T4 = _mm_madd_epi16(T4, mCoefy1First);\r\n            T5 = _mm_madd_epi16(T5, mCoefy2First);\r\n            T6 = _mm_madd_epi16(T6, mCoefy3First);\r\n            T7 = _mm_madd_epi16(T7, mCoefy4First);\r\n\r\n            mVal1 = _mm_add_epi32(T0, T1);\r\n            mVal1 = _mm_add_epi32(mVal1, T2);\r\n            mVal1 = _mm_add_epi32(mVal1, T3);\r\n\r\n            mVal2 = _mm_add_epi32(T4, T5);\r\n            mVal2 = _mm_add_epi32(mVal2, T6);\r\n            mVal2 = _mm_add_epi32(mVal2, T7);\r\n\r\n            mVal1 = _mm_add_epi32(mVal1, mAddOffset);\r\n            mVal2 = _mm_add_epi32(mVal2, mAddOffset);\r\n            mVal1 = _mm_srai_epi32(mVal1, shift);\r\n            mVal2 = _mm_srai_epi32(mVal2, shift);\r\n            mVal = _mm_packs_epi32(mVal1, mVal2);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_maskmoveu_si128(mVal, mask, (char *)(&dst0[col] + i_dst));\r\n\r\n            if (bsymyFirst) {\r\n                T0 = _mm_unpacklo_epi16(T20, T90);\r\n                T1 = _mm_unpacklo_epi16(T30, T80);\r\n                T2 = _mm_unpacklo_epi16(T40, T70);\r\n                T3 = _mm_unpacklo_epi16(T50, T60);\r\n                T4 = _mm_unpackhi_epi16(T20, T90);\r\n                T5 = _mm_unpackhi_epi16(T30, T80);\r\n                T6 = _mm_unpackhi_epi16(T40, T70);\r\n                T7 = _mm_unpackhi_epi16(T50, T60);\r\n            }\r\n            else {\r\n                T0 = _mm_unpacklo_epi16(T20, T30);\r\n                T1 = _mm_unpacklo_epi16(T20, T30);\r\n                T2 = _mm_unpacklo_epi16(T20, T30);\r\n                T3 = _mm_unpacklo_epi16(T20, T30);\r\n                T4 = _mm_unpackhi_epi16(T20, T30);\r\n                T5 = _mm_unpackhi_epi16(T20, T30);\r\n                T6 = _mm_unpackhi_epi16(T20, T30);\r\n                T7 = _mm_unpackhi_epi16(T20, T30);\r\n            }\r\n            T0 = _mm_madd_epi16(T0, mCoefy1First);\r\n            T1 = _mm_madd_epi16(T1, mCoefy2First);\r\n            T2 = _mm_madd_epi16(T2, mCoefy3First);\r\n            T3 = _mm_madd_epi16(T3, mCoefy4First);\r\n            T4 = _mm_madd_epi16(T4, mCoefy1First);\r\n            T5 = _mm_madd_epi16(T5, mCoefy2First);\r\n            T6 = _mm_madd_epi16(T6, mCoefy3First);\r\n            T7 = _mm_madd_epi16(T7, mCoefy4First);\r\n\r\n            mVal1 = _mm_add_epi32(T0, T1);\r\n            mVal1 = _mm_add_epi32(mVal1, T2);\r\n            mVal1 = _mm_add_epi32(mVal1, T3);\r\n\r\n            mVal2 = _mm_add_epi32(T4, T5);\r\n            mVal2 = _mm_add_epi32(mVal2, T6);\r\n            mVal2 = _mm_add_epi32(mVal2, T7);\r\n\r\n            mVal1 = _mm_add_epi32(mVal1, mAddOffset);\r\n            mVal2 = _mm_add_epi32(mVal2, mAddOffset);\r\n            mVal1 = _mm_srai_epi32(mVal1, shift);\r\n            mVal2 = _mm_srai_epi32(mVal2, shift);\r\n            mVal = _mm_packs_epi32(mVal1, mVal2);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_maskmoveu_si128(mVal, mask, (char *)(&dst0[col] + 2 * i_dst));\r\n\r\n            if (bsymyFirst) {\r\n                T0 = _mm_unpacklo_epi16(T30, Ta0);\r\n                T1 = _mm_unpacklo_epi16(T40, T90);\r\n                T2 = _mm_unpacklo_epi16(T50, T80);\r\n                T3 = _mm_unpacklo_epi16(T60, T70);\r\n                T4 = _mm_unpackhi_epi16(T30, Ta0);\r\n                T5 = _mm_unpackhi_epi16(T40, T90);\r\n                T6 = _mm_unpackhi_epi16(T50, T80);\r\n                T7 = _mm_unpackhi_epi16(T60, T70);\r\n            }\r\n            else {\r\n                T0 = _mm_unpacklo_epi16(T30, T40);\r\n                T1 = _mm_unpacklo_epi16(T50, T60);\r\n                T2 = _mm_unpacklo_epi16(T70, T80);\r\n                T3 = _mm_unpacklo_epi16(T90, Ta0);\r\n                T4 = _mm_unpackhi_epi16(T30, T40);\r\n                T5 = _mm_unpackhi_epi16(T50, T60);\r\n                T6 = _mm_unpackhi_epi16(T70, T80);\r\n                T7 = _mm_unpackhi_epi16(T90, Ta0);\r\n            }\r\n            T0 = _mm_madd_epi16(T0, mCoefy1First);\r\n            T1 = _mm_madd_epi16(T1, mCoefy2First);\r\n            T2 = _mm_madd_epi16(T2, mCoefy3First);\r\n            T3 = _mm_madd_epi16(T3, mCoefy4First);\r\n            T4 = _mm_madd_epi16(T4, mCoefy1First);\r\n            T5 = _mm_madd_epi16(T5, mCoefy2First);\r\n            T6 = _mm_madd_epi16(T6, mCoefy3First);\r\n            T7 = _mm_madd_epi16(T7, mCoefy4First);\r\n\r\n            mVal1 = _mm_add_epi32(T0, T1);\r\n            mVal1 = _mm_add_epi32(mVal1, T2);\r\n            mVal1 = _mm_add_epi32(mVal1, T3);\r\n\r\n            mVal2 = _mm_add_epi32(T4, T5);\r\n            mVal2 = _mm_add_epi32(mVal2, T6);\r\n            mVal2 = _mm_add_epi32(mVal2, T7);\r\n\r\n            mVal1 = _mm_add_epi32(mVal1, mAddOffset);\r\n            mVal2 = _mm_add_epi32(mVal2, mAddOffset);\r\n            mVal1 = _mm_srai_epi32(mVal1, shift);\r\n            mVal2 = _mm_srai_epi32(mVal2, shift);\r\n            mVal = _mm_packs_epi32(mVal1, mVal2);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_maskmoveu_si128(mVal, mask, (char *)(&dst0[col] + 3 * i_dst));\r\n\r\n            //Second\r\n            if (bsymySecond) {\r\n                T0 = _mm_unpacklo_epi16(T00, T70);\r\n                T1 = _mm_unpacklo_epi16(T10, T60);\r\n                T2 = _mm_unpacklo_epi16(T20, T50);\r\n                T3 = _mm_unpacklo_epi16(T30, T40);\r\n                T4 = _mm_unpackhi_epi16(T00, T70);\r\n                T5 = _mm_unpackhi_epi16(T10, T60);\r\n                T6 = _mm_unpackhi_epi16(T20, T50);\r\n                T7 = _mm_unpackhi_epi16(T30, T40);\r\n            } else {\r\n                T0 = _mm_unpacklo_epi16(T00, T10);\r\n                T1 = _mm_unpacklo_epi16(T20, T30);\r\n                T2 = _mm_unpacklo_epi16(T40, T50);\r\n                T3 = _mm_unpacklo_epi16(T60, T70);\r\n                T4 = _mm_unpackhi_epi16(T00, T10);\r\n                T5 = _mm_unpackhi_epi16(T20, T30);\r\n                T6 = _mm_unpackhi_epi16(T40, T50);\r\n                T7 = _mm_unpackhi_epi16(T60, T70);\r\n            }\r\n            T0 = _mm_madd_epi16(T0, mCoefy1Second);\r\n            T1 = _mm_madd_epi16(T1, mCoefy2Second);\r\n            T2 = _mm_madd_epi16(T2, mCoefy3Second);\r\n            T3 = _mm_madd_epi16(T3, mCoefy4Second);\r\n            T4 = _mm_madd_epi16(T4, mCoefy1Second);\r\n            T5 = _mm_madd_epi16(T5, mCoefy2Second);\r\n            T6 = _mm_madd_epi16(T6, mCoefy3Second);\r\n            T7 = _mm_madd_epi16(T7, mCoefy4Second);\r\n\r\n            mVal1 = _mm_add_epi32(T0, T1);\r\n            mVal1 = _mm_add_epi32(mVal1, T2);\r\n            mVal1 = _mm_add_epi32(mVal1, T3);\r\n\r\n            mVal2 = _mm_add_epi32(T4, T5);\r\n            mVal2 = _mm_add_epi32(mVal2, T6);\r\n            mVal2 = _mm_add_epi32(mVal2, T7);\r\n\r\n            mVal1 = _mm_add_epi32(mVal1, mAddOffset);\r\n            mVal2 = _mm_add_epi32(mVal2, mAddOffset);\r\n            mVal1 = _mm_srai_epi32(mVal1, shift);\r\n            mVal2 = _mm_srai_epi32(mVal2, shift);\r\n            mVal = _mm_packs_epi32(mVal1, mVal2);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_maskmoveu_si128(mVal, mask, (char *)&dst1[col]);\r\n\r\n            if (bsymySecond) {\r\n                T0 = _mm_unpacklo_epi16(T10, T80);\r\n                T1 = _mm_unpacklo_epi16(T20, T70);\r\n                T2 = _mm_unpacklo_epi16(T30, T60);\r\n                T3 = _mm_unpacklo_epi16(T40, T50);\r\n                T4 = _mm_unpackhi_epi16(T10, T80);\r\n                T5 = _mm_unpackhi_epi16(T20, T70);\r\n                T6 = _mm_unpackhi_epi16(T30, T60);\r\n                T7 = _mm_unpackhi_epi16(T40, T50);\r\n            }\r\n            else {\r\n                T0 = _mm_unpacklo_epi16(T10, T20);\r\n                T1 = _mm_unpacklo_epi16(T30, T40);\r\n                T2 = _mm_unpacklo_epi16(T50, T60);\r\n                T3 = _mm_unpacklo_epi16(T70, T80);\r\n                T4 = _mm_unpackhi_epi16(T10, T20);\r\n                T5 = _mm_unpackhi_epi16(T30, T40);\r\n                T6 = _mm_unpackhi_epi16(T50, T60);\r\n                T7 = _mm_unpackhi_epi16(T70, T80);\r\n            }\r\n            T0 = _mm_madd_epi16(T0, mCoefy1Second);\r\n            T1 = _mm_madd_epi16(T1, mCoefy2Second);\r\n            T2 = _mm_madd_epi16(T2, mCoefy3Second);\r\n            T3 = _mm_madd_epi16(T3, mCoefy4Second);\r\n            T4 = _mm_madd_epi16(T4, mCoefy1Second);\r\n            T5 = _mm_madd_epi16(T5, mCoefy2Second);\r\n            T6 = _mm_madd_epi16(T6, mCoefy3Second);\r\n            T7 = _mm_madd_epi16(T7, mCoefy4Second);\r\n\r\n            mVal1 = _mm_add_epi32(T0, T1);\r\n            mVal1 = _mm_add_epi32(mVal1, T2);\r\n            mVal1 = _mm_add_epi32(mVal1, T3);\r\n\r\n            mVal2 = _mm_add_epi32(T4, T5);\r\n            mVal2 = _mm_add_epi32(mVal2, T6);\r\n            mVal2 = _mm_add_epi32(mVal2, T7);\r\n\r\n            mVal1 = _mm_add_epi32(mVal1, mAddOffset);\r\n            mVal2 = _mm_add_epi32(mVal2, mAddOffset);\r\n            mVal1 = _mm_srai_epi32(mVal1, shift);\r\n            mVal2 = _mm_srai_epi32(mVal2, shift);\r\n            mVal = _mm_packs_epi32(mVal1, mVal2);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_maskmoveu_si128(mVal, mask, (char *)(&dst1[col] + i_dst));\r\n\r\n            if (bsymySecond) {\r\n                T0 = _mm_unpacklo_epi16(T20, T90);\r\n                T1 = _mm_unpacklo_epi16(T30, T80);\r\n                T2 = _mm_unpacklo_epi16(T40, T70);\r\n                T3 = _mm_unpacklo_epi16(T50, T60);\r\n                T4 = _mm_unpackhi_epi16(T20, T90);\r\n                T5 = _mm_unpackhi_epi16(T30, T80);\r\n                T6 = _mm_unpackhi_epi16(T40, T70);\r\n                T7 = _mm_unpackhi_epi16(T50, T60);\r\n            }\r\n            else {\r\n                T0 = _mm_unpacklo_epi16(T20, T30);\r\n                T1 = _mm_unpacklo_epi16(T40, T50);\r\n                T2 = _mm_unpacklo_epi16(T60, T70);\r\n                T3 = _mm_unpacklo_epi16(T80, T90);\r\n                T4 = _mm_unpackhi_epi16(T20, T30);\r\n                T5 = _mm_unpackhi_epi16(T40, T50);\r\n                T6 = _mm_unpackhi_epi16(T60, T70);\r\n                T7 = _mm_unpackhi_epi16(T80, T90);\r\n            }\r\n            T0 = _mm_madd_epi16(T0, mCoefy1Second);\r\n            T1 = _mm_madd_epi16(T1, mCoefy2Second);\r\n            T2 = _mm_madd_epi16(T2, mCoefy3Second);\r\n            T3 = _mm_madd_epi16(T3, mCoefy4Second);\r\n            T4 = _mm_madd_epi16(T4, mCoefy1Second);\r\n            T5 = _mm_madd_epi16(T5, mCoefy2Second);\r\n            T6 = _mm_madd_epi16(T6, mCoefy3Second);\r\n            T7 = _mm_madd_epi16(T7, mCoefy4Second);\r\n\r\n            mVal1 = _mm_add_epi32(T0, T1);\r\n            mVal1 = _mm_add_epi32(mVal1, T2);\r\n            mVal1 = _mm_add_epi32(mVal1, T3);\r\n\r\n            mVal2 = _mm_add_epi32(T4, T5);\r\n            mVal2 = _mm_add_epi32(mVal2, T6);\r\n            mVal2 = _mm_add_epi32(mVal2, T7);\r\n\r\n            mVal1 = _mm_add_epi32(mVal1, mAddOffset);\r\n            mVal2 = _mm_add_epi32(mVal2, mAddOffset);\r\n            mVal1 = _mm_srai_epi32(mVal1, shift);\r\n            mVal2 = _mm_srai_epi32(mVal2, shift);\r\n            mVal = _mm_packs_epi32(mVal1, mVal2);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_maskmoveu_si128(mVal, mask, (char *)(&dst1[col] + 2 * i_dst));\r\n\r\n            if (bsymySecond) {\r\n                T0 = _mm_unpacklo_epi16(T30, Ta0);\r\n                T1 = _mm_unpacklo_epi16(T40, T90);\r\n                T2 = _mm_unpacklo_epi16(T50, T80);\r\n                T3 = _mm_unpacklo_epi16(T60, T70);\r\n                T4 = _mm_unpackhi_epi16(T30, Ta0);\r\n                T5 = _mm_unpackhi_epi16(T40, T90);\r\n                T6 = _mm_unpackhi_epi16(T50, T80);\r\n                T7 = _mm_unpackhi_epi16(T60, T70);\r\n            }\r\n            else {\r\n                T0 = _mm_unpacklo_epi16(T30, T40);\r\n                T1 = _mm_unpacklo_epi16(T50, T60);\r\n                T2 = _mm_unpacklo_epi16(T70, T80);\r\n                T3 = _mm_unpacklo_epi16(T90, Ta0);\r\n                T4 = _mm_unpackhi_epi16(T30, T40);\r\n                T5 = _mm_unpackhi_epi16(T50, T60);\r\n                T6 = _mm_unpackhi_epi16(T70, T80);\r\n                T7 = _mm_unpackhi_epi16(T90, Ta0);\r\n            }\r\n            T0 = _mm_madd_epi16(T0, mCoefy1Second);\r\n            T1 = _mm_madd_epi16(T1, mCoefy2Second);\r\n            T2 = _mm_madd_epi16(T2, mCoefy3Second);\r\n            T3 = _mm_madd_epi16(T3, mCoefy4Second);\r\n            T4 = _mm_madd_epi16(T4, mCoefy1Second);\r\n            T5 = _mm_madd_epi16(T5, mCoefy2Second);\r\n            T6 = _mm_madd_epi16(T6, mCoefy3Second);\r\n            T7 = _mm_madd_epi16(T7, mCoefy4Second);\r\n\r\n            mVal1 = _mm_add_epi32(T0, T1);\r\n            mVal1 = _mm_add_epi32(mVal1, T2);\r\n            mVal1 = _mm_add_epi32(mVal1, T3);\r\n\r\n            mVal2 = _mm_add_epi32(T4, T5);\r\n            mVal2 = _mm_add_epi32(mVal2, T6);\r\n            mVal2 = _mm_add_epi32(mVal2, T7);\r\n\r\n            mVal1 = _mm_add_epi32(mVal1, mAddOffset);\r\n            mVal2 = _mm_add_epi32(mVal2, mAddOffset);\r\n            mVal1 = _mm_srai_epi32(mVal1, shift);\r\n            mVal2 = _mm_srai_epi32(mVal2, shift);\r\n            mVal = _mm_packs_epi32(mVal1, mVal2);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_maskmoveu_si128(mVal, mask, (char *)(&dst1[col] + 3 * i_dst));\r\n\r\n            //Third\r\n            if (bsymyThird) {\r\n                T0 = _mm_unpacklo_epi16(T00, T70);\r\n                T1 = _mm_unpacklo_epi16(T10, T60);\r\n                T2 = _mm_unpacklo_epi16(T20, T50);\r\n                T3 = _mm_unpacklo_epi16(T30, T40);\r\n                T4 = _mm_unpackhi_epi16(T00, T70);\r\n                T5 = _mm_unpackhi_epi16(T10, T60);\r\n                T6 = _mm_unpackhi_epi16(T20, T50);\r\n                T7 = _mm_unpackhi_epi16(T30, T40);\r\n            } else {\r\n                T0 = _mm_unpacklo_epi16(T00, T10);\r\n                T1 = _mm_unpacklo_epi16(T20, T30);\r\n                T2 = _mm_unpacklo_epi16(T40, T50);\r\n                T3 = _mm_unpacklo_epi16(T60, T70);\r\n                T4 = _mm_unpackhi_epi16(T00, T10);\r\n                T5 = _mm_unpackhi_epi16(T20, T30);\r\n                T6 = _mm_unpackhi_epi16(T40, T50);\r\n                T7 = _mm_unpackhi_epi16(T60, T70);\r\n            }\r\n            T0 = _mm_madd_epi16(T0, mCoefy1Third);\r\n            T1 = _mm_madd_epi16(T1, mCoefy2Third);\r\n            T2 = _mm_madd_epi16(T2, mCoefy3Third);\r\n            T3 = _mm_madd_epi16(T3, mCoefy4Third);\r\n            T4 = _mm_madd_epi16(T4, mCoefy1Third);\r\n            T5 = _mm_madd_epi16(T5, mCoefy2Third);\r\n            T6 = _mm_madd_epi16(T6, mCoefy3Third);\r\n            T7 = _mm_madd_epi16(T7, mCoefy4Third);\r\n\r\n            mVal1 = _mm_add_epi32(T0, T1);\r\n            mVal1 = _mm_add_epi32(mVal1, T2);\r\n            mVal1 = _mm_add_epi32(mVal1, T3);\r\n\r\n            mVal2 = _mm_add_epi32(T4, T5);\r\n            mVal2 = _mm_add_epi32(mVal2, T6);\r\n            mVal2 = _mm_add_epi32(mVal2, T7);\r\n\r\n            mVal1 = _mm_add_epi32(mVal1, mAddOffset);\r\n            mVal2 = _mm_add_epi32(mVal2, mAddOffset);\r\n            mVal1 = _mm_srai_epi32(mVal1, shift);\r\n            mVal2 = _mm_srai_epi32(mVal2, shift);\r\n            mVal = _mm_packs_epi32(mVal1, mVal2);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_maskmoveu_si128(mVal, mask, (char *)&dst2[col]);\r\n\r\n            if (bsymyThird) {\r\n                T0 = _mm_unpacklo_epi16(T10, T80);\r\n                T1 = _mm_unpacklo_epi16(T20, T70);\r\n                T2 = _mm_unpacklo_epi16(T30, T60);\r\n                T3 = _mm_unpacklo_epi16(T40, T50);\r\n                T4 = _mm_unpackhi_epi16(T10, T80);\r\n                T5 = _mm_unpackhi_epi16(T20, T70);\r\n                T6 = _mm_unpackhi_epi16(T30, T60);\r\n                T7 = _mm_unpackhi_epi16(T40, T50);\r\n            }\r\n            else {\r\n                T0 = _mm_unpacklo_epi16(T10, T20);\r\n                T1 = _mm_unpacklo_epi16(T30, T40);\r\n                T2 = _mm_unpacklo_epi16(T50, T60);\r\n                T3 = _mm_unpacklo_epi16(T70, T80);\r\n                T4 = _mm_unpackhi_epi16(T10, T20);\r\n                T5 = _mm_unpackhi_epi16(T30, T40);\r\n                T6 = _mm_unpackhi_epi16(T50, T60);\r\n                T7 = _mm_unpackhi_epi16(T70, T80);\r\n            }\r\n            T0 = _mm_madd_epi16(T0, mCoefy1Third);\r\n            T1 = _mm_madd_epi16(T1, mCoefy2Third);\r\n            T2 = _mm_madd_epi16(T2, mCoefy3Third);\r\n            T3 = _mm_madd_epi16(T3, mCoefy4Third);\r\n            T4 = _mm_madd_epi16(T4, mCoefy1Third);\r\n            T5 = _mm_madd_epi16(T5, mCoefy2Third);\r\n            T6 = _mm_madd_epi16(T6, mCoefy3Third);\r\n            T7 = _mm_madd_epi16(T7, mCoefy4Third);\r\n\r\n            mVal1 = _mm_add_epi32(T0, T1);\r\n            mVal1 = _mm_add_epi32(mVal1, T2);\r\n            mVal1 = _mm_add_epi32(mVal1, T3);\r\n\r\n            mVal2 = _mm_add_epi32(T4, T5);\r\n            mVal2 = _mm_add_epi32(mVal2, T6);\r\n            mVal2 = _mm_add_epi32(mVal2, T7);\r\n\r\n            mVal1 = _mm_add_epi32(mVal1, mAddOffset);\r\n            mVal2 = _mm_add_epi32(mVal2, mAddOffset);\r\n            mVal1 = _mm_srai_epi32(mVal1, shift);\r\n            mVal2 = _mm_srai_epi32(mVal2, shift);\r\n            mVal = _mm_packs_epi32(mVal1, mVal2);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_maskmoveu_si128(mVal, mask, (char *)(&dst2[col] + i_dst));\r\n\r\n            if (bsymyThird) {\r\n                T0 = _mm_unpacklo_epi16(T20, T90);\r\n                T1 = _mm_unpacklo_epi16(T30, T80);\r\n                T2 = _mm_unpacklo_epi16(T40, T70);\r\n                T3 = _mm_unpacklo_epi16(T50, T60);\r\n                T4 = _mm_unpackhi_epi16(T20, T90);\r\n                T5 = _mm_unpackhi_epi16(T30, T80);\r\n                T6 = _mm_unpackhi_epi16(T40, T70);\r\n                T7 = _mm_unpackhi_epi16(T50, T60);\r\n            }\r\n            else {\r\n                T0 = _mm_unpacklo_epi16(T20, T30);\r\n                T1 = _mm_unpacklo_epi16(T40, T50);\r\n                T2 = _mm_unpacklo_epi16(T60, T70);\r\n                T3 = _mm_unpacklo_epi16(T80, T90);\r\n                T4 = _mm_unpackhi_epi16(T20, T30);\r\n                T5 = _mm_unpackhi_epi16(T40, T50);\r\n                T6 = _mm_unpackhi_epi16(T60, T70);\r\n                T7 = _mm_unpackhi_epi16(T80, T90);\r\n            }\r\n            T0 = _mm_madd_epi16(T0, mCoefy1Third);\r\n            T1 = _mm_madd_epi16(T1, mCoefy2Third);\r\n            T2 = _mm_madd_epi16(T2, mCoefy3Third);\r\n            T3 = _mm_madd_epi16(T3, mCoefy4Third);\r\n            T4 = _mm_madd_epi16(T4, mCoefy1Third);\r\n            T5 = _mm_madd_epi16(T5, mCoefy2Third);\r\n            T6 = _mm_madd_epi16(T6, mCoefy3Third);\r\n            T7 = _mm_madd_epi16(T7, mCoefy4Third);\r\n\r\n            mVal1 = _mm_add_epi32(T0, T1);\r\n            mVal1 = _mm_add_epi32(mVal1, T2);\r\n            mVal1 = _mm_add_epi32(mVal1, T3);\r\n\r\n            mVal2 = _mm_add_epi32(T4, T5);\r\n            mVal2 = _mm_add_epi32(mVal2, T6);\r\n            mVal2 = _mm_add_epi32(mVal2, T7);\r\n\r\n            mVal1 = _mm_add_epi32(mVal1, mAddOffset);\r\n            mVal2 = _mm_add_epi32(mVal2, mAddOffset);\r\n            mVal1 = _mm_srai_epi32(mVal1, shift);\r\n            mVal2 = _mm_srai_epi32(mVal2, shift);\r\n            mVal = _mm_packs_epi32(mVal1, mVal2);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_maskmoveu_si128(mVal, mask, (char *)(&dst2[col] + 2 * i_dst));\r\n\r\n            if (bsymyThird) {\r\n                T0 = _mm_unpacklo_epi16(T30, Ta0);\r\n                T1 = _mm_unpacklo_epi16(T40, T90);\r\n                T2 = _mm_unpacklo_epi16(T50, T80);\r\n                T3 = _mm_unpacklo_epi16(T60, T70);\r\n                T4 = _mm_unpackhi_epi16(T30, Ta0);\r\n                T5 = _mm_unpackhi_epi16(T40, T90);\r\n                T6 = _mm_unpackhi_epi16(T50, T80);\r\n                T7 = _mm_unpackhi_epi16(T60, T70);\r\n            }\r\n            else {\r\n                T0 = _mm_unpacklo_epi16(T30, T40);\r\n                T1 = _mm_unpacklo_epi16(T50, T60);\r\n                T2 = _mm_unpacklo_epi16(T70, T80);\r\n                T3 = _mm_unpacklo_epi16(T90, Ta0);\r\n                T4 = _mm_unpackhi_epi16(T30, T40);\r\n                T5 = _mm_unpackhi_epi16(T50, T60);\r\n                T6 = _mm_unpackhi_epi16(T70, T80);\r\n                T7 = _mm_unpackhi_epi16(T90, Ta0);\r\n            }\r\n            T0 = _mm_madd_epi16(T0, mCoefy1Third);\r\n            T1 = _mm_madd_epi16(T1, mCoefy2Third);\r\n            T2 = _mm_madd_epi16(T2, mCoefy3Third);\r\n            T3 = _mm_madd_epi16(T3, mCoefy4Third);\r\n            T4 = _mm_madd_epi16(T4, mCoefy1Third);\r\n            T5 = _mm_madd_epi16(T5, mCoefy2Third);\r\n            T6 = _mm_madd_epi16(T6, mCoefy3Third);\r\n            T7 = _mm_madd_epi16(T7, mCoefy4Third);\r\n\r\n            mVal1 = _mm_add_epi32(T0, T1);\r\n            mVal1 = _mm_add_epi32(mVal1, T2);\r\n            mVal1 = _mm_add_epi32(mVal1, T3);\r\n\r\n            mVal2 = _mm_add_epi32(T4, T5);\r\n            mVal2 = _mm_add_epi32(mVal2, T6);\r\n            mVal2 = _mm_add_epi32(mVal2, T7);\r\n\r\n            mVal1 = _mm_add_epi32(mVal1, mAddOffset);\r\n            mVal2 = _mm_add_epi32(mVal2, mAddOffset);\r\n            mVal1 = _mm_srai_epi32(mVal1, shift);\r\n            mVal2 = _mm_srai_epi32(mVal2, shift);\r\n            mVal = _mm_packs_epi32(mVal1, mVal2);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_maskmoveu_si128(mVal, mask, (char *)(&dst2[col] + 3 * i_dst));\r\n        }\r\n\r\n        tmp += 4 * i_tmp;\r\n        dst0 += 4 * i_dst;\r\n        dst1 += 4 * i_dst;\r\n        dst2 += 4 * i_dst;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid intpl_chroma_block_ver_sse128(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, const int8_t *coeff)\r\n{\r\n    int row, col;\r\n    const short offset = 32;\r\n    const int shift = 6;\r\n    int bsym = (coeff[1] == coeff[2]);\r\n    __m128i mAddOffset = _mm_set1_epi16(offset);\r\n    pel_t const *p;\r\n    __m128i mask = _mm_loadu_si128((__m128i*)(intrinsic_mask[(width & 7) - 1]));\r\n\r\n    src -= i_src;\r\n    if (bsym) {\r\n        __m128i coeff0 = _mm_set1_epi8(coeff[0]);\r\n        __m128i coeff1 = _mm_set1_epi8(coeff[1]);\r\n        __m128i mVal;\r\n\r\n        for (row = 0; row < height - 3; row += 4) {\r\n            p = src;\r\n            for (col = 0; col < width - 7; col += 8) {\r\n                __m128i T00 = _mm_loadu_si128((__m128i*)(p));\r\n                __m128i T10 = _mm_loadu_si128((__m128i*)(p + i_src));\r\n                __m128i T20 = _mm_loadu_si128((__m128i*)(p + 2 * i_src));\r\n                __m128i T30 = _mm_loadu_si128((__m128i*)(p + 3 * i_src));\r\n                __m128i T40 = _mm_loadu_si128((__m128i*)(p + 4 * i_src));\r\n                __m128i T50 = _mm_loadu_si128((__m128i*)(p + 5 * i_src));\r\n                __m128i T60 = _mm_loadu_si128((__m128i*)(p + 6 * i_src));\r\n\r\n                __m128i T0 = _mm_unpacklo_epi8(T00, T30);\r\n                __m128i T1 = _mm_unpacklo_epi8(T10, T20);\r\n\r\n                T0 = _mm_maddubs_epi16(T0, coeff0);\r\n                T1 = _mm_maddubs_epi16(T1, coeff1);\r\n\r\n                mVal = _mm_add_epi16(T0, T1);\r\n\r\n                mVal = _mm_add_epi16(mVal, mAddOffset);\r\n                mVal = _mm_srai_epi16(mVal, shift);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_storel_epi64((__m128i*)&dst[col], mVal);\r\n\r\n                T0 = _mm_unpacklo_epi8(T10, T40);\r\n                T1 = _mm_unpacklo_epi8(T20, T30);\r\n\r\n                T0 = _mm_maddubs_epi16(T0, coeff0);\r\n                T1 = _mm_maddubs_epi16(T1, coeff1);\r\n\r\n                mVal = _mm_add_epi16(T0, T1);\r\n\r\n                mVal = _mm_add_epi16(mVal, mAddOffset);\r\n                mVal = _mm_srai_epi16(mVal, shift);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_storel_epi64((__m128i*)(&dst[col] + i_dst), mVal);\r\n\r\n                T0 = _mm_unpacklo_epi8(T20, T50);\r\n                T1 = _mm_unpacklo_epi8(T30, T40);\r\n\r\n                T0 = _mm_maddubs_epi16(T0, coeff0);\r\n                T1 = _mm_maddubs_epi16(T1, coeff1);\r\n\r\n                mVal = _mm_add_epi16(T0, T1);\r\n\r\n                mVal = _mm_add_epi16(mVal, mAddOffset);\r\n                mVal = _mm_srai_epi16(mVal, shift);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_storel_epi64((__m128i*)(&dst[col] + 2 * i_dst), mVal);\r\n\r\n                T0 = _mm_unpacklo_epi8(T30, T60);\r\n                T1 = _mm_unpacklo_epi8(T40, T50);\r\n\r\n                T0 = _mm_maddubs_epi16(T0, coeff0);\r\n                T1 = _mm_maddubs_epi16(T1, coeff1);\r\n\r\n                mVal = _mm_add_epi16(T0, T1);\r\n\r\n                mVal = _mm_add_epi16(mVal, mAddOffset);\r\n                mVal = _mm_srai_epi16(mVal, shift);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_storel_epi64((__m128i*)(&dst[col] + 3 * i_dst), mVal);\r\n\r\n                p += 8;\r\n            }\r\n\r\n            if (col < width) { // store either 1, 2, 3, 4, 5, 6, or 7 8-bit results in dst\r\n                __m128i T00 = _mm_loadu_si128((__m128i*)(p));\r\n                __m128i T10 = _mm_loadu_si128((__m128i*)(p + i_src));\r\n                __m128i T20 = _mm_loadu_si128((__m128i*)(p + 2 * i_src));\r\n                __m128i T30 = _mm_loadu_si128((__m128i*)(p + 3 * i_src));\r\n                __m128i T40 = _mm_loadu_si128((__m128i*)(p + 4 * i_src));\r\n                __m128i T50 = _mm_loadu_si128((__m128i*)(p + 5 * i_src));\r\n                __m128i T60 = _mm_loadu_si128((__m128i*)(p + 6 * i_src));\r\n\r\n                __m128i T0 = _mm_unpacklo_epi8(T00, T30);\r\n                __m128i T1 = _mm_unpacklo_epi8(T10, T20);\r\n\r\n                T0 = _mm_maddubs_epi16(T0, coeff0);\r\n                T1 = _mm_maddubs_epi16(T1, coeff1);\r\n\r\n                mVal = _mm_add_epi16(T0, T1);\r\n\r\n                mVal = _mm_add_epi16(mVal, mAddOffset);\r\n                mVal = _mm_srai_epi16(mVal, shift);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_maskmoveu_si128(mVal, mask, (char *)&dst[col]);\r\n\r\n                T0 = _mm_unpacklo_epi8(T10, T40);\r\n                T1 = _mm_unpacklo_epi8(T20, T30);\r\n\r\n                T0 = _mm_maddubs_epi16(T0, coeff0);\r\n                T1 = _mm_maddubs_epi16(T1, coeff1);\r\n\r\n                mVal = _mm_add_epi16(T0, T1);\r\n\r\n                mVal = _mm_add_epi16(mVal, mAddOffset);\r\n                mVal = _mm_srai_epi16(mVal, shift);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_maskmoveu_si128(mVal, mask, (char *)(&dst[col] + i_dst));\r\n\r\n                T0 = _mm_unpacklo_epi8(T20, T50);\r\n                T1 = _mm_unpacklo_epi8(T30, T40);\r\n\r\n                T0 = _mm_maddubs_epi16(T0, coeff0);\r\n                T1 = _mm_maddubs_epi16(T1, coeff1);\r\n\r\n                mVal = _mm_add_epi16(T0, T1);\r\n\r\n                mVal = _mm_add_epi16(mVal, mAddOffset);\r\n                mVal = _mm_srai_epi16(mVal, shift);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_maskmoveu_si128(mVal, mask, (char *)(&dst[col] + 2 * i_dst));\r\n\r\n                T0 = _mm_unpacklo_epi8(T30, T60);\r\n                T1 = _mm_unpacklo_epi8(T40, T50);\r\n\r\n                T0 = _mm_maddubs_epi16(T0, coeff0);\r\n                T1 = _mm_maddubs_epi16(T1, coeff1);\r\n\r\n                mVal = _mm_add_epi16(T0, T1);\r\n\r\n                mVal = _mm_add_epi16(mVal, mAddOffset);\r\n                mVal = _mm_srai_epi16(mVal, shift);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_maskmoveu_si128(mVal, mask, (char *)(&dst[col] + 3 * i_dst));\r\n            }\r\n\r\n            src += 4 * i_src;\r\n            dst += 4 * i_dst;\r\n        }\r\n\r\n        for (; row < height; row++) {\r\n            p = src;\r\n            for (col = 0; col < width - 7; col += 8) {\r\n                __m128i T00 = _mm_loadu_si128((__m128i*)(p));\r\n                __m128i T10 = _mm_loadu_si128((__m128i*)(p + i_src));\r\n                __m128i T20 = _mm_loadu_si128((__m128i*)(p + 2 * i_src));\r\n                __m128i T30 = _mm_loadu_si128((__m128i*)(p + 3 * i_src));\r\n\r\n                T00 = _mm_unpacklo_epi8(T00, T30);\r\n                T10 = _mm_unpacklo_epi8(T10, T20);\r\n\r\n                T00 = _mm_maddubs_epi16(T00, coeff0);\r\n                T10 = _mm_maddubs_epi16(T10, coeff1);\r\n\r\n                mVal = _mm_add_epi16(T00, T10);\r\n\r\n                mVal = _mm_add_epi16(mVal, mAddOffset);\r\n                mVal = _mm_srai_epi16(mVal, shift);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_storel_epi64((__m128i*)&dst[col], mVal);\r\n\r\n                p += 8;\r\n            }\r\n\r\n            if (col < width) { // store either 1, 2, 3, 4, 5, 6, or 7 8-bit results in dst\r\n                __m128i T00 = _mm_loadu_si128((__m128i*)(p));\r\n                __m128i T10 = _mm_loadu_si128((__m128i*)(p + i_src));\r\n                __m128i T20 = _mm_loadu_si128((__m128i*)(p + 2 * i_src));\r\n                __m128i T30 = _mm_loadu_si128((__m128i*)(p + 3 * i_src));\r\n\r\n                T00 = _mm_unpacklo_epi8(T00, T30);\r\n                T10 = _mm_unpacklo_epi8(T10, T20);\r\n\r\n                T00 = _mm_maddubs_epi16(T00, coeff0);\r\n                T10 = _mm_maddubs_epi16(T10, coeff1);\r\n\r\n                mVal = _mm_add_epi16(T00, T10);\r\n\r\n                mVal = _mm_add_epi16(mVal, mAddOffset);\r\n                mVal = _mm_srai_epi16(mVal, shift);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_maskmoveu_si128(mVal, mask, (char *)&dst[col]);\r\n            }\r\n\r\n            src += i_src;\r\n            dst += i_dst;\r\n        }\r\n    } else {\r\n        __m128i coeff0 = _mm_set1_epi16(*(short*)coeff);\r\n        __m128i coeff1 = _mm_set1_epi16(*(short*)(coeff + 2));\r\n        __m128i mVal;\r\n        for (row = 0; row < height - 3; row += 4) {\r\n            p = src;\r\n            for (col = 0; col < width - 7; col += 8) {\r\n                __m128i T00 = _mm_loadu_si128((__m128i*)(p));\r\n                __m128i T10 = _mm_loadu_si128((__m128i*)(p + i_src));\r\n                __m128i T20 = _mm_loadu_si128((__m128i*)(p + 2 * i_src));\r\n                __m128i T30 = _mm_loadu_si128((__m128i*)(p + 3 * i_src));\r\n                __m128i T40 = _mm_loadu_si128((__m128i*)(p + 4 * i_src));\r\n                __m128i T50 = _mm_loadu_si128((__m128i*)(p + 5 * i_src));\r\n                __m128i T60 = _mm_loadu_si128((__m128i*)(p + 6 * i_src));\r\n\r\n                __m128i T0 = _mm_unpacklo_epi8(T00, T10);\r\n                __m128i T1 = _mm_unpacklo_epi8(T20, T30);\r\n\r\n                T0 = _mm_maddubs_epi16(T0, coeff0);\r\n                T1 = _mm_maddubs_epi16(T1, coeff1);\r\n\r\n                mVal = _mm_add_epi16(T0, T1);\r\n\r\n                mVal = _mm_add_epi16(mVal, mAddOffset);\r\n                mVal = _mm_srai_epi16(mVal, shift);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_storel_epi64((__m128i*)&dst[col], mVal);\r\n\r\n                T0 = _mm_unpacklo_epi8(T10, T20);\r\n                T1 = _mm_unpacklo_epi8(T30, T40);\r\n\r\n                T0 = _mm_maddubs_epi16(T0, coeff0);\r\n                T1 = _mm_maddubs_epi16(T1, coeff1);\r\n\r\n                mVal = _mm_add_epi16(T0, T1);\r\n\r\n                mVal = _mm_add_epi16(mVal, mAddOffset);\r\n                mVal = _mm_srai_epi16(mVal, shift);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_storel_epi64((__m128i*)(&dst[col] + i_dst), mVal);\r\n\r\n                T0 = _mm_unpacklo_epi8(T20, T30);\r\n                T1 = _mm_unpacklo_epi8(T40, T50);\r\n\r\n                T0 = _mm_maddubs_epi16(T0, coeff0);\r\n                T1 = _mm_maddubs_epi16(T1, coeff1);\r\n\r\n                mVal = _mm_add_epi16(T0, T1);\r\n\r\n                mVal = _mm_add_epi16(mVal, mAddOffset);\r\n                mVal = _mm_srai_epi16(mVal, shift);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_storel_epi64((__m128i*)(&dst[col] + 2 * i_dst), mVal);\r\n\r\n                T0 = _mm_unpacklo_epi8(T30, T40);\r\n                T1 = _mm_unpacklo_epi8(T50, T60);\r\n\r\n                T0 = _mm_maddubs_epi16(T0, coeff0);\r\n                T1 = _mm_maddubs_epi16(T1, coeff1);\r\n\r\n                mVal = _mm_add_epi16(T0, T1);\r\n\r\n                mVal = _mm_add_epi16(mVal, mAddOffset);\r\n                mVal = _mm_srai_epi16(mVal, shift);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_storel_epi64((__m128i*)(&dst[col] + 3 * i_dst), mVal);\r\n\r\n                p += 8;\r\n            }\r\n\r\n            if (col < width) { // store either 1, 2, 3, 4, 5, 6, or 7 8-bit results in dst\r\n                __m128i T00 = _mm_loadu_si128((__m128i*)(p));\r\n                __m128i T10 = _mm_loadu_si128((__m128i*)(p + i_src));\r\n                __m128i T20 = _mm_loadu_si128((__m128i*)(p + 2 * i_src));\r\n                __m128i T30 = _mm_loadu_si128((__m128i*)(p + 3 * i_src));\r\n                __m128i T40 = _mm_loadu_si128((__m128i*)(p + 4 * i_src));\r\n                __m128i T50 = _mm_loadu_si128((__m128i*)(p + 5 * i_src));\r\n                __m128i T60 = _mm_loadu_si128((__m128i*)(p + 6 * i_src));\r\n\r\n                __m128i T0 = _mm_unpacklo_epi8(T00, T10);\r\n                __m128i T1 = _mm_unpacklo_epi8(T20, T30);\r\n\r\n                T0 = _mm_maddubs_epi16(T0, coeff0);\r\n                T1 = _mm_maddubs_epi16(T1, coeff1);\r\n\r\n                mVal = _mm_add_epi16(T0, T1);\r\n\r\n                mVal = _mm_add_epi16(mVal, mAddOffset);\r\n                mVal = _mm_srai_epi16(mVal, shift);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_maskmoveu_si128(mVal, mask, (char *)&dst[col]);\r\n\r\n                T0 = _mm_unpacklo_epi8(T10, T20);\r\n                T1 = _mm_unpacklo_epi8(T30, T40);\r\n\r\n                T0 = _mm_maddubs_epi16(T0, coeff0);\r\n                T1 = _mm_maddubs_epi16(T1, coeff1);\r\n\r\n                mVal = _mm_add_epi16(T0, T1);\r\n\r\n                mVal = _mm_add_epi16(mVal, mAddOffset);\r\n                mVal = _mm_srai_epi16(mVal, shift);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_maskmoveu_si128(mVal, mask, (char *)(&dst[col] + i_dst));\r\n\r\n                T0 = _mm_unpacklo_epi8(T20, T30);\r\n                T1 = _mm_unpacklo_epi8(T40, T50);\r\n\r\n                T0 = _mm_maddubs_epi16(T0, coeff0);\r\n                T1 = _mm_maddubs_epi16(T1, coeff1);\r\n\r\n                mVal = _mm_add_epi16(T0, T1);\r\n\r\n                mVal = _mm_add_epi16(mVal, mAddOffset);\r\n                mVal = _mm_srai_epi16(mVal, shift);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_maskmoveu_si128(mVal, mask, (char *)(&dst[col] + 2 * i_dst));\r\n\r\n                T0 = _mm_unpacklo_epi8(T30, T40);\r\n                T1 = _mm_unpacklo_epi8(T50, T60);\r\n\r\n                T0 = _mm_maddubs_epi16(T0, coeff0);\r\n                T1 = _mm_maddubs_epi16(T1, coeff1);\r\n\r\n                mVal = _mm_add_epi16(T0, T1);\r\n\r\n                mVal = _mm_add_epi16(mVal, mAddOffset);\r\n                mVal = _mm_srai_epi16(mVal, shift);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_maskmoveu_si128(mVal, mask, (char *)(&dst[col] + 3 * i_dst));\r\n            }\r\n\r\n            src += 4 * i_src;\r\n            dst += 4 * i_dst;\r\n        }\r\n\r\n        for (; row < height; row++) {\r\n            p = src;\r\n            for (col = 0; col < width - 7; col += 8) {\r\n                __m128i T00 = _mm_loadu_si128((__m128i*)(p));\r\n                __m128i T10 = _mm_loadu_si128((__m128i*)(p + i_src));\r\n                __m128i T20 = _mm_loadu_si128((__m128i*)(p + 2 * i_src));\r\n                __m128i T30 = _mm_loadu_si128((__m128i*)(p + 3 * i_src));\r\n\r\n                T00 = _mm_unpacklo_epi8(T00, T10);\r\n                T10 = _mm_unpacklo_epi8(T20, T30);\r\n\r\n                T00 = _mm_maddubs_epi16(T00, coeff0);\r\n                T10 = _mm_maddubs_epi16(T10, coeff1);\r\n\r\n                mVal = _mm_add_epi16(T00, T10);\r\n\r\n                mVal = _mm_add_epi16(mVal, mAddOffset);\r\n                mVal = _mm_srai_epi16(mVal, shift);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_storel_epi64((__m128i*)&dst[col], mVal);\r\n\r\n                p += 8;\r\n            }\r\n\r\n            if (col < width) { // store either 1, 2, 3, 4, 5, 6, or 7 8-bit results in dst\r\n                __m128i T00 = _mm_loadu_si128((__m128i*)(p));\r\n                __m128i T10 = _mm_loadu_si128((__m128i*)(p + i_src));\r\n                __m128i T20 = _mm_loadu_si128((__m128i*)(p + 2 * i_src));\r\n                __m128i T30 = _mm_loadu_si128((__m128i*)(p + 3 * i_src));\r\n\r\n                T00 = _mm_unpacklo_epi8(T00, T10);\r\n                T10 = _mm_unpacklo_epi8(T20, T30);\r\n\r\n                T00 = _mm_maddubs_epi16(T00, coeff0);\r\n                T10 = _mm_maddubs_epi16(T10, coeff1);\r\n\r\n                mVal = _mm_add_epi16(T00, T10);\r\n\r\n                mVal = _mm_add_epi16(mVal, mAddOffset);\r\n                mVal = _mm_srai_epi16(mVal, shift);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_maskmoveu_si128(mVal, mask, (char *)&dst[col]);\r\n            }\r\n\r\n            src += i_src;\r\n            dst += i_dst;\r\n        }\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid intpl_luma_block_ver_sse128(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, const int8_t *coeff)\r\n{\r\n    const short offset = 32;\r\n    const int shift = 6;\r\n    int row, col;\r\n    int bsym = (coeff[1] == coeff[6]);\r\n\r\n    __m128i mAddOffset = _mm_set1_epi16(offset);\r\n\r\n    pel_t const *p;\r\n\r\n    __m128i mask = _mm_loadu_si128((__m128i*)(intrinsic_mask[(width & 7) - 1]));\r\n\r\n    src -= 3 * i_src;\r\n\r\n    if (bsym) {\r\n        __m128i coeff0 = _mm_set1_epi8(coeff[0]);\r\n        __m128i coeff1 = _mm_set1_epi8(coeff[1]);\r\n        __m128i coeff2 = _mm_set1_epi8(coeff[2]);\r\n        __m128i coeff3 = _mm_set1_epi8(coeff[3]);\r\n\r\n        for (row = 0; row < height - 3; row += 4) {\r\n            __m128i mVal;\r\n            p = src;\r\n            for (col = 0; col < width - 7; col += 8) {\r\n                __m128i T00 = _mm_loadu_si128((__m128i*)(p));\r\n                __m128i T10 = _mm_loadu_si128((__m128i*)(p + i_src));\r\n                __m128i T20 = _mm_loadu_si128((__m128i*)(p + 2 * i_src));\r\n                __m128i T30 = _mm_loadu_si128((__m128i*)(p + 3 * i_src));\r\n                __m128i T40 = _mm_loadu_si128((__m128i*)(p + 4 * i_src));\r\n                __m128i T50 = _mm_loadu_si128((__m128i*)(p + 5 * i_src));\r\n                __m128i T60 = _mm_loadu_si128((__m128i*)(p + 6 * i_src));\r\n                __m128i T70 = _mm_loadu_si128((__m128i*)(p + 7 * i_src));\r\n                __m128i T80 = _mm_loadu_si128((__m128i*)(p + 8 * i_src));\r\n                __m128i T90 = _mm_loadu_si128((__m128i*)(p + 9 * i_src));\r\n                __m128i Ta0 = _mm_loadu_si128((__m128i*)(p + 10 * i_src));\r\n\r\n                __m128i T1 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T00, T70), coeff0);\r\n                __m128i T2 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T10, T60), coeff1);\r\n                __m128i T3 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T20, T50), coeff2);\r\n                __m128i T4 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T30, T40), coeff3);\r\n\r\n                mVal = _mm_add_epi16(_mm_add_epi16(T1, T2), _mm_add_epi16(T3, T4));\r\n                mVal = _mm_srai_epi16(_mm_add_epi16(mVal, mAddOffset), shift);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_storel_epi64((__m128i*)&dst[col], mVal);\r\n\r\n                T1 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T10, T80), coeff0);\r\n                T2 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T20, T70), coeff1);\r\n                T3 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T30, T60), coeff2);\r\n                T4 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T40, T50), coeff3);\r\n\r\n                mVal = _mm_add_epi16(_mm_add_epi16(T1, T2), _mm_add_epi16(T3, T4));\r\n                mVal = _mm_srai_epi16(_mm_add_epi16(mVal, mAddOffset), shift);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_storel_epi64((__m128i*)(&dst[col] + i_dst), mVal);\r\n\r\n                T1 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T20, T90), coeff0);\r\n                T2 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T30, T80), coeff1);\r\n                T3 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T40, T70), coeff2);\r\n                T4 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T50, T60), coeff3);\r\n\r\n                mVal = _mm_add_epi16(_mm_add_epi16(T1, T2), _mm_add_epi16(T3, T4));\r\n                mVal = _mm_srai_epi16(_mm_add_epi16(mVal, mAddOffset), shift);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_storel_epi64((__m128i*)(&dst[col] + 2 * i_dst), mVal);\r\n\r\n                T1 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T30, Ta0), coeff0);\r\n                T2 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T40, T90), coeff1);\r\n                T3 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T50, T80), coeff2);\r\n                T4 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T60, T70), coeff3);\r\n\r\n                mVal = _mm_add_epi16(_mm_add_epi16(T1, T2), _mm_add_epi16(T3, T4));\r\n                mVal = _mm_srai_epi16(_mm_add_epi16(mVal, mAddOffset), shift);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_storel_epi64((__m128i*)(&dst[col] + 3 * i_dst), mVal);\r\n\r\n                p += 8;\r\n            }\r\n\r\n            if (col < width) { // store either 1, 2, 3, 4, 5, 6, or 7 8-bit results in dst\r\n                __m128i T00 = _mm_loadu_si128((__m128i*)(p));\r\n                __m128i T10 = _mm_loadu_si128((__m128i*)(p + i_src));\r\n                __m128i T20 = _mm_loadu_si128((__m128i*)(p + 2 * i_src));\r\n                __m128i T30 = _mm_loadu_si128((__m128i*)(p + 3 * i_src));\r\n                __m128i T40 = _mm_loadu_si128((__m128i*)(p + 4 * i_src));\r\n                __m128i T50 = _mm_loadu_si128((__m128i*)(p + 5 * i_src));\r\n                __m128i T60 = _mm_loadu_si128((__m128i*)(p + 6 * i_src));\r\n                __m128i T70 = _mm_loadu_si128((__m128i*)(p + 7 * i_src));\r\n                __m128i T80 = _mm_loadu_si128((__m128i*)(p + 8 * i_src));\r\n                __m128i T90 = _mm_loadu_si128((__m128i*)(p + 9 * i_src));\r\n                __m128i Ta0 = _mm_loadu_si128((__m128i*)(p + 10 * i_src));\r\n\r\n                __m128i T1 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T00, T70), coeff0);\r\n                __m128i T2 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T10, T60), coeff1);\r\n                __m128i T3 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T20, T50), coeff2);\r\n                __m128i T4 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T30, T40), coeff3);\r\n\r\n                mVal = _mm_add_epi16(_mm_add_epi16(T1, T2), _mm_add_epi16(T3, T4));\r\n                mVal = _mm_srai_epi16(_mm_add_epi16(mVal, mAddOffset), shift);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_maskmoveu_si128(mVal, mask, (char *)&dst[col]);\r\n\r\n                T1 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T10, T80), coeff0);\r\n                T2 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T20, T70), coeff1);\r\n                T3 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T30, T60), coeff2);\r\n                T4 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T40, T50), coeff3);\r\n\r\n                mVal = _mm_add_epi16(_mm_add_epi16(T1, T2), _mm_add_epi16(T3, T4));\r\n                mVal = _mm_srai_epi16(_mm_add_epi16(mVal, mAddOffset), shift);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_maskmoveu_si128(mVal, mask, (char *)(&dst[col] + i_dst));\r\n\r\n                T1 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T20, T90), coeff0);\r\n                T2 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T30, T80), coeff1);\r\n                T3 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T40, T70), coeff2);\r\n                T4 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T50, T60), coeff3);\r\n\r\n                mVal = _mm_add_epi16(_mm_add_epi16(T1, T2), _mm_add_epi16(T3, T4));\r\n                mVal = _mm_srai_epi16(_mm_add_epi16(mVal, mAddOffset), shift);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_maskmoveu_si128(mVal, mask, (char *)(&dst[col] + 2 * i_dst));\r\n\r\n                T1 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T30, Ta0), coeff0);\r\n                T2 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T40, T90), coeff1);\r\n                T3 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T50, T80), coeff2);\r\n                T4 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T60, T70), coeff3);\r\n\r\n                mVal = _mm_add_epi16(_mm_add_epi16(T1, T2), _mm_add_epi16(T3, T4));\r\n                mVal = _mm_srai_epi16(_mm_add_epi16(mVal, mAddOffset), shift);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_maskmoveu_si128(mVal, mask, (char *)(&dst[col] + 3 * i_dst));\r\n\r\n            }\r\n\r\n            src += 4 * i_src;\r\n            dst += 4 * i_dst;\r\n        }\r\n    }\r\n    else {\r\n        __m128i coeff0 = _mm_set1_epi16(*(short*)coeff);\r\n        __m128i coeff1 = _mm_set1_epi16(*(short*)(coeff + 2));\r\n        __m128i coeff2 = _mm_set1_epi16(*(short*)(coeff + 4));\r\n        __m128i coeff3 = _mm_set1_epi16(*(short*)(coeff + 6));\r\n        for (row = 0; row < height - 3; row += 4) {\r\n            __m128i mVal;\r\n            p = src;\r\n            for (col = 0; col < width - 7; col += 8) {\r\n                __m128i T00 = _mm_loadu_si128((__m128i*)(p));\r\n                __m128i T10 = _mm_loadu_si128((__m128i*)(p + i_src));\r\n                __m128i T20 = _mm_loadu_si128((__m128i*)(p + 2 * i_src));\r\n                __m128i T30 = _mm_loadu_si128((__m128i*)(p + 3 * i_src));\r\n                __m128i T40 = _mm_loadu_si128((__m128i*)(p + 4 * i_src));\r\n                __m128i T50 = _mm_loadu_si128((__m128i*)(p + 5 * i_src));\r\n                __m128i T60 = _mm_loadu_si128((__m128i*)(p + 6 * i_src));\r\n                __m128i T70 = _mm_loadu_si128((__m128i*)(p + 7 * i_src));\r\n                __m128i T80 = _mm_loadu_si128((__m128i*)(p + 8 * i_src));\r\n                __m128i T90 = _mm_loadu_si128((__m128i*)(p + 9 * i_src));\r\n                __m128i Ta0 = _mm_loadu_si128((__m128i*)(p + 10 * i_src));\r\n\r\n                __m128i T1 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T00, T10), coeff0);\r\n                __m128i T2 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T20, T30), coeff1);\r\n                __m128i T3 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T40, T50), coeff2);\r\n                __m128i T4 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T60, T70), coeff3);\r\n\r\n                mVal = _mm_add_epi16(_mm_add_epi16(T1, T2), _mm_add_epi16(T3, T4));\r\n                mVal = _mm_srai_epi16(_mm_add_epi16(mVal, mAddOffset), shift);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_storel_epi64((__m128i*)&dst[col], mVal);\r\n\r\n                T1 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T10, T20), coeff0);\r\n                T2 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T30, T40), coeff1);\r\n                T3 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T50, T60), coeff2);\r\n                T4 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T70, T80), coeff3);\r\n\r\n                mVal = _mm_add_epi16(_mm_add_epi16(T1, T2), _mm_add_epi16(T3, T4));\r\n                mVal = _mm_srai_epi16(_mm_add_epi16(mVal, mAddOffset), shift);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_storel_epi64((__m128i*)(&dst[col] + i_dst), mVal);\r\n\r\n                T1 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T20, T30), coeff0);\r\n                T2 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T40, T50), coeff1);\r\n                T3 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T60, T70), coeff2);\r\n                T4 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T80, T90), coeff3);\r\n\r\n                mVal = _mm_add_epi16(_mm_add_epi16(T1, T2), _mm_add_epi16(T3, T4));\r\n                mVal = _mm_srai_epi16(_mm_add_epi16(mVal, mAddOffset), shift);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_storel_epi64((__m128i*)(&dst[col] + 2 * i_dst), mVal);\r\n\r\n                T1 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T30, T40), coeff0);\r\n                T2 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T50, T60), coeff1);\r\n                T3 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T70, T80), coeff2);\r\n                T4 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T90, Ta0), coeff3);\r\n\r\n                mVal = _mm_add_epi16(_mm_add_epi16(T1, T2), _mm_add_epi16(T3, T4));\r\n                mVal = _mm_srai_epi16(_mm_add_epi16(mVal, mAddOffset), shift);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_storel_epi64((__m128i*)(&dst[col] + 3 * i_dst), mVal);\r\n\r\n                p += 8;\r\n            }\r\n\r\n            if (col < width) { // store either 1, 2, 3, 4, 5, 6, or 7 8-bit results in dst\r\n                __m128i T00 = _mm_loadu_si128((__m128i*)(p));\r\n                __m128i T10 = _mm_loadu_si128((__m128i*)(p + i_src));\r\n                __m128i T20 = _mm_loadu_si128((__m128i*)(p + 2 * i_src));\r\n                __m128i T30 = _mm_loadu_si128((__m128i*)(p + 3 * i_src));\r\n                __m128i T40 = _mm_loadu_si128((__m128i*)(p + 4 * i_src));\r\n                __m128i T50 = _mm_loadu_si128((__m128i*)(p + 5 * i_src));\r\n                __m128i T60 = _mm_loadu_si128((__m128i*)(p + 6 * i_src));\r\n                __m128i T70 = _mm_loadu_si128((__m128i*)(p + 7 * i_src));\r\n                __m128i T80 = _mm_loadu_si128((__m128i*)(p + 8 * i_src));\r\n                __m128i T90 = _mm_loadu_si128((__m128i*)(p + 9 * i_src));\r\n                __m128i Ta0 = _mm_loadu_si128((__m128i*)(p + 10 * i_src));\r\n\r\n                __m128i T1 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T00, T10), coeff0);\r\n                __m128i T2 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T20, T30), coeff1);\r\n                __m128i T3 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T40, T50), coeff2);\r\n                __m128i T4 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T60, T70), coeff3);\r\n\r\n                mVal = _mm_add_epi16(_mm_add_epi16(T1, T2), _mm_add_epi16(T3, T4));\r\n                mVal = _mm_srai_epi16(_mm_add_epi16(mVal, mAddOffset), shift);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_maskmoveu_si128(mVal, mask, (char *)&dst[col]);\r\n\r\n                T1 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T10, T20), coeff0);\r\n                T2 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T30, T40), coeff1);\r\n                T3 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T50, T60), coeff2);\r\n                T4 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T70, T80), coeff3);\r\n\r\n                mVal = _mm_add_epi16(_mm_add_epi16(T1, T2), _mm_add_epi16(T3, T4));\r\n                mVal = _mm_srai_epi16(_mm_add_epi16(mVal, mAddOffset), shift);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_maskmoveu_si128(mVal, mask, (char *)(&dst[col] + i_dst));\r\n\r\n                T1 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T20, T30), coeff0);\r\n                T2 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T40, T50), coeff1);\r\n                T3 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T60, T70), coeff2);\r\n                T4 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T80, T90), coeff3);\r\n\r\n                mVal = _mm_add_epi16(_mm_add_epi16(T1, T2), _mm_add_epi16(T3, T4));\r\n                mVal = _mm_srai_epi16(_mm_add_epi16(mVal, mAddOffset), shift);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_maskmoveu_si128(mVal, mask, (char *)(&dst[col] + 2 * i_dst));\r\n\r\n                T1 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T30, T40), coeff0);\r\n                T2 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T50, T60), coeff1);\r\n                T3 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T70, T80), coeff2);\r\n                T4 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T90, Ta0), coeff3);\r\n\r\n                mVal = _mm_add_epi16(_mm_add_epi16(T1, T2), _mm_add_epi16(T3, T4));\r\n                mVal = _mm_srai_epi16(_mm_add_epi16(mVal, mAddOffset), shift);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_maskmoveu_si128(mVal, mask, (char *)(&dst[col] + 3 * i_dst));\r\n\r\n            }\r\n\r\n            src += 4 * i_src;\r\n            dst += 4 * i_dst;\r\n        }\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n*/\r\nvoid intpl_luma_block_ver0_sse128(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, const int8_t *coeff)\r\n{//-1, 4, -10, 57, 19, -7,  3, -1\r\n    const short offset = 32;\r\n    const int shift = 6;\r\n    int row, col;\r\n\r\n    __m128i mAddOffset = _mm_set1_epi16(offset);\r\n    pel_t const *p;\r\n    __m128i mask = _mm_loadu_si128((__m128i*)(intrinsic_mask[(width & 7) - 1]));\r\n\r\n    src -= 3 * i_src;\r\n    \r\n    //__m128i coeff0 = _mm_set1_epi16(*(short*)coeff);//-1 4\r\n    __m128i coeff1 = _mm_set1_epi16(*(short*)(coeff + 2));//-10 57\r\n    __m128i coeff2 = _mm_set1_epi16(*(short*)(coeff + 4));//19 -7\r\n    //__m128i coeff3 = _mm_set1_epi16(*(short*)(coeff + 6));//3 -1\r\n    for (row = 0; row < height; row++) {\r\n        __m128i mVal;\r\n        p = src;\r\n        for (col = 0; col < width - 7; col += 8) {\r\n            __m128i T00 = _mm_loadu_si128((__m128i*)(p));\r\n            __m128i T10 = _mm_loadu_si128((__m128i*)(p + i_src));\r\n            __m128i T20 = _mm_loadu_si128((__m128i*)(p + 2 * i_src));\r\n            __m128i T30 = _mm_loadu_si128((__m128i*)(p + 3 * i_src));\r\n            __m128i T40 = _mm_loadu_si128((__m128i*)(p + 4 * i_src));\r\n            __m128i T50 = _mm_loadu_si128((__m128i*)(p + 5 * i_src));\r\n            __m128i T60 = _mm_loadu_si128((__m128i*)(p + 6 * i_src));\r\n            __m128i T70 = _mm_loadu_si128((__m128i*)(p + 7 * i_src));\r\n\r\n            T00 = _mm_adds_epi16(_mm_cvtepu8_epi16(T00), _mm_cvtepu8_epi16(T70));\r\n            T10 = _mm_adds_epi16(_mm_cvtepu8_epi16(T10), _mm_cvtepu8_epi16(T60));\r\n            T10 = _mm_subs_epi16(_mm_slli_epi16(T10, 2), _mm_cvtepu8_epi16(T60));//ԭ12Ϊ9\r\n            T20 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T20, T30), coeff1);\r\n            T30 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T40, T50), coeff2);\r\n            \r\n\r\n            mVal = _mm_add_epi16(_mm_sub_epi16(T10, T00), _mm_add_epi16(T20, T30));\r\n            mVal = _mm_srai_epi16(_mm_add_epi16(mVal, mAddOffset), shift);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_storel_epi64((__m128i*)&dst[col], mVal);\r\n\r\n            p += 8;\r\n        }\r\n\r\n        if (col < width) { // store either 1, 2, 3, 4, 5, 6, or 7 8-bit results in dst\r\n            __m128i T00 = _mm_loadu_si128((__m128i*)(p));\r\n            __m128i T10 = _mm_loadu_si128((__m128i*)(p + i_src));\r\n            __m128i T20 = _mm_loadu_si128((__m128i*)(p + 2 * i_src));\r\n            __m128i T30 = _mm_loadu_si128((__m128i*)(p + 3 * i_src));\r\n            __m128i T40 = _mm_loadu_si128((__m128i*)(p + 4 * i_src));\r\n            __m128i T50 = _mm_loadu_si128((__m128i*)(p + 5 * i_src));\r\n            __m128i T60 = _mm_loadu_si128((__m128i*)(p + 6 * i_src));\r\n            __m128i T70 = _mm_loadu_si128((__m128i*)(p + 7 * i_src));\r\n\r\n            T00 = _mm_adds_epi16(_mm_cvtepu8_epi16(T00), _mm_cvtepu8_epi16(T70));\r\n            T10 = _mm_adds_epi16(_mm_cvtepu8_epi16(T10), _mm_cvtepu8_epi16(T60));\r\n            T10 = _mm_subs_epi16(_mm_slli_epi16(T10, 2), _mm_cvtepu8_epi16(T60));//ԭ12Ϊ9\r\n            T20 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T20, T30), coeff1);\r\n            T30 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T40, T50), coeff2);\r\n\r\n            mVal = _mm_add_epi16(_mm_sub_epi16(T10, T00), _mm_add_epi16(T20, T30));\r\n            mVal = _mm_srai_epi16(_mm_add_epi16(mVal, mAddOffset), shift);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_maskmoveu_si128(mVal, mask, (char *)&dst[col]);\r\n        }\r\n\r\n        src += i_src;\r\n        dst += i_dst;\r\n    }\r\n    \r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n*/\r\nvoid intpl_luma_block_ver1_sse128(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, const int8_t *coeff)\r\n{//-1, 4, -11, 40, 40, -11, 4, -1\r\n    const short offset = 32;\r\n    const int shift = 6;\r\n    int row, col;\r\n\r\n    __m128i mAddOffset = _mm_set1_epi16(offset);\r\n\r\n    pel_t const *p;\r\n\r\n    __m128i mask = _mm_loadu_si128((__m128i*)(intrinsic_mask[(width & 7) - 1]));\r\n\r\n    src -= 3 * i_src;\r\n\r\n    __m128i coeff2 = _mm_set1_epi8(coeff[2]);//-11\r\n    __m128i coeff3 = _mm_set1_epi8(coeff[3]);//40\r\n\r\n    for (row = 0; row < height; row++) {\r\n        __m128i mVal;\r\n        p = src;\r\n        for (col = 0; col < width - 7; col += 8) {\r\n            __m128i T00 = _mm_loadu_si128((__m128i*)(p));\r\n            __m128i T10 = _mm_loadu_si128((__m128i*)(p + i_src));\r\n            __m128i T20 = _mm_loadu_si128((__m128i*)(p + 2 * i_src));\r\n            __m128i T30 = _mm_loadu_si128((__m128i*)(p + 3 * i_src));\r\n            __m128i T40 = _mm_loadu_si128((__m128i*)(p + 4 * i_src));\r\n            __m128i T50 = _mm_loadu_si128((__m128i*)(p + 5 * i_src));\r\n            __m128i T60 = _mm_loadu_si128((__m128i*)(p + 6 * i_src));\r\n            __m128i T70 = _mm_loadu_si128((__m128i*)(p + 7 * i_src));\r\n\r\n            T00 = _mm_adds_epi16(_mm_cvtepu8_epi16(T00), _mm_cvtepu8_epi16(T70));\r\n            T10 = _mm_adds_epi16(_mm_cvtepu8_epi16(T10), _mm_cvtepu8_epi16(T60));\r\n            T10 = _mm_slli_epi16(T10, 2);\r\n            T20 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T20, T50), coeff2);\r\n            T30 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T30, T40), coeff3);\r\n\r\n            mVal = _mm_add_epi16(_mm_sub_epi16(T10, T00), _mm_add_epi16(T20, T30));\r\n            mVal = _mm_srai_epi16(_mm_add_epi16(mVal, mAddOffset), shift);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_storel_epi64((__m128i*)&dst[col], mVal);\r\n\r\n            p += 8;\r\n        }\r\n\r\n        if (col < width) { // store either 1, 2, 3, 4, 5, 6, or 7 8-bit results in dst\r\n            __m128i T00 = _mm_loadu_si128((__m128i*)(p));\r\n            __m128i T10 = _mm_loadu_si128((__m128i*)(p + i_src));\r\n            __m128i T20 = _mm_loadu_si128((__m128i*)(p + 2 * i_src));\r\n            __m128i T30 = _mm_loadu_si128((__m128i*)(p + 3 * i_src));\r\n            __m128i T40 = _mm_loadu_si128((__m128i*)(p + 4 * i_src));\r\n            __m128i T50 = _mm_loadu_si128((__m128i*)(p + 5 * i_src));\r\n            __m128i T60 = _mm_loadu_si128((__m128i*)(p + 6 * i_src));\r\n            __m128i T70 = _mm_loadu_si128((__m128i*)(p + 7 * i_src));\r\n\r\n            T00 = _mm_adds_epi16(_mm_cvtepu8_epi16(T00), _mm_cvtepu8_epi16(T70));\r\n            T10 = _mm_adds_epi16(_mm_cvtepu8_epi16(T10), _mm_cvtepu8_epi16(T60));\r\n            T10 = _mm_slli_epi16(T10, 2);\r\n            T20 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T20, T50), coeff2);\r\n            T30 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T30, T40), coeff3);\r\n\r\n            mVal = _mm_add_epi16(_mm_sub_epi16(T10, T00), _mm_add_epi16(T20, T30));\r\n            mVal = _mm_srai_epi16(_mm_add_epi16(mVal, mAddOffset), shift);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_maskmoveu_si128(mVal, mask, (char *)&dst[col]);\r\n        }\r\n\r\n        src += i_src;\r\n        dst += i_dst;\r\n    }\r\n    \r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n*/\r\nvoid intpl_luma_block_ver2_sse128(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, const int8_t *coeff)\r\n{//-1, 3,  -7, 19, 57, -10, 4, -1\r\n    const short offset = 32;\r\n    const int shift = 6;\r\n    int row, col;\r\n\r\n    __m128i mAddOffset = _mm_set1_epi16(offset);\r\n\r\n    pel_t const *p;\r\n\r\n    __m128i mask = _mm_loadu_si128((__m128i*)(intrinsic_mask[(width & 7) - 1]));\r\n\r\n    src -= 3 * i_src;\r\n\r\n    \r\n    //__m128i coeff0 = _mm_set1_epi16(*(short*)coeff);\r\n    __m128i coeff1 = _mm_set1_epi16(*(short*)(coeff + 2));\r\n    __m128i coeff2 = _mm_set1_epi16(*(short*)(coeff + 4));\r\n    //__m128i coeff3 = _mm_set1_epi16(*(short*)(coeff + 6));\r\n    for (row = 0; row < height; row++) {\r\n        __m128i mVal;\r\n        p = src;\r\n        for (col = 0; col < width - 7; col += 8) {\r\n            __m128i T00 = _mm_loadu_si128((__m128i*)(p));\r\n            __m128i T10 = _mm_loadu_si128((__m128i*)(p + i_src));\r\n            __m128i T20 = _mm_loadu_si128((__m128i*)(p + 2 * i_src));\r\n            __m128i T30 = _mm_loadu_si128((__m128i*)(p + 3 * i_src));\r\n            __m128i T40 = _mm_loadu_si128((__m128i*)(p + 4 * i_src));\r\n            __m128i T50 = _mm_loadu_si128((__m128i*)(p + 5 * i_src));\r\n            __m128i T60 = _mm_loadu_si128((__m128i*)(p + 6 * i_src));\r\n            __m128i T70 = _mm_loadu_si128((__m128i*)(p + 7 * i_src));\r\n\r\n            T00 = _mm_adds_epi16(_mm_cvtepu8_epi16(T00), _mm_cvtepu8_epi16(T70));\r\n            T60 = _mm_adds_epi16(_mm_cvtepu8_epi16(T10), _mm_cvtepu8_epi16(T60));\r\n            T10 = _mm_subs_epi16(_mm_slli_epi16(T60, 2), _mm_cvtepu8_epi16(T10));\r\n            T20 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T20, T30), coeff1);\r\n            T30 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T40, T50), coeff2);\r\n\r\n            mVal = _mm_add_epi16(_mm_sub_epi16(T10, T00), _mm_add_epi16(T20, T30));\r\n            mVal = _mm_srai_epi16(_mm_add_epi16(mVal, mAddOffset), shift);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_storel_epi64((__m128i*)&dst[col], mVal);\r\n\r\n            p += 8;\r\n        }\r\n\r\n        if (col < width) { // store either 1, 2, 3, 4, 5, 6, or 7 8-bit results in dst\r\n            __m128i T00 = _mm_loadu_si128((__m128i*)(p));\r\n            __m128i T10 = _mm_loadu_si128((__m128i*)(p + i_src));\r\n            __m128i T20 = _mm_loadu_si128((__m128i*)(p + 2 * i_src));\r\n            __m128i T30 = _mm_loadu_si128((__m128i*)(p + 3 * i_src));\r\n            __m128i T40 = _mm_loadu_si128((__m128i*)(p + 4 * i_src));\r\n            __m128i T50 = _mm_loadu_si128((__m128i*)(p + 5 * i_src));\r\n            __m128i T60 = _mm_loadu_si128((__m128i*)(p + 6 * i_src));\r\n            __m128i T70 = _mm_loadu_si128((__m128i*)(p + 7 * i_src));\r\n\r\n            T00 = _mm_adds_epi16(_mm_cvtepu8_epi16(T00), _mm_cvtepu8_epi16(T70));\r\n            T60 = _mm_adds_epi16(_mm_cvtepu8_epi16(T10), _mm_cvtepu8_epi16(T60));\r\n            T10 = _mm_subs_epi16(_mm_slli_epi16(T60, 2), _mm_cvtepu8_epi16(T10));\r\n            T20 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T20, T30), coeff1);\r\n            T30 = _mm_maddubs_epi16(_mm_unpacklo_epi8(T40, T50), coeff2);\r\n\r\n            mVal = _mm_add_epi16(_mm_sub_epi16(T10, T00), _mm_add_epi16(T20, T30));\r\n            mVal = _mm_srai_epi16(_mm_add_epi16(mVal, mAddOffset), shift);\r\n            mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n            _mm_maskmoveu_si128(mVal, mask, (char *)&dst[col]);\r\n        }\r\n\r\n        src += i_src;\r\n        dst += i_dst;\r\n    }\r\n    \r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid intpl_chroma_block_ext_sse128(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, const int8_t *coef_x, const int8_t *coef_y)\r\n{\r\n    ALIGN16(int16_t tmp_res[(32 + 3) * 32]);\r\n    int16_t *tmp = tmp_res;\r\n    const int i_tmp = 32;\r\n    int row, col;\r\n    int shift;\r\n    int16_t const *p;\r\n\r\n    int bsymy = (coef_y[1] == coef_y[6]);\r\n\r\n    __m128i mAddOffset;\r\n\r\n    __m128i mSwitch1 = _mm_setr_epi8(0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6);\r\n    __m128i mSwitch2 = _mm_setr_epi8(4, 5, 6, 7, 5, 6, 7, 8, 6, 7, 8, 9, 7, 8, 9, 10);\r\n\r\n    __m128i mCoefx = _mm_set1_epi32(*(int*)coef_x);\r\n\r\n    __m128i mask = _mm_loadu_si128((__m128i*)(intrinsic_mask[(width & 7) - 1]));\r\n\r\n    // HOR\r\n    src = src - 1 * i_src - 1;\r\n\r\n    if (width > 4) {\r\n        for (row = -1; row < height + 2; row++) {\r\n            __m128i mT0, mT1, mV01;\r\n            for (col = 0; col < width; col += 8) {\r\n                __m128i mSrc = _mm_loadu_si128((__m128i*)(src + col));\r\n                mT0 = _mm_maddubs_epi16(_mm_shuffle_epi8(mSrc, mSwitch1), mCoefx);\r\n                mT1 = _mm_maddubs_epi16(_mm_shuffle_epi8(mSrc, mSwitch2), mCoefx);\r\n\r\n                mV01 = _mm_hadd_epi16(mT0, mT1);\r\n                _mm_store_si128((__m128i*)&tmp[col], mV01);\r\n            }\r\n            src += i_src;\r\n            tmp += i_tmp;\r\n        }\r\n    } else {\r\n        for (row = -1; row < height + 2; row++) {\r\n            __m128i mSrc = _mm_loadu_si128((__m128i*)src);\r\n            __m128i mT0 = _mm_maddubs_epi16(_mm_shuffle_epi8(mSrc, mSwitch1), mCoefx);\r\n            __m128i mV01 = _mm_hadd_epi16(mT0, mT0);\r\n            _mm_storel_epi64((__m128i*)tmp, mV01);\r\n            src += i_src;\r\n            tmp += i_tmp;\r\n        }\r\n    }\r\n\r\n\r\n    // VER\r\n    shift = 12;\r\n    mAddOffset = _mm_set1_epi32(1 << 11);\r\n\r\n    tmp = tmp_res;\r\n    if (bsymy) {\r\n        __m128i mCoefy1 = _mm_set1_epi16(coef_y[0]);\r\n        __m128i mCoefy2 = _mm_set1_epi16(coef_y[1]);\r\n\r\n        for (row = 0; row < height; row += 2) {\r\n            p = tmp;\r\n            for (col = 0; col < width - 7; col += 8) {\r\n                __m128i mV01, mV02;\r\n                __m128i mV11, mV12;\r\n                __m128i T0 = _mm_loadu_si128((__m128i*)(p));\r\n                __m128i T1 = _mm_loadu_si128((__m128i*)(p + i_tmp));\r\n                __m128i T2 = _mm_loadu_si128((__m128i*)(p + 2 * i_tmp));\r\n                __m128i T3 = _mm_loadu_si128((__m128i*)(p + 3 * i_tmp));\r\n                __m128i T4 = _mm_loadu_si128((__m128i*)(p + 4 * i_tmp));\r\n\r\n                __m128i M00 = _mm_unpacklo_epi16(T0, T3);\r\n                __m128i M01 = _mm_unpacklo_epi16(T1, T2);\r\n                __m128i M02 = _mm_unpackhi_epi16(T0, T3);\r\n                __m128i M03 = _mm_unpackhi_epi16(T1, T2);\r\n\r\n                __m128i M10 = _mm_unpacklo_epi16(T1, T4);\r\n                __m128i M11 = _mm_unpacklo_epi16(T2, T3);\r\n                __m128i M12 = _mm_unpackhi_epi16(T1, T4);\r\n                __m128i M13 = _mm_unpackhi_epi16(T2, T3);\r\n\r\n                mV01 = _mm_add_epi32(_mm_madd_epi16(M00, mCoefy1), _mm_madd_epi16(M01, mCoefy2));\r\n                mV02 = _mm_add_epi32(_mm_madd_epi16(M02, mCoefy1), _mm_madd_epi16(M03, mCoefy2));                \r\n                mV11 = _mm_add_epi32(_mm_madd_epi16(M10, mCoefy1), _mm_madd_epi16(M11, mCoefy2));\r\n                mV12 = _mm_add_epi32(_mm_madd_epi16(M12, mCoefy1), _mm_madd_epi16(M13, mCoefy2));\r\n\r\n                mV01 = _mm_srai_epi32(_mm_add_epi32(mV01, mAddOffset), shift);\r\n                mV02 = _mm_srai_epi32(_mm_add_epi32(mV02, mAddOffset), shift);\r\n                mV11 = _mm_srai_epi32(_mm_add_epi32(mV11, mAddOffset), shift);\r\n                mV12 = _mm_srai_epi32(_mm_add_epi32(mV12, mAddOffset), shift);\r\n\r\n                mV01 = _mm_packs_epi32 (mV01, mV02);\r\n                mV01 = _mm_packus_epi16(mV01, mV01);\r\n                mV11 = _mm_packs_epi32 (mV11, mV12);\r\n                mV11 = _mm_packus_epi16(mV11, mV11);\r\n\r\n                _mm_storel_epi64((__m128i*)&dst[col],         mV01);\r\n                _mm_storel_epi64((__m128i*)&dst[col + i_dst], mV11);\r\n\r\n                p += 8;\r\n            }\r\n\r\n            if (col < width) { // store either 1, 2, 3, 4, 5, 6, or 7 8-bit results in dst\r\n                __m128i mV01, mV02;\r\n                __m128i mV11, mV12;\r\n                __m128i T0 = _mm_loadu_si128((__m128i*)(p));\r\n                __m128i T1 = _mm_loadu_si128((__m128i*)(p + i_tmp));\r\n                __m128i T2 = _mm_loadu_si128((__m128i*)(p + 2 * i_tmp));\r\n                __m128i T3 = _mm_loadu_si128((__m128i*)(p + 3 * i_tmp));\r\n                __m128i T4 = _mm_loadu_si128((__m128i*)(p + 4 * i_tmp));\r\n\r\n                __m128i M00 = _mm_unpacklo_epi16(T0, T3);\r\n                __m128i M01 = _mm_unpacklo_epi16(T1, T2);\r\n                __m128i M02 = _mm_unpackhi_epi16(T0, T3);\r\n                __m128i M03 = _mm_unpackhi_epi16(T1, T2);\r\n\r\n                __m128i M10 = _mm_unpacklo_epi16(T1, T4);\r\n                __m128i M11 = _mm_unpacklo_epi16(T2, T3);\r\n                __m128i M12 = _mm_unpackhi_epi16(T1, T4);\r\n                __m128i M13 = _mm_unpackhi_epi16(T2, T3);\r\n\r\n                mV01 = _mm_add_epi32(_mm_madd_epi16(M00, mCoefy1), _mm_madd_epi16(M01, mCoefy2));\r\n                mV02 = _mm_add_epi32(_mm_madd_epi16(M02, mCoefy1), _mm_madd_epi16(M03, mCoefy2));\r\n                mV11 = _mm_add_epi32(_mm_madd_epi16(M10, mCoefy1), _mm_madd_epi16(M11, mCoefy2));\r\n                mV12 = _mm_add_epi32(_mm_madd_epi16(M12, mCoefy1), _mm_madd_epi16(M13, mCoefy2));\r\n\r\n                mV01 = _mm_srai_epi32(_mm_add_epi32(mV01, mAddOffset), shift);\r\n                mV02 = _mm_srai_epi32(_mm_add_epi32(mV02, mAddOffset), shift);\r\n                mV11 = _mm_srai_epi32(_mm_add_epi32(mV11, mAddOffset), shift);\r\n                mV12 = _mm_srai_epi32(_mm_add_epi32(mV12, mAddOffset), shift);\r\n\r\n                mV01 = _mm_packs_epi32 (mV01, mV02);\r\n                mV01 = _mm_packus_epi16(mV01, mV01);\r\n                mV11 = _mm_packs_epi32 (mV11, mV12);\r\n                mV11 = _mm_packus_epi16(mV11, mV11);\r\n\r\n                _mm_maskmoveu_si128(mV01, mask, (char *)&dst[col]);\r\n                _mm_maskmoveu_si128(mV01, mask, (char *)&dst[col + i_dst]);\r\n            }\r\n\r\n            tmp += i_tmp * 2;\r\n            dst += i_dst * 2;\r\n        }\r\n    } else {\r\n        __m128i coeff0 = _mm_set1_epi16(*(short*)coef_y);\r\n        __m128i coeff1 = _mm_set1_epi16(*(short*)(coef_y + 2));\r\n        coeff0 = _mm_cvtepi8_epi16(coeff0);\r\n        coeff1 = _mm_cvtepi8_epi16(coeff1);\r\n\r\n        for (row = 0; row < height; row += 2) {\r\n            p = tmp;\r\n            for (col = 0; col < width - 7; col += 8) {\r\n                __m128i mV01, mV02;\r\n                __m128i mV11, mV12;\r\n                __m128i T0 = _mm_loadu_si128((__m128i*)(p));\r\n                __m128i T1 = _mm_loadu_si128((__m128i*)(p + i_tmp));\r\n                __m128i T2 = _mm_loadu_si128((__m128i*)(p + 2 * i_tmp));\r\n                __m128i T3 = _mm_loadu_si128((__m128i*)(p + 3 * i_tmp));\r\n                __m128i T4 = _mm_loadu_si128((__m128i*)(p + 4 * i_tmp));\r\n\r\n                __m128i M00 = _mm_unpacklo_epi16(T0, T1);\r\n                __m128i M01 = _mm_unpacklo_epi16(T2, T3);\r\n                __m128i M02 = _mm_unpackhi_epi16(T0, T1);\r\n                __m128i M03 = _mm_unpackhi_epi16(T2, T3);\r\n\r\n                __m128i M10 = _mm_unpacklo_epi16(T1, T2);\r\n                __m128i M11 = _mm_unpacklo_epi16(T3, T4);\r\n                __m128i M12 = _mm_unpackhi_epi16(T1, T2);\r\n                __m128i M13 = _mm_unpackhi_epi16(T3, T4);\r\n\r\n                mV01 = _mm_add_epi32(_mm_madd_epi16(M00, coeff0), _mm_madd_epi16(M01, coeff1));\r\n                mV02 = _mm_add_epi32(_mm_madd_epi16(M02, coeff0), _mm_madd_epi16(M03, coeff1));\r\n                mV11 = _mm_add_epi32(_mm_madd_epi16(M10, coeff0), _mm_madd_epi16(M11, coeff1));\r\n                mV12 = _mm_add_epi32(_mm_madd_epi16(M12, coeff0), _mm_madd_epi16(M13, coeff1));\r\n\r\n                mV01 = _mm_srai_epi32(_mm_add_epi32(mV01, mAddOffset), shift);\r\n                mV02 = _mm_srai_epi32(_mm_add_epi32(mV02, mAddOffset), shift);\r\n                mV11 = _mm_srai_epi32(_mm_add_epi32(mV11, mAddOffset), shift);\r\n                mV12 = _mm_srai_epi32(_mm_add_epi32(mV12, mAddOffset), shift);\r\n\r\n                mV01 = _mm_packs_epi32 (mV01, mV02);\r\n                mV01 = _mm_packus_epi16(mV01, mV01);\r\n                mV11 = _mm_packs_epi32 (mV11, mV12);\r\n                mV11 = _mm_packus_epi16(mV11, mV11);\r\n\r\n                _mm_storel_epi64((__m128i*)&dst[col],         mV01);\r\n                _mm_storel_epi64((__m128i*)&dst[col + i_dst], mV11);\r\n\r\n                p += 8;\r\n            }\r\n\r\n            if (col < width) { // store either 1, 2, 3, 4, 5, 6, or 7 8-bit results in dst\r\n                __m128i mV01, mV02;\r\n                __m128i mV11, mV12;\r\n                __m128i T0 = _mm_loadu_si128((__m128i*)(p));\r\n                __m128i T1 = _mm_loadu_si128((__m128i*)(p + i_tmp));\r\n                __m128i T2 = _mm_loadu_si128((__m128i*)(p + 2 * i_tmp));\r\n                __m128i T3 = _mm_loadu_si128((__m128i*)(p + 3 * i_tmp));\r\n                __m128i T4 = _mm_loadu_si128((__m128i*)(p + 4 * i_tmp));\r\n\r\n                __m128i M00 = _mm_unpacklo_epi16(T0, T1);\r\n                __m128i M01 = _mm_unpacklo_epi16(T2, T3);\r\n                __m128i M02 = _mm_unpackhi_epi16(T0, T1);\r\n                __m128i M03 = _mm_unpackhi_epi16(T2, T3);\r\n                \r\n                __m128i M10 = _mm_unpacklo_epi16(T1, T2);\r\n                __m128i M11 = _mm_unpacklo_epi16(T3, T4);\r\n                __m128i M12 = _mm_unpackhi_epi16(T1, T2);\r\n                __m128i M13 = _mm_unpackhi_epi16(T3, T4);\r\n\r\n                mV01 = _mm_add_epi32(_mm_madd_epi16(M00, coeff0), _mm_madd_epi16(M01, coeff1));\r\n                mV02 = _mm_add_epi32(_mm_madd_epi16(M02, coeff0), _mm_madd_epi16(M03, coeff1));\r\n                mV11 = _mm_add_epi32(_mm_madd_epi16(M10, coeff0), _mm_madd_epi16(M11, coeff1));\r\n                mV12 = _mm_add_epi32(_mm_madd_epi16(M12, coeff0), _mm_madd_epi16(M13, coeff1));\r\n\r\n                mV01 = _mm_srai_epi32(_mm_add_epi32(mV01, mAddOffset), shift);\r\n                mV02 = _mm_srai_epi32(_mm_add_epi32(mV02, mAddOffset), shift);\r\n                mV11 = _mm_srai_epi32(_mm_add_epi32(mV11, mAddOffset), shift);\r\n                mV12 = _mm_srai_epi32(_mm_add_epi32(mV12, mAddOffset), shift);\r\n\r\n                mV01 = _mm_packs_epi32 (mV01, mV02);\r\n                mV01 = _mm_packus_epi16(mV01, mV01);\r\n                mV11 = _mm_packs_epi32 (mV11, mV12);\r\n                mV11 = _mm_packus_epi16(mV11, mV11);\r\n\r\n                _mm_maskmoveu_si128(mV01, mask, (char *)&dst[col]);\r\n                _mm_maskmoveu_si128(mV11, mask, (char *)&dst[col + i_dst]);\r\n            }\r\n\r\n            tmp += i_tmp * 2;\r\n            dst += i_dst * 2;\r\n        }\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid intpl_luma_block_ext_sse128(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, const int8_t *coef_x, const int8_t *coef_y)\r\n{\r\n    ALIGN16(int16_t tmp_res[(64 + 7) * 64]);\r\n    int16_t *tmp = tmp_res;\r\n    const int i_tmp = 64;\r\n    int row, col;\r\n    int shift = 12;\r\n    int16_t const *p;\r\n\r\n    int bsymy = (coef_y[1] == coef_y[6]);\r\n\r\n    __m128i mAddOffset = _mm_set1_epi32(1 << (shift - 1));\r\n\r\n    __m128i mSwitch1 = _mm_setr_epi8(0, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8);\r\n    __m128i mSwitch2 = _mm_setr_epi8(2, 3, 4, 5, 6, 7, 8, 9, 3, 4, 5, 6, 7, 8, 9, 10);\r\n    __m128i mSwitch3 = _mm_setr_epi8(4, 5, 6, 7, 8, 9, 10, 11, 5, 6, 7, 8, 9, 10, 11, 12);\r\n    __m128i mSwitch4 = _mm_setr_epi8(6, 7, 8, 9, 10, 11, 12, 13, 7, 8, 9, 10, 11, 12, 13, 14);\r\n\r\n    __m128i mCoefx = _mm_loadl_epi64((__m128i*)coef_x);\r\n    __m128i mask = _mm_loadu_si128((__m128i*)(intrinsic_mask[(width & 7) - 1]));\r\n\r\n    mCoefx = _mm_unpacklo_epi64(mCoefx, mCoefx);\r\n\r\n    // HOR\r\n    src -= (3 * i_src + 3);\r\n\r\n    for (row = -3; row < height + 4; row++) {\r\n        for (col = 0; col < width; col += 8) {\r\n            __m128i mSrc = _mm_loadu_si128((__m128i*)(src + col));\r\n            __m128i mT0  = _mm_maddubs_epi16(_mm_shuffle_epi8(mSrc, mSwitch1), mCoefx);\r\n            __m128i mT1  = _mm_maddubs_epi16(_mm_shuffle_epi8(mSrc, mSwitch2), mCoefx);\r\n            __m128i mT2  = _mm_maddubs_epi16(_mm_shuffle_epi8(mSrc, mSwitch3), mCoefx);\r\n            __m128i mT3  = _mm_maddubs_epi16(_mm_shuffle_epi8(mSrc, mSwitch4), mCoefx);\r\n            __m128i mVal = _mm_hadd_epi16(_mm_hadd_epi16(mT0, mT1), _mm_hadd_epi16(mT2, mT3));\r\n\r\n            _mm_store_si128((__m128i*)&tmp[col], mVal);\r\n        }\r\n\r\n        src += i_src;\r\n        tmp += i_tmp;\r\n    }\r\n\r\n    // VER\r\n    tmp = tmp_res;\r\n\r\n    if (bsymy) {\r\n        __m128i mCoefy1 = _mm_set1_epi16(coef_y[0]);\r\n        __m128i mCoefy2 = _mm_set1_epi16(coef_y[1]);\r\n        __m128i mCoefy3 = _mm_set1_epi16(coef_y[2]);\r\n        __m128i mCoefy4 = _mm_set1_epi16(coef_y[3]);\r\n\r\n        for (row = 0; row < height - 3; row += 4) {\r\n            p = tmp;\r\n            for (col = 0; col < width - 7; col += 8) {\r\n                __m128i T00 = _mm_loadu_si128((__m128i*)(p));\r\n                __m128i T10 = _mm_loadu_si128((__m128i*)(p + i_tmp));\r\n                __m128i T20 = _mm_loadu_si128((__m128i*)(p + 2 * i_tmp));\r\n                __m128i T30 = _mm_loadu_si128((__m128i*)(p + 3 * i_tmp));\r\n                __m128i T40 = _mm_loadu_si128((__m128i*)(p + 4 * i_tmp));\r\n                __m128i T50 = _mm_loadu_si128((__m128i*)(p + 5 * i_tmp));\r\n                __m128i T60 = _mm_loadu_si128((__m128i*)(p + 6 * i_tmp));\r\n                __m128i T70 = _mm_loadu_si128((__m128i*)(p + 7 * i_tmp));\r\n                __m128i T80 = _mm_loadu_si128((__m128i*)(p + 8 * i_tmp));\r\n                __m128i T90 = _mm_loadu_si128((__m128i*)(p + 9 * i_tmp));\r\n                __m128i Ta0 = _mm_loadu_si128((__m128i*)(p + 10 * i_tmp));\r\n\r\n                __m128i T0 = _mm_unpacklo_epi16(T00, T70);\r\n                __m128i T1 = _mm_unpacklo_epi16(T10, T60);\r\n                __m128i T2 = _mm_unpacklo_epi16(T20, T50);\r\n                __m128i T3 = _mm_unpacklo_epi16(T30, T40);\r\n                __m128i T4 = _mm_unpackhi_epi16(T00, T70);\r\n                __m128i T5 = _mm_unpackhi_epi16(T10, T60);\r\n                __m128i T6 = _mm_unpackhi_epi16(T20, T50);\r\n                __m128i T7 = _mm_unpackhi_epi16(T30, T40);\r\n                __m128i mVal1, mVal2, mVal;\r\n\r\n                T0 = _mm_madd_epi16(T0, mCoefy1);\r\n                T1 = _mm_madd_epi16(T1, mCoefy2);\r\n                T2 = _mm_madd_epi16(T2, mCoefy3);\r\n                T3 = _mm_madd_epi16(T3, mCoefy4);\r\n                T4 = _mm_madd_epi16(T4, mCoefy1);\r\n                T5 = _mm_madd_epi16(T5, mCoefy2);\r\n                T6 = _mm_madd_epi16(T6, mCoefy3);\r\n                T7 = _mm_madd_epi16(T7, mCoefy4);\r\n\r\n                mVal1 = _mm_add_epi32(_mm_add_epi32(T0, T1), _mm_add_epi32(T2, T3));\r\n                mVal2 = _mm_add_epi32(_mm_add_epi32(T4, T5), _mm_add_epi32(T6, T7));\r\n\r\n                mVal1 = _mm_srai_epi32(_mm_add_epi32(mVal1, mAddOffset), shift);\r\n                mVal2 = _mm_srai_epi32(_mm_add_epi32(mVal2, mAddOffset), shift);\r\n\r\n                mVal = _mm_packs_epi32(mVal1, mVal2);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_storel_epi64((__m128i*)&dst[col], mVal);\r\n\r\n                T0 = _mm_unpacklo_epi16(T10, T80);\r\n                T1 = _mm_unpacklo_epi16(T20, T70);\r\n                T2 = _mm_unpacklo_epi16(T30, T60);\r\n                T3 = _mm_unpacklo_epi16(T40, T50);\r\n                T4 = _mm_unpackhi_epi16(T10, T80);\r\n                T5 = _mm_unpackhi_epi16(T20, T70);\r\n                T6 = _mm_unpackhi_epi16(T30, T60);\r\n                T7 = _mm_unpackhi_epi16(T40, T50);\r\n\r\n                T0 = _mm_madd_epi16(T0, mCoefy1);\r\n                T1 = _mm_madd_epi16(T1, mCoefy2);\r\n                T2 = _mm_madd_epi16(T2, mCoefy3);\r\n                T3 = _mm_madd_epi16(T3, mCoefy4);\r\n                T4 = _mm_madd_epi16(T4, mCoefy1);\r\n                T5 = _mm_madd_epi16(T5, mCoefy2);\r\n                T6 = _mm_madd_epi16(T6, mCoefy3);\r\n                T7 = _mm_madd_epi16(T7, mCoefy4);\r\n\r\n                mVal1 = _mm_add_epi32(_mm_add_epi32(T0, T1), _mm_add_epi32(T2, T3));\r\n                mVal2 = _mm_add_epi32(_mm_add_epi32(T4, T5), _mm_add_epi32(T6, T7));\r\n\r\n                mVal1 = _mm_srai_epi32(_mm_add_epi32(mVal1, mAddOffset), shift);\r\n                mVal2 = _mm_srai_epi32(_mm_add_epi32(mVal2, mAddOffset), shift);\r\n\r\n                mVal = _mm_packs_epi32(mVal1, mVal2);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_storel_epi64((__m128i*)(&dst[col] + i_dst), mVal);\r\n\r\n                T0 = _mm_unpacklo_epi16(T20, T90);\r\n                T1 = _mm_unpacklo_epi16(T30, T80);\r\n                T2 = _mm_unpacklo_epi16(T40, T70);\r\n                T3 = _mm_unpacklo_epi16(T50, T60);\r\n                T4 = _mm_unpackhi_epi16(T20, T90);\r\n                T5 = _mm_unpackhi_epi16(T30, T80);\r\n                T6 = _mm_unpackhi_epi16(T40, T70);\r\n                T7 = _mm_unpackhi_epi16(T50, T60);\r\n\r\n                T0 = _mm_madd_epi16(T0, mCoefy1);\r\n                T1 = _mm_madd_epi16(T1, mCoefy2);\r\n                T2 = _mm_madd_epi16(T2, mCoefy3);\r\n                T3 = _mm_madd_epi16(T3, mCoefy4);\r\n                T4 = _mm_madd_epi16(T4, mCoefy1);\r\n                T5 = _mm_madd_epi16(T5, mCoefy2);\r\n                T6 = _mm_madd_epi16(T6, mCoefy3);\r\n                T7 = _mm_madd_epi16(T7, mCoefy4);\r\n\r\n                mVal1 = _mm_add_epi32(_mm_add_epi32(T0, T1), _mm_add_epi32(T2, T3));\r\n                mVal2 = _mm_add_epi32(_mm_add_epi32(T4, T5), _mm_add_epi32(T6, T7));\r\n\r\n                mVal1 = _mm_srai_epi32(_mm_add_epi32(mVal1, mAddOffset), shift);\r\n                mVal2 = _mm_srai_epi32(_mm_add_epi32(mVal2, mAddOffset), shift);\r\n\r\n                mVal = _mm_packs_epi32(mVal1, mVal2);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_storel_epi64((__m128i*)(&dst[col] + 2 * i_dst), mVal);\r\n\r\n                T0 = _mm_unpacklo_epi16(T30, Ta0);\r\n                T1 = _mm_unpacklo_epi16(T40, T90);\r\n                T2 = _mm_unpacklo_epi16(T50, T80);\r\n                T3 = _mm_unpacklo_epi16(T60, T70);\r\n                T4 = _mm_unpackhi_epi16(T30, Ta0);\r\n                T5 = _mm_unpackhi_epi16(T40, T90);\r\n                T6 = _mm_unpackhi_epi16(T50, T80);\r\n                T7 = _mm_unpackhi_epi16(T60, T70);\r\n\r\n                T0 = _mm_madd_epi16(T0, mCoefy1);\r\n                T1 = _mm_madd_epi16(T1, mCoefy2);\r\n                T2 = _mm_madd_epi16(T2, mCoefy3);\r\n                T3 = _mm_madd_epi16(T3, mCoefy4);\r\n                T4 = _mm_madd_epi16(T4, mCoefy1);\r\n                T5 = _mm_madd_epi16(T5, mCoefy2);\r\n                T6 = _mm_madd_epi16(T6, mCoefy3);\r\n                T7 = _mm_madd_epi16(T7, mCoefy4);\r\n\r\n                mVal1 = _mm_add_epi32(_mm_add_epi32(T0, T1), _mm_add_epi32(T2, T3));\r\n                mVal2 = _mm_add_epi32(_mm_add_epi32(T4, T5), _mm_add_epi32(T6, T7));\r\n\r\n                mVal1 = _mm_srai_epi32(_mm_add_epi32(mVal1, mAddOffset), shift);\r\n                mVal2 = _mm_srai_epi32(_mm_add_epi32(mVal2, mAddOffset), shift);\r\n\r\n                mVal = _mm_packs_epi32(mVal1, mVal2);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_storel_epi64((__m128i*)(&dst[col] + 3 * i_dst), mVal);\r\n\r\n                p += 8;\r\n            }\r\n\r\n            if (col < width) { // store either 1, 2, 3, 4, 5, 6, or 7 8-bit results in dst\r\n                __m128i T00 = _mm_loadu_si128((__m128i*)(p));\r\n                __m128i T10 = _mm_loadu_si128((__m128i*)(p + i_tmp));\r\n                __m128i T20 = _mm_loadu_si128((__m128i*)(p + 2 * i_tmp));\r\n                __m128i T30 = _mm_loadu_si128((__m128i*)(p + 3 * i_tmp));\r\n                __m128i T40 = _mm_loadu_si128((__m128i*)(p + 4 * i_tmp));\r\n                __m128i T50 = _mm_loadu_si128((__m128i*)(p + 5 * i_tmp));\r\n                __m128i T60 = _mm_loadu_si128((__m128i*)(p + 6 * i_tmp));\r\n                __m128i T70 = _mm_loadu_si128((__m128i*)(p + 7 * i_tmp));\r\n                __m128i T80 = _mm_loadu_si128((__m128i*)(p + 8 * i_tmp));\r\n                __m128i T90 = _mm_loadu_si128((__m128i*)(p + 9 * i_tmp));\r\n                __m128i Ta0 = _mm_loadu_si128((__m128i*)(p + 10 * i_tmp));\r\n\r\n                __m128i T0 = _mm_unpacklo_epi16(T00, T70);\r\n                __m128i T1 = _mm_unpacklo_epi16(T10, T60);\r\n                __m128i T2 = _mm_unpacklo_epi16(T20, T50);\r\n                __m128i T3 = _mm_unpacklo_epi16(T30, T40);\r\n                __m128i T4 = _mm_unpackhi_epi16(T00, T70);\r\n                __m128i T5 = _mm_unpackhi_epi16(T10, T60);\r\n                __m128i T6 = _mm_unpackhi_epi16(T20, T50);\r\n                __m128i T7 = _mm_unpackhi_epi16(T30, T40);\r\n                __m128i mVal1, mVal2, mVal;\r\n\r\n                T0 = _mm_madd_epi16(T0, mCoefy1);\r\n                T1 = _mm_madd_epi16(T1, mCoefy2);\r\n                T2 = _mm_madd_epi16(T2, mCoefy3);\r\n                T3 = _mm_madd_epi16(T3, mCoefy4);\r\n                T4 = _mm_madd_epi16(T4, mCoefy1);\r\n                T5 = _mm_madd_epi16(T5, mCoefy2);\r\n                T6 = _mm_madd_epi16(T6, mCoefy3);\r\n                T7 = _mm_madd_epi16(T7, mCoefy4);\r\n\r\n                mVal1 = _mm_add_epi32(_mm_add_epi32(T0, T1), _mm_add_epi32(T2, T3));\r\n                mVal2 = _mm_add_epi32(_mm_add_epi32(T4, T5), _mm_add_epi32(T6, T7));\r\n\r\n                mVal1 = _mm_srai_epi32(_mm_add_epi32(mVal1, mAddOffset), shift);\r\n                mVal2 = _mm_srai_epi32(_mm_add_epi32(mVal2, mAddOffset), shift);\r\n\r\n                mVal = _mm_packs_epi32(mVal1, mVal2);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_maskmoveu_si128(mVal, mask, (char *)&dst[col]);\r\n\r\n                T0 = _mm_unpacklo_epi16(T10, T80);\r\n                T1 = _mm_unpacklo_epi16(T20, T70);\r\n                T2 = _mm_unpacklo_epi16(T30, T60);\r\n                T3 = _mm_unpacklo_epi16(T40, T50);\r\n                T4 = _mm_unpackhi_epi16(T10, T80);\r\n                T5 = _mm_unpackhi_epi16(T20, T70);\r\n                T6 = _mm_unpackhi_epi16(T30, T60);\r\n                T7 = _mm_unpackhi_epi16(T40, T50);\r\n\r\n                T0 = _mm_madd_epi16(T0, mCoefy1);\r\n                T1 = _mm_madd_epi16(T1, mCoefy2);\r\n                T2 = _mm_madd_epi16(T2, mCoefy3);\r\n                T3 = _mm_madd_epi16(T3, mCoefy4);\r\n                T4 = _mm_madd_epi16(T4, mCoefy1);\r\n                T5 = _mm_madd_epi16(T5, mCoefy2);\r\n                T6 = _mm_madd_epi16(T6, mCoefy3);\r\n                T7 = _mm_madd_epi16(T7, mCoefy4);\r\n\r\n                mVal1 = _mm_add_epi32(_mm_add_epi32(T0, T1), _mm_add_epi32(T2, T3));\r\n                mVal2 = _mm_add_epi32(_mm_add_epi32(T4, T5), _mm_add_epi32(T6, T7));\r\n\r\n                mVal1 = _mm_srai_epi32(_mm_add_epi32(mVal1, mAddOffset), shift);\r\n                mVal2 = _mm_srai_epi32(_mm_add_epi32(mVal2, mAddOffset), shift);\r\n\r\n                mVal = _mm_packs_epi32(mVal1, mVal2);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_maskmoveu_si128(mVal, mask, (char *)(&dst[col] + i_dst));\r\n\r\n                T0 = _mm_unpacklo_epi16(T20, T90);\r\n                T1 = _mm_unpacklo_epi16(T30, T80);\r\n                T2 = _mm_unpacklo_epi16(T40, T70);\r\n                T3 = _mm_unpacklo_epi16(T50, T60);\r\n                T4 = _mm_unpackhi_epi16(T20, T90);\r\n                T5 = _mm_unpackhi_epi16(T30, T80);\r\n                T6 = _mm_unpackhi_epi16(T40, T70);\r\n                T7 = _mm_unpackhi_epi16(T50, T60);\r\n\r\n                T0 = _mm_madd_epi16(T0, mCoefy1);\r\n                T1 = _mm_madd_epi16(T1, mCoefy2);\r\n                T2 = _mm_madd_epi16(T2, mCoefy3);\r\n                T3 = _mm_madd_epi16(T3, mCoefy4);\r\n                T4 = _mm_madd_epi16(T4, mCoefy1);\r\n                T5 = _mm_madd_epi16(T5, mCoefy2);\r\n                T6 = _mm_madd_epi16(T6, mCoefy3);\r\n                T7 = _mm_madd_epi16(T7, mCoefy4);\r\n\r\n                mVal1 = _mm_add_epi32(_mm_add_epi32(T0, T1), _mm_add_epi32(T2, T3));\r\n                mVal2 = _mm_add_epi32(_mm_add_epi32(T4, T5), _mm_add_epi32(T6, T7));\r\n\r\n                mVal1 = _mm_srai_epi32(_mm_add_epi32(mVal1, mAddOffset), shift);\r\n                mVal2 = _mm_srai_epi32(_mm_add_epi32(mVal2, mAddOffset), shift);\r\n\r\n                mVal = _mm_packs_epi32(mVal1, mVal2);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_maskmoveu_si128(mVal, mask, (char *)(&dst[col] + 2 * i_dst));\r\n\r\n                T0 = _mm_unpacklo_epi16(T30, Ta0);\r\n                T1 = _mm_unpacklo_epi16(T40, T90);\r\n                T2 = _mm_unpacklo_epi16(T50, T80);\r\n                T3 = _mm_unpacklo_epi16(T60, T70);\r\n                T4 = _mm_unpackhi_epi16(T30, Ta0);\r\n                T5 = _mm_unpackhi_epi16(T40, T90);\r\n                T6 = _mm_unpackhi_epi16(T50, T80);\r\n                T7 = _mm_unpackhi_epi16(T60, T70);\r\n\r\n                T0 = _mm_madd_epi16(T0, mCoefy1);\r\n                T1 = _mm_madd_epi16(T1, mCoefy2);\r\n                T2 = _mm_madd_epi16(T2, mCoefy3);\r\n                T3 = _mm_madd_epi16(T3, mCoefy4);\r\n                T4 = _mm_madd_epi16(T4, mCoefy1);\r\n                T5 = _mm_madd_epi16(T5, mCoefy2);\r\n                T6 = _mm_madd_epi16(T6, mCoefy3);\r\n                T7 = _mm_madd_epi16(T7, mCoefy4);\r\n\r\n                mVal1 = _mm_add_epi32(_mm_add_epi32(T0, T1), _mm_add_epi32(T2, T3));\r\n                mVal2 = _mm_add_epi32(_mm_add_epi32(T4, T5), _mm_add_epi32(T6, T7));\r\n\r\n                mVal1 = _mm_srai_epi32(_mm_add_epi32(mVal1, mAddOffset), shift);\r\n                mVal2 = _mm_srai_epi32(_mm_add_epi32(mVal2, mAddOffset), shift);\r\n\r\n                mVal = _mm_packs_epi32(mVal1, mVal2);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_maskmoveu_si128(mVal, mask, (char *)(&dst[col] + 3 * i_dst));\r\n\r\n            }\r\n            tmp += 4 * i_tmp;\r\n            dst += 4 * i_dst;\r\n        }\r\n    } else {\r\n        __m128i mCoefy1 = _mm_set1_epi16(*(int16_t*)coef_y);\r\n        __m128i mCoefy2 = _mm_set1_epi16(*(int16_t*)(coef_y + 2));\r\n        __m128i mCoefy3 = _mm_set1_epi16(*(int16_t*)(coef_y + 4));\r\n        __m128i mCoefy4 = _mm_set1_epi16(*(int16_t*)(coef_y + 6));\r\n        mCoefy1 = _mm_cvtepi8_epi16(mCoefy1);\r\n        mCoefy2 = _mm_cvtepi8_epi16(mCoefy2);\r\n        mCoefy3 = _mm_cvtepi8_epi16(mCoefy3);\r\n        mCoefy4 = _mm_cvtepi8_epi16(mCoefy4);\r\n\r\n        for (row = 0; row < height - 3; row += 4) {\r\n            p = tmp;\r\n            for (col = 0; col < width - 7; col += 8) {\r\n                __m128i T00 = _mm_loadu_si128((__m128i*)(p));\r\n                __m128i T10 = _mm_loadu_si128((__m128i*)(p + i_tmp));\r\n                __m128i T20 = _mm_loadu_si128((__m128i*)(p + 2 * i_tmp));\r\n                __m128i T30 = _mm_loadu_si128((__m128i*)(p + 3 * i_tmp));\r\n                __m128i T40 = _mm_loadu_si128((__m128i*)(p + 4 * i_tmp));\r\n                __m128i T50 = _mm_loadu_si128((__m128i*)(p + 5 * i_tmp));\r\n                __m128i T60 = _mm_loadu_si128((__m128i*)(p + 6 * i_tmp));\r\n                __m128i T70 = _mm_loadu_si128((__m128i*)(p + 7 * i_tmp));\r\n                __m128i T80 = _mm_loadu_si128((__m128i*)(p + 8 * i_tmp));\r\n                __m128i T90 = _mm_loadu_si128((__m128i*)(p + 9 * i_tmp));\r\n                __m128i Ta0 = _mm_loadu_si128((__m128i*)(p + 10 * i_tmp));\r\n\r\n                __m128i T0 = _mm_unpacklo_epi16(T00, T10);\r\n                __m128i T1 = _mm_unpacklo_epi16(T20, T30);\r\n                __m128i T2 = _mm_unpacklo_epi16(T40, T50);\r\n                __m128i T3 = _mm_unpacklo_epi16(T60, T70);\r\n                __m128i T4 = _mm_unpackhi_epi16(T00, T10);\r\n                __m128i T5 = _mm_unpackhi_epi16(T20, T30);\r\n                __m128i T6 = _mm_unpackhi_epi16(T40, T50);\r\n                __m128i T7 = _mm_unpackhi_epi16(T60, T70);\r\n                __m128i mVal1, mVal2, mVal;\r\n\r\n                T0 = _mm_madd_epi16(T0, mCoefy1);\r\n                T1 = _mm_madd_epi16(T1, mCoefy2);\r\n                T2 = _mm_madd_epi16(T2, mCoefy3);\r\n                T3 = _mm_madd_epi16(T3, mCoefy4);\r\n                T4 = _mm_madd_epi16(T4, mCoefy1);\r\n                T5 = _mm_madd_epi16(T5, mCoefy2);\r\n                T6 = _mm_madd_epi16(T6, mCoefy3);\r\n                T7 = _mm_madd_epi16(T7, mCoefy4);\r\n\r\n                mVal1 = _mm_add_epi32(_mm_add_epi32(T0, T1), _mm_add_epi32(T2, T3));\r\n                mVal2 = _mm_add_epi32(_mm_add_epi32(T4, T5), _mm_add_epi32(T6, T7));\r\n\r\n                mVal1 = _mm_srai_epi32(_mm_add_epi32(mVal1, mAddOffset), shift);\r\n                mVal2 = _mm_srai_epi32(_mm_add_epi32(mVal2, mAddOffset), shift);\r\n                mVal = _mm_packs_epi32(mVal1, mVal2);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_storel_epi64((__m128i*)&dst[col], mVal);\r\n\r\n                T0 = _mm_unpacklo_epi16(T10, T20);\r\n                T1 = _mm_unpacklo_epi16(T30, T40);\r\n                T2 = _mm_unpacklo_epi16(T50, T60);\r\n                T3 = _mm_unpacklo_epi16(T70, T80);\r\n                T4 = _mm_unpackhi_epi16(T10, T20);\r\n                T5 = _mm_unpackhi_epi16(T30, T40);\r\n                T6 = _mm_unpackhi_epi16(T50, T60);\r\n                T7 = _mm_unpackhi_epi16(T70, T80);\r\n\r\n                T0 = _mm_madd_epi16(T0, mCoefy1);\r\n                T1 = _mm_madd_epi16(T1, mCoefy2);\r\n                T2 = _mm_madd_epi16(T2, mCoefy3);\r\n                T3 = _mm_madd_epi16(T3, mCoefy4);\r\n                T4 = _mm_madd_epi16(T4, mCoefy1);\r\n                T5 = _mm_madd_epi16(T5, mCoefy2);\r\n                T6 = _mm_madd_epi16(T6, mCoefy3);\r\n                T7 = _mm_madd_epi16(T7, mCoefy4);\r\n\r\n                mVal1 = _mm_add_epi32(_mm_add_epi32(T0, T1), _mm_add_epi32(T2, T3));\r\n                mVal2 = _mm_add_epi32(_mm_add_epi32(T4, T5), _mm_add_epi32(T6, T7));\r\n\r\n                mVal1 = _mm_srai_epi32(_mm_add_epi32(mVal1, mAddOffset), shift);\r\n                mVal2 = _mm_srai_epi32(_mm_add_epi32(mVal2, mAddOffset), shift);\r\n                mVal = _mm_packs_epi32(mVal1, mVal2);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_storel_epi64((__m128i*)(&dst[col] + i_dst), mVal);\r\n\r\n                T0 = _mm_unpacklo_epi16(T20, T30);\r\n                T1 = _mm_unpacklo_epi16(T40, T50);\r\n                T2 = _mm_unpacklo_epi16(T60, T70);\r\n                T3 = _mm_unpacklo_epi16(T80, T90);\r\n                T4 = _mm_unpackhi_epi16(T20, T30);\r\n                T5 = _mm_unpackhi_epi16(T40, T50);\r\n                T6 = _mm_unpackhi_epi16(T60, T70);\r\n                T7 = _mm_unpackhi_epi16(T80, T90);\r\n\r\n                T0 = _mm_madd_epi16(T0, mCoefy1);\r\n                T1 = _mm_madd_epi16(T1, mCoefy2);\r\n                T2 = _mm_madd_epi16(T2, mCoefy3);\r\n                T3 = _mm_madd_epi16(T3, mCoefy4);\r\n                T4 = _mm_madd_epi16(T4, mCoefy1);\r\n                T5 = _mm_madd_epi16(T5, mCoefy2);\r\n                T6 = _mm_madd_epi16(T6, mCoefy3);\r\n                T7 = _mm_madd_epi16(T7, mCoefy4);\r\n\r\n                mVal1 = _mm_add_epi32(_mm_add_epi32(T0, T1), _mm_add_epi32(T2, T3));\r\n                mVal2 = _mm_add_epi32(_mm_add_epi32(T4, T5), _mm_add_epi32(T6, T7));\r\n\r\n                mVal1 = _mm_srai_epi32(_mm_add_epi32(mVal1, mAddOffset), shift);\r\n                mVal2 = _mm_srai_epi32(_mm_add_epi32(mVal2, mAddOffset), shift);\r\n                mVal = _mm_packs_epi32(mVal1, mVal2);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_storel_epi64((__m128i*)(&dst[col] + 2 * i_dst), mVal);\r\n\r\n                T0 = _mm_unpacklo_epi16(T30, T40);\r\n                T1 = _mm_unpacklo_epi16(T50, T60);\r\n                T2 = _mm_unpacklo_epi16(T70, T80);\r\n                T3 = _mm_unpacklo_epi16(T90, Ta0);\r\n                T4 = _mm_unpackhi_epi16(T30, T40);\r\n                T5 = _mm_unpackhi_epi16(T50, T60);\r\n                T6 = _mm_unpackhi_epi16(T70, T80);\r\n                T7 = _mm_unpackhi_epi16(T90, Ta0);\r\n\r\n                T0 = _mm_madd_epi16(T0, mCoefy1);\r\n                T1 = _mm_madd_epi16(T1, mCoefy2);\r\n                T2 = _mm_madd_epi16(T2, mCoefy3);\r\n                T3 = _mm_madd_epi16(T3, mCoefy4);\r\n                T4 = _mm_madd_epi16(T4, mCoefy1);\r\n                T5 = _mm_madd_epi16(T5, mCoefy2);\r\n                T6 = _mm_madd_epi16(T6, mCoefy3);\r\n                T7 = _mm_madd_epi16(T7, mCoefy4);\r\n\r\n                mVal1 = _mm_add_epi32(_mm_add_epi32(T0, T1), _mm_add_epi32(T2, T3));\r\n                mVal2 = _mm_add_epi32(_mm_add_epi32(T4, T5), _mm_add_epi32(T6, T7));\r\n\r\n                mVal1 = _mm_srai_epi32(_mm_add_epi32(mVal1, mAddOffset), shift);\r\n                mVal2 = _mm_srai_epi32(_mm_add_epi32(mVal2, mAddOffset), shift);\r\n                mVal = _mm_packs_epi32(mVal1, mVal2);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_storel_epi64((__m128i*)(&dst[col] + 3 * i_dst), mVal);\r\n\r\n                p += 8;\r\n            }\r\n\r\n            if (col < width) {\r\n                __m128i T00 = _mm_loadu_si128((__m128i*)(p));\r\n                __m128i T10 = _mm_loadu_si128((__m128i*)(p + i_tmp));\r\n                __m128i T20 = _mm_loadu_si128((__m128i*)(p + 2 * i_tmp));\r\n                __m128i T30 = _mm_loadu_si128((__m128i*)(p + 3 * i_tmp));\r\n                __m128i T40 = _mm_loadu_si128((__m128i*)(p + 4 * i_tmp));\r\n                __m128i T50 = _mm_loadu_si128((__m128i*)(p + 5 * i_tmp));\r\n                __m128i T60 = _mm_loadu_si128((__m128i*)(p + 6 * i_tmp));\r\n                __m128i T70 = _mm_loadu_si128((__m128i*)(p + 7 * i_tmp));\r\n                __m128i T80 = _mm_loadu_si128((__m128i*)(p + 8 * i_tmp));\r\n                __m128i T90 = _mm_loadu_si128((__m128i*)(p + 9 * i_tmp));\r\n                __m128i Ta0 = _mm_loadu_si128((__m128i*)(p + 10 * i_tmp));\r\n\r\n                __m128i T0 = _mm_unpacklo_epi16(T00, T10);\r\n                __m128i T1 = _mm_unpacklo_epi16(T20, T30);\r\n                __m128i T2 = _mm_unpacklo_epi16(T40, T50);\r\n                __m128i T3 = _mm_unpacklo_epi16(T60, T70);\r\n                __m128i T4 = _mm_unpackhi_epi16(T00, T10);\r\n                __m128i T5 = _mm_unpackhi_epi16(T20, T30);\r\n                __m128i T6 = _mm_unpackhi_epi16(T40, T50);\r\n                __m128i T7 = _mm_unpackhi_epi16(T60, T70);\r\n                __m128i mVal1, mVal2, mVal;\r\n\r\n                T0 = _mm_madd_epi16(T0, mCoefy1);\r\n                T1 = _mm_madd_epi16(T1, mCoefy2);\r\n                T2 = _mm_madd_epi16(T2, mCoefy3);\r\n                T3 = _mm_madd_epi16(T3, mCoefy4);\r\n                T4 = _mm_madd_epi16(T4, mCoefy1);\r\n                T5 = _mm_madd_epi16(T5, mCoefy2);\r\n                T6 = _mm_madd_epi16(T6, mCoefy3);\r\n                T7 = _mm_madd_epi16(T7, mCoefy4);\r\n\r\n                mVal1 = _mm_add_epi32(_mm_add_epi32(T0, T1), _mm_add_epi32(T2, T3));\r\n                mVal2 = _mm_add_epi32(_mm_add_epi32(T4, T5), _mm_add_epi32(T6, T7));\r\n\r\n                mVal1 = _mm_srai_epi32(_mm_add_epi32(mVal1, mAddOffset), shift);\r\n                mVal2 = _mm_srai_epi32(_mm_add_epi32(mVal2, mAddOffset), shift);\r\n                mVal = _mm_packs_epi32(mVal1, mVal2);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_maskmoveu_si128(mVal, mask, (char *)&dst[col]);\r\n\r\n                T0 = _mm_unpacklo_epi16(T10, T20);\r\n                T1 = _mm_unpacklo_epi16(T30, T40);\r\n                T2 = _mm_unpacklo_epi16(T50, T60);\r\n                T3 = _mm_unpacklo_epi16(T70, T80);\r\n                T4 = _mm_unpackhi_epi16(T10, T20);\r\n                T5 = _mm_unpackhi_epi16(T30, T40);\r\n                T6 = _mm_unpackhi_epi16(T50, T60);\r\n                T7 = _mm_unpackhi_epi16(T70, T80);\r\n\r\n                T0 = _mm_madd_epi16(T0, mCoefy1);\r\n                T1 = _mm_madd_epi16(T1, mCoefy2);\r\n                T2 = _mm_madd_epi16(T2, mCoefy3);\r\n                T3 = _mm_madd_epi16(T3, mCoefy4);\r\n                T4 = _mm_madd_epi16(T4, mCoefy1);\r\n                T5 = _mm_madd_epi16(T5, mCoefy2);\r\n                T6 = _mm_madd_epi16(T6, mCoefy3);\r\n                T7 = _mm_madd_epi16(T7, mCoefy4);\r\n\r\n                mVal1 = _mm_add_epi32(_mm_add_epi32(T0, T1), _mm_add_epi32(T2, T3));\r\n                mVal2 = _mm_add_epi32(_mm_add_epi32(T4, T5), _mm_add_epi32(T6, T7));\r\n\r\n                mVal1 = _mm_srai_epi32(_mm_add_epi32(mVal1, mAddOffset), shift);\r\n                mVal2 = _mm_srai_epi32(_mm_add_epi32(mVal2, mAddOffset), shift);\r\n                mVal = _mm_packs_epi32(mVal1, mVal2);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_maskmoveu_si128(mVal, mask, (char *)(&dst[col] + i_dst));\r\n\r\n                T0 = _mm_unpacklo_epi16(T20, T30);\r\n                T1 = _mm_unpacklo_epi16(T40, T50);\r\n                T2 = _mm_unpacklo_epi16(T60, T70);\r\n                T3 = _mm_unpacklo_epi16(T80, T90);\r\n                T4 = _mm_unpackhi_epi16(T20, T30);\r\n                T5 = _mm_unpackhi_epi16(T40, T50);\r\n                T6 = _mm_unpackhi_epi16(T60, T70);\r\n                T7 = _mm_unpackhi_epi16(T80, T90);\r\n\r\n                T0 = _mm_madd_epi16(T0, mCoefy1);\r\n                T1 = _mm_madd_epi16(T1, mCoefy2);\r\n                T2 = _mm_madd_epi16(T2, mCoefy3);\r\n                T3 = _mm_madd_epi16(T3, mCoefy4);\r\n                T4 = _mm_madd_epi16(T4, mCoefy1);\r\n                T5 = _mm_madd_epi16(T5, mCoefy2);\r\n                T6 = _mm_madd_epi16(T6, mCoefy3);\r\n                T7 = _mm_madd_epi16(T7, mCoefy4);\r\n\r\n                mVal1 = _mm_add_epi32(_mm_add_epi32(T0, T1), _mm_add_epi32(T2, T3));\r\n                mVal2 = _mm_add_epi32(_mm_add_epi32(T4, T5), _mm_add_epi32(T6, T7));\r\n\r\n                mVal1 = _mm_srai_epi32(_mm_add_epi32(mVal1, mAddOffset), shift);\r\n                mVal2 = _mm_srai_epi32(_mm_add_epi32(mVal2, mAddOffset), shift);\r\n                mVal = _mm_packs_epi32(mVal1, mVal2);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_maskmoveu_si128(mVal, mask, (char *)(&dst[col] + 2 * i_dst));\r\n\r\n                T0 = _mm_unpacklo_epi16(T30, T40);\r\n                T1 = _mm_unpacklo_epi16(T50, T60);\r\n                T2 = _mm_unpacklo_epi16(T70, T80);\r\n                T3 = _mm_unpacklo_epi16(T90, Ta0);\r\n                T4 = _mm_unpackhi_epi16(T30, T40);\r\n                T5 = _mm_unpackhi_epi16(T50, T60);\r\n                T6 = _mm_unpackhi_epi16(T70, T80);\r\n                T7 = _mm_unpackhi_epi16(T90, Ta0);\r\n\r\n                T0 = _mm_madd_epi16(T0, mCoefy1);\r\n                T1 = _mm_madd_epi16(T1, mCoefy2);\r\n                T2 = _mm_madd_epi16(T2, mCoefy3);\r\n                T3 = _mm_madd_epi16(T3, mCoefy4);\r\n                T4 = _mm_madd_epi16(T4, mCoefy1);\r\n                T5 = _mm_madd_epi16(T5, mCoefy2);\r\n                T6 = _mm_madd_epi16(T6, mCoefy3);\r\n                T7 = _mm_madd_epi16(T7, mCoefy4);\r\n\r\n                mVal1 = _mm_add_epi32(_mm_add_epi32(T0, T1), _mm_add_epi32(T2, T3));\r\n                mVal2 = _mm_add_epi32(_mm_add_epi32(T4, T5), _mm_add_epi32(T6, T7));\r\n\r\n                mVal1 = _mm_srai_epi32(_mm_add_epi32(mVal1, mAddOffset), shift);\r\n                mVal2 = _mm_srai_epi32(_mm_add_epi32(mVal2, mAddOffset), shift);\r\n                mVal = _mm_packs_epi32(mVal1, mVal2);\r\n                mVal = _mm_packus_epi16(mVal, mVal);\r\n\r\n                _mm_maskmoveu_si128(mVal, mask, (char *)(&dst[col] + 3 * i_dst));\r\n            }\r\n\r\n            tmp += 4 * i_tmp;\r\n            dst += 4 * i_dst;\r\n        }\r\n    }\r\n}\r\n#endif\r\n"
  },
  {
    "path": "source/common/vec/intrinsic_inter_pred_avx2.cc",
    "content": "/*\r\n * intrinsic_inter-pred_avx2.cc\r\n *\r\n * Description of this file:\r\n *    AVX2 assembly functions of Inter-Prediction module of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video decoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#include <mmintrin.h>\r\n#include <emmintrin.h>\r\n#include <tmmintrin.h>\r\n#include <smmintrin.h>\r\n#include <immintrin.h>\r\n\r\n#include \"../common.h\"\r\n#include \"intrinsic.h\"\r\n\r\n#pragma warning(disable:4127)  // warning C4127: ʽǳ\r\n\r\n#if !HIGH_BIT_DEPTH\r\n/*---------------------------------------  ------------------------------------------------------*/\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid intpl_luma_block_hor_w16_avx2(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, const int8_t *coeff)\r\n{\r\n    int row, col;\r\n    const int offset = 32;\r\n    const int shift = 6;\r\n    const __m256i mAddOffset = _mm256_set1_epi16((short)offset);\r\n    const __m256i mask16 = _mm256_setr_epi32(-1, -1, -1, -1, 0, 0, 0, 0);\r\n    const __m256i mSwitch1 = _mm256_setr_epi8(0, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 0, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8);\r\n    const __m256i mSwitch2 = _mm256_setr_epi8(2, 3, 4, 5, 6, 7, 8, 9, 3, 4, 5, 6, 7, 8, 9, 10, 2, 3, 4, 5, 6, 7, 8, 9, 3, 4, 5, 6, 7, 8, 9, 10);\r\n    const __m256i mSwitch3 = _mm256_setr_epi8(4, 5, 6, 7, 8, 9, 10, 11, 5, 6, 7, 8, 9, 10, 11, 12, 4, 5, 6, 7, 8, 9, 10, 11, 5, 6, 7, 8, 9, 10, 11, 12);\r\n    const __m256i mSwitch4 = _mm256_setr_epi8(6, 7, 8, 9, 10, 11, 12, 13, 7, 8, 9, 10, 11, 12, 13, 14, 6, 7, 8, 9, 10, 11, 12, 13, 7, 8, 9, 10, 11, 12, 13, 14);\r\n    __m256i mCoef;\r\n    src -= 3;\r\n\r\n#if ARCH_X86_64\r\n    mCoef = _mm256_set1_epi64x(*(long long*)coeff);\r\n#else\r\n    mCoef = _mm256_loadu_si256((__m256i*)coeff);\r\n    mCoef = _mm256_permute4x64_epi64(mCoef, 0x0);\r\n#endif\r\n\r\n    for (row = 0; row < height; row++) {\r\n        for (col = 0; col < width; col += 16) {\r\n            __m256i S = _mm256_loadu_si256((__m256i*)(src + col));\r\n            __m256i S0 = _mm256_permute4x64_epi64(S, 0x94);\r\n            __m256i T0, T1, T2, T3;\r\n            __m256i sum;\r\n\r\n            T0 = _mm256_maddubs_epi16(_mm256_shuffle_epi8(S0, mSwitch1), mCoef);\r\n            T1 = _mm256_maddubs_epi16(_mm256_shuffle_epi8(S0, mSwitch2), mCoef);\r\n            T2 = _mm256_maddubs_epi16(_mm256_shuffle_epi8(S0, mSwitch3), mCoef);\r\n            T3 = _mm256_maddubs_epi16(_mm256_shuffle_epi8(S0, mSwitch4), mCoef);\r\n\r\n            T0 = _mm256_hadd_epi16(T0, T1);\r\n            T1 = _mm256_hadd_epi16(T2, T3);\r\n            sum = _mm256_hadd_epi16(T0, T1);\r\n\r\n            sum = _mm256_srai_epi16(_mm256_add_epi16(sum, mAddOffset), shift);\r\n\r\n            sum = _mm256_packus_epi16(sum, sum);\r\n            sum = _mm256_permute4x64_epi64(sum, 0xd8);\r\n\r\n            _mm256_maskstore_epi32((int*)(dst + col), mask16, sum);\r\n        }\r\n        src += i_src;\r\n        dst += i_dst;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid intpl_luma_block_hor_w24_avx2(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, const int8_t *coeff)\r\n{\r\n    int row;\r\n    const int offset = 32;\r\n    const int shift = 6;\r\n    const __m256i mAddOffset = _mm256_set1_epi16((short)offset);\r\n    const __m256i index = _mm256_setr_epi32(0, 4, 1, 5, 2, 6, 3, 7);\r\n    const __m256i mask24 = _mm256_setr_epi16(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 0, 0, 0);\r\n    const __m256i mSwitch1 = _mm256_setr_epi8(0, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 2, 3, 4, 5, 6, 7, 8, 9, 3, 4, 5, 6, 7, 8, 9, 10);\r\n    const __m256i mSwitch2 = _mm256_setr_epi8(4, 5, 6, 7, 8, 9, 10, 11, 5, 6, 7, 8, 9, 10, 11, 12, 6, 7, 8, 9, 10, 11, 12, 13, 7, 8, 9, 10, 11, 12, 13, 14);\r\n    const __m256i mSwitch3 = _mm256_setr_epi8(0, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 0, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8);\r\n    const __m256i mSwitch4 = _mm256_setr_epi8(2, 3, 4, 5, 6, 7, 8, 9, 3, 4, 5, 6, 7, 8, 9, 10, 2, 3, 4, 5, 6, 7, 8, 9, 3, 4, 5, 6, 7, 8, 9, 10);\r\n    const __m256i mSwitch5 = _mm256_setr_epi8(4, 5, 6, 7, 8, 9, 10, 11, 5, 6, 7, 8, 9, 10, 11, 12, 4, 5, 6, 7, 8, 9, 10, 11, 5, 6, 7, 8, 9, 10, 11, 12);\r\n    const __m256i mSwitch6 = _mm256_setr_epi8(6, 7, 8, 9, 10, 11, 12, 13, 7, 8, 9, 10, 11, 12, 13, 14, 6, 7, 8, 9, 10, 11, 12, 13, 7, 8, 9, 10, 11, 12, 13, 14);\r\n    __m256i mCoef;\r\n\r\n    UNUSED_PARAMETER(width);\r\n\r\n    src -= 3;\r\n\r\n#if ARCH_X86_64\r\n    mCoef = _mm256_set1_epi64x(*(long long*)coeff);\r\n#else\r\n    mCoef = _mm256_loadu_si256((__m256i*)coeff);\r\n    mCoef = _mm256_permute4x64_epi64(mCoef, 0x0);\r\n#endif\r\n\r\n    for (row = 0; row < height; row++) {\r\n        __m256i S0 = _mm256_loadu_si256((__m256i*)(src));\r\n        __m256i S1 = _mm256_permute4x64_epi64(S0, 0x99);\r\n        __m256i T0, T1, T2, T3, T4, T5;\r\n        __m256i sum1, sum2;\r\n\r\n        T0 = _mm256_maddubs_epi16(_mm256_shuffle_epi8(S1, mSwitch1), mCoef);\r\n        T1 = _mm256_maddubs_epi16(_mm256_shuffle_epi8(S1, mSwitch2), mCoef);\r\n        T2 = _mm256_maddubs_epi16(_mm256_shuffle_epi8(S0, mSwitch3), mCoef);\r\n        T3 = _mm256_maddubs_epi16(_mm256_shuffle_epi8(S0, mSwitch4), mCoef);\r\n        T4 = _mm256_maddubs_epi16(_mm256_shuffle_epi8(S0, mSwitch5), mCoef);\r\n        T5 = _mm256_maddubs_epi16(_mm256_shuffle_epi8(S0, mSwitch6), mCoef);\r\n\r\n        T0 = _mm256_hadd_epi16(T0, T1);\r\n        sum1 = _mm256_hadd_epi16(_mm256_hadd_epi16(T2, T3), _mm256_hadd_epi16(T4, T5));\r\n        sum2 = _mm256_hadd_epi16(T0, T0);\r\n\r\n        sum1 = _mm256_srai_epi16(_mm256_add_epi16(sum1, mAddOffset), shift);\r\n        sum2 = _mm256_srai_epi16(_mm256_add_epi16(sum2, mAddOffset), shift);\r\n\r\n        sum2 = _mm256_permutevar8x32_epi32(sum2, index);\r\n        sum1 = _mm256_packus_epi16(sum1, sum2);\r\n\r\n        _mm256_maskstore_epi32((int*)(dst), mask24, sum1);\r\n        src += i_src;\r\n        dst += i_dst;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid intpl_luma_block_ver_w32_avx2(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, const int8_t *coeff)\r\n{\r\n    int row;\r\n    const int offset = 32;\r\n    const int shift = 6;\r\n    int bsym = (coeff[1] == coeff[6]);\r\n    const int i_src2 = i_src * 2;\r\n    const int i_src3 = i_src * 3;\r\n    const int i_src4 = i_src * 4;\r\n    const int i_src5 = i_src * 5;\r\n    const int i_src6 = i_src * 6;\r\n    const int i_src7 = i_src * 7;\r\n    __m256i mAddOffset = _mm256_set1_epi16((short)offset);\r\n\r\n    UNUSED_PARAMETER(width);\r\n\r\n    src -= 3 * i_src;\r\n\r\n    if (bsym) {\r\n        __m256i coeff0 = _mm256_set1_epi8(coeff[0]);\r\n        __m256i coeff1 = _mm256_set1_epi8(coeff[1]);\r\n        __m256i coeff2 = _mm256_set1_epi8(coeff[2]);\r\n        __m256i coeff3 = _mm256_set1_epi8(coeff[3]);\r\n        __m256i T0, T1, T2, T3, T4, T5, T6, T7, mVal1, mVal2;\r\n\r\n        for (row = 0; row < height; row++) {\r\n            __m256i S0 = _mm256_loadu_si256((__m256i*)(src));\r\n            __m256i S1 = _mm256_loadu_si256((__m256i*)(src + i_src));\r\n            __m256i S2 = _mm256_loadu_si256((__m256i*)(src + i_src2));\r\n            __m256i S3 = _mm256_loadu_si256((__m256i*)(src + i_src3));\r\n            __m256i S4 = _mm256_loadu_si256((__m256i*)(src + i_src4));\r\n            __m256i S5 = _mm256_loadu_si256((__m256i*)(src + i_src5));\r\n            __m256i S6 = _mm256_loadu_si256((__m256i*)(src + i_src6));\r\n            __m256i S7 = _mm256_loadu_si256((__m256i*)(src + i_src7));\r\n\r\n            T0 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S0, S7), coeff0);\r\n            T4 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S0, S7), coeff0);\r\n            T1 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S1, S6), coeff1);\r\n            T5 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S1, S6), coeff1);\r\n            T2 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S2, S5), coeff2);\r\n            T6 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S2, S5), coeff2);\r\n            T3 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S3, S4), coeff3);\r\n            T7 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S3, S4), coeff3);\r\n\r\n            mVal1 = _mm256_add_epi16(_mm256_add_epi16(T0, T1), _mm256_add_epi16(T2, T3));\r\n            mVal2 = _mm256_add_epi16(_mm256_add_epi16(T4, T5), _mm256_add_epi16(T6, T7));\r\n\r\n            mVal1 = _mm256_srai_epi16(_mm256_add_epi16(mVal1, mAddOffset), shift);\r\n            mVal2 = _mm256_srai_epi16(_mm256_add_epi16(mVal2, mAddOffset), shift);\r\n            mVal1 = _mm256_packus_epi16(mVal1, mVal2);\r\n\r\n            _mm256_storeu_si256((__m256i*)(dst), mVal1);\r\n\r\n            src += i_src;\r\n            dst += i_dst;\r\n        }\r\n    } else {\r\n        __m256i coeff0 = _mm256_set1_epi16(*(short*)coeff);\r\n        __m256i coeff1 = _mm256_set1_epi16(*(short*)(coeff + 2));\r\n        __m256i coeff2 = _mm256_set1_epi16(*(short*)(coeff + 4));\r\n        __m256i coeff3 = _mm256_set1_epi16(*(short*)(coeff + 6));\r\n        __m256i T0, T1, T2, T3, T4, T5, T6, T7, mVal1, mVal2;\r\n\r\n        for (row = 0; row < height; row++) {\r\n            __m256i S0 = _mm256_loadu_si256((__m256i*)(src));\r\n            __m256i S1 = _mm256_loadu_si256((__m256i*)(src + i_src));\r\n            __m256i S2 = _mm256_loadu_si256((__m256i*)(src + i_src2));\r\n            __m256i S3 = _mm256_loadu_si256((__m256i*)(src + i_src3));\r\n            __m256i S4 = _mm256_loadu_si256((__m256i*)(src + i_src4));\r\n            __m256i S5 = _mm256_loadu_si256((__m256i*)(src + i_src5));\r\n            __m256i S6 = _mm256_loadu_si256((__m256i*)(src + i_src6));\r\n            __m256i S7 = _mm256_loadu_si256((__m256i*)(src + i_src7));\r\n\r\n            T0 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S0, S1), coeff0);\r\n            T4 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S0, S1), coeff0);\r\n            T1 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S2, S3), coeff1);\r\n            T5 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S2, S3), coeff1);\r\n            T2 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S4, S5), coeff2);\r\n            T6 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S4, S5), coeff2);\r\n            T3 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S6, S7), coeff3);\r\n            T7 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S6, S7), coeff3);\r\n\r\n            mVal1 = _mm256_add_epi16(_mm256_add_epi16(T0, T1), _mm256_add_epi16(T2, T3));\r\n            mVal2 = _mm256_add_epi16(_mm256_add_epi16(T4, T5), _mm256_add_epi16(T6, T7));\r\n\r\n            mVal1 = _mm256_srai_epi16(_mm256_add_epi16(mVal1, mAddOffset), shift);\r\n            mVal2 = _mm256_srai_epi16(_mm256_add_epi16(mVal2, mAddOffset), shift);\r\n            mVal1 = _mm256_packus_epi16(mVal1, mVal2);\r\n\r\n            _mm256_storeu_si256((__m256i*)(dst), mVal1);\r\n\r\n            src += i_src;\r\n            dst += i_dst;\r\n        }\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid intpl_luma_block_ver_w64_avx2(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, const int8_t *coeff)\r\n{\r\n    int row;\r\n    const int offset = 32;\r\n    const int shift = 6;\r\n    int bsym = (coeff[1] == coeff[6]);\r\n    const int i_src2 = i_src * 2;\r\n    const int i_src3 = i_src * 3;\r\n    const int i_src4 = i_src * 4;\r\n    const int i_src5 = i_src * 5;\r\n    const int i_src6 = i_src * 6;\r\n    const int i_src7 = i_src * 7;\r\n    __m256i mAddOffset = _mm256_set1_epi16((short)offset);\r\n\r\n    UNUSED_PARAMETER(width);\r\n\r\n    src -= 3 * i_src;\r\n\r\n    if (bsym) {\r\n        __m256i coeff0 = _mm256_set1_epi8(coeff[0]);\r\n        __m256i coeff1 = _mm256_set1_epi8(coeff[1]);\r\n        __m256i coeff2 = _mm256_set1_epi8(coeff[2]);\r\n        __m256i coeff3 = _mm256_set1_epi8(coeff[3]);\r\n\r\n        for (row = 0; row < height; row++) {\r\n            const pel_t *p = src + 32;\r\n            __m256i S0 = _mm256_loadu_si256((__m256i*)(src));\r\n            __m256i S1 = _mm256_loadu_si256((__m256i*)(src + i_src));\r\n            __m256i S2 = _mm256_loadu_si256((__m256i*)(src + i_src2));\r\n            __m256i S3 = _mm256_loadu_si256((__m256i*)(src + i_src3));\r\n            __m256i S4 = _mm256_loadu_si256((__m256i*)(src + i_src4));\r\n            __m256i S5 = _mm256_loadu_si256((__m256i*)(src + i_src5));\r\n            __m256i S6 = _mm256_loadu_si256((__m256i*)(src + i_src6));\r\n            __m256i S7 = _mm256_loadu_si256((__m256i*)(src + i_src7));\r\n            __m256i T0, T1, T2, T3, T4, T5, T6, T7, mVal1, mVal2;\r\n\r\n            T0 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S0, S7), coeff0);\r\n            T1 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S1, S6), coeff1);\r\n            T2 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S2, S5), coeff2);\r\n            T3 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S3, S4), coeff3);\r\n            T4 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S0, S7), coeff0);\r\n            T5 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S1, S6), coeff1);\r\n            T6 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S2, S5), coeff2);\r\n            T7 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S3, S4), coeff3);\r\n\r\n            mVal1 = _mm256_add_epi16(_mm256_add_epi16(T0, T1), _mm256_add_epi16(T2, T3));\r\n            mVal2 = _mm256_add_epi16(_mm256_add_epi16(T4, T5), _mm256_add_epi16(T6, T7));\r\n\r\n            mVal1 = _mm256_srai_epi16(_mm256_add_epi16(mVal1, mAddOffset), shift);\r\n            mVal2 = _mm256_srai_epi16(_mm256_add_epi16(mVal2, mAddOffset), shift);\r\n            mVal1 = _mm256_packus_epi16(mVal1, mVal2);\r\n\r\n            _mm256_storeu_si256((__m256i*)(dst), mVal1);\r\n\r\n            S0 = _mm256_loadu_si256((__m256i*)(p));\r\n            S1 = _mm256_loadu_si256((__m256i*)(p + i_src));\r\n            S2 = _mm256_loadu_si256((__m256i*)(p + i_src2));\r\n            S3 = _mm256_loadu_si256((__m256i*)(p + i_src3));\r\n            S4 = _mm256_loadu_si256((__m256i*)(p + i_src4));\r\n            S5 = _mm256_loadu_si256((__m256i*)(p + i_src5));\r\n            S6 = _mm256_loadu_si256((__m256i*)(p + i_src6));\r\n            S7 = _mm256_loadu_si256((__m256i*)(p + i_src7));\r\n\r\n            T0 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S0, S7), coeff0);\r\n            T1 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S1, S6), coeff1);\r\n            T2 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S2, S5), coeff2);\r\n            T3 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S3, S4), coeff3);\r\n            T4 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S0, S7), coeff0);\r\n            T5 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S1, S6), coeff1);\r\n            T6 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S2, S5), coeff2);\r\n            T7 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S3, S4), coeff3);\r\n\r\n            mVal1 = _mm256_add_epi16(_mm256_add_epi16(T0, T1), _mm256_add_epi16(T2, T3));\r\n            mVal2 = _mm256_add_epi16(_mm256_add_epi16(T4, T5), _mm256_add_epi16(T6, T7));\r\n\r\n            mVal1 = _mm256_srai_epi16(_mm256_add_epi16(mVal1, mAddOffset), shift);\r\n            mVal2 = _mm256_srai_epi16(_mm256_add_epi16(mVal2, mAddOffset), shift);\r\n            mVal1 = _mm256_packus_epi16(mVal1, mVal2);\r\n\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), mVal1);\r\n\r\n            src += i_src;\r\n            dst += i_dst;\r\n        }\r\n    } else {\r\n        __m256i coeff0 = _mm256_set1_epi16(*(short*)coeff);\r\n        __m256i coeff1 = _mm256_set1_epi16(*(short*)(coeff + 2));\r\n        __m256i coeff2 = _mm256_set1_epi16(*(short*)(coeff + 4));\r\n        __m256i coeff3 = _mm256_set1_epi16(*(short*)(coeff + 6));\r\n\r\n        for (row = 0; row < height; row++) {\r\n            __m256i T0, T1, T2, T3, T4, T5, T6, T7, mVal1, mVal2;\r\n            const pel_t *p = src + 32;\r\n            __m256i S0 = _mm256_loadu_si256((__m256i*)(src));\r\n            __m256i S1 = _mm256_loadu_si256((__m256i*)(src + i_src));\r\n            __m256i S2 = _mm256_loadu_si256((__m256i*)(src + i_src2));\r\n            __m256i S3 = _mm256_loadu_si256((__m256i*)(src + i_src3));\r\n            __m256i S4 = _mm256_loadu_si256((__m256i*)(src + i_src4));\r\n            __m256i S5 = _mm256_loadu_si256((__m256i*)(src + i_src5));\r\n            __m256i S6 = _mm256_loadu_si256((__m256i*)(src + i_src6));\r\n            __m256i S7 = _mm256_loadu_si256((__m256i*)(src + i_src7));\r\n\r\n            T0 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S0, S1), coeff0);\r\n            T1 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S2, S3), coeff1);\r\n            T2 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S4, S5), coeff2);\r\n            T3 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S6, S7), coeff3);\r\n            T4 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S0, S1), coeff0);\r\n            T5 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S2, S3), coeff1);\r\n            T6 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S4, S5), coeff2);\r\n            T7 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S6, S7), coeff3);\r\n\r\n            mVal1 = _mm256_add_epi16(_mm256_add_epi16(T0, T1), _mm256_add_epi16(T2, T3));\r\n            mVal2 = _mm256_add_epi16(_mm256_add_epi16(T4, T5), _mm256_add_epi16(T6, T7));\r\n\r\n            mVal1 = _mm256_srai_epi16(_mm256_add_epi16(mVal1, mAddOffset), shift);\r\n            mVal2 = _mm256_srai_epi16(_mm256_add_epi16(mVal2, mAddOffset), shift);\r\n            mVal1 = _mm256_packus_epi16(mVal1, mVal2);\r\n\r\n            _mm256_storeu_si256((__m256i*)(dst), mVal1);\r\n\r\n            S0 = _mm256_loadu_si256((__m256i*)(p));\r\n            S1 = _mm256_loadu_si256((__m256i*)(p + i_src));\r\n            S2 = _mm256_loadu_si256((__m256i*)(p + i_src2));\r\n            S3 = _mm256_loadu_si256((__m256i*)(p + i_src3));\r\n            S4 = _mm256_loadu_si256((__m256i*)(p + i_src4));\r\n            S5 = _mm256_loadu_si256((__m256i*)(p + i_src5));\r\n            S6 = _mm256_loadu_si256((__m256i*)(p + i_src6));\r\n            S7 = _mm256_loadu_si256((__m256i*)(p + i_src7));\r\n\r\n            T0 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S0, S1), coeff0);\r\n            T1 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S2, S3), coeff1);\r\n            T2 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S4, S5), coeff2);\r\n            T3 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S6, S7), coeff3);\r\n            T4 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S0, S1), coeff0);\r\n            T5 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S2, S3), coeff1);\r\n            T6 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S4, S5), coeff2);\r\n            T7 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S6, S7), coeff3);\r\n\r\n            mVal1 = _mm256_add_epi16(_mm256_add_epi16(T0, T1), _mm256_add_epi16(T2, T3));\r\n            mVal2 = _mm256_add_epi16(_mm256_add_epi16(T4, T5), _mm256_add_epi16(T6, T7));\r\n\r\n            mVal1 = _mm256_srai_epi16(_mm256_add_epi16(mVal1, mAddOffset), shift);\r\n            mVal2 = _mm256_srai_epi16(_mm256_add_epi16(mVal2, mAddOffset), shift);\r\n            mVal1 = _mm256_packus_epi16(mVal1, mVal2);\r\n\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), mVal1);\r\n\r\n            src += i_src;\r\n            dst += i_dst;\r\n        }\r\n    }\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid intpl_luma_block_ver_w16_avx2(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, const int8_t *coeff)\r\n{\r\n    int row;\r\n    const int offset = 32;\r\n    const int shift = 6;\r\n    int bsym = (coeff[1] == coeff[6]);\r\n    const int i_src2 = i_src * 2;\r\n    const int i_src3 = i_src * 3;\r\n    const int i_src4 = i_src * 4;\r\n    const int i_src5 = i_src * 5;\r\n    const int i_src6 = i_src * 6;\r\n    const int i_src7 = i_src * 7;\r\n    const int i_src8 = i_src * 8;\r\n    __m256i mAddOffset = _mm256_set1_epi16((short)offset);\r\n\r\n    src -= 3 * i_src;\r\n    UNUSED_PARAMETER(width);\r\n\r\n    if (bsym) {\r\n        __m256i coeff0 = _mm256_set1_epi8(coeff[0]);\r\n        __m256i coeff1 = _mm256_set1_epi8(coeff[1]);\r\n        __m256i coeff2 = _mm256_set1_epi8(coeff[2]);\r\n        __m256i coeff3 = _mm256_set1_epi8(coeff[3]);\r\n\r\n        for (row = 0; row < height; row += 2) {\r\n            __m128i S0 = _mm_loadu_si128((__m128i*)(src));\r\n            __m128i S1 = _mm_loadu_si128((__m128i*)(src + i_src));\r\n            __m128i S2 = _mm_loadu_si128((__m128i*)(src + i_src2));\r\n            __m128i S3 = _mm_loadu_si128((__m128i*)(src + i_src3));\r\n            __m128i S4 = _mm_loadu_si128((__m128i*)(src + i_src4));\r\n            __m128i S5 = _mm_loadu_si128((__m128i*)(src + i_src5));\r\n            __m128i S6 = _mm_loadu_si128((__m128i*)(src + i_src6));\r\n            __m128i S7 = _mm_loadu_si128((__m128i*)(src + i_src7));\r\n            __m128i S8 = _mm_loadu_si128((__m128i*)(src + i_src8));\r\n\r\n            __m256i T0, T1, T2, T3, T4, T5, T6, T7, mVal1, mVal2;\r\n            __m256i R0, R1, R2, R3, R4, R5, R6, R7;\r\n\r\n            R0 = _mm256_set_m128i(S0, S1);\r\n            R1 = _mm256_set_m128i(S1, S2);\r\n            R2 = _mm256_set_m128i(S2, S3);\r\n            R3 = _mm256_set_m128i(S3, S4);\r\n            R4 = _mm256_set_m128i(S4, S5);\r\n            R5 = _mm256_set_m128i(S5, S6);\r\n            R6 = _mm256_set_m128i(S6, S7);\r\n            R7 = _mm256_set_m128i(S7, S8);\r\n\r\n            T0 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(R0, R7), coeff0);\r\n            T1 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(R0, R7), coeff0);\r\n            T2 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(R1, R6), coeff1);\r\n            T3 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(R1, R6), coeff1);\r\n            T4 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(R2, R5), coeff2);\r\n            T5 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(R2, R5), coeff2);\r\n            T6 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(R3, R4), coeff3);\r\n            T7 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(R3, R4), coeff3);\r\n\r\n            mVal1 = _mm256_add_epi16(_mm256_add_epi16(T0, T2), _mm256_add_epi16(T4, T6));\r\n            mVal2 = _mm256_add_epi16(_mm256_add_epi16(T1, T3), _mm256_add_epi16(T5, T7));\r\n\r\n            mVal1 = _mm256_srai_epi16(_mm256_add_epi16(mVal1, mAddOffset), shift);\r\n            mVal2 = _mm256_srai_epi16(_mm256_add_epi16(mVal2, mAddOffset), shift);\r\n            mVal1 = _mm256_packus_epi16(mVal1, mVal2);\r\n\r\n            _mm256_storeu2_m128i((__m128i*)dst, (__m128i*)(dst + i_dst), mVal1);\r\n            src += 2 * i_src;\r\n            dst += 2 * i_dst;\r\n        }\r\n    } else {\r\n        __m256i coeff0 = _mm256_set1_epi16(*(int16_t*)(coeff + 0));\r\n        __m256i coeff1 = _mm256_set1_epi16(*(int16_t*)(coeff + 2));\r\n        __m256i coeff2 = _mm256_set1_epi16(*(int16_t*)(coeff + 4));\r\n        __m256i coeff3 = _mm256_set1_epi16(*(int16_t*)(coeff + 6));\r\n\r\n        for (row = 0; row < height; row += 2) {\r\n            __m128i S0 = _mm_loadu_si128((__m128i*)(src));\r\n            __m128i S1 = _mm_loadu_si128((__m128i*)(src + i_src));\r\n            __m128i S2 = _mm_loadu_si128((__m128i*)(src + i_src2));\r\n            __m128i S3 = _mm_loadu_si128((__m128i*)(src + i_src3));\r\n            __m128i S4 = _mm_loadu_si128((__m128i*)(src + i_src4));\r\n            __m128i S5 = _mm_loadu_si128((__m128i*)(src + i_src5));\r\n            __m128i S6 = _mm_loadu_si128((__m128i*)(src + i_src6));\r\n            __m128i S7 = _mm_loadu_si128((__m128i*)(src + i_src7));\r\n            __m128i S8 = _mm_loadu_si128((__m128i*)(src + i_src8));\r\n\r\n            __m256i T0, T1, T2, T3, T4, T5, T6, T7, mVal1, mVal2;\r\n            __m256i R0, R1, R2, R3, R4, R5, R6, R7;\r\n\r\n            R0 = _mm256_set_m128i(S0, S1);\r\n            R1 = _mm256_set_m128i(S1, S2);\r\n            R2 = _mm256_set_m128i(S2, S3);\r\n            R3 = _mm256_set_m128i(S3, S4);\r\n            R4 = _mm256_set_m128i(S4, S5);\r\n            R5 = _mm256_set_m128i(S5, S6);\r\n            R6 = _mm256_set_m128i(S6, S7);\r\n            R7 = _mm256_set_m128i(S7, S8);\r\n\r\n            T0 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(R0, R1), coeff0);\r\n            T1 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(R0, R1), coeff0);\r\n            T2 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(R2, R3), coeff1);\r\n            T3 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(R2, R3), coeff1);\r\n            T4 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(R4, R5), coeff2);\r\n            T5 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(R4, R5), coeff2);\r\n            T6 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(R6, R7), coeff3);\r\n            T7 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(R6, R7), coeff3);\r\n\r\n            mVal1 = _mm256_add_epi16(_mm256_add_epi16(T0, T2), _mm256_add_epi16(T4, T6));\r\n            mVal2 = _mm256_add_epi16(_mm256_add_epi16(T1, T3), _mm256_add_epi16(T5, T7));\r\n\r\n            mVal1 = _mm256_srai_epi16(_mm256_add_epi16(mVal1, mAddOffset), shift);\r\n            mVal2 = _mm256_srai_epi16(_mm256_add_epi16(mVal2, mAddOffset), shift);\r\n            mVal1 = _mm256_packus_epi16(mVal1, mVal2);\r\n\r\n            _mm256_storeu2_m128i((__m128i*)dst, (__m128i*)(dst + i_dst), mVal1);\r\n            src += 2 * i_src;\r\n            dst += 2 * i_dst;\r\n        }\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid intpl_luma_block_ver_w24_avx2(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, const int8_t *coeff)\r\n{\r\n    int row;\r\n    const int offset = 32;\r\n    const int shift = 6;\r\n    int bsym = (coeff[1] == coeff[6]);\r\n    __m256i mask24 = _mm256_setr_epi16(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 0, 0, 0);\r\n    const int i_src2 = i_src * 2;\r\n    const int i_src3 = i_src * 3;\r\n    const int i_src4 = i_src * 4;\r\n    const int i_src5 = i_src * 5;\r\n    const int i_src6 = i_src * 6;\r\n    const int i_src7 = i_src * 7;\r\n    __m256i mAddOffset = _mm256_set1_epi16((short)offset);\r\n\r\n    UNUSED_PARAMETER(width);\r\n    src -= 3 * i_src;\r\n\r\n    if (bsym) {\r\n        __m256i coeff0 = _mm256_set1_epi8(coeff[0]);\r\n        __m256i coeff1 = _mm256_set1_epi8(coeff[1]);\r\n        __m256i coeff2 = _mm256_set1_epi8(coeff[2]);\r\n        __m256i coeff3 = _mm256_set1_epi8(coeff[3]);\r\n        __m256i T0, T1, T2, T3, T4, T5, T6, T7, mVal1, mVal2;\r\n\r\n        for (row = 0; row < height; row++) {\r\n            __m256i S0 = _mm256_loadu_si256((__m256i*)(src));\r\n            __m256i S1 = _mm256_loadu_si256((__m256i*)(src + i_src));\r\n            __m256i S2 = _mm256_loadu_si256((__m256i*)(src + i_src2));\r\n            __m256i S3 = _mm256_loadu_si256((__m256i*)(src + i_src3));\r\n            __m256i S4 = _mm256_loadu_si256((__m256i*)(src + i_src4));\r\n            __m256i S5 = _mm256_loadu_si256((__m256i*)(src + i_src5));\r\n            __m256i S6 = _mm256_loadu_si256((__m256i*)(src + i_src6));\r\n            __m256i S7 = _mm256_loadu_si256((__m256i*)(src + i_src7));\r\n\r\n            T0 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S0, S7), coeff0);\r\n            T1 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S1, S6), coeff1);\r\n            T2 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S2, S5), coeff2);\r\n            T3 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S3, S4), coeff3);\r\n            T4 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S0, S7), coeff0);\r\n            T5 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S1, S6), coeff1);\r\n            T6 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S2, S5), coeff2);\r\n            T7 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S3, S4), coeff3);\r\n\r\n            mVal1 = _mm256_add_epi16(_mm256_add_epi16(T0, T1), _mm256_add_epi16(T2, T3));\r\n            mVal2 = _mm256_add_epi16(_mm256_add_epi16(T4, T5), _mm256_add_epi16(T6, T7));\r\n\r\n            mVal1 = _mm256_srai_epi16(_mm256_add_epi16(mVal1, mAddOffset), shift);\r\n            mVal2 = _mm256_srai_epi16(_mm256_add_epi16(mVal2, mAddOffset), shift);\r\n            mVal1 = _mm256_packus_epi16(mVal1, mVal2);\r\n\r\n            _mm256_maskstore_epi32((int*)(dst), mask24, mVal1);\r\n\r\n            src += i_src;\r\n            dst += i_dst;\r\n        }\r\n    } else {\r\n        __m256i coeff0 = _mm256_set1_epi16(*(short*)coeff);\r\n        __m256i coeff1 = _mm256_set1_epi16(*(short*)(coeff + 2));\r\n        __m256i coeff2 = _mm256_set1_epi16(*(short*)(coeff + 4));\r\n        __m256i coeff3 = _mm256_set1_epi16(*(short*)(coeff + 6));\r\n        __m256i T0, T1, T2, T3, T4, T5, T6, T7, mVal1, mVal2;\r\n\r\n        for (row = 0; row < height; row++) {\r\n            __m256i S0 = _mm256_loadu_si256((__m256i*)(src));\r\n            __m256i S1 = _mm256_loadu_si256((__m256i*)(src + i_src));\r\n            __m256i S2 = _mm256_loadu_si256((__m256i*)(src + i_src2));\r\n            __m256i S3 = _mm256_loadu_si256((__m256i*)(src + i_src3));\r\n            __m256i S4 = _mm256_loadu_si256((__m256i*)(src + i_src4));\r\n            __m256i S5 = _mm256_loadu_si256((__m256i*)(src + i_src5));\r\n            __m256i S6 = _mm256_loadu_si256((__m256i*)(src + i_src6));\r\n            __m256i S7 = _mm256_loadu_si256((__m256i*)(src + i_src7));\r\n\r\n            T0 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S0, S1), coeff0);\r\n            T1 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S2, S3), coeff1);\r\n            T2 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S4, S5), coeff2);\r\n            T3 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S6, S7), coeff3);\r\n            T4 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S0, S1), coeff0);\r\n            T5 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S2, S3), coeff1);\r\n            T6 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S4, S5), coeff2);\r\n            T7 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S6, S7), coeff3);\r\n\r\n            mVal1 = _mm256_add_epi16(_mm256_add_epi16(T0, T1), _mm256_add_epi16(T2, T3));\r\n            mVal2 = _mm256_add_epi16(_mm256_add_epi16(T4, T5), _mm256_add_epi16(T6, T7));\r\n\r\n            mVal1 = _mm256_srai_epi16(_mm256_add_epi16(mVal1, mAddOffset), shift);\r\n            mVal2 = _mm256_srai_epi16(_mm256_add_epi16(mVal2, mAddOffset), shift);\r\n            mVal1 = _mm256_packus_epi16(mVal1, mVal2);\r\n\r\n            _mm256_maskstore_epi32((int*)(dst), mask24, mVal1);\r\n\r\n            src += i_src;\r\n            dst += i_dst;\r\n        }\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid intpl_luma_block_ver_w48_avx2(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, const int8_t *coeff)\r\n{\r\n    const int shift = 6;\r\n    const int offset = (1 << shift) >> 1;\r\n    const int i_src2 = i_src * 2;\r\n    const int i_src3 = i_src * 3;\r\n    const int i_src4 = i_src * 4;\r\n    const int i_src5 = i_src * 5;\r\n    const int i_src6 = i_src * 6;\r\n    const int i_src7 = i_src * 7;\r\n    const __m256i mask16 = _mm256_setr_epi32(-1, -1, -1, -1, 0, 0, 0, 0);\r\n    int bsym = (coeff[1] == coeff[6]);\r\n    int row;\r\n\r\n    src -= 3 * i_src;\r\n    UNUSED_PARAMETER(width);\r\n\r\n    if (bsym) {\r\n        __m256i mAddOffset = _mm256_set1_epi16((short)offset);\r\n        __m256i coeff0 = _mm256_set1_epi8(coeff[0]);\r\n        __m256i coeff1 = _mm256_set1_epi8(coeff[1]);\r\n        __m256i coeff2 = _mm256_set1_epi8(coeff[2]);\r\n        __m256i coeff3 = _mm256_set1_epi8(coeff[3]);\r\n        __m256i T0, T1, T2, T3, T4, T5, T6, T7, mVal1, mVal2;\r\n\r\n        for (row = 0; row < height; row++) {\r\n            const pel_t *p = src + 32;\r\n            __m256i S0 = _mm256_loadu_si256((__m256i*)(src));\r\n            __m256i S1 = _mm256_loadu_si256((__m256i*)(src + i_src));\r\n            __m256i S2 = _mm256_loadu_si256((__m256i*)(src + i_src2));\r\n            __m256i S3 = _mm256_loadu_si256((__m256i*)(src + i_src3));\r\n            __m256i S4 = _mm256_loadu_si256((__m256i*)(src + i_src4));\r\n            __m256i S5 = _mm256_loadu_si256((__m256i*)(src + i_src5));\r\n            __m256i S6 = _mm256_loadu_si256((__m256i*)(src + i_src6));\r\n            __m256i S7 = _mm256_loadu_si256((__m256i*)(src + i_src7));\r\n\r\n            T0 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S0, S7), coeff0);\r\n            T1 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S1, S6), coeff1);\r\n            T2 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S2, S5), coeff2);\r\n            T3 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S3, S4), coeff3);\r\n            T4 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S0, S7), coeff0);\r\n            T5 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S1, S6), coeff1);\r\n            T6 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S2, S5), coeff2);\r\n            T7 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S3, S4), coeff3);\r\n\r\n            mVal1 = _mm256_add_epi16(_mm256_add_epi16(T0, T1), _mm256_add_epi16(T2, T3));\r\n            mVal2 = _mm256_add_epi16(_mm256_add_epi16(T4, T5), _mm256_add_epi16(T6, T7));\r\n\r\n            mVal1 = _mm256_srai_epi16(_mm256_add_epi16(mVal1, mAddOffset), shift);\r\n            mVal2 = _mm256_srai_epi16(_mm256_add_epi16(mVal2, mAddOffset), shift);\r\n            mVal1 = _mm256_packus_epi16(mVal1, mVal2);\r\n\r\n            _mm256_storeu_si256((__m256i*)(dst), mVal1);\r\n\r\n            S0 = _mm256_loadu_si256((__m256i*)(p));\r\n            S1 = _mm256_loadu_si256((__m256i*)(p + i_src));\r\n            S2 = _mm256_loadu_si256((__m256i*)(p + i_src2));\r\n            S3 = _mm256_loadu_si256((__m256i*)(p + i_src3));\r\n            S4 = _mm256_loadu_si256((__m256i*)(p + i_src4));\r\n            S5 = _mm256_loadu_si256((__m256i*)(p + i_src5));\r\n            S6 = _mm256_loadu_si256((__m256i*)(p + i_src6));\r\n            S7 = _mm256_loadu_si256((__m256i*)(p + i_src7));\r\n\r\n            T0 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S0, S7), coeff0);\r\n            T1 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S1, S6), coeff1);\r\n            T2 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S2, S5), coeff2);\r\n            T3 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S3, S4), coeff3);\r\n            T4 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S0, S7), coeff0);\r\n            T5 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S1, S6), coeff1);\r\n            T6 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S2, S5), coeff2);\r\n            T7 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S3, S4), coeff3);\r\n\r\n            mVal1 = _mm256_add_epi16(_mm256_add_epi16(T0, T1), _mm256_add_epi16(T2, T3));\r\n            mVal2 = _mm256_add_epi16(_mm256_add_epi16(T4, T5), _mm256_add_epi16(T6, T7));\r\n\r\n            mVal1 = _mm256_srai_epi16(_mm256_add_epi16(mVal1, mAddOffset), shift);\r\n            mVal2 = _mm256_srai_epi16(_mm256_add_epi16(mVal2, mAddOffset), shift);\r\n            mVal1 = _mm256_packus_epi16(mVal1, mVal2);\r\n\r\n            _mm256_maskstore_epi32((int*)(dst + 32), mask16, mVal1);\r\n\r\n            src += i_src;\r\n            dst += i_dst;\r\n        }\r\n    } else {\r\n        __m256i mAddOffset = _mm256_set1_epi16((short)offset);\r\n        __m256i coeff0 = _mm256_set1_epi16(*(short*)coeff);\r\n        __m256i coeff1 = _mm256_set1_epi16(*(short*)(coeff + 2));\r\n        __m256i coeff2 = _mm256_set1_epi16(*(short*)(coeff + 4));\r\n        __m256i coeff3 = _mm256_set1_epi16(*(short*)(coeff + 6));\r\n        __m256i T0, T1, T2, T3, T4, T5, T6, T7, mVal1, mVal2;\r\n\r\n        for (row = 0; row < height; row++) {\r\n            const pel_t *p = src + 32;\r\n            __m256i S0 = _mm256_loadu_si256((__m256i*)(src));\r\n            __m256i S1 = _mm256_loadu_si256((__m256i*)(src + i_src));\r\n            __m256i S2 = _mm256_loadu_si256((__m256i*)(src + i_src2));\r\n            __m256i S3 = _mm256_loadu_si256((__m256i*)(src + i_src3));\r\n            __m256i S4 = _mm256_loadu_si256((__m256i*)(src + i_src4));\r\n            __m256i S5 = _mm256_loadu_si256((__m256i*)(src + i_src5));\r\n            __m256i S6 = _mm256_loadu_si256((__m256i*)(src + i_src6));\r\n            __m256i S7 = _mm256_loadu_si256((__m256i*)(src + i_src7));\r\n\r\n            T0 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S0, S1), coeff0);\r\n            T1 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S2, S3), coeff1);\r\n            T2 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S4, S5), coeff2);\r\n            T3 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S6, S7), coeff3);\r\n            T4 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S0, S1), coeff0);\r\n            T5 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S2, S3), coeff1);\r\n            T6 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S4, S5), coeff2);\r\n            T7 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S6, S7), coeff3);\r\n\r\n            mVal1 = _mm256_add_epi16(_mm256_add_epi16(T0, T1), _mm256_add_epi16(T2, T3));\r\n            mVal2 = _mm256_add_epi16(_mm256_add_epi16(T4, T5), _mm256_add_epi16(T6, T7));\r\n\r\n            mVal1 = _mm256_srai_epi16(_mm256_add_epi16(mVal1, mAddOffset), shift);\r\n            mVal2 = _mm256_srai_epi16(_mm256_add_epi16(mVal2, mAddOffset), shift);\r\n            mVal1 = _mm256_packus_epi16(mVal1, mVal2);\r\n\r\n            _mm256_storeu_si256((__m256i*)(dst), mVal1);\r\n\r\n            S0 = _mm256_loadu_si256((__m256i*)(p));\r\n            S1 = _mm256_loadu_si256((__m256i*)(p + i_src));\r\n            S2 = _mm256_loadu_si256((__m256i*)(p + i_src2));\r\n            S3 = _mm256_loadu_si256((__m256i*)(p + i_src3));\r\n            S4 = _mm256_loadu_si256((__m256i*)(p + i_src4));\r\n            S5 = _mm256_loadu_si256((__m256i*)(p + i_src5));\r\n            S6 = _mm256_loadu_si256((__m256i*)(p + i_src6));\r\n            S7 = _mm256_loadu_si256((__m256i*)(p + i_src7));\r\n\r\n            T0 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S0, S1), coeff0);\r\n            T1 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S2, S3), coeff1);\r\n            T2 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S4, S5), coeff2);\r\n            T3 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S6, S7), coeff3);\r\n            T4 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S0, S1), coeff0);\r\n            T5 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S2, S3), coeff1);\r\n            T6 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S4, S5), coeff2);\r\n            T7 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S6, S7), coeff3);\r\n\r\n            mVal1 = _mm256_add_epi16(_mm256_add_epi16(T0, T1), _mm256_add_epi16(T2, T3));\r\n            mVal2 = _mm256_add_epi16(_mm256_add_epi16(T4, T5), _mm256_add_epi16(T6, T7));\r\n\r\n            mVal1 = _mm256_srai_epi16(_mm256_add_epi16(mVal1, mAddOffset), shift);\r\n            mVal2 = _mm256_srai_epi16(_mm256_add_epi16(mVal2, mAddOffset), shift);\r\n            mVal1 = _mm256_packus_epi16(mVal1, mVal2);\r\n\r\n            _mm256_maskstore_epi32((int*)(dst + 32), mask16, mVal1);\r\n\r\n            src += i_src;\r\n            dst += i_dst;\r\n        }\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid intpl_luma_block_ext_w16_avx2(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, const int8_t *coef_x, const int8_t *coef_y)\r\n{\r\n    ALIGN32(int16_t tmp_res[(64 + 7) * 64]);\r\n    int16_t *tmp = tmp_res;\r\n    const int i_tmp = 64;\r\n    const int i_tmp2 = 2 * i_tmp;\r\n    const int i_tmp3 = 3 * i_tmp;\r\n    const int i_tmp4 = 4 * i_tmp;\r\n    const int i_tmp5 = 5 * i_tmp;\r\n    const int i_tmp6 = 6 * i_tmp;\r\n    const int i_tmp7 = 7 * i_tmp;\r\n    const int shift = 12;\r\n    const __m256i mAddOffset = _mm256_set1_epi32((1 << shift) >> 1);\r\n\r\n    int row, col;\r\n    __m256i mSwitch1 = _mm256_setr_epi8(0, 1, 2, 3, 4, 5, 6, 7,       1, 2, 3, 4, 5, 6, 7, 8,  // ǰ 8 \r\n                                        0, 1, 2, 3, 4, 5, 6, 7,       1, 2, 3, 4, 5, 6, 7, 8); //  8 \r\n    __m256i mSwitch2 = _mm256_setr_epi8(2, 3, 4, 5, 6, 7, 8, 9,       3, 4, 5, 6, 7, 8, 9, 10, \r\n                                        2, 3, 4, 5, 6, 7, 8, 9,       3, 4, 5, 6, 7, 8, 9, 10);\r\n    __m256i mSwitch3 = _mm256_setr_epi8(4, 5, 6, 7, 8, 9, 10, 11,     5, 6, 7, 8, 9, 10, 11, 12, \r\n                                        4, 5, 6, 7, 8, 9, 10, 11,     5, 6, 7, 8, 9, 10, 11, 12);\r\n    __m256i mSwitch4 = _mm256_setr_epi8(6, 7, 8, 9, 10, 11, 12, 13,   7, 8, 9, 10, 11, 12, 13, 14,\r\n                                        6, 7, 8, 9, 10, 11, 12, 13,   7, 8, 9, 10, 11, 12, 13, 14);\r\n    __m256i mCoef;\r\n\r\n    src = src - 3 * i_src - 3;\r\n\r\n    //HOR\r\n#if ARCH_X86_64\r\n    mCoef = _mm256_set1_epi64x(*(long long*)coef_x);\r\n#else\r\n    mCoef = _mm256_loadu_si256((__m256i*)coef_x);\r\n    mCoef = _mm256_permute4x64_epi64(mCoef, 0x0);\r\n#endif\r\n\r\n    for (row = -3; row < height + 4; row++) {\r\n        for (col = 0; col < width; col += 16) {\r\n            __m256i T0, T1, sum, T2, T3;\r\n            __m256i S = _mm256_loadu_si256((__m256i*)(src + col));\r\n            // ǰ8ֵصͺ8ֵĵֱ뵽ǰ128λ\r\n            __m256i S0 = _mm256_permute4x64_epi64(S, 0x94);\r\n\r\n            T0 = _mm256_maddubs_epi16(_mm256_shuffle_epi8(S0, mSwitch1), mCoef);\r\n            T1 = _mm256_maddubs_epi16(_mm256_shuffle_epi8(S0, mSwitch2), mCoef);\r\n            T2 = _mm256_maddubs_epi16(_mm256_shuffle_epi8(S0, mSwitch3), mCoef);\r\n            T3 = _mm256_maddubs_epi16(_mm256_shuffle_epi8(S0, mSwitch4), mCoef);\r\n\r\n            sum = _mm256_hadd_epi16(_mm256_hadd_epi16(T0, T1), _mm256_hadd_epi16(T2, T3));\r\n\r\n            _mm256_store_si256((__m256i*)(tmp + col), sum);\r\n        }\r\n        src += i_src;\r\n        tmp += i_tmp;\r\n    }\r\n\r\n    // VER\r\n    tmp = tmp_res;\r\n\r\n    __m256i mCoefy1 = _mm256_cvtepi8_epi16(_mm_set1_epi16(*(int16_t*)coef_y));\r\n    __m256i mCoefy2 = _mm256_cvtepi8_epi16(_mm_set1_epi16(*(int16_t*)(coef_y + 2)));\r\n    __m256i mCoefy3 = _mm256_cvtepi8_epi16(_mm_set1_epi16(*(int16_t*)(coef_y + 4)));\r\n    __m256i mCoefy4 = _mm256_cvtepi8_epi16(_mm_set1_epi16(*(int16_t*)(coef_y + 6)));\r\n\r\n    // ͬʱֵ2/4Уظload\r\n    for (row = 0; row < height; row++) {\r\n        for (col = 0; col < width; col += 16) {\r\n            __m256i T0, T1, T2, T3, T4, T5, T6, T7;\r\n            __m256i mVal1, mVal2;\r\n            __m256i S0 = _mm256_load_si256((__m256i*)(tmp + col));\r\n            __m256i S1 = _mm256_load_si256((__m256i*)(tmp + col + i_tmp));\r\n            __m256i S2 = _mm256_load_si256((__m256i*)(tmp + col + i_tmp2));\r\n            __m256i S3 = _mm256_load_si256((__m256i*)(tmp + col + i_tmp3));\r\n            __m256i S4 = _mm256_load_si256((__m256i*)(tmp + col + i_tmp4));\r\n            __m256i S5 = _mm256_load_si256((__m256i*)(tmp + col + i_tmp5));\r\n            __m256i S6 = _mm256_load_si256((__m256i*)(tmp + col + i_tmp6));\r\n            __m256i S7 = _mm256_load_si256((__m256i*)(tmp + col + i_tmp7));\r\n\r\n            T0 = _mm256_madd_epi16(_mm256_unpacklo_epi16(S0, S1), mCoefy1);\r\n            T1 = _mm256_madd_epi16(_mm256_unpacklo_epi16(S2, S3), mCoefy2);\r\n            T2 = _mm256_madd_epi16(_mm256_unpacklo_epi16(S4, S5), mCoefy3);\r\n            T3 = _mm256_madd_epi16(_mm256_unpacklo_epi16(S6, S7), mCoefy4);\r\n            T4 = _mm256_madd_epi16(_mm256_unpackhi_epi16(S0, S1), mCoefy1);\r\n            T5 = _mm256_madd_epi16(_mm256_unpackhi_epi16(S2, S3), mCoefy2);\r\n            T6 = _mm256_madd_epi16(_mm256_unpackhi_epi16(S4, S5), mCoefy3);\r\n            T7 = _mm256_madd_epi16(_mm256_unpackhi_epi16(S6, S7), mCoefy4);\r\n\r\n            mVal1 = _mm256_add_epi32(_mm256_add_epi32(T0, T1), _mm256_add_epi32(T2, T3));\r\n            mVal2 = _mm256_add_epi32(_mm256_add_epi32(T4, T5), _mm256_add_epi32(T6, T7));\r\n\r\n            mVal1 = _mm256_srai_epi32(_mm256_add_epi32(mVal1, mAddOffset), shift);\r\n            mVal2 = _mm256_srai_epi32(_mm256_add_epi32(mVal2, mAddOffset), shift);\r\n\r\n            mVal1 = _mm256_packs_epi32(mVal1, mVal2);\r\n            mVal1 = _mm256_packus_epi16(mVal1, mVal1);\r\n\r\n            mVal1 = _mm256_permute4x64_epi64(mVal1, 0xd8);\r\n            _mm_storeu_si128((__m128i*)(dst + col), _mm256_castsi256_si128(mVal1));\r\n        }\r\n        tmp += i_tmp;\r\n        dst += i_dst;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid intpl_luma_block_ext_w24_avx2(pel_t *dst, int i_dst, pel_t *src, int i_src, int height, const int8_t *coef_x, const int8_t *coef_y)\r\n{\r\n    ALIGN32(int16_t tmp_res[(64 + 7) * 64]);\r\n    int16_t *tmp = tmp_res;\r\n    const int i_tmp  = 32;\r\n    const int i_tmp2 = 2 * i_tmp;\r\n    const int i_tmp3 = 3 * i_tmp;\r\n    const int i_tmp4 = 4 * i_tmp;\r\n    const int i_tmp5 = 5 * i_tmp;\r\n    const int i_tmp6 = 6 * i_tmp;\r\n    const int i_tmp7 = 7 * i_tmp;\r\n\r\n    int row;\r\n    int bsymy = (coef_y[1] == coef_y[6]);\r\n    int shift = 12;\r\n    __m256i mAddOffset = _mm256_set1_epi32(1 << 11);\r\n    __m256i mCoef;\r\n    __m256i mask24 = _mm256_setr_epi16(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 0, 0, 0);\r\n\r\n    // HOR\r\n    __m256i mSwitch1 = _mm256_setr_epi8(0, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 4, 5, 6, 7, 8, 9, 10, 11, 5, 6, 7, 8, 9, 10, 11, 12);\r\n    __m256i mSwitch2 = _mm256_setr_epi8(2, 3, 4, 5, 6, 7, 8, 9, 3, 4, 5, 6, 7, 8, 9, 10, 6, 7, 8, 9, 10, 11, 12, 13, 7, 8, 9, 10, 11, 12, 13, 14);\r\n    __m256i mSwitch3 = _mm256_setr_epi8(0, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 0, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8);\r\n    __m256i mSwitch4 = _mm256_setr_epi8(2, 3, 4, 5, 6, 7, 8, 9, 3, 4, 5, 6, 7, 8, 9, 10, 2, 3, 4, 5, 6, 7, 8, 9, 3, 4, 5, 6, 7, 8, 9, 10);\r\n    __m256i mSwitch5 = _mm256_setr_epi8(4, 5, 6, 7, 8, 9, 10, 11, 5, 6, 7, 8, 9, 10, 11, 12, 4, 5, 6, 7, 8, 9, 10, 11, 5, 6, 7, 8, 9, 10, 11, 12);\r\n    __m256i mSwitch6 = _mm256_setr_epi8(6, 7, 8, 9, 10, 11, 12, 13, 7, 8, 9, 10, 11, 12, 13, 14, 6, 7, 8, 9, 10, 11, 12, 13, 7, 8, 9, 10, 11, 12, 13, 14);\r\n\r\n    src -= (3 * i_src + 3);\r\n#if ARCH_X86_64\r\n    mCoef = _mm256_set1_epi64x(*(long long*)coef_x);\r\n#else\r\n    mCoef = _mm256_loadu_si256((__m256i*)coef_x);\r\n    mCoef = _mm256_permute4x64_epi64(mCoef, 0x0);\r\n#endif\r\n\r\n    for (row = -3; row < height + 4; row++) {\r\n        __m256i T0, T1, T2, T3, T4, T5, sum1, sum2;\r\n        __m256i S0 = _mm256_loadu_si256((__m256i*)(src));\r\n        __m256i S1 = _mm256_permute4x64_epi64(S0, 0x99);\r\n\r\n        T0 = _mm256_maddubs_epi16(_mm256_shuffle_epi8(S1, mSwitch1), mCoef);\r\n        T1 = _mm256_maddubs_epi16(_mm256_shuffle_epi8(S1, mSwitch2), mCoef);\r\n        T0 = _mm256_hadd_epi16(T0, T1);\r\n\r\n        T2 = _mm256_maddubs_epi16(_mm256_shuffle_epi8(S0, mSwitch3), mCoef);\r\n        T3 = _mm256_maddubs_epi16(_mm256_shuffle_epi8(S0, mSwitch4), mCoef);\r\n        T4 = _mm256_maddubs_epi16(_mm256_shuffle_epi8(S0, mSwitch5), mCoef);\r\n        T5 = _mm256_maddubs_epi16(_mm256_shuffle_epi8(S0, mSwitch6), mCoef);\r\n\r\n        sum1 = _mm256_hadd_epi16(_mm256_hadd_epi16(T2, T3), _mm256_hadd_epi16(T4, T5));\r\n        sum2 = _mm256_hadd_epi16(T0, T0);\r\n\r\n        sum2 = _mm256_permute4x64_epi64(sum2, 0xd8);\r\n        sum2 = _mm256_permute2x128_si256(sum1, sum2, 0x13);\r\n        _mm_storeu_si128((__m128i*)(tmp), _mm256_castsi256_si128(sum1));\r\n        _mm256_storeu_si256((__m256i*)(tmp + 8), sum2);\r\n\r\n        src += i_src;\r\n        tmp += i_tmp;\r\n    }\r\n\r\n    // VER\r\n    tmp = tmp_res;\r\n    if (bsymy) {\r\n        __m256i mCoefy1 = _mm256_set1_epi16(coef_y[0]);\r\n        __m256i mCoefy2 = _mm256_set1_epi16(coef_y[1]);\r\n        __m256i mCoefy3 = _mm256_set1_epi16(coef_y[2]);\r\n        __m256i mCoefy4 = _mm256_set1_epi16(coef_y[3]);\r\n\r\n        for (row = 0; row < height; row++) {\r\n            __m256i mVal1, mVal2, mVal, mVal3, mVal4;\r\n            __m256i T0, T1, T2, T3, S0, S1, S2, S3;\r\n            __m256i T4, T5, T6, T7, S4, S5, S6, S7;\r\n            __m256i T00, T11, T22, T33, S00, S11, S22, S33;\r\n            __m256i T44, T55, T66, T77, S44, S55, S66, S77;\r\n\r\n            S0 = _mm256_loadu_si256((__m256i*)(tmp));\r\n            S1 = _mm256_loadu_si256((__m256i*)(tmp + i_tmp));\r\n            S2 = _mm256_loadu_si256((__m256i*)(tmp + i_tmp2));\r\n            S3 = _mm256_loadu_si256((__m256i*)(tmp + i_tmp3));\r\n            S4 = _mm256_loadu_si256((__m256i*)(tmp + i_tmp4));\r\n            S5 = _mm256_loadu_si256((__m256i*)(tmp + i_tmp5));\r\n            S6 = _mm256_loadu_si256((__m256i*)(tmp + i_tmp6));\r\n            S7 = _mm256_loadu_si256((__m256i*)(tmp + i_tmp7));\r\n\r\n            S00 = _mm256_loadu_si256((__m256i*)(tmp + 16));\r\n            S11 = _mm256_loadu_si256((__m256i*)(tmp + 16 + i_tmp));\r\n            S22 = _mm256_loadu_si256((__m256i*)(tmp + 16 + i_tmp2));\r\n            S33 = _mm256_loadu_si256((__m256i*)(tmp + 16 + i_tmp3));\r\n            S44 = _mm256_loadu_si256((__m256i*)(tmp + 16 + i_tmp4));\r\n            S55 = _mm256_loadu_si256((__m256i*)(tmp + 16 + i_tmp5));\r\n            S66 = _mm256_loadu_si256((__m256i*)(tmp + 16 + i_tmp6));\r\n            S77 = _mm256_loadu_si256((__m256i*)(tmp + 16 + i_tmp7));\r\n\r\n            T0 = _mm256_madd_epi16(_mm256_unpacklo_epi16(S0, S7), mCoefy1);\r\n            T1 = _mm256_madd_epi16(_mm256_unpacklo_epi16(S1, S6), mCoefy2);\r\n            T2 = _mm256_madd_epi16(_mm256_unpacklo_epi16(S2, S5), mCoefy3);\r\n            T3 = _mm256_madd_epi16(_mm256_unpacklo_epi16(S3, S4), mCoefy4);\r\n            T4 = _mm256_madd_epi16(_mm256_unpackhi_epi16(S0, S7), mCoefy1);\r\n            T5 = _mm256_madd_epi16(_mm256_unpackhi_epi16(S1, S6), mCoefy2);\r\n            T6 = _mm256_madd_epi16(_mm256_unpackhi_epi16(S2, S5), mCoefy3);\r\n            T7 = _mm256_madd_epi16(_mm256_unpackhi_epi16(S3, S4), mCoefy4);\r\n\r\n            T00 = _mm256_madd_epi16(_mm256_unpacklo_epi16(S00, S77), mCoefy1);\r\n            T11 = _mm256_madd_epi16(_mm256_unpacklo_epi16(S11, S66), mCoefy2);\r\n            T22 = _mm256_madd_epi16(_mm256_unpacklo_epi16(S22, S55), mCoefy3);\r\n            T33 = _mm256_madd_epi16(_mm256_unpacklo_epi16(S33, S44), mCoefy4);\r\n            T44 = _mm256_madd_epi16(_mm256_unpackhi_epi16(S00, S77), mCoefy1);\r\n            T55 = _mm256_madd_epi16(_mm256_unpackhi_epi16(S11, S66), mCoefy2);\r\n            T66 = _mm256_madd_epi16(_mm256_unpackhi_epi16(S22, S55), mCoefy3);\r\n            T77 = _mm256_madd_epi16(_mm256_unpackhi_epi16(S33, S44), mCoefy4);\r\n\r\n            mVal1 = _mm256_add_epi32(_mm256_add_epi32(T0, T1), _mm256_add_epi32(T2, T3));\r\n            mVal2 = _mm256_add_epi32(_mm256_add_epi32(T4, T5), _mm256_add_epi32(T6, T7));\r\n\r\n            mVal3 = _mm256_add_epi32(_mm256_add_epi32(T00, T11), _mm256_add_epi32(T22, T33));\r\n            mVal4 = _mm256_add_epi32(_mm256_add_epi32(T44, T55), _mm256_add_epi32(T66, T77));\r\n\r\n            mVal1 = _mm256_srai_epi32(_mm256_add_epi32(mVal1, mAddOffset), shift);\r\n            mVal2 = _mm256_srai_epi32(_mm256_add_epi32(mVal2, mAddOffset), shift);\r\n            mVal3 = _mm256_srai_epi32(_mm256_add_epi32(mVal3, mAddOffset), shift);\r\n            mVal4 = _mm256_srai_epi32(_mm256_add_epi32(mVal4, mAddOffset), shift);\r\n\r\n            mVal = _mm256_packus_epi16(_mm256_packs_epi32(mVal1, mVal2), _mm256_packs_epi32(mVal3, mVal4));\r\n\r\n            mVal = _mm256_permute4x64_epi64(mVal, 0xd8);\r\n            _mm256_maskstore_epi32((int*)(dst), mask24, mVal);\r\n\r\n            tmp += i_tmp;\r\n            dst += i_dst;\r\n        }\r\n    } else {\r\n        __m256i mCoefy1 = _mm256_cvtepi8_epi16(_mm_set1_epi16(*(int16_t*)(coef_y)));\r\n        __m256i mCoefy2 = _mm256_cvtepi8_epi16(_mm_set1_epi16(*(int16_t*)(coef_y + 2)));\r\n        __m256i mCoefy3 = _mm256_cvtepi8_epi16(_mm_set1_epi16(*(int16_t*)(coef_y + 4)));\r\n        __m256i mCoefy4 = _mm256_cvtepi8_epi16(_mm_set1_epi16(*(int16_t*)(coef_y + 6)));\r\n\r\n        for (row = 0; row < height; row++) {\r\n            __m256i mVal1, mVal2, mVal, mVal3, mVal4;\r\n            __m256i T0, T1, T2, T3, S0, S1, S2, S3;\r\n            __m256i T4, T5, T6, T7, S4, S5, S6, S7;\r\n            __m256i T00, T11, T22, T33, S00, S11, S22, S33;\r\n            __m256i T44, T55, T66, T77, S44, S55, S66, S77;\r\n\r\n            S0 = _mm256_loadu_si256((__m256i*)(tmp));\r\n            S1 = _mm256_loadu_si256((__m256i*)(tmp + i_tmp));\r\n            S2 = _mm256_loadu_si256((__m256i*)(tmp + i_tmp2));\r\n            S3 = _mm256_loadu_si256((__m256i*)(tmp + i_tmp3));\r\n            S4 = _mm256_loadu_si256((__m256i*)(tmp + i_tmp4));\r\n            S5 = _mm256_loadu_si256((__m256i*)(tmp + i_tmp5));\r\n            S6 = _mm256_loadu_si256((__m256i*)(tmp + i_tmp6));\r\n            S7 = _mm256_loadu_si256((__m256i*)(tmp + i_tmp7));\r\n\r\n            S00 = _mm256_loadu_si256((__m256i*)(tmp + 16));\r\n            S11 = _mm256_loadu_si256((__m256i*)(tmp + 16 + i_tmp));\r\n            S22 = _mm256_loadu_si256((__m256i*)(tmp + 16 + i_tmp2));\r\n            S33 = _mm256_loadu_si256((__m256i*)(tmp + 16 + i_tmp3));\r\n            S44 = _mm256_loadu_si256((__m256i*)(tmp + 16 + i_tmp4));\r\n            S55 = _mm256_loadu_si256((__m256i*)(tmp + 16 + i_tmp5));\r\n            S66 = _mm256_loadu_si256((__m256i*)(tmp + 16 + i_tmp6));\r\n            S77 = _mm256_loadu_si256((__m256i*)(tmp + 16 + i_tmp7));\r\n\r\n            T0 = _mm256_madd_epi16(_mm256_unpacklo_epi16(S0, S1), mCoefy1);\r\n            T1 = _mm256_madd_epi16(_mm256_unpacklo_epi16(S2, S3), mCoefy2);\r\n            T2 = _mm256_madd_epi16(_mm256_unpacklo_epi16(S4, S5), mCoefy3);\r\n            T3 = _mm256_madd_epi16(_mm256_unpacklo_epi16(S6, S7), mCoefy4);\r\n            T4 = _mm256_madd_epi16(_mm256_unpackhi_epi16(S0, S1), mCoefy1);\r\n            T5 = _mm256_madd_epi16(_mm256_unpackhi_epi16(S2, S3), mCoefy2);\r\n            T6 = _mm256_madd_epi16(_mm256_unpackhi_epi16(S4, S5), mCoefy3);\r\n            T7 = _mm256_madd_epi16(_mm256_unpackhi_epi16(S6, S7), mCoefy4);\r\n\r\n            T00 = _mm256_madd_epi16(_mm256_unpacklo_epi16(S00, S11), mCoefy1);\r\n            T11 = _mm256_madd_epi16(_mm256_unpacklo_epi16(S22, S33), mCoefy2);\r\n            T22 = _mm256_madd_epi16(_mm256_unpacklo_epi16(S44, S55), mCoefy3);\r\n            T33 = _mm256_madd_epi16(_mm256_unpacklo_epi16(S66, S77), mCoefy4);\r\n            T44 = _mm256_madd_epi16(_mm256_unpackhi_epi16(S00, S11), mCoefy1);\r\n            T55 = _mm256_madd_epi16(_mm256_unpackhi_epi16(S22, S33), mCoefy2);\r\n            T66 = _mm256_madd_epi16(_mm256_unpackhi_epi16(S44, S55), mCoefy3);\r\n            T77 = _mm256_madd_epi16(_mm256_unpackhi_epi16(S66, S77), mCoefy4);\r\n\r\n            mVal1 = _mm256_add_epi32(_mm256_add_epi32(T0, T1), _mm256_add_epi32(T2, T3));\r\n            mVal2 = _mm256_add_epi32(_mm256_add_epi32(T4, T5), _mm256_add_epi32(T6, T7));\r\n\r\n            mVal3 = _mm256_add_epi32(_mm256_add_epi32(T00, T11), _mm256_add_epi32(T22, T33));\r\n            mVal4 = _mm256_add_epi32(_mm256_add_epi32(T44, T55), _mm256_add_epi32(T66, T77));\r\n\r\n            mVal1 = _mm256_srai_epi32(_mm256_add_epi32(mVal1, mAddOffset), shift);\r\n            mVal2 = _mm256_srai_epi32(_mm256_add_epi32(mVal2, mAddOffset), shift);\r\n            mVal3 = _mm256_srai_epi32(_mm256_add_epi32(mVal3, mAddOffset), shift);\r\n            mVal4 = _mm256_srai_epi32(_mm256_add_epi32(mVal4, mAddOffset), shift);\r\n\r\n            mVal = _mm256_packus_epi16(_mm256_packs_epi32(mVal1, mVal2), _mm256_packs_epi32(mVal3, mVal4));\r\n\r\n            mVal = _mm256_permute4x64_epi64(mVal, 0xd8);\r\n            _mm256_maskstore_epi32((int*)(dst), mask24, mVal);\r\n\r\n            tmp += i_tmp;\r\n            dst += i_dst;\r\n        }\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid intpl_chroma_block_hor_w16_avx2(pel_t *dst, int i_dst, const pel_t *src, int i_src, int width, int height, const int8_t *coeff)\r\n{\r\n    int row, col;\r\n    const int offset = 32;\r\n    const int shift = 6;\r\n\r\n    __m256i mCoef = _mm256_set1_epi32(*(int32_t*)coeff);\r\n    __m256i mSwitch1 = _mm256_setr_epi8(0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6, 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6);\r\n    __m256i mSwitch2 = _mm256_setr_epi8(4, 5, 6, 7, 5, 6, 7, 8, 6, 7, 8, 9, 7, 8, 9, 10, 4, 5, 6, 7, 5, 6, 7, 8, 6, 7, 8, 9, 7, 8, 9, 10);\r\n    __m256i mAddOffset = _mm256_set1_epi16((short)offset);\r\n    __m256i mask16 = _mm256_setr_epi32(-1, -1, -1, -1, 0, 0, 0, 0);\r\n    src -= 1;\r\n\r\n    for (row = 0; row < height; row++) {\r\n        for (col = 0; col < width; col += 16) {\r\n            __m256i T0, T1, sum;\r\n            __m256i S  = _mm256_loadu_si256((__m256i*)(src + col));\r\n            __m256i S0 = _mm256_permute4x64_epi64(S, 0x94);\r\n\r\n            T0 = _mm256_maddubs_epi16(_mm256_shuffle_epi8(S0, mSwitch1), mCoef);\r\n            T1 = _mm256_maddubs_epi16(_mm256_shuffle_epi8(S0, mSwitch2), mCoef);\r\n\r\n            sum = _mm256_srai_epi16(_mm256_add_epi16(_mm256_hadd_epi16(T0, T1), mAddOffset), shift);\r\n            sum = _mm256_packus_epi16(sum, sum);\r\n            sum = _mm256_permute4x64_epi64(sum, 0xd8);\r\n\r\n            _mm256_maskstore_epi32((int*)(dst + col), mask16, sum);\r\n        }\r\n        src += i_src;\r\n        dst += i_dst;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid intpl_chroma_block_hor_w24_avx2(pel_t *dst, int i_dst, const pel_t *src, int i_src, int height, const int8_t *coeff)\r\n{\r\n    const int offset = 32;\r\n    const int shift = 6;\r\n\r\n    const __m256i mCoef = _mm256_set1_epi32(*(int32_t*)coeff);\r\n    const __m256i mSwitch = _mm256_setr_epi8(0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6, 4, 5, 6, 7, 5, 6, 7, 8, 6, 7, 8, 9, 7, 8, 9, 10);\r\n    const __m256i mSwitch1 = _mm256_setr_epi8(0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6, 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6);\r\n    const __m256i mSwitch2 = _mm256_setr_epi8(4, 5, 6, 7, 5, 6, 7, 8, 6, 7, 8, 9, 7, 8, 9, 10, 4, 5, 6, 7, 5, 6, 7, 8, 6, 7, 8, 9, 7, 8, 9, 10);\r\n    const __m256i mask24 = _mm256_setr_epi16(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 0, 0, 0);\r\n    const __m256i mAddOffset = _mm256_set1_epi16((short)offset);\r\n    const __m256i index = _mm256_setr_epi32(0, 1, 2, 6, 4, 5, 3, 7);\r\n\r\n    int row;\r\n    src -= 1;\r\n\r\n    for (row = 0; row < height; row++) {\r\n        __m256i T0, T1, T2, sum1, sum2;\r\n        __m256i S  = _mm256_loadu_si256((__m256i*)(src));\r\n        __m256i S0 = _mm256_permute4x64_epi64(S, 0x99);\r\n\r\n        T0 = _mm256_maddubs_epi16(_mm256_shuffle_epi8(S, mSwitch1), mCoef);\r\n        T1 = _mm256_maddubs_epi16(_mm256_shuffle_epi8(S, mSwitch2), mCoef);\r\n        T2 = _mm256_maddubs_epi16(_mm256_shuffle_epi8(S0, mSwitch), mCoef);\r\n\r\n        sum1 = _mm256_srai_epi16(_mm256_add_epi16(_mm256_hadd_epi16(T0, T1), mAddOffset), shift);\r\n        sum2 = _mm256_srai_epi16(_mm256_add_epi16(_mm256_hadd_epi16(T2, T2), mAddOffset), shift);\r\n\r\n        sum1 = _mm256_permutevar8x32_epi32(_mm256_packus_epi16(sum1, sum2), index);\r\n\r\n        _mm256_maskstore_epi32((int*)(dst), mask24, sum1);\r\n\r\n        src += i_src;\r\n        dst += i_dst;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid intpl_chroma_block_ver_w32_avx2(pel_t *dst, int i_dst, const pel_t *src, int i_src, int height, const int8_t *coeff)\r\n{\r\n    int row;\r\n    const int offset = 32;\r\n    const int shift = 6;\r\n    int bsym = (coeff[1] == coeff[2]);\r\n    __m256i mAddOffset = _mm256_set1_epi16((short)offset);\r\n    const int i_src2 = i_src * 2;\r\n    const int i_src3 = i_src * 3;\r\n\r\n    src -= i_src;\r\n\r\n    if (bsym) {\r\n        __m256i coeff0 = _mm256_set1_epi8(coeff[0]);\r\n        __m256i coeff1 = _mm256_set1_epi8(coeff[1]);\r\n\r\n        for (row = 0; row < height; row++) {\r\n            __m256i S0, S1, S2, S3;\r\n            __m256i T0, T1, T2, T3, mVal1, mVal2;\r\n            S0 = _mm256_loadu_si256((__m256i*)(src));\r\n            S1 = _mm256_loadu_si256((__m256i*)(src + i_src));\r\n            S2 = _mm256_loadu_si256((__m256i*)(src + i_src2));\r\n            S3 = _mm256_loadu_si256((__m256i*)(src + i_src3));\r\n\r\n            T0 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S0, S3), coeff0);\r\n            T1 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S1, S2), coeff1);\r\n            T2 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S0, S3), coeff0);\r\n            T3 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S1, S2), coeff1);\r\n\r\n            mVal1 = _mm256_srai_epi16(_mm256_add_epi16(_mm256_add_epi16(T0, T1), mAddOffset), shift);\r\n            mVal2 = _mm256_srai_epi16(_mm256_add_epi16(_mm256_add_epi16(T2, T3), mAddOffset), shift);\r\n            mVal1 = _mm256_packus_epi16(mVal1, mVal2);\r\n            _mm256_storeu_si256((__m256i*)(dst), mVal1);\r\n            src += i_src;\r\n            dst += i_dst;\r\n        }\r\n    } else {\r\n        __m256i coeff0 = _mm256_set1_epi16(*(int16_t*)coeff);\r\n        __m256i coeff1 = _mm256_set1_epi16(*(int16_t*)(coeff + 2));\r\n\r\n        for (row = 0; row < height; row++) {\r\n            __m256i T0, T1, T2, T3, mVal1, mVal2;\r\n            __m256i S0 = _mm256_loadu_si256((__m256i*)(src));\r\n            __m256i S1 = _mm256_loadu_si256((__m256i*)(src + i_src));\r\n            __m256i S2 = _mm256_loadu_si256((__m256i*)(src + i_src2));\r\n            __m256i S3 = _mm256_loadu_si256((__m256i*)(src + i_src3));\r\n            \r\n            T0 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S0, S1), coeff0);\r\n            T1 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S2, S3), coeff1);\r\n            T2 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S0, S1), coeff0);\r\n            T3 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S2, S3), coeff1);\r\n\r\n            mVal1 = _mm256_srai_epi16(_mm256_add_epi16(_mm256_add_epi16(T0, T1), mAddOffset), shift);\r\n            mVal2 = _mm256_srai_epi16(_mm256_add_epi16(_mm256_add_epi16(T2, T3), mAddOffset), shift);\r\n            mVal1 = _mm256_packus_epi16(mVal1, mVal2);\r\n            _mm256_storeu_si256((__m256i*)(dst), mVal1);\r\n\r\n            src += i_src;\r\n            dst += i_dst;\r\n        }\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid intpl_chroma_block_ver_w24_avx2(pel_t *dst, int i_dst, const pel_t *src, int i_src, int height, const int8_t *coeff)\r\n{\r\n    int row;\r\n    const int offset = 32;\r\n    const int shift = 6;\r\n    int bsym = (coeff[1] == coeff[2]);\r\n    __m256i mAddOffset = _mm256_set1_epi16((short)offset);\r\n    __m256i mask24 = _mm256_setr_epi16(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 0, 0, 0);\r\n    const int i_src2 = i_src * 2;\r\n    const int i_src3 = i_src * 3;\r\n\r\n    src -= i_src;\r\n\r\n    if (bsym) {\r\n        __m256i coeff0 = _mm256_set1_epi8(coeff[0]);\r\n        __m256i coeff1 = _mm256_set1_epi8(coeff[1]);\r\n\r\n        for (row = 0; row < height; row++) {\r\n            __m256i T0, T1, T2, T3, mVal1, mVal2;\r\n            __m256i S0 = _mm256_loadu_si256((__m256i*)(src));\r\n            __m256i S1 = _mm256_loadu_si256((__m256i*)(src + i_src));\r\n            __m256i S2 = _mm256_loadu_si256((__m256i*)(src + i_src2));\r\n            __m256i S3 = _mm256_loadu_si256((__m256i*)(src + i_src3));\r\n\r\n            T0 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S0, S3), coeff0);\r\n            T1 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S1, S2), coeff1);\r\n            T2 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S0, S3), coeff0);\r\n            T3 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S1, S2), coeff1);\r\n\r\n            mVal1 = _mm256_srai_epi16(_mm256_add_epi16(_mm256_add_epi16(T0, T1), mAddOffset), shift);\r\n            mVal2 = _mm256_srai_epi16(_mm256_add_epi16(_mm256_add_epi16(T2, T3), mAddOffset), shift);\r\n            mVal1 = _mm256_packus_epi16(mVal1, mVal2);\r\n            _mm256_maskstore_epi32((int*)(dst), mask24, mVal1);\r\n            src += i_src;\r\n            dst += i_dst;\r\n        }\r\n    } else {\r\n        __m256i coeff0 = _mm256_set1_epi16(*(int16_t*)coeff);\r\n        __m256i coeff1 = _mm256_set1_epi16(*(int16_t*)(coeff + 2));\r\n\r\n        for (row = 0; row < height; row++) {\r\n            __m256i T0, T1, T2, T3, mVal1, mVal2;\r\n            __m256i S0 = _mm256_loadu_si256((__m256i*)(src));\r\n            __m256i S1 = _mm256_loadu_si256((__m256i*)(src + i_src));\r\n            __m256i S2 = _mm256_loadu_si256((__m256i*)(src + i_src2));\r\n            __m256i S3 = _mm256_loadu_si256((__m256i*)(src + i_src3));\r\n\r\n            T0 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S0, S1), coeff0);\r\n            T1 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(S2, S3), coeff1);\r\n            T2 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S0, S1), coeff0);\r\n            T3 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(S2, S3), coeff1);\r\n\r\n            mVal1 = _mm256_srai_epi16(_mm256_add_epi16(_mm256_add_epi16(T0, T1), mAddOffset), shift);\r\n            mVal2 = _mm256_srai_epi16(_mm256_add_epi16(_mm256_add_epi16(T2, T3), mAddOffset), shift);\r\n            mVal1 = _mm256_packus_epi16(mVal1, mVal2);\r\n            _mm256_maskstore_epi32((int*)(dst), mask24, mVal1);\r\n\r\n            src += i_src;\r\n            dst += i_dst;\r\n        }\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid intpl_chroma_block_ver_w16_avx2(pel_t *dst, int i_dst, const pel_t *src, int i_src, int height, const int8_t *coeff)\r\n{\r\n    int row;\r\n    const int offset = 32;\r\n    const int shift = 6;\r\n    int bsym = (coeff[1] == coeff[2]);\r\n    __m256i mAddOffset = _mm256_set1_epi16((short)offset);\r\n    const int i_src2 = i_src * 2;\r\n    const int i_src3 = i_src * 3;\r\n    const int i_src4 = i_src * 4;\r\n\r\n    src -= i_src;\r\n\r\n    if (bsym) {\r\n        __m256i coeff0 = _mm256_set1_epi8(coeff[0]);\r\n        __m256i coeff1 = _mm256_set1_epi8(coeff[1]);\r\n\r\n        for (row = 0; row < height; row = row + 2) {\r\n            __m256i T0, T1, T2, T3, mVal1, mVal2;\r\n            __m256i R0, R1, R2, R3;\r\n            __m128i S0 = _mm_loadu_si128((__m128i*)(src));\r\n            __m128i S1 = _mm_loadu_si128((__m128i*)(src + i_src));\r\n            __m128i S2 = _mm_loadu_si128((__m128i*)(src + i_src2));\r\n            __m128i S3 = _mm_loadu_si128((__m128i*)(src + i_src3));\r\n            __m128i S4 = _mm_loadu_si128((__m128i*)(src + i_src4));\r\n\r\n            R0 = _mm256_set_m128i(S0, S1);\r\n            R1 = _mm256_set_m128i(S1, S2);\r\n            R2 = _mm256_set_m128i(S2, S3);\r\n            R3 = _mm256_set_m128i(S3, S4);\r\n\r\n            T0 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(R0, R3), coeff0);\r\n            T1 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(R0, R3), coeff0);\r\n            T2 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(R1, R2), coeff1);\r\n            T3 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(R1, R2), coeff1);\r\n\r\n            mVal1 = _mm256_srai_epi16(_mm256_add_epi16(_mm256_add_epi16(T0, T2), mAddOffset), shift);\r\n            mVal2 = _mm256_srai_epi16(_mm256_add_epi16(_mm256_add_epi16(T1, T3), mAddOffset), shift);\r\n            mVal1 = _mm256_packus_epi16(mVal1, mVal2);\r\n\r\n            _mm256_storeu2_m128i((__m128i*)dst, (__m128i*)(dst + i_dst), mVal1);\r\n\r\n            src += 2 * i_src;\r\n            dst += 2 * i_dst;\r\n        }\r\n    } else {\r\n        __m256i coeff0 = _mm256_set1_epi16(*(int16_t*)coeff);\r\n        __m256i coeff1 = _mm256_set1_epi16(*(int16_t*)(coeff + 2));\r\n\r\n        for (row = 0; row < height; row = row + 2) {\r\n            __m256i T0, T1, T2, T3, mVal1, mVal2;\r\n            __m256i R0, R1, R2, R3;\r\n\r\n            __m128i S0 = _mm_loadu_si128((__m128i*)(src));\r\n            __m128i S1 = _mm_loadu_si128((__m128i*)(src + i_src));\r\n            __m128i S2 = _mm_loadu_si128((__m128i*)(src + i_src2));\r\n            __m128i S3 = _mm_loadu_si128((__m128i*)(src + i_src3));\r\n            __m128i S4 = _mm_loadu_si128((__m128i*)(src + i_src4));\r\n\r\n            R0 = _mm256_set_m128i(S0, S1);\r\n            R1 = _mm256_set_m128i(S1, S2);\r\n            R2 = _mm256_set_m128i(S2, S3);\r\n            R3 = _mm256_set_m128i(S3, S4);\r\n\r\n            T0 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(R0, R1), coeff0);\r\n            T1 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(R0, R1), coeff0);\r\n            T2 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(R2, R3), coeff1);\r\n            T3 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(R2, R3), coeff1);\r\n\r\n            mVal1 = _mm256_srai_epi16(_mm256_add_epi16(_mm256_add_epi16(T0, T2), mAddOffset), shift);\r\n            mVal2 = _mm256_srai_epi16(_mm256_add_epi16(_mm256_add_epi16(T1, T3), mAddOffset), shift);\r\n            mVal1 = _mm256_packus_epi16(mVal1, mVal2);\r\n\r\n            _mm256_storeu2_m128i((__m128i*)dst, (__m128i*)(dst + i_dst), mVal1);\r\n\r\n            src += 2 * i_src;\r\n            dst += 2 * i_dst;\r\n        }\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid intpl_chroma_block_ext_w16_avx2(pel_t *dst, int i_dst, const pel_t *src, int i_src, int width, int height, const int8_t *coef_x, const int8_t *coef_y)\r\n{\r\n    ALIGN32(int16_t tmp_res[(32 + 3) * 32]);\r\n    int16_t *tmp = tmp_res;\r\n    const int i_tmp = 32;\r\n    const int i_tmp2 = 2 * i_tmp;\r\n    const int i_tmp3 = 3 * i_tmp;\r\n    const int shift = 12;\r\n\r\n    int row, col;\r\n    int bsymy = (coef_y[1] == coef_y[6]);\r\n    __m256i mAddOffset = _mm256_set1_epi32(1 << (shift - 1));\r\n    __m256i mCoef = _mm256_set1_epi32(*(int32_t*)coef_x);\r\n    __m256i mSwitch1 = _mm256_setr_epi8(0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6, 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6);\r\n    __m256i mSwitch2 = _mm256_setr_epi8(4, 5, 6, 7, 5, 6, 7, 8, 6, 7, 8, 9, 7, 8, 9, 10, 4, 5, 6, 7, 5, 6, 7, 8, 6, 7, 8, 9, 7, 8, 9, 10);\r\n\r\n    // HOR\r\n    src -= (i_src + 1);\r\n\r\n    for (row = -1; row < height + 2; row++) {\r\n        for (col = 0; col < width; col += 16) {\r\n            __m256i T0, T1, S, S0, sum;\r\n            S = _mm256_loadu_si256((__m256i*)(src + col));\r\n            S0 = _mm256_permute4x64_epi64(S, 0x94);\r\n\r\n            T0 = _mm256_maddubs_epi16(_mm256_shuffle_epi8(S0, mSwitch1), mCoef);\r\n            T1 = _mm256_maddubs_epi16(_mm256_shuffle_epi8(S0, mSwitch2), mCoef);\r\n            sum = _mm256_hadd_epi16(T0, T1);\r\n\r\n            _mm256_storeu_si256((__m256i*)(tmp + col), sum);\r\n        }\r\n        src += i_src;\r\n        tmp += i_tmp;\r\n    }\r\n\r\n    // VER\r\n    tmp = tmp_res;\r\n    if (bsymy) {\r\n        __m256i mCoefy1 = _mm256_set1_epi16(coef_y[0]);\r\n        __m256i mCoefy2 = _mm256_set1_epi16(coef_y[1]);\r\n\r\n        for (row = 0; row < height; row++) {\r\n            for (col = 0; col < width; col += 16) {\r\n                __m256i mVal1, mVal2, mVal;\r\n                __m256i T0, T1, T2, T3, S0, S1, S2, S3;\r\n                S0 = _mm256_load_si256((__m256i*)(tmp + col));\r\n                S1 = _mm256_load_si256((__m256i*)(tmp + col + i_tmp));\r\n                S2 = _mm256_load_si256((__m256i*)(tmp + col + i_tmp2));\r\n                S3 = _mm256_load_si256((__m256i*)(tmp + col + i_tmp3));\r\n\r\n                T0 = _mm256_madd_epi16(_mm256_unpacklo_epi16(S0, S3), mCoefy1);\r\n                T1 = _mm256_madd_epi16(_mm256_unpacklo_epi16(S1, S2), mCoefy2);\r\n                T2 = _mm256_madd_epi16(_mm256_unpackhi_epi16(S0, S3), mCoefy1);\r\n                T3 = _mm256_madd_epi16(_mm256_unpackhi_epi16(S1, S2), mCoefy2);\r\n\r\n                mVal1 = _mm256_srai_epi32(_mm256_add_epi32(_mm256_add_epi32(T0, T1), mAddOffset), shift);\r\n                mVal2 = _mm256_srai_epi32(_mm256_add_epi32(_mm256_add_epi32(T2, T3), mAddOffset), shift);\r\n\r\n                mVal = _mm256_packus_epi16(_mm256_packs_epi32(mVal1, mVal2), /*no-use*/mVal1);\r\n\r\n                mVal = _mm256_permute4x64_epi64(mVal, 0xd8);\r\n                _mm_storeu_si128((__m128i*)(dst + col), _mm256_castsi256_si128(mVal));\r\n            }\r\n            tmp += i_tmp;\r\n            dst += i_dst;\r\n        }\r\n    } else {\r\n        __m256i mCoefy1 = _mm256_cvtepi8_epi16(_mm_set1_epi16(*(int16_t*)coef_y));\r\n        __m256i mCoefy2 = _mm256_cvtepi8_epi16(_mm_set1_epi16(*(int16_t*)(coef_y + 2)));\r\n\r\n        for (row = 0; row < height; row++) {\r\n            for (col = 0; col < width; col += 16) {\r\n                __m256i mVal1, mVal2, mVal;\r\n                __m256i T0, T1, T2, T3, S0, S1, S2, S3;\r\n                S0 = _mm256_load_si256((__m256i*)(tmp + col));\r\n                S1 = _mm256_load_si256((__m256i*)(tmp + col + i_tmp));\r\n                S2 = _mm256_load_si256((__m256i*)(tmp + col + i_tmp2));\r\n                S3 = _mm256_load_si256((__m256i*)(tmp + col + i_tmp3));\r\n\r\n                T0 = _mm256_madd_epi16(_mm256_unpacklo_epi16(S0, S1), mCoefy1);\r\n                T1 = _mm256_madd_epi16(_mm256_unpacklo_epi16(S2, S3), mCoefy2);\r\n                T2 = _mm256_madd_epi16(_mm256_unpackhi_epi16(S0, S1), mCoefy1);\r\n                T3 = _mm256_madd_epi16(_mm256_unpackhi_epi16(S2, S3), mCoefy2);\r\n\r\n                mVal1 = _mm256_srai_epi32(_mm256_add_epi32(_mm256_add_epi32(T0, T1), mAddOffset), shift);\r\n                mVal2 = _mm256_srai_epi32(_mm256_add_epi32(_mm256_add_epi32(T2, T3), mAddOffset), shift);\r\n\r\n                mVal = _mm256_packus_epi16(_mm256_packs_epi32(mVal1, mVal2), /*no-use*/mVal1);\r\n\r\n                mVal = _mm256_permute4x64_epi64(mVal, 0xd8);\r\n                _mm_storeu_si128((__m128i*)(dst + col), _mm256_castsi256_si128(mVal));\r\n            }\r\n            tmp += i_tmp;\r\n            dst += i_dst;\r\n        }\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid intpl_chroma_block_ext_w24_avx2(pel_t *dst, int i_dst, const pel_t *src, int i_src, int width, int height, const int8_t *coef_x, const int8_t *coef_y)\r\n{\r\n    ALIGN32(int16_t tmp_res[(32 + 3) * 32]);\r\n    int16_t *tmp = tmp_res;\r\n    const int i_tmp = 32;\r\n    const int i_tmp2 = 2 * i_tmp;\r\n    const int i_tmp3 = 3 * i_tmp;\r\n\r\n    int row;\r\n    int bsymy = (coef_y[1] == coef_y[6]);\r\n    const int shift = 12;\r\n    __m256i mAddOffset = _mm256_set1_epi32(1 << (shift - 1));\r\n    __m256i mCoef = _mm256_set1_epi32(*(int32_t*)coef_x);\r\n    __m256i mSwitch = _mm256_setr_epi8(0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6, 4, 5, 6, 7, 5, 6, 7, 8, 6, 7, 8, 9, 7, 8, 9, 10);\r\n    __m256i mSwitch1 = _mm256_setr_epi8(0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6, 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6);\r\n    __m256i mSwitch2 = _mm256_setr_epi8(4, 5, 6, 7, 5, 6, 7, 8, 6, 7, 8, 9, 7, 8, 9, 10, 4, 5, 6, 7, 5, 6, 7, 8, 6, 7, 8, 9, 7, 8, 9, 10);    \r\n    __m256i mask24 = _mm256_setr_epi16(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 0, 0, 0);\r\n    //HOR\r\n    src = src - i_src - 1;\r\n    UNUSED_PARAMETER(width);\r\n\r\n    for (row = -1; row < height + 2; row++) {\r\n        __m256i T0, T1, T2, S, S0;\r\n        S = _mm256_loadu_si256((__m256i*)(src));\r\n        S0 = _mm256_permute4x64_epi64(S, 0x99);\r\n\r\n        T0 = _mm256_maddubs_epi16(_mm256_shuffle_epi8(S, mSwitch1), mCoef);\r\n        T1 = _mm256_maddubs_epi16(_mm256_shuffle_epi8(S, mSwitch2), mCoef);\r\n        T2 = _mm256_maddubs_epi16(_mm256_shuffle_epi8(S0, mSwitch), mCoef);\r\n        T0 = _mm256_hadd_epi16(T0, T1);\r\n        T2 = _mm256_hadd_epi16(T2, T2);\r\n\r\n        T2 = _mm256_permute4x64_epi64(T2, 0xd8);\r\n        T2 = _mm256_permute2x128_si256(T0, T2, 0x13);\r\n        _mm_storeu_si128((__m128i*)(tmp), _mm256_castsi256_si128(T0));\r\n        _mm256_storeu_si256((__m256i*)(tmp + 8), T2);\r\n        src += i_src;\r\n        tmp += i_tmp;\r\n    }\r\n\r\n    // VER\r\n    tmp = tmp_res;\r\n    if (bsymy) {\r\n        __m256i mCoefy1 = _mm256_set1_epi16(coef_y[0]);\r\n        __m256i mCoefy2 = _mm256_set1_epi16(coef_y[1]);\r\n\r\n        for (row = 0; row < height; row++) {\r\n            __m256i mVal1, mVal2, mVal3, mVal4, mVal;\r\n            __m256i S0, S1, S2, S3, S4, S5, S6, S7;\r\n            __m256i T0, T1, T2, T3, T4, T5, T6, T7;\r\n\r\n            S0 = _mm256_load_si256((__m256i*)(tmp));\r\n            S1 = _mm256_load_si256((__m256i*)(tmp + i_tmp));\r\n            S2 = _mm256_load_si256((__m256i*)(tmp + i_tmp2));\r\n            S3 = _mm256_load_si256((__m256i*)(tmp + i_tmp3));\r\n\r\n            S4 = _mm256_load_si256((__m256i*)(tmp + 16));\r\n            S5 = _mm256_load_si256((__m256i*)(tmp + 16 + i_tmp));\r\n            S6 = _mm256_load_si256((__m256i*)(tmp + 16 + i_tmp2));\r\n            S7 = _mm256_load_si256((__m256i*)(tmp + 16 + i_tmp3));\r\n\r\n            T0 = _mm256_madd_epi16(_mm256_unpacklo_epi16(S0, S3), mCoefy1);\r\n            T1 = _mm256_madd_epi16(_mm256_unpacklo_epi16(S1, S2), mCoefy2);\r\n            T2 = _mm256_madd_epi16(_mm256_unpackhi_epi16(S0, S3), mCoefy1);\r\n            T3 = _mm256_madd_epi16(_mm256_unpackhi_epi16(S1, S2), mCoefy2);\r\n            T4 = _mm256_madd_epi16(_mm256_unpacklo_epi16(S4, S7), mCoefy1);\r\n            T5 = _mm256_madd_epi16(_mm256_unpacklo_epi16(S5, S6), mCoefy2);\r\n            T6 = _mm256_madd_epi16(_mm256_unpackhi_epi16(S4, S7), mCoefy1);\r\n            T7 = _mm256_madd_epi16(_mm256_unpackhi_epi16(S5, S6), mCoefy2);\r\n\r\n            mVal1 = _mm256_srai_epi32(_mm256_add_epi32(_mm256_add_epi32(T0, T1), mAddOffset), shift);\r\n            mVal2 = _mm256_srai_epi32(_mm256_add_epi32(_mm256_add_epi32(T2, T3), mAddOffset), shift);\r\n            mVal3 = _mm256_srai_epi32(_mm256_add_epi32(_mm256_add_epi32(T4, T5), mAddOffset), shift);\r\n            mVal4 = _mm256_srai_epi32(_mm256_add_epi32(_mm256_add_epi32(T6, T7), mAddOffset), shift);\r\n\r\n            mVal = _mm256_packus_epi16(_mm256_packs_epi32(mVal1, mVal2), _mm256_packs_epi32(mVal3, mVal4));\r\n\r\n            mVal = _mm256_permute4x64_epi64(mVal, 0xd8);\r\n            _mm256_maskstore_epi32((int*)(dst), mask24, mVal);\r\n\r\n            tmp += i_tmp;\r\n            dst += i_dst;\r\n        }\r\n    } else {\r\n        __m256i mCoefy1 = _mm256_cvtepi8_epi16(_mm_set1_epi16(*(int16_t*)coef_y));\r\n        __m256i mCoefy2 = _mm256_cvtepi8_epi16(_mm_set1_epi16(*(int16_t*)(coef_y + 2)));\r\n\r\n        for (row = 0; row < height; row++) {\r\n            __m256i mVal1, mVal2, mVal3, mVal4, mVal;\r\n            __m256i S0, S1, S2, S3, S4, S5, S6, S7;\r\n            __m256i T0, T1, T2, T3, T4, T5, T6, T7;\r\n\r\n            S0 = _mm256_load_si256((__m256i*)(tmp));\r\n            S1 = _mm256_load_si256((__m256i*)(tmp + i_tmp));\r\n            S2 = _mm256_load_si256((__m256i*)(tmp + i_tmp2));\r\n            S3 = _mm256_load_si256((__m256i*)(tmp + i_tmp3));\r\n\r\n            S4 = _mm256_load_si256((__m256i*)(tmp + 16));\r\n            S5 = _mm256_load_si256((__m256i*)(tmp + 16 + i_tmp));\r\n            S6 = _mm256_load_si256((__m256i*)(tmp + 16 + i_tmp2));\r\n            S7 = _mm256_load_si256((__m256i*)(tmp + 16 + i_tmp3));\r\n\r\n            T0 = _mm256_madd_epi16(_mm256_unpacklo_epi16(S0, S1), mCoefy1);\r\n            T1 = _mm256_madd_epi16(_mm256_unpacklo_epi16(S2, S3), mCoefy2);\r\n            T2 = _mm256_madd_epi16(_mm256_unpackhi_epi16(S0, S1), mCoefy1);\r\n            T3 = _mm256_madd_epi16(_mm256_unpackhi_epi16(S2, S3), mCoefy2);\r\n            T4 = _mm256_madd_epi16(_mm256_unpacklo_epi16(S4, S5), mCoefy1);\r\n            T5 = _mm256_madd_epi16(_mm256_unpacklo_epi16(S6, S7), mCoefy2);\r\n            T6 = _mm256_madd_epi16(_mm256_unpackhi_epi16(S4, S5), mCoefy1);\r\n            T7 = _mm256_madd_epi16(_mm256_unpackhi_epi16(S6, S7), mCoefy2);\r\n\r\n            mVal1 = _mm256_srai_epi32(_mm256_add_epi32(_mm256_add_epi32(T0, T1), mAddOffset), shift);\r\n            mVal2 = _mm256_srai_epi32(_mm256_add_epi32(_mm256_add_epi32(T2, T3), mAddOffset), shift);\r\n            mVal3 = _mm256_srai_epi32(_mm256_add_epi32(_mm256_add_epi32(T4, T5), mAddOffset), shift);\r\n            mVal4 = _mm256_srai_epi32(_mm256_add_epi32(_mm256_add_epi32(T6, T7), mAddOffset), shift);\r\n\r\n            mVal = _mm256_packus_epi16(_mm256_packs_epi32(mVal1, mVal2), _mm256_packs_epi32(mVal3, mVal4));\r\n\r\n            mVal = _mm256_permute4x64_epi64(mVal, 0xd8);\r\n            _mm256_maskstore_epi32((int*)(dst), mask24, mVal);\r\n\r\n            tmp += i_tmp;\r\n            dst += i_dst;\r\n        }\r\n    }\r\n}\r\n\r\n/*--------------------------------------- ֵ ------------------------------------------------------*/\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid intpl_luma_block_hor_avx2(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, const int8_t *coeff)\r\n{\r\n    switch (width / 4 - 1) {\r\n        case 3:\r\n        case 7:\r\n        case 11:\r\n        case 15:\r\n            intpl_luma_block_hor_w16_avx2(dst, i_dst, src, i_src, width, height, coeff);\r\n            break;\r\n        case 5:\r\n            intpl_luma_block_hor_w24_avx2(dst, i_dst, src, i_src, width, height, coeff);\r\n            break;\r\n        default:\r\n            intpl_luma_block_hor_sse128(dst, i_dst, src, i_src, width, height, coeff);\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid intpl_luma_block_ver_avx2(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, const int8_t *coeff)\r\n{\r\n    switch (width / 4 - 1) {\r\n        case 3:\r\n            intpl_luma_block_ver_w16_avx2(dst, i_dst, src, i_src, width, height, coeff);\r\n            break;\r\n        case 5:\r\n            intpl_luma_block_ver_w24_avx2(dst, i_dst, src, i_src, width, height, coeff);\r\n            break;\r\n        case 7:\r\n            intpl_luma_block_ver_w32_avx2(dst, i_dst, src, i_src, width, height, coeff);\r\n            break;\r\n        case 11:\r\n            intpl_luma_block_ver_w48_avx2(dst, i_dst, src, i_src, width, height, coeff);\r\n            break;\r\n        case 15:\r\n            intpl_luma_block_ver_w64_avx2(dst, i_dst, src, i_src, width, height, coeff);\r\n            break;\r\n        default:\r\n            intpl_luma_block_ver_sse128(dst, i_dst, src, i_src, width, height, coeff);\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n*/\r\nvoid intpl_luma_block_ver0_avx2(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, const int8_t *coeff)\r\n{\r\n    switch (width / 4 - 1) {\r\n        case 3:\r\n            intpl_luma_block_ver_w16_avx2(dst, i_dst, src, i_src, width, height, coeff);\r\n            break;\r\n        case 5:\r\n            intpl_luma_block_ver_w24_avx2(dst, i_dst, src, i_src, width, height, coeff);\r\n            break;\r\n        case 7:\r\n            intpl_luma_block_ver_w32_avx2(dst, i_dst, src, i_src, width, height, coeff);\r\n            break;\r\n        case 11:\r\n            intpl_luma_block_ver_w48_avx2(dst, i_dst, src, i_src, width, height, coeff);\r\n            break;\r\n        case 15:\r\n            intpl_luma_block_ver_w64_avx2(dst, i_dst, src, i_src, width, height, coeff);\r\n            break;\r\n        default:\r\n            intpl_luma_block_ver_sse128(dst, i_dst, src, i_src, width, height, coeff);\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n*/\r\nvoid intpl_luma_block_ver1_avx2(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, const int8_t *coeff)\r\n{\r\n    switch (width / 4 - 1) {\r\n        case 3:\r\n            intpl_luma_block_ver_w16_avx2(dst, i_dst, src, i_src, width, height, coeff);\r\n            break;\r\n        case 5:\r\n            intpl_luma_block_ver_w24_avx2(dst, i_dst, src, i_src, width, height, coeff);\r\n            break;\r\n        case 7:\r\n            intpl_luma_block_ver_w32_avx2(dst, i_dst, src, i_src, width, height, coeff);\r\n            break;\r\n        case 11:\r\n            intpl_luma_block_ver_w48_avx2(dst, i_dst, src, i_src, width, height, coeff);\r\n            break;\r\n        case 15:\r\n            intpl_luma_block_ver_w64_avx2(dst, i_dst, src, i_src, width, height, coeff);\r\n            break;\r\n        default:\r\n            intpl_luma_block_ver_sse128(dst, i_dst, src, i_src, width, height, coeff);\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n*/\r\nvoid intpl_luma_block_ver2_avx2(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, const int8_t *coeff)\r\n{\r\n    switch (width / 4 - 1) {\r\n        case 3:\r\n            intpl_luma_block_ver_w16_avx2(dst, i_dst, src, i_src, width, height, coeff);\r\n            break;\r\n        case 5:\r\n            intpl_luma_block_ver_w24_avx2(dst, i_dst, src, i_src, width, height, coeff);\r\n            break;\r\n        case 7:\r\n            intpl_luma_block_ver_w32_avx2(dst, i_dst, src, i_src, width, height, coeff);\r\n            break;\r\n        case 11:\r\n            intpl_luma_block_ver_w48_avx2(dst, i_dst, src, i_src, width, height, coeff);\r\n            break;\r\n        case 15:\r\n            intpl_luma_block_ver_w64_avx2(dst, i_dst, src, i_src, width, height, coeff);\r\n            break;\r\n        default:\r\n            intpl_luma_block_ver_sse128(dst, i_dst, src, i_src, width, height, coeff);\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid intpl_luma_block_ext_avx2(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, const int8_t *coef_x, const int8_t *coef_y)\r\n{\r\n    switch (width / 4 - 1) {\r\n        case 3:\r\n        case 7:\r\n        case 11:\r\n        case 15:\r\n            intpl_luma_block_ext_w16_avx2(dst, i_dst, src, i_src, width, height, coef_x, coef_y);\r\n            break;\r\n        case 5:\r\n            intpl_luma_block_ext_w24_avx2(dst, i_dst, src, i_src, height, coef_x, coef_y);\r\n            break;\r\n        default:\r\n            intpl_luma_block_ext_sse128(dst, i_dst, src, i_src, width, height, coef_x, coef_y);\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid intpl_chroma_block_hor_avx2(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, const int8_t *coeff)\r\n{\r\n    switch (width / 2 - 1) {\r\n        case 7:\r\n        case 15:\r\n            intpl_chroma_block_hor_w16_avx2(dst, i_dst, src, i_src, width, height, coeff);\r\n            break;\r\n        case 11:\r\n            intpl_chroma_block_hor_w24_avx2(dst, i_dst, src, i_src, height, coeff);\r\n            break;\r\n        default:\r\n            intpl_chroma_block_hor_sse128(dst, i_dst, src, i_src, width, height, coeff);\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid intpl_chroma_block_ver_avx2(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, const int8_t *coeff)\r\n{\r\n    switch (width / 2 - 1) {\r\n        case 7:\r\n            intpl_chroma_block_ver_w16_avx2(dst, i_dst, src, i_src, height, coeff);\r\n            break;\r\n        case 11:\r\n            intpl_chroma_block_ver_w24_avx2(dst, i_dst, src, i_src, height, coeff);\r\n            break;\r\n        case 15:\r\n            intpl_chroma_block_ver_w32_avx2(dst, i_dst, src, i_src, height, coeff);\r\n            break;\r\n        default:\r\n            intpl_chroma_block_ver_sse128(dst, i_dst, src, i_src, width, height, coeff);\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid intpl_chroma_block_ext_avx2(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, const int8_t *coef_x, const int8_t *coef_y)\r\n{\r\n    switch (width / 2 - 1) {\r\n        case 7:\r\n        case 15:\r\n            intpl_chroma_block_ext_w16_avx2(dst, i_dst, src, i_src, width, height, coef_x, coef_y);\r\n            break;\r\n        case 11:\r\n            intpl_chroma_block_ext_w24_avx2(dst, i_dst, src, i_src, width, height, coef_x, coef_y);\r\n            break;\r\n        default:\r\n            intpl_chroma_block_ext_sse128(dst, i_dst, src, i_src, width, height, coef_x, coef_y);\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\n#define INTPL_LUMA_EXT_COMPUT(W0,W1,W2,W3,W4,W5,W6,W7,result)          \\\r\n    T0 = _mm256_madd_epi16(_mm256_unpacklo_epi16(W0, W1), mCoefy01);   \\\r\n    T1 = _mm256_madd_epi16(_mm256_unpacklo_epi16(W2, W3), mCoefy23);   \\\r\n    T2 = _mm256_madd_epi16(_mm256_unpacklo_epi16(W4, W5), mCoefy45);   \\\r\n    T3 = _mm256_madd_epi16(_mm256_unpacklo_epi16(W6, W7), mCoefy67);   \\\r\n    T4 = _mm256_madd_epi16(_mm256_unpackhi_epi16(W0, W1), mCoefy01);   \\\r\n    T5 = _mm256_madd_epi16(_mm256_unpackhi_epi16(W2, W3), mCoefy23);   \\\r\n    T6 = _mm256_madd_epi16(_mm256_unpackhi_epi16(W4, W5), mCoefy45);   \\\r\n    T7 = _mm256_madd_epi16(_mm256_unpackhi_epi16(W6, W7), mCoefy67);   \\\r\n    \\\r\n    mVal1 = _mm256_add_epi32(_mm256_add_epi32(T0, T1), _mm256_add_epi32(T2, T3));  \\\r\n    mVal2 = _mm256_add_epi32(_mm256_add_epi32(T4, T5), _mm256_add_epi32(T6, T7));  \\\r\n    \\\r\n    mVal1 = _mm256_srai_epi32(_mm256_add_epi32(mVal1, mAddOffset), shift);         \\\r\n    mVal2 = _mm256_srai_epi32(_mm256_add_epi32(mVal2, mAddOffset), shift);         \\\r\n    result = _mm256_packs_epi32(mVal1, mVal2);\r\n\r\n#define INTPL_LUMA_EXT_STORE(a, b, c)                      \\\r\n    mVal = _mm256_permute4x64_epi64(_mm256_packus_epi16(a, b), 216);            \\\r\n    _mm256_storeu_si256((__m256i*)(c), mVal);\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid intpl_luma_ext_avx2(pel_t *dst, int i_dst, int16_t *tmp, int i_tmp, int width, int height, const int8_t *coeff)\r\n{\r\n    const int shift = 12;\r\n    int row, col;\r\n    int16_t const *p;\r\n\r\n    __m256i mAddOffset = _mm256_set1_epi32(1 << (shift - 1));\r\n\r\n    __m256i mCoefy01 = _mm256_cvtepi8_epi16(_mm_set1_epi16(*(int16_t*)(coeff + 0)));\r\n    __m256i mCoefy23 = _mm256_cvtepi8_epi16(_mm_set1_epi16(*(int16_t*)(coeff + 2)));\r\n    __m256i mCoefy45 = _mm256_cvtepi8_epi16(_mm_set1_epi16(*(int16_t*)(coeff + 4)));\r\n    __m256i mCoefy67 = _mm256_cvtepi8_epi16(_mm_set1_epi16(*(int16_t*)(coeff + 6)));\r\n\r\n    tmp -= 3 * i_tmp;\r\n\r\n    for (row = 0; row < height; row = row + 4) {\r\n        __m256i T00, T10, T20, T30, T40, T50, T60, T70, T80, T90, Ta0;\r\n        __m256i T0, T1, T2, T3, T4, T5, T6, T7;\r\n        __m256i U0, U1, U2, U3;\r\n        __m256i V0, V1, V2, V3;\r\n        __m256i mVal1, mVal2, mVal;\r\n\r\n        p = tmp;\r\n        for (col = 0; col < width - 31; col += 32) {\r\n\r\n            T00 = _mm256_loadu_si256((__m256i*)(p));\r\n            T10 = _mm256_loadu_si256((__m256i*)(p + i_tmp));\r\n            T20 = _mm256_loadu_si256((__m256i*)(p + 2 * i_tmp));\r\n            T30 = _mm256_loadu_si256((__m256i*)(p + 3 * i_tmp));\r\n            T40 = _mm256_loadu_si256((__m256i*)(p + 4 * i_tmp));\r\n            T50 = _mm256_loadu_si256((__m256i*)(p + 5 * i_tmp));\r\n            T60 = _mm256_loadu_si256((__m256i*)(p + 6 * i_tmp));\r\n            T70 = _mm256_loadu_si256((__m256i*)(p + 7 * i_tmp));\r\n            T80 = _mm256_loadu_si256((__m256i*)(p + 8 * i_tmp));\r\n            T90 = _mm256_loadu_si256((__m256i*)(p + 9 * i_tmp));\r\n            Ta0 = _mm256_loadu_si256((__m256i*)(p + 10 * i_tmp));\r\n\r\n            INTPL_LUMA_EXT_COMPUT(T00, T10, T20, T30, T40, T50, T60, T70, U0);\r\n            INTPL_LUMA_EXT_COMPUT(T10, T20, T30, T40, T50, T60, T70, T80, U1);\r\n            INTPL_LUMA_EXT_COMPUT(T20, T30, T40, T50, T60, T70, T80, T90, U2);\r\n            INTPL_LUMA_EXT_COMPUT(T30, T40, T50, T60, T70, T80, T90, Ta0, U3);\r\n\r\n            //col + 16\r\n            T00 = _mm256_loadu_si256((__m256i*)(p + 16));\r\n            T10 = _mm256_loadu_si256((__m256i*)(p + 16 + i_tmp));\r\n            T20 = _mm256_loadu_si256((__m256i*)(p + 16 + 2 * i_tmp));\r\n            T30 = _mm256_loadu_si256((__m256i*)(p + 16 + 3 * i_tmp));\r\n            T40 = _mm256_loadu_si256((__m256i*)(p + 16 + 4 * i_tmp));\r\n            T50 = _mm256_loadu_si256((__m256i*)(p + 16 + 5 * i_tmp));\r\n            T60 = _mm256_loadu_si256((__m256i*)(p + 16 + 6 * i_tmp));\r\n            T70 = _mm256_loadu_si256((__m256i*)(p + 16 + 7 * i_tmp));\r\n            T80 = _mm256_loadu_si256((__m256i*)(p + 16 + 8 * i_tmp));\r\n            T90 = _mm256_loadu_si256((__m256i*)(p + 16 + 9 * i_tmp));\r\n            Ta0 = _mm256_loadu_si256((__m256i*)(p + 16 + 10 * i_tmp));\r\n\r\n            INTPL_LUMA_EXT_COMPUT(T00, T10, T20, T30, T40, T50, T60, T70, V0);\r\n            INTPL_LUMA_EXT_COMPUT(T10, T20, T30, T40, T50, T60, T70, T80, V1);\r\n            INTPL_LUMA_EXT_COMPUT(T20, T30, T40, T50, T60, T70, T80, T90, V2);\r\n            INTPL_LUMA_EXT_COMPUT(T30, T40, T50, T60, T70, T80, T90, Ta0, V3);\r\n\r\n            INTPL_LUMA_EXT_STORE(U0, V0, dst + col);\r\n            INTPL_LUMA_EXT_STORE(U1, V1, dst + i_dst + col);\r\n            INTPL_LUMA_EXT_STORE(U2, V2, dst + 2 * i_dst + col);\r\n            INTPL_LUMA_EXT_STORE(U3, V3, dst + 3 * i_dst + col);\r\n\r\n            p += 32;\r\n        }\r\n\r\n        if (col < width - 16) {\r\n            T00 = _mm256_loadu_si256((__m256i*)(p));\r\n            T10 = _mm256_loadu_si256((__m256i*)(p + i_tmp));\r\n            T20 = _mm256_loadu_si256((__m256i*)(p + 2 * i_tmp));\r\n            T30 = _mm256_loadu_si256((__m256i*)(p + 3 * i_tmp));\r\n            T40 = _mm256_loadu_si256((__m256i*)(p + 4 * i_tmp));\r\n            T50 = _mm256_loadu_si256((__m256i*)(p + 5 * i_tmp));\r\n            T60 = _mm256_loadu_si256((__m256i*)(p + 6 * i_tmp));\r\n            T70 = _mm256_loadu_si256((__m256i*)(p + 7 * i_tmp));\r\n            T80 = _mm256_loadu_si256((__m256i*)(p + 8 * i_tmp));\r\n            T90 = _mm256_loadu_si256((__m256i*)(p + 9 * i_tmp));\r\n            Ta0 = _mm256_loadu_si256((__m256i*)(p + 10 * i_tmp));\r\n\r\n            INTPL_LUMA_EXT_COMPUT(T00, T10, T20, T30, T40, T50, T60, T70, U0);\r\n            INTPL_LUMA_EXT_COMPUT(T10, T20, T30, T40, T50, T60, T70, T80, U1);\r\n            INTPL_LUMA_EXT_COMPUT(T20, T30, T40, T50, T60, T70, T80, T90, U2);\r\n            INTPL_LUMA_EXT_COMPUT(T30, T40, T50, T60, T70, T80, T90, Ta0, U3);\r\n\r\n            //col + 16\r\n            T00 = _mm256_loadu_si256((__m256i*)(p + 16));\r\n            T10 = _mm256_loadu_si256((__m256i*)(p + 16 + i_tmp));\r\n            T20 = _mm256_loadu_si256((__m256i*)(p + 16 + 2 * i_tmp));\r\n            T30 = _mm256_loadu_si256((__m256i*)(p + 16 + 3 * i_tmp));\r\n            T40 = _mm256_loadu_si256((__m256i*)(p + 16 + 4 * i_tmp));\r\n            T50 = _mm256_loadu_si256((__m256i*)(p + 16 + 5 * i_tmp));\r\n            T60 = _mm256_loadu_si256((__m256i*)(p + 16 + 6 * i_tmp));\r\n            T70 = _mm256_loadu_si256((__m256i*)(p + 16 + 7 * i_tmp));\r\n            T80 = _mm256_loadu_si256((__m256i*)(p + 16 + 8 * i_tmp));\r\n            T90 = _mm256_loadu_si256((__m256i*)(p + 16 + 9 * i_tmp));\r\n            Ta0 = _mm256_loadu_si256((__m256i*)(p + 16 + 10 * i_tmp));\r\n\r\n            INTPL_LUMA_EXT_COMPUT(T00, T10, T20, T30, T40, T50, T60, T70, V0);\r\n            INTPL_LUMA_EXT_COMPUT(T10, T20, T30, T40, T50, T60, T70, T80, V1);\r\n            INTPL_LUMA_EXT_COMPUT(T20, T30, T40, T50, T60, T70, T80, T90, V2);\r\n            INTPL_LUMA_EXT_COMPUT(T30, T40, T50, T60, T70, T80, T90, Ta0, V3);\r\n\r\n            INTPL_LUMA_EXT_STORE(U0, V0, dst + col);\r\n            INTPL_LUMA_EXT_STORE(U1, V1, dst + i_dst + col);\r\n            INTPL_LUMA_EXT_STORE(U2, V2, dst + 2 * i_dst + col);\r\n            INTPL_LUMA_EXT_STORE(U3, V3, dst + 3 * i_dst + col);\r\n\r\n            p += 32;\r\n            col += 32;\r\n        }\r\n\r\n        if (col < width) {\r\n            T00 = _mm256_loadu_si256((__m256i*)(p));\r\n            T10 = _mm256_loadu_si256((__m256i*)(p + i_tmp));\r\n            T20 = _mm256_loadu_si256((__m256i*)(p + 2 * i_tmp));\r\n            T30 = _mm256_loadu_si256((__m256i*)(p + 3 * i_tmp));\r\n            T40 = _mm256_loadu_si256((__m256i*)(p + 4 * i_tmp));\r\n            T50 = _mm256_loadu_si256((__m256i*)(p + 5 * i_tmp));\r\n            T60 = _mm256_loadu_si256((__m256i*)(p + 6 * i_tmp));\r\n            T70 = _mm256_loadu_si256((__m256i*)(p + 7 * i_tmp));\r\n            T80 = _mm256_loadu_si256((__m256i*)(p + 8 * i_tmp));\r\n            T90 = _mm256_loadu_si256((__m256i*)(p + 9 * i_tmp));\r\n            Ta0 = _mm256_loadu_si256((__m256i*)(p + 10 * i_tmp));\r\n\r\n            INTPL_LUMA_EXT_COMPUT(T00, T10, T20, T30, T40, T50, T60, T70, U0);\r\n            INTPL_LUMA_EXT_COMPUT(T10, T20, T30, T40, T50, T60, T70, T80, U1);\r\n            INTPL_LUMA_EXT_COMPUT(T20, T30, T40, T50, T60, T70, T80, T90, U2);\r\n            INTPL_LUMA_EXT_COMPUT(T30, T40, T50, T60, T70, T80, T90, Ta0, U3);\r\n\r\n            INTPL_LUMA_EXT_STORE(U0, U0, dst + col);\r\n            INTPL_LUMA_EXT_STORE(U1, U1, dst + i_dst + col);\r\n            INTPL_LUMA_EXT_STORE(U2, U2, dst + 2 * i_dst + col);\r\n            INTPL_LUMA_EXT_STORE(U3, U3, dst + 3 * i_dst + col);\r\n\r\n            p += 16;\r\n            col += 16;\r\n        }\r\n\r\n        tmp += i_tmp * 4;\r\n        dst += i_dst * 4;\r\n    }\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid intpl_luma_ext_x3_avx2(pel_t *const dst[3], int i_dst, int16_t *tmp, int i_tmp, int width, int height, const int8_t **coeff)\r\n{\r\n#if 1\r\n    intpl_luma_ext_avx2(dst[0], i_dst, tmp, i_tmp, width, height, coeff[0]);\r\n    intpl_luma_ext_avx2(dst[1], i_dst, tmp, i_tmp, width, height, coeff[1]);\r\n    intpl_luma_ext_avx2(dst[2], i_dst, tmp, i_tmp, width, height, coeff[2]);\r\n#else\r\n    const int shift = 12;\r\n    int row, col;\r\n    int16_t const *p;\r\n\r\n    __m256i mAddOffset = _mm256_set1_epi32(1 << (shift - 1));\r\n\r\n    __m256i mCoefy01 = _mm256_cvtepi8_epi16(_mm_set1_epi16(*(int16_t*)(coeff[0] + 0)));\r\n    __m256i mCoefy23 = _mm256_cvtepi8_epi16(_mm_set1_epi16(*(int16_t*)(coeff[0] + 2)));\r\n    __m256i mCoefy45 = _mm256_cvtepi8_epi16(_mm_set1_epi16(*(int16_t*)(coeff[0] + 4)));\r\n    __m256i mCoefy67 = _mm256_cvtepi8_epi16(_mm_set1_epi16(*(int16_t*)(coeff[0] + 6)));\r\n\r\n    tmp -= 3 * i_tmp;\r\n\r\n    for (row = 0; row < height; row = row + 4) {\r\n        __m256i T00, T10, T20, T30, T40, T50, T60, T70, T80, T90, Ta0;\r\n        __m256i T0, T1, T2, T3, T4, T5, T6, T7;\r\n        __m256i U0, U1, U2, U3;\r\n        __m256i V0, V1, V2, V3;\r\n        __m256i mVal1, mVal2, mVal;\r\n\r\n        p = tmp;\r\n        for (col = 0; col < width - 31; col += 32) {\r\n\r\n            T00 = _mm256_loadu_si256((__m256i*)(p));\r\n            T10 = _mm256_loadu_si256((__m256i*)(p + i_tmp));\r\n            T20 = _mm256_loadu_si256((__m256i*)(p + 2 * i_tmp));\r\n            T30 = _mm256_loadu_si256((__m256i*)(p + 3 * i_tmp));\r\n            T40 = _mm256_loadu_si256((__m256i*)(p + 4 * i_tmp));\r\n            T50 = _mm256_loadu_si256((__m256i*)(p + 5 * i_tmp));\r\n            T60 = _mm256_loadu_si256((__m256i*)(p + 6 * i_tmp));\r\n            T70 = _mm256_loadu_si256((__m256i*)(p + 7 * i_tmp));\r\n            T80 = _mm256_loadu_si256((__m256i*)(p + 8 * i_tmp));\r\n            T90 = _mm256_loadu_si256((__m256i*)(p + 9 * i_tmp));\r\n            Ta0 = _mm256_loadu_si256((__m256i*)(p + 10 * i_tmp));\r\n\r\n            INTPL_LUMA_EXT_COMPUT(T00, T10, T20, T30, T40, T50, T60, T70, U0);\r\n            INTPL_LUMA_EXT_COMPUT(T10, T20, T30, T40, T50, T60, T70, T80, U1);\r\n            INTPL_LUMA_EXT_COMPUT(T20, T30, T40, T50, T60, T70, T80, T90, U2);\r\n            INTPL_LUMA_EXT_COMPUT(T30, T40, T50, T60, T70, T80, T90, Ta0, U3);\r\n\r\n            //col + 16\r\n            T00 = _mm256_loadu_si256((__m256i*)(p + 16));\r\n            T10 = _mm256_loadu_si256((__m256i*)(p + 16 + i_tmp));\r\n            T20 = _mm256_loadu_si256((__m256i*)(p + 16 + 2 * i_tmp));\r\n            T30 = _mm256_loadu_si256((__m256i*)(p + 16 + 3 * i_tmp));\r\n            T40 = _mm256_loadu_si256((__m256i*)(p + 16 + 4 * i_tmp));\r\n            T50 = _mm256_loadu_si256((__m256i*)(p + 16 + 5 * i_tmp));\r\n            T60 = _mm256_loadu_si256((__m256i*)(p + 16 + 6 * i_tmp));\r\n            T70 = _mm256_loadu_si256((__m256i*)(p + 16 + 7 * i_tmp));\r\n            T80 = _mm256_loadu_si256((__m256i*)(p + 16 + 8 * i_tmp));\r\n            T90 = _mm256_loadu_si256((__m256i*)(p + 16 + 9 * i_tmp));\r\n            Ta0 = _mm256_loadu_si256((__m256i*)(p + 16 + 10 * i_tmp));\r\n\r\n            INTPL_LUMA_EXT_COMPUT(T00, T10, T20, T30, T40, T50, T60, T70, V0);\r\n            INTPL_LUMA_EXT_COMPUT(T10, T20, T30, T40, T50, T60, T70, T80, V1);\r\n            INTPL_LUMA_EXT_COMPUT(T20, T30, T40, T50, T60, T70, T80, T90, V2);\r\n            INTPL_LUMA_EXT_COMPUT(T30, T40, T50, T60, T70, T80, T90, Ta0, V3);\r\n\r\n            INTPL_LUMA_EXT_STORE(U0, V0, dst + col);\r\n            INTPL_LUMA_EXT_STORE(U1, V1, dst + i_dst + col);\r\n            INTPL_LUMA_EXT_STORE(U2, V2, dst + 2 * i_dst + col);\r\n            INTPL_LUMA_EXT_STORE(U3, V3, dst + 3 * i_dst + col);\r\n\r\n            p += 32;\r\n        }\r\n\r\n        if (col < width - 16) {\r\n            T00 = _mm256_loadu_si256((__m256i*)(p));\r\n            T10 = _mm256_loadu_si256((__m256i*)(p + i_tmp));\r\n            T20 = _mm256_loadu_si256((__m256i*)(p + 2 * i_tmp));\r\n            T30 = _mm256_loadu_si256((__m256i*)(p + 3 * i_tmp));\r\n            T40 = _mm256_loadu_si256((__m256i*)(p + 4 * i_tmp));\r\n            T50 = _mm256_loadu_si256((__m256i*)(p + 5 * i_tmp));\r\n            T60 = _mm256_loadu_si256((__m256i*)(p + 6 * i_tmp));\r\n            T70 = _mm256_loadu_si256((__m256i*)(p + 7 * i_tmp));\r\n            T80 = _mm256_loadu_si256((__m256i*)(p + 8 * i_tmp));\r\n            T90 = _mm256_loadu_si256((__m256i*)(p + 9 * i_tmp));\r\n            Ta0 = _mm256_loadu_si256((__m256i*)(p + 10 * i_tmp));\r\n\r\n            INTPL_LUMA_EXT_COMPUT(T00, T10, T20, T30, T40, T50, T60, T70, U0);\r\n            INTPL_LUMA_EXT_COMPUT(T10, T20, T30, T40, T50, T60, T70, T80, U1);\r\n            INTPL_LUMA_EXT_COMPUT(T20, T30, T40, T50, T60, T70, T80, T90, U2);\r\n            INTPL_LUMA_EXT_COMPUT(T30, T40, T50, T60, T70, T80, T90, Ta0, U3);\r\n\r\n            //col + 16\r\n            T00 = _mm256_loadu_si256((__m256i*)(p + 16));\r\n            T10 = _mm256_loadu_si256((__m256i*)(p + 16 + i_tmp));\r\n            T20 = _mm256_loadu_si256((__m256i*)(p + 16 + 2 * i_tmp));\r\n            T30 = _mm256_loadu_si256((__m256i*)(p + 16 + 3 * i_tmp));\r\n            T40 = _mm256_loadu_si256((__m256i*)(p + 16 + 4 * i_tmp));\r\n            T50 = _mm256_loadu_si256((__m256i*)(p + 16 + 5 * i_tmp));\r\n            T60 = _mm256_loadu_si256((__m256i*)(p + 16 + 6 * i_tmp));\r\n            T70 = _mm256_loadu_si256((__m256i*)(p + 16 + 7 * i_tmp));\r\n            T80 = _mm256_loadu_si256((__m256i*)(p + 16 + 8 * i_tmp));\r\n            T90 = _mm256_loadu_si256((__m256i*)(p + 16 + 9 * i_tmp));\r\n            Ta0 = _mm256_loadu_si256((__m256i*)(p + 16 + 10 * i_tmp));\r\n\r\n            INTPL_LUMA_EXT_COMPUT(T00, T10, T20, T30, T40, T50, T60, T70, V0);\r\n            INTPL_LUMA_EXT_COMPUT(T10, T20, T30, T40, T50, T60, T70, T80, V1);\r\n            INTPL_LUMA_EXT_COMPUT(T20, T30, T40, T50, T60, T70, T80, T90, V2);\r\n            INTPL_LUMA_EXT_COMPUT(T30, T40, T50, T60, T70, T80, T90, Ta0, V3);\r\n\r\n            INTPL_LUMA_EXT_STORE(U0, V0, dst + col);\r\n            INTPL_LUMA_EXT_STORE(U1, V1, dst + i_dst + col);\r\n            INTPL_LUMA_EXT_STORE(U2, V2, dst + 2 * i_dst + col);\r\n            INTPL_LUMA_EXT_STORE(U3, V3, dst + 3 * i_dst + col);\r\n\r\n            p += 32;\r\n            col += 32;\r\n        }\r\n\r\n        if (col < width) {\r\n            T00 = _mm256_loadu_si256((__m256i*)(p));\r\n            T10 = _mm256_loadu_si256((__m256i*)(p + i_tmp));\r\n            T20 = _mm256_loadu_si256((__m256i*)(p + 2 * i_tmp));\r\n            T30 = _mm256_loadu_si256((__m256i*)(p + 3 * i_tmp));\r\n            T40 = _mm256_loadu_si256((__m256i*)(p + 4 * i_tmp));\r\n            T50 = _mm256_loadu_si256((__m256i*)(p + 5 * i_tmp));\r\n            T60 = _mm256_loadu_si256((__m256i*)(p + 6 * i_tmp));\r\n            T70 = _mm256_loadu_si256((__m256i*)(p + 7 * i_tmp));\r\n            T80 = _mm256_loadu_si256((__m256i*)(p + 8 * i_tmp));\r\n            T90 = _mm256_loadu_si256((__m256i*)(p + 9 * i_tmp));\r\n            Ta0 = _mm256_loadu_si256((__m256i*)(p + 10 * i_tmp));\r\n\r\n            INTPL_LUMA_EXT_COMPUT(T00, T10, T20, T30, T40, T50, T60, T70, U0);\r\n            INTPL_LUMA_EXT_COMPUT(T10, T20, T30, T40, T50, T60, T70, T80, U1);\r\n            INTPL_LUMA_EXT_COMPUT(T20, T30, T40, T50, T60, T70, T80, T90, U2);\r\n            INTPL_LUMA_EXT_COMPUT(T30, T40, T50, T60, T70, T80, T90, Ta0, U3);\r\n\r\n            INTPL_LUMA_EXT_STORE(U0, U0, dst + col);\r\n            INTPL_LUMA_EXT_STORE(U1, U1, dst + i_dst + col);\r\n            INTPL_LUMA_EXT_STORE(U2, U2, dst + 2 * i_dst + col);\r\n            INTPL_LUMA_EXT_STORE(U3, U3, dst + 3 * i_dst + col);\r\n\r\n            p += 16;\r\n            col += 16;\r\n        }\r\n\r\n        tmp += i_tmp * 4;\r\n        dst += i_dst * 4;\r\n    }\r\n#endif\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid intpl_luma_hor_avx2(pel_t *dst, int i_dst, int16_t *tmp, int i_tmp, pel_t *src, int i_src, int width, int height, int8_t const *coeff)\r\n{\r\n    int row, col = 0;\r\n    const short offset = 32;\r\n    const int shift = 6;\r\n\r\n    __m256i mAddOffset = _mm256_set1_epi16(offset);\r\n\r\n    __m256i mSwitch1 = _mm256_setr_epi8(0, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 0, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8);\r\n    __m256i mSwitch2 = _mm256_setr_epi8(2, 3, 4, 5, 6, 7, 8, 9, 3, 4, 5, 6, 7, 8, 9, 10, 2, 3, 4, 5, 6, 7, 8, 9, 3, 4, 5, 6, 7, 8, 9, 10);\r\n    __m256i mSwitch3 = _mm256_setr_epi8(4, 5, 6, 7, 8, 9, 10, 11, 5, 6, 7, 8, 9, 10, 11, 12, 4, 5, 6, 7, 8, 9, 10, 11, 5, 6, 7, 8, 9, 10, 11, 12);\r\n    __m256i mSwitch4 = _mm256_setr_epi8(6, 7, 8, 9, 10, 11, 12, 13, 7, 8, 9, 10, 11, 12, 13, 14, 6, 7, 8, 9, 10, 11, 12, 13, 7, 8, 9, 10, 11, 12, 13, 14);\r\n\r\n#if ARCH_X86_64\r\n    __m256i mCoef = _mm256_set1_epi64x(*(long long *)coeff);\r\n#else\r\n    __m256i mCoef = _mm256_loadu_si256((__m256i*)coeff);\r\n    mCoef = _mm256_permute4x64_epi64(mCoef, 0x0);\r\n#endif\r\n\r\n    src -= 3;\r\n    for (row = 0; row < height; row++) {\r\n        __m256i srcCoeff1, srcCoeff2;\r\n        __m256i T20, T40, T60, T80;\r\n        __m256i sum10, sum20;\r\n\r\n        for (col = 0; col < width - 16; col += 32) {\r\n            srcCoeff1 = _mm256_loadu_si256((__m256i*)(src + col));\r\n            srcCoeff2 = _mm256_loadu_si256((__m256i*)(src + col + 8));\r\n\r\n            T20 = _mm256_maddubs_epi16(_mm256_shuffle_epi8(srcCoeff1, mSwitch1), mCoef);\r\n            T40 = _mm256_maddubs_epi16(_mm256_shuffle_epi8(srcCoeff1, mSwitch2), mCoef);\r\n            T60 = _mm256_maddubs_epi16(_mm256_shuffle_epi8(srcCoeff1, mSwitch3), mCoef);\r\n            T80 = _mm256_maddubs_epi16(_mm256_shuffle_epi8(srcCoeff1, mSwitch4), mCoef);\r\n\r\n            sum10 = _mm256_hadd_epi16(_mm256_hadd_epi16(T20, T40), _mm256_hadd_epi16(T60, T80));\r\n\r\n            T20 = _mm256_maddubs_epi16(_mm256_shuffle_epi8(srcCoeff2, mSwitch1), mCoef);\r\n            T40 = _mm256_maddubs_epi16(_mm256_shuffle_epi8(srcCoeff2, mSwitch2), mCoef);\r\n            T60 = _mm256_maddubs_epi16(_mm256_shuffle_epi8(srcCoeff2, mSwitch3), mCoef);\r\n            T80 = _mm256_maddubs_epi16(_mm256_shuffle_epi8(srcCoeff2, mSwitch4), mCoef);\r\n\r\n            sum20 = _mm256_hadd_epi16(_mm256_hadd_epi16(T20, T40), _mm256_hadd_epi16(T60, T80));\r\n\r\n            // store 16bit\r\n            _mm256_storeu_si256((__m256i*)&tmp[col],      _mm256_permute2x128_si256(sum10, sum20, 32));\r\n            _mm256_storeu_si256((__m256i*)&tmp[col + 16], _mm256_permute2x128_si256(sum10, sum20, 49));\r\n\r\n            // store 8bit\r\n            sum10 = _mm256_srai_epi16(_mm256_add_epi16(sum10, mAddOffset), shift);\r\n            sum20 = _mm256_srai_epi16(_mm256_add_epi16(sum20, mAddOffset), shift);\r\n\r\n            _mm256_storeu_si256((__m256i*)&dst[col], _mm256_packus_epi16(sum10, sum20));\r\n        }\r\n\r\n        // width 16\r\n        if (col < width - 8) {\r\n            srcCoeff1 = _mm256_loadu_si256((__m256i*)(src + col));\r\n            srcCoeff2 = _mm256_loadu_si256((__m256i*)(src + col + 8));\r\n\r\n            srcCoeff1 = _mm256_permute2x128_si256(srcCoeff1, srcCoeff2, 32);\r\n\r\n            T20 = _mm256_maddubs_epi16(_mm256_shuffle_epi8(srcCoeff1, mSwitch1), mCoef);\r\n            T40 = _mm256_maddubs_epi16(_mm256_shuffle_epi8(srcCoeff1, mSwitch2), mCoef);\r\n            T60 = _mm256_maddubs_epi16(_mm256_shuffle_epi8(srcCoeff1, mSwitch3), mCoef);\r\n            T80 = _mm256_maddubs_epi16(_mm256_shuffle_epi8(srcCoeff1, mSwitch4), mCoef);\r\n\r\n            sum10 = _mm256_hadd_epi16(_mm256_hadd_epi16(T20, T40), _mm256_hadd_epi16(T60, T80));\r\n\r\n            // store 16bit\r\n            _mm256_storeu_si256((__m256i*)&tmp[col], sum10);\r\n\r\n            // store 8bit\r\n            sum10 = _mm256_srai_epi16(_mm256_add_epi16(sum10, mAddOffset), shift);\r\n            sum10 = _mm256_permute4x64_epi64(_mm256_packus_epi16(sum10, sum10), 8);\r\n            _mm256_storeu_si256((__m256i*)&dst[col], sum10);\r\n            col += 16;\r\n        }\r\n\r\n        // width 8\r\n        if (col < width) {\r\n            srcCoeff1 = _mm256_loadu_si256((__m256i*)(src + col));\r\n\r\n            T20 = _mm256_maddubs_epi16(_mm256_shuffle_epi8(srcCoeff1, mSwitch1), mCoef);\r\n            T40 = _mm256_maddubs_epi16(_mm256_shuffle_epi8(srcCoeff1, mSwitch2), mCoef);\r\n            T60 = _mm256_maddubs_epi16(_mm256_shuffle_epi8(srcCoeff1, mSwitch3), mCoef);\r\n            T80 = _mm256_maddubs_epi16(_mm256_shuffle_epi8(srcCoeff1, mSwitch4), mCoef);\r\n\r\n            sum10 = _mm256_hadd_epi16(_mm256_hadd_epi16(T20, T40), _mm256_hadd_epi16(T60, T80));\r\n\r\n            // store 16bit\r\n            _mm256_storeu_si256((__m256i*)&tmp[col], sum10);\r\n\r\n            // store 8bit\r\n            sum10 = _mm256_srai_epi16(_mm256_add_epi16(sum10, mAddOffset), shift);\r\n            sum10 = _mm256_packus_epi16(sum10, sum10);\r\n\r\n            _mm256_storeu_si256((__m256i*)&dst[col], sum10);\r\n        }\r\n\r\n        src += i_src;\r\n        tmp += i_tmp;\r\n        dst += i_dst;\r\n    }\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid intpl_luma_hor_x3_avx2(pel_t *const dst[3], int i_dst, mct_t *const tmp[3], int i_tmp, pel_t *src, int i_src, int width, int height, const int8_t **coeff)\r\n{\r\n    int row, col = 0;\r\n    const short offset = 32;\r\n    const int shift = 6;\r\n\r\n    __m256i mOffset = _mm256_set1_epi16(offset);\r\n\r\n    __m256i mSwitch1 = _mm256_setr_epi8(0, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 0, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8);\r\n    __m256i mSwitch2 = _mm256_setr_epi8(2, 3, 4, 5, 6, 7, 8, 9, 3, 4, 5, 6, 7, 8, 9, 10, 2, 3, 4, 5, 6, 7, 8, 9, 3, 4, 5, 6, 7, 8, 9, 10);\r\n    __m256i mSwitch3 = _mm256_setr_epi8(4, 5, 6, 7, 8, 9, 10, 11, 5, 6, 7, 8, 9, 10, 11, 12, 4, 5, 6, 7, 8, 9, 10, 11, 5, 6, 7, 8, 9, 10, 11, 12);\r\n    __m256i mSwitch4 = _mm256_setr_epi8(6, 7, 8, 9, 10, 11, 12, 13, 7, 8, 9, 10, 11, 12, 13, 14, 6, 7, 8, 9, 10, 11, 12, 13, 7, 8, 9, 10, 11, 12, 13, 14);\r\n    __m256i mCoef0, mCoef1, mCoef2;\r\n    mct_t *tmp0 = tmp[0];\r\n    mct_t *tmp1 = tmp[1];\r\n    mct_t *tmp2 = tmp[2];\r\n    pel_t *dst0 = dst[0];\r\n    pel_t *dst1 = dst[1];\r\n    pel_t *dst2 = dst[2];\r\n\r\n#if ARCH_X86_64\r\n    mCoef0 = _mm256_set1_epi64x(*(long long *)coeff[0]);\r\n    mCoef1 = _mm256_set1_epi64x(*(long long *)coeff[1]);\r\n    mCoef2 = _mm256_set1_epi64x(*(long long *)coeff[2]);\r\n#else\r\n    mCoef0 = _mm256_permute4x64_epi64(_mm256_loadu_si256((__m256i*)coeff[0]), 0x0);\r\n    mCoef1 = _mm256_permute4x64_epi64(_mm256_loadu_si256((__m256i*)coeff[1]), 0x0);\r\n    mCoef2 = _mm256_permute4x64_epi64(_mm256_loadu_si256((__m256i*)coeff[2]), 0x0);\r\n#endif\r\n\r\n    src -= 3;\r\n    for (row = 0; row < height; row++) {\r\n        __m256i srcCoeff1, srcCoeff2;\r\n        __m256i S11, S12, S13, S14;\r\n        __m256i S21, S22, S23, S24;\r\n        __m256i sum10, sum20;\r\n\r\n        for (col = 0; col < width - 16; col += 32) {\r\n            srcCoeff1 = _mm256_loadu_si256((__m256i*)(src + col));\r\n            srcCoeff2 = _mm256_loadu_si256((__m256i*)(src + col + 8));\r\n\r\n            S11 = _mm256_shuffle_epi8(srcCoeff1, mSwitch1);\r\n            S12 = _mm256_shuffle_epi8(srcCoeff1, mSwitch2);\r\n            S13 = _mm256_shuffle_epi8(srcCoeff1, mSwitch3);\r\n            S14 = _mm256_shuffle_epi8(srcCoeff1, mSwitch4);\r\n\r\n            S21 = _mm256_shuffle_epi8(srcCoeff2, mSwitch1);\r\n            S22 = _mm256_shuffle_epi8(srcCoeff2, mSwitch2);\r\n            S23 = _mm256_shuffle_epi8(srcCoeff2, mSwitch3);\r\n            S24 = _mm256_shuffle_epi8(srcCoeff2, mSwitch4);\r\n\r\n#define INTPL_HOR_FLT(Coef, S1, S2, S3, S4, Res)   do { \\\r\n                __m256i T0 = _mm256_maddubs_epi16(S1, Coef); \\\r\n                __m256i T1 = _mm256_maddubs_epi16(S2, Coef); \\\r\n                __m256i T2 = _mm256_maddubs_epi16(S3, Coef); \\\r\n                __m256i T3 = _mm256_maddubs_epi16(S4, Coef); \\\r\n                Res = _mm256_hadd_epi16(_mm256_hadd_epi16(T0, T1), _mm256_hadd_epi16(T2, T3)); \\\r\n            } while (0)\r\n\r\n            /* 1st */\r\n            INTPL_HOR_FLT(mCoef0, S11, S12, S13, S14, sum10);\r\n            INTPL_HOR_FLT(mCoef0, S21, S22, S23, S24, sum20);\r\n\r\n            // store 16bit\r\n            _mm256_storeu_si256((__m256i*)&tmp0[col],      _mm256_permute2x128_si256(sum10, sum20, 32));\r\n            _mm256_storeu_si256((__m256i*)&tmp0[col + 16], _mm256_permute2x128_si256(sum10, sum20, 49));\r\n\r\n            // store 8bit\r\n            sum10 = _mm256_srai_epi16(_mm256_add_epi16(sum10, mOffset), shift);\r\n            sum20 = _mm256_srai_epi16(_mm256_add_epi16(sum20, mOffset), shift);\r\n\r\n            _mm256_storeu_si256((__m256i*)&dst0[col], _mm256_packus_epi16(sum10, sum20));\r\n\r\n            /* 2nd */\r\n            INTPL_HOR_FLT(mCoef1, S11, S12, S13, S14, sum10);\r\n            INTPL_HOR_FLT(mCoef1, S21, S22, S23, S24, sum20);\r\n\r\n            // store 16bit\r\n            _mm256_storeu_si256((__m256i*)&tmp1[col], _mm256_permute2x128_si256(sum10, sum20, 32));\r\n            _mm256_storeu_si256((__m256i*)&tmp1[col + 16], _mm256_permute2x128_si256(sum10, sum20, 49));\r\n\r\n            // store 8bit\r\n            sum10 = _mm256_srai_epi16(_mm256_add_epi16(sum10, mOffset), shift);\r\n            sum20 = _mm256_srai_epi16(_mm256_add_epi16(sum20, mOffset), shift);\r\n\r\n            _mm256_storeu_si256((__m256i*)&dst1[col], _mm256_packus_epi16(sum10, sum20));\r\n\r\n            /* 3rd */\r\n            INTPL_HOR_FLT(mCoef2, S11, S12, S13, S14, sum10);\r\n            INTPL_HOR_FLT(mCoef2, S21, S22, S23, S24, sum20);\r\n\r\n            // store 16bit\r\n            _mm256_storeu_si256((__m256i*)&tmp2[col], _mm256_permute2x128_si256(sum10, sum20, 32));\r\n            _mm256_storeu_si256((__m256i*)&tmp2[col + 16], _mm256_permute2x128_si256(sum10, sum20, 49));\r\n\r\n            // store 8bit\r\n            sum10 = _mm256_srai_epi16(_mm256_add_epi16(sum10, mOffset), shift);\r\n            sum20 = _mm256_srai_epi16(_mm256_add_epi16(sum20, mOffset), shift);\r\n\r\n            _mm256_storeu_si256((__m256i*)&dst2[col], _mm256_packus_epi16(sum10, sum20));\r\n        }\r\n\r\n        // width 16\r\n        if (col < width - 8) {\r\n            srcCoeff1 = _mm256_loadu_si256((__m256i*)(src + col));\r\n            srcCoeff2 = _mm256_loadu_si256((__m256i*)(src + col + 8));\r\n\r\n            srcCoeff1 = _mm256_permute2x128_si256(srcCoeff1, srcCoeff2, 32);\r\n            S11 = _mm256_shuffle_epi8(srcCoeff1, mSwitch1);\r\n            S12 = _mm256_shuffle_epi8(srcCoeff1, mSwitch2);\r\n            S13 = _mm256_shuffle_epi8(srcCoeff1, mSwitch3);\r\n            S14 = _mm256_shuffle_epi8(srcCoeff1, mSwitch4);\r\n\r\n            /* 1st */\r\n            INTPL_HOR_FLT(mCoef0, S11, S12, S13, S14, sum10);\r\n\r\n            // store 16bit\r\n            _mm256_storeu_si256((__m256i*)&tmp0[col], sum10);\r\n\r\n            // store 8bit\r\n            sum10 = _mm256_srai_epi16(_mm256_add_epi16(sum10, mOffset), shift);\r\n            sum10 = _mm256_permute4x64_epi64(_mm256_packus_epi16(sum10, sum10), 8);\r\n            _mm256_storeu_si256((__m256i*)&dst0[col], sum10);\r\n\r\n            /* 1st */\r\n            INTPL_HOR_FLT(mCoef1, S11, S12, S13, S14, sum10);\r\n\r\n            // store 16bit\r\n            _mm256_storeu_si256((__m256i*)&tmp1[col], sum10);\r\n\r\n            // store 8bit\r\n            sum10 = _mm256_srai_epi16(_mm256_add_epi16(sum10, mOffset), shift);\r\n            sum10 = _mm256_permute4x64_epi64(_mm256_packus_epi16(sum10, sum10), 8);\r\n            _mm256_storeu_si256((__m256i*)&dst1[col], sum10);\r\n\r\n            /* 3rd */\r\n            INTPL_HOR_FLT(mCoef2, S11, S12, S13, S14, sum10);\r\n\r\n            // store 16bit\r\n            _mm256_storeu_si256((__m256i*)&tmp2[col], sum10);\r\n\r\n            // store 8bit\r\n            sum10 = _mm256_srai_epi16(_mm256_add_epi16(sum10, mOffset), shift);\r\n            sum10 = _mm256_permute4x64_epi64(_mm256_packus_epi16(sum10, sum10), 8);\r\n            _mm256_storeu_si256((__m256i*)&dst2[col], sum10);\r\n            col += 16;\r\n        }\r\n\r\n        // width 8\r\n        if (col < width) {\r\n            srcCoeff1 = _mm256_loadu_si256((__m256i*)(src + col));\r\n            S11 = _mm256_shuffle_epi8(srcCoeff1, mSwitch1);\r\n            S12 = _mm256_shuffle_epi8(srcCoeff1, mSwitch2);\r\n            S13 = _mm256_shuffle_epi8(srcCoeff1, mSwitch3);\r\n            S14 = _mm256_shuffle_epi8(srcCoeff1, mSwitch4);\r\n\r\n            /* 1st */\r\n            INTPL_HOR_FLT(mCoef0, S11, S12, S13, S14, sum10);\r\n\r\n            // store 16bit\r\n            _mm256_storeu_si256((__m256i*)&tmp0[col], sum10);\r\n\r\n            // store 8bit\r\n            sum10 = _mm256_srai_epi16(_mm256_add_epi16(sum10, mOffset), shift);\r\n            sum10 = _mm256_packus_epi16(sum10, sum10);\r\n\r\n            _mm256_storeu_si256((__m256i*)&dst0[col], sum10);\r\n\r\n            /* 2nd */\r\n            INTPL_HOR_FLT(mCoef1, S11, S12, S13, S14, sum10);\r\n\r\n            // store 16bit\r\n            _mm256_storeu_si256((__m256i*)&tmp1[col], sum10);\r\n\r\n            // store 8bit\r\n            sum10 = _mm256_srai_epi16(_mm256_add_epi16(sum10, mOffset), shift);\r\n            sum10 = _mm256_packus_epi16(sum10, sum10);\r\n\r\n            _mm256_storeu_si256((__m256i*)&dst1[col], sum10);\r\n\r\n            /* 3rd */\r\n            INTPL_HOR_FLT(mCoef2, S11, S12, S13, S14, sum10);\r\n\r\n            // store 16bit\r\n            _mm256_storeu_si256((__m256i*)&tmp2[col], sum10);\r\n\r\n            // store 8bit\r\n            sum10 = _mm256_srai_epi16(_mm256_add_epi16(sum10, mOffset), shift);\r\n            sum10 = _mm256_packus_epi16(sum10, sum10);\r\n\r\n            _mm256_storeu_si256((__m256i*)&dst2[col], sum10);\r\n        }\r\n\r\n        src    += i_src;\r\n        tmp0 += i_tmp;\r\n        tmp1 += i_tmp;\r\n        tmp2 += i_tmp;\r\n        dst0 += i_dst;\r\n        dst1 += i_dst;\r\n        dst2 += i_dst;\r\n    }\r\n#undef INTPL_HOR_FLT\r\n\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\n#define INTPL_LUMA_VER_COMPUT(W0,W1,W2,W3,W4,W5,W6,W7,result)      \\\r\n    T0 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(W0, W1), mCoefy01);                  \\\r\n    T1 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(W2, W3), mCoefy23);                  \\\r\n    T2 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(W4, W5), mCoefy45);                  \\\r\n    T3 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(W6, W7), mCoefy67);                  \\\r\n    T4 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(W0, W1), mCoefy01);                  \\\r\n    T5 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(W2, W3), mCoefy23);                  \\\r\n    T6 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(W4, W5), mCoefy45);                  \\\r\n    T7 = _mm256_maddubs_epi16(_mm256_unpackhi_epi8(W6, W7), mCoefy67);                  \\\r\n    \\\r\n    mVal1 = _mm256_add_epi16(_mm256_add_epi16(T0, T1), _mm256_add_epi16(T2, T3));       \\\r\n    mVal2 = _mm256_add_epi16(_mm256_add_epi16(T4, T5), _mm256_add_epi16(T6, T7));       \\\r\n    \\\r\n    mVal1 = _mm256_srai_epi16(_mm256_add_epi16(mVal1, mAddOffset), shift);              \\\r\n    mVal2 = _mm256_srai_epi16(_mm256_add_epi16(mVal2, mAddOffset), shift);              \\\r\n    result = _mm256_packus_epi16(mVal1, mVal2);\r\n\r\n#define INTPL_LUMA_VER_STORE(a, b)                         \\\r\n    _mm256_storeu_si256((__m256i*)(b), a);\r\n\r\n#define INTPL_LUMA_VER_COMPUT_LOW(W0,W1,W2,W3,W4,W5,W6,W7,result)      \\\r\n    T0 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(W0, W1), mCoefy01);                  \\\r\n    T1 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(W2, W3), mCoefy23);                  \\\r\n    T2 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(W4, W5), mCoefy45);                  \\\r\n    T3 = _mm256_maddubs_epi16(_mm256_unpacklo_epi8(W6, W7), mCoefy67);                  \\\r\n    \\\r\n    mVal1 = _mm256_add_epi16(_mm256_add_epi16(T0, T1), _mm256_add_epi16(T2, T3));       \\\r\n    \\\r\n    mVal1 = _mm256_srai_epi16(_mm256_add_epi16(mVal1, mAddOffset), shift);              \\\r\n    result = _mm256_packus_epi16(mVal1, mVal1);\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid intpl_luma_ver_avx2(pel_t *dst, int i_dst, pel_t *src, int i_src, int width, int height, int8_t const *coeff)\r\n{\r\n    int row, col;\r\n    const short offset = 32;\r\n    const int shift = 6;\r\n\r\n    __m256i mAddOffset = _mm256_set1_epi16(offset);\r\n\r\n    pel_t const *p;\r\n\r\n    src -= 3 * i_src;\r\n\r\n    __m256i mVal1, mVal2;\r\n\r\n    __m256i mCoefy01 = _mm256_set1_epi16(*(short*)coeff);\r\n    __m256i mCoefy23 = _mm256_set1_epi16(*(short*)(coeff + 2));\r\n    __m256i mCoefy45 = _mm256_set1_epi16(*(short*)(coeff + 4));\r\n    __m256i mCoefy67 = _mm256_set1_epi16(*(short*)(coeff + 6));\r\n\r\n    __m256i T00, T10, T20, T30, T40, T50, T60, T70, T80, T90, Ta0;\r\n    __m256i T0, T1, T2, T3, T4, T5, T6, T7;\r\n    __m256i U0, U1, U2, U3;\r\n    for (row = 0; row < height; row = row + 4) {\r\n        p = src;\r\n        for (col = 0; col < width - 8; col += 32) {\r\n            T00 = _mm256_loadu_si256((__m256i*)(p));\r\n            T10 = _mm256_loadu_si256((__m256i*)(p + i_src));\r\n            T20 = _mm256_loadu_si256((__m256i*)(p + 2 * i_src));\r\n            T30 = _mm256_loadu_si256((__m256i*)(p + 3 * i_src));\r\n            T40 = _mm256_loadu_si256((__m256i*)(p + 4 * i_src));\r\n            T50 = _mm256_loadu_si256((__m256i*)(p + 5 * i_src));\r\n            T60 = _mm256_loadu_si256((__m256i*)(p + 6 * i_src));\r\n            T70 = _mm256_loadu_si256((__m256i*)(p + 7 * i_src));\r\n            T80 = _mm256_loadu_si256((__m256i*)(p + 8 * i_src));\r\n            T90 = _mm256_loadu_si256((__m256i*)(p + 9 * i_src));\r\n            Ta0 = _mm256_loadu_si256((__m256i*)(p + 10 * i_src));\r\n\r\n            INTPL_LUMA_VER_COMPUT(T00, T10, T20, T30, T40, T50, T60, T70, U0);\r\n            INTPL_LUMA_VER_STORE(U0, dst + col);\r\n\r\n            INTPL_LUMA_VER_COMPUT(T10, T20, T30, T40, T50, T60, T70, T80, U1);\r\n            INTPL_LUMA_VER_STORE(U1, dst + i_dst + col);\r\n\r\n            INTPL_LUMA_VER_COMPUT(T20, T30, T40, T50, T60, T70, T80, T90, U2);\r\n            INTPL_LUMA_VER_STORE(U2, dst + 2 * i_dst + col);\r\n\r\n            INTPL_LUMA_VER_COMPUT(T30, T40, T50, T60, T70, T80, T90, Ta0, U3);\r\n            INTPL_LUMA_VER_STORE(U3, dst + 3 * i_dst + col);\r\n\r\n            p += 32;\r\n        }\r\n        if (col < width) {\r\n            T00 = _mm256_loadu_si256((__m256i*)(p));\r\n            T10 = _mm256_loadu_si256((__m256i*)(p + i_src));\r\n            T20 = _mm256_loadu_si256((__m256i*)(p + 2 * i_src));\r\n            T30 = _mm256_loadu_si256((__m256i*)(p + 3 * i_src));\r\n            T40 = _mm256_loadu_si256((__m256i*)(p + 4 * i_src));\r\n            T50 = _mm256_loadu_si256((__m256i*)(p + 5 * i_src));\r\n            T60 = _mm256_loadu_si256((__m256i*)(p + 6 * i_src));\r\n            T70 = _mm256_loadu_si256((__m256i*)(p + 7 * i_src));\r\n            T80 = _mm256_loadu_si256((__m256i*)(p + 8 * i_src));\r\n            T90 = _mm256_loadu_si256((__m256i*)(p + 9 * i_src));\r\n            Ta0 = _mm256_loadu_si256((__m256i*)(p + 10 * i_src));\r\n\r\n            INTPL_LUMA_VER_COMPUT_LOW(T00, T10, T20, T30, T40, T50, T60, T70, U0);\r\n            INTPL_LUMA_VER_STORE(U0, dst + col);\r\n\r\n            INTPL_LUMA_VER_COMPUT_LOW(T10, T20, T30, T40, T50, T60, T70, T80, U1);\r\n            INTPL_LUMA_VER_STORE(U1, dst + i_dst + col);\r\n\r\n            INTPL_LUMA_VER_COMPUT_LOW(T20, T30, T40, T50, T60, T70, T80, T90, U2);\r\n            INTPL_LUMA_VER_STORE(U2, dst + 2 * i_dst + col);\r\n\r\n            INTPL_LUMA_VER_COMPUT_LOW(T30, T40, T50, T60, T70, T80, T90, Ta0, U3);\r\n            INTPL_LUMA_VER_STORE(U3, dst + 3 * i_dst + col);\r\n        }\r\n        src += 4 * i_src;\r\n        dst += 4 * i_dst;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid intpl_luma_ver_x3_avx2(pel_t *const dst[3], int i_dst, pel_t *src, int i_src, int width, int height, const int8_t **coeff)\r\n{\r\n#if 1\r\n    intpl_luma_ver_avx2(dst[0], i_dst, src, i_src, width, height, coeff[0]);\r\n    intpl_luma_ver_avx2(dst[1], i_dst, src, i_src, width, height, coeff[1]);\r\n    intpl_luma_ver_avx2(dst[2], i_dst, src, i_src, width, height, coeff[2]);\r\n#else\r\n    int row, col;\r\n    const short offset = 32;\r\n    const int shift = 6;\r\n\r\n    __m256i mAddOffset = _mm256_set1_epi16(offset);\r\n\r\n    pel_t const *p;\r\n\r\n    src -= 3 * i_src;\r\n\r\n    __m256i mVal1, mVal2;\r\n\r\n    __m256i mCoefy01 = _mm256_set1_epi16(*(short*)coeff);\r\n    __m256i mCoefy23 = _mm256_set1_epi16(*(short*)(coeff + 2));\r\n    __m256i mCoefy45 = _mm256_set1_epi16(*(short*)(coeff + 4));\r\n    __m256i mCoefy67 = _mm256_set1_epi16(*(short*)(coeff + 6));\r\n\r\n    __m256i T00, T10, T20, T30, T40, T50, T60, T70, T80, T90, Ta0;\r\n    __m256i T0, T1, T2, T3, T4, T5, T6, T7;\r\n    __m256i U0, U1, U2, U3;\r\n    for (row = 0; row < height; row = row + 4) {\r\n        p = src;\r\n        for (col = 0; col < width - 8; col += 32) {\r\n            T00 = _mm256_loadu_si256((__m256i*)(p));\r\n            T10 = _mm256_loadu_si256((__m256i*)(p + i_src));\r\n            T20 = _mm256_loadu_si256((__m256i*)(p + 2 * i_src));\r\n            T30 = _mm256_loadu_si256((__m256i*)(p + 3 * i_src));\r\n            T40 = _mm256_loadu_si256((__m256i*)(p + 4 * i_src));\r\n            T50 = _mm256_loadu_si256((__m256i*)(p + 5 * i_src));\r\n            T60 = _mm256_loadu_si256((__m256i*)(p + 6 * i_src));\r\n            T70 = _mm256_loadu_si256((__m256i*)(p + 7 * i_src));\r\n            T80 = _mm256_loadu_si256((__m256i*)(p + 8 * i_src));\r\n            T90 = _mm256_loadu_si256((__m256i*)(p + 9 * i_src));\r\n            Ta0 = _mm256_loadu_si256((__m256i*)(p + 10 * i_src));\r\n\r\n            INTPL_LUMA_VER_COMPUT(T00, T10, T20, T30, T40, T50, T60, T70, U0);\r\n            INTPL_LUMA_VER_STORE(U0, dst + col);\r\n\r\n            INTPL_LUMA_VER_COMPUT(T10, T20, T30, T40, T50, T60, T70, T80, U1);\r\n            INTPL_LUMA_VER_STORE(U1, dst + i_dst + col);\r\n\r\n            INTPL_LUMA_VER_COMPUT(T20, T30, T40, T50, T60, T70, T80, T90, U2);\r\n            INTPL_LUMA_VER_STORE(U2, dst + 2 * i_dst + col);\r\n\r\n            INTPL_LUMA_VER_COMPUT(T30, T40, T50, T60, T70, T80, T90, Ta0, U3);\r\n            INTPL_LUMA_VER_STORE(U3, dst + 3 * i_dst + col);\r\n\r\n            p += 32;\r\n        }\r\n        if (col < width) {\r\n            T00 = _mm256_loadu_si256((__m256i*)(p));\r\n            T10 = _mm256_loadu_si256((__m256i*)(p + i_src));\r\n            T20 = _mm256_loadu_si256((__m256i*)(p + 2 * i_src));\r\n            T30 = _mm256_loadu_si256((__m256i*)(p + 3 * i_src));\r\n            T40 = _mm256_loadu_si256((__m256i*)(p + 4 * i_src));\r\n            T50 = _mm256_loadu_si256((__m256i*)(p + 5 * i_src));\r\n            T60 = _mm256_loadu_si256((__m256i*)(p + 6 * i_src));\r\n            T70 = _mm256_loadu_si256((__m256i*)(p + 7 * i_src));\r\n            T80 = _mm256_loadu_si256((__m256i*)(p + 8 * i_src));\r\n            T90 = _mm256_loadu_si256((__m256i*)(p + 9 * i_src));\r\n            Ta0 = _mm256_loadu_si256((__m256i*)(p + 10 * i_src));\r\n\r\n            INTPL_LUMA_VER_COMPUT_LOW(T00, T10, T20, T30, T40, T50, T60, T70, U0);\r\n            INTPL_LUMA_VER_STORE(U0, dst + col);\r\n\r\n            INTPL_LUMA_VER_COMPUT_LOW(T10, T20, T30, T40, T50, T60, T70, T80, U1);\r\n            INTPL_LUMA_VER_STORE(U1, dst + i_dst + col);\r\n\r\n            INTPL_LUMA_VER_COMPUT_LOW(T20, T30, T40, T50, T60, T70, T80, T90, U2);\r\n            INTPL_LUMA_VER_STORE(U2, dst + 2 * i_dst + col);\r\n\r\n            INTPL_LUMA_VER_COMPUT_LOW(T30, T40, T50, T60, T70, T80, T90, Ta0, U3);\r\n            INTPL_LUMA_VER_STORE(U3, dst + 3 * i_dst + col);\r\n        }\r\n        src += 4 * i_src;\r\n        dst += 4 * i_dst;\r\n    }\r\n#endif\r\n}\r\n#endif\r\n"
  },
  {
    "path": "source/common/vec/intrinsic_intra-filledge.cc",
    "content": "/*\r\n * intrinsic_intra-fiiledge.cc\r\n *\r\n * Description of this file:\r\n *   SSE assembly functions of Intra-Filledge module of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video decoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#include \"../common.h\"\r\n#include \"intrinsic.h\"\r\n\r\n#include <mmintrin.h>\r\n#include <emmintrin.h>\r\n#include <tmmintrin.h>\r\n#include <smmintrin.h>\r\n\r\n\r\n#if !HIGH_BIT_DEPTH\r\n/* ---------------------------------------------------------------------------\r\n * fill reference samples for intra prediction\r\n * LCU߽ϵPU\r\n */\r\nvoid fill_edge_samples_0_sse128(const pel_t *pTL, int i_TL, const pel_t *pLcuEP, pel_t *EP, uint32_t i_avai, int bsx, int bsy)\r\n{\r\n    __m128i T0, T1;\r\n    int i, k, j;\r\n    int num_padding;\r\n\r\n    UNUSED_PARAMETER(pTL);\r\n    UNUSED_PARAMETER(i_TL);\r\n    /* fill default value */\r\n    k = ((bsy + bsx) << 1) + 1;\r\n    j = (k >> 4) << 4;\r\n    T0 = _mm_set1_epi8((uint8_t)g_dc_value);\r\n    for (i = 0; i < j; i += 16) {\r\n        _mm_storeu_si128((__m128i *)(&EP[-(bsy << 1)] + i), T0);\r\n    }\r\n    memset(&EP[-(bsy << 1)] + j, g_dc_value, k - j + 1);\r\n    EP[2 * bsx] = (pel_t)g_dc_value;\r\n    \r\n    /* get prediction pixels ---------------------------------------\r\n     * extra pixels          | left-down pixels   | left pixels   | top-left | top pixels  | top-right pixels  | extra pixels\r\n     * -2*bsy-4 ... -2*bsy-1 | -bsy-bsy ... -bsy-1| -bsy -3 -2 -1 |     0    | 1 2 ... bsx | bsx+1 ... bsx+bsx | 2*bsx+1 ... 2*bsx+4\r\n     */\r\n\r\n    /* fill top & top-right pixels */\r\n    if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_TOP)) {\r\n        /* fill top pixels */\r\n        if (bsx == 4) {\r\n            memcpy(&EP[1], &pLcuEP[1], bsx * sizeof(pel_t));\r\n        } else if (bsx == 8) {\r\n            T1 = _mm_loadu_si128((__m128i *)&pLcuEP[1]);\r\n            _mm_storel_epi64((__m128i *)&EP[1], T1);\r\n        } else {\r\n            for (i = 0; i < bsx; i += 16) {\r\n                T1 = _mm_loadu_si128((__m128i *)(pLcuEP + i + 1));\r\n                _mm_store_si128((__m128i *)(&EP[1] + i), T1);\r\n            }\r\n        }\r\n    }\r\n\r\n    /* fill top-right pixels */\r\n    if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_TOP_RIGHT)) {\r\n        if (bsx == 4) {\r\n            memcpy(&EP[bsx + 1], &pLcuEP[bsx + 1], bsx * sizeof(pel_t));\r\n        } else if (bsx == 8) {\r\n            T1 = _mm_loadu_si128((__m128i *)&pLcuEP[bsx + 1]);\r\n            _mm_storel_epi64((__m128i *)&EP[bsx + 1], T1);\r\n        } else {\r\n            for (i = 0; i < bsx; i += 16) {\r\n                T1 = _mm_loadu_si128((__m128i *)(&pLcuEP[bsx + i + 1]));\r\n                _mm_store_si128((__m128i *)(&EP[bsx + 1] + i), T1);\r\n            }\r\n        }\r\n    } else {\r\n        if (bsx == 4) {\r\n            memset(&EP[bsx + 1], EP[bsx], bsx);\r\n        } else if (bsx == 8) {\r\n            T0 = _mm_set1_epi8(EP[bsx]);    // repeat the last pixel\r\n            _mm_storel_epi64((__m128i *)&EP[bsx + 1], T0);\r\n        } else {\r\n            T0 = _mm_set1_epi8(EP[bsx]);    // repeat the last pixel\r\n            for (i = 0; i < bsx; i += 16) {\r\n                _mm_store_si128((__m128i *)(&EP[bsx + 1 + i]), T0);\r\n            }\r\n        }\r\n    }\r\n\r\n    /* fill extra pixels */\r\n    num_padding = bsy * 11 / 4 - bsx + 4;\r\n    if (num_padding > 0) {\r\n        memset(&EP[2 * bsx + 1], EP[2 * bsx], num_padding); // from (2*bsx) to (iX + 3) = (bsy *11/4 + bsx - 1) + 3\r\n    }\r\n\r\n    /* fill left & left-down pixels */\r\n    if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_LEFT)) {\r\n        /* fill left pixels */\r\n        memcpy(&EP[-bsy], &pLcuEP[-bsy], bsy * sizeof(pel_t));\r\n    }\r\n\r\n    /* fill left-down pixels */\r\n    if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_LEFT_DOWN)) {\r\n        memcpy(&EP[-2 * bsy], &pLcuEP[-2 * bsy], bsy * sizeof(pel_t));\r\n    } else {\r\n        if (bsy == 4) {\r\n            memset(&EP[-(bsy << 1)], EP[-bsy], bsy);\r\n        } else if (bsy == 8) {\r\n            T0 = _mm_set1_epi8(EP[-bsy]);\r\n            _mm_storel_epi64((__m128i *)&EP[-(bsy << 1)], T0);\r\n        } else {\r\n            T0 = _mm_set1_epi8(EP[-bsy]);\r\n            for (i = 0; i < bsy; i += 16) {\r\n                _mm_storeu_si128((__m128i *)(&EP[-(bsy << 1)] + i), T0);\r\n            }\r\n        }\r\n    }\r\n\r\n    /* fill extra pixels */\r\n    num_padding = bsx * 11 / 4 - bsy + 4;\r\n    if (num_padding > 0) {\r\n        memset(&EP[-2 * bsy - num_padding], EP[-2 * bsy], num_padding); // from (-2*bsy) to (-iY - 3) = -(bsx *11/4 + bsy - 1) - 3\r\n    }\r\n\r\n    /* fill EP[0] */\r\n    if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_TOP_LEFT)) {\r\n        EP[0] = pLcuEP[0];\r\n    } else if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_TOP)) {\r\n        EP[0] = pLcuEP[1];\r\n    } else if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_LEFT)) {\r\n        EP[0] = pLcuEP[-1];\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * fill reference samples for intra prediction\r\n * LCU߽ϵPU\r\n */\r\nvoid fill_edge_samples_x_sse128(const pel_t *pTL, int i_TL, const pel_t *pLcuEP, pel_t *EP, uint32_t i_avai, int bsx, int bsy)\r\n{\r\n    __m128i T0, T1;\r\n    int i, k, j;\r\n    int num_padding;\r\n\r\n    const pel_t *pL = pTL + i_TL;\r\n\r\n    /* fill default value */\r\n    k = ((bsy + bsx) << 1) + 1;\r\n    j = (k >> 4) << 4;\r\n    T0 = _mm_set1_epi8((uint8_t)g_dc_value);\r\n    for (i = 0; i < j; i += 16) {\r\n        _mm_storeu_si128((__m128i *)(&EP[-(bsy << 1)] + i), T0);\r\n    }\r\n    memset(&EP[-(bsy << 1)] + j, g_dc_value, k - j + 1);\r\n    EP[2 * bsx] = (pel_t)g_dc_value;\r\n    \r\n    /* get prediction pixels ---------------------------------------\r\n     * extra pixels          | left-down pixels   | left pixels   | top-left | top pixels  | top-right pixels  | extra pixels\r\n     * -2*bsy-4 ... -2*bsy-1 | -bsy-bsy ... -bsy-1| -bsy -3 -2 -1 |     0    | 1 2 ... bsx | bsx+1 ... bsx+bsx | 2*bsx+1 ... 2*bsx+4\r\n     */\r\n\r\n    /* fill top & top-right pixels */\r\n    if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_TOP)) {\r\n        /* fill top pixels */\r\n        if (bsx == 4) {\r\n            memcpy(&EP[1], &pLcuEP[1], bsx * sizeof(pel_t));\r\n        } else if (bsx == 8) {\r\n            T1 = _mm_loadu_si128((__m128i *)&pLcuEP[1]);\r\n            _mm_storel_epi64((__m128i *)&EP[1], T1);\r\n        } else {\r\n            for (i = 0; i < bsx; i += 16) {\r\n                T1 = _mm_loadu_si128((__m128i *)(pLcuEP + i + 1));\r\n                _mm_store_si128((__m128i *)(&EP[1] + i), T1);\r\n            }\r\n        }\r\n    }\r\n\r\n    /* fill top-right pixels */\r\n    if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_TOP_RIGHT)) {\r\n        if (bsx == 4) {\r\n            memcpy(&EP[bsx + 1], &pLcuEP[bsx + 1], bsx * sizeof(pel_t));\r\n        } else if (bsx == 8) {\r\n            T1 = _mm_loadu_si128((__m128i *)&pLcuEP[bsx + 1]);\r\n            _mm_storel_epi64((__m128i *)&EP[bsx + 1], T1);\r\n        } else {\r\n            for (i = 0; i < bsx; i += 16) {\r\n                T1 = _mm_loadu_si128((__m128i *)(&pLcuEP[bsx + i + 1]));\r\n                _mm_store_si128((__m128i *)(&EP[bsx + 1] + i), T1);\r\n            }\r\n        }\r\n    } else {\r\n        if (bsx == 4) {\r\n            memset(&EP[bsx + 1], EP[bsx], bsx);\r\n        } else if (bsx == 8) {\r\n            T0 = _mm_set1_epi8(EP[bsx]);    // repeat the last pixel\r\n            _mm_storel_epi64((__m128i *)&EP[bsx + 1], T0);\r\n        } else {\r\n            T0 = _mm_set1_epi8(EP[bsx]);    // repeat the last pixel\r\n            for (i = 0; i < bsx; i += 16) {\r\n                _mm_store_si128((__m128i *)(&EP[bsx + 1 + i]), T0);\r\n            }\r\n        }\r\n    }\r\n\r\n    /* fill extra pixels */\r\n    num_padding = bsy * 11 / 4 - bsx + 4;\r\n    if (num_padding > 0) {\r\n        memset(&EP[2 * bsx + 1], EP[2 * bsx], num_padding); // from (2*bsx) to (iX + 3) = (bsy *11/4 + bsx - 1) + 3\r\n    }\r\n\r\n    /* fill left & left-down pixels */\r\n    if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_LEFT)) {\r\n        const pel_t *p_l = pL;\r\n        int y;\r\n        /* fill left pixels */\r\n        for (y = 0; y < bsy; y++) {\r\n            EP[-1 - y] = *p_l;\r\n            p_l += i_TL;\r\n        }\r\n    }\r\n\r\n    /* fill left-down pixels */\r\n    if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_LEFT_DOWN)) {\r\n        const pel_t *p_l = pL + bsy * i_TL;\r\n        int y;\r\n        for (y = 0; y < bsy; y++) {\r\n            EP[-bsy - 1 - y] = *p_l;\r\n            p_l += i_TL;\r\n        }\r\n    } else {\r\n        if (bsy == 4) {\r\n            memset(&EP[-(bsy << 1)], EP[-bsy], bsy);\r\n        } else if (bsy == 8) {\r\n            T0 = _mm_set1_epi8(EP[-bsy]);\r\n            _mm_storel_epi64((__m128i *)&EP[-(bsy << 1)], T0);\r\n        } else {\r\n            T0 = _mm_set1_epi8(EP[-bsy]);\r\n            for (i = 0; i < bsy; i += 16) {\r\n                _mm_storeu_si128((__m128i *)(&EP[-(bsy << 1)] + i), T0);\r\n            }\r\n        }\r\n    }\r\n\r\n    /* fill extra pixels */\r\n    num_padding = bsx * 11 / 4 - bsy + 4;\r\n    if (num_padding > 0) {\r\n        memset(&EP[-2 * bsy - num_padding], EP[-2 * bsy], num_padding); // from (-2*bsy) to (-iY - 3) = -(bsx *11/4 + bsy - 1) - 3\r\n    }\r\n\r\n    /* fill EP[0] */\r\n    if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_TOP_LEFT)) {\r\n        EP[0] = pLcuEP[0];\r\n    } else if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_TOP)) {\r\n        EP[0] = pLcuEP[1];\r\n    } else if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_LEFT)) {\r\n        EP[0] = pL[0];\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * fill reference samples for intra prediction\r\n * LCU߽ϵPU\r\n */\r\nvoid fill_edge_samples_y_sse128(const pel_t *pTL, int i_TL, const pel_t *pLcuEP, pel_t *EP, uint32_t i_avai, int bsx, int bsy)\r\n{\r\n    __m128i T0, T1;\r\n    int i, k, j;\r\n    int num_padding;\r\n\r\n    const pel_t *pT = pTL + 1;\r\n    UNUSED_PARAMETER(i_TL);\r\n\r\n    /* fill default value */\r\n    k = ((bsy + bsx) << 1) + 1;\r\n    j = (k >> 4) << 4;\r\n    T0 = _mm_set1_epi8((uint8_t)g_dc_value);\r\n    for (i = 0; i < j; i += 16) {\r\n        _mm_storeu_si128((__m128i *)(&EP[-(bsy << 1)] + i), T0);\r\n    }\r\n    memset(&EP[-(bsy << 1)] + j, g_dc_value, k - j + 1);\r\n    EP[2 * bsx] = (pel_t)g_dc_value;\r\n    \r\n    /* get prediction pixels ---------------------------------------\r\n     * extra pixels          | left-down pixels   | left pixels   | top-left | top pixels  | top-right pixels  | extra pixels\r\n     * -2*bsy-4 ... -2*bsy-1 | -bsy-bsy ... -bsy-1| -bsy -3 -2 -1 |     0    | 1 2 ... bsx | bsx+1 ... bsx+bsx | 2*bsx+1 ... 2*bsx+4\r\n     */\r\n\r\n    /* fill top & top-right pixels */\r\n    if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_TOP)) {\r\n        /* fill top pixels */\r\n        if (bsx == 4) {\r\n            memcpy(&EP[1], pT, bsx * sizeof(pel_t));\r\n        } else if (bsx == 8) {\r\n            T1 = _mm_loadu_si128((__m128i *)pT);\r\n            _mm_storel_epi64((__m128i *)&EP[1], T1);\r\n        } else {\r\n            for (i = 0; i < bsx; i += 16) {\r\n                T1 = _mm_loadu_si128((__m128i *)(pT + i));\r\n                _mm_store_si128((__m128i *)(&EP[1] + i), T1);\r\n            }\r\n        }\r\n    }\r\n\r\n    /* fill top-right pixels */\r\n    if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_TOP_RIGHT)) {\r\n        if (bsx == 4) {\r\n            memcpy(&EP[bsx + 1], &pT[bsx], bsx * sizeof(pel_t));\r\n        } else if (bsx == 8) {\r\n            T1 = _mm_loadu_si128((__m128i *)&pT[bsx]);\r\n            _mm_storel_epi64((__m128i *)&EP[bsx + 1], T1);\r\n        } else {\r\n            for (i = 0; i < bsx; i += 16) {\r\n                T1 = _mm_loadu_si128((__m128i *)(&pT[bsx + i]));\r\n                _mm_store_si128((__m128i *)(&EP[bsx + 1] + i), T1);\r\n            }\r\n        }\r\n    } else {\r\n        if (bsx == 4) {\r\n            memset(&EP[bsx + 1], EP[bsx], bsx);\r\n        } else if (bsx == 8) {\r\n            T0 = _mm_set1_epi8(EP[bsx]);    // repeat the last pixel\r\n            _mm_storel_epi64((__m128i *)&EP[bsx + 1], T0);\r\n        } else {\r\n            T0 = _mm_set1_epi8(EP[bsx]);    // repeat the last pixel\r\n            for (i = 0; i < bsx; i += 16) {\r\n                _mm_store_si128((__m128i *)(&EP[bsx + 1 + i]), T0);\r\n            }\r\n        }\r\n    }\r\n\r\n    /* fill extra pixels */\r\n    num_padding = bsy * 11 / 4 - bsx + 4;\r\n    if (num_padding > 0) {\r\n        memset(&EP[2 * bsx + 1], EP[2 * bsx], num_padding); // from (2*bsx) to (iX + 3) = (bsy *11/4 + bsx - 1) + 3\r\n    }\r\n\r\n    /* fill left & left-down pixels */\r\n    if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_LEFT)) {\r\n        /* fill left pixels */\r\n        memcpy(&EP[-bsy], &pLcuEP[-bsy], bsy * sizeof(pel_t));\r\n    }\r\n\r\n    /* fill left-down pixels */\r\n    if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_LEFT_DOWN)) {\r\n        memcpy(&EP[-2 * bsy], &pLcuEP[-2 * bsy], bsy * sizeof(pel_t));\r\n    } else {\r\n        if (bsy == 4) {\r\n            memset(&EP[-(bsy << 1)], EP[-bsy], bsy);\r\n        } else if (bsy == 8) {\r\n            T0 = _mm_set1_epi8(EP[-bsy]);\r\n            _mm_storel_epi64((__m128i *)&EP[-(bsy << 1)], T0);\r\n        } else {\r\n            T0 = _mm_set1_epi8(EP[-bsy]);\r\n            for (i = 0; i < bsy; i += 16) {\r\n                _mm_storeu_si128((__m128i *)(&EP[-(bsy << 1)] + i), T0);\r\n            }\r\n        }\r\n    }\r\n\r\n    /* fill extra pixels */\r\n    num_padding = bsx * 11 / 4 - bsy + 4;\r\n    if (num_padding > 0) {\r\n        memset(&EP[-2 * bsy - num_padding], EP[-2 * bsy], num_padding); // from (-2*bsy) to (-iY - 3) = -(bsx *11/4 + bsy - 1) - 3\r\n    }\r\n\r\n    /* fill EP[0] */\r\n    if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_TOP_LEFT)) {\r\n        EP[0] = pLcuEP[0];\r\n    } else if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_TOP)) {\r\n        EP[0] = pT[0];\r\n    } else if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_LEFT)) {\r\n        EP[0] = pLcuEP[-1];\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * fill reference samples for intra prediction\r\n * LCU߽ϵPU\r\n */\r\nvoid fill_edge_samples_xy_sse128(const pel_t *pTL, int i_TL, const pel_t *pLcuEP, pel_t *EP, uint32_t i_avai, int bsx, int bsy)\r\n{\r\n    __m128i T0, T1;\r\n    int i, k, j;\r\n    int num_padding;\r\n\r\n    const pel_t *pT = pTL + 1;\r\n    const pel_t *pL = pTL + i_TL;\r\n\r\n    UNUSED_PARAMETER(pLcuEP);\r\n    /* fill default value */\r\n    k = ((bsy + bsx) << 1) + 1;\r\n    j = (k >> 4) << 4;\r\n    T0 = _mm_set1_epi8((uint8_t)g_dc_value);\r\n    for (i = 0; i < j; i += 16) {\r\n        _mm_storeu_si128((__m128i *)(&EP[-(bsy << 1)] + i), T0);\r\n    }\r\n    memset(&EP[-(bsy << 1)] + j, g_dc_value, k - j + 1);\r\n    EP[2 * bsx] = (pel_t)g_dc_value;\r\n    \r\n    /* get prediction pixels ---------------------------------------\r\n     * extra pixels          | left-down pixels   | left pixels   | top-left | top pixels  | top-right pixels  | extra pixels\r\n     * -2*bsy-4 ... -2*bsy-1 | -bsy-bsy ... -bsy-1| -bsy -3 -2 -1 |     0    | 1 2 ... bsx | bsx+1 ... bsx+bsx | 2*bsx+1 ... 2*bsx+4\r\n     */\r\n\r\n    /* fill top & top-right pixels */\r\n    if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_TOP)) {\r\n        /* fill top pixels */\r\n        if (bsx == 4) {\r\n            memcpy(&EP[1], pT, bsx * sizeof(pel_t));\r\n        } else if (bsx == 8) {\r\n            T1 = _mm_loadu_si128((__m128i *)pT);\r\n            _mm_storel_epi64((__m128i *)&EP[1], T1);\r\n        } else {\r\n            for (i = 0; i < bsx; i += 16) {\r\n                T1 = _mm_loadu_si128((__m128i *)(pT + i));\r\n                _mm_store_si128((__m128i *)(&EP[1] + i), T1);\r\n            }\r\n        }\r\n    }\r\n\r\n    /* fill top-right pixels */\r\n    if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_TOP_RIGHT)) {\r\n        if (bsx == 4) {\r\n            memcpy(&EP[bsx + 1], &pT[bsx], bsx * sizeof(pel_t));\r\n        } else if (bsx == 8) {\r\n            T1 = _mm_loadu_si128((__m128i *)&pT[bsx]);\r\n            _mm_storel_epi64((__m128i *)&EP[bsx + 1], T1);\r\n        } else {\r\n            for (i = 0; i < bsx; i += 16) {\r\n                T1 = _mm_loadu_si128((__m128i *)(&pT[bsx + i]));\r\n                _mm_store_si128((__m128i *)(&EP[bsx + 1] + i), T1);\r\n            }\r\n        }\r\n    } else {\r\n        if (bsx == 4) {\r\n            memset(&EP[bsx + 1], EP[bsx], bsx);\r\n        } else if (bsx == 8) {\r\n            T0 = _mm_set1_epi8(EP[bsx]);    // repeat the last pixel\r\n            _mm_storel_epi64((__m128i *)&EP[bsx + 1], T0);\r\n        } else {\r\n            T0 = _mm_set1_epi8(EP[bsx]);    // repeat the last pixel\r\n            for (i = 0; i < bsx; i += 16) {\r\n                _mm_store_si128((__m128i *)(&EP[bsx + 1 + i]), T0);\r\n            }\r\n        }\r\n    }\r\n\r\n    /* fill extra pixels */\r\n    num_padding = bsy * 11 / 4 - bsx + 4;\r\n    if (num_padding > 0) {\r\n        memset(&EP[2 * bsx + 1], EP[2 * bsx], num_padding); // from (2*bsx) to (iX + 3) = (bsy *11/4 + bsx - 1) + 3\r\n    }\r\n\r\n    /* fill left & left-down pixels */\r\n    if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_LEFT)) {\r\n        const pel_t *p_l = pL;\r\n        int y;\r\n        /* fill left pixels */\r\n        for (y = 0; y < bsy; y++) {\r\n            EP[-1 - y] = *p_l;\r\n            p_l += i_TL;\r\n        }\r\n    }\r\n\r\n    /* fill left-down pixels */\r\n    if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_LEFT_DOWN)) {\r\n        const pel_t *p_l = pL + bsy * i_TL;\r\n        int y;\r\n        for (y = 0; y < bsy; y++) {\r\n            EP[-bsy - 1 - y] = *p_l;\r\n            p_l += i_TL;\r\n        }\r\n    } else {\r\n        if (bsy == 4) {\r\n            memset(&EP[-(bsy << 1)], EP[-bsy], bsy);\r\n        } else if (bsy == 8) {\r\n            T0 = _mm_set1_epi8(EP[-bsy]);\r\n            _mm_storel_epi64((__m128i *)&EP[-(bsy << 1)], T0);\r\n        } else {\r\n            T0 = _mm_set1_epi8(EP[-bsy]);\r\n            for (i = 0; i < bsy; i += 16) {\r\n                _mm_storeu_si128((__m128i *)(&EP[-(bsy << 1)] + i), T0);\r\n            }\r\n        }\r\n    }\r\n\r\n    /* fill extra pixels */\r\n    num_padding = bsx * 11 / 4 - bsy + 4;\r\n    if (num_padding > 0) {\r\n        memset(&EP[-2 * bsy - num_padding], EP[-2 * bsy], num_padding); // from (-2*bsy) to (-iY - 3) = -(bsx *11/4 + bsy - 1) - 3\r\n    }\r\n\r\n    /* fill EP[0] */\r\n    if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_TOP_LEFT)) {\r\n        EP[0] = pTL[0];\r\n    } else if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_TOP)) {\r\n        EP[0] = pT[0];\r\n    } else if (IS_NEIGHBOR_AVAIL(i_avai, MD_I_LEFT)) {\r\n        EP[0] = pL[0];\r\n    }\r\n}\r\n\r\n#endif // #if !HIGH_BIT_DEPTH\r\n"
  },
  {
    "path": "source/common/vec/intrinsic_intra-pred.cc",
    "content": "/*\r\n * intrinsic_intra-pred.cc\r\n *\r\n * Description of this file:\r\n *    SSE assembly functions of Intra-Prediction module of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video decoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#include \"../common.h\"\r\n#include \"intrinsic.h\"\r\n\r\n#include <mmintrin.h>\r\n#include <emmintrin.h>\r\n#include <tmmintrin.h>\r\n#include <smmintrin.h>\r\n\r\n\r\n#if !HIGH_BIT_DEPTH\r\n\r\nstatic ALIGN16(int8_t tab_coeff_mode_5[8][16]) = {\r\n    { 20, 52, 44, 12, 20, 52, 44, 12, 20, 52, 44, 12, 20, 52, 44, 12 },\r\n    { 8, 40, 56, 24, 8, 40, 56, 24, 8, 40, 56, 24, 8, 40, 56, 24 },\r\n    { 28, 60, 36, 4, 28, 60, 36, 4, 28, 60, 36, 4, 28, 60, 36, 4 },\r\n    { 16, 48, 48, 16, 16, 48, 48, 16, 16, 48, 48, 16, 16, 48, 48, 16 },\r\n    { 4, 36, 60, 28, 4, 36, 60, 28, 4, 36, 60, 28, 4, 36, 60, 28 },\r\n    { 24, 56, 40, 8, 24, 56, 40, 8, 24, 56, 40, 8, 24, 56, 40, 8 },\r\n    { 12, 44, 52, 20, 12, 44, 52, 20, 12, 44, 52, 20, 12, 44, 52, 20 },\r\n    { 32, 64, 32, 0, 32, 64, 32, 0, 32, 64, 32, 0, 32, 64, 32, 0 }\r\n};\r\nstatic uint8_t tab_idx_mode_5[64] = {\r\n    1, 2, 4, 5, 6, 8, 9, 11, 12, 13, 15, 16, 17, 19, 20, 22, 23, 24, 26, 27, 28, 30, 31,\r\n    33, 34, 35, 37, 38, 39, 41, 42, 44, 45, 46, 48, 49, 50, 52, 53, 55, 56, 57, 59, 60,\r\n    61, 63, 64, 66, 67, 68, 70, 71, 72, 74, 75, 77, 78, 79, 81, 82, 83, 85, 86, 88\r\n};\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid intra_pred_ver_sse128(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    int y;\r\n    pel_t *rpSrc = src + 1;\r\n    __m128i T1, T2, T3, T4;\r\n\r\n    UNUSED_PARAMETER(dir_mode);\r\n\r\n    switch (bsx) {\r\n        case 4:\r\n            for (y = 0; y < bsy; y += 2) {\r\n                CP32(dst, rpSrc);\r\n                CP32(dst + i_dst, rpSrc);\r\n                dst += i_dst << 1;\r\n            }\r\n            break;\r\n        case 8:\r\n            for (y = 0; y < bsy; y += 2) {\r\n                CP64(dst, rpSrc);\r\n                CP64(dst + i_dst, rpSrc);\r\n                dst += i_dst << 1;\r\n            }\r\n            break;\r\n        case 16:\r\n            T1 = _mm_loadu_si128((__m128i*)rpSrc);\r\n            for (y = 0; y < bsy; y++) {\r\n                _mm_storeu_si128((__m128i*)(dst), T1);\r\n                dst += i_dst;\r\n            }\r\n            break;\r\n        case 32:\r\n            T1 = _mm_loadu_si128((__m128i*)(rpSrc + 0));\r\n            T2 = _mm_loadu_si128((__m128i*)(rpSrc + 16));\r\n            for (y = 0; y < bsy; y++) {\r\n                _mm_storeu_si128((__m128i*)(dst + 0), T1);\r\n                _mm_storeu_si128((__m128i*)(dst + 16), T2);\r\n                dst += i_dst;\r\n            }\r\n            break;\r\n        case 64:\r\n            T1 = _mm_loadu_si128((__m128i*)(rpSrc + 0));\r\n            T2 = _mm_loadu_si128((__m128i*)(rpSrc + 16));\r\n            T3 = _mm_loadu_si128((__m128i*)(rpSrc + 32));\r\n            T4 = _mm_loadu_si128((__m128i*)(rpSrc + 48));\r\n            for (y = 0; y < bsy; y++) {\r\n                _mm_storeu_si128((__m128i*)(dst + 0), T1);\r\n                _mm_storeu_si128((__m128i*)(dst + 16), T2);\r\n                _mm_storeu_si128((__m128i*)(dst + 32), T3);\r\n                _mm_storeu_si128((__m128i*)(dst + 48), T4);\r\n                dst += i_dst;\r\n            }\r\n            break;\r\n        default:\r\n            assert(0);\r\n            break;\r\n    }\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid intra_pred_hor_sse128(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    int y;\r\n    pel_t *rpSrc = src - 1;\r\n    __m128i T;\r\n\r\n    UNUSED_PARAMETER(dir_mode);\r\n\r\n    switch (bsx) {\r\n        case 4:\r\n            for (y = 0; y < bsy; y++) {\r\n                M32(dst) = 0x01010101 * rpSrc[-y];\r\n                dst += i_dst;\r\n            }\r\n            break;\r\n        case 8:\r\n            for (y = 0; y < bsy; y++) {\r\n                M64(dst) = 0x0101010101010101 * rpSrc[-y];\r\n                dst += i_dst;\r\n            }\r\n            break;\r\n        case 16:\r\n            for (y = 0; y < bsy; y++) {\r\n                T = _mm_set1_epi8((char)rpSrc[-y]);\r\n                _mm_storeu_si128((__m128i*)(dst), T);\r\n                dst += i_dst;\r\n            }\r\n            break;\r\n        case 32:\r\n            for (y = 0; y < bsy; y++) {\r\n                T = _mm_set1_epi8((char)rpSrc[-y]);\r\n                _mm_storeu_si128((__m128i*)(dst + 0), T);\r\n                _mm_storeu_si128((__m128i*)(dst + 16), T);\r\n                dst += i_dst;\r\n            }\r\n            break;\r\n        case 64:\r\n            for (y = 0; y < bsy; y++) {\r\n                T = _mm_set1_epi8((char)rpSrc[-y]);\r\n                _mm_storeu_si128((__m128i*)(dst + 0), T);\r\n                _mm_storeu_si128((__m128i*)(dst + 16), T);\r\n                _mm_storeu_si128((__m128i*)(dst + 32), T);\r\n                _mm_storeu_si128((__m128i*)(dst + 48), T);\r\n                dst += i_dst;\r\n            }\r\n            break;\r\n        default:\r\n            assert(0);\r\n            break;\r\n    }\r\n\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid intra_pred_dc_sse128(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    int avail_above = dir_mode >> 8;\r\n    int avail_left = dir_mode & 0xFF;\r\n    int dc_value;\r\n    int sum_above = 0;\r\n    int sum_left = 0;\r\n    int x, y;\r\n    pel_t *p_src;\r\n\r\n    __m128i zero = _mm_setzero_si128();\r\n    __m128i S0;\r\n    __m128i p00, p10, p20, p30;\r\n\r\n    /* sum of left samples */\r\n    // for (y = 0; y < bsy; y++)  dc_value += p_src[-y];\r\n    p_src = src - bsy;\r\n    if (bsy == 4) {\r\n        sum_left += p_src[0] + p_src[1] + p_src[2] + p_src[3];\r\n    } else if (bsy == 8) {\r\n        S0 = _mm_loadu_si128((__m128i*)(p_src));\r\n        p00 = _mm_unpacklo_epi8(S0, zero);\r\n        p10 = _mm_srli_si128(p00, 8);\r\n        p00 = _mm_add_epi16(p00, p10);\r\n        sum_left += M128_U16(p00, 0) + M128_U16(p00, 1) + M128_U16(p00, 2) + M128_U16(p00, 3);\r\n    } else {\r\n        p30 = zero;\r\n        for (y = 0; y < bsy - 8; y += 16, p_src += 16) {\r\n            S0 = _mm_loadu_si128((__m128i*)(p_src));\r\n            p00 = _mm_unpacklo_epi8(S0, zero);\r\n            p10 = _mm_unpackhi_epi8(S0, zero);\r\n            p20 = _mm_add_epi16(p00, p10);\r\n            p30 = _mm_add_epi16(p30, p20);\r\n        }\r\n        p00 = _mm_srli_si128(p30, 8);\r\n        p00 = _mm_add_epi16(p30, p00);\r\n        sum_left += M128_U16(p00, 0) + M128_U16(p00, 1) + M128_U16(p00, 2) + M128_U16(p00, 3);\r\n    }\r\n\r\n    /* sum of above samples */\r\n    //for (x = 0; x < bsx; x++)  dc_value += p_src[x];\r\n    p_src = src + 1;\r\n    if (bsx == 4) {\r\n        sum_above += p_src[0] + p_src[1] + p_src[2] + p_src[3];\r\n    } else if (bsx == 8) {\r\n        S0 = _mm_loadu_si128((__m128i*)(p_src));\r\n        p00 = _mm_unpacklo_epi8(S0, zero);\r\n        p10 = _mm_srli_si128(p00, 8);\r\n        p00 = _mm_add_epi16(p00, p10);\r\n        sum_above += M128_U16(p00, 0) + M128_U16(p00, 1) + M128_U16(p00, 2) + M128_U16(p00, 3);\r\n    } else {\r\n        p30 = zero;\r\n        for (x = 0; x < bsx - 8; x += 16, p_src += 16) {\r\n            S0 = _mm_loadu_si128((__m128i*)(p_src));\r\n            p00 = _mm_unpacklo_epi8(S0, zero);\r\n            p10 = _mm_unpackhi_epi8(S0, zero);\r\n            p20 = _mm_add_epi16(p00, p10);\r\n            p30 = _mm_add_epi16(p30, p20);\r\n        }\r\n        p00 = _mm_srli_si128(p30, 8);\r\n        p00 = _mm_add_epi16(p30, p00);\r\n        sum_above += M128_U16(p00, 0) + M128_U16(p00, 1) + M128_U16(p00, 2) + M128_U16(p00, 3);\r\n    }\r\n\r\n    if (avail_left && avail_above) {\r\n        x = bsx + bsy;\r\n        dc_value = ((sum_above + sum_left + (x >> 1)) * (512 / x)) >> 9;\r\n    } else if (avail_left) {\r\n        dc_value = (sum_left + (bsy >> 1)) >> davs2_log2u(bsy);\r\n    } else if (avail_above) {\r\n        dc_value = (sum_above + (bsx >> 1)) >> davs2_log2u(bsx);\r\n    } else {\r\n        dc_value = g_dc_value;\r\n    }\r\n\r\n    p00 = _mm_set1_epi8((pel_t)dc_value);\r\n    for (y = 0; y < bsy; y++) {\r\n        if (bsx == 8) {\r\n            _mm_storel_epi64((__m128i*)dst, p00);\r\n        } else if (bsx == 4) {\r\n            *(int*)(dst) = _mm_cvtsi128_si32(p00);\r\n        } else {\r\n            for (x = 0; x < bsx - 8; x += 16) {\r\n                _mm_storeu_si128((__m128i*)(dst + x), p00);\r\n            }\r\n        }\r\n        dst += i_dst;\r\n    }\r\n\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n*/\r\nvoid intra_pred_plane_sse128(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    pel_t  *rpSrc;\r\n    int iH = 0;\r\n    int iV = 0;\r\n    int iA, iB, iC;\r\n    int x, y;\r\n    int iW2 = bsx >> 1;\r\n    int iH2 = bsy >> 1;\r\n    int ib_mult[5] = { 13, 17, 5, 11, 23 };\r\n    int ib_shift[5] = { 7, 10, 11, 15, 19 };\r\n    int im_h = ib_mult[tab_log2[bsx] - 2];\r\n    int is_h = ib_shift[tab_log2[bsx] - 2];\r\n    int im_v = ib_mult[tab_log2[bsy] - 2];\r\n    int is_v = ib_shift[tab_log2[bsy] - 2];\r\n\r\n    int iTmp;\r\n\r\n    UNUSED_PARAMETER(dir_mode);\r\n\r\n    rpSrc = src + iW2;\r\n    for (x = 1; x < iW2 + 1; x++) {\r\n        iH += x * (rpSrc[x] - rpSrc[-x]);\r\n    }\r\n\r\n    rpSrc = src - iH2;\r\n    for (y = 1; y < iH2 + 1; y++) {\r\n        iV += y * (rpSrc[-y] - rpSrc[y]);\r\n    }\r\n\r\n    iA = (src[-1 - (bsy - 1)] + src[1 + bsx - 1]) << 4;\r\n    iB = ((iH << 5) * im_h + (1 << (is_h - 1))) >> is_h;\r\n    iC = ((iV << 5) * im_v + (1 << (is_v - 1))) >> is_v;\r\n\r\n    iTmp = iA - (iH2 - 1) * iC - (iW2 - 1) * iB + 16;\r\n\r\n    __m128i TC, TB, TA, T_Start, T, D, D1;\r\n    TA = _mm_set1_epi16((int16_t)iTmp);\r\n    TB = _mm_set1_epi16((int16_t)iB);\r\n    TC = _mm_set1_epi16((int16_t)iC);\r\n\r\n    T_Start = _mm_set_epi16(7, 6, 5, 4, 3, 2, 1, 0);\r\n    T_Start = _mm_mullo_epi16(TB, T_Start);\r\n    T_Start = _mm_add_epi16(T_Start, TA);\r\n\r\n    TB = _mm_mullo_epi16(TB, _mm_set1_epi16(8));\r\n\r\n    if (bsx == 4) {\r\n        for (y = 0; y < bsy; y++) {\r\n            D = _mm_srai_epi16(T_Start, 5);\r\n            D = _mm_packus_epi16(D, D);\r\n            // extract low 32 bits from the packed result , and put it into a integer . (Redundant operation?)\r\n            _mm_stream_si32((int *)dst, _mm_extract_epi32(D, 0));\r\n            T_Start = _mm_add_epi16(T_Start, TC);\r\n            dst += i_dst;\r\n        }\r\n    } else if (bsx == 8) {\r\n        for (y = 0; y < bsy; y++) {\r\n            D = _mm_srai_epi16(T_Start, 5);\r\n            D = _mm_packus_epi16(D, D);\r\n            _mm_storel_epi64((__m128i*)dst, D);\r\n            T_Start = _mm_add_epi16(T_Start, TC);\r\n            dst += i_dst;\r\n        }\r\n    } else {\r\n        for (y = 0; y < bsy; y++) {\r\n            T = T_Start;\r\n            for (x = 0; x < bsx; x += 16) {\r\n                D = _mm_srai_epi16(T, 5);\r\n                T = _mm_add_epi16(T, TB);\r\n                D1 = _mm_srai_epi16(T, 5);\r\n                T = _mm_add_epi16(T, TB);\r\n                D = _mm_packus_epi16(D, D1);\r\n                _mm_storeu_si128((__m128i*)(dst + x), D);\r\n            }\r\n            T_Start = _mm_add_epi16(T_Start, TC);\r\n            dst += i_dst;\r\n        }\r\n    }\r\n\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n*/\r\nvoid intra_pred_bilinear_sse128(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    int x, y;\r\n    int ishift_x = tab_log2[bsx];\r\n    int ishift_y = tab_log2[bsy];\r\n    int ishift = DAVS2_MIN(ishift_x, ishift_y);\r\n    int ishift_xy = ishift_x + ishift_y + 1;\r\n    int offset = 1 << (ishift_x + ishift_y);\r\n    int a, b, c, w, val;\r\n    pel_t *p;\r\n    __m128i T, T1, T2, T3, C1, C2, ADD;\r\n    __m128i ZERO = _mm_setzero_si128();\r\n\r\n    /* TODO: Ϊʲô⼸ĴСҪӵ 32ǷбҪ */\r\n    ALIGN32(itr_t pTop [MAX_CU_SIZE + 32]);\r\n    ALIGN32(itr_t pLeft[MAX_CU_SIZE + 32]);\r\n    ALIGN32(itr_t pT   [MAX_CU_SIZE + 32]);\r\n    ALIGN32(itr_t pL   [MAX_CU_SIZE + 32]);\r\n    ALIGN32(itr_t wy   [MAX_CU_SIZE + 32]);\r\n\r\n    UNUSED_PARAMETER(dir_mode);\r\n\r\n    p = src + 1;\r\n    for (x = 0; x < bsx; x += 16) {\r\n        T = _mm_loadu_si128((__m128i*)(p + x));\r\n        T1 = _mm_unpacklo_epi8(T, ZERO);\r\n        T2 = _mm_unpackhi_epi8(T, ZERO);\r\n        _mm_store_si128((__m128i*)(pTop + x), T1);\r\n        _mm_store_si128((__m128i*)(pTop + x + 8), T2);\r\n    }\r\n    for (y = 0; y < bsy; y++) {\r\n        pLeft[y] = src[-1 - y];\r\n    }\r\n\r\n    a = pTop[bsx - 1];\r\n    b = pLeft[bsy - 1];\r\n\r\n    if (bsx == bsy) {\r\n        c = (a + b + 1) >> 1;\r\n    } else {\r\n        c = (((a << ishift_x) + (b << ishift_y)) * 13 + (1 << (ishift + 5))) >> (ishift + 6);\r\n    }\r\n\r\n    w = (c << 1) - a - b;\r\n\r\n    T = _mm_set1_epi16((int16_t)b);\r\n    for (x = 0; x < bsx; x += 8) {\r\n        T1 = _mm_load_si128((__m128i*)(pTop + x));\r\n        T2 = _mm_sub_epi16(T, T1);\r\n        T1 = _mm_slli_epi16(T1, ishift_y);\r\n        _mm_store_si128((__m128i*)(pT + x), T2);\r\n        _mm_store_si128((__m128i*)(pTop + x), T1);\r\n    }\r\n\r\n    T = _mm_set1_epi16((int16_t)a);\r\n    for (y = 0; y < bsy; y += 8) {\r\n        T1 = _mm_load_si128((__m128i*)(pLeft + y));\r\n        T2 = _mm_sub_epi16(T, T1);\r\n        T1 = _mm_slli_epi16(T1, ishift_x);\r\n        _mm_store_si128((__m128i*)(pL + y), T2);\r\n        _mm_store_si128((__m128i*)(pLeft + y), T1);\r\n    }\r\n\r\n    T = _mm_set1_epi16((int16_t)w);\r\n    T = _mm_mullo_epi16(T, _mm_set_epi16(7, 6, 5, 4, 3, 2, 1, 0));\r\n    T1 = _mm_set1_epi16((int16_t)(8 * w));\r\n\r\n    for (y = 0; y < bsy; y += 8) {\r\n        _mm_store_si128((__m128i*)(wy + y), T);\r\n        T = _mm_add_epi16(T, T1);\r\n    }\r\n\r\n    C1 = _mm_set_epi32(3, 2, 1, 0);\r\n    C2 = _mm_set1_epi32(4);\r\n\r\n    if (bsx == 4) {\r\n        __m128i pTT = _mm_loadl_epi64((__m128i*)pT);\r\n        T = _mm_loadl_epi64((__m128i*)pTop);\r\n        for (y = 0; y < bsy; y++) {\r\n            int add = (pL[y] << ishift_y) + wy[y];\r\n            ADD = _mm_set1_epi32(add);\r\n            ADD = _mm_mullo_epi32(C1, ADD);\r\n\r\n            val = (pLeft[y] << ishift_y) + offset + (pL[y] << ishift_y);\r\n\r\n            ADD = _mm_add_epi32(ADD, _mm_set1_epi32(val));\r\n            T = _mm_add_epi16(T, pTT);\r\n\r\n            T1 = _mm_cvtepi16_epi32(T);\r\n            T1 = _mm_slli_epi32(T1, ishift_x);\r\n\r\n            T1 = _mm_add_epi32(T1, ADD);\r\n            T1 = _mm_srai_epi32(T1, ishift_xy);\r\n\r\n            T1 = _mm_packus_epi32(T1, T1);\r\n            T1 = _mm_packus_epi16(T1, T1);\r\n\r\n            M32(dst) = _mm_cvtsi128_si32(T1);\r\n\r\n            dst += i_dst;\r\n        }\r\n    } else if (bsx == 8) {\r\n        __m128i pTT = _mm_load_si128((__m128i*)pT);\r\n        T = _mm_load_si128((__m128i*)pTop);\r\n        for (y = 0; y < bsy; y++) {\r\n            int add = (pL[y] << ishift_y) + wy[y];\r\n            ADD = _mm_set1_epi32(add);\r\n            T3 = _mm_mullo_epi32(C2, ADD);\r\n            ADD = _mm_mullo_epi32(C1, ADD);\r\n\r\n            val = (pLeft[y] << ishift_y) + offset + (pL[y] << ishift_y);\r\n\r\n            ADD = _mm_add_epi32(ADD, _mm_set1_epi32(val));\r\n\r\n            T = _mm_add_epi16(T, pTT);\r\n\r\n            T1 = _mm_cvtepi16_epi32(T);\r\n            T2 = _mm_cvtepi16_epi32(_mm_srli_si128(T, 8));\r\n            T1 = _mm_slli_epi32(T1, ishift_x);\r\n            T2 = _mm_slli_epi32(T2, ishift_x);\r\n\r\n            T1 = _mm_add_epi32(T1, ADD);\r\n            T1 = _mm_srai_epi32(T1, ishift_xy);\r\n            ADD = _mm_add_epi32(ADD, T3);\r\n\r\n            T2 = _mm_add_epi32(T2, ADD);\r\n            T2 = _mm_srai_epi32(T2, ishift_xy);\r\n            ADD = _mm_add_epi32(ADD, T3);\r\n\r\n            T1 = _mm_packus_epi32(T1, T2);\r\n            T1 = _mm_packus_epi16(T1, T1);\r\n\r\n            _mm_storel_epi64((__m128i*)dst, T1);\r\n\r\n            dst += i_dst;\r\n        }\r\n    } else {\r\n        __m128i TT[16];\r\n        __m128i PTT[16];\r\n        for (x = 0; x < bsx; x += 8) {\r\n            int idx = x >> 2;\r\n            __m128i M0 = _mm_load_si128((__m128i*)(pTop + x));\r\n            __m128i M1 = _mm_load_si128((__m128i*)(pT + x));\r\n            TT[idx] = _mm_unpacklo_epi16(M0, ZERO);\r\n            TT[idx + 1] = _mm_unpackhi_epi16(M0, ZERO);\r\n            PTT[idx] = _mm_cvtepi16_epi32(M1);\r\n            PTT[idx + 1] = _mm_cvtepi16_epi32(_mm_srli_si128(M1, 8));\r\n        }\r\n        for (y = 0; y < bsy; y++) {\r\n            int add = (pL[y] << ishift_y) + wy[y];\r\n            ADD = _mm_set1_epi32(add);\r\n            T3 = _mm_mullo_epi32(C2, ADD);\r\n            ADD = _mm_mullo_epi32(C1, ADD);\r\n\r\n            val = (pLeft[y] << ishift_y) + offset + (pL[y] << ishift_y);\r\n\r\n            ADD = _mm_add_epi32(ADD, _mm_set1_epi32(val));\r\n\r\n            for (x = 0; x < bsx; x += 8) {\r\n                int idx = x >> 2;\r\n                TT[idx] = _mm_add_epi32(TT[idx], PTT[idx]);\r\n                TT[idx + 1] = _mm_add_epi32(TT[idx + 1], PTT[idx + 1]);\r\n\r\n                T1 = _mm_slli_epi32(TT[idx], ishift_x);\r\n                T2 = _mm_slli_epi32(TT[idx + 1], ishift_x);\r\n\r\n                T1 = _mm_add_epi32(T1, ADD);\r\n                T1 = _mm_srai_epi32(T1, ishift_xy);\r\n                ADD = _mm_add_epi32(ADD, T3);\r\n\r\n                T2 = _mm_add_epi32(T2, ADD);\r\n                T2 = _mm_srai_epi32(T2, ishift_xy);\r\n                ADD = _mm_add_epi32(ADD, T3);\r\n\r\n                T1 = _mm_packus_epi32(T1, T2);\r\n                T1 = _mm_packus_epi16(T1, T1);\r\n\r\n                _mm_storel_epi64((__m128i*)(dst + x), T1);\r\n            }\r\n            dst += i_dst;\r\n        }\r\n    }\r\n\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid intra_pred_ang_x_3_sse128(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    __m128i zero = _mm_setzero_si128();\r\n    __m128i coeff2 = _mm_set1_epi16(2);\r\n    __m128i coeff3 = _mm_set1_epi16(3);\r\n    __m128i coeff4 = _mm_set1_epi16(4);\r\n    __m128i coeff5 = _mm_set1_epi16(5);\r\n    __m128i coeff7 = _mm_set1_epi16(7);\r\n    __m128i coeff8 = _mm_set1_epi16(8);\r\n\r\n    pel_t *dst1 = dst;\r\n    pel_t *dst2 = dst1 + i_dst;\r\n    pel_t *dst3 = dst2 + i_dst;\r\n    pel_t *dst4 = dst3 + i_dst;\r\n\r\n    UNUSED_PARAMETER(dir_mode);\r\n\r\n    if ((bsy > 4) && (bsx > 8)) {\r\n        ALIGN16(pel_t first_line[(64 + 176 + 16) << 2]);\r\n        int line_size = bsx + (((bsy - 4) * 11) >> 2);\r\n#if !BUGFIX_PREDICTION_INTRA\r\n        int iW2 = bsx * 2 - 1;\r\n        int real_size = DAVS2_MIN(line_size, iW2 + 1);\r\n#endif\r\n        int aligned_line_size = 64 + 176 + 16;\r\n        int i;\r\n        pel_t *pfirst[4];\r\n#if !BUGFIX_PREDICTION_INTRA\r\n        pel_t *src_org = src;\r\n#endif\r\n\r\n        pfirst[0] = first_line;\r\n        pfirst[1] = pfirst[0] + aligned_line_size;\r\n        pfirst[2] = pfirst[1] + aligned_line_size;\r\n        pfirst[3] = pfirst[2] + aligned_line_size;\r\n\r\n#if BUGFIX_PREDICTION_INTRA\r\n        for (i = 0; i < line_size - 8; i += 16, src += 16) {\r\n#else\r\n        for (i = 0; i < real_size - 8; i += 16, src += 16) {\r\n#endif\r\n            __m128i p00, p10, p20, p30;\r\n            __m128i p01, p11, p21, p31;\r\n\r\n            __m128i SS2 = _mm_loadu_si128((__m128i*)(src + 2));\r\n            __m128i L2 = _mm_unpacklo_epi8(SS2, zero);\r\n            SS2 = _mm_srli_si128(SS2, 1);\r\n            __m128i L3 = _mm_unpacklo_epi8(SS2, zero);\r\n            SS2 = _mm_srli_si128(SS2, 1);\r\n            __m128i L4 = _mm_unpacklo_epi8(SS2, zero);\r\n            SS2 = _mm_srli_si128(SS2, 1);\r\n            __m128i L5 = _mm_unpacklo_epi8(SS2, zero);\r\n            SS2 = _mm_srli_si128(SS2, 1);\r\n            __m128i L6 = _mm_unpacklo_epi8(SS2, zero);\r\n            SS2 = _mm_srli_si128(SS2, 1);\r\n            __m128i L7 = _mm_unpacklo_epi8(SS2, zero);\r\n            SS2 = _mm_srli_si128(SS2, 1);\r\n            __m128i L8 = _mm_unpacklo_epi8(SS2, zero);\r\n            SS2 = _mm_srli_si128(SS2, 1);\r\n            __m128i L9 = _mm_unpacklo_epi8(SS2, zero);\r\n            SS2 = _mm_srli_si128(SS2, 1);\r\n            __m128i L10 = _mm_unpacklo_epi8(SS2, zero);\r\n            __m128i H2 = L10;\r\n\r\n            __m128i SS11 = _mm_loadu_si128((__m128i*)(src + 11));\r\n            __m128i L11 = _mm_unpacklo_epi8(SS11, zero);\r\n            __m128i H3 = L11;\r\n            SS11 = _mm_srli_si128(SS11, 1);\r\n            __m128i L12 = _mm_unpacklo_epi8(SS11, zero);\r\n            __m128i H4 = L12;\r\n            SS11 = _mm_srli_si128(SS11, 1);\r\n            __m128i L13 = _mm_unpacklo_epi8(SS11, zero);\r\n            __m128i H5 = L13;\r\n\r\n            SS11 = _mm_srli_si128(SS11, 1);\r\n            __m128i H6 = _mm_unpacklo_epi8(SS11, zero);\r\n            SS11 = _mm_srli_si128(SS11, 1);\r\n            __m128i H7 = _mm_unpacklo_epi8(SS11, zero);\r\n            SS11 = _mm_srli_si128(SS11, 1);\r\n            __m128i H8 = _mm_unpacklo_epi8(SS11, zero);\r\n            SS11 = _mm_srli_si128(SS11, 1);\r\n            __m128i H9 = _mm_unpacklo_epi8(SS11, zero);\r\n            SS11 = _mm_srli_si128(SS11, 1);\r\n            __m128i H10 = _mm_unpacklo_epi8(SS11, zero);\r\n            SS11 = _mm_srli_si128(SS11, 1);\r\n            __m128i H11 = _mm_unpacklo_epi8(SS11, zero);\r\n\r\n            __m128i SS20 = _mm_loadu_si128((__m128i*)(src + 20));\r\n            __m128i H12 = _mm_unpacklo_epi8(SS20, zero);\r\n            SS20 = _mm_srli_si128(SS20, 1);\r\n            __m128i H13 = _mm_unpacklo_epi8(SS20, zero);\r\n\r\n            p00 = _mm_add_epi16(L2, coeff8);\r\n            p10 = _mm_mullo_epi16(L3, coeff5);\r\n            p20 = _mm_mullo_epi16(L4, coeff7);\r\n            p30 = _mm_mullo_epi16(L5, coeff3);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 4);\r\n\r\n            p01 = _mm_add_epi16(H2, coeff8);\r\n            p11 = _mm_mullo_epi16(H3, coeff5);\r\n            p21 = _mm_mullo_epi16(H4, coeff7);\r\n            p31 = _mm_mullo_epi16(H5, coeff3);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            p01 = _mm_srli_epi16(p01, 4);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_store_si128((__m128i*)&pfirst[0][i], p00);\r\n\r\n            p00 = _mm_add_epi16(L5, L8);\r\n            p10 = _mm_add_epi16(L6, L7);\r\n            p10 = _mm_mullo_epi16(p10, coeff3);\r\n            p00 = _mm_add_epi16(p00, coeff4);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_srli_epi16(p00, 3);\r\n\r\n            p01 = _mm_add_epi16(H5, H8);\r\n            p11 = _mm_add_epi16(H6, H7);\r\n            p11 = _mm_mullo_epi16(p11, coeff3);\r\n            p01 = _mm_add_epi16(p01, coeff4);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_srli_epi16(p01, 3);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_store_si128((__m128i*)&pfirst[1][i], p00);\r\n\r\n            p00 = _mm_mullo_epi16(L8, coeff3);\r\n            p10 = _mm_mullo_epi16(L9, coeff7);\r\n            p20 = _mm_mullo_epi16(L10, coeff5);\r\n            p30 = _mm_add_epi16(L11, coeff8);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 4);\r\n\r\n            p01 = _mm_mullo_epi16(H8, coeff3);\r\n            p11 = _mm_mullo_epi16(H9, coeff7);\r\n            p21 = _mm_mullo_epi16(H10, coeff5);\r\n            p31 = _mm_add_epi16(H11, coeff8);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            p01 = _mm_srli_epi16(p01, 4);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_store_si128((__m128i*)&pfirst[2][i], p00);\r\n\r\n            p00 = _mm_add_epi16(L11, L13);\r\n            p10 = _mm_mullo_epi16(L12, coeff2);\r\n            p00 = _mm_add_epi16(p00, coeff2);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_srli_epi16(p00, 2);\r\n\r\n            p01 = _mm_add_epi16(H11, H13);\r\n            p11 = _mm_mullo_epi16(H12, coeff2);\r\n            p01 = _mm_add_epi16(p01, coeff2);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_srli_epi16(p01, 2);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_store_si128((__m128i*)&pfirst[3][i], p00);\r\n        }\r\n#if BUGFIX_PREDICTION_INTRA\r\n        if (i < line_size) {\r\n#else\r\n        if (i < real_size) {\r\n#endif\r\n            __m128i p00, p10, p20, p30;\r\n            __m128i SS2 = _mm_loadu_si128((__m128i*)(src + 2));\r\n            __m128i L2 = _mm_unpacklo_epi8(SS2, zero);\r\n            SS2 = _mm_srli_si128(SS2, 1);\r\n            __m128i L3 = _mm_unpacklo_epi8(SS2, zero);\r\n            SS2 = _mm_srli_si128(SS2, 1);\r\n            __m128i L4 = _mm_unpacklo_epi8(SS2, zero);\r\n            SS2 = _mm_srli_si128(SS2, 1);\r\n            __m128i L5 = _mm_unpacklo_epi8(SS2, zero);\r\n            SS2 = _mm_srli_si128(SS2, 1);\r\n            __m128i L6 = _mm_unpacklo_epi8(SS2, zero);\r\n            SS2 = _mm_srli_si128(SS2, 1);\r\n            __m128i L7 = _mm_unpacklo_epi8(SS2, zero);\r\n            SS2 = _mm_srli_si128(SS2, 1);\r\n            __m128i L8 = _mm_unpacklo_epi8(SS2, zero);\r\n            SS2 = _mm_srli_si128(SS2, 1);\r\n            __m128i L9 = _mm_unpacklo_epi8(SS2, zero);\r\n            SS2 = _mm_srli_si128(SS2, 1);\r\n            __m128i L10 = _mm_unpacklo_epi8(SS2, zero);\r\n\r\n            __m128i SS11 = _mm_loadu_si128((__m128i*)(src + 11));\r\n            __m128i L11 = _mm_unpacklo_epi8(SS11, zero);\r\n            SS11 = _mm_srli_si128(SS11, 1);\r\n            __m128i L12 = _mm_unpacklo_epi8(SS11, zero);\r\n            SS11 = _mm_srli_si128(SS11, 1);\r\n            __m128i L13 = _mm_unpacklo_epi8(SS11, zero);\r\n\r\n            p00 = _mm_add_epi16(L2, coeff8);\r\n            p10 = _mm_mullo_epi16(L3, coeff5);\r\n            p20 = _mm_mullo_epi16(L4, coeff7);\r\n            p30 = _mm_mullo_epi16(L5, coeff3);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 4);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)&pfirst[0][i], p00);\r\n\r\n            p00 = _mm_add_epi16(L5, L8);\r\n            p10 = _mm_add_epi16(L6, L7);\r\n            p10 = _mm_mullo_epi16(p10, coeff3);\r\n            p00 = _mm_add_epi16(p00, coeff4);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_srli_epi16(p00, 3);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)&pfirst[1][i], p00);\r\n\r\n            p00 = _mm_mullo_epi16(L8, coeff3);\r\n            p10 = _mm_mullo_epi16(L9, coeff7);\r\n            p20 = _mm_mullo_epi16(L10, coeff5);\r\n            p30 = _mm_add_epi16(L11, coeff8);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 4);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)&pfirst[2][i], p00);\r\n\r\n            p00 = _mm_add_epi16(L11, L13);\r\n            p10 = _mm_mullo_epi16(L12, coeff2);\r\n            p00 = _mm_add_epi16(p00, coeff2);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_srli_epi16(p00, 2);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)&pfirst[3][i], p00);\r\n        }\r\n\r\n#if !BUGFIX_PREDICTION_INTRA\r\n        // padding\r\n        if (((real_size - 1) + 11) > iW2) {\r\n            src = src_org + iW2;\r\n            //needn't calculate pad using the value src. If pad is invalid, we won't use in \"for (i = start1; i < line_size; i += 16)\".Else pad is valid.\r\n            __m128i pad1 = _mm_set1_epi8(pfirst[0][iW2 - 2]);\r\n            __m128i pad2 = _mm_set1_epi8(pfirst[1][iW2 - 5]);\r\n            __m128i pad3 = _mm_set1_epi8(pfirst[2][iW2 - 8]);\r\n            __m128i pad4 = _mm_set1_epi8(pfirst[3][iW2 - 11]);\r\n\r\n            int start1 = iW2 - 1;\r\n            int start2 = iW2 - 4;\r\n            int start3 = iW2 - 7;\r\n            int start4 = iW2 - 10;\r\n            for (i = start1; i < line_size; i += 16) {\r\n                _mm_storeu_si128((__m128i*)&pfirst[0][i], pad1);\r\n            }\r\n            for (i = start2; i < line_size; i += 16) {\r\n                _mm_storeu_si128((__m128i*)&pfirst[1][i], pad2);\r\n            }\r\n            for (i = start3; i < line_size; i += 16) {\r\n                _mm_storeu_si128((__m128i*)&pfirst[2][i], pad3);\r\n            }\r\n            for (i = start4; i < line_size; i += 16) {\r\n                _mm_storeu_si128((__m128i*)&pfirst[3][i], pad4);\r\n            }\r\n        }\r\n#endif\r\n\r\n        bsy >>= 2;\r\n        for (i = 0; i < bsy; i++) {\r\n            memcpy(dst1, pfirst[0] + i * 11, bsx * sizeof(pel_t));\r\n            memcpy(dst2, pfirst[1] + i * 11, bsx * sizeof(pel_t));\r\n            memcpy(dst3, pfirst[2] + i * 11, bsx * sizeof(pel_t));\r\n            memcpy(dst4, pfirst[3] + i * 11, bsx * sizeof(pel_t));\r\n            dst1 = dst4 + i_dst;\r\n            dst2 = dst1 + i_dst;\r\n            dst3 = dst2 + i_dst;\r\n            dst4 = dst3 + i_dst;\r\n        }\r\n    } else if (bsx == 16) {\r\n        __m128i p00, p10, p20, p30;\r\n        __m128i p01, p11, p21, p31;\r\n\r\n        __m128i SS2 = _mm_loadu_si128((__m128i*)(src + 2));\r\n        __m128i L2 = _mm_unpacklo_epi8(SS2, zero);\r\n        SS2 = _mm_srli_si128(SS2, 1);\r\n        __m128i L3 = _mm_unpacklo_epi8(SS2, zero);\r\n        SS2 = _mm_srli_si128(SS2, 1);\r\n        __m128i L4 = _mm_unpacklo_epi8(SS2, zero);\r\n        SS2 = _mm_srli_si128(SS2, 1);\r\n        __m128i L5 = _mm_unpacklo_epi8(SS2, zero);\r\n        SS2 = _mm_srli_si128(SS2, 1);\r\n        __m128i L6 = _mm_unpacklo_epi8(SS2, zero);\r\n        SS2 = _mm_srli_si128(SS2, 1);\r\n        __m128i L7 = _mm_unpacklo_epi8(SS2, zero);\r\n        SS2 = _mm_srli_si128(SS2, 1);\r\n        __m128i L8 = _mm_unpacklo_epi8(SS2, zero);\r\n        SS2 = _mm_srli_si128(SS2, 1);\r\n        __m128i L9 = _mm_unpacklo_epi8(SS2, zero);\r\n        SS2 = _mm_srli_si128(SS2, 1);\r\n        __m128i L10 = _mm_unpacklo_epi8(SS2, zero);\r\n        __m128i H2 = L10;\r\n\r\n        __m128i SS11 = _mm_loadu_si128((__m128i*)(src + 11));\r\n        __m128i L11 = _mm_unpacklo_epi8(SS11, zero);\r\n        __m128i H3 = L11;\r\n        SS11 = _mm_srli_si128(SS11, 1);\r\n        __m128i L12 = _mm_unpacklo_epi8(SS11, zero);\r\n        __m128i H4 = L12;\r\n        SS11 = _mm_srli_si128(SS11, 1);\r\n        __m128i L13 = _mm_unpacklo_epi8(SS11, zero);\r\n        __m128i H5 = L13;\r\n\r\n        SS11 = _mm_srli_si128(SS11, 1);\r\n        __m128i H6 = _mm_unpacklo_epi8(SS11, zero);\r\n        SS11 = _mm_srli_si128(SS11, 1);\r\n        __m128i H7 = _mm_unpacklo_epi8(SS11, zero);\r\n        SS11 = _mm_srli_si128(SS11, 1);\r\n        __m128i H8 = _mm_unpacklo_epi8(SS11, zero);\r\n        SS11 = _mm_srli_si128(SS11, 1);\r\n        __m128i H9 = _mm_unpacklo_epi8(SS11, zero);\r\n        SS11 = _mm_srli_si128(SS11, 1);\r\n        __m128i H10 = _mm_unpacklo_epi8(SS11, zero);\r\n        SS11 = _mm_srli_si128(SS11, 1);\r\n        __m128i H11 = _mm_unpacklo_epi8(SS11, zero);\r\n\r\n        __m128i SS20 = _mm_loadu_si128((__m128i*)(src + 20));\r\n        __m128i H12 = _mm_unpacklo_epi8(SS20, zero);\r\n        SS20 = _mm_srli_si128(SS20, 1);\r\n        __m128i H13 = _mm_unpacklo_epi8(SS20, zero);\r\n\r\n        p00 = _mm_add_epi16(L2, coeff8);\r\n        p10 = _mm_mullo_epi16(L3, coeff5);\r\n        p20 = _mm_mullo_epi16(L4, coeff7);\r\n        p30 = _mm_mullo_epi16(L5, coeff3);\r\n        p00 = _mm_add_epi16(p00, p10);\r\n        p00 = _mm_add_epi16(p00, p20);\r\n        p00 = _mm_add_epi16(p00, p30);\r\n        p00 = _mm_srli_epi16(p00, 4);\r\n\r\n        p01 = _mm_add_epi16(H2, coeff8);\r\n        p11 = _mm_mullo_epi16(H3, coeff5);\r\n        p21 = _mm_mullo_epi16(H4, coeff7);\r\n        p31 = _mm_mullo_epi16(H5, coeff3);\r\n        p01 = _mm_add_epi16(p01, p11);\r\n        p01 = _mm_add_epi16(p01, p21);\r\n        p01 = _mm_add_epi16(p01, p31);\r\n        p01 = _mm_srli_epi16(p01, 4);\r\n\r\n        p00 = _mm_packus_epi16(p00, p01);\r\n        _mm_store_si128((__m128i*)dst1, p00);\r\n\r\n        p00 = _mm_add_epi16(L5, L8);\r\n        p10 = _mm_add_epi16(L6, L7);\r\n        p10 = _mm_mullo_epi16(p10, coeff3);\r\n        p00 = _mm_add_epi16(p00, coeff4);\r\n        p00 = _mm_add_epi16(p00, p10);\r\n        p00 = _mm_srli_epi16(p00, 3);\r\n\r\n        p01 = _mm_add_epi16(H5, H8);\r\n        p11 = _mm_add_epi16(H6, H7);\r\n        p11 = _mm_mullo_epi16(p11, coeff3);\r\n        p01 = _mm_add_epi16(p01, coeff4);\r\n        p01 = _mm_add_epi16(p01, p11);\r\n        p01 = _mm_srli_epi16(p01, 3);\r\n\r\n        p00 = _mm_packus_epi16(p00, p01);\r\n        _mm_store_si128((__m128i*)dst2, p00);\r\n\r\n        p00 = _mm_mullo_epi16(L8, coeff3);\r\n        p10 = _mm_mullo_epi16(L9, coeff7);\r\n        p20 = _mm_mullo_epi16(L10, coeff5);\r\n        p30 = _mm_add_epi16(L11, coeff8);\r\n        p00 = _mm_add_epi16(p00, p10);\r\n        p00 = _mm_add_epi16(p00, p20);\r\n        p00 = _mm_add_epi16(p00, p30);\r\n        p00 = _mm_srli_epi16(p00, 4);\r\n\r\n        p01 = _mm_mullo_epi16(H8, coeff3);\r\n        p11 = _mm_mullo_epi16(H9, coeff7);\r\n        p21 = _mm_mullo_epi16(H10, coeff5);\r\n        p31 = _mm_add_epi16(H11, coeff8);\r\n        p01 = _mm_add_epi16(p01, p11);\r\n        p01 = _mm_add_epi16(p01, p21);\r\n        p01 = _mm_add_epi16(p01, p31);\r\n        p01 = _mm_srli_epi16(p01, 4);\r\n\r\n        p00 = _mm_packus_epi16(p00, p01);\r\n        _mm_store_si128((__m128i*)dst3, p00);\r\n\r\n        p00 = _mm_add_epi16(L11, L13);\r\n        p10 = _mm_mullo_epi16(L12, coeff2);\r\n        p00 = _mm_add_epi16(p00, coeff2);\r\n        p00 = _mm_add_epi16(p00, p10);\r\n        p00 = _mm_srli_epi16(p00, 2);\r\n\r\n        p01 = _mm_add_epi16(H11, H13);\r\n        p11 = _mm_mullo_epi16(H12, coeff2);\r\n        p01 = _mm_add_epi16(p01, coeff2);\r\n        p01 = _mm_add_epi16(p01, p11);\r\n        p01 = _mm_srli_epi16(p01, 2);\r\n\r\n        p00 = _mm_packus_epi16(p00, p01);\r\n        _mm_store_si128((__m128i*)dst4, p00);\r\n    } else if (bsx == 8) {\r\n        __m128i p00, p10, p20, p30;\r\n\r\n        __m128i SS2 = _mm_loadu_si128((__m128i*)(src + 2));\r\n        __m128i L2 = _mm_unpacklo_epi8(SS2, zero);\r\n        SS2 = _mm_srli_si128(SS2, 1);\r\n        __m128i L3 = _mm_unpacklo_epi8(SS2, zero);\r\n        SS2 = _mm_srli_si128(SS2, 1);\r\n        __m128i L4 = _mm_unpacklo_epi8(SS2, zero);\r\n        SS2 = _mm_srli_si128(SS2, 1);\r\n        __m128i L5 = _mm_unpacklo_epi8(SS2, zero);\r\n        SS2 = _mm_srli_si128(SS2, 1);\r\n        __m128i L6 = _mm_unpacklo_epi8(SS2, zero);\r\n        SS2 = _mm_srli_si128(SS2, 1);\r\n        __m128i L7 = _mm_unpacklo_epi8(SS2, zero);\r\n        SS2 = _mm_srli_si128(SS2, 1);\r\n        __m128i L8 = _mm_unpacklo_epi8(SS2, zero);\r\n        SS2 = _mm_srli_si128(SS2, 1);\r\n        __m128i L9 = _mm_unpacklo_epi8(SS2, zero);\r\n        SS2 = _mm_srli_si128(SS2, 1);\r\n        __m128i L10 = _mm_unpacklo_epi8(SS2, zero);\r\n\r\n        __m128i SS11 = _mm_loadu_si128((__m128i*)(src + 11));\r\n        __m128i L11 = _mm_unpacklo_epi8(SS11, zero);\r\n        SS11 = _mm_srli_si128(SS11, 1);\r\n        __m128i L12 = _mm_unpacklo_epi8(SS11, zero);\r\n        SS11 = _mm_srli_si128(SS11, 1);\r\n        __m128i L13 = _mm_unpacklo_epi8(SS11, zero);\r\n\r\n        p00 = _mm_add_epi16(L2, coeff8);\r\n        p10 = _mm_mullo_epi16(L3, coeff5);\r\n        p20 = _mm_mullo_epi16(L4, coeff7);\r\n        p30 = _mm_mullo_epi16(L5, coeff3);\r\n        p00 = _mm_add_epi16(p00, p10);\r\n        p00 = _mm_add_epi16(p00, p20);\r\n        p00 = _mm_add_epi16(p00, p30);\r\n        p00 = _mm_srli_epi16(p00, 4);\r\n\r\n        p00 = _mm_packus_epi16(p00, p00);\r\n        _mm_storel_epi64((__m128i*)dst1, p00);\r\n\r\n        p00 = _mm_add_epi16(L5, L8);\r\n        p10 = _mm_add_epi16(L6, L7);\r\n        p10 = _mm_mullo_epi16(p10, coeff3);\r\n        p00 = _mm_add_epi16(p00, coeff4);\r\n        p00 = _mm_add_epi16(p00, p10);\r\n        p00 = _mm_srli_epi16(p00, 3);\r\n\r\n        p00 = _mm_packus_epi16(p00, p00);\r\n        _mm_storel_epi64((__m128i*)dst2, p00);\r\n\r\n        p00 = _mm_mullo_epi16(L8, coeff3);\r\n        p10 = _mm_mullo_epi16(L9, coeff7);\r\n        p20 = _mm_mullo_epi16(L10, coeff5);\r\n        p30 = _mm_add_epi16(L11, coeff8);\r\n        p00 = _mm_add_epi16(p00, p10);\r\n        p00 = _mm_add_epi16(p00, p20);\r\n        p00 = _mm_add_epi16(p00, p30);\r\n        p00 = _mm_srli_epi16(p00, 4);\r\n\r\n        p00 = _mm_packus_epi16(p00, p00);\r\n        _mm_storel_epi64((__m128i*)dst3, p00);\r\n\r\n        p00 = _mm_add_epi16(L11, L13);\r\n        p10 = _mm_mullo_epi16(L12, coeff2);\r\n        p00 = _mm_add_epi16(p00, coeff2);\r\n        p00 = _mm_add_epi16(p00, p10);\r\n        p00 = _mm_srli_epi16(p00, 2);\r\n\r\n        p00 = _mm_packus_epi16(p00, p00);\r\n        _mm_storel_epi64((__m128i*)dst4, p00);\r\n#if BUGFIX_PREDICTION_INTRA\r\n        __m128i pad1 = _mm_set1_epi8(src[16]);\r\n#else\r\n        dst4[5] = dst4[4];\r\n        dst4[6] = dst4[4];\r\n        dst4[7] = dst4[4];\r\n\r\n        __m128i pad1 = _mm_set1_epi8((pel_t)((src[15] + 5 * src[16] + 7 * src[17] + 3 * src[18] + 8) >> 4));\r\n        __m128i pad2 = _mm_set1_epi8((pel_t)((src[15] + 3 * src[16] + 3 * src[17] + 1 * src[18] + 4) >> 3));\r\n        __m128i pad3 = _mm_set1_epi8(dst3[7]);\r\n        __m128i pad4 = _mm_set1_epi8(dst4[4]);\r\n#endif\r\n\r\n        dst1 = dst4 + i_dst;\r\n        dst2 = dst1 + i_dst;\r\n        dst3 = dst2 + i_dst;\r\n        dst4 = dst3 + i_dst;\r\n\r\n#if BUGFIX_PREDICTION_INTRA\r\n        _mm_storel_epi64((__m128i*)dst1, pad1);\r\n        _mm_storel_epi64((__m128i*)dst2, pad1);\r\n        _mm_storel_epi64((__m128i*)dst3, pad1);\r\n        _mm_storel_epi64((__m128i*)dst4, pad1);\r\n#else\r\n        _mm_storel_epi64((__m128i*)dst1, pad1);\r\n        _mm_storel_epi64((__m128i*)dst2, pad2);\r\n        _mm_storel_epi64((__m128i*)dst3, pad3);\r\n        _mm_storel_epi64((__m128i*)dst4, pad4);\r\n#endif\r\n        dst1[0] = (pel_t)((src[13] + 5 * src[14] + 7 * src[15] + 3 * src[16] + 8) >> 4);\r\n        dst1[1] = (pel_t)((src[14] + 5 * src[15] + 7 * src[16] + 3 * src[17] + 8) >> 4);\r\n        dst1[2] = (pel_t)((src[15] + 5 * src[16] + 7 * src[17] + 3 * src[18] + 8) >> 4);\r\n\r\n        if (bsy == 32) {\r\n            for (int i = 0; i < 6; i++) {\r\n                dst1 = dst4 + i_dst;\r\n                dst2 = dst1 + i_dst;\r\n                dst3 = dst2 + i_dst;\r\n                dst4 = dst3 + i_dst;\r\n\r\n#if BUGFIX_PREDICTION_INTRA\r\n                _mm_storel_epi64((__m128i*)dst1, pad1);\r\n                _mm_storel_epi64((__m128i*)dst2, pad1);\r\n                _mm_storel_epi64((__m128i*)dst3, pad1);\r\n                _mm_storel_epi64((__m128i*)dst4, pad1);\r\n#else\r\n                _mm_storel_epi64((__m128i*)dst1, pad1);\r\n                _mm_storel_epi64((__m128i*)dst2, pad2);\r\n                _mm_storel_epi64((__m128i*)dst3, pad3);\r\n                _mm_storel_epi64((__m128i*)dst4, pad4);\r\n#endif\r\n            }\r\n        }\r\n    } else {\r\n        if (bsy == 16) {\r\n            __m128i p00, p10, p20, p30;\r\n\r\n            __m128i SS2 = _mm_loadu_si128((__m128i*)(src + 2));\r\n            __m128i L2 = _mm_unpacklo_epi8(SS2, zero);\r\n            SS2 = _mm_srli_si128(SS2, 1);\r\n            __m128i L3 = _mm_unpacklo_epi8(SS2, zero);\r\n            SS2 = _mm_srli_si128(SS2, 1);\r\n            __m128i L4 = _mm_unpacklo_epi8(SS2, zero);\r\n            SS2 = _mm_srli_si128(SS2, 1);\r\n            __m128i L5 = _mm_unpacklo_epi8(SS2, zero);\r\n            SS2 = _mm_srli_si128(SS2, 1);\r\n            __m128i L6 = _mm_unpacklo_epi8(SS2, zero);\r\n            SS2 = _mm_srli_si128(SS2, 1);\r\n            __m128i L7 = _mm_unpacklo_epi8(SS2, zero);\r\n            SS2 = _mm_srli_si128(SS2, 1);\r\n            __m128i L8 = _mm_unpacklo_epi8(SS2, zero);\r\n\r\n            p00 = _mm_add_epi16(L2, coeff8);\r\n            p10 = _mm_mullo_epi16(L3, coeff5);\r\n            p20 = _mm_mullo_epi16(L4, coeff7);\r\n            p30 = _mm_mullo_epi16(L5, coeff3);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 4);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            *((int*)(dst1)) = _mm_cvtsi128_si32(p00);\r\n\r\n            p00 = _mm_add_epi16(L5, L8);\r\n            p10 = _mm_add_epi16(L6, L7);\r\n            p10 = _mm_mullo_epi16(p10, coeff3);\r\n            p00 = _mm_add_epi16(p00, coeff4);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_srli_epi16(p00, 3);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            *((int*)(dst2)) = _mm_cvtsi128_si32(p00);\r\n#if BUGFIX_PREDICTION_INTRA\r\n            __m128i pad1 = _mm_set1_epi8(src[8]);\r\n            *((int*)(dst3)) = _mm_cvtsi128_si32(pad1);\r\n            *((int*)(dst4)) = _mm_cvtsi128_si32(pad1);\r\n#else\r\n            dst2[3] = dst2[2];\r\n\r\n            __m128i pad1 = _mm_set1_epi8((pel_t)((src[7] + 5 * src[8] + 7 * src[9] + 3 * src[10] + 8) >> 4));\r\n            __m128i pad2 = _mm_set1_epi8(dst2[2]);\r\n            __m128i pad3 = _mm_set1_epi8((pel_t)((3 * src[7] + 7 * src[8] + 5 * src[9] + src[10] + 8) >> 4));\r\n            __m128i pad4 = _mm_set1_epi8((pel_t)((src[7] + 2 * src[8] + src[9] + 2) >> 2));\r\n\r\n            *((int*)(dst3)) = _mm_cvtsi128_si32(pad3);\r\n            *((int*)(dst4)) = _mm_cvtsi128_si32(pad4);\r\n#endif\r\n\r\n            for (int i = 0; i < 3; i++) {\r\n                dst1 = dst4 + i_dst;\r\n                dst2 = dst1 + i_dst;\r\n                dst3 = dst2 + i_dst;\r\n                dst4 = dst3 + i_dst;\r\n\r\n#if BUGFIX_PREDICTION_INTRA\r\n                *((int*)(dst1)) = _mm_cvtsi128_si32(pad1);\r\n                *((int*)(dst2)) = _mm_cvtsi128_si32(pad1);\r\n                *((int*)(dst3)) = _mm_cvtsi128_si32(pad1);\r\n                *((int*)(dst4)) = _mm_cvtsi128_si32(pad1);\r\n#else\r\n                *((int*)(dst1)) = _mm_cvtsi128_si32(pad1);\r\n                *((int*)(dst2)) = _mm_cvtsi128_si32(pad2);\r\n                *((int*)(dst3)) = _mm_cvtsi128_si32(pad3);\r\n                *((int*)(dst4)) = _mm_cvtsi128_si32(pad4);\r\n#endif\r\n            }\r\n        } else {\r\n            __m128i p00, p10, p20, p30;\r\n            __m128i SS2 = _mm_loadu_si128((__m128i*)(src + 2));\r\n            __m128i L2 = _mm_unpacklo_epi8(SS2, zero);\r\n            SS2 = _mm_srli_si128(SS2, 1);\r\n            __m128i L3 = _mm_unpacklo_epi8(SS2, zero);\r\n            SS2 = _mm_srli_si128(SS2, 1);\r\n            __m128i L4 = _mm_unpacklo_epi8(SS2, zero);\r\n            SS2 = _mm_srli_si128(SS2, 1);\r\n            __m128i L5 = _mm_unpacklo_epi8(SS2, zero);\r\n            SS2 = _mm_srli_si128(SS2, 1);\r\n            __m128i L6 = _mm_unpacklo_epi8(SS2, zero);\r\n            SS2 = _mm_srli_si128(SS2, 1);\r\n            __m128i L7 = _mm_unpacklo_epi8(SS2, zero);\r\n            SS2 = _mm_srli_si128(SS2, 1);\r\n            __m128i L8 = _mm_unpacklo_epi8(SS2, zero);\r\n\r\n            p00 = _mm_add_epi16(L2, coeff8);\r\n            p10 = _mm_mullo_epi16(L3, coeff5);\r\n            p20 = _mm_mullo_epi16(L4, coeff7);\r\n            p30 = _mm_mullo_epi16(L5, coeff3);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 4);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            *((int*)(dst1)) = _mm_cvtsi128_si32(p00);\r\n\r\n            p00 = _mm_add_epi16(L5, L8);\r\n            p10 = _mm_add_epi16(L6, L7);\r\n            p10 = _mm_mullo_epi16(p10, coeff3);\r\n            p00 = _mm_add_epi16(p00, coeff4);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_srli_epi16(p00, 3);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            *((int*)(dst2)) = _mm_cvtsi128_si32(p00);\r\n#if BUGFIX_PREDICTION_INTRA\r\n            __m128i pad1 = _mm_set1_epi8(src[8]);\r\n            *((int*)(dst3)) = _mm_cvtsi128_si32(pad1);\r\n            *((int*)(dst4)) = _mm_cvtsi128_si32(pad1);\r\n#else\r\n            dst2[3] = dst2[2];\r\n\r\n            dst3[0] = (pel_t)((3 * src[7] + 7 * src[8] + 5 * src[9] + src[10] + 8) >> 4);\r\n            dst3[1] = dst3[0];\r\n            dst3[2] = dst3[0];\r\n            dst3[3] = dst3[0];\r\n\r\n            dst4[0] = (pel_t)((src[7] + 2 * src[8] + src[9] + 2) >> 2);\r\n            dst4[1] = dst4[0];\r\n            dst4[2] = dst4[0];\r\n            dst4[3] = dst4[0];\r\n#endif\r\n        }\r\n    }\r\n\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid intra_pred_ang_x_4_sse128(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n\r\n    ALIGN16(pel_t first_line[64 + 128]);\r\n    int line_size = bsx + ((bsy - 1) << 1);\r\n#if !BUGFIX_PREDICTION_INTRA\r\n    int real_size = DAVS2_MIN(line_size, bsx * 2 - 2);\r\n#endif\r\n    int iHeight2 = bsy << 1;\r\n    int i;\r\n    __m128i zero = _mm_setzero_si128();\r\n    __m128i offset = _mm_set1_epi16(2);\r\n\r\n    UNUSED_PARAMETER(dir_mode);\r\n\r\n    src += 3;\r\n#if BUGFIX_PREDICTION_INTRA\r\n    for (i = 0; i < line_size - 8; i += 16, src += 16) {\r\n#else\r\n    for (i = 0; i < real_size - 8; i += 16, src += 16) {\r\n#endif\r\n        __m128i S0 = _mm_loadu_si128((__m128i*)(src - 1));\r\n        __m128i S2 = _mm_loadu_si128((__m128i*)(src + 1));\r\n        __m128i S1 = _mm_loadu_si128((__m128i*)(src));\r\n\r\n        __m128i L0 = _mm_unpacklo_epi8(S0, zero);\r\n        __m128i L1 = _mm_unpacklo_epi8(S1, zero);\r\n        __m128i L2 = _mm_unpacklo_epi8(S2, zero);\r\n\r\n        __m128i H0 = _mm_unpackhi_epi8(S0, zero);\r\n        __m128i H1 = _mm_unpackhi_epi8(S1, zero);\r\n        __m128i H2 = _mm_unpackhi_epi8(S2, zero);\r\n\r\n        __m128i sum1 = _mm_add_epi16(L0, L1);\r\n        __m128i sum2 = _mm_add_epi16(L1, L2);\r\n        __m128i sum3 = _mm_add_epi16(H0, H1);\r\n        __m128i sum4 = _mm_add_epi16(H1, H2);\r\n\r\n        sum1 = _mm_add_epi16(sum1, sum2);\r\n        sum3 = _mm_add_epi16(sum3, sum4);\r\n\r\n        sum1 = _mm_add_epi16(sum1, offset);\r\n        sum3 = _mm_add_epi16(sum3, offset);\r\n\r\n        sum1 = _mm_srli_epi16(sum1, 2);\r\n        sum3 = _mm_srli_epi16(sum3, 2);\r\n\r\n        sum1 = _mm_packus_epi16(sum1, sum3);\r\n\r\n        _mm_store_si128((__m128i*)&first_line[i], sum1);\r\n    }\r\n\r\n#if BUGFIX_PREDICTION_INTRA\r\n    if (i < line_size) {\r\n#else\r\n    if (i < real_size) {\r\n#endif\r\n        __m128i S0 = _mm_loadu_si128((__m128i*)(src - 1));\r\n        __m128i S2 = _mm_loadu_si128((__m128i*)(src + 1));\r\n        __m128i S1 = _mm_loadu_si128((__m128i*)(src));\r\n\r\n        __m128i L0 = _mm_unpacklo_epi8(S0, zero);\r\n        __m128i L1 = _mm_unpacklo_epi8(S1, zero);\r\n        __m128i L2 = _mm_unpacklo_epi8(S2, zero);\r\n\r\n        __m128i sum1 = _mm_add_epi16(L0, L1);\r\n        __m128i sum2 = _mm_add_epi16(L1, L2);\r\n\r\n        sum1 = _mm_add_epi16(sum1, sum2);\r\n        sum1 = _mm_add_epi16(sum1, offset);\r\n        sum1 = _mm_srli_epi16(sum1, 2);\r\n\r\n        sum1 = _mm_packus_epi16(sum1, sum1);\r\n        _mm_storel_epi64((__m128i*)&first_line[i], sum1);\r\n    }\r\n\r\n#if !BUGFIX_PREDICTION_INTRA\r\n    // padding\r\n    for (i = real_size; i < line_size; i += 16) {\r\n        __m128i pad = _mm_set1_epi8(first_line[real_size - 1]);\r\n        _mm_storeu_si128((__m128i*)&first_line[i], pad);\r\n    }\r\n#endif\r\n\r\n\r\n    if (bsx == bsy || bsx > 16) {\r\n        for (i = 0; i < iHeight2; i += 2) {\r\n            memcpy(dst, first_line + i, bsx * sizeof(pel_t));\r\n            dst += i_dst;\r\n        }\r\n    } else if (bsx == 16) {\r\n        pel_t *dst1 = dst;\r\n        __m128i M = _mm_loadu_si128((__m128i*)&first_line[0]);\r\n        _mm_storel_epi64((__m128i*)dst, M);\r\n        dst += i_dst;\r\n        M = _mm_srli_si128(M, 2);\r\n        _mm_storel_epi64((__m128i*)dst, M);\r\n        dst += i_dst;\r\n        M = _mm_srli_si128(M, 2);\r\n        _mm_storel_epi64((__m128i*)dst, M);\r\n        dst += i_dst;\r\n        M = _mm_srli_si128(M, 2);\r\n        _mm_storel_epi64((__m128i*)dst, M);\r\n        dst = dst1 + 8;\r\n        M = _mm_loadu_si128((__m128i*)&first_line[8]);\r\n        _mm_storel_epi64((__m128i*)dst, M);\r\n        dst += i_dst;\r\n        M = _mm_srli_si128(M, 2);\r\n        _mm_storel_epi64((__m128i*)dst, M);\r\n        dst += i_dst;\r\n        M = _mm_srli_si128(M, 2);\r\n        _mm_storel_epi64((__m128i*)dst, M);\r\n        dst += i_dst;\r\n        M = _mm_srli_si128(M, 2);\r\n        _mm_storel_epi64((__m128i*)dst, M);\r\n    } else if (bsx == 8) {\r\n        for (i = 0; i < iHeight2; i += 8) {\r\n            __m128i M = _mm_loadu_si128((__m128i*)&first_line[i]);\r\n            _mm_storel_epi64((__m128i*)dst, M);\r\n            dst += i_dst;\r\n            M = _mm_srli_si128(M, 2);\r\n            _mm_storel_epi64((__m128i*)dst, M);\r\n            dst += i_dst;\r\n            M = _mm_srli_si128(M, 2);\r\n            _mm_storel_epi64((__m128i*)dst, M);\r\n            dst += i_dst;\r\n            M = _mm_srli_si128(M, 2);\r\n            _mm_storel_epi64((__m128i*)dst, M);\r\n            dst += i_dst;\r\n        }\r\n    } else {\r\n        for (i = 0; i < iHeight2; i += 8) {\r\n            __m128i M = _mm_loadu_si128((__m128i*)&first_line[i]);\r\n            ((int*)(dst))[0] = _mm_cvtsi128_si32(M);\r\n            dst += i_dst;\r\n            M = _mm_srli_si128(M, 2);\r\n            ((int*)(dst))[0] = _mm_cvtsi128_si32(M);\r\n            dst += i_dst;\r\n            M = _mm_srli_si128(M, 2);\r\n            ((int*)(dst))[0] = _mm_cvtsi128_si32(M);\r\n            dst += i_dst;\r\n            M = _mm_srli_si128(M, 2);\r\n            ((int*)(dst))[0] = _mm_cvtsi128_si32(M);\r\n            dst += i_dst;\r\n        }\r\n    }\r\n\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid intra_pred_ang_x_5_sse128(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    __m128i zero = _mm_setzero_si128();\r\n    __m128i coeff2 = _mm_set1_epi16(2);\r\n    __m128i coeff3 = _mm_set1_epi16(3);\r\n    __m128i coeff4 = _mm_set1_epi16(4);\r\n    __m128i coeff5 = _mm_set1_epi16(5);\r\n    __m128i coeff7 = _mm_set1_epi16(7);\r\n    __m128i coeff8 = _mm_set1_epi16(8);\r\n    __m128i coeff9 = _mm_set1_epi16(9);\r\n    __m128i coeff11 = _mm_set1_epi16(11);\r\n    __m128i coeff13 = _mm_set1_epi16(13);\r\n    __m128i coeff15 = _mm_set1_epi16(15);\r\n    __m128i coeff16 = _mm_set1_epi16(16);\r\n\r\n    UNUSED_PARAMETER(dir_mode);\r\n\r\n    int i;\r\n    if (((bsy > 4) && (bsx > 8))) {\r\n        ALIGN16(pel_t first_line[(64 + 80 + 16) << 3]);\r\n        int line_size = bsx + ((bsy - 8) >> 3) * 11;\r\n#if !BUGFIX_PREDICTION_INTRA\r\n        int iW2 = bsx * 2 - 1;\r\n        int real_size = DAVS2_MIN(line_size, iW2 + 1);\r\n#endif\r\n        int aligned_line_size = (((line_size + 15) >> 4) << 4) + 16;\r\n        pel_t *pfirst[8];\r\n#if !BUGFIX_PREDICTION_INTRA\r\n        pel_t *src_org = src;\r\n#endif\r\n\r\n        pel_t *dst1 = dst;\r\n        pel_t *dst2 = dst1 + i_dst;\r\n        pel_t *dst3 = dst2 + i_dst;\r\n        pel_t *dst4 = dst3 + i_dst;\r\n        pel_t *dst5 = dst4 + i_dst;\r\n        pel_t *dst6 = dst5 + i_dst;\r\n        pel_t *dst7 = dst6 + i_dst;\r\n        pel_t *dst8 = dst7 + i_dst;\r\n\r\n        pfirst[0] = first_line;\r\n        pfirst[1] = pfirst[0] + aligned_line_size;\r\n        pfirst[2] = pfirst[1] + aligned_line_size;\r\n        pfirst[3] = pfirst[2] + aligned_line_size;\r\n        pfirst[4] = pfirst[3] + aligned_line_size;\r\n        pfirst[5] = pfirst[4] + aligned_line_size;\r\n        pfirst[6] = pfirst[5] + aligned_line_size;\r\n        pfirst[7] = pfirst[6] + aligned_line_size;\r\n\r\n        __m128i p00, p10, p20, p30;\r\n        __m128i p01, p11, p21, p31;\r\n#if BUGFIX_PREDICTION_INTRA\r\n        for (i = 0; i < line_size - 8; i += 16, src += 16) {\r\n#else\r\n        for (i = 0; i < real_size - 8; i += 16, src += 16) {\r\n#endif\r\n            __m128i SS1 = _mm_loadu_si128((__m128i*)(src + 1));\r\n            __m128i L1 = _mm_unpacklo_epi8(SS1, zero);\r\n            SS1 = _mm_srli_si128(SS1, 1);\r\n            __m128i L2 = _mm_unpacklo_epi8(SS1, zero);\r\n            SS1 = _mm_srli_si128(SS1, 1);\r\n            __m128i L3 = _mm_unpacklo_epi8(SS1, zero);\r\n            SS1 = _mm_srli_si128(SS1, 1);\r\n            __m128i L4 = _mm_unpacklo_epi8(SS1, zero);\r\n            SS1 = _mm_srli_si128(SS1, 1);\r\n            __m128i L5 = _mm_unpacklo_epi8(SS1, zero);\r\n            SS1 = _mm_srli_si128(SS1, 1);\r\n            __m128i L6 = _mm_unpacklo_epi8(SS1, zero);\r\n            SS1 = _mm_srli_si128(SS1, 1);\r\n            __m128i L7 = _mm_unpacklo_epi8(SS1, zero);\r\n            SS1 = _mm_srli_si128(SS1, 1);\r\n            __m128i L8 = _mm_unpacklo_epi8(SS1, zero);\r\n            SS1 = _mm_srli_si128(SS1, 1);\r\n            __m128i L9 = _mm_unpacklo_epi8(SS1, zero);\r\n            __m128i H1 = L9;\r\n\r\n            __m128i SS10 = _mm_loadu_si128((__m128i*)(src + 10));\r\n            __m128i L10 = _mm_unpacklo_epi8(SS10, zero);\r\n            __m128i H2 = L10;\r\n            SS10 = _mm_srli_si128(SS10, 1);\r\n            __m128i L11 = _mm_unpacklo_epi8(SS10, zero);\r\n            __m128i H3 = L11;\r\n            SS10 = _mm_srli_si128(SS10, 1);\r\n            __m128i L12 = _mm_unpacklo_epi8(SS10, zero);\r\n            __m128i H4 = L12;\r\n            SS10 = _mm_srli_si128(SS10, 1);\r\n            __m128i L13 = _mm_unpacklo_epi8(SS10, zero);\r\n            __m128i H5 = L13;\r\n\r\n            SS10 = _mm_srli_si128(SS10, 1);\r\n            __m128i H6 = _mm_unpacklo_epi8(SS10, zero);\r\n            SS10 = _mm_srli_si128(SS10, 1);\r\n            __m128i H7 = _mm_unpacklo_epi8(SS10, zero);\r\n            SS10 = _mm_srli_si128(SS10, 1);\r\n            __m128i H8 = _mm_unpacklo_epi8(SS10, zero);\r\n            SS10 = _mm_srli_si128(SS10, 1);\r\n            __m128i H9 = _mm_unpacklo_epi8(SS10, zero);\r\n            SS10 = _mm_srli_si128(SS10, 1);\r\n            __m128i H10 = _mm_unpacklo_epi8(SS10, zero);\r\n\r\n            __m128i SS19 = _mm_loadu_si128((__m128i*)(src + 19));\r\n            __m128i H11 = _mm_unpacklo_epi8(SS19, zero);\r\n            SS19 = _mm_srli_si128(SS19, 1);\r\n            __m128i H12 = _mm_unpacklo_epi8(SS19, zero);\r\n            SS19 = _mm_srli_si128(SS19, 1);\r\n            __m128i H13 = _mm_unpacklo_epi8(SS19, zero);\r\n\r\n            p00 = _mm_mullo_epi16(L1, coeff5);\r\n            p10 = _mm_mullo_epi16(L2, coeff13);\r\n            p20 = _mm_mullo_epi16(L3, coeff11);\r\n            p30 = _mm_mullo_epi16(L4, coeff3);\r\n            p00 = _mm_add_epi16(p00, coeff16);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p01 = _mm_mullo_epi16(H1, coeff5);\r\n            p11 = _mm_mullo_epi16(H2, coeff13);\r\n            p21 = _mm_mullo_epi16(H3, coeff11);\r\n            p31 = _mm_mullo_epi16(H4, coeff3);\r\n            p01 = _mm_add_epi16(p01, coeff16);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            p01 = _mm_srli_epi16(p01, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_store_si128((__m128i*)&pfirst[0][i], p00);\r\n\r\n            p10 = _mm_mullo_epi16(L3, coeff5);\r\n            p20 = _mm_mullo_epi16(L4, coeff7);\r\n            p30 = _mm_mullo_epi16(L5, coeff3);\r\n            p00 = _mm_add_epi16(L2, coeff8);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 4);\r\n\r\n            p11 = _mm_mullo_epi16(H3, coeff5);\r\n            p21 = _mm_mullo_epi16(H4, coeff7);\r\n            p31 = _mm_mullo_epi16(H5, coeff3);\r\n            p01 = _mm_add_epi16(H2, coeff8);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            p01 = _mm_srli_epi16(p01, 4);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_store_si128((__m128i*)&pfirst[1][i], p00);\r\n\r\n            p00 = _mm_mullo_epi16(L4, coeff7);\r\n            p10 = _mm_mullo_epi16(L5, coeff15);\r\n            p20 = _mm_mullo_epi16(L6, coeff9);\r\n            p30 = _mm_add_epi16(L7, coeff16);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p01 = _mm_mullo_epi16(H4, coeff7);\r\n            p11 = _mm_mullo_epi16(H5, coeff15);\r\n            p21 = _mm_mullo_epi16(H6, coeff9);\r\n            p31 = _mm_add_epi16(H7, coeff16);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            p01 = _mm_srli_epi16(p01, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_store_si128((__m128i*)&pfirst[2][i], p00);\r\n\r\n            p00 = _mm_add_epi16(L5, L8);\r\n            p10 = _mm_add_epi16(L6, L7);\r\n            p10 = _mm_mullo_epi16(p10, coeff3);\r\n            p00 = _mm_add_epi16(p00, coeff4);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_srli_epi16(p00, 3);\r\n\r\n            p01 = _mm_add_epi16(H5, H8);\r\n            p11 = _mm_add_epi16(H6, H7);\r\n            p11 = _mm_mullo_epi16(p11, coeff3);\r\n            p01 = _mm_add_epi16(p01, coeff4);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_srli_epi16(p01, 3);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_store_si128((__m128i*)&pfirst[3][i], p00);\r\n\r\n            p00 = _mm_add_epi16(L6, coeff16);\r\n            p10 = _mm_mullo_epi16(L7, coeff9);\r\n            p20 = _mm_mullo_epi16(L8, coeff15);\r\n            p30 = _mm_mullo_epi16(L9, coeff7);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p01 = _mm_add_epi16(H6, coeff16);\r\n            p11 = _mm_mullo_epi16(H7, coeff9);\r\n            p21 = _mm_mullo_epi16(H8, coeff15);\r\n            p31 = _mm_mullo_epi16(H9, coeff7);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            p01 = _mm_srli_epi16(p01, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_store_si128((__m128i*)&pfirst[4][i], p00);\r\n\r\n            p00 = _mm_mullo_epi16(L8, coeff3);\r\n            p10 = _mm_mullo_epi16(L9, coeff7);\r\n            p20 = _mm_mullo_epi16(L10, coeff5);\r\n            p30 = _mm_add_epi16(L11, coeff8);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 4);\r\n\r\n            p01 = _mm_mullo_epi16(H8, coeff3);\r\n            p11 = _mm_mullo_epi16(H9, coeff7);\r\n            p21 = _mm_mullo_epi16(H10, coeff5);\r\n            p31 = _mm_add_epi16(H11, coeff8);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            p01 = _mm_srli_epi16(p01, 4);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_store_si128((__m128i*)&pfirst[5][i], p00);\r\n\r\n            p00 = _mm_mullo_epi16(L9, coeff3);\r\n            p10 = _mm_mullo_epi16(L10, coeff11);\r\n            p20 = _mm_mullo_epi16(L11, coeff13);\r\n            p30 = _mm_mullo_epi16(L12, coeff5);\r\n            p00 = _mm_add_epi16(p00, coeff16);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p01 = _mm_mullo_epi16(H9, coeff3);\r\n            p11 = _mm_mullo_epi16(H10, coeff11);\r\n            p21 = _mm_mullo_epi16(H11, coeff13);\r\n            p31 = _mm_mullo_epi16(H12, coeff5);\r\n            p01 = _mm_add_epi16(p01, coeff16);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            p01 = _mm_srli_epi16(p01, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_store_si128((__m128i*)&pfirst[6][i], p00);\r\n\r\n            p00 = _mm_add_epi16(L11, L13);\r\n            p10 = _mm_add_epi16(L12, L12);\r\n            p00 = _mm_add_epi16(p00, coeff2);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_srli_epi16(p00, 2);\r\n\r\n            p01 = _mm_add_epi16(H11, H13);\r\n            p11 = _mm_add_epi16(H12, H12);\r\n            p01 = _mm_add_epi16(p01, coeff2);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_srli_epi16(p01, 2);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_store_si128((__m128i*)&pfirst[7][i], p00);\r\n        }\r\n#if BUGFIX_PREDICTION_INTRA\r\n        if (i < line_size) {\r\n#else\r\n        if (i < real_size) {\r\n#endif\r\n            __m128i SS1 = _mm_loadu_si128((__m128i*)(src + 1));\r\n            __m128i L1 = _mm_unpacklo_epi8(SS1, zero);\r\n            SS1 = _mm_srli_si128(SS1, 1);\r\n            __m128i L2 = _mm_unpacklo_epi8(SS1, zero);\r\n            SS1 = _mm_srli_si128(SS1, 1);\r\n            __m128i L3 = _mm_unpacklo_epi8(SS1, zero);\r\n            SS1 = _mm_srli_si128(SS1, 1);\r\n            __m128i L4 = _mm_unpacklo_epi8(SS1, zero);\r\n            SS1 = _mm_srli_si128(SS1, 1);\r\n            __m128i L5 = _mm_unpacklo_epi8(SS1, zero);\r\n            SS1 = _mm_srli_si128(SS1, 1);\r\n            __m128i L6 = _mm_unpacklo_epi8(SS1, zero);\r\n            SS1 = _mm_srli_si128(SS1, 1);\r\n            __m128i L7 = _mm_unpacklo_epi8(SS1, zero);\r\n            SS1 = _mm_srli_si128(SS1, 1);\r\n            __m128i L8 = _mm_unpacklo_epi8(SS1, zero);\r\n            SS1 = _mm_srli_si128(SS1, 1);\r\n            __m128i L9 = _mm_unpacklo_epi8(SS1, zero);\r\n\r\n            __m128i SS10 = _mm_loadu_si128((__m128i*)(src + 10));\r\n            __m128i L10 = _mm_unpacklo_epi8(SS10, zero);\r\n            SS10 = _mm_srli_si128(SS10, 1);\r\n            __m128i L11 = _mm_unpacklo_epi8(SS10, zero);\r\n            SS10 = _mm_srli_si128(SS10, 1);\r\n            __m128i L12 = _mm_unpacklo_epi8(SS10, zero);\r\n            SS10 = _mm_srli_si128(SS10, 1);\r\n            __m128i L13 = _mm_unpacklo_epi8(SS10, zero);\r\n\r\n            p00 = _mm_mullo_epi16(L1, coeff5);\r\n            p10 = _mm_mullo_epi16(L2, coeff13);\r\n            p20 = _mm_mullo_epi16(L3, coeff11);\r\n            p30 = _mm_mullo_epi16(L4, coeff3);\r\n            p00 = _mm_add_epi16(p00, coeff16);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)&pfirst[0][i], p00);\r\n\r\n            p10 = _mm_mullo_epi16(L3, coeff5);\r\n            p20 = _mm_mullo_epi16(L4, coeff7);\r\n            p30 = _mm_mullo_epi16(L5, coeff3);\r\n            p00 = _mm_add_epi16(L2, coeff8);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 4);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)&pfirst[1][i], p00);\r\n\r\n            p00 = _mm_mullo_epi16(L4, coeff7);\r\n            p10 = _mm_mullo_epi16(L5, coeff15);\r\n            p20 = _mm_mullo_epi16(L6, coeff9);\r\n            p30 = _mm_add_epi16(L7, coeff16);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)&pfirst[2][i], p00);\r\n\r\n            p00 = _mm_add_epi16(L5, L8);\r\n            p10 = _mm_add_epi16(L6, L7);\r\n            p10 = _mm_mullo_epi16(p10, coeff3);\r\n            p00 = _mm_add_epi16(p00, coeff4);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_srli_epi16(p00, 3);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)&pfirst[3][i], p00);\r\n\r\n            p00 = _mm_add_epi16(L6, coeff16);\r\n            p10 = _mm_mullo_epi16(L7, coeff9);\r\n            p20 = _mm_mullo_epi16(L8, coeff15);\r\n            p30 = _mm_mullo_epi16(L9, coeff7);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)&pfirst[4][i], p00);\r\n\r\n            p00 = _mm_mullo_epi16(L8, coeff3);\r\n            p10 = _mm_mullo_epi16(L9, coeff7);\r\n            p20 = _mm_mullo_epi16(L10, coeff5);\r\n            p30 = _mm_add_epi16(L11, coeff8);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 4);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)&pfirst[5][i], p00);\r\n\r\n            p00 = _mm_mullo_epi16(L9, coeff3);\r\n            p10 = _mm_mullo_epi16(L10, coeff11);\r\n            p20 = _mm_mullo_epi16(L11, coeff13);\r\n            p30 = _mm_mullo_epi16(L12, coeff5);\r\n            p00 = _mm_add_epi16(p00, coeff16);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)&pfirst[6][i], p00);\r\n\r\n            p00 = _mm_add_epi16(L11, L13);\r\n            p10 = _mm_add_epi16(L12, L12);\r\n            p00 = _mm_add_epi16(p00, coeff2);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_srli_epi16(p00, 2);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)&pfirst[7][i], p00);\r\n        }\r\n\r\n#if !BUGFIX_PREDICTION_INTRA\r\n        //padding\r\n        if (real_size + 10 > iW2) {\r\n            src = src_org + iW2;\r\n            //needn't calculate pad using the value src. If pad is invalid, we won't use in \"for (i = start1; i < line_size; i += 16)\".Else pad is valid.\r\n            __m128i pad1 = _mm_set1_epi8(pfirst[0][iW2 - 1]);\r\n            __m128i pad2 = _mm_set1_epi8(pfirst[1][iW2 - 2]);\r\n            __m128i pad3 = _mm_set1_epi8(pfirst[2][iW2 - 4]);\r\n            __m128i pad4 = _mm_set1_epi8(pfirst[3][iW2 - 5]);\r\n\r\n            __m128i pad5 = _mm_set1_epi8(pfirst[4][iW2 - 6]);\r\n            __m128i pad6 = _mm_set1_epi8(pfirst[5][iW2 - 8]);\r\n            __m128i pad7 = _mm_set1_epi8(pfirst[6][iW2 - 9]);\r\n            __m128i pad8 = _mm_set1_epi8(pfirst[7][iW2 - 11]);\r\n\r\n            int start1 = iW2;\r\n            int start2 = iW2 - 1;\r\n            int start3 = iW2 - 3;\r\n            int start4 = iW2 - 4;\r\n            int start5 = iW2 - 5;\r\n            int start6 = iW2 - 7;\r\n            int start7 = iW2 - 8;\r\n            int start8 = iW2 - 10;\r\n\r\n            for (i = start1; i < line_size; i += 16) {\r\n                _mm_storeu_si128((__m128i*)&pfirst[0][i], pad1);\r\n            }\r\n            for (i = start2; i < line_size; i += 16) {\r\n                _mm_storeu_si128((__m128i*)&pfirst[1][i], pad2);\r\n            }\r\n            for (i = start3; i < line_size; i += 16) {\r\n                _mm_storeu_si128((__m128i*)&pfirst[2][i], pad3);\r\n            }\r\n            for (i = start4; i < line_size; i += 16) {\r\n                _mm_storeu_si128((__m128i*)&pfirst[3][i], pad4);\r\n            }\r\n\r\n            for (i = start5; i < line_size; i += 16) {\r\n                _mm_storeu_si128((__m128i*)&pfirst[4][i], pad5);\r\n            }\r\n            for (i = start6; i < line_size; i += 16) {\r\n                _mm_storeu_si128((__m128i*)&pfirst[5][i], pad6);\r\n            }\r\n            for (i = start7; i < line_size; i += 16) {\r\n                _mm_storeu_si128((__m128i*)&pfirst[6][i], pad7);\r\n            }\r\n            for (i = start8; i < line_size; i += 16) {\r\n                _mm_storeu_si128((__m128i*)&pfirst[7][i], pad8);\r\n            }\r\n        }\r\n#endif\r\n\r\n        bsy >>= 3;\r\n        for (i = 0; i < bsy; i++) {\r\n            memcpy(dst1, pfirst[0] + i * 11, bsx * sizeof(pel_t));\r\n            memcpy(dst2, pfirst[1] + i * 11, bsx * sizeof(pel_t));\r\n            memcpy(dst3, pfirst[2] + i * 11, bsx * sizeof(pel_t));\r\n            memcpy(dst4, pfirst[3] + i * 11, bsx * sizeof(pel_t));\r\n            memcpy(dst5, pfirst[4] + i * 11, bsx * sizeof(pel_t));\r\n            memcpy(dst6, pfirst[5] + i * 11, bsx * sizeof(pel_t));\r\n            memcpy(dst7, pfirst[6] + i * 11, bsx * sizeof(pel_t));\r\n            memcpy(dst8, pfirst[7] + i * 11, bsx * sizeof(pel_t));\r\n\r\n            dst1 = dst8 + i_dst;\r\n            dst2 = dst1 + i_dst;\r\n            dst3 = dst2 + i_dst;\r\n            dst4 = dst3 + i_dst;\r\n            dst5 = dst4 + i_dst;\r\n            dst6 = dst5 + i_dst;\r\n            dst7 = dst6 + i_dst;\r\n            dst8 = dst7 + i_dst;\r\n        }\r\n    } else if (bsx == 16) {\r\n        pel_t *dst1 = dst;\r\n        pel_t *dst2 = dst1 + i_dst;\r\n        pel_t *dst3 = dst2 + i_dst;\r\n        pel_t *dst4 = dst3 + i_dst;\r\n\r\n        __m128i p00, p10, p20, p30;\r\n        __m128i p01, p11, p21, p31;\r\n\r\n        __m128i SS1 = _mm_loadu_si128((__m128i*)(src + 1));\r\n        __m128i L1 = _mm_unpacklo_epi8(SS1, zero);\r\n        SS1 = _mm_srli_si128(SS1, 1);\r\n        __m128i L2 = _mm_unpacklo_epi8(SS1, zero);\r\n        SS1 = _mm_srli_si128(SS1, 1);\r\n        __m128i L3 = _mm_unpacklo_epi8(SS1, zero);\r\n        SS1 = _mm_srli_si128(SS1, 1);\r\n        __m128i L4 = _mm_unpacklo_epi8(SS1, zero);\r\n        SS1 = _mm_srli_si128(SS1, 1);\r\n        __m128i L5 = _mm_unpacklo_epi8(SS1, zero);\r\n        SS1 = _mm_srli_si128(SS1, 1);\r\n        __m128i L6 = _mm_unpacklo_epi8(SS1, zero);\r\n        SS1 = _mm_srli_si128(SS1, 1);\r\n        __m128i L7 = _mm_unpacklo_epi8(SS1, zero);\r\n        SS1 = _mm_srli_si128(SS1, 1);\r\n        __m128i L8 = _mm_unpacklo_epi8(SS1, zero);\r\n        SS1 = _mm_srli_si128(SS1, 1);\r\n        __m128i H1 = _mm_unpacklo_epi8(SS1, zero);\r\n\r\n        __m128i SS10 = _mm_loadu_si128((__m128i*)(src + 10));\r\n        __m128i H2 = _mm_unpacklo_epi8(SS10, zero);\r\n        SS10 = _mm_srli_si128(SS10, 1);\r\n        __m128i H3 = _mm_unpacklo_epi8(SS10, zero);\r\n        SS10 = _mm_srli_si128(SS10, 1);\r\n        __m128i H4 = _mm_unpacklo_epi8(SS10, zero);\r\n        SS10 = _mm_srli_si128(SS10, 1);\r\n        __m128i H5 = _mm_unpacklo_epi8(SS10, zero);\r\n\r\n        SS10 = _mm_srli_si128(SS10, 1);\r\n        __m128i H6 = _mm_unpacklo_epi8(SS10, zero);\r\n        SS10 = _mm_srli_si128(SS10, 1);\r\n        __m128i H7 = _mm_unpacklo_epi8(SS10, zero);\r\n        SS10 = _mm_srli_si128(SS10, 1);\r\n        __m128i H8 = _mm_unpacklo_epi8(SS10, zero);\r\n\r\n        p00 = _mm_mullo_epi16(L1, coeff5);\r\n        p10 = _mm_mullo_epi16(L2, coeff13);\r\n        p20 = _mm_mullo_epi16(L3, coeff11);\r\n        p30 = _mm_mullo_epi16(L4, coeff3);\r\n        p00 = _mm_add_epi16(p00, coeff16);\r\n        p00 = _mm_add_epi16(p00, p10);\r\n        p00 = _mm_add_epi16(p00, p20);\r\n        p00 = _mm_add_epi16(p00, p30);\r\n        p00 = _mm_srli_epi16(p00, 5);\r\n\r\n        p01 = _mm_mullo_epi16(H1, coeff5);\r\n        p11 = _mm_mullo_epi16(H2, coeff13);\r\n        p21 = _mm_mullo_epi16(H3, coeff11);\r\n        p31 = _mm_mullo_epi16(H4, coeff3);\r\n        p01 = _mm_add_epi16(p01, coeff16);\r\n        p01 = _mm_add_epi16(p01, p11);\r\n        p01 = _mm_add_epi16(p01, p21);\r\n        p01 = _mm_add_epi16(p01, p31);\r\n        p01 = _mm_srli_epi16(p01, 5);\r\n\r\n        p00 = _mm_packus_epi16(p00, p01);\r\n        _mm_store_si128((__m128i*)dst1, p00);\r\n\r\n        p10 = _mm_mullo_epi16(L3, coeff5);\r\n        p20 = _mm_mullo_epi16(L4, coeff7);\r\n        p30 = _mm_mullo_epi16(L5, coeff3);\r\n        p00 = _mm_add_epi16(L2, coeff8);\r\n        p00 = _mm_add_epi16(p00, p10);\r\n        p00 = _mm_add_epi16(p00, p20);\r\n        p00 = _mm_add_epi16(p00, p30);\r\n        p00 = _mm_srli_epi16(p00, 4);\r\n\r\n        p11 = _mm_mullo_epi16(H3, coeff5);\r\n        p21 = _mm_mullo_epi16(H4, coeff7);\r\n        p31 = _mm_mullo_epi16(H5, coeff3);\r\n        p01 = _mm_add_epi16(H2, coeff8);\r\n        p01 = _mm_add_epi16(p01, p11);\r\n        p01 = _mm_add_epi16(p01, p21);\r\n        p01 = _mm_add_epi16(p01, p31);\r\n        p01 = _mm_srli_epi16(p01, 4);\r\n\r\n        p00 = _mm_packus_epi16(p00, p01);\r\n        _mm_store_si128((__m128i*)dst2, p00);\r\n\r\n        p00 = _mm_mullo_epi16(L4, coeff7);\r\n        p10 = _mm_mullo_epi16(L5, coeff15);\r\n        p20 = _mm_mullo_epi16(L6, coeff9);\r\n        p30 = _mm_add_epi16(L7, coeff16);\r\n        p00 = _mm_add_epi16(p00, p10);\r\n        p00 = _mm_add_epi16(p00, p20);\r\n        p00 = _mm_add_epi16(p00, p30);\r\n        p00 = _mm_srli_epi16(p00, 5);\r\n\r\n        p01 = _mm_mullo_epi16(H4, coeff7);\r\n        p11 = _mm_mullo_epi16(H5, coeff15);\r\n        p21 = _mm_mullo_epi16(H6, coeff9);\r\n        p31 = _mm_add_epi16(H7, coeff16);\r\n        p01 = _mm_add_epi16(p01, p11);\r\n        p01 = _mm_add_epi16(p01, p21);\r\n        p01 = _mm_add_epi16(p01, p31);\r\n        p01 = _mm_srli_epi16(p01, 5);\r\n\r\n        p00 = _mm_packus_epi16(p00, p01);\r\n        _mm_store_si128((__m128i*)dst3, p00);\r\n\r\n        p00 = _mm_add_epi16(L5, L8);\r\n        p10 = _mm_add_epi16(L6, L7);\r\n        p10 = _mm_mullo_epi16(p10, coeff3);\r\n        p00 = _mm_add_epi16(p00, coeff4);\r\n        p00 = _mm_add_epi16(p00, p10);\r\n        p00 = _mm_srli_epi16(p00, 3);\r\n\r\n        p01 = _mm_add_epi16(H5, H8);\r\n        p11 = _mm_add_epi16(H6, H7);\r\n        p11 = _mm_mullo_epi16(p11, coeff3);\r\n        p01 = _mm_add_epi16(p01, coeff4);\r\n        p01 = _mm_add_epi16(p01, p11);\r\n        p01 = _mm_srli_epi16(p01, 3);\r\n\r\n        p00 = _mm_packus_epi16(p00, p01);\r\n        _mm_store_si128((__m128i*)dst4, p00);\r\n    } else if (bsx == 8) {\r\n        pel_t *dst1 = dst;\r\n        pel_t *dst2 = dst1 + i_dst;\r\n        pel_t *dst3 = dst2 + i_dst;\r\n        pel_t *dst4 = dst3 + i_dst;\r\n        pel_t *dst5 = dst4 + i_dst;\r\n        pel_t *dst6 = dst5 + i_dst;\r\n        pel_t *dst7 = dst6 + i_dst;\r\n        pel_t *dst8 = dst7 + i_dst;\r\n\r\n        for (i = 0; i < 8; src++, i++) {\r\n            dst1[i] = (pel_t)((5 * src[1] + 13 * src[2] + 11 * src[3] + 3 * src[4] + 16) >> 5);\r\n            dst2[i] = (pel_t)((src[2] + 5 * src[3] + 7 * src[4] + 3 * src[5] + 8) >> 4);\r\n            dst3[i] = (pel_t)((7 * src[4] + 15 * src[5] + 9 * src[6] + 1 * src[7] + 16) >> 5);\r\n            dst4[i] = (pel_t)((src[5] + 3 * src[6] + 3 * src[7] + 1 * src[8] + 4) >> 3);\r\n\r\n            dst5[i] = (pel_t)((src[6] + 9 * src[7] + 15 * src[8] + 7 * src[9] + 16) >> 5);\r\n            dst6[i] = (pel_t)((3 * src[8] + 7 * src[9] + 5 * src[10] + src[11] + 8) >> 4);\r\n            dst7[i] = (pel_t)((3 * src[9] + 11 * src[10] + 13 * src[11] + 5 * src[12] + 16) >> 5);\r\n            dst8[i] = (pel_t)((src[11] + 2 * src[12] + src[13] + 2) >> 2);\r\n        }\r\n#if !BUGFIX_PREDICTION_INTRA\r\n        dst7[7] = dst7[6];\r\n        dst8[7] = dst8[4];\r\n        dst8[6] = dst8[4];\r\n        dst8[5] = dst8[4];\r\n#endif\r\n        if (bsy == 32) {\r\n            //src -> 8,src[7] -> 15\r\n#if BUGFIX_PREDICTION_INTRA\r\n            __m128i pad1 = _mm_set1_epi8(src[8]);\r\n#else\r\n            __m128i pad1 = _mm_set1_epi8((pel_t)((5 * src[7] + 13 * src[8] + 11 * src[9] + 3 * src[10] + 16) >> 5));\r\n            __m128i pad2 = _mm_set1_epi8((pel_t)((src[7] + 5 * src[8] + 7 * src[9] + 3 * src[10] + 8) >> 4));\r\n            __m128i pad3 = _mm_set1_epi8((pel_t)((7 * src[7] + 15 * src[8] + 9 * src[9] + 1 * src[10] + 16) >> 5));\r\n            __m128i pad4 = _mm_set1_epi8((pel_t)((src[7] + 3 * src[8] + 3 * src[9] + 1 * src[10] + 4) >> 3));\r\n\r\n            __m128i pad5 = _mm_set1_epi8((pel_t)((src[7] + 9 * src[8] + 15 * src[9] + 7 * src[10] + 16) >> 5));\r\n            __m128i pad6 = _mm_set1_epi8(dst6[7]);\r\n            __m128i pad7 = _mm_set1_epi8(dst7[7]);\r\n            __m128i pad8 = _mm_set1_epi8(dst8[7]);\r\n#endif\r\n\r\n            dst1 = dst8 + i_dst;\r\n            dst2 = dst1 + i_dst;\r\n            dst3 = dst2 + i_dst;\r\n            dst4 = dst3 + i_dst;\r\n            dst5 = dst4 + i_dst;\r\n            dst6 = dst5 + i_dst;\r\n            dst7 = dst6 + i_dst;\r\n            dst8 = dst7 + i_dst;\r\n\r\n#if BUGFIX_PREDICTION_INTRA\r\n            _mm_storel_epi64((__m128i*)dst1, pad1);\r\n            _mm_storel_epi64((__m128i*)dst2, pad1);\r\n            _mm_storel_epi64((__m128i*)dst3, pad1);\r\n            _mm_storel_epi64((__m128i*)dst4, pad1);\r\n\r\n            _mm_storel_epi64((__m128i*)dst5, pad1);\r\n            _mm_storel_epi64((__m128i*)dst6, pad1);\r\n            _mm_storel_epi64((__m128i*)dst7, pad1);\r\n            _mm_storel_epi64((__m128i*)dst8, pad1);\r\n#else\r\n            _mm_storel_epi64((__m128i*)dst1, pad1);\r\n            _mm_storel_epi64((__m128i*)dst2, pad2);\r\n            _mm_storel_epi64((__m128i*)dst3, pad3);\r\n            _mm_storel_epi64((__m128i*)dst4, pad4);\r\n\r\n            _mm_storel_epi64((__m128i*)dst5, pad5);\r\n            _mm_storel_epi64((__m128i*)dst6, pad6);\r\n            _mm_storel_epi64((__m128i*)dst7, pad7);\r\n            _mm_storel_epi64((__m128i*)dst8, pad8);\r\n#endif\r\n\r\n            src += 4;\r\n            dst1[0] = (pel_t)((5 * src[0] + 13 * src[1] + 11 * src[2] + 3 * src[3] + 16) >> 5);\r\n            dst1[1] = (pel_t)((5 * src[1] + 13 * src[2] + 11 * src[3] + 3 * src[4] + 16) >> 5);\r\n            dst1[2] = (pel_t)((5 * src[2] + 13 * src[3] + 11 * src[4] + 3 * src[5] + 16) >> 5);\r\n            dst1[3] = (pel_t)((5 * src[3] + 13 * src[4] + 11 * src[5] + 3 * src[6] + 16) >> 5);\r\n            dst2[0] = (pel_t)((src[1] + 5 * src[2] + 7 * src[3] + 3 * src[4] + 8) >> 4);\r\n            dst2[1] = (pel_t)((src[2] + 5 * src[3] + 7 * src[4] + 3 * src[5] + 8) >> 4);\r\n            dst2[2] = (pel_t)((src[3] + 5 * src[4] + 7 * src[5] + 3 * src[6] + 8) >> 4);\r\n            dst3[0] = (pel_t)((7 * src[3] + 15 * src[4] + 9 * src[5] + src[6] + 16) >> 5);\r\n\r\n            dst1 = dst8 + i_dst;\r\n            dst2 = dst1 + i_dst;\r\n            dst3 = dst2 + i_dst;\r\n            dst4 = dst3 + i_dst;\r\n            dst5 = dst4 + i_dst;\r\n            dst6 = dst5 + i_dst;\r\n            dst7 = dst6 + i_dst;\r\n            dst8 = dst7 + i_dst;\r\n\r\n#if BUGFIX_PREDICTION_INTRA\r\n            _mm_storel_epi64((__m128i*)dst1, pad1);\r\n            _mm_storel_epi64((__m128i*)dst2, pad1);\r\n            _mm_storel_epi64((__m128i*)dst3, pad1);\r\n            _mm_storel_epi64((__m128i*)dst4, pad1);\r\n\r\n            _mm_storel_epi64((__m128i*)dst5, pad1);\r\n            _mm_storel_epi64((__m128i*)dst6, pad1);\r\n            _mm_storel_epi64((__m128i*)dst7, pad1);\r\n            _mm_storel_epi64((__m128i*)dst8, pad1);\r\n#else\r\n            _mm_storel_epi64((__m128i*)dst1, pad1);\r\n            _mm_storel_epi64((__m128i*)dst2, pad2);\r\n            _mm_storel_epi64((__m128i*)dst3, pad3);\r\n            _mm_storel_epi64((__m128i*)dst4, pad4);\r\n\r\n            _mm_storel_epi64((__m128i*)dst5, pad5);\r\n            _mm_storel_epi64((__m128i*)dst6, pad6);\r\n            _mm_storel_epi64((__m128i*)dst7, pad7);\r\n            _mm_storel_epi64((__m128i*)dst8, pad8);\r\n#endif\r\n\r\n            dst1 = dst8 + i_dst;\r\n            dst2 = dst1 + i_dst;\r\n            dst3 = dst2 + i_dst;\r\n            dst4 = dst3 + i_dst;\r\n            dst5 = dst4 + i_dst;\r\n            dst6 = dst5 + i_dst;\r\n            dst7 = dst6 + i_dst;\r\n            dst8 = dst7 + i_dst;\r\n\r\n#if BUGFIX_PREDICTION_INTRA\r\n            _mm_storel_epi64((__m128i*)dst1, pad1);\r\n            _mm_storel_epi64((__m128i*)dst2, pad1);\r\n            _mm_storel_epi64((__m128i*)dst3, pad1);\r\n            _mm_storel_epi64((__m128i*)dst4, pad1);\r\n\r\n            _mm_storel_epi64((__m128i*)dst5, pad1);\r\n            _mm_storel_epi64((__m128i*)dst6, pad1);\r\n            _mm_storel_epi64((__m128i*)dst7, pad1);\r\n            _mm_storel_epi64((__m128i*)dst8, pad1);\r\n#else\r\n            _mm_storel_epi64((__m128i*)dst1, pad1);\r\n            _mm_storel_epi64((__m128i*)dst2, pad2);\r\n            _mm_storel_epi64((__m128i*)dst3, pad3);\r\n            _mm_storel_epi64((__m128i*)dst4, pad4);\r\n\r\n            _mm_storel_epi64((__m128i*)dst5, pad5);\r\n            _mm_storel_epi64((__m128i*)dst6, pad6);\r\n            _mm_storel_epi64((__m128i*)dst7, pad7);\r\n            _mm_storel_epi64((__m128i*)dst8, pad8);\r\n#endif\r\n        }\r\n    } else {\r\n        pel_t *dst1 = dst;\r\n        pel_t *dst2 = dst1 + i_dst;\r\n        pel_t *dst3 = dst2 + i_dst;\r\n        pel_t *dst4 = dst3 + i_dst;\r\n\r\n        __m128i p00, p10, p20, p30;\r\n\r\n        __m128i SS1 = _mm_loadu_si128((__m128i*)(src + 1));\r\n        __m128i L1 = _mm_unpacklo_epi8(SS1, zero);\r\n        SS1 = _mm_srli_si128(SS1, 1);\r\n        __m128i L2 = _mm_unpacklo_epi8(SS1, zero);\r\n        SS1 = _mm_srli_si128(SS1, 1);\r\n        __m128i L3 = _mm_unpacklo_epi8(SS1, zero);\r\n        SS1 = _mm_srli_si128(SS1, 1);\r\n        __m128i L4 = _mm_unpacklo_epi8(SS1, zero);\r\n        SS1 = _mm_srli_si128(SS1, 1);\r\n        __m128i L5 = _mm_unpacklo_epi8(SS1, zero);\r\n        SS1 = _mm_srli_si128(SS1, 1);\r\n        __m128i L6 = _mm_unpacklo_epi8(SS1, zero);\r\n        SS1 = _mm_srli_si128(SS1, 1);\r\n        __m128i L7 = _mm_unpacklo_epi8(SS1, zero);\r\n        SS1 = _mm_srli_si128(SS1, 1);\r\n        __m128i L8 = _mm_unpacklo_epi8(SS1, zero);\r\n\r\n        p00 = _mm_mullo_epi16(L1, coeff5);\r\n        p10 = _mm_mullo_epi16(L2, coeff13);\r\n        p20 = _mm_mullo_epi16(L3, coeff11);\r\n        p30 = _mm_mullo_epi16(L4, coeff3);\r\n        p00 = _mm_add_epi16(p00, coeff16);\r\n        p00 = _mm_add_epi16(p00, p10);\r\n        p00 = _mm_add_epi16(p00, p20);\r\n        p00 = _mm_add_epi16(p00, p30);\r\n        p00 = _mm_srli_epi16(p00, 5);\r\n\r\n        p00 = _mm_packus_epi16(p00, p00);\r\n        *((int*)(dst1)) = _mm_cvtsi128_si32(p00);\r\n\r\n        p10 = _mm_mullo_epi16(L3, coeff5);\r\n        p20 = _mm_mullo_epi16(L4, coeff7);\r\n        p30 = _mm_mullo_epi16(L5, coeff3);\r\n        p00 = _mm_add_epi16(L2, coeff8);\r\n        p00 = _mm_add_epi16(p00, p10);\r\n        p00 = _mm_add_epi16(p00, p20);\r\n        p00 = _mm_add_epi16(p00, p30);\r\n        p00 = _mm_srli_epi16(p00, 4);\r\n\r\n        p00 = _mm_packus_epi16(p00, p00);\r\n        *((int*)(dst2)) = _mm_cvtsi128_si32(p00);\r\n\r\n        p00 = _mm_mullo_epi16(L4, coeff7);\r\n        p10 = _mm_mullo_epi16(L5, coeff15);\r\n        p20 = _mm_mullo_epi16(L6, coeff9);\r\n        p30 = _mm_add_epi16(L7, coeff16);\r\n        p00 = _mm_add_epi16(p00, p10);\r\n        p00 = _mm_add_epi16(p00, p20);\r\n        p00 = _mm_add_epi16(p00, p30);\r\n        p00 = _mm_srli_epi16(p00, 5);\r\n\r\n        p00 = _mm_packus_epi16(p00, p00);\r\n        *((int*)(dst3)) = _mm_cvtsi128_si32(p00);\r\n\r\n        p00 = _mm_add_epi16(L5, L8);\r\n        p10 = _mm_add_epi16(L6, L7);\r\n        p10 = _mm_mullo_epi16(p10, coeff3);\r\n        p00 = _mm_add_epi16(p00, coeff4);\r\n        p00 = _mm_add_epi16(p00, p10);\r\n        p00 = _mm_srli_epi16(p00, 3);\r\n\r\n        p00 = _mm_packus_epi16(p00, p00);\r\n        *((int*)(dst4)) = _mm_cvtsi128_si32(p00);\r\n\r\n#if !BUGFIX_PREDICTION_INTRA\r\n        dst4[3] = dst4[2];\r\n#endif\r\n\r\n        if (bsy == 16) {\r\n            pel_t *dst5 = dst4 + i_dst;\r\n            pel_t *dst6 = dst5 + i_dst;\r\n            pel_t *dst7 = dst6 + i_dst;\r\n            pel_t *dst8 = dst7 + i_dst;\r\n\r\n            src += 8;\r\n#if BUGFIX_PREDICTION_INTRA\r\n            __m128i pad1 = _mm_set1_epi8(src[0]);\r\n\r\n            *(int*)(dst5) = _mm_cvtsi128_si32(pad1);\r\n            *(int*)(dst6) = _mm_cvtsi128_si32(pad1);\r\n            *(int*)(dst7) = _mm_cvtsi128_si32(pad1);\r\n            *(int*)(dst8) = _mm_cvtsi128_si32(pad1);\r\n#else\r\n            __m128i pad1 = _mm_set1_epi8((pel_t)((5 * src[0] + 13 * src[1] + 11 * src[2] + 3 * src[3] + 16) >> 5));\r\n            __m128i pad2 = _mm_set1_epi8((pel_t)((src[0] + 5 * src[1] + 7 * src[2] + 3 * src[3] + 8) >> 4));\r\n            __m128i pad3 = _mm_set1_epi8((pel_t)((7 * src[0] + 15 * src[1] + 9 * src[2] + 1 * src[3] + 16) >> 5));\r\n            __m128i pad4 = _mm_set1_epi8(dst4[3]);\r\n\r\n            __m128i pad5 = _mm_set1_epi8((pel_t)((src[0] + 9 * src[1] + 15 * src[2] + 7 * src[3] + 16) >> 5));\r\n            __m128i pad6 = _mm_set1_epi8((pel_t)((3 * src[0] + 7 * src[1] + 5 * src[2] + src[3] + 8) >> 4));\r\n            __m128i pad7 = _mm_set1_epi8((pel_t)((3 * src[0] + 11 * src[1] + 13 * src[2] + 5 * src[3] + 16) >> 5));\r\n            __m128i pad8 = _mm_set1_epi8((pel_t)((src[0] + 2 * src[1] + src[2] + 2) >> 2));\r\n\r\n            *(int*)(dst5) = _mm_cvtsi128_si32(pad5);\r\n            *(int*)(dst6) = _mm_cvtsi128_si32(pad6);\r\n            *(int*)(dst7) = _mm_cvtsi128_si32(pad7);\r\n            *(int*)(dst8) = _mm_cvtsi128_si32(pad8);\r\n#endif\r\n\r\n            dst5[0] = (pel_t)((src[-2] + 9 * src[-1] + 15 * src[0] + 7 * src[1] + 16) >> 5);\r\n            dst5[1] = (pel_t)((src[-1] + 9 * src[0] + 15 * src[1] + 7 * src[2] + 16) >> 5);\r\n\r\n            dst1 = dst8 + i_dst;\r\n            dst2 = dst1 + i_dst;\r\n            dst3 = dst2 + i_dst;\r\n            dst4 = dst3 + i_dst;\r\n            dst5 = dst4 + i_dst;\r\n            dst6 = dst5 + i_dst;\r\n            dst7 = dst6 + i_dst;\r\n            dst8 = dst7 + i_dst;\r\n\r\n#if BUGFIX_PREDICTION_INTRA\r\n            *(int*)(dst1) = _mm_cvtsi128_si32(pad1);\r\n            *(int*)(dst2) = _mm_cvtsi128_si32(pad1);\r\n            *(int*)(dst3) = _mm_cvtsi128_si32(pad1);\r\n            *(int*)(dst4) = _mm_cvtsi128_si32(pad1);\r\n\r\n            *(int*)(dst5) = _mm_cvtsi128_si32(pad1);\r\n            *(int*)(dst6) = _mm_cvtsi128_si32(pad1);\r\n            *(int*)(dst7) = _mm_cvtsi128_si32(pad1);\r\n            *(int*)(dst8) = _mm_cvtsi128_si32(pad1);\r\n#else\r\n            *(int*)(dst1) = _mm_cvtsi128_si32(pad1);\r\n            *(int*)(dst2) = _mm_cvtsi128_si32(pad2);\r\n            *(int*)(dst3) = _mm_cvtsi128_si32(pad3);\r\n            *(int*)(dst4) = _mm_cvtsi128_si32(pad4);\r\n\r\n            *(int*)(dst5) = _mm_cvtsi128_si32(pad5);\r\n            *(int*)(dst6) = _mm_cvtsi128_si32(pad6);\r\n            *(int*)(dst7) = _mm_cvtsi128_si32(pad7);\r\n            *(int*)(dst8) = _mm_cvtsi128_si32(pad8);\r\n#endif\r\n        }\r\n    }\r\n\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n*/\r\nvoid intra_pred_ang_x_6_sse128(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    ALIGN16(pel_t first_line[64 + 64]);\r\n    int line_size = bsx + bsy - 1;\r\n#if !BUGFIX_PREDICTION_INTRA\r\n    int real_size = DAVS2_MIN(line_size, bsx * 2 - 1);\r\n#endif\r\n    int i;\r\n    __m128i zero = _mm_setzero_si128();\r\n    __m128i offset = _mm_set1_epi16(2);\r\n\r\n    UNUSED_PARAMETER(dir_mode);\r\n\r\n    src += 2;\r\n#if BUGFIX_PREDICTION_INTRA\r\n    for (i = 0; i < line_size - 8; i += 16, src += 16) {\r\n#else\r\n    for (i = 0; i < real_size - 8; i += 16, src += 16) {\r\n#endif\r\n        __m128i S0 = _mm_loadu_si128((__m128i*)(src - 1));\r\n        __m128i S2 = _mm_loadu_si128((__m128i*)(src + 1));\r\n        __m128i S1 = _mm_loadu_si128((__m128i*)(src));\r\n\r\n        __m128i L0 = _mm_unpacklo_epi8(S0, zero);\r\n        __m128i L1 = _mm_unpacklo_epi8(S1, zero);\r\n        __m128i L2 = _mm_unpacklo_epi8(S2, zero);\r\n\r\n        __m128i H0 = _mm_unpackhi_epi8(S0, zero);\r\n        __m128i H1 = _mm_unpackhi_epi8(S1, zero);\r\n        __m128i H2 = _mm_unpackhi_epi8(S2, zero);\r\n\r\n        __m128i sum1 = _mm_add_epi16(L0, L1);\r\n        __m128i sum2 = _mm_add_epi16(L1, L2);\r\n        __m128i sum3 = _mm_add_epi16(H0, H1);\r\n        __m128i sum4 = _mm_add_epi16(H1, H2);\r\n\r\n        sum1 = _mm_add_epi16(sum1, sum2);\r\n        sum3 = _mm_add_epi16(sum3, sum4);\r\n\r\n        sum1 = _mm_add_epi16(sum1, offset);\r\n        sum3 = _mm_add_epi16(sum3, offset);\r\n\r\n        sum1 = _mm_srli_epi16(sum1, 2);\r\n        sum3 = _mm_srli_epi16(sum3, 2);\r\n\r\n        sum1 = _mm_packus_epi16(sum1, sum3);\r\n\r\n        _mm_store_si128((__m128i*)&first_line[i], sum1);\r\n    }\r\n\r\n#if BUGFIX_PREDICTION_INTRA\r\n    if (i < line_size) {\r\n#else\r\n    if (i < real_size) {\r\n#endif\r\n        __m128i S0 = _mm_loadu_si128((__m128i*)(src - 1));\r\n        __m128i S2 = _mm_loadu_si128((__m128i*)(src + 1));\r\n        __m128i S1 = _mm_loadu_si128((__m128i*)(src));\r\n\r\n        __m128i L0 = _mm_unpacklo_epi8(S0, zero);\r\n        __m128i L1 = _mm_unpacklo_epi8(S1, zero);\r\n        __m128i L2 = _mm_unpacklo_epi8(S2, zero);\r\n\r\n        __m128i sum1 = _mm_add_epi16(L0, L1);\r\n        __m128i sum2 = _mm_add_epi16(L1, L2);\r\n\r\n        sum1 = _mm_add_epi16(sum1, sum2);\r\n        sum1 = _mm_add_epi16(sum1, offset);\r\n        sum1 = _mm_srli_epi16(sum1, 2);\r\n\r\n        sum1 = _mm_packus_epi16(sum1, sum1);\r\n        _mm_storel_epi64((__m128i*)&first_line[i], sum1);\r\n    }\r\n\r\n#if !BUGFIX_PREDICTION_INTRA\r\n    // padding\r\n    for (i = real_size; i < line_size; i += 16) {\r\n        __m128i pad = _mm_set1_epi8(first_line[real_size - 1]);\r\n        _mm_storeu_si128((__m128i*)&first_line[i], pad);\r\n    }\r\n#endif\r\n\r\n    if (bsx > 16 || bsx == 4) {\r\n        for (i = 0; i < bsy; i++) {\r\n            memcpy(dst, first_line + i, bsx * sizeof(pel_t));\r\n            dst += i_dst;\r\n        }\r\n    } else if (bsx == 16) {\r\n        pel_t *dst1 = dst;\r\n        pel_t *dst2;\r\n        if (bsy == 4) {\r\n            __m128i M = _mm_loadu_si128((__m128i*)&first_line[0]);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            dst1 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            dst1 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            dst1 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            dst1 = dst + 8;\r\n            M = _mm_loadu_si128((__m128i*)&first_line[8]);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            dst1 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            dst1 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            dst1 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n        } else {\r\n            __m128i M = _mm_loadu_si128((__m128i*)&first_line[0]);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            dst1 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            dst1 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            dst1 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            dst1 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            dst1 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            dst1 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            dst1 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            dst2 = dst1 + i_dst;\r\n            dst1 = dst + 8;\r\n            M = _mm_loadu_si128((__m128i*)&first_line[8]);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            _mm_storel_epi64((__m128i*)dst2, M);\r\n            dst1 += i_dst;\r\n            dst2 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            _mm_storel_epi64((__m128i*)dst2, M);\r\n            dst1 += i_dst;\r\n            dst2 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            _mm_storel_epi64((__m128i*)dst2, M);\r\n            dst1 += i_dst;\r\n            dst2 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            _mm_storel_epi64((__m128i*)dst2, M);\r\n            dst1 += i_dst;\r\n            dst2 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            _mm_storel_epi64((__m128i*)dst2, M);\r\n            dst1 += i_dst;\r\n            dst2 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            _mm_storel_epi64((__m128i*)dst2, M);\r\n            dst1 += i_dst;\r\n            dst2 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            _mm_storel_epi64((__m128i*)dst2, M);\r\n            dst1 += i_dst;\r\n            dst2 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            _mm_storel_epi64((__m128i*)dst2, M);\r\n            dst1 += i_dst;\r\n            M = _mm_loadu_si128((__m128i*)&first_line[16]);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            dst1 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            dst1 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            dst1 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            dst1 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            dst1 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            dst1 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            dst1 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n        }\r\n    } else {\r\n        for (i = 0; i < bsy; i += 8) {\r\n            __m128i M = _mm_loadu_si128((__m128i*)&first_line[i]);\r\n            _mm_storel_epi64((__m128i*)dst, M);\r\n            dst += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst, M);\r\n            dst += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst, M);\r\n            dst += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst, M);\r\n            dst += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst, M);\r\n            dst += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst, M);\r\n            dst += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst, M);\r\n            dst += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst, M);\r\n            dst += i_dst;\r\n        }\r\n    }\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n*/\r\nvoid intra_pred_ang_x_7_sse128(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    int i, j;\r\n    int iWidth2 = bsx << 1;\r\n    __m128i zero = _mm_setzero_si128();\r\n    __m128i S0, S1, S2, S3;\r\n    __m128i t0, t1, t2, t3;\r\n    __m128i off = _mm_set1_epi16(64);\r\n    __m128i c0;\r\n\r\n    UNUSED_PARAMETER(dir_mode);\r\n\r\n    if (bsx >= bsy) {\r\n        if (bsx & 0x07) {\r\n            __m128i D0;\r\n            int i_dst2 = i_dst << 1;\r\n\r\n            for (j = 0; j < bsy; j += 2) {\r\n                int idx = tab_idx_mode_7[j];\r\n                c0 = _mm_load_si128((__m128i*)tab_coeff_mode_7[j]);\r\n\r\n                S0 = _mm_loadl_epi64((__m128i*)(src + idx));\r\n                S1 = _mm_srli_si128(S0, 1);\r\n                S2 = _mm_srli_si128(S0, 2);\r\n                S3 = _mm_srli_si128(S0, 3);\r\n\r\n                t0 = _mm_unpacklo_epi8(S0, S1);\r\n                t1 = _mm_unpacklo_epi8(S2, S3);\r\n                t2 = _mm_unpacklo_epi16(t0, t1);\r\n\r\n                t0 = _mm_maddubs_epi16(t2, c0);\r\n\r\n                idx = tab_idx_mode_7[j + 1];\r\n                c0 = _mm_load_si128((__m128i*)tab_coeff_mode_7[j + 1]);\r\n                S0 = _mm_loadl_epi64((__m128i*)(src + idx));\r\n                S1 = _mm_srli_si128(S0, 1);\r\n                S2 = _mm_srli_si128(S0, 2);\r\n                S3 = _mm_srli_si128(S0, 3);\r\n\r\n                t1 = _mm_unpacklo_epi8(S0, S1);\r\n                t2 = _mm_unpacklo_epi8(S2, S3);\r\n                t1 = _mm_unpacklo_epi16(t1, t2);\r\n\r\n                t1 = _mm_maddubs_epi16(t1, c0);\r\n\r\n                D0 = _mm_hadds_epi16(t0, t1);\r\n                D0 = _mm_add_epi16(D0, off);\r\n                D0 = _mm_srli_epi16(D0, 7);\r\n                D0 = _mm_packus_epi16(D0, zero);\r\n\r\n                ((uint32_t*)(dst))[0] = _mm_cvtsi128_si32(D0);\r\n                D0= _mm_srli_si128(D0, 4);\r\n                ((uint32_t*)(dst + i_dst))[0] = _mm_cvtsi128_si32(D0);\r\n                //dst[i] = (pel_t)((src[idx] * c1 + src[idx + 1] * c2 + src[idx + 2] * c3 + src[idx + 3] * c4 + 64) >> 7);\r\n                dst += i_dst2;\r\n            }\r\n        } else if (bsx & 0x0f) {\r\n            __m128i D0;\r\n\r\n            for (j = 0; j < bsy; j++) {\r\n                int idx = tab_idx_mode_7[j];\r\n                c0 = _mm_load_si128((__m128i*)tab_coeff_mode_7[j]);\r\n\r\n                S0 = _mm_loadu_si128((__m128i*)(src + idx));\r\n                S1 = _mm_srli_si128(S0, 1);\r\n                S2 = _mm_srli_si128(S0, 2);\r\n                S3 = _mm_srli_si128(S0, 3);\r\n\r\n                t0 = _mm_unpacklo_epi8(S0, S1);\r\n                t1 = _mm_unpacklo_epi8(S2, S3);\r\n                t2 = _mm_unpacklo_epi16(t0, t1);\r\n                t3 = _mm_unpackhi_epi16(t0, t1);\r\n\r\n                t0 = _mm_maddubs_epi16(t2, c0);\r\n                t1 = _mm_maddubs_epi16(t3, c0);\r\n\r\n                D0 = _mm_hadds_epi16(t0, t1);\r\n                D0 = _mm_add_epi16(D0, off);\r\n                D0 = _mm_srli_epi16(D0, 7);\r\n\r\n                D0 = _mm_packus_epi16(D0, _mm_setzero_si128());\r\n\r\n                _mm_storel_epi64((__m128i*)(dst), D0);\r\n                //dst[i] = (pel_t)((src[idx] * c1 + src[idx + 1] * c2 + src[idx + 2] * c3 + src[idx + 3] * c4 + 64) >> 7);\r\n\r\n                dst += i_dst;\r\n            }\r\n        } else {\r\n            for (j = 0; j < bsy; j++) {\r\n                __m128i D0, D1;\r\n\r\n                int idx = tab_idx_mode_7[j];\r\n                c0 = _mm_load_si128((__m128i*)tab_coeff_mode_7[j]);\r\n\r\n                for (i = 0; i < bsx; i += 16, idx += 16) {\r\n                    S0 = _mm_loadu_si128((__m128i*)(src + idx));\r\n                    S1 = _mm_loadu_si128((__m128i*)(src + idx + 1));\r\n                    S2 = _mm_loadu_si128((__m128i*)(src + idx + 2));\r\n                    S3 = _mm_loadu_si128((__m128i*)(src + idx + 3));\r\n\r\n                    t0 = _mm_unpacklo_epi8(S0, S1);\r\n                    t1 = _mm_unpacklo_epi8(S2, S3);\r\n                    t2 = _mm_unpacklo_epi16(t0, t1);\r\n                    t3 = _mm_unpackhi_epi16(t0, t1);\r\n\r\n                    t0 = _mm_maddubs_epi16(t2, c0);\r\n                    t1 = _mm_maddubs_epi16(t3, c0);\r\n\r\n                    D0 = _mm_hadds_epi16(t0, t1);\r\n                    D0 = _mm_add_epi16(D0, off);\r\n                    D0 = _mm_srli_epi16(D0, 7);\r\n\r\n                    t0 = _mm_unpackhi_epi8(S0, S1);\r\n                    t1 = _mm_unpackhi_epi8(S2, S3);\r\n                    t2 = _mm_unpacklo_epi16(t0, t1);\r\n                    t3 = _mm_unpackhi_epi16(t0, t1);\r\n\r\n                    t0 = _mm_maddubs_epi16(t2, c0);\r\n                    t1 = _mm_maddubs_epi16(t3, c0);\r\n\r\n                    D1 = _mm_hadds_epi16(t0, t1);\r\n                    D1 = _mm_add_epi16(D1, off);\r\n                    D1 = _mm_srli_epi16(D1, 7);\r\n\r\n                    D0 = _mm_packus_epi16(D0, D1);\r\n\r\n                    _mm_storeu_si128((__m128i*)(dst + i), D0);\r\n                    //dst[i] = (pel_t)((src[idx] * c1 + src[idx + 1] * c2 + src[idx + 2] * c3 + src[idx + 3] * c4 + 64) >> 7);\r\n                }\r\n\r\n                dst += i_dst;\r\n            }\r\n        }\r\n    } else {\r\n        if (bsx & 0x07) {\r\n            for (j = 0; j < bsy; j++) {\r\n                int real_width;\r\n                int idx = tab_idx_mode_7[j];\r\n\r\n                real_width = DAVS2_MIN(bsx, iWidth2 - idx + 1);\r\n\r\n                if (real_width <= 0) {\r\n                    pel_t val = (pel_t)((src[iWidth2] * tab_coeff_mode_7[j][0] + src[iWidth2 + 1] * tab_coeff_mode_7[j][1] + src[iWidth2 + 2] * tab_coeff_mode_7[j][2] + src[iWidth2 + 3] * tab_coeff_mode_7[j][3] + 64) >> 7);\r\n                    __m128i D0 = _mm_set1_epi8((char)val);\r\n                    _mm_storel_epi64((__m128i*)(dst), D0);\r\n                    dst += i_dst;\r\n                    j++;\r\n\r\n                    for (; j < bsy; j++) {\r\n                        val = (pel_t)((src[iWidth2] * tab_coeff_mode_7[j][0] + src[iWidth2 + 1] * tab_coeff_mode_7[j][1] + src[iWidth2 + 2] * tab_coeff_mode_7[j][2] + src[iWidth2 + 3] * tab_coeff_mode_7[j][3] + 64) >> 7);\r\n                        D0 = _mm_set1_epi8((char)val);\r\n                        _mm_storel_epi64((__m128i*)(dst), D0);\r\n                        dst += i_dst;\r\n                    }\r\n                    break;\r\n                } else {\r\n                    __m128i D0;\r\n                    c0 = _mm_load_si128((__m128i*)tab_coeff_mode_7[j]);\r\n\r\n                    S0 = _mm_loadl_epi64((__m128i*)(src + idx));\r\n                    S1 = _mm_srli_si128(S0, 1);\r\n                    S2 = _mm_srli_si128(S0, 2);\r\n                    S3 = _mm_srli_si128(S0, 3);\r\n\r\n                    t0 = _mm_unpacklo_epi8(S0, S1);\r\n                    t1 = _mm_unpacklo_epi8(S2, S3);\r\n                    t2 = _mm_unpacklo_epi16(t0, t1);\r\n\r\n                    t0 = _mm_maddubs_epi16(t2, c0);\r\n\r\n                    D0 = _mm_hadds_epi16(t0, zero);\r\n                    D0 = _mm_add_epi16(D0, off);\r\n                    D0 = _mm_srli_epi16(D0, 7);\r\n\r\n                    D0 = _mm_packus_epi16(D0, zero);\r\n\r\n                    _mm_storel_epi64((__m128i*)(dst), D0);\r\n\r\n                    if (real_width < bsx) {\r\n                        D0 = _mm_set1_epi8((char)dst[real_width - 1]);\r\n                        _mm_storel_epi64((__m128i*)(dst + real_width), D0);\r\n                    }\r\n                }\r\n                dst += i_dst;\r\n            }\r\n        } else if (bsx & 0x0f) {\r\n            for (j = 0; j < bsy; j++) {\r\n                int real_width;\r\n                int idx = tab_idx_mode_7[j];\r\n\r\n                real_width = DAVS2_MIN(bsx, iWidth2 - idx + 1);\r\n\r\n                if (real_width <= 0) {\r\n                    pel_t val = (pel_t)((src[iWidth2] * tab_coeff_mode_7[j][0] + src[iWidth2 + 1] * tab_coeff_mode_7[j][1] + src[iWidth2 + 2] * tab_coeff_mode_7[j][2] + src[iWidth2 + 3] * tab_coeff_mode_7[j][3] + 64) >> 7);\r\n                    __m128i D0 = _mm_set1_epi8((char)val);\r\n                    _mm_storel_epi64((__m128i*)(dst), D0);\r\n                    dst += i_dst;\r\n                    j++;\r\n\r\n                    for (; j < bsy; j++) {\r\n                        val = (pel_t)((src[iWidth2] * tab_coeff_mode_7[j][0] + src[iWidth2 + 1] * tab_coeff_mode_7[j][1] + src[iWidth2 + 2] * tab_coeff_mode_7[j][2] + src[iWidth2 + 3] * tab_coeff_mode_7[j][3] + 64) >> 7);\r\n                        D0 = _mm_set1_epi8((char)val);\r\n                        _mm_storel_epi64((__m128i*)(dst), D0);\r\n                        dst += i_dst;\r\n                    }\r\n                    break;\r\n                } else {\r\n                    __m128i D0;\r\n                    c0 = _mm_load_si128((__m128i*)tab_coeff_mode_7[j]);\r\n\r\n                    S0 = _mm_loadu_si128((__m128i*)(src + idx));\r\n                    S1 = _mm_srli_si128(S0, 1);\r\n                    S2 = _mm_srli_si128(S0, 2);\r\n                    S3 = _mm_srli_si128(S0, 3);\r\n\r\n                    t0 = _mm_unpacklo_epi8(S0, S1);\r\n                    t1 = _mm_unpacklo_epi8(S2, S3);\r\n                    t2 = _mm_unpacklo_epi16(t0, t1);\r\n                    t3 = _mm_unpackhi_epi16(t0, t1);\r\n\r\n                    t0 = _mm_maddubs_epi16(t2, c0);\r\n                    t1 = _mm_maddubs_epi16(t3, c0);\r\n\r\n                    D0 = _mm_hadds_epi16(t0, t1);\r\n                    D0 = _mm_add_epi16(D0, off);\r\n                    D0 = _mm_srli_epi16(D0, 7);\r\n\r\n                    D0 = _mm_packus_epi16(D0, zero);\r\n\r\n                    _mm_storel_epi64((__m128i*)(dst), D0);\r\n                    //dst[i] = (pel_t)((src[idx] * c1 + src[idx + 1] * c2 + src[idx + 2] * c3 + src[idx + 3] * c4 + 64) >> 7);\r\n\r\n                    if (real_width < bsx) {\r\n                        D0 = _mm_set1_epi8((char)dst[real_width - 1]);\r\n                        _mm_storel_epi64((__m128i*)(dst + real_width), D0);\r\n                    }\r\n\r\n                }\r\n\r\n                dst += i_dst;\r\n            }\r\n        } else {\r\n            for (j = 0; j < bsy; j++) {\r\n                int real_width;\r\n                int idx = tab_idx_mode_7[j];\r\n\r\n                real_width = DAVS2_MIN(bsx, iWidth2 - idx + 1);\r\n\r\n                if (real_width <= 0) {\r\n                    pel_t val = (pel_t)((src[iWidth2] * tab_coeff_mode_7[j][0] + src[iWidth2 + 1] * tab_coeff_mode_7[j][1] + src[iWidth2 + 2] * tab_coeff_mode_7[j][2] + src[iWidth2 + 3] * tab_coeff_mode_7[j][3] + 64) >> 7);\r\n                    __m128i D0 = _mm_set1_epi8((char)val);\r\n\r\n                    for (i = 0; i < bsx; i += 16) {\r\n                        _mm_storeu_si128((__m128i*)(dst + i), D0);\r\n                    }\r\n                    dst += i_dst;\r\n                    j++;\r\n\r\n                    for (; j < bsy; j++) {\r\n                        val = (pel_t)((src[iWidth2] * tab_coeff_mode_7[j][0] + src[iWidth2 + 1] * tab_coeff_mode_7[j][1] + src[iWidth2 + 2] * tab_coeff_mode_7[j][2] + src[iWidth2 + 3] * tab_coeff_mode_7[j][3] + 64) >> 7);\r\n                        D0 = _mm_set1_epi8((char)val);\r\n                        for (i = 0; i < bsx; i += 16) {\r\n                            _mm_storeu_si128((__m128i*)(dst + i), D0);\r\n                        }\r\n                        dst += i_dst;\r\n                    }\r\n                    break;\r\n                } else {\r\n                    __m128i D0, D1;\r\n\r\n                    c0 = _mm_load_si128((__m128i*)tab_coeff_mode_7[j]);\r\n                    for (i = 0; i < real_width; i += 16, idx += 16) {\r\n                        S0 = _mm_loadu_si128((__m128i*)(src + idx));\r\n                        S1 = _mm_loadu_si128((__m128i*)(src + idx + 1));\r\n                        S2 = _mm_loadu_si128((__m128i*)(src + idx + 2));\r\n                        S3 = _mm_loadu_si128((__m128i*)(src + idx + 3));\r\n\r\n                        t0 = _mm_unpacklo_epi8(S0, S1);\r\n                        t1 = _mm_unpacklo_epi8(S2, S3);\r\n                        t2 = _mm_unpacklo_epi16(t0, t1);\r\n                        t3 = _mm_unpackhi_epi16(t0, t1);\r\n\r\n                        t0 = _mm_maddubs_epi16(t2, c0);\r\n                        t1 = _mm_maddubs_epi16(t3, c0);\r\n\r\n                        D0 = _mm_hadds_epi16(t0, t1);\r\n                        D0 = _mm_add_epi16(D0, off);\r\n                        D0 = _mm_srli_epi16(D0, 7);\r\n\r\n                        t0 = _mm_unpackhi_epi8(S0, S1);\r\n                        t1 = _mm_unpackhi_epi8(S2, S3);\r\n                        t2 = _mm_unpacklo_epi16(t0, t1);\r\n                        t3 = _mm_unpackhi_epi16(t0, t1);\r\n\r\n                        t0 = _mm_maddubs_epi16(t2, c0);\r\n                        t1 = _mm_maddubs_epi16(t3, c0);\r\n\r\n                        D1 = _mm_hadds_epi16(t0, t1);\r\n                        D1 = _mm_add_epi16(D1, off);\r\n                        D1 = _mm_srli_epi16(D1, 7);\r\n\r\n                        D0 = _mm_packus_epi16(D0, D1);\r\n\r\n                        _mm_store_si128((__m128i*)(dst + i), D0);\r\n                        //dst[i] = (pel_t)((src[idx] * c1 + src[idx + 1] * c2 + src[idx + 2] * c3 + src[idx + 3] * c4 + 64) >> 7);\r\n                    }\r\n\r\n                    if (real_width < bsx) {\r\n                        D0 = _mm_set1_epi8((char)dst[real_width - 1]);\r\n                        for (i = real_width; i < bsx; i += 16) {\r\n                            _mm_storeu_si128((__m128i*)(dst + i), D0);\r\n                            //dst[i] = dst[real_width - 1];\r\n                        }\r\n                    }\r\n\r\n                }\r\n\r\n                dst += i_dst;\r\n            }\r\n        }\r\n    }\r\n\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n*/\r\nvoid intra_pred_ang_x_8_sse128(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    ALIGN16(pel_t first_line[2 * (64 + 48)]);\r\n    int line_size = bsx + (bsy >> 1) - 1;\r\n#if !BUGFIX_PREDICTION_INTRA\r\n    int real_size = DAVS2_MIN(line_size, (bsx << 1));\r\n#endif\r\n    int i;\r\n#if !BUGFIX_PREDICTION_INTRA\r\n    __m128i pad1, pad2;\r\n#endif\r\n    int aligned_line_size = ((line_size + 31) >> 4) << 4;\r\n    pel_t *pfirst[2];\r\n    __m128i zero = _mm_setzero_si128();\r\n    __m128i coeff = _mm_set1_epi16(3);\r\n    __m128i offset1 = _mm_set1_epi16(4);\r\n    __m128i offset2 = _mm_set1_epi16(2);\r\n    int i_dst2 = i_dst * 2;\r\n\r\n    UNUSED_PARAMETER(dir_mode);\r\n\r\n    pfirst[0] = first_line;\r\n    pfirst[1] = first_line + aligned_line_size;\r\n\r\n#if BUGFIX_PREDICTION_INTRA\r\n    for (i = 0; i < line_size - 8; i += 16, src += 16) {\r\n#else\r\n    for (i = 0; i < real_size - 8; i += 16, src += 16) {\r\n#endif\r\n        __m128i p01, p02, p11, p12;\r\n        __m128i S0 = _mm_loadu_si128((__m128i*)(src));\r\n        __m128i S3 = _mm_loadu_si128((__m128i*)(src + 3));\r\n        __m128i S1 = _mm_loadu_si128((__m128i*)(src + 1));\r\n        __m128i S2 = _mm_loadu_si128((__m128i*)(src + 2));\r\n\r\n        __m128i L0 = _mm_unpacklo_epi8(S0, zero);\r\n        __m128i L1 = _mm_unpacklo_epi8(S1, zero);\r\n        __m128i L2 = _mm_unpacklo_epi8(S2, zero);\r\n        __m128i L3 = _mm_unpacklo_epi8(S3, zero);\r\n\r\n        __m128i H0 = _mm_unpackhi_epi8(S0, zero);\r\n        __m128i H1 = _mm_unpackhi_epi8(S1, zero);\r\n        __m128i H2 = _mm_unpackhi_epi8(S2, zero);\r\n        __m128i H3 = _mm_unpackhi_epi8(S3, zero);\r\n\r\n        p01 = _mm_add_epi16(L1, L2);\r\n        p01 = _mm_mullo_epi16(p01, coeff);\r\n        p02 = _mm_add_epi16(L0, L3);\r\n        p02 = _mm_add_epi16(p02, offset1);\r\n        p01 = _mm_add_epi16(p01, p02);\r\n        p01 = _mm_srli_epi16(p01, 3);\r\n\r\n        p11 = _mm_add_epi16(H1, H2);\r\n        p11 = _mm_mullo_epi16(p11, coeff);\r\n        p12 = _mm_add_epi16(H0, H3);\r\n        p12 = _mm_add_epi16(p12, offset1);\r\n        p11 = _mm_add_epi16(p11, p12);\r\n        p11 = _mm_srli_epi16(p11, 3);\r\n\r\n        p01 = _mm_packus_epi16(p01, p11);\r\n        _mm_store_si128((__m128i*)&pfirst[0][i], p01);\r\n\r\n        p01 = _mm_add_epi16(L1, L2);\r\n        p02 = _mm_add_epi16(L2, L3);\r\n        p11 = _mm_add_epi16(H1, H2);\r\n        p12 = _mm_add_epi16(H2, H3);\r\n\r\n        p01 = _mm_add_epi16(p01, p02);\r\n        p11 = _mm_add_epi16(p11, p12);\r\n\r\n        p01 = _mm_add_epi16(p01, offset2);\r\n        p11 = _mm_add_epi16(p11, offset2);\r\n\r\n        p01 = _mm_srli_epi16(p01, 2);\r\n        p11 = _mm_srli_epi16(p11, 2);\r\n\r\n        p01 = _mm_packus_epi16(p01, p11);\r\n        _mm_store_si128((__m128i*)&pfirst[1][i], p01);\r\n    }\r\n\r\n#if BUGFIX_PREDICTION_INTRA\r\n    if (i < line_size) {\r\n#else\r\n    if (i < real_size) {\r\n#endif\r\n        __m128i p01, p02;\r\n        __m128i S0 = _mm_loadu_si128((__m128i*)(src));\r\n        __m128i S3 = _mm_loadu_si128((__m128i*)(src + 3));\r\n        __m128i S1 = _mm_loadu_si128((__m128i*)(src + 1));\r\n        __m128i S2 = _mm_loadu_si128((__m128i*)(src + 2));\r\n\r\n        __m128i L0 = _mm_unpacklo_epi8(S0, zero);\r\n        __m128i L1 = _mm_unpacklo_epi8(S1, zero);\r\n        __m128i L2 = _mm_unpacklo_epi8(S2, zero);\r\n        __m128i L3 = _mm_unpacklo_epi8(S3, zero);\r\n\r\n        p01 = _mm_add_epi16(L1, L2);\r\n        p01 = _mm_mullo_epi16(p01, coeff);\r\n        p02 = _mm_add_epi16(L0, L3);\r\n        p02 = _mm_add_epi16(p02, offset1);\r\n        p01 = _mm_add_epi16(p01, p02);\r\n        p01 = _mm_srli_epi16(p01, 3);\r\n\r\n        p01 = _mm_packus_epi16(p01, p01);\r\n        _mm_storel_epi64((__m128i*)&pfirst[0][i], p01);\r\n\r\n        p01 = _mm_add_epi16(L1, L2);\r\n        p02 = _mm_add_epi16(L2, L3);\r\n\r\n        p01 = _mm_add_epi16(p01, p02);\r\n        p01 = _mm_add_epi16(p01, offset2);\r\n        p01 = _mm_srli_epi16(p01, 2);\r\n\r\n        p01 = _mm_packus_epi16(p01, p01);\r\n        _mm_storel_epi64((__m128i*)&pfirst[1][i], p01);\r\n    }\r\n\r\n#if !BUGFIX_PREDICTION_INTRA\r\n    // padding\r\n    if (real_size < line_size) {\r\n        pfirst[1][real_size - 1] = pfirst[1][real_size - 2];\r\n\r\n        pad1 = _mm_set1_epi8(pfirst[0][real_size - 1]);\r\n        pad2 = _mm_set1_epi8(pfirst[1][real_size - 1]);\r\n        for (i = real_size; i < line_size; i += 16) {\r\n            _mm_storeu_si128((__m128i*)&pfirst[0][i], pad1);\r\n            _mm_storeu_si128((__m128i*)&pfirst[1][i], pad2);\r\n        }\r\n    }\r\n#endif\r\n\r\n    bsy >>= 1;\r\n\r\n    if (bsx != 8) {\r\n        for (i = 0; i < bsy; i++) {\r\n            memcpy(dst, pfirst[0] + i, bsx * sizeof(pel_t));\r\n            memcpy(dst + i_dst, pfirst[1] + i, bsx * sizeof(pel_t));\r\n            dst += i_dst2;\r\n        }\r\n    } else if (bsy == 4) {\r\n        __m128i M1 = _mm_loadu_si128((__m128i*)&pfirst[0][0]);\r\n        __m128i M2 = _mm_loadu_si128((__m128i*)&pfirst[1][0]);\r\n        _mm_storel_epi64((__m128i*)dst, M1);\r\n        _mm_storel_epi64((__m128i*)(dst + i_dst), M2);\r\n        dst += i_dst2;\r\n        M1 = _mm_srli_si128(M1, 1);\r\n        M2 = _mm_srli_si128(M2, 1);\r\n        _mm_storel_epi64((__m128i*)dst, M1);\r\n        _mm_storel_epi64((__m128i*)(dst + i_dst), M2);\r\n        dst += i_dst2;\r\n        M1 = _mm_srli_si128(M1, 1);\r\n        M2 = _mm_srli_si128(M2, 1);\r\n        _mm_storel_epi64((__m128i*)dst, M1);\r\n        _mm_storel_epi64((__m128i*)(dst + i_dst), M2);\r\n        dst += i_dst2;\r\n        M1 = _mm_srli_si128(M1, 1);\r\n        M2 = _mm_srli_si128(M2, 1);\r\n        _mm_storel_epi64((__m128i*)dst, M1);\r\n        _mm_storel_epi64((__m128i*)(dst + i_dst), M2);\r\n    } else {\r\n        for (i = 0; i < 16; i = i + 8) {\r\n            __m128i M1 = _mm_loadu_si128((__m128i*)&pfirst[0][i]);\r\n            __m128i M2 = _mm_loadu_si128((__m128i*)&pfirst[1][i]);\r\n            _mm_storel_epi64((__m128i*)dst, M1);\r\n            _mm_storel_epi64((__m128i*)(dst + i_dst), M2);\r\n            dst += i_dst2;\r\n            M1 = _mm_srli_si128(M1, 1);\r\n            M2 = _mm_srli_si128(M2, 1);\r\n            _mm_storel_epi64((__m128i*)dst, M1);\r\n            _mm_storel_epi64((__m128i*)(dst + i_dst), M2);\r\n            dst += i_dst2;\r\n            M1 = _mm_srli_si128(M1, 1);\r\n            M2 = _mm_srli_si128(M2, 1);\r\n            _mm_storel_epi64((__m128i*)dst, M1);\r\n            _mm_storel_epi64((__m128i*)(dst + i_dst), M2);\r\n            dst += i_dst2;\r\n            M1 = _mm_srli_si128(M1, 1);\r\n            M2 = _mm_srli_si128(M2, 1);\r\n            _mm_storel_epi64((__m128i*)dst, M1);\r\n            _mm_storel_epi64((__m128i*)(dst + i_dst), M2);\r\n            dst += i_dst2;\r\n            M1 = _mm_srli_si128(M1, 1);\r\n            M2 = _mm_srli_si128(M2, 1);\r\n            _mm_storel_epi64((__m128i*)dst, M1);\r\n            _mm_storel_epi64((__m128i*)(dst + i_dst), M2);\r\n            dst += i_dst2;\r\n            M1 = _mm_srli_si128(M1, 1);\r\n            M2 = _mm_srli_si128(M2, 1);\r\n            _mm_storel_epi64((__m128i*)dst, M1);\r\n            _mm_storel_epi64((__m128i*)(dst + i_dst), M2);\r\n            dst += i_dst2;\r\n            M1 = _mm_srli_si128(M1, 1);\r\n            M2 = _mm_srli_si128(M2, 1);\r\n            _mm_storel_epi64((__m128i*)dst, M1);\r\n            _mm_storel_epi64((__m128i*)(dst + i_dst), M2);\r\n            dst += i_dst2;\r\n            M1 = _mm_srli_si128(M1, 1);\r\n            M2 = _mm_srli_si128(M2, 1);\r\n            _mm_storel_epi64((__m128i*)dst, M1);\r\n            _mm_storel_epi64((__m128i*)(dst + i_dst), M2);\r\n            dst += i_dst2;\r\n        }\r\n    }\r\n\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n*/\r\nvoid intra_pred_ang_x_9_sse128(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    int i, j;\r\n    int iWidth2 = bsx << 1;\r\n    __m128i zero = _mm_setzero_si128();\r\n    __m128i S0, S1, S2, S3;\r\n    __m128i t0, t1, t2, t3;\r\n    __m128i off = _mm_set1_epi16(64);\r\n    __m128i c0;\r\n\r\n    UNUSED_PARAMETER(dir_mode);\r\n\r\n    if (bsx >= bsy) {\r\n        if (bsx & 0x07) {\r\n            __m128i D0;\r\n            int i_dst2 = i_dst << 1;\r\n\r\n            for (j = 0; j < bsy; j += 2) {\r\n                int idx = tab_idx_mode_9[j];\r\n                c0 = _mm_load_si128((__m128i*)tab_coeff_mode_9[j]);\r\n\r\n                S0 = _mm_loadl_epi64((__m128i*)(src + idx));\r\n                S1 = _mm_srli_si128(S0, 1);\r\n                S2 = _mm_srli_si128(S0, 2);\r\n                S3 = _mm_srli_si128(S0, 3);\r\n\r\n                t0 = _mm_unpacklo_epi8(S0, S1);\r\n                t1 = _mm_unpacklo_epi8(S2, S3);\r\n                t2 = _mm_unpacklo_epi16(t0, t1);\r\n\r\n                t0 = _mm_maddubs_epi16(t2, c0);\r\n\r\n                idx = tab_idx_mode_9[j + 1];\r\n                c0 = _mm_load_si128((__m128i*)tab_coeff_mode_9[j + 1]);\r\n                S0 = _mm_loadl_epi64((__m128i*)(src + idx));\r\n                S1 = _mm_srli_si128(S0, 1);\r\n                S2 = _mm_srli_si128(S0, 2);\r\n                S3 = _mm_srli_si128(S0, 3);\r\n\r\n                t1 = _mm_unpacklo_epi8(S0, S1);\r\n                t2 = _mm_unpacklo_epi8(S2, S3);\r\n                t1 = _mm_unpacklo_epi16(t1, t2);\r\n\r\n                t1 = _mm_maddubs_epi16(t1, c0);\r\n\r\n                D0 = _mm_hadds_epi16(t0, t1);\r\n                D0 = _mm_add_epi16(D0, off);\r\n                D0 = _mm_srli_epi16(D0, 7);\r\n                D0 = _mm_packus_epi16(D0, zero);\r\n\r\n                ((uint32_t*)(dst))[0] = _mm_cvtsi128_si32(D0);\r\n                D0 = _mm_srli_si128(D0, 4);\r\n                ((uint32_t*)(dst + i_dst))[0] = _mm_cvtsi128_si32(D0);\r\n                //_mm_maskmoveu_si128(D0, mask, (char*)(dst + i_dst));\r\n                //dst[i] = (pel_t)((src[idx] * c1 + src[idx + 1] * c2 + src[idx + 2] * c3 + src[idx + 3] * c4 + 64) >> 7);\r\n                dst += i_dst2;\r\n            }\r\n        } else if (bsx & 0x0f) {\r\n            __m128i D0;\r\n\r\n            for (j = 0; j < bsy; j++) {\r\n                int idx = tab_idx_mode_9[j];\r\n                c0 = _mm_load_si128((__m128i*)tab_coeff_mode_9[j]);\r\n\r\n                S0 = _mm_loadu_si128((__m128i*)(src + idx));\r\n                S1 = _mm_srli_si128(S0, 1);\r\n                S2 = _mm_srli_si128(S0, 2);\r\n                S3 = _mm_srli_si128(S0, 3);\r\n\r\n                t0 = _mm_unpacklo_epi8(S0, S1);\r\n                t1 = _mm_unpacklo_epi8(S2, S3);\r\n                t2 = _mm_unpacklo_epi16(t0, t1);\r\n                t3 = _mm_unpackhi_epi16(t0, t1);\r\n\r\n                t0 = _mm_maddubs_epi16(t2, c0);\r\n                t1 = _mm_maddubs_epi16(t3, c0);\r\n\r\n                D0 = _mm_hadds_epi16(t0, t1);\r\n                D0 = _mm_add_epi16(D0, off);\r\n                D0 = _mm_srli_epi16(D0, 7);\r\n\r\n                D0 = _mm_packus_epi16(D0, _mm_setzero_si128());\r\n\r\n                _mm_storel_epi64((__m128i*)(dst), D0);\r\n                //dst[i] = (pel_t)((src[idx] * c1 + src[idx + 1] * c2 + src[idx + 2] * c3 + src[idx + 3] * c4 + 64) >> 7);\r\n\r\n                dst += i_dst;\r\n            }\r\n        } else {\r\n            for (j = 0; j < bsy; j++) {\r\n                __m128i D0, D1;\r\n\r\n                int idx = tab_idx_mode_9[j];\r\n                c0 = _mm_load_si128((__m128i*)tab_coeff_mode_9[j]);\r\n\r\n                for (i = 0; i < bsx; i += 16, idx += 16) {\r\n                    S0 = _mm_loadu_si128((__m128i*)(src + idx));\r\n                    S1 = _mm_loadu_si128((__m128i*)(src + idx + 1));\r\n                    S2 = _mm_loadu_si128((__m128i*)(src + idx + 2));\r\n                    S3 = _mm_loadu_si128((__m128i*)(src + idx + 3));\r\n\r\n                    t0 = _mm_unpacklo_epi8(S0, S1);\r\n                    t1 = _mm_unpacklo_epi8(S2, S3);\r\n                    t2 = _mm_unpacklo_epi16(t0, t1);\r\n                    t3 = _mm_unpackhi_epi16(t0, t1);\r\n\r\n                    t0 = _mm_maddubs_epi16(t2, c0);\r\n                    t1 = _mm_maddubs_epi16(t3, c0);\r\n\r\n                    D0 = _mm_hadds_epi16(t0, t1);\r\n                    D0 = _mm_add_epi16(D0, off);\r\n                    D0 = _mm_srli_epi16(D0, 7);\r\n\r\n                    t0 = _mm_unpackhi_epi8(S0, S1);\r\n                    t1 = _mm_unpackhi_epi8(S2, S3);\r\n                    t2 = _mm_unpacklo_epi16(t0, t1);\r\n                    t3 = _mm_unpackhi_epi16(t0, t1);\r\n\r\n                    t0 = _mm_maddubs_epi16(t2, c0);\r\n                    t1 = _mm_maddubs_epi16(t3, c0);\r\n\r\n                    D1 = _mm_hadds_epi16(t0, t1);\r\n                    D1 = _mm_add_epi16(D1, off);\r\n                    D1 = _mm_srli_epi16(D1, 7);\r\n\r\n                    D0 = _mm_packus_epi16(D0, D1);\r\n\r\n                    _mm_storeu_si128((__m128i*)(dst + i), D0);\r\n                    //dst[i] = (pel_t)((src[idx] * c1 + src[idx + 1] * c2 + src[idx + 2] * c3 + src[idx + 3] * c4 + 64) >> 7);\r\n                }\r\n\r\n                dst += i_dst;\r\n            }\r\n        }\r\n    } else {\r\n        if (bsx & 0x07) {\r\n            for (j = 0; j < bsy; j++) {\r\n                int real_width;\r\n                int idx = tab_idx_mode_9[j];\r\n\r\n                real_width = DAVS2_MIN(bsx, iWidth2 - idx + 1);\r\n\r\n                if (real_width <= 0) {\r\n                    pel_t val = (pel_t)((src[iWidth2] * tab_coeff_mode_9[j][0] + src[iWidth2 + 1] * tab_coeff_mode_9[j][1] + src[iWidth2 + 2] * tab_coeff_mode_9[j][2] + src[iWidth2 + 3] * tab_coeff_mode_9[j][3] + 64) >> 7);\r\n                    __m128i D0 = _mm_set1_epi8((char)val);\r\n                    _mm_storel_epi64((__m128i*)(dst), D0);\r\n                    dst += i_dst;\r\n                    j++;\r\n\r\n                    for (; j < bsy; j++) {\r\n                        val = (pel_t)((src[iWidth2] * tab_coeff_mode_9[j][0] + src[iWidth2 + 1] * tab_coeff_mode_9[j][1] + src[iWidth2 + 2] * tab_coeff_mode_9[j][2] + src[iWidth2 + 3] * tab_coeff_mode_9[j][3] + 64) >> 7);\r\n                        D0 = _mm_set1_epi8((char)val);\r\n                        _mm_storel_epi64((__m128i*)(dst), D0);\r\n                        dst += i_dst;\r\n                    }\r\n                    break;\r\n                } else {\r\n                    __m128i D0;\r\n                    c0 = _mm_load_si128((__m128i*)tab_coeff_mode_9[j]);\r\n\r\n                    S0 = _mm_loadl_epi64((__m128i*)(src + idx));\r\n                    S1 = _mm_srli_si128(S0, 1);\r\n                    S2 = _mm_srli_si128(S0, 2);\r\n                    S3 = _mm_srli_si128(S0, 3);\r\n\r\n                    t0 = _mm_unpacklo_epi8(S0, S1);\r\n                    t1 = _mm_unpacklo_epi8(S2, S3);\r\n                    t2 = _mm_unpacklo_epi16(t0, t1);\r\n\r\n                    t0 = _mm_maddubs_epi16(t2, c0);\r\n\r\n                    D0 = _mm_hadds_epi16(t0, zero);\r\n                    D0 = _mm_add_epi16(D0, off);\r\n                    D0 = _mm_srli_epi16(D0, 7);\r\n\r\n                    D0 = _mm_packus_epi16(D0, zero);\r\n\r\n                    _mm_storel_epi64((__m128i*)(dst), D0);\r\n\r\n                    if (real_width < bsx) {\r\n                        D0 = _mm_set1_epi8((char)dst[real_width - 1]);\r\n                        _mm_storel_epi64((__m128i*)(dst + real_width), D0);\r\n                    }\r\n                }\r\n                dst += i_dst;\r\n            }\r\n        } else if (bsx & 0x0f) {\r\n            for (j = 0; j < bsy; j++) {\r\n                int real_width;\r\n                int idx = tab_idx_mode_9[j];\r\n\r\n                real_width = DAVS2_MIN(bsx, iWidth2 - idx + 1);\r\n\r\n                if (real_width <= 0) {\r\n                    pel_t val = (pel_t)((src[iWidth2] * tab_coeff_mode_9[j][0] + src[iWidth2 + 1] * tab_coeff_mode_9[j][1] + src[iWidth2 + 2] * tab_coeff_mode_9[j][2] + src[iWidth2 + 3] * tab_coeff_mode_9[j][3] + 64) >> 7);\r\n                    __m128i D0 = _mm_set1_epi8((char)val);\r\n                    _mm_storel_epi64((__m128i*)(dst), D0);\r\n                    dst += i_dst;\r\n                    j++;\r\n\r\n                    for (; j < bsy; j++) {\r\n                        val = (pel_t)((src[iWidth2] * tab_coeff_mode_9[j][0] + src[iWidth2 + 1] * tab_coeff_mode_9[j][1] + src[iWidth2 + 2] * tab_coeff_mode_9[j][2] + src[iWidth2 + 3] * tab_coeff_mode_9[j][3] + 64) >> 7);\r\n                        D0 = _mm_set1_epi8((char)val);\r\n                        _mm_storel_epi64((__m128i*)(dst), D0);\r\n                        dst += i_dst;\r\n                    }\r\n                    break;\r\n                } else {\r\n                    __m128i D0;\r\n                    c0 = _mm_load_si128((__m128i*)tab_coeff_mode_9[j]);\r\n\r\n                    S0 = _mm_loadu_si128((__m128i*)(src + idx));\r\n                    S1 = _mm_srli_si128(S0, 1);\r\n                    S2 = _mm_srli_si128(S0, 2);\r\n                    S3 = _mm_srli_si128(S0, 3);\r\n\r\n                    t0 = _mm_unpacklo_epi8(S0, S1);\r\n                    t1 = _mm_unpacklo_epi8(S2, S3);\r\n                    t2 = _mm_unpacklo_epi16(t0, t1);\r\n                    t3 = _mm_unpackhi_epi16(t0, t1);\r\n\r\n                    t0 = _mm_maddubs_epi16(t2, c0);\r\n                    t1 = _mm_maddubs_epi16(t3, c0);\r\n\r\n                    D0 = _mm_hadds_epi16(t0, t1);\r\n                    D0 = _mm_add_epi16(D0, off);\r\n                    D0 = _mm_srli_epi16(D0, 7);\r\n\r\n                    D0 = _mm_packus_epi16(D0, zero);\r\n\r\n                    _mm_storel_epi64((__m128i*)(dst), D0);\r\n                    //dst[i] = (pel_t)((src[idx] * c1 + src[idx + 1] * c2 + src[idx + 2] * c3 + src[idx + 3] * c4 + 64) >> 7);\r\n\r\n                    if (real_width < bsx) {\r\n                        D0 = _mm_set1_epi8((char)dst[real_width - 1]);\r\n                        _mm_storel_epi64((__m128i*)(dst + real_width), D0);\r\n                    }\r\n\r\n                }\r\n\r\n                dst += i_dst;\r\n            }\r\n        } else {\r\n            for (j = 0; j < bsy; j++) {\r\n                int real_width;\r\n                int idx = tab_idx_mode_9[j];\r\n\r\n                real_width = DAVS2_MIN(bsx, iWidth2 - idx + 1);\r\n\r\n                if (real_width <= 0) {\r\n                    pel_t val = (pel_t)((src[iWidth2] * tab_coeff_mode_9[j][0] + src[iWidth2 + 1] * tab_coeff_mode_9[j][1] + src[iWidth2 + 2] * tab_coeff_mode_9[j][2] + src[iWidth2 + 3] * tab_coeff_mode_9[j][3] + 64) >> 7);\r\n                    __m128i D0 = _mm_set1_epi8((char)val);\r\n\r\n                    for (i = 0; i < bsx; i += 16) {\r\n                        _mm_storeu_si128((__m128i*)(dst + i), D0);\r\n                    }\r\n                    dst += i_dst;\r\n                    j++;\r\n\r\n                    for (; j < bsy; j++) {\r\n                        val = (pel_t)((src[iWidth2] * tab_coeff_mode_9[j][0] + src[iWidth2 + 1] * tab_coeff_mode_9[j][1] + src[iWidth2 + 2] * tab_coeff_mode_9[j][2] + src[iWidth2 + 3] * tab_coeff_mode_9[j][3] + 64) >> 7);\r\n                        D0 = _mm_set1_epi8((char)val);\r\n                        for (i = 0; i < bsx; i += 16) {\r\n                            _mm_storeu_si128((__m128i*)(dst + i), D0);\r\n                        }\r\n                        dst += i_dst;\r\n                    }\r\n                    break;\r\n                } else {\r\n                    __m128i D0, D1;\r\n\r\n                    c0 = _mm_load_si128((__m128i*)tab_coeff_mode_9[j]);\r\n                    for (i = 0; i < real_width; i += 16, idx += 16) {\r\n                        S0 = _mm_loadu_si128((__m128i*)(src + idx));\r\n                        S1 = _mm_loadu_si128((__m128i*)(src + idx + 1));\r\n                        S2 = _mm_loadu_si128((__m128i*)(src + idx + 2));\r\n                        S3 = _mm_loadu_si128((__m128i*)(src + idx + 3));\r\n\r\n                        t0 = _mm_unpacklo_epi8(S0, S1);\r\n                        t1 = _mm_unpacklo_epi8(S2, S3);\r\n                        t2 = _mm_unpacklo_epi16(t0, t1);\r\n                        t3 = _mm_unpackhi_epi16(t0, t1);\r\n\r\n                        t0 = _mm_maddubs_epi16(t2, c0);\r\n                        t1 = _mm_maddubs_epi16(t3, c0);\r\n\r\n                        D0 = _mm_hadds_epi16(t0, t1);\r\n                        D0 = _mm_add_epi16(D0, off);\r\n                        D0 = _mm_srli_epi16(D0, 7);\r\n\r\n                        t0 = _mm_unpackhi_epi8(S0, S1);\r\n                        t1 = _mm_unpackhi_epi8(S2, S3);\r\n                        t2 = _mm_unpacklo_epi16(t0, t1);\r\n                        t3 = _mm_unpackhi_epi16(t0, t1);\r\n\r\n                        t0 = _mm_maddubs_epi16(t2, c0);\r\n                        t1 = _mm_maddubs_epi16(t3, c0);\r\n\r\n                        D1 = _mm_hadds_epi16(t0, t1);\r\n                        D1 = _mm_add_epi16(D1, off);\r\n                        D1 = _mm_srli_epi16(D1, 7);\r\n\r\n                        D0 = _mm_packus_epi16(D0, D1);\r\n\r\n                        _mm_store_si128((__m128i*)(dst + i), D0);\r\n                        //dst[i] = (pel_t)((src[idx] * c1 + src[idx + 1] * c2 + src[idx + 2] * c3 + src[idx + 3] * c4 + 64) >> 7);\r\n                    }\r\n\r\n                    if (real_width < bsx) {\r\n                        D0 = _mm_set1_epi8((char)dst[real_width - 1]);\r\n                        for (i = real_width; i < bsx; i += 16) {\r\n                            _mm_storeu_si128((__m128i*)(dst + i), D0);\r\n                            //dst[i] = dst[real_width - 1];\r\n                        }\r\n                    }\r\n\r\n                }\r\n\r\n                dst += i_dst;\r\n            }\r\n        }\r\n    }\r\n\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n*/\r\nvoid intra_pred_ang_x_10_sse128(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    int i;\r\n    pel_t *dst1 = dst;\r\n    pel_t *dst2 = dst1 + i_dst;\r\n    pel_t *dst3 = dst2 + i_dst;\r\n    pel_t *dst4 = dst3 + i_dst;\r\n    __m128i zero = _mm_setzero_si128();\r\n    __m128i coeff2 = _mm_set1_epi16(2);\r\n    __m128i coeff3 = _mm_set1_epi16(3);\r\n    __m128i coeff4 = _mm_set1_epi16(4);\r\n    __m128i coeff5 = _mm_set1_epi16(5);\r\n    __m128i coeff7 = _mm_set1_epi16(7);\r\n    __m128i coeff8 = _mm_set1_epi16(8);\r\n\r\n    UNUSED_PARAMETER(dir_mode);\r\n\r\n    if (bsy != 4) {\r\n        ALIGN16(pel_t first_line[4 * (64 + 32)]);\r\n        int line_size = bsx + bsy / 4 - 1;\r\n        int aligned_line_size = ((line_size + 31) >> 4) << 4;\r\n        pel_t *pfirst[4];\r\n\r\n        pfirst[0] = first_line;\r\n        pfirst[1] = first_line + aligned_line_size;\r\n        pfirst[2] = first_line + aligned_line_size * 2;\r\n        pfirst[3] = first_line + aligned_line_size * 3;\r\n\r\n        for (i = 0; i < line_size - 8; i += 16, src += 16) {\r\n            __m128i p00, p10, p20, p30;\r\n            __m128i p01, p11, p21, p31;\r\n            __m128i S0 = _mm_loadu_si128((__m128i*)(src));\r\n            __m128i S3 = _mm_loadu_si128((__m128i*)(src + 3));\r\n            __m128i S1 = _mm_loadu_si128((__m128i*)(src + 1));\r\n            __m128i S2 = _mm_loadu_si128((__m128i*)(src + 2));\r\n\r\n            __m128i L0 = _mm_unpacklo_epi8(S0, zero);\r\n            __m128i L1 = _mm_unpacklo_epi8(S1, zero);\r\n            __m128i L2 = _mm_unpacklo_epi8(S2, zero);\r\n            __m128i L3 = _mm_unpacklo_epi8(S3, zero);\r\n\r\n            __m128i H0 = _mm_unpackhi_epi8(S0, zero);\r\n            __m128i H1 = _mm_unpackhi_epi8(S1, zero);\r\n            __m128i H2 = _mm_unpackhi_epi8(S2, zero);\r\n            __m128i H3 = _mm_unpackhi_epi8(S3, zero);\r\n\r\n            p00 = _mm_mullo_epi16(L0, coeff3);\r\n            p10 = _mm_mullo_epi16(L1, coeff7);\r\n            p20 = _mm_mullo_epi16(L2, coeff5);\r\n            p30 = _mm_add_epi16(L3, coeff8);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_srli_epi16(p00, 4);\r\n\r\n            p01 = _mm_mullo_epi16(H0, coeff3);\r\n            p11 = _mm_mullo_epi16(H1, coeff7);\r\n            p21 = _mm_mullo_epi16(H2, coeff5);\r\n            p31 = _mm_add_epi16(H3, coeff8);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_srli_epi16(p01, 4);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_store_si128((__m128i*)&pfirst[0][i], p00);\r\n\r\n            p00 = _mm_add_epi16(L1, L2);\r\n            p00 = _mm_mullo_epi16(p00, coeff3);\r\n            p10 = _mm_add_epi16(L0, L3);\r\n            p10 = _mm_add_epi16(p10, coeff4);\r\n            p00 = _mm_add_epi16(p10, p00);\r\n            p00 = _mm_srli_epi16(p00, 3);\r\n\r\n            p01 = _mm_add_epi16(H1, H2);\r\n            p01 = _mm_mullo_epi16(p01, coeff3);\r\n            p11 = _mm_add_epi16(H0, H3);\r\n            p11 = _mm_add_epi16(p11, coeff4);\r\n            p01 = _mm_add_epi16(p11, p01);\r\n            p01 = _mm_srli_epi16(p01, 3);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_store_si128((__m128i*)&pfirst[1][i], p00);\r\n\r\n            p10 = _mm_mullo_epi16(L1, coeff5);\r\n            p20 = _mm_mullo_epi16(L2, coeff7);\r\n            p30 = _mm_mullo_epi16(L3, coeff3);\r\n            p00 = _mm_add_epi16(L0, coeff8);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 4);\r\n\r\n            p11 = _mm_mullo_epi16(H1, coeff5);\r\n            p21 = _mm_mullo_epi16(H2, coeff7);\r\n            p31 = _mm_mullo_epi16(H3, coeff3);\r\n            p01 = _mm_add_epi16(H0, coeff8);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            p01 = _mm_srli_epi16(p01, 4);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_store_si128((__m128i*)&pfirst[2][i], p00);\r\n\r\n            p00 = _mm_add_epi16(L1, L2);\r\n            p10 = _mm_add_epi16(L2, L3);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, coeff2);\r\n            p00 = _mm_srli_epi16(p00, 2);\r\n\r\n            p01 = _mm_add_epi16(H1, H2);\r\n            p11 = _mm_add_epi16(H2, H3);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, coeff2);\r\n            p01 = _mm_srli_epi16(p01, 2);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_store_si128((__m128i*)&pfirst[3][i], p00);\r\n        }\r\n\r\n        if (i < line_size) {\r\n            __m128i p00, p10, p20, p30;\r\n            __m128i S0 = _mm_loadu_si128((__m128i*)(src));\r\n            __m128i S3 = _mm_loadu_si128((__m128i*)(src + 3));\r\n            __m128i S1 = _mm_loadu_si128((__m128i*)(src + 1));\r\n            __m128i S2 = _mm_loadu_si128((__m128i*)(src + 2));\r\n\r\n            __m128i L0 = _mm_unpacklo_epi8(S0, zero);\r\n            __m128i L1 = _mm_unpacklo_epi8(S1, zero);\r\n            __m128i L2 = _mm_unpacklo_epi8(S2, zero);\r\n            __m128i L3 = _mm_unpacklo_epi8(S3, zero);\r\n\r\n            p00 = _mm_mullo_epi16(L0, coeff3);\r\n            p10 = _mm_mullo_epi16(L1, coeff7);\r\n            p20 = _mm_mullo_epi16(L2, coeff5);\r\n            p30 = _mm_add_epi16(L3, coeff8);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_srli_epi16(p00, 4);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)&pfirst[0][i], p00);\r\n\r\n            p00 = _mm_add_epi16(L1, L2);\r\n            p00 = _mm_mullo_epi16(p00, coeff3);\r\n            p10 = _mm_add_epi16(L0, L3);\r\n            p10 = _mm_add_epi16(p10, coeff4);\r\n            p00 = _mm_add_epi16(p10, p00);\r\n            p00 = _mm_srli_epi16(p00, 3);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)&pfirst[1][i], p00);\r\n\r\n            p10 = _mm_mullo_epi16(L1, coeff5);\r\n            p20 = _mm_mullo_epi16(L2, coeff7);\r\n            p30 = _mm_mullo_epi16(L3, coeff3);\r\n            p00 = _mm_add_epi16(L0, coeff8);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 4);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)&pfirst[2][i], p00);\r\n\r\n            p00 = _mm_add_epi16(L1, L2);\r\n            p10 = _mm_add_epi16(L2, L3);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, coeff2);\r\n            p00 = _mm_srli_epi16(p00, 2);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)&pfirst[3][i], p00);\r\n        }\r\n\r\n        bsy >>= 2;\r\n\r\n        if (bsx != 8) {\r\n            int i_dstx4 = i_dst << 2;\r\n            switch (bsx) {\r\n                case 4:\r\n                    for (i = 0; i < bsy; i++) {\r\n                        CP32(dst1, pfirst[0] + i); dst1 += i_dstx4;\r\n                        CP32(dst2, pfirst[1] + i); dst2 += i_dstx4;\r\n                        CP32(dst3, pfirst[2] + i); dst3 += i_dstx4;\r\n                        CP32(dst4, pfirst[3] + i); dst4 += i_dstx4;\r\n                    }\r\n                    break;\r\n                case 16:\r\n                    for (i = 0; i < bsy; i++) {\r\n                        memcpy(dst1, pfirst[0] + i, 16 * sizeof(pel_t)); dst1 += i_dstx4;\r\n                        memcpy(dst2, pfirst[1] + i, 16 * sizeof(pel_t)); dst2 += i_dstx4;\r\n                        memcpy(dst3, pfirst[2] + i, 16 * sizeof(pel_t)); dst3 += i_dstx4;\r\n                        memcpy(dst4, pfirst[3] + i, 16 * sizeof(pel_t)); dst4 += i_dstx4;\r\n                    }\r\n                    break;\r\n                case 32:\r\n                    for (i = 0; i < bsy; i++) {\r\n                        memcpy(dst1, pfirst[0] + i, 32 * sizeof(pel_t)); dst1 += i_dstx4;\r\n                        memcpy(dst2, pfirst[1] + i, 32 * sizeof(pel_t)); dst2 += i_dstx4;\r\n                        memcpy(dst3, pfirst[2] + i, 32 * sizeof(pel_t)); dst3 += i_dstx4;\r\n                        memcpy(dst4, pfirst[3] + i, 32 * sizeof(pel_t)); dst4 += i_dstx4;\r\n                    }\r\n                    break;\r\n                case 64:\r\n                    for (i = 0; i < bsy; i++) {\r\n                        memcpy(dst1, pfirst[0] + i, 64 * sizeof(pel_t)); dst1 += i_dstx4;\r\n                        memcpy(dst2, pfirst[1] + i, 64 * sizeof(pel_t)); dst2 += i_dstx4;\r\n                        memcpy(dst3, pfirst[2] + i, 64 * sizeof(pel_t)); dst3 += i_dstx4;\r\n                        memcpy(dst4, pfirst[3] + i, 64 * sizeof(pel_t)); dst4 += i_dstx4;\r\n                    }\r\n                    break;\r\n                default:\r\n                    assert(0);\r\n                    break;\r\n            }\r\n\r\n        } else {\r\n            if (bsy == 2) {\r\n                for (i = 0; i < bsy; i++) {\r\n                    CP64(dst1, pfirst[0] + i);\r\n                    CP64(dst2, pfirst[1] + i);\r\n                    CP64(dst3, pfirst[2] + i);\r\n                    CP64(dst4, pfirst[3] + i);\r\n                    dst1 = dst4 + i_dst;\r\n                    dst2 = dst1 + i_dst;\r\n                    dst3 = dst2 + i_dst;\r\n                    dst4 = dst3 + i_dst;\r\n                }\r\n            } else {\r\n                __m128i M1 = _mm_loadu_si128((__m128i*)&pfirst[0][0]);\r\n                __m128i M2 = _mm_loadu_si128((__m128i*)&pfirst[1][0]);\r\n                __m128i M3 = _mm_loadu_si128((__m128i*)&pfirst[2][0]);\r\n                __m128i M4 = _mm_loadu_si128((__m128i*)&pfirst[3][0]);\r\n                _mm_storel_epi64((__m128i*)dst1, M1);\r\n                _mm_storel_epi64((__m128i*)dst2, M2);\r\n                _mm_storel_epi64((__m128i*)dst3, M3);\r\n                _mm_storel_epi64((__m128i*)dst4, M4);\r\n                dst1 = dst4 + i_dst;\r\n                dst2 = dst1 + i_dst;\r\n                dst3 = dst2 + i_dst;\r\n                dst4 = dst3 + i_dst;\r\n                M1 = _mm_srli_si128(M1, 1);\r\n                M2 = _mm_srli_si128(M2, 1);\r\n                M3 = _mm_srli_si128(M3, 1);\r\n                M4 = _mm_srli_si128(M4, 1);\r\n                _mm_storel_epi64((__m128i*)dst1, M1);\r\n                _mm_storel_epi64((__m128i*)dst2, M2);\r\n                _mm_storel_epi64((__m128i*)dst3, M3);\r\n                _mm_storel_epi64((__m128i*)dst4, M4);\r\n                dst1 = dst4 + i_dst;\r\n                dst2 = dst1 + i_dst;\r\n                dst3 = dst2 + i_dst;\r\n                dst4 = dst3 + i_dst;\r\n                M1 = _mm_srli_si128(M1, 1);\r\n                M2 = _mm_srli_si128(M2, 1);\r\n                M3 = _mm_srli_si128(M3, 1);\r\n                M4 = _mm_srli_si128(M4, 1);\r\n                _mm_storel_epi64((__m128i*)dst1, M1);\r\n                _mm_storel_epi64((__m128i*)dst2, M2);\r\n                _mm_storel_epi64((__m128i*)dst3, M3);\r\n                _mm_storel_epi64((__m128i*)dst4, M4);\r\n                dst1 = dst4 + i_dst;\r\n                dst2 = dst1 + i_dst;\r\n                dst3 = dst2 + i_dst;\r\n                dst4 = dst3 + i_dst;\r\n                M1 = _mm_srli_si128(M1, 1);\r\n                M2 = _mm_srli_si128(M2, 1);\r\n                M3 = _mm_srli_si128(M3, 1);\r\n                M4 = _mm_srli_si128(M4, 1);\r\n                _mm_storel_epi64((__m128i*)dst1, M1);\r\n                _mm_storel_epi64((__m128i*)dst2, M2);\r\n                _mm_storel_epi64((__m128i*)dst3, M3);\r\n                _mm_storel_epi64((__m128i*)dst4, M4);\r\n                dst1 = dst4 + i_dst;\r\n                dst2 = dst1 + i_dst;\r\n                dst3 = dst2 + i_dst;\r\n                dst4 = dst3 + i_dst;\r\n                M1 = _mm_srli_si128(M1, 1);\r\n                M2 = _mm_srli_si128(M2, 1);\r\n                M3 = _mm_srli_si128(M3, 1);\r\n                M4 = _mm_srli_si128(M4, 1);\r\n                _mm_storel_epi64((__m128i*)dst1, M1);\r\n                _mm_storel_epi64((__m128i*)dst2, M2);\r\n                _mm_storel_epi64((__m128i*)dst3, M3);\r\n                _mm_storel_epi64((__m128i*)dst4, M4);\r\n                dst1 = dst4 + i_dst;\r\n                dst2 = dst1 + i_dst;\r\n                dst3 = dst2 + i_dst;\r\n                dst4 = dst3 + i_dst;\r\n                M1 = _mm_srli_si128(M1, 1);\r\n                M2 = _mm_srli_si128(M2, 1);\r\n                M3 = _mm_srli_si128(M3, 1);\r\n                M4 = _mm_srli_si128(M4, 1);\r\n                _mm_storel_epi64((__m128i*)dst1, M1);\r\n                _mm_storel_epi64((__m128i*)dst2, M2);\r\n                _mm_storel_epi64((__m128i*)dst3, M3);\r\n                _mm_storel_epi64((__m128i*)dst4, M4);\r\n                dst1 = dst4 + i_dst;\r\n                dst2 = dst1 + i_dst;\r\n                dst3 = dst2 + i_dst;\r\n                dst4 = dst3 + i_dst;\r\n                M1 = _mm_srli_si128(M1, 1);\r\n                M2 = _mm_srli_si128(M2, 1);\r\n                M3 = _mm_srli_si128(M3, 1);\r\n                M4 = _mm_srli_si128(M4, 1);\r\n                _mm_storel_epi64((__m128i*)dst1, M1);\r\n                _mm_storel_epi64((__m128i*)dst2, M2);\r\n                _mm_storel_epi64((__m128i*)dst3, M3);\r\n                _mm_storel_epi64((__m128i*)dst4, M4);\r\n                dst1 = dst4 + i_dst;\r\n                dst2 = dst1 + i_dst;\r\n                dst3 = dst2 + i_dst;\r\n                dst4 = dst3 + i_dst;\r\n                M1 = _mm_srli_si128(M1, 1);\r\n                M2 = _mm_srli_si128(M2, 1);\r\n                M3 = _mm_srli_si128(M3, 1);\r\n                M4 = _mm_srli_si128(M4, 1);\r\n                _mm_storel_epi64((__m128i*)dst1, M1);\r\n                _mm_storel_epi64((__m128i*)dst2, M2);\r\n                _mm_storel_epi64((__m128i*)dst3, M3);\r\n                _mm_storel_epi64((__m128i*)dst4, M4);\r\n            }\r\n        }\r\n    } else {\r\n        if (bsx == 16) {\r\n            __m128i p00, p10, p20, p30;\r\n            __m128i p01, p11, p21, p31;\r\n            __m128i S0 = _mm_loadu_si128((__m128i*)(src));\r\n            __m128i S3 = _mm_loadu_si128((__m128i*)(src + 3));\r\n            __m128i S1 = _mm_loadu_si128((__m128i*)(src + 1));\r\n            __m128i S2 = _mm_loadu_si128((__m128i*)(src + 2));\r\n\r\n            __m128i L0 = _mm_unpacklo_epi8(S0, zero);\r\n            __m128i L1 = _mm_unpacklo_epi8(S1, zero);\r\n            __m128i L2 = _mm_unpacklo_epi8(S2, zero);\r\n            __m128i L3 = _mm_unpacklo_epi8(S3, zero);\r\n\r\n            __m128i H0 = _mm_unpackhi_epi8(S0, zero);\r\n            __m128i H1 = _mm_unpackhi_epi8(S1, zero);\r\n            __m128i H2 = _mm_unpackhi_epi8(S2, zero);\r\n            __m128i H3 = _mm_unpackhi_epi8(S3, zero);\r\n\r\n            p00 = _mm_mullo_epi16(L0, coeff3);\r\n            p10 = _mm_mullo_epi16(L1, coeff7);\r\n            p20 = _mm_mullo_epi16(L2, coeff5);\r\n            p30 = _mm_add_epi16(L3, coeff8);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_srli_epi16(p00, 4);\r\n\r\n            p01 = _mm_mullo_epi16(H0, coeff3);\r\n            p11 = _mm_mullo_epi16(H1, coeff7);\r\n            p21 = _mm_mullo_epi16(H2, coeff5);\r\n            p31 = _mm_add_epi16(H3, coeff8);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_srli_epi16(p01, 4);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_store_si128((__m128i*)dst1, p00);\r\n\r\n            p00 = _mm_add_epi16(L1, L2);\r\n            p00 = _mm_mullo_epi16(p00, coeff3);\r\n            p10 = _mm_add_epi16(L0, L3);\r\n            p10 = _mm_add_epi16(p10, coeff4);\r\n            p00 = _mm_add_epi16(p10, p00);\r\n            p00 = _mm_srli_epi16(p00, 3);\r\n\r\n            p01 = _mm_add_epi16(H1, H2);\r\n            p01 = _mm_mullo_epi16(p01, coeff3);\r\n            p11 = _mm_add_epi16(H0, H3);\r\n            p11 = _mm_add_epi16(p11, coeff4);\r\n            p01 = _mm_add_epi16(p11, p01);\r\n            p01 = _mm_srli_epi16(p01, 3);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_store_si128((__m128i*)dst2, p00);\r\n\r\n            p10 = _mm_mullo_epi16(L1, coeff5);\r\n            p20 = _mm_mullo_epi16(L2, coeff7);\r\n            p30 = _mm_mullo_epi16(L3, coeff3);\r\n            p00 = _mm_add_epi16(L0, coeff8);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 4);\r\n\r\n            p11 = _mm_mullo_epi16(H1, coeff5);\r\n            p21 = _mm_mullo_epi16(H2, coeff7);\r\n            p31 = _mm_mullo_epi16(H3, coeff3);\r\n            p01 = _mm_add_epi16(H0, coeff8);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            p01 = _mm_srli_epi16(p01, 4);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_store_si128((__m128i*)dst3, p00);\r\n\r\n            p00 = _mm_add_epi16(L1, L2);\r\n            p10 = _mm_add_epi16(L2, L3);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, coeff2);\r\n            p00 = _mm_srli_epi16(p00, 2);\r\n\r\n            p01 = _mm_add_epi16(H1, H2);\r\n            p11 = _mm_add_epi16(H2, H3);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, coeff2);\r\n            p01 = _mm_srli_epi16(p01, 2);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_store_si128((__m128i*)dst4, p00);\r\n        } else {\r\n            __m128i p00, p10, p20, p30;\r\n            __m128i S0 = _mm_loadu_si128((__m128i*)(src));\r\n            __m128i S3 = _mm_loadu_si128((__m128i*)(src + 3));\r\n            __m128i S1 = _mm_loadu_si128((__m128i*)(src + 1));\r\n            __m128i S2 = _mm_loadu_si128((__m128i*)(src + 2));\r\n\r\n            __m128i L0 = _mm_unpacklo_epi8(S0, zero);\r\n            __m128i L1 = _mm_unpacklo_epi8(S1, zero);\r\n            __m128i L2 = _mm_unpacklo_epi8(S2, zero);\r\n            __m128i L3 = _mm_unpacklo_epi8(S3, zero);\r\n\r\n            p00 = _mm_mullo_epi16(L0, coeff3);\r\n            p10 = _mm_mullo_epi16(L1, coeff7);\r\n            p20 = _mm_mullo_epi16(L2, coeff5);\r\n            p30 = _mm_add_epi16(L3, coeff8);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_srli_epi16(p00, 4);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            ((int*)(dst1))[0] = _mm_cvtsi128_si32(p00);\r\n\r\n            p00 = _mm_add_epi16(L1, L2);\r\n            p00 = _mm_mullo_epi16(p00, coeff3);\r\n            p10 = _mm_add_epi16(L0, L3);\r\n            p10 = _mm_add_epi16(p10, coeff4);\r\n            p00 = _mm_add_epi16(p10, p00);\r\n            p00 = _mm_srli_epi16(p00, 3);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            ((int*)(dst2))[0] = _mm_cvtsi128_si32(p00);\r\n\r\n            p10 = _mm_mullo_epi16(L1, coeff5);\r\n            p20 = _mm_mullo_epi16(L2, coeff7);\r\n            p30 = _mm_mullo_epi16(L3, coeff3);\r\n            p00 = _mm_add_epi16(L0, coeff8);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 4);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            ((int*)(dst3))[0] = _mm_cvtsi128_si32(p00);\r\n\r\n            p00 = _mm_add_epi16(L1, L2);\r\n            p10 = _mm_add_epi16(L2, L3);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, coeff2);\r\n            p00 = _mm_srli_epi16(p00, 2);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            ((int*)(dst4))[0] = _mm_cvtsi128_si32(p00);\r\n        }\r\n    }\r\n\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n*/\r\nvoid intra_pred_ang_x_11_sse128(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    int i, j, idx;\r\n    __m128i zero = _mm_setzero_si128();\r\n    __m128i S0, S1, S2, S3;\r\n    __m128i t0, t1, t2, t3;\r\n    __m128i off = _mm_set1_epi16(64);\r\n    __m128i c0;\r\n\r\n    UNUSED_PARAMETER(dir_mode);\r\n\r\n    if (bsx & 0x07) {\r\n        __m128i D0;\r\n        int i_dst2 = i_dst << 1;\r\n\r\n        for (j = 0; j < bsy; j += 2) {\r\n            idx = (j + 1) >> 3;\r\n            c0 = _mm_load_si128((__m128i*)tab_coeff_mode_11[j & 0x07]);\r\n\r\n            S0 = _mm_loadl_epi64((__m128i*)(src + idx));\r\n            S1 = _mm_srli_si128(S0, 1);\r\n            S2 = _mm_srli_si128(S0, 2);\r\n            S3 = _mm_srli_si128(S0, 3);\r\n\r\n            t0 = _mm_unpacklo_epi8(S0, S1);\r\n            t1 = _mm_unpacklo_epi8(S2, S3);\r\n            t2 = _mm_unpacklo_epi16(t0, t1);\r\n\r\n            t0 = _mm_maddubs_epi16(t2, c0);\r\n\r\n            idx = (j + 2) >> 3;\r\n            c0 = _mm_load_si128((__m128i*)tab_coeff_mode_11[(j + 1) & 0x07]);\r\n            S0 = _mm_loadl_epi64((__m128i*)(src + idx));\r\n            S1 = _mm_srli_si128(S0, 1);\r\n            S2 = _mm_srli_si128(S0, 2);\r\n            S3 = _mm_srli_si128(S0, 3);\r\n\r\n            t1 = _mm_unpacklo_epi8(S0, S1);\r\n            t2 = _mm_unpacklo_epi8(S2, S3);\r\n            t1 = _mm_unpacklo_epi16(t1, t2);\r\n\r\n            t1 = _mm_maddubs_epi16(t1, c0);\r\n\r\n            D0 = _mm_hadds_epi16(t0, t1);\r\n            D0 = _mm_add_epi16(D0, off);\r\n            D0 = _mm_srli_epi16(D0, 7);\r\n            D0 = _mm_packus_epi16(D0, zero);\r\n\r\n            ((uint32_t*)(dst))[0] = _mm_cvtsi128_si32(D0);\r\n            D0 = _mm_srli_si128(D0, 4);\r\n            ((uint32_t*)(dst + i_dst))[0] = _mm_cvtsi128_si32(D0);\r\n            //dst[i] = (pel_t)((src[idx] * c1 + src[idx + 1] * c2 + src[idx + 2] * c3 + src[idx + 3] * c4 + 64) >> 7);\r\n            dst += i_dst2;\r\n        }\r\n    } else if (bsx & 0x0f) {\r\n        __m128i D0;\r\n\r\n        for (j = 0; j < bsy; j++) {\r\n            idx = (j + 1) >> 3;\r\n            c0 = _mm_load_si128((__m128i*)tab_coeff_mode_11[j & 0x07]);\r\n\r\n            S0 = _mm_loadu_si128((__m128i*)(src + idx));\r\n            S1 = _mm_srli_si128(S0, 1);\r\n            S2 = _mm_srli_si128(S0, 2);\r\n            S3 = _mm_srli_si128(S0, 3);\r\n\r\n            t0 = _mm_unpacklo_epi8(S0, S1);\r\n            t1 = _mm_unpacklo_epi8(S2, S3);\r\n            t2 = _mm_unpacklo_epi16(t0, t1);\r\n            t3 = _mm_unpackhi_epi16(t0, t1);\r\n\r\n            t0 = _mm_maddubs_epi16(t2, c0);\r\n            t1 = _mm_maddubs_epi16(t3, c0);\r\n\r\n            D0 = _mm_hadds_epi16(t0, t1);\r\n            D0 = _mm_add_epi16(D0, off);\r\n            D0 = _mm_srli_epi16(D0, 7);\r\n\r\n            D0 = _mm_packus_epi16(D0, _mm_setzero_si128());\r\n\r\n            _mm_storel_epi64((__m128i*)(dst), D0);\r\n            //dst[i] = (pel_t)((src[idx] * c1 + src[idx + 1] * c2 + src[idx + 2] * c3 + src[idx + 3] * c4 + 64) >> 7);\r\n\r\n            dst += i_dst;\r\n        }\r\n    } else {\r\n        for (j = 0; j < bsy; j++) {\r\n            __m128i D0, D1;\r\n\r\n            idx = (j + 1) >> 3;\r\n            c0 = _mm_load_si128((__m128i*)tab_coeff_mode_11[j & 0x07]);\r\n\r\n            for (i = 0; i < bsx; i += 16, idx += 16) {\r\n                S0 = _mm_loadu_si128((__m128i*)(src + idx));\r\n                S1 = _mm_loadu_si128((__m128i*)(src + idx + 1));\r\n                S2 = _mm_loadu_si128((__m128i*)(src + idx + 2));\r\n                S3 = _mm_loadu_si128((__m128i*)(src + idx + 3));\r\n\r\n                t0 = _mm_unpacklo_epi8(S0, S1);\r\n                t1 = _mm_unpacklo_epi8(S2, S3);\r\n                t2 = _mm_unpacklo_epi16(t0, t1);\r\n                t3 = _mm_unpackhi_epi16(t0, t1);\r\n\r\n                t0 = _mm_maddubs_epi16(t2, c0);\r\n                t1 = _mm_maddubs_epi16(t3, c0);\r\n\r\n                D0 = _mm_hadds_epi16(t0, t1);\r\n                D0 = _mm_add_epi16(D0, off);\r\n                D0 = _mm_srli_epi16(D0, 7);\r\n\r\n                t0 = _mm_unpackhi_epi8(S0, S1);\r\n                t1 = _mm_unpackhi_epi8(S2, S3);\r\n                t2 = _mm_unpacklo_epi16(t0, t1);\r\n                t3 = _mm_unpackhi_epi16(t0, t1);\r\n\r\n                t0 = _mm_maddubs_epi16(t2, c0);\r\n                t1 = _mm_maddubs_epi16(t3, c0);\r\n\r\n                D1 = _mm_hadds_epi16(t0, t1);\r\n                D1 = _mm_add_epi16(D1, off);\r\n                D1 = _mm_srli_epi16(D1, 7);\r\n\r\n                D0 = _mm_packus_epi16(D0, D1);\r\n\r\n                _mm_storeu_si128((__m128i*)(dst + i), D0);\r\n                //dst[i] = (pel_t)((src[idx] * c1 + src[idx + 1] * c2 + src[idx + 2] * c3 + src[idx + 3] * c4 + 64) >> 7);\r\n            }\r\n\r\n            dst += i_dst;\r\n        }\r\n    }\r\n\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n*/\r\nvoid intra_pred_ang_y_25_sse128(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    int i;\r\n\r\n    UNUSED_PARAMETER(dir_mode);\r\n\r\n    if (bsx > 8) {\r\n        ALIGN16(pel_t first_line[64 + (64 << 3)]);\r\n        int line_size = bsx + ((bsy - 1) << 3);\r\n        int iHeight8 = bsy << 3;\r\n        pel_t *pfirst = first_line;\r\n\r\n        __m128i coeff0 = _mm_setr_epi16(7, 3, 5, 1, 3, 1, 1, 0);\r\n        __m128i coeff1 = _mm_setr_epi16(15, 7, 13, 3, 11, 5, 9, 1);\r\n        __m128i coeff2 = _mm_setr_epi16(9, 5, 11, 3, 13, 7, 15, 2);\r\n        __m128i coeff3 = _mm_setr_epi16(1, 1, 3, 1, 5, 3, 7, 1);\r\n        __m128i coeff4 = _mm_setr_epi16(16, 8, 16, 4, 16, 8, 16, 2);\r\n        __m128i coeff5 = _mm_setr_epi16(1, 2, 1, 4, 1, 2, 1, 8);\r\n\r\n        __m128i p00, p10, p20, p30;\r\n\r\n        __m128i L0 = _mm_set1_epi16(src[0]);\r\n        __m128i L1 = _mm_set1_epi16(src[-1]);\r\n        __m128i L2 = _mm_set1_epi16(src[-2]);\r\n        __m128i L3 = _mm_set1_epi16(src[-3]);\r\n\r\n        src -= 4;\r\n\r\n        for (i = 0; i < line_size - 24; i += 32, src -= 4) {\r\n            p00 = _mm_mullo_epi16(L0, coeff0);\r\n            p10 = _mm_mullo_epi16(L1, coeff1);\r\n            p20 = _mm_mullo_epi16(L2, coeff2);\r\n            p30 = _mm_mullo_epi16(L3, coeff3);\r\n            p00 = _mm_add_epi16(p00, coeff4);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_mullo_epi16(p00, coeff5);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)pfirst, p00);\r\n\r\n            pfirst += 8;\r\n            L0 = _mm_set1_epi16(src[0]);\r\n\r\n            p00 = _mm_mullo_epi16(L1, coeff0);\r\n            p10 = _mm_mullo_epi16(L2, coeff1);\r\n            p20 = _mm_mullo_epi16(L3, coeff2);\r\n            p30 = _mm_mullo_epi16(L0, coeff3);\r\n            p00 = _mm_add_epi16(p00, coeff4);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_mullo_epi16(p00, coeff5);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)pfirst, p00);\r\n\r\n            pfirst += 8;\r\n            L1 = _mm_set1_epi16(src[-1]);\r\n\r\n            p00 = _mm_mullo_epi16(L2, coeff0);\r\n            p10 = _mm_mullo_epi16(L3, coeff1);\r\n            p20 = _mm_mullo_epi16(L0, coeff2);\r\n            p30 = _mm_mullo_epi16(L1, coeff3);\r\n            p00 = _mm_add_epi16(p00, coeff4);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_mullo_epi16(p00, coeff5);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)pfirst, p00);\r\n\r\n            pfirst += 8;\r\n            L2 = _mm_set1_epi16(src[-2]);\r\n\r\n            p00 = _mm_mullo_epi16(L3, coeff0);\r\n            p10 = _mm_mullo_epi16(L0, coeff1);\r\n            p20 = _mm_mullo_epi16(L1, coeff2);\r\n            p30 = _mm_mullo_epi16(L2, coeff3);\r\n            p00 = _mm_add_epi16(p00, coeff4);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_mullo_epi16(p00, coeff5);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)pfirst, p00);\r\n\r\n            pfirst += 8;\r\n            L3 = _mm_set1_epi16(src[-3]);\r\n        }\r\n\r\n        if (bsx == 16) {\r\n            p00 = _mm_mullo_epi16(L0, coeff0);\r\n            p10 = _mm_mullo_epi16(L1, coeff1);\r\n            p20 = _mm_mullo_epi16(L2, coeff2);\r\n            p30 = _mm_mullo_epi16(L3, coeff3);\r\n            p00 = _mm_add_epi16(p00, coeff4);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_mullo_epi16(p00, coeff5);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)pfirst, p00);\r\n        } else {\r\n            p00 = _mm_mullo_epi16(L0, coeff0);\r\n            p10 = _mm_mullo_epi16(L1, coeff1);\r\n            p20 = _mm_mullo_epi16(L2, coeff2);\r\n            p30 = _mm_mullo_epi16(L3, coeff3);\r\n            p00 = _mm_add_epi16(p00, coeff4);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_mullo_epi16(p00, coeff5);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)pfirst, p00);\r\n\r\n            pfirst += 8;\r\n            L0 = _mm_set1_epi16(src[0]);\r\n\r\n            p00 = _mm_mullo_epi16(L1, coeff0);\r\n            p10 = _mm_mullo_epi16(L2, coeff1);\r\n            p20 = _mm_mullo_epi16(L3, coeff2);\r\n            p30 = _mm_mullo_epi16(L0, coeff3);\r\n            p00 = _mm_add_epi16(p00, coeff4);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_mullo_epi16(p00, coeff5);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)pfirst, p00);\r\n\r\n            pfirst += 8;\r\n            L1 = _mm_set1_epi16(src[-1]);\r\n\r\n            p00 = _mm_mullo_epi16(L2, coeff0);\r\n            p10 = _mm_mullo_epi16(L3, coeff1);\r\n            p20 = _mm_mullo_epi16(L0, coeff2);\r\n            p30 = _mm_mullo_epi16(L1, coeff3);\r\n            p00 = _mm_add_epi16(p00, coeff4);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_mullo_epi16(p00, coeff5);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)pfirst, p00);\r\n        }\r\n\r\n        for (i = 0; i < iHeight8; i += 8) {\r\n            memcpy(dst, first_line + i, bsx * sizeof(pel_t));\r\n            dst += i_dst;\r\n        }\r\n    } else if (bsx == 8) {\r\n        __m128i coeff0 = _mm_setr_epi16(7, 3, 5, 1, 3, 1, 1, 0);\r\n        __m128i coeff1 = _mm_setr_epi16(15, 7, 13, 3, 11, 5, 9, 1);\r\n        __m128i coeff2 = _mm_setr_epi16(9, 5, 11, 3, 13, 7, 15, 2);\r\n        __m128i coeff3 = _mm_setr_epi16(1, 1, 3, 1, 5, 3, 7, 1);\r\n        __m128i coeff4 = _mm_setr_epi16(16, 8, 16, 4, 16, 8, 16, 2);\r\n        __m128i coeff5 = _mm_setr_epi16(1, 2, 1, 4, 1, 2, 1, 8);\r\n\r\n        __m128i p00, p10, p20, p30;\r\n\r\n        __m128i L0 = _mm_set1_epi16(src[0]);\r\n        __m128i L1 = _mm_set1_epi16(src[-1]);\r\n        __m128i L2 = _mm_set1_epi16(src[-2]);\r\n        __m128i L3 = _mm_set1_epi16(src[-3]);\r\n        src -= 4;\r\n\r\n        bsy >>= 2;\r\n        for (i = 0; i < bsy; i++, src -= 4) {\r\n            p00 = _mm_mullo_epi16(L0, coeff0);\r\n            p10 = _mm_mullo_epi16(L1, coeff1);\r\n            p20 = _mm_mullo_epi16(L2, coeff2);\r\n            p30 = _mm_mullo_epi16(L3, coeff3);\r\n            p00 = _mm_add_epi16(p00, coeff4);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_mullo_epi16(p00, coeff5);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)dst, p00);\r\n\r\n            dst += i_dst;\r\n            L0 = _mm_set1_epi16(src[0]);\r\n\r\n            p00 = _mm_mullo_epi16(L1, coeff0);\r\n            p10 = _mm_mullo_epi16(L2, coeff1);\r\n            p20 = _mm_mullo_epi16(L3, coeff2);\r\n            p30 = _mm_mullo_epi16(L0, coeff3);\r\n            p00 = _mm_add_epi16(p00, coeff4);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_mullo_epi16(p00, coeff5);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)dst, p00);\r\n\r\n            dst += i_dst;\r\n            L1 = _mm_set1_epi16(src[-1]);\r\n\r\n            p00 = _mm_mullo_epi16(L2, coeff0);\r\n            p10 = _mm_mullo_epi16(L3, coeff1);\r\n            p20 = _mm_mullo_epi16(L0, coeff2);\r\n            p30 = _mm_mullo_epi16(L1, coeff3);\r\n            p00 = _mm_add_epi16(p00, coeff4);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_mullo_epi16(p00, coeff5);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)dst, p00);\r\n\r\n            dst += i_dst;\r\n            L2 = _mm_set1_epi16(src[-2]);\r\n\r\n            p00 = _mm_mullo_epi16(L3, coeff0);\r\n            p10 = _mm_mullo_epi16(L0, coeff1);\r\n            p20 = _mm_mullo_epi16(L1, coeff2);\r\n            p30 = _mm_mullo_epi16(L2, coeff3);\r\n            p00 = _mm_add_epi16(p00, coeff4);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_mullo_epi16(p00, coeff5);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)dst, p00);\r\n\r\n            dst += i_dst;\r\n            L3 = _mm_set1_epi16(src[-3]);\r\n        }\r\n    } else {\r\n        __m128i zero = _mm_setzero_si128();\r\n        __m128i coeff3 = _mm_set1_epi16(3);\r\n        __m128i coeff4 = _mm_set1_epi16(4);\r\n        __m128i coeff5 = _mm_set1_epi16(5);\r\n        __m128i coeff7 = _mm_set1_epi16(7);\r\n        __m128i coeff8 = _mm_set1_epi16(8);\r\n        __m128i coeff9 = _mm_set1_epi16(9);\r\n        __m128i coeff11 = _mm_set1_epi16(11);\r\n        __m128i coeff13 = _mm_set1_epi16(13);\r\n        __m128i coeff15 = _mm_set1_epi16(15);\r\n        __m128i coeff16 = _mm_set1_epi16(16);\r\n        __m128i shuffle = _mm_setr_epi8(7, 15, 6, 14, 5, 13, 4, 12, 3, 11, 2, 10, 1, 9, 0, 8);\r\n\r\n        if (bsy == 4) {\r\n            src -= 15;\r\n            __m128i p01, p11, p21, p31;\r\n            __m128i M2, M4, M6, M8;\r\n            __m128i S0 = _mm_loadu_si128((__m128i*)(src));\r\n            __m128i S3 = _mm_loadu_si128((__m128i*)(src - 3));\r\n            __m128i S1 = _mm_loadu_si128((__m128i*)(src - 1));\r\n            __m128i S2 = _mm_loadu_si128((__m128i*)(src - 2));\r\n\r\n            __m128i H0 = _mm_unpackhi_epi8(S0, zero);\r\n            __m128i H1 = _mm_unpackhi_epi8(S1, zero);\r\n            __m128i H2 = _mm_unpackhi_epi8(S2, zero);\r\n            __m128i H3 = _mm_unpackhi_epi8(S3, zero);\r\n\r\n            p01 = _mm_mullo_epi16(H0, coeff7);\r\n            p11 = _mm_mullo_epi16(H1, coeff15);\r\n            p21 = _mm_mullo_epi16(H2, coeff9);\r\n            p31 = _mm_add_epi16(H3, coeff16);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            M2 = _mm_srli_epi16(p01, 5);\r\n\r\n            p01 = _mm_mullo_epi16(H0, coeff3);\r\n            p11 = _mm_mullo_epi16(H1, coeff7);\r\n            p21 = _mm_mullo_epi16(H2, coeff5);\r\n            p31 = _mm_add_epi16(H3, coeff8);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            M4 = _mm_srli_epi16(p01, 4);\r\n\r\n            p01 = _mm_mullo_epi16(H0, coeff5);\r\n            p11 = _mm_mullo_epi16(H1, coeff13);\r\n            p21 = _mm_mullo_epi16(H2, coeff11);\r\n            p31 = _mm_mullo_epi16(H3, coeff3);\r\n            p01 = _mm_add_epi16(p01, coeff16);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            M6 = _mm_srli_epi16(p01, 5);\r\n\r\n            p01 = _mm_add_epi16(H0, H3);\r\n            p11 = _mm_add_epi16(H1, H2);\r\n            p11 = _mm_mullo_epi16(p11, coeff3);\r\n            p01 = _mm_add_epi16(p01, coeff4);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            M8 = _mm_srli_epi16(p01, 3);\r\n\r\n            M2 = _mm_packus_epi16(M2, M4);\r\n            M6 = _mm_packus_epi16(M6, M8);\r\n            M2 = _mm_shuffle_epi8(M2, shuffle);\r\n            M6 = _mm_shuffle_epi8(M6, shuffle);\r\n\r\n            M4 = _mm_unpacklo_epi16(M2, M6);\r\n\r\n            *((int*)dst) = _mm_cvtsi128_si32(M4);\r\n            dst += i_dst;\r\n            M4 = _mm_srli_si128(M4, 4);\r\n            *((int*)dst) = _mm_cvtsi128_si32(M4);\r\n            dst += i_dst;\r\n            M4 = _mm_srli_si128(M4, 4);\r\n            *((int*)dst) = _mm_cvtsi128_si32(M4);\r\n            dst += i_dst;\r\n            M4 = _mm_srli_si128(M4, 4);\r\n            *((int*)dst) = _mm_cvtsi128_si32(M4);\r\n        } else {\r\n            src -= 15;\r\n\r\n            __m128i p00, p10, p20, p30;\r\n            __m128i p01, p11, p21, p31;\r\n            __m128i M1, M2, M3, M4, M5, M6, M7, M8;\r\n            __m128i S0 = _mm_loadu_si128((__m128i*)(src));\r\n            __m128i S3 = _mm_loadu_si128((__m128i*)(src - 3));\r\n            __m128i S1 = _mm_loadu_si128((__m128i*)(src - 1));\r\n            __m128i S2 = _mm_loadu_si128((__m128i*)(src - 2));\r\n\r\n            __m128i L0 = _mm_unpacklo_epi8(S0, zero);\r\n            __m128i L1 = _mm_unpacklo_epi8(S1, zero);\r\n            __m128i L2 = _mm_unpacklo_epi8(S2, zero);\r\n            __m128i L3 = _mm_unpacklo_epi8(S3, zero);\r\n\r\n            __m128i H0 = _mm_unpackhi_epi8(S0, zero);\r\n            __m128i H1 = _mm_unpackhi_epi8(S1, zero);\r\n            __m128i H2 = _mm_unpackhi_epi8(S2, zero);\r\n            __m128i H3 = _mm_unpackhi_epi8(S3, zero);\r\n\r\n            p00 = _mm_mullo_epi16(L0, coeff7);\r\n            p10 = _mm_mullo_epi16(L1, coeff15);\r\n            p20 = _mm_mullo_epi16(L2, coeff9);\r\n            p30 = _mm_add_epi16(L3, coeff16);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            M1 = _mm_srli_epi16(p00, 5);\r\n\r\n            p01 = _mm_mullo_epi16(H0, coeff7);\r\n            p11 = _mm_mullo_epi16(H1, coeff15);\r\n            p21 = _mm_mullo_epi16(H2, coeff9);\r\n            p31 = _mm_add_epi16(H3, coeff16);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            M2 = _mm_srli_epi16(p01, 5);\r\n\r\n            p00 = _mm_mullo_epi16(L0, coeff3);\r\n            p10 = _mm_mullo_epi16(L1, coeff7);\r\n            p20 = _mm_mullo_epi16(L2, coeff5);\r\n            p30 = _mm_add_epi16(L3, coeff8);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            M3 = _mm_srli_epi16(p00, 4);\r\n\r\n            p01 = _mm_mullo_epi16(H0, coeff3);\r\n            p11 = _mm_mullo_epi16(H1, coeff7);\r\n            p21 = _mm_mullo_epi16(H2, coeff5);\r\n            p31 = _mm_add_epi16(H3, coeff8);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            M4 = _mm_srli_epi16(p01, 4);\r\n\r\n            p00 = _mm_mullo_epi16(L0, coeff5);\r\n            p10 = _mm_mullo_epi16(L1, coeff13);\r\n            p20 = _mm_mullo_epi16(L2, coeff11);\r\n            p30 = _mm_mullo_epi16(L3, coeff3);\r\n            p00 = _mm_add_epi16(p00, coeff16);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            M5 = _mm_srli_epi16(p00, 5);\r\n\r\n            p01 = _mm_mullo_epi16(H0, coeff5);\r\n            p11 = _mm_mullo_epi16(H1, coeff13);\r\n            p21 = _mm_mullo_epi16(H2, coeff11);\r\n            p31 = _mm_mullo_epi16(H3, coeff3);\r\n            p01 = _mm_add_epi16(p01, coeff16);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            M6 = _mm_srli_epi16(p01, 5);\r\n\r\n            p00 = _mm_add_epi16(L0, L3);\r\n            p10 = _mm_add_epi16(L1, L2);\r\n            p10 = _mm_mullo_epi16(p10, coeff3);\r\n            p00 = _mm_add_epi16(p00, coeff4);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            M7 = _mm_srli_epi16(p00, 3);\r\n\r\n            p01 = _mm_add_epi16(H0, H3);\r\n            p11 = _mm_add_epi16(H1, H2);\r\n            p11 = _mm_mullo_epi16(p11, coeff3);\r\n            p01 = _mm_add_epi16(p01, coeff4);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            M8 = _mm_srli_epi16(p01, 3);\r\n\r\n            M1 = _mm_packus_epi16(M1, M3);\r\n            M5 = _mm_packus_epi16(M5, M7);\r\n            M1 = _mm_shuffle_epi8(M1, shuffle);\r\n            M5 = _mm_shuffle_epi8(M5, shuffle);\r\n\r\n            M2 = _mm_packus_epi16(M2, M4);\r\n            M6 = _mm_packus_epi16(M6, M8);\r\n            M2 = _mm_shuffle_epi8(M2, shuffle);\r\n            M6 = _mm_shuffle_epi8(M6, shuffle);\r\n\r\n            M3 = _mm_unpacklo_epi16(M1, M5);\r\n            M7 = _mm_unpackhi_epi16(M1, M5);\r\n            M4 = _mm_unpacklo_epi16(M2, M6);\r\n            M8 = _mm_unpackhi_epi16(M2, M6);\r\n\r\n            *((int*)dst) = _mm_cvtsi128_si32(M4);\r\n            dst += i_dst;\r\n            M4 = _mm_srli_si128(M4, 4);\r\n            *((int*)dst) = _mm_cvtsi128_si32(M4);\r\n            dst += i_dst;\r\n            M4 = _mm_srli_si128(M4, 4);\r\n            *((int*)dst) = _mm_cvtsi128_si32(M4);\r\n            dst += i_dst;\r\n            M4 = _mm_srli_si128(M4, 4);\r\n            *((int*)dst) = _mm_cvtsi128_si32(M4);\r\n            dst += i_dst;\r\n            *((int*)dst) = _mm_cvtsi128_si32(M8);\r\n            dst += i_dst;\r\n            M8 = _mm_srli_si128(M8, 4);\r\n            *((int*)dst) = _mm_cvtsi128_si32(M8);\r\n            dst += i_dst;\r\n            M8 = _mm_srli_si128(M8, 4);\r\n            *((int*)dst) = _mm_cvtsi128_si32(M8);\r\n            dst += i_dst;\r\n            M8 = _mm_srli_si128(M8, 4);\r\n            *((int*)dst) = _mm_cvtsi128_si32(M8);\r\n            dst += i_dst;\r\n            *((int*)dst) = _mm_cvtsi128_si32(M3);\r\n            dst += i_dst;\r\n            M3 = _mm_srli_si128(M3, 4);\r\n            *((int*)dst) = _mm_cvtsi128_si32(M3);\r\n            dst += i_dst;\r\n            M3 = _mm_srli_si128(M3, 4);\r\n            *((int*)dst) = _mm_cvtsi128_si32(M3);\r\n            dst += i_dst;\r\n            M3 = _mm_srli_si128(M3, 4);\r\n            *((int*)dst) = _mm_cvtsi128_si32(M3);\r\n            dst += i_dst;\r\n            *((int*)dst) = _mm_cvtsi128_si32(M7);\r\n            dst += i_dst;\r\n            M7 = _mm_srli_si128(M7, 4);\r\n            *((int*)dst) = _mm_cvtsi128_si32(M7);\r\n            dst += i_dst;\r\n            M7 = _mm_srli_si128(M7, 4);\r\n            *((int*)dst) = _mm_cvtsi128_si32(M7);\r\n            dst += i_dst;\r\n            M7 = _mm_srli_si128(M7, 4);\r\n            *((int*)dst) = _mm_cvtsi128_si32(M7);\r\n        }\r\n    }\r\n\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n*/\r\nvoid intra_pred_ang_y_26_sse128(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    int i;\r\n\r\n    UNUSED_PARAMETER(dir_mode);\r\n    if (bsx != 4) {\r\n        __m128i zero = _mm_setzero_si128();\r\n        __m128i coeff2 = _mm_set1_epi16(2);\r\n        __m128i coeff3 = _mm_set1_epi16(3);\r\n        __m128i coeff4 = _mm_set1_epi16(4);\r\n        __m128i coeff5 = _mm_set1_epi16(5);\r\n        __m128i coeff7 = _mm_set1_epi16(7);\r\n        __m128i coeff8 = _mm_set1_epi16(8);\r\n        __m128i shuffle = _mm_setr_epi8(7, 15, 6, 14, 5, 13, 4, 12, 3, 11, 2, 10, 1, 9, 0, 8);\r\n\r\n        ALIGN16(pel_t first_line[64 + 256]);\r\n        int line_size = bsx + (bsy - 1) * 4;\r\n        int iHeight4 = bsy << 2;\r\n\r\n        src -= 15;\r\n\r\n        for (i = 0; i < line_size - 32; i += 64, src -= 16) {\r\n            __m128i p00, p10, p20, p30;\r\n            __m128i p01, p11, p21, p31;\r\n            __m128i M1, M2, M3, M4, M5, M6, M7, M8;\r\n            __m128i S0 = _mm_loadu_si128((__m128i*)(src));\r\n            __m128i S3 = _mm_loadu_si128((__m128i*)(src - 3));\r\n            __m128i S1 = _mm_loadu_si128((__m128i*)(src - 1));\r\n            __m128i S2 = _mm_loadu_si128((__m128i*)(src - 2));\r\n\r\n            __m128i L0 = _mm_unpacklo_epi8(S0, zero);\r\n            __m128i L1 = _mm_unpacklo_epi8(S1, zero);\r\n            __m128i L2 = _mm_unpacklo_epi8(S2, zero);\r\n            __m128i L3 = _mm_unpacklo_epi8(S3, zero);\r\n\r\n            __m128i H0 = _mm_unpackhi_epi8(S0, zero);\r\n            __m128i H1 = _mm_unpackhi_epi8(S1, zero);\r\n            __m128i H2 = _mm_unpackhi_epi8(S2, zero);\r\n            __m128i H3 = _mm_unpackhi_epi8(S3, zero);\r\n\r\n            p00 = _mm_mullo_epi16(L0, coeff3);\r\n            p10 = _mm_mullo_epi16(L1, coeff7);\r\n            p20 = _mm_mullo_epi16(L2, coeff5);\r\n            p30 = _mm_add_epi16(L3, coeff8);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            M1 = _mm_srli_epi16(p00, 4);\r\n\r\n            p01 = _mm_mullo_epi16(H0, coeff3);\r\n            p11 = _mm_mullo_epi16(H1, coeff7);\r\n            p21 = _mm_mullo_epi16(H2, coeff5);\r\n            p31 = _mm_add_epi16(H3, coeff8);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            M2 = _mm_srli_epi16(p01, 4);\r\n\r\n\r\n            p00 = _mm_add_epi16(L1, L2);\r\n            p00 = _mm_mullo_epi16(p00, coeff3);\r\n            p10 = _mm_add_epi16(L0, L3);\r\n            p10 = _mm_add_epi16(p10, coeff4);\r\n            p00 = _mm_add_epi16(p10, p00);\r\n            M3 = _mm_srli_epi16(p00, 3);\r\n\r\n            p01 = _mm_add_epi16(H1, H2);\r\n            p01 = _mm_mullo_epi16(p01, coeff3);\r\n            p11 = _mm_add_epi16(H0, H3);\r\n            p11 = _mm_add_epi16(p11, coeff4);\r\n            p01 = _mm_add_epi16(p11, p01);\r\n            M4 = _mm_srli_epi16(p01, 3);\r\n\r\n\r\n            p10 = _mm_mullo_epi16(L1, coeff5);\r\n            p20 = _mm_mullo_epi16(L2, coeff7);\r\n            p30 = _mm_mullo_epi16(L3, coeff3);\r\n            p00 = _mm_add_epi16(L0, coeff8);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            M5 = _mm_srli_epi16(p00, 4);\r\n\r\n            p11 = _mm_mullo_epi16(H1, coeff5);\r\n            p21 = _mm_mullo_epi16(H2, coeff7);\r\n            p31 = _mm_mullo_epi16(H3, coeff3);\r\n            p01 = _mm_add_epi16(H0, coeff8);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            M6 = _mm_srli_epi16(p01, 4);\r\n\r\n\r\n            p00 = _mm_add_epi16(L1, L2);\r\n            p10 = _mm_add_epi16(L2, L3);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, coeff2);\r\n            M7 = _mm_srli_epi16(p00, 2);\r\n\r\n            p01 = _mm_add_epi16(H1, H2);\r\n            p11 = _mm_add_epi16(H2, H3);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, coeff2);\r\n            M8 = _mm_srli_epi16(p01, 2);\r\n\r\n            M1 = _mm_packus_epi16(M1, M3);\r\n            M5 = _mm_packus_epi16(M5, M7);\r\n            M1 = _mm_shuffle_epi8(M1, shuffle);\r\n            M5 = _mm_shuffle_epi8(M5, shuffle);\r\n\r\n            M2 = _mm_packus_epi16(M2, M4);\r\n            M6 = _mm_packus_epi16(M6, M8);\r\n            M2 = _mm_shuffle_epi8(M2, shuffle);\r\n            M6 = _mm_shuffle_epi8(M6, shuffle);\r\n\r\n            M3 = _mm_unpacklo_epi16(M1, M5);\r\n            M7 = _mm_unpackhi_epi16(M1, M5);\r\n            M4 = _mm_unpacklo_epi16(M2, M6);\r\n            M8 = _mm_unpackhi_epi16(M2, M6);\r\n\r\n            _mm_store_si128((__m128i*)&first_line[i], M4);\r\n            _mm_store_si128((__m128i*)&first_line[16 + i], M8);\r\n            _mm_store_si128((__m128i*)&first_line[32 + i], M3);\r\n            _mm_store_si128((__m128i*)&first_line[48 + i], M7);\r\n        }\r\n\r\n        if (i < line_size) {\r\n            __m128i p01, p11, p21, p31;\r\n            __m128i M2, M4, M6, M8;\r\n            __m128i S0 = _mm_loadu_si128((__m128i*)(src));\r\n            __m128i S3 = _mm_loadu_si128((__m128i*)(src - 3));\r\n            __m128i S1 = _mm_loadu_si128((__m128i*)(src - 1));\r\n            __m128i S2 = _mm_loadu_si128((__m128i*)(src - 2));\r\n\r\n            __m128i H0 = _mm_unpackhi_epi8(S0, zero);\r\n            __m128i H1 = _mm_unpackhi_epi8(S1, zero);\r\n            __m128i H2 = _mm_unpackhi_epi8(S2, zero);\r\n            __m128i H3 = _mm_unpackhi_epi8(S3, zero);\r\n\r\n            p01 = _mm_mullo_epi16(H0, coeff3);\r\n            p11 = _mm_mullo_epi16(H1, coeff7);\r\n            p21 = _mm_mullo_epi16(H2, coeff5);\r\n            p31 = _mm_add_epi16(H3, coeff8);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            M2 = _mm_srli_epi16(p01, 4);\r\n\r\n            p01 = _mm_add_epi16(H1, H2);\r\n            p01 = _mm_mullo_epi16(p01, coeff3);\r\n            p11 = _mm_add_epi16(H0, H3);\r\n            p11 = _mm_add_epi16(p11, coeff4);\r\n            p01 = _mm_add_epi16(p11, p01);\r\n            M4 = _mm_srli_epi16(p01, 3);\r\n\r\n            p11 = _mm_mullo_epi16(H1, coeff5);\r\n            p21 = _mm_mullo_epi16(H2, coeff7);\r\n            p31 = _mm_mullo_epi16(H3, coeff3);\r\n            p01 = _mm_add_epi16(H0, coeff8);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            M6 = _mm_srli_epi16(p01, 4);\r\n\r\n            p01 = _mm_add_epi16(H1, H2);\r\n            p11 = _mm_add_epi16(H2, H3);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, coeff2);\r\n            M8 = _mm_srli_epi16(p01, 2);\r\n\r\n            M2 = _mm_packus_epi16(M2, M4);\r\n            M6 = _mm_packus_epi16(M6, M8);\r\n            M2 = _mm_shuffle_epi8(M2, shuffle);\r\n            M6 = _mm_shuffle_epi8(M6, shuffle);\r\n\r\n            M4 = _mm_unpacklo_epi16(M2, M6);\r\n            M8 = _mm_unpackhi_epi16(M2, M6);\r\n\r\n            _mm_store_si128((__m128i*)&first_line[i], M4);\r\n            _mm_store_si128((__m128i*)&first_line[16 + i], M8);\r\n        }\r\n\r\n        switch (bsx) {\r\n            case 4:\r\n                for (i = 0; i < iHeight4; i += 4) {\r\n                    CP32(dst, first_line + i);\r\n                    dst += i_dst;\r\n                }\r\n                break;\r\n            case 8:\r\n                for (i = 0; i < iHeight4; i += 4) {\r\n                    CP64(dst, first_line + i);\r\n                    dst += i_dst;\r\n                }\r\n                break;\r\n            default:\r\n                for (i = 0; i < iHeight4; i += 4) {\r\n                    memcpy(dst, first_line + i, bsx * sizeof(pel_t));\r\n                    dst += i_dst;\r\n                }\r\n                break;\r\n        }\r\n    } else {\r\n        __m128i zero = _mm_setzero_si128();\r\n        __m128i coeff2 = _mm_set1_epi16(2);\r\n        __m128i coeff3 = _mm_set1_epi16(3);\r\n        __m128i coeff4 = _mm_set1_epi16(4);\r\n        __m128i coeff5 = _mm_set1_epi16(5);\r\n        __m128i coeff7 = _mm_set1_epi16(7);\r\n        __m128i coeff8 = _mm_set1_epi16(8);\r\n        __m128i shuffle = _mm_setr_epi8(7, 15, 6, 14, 5, 13, 4, 12, 3, 11, 2, 10, 1, 9, 0, 8);\r\n        src -= 15;\r\n\r\n        if (bsy == 4) {\r\n            __m128i p01, p11, p21, p31;\r\n            __m128i M2, M4, M6, M8;\r\n            __m128i S0 = _mm_loadu_si128((__m128i*)(src));\r\n            __m128i S3 = _mm_loadu_si128((__m128i*)(src - 3));\r\n            __m128i S1 = _mm_loadu_si128((__m128i*)(src - 1));\r\n            __m128i S2 = _mm_loadu_si128((__m128i*)(src - 2));\r\n\r\n            __m128i H0 = _mm_unpackhi_epi8(S0, zero);\r\n            __m128i H1 = _mm_unpackhi_epi8(S1, zero);\r\n            __m128i H2 = _mm_unpackhi_epi8(S2, zero);\r\n            __m128i H3 = _mm_unpackhi_epi8(S3, zero);\r\n\r\n            p01 = _mm_mullo_epi16(H0, coeff3);\r\n            p11 = _mm_mullo_epi16(H1, coeff7);\r\n            p21 = _mm_mullo_epi16(H2, coeff5);\r\n            p31 = _mm_add_epi16(H3, coeff8);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            M2 = _mm_srli_epi16(p01, 4);\r\n\r\n            p01 = _mm_add_epi16(H1, H2);\r\n            p01 = _mm_mullo_epi16(p01, coeff3);\r\n            p11 = _mm_add_epi16(H0, H3);\r\n            p11 = _mm_add_epi16(p11, coeff4);\r\n            p01 = _mm_add_epi16(p11, p01);\r\n            M4 = _mm_srli_epi16(p01, 3);\r\n\r\n            p11 = _mm_mullo_epi16(H1, coeff5);\r\n            p21 = _mm_mullo_epi16(H2, coeff7);\r\n            p31 = _mm_mullo_epi16(H3, coeff3);\r\n            p01 = _mm_add_epi16(H0, coeff8);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            M6 = _mm_srli_epi16(p01, 4);\r\n\r\n            p01 = _mm_add_epi16(H1, H2);\r\n            p11 = _mm_add_epi16(H2, H3);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, coeff2);\r\n            M8 = _mm_srli_epi16(p01, 2);\r\n\r\n            M2 = _mm_packus_epi16(M2, M4);\r\n            M6 = _mm_packus_epi16(M6, M8);\r\n            M2 = _mm_shuffle_epi8(M2, shuffle);\r\n            M6 = _mm_shuffle_epi8(M6, shuffle);\r\n\r\n            M4 = _mm_unpacklo_epi16(M2, M6);\r\n\r\n            ((int*)dst)[0] = _mm_cvtsi128_si32(M4);\r\n            dst += i_dst;\r\n            M4 = _mm_srli_si128(M4, 4);\r\n            ((int*)dst)[0] = _mm_cvtsi128_si32(M4);\r\n            dst += i_dst;\r\n            M4 = _mm_srli_si128(M4, 4);\r\n            ((int*)dst)[0] = _mm_cvtsi128_si32(M4);\r\n            dst += i_dst;\r\n            M4 = _mm_srli_si128(M4, 4);\r\n            ((int*)dst)[0] = _mm_cvtsi128_si32(M4);\r\n        } else {\r\n            __m128i p00, p10, p20, p30;\r\n            __m128i p01, p11, p21, p31;\r\n            __m128i M1, M2, M3, M4, M5, M6, M7, M8;\r\n            __m128i S0 = _mm_loadu_si128((__m128i*)(src));\r\n            __m128i S3 = _mm_loadu_si128((__m128i*)(src - 3));\r\n            __m128i S1 = _mm_loadu_si128((__m128i*)(src - 1));\r\n            __m128i S2 = _mm_loadu_si128((__m128i*)(src - 2));\r\n\r\n            __m128i L0 = _mm_unpacklo_epi8(S0, zero);\r\n            __m128i L1 = _mm_unpacklo_epi8(S1, zero);\r\n            __m128i L2 = _mm_unpacklo_epi8(S2, zero);\r\n            __m128i L3 = _mm_unpacklo_epi8(S3, zero);\r\n\r\n            __m128i H0 = _mm_unpackhi_epi8(S0, zero);\r\n            __m128i H1 = _mm_unpackhi_epi8(S1, zero);\r\n            __m128i H2 = _mm_unpackhi_epi8(S2, zero);\r\n            __m128i H3 = _mm_unpackhi_epi8(S3, zero);\r\n\r\n            p00 = _mm_mullo_epi16(L0, coeff3);\r\n            p10 = _mm_mullo_epi16(L1, coeff7);\r\n            p20 = _mm_mullo_epi16(L2, coeff5);\r\n            p30 = _mm_add_epi16(L3, coeff8);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            M1 = _mm_srli_epi16(p00, 4);\r\n\r\n            p01 = _mm_mullo_epi16(H0, coeff3);\r\n            p11 = _mm_mullo_epi16(H1, coeff7);\r\n            p21 = _mm_mullo_epi16(H2, coeff5);\r\n            p31 = _mm_add_epi16(H3, coeff8);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            M2 = _mm_srli_epi16(p01, 4);\r\n\r\n\r\n            p00 = _mm_add_epi16(L1, L2);\r\n            p00 = _mm_mullo_epi16(p00, coeff3);\r\n            p10 = _mm_add_epi16(L0, L3);\r\n            p10 = _mm_add_epi16(p10, coeff4);\r\n            p00 = _mm_add_epi16(p10, p00);\r\n            M3 = _mm_srli_epi16(p00, 3);\r\n\r\n            p01 = _mm_add_epi16(H1, H2);\r\n            p01 = _mm_mullo_epi16(p01, coeff3);\r\n            p11 = _mm_add_epi16(H0, H3);\r\n            p11 = _mm_add_epi16(p11, coeff4);\r\n            p01 = _mm_add_epi16(p11, p01);\r\n            M4 = _mm_srli_epi16(p01, 3);\r\n\r\n\r\n            p10 = _mm_mullo_epi16(L1, coeff5);\r\n            p20 = _mm_mullo_epi16(L2, coeff7);\r\n            p30 = _mm_mullo_epi16(L3, coeff3);\r\n            p00 = _mm_add_epi16(L0, coeff8);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            M5 = _mm_srli_epi16(p00, 4);\r\n\r\n            p11 = _mm_mullo_epi16(H1, coeff5);\r\n            p21 = _mm_mullo_epi16(H2, coeff7);\r\n            p31 = _mm_mullo_epi16(H3, coeff3);\r\n            p01 = _mm_add_epi16(H0, coeff8);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            M6 = _mm_srli_epi16(p01, 4);\r\n\r\n\r\n            p00 = _mm_add_epi16(L1, L2);\r\n            p10 = _mm_add_epi16(L2, L3);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, coeff2);\r\n            M7 = _mm_srli_epi16(p00, 2);\r\n\r\n            p01 = _mm_add_epi16(H1, H2);\r\n            p11 = _mm_add_epi16(H2, H3);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, coeff2);\r\n            M8 = _mm_srli_epi16(p01, 2);\r\n\r\n            M1 = _mm_packus_epi16(M1, M3);\r\n            M5 = _mm_packus_epi16(M5, M7);\r\n            M1 = _mm_shuffle_epi8(M1, shuffle);\r\n            M5 = _mm_shuffle_epi8(M5, shuffle);\r\n\r\n            M2 = _mm_packus_epi16(M2, M4);\r\n            M6 = _mm_packus_epi16(M6, M8);\r\n            M2 = _mm_shuffle_epi8(M2, shuffle);\r\n            M6 = _mm_shuffle_epi8(M6, shuffle);\r\n\r\n            M3 = _mm_unpacklo_epi16(M1, M5);\r\n            M7 = _mm_unpackhi_epi16(M1, M5);\r\n            M4 = _mm_unpacklo_epi16(M2, M6);\r\n            M8 = _mm_unpackhi_epi16(M2, M6);\r\n\r\n            ((int*)dst)[0] = _mm_cvtsi128_si32(M4);\r\n            dst += i_dst;\r\n            M4 = _mm_srli_si128(M4, 4);\r\n            ((int*)dst)[0] = _mm_cvtsi128_si32(M4);\r\n            dst += i_dst;\r\n            M4 = _mm_srli_si128(M4, 4);\r\n            ((int*)dst)[0] = _mm_cvtsi128_si32(M4);\r\n            dst += i_dst;\r\n            M4 = _mm_srli_si128(M4, 4);\r\n            ((int*)dst)[0] = _mm_cvtsi128_si32(M4);\r\n            dst += i_dst;\r\n            ((int*)dst)[0] = _mm_cvtsi128_si32(M8);\r\n            dst += i_dst;\r\n            M8 = _mm_srli_si128(M8, 4);\r\n            ((int*)dst)[0] = _mm_cvtsi128_si32(M8);\r\n            dst += i_dst;\r\n            M8 = _mm_srli_si128(M8, 4);\r\n            ((int*)dst)[0] = _mm_cvtsi128_si32(M8);\r\n            dst += i_dst;\r\n            M8 = _mm_srli_si128(M8, 4);\r\n            ((int*)dst)[0] = _mm_cvtsi128_si32(M8);\r\n            dst += i_dst;\r\n            ((int*)dst)[0] = _mm_cvtsi128_si32(M3);\r\n            dst += i_dst;\r\n            M3 = _mm_srli_si128(M3, 4);\r\n            ((int*)dst)[0] = _mm_cvtsi128_si32(M3);\r\n            dst += i_dst;\r\n            M3 = _mm_srli_si128(M3, 4);\r\n            ((int*)dst)[0] = _mm_cvtsi128_si32(M3);\r\n            dst += i_dst;\r\n            M3 = _mm_srli_si128(M3, 4);\r\n            ((int*)dst)[0] = _mm_cvtsi128_si32(M3);\r\n            dst += i_dst;\r\n            ((int*)dst)[0] = _mm_cvtsi128_si32(M7);\r\n            dst += i_dst;\r\n            M7 = _mm_srli_si128(M7, 4);\r\n            ((int*)dst)[0] = _mm_cvtsi128_si32(M7);\r\n            dst += i_dst;\r\n            M7 = _mm_srli_si128(M7, 4);\r\n            ((int*)dst)[0] = _mm_cvtsi128_si32(M7);\r\n            dst += i_dst;\r\n            M7 = _mm_srli_si128(M7, 4);\r\n            ((int*)dst)[0] = _mm_cvtsi128_si32(M7);\r\n        }\r\n    }\r\n\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n*/\r\nvoid intra_pred_ang_y_28_sse128(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    ALIGN16(pel_t first_line[64 + 128]);\r\n    int line_size = bsx + (bsy - 1) * 2;\r\n#if !BUGFIX_PREDICTION_INTRA\r\n    int real_size = DAVS2_MIN(line_size, bsy * 4);\r\n#endif\r\n    int i;\r\n    int iHeight2 = bsy << 1;\r\n#if !BUGFIX_PREDICTION_INTRA\r\n    __m128i pad;\r\n#endif\r\n    __m128i coeff2 = _mm_set1_epi16(2);\r\n    __m128i coeff3 = _mm_set1_epi16(3);\r\n    __m128i coeff4 = _mm_set1_epi16(4);\r\n    __m128i shuffle = _mm_setr_epi8(7, 15, 6, 14, 5, 13, 4, 12, 3, 11, 2, 10, 1, 9, 0, 8);\r\n    __m128i zero = _mm_setzero_si128();\r\n\r\n    UNUSED_PARAMETER(dir_mode);\r\n\r\n    src -= 15;\r\n#if BUGFIX_PREDICTION_INTRA\r\n    for (i = 0; i < line_size - 16; i += 32, src -= 16) {\r\n#else\r\n    for (i = 0; i < real_size - 16; i += 32, src -= 16) {\r\n#endif\r\n        __m128i p00, p10, p01, p11;\r\n        __m128i S0 = _mm_loadu_si128((__m128i*)(src));\r\n        __m128i S3 = _mm_loadu_si128((__m128i*)(src - 3));\r\n        __m128i S1 = _mm_loadu_si128((__m128i*)(src - 1));\r\n        __m128i S2 = _mm_loadu_si128((__m128i*)(src - 2));\r\n\r\n        __m128i L0 = _mm_unpacklo_epi8(S0, zero);\r\n        __m128i L1 = _mm_unpacklo_epi8(S1, zero);\r\n        __m128i L2 = _mm_unpacklo_epi8(S2, zero);\r\n        __m128i L3 = _mm_unpacklo_epi8(S3, zero);\r\n\r\n        __m128i H0 = _mm_unpackhi_epi8(S0, zero);\r\n        __m128i H1 = _mm_unpackhi_epi8(S1, zero);\r\n        __m128i H2 = _mm_unpackhi_epi8(S2, zero);\r\n        __m128i H3 = _mm_unpackhi_epi8(S3, zero);\r\n\r\n        p00 = _mm_adds_epi16(L1, L2);\r\n        p01 = _mm_add_epi16(L1, L2);\r\n        p00 = _mm_mullo_epi16(p00, coeff3);\r\n        p10 = _mm_adds_epi16(L0, L3);\r\n        p11 = _mm_add_epi16(L2, L3);\r\n        p10 = _mm_adds_epi16(p10, coeff4);\r\n        p00 = _mm_adds_epi16(p00, p10);\r\n        p01 = _mm_add_epi16(p01, p11);\r\n        p01 = _mm_add_epi16(p01, coeff2);\r\n\r\n        p00 = _mm_srli_epi16(p00, 3);\r\n        p01 = _mm_srli_epi16(p01, 2);\r\n\r\n        p00 = _mm_packus_epi16(p00, p01);\r\n        p00 = _mm_shuffle_epi8(p00, shuffle);\r\n\r\n        _mm_store_si128((__m128i*)&first_line[i + 16], p00);\r\n\r\n        p00 = _mm_adds_epi16(H1, H2);\r\n        p01 = _mm_add_epi16(H1, H2);\r\n        p00 = _mm_mullo_epi16(p00, coeff3);\r\n        p10 = _mm_adds_epi16(H0, H3);\r\n        p11 = _mm_add_epi16(H2, H3);\r\n        p10 = _mm_adds_epi16(p10, coeff4);\r\n        p00 = _mm_adds_epi16(p00, p10);\r\n        p01 = _mm_add_epi16(p01, p11);\r\n        p01 = _mm_add_epi16(p01, coeff2);\r\n\r\n        p00 = _mm_srli_epi16(p00, 3);\r\n        p01 = _mm_srli_epi16(p01, 2);\r\n\r\n        p00 = _mm_packus_epi16(p00, p01);\r\n        p00 = _mm_shuffle_epi8(p00, shuffle);\r\n\r\n        _mm_store_si128((__m128i*)&first_line[i], p00);\r\n    }\r\n\r\n#if BUGFIX_PREDICTION_INTRA\r\n    if (i < line_size) {\r\n#else\r\n    if (i < real_size) {\r\n#endif\r\n        __m128i p00, p10, p01, p11;\r\n        __m128i S0 = _mm_loadu_si128((__m128i*)(src));\r\n        __m128i S3 = _mm_loadu_si128((__m128i*)(src - 3));\r\n        __m128i S1 = _mm_loadu_si128((__m128i*)(src - 1));\r\n        __m128i S2 = _mm_loadu_si128((__m128i*)(src - 2));\r\n\r\n        __m128i H0 = _mm_unpackhi_epi8(S0, zero);\r\n        __m128i H1 = _mm_unpackhi_epi8(S1, zero);\r\n        __m128i H2 = _mm_unpackhi_epi8(S2, zero);\r\n        __m128i H3 = _mm_unpackhi_epi8(S3, zero);\r\n\r\n        p00 = _mm_adds_epi16(H1, H2);\r\n        p01 = _mm_add_epi16(H1, H2);\r\n        p00 = _mm_mullo_epi16(p00, coeff3);\r\n        p10 = _mm_adds_epi16(H0, H3);\r\n        p11 = _mm_add_epi16(H2, H3);\r\n        p10 = _mm_adds_epi16(p10, coeff4);\r\n        p00 = _mm_adds_epi16(p00, p10);\r\n        p01 = _mm_add_epi16(p01, p11);\r\n        p01 = _mm_add_epi16(p01, coeff2);\r\n\r\n        p00 = _mm_srli_epi16(p00, 3);\r\n        p01 = _mm_srli_epi16(p01, 2);\r\n\r\n        p00 = _mm_packus_epi16(p00, p01);\r\n        p00 = _mm_shuffle_epi8(p00, shuffle);\r\n\r\n        _mm_store_si128((__m128i*)&first_line[i], p00);\r\n    }\r\n\r\n#if !BUGFIX_PREDICTION_INTRA\r\n    // padding\r\n    if (real_size < line_size) {\r\n        i = real_size;\r\n        first_line[i - 1] = first_line[i - 3];\r\n\r\n        pad = _mm_set1_epi16(((short*)&first_line[i - 2])[0]);\r\n\r\n        for (; i < line_size; i += 16) {\r\n            _mm_storeu_si128((__m128i*)&first_line[i], pad);\r\n        }\r\n    }\r\n#endif\r\n\r\n    if (bsx >= 16) {\r\n        for (i = 0; i < iHeight2; i += 2) {\r\n            memcpy(dst, first_line + i, bsx * sizeof(pel_t));\r\n            dst += i_dst;\r\n        }\r\n    } else if (bsx == 8) {\r\n        for (i = 0; i < iHeight2; i += 8) {\r\n            __m128i M = _mm_loadu_si128((__m128i*)&first_line[i]);\r\n            _mm_storel_epi64((__m128i*)(dst), M);\r\n            dst += i_dst;\r\n            M = _mm_srli_si128(M, 2);\r\n            _mm_storel_epi64((__m128i*)(dst), M);\r\n            dst += i_dst;\r\n            M = _mm_srli_si128(M, 2);\r\n            _mm_storel_epi64((__m128i*)(dst), M);\r\n            dst += i_dst;\r\n            M = _mm_srli_si128(M, 2);\r\n            _mm_storel_epi64((__m128i*)(dst), M);\r\n            dst += i_dst;\r\n        }\r\n    } else {\r\n        for (i = 0; i < iHeight2; i += 8) {\r\n            __m128i M = _mm_loadu_si128((__m128i*)&first_line[i]);\r\n            ((int*)(dst))[0] = _mm_cvtsi128_si32(M);\r\n            dst += i_dst;\r\n            M = _mm_srli_si128(M, 2);\r\n            ((int*)(dst))[0] = _mm_cvtsi128_si32(M);\r\n            dst += i_dst;\r\n            M = _mm_srli_si128(M, 2);\r\n            ((int*)(dst))[0] = _mm_cvtsi128_si32(M);\r\n            dst += i_dst;\r\n            M = _mm_srli_si128(M, 2);\r\n            ((int*)(dst))[0] = _mm_cvtsi128_si32(M);\r\n            dst += i_dst;\r\n        }\r\n    }\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n*/\r\nvoid intra_pred_ang_y_30_sse128(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    ALIGN16(pel_t first_line[64 + 64]);\r\n    int line_size = bsx + bsy - 1;\r\n#if !BUGFIX_PREDICTION_INTRA\r\n    int real_size = DAVS2_MIN(line_size, bsy * 2 - 1);\r\n#endif\r\n    int i;\r\n    __m128i coeff2 = _mm_set1_epi16(2);\r\n    __m128i shuffle = _mm_setr_epi8(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0);\r\n    __m128i zero = _mm_setzero_si128();\r\n\r\n    UNUSED_PARAMETER(dir_mode);\r\n\r\n    src -= 17;\r\n#if BUGFIX_PREDICTION_INTRA\r\n    for (i = 0; i < line_size - 8; i += 16, src -= 16) {\r\n#else\r\n    for (i = 0; i < real_size - 8; i += 16, src -= 16) {\r\n#endif\r\n        __m128i p00, p10, p01, p11;\r\n        __m128i S0 = _mm_loadu_si128((__m128i*)(src - 1));\r\n        __m128i S2 = _mm_loadu_si128((__m128i*)(src + 1));\r\n        __m128i S1 = _mm_loadu_si128((__m128i*)(src));\r\n\r\n        __m128i L0 = _mm_unpacklo_epi8(S0, zero);\r\n        __m128i L1 = _mm_unpacklo_epi8(S1, zero);\r\n        __m128i L2 = _mm_unpacklo_epi8(S2, zero);\r\n\r\n        __m128i H0 = _mm_unpackhi_epi8(S0, zero);\r\n        __m128i H1 = _mm_unpackhi_epi8(S1, zero);\r\n        __m128i H2 = _mm_unpackhi_epi8(S2, zero);\r\n\r\n        p00 = _mm_add_epi16(L0, L1);\r\n        p10 = _mm_add_epi16(L1, L2);\r\n        p01 = _mm_add_epi16(H0, H1);\r\n        p11 = _mm_add_epi16(H1, H2);\r\n\r\n        p00 = _mm_add_epi16(p00, p10);\r\n        p01 = _mm_add_epi16(p01, p11);\r\n        p00 = _mm_add_epi16(p00, coeff2);\r\n        p01 = _mm_add_epi16(p01, coeff2);\r\n\r\n        p00 = _mm_srli_epi16(p00, 2);\r\n        p01 = _mm_srli_epi16(p01, 2);\r\n\r\n        p00 = _mm_packus_epi16(p00, p01);\r\n        p00 = _mm_shuffle_epi8(p00, shuffle);\r\n\r\n        _mm_store_si128((__m128i*)&first_line[i], p00);\r\n    }\r\n\r\n#if BUGFIX_PREDICTION_INTRA\r\n    if (i < line_size) {\r\n#else\r\n    if (i < real_size) {\r\n#endif\r\n        __m128i p01, p11;\r\n        __m128i S0 = _mm_loadu_si128((__m128i*)(src - 1));\r\n        __m128i S2 = _mm_loadu_si128((__m128i*)(src + 1));\r\n        __m128i S1 = _mm_loadu_si128((__m128i*)(src));\r\n\r\n        __m128i H0 = _mm_unpackhi_epi8(S0, zero);\r\n        __m128i H1 = _mm_unpackhi_epi8(S1, zero);\r\n        __m128i H2 = _mm_unpackhi_epi8(S2, zero);\r\n\r\n        p01 = _mm_add_epi16(H0, H1);\r\n        p11 = _mm_add_epi16(H1, H2);\r\n\r\n        p01 = _mm_add_epi16(p01, p11);\r\n        p01 = _mm_add_epi16(p01, coeff2);\r\n\r\n        p01 = _mm_srli_epi16(p01, 2);\r\n\r\n        p01 = _mm_packus_epi16(p01, p01);\r\n        p01 = _mm_shuffle_epi8(p01, shuffle);\r\n\r\n        _mm_store_si128((__m128i*)&first_line[i], p01);\r\n    }\r\n#if !BUGFIX_PREDICTION_INTRA\r\n    // padding\r\n    for (i = real_size; i < line_size; i += 16) {\r\n        __m128i pad = _mm_set1_epi8(first_line[real_size - 1]);\r\n        _mm_storeu_si128((__m128i*)&first_line[i], pad);\r\n    }\r\n#endif\r\n\r\n    if (bsx > 16) {\r\n        for (i = 0; i < bsy; i++) {\r\n            memcpy(dst, first_line + i, bsx * sizeof(pel_t));\r\n            dst += i_dst;\r\n        }\r\n    } else if (bsx == 16) {\r\n        pel_t *dst1 = dst;\r\n        pel_t *dst2;\r\n        if (bsy == 4) {\r\n            __m128i M = _mm_loadu_si128((__m128i*)&first_line[0]);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            dst1 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            dst1 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            dst1 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            dst1 = dst + 8;\r\n            M = _mm_loadu_si128((__m128i*)&first_line[8]);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            dst1 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            dst1 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            dst1 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n        } else {\r\n            __m128i M = _mm_loadu_si128((__m128i*)&first_line[0]);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            dst1 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            dst1 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            dst1 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            dst1 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            dst1 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            dst1 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            dst1 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            dst2 = dst1 + i_dst;\r\n            dst1 = dst + 8;\r\n            M = _mm_loadu_si128((__m128i*)&first_line[8]);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            _mm_storel_epi64((__m128i*)dst2, M);\r\n            dst1 += i_dst;\r\n            dst2 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            _mm_storel_epi64((__m128i*)dst2, M);\r\n            dst1 += i_dst;\r\n            dst2 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            _mm_storel_epi64((__m128i*)dst2, M);\r\n            dst1 += i_dst;\r\n            dst2 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            _mm_storel_epi64((__m128i*)dst2, M);\r\n            dst1 += i_dst;\r\n            dst2 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            _mm_storel_epi64((__m128i*)dst2, M);\r\n            dst1 += i_dst;\r\n            dst2 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            _mm_storel_epi64((__m128i*)dst2, M);\r\n            dst1 += i_dst;\r\n            dst2 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            _mm_storel_epi64((__m128i*)dst2, M);\r\n            dst1 += i_dst;\r\n            dst2 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            _mm_storel_epi64((__m128i*)dst2, M);\r\n            dst1 += i_dst;\r\n            M = _mm_loadu_si128((__m128i*)&first_line[16]);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            dst1 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            dst1 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            dst1 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            dst1 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            dst1 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            dst1 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n            dst1 += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst1, M);\r\n        }\r\n    } else if (bsx == 8) {\r\n        for (i = 0; i < bsy; i += 8) {\r\n            __m128i M = _mm_loadu_si128((__m128i*)&first_line[i]);\r\n            _mm_storel_epi64((__m128i*)dst, M);\r\n            dst += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst, M);\r\n            dst += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst, M);\r\n            dst += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst, M);\r\n            dst += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst, M);\r\n            dst += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst, M);\r\n            dst += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst, M);\r\n            dst += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst, M);\r\n            dst += i_dst;\r\n        }\r\n    } else {\r\n        for (i = 0; i < bsy; i += 4) {\r\n            __m128i M = _mm_loadu_si128((__m128i*)&first_line[i]);\r\n            ((int*)(dst))[0] = _mm_cvtsi128_si32(M);\r\n            dst += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            ((int*)(dst))[0] = _mm_cvtsi128_si32(M);\r\n            dst += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            ((int*)(dst))[0] = _mm_cvtsi128_si32(M);\r\n            dst += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            ((int*)(dst))[0] = _mm_cvtsi128_si32(M);\r\n            dst += i_dst;\r\n        }\r\n    }\r\n\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n*/\r\nvoid intra_pred_ang_y_31_sse128(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    ALIGN16(pel_t dst_tran[64 * 80]);\r\n    ALIGN16(pel_t src_tran[64 * 8]);\r\n    int i_dst2 = (((bsy + 15) >> 4) << 4) + 16;\r\n    int i;\r\n\r\n    UNUSED_PARAMETER(dir_mode);\r\n\r\n    //transposition    \r\n#if BUGFIX_PREDICTION_INTRA\r\n    for (i = 0; i < (bsy + bsx * 11 / 8 + 3); i++) {\r\n#else\r\n    for (i = 0; i < (2 * bsy + 3); i++) {\r\n#endif\r\n        src_tran[i] = src[-i];\r\n    }\r\n\r\n    intra_pred_ang_x_5_sse128(src_tran, dst_tran, i_dst2, 5, bsy, bsx);\r\n\r\n    if ((bsy > 4) && (bsx > 4)) {\r\n        pel_t *pDst_128[64];\r\n        pel_t *pTra_128[64];\r\n\r\n        int iSize_x = bsx >> 3;\r\n        int iSize_y = bsy >> 3;\r\n        int iSize = iSize_x * iSize_y;\r\n\r\n        for (int y = 0; y < iSize_y; y++) {\r\n            for (int x = 0; x < iSize_x; x++) {\r\n                pDst_128[x + y * iSize_x] = dst      + x * 8 + y * 8 * i_dst;\r\n                pTra_128[x + y * iSize_x] = dst_tran + y * 8 + x * 8 * i_dst2;\r\n            }\r\n        }\r\n\r\n        for (i = 0; i < iSize; i++) {\r\n            pel_t *dst_tran_org = pTra_128[i];\r\n\r\n            pel_t *dst1 = pDst_128[i];\r\n            pel_t *dst2 = dst1 + i_dst;\r\n            pel_t *dst3 = dst2 + i_dst;\r\n            pel_t *dst4 = dst3 + i_dst;\r\n            pel_t *dst5 = dst4 + i_dst;\r\n            pel_t *dst6 = dst5 + i_dst;\r\n            pel_t *dst7 = dst6 + i_dst;\r\n            pel_t *dst8 = dst7 + i_dst;\r\n            __m128i Org_8_0, Org_8_1, Org_8_2, Org_8_3, Org_8_4, Org_8_5, Org_8_6, Org_8_7;\r\n            __m128i p00, p10, p20, p30;\r\n            __m128i t00, t10, t20, t30;\r\n            Org_8_0 = _mm_loadu_si128((__m128i*)dst_tran_org);\r\n            dst_tran_org += i_dst2;\r\n            Org_8_1 = _mm_loadu_si128((__m128i*)dst_tran_org);\r\n            dst_tran_org += i_dst2;\r\n            Org_8_2 = _mm_loadu_si128((__m128i*)dst_tran_org);\r\n            dst_tran_org += i_dst2;\r\n            Org_8_3 = _mm_loadu_si128((__m128i*)dst_tran_org);\r\n            dst_tran_org += i_dst2;\r\n            Org_8_4 = _mm_loadu_si128((__m128i*)dst_tran_org);\r\n            dst_tran_org += i_dst2;\r\n            Org_8_5 = _mm_loadu_si128((__m128i*)dst_tran_org);\r\n            dst_tran_org += i_dst2;\r\n            Org_8_6 = _mm_loadu_si128((__m128i*)dst_tran_org);\r\n            dst_tran_org += i_dst2;\r\n            Org_8_7 = _mm_loadu_si128((__m128i*)dst_tran_org);\r\n            dst_tran_org += i_dst2;\r\n\r\n            p00 = _mm_unpacklo_epi8(Org_8_0, Org_8_1);\r\n            p10 = _mm_unpacklo_epi8(Org_8_2, Org_8_3);\r\n            p20 = _mm_unpacklo_epi8(Org_8_4, Org_8_5);\r\n            p30 = _mm_unpacklo_epi8(Org_8_6, Org_8_7);\r\n\r\n            t00 = _mm_unpacklo_epi16(p00, p10);\r\n            t20 = _mm_unpacklo_epi16(p20, p30);\r\n            t10 = _mm_unpackhi_epi16(p00, p10);\r\n            t30 = _mm_unpackhi_epi16(p20, p30);\r\n\r\n            p00 = _mm_unpacklo_epi32(t00, t20);\r\n            p10 = _mm_unpackhi_epi32(t00, t20);\r\n            p20 = _mm_unpacklo_epi32(t10, t30);\r\n            p30 = _mm_unpackhi_epi32(t10, t30);\r\n\r\n            _mm_storel_epi64((__m128i*)dst1, p00);\r\n            p00 = _mm_srli_si128(p00, 8);\r\n            _mm_storel_epi64((__m128i*)dst2, p00);\r\n\r\n            _mm_storel_epi64((__m128i*)dst3, p10);\r\n            p10 = _mm_srli_si128(p10, 8);\r\n            _mm_storel_epi64((__m128i*)dst4, p10);\r\n\r\n            _mm_storel_epi64((__m128i*)dst5, p20);\r\n            p20 = _mm_srli_si128(p20, 8);\r\n            _mm_storel_epi64((__m128i*)dst6, p20);\r\n\r\n            _mm_storel_epi64((__m128i*)dst7, p30);\r\n            p30 = _mm_srli_si128(p30, 8);\r\n            _mm_storel_epi64((__m128i*)dst8, p30);\r\n        }\r\n    } else if (bsx == 16) {\r\n        for (i = 0; i < 2; i++) {\r\n            pel_t *dst_tran_org = dst_tran + i * 8 * i_dst2;\r\n\r\n            pel_t *dst1 = dst + i * 8;\r\n            pel_t *dst2 = dst1 + i_dst;\r\n            pel_t *dst3 = dst2 + i_dst;\r\n            pel_t *dst4 = dst3 + i_dst;\r\n\r\n            __m128i Org_8_0, Org_8_1, Org_8_2, Org_8_3, Org_8_4, Org_8_5, Org_8_6, Org_8_7;\r\n            __m128i p00, p10, p20, p30;\r\n            __m128i t00, t20;\r\n            Org_8_0 = _mm_loadu_si128((__m128i*)dst_tran_org);\r\n            dst_tran_org += i_dst2;\r\n            Org_8_1 = _mm_loadu_si128((__m128i*)dst_tran_org);\r\n            dst_tran_org += i_dst2;\r\n            Org_8_2 = _mm_loadu_si128((__m128i*)dst_tran_org);\r\n            dst_tran_org += i_dst2;\r\n            Org_8_3 = _mm_loadu_si128((__m128i*)dst_tran_org);\r\n            dst_tran_org += i_dst2;\r\n            Org_8_4 = _mm_loadu_si128((__m128i*)dst_tran_org);\r\n            dst_tran_org += i_dst2;\r\n            Org_8_5 = _mm_loadu_si128((__m128i*)dst_tran_org);\r\n            dst_tran_org += i_dst2;\r\n            Org_8_6 = _mm_loadu_si128((__m128i*)dst_tran_org);\r\n            dst_tran_org += i_dst2;\r\n            Org_8_7 = _mm_loadu_si128((__m128i*)dst_tran_org);\r\n            dst_tran_org += i_dst2;\r\n\r\n            p00 = _mm_unpacklo_epi8(Org_8_0, Org_8_1);\r\n            p10 = _mm_unpacklo_epi8(Org_8_2, Org_8_3);\r\n            p20 = _mm_unpacklo_epi8(Org_8_4, Org_8_5);\r\n            p30 = _mm_unpacklo_epi8(Org_8_6, Org_8_7);\r\n\r\n            t00 = _mm_unpacklo_epi16(p00, p10);\r\n            t20 = _mm_unpacklo_epi16(p20, p30);\r\n\r\n            p00 = _mm_unpacklo_epi32(t00, t20);\r\n            p10 = _mm_unpackhi_epi32(t00, t20);\r\n\r\n            _mm_storel_epi64((__m128i*)dst1, p00);\r\n            p00 = _mm_srli_si128(p00, 8);\r\n            _mm_storel_epi64((__m128i*)dst2, p00);\r\n\r\n            _mm_storel_epi64((__m128i*)dst3, p10);\r\n            p10 = _mm_srli_si128(p10, 8);\r\n            _mm_storel_epi64((__m128i*)dst4, p10);\r\n        }\r\n    } else if (bsy == 16) {//bsx == 4\r\n        pel_t *dst_tran_org = dst_tran;\r\n\r\n        pel_t *dst1 = dst;\r\n        pel_t *dst2 = dst1 + i_dst;\r\n        pel_t *dst3 = dst2 + i_dst;\r\n        pel_t *dst4 = dst3 + i_dst;\r\n        pel_t *dst5 = dst4 + i_dst;\r\n        pel_t *dst6 = dst5 + i_dst;\r\n        pel_t *dst7 = dst6 + i_dst;\r\n        pel_t *dst8 = dst7 + i_dst;\r\n\r\n        __m128i Org_8_0, Org_8_1, Org_8_2, Org_8_3;\r\n        __m128i p00, p10;\r\n        __m128i t00, t10;\r\n        Org_8_0 = _mm_loadu_si128((__m128i*)dst_tran_org);\r\n        dst_tran_org += i_dst2;\r\n        Org_8_1 = _mm_loadu_si128((__m128i*)dst_tran_org);\r\n        dst_tran_org += i_dst2;\r\n        Org_8_2 = _mm_loadu_si128((__m128i*)dst_tran_org);\r\n        dst_tran_org += i_dst2;\r\n        Org_8_3 = _mm_loadu_si128((__m128i*)dst_tran_org);\r\n        dst_tran_org += i_dst2;\r\n\r\n        p00 = _mm_unpacklo_epi8(Org_8_0, Org_8_1);\r\n        p10 = _mm_unpacklo_epi8(Org_8_2, Org_8_3);\r\n\r\n        t00 = _mm_unpacklo_epi16(p00, p10);\r\n        t10 = _mm_unpackhi_epi16(p00, p10);\r\n\r\n        *((int*)(dst1)) = _mm_cvtsi128_si32(t00);\r\n        t00 = _mm_srli_si128(t00, 4);\r\n        *((int*)(dst2)) = _mm_cvtsi128_si32(t00);\r\n        t00 = _mm_srli_si128(t00, 4);\r\n        *((int*)(dst3)) = _mm_cvtsi128_si32(t00);\r\n        t00 = _mm_srli_si128(t00, 4);\r\n        *((int*)(dst4)) = _mm_cvtsi128_si32(t00);\r\n\r\n        *((int*)(dst5)) = _mm_cvtsi128_si32(t10);\r\n        t10 = _mm_srli_si128(t10, 4);\r\n        *((int*)(dst6)) = _mm_cvtsi128_si32(t10);\r\n        t10 = _mm_srli_si128(t10, 4);\r\n        *((int*)(dst7)) = _mm_cvtsi128_si32(t10);\r\n        t10 = _mm_srli_si128(t10, 4);\r\n        *((int*)(dst8)) = _mm_cvtsi128_si32(t10);\r\n\r\n        dst1 = dst8 + i_dst;\r\n        dst2 = dst1 + i_dst;\r\n        dst3 = dst2 + i_dst;\r\n        dst4 = dst3 + i_dst;\r\n        dst5 = dst4 + i_dst;\r\n        dst6 = dst5 + i_dst;\r\n        dst7 = dst6 + i_dst;\r\n        dst8 = dst7 + i_dst;\r\n\r\n        p00 = _mm_unpackhi_epi8(Org_8_0, Org_8_1);\r\n        p10 = _mm_unpackhi_epi8(Org_8_2, Org_8_3);\r\n\r\n        t00 = _mm_unpacklo_epi16(p00, p10);\r\n        t10 = _mm_unpackhi_epi16(p00, p10);\r\n\r\n        *((int*)(dst1)) = _mm_cvtsi128_si32(t00);\r\n        t00 = _mm_srli_si128(t00, 4);\r\n        *((int*)(dst2)) = _mm_cvtsi128_si32(t00);\r\n        t00 = _mm_srli_si128(t00, 4);\r\n        *((int*)(dst3)) = _mm_cvtsi128_si32(t00);\r\n        t00 = _mm_srli_si128(t00, 4);\r\n        *((int*)(dst4)) = _mm_cvtsi128_si32(t00);\r\n\r\n        *((int*)(dst5)) = _mm_cvtsi128_si32(t10);\r\n        t10 = _mm_srli_si128(t10, 4);\r\n        *((int*)(dst6)) = _mm_cvtsi128_si32(t10);\r\n        t10 = _mm_srli_si128(t10, 4);\r\n        *((int*)(dst7)) = _mm_cvtsi128_si32(t10);\r\n        t10 = _mm_srli_si128(t10, 4);\r\n        *((int*)(dst8)) = _mm_cvtsi128_si32(t10);\r\n    } else {// bsx == 4 bsy ==4\r\n        pel_t *dst_tran_org = dst_tran;\r\n\r\n        pel_t *dst1 = dst;\r\n        pel_t *dst2 = dst1 + i_dst;\r\n        pel_t *dst3 = dst2 + i_dst;\r\n        pel_t *dst4 = dst3 + i_dst;\r\n\r\n        __m128i Org_8_0, Org_8_1, Org_8_2, Org_8_3;\r\n        __m128i p00, p10;\r\n        __m128i t00;\r\n        Org_8_0 = _mm_loadu_si128((__m128i*)dst_tran_org);\r\n        dst_tran_org += i_dst2;\r\n        Org_8_1 = _mm_loadu_si128((__m128i*)dst_tran_org);\r\n        dst_tran_org += i_dst2;\r\n        Org_8_2 = _mm_loadu_si128((__m128i*)dst_tran_org);\r\n        dst_tran_org += i_dst2;\r\n        Org_8_3 = _mm_loadu_si128((__m128i*)dst_tran_org);\r\n        dst_tran_org += i_dst2;\r\n\r\n        p00 = _mm_unpacklo_epi8(Org_8_0, Org_8_1);\r\n        p10 = _mm_unpacklo_epi8(Org_8_2, Org_8_3);\r\n\r\n        t00 = _mm_unpacklo_epi16(p00, p10);\r\n\r\n        *((int*)(dst1)) = _mm_cvtsi128_si32(t00);\r\n        t00 = _mm_srli_si128(t00, 4);\r\n        *((int*)(dst2)) = _mm_cvtsi128_si32(t00);\r\n        t00 = _mm_srli_si128(t00, 4);\r\n        *((int*)(dst3)) = _mm_cvtsi128_si32(t00);\r\n        t00 = _mm_srli_si128(t00, 4);\r\n        *((int*)(dst4)) = _mm_cvtsi128_si32(t00);\r\n    }\r\n\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n*/\r\nvoid intra_pred_ang_y_32_sse128(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    ALIGN16(pel_t first_line[2 * (64 + 64)]);\r\n    int line_size = (bsy >> 1) + bsx - 1;\r\n#if !BUGFIX_PREDICTION_INTRA\r\n    int real_size = DAVS2_MIN(line_size, bsy - 1);\r\n    __m128i pad_val;\r\n#endif\r\n    int i;\r\n    int aligned_line_size = ((line_size + 63) >> 4) << 4;\r\n    pel_t *pfirst[2];\r\n\r\n    __m128i coeff2 = _mm_set1_epi16(2);\r\n    __m128i zero = _mm_setzero_si128();\r\n    __m128i shuffle1 = _mm_setr_epi8(15, 13, 11, 9, 7, 5, 3, 1, 14, 12, 10, 8, 6, 4, 2, 0);\r\n    __m128i shuffle2 = _mm_setr_epi8(14, 12, 10, 8, 6, 4, 2, 0, 15, 13, 11, 9, 7, 5, 3, 1);\r\n    int i_dst2 = i_dst * 2;\r\n\r\n    UNUSED_PARAMETER(dir_mode);\r\n\r\n    pfirst[0] = first_line;\r\n    pfirst[1] = first_line + aligned_line_size;\r\n\r\n    src -= 18;\r\n#if BUGFIX_PREDICTION_INTRA\r\n    for (i = 0; i < line_size - 4; i += 8, src -= 16) {\r\n#else\r\n    for (i = 0; i < real_size - 4; i += 8, src -= 16) {\r\n#endif\r\n        __m128i p00, p01, p10, p11;\r\n        __m128i S2 = _mm_loadu_si128((__m128i*)(src + 1));\r\n        __m128i S0 = _mm_loadu_si128((__m128i*)(src - 1));\r\n        __m128i S1 = _mm_loadu_si128((__m128i*)(src));\r\n\r\n        __m128i L0 = _mm_unpacklo_epi8(S0, zero);\r\n        __m128i L1 = _mm_unpacklo_epi8(S1, zero);\r\n        __m128i L2 = _mm_unpacklo_epi8(S2, zero);\r\n\r\n        __m128i H0 = _mm_unpackhi_epi8(S0, zero);\r\n        __m128i H1 = _mm_unpackhi_epi8(S1, zero);\r\n        __m128i H2 = _mm_unpackhi_epi8(S2, zero);\r\n\r\n        p00 = _mm_add_epi16(L0, L1);\r\n        p01 = _mm_add_epi16(L1, L2);\r\n        p00 = _mm_add_epi16(p00, coeff2);\r\n        p00 = _mm_add_epi16(p00, p01);\r\n        p00 = _mm_srli_epi16(p00, 2);\r\n\r\n        p10 = _mm_add_epi16(H0, H1);\r\n        p11 = _mm_add_epi16(H1, H2);\r\n        p10 = _mm_add_epi16(p10, coeff2);\r\n        p10 = _mm_add_epi16(p10, p11);\r\n        p10 = _mm_srli_epi16(p10, 2);\r\n\r\n        p00 = _mm_packus_epi16(p00, p10);\r\n        p10 = _mm_shuffle_epi8(p00, shuffle2);\r\n        p00 = _mm_shuffle_epi8(p00, shuffle1);\r\n        _mm_storel_epi64((__m128i*)&pfirst[0][i], p00);\r\n        _mm_storel_epi64((__m128i*)&pfirst[1][i], p10);\r\n    }\r\n\r\n#if BUGFIX_PREDICTION_INTRA\r\n    if (i < line_size) {\r\n#else\r\n    if (i < real_size) {\r\n#endif\r\n        __m128i p10, p11;\r\n        __m128i S2 = _mm_loadu_si128((__m128i*)(src + 1));\r\n        __m128i S0 = _mm_loadu_si128((__m128i*)(src - 1));\r\n        __m128i S1 = _mm_loadu_si128((__m128i*)(src));\r\n\r\n        __m128i H0 = _mm_unpackhi_epi8(S0, zero);\r\n        __m128i H1 = _mm_unpackhi_epi8(S1, zero);\r\n        __m128i H2 = _mm_unpackhi_epi8(S2, zero);\r\n\r\n        p10 = _mm_add_epi16(H0, H1);\r\n        p11 = _mm_add_epi16(H1, H2);\r\n        p10 = _mm_add_epi16(p10, coeff2);\r\n        p10 = _mm_add_epi16(p10, p11);\r\n        p10 = _mm_srli_epi16(p10, 2);\r\n\r\n        p11 = _mm_packus_epi16(p10, p10);\r\n        p10 = _mm_shuffle_epi8(p11, shuffle2);\r\n        p11 = _mm_shuffle_epi8(p11, shuffle1);\r\n        ((int*)&pfirst[0][i])[0] = _mm_cvtsi128_si32(p11);\r\n        ((int*)&pfirst[1][i])[0] = _mm_cvtsi128_si32(p10);\r\n    }\r\n\r\n#if !BUGFIX_PREDICTION_INTRA\r\n    // padding\r\n    if (real_size < line_size) {\r\n        pad_val = _mm_set1_epi8(pfirst[1][real_size - 1]);\r\n        for (i = real_size; i < line_size; i++) {\r\n            _mm_storeu_si128((__m128i*)&pfirst[0][i], pad_val);\r\n            _mm_storeu_si128((__m128i*)&pfirst[1][i], pad_val);\r\n        }\r\n    }\r\n#endif\r\n\r\n    bsy >>= 1;\r\n\r\n    if (bsx >= 16 || bsx == 4) {\r\n        for (i = 0; i < bsy; i++) {\r\n            memcpy(dst, pfirst[0] + i, bsx * sizeof(pel_t));\r\n            memcpy(dst + i_dst, pfirst[1] + i, bsx * sizeof(pel_t));\r\n            dst += i_dst2;\r\n        }\r\n    } else {\r\n        if (bsy == 4) {\r\n            __m128i M1 = _mm_loadu_si128((__m128i*)&pfirst[0][0]);\r\n            __m128i M2 = _mm_loadu_si128((__m128i*)&pfirst[1][0]);\r\n            _mm_storel_epi64((__m128i*)dst, M1);\r\n            _mm_storel_epi64((__m128i*)(dst + i_dst), M2);\r\n            dst += i_dst2;\r\n            M1 = _mm_srli_si128(M1, 1);\r\n            M2 = _mm_srli_si128(M2, 1);\r\n            _mm_storel_epi64((__m128i*)dst, M1);\r\n            _mm_storel_epi64((__m128i*)(dst + i_dst), M2);\r\n            dst += i_dst2;\r\n            M1 = _mm_srli_si128(M1, 1);\r\n            M2 = _mm_srli_si128(M2, 1);\r\n            _mm_storel_epi64((__m128i*)dst, M1);\r\n            _mm_storel_epi64((__m128i*)(dst + i_dst), M2);\r\n            dst += i_dst2;\r\n            M1 = _mm_srli_si128(M1, 1);\r\n            M2 = _mm_srli_si128(M2, 1);\r\n            _mm_storel_epi64((__m128i*)dst, M1);\r\n            _mm_storel_epi64((__m128i*)(dst + i_dst), M2);\r\n        } else {\r\n            for (i = 0; i < 16; i = i + 8) {\r\n                __m128i M1 = _mm_loadu_si128((__m128i*)&pfirst[0][i]);\r\n                __m128i M2 = _mm_loadu_si128((__m128i*)&pfirst[1][i]);\r\n                _mm_storel_epi64((__m128i*)dst, M1);\r\n                _mm_storel_epi64((__m128i*)(dst + i_dst), M2);\r\n                dst += i_dst2;\r\n                M1 = _mm_srli_si128(M1, 1);\r\n                M2 = _mm_srli_si128(M2, 1);\r\n                _mm_storel_epi64((__m128i*)dst, M1);\r\n                _mm_storel_epi64((__m128i*)(dst + i_dst), M2);\r\n                dst += i_dst2;\r\n                M1 = _mm_srli_si128(M1, 1);\r\n                M2 = _mm_srli_si128(M2, 1);\r\n                _mm_storel_epi64((__m128i*)dst, M1);\r\n                _mm_storel_epi64((__m128i*)(dst + i_dst), M2);\r\n                dst += i_dst2;\r\n                M1 = _mm_srli_si128(M1, 1);\r\n                M2 = _mm_srli_si128(M2, 1);\r\n                _mm_storel_epi64((__m128i*)dst, M1);\r\n                _mm_storel_epi64((__m128i*)(dst + i_dst), M2);\r\n                dst += i_dst2;\r\n                M1 = _mm_srli_si128(M1, 1);\r\n                M2 = _mm_srli_si128(M2, 1);\r\n                _mm_storel_epi64((__m128i*)dst, M1);\r\n                _mm_storel_epi64((__m128i*)(dst + i_dst), M2);\r\n                dst += i_dst2;\r\n                M1 = _mm_srli_si128(M1, 1);\r\n                M2 = _mm_srli_si128(M2, 1);\r\n                _mm_storel_epi64((__m128i*)dst, M1);\r\n                _mm_storel_epi64((__m128i*)(dst + i_dst), M2);\r\n                dst += i_dst2;\r\n                M1 = _mm_srli_si128(M1, 1);\r\n                M2 = _mm_srli_si128(M2, 1);\r\n                _mm_storel_epi64((__m128i*)dst, M1);\r\n                _mm_storel_epi64((__m128i*)(dst + i_dst), M2);\r\n                dst += i_dst2;\r\n                M1 = _mm_srli_si128(M1, 1);\r\n                M2 = _mm_srli_si128(M2, 1);\r\n                _mm_storel_epi64((__m128i*)dst, M1);\r\n                _mm_storel_epi64((__m128i*)(dst + i_dst), M2);\r\n                dst += i_dst2;\r\n            }\r\n        }\r\n    }\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n*/\r\nvoid intra_pred_ang_xy_13_sse128(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    __m128i zero = _mm_setzero_si128();\r\n    __m128i coeff2 = _mm_set1_epi16(2);\r\n    __m128i coeff3 = _mm_set1_epi16(3);\r\n    __m128i coeff4 = _mm_set1_epi16(4);\r\n    __m128i coeff5 = _mm_set1_epi16(5);\r\n    __m128i coeff7 = _mm_set1_epi16(7);\r\n    __m128i coeff8 = _mm_set1_epi16(8);\r\n    __m128i coeff9 = _mm_set1_epi16(9);\r\n    __m128i coeff11 = _mm_set1_epi16(11);\r\n    __m128i coeff13 = _mm_set1_epi16(13);\r\n    __m128i coeff15 = _mm_set1_epi16(15);\r\n    __m128i coeff16 = _mm_set1_epi16(16);\r\n\r\n    UNUSED_PARAMETER(dir_mode);\r\n\r\n    int i;\r\n    if (bsy > 8) {\r\n        ALIGN16(pel_t first_line[(64 + 16) << 3]);\r\n        int line_size = bsx + (bsy >> 3) - 1;\r\n        int left_size = line_size - bsx;\r\n        int aligned_line_size = ((line_size + 15) >> 4) << 4;\r\n        pel_t *pfirst[8];\r\n\r\n        pfirst[0] = first_line;\r\n        pfirst[1] = pfirst[0] + aligned_line_size;\r\n        pfirst[2] = pfirst[1] + aligned_line_size;\r\n        pfirst[3] = pfirst[2] + aligned_line_size;\r\n        pfirst[4] = pfirst[3] + aligned_line_size;\r\n        pfirst[5] = pfirst[4] + aligned_line_size;\r\n        pfirst[6] = pfirst[5] + aligned_line_size;\r\n        pfirst[7] = pfirst[6] + aligned_line_size;\r\n\r\n        src -= bsy - 8;\r\n        for (i = 0; i < left_size; i++, src += 8) {\r\n            pfirst[0][i] = (pel_t)((src[6] + (src[7] << 1) + src[8] + 2) >> 2);\r\n            pfirst[1][i] = (pel_t)((src[5] + (src[6] << 1) + src[7] + 2) >> 2);\r\n            pfirst[2][i] = (pel_t)((src[4] + (src[5] << 1) + src[6] + 2) >> 2);\r\n            pfirst[3][i] = (pel_t)((src[3] + (src[4] << 1) + src[5] + 2) >> 2);\r\n\r\n            pfirst[4][i] = (pel_t)((src[2] + (src[3] << 1) + src[4] + 2) >> 2);\r\n            pfirst[5][i] = (pel_t)((src[1] + (src[2] << 1) + src[3] + 2) >> 2);\r\n            pfirst[6][i] = (pel_t)((src[0] + (src[1] << 1) + src[2] + 2) >> 2);\r\n            pfirst[7][i] = (pel_t)((src[-1] + (src[0] << 1) + src[1] + 2) >> 2);\r\n        }\r\n\r\n        for (; i < line_size - 8; i += 16, src += 16) {\r\n            __m128i p00, p10, p20, p30;\r\n            __m128i p01, p11, p21, p31;\r\n            __m128i S0 = _mm_loadu_si128((__m128i*)(src + 2));\r\n            __m128i S1 = _mm_loadu_si128((__m128i*)(src + 1));\r\n            __m128i S2 = _mm_loadu_si128((__m128i*)(src));\r\n            __m128i S3 = _mm_loadu_si128((__m128i*)(src - 1));\r\n\r\n            __m128i L0 = _mm_unpacklo_epi8(S0, zero);\r\n            __m128i L1 = _mm_unpacklo_epi8(S1, zero);\r\n            __m128i L2 = _mm_unpacklo_epi8(S2, zero);\r\n            __m128i L3 = _mm_unpacklo_epi8(S3, zero);\r\n\r\n            __m128i H0 = _mm_unpackhi_epi8(S0, zero);\r\n            __m128i H1 = _mm_unpackhi_epi8(S1, zero);\r\n            __m128i H2 = _mm_unpackhi_epi8(S2, zero);\r\n            __m128i H3 = _mm_unpackhi_epi8(S3, zero);\r\n\r\n            p00 = _mm_mullo_epi16(L0, coeff7);\r\n            p10 = _mm_mullo_epi16(L1, coeff15);\r\n            p20 = _mm_mullo_epi16(L2, coeff9);\r\n            p30 = _mm_add_epi16(L3, coeff16);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p01 = _mm_mullo_epi16(H0, coeff7);\r\n            p11 = _mm_mullo_epi16(H1, coeff15);\r\n            p21 = _mm_mullo_epi16(H2, coeff9);\r\n            p31 = _mm_add_epi16(H3, coeff16);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            p01 = _mm_srli_epi16(p01, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_storeu_si128((__m128i*)&pfirst[0][i], p00);\r\n\r\n            p00 = _mm_mullo_epi16(L0, coeff3);\r\n            p10 = _mm_mullo_epi16(L1, coeff7);\r\n            p20 = _mm_mullo_epi16(L2, coeff5);\r\n            p30 = _mm_add_epi16(L3, coeff8);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 4);\r\n\r\n            p01 = _mm_mullo_epi16(H0, coeff3);\r\n            p11 = _mm_mullo_epi16(H1, coeff7);\r\n            p21 = _mm_mullo_epi16(H2, coeff5);\r\n            p31 = _mm_add_epi16(H3, coeff8);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            p01 = _mm_srli_epi16(p01, 4);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_storeu_si128((__m128i*)&pfirst[1][i], p00);\r\n\r\n            p00 = _mm_mullo_epi16(L0, coeff5);\r\n            p10 = _mm_mullo_epi16(L1, coeff13);\r\n            p20 = _mm_mullo_epi16(L2, coeff11);\r\n            p30 = _mm_mullo_epi16(L3, coeff3);\r\n            p00 = _mm_add_epi16(p00, coeff16);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p01 = _mm_mullo_epi16(H0, coeff5);\r\n            p11 = _mm_mullo_epi16(H1, coeff13);\r\n            p21 = _mm_mullo_epi16(H2, coeff11);\r\n            p31 = _mm_mullo_epi16(H3, coeff3);\r\n            p01 = _mm_add_epi16(p01, coeff16);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            p01 = _mm_srli_epi16(p01, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_storeu_si128((__m128i*)&pfirst[2][i], p00);\r\n\r\n            p00 = _mm_add_epi16(L0, L3);\r\n            p10 = _mm_add_epi16(L1, L2);\r\n            p10 = _mm_mullo_epi16(p10, coeff3);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, coeff4);\r\n            p00 = _mm_srli_epi16(p00, 3);\r\n\r\n            p01 = _mm_add_epi16(H0, H3);\r\n            p11 = _mm_add_epi16(H1, H2);\r\n            p11 = _mm_mullo_epi16(p11, coeff3);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, coeff4);\r\n            p01 = _mm_srli_epi16(p01, 3);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_storeu_si128((__m128i*)&pfirst[3][i], p00);\r\n\r\n            p00 = _mm_mullo_epi16(L0, coeff3);\r\n            p10 = _mm_mullo_epi16(L1, coeff11);\r\n            p20 = _mm_mullo_epi16(L2, coeff13);\r\n            p30 = _mm_mullo_epi16(L3, coeff5);\r\n            p00 = _mm_add_epi16(p00, coeff16);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p01 = _mm_mullo_epi16(H0, coeff3);\r\n            p11 = _mm_mullo_epi16(H1, coeff11);\r\n            p21 = _mm_mullo_epi16(H2, coeff13);\r\n            p31 = _mm_mullo_epi16(H3, coeff5);\r\n            p01 = _mm_add_epi16(p01, coeff16);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            p01 = _mm_srli_epi16(p01, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_storeu_si128((__m128i*)&pfirst[4][i], p00);\r\n\r\n            p10 = _mm_mullo_epi16(L1, coeff5);\r\n            p20 = _mm_mullo_epi16(L2, coeff7);\r\n            p30 = _mm_mullo_epi16(L3, coeff3);\r\n            p00 = _mm_add_epi16(L0, coeff8);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 4);\r\n\r\n            p11 = _mm_mullo_epi16(H1, coeff5);\r\n            p21 = _mm_mullo_epi16(H2, coeff7);\r\n            p31 = _mm_mullo_epi16(H3, coeff3);\r\n            p01 = _mm_add_epi16(H0, coeff8);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            p01 = _mm_srli_epi16(p01, 4);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_storeu_si128((__m128i*)&pfirst[5][i], p00);\r\n\r\n            p10 = _mm_mullo_epi16(L1, coeff9);\r\n            p20 = _mm_mullo_epi16(L2, coeff15);\r\n            p30 = _mm_mullo_epi16(L3, coeff7);\r\n            p00 = _mm_add_epi16(L0, coeff16);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p11 = _mm_mullo_epi16(H1, coeff9);\r\n            p21 = _mm_mullo_epi16(H2, coeff15);\r\n            p31 = _mm_mullo_epi16(H3, coeff7);\r\n            p01 = _mm_add_epi16(H0, coeff16);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            p01 = _mm_srli_epi16(p01, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_storeu_si128((__m128i*)&pfirst[6][i], p00);\r\n\r\n\r\n            p10 = _mm_mullo_epi16(L2, coeff2);\r\n            p00 = _mm_add_epi16(L1, L3);\r\n            p00 = _mm_add_epi16(p00, coeff2);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_srli_epi16(p00, 2);\r\n\r\n            p11 = _mm_mullo_epi16(H2, coeff2);\r\n            p01 = _mm_add_epi16(H1, H3);\r\n            p01 = _mm_add_epi16(p01, coeff2);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_srli_epi16(p01, 2);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_storeu_si128((__m128i*)&pfirst[7][i], p00);\r\n        }\r\n\r\n        if (i < line_size) {\r\n            __m128i p00, p10, p20, p30;\r\n            __m128i S0 = _mm_loadu_si128((__m128i*)(src + 2));\r\n            __m128i S1 = _mm_loadu_si128((__m128i*)(src + 1));\r\n            __m128i S2 = _mm_loadu_si128((__m128i*)(src));\r\n            __m128i S3 = _mm_loadu_si128((__m128i*)(src - 1));\r\n\r\n            __m128i L0 = _mm_unpacklo_epi8(S0, zero);\r\n            __m128i L1 = _mm_unpacklo_epi8(S1, zero);\r\n            __m128i L2 = _mm_unpacklo_epi8(S2, zero);\r\n            __m128i L3 = _mm_unpacklo_epi8(S3, zero);\r\n\r\n            p00 = _mm_mullo_epi16(L0, coeff7);\r\n            p10 = _mm_mullo_epi16(L1, coeff15);\r\n            p20 = _mm_mullo_epi16(L2, coeff9);\r\n            p30 = _mm_add_epi16(L3, coeff16);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)&pfirst[0][i], p00);\r\n\r\n            p00 = _mm_mullo_epi16(L0, coeff3);\r\n            p10 = _mm_mullo_epi16(L1, coeff7);\r\n            p20 = _mm_mullo_epi16(L2, coeff5);\r\n            p30 = _mm_add_epi16(L3, coeff8);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 4);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)&pfirst[1][i], p00);\r\n\r\n            p00 = _mm_mullo_epi16(L0, coeff5);\r\n            p10 = _mm_mullo_epi16(L1, coeff13);\r\n            p20 = _mm_mullo_epi16(L2, coeff11);\r\n            p30 = _mm_mullo_epi16(L3, coeff3);\r\n            p00 = _mm_add_epi16(p00, coeff16);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)&pfirst[2][i], p00);\r\n\r\n            p00 = _mm_add_epi16(L0, L3);\r\n            p10 = _mm_add_epi16(L1, L2);\r\n            p10 = _mm_mullo_epi16(p10, coeff3);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, coeff4);\r\n            p00 = _mm_srli_epi16(p00, 3);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)&pfirst[3][i], p00);\r\n\r\n            p00 = _mm_mullo_epi16(L0, coeff3);\r\n            p10 = _mm_mullo_epi16(L1, coeff11);\r\n            p20 = _mm_mullo_epi16(L2, coeff13);\r\n            p30 = _mm_mullo_epi16(L3, coeff5);\r\n            p00 = _mm_add_epi16(p00, coeff16);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)&pfirst[4][i], p00);\r\n\r\n            p10 = _mm_mullo_epi16(L1, coeff5);\r\n            p20 = _mm_mullo_epi16(L2, coeff7);\r\n            p30 = _mm_mullo_epi16(L3, coeff3);\r\n            p00 = _mm_add_epi16(L0, coeff8);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 4);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)&pfirst[5][i], p00);\r\n\r\n            p10 = _mm_mullo_epi16(L1, coeff9);\r\n            p20 = _mm_mullo_epi16(L2, coeff15);\r\n            p30 = _mm_mullo_epi16(L3, coeff7);\r\n            p00 = _mm_add_epi16(L0, coeff16);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)&pfirst[6][i], p00);\r\n\r\n            p10 = _mm_mullo_epi16(L2, coeff2);\r\n            p00 = _mm_add_epi16(L1, L3);\r\n            p00 = _mm_add_epi16(p00, coeff2);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_srli_epi16(p00, 2);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)&pfirst[7][i], p00);\r\n        }\r\n\r\n        pfirst[0] += left_size;\r\n        pfirst[1] += left_size;\r\n        pfirst[2] += left_size;\r\n        pfirst[3] += left_size;\r\n        pfirst[4] += left_size;\r\n        pfirst[5] += left_size;\r\n        pfirst[6] += left_size;\r\n        pfirst[7] += left_size;\r\n\r\n        bsy >>= 3;\r\n        for (i = 0; i < bsy; i++) {\r\n            memcpy(dst, pfirst[0] - i, bsx * sizeof(pel_t));\r\n            dst += i_dst;\r\n            memcpy(dst, pfirst[1] - i, bsx * sizeof(pel_t));\r\n            dst += i_dst;\r\n            memcpy(dst, pfirst[2] - i, bsx * sizeof(pel_t));\r\n            dst += i_dst;\r\n            memcpy(dst, pfirst[3] - i, bsx * sizeof(pel_t));\r\n            dst += i_dst;\r\n            memcpy(dst, pfirst[4] - i, bsx * sizeof(pel_t));\r\n            dst += i_dst;\r\n            memcpy(dst, pfirst[5] - i, bsx * sizeof(pel_t));\r\n            dst += i_dst;\r\n            memcpy(dst, pfirst[6] - i, bsx * sizeof(pel_t));\r\n            dst += i_dst;\r\n            memcpy(dst, pfirst[7] - i, bsx * sizeof(pel_t));\r\n            dst += i_dst;\r\n        }\r\n    } else if (bsy == 8) {\r\n        pel_t *dst1 = dst;\r\n        pel_t *dst2 = dst1 + i_dst;\r\n        pel_t *dst3 = dst2 + i_dst;\r\n        pel_t *dst4 = dst3 + i_dst;\r\n        pel_t *dst5 = dst4 + i_dst;\r\n        pel_t *dst6 = dst5 + i_dst;\r\n        pel_t *dst7 = dst6 + i_dst;\r\n        pel_t *dst8 = dst7 + i_dst;\r\n        if (bsx == 32) {\r\n            __m128i p00, p10, p20, p30;\r\n            __m128i p01, p11, p21, p31;\r\n            __m128i S0 = _mm_loadu_si128((__m128i*)(src + 2));\r\n            __m128i S1 = _mm_loadu_si128((__m128i*)(src + 1));\r\n            __m128i S2 = _mm_loadu_si128((__m128i*)(src));\r\n            __m128i S3 = _mm_loadu_si128((__m128i*)(src - 1));\r\n\r\n            __m128i L0 = _mm_unpacklo_epi8(S0, zero);\r\n            __m128i L1 = _mm_unpacklo_epi8(S1, zero);\r\n            __m128i L2 = _mm_unpacklo_epi8(S2, zero);\r\n            __m128i L3 = _mm_unpacklo_epi8(S3, zero);\r\n\r\n            __m128i H0 = _mm_unpackhi_epi8(S0, zero);\r\n            __m128i H1 = _mm_unpackhi_epi8(S1, zero);\r\n            __m128i H2 = _mm_unpackhi_epi8(S2, zero);\r\n            __m128i H3 = _mm_unpackhi_epi8(S3, zero);\r\n\r\n            p00 = _mm_mullo_epi16(L0, coeff7);\r\n            p10 = _mm_mullo_epi16(L1, coeff15);\r\n            p20 = _mm_mullo_epi16(L2, coeff9);\r\n            p30 = _mm_add_epi16(L3, coeff16);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p01 = _mm_mullo_epi16(H0, coeff7);\r\n            p11 = _mm_mullo_epi16(H1, coeff15);\r\n            p21 = _mm_mullo_epi16(H2, coeff9);\r\n            p31 = _mm_add_epi16(H3, coeff16);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            p01 = _mm_srli_epi16(p01, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_store_si128((__m128i*)dst1, p00);\r\n\r\n            p00 = _mm_mullo_epi16(L0, coeff3);\r\n            p10 = _mm_mullo_epi16(L1, coeff7);\r\n            p20 = _mm_mullo_epi16(L2, coeff5);\r\n            p30 = _mm_add_epi16(L3, coeff8);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 4);\r\n\r\n            p01 = _mm_mullo_epi16(H0, coeff3);\r\n            p11 = _mm_mullo_epi16(H1, coeff7);\r\n            p21 = _mm_mullo_epi16(H2, coeff5);\r\n            p31 = _mm_add_epi16(H3, coeff8);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            p01 = _mm_srli_epi16(p01, 4);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_store_si128((__m128i*)dst2, p00);\r\n\r\n            p00 = _mm_mullo_epi16(L0, coeff5);\r\n            p10 = _mm_mullo_epi16(L1, coeff13);\r\n            p20 = _mm_mullo_epi16(L2, coeff11);\r\n            p30 = _mm_mullo_epi16(L3, coeff3);\r\n            p00 = _mm_add_epi16(p00, coeff16);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p01 = _mm_mullo_epi16(H0, coeff5);\r\n            p11 = _mm_mullo_epi16(H1, coeff13);\r\n            p21 = _mm_mullo_epi16(H2, coeff11);\r\n            p31 = _mm_mullo_epi16(H3, coeff3);\r\n            p01 = _mm_add_epi16(p01, coeff16);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            p01 = _mm_srli_epi16(p01, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_store_si128((__m128i*)dst3, p00);\r\n\r\n            p00 = _mm_add_epi16(L0, L3);\r\n            p10 = _mm_add_epi16(L1, L2);\r\n            p10 = _mm_mullo_epi16(p10, coeff3);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, coeff4);\r\n            p00 = _mm_srli_epi16(p00, 3);\r\n\r\n            p01 = _mm_add_epi16(H0, H3);\r\n            p11 = _mm_add_epi16(H1, H2);\r\n            p11 = _mm_mullo_epi16(p11, coeff3);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, coeff4);\r\n            p01 = _mm_srli_epi16(p01, 3);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_store_si128((__m128i*)dst4, p00);\r\n\r\n            p00 = _mm_mullo_epi16(L0, coeff3);\r\n            p10 = _mm_mullo_epi16(L1, coeff11);\r\n            p20 = _mm_mullo_epi16(L2, coeff13);\r\n            p30 = _mm_mullo_epi16(L3, coeff5);\r\n            p00 = _mm_add_epi16(p00, coeff16);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p01 = _mm_mullo_epi16(H0, coeff3);\r\n            p11 = _mm_mullo_epi16(H1, coeff11);\r\n            p21 = _mm_mullo_epi16(H2, coeff13);\r\n            p31 = _mm_mullo_epi16(H3, coeff5);\r\n            p01 = _mm_add_epi16(p01, coeff16);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            p01 = _mm_srli_epi16(p01, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_store_si128((__m128i*)dst5, p00);\r\n\r\n            p10 = _mm_mullo_epi16(L1, coeff5);\r\n            p20 = _mm_mullo_epi16(L2, coeff7);\r\n            p30 = _mm_mullo_epi16(L3, coeff3);\r\n            p00 = _mm_add_epi16(L0, coeff8);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 4);\r\n\r\n            p11 = _mm_mullo_epi16(H1, coeff5);\r\n            p21 = _mm_mullo_epi16(H2, coeff7);\r\n            p31 = _mm_mullo_epi16(H3, coeff3);\r\n            p01 = _mm_add_epi16(H0, coeff8);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            p01 = _mm_srli_epi16(p01, 4);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_store_si128((__m128i*)dst6, p00);\r\n\r\n            p10 = _mm_mullo_epi16(L1, coeff9);\r\n            p20 = _mm_mullo_epi16(L2, coeff15);\r\n            p30 = _mm_mullo_epi16(L3, coeff7);\r\n            p00 = _mm_add_epi16(L0, coeff16);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p11 = _mm_mullo_epi16(H1, coeff9);\r\n            p21 = _mm_mullo_epi16(H2, coeff15);\r\n            p31 = _mm_mullo_epi16(H3, coeff7);\r\n            p01 = _mm_add_epi16(H0, coeff16);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            p01 = _mm_srli_epi16(p01, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_store_si128((__m128i*)dst7, p00);\r\n\r\n\r\n            p10 = _mm_mullo_epi16(L2, coeff2);\r\n            p00 = _mm_add_epi16(L1, L3);\r\n            p00 = _mm_add_epi16(p00, coeff2);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_srli_epi16(p00, 2);\r\n\r\n            p11 = _mm_mullo_epi16(H2, coeff2);\r\n            p01 = _mm_add_epi16(H1, H3);\r\n            p01 = _mm_add_epi16(p01, coeff2);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_srli_epi16(p01, 2);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_store_si128((__m128i*)dst8, p00);\r\n\r\n            src += 16;\r\n            dst1 += 16;\r\n            dst2 += 16;\r\n            dst3 += 16;\r\n            dst4 += 16;\r\n            dst5 += 16;\r\n            dst6 += 16;\r\n            dst7 += 16;\r\n            dst8 += 16;\r\n\r\n            S0 = _mm_loadu_si128((__m128i*)(src + 2));\r\n            S1 = _mm_loadu_si128((__m128i*)(src + 1));\r\n            S2 = _mm_loadu_si128((__m128i*)(src));\r\n            S3 = _mm_loadu_si128((__m128i*)(src - 1));\r\n\r\n            L0 = _mm_unpacklo_epi8(S0, zero);\r\n            L1 = _mm_unpacklo_epi8(S1, zero);\r\n            L2 = _mm_unpacklo_epi8(S2, zero);\r\n            L3 = _mm_unpacklo_epi8(S3, zero);\r\n\r\n            H0 = _mm_unpackhi_epi8(S0, zero);\r\n            H1 = _mm_unpackhi_epi8(S1, zero);\r\n            H2 = _mm_unpackhi_epi8(S2, zero);\r\n            H3 = _mm_unpackhi_epi8(S3, zero);\r\n\r\n            p00 = _mm_mullo_epi16(L0, coeff7);\r\n            p10 = _mm_mullo_epi16(L1, coeff15);\r\n            p20 = _mm_mullo_epi16(L2, coeff9);\r\n            p30 = _mm_add_epi16(L3, coeff16);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p01 = _mm_mullo_epi16(H0, coeff7);\r\n            p11 = _mm_mullo_epi16(H1, coeff15);\r\n            p21 = _mm_mullo_epi16(H2, coeff9);\r\n            p31 = _mm_add_epi16(H3, coeff16);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            p01 = _mm_srli_epi16(p01, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_store_si128((__m128i*)dst1, p00);\r\n\r\n            p00 = _mm_mullo_epi16(L0, coeff3);\r\n            p10 = _mm_mullo_epi16(L1, coeff7);\r\n            p20 = _mm_mullo_epi16(L2, coeff5);\r\n            p30 = _mm_add_epi16(L3, coeff8);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 4);\r\n\r\n            p01 = _mm_mullo_epi16(H0, coeff3);\r\n            p11 = _mm_mullo_epi16(H1, coeff7);\r\n            p21 = _mm_mullo_epi16(H2, coeff5);\r\n            p31 = _mm_add_epi16(H3, coeff8);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            p01 = _mm_srli_epi16(p01, 4);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_store_si128((__m128i*)dst2, p00);\r\n\r\n            p00 = _mm_mullo_epi16(L0, coeff5);\r\n            p10 = _mm_mullo_epi16(L1, coeff13);\r\n            p20 = _mm_mullo_epi16(L2, coeff11);\r\n            p30 = _mm_mullo_epi16(L3, coeff3);\r\n            p00 = _mm_add_epi16(p00, coeff16);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p01 = _mm_mullo_epi16(H0, coeff5);\r\n            p11 = _mm_mullo_epi16(H1, coeff13);\r\n            p21 = _mm_mullo_epi16(H2, coeff11);\r\n            p31 = _mm_mullo_epi16(H3, coeff3);\r\n            p01 = _mm_add_epi16(p01, coeff16);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            p01 = _mm_srli_epi16(p01, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_store_si128((__m128i*)dst3, p00);\r\n\r\n            p00 = _mm_add_epi16(L0, L3);\r\n            p10 = _mm_add_epi16(L1, L2);\r\n            p10 = _mm_mullo_epi16(p10, coeff3);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, coeff4);\r\n            p00 = _mm_srli_epi16(p00, 3);\r\n\r\n            p01 = _mm_add_epi16(H0, H3);\r\n            p11 = _mm_add_epi16(H1, H2);\r\n            p11 = _mm_mullo_epi16(p11, coeff3);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, coeff4);\r\n            p01 = _mm_srli_epi16(p01, 3);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_store_si128((__m128i*)dst4, p00);\r\n\r\n            p00 = _mm_mullo_epi16(L0, coeff3);\r\n            p10 = _mm_mullo_epi16(L1, coeff11);\r\n            p20 = _mm_mullo_epi16(L2, coeff13);\r\n            p30 = _mm_mullo_epi16(L3, coeff5);\r\n            p00 = _mm_add_epi16(p00, coeff16);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p01 = _mm_mullo_epi16(H0, coeff3);\r\n            p11 = _mm_mullo_epi16(H1, coeff11);\r\n            p21 = _mm_mullo_epi16(H2, coeff13);\r\n            p31 = _mm_mullo_epi16(H3, coeff5);\r\n            p01 = _mm_add_epi16(p01, coeff16);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            p01 = _mm_srli_epi16(p01, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_store_si128((__m128i*)dst5, p00);\r\n\r\n            p10 = _mm_mullo_epi16(L1, coeff5);\r\n            p20 = _mm_mullo_epi16(L2, coeff7);\r\n            p30 = _mm_mullo_epi16(L3, coeff3);\r\n            p00 = _mm_add_epi16(L0, coeff8);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 4);\r\n\r\n            p11 = _mm_mullo_epi16(H1, coeff5);\r\n            p21 = _mm_mullo_epi16(H2, coeff7);\r\n            p31 = _mm_mullo_epi16(H3, coeff3);\r\n            p01 = _mm_add_epi16(H0, coeff8);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            p01 = _mm_srli_epi16(p01, 4);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_store_si128((__m128i*)dst6, p00);\r\n\r\n            p10 = _mm_mullo_epi16(L1, coeff9);\r\n            p20 = _mm_mullo_epi16(L2, coeff15);\r\n            p30 = _mm_mullo_epi16(L3, coeff7);\r\n            p00 = _mm_add_epi16(L0, coeff16);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p11 = _mm_mullo_epi16(H1, coeff9);\r\n            p21 = _mm_mullo_epi16(H2, coeff15);\r\n            p31 = _mm_mullo_epi16(H3, coeff7);\r\n            p01 = _mm_add_epi16(H0, coeff16);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            p01 = _mm_srli_epi16(p01, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_store_si128((__m128i*)dst7, p00);\r\n\r\n\r\n            p10 = _mm_mullo_epi16(L2, coeff2);\r\n            p00 = _mm_add_epi16(L1, L3);\r\n            p00 = _mm_add_epi16(p00, coeff2);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_srli_epi16(p00, 2);\r\n\r\n            p11 = _mm_mullo_epi16(H2, coeff2);\r\n            p01 = _mm_add_epi16(H1, H3);\r\n            p01 = _mm_add_epi16(p01, coeff2);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_srli_epi16(p01, 2);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_store_si128((__m128i*)dst8, p00);\r\n        } else {\r\n            __m128i p00, p10, p20, p30;\r\n            __m128i S0 = _mm_loadu_si128((__m128i*)(src + 2));\r\n            __m128i S1 = _mm_loadu_si128((__m128i*)(src + 1));\r\n            __m128i S2 = _mm_loadu_si128((__m128i*)(src));\r\n            __m128i S3 = _mm_loadu_si128((__m128i*)(src - 1));\r\n\r\n            __m128i L0 = _mm_unpacklo_epi8(S0, zero);\r\n            __m128i L1 = _mm_unpacklo_epi8(S1, zero);\r\n            __m128i L2 = _mm_unpacklo_epi8(S2, zero);\r\n            __m128i L3 = _mm_unpacklo_epi8(S3, zero);\r\n\r\n            p00 = _mm_mullo_epi16(L0, coeff7);\r\n            p10 = _mm_mullo_epi16(L1, coeff15);\r\n            p20 = _mm_mullo_epi16(L2, coeff9);\r\n            p30 = _mm_add_epi16(L3, coeff16);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)dst1, p00);\r\n\r\n            p00 = _mm_mullo_epi16(L0, coeff3);\r\n            p10 = _mm_mullo_epi16(L1, coeff7);\r\n            p20 = _mm_mullo_epi16(L2, coeff5);\r\n            p30 = _mm_add_epi16(L3, coeff8);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 4);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)dst2, p00);\r\n\r\n            p00 = _mm_mullo_epi16(L0, coeff5);\r\n            p10 = _mm_mullo_epi16(L1, coeff13);\r\n            p20 = _mm_mullo_epi16(L2, coeff11);\r\n            p30 = _mm_mullo_epi16(L3, coeff3);\r\n            p00 = _mm_add_epi16(p00, coeff16);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)dst3, p00);\r\n\r\n            p00 = _mm_add_epi16(L0, L3);\r\n            p10 = _mm_add_epi16(L1, L2);\r\n            p10 = _mm_mullo_epi16(p10, coeff3);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, coeff4);\r\n            p00 = _mm_srli_epi16(p00, 3);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)dst4, p00);\r\n\r\n            p00 = _mm_mullo_epi16(L0, coeff3);\r\n            p10 = _mm_mullo_epi16(L1, coeff11);\r\n            p20 = _mm_mullo_epi16(L2, coeff13);\r\n            p30 = _mm_mullo_epi16(L3, coeff5);\r\n            p00 = _mm_add_epi16(p00, coeff16);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)dst5, p00);\r\n\r\n            p10 = _mm_mullo_epi16(L1, coeff5);\r\n            p20 = _mm_mullo_epi16(L2, coeff7);\r\n            p30 = _mm_mullo_epi16(L3, coeff3);\r\n            p00 = _mm_add_epi16(L0, coeff8);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 4);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)dst6, p00);\r\n\r\n            p10 = _mm_mullo_epi16(L1, coeff9);\r\n            p20 = _mm_mullo_epi16(L2, coeff15);\r\n            p30 = _mm_mullo_epi16(L3, coeff7);\r\n            p00 = _mm_add_epi16(L0, coeff16);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)dst7, p00);\r\n\r\n\r\n            p10 = _mm_mullo_epi16(L2, coeff2);\r\n            p00 = _mm_add_epi16(L1, L3);\r\n            p00 = _mm_add_epi16(p00, coeff2);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_srli_epi16(p00, 2);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)dst8, p00);\r\n        }\r\n    } else {\r\n        pel_t *dst1 = dst;\r\n        pel_t *dst2 = dst1 + i_dst;\r\n        pel_t *dst3 = dst2 + i_dst;\r\n        pel_t *dst4 = dst3 + i_dst;\r\n\r\n        if (bsx == 16) {\r\n            __m128i p00, p10, p20, p30;\r\n            __m128i p01, p11, p21, p31;\r\n            __m128i S0 = _mm_loadu_si128((__m128i*)(src + 2));\r\n            __m128i S1 = _mm_loadu_si128((__m128i*)(src + 1));\r\n            __m128i S2 = _mm_loadu_si128((__m128i*)(src));\r\n            __m128i S3 = _mm_loadu_si128((__m128i*)(src - 1));\r\n\r\n            __m128i L0 = _mm_unpacklo_epi8(S0, zero);\r\n            __m128i L1 = _mm_unpacklo_epi8(S1, zero);\r\n            __m128i L2 = _mm_unpacklo_epi8(S2, zero);\r\n            __m128i L3 = _mm_unpacklo_epi8(S3, zero);\r\n\r\n            __m128i H0 = _mm_unpackhi_epi8(S0, zero);\r\n            __m128i H1 = _mm_unpackhi_epi8(S1, zero);\r\n            __m128i H2 = _mm_unpackhi_epi8(S2, zero);\r\n            __m128i H3 = _mm_unpackhi_epi8(S3, zero);\r\n\r\n            p00 = _mm_mullo_epi16(L0, coeff7);\r\n            p10 = _mm_mullo_epi16(L1, coeff15);\r\n            p20 = _mm_mullo_epi16(L2, coeff9);\r\n            p30 = _mm_add_epi16(L3, coeff16);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p01 = _mm_mullo_epi16(H0, coeff7);\r\n            p11 = _mm_mullo_epi16(H1, coeff15);\r\n            p21 = _mm_mullo_epi16(H2, coeff9);\r\n            p31 = _mm_add_epi16(H3, coeff16);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            p01 = _mm_srli_epi16(p01, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_store_si128((__m128i*)dst1, p00);\r\n\r\n            p00 = _mm_mullo_epi16(L0, coeff3);\r\n            p10 = _mm_mullo_epi16(L1, coeff7);\r\n            p20 = _mm_mullo_epi16(L2, coeff5);\r\n            p30 = _mm_add_epi16(L3, coeff8);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 4);\r\n\r\n            p01 = _mm_mullo_epi16(H0, coeff3);\r\n            p11 = _mm_mullo_epi16(H1, coeff7);\r\n            p21 = _mm_mullo_epi16(H2, coeff5);\r\n            p31 = _mm_add_epi16(H3, coeff8);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            p01 = _mm_srli_epi16(p01, 4);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_store_si128((__m128i*)dst2, p00);\r\n\r\n            p00 = _mm_mullo_epi16(L0, coeff5);\r\n            p10 = _mm_mullo_epi16(L1, coeff13);\r\n            p20 = _mm_mullo_epi16(L2, coeff11);\r\n            p30 = _mm_mullo_epi16(L3, coeff3);\r\n            p00 = _mm_add_epi16(p00, coeff16);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p01 = _mm_mullo_epi16(H0, coeff5);\r\n            p11 = _mm_mullo_epi16(H1, coeff13);\r\n            p21 = _mm_mullo_epi16(H2, coeff11);\r\n            p31 = _mm_mullo_epi16(H3, coeff3);\r\n            p01 = _mm_add_epi16(p01, coeff16);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            p01 = _mm_srli_epi16(p01, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_store_si128((__m128i*)dst3, p00);\r\n\r\n            p00 = _mm_add_epi16(L0, L3);\r\n            p10 = _mm_add_epi16(L1, L2);\r\n            p10 = _mm_mullo_epi16(p10, coeff3);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, coeff4);\r\n            p00 = _mm_srli_epi16(p00, 3);\r\n\r\n            p01 = _mm_add_epi16(H0, H3);\r\n            p11 = _mm_add_epi16(H1, H2);\r\n            p11 = _mm_mullo_epi16(p11, coeff3);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, coeff4);\r\n            p01 = _mm_srli_epi16(p01, 3);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_store_si128((__m128i*)dst4, p00);\r\n        } else {\r\n            __m128i p00, p10, p20, p30;\r\n            __m128i S0 = _mm_loadu_si128((__m128i*)(src + 2));\r\n            __m128i S1 = _mm_loadu_si128((__m128i*)(src + 1));\r\n            __m128i S2 = _mm_loadu_si128((__m128i*)(src));\r\n            __m128i S3 = _mm_loadu_si128((__m128i*)(src - 1));\r\n\r\n            __m128i L0 = _mm_unpacklo_epi8(S0, zero);\r\n            __m128i L1 = _mm_unpacklo_epi8(S1, zero);\r\n            __m128i L2 = _mm_unpacklo_epi8(S2, zero);\r\n            __m128i L3 = _mm_unpacklo_epi8(S3, zero);\r\n\r\n            p00 = _mm_mullo_epi16(L0, coeff7);\r\n            p10 = _mm_mullo_epi16(L1, coeff15);\r\n            p20 = _mm_mullo_epi16(L2, coeff9);\r\n            p30 = _mm_add_epi16(L3, coeff16);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            ((int*)(dst1))[0] = _mm_cvtsi128_si32(p00);\r\n\r\n            p00 = _mm_mullo_epi16(L0, coeff3);\r\n            p10 = _mm_mullo_epi16(L1, coeff7);\r\n            p20 = _mm_mullo_epi16(L2, coeff5);\r\n            p30 = _mm_add_epi16(L3, coeff8);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 4);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            ((int*)(dst2))[0] = _mm_cvtsi128_si32(p00);\r\n\r\n            p00 = _mm_mullo_epi16(L0, coeff5);\r\n            p10 = _mm_mullo_epi16(L1, coeff13);\r\n            p20 = _mm_mullo_epi16(L2, coeff11);\r\n            p30 = _mm_mullo_epi16(L3, coeff3);\r\n            p00 = _mm_add_epi16(p00, coeff16);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            ((int*)(dst3))[0] = _mm_cvtsi128_si32(p00);\r\n\r\n            p00 = _mm_add_epi16(L0, L3);\r\n            p10 = _mm_add_epi16(L1, L2);\r\n            p10 = _mm_mullo_epi16(p10, coeff3);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, coeff4);\r\n            p00 = _mm_srli_epi16(p00, 3);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            ((int*)(dst4))[0] = _mm_cvtsi128_si32(p00);\r\n        }\r\n    }\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n*/\r\nvoid intra_pred_ang_xy_14_sse128(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    int i;\r\n    __m128i coeff2 = _mm_set1_epi16(2);\r\n    __m128i coeff3 = _mm_set1_epi16(3);\r\n    __m128i coeff4 = _mm_set1_epi16(4);\r\n    __m128i coeff5 = _mm_set1_epi16(5);\r\n    __m128i coeff7 = _mm_set1_epi16(7);\r\n    __m128i coeff8 = _mm_set1_epi16(8);\r\n    __m128i zero = _mm_setzero_si128();\r\n\r\n    UNUSED_PARAMETER(dir_mode);\r\n\r\n    if (bsy != 4) {\r\n        ALIGN16(pel_t first_line[4 * (64 + 32)]);\r\n        int line_size = bsx + bsy / 4 - 1;\r\n        int left_size = line_size - bsx;\r\n        int aligned_line_size = ((line_size + 31) >> 4) << 4;\r\n        pel_t *pfirst[4];\r\n        __m128i shuffle1 = _mm_setr_epi8(0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15);\r\n        __m128i shuffle2 = _mm_setr_epi8(1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15, 0, 4, 8, 12);\r\n        __m128i shuffle3 = _mm_setr_epi8(2, 6, 10, 14, 3, 7, 11, 15, 0, 4, 8, 12, 1, 5, 9, 13);\r\n        __m128i shuffle4 = _mm_setr_epi8(3, 7, 11, 15, 0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14);\r\n        pel_t *pSrc1 = src;\r\n\r\n        pfirst[0] = first_line;\r\n        pfirst[1] = first_line + aligned_line_size;\r\n        pfirst[2] = first_line + aligned_line_size * 2;\r\n        pfirst[3] = first_line + aligned_line_size * 3;\r\n        src -= bsy - 4;\r\n        for (i = 0; i < left_size - 1; i += 4, src += 16) {\r\n            __m128i p00, p01, p10, p11;\r\n            __m128i p20, p30;\r\n            __m128i S0 = _mm_loadu_si128((__m128i*)(src - 1));\r\n            __m128i S2 = _mm_loadu_si128((__m128i*)(src + 1));\r\n            __m128i S1 = _mm_loadu_si128((__m128i*)(src));\r\n\r\n            __m128i L0 = _mm_unpacklo_epi8(S0, zero);\r\n            __m128i L1 = _mm_unpacklo_epi8(S1, zero);\r\n            __m128i L2 = _mm_unpacklo_epi8(S2, zero);\r\n\r\n            __m128i H0 = _mm_unpackhi_epi8(S0, zero);\r\n            __m128i H1 = _mm_unpackhi_epi8(S1, zero);\r\n            __m128i H2 = _mm_unpackhi_epi8(S2, zero);\r\n\r\n            p00 = _mm_add_epi16(L0, L1);\r\n            p01 = _mm_add_epi16(L1, L2);\r\n            p10 = _mm_add_epi16(H0, H1);\r\n            p11 = _mm_add_epi16(H1, H2);\r\n\r\n            p00 = _mm_add_epi16(p00, coeff2);\r\n            p10 = _mm_add_epi16(p10, coeff2);\r\n            p00 = _mm_add_epi16(p00, p01);\r\n            p10 = _mm_add_epi16(p10, p11);\r\n\r\n            p00 = _mm_srli_epi16(p00, 2);\r\n            p10 = _mm_srli_epi16(p10, 2);\r\n\r\n            p00 = _mm_packus_epi16(p00, p10);\r\n            p10 = _mm_shuffle_epi8(p00, shuffle2);\r\n            p20 = _mm_shuffle_epi8(p00, shuffle3);\r\n            p30 = _mm_shuffle_epi8(p00, shuffle4);\r\n            p00 = _mm_shuffle_epi8(p00, shuffle1);\r\n\r\n            ((int*)&pfirst[0][i])[0] = _mm_cvtsi128_si32(p30);\r\n            ((int*)&pfirst[1][i])[0] = _mm_cvtsi128_si32(p20);\r\n            ((int*)&pfirst[2][i])[0] = _mm_cvtsi128_si32(p10);\r\n            ((int*)&pfirst[3][i])[0] = _mm_cvtsi128_si32(p00);\r\n        }\r\n\r\n        if (i < left_size) { //ʹcԿܻ\r\n            __m128i p00, p01, p10;\r\n            __m128i p20, p30;\r\n            __m128i S0 = _mm_loadu_si128((__m128i*)(src - 1));\r\n            __m128i S2 = _mm_loadu_si128((__m128i*)(src + 1));\r\n            __m128i S1 = _mm_loadu_si128((__m128i*)(src));\r\n\r\n            __m128i L0 = _mm_unpacklo_epi8(S0, zero);\r\n            __m128i L1 = _mm_unpacklo_epi8(S1, zero);\r\n            __m128i L2 = _mm_unpacklo_epi8(S2, zero);\r\n\r\n            p00 = _mm_add_epi16(L0, L1);\r\n            p01 = _mm_add_epi16(L1, L2);\r\n\r\n            p00 = _mm_add_epi16(p00, coeff2);\r\n            p00 = _mm_add_epi16(p00, p01);\r\n\r\n            p00 = _mm_srli_epi16(p00, 2);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            p10 = _mm_shuffle_epi8(p00, shuffle2);\r\n            p20 = _mm_shuffle_epi8(p00, shuffle3);\r\n            p30 = _mm_shuffle_epi8(p00, shuffle4);\r\n            p00 = _mm_shuffle_epi8(p00, shuffle1);\r\n\r\n            ((int*)&pfirst[0][i])[0] = _mm_cvtsi128_si32(p30);\r\n            ((int*)&pfirst[1][i])[0] = _mm_cvtsi128_si32(p20);\r\n            ((int*)&pfirst[2][i])[0] = _mm_cvtsi128_si32(p10);\r\n            ((int*)&pfirst[3][i])[0] = _mm_cvtsi128_si32(p00);\r\n        }\r\n\r\n        src = pSrc1;\r\n\r\n        for (i = left_size; i < line_size; i++, src++) {\r\n            __m128i p00, p10, p20, p30;\r\n            __m128i p01, p11, p21, p31;\r\n            __m128i S0 = _mm_loadu_si128((__m128i*)(src - 1));\r\n            __m128i S3 = _mm_loadu_si128((__m128i*)(src + 2));\r\n            __m128i S1 = _mm_loadu_si128((__m128i*)(src));\r\n            __m128i S2 = _mm_loadu_si128((__m128i*)(src + 1));\r\n\r\n            __m128i L0 = _mm_unpacklo_epi8(S0, zero);\r\n            __m128i L1 = _mm_unpacklo_epi8(S1, zero);\r\n            __m128i L2 = _mm_unpacklo_epi8(S2, zero);\r\n            __m128i L3 = _mm_unpacklo_epi8(S3, zero);\r\n\r\n            __m128i H0 = _mm_unpackhi_epi8(S0, zero);\r\n            __m128i H1 = _mm_unpackhi_epi8(S1, zero);\r\n            __m128i H2 = _mm_unpackhi_epi8(S2, zero);\r\n            __m128i H3 = _mm_unpackhi_epi8(S3, zero);\r\n\r\n            p00 = _mm_mullo_epi16(L0, coeff3);\r\n            p10 = _mm_mullo_epi16(L1, coeff7);\r\n            p20 = _mm_mullo_epi16(L2, coeff5);\r\n            p30 = _mm_add_epi16(L3, coeff8);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_srli_epi16(p00, 4);\r\n\r\n            p01 = _mm_mullo_epi16(H0, coeff3);\r\n            p11 = _mm_mullo_epi16(H1, coeff7);\r\n            p21 = _mm_mullo_epi16(H2, coeff5);\r\n            p31 = _mm_add_epi16(H3, coeff8);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_srli_epi16(p01, 4);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_storeu_si128((__m128i*)&pfirst[2][i], p00);\r\n\r\n            p00 = _mm_add_epi16(L1, L2);\r\n            p00 = _mm_mullo_epi16(p00, coeff3);\r\n            p10 = _mm_add_epi16(L0, L3);\r\n            p10 = _mm_add_epi16(p10, coeff4);\r\n            p00 = _mm_add_epi16(p10, p00);\r\n            p00 = _mm_srli_epi16(p00, 3);\r\n\r\n            p01 = _mm_add_epi16(H1, H2);\r\n            p01 = _mm_mullo_epi16(p01, coeff3);\r\n            p11 = _mm_add_epi16(H0, H3);\r\n            p11 = _mm_add_epi16(p11, coeff4);\r\n            p01 = _mm_add_epi16(p11, p01);\r\n            p01 = _mm_srli_epi16(p01, 3);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_storeu_si128((__m128i*)&pfirst[1][i], p00);\r\n\r\n            p10 = _mm_mullo_epi16(L1, coeff5);\r\n            p20 = _mm_mullo_epi16(L2, coeff7);\r\n            p30 = _mm_mullo_epi16(L3, coeff3);\r\n            p00 = _mm_add_epi16(L0, coeff8);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 4);\r\n\r\n            p11 = _mm_mullo_epi16(H1, coeff5);\r\n            p21 = _mm_mullo_epi16(H2, coeff7);\r\n            p31 = _mm_mullo_epi16(H3, coeff3);\r\n            p01 = _mm_add_epi16(H0, coeff8);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            p01 = _mm_srli_epi16(p01, 4);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_storeu_si128((__m128i*)&pfirst[0][i], p00);\r\n\r\n            p00 = _mm_add_epi16(L0, L1);\r\n            p10 = _mm_add_epi16(L1, L2);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, coeff2);\r\n            p00 = _mm_srli_epi16(p00, 2);\r\n\r\n            p01 = _mm_add_epi16(H0, H1);\r\n            p11 = _mm_add_epi16(H1, H2);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, coeff2);\r\n            p01 = _mm_srli_epi16(p01, 2);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_storeu_si128((__m128i*)&pfirst[3][i], p00);\r\n        }\r\n\r\n        pfirst[0] += left_size;\r\n        pfirst[1] += left_size;\r\n        pfirst[2] += left_size;\r\n        pfirst[3] += left_size;\r\n\r\n        bsy >>= 2;\r\n\r\n        for (i = 0; i < bsy; i++) {\r\n            memcpy(dst, pfirst[0] - i, bsx * sizeof(pel_t));\r\n            dst += i_dst;\r\n            memcpy(dst, pfirst[1] - i, bsx * sizeof(pel_t));\r\n            dst += i_dst;\r\n            memcpy(dst, pfirst[2] - i, bsx * sizeof(pel_t));\r\n            dst += i_dst;\r\n            memcpy(dst, pfirst[3] - i, bsx * sizeof(pel_t));\r\n            dst += i_dst;\r\n        }\r\n    } else {\r\n        if (bsx == 16) {\r\n            pel_t *dst2 = dst + i_dst;\r\n            pel_t *dst3 = dst2 + i_dst;\r\n            pel_t *dst4 = dst3 + i_dst;\r\n            __m128i p00, p10, p20, p30;\r\n            __m128i p01, p11, p21, p31;\r\n            __m128i S0 = _mm_loadu_si128((__m128i*)(src - 1));\r\n            __m128i S3 = _mm_loadu_si128((__m128i*)(src + 2));\r\n            __m128i S1 = _mm_loadu_si128((__m128i*)(src));\r\n            __m128i S2 = _mm_loadu_si128((__m128i*)(src + 1));\r\n\r\n            __m128i L0 = _mm_unpacklo_epi8(S0, zero);\r\n            __m128i L1 = _mm_unpacklo_epi8(S1, zero);\r\n            __m128i L2 = _mm_unpacklo_epi8(S2, zero);\r\n            __m128i L3 = _mm_unpacklo_epi8(S3, zero);\r\n\r\n            __m128i H0 = _mm_unpackhi_epi8(S0, zero);\r\n            __m128i H1 = _mm_unpackhi_epi8(S1, zero);\r\n            __m128i H2 = _mm_unpackhi_epi8(S2, zero);\r\n            __m128i H3 = _mm_unpackhi_epi8(S3, zero);\r\n\r\n            p00 = _mm_mullo_epi16(L0, coeff3);\r\n            p10 = _mm_mullo_epi16(L1, coeff7);\r\n            p20 = _mm_mullo_epi16(L2, coeff5);\r\n            p30 = _mm_add_epi16(L3, coeff8);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_srli_epi16(p00, 4);\r\n\r\n            p01 = _mm_mullo_epi16(H0, coeff3);\r\n            p11 = _mm_mullo_epi16(H1, coeff7);\r\n            p21 = _mm_mullo_epi16(H2, coeff5);\r\n            p31 = _mm_add_epi16(H3, coeff8);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_srli_epi16(p01, 4);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_storeu_si128((__m128i*)dst3, p00);\r\n\r\n            p00 = _mm_add_epi16(L1, L2);\r\n            p00 = _mm_mullo_epi16(p00, coeff3);\r\n            p10 = _mm_add_epi16(L0, L3);\r\n            p10 = _mm_add_epi16(p10, coeff4);\r\n            p00 = _mm_add_epi16(p10, p00);\r\n            p00 = _mm_srli_epi16(p00, 3);\r\n\r\n            p01 = _mm_add_epi16(H1, H2);\r\n            p01 = _mm_mullo_epi16(p01, coeff3);\r\n            p11 = _mm_add_epi16(H0, H3);\r\n            p11 = _mm_add_epi16(p11, coeff4);\r\n            p01 = _mm_add_epi16(p11, p01);\r\n            p01 = _mm_srli_epi16(p01, 3);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_storeu_si128((__m128i*)dst2, p00);\r\n\r\n            p10 = _mm_mullo_epi16(L1, coeff5);\r\n            p20 = _mm_mullo_epi16(L2, coeff7);\r\n            p30 = _mm_mullo_epi16(L3, coeff3);\r\n            p00 = _mm_add_epi16(L0, coeff8);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 4);\r\n\r\n            p11 = _mm_mullo_epi16(H1, coeff5);\r\n            p21 = _mm_mullo_epi16(H2, coeff7);\r\n            p31 = _mm_mullo_epi16(H3, coeff3);\r\n            p01 = _mm_add_epi16(H0, coeff8);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            p01 = _mm_srli_epi16(p01, 4);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_storeu_si128((__m128i*)dst, p00);\r\n\r\n            p00 = _mm_add_epi16(L0, L1);\r\n            p10 = _mm_add_epi16(L1, L2);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, coeff2);\r\n            p00 = _mm_srli_epi16(p00, 2);\r\n\r\n            p01 = _mm_add_epi16(H0, H1);\r\n            p11 = _mm_add_epi16(H1, H2);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, coeff2);\r\n            p01 = _mm_srli_epi16(p01, 2);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_storeu_si128((__m128i*)dst4, p00);\r\n        } else {\r\n            pel_t *dst2 = dst + i_dst;\r\n            pel_t *dst3 = dst2 + i_dst;\r\n            pel_t *dst4 = dst3 + i_dst;\r\n            __m128i p00, p10, p20, p30;\r\n            __m128i S0 = _mm_loadu_si128((__m128i*)(src - 1));\r\n            __m128i S3 = _mm_loadu_si128((__m128i*)(src + 2));\r\n            __m128i S1 = _mm_loadu_si128((__m128i*)(src));\r\n            __m128i S2 = _mm_loadu_si128((__m128i*)(src + 1));\r\n\r\n            __m128i L0 = _mm_unpacklo_epi8(S0, zero);\r\n            __m128i L1 = _mm_unpacklo_epi8(S1, zero);\r\n            __m128i L2 = _mm_unpacklo_epi8(S2, zero);\r\n            __m128i L3 = _mm_unpacklo_epi8(S3, zero);\r\n\r\n            p00 = _mm_mullo_epi16(L0, coeff3);\r\n            p10 = _mm_mullo_epi16(L1, coeff7);\r\n            p20 = _mm_mullo_epi16(L2, coeff5);\r\n            p30 = _mm_add_epi16(L3, coeff8);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_srli_epi16(p00, 4);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            ((int*)dst3)[0] = _mm_cvtsi128_si32(p00);\r\n\r\n            p00 = _mm_add_epi16(L1, L2);\r\n            p00 = _mm_mullo_epi16(p00, coeff3);\r\n            p10 = _mm_add_epi16(L0, L3);\r\n            p10 = _mm_add_epi16(p10, coeff4);\r\n            p00 = _mm_add_epi16(p10, p00);\r\n            p00 = _mm_srli_epi16(p00, 3);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            ((int*)dst2)[0] = _mm_cvtsi128_si32(p00);\r\n\r\n            p10 = _mm_mullo_epi16(L1, coeff5);\r\n            p20 = _mm_mullo_epi16(L2, coeff7);\r\n            p30 = _mm_mullo_epi16(L3, coeff3);\r\n            p00 = _mm_add_epi16(L0, coeff8);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 4);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            ((int*)dst)[0] = _mm_cvtsi128_si32(p00);\r\n\r\n            p00 = _mm_add_epi16(L0, L1);\r\n            p10 = _mm_add_epi16(L1, L2);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, coeff2);\r\n            p00 = _mm_srli_epi16(p00, 2);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            ((int*)dst4)[0] = _mm_cvtsi128_si32(p00);\r\n        }\r\n    }\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n*/\r\nvoid intra_pred_ang_xy_16_sse128(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    ALIGN16(pel_t first_line[2 * (64 + 48)]);\r\n    int line_size = bsx + bsy / 2 - 1;\r\n    int left_size = line_size - bsx;\r\n    int aligned_line_size = ((line_size + 31) >> 4) << 4;\r\n    pel_t *pfirst[2];\r\n    __m128i zero = _mm_setzero_si128();\r\n    __m128i coeff2 = _mm_set1_epi16(2);\r\n    __m128i coeff3 = _mm_set1_epi16(3);\r\n    __m128i coeff4 = _mm_set1_epi16(4);\r\n    __m128i shuffle1 = _mm_setr_epi8(0, 2, 4, 6, 8, 10, 12, 14, 1, 3, 5, 7, 9, 11, 13, 15);\r\n    __m128i shuffle2 = _mm_setr_epi8(1, 3, 5, 7, 9, 11, 13, 15, 0, 2, 4, 6, 8, 10, 12, 14);\r\n    int i;\r\n    pel_t *pSrc1;\r\n\r\n    UNUSED_PARAMETER(dir_mode);\r\n\r\n    pfirst[0] = first_line;\r\n    pfirst[1] = first_line + aligned_line_size;\r\n\r\n    src -= bsy - 2;\r\n\r\n    pSrc1 = src;\r\n\r\n    for (i = 0; i < left_size - 4; i += 8, src += 16) {\r\n        __m128i p00, p01, p10, p11;\r\n        __m128i S0 = _mm_loadu_si128((__m128i*)(src - 1));\r\n        __m128i S2 = _mm_loadu_si128((__m128i*)(src + 1));\r\n        __m128i S1 = _mm_loadu_si128((__m128i*)(src));\r\n\r\n        __m128i L0 = _mm_unpacklo_epi8(S0, zero);\r\n        __m128i L1 = _mm_unpacklo_epi8(S1, zero);\r\n        __m128i L2 = _mm_unpacklo_epi8(S2, zero);\r\n\r\n        __m128i H0 = _mm_unpackhi_epi8(S0, zero);\r\n        __m128i H1 = _mm_unpackhi_epi8(S1, zero);\r\n        __m128i H2 = _mm_unpackhi_epi8(S2, zero);\r\n\r\n        p00 = _mm_add_epi16(L0, L1);\r\n        p01 = _mm_add_epi16(L1, L2);\r\n        p10 = _mm_add_epi16(H0, H1);\r\n        p11 = _mm_add_epi16(H1, H2);\r\n\r\n        p00 = _mm_add_epi16(p00, coeff2);\r\n        p10 = _mm_add_epi16(p10, coeff2);\r\n\r\n        p00 = _mm_add_epi16(p00, p01);\r\n        p10 = _mm_add_epi16(p10, p11);\r\n\r\n        p00 = _mm_srli_epi16(p00, 2);\r\n        p10 = _mm_srli_epi16(p10, 2);\r\n        p00 = _mm_packus_epi16(p00, p10);\r\n\r\n        p10 = _mm_shuffle_epi8(p00, shuffle2);\r\n        p00 = _mm_shuffle_epi8(p00, shuffle1);\r\n        _mm_storel_epi64((__m128i*)&pfirst[1][i], p00);\r\n        _mm_storel_epi64((__m128i*)&pfirst[0][i], p10);\r\n    }\r\n\r\n    if (i < left_size) {\r\n        __m128i p00, p01;\r\n        __m128i S0 = _mm_loadu_si128((__m128i*)(src - 1));\r\n        __m128i S2 = _mm_loadu_si128((__m128i*)(src + 1));\r\n        __m128i S1 = _mm_loadu_si128((__m128i*)(src));\r\n\r\n        __m128i L0 = _mm_unpacklo_epi8(S0, zero);\r\n        __m128i L1 = _mm_unpacklo_epi8(S1, zero);\r\n        __m128i L2 = _mm_unpacklo_epi8(S2, zero);\r\n\r\n        p00 = _mm_add_epi16(L0, L1);\r\n        p01 = _mm_add_epi16(L1, L2);\r\n        p00 = _mm_add_epi16(p00, coeff2);\r\n        p00 = _mm_add_epi16(p00, p01);\r\n        p00 = _mm_srli_epi16(p00, 2);\r\n        p00 = _mm_packus_epi16(p00, p00);\r\n\r\n        p01 = _mm_shuffle_epi8(p00, shuffle2);\r\n        p00 = _mm_shuffle_epi8(p00, shuffle1);\r\n        ((int*)&pfirst[1][i])[0] = _mm_cvtsi128_si32(p00);\r\n        ((int*)&pfirst[0][i])[0] = _mm_cvtsi128_si32(p01);\r\n    }\r\n\r\n    src = pSrc1 + left_size + left_size;\r\n\r\n    for (i = left_size; i < line_size; i += 16, src += 16) {\r\n        __m128i p00, p01, p10, p11;\r\n        __m128i S0 = _mm_loadu_si128((__m128i*)(src - 1));\r\n        __m128i S3 = _mm_loadu_si128((__m128i*)(src + 2));\r\n        __m128i S1 = _mm_loadu_si128((__m128i*)(src));\r\n        __m128i S2 = _mm_loadu_si128((__m128i*)(src + 1));\r\n\r\n        __m128i L0 = _mm_unpacklo_epi8(S0, zero);\r\n        __m128i L1 = _mm_unpacklo_epi8(S1, zero);\r\n        __m128i L2 = _mm_unpacklo_epi8(S2, zero);\r\n        __m128i L3 = _mm_unpacklo_epi8(S3, zero);\r\n\r\n        __m128i H0 = _mm_unpackhi_epi8(S0, zero);\r\n        __m128i H1 = _mm_unpackhi_epi8(S1, zero);\r\n        __m128i H2 = _mm_unpackhi_epi8(S2, zero);\r\n        __m128i H3 = _mm_unpackhi_epi8(S3, zero);\r\n\r\n        p00 = _mm_add_epi16(L1, L2);\r\n        p10 = _mm_add_epi16(H1, H2);\r\n        p00 = _mm_mullo_epi16(p00, coeff3);\r\n        p10 = _mm_mullo_epi16(p10, coeff3);\r\n\r\n        p01 = _mm_add_epi16(L0, L3);\r\n        p11 = _mm_add_epi16(H0, H3);\r\n        p00 = _mm_add_epi16(p00, coeff4);\r\n        p10 = _mm_add_epi16(p10, coeff4);\r\n        p00 = _mm_add_epi16(p00, p01);\r\n        p10 = _mm_add_epi16(p10, p11);\r\n\r\n        p00 = _mm_srli_epi16(p00, 3);\r\n        p10 = _mm_srli_epi16(p10, 3);\r\n\r\n        p00 = _mm_packus_epi16(p00, p10);\r\n        _mm_storeu_si128((__m128i*)&pfirst[0][i], p00);\r\n\r\n        p00 = _mm_add_epi16(L0, L1);\r\n        p01 = _mm_add_epi16(L1, L2);\r\n        p10 = _mm_add_epi16(H0, H1);\r\n        p11 = _mm_add_epi16(H1, H2);\r\n\r\n        p00 = _mm_add_epi16(p00, coeff2);\r\n        p10 = _mm_add_epi16(p10, coeff2);\r\n\r\n        p00 = _mm_add_epi16(p00, p01);\r\n        p10 = _mm_add_epi16(p10, p11);\r\n\r\n        p00 = _mm_srli_epi16(p00, 2);\r\n        p10 = _mm_srli_epi16(p10, 2);\r\n\r\n        p00 = _mm_packus_epi16(p00, p10);\r\n        _mm_storeu_si128((__m128i*)&pfirst[1][i], p00);\r\n    }\r\n\r\n    pfirst[0] += left_size;\r\n    pfirst[1] += left_size;\r\n\r\n    bsy >>= 1;\r\n\r\n    switch (bsx) {\r\n        case 4:\r\n            for (i = 0; i < bsy; i++) {\r\n                CP32(dst, pfirst[0] - i);\r\n                CP32(dst + i_dst, pfirst[1] - i);\r\n                dst += (i_dst << 1);\r\n            }\r\n            break;\r\n        case 8:\r\n            for (i = 0; i < bsy; i++) {\r\n                CP64(dst, pfirst[0] - i);\r\n                CP64(dst + i_dst, pfirst[1] - i);\r\n                dst += (i_dst << 1);\r\n            }\r\n            break;\r\n        default:\r\n            for (i = 0; i < bsy; i++) {\r\n                memcpy(dst, pfirst[0] - i, bsx * sizeof(pel_t));\r\n                memcpy(dst + i_dst, pfirst[1] - i, bsx * sizeof(pel_t));\r\n                dst += (i_dst << 1);\r\n            }\r\n            break;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n*/\r\nvoid intra_pred_ang_xy_18_sse128(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    ALIGN16(pel_t first_line[64 + 64]);\r\n    int line_size = bsx + bsy - 1;\r\n    int i;\r\n    pel_t *pfirst = first_line + bsy - 1;\r\n    __m128i coeff2 = _mm_set1_epi16(2);\r\n    __m128i zero = _mm_setzero_si128();\r\n\r\n    UNUSED_PARAMETER(dir_mode);\r\n\r\n    src -= bsy - 1;\r\n\r\n    for (i = 0; i < line_size - 8; i += 16, src += 16) {\r\n        __m128i S0 = _mm_loadu_si128((__m128i*)(src - 1));\r\n        __m128i S2 = _mm_loadu_si128((__m128i*)(src + 1));\r\n        __m128i S1 = _mm_loadu_si128((__m128i*)(src));\r\n\r\n        __m128i L0 = _mm_unpacklo_epi8(S0, zero);\r\n        __m128i L1 = _mm_unpacklo_epi8(S1, zero);\r\n        __m128i L2 = _mm_unpacklo_epi8(S2, zero);\r\n\r\n        __m128i H0 = _mm_unpackhi_epi8(S0, zero);\r\n        __m128i H1 = _mm_unpackhi_epi8(S1, zero);\r\n        __m128i H2 = _mm_unpackhi_epi8(S2, zero);\r\n\r\n        __m128i sum1 = _mm_add_epi16(L0, L1);\r\n        __m128i sum2 = _mm_add_epi16(L1, L2);\r\n        __m128i sum3 = _mm_add_epi16(H0, H1);\r\n        __m128i sum4 = _mm_add_epi16(H1, H2);\r\n\r\n        sum1 = _mm_add_epi16(sum1, sum2);\r\n        sum3 = _mm_add_epi16(sum3, sum4);\r\n\r\n        sum1 = _mm_add_epi16(sum1, coeff2);\r\n        sum3 = _mm_add_epi16(sum3, coeff2);\r\n\r\n        sum1 = _mm_srli_epi16(sum1, 2);\r\n        sum3 = _mm_srli_epi16(sum3, 2);\r\n\r\n        sum1 = _mm_packus_epi16(sum1, sum3);\r\n\r\n        _mm_store_si128((__m128i*)&first_line[i], sum1);\r\n    }\r\n\r\n    if (i < line_size) {\r\n        __m128i S0 = _mm_loadu_si128((__m128i*)(src - 1));\r\n        __m128i S2 = _mm_loadu_si128((__m128i*)(src + 1));\r\n        __m128i S1 = _mm_loadu_si128((__m128i*)(src));\r\n\r\n        __m128i L0 = _mm_unpacklo_epi8(S0, zero);\r\n        __m128i L1 = _mm_unpacklo_epi8(S1, zero);\r\n        __m128i L2 = _mm_unpacklo_epi8(S2, zero);\r\n\r\n        __m128i sum1 = _mm_add_epi16(L0, L1);\r\n        __m128i sum2 = _mm_add_epi16(L1, L2);\r\n\r\n        sum1 = _mm_add_epi16(sum1, sum2);\r\n        sum1 = _mm_add_epi16(sum1, coeff2);\r\n        sum1 = _mm_srli_epi16(sum1, 2);\r\n\r\n        sum1 = _mm_packus_epi16(sum1, sum1);\r\n        _mm_storel_epi64((__m128i*)&first_line[i], sum1);\r\n    }\r\n\r\n    switch (bsx) {\r\n        case 4:\r\n            for (i = 0; i < bsy; i++) {\r\n                CP32(dst, pfirst--);\r\n                dst += i_dst;\r\n            }\r\n            break;\r\n        case 8:\r\n            for (i = 0; i < bsy; i++) {\r\n                CP64(dst, pfirst--);\r\n                dst += i_dst;\r\n            }\r\n            break;\r\n        default:\r\n            for (i = 0; i < bsy; i++) {\r\n                memcpy(dst, pfirst--, bsx * sizeof(pel_t));\r\n                dst += i_dst;\r\n            }\r\n            break;\r\n            break;\r\n    }\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n*/\r\nvoid intra_pred_ang_xy_20_sse128(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    ALIGN16(pel_t first_line[64 + 128]);\r\n    int left_size = (bsy - 1) * 2 + 1;\r\n    int top_size = bsx - 1;\r\n    int line_size = left_size + top_size;\r\n    int i;\r\n    pel_t *pfirst = first_line + left_size - 1;\r\n    __m128i zero = _mm_setzero_si128();\r\n    __m128i coeff2 = _mm_set1_epi16(2);\r\n    __m128i coeff3 = _mm_set1_epi16(3);\r\n    __m128i coeff4 = _mm_set1_epi16(4);\r\n    __m128i shuffle = _mm_setr_epi8(0, 8, 1, 9, 2, 10, 3, 11, 4, 12, 5, 13, 6, 14, 7, 15);\r\n    pel_t *pSrc1 = src;\r\n\r\n    UNUSED_PARAMETER(dir_mode);\r\n\r\n    src -= bsy;\r\n\r\n    for (i = 0; i < left_size - 16; i += 32, src += 16) {\r\n        __m128i p00, p01, p10, p11;\r\n        __m128i p20, p21, p30, p31;\r\n        __m128i S0 = _mm_loadu_si128((__m128i*)(src - 1));\r\n        __m128i S3 = _mm_loadu_si128((__m128i*)(src + 2));\r\n        __m128i S2 = _mm_loadu_si128((__m128i*)(src + 1));\r\n        __m128i S1 = _mm_loadu_si128((__m128i*)(src));\r\n\r\n        __m128i L0 = _mm_unpacklo_epi8(S0, zero);\r\n        __m128i L1 = _mm_unpacklo_epi8(S1, zero);\r\n        __m128i L2 = _mm_unpacklo_epi8(S2, zero);\r\n        __m128i L3 = _mm_unpacklo_epi8(S3, zero);\r\n\r\n        __m128i H0 = _mm_unpackhi_epi8(S0, zero);\r\n        __m128i H1 = _mm_unpackhi_epi8(S1, zero);\r\n        __m128i H2 = _mm_unpackhi_epi8(S2, zero);\r\n        __m128i H3 = _mm_unpackhi_epi8(S3, zero);\r\n\r\n        p00 = _mm_add_epi16(L1, L2);\r\n        p10 = _mm_add_epi16(H1, H2);\r\n        p00 = _mm_mullo_epi16(p00, coeff3);\r\n        p10 = _mm_mullo_epi16(p10, coeff3);\r\n\r\n        p01 = _mm_add_epi16(L0, L3);\r\n        p11 = _mm_add_epi16(H0, H3);\r\n        p00 = _mm_add_epi16(p00, coeff4);\r\n        p10 = _mm_add_epi16(p10, coeff4);\r\n        p00 = _mm_add_epi16(p00, p01);\r\n        p10 = _mm_add_epi16(p10, p11);\r\n\r\n        p00 = _mm_srli_epi16(p00, 3);\r\n        p10 = _mm_srli_epi16(p10, 3);\r\n\r\n        p20 = _mm_add_epi16(L1, L2);\r\n        p30 = _mm_add_epi16(H1, H2);\r\n        p21 = _mm_add_epi16(L2, L3);\r\n        p31 = _mm_add_epi16(H2, H3);\r\n        p20 = _mm_add_epi16(p20, coeff2);\r\n        p30 = _mm_add_epi16(p30, coeff2);\r\n        p20 = _mm_add_epi16(p20, p21);\r\n        p30 = _mm_add_epi16(p30, p31);\r\n\r\n        p20 = _mm_srli_epi16(p20, 2);\r\n        p30 = _mm_srli_epi16(p30, 2);\r\n\r\n        p00 = _mm_packus_epi16(p00, p20);\r\n        p10 = _mm_packus_epi16(p10, p30);\r\n\r\n        p00 = _mm_shuffle_epi8(p00, shuffle);\r\n        p10 = _mm_shuffle_epi8(p10, shuffle);\r\n        _mm_store_si128((__m128i*)&first_line[i], p00);\r\n        _mm_store_si128((__m128i*)&first_line[i + 16], p10);\r\n    }\r\n\r\n    if (i < left_size) {\r\n        __m128i p00, p01;\r\n        __m128i p20, p21;\r\n        __m128i S0 = _mm_loadu_si128((__m128i*)(src - 1));\r\n        __m128i S3 = _mm_loadu_si128((__m128i*)(src + 2));\r\n        __m128i S2 = _mm_loadu_si128((__m128i*)(src + 1));\r\n        __m128i S1 = _mm_loadu_si128((__m128i*)(src));\r\n\r\n        __m128i L0 = _mm_unpacklo_epi8(S0, zero);\r\n        __m128i L1 = _mm_unpacklo_epi8(S1, zero);\r\n        __m128i L2 = _mm_unpacklo_epi8(S2, zero);\r\n        __m128i L3 = _mm_unpacklo_epi8(S3, zero);\r\n\r\n        p00 = _mm_add_epi16(L1, L2);\r\n        p00 = _mm_mullo_epi16(p00, coeff3);\r\n\r\n        p01 = _mm_add_epi16(L0, L3);\r\n        p00 = _mm_add_epi16(p00, coeff4);\r\n        p00 = _mm_add_epi16(p00, p01);\r\n\r\n        p00 = _mm_srli_epi16(p00, 3);\r\n\r\n        p20 = _mm_add_epi16(L1, L2);\r\n        p21 = _mm_add_epi16(L2, L3);\r\n        p20 = _mm_add_epi16(p20, coeff2);\r\n        p20 = _mm_add_epi16(p20, p21);\r\n\r\n        p20 = _mm_srli_epi16(p20, 2);\r\n\r\n        p00 = _mm_packus_epi16(p00, p20);\r\n\r\n        p00 = _mm_shuffle_epi8(p00, shuffle);\r\n        _mm_store_si128((__m128i*)&first_line[i], p00);\r\n    }\r\n\r\n    src = pSrc1;\r\n\r\n    for (i = left_size; i < line_size - 8; i += 16, src += 16) {\r\n        __m128i S0 = _mm_loadu_si128((__m128i*)(src - 1));\r\n        __m128i S2 = _mm_loadu_si128((__m128i*)(src + 1));\r\n        __m128i S1 = _mm_loadu_si128((__m128i*)(src));\r\n\r\n        __m128i L0 = _mm_unpacklo_epi8(S0, zero);\r\n        __m128i L1 = _mm_unpacklo_epi8(S1, zero);\r\n        __m128i L2 = _mm_unpacklo_epi8(S2, zero);\r\n\r\n        __m128i H0 = _mm_unpackhi_epi8(S0, zero);\r\n        __m128i H1 = _mm_unpackhi_epi8(S1, zero);\r\n        __m128i H2 = _mm_unpackhi_epi8(S2, zero);\r\n\r\n        __m128i sum1 = _mm_add_epi16(L0, L1);\r\n        __m128i sum2 = _mm_add_epi16(L1, L2);\r\n        __m128i sum3 = _mm_add_epi16(H0, H1);\r\n        __m128i sum4 = _mm_add_epi16(H1, H2);\r\n\r\n        sum1 = _mm_add_epi16(sum1, sum2);\r\n        sum3 = _mm_add_epi16(sum3, sum4);\r\n\r\n        sum1 = _mm_add_epi16(sum1, coeff2);\r\n        sum3 = _mm_add_epi16(sum3, coeff2);\r\n\r\n        sum1 = _mm_srli_epi16(sum1, 2);\r\n        sum3 = _mm_srli_epi16(sum3, 2);\r\n\r\n        sum1 = _mm_packus_epi16(sum1, sum3);\r\n\r\n        _mm_storeu_si128((__m128i*)&first_line[i], sum1);\r\n    }\r\n\r\n    if (i < line_size) {\r\n        __m128i S0 = _mm_loadu_si128((__m128i*)(src - 1));\r\n        __m128i S2 = _mm_loadu_si128((__m128i*)(src + 1));\r\n        __m128i S1 = _mm_loadu_si128((__m128i*)(src));\r\n\r\n        __m128i L0 = _mm_unpacklo_epi8(S0, zero);\r\n        __m128i L1 = _mm_unpacklo_epi8(S1, zero);\r\n        __m128i L2 = _mm_unpacklo_epi8(S2, zero);\r\n\r\n        __m128i sum1 = _mm_add_epi16(L0, L1);\r\n        __m128i sum2 = _mm_add_epi16(L1, L2);\r\n\r\n        sum1 = _mm_add_epi16(sum1, sum2);\r\n        sum1 = _mm_add_epi16(sum1, coeff2);\r\n        sum1 = _mm_srli_epi16(sum1, 2);\r\n\r\n        sum1 = _mm_packus_epi16(sum1, sum1);\r\n        _mm_storel_epi64((__m128i*)&first_line[i], sum1);\r\n    }\r\n\r\n    for (i = 0; i < bsy; i++) {\r\n        memcpy(dst, pfirst, bsx * sizeof(pel_t));\r\n        pfirst -= 2;\r\n        dst += i_dst;\r\n    }\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n*/\r\nvoid intra_pred_ang_xy_22_sse128(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    int i;\r\n\r\n    UNUSED_PARAMETER(dir_mode);\r\n\r\n    src -= bsy;\r\n    if (bsx != 4) {\r\n        ALIGN16(pel_t first_line[64 + 256]);\r\n        int left_size = (bsy - 1) * 4 + 3;\r\n        int top_size = bsx - 3;\r\n        int line_size = left_size + top_size;\r\n        pel_t *pfirst = first_line + left_size - 3;\r\n        pel_t *pSrc1 = src;\r\n\r\n        __m128i zero = _mm_setzero_si128();\r\n        __m128i coeff2 = _mm_set1_epi16(2);\r\n        __m128i coeff3 = _mm_set1_epi16(3);\r\n        __m128i coeff4 = _mm_set1_epi16(4);\r\n        __m128i coeff5 = _mm_set1_epi16(5);\r\n        __m128i coeff7 = _mm_set1_epi16(7);\r\n        __m128i coeff8 = _mm_set1_epi16(8);\r\n        __m128i shuffle = _mm_setr_epi8(0, 8, 1, 9, 2, 10, 3, 11, 4, 12, 5, 13, 6, 14, 7, 15);\r\n\r\n        for (i = 0; i < line_size - 32; i += 64, src += 16) {\r\n            __m128i p00, p10, p20, p30;\r\n            __m128i p01, p11, p21, p31;\r\n            __m128i M1, M2, M3, M4, M5, M6, M7, M8;\r\n            __m128i S0 = _mm_loadu_si128((__m128i*)(src - 1));\r\n            __m128i S3 = _mm_loadu_si128((__m128i*)(src + 2));\r\n            __m128i S1 = _mm_loadu_si128((__m128i*)(src));\r\n            __m128i S2 = _mm_loadu_si128((__m128i*)(src + 1));\r\n\r\n            __m128i L0 = _mm_unpacklo_epi8(S0, zero);\r\n            __m128i L1 = _mm_unpacklo_epi8(S1, zero);\r\n            __m128i L2 = _mm_unpacklo_epi8(S2, zero);\r\n            __m128i L3 = _mm_unpacklo_epi8(S3, zero);\r\n\r\n            __m128i H0 = _mm_unpackhi_epi8(S0, zero);\r\n            __m128i H1 = _mm_unpackhi_epi8(S1, zero);\r\n            __m128i H2 = _mm_unpackhi_epi8(S2, zero);\r\n            __m128i H3 = _mm_unpackhi_epi8(S3, zero);\r\n\r\n            p00 = _mm_mullo_epi16(L0, coeff3);\r\n            p10 = _mm_mullo_epi16(L1, coeff7);\r\n            p20 = _mm_mullo_epi16(L2, coeff5);\r\n            p30 = _mm_add_epi16(L3, coeff8);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            M1 = _mm_srli_epi16(p00, 4);\r\n\r\n            p01 = _mm_mullo_epi16(H0, coeff3);\r\n            p11 = _mm_mullo_epi16(H1, coeff7);\r\n            p21 = _mm_mullo_epi16(H2, coeff5);\r\n            p31 = _mm_add_epi16(H3, coeff8);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            M2 = _mm_srli_epi16(p01, 4);\r\n\r\n\r\n            p00 = _mm_add_epi16(L1, L2);\r\n            p00 = _mm_mullo_epi16(p00, coeff3);\r\n            p10 = _mm_add_epi16(L0, L3);\r\n            p10 = _mm_add_epi16(p10, coeff4);\r\n            p00 = _mm_add_epi16(p10, p00);\r\n            M3 = _mm_srli_epi16(p00, 3);\r\n\r\n            p01 = _mm_add_epi16(H1, H2);\r\n            p01 = _mm_mullo_epi16(p01, coeff3);\r\n            p11 = _mm_add_epi16(H0, H3);\r\n            p11 = _mm_add_epi16(p11, coeff4);\r\n            p01 = _mm_add_epi16(p11, p01);\r\n            M4 = _mm_srli_epi16(p01, 3);\r\n\r\n\r\n            p10 = _mm_mullo_epi16(L1, coeff5);\r\n            p20 = _mm_mullo_epi16(L2, coeff7);\r\n            p30 = _mm_mullo_epi16(L3, coeff3);\r\n            p00 = _mm_add_epi16(L0, coeff8);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            M5 = _mm_srli_epi16(p00, 4);\r\n\r\n            p11 = _mm_mullo_epi16(H1, coeff5);\r\n            p21 = _mm_mullo_epi16(H2, coeff7);\r\n            p31 = _mm_mullo_epi16(H3, coeff3);\r\n            p01 = _mm_add_epi16(H0, coeff8);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            M6 = _mm_srli_epi16(p01, 4);\r\n\r\n\r\n            p00 = _mm_add_epi16(L1, L2);\r\n            p10 = _mm_add_epi16(L2, L3);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, coeff2);\r\n            M7 = _mm_srli_epi16(p00, 2);\r\n\r\n            p01 = _mm_add_epi16(H1, H2);\r\n            p11 = _mm_add_epi16(H2, H3);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, coeff2);\r\n            M8 = _mm_srli_epi16(p01, 2);\r\n\r\n            M1 = _mm_packus_epi16(M1, M3);\r\n            M5 = _mm_packus_epi16(M5, M7);\r\n            M1 = _mm_shuffle_epi8(M1, shuffle);\r\n            M5 = _mm_shuffle_epi8(M5, shuffle);\r\n\r\n            M2 = _mm_packus_epi16(M2, M4);\r\n            M6 = _mm_packus_epi16(M6, M8);\r\n            M2 = _mm_shuffle_epi8(M2, shuffle);\r\n            M6 = _mm_shuffle_epi8(M6, shuffle);\r\n\r\n            M3 = _mm_unpacklo_epi16(M1, M5);\r\n            M7 = _mm_unpackhi_epi16(M1, M5);\r\n            M4 = _mm_unpacklo_epi16(M2, M6);\r\n            M8 = _mm_unpackhi_epi16(M2, M6);\r\n\r\n            _mm_store_si128((__m128i*)&first_line[i], M3);\r\n            _mm_store_si128((__m128i*)&first_line[16 + i], M7);\r\n            _mm_store_si128((__m128i*)&first_line[32 + i], M4);\r\n            _mm_store_si128((__m128i*)&first_line[48 + i], M8);\r\n        }\r\n\r\n        if (i < left_size) {\r\n            __m128i p00, p10, p20, p30;\r\n            __m128i M1, M3, M5, M7;\r\n            __m128i S0 = _mm_loadu_si128((__m128i*)(src - 1));\r\n            __m128i S3 = _mm_loadu_si128((__m128i*)(src + 2));\r\n            __m128i S1 = _mm_loadu_si128((__m128i*)(src));\r\n            __m128i S2 = _mm_loadu_si128((__m128i*)(src + 1));\r\n\r\n            __m128i L0 = _mm_unpacklo_epi8(S0, zero);\r\n            __m128i L1 = _mm_unpacklo_epi8(S1, zero);\r\n            __m128i L2 = _mm_unpacklo_epi8(S2, zero);\r\n            __m128i L3 = _mm_unpacklo_epi8(S3, zero);\r\n\r\n            p00 = _mm_mullo_epi16(L0, coeff3);\r\n            p10 = _mm_mullo_epi16(L1, coeff7);\r\n            p20 = _mm_mullo_epi16(L2, coeff5);\r\n            p30 = _mm_add_epi16(L3, coeff8);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            M1 = _mm_srli_epi16(p00, 4);\r\n\r\n            p00 = _mm_add_epi16(L1, L2);\r\n            p00 = _mm_mullo_epi16(p00, coeff3);\r\n            p10 = _mm_add_epi16(L0, L3);\r\n            p10 = _mm_add_epi16(p10, coeff4);\r\n            p00 = _mm_add_epi16(p10, p00);\r\n            M3 = _mm_srli_epi16(p00, 3);\r\n\r\n            p10 = _mm_mullo_epi16(L1, coeff5);\r\n            p20 = _mm_mullo_epi16(L2, coeff7);\r\n            p30 = _mm_mullo_epi16(L3, coeff3);\r\n            p00 = _mm_add_epi16(L0, coeff8);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            M5 = _mm_srli_epi16(p00, 4);\r\n\r\n            p00 = _mm_add_epi16(L1, L2);\r\n            p10 = _mm_add_epi16(L2, L3);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, coeff2);\r\n            M7 = _mm_srli_epi16(p00, 2);\r\n\r\n            M1 = _mm_packus_epi16(M1, M3);\r\n            M5 = _mm_packus_epi16(M5, M7);\r\n            M1 = _mm_shuffle_epi8(M1, shuffle);\r\n            M5 = _mm_shuffle_epi8(M5, shuffle);\r\n\r\n            M3 = _mm_unpacklo_epi16(M1, M5);\r\n            M7 = _mm_unpackhi_epi16(M1, M5);\r\n\r\n            _mm_store_si128((__m128i*)&first_line[i], M3);\r\n            _mm_store_si128((__m128i*)&first_line[16 + i], M7);\r\n        }\r\n\r\n        src = pSrc1 + bsy;\r\n\r\n        for (i = left_size; i < line_size - 8; i += 16, src += 16) {\r\n            __m128i S0 = _mm_loadu_si128((__m128i*)(src - 1));\r\n            __m128i S2 = _mm_loadu_si128((__m128i*)(src + 1));\r\n            __m128i S1 = _mm_loadu_si128((__m128i*)(src));\r\n\r\n            __m128i L0 = _mm_unpacklo_epi8(S0, zero);\r\n            __m128i L1 = _mm_unpacklo_epi8(S1, zero);\r\n            __m128i L2 = _mm_unpacklo_epi8(S2, zero);\r\n\r\n            __m128i H0 = _mm_unpackhi_epi8(S0, zero);\r\n            __m128i H1 = _mm_unpackhi_epi8(S1, zero);\r\n            __m128i H2 = _mm_unpackhi_epi8(S2, zero);\r\n\r\n            __m128i sum1 = _mm_add_epi16(L0, L1);\r\n            __m128i sum2 = _mm_add_epi16(L1, L2);\r\n            __m128i sum3 = _mm_add_epi16(H0, H1);\r\n            __m128i sum4 = _mm_add_epi16(H1, H2);\r\n\r\n            sum1 = _mm_add_epi16(sum1, sum2);\r\n            sum3 = _mm_add_epi16(sum3, sum4);\r\n\r\n            sum1 = _mm_add_epi16(sum1, coeff2);\r\n            sum3 = _mm_add_epi16(sum3, coeff2);\r\n\r\n            sum1 = _mm_srli_epi16(sum1, 2);\r\n            sum3 = _mm_srli_epi16(sum3, 2);\r\n\r\n            sum1 = _mm_packus_epi16(sum1, sum3);\r\n\r\n            _mm_storeu_si128((__m128i*)&first_line[i], sum1);\r\n        }\r\n\r\n        if (i < line_size) {\r\n            __m128i S0 = _mm_loadu_si128((__m128i*)(src - 1));\r\n            __m128i S2 = _mm_loadu_si128((__m128i*)(src + 1));\r\n            __m128i S1 = _mm_loadu_si128((__m128i*)(src));\r\n\r\n            __m128i L0 = _mm_unpacklo_epi8(S0, zero);\r\n            __m128i L1 = _mm_unpacklo_epi8(S1, zero);\r\n            __m128i L2 = _mm_unpacklo_epi8(S2, zero);\r\n\r\n            __m128i sum1 = _mm_add_epi16(L0, L1);\r\n            __m128i sum2 = _mm_add_epi16(L1, L2);\r\n\r\n            sum1 = _mm_add_epi16(sum1, sum2);\r\n            sum1 = _mm_add_epi16(sum1, coeff2);\r\n            sum1 = _mm_srli_epi16(sum1, 2);\r\n\r\n            sum1 = _mm_packus_epi16(sum1, sum1);\r\n            _mm_storel_epi64((__m128i*)&first_line[i], sum1);\r\n        }\r\n\r\n        switch (bsx) {\r\n            case 8:\r\n                while (bsy--) {\r\n                    CP64(dst, pfirst);\r\n                    dst += i_dst;\r\n                    pfirst -= 4;\r\n                }\r\n                break;\r\n            case 16:\r\n            case 32:\r\n            case 64:\r\n                while (bsy--) {\r\n                    memcpy(dst, pfirst, bsx * sizeof(pel_t));\r\n                    dst += i_dst;\r\n                    pfirst -= 4;\r\n                }\r\n                break;\r\n            default:\r\n                assert(0);\r\n                break;\r\n        }\r\n    } else {\r\n        dst += (bsy - 1) * i_dst;\r\n        for (i = 0; i < bsy; i++, src++) {\r\n            dst[0] = (src[-1] * 3 + src[0] * 7 + src[1] * 5 + src[2] + 8) >> 4;\r\n            dst[1] = (src[-1] + (src[0] + src[1]) * 3 + src[2] + 4) >> 3;\r\n            dst[2] = (src[-1] + src[0] * 5 + src[1] * 7 + src[2] * 3 + 8) >> 4;\r\n            dst[3] = (src[0] + src[1] * 2 + src[2] + 2) >> 2;\r\n            dst -= i_dst;\r\n        }\r\n    }\r\n\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n*/\r\nvoid intra_pred_ang_xy_23_sse128(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n\r\n    int i;\r\n\r\n    UNUSED_PARAMETER(dir_mode);\r\n\r\n    if (bsx > 8) {\r\n        ALIGN16(pel_t first_line[64 + 512]);\r\n        int left_size = (bsy << 3) - 1;\r\n        int top_size = bsx - 7;\r\n        int line_size = left_size + top_size;\r\n        pel_t *pfirst = first_line + left_size - 7;\r\n        pel_t *pfirst1 = first_line;\r\n        pel_t *src_org = src;\r\n\r\n        src -= bsy;\r\n\r\n        __m128i zero = _mm_setzero_si128();\r\n        __m128i coeff0 = _mm_setr_epi16(7, 3, 5, 1, 3, 1, 1, 0);\r\n        __m128i coeff1 = _mm_setr_epi16(15, 7, 13, 3, 11, 5, 9, 1);\r\n        __m128i coeff2 = _mm_setr_epi16(9, 5, 11, 3, 13, 7, 15, 2);\r\n        __m128i coeff3 = _mm_setr_epi16(1, 1, 3, 1, 5, 3, 7, 1);\r\n        __m128i coeff4 = _mm_setr_epi16(16, 8, 16, 4, 16, 8, 16, 2);\r\n        __m128i coeff5 = _mm_setr_epi16(1, 2, 1, 4, 1, 2, 1, 8);\r\n\r\n        __m128i p00, p10, p20, p30;\r\n\r\n        __m128i L0 = _mm_set1_epi16(src[-1]);\r\n        __m128i L1 = _mm_set1_epi16(src[0]);\r\n        __m128i L2 = _mm_set1_epi16(src[1]);\r\n        __m128i L3 = _mm_set1_epi16(src[2]);\r\n\r\n        src += 4;\r\n\r\n        for (i = 0; i < left_size + 1; i += 32, src += 4) {\r\n            p00 = _mm_mullo_epi16(L0, coeff0);\r\n            p10 = _mm_mullo_epi16(L1, coeff1);\r\n            p20 = _mm_mullo_epi16(L2, coeff2);\r\n            p30 = _mm_mullo_epi16(L3, coeff3);\r\n            p00 = _mm_add_epi16(p00, coeff4);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_mullo_epi16(p00, coeff5);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)pfirst1, p00);\r\n\r\n            pfirst1 += 8;\r\n            L0 = _mm_set1_epi16(src[-1]);\r\n\r\n            p00 = _mm_mullo_epi16(L1, coeff0);\r\n            p10 = _mm_mullo_epi16(L2, coeff1);\r\n            p20 = _mm_mullo_epi16(L3, coeff2);\r\n            p30 = _mm_mullo_epi16(L0, coeff3);\r\n            p00 = _mm_add_epi16(p00, coeff4);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_mullo_epi16(p00, coeff5);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)pfirst1, p00);\r\n\r\n            pfirst1 += 8;\r\n            L1 = _mm_set1_epi16(src[0]);\r\n\r\n            p00 = _mm_mullo_epi16(L2, coeff0);\r\n            p10 = _mm_mullo_epi16(L3, coeff1);\r\n            p20 = _mm_mullo_epi16(L0, coeff2);\r\n            p30 = _mm_mullo_epi16(L1, coeff3);\r\n            p00 = _mm_add_epi16(p00, coeff4);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_mullo_epi16(p00, coeff5);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)pfirst1, p00);\r\n\r\n            pfirst1 += 8;\r\n            L2 = _mm_set1_epi16(src[1]);\r\n\r\n            p00 = _mm_mullo_epi16(L3, coeff0);\r\n            p10 = _mm_mullo_epi16(L0, coeff1);\r\n            p20 = _mm_mullo_epi16(L1, coeff2);\r\n            p30 = _mm_mullo_epi16(L2, coeff3);\r\n            p00 = _mm_add_epi16(p00, coeff4);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_mullo_epi16(p00, coeff5);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)pfirst1, p00);\r\n\r\n            pfirst1 += 8;\r\n            L3 = _mm_set1_epi16(src[2]);\r\n        }\r\n\r\n        src = src_org + 1;\r\n        for (; i < line_size; i += 16, src += 16) {\r\n            coeff2 = _mm_set1_epi16(2);\r\n\r\n            \r\n            __m128i p01, p11;\r\n            __m128i S0 = _mm_loadu_si128((__m128i*)(src));\r\n            __m128i S1 = _mm_loadu_si128((__m128i*)(src + 1));\r\n            __m128i S2 = _mm_loadu_si128((__m128i*)(src - 1));\r\n\r\n            L0 = _mm_unpacklo_epi8(S0, zero);\r\n            L1 = _mm_unpacklo_epi8(S1, zero);\r\n            L2 = _mm_unpacklo_epi8(S2, zero);\r\n\r\n            __m128i H0 = _mm_unpackhi_epi8(S0, zero);\r\n            __m128i H1 = _mm_unpackhi_epi8(S1, zero);\r\n            __m128i H2 = _mm_unpackhi_epi8(S2, zero);\r\n\r\n            p00 = _mm_mullo_epi16(L0, coeff2);\r\n            p10 = _mm_add_epi16(L1, L2);\r\n            p00 = _mm_add_epi16(p00, coeff2);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_srli_epi16(p00, 2);\r\n\r\n            p01 = _mm_mullo_epi16(H0, coeff2);\r\n            p11 = _mm_add_epi16(H1, H2);\r\n            p01 = _mm_add_epi16(p01, coeff2);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_srli_epi16(p01, 2);\r\n\r\n            p00 = _mm_packus_epi16(p00, p01);\r\n            _mm_store_si128((__m128i*)&first_line[i], p00);\r\n        }\r\n\r\n        for (i = 0; i < bsy; i++) {\r\n            memcpy(dst, pfirst, bsx * sizeof(pel_t));\r\n            dst += i_dst;\r\n            pfirst -= 8;\r\n        }\r\n    } else if (bsx == 8) {\r\n        __m128i coeff0 = _mm_setr_epi16(7, 3, 5, 1, 3, 1, 1, 0);\r\n        __m128i coeff1 = _mm_setr_epi16(15, 7, 13, 3, 11, 5, 9, 1);\r\n        __m128i coeff2 = _mm_setr_epi16(9, 5, 11, 3, 13, 7, 15, 2);\r\n        __m128i coeff3 = _mm_setr_epi16(1, 1, 3, 1, 5, 3, 7, 1);\r\n        __m128i coeff4 = _mm_setr_epi16(16, 8, 16, 4, 16, 8, 16, 2);\r\n        __m128i coeff5 = _mm_setr_epi16(1, 2, 1, 4, 1, 2, 1, 8);\r\n\r\n        __m128i p00, p10, p20, p30;\r\n\r\n        __m128i L0 = _mm_set1_epi16(src[-2]);\r\n        __m128i L1 = _mm_set1_epi16(src[-1]);\r\n        __m128i L2 = _mm_set1_epi16(src[0]);\r\n        __m128i L3 = _mm_set1_epi16(src[1]);\r\n        src -= 4;\r\n\r\n        bsy >>= 2;\r\n        for (i = 0; i < bsy; i++, src -= 4) {\r\n            p00 = _mm_mullo_epi16(L0, coeff0);\r\n            p10 = _mm_mullo_epi16(L1, coeff1);\r\n            p20 = _mm_mullo_epi16(L2, coeff2);\r\n            p30 = _mm_mullo_epi16(L3, coeff3);\r\n            p00 = _mm_add_epi16(p00, coeff4);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_mullo_epi16(p00, coeff5);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)dst, p00);\r\n\r\n            dst += i_dst;\r\n            L3 = _mm_set1_epi16(src[1]);\r\n\r\n            p00 = _mm_mullo_epi16(L3, coeff0);\r\n            p10 = _mm_mullo_epi16(L0, coeff1);\r\n            p20 = _mm_mullo_epi16(L1, coeff2);\r\n            p30 = _mm_mullo_epi16(L2, coeff3);\r\n            p00 = _mm_add_epi16(p00, coeff4);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_mullo_epi16(p00, coeff5);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)dst, p00);\r\n\r\n            dst += i_dst;\r\n            L2 = _mm_set1_epi16(src[0]);\r\n\r\n            p00 = _mm_mullo_epi16(L2, coeff0);\r\n            p10 = _mm_mullo_epi16(L3, coeff1);\r\n            p20 = _mm_mullo_epi16(L0, coeff2);\r\n            p30 = _mm_mullo_epi16(L1, coeff3);\r\n            p00 = _mm_add_epi16(p00, coeff4);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_mullo_epi16(p00, coeff5);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)dst, p00);\r\n\r\n            dst += i_dst;\r\n            L1 = _mm_set1_epi16(src[-1]);\r\n\r\n            p00 = _mm_mullo_epi16(L1, coeff0);\r\n            p10 = _mm_mullo_epi16(L2, coeff1);\r\n            p20 = _mm_mullo_epi16(L3, coeff2);\r\n            p30 = _mm_mullo_epi16(L0, coeff3);\r\n            p00 = _mm_add_epi16(p00, coeff4);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_mullo_epi16(p00, coeff5);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)dst, p00);\r\n\r\n            dst += i_dst;\r\n            L0 = _mm_set1_epi16(src[-2]);\r\n        }\r\n    } else {\r\n        __m128i zero = _mm_setzero_si128();\r\n        __m128i coeff3 = _mm_set1_epi16(3);\r\n        __m128i coeff4 = _mm_set1_epi16(4);\r\n        __m128i coeff5 = _mm_set1_epi16(5);\r\n        __m128i coeff7 = _mm_set1_epi16(7);\r\n        __m128i coeff8 = _mm_set1_epi16(8);\r\n        __m128i coeff9 = _mm_set1_epi16(9);\r\n        __m128i coeff11 = _mm_set1_epi16(11);\r\n        __m128i coeff13 = _mm_set1_epi16(13);\r\n        __m128i coeff15 = _mm_set1_epi16(15);\r\n        __m128i coeff16 = _mm_set1_epi16(16);\r\n        __m128i shuffle = _mm_setr_epi8(7, 15, 6, 14, 5, 13, 4, 12, 3, 11, 2, 10, 1, 9, 0, 8);\r\n        if (bsy == 4) {\r\n            src -= 15;\r\n            __m128i p01, p11, p21, p31;\r\n            __m128i M2, M4, M6, M8;\r\n            __m128i S0 = _mm_loadu_si128((__m128i*)(src - 2));\r\n            __m128i S3 = _mm_loadu_si128((__m128i*)(src + 1));\r\n            __m128i S1 = _mm_loadu_si128((__m128i*)(src - 1));\r\n            __m128i S2 = _mm_loadu_si128((__m128i*)(src));\r\n\r\n            __m128i H0 = _mm_unpackhi_epi8(S0, zero);\r\n            __m128i H1 = _mm_unpackhi_epi8(S1, zero);\r\n            __m128i H2 = _mm_unpackhi_epi8(S2, zero);\r\n            __m128i H3 = _mm_unpackhi_epi8(S3, zero);\r\n\r\n            p01 = _mm_mullo_epi16(H0, coeff7);\r\n            p11 = _mm_mullo_epi16(H1, coeff15);\r\n            p21 = _mm_mullo_epi16(H2, coeff9);\r\n            p31 = _mm_add_epi16(H3, coeff16);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            M2 = _mm_srli_epi16(p01, 5);\r\n\r\n            p01 = _mm_mullo_epi16(H0, coeff3);\r\n            p11 = _mm_mullo_epi16(H1, coeff7);\r\n            p21 = _mm_mullo_epi16(H2, coeff5);\r\n            p31 = _mm_add_epi16(H3, coeff8);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            M4 = _mm_srli_epi16(p01, 4);\r\n\r\n            p01 = _mm_mullo_epi16(H0, coeff5);\r\n            p11 = _mm_mullo_epi16(H1, coeff13);\r\n            p21 = _mm_mullo_epi16(H2, coeff11);\r\n            p31 = _mm_mullo_epi16(H3, coeff3);\r\n            p01 = _mm_add_epi16(p01, coeff16);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            M6 = _mm_srli_epi16(p01, 5);\r\n\r\n            p01 = _mm_add_epi16(H0, H3);\r\n            p11 = _mm_add_epi16(H1, H2);\r\n            p11 = _mm_mullo_epi16(p11, coeff3);\r\n            p01 = _mm_add_epi16(p01, coeff4);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            M8 = _mm_srli_epi16(p01, 3);\r\n\r\n            M2 = _mm_packus_epi16(M2, M4);\r\n            M6 = _mm_packus_epi16(M6, M8);\r\n            M2 = _mm_shuffle_epi8(M2, shuffle);\r\n            M6 = _mm_shuffle_epi8(M6, shuffle);\r\n\r\n            M4 = _mm_unpacklo_epi16(M2, M6);\r\n\r\n            *((int*)dst) = _mm_cvtsi128_si32(M4);\r\n            dst += i_dst;\r\n            M4 = _mm_srli_si128(M4, 4);\r\n            *((int*)dst) = _mm_cvtsi128_si32(M4);\r\n            dst += i_dst;\r\n            M4 = _mm_srli_si128(M4, 4);\r\n            *((int*)dst) = _mm_cvtsi128_si32(M4);\r\n            dst += i_dst;\r\n            M4 = _mm_srli_si128(M4, 4);\r\n            *((int*)dst) = _mm_cvtsi128_si32(M4);\r\n        } else {\r\n            src -= 15;\r\n\r\n            __m128i p00, p10, p20, p30;\r\n            __m128i p01, p11, p21, p31;\r\n            __m128i M1, M2, M3, M4, M5, M6, M7, M8;\r\n            __m128i S0 = _mm_loadu_si128((__m128i*)(src - 2));\r\n            __m128i S3 = _mm_loadu_si128((__m128i*)(src + 1));\r\n            __m128i S1 = _mm_loadu_si128((__m128i*)(src - 1));\r\n            __m128i S2 = _mm_loadu_si128((__m128i*)(src));\r\n\r\n            __m128i L0 = _mm_unpacklo_epi8(S0, zero);\r\n            __m128i L1 = _mm_unpacklo_epi8(S1, zero);\r\n            __m128i L2 = _mm_unpacklo_epi8(S2, zero);\r\n            __m128i L3 = _mm_unpacklo_epi8(S3, zero);\r\n\r\n            __m128i H0 = _mm_unpackhi_epi8(S0, zero);\r\n            __m128i H1 = _mm_unpackhi_epi8(S1, zero);\r\n            __m128i H2 = _mm_unpackhi_epi8(S2, zero);\r\n            __m128i H3 = _mm_unpackhi_epi8(S3, zero);\r\n\r\n            p00 = _mm_mullo_epi16(L0, coeff7);\r\n            p10 = _mm_mullo_epi16(L1, coeff15);\r\n            p20 = _mm_mullo_epi16(L2, coeff9);\r\n            p30 = _mm_add_epi16(L3, coeff16);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            M1 = _mm_srli_epi16(p00, 5);\r\n\r\n            p01 = _mm_mullo_epi16(H0, coeff7);\r\n            p11 = _mm_mullo_epi16(H1, coeff15);\r\n            p21 = _mm_mullo_epi16(H2, coeff9);\r\n            p31 = _mm_add_epi16(H3, coeff16);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            M2 = _mm_srli_epi16(p01, 5);\r\n\r\n            p00 = _mm_mullo_epi16(L0, coeff3);\r\n            p10 = _mm_mullo_epi16(L1, coeff7);\r\n            p20 = _mm_mullo_epi16(L2, coeff5);\r\n            p30 = _mm_add_epi16(L3, coeff8);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            M3 = _mm_srli_epi16(p00, 4);\r\n\r\n            p01 = _mm_mullo_epi16(H0, coeff3);\r\n            p11 = _mm_mullo_epi16(H1, coeff7);\r\n            p21 = _mm_mullo_epi16(H2, coeff5);\r\n            p31 = _mm_add_epi16(H3, coeff8);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            M4 = _mm_srli_epi16(p01, 4);\r\n\r\n            p00 = _mm_mullo_epi16(L0, coeff5);\r\n            p10 = _mm_mullo_epi16(L1, coeff13);\r\n            p20 = _mm_mullo_epi16(L2, coeff11);\r\n            p30 = _mm_mullo_epi16(L3, coeff3);\r\n            p00 = _mm_add_epi16(p00, coeff16);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            M5 = _mm_srli_epi16(p00, 5);\r\n\r\n            p01 = _mm_mullo_epi16(H0, coeff5);\r\n            p11 = _mm_mullo_epi16(H1, coeff13);\r\n            p21 = _mm_mullo_epi16(H2, coeff11);\r\n            p31 = _mm_mullo_epi16(H3, coeff3);\r\n            p01 = _mm_add_epi16(p01, coeff16);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            p01 = _mm_add_epi16(p01, p21);\r\n            p01 = _mm_add_epi16(p01, p31);\r\n            M6 = _mm_srli_epi16(p01, 5);\r\n\r\n            p00 = _mm_add_epi16(L0, L3);\r\n            p10 = _mm_add_epi16(L1, L2);\r\n            p10 = _mm_mullo_epi16(p10, coeff3);\r\n            p00 = _mm_add_epi16(p00, coeff4);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            M7 = _mm_srli_epi16(p00, 3);\r\n\r\n            p01 = _mm_add_epi16(H0, H3);\r\n            p11 = _mm_add_epi16(H1, H2);\r\n            p11 = _mm_mullo_epi16(p11, coeff3);\r\n            p01 = _mm_add_epi16(p01, coeff4);\r\n            p01 = _mm_add_epi16(p01, p11);\r\n            M8 = _mm_srli_epi16(p01, 3);\r\n\r\n            M1 = _mm_packus_epi16(M1, M3);\r\n            M5 = _mm_packus_epi16(M5, M7);\r\n            M1 = _mm_shuffle_epi8(M1, shuffle);\r\n            M5 = _mm_shuffle_epi8(M5, shuffle);\r\n\r\n            M2 = _mm_packus_epi16(M2, M4);\r\n            M6 = _mm_packus_epi16(M6, M8);\r\n            M2 = _mm_shuffle_epi8(M2, shuffle);\r\n            M6 = _mm_shuffle_epi8(M6, shuffle);\r\n\r\n            M3 = _mm_unpacklo_epi16(M1, M5);\r\n            M7 = _mm_unpackhi_epi16(M1, M5);\r\n            M4 = _mm_unpacklo_epi16(M2, M6);\r\n            M8 = _mm_unpackhi_epi16(M2, M6);\r\n\r\n            *((int*)dst) = _mm_cvtsi128_si32(M4);\r\n            dst += i_dst;\r\n            M4 = _mm_srli_si128(M4, 4);\r\n            *((int*)dst) = _mm_cvtsi128_si32(M4);\r\n            dst += i_dst;\r\n            M4 = _mm_srli_si128(M4, 4);\r\n            *((int*)dst) = _mm_cvtsi128_si32(M4);\r\n            dst += i_dst;\r\n            M4 = _mm_srli_si128(M4, 4);\r\n            *((int*)dst) = _mm_cvtsi128_si32(M4);\r\n            dst += i_dst;\r\n            *((int*)dst) = _mm_cvtsi128_si32(M8);\r\n            dst += i_dst;\r\n            M8 = _mm_srli_si128(M8, 4);\r\n            *((int*)dst) = _mm_cvtsi128_si32(M8);\r\n            dst += i_dst;\r\n            M8 = _mm_srli_si128(M8, 4);\r\n            *((int*)dst) = _mm_cvtsi128_si32(M8);\r\n            dst += i_dst;\r\n            M8 = _mm_srli_si128(M8, 4);\r\n            *((int*)dst) = _mm_cvtsi128_si32(M8);\r\n            dst += i_dst;\r\n            *((int*)dst) = _mm_cvtsi128_si32(M3);\r\n            dst += i_dst;\r\n            M3 = _mm_srli_si128(M3, 4);\r\n            *((int*)dst) = _mm_cvtsi128_si32(M3);\r\n            dst += i_dst;\r\n            M3 = _mm_srli_si128(M3, 4);\r\n            *((int*)dst) = _mm_cvtsi128_si32(M3);\r\n            dst += i_dst;\r\n            M3 = _mm_srli_si128(M3, 4);\r\n            *((int*)dst) = _mm_cvtsi128_si32(M3);\r\n            dst += i_dst;\r\n            *((int*)dst) = _mm_cvtsi128_si32(M7);\r\n            dst += i_dst;\r\n            M7 = _mm_srli_si128(M7, 4);\r\n            *((int*)dst) = _mm_cvtsi128_si32(M7);\r\n            dst += i_dst;\r\n            M7 = _mm_srli_si128(M7, 4);\r\n            *((int*)dst) = _mm_cvtsi128_si32(M7);\r\n            dst += i_dst;\r\n            M7 = _mm_srli_si128(M7, 4);\r\n            *((int*)dst) = _mm_cvtsi128_si32(M7);\r\n        }\r\n    }\r\n\r\n}\r\n\r\n#endif // #if !HIGH_BIT_DEPTH\r\n"
  },
  {
    "path": "source/common/vec/intrinsic_intra-pred_avx2.cc",
    "content": "/*\r\n * intrinsic_intra-pred_avx2.cc\r\n *\r\n * Description of this file:\r\n *    AVX2 assembly functions of Intra-Prediction module of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video decoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#include <mmintrin.h>\r\n#include <emmintrin.h>\r\n#include <tmmintrin.h>\r\n#include <smmintrin.h>\r\n#include <immintrin.h>\r\n\r\n#include \"../common.h\"\r\n#include \"intrinsic.h\"\r\n\r\n#if !HIGH_BIT_DEPTH\r\n\r\nvoid intra_pred_ver_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    if (bsx <= 8 && bsy <= 8) {\r\n        // block_sizeС8ʱavx2sse\r\n        intra_pred_ver_sse128(src, dst, i_dst, dir_mode, bsx, bsy);\r\n        return;\r\n    }\r\n    pel_t *rsrc = src + 1;\r\n    int i;\r\n\r\n    __m256i S1;\r\n    if (bsx >= 32) {\r\n        for (i = 0; i < bsy; i++) {\r\n            S1 = _mm256_loadu_si256((const __m256i*)(rsrc));//32\r\n            _mm256_storeu_si256((__m256i*)(dst), S1);\r\n\r\n            if (32 < bsx) {\r\n                S1 = _mm256_loadu_si256((const __m256i*)(rsrc + 32));//64\r\n                _mm256_storeu_si256((__m256i*)(dst + 32), S1);\r\n            }\r\n            dst += i_dst;\r\n        }\r\n    } else {\r\n        int j;\r\n        __m128i S_1;\r\n        if (bsx & 15) {//4/8\r\n            __m128i mask = _mm_load_si128((const __m128i*)intrinsic_mask[(bsx & 15) - 1]);\r\n            for (i = 0; i < bsy; i++) {\r\n                for (j = 0; j < bsx - 15; j += 16) {\r\n                    S_1 = _mm_loadu_si128((const __m128i*)(rsrc + j));\r\n                    _mm_storeu_si128((__m128i*)(dst + j), S_1);\r\n                }\r\n                S_1 = _mm_loadu_si128((const __m128i*)(rsrc + j));\r\n                _mm_maskmoveu_si128(S_1, mask, (char *)&dst[j]);\r\n                dst += i_dst;\r\n            }\r\n        }  else {\r\n            for (i = 0; i < bsy; i++) {//16\r\n                S_1 = _mm_loadu_si128((const __m128i*)rsrc);\r\n                _mm_storeu_si128((__m128i*)dst, S_1);\r\n                dst += i_dst;\r\n            }\r\n        }\r\n    }\r\n\r\n}\r\n\r\nvoid intra_pred_hor_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    if (bsx <= 8 && bsy <= 8) {\r\n        // block_sizeС8ʱavx2sse\r\n        intra_pred_hor_sse128(src, dst, i_dst, dir_mode, bsx, bsy);\r\n        return;\r\n    }\r\n    int i;\r\n    pel_t *rsrc = src - 1;\r\n    __m256i S1;\r\n\r\n    if (bsx >= 32) {\r\n        for (i = 0; i < bsy; i++) {\r\n            S1 = _mm256_set1_epi8((char)rsrc[-i]);//32\r\n            _mm256_storeu_si256((__m256i*)(dst), S1);\r\n\r\n            if (32 < bsx) {//64\r\n                _mm256_storeu_si256((__m256i*)(dst + 32), S1);\r\n            }\r\n            dst += i_dst;\r\n        }\r\n    } else {\r\n        int j;\r\n        __m128i S_1;\r\n        if (bsx & 15) {//4/8\r\n            __m128i mask = _mm_load_si128((const __m128i*)intrinsic_mask[(bsx & 15) - 1]);\r\n            for (i = 0; i < bsy; i++) {\r\n                for (j = 0; j < bsx - 15; j += 16) {\r\n                    S_1 = _mm_set1_epi8((char)rsrc[-i]);\r\n                    _mm_storeu_si128((__m128i*)(dst + j), S_1);\r\n                }\r\n                S_1 = _mm_set1_epi8((char)rsrc[-i]);\r\n                _mm_maskmoveu_si128(S_1, mask, (char*)&dst[j]);\r\n                dst += i_dst;\r\n            }\r\n        } else {\r\n            for (i = 0; i < bsy; i++) {//16\r\n                S_1 = _mm_set1_epi8((char)rsrc[-i]);\r\n                _mm_storeu_si128((__m128i*)dst, S_1);\r\n                dst += i_dst;\r\n            }\r\n        }\r\n    }\r\n}\r\n\r\nvoid intra_pred_dc_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    if (bsx <= 8 && bsy <= 8) {\r\n        // block_sizeС8ʱavx2sse\r\n        intra_pred_dc_sse128(src, dst, i_dst, dir_mode, bsx, bsy);\r\n        return;\r\n    }\r\n    int bAboveAvail = dir_mode >> 8;\r\n    int bLeftAvail = dir_mode & 0xFF;\r\n    int   x, y;\r\n    int   iDCValue = 0;\r\n    pel_t  *rsrc = src - 1;\r\n    __m256i S1;\r\n    int i;\r\n    if (bLeftAvail) {\r\n        for (y = 0; y < bsy; y++) {\r\n            iDCValue += rsrc[-y];\r\n        }\r\n\r\n        rsrc = src + 1;\r\n        if (bAboveAvail) {\r\n            for (x = 0; x < bsx; x++) {\r\n                iDCValue += rsrc[x];\r\n            }\r\n\r\n            iDCValue += ((bsx + bsy) >> 1);\r\n            iDCValue = (iDCValue * (512 / (bsx + bsy))) >> 9;\r\n        } else {\r\n            iDCValue += bsy / 2;\r\n            iDCValue /= bsy;\r\n        }\r\n    } else {\r\n        rsrc = src + 1;\r\n        if (bAboveAvail) {\r\n            for (x = 0; x < bsx; x++) {\r\n                iDCValue += rsrc[x];\r\n            }\r\n\r\n            iDCValue += bsx / 2;\r\n            iDCValue /= bsx;\r\n        } else {\r\n            iDCValue = g_dc_value;\r\n        }\r\n    }\r\n    /*\r\n    for (y = 0; y < bsy; y++) {\r\n    for (x = 0; x < bsx; x++) {\r\n    dst[x] = iDCValue;\r\n    }\r\n    dst += i_dst;\r\n    }\r\n    */\r\n\r\n    S1 = _mm256_set1_epi8((char)iDCValue);\r\n    if (bsx >= 32) {\r\n        for (i = 0; i < bsy; i++) {\r\n            _mm256_storeu_si256((__m256i*)(dst), S1);//32\r\n            if (32 < bsx) {//64\r\n                _mm256_storeu_si256((__m256i*)(dst + 32), S1);\r\n            }\r\n            dst += i_dst;\r\n        }\r\n    } else {\r\n        __m128i S_1;\r\n        int j;\r\n        S_1 = _mm_set1_epi8((char)iDCValue);\r\n        if (bsx & 15) {//4/8\r\n            __m128i mask = _mm_load_si128((const __m128i*)intrinsic_mask[(bsx & 15) - 1]);\r\n            for (i = 0; i < bsy; i++) {\r\n                for (j = 0; j < bsx - 15; j += 16) {\r\n                    _mm_storeu_si128((__m128i*)(dst + j), S_1);\r\n                }\r\n                _mm_maskmoveu_si128(S_1, mask, (char*)&dst[j]);\r\n                dst += i_dst;\r\n            }\r\n        } else {\r\n            for (i = 0; i < bsy; i++) {//16\r\n                _mm_storeu_si128((__m128i*)dst, S_1);\r\n                dst += i_dst;\r\n            }\r\n        }\r\n    }\r\n\r\n}\r\n\r\nvoid intra_pred_plane_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    pel_t  *rpSrc;\r\n    int iH = 0;\r\n    int iV = 0;\r\n    int iA, iB, iC;\r\n    int x, y;\r\n    int iW2 = bsx >> 1;\r\n    int iH2 = bsy >> 1;\r\n    int ib_mult[5] = { 13, 17, 5, 11, 23 };\r\n    int ib_shift[5] = { 7, 10, 11, 15, 19 };\r\n    int im_h = ib_mult [tab_log2[bsx] - 2];\r\n    int is_h = ib_shift[tab_log2[bsx] - 2];\r\n    int im_v = ib_mult [tab_log2[bsy] - 2];\r\n    int is_v = ib_shift[tab_log2[bsy] - 2];\r\n\r\n    int iTmp;\r\n\r\n    UNUSED_PARAMETER(dir_mode);\r\n\r\n    rpSrc = src + iW2;\r\n    for (x = 1; x < iW2 + 1; x++) {\r\n        iH += x * (rpSrc[x] - rpSrc[-x]);\r\n    }\r\n\r\n    rpSrc = src - iH2;\r\n    for (y = 1; y < iH2 + 1; y++) {\r\n        iV += y * (rpSrc[-y] - rpSrc[y]);\r\n    }\r\n\r\n    iA = (src[-1 - (bsy - 1)] + src[1 + bsx - 1]) << 4;\r\n    iB = ((iH << 5) * im_h + (1 << (is_h - 1))) >> is_h;\r\n    iC = ((iV << 5) * im_v + (1 << (is_v - 1))) >> is_v;\r\n\r\n    iTmp = iA - (iH2 - 1) * iC - (iW2 - 1) * iB + 16;\r\n\r\n    __m256i TC, TB, TA, T_Start, T, D, D1;\r\n    __m256i mask ;\r\n    \r\n    TA = _mm256_set1_epi16((int16_t)iTmp);\r\n    TB = _mm256_set1_epi16((int16_t)iB);\r\n    TC = _mm256_set1_epi16((int16_t)iC);\r\n\r\n    T_Start = _mm256_set_epi16(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0);\r\n    T_Start = _mm256_mullo_epi16(TB, T_Start);\r\n    T_Start = _mm256_add_epi16(T_Start, TA);\r\n\r\n    TB = _mm256_mullo_epi16(TB, _mm256_set1_epi16(16));\r\n\r\n    if (bsx == 4){\r\n        mask = _mm256_loadu_si256((const __m256i*)intrinsic_mask_256_8bit[3]);\r\n        for (y = 0; y < bsy; y++) {\r\n            D = _mm256_srai_epi16(T_Start, 5);\r\n            D = _mm256_packus_epi16(D, D);\r\n            _mm256_maskstore_epi32((int*)dst, mask, D);\r\n            T_Start = _mm256_add_epi16(T_Start, TC);\r\n            dst += i_dst;\r\n        }\r\n    } else if (bsx == 8) {\r\n        mask = _mm256_loadu_si256((const __m256i*)intrinsic_mask_256_8bit[7]);\r\n        for (y = 0; y < bsy; y++) {\r\n            D = _mm256_srai_epi16(T_Start, 5);\r\n            D = _mm256_packus_epi16(D, D);\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, D);\r\n            T_Start = _mm256_add_epi16(T_Start, TC);\r\n            dst += i_dst;\r\n        }\r\n    } else if (bsx == 16) {\r\n        mask = _mm256_loadu_si256((const __m256i*)intrinsic_mask_256_8bit[15]);\r\n        for (y = 0; y < bsy; y++) {\r\n            D = _mm256_srai_epi16(T_Start, 5);\r\n            D = _mm256_packus_epi16(D, D);\r\n            D = _mm256_permute4x64_epi64(D, 8);\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, D);\r\n            T_Start = _mm256_add_epi16(T_Start, TC);\r\n            dst += i_dst;\r\n        }\r\n    } else { //32 64\r\n        for (y = 0; y < bsy; y++) {\r\n            T = T_Start;\r\n            for (x = 0; x < bsx; x += 32) {\r\n                D = _mm256_srai_epi16(T, 5);\r\n                T = _mm256_add_epi16(T, TB);\r\n                D1 = _mm256_srai_epi16(T, 5);\r\n                D = _mm256_packus_epi16(D, D1);\r\n                D = _mm256_permute4x64_epi64(D, 0x00D8);\r\n                _mm256_storeu_si256((__m256i*)(dst + x), D);\r\n\r\n                T = _mm256_add_epi16(T, TB);\r\n            }\r\n            T_Start = _mm256_add_epi16(T_Start, TC);\r\n            dst += i_dst;\r\n        }\r\n    }\r\n\r\n\r\n}\r\n\r\nvoid intra_pred_bilinear_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    int x, y;\r\n    int ishift_x = tab_log2[bsx];\r\n    int ishift_y = tab_log2[bsy];\r\n    int ishift = DAVS2_MIN(ishift_x, ishift_y);\r\n    int ishift_xy = ishift_x + ishift_y + 1;\r\n    int offset = 1 << (ishift_x + ishift_y);\r\n    int a, b, c, t, val;\r\n    pel_t *p;\r\n\r\n\r\n    __m256i T, T1, T2, T3, C1, C2, ADD;\r\n\r\n    /* TODO: Ϊʲô⼸ĴСҪӵ 32ǷбҪ */\r\n    ALIGN32(itr_t pTop[MAX_CU_SIZE + 32]);\r\n    ALIGN32(itr_t pLeft[MAX_CU_SIZE + 32]);\r\n    ALIGN32(itr_t pT[MAX_CU_SIZE + 32]);\r\n    ALIGN32(itr_t pL[MAX_CU_SIZE + 32]);\r\n    ALIGN32(itr_t wy[MAX_CU_SIZE + 32]);\r\n\r\n    UNUSED_PARAMETER(dir_mode);\r\n\r\n    p = src + 1;\r\n    __m256i ZERO = _mm256_setzero_si256();\r\n    for (x = 0; x < bsx; x += 32) {\r\n        T = _mm256_loadu_si256((__m256i*)(p + x));//8bit 32\r\n        T1 = _mm256_unpacklo_epi8(T, ZERO); //0 2\r\n        T2 = _mm256_unpackhi_epi8(T, ZERO); //1 3\r\n        T = _mm256_permute2x128_si256(T1, T2, 0x0020);\r\n        _mm256_store_si256((__m256i*)(pTop + x), T);\r\n        T = _mm256_permute2x128_si256(T1, T2, 0x0031);\r\n        _mm256_store_si256((__m256i*)(pTop + x + 16), T);\r\n    }\r\n    for (y = 0; y < bsy; y++) {\r\n        pLeft[y] = src[-1 - y];\r\n    }\r\n\r\n\r\n    //p = src + 1;\r\n    //for (x = 0; x < bsx; x++) {\r\n    //    pTop[x] = p[x];\r\n    //}\r\n    //p = src - 1;\r\n    //for (y = 0; y < bsy; y++) {\r\n    //    pLeft[y] = p[-y];\r\n    //}\r\n    \r\n\r\n    a = pTop[bsx - 1];\r\n    b = pLeft[bsy - 1];\r\n\r\n    if (bsx == bsy) {\r\n        c = (a + b + 1) >> 1;\r\n    } else {\r\n        c = (((a << ishift_x) + (b << ishift_y)) * 13 + (1 << (ishift + 5))) >> (ishift + 6);\r\n    }\r\n\r\n    t = (c << 1) - a - b;\r\n\r\n    T = _mm256_set1_epi16((int16_t)b);\r\n    for (x = 0; x < bsx; x += 16) {\r\n        T1 = _mm256_loadu_si256((__m256i*)(pTop + x));\r\n        T2 = _mm256_sub_epi16(T, T1);\r\n        T1 = _mm256_slli_epi16(T1, ishift_y);\r\n        _mm256_store_si256((__m256i*)(pT + x), T2);\r\n        _mm256_store_si256((__m256i*)(pTop + x), T1);\r\n    }\r\n\r\n    T = _mm256_set1_epi16((int16_t)a);\r\n    for (y = 0; y < bsy; y += 16) {\r\n        T1 = _mm256_loadu_si256((__m256i*)(pLeft + y));\r\n        T2 = _mm256_sub_epi16(T, T1);\r\n        T1 = _mm256_slli_epi16(T1, ishift_x);\r\n        _mm256_store_si256((__m256i*)(pL + y), T2);\r\n        _mm256_store_si256((__m256i*)(pLeft + y), T1);\r\n    }\r\n\r\n    T = _mm256_set1_epi16((int16_t)t);\r\n    T = _mm256_mullo_epi16(T, _mm256_set_epi16(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0));\r\n    T1 = _mm256_set1_epi16((int16_t)(16 * t));\r\n\r\n    for (y = 0; y < bsy; y += 16) {\r\n        _mm256_store_si256((__m256i*)(wy + y), T);\r\n        T = _mm256_add_epi16(T, T1);\r\n    }\r\n\r\n    C1 = _mm256_set_epi32(7, 6, 5, 4, 3, 2, 1, 0);\r\n    C2 = _mm256_set1_epi32(8);\r\n\r\n    if (bsx == 4) {\r\n        __m256i pTT = _mm256_loadu_si256((__m256i*)pT);\r\n        T = _mm256_loadu_si256((__m256i*)pTop);\r\n        __m256i mask = _mm256_loadu_si256((__m256i*)intrinsic_mask_256_8bit[3]);\r\n            for (y = 0; y < bsy; y++) {\r\n                int add = (pL[y] << ishift_y) + wy[y];\r\n                ADD = _mm256_set1_epi32(add);\r\n                ADD = _mm256_mullo_epi32(C1, ADD);\r\n                val = (pLeft[y] << ishift_y) + offset + (pL[y] << ishift_y);\r\n                ADD = _mm256_add_epi32(ADD, _mm256_set1_epi32(val));\r\n\r\n                T = _mm256_add_epi16(T, pTT);\r\n                T1 = _mm256_cvtepi16_epi32(_mm256_extracti128_si256(T, 0));\r\n                T1 = _mm256_slli_epi32(T1, ishift_x);\r\n\r\n                T1 = _mm256_add_epi32(T1, ADD);\r\n                T1 = _mm256_srai_epi32(T1, ishift_xy);\r\n\r\n                T1 = _mm256_packus_epi32(T1, T1);\r\n                T1 = _mm256_packus_epi16(T1, T1);\r\n\r\n                _mm256_maskstore_epi32((int*)dst, mask, T1);\r\n\r\n                dst += i_dst;\r\n            }\r\n    } else if (bsx == 8) {\r\n        __m256i pTT = _mm256_load_si256((__m256i*)pT);\r\n        T = _mm256_load_si256((__m256i*)pTop);\r\n        __m256i mask = _mm256_loadu_si256((__m256i*)intrinsic_mask_256_8bit[7]);\r\n        for (y = 0; y < bsy; y++) {\r\n            int add = (pL[y] << ishift_y) + wy[y];\r\n            ADD = _mm256_set1_epi32(add);\r\n            ADD = _mm256_mullo_epi32(C1, ADD);\r\n            val = (pLeft[y] << ishift_y) + offset + (pL[y] << ishift_y);\r\n            ADD = _mm256_add_epi32(ADD, _mm256_set1_epi32(val));\r\n\r\n            T = _mm256_add_epi16(T, pTT);\r\n            T1 = _mm256_cvtepi16_epi32(_mm256_extracti128_si256(T, 0));\r\n            T1 = _mm256_slli_epi32(T1, ishift_x);\r\n\r\n            T1 = _mm256_add_epi32(T1, ADD);\r\n            T1 = _mm256_srai_epi32(T1, ishift_xy);\r\n\r\n            //mask\r\n\r\n            //T1 is the result\r\n            T1 = _mm256_packus_epi32(T1, T1); //1 2 3 4 1 2 3 4 5 6 7 8 5 6 7 8\r\n            T1 = _mm256_permute4x64_epi64(T1, 0x0008);\r\n            T1 = _mm256_packus_epi16(T1, T1);\r\n\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, T1);\r\n\r\n            dst += i_dst;\r\n        }\r\n    } else {\r\n        __m256i TT[8];\r\n        __m256i PTT[8];\r\n        __m256i temp1, temp2;\r\n        __m256i mask1 = _mm256_set_epi32(3, 2, 1, 0, 5, 1, 4, 0);\r\n        __m256i mask2 = _mm256_loadu_si256((__m256i*)intrinsic_mask_256_8bit[15]);\r\n        for (x = 0; x < bsx; x += 16) {\r\n            int idx = x >> 3;\r\n            __m256i M0 = _mm256_loadu_si256((__m256i*)(pTop + x)); //0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15\r\n            __m256i M1 = _mm256_loadu_si256((__m256i*)(pT + x));\r\n            temp1 = _mm256_unpacklo_epi16(M0, ZERO); //0 1 2 3   8  9 10 11\r\n            temp2 = _mm256_unpackhi_epi16(M0, ZERO); //4 5 6 7  12 13 14 15\r\n            TT[idx]      = _mm256_permute2x128_si256(temp1, temp2, 0x0020); //0 1 2 3 4 5 6 7\r\n            TT[idx + 1]  = _mm256_permute2x128_si256(temp1, temp2, 0x0031); //8 9 10 11 12 13 14 15\r\n\r\n            PTT[idx]     = _mm256_cvtepi16_epi32(_mm256_extracti128_si256(M1, 0));\r\n            PTT[idx + 1] = _mm256_cvtepi16_epi32(_mm256_extracti128_si256(M1, 1));\r\n        }\r\n        for (y = 0; y < bsy; y++) {\r\n            int add = (pL[y] << ishift_y) + wy[y];\r\n            ADD = _mm256_set1_epi32(add);\r\n            T3 = _mm256_mullo_epi32(C2, ADD);\r\n            ADD = _mm256_mullo_epi32(C1, ADD);\r\n\r\n            val = (pLeft[y] << ishift_y) + offset + (pL[y] << ishift_y);\r\n\r\n            ADD = _mm256_add_epi32(ADD, _mm256_set1_epi32(val));\r\n\r\n            for (x = 0; x < bsx; x += 16) {\r\n                int idx = x >> 3;\r\n                TT[idx] = _mm256_add_epi32(TT[idx], PTT[idx]); //0 1 2 3 4 5 6 7\r\n                TT[idx + 1] = _mm256_add_epi32(TT[idx + 1], PTT[idx + 1]); //8 9 10 11 12 13 14 15\r\n\r\n                T1 = _mm256_slli_epi32(TT[idx], ishift_x);\r\n                T2 = _mm256_slli_epi32(TT[idx + 1], ishift_x);\r\n\r\n                T1 = _mm256_add_epi32(T1, ADD);\r\n                T1 = _mm256_srai_epi32(T1, ishift_xy);//0 1 2 3 4 5 6 7\r\n\r\n                ADD = _mm256_add_epi32(ADD, T3);\r\n                T2 = _mm256_add_epi32(T2, ADD);\r\n                T2 = _mm256_srai_epi32(T2, ishift_xy);//8 9 10 11 12 13 14 15\r\n                \r\n                //T1 T2 is the result\r\n                T1 = _mm256_packus_epi32(T1, T2); //0 1 2 3 8 9 10 11 4 5 6 7 12 13 14 15\r\n                T1 = _mm256_packus_epi16(T1, T1); //0 1 2 3 8 9 10 11 0 1 2 3 8 9 10 11     4 5 6 7 12 13 14 15 4 5 6 7 12 13 14 15\r\n                T1 = _mm256_permutevar8x32_epi32(T1, mask1);\r\n                \r\n                //store 128 bits\r\n                _mm256_maskstore_epi64((__int64 *)(dst + x), mask2, T1);\r\n\r\n                ADD = _mm256_add_epi32(ADD, T3);\r\n            }\r\n            dst += i_dst;\r\n        }\r\n    }\r\n\r\n}\r\n\r\nvoid intra_pred_ang_x_3_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n\r\n    pel_t *dst1 = dst;\r\n    pel_t *dst2 = dst1 + i_dst;\r\n    pel_t *dst3 = dst2 + i_dst;\r\n    pel_t *dst4 = dst3 + i_dst;\r\n\r\n    UNUSED_PARAMETER(dir_mode);\r\n\r\n    if ((bsy > 4) && (bsx > 8)) {\r\n\r\n        __m256i coeff2 = _mm256_set1_epi16(2);\r\n        __m256i coeff3 = _mm256_set1_epi16(3);\r\n        __m256i coeff4 = _mm256_set1_epi16(4);\r\n        __m256i coeff5 = _mm256_set1_epi16(5);\r\n        __m256i coeff7 = _mm256_set1_epi16(7);\r\n        __m256i coeff8 = _mm256_set1_epi16(8);\r\n\r\n        ALIGN32(pel_t first_line[(64 + 176 + 16) << 2]);\r\n        int line_size = bsx + (((bsy - 4) * 11) >> 2);\r\n#if !BUGFIX_PREDICTION_INTRA\r\n        int iW2 = bsx * 2 - 1;\r\n        int real_size = DAVS2_MIN(line_size, iW2 + 1);\r\n#endif\r\n        int aligned_line_size = 64 + 176 + 16;\r\n        int i;\r\n        pel_t *pfirst[4];\r\n#if !BUGFIX_PREDICTION_INTRA\r\n        pel_t *src_org = src;\r\n#endif\r\n\r\n        pfirst[0] = first_line;\r\n        pfirst[1] = pfirst[0] + aligned_line_size;\r\n        pfirst[2] = pfirst[1] + aligned_line_size;\r\n        pfirst[3] = pfirst[2] + aligned_line_size;\r\n\r\n        __m256i p00, p10, p20, p30;\r\n        __m256i p01, p11, p21, p31;\r\n\r\n        __m256i SS2, SS11;\r\n        __m256i L2, L3, L4, L5, L6, L7, L8, L9, L10, L11, L12, L13;\r\n        __m256i H2, H3, H4, H5, H6, H7, H8, H9, H10, H11, H12, H13;\r\n\r\n#if BUGFIX_PREDICTION_INTRA\r\n        for (i = 0; i < line_size - 16; i += 32, src += 32) {\r\n#else\r\n        for (i = 0; i < real_size - 16; i += 32, src += 32) {\r\n#endif\r\n\r\n            SS2 = _mm256_loadu_si256((__m256i*)(src + 2));//2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17\r\n            L2 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS2, 0));//2...17\r\n            H2 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS2, 1));//18...34\r\n            SS2 = _mm256_loadu_si256((__m256i*)(src + 3));//3 4 5 6 7 8 9 10 11 12 13 14 15\r\n            L3  = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS2, 0));//3...18\r\n            H3  = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS2, 1));//19...35\r\n            SS2 = _mm256_loadu_si256((__m256i*)(src + 4));//4 5 6 7 8 9 10 11 12 13 14 15\r\n            L4  = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS2, 0));//4\r\n            H4  = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS2, 1));//20\r\n            SS2  = _mm256_loadu_si256((__m256i*)(src + 5));//5 6 7 8 9 10 11 12 13 14 15\r\n            L5  = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS2, 0));//5\r\n            H5  = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS2, 1));//21\r\n            SS2 = _mm256_loadu_si256((__m256i*)(src + 6));//6 7 8 9 10 11 12 13 14 15\r\n            L6  = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS2, 0));//6\r\n            H6  = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS2, 1));//22\r\n            SS2 = _mm256_loadu_si256((__m256i*)(src + 7));//7 8 9 10 11 12 13 14 15\r\n            L7  = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS2, 0));\r\n            H7  = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS2, 1));\r\n            SS2 = _mm256_loadu_si256((__m256i*)(src + 8));//8 9 10 11 12 13 14 15\r\n            L8  = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS2, 0));\r\n            H8  = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS2, 1));\r\n            SS2 = _mm256_loadu_si256((__m256i*)(src + 9));//9 10 11 12 13 14 15\r\n            L9  = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS2, 0));\r\n            H9  = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS2, 1));\r\n            SS2 = _mm256_loadu_si256((__m256i*)(src + 10));//10 11 12 13 14 15\r\n            L10 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS2, 0));\r\n            H10 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS2, 1));\r\n            SS2 = _mm256_loadu_si256((__m256i*)(src + 11));//11 12 13 14 15 16 17 18 19 20 21 22 23\r\n            L11 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS2, 0));\r\n            H11 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS2, 1));\r\n            SS2 = _mm256_loadu_si256((__m256i*)(src + 12));//12 13 14 15 16 17 18 19 20...\r\n            L12 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS2, 0));\r\n            H12 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS2, 1));\r\n            SS2 = _mm256_loadu_si256((__m256i*)(src + 13));//13 ...28 29...44\r\n            L13 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS2, 0));\r\n            H13 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS2, 1));\r\n\r\n            p00 = _mm256_add_epi16(L2, coeff8);//2 ...17\r\n            p10 = _mm256_mullo_epi16(L3, coeff5);\r\n            p20 = _mm256_mullo_epi16(L4, coeff7);\r\n            p30 = _mm256_mullo_epi16(L5, coeff3);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, p20);\r\n            p00 = _mm256_add_epi16(p00, p30);\r\n            p00 = _mm256_srli_epi16(p00, 4);\r\n\r\n            p01 = _mm256_add_epi16(H2, coeff8);\r\n            p11 = _mm256_mullo_epi16(H3, coeff5);\r\n            p21 = _mm256_mullo_epi16(H4, coeff7);\r\n            p31 = _mm256_mullo_epi16(H5, coeff3);\r\n            p01 = _mm256_add_epi16(p01, p11);\r\n            p01 = _mm256_add_epi16(p01, p21);\r\n            p01 = _mm256_add_epi16(p01, p31);\r\n            p01 = _mm256_srli_epi16(p01, 4);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p01);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x00D8);\r\n            _mm256_storeu_si256((__m256i*)&pfirst[0][i], p00);\r\n\r\n            p00 = _mm256_add_epi16(L5, L8);\r\n            p10 = _mm256_add_epi16(L6, L7);\r\n            p10 = _mm256_mullo_epi16(p10, coeff3);\r\n            p00 = _mm256_add_epi16(p00, coeff4);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_srli_epi16(p00, 3);\r\n\r\n            p01 = _mm256_add_epi16(H5, H8);\r\n            p11 = _mm256_add_epi16(H6, H7);\r\n            p11 = _mm256_mullo_epi16(p11, coeff3);\r\n            p01 = _mm256_add_epi16(p01, coeff4);\r\n            p01 = _mm256_add_epi16(p01, p11);\r\n            p01 = _mm256_srli_epi16(p01, 3);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p01);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x00D8);\r\n            _mm256_storeu_si256((__m256i*)&pfirst[1][i], p00);\r\n\r\n            p00 = _mm256_mullo_epi16(L8, coeff3);\r\n            p10 = _mm256_mullo_epi16(L9, coeff7);\r\n            p20 = _mm256_mullo_epi16(L10, coeff5);\r\n            p30 = _mm256_add_epi16(L11, coeff8);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, p20);\r\n            p00 = _mm256_add_epi16(p00, p30);\r\n            p00 = _mm256_srli_epi16(p00, 4);\r\n\r\n            p01 = _mm256_mullo_epi16(H8, coeff3);\r\n            p11 = _mm256_mullo_epi16(H9, coeff7);\r\n            p21 = _mm256_mullo_epi16(H10, coeff5);\r\n            p31 = _mm256_add_epi16(H11, coeff8);\r\n            p01 = _mm256_add_epi16(p01, p11);\r\n            p01 = _mm256_add_epi16(p01, p21);\r\n            p01 = _mm256_add_epi16(p01, p31);\r\n            p01 = _mm256_srli_epi16(p01, 4);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p01);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x00D8);\r\n            _mm256_storeu_si256((__m256i*)&pfirst[2][i], p00);\r\n\r\n            p00 = _mm256_add_epi16(L11, L13);\r\n            p10 = _mm256_mullo_epi16(L12, coeff2);\r\n            p00 = _mm256_add_epi16(p00, coeff2);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_srli_epi16(p00, 2);\r\n\r\n            p01 = _mm256_add_epi16(H11, H13);\r\n            p11 = _mm256_mullo_epi16(H12, coeff2);\r\n            p01 = _mm256_add_epi16(p01, coeff2);\r\n            p01 = _mm256_add_epi16(p01, p11);\r\n            p01 = _mm256_srli_epi16(p01, 2);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p01);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x00D8);\r\n            _mm256_storeu_si256((__m256i*)&pfirst[3][i], p00);\r\n        }\r\n\r\n#if BUGFIX_PREDICTION_INTRA\r\n        if (i < line_size) {\r\n#else\r\n        if (i < real_size) {\r\n#endif\r\n            SS2 = _mm256_loadu_si256((__m256i*)(src + 2));\r\n            L2 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS2, 0));\r\n            SS2 = _mm256_loadu_si256((__m256i*)(src + 3));\r\n            L3 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS2, 0));\r\n            SS2 = _mm256_loadu_si256((__m256i*)(src + 4));\r\n            L4 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS2, 0));\r\n            SS2 = _mm256_loadu_si256((__m256i*)(src + 5));\r\n            L5 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS2, 0));\r\n            SS2 = _mm256_loadu_si256((__m256i*)(src + 6));\r\n            L6 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS2, 0));\r\n            SS2 = _mm256_loadu_si256((__m256i*)(src + 7));\r\n            L7 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS2, 0));\r\n            SS2 = _mm256_loadu_si256((__m256i*)(src + 8));\r\n            L8 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS2, 0));\r\n            SS2 = _mm256_loadu_si256((__m256i*)(src + 9));\r\n            L9 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS2, 0));\r\n            SS2 = _mm256_loadu_si256((__m256i*)(src + 10));\r\n            L10 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS2, 0));\r\n\r\n            SS11 = _mm256_loadu_si256((__m256i*)(src + 11));\r\n            L11 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS11, 0));\r\n            SS11 = _mm256_loadu_si256((__m256i*)(src + 12));\r\n            L12 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS11, 0));\r\n            SS11 = _mm256_loadu_si256((__m256i*)(src + 13));\r\n            L13 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS11, 0));\r\n\r\n            p00 = _mm256_add_epi16(L2, coeff8);\r\n            p10 = _mm256_mullo_epi16(L3, coeff5);\r\n            p20 = _mm256_mullo_epi16(L4, coeff7);\r\n            p30 = _mm256_mullo_epi16(L5, coeff3);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, p20);\r\n            p00 = _mm256_add_epi16(p00, p30);\r\n            p00 = _mm256_srli_epi16(p00, 4);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p00);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x0008);\r\n            __m256i mask = _mm256_loadu_si256((__m256i*)intrinsic_mask_256_8bit[15]);\r\n            _mm256_maskstore_epi64((__int64 *)&pfirst[0][i], mask, p00);\r\n\r\n            p00 = _mm256_add_epi16(L5, L8);\r\n            p10 = _mm256_add_epi16(L6, L7);\r\n            p10 = _mm256_mullo_epi16(p10, coeff3);\r\n            p00 = _mm256_add_epi16(p00, coeff4);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_srli_epi16(p00, 3);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p00);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x0008);\r\n            _mm256_maskstore_epi64((__int64 *)&pfirst[1][i], mask, p00);\r\n\r\n            p00 = _mm256_mullo_epi16(L8, coeff3);\r\n            p10 = _mm256_mullo_epi16(L9, coeff7);\r\n            p20 = _mm256_mullo_epi16(L10, coeff5);\r\n            p30 = _mm256_add_epi16(L11, coeff8);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, p20);\r\n            p00 = _mm256_add_epi16(p00, p30);\r\n            p00 = _mm256_srli_epi16(p00, 4);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p00);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x0008);\r\n            _mm256_maskstore_epi64((__int64 *)&pfirst[2][i], mask, p00);\r\n\r\n            p00 = _mm256_add_epi16(L11, L13);\r\n            p10 = _mm256_mullo_epi16(L12, coeff2);\r\n            p00 = _mm256_add_epi16(p00, coeff2);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_srli_epi16(p00, 2);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p00);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x0008);\r\n            _mm256_maskstore_epi64((__int64 *)&pfirst[3][i], mask, p00);\r\n        }\r\n\r\n        bsy >>= 2;\r\n        __m256i M;\r\n        if (bsx == 64){\r\n            for (i = 0; i < bsy; i++){\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i * 11));\r\n                _mm256_storeu_si256((__m256i*)dst1, M);\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i * 11 + 32));\r\n                _mm256_storeu_si256((__m256i*)(dst1 + 32), M);\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i * 11));\r\n                _mm256_storeu_si256((__m256i*)dst2, M);\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i * 11 + 32));\r\n                _mm256_storeu_si256((__m256i*)(dst2 + 32), M);\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[2] + i * 11));\r\n                _mm256_storeu_si256((__m256i*)dst3, M);\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[2] + i * 11 + 32));\r\n                _mm256_storeu_si256((__m256i*)(dst3 + 32), M);\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[3] + i * 11));\r\n                _mm256_storeu_si256((__m256i*)dst4, M);\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[3] + i * 11 + 32));\r\n                _mm256_storeu_si256((__m256i*)(dst4 + 32), M);\r\n\r\n                dst1 = dst4 + i_dst;\r\n                dst2 = dst1 + i_dst;\r\n                dst3 = dst2 + i_dst;\r\n                dst4 = dst3 + i_dst;\r\n            }\r\n        } else if (bsx == 32) {\r\n            for (i = 0; i < bsy; i++){\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i * 11));\r\n                _mm256_storeu_si256((__m256i*)dst1, M);\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i * 11));\r\n                _mm256_storeu_si256((__m256i*)dst2, M);\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[2] + i * 11));\r\n                _mm256_storeu_si256((__m256i*)dst3, M);\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[3] + i * 11));\r\n                _mm256_storeu_si256((__m256i*)dst4, M);\r\n\r\n                dst1 = dst4 + i_dst;\r\n                dst2 = dst1 + i_dst;\r\n                dst3 = dst2 + i_dst;\r\n                dst4 = dst3 + i_dst;\r\n            }\r\n        } else {\r\n            __m256i mask = _mm256_lddqu_si256((__m256i*)intrinsic_mask_256_8bit[15]);\r\n            for (i = 0; i < bsy; i++){\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i * 11));\r\n                _mm256_maskstore_epi64((__int64 *)dst1, mask, M);\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i * 11));\r\n                _mm256_maskstore_epi64((__int64 *)dst2, mask, M);\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[2] + i * 11));\r\n                _mm256_maskstore_epi64((__int64 *)dst3, mask, M);\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[3] + i * 11));\r\n                _mm256_maskstore_epi64((__int64 *)dst4, mask, M);\r\n\r\n                dst1 = dst4 + i_dst;\r\n                dst2 = dst1 + i_dst;\r\n                dst3 = dst2 + i_dst;\r\n                dst4 = dst3 + i_dst;\r\n            }\r\n        }\r\n\r\n        /*for (i = 0; i < bsy; i++) {\r\n            memcpy(dst1, pfirst[0] + i * 11, bsx * sizeof(pel_t));\r\n            memcpy(dst2, pfirst[1] + i * 11, bsx * sizeof(pel_t));\r\n            memcpy(dst3, pfirst[2] + i * 11, bsx * sizeof(pel_t));\r\n            memcpy(dst4, pfirst[3] + i * 11, bsx * sizeof(pel_t));\r\n            dst1 = dst4 + i_dst;\r\n            dst2 = dst1 + i_dst;\r\n            dst3 = dst2 + i_dst;\r\n            dst4 = dst3 + i_dst;\r\n        }*/\r\n    } else if (bsx == 16) {\r\n\r\n        __m256i coeff2 = _mm256_set1_epi16(2);\r\n        __m256i coeff3 = _mm256_set1_epi16(3);\r\n        __m256i coeff4 = _mm256_set1_epi16(4);\r\n        __m256i coeff5 = _mm256_set1_epi16(5);\r\n        __m256i coeff7 = _mm256_set1_epi16(7);\r\n        __m256i coeff8 = _mm256_set1_epi16(8);\r\n\r\n        __m256i p00, p10, p20, p30;\r\n        __m256i SS2 = _mm256_loadu_si256((__m256i*)(src + 2));\r\n        __m256i L2 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS2, 0));\r\n        SS2 = _mm256_loadu_si256((__m256i*)(src + 3));\r\n        __m256i L3 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS2, 0));\r\n        SS2 = _mm256_loadu_si256((__m256i*)(src + 4));\r\n        __m256i L4 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS2, 0));\r\n        SS2 = _mm256_loadu_si256((__m256i*)(src + 5));\r\n        __m256i L5 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS2, 0));\r\n        SS2 = _mm256_loadu_si256((__m256i*)(src + 6));\r\n        __m256i L6 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS2, 0));\r\n        SS2 = _mm256_loadu_si256((__m256i*)(src + 7));\r\n        __m256i L7 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS2, 0));\r\n        SS2 = _mm256_loadu_si256((__m256i*)(src + 8));\r\n        __m256i L8 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS2, 0));\r\n        SS2 = _mm256_loadu_si256((__m256i*)(src + 9));\r\n        __m256i L9 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS2, 0));\r\n        SS2 = _mm256_loadu_si256((__m256i*)(src + 10));\r\n        __m256i L10 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS2, 0));\r\n\r\n        __m256i SS11 = _mm256_loadu_si256((__m256i*)(src + 11));\r\n        __m256i L11 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS11, 0));\r\n        SS11 = _mm256_loadu_si256((__m256i*)(src + 12));\r\n        __m256i L12 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS11, 0));\r\n        SS11 = _mm256_loadu_si256((__m256i*)(src + 13));\r\n        __m256i L13 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS11, 0));\r\n\r\n        p00 = _mm256_add_epi16(L2, coeff8);\r\n        p10 = _mm256_mullo_epi16(L3, coeff5);\r\n        p20 = _mm256_mullo_epi16(L4, coeff7);\r\n        p30 = _mm256_mullo_epi16(L5, coeff3);\r\n        p00 = _mm256_add_epi16(p00, p10);\r\n        p00 = _mm256_add_epi16(p00, p20);\r\n        p00 = _mm256_add_epi16(p00, p30);\r\n        p00 = _mm256_srli_epi16(p00, 4);\r\n\r\n        p00 = _mm256_packus_epi16(p00, p00);\r\n        p00 = _mm256_permute4x64_epi64(p00, 0x0008);\r\n        __m256i mask = _mm256_loadu_si256((__m256i*)intrinsic_mask_256_8bit[15]);\r\n        _mm256_maskstore_epi64((__int64 *)dst1, mask, p00);\r\n\r\n        p00 = _mm256_add_epi16(L5, L8);\r\n        p10 = _mm256_add_epi16(L6, L7);\r\n        p10 = _mm256_mullo_epi16(p10, coeff3);\r\n        p00 = _mm256_add_epi16(p00, coeff4);\r\n        p00 = _mm256_add_epi16(p00, p10);\r\n        p00 = _mm256_srli_epi16(p00, 3);\r\n\r\n        p00 = _mm256_packus_epi16(p00, p00);\r\n        p00 = _mm256_permute4x64_epi64(p00, 0x0008);\r\n        _mm256_maskstore_epi64((__int64 *)dst2, mask, p00);\r\n\r\n        p00 = _mm256_mullo_epi16(L8, coeff3);\r\n        p10 = _mm256_mullo_epi16(L9, coeff7);\r\n        p20 = _mm256_mullo_epi16(L10, coeff5);\r\n        p30 = _mm256_add_epi16(L11, coeff8);\r\n        p00 = _mm256_add_epi16(p00, p10);\r\n        p00 = _mm256_add_epi16(p00, p20);\r\n        p00 = _mm256_add_epi16(p00, p30);\r\n        p00 = _mm256_srli_epi16(p00, 4);\r\n\r\n        p00 = _mm256_packus_epi16(p00, p00);\r\n        p00 = _mm256_permute4x64_epi64(p00, 0x0008);\r\n        _mm256_maskstore_epi64((__int64 *)dst3, mask, p00);\r\n\r\n        p00 = _mm256_add_epi16(L11, L13);\r\n        p10 = _mm256_mullo_epi16(L12, coeff2);\r\n        p00 = _mm256_add_epi16(p00, coeff2);\r\n        p00 = _mm256_add_epi16(p00, p10);\r\n        p00 = _mm256_srli_epi16(p00, 2);\r\n\r\n        p00 = _mm256_packus_epi16(p00, p00);\r\n        p00 = _mm256_permute4x64_epi64(p00, 0x0008);\r\n        _mm256_maskstore_epi64((__int64 *)dst4, mask, p00);\r\n\r\n    } else { //8x8 8x32 4x16 4x4\r\n\r\n        intra_pred_ang_x_3_sse128(src, dst, i_dst, dir_mode, bsx, bsy);\r\n\r\n    }\r\n\r\n}\r\n\r\nvoid intra_pred_ang_x_4_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    if (bsx != bsy && bsx < bsy){\r\n        intra_pred_ang_x_4_sse128(src, dst, i_dst, dir_mode, bsx, bsy);\r\n        return;\r\n    }\r\n    ALIGN32(pel_t first_line[64 + 128]);\r\n    int line_size = bsx + ((bsy - 1) << 1);\r\n\r\n    int iHeight2 = bsy << 1;\r\n    int i;\r\n    __m256i zero = _mm256_setzero_si256();\r\n    __m256i offset = _mm256_set1_epi16(2);\r\n\r\n    UNUSED_PARAMETER(dir_mode);\r\n    src += 3;\r\n\r\n#if BUGFIX_PREDICTION_INTRA\r\n    for (i = 0; i < line_size - 16; i += 32, src += 32) {\r\n#else\r\n    for (i = 0; i < real_size - 16; i += 32, src += 32) {\r\n#endif\r\n        //0 1 2 3 .... 12 13 14 15    16 17 18 19 .... 28 29 30 21\r\n        __m256i S0 = _mm256_loadu_si256((__m256i*)(src - 1));\r\n        __m256i S1 = _mm256_loadu_si256((__m256i*)(src));\r\n        __m256i S2 = _mm256_loadu_si256((__m256i*)(src + 1));\r\n\r\n        __m256i L0 = _mm256_unpacklo_epi8(S0, zero);//0 1 2 3 4 5 6 7     16 17 18 19 20 21 22 23\r\n        __m256i L1 = _mm256_unpacklo_epi8(S1, zero);\r\n        __m256i L2 = _mm256_unpacklo_epi8(S2, zero);\r\n\r\n        __m256i H0 = _mm256_unpackhi_epi8(S0, zero);//8 9 10 11 12 13 14 15     24 25 26 27 28 29 30 31\r\n        __m256i H1 = _mm256_unpackhi_epi8(S1, zero);\r\n        __m256i H2 = _mm256_unpackhi_epi8(S2, zero);\r\n\r\n        __m256i tmp0 = _mm256_permute2x128_si256(L0, H0, 0x0020);//0 1 2 3 4 5 6 7   8 9 10 11 12 13 14 15\r\n        __m256i tmp1 = _mm256_permute2x128_si256(L1, H1, 0x0020);\r\n        __m256i tmp2 = _mm256_permute2x128_si256(L2, H2, 0x0020);\r\n        __m256i sum1 = _mm256_add_epi16(tmp0, tmp1);\r\n        __m256i sum2 = _mm256_add_epi16(tmp1, tmp2);\r\n\r\n\r\n        tmp0 = _mm256_permute2x128_si256(L0, H0, 0x0031);//16 17...24 25...\r\n        tmp1 = _mm256_permute2x128_si256(L1, H1, 0x0031);\r\n        tmp2 = _mm256_permute2x128_si256(L2, H2, 0x0031);\r\n        __m256i sum3 = _mm256_add_epi16(tmp0, tmp1);\r\n        __m256i sum4 = _mm256_add_epi16(tmp1, tmp2);\r\n\r\n        sum1 = _mm256_add_epi16(sum1, sum2);\r\n        sum3 = _mm256_add_epi16(sum3, sum4);\r\n\r\n        sum1 = _mm256_add_epi16(sum1, offset);\r\n        sum3 = _mm256_add_epi16(sum3, offset);\r\n\r\n        sum1 = _mm256_srli_epi16(sum1, 2);\r\n        sum3 = _mm256_srli_epi16(sum3, 2);\r\n\r\n        sum1 = _mm256_packus_epi16(sum1, sum3);//0 2 1 3\r\n        sum1 = _mm256_permute4x64_epi64(sum1, 0x00D8);\r\n        _mm256_storeu_si256((__m256i*)&first_line[i], sum1);\r\n    }\r\n\r\n#if BUGFIX_PREDICTION_INTRA\r\n    if (i < line_size) {\r\n#else\r\n    if (i < real_size) {\r\n#endif\r\n        //0 1 2 3 .... 12 13 14 15    16 17 18 19 .... 28 29 30 21\r\n        __m256i S0 = _mm256_loadu_si256((__m256i*)(src - 1));\r\n        __m256i S2 = _mm256_loadu_si256((__m256i*)(src + 1));\r\n        __m256i S1 = _mm256_loadu_si256((__m256i*)(src));\r\n        S0 = _mm256_permute4x64_epi64(S0, 0x00D8);\r\n        S2 = _mm256_permute4x64_epi64(S2, 0x00D8);\r\n        S1 = _mm256_permute4x64_epi64(S1, 0x00D8);\r\n\r\n        __m256i L0 = _mm256_unpacklo_epi8(S0, zero);\r\n        __m256i L1 = _mm256_unpacklo_epi8(S1, zero);\r\n        __m256i L2 = _mm256_unpacklo_epi8(S2, zero);\r\n\r\n        __m256i sum1 = _mm256_add_epi16(L0, L1);\r\n        __m256i sum2 = _mm256_add_epi16(L1, L2);\r\n\r\n        sum1 = _mm256_add_epi16(sum1, sum2);\r\n        sum1 = _mm256_add_epi16(sum1, offset);\r\n        sum1 = _mm256_srli_epi16(sum1, 2);\r\n\r\n        sum1 = _mm256_packus_epi16(sum1, sum1);\r\n        sum1 = _mm256_permute4x64_epi64(sum1, 0x0008);\r\n        //store 128 bit\r\n        __m256i mask2 = _mm256_loadu_si256((__m256i*)intrinsic_mask_256_8bit[15]);\r\n        _mm256_maskstore_epi64((__int64 *)(first_line + i), mask2, sum1);\r\n        \r\n        //_mm_storel_epi64((__m128i*)&first_line[i], sum1);\r\n    }\r\n\r\n    if (bsx == 64){\r\n\r\n        for (i = 0; i < iHeight2; i += 8) {\r\n            __m256i M = _mm256_lddqu_si256((__m256i*)&first_line[i]);\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(&first_line[i]+32));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)&first_line[i + 2]);\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(&first_line[i + 2] + 32));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)&first_line[i + 4]);\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(&first_line[i + 4] + 32));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)&first_line[i + 6]);\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(&first_line[i + 6] + 32));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n        }\r\n    } else if (bsx == 32){\r\n        for (i = 0; i < iHeight2; i += 8){\r\n            __m256i M = _mm256_lddqu_si256((__m256i*)&first_line[i]);\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)&first_line[i + 2]);\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)&first_line[i + 4]);\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)&first_line[i + 6]);\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n        }\r\n    } else if (bsx == 16){\r\n        __m256i mask = _mm256_loadu_si256((const __m256i*)intrinsic_mask_256_8bit[bsx - 1]);\r\n        for (i = 0; i < iHeight2; i += 8) {\r\n            __m256i M = _mm256_loadu_si256((__m256i*)&first_line[i]);\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_loadu_si256((__m256i*)&first_line[i + 2]);\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_loadu_si256((__m256i*)&first_line[i + 4]);\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_loadu_si256((__m256i*)&first_line[i + 6]);\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n        }\r\n    } else if (bsx == 8){\r\n        __m256i mask = _mm256_loadu_si256((const __m256i*)intrinsic_mask_256_8bit[bsx - 1]);\r\n        for (i = 0; i < iHeight2; i += 8) {\r\n            __m256i M = _mm256_loadu_si256((__m256i*)&first_line[i]);\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_srli_si256(M, 2);\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_srli_si256(M, 2);\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_srli_si256(M, 2);\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n        }\r\n    } else{\r\n        __m256i mask = _mm256_loadu_si256((const __m256i*)intrinsic_mask_256_8bit[bsx - 1]);\r\n        for (i = 0; i < iHeight2; i += 8) {\r\n            __m256i M = _mm256_loadu_si256((__m256i*)&first_line[i]);\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_srli_si256(M, 2);\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_srli_si256(M, 2);\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_srli_si256(M, 2);\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n        }\r\n    }\r\n\r\n    /*if (bsx == bsy || bsx >= 16) {\r\n        for (i = 0; i < iHeight2; i += 2) {\r\n            memcpy(dst, first_line + i, bsx * sizeof(pel_t));\r\n            dst += i_dst;\r\n        }\r\n    } else {\r\n        __m256i mask = _mm256_loadu_si256((const __m256i*)intrinsic_mask_256_8bit[bsx - 1]);\r\n        for (i = 0; i < iHeight2; i += 8) {\r\n            __m256i M = _mm256_loadu_si256((__m256i*)&first_line[i]);\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_srli_si256(M, 2);\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_srli_si256(M, 2);\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_srli_si256(M, 2);\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n        }\r\n    }*/\r\n\r\n}\r\n\r\nvoid intra_pred_ang_x_5_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    __m256i coeff2  = _mm256_set1_epi16(2);\r\n    __m256i coeff3  = _mm256_set1_epi16(3);\r\n    __m256i coeff4  = _mm256_set1_epi16(4);\r\n    __m256i coeff5  = _mm256_set1_epi16(5);\r\n    __m256i coeff7  = _mm256_set1_epi16(7);\r\n    __m256i coeff8  = _mm256_set1_epi16(8);\r\n    __m256i coeff9  = _mm256_set1_epi16(9);\r\n    __m256i coeff11 = _mm256_set1_epi16(11);\r\n    __m256i coeff13 = _mm256_set1_epi16(13);\r\n    __m256i coeff15 = _mm256_set1_epi16(15);\r\n    __m256i coeff16 = _mm256_set1_epi16(16);\r\n\r\n    UNUSED_PARAMETER(dir_mode);\r\n\r\n    int i;\r\n    if (((bsy > 4) && (bsx > 8))) {\r\n        ALIGN32(pel_t first_line[(64 + 80 + 16) << 3]);\r\n        int line_size = bsx + ((bsy - 8) >> 3) * 11;\r\n#if !BUGFIX_PREDICTION_INTRA\r\n        int iW2 = bsx * 2 - 1;\r\n        int real_size = DAVS2_MIN(line_size, iW2 + 1);\r\n#endif\r\n        int aligned_line_size = (((line_size + 15) >> 4) << 4) + 16;\r\n        pel_t *pfirst[8];\r\n#if !BUGFIX_PREDICTION_INTRA\r\n        pel_t *src_org = src;\r\n#endif\r\n\r\n        pel_t *dst1 = dst;\r\n        pel_t *dst2 = dst1 + i_dst;\r\n        pel_t *dst3 = dst2 + i_dst;\r\n        pel_t *dst4 = dst3 + i_dst;\r\n        pel_t *dst5 = dst4 + i_dst;\r\n        pel_t *dst6 = dst5 + i_dst;\r\n        pel_t *dst7 = dst6 + i_dst;\r\n        pel_t *dst8 = dst7 + i_dst;\r\n\r\n        pfirst[0] = first_line;\r\n        pfirst[1] = pfirst[0] + aligned_line_size;\r\n        pfirst[2] = pfirst[1] + aligned_line_size;\r\n        pfirst[3] = pfirst[2] + aligned_line_size;\r\n        pfirst[4] = pfirst[3] + aligned_line_size;\r\n        pfirst[5] = pfirst[4] + aligned_line_size;\r\n        pfirst[6] = pfirst[5] + aligned_line_size;\r\n        pfirst[7] = pfirst[6] + aligned_line_size;\r\n\r\n        __m256i p00, p10, p20, p30;\r\n        __m256i p01, p11, p21, p31;\r\n\r\n        __m256i SS1;\r\n        __m256i L1, L2, L3, L4, L5, L6, L7, L8, L9, L10, L11, L12, L13;\r\n        __m256i H1, H2, H3, H4, H5, H6, H7, H8, H9, H10, H11, H12, H13;\r\n\r\n\r\n#if BUGFIX_PREDICTION_INTRA\r\n        for (i = 0; i < line_size - 16; i += 32, src += 32) {\r\n#else\r\n        for (i = 0; i < real_size - 16; i += 32, src += 32) {\r\n#endif\r\n            SS1 = _mm256_loadu_si256((__m256i*)(src + 1));//1...8 9...16 17..24 25..32\r\n            L1  = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS1, 0));//1\r\n            H1  = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS1, 1));//17\r\n            SS1 = _mm256_loadu_si256((__m256i*)(src + 2));\r\n            L2  = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS1, 0));//2\r\n            H2  = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS1, 1));//18\r\n            SS1 = _mm256_loadu_si256((__m256i*)(src + 3));\r\n            L3  = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS1, 0));//3\r\n            H3  = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS1, 1));//19\r\n            SS1 = _mm256_loadu_si256((__m256i*)(src + 4));\r\n            L4  = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS1, 0));//4\r\n            H4  = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS1, 1));//20\r\n            SS1 = _mm256_loadu_si256((__m256i*)(src + 5));\r\n            L5  = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS1, 0));\r\n            H5  = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS1, 1));\r\n            SS1 = _mm256_loadu_si256((__m256i*)(src + 6));\r\n            L6  = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS1, 0));\r\n            H6  = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS1, 1));\r\n            SS1 = _mm256_loadu_si256((__m256i*)(src + 7));\r\n            L7  = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS1, 0));\r\n            H7  = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS1, 1));\r\n            SS1 = _mm256_loadu_si256((__m256i*)(src + 8));\r\n            L8  = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS1, 0));\r\n            H8  = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS1, 1));\r\n            SS1 = _mm256_loadu_si256((__m256i*)(src + 9));\r\n            L9  = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS1, 0));\r\n            H9  = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS1, 1));\r\n            SS1 = _mm256_loadu_si256((__m256i*)(src + 10));\r\n            L10 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS1, 0));\r\n            H10 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS1, 1));\r\n            SS1 = _mm256_loadu_si256((__m256i*)(src + 11));\r\n            L11 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS1, 0));\r\n            H11 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS1, 1));\r\n            SS1 = _mm256_loadu_si256((__m256i*)(src + 12));\r\n            L12 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS1, 0));\r\n            H12 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS1, 1));\r\n            SS1 = _mm256_loadu_si256((__m256i*)(src + 13));\r\n            L13 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS1, 0));\r\n            H13 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS1, 1));\r\n\r\n            p00 = _mm256_mullo_epi16(L1, coeff5);\r\n            p10 = _mm256_mullo_epi16(L2, coeff13);\r\n            p20 = _mm256_mullo_epi16(L3, coeff11);\r\n            p30 = _mm256_mullo_epi16(L4, coeff3);\r\n            p00 = _mm256_add_epi16(p00, coeff16);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, p20);\r\n            p00 = _mm256_add_epi16(p00, p30);\r\n            p00 = _mm256_srli_epi16(p00, 5);\r\n\r\n            p01 = _mm256_mullo_epi16(H1, coeff5);\r\n            p11 = _mm256_mullo_epi16(H2, coeff13);\r\n            p21 = _mm256_mullo_epi16(H3, coeff11);\r\n            p31 = _mm256_mullo_epi16(H4, coeff3);\r\n            p01 = _mm256_add_epi16(p01, coeff16);\r\n            p01 = _mm256_add_epi16(p01, p11);\r\n            p01 = _mm256_add_epi16(p01, p21);\r\n            p01 = _mm256_add_epi16(p01, p31);\r\n            p01 = _mm256_srli_epi16(p01, 5);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p01);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x00D8);\r\n            _mm256_storeu_si256((__m256i*)&pfirst[0][i], p00);\r\n\r\n            p10 = _mm256_mullo_epi16(L3, coeff5);\r\n            p20 = _mm256_mullo_epi16(L4, coeff7);\r\n            p30 = _mm256_mullo_epi16(L5, coeff3);\r\n            p00 = _mm256_add_epi16(L2, coeff8);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, p20);\r\n            p00 = _mm256_add_epi16(p00, p30);\r\n            p00 = _mm256_srli_epi16(p00, 4);\r\n\r\n            p11 = _mm256_mullo_epi16(H3, coeff5);\r\n            p21 = _mm256_mullo_epi16(H4, coeff7);\r\n            p31 = _mm256_mullo_epi16(H5, coeff3);\r\n            p01 = _mm256_add_epi16(H2, coeff8);\r\n            p01 = _mm256_add_epi16(p01, p11);\r\n            p01 = _mm256_add_epi16(p01, p21);\r\n            p01 = _mm256_add_epi16(p01, p31);\r\n            p01 = _mm256_srli_epi16(p01, 4);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p01);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x00D8);\r\n            _mm256_storeu_si256((__m256i*)&pfirst[1][i], p00);\r\n\r\n            p00 = _mm256_mullo_epi16(L4, coeff7);\r\n            p10 = _mm256_mullo_epi16(L5, coeff15);\r\n            p20 = _mm256_mullo_epi16(L6, coeff9);\r\n            p30 = _mm256_add_epi16(L7, coeff16);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, p20);\r\n            p00 = _mm256_add_epi16(p00, p30);\r\n            p00 = _mm256_srli_epi16(p00, 5);\r\n\r\n            p01 = _mm256_mullo_epi16(H4, coeff7);\r\n            p11 = _mm256_mullo_epi16(H5, coeff15);\r\n            p21 = _mm256_mullo_epi16(H6, coeff9);\r\n            p31 = _mm256_add_epi16(H7, coeff16);\r\n            p01 = _mm256_add_epi16(p01, p11);\r\n            p01 = _mm256_add_epi16(p01, p21);\r\n            p01 = _mm256_add_epi16(p01, p31);\r\n            p01 = _mm256_srli_epi16(p01, 5);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p01);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x00D8);\r\n            _mm256_storeu_si256((__m256i*)&pfirst[2][i], p00);\r\n\r\n            p00 = _mm256_add_epi16(L5, L8);\r\n            p10 = _mm256_add_epi16(L6, L7);\r\n            p10 = _mm256_mullo_epi16(p10, coeff3);\r\n            p00 = _mm256_add_epi16(p00, coeff4);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_srli_epi16(p00, 3);\r\n\r\n            p01 = _mm256_add_epi16(H5, H8);\r\n            p11 = _mm256_add_epi16(H6, H7);\r\n            p11 = _mm256_mullo_epi16(p11, coeff3);\r\n            p01 = _mm256_add_epi16(p01, coeff4);\r\n            p01 = _mm256_add_epi16(p01, p11);\r\n            p01 = _mm256_srli_epi16(p01, 3);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p01);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x00D8);\r\n            _mm256_storeu_si256((__m256i*)&pfirst[3][i], p00);\r\n\r\n            p00 = _mm256_add_epi16(L6, coeff16);\r\n            p10 = _mm256_mullo_epi16(L7, coeff9);\r\n            p20 = _mm256_mullo_epi16(L8, coeff15);\r\n            p30 = _mm256_mullo_epi16(L9, coeff7);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, p20);\r\n            p00 = _mm256_add_epi16(p00, p30);\r\n            p00 = _mm256_srli_epi16(p00, 5);\r\n\r\n            p01 = _mm256_add_epi16(H6, coeff16);\r\n            p11 = _mm256_mullo_epi16(H7, coeff9);\r\n            p21 = _mm256_mullo_epi16(H8, coeff15);\r\n            p31 = _mm256_mullo_epi16(H9, coeff7);\r\n            p01 = _mm256_add_epi16(p01, p11);\r\n            p01 = _mm256_add_epi16(p01, p21);\r\n            p01 = _mm256_add_epi16(p01, p31);\r\n            p01 = _mm256_srli_epi16(p01, 5);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p01);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x00D8);\r\n            _mm256_storeu_si256((__m256i*)&pfirst[4][i], p00);\r\n\r\n            p00 = _mm256_mullo_epi16(L8, coeff3);\r\n            p10 = _mm256_mullo_epi16(L9, coeff7);\r\n            p20 = _mm256_mullo_epi16(L10, coeff5);\r\n            p30 = _mm256_add_epi16(L11, coeff8);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, p20);\r\n            p00 = _mm256_add_epi16(p00, p30);\r\n            p00 = _mm256_srli_epi16(p00, 4);\r\n\r\n            p01 = _mm256_mullo_epi16(H8, coeff3);\r\n            p11 = _mm256_mullo_epi16(H9, coeff7);\r\n            p21 = _mm256_mullo_epi16(H10, coeff5);\r\n            p31 = _mm256_add_epi16(H11, coeff8);\r\n            p01 = _mm256_add_epi16(p01, p11);\r\n            p01 = _mm256_add_epi16(p01, p21);\r\n            p01 = _mm256_add_epi16(p01, p31);\r\n            p01 = _mm256_srli_epi16(p01, 4);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p01);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x00D8);\r\n            _mm256_storeu_si256((__m256i*)&pfirst[5][i], p00);\r\n\r\n            p00 = _mm256_mullo_epi16(L9, coeff3);\r\n            p10 = _mm256_mullo_epi16(L10, coeff11);\r\n            p20 = _mm256_mullo_epi16(L11, coeff13);\r\n            p30 = _mm256_mullo_epi16(L12, coeff5);\r\n            p00 = _mm256_add_epi16(p00, coeff16);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, p20);\r\n            p00 = _mm256_add_epi16(p00, p30);\r\n            p00 = _mm256_srli_epi16(p00, 5);\r\n\r\n            p01 = _mm256_mullo_epi16(H9, coeff3);\r\n            p11 = _mm256_mullo_epi16(H10, coeff11);\r\n            p21 = _mm256_mullo_epi16(H11, coeff13);\r\n            p31 = _mm256_mullo_epi16(H12, coeff5);\r\n            p01 = _mm256_add_epi16(p01, coeff16);\r\n            p01 = _mm256_add_epi16(p01, p11);\r\n            p01 = _mm256_add_epi16(p01, p21);\r\n            p01 = _mm256_add_epi16(p01, p31);\r\n            p01 = _mm256_srli_epi16(p01, 5);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p01);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x00D8);\r\n            _mm256_storeu_si256((__m256i*)&pfirst[6][i], p00);\r\n\r\n            p00 = _mm256_add_epi16(L11, L13);\r\n            p10 = _mm256_add_epi16(L12, L12);\r\n            p00 = _mm256_add_epi16(p00, coeff2);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_srli_epi16(p00, 2);\r\n\r\n            p01 = _mm256_add_epi16(H11, H13);\r\n            p11 = _mm256_add_epi16(H12, H12);\r\n            p01 = _mm256_add_epi16(p01, coeff2);\r\n            p01 = _mm256_add_epi16(p01, p11);\r\n            p01 = _mm256_srli_epi16(p01, 2);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p01);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x00D8);\r\n            _mm256_storeu_si256((__m256i*)&pfirst[7][i], p00);\r\n        }\r\n#if BUGFIX_PREDICTION_INTRA\r\n        if (i < line_size) {\r\n#else\r\n        if (i < real_size) {\r\n#endif\r\n            SS1 = _mm256_loadu_si256((__m256i*)(src + 1));//1...8 9...16 17..24 25..32\r\n            L1 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS1, 0));//1\r\n            SS1 = _mm256_loadu_si256((__m256i*)(src + 2));\r\n            L2 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS1, 0));//2\r\n            SS1 = _mm256_loadu_si256((__m256i*)(src + 3));\r\n            L3 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS1, 0));//3\r\n            SS1 = _mm256_loadu_si256((__m256i*)(src + 4));\r\n            L4 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS1, 0));//4\r\n            SS1 = _mm256_loadu_si256((__m256i*)(src + 5));\r\n            L5 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS1, 0));\r\n            SS1 = _mm256_loadu_si256((__m256i*)(src + 6));\r\n            L6 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS1, 0));\r\n            SS1 = _mm256_loadu_si256((__m256i*)(src + 7));\r\n            L7 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS1, 0));\r\n            SS1 = _mm256_loadu_si256((__m256i*)(src + 8));\r\n            L8 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS1, 0));\r\n            SS1 = _mm256_loadu_si256((__m256i*)(src + 9));\r\n            L9 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS1, 0));\r\n            SS1 = _mm256_loadu_si256((__m256i*)(src + 10));\r\n            L10 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS1, 0));\r\n            SS1 = _mm256_loadu_si256((__m256i*)(src + 11));\r\n            L11 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS1, 0));\r\n            SS1 = _mm256_loadu_si256((__m256i*)(src + 12));\r\n            L12 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS1, 0));\r\n            SS1 = _mm256_loadu_si256((__m256i*)(src + 13));\r\n            L13 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS1, 0));\r\n\r\n            p00 = _mm256_mullo_epi16(L1, coeff5);\r\n            p10 = _mm256_mullo_epi16(L2, coeff13);\r\n            p20 = _mm256_mullo_epi16(L3, coeff11);\r\n            p30 = _mm256_mullo_epi16(L4, coeff3);\r\n            p00 = _mm256_add_epi16(p00, coeff16);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, p20);\r\n            p00 = _mm256_add_epi16(p00, p30);\r\n            p00 = _mm256_srli_epi16(p00, 5);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p00);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x0008);\r\n            __m256i mask = _mm256_loadu_si256((__m256i*)intrinsic_mask_256_8bit[15]);\r\n            _mm256_maskstore_epi64((__int64 *)&pfirst[0][i], mask, p00);\r\n\r\n            p10 = _mm256_mullo_epi16(L3, coeff5);\r\n            p20 = _mm256_mullo_epi16(L4, coeff7);\r\n            p30 = _mm256_mullo_epi16(L5, coeff3);\r\n            p00 = _mm256_add_epi16(L2, coeff8);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, p20);\r\n            p00 = _mm256_add_epi16(p00, p30);\r\n            p00 = _mm256_srli_epi16(p00, 4);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p00);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x0008);\r\n            _mm256_maskstore_epi64((__int64 *)&pfirst[1][i], mask, p00);\r\n\r\n            p00 = _mm256_mullo_epi16(L4, coeff7);\r\n            p10 = _mm256_mullo_epi16(L5, coeff15);\r\n            p20 = _mm256_mullo_epi16(L6, coeff9);\r\n            p30 = _mm256_add_epi16(L7, coeff16);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, p20);\r\n            p00 = _mm256_add_epi16(p00, p30);\r\n            p00 = _mm256_srli_epi16(p00, 5);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p00);\r\n            _mm256_maskstore_epi64((__int64 *)&pfirst[2][i], mask, p00);\r\n\r\n            p00 = _mm256_add_epi16(L5, L8);\r\n            p10 = _mm256_add_epi16(L6, L7);\r\n            p10 = _mm256_mullo_epi16(p10, coeff3);\r\n            p00 = _mm256_add_epi16(p00, coeff4);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_srli_epi16(p00, 3);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p00);\r\n            _mm256_maskstore_epi64((__int64 *)&pfirst[3][i], mask, p00);\r\n\r\n            p00 = _mm256_add_epi16(L6, coeff16);\r\n            p10 = _mm256_mullo_epi16(L7, coeff9);\r\n            p20 = _mm256_mullo_epi16(L8, coeff15);\r\n            p30 = _mm256_mullo_epi16(L9, coeff7);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, p20);\r\n            p00 = _mm256_add_epi16(p00, p30);\r\n            p00 = _mm256_srli_epi16(p00, 5);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p00);\r\n            _mm256_maskstore_epi64((__int64 *)&pfirst[4][i], mask, p00);\r\n\r\n            p00 = _mm256_mullo_epi16(L8, coeff3);\r\n            p10 = _mm256_mullo_epi16(L9, coeff7);\r\n            p20 = _mm256_mullo_epi16(L10, coeff5);\r\n            p30 = _mm256_add_epi16(L11, coeff8);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, p20);\r\n            p00 = _mm256_add_epi16(p00, p30);\r\n            p00 = _mm256_srli_epi16(p00, 4);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p00);\r\n            _mm256_maskstore_epi64((__int64 *)&pfirst[5][i], mask, p00);\r\n\r\n            p00 = _mm256_mullo_epi16(L9, coeff3);\r\n            p10 = _mm256_mullo_epi16(L10, coeff11);\r\n            p20 = _mm256_mullo_epi16(L11, coeff13);\r\n            p30 = _mm256_mullo_epi16(L12, coeff5);\r\n            p00 = _mm256_add_epi16(p00, coeff16);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, p20);\r\n            p00 = _mm256_add_epi16(p00, p30);\r\n            p00 = _mm256_srli_epi16(p00, 5);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p00);\r\n            _mm256_maskstore_epi64((__int64 *)&pfirst[6][i], mask, p00);\r\n\r\n            p00 = _mm256_add_epi16(L11, L13);\r\n            p10 = _mm256_add_epi16(L12, L12);\r\n            p00 = _mm256_add_epi16(p00, coeff2);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_srli_epi16(p00, 2);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p00);\r\n            _mm256_maskstore_epi64((__int64 *)&pfirst[7][i], mask, p00);\r\n        }\r\n\r\n        bsy >>= 3;\r\n\r\n        __m256i M;\r\n        if (bsx == 64){\r\n            for (i = 0; i < bsy; i++){\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i * 11));\r\n                _mm256_storeu_si256((__m256i*)dst1, M);\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i * 11 + 32));\r\n                _mm256_storeu_si256((__m256i*)(dst1 + 32), M);\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i * 11));\r\n                _mm256_storeu_si256((__m256i*)dst2, M);\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i * 11 + 32));\r\n                _mm256_storeu_si256((__m256i*)(dst2 + 32), M);\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[2] + i * 11));\r\n                _mm256_storeu_si256((__m256i*)dst3, M);\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[2] + i * 11 + 32));\r\n                _mm256_storeu_si256((__m256i*)(dst3 + 32), M);\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[3] + i * 11));\r\n                _mm256_storeu_si256((__m256i*)dst4, M);\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[3] + i * 11 + 32));\r\n                _mm256_storeu_si256((__m256i*)(dst4 + 32), M);\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[4] + i * 11));\r\n                _mm256_storeu_si256((__m256i*)dst5, M);\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[4] + i * 11 + 32));\r\n                _mm256_storeu_si256((__m256i*)(dst5 + 32), M);\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[5] + i * 11));\r\n                _mm256_storeu_si256((__m256i*)dst6, M);\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[5] + i * 11 + 32));\r\n                _mm256_storeu_si256((__m256i*)(dst6 + 32), M);\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[6] + i * 11));\r\n                _mm256_storeu_si256((__m256i*)dst7, M);\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[6] + i * 11 + 32));\r\n                _mm256_storeu_si256((__m256i*)(dst7 + 32), M);\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[7] + i * 11));\r\n                _mm256_storeu_si256((__m256i*)dst8, M);\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[7] + i * 11 + 32));\r\n                _mm256_storeu_si256((__m256i*)(dst8 + 32), M);\r\n\r\n                dst1 = dst8 + i_dst;\r\n                dst2 = dst1 + i_dst;\r\n                dst3 = dst2 + i_dst;\r\n                dst4 = dst3 + i_dst;\r\n                dst5 = dst4 + i_dst;\r\n                dst6 = dst5 + i_dst;\r\n                dst7 = dst6 + i_dst;\r\n                dst8 = dst7 + i_dst;\r\n            }\r\n        } else if (bsx == 32) {\r\n            for (i = 0; i < bsy; i++){\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i * 11));\r\n                _mm256_storeu_si256((__m256i*)dst1, M);\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i * 11));\r\n                _mm256_storeu_si256((__m256i*)dst2, M);\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[2] + i * 11));\r\n                _mm256_storeu_si256((__m256i*)dst3, M);\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[3] + i * 11));\r\n                _mm256_storeu_si256((__m256i*)dst4, M);\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[4] + i * 11));\r\n                _mm256_storeu_si256((__m256i*)dst5, M);\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[5] + i * 11));\r\n                _mm256_storeu_si256((__m256i*)dst6, M);\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[6] + i * 11));\r\n                _mm256_storeu_si256((__m256i*)dst7, M);\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[7] + i * 11));\r\n                _mm256_storeu_si256((__m256i*)dst8, M);\r\n\r\n                dst1 = dst8 + i_dst;\r\n                dst2 = dst1 + i_dst;\r\n                dst3 = dst2 + i_dst;\r\n                dst4 = dst3 + i_dst;\r\n                dst5 = dst4 + i_dst;\r\n                dst6 = dst5 + i_dst;\r\n                dst7 = dst6 + i_dst;\r\n                dst8 = dst7 + i_dst;\r\n            }\r\n        } else {\r\n            __m256i mask = _mm256_lddqu_si256((__m256i*)intrinsic_mask_256_8bit[15]);\r\n            for (i = 0; i < bsy; i++){\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i * 11));\r\n                _mm256_maskstore_epi64((__int64 *)dst1, mask, M);\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i * 11));\r\n                _mm256_maskstore_epi64((__int64 *)dst2, mask, M);\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[2] + i * 11));\r\n                _mm256_maskstore_epi64((__int64 *)dst3, mask, M);\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[3] + i * 11));\r\n                _mm256_maskstore_epi64((__int64 *)dst4, mask, M);\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[4] + i * 11));\r\n                _mm256_maskstore_epi64((__int64 *)dst5, mask, M);\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[5] + i * 11));\r\n                _mm256_maskstore_epi64((__int64 *)dst6, mask, M);\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[6] + i * 11));\r\n                _mm256_maskstore_epi64((__int64 *)dst7, mask, M);\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[7] + i * 11));\r\n                _mm256_maskstore_epi64((__int64 *)dst8, mask, M);\r\n\r\n                dst1 = dst8 + i_dst;\r\n                dst2 = dst1 + i_dst;\r\n                dst3 = dst2 + i_dst;\r\n                dst4 = dst3 + i_dst;\r\n                dst5 = dst4 + i_dst;\r\n                dst6 = dst5 + i_dst;\r\n                dst7 = dst6 + i_dst;\r\n                dst8 = dst7 + i_dst;\r\n            }\r\n        }\r\n\r\n\r\n\r\n\r\n        /*for (i = 0; i < bsy; i++) {\r\n            memcpy(dst1, pfirst[0] + i * 11, bsx * sizeof(pel_t));\r\n            memcpy(dst2, pfirst[1] + i * 11, bsx * sizeof(pel_t));\r\n            memcpy(dst3, pfirst[2] + i * 11, bsx * sizeof(pel_t));\r\n            memcpy(dst4, pfirst[3] + i * 11, bsx * sizeof(pel_t));\r\n            memcpy(dst5, pfirst[4] + i * 11, bsx * sizeof(pel_t));\r\n            memcpy(dst6, pfirst[5] + i * 11, bsx * sizeof(pel_t));\r\n            memcpy(dst7, pfirst[6] + i * 11, bsx * sizeof(pel_t));\r\n            memcpy(dst8, pfirst[7] + i * 11, bsx * sizeof(pel_t));\r\n\r\n            dst1 = dst8 + i_dst;\r\n            dst2 = dst1 + i_dst;\r\n            dst3 = dst2 + i_dst;\r\n            dst4 = dst3 + i_dst;\r\n            dst5 = dst4 + i_dst;\r\n            dst6 = dst5 + i_dst;\r\n            dst7 = dst6 + i_dst;\r\n            dst8 = dst7 + i_dst;\r\n        }*/\r\n    } else if (bsx == 16) {\r\n\r\n            pel_t *dst1 = dst;\r\n            pel_t *dst2 = dst1 + i_dst;\r\n            pel_t *dst3 = dst2 + i_dst;\r\n            pel_t *dst4 = dst3 + i_dst;\r\n\r\n            __m256i p00, p10, p20, p30;\r\n\r\n            __m256i SS1;\r\n            __m256i L1, L2, L3, L4, L5, L6, L7, L8;\r\n\r\n            SS1 = _mm256_loadu_si256((__m256i*)(src + 1));//1...8 9...16 17..24 25..32\r\n            L1 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS1, 0));//1\r\n            SS1 = _mm256_loadu_si256((__m256i*)(src + 2));\r\n            L2 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS1, 0));//2\r\n            SS1 = _mm256_loadu_si256((__m256i*)(src + 3));\r\n            L3 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS1, 0));//3\r\n            SS1 = _mm256_loadu_si256((__m256i*)(src + 4));\r\n            L4 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS1, 0));//4\r\n            SS1 = _mm256_loadu_si256((__m256i*)(src + 5));\r\n            L5 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS1, 0));\r\n            SS1 = _mm256_loadu_si256((__m256i*)(src + 6));\r\n            L6 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS1, 0));\r\n            SS1 = _mm256_loadu_si256((__m256i*)(src + 7));\r\n            L7 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS1, 0));\r\n            SS1 = _mm256_loadu_si256((__m256i*)(src + 8));\r\n            L8 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(SS1, 0));\r\n\r\n            p00 = _mm256_mullo_epi16(L1, coeff5);\r\n            p10 = _mm256_mullo_epi16(L2, coeff13);\r\n            p20 = _mm256_mullo_epi16(L3, coeff11);\r\n            p30 = _mm256_mullo_epi16(L4, coeff3);\r\n            p00 = _mm256_add_epi16(p00, coeff16);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, p20);\r\n            p00 = _mm256_add_epi16(p00, p30);\r\n            p00 = _mm256_srli_epi16(p00, 5);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p00);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x0008);\r\n            __m256i mask = _mm256_loadu_si256((__m256i*)intrinsic_mask_256_8bit[15]);\r\n            _mm256_maskstore_epi64((__int64 *)dst1, mask, p00);\r\n\r\n            p10 = _mm256_mullo_epi16(L3, coeff5);\r\n            p20 = _mm256_mullo_epi16(L4, coeff7);\r\n            p30 = _mm256_mullo_epi16(L5, coeff3);\r\n            p00 = _mm256_add_epi16(L2, coeff8);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, p20);\r\n            p00 = _mm256_add_epi16(p00, p30);\r\n            p00 = _mm256_srli_epi16(p00, 4);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p00);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x0008);\r\n            _mm256_maskstore_epi64((__int64 *)dst2, mask, p00);\r\n\r\n            p00 = _mm256_mullo_epi16(L4, coeff7);\r\n            p10 = _mm256_mullo_epi16(L5, coeff15);\r\n            p20 = _mm256_mullo_epi16(L6, coeff9);\r\n            p30 = _mm256_add_epi16(L7, coeff16);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, p20);\r\n            p00 = _mm256_add_epi16(p00, p30);\r\n            p00 = _mm256_srli_epi16(p00, 5);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p00);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x0008);\r\n            _mm256_maskstore_epi64((__int64 *)dst3, mask, p00);\r\n\r\n            p00 = _mm256_add_epi16(L5, L8);\r\n            p10 = _mm256_add_epi16(L6, L7);\r\n            p10 = _mm256_mullo_epi16(p10, coeff3);\r\n            p00 = _mm256_add_epi16(p00, coeff4);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_srli_epi16(p00, 3);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p00);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x0008);\r\n            _mm256_maskstore_epi64((__int64 *)dst4, mask, p00);\r\n\r\n\r\n        } else { //8x8 8x32 4x4 4x16\r\n\r\n            intra_pred_ang_x_5_sse128(src, dst, i_dst, dir_mode, bsx, bsy);\r\n\r\n        }\r\n\r\n}\r\n\r\nvoid intra_pred_ang_x_6_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    ALIGN32(pel_t first_line[64 + 64]);\r\n    int line_size = bsx + bsy - 1;\r\n\r\n    int i;\r\n    __m256i zero = _mm256_setzero_si256();\r\n    __m256i offset = _mm256_set1_epi16(2);\r\n\r\n    UNUSED_PARAMETER(dir_mode);\r\n    src += 2;\r\n    \r\n#if BUGFIX_PREDICTION_INTRA\r\n    for (i = 0; i < line_size - 16; i += 32, src += 32) {\r\n#else\r\n    for (i = 0; i < real_size - 16; i += 32, src += 32) {\r\n#endif\r\n        //0 1 2 3 .... 12 13 14 15    16 17 18 19 .... 28 29 30 21\r\n        __m256i S0 = _mm256_loadu_si256((__m256i*)(src - 1));\r\n        __m256i S1 = _mm256_loadu_si256((__m256i*)(src));\r\n        __m256i S2 = _mm256_loadu_si256((__m256i*)(src + 1));\r\n\r\n        __m256i L0 = _mm256_unpacklo_epi8(S0, zero);//0 1 2 3 4 5 6 7     16 17 18 19 20 21 22 23\r\n        __m256i L1 = _mm256_unpacklo_epi8(S1, zero);\r\n        __m256i L2 = _mm256_unpacklo_epi8(S2, zero);\r\n\r\n        __m256i H0 = _mm256_unpackhi_epi8(S0, zero);//8 9 10 11 12 13 14 15     24 25 26 27 28 29 30 31\r\n        __m256i H1 = _mm256_unpackhi_epi8(S1, zero);\r\n        __m256i H2 = _mm256_unpackhi_epi8(S2, zero);\r\n\r\n        __m256i tmp0 = _mm256_permute2x128_si256(L0, H0, 0x0020);//0 1 2 3 4 5 6 7   8 9 10 11 12 13 14 15\r\n        __m256i tmp1 = _mm256_permute2x128_si256(L1, H1, 0x0020);\r\n        __m256i tmp2 = _mm256_permute2x128_si256(L2, H2, 0x0020);\r\n        __m256i sum1 = _mm256_add_epi16(tmp0, tmp1);\r\n        __m256i sum2 = _mm256_add_epi16(tmp1, tmp2);\r\n\r\n\r\n        tmp0 = _mm256_permute2x128_si256(L0, H0, 0x0031);//16 17...24 25...\r\n        tmp1 = _mm256_permute2x128_si256(L1, H1, 0x0031);\r\n        tmp2 = _mm256_permute2x128_si256(L2, H2, 0x0031);\r\n        __m256i sum3 = _mm256_add_epi16(tmp0, tmp1);\r\n        __m256i sum4 = _mm256_add_epi16(tmp1, tmp2);\r\n\r\n        sum1 = _mm256_add_epi16(sum1, sum2);\r\n        sum3 = _mm256_add_epi16(sum3, sum4);\r\n\r\n        sum1 = _mm256_add_epi16(sum1, offset);\r\n        sum3 = _mm256_add_epi16(sum3, offset);\r\n\r\n        sum1 = _mm256_srli_epi16(sum1, 2);\r\n        sum3 = _mm256_srli_epi16(sum3, 2);\r\n\r\n        sum1 = _mm256_packus_epi16(sum1, sum3);//0 2 1 3\r\n        sum1 = _mm256_permute4x64_epi64(sum1, 0x00D8);\r\n        _mm256_storeu_si256((__m256i*)&first_line[i], sum1);\r\n    }\r\n\r\n#if BUGFIX_PREDICTION_INTRA\r\n    if (i < line_size) {\r\n#else\r\n    if (i < real_size) {\r\n#endif\r\n        //0 1 2 3 .... 12 13 14 15    16 17 18 19 .... 28 29 30 21\r\n        __m256i S0 = _mm256_loadu_si256((__m256i*)(src - 1));\r\n        __m256i S2 = _mm256_loadu_si256((__m256i*)(src + 1));\r\n        __m256i S1 = _mm256_loadu_si256((__m256i*)(src));\r\n        S0 = _mm256_permute4x64_epi64(S0, 0x00D8);\r\n        S2 = _mm256_permute4x64_epi64(S2, 0x00D8);\r\n        S1 = _mm256_permute4x64_epi64(S1, 0x00D8);\r\n\r\n        __m256i L0 = _mm256_unpacklo_epi8(S0, zero);\r\n        __m256i L1 = _mm256_unpacklo_epi8(S1, zero);\r\n        __m256i L2 = _mm256_unpacklo_epi8(S2, zero);\r\n\r\n        __m256i sum1 = _mm256_add_epi16(L0, L1);\r\n        __m256i sum2 = _mm256_add_epi16(L1, L2);\r\n\r\n        sum1 = _mm256_add_epi16(sum1, sum2);\r\n        sum1 = _mm256_add_epi16(sum1, offset);\r\n        sum1 = _mm256_srli_epi16(sum1, 2);\r\n\r\n        sum1 = _mm256_packus_epi16(sum1, sum1);\r\n        sum1 = _mm256_permute4x64_epi64(sum1, 0x0008);\r\n        //store 128 bit\r\n        __m256i mask2 = _mm256_loadu_si256((__m256i*)intrinsic_mask_256_8bit[15]);\r\n        _mm256_maskstore_epi64((__int64 *)(first_line + i), mask2, sum1);\r\n\r\n        //_mm_storel_epi64((__m128i*)&first_line[i], sum1);\r\n    }\r\n\r\n#if !BUGFIX_PREDICTION_INTRA\r\n    // padding\r\n    for (i = real_size; i < line_size; i += 32) {\r\n        __m256i pad = _mm256_set1_epi8(first_line[real_size - 1]);\r\n        _mm256_storeu_si256((__m256i*)&first_line[i], pad);\r\n    }\r\n#endif\r\n\r\n    if (bsx == 64){\r\n        for (i = 0; i < bsy; i += 4){\r\n            __m256i M = _mm256_lddqu_si256((__m256i*)&first_line[i]);\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(&first_line[i] + 32));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)&first_line[i + 1]);\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(&first_line[i + 1] + 32));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)&first_line[i + 2]);\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(&first_line[i + 2] + 32));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)&first_line[i + 3]);\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(&first_line[i + 3] + 32));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n        }\r\n    } else if (bsx == 32){\r\n        for (i = 0; i < bsy; i += 4) {\r\n            __m256i M = _mm256_lddqu_si256((__m256i*)&first_line[i]);\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)&first_line[i + 1]);\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)&first_line[i + 2]);\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)&first_line[i + 3]);\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n        }\r\n    } else if (bsx == 16){\r\n        __m256i mask = _mm256_lddqu_si256((__m256i*)intrinsic_mask_256_8bit[15]);\r\n        for (i = 0; i < bsy; i += 4){\r\n            __m256i M = _mm256_lddqu_si256((__m256i*)&first_line[i]);\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)&first_line[i + 1]);\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)&first_line[i + 2]);\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)&first_line[i + 3]);\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n\r\n        }\r\n    } else if (bsx == 8){\r\n        __m256i mask = _mm256_lddqu_si256((__m256i*)intrinsic_mask_256_8bit[7]);\r\n        for (i = 0; i < bsy; i += 4){\r\n            __m256i M = _mm256_lddqu_si256((__m256i*)&first_line[i]);\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)&first_line[i + 1]);\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)&first_line[i + 2]);\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)&first_line[i + 3]);\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n\r\n        }\r\n    } else {\r\n        __m256i mask = _mm256_lddqu_si256((__m256i*)intrinsic_mask_256_8bit[3]);\r\n        for (i = 0; i < bsy; i += 4){\r\n            __m256i M = _mm256_lddqu_si256((__m256i*)&first_line[i]);\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)&first_line[i + 1]);\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)&first_line[i + 2]);\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)&first_line[i + 3]);\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n\r\n        }\r\n    }\r\n\r\n\r\n\r\n    /*\r\n    if (bsx == bsy || bsx >= 16) {\r\n        for (i = 0; i < bsy; i++) {\r\n            memcpy(dst, first_line + i, bsx * sizeof(pel_t));\r\n            dst += i_dst;\r\n        }\r\n    } else {//8x32 4x16\r\n\r\n        __m256i mask = _mm256_loadu_si256((const __m256i*)intrinsic_mask_256_8bit[bsx - 1]);\r\n        for (i = 0; i < bsy; i += 4) {\r\n            __m256i M = _mm256_loadu_si256((__m256i*)&first_line[i]);\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_srli_si256(M, 1);\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_srli_si256(M, 1);\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_srli_si256(M, 1);\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n        }\r\n    }*/\r\n\r\n}\r\n\r\nvoid intra_pred_ang_x_7_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    int i, j;\r\n\r\n    UNUSED_PARAMETER(dir_mode);\r\n\r\n    if (bsx >= bsy) {\r\n        if (bsx <= 8) {//4x4 8x8\r\n\r\n            intra_pred_ang_x_7_sse128(src, dst, i_dst, dir_mode, bsx, bsy);\r\n\r\n        } else if (bsx & 16){//16\r\n\r\n            __m256i S0, S1, S2, S3;\r\n            __m256i t0, t1, t2, t3;\r\n            __m256i c0;\r\n            __m256i D0;\r\n            __m256i off = _mm256_set1_epi16(64);\r\n\r\n            __m256i mask = _mm256_loadu_si256((__m256i*)intrinsic_mask_256_8bit[15]);\r\n\r\n            for (j = 0; j < bsy; j++) {\r\n                int idx = tab_idx_mode_7[j];\r\n                c0 = _mm256_loadu_si256((__m256i*)tab_coeff_mode_7_avx[j]);\r\n\r\n                S0 = _mm256_loadu_si256((__m256i*)(src + idx));    //0...7 8...15 16...23 24...31\r\n                S1 = _mm256_loadu_si256((__m256i*)(src + idx + 1));//1.. 8 9...16 17...24 25...32\r\n                S2 = _mm256_loadu_si256((__m256i*)(src + idx + 2));//2...9 10...17\r\n                S3 = _mm256_loadu_si256((__m256i*)(src + idx + 3));//3...10 11...18\r\n\r\n                S0 = _mm256_permute4x64_epi64(S0, 0x00D8);//0...7 16...23 8...15 24...31\r\n                S1 = _mm256_permute4x64_epi64(S1, 0x00D8);//1...8 17...24 9...16 25...32\r\n                S2 = _mm256_permute4x64_epi64(S2, 0x00D8);\r\n                S3 = _mm256_permute4x64_epi64(S3, 0x00D8);\r\n\r\n                t0 = _mm256_unpacklo_epi8(S0, S1);//0 1 1 2 2 3 3 4  4 5 5 6 6 7 7  8     8  9  9 10 10 11 11 12  12 13 13 14 14 15 15 16\r\n                t1 = _mm256_unpacklo_epi8(S2, S3);//2 3 3 4 4 5 5 6  6 7 7 8 8 9 9 10    10 11 11 12 12 13 13 14  14 15 15 16 16 17 17 18\r\n                t0 = _mm256_permute4x64_epi64(t0, 0x00D8);\r\n                t1 = _mm256_permute4x64_epi64(t1, 0x00D8);\r\n                t2 = _mm256_unpacklo_epi16(t0, t1);//0...7\r\n                t3 = _mm256_unpackhi_epi16(t0, t1);//8...15\r\n\r\n                t0 = _mm256_maddubs_epi16(t2, c0);\r\n                t1 = _mm256_maddubs_epi16(t3, c0);\r\n\r\n                D0 = _mm256_hadds_epi16(t0, t1);//0 1 2 3 8 9 10 11    4 5 6 7 12 13 14 15\r\n                D0 = _mm256_permute4x64_epi64(D0, 0x00D8);\r\n                D0 = _mm256_add_epi16(D0, off);\r\n                D0 = _mm256_srli_epi16(D0, 7);\r\n\r\n                D0 = _mm256_packus_epi16(D0, D0);\r\n                D0 = _mm256_permute4x64_epi64(D0, 0x00D8);\r\n                _mm256_maskstore_epi64((__int64 *)dst, mask, D0);\r\n\r\n                dst += i_dst;\r\n            }\r\n\r\n        } else {//32 64\r\n\r\n            __m256i S0, S1, S2, S3;\r\n            __m256i t0, t1, t2, t3;\r\n            __m256i c0;\r\n            __m256i D0, D1;\r\n            __m256i off = _mm256_set1_epi16(64);\r\n\r\n            for (j = 0; j < bsy; j++) {\r\n                int idx = tab_idx_mode_7[j];\r\n                c0 = _mm256_loadu_si256((__m256i*)tab_coeff_mode_7_avx[j]);\r\n                for (i = 0; i < bsx; i += 32, idx += 32) {\r\n                    S0 = _mm256_loadu_si256((__m256i*)(src + idx));    //0...7 8...15 16...23 24...31\r\n                    S1 = _mm256_loadu_si256((__m256i*)(src + idx + 1));//1.. 8 9...16 17...24 25...32\r\n                    S2 = _mm256_loadu_si256((__m256i*)(src + idx + 2));//2...9 10...17 18\r\n                    S3 = _mm256_loadu_si256((__m256i*)(src + idx + 3));//3...10 11...18 19\r\n\r\n                    S0 = _mm256_permute4x64_epi64(S0, 0x00D8);//0...7 16...23 8...15 24...31\r\n                    S1 = _mm256_permute4x64_epi64(S1, 0x00D8);//1...8 17...24 9...16 25...32\r\n                    S2 = _mm256_permute4x64_epi64(S2, 0x00D8);\r\n                    S3 = _mm256_permute4x64_epi64(S3, 0x00D8);\r\n\r\n                    t0 = _mm256_unpacklo_epi8(S0, S1);//0 1 1 2 2 3 3 4  4 5 5 6 6 7 7  8     8  9  9 10 10 11 11 12  12 13 13 14 14 15 15 16\r\n                    t1 = _mm256_unpacklo_epi8(S2, S3);//2 3 3 4 4 5 5 6  6 7 7 8 8 9 9 10    10 11 11 12 12 13 13 14  14 15 15 16 16 17 17 18\r\n                    t0 = _mm256_permute4x64_epi64(t0, 0x00D8);\r\n                    t1 = _mm256_permute4x64_epi64(t1, 0x00D8);\r\n                    t2 = _mm256_unpacklo_epi16(t0, t1);//\r\n                    t3 = _mm256_unpackhi_epi16(t0, t1);//........15 16 17 18\r\n\r\n                    t0 = _mm256_maddubs_epi16(t2, c0);\r\n                    t1 = _mm256_maddubs_epi16(t3, c0);\r\n\r\n                    D0 = _mm256_hadds_epi16(t0, t1);//0 1 2 3 8 9 10 11    4 5 6 7 12 13 14 15\r\n                    D0 = _mm256_permute4x64_epi64(D0, 0x00D8);\r\n                    D0 = _mm256_add_epi16(D0, off);\r\n                    D0 = _mm256_srli_epi16(D0, 7);\r\n\r\n                    t0 = _mm256_unpackhi_epi8(S0, S1);//16 17 17 18  18 19 19 20  20 21 21 22 22 23 23 24...24 25 25..\r\n                    t1 = _mm256_unpackhi_epi8(S2, S3);//18 19 19 20  .....\r\n                    t0 = _mm256_permute4x64_epi64(t0, 0x00D8);\r\n                    t1 = _mm256_permute4x64_epi64(t1, 0x00D8);\r\n                    t2 = _mm256_unpacklo_epi16(t0, t1);//16 17 18 19...\r\n                    t3 = _mm256_unpackhi_epi16(t0, t1);//24 25 26 27...\r\n\r\n                    t0 = _mm256_maddubs_epi16(t2, c0);\r\n                    t1 = _mm256_maddubs_epi16(t3, c0);\r\n\r\n                    D1 = _mm256_hadds_epi16(t0, t1);//16 17 18 19 24 25 26 27    20 21 22 23 28 29 30 31\r\n                    D1 = _mm256_permute4x64_epi64(D1, 0x00D8);\r\n                    D1 = _mm256_add_epi16(D1, off);\r\n                    D1 = _mm256_srli_epi16(D1, 7);\r\n\r\n                    D0 = _mm256_packus_epi16(D0, D1);\r\n                    D0 = _mm256_permute4x64_epi64(D0, 0x00D8);\r\n                    _mm256_storeu_si256((__m256i*)(dst + i), D0);\r\n\r\n                }\r\n                dst += i_dst;\r\n            }\r\n        }\r\n    } else {\r\n            intra_pred_ang_x_7_sse128(src, dst, i_dst, dir_mode, bsx, bsy);\r\n    }\r\n\r\n}\r\n\r\nvoid intra_pred_ang_x_8_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n\r\n    ALIGN32(pel_t first_line[2 * (64 + 48)]);\r\n    int line_size = bsx + (bsy >> 1) - 1;\r\n#if !BUGFIX_PREDICTION_INTRA\r\n    int real_size = DAVS2_MIN(line_size, (bsx << 1));\r\n#endif\r\n    int i;\r\n#if !BUGFIX_PREDICTION_INTRA\r\n    __m128i pad1, pad2;\r\n#endif\r\n    int aligned_line_size = ((line_size + 31) >> 4) << 4;\r\n    pel_t *pfirst[2];\r\n    __m256i zero = _mm256_setzero_si256();\r\n\r\n    __m256i coeff   = _mm256_set1_epi16(3); //16\r\n    __m256i offset1 = _mm256_set1_epi16(4);\r\n    __m256i offset2 = _mm256_set1_epi16(2);\r\n\r\n    UNUSED_PARAMETER(dir_mode);\r\n\r\n    pfirst[0] = first_line;\r\n    pfirst[1] = first_line + aligned_line_size;\r\n\r\n    __m256i p01, p02, p11, p12;\r\n    __m256i p21, p22, p31, p32;\r\n    __m256i tmp0, tmp1, tmp2, tmp3;\r\n#if BUGFIX_PREDICTION_INTRA\r\n    for (i = 0; i < line_size - 16; i += 32, src += 32) {\r\n#else\r\n    for (i = 0; i < real_size - 16; i += 32, src += 32) {\r\n#endif\r\n        //0 1 2 3 .... 12 13 14 15    16 17 18 19 .... 28 29 30 21\r\n        __m256i S0 = _mm256_loadu_si256((__m256i*)(src));\r\n        __m256i S3 = _mm256_loadu_si256((__m256i*)(src + 3));\r\n        __m256i S1 = _mm256_loadu_si256((__m256i*)(src + 1));\r\n        __m256i S2 = _mm256_loadu_si256((__m256i*)(src + 2));\r\n\r\n        __m256i L0 = _mm256_unpacklo_epi8(S0, zero);//0 1 2 3 4 5 6 7     16 17 18 19 20 21 22 23\r\n        __m256i L1 = _mm256_unpacklo_epi8(S1, zero);\r\n        __m256i L2 = _mm256_unpacklo_epi8(S2, zero);\r\n        __m256i L3 = _mm256_unpacklo_epi8(S3, zero);\r\n\r\n        __m256i H0 = _mm256_unpackhi_epi8(S0, zero);//8 9 10 11 12 13 14 15     24 25 26 27 28 29 30 31\r\n        __m256i H1 = _mm256_unpackhi_epi8(S1, zero);\r\n        __m256i H2 = _mm256_unpackhi_epi8(S2, zero);\r\n        __m256i H3 = _mm256_unpackhi_epi8(S3, zero);\r\n\r\n        tmp0 = _mm256_permute2x128_si256(L0, H0, 0x0020);//0 1 2 3 4 5 6 7   8 9 10 11 12 13 14 15\r\n        tmp1 = _mm256_permute2x128_si256(L1, H1, 0x0020);\r\n        tmp2 = _mm256_permute2x128_si256(L2, H2, 0x0020);\r\n        tmp3 = _mm256_permute2x128_si256(L3, H3, 0x0020);\r\n\r\n        p01 = _mm256_add_epi16(tmp1, tmp2);\r\n        p01 = _mm256_mullo_epi16(p01, coeff);\r\n        p02 = _mm256_add_epi16(tmp0, tmp3);\r\n        p02 = _mm256_add_epi16(p02, offset1);\r\n        p01 = _mm256_add_epi16(p01, p02);\r\n        p01 = _mm256_srli_epi16(p01, 3); //\r\n\r\n        //prepare for next line\r\n        p21 = _mm256_add_epi16(tmp1, tmp2);\r\n        p22 = _mm256_add_epi16(tmp2, tmp3);\r\n\r\n        tmp0 = _mm256_permute2x128_si256(L0, H0, 0x0031);//16 17....24 25....\r\n        tmp1 = _mm256_permute2x128_si256(L1, H1, 0x0031);\r\n        tmp2 = _mm256_permute2x128_si256(L2, H2, 0x0031);\r\n        tmp3 = _mm256_permute2x128_si256(L3, H3, 0x0031);\r\n\r\n        p11 = _mm256_add_epi16(tmp1, tmp2);\r\n        p11 = _mm256_mullo_epi16(p11, coeff);\r\n        p12 = _mm256_add_epi16(tmp0, tmp3);\r\n        p12 = _mm256_add_epi16(p12, offset1);\r\n        p11 = _mm256_add_epi16(p11, p12);\r\n        p11 = _mm256_srli_epi16(p11, 3);\r\n\r\n        //prepare for next line\r\n        p31 = _mm256_add_epi16(tmp1, tmp2);\r\n        p32 = _mm256_add_epi16(tmp2, tmp3);\r\n\r\n        p01 = _mm256_packus_epi16(p01, p11);\r\n        p01 = _mm256_permute4x64_epi64(p01, 0x00D8);\r\n        _mm256_storeu_si256((__m256i*)&pfirst[0][i], p01);\r\n\r\n        p21 = _mm256_add_epi16(p21, p22);\r\n        p31 = _mm256_add_epi16(p31, p32);\r\n\r\n        p21 = _mm256_add_epi16(p21, offset2);\r\n        p31 = _mm256_add_epi16(p31, offset2);\r\n\r\n        p21 = _mm256_srli_epi16(p21, 2);\r\n        p31 = _mm256_srli_epi16(p31, 2);\r\n\r\n        p21 = _mm256_packus_epi16(p21, p31);\r\n        p21 = _mm256_permute4x64_epi64(p21, 0x00D8);\r\n        _mm256_storeu_si256((__m256i*)&pfirst[1][i], p21);\r\n    }\r\n\r\n#if BUGFIX_PREDICTION_INTRA\r\n    if (i < line_size) {\r\n#else\r\n    if (i < real_size) {\r\n#endif\r\n        __m256i S0 = _mm256_loadu_si256((__m256i*)(src));\r\n        __m256i S3 = _mm256_loadu_si256((__m256i*)(src + 3));\r\n        __m256i S1 = _mm256_loadu_si256((__m256i*)(src + 1));\r\n        __m256i S2 = _mm256_loadu_si256((__m256i*)(src + 2));\r\n\r\n        S0 = _mm256_permute4x64_epi64(S0, 0x00D8);\r\n        S3 = _mm256_permute4x64_epi64(S3, 0x00D8);\r\n        S1 = _mm256_permute4x64_epi64(S1, 0x00D8);\r\n        S2 = _mm256_permute4x64_epi64(S2, 0x00D8);\r\n\r\n        __m256i L0 = _mm256_unpacklo_epi8(S0, zero);\r\n        __m256i L1 = _mm256_unpacklo_epi8(S1, zero);\r\n        __m256i L2 = _mm256_unpacklo_epi8(S2, zero);\r\n        __m256i L3 = _mm256_unpacklo_epi8(S3, zero);\r\n\r\n        p01 = _mm256_add_epi16(L1, L2);\r\n        p01 = _mm256_mullo_epi16(p01, coeff);\r\n        p02 = _mm256_add_epi16(L0, L3);\r\n        p02 = _mm256_add_epi16(p02, offset1);\r\n        p01 = _mm256_add_epi16(p01, p02);\r\n        p01 = _mm256_srli_epi16(p01, 3);\r\n\r\n        p01 = _mm256_packus_epi16(p01, p01);\r\n        p01 = _mm256_permute4x64_epi64(p01, 0x0008);\r\n        __m256i mask = _mm256_loadu_si256((__m256i*)intrinsic_mask_256_8bit[15]);\r\n        _mm256_maskstore_epi64((__int64 *)&pfirst[0][i], mask, p01);\r\n\r\n        p01 = _mm256_add_epi16(L1, L2);\r\n        p02 = _mm256_add_epi16(L2, L3);\r\n\r\n        p01 = _mm256_add_epi16(p01, p02);\r\n        p01 = _mm256_add_epi16(p01, offset2);\r\n        p01 = _mm256_srli_epi16(p01, 2);\r\n\r\n        p01 = _mm256_packus_epi16(p01, p01);\r\n        p01=_mm256_permute4x64_epi64(p01,0x0008);\r\n        _mm256_maskstore_epi64((__int64 *)&pfirst[1][i], mask, p01);\r\n    }\r\n\r\n#if !BUGFIX_PREDICTION_INTRA\r\n    // padding\r\n    if (real_size < line_size) {\r\n        pfirst[1][real_size - 1] = pfirst[1][real_size - 2];\r\n\r\n        pad1 = _mm256_set1_epi8(pfirst[0][real_size - 1]);\r\n        pad2 = _mm256_set1_epi8(pfirst[1][real_size - 1]);\r\n        for (i = real_size; i < line_size; i += 32) {\r\n            _mm256_storeu_si256((__m256i*)&pfirst[0][i], pad1);\r\n            _mm256_storeu_si256((__m256i*)&pfirst[1][i], pad2);\r\n        }\r\n    }\r\n#endif\r\n\r\n    bsy >>= 1;\r\n\r\n    if (bsx == 64){\r\n\r\n        for (i = 0; i < bsy; i += 4) {\r\n            __m256i M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i + 32));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i + 32));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i + 1));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i + 1 + 32));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i + 1));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i + 1 + 32));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i + 2));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i + 2 + 32));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i + 2));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i + 2 + 32));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i + 3));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i + 3 + 32));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i + 3));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i + 3 + 32));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n\r\n        }\r\n    } else if (bsx == 32){\r\n        for (i = 0; i < bsy; i += 4){\r\n            __m256i M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i + 1));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i + 1));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i + 2));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i + 2));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i + 3));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i + 3));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n        }\r\n    } else if (bsx == 16){\r\n        __m256i mask = _mm256_loadu_si256((const __m256i*)intrinsic_mask_256_8bit[bsx - 1]);\r\n        for (i = 0; i < bsy; i += 4) {\r\n            __m256i M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i + 1));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i + 1));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i + 2));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i + 2));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i + 3));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i + 3));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n        }\r\n    } else if (bsx == 8){\r\n        __m256i mask = _mm256_loadu_si256((const __m256i*)intrinsic_mask_256_8bit[bsx - 1]);\r\n        for (i = 0; i < bsy; i += 4) {\r\n            __m256i M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i + 1));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i + 1));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i + 2));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i + 2));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i + 3));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i + 3));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n        }\r\n    } else{\r\n        __m256i mask = _mm256_loadu_si256((const __m256i*)intrinsic_mask_256_8bit[bsx - 1]);\r\n        for (i = 0; i < bsy; i += 4) {\r\n            __m256i M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i));\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i));\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i + 1));\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i + 1));\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i + 2));\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i + 2));\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i + 3));\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i + 3));\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n        }\r\n    }\r\n\r\n    /*if (bsx != 8) {\r\n        for (i = 0; i < bsy; i++) {\r\n            memcpy(dst, pfirst[0] + i, bsx * sizeof(pel_t));\r\n            memcpy(dst + i_dst, pfirst[1] + i, bsx * sizeof(pel_t));\r\n            dst += i_dst2;\r\n        }\r\n    } else if (bsy == 4) {//8x8\r\n        __m256i mask = _mm256_loadu_si256((const __m256i*)intrinsic_mask_256_8bit[bsx - 1]);\r\n\r\n        __m256i M1 = _mm256_loadu_si256((__m256i*)&pfirst[0][0]);\r\n        __m256i M2 = _mm256_loadu_si256((__m256i*)&pfirst[1][0]);\r\n        _mm256_maskstore_epi64((__int64 *)dst, mask, M1);\r\n        _mm256_maskstore_epi64((__int64 *)(dst + i_dst), mask, M2);\r\n        dst += i_dst2;\r\n        M1 = _mm256_srli_si256(M1, 1);\r\n        M2 = _mm256_srli_si256(M2, 1);\r\n        _mm256_maskstore_epi64((__int64 *)dst, mask, M1);\r\n        _mm256_maskstore_epi64((__int64 *)(dst + i_dst), mask, M2);\r\n        dst += i_dst2;\r\n        M1 = _mm256_srli_si256(M1, 1);\r\n        M2 = _mm256_srli_si256(M2, 1);\r\n        _mm256_maskstore_epi64((__int64 *)dst, mask, M1);\r\n        _mm256_maskstore_epi64((__int64 *)(dst + i_dst), mask, M2);\r\n        dst += i_dst2;\r\n        M1 = _mm256_srli_si256(M1, 1);\r\n        M2 = _mm256_srli_si256(M2, 1);\r\n        _mm256_maskstore_epi64((__int64 *)dst, mask, M1);\r\n        _mm256_maskstore_epi64((__int64 *)(dst + i_dst), mask, M2);\r\n    } else { //8x32\r\n        __m256i mask = _mm256_loadu_si256((const __m256i*)intrinsic_mask_256_8bit[bsx - 1]);\r\n        for (i = 0; i < 16; i = i + 4) {\r\n            __m256i M1 = _mm256_loadu_si256((__m256i*)&pfirst[0][i]);\r\n            __m256i M2 = _mm256_loadu_si256((__m256i*)&pfirst[1][i]);\r\n\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M1);\r\n            _mm256_maskstore_epi64((__int64 *)(dst + i_dst), mask, M2);\r\n            dst += i_dst2;\r\n            M1 = _mm256_srli_si256(M1, 1);\r\n            M2 = _mm256_srli_si256(M2, 1);\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M1);\r\n            _mm256_maskstore_epi64((__int64 *)(dst + i_dst), mask, M2);\r\n            dst += i_dst2;\r\n            M1 = _mm256_srli_si256(M1, 1);\r\n            M2 = _mm256_srli_si256(M2, 1);\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M1);\r\n            _mm256_maskstore_epi64((__int64 *)(dst + i_dst), mask, M2);\r\n            dst += i_dst2;\r\n            M1 = _mm256_srli_si256(M1, 1);\r\n            M2 = _mm256_srli_si256(M2, 1);\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M1);\r\n            _mm256_maskstore_epi64((__int64 *)(dst + i_dst), mask, M2);\r\n            dst += i_dst2;\r\n            //M1 = _mm256_srli_si256(M1, 1);\r\n            //M2 = _mm256_srli_si256(M2, 1);\r\n            //_mm256_maskstore_epi64((__m256i*)dst, mask, M1);\r\n            //_mm256_maskstore_epi64((__m256i*)(dst + i_dst), mask, M2);\r\n            //dst += i_dst2;\r\n            //M1 = _mm256_srli_si256(M1, 1);\r\n            //M2 = _mm256_srli_si256(M2, 1);\r\n            //_mm256_maskstore_epi64((__m256i*)dst, mask, M1);\r\n            //_mm256_maskstore_epi64((__m256i*)(dst + i_dst), mask, M2);\r\n            //dst += i_dst2;\r\n            //M1 = _mm256_srli_si256(M1, 1);\r\n            //M2 = _mm256_srli_si256(M2, 1);\r\n            //_mm256_maskstore_epi64((__m256i*)dst, mask, M1);\r\n            //_mm256_maskstore_epi64((__m256i*)(dst + i_dst), mask, M2);\r\n            //dst += i_dst2;\r\n            //M1 = _mm256_srli_si256(M1, 1);\r\n            //M2 = _mm256_srli_si256(M2, 1);\r\n            //_mm256_maskstore_epi64((__m256i*)dst, mask, M1);\r\n            //_mm256_maskstore_epi64((__m256i*)(dst + i_dst), mask, M2);\r\n            //dst += i_dst2;\r\n        }\r\n    }*/\r\n\r\n}\r\n\r\nvoid intra_pred_ang_x_9_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    int i, j;\r\n\r\n    UNUSED_PARAMETER(dir_mode);\r\n\r\n    if (bsx >= bsy) {\r\n        if (bsx & 0x07) {//4\r\n            intra_pred_ang_x_9_sse128(src, dst, i_dst, dir_mode, bsx, bsy);\r\n\r\n        } else if (bsx & 0x0f) {//8\r\n            intra_pred_ang_x_9_sse128(src, dst, i_dst, dir_mode, bsx, bsy);\r\n\r\n        } else if (bsx & 16){//16\r\n\r\n            __m256i S0, S1, S2, S3;\r\n            __m256i t0, t1, t2, t3;\r\n            __m256i c0;\r\n            __m256i D0;\r\n            __m256i off = _mm256_set1_epi16(64);\r\n\r\n            __m256i mask = _mm256_loadu_si256((__m256i*)intrinsic_mask_256_8bit[15]);\r\n\r\n            for (j = 0; j < bsy; j++) {\r\n\r\n                int idx = tab_idx_mode_9[j];\r\n                c0 = _mm256_set1_epi32(((int*)(tab_coeff_mode_9[j]))[0]);\r\n\r\n                S0 = _mm256_loadu_si256((__m256i*)(src + idx));    //0...7 8...15 16...23 24...31\r\n                S1 = _mm256_loadu_si256((__m256i*)(src + idx + 1));//1.. 8 9...16 17...24 25...32\r\n                S2 = _mm256_loadu_si256((__m256i*)(src + idx + 2));//2...9 10...17\r\n                S3 = _mm256_loadu_si256((__m256i*)(src + idx + 3));//3...10 11...18\r\n\r\n                S0 = _mm256_permute4x64_epi64(S0, 0x00D8);//0...7 16...23 8...15 24...31\r\n                S1 = _mm256_permute4x64_epi64(S1, 0x00D8);//1...8 17...24 9...16 25...32\r\n                S2 = _mm256_permute4x64_epi64(S2, 0x00D8);\r\n                S3 = _mm256_permute4x64_epi64(S3, 0x00D8);\r\n\r\n                t0 = _mm256_unpacklo_epi8(S0, S1);//0 1 1 2 2 3 3 4  4 5 5 6 6 7 7  8     8  9  9 10 10 11 11 12  12 13 13 14 14 15 15 16\r\n                t1 = _mm256_unpacklo_epi8(S2, S3);//2 3 3 4 4 5 5 6  6 7 7 8 8 9 9 10    10 11 11 12 12 13 13 14  14 15 15 16 16 17 17 18\r\n                t0 = _mm256_permute4x64_epi64(t0, 0x00D8);\r\n                t1 = _mm256_permute4x64_epi64(t1, 0x00D8);\r\n                t2 = _mm256_unpacklo_epi16(t0, t1);//0...7\r\n                t3 = _mm256_unpackhi_epi16(t0, t1);//8...15\r\n\r\n                t0 = _mm256_maddubs_epi16(t2, c0);\r\n                t1 = _mm256_maddubs_epi16(t3, c0);\r\n\r\n                D0 = _mm256_hadds_epi16(t0, t1);//0 1 2 3 8 9 10 11    4 5 6 7 12 13 14 15\r\n                D0 = _mm256_permute4x64_epi64(D0, 0x00D8);\r\n                D0 = _mm256_add_epi16(D0, off);\r\n                D0 = _mm256_srli_epi16(D0, 7);\r\n\r\n                D0 = _mm256_packus_epi16(D0, D0);\r\n                D0 = _mm256_permute4x64_epi64(D0, 0x00D8);\r\n                _mm256_maskstore_epi64((__int64 *)dst, mask, D0);\r\n\r\n                dst += i_dst;\r\n            }\r\n\r\n        } else {//32 64\r\n\r\n            __m256i S0, S1, S2, S3;\r\n            __m256i t0, t1, t2, t3;\r\n            __m256i c0;\r\n            __m256i D0, D1;\r\n            __m256i off = _mm256_set1_epi16(64);\r\n\r\n            for (j = 0; j < bsy; j++) {\r\n                int idx = tab_idx_mode_9[j];\r\n                c0 = _mm256_set1_epi32(((int*)tab_coeff_mode_9[j])[0]);\r\n                for (i = 0; i < bsx; i += 32, idx += 32) {\r\n                    S0 = _mm256_loadu_si256((__m256i*)(src + idx));    //0...7 8...15 16...23 24...31\r\n                    S1 = _mm256_loadu_si256((__m256i*)(src + idx + 1));//1.. 8 9...16 17...24 25...32\r\n                    S2 = _mm256_loadu_si256((__m256i*)(src + idx + 2));//2...9 10...17 18\r\n                    S3 = _mm256_loadu_si256((__m256i*)(src + idx + 3));//3...10 11...18 19\r\n\r\n                    S0 = _mm256_permute4x64_epi64(S0, 0x00D8);//0...7 16...23 8...15 24...31\r\n                    S1 = _mm256_permute4x64_epi64(S1, 0x00D8);//1...8 17...24 9...16 25...32\r\n                    S2 = _mm256_permute4x64_epi64(S2, 0x00D8);\r\n                    S3 = _mm256_permute4x64_epi64(S3, 0x00D8);\r\n\r\n                    t0 = _mm256_unpacklo_epi8(S0, S1);//0 1 1 2 2 3 3 4  4 5 5 6 6 7 7  8     8  9  9 10 10 11 11 12  12 13 13 14 14 15 15 16\r\n                    t1 = _mm256_unpacklo_epi8(S2, S3);//2 3 3 4 4 5 5 6  6 7 7 8 8 9 9 10    10 11 11 12 12 13 13 14  14 15 15 16 16 17 17 18\r\n                    t0 = _mm256_permute4x64_epi64(t0, 0x00D8);\r\n                    t1 = _mm256_permute4x64_epi64(t1, 0x00D8);\r\n                    t2 = _mm256_unpacklo_epi16(t0, t1);//\r\n                    t3 = _mm256_unpackhi_epi16(t0, t1);//........15 16 17 18\r\n\r\n                    t0 = _mm256_maddubs_epi16(t2, c0);\r\n                    t1 = _mm256_maddubs_epi16(t3, c0);\r\n\r\n                    D0 = _mm256_hadds_epi16(t0, t1);//0 1 2 3 8 9 10 11    4 5 6 7 12 13 14 15\r\n                    D0 = _mm256_permute4x64_epi64(D0, 0x00D8);\r\n                    D0 = _mm256_add_epi16(D0, off);\r\n                    D0 = _mm256_srli_epi16(D0, 7);\r\n\r\n                    t0 = _mm256_unpackhi_epi8(S0, S1);//16 17 17 18  18 19 19 20  20 21 21 22 22 23 23 24...24 25 25..\r\n                    t1 = _mm256_unpackhi_epi8(S2, S3);//18 19 19 20  .....\r\n                    t0 = _mm256_permute4x64_epi64(t0, 0x00D8);\r\n                    t1 = _mm256_permute4x64_epi64(t1, 0x00D8);\r\n                    t2 = _mm256_unpacklo_epi16(t0, t1);//16 17 18 19...\r\n                    t3 = _mm256_unpackhi_epi16(t0, t1);//24 25 26 27...\r\n\r\n                    t0 = _mm256_maddubs_epi16(t2, c0);\r\n                    t1 = _mm256_maddubs_epi16(t3, c0);\r\n\r\n                    D1 = _mm256_hadds_epi16(t0, t1);//16 17 18 19 24 25 26 27    20 21 22 23 28 29 30 31\r\n                    D1 = _mm256_permute4x64_epi64(D1, 0x00D8);\r\n                    D1 = _mm256_add_epi16(D1, off);\r\n                    D1 = _mm256_srli_epi16(D1, 7);\r\n\r\n                    D0 = _mm256_packus_epi16(D0, D1);\r\n                    D0 = _mm256_permute4x64_epi64(D0, 0x00D8);\r\n                    _mm256_storeu_si256((__m256i*)(dst + i), D0);\r\n\r\n                }\r\n                dst += i_dst;\r\n            }\r\n        }\r\n    } else {//4x16 8x32\r\n        intra_pred_ang_x_9_sse128(src, dst, i_dst, dir_mode, bsx, bsy);\r\n    }\r\n\r\n}\r\n\r\nvoid intra_pred_ang_x_10_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    if (bsy == 4){\r\n        intra_pred_ang_x_10_sse128(src, dst, i_dst, dir_mode, bsx, bsy);\r\n        return;\r\n    }\r\n    int i;\r\n    pel_t *dst1 = dst;\r\n    pel_t *dst2 = dst1 + i_dst;\r\n    pel_t *dst3 = dst2 + i_dst;\r\n    pel_t *dst4 = dst3 + i_dst;\r\n    UNUSED_PARAMETER(dir_mode);\r\n\r\n    if (bsy != 4) {\r\n\r\n        __m256i zero = _mm256_setzero_si256();\r\n\r\n        __m256i coeff2 = _mm256_set1_epi16(2);\r\n        __m256i coeff3 = _mm256_set1_epi16(3);\r\n        __m256i coeff4 = _mm256_set1_epi16(4);\r\n        __m256i coeff5 = _mm256_set1_epi16(5);\r\n        __m256i coeff7 = _mm256_set1_epi16(7);\r\n        __m256i coeff8 = _mm256_set1_epi16(8);\r\n\r\n        ALIGN32(pel_t first_line[4 * (64 + 32)]);\r\n        int line_size = bsx + bsy / 4 - 1;\r\n        int aligned_line_size = ((line_size + 31) >> 4) << 4;\r\n        pel_t *pfirst[4];\r\n\r\n        pfirst[0] = first_line;\r\n        pfirst[1] = first_line + aligned_line_size;\r\n        pfirst[2] = first_line + aligned_line_size * 2;\r\n        pfirst[3] = first_line + aligned_line_size * 3;\r\n\r\n        for (i = 0; i < line_size - 16; i += 32, src += 32) {\r\n            __m256i p00, p10, p20, p30;\r\n            __m256i p01, p11, p21, p31;\r\n            //0 1 2 3 .... 12 13 14 15    16 17 18 19 .... 28 29 30 21\r\n            __m256i S0 = _mm256_loadu_si256((__m256i*)(src));\r\n            __m256i S3 = _mm256_loadu_si256((__m256i*)(src + 3));\r\n            __m256i S1 = _mm256_loadu_si256((__m256i*)(src + 1));\r\n            __m256i S2 = _mm256_loadu_si256((__m256i*)(src + 2));\r\n\r\n            __m256i L0 = _mm256_unpacklo_epi8(S0, zero);//0 1 2 3 4 5 6 7     16 17 18 19 20 21 22 23\r\n            __m256i L1 = _mm256_unpacklo_epi8(S1, zero);\r\n            __m256i L2 = _mm256_unpacklo_epi8(S2, zero);\r\n            __m256i L3 = _mm256_unpacklo_epi8(S3, zero);\r\n\r\n            __m256i H0 = _mm256_unpackhi_epi8(S0, zero);// 8 9 10 11 12 13 14 15     24 25 26 27 28 29 30 31\r\n            __m256i H1 = _mm256_unpackhi_epi8(S1, zero);\r\n            __m256i H2 = _mm256_unpackhi_epi8(S2, zero);\r\n            __m256i H3 = _mm256_unpackhi_epi8(S3, zero);\r\n\r\n            __m256i tmpL0 = _mm256_permute2x128_si256(L0, H0, 0x0020);//0 1 2 3 4 5 6 7   8 9 10 11 12 13 14 15\r\n            __m256i tmpL1 = _mm256_permute2x128_si256(L1, H1, 0x0020);\r\n            __m256i tmpL2 = _mm256_permute2x128_si256(L2, H2, 0x0020);\r\n            __m256i tmpL3 = _mm256_permute2x128_si256(L3, H3, 0x0020);\r\n            \r\n            __m256i tmpH0 = _mm256_permute2x128_si256(L0, H0, 0x0031);//16 17...24 25...\r\n            __m256i tmpH1 = _mm256_permute2x128_si256(L1, H1, 0x0031);\r\n            __m256i tmpH2 = _mm256_permute2x128_si256(L2, H2, 0x0031);\r\n            __m256i tmpH3 = _mm256_permute2x128_si256(L3, H3, 0x0031);\r\n            \r\n            p00 = _mm256_mullo_epi16(tmpL0, coeff3);//0 1 2 3 4 5 6 7   8 9 10 11 12 13 14 15\r\n            p10 = _mm256_mullo_epi16(tmpL1, coeff7);\r\n            p20 = _mm256_mullo_epi16(tmpL2, coeff5);\r\n            p30 = _mm256_add_epi16(tmpL3, coeff8);\r\n            p00 = _mm256_add_epi16(p00, p30);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, p20);\r\n            p00 = _mm256_srli_epi16(p00, 4);\r\n\r\n            p01 = _mm256_mullo_epi16(tmpH0, coeff3);//16 17...24 25...\r\n            p11 = _mm256_mullo_epi16(tmpH1, coeff7);\r\n            p21 = _mm256_mullo_epi16(tmpH2, coeff5);\r\n            p31 = _mm256_add_epi16(tmpH3, coeff8);\r\n            p01 = _mm256_add_epi16(p01, p31);\r\n            p01 = _mm256_add_epi16(p01, p11);\r\n            p01 = _mm256_add_epi16(p01, p21);\r\n            p01 = _mm256_srli_epi16(p01, 4);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p01);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x00D8);\r\n            _mm256_storeu_si256((__m256i*)&pfirst[0][i], p00);\r\n\r\n            p00 = _mm256_add_epi16(tmpL1, tmpL2);\r\n            p00 = _mm256_mullo_epi16(p00, coeff3);\r\n            p10 = _mm256_add_epi16(tmpL0, tmpL3);\r\n            p10 = _mm256_add_epi16(p10, coeff4);\r\n            p00 = _mm256_add_epi16(p10, p00);\r\n            p00 = _mm256_srli_epi16(p00, 3);\r\n\r\n            p01 = _mm256_add_epi16(tmpH1, tmpH2);\r\n            p01 = _mm256_mullo_epi16(p01, coeff3);\r\n            p11 = _mm256_add_epi16(tmpH0, tmpH3);\r\n            p11 = _mm256_add_epi16(p11, coeff4);\r\n            p01 = _mm256_add_epi16(p11, p01);\r\n            p01 = _mm256_srli_epi16(p01, 3);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p01);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x00D8);\r\n            _mm256_storeu_si256((__m256i*)&pfirst[1][i], p00);\r\n\r\n            p10 = _mm256_mullo_epi16(tmpL1, coeff5);\r\n            p20 = _mm256_mullo_epi16(tmpL2, coeff7);\r\n            p30 = _mm256_mullo_epi16(tmpL3, coeff3);\r\n            p00 = _mm256_add_epi16(tmpL0, coeff8);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, p20);\r\n            p00 = _mm256_add_epi16(p00, p30);\r\n            p00 = _mm256_srli_epi16(p00, 4);\r\n\r\n            p11 = _mm256_mullo_epi16(tmpH1, coeff5);\r\n            p21 = _mm256_mullo_epi16(tmpH2, coeff7);\r\n            p31 = _mm256_mullo_epi16(tmpH3, coeff3);\r\n            p01 = _mm256_add_epi16(tmpH0, coeff8);\r\n            p01 = _mm256_add_epi16(p01, p11);\r\n            p01 = _mm256_add_epi16(p01, p21);\r\n            p01 = _mm256_add_epi16(p01, p31);\r\n            p01 = _mm256_srli_epi16(p01, 4);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p01);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x00D8);\r\n            _mm256_storeu_si256((__m256i*)&pfirst[2][i], p00);\r\n\r\n            p00 = _mm256_add_epi16(tmpL1, tmpL2);\r\n            p10 = _mm256_add_epi16(tmpL2, tmpL3);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, coeff2);\r\n            p00 = _mm256_srli_epi16(p00, 2);\r\n\r\n            p01 = _mm256_add_epi16(tmpH1, tmpH2);\r\n            p11 = _mm256_add_epi16(tmpH2, tmpH3);\r\n            p01 = _mm256_add_epi16(p01, p11);\r\n            p01 = _mm256_add_epi16(p01, coeff2);\r\n            p01 = _mm256_srli_epi16(p01, 2);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p01);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x00D8);\r\n            _mm256_storeu_si256((__m256i*)&pfirst[3][i], p00);\r\n        }\r\n\r\n        if (i < line_size) {\r\n            __m256i p00, p10, p20, p30;\r\n            __m256i S0 = _mm256_loadu_si256((__m256i*)(src));\r\n            __m256i S3 = _mm256_loadu_si256((__m256i*)(src + 3));\r\n            __m256i S1 = _mm256_loadu_si256((__m256i*)(src + 1));\r\n            __m256i S2 = _mm256_loadu_si256((__m256i*)(src + 2));\r\n\r\n            S0 = _mm256_permute4x64_epi64(S0, 0x00D8);\r\n            S3 = _mm256_permute4x64_epi64(S3, 0x00D8);\r\n            S2 = _mm256_permute4x64_epi64(S2, 0x00D8);\r\n            S1 = _mm256_permute4x64_epi64(S1, 0x00D8);\r\n\r\n            __m256i L0 = _mm256_unpacklo_epi8(S0, zero);\r\n            __m256i L1 = _mm256_unpacklo_epi8(S1, zero);\r\n            __m256i L2 = _mm256_unpacklo_epi8(S2, zero);\r\n            __m256i L3 = _mm256_unpacklo_epi8(S3, zero);\r\n\r\n            p00 = _mm256_mullo_epi16(L0, coeff3);\r\n            p10 = _mm256_mullo_epi16(L1, coeff7);\r\n            p20 = _mm256_mullo_epi16(L2, coeff5);\r\n            p30 = _mm256_add_epi16(L3, coeff8);\r\n            p00 = _mm256_add_epi16(p00, p30);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, p20);\r\n            p00 = _mm256_srli_epi16(p00, 4);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p00);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x0008);\r\n            __m256i mask = _mm256_loadu_si256((__m256i*)intrinsic_mask_256_8bit[15]);\r\n            _mm256_maskstore_epi64((__int64 *)&pfirst[0][i], mask, p00);\r\n\r\n            p00 = _mm256_add_epi16(L1, L2);\r\n            p00 = _mm256_mullo_epi16(p00, coeff3);\r\n            p10 = _mm256_add_epi16(L0, L3);\r\n            p10 = _mm256_add_epi16(p10, coeff4);\r\n            p00 = _mm256_add_epi16(p10, p00);\r\n            p00 = _mm256_srli_epi16(p00, 3);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p00);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x0008);\r\n            _mm256_maskstore_epi64((__int64 *)&pfirst[1][i], mask, p00);\r\n\r\n            p10 = _mm256_mullo_epi16(L1, coeff5);\r\n            p20 = _mm256_mullo_epi16(L2, coeff7);\r\n            p30 = _mm256_mullo_epi16(L3, coeff3);\r\n            p00 = _mm256_add_epi16(L0, coeff8);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, p20);\r\n            p00 = _mm256_add_epi16(p00, p30);\r\n            p00 = _mm256_srli_epi16(p00, 4);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p00);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x0008);\r\n            _mm256_maskstore_epi64((__int64 *)&pfirst[2][i], mask, p00);\r\n\r\n            p00 = _mm256_add_epi16(L1, L2);\r\n            p10 = _mm256_add_epi16(L2, L3);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, coeff2);\r\n            p00 = _mm256_srli_epi16(p00, 2);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p00);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x0008);\r\n            _mm256_maskstore_epi64((__int64 *)&pfirst[3][i], mask, p00);\r\n        }\r\n\r\n        bsy >>= 2;\r\n        int i_dstx4 = i_dst << 2;\r\n        if (bsx == 64){\r\n\r\n            for (i = 0; i < bsy; i++){\r\n                __m256i M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i));\r\n                _mm256_storeu_si256((__m256i*)dst1, M);\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i + 32));\r\n                _mm256_storeu_si256((__m256i*)(dst1 + 32), M);\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i));\r\n                _mm256_storeu_si256((__m256i*)dst2, M);\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i + 32));\r\n                _mm256_storeu_si256((__m256i*)(dst2 + 32), M);\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[2] + i));\r\n                _mm256_storeu_si256((__m256i*)dst3, M);\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[2] + i + 32));\r\n                _mm256_storeu_si256((__m256i*)(dst3 + 32), M);\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[3] + i));\r\n                _mm256_storeu_si256((__m256i*)dst4, M);\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[3] + i + 32));\r\n                _mm256_storeu_si256((__m256i*)(dst4 + 32), M);\r\n\r\n                dst1 += i_dstx4;\r\n                dst2 += i_dstx4;\r\n                dst3 += i_dstx4;\r\n                dst4 += i_dstx4;\r\n            }\r\n\r\n        } else if (bsx == 32){\r\n\r\n            for (i = 0; i < bsy; i++){\r\n                __m256i M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i));\r\n                _mm256_storeu_si256((__m256i*)dst1, M);\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i));\r\n                _mm256_storeu_si256((__m256i*)dst2, M);\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[2] + i));\r\n                _mm256_storeu_si256((__m256i*)dst3, M);\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[3] + i));\r\n                _mm256_storeu_si256((__m256i*)dst4, M);\r\n\r\n                dst1 += i_dstx4;\r\n                dst2 += i_dstx4;\r\n                dst3 += i_dstx4;\r\n                dst4 += i_dstx4;\r\n            }\r\n\r\n        } else if (bsx == 16){\r\n            __m256i mask = _mm256_lddqu_si256((__m256i*)intrinsic_mask_256_8bit[15]);\r\n            for (i = 0; i < bsy; i++){\r\n                __m256i M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i));\r\n                _mm256_maskstore_epi64((__int64 *)dst1, mask, M);\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i));\r\n                _mm256_maskstore_epi64((__int64 *)dst2, mask, M);\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[2] + i));\r\n                _mm256_maskstore_epi64((__int64 *)dst3, mask, M);\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[3] + i));\r\n                _mm256_maskstore_epi64((__int64 *)dst4, mask, M);\r\n\r\n                dst1 += i_dstx4;\r\n                dst2 += i_dstx4;\r\n                dst3 += i_dstx4;\r\n                dst4 += i_dstx4;\r\n            }\r\n        } else if (bsx == 8){\r\n            __m256i mask = _mm256_lddqu_si256((__m256i*)intrinsic_mask_256_8bit[7]);\r\n            for (i = 0; i < bsy; i++){\r\n                __m256i M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i));\r\n                _mm256_maskstore_epi64((__int64 *)dst1, mask, M);\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i));\r\n                _mm256_maskstore_epi64((__int64 *)dst2, mask, M);\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[2] + i));\r\n                _mm256_maskstore_epi64((__int64 *)dst3, mask, M);\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[3] + i));\r\n                _mm256_maskstore_epi64((__int64 *)dst4, mask, M);\r\n\r\n                dst1 += i_dstx4;\r\n                dst2 += i_dstx4;\r\n                dst3 += i_dstx4;\r\n                dst4 += i_dstx4;\r\n            }\r\n        } else {\r\n            __m256i mask = _mm256_lddqu_si256((__m256i*)intrinsic_mask_256_8bit[3]);\r\n            for (i = 0; i < bsy; i++){\r\n                __m256i M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i));\r\n                _mm256_maskstore_epi32((int*)dst1, mask, M);\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i));\r\n                _mm256_maskstore_epi32((int*)dst2, mask, M);\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[2] + i));\r\n                _mm256_maskstore_epi32((int*)dst3, mask, M);\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[3] + i));\r\n                _mm256_maskstore_epi32((int*)dst4, mask, M);\r\n\r\n                dst1 += i_dstx4;\r\n                dst2 += i_dstx4;\r\n                dst3 += i_dstx4;\r\n                dst4 += i_dstx4;\r\n            }\r\n            \r\n        }\r\n\r\n        /*\r\n        if (bsx != 8) {\r\n            switch (bsx) {\r\n            case 4:\r\n                for (i = 0; i < bsy; i++) {\r\n                    CP32(dst1, pfirst[0] + i); dst1 += i_dstx4;\r\n                    CP32(dst2, pfirst[1] + i); dst2 += i_dstx4;\r\n                    CP32(dst3, pfirst[2] + i); dst3 += i_dstx4;\r\n                    CP32(dst4, pfirst[3] + i); dst4 += i_dstx4;\r\n                }\r\n                break;\r\n            case 16:\r\n                for (i = 0; i < bsy; i++) {\r\n                    memcpy(dst1, pfirst[0] + i, 16 * sizeof(pel_t)); dst1 += i_dstx4;\r\n                    memcpy(dst2, pfirst[1] + i, 16 * sizeof(pel_t)); dst2 += i_dstx4;\r\n                    memcpy(dst3, pfirst[2] + i, 16 * sizeof(pel_t)); dst3 += i_dstx4;\r\n                    memcpy(dst4, pfirst[3] + i, 16 * sizeof(pel_t)); dst4 += i_dstx4;\r\n                }\r\n                break;\r\n            case 32:\r\n                for (i = 0; i < bsy; i++) {\r\n                    memcpy(dst1, pfirst[0] + i, 32 * sizeof(pel_t)); dst1 += i_dstx4;\r\n                    memcpy(dst2, pfirst[1] + i, 32 * sizeof(pel_t)); dst2 += i_dstx4;\r\n                    memcpy(dst3, pfirst[2] + i, 32 * sizeof(pel_t)); dst3 += i_dstx4;\r\n                    memcpy(dst4, pfirst[3] + i, 32 * sizeof(pel_t)); dst4 += i_dstx4;\r\n                }\r\n                break;\r\n            case 64:\r\n                for (i = 0; i < bsy; i++) {\r\n                    memcpy(dst1, pfirst[0] + i, 64 * sizeof(pel_t)); dst1 += i_dstx4;\r\n                    memcpy(dst2, pfirst[1] + i, 64 * sizeof(pel_t)); dst2 += i_dstx4;\r\n                    memcpy(dst3, pfirst[2] + i, 64 * sizeof(pel_t)); dst3 += i_dstx4;\r\n                    memcpy(dst4, pfirst[3] + i, 64 * sizeof(pel_t)); dst4 += i_dstx4;\r\n                }\r\n                break;\r\n            default:\r\n                assert(0);\r\n                break;\r\n            }\r\n\r\n        } else {\r\n            if (bsy == 2) { //8x8\r\n                for (i = 0; i < bsy; i++) {\r\n                    CP64(dst1, pfirst[0] + i);\r\n                    CP64(dst2, pfirst[1] + i);\r\n                    CP64(dst3, pfirst[2] + i);\r\n                    CP64(dst4, pfirst[3] + i);\r\n                    dst1 = dst4 + i_dst;\r\n                    dst2 = dst1 + i_dst;\r\n                    dst3 = dst2 + i_dst;\r\n                    dst4 = dst3 + i_dst;\r\n                }\r\n            } else {//8x32\r\n                __m128i M1 = _mm_loadu_si128((__m128i*)&pfirst[0][0]);\r\n                __m128i M2 = _mm_loadu_si128((__m128i*)&pfirst[1][0]);\r\n                __m128i M3 = _mm_loadu_si128((__m128i*)&pfirst[2][0]);\r\n                __m128i M4 = _mm_loadu_si128((__m128i*)&pfirst[3][0]);\r\n                _mm_storel_epi64((__m128i*)dst1, M1);\r\n                _mm_storel_epi64((__m128i*)dst2, M2);\r\n                _mm_storel_epi64((__m128i*)dst3, M3);\r\n                _mm_storel_epi64((__m128i*)dst4, M4);\r\n                dst1 = dst4 + i_dst;\r\n                dst2 = dst1 + i_dst;\r\n                dst3 = dst2 + i_dst;\r\n                dst4 = dst3 + i_dst;\r\n                M1 = _mm_srli_si128(M1, 1);\r\n                M2 = _mm_srli_si128(M2, 1);\r\n                M3 = _mm_srli_si128(M3, 1);\r\n                M4 = _mm_srli_si128(M4, 1);\r\n                _mm_storel_epi64((__m128i*)dst1, M1);\r\n                _mm_storel_epi64((__m128i*)dst2, M2);\r\n                _mm_storel_epi64((__m128i*)dst3, M3);\r\n                _mm_storel_epi64((__m128i*)dst4, M4);\r\n                dst1 = dst4 + i_dst;\r\n                dst2 = dst1 + i_dst;\r\n                dst3 = dst2 + i_dst;\r\n                dst4 = dst3 + i_dst;\r\n                M1 = _mm_srli_si128(M1, 1);\r\n                M2 = _mm_srli_si128(M2, 1);\r\n                M3 = _mm_srli_si128(M3, 1);\r\n                M4 = _mm_srli_si128(M4, 1);\r\n                _mm_storel_epi64((__m128i*)dst1, M1);\r\n                _mm_storel_epi64((__m128i*)dst2, M2);\r\n                _mm_storel_epi64((__m128i*)dst3, M3);\r\n                _mm_storel_epi64((__m128i*)dst4, M4);\r\n                dst1 = dst4 + i_dst;\r\n                dst2 = dst1 + i_dst;\r\n                dst3 = dst2 + i_dst;\r\n                dst4 = dst3 + i_dst;\r\n                M1 = _mm_srli_si128(M1, 1);\r\n                M2 = _mm_srli_si128(M2, 1);\r\n                M3 = _mm_srli_si128(M3, 1);\r\n                M4 = _mm_srli_si128(M4, 1);\r\n                _mm_storel_epi64((__m128i*)dst1, M1);\r\n                _mm_storel_epi64((__m128i*)dst2, M2);\r\n                _mm_storel_epi64((__m128i*)dst3, M3);\r\n                _mm_storel_epi64((__m128i*)dst4, M4);\r\n                dst1 = dst4 + i_dst;\r\n                dst2 = dst1 + i_dst;\r\n                dst3 = dst2 + i_dst;\r\n                dst4 = dst3 + i_dst;\r\n                M1 = _mm_srli_si128(M1, 1);\r\n                M2 = _mm_srli_si128(M2, 1);\r\n                M3 = _mm_srli_si128(M3, 1);\r\n                M4 = _mm_srli_si128(M4, 1);\r\n                _mm_storel_epi64((__m128i*)dst1, M1);\r\n                _mm_storel_epi64((__m128i*)dst2, M2);\r\n                _mm_storel_epi64((__m128i*)dst3, M3);\r\n                _mm_storel_epi64((__m128i*)dst4, M4);\r\n                dst1 = dst4 + i_dst;\r\n                dst2 = dst1 + i_dst;\r\n                dst3 = dst2 + i_dst;\r\n                dst4 = dst3 + i_dst;\r\n                M1 = _mm_srli_si128(M1, 1);\r\n                M2 = _mm_srli_si128(M2, 1);\r\n                M3 = _mm_srli_si128(M3, 1);\r\n                M4 = _mm_srli_si128(M4, 1);\r\n                _mm_storel_epi64((__m128i*)dst1, M1);\r\n                _mm_storel_epi64((__m128i*)dst2, M2);\r\n                _mm_storel_epi64((__m128i*)dst3, M3);\r\n                _mm_storel_epi64((__m128i*)dst4, M4);\r\n                dst1 = dst4 + i_dst;\r\n                dst2 = dst1 + i_dst;\r\n                dst3 = dst2 + i_dst;\r\n                dst4 = dst3 + i_dst;\r\n                M1 = _mm_srli_si128(M1, 1);\r\n                M2 = _mm_srli_si128(M2, 1);\r\n                M3 = _mm_srli_si128(M3, 1);\r\n                M4 = _mm_srli_si128(M4, 1);\r\n                _mm_storel_epi64((__m128i*)dst1, M1);\r\n                _mm_storel_epi64((__m128i*)dst2, M2);\r\n                _mm_storel_epi64((__m128i*)dst3, M3);\r\n                _mm_storel_epi64((__m128i*)dst4, M4);\r\n                dst1 = dst4 + i_dst;\r\n                dst2 = dst1 + i_dst;\r\n                dst3 = dst2 + i_dst;\r\n                dst4 = dst3 + i_dst;\r\n                M1 = _mm_srli_si128(M1, 1);\r\n                M2 = _mm_srli_si128(M2, 1);\r\n                M3 = _mm_srli_si128(M3, 1);\r\n                M4 = _mm_srli_si128(M4, 1);\r\n                _mm_storel_epi64((__m128i*)dst1, M1);\r\n                _mm_storel_epi64((__m128i*)dst2, M2);\r\n                _mm_storel_epi64((__m128i*)dst3, M3);\r\n                _mm_storel_epi64((__m128i*)dst4, M4);\r\n            }\r\n        }*/\r\n    }\r\n}\r\n\r\nvoid intra_pred_ang_x_11_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    int i, j;\r\n\r\n    UNUSED_PARAMETER(dir_mode);\r\n\r\n    if (bsx & 0x07) {\r\n        intra_pred_ang_x_11_sse128(src, dst, i_dst, dir_mode, bsx, bsy);\r\n    } else if (bsx & 0x0f) {\r\n        intra_pred_ang_x_11_sse128(src, dst, i_dst, dir_mode, bsx, bsy);\r\n    } else if (bsx & 16){\r\n\r\n        __m256i S0, S1, S2, S3;\r\n        __m256i t0, t1, t2, t3;\r\n        __m256i c0;\r\n        __m256i D0;\r\n        __m256i off = _mm256_set1_epi16(64);\r\n\r\n        __m256i mask = _mm256_loadu_si256((__m256i*)intrinsic_mask_256_8bit[15]);\r\n\r\n        for (j = 0; j < bsy; j++) {\r\n\r\n            int idx = (j + 1) >> 3;\r\n            c0 = _mm256_set1_epi32(((int*)(tab_coeff_mode_11[j & 0x07]))[0]);\r\n\r\n            S0 = _mm256_loadu_si256((__m256i*)(src + idx));    //0...7 8...15 16...23 24...31\r\n            S1 = _mm256_loadu_si256((__m256i*)(src + idx + 1));//1.. 8 9...16 17...24 25...32\r\n            S2 = _mm256_loadu_si256((__m256i*)(src + idx + 2));//2...9 10...17\r\n            S3 = _mm256_loadu_si256((__m256i*)(src + idx + 3));//3...10 11...18\r\n\r\n            S0 = _mm256_permute4x64_epi64(S0, 0x00D8);//0...7 16...23 8...15 24...31\r\n            S1 = _mm256_permute4x64_epi64(S1, 0x00D8);//1...8 17...24 9...16 25...32\r\n            S2 = _mm256_permute4x64_epi64(S2, 0x00D8);\r\n            S3 = _mm256_permute4x64_epi64(S3, 0x00D8);\r\n\r\n            t0 = _mm256_unpacklo_epi8(S0, S1);//0 1 1 2 2 3 3 4  4 5 5 6 6 7 7  8     8  9  9 10 10 11 11 12  12 13 13 14 14 15 15 16\r\n            t1 = _mm256_unpacklo_epi8(S2, S3);//2 3 3 4 4 5 5 6  6 7 7 8 8 9 9 10    10 11 11 12 12 13 13 14  14 15 15 16 16 17 17 18\r\n            t0 = _mm256_permute4x64_epi64(t0, 0x00D8);\r\n            t1 = _mm256_permute4x64_epi64(t1, 0x00D8);\r\n            t2 = _mm256_unpacklo_epi16(t0, t1);//0...7\r\n            t3 = _mm256_unpackhi_epi16(t0, t1);//8...15\r\n\r\n            t0 = _mm256_maddubs_epi16(t2, c0);\r\n            t1 = _mm256_maddubs_epi16(t3, c0);\r\n\r\n            D0 = _mm256_hadds_epi16(t0, t1);//0 1 2 3 8 9 10 11    4 5 6 7 12 13 14 15\r\n            D0 = _mm256_permute4x64_epi64(D0, 0x00D8);\r\n            D0 = _mm256_add_epi16(D0, off);\r\n            D0 = _mm256_srli_epi16(D0, 7);\r\n\r\n            D0 = _mm256_packus_epi16(D0, D0);\r\n            D0 = _mm256_permute4x64_epi64(D0, 0x00D8);\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, D0);\r\n\r\n            dst += i_dst;\r\n        }\r\n    \r\n    } else {\r\n\r\n        __m256i S0, S1, S2, S3;\r\n        __m256i t0, t1, t2, t3;\r\n        __m256i c0;\r\n        __m256i D0, D1;\r\n        __m256i off = _mm256_set1_epi16(64);\r\n\r\n        for (j = 0; j < bsy; j++) {\r\n            int idx = (j + 1) >> 3;\r\n            c0 = _mm256_set1_epi32(((int*)tab_coeff_mode_11[j & 0x07])[0]);\r\n            for (i = 0; i < bsx; i += 32, idx += 32) {\r\n                S0 = _mm256_loadu_si256((__m256i*)(src + idx));    //0...7 8...15 16...23 24...31\r\n                S1 = _mm256_loadu_si256((__m256i*)(src + idx + 1));//1.. 8 9...16 17...24 25...32\r\n                S2 = _mm256_loadu_si256((__m256i*)(src + idx + 2));//2...9 10...17 18\r\n                S3 = _mm256_loadu_si256((__m256i*)(src + idx + 3));//3...10 11...18 19\r\n\r\n                S0 = _mm256_permute4x64_epi64(S0, 0x00D8);//0...7 16...23 8...15 24...31\r\n                S1 = _mm256_permute4x64_epi64(S1, 0x00D8);//1...8 17...24 9...16 25...32\r\n                S2 = _mm256_permute4x64_epi64(S2, 0x00D8);\r\n                S3 = _mm256_permute4x64_epi64(S3, 0x00D8);\r\n\r\n                t0 = _mm256_unpacklo_epi8(S0, S1);//0 1 1 2 2 3 3 4  4 5 5 6 6 7 7  8     8  9  9 10 10 11 11 12  12 13 13 14 14 15 15 16\r\n                t1 = _mm256_unpacklo_epi8(S2, S3);//2 3 3 4 4 5 5 6  6 7 7 8 8 9 9 10    10 11 11 12 12 13 13 14  14 15 15 16 16 17 17 18\r\n                t0 = _mm256_permute4x64_epi64(t0, 0x00D8);\r\n                t1 = _mm256_permute4x64_epi64(t1, 0x00D8);\r\n                t2 = _mm256_unpacklo_epi16(t0, t1);//\r\n                t3 = _mm256_unpackhi_epi16(t0, t1);//........15 16 17 18\r\n\r\n                t0 = _mm256_maddubs_epi16(t2, c0);\r\n                t1 = _mm256_maddubs_epi16(t3, c0);\r\n\r\n                D0 = _mm256_hadds_epi16(t0, t1);//0 1 2 3 8 9 10 11    4 5 6 7 12 13 14 15\r\n                D0 = _mm256_permute4x64_epi64(D0, 0x00D8);\r\n                D0 = _mm256_add_epi16(D0, off);\r\n                D0 = _mm256_srli_epi16(D0, 7);\r\n\r\n                t0 = _mm256_unpackhi_epi8(S0, S1);//16 17 17 18  18 19 19 20  20 21 21 22 22 23 23 24...24 25 25..\r\n                t1 = _mm256_unpackhi_epi8(S2, S3);//18 19 19 20  .....\r\n                t0 = _mm256_permute4x64_epi64(t0, 0x00D8);\r\n                t1 = _mm256_permute4x64_epi64(t1, 0x00D8);\r\n                t2 = _mm256_unpacklo_epi16(t0, t1);//16 17 18 19...\r\n                t3 = _mm256_unpackhi_epi16(t0, t1);//24 25 26 27...\r\n\r\n                t0 = _mm256_maddubs_epi16(t2, c0);\r\n                t1 = _mm256_maddubs_epi16(t3, c0);\r\n\r\n                D1 = _mm256_hadds_epi16(t0, t1);//16 17 18 19 24 25 26 27    20 21 22 23 28 29 30 31\r\n                D1 = _mm256_permute4x64_epi64(D1, 0x00D8);\r\n                D1 = _mm256_add_epi16(D1, off);\r\n                D1 = _mm256_srli_epi16(D1, 7);\r\n\r\n                D0 = _mm256_packus_epi16(D0, D1);\r\n                D0 = _mm256_permute4x64_epi64(D0, 0x00D8);\r\n                _mm256_storeu_si256((__m256i*)(dst + i), D0);\r\n\r\n            }\r\n            dst += i_dst;\r\n        }\r\n\r\n    }\r\n\r\n}\r\n\r\nvoid intra_pred_ang_y_25_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    UNUSED_PARAMETER(dir_mode);\r\n    int i;\r\n\r\n    if (bsx > 8) {\r\n\r\n        ALIGN32(pel_t first_line[64 + (64 << 3)]);\r\n        int line_size = bsx + ((bsy - 1) << 3);\r\n        int iHeight8 = bsy << 3;\r\n        pel_t *pfirst = first_line;\r\n\r\n        __m256i coeff0 = _mm256_setr_epi16( 7,  3,  5,  1,  3,  1,  1,  0,    7,  3,  5,  1,  3,  1,  1,  0);\r\n        __m256i coeff1 = _mm256_setr_epi16(15,  7, 13,  3, 11,  5,  9,  1,   15,  7, 13,  3, 11,  5,  9,  1);\r\n        __m256i coeff2 = _mm256_setr_epi16( 9,  5, 11,  3, 13,  7, 15,  2,    9,  5, 11,  3, 13,  7, 15,  2);\r\n        __m256i coeff3 = _mm256_setr_epi16( 1,  1,  3,  1,  5,  3,  7,  1,    1,  1,  3,  1,  5,  3,  7,  1);\r\n        __m256i coeff4 = _mm256_setr_epi16(16,  8, 16,  4, 16,  8, 16,  2,   16,  8, 16,  4, 16,  8, 16,  2);\r\n        __m256i coeff5 = _mm256_setr_epi16( 1,  2,  1,  4,  1,  2,  1,  8,    1,  2,  1,  4,  1,  2,  1,  8);\r\n\r\n        __m256i p00, p10, p20, p30;\r\n        __m256i p01, p11, p21, p31;\r\n        __m256i res1, res2;\r\n        __m256i L0 = _mm256_setr_epi16(src[0],  src[0],  src[0],  src[0],  src[0],  src[0],  src[0],  src[0], \r\n            src[-4], src[-4], src[-4], src[-4], src[-4], src[-4], src[-4], src[-4]);\r\n\r\n        __m256i L1 = _mm256_setr_epi16(src[-1], src[-1], src[-1], src[-1], src[-1], src[-1], src[-1], src[-1], \r\n            src[-5], src[-5], src[-5], src[-5], src[-5], src[-5], src[-5], src[-5]);\r\n\r\n        __m256i L2 = _mm256_setr_epi16(src[-2], src[-2], src[-2], src[-2], src[-2], src[-2], src[-2], src[-2], \r\n            src[-6], src[-6], src[-6], src[-6], src[-6], src[-6], src[-6], src[-6]);\r\n\r\n        __m256i L3 = _mm256_setr_epi16(src[-3], src[-3], src[-3], src[-3], src[-3], src[-3], src[-3], src[-3], \r\n            src[-7], src[-7], src[-7], src[-7], src[-7], src[-7], src[-7], src[-7]);\r\n\r\n        src -= 4;\r\n\r\n        for (i = 0; i < line_size; i += 64, src -= 4) {\r\n            p00 = _mm256_mullo_epi16(L0, coeff0);//0...4...\r\n            p10 = _mm256_mullo_epi16(L1, coeff1);//1...5...\r\n            p20 = _mm256_mullo_epi16(L2, coeff2);//2...6...\r\n            p30 = _mm256_mullo_epi16(L3, coeff3);//3...7...\r\n            p00 = _mm256_add_epi16(p00, coeff4);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, p20);\r\n            p00 = _mm256_add_epi16(p00, p30);\r\n            p00 = _mm256_mullo_epi16(p00, coeff5);\r\n            p00 = _mm256_srli_epi16(p00, 5);\r\n\r\n            L0 = _mm256_setr_epi16(src[0], src[0], src[0], src[0], src[0], src[0], src[0], src[0],\r\n                src[-4], src[-4], src[-4], src[-4], src[-4], src[-4], src[-4], src[-4]);//4 8\r\n\r\n            p01 = _mm256_mullo_epi16(L1, coeff0);//1...5...\r\n            p11 = _mm256_mullo_epi16(L2, coeff1);//2...6...\r\n            p21 = _mm256_mullo_epi16(L3, coeff2);//3...7...\r\n            p31 = _mm256_mullo_epi16(L0, coeff3);//4...8...\r\n            p01 = _mm256_add_epi16(p01, coeff4);\r\n            p01 = _mm256_add_epi16(p01, p11);\r\n            p01 = _mm256_add_epi16(p01, p21);\r\n            p01 = _mm256_add_epi16(p01, p31);\r\n            p01 = _mm256_mullo_epi16(p01, coeff5);\r\n            p01 = _mm256_srli_epi16(p01, 5);\r\n\r\n            res1 = _mm256_packus_epi16(p00, p01);\r\n\r\n\r\n            L1 = _mm256_setr_epi16(src[-1], src[-1], src[-1], src[-1], src[-1], src[-1], src[-1], src[-1],\r\n                src[-5], src[-5], src[-5], src[-5], src[-5], src[-5], src[-5], src[-5]);//5 9\r\n\r\n            p00 = _mm256_mullo_epi16(L2, coeff0);//2...6...\r\n            p10 = _mm256_mullo_epi16(L3, coeff1);//3...7...\r\n            p20 = _mm256_mullo_epi16(L0, coeff2);//4...8...\r\n            p30 = _mm256_mullo_epi16(L1, coeff3);//5...9...\r\n            p00 = _mm256_add_epi16(p00, coeff4);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, p20);\r\n            p00 = _mm256_add_epi16(p00, p30);\r\n            p00 = _mm256_mullo_epi16(p00, coeff5);\r\n            p00 = _mm256_srli_epi16(p00, 5);\r\n\r\n            L2 = _mm256_setr_epi16(src[-2], src[-2], src[-2], src[-2], src[-2], src[-2], src[-2], src[-2],\r\n                src[-6], src[-6], src[-6], src[-6], src[-6], src[-6], src[-6], src[-6]);//6 10\r\n\r\n            p01 = _mm256_mullo_epi16(L3, coeff0);//3...7...\r\n            p11 = _mm256_mullo_epi16(L0, coeff1);//4...8...\r\n            p21 = _mm256_mullo_epi16(L1, coeff2);//5...9...\r\n            p31 = _mm256_mullo_epi16(L2, coeff3);//6...10...\r\n            p01 = _mm256_add_epi16(p01, coeff4);\r\n            p01 = _mm256_add_epi16(p01, p11);\r\n            p01 = _mm256_add_epi16(p01, p21);\r\n            p01 = _mm256_add_epi16(p01, p31);\r\n            p01 = _mm256_mullo_epi16(p01, coeff5);\r\n            p01 = _mm256_srli_epi16(p01, 5);\r\n\r\n            res2 = _mm256_packus_epi16(p00, p01);\r\n            p00 = _mm256_permute2x128_si256(res1, res2, 0x0020);\r\n            _mm256_storeu_si256((__m256i*)pfirst, p00);\r\n            pfirst += 32;\r\n\r\n            p00 = _mm256_permute2x128_si256(res1, res2, 0x0031);\r\n            _mm256_storeu_si256((__m256i*)pfirst, p00);\r\n            \r\n            pfirst += 32;\r\n\r\n            src -= 4;\r\n            L0 = _mm256_setr_epi16(src[0], src[0], src[0], src[0], src[0], src[0], src[0], src[0],\r\n                src[-4], src[-4], src[-4], src[-4], src[-4], src[-4], src[-4], src[-4]);//8 12\r\n\r\n            L1 = _mm256_setr_epi16(src[-1], src[-1], src[-1], src[-1], src[-1], src[-1], src[-1], src[-1],\r\n                src[-5], src[-5], src[-5], src[-5], src[-5], src[-5], src[-5], src[-5]);//9 13\r\n\r\n            L2 = _mm256_setr_epi16(src[-2], src[-2], src[-2], src[-2], src[-2], src[-2], src[-2], src[-2],\r\n                src[-6], src[-6], src[-6], src[-6], src[-6], src[-6], src[-6], src[-6]);//10 14\r\n\r\n            L3 = _mm256_setr_epi16(src[-3], src[-3], src[-3], src[-3], src[-3], src[-3], src[-3], src[-3],\r\n                src[-7], src[-7], src[-7], src[-7], src[-7], src[-7], src[-7], src[-7]);//11 15\r\n\r\n        }\r\n        \r\n        //if (bsx == 16) {// 8\r\n        //    __m256i mask = _mm256_loadu_si256((__m256i*)intrinsic_mask_256_8bit[7]);\r\n        //    p00 = _mm256_mullo_epi16(L0, coeff0);\r\n        //    p10 = _mm256_mullo_epi16(L1, coeff1);\r\n        //    p20 = _mm256_mullo_epi16(L2, coeff2);\r\n        //    p30 = _mm256_mullo_epi16(L3, coeff3);\r\n        //    p00 = _mm256_add_epi16(p00, coeff4);\r\n        //    p00 = _mm256_add_epi16(p00, p10);\r\n        //    p00 = _mm256_add_epi16(p00, p20);\r\n        //    p00 = _mm256_add_epi16(p00, p30);\r\n        //    p00 = _mm256_mullo_epi16(p00, coeff5);\r\n        //    p00 = _mm256_srli_epi16(p00, 5);\r\n        //\r\n        //    p00 = _mm256_packus_epi16(p00, p00);\r\n        //    p00 = _mm256_permute4x64_epi64(p00, 0x0008);\r\n        //    _mm256_maskstore_epi64((__m256i*)pfirst, mask, p00);\r\n        //} else if(bsx == 32){\r\n        //    __m256i mask = _mm256_set_epi64x(0, -1, -1, -1);\r\n        //    p00 = _mm256_mullo_epi16(L0, coeff0);\r\n        //    p10 = _mm256_mullo_epi16(L1, coeff1);\r\n        //    p20 = _mm256_mullo_epi16(L2, coeff2);\r\n        //    p30 = _mm256_mullo_epi16(L3, coeff3);\r\n        //    p00 = _mm256_add_epi16(p00, coeff4);\r\n        //    p00 = _mm256_add_epi16(p00, p10);\r\n        //    p00 = _mm256_add_epi16(p00, p20);\r\n        //    p00 = _mm256_add_epi16(p00, p30);\r\n        //    p00 = _mm256_mullo_epi16(p00, coeff5);\r\n        //    p00 = _mm256_srli_epi16(p00, 5);\r\n        //\r\n        //    L0 = _mm256_setr_epi16(src[0], src[0], src[0], src[0], src[0], src[0], src[0], src[0],\r\n        //        src[-4], src[-4], src[-4], src[-4], src[-4], src[-4], src[-4], src[-4]);\r\n        //\r\n        //    p01 = _mm256_mullo_epi16(L1, coeff0);\r\n        //    p11 = _mm256_mullo_epi16(L2, coeff1);\r\n        //    p21 = _mm256_mullo_epi16(L3, coeff2);\r\n        //    p31 = _mm256_mullo_epi16(L0, coeff3);\r\n        //    p01 = _mm256_add_epi16(p01, coeff4);\r\n        //    p01 = _mm256_add_epi16(p01, p11);\r\n        //    p01 = _mm256_add_epi16(p01, p21);\r\n        //    p01 = _mm256_add_epi16(p01, p31);\r\n        //    p01 = _mm256_mullo_epi16(p01, coeff5);\r\n        //    p01 = _mm256_srli_epi16(p01, 5);\r\n        //\r\n        //    p00 = _mm256_packus_epi16(p00, p01);\r\n        //    p00 = _mm256_permute4x64_epi64(p00, 0x00D8);\r\n        //    _mm256_maskstore_epi64((__int64 *)pfirst, mask, p00);\r\n        //\r\n        //} else {\r\n        //    __m256i mask = _mm256_set_epi64x(0, -1, -1, -1);\r\n        //    p00 = _mm256_mullo_epi16(L0, coeff0);\r\n        //    p10 = _mm256_mullo_epi16(L1, coeff1);\r\n        //    p20 = _mm256_mullo_epi16(L2, coeff2);\r\n        //    p30 = _mm256_mullo_epi16(L3, coeff3);\r\n        //    p00 = _mm256_add_epi16(p00, coeff4);\r\n        //    p00 = _mm256_add_epi16(p00, p10);\r\n        //    p00 = _mm256_add_epi16(p00, p20);\r\n        //    p00 = _mm256_add_epi16(p00, p30);\r\n        //    p00 = _mm256_mullo_epi16(p00, coeff5);\r\n        //    p00 = _mm256_srli_epi16(p00, 5);\r\n        //\r\n        //    L0 = _mm256_setr_epi16(src[0], src[0], src[0], src[0], src[0], src[0], src[0], src[0],\r\n        //        src[-4], src[-4], src[-4], src[-4], src[-4], src[-4], src[-4], src[-4]);\r\n        //\r\n        //    p01 = _mm256_mullo_epi16(L1, coeff0);\r\n        //    p11 = _mm256_mullo_epi16(L2, coeff1);\r\n        //    p21 = _mm256_mullo_epi16(L3, coeff2);\r\n        //    p31 = _mm256_mullo_epi16(L0, coeff3);\r\n        //    p01 = _mm256_add_epi16(p01, coeff4);\r\n        //    p01 = _mm256_add_epi16(p01, p11);\r\n        //    p01 = _mm256_add_epi16(p01, p21);\r\n        //    p01 = _mm256_add_epi16(p01, p31);\r\n        //    p01 = _mm256_mullo_epi16(p01, coeff5);\r\n        //    p01 = _mm256_srli_epi16(p01, 5);\r\n        //\r\n        //    p00 = _mm256_packus_epi16(p00, p01);\r\n        //    p00 = _mm256_permute4x64_epi64(p00, 0x00D8);\r\n        //    _mm256_storeu_si256((__m256*)pfirst, p00);\r\n        //\r\n        //    pfirst += 32;\r\n        //\r\n        //    L1 = _mm256_setr_epi16(src[-1], src[-1], src[-1], src[-1], src[-1], src[-1], src[-1], src[-1],\r\n        //        src[-5], src[-5], src[-5], src[-5], src[-5], src[-5], src[-5], src[-5]);\r\n        //\r\n        //    p00 = _mm256_mullo_epi16(L2, coeff0);\r\n        //    p10 = _mm256_mullo_epi16(L3, coeff1);\r\n        //    p20 = _mm256_mullo_epi16(L0, coeff2);\r\n        //    p30 = _mm256_mullo_epi16(L1, coeff3);\r\n        //    p00 = _mm256_add_epi16(p00, coeff4);\r\n        //    p00 = _mm256_add_epi16(p00, p10);\r\n        //    p00 = _mm256_add_epi16(p00, p20);\r\n        //    p00 = _mm256_add_epi16(p00, p30);\r\n        //    p00 = _mm256_mullo_epi16(p00, coeff5);\r\n        //    p00 = _mm256_srli_epi16(p00, 5);\r\n        //\r\n        //    p00 = _mm256_packus_epi16(p00, p00);\r\n        //    p00 = _mm256_permute4x64_epi64(p00, 0x0008);\r\n        //    _mm256_maskstore_epi64((__int64 *)pfirst, mask, p00);\r\n        //\r\n        //}\r\n\r\n        __m256i M;\r\n\r\n        if (bsx == 64) {\r\n            for (i = 0; i < iHeight8; i += 32){\r\n                M = _mm256_lddqu_si256((__m256i*)(first_line + i));\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                M = _mm256_lddqu_si256((__m256i*)(first_line + i + 32));\r\n                _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(first_line + i + 8));\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                M = _mm256_lddqu_si256((__m256i*)(first_line + i + +8 + 32));\r\n                _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(first_line + i + 16));\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                M = _mm256_lddqu_si256((__m256i*)(first_line + i + 16 + 32));\r\n                _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(first_line + i + 24));\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                M = _mm256_lddqu_si256((__m256i*)(first_line + i + 24 + 32));\r\n                _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n                dst += i_dst;\r\n            }\r\n        } else if (bsx == 32){\r\n            for (i = 0; i < iHeight8; i += 32){\r\n                M = _mm256_lddqu_si256((__m256i*)(first_line + i));\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(first_line + i + 8));\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(first_line + i + 16));\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(first_line + i + 24));\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                dst += i_dst;\r\n            }\r\n        } else if (bsx == 16){\r\n            __m256i mask = _mm256_lddqu_si256((__m256i*)intrinsic_mask_256_8bit[15]);\r\n            for (i = 0; i < iHeight8; i += 32){\r\n                M = _mm256_lddqu_si256((__m256i*)(first_line + i));\r\n                _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(first_line + i + 8));\r\n                _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(first_line + i + 16));\r\n                _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(first_line + i + 24));\r\n                _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n                dst += i_dst;\r\n            }\r\n        }\r\n\r\n        /*for (i = 0; i < iHeight8; i += 8) {\r\n            memcpy(dst, first_line + i, bsx * sizeof(pel_t));\r\n            dst += i_dst;\r\n        }*/\r\n    } else {//8x8 8x32 4x4 4x16\r\n        intra_pred_ang_y_25_sse128(src, dst, i_dst, dir_mode, bsx, bsy);\r\n        return ;\r\n    }\r\n\r\n}\r\n\r\nvoid intra_pred_ang_y_26_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    int i;\r\n\r\n    UNUSED_PARAMETER(dir_mode);\r\n\r\n    if (bsx != 4) {\r\n        __m256i coeff2 = _mm256_set1_epi16(2);\r\n        __m256i coeff3 = _mm256_set1_epi16(3);\r\n        __m256i coeff4 = _mm256_set1_epi16(4);\r\n        __m256i coeff5 = _mm256_set1_epi16(5);\r\n        __m256i coeff7 = _mm256_set1_epi16(7);\r\n        __m256i coeff8 = _mm256_set1_epi16(8);\r\n        __m256i shuffle = _mm256_setr_epi8(7, 15, 6, 14, 5, 13, 4, 12, 3, 11, 2, 10, 1, 9, 0, 8,\r\n            7, 15, 6, 14, 5, 13, 4, 12, 3, 11, 2, 10, 1, 9, 0, 8);\r\n\r\n        ALIGN32(pel_t first_line[64 + 256]);\r\n        int line_size = bsx + (bsy - 1) * 4;\r\n        int iHeight4 = bsy << 2;\r\n\r\n        src -= 31;\r\n        __m256i p00, p10, p20, p30;\r\n        __m256i p01, p11, p21, p31;\r\n        __m256i M1, M2, M3, M4, M5, M6, M7, M8;\r\n        __m256i S0, S1, S2, S3;\r\n        __m256i L0, L1, L2, L3;\r\n        __m256i H0, H1, H2, H3;\r\n\r\n\r\n        for (i = 0; i < line_size - 64; i += 128, src -= 32) {\r\n\r\n            S0 = _mm256_loadu_si256((__m256i*)(src));    //15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0\r\n            S1 = _mm256_loadu_si256((__m256i*)(src - 1));//16 15 14...\r\n            S2 = _mm256_loadu_si256((__m256i*)(src - 2));//17 16 15...\r\n            S3 = _mm256_loadu_si256((__m256i*)(src - 3));//18 17 16...\r\n\r\n            L0 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S0, 0));//15 14 13 12 11 10 9 8\r\n            L1 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S1, 0));//16 15 14...\r\n            L2 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S2, 0));//17 16 15...\r\n            L3 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S3, 0));//18 17 16...\r\n\r\n            H0 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S0, 1));//7 6 5 4 3 2 1 0\r\n            H1 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S1, 1));//8 7 6..\r\n            H2 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S2, 1));//9 8 7...\r\n            H3 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S3, 1));//10 9 8...\r\n\r\n            p00 = _mm256_mullo_epi16(L0, coeff3);\r\n            p10 = _mm256_mullo_epi16(L1, coeff7);\r\n            p20 = _mm256_mullo_epi16(L2, coeff5);\r\n            p30 = _mm256_add_epi16(L3, coeff8);\r\n            p00 = _mm256_add_epi16(p00, p30);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, p20);\r\n            M1  = _mm256_srli_epi16(p00, 4);//31...16\r\n\r\n            p01 = _mm256_mullo_epi16(H0, coeff3);\r\n            p11 = _mm256_mullo_epi16(H1, coeff7);\r\n            p21 = _mm256_mullo_epi16(H2, coeff5);\r\n            p31 = _mm256_add_epi16(H3, coeff8);\r\n            p01 = _mm256_add_epi16(p01, p31);\r\n            p01 = _mm256_add_epi16(p01, p11);\r\n            p01 = _mm256_add_epi16(p01, p21);\r\n            M2  = _mm256_srli_epi16(p01, 4);//15...0\r\n\r\n            p00 = _mm256_add_epi16(L1, L2);\r\n            p00 = _mm256_mullo_epi16(p00, coeff3);\r\n            p10 = _mm256_add_epi16(L0, L3);\r\n            p10 = _mm256_add_epi16(p10, coeff4);\r\n            p00 = _mm256_add_epi16(p10, p00);\r\n            M3  = _mm256_srli_epi16(p00, 3);\r\n\r\n            p01 = _mm256_add_epi16(H1, H2);\r\n            p01 = _mm256_mullo_epi16(p01, coeff3);\r\n            p11 = _mm256_add_epi16(H0, H3);\r\n            p11 = _mm256_add_epi16(p11, coeff4);\r\n            p01 = _mm256_add_epi16(p11, p01);\r\n            M4  = _mm256_srli_epi16(p01, 3);\r\n\r\n            p10 = _mm256_mullo_epi16(L1, coeff5);\r\n            p20 = _mm256_mullo_epi16(L2, coeff7);\r\n            p30 = _mm256_mullo_epi16(L3, coeff3);\r\n            p00 = _mm256_add_epi16(L0, coeff8);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, p20);\r\n            p00 = _mm256_add_epi16(p00, p30);\r\n            M5  = _mm256_srli_epi16(p00, 4);//31...16\r\n\r\n            p11 = _mm256_mullo_epi16(H1, coeff5);\r\n            p21 = _mm256_mullo_epi16(H2, coeff7);\r\n            p31 = _mm256_mullo_epi16(H3, coeff3);\r\n            p01 = _mm256_add_epi16(H0, coeff8);\r\n            p01 = _mm256_add_epi16(p01, p11);\r\n            p01 = _mm256_add_epi16(p01, p21);\r\n            p01 = _mm256_add_epi16(p01, p31);\r\n            M6  = _mm256_srli_epi16(p01, 4);//15...0\r\n\r\n            p00 = _mm256_add_epi16(L1, L2);\r\n            p10 = _mm256_add_epi16(L2, L3);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, coeff2);\r\n            M7  = _mm256_srli_epi16(p00, 2);\r\n\r\n            p01 = _mm256_add_epi16(H1, H2);\r\n            p11 = _mm256_add_epi16(H2, H3);\r\n            p01 = _mm256_add_epi16(p01, p11);\r\n            p01 = _mm256_add_epi16(p01, coeff2);\r\n            M8  = _mm256_srli_epi16(p01, 2);\r\n\r\n            M1 = _mm256_packus_epi16(M1, M3);\r\n            M5 = _mm256_packus_epi16(M5, M7);\r\n            M1 = _mm256_shuffle_epi8(M1, shuffle);\r\n            M5 = _mm256_shuffle_epi8(M5, shuffle);\r\n\r\n            M2 = _mm256_packus_epi16(M2, M4);\r\n            M6 = _mm256_packus_epi16(M6, M8);\r\n            M2 = _mm256_shuffle_epi8(M2, shuffle);\r\n            M6 = _mm256_shuffle_epi8(M6, shuffle);\r\n\r\n            //M1 = _mm256_permute4x64_epi64(M1, 0x4E);\r\n            //M5 = _mm256_permute4x64_epi64(M5, 0x4E);\r\n            //M2 = _mm256_permute4x64_epi64(M2, 0x4E);\r\n            //M6 = _mm256_permute4x64_epi64(M6, 0x4E);\r\n\r\n            M1 = _mm256_permute4x64_epi64(M1, 0x72);\r\n            M5 = _mm256_permute4x64_epi64(M5, 0x72);\r\n            M2 = _mm256_permute4x64_epi64(M2, 0x72);\r\n            M6 = _mm256_permute4x64_epi64(M6, 0x72);\r\n\r\n            M3 = _mm256_unpacklo_epi16(M1, M5);\r\n            M7 = _mm256_unpackhi_epi16(M1, M5);\r\n            M4 = _mm256_unpacklo_epi16(M2, M6);\r\n            M8 = _mm256_unpackhi_epi16(M2, M6);\r\n\r\n            _mm256_storeu_si256((__m256i*)&first_line[i], M4);\r\n            _mm256_storeu_si256((__m256i*)&first_line[32 + i], M8);\r\n            _mm256_storeu_si256((__m256i*)&first_line[64 + i], M3);\r\n            _mm256_storeu_si256((__m256i*)&first_line[96 + i], M7);\r\n        }\r\n\r\n        if (i < line_size) {\r\n            S0 = _mm256_loadu_si256((__m256i*)(src));    //15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0\r\n            S1 = _mm256_loadu_si256((__m256i*)(src - 1));//16 15 14...\r\n            S2 = _mm256_loadu_si256((__m256i*)(src - 2));//17 16 15...\r\n            S3 = _mm256_loadu_si256((__m256i*)(src - 3));//18 17 16...\r\n\r\n            H0 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S0, 1));//7 6 5 4 3 2 1 0\r\n            H1 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S1, 1));//8 7 6..\r\n            H2 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S2, 1));//9 8 7...\r\n            H3 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S3, 1));//10 9 8...\r\n\r\n            p01 = _mm256_mullo_epi16(H0, coeff3);\r\n            p11 = _mm256_mullo_epi16(H1, coeff7);\r\n            p21 = _mm256_mullo_epi16(H2, coeff5);\r\n            p31 = _mm256_add_epi16(H3, coeff8);\r\n            p01 = _mm256_add_epi16(p01, p31);\r\n            p01 = _mm256_add_epi16(p01, p11);\r\n            p01 = _mm256_add_epi16(p01, p21);\r\n            M2 = _mm256_srli_epi16(p01, 4);//15...0\r\n\r\n            p01 = _mm256_add_epi16(H1, H2);\r\n            p01 = _mm256_mullo_epi16(p01, coeff3);\r\n            p11 = _mm256_add_epi16(H0, H3);\r\n            p11 = _mm256_add_epi16(p11, coeff4);\r\n            p01 = _mm256_add_epi16(p11, p01);\r\n            M4 = _mm256_srli_epi16(p01, 3);\r\n\r\n            p11 = _mm256_mullo_epi16(H1, coeff5);\r\n            p21 = _mm256_mullo_epi16(H2, coeff7);\r\n            p31 = _mm256_mullo_epi16(H3, coeff3);\r\n            p01 = _mm256_add_epi16(H0, coeff8);\r\n            p01 = _mm256_add_epi16(p01, p11);\r\n            p01 = _mm256_add_epi16(p01, p21);\r\n            p01 = _mm256_add_epi16(p01, p31);\r\n            M6 = _mm256_srli_epi16(p01, 4);//15...0\r\n\r\n            p01 = _mm256_add_epi16(H1, H2);\r\n            p11 = _mm256_add_epi16(H2, H3);\r\n            p01 = _mm256_add_epi16(p01, p11);\r\n            p01 = _mm256_add_epi16(p01, coeff2);\r\n            M8 = _mm256_srli_epi16(p01, 2);\r\n\r\n            M2 = _mm256_packus_epi16(M2, M4);\r\n            M6 = _mm256_packus_epi16(M6, M8);\r\n            M2 = _mm256_shuffle_epi8(M2, shuffle);\r\n            M6 = _mm256_shuffle_epi8(M6, shuffle);\r\n\r\n            //M2 = _mm256_permute4x64_epi64(M2, 0x4E);\r\n            //M6 = _mm256_permute4x64_epi64(M6, 0x4E);\r\n\r\n            M2 = _mm256_permute4x64_epi64(M2, 0x72);\r\n            M6 = _mm256_permute4x64_epi64(M6, 0x72);\r\n\r\n            M4 = _mm256_unpacklo_epi16(M2, M6);\r\n            M8 = _mm256_unpackhi_epi16(M2, M6);\r\n\r\n            _mm256_storeu_si256((__m256i*)&first_line[i], M4);\r\n            _mm256_storeu_si256((__m256i*)&first_line[32 + i], M8);\r\n        }\r\n\r\n        __m256i M;\r\n        if (bsx == 64) {\r\n            for (i = 0; i < iHeight4; i += 16){\r\n                M = _mm256_lddqu_si256((__m256i*)(first_line + i));\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                M = _mm256_lddqu_si256((__m256i*)(first_line + i + 32));\r\n                _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(first_line + i + 4));\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                M = _mm256_lddqu_si256((__m256i*)(first_line + i + 32 + 4));\r\n                _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(first_line + i + 8));\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                M = _mm256_lddqu_si256((__m256i*)(first_line + i + 32 + 8));\r\n                _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(first_line + i + 12));\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                M = _mm256_lddqu_si256((__m256i*)(first_line + i + 32 + 12));\r\n                _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n                dst += i_dst;\r\n            }\r\n        } else if (bsx == 32) {\r\n            for (i = 0; i < iHeight4; i += 16){\r\n                M = _mm256_lddqu_si256((__m256i*)(first_line + i));\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(first_line + i + 4));\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(first_line + i + 8));\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(first_line + i + 12));\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                dst += i_dst;\r\n            }\r\n        } else if (bsx == 16){\r\n            __m256i mask = _mm256_lddqu_si256((__m256i*)intrinsic_mask_256_8bit[15]);\r\n            for (i = 0; i < iHeight4; i += 16){\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(first_line + i));\r\n                _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(first_line + i + 4));\r\n                _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(first_line + i + 8));\r\n                _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(first_line + i + 12));\r\n                _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n                dst += i_dst;\r\n\r\n            }\r\n        } else {\r\n            __m256i mask = _mm256_lddqu_si256((__m256i*)intrinsic_mask_256_8bit[7]);\r\n            for (i = 0; i < iHeight4; i += 16){\r\n                M = _mm256_lddqu_si256((__m256i*)(first_line + i));\r\n                _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(first_line + i + 4));\r\n                _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(first_line + i + 8));\r\n                _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(first_line + i + 12));\r\n                _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n                dst += i_dst;\r\n            }\r\n        }\r\n\r\n        /*switch (bsx) {\r\n        case 4:\r\n            for (i = 0; i < iHeight4; i += 4) {\r\n                CP32(dst, first_line + i);\r\n                dst += i_dst;\r\n            }\r\n            break;\r\n        case 8:\r\n            for (i = 0; i < iHeight4; i += 4) {\r\n                CP64(dst, first_line + i);\r\n                dst += i_dst;\r\n            }\r\n            break;\r\n        default:\r\n            for (i = 0; i < iHeight4; i += 4) {\r\n                memcpy(dst, first_line + i, bsx * sizeof(pel_t));\r\n                dst += i_dst;\r\n            }\r\n            break;\r\n        }*/\r\n    } else { //4x4 4x16\r\n        intra_pred_ang_y_26_sse128(src, dst, i_dst, dir_mode, bsx, bsy);\r\n        return;\r\n    }\r\n\r\n}\r\n\r\nvoid intra_pred_ang_y_28_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    ALIGN32(pel_t first_line[64 + 128]);\r\n    int line_size = bsx + (bsy - 1) * 2;\r\n\r\n    int i;\r\n    int iHeight2 = bsy << 1;\r\n    UNUSED_PARAMETER(dir_mode);\r\n\r\n    __m256i coeff2 = _mm256_set1_epi16(2);\r\n    __m256i coeff3 = _mm256_set1_epi16(3);\r\n    __m256i coeff4 = _mm256_set1_epi16(4);\r\n    __m256i shuffle = _mm256_setr_epi8(7, 15, 6, 14, 5, 13, 4, 12, 3, 11, 2, 10, 1, 9, 0, 8,\r\n        7, 15, 6, 14, 5, 13, 4, 12, 3, 11, 2, 10, 1, 9, 0, 8);\r\n\r\n    src -= 31;\r\n    __m256i p00, p10;\r\n    __m256i p01, p11;\r\n    __m256i S0, S1, S2, S3;\r\n    __m256i L0, L1, L2, L3;\r\n    __m256i H0, H1, H2, H3;\r\n#if BUGFIX_PREDICTION_INTRA\r\n    for (i = 0; i < line_size - 32; i += 64, src -= 32) {\r\n#else\r\n    for (i = 0; i < real_size - 32; i += 64, src -= 32) {\r\n#endif\r\n        S0 = _mm256_loadu_si256((__m256i*)(src));\r\n        S3 = _mm256_loadu_si256((__m256i*)(src - 3));\r\n        S1 = _mm256_loadu_si256((__m256i*)(src - 1));\r\n        S2 = _mm256_loadu_si256((__m256i*)(src - 2));\r\n\r\n        L0 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S0, 0));//15 14 13 12 11 10 9 8\r\n        L1 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S1, 0));//16 15 14...\r\n        L2 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S2, 0));//17 16 15...\r\n        L3 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S3, 0));//18 17 16...\r\n\r\n        H0 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S0, 1));//7 6 5 4 3 2 1 0\r\n        H1 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S1, 1));//8 7 6..\r\n        H2 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S2, 1));//9 8 7...\r\n        H3 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S3, 1));//10 9 8...\r\n\r\n        p00 = _mm256_adds_epi16(L1, L2);\r\n        p00 = _mm256_mullo_epi16(p00, coeff3);\r\n        p10 = _mm256_add_epi16(L0, L3);\r\n        p10 = _mm256_add_epi16(p10, coeff4);\r\n        p00 = _mm256_add_epi16(p00, p10);\r\n        p00 = _mm256_srli_epi16(p00, 3);//031...016\r\n\r\n        p01 = _mm256_add_epi16(L1, L2);\r\n        p11 = _mm256_add_epi16(L2, L3);\r\n        p01 = _mm256_add_epi16(p01, p11);\r\n        p01 = _mm256_add_epi16(p01, coeff2);\r\n        p01 = _mm256_srli_epi16(p01, 2);//131...116\r\n\r\n        p00 = _mm256_packus_epi16(p00, p01);//\r\n        p00 = _mm256_shuffle_epi8(p00, shuffle);\r\n        p00 = _mm256_permute4x64_epi64(p00, 0x4E);\r\n        _mm256_storeu_si256((__m256i*)&first_line[i + 32], p00);\r\n\r\n        p00 = _mm256_adds_epi16(H1, H2);\r\n        p00 = _mm256_mullo_epi16(p00, coeff3);\r\n        p10 = _mm256_adds_epi16(H0, H3);\r\n        p10 = _mm256_adds_epi16(p10, coeff4);\r\n        p00 = _mm256_adds_epi16(p00, p10);\r\n        p00 = _mm256_srli_epi16(p00, 3);\r\n\r\n        p01 = _mm256_add_epi16(H1, H2);\r\n        p11 = _mm256_add_epi16(H2, H3);\r\n        p01 = _mm256_add_epi16(p01, p11);\r\n        p01 = _mm256_add_epi16(p01, coeff2);\r\n        p01 = _mm256_srli_epi16(p01, 2);\r\n\r\n        p00 = _mm256_packus_epi16(p00, p01);\r\n        p00 = _mm256_shuffle_epi8(p00, shuffle);\r\n        p00 = _mm256_permute4x64_epi64(p00, 0x4E);\r\n        _mm256_storeu_si256((__m256i*)&first_line[i], p00);\r\n    }\r\n\r\n#if BUGFIX_PREDICTION_INTRA\r\n    if (i < line_size) {\r\n#else\r\n    if (i < real_size) {\r\n#endif\r\n        S0 = _mm256_loadu_si256((__m256i*)(src));\r\n        S3 = _mm256_loadu_si256((__m256i*)(src - 3));\r\n        S1 = _mm256_loadu_si256((__m256i*)(src - 1));\r\n        S2 = _mm256_loadu_si256((__m256i*)(src - 2));\r\n\r\n        H0 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S0, 1));//7 6 5 4 3 2 1 0\r\n        H1 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S1, 1));//8 7 6..\r\n        H2 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S2, 1));//9 8 7...\r\n        H3 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S3, 1));//10 9 8...\r\n\r\n        p00 = _mm256_adds_epi16(H1, H2);\r\n        p00 = _mm256_mullo_epi16(p00, coeff3);\r\n        p10 = _mm256_adds_epi16(H0, H3);\r\n        p10 = _mm256_adds_epi16(p10, coeff4);\r\n        p00 = _mm256_adds_epi16(p00, p10);\r\n        p00 = _mm256_srli_epi16(p00, 3);\r\n\r\n        p01 = _mm256_add_epi16(H1, H2);\r\n        p11 = _mm256_add_epi16(H2, H3);\r\n        p01 = _mm256_add_epi16(p01, p11);\r\n        p01 = _mm256_add_epi16(p01, coeff2);\r\n        p01 = _mm256_srli_epi16(p01, 2);\r\n\r\n        p00 = _mm256_packus_epi16(p00, p01);\r\n        p00 = _mm256_shuffle_epi8(p00, shuffle);\r\n        p00 = _mm256_permute4x64_epi64(p00, 0x4E);\r\n        _mm256_storeu_si256((__m256i*)&first_line[i], p00);\r\n    }\r\n\r\n    if (bsx == 64){\r\n        for (i = 0; i < iHeight2; i += 8) {\r\n            __m256i M = _mm256_lddqu_si256((__m256i*)&first_line[i]);\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(&first_line[i] + 32));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)&first_line[i + 2]);\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(&first_line[i + 2] + 32));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)&first_line[i + 4]);\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(&first_line[i + 4] + 32));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)&first_line[i + 6]);\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(&first_line[i + 6] + 32));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n        }\r\n\r\n    } else if (bsx == 32){\r\n        for (i = 0; i < iHeight2; i += 8) {\r\n            __m256i M = _mm256_lddqu_si256((__m256i*)&first_line[i]);\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)&first_line[i + 2]);\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)&first_line[i + 4]);\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)&first_line[i + 6]);\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n        }\r\n    } else if (bsx == 16){\r\n        __m256i mask = _mm256_lddqu_si256((__m256i*)intrinsic_mask_256_8bit[15]);\r\n        for (i = 0; i < iHeight2; i += 8){\r\n            __m256i M = _mm256_lddqu_si256((__m256i*)&first_line[i]);\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)&first_line[i + 2]);\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)&first_line[i + 4]);\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)&first_line[i + 6]);\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n\r\n        }\r\n    } else if (bsx == 8){\r\n        __m256i mask = _mm256_lddqu_si256((__m256i*)intrinsic_mask_256_8bit[7]);\r\n        for (i = 0; i < iHeight2; i += 8){\r\n            __m256i M = _mm256_lddqu_si256((__m256i*)&first_line[i]);\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)&first_line[i + 2]);\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)&first_line[i + 4]);\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)&first_line[i + 6]);\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n\r\n        }\r\n    } else {\r\n        __m256i mask = _mm256_lddqu_si256((__m256i*)intrinsic_mask_256_8bit[3]);\r\n        for (i = 0; i < iHeight2; i += 8){\r\n            __m256i M = _mm256_lddqu_si256((__m256i*)&first_line[i]);\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)&first_line[i + 2]);\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)&first_line[i + 4]);\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)&first_line[i + 6]);\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n\r\n        }\r\n    }\r\n\r\n\r\n\r\n    /*if (bsx >= 16) {\r\n\r\n        for (i = 0; i < iHeight2; i += 2) {\r\n            memcpy(dst, first_line + i, bsx * sizeof(pel_t));\r\n            dst += i_dst;\r\n        }\r\n    } else if (bsx == 8) {\r\n\r\n        for (i = 0; i < iHeight2; i += 8) {\r\n            __m128i M = _mm_loadu_si128((__m128i*)&first_line[i]);\r\n            _mm_storel_epi64((__m128i*)(dst), M);\r\n            dst += i_dst;\r\n            M = _mm_srli_si128(M, 2);\r\n            _mm_storel_epi64((__m128i*)(dst), M);\r\n            dst += i_dst;\r\n            M = _mm_srli_si128(M, 2);\r\n            _mm_storel_epi64((__m128i*)(dst), M);\r\n            dst += i_dst;\r\n            M = _mm_srli_si128(M, 2);\r\n            _mm_storel_epi64((__m128i*)(dst), M);\r\n            dst += i_dst;\r\n        }\r\n    } else {\r\n        for (i = 0; i < iHeight2; i += 8) {\r\n            __m128i M = _mm_loadu_si128((__m128i*)&first_line[i]);\r\n            ((int*)(dst))[0] = _mm_cvtsi128_si32(M);\r\n            dst += i_dst;\r\n            M = _mm_srli_si128(M, 2);\r\n            ((int*)(dst))[0] = _mm_cvtsi128_si32(M);\r\n            dst += i_dst;\r\n            M = _mm_srli_si128(M, 2);\r\n            ((int*)(dst))[0] = _mm_cvtsi128_si32(M);\r\n            dst += i_dst;\r\n            M = _mm_srli_si128(M, 2);\r\n            ((int*)(dst))[0] = _mm_cvtsi128_si32(M);\r\n            dst += i_dst;\r\n        }\r\n    }*/\r\n}\r\n\r\nvoid intra_pred_ang_y_30_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    ALIGN32(pel_t first_line[64 + 64]);\r\n    int line_size = bsx + bsy - 1;\r\n    UNUSED_PARAMETER(dir_mode);\r\n    int i;\r\n\r\n    __m256i coeff2 = _mm256_set1_epi16(2);\r\n    __m256i shuffle = _mm256_setr_epi8(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0,\r\n        15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0);\r\n\r\n    __m256i p00, p10;\r\n    __m256i p01, p11;\r\n    __m256i S0, S1, S2;\r\n    __m256i L0, L1, L2;\r\n    __m256i H0, H1, H2;\r\n\r\n    src -= 33;\r\n\r\n    for (i = 0; i < line_size - 16; i += 32, src -= 32) {\r\n\r\n        S0 = _mm256_loadu_si256((__m256i*)(src - 1));\r\n        S1 = _mm256_loadu_si256((__m256i*)(src));\r\n        S2 = _mm256_loadu_si256((__m256i*)(src + 1));\r\n\r\n        L0 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S0, 0));//35 34 33...\r\n        L1 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S1, 0));//34 33 32...\r\n        L2 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S2, 0));\r\n\r\n        H0 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S0, 1));//20 19 18...\r\n        H1 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S1, 1));//19 18 17...\r\n        H2 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S2, 1));\r\n\r\n        p00 = _mm256_add_epi16(L0, L1);\r\n        p10 = _mm256_add_epi16(L1, L2);\r\n        p00 = _mm256_add_epi16(p00, p10);\r\n        p00 = _mm256_add_epi16(p00, coeff2);\r\n        p00 = _mm256_srli_epi16(p00, 2);//31...24 23...16\r\n\r\n        p01 = _mm256_add_epi16(H0, H1);\r\n        p11 = _mm256_add_epi16(H1, H2);\r\n        p01 = _mm256_add_epi16(p01, p11);\r\n        p01 = _mm256_add_epi16(p01, coeff2);\r\n        p01 = _mm256_srli_epi16(p01, 2);//15..8 7...0\r\n\r\n        p00 = _mm256_packus_epi16(p00, p01);//32...24 15...8 23...16 7...0\r\n        p00 = _mm256_permute4x64_epi64(p00, 0x8D);\r\n        p00 = _mm256_shuffle_epi8(p00, shuffle);\r\n\r\n        _mm256_storeu_si256((__m256i*)&first_line[i], p00);\r\n\r\n    }\r\n\r\n    __m256i mask = _mm256_load_si256((__m256i*)intrinsic_mask_256_8bit[15]);\r\n\r\n    if (i < line_size) {\r\n        S0 = _mm256_loadu_si256((__m256i*)(src - 1));\r\n        S1 = _mm256_loadu_si256((__m256i*)(src));\r\n        S2 = _mm256_loadu_si256((__m256i*)(src + 1));\r\n\r\n        H0 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S0, 1));//20 19 18...\r\n        H1 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S1, 1));//19 18 17...\r\n        H2 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S2, 1));//18\r\n\r\n        p01 = _mm256_add_epi16(H0, H1);\r\n        p11 = _mm256_add_epi16(H1, H2);\r\n        p01 = _mm256_add_epi16(p01, p11);\r\n        p01 = _mm256_add_epi16(p01, coeff2);\r\n        p01 = _mm256_srli_epi16(p01, 2);//15...8..7..0\r\n\r\n        p01 = _mm256_packus_epi16(p01, p01);//15...8 15...8 7...0 7...0\r\n        p01 = _mm256_permute4x64_epi64(p01, 0x0008);\r\n        p01 = _mm256_shuffle_epi8(p01, shuffle);\r\n\r\n        _mm256_maskstore_epi64((__int64 *)&first_line[i], mask, p01);\r\n    }\r\n\r\n    __m256i M;\r\n    if (bsx == 64) {\r\n        for (i = 0; i < bsy; i += 4){\r\n            M = _mm256_lddqu_si256((__m256i*)(first_line + i));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(first_line + i + 32));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(first_line + i + 1));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(first_line + i + 32 + 1));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(first_line + i + 2));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(first_line + i + 32 + 2));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(first_line + i + 3));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(first_line + i + 32 + 3));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n        }\r\n    } else if (bsx == 32) {\r\n        for (i = 0; i < bsy; i += 4){\r\n            M = _mm256_lddqu_si256((__m256i*)(first_line + i));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(first_line + i + 1));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(first_line + i + 2));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(first_line + i + 3));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n        }\r\n    } else if (bsx == 16){\r\n        mask = _mm256_lddqu_si256((__m256i*)intrinsic_mask_256_8bit[15]);\r\n        for (i = 0; i < bsy; i += 4){\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(first_line + i));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(first_line + i + 1));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(first_line + i + 2));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(first_line + i + 3));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n\r\n        }\r\n    } else if (bsx == 8){\r\n        mask = _mm256_lddqu_si256((__m256i*)intrinsic_mask_256_8bit[7]);\r\n        for (i = 0; i < bsy; i += 4){\r\n            M = _mm256_lddqu_si256((__m256i*)(first_line + i));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(first_line + i + 1));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(first_line + i + 2));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(first_line + i + 3));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n        }\r\n    } else {\r\n        mask = _mm256_lddqu_si256((__m256i*)intrinsic_mask_256_8bit[3]);\r\n        for (i = 0; i < bsy; i += 4){\r\n            M = _mm256_lddqu_si256((__m256i*)(first_line + i));\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(first_line + i + 1));\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(first_line + i + 2));\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(first_line + i + 3));\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n        }\r\n    }\r\n\r\n\r\n\r\n    /*if (bsx > 16) {\r\n\r\n        for (i = 0; i < bsy; i++) {\r\n            memcpy(dst, first_line + i, bsx * sizeof(pel_t));\r\n            dst += i_dst;\r\n        }\r\n    } else if (bsx == 16) {\r\n\r\n        pel_t *dst1 = dst;\r\n\r\n        if (bsy == 4) {\r\n            __m256i M = _mm256_loadu_si256((__m256i*)&first_line[0]);\r\n            _mm256_maskstore_epi64((__int64 *)dst1, mask, M);\r\n            dst1 += i_dst;\r\n            M = _mm256_loadu_si256((__m256i*)&first_line[1]);\r\n            _mm256_maskstore_epi64((__int64 *)dst1, mask, M);\r\n            dst1 += i_dst;\r\n            M = _mm256_loadu_si256((__m256i*)&first_line[2]);\r\n            _mm256_maskstore_epi64((__int64 *)dst1, mask, M);\r\n            dst1 += i_dst;\r\n            M = _mm256_loadu_si256((__m256i*)&first_line[3]);\r\n            _mm256_maskstore_epi64((__int64 *)dst1, mask, M);\r\n\r\n        } else {\r\n            __m256i M = _mm256_loadu_si256((__m256i*)&first_line[0]);\r\n            _mm256_maskstore_epi64((__int64 *)dst1, mask, M);\r\n            dst1 += i_dst;\r\n            M = _mm256_loadu_si256((__m256i*)&first_line[1]);\r\n            _mm256_maskstore_epi64((__int64 *)dst1, mask, M);\r\n            dst1 += i_dst;\r\n            M = _mm256_loadu_si256((__m256i*)&first_line[2]);\r\n            _mm256_maskstore_epi64((__int64 *)dst1, mask, M);\r\n            dst1 += i_dst;\r\n            M = _mm256_loadu_si256((__m256i*)&first_line[3]);\r\n            _mm256_maskstore_epi64((__int64 *)dst1, mask, M);\r\n            dst1 += i_dst;\r\n            M = _mm256_loadu_si256((__m256i*)&first_line[4]);\r\n            _mm256_maskstore_epi64((__int64 *)dst1, mask, M);\r\n            dst1 += i_dst;\r\n            M = _mm256_loadu_si256((__m256i*)&first_line[5]);\r\n            _mm256_maskstore_epi64((__int64 *)dst1, mask, M);\r\n            dst1 += i_dst;\r\n            M = _mm256_loadu_si256((__m256i*)&first_line[6]);\r\n            _mm256_maskstore_epi64((__int64 *)dst1, mask, M);\r\n            dst1 += i_dst;\r\n            M = _mm256_loadu_si256((__m256i*)&first_line[7]);\r\n            _mm256_maskstore_epi64((__int64 *)dst1, mask, M);\r\n            dst1 += i_dst;\r\n            M = _mm256_loadu_si256((__m256i*)&first_line[8]);\r\n            _mm256_maskstore_epi64((__int64 *)dst1, mask, M);\r\n            dst1 += i_dst;\r\n            M = _mm256_loadu_si256((__m256i*)&first_line[9]);\r\n            _mm256_maskstore_epi64((__int64 *)dst1, mask, M);\r\n            dst1 += i_dst;\r\n            M = _mm256_loadu_si256((__m256i*)&first_line[10]);\r\n            _mm256_maskstore_epi64((__int64 *)dst1, mask, M);\r\n            dst1 += i_dst;\r\n            M = _mm256_loadu_si256((__m256i*)&first_line[11]);\r\n            _mm256_maskstore_epi64((__int64 *)dst1, mask, M);\r\n            dst1 += i_dst;\r\n            M = _mm256_loadu_si256((__m256i*)&first_line[12]);\r\n            _mm256_maskstore_epi64((__int64 *)dst1, mask, M);\r\n            dst1 += i_dst;\r\n            M = _mm256_loadu_si256((__m256i*)&first_line[13]);\r\n            _mm256_maskstore_epi64((__int64 *)dst1, mask, M);\r\n            dst1 += i_dst;\r\n            M = _mm256_loadu_si256((__m256i*)&first_line[14]);\r\n            _mm256_maskstore_epi64((__int64 *)dst1, mask, M);\r\n            dst1 += i_dst;\r\n            M = _mm256_loadu_si256((__m256i*)&first_line[15]);\r\n            _mm256_maskstore_epi64((__int64 *)dst1, mask, M);\r\n        }\r\n    } else if (bsx == 8) {\r\n\r\n        for (i = 0; i < bsy; i += 8) {\r\n            __m128i M = _mm_loadu_si128((__m128i*)&first_line[i]);\r\n            _mm_storel_epi64((__m128i*)dst, M);\r\n            dst += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst, M);\r\n            dst += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst, M);\r\n            dst += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst, M);\r\n            dst += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst, M);\r\n            dst += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst, M);\r\n            dst += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst, M);\r\n            dst += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            _mm_storel_epi64((__m128i*)dst, M);\r\n            dst += i_dst;\r\n        }\r\n    } else {\r\n\r\n        for (i = 0; i < bsy; i += 4) {\r\n            __m128i M = _mm_loadu_si128((__m128i*)&first_line[i]);\r\n            ((int*)(dst))[0] = _mm_cvtsi128_si32(M);\r\n            dst += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            ((int*)(dst))[0] = _mm_cvtsi128_si32(M);\r\n            dst += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            ((int*)(dst))[0] = _mm_cvtsi128_si32(M);\r\n            dst += i_dst;\r\n            M = _mm_srli_si128(M, 1);\r\n            ((int*)(dst))[0] = _mm_cvtsi128_si32(M);\r\n            dst += i_dst;\r\n        }\r\n    }*/\r\n\r\n}\r\n\r\nvoid intra_pred_ang_y_31_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    int i;\r\n    UNUSED_PARAMETER(dir_mode);\r\n\r\n    if (bsx >= bsy){\r\n        ALIGN32(pel_t dst_tran[MAX_CU_SIZE * MAX_CU_SIZE]);\r\n        ALIGN32(pel_t src_tran[MAX_CU_SIZE << 3]);\r\n\r\n        for (i = 0; i < (bsy + bsx * 11 / 8 + 3); i++){\r\n            src_tran[i] = src[-i];\r\n        }\r\n        intra_pred_ang_x_5_avx(src_tran, dst_tran, bsy, 5, bsy, bsx);\r\n        for (i = 0; i < bsy; i++){\r\n            for (int j = 0; j < bsx; j++){\r\n                dst[j + i_dst * i] = dst_tran[i + bsy * j];\r\n            }\r\n        }\r\n    } else if (bsx == 8){\r\n\r\n        __m128i coeff0 = _mm_setr_epi16( 5, 1,  7, 1,  1, 3,  3, 1);\r\n        __m128i coeff1 = _mm_setr_epi16(13, 5, 15, 3,  9, 7, 11, 2);\r\n        __m128i coeff2 = _mm_setr_epi16(11, 7,  9, 3, 15, 5, 13, 1);\r\n        __m128i coeff3 = _mm_setr_epi16( 3, 3,  1, 1,  7, 1,  5, 0);\r\n        __m128i coeff4 = _mm_setr_epi16(16, 8, 16, 4, 16, 8, 16, 2);\r\n        __m128i coeff5 = _mm_setr_epi16( 1, 2,  1, 4,  1, 2,  1, 8);\r\n\r\n        __m128i L0, L1, L2, L3;\r\n        __m128i p00, p10, p20, p30;\r\n\r\n        for (i = 0; i < bsy; i++,src--){\r\n            L0 = _mm_setr_epi16(src[-1], src[-2], src[-4], src[-5], src[-6], src[ -8], src[ -9], src[-11]);\r\n            L1 = _mm_setr_epi16(src[-2], src[-3], src[-5], src[-6], src[-7], src[ -9], src[-10], src[-12]);\r\n            L2 = _mm_setr_epi16(src[-3], src[-4], src[-6], src[-7], src[-8], src[-10], src[-11], src[-13]);\r\n            L3 = _mm_setr_epi16(src[-4], src[-5], src[-7], src[-8], src[-9], src[-11], src[-12], src[-14]);\r\n\r\n            p00 = _mm_mullo_epi16(L0, coeff0);\r\n            p10 = _mm_mullo_epi16(L1, coeff1);\r\n            p20 = _mm_mullo_epi16(L2, coeff2);\r\n            p30 = _mm_mullo_epi16(L3, coeff3);\r\n            p00 = _mm_add_epi16(p00, coeff4);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_mullo_epi16(p00, coeff5);\r\n            p00 = _mm_srli_epi16(p00, 5);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            _mm_storel_epi64((__m128i*)dst, p00);\r\n            dst += i_dst;\r\n        }\r\n\r\n    } else {\r\n        intra_pred_ang_y_31_sse128(src, dst, i_dst, dir_mode, bsx, bsy);\r\n    }\r\n\r\n\r\n}\r\n\r\nvoid intra_pred_ang_y_32_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    ALIGN32(pel_t first_line[2 * (64 + 64)]);\r\n    int line_size = (bsy >> 1) + bsx - 1;\r\n\r\n    int i;\r\n    int aligned_line_size = ((line_size + 63) >> 4) << 4;\r\n    pel_t *pfirst[2];\r\n    UNUSED_PARAMETER(dir_mode);\r\n\r\n    __m256i coeff2 = _mm256_set1_epi16(2);\r\n    __m256i shuffle = _mm256_setr_epi8(15, 13, 11, 9, 7, 5, 3, 1, 14, 12, 10, 8, 6, 4, 2, 0,\r\n        15, 13, 11, 9, 7, 5, 3, 1, 14, 12, 10, 8, 6, 4, 2, 0);\r\n\r\n    pfirst[0] = first_line;\r\n    pfirst[1] = first_line + aligned_line_size;\r\n\r\n    src -= 34;\r\n\r\n    __m256i S0, S1, S2;\r\n    __m256i L0, L1, L2;\r\n    __m256i H0, H1, H2;\r\n    __m256i p00, p01, p10, p11;\r\n\r\n    __m256i mask = _mm256_load_si256((__m256i*)intrinsic_mask_256_8bit[15]);\r\n    for (i = 0; i < line_size - 8; i += 16, src -= 32) {\r\n\r\n        S0 = _mm256_loadu_si256((__m256i*)(src - 1));//19 18 17 16 15 14 13 12  11 10 9 8 7 6 5 4\r\n        S1 = _mm256_loadu_si256((__m256i*)(src));    //18 17 16 15 14 13 12 11  10  9 8 7 6 5 4 3\r\n        S2 = _mm256_loadu_si256((__m256i*)(src + 1));//17 16 15 14 13 12 11 10   9  8 7 6 5 4 3 2\r\n\r\n        L0 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S0, 0));//19 18 17 16 15 14 13 12\r\n        L1 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S1, 0));//18 17 16 15 14 13 12 11\r\n        L2 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S2, 0));//17 16 15 14 13 12 11 10\r\n\r\n        H0 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S0, 1));//11 10 9 8 7 6 5 4\r\n        H1 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S1, 1));//10  9 8 7 6 5 4 3\r\n        H2 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S2, 1));// 9  8 7 6 5 4 3 2\r\n\r\n        p00 = _mm256_add_epi16(L0, L1);\r\n        p01 = _mm256_add_epi16(L1, L2);\r\n        p00 = _mm256_add_epi16(p00, coeff2);\r\n        p00 = _mm256_add_epi16(p00, p01);\r\n        p00 = _mm256_srli_epi16(p00, 2);//19...12(31...16)\r\n\r\n        p10 = _mm256_add_epi16(H0, H1);\r\n        p11 = _mm256_add_epi16(H1, H2);\r\n        p10 = _mm256_add_epi16(p10, coeff2);\r\n        p10 = _mm256_add_epi16(p10, p11);\r\n        p10 = _mm256_srli_epi16(p10, 2);//11...4(15...0)\r\n\r\n        //31...24 15...8 23...16 7...0\r\n        p00 = _mm256_packus_epi16(p00, p10);     //19 18 17 16 15 14 13 12   11 10 9 8 7 6 5 4\r\n        p00 = _mm256_permute4x64_epi64(p00, 0x8D);//31...16 15..0\r\n        //0 2 4 6 8 10 12 14  1 3 5 7 9 11 13 15  16....\r\n        p00 = _mm256_shuffle_epi8(p00, shuffle);\r\n        p10 = _mm256_permute4x64_epi64(p00, 0x0D);\r\n        p00 = _mm256_permute4x64_epi64(p00, 0x08);\r\n\r\n        _mm256_maskstore_epi64((__int64 *)&pfirst[0][i], mask, p00);\r\n        _mm256_maskstore_epi64((__int64 *)&pfirst[1][i], mask, p10);\r\n    }\r\n\r\n    mask = _mm256_load_si256((__m256i*)intrinsic_mask_256_8bit[8]);\r\n    if (i < line_size) {\r\n        S0 = _mm256_loadu_si256((__m256i*)(src - 1));//19 18 17 16 15 14 13 12  11 10 9 8 7 6 5 4\r\n        S1 = _mm256_loadu_si256((__m256i*)(src));    //18 17 16 15 14 13 12 11  10  9 8 7 6 5 4 3\r\n        S2 = _mm256_loadu_si256((__m256i*)(src + 1));//17 16 15 14 13 12 11 10   9  8 7 6 5 4 3 2\r\n\r\n        H0 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S0, 1));//11 10 9 8 7 6 5 4\r\n        H1 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S1, 1));//10  9 8 7 6 5 4 3\r\n        H2 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S2, 1));// 9  8 7 6 5 4 3 2\r\n\r\n        p10 = _mm256_add_epi16(H0, H1);\r\n        p11 = _mm256_add_epi16(H1, H2);\r\n        p10 = _mm256_add_epi16(p10, coeff2);\r\n        p10 = _mm256_add_epi16(p10, p11);\r\n        p10 = _mm256_srli_epi16(p10, 2);\r\n\r\n        //15...8 15...8 7...0 7...0\r\n        p00 = _mm256_packus_epi16(p10, p10);     //19 18 17 16 15 14 13 12   11 10 9 8 7 6 5 4\r\n        p00 = _mm256_permute4x64_epi64(p00, 0x8D);//15...0 15...0\r\n        //0 2 4 6 8 10 12 14  1 3 5 7 1 3 5 7 8....\r\n        p00 = _mm256_shuffle_epi8(p00, shuffle);\r\n        p10 = _mm256_permute4x64_epi64(p00, 0x0D);\r\n        p00 = _mm256_permute4x64_epi64(p00, 0x08);\r\n\r\n        _mm256_maskstore_epi64((__int64 *)&pfirst[0][i], mask, p00);\r\n        _mm256_maskstore_epi64((__int64 *)&pfirst[1][i], mask, p10);\r\n;\r\n    }\r\n    bsy >>= 1;\r\n\r\n    if (bsx == 64){\r\n\r\n        for (i = 0; i < bsy; i += 4) {\r\n            __m256i M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i + 32));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i + 32));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i + 1));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i + 1 + 32));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i + 1));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i + 1 + 32));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i + 2));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i + 2 + 32));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i + 2));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i + 2 + 32));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i + 3));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i + 3 + 32));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i + 3));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i + 3 + 32));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n\r\n        }\r\n    } else if (bsx == 32){\r\n\r\n        for (i = 0; i < bsy; i += 4){\r\n            __m256i M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i + 1));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i + 1));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i + 2));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i + 2));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i + 3));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i + 3));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n        }\r\n    } else if (bsx == 16){\r\n        mask = _mm256_loadu_si256((const __m256i*)intrinsic_mask_256_8bit[bsx - 1]);\r\n        for (i = 0; i < bsy; i += 4) {\r\n            __m256i M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i + 1));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i + 1));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i + 2));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i + 2));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i + 3));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i + 3));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n        }\r\n    } else if (bsx == 8){\r\n        mask = _mm256_loadu_si256((const __m256i*)intrinsic_mask_256_8bit[bsx - 1]);\r\n        for (i = 0; i < bsy; i += 4) {\r\n            __m256i M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i + 1));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i + 1));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i + 2));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i + 2));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i + 3));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i + 3));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n        }\r\n    } else{\r\n        mask = _mm256_loadu_si256((const __m256i*)intrinsic_mask_256_8bit[bsx - 1]);\r\n        for (i = 0; i < bsy; i += 4) {\r\n            __m256i M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i));\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i));\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i + 1));\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i + 1));\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i + 2));\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i + 2));\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] + i + 3));\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] + i + 3));\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n        }\r\n    }\r\n\r\n\r\n    /*if (bsx >= 16 || bsx == 4) {\r\n        for (i = 0; i < bsy; i++) {\r\n            memcpy(dst, pfirst[0] + i, bsx * sizeof(pel_t));\r\n            memcpy(dst + i_dst, pfirst[1] + i, bsx * sizeof(pel_t));\r\n            dst += i_dst2;\r\n        }\r\n    } else {\r\n\r\n        if (bsy == 4) {//8x8\r\n            __m128i M1 = _mm_loadu_si128((__m128i*)&pfirst[0][0]);\r\n            __m128i M2 = _mm_loadu_si128((__m128i*)&pfirst[1][0]);\r\n            _mm_storel_epi64((__m128i*)dst, M1);\r\n            _mm_storel_epi64((__m128i*)(dst + i_dst), M2);\r\n            dst += i_dst2;\r\n            M1 = _mm_srli_si128(M1, 1);\r\n            M2 = _mm_srli_si128(M2, 1);\r\n            _mm_storel_epi64((__m128i*)dst, M1);\r\n            _mm_storel_epi64((__m128i*)(dst + i_dst), M2);\r\n            dst += i_dst2;\r\n            M1 = _mm_srli_si128(M1, 1);\r\n            M2 = _mm_srli_si128(M2, 1);\r\n            _mm_storel_epi64((__m128i*)dst, M1);\r\n            _mm_storel_epi64((__m128i*)(dst + i_dst), M2);\r\n            dst += i_dst2;\r\n            M1 = _mm_srli_si128(M1, 1);\r\n            M2 = _mm_srli_si128(M2, 1);\r\n            _mm_storel_epi64((__m128i*)dst, M1);\r\n            _mm_storel_epi64((__m128i*)(dst + i_dst), M2);\r\n        } else {//8x32\r\n            for (i = 0; i < 16; i = i + 8) {\r\n                __m128i M1 = _mm_loadu_si128((__m128i*)&pfirst[0][i]);\r\n                __m128i M2 = _mm_loadu_si128((__m128i*)&pfirst[1][i]);\r\n                _mm_storel_epi64((__m128i*)dst, M1);\r\n                _mm_storel_epi64((__m128i*)(dst + i_dst), M2);\r\n                dst += i_dst2;\r\n                M1 = _mm_srli_si128(M1, 1);\r\n                M2 = _mm_srli_si128(M2, 1);\r\n                _mm_storel_epi64((__m128i*)dst, M1);\r\n                _mm_storel_epi64((__m128i*)(dst + i_dst), M2);\r\n                dst += i_dst2;\r\n                M1 = _mm_srli_si128(M1, 1);\r\n                M2 = _mm_srli_si128(M2, 1);\r\n                _mm_storel_epi64((__m128i*)dst, M1);\r\n                _mm_storel_epi64((__m128i*)(dst + i_dst), M2);\r\n                dst += i_dst2;\r\n                M1 = _mm_srli_si128(M1, 1);\r\n                M2 = _mm_srli_si128(M2, 1);\r\n                _mm_storel_epi64((__m128i*)dst, M1);\r\n                _mm_storel_epi64((__m128i*)(dst + i_dst), M2);\r\n                dst += i_dst2;\r\n                M1 = _mm_srli_si128(M1, 1);\r\n                M2 = _mm_srli_si128(M2, 1);\r\n                _mm_storel_epi64((__m128i*)dst, M1);\r\n                _mm_storel_epi64((__m128i*)(dst + i_dst), M2);\r\n                dst += i_dst2;\r\n                M1 = _mm_srli_si128(M1, 1);\r\n                M2 = _mm_srli_si128(M2, 1);\r\n                _mm_storel_epi64((__m128i*)dst, M1);\r\n                _mm_storel_epi64((__m128i*)(dst + i_dst), M2);\r\n                dst += i_dst2;\r\n                M1 = _mm_srli_si128(M1, 1);\r\n                M2 = _mm_srli_si128(M2, 1);\r\n                _mm_storel_epi64((__m128i*)dst, M1);\r\n                _mm_storel_epi64((__m128i*)(dst + i_dst), M2);\r\n                dst += i_dst2;\r\n                M1 = _mm_srli_si128(M1, 1);\r\n                M2 = _mm_srli_si128(M2, 1);\r\n                _mm_storel_epi64((__m128i*)dst, M1);\r\n                _mm_storel_epi64((__m128i*)(dst + i_dst), M2);\r\n                dst += i_dst2;\r\n            }\r\n        }\r\n    }*/\r\n}\r\n\r\nvoid intra_pred_ang_xy_13_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    int i;\r\n    UNUSED_PARAMETER(dir_mode);\r\n\r\n    if (bsy > 4) {\r\n\r\n        __m256i coeff2 = _mm256_set1_epi16(2);\r\n        __m256i coeff3 = _mm256_set1_epi16(3);\r\n        __m256i coeff4 = _mm256_set1_epi16(4);\r\n        __m256i coeff5 = _mm256_set1_epi16(5);\r\n        __m256i coeff7 = _mm256_set1_epi16(7);\r\n        __m256i coeff8 = _mm256_set1_epi16(8);\r\n        __m256i coeff9 = _mm256_set1_epi16(9);\r\n        __m256i coeff11 = _mm256_set1_epi16(11);\r\n        __m256i coeff13 = _mm256_set1_epi16(13);\r\n        __m256i coeff15 = _mm256_set1_epi16(15);\r\n        __m256i coeff16 = _mm256_set1_epi16(16);\r\n\r\n        ALIGN32(pel_t first_line[(64 + 16) << 3]);\r\n        int line_size = bsx + (bsy >> 3) - 1;\r\n        int left_size = line_size - bsx;\r\n        int aligned_line_size = ((line_size + 15) >> 4) << 4;\r\n        pel_t *pfirst[8];\r\n\r\n        pfirst[0] = first_line;\r\n        pfirst[1] = pfirst[0] + aligned_line_size;\r\n        pfirst[2] = pfirst[1] + aligned_line_size;\r\n        pfirst[3] = pfirst[2] + aligned_line_size;\r\n        pfirst[4] = pfirst[3] + aligned_line_size;\r\n        pfirst[5] = pfirst[4] + aligned_line_size;\r\n        pfirst[6] = pfirst[5] + aligned_line_size;\r\n        pfirst[7] = pfirst[6] + aligned_line_size;\r\n\r\n        src -= bsy - 8;\r\n        for (i = 0; i < left_size; i++, src += 8) {//left size`s value is small ,there is no need to use intrinsic assmble\r\n            pfirst[0][i] = (pel_t)((src[6] + (src[7] << 1) + src[8] + 2) >> 2);\r\n            pfirst[1][i] = (pel_t)((src[5] + (src[6] << 1) + src[7] + 2) >> 2);\r\n            pfirst[2][i] = (pel_t)((src[4] + (src[5] << 1) + src[6] + 2) >> 2);\r\n            pfirst[3][i] = (pel_t)((src[3] + (src[4] << 1) + src[5] + 2) >> 2);\r\n\r\n            pfirst[4][i] = (pel_t)((src[2] + (src[3] << 1) + src[4] + 2) >> 2);\r\n            pfirst[5][i] = (pel_t)((src[1] + (src[2] << 1) + src[3] + 2) >> 2);\r\n            pfirst[6][i] = (pel_t)((src[0] + (src[1] << 1) + src[2] + 2) >> 2);\r\n            pfirst[7][i] = (pel_t)((src[-1] + (src[0] << 1) + src[1] + 2) >> 2);\r\n        }\r\n\r\n        __m256i p00, p10, p20, p30;\r\n        __m256i p01, p11, p21, p31;\r\n        __m256i S0, S1, S2, S3;\r\n        __m256i L0, L1, L2, L3;\r\n        __m256i H0, H1, H2, H3;\r\n\r\n        for (; i < line_size - 16; i += 32, src += 32) {\r\n            \r\n            S0 = _mm256_loadu_si256((__m256i*)(src + 2));\r\n            S1 = _mm256_loadu_si256((__m256i*)(src + 1));\r\n            S2 = _mm256_loadu_si256((__m256i*)(src));\r\n            S3 = _mm256_loadu_si256((__m256i*)(src - 1));\r\n\r\n            L0 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S0, 0));\r\n            L1 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S1, 0));\r\n            L2 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S2, 0));\r\n            L3 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S3, 0));\r\n\r\n            H0 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S0, 1));\r\n            H1 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S1, 1));\r\n            H2 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S2, 1));\r\n            H3 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S3, 1));\r\n\r\n            p00 = _mm256_mullo_epi16(L0, coeff7);\r\n            p10 = _mm256_mullo_epi16(L1, coeff15);\r\n            p20 = _mm256_mullo_epi16(L2, coeff9);\r\n            p30 = _mm256_add_epi16(L3, coeff16);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, p20);\r\n            p00 = _mm256_add_epi16(p00, p30);\r\n            p00 = _mm256_srli_epi16(p00, 5);\r\n\r\n            p01 = _mm256_mullo_epi16(H0, coeff7);\r\n            p11 = _mm256_mullo_epi16(H1, coeff15);\r\n            p21 = _mm256_mullo_epi16(H2, coeff9);\r\n            p31 = _mm256_add_epi16(H3, coeff16);\r\n            p01 = _mm256_add_epi16(p01, p11);\r\n            p01 = _mm256_add_epi16(p01, p21);\r\n            p01 = _mm256_add_epi16(p01, p31);\r\n            p01 = _mm256_srli_epi16(p01, 5);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p01);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x00D8);\r\n\r\n            _mm256_storeu_si256((__m256i*)&pfirst[0][i], p00);\r\n\r\n            p00 = _mm256_mullo_epi16(L0, coeff3);\r\n            p10 = _mm256_mullo_epi16(L1, coeff7);\r\n            p20 = _mm256_mullo_epi16(L2, coeff5);\r\n            p30 = _mm256_add_epi16(L3, coeff8);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, p20);\r\n            p00 = _mm256_add_epi16(p00, p30);\r\n            p00 = _mm256_srli_epi16(p00, 4);\r\n\r\n            p01 = _mm256_mullo_epi16(H0, coeff3);\r\n            p11 = _mm256_mullo_epi16(H1, coeff7);\r\n            p21 = _mm256_mullo_epi16(H2, coeff5);\r\n            p31 = _mm256_add_epi16(H3, coeff8);\r\n            p01 = _mm256_add_epi16(p01, p11);\r\n            p01 = _mm256_add_epi16(p01, p21);\r\n            p01 = _mm256_add_epi16(p01, p31);\r\n            p01 = _mm256_srli_epi16(p01, 4);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p01);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x00D8);\r\n            _mm256_storeu_si256((__m256i*)&pfirst[1][i], p00);\r\n\r\n\r\n            p00 = _mm256_mullo_epi16(L0, coeff5);\r\n            p10 = _mm256_mullo_epi16(L1, coeff13);\r\n            p20 = _mm256_mullo_epi16(L2, coeff11);\r\n            p30 = _mm256_mullo_epi16(L3, coeff3);\r\n            p00 = _mm256_add_epi16(p00, coeff16);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, p20);\r\n            p00 = _mm256_add_epi16(p00, p30);\r\n            p00 = _mm256_srli_epi16(p00, 5);\r\n\r\n            p01 = _mm256_mullo_epi16(H0, coeff5);\r\n            p11 = _mm256_mullo_epi16(H1, coeff13);\r\n            p21 = _mm256_mullo_epi16(H2, coeff11);\r\n            p31 = _mm256_mullo_epi16(H3, coeff3);\r\n            p01 = _mm256_add_epi16(p01, coeff16);\r\n            p01 = _mm256_add_epi16(p01, p11);\r\n            p01 = _mm256_add_epi16(p01, p21);\r\n            p01 = _mm256_add_epi16(p01, p31);\r\n            p01 = _mm256_srli_epi16(p01, 5);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p01);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x00D8);\r\n            _mm256_storeu_si256((__m256i*)&pfirst[2][i], p00);\r\n\r\n            p00 = _mm256_add_epi16(L0, L3);\r\n            p10 = _mm256_add_epi16(L1, L2);\r\n            p10 = _mm256_mullo_epi16(p10, coeff3);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, coeff4);\r\n            p00 = _mm256_srli_epi16(p00, 3);\r\n\r\n            p01 = _mm256_add_epi16(H0, H3);\r\n            p11 = _mm256_add_epi16(H1, H2);\r\n            p11 = _mm256_mullo_epi16(p11, coeff3);\r\n            p01 = _mm256_add_epi16(p01, p11);\r\n            p01 = _mm256_add_epi16(p01, coeff4);\r\n            p01 = _mm256_srli_epi16(p01, 3);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p01);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x00D8);\r\n            _mm256_storeu_si256((__m256i*)&pfirst[3][i], p00);\r\n\r\n            p00 = _mm256_mullo_epi16(L0, coeff3);\r\n            p10 = _mm256_mullo_epi16(L1, coeff11);\r\n            p20 = _mm256_mullo_epi16(L2, coeff13);\r\n            p30 = _mm256_mullo_epi16(L3, coeff5);\r\n            p00 = _mm256_add_epi16(p00, coeff16);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, p20);\r\n            p00 = _mm256_add_epi16(p00, p30);\r\n            p00 = _mm256_srli_epi16(p00, 5);\r\n\r\n            p01 = _mm256_mullo_epi16(H0, coeff3);\r\n            p11 = _mm256_mullo_epi16(H1, coeff11);\r\n            p21 = _mm256_mullo_epi16(H2, coeff13);\r\n            p31 = _mm256_mullo_epi16(H3, coeff5);\r\n            p01 = _mm256_add_epi16(p01, coeff16);\r\n            p01 = _mm256_add_epi16(p01, p11);\r\n            p01 = _mm256_add_epi16(p01, p21);\r\n            p01 = _mm256_add_epi16(p01, p31);\r\n            p01 = _mm256_srli_epi16(p01, 5);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p01);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x00D8);\r\n            _mm256_storeu_si256((__m256i*)&pfirst[4][i], p00);\r\n\r\n            p10 = _mm256_mullo_epi16(L1, coeff5);\r\n            p20 = _mm256_mullo_epi16(L2, coeff7);\r\n            p30 = _mm256_mullo_epi16(L3, coeff3);\r\n            p00 = _mm256_add_epi16(L0, coeff8);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, p20);\r\n            p00 = _mm256_add_epi16(p00, p30);\r\n            p00 = _mm256_srli_epi16(p00, 4);\r\n\r\n            p11 = _mm256_mullo_epi16(H1, coeff5);\r\n            p21 = _mm256_mullo_epi16(H2, coeff7);\r\n            p31 = _mm256_mullo_epi16(H3, coeff3);\r\n            p01 = _mm256_add_epi16(H0, coeff8);\r\n            p01 = _mm256_add_epi16(p01, p11);\r\n            p01 = _mm256_add_epi16(p01, p21);\r\n            p01 = _mm256_add_epi16(p01, p31);\r\n            p01 = _mm256_srli_epi16(p01, 4);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p01);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x00D8);\r\n            _mm256_storeu_si256((__m256i*)&pfirst[5][i], p00);\r\n\r\n            p10 = _mm256_mullo_epi16(L1, coeff9);\r\n            p20 = _mm256_mullo_epi16(L2, coeff15);\r\n            p30 = _mm256_mullo_epi16(L3, coeff7);\r\n            p00 = _mm256_add_epi16(L0, coeff16);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, p20);\r\n            p00 = _mm256_add_epi16(p00, p30);\r\n            p00 = _mm256_srli_epi16(p00, 5);\r\n\r\n            p11 = _mm256_mullo_epi16(H1, coeff9);\r\n            p21 = _mm256_mullo_epi16(H2, coeff15);\r\n            p31 = _mm256_mullo_epi16(H3, coeff7);\r\n            p01 = _mm256_add_epi16(H0, coeff16);\r\n            p01 = _mm256_add_epi16(p01, p11);\r\n            p01 = _mm256_add_epi16(p01, p21);\r\n            p01 = _mm256_add_epi16(p01, p31);\r\n            p01 = _mm256_srli_epi16(p01, 5);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p01);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x00D8);\r\n            _mm256_storeu_si256((__m256i*)&pfirst[6][i], p00);\r\n\r\n            p10 = _mm256_mullo_epi16(L2, coeff2);\r\n            p00 = _mm256_add_epi16(L1, L3);\r\n            p00 = _mm256_add_epi16(p00, coeff2);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_srli_epi16(p00, 2);\r\n\r\n            p11 = _mm256_mullo_epi16(H2, coeff2);\r\n            p01 = _mm256_add_epi16(H1, H3);\r\n            p01 = _mm256_add_epi16(p01, coeff2);\r\n            p01 = _mm256_add_epi16(p01, p11);\r\n            p01 = _mm256_srli_epi16(p01, 2);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p01);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x00D8);\r\n            _mm256_storeu_si256((__m256i*)&pfirst[7][i], p00);\r\n\r\n        }\r\n        __m256i mask = _mm256_load_si256((__m256i*)intrinsic_mask_256_8bit[bsx - 1]);\r\n\r\n        if (i < line_size) {\r\n            \r\n            S0 = _mm256_loadu_si256((__m256i*)(src + 2));\r\n            S1 = _mm256_loadu_si256((__m256i*)(src + 1));\r\n            S2 = _mm256_loadu_si256((__m256i*)(src));\r\n            S3 = _mm256_loadu_si256((__m256i*)(src - 1));\r\n\r\n            L0 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S0, 0));\r\n            L1 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S1, 0));\r\n            L2 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S2, 0));\r\n            L3 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S3, 0));\r\n\r\n            p00 = _mm256_mullo_epi16(L0, coeff7);\r\n            p10 = _mm256_mullo_epi16(L1, coeff15);\r\n            p20 = _mm256_mullo_epi16(L2, coeff9);\r\n            p30 = _mm256_add_epi16(L3, coeff16);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, p20);\r\n            p00 = _mm256_add_epi16(p00, p30);\r\n            p00 = _mm256_srli_epi16(p00, 5);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p00);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x00D8);\r\n            _mm256_maskstore_epi32((int*)&pfirst[0][i], mask, p00);\r\n\r\n            p00 = _mm256_mullo_epi16(L0, coeff3);\r\n            p10 = _mm256_mullo_epi16(L1, coeff7);\r\n            p20 = _mm256_mullo_epi16(L2, coeff5);\r\n            p30 = _mm256_add_epi16(L3, coeff8);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, p20);\r\n            p00 = _mm256_add_epi16(p00, p30);\r\n            p00 = _mm256_srli_epi16(p00, 4);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p00);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x00D8);\r\n            _mm256_maskstore_epi32((int*)&pfirst[1][i], mask, p00);\r\n\r\n            p00 = _mm256_mullo_epi16(L0, coeff5);\r\n            p10 = _mm256_mullo_epi16(L1, coeff13);\r\n            p20 = _mm256_mullo_epi16(L2, coeff11);\r\n            p30 = _mm256_mullo_epi16(L3, coeff3);\r\n            p00 = _mm256_add_epi16(p00, coeff16);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, p20);\r\n            p00 = _mm256_add_epi16(p00, p30);\r\n            p00 = _mm256_srli_epi16(p00, 5);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p00);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x00D8);\r\n            _mm256_maskstore_epi32((int*)&pfirst[2][i], mask, p00);\r\n\r\n            p00 = _mm256_add_epi16(L0, L3);\r\n            p10 = _mm256_add_epi16(L1, L2);\r\n            p10 = _mm256_mullo_epi16(p10, coeff3);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, coeff4);\r\n            p00 = _mm256_srli_epi16(p00, 3);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p00);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x00D8);\r\n            _mm256_maskstore_epi32((int*)&pfirst[3][i], mask, p00);\r\n\r\n            p00 = _mm256_mullo_epi16(L0, coeff3);\r\n            p10 = _mm256_mullo_epi16(L1, coeff11);\r\n            p20 = _mm256_mullo_epi16(L2, coeff13);\r\n            p30 = _mm256_mullo_epi16(L3, coeff5);\r\n            p00 = _mm256_add_epi16(p00, coeff16);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, p20);\r\n            p00 = _mm256_add_epi16(p00, p30);\r\n            p00 = _mm256_srli_epi16(p00, 5);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p00);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x00D8);\r\n            _mm256_maskstore_epi32((int*)&pfirst[4][i], mask, p00);\r\n\r\n            p10 = _mm256_mullo_epi16(L1, coeff5);\r\n            p20 = _mm256_mullo_epi16(L2, coeff7);\r\n            p30 = _mm256_mullo_epi16(L3, coeff3);\r\n            p00 = _mm256_add_epi16(L0, coeff8);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, p20);\r\n            p00 = _mm256_add_epi16(p00, p30);\r\n            p00 = _mm256_srli_epi16(p00, 4);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p00);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x00D8);\r\n            _mm256_maskstore_epi32((int*)&pfirst[5][i], mask, p00);\r\n\r\n            p10 = _mm256_mullo_epi16(L1, coeff9);\r\n            p20 = _mm256_mullo_epi16(L2, coeff15);\r\n            p30 = _mm256_mullo_epi16(L3, coeff7);\r\n            p00 = _mm256_add_epi16(L0, coeff16);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, p20);\r\n            p00 = _mm256_add_epi16(p00, p30);\r\n            p00 = _mm256_srli_epi16(p00, 5);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p00);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x00D8);\r\n            _mm256_maskstore_epi32((int*)&pfirst[6][i], mask, p00);\r\n\r\n            p10 = _mm256_mullo_epi16(L2, coeff2);\r\n            p00 = _mm256_add_epi16(L1, L3);\r\n            p00 = _mm256_add_epi16(p00, coeff2);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_srli_epi16(p00, 2);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p00);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x00D8);\r\n            _mm256_maskstore_epi32((int*)&pfirst[7][i], mask, p00);\r\n        }\r\n\r\n        pfirst[0] += left_size;\r\n        pfirst[1] += left_size;\r\n        pfirst[2] += left_size;\r\n        pfirst[3] += left_size;\r\n        pfirst[4] += left_size;\r\n        pfirst[5] += left_size;\r\n        pfirst[6] += left_size;\r\n        pfirst[7] += left_size;\r\n\r\n        bsy >>= 3;\r\n        __m256i M;\r\n        if (bsx == 64){\r\n            for (i = 0; i < bsy; i++){\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[0] - i));\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[0] - i + 32));\r\n                _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[1] - i));\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[1] - i + 32));\r\n                _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[2] - i));\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[2] - i + 32));\r\n                _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[3] - i));\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[3] - i + 32));\r\n                _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[4] - i));\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[4] - i + 32));\r\n                _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[5] - i));\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[5] - i + 32));\r\n                _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[6] - i));\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[6] - i + 32));\r\n                _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[7] - i));\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[7] - i + 32));\r\n                _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n                dst += i_dst;\r\n            }\r\n        } else if (bsx == 32){\r\n            for (i = 0; i < bsy; i++){\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[0] - i));\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[1] - i));\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[2] - i));\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[3] - i));\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[4] - i));\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[5] - i));\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[6] - i));\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[7] - i));\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                dst += i_dst;\r\n            }\r\n        } else if (bsx == 16) {\r\n            mask = _mm256_lddqu_si256((__m256i*)intrinsic_mask_256_8bit[15]);\r\n            for (i = 0; i < bsy; i++){\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[0] - i));\r\n                _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[1] - i));\r\n                _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[2] - i));\r\n                _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[3] - i));\r\n                _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[4] - i));\r\n                _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[5] - i));\r\n                _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[6] - i));\r\n                _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[7] - i));\r\n                _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n                dst += i_dst;\r\n            }\r\n        } else if (bsx == 8) {\r\n            mask = _mm256_lddqu_si256((__m256i*)intrinsic_mask_256_8bit[7]);\r\n            for (i = 0; i < bsy; i++){\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[0] - i));\r\n                _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[1] - i));\r\n                _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[2] - i));\r\n                _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[3] - i));\r\n                _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[4] - i));\r\n                _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[5] - i));\r\n                _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[6] - i));\r\n                _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[7] - i));\r\n                _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n                dst += i_dst;\r\n            }\r\n        } else {\r\n            mask = _mm256_lddqu_si256((__m256i*)intrinsic_mask_256_8bit[3]);\r\n            for (i = 0; i < bsy; i++){\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[0] - i));\r\n                _mm256_maskstore_epi32((int*)dst, mask, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[1] - i));\r\n                _mm256_maskstore_epi32((int*)dst, mask, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[2] - i));\r\n                _mm256_maskstore_epi32((int*)dst, mask, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[3] - i));\r\n                _mm256_maskstore_epi32((int*)dst, mask, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[4] - i));\r\n                _mm256_maskstore_epi32((int*)dst, mask, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[5] - i));\r\n                _mm256_maskstore_epi32((int*)dst, mask, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[6] - i));\r\n                _mm256_maskstore_epi32((int*)dst, mask, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[7] - i));\r\n                _mm256_maskstore_epi32((int*)dst, mask, M);\r\n                dst += i_dst;\r\n            }\r\n        }\r\n        \r\n        /*for (i = 0; i < bsy; i++) {\r\n            memcpy(dst, pfirst[0] - i, bsx * sizeof(pel_t));\r\n            dst += i_dst;\r\n            memcpy(dst, pfirst[1] - i, bsx * sizeof(pel_t));\r\n            dst += i_dst;\r\n            memcpy(dst, pfirst[2] - i, bsx * sizeof(pel_t));\r\n            dst += i_dst;\r\n            memcpy(dst, pfirst[3] - i, bsx * sizeof(pel_t));\r\n            dst += i_dst;\r\n            memcpy(dst, pfirst[4] - i, bsx * sizeof(pel_t));\r\n            dst += i_dst;\r\n            memcpy(dst, pfirst[5] - i, bsx * sizeof(pel_t));\r\n            dst += i_dst;\r\n            memcpy(dst, pfirst[6] - i, bsx * sizeof(pel_t));\r\n            dst += i_dst;\r\n            memcpy(dst, pfirst[7] - i, bsx * sizeof(pel_t));\r\n            dst += i_dst;\r\n        }*/\r\n    } else {\r\n        intra_pred_ang_xy_13_sse128(src, dst, i_dst, dir_mode, bsx, bsy);\r\n        return;\r\n    }\r\n\r\n}\r\n\r\nvoid intra_pred_ang_xy_14_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    int i;\r\n    UNUSED_PARAMETER(dir_mode);\r\n\r\n    __m256i coeff2 = _mm256_set1_epi16(2);\r\n    __m256i coeff3 = _mm256_set1_epi16(3);\r\n    __m256i coeff4 = _mm256_set1_epi16(4);\r\n    __m256i coeff5 = _mm256_set1_epi16(5);\r\n    __m256i coeff7 = _mm256_set1_epi16(7);\r\n    __m256i coeff8 = _mm256_set1_epi16(8);\r\n\r\n    if (bsy != 4) {\r\n        ALIGN32(pel_t first_line[4 * (64 + 32)]);\r\n        int line_size = bsx + bsy / 4 - 1;\r\n        int left_size = line_size - bsx;\r\n        int aligned_line_size = ((line_size + 31) >> 4) << 4;\r\n        pel_t *pfirst[4];\r\n        __m256i shuffle = _mm256_setr_epi8(0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15,\r\n            0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15);\r\n\r\n        __m256i index = _mm256_setr_epi32(0, 4, 1, 5, 2, 6, 3, 7);\r\n\r\n        pel_t *pSrc1 = src;\r\n\r\n        pfirst[0] = first_line;\r\n        pfirst[1] = first_line + aligned_line_size;\r\n        pfirst[2] = first_line + aligned_line_size * 2;\r\n        pfirst[3] = first_line + aligned_line_size * 3;\r\n        src -= bsy - 4;\r\n\r\n        __m256i p00, p01, p10, p11;\r\n        __m256i p20, p30, p21, p31;\r\n        __m256i S0, S1, S2, S3;\r\n        __m256i L0, L1, L2, L3;\r\n        __m256i H0, H1, H2, H3;\r\n\r\n        __m256i mask0 = _mm256_set_epi64x(0, 0, 0, -1);\r\n        __m256i mask1 = _mm256_set_epi64x(0, 0, -1, 0);\r\n        __m256i mask2 = _mm256_set_epi64x(0, -1, 0, 0);\r\n        __m256i mask3 = _mm256_set_epi64x(-1, 0, 0, 0);\r\n\r\n        for (i = 0; i < left_size - 1; i += 8, src += 32) {\r\n\r\n            S0 = _mm256_loadu_si256((__m256i*)(src - 1));//0 1 2 3 4 5 6 7 8...15\r\n            S1 = _mm256_loadu_si256((__m256i*)(src));\r\n            S2 = _mm256_loadu_si256((__m256i*)(src + 1));\r\n\r\n            L0 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S0, 0));//0...15\r\n            L1 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S1, 0));\r\n            L2 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S2, 0));\r\n\r\n            H0 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S0, 1));//16...31\r\n            H1 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S1, 1));\r\n            H2 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S2, 1));\r\n\r\n            p00 = _mm256_add_epi16(L0, L1);\r\n            p01 = _mm256_add_epi16(L1, L2);\r\n            p10 = _mm256_add_epi16(H0, H1);\r\n            p11 = _mm256_add_epi16(H1, H2);\r\n\r\n            p00 = _mm256_add_epi16(p00, coeff2);\r\n            p10 = _mm256_add_epi16(p10, coeff2);\r\n            p00 = _mm256_add_epi16(p00, p01);\r\n            p10 = _mm256_add_epi16(p10, p11);\r\n\r\n            p00 = _mm256_srli_epi16(p00, 2);//0...7 8...15\r\n            p10 = _mm256_srli_epi16(p10, 2);//16...23 24...31\r\n\r\n            p00 = _mm256_packus_epi16(p00, p10);//0...7 16...23 8...15 24...31\r\n            \r\n            p00 = _mm256_permute4x64_epi64(p00, 0x00D8);\r\n            //0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15     16 20 24 28 17 21...\r\n            p10 = _mm256_shuffle_epi8(p00, shuffle);\r\n            //0 4 8 12 16 20 24 28    1 5 9 13 17 21 25 29    \r\n            p10 = _mm256_permutevar8x32_epi32(p10, index);\r\n\r\n            _mm256_maskstore_epi64(((__int64 *)(pfirst[0] + i - 24)), mask3, p10);\r\n            _mm256_maskstore_epi64(((__int64 *)(pfirst[1] + i - 16)), mask2, p10);\r\n            _mm256_maskstore_epi64(((__int64 *)(pfirst[2] + i - 8 )), mask1, p10);\r\n            _mm256_maskstore_epi64(((__int64 *)(pfirst[3] + i     )), mask0, p10);\r\n        }\r\n\r\n        if (i < left_size) { //sse汾avx죬ݽ\r\n            __m128i shuffle1 = _mm_setr_epi8(0, 4, 1, 5, 2, 6, 3, 7, 0, 4, 1, 5, 2, 6, 3, 7);\r\n            __m128i coeff_2 = _mm_set1_epi16(2);\r\n            __m128i zero = _mm_setzero_si128();\r\n\r\n            __m128i S_0 = _mm_loadu_si128((__m128i*)(src - 1));\r\n            __m128i S_2 = _mm_loadu_si128((__m128i*)(src + 1));\r\n            __m128i S_1 = _mm_loadu_si128((__m128i*)(src));\r\n\r\n            __m128i L_0 = _mm_unpacklo_epi8(S_0, zero);//0 1 2 3 4 5 6 7\r\n            __m128i L_1 = _mm_unpacklo_epi8(S_1, zero);\r\n            __m128i L_2 = _mm_unpacklo_epi8(S_2, zero);\r\n\r\n            __m128i p_00 = _mm_add_epi16(L_0, L_1);\r\n            __m128i p_01 = _mm_add_epi16(L_1, L_2);\r\n\r\n            p_00 = _mm_add_epi16(p_00, coeff_2);\r\n            p_00 = _mm_add_epi16(p_00, p_01);\r\n\r\n            p_00 = _mm_srli_epi16(p_00, 2);\r\n\r\n            p_00 = _mm_packus_epi16(p_00, p_00);//0 1 2 3 4 5 6 7\r\n\r\n            p_00 = _mm_shuffle_epi8(p_00, shuffle1);//0 4 1 5 2 6 3 7\r\n\r\n            ((int*)&pfirst[0][i])[0] = _mm_extract_epi16(p_00, 3);\r\n            ((int*)&pfirst[1][i])[0] = _mm_extract_epi16(p_00, 2);\r\n            ((int*)&pfirst[2][i])[0] = _mm_extract_epi16(p_00, 1);\r\n            ((int*)&pfirst[3][i])[0] = _mm_extract_epi16(p_00, 0);\r\n        }\r\n\r\n        src = pSrc1;\r\n\r\n        for (i = left_size; i < line_size - 16; i += 32, src += 32) {\r\n\r\n            S0 = _mm256_loadu_si256((__m256i*)(src - 1));\r\n            S1 = _mm256_loadu_si256((__m256i*)(src));\r\n            S2 = _mm256_loadu_si256((__m256i*)(src + 1));\r\n            S3 = _mm256_loadu_si256((__m256i*)(src + 2));\r\n\r\n            L0 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S0, 0));\r\n            L1 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S1, 0));\r\n            L2 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S2, 0));\r\n            L3 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S3, 0));\r\n\r\n            H0 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S0, 1));\r\n            H1 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S1, 1));\r\n            H2 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S2, 1));\r\n            H3 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S3, 1));\r\n\r\n            p00 = _mm256_mullo_epi16(L0, coeff3);\r\n            p10 = _mm256_mullo_epi16(L1, coeff7);\r\n            p20 = _mm256_mullo_epi16(L2, coeff5);\r\n            p30 = _mm256_add_epi16(L3, coeff8);\r\n            p00 = _mm256_add_epi16(p00, p30);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, p20);\r\n            p00 = _mm256_srli_epi16(p00, 4);\r\n\r\n            p01 = _mm256_mullo_epi16(H0, coeff3);\r\n            p11 = _mm256_mullo_epi16(H1, coeff7);\r\n            p21 = _mm256_mullo_epi16(H2, coeff5);\r\n            p31 = _mm256_add_epi16(H3, coeff8);\r\n            p01 = _mm256_add_epi16(p01, p31);\r\n            p01 = _mm256_add_epi16(p01, p11);\r\n            p01 = _mm256_add_epi16(p01, p21);\r\n            p01 = _mm256_srli_epi16(p01, 4);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p01);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x00D8);\r\n            _mm256_storeu_si256((__m256i*)&pfirst[2][i], p00);\r\n\r\n            p00 = _mm256_add_epi16(L1, L2);\r\n            p00 = _mm256_mullo_epi16(p00, coeff3);\r\n            p10 = _mm256_add_epi16(L0, L3);\r\n            p10 = _mm256_add_epi16(p10, coeff4);\r\n            p00 = _mm256_add_epi16(p10, p00);\r\n            p00 = _mm256_srli_epi16(p00, 3);\r\n\r\n            p01 = _mm256_add_epi16(H1, H2);\r\n            p01 = _mm256_mullo_epi16(p01, coeff3);\r\n            p11 = _mm256_add_epi16(H0, H3);\r\n            p11 = _mm256_add_epi16(p11, coeff4);\r\n            p01 = _mm256_add_epi16(p11, p01);\r\n            p01 = _mm256_srli_epi16(p01, 3);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p01);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x00D8);\r\n            _mm256_storeu_si256((__m256i*)&pfirst[1][i], p00);\r\n\r\n            p10 = _mm256_mullo_epi16(L1, coeff5);\r\n            p20 = _mm256_mullo_epi16(L2, coeff7);\r\n            p30 = _mm256_mullo_epi16(L3, coeff3);\r\n            p00 = _mm256_add_epi16(L0, coeff8);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, p20);\r\n            p00 = _mm256_add_epi16(p00, p30);\r\n            p00 = _mm256_srli_epi16(p00, 4);\r\n\r\n            p11 = _mm256_mullo_epi16(H1, coeff5);\r\n            p21 = _mm256_mullo_epi16(H2, coeff7);\r\n            p31 = _mm256_mullo_epi16(H3, coeff3);\r\n            p01 = _mm256_add_epi16(H0, coeff8);\r\n            p01 = _mm256_add_epi16(p01, p11);\r\n            p01 = _mm256_add_epi16(p01, p21);\r\n            p01 = _mm256_add_epi16(p01, p31);\r\n            p01 = _mm256_srli_epi16(p01, 4);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p01);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x00D8);\r\n            _mm256_storeu_si256((__m256i*)&pfirst[0][i], p00);\r\n\r\n            p00 = _mm256_add_epi16(L0, L1);\r\n            p10 = _mm256_add_epi16(L1, L2);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, coeff2);\r\n            p00 = _mm256_srli_epi16(p00, 2);\r\n\r\n            p01 = _mm256_add_epi16(H0, H1);\r\n            p11 = _mm256_add_epi16(H1, H2);\r\n            p01 = _mm256_add_epi16(p01, p11);\r\n            p01 = _mm256_add_epi16(p01, coeff2);\r\n            p01 = _mm256_srli_epi16(p01, 2);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p01);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x00D8);\r\n            _mm256_storeu_si256((__m256i*)&pfirst[3][i], p00);\r\n        }\r\n\r\n        if (i  < line_size) {\r\n            __m256i mask = _mm256_load_si256((__m256i*)intrinsic_mask_256_8bit[15]);\r\n\r\n            S0 = _mm256_loadu_si256((__m256i*)(src - 1));\r\n            S1 = _mm256_loadu_si256((__m256i*)(src));\r\n            S2 = _mm256_loadu_si256((__m256i*)(src + 1));\r\n            S3 = _mm256_loadu_si256((__m256i*)(src + 2));\r\n\r\n            L0 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S0, 0));\r\n            L1 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S1, 0));\r\n            L2 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S2, 0));\r\n            L3 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S3, 0));\r\n\r\n            p00 = _mm256_mullo_epi16(L0, coeff3);\r\n            p10 = _mm256_mullo_epi16(L1, coeff7);\r\n            p20 = _mm256_mullo_epi16(L2, coeff5);\r\n            p30 = _mm256_add_epi16(L3, coeff8);\r\n            p00 = _mm256_add_epi16(p00, p30);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, p20);\r\n            p00 = _mm256_srli_epi16(p00, 4);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p00);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x0008);\r\n            _mm256_maskstore_epi64((__int64 *)&pfirst[2][i], mask, p00);\r\n\r\n            p00 = _mm256_add_epi16(L1, L2);\r\n            p00 = _mm256_mullo_epi16(p00, coeff3);\r\n            p10 = _mm256_add_epi16(L0, L3);\r\n            p10 = _mm256_add_epi16(p10, coeff4);\r\n            p00 = _mm256_add_epi16(p10, p00);\r\n            p00 = _mm256_srli_epi16(p00, 3);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p00);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x0008);\r\n            _mm256_maskstore_epi64((__int64 *)&pfirst[1][i], mask, p00);\r\n\r\n            p10 = _mm256_mullo_epi16(L1, coeff5);\r\n            p20 = _mm256_mullo_epi16(L2, coeff7);\r\n            p30 = _mm256_mullo_epi16(L3, coeff3);\r\n            p00 = _mm256_add_epi16(L0, coeff8);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, p20);\r\n            p00 = _mm256_add_epi16(p00, p30);\r\n            p00 = _mm256_srli_epi16(p00, 4);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p00);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x0008);\r\n            _mm256_maskstore_epi64((__int64 *)&pfirst[0][i], mask, p00);\r\n\r\n            p00 = _mm256_add_epi16(L0, L1);\r\n            p10 = _mm256_add_epi16(L1, L2);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, coeff2);\r\n            p00 = _mm256_srli_epi16(p00, 2);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p00);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x0008);\r\n            _mm256_maskstore_epi64((__int64 *)&pfirst[3][i], mask, p00);\r\n        }\r\n\r\n        pfirst[0] += left_size;\r\n        pfirst[1] += left_size;\r\n        pfirst[2] += left_size;\r\n        pfirst[3] += left_size;\r\n\r\n        bsy >>= 2;\r\n\r\n\r\n        if (bsx == 64){\r\n            for (i = 0; i < bsy; i++){\r\n                __m256i M = _mm256_lddqu_si256((__m256i*)(pfirst[0] - i));\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[0] - i + 32));\r\n                _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[1] - i));\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[1] - i + 32));\r\n                _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[2] - i));\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[2] - i + 32));\r\n                _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[3] - i));\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[3] - i + 32));\r\n                _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n                dst += i_dst;\r\n            }\r\n        } else if (bsx == 32) {\r\n            for (i = 0; i < bsy; i++){\r\n                __m256i M = _mm256_lddqu_si256((__m256i*)(pfirst[0] - i));\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[1] - i));\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[2] - i));\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[3] - i));\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                dst += i_dst;\r\n            }\r\n\r\n        } else if (bsx == 16) {\r\n            __m256i mask = _mm256_lddqu_si256((__m256i*)intrinsic_mask_256_8bit[15]);\r\n            for (i = 0; i < bsy; i++){\r\n                __m256i M = _mm256_lddqu_si256((__m256i*)(pfirst[0] - i));\r\n                _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[1] - i));\r\n                _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[2] - i));\r\n                _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[3] - i));\r\n                _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n                dst += i_dst;\r\n            }\r\n        } else if (bsx == 8) {\r\n            __m256i mask = _mm256_lddqu_si256((__m256i*)intrinsic_mask_256_8bit[7]);\r\n            for (i = 0; i < bsy; i++){\r\n                __m256i M = _mm256_lddqu_si256((__m256i*)(pfirst[0] - i));\r\n                _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[1] - i));\r\n                _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[2] - i));\r\n                _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[3] - i));\r\n                _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n                dst += i_dst;\r\n            }\r\n        } else {\r\n            __m256i mask = _mm256_lddqu_si256((__m256i*)intrinsic_mask_256_8bit[3]);\r\n            for (i = 0; i < bsy; i++){\r\n                __m256i M = _mm256_lddqu_si256((__m256i*)(pfirst[0] - i));\r\n                _mm256_maskstore_epi32((int*)dst, mask, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[1] - i));\r\n                _mm256_maskstore_epi32((int*)dst, mask, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[2] - i));\r\n                _mm256_maskstore_epi32((int*)dst, mask, M);\r\n                dst += i_dst;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst[3] - i));\r\n                _mm256_maskstore_epi32((int*)dst, mask, M);\r\n                dst += i_dst;\r\n            }\r\n        }\r\n\r\n\r\n        /*for (i = 0; i < bsy; i++) {\r\n            memcpy(dst, pfirst[0] - i, bsx * sizeof(pel_t));\r\n            dst += i_dst;\r\n            memcpy(dst, pfirst[1] - i, bsx * sizeof(pel_t));\r\n            dst += i_dst;\r\n            memcpy(dst, pfirst[2] - i, bsx * sizeof(pel_t));\r\n            dst += i_dst;\r\n            memcpy(dst, pfirst[3] - i, bsx * sizeof(pel_t));\r\n            dst += i_dst;\r\n        }*/\r\n    } else {\r\n        if (bsx == 16) {\r\n            __m256i mask = _mm256_load_si256((__m256i*)intrinsic_mask_256_8bit[15]);\r\n            pel_t *dst2 = dst + i_dst;\r\n            pel_t *dst3 = dst2 + i_dst;\r\n            pel_t *dst4 = dst3 + i_dst;\r\n\r\n            __m256i p00, p10, p20, p30;\r\n            __m256i L0, L1, L2, L3;\r\n            __m256i S0 = _mm256_loadu_si256((__m256i*)(src - 1));\r\n            __m256i S3 = _mm256_loadu_si256((__m256i*)(src + 2));\r\n            __m256i S1 = _mm256_loadu_si256((__m256i*)(src));\r\n            __m256i S2 = _mm256_loadu_si256((__m256i*)(src + 1));\r\n\r\n            L0 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S0, 0));\r\n            L1 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S1, 0));\r\n            L2 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S2, 0));\r\n            L3 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S3, 0));\r\n\r\n            p00 = _mm256_mullo_epi16(L0, coeff3);\r\n            p10 = _mm256_mullo_epi16(L1, coeff7);\r\n            p20 = _mm256_mullo_epi16(L2, coeff5);\r\n            p30 = _mm256_add_epi16(L3, coeff8);\r\n            p00 = _mm256_add_epi16(p00, p30);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, p20);\r\n            p00 = _mm256_srli_epi16(p00, 4);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p00);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x0008);\r\n            _mm256_maskstore_epi64((__int64 *)dst3, mask, p00);\r\n\r\n            p00 = _mm256_add_epi16(L1, L2);\r\n            p00 = _mm256_mullo_epi16(p00, coeff3);\r\n            p10 = _mm256_add_epi16(L0, L3);\r\n            p10 = _mm256_add_epi16(p10, coeff4);\r\n            p00 = _mm256_add_epi16(p10, p00);\r\n            p00 = _mm256_srli_epi16(p00, 3);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p00);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x0008);\r\n            _mm256_maskstore_epi64((__int64 *)dst2, mask, p00);\r\n\r\n\r\n            p10 = _mm256_mullo_epi16(L1, coeff5);\r\n            p20 = _mm256_mullo_epi16(L2, coeff7);\r\n            p30 = _mm256_mullo_epi16(L3, coeff3);\r\n            p00 = _mm256_add_epi16(L0, coeff8);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, p20);\r\n            p00 = _mm256_add_epi16(p00, p30);\r\n            p00 = _mm256_srli_epi16(p00, 4);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p00);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x0008);\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, p00);\r\n\r\n            p00 = _mm256_add_epi16(L0, L1);\r\n            p10 = _mm256_add_epi16(L1, L2);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, coeff2);\r\n            p00 = _mm256_srli_epi16(p00, 2);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p00);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x0008);\r\n            _mm256_maskstore_epi64((__int64 *)dst4, mask, p00);\r\n        } else {//4x4\r\n            pel_t *dst2 = dst + i_dst;\r\n            pel_t *dst3 = dst2 + i_dst;\r\n            pel_t *dst4 = dst3 + i_dst;\r\n            __m128i p00, p10, p20, p30;\r\n            __m128i coeff_2 = _mm_set1_epi16(2);\r\n            __m128i coeff_3 = _mm_set1_epi16(3);\r\n            __m128i coeff_4 = _mm_set1_epi16(4);\r\n            __m128i coeff_5 = _mm_set1_epi16(5);\r\n            __m128i coeff_7 = _mm_set1_epi16(7);\r\n            __m128i coeff_8 = _mm_set1_epi16(8);\r\n            __m128i zero = _mm_setzero_si128();\r\n\r\n            __m128i S0 = _mm_loadu_si128((__m128i*)(src - 1));\r\n            __m128i S3 = _mm_loadu_si128((__m128i*)(src + 2));\r\n            __m128i S1 = _mm_loadu_si128((__m128i*)(src));\r\n            __m128i S2 = _mm_loadu_si128((__m128i*)(src + 1));\r\n\r\n            __m128i L0 = _mm_unpacklo_epi8(S0, zero);\r\n            __m128i L1 = _mm_unpacklo_epi8(S1, zero);\r\n            __m128i L2 = _mm_unpacklo_epi8(S2, zero);\r\n            __m128i L3 = _mm_unpacklo_epi8(S3, zero);\r\n\r\n            p00 = _mm_mullo_epi16(L0, coeff_3);\r\n            p10 = _mm_mullo_epi16(L1, coeff_7);\r\n            p20 = _mm_mullo_epi16(L2, coeff_5);\r\n            p30 = _mm_add_epi16(L3, coeff_8);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_srli_epi16(p00, 4);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            ((int*)dst3)[0] = _mm_cvtsi128_si32(p00);\r\n\r\n            p00 = _mm_add_epi16(L1, L2);\r\n            p00 = _mm_mullo_epi16(p00, coeff_3);\r\n            p10 = _mm_add_epi16(L0, L3);\r\n            p10 = _mm_add_epi16(p10, coeff_4);\r\n            p00 = _mm_add_epi16(p10, p00);\r\n            p00 = _mm_srli_epi16(p00, 3);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            ((int*)dst2)[0] = _mm_cvtsi128_si32(p00);\r\n\r\n            p10 = _mm_mullo_epi16(L1, coeff_5);\r\n            p20 = _mm_mullo_epi16(L2, coeff_7);\r\n            p30 = _mm_mullo_epi16(L3, coeff_3);\r\n            p00 = _mm_add_epi16(L0, coeff_8);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, p20);\r\n            p00 = _mm_add_epi16(p00, p30);\r\n            p00 = _mm_srli_epi16(p00, 4);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            ((int*)dst)[0] = _mm_cvtsi128_si32(p00);\r\n\r\n            p00 = _mm_add_epi16(L0, L1);\r\n            p10 = _mm_add_epi16(L1, L2);\r\n            p00 = _mm_add_epi16(p00, p10);\r\n            p00 = _mm_add_epi16(p00, coeff_2);\r\n            p00 = _mm_srli_epi16(p00, 2);\r\n\r\n            p00 = _mm_packus_epi16(p00, p00);\r\n            ((int*)dst4)[0] = _mm_cvtsi128_si32(p00);\r\n        }\r\n    }\r\n\r\n}\r\n\r\nvoid intra_pred_ang_xy_16_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{  \r\n    ALIGN32(pel_t first_line[2 * (64 + 48)]);\r\n    int line_size = bsx + bsy / 2 - 1;\r\n    int left_size = line_size - bsx;\r\n    int aligned_line_size = ((line_size + 31) >> 4) << 4;\r\n    pel_t *pfirst[2];\r\n    UNUSED_PARAMETER(dir_mode);\r\n    __m256i coeff2   = _mm256_set1_epi16(2);\r\n    __m256i coeff3   = _mm256_set1_epi16(3);\r\n    __m256i coeff4   = _mm256_set1_epi16(4);\r\n    __m256i shuffle = _mm256_setr_epi8(0, 2, 4, 6, 8, 10, 12, 14, 1, 3, 5, 7, 9, 11, 13, 15,\r\n        0, 2, 4, 6, 8, 10, 12, 14, 1, 3, 5, 7, 9, 11, 13, 15);\r\n\r\n    int i;\r\n    pel_t *pSrc1;\r\n\r\n    pfirst[0] = first_line;\r\n    pfirst[1] = first_line + aligned_line_size;\r\n\r\n    src -= bsy - 2;\r\n    pSrc1 = src;\r\n\r\n    __m256i p00, p01, p10, p11;\r\n    __m256i S0, S1, S2, S3;\r\n    __m256i L0, L1, L2, L3;\r\n    __m256i H0, H1, H2, H3;\r\n\r\n    __m256i mask1 = _mm256_load_si256((__m256i*)intrinsic_mask_256_8bit[15]);\r\n    for (i = 0; i < left_size - 8; i += 16, src += 32) {\r\n\r\n        S0 = _mm256_loadu_si256((__m256i*)(src - 1));//\r\n        S2 = _mm256_loadu_si256((__m256i*)(src + 1));//\r\n        S1 = _mm256_loadu_si256((__m256i*)(src));//\r\n\r\n        L0 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S0, 0));\r\n        L1 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S1, 0));\r\n        L2 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S2, 0));\r\n\r\n        H0 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S0, 1));\r\n        H1 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S1, 1));\r\n        H2 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S2, 1));\r\n\r\n        p00 = _mm256_add_epi16(L0, L1);\r\n        p01 = _mm256_add_epi16(L1, L2);\r\n        p10 = _mm256_add_epi16(H0, H1);\r\n        p11 = _mm256_add_epi16(H1, H2);\r\n\r\n        p00 = _mm256_add_epi16(p00, coeff2);\r\n        p10 = _mm256_add_epi16(p10, coeff2);\r\n\r\n        p00 = _mm256_add_epi16(p00, p01);\r\n        p10 = _mm256_add_epi16(p10, p11);\r\n\r\n        p00 = _mm256_srli_epi16(p00, 2);//0 1 2 3 4 5 6 7....15\r\n        p10 = _mm256_srli_epi16(p10, 2);//16 17 18....31\r\n\r\n        //0...7 16...23 8...15 24...31\r\n        p00 = _mm256_packus_epi16(p00, p10);\r\n        p00 = _mm256_permute4x64_epi64(p00, 0x00D8);//31...16 15..0\r\n\r\n        //0 1 2 3\r\n        p00 = _mm256_shuffle_epi8(p00, shuffle);\r\n\r\n        p10 = _mm256_permute4x64_epi64(p00, 0x08);//0 2\r\n        p00 = _mm256_permute4x64_epi64(p00, 0x0D);//1 3\r\n\r\n        _mm256_maskstore_epi64((__int64 *)&pfirst[0][i], mask1, p00);\r\n        _mm256_maskstore_epi64((__int64 *)&pfirst[1][i], mask1, p10);\r\n\r\n    }\r\n\r\n    __m256i mask2 = _mm256_load_si256((__m256i*)intrinsic_mask_256_8bit[7]);\r\n    if (i < left_size) {\r\n        S0 = _mm256_loadu_si256((__m256i*)(src - 1));\r\n        S2 = _mm256_loadu_si256((__m256i*)(src + 1));\r\n        S1 = _mm256_loadu_si256((__m256i*)(src));\r\n\r\n        L0 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S0, 0));\r\n        L1 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S1, 0));\r\n        L2 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S2, 0));\r\n\r\n        p00 = _mm256_add_epi16(L0, L1);\r\n        p01 = _mm256_add_epi16(L1, L2);\r\n        p00 = _mm256_add_epi16(p00, coeff2);\r\n        p00 = _mm256_add_epi16(p00, p01);\r\n        p00 = _mm256_srli_epi16(p00, 2);\r\n\r\n        //0...7 0...7 8...15 8...15\r\n        p00 = _mm256_packus_epi16(p00, p00);\r\n        p00 = _mm256_permute4x64_epi64(p00, 0x0008);//0...15 0...15\r\n\r\n        p01 = _mm256_shuffle_epi8(p00, shuffle);//0 2 4 6 7 8 10 12 14    1 3 5 7 9 11 13 15\r\n        \r\n        p10 = _mm256_permute4x64_epi64(p01, 0x01);\r\n        \r\n        _mm256_maskstore_epi64((__int64 *)&pfirst[0][i], mask2, p10);\r\n        _mm256_maskstore_epi64((__int64 *)&pfirst[1][i], mask2, p01);\r\n    }\r\n\r\n    src = pSrc1 + left_size + left_size;\r\n\r\n    for (i = left_size; i < line_size - 16; i += 32, src += 32) {\r\n\r\n        S0 = _mm256_loadu_si256((__m256i*)(src - 1));\r\n        S3 = _mm256_loadu_si256((__m256i*)(src + 2));\r\n        S1 = _mm256_loadu_si256((__m256i*)(src));\r\n        S2 = _mm256_loadu_si256((__m256i*)(src + 1));\r\n\r\n        L0 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S0, 0));\r\n        L1 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S1, 0));\r\n        L2 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S2, 0));\r\n        L3 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S3, 0));\r\n\r\n        H0 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S0, 1));\r\n        H1 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S1, 1));\r\n        H2 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S2, 1));\r\n        H3 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S3, 1));\r\n\r\n        p00 = _mm256_add_epi16(L1, L2);\r\n        p01 = _mm256_add_epi16(L0, L3);\r\n        p00 = _mm256_mullo_epi16(p00, coeff3);\r\n        p00 = _mm256_add_epi16(p00, coeff4);\r\n        p00 = _mm256_add_epi16(p00, p01);\r\n        p00 = _mm256_srli_epi16(p00, 3);\r\n\r\n        p10 = _mm256_add_epi16(H1, H2);\r\n        p11 = _mm256_add_epi16(H0, H3);\r\n        p10 = _mm256_mullo_epi16(p10, coeff3);\r\n        p10 = _mm256_add_epi16(p10, coeff4);\r\n        p10 = _mm256_add_epi16(p10, p11);\r\n        p10 = _mm256_srli_epi16(p10, 3);\r\n\r\n        p00 = _mm256_packus_epi16(p00, p10);\r\n        p00 = _mm256_permute4x64_epi64(p00, 0x00D8);\r\n        _mm256_storeu_si256((__m256i*)&pfirst[0][i], p00);\r\n\r\n        p00 = _mm256_add_epi16(L0, L1);\r\n        p01 = _mm256_add_epi16(L1, L2);\r\n        p10 = _mm256_add_epi16(H0, H1);\r\n        p11 = _mm256_add_epi16(H1, H2);\r\n\r\n        p00 = _mm256_add_epi16(p00, coeff2);\r\n        p10 = _mm256_add_epi16(p10, coeff2);\r\n\r\n        p00 = _mm256_add_epi16(p00, p01);\r\n        p10 = _mm256_add_epi16(p10, p11);\r\n\r\n        p00 = _mm256_srli_epi16(p00, 2);\r\n        p10 = _mm256_srli_epi16(p10, 2);\r\n\r\n        p00 = _mm256_packus_epi16(p00, p10);\r\n        p00 = _mm256_permute4x64_epi64(p00, 0x00D8);\r\n        _mm256_storeu_si256((__m256i*)&pfirst[1][i], p00);\r\n        \r\n    }\r\n\r\n    if (i < line_size) {\r\n\r\n        S0 = _mm256_loadu_si256((__m256i*)(src - 1));\r\n        S3 = _mm256_loadu_si256((__m256i*)(src + 2));\r\n        S1 = _mm256_loadu_si256((__m256i*)(src));\r\n        S2 = _mm256_loadu_si256((__m256i*)(src + 1));\r\n\r\n        L0 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S0, 0));\r\n        L1 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S1, 0));\r\n        L2 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S2, 0));\r\n        L3 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S3, 0));\r\n\r\n        p00 = _mm256_add_epi16(L1, L2);\r\n        p01 = _mm256_add_epi16(L0, L3);\r\n        p00 = _mm256_mullo_epi16(p00, coeff3);\r\n        p00 = _mm256_add_epi16(p00, coeff4);\r\n        p00 = _mm256_add_epi16(p00, p01);\r\n        p00 = _mm256_srli_epi16(p00, 3);\r\n\r\n        p00 = _mm256_packus_epi16(p00, p00);\r\n        p00 = _mm256_permute4x64_epi64(p00, 0x0008);\r\n        _mm256_maskstore_epi64((__int64 *)&pfirst[0][i], mask1, p00);\r\n\r\n        p00 = _mm256_add_epi16(L0, L1);\r\n        p01 = _mm256_add_epi16(L1, L2);\r\n        p00 = _mm256_add_epi16(p00, coeff2);\r\n        p00 = _mm256_add_epi16(p00, p01);\r\n        p00 = _mm256_srli_epi16(p00, 2);\r\n\r\n        p00 = _mm256_packus_epi16(p00, p00);\r\n        p00 = _mm256_permute4x64_epi64(p00, 0x0008);\r\n        _mm256_maskstore_epi64((__int64 *)&pfirst[1][i], mask1, p00);\r\n\r\n    }\r\n\r\n    pfirst[0] += left_size;\r\n    pfirst[1] += left_size;\r\n\r\n    bsy >>= 1;\r\n\r\n    if (bsx == 64){\r\n\r\n        for (i = 0; i < bsy; i += 4) {\r\n\r\n            __m256i M = _mm256_lddqu_si256((__m256i*)(pfirst[0] - i));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] - i + 32));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] - i));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] - i + 32));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] - i - 1));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] - i - 1 + 32));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] - i - 1));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] - i - 1 + 32));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] - i - 2));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] - i - 2 + 32));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] - i - 2));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] - i - 2 + 32));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] - i - 3));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] - i - 3 + 32));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] - i - 3));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] - i - 3 + 32));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n\r\n        }\r\n    } else if (bsx == 32){\r\n        for (i = 0; i < bsy; i += 4){\r\n            __m256i M = _mm256_lddqu_si256((__m256i*)(pfirst[0] - i));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] - i));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] - i - 1));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] - i - 1));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] - i - 2));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] - i - 2));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] - i - 3));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] - i - 3));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n        }\r\n    } else if (bsx == 16){\r\n        __m256i mask = _mm256_loadu_si256((const __m256i*)intrinsic_mask_256_8bit[bsx - 1]);\r\n        for (i = 0; i < bsy; i += 4) {\r\n            __m256i M = _mm256_lddqu_si256((__m256i*)(pfirst[0] - i));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] - i));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] - i - 1));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] - i - 1));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] - i - 2));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] - i - 2));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] - i - 3));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] - i - 3));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n        }\r\n    } else if (bsx == 8){\r\n        __m256i mask = _mm256_loadu_si256((const __m256i*)intrinsic_mask_256_8bit[bsx - 1]);\r\n        for (i = 0; i < bsy; i += 4) {\r\n            __m256i M = _mm256_lddqu_si256((__m256i*)(pfirst[0] - i));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] - i));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] - i - 1));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] - i - 1));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] - i - 2));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] - i - 2));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] - i - 3));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] - i - 3));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n        }\r\n    } else{\r\n        __m256i mask = _mm256_loadu_si256((const __m256i*)intrinsic_mask_256_8bit[bsx - 1]);\r\n        for (i = 0; i < bsy; i += 4) {\r\n            __m256i M = _mm256_lddqu_si256((__m256i*)(pfirst[0] - i));\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] - i));\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] - i - 1));\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] - i - 1));\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] - i - 2));\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] - i - 2));\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[0] - i - 3));\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst[1] - i - 3));\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n        }\r\n    }\r\n\r\n    /*switch (bsx) {\r\n        case 4:\r\n            for (i = 0; i < bsy; i++) {\r\n                CP32(dst, pfirst[0] - i);\r\n                CP32(dst + i_dst, pfirst[1] - i);\r\n                dst += (i_dst << 1);\r\n            }\r\n            break;\r\n        case 8:\r\n            for (i = 0; i < bsy; i++) {\r\n                CP64(dst, pfirst[0] - i);\r\n                CP64(dst + i_dst, pfirst[1] - i);\r\n                dst += (i_dst << 1);\r\n            }\r\n            break;\r\n        default:\r\n            for (i = 0; i < bsy; i++) {\r\n                memcpy(dst, pfirst[0] - i, bsx * sizeof(pel_t));\r\n                memcpy(dst + i_dst, pfirst[1] - i, bsx * sizeof(pel_t));\r\n                dst += (i_dst << 1);\r\n            }\r\n            break;\r\n    }*/\r\n\r\n}\r\n\r\nvoid intra_pred_ang_xy_18_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{          \r\n    ALIGN32(pel_t first_line[64 + 64]);\r\n    int line_size = bsx + bsy - 1;\r\n    int i;\r\n    pel_t *pfirst = first_line + bsy - 1;\r\n    UNUSED_PARAMETER(dir_mode);\r\n    __m256i coeff2 = _mm256_set1_epi16(2);\r\n\r\n    src -= bsy - 1;\r\n\r\n    __m256i S0, S1, S2;\r\n    __m256i L0, L1, L2;\r\n    __m256i H0, H1, H2;\r\n    __m256i sum1, sum2, sum3, sum4;\r\n\r\n    for (i = 0; i < line_size - 16; i += 32, src += 32) {\r\n        S0 = _mm256_loadu_si256((__m256i*)(src - 1));\r\n        S2 = _mm256_loadu_si256((__m256i*)(src + 1));\r\n        S1 = _mm256_loadu_si256((__m256i*)(src));\r\n\r\n        L0 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S0, 0));\r\n        L1 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S1, 0));\r\n        L2 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S2, 0));\r\n\r\n        H0 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S0, 1));\r\n        H1 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S1, 1));\r\n        H2 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S2, 1));\r\n\r\n        sum1 = _mm256_add_epi16(L0, L1);\r\n        sum2 = _mm256_add_epi16(L1, L2);\r\n        sum3 = _mm256_add_epi16(H0, H1);\r\n        sum4 = _mm256_add_epi16(H1, H2);\r\n\r\n        sum1 = _mm256_add_epi16(sum1, sum2);\r\n        sum3 = _mm256_add_epi16(sum3, sum4);\r\n\r\n        sum1 = _mm256_add_epi16(sum1, coeff2);\r\n        sum3 = _mm256_add_epi16(sum3, coeff2);\r\n\r\n        sum1 = _mm256_srli_epi16(sum1, 2);\r\n        sum3 = _mm256_srli_epi16(sum3, 2);\r\n\r\n        sum1 = _mm256_packus_epi16(sum1, sum3);\r\n        sum1 = _mm256_permute4x64_epi64(sum1, 0x00D8);\r\n        _mm256_storeu_si256((__m256i*)&first_line[i], sum1);\r\n    }\r\n\r\n    if (i < line_size) {\r\n        __m256i mask = _mm256_load_si256((__m256i*)intrinsic_mask_256_8bit[15]);\r\n        S0 = _mm256_loadu_si256((__m256i*)(src - 1));\r\n        S2 = _mm256_loadu_si256((__m256i*)(src + 1));\r\n        S1 = _mm256_loadu_si256((__m256i*)(src));\r\n\r\n        L0 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S0, 0));\r\n        L1 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S1, 0));\r\n        L2 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S2, 0));\r\n\r\n        sum1 = _mm256_add_epi16(L0, L1);\r\n        sum2 = _mm256_add_epi16(L1, L2);\r\n\r\n        sum1 = _mm256_add_epi16(sum1, sum2);\r\n        sum1 = _mm256_add_epi16(sum1, coeff2);\r\n        sum1 = _mm256_srli_epi16(sum1, 2);\r\n\r\n        sum1 = _mm256_packus_epi16(sum1, sum1);\r\n        sum1 = _mm256_permute4x64_epi64(sum1, 0x00D8);\r\n\r\n        _mm256_maskstore_epi64((__int64 *)&first_line[i], mask, sum1);\r\n    }\r\n\r\n    __m256i M;\r\n    if (bsx == 64) {\r\n        for (i = 0; i < bsy; i += 4){\r\n            M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst + 32));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n            pfirst--;\r\n        \r\n            M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst + 32));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n            pfirst--;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst + 32));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n            pfirst--;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst + 32));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n            pfirst--;\r\n        }\r\n    } else if (bsx == 32) {\r\n        for (i = 0; i < bsy; i += 4){\r\n            M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n            pfirst--;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n            pfirst--;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n            pfirst--;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n            pfirst--;\r\n        }\r\n    } else if (bsx == 16){\r\n        __m256i mask = _mm256_lddqu_si256((__m256i*)intrinsic_mask_256_8bit[15]);\r\n        for (i = 0; i < bsy; i += 4){\r\n            M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            pfirst--;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            pfirst--;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            pfirst--;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            pfirst--;\r\n        }\r\n    } else if (bsx == 8) {\r\n        __m256i mask = _mm256_lddqu_si256((__m256i*)intrinsic_mask_256_8bit[7]);\r\n        for (i = 0; i < bsy; i += 4){\r\n            M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            pfirst--;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            pfirst--;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            pfirst--;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            pfirst--;\r\n        }\r\n    } else {\r\n        __m256i mask = _mm256_lddqu_si256((__m256i*)intrinsic_mask_256_8bit[3]);\r\n        for (i = 0; i < bsy; i += 4){\r\n            M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n            pfirst--;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n            pfirst--;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n            pfirst--;\r\n\r\n            M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n            pfirst--;\r\n        }\r\n    }\r\n\r\n\r\n    /*switch (bsx) {\r\n        case 4:\r\n            for (i = 0; i < bsy; i++) {\r\n                CP32(dst, pfirst--);\r\n                dst += i_dst;\r\n            }\r\n            break;\r\n        case 8:\r\n            for (i = 0; i < bsy; i++) {\r\n                CP64(dst, pfirst--);\r\n                dst += i_dst;\r\n            }\r\n            break;\r\n        default:\r\n            for (i = 0; i < bsy; i++) {\r\n                memcpy(dst, pfirst--, bsx * sizeof(pel_t));\r\n                dst += i_dst;\r\n            }\r\n            break;\r\n            break;\r\n    }*/\r\n\r\n}\r\n\r\nvoid intra_pred_ang_xy_20_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{  \r\n    ALIGN32(pel_t first_line[64 + 128]);\r\n    int left_size = (bsy - 1) * 2 + 1;\r\n    int top_size = bsx - 1;\r\n    int line_size = left_size + top_size;\r\n    int i;\r\n    pel_t *pfirst = first_line + left_size - 1;\r\n\r\n    __m256i coeff2 = _mm256_set1_epi16(2);\r\n    __m256i coeff3 = _mm256_set1_epi16(3);\r\n    __m256i coeff4 = _mm256_set1_epi16(4);\r\n    __m256i shuffle = _mm256_setr_epi8(0, 8, 1, 9, 2, 10, 3, 11, 4, 12, 5, 13, 6, 14, 7, 15,\r\n        0, 8, 1, 9, 2, 10, 3, 11, 4, 12, 5, 13, 6, 14, 7, 15);\r\n    pel_t *pSrc1 = src;\r\n\r\n    UNUSED_PARAMETER(dir_mode);\r\n\r\n    src -= bsy;\r\n\r\n    __m256i p00, p01, p10, p11;\r\n    __m256i p20, p21, p30, p31;\r\n\r\n    __m256i S0, S1, S2, S3;\r\n    __m256i L0, L1, L2, L3;\r\n    __m256i H0, H1, H2, H3;\r\n\r\n    for (i = 0; i < left_size - 32; i += 64, src += 32) {\r\n\r\n        S0 = _mm256_loadu_si256((__m256i*)(src - 1));//0...7 8...15\r\n        S1 = _mm256_loadu_si256((__m256i*)(src));\r\n        S2 = _mm256_loadu_si256((__m256i*)(src + 1));\r\n        S3 = _mm256_loadu_si256((__m256i*)(src + 2));\r\n\r\n        L0 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S0, 0));//0...7\r\n        L1 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S1, 0));\r\n        L2 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S2, 0));\r\n        L3 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S3, 0));\r\n\r\n        H0 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S0, 1));\r\n        H1 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S1, 1));\r\n        H2 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S2, 1));\r\n        H3 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S3, 1));\r\n\r\n        p00 = _mm256_add_epi16(L1, L2);\r\n        p01 = _mm256_add_epi16(L0, L3);\r\n        p00 = _mm256_mullo_epi16(p00, coeff3);\r\n        p00 = _mm256_add_epi16(p00, coeff4);\r\n        p00 = _mm256_add_epi16(p00, p01);\r\n        p00 = _mm256_srli_epi16(p00, 3);//0...15\r\n\r\n        p10 = _mm256_add_epi16(H1, H2);\r\n        p11 = _mm256_add_epi16(H0, H3);\r\n        p10 = _mm256_mullo_epi16(p10, coeff3);\r\n        p10 = _mm256_add_epi16(p10, coeff4);\r\n        p10 = _mm256_add_epi16(p10, p11);\r\n        p10 = _mm256_srli_epi16(p10, 3);//16..31\r\n\r\n        p20 = _mm256_add_epi16(L1, L2);\r\n        p21 = _mm256_add_epi16(L2, L3);\r\n        p20 = _mm256_add_epi16(p20, coeff2);\r\n        p20 = _mm256_add_epi16(p20, p21);\r\n        p20 = _mm256_srli_epi16(p20, 2);//0...15\r\n\r\n        p30 = _mm256_add_epi16(H1, H2);\r\n        p31 = _mm256_add_epi16(H2, H3);\r\n        p30 = _mm256_add_epi16(p30, coeff2);\r\n        p30 = _mm256_add_epi16(p30, p31);\r\n        p30 = _mm256_srli_epi16(p30, 2);//16...31\r\n\r\n        //00...07 10...17 08...015 18...115\r\n        p00 = _mm256_packus_epi16(p00, p20);\r\n        p10 = _mm256_packus_epi16(p10, p30);\r\n\r\n        p00 = _mm256_shuffle_epi8(p00, shuffle);\r\n        p10 = _mm256_shuffle_epi8(p10, shuffle);\r\n\r\n        _mm256_storeu_si256((__m256i*)&first_line[i], p00);\r\n        _mm256_storeu_si256((__m256i*)&first_line[i + 32], p10);\r\n    }\r\n\r\n    if (i < left_size) {\r\n\r\n        S0 = _mm256_loadu_si256((__m256i*)(src - 1));//0...7 8...15\r\n        S1 = _mm256_loadu_si256((__m256i*)(src));\r\n        S2 = _mm256_loadu_si256((__m256i*)(src + 1));\r\n        S3 = _mm256_loadu_si256((__m256i*)(src + 2));\r\n\r\n        L0 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S0, 0));//0...7\r\n        L1 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S1, 0));\r\n        L2 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S2, 0));\r\n        L3 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S3, 0));\r\n\r\n        p00 = _mm256_add_epi16(L1, L2);\r\n        p00 = _mm256_mullo_epi16(p00, coeff3);\r\n        p01 = _mm256_add_epi16(L0, L3);\r\n        p00 = _mm256_add_epi16(p00, coeff4);\r\n        p00 = _mm256_add_epi16(p00, p01);\r\n        p00 = _mm256_srli_epi16(p00, 3);//0...15\r\n\r\n        p20 = _mm256_add_epi16(L1, L2);\r\n        p21 = _mm256_add_epi16(L2, L3);\r\n        p20 = _mm256_add_epi16(p20, coeff2);\r\n        p20 = _mm256_add_epi16(p20, p21);\r\n        p20 = _mm256_srli_epi16(p20, 2);//0...15\r\n\r\n        p00 = _mm256_packus_epi16(p00, p20);\r\n        p00 = _mm256_shuffle_epi8(p00, shuffle);\r\n        _mm256_storeu_si256((__m256i*)&first_line[i], p00);\r\n    }\r\n\r\n    src = pSrc1;\r\n    \r\n    __m256i sum1, sum2, sum3, sum4;\r\n    for (i = left_size; i < line_size - 16; i += 32, src += 32) {\r\n        S2 = _mm256_loadu_si256((__m256i*)(src + 1));\r\n        S0 = _mm256_loadu_si256((__m256i*)(src - 1));\r\n        S1 = _mm256_loadu_si256((__m256i*)(src));\r\n\r\n        L0 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S0, 0));\r\n        L1 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S1, 0));\r\n        L2 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S2, 0));\r\n\r\n        H0 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S0, 1));\r\n        H1 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S1, 1));\r\n        H2 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S2, 1));\r\n\r\n        sum1 = _mm256_add_epi16(L0, L1);\r\n        sum2 = _mm256_add_epi16(L1, L2);\r\n        sum3 = _mm256_add_epi16(H0, H1);\r\n        sum4 = _mm256_add_epi16(H1, H2);\r\n\r\n        sum1 = _mm256_add_epi16(sum1, sum2);\r\n        sum3 = _mm256_add_epi16(sum3, sum4);\r\n\r\n        sum1 = _mm256_add_epi16(sum1, coeff2);\r\n        sum3 = _mm256_add_epi16(sum3, coeff2);\r\n\r\n        sum1 = _mm256_srli_epi16(sum1, 2);\r\n        sum3 = _mm256_srli_epi16(sum3, 2);\r\n\r\n        sum1 = _mm256_packus_epi16(sum1, sum3);\r\n        sum1 = _mm256_permute4x64_epi64(sum1, 0x00D8);\r\n        _mm256_storeu_si256((__m256i*)&first_line[i], sum1);\r\n    }\r\n    __m256i mask = _mm256_load_si256((__m256i*)intrinsic_mask_256_8bit[15]);\r\n    if (i < line_size) {\r\n        S0 = _mm256_loadu_si256((__m256i*)(src - 1));\r\n        S2 = _mm256_loadu_si256((__m256i*)(src + 1));\r\n        S1 = _mm256_loadu_si256((__m256i*)(src));\r\n\r\n        L0 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S0, 0));\r\n        L1 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S1, 0));\r\n        L2 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S2, 0));\r\n\r\n        sum1 = _mm256_add_epi16(L0, L1);\r\n        sum2 = _mm256_add_epi16(L1, L2);\r\n\r\n        sum1 = _mm256_add_epi16(sum1, sum2);\r\n        sum1 = _mm256_add_epi16(sum1, coeff2);\r\n        sum1 = _mm256_srli_epi16(sum1, 2);\r\n\r\n        sum1 = _mm256_packus_epi16(sum1, sum1);\r\n        sum1 = _mm256_permute4x64_epi64(sum1, 0x00D8);\r\n        _mm256_maskstore_epi64((__int64 *)&first_line[i], mask, sum1);\r\n    }\r\n\r\n    if (bsx == 64){\r\n\r\n        for (i = 0; i < bsy; i += 8) {\r\n            __m256i M = _mm256_lddqu_si256((__m256i*)(pfirst));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst + 32));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n            pfirst -= 2;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst + 32));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n            pfirst -= 2;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst + 32));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n            pfirst -= 2;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst + 32));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n            pfirst -= 2;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst + 32));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n            pfirst -= 2;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst + 32));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n            pfirst -= 2;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst + 32));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n            pfirst -= 2;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst + 32));\r\n            _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n            dst += i_dst;\r\n            pfirst -= 2;\r\n        }\r\n    } else if (bsx == 32){\r\n        for (i = 0; i < bsy; i += 8) {\r\n            __m256i M = _mm256_lddqu_si256((__m256i*)(pfirst));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n            pfirst -= 2;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n            pfirst -= 2;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n            pfirst -= 2;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n            pfirst -= 2;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n            pfirst -= 2;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n            pfirst -= 2;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n            pfirst -= 2;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst));\r\n            _mm256_storeu_si256((__m256i*)dst, M);\r\n            dst += i_dst;\r\n            pfirst -= 2;\r\n        }\r\n    } else if (bsx == 16){\r\n\r\n        mask = _mm256_loadu_si256((const __m256i*)intrinsic_mask_256_8bit[bsx - 1]);\r\n        for (i = 0; i < bsy; i += 4) {\r\n            __m256i M = _mm256_lddqu_si256((__m256i*)(pfirst));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            pfirst -= 2;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            pfirst -= 2;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            pfirst -= 2;\r\n            M = _mm256_lddqu_si256((__m256i*)(pfirst));\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            pfirst -= 2;\r\n            \r\n        }\r\n    } else if (bsx == 8){\r\n\r\n        mask = _mm256_loadu_si256((const __m256i*)intrinsic_mask_256_8bit[bsx - 1]);\r\n        for (i = 0; i < bsy; i += 8) {\r\n            __m256i M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            pfirst -= 2;\r\n            M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            pfirst -= 2;\r\n            M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            pfirst -= 2;\r\n            M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            pfirst -= 2;\r\n            M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            pfirst -= 2;\r\n            M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            pfirst -= 2;\r\n            M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            pfirst -= 2;\r\n            M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n            _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n            dst += i_dst;\r\n            pfirst -= 2;\r\n        }\r\n    } else{\r\n\r\n        mask = _mm256_loadu_si256((const __m256i*)intrinsic_mask_256_8bit[bsx - 1]);\r\n        for (i = 0; i < bsy; i += 4) {\r\n            __m256i M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n            pfirst -= 2;\r\n            M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n            pfirst -= 2;\r\n            M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n            pfirst -= 2;\r\n            M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n            _mm256_maskstore_epi32((int*)dst, mask, M);\r\n            dst += i_dst;\r\n            pfirst -= 2;\r\n        }\r\n    }\r\n\r\n\r\n\r\n    /*for (i = 0; i < bsy; i++) {\r\n        memcpy(dst, pfirst, bsx * sizeof(pel_t));\r\n        pfirst -= 2;\r\n        dst += i_dst;\r\n    }*/\r\n\r\n}\r\n\r\nvoid intra_pred_ang_xy_22_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    int i;\r\n    UNUSED_PARAMETER(dir_mode);\r\n\r\n    if (bsx != 4) {\r\n        src -= bsy;\r\n        ALIGN32(pel_t first_line[64 + 256]);\r\n        int left_size = (bsy - 1) * 4 + 3;\r\n        int top_size = bsx - 3;\r\n        int line_size = left_size + top_size;\r\n        pel_t *pfirst = first_line + left_size - 3;\r\n        pel_t *pSrc1 = src;\r\n\r\n        __m256i coeff2 = _mm256_set1_epi16(2);\r\n        __m256i coeff3 = _mm256_set1_epi16(3);\r\n        __m256i coeff4 = _mm256_set1_epi16(4);\r\n        __m256i coeff5 = _mm256_set1_epi16(5);\r\n        __m256i coeff7 = _mm256_set1_epi16(7);\r\n        __m256i coeff8 = _mm256_set1_epi16(8);\r\n        __m256i shuffle = _mm256_setr_epi8(0, 8, 1, 9, 2, 10, 3, 11, 4, 12, 5, 13, 6, 14, 7, 15,\r\n            0, 8, 1, 9, 2, 10, 3, 11, 4, 12, 5, 13, 6, 14, 7, 15);\r\n\r\n        __m256i p00, p10, p20, p30;\r\n        __m256i p01, p11, p21, p31;\r\n        __m256i M1, M2, M3, M4, M5, M6, M7, M8;\r\n        __m256i S0, S1, S2, S3;\r\n        __m256i L0, L1, L2, L3;\r\n        __m256i H0, H1, H2, H3;\r\n\r\n        for (i = 0; i < line_size - 64; i += 128, src += 32) {\r\n\r\n            S0 = _mm256_loadu_si256((__m256i*)(src - 1));\r\n            S3 = _mm256_loadu_si256((__m256i*)(src + 2));\r\n            S1 = _mm256_loadu_si256((__m256i*)(src));\r\n            S2 = _mm256_loadu_si256((__m256i*)(src + 1));\r\n\r\n            L0 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S0, 0));\r\n            L1 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S1, 0));\r\n            L2 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S2, 0));\r\n            L3 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S3, 0));\r\n\r\n            H0 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S0, 1));\r\n            H1 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S1, 1));\r\n            H2 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S2, 1));\r\n            H3 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S3, 1));\r\n\r\n            p00 = _mm256_mullo_epi16(L0, coeff3);\r\n            p10 = _mm256_mullo_epi16(L1, coeff7);\r\n            p20 = _mm256_mullo_epi16(L2, coeff5);\r\n            p30 = _mm256_add_epi16(L3, coeff8);\r\n            p00 = _mm256_add_epi16(p00, p30);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, p20);\r\n            M1  = _mm256_srli_epi16(p00, 4);//0...15\r\n\r\n            p01 = _mm256_mullo_epi16(H0, coeff3);\r\n            p11 = _mm256_mullo_epi16(H1, coeff7);\r\n            p21 = _mm256_mullo_epi16(H2, coeff5);\r\n            p31 = _mm256_add_epi16(H3, coeff8);\r\n            p01 = _mm256_add_epi16(p01, p31);\r\n            p01 = _mm256_add_epi16(p01, p11);\r\n            p01 = _mm256_add_epi16(p01, p21);\r\n            M2  = _mm256_srli_epi16(p01, 4);//16...31\r\n\r\n            p00 = _mm256_add_epi16(L1, L2);\r\n            p00 = _mm256_mullo_epi16(p00, coeff3);\r\n            p10 = _mm256_add_epi16(L0, L3);\r\n            p10 = _mm256_add_epi16(p10, coeff4);\r\n            p00 = _mm256_add_epi16(p10, p00);\r\n            M3  = _mm256_srli_epi16(p00, 3);\r\n\r\n            p01 = _mm256_add_epi16(H1, H2);\r\n            p01 = _mm256_mullo_epi16(p01, coeff3);\r\n            p11 = _mm256_add_epi16(H0, H3);\r\n            p11 = _mm256_add_epi16(p11, coeff4);\r\n            p01 = _mm256_add_epi16(p11, p01);\r\n            M4  = _mm256_srli_epi16(p01, 3);\r\n\r\n\r\n            p10 = _mm256_mullo_epi16(L1, coeff5);\r\n            p20 = _mm256_mullo_epi16(L2, coeff7);\r\n            p30 = _mm256_mullo_epi16(L3, coeff3);\r\n            p00 = _mm256_add_epi16(L0, coeff8);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, p20);\r\n            p00 = _mm256_add_epi16(p00, p30);\r\n            M5  = _mm256_srli_epi16(p00, 4);\r\n\r\n            p11 = _mm256_mullo_epi16(H1, coeff5);\r\n            p21 = _mm256_mullo_epi16(H2, coeff7);\r\n            p31 = _mm256_mullo_epi16(H3, coeff3);\r\n            p01 = _mm256_add_epi16(H0, coeff8);\r\n            p01 = _mm256_add_epi16(p01, p11);\r\n            p01 = _mm256_add_epi16(p01, p21);\r\n            p01 = _mm256_add_epi16(p01, p31);\r\n            M6  = _mm256_srli_epi16(p01, 4);\r\n\r\n\r\n            p00 = _mm256_add_epi16(L1, L2);\r\n            p10 = _mm256_add_epi16(L2, L3);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, coeff2);\r\n            M7  = _mm256_srli_epi16(p00, 2);\r\n\r\n            p01 = _mm256_add_epi16(H1, H2);\r\n            p11 = _mm256_add_epi16(H2, H3);\r\n            p01 = _mm256_add_epi16(p01, p11);\r\n            p01 = _mm256_add_epi16(p01, coeff2);\r\n            M8  = _mm256_srli_epi16(p01, 2);\r\n\r\n            M1 = _mm256_packus_epi16(M1, M3);//00...08 10...18\r\n            M5 = _mm256_packus_epi16(M5, M7);\r\n            M1 = _mm256_shuffle_epi8(M1, shuffle);//00 10 01 11 02 12...\r\n            M5 = _mm256_shuffle_epi8(M5, shuffle);\r\n\r\n            M2 = _mm256_packus_epi16(M2, M4);\r\n            M6 = _mm256_packus_epi16(M6, M8);\r\n            M2 = _mm256_shuffle_epi8(M2, shuffle);\r\n            M6 = _mm256_shuffle_epi8(M6, shuffle);\r\n\r\n            M1 = _mm256_permute4x64_epi64(M1, 0x00D8);\r\n            M5 = _mm256_permute4x64_epi64(M5, 0x00D8);\r\n            M2 = _mm256_permute4x64_epi64(M2, 0x00D8);\r\n            M6 = _mm256_permute4x64_epi64(M6, 0x00D8);\r\n\r\n            M3 = _mm256_unpacklo_epi16(M1, M5);\r\n            M7 = _mm256_unpackhi_epi16(M1, M5);\r\n            M4 = _mm256_unpacklo_epi16(M2, M6);\r\n            M8 = _mm256_unpackhi_epi16(M2, M6);\r\n\r\n            _mm256_storeu_si256((__m256i*)&first_line[i], M3);\r\n            _mm256_storeu_si256((__m256i*)&first_line[32 + i], M7);\r\n            _mm256_storeu_si256((__m256i*)&first_line[64 + i], M4);\r\n            _mm256_storeu_si256((__m256i*)&first_line[96 + i], M8);\r\n        }\r\n\r\n        if (i < left_size) {\r\n\r\n            S0 = _mm256_loadu_si256((__m256i*)(src - 1));\r\n            S3 = _mm256_loadu_si256((__m256i*)(src + 2));\r\n            S1 = _mm256_loadu_si256((__m256i*)(src));\r\n            S2 = _mm256_loadu_si256((__m256i*)(src + 1));\r\n\r\n            L0 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S0, 0));\r\n            L1 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S1, 0));\r\n            L2 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S2, 0));\r\n            L3 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S3, 0));\r\n\r\n            p00 = _mm256_mullo_epi16(L0, coeff3);\r\n            p10 = _mm256_mullo_epi16(L1, coeff7);\r\n            p20 = _mm256_mullo_epi16(L2, coeff5);\r\n            p30 = _mm256_add_epi16(L3, coeff8);\r\n            p00 = _mm256_add_epi16(p00, p30);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, p20);\r\n            M1  = _mm256_srli_epi16(p00, 4);\r\n\r\n            p00 = _mm256_add_epi16(L1, L2);\r\n            p00 = _mm256_mullo_epi16(p00, coeff3);\r\n            p10 = _mm256_add_epi16(L0, L3);\r\n            p10 = _mm256_add_epi16(p10, coeff4);\r\n            p00 = _mm256_add_epi16(p10, p00);\r\n            M3  = _mm256_srli_epi16(p00, 3);\r\n\r\n            p10 = _mm256_mullo_epi16(L1, coeff5);\r\n            p20 = _mm256_mullo_epi16(L2, coeff7);\r\n            p30 = _mm256_mullo_epi16(L3, coeff3);\r\n            p00 = _mm256_add_epi16(L0, coeff8);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, p20);\r\n            p00 = _mm256_add_epi16(p00, p30);\r\n            M5  = _mm256_srli_epi16(p00, 4);\r\n\r\n            p10 = _mm256_add_epi16(L2, L3);\r\n            p00 = _mm256_add_epi16(L1, L2);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_add_epi16(p00, coeff2);\r\n            M7  = _mm256_srli_epi16(p00, 2);\r\n\r\n            M1 = _mm256_packus_epi16(M1, M3);\r\n            M5 = _mm256_packus_epi16(M5, M7);\r\n            M1 = _mm256_shuffle_epi8(M1, shuffle);\r\n            M5 = _mm256_shuffle_epi8(M5, shuffle);\r\n\r\n            M1 = _mm256_permute4x64_epi64(M1, 0x00D8);\r\n            M5 = _mm256_permute4x64_epi64(M5, 0x00D8);\r\n\r\n            M3 = _mm256_unpacklo_epi16(M1, M5);\r\n            M7 = _mm256_unpackhi_epi16(M1, M5);\r\n\r\n            _mm256_store_si256((__m256i*)&first_line[i], M3);\r\n            _mm256_store_si256((__m256i*)&first_line[32 + i], M7);\r\n        }\r\n\r\n        src = pSrc1 + bsy;\r\n\r\n        __m256i sum1, sum2, sum3, sum4;\r\n        for (i = left_size; i < line_size - 16; i += 32, src += 32) {\r\n\r\n            S0 = _mm256_loadu_si256((__m256i*)(src - 1));\r\n            S2 = _mm256_loadu_si256((__m256i*)(src + 1));\r\n            S1 = _mm256_loadu_si256((__m256i*)(src));\r\n\r\n            L0 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S0, 0));\r\n            L1 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S1, 0));\r\n            L2 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S2, 0));\r\n\r\n            H0 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S0, 1));\r\n            H1 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S1, 1));\r\n            H2 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S2, 1));\r\n\r\n            sum1 = _mm256_add_epi16(L0, L1);\r\n            sum2 = _mm256_add_epi16(L1, L2);\r\n            sum3 = _mm256_add_epi16(H0, H1);\r\n            sum4 = _mm256_add_epi16(H1, H2);\r\n\r\n            sum1 = _mm256_add_epi16(sum1, sum2);\r\n            sum3 = _mm256_add_epi16(sum3, sum4);\r\n\r\n            sum1 = _mm256_add_epi16(sum1, coeff2);\r\n            sum3 = _mm256_add_epi16(sum3, coeff2);\r\n\r\n            sum1 = _mm256_srli_epi16(sum1, 2);\r\n            sum3 = _mm256_srli_epi16(sum3, 2);\r\n\r\n            sum1 = _mm256_packus_epi16(sum1, sum3);\r\n            sum1 = _mm256_permute4x64_epi64(sum1, 0x00D8);\r\n\r\n            _mm256_storeu_si256((__m256i*)&first_line[i], sum1);\r\n        }\r\n\r\n        if (i < line_size) {\r\n\r\n            __m256i mask = _mm256_load_si256((__m256i*)intrinsic_mask_256_8bit[15]);\r\n            S0 = _mm256_loadu_si256((__m256i*)(src - 1));\r\n            S2 = _mm256_loadu_si256((__m256i*)(src + 1));\r\n            S1 = _mm256_loadu_si256((__m256i*)(src));\r\n\r\n            L0 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S0, 0));\r\n            L1 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S1, 0));\r\n            L2 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S2, 0));\r\n\r\n            sum1 = _mm256_add_epi16(L0, L1);\r\n            sum2 = _mm256_add_epi16(L1, L2);\r\n\r\n            sum1 = _mm256_add_epi16(sum1, sum2);\r\n            sum1 = _mm256_add_epi16(sum1, coeff2);\r\n            sum1 = _mm256_srli_epi16(sum1, 2);\r\n\r\n            sum1 = _mm256_packus_epi16(sum1, sum1);\r\n            sum1 = _mm256_permute4x64_epi64(sum1, 0x00D8);\r\n\r\n            _mm256_maskstore_epi64((__int64 *)&first_line[i], mask, sum1);\r\n\r\n\r\n        }\r\n\r\n        __m256i M;\r\n        if (bsx == 64) {\r\n            for (i = 0; i < bsy; i += 4){\r\n                M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst + 32));\r\n                _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n                dst += i_dst;\r\n                pfirst -= 4;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst + 32));\r\n                _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n                dst += i_dst;\r\n                pfirst -= 4;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst + 32));\r\n                _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n                dst += i_dst;\r\n                pfirst -= 4;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst + 32));\r\n                _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n                dst += i_dst;\r\n                pfirst -= 4;\r\n            }\r\n        } else if (bsx == 32) {\r\n            for (i = 0; i < bsy; i += 4){\r\n                M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                dst += i_dst;\r\n                pfirst -= 4;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                dst += i_dst;\r\n                pfirst -= 4;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                dst += i_dst;\r\n                pfirst -= 4;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                dst += i_dst;\r\n                pfirst -= 4;\r\n            }\r\n        } else if (bsx == 16){\r\n            __m256i mask = _mm256_lddqu_si256((__m256i*)intrinsic_mask_256_8bit[15]);\r\n            for (i = 0; i < bsy; i += 4){\r\n                M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n                _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n                dst += i_dst;\r\n                pfirst -= 4;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n                _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n                dst += i_dst;\r\n                pfirst -= 4;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n                _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n                dst += i_dst;\r\n                pfirst -= 4;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n                _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n                dst += i_dst;\r\n                pfirst -= 4;\r\n            }\r\n        } else {\r\n            __m256i mask = _mm256_lddqu_si256((__m256i*)intrinsic_mask_256_8bit[7]);\r\n            for (i = 0; i < bsy; i += 4){\r\n                M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n                _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n                dst += i_dst;\r\n                pfirst -= 4;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n                _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n                dst += i_dst;\r\n                pfirst -= 4;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n                _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n                dst += i_dst;\r\n                pfirst -= 4;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n                _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n                dst += i_dst;\r\n                pfirst -= 4;\r\n            }\r\n        }\r\n\r\n\r\n\r\n        /*\r\n        switch (bsx) {\r\n            case 8:\r\n                while (bsy--) {\r\n                    CP64(dst, pfirst);\r\n                    dst += i_dst;\r\n                    pfirst -= 4;\r\n                }\r\n                break;\r\n            case 16:\r\n            case 32:\r\n            case 64:\r\n                while (bsy--) {\r\n                    memcpy(dst, pfirst, bsx * sizeof(pel_t));\r\n                    dst += i_dst;\r\n                    pfirst -= 4;\r\n                }\r\n                break;\r\n            default:\r\n                assert(0);\r\n                break;\r\n        }*/\r\n    } else {//4x4 4x16\r\n        for (i = 0; i < bsy; i++, src--) {\r\n            dst[0] = (pel_t)((src[-2] * 3 +  src[-1] * 7 + src[0]  * 5 + src[1]     + 8) >> 4);\r\n            dst[1] = (pel_t)((src[-2]     + (src[-1]     + src[0]) * 3 + src[1]     + 4) >> 3);\r\n            dst[2] = (pel_t)((src[-2]     +  src[-1] * 5 + src[0]  * 7 + src[1] * 3 + 8) >> 4);\r\n            dst[3] = (pel_t)((               src[-1]     + src[0]  * 2 + src[1]     + 2) >> 2);\r\n            dst += i_dst;\r\n        }\r\n    }\r\n\r\n}\r\n\r\nvoid intra_pred_ang_xy_23_avx(pel_t *src, pel_t *dst, int i_dst, int dir_mode, int bsx, int bsy)\r\n{\r\n    int i;\r\n    UNUSED_PARAMETER(dir_mode);\r\n\r\n    if (bsx > 8) {\r\n        ALIGN32(pel_t first_line[64 + 512]);\r\n        int left_size = (bsy << 3) - 1;\r\n        int top_size = bsx - 7;\r\n        int line_size = left_size + top_size;\r\n        pel_t *pfirst = first_line + left_size - 7;\r\n        pel_t *pfirst1 = first_line;\r\n        pel_t *src_org = src;\r\n\r\n        src -= bsy;\r\n\r\n        __m256i coeff0 = _mm256_setr_epi16(7, 3, 5, 1, 3, 1, 1, 0, 7, 3, 5, 1, 3, 1, 1, 0);\r\n        __m256i coeff1 = _mm256_setr_epi16(15, 7, 13, 3, 11, 5, 9, 1, 15, 7, 13, 3, 11, 5, 9, 1);\r\n        __m256i coeff2 = _mm256_setr_epi16(9, 5, 11, 3, 13, 7, 15, 2, 9, 5, 11, 3, 13, 7, 15, 2);\r\n        __m256i coeff3 = _mm256_setr_epi16(1, 1, 3, 1, 5, 3, 7, 1, 1, 1, 3, 1, 5, 3, 7, 1);\r\n        __m256i coeff4 = _mm256_setr_epi16(16, 8, 16, 4, 16, 8, 16, 2, 16, 8, 16, 4, 16, 8, 16, 2);\r\n        __m256i coeff5 = _mm256_setr_epi16(1, 2, 1, 4, 1, 2, 1, 8, 1, 2, 1, 4, 1, 2, 1, 8);\r\n\r\n        __m256i p00, p10, p20, p30;\r\n        __m256i p01, p11, p21, p31;\r\n        __m256i res1, res2;\r\n        __m256i L0, L1, L2, L3;\r\n\r\n\r\n        __m256i H0, H1, H2;\r\n\r\n        if (bsy == 4){\r\n            L0 = _mm256_setr_epi16(src[-1], src[-1], src[-1], src[-1], src[-1], src[-1], src[-1], src[-1],\r\n                src[1], src[1], src[1], src[1], src[1], src[1], src[1], src[1]);//-1 3\r\n\r\n            L1 = _mm256_setr_epi16(src[0], src[0], src[0], src[0], src[0], src[0], src[0], src[0],\r\n                src[2], src[2], src[2], src[2], src[2], src[2], src[2], src[2]);//0 4\r\n\r\n            L2 = _mm256_setr_epi16(src[1], src[1], src[1], src[1], src[1], src[1], src[1], src[1],\r\n                src[3], src[3], src[3], src[3], src[3], src[3], src[3], src[3]);//1 5\r\n\r\n            L3 = _mm256_setr_epi16(src[2], src[2], src[2], src[2], src[2], src[2], src[2], src[2],\r\n                src[4], src[4], src[4], src[4], src[4], src[4], src[4], src[4]);//2 6\r\n\r\n            src += 4;\r\n\r\n            for (i = 0; i < left_size + 1; i += 32) {\r\n                p00 = _mm256_mullo_epi16(L0, coeff0);//-1\r\n                p10 = _mm256_mullo_epi16(L1, coeff1);//0\r\n                p20 = _mm256_mullo_epi16(L2, coeff2);//1\r\n                p30 = _mm256_mullo_epi16(L3, coeff3);//2\r\n                p00 = _mm256_add_epi16(p00, coeff4);\r\n                p00 = _mm256_add_epi16(p00, p10);\r\n                p00 = _mm256_add_epi16(p00, p20);\r\n                p00 = _mm256_add_epi16(p00, p30);\r\n                p00 = _mm256_mullo_epi16(p00, coeff5);\r\n                p00 = _mm256_srli_epi16(p00, 5);\r\n\r\n                L0 = _mm256_setr_epi16(src[-1], src[-1], src[-1], src[-1], src[-1], src[-1], src[-1], src[-1],\r\n                    src[1], src[1], src[1], src[1], src[1], src[1], src[1], src[1]);//-1 3\r\n\r\n                p01 = _mm256_mullo_epi16(L1, coeff0);//0\r\n                p11 = _mm256_mullo_epi16(L2, coeff1);//1\r\n                p21 = _mm256_mullo_epi16(L3, coeff2);//2\r\n                p31 = _mm256_mullo_epi16(L0, coeff3);//3\r\n                p01 = _mm256_add_epi16(p01, coeff4);\r\n                p01 = _mm256_add_epi16(p01, p11);\r\n                p01 = _mm256_add_epi16(p01, p21);\r\n                p01 = _mm256_add_epi16(p01, p31);\r\n                p01 = _mm256_mullo_epi16(p01, coeff5);\r\n                p01 = _mm256_srli_epi16(p01, 5);\r\n\r\n                res1 = _mm256_packus_epi16(p00, p01);\r\n                _mm256_storeu_si256((__m256i*)pfirst1, res1);\r\n\r\n            }\r\n\r\n        } else {\r\n\r\n            L0 = _mm256_setr_epi16(src[-1], src[-1], src[-1], src[-1], src[-1], src[-1], src[-1], src[-1],\r\n                src[3], src[3], src[3], src[3], src[3], src[3], src[3], src[3]);//-1 3\r\n\r\n            L1 = _mm256_setr_epi16(src[0], src[0], src[0], src[0], src[0], src[0], src[0], src[0],\r\n                src[4], src[4], src[4], src[4], src[4], src[4], src[4], src[4]);//0 4\r\n\r\n            L2 = _mm256_setr_epi16(src[1], src[1], src[1], src[1], src[1], src[1], src[1], src[1],\r\n                src[5], src[5], src[5], src[5], src[5], src[5], src[5], src[5]);//1 5\r\n\r\n            L3 = _mm256_setr_epi16(src[2], src[2], src[2], src[2], src[2], src[2], src[2], src[2],\r\n                src[6], src[6], src[6], src[6], src[6], src[6], src[6], src[6]);//2 6\r\n\r\n            src += 4;\r\n\r\n            for (i = 0; i < left_size + 1; i += 64, src += 4) {\r\n                p00 = _mm256_mullo_epi16(L0, coeff0);//-1 3\r\n                p10 = _mm256_mullo_epi16(L1, coeff1);// 0 4\r\n                p20 = _mm256_mullo_epi16(L2, coeff2);// 1 5\r\n                p30 = _mm256_mullo_epi16(L3, coeff3);// 2 6\r\n                p00 = _mm256_add_epi16(p00, coeff4);\r\n                p00 = _mm256_add_epi16(p00, p10);\r\n                p00 = _mm256_add_epi16(p00, p20);\r\n                p00 = _mm256_add_epi16(p00, p30);\r\n                p00 = _mm256_mullo_epi16(p00, coeff5);\r\n                p00 = _mm256_srli_epi16(p00, 5);\r\n\r\n                L0 = _mm256_setr_epi16(src[-1], src[-1], src[-1], src[-1], src[-1], src[-1], src[-1], src[-1],\r\n                    src[3], src[3], src[3], src[3], src[3], src[3], src[3], src[3]);//3 7\r\n\r\n                p01 = _mm256_mullo_epi16(L1, coeff0);//0 4\r\n                p11 = _mm256_mullo_epi16(L2, coeff1);//1 5\r\n                p21 = _mm256_mullo_epi16(L3, coeff2);//2 6\r\n                p31 = _mm256_mullo_epi16(L0, coeff3);//3 7\r\n                p01 = _mm256_add_epi16(p01, coeff4);\r\n                p01 = _mm256_add_epi16(p01, p11);\r\n                p01 = _mm256_add_epi16(p01, p21);\r\n                p01 = _mm256_add_epi16(p01, p31);\r\n                p01 = _mm256_mullo_epi16(p01, coeff5);\r\n                p01 = _mm256_srli_epi16(p01, 5);\r\n\r\n                res1 = _mm256_packus_epi16(p00, p01);\r\n\r\n                L1 = _mm256_setr_epi16(src[0], src[0], src[0], src[0], src[0], src[0], src[0], src[0],\r\n                    src[4], src[4], src[4], src[4], src[4], src[4], src[4], src[4]);//4 8\r\n\r\n                p00 = _mm256_mullo_epi16(L2, coeff0);//1 5\r\n                p10 = _mm256_mullo_epi16(L3, coeff1);//2 6\r\n                p20 = _mm256_mullo_epi16(L0, coeff2);//3 7\r\n                p30 = _mm256_mullo_epi16(L1, coeff3);//4 8\r\n                p00 = _mm256_add_epi16(p00, coeff4);\r\n                p00 = _mm256_add_epi16(p00, p10);\r\n                p00 = _mm256_add_epi16(p00, p20);\r\n                p00 = _mm256_add_epi16(p00, p30);\r\n                p00 = _mm256_mullo_epi16(p00, coeff5);\r\n                p00 = _mm256_srli_epi16(p00, 5);\r\n\r\n                L2 = _mm256_setr_epi16(src[1], src[1], src[1], src[1], src[1], src[1], src[1], src[1],\r\n                    src[5], src[5], src[5], src[5], src[5], src[5], src[5], src[5]);//5 9\r\n\r\n                p01 = _mm256_mullo_epi16(L3, coeff0);//2 6\r\n                p11 = _mm256_mullo_epi16(L0, coeff1);//3 7\r\n                p21 = _mm256_mullo_epi16(L1, coeff2);//4 8\r\n                p31 = _mm256_mullo_epi16(L2, coeff3);//5 9\r\n                p01 = _mm256_add_epi16(p01, coeff4);\r\n                p01 = _mm256_add_epi16(p01, p11);\r\n                p01 = _mm256_add_epi16(p01, p21);\r\n                p01 = _mm256_add_epi16(p01, p31);\r\n                p01 = _mm256_mullo_epi16(p01, coeff5);\r\n                p01 = _mm256_srli_epi16(p01, 5);\r\n\r\n                res2 = _mm256_packus_epi16(p00, p01);\r\n                p00 = _mm256_permute2x128_si256(res1, res2, 0x0020);\r\n                _mm256_storeu_si256((__m256i*)pfirst1, p00);\r\n                pfirst1 += 32;\r\n\r\n                p00 = _mm256_permute2x128_si256(res1, res2, 0x0031);\r\n                _mm256_storeu_si256((__m256i*)pfirst1, p00);\r\n\r\n                pfirst1 += 32;\r\n\r\n                src += 4;\r\n\r\n                L0 = _mm256_setr_epi16(src[-1], src[-1], src[-1], src[-1], src[-1], src[-1], src[-1], src[-1],\r\n                    src[3], src[3], src[3], src[3], src[3], src[3], src[3], src[3]);\r\n\r\n                L1 = _mm256_setr_epi16(src[0], src[0], src[0], src[0], src[0], src[0], src[0], src[0],\r\n                    src[4], src[4], src[4], src[4], src[4], src[4], src[4], src[4]);\r\n\r\n                L2 = _mm256_setr_epi16(src[1], src[1], src[1], src[1], src[1], src[1], src[1], src[1],\r\n                    src[5], src[5], src[5], src[5], src[5], src[5], src[5], src[5]);\r\n\r\n                L3 = _mm256_setr_epi16(src[2], src[2], src[2], src[2], src[2], src[2], src[2], src[2],\r\n                    src[6], src[6], src[6], src[6], src[6], src[6], src[6], src[6]);\r\n            }\r\n        }\r\n\r\n        src = src_org + 1;\r\n        __m256i S0, S1, S2;\r\n        coeff2 = _mm256_set1_epi16(2);\r\n        for (; i < line_size; i += 32, src += 32) {\r\n\r\n            S0 = _mm256_loadu_si256((__m256i*)(src));\r\n            S1 = _mm256_loadu_si256((__m256i*)(src + 1));\r\n            S2 = _mm256_loadu_si256((__m256i*)(src - 1));\r\n\r\n            L0 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S0, 0));\r\n            L1 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S1, 0));\r\n            L2 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S2, 0));\r\n\r\n            H0 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S0, 1));\r\n            H1 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S1, 1));\r\n            H2 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(S2, 1));\r\n\r\n            p00 = _mm256_mullo_epi16(L0, coeff2);\r\n            p10 = _mm256_add_epi16(L1, L2);\r\n            p00 = _mm256_add_epi16(p00, coeff2);\r\n            p00 = _mm256_add_epi16(p00, p10);\r\n            p00 = _mm256_srli_epi16(p00, 2);\r\n\r\n            p01 = _mm256_mullo_epi16(H0, coeff2);\r\n            p11 = _mm256_add_epi16(H1, H2);\r\n            p01 = _mm256_add_epi16(p01, coeff2);\r\n            p01 = _mm256_add_epi16(p01, p11);\r\n            p01 = _mm256_srli_epi16(p01, 2);\r\n\r\n            p00 = _mm256_packus_epi16(p00, p01);\r\n            p00 = _mm256_permute4x64_epi64(p00, 0x00D8);\r\n            _mm256_storeu_si256((__m256i*)&first_line[i], p00);\r\n        }\r\n\r\n        __m256i M;\r\n        if (bsx == 64) {\r\n            for (i = 0; i < bsy; i += 4){\r\n                M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst + 32));\r\n                _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n                dst += i_dst;\r\n                pfirst -= 8;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst + 32));\r\n                _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n                dst += i_dst;\r\n                pfirst -= 8;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst + 32));\r\n                _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n                dst += i_dst;\r\n                pfirst -= 8;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                M = _mm256_lddqu_si256((__m256i*)(pfirst + 32));\r\n                _mm256_storeu_si256((__m256i*)(dst + 32), M);\r\n                dst += i_dst;\r\n                pfirst -= 8;\r\n            }\r\n        } else if (bsx == 32){\r\n            for (i = 0; i < bsy; i += 4){\r\n                M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                dst += i_dst;\r\n                pfirst -= 8;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                dst += i_dst;\r\n                pfirst -= 8;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                dst += i_dst;\r\n                pfirst -= 8;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n                _mm256_storeu_si256((__m256i*)dst, M);\r\n                dst += i_dst;\r\n                pfirst -= 8;\r\n            }\r\n        } else if (bsx == 16){\r\n            __m256i mask = _mm256_lddqu_si256((__m256i*)intrinsic_mask_256_8bit[15]);\r\n            for (i = 0; i < bsy; i += 4){\r\n                M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n                _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n                dst += i_dst;\r\n                pfirst -= 8;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n                _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n                dst += i_dst;\r\n                pfirst -= 8;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n                _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n                dst += i_dst;\r\n                pfirst -= 8;\r\n\r\n                M = _mm256_lddqu_si256((__m256i*)pfirst);\r\n                _mm256_maskstore_epi64((__int64 *)dst, mask, M);\r\n                dst += i_dst;\r\n                pfirst -= 8;\r\n            }\r\n        }\r\n\r\n        /*for (i = 0; i < bsy; i++) {\r\n            memcpy(dst, pfirst, bsx * sizeof(pel_t));\r\n            dst += i_dst;\r\n            pfirst -= 8;\r\n        }*/\r\n    } else {//8x8 8x32 4x4 4x16------128bit is enough\r\n        intra_pred_ang_xy_23_sse128(src, dst, i_dst, dir_mode, bsx, bsy);\r\n        return;\r\n    }\r\n}\r\n\r\n#endif\r\n"
  },
  {
    "path": "source/common/vec/intrinsic_pixel.cc",
    "content": "/*\r\n * intrinsic_pixel.cc\r\n *\r\n * Description of this file:\r\n *    SSE assembly functions of Pixel-Processing module of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video decoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#include \"../common.h\"\r\n#include \"intrinsic.h\"\r\n\r\n#include <mmintrin.h>\r\n#include <emmintrin.h>\r\n#include <tmmintrin.h>\r\n#include <smmintrin.h>\r\n\r\n\r\n\r\nvoid avs_pixel_average_sse128(pel_t *dst, int i_dst, const pel_t *src0, int i_src0, const pel_t *src1, int i_src1, int width, int height)\r\n{\r\n#if HIGH_BIT_DEPTH\r\n    int j;\r\n    __m128i D;\r\n\r\n    if (width & 7) {\r\n        __m128i mask = _mm_load_si128((const __m128i *)intrinsic_mask_10bit[(width & 7) - 1]);\r\n\r\n        while (height--) {\r\n            for (j = 0; j < width - 7; j += 8) {\r\n                D = _mm_avg_epu16(_mm_loadu_si128((const __m128i *)(src0 + j)), _mm_loadu_si128((const __m128i *)(src1 + j)));\r\n                _mm_storeu_si128((__m128i *)(dst + j), D);\r\n            }\r\n\r\n            D = _mm_avg_epu16(_mm_loadu_si128((const __m128i *)(src0 + j)), _mm_loadu_si128((const __m128i *)(src1 + j)));\r\n            _mm_maskmoveu_si128(D, mask, (char *)&dst[j]);\r\n\r\n            src0 += i_src0;\r\n            src1 += i_src1;\r\n            dst += i_dst;\r\n        }\r\n    } else {\r\n        while (height--) {\r\n            for (j = 0; j < width; j += 8) {\r\n                D = _mm_avg_epu16(_mm_loadu_si128((const __m128i *)(src0 + j)), _mm_loadu_si128((const __m128i *)(src1 + j)));\r\n                _mm_storeu_si128((__m128i *)(dst + j), D);\r\n            }\r\n            src0 += i_src0;\r\n            src1 += i_src1;\r\n            dst += i_dst;\r\n        }\r\n    }\r\n#else\r\n    int i, j;\r\n    __m128i S1, S2, D;\r\n\r\n    if (width & 15) {\r\n        __m128i mask = _mm_load_si128((const __m128i*)intrinsic_mask[(width & 15) - 1]);\r\n\r\n        for (i = 0; i < height; i++) {\r\n            for (j = 0; j < width - 15; j += 16) {\r\n                S1 = _mm_loadu_si128((const __m128i*)(src0 + j));\r\n                S2 = _mm_loadu_si128((const __m128i*)(src1 + j));\r\n                D  = _mm_avg_epu8(S1, S2);\r\n                _mm_storeu_si128((__m128i*)(dst + j), D);\r\n            }\r\n\r\n            S1 = _mm_loadu_si128((const __m128i*)(src0 + j));\r\n            S2 = _mm_loadu_si128((const __m128i*)(src1 + j));\r\n            D  = _mm_avg_epu8(S1, S2);\r\n            _mm_maskmoveu_si128(D, mask, (char*)(dst + j));\r\n\r\n            src0 += i_src0;\r\n            src1 += i_src1;\r\n            dst  += i_dst;\r\n        }\r\n    } else {\r\n        for (i = 0; i < height; i++) {\r\n            for (j = 0; j < width; j += 16) {\r\n                S1 = _mm_loadu_si128((const __m128i*)(src0 + j));\r\n                S2 = _mm_loadu_si128((const __m128i*)(src1 + j));\r\n                D  = _mm_avg_epu8(S1, S2);\r\n                _mm_storeu_si128((__m128i*)(dst + j), D);\r\n            }\r\n            src0 += i_src0;\r\n            src1 += i_src1;\r\n            dst += i_dst;\r\n        }\r\n    }\r\n#endif\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid *davs2_memzero_aligned_c_sse2(void *dst, size_t n)\r\n{\r\n    __m128i *p_dst = (__m128i *)dst;\r\n    __m128i m0 = _mm_setzero_si128();\r\n    int i = (int)(n >> 4);\r\n\r\n    for (; i != 0; i--) {\r\n        _mm_store_si128(p_dst, m0);\r\n        p_dst++;\r\n    }\r\n\r\n    return dst;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid *davs2_memcpy_aligned_c_sse2(void *dst, const void *src, size_t n)\r\n{\r\n    __m128i *p_dst = (__m128i *)dst;\r\n    const __m128i *p_src = (const __m128i *)src;\r\n    int i = (int)(n >> 4);\r\n\r\n    for (; i != 0; i--) {\r\n        _mm_store_si128(p_dst, _mm_load_si128(p_src));\r\n        p_src++;\r\n        p_dst++;\r\n    }\r\n\r\n    return dst;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid plane_copy_c_sse2(pel_t *dst, intptr_t i_dst, pel_t *src, intptr_t i_src, int w, int h)\r\n{\r\n    const int n128 = (w * sizeof(pel_t)) >> 4;\r\n    int n_left = (w * sizeof(pel_t)) - (n128 << 4);\r\n\r\n    if (n_left) {\r\n        int n_offset = (n128 << 4);\r\n        while (h--) {\r\n            const __m128i *p_src = (const __m128i *)src;\r\n            __m128i *p_dst = (__m128i *)dst;\r\n            int n = n128;\r\n            for (; n != 0; n--) {\r\n                _mm_storeu_si128(p_dst, _mm_loadu_si128(p_src));\r\n                p_dst++;\r\n                p_src++;\r\n            }\r\n            memcpy((uint8_t *)(dst) + n_offset, (uint8_t *)(src) + n_offset, n_left);\r\n            dst += i_dst;\r\n            src += i_src;\r\n        }\r\n    } else {\r\n        while (h--) {\r\n            const __m128i *p_src = (const __m128i *)src;\r\n            __m128i *p_dst = (__m128i *)dst;\r\n            int n = n128;\r\n            for (; n != 0; n--) {\r\n                _mm_storeu_si128(p_dst, _mm_loadu_si128(p_src));\r\n                p_dst++;\r\n                p_src++;\r\n            }\r\n            dst += i_dst;\r\n            src += i_src;\r\n        }\r\n    }\r\n}\r\n"
  },
  {
    "path": "source/common/vec/intrinsic_pixel_avx.cc",
    "content": "/*\r\n * intrinsic_pixel_avx.cc\r\n *\r\n * Description of this file:\r\n *    AVX2 assembly functions of Pixel-Processing module of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video decoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#include <mmintrin.h>\r\n#include <emmintrin.h>\r\n#include <tmmintrin.h>\r\n#include <smmintrin.h>\r\n#include <immintrin.h>\r\n\r\n#include \"../common.h\"\r\n#include \"intrinsic.h\"\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid *davs2_memzero_aligned_c_avx(void *dst, size_t n)\r\n{\r\n    __m256i *p_dst = (__m256i *)dst;\r\n    __m256i m0 = _mm256_setzero_si256();\r\n    int i = (int)(n >> 5);\r\n\r\n    for (; i != 0; i--) {\r\n        _mm256_store_si256(p_dst, m0);\r\n        p_dst++;\r\n    }\r\n\r\n    return dst;\r\n}\r\n\r\n\r\n#if _MSC_VER\r\n#if !HIGH_BIT_DEPTH\r\nvoid padding_rows_sse256_10bit(pel_t *src, int i_src, int width, int height, int start, int rows, int pad)\r\n{\r\n    int i, j;\r\n    pel_t *p, *p1, *p2;\r\n    int pad_lr = pad + 16 - (pad & 0xF);\r\n    start = max(start, 0);\r\n\r\n    if (start + rows > height) {\r\n        rows = height - start;\r\n    }\r\n\r\n    p = src + start * i_src;\r\n\r\n    // left & right\r\n    for (i = 0; i < rows; i++) {\r\n        __m256i Val1 = _mm256_set1_epi16((int16_t)p[0]);\r\n        __m256i Val2 = _mm256_set1_epi16((int16_t)p[width - 1]);\r\n        p1 = p - pad_lr;\r\n        p2 = p + width;\r\n        for (j = 0; j < pad_lr; j += 16) {\r\n            _mm256_storeu_si256((__m256i *)(p1 + j), Val1);\r\n            _mm256_storeu_si256((__m256i *)(p2 + j), Val2);\r\n        }\r\n\r\n        p += i_src;\r\n    }\r\n\r\n    if (start == 0) {\r\n        p = src - pad;\r\n        for (i = 1; i <= pad; i++) {\r\n            memcpy(p - i_src * i, p, (width + 2 * pad) * sizeof(pel_t));\r\n        }\r\n    }\r\n\r\n    if (start + rows == height) {\r\n        p = src + i_src * (height - 1) - pad;\r\n        for (i = 1; i <= pad; i++) {\r\n            memcpy(p + i_src * i, p, (width + 2 * pad) * sizeof(pel_t));\r\n        }\r\n    }\r\n}\r\n\r\nvoid padding_rows_lr_sse256_10bit(pel_t *src, int i_src, int width, int height, int start, int rows, int pad)\r\n{\r\n    int i, j;\r\n    pel_t *p, *p1, *p2;\r\n    int pad_lr = pad + 16 - (pad & 0xF);\r\n    start = max(start, 0);\r\n\r\n    if (start + rows > height) {\r\n        rows = height - start;\r\n    }\r\n\r\n    p = src + start * i_src;\r\n\r\n    // left & right\r\n    for (i = 0; i < rows; i++) {\r\n        __m256i Val1 = _mm256_set1_epi16((int16_t)p[0]);\r\n        __m256i Val2 = _mm256_set1_epi16((int16_t)p[width - 1]);\r\n        p1 = p - pad_lr;\r\n        p2 = p + width;\r\n        for (j = 0; j < pad_lr; j += 16) {\r\n            _mm256_storeu_si256((__m256i *)(p1 + j), Val1);\r\n            _mm256_storeu_si256((__m256i *)(p2 + j), Val2);\r\n        }\r\n\r\n        p += i_src;\r\n    }\r\n}\r\n#endif\r\n\r\nvoid add_pel_clip_sse256(const pel_t *src1, int i_src1, const coeff_t *src2, int i_src2, pel_t *dst, int i_dst,\r\n                         int width, int height)\r\n{\r\n#if !HIGH_BIT_DEPTH\r\n    int i, j;\r\n    __m256i mask;\r\n    __m128i mask1;\r\n\r\n    if (width >= 32) {\r\n        __m256i S, R1, R2, S1, S2, D;\r\n        __m256i zero = _mm256_setzero_si256();\r\n        mask = _mm256_load_si256((const __m256i *)intrinsic_mask32[(width & 31)]);\r\n        for (i = 0; i < height; i++) {\r\n            S = _mm256_loadu_si256((const __m256i *)(src1));\r\n            R1 = _mm256_loadu_si256((const __m256i *)(src2));\r\n            R2 = _mm256_loadu_si256((const __m256i *)(src2 + 16));\r\n            S = _mm256_permute4x64_epi64(S, 0xd8);\r\n            S1 = _mm256_unpacklo_epi8(S, zero);\r\n            S2 = _mm256_unpackhi_epi8(S, zero);\r\n            S1 = _mm256_add_epi16(R1, S1);\r\n            S2 = _mm256_add_epi16(R2, S2);\r\n            D = _mm256_packus_epi16(S1, S2);\r\n            D = _mm256_permute4x64_epi64(D, 0xd8);\r\n            _mm256_storeu_si256((__m256i *)(dst), D);\r\n\r\n            if (width > 32) {\r\n                S = _mm256_loadu_si256((const __m256i *)(src1 + 32));\r\n                R1 = _mm256_loadu_si256((const __m256i *)(src2 + 32));\r\n                R2 = _mm256_loadu_si256((const __m256i *)(src2 + 48));\r\n                S = _mm256_permute4x64_epi64(S, 0xd8);\r\n                S1 = _mm256_unpacklo_epi8(S, zero);\r\n                S2 = _mm256_unpackhi_epi8(S, zero);\r\n                S1 = _mm256_add_epi16(R1, S1);\r\n                S2 = _mm256_add_epi16(R2, S2);\r\n                D = _mm256_packus_epi16(S1, S2);\r\n                D = _mm256_permute4x64_epi64(D, 0xd8);\r\n                _mm256_maskstore_epi32((int *)(dst + 32), mask, D);\r\n            }\r\n            src1 += i_src1;\r\n            src2 += i_src2;\r\n            dst += i_dst;\r\n        }\r\n    } else {\r\n        __m128i zero = _mm_setzero_si128();\r\n        __m128i S, S1, S2, R1, R2, D;\r\n        if (width & 15) {\r\n            mask1 = _mm_load_si128((const __m128i *)intrinsic_mask[(width & 15) - 1]);\r\n\r\n            for (i = 0; i < height; i++) {\r\n                for (j = 0; j < width - 15; j += 16) {\r\n                    S = _mm_load_si128((const __m128i *)(src1 + j));\r\n                    R1 = _mm_load_si128((const __m128i *)(src2 + j));\r\n                    R2 = _mm_load_si128((const __m128i *)(src2 + j + 8));\r\n                    S1 = _mm_unpacklo_epi8(S, zero);\r\n                    S2 = _mm_unpackhi_epi8(S, zero);\r\n                    S1 = _mm_add_epi16(R1, S1);\r\n                    S2 = _mm_add_epi16(R2, S2);\r\n                    D = _mm_packus_epi16(S1, S2);\r\n                    _mm_store_si128((__m128i *)(dst + j), D);\r\n                }\r\n\r\n                S = _mm_loadu_si128((const __m128i *)(src1 + j));\r\n                R1 = _mm_loadu_si128((const __m128i *)(src2 + j));\r\n                R2 = _mm_loadu_si128((const __m128i *)(src2 + j + 8));\r\n                S1 = _mm_unpacklo_epi8(S, zero);\r\n                S2 = _mm_unpackhi_epi8(S, zero);\r\n                S1 = _mm_add_epi16(R1, S1);\r\n                S2 = _mm_add_epi16(R2, S2);\r\n                D = _mm_packus_epi16(S1, S2);\r\n                _mm_maskmoveu_si128(D, mask1, (char *)&dst[j]);\r\n\r\n                src1 += i_src1;\r\n                src2 += i_src2;\r\n                dst += i_dst;\r\n            }\r\n        } else {\r\n            for (i = 0; i < height; i++) {\r\n                for (j = 0; j < width; j += 16) {\r\n                    S = _mm_load_si128((const __m128i *)(src1 + j));\r\n                    R1 = _mm_load_si128((const __m128i *)(src2 + j));\r\n                    R2 = _mm_load_si128((const __m128i *)(src2 + j + 8));\r\n                    S1 = _mm_unpacklo_epi8(S, zero);\r\n                    S2 = _mm_unpackhi_epi8(S, zero);\r\n                    S1 = _mm_add_epi16(R1, S1);\r\n                    S2 = _mm_add_epi16(R2, S2);\r\n                    D = _mm_packus_epi16(S1, S2);\r\n                    _mm_store_si128((__m128i *)(dst + j), D);\r\n                }\r\n                src1 += i_src1;\r\n                src2 += i_src2;\r\n                dst += i_dst;\r\n            }\r\n        }\r\n    }\r\n#else\r\n    int j;\r\n    __m256i zero = _mm256_setzero_si256();\r\n    __m256i D;\r\n    __m256i max_val = _mm256_set1_epi16((short)(max_pel_value));\r\n\r\n    if (width & 15) {\r\n        __m256i mask = _mm256_loadu_si256((const __m256i *)intrinsic_mask_10bit[(width & 15) - 1]);\r\n\r\n        while (height--) {\r\n            for (j = 0; j < width - 15; j += 16) {\r\n                D = _mm256_add_epi16(_mm256_loadu_si256((const __m256i *)(src1 + j)), _mm256_loadu_si256((const __m256i *)(src2 + j)));\r\n                D = _mm256_min_epi16(D, max_val);\r\n                D = _mm256_max_epi16(D, zero);\r\n                _mm256_storeu_si256((__m256i *)(dst + j), D);\r\n            }\r\n\r\n            D = _mm256_add_epi16(_mm256_loadu_si256((const __m256i *)(src1 + j)), _mm256_loadu_si256((const __m256i *)(src2 + j)));\r\n            D = _mm256_min_epi16(D, max_val);\r\n            D = _mm256_max_epi16(D, zero);\r\n            _mm256_maskstore_epi32((int *)&dst[j], mask, D);\r\n\r\n            src1 += i_src1;\r\n            src2 += i_src2;\r\n            dst += i_dst;\r\n        }\r\n    } else {\r\n        while (height--) {\r\n            for (j = 0; j < width - 15; j += 16) {\r\n                D = _mm256_add_epi16(_mm256_loadu_si256((const __m256i *)(src1 + j)), _mm256_loadu_si256((const __m256i *)(src2 + j)));\r\n                D = _mm256_min_epi16(D, max_val);\r\n                D = _mm256_max_epi16(D, zero);\r\n                _mm256_storeu_si256((__m256i *)(dst + j), D);\r\n            }\r\n            src1 += i_src1;\r\n            src2 += i_src2;\r\n            dst += i_dst;\r\n        }\r\n    }\r\n#endif\r\n}\r\n\r\nvoid davs2_pixel_average_avx(pel_t *dst, int i_dst, const pel_t *src1, int i_src1, const pel_t *src2, int i_src2, int width, int height)\r\n{\r\n#if HIGH_BIT_DEPTH\r\n    int j;\r\n\r\n    if (width & 15) {\r\n        __m256i mask = _mm256_loadu_si256((const __m256i *)intrinsic_mask_10bit[(width & 15) - 1]);\r\n\r\n        while (height--) {\r\n            __m256i D;\r\n            for (j = 0; j < width - 15; j += 16) {\r\n                D = _mm256_avg_epu16(_mm256_loadu_si256((const __m256i *)(src1 + j)), _mm256_loadu_si256((const __m256i *)(src2 + j)));\r\n                _mm256_storeu_si256((__m256i *)(dst + j), D);\r\n            }\r\n\r\n            D = _mm256_avg_epu16(_mm256_loadu_si256((const __m256i *)(src1 + j)), _mm256_loadu_si256((const __m256i *)(src2 + j)));\r\n            _mm256_maskstore_epi32((int *)&dst[j], mask, D);\r\n\r\n            src1 += i_src1;\r\n            src2 += i_src2;\r\n            dst += i_dst;\r\n        }\r\n    } else {\r\n        while (height--) {\r\n            for (j = 0; j < width - 15; j += 16) {\r\n                __m256i D = _mm256_avg_epu16(_mm256_loadu_si256((const __m256i *)(src1 + j)), _mm256_loadu_si256((const __m256i *)(src2 + j)));\r\n                _mm256_storeu_si256((__m256i *)(dst + j), D);\r\n            }\r\n            src1 += i_src1;\r\n            src2 += i_src2;\r\n            dst += i_dst;\r\n        }\r\n    }\r\n#else\r\n    int i;\r\n\r\n    if (width >= 32) {\r\n        __m256i mask = _mm256_load_si256((const __m256i *)intrinsic_mask32[(width & 31)]);\r\n        for (i = 0; i < height; i++) {\r\n            __m256i S1 = _mm256_loadu_si256((const __m256i *)(src1));\r\n            __m256i S2 = _mm256_load_si256((const __m256i *)(src2));\r\n            __m256i D = _mm256_avg_epu8(S1, S2);\r\n            _mm256_storeu_si256((__m256i *)(dst), D);\r\n\r\n            if (32 < width) {\r\n                S1 = _mm256_loadu_si256((const __m256i *)(src1 + 32));\r\n                S2 = _mm256_load_si256((const __m256i *)(src2 + 32));\r\n                D = _mm256_avg_epu8(S1, S2);\r\n                _mm256_maskstore_epi32((int *)(dst + 32), mask, D);\r\n            }\r\n            src1 += i_src1;\r\n            src2 += i_src2;\r\n            dst += i_dst;\r\n        }\r\n    } else {\r\n        int j;\r\n\r\n        if (width & 15) {\r\n            __m128i mask = _mm_load_si128((const __m128i *)intrinsic_mask[(width & 15) - 1]);\r\n\r\n            for (i = 0; i < height; i++) {\r\n                __m128i S1, S2, D;\r\n\r\n                for (j = 0; j < width - 15; j += 16) {\r\n                    S1 = _mm_loadu_si128((const __m128i *)(src1 + j));\r\n                    S2 = _mm_load_si128((const __m128i *)(src2 + j));\r\n                    D = _mm_avg_epu8(S1, S2);\r\n                    _mm_storeu_si128((__m128i *)(dst + j), D);\r\n                }\r\n\r\n                S1 = _mm_loadu_si128((const __m128i *)(src1 + j));\r\n                S2 = _mm_load_si128((const __m128i *)(src2 + j));\r\n                D = _mm_avg_epu8(S1, S2);\r\n                _mm_maskmoveu_si128(D, mask, (char *)&dst[j]);\r\n\r\n                src1 += i_src1;\r\n                src2 += i_src2;\r\n                dst += i_dst;\r\n            }\r\n        } else {\r\n            for (i = 0; i < height; i++) {\r\n                for (j = 0; j < width; j += 16) {\r\n                    __m128i S1 = _mm_loadu_si128((const __m128i *)(src1 + j));\r\n                    __m128i S2 = _mm_load_si128((const __m128i *)(src2 + j));\r\n                    __m128i D = _mm_avg_epu8(S1, S2);\r\n                    _mm_storeu_si128((__m128i *)(dst + j), D);\r\n                }\r\n                src1 += i_src1;\r\n                src2 += i_src2;\r\n                dst += i_dst;\r\n            }\r\n        }\r\n    }\r\n#endif\r\n}\r\n\r\n#if !HIGH_BIT_DEPTH\r\nvoid padding_rows_lr_sse256(pel_t *src, int i_src, int width, int height, int start, int rows, int pad)\r\n{\r\n    int i, j;\r\n    pel_t *p, *p1, *p2;\r\n\r\n    start = max(start, 0);\r\n\r\n    if (start + rows > height) {\r\n        rows = height - start;\r\n    }\r\n\r\n    p = src + start * i_src;\r\n\r\n    pad = pad + 16 - (pad & 0xF);\r\n    if (pad & 0x1f) {\r\n        __m256i mask = _mm256_setr_epi16(-1, -1, -1, -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 0, 0);\r\n        for (i = 0; i < rows; i++) {\r\n            __m256i Val1 = _mm256_set1_epi8((char)p[0]);\r\n            __m256i Val2 = _mm256_set1_epi8((char)p[width - 1]);\r\n            p1 = p - pad;\r\n            p2 = p + width;\r\n            for (j = 0; j < pad - 31; j += 32) {\r\n                _mm256_storeu_si256((__m256i *)(p1 + j), Val1);\r\n                _mm256_storeu_si256((__m256i *)(p2 + j), Val2);\r\n            }\r\n            _mm256_maskstore_epi32((int *)(p1 + j), mask, Val1);\r\n            _mm256_maskstore_epi32((int *)(p2 + j), mask, Val2);\r\n            p += i_src;\r\n        }\r\n    } else {\r\n        __m256i Val1 = _mm256_set1_epi8((char)p[0]);\r\n        __m256i Val2 = _mm256_set1_epi8((char)p[width - 1]);\r\n        p1 = p - pad;\r\n        p2 = p + width;\r\n        for (j = 0; j < pad; j += 32) {\r\n            _mm256_storeu_si256((__m256i *)(p1 + j), Val1);\r\n            _mm256_storeu_si256((__m256i *)(p2 + j), Val2);\r\n        }\r\n        p += i_src;\r\n    }\r\n}\r\n\r\nvoid padding_rows_sse256(pel_t *src, int i_src, int width, int height, int start, int rows, int pad)\r\n{\r\n    int i, j;\r\n    pel_t *p, *p1, *p2;\r\n\r\n    start = max(start, 0);\r\n\r\n    if (start + rows > height) {\r\n        rows = height - start;\r\n    }\r\n\r\n    p = src + start * i_src;\r\n\r\n    pad = pad + 16 - (pad & 0xF);\r\n    if (pad & 0x1f) {\r\n        __m256i mask = _mm256_setr_epi16(-1, -1, -1, -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 0, 0);\r\n        for (i = 0; i < rows; i++) {\r\n            __m256i Val1 = _mm256_set1_epi8((char)p[0]);\r\n            __m256i Val2 = _mm256_set1_epi8((char)p[width - 1]);\r\n            p1 = p - pad;\r\n            p2 = p + width;\r\n            for (j = 0; j < pad - 31; j += 32) {\r\n                _mm256_storeu_si256((__m256i *)(p1 + j), Val1);\r\n                _mm256_storeu_si256((__m256i *)(p2 + j), Val2);\r\n            }\r\n            _mm256_maskstore_epi32((int *)(p1 + j), mask, Val1);\r\n            _mm256_maskstore_epi32((int *)(p2 + j), mask, Val2);\r\n            p += i_src;\r\n        }\r\n    } else {\r\n        __m256i Val1 = _mm256_set1_epi8((char)p[0]);\r\n        __m256i Val2 = _mm256_set1_epi8((char)p[width - 1]);\r\n        p1 = p - pad;\r\n        p2 = p + width;\r\n        for (j = 0; j < pad; j += 32) {\r\n            _mm256_storeu_si256((__m256i *)(p1 + j), Val1);\r\n            _mm256_storeu_si256((__m256i *)(p2 + j), Val2);\r\n        }\r\n        p += i_src;\r\n    }\r\n\r\n    if (start == 0) {\r\n        p = src - pad;\r\n        for (i = 1; i <= pad; i++) {\r\n            memcpy(p - i_src * i, p, (width + 2 * pad) * sizeof(pel_t));\r\n        }\r\n    }\r\n\r\n    if (start + rows == height) {\r\n        p = src + i_src * (height - 1) - pad;\r\n        for (i = 1; i <= pad; i++) {\r\n            memcpy(p + i_src * i, p, (width + 2 * pad) * sizeof(pel_t));\r\n        }\r\n    }\r\n}\r\n#endif\r\n#endif // #if _MSC_VER\r\n"
  },
  {
    "path": "source/common/vec/intrinsic_sao.cc",
    "content": "/*\r\n * intrinsic_sao.cc\r\n *\r\n * Description of this file:\r\n *    SSE assembly functions of SAO module of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video decoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#include \"../common.h\"\r\n\r\n\r\n#include \"intrinsic.h\"\r\n\r\n#include <mmintrin.h>\r\n#include <emmintrin.h>\r\n#include <tmmintrin.h>\r\n#include <smmintrin.h>\r\n\r\n#ifdef _MSC_VER\r\n#pragma warning(disable:4244)  // TODO: warning\r\n#endif\r\n\r\n#if !HIGH_BIT_DEPTH\r\n/* ---------------------------------------------------------------------------\r\n * lcu neighbor\r\n */\r\nenum lcu_neighbor_e {\r\n    SAO_T = 0,     /* top        */\r\n    SAO_D = 1,     /* down       */\r\n    SAO_L = 2,     /* left       */\r\n    SAO_R = 3,     /* right      */\r\n    SAO_TL = 4,    /* top-left   */\r\n    SAO_TR = 5,    /* top-right  */\r\n    SAO_DL = 6,    /* down-left  */\r\n    SAO_DR = 7     /* down-right */\r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nvoid SAO_on_block_eo_0_sse128(pel_t *p_dst, int i_dst, const pel_t *p_src,int i_src, int i_block_w, int i_block_h,\r\n                              int bit_depth, const int *lcu_avail, const int *sao_offset)\r\n{\r\n    __m128i off0, off1, off2, off3, off4;\r\n    __m128i s0, s1, s2;\r\n    __m128i t0, t1, t2, t3, t4, etype;\r\n    __m128i c0, c1, c2, c3, c4;\r\n    __m128i mask;\r\n    int x, y;\r\n\r\n    UNUSED_PARAMETER(bit_depth);\r\n\r\n    __m128i clipMin = _mm_setzero_si128();\r\n    int end_x_16;\r\n\r\n    c0 = _mm_set1_epi8(-2);\r\n    c1 = _mm_set1_epi8(-1);\r\n    c2 = _mm_set1_epi8(0);\r\n    c3 = _mm_set1_epi8(1);\r\n    c4 = _mm_set1_epi8(2);\r\n\r\n    off0 = _mm_set1_epi8((int8_t)sao_offset[0]);\r\n    off1 = _mm_set1_epi8((int8_t)sao_offset[1]);\r\n    off2 = _mm_set1_epi8((int8_t)sao_offset[2]);\r\n    off3 = _mm_set1_epi8((int8_t)sao_offset[3]);\r\n    off4 = _mm_set1_epi8((int8_t)sao_offset[4]);\r\n\r\n    int start_x = lcu_avail[SAO_L] ? 0 : 1;\r\n    int end_x = lcu_avail[SAO_R] ? i_block_w : (i_block_w - 1);\r\n    end_x_16 = end_x - ((end_x - start_x) & 0x0f);\r\n\r\n    for (y = 0; y < i_block_h; y++) {\r\n        for (x = start_x; x < end_x; x += 16) {\r\n            s0 = _mm_loadu_si128((__m128i*)&p_src[x - 1]);\r\n            s1 = _mm_loadu_si128((__m128i*)&p_src[x]);\r\n            s2 = _mm_loadu_si128((__m128i*)&p_src[x + 1]);\r\n\r\n            t3 = _mm_min_epu8(s0, s1);\r\n            t1 = _mm_cmpeq_epi8(t3, s0);\r\n            t2 = _mm_cmpeq_epi8(t3, s1);\r\n            t0 = _mm_subs_epi8(t2, t1);\r\n\r\n            t3 = _mm_min_epu8(s1, s2);\r\n            t1 = _mm_cmpeq_epi8(t3, s1);\r\n            t2 = _mm_cmpeq_epi8(t3, s2);\r\n            t3 = _mm_subs_epi8(t1, t2);    //rightsign\r\n\r\n            etype = _mm_adds_epi8(t0, t3); //edgetype=leftsign+rightsign\r\n\r\n            t0 = _mm_cmpeq_epi8(etype, c0);\r\n            t1 = _mm_cmpeq_epi8(etype, c1);\r\n            t2 = _mm_cmpeq_epi8(etype, c2);\r\n            t3 = _mm_cmpeq_epi8(etype, c3);\r\n            t4 = _mm_cmpeq_epi8(etype, c4);\r\n\r\n            t0 = _mm_and_si128(t0, off0);\r\n            t1 = _mm_and_si128(t1, off1);\r\n            t2 = _mm_and_si128(t2, off2);\r\n            t3 = _mm_and_si128(t3, off3);\r\n            t4 = _mm_and_si128(t4, off4);\r\n\r\n            t0 = _mm_adds_epi8(t0, t1);\r\n            t2 = _mm_adds_epi8(t2, t3);\r\n            t0 = _mm_adds_epi8(t0, t4);\r\n            t0 = _mm_adds_epi8(t0, t2);//get offset\r\n\r\n            //add 8 nums once for possible overflow\r\n            t1 = _mm_cvtepi8_epi16(t0);\r\n            t0 = _mm_srli_si128(t0, 8);\r\n            t2 = _mm_cvtepi8_epi16(t0);\r\n            t3 = _mm_unpacklo_epi8(s1, clipMin);\r\n            t4 = _mm_unpackhi_epi8(s1, clipMin);\r\n\r\n            t1 = _mm_adds_epi16(t1, t3);\r\n            t2 = _mm_adds_epi16(t2, t4);\r\n            t0 = _mm_packus_epi16(t1, t2);\r\n\r\n            if (x != end_x_16) {\r\n                _mm_storeu_si128((__m128i*)(p_dst + x), t0);\r\n            } else {\r\n                mask = _mm_load_si128((__m128i*)(intrinsic_mask[end_x - end_x_16 - 1]));\r\n                _mm_maskmoveu_si128(t0, mask, (char*)(p_dst + x));\r\n                break;\r\n            }\r\n        }\r\n        p_dst += i_dst;\r\n        p_src += i_src;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n*/\r\nvoid SAO_on_block_eo_90_sse128(pel_t *p_dst, int i_dst, const pel_t *p_src, int i_src, int i_block_w, int i_block_h,\r\n                               int bit_depth, const int *lcu_avail, const int *sao_offset)\r\n{\r\n    __m128i off0, off1, off2, off3, off4;\r\n    __m128i s0, s1, s2;\r\n    __m128i t0, t1, t2, t3, t4, etype;\r\n    __m128i c0, c1, c2, c3, c4;\r\n    __m128i mask;\r\n    int x, y;\r\n\r\n    UNUSED_PARAMETER(bit_depth);\r\n\r\n    __m128i clipMin = _mm_setzero_si128();\r\n    int end_x_16 = i_block_w - 15;\r\n\r\n    c0 = _mm_set1_epi8(-2);\r\n    c1 = _mm_set1_epi8(-1);\r\n    c2 = _mm_set1_epi8(0);\r\n    c3 = _mm_set1_epi8(1);\r\n    c4 = _mm_set1_epi8(2);\r\n\r\n    off0 = _mm_set1_epi8((int8_t)sao_offset[0]);\r\n    off1 = _mm_set1_epi8((int8_t)sao_offset[1]);\r\n    off2 = _mm_set1_epi8((int8_t)sao_offset[2]);\r\n    off3 = _mm_set1_epi8((int8_t)sao_offset[3]);\r\n    off4 = _mm_set1_epi8((int8_t)sao_offset[4]);\r\n\r\n    int start_y = lcu_avail[SAO_T] ? 0 : 1;\r\n    int end_y = lcu_avail[SAO_D] ? i_block_h : (i_block_h - 1);\r\n\r\n    p_dst += start_y * i_dst;\r\n    p_src += start_y * i_src;\r\n\r\n    for (y = start_y; y < end_y; y++) {\r\n        for (x = 0; x < i_block_w; x += 16) {\r\n            s0 = _mm_loadu_si128((__m128i*)&p_src[x - i_src]);\r\n            s1 = _mm_loadu_si128((__m128i*)&p_src[x]);\r\n            s2 = _mm_loadu_si128((__m128i*)&p_src[x + i_src]);\r\n\r\n            t3 = _mm_min_epu8(s0, s1);\r\n            t1 = _mm_cmpeq_epi8(t3, s0);\r\n            t2 = _mm_cmpeq_epi8(t3, s1);\r\n            t0 = _mm_subs_epi8(t2, t1); //upsign\r\n\r\n            t3 = _mm_min_epu8(s1, s2);\r\n            t1 = _mm_cmpeq_epi8(t3, s1);\r\n            t2 = _mm_cmpeq_epi8(t3, s2);\r\n            t3 = _mm_subs_epi8(t1, t2); //downsign\r\n\r\n            etype = _mm_adds_epi8(t0, t3); //edgetype\r\n\r\n            t0 = _mm_cmpeq_epi8(etype, c0);\r\n            t1 = _mm_cmpeq_epi8(etype, c1);\r\n            t2 = _mm_cmpeq_epi8(etype, c2);\r\n            t3 = _mm_cmpeq_epi8(etype, c3);\r\n            t4 = _mm_cmpeq_epi8(etype, c4);\r\n\r\n            t0 = _mm_and_si128(t0, off0);\r\n            t1 = _mm_and_si128(t1, off1);\r\n            t2 = _mm_and_si128(t2, off2);\r\n            t3 = _mm_and_si128(t3, off3);\r\n            t4 = _mm_and_si128(t4, off4);\r\n\r\n            t0 = _mm_adds_epi8(t0, t1);\r\n            t2 = _mm_adds_epi8(t2, t3);\r\n            t0 = _mm_adds_epi8(t0, t4);\r\n            t0 = _mm_adds_epi8(t0, t2);//get offset\r\n\r\n            //add 8 nums once for possible overflow\r\n            t1 = _mm_cvtepi8_epi16(t0);\r\n            t0 = _mm_srli_si128(t0, 8);\r\n            t2 = _mm_cvtepi8_epi16(t0);\r\n            t3 = _mm_unpacklo_epi8(s1, clipMin);\r\n            t4 = _mm_unpackhi_epi8(s1, clipMin);\r\n\r\n            t1 = _mm_adds_epi16(t1, t3);\r\n            t2 = _mm_adds_epi16(t2, t4);\r\n            t0 = _mm_packus_epi16(t1, t2); //saturated\r\n\r\n            if (x < end_x_16) {\r\n                _mm_storeu_si128((__m128i*)(p_dst + x), t0);\r\n            } else {\r\n                mask = _mm_loadu_si128((__m128i*)(intrinsic_mask[(i_block_w & 15) - 1]));\r\n                _mm_maskmoveu_si128(t0, mask, (char*)(p_dst + x));\r\n                break;\r\n            }\r\n        }\r\n        p_dst += i_dst;\r\n        p_src += i_src;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n*/\r\nvoid SAO_on_block_eo_135_sse128(pel_t *p_dst, int i_dst, const pel_t *p_src, int i_src, int i_block_w, int i_block_h,\r\n                                int bit_depth, const int *lcu_avail, const int *sao_offset)\r\n{\r\n    __m128i off0, off1, off2, off3, off4;\r\n    __m128i s0, s1, s2;\r\n    __m128i t0, t1, t2, t3, t4, etype;\r\n    __m128i c0, c1, c2, c3, c4;\r\n    __m128i mask_r0, mask_r, mask_rn;\r\n    int x, y;\r\n\r\n    __m128i clipMin = _mm_setzero_si128();\r\n    int end_x_r0_16, end_x_r_16, end_x_rn_16;\r\n    UNUSED_PARAMETER(bit_depth);\r\n\r\n    c0 = _mm_set1_epi8(-2);\r\n    c1 = _mm_set1_epi8(-1);\r\n    c2 = _mm_set1_epi8(0);\r\n    c3 = _mm_set1_epi8(1);\r\n    c4 = _mm_set1_epi8(2);\r\n\r\n    off0 = _mm_set1_epi8((int8_t)sao_offset[0]);\r\n    off1 = _mm_set1_epi8((int8_t)sao_offset[1]);\r\n    off2 = _mm_set1_epi8((int8_t)sao_offset[2]);\r\n    off3 = _mm_set1_epi8((int8_t)sao_offset[3]);\r\n    off4 = _mm_set1_epi8((int8_t)sao_offset[4]);\r\n\r\n    //first row\r\n    int start_x_r0 = lcu_avail[SAO_TL] ? 0 : 1;\r\n    int end_x_r0 = lcu_avail[SAO_T] ? (lcu_avail[SAO_R] ? i_block_w : (i_block_w - 1)) : 1;\r\n    end_x_r0_16 = end_x_r0 - ((end_x_r0 - start_x_r0) & 0x0f);\r\n    for (x = start_x_r0; x < end_x_r0; x += 16) {\r\n        s0 = _mm_loadu_si128((__m128i*)&p_src[x - i_src - 1]);\r\n        s1 = _mm_loadu_si128((__m128i*)&p_src[x]);\r\n        s2 = _mm_loadu_si128((__m128i*)&p_src[x + i_src + 1]);\r\n\r\n        t3 = _mm_min_epu8(s0, s1);\r\n        t1 = _mm_cmpeq_epi8(t3, s0);\r\n        t2 = _mm_cmpeq_epi8(t3, s1);\r\n        t0 = _mm_subs_epi8(t2, t1); //upsign\r\n\r\n        t3 = _mm_min_epu8(s1, s2);\r\n        t1 = _mm_cmpeq_epi8(t3, s1);\r\n        t2 = _mm_cmpeq_epi8(t3, s2);\r\n        t3 = _mm_subs_epi8(t1, t2); //downsign\r\n\r\n        etype = _mm_adds_epi8(t0, t3); //edgetype\r\n\r\n        t0 = _mm_cmpeq_epi8(etype, c0);\r\n        t1 = _mm_cmpeq_epi8(etype, c1);\r\n        t2 = _mm_cmpeq_epi8(etype, c2);\r\n        t3 = _mm_cmpeq_epi8(etype, c3);\r\n        t4 = _mm_cmpeq_epi8(etype, c4);\r\n\r\n        t0 = _mm_and_si128(t0, off0);\r\n        t1 = _mm_and_si128(t1, off1);\r\n        t2 = _mm_and_si128(t2, off2);\r\n        t3 = _mm_and_si128(t3, off3);\r\n        t4 = _mm_and_si128(t4, off4);\r\n\r\n        t0 = _mm_adds_epi8(t0, t1);\r\n        t2 = _mm_adds_epi8(t2, t3);\r\n        t0 = _mm_adds_epi8(t0, t4);\r\n        t0 = _mm_adds_epi8(t0, t2);//get offset\r\n\r\n        //add 8 nums once for possible overflow\r\n        t1 = _mm_cvtepi8_epi16(t0);\r\n        t0 = _mm_srli_si128(t0, 8);\r\n        t2 = _mm_cvtepi8_epi16(t0);\r\n        t3 = _mm_unpacklo_epi8(s1, clipMin);\r\n        t4 = _mm_unpackhi_epi8(s1, clipMin);\r\n\r\n        t1 = _mm_adds_epi16(t1, t3);\r\n        t2 = _mm_adds_epi16(t2, t4);\r\n        t0 = _mm_packus_epi16(t1, t2); //saturated\r\n\r\n        if (x != end_x_r0_16) {\r\n            _mm_storeu_si128((__m128i*)(p_dst + x), t0);\r\n        } else {\r\n            mask_r0 = _mm_load_si128((__m128i*)(intrinsic_mask[end_x_r0 - end_x_r0_16 - 1]));\r\n            _mm_maskmoveu_si128(t0, mask_r0, (char*)(p_dst + x));\r\n            break;\r\n        }\r\n    }\r\n    p_dst += i_dst;\r\n    p_src += i_src;\r\n\r\n    //middle rows\r\n    int start_x_r = lcu_avail[SAO_L] ? 0 : 1;\r\n    int end_x_r = lcu_avail[SAO_R] ? i_block_w : (i_block_w - 1);\r\n    end_x_r_16 = end_x_r - ((end_x_r - start_x_r) & 0x0f);\r\n\r\n    for (y = 1; y < i_block_h - 1; y++) {\r\n        for (x = start_x_r; x < end_x_r; x += 16) {\r\n            s0 = _mm_loadu_si128((__m128i*)&p_src[x - i_src - 1]);\r\n            s1 = _mm_loadu_si128((__m128i*)&p_src[x]);\r\n            s2 = _mm_loadu_si128((__m128i*)&p_src[x + i_src + 1]);\r\n\r\n            t3 = _mm_min_epu8(s0, s1);\r\n            t1 = _mm_cmpeq_epi8(t3, s0);\r\n            t2 = _mm_cmpeq_epi8(t3, s1);\r\n            t0 = _mm_subs_epi8(t2, t1); //upsign\r\n\r\n            t3 = _mm_min_epu8(s1, s2);\r\n            t1 = _mm_cmpeq_epi8(t3, s1);\r\n            t2 = _mm_cmpeq_epi8(t3, s2);\r\n            t3 = _mm_subs_epi8(t1, t2); //downsign\r\n\r\n            etype = _mm_adds_epi8(t0, t3); //edgetype\r\n\r\n            t0 = _mm_cmpeq_epi8(etype, c0);\r\n            t1 = _mm_cmpeq_epi8(etype, c1);\r\n            t2 = _mm_cmpeq_epi8(etype, c2);\r\n            t3 = _mm_cmpeq_epi8(etype, c3);\r\n            t4 = _mm_cmpeq_epi8(etype, c4);\r\n\r\n            t0 = _mm_and_si128(t0, off0);\r\n            t1 = _mm_and_si128(t1, off1);\r\n            t2 = _mm_and_si128(t2, off2);\r\n            t3 = _mm_and_si128(t3, off3);\r\n            t4 = _mm_and_si128(t4, off4);\r\n\r\n            t0 = _mm_adds_epi8(t0, t1);\r\n            t2 = _mm_adds_epi8(t2, t3);\r\n            t0 = _mm_adds_epi8(t0, t4);\r\n            t0 = _mm_adds_epi8(t0, t2);//get offset\r\n\r\n            //add 8 nums once for possible overflow\r\n            t1 = _mm_cvtepi8_epi16(t0);\r\n            t0 = _mm_srli_si128(t0, 8);\r\n            t2 = _mm_cvtepi8_epi16(t0);\r\n            t3 = _mm_unpacklo_epi8(s1, clipMin);\r\n            t4 = _mm_unpackhi_epi8(s1, clipMin);\r\n\r\n            t1 = _mm_adds_epi16(t1, t3);\r\n            t2 = _mm_adds_epi16(t2, t4);\r\n            t0 = _mm_packus_epi16(t1, t2); //saturated\r\n\r\n            if (x != end_x_r_16) {\r\n                _mm_storeu_si128((__m128i*)(p_dst + x), t0);\r\n            } else {\r\n                mask_r = _mm_load_si128((__m128i*)(intrinsic_mask[end_x_r - end_x_r_16 - 1]));\r\n                _mm_maskmoveu_si128(t0, mask_r, (char*)(p_dst + x));\r\n                break;\r\n            }\r\n        }\r\n        p_dst += i_dst;\r\n        p_src += i_src;\r\n    }\r\n    //last row\r\n    int start_x_rn = lcu_avail[SAO_D] ? (lcu_avail[SAO_L] ? 0 : 1) : (i_block_w - 1);\r\n    int end_x_rn = lcu_avail[SAO_DR] ? i_block_w : (i_block_w - 1);\r\n    end_x_rn_16 = end_x_rn - ((end_x_rn - start_x_rn) & 0x0f);\r\n    for (x = start_x_rn; x < end_x_rn; x += 16) {\r\n        s0 = _mm_loadu_si128((__m128i*)&p_src[x - i_src - 1]);\r\n        s1 = _mm_loadu_si128((__m128i*)&p_src[x]);\r\n        s2 = _mm_loadu_si128((__m128i*)&p_src[x + i_src + 1]);\r\n\r\n        t3 = _mm_min_epu8(s0, s1);\r\n        t1 = _mm_cmpeq_epi8(t3, s0);\r\n        t2 = _mm_cmpeq_epi8(t3, s1);\r\n        t0 = _mm_subs_epi8(t2, t1); //upsign\r\n\r\n        t3 = _mm_min_epu8(s1, s2);\r\n        t1 = _mm_cmpeq_epi8(t3, s1);\r\n        t2 = _mm_cmpeq_epi8(t3, s2);\r\n        t3 = _mm_subs_epi8(t1, t2); //downsign\r\n\r\n        etype = _mm_adds_epi8(t0, t3); //edgetype\r\n\r\n        t0 = _mm_cmpeq_epi8(etype, c0);\r\n        t1 = _mm_cmpeq_epi8(etype, c1);\r\n        t2 = _mm_cmpeq_epi8(etype, c2);\r\n        t3 = _mm_cmpeq_epi8(etype, c3);\r\n        t4 = _mm_cmpeq_epi8(etype, c4);\r\n\r\n        t0 = _mm_and_si128(t0, off0);\r\n        t1 = _mm_and_si128(t1, off1);\r\n        t2 = _mm_and_si128(t2, off2);\r\n        t3 = _mm_and_si128(t3, off3);\r\n        t4 = _mm_and_si128(t4, off4);\r\n\r\n        t0 = _mm_adds_epi8(t0, t1);\r\n        t2 = _mm_adds_epi8(t2, t3);\r\n        t0 = _mm_adds_epi8(t0, t4);\r\n        t0 = _mm_adds_epi8(t0, t2);//get offset\r\n\r\n        //add 8 nums once for possible overflow\r\n        t1 = _mm_cvtepi8_epi16(t0);\r\n        t0 = _mm_srli_si128(t0, 8);\r\n        t2 = _mm_cvtepi8_epi16(t0);\r\n        t3 = _mm_unpacklo_epi8(s1, clipMin);\r\n        t4 = _mm_unpackhi_epi8(s1, clipMin);\r\n\r\n        t1 = _mm_adds_epi16(t1, t3);\r\n        t2 = _mm_adds_epi16(t2, t4);\r\n        t0 = _mm_packus_epi16(t1, t2); //saturated\r\n\r\n        if (x != end_x_rn_16) {\r\n            _mm_storeu_si128((__m128i*)(p_dst + x), t0);\r\n        } else {\r\n            mask_rn = _mm_load_si128((__m128i*)(intrinsic_mask[end_x_rn - end_x_rn_16 - 1]));\r\n            _mm_maskmoveu_si128(t0, mask_rn, (char*)(p_dst + x));\r\n            break;\r\n        }\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n*/\r\nvoid SAO_on_block_eo_45_sse128(pel_t *p_dst, int i_dst, const pel_t *p_src, int i_src, int i_block_w, int i_block_h,\r\n                               int bit_depth, const int *lcu_avail, const int *sao_offset)\r\n{\r\n    __m128i off0, off1, off2, off3, off4;\r\n    __m128i s0, s1, s2;\r\n    __m128i t0, t1, t2, t3, t4, etype;\r\n    __m128i c0, c1, c2, c3, c4;\r\n    __m128i mask_r0, mask_r, mask_rn;\r\n    int x, y;\r\n\r\n    __m128i clipMin = _mm_setzero_si128();\r\n    int end_x_r0_16, end_x_r_16, end_x_rn_16;\r\n    UNUSED_PARAMETER(bit_depth);\r\n\r\n    c0 = _mm_set1_epi8(-2);\r\n    c1 = _mm_set1_epi8(-1);\r\n    c2 = _mm_set1_epi8(0);\r\n    c3 = _mm_set1_epi8(1);\r\n    c4 = _mm_set1_epi8(2);\r\n\r\n    off0 = _mm_set1_epi8((int8_t)sao_offset[0]);\r\n    off1 = _mm_set1_epi8((int8_t)sao_offset[1]);\r\n    off2 = _mm_set1_epi8((int8_t)sao_offset[2]);\r\n    off3 = _mm_set1_epi8((int8_t)sao_offset[3]);\r\n    off4 = _mm_set1_epi8((int8_t)sao_offset[4]);\r\n\r\n    //first row\r\n    int start_x_r0 = lcu_avail[SAO_T] ? (lcu_avail[SAO_L] ? 0 : 1) : (i_block_w - 1);\r\n    int end_x_r0 = lcu_avail[SAO_TR] ? i_block_w : (i_block_w - 1);\r\n    end_x_r0_16 = end_x_r0 - ((end_x_r0 - start_x_r0) & 0x0f);\r\n    for (x = start_x_r0; x < end_x_r0; x += 16) {\r\n        s0 = _mm_loadu_si128((__m128i*)&p_src[x - i_src + 1]);\r\n        s1 = _mm_loadu_si128((__m128i*)&p_src[x]);\r\n        s2 = _mm_loadu_si128((__m128i*)&p_src[x + i_src - 1]);\r\n\r\n        t3 = _mm_min_epu8(s0, s1);\r\n        t1 = _mm_cmpeq_epi8(t3, s0);\r\n        t2 = _mm_cmpeq_epi8(t3, s1);\r\n        t0 = _mm_subs_epi8(t2, t1); //upsign\r\n\r\n        t3 = _mm_min_epu8(s1, s2);\r\n        t1 = _mm_cmpeq_epi8(t3, s1);\r\n        t2 = _mm_cmpeq_epi8(t3, s2);\r\n        t3 = _mm_subs_epi8(t1, t2); //downsign\r\n\r\n        etype = _mm_adds_epi8(t0, t3); //edgetype\r\n\r\n        t0 = _mm_cmpeq_epi8(etype, c0);\r\n        t1 = _mm_cmpeq_epi8(etype, c1);\r\n        t2 = _mm_cmpeq_epi8(etype, c2);\r\n        t3 = _mm_cmpeq_epi8(etype, c3);\r\n        t4 = _mm_cmpeq_epi8(etype, c4);\r\n\r\n        t0 = _mm_and_si128(t0, off0);\r\n        t1 = _mm_and_si128(t1, off1);\r\n        t2 = _mm_and_si128(t2, off2);\r\n        t3 = _mm_and_si128(t3, off3);\r\n        t4 = _mm_and_si128(t4, off4);\r\n\r\n        t0 = _mm_adds_epi8(t0, t1);\r\n        t2 = _mm_adds_epi8(t2, t3);\r\n        t0 = _mm_adds_epi8(t0, t4);\r\n        t0 = _mm_adds_epi8(t0, t2);//get offset\r\n\r\n        //add 8 nums once for possible overflow\r\n        t1 = _mm_cvtepi8_epi16(t0);\r\n        t0 = _mm_srli_si128(t0, 8);\r\n        t2 = _mm_cvtepi8_epi16(t0);\r\n        t3 = _mm_unpacklo_epi8(s1, clipMin);\r\n        t4 = _mm_unpackhi_epi8(s1, clipMin);\r\n\r\n        t1 = _mm_adds_epi16(t1, t3);\r\n        t2 = _mm_adds_epi16(t2, t4);\r\n        t0 = _mm_packus_epi16(t1, t2); //saturated\r\n\r\n        if (x != end_x_r0_16) {\r\n            _mm_storeu_si128((__m128i*)(p_dst + x), t0);\r\n        } else {\r\n            mask_r0 = _mm_load_si128((__m128i*)(intrinsic_mask[end_x_r0 - end_x_r0_16 - 1]));\r\n            _mm_maskmoveu_si128(t0, mask_r0, (char*)(p_dst + x));\r\n            break;\r\n        }\r\n    }\r\n    p_dst += i_dst;\r\n    p_src += i_src;\r\n    //middle rows\r\n    int start_x_r = lcu_avail[SAO_L] ? 0 : 1;\r\n    int end_x_r = lcu_avail[SAO_R] ? i_block_w : (i_block_w - 1);\r\n    end_x_r_16 = end_x_r - ((end_x_r - start_x_r) & 0x0f);\r\n    for (y = 1; y < i_block_h - 1; y++) {\r\n        for (x = start_x_r; x < end_x_r; x += 16) {\r\n            s0 = _mm_loadu_si128((__m128i*)&p_src[x - i_src + 1]);\r\n            s1 = _mm_loadu_si128((__m128i*)&p_src[x]);\r\n            s2 = _mm_loadu_si128((__m128i*)&p_src[x + i_src - 1]);\r\n\r\n            t3 = _mm_min_epu8(s0, s1);\r\n            t1 = _mm_cmpeq_epi8(t3, s0);\r\n            t2 = _mm_cmpeq_epi8(t3, s1);\r\n            t0 = _mm_subs_epi8(t2, t1); //upsign\r\n\r\n            t3 = _mm_min_epu8(s1, s2);\r\n            t1 = _mm_cmpeq_epi8(t3, s1);\r\n            t2 = _mm_cmpeq_epi8(t3, s2);\r\n            t3 = _mm_subs_epi8(t1, t2); //downsign\r\n\r\n            etype = _mm_adds_epi8(t0, t3); //edgetype\r\n\r\n            t0 = _mm_cmpeq_epi8(etype, c0);\r\n            t1 = _mm_cmpeq_epi8(etype, c1);\r\n            t2 = _mm_cmpeq_epi8(etype, c2);\r\n            t3 = _mm_cmpeq_epi8(etype, c3);\r\n            t4 = _mm_cmpeq_epi8(etype, c4);\r\n\r\n            t0 = _mm_and_si128(t0, off0);\r\n            t1 = _mm_and_si128(t1, off1);\r\n            t2 = _mm_and_si128(t2, off2);\r\n            t3 = _mm_and_si128(t3, off3);\r\n            t4 = _mm_and_si128(t4, off4);\r\n\r\n            t0 = _mm_adds_epi8(t0, t1);\r\n            t2 = _mm_adds_epi8(t2, t3);\r\n            t0 = _mm_adds_epi8(t0, t4);\r\n            t0 = _mm_adds_epi8(t0, t2);//get offset\r\n\r\n            //add 8 nums once for possible overflow\r\n            t1 = _mm_cvtepi8_epi16(t0);\r\n            t0 = _mm_srli_si128(t0, 8);\r\n            t2 = _mm_cvtepi8_epi16(t0);\r\n            t3 = _mm_unpacklo_epi8(s1, clipMin);\r\n            t4 = _mm_unpackhi_epi8(s1, clipMin);\r\n\r\n            t1 = _mm_adds_epi16(t1, t3);\r\n            t2 = _mm_adds_epi16(t2, t4);\r\n            t0 = _mm_packus_epi16(t1, t2); //saturated\r\n\r\n            if (x != end_x_r_16) {\r\n                _mm_storeu_si128((__m128i*)(p_dst + x), t0);\r\n            } else {\r\n                mask_r = _mm_load_si128((__m128i*)(intrinsic_mask[end_x_r - end_x_r_16 - 1]));\r\n                _mm_maskmoveu_si128(t0, mask_r, (char*)(p_dst + x));\r\n                break;\r\n            }\r\n        }\r\n        p_dst += i_dst;\r\n        p_src += i_src;\r\n    }\r\n    //last row\r\n    int start_x_rn = lcu_avail[SAO_DL] ? 0 : 1;\r\n    int end_x_rn = lcu_avail[SAO_D] ? (lcu_avail[SAO_R] ? i_block_w : (i_block_w - 1)) : 1;\r\n    end_x_rn_16 = end_x_rn - ((end_x_rn - start_x_rn) & 0x0f);\r\n    for (x = start_x_rn; x < end_x_rn; x += 16) {\r\n        s0 = _mm_loadu_si128((__m128i*)&p_src[x - i_src + 1]);\r\n        s1 = _mm_loadu_si128((__m128i*)&p_src[x]);\r\n        s2 = _mm_loadu_si128((__m128i*)&p_src[x + i_src - 1]);\r\n\r\n        t3 = _mm_min_epu8(s0, s1);\r\n        t1 = _mm_cmpeq_epi8(t3, s0);\r\n        t2 = _mm_cmpeq_epi8(t3, s1);\r\n        t0 = _mm_subs_epi8(t2, t1); //upsign\r\n\r\n        t3 = _mm_min_epu8(s1, s2);\r\n        t1 = _mm_cmpeq_epi8(t3, s1);\r\n        t2 = _mm_cmpeq_epi8(t3, s2);\r\n        t3 = _mm_subs_epi8(t1, t2); //downsign\r\n\r\n        etype = _mm_adds_epi8(t0, t3); //edgetype\r\n\r\n        t0 = _mm_cmpeq_epi8(etype, c0);\r\n        t1 = _mm_cmpeq_epi8(etype, c1);\r\n        t2 = _mm_cmpeq_epi8(etype, c2);\r\n        t3 = _mm_cmpeq_epi8(etype, c3);\r\n        t4 = _mm_cmpeq_epi8(etype, c4);\r\n\r\n        t0 = _mm_and_si128(t0, off0);\r\n        t1 = _mm_and_si128(t1, off1);\r\n        t2 = _mm_and_si128(t2, off2);\r\n        t3 = _mm_and_si128(t3, off3);\r\n        t4 = _mm_and_si128(t4, off4);\r\n\r\n        t0 = _mm_adds_epi8(t0, t1);\r\n        t2 = _mm_adds_epi8(t2, t3);\r\n        t0 = _mm_adds_epi8(t0, t4);\r\n        t0 = _mm_adds_epi8(t0, t2);//get offset\r\n\r\n        //add 8 nums once for possible overflow\r\n        t1 = _mm_cvtepi8_epi16(t0);\r\n        t0 = _mm_srli_si128(t0, 8);\r\n        t2 = _mm_cvtepi8_epi16(t0);\r\n        t3 = _mm_unpacklo_epi8(s1, clipMin);\r\n        t4 = _mm_unpackhi_epi8(s1, clipMin);\r\n\r\n        t1 = _mm_adds_epi16(t1, t3);\r\n        t2 = _mm_adds_epi16(t2, t4);\r\n        t0 = _mm_packus_epi16(t1, t2); //saturated\r\n\r\n        if (x != end_x_rn_16) {\r\n            _mm_storeu_si128((__m128i*)(p_dst + x), t0);\r\n        } else {\r\n            mask_rn = _mm_load_si128((__m128i*)(intrinsic_mask[end_x_rn - end_x_rn_16 - 1]));\r\n            _mm_maskmoveu_si128(t0, mask_rn, (char*)(p_dst + x));\r\n            break;\r\n        }\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n*/\r\nvoid SAO_on_block_bo_sse128(pel_t *p_dst, int i_dst,\r\n                            const pel_t *p_src, int i_src,\r\n                            int i_block_w, int i_block_h,\r\n                            int bit_depth, const sao_param_t *sao_param)\r\n{\r\n    __m128i r0, r1, r2, r3, off0, off1, off2, off3;\r\n    __m128i t0, t1, t2, t3;\r\n    __m128i mask;\r\n    int shift_bo = g_bit_depth - NUM_SAO_BO_CLASSES_IN_BIT;\r\n    int x, y;\r\n\r\n    __m128i src0, src1;\r\n    __m128i shift_mask = _mm_set1_epi8(31);\r\n    __m128i clipMin = _mm_setzero_si128();\r\n    int end_x_16 = i_block_w - 15;\r\n\r\n    UNUSED_PARAMETER(bit_depth);\r\n\r\n    r0 = _mm_set1_epi8((int8_t)(sao_param->startBand));\r\n    r1 = _mm_set1_epi8((int8_t)((sao_param->startBand + 1) & 31));\r\n    r2 = _mm_set1_epi8((int8_t)(sao_param->startBand2));\r\n    r3 = _mm_set1_epi8((int8_t)((sao_param->startBand2 + 1) & 31));\r\n\r\n    off0 = _mm_set1_epi8((int8_t)sao_param->offset[sao_param->startBand]);\r\n    off1 = _mm_set1_epi8((int8_t)sao_param->offset[(sao_param->startBand + 1) & 31]);\r\n    off2 = _mm_set1_epi8((int8_t)sao_param->offset[sao_param->startBand2]);\r\n    off3 = _mm_set1_epi8((int8_t)sao_param->offset[(sao_param->startBand2 + 1) & 31]);\r\n\r\n    for (y = 0; y < i_block_h; y++) {\r\n        for (x = 0; x < i_block_w; x += 16) {\r\n            __m128i t4;\r\n            src0 = _mm_loadu_si128((__m128i*)&p_src[x]);\r\n            src1 = _mm_and_si128(_mm_srai_epi16(src0, shift_bo), shift_mask);\r\n\r\n            t0 = _mm_cmpeq_epi8(src1, r0);\r\n            t1 = _mm_cmpeq_epi8(src1, r1);\r\n            t2 = _mm_cmpeq_epi8(src1, r2);\r\n            t3 = _mm_cmpeq_epi8(src1, r3);\r\n\r\n            t0 = _mm_and_si128(t0, off0);\r\n            t1 = _mm_and_si128(t1, off1);\r\n            t2 = _mm_and_si128(t2, off2);\r\n            t3 = _mm_and_si128(t3, off3);\r\n\r\n            t0 = _mm_or_si128(t0, t1);\r\n            t2 = _mm_or_si128(t2, t3);\r\n            t0 = _mm_or_si128(t0, t2);//get offset\r\n\r\n            //add 8 nums once for possible overflow\r\n            t1 = _mm_cvtepi8_epi16(t0);\r\n            t0 = _mm_srli_si128(t0, 8);\r\n            t2 = _mm_cvtepi8_epi16(t0);\r\n            t3 = _mm_unpacklo_epi8(src0, clipMin);\r\n            t4 = _mm_unpackhi_epi8(src0, clipMin);\r\n\r\n            t1 = _mm_adds_epi16(t1, t3);\r\n            t2 = _mm_adds_epi16(t2, t4);\r\n            src0 = _mm_packus_epi16(t1, t2); //saturated\r\n\r\n            if (x < end_x_16) {\r\n                _mm_storeu_si128((__m128i*)&p_dst[x], src0);\r\n            } else {\r\n                mask = _mm_load_si128((const __m128i*)intrinsic_mask[(i_block_w & 15) - 1]);\r\n                _mm_maskmoveu_si128(src0, mask, (char*)(p_dst + x));\r\n            }\r\n        }\r\n        p_dst += i_dst;\r\n        p_src += i_src;\r\n    }\r\n}\r\n\r\n#endif // !HIGH_BIT_DEPTH\r\n"
  },
  {
    "path": "source/common/vec/intrinsic_sao_avx2.cc",
    "content": "/*\r\n * intrinsic_sao_avx2.cc\r\n *\r\n * Description of this file:\r\n *    AVX2 assembly functions of SAO module of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video decoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#include <mmintrin.h>\r\n#include <emmintrin.h>\r\n#include <tmmintrin.h>\r\n#include <smmintrin.h>\r\n#include <immintrin.h>\r\n\r\n#include \"../common.h\"\r\n#include \"intrinsic.h\"\r\n\r\n#if !HIGH_BIT_DEPTH\r\n#ifdef _MSC_VER\r\n#pragma warning(disable:4244)  // TODO: warning\r\n#endif\r\n\r\n/* ---------------------------------------------------------------------------\r\n * lcu neighbor\r\n */\r\nenum lcu_neighbor_e {\r\n    SAO_T  = 0,    /* top        */\r\n    SAO_D  = 1,    /* down       */\r\n    SAO_L  = 2,    /* left       */\r\n    SAO_R  = 3,    /* right      */\r\n    SAO_TL = 4,    /* top-left   */\r\n    SAO_TR = 5,    /* top-right  */\r\n    SAO_DL = 6,    /* down-left  */\r\n    SAO_DR = 7     /* down-right */\r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n*/\r\nvoid SAO_on_block_eo_0_avx2(pel_t *p_dst, int i_dst,\r\n                            const pel_t *p_src, int i_src,\r\n                            int i_block_w, int i_block_h,\r\n                            int bit_depth, const int *lcu_avail, const int *sao_offset)\r\n{\r\n    int x, y;\r\n    __m256i off;\r\n    __m256i s0, s1, s2;\r\n    __m256i t0, t1, t2, t3, t4, etype;\r\n    __m128i mask, offtmp;\r\n    int start_x = lcu_avail[SAO_L] ? 0 : 1;\r\n    int end_x = lcu_avail[SAO_R] ? i_block_w : (i_block_w - 1);\r\n    int end_x_32;\r\n\r\n    __m256i c2 = _mm256_set1_epi8(2);\r\n\r\n    UNUSED_PARAMETER(bit_depth);\r\n\r\n    offtmp = _mm_loadu_si128((__m128i*)sao_offset);\r\n    offtmp = _mm_packs_epi32(offtmp, _mm_set_epi32(0, 0, 0, sao_offset[4]));\r\n    offtmp = _mm_packs_epi16(offtmp, _mm_setzero_si128());\r\n\r\n    off = _mm256_castsi128_si256(offtmp);\r\n    off = _mm256_inserti128_si256(off, offtmp, 1);\r\n\r\n    end_x_32 = end_x - ((end_x - start_x) & 0x1f);\r\n\r\n    for (y = 0; y < i_block_h; y++) {\r\n        for (x = start_x; x < end_x; x += 32) {\r\n            s0 = _mm256_lddqu_si256((__m256i*)&p_src[x - 1]);\r\n            s1 = _mm256_loadu_si256((__m256i*)&p_src[x]);\r\n            s2 = _mm256_loadu_si256((__m256i*)&p_src[x + 1]);\r\n\r\n            t3 = _mm256_min_epu8(s0, s1);\r\n            t1 = _mm256_cmpeq_epi8(t3, s0);\r\n            t2 = _mm256_cmpeq_epi8(t3, s1);\r\n            t0 = _mm256_subs_epi8(t2, t1); //leftsign\r\n\r\n            t3 = _mm256_min_epu8(s1, s2);\r\n            t1 = _mm256_cmpeq_epi8(t3, s1);\r\n            t2 = _mm256_cmpeq_epi8(t3, s2);\r\n            t3 = _mm256_subs_epi8(t1, t2); //rightsign\r\n\r\n            etype = _mm256_adds_epi8(t0, t3);\r\n\r\n            etype = _mm256_adds_epi8(etype, c2);//edgetype=left + right +2\r\n\r\n            t0 = _mm256_shuffle_epi8(off, etype);//get offset\r\n\r\n            //convert byte to short for possible overflow\r\n            t1 = _mm256_cvtepi8_epi16(_mm256_castsi256_si128(t0));\r\n            t2 = _mm256_cvtepi8_epi16(_mm256_extracti128_si256(t0, 1));\r\n            t3 = _mm256_cvtepu8_epi16(_mm256_castsi256_si128(s1));\r\n            t4 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(s1, 1));\r\n\r\n            t1 = _mm256_adds_epi16(t1, t3);\r\n            t2 = _mm256_adds_epi16(t2, t4);\r\n            t0 = _mm256_packus_epi16(t1, t2); //saturated\r\n            t0 = _mm256_permute4x64_epi64(t0, 0xd8);\r\n\r\n            if (x != end_x_32) {\r\n                _mm256_storeu_si256((__m256i*)(p_dst + x), t0);\r\n            } else {\r\n                if (end_x - x >= 16) {\r\n                    _mm_storeu_si128((__m128i*)(p_dst + x), _mm256_castsi256_si128(t0));\r\n                    if (end_x - x > 16) {\r\n                        mask = _mm_loadu_si128((__m128i*)(intrinsic_mask[end_x - end_x_32 - 17]));\r\n                        _mm_maskmoveu_si128(_mm256_extracti128_si256(t0, 1), mask, (char*)(p_dst + x + 16));\r\n                    }\r\n                } else {\r\n                    mask = _mm_load_si128((__m128i*)(intrinsic_mask[end_x - end_x_32 - 1]));\r\n                    _mm_maskmoveu_si128(_mm256_castsi256_si128(t0), mask, (char*)(p_dst + x));\r\n                }\r\n                break;\r\n            }\r\n        }\r\n        p_dst += i_dst;\r\n        p_src += i_src;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n*/\r\nvoid SAO_on_block_eo_90_avx2(pel_t *p_dst, int i_dst,\r\n                             const pel_t *p_src, int i_src,\r\n                             int i_block_w, int i_block_h,\r\n                             int bit_depth, const int *lcu_avail, const int *sao_offset)\r\n{\r\n    int start_y, end_y;\r\n    int x, y;\r\n    __m256i off;\r\n    __m256i s0, s1, s2;\r\n    __m256i t0, t1, t2, t3, t4, etype;\r\n    __m128i mask, offtmp;\r\n\r\n    __m256i c2 = _mm256_set1_epi8(2);\r\n    int end_x_32 = i_block_w - (i_block_w & 0x1f);\r\n\r\n    UNUSED_PARAMETER(bit_depth);\r\n\r\n    offtmp = _mm_loadu_si128((__m128i*)sao_offset);\r\n    offtmp = _mm_packs_epi32(offtmp, _mm_set_epi32(0, 0, 0, sao_offset[4]));\r\n    offtmp = _mm_packs_epi16(offtmp, _mm_setzero_si128());\r\n\r\n    off = _mm256_castsi128_si256(offtmp);\r\n    off = _mm256_inserti128_si256(off, offtmp, 1);\r\n\r\n    start_y = lcu_avail[SAO_T] ? 0 : 1;\r\n    end_y = lcu_avail[SAO_D] ? i_block_h : (i_block_h - 1);\r\n\r\n    p_dst += start_y * i_dst;\r\n    p_src += start_y * i_src;\r\n\r\n    for (y = start_y; y < end_y; y++) {\r\n        for (x = 0; x < i_block_w; x += 32) {\r\n            s0 = _mm256_lddqu_si256((__m256i*)&p_src[x - i_src]);\r\n            s1 = _mm256_lddqu_si256((__m256i*)&p_src[x]);\r\n            s2 = _mm256_lddqu_si256((__m256i*)&p_src[x + i_src]);\r\n\r\n            t3 = _mm256_min_epu8(s0, s1);\r\n            t1 = _mm256_cmpeq_epi8(t3, s0);\r\n            t2 = _mm256_cmpeq_epi8(t3, s1);\r\n            t0 = _mm256_subs_epi8(t2, t1); //leftsign\r\n\r\n            t3 = _mm256_min_epu8(s1, s2);\r\n            t1 = _mm256_cmpeq_epi8(t3, s1);\r\n            t2 = _mm256_cmpeq_epi8(t3, s2);\r\n            t3 = _mm256_subs_epi8(t1, t2); //rightsign\r\n\r\n            etype = _mm256_adds_epi8(t0, t3); //edgetype\r\n\r\n            etype = _mm256_adds_epi8(etype, c2);\r\n\r\n            t0 = _mm256_shuffle_epi8(off, etype);//get offset\r\n\r\n            //convert byte to short for possible overflow\r\n            t1 = _mm256_cvtepi8_epi16(_mm256_castsi256_si128(t0));\r\n            t2 = _mm256_cvtepi8_epi16(_mm256_extracti128_si256(t0, 1));\r\n            t3 = _mm256_cvtepu8_epi16(_mm256_castsi256_si128(s1));\r\n            t4 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(s1, 1));\r\n\r\n            t1 = _mm256_adds_epi16(t1, t3);\r\n            t2 = _mm256_adds_epi16(t2, t4);\r\n            t0 = _mm256_packus_epi16(t1, t2); //saturated\r\n            t0 = _mm256_permute4x64_epi64(t0, 0xd8);\r\n\r\n            if (x != end_x_32) {\r\n                _mm256_storeu_si256((__m256i*)(p_dst + x), t0);\r\n            } else {\r\n                if (i_block_w - x >= 16) {\r\n                    _mm_storeu_si128((__m128i*)(p_dst + x), _mm256_castsi256_si128(t0));\r\n                    if (i_block_w - x > 16) {\r\n                        mask = _mm_loadu_si128((__m128i*)(intrinsic_mask[i_block_w - end_x_32 - 17]));\r\n                        _mm_maskmoveu_si128(_mm256_extracti128_si256(t0, 1), mask, (char*)(p_dst + x + 16));\r\n                    }\r\n                } else {\r\n                    mask = _mm_load_si128((__m128i*)(intrinsic_mask[i_block_w - end_x_32 - 1]));\r\n                    _mm_maskmoveu_si128(_mm256_castsi256_si128(t0), mask, (char*)(p_dst + x));\r\n                }\r\n                break;\r\n            }\r\n        }\r\n        p_dst += i_dst;\r\n        p_src += i_src;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n*/\r\nvoid SAO_on_block_eo_135_avx2(pel_t *p_dst, int i_dst,\r\n                              const pel_t *p_src, int i_src,\r\n                              int i_block_w, int i_block_h,\r\n                              int bit_depth, const int *lcu_avail, const int *sao_offset)\r\n{\r\n    int start_x_r0, end_x_r0, start_x_r, end_x_r, start_x_rn, end_x_rn;\r\n    int x, y;\r\n    __m256i off;\r\n    __m256i s0, s1, s2;\r\n    __m256i t0, t1, t2, t3, t4, etype;\r\n    __m128i mask, offtmp;\r\n    __m256i c2 = _mm256_set1_epi8(2);\r\n    int end_x_r0_32, end_x_r_32, end_x_rn_32;\r\n\r\n    UNUSED_PARAMETER(bit_depth);\r\n\r\n    offtmp = _mm_loadu_si128((__m128i*)sao_offset);\r\n    offtmp = _mm_packs_epi32(offtmp, _mm_set_epi32(0, 0, 0, sao_offset[4]));\r\n    offtmp = _mm_packs_epi16(offtmp, _mm_setzero_si128());\r\n\r\n    off = _mm256_castsi128_si256(offtmp);\r\n    off = _mm256_inserti128_si256(off, offtmp, 1);\r\n\r\n    //first row\r\n    start_x_r0 = lcu_avail[SAO_TL] ? 0 : 1;\r\n    end_x_r0 = lcu_avail[SAO_T] ? (lcu_avail[SAO_R] ? i_block_w : (i_block_w - 1)) : 1;\r\n    end_x_r0_32 = end_x_r0 - ((end_x_r0 - start_x_r0) & 0x1f);\r\n    for (x = start_x_r0; x < end_x_r0; x += 32) {\r\n        s0 = _mm256_loadu_si256((__m256i*)&p_src[x - i_src - 1]);\r\n        s1 = _mm256_loadu_si256((__m256i*)&p_src[x]);\r\n        s2 = _mm256_loadu_si256((__m256i*)&p_src[x + i_src + 1]);\r\n\r\n        t3 = _mm256_min_epu8(s0, s1);\r\n        t1 = _mm256_cmpeq_epi8(t3, s0);\r\n        t2 = _mm256_cmpeq_epi8(t3, s1);\r\n        t0 = _mm256_subs_epi8(t2, t1); //upsign\r\n\r\n        t3 = _mm256_min_epu8(s1, s2);\r\n        t1 = _mm256_cmpeq_epi8(t3, s1);\r\n        t2 = _mm256_cmpeq_epi8(t3, s2);\r\n        t3 = _mm256_subs_epi8(t1, t2); //downsign\r\n\r\n        etype = _mm256_adds_epi8(t0, t3); //edgetype\r\n        etype = _mm256_adds_epi8(etype, c2);\r\n\r\n        t0 = _mm256_shuffle_epi8(off, etype);//get offset\r\n        //convert byte to short for possible overflow\r\n        t1 = _mm256_cvtepi8_epi16(_mm256_castsi256_si128(t0));\r\n        t2 = _mm256_cvtepi8_epi16(_mm256_extracti128_si256(t0, 1));\r\n        t3 = _mm256_cvtepu8_epi16(_mm256_castsi256_si128(s1));\r\n        t4 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(s1, 1));\r\n\r\n        t1 = _mm256_adds_epi16(t1, t3);\r\n        t2 = _mm256_adds_epi16(t2, t4);\r\n        t0 = _mm256_packus_epi16(t1, t2); //saturated\r\n        t0 = _mm256_permute4x64_epi64(t0, 0xd8);\r\n\r\n        if (x != end_x_r0_32) {\r\n            _mm256_storeu_si256((__m256i*)(p_dst + x), t0);\r\n        } else {\r\n            if (end_x_r0 - x >= 16) {\r\n                _mm_storeu_si128((__m128i*)(p_dst + x), _mm256_castsi256_si128(t0));\r\n                if (end_x_r0 - x > 16) {\r\n                    mask = _mm_loadu_si128((__m128i*)intrinsic_mask[end_x_r0 - end_x_r0_32 - 17]);\r\n                    _mm_maskmoveu_si128(_mm256_extracti128_si256(t0, 1), mask, (char*)(p_dst + x + 16));\r\n                }\r\n            } else {\r\n                mask = _mm_load_si128((__m128i*)(intrinsic_mask[end_x_r0 - end_x_r0_32 - 1]));\r\n                _mm_maskmoveu_si128(_mm256_castsi256_si128(t0), mask, (char*)(p_dst + x));\r\n            }\r\n            break;\r\n        }\r\n    }\r\n    p_dst += i_dst;\r\n    p_src += i_src;\r\n\r\n    //middle rows\r\n    start_x_r = lcu_avail[SAO_L] ? 0 : 1;\r\n    end_x_r = lcu_avail[SAO_R] ? i_block_w : (i_block_w - 1);\r\n    end_x_r_32 = end_x_r - ((end_x_r - start_x_r) & 0x1f);\r\n    for (y = 1; y < i_block_h - 1; y++) {\r\n        for (x = start_x_r; x < end_x_r; x += 32) {\r\n            s0 = _mm256_loadu_si256((__m256i*)&p_src[x - i_src - 1]);\r\n            s1 = _mm256_loadu_si256((__m256i*)&p_src[x]);\r\n            s2 = _mm256_loadu_si256((__m256i*)&p_src[x + i_src + 1]);\r\n\r\n            t3 = _mm256_min_epu8(s0, s1);\r\n            t1 = _mm256_cmpeq_epi8(t3, s0);\r\n            t2 = _mm256_cmpeq_epi8(t3, s1);\r\n            t0 = _mm256_subs_epi8(t2, t1); //upsign\r\n\r\n            t3 = _mm256_min_epu8(s1, s2);\r\n            t1 = _mm256_cmpeq_epi8(t3, s1);\r\n            t2 = _mm256_cmpeq_epi8(t3, s2);\r\n            t3 = _mm256_subs_epi8(t1, t2); //downsign\r\n\r\n            etype = _mm256_adds_epi8(t0, t3); //edgetype\r\n\r\n            etype = _mm256_adds_epi8(etype, c2);\r\n\r\n            t0 = _mm256_shuffle_epi8(off, etype);//get offset\r\n\r\n            //convert byte to short for possible overflow\r\n            t1 = _mm256_cvtepi8_epi16(_mm256_castsi256_si128(t0));\r\n            t2 = _mm256_cvtepi8_epi16(_mm256_extracti128_si256(t0, 1));\r\n            t3 = _mm256_cvtepu8_epi16(_mm256_castsi256_si128(s1));\r\n            t4 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(s1, 1));\r\n\r\n            t1 = _mm256_adds_epi16(t1, t3);\r\n            t2 = _mm256_adds_epi16(t2, t4);\r\n            t0 = _mm256_packus_epi16(t1, t2); //saturated\r\n            t0 = _mm256_permute4x64_epi64(t0, 0xd8);\r\n\r\n            if (x != end_x_r_32) {\r\n                _mm256_storeu_si256((__m256i*)(p_dst + x), t0);\r\n            } else {\r\n                if (end_x_r - x >= 16) {\r\n                    _mm_storeu_si128((__m128i*)(p_dst + x), _mm256_castsi256_si128(t0));\r\n                    if (end_x_r - x > 16) {\r\n                        mask = _mm_load_si128((__m128i*)(intrinsic_mask[end_x_r - end_x_r_32 - 17]));\r\n                        _mm_maskmoveu_si128(_mm256_extracti128_si256(t0, 1), mask, (char*)(p_dst + x + 16));\r\n                    }\r\n                } else {\r\n                    mask = _mm_load_si128((__m128i*)(intrinsic_mask[end_x_r - end_x_r_32 - 1]));\r\n                    _mm_maskmoveu_si128(_mm256_castsi256_si128(t0), mask, (char*)(p_dst + x));\r\n                }\r\n                break;\r\n            }\r\n        }\r\n        p_dst += i_dst;\r\n        p_src += i_src;\r\n    }\r\n    //last row\r\n    start_x_rn = lcu_avail[SAO_D] ? (lcu_avail[SAO_L] ? 0 : 1) : (i_block_w - 1);\r\n    end_x_rn = lcu_avail[SAO_DR] ? i_block_w : (i_block_w - 1);\r\n    end_x_rn_32 = end_x_rn - ((end_x_rn - start_x_rn) & 0x1f);\r\n    for (x = start_x_rn; x < end_x_rn; x += 32) {\r\n        s0 = _mm256_loadu_si256((__m256i*)&p_src[x - i_src - 1]);\r\n        s1 = _mm256_loadu_si256((__m256i*)&p_src[x]);\r\n        s2 = _mm256_loadu_si256((__m256i*)&p_src[x + i_src + 1]);\r\n\r\n        t3 = _mm256_min_epu8(s0, s1);\r\n        t1 = _mm256_cmpeq_epi8(t3, s0);\r\n        t2 = _mm256_cmpeq_epi8(t3, s1);\r\n        t0 = _mm256_subs_epi8(t2, t1); //upsign\r\n\r\n        t3 = _mm256_min_epu8(s1, s2);\r\n        t1 = _mm256_cmpeq_epi8(t3, s1);\r\n        t2 = _mm256_cmpeq_epi8(t3, s2);\r\n        t3 = _mm256_subs_epi8(t1, t2); //downsign\r\n\r\n        etype = _mm256_adds_epi8(t0, t3); //edgetype\r\n\r\n        etype = _mm256_adds_epi8(etype, c2);\r\n\r\n        t0 = _mm256_shuffle_epi8(off, etype);//get offset\r\n\r\n        //convert byte to short for possible overflow\r\n        t1 = _mm256_cvtepi8_epi16(_mm256_castsi256_si128(t0));\r\n        t2 = _mm256_cvtepi8_epi16(_mm256_extracti128_si256(t0, 1));\r\n        t3 = _mm256_cvtepu8_epi16(_mm256_castsi256_si128(s1));\r\n        t4 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(s1, 1));\r\n\r\n        t1 = _mm256_adds_epi16(t1, t3);\r\n        t2 = _mm256_adds_epi16(t2, t4);\r\n        t0 = _mm256_packus_epi16(t1, t2); //saturated\r\n        t0 = _mm256_permute4x64_epi64(t0, 0xd8);\r\n\r\n        if (x != end_x_rn_32) {\r\n            _mm256_storeu_si256((__m256i*)(p_dst + x), t0);\r\n        } else {\r\n            if (end_x_rn - x >= 16) {\r\n                _mm_storeu_si128((__m128i*)(p_dst + x), _mm256_castsi256_si128(t0));\r\n                if (end_x_rn - x > 16) {\r\n                    mask = _mm_load_si128((__m128i*)(intrinsic_mask[end_x_rn - end_x_rn_32 - 17]));\r\n                    _mm_maskmoveu_si128(_mm256_extracti128_si256(t0, 1), mask, (char*)(p_dst + x + 16));\r\n                }\r\n            } else {\r\n                mask = _mm_load_si128((__m128i*)(intrinsic_mask[end_x_rn - end_x_rn_32 - 1]));\r\n                _mm_maskmoveu_si128(_mm256_castsi256_si128(t0), mask, (char*)(p_dst + x));\r\n            }\r\n            break;\r\n        }\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n*/\r\nvoid SAO_on_block_eo_45_avx2(pel_t *p_dst, int i_dst,\r\n                             const pel_t *p_src, int i_src,\r\n                             int i_block_w, int i_block_h,\r\n                             int bit_depth, const int *lcu_avail, const int *sao_offset)\r\n{\r\n    int start_x_r0, end_x_r0, start_x_r, end_x_r, start_x_rn, end_x_rn;\r\n    int x, y;\r\n    __m256i off;\r\n    __m256i s0, s1, s2;\r\n    __m256i t0, t1, t2, t3, t4, etype;\r\n    __m128i mask, offtmp;\r\n    __m256i c2 = _mm256_set1_epi8(2);\r\n    int end_x_r0_32, end_x_r_32, end_x_rn_32;\r\n\r\n    UNUSED_PARAMETER(bit_depth);\r\n\r\n    offtmp = _mm_loadu_si128((__m128i*)sao_offset);\r\n    offtmp = _mm_packs_epi32(offtmp, _mm_set_epi32(0, 0, 0, sao_offset[4]));\r\n    offtmp = _mm_packs_epi16(offtmp, _mm_setzero_si128());\r\n\r\n    off = _mm256_castsi128_si256(offtmp);\r\n    off = _mm256_inserti128_si256(off, offtmp, 1);\r\n\r\n    start_x_r0 = lcu_avail[SAO_T] ? (lcu_avail[SAO_L] ? 0 : 1) : (i_block_w - 1);\r\n    end_x_r0 = lcu_avail[SAO_TR] ? i_block_w : (i_block_w - 1);\r\n    end_x_r0_32 = end_x_r0 - ((end_x_r0 - start_x_r0) & 0x1f);\r\n\r\n    //first row\r\n    for (x = start_x_r0; x < end_x_r0; x += 32) {\r\n        s0 = _mm256_loadu_si256((__m256i*)&p_src[x - i_src + 1]);\r\n        s1 = _mm256_loadu_si256((__m256i*)&p_src[x]);\r\n        s2 = _mm256_loadu_si256((__m256i*)&p_src[x + i_src - 1]);\r\n\r\n        t3 = _mm256_min_epu8(s0, s1);\r\n        t1 = _mm256_cmpeq_epi8(t3, s0);\r\n        t2 = _mm256_cmpeq_epi8(t3, s1);\r\n        t0 = _mm256_subs_epi8(t2, t1); //upsign\r\n\r\n        t3 = _mm256_min_epu8(s1, s2);\r\n        t1 = _mm256_cmpeq_epi8(t3, s1);\r\n        t2 = _mm256_cmpeq_epi8(t3, s2);\r\n        t3 = _mm256_subs_epi8(t1, t2); //downsign\r\n\r\n        etype = _mm256_adds_epi8(t0, t3); //edgetype\r\n\r\n        etype = _mm256_adds_epi8(etype, c2);\r\n\r\n        t0 = _mm256_shuffle_epi8(off, etype);//get offset\r\n\r\n        //convert byte to short for possible overflow\r\n        t1 = _mm256_cvtepi8_epi16(_mm256_castsi256_si128(t0));\r\n        t2 = _mm256_cvtepi8_epi16(_mm256_extracti128_si256(t0, 1));\r\n        t3 = _mm256_cvtepu8_epi16(_mm256_castsi256_si128(s1));\r\n        t4 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(s1, 1));\r\n\r\n        t1 = _mm256_adds_epi16(t1, t3);\r\n        t2 = _mm256_adds_epi16(t2, t4);\r\n        t0 = _mm256_packus_epi16(t1, t2); //saturated\r\n        t0 = _mm256_permute4x64_epi64(t0, 0xd8);\r\n\r\n        if (x != end_x_r0_32) {\r\n            _mm256_storeu_si256((__m256i*)(p_dst + x), t0);\r\n        } else {\r\n            if (end_x_r0 - x >= 16) {\r\n                _mm_storeu_si128((__m128i*)(p_dst + x), _mm256_castsi256_si128(t0));\r\n                if (end_x_r0 - x > 16) {\r\n                    mask = _mm_load_si128((__m128i*)(intrinsic_mask[end_x_r0 - end_x_r0_32 - 17]));\r\n                    _mm_maskmoveu_si128(_mm256_extracti128_si256(t0, 1), mask, (char*)(p_dst + x + 16));\r\n                }\r\n            } else {\r\n                mask = _mm_load_si128((__m128i*)(intrinsic_mask[end_x_r0 - end_x_r0_32 - 1]));\r\n                _mm_maskmoveu_si128(_mm256_castsi256_si128(t0), mask, (char*)(p_dst + x));\r\n            }\r\n            break;\r\n        }\r\n    }\r\n    p_dst += i_dst;\r\n    p_src += i_src;\r\n\r\n    //middle rows\r\n    start_x_r = lcu_avail[SAO_L] ? 0 : 1;\r\n    end_x_r = lcu_avail[SAO_R] ? i_block_w : (i_block_w - 1);\r\n    end_x_r_32 = end_x_r - ((end_x_r - start_x_r) & 0x1f);\r\n    for (y = 1; y < i_block_h - 1; y++) {\r\n        for (x = start_x_r; x < end_x_r; x += 32) {\r\n            s0 = _mm256_loadu_si256((__m256i*)&p_src[x - i_src + 1]);\r\n            s1 = _mm256_loadu_si256((__m256i*)&p_src[x]);\r\n            s2 = _mm256_loadu_si256((__m256i*)&p_src[x + i_src - 1]);\r\n\r\n            t3 = _mm256_min_epu8(s0, s1);\r\n            t1 = _mm256_cmpeq_epi8(t3, s0);\r\n            t2 = _mm256_cmpeq_epi8(t3, s1);\r\n            t0 = _mm256_subs_epi8(t2, t1); //upsign\r\n\r\n            t3 = _mm256_min_epu8(s1, s2);\r\n            t1 = _mm256_cmpeq_epi8(t3, s1);\r\n            t2 = _mm256_cmpeq_epi8(t3, s2);\r\n            t3 = _mm256_subs_epi8(t1, t2); //downsign\r\n\r\n            etype = _mm256_adds_epi8(t0, t3); //edgetype\r\n\r\n            etype = _mm256_adds_epi8(etype, c2);\r\n\r\n            t0 = _mm256_shuffle_epi8(off, etype);//get offset\r\n\r\n            //convert byte to short for possible overflow\r\n            t1 = _mm256_cvtepi8_epi16(_mm256_castsi256_si128(t0));\r\n            t2 = _mm256_cvtepi8_epi16(_mm256_extracti128_si256(t0, 1));\r\n            t3 = _mm256_cvtepu8_epi16(_mm256_castsi256_si128(s1));\r\n            t4 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(s1, 1));\r\n\r\n            t1 = _mm256_adds_epi16(t1, t3);\r\n            t2 = _mm256_adds_epi16(t2, t4);\r\n            t0 = _mm256_packus_epi16(t1, t2); //saturated\r\n            t0 = _mm256_permute4x64_epi64(t0, 0xd8);\r\n\r\n            if (x != end_x_r_32) {\r\n                _mm256_storeu_si256((__m256i*)(p_dst + x), t0);\r\n            } else {\r\n                if (end_x_r - x >= 16) {\r\n                    _mm_storeu_si128((__m128i*)(p_dst + x), _mm256_castsi256_si128(t0));\r\n                    if (end_x_r - x > 16) {\r\n                        mask = _mm_load_si128((__m128i*)(intrinsic_mask[end_x_r - end_x_r_32 - 17]));\r\n                        _mm_maskmoveu_si128(_mm256_extracti128_si256(t0, 1), mask, (char*)(p_dst + x + 16));\r\n                    }\r\n                } else {\r\n                    mask = _mm_load_si128((__m128i*)(intrinsic_mask[end_x_r - end_x_r_32 - 1]));\r\n                    _mm_maskmoveu_si128(_mm256_castsi256_si128(t0), mask, (char*)(p_dst + x));\r\n                }\r\n                break;\r\n            }\r\n        }\r\n        p_dst += i_dst;\r\n        p_src += i_src;\r\n    }\r\n\r\n    //last row\r\n    start_x_rn = lcu_avail[SAO_DL] ? 0 : 1;\r\n    end_x_rn = lcu_avail[SAO_D] ? (lcu_avail[SAO_R] ? i_block_w : (i_block_w - 1)) : 1;\r\n    end_x_rn_32 = end_x_rn - ((end_x_rn - start_x_rn) & 0x1f);\r\n    for (x = start_x_rn; x < end_x_rn; x += 32) {\r\n        s0 = _mm256_loadu_si256((__m256i*)&p_src[x - i_src + 1]);\r\n        s1 = _mm256_loadu_si256((__m256i*)&p_src[x]);\r\n        s2 = _mm256_loadu_si256((__m256i*)&p_src[x + i_src - 1]);\r\n\r\n        t3 = _mm256_min_epu8(s0, s1);\r\n        t1 = _mm256_cmpeq_epi8(t3, s0);\r\n        t2 = _mm256_cmpeq_epi8(t3, s1);\r\n        t0 = _mm256_subs_epi8(t2, t1); //upsign\r\n\r\n        t3 = _mm256_min_epu8(s1, s2);\r\n        t1 = _mm256_cmpeq_epi8(t3, s1);\r\n        t2 = _mm256_cmpeq_epi8(t3, s2);\r\n        t3 = _mm256_subs_epi8(t1, t2); //downsign\r\n\r\n        etype = _mm256_adds_epi8(t0, t3); //edgetype\r\n\r\n        etype = _mm256_adds_epi8(etype, c2);\r\n\r\n        t0 = _mm256_shuffle_epi8(off, etype);//get offset\r\n\r\n        //convert byte to short for possible overflow\r\n        t1 = _mm256_cvtepi8_epi16(_mm256_castsi256_si128(t0));\r\n        t2 = _mm256_cvtepi8_epi16(_mm256_extracti128_si256(t0, 1));\r\n        t3 = _mm256_cvtepu8_epi16(_mm256_castsi256_si128(s1));\r\n        t4 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(s1, 1));\r\n\r\n        t1 = _mm256_adds_epi16(t1, t3);\r\n        t2 = _mm256_adds_epi16(t2, t4);\r\n        t0 = _mm256_packus_epi16(t1, t2); //saturated\r\n        t0 = _mm256_permute4x64_epi64(t0, 0xd8);\r\n\r\n        if (x != end_x_rn_32) {\r\n            _mm256_storeu_si256((__m256i*)(p_dst + x), t0);\r\n        } else {\r\n            if (end_x_rn - x >= 16) {\r\n                _mm_storeu_si128((__m128i*)(p_dst + x), _mm256_castsi256_si128(t0));\r\n                if (end_x_rn - x > 16) {\r\n                    mask = _mm_load_si128((__m128i*)(intrinsic_mask[end_x_rn - end_x_rn_32 - 17]));\r\n                    _mm_maskmoveu_si128(_mm256_extracti128_si256(t0, 1), mask, (char*)(p_dst + x + 16));\r\n                }\r\n            } else {\r\n                mask = _mm_load_si128((__m128i*)(intrinsic_mask[end_x_rn - end_x_rn_32 - 1]));\r\n                _mm_maskmoveu_si128(_mm256_castsi256_si128(t0), mask, (char*)(p_dst + x));\r\n            }\r\n            break;\r\n        }\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n*/\r\nvoid SAO_on_block_bo_avx2(pel_t *p_dst, int i_dst,\r\n                          const pel_t *p_src, int i_src,\r\n                          int i_block_w, int i_block_h,\r\n                          int bit_depth, const sao_param_t *sao_param)\r\n{\r\n    __m256i r0, r1, r2, r3, off0, off1, off2, off3;\r\n    __m256i t0, t1, t2, t3, t4, src0, src1;\r\n    __m128i mask = _mm_setzero_si128();\r\n    int x, y;\r\n    int shift_bo = bit_depth - NUM_SAO_BO_CLASSES_IN_BIT;\r\n    __m256i shift_mask = _mm256_set1_epi8(31);\r\n    int end_x    = i_block_w;\r\n    int end_x_32 = end_x - ((end_x - 0) & 0x1f);\r\n\r\n    UNUSED_PARAMETER(bit_depth);\r\n\r\n    r0   = _mm256_set1_epi8((int8_t)(sao_param->startBand));\r\n    r1   = _mm256_set1_epi8((int8_t)((sao_param->startBand + 1) & 31));\r\n    r2   = _mm256_set1_epi8((int8_t)(sao_param->startBand2));\r\n    r3   = _mm256_set1_epi8((int8_t)((sao_param->startBand2 + 1) & 31));\r\n\r\n    off0 = _mm256_set1_epi8((int8_t)sao_param->offset[sao_param->startBand]);\r\n    off1 = _mm256_set1_epi8((int8_t)sao_param->offset[(sao_param->startBand + 1) & 31]);\r\n    off2 = _mm256_set1_epi8((int8_t)sao_param->offset[sao_param->startBand2]);\r\n    off3 = _mm256_set1_epi8((int8_t)sao_param->offset[(sao_param->startBand2 + 1) & 31]);\r\n\r\n    for (y = 0; y < i_block_h; y++) {\r\n        for (x = 0; x < i_block_w; x += 32){\r\n            src0 = _mm256_loadu_si256((__m256i*)&p_src[x]);\r\n            src1 = _mm256_srli_epi16(src0, shift_bo);\r\n            src1 = _mm256_and_si256(src1, shift_mask);\r\n\r\n            t0 = _mm256_cmpeq_epi8(src1, r0);\r\n            t1 = _mm256_cmpeq_epi8(src1, r1);\r\n            t2 = _mm256_cmpeq_epi8(src1, r2);\r\n            t3 = _mm256_cmpeq_epi8(src1, r3);\r\n\r\n            t0 = _mm256_and_si256(t0, off0);\r\n            t1 = _mm256_and_si256(t1, off1);\r\n            t2 = _mm256_and_si256(t2, off2);\r\n            t3 = _mm256_and_si256(t3, off3);\r\n            t0 = _mm256_or_si256(t0, t1);\r\n            t2 = _mm256_or_si256(t2, t3);\r\n            t0 = _mm256_or_si256(t0, t2);\r\n\r\n            //convert byte to short for possible overflow\r\n            t1 = _mm256_cvtepi8_epi16(_mm256_castsi256_si128(t0));\r\n            t2 = _mm256_cvtepi8_epi16(_mm256_extracti128_si256(t0, 1));\r\n            t3 = _mm256_cvtepu8_epi16(_mm256_castsi256_si128(src0));\r\n            t4 = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(src0, 1));\r\n\r\n            t1 = _mm256_add_epi16(t1, t3);\r\n            t2 = _mm256_add_epi16(t2, t4);\r\n            t0 = _mm256_packus_epi16(t1, t2); //saturated\r\n            t0 = _mm256_permute4x64_epi64(t0, 0xd8);\r\n\r\n            if (x < end_x_32) {\r\n                _mm256_storeu_si256((__m256i*)(p_dst + x), t0);\r\n            } else {\r\n                if (end_x - x >= 16) {\r\n                    _mm_storeu_si128((__m128i*)(p_dst + x), _mm256_castsi256_si128(t0));\r\n                    if (end_x - x > 16) {\r\n                        mask = _mm_loadu_si128((__m128i*)(intrinsic_mask[end_x - end_x_32 - 17]));\r\n                        _mm_maskmoveu_si128(_mm256_extracti128_si256(t0, 1), mask, (char*)(p_dst + x + 16));\r\n                    }\r\n                } else {\r\n                    mask = _mm_load_si128((__m128i*)(intrinsic_mask[end_x - end_x_32 - 1]));\r\n                    _mm_maskmoveu_si128(_mm256_castsi256_si128(t0), mask, (char*)(p_dst + x));\r\n                }\r\n                break;\r\n            }\r\n        }\r\n        p_dst += i_dst;\r\n        p_src += i_src;\r\n    }\r\n}\r\n#endif // !HIGH_BIT_DEPTH\r\n"
  },
  {
    "path": "source/common/vlc.h",
    "content": "/*\r\n *  vlc.h\r\n *\r\n * Description of this file:\r\n *    VLC functions definition of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video encoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#ifndef DAVS2_VLC_H\r\n#define DAVS2_VLC_H\r\n#ifdef __cplusplus\r\nextern \"C\" {\r\n#endif\r\n\r\n/* ---------------------------------------------------------------------------\r\n* reads bits from the bitstream buffer\r\n* Input:\r\n*      p_buf     - containing VLC-coded data bits\r\n*      i_bit_pos - bit offset from start of partition\r\n*      i_buf     - total bytes in bitstream\r\n*      i_bits    - number of bits to read\r\n* return 0 for success, otherwise failure\r\n*/\r\nstatic INLINE\r\nint read_bits(uint8_t *p_buf, int i_buf, int i_bit_pos, int *p_info, int i_bits)\r\n{\r\n    int byte_offset = i_bit_pos >> 3;        // byte from start of buffer\r\n    int bit_offset  = 7 - (i_bit_pos & 7);   // bit  from start of byte\r\n    int inf = 0;\r\n\r\n    while (i_bits--) {\r\n        inf <<= 1;\r\n        inf |= (p_buf[byte_offset] & (1 << bit_offset)) >> bit_offset;\r\n\r\n        bit_offset--;\r\n        if (bit_offset < 0) {\r\n            byte_offset++;\r\n            bit_offset += 8;\r\n\r\n            if (byte_offset > i_buf) {\r\n                return -1;      /* error */\r\n            }\r\n        }\r\n    }\r\n    *p_info = inf;\r\n\r\n    return 0;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n* RETURN: the length of symbol, or -1 on error\r\n*/\r\nstatic INLINE\r\nint get_vlc_symbol(uint8_t *p_buf, int i_bit_pos, int *info, int i_buf)\r\n{\r\n    int byte_offset = i_bit_pos >> 3;         // byte from start of buffer\r\n    int bit_offset  = 7 - (i_bit_pos & 7);    // bit from start of byte\r\n    int bit_counter = 1;\r\n    int len = 1;\r\n    int ctr_bit;         // control bit for current bit position\r\n    int info_bit;\r\n    int inf;\r\n\r\n    ctr_bit = (p_buf[byte_offset] & (1 << bit_offset));     // set up control bit\r\n    while (ctr_bit == 0) {\r\n        // find leading 1 bit\r\n        len++;\r\n        bit_offset -= 1;\r\n        bit_counter++;\r\n\r\n        if (bit_offset < 0) {\r\n            // finish with current byte ?\r\n            bit_offset = bit_offset + 8;\r\n            byte_offset++;\r\n        }\r\n\r\n        ctr_bit = (p_buf[byte_offset] & (1 << bit_offset));  // set up control bit\r\n    }\r\n\r\n    // make info-word\r\n    inf = 0;        // shortest possible code is 1, then info is always 0\r\n    for (info_bit = 0; (info_bit < (len - 1)); info_bit++) {\r\n        bit_counter++;\r\n        bit_offset--;\r\n\r\n        if (bit_offset < 0) {\r\n            // finished with current byte ?\r\n            bit_offset = bit_offset + 8;\r\n            byte_offset++;\r\n        }\r\n\r\n        if (byte_offset > i_buf) {\r\n            return -1;          /* error */\r\n        }\r\n\r\n        inf = (inf << 1);\r\n        if (p_buf[byte_offset] & (0x01 << (bit_offset))) {\r\n            inf |= 1;\r\n        }\r\n    }\r\n    *info = inf;\r\n\r\n    // return absolute offset in bit from start of frame\r\n    return bit_counter;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n* reads an u(v) syntax element (FLC codeword) from UVLC-partition\r\n* RETURN: the value of the coded syntax element, or -1 on error\r\n*/\r\nstatic INLINE\r\nint vlc_u_v(davs2_bs_t *bs, int i_bits\r\n#if AVS2_TRACE\r\n            , char *tracestring\r\n#endif\r\n            )\r\n{\r\n    int ret_val = 0;\r\n\r\n    if (read_bits(bs->p_stream, bs->i_stream, bs->i_bit_pos, &ret_val, i_bits) == 0) {\r\n        bs->i_bit_pos += i_bits;    /* move bitstream pointer */\r\n\r\n#if AVS2_TRACE\r\n        avs2_trace_string(tracestring, ret_val, i_bits);\r\n#endif\r\n\r\n        return ret_val;\r\n    }\r\n\r\n    return -1;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n* reads an ue(v) syntax element\r\n* RETURN: the value of the coded syntax element, or -1 on error\r\n*/\r\nstatic INLINE\r\nint vlc_ue_v(davs2_bs_t *bs\r\n#if AVS2_TRACE\r\n             , char *tracestring\r\n#endif\r\n             )\r\n{\r\n    int len, info;\r\n    int ret_val;\r\n\r\n    len = get_vlc_symbol(bs->p_stream, bs->i_bit_pos, &info, bs->i_stream);\r\n    if (len == -1) {\r\n        return -1;              /* error */\r\n    }\r\n\r\n    bs->i_bit_pos += len;\r\n\r\n    // cal:   pow(2, (len / 2)) + info - 1;\r\n    ret_val = (1 << (len >> 1)) + info - 1;\r\n\r\n#if AVS2_TRACE\r\n    avs2_trace_string2(tracestring, ret_val + 1, ret_val, len);\r\n#endif\r\n\r\n    return ret_val;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n* reads an se(v) syntax element\r\n* RETURN: the value of the coded syntax element, or -1 on error\r\n*/\r\nstatic INLINE\r\nint vlc_se_v(davs2_bs_t *bs\r\n#if AVS2_TRACE\r\n             , char *tracestring\r\n#endif\r\n             )\r\n{\r\n    int len, info;\r\n    int ret_val;\r\n    int n;\r\n\r\n    len = get_vlc_symbol(bs->p_stream, bs->i_bit_pos, &info, bs->i_stream);\r\n    if (len == -1) {\r\n        return -1;              /* error */\r\n    }\r\n\r\n    bs->i_bit_pos += len;\r\n\r\n    // cal: (int)pow(2, (len / 2)) + info - 1;\r\n    n = (1 << (len >> 1)) + info - 1;\r\n    ret_val = (n + 1) >> 1;\r\n    if ((n & 1) == 0) {         /* lsb is signed bit */\r\n        ret_val = -ret_val;\r\n    }\r\n\r\n#if AVS2_TRACE\r\n    avs2_trace_string2(tracestring, n + 1, ret_val, len);\r\n#endif\r\n\r\n    return ret_val;\r\n}\r\n\r\n\r\n#if AVS2_TRACE\r\n#define u_flag(bs, tracestring)         (bool_t)vlc_u_v(bs, 1, tracestring)\r\n#define u_v(bs, i_bits, tracestring)    vlc_u_v(bs, i_bits, tracestring)\r\n#define ue_v(bs, tracestring)           vlc_ue_v(bs, tracestring)\r\n#define se_v(bs, tracestring)           vlc_se_v(bs, tracestring)\r\n#else\r\n#define u_flag(bs, tracestring)         (bool_t)vlc_u_v(bs, 1)\r\n#define u_v(bs, i_bits, tracestring)    vlc_u_v(bs, i_bits)\r\n#define ue_v(bs, tracestring)           vlc_ue_v(bs)\r\n#define se_v(bs, tracestring)           vlc_se_v(bs)\r\n#endif\r\n\r\n#ifdef __cplusplus\r\n}\r\n#endif\r\n#endif  // DAVS2_VLC_H\r\n"
  },
  {
    "path": "source/common/win32thread.cc",
    "content": "/*****************************************************************************\r\n * win32thread.c: windows threading\r\n *****************************************************************************\r\n * Copyright (C) 2010-2017 x264 project\r\n *\r\n * Authors: Steven Walters <kemuri9@gmail.com>\r\n *          Pegasys Inc. <http://www.pegasys-inc.com>\r\n *          Henrik Gramner <henrik@gramner.com>\r\n *\r\n * This program is free software; you can redistribute it and/or modify\r\n * it under the terms of the GNU General Public License as published by\r\n * the Free Software Foundation; either version 2 of the License, or\r\n * (at your option) any later version.\r\n *\r\n * This program is distributed in the hope that it will be useful,\r\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n * GNU General Public License for more details.\r\n *\r\n * You should have received a copy of the GNU General Public License\r\n * along with this program; if not, write to the Free Software\r\n * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n * This program is also available under a commercial proprietary license.\r\n * For more information, contact us at licensing@x264.com.\r\n *****************************************************************************/\r\n\r\n/*\r\n * changes of this file:\r\n *    modified for davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video decoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n */\r\n\r\n/* Microsoft's way of supporting systems with >64 logical cpus can be found at\r\n * http://www.microsoft.com/whdc/system/Sysinternals/MoreThan64proc.mspx */\r\n\r\n/* Based on the agreed standing that davs2 decoder does not need to utilize >64 logical cpus,\r\n * this API does not detect nor utilize more than 64 cpus for systems that have them. */\r\n\r\n#include \"common.h\"\r\n\r\n#if HAVE_WIN32THREAD\r\n#include <process.h>\r\n\r\n/**\r\n * ===========================================================================\r\n * type defines\r\n * ===========================================================================\r\n */\r\n\r\n/* number of times to spin a thread about to block on a locked mutex before retrying and sleeping if still locked */\r\n#define XAVS2_SPIN_COUNT 0\r\n\r\n/* GROUP_AFFINITY struct */\r\ntypedef struct {\r\n    ULONG_PTR   mask;           // KAFFINITY = ULONG_PTR\r\n    USHORT      group;\r\n    USHORT      reserved[3];\r\n} davs2_group_affinity_t;\r\n\r\ntypedef void (WINAPI *cond_func_t)(davs2_thread_cond_t *cond);\r\ntypedef BOOL (WINAPI *cond_wait_t)(davs2_thread_cond_t *cond, davs2_thread_mutex_t *mutex, DWORD milliseconds);\r\n\r\ntypedef struct {\r\n    /* global mutex for replacing MUTEX_INITIALIZER instances */\r\n    davs2_thread_mutex_t static_mutex;\r\n\r\n    /* function pointers to conditional variable API on windows 6.0+ kernels */\r\n    cond_func_t cond_broadcast;\r\n    cond_func_t cond_init;\r\n    cond_func_t cond_signal;\r\n    cond_wait_t cond_wait;\r\n} davs2_win32thread_control_t;\r\n\r\nstatic davs2_win32thread_control_t thread_control;\r\n\r\n\r\n/**\r\n * ===========================================================================\r\n * function defines\r\n * ===========================================================================\r\n */\r\n\r\n/* _beginthreadex requires that the start routine is __stdcall */\r\nstatic unsigned __stdcall davs2_win32thread_worker(void *arg)\r\n{\r\n    davs2_thread_t *h = (davs2_thread_t *)arg;\r\n\r\n    h->ret = h->func(h->arg);\r\n\r\n    return 0;\r\n}\r\n\r\nint davs2_thread_create(davs2_thread_t *thread, const davs2_thread_attr_t *attr,\r\n                        void *(*start_routine)(void *), void *arg)\r\n{\r\n    UNUSED_PARAMETER(attr);\r\n\r\n    thread->func   = start_routine;\r\n    thread->arg    = arg;\r\n    thread->handle = (void *)_beginthreadex(NULL, 0, davs2_win32thread_worker, thread, 0, NULL);\r\n    return !thread->handle;\r\n}\r\n\r\nint davs2_thread_join(davs2_thread_t thread, void **value_ptr)\r\n{\r\n    DWORD ret = WaitForSingleObject(thread.handle, INFINITE);\r\n\r\n    if (ret != WAIT_OBJECT_0) {\r\n        return -1;\r\n    }\r\n    if (value_ptr) {\r\n        *value_ptr = thread.ret;\r\n    }\r\n    CloseHandle(thread.handle);\r\n\r\n    return 0;\r\n}\r\n\r\nint davs2_thread_mutex_init(davs2_thread_mutex_t *mutex, const davs2_thread_mutexattr_t *attr)\r\n{\r\n    UNUSED_PARAMETER(attr);\r\n    return !InitializeCriticalSectionAndSpinCount(mutex, XAVS2_SPIN_COUNT);\r\n}\r\n\r\nint davs2_thread_mutex_destroy(davs2_thread_mutex_t *mutex)\r\n{\r\n    DeleteCriticalSection(mutex);\r\n    return 0;\r\n}\r\n\r\nint davs2_thread_mutex_lock(davs2_thread_mutex_t *mutex)\r\n{\r\n    static davs2_thread_mutex_t init = DAVS2_THREAD_MUTEX_INITIALIZER;\r\n\r\n    if (!memcmp(mutex, &init, sizeof(davs2_thread_mutex_t))) {\r\n        *mutex = thread_control.static_mutex;\r\n    }\r\n    EnterCriticalSection(mutex);\r\n\r\n    return 0;\r\n}\r\n\r\nint davs2_thread_mutex_unlock(davs2_thread_mutex_t *mutex)\r\n{\r\n    LeaveCriticalSection(mutex);\r\n    return 0;\r\n}\r\n\r\n/* for pre-Windows 6.0 platforms we need to define and use our own condition variable and api */\r\ntypedef struct {\r\n    davs2_thread_mutex_t mtx_broadcast;\r\n    davs2_thread_mutex_t mtx_waiter_count;\r\n    int waiter_count;\r\n    HANDLE semaphore;\r\n    HANDLE waiters_done;\r\n    int is_broadcast;\r\n} davs2_win32_cond_t;\r\n\r\nint davs2_thread_cond_init(davs2_thread_cond_t *cond, const davs2_thread_condattr_t *attr)\r\n{\r\n    davs2_win32_cond_t *win32_cond;\r\n\r\n    UNUSED_PARAMETER(attr);\r\n\r\n    if (thread_control.cond_init) {\r\n        thread_control.cond_init(cond);\r\n        return 0;\r\n    }\r\n\r\n    /* non native condition variables */\r\n    win32_cond = (davs2_win32_cond_t *)davs2_malloc(sizeof(davs2_win32_cond_t));\r\n    memset(win32_cond, 0, sizeof(davs2_win32_cond_t));\r\n    if (!win32_cond) {\r\n        return -1;\r\n    }\r\n    cond->ptr = win32_cond;\r\n    win32_cond->semaphore = CreateSemaphore(NULL, 0, 0x7fffffff, NULL);\r\n    if (!win32_cond->semaphore) {\r\n        return -1;\r\n    }\r\n\r\n    if (davs2_thread_mutex_init(&win32_cond->mtx_waiter_count, NULL)) {\r\n        return -1;\r\n    }\r\n    if (davs2_thread_mutex_init(&win32_cond->mtx_broadcast, NULL)) {\r\n        return -1;\r\n    }\r\n\r\n    win32_cond->waiters_done = CreateEvent(NULL, FALSE, FALSE, NULL);\r\n    if (!win32_cond->waiters_done) {\r\n        return -1;\r\n    }\r\n\r\n    return 0;\r\n}\r\n\r\nint davs2_thread_cond_destroy(davs2_thread_cond_t *cond)\r\n{\r\n    davs2_win32_cond_t *win32_cond;\r\n\r\n    /* native condition variables do not destroy */\r\n    if (thread_control.cond_init) {\r\n        return 0;\r\n    }\r\n\r\n    /* non native condition variables */\r\n    win32_cond = (davs2_win32_cond_t *)cond->ptr;\r\n    CloseHandle(win32_cond->semaphore);\r\n    CloseHandle(win32_cond->waiters_done);\r\n    davs2_thread_mutex_destroy(&win32_cond->mtx_broadcast);\r\n    davs2_thread_mutex_destroy(&win32_cond->mtx_waiter_count);\r\n    davs2_free(win32_cond);\r\n\r\n    return 0;\r\n}\r\n\r\nint davs2_thread_cond_broadcast(davs2_thread_cond_t *cond)\r\n{\r\n    davs2_win32_cond_t *win32_cond;\r\n    int have_waiter = 0;\r\n\r\n    if (thread_control.cond_broadcast) {\r\n        thread_control.cond_broadcast(cond);\r\n        return 0;\r\n    }\r\n\r\n    /* non native condition variables */\r\n    win32_cond = (davs2_win32_cond_t *)cond->ptr;\r\n    davs2_thread_mutex_lock(&win32_cond->mtx_broadcast);\r\n    davs2_thread_mutex_lock(&win32_cond->mtx_waiter_count);\r\n\r\n    if (win32_cond->waiter_count) {\r\n        win32_cond->is_broadcast = 1;\r\n        have_waiter = 1;\r\n    }\r\n\r\n    if (have_waiter) {\r\n        ReleaseSemaphore(win32_cond->semaphore, win32_cond->waiter_count, NULL);\r\n        davs2_thread_mutex_unlock(&win32_cond->mtx_waiter_count);\r\n        WaitForSingleObject(win32_cond->waiters_done, INFINITE);\r\n        win32_cond->is_broadcast = 0;\r\n    } else {\r\n        davs2_thread_mutex_unlock(&win32_cond->mtx_waiter_count);\r\n    }\r\n\r\n    return davs2_thread_mutex_unlock(&win32_cond->mtx_broadcast);\r\n}\r\n\r\nint davs2_thread_cond_signal(davs2_thread_cond_t *cond)\r\n{\r\n    davs2_win32_cond_t *win32_cond;\r\n    int have_waiter;\r\n\r\n    if (thread_control.cond_signal) {\r\n        thread_control.cond_signal(cond);\r\n        return 0;\r\n    }\r\n\r\n    /* non-native condition variables */\r\n    win32_cond = (davs2_win32_cond_t *)cond->ptr;\r\n    davs2_thread_mutex_lock(&win32_cond->mtx_waiter_count);\r\n    have_waiter = win32_cond->waiter_count;\r\n    davs2_thread_mutex_unlock(&win32_cond->mtx_waiter_count);\r\n\r\n    if (have_waiter) {\r\n        ReleaseSemaphore(win32_cond->semaphore, 1, NULL);\r\n    }\r\n\r\n    return 0;\r\n}\r\n\r\nint davs2_thread_cond_wait(davs2_thread_cond_t *cond, davs2_thread_mutex_t *mutex)\r\n{\r\n    davs2_win32_cond_t *win32_cond;\r\n    int last_waiter;\r\n\r\n    if (thread_control.cond_wait) {\r\n        return !thread_control.cond_wait(cond, mutex, INFINITE);\r\n    }\r\n\r\n    /* non native condition variables */\r\n    win32_cond = (davs2_win32_cond_t *)cond->ptr;\r\n\r\n    davs2_thread_mutex_lock(&win32_cond->mtx_broadcast);\r\n    davs2_thread_mutex_unlock(&win32_cond->mtx_broadcast);\r\n\r\n    davs2_thread_mutex_lock(&win32_cond->mtx_waiter_count);\r\n    win32_cond->waiter_count++;\r\n    davs2_thread_mutex_unlock(&win32_cond->mtx_waiter_count);\r\n\r\n    // unlock the external mutex\r\n    davs2_thread_mutex_unlock(mutex);\r\n    WaitForSingleObject(win32_cond->semaphore, INFINITE);\r\n\r\n    davs2_thread_mutex_lock(&win32_cond->mtx_waiter_count);\r\n    win32_cond->waiter_count--;\r\n    last_waiter = !win32_cond->waiter_count && win32_cond->is_broadcast;\r\n    davs2_thread_mutex_unlock(&win32_cond->mtx_waiter_count);\r\n\r\n    if (last_waiter) {\r\n        SetEvent(win32_cond->waiters_done);\r\n    }\r\n\r\n    // lock the external mutex\r\n    return davs2_thread_mutex_lock(mutex);\r\n}\r\n\r\nint davs2_win32_threading_init(void)\r\n{\r\n    /* find function pointers to API functions, if they exist */\r\n    HMODULE kernel_dll = GetModuleHandle(TEXT(\"kernel32\"));\r\n\r\n    thread_control.cond_init = (cond_func_t)GetProcAddress(kernel_dll, \"InitializeConditionVariable\");\r\n    if (thread_control.cond_init) {\r\n        /* we're on a windows 6.0+ kernel, acquire the rest of the functions */\r\n        thread_control.cond_broadcast = (cond_func_t)GetProcAddress(kernel_dll, \"WakeAllConditionVariable\");\r\n        thread_control.cond_signal = (cond_func_t)GetProcAddress(kernel_dll, \"WakeConditionVariable\");\r\n        thread_control.cond_wait = (cond_wait_t)GetProcAddress(kernel_dll, \"SleepConditionVariableCS\");\r\n    }\r\n    return davs2_thread_mutex_init(&thread_control.static_mutex, NULL);\r\n}\r\n\r\nvoid davs2_win32_threading_destroy(void)\r\n{\r\n    davs2_thread_mutex_destroy(&thread_control.static_mutex);\r\n    memset(&thread_control, 0, sizeof(davs2_win32thread_control_t));\r\n}\r\n\r\nint davs2_thread_num_processors_np()\r\n{\r\n    DWORD_PTR system_cpus, process_cpus = 0;\r\n    int cpus = 0;\r\n    DWORD_PTR bit;\r\n\r\n    /* GetProcessAffinityMask returns affinities of 0 when the process has threads in multiple processor groups.\r\n     * On platforms that support processor grouping, use GetThreadGroupAffinity to get the current thread's affinity instead. */\r\n#if ARCH_X86_64\r\n    /* find function pointers to API functions specific to x86_64 platforms, if they exist.\r\n     * BOOL GetThreadGroupAffinity(_In_  HANDLE hThread, _Out_ PGROUP_AFFINITY GroupAffinity); */\r\n    typedef BOOL(*get_thread_affinity_t)(HANDLE thread, davs2_group_affinity_t *group_affinity);\r\n    HMODULE kernel_dll = GetModuleHandle(TEXT(\"kernel32.dll\"));\r\n    get_thread_affinity_t get_thread_affinity = (get_thread_affinity_t)GetProcAddress(kernel_dll, \"GetThreadGroupAffinity\");\r\n\r\n    if (get_thread_affinity) {\r\n        /* running on a platform that supports >64 logical cpus */\r\n        davs2_group_affinity_t thread_affinity;\r\n        if (get_thread_affinity(GetCurrentThread(), &thread_affinity)) {\r\n            process_cpus = thread_affinity.mask;\r\n        }\r\n    }\r\n#endif\r\n    if (!process_cpus) {\r\n        GetProcessAffinityMask(GetCurrentProcess(), &process_cpus, &system_cpus);\r\n    }\r\n\r\n    for (bit = 1; bit; bit <<= 1) {\r\n        cpus += !!(process_cpus & bit);\r\n    }\r\n\r\n    return cpus ? cpus : 1;\r\n}\r\n\r\n#endif // #if HAVE_WIN32THREAD\r\n"
  },
  {
    "path": "source/common/win32thread.h",
    "content": "/*****************************************************************************\r\n * win32thread.h: windows threading\r\n *****************************************************************************\r\n * Copyright (C) 2010-2017 x264 project\r\n *\r\n * Authors: Steven Walters <kemuri9@gmail.com>\r\n *\r\n * This program is free software; you can redistribute it and/or modify\r\n * it under the terms of the GNU General Public License as published by\r\n * the Free Software Foundation; either version 2 of the License, or\r\n * (at your option) any later version.\r\n *\r\n * This program is distributed in the hope that it will be useful,\r\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n * GNU General Public License for more details.\r\n *\r\n * You should have received a copy of the GNU General Public License\r\n * along with this program; if not, write to the Free Software\r\n * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n * This program is also available under a commercial proprietary license.\r\n * For more information, contact us at licensing@x264.com.\r\n *****************************************************************************/\r\n\r\n/*\r\n * changes of this file:\r\n *    modified for davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video decoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n */\r\n\r\n#ifndef DAVS2_WIN32THREAD_H\r\n#define DAVS2_WIN32THREAD_H\r\n\r\n#define WIN32_LEAN_AND_MEAN\r\n#include <windows.h>\r\n/* the following macro is used within xavs2 encoder */\r\n#undef ERROR\r\n\r\n#ifdef __cplusplus\r\nextern \"C\" {\r\n#endif\r\n\r\ntypedef struct {\r\n    void *handle;\r\n    void *(*func)(void *arg);\r\n    void *arg;\r\n    void *ret;\r\n} davs2_thread_t;\r\n#define davs2_thread_attr_t int\r\n\r\n/* the conditional variable api for windows 6.0+ uses critical sections and not mutexes */\r\ntypedef CRITICAL_SECTION davs2_thread_mutex_t;\r\n#define DAVS2_THREAD_MUTEX_INITIALIZER {0}\r\n#define davs2_thread_mutexattr_t int\r\n#define pthread_exit(a)\r\n/* This is the CONDITIONAL_VARIABLE typedef for using Window's native conditional variables on kernels 6.0+.\r\n * MinGW does not currently have this typedef. */\r\ntypedef struct {\r\n    void *ptr;\r\n} davs2_thread_cond_t;\r\n#define davs2_thread_condattr_t int\r\n\r\n#define davs2_thread_create FPFX(thread_create)\r\nint davs2_thread_create(davs2_thread_t *thread, const davs2_thread_attr_t *attr,\r\n                        void *(*start_routine)(void *), void *arg);\r\n#define davs2_thread_join FPFX(thread_join)\r\nint davs2_thread_join(davs2_thread_t thread, void **value_ptr);\r\n\r\n#define davs2_thread_mutex_init FPFX(thread_mutex_init)\r\nint davs2_thread_mutex_init(davs2_thread_mutex_t *mutex, const davs2_thread_mutexattr_t *attr);\r\n#define davs2_thread_mutex_destroy FPFX(thread_mutex_destroy)\r\nint davs2_thread_mutex_destroy(davs2_thread_mutex_t *mutex);\r\n#define davs2_thread_mutex_lock FPFX(thread_mutex_lock)\r\nint davs2_thread_mutex_lock(davs2_thread_mutex_t *mutex);\r\n#define davs2_thread_mutex_unlock FPFX(thread_mutex_unlock)\r\nint davs2_thread_mutex_unlock(davs2_thread_mutex_t *mutex);\r\n\r\n#define davs2_thread_cond_init FPFX(thread_cond_init)\r\nint davs2_thread_cond_init(davs2_thread_cond_t *cond, const davs2_thread_condattr_t *attr);\r\n#define davs2_thread_cond_destroy FPFX(thread_cond_destroy)\r\nint davs2_thread_cond_destroy(davs2_thread_cond_t *cond);\r\n#define davs2_thread_cond_broadcast FPFX(thread_cond_broadcast)\r\nint davs2_thread_cond_broadcast(davs2_thread_cond_t *cond);\r\n#define davs2_thread_cond_wait FPFX(thread_cond_wait)\r\nint davs2_thread_cond_wait(davs2_thread_cond_t *cond, davs2_thread_mutex_t *mutex);\r\n#define davs2_thread_cond_signal FPFX(thread_cond_signal)\r\nint davs2_thread_cond_signal(davs2_thread_cond_t *cond);\r\n\r\n#define davs2_thread_attr_init(a) 0\r\n#define davs2_thread_attr_destroy(a) 0\r\n\r\n#define davs2_win32_threading_init FPFX(win32_threading_init)\r\nint  davs2_win32_threading_init(void);\r\n#define davs2_win32_threading_destroy FPFX(win32_threading_destroy)\r\nvoid davs2_win32_threading_destroy(void);\r\n\r\n#define davs2_thread_num_processors_np FPFX(thread_num_processors_np)\r\nint davs2_thread_num_processors_np(void);\r\n\r\n#ifdef __cplusplus\r\n}\r\n#endif\r\n#endif  // DAVS2_WIN32THREAD_H\r\n"
  },
  {
    "path": "source/common/x86/blockcopy8.asm",
    "content": ";*****************************************************************************\r\n;* Copyright (C) 2013-2017 MulticoreWare, Inc\r\n;*\r\n;* Authors: Praveen Kumar Tiwari <praveen@multicorewareinc.com>\r\n;*          Murugan Vairavel <murugan@multicorewareinc.com>\r\n;*          Min Chen <chenm003@163.com>\r\n;*\r\n;* This program is free software; you can redistribute it and/or modify\r\n;* it under the terms of the GNU General Public License as published by\r\n;* the Free Software Foundation; either version 2 of the License, or\r\n;* (at your option) any later version.\r\n;*\r\n;* This program is distributed in the hope that it will be useful,\r\n;* but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n;* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n;* GNU General Public License for more details.\r\n;*\r\n;* You should have received a copy of the GNU General Public License\r\n;* along with this program; if not, write to the Free Software\r\n;* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n;*\r\n;* This program is also available under a commercial proprietary license.\r\n;* For more information, contact us at license @ x265.com.\r\n;*****************************************************************************/\r\n\r\n%include \"x86inc.asm\"\r\n%include \"x86util.asm\"\r\n\r\nSECTION_RODATA 32\r\n\r\ncextern pb_4\r\ncextern pb_1\r\ncextern pb_16\r\ncextern pb_64\r\ncextern pw_4\r\ncextern pb_8\r\ncextern pb_32\r\ncextern pb_128\r\n\r\nSECTION .text\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_pp_2x4(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal blockcopy_pp_2x4, 4, 7, 0\r\n    mov    r4w,    [r2]\r\n    mov    r5w,    [r2 + r3]\r\n    mov    r6w,    [r2 + 2 * r3]\r\n    lea    r3,     [r3 + 2 * r3]\r\n    mov    r3w,    [r2 + r3]\r\n\r\n    mov    [r0],          r4w\r\n    mov    [r0 + r1],     r5w\r\n    mov    [r0 + 2 * r1], r6w\r\n    lea    r1,            [r1 + 2 * r1]\r\n    mov    [r0 + r1],     r3w\r\nRET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_pp_2x8(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal blockcopy_pp_2x8, 4, 7, 0\r\n    lea     r5,      [3 * r1]\r\n    lea     r6,      [3 * r3]\r\n\r\n    mov     r4w,           [r2]\r\n    mov     [r0],          r4w\r\n    mov     r4w,           [r2 + r3]\r\n    mov     [r0 + r1],     r4w\r\n    mov     r4w,           [r2 + 2 * r3]\r\n    mov     [r0 + 2 * r1], r4w\r\n    mov     r4w,           [r2 + r6]\r\n    mov     [r0 + r5],     r4w\r\n\r\n    lea     r2,            [r2 + 4 * r3]\r\n    mov     r4w,           [r2]\r\n    lea     r0,            [r0 + 4 * r1]\r\n    mov     [r0],          r4w\r\n\r\n    mov     r4w,           [r2 + r3]\r\n    mov     [r0 + r1],     r4w\r\n    mov     r4w,           [r2 + 2 * r3]\r\n    mov     [r0 + 2 * r1], r4w\r\n    mov     r4w,           [r2 + r6]\r\n    mov     [r0 + r5],     r4w\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_pp_2x16(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal blockcopy_pp_2x16, 4, 7, 0\r\n    lea     r5,      [3 * r1]\r\n    lea     r6,      [3 * r3]\r\n\r\n    mov     r4w,           [r2]\r\n    mov     [r0],          r4w\r\n    mov     r4w,           [r2 + r3]\r\n    mov     [r0 + r1],     r4w\r\n    mov     r4w,           [r2 + 2 * r3]\r\n    mov     [r0 + 2 * r1], r4w\r\n    mov     r4w,           [r2 + r6]\r\n    mov     [r0 + r5],     r4w\r\n\r\n%rep 3\r\n    lea     r2,            [r2 + 4 * r3]\r\n    mov     r4w,           [r2]\r\n    lea     r0,            [r0 + 4 * r1]\r\n    mov     [r0],          r4w\r\n    mov     r4w,           [r2 + r3]\r\n    mov     [r0 + r1],     r4w\r\n    mov     r4w,           [r2 + 2 * r3]\r\n    mov     [r0 + 2 * r1], r4w\r\n    mov     r4w,           [r2 + r6]\r\n    mov     [r0 + r5],     r4w\r\n%endrep\r\n    RET\r\n\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_pp_4x2(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal blockcopy_pp_4x2, 4, 6, 0\r\n    mov     r4d,     [r2]\r\n    mov     r5d,     [r2 + r3]\r\n\r\n    mov     [r0],            r4d\r\n    mov     [r0 + r1],       r5d\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_pp_4x4(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal blockcopy_pp_4x4, 4, 4, 4\r\n    movd     m0,     [r2]\r\n    movd     m1,     [r2 + r3]\r\n    movd     m2,     [r2 + 2 * r3]\r\n    lea      r3,     [r3 + r3 * 2]\r\n    movd     m3,     [r2 + r3]\r\n\r\n    movd     [r0],            m0\r\n    movd     [r0 + r1],       m1\r\n    movd     [r0 + 2 * r1],   m2\r\n    lea      r1,              [r1 + 2 * r1]\r\n    movd     [r0 + r1],       m3\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_pp_4x8(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal blockcopy_pp_4x8, 4, 6, 4\r\n\r\n    lea     r4,    [3 * r1]\r\n    lea     r5,    [3 * r3]\r\n\r\n    movd     m0,     [r2]\r\n    movd     m1,     [r2 + r3]\r\n    movd     m2,     [r2 + 2 * r3]\r\n    movd     m3,     [r2 + r5]\r\n\r\n    movd     [r0],          m0\r\n    movd     [r0 + r1],     m1\r\n    movd     [r0 + 2 * r1], m2\r\n    movd     [r0 + r4],     m3\r\n\r\n    lea      r2,     [r2 + 4 * r3]\r\n    movd     m0,     [r2]\r\n    movd     m1,     [r2 + r3]\r\n    movd     m2,     [r2 + 2 * r3]\r\n    movd     m3,     [r2 + r5]\r\n\r\n    lea      r0,            [r0 + 4 * r1]\r\n    movd     [r0],          m0\r\n    movd     [r0 + r1],     m1\r\n    movd     [r0 + 2 * r1], m2\r\n    movd     [r0 + r4],     m3\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_pp_%1x%2(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\n%macro BLOCKCOPY_PP_W4_H8 2\r\nINIT_XMM sse2\r\ncglobal blockcopy_pp_%1x%2, 4, 7, 4\r\n    mov    r4d,    %2/8\r\n    lea    r5,     [3 * r1]\r\n    lea    r6,     [3 * r3]\r\n\r\n.loop:\r\n    movd     m0,     [r2]\r\n    movd     m1,     [r2 + r3]\r\n    movd     m2,     [r2 + 2 * r3]\r\n    movd     m3,     [r2 + r6]\r\n\r\n    movd     [r0],          m0\r\n    movd     [r0 + r1],     m1\r\n    movd     [r0 + 2 * r1], m2\r\n    movd     [r0 + r5],     m3\r\n\r\n    lea      r2,     [r2 + 4 * r3]\r\n    movd     m0,     [r2]\r\n    movd     m1,     [r2 + r3]\r\n    movd     m2,     [r2 + 2 * r3]\r\n    movd     m3,     [r2 + r6]\r\n\r\n    lea      r0,            [r0 + 4 * r1]\r\n    movd     [r0],          m0\r\n    movd     [r0 + r1],     m1\r\n    movd     [r0 + 2 * r1], m2\r\n    movd     [r0 + r5],     m3\r\n\r\n    lea       r0,                  [r0 + 4 * r1]\r\n    lea       r2,                  [r2 + 4 * r3]\r\n\r\n    dec       r4d\r\n    jnz       .loop\r\n    RET\r\n%endmacro\r\n\r\nBLOCKCOPY_PP_W4_H8 4, 16\r\nBLOCKCOPY_PP_W4_H8 4, 32\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_pp_6x8(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal blockcopy_pp_6x8, 4, 7, 3\r\n\r\n    movd     m0,  [r2]\r\n    mov      r4w, [r2 + 4]\r\n    movd     m1,  [r2 + r3]\r\n    mov      r5w, [r2 + r3 + 4]\r\n    movd     m2,  [r2 + 2 * r3]\r\n    mov      r6w, [r2 + 2 * r3 + 4]\r\n\r\n    movd     [r0],              m0\r\n    mov      [r0 + 4],          r4w\r\n    movd     [r0 + r1],         m1\r\n    mov      [r0 + r1 + 4],     r5w\r\n    movd     [r0 + 2 * r1],     m2\r\n    mov      [r0 + 2 * r1 + 4], r6w\r\n\r\n    lea      r2,  [r2 + 2 * r3]\r\n    movd     m0,  [r2 + r3]\r\n    mov      r4w, [r2 + r3 + 4]\r\n    movd     m1,  [r2 + 2 * r3]\r\n    mov      r5w, [r2 + 2 * r3 + 4]\r\n    lea      r2,  [r2 + 2 * r3]\r\n    movd     m2,  [r2 + r3]\r\n    mov      r6w, [r2 + r3 + 4]\r\n\r\n    lea      r0,                [r0 + 2 * r1]\r\n    movd     [r0 + r1],         m0\r\n    mov      [r0 + r1 + 4],     r4w\r\n    movd     [r0 + 2 * r1],     m1\r\n    mov      [r0 + 2 * r1 + 4], r5w\r\n    lea      r0,                [r0 + 2 * r1]\r\n    movd     [r0 + r1],         m2\r\n    mov      [r0 + r1 + 4],     r6w\r\n\r\n    lea      r2,                [r2 + 2 * r3]\r\n    movd     m0,                [r2]\r\n    mov      r4w,               [r2 + 4]\r\n    movd     m1,                [r2 + r3]\r\n    mov      r5w,               [r2 + r3 + 4]\r\n\r\n    lea      r0,            [r0 + 2 * r1]\r\n    movd     [r0],          m0\r\n    mov      [r0 + 4],      r4w\r\n    movd     [r0 + r1],     m1\r\n    mov      [r0 + r1 + 4], r5w\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_pp_6x16(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal blockcopy_pp_6x16, 4, 7, 2\r\n    mov     r6d,    16/2\r\n.loop:\r\n    movd    m0,     [r2]\r\n    mov     r4w,    [r2 + 4]\r\n    movd    m1,     [r2 + r3]\r\n    mov     r5w,    [r2 + r3 + 4]\r\n    lea     r2,     [r2 + r3 * 2]\r\n    movd    [r0],           m0\r\n    mov     [r0 + 4],       r4w\r\n    movd    [r0 + r1],      m1\r\n    mov     [r0 + r1 + 4],  r5w\r\n    lea     r0,     [r0 + r1 * 2]\r\n    dec     r6d\r\n    jnz     .loop\r\n    RET\r\n\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_pp_8x2(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal blockcopy_pp_8x2, 4, 4, 2\r\n    movh     m0,        [r2]\r\n    movh     m1,        [r2 + r3]\r\n\r\n    movh     [r0],       m0\r\n    movh     [r0 + r1],  m1\r\nRET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_pp_8x4(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal blockcopy_pp_8x4, 4, 4, 4\r\n    movh     m0,     [r2]\r\n    movh     m1,     [r2 + r3]\r\n    movh     m2,     [r2 + 2 * r3]\r\n    lea      r3,     [r3 + r3 * 2]\r\n    movh     m3,     [r2 + r3]\r\n\r\n    movh     [r0],            m0\r\n    movh     [r0 + r1],       m1\r\n    movh     [r0 + 2 * r1],   m2\r\n    lea      r1,              [r1 + 2 * r1]\r\n    movh     [r0 + r1],       m3\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_pp_8x6(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal blockcopy_pp_8x6, 4, 4, 6\r\n    movh     m0,     [r2]\r\n    movh     m1,     [r2 + r3]\r\n    lea      r2,     [r2 + 2 * r3]\r\n    movh     m2,     [r2]\r\n    movh     m3,     [r2 + r3]\r\n    lea      r2,     [r2 + 2 * r3]\r\n    movh     m4,     [r2]\r\n    movh     m5,     [r2 + r3]\r\n\r\n    movh     [r0],          m0\r\n    movh     [r0 + r1],     m1\r\n    lea      r0,            [r0 + 2 * r1]\r\n    movh     [r0],          m2\r\n    movh     [r0 + r1],     m3\r\n    lea      r0,            [r0 + 2 * r1]\r\n    movh     [r0],          m4\r\n    movh     [r0 + r1],     m5\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_pp_8x12(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal blockcopy_pp_8x12, 4, 6, 4\r\n\r\n    lea      r4, [3 * r3]\r\n    lea      r5, [3 * r1]\r\n\r\n    movh     m0, [r2]\r\n    movh     m1, [r2 + r3]\r\n    movh     m2, [r2 + 2 * r3]\r\n    movh     m3, [r2 + r4]\r\n\r\n    movh     [r0],          m0\r\n    movh     [r0 + r1],     m1\r\n    movh     [r0 + 2 * r1], m2\r\n    movh     [r0 + r5],     m3\r\n\r\n    %rep 2\r\n    lea      r2, [r2 + 4 * r3]\r\n    movh     m0, [r2]\r\n    movh     m1, [r2 + r3]\r\n    movh     m2, [r2 + 2 * r3]\r\n    movh     m3, [r2 + r4]\r\n\r\n    lea      r0,            [r0 + 4 * r1]\r\n    movh     [r0],          m0\r\n    movh     [r0 + r1],     m1\r\n    movh     [r0 + 2 * r1], m2\r\n    movh     [r0 + r5],     m3\r\n    %endrep\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_pp_8x8(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal blockcopy_pp_8x8, 4, 6, 4\r\n\r\n    lea      r4, [3 * r3]\r\n    lea      r5, [3 * r1]\r\n\r\n    movh     m0, [r2]\r\n    movh     m1, [r2 + r3]\r\n    movh     m2, [r2 + 2 * r3]\r\n    movh     m3, [r2 + r4]\r\n\r\n    movh     [r0],          m0\r\n    movh     [r0 + r1],     m1\r\n    movh     [r0 + 2 * r1], m2\r\n    movh     [r0 + r5],     m3\r\n\r\n    lea      r2, [r2 + 4 * r3]\r\n    movh     m0, [r2]\r\n    movh     m1, [r2 + r3]\r\n    movh     m2, [r2 + 2 * r3]\r\n    movh     m3, [r2 + r4]\r\n\r\n    lea      r0,            [r0 + 4 * r1]\r\n    movh     [r0],          m0\r\n    movh     [r0 + r1],     m1\r\n    movh     [r0 + 2 * r1], m2\r\n    movh     [r0 + r5],     m3\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_pp_8x16(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal blockcopy_pp_8x16, 4, 6, 4\r\n\r\n    lea      r4, [3 * r3]\r\n    lea      r5, [3 * r1]\r\n\r\n    movh     m0, [r2]\r\n    movh     m1, [r2 + r3]\r\n    movh     m2, [r2 + 2 * r3]\r\n    movh     m3, [r2 + r4]\r\n\r\n    movh     [r0],          m0\r\n    movh     [r0 + r1],     m1\r\n    movh     [r0 + 2 * r1], m2\r\n    movh     [r0 + r5],     m3\r\n\r\n    %rep 3\r\n    lea      r2, [r2 + 4 * r3]\r\n    movh     m0, [r2]\r\n    movh     m1, [r2 + r3]\r\n    movh     m2, [r2 + 2 * r3]\r\n    movh     m3, [r2 + r4]\r\n\r\n    lea      r0,            [r0 + 4 * r1]\r\n    movh     [r0],          m0\r\n    movh     [r0 + r1],     m1\r\n    movh     [r0 + 2 * r1], m2\r\n    movh     [r0 + r5],     m3\r\n    %endrep\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_pp_8x32(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal blockcopy_pp_8x32, 4, 6, 4\r\n\r\n    lea      r4, [3 * r3]\r\n    lea      r5, [3 * r1]\r\n\r\n    movh     m0, [r2]\r\n    movh     m1, [r2 + r3]\r\n    movh     m2, [r2 + 2 * r3]\r\n    movh     m3, [r2 + r4]\r\n\r\n    movh     [r0],          m0\r\n    movh     [r0 + r1],     m1\r\n    movh     [r0 + 2 * r1], m2\r\n    movh     [r0 + r5],     m3\r\n\r\n    %rep 7\r\n    lea      r2, [r2 + 4 * r3]\r\n    movh     m0, [r2]\r\n    movh     m1, [r2 + r3]\r\n    movh     m2, [r2 + 2 * r3]\r\n    movh     m3, [r2 + r4]\r\n\r\n    lea      r0,            [r0 + 4 * r1]\r\n    movh     [r0],          m0\r\n    movh     [r0 + r1],     m1\r\n    movh     [r0 + 2 * r1], m2\r\n    movh     [r0 + r5],     m3\r\n    %endrep\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_pp_8x64(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal blockcopy_pp_8x64, 4, 6, 4\r\n\r\n    lea      r4, [3 * r3]\r\n    lea      r5, [3 * r1]\r\n\r\n    movh     m0, [r2]\r\n    movh     m1, [r2 + r3]\r\n    movh     m2, [r2 + 2 * r3]\r\n    movh     m3, [r2 + r4]\r\n\r\n    movh     [r0],          m0\r\n    movh     [r0 + r1],     m1\r\n    movh     [r0 + 2 * r1], m2\r\n    movh     [r0 + r5],     m3\r\n\r\n    %rep 15\r\n    lea      r2, [r2 + 4 * r3]\r\n    movh     m0, [r2]\r\n    movh     m1, [r2 + r3]\r\n    movh     m2, [r2 + 2 * r3]\r\n    movh     m3, [r2 + r4]\r\n\r\n    lea      r0,            [r0 + 4 * r1]\r\n    movh     [r0],          m0\r\n    movh     [r0 + r1],     m1\r\n    movh     [r0 + 2 * r1], m2\r\n    movh     [r0 + r5],     m3\r\n    %endrep\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_pp_%1x%2(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\n%macro BLOCKCOPY_PP_W12_H4 2\r\nINIT_XMM sse2\r\ncglobal blockcopy_pp_%1x%2, 4, 5, 4\r\n    mov         r4d,       %2/4\r\n\r\n.loop:\r\n    movh    m0,     [r2]\r\n    movd    m1,     [r2 + 8]\r\n    movh    m2,     [r2 + r3]\r\n    movd    m3,     [r2 + r3 + 8]\r\n    lea     r2,     [r2 + 2 * r3]\r\n\r\n    movh    [r0],             m0\r\n    movd    [r0 + 8],         m1\r\n    movh    [r0 + r1],        m2\r\n    movd    [r0 + r1 + 8],    m3\r\n    lea     r0,               [r0 + 2 * r1]\r\n\r\n    movh    m0,     [r2]\r\n    movd    m1,     [r2 + 8]\r\n    movh    m2,     [r2 + r3]\r\n    movd    m3,     [r2 + r3 + 8]\r\n\r\n    movh    [r0],             m0\r\n    movd    [r0 + 8],         m1\r\n    movh    [r0 + r1],        m2\r\n    movd    [r0 + r1 + 8],    m3\r\n\r\n    dec     r4d\r\n    lea     r0,               [r0 + 2 * r1]\r\n    lea     r2,               [r2 + 2 * r3]\r\n    jnz     .loop\r\n    RET\r\n%endmacro\r\n\r\nBLOCKCOPY_PP_W12_H4 12, 16\r\n\r\nBLOCKCOPY_PP_W12_H4 12, 32\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_pp_16x4(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\n%macro BLOCKCOPY_PP_W16_H4 2\r\nINIT_XMM sse2\r\ncglobal blockcopy_pp_%1x%2, 4, 5, 4\r\n    mov    r4d,    %2/4\r\n\r\n.loop:\r\n    movu    m0,    [r2]\r\n    movu    m1,    [r2 + r3]\r\n    lea     r2,    [r2 + 2 * r3]\r\n    movu    m2,    [r2]\r\n    movu    m3,    [r2 + r3]\r\n\r\n    movu    [r0],         m0\r\n    movu    [r0 + r1],    m1\r\n    lea     r0,           [r0 + 2 * r1]\r\n    movu    [r0],         m2\r\n    movu    [r0 + r1],    m3\r\n\r\n    dec     r4d\r\n    lea     r0,               [r0 + 2 * r1]\r\n    lea     r2,               [r2 + 2 * r3]\r\n    jnz     .loop\r\n\r\n    RET\r\n%endmacro\r\n\r\nBLOCKCOPY_PP_W16_H4 16, 4\r\nBLOCKCOPY_PP_W16_H4 16, 12\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_pp_%1x%2(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\n%macro BLOCKCOPY_PP_W16_H8 2\r\nINIT_XMM sse2\r\ncglobal blockcopy_pp_%1x%2, 4, 5, 6\r\n    mov    r4d,    %2/8\r\n\r\n.loop:\r\n    movu    m0,    [r2]\r\n    movu    m1,    [r2 + r3]\r\n    lea     r2,    [r2 + 2 * r3]\r\n    movu    m2,    [r2]\r\n    movu    m3,    [r2 + r3]\r\n    lea     r2,    [r2 + 2 * r3]\r\n    movu    m4,    [r2]\r\n    movu    m5,    [r2 + r3]\r\n    lea     r2,    [r2 + 2 * r3]\r\n\r\n    movu    [r0],         m0\r\n    movu    [r0 + r1],    m1\r\n    lea     r0,           [r0 + 2 * r1]\r\n    movu    [r0],         m2\r\n    movu    [r0 + r1],    m3\r\n    lea     r0,           [r0 + 2 * r1]\r\n    movu    [r0],         m4\r\n    movu    [r0 + r1],    m5\r\n    lea     r0,           [r0 + 2 * r1]\r\n\r\n    movu    m0,           [r2]\r\n    movu    m1,           [r2 + r3]\r\n    movu    [r0],         m0\r\n    movu    [r0 + r1],    m1\r\n\r\n    dec    r4d\r\n    lea    r0,    [r0 + 2 * r1]\r\n    lea    r2,    [r2 + 2 * r3]\r\n    jnz    .loop\r\n    RET\r\n%endmacro\r\n\r\nBLOCKCOPY_PP_W16_H8 16, 8\r\nBLOCKCOPY_PP_W16_H8 16, 16\r\nBLOCKCOPY_PP_W16_H8 16, 32\r\nBLOCKCOPY_PP_W16_H8 16, 64\r\n\r\nBLOCKCOPY_PP_W16_H8 16, 24\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_pp_%1x%2(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\n%macro BLOCKCOPY_PP_W24_H4 2\r\nINIT_XMM sse2\r\ncglobal blockcopy_pp_%1x%2, 4, 5, 6\r\n    mov    r4d,    %2/4\r\n\r\n.loop:\r\n    movu    m0,    [r2]\r\n    movh    m1,    [r2 + 16]\r\n    movu    m2,    [r2 + r3]\r\n    movh    m3,    [r2 + r3 + 16]\r\n    lea     r2,    [r2 + 2 * r3]\r\n    movu    m4,    [r2]\r\n    movh    m5,    [r2 + 16]\r\n\r\n    movu    [r0],              m0\r\n    movh    [r0 + 16],         m1\r\n    movu    [r0 + r1],         m2\r\n    movh    [r0 + r1 + 16],    m3\r\n    lea     r0,                [r0 + 2 * r1]\r\n    movu    [r0],              m4\r\n    movh    [r0 + 16],         m5\r\n\r\n    movu    m0,                [r2 + r3]\r\n    movh    m1,                [r2 + r3 + 16]\r\n    movu    [r0 + r1],         m0\r\n    movh    [r0 + r1 + 16],    m1\r\n\r\n    dec    r4d\r\n    lea    r0,    [r0 + 2 * r1]\r\n    lea    r2,    [r2 + 2 * r3]\r\n    jnz    .loop\r\n    RET\r\n%endmacro\r\n\r\nBLOCKCOPY_PP_W24_H4 24, 32\r\n\r\nBLOCKCOPY_PP_W24_H4 24, 64\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_pp_%1x%2(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\n%macro BLOCKCOPY_PP_W32_H4 2\r\nINIT_XMM sse2\r\ncglobal blockcopy_pp_%1x%2, 4, 5, 4\r\n    mov    r4d,    %2/4\r\n\r\n.loop:\r\n    movu    m0,    [r2]\r\n    movu    m1,    [r2 + 16]\r\n    movu    m2,    [r2 + r3]\r\n    movu    m3,    [r2 + r3 + 16]\r\n    lea     r2,    [r2 + 2 * r3]\r\n\r\n    movu    [r0],              m0\r\n    movu    [r0 + 16],         m1\r\n    movu    [r0 + r1],         m2\r\n    movu    [r0 + r1 + 16],    m3\r\n    lea     r0,                [r0 + 2 * r1]\r\n\r\n    movu    m0,    [r2]\r\n    movu    m1,    [r2 + 16]\r\n    movu    m2,    [r2 + r3]\r\n    movu    m3,    [r2 + r3 + 16]\r\n\r\n    movu    [r0],              m0\r\n    movu    [r0 + 16],         m1\r\n    movu    [r0 + r1],         m2\r\n    movu    [r0 + r1 + 16],    m3\r\n\r\n    dec    r4d\r\n    lea    r0,    [r0 + 2 * r1]\r\n    lea    r2,    [r2 + 2 * r3]\r\n    jnz    .loop\r\n    RET\r\n%endmacro\r\n\r\nBLOCKCOPY_PP_W32_H4 32, 8\r\nBLOCKCOPY_PP_W32_H4 32, 16\r\nBLOCKCOPY_PP_W32_H4 32, 24\r\nBLOCKCOPY_PP_W32_H4 32, 32\r\nBLOCKCOPY_PP_W32_H4 32, 64\r\n\r\nBLOCKCOPY_PP_W32_H4 32, 48\r\n\r\nINIT_YMM avx\r\ncglobal blockcopy_pp_32x8, 4, 6, 6\r\n    lea    r4, [3 * r1]\r\n    lea    r5, [3 * r3]\r\n\r\n    movu    m0, [r2]\r\n    movu    m1, [r2 + r3]\r\n    movu    m2, [r2 + 2 * r3]\r\n    movu    m3, [r2 + r5]\r\n    lea     r2, [r2 + 4 * r3]\r\n    movu    m4, [r2]\r\n    movu    m5, [r2 + r3]\r\n\r\n    movu    [r0], m0\r\n    movu    [r0 + r1], m1\r\n    movu    [r0 + 2 * r1], m2\r\n    movu    [r0 + r4], m3\r\n    lea     r0, [r0 + 4 * r1]\r\n    movu    [r0], m4\r\n    movu    [r0 + r1], m5\r\n\r\n    movu    m0, [r2 + 2 * r3]\r\n    movu    m1, [r2 + r5]\r\n\r\n    movu    [r0 + 2 * r1], m0\r\n    movu    [r0 + r4], m1\r\n    RET\r\n\r\nINIT_YMM avx\r\ncglobal blockcopy_pp_32x16, 4, 6, 6\r\n    lea    r4,  [3 * r1]\r\n    lea    r5,  [3 * r3]\r\n\r\n    movu    m0, [r2]\r\n    movu    m1, [r2 + r3]\r\n    movu    m2, [r2 + 2 * r3]\r\n    movu    m3, [r2 + r5]\r\n    lea     r2, [r2 + 4 * r3]\r\n    movu    m4, [r2]\r\n    movu    m5, [r2 + r3]\r\n\r\n    movu    [r0], m0\r\n    movu    [r0 + r1], m1\r\n    movu    [r0 + 2 * r1], m2\r\n    movu    [r0 + r4], m3\r\n    lea     r0, [r0 + 4 * r1]\r\n    movu    [r0], m4\r\n    movu    [r0 + r1], m5\r\n\r\n    movu    m0, [r2 + 2 * r3]\r\n    movu    m1, [r2 + r5]\r\n    lea     r2, [r2 + 4 * r3]\r\n    movu    m2, [r2]\r\n    movu    m3, [r2 + r3]\r\n    movu    m4, [r2 + 2 * r3]\r\n    movu    m5, [r2 + r5]\r\n\r\n    movu    [r0 + 2 * r1], m0\r\n    movu    [r0 + r4], m1\r\n    lea     r0, [r0 + 4 * r1]\r\n    movu    [r0], m2\r\n    movu    [r0 + r1], m3\r\n    movu    [r0 + 2 * r1], m4\r\n    movu    [r0 + r4], m5\r\n\r\n    lea     r2, [r2 + 4 * r3]\r\n    movu    m0, [r2]\r\n    movu    m1, [r2 + r3]\r\n    movu    m2, [r2 + 2 * r3]\r\n    movu    m3, [r2 + r5]\r\n\r\n    lea     r0, [r0 + 4 * r1]\r\n    movu    [r0], m0\r\n    movu    [r0 + r1], m1\r\n    movu    [r0 + 2 * r1], m2\r\n    movu    [r0 + r4], m3\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_pp_32x24(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\nINIT_YMM avx\r\ncglobal blockcopy_pp_32x24, 4, 7, 6\r\nlea    r4,  [3 * r1]\r\nlea    r5,  [3 * r3]\r\nmov    r6d, 24/8\r\n\r\n.loop:\r\n    movu    m0, [r2]\r\n    movu    m1, [r2 + r3]\r\n    movu    m2, [r2 + 2 * r3]\r\n    movu    m3, [r2 + r5]\r\n    lea     r2, [r2 + 4 * r3]\r\n    movu    m4, [r2]\r\n    movu    m5, [r2 + r3]\r\n\r\n    movu    [r0], m0\r\n    movu    [r0 + r1], m1\r\n    movu    [r0 + 2 * r1], m2\r\n    movu    [r0 + r4], m3\r\n    lea     r0, [r0 + 4 * r1]\r\n    movu    [r0], m4\r\n    movu    [r0 + r1], m5\r\n\r\n    movu    m0, [r2 + 2 * r3]\r\n    movu    m1, [r2 + r5]\r\n\r\n    movu    [r0 + 2 * r1], m0\r\n    movu    [r0 + r4], m1\r\n\r\n    lea     r2, [r2 + 4 * r3]\r\n    lea     r0, [r0 + 4 * r1]\r\n    dec     r6d\r\n    jnz     .loop\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_pp_%1x%2(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\n%macro BLOCKCOPY_PP_W32_H16_avx 2\r\nINIT_YMM avx\r\ncglobal blockcopy_pp_%1x%2, 4, 7, 6\r\n    lea    r4,  [3 * r1]\r\n    lea    r5,  [3 * r3]\r\n    mov    r6d, %2/16\r\n\r\n.loop:\r\n    movu    m0, [r2]\r\n    movu    m1, [r2 + r3]\r\n    movu    m2, [r2 + 2 * r3]\r\n    movu    m3, [r2 + r5]\r\n    lea     r2, [r2 + 4 * r3]\r\n    movu    m4, [r2]\r\n    movu    m5, [r2 + r3]\r\n  \r\n    movu    [r0], m0\r\n    movu    [r0 + r1], m1\r\n    movu    [r0 + 2 * r1], m2\r\n    movu    [r0 + r4], m3\r\n    lea     r0, [r0 + 4 * r1]\r\n    movu    [r0], m4\r\n    movu    [r0 + r1], m5\r\n\r\n    movu    m0, [r2 + 2 * r3]\r\n    movu    m1, [r2 + r5]\r\n    lea     r2, [r2 + 4 * r3]\r\n    movu    m2, [r2]\r\n    movu    m3, [r2 + r3]\r\n    movu    m4, [r2 + 2 * r3]\r\n    movu    m5, [r2 + r5]\r\n\r\n    movu    [r0 + 2 * r1], m0\r\n    movu    [r0 + r4], m1\r\n    lea     r0, [r0 + 4 * r1]\r\n    movu    [r0], m2\r\n    movu    [r0 + r1], m3\r\n    movu    [r0 + 2 * r1], m4\r\n    movu    [r0 + r4], m5\r\n\r\n    lea     r2, [r2 + 4 * r3]\r\n    movu    m0, [r2]\r\n    movu    m1, [r2 + r3]\r\n    movu    m2, [r2 + 2 * r3]\r\n    movu    m3, [r2 + r5]\r\n\r\n    lea     r0, [r0 + 4 * r1]\r\n    movu    [r0], m0\r\n    movu    [r0 + r1], m1\r\n    movu    [r0 + 2 * r1], m2\r\n    movu    [r0 + r4], m3\r\n\r\n    lea     r2, [r2 + 4 * r3]\r\n    lea     r0, [r0 + 4 * r1]\r\n    dec     r6d\r\n    jnz     .loop\r\n    RET\r\n%endmacro\r\n\r\nBLOCKCOPY_PP_W32_H16_avx 32, 32\r\nBLOCKCOPY_PP_W32_H16_avx 32, 48\r\nBLOCKCOPY_PP_W32_H16_avx 32, 64\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_pp_%1x%2(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\n%macro BLOCKCOPY_PP_W48_H2 2\r\nINIT_XMM sse2\r\ncglobal blockcopy_pp_%1x%2, 4, 5, 6\r\n    mov    r4d,    %2/4\r\n\r\n.loop:\r\n    movu    m0,    [r2]\r\n    movu    m1,    [r2 + 16]\r\n    movu    m2,    [r2 + 32]\r\n    movu    m3,    [r2 + r3]\r\n    movu    m4,    [r2 + r3 + 16]\r\n    movu    m5,    [r2 + r3 + 32]\r\n    lea     r2,    [r2 + 2 * r3]\r\n\r\n    movu    [r0],              m0\r\n    movu    [r0 + 16],         m1\r\n    movu    [r0 + 32],         m2\r\n    movu    [r0 + r1],         m3\r\n    movu    [r0 + r1 + 16],    m4\r\n    movu    [r0 + r1 + 32],    m5\r\n    lea     r0,    [r0 + 2 * r1]\r\n\r\n    movu    m0,    [r2]\r\n    movu    m1,    [r2 + 16]\r\n    movu    m2,    [r2 + 32]\r\n    movu    m3,    [r2 + r3]\r\n    movu    m4,    [r2 + r3 + 16]\r\n    movu    m5,    [r2 + r3 + 32]\r\n\r\n    movu    [r0],              m0\r\n    movu    [r0 + 16],         m1\r\n    movu    [r0 + 32],         m2\r\n    movu    [r0 + r1],         m3\r\n    movu    [r0 + r1 + 16],    m4\r\n    movu    [r0 + r1 + 32],    m5\r\n\r\n    dec    r4d\r\n    lea    r0,    [r0 + 2 * r1]\r\n    lea    r2,    [r2 + 2 * r3]\r\n    jnz    .loop\r\n    RET\r\n%endmacro\r\n\r\nBLOCKCOPY_PP_W48_H2 48, 64\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_pp_%1x%2(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\n%macro BLOCKCOPY_PP_W48_H4_avx 2\r\nINIT_YMM avx\r\ncglobal blockcopy_pp_%1x%2, 4, 5, 4\r\n    mov    r4d,    %2/4\r\n\r\n.loop:\r\n    movu    m0,    [r2]\r\n    movu    xm1,   [r2 + 32]\r\n    movu    m2,    [r2 + r3]\r\n    movu    xm3,   [r2 + r3 + 32]\r\n    lea     r2,    [r2 + 2 * r3]\r\n\r\n    movu    [r0],              m0\r\n    movu    [r0 + 32],         xm1\r\n    movu    [r0 + r1],         m2\r\n    movu    [r0 + r1 + 32],    xm3\r\n    lea     r0,                [r0 + 2 * r1]\r\n\r\n    movu    m0,    [r2]\r\n    movu    xm1,   [r2 + 32]\r\n    movu    m2,    [r2 + r3]\r\n    movu    xm3,   [r2 + r3 + 32]\r\n\r\n    movu    [r0],              m0\r\n    movu    [r0 + 32],         xm1\r\n    movu    [r0 + r1],         m2\r\n    movu    [r0 + r1 + 32],    xm3\r\n\r\n    dec    r4d\r\n    lea    r0,    [r0 + 2 * r1]\r\n    lea    r2,    [r2 + 2 * r3]\r\n    jnz    .loop\r\n    RET\r\n%endmacro\r\n\r\nBLOCKCOPY_PP_W48_H4_avx 48, 64\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_pp_%1x%2(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\n%macro BLOCKCOPY_PP_W64_H4 2\r\nINIT_XMM sse2\r\ncglobal blockcopy_pp_%1x%2, 4, 5, 6\r\n    mov    r4d,    %2/4\r\n\r\n.loop:\r\n    movu    m0,    [r2]\r\n    movu    m1,    [r2 + 16]\r\n    movu    m2,    [r2 + 32]\r\n    movu    m3,    [r2 + 48]\r\n    movu    m4,    [r2 + r3]\r\n    movu    m5,    [r2 + r3 + 16]\r\n\r\n    movu    [r0],              m0\r\n    movu    [r0 + 16],         m1\r\n    movu    [r0 + 32],         m2\r\n    movu    [r0 + 48],         m3\r\n    movu    [r0 + r1],         m4\r\n    movu    [r0 + r1 + 16],    m5\r\n\r\n    movu    m0,    [r2 + r3 + 32]\r\n    movu    m1,    [r2 + r3 + 48]\r\n    lea     r2,    [r2 + 2 * r3]\r\n    movu    m2,    [r2]\r\n    movu    m3,    [r2 + 16]\r\n    movu    m4,    [r2 + 32]\r\n    movu    m5,    [r2 + 48]\r\n\r\n    movu    [r0 + r1 + 32],    m0\r\n    movu    [r0 + r1 + 48],    m1\r\n    lea     r0,                [r0 + 2 * r1]\r\n    movu    [r0],              m2\r\n    movu    [r0 + 16],         m3\r\n    movu    [r0 + 32],         m4\r\n    movu    [r0 + 48],         m5\r\n\r\n    movu    m0,    [r2 + r3]\r\n    movu    m1,    [r2 + r3 + 16]\r\n    movu    m2,    [r2 + r3 + 32]\r\n    movu    m3,    [r2 + r3 + 48]\r\n\r\n    movu    [r0 + r1],         m0\r\n    movu    [r0 + r1 + 16],    m1\r\n    movu    [r0 + r1 + 32],    m2\r\n    movu    [r0 + r1 + 48],    m3\r\n\r\n    dec    r4d\r\n    lea    r0,    [r0 + 2 * r1]\r\n    lea    r2,    [r2 + 2 * r3]\r\n    jnz    .loop\r\n    RET\r\n%endmacro\r\n\r\nBLOCKCOPY_PP_W64_H4 64, 16\r\nBLOCKCOPY_PP_W64_H4 64, 32\r\nBLOCKCOPY_PP_W64_H4 64, 48\r\nBLOCKCOPY_PP_W64_H4 64, 64\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_pp_%1x%2(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\n%macro BLOCKCOPY_PP_W64_H4_avx 2\r\nINIT_YMM avx\r\ncglobal blockcopy_pp_%1x%2, 4, 7, 6\r\n    lea    r4,  [3 * r1]\r\n    lea    r5,  [3 * r3]\r\n    mov    r6d, %2/4\r\n\r\n.loop:\r\n    movu    m0, [r2]\r\n    movu    m1, [r2 + 32]\r\n    movu    m2, [r2 + r3]\r\n    movu    m3, [r2 + r3 + 32]\r\n    movu    m4, [r2 + 2 * r3]\r\n    movu    m5, [r2 + 2 * r3 + 32]\r\n\r\n    movu    [r0], m0\r\n    movu    [r0 + 32], m1\r\n    movu    [r0 + r1], m2\r\n    movu    [r0 + r1 + 32], m3\r\n    movu    [r0 + 2 * r1], m4\r\n    movu    [r0 + 2 * r1 + 32], m5\r\n\r\n    movu    m0, [r2 + r5]\r\n    movu    m1, [r2 + r5 + 32]\r\n\r\n    movu    [r0 + r4], m0\r\n    movu    [r0 + r4 + 32], m1\r\n\r\n    lea     r2, [r2 + 4 * r3]\r\n    lea     r0, [r0 + 4 * r1]\r\n    dec     r6d\r\n    jnz     .loop\r\n    RET\r\n%endmacro\r\n\r\nBLOCKCOPY_PP_W64_H4_avx 64, 16\r\nBLOCKCOPY_PP_W64_H4_avx 64, 32\r\nBLOCKCOPY_PP_W64_H4_avx 64, 48\r\nBLOCKCOPY_PP_W64_H4_avx 64, 64\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_sp_2x4(pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse4\r\ncglobal blockcopy_sp_2x4, 4, 5, 2\r\n\r\nadd        r3, r3\r\n\r\n;Row 0-1\r\nmovd       m0, [r2]\r\nmovd       m1, [r2 + r3]\r\npackuswb   m0, m1\r\nmovd       r4d, m0\r\nmov        [r0], r4w\r\npextrw     [r0 + r1], m0, 4\r\n\r\n;Row 2-3\r\nmovd       m0, [r2 + 2 * r3]\r\nlea        r2, [r2 + 2 * r3]\r\nmovd       m1, [r2 + r3]\r\npackuswb   m0, m1\r\nmovd       r4d, m0\r\nmov        [r0 + 2 * r1], r4w\r\nlea        r0, [r0 + 2 * r1]\r\npextrw     [r0 + r1], m0, 4\r\n\r\nRET\r\n\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_sp_2x8(pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse4\r\ncglobal blockcopy_sp_2x8, 4, 5, 2\r\n\r\nadd        r3, r3\r\n\r\n;Row 0-1\r\nmovd       m0, [r2]\r\nmovd       m1, [r2 + r3]\r\npackuswb   m0, m1\r\nmovd       r4d, m0\r\nmov        [r0], r4w\r\npextrw     [r0 + r1], m0, 4\r\n\r\n;Row 2-3\r\nmovd       m0, [r2 + 2 * r3]\r\nlea        r2, [r2 + 2 * r3]\r\nmovd       m1, [r2 + r3]\r\npackuswb   m0, m1\r\nmovd       r4d, m0\r\nmov        [r0 + 2 * r1], r4w\r\nlea        r0, [r0 + 2 * r1]\r\npextrw     [r0 + r1], m0, 4\r\n\r\n;Row 4-5\r\nmovd       m0, [r2 + 2 * r3]\r\nlea        r2, [r2 + 2 * r3]\r\nmovd       m1, [r2 + r3]\r\npackuswb   m0, m1\r\nmovd       r4d, m0\r\nmov        [r0 + 2 * r1], r4w\r\nlea        r0, [r0 + 2 * r1]\r\npextrw     [r0 + r1], m0, 4\r\n\r\n;Row 6-7\r\nmovd       m0, [r2 + 2 * r3]\r\nlea        r2, [r2 + 2 * r3]\r\nmovd       m1, [r2 + r3]\r\npackuswb   m0, m1\r\nmovd       r4d, m0\r\nmov        [r0 + 2 * r1], r4w\r\nlea        r0, [r0 + 2 * r1]\r\npextrw     [r0 + r1], m0, 4\r\n\r\nRET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_sp_%1x%2(pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\n%macro BLOCKCOPY_SP_W2_H2 2\r\nINIT_XMM sse2\r\ncglobal blockcopy_sp_%1x%2, 4, 7, 2, dst, dstStride, src, srcStride\r\n    add         r3,     r3\r\n    mov         r6d,    %2/2\r\n.loop:\r\n    movd        m0,     [r2]\r\n    movd        m1,     [r2 + r3]\r\n    dec         r6d\r\n    lea         r2,     [r2 + r3 * 2]\r\n    packuswb    m0,     m0\r\n    packuswb    m1,     m1\r\n    movd        r4d,        m0\r\n    movd        r5d,        m1\r\n    mov         [r0],       r4w\r\n    mov         [r0 + r1],  r5w\r\n    lea         r0,         [r0 + r1 * 2]\r\n    jnz         .loop\r\n    RET\r\n%endmacro\r\n\r\nBLOCKCOPY_SP_W2_H2 2,  4\r\nBLOCKCOPY_SP_W2_H2 2,  8\r\n\r\nBLOCKCOPY_SP_W2_H2 2, 16\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_sp_4x2(pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal blockcopy_sp_4x2, 4, 4, 2, dst, dstStride, src, srcStride\r\n\r\nadd        r3,        r3\r\n\r\nmovh       m0,        [r2]\r\nmovh       m1,        [r2 + r3]\r\n\r\npackuswb   m0,        m1\r\n\r\nmovd       [r0],      m0\r\npshufd     m0,        m0,        2\r\nmovd       [r0 + r1], m0\r\n\r\nRET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_sp_4x4(pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal blockcopy_sp_4x4, 4, 4, 4, dst, dstStride, src, srcStride\r\n\r\nadd        r3,     r3\r\n\r\nmovh       m0,     [r2]\r\nmovh       m1,     [r2 + r3]\r\nmovh       m2,     [r2 + 2 * r3]\r\nlea        r2,     [r2 + 2 * r3]\r\nmovh       m3,     [r2 + r3]\r\n\r\npackuswb   m0,            m1\r\npackuswb   m2,            m3\r\n\r\nmovd       [r0],          m0\r\npshufd     m0,            m0,         2\r\nmovd       [r0 + r1],     m0\r\nmovd       [r0 + 2 * r1], m2\r\nlea        r0,            [r0 + 2 * r1]\r\npshufd     m2,            m2,         2\r\nmovd       [r0 + r1],     m2\r\n\r\nRET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_sp_4x8(pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal blockcopy_sp_4x8, 4, 4, 8, dst, dstStride, src, srcStride\r\n\r\nadd        r3,      r3\r\n\r\nmovh       m0,      [r2]\r\nmovh       m1,      [r2 + r3]\r\nmovh       m2,      [r2 + 2 * r3]\r\nlea        r2,      [r2 + 2 * r3]\r\nmovh       m3,      [r2 + r3]\r\nmovh       m4,      [r2 + 2 * r3]\r\nlea        r2,      [r2 + 2 * r3]\r\nmovh       m5,      [r2 + r3]\r\nmovh       m6,      [r2 + 2 * r3]\r\nlea        r2,      [r2 + 2 * r3]\r\nmovh       m7,      [r2 + r3]\r\n\r\npackuswb   m0,      m1\r\npackuswb   m2,      m3\r\npackuswb   m4,      m5\r\npackuswb   m6,      m7\r\n\r\nmovd       [r0],          m0\r\npshufd     m0,            m0,         2\r\nmovd       [r0 + r1],     m0\r\nmovd       [r0 + 2 * r1], m2\r\nlea        r0,            [r0 + 2 * r1]\r\npshufd     m2,            m2,         2\r\nmovd       [r0 + r1],     m2\r\nmovd       [r0 + 2 * r1], m4\r\nlea        r0,            [r0 + 2 * r1]\r\npshufd     m4,            m4,         2\r\nmovd       [r0 + r1],     m4\r\nmovd       [r0 + 2 * r1], m6\r\nlea        r0,            [r0 + 2 * r1]\r\npshufd     m6,            m6,         2\r\nmovd       [r0 + r1],     m6\r\n\r\nRET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_sp_%1x%2(pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\n%macro BLOCKCOPY_SP_W4_H8 2\r\nINIT_XMM sse2\r\ncglobal blockcopy_sp_%1x%2, 4, 5, 8, dst, dstStride, src, srcStride\r\n\r\nmov         r4d,    %2/8\r\n\r\nadd         r3,     r3\r\n\r\n.loop:\r\n     movh       m0,      [r2]\r\n     movh       m1,      [r2 + r3]\r\n     movh       m2,      [r2 + 2 * r3]\r\n     lea        r2,      [r2 + 2 * r3]\r\n     movh       m3,      [r2 + r3]\r\n     movh       m4,      [r2 + 2 * r3]\r\n     lea        r2,      [r2 + 2 * r3]\r\n     movh       m5,      [r2 + r3]\r\n     movh       m6,      [r2 + 2 * r3]\r\n     lea        r2,      [r2 + 2 * r3]\r\n     movh       m7,      [r2 + r3]\r\n\r\n     packuswb   m0,      m1\r\n     packuswb   m2,      m3\r\n     packuswb   m4,      m5\r\n     packuswb   m6,      m7\r\n\r\n     movd       [r0],          m0\r\n     pshufd     m0,            m0,         2\r\n     movd       [r0 + r1],     m0\r\n     movd       [r0 + 2 * r1], m2\r\n     lea        r0,            [r0 + 2 * r1]\r\n     pshufd     m2,            m2,         2\r\n     movd       [r0 + r1],     m2\r\n     movd       [r0 + 2 * r1], m4\r\n     lea        r0,            [r0 + 2 * r1]\r\n     pshufd     m4,            m4,         2\r\n     movd       [r0 + r1],     m4\r\n     movd       [r0 + 2 * r1], m6\r\n     lea        r0,            [r0 + 2 * r1]\r\n     pshufd     m6,            m6,         2\r\n     movd       [r0 + r1],     m6\r\n\r\n     lea        r0,            [r0 + 2 * r1]\r\n     lea        r2,            [r2 + 2 * r3]\r\n\r\n     dec        r4d\r\n     jnz        .loop\r\n\r\nRET\r\n%endmacro\r\n\r\nBLOCKCOPY_SP_W4_H8 4, 16\r\n\r\nBLOCKCOPY_SP_W4_H8 4, 32\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_sp_6x8(pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse4\r\ncglobal blockcopy_sp_6x8, 4, 4, 2\r\n\r\n    add       r3, r3\r\n\r\n    movu      m0, [r2]\r\n    movu      m1, [r2 + r3]\r\n    packuswb  m0, m1\r\n\r\n    movd      [r0], m0\r\n    pextrw    [r0 + 4], m0, 2\r\n\r\n    movhlps   m0, m0\r\n    movd      [r0 + r1], m0\r\n    pextrw    [r0 + r1 + 4], m0, 2\r\n\r\n    lea       r0, [r0 + 2 * r1]\r\n    lea       r2, [r2 + 2 * r3]\r\n\r\n    movu      m0, [r2]\r\n    movu      m1, [r2 + r3]\r\n    packuswb  m0, m1\r\n\r\n    movd      [r0], m0\r\n    pextrw    [r0 + 4], m0, 2\r\n\r\n    movhlps   m0, m0\r\n    movd      [r0 + r1], m0\r\n    pextrw    [r0 + r1 + 4], m0, 2\r\n\r\n    lea       r0, [r0 + 2 * r1]\r\n    lea       r2, [r2 + 2 * r3]\r\n\r\n    movu      m0, [r2]\r\n    movu      m1, [r2 + r3]\r\n    packuswb  m0, m1\r\n\r\n    movd      [r0], m0\r\n    pextrw    [r0 + 4], m0, 2\r\n\r\n    movhlps   m0, m0\r\n    movd      [r0 + r1], m0\r\n    pextrw    [r0 + r1 + 4], m0, 2\r\n\r\n    lea       r0, [r0 + 2 * r1]\r\n    lea       r2, [r2 + 2 * r3]\r\n\r\n    movu      m0, [r2]\r\n    movu      m1, [r2 + r3]\r\n    packuswb  m0, m1\r\n\r\n    movd      [r0], m0\r\n    pextrw    [r0 + 4], m0, 2\r\n\r\n    movhlps   m0, m0\r\n    movd      [r0 + r1], m0\r\n    pextrw    [r0 + r1 + 4], m0, 2\r\n\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_sp_%1x%2(pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\n%macro BLOCKCOPY_SP_W6_H2 2\r\nINIT_XMM sse2\r\ncglobal blockcopy_sp_%1x%2, 4, 7, 4, dst, dstStride, src, srcStride\r\n    add         r3,     r3\r\n    mov         r6d,    %2/2\r\n.loop:\r\n    movh        m0, [r2]\r\n    movd        m2, [r2 + 8]\r\n    movh        m1, [r2 + r3]\r\n    movd        m3, [r2 + r3 + 8]\r\n    dec         r6d\r\n    lea         r2, [r2 + r3 * 2]\r\n    packuswb    m0, m0\r\n    packuswb    m2, m2\r\n    packuswb    m1, m1\r\n    packuswb    m3, m3\r\n    movd        r4d,            m2\r\n    movd        r5d,            m3\r\n    movd        [r0],           m0\r\n    mov         [r0 + 4],       r4w\r\n    movd        [r0 + r1],      m1\r\n    mov         [r0 + r1 + 4],  r5w\r\n    lea         r0, [r0 + r1 * 2]\r\n    jnz         .loop\r\n    RET\r\n%endmacro\r\n\r\nBLOCKCOPY_SP_W6_H2 6,  8\r\n\r\nBLOCKCOPY_SP_W6_H2 6, 16\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_sp_8x2(pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal blockcopy_sp_8x2, 4, 4, 2, dst, dstStride, src, srcStride\r\n\r\nadd        r3,         r3\r\n\r\nmovu       m0,         [r2]\r\nmovu       m1,         [r2 + r3]\r\n\r\npackuswb   m0,         m1\r\n\r\nmovlps     [r0],       m0\r\nmovhps     [r0 + r1],  m0\r\n\r\nRET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_sp_8x4(pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal blockcopy_sp_8x4, 4, 4, 4, dst, dstStride, src, srcStride\r\n\r\nadd        r3,     r3\r\n\r\nmovu       m0,     [r2]\r\nmovu       m1,     [r2 + r3]\r\nmovu       m2,     [r2 + 2 * r3]\r\nlea        r2,     [r2 + 2 * r3]\r\nmovu       m3,     [r2 + r3]\r\n\r\npackuswb   m0,            m1\r\npackuswb   m2,            m3\r\n\r\nmovlps     [r0],          m0\r\nmovhps     [r0 + r1],     m0\r\nmovlps     [r0 + 2 * r1], m2\r\nlea        r0,            [r0 + 2 * r1]\r\nmovhps     [r0 + r1],     m2\r\n\r\nRET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_sp_8x6(pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal blockcopy_sp_8x6, 4, 4, 6, dst, dstStride, src, srcStride\r\n\r\nadd        r3,      r3\r\n\r\nmovu       m0,      [r2]\r\nmovu       m1,      [r2 + r3]\r\nmovu       m2,      [r2 + 2 * r3]\r\nlea        r2,      [r2 + 2 * r3]\r\nmovu       m3,      [r2 + r3]\r\nmovu       m4,      [r2 + 2 * r3]\r\nlea        r2,      [r2 + 2 * r3]\r\nmovu       m5,      [r2 + r3]\r\n\r\npackuswb   m0,            m1\r\npackuswb   m2,            m3\r\npackuswb   m4,            m5\r\n\r\nmovlps     [r0],          m0\r\nmovhps     [r0 + r1],     m0\r\nmovlps     [r0 + 2 * r1], m2\r\nlea        r0,            [r0 + 2 * r1]\r\nmovhps     [r0 + r1],     m2\r\nmovlps     [r0 + 2 * r1], m4\r\nlea        r0,            [r0 + 2 * r1]\r\nmovhps     [r0 + r1],     m4\r\n\r\nRET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_sp_8x8(pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal blockcopy_sp_8x8, 4, 4, 8, dst, dstStride, src, srcStride\r\n\r\nadd        r3,      r3\r\n\r\nmovu       m0,      [r2]\r\nmovu       m1,      [r2 + r3]\r\nmovu       m2,      [r2 + 2 * r3]\r\nlea        r2,      [r2 + 2 * r3]\r\nmovu       m3,      [r2 + r3]\r\nmovu       m4,      [r2 + 2 * r3]\r\nlea        r2,      [r2 + 2 * r3]\r\nmovu       m5,      [r2 + r3]\r\nmovu       m6,      [r2 + 2 * r3]\r\nlea        r2,      [r2 + 2 * r3]\r\nmovu       m7,      [r2 + r3]\r\n\r\npackuswb   m0,      m1\r\npackuswb   m2,      m3\r\npackuswb   m4,      m5\r\npackuswb   m6,      m7\r\n\r\nmovlps     [r0],          m0\r\nmovhps     [r0 + r1],     m0\r\nmovlps     [r0 + 2 * r1], m2\r\nlea        r0,            [r0 + 2 * r1]\r\nmovhps     [r0 + r1],     m2\r\nmovlps     [r0 + 2 * r1], m4\r\nlea        r0,            [r0 + 2 * r1]\r\nmovhps     [r0 + r1],     m4\r\nmovlps     [r0 + 2 * r1], m6\r\nlea        r0,            [r0 + 2 * r1]\r\nmovhps     [r0 + r1],     m6\r\n\r\nRET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_sp_%1x%2(pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\n%macro BLOCKCOPY_SP_W8_H4 2\r\nINIT_XMM sse2\r\ncglobal blockcopy_sp_%1x%2, 4, 5, 4, dst, dstStride, src, srcStride\r\n    add         r3,     r3\r\n    mov         r4d,    %2/4\r\n.loop:\r\n    movu        m0,     [r2]\r\n    movu        m1,     [r2 + r3]\r\n    lea         r2,     [r2 + r3 * 2]\r\n    movu        m2,     [r2]\r\n    movu        m3,     [r2 + r3]\r\n    dec         r4d\r\n    lea         r2,     [r2 + r3 * 2]\r\n    packuswb    m0,     m1\r\n    packuswb    m2,     m3\r\n    movlps      [r0],       m0\r\n    movhps      [r0 + r1],  m0\r\n    lea         r0,         [r0 + r1 * 2]\r\n    movlps      [r0],       m2\r\n    movhps      [r0 + r1],  m2\r\n    lea         r0,         [r0 + r1 * 2]\r\n    jnz         .loop\r\n    RET\r\n%endmacro\r\n\r\nBLOCKCOPY_SP_W8_H4 8, 12\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_sp_%1x%2(pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\n%macro BLOCKCOPY_SP_W8_H8 2\r\nINIT_XMM sse2\r\ncglobal blockcopy_sp_%1x%2, 4, 5, 8, dst, dstStride, src, srcStride\r\n\r\nmov         r4d,    %2/8\r\n\r\nadd         r3,     r3\r\n\r\n.loop:\r\n     movu       m0,      [r2]\r\n     movu       m1,      [r2 + r3]\r\n     movu       m2,      [r2 + 2 * r3]\r\n     lea        r2,      [r2 + 2 * r3]\r\n     movu       m3,      [r2 + r3]\r\n     movu       m4,      [r2 + 2 * r3]\r\n     lea        r2,      [r2 + 2 * r3]\r\n     movu       m5,      [r2 + r3]\r\n     movu       m6,      [r2 + 2 * r3]\r\n     lea        r2,      [r2 + 2 * r3]\r\n     movu       m7,      [r2 + r3]\r\n\r\n     packuswb   m0,      m1\r\n     packuswb   m2,      m3\r\n     packuswb   m4,      m5\r\n     packuswb   m6,      m7\r\n\r\n     movlps     [r0],          m0\r\n     movhps     [r0 + r1],     m0\r\n     movlps     [r0 + 2 * r1], m2\r\n     lea        r0,            [r0 + 2 * r1]\r\n     movhps     [r0 + r1],     m2\r\n     movlps     [r0 + 2 * r1], m4\r\n     lea        r0,            [r0 + 2 * r1]\r\n     movhps     [r0 + r1],     m4\r\n     movlps     [r0 + 2 * r1], m6\r\n     lea        r0,            [r0 + 2 * r1]\r\n     movhps     [r0 + r1],     m6\r\n\r\n    lea         r0,            [r0 + 2 * r1]\r\n    lea         r2,            [r2 + 2 * r3]\r\n\r\n    dec         r4d\r\n    jnz         .loop\r\n\r\nRET\r\n%endmacro\r\n\r\nBLOCKCOPY_SP_W8_H8 8, 16\r\nBLOCKCOPY_SP_W8_H8 8, 32\r\n\r\nBLOCKCOPY_SP_W8_H8 8, 64\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_sp_%1x%2(pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\n%macro BLOCKCOPY_SP_W12_H4 2\r\nINIT_XMM sse2\r\ncglobal blockcopy_sp_%1x%2, 4, 5, 8, dst, dstStride, src, srcStride\r\n\r\nmov             r4d,     %2/4\r\n\r\nadd             r3,      r3\r\n\r\n.loop:\r\n     movu       m0,      [r2]\r\n     movu       m1,      [r2 + 16]\r\n     movu       m2,      [r2 + r3]\r\n     movu       m3,      [r2 + r3 + 16]\r\n     movu       m4,      [r2 + 2 * r3]\r\n     movu       m5,      [r2 + 2 * r3 + 16]\r\n     lea        r2,      [r2 + 2 * r3]\r\n     movu       m6,      [r2 + r3]\r\n     movu       m7,      [r2 + r3 + 16]\r\n\r\n     packuswb   m0,      m1\r\n     packuswb   m2,      m3\r\n     packuswb   m4,      m5\r\n     packuswb   m6,      m7\r\n\r\n     movh       [r0],              m0\r\n     pshufd     m0,                m0,    2\r\n     movd       [r0 + 8],          m0\r\n\r\n     movh       [r0 + r1],         m2\r\n     pshufd     m2,                m2,    2\r\n     movd       [r0 + r1 + 8],     m2\r\n\r\n     movh       [r0 + 2 * r1],     m4\r\n     pshufd     m4,                m4,    2\r\n     movd       [r0 + 2 * r1 + 8], m4\r\n\r\n     lea        r0,                [r0 + 2 * r1]\r\n     movh       [r0 + r1],         m6\r\n     pshufd     m6,                m6,    2\r\n     movd       [r0 + r1 + 8],     m6\r\n\r\n     lea        r0,                [r0 + 2 * r1]\r\n     lea        r2,                [r2 + 2 * r3]\r\n\r\n     dec        r4d\r\n     jnz        .loop\r\n\r\nRET\r\n%endmacro\r\n\r\nBLOCKCOPY_SP_W12_H4 12, 16\r\n\r\nBLOCKCOPY_SP_W12_H4 12, 32\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_sp_%1x%2(pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\n%macro BLOCKCOPY_SP_W16_H4 2\r\nINIT_XMM sse2\r\ncglobal blockcopy_sp_%1x%2, 4, 5, 8, dst, dstStride, src, srcStride\r\n\r\nmov             r4d,     %2/4\r\n\r\nadd             r3,      r3\r\n\r\n.loop:\r\n     movu       m0,      [r2]\r\n     movu       m1,      [r2 + 16]\r\n     movu       m2,      [r2 + r3]\r\n     movu       m3,      [r2 + r3 + 16]\r\n     movu       m4,      [r2 + 2 * r3]\r\n     movu       m5,      [r2 + 2 * r3 + 16]\r\n     lea        r2,      [r2 + 2 * r3]\r\n     movu       m6,      [r2 + r3]\r\n     movu       m7,      [r2 + r3 + 16]\r\n\r\n     packuswb   m0,      m1\r\n     packuswb   m2,      m3\r\n     packuswb   m4,      m5\r\n     packuswb   m6,      m7\r\n\r\n     movu       [r0],              m0\r\n     movu       [r0 + r1],         m2\r\n     movu       [r0 + 2 * r1],     m4\r\n     lea        r0,                [r0 + 2 * r1]\r\n     movu       [r0 + r1],         m6\r\n\r\n     lea        r0,                [r0 + 2 * r1]\r\n     lea        r2,                [r2 + 2 * r3]\r\n\r\n     dec        r4d\r\n     jnz        .loop\r\n\r\nRET\r\n%endmacro\r\n\r\nBLOCKCOPY_SP_W16_H4 16,  4\r\nBLOCKCOPY_SP_W16_H4 16,  8\r\nBLOCKCOPY_SP_W16_H4 16, 12\r\nBLOCKCOPY_SP_W16_H4 16, 16\r\nBLOCKCOPY_SP_W16_H4 16, 32\r\nBLOCKCOPY_SP_W16_H4 16, 64\r\nBLOCKCOPY_SP_W16_H4 16, 24\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_sp_%1x%2(pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\n%macro BLOCKCOPY_SP_W16_H8_avx2 2\r\nINIT_YMM avx2\r\ncglobal blockcopy_sp_%1x%2, 4, 7, 4, dst, dstStride, src, srcStride\r\n    mov    r4d, %2/8\r\n    add    r3,  r3\r\n    lea    r5,  [3 * r3]\r\n    lea    r6,  [3 * r1]\r\n\r\n.loop:\r\n    movu    m0, [r2]\r\n    movu    m1, [r2 + r3]\r\n    movu    m2, [r2 + 2 * r3]\r\n    movu    m3, [r2 + r5]\r\n\r\n    packuswb    m0, m1\r\n    packuswb    m2, m3\r\n\r\n    vpermq    m0, m0, 11011000b\r\n    vpermq    m2, m2, 11011000b\r\n\r\n    vextracti128 xm1, m0, 1\r\n    vextracti128 xm3, m2, 1\r\n\r\n    movu    [r0],          xm0\r\n    movu    [r0 + r1],     xm1\r\n    movu    [r0 + 2 * r1], xm2\r\n    movu    [r0 + r6],     xm3\r\n\r\n    lea     r2, [r2 + 4 * r3]\r\n    movu    m0, [r2]\r\n    movu    m1, [r2 + r3]\r\n    movu    m2, [r2 + 2 * r3]\r\n    movu    m3, [r2 + r5]\r\n\r\n    packuswb    m0, m1\r\n    packuswb    m2, m3\r\n\r\n    vpermq    m0, m0, 11011000b\r\n    vpermq    m2, m2, 11011000b\r\n\r\n    vextracti128 xm1, m0, 1\r\n    vextracti128 xm3, m2, 1\r\n\r\n    lea     r0,            [r0 + 4 * r1]\r\n    movu    [r0],          xm0\r\n    movu    [r0 + r1],     xm1\r\n    movu    [r0 + 2 * r1], xm2\r\n    movu    [r0 + r6],     xm3\r\n\r\n    lea    r0, [r0 + 4 * r1]\r\n    lea    r2, [r2 + 4 * r3]\r\n\r\n    dec    r4d\r\n    jnz    .loop\r\n    RET\r\n%endmacro\r\n\r\nBLOCKCOPY_SP_W16_H8_avx2 16, 16\r\nBLOCKCOPY_SP_W16_H8_avx2 16, 32\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_sp_%1x%2(pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\n%macro BLOCKCOPY_SP_W24_H2 2\r\nINIT_XMM sse2\r\ncglobal blockcopy_sp_%1x%2, 4, 5, 6, dst, dstStride, src, srcStride\r\n\r\nmov             r4d,     %2/2\r\n\r\nadd             r3,      r3\r\n\r\n.loop:\r\n     movu       m0,      [r2]\r\n     movu       m1,      [r2 + 16]\r\n     movu       m2,      [r2 + 32]\r\n     movu       m3,      [r2 + r3]\r\n     movu       m4,      [r2 + r3 + 16]\r\n     movu       m5,      [r2 + r3 + 32]\r\n\r\n     packuswb   m0,      m1\r\n     packuswb   m2,      m3\r\n     packuswb   m4,      m5\r\n\r\n     movu       [r0],            m0\r\n     movlps     [r0 + 16],       m2\r\n     movhps     [r0 + r1],       m2\r\n     movu       [r0 + r1 + 8],   m4\r\n\r\n     lea        r0,              [r0 + 2 * r1]\r\n     lea        r2,              [r2 + 2 * r3]\r\n\r\n     dec        r4d\r\n     jnz        .loop\r\n\r\nRET\r\n%endmacro\r\n\r\nBLOCKCOPY_SP_W24_H2 24, 32\r\n\r\nBLOCKCOPY_SP_W24_H2 24, 64\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_sp_%1x%2(pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\n%macro BLOCKCOPY_SP_W32_H2 2\r\nINIT_XMM sse2\r\ncglobal blockcopy_sp_%1x%2, 4, 5, 8, dst, dstStride, src, srcStride\r\n\r\nmov             r4d,     %2/2\r\n\r\nadd             r3,      r3\r\n\r\n.loop:\r\n     movu       m0,      [r2]\r\n     movu       m1,      [r2 + 16]\r\n     movu       m2,      [r2 + 32]\r\n     movu       m3,      [r2 + 48]\r\n     movu       m4,      [r2 + r3]\r\n     movu       m5,      [r2 + r3 + 16]\r\n     movu       m6,      [r2 + r3 + 32]\r\n     movu       m7,      [r2 + r3 + 48]\r\n\r\n     packuswb   m0,      m1\r\n     packuswb   m2,      m3\r\n     packuswb   m4,      m5\r\n     packuswb   m6,      m7\r\n\r\n     movu       [r0],            m0\r\n     movu       [r0 + 16],       m2\r\n     movu       [r0 + r1],       m4\r\n     movu       [r0 + r1 + 16],  m6\r\n\r\n     lea        r0,              [r0 + 2 * r1]\r\n     lea        r2,              [r2 + 2 * r3]\r\n\r\n     dec        r4d\r\n     jnz        .loop\r\n\r\nRET\r\n%endmacro\r\n\r\nBLOCKCOPY_SP_W32_H2 32,  8\r\nBLOCKCOPY_SP_W32_H2 32, 16\r\nBLOCKCOPY_SP_W32_H2 32, 24\r\nBLOCKCOPY_SP_W32_H2 32, 32\r\nBLOCKCOPY_SP_W32_H2 32, 64\r\n\r\nBLOCKCOPY_SP_W32_H2 32, 48\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_sp_%1x%2(pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\n%macro BLOCKCOPY_SP_W32_H4_avx2 2\r\nINIT_YMM avx2\r\ncglobal blockcopy_sp_%1x%2, 4, 7, 4, dst, dstStride, src, srcStride\r\n    mov    r4d, %2/4\r\n    add    r3,  r3\r\n    lea    r5,  [3 * r3]\r\n    lea    r6,  [3 * r1]\r\n\r\n.loop:\r\n    movu       m0, [r2]\r\n    movu       m1, [r2 + 32]\r\n    movu       m2, [r2 + r3]\r\n    movu       m3, [r2 + r3 + 32]\r\n\r\n    packuswb   m0, m1\r\n    packuswb   m2, m3\r\n\r\n    vpermq    m0, m0, 11011000b\r\n    vpermq    m2, m2, 11011000b\r\n\r\n    movu       [r0],      m0\r\n    movu       [r0 + r1], m2\r\n\r\n    movu       m0, [r2 + 2 * r3]\r\n    movu       m1, [r2 + 2 * r3 + 32]\r\n    movu       m2, [r2 + r5]\r\n    movu       m3, [r2 + r5 + 32]\r\n\r\n    packuswb   m0, m1\r\n    packuswb   m2, m3\r\n\r\n    vpermq    m0, m0, 11011000b\r\n    vpermq    m2, m2, 11011000b\r\n\r\n    movu       [r0 + 2 * r1], m0\r\n    movu       [r0 + r6],     m2\r\n\r\n    lea        r0, [r0 + 4 * r1]\r\n    lea        r2, [r2 + 4 * r3]\r\n\r\n    dec        r4d\r\n    jnz        .loop\r\n    RET\r\n%endmacro\r\n\r\nBLOCKCOPY_SP_W32_H4_avx2 32, 32\r\nBLOCKCOPY_SP_W32_H4_avx2 32, 64\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_sp_%1x%2(pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\n%macro BLOCKCOPY_SP_W48_H2 2\r\nINIT_XMM sse2\r\ncglobal blockcopy_sp_%1x%2, 4, 5, 6, dst, dstStride, src, srcStride\r\n\r\nmov             r4d,     %2\r\n\r\nadd             r3,      r3\r\n\r\n.loop:\r\n     movu       m0,        [r2]\r\n     movu       m1,        [r2 + 16]\r\n     movu       m2,        [r2 + 32]\r\n     movu       m3,        [r2 + 48]\r\n     movu       m4,        [r2 + 64]\r\n     movu       m5,        [r2 + 80]\r\n\r\n     packuswb   m0,        m1\r\n     packuswb   m2,        m3\r\n     packuswb   m4,        m5\r\n\r\n     movu       [r0],      m0\r\n     movu       [r0 + 16], m2\r\n     movu       [r0 + 32], m4\r\n\r\n     lea        r0,        [r0 + r1]\r\n     lea        r2,        [r2 + r3]\r\n\r\n     dec        r4d\r\n     jnz        .loop\r\n\r\nRET\r\n%endmacro\r\n\r\nBLOCKCOPY_SP_W48_H2 48, 64\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_sp_%1x%2(pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\n%macro BLOCKCOPY_SP_W64_H1 2\r\nINIT_XMM sse2\r\ncglobal blockcopy_sp_%1x%2, 4, 5, 8, dst, dstStride, src, srcStride\r\n\r\nmov             r4d,       %2\r\n\r\nadd             r3,         r3\r\n\r\n.loop:\r\n      movu      m0,        [r2]\r\n      movu      m1,        [r2 + 16]\r\n      movu      m2,        [r2 + 32]\r\n      movu      m3,        [r2 + 48]\r\n      movu      m4,        [r2 + 64]\r\n      movu      m5,        [r2 + 80]\r\n      movu      m6,        [r2 + 96]\r\n      movu      m7,        [r2 + 112]\r\n\r\n     packuswb   m0,        m1\r\n     packuswb   m2,        m3\r\n     packuswb   m4,        m5\r\n     packuswb   m6,        m7\r\n\r\n      movu      [r0],      m0\r\n      movu      [r0 + 16], m2\r\n      movu      [r0 + 32], m4\r\n      movu      [r0 + 48], m6\r\n\r\n      lea       r0,        [r0 + r1]\r\n      lea       r2,        [r2 + r3]\r\n\r\n      dec       r4d\r\n      jnz       .loop\r\n\r\nRET\r\n%endmacro\r\n\r\nBLOCKCOPY_SP_W64_H1 64, 16\r\nBLOCKCOPY_SP_W64_H1 64, 32\r\nBLOCKCOPY_SP_W64_H1 64, 48\r\nBLOCKCOPY_SP_W64_H1 64, 64\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_sp_%1x%2(pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\n%macro BLOCKCOPY_SP_W64_H4_avx2 2\r\nINIT_YMM avx2\r\ncglobal blockcopy_sp_%1x%2, 4, 7, 4, dst, dstStride, src, srcStride\r\n    mov    r4d, %2/4\r\n    add    r3,  r3\r\n    lea    r5,  [3 * r3]\r\n    lea    r6,  [3 * r1]\r\n\r\n.loop:\r\n    movu    m0, [r2]\r\n    movu    m1, [r2 + 32]\r\n    movu    m2, [r2 + 64]\r\n    movu    m3, [r2 + 96]\r\n\r\n    packuswb    m0, m1\r\n    packuswb    m2, m3\r\n\r\n    vpermq    m0, m0, 11011000b\r\n    vpermq    m2, m2, 11011000b\r\n\r\n    movu    [r0],      m0\r\n    movu    [r0 + 32], m2\r\n\r\n    movu    m0, [r2 + r3]\r\n    movu    m1, [r2 + r3 + 32]\r\n    movu    m2, [r2 + r3 + 64]\r\n    movu    m3, [r2 + r3 + 96]\r\n\r\n    packuswb    m0, m1\r\n    packuswb    m2, m3\r\n\r\n    vpermq    m0, m0, 11011000b\r\n    vpermq    m2, m2, 11011000b\r\n\r\n    movu    [r0 + r1],      m0\r\n    movu    [r0 + r1 + 32], m2\r\n\r\n    movu    m0, [r2 + 2 * r3]\r\n    movu    m1, [r2 + 2 * r3 + 32]\r\n    movu    m2, [r2 + 2 * r3 + 64]\r\n    movu    m3, [r2 + 2 * r3 + 96]\r\n\r\n    packuswb    m0, m1\r\n    packuswb    m2, m3\r\n\r\n    vpermq    m0, m0, 11011000b\r\n    vpermq    m2, m2, 11011000b\r\n\r\n    movu    [r0 + 2 * r1],      m0\r\n    movu    [r0 + 2 * r1 + 32], m2\r\n\r\n    movu    m0, [r2 + r5]\r\n    movu    m1, [r2 + r5 + 32]\r\n    movu    m2, [r2 + r5 + 64]\r\n    movu    m3, [r2 + r5 + 96]\r\n\r\n    packuswb    m0, m1\r\n    packuswb    m2, m3\r\n\r\n    vpermq    m0, m0, 11011000b\r\n    vpermq    m2, m2, 11011000b\r\n\r\n    movu    [r0 + r6],      m0\r\n    movu    [r0 + r6 + 32], m2\r\n\r\n    lea    r0, [r0 + 4 * r1]\r\n    lea    r2, [r2 + 4 * r3]\r\n\r\n    dec    r4d\r\n    jnz    .loop\r\n    RET\r\n%endmacro\r\n\r\nBLOCKCOPY_SP_W64_H4_avx2 64, 64\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockfill_s_4x4(int16_t* dst, intptr_t dstride, int16_t val)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal blockfill_s_4x4, 3, 3, 1, dst, dstStride, val\r\n\r\nadd        r1,            r1\r\n\r\nmovd       m0,            r2d\r\npshuflw    m0,            m0,         0\r\n\r\nmovh       [r0],          m0\r\nmovh       [r0 + r1],     m0\r\nmovh       [r0 + 2 * r1], m0\r\nlea        r0,            [r0 + 2 * r1]\r\nmovh       [r0 + r1],     m0\r\n\r\nRET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockfill_s_8x8(int16_t* dst, intptr_t dstride, int16_t val)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal blockfill_s_8x8, 3, 4, 1, dst, dstStride, val\r\n\r\nadd        r1,            r1\r\nlea        r3,            [3 * r1]\r\n\r\nmovd       m0,            r2d\r\npshuflw    m0,            m0,         0\r\npshufd     m0,            m0,         0\r\n\r\nmovu       [r0],          m0\r\nmovu       [r0 + r1],     m0\r\nmovu       [r0 + 2 * r1], m0\r\n\r\nmovu       [r0 + r3],     m0\r\n\r\nlea        r0,            [r0 + 4 * r1]\r\nmovu       [r0],          m0\r\nmovu       [r0 + r1],     m0\r\nmovu       [r0 + 2 * r1], m0\r\nmovu       [r0 + r3],     m0\r\nRET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockfill_s_16x16(int16_t* dst, intptr_t dstride, int16_t val)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal blockfill_s_16x16, 3, 4, 1, dst, dstStride, val\r\n\r\nadd        r1,            r1\r\nlea        r3,            [3 * r1]\r\n\r\nmovd       m0,            r2d\r\npshuflw    m0,            m0,         0\r\npshufd     m0,            m0,         0\r\n\r\nmovu       [r0],           m0\r\nmovu       [r0 + 16],      m0\r\nmovu       [r0 + r1],      m0\r\nmovu       [r0 + r1 + 16], m0\r\nmovu       [r0 + 2 * r1],  m0\r\nmovu       [r0 + 2 * r1 + 16], m0\r\n\r\nmovu       [r0 + r3],          m0\r\nmovu       [r0 + r3 + 16],     m0\r\nmovu       [r0 + 4 * r1],      m0\r\nmovu       [r0 + 4 * r1 + 16], m0\r\n\r\nlea        r0,                 [r0 + 4 * r1]\r\nmovu       [r0 + r1],          m0\r\nmovu       [r0 + r1 + 16],     m0\r\nmovu       [r0 + 2 * r1],      m0\r\nmovu       [r0 + 2 * r1 + 16], m0\r\nmovu       [r0 + r3],          m0\r\nmovu       [r0 + r3 + 16],     m0\r\nmovu       [r0 + 4 * r1],      m0\r\nmovu       [r0 + 4 * r1 + 16], m0\r\n\r\nlea        r0,                 [r0 + 4 * r1]\r\nmovu       [r0 + r1],          m0\r\nmovu       [r0 + r1 + 16],     m0\r\nmovu       [r0 + 2 * r1],      m0\r\nmovu       [r0 + 2 * r1 + 16], m0\r\nmovu       [r0 + r3],          m0\r\nmovu       [r0 + r3 + 16],     m0\r\nmovu       [r0 + 4 * r1],      m0\r\nmovu       [r0 + 4 * r1 + 16], m0\r\n\r\nlea        r0,                 [r0 + 4 * r1]\r\nmovu       [r0 + r1],          m0\r\nmovu       [r0 + r1 + 16],     m0\r\nmovu       [r0 + 2 * r1],      m0\r\nmovu       [r0 + 2 * r1 + 16], m0\r\nmovu       [r0 + r3],          m0\r\nmovu       [r0 + r3 + 16],     m0\r\nRET\r\n\r\nINIT_YMM avx2\r\ncglobal blockfill_s_16x16, 3, 4, 1\r\nadd          r1, r1\r\nlea          r3, [3 * r1]\r\nmovd         xm0, r2d\r\nvpbroadcastw m0, xm0\r\n\r\nmovu       [r0], m0\r\nmovu       [r0 + r1], m0\r\nmovu       [r0 + 2 * r1], m0\r\nmovu       [r0 + r3], m0\r\nlea        r0, [r0 + 4 * r1]\r\nmovu       [r0], m0\r\nmovu       [r0 + r1], m0\r\nmovu       [r0 + 2 * r1], m0\r\nmovu       [r0 + r3], m0\r\nlea        r0, [r0 + 4 * r1]\r\nmovu       [r0], m0\r\nmovu       [r0 + r1], m0\r\nmovu       [r0 + 2 * r1], m0\r\nmovu       [r0 + r3], m0\r\nlea        r0, [r0 + 4 * r1]\r\nmovu       [r0], m0\r\nmovu       [r0 + r1], m0\r\nmovu       [r0 + 2 * r1], m0\r\nmovu       [r0 + r3], m0\r\nRET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockfill_s_%1x%2(int16_t* dst, intptr_t dstride, int16_t val)\r\n;-----------------------------------------------------------------------------\r\n%macro BLOCKFILL_S_W32_H8 2\r\nINIT_XMM sse2\r\ncglobal blockfill_s_%1x%2, 3, 5, 1, dst, dstStride, val\r\n\r\nmov        r3d,           %2/8\r\n\r\nadd        r1,            r1\r\nlea        r4,            [3 * r1]\r\n\r\nmovd       m0,            r2d\r\npshuflw    m0,            m0,       0\r\npshufd     m0,            m0,       0\r\n\r\n.loop:\r\n     movu       [r0],               m0\r\n     movu       [r0 + 16],          m0\r\n     movu       [r0 + 32],          m0\r\n     movu       [r0 + 48],          m0\r\n\r\n     movu       [r0 + r1],          m0\r\n     movu       [r0 + r1 + 16],     m0\r\n     movu       [r0 + r1 + 32],     m0\r\n     movu       [r0 + r1 + 48],     m0\r\n\r\n     movu       [r0 + 2 * r1],      m0\r\n     movu       [r0 + 2 * r1 + 16], m0\r\n     movu       [r0 + 2 * r1 + 32], m0\r\n     movu       [r0 + 2 * r1 + 48], m0\r\n\r\n     movu       [r0 + r4],          m0\r\n     movu       [r0 + r4 + 16],     m0\r\n     movu       [r0 + r4 + 32],     m0\r\n     movu       [r0 + r4 + 48],     m0\r\n\r\n     movu       [r0 + 4 * r1],      m0\r\n     movu       [r0 + 4 * r1 + 16], m0\r\n     movu       [r0 + 4 * r1 + 32], m0\r\n     movu       [r0 + 4 * r1 + 48], m0\r\n\r\n     lea        r0,                 [r0 + 4 * r1]\r\n     movu       [r0 + r1],          m0\r\n     movu       [r0 + r1 + 16],     m0\r\n     movu       [r0 + r1 + 32],     m0\r\n     movu       [r0 + r1 + 48],     m0\r\n\r\n     movu       [r0 + 2 * r1],      m0\r\n     movu       [r0 + 2 * r1 + 16], m0\r\n     movu       [r0 + 2 * r1 + 32], m0\r\n     movu       [r0 + 2 * r1 + 48], m0\r\n\r\n     movu       [r0 + r4],          m0\r\n     movu       [r0 + r4 + 16],     m0\r\n     movu       [r0 + r4 + 32],     m0\r\n     movu       [r0 + r4 + 48],     m0\r\n\r\n     lea        r0,                 [r0 + 4 * r1]\r\n\r\n     dec        r3d\r\n     jnz        .loop\r\n\r\nRET\r\n%endmacro\r\n\r\nBLOCKFILL_S_W32_H8 32, 32\r\n\r\nINIT_YMM avx2\r\ncglobal blockfill_s_32x32, 3, 4, 1\r\nadd          r1, r1\r\nlea          r3, [3 * r1]\r\nmovd         xm0, r2d\r\nvpbroadcastw m0, xm0\r\n\r\nmovu       [r0], m0\r\nmovu       [r0 + 32], m0\r\nmovu       [r0 + r1], m0\r\nmovu       [r0 + r1 + 32], m0\r\nmovu       [r0 + 2 * r1], m0\r\nmovu       [r0 + 2 * r1 + 32], m0\r\nmovu       [r0 + r3], m0\r\nmovu       [r0 + r3 + 32], m0\r\nlea        r0, [r0 + 4 * r1]\r\nmovu       [r0], m0\r\nmovu       [r0 + 32], m0\r\nmovu       [r0 + r1], m0\r\nmovu       [r0 + r1 + 32], m0\r\nmovu       [r0 + 2 * r1], m0\r\nmovu       [r0 + 2 * r1 + 32], m0\r\nmovu       [r0 + r3], m0\r\nmovu       [r0 + r3 + 32], m0\r\nlea        r0, [r0 + 4 * r1]\r\nmovu       [r0], m0\r\nmovu       [r0 + 32], m0\r\nmovu       [r0 + r1], m0\r\nmovu       [r0 + r1 + 32], m0\r\nmovu       [r0 + 2 * r1], m0\r\nmovu       [r0 + 2 * r1 + 32], m0\r\nmovu       [r0 + r3], m0\r\nmovu       [r0 + r3 + 32], m0\r\nlea        r0, [r0 + 4 * r1]\r\nmovu       [r0], m0\r\nmovu       [r0 + 32], m0\r\nmovu       [r0 + r1], m0\r\nmovu       [r0 + r1 + 32], m0\r\nmovu       [r0 + 2 * r1], m0\r\nmovu       [r0 + 2 * r1 + 32], m0\r\nmovu       [r0 + r3], m0\r\nmovu       [r0 + r3 + 32], m0\r\nlea        r0, [r0 + 4 * r1]\r\nmovu       [r0], m0\r\nmovu       [r0 + 32], m0\r\nmovu       [r0 + r1], m0\r\nmovu       [r0 + r1 + 32], m0\r\nmovu       [r0 + 2 * r1], m0\r\nmovu       [r0 + 2 * r1 + 32], m0\r\nmovu       [r0 + r3], m0\r\nmovu       [r0 + r3 + 32], m0\r\nlea        r0, [r0 + 4 * r1]\r\nmovu       [r0], m0\r\nmovu       [r0 + 32], m0\r\nmovu       [r0 + r1], m0\r\nmovu       [r0 + r1 + 32], m0\r\nmovu       [r0 + 2 * r1], m0\r\nmovu       [r0 + 2 * r1 + 32], m0\r\nmovu       [r0 + r3], m0\r\nmovu       [r0 + r3 + 32], m0\r\nlea        r0, [r0 + 4 * r1]\r\nmovu       [r0], m0\r\nmovu       [r0 + 32], m0\r\nmovu       [r0 + r1], m0\r\nmovu       [r0 + r1 + 32], m0\r\nmovu       [r0 + 2 * r1], m0\r\nmovu       [r0 + 2 * r1 + 32], m0\r\nmovu       [r0 + r3], m0\r\nmovu       [r0 + r3 + 32], m0\r\nlea        r0, [r0 + 4 * r1]\r\nmovu       [r0], m0\r\nmovu       [r0 + 32], m0\r\nmovu       [r0 + r1], m0\r\nmovu       [r0 + r1 + 32], m0\r\nmovu       [r0 + 2 * r1], m0\r\nmovu       [r0 + 2 * r1 + 32], m0\r\nmovu       [r0 + r3], m0\r\nmovu       [r0 + r3 + 32], m0\r\nRET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_ps_2x4(int16_t* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride);\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse4\r\ncglobal blockcopy_ps_2x4, 4, 4, 1, dst, dstStride, src, srcStride\r\n\r\nadd        r1,            r1\r\n\r\nmovd       m0,            [r2]\r\npmovzxbw   m0,            m0\r\nmovd       [r0],          m0\r\n\r\nmovd       m0,            [r2 + r3]\r\npmovzxbw   m0,            m0\r\nmovd       [r0 + r1],     m0\r\n\r\nmovd       m0,            [r2 + 2 * r3]\r\npmovzxbw   m0,            m0\r\nmovd       [r0 + 2 * r1], m0\r\n\r\nlea        r2,            [r2 + 2 * r3]\r\nlea        r0,            [r0 + 2 * r1]\r\n\r\nmovd       m0,            [r2 + r3]\r\npmovzxbw   m0,            m0\r\nmovd       [r0 + r1],     m0\r\n\r\nRET\r\n\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_ps_2x8(int16_t* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride);\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse4\r\ncglobal blockcopy_ps_2x8, 4, 4, 1, dst, dstStride, src, srcStride\r\n\r\nadd        r1,            r1\r\n\r\nmovd       m0,            [r2]\r\npmovzxbw   m0,            m0\r\nmovd       [r0],          m0\r\n\r\nmovd       m0,            [r2 + r3]\r\npmovzxbw   m0,            m0\r\nmovd       [r0 + r1],     m0\r\n\r\nmovd       m0,            [r2 + 2 * r3]\r\npmovzxbw   m0,            m0\r\nmovd       [r0 + 2 * r1], m0\r\n\r\nlea        r2,            [r2 + 2 * r3]\r\nlea        r0,            [r0 + 2 * r1]\r\n\r\nmovd       m0,            [r2 + r3]\r\npmovzxbw   m0,            m0\r\nmovd       [r0 + r1],     m0\r\n\r\nmovd       m0,            [r2 + 2 * r3]\r\npmovzxbw   m0,            m0\r\nmovd       [r0 + 2 * r1], m0\r\n\r\nlea        r2,            [r2 + 2 * r3]\r\nlea        r0,            [r0 + 2 * r1]\r\n\r\nmovd       m0,            [r2 + r3]\r\npmovzxbw   m0,            m0\r\nmovd       [r0 + r1],     m0\r\n\r\nmovd       m0,            [r2 + 2 * r3]\r\npmovzxbw   m0,            m0\r\nmovd       [r0 + 2 * r1], m0\r\n\r\nlea        r2,            [r2 + 2 * r3]\r\nlea        r0,            [r0 + 2 * r1]\r\n\r\nmovd       m0,            [r2 + r3]\r\npmovzxbw   m0,            m0\r\nmovd       [r0 + r1],     m0\r\n\r\nRET\r\n\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_ps_2x16(int16_t* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride);\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse4\r\ncglobal blockcopy_ps_2x16, 4, 5, 2, dst, dstStride, src, srcStride\r\n    add         r1,         r1\r\n    mov         r4d,        16/2\r\n.loop:\r\n    movd        m0,         [r2]\r\n    movd        m1,         [r2 + r3]\r\n    dec         r4d\r\n    lea         r2,         [r2 + r3 * 2]\r\n    pmovzxbw    m0,         m0\r\n    pmovzxbw    m1,         m1\r\n    movd        [r0],       m0\r\n    movd        [r0 + r1],  m1\r\n    lea         r0,         [r0 + r1 * 2]\r\n    jnz         .loop\r\n    RET\r\n\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_ps_4x2(int16_t* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride);\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse4\r\ncglobal blockcopy_ps_4x2, 4, 4, 1, dst, dstStride, src, srcStride\r\n\r\nadd        r1,         r1\r\n\r\nmovd       m0,         [r2]\r\npmovzxbw   m0,         m0\r\nmovh       [r0],       m0\r\n\r\nmovd       m0,         [r2 + r3]\r\npmovzxbw   m0,         m0\r\nmovh       [r0 + r1],  m0\r\n\r\nRET\r\n\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_ps_4x4(int16_t* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride);\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse4\r\ncglobal blockcopy_ps_4x4, 4, 4, 1, dst, dstStride, src, srcStride\r\n\r\nadd        r1,            r1\r\n\r\nmovd       m0,            [r2]\r\npmovzxbw   m0,            m0\r\nmovh       [r0],          m0\r\n\r\nmovd       m0,            [r2 + r3]\r\npmovzxbw   m0,            m0\r\nmovh       [r0 + r1],     m0\r\n\r\nmovd       m0,            [r2 + 2 * r3]\r\npmovzxbw   m0,            m0\r\nmovh       [r0 + 2 * r1], m0\r\n\r\nlea        r2,            [r2 + 2 * r3]\r\nlea        r0,            [r0 + 2 * r1]\r\n\r\nmovd       m0,            [r2 + r3]\r\npmovzxbw   m0,            m0\r\nmovh       [r0 + r1],     m0\r\n\r\nRET\r\n\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_ps_%1x%2(int16_t* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride);\r\n;-----------------------------------------------------------------------------\r\n%macro BLOCKCOPY_PS_W4_H4 2\r\nINIT_XMM sse4\r\ncglobal blockcopy_ps_%1x%2, 4, 5, 1, dst, dstStride, src, srcStride\r\n\r\nadd     r1,      r1\r\nmov    r4d,      %2/4\r\n\r\n.loop:\r\n      movd       m0,            [r2]\r\n      pmovzxbw   m0,            m0\r\n      movh       [r0],          m0\r\n\r\n      movd       m0,            [r2 + r3]\r\n      pmovzxbw   m0,            m0\r\n      movh       [r0 + r1],     m0\r\n\r\n      movd       m0,            [r2 + 2 * r3]\r\n      pmovzxbw   m0,            m0\r\n      movh       [r0 + 2 * r1], m0\r\n\r\n      lea        r2,            [r2 + 2 * r3]\r\n      lea        r0,            [r0 + 2 * r1]\r\n\r\n      movd       m0,            [r2 + r3]\r\n      pmovzxbw   m0,            m0\r\n      movh       [r0 + r1],     m0\r\n\r\n      lea        r0,            [r0 + 2 * r1]\r\n      lea        r2,            [r2 + 2 * r3]\r\n\r\n      dec        r4d\r\n      jnz        .loop\r\n\r\nRET\r\n%endmacro\r\n\r\nBLOCKCOPY_PS_W4_H4 4, 8\r\nBLOCKCOPY_PS_W4_H4 4, 16\r\n\r\nBLOCKCOPY_PS_W4_H4 4, 32\r\n\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_ps_%1x%2(int16_t* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride);\r\n;-----------------------------------------------------------------------------\r\n%macro BLOCKCOPY_PS_W6_H4 2\r\nINIT_XMM sse4\r\ncglobal blockcopy_ps_%1x%2, 4, 5, 1, dst, dstStride, src, srcStride\r\n\r\nadd     r1,      r1\r\nmov    r4d,      %2/4\r\n\r\n.loop:\r\n      movh       m0,                [r2]\r\n      pmovzxbw   m0,                m0\r\n      movh       [r0],              m0\r\n      pextrd     [r0 + 8],          m0,            2\r\n\r\n      movh       m0,                [r2 + r3]\r\n      pmovzxbw   m0,                m0\r\n      movh       [r0 + r1],         m0\r\n      pextrd     [r0 + r1 + 8],     m0,            2\r\n\r\n      movh       m0,                [r2 + 2 * r3]\r\n      pmovzxbw   m0,                m0\r\n      movh       [r0 + 2 * r1],     m0\r\n      pextrd     [r0 + 2 * r1 + 8], m0,            2\r\n\r\n      lea        r2,                [r2 + 2 * r3]\r\n      lea        r0,                [r0 + 2 * r1]\r\n\r\n      movh       m0,                [r2 + r3]\r\n      pmovzxbw   m0,                m0\r\n      movh       [r0 + r1],         m0\r\n      pextrd     [r0 + r1 + 8],     m0,            2\r\n\r\n      lea        r0,                [r0 + 2 * r1]\r\n      lea        r2,                [r2 + 2 * r3]\r\n\r\n      dec        r4d\r\n      jnz        .loop\r\n\r\nRET\r\n%endmacro\r\n\r\nBLOCKCOPY_PS_W6_H4 6, 8\r\n\r\nBLOCKCOPY_PS_W6_H4 6, 16\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_ps_8x2(int16_t* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride);\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse4\r\ncglobal blockcopy_ps_8x2, 4, 4, 1, dst, dstStride, src, srcStride\r\n\r\nadd        r1,         r1\r\n\r\nmovh       m0,         [r2]\r\npmovzxbw   m0,         m0\r\nmovu       [r0],       m0\r\n\r\nmovh       m0,         [r2 + r3]\r\npmovzxbw   m0,         m0\r\nmovu       [r0 + r1],  m0\r\n\r\nRET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_ps_8x4(int16_t* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride);\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse4\r\ncglobal blockcopy_ps_8x4, 4, 4, 1, dst, dstStride, src, srcStride\r\n\r\nadd        r1,            r1\r\n\r\nmovh       m0,            [r2]\r\npmovzxbw   m0,            m0\r\nmovu       [r0],          m0\r\n\r\nmovh       m0,            [r2 + r3]\r\npmovzxbw   m0,            m0\r\nmovu       [r0 + r1],     m0\r\n\r\nmovh       m0,            [r2 + 2 * r3]\r\npmovzxbw   m0,            m0\r\nmovu       [r0 + 2 * r1], m0\r\n\r\nlea        r2,            [r2 + 2 * r3]\r\nlea        r0,            [r0 + 2 * r1]\r\n\r\nmovh       m0,            [r2 + r3]\r\npmovzxbw   m0,            m0\r\nmovu       [r0 + r1],     m0\r\n\r\nRET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_ps_8x6(int16_t* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride);\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse4\r\ncglobal blockcopy_ps_8x6, 4, 4, 1, dst, dstStride, src, srcStride\r\n\r\nadd        r1,            r1\r\n\r\nmovh       m0,            [r2]\r\npmovzxbw   m0,            m0\r\nmovu       [r0],          m0\r\n\r\nmovh       m0,            [r2 + r3]\r\npmovzxbw   m0,            m0\r\nmovu       [r0 + r1],     m0\r\n\r\nmovh       m0,            [r2 + 2 * r3]\r\npmovzxbw   m0,            m0\r\nmovu       [r0 + 2 * r1], m0\r\n\r\nlea        r2,            [r2 + 2 * r3]\r\nlea        r0,            [r0 + 2 * r1]\r\n\r\nmovh       m0,            [r2 + r3]\r\npmovzxbw   m0,            m0\r\nmovu       [r0 + r1],     m0\r\n\r\nmovh       m0,            [r2 + 2 * r3]\r\npmovzxbw   m0,            m0\r\nmovu       [r0 + 2 * r1], m0\r\n\r\nlea        r2,            [r2 + 2 * r3]\r\nlea        r0,            [r0 + 2 * r1]\r\n\r\nmovh       m0,            [r2 + r3]\r\npmovzxbw   m0,            m0\r\nmovu       [r0 + r1],     m0\r\n\r\nRET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_ps_%1x%2(int16_t* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride);\r\n;-----------------------------------------------------------------------------\r\n%macro BLOCKCOPY_PS_W8_H4 2\r\nINIT_XMM sse4\r\ncglobal blockcopy_ps_%1x%2, 4, 5, 1, dst, dstStride, src, srcStride\r\n\r\nadd     r1,      r1\r\nmov    r4d,      %2/4\r\n\r\n.loop:\r\n      movh       m0,            [r2]\r\n      pmovzxbw   m0,            m0\r\n      movu       [r0],          m0\r\n\r\n      movh       m0,            [r2 + r3]\r\n      pmovzxbw   m0,            m0\r\n      movu       [r0 + r1],     m0\r\n\r\n      movh       m0,            [r2 + 2 * r3]\r\n      pmovzxbw   m0,            m0\r\n      movu       [r0 + 2 * r1], m0\r\n\r\n      lea        r2,            [r2 + 2 * r3]\r\n      lea        r0,            [r0 + 2 * r1]\r\n\r\n      movh       m0,            [r2 + r3]\r\n      pmovzxbw   m0,            m0\r\n      movu       [r0 + r1],     m0\r\n\r\n      lea        r0,            [r0 + 2 * r1]\r\n      lea        r2,            [r2 + 2 * r3]\r\n\r\n      dec        r4d\r\n      jnz        .loop\r\n\r\nRET\r\n%endmacro\r\n\r\nBLOCKCOPY_PS_W8_H4  8,  8\r\nBLOCKCOPY_PS_W8_H4  8, 16\r\nBLOCKCOPY_PS_W8_H4  8, 32\r\n\r\nBLOCKCOPY_PS_W8_H4  8, 12\r\nBLOCKCOPY_PS_W8_H4  8, 64\r\n\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_ps_%1x%2(int16_t* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride);\r\n;-----------------------------------------------------------------------------\r\n%macro BLOCKCOPY_PS_W12_H2 2\r\nINIT_XMM sse4\r\ncglobal blockcopy_ps_%1x%2, 4, 5, 3, dst, dstStride, src, srcStride\r\n\r\nadd        r1,      r1\r\nmov        r4d,     %2/2\r\npxor       m0,      m0\r\n\r\n.loop:\r\n      movu       m1,             [r2]\r\n      pmovzxbw   m2,             m1\r\n      movu       [r0],           m2\r\n      punpckhbw  m1,             m0\r\n      movh       [r0 + 16],      m1\r\n\r\n      movu       m1,             [r2 + r3]\r\n      pmovzxbw   m2,             m1\r\n      movu       [r0 + r1],      m2\r\n      punpckhbw  m1,             m0\r\n      movh       [r0 + r1 + 16], m1\r\n\r\n      lea        r0,             [r0 + 2 * r1]\r\n      lea        r2,             [r2 + 2 * r3]\r\n\r\n      dec        r4d\r\n      jnz        .loop\r\n\r\nRET\r\n%endmacro\r\n\r\nBLOCKCOPY_PS_W12_H2 12, 16\r\n\r\nBLOCKCOPY_PS_W12_H2 12, 32\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_ps_16x4(int16_t* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride);\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse4\r\ncglobal blockcopy_ps_16x4, 4, 4, 3, dst, dstStride, src, srcStride\r\n\r\nadd        r1,      r1\r\npxor       m0,      m0\r\n\r\nmovu       m1,                 [r2]\r\npmovzxbw   m2,                 m1\r\nmovu       [r0],               m2\r\npunpckhbw  m1,                 m0\r\nmovu       [r0 + 16],          m1\r\n\r\nmovu       m1,                 [r2 + r3]\r\npmovzxbw   m2,                 m1\r\nmovu       [r0 + r1],          m2\r\npunpckhbw  m1,                 m0\r\nmovu       [r0 + r1 + 16],     m1\r\n\r\nmovu       m1,                 [r2 + 2 * r3]\r\npmovzxbw   m2,                 m1\r\nmovu       [r0 + 2 * r1],      m2\r\npunpckhbw  m1,                 m0\r\nmovu       [r0 + 2 * r1 + 16], m1\r\n\r\nlea        r0,                 [r0 + 2 * r1]\r\nlea        r2,                 [r2 + 2 * r3]\r\n\r\nmovu       m1,                 [r2 + r3]\r\npmovzxbw   m2,                 m1\r\nmovu       [r0 + r1],          m2\r\npunpckhbw  m1,                 m0\r\nmovu       [r0 + r1 + 16],     m1\r\n\r\nRET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_ps_%1x%2(int16_t* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride);\r\n;-----------------------------------------------------------------------------\r\n%macro BLOCKCOPY_PS_W16_H4 2\r\nINIT_XMM sse4\r\ncglobal blockcopy_ps_%1x%2, 4, 5, 3, dst, dstStride, src, srcStride\r\n\r\nadd        r1,      r1\r\nmov        r4d,     %2/4\r\npxor       m0,      m0\r\n\r\n.loop:\r\n      movu       m1,                 [r2]\r\n      pmovzxbw   m2,                 m1\r\n      movu       [r0],               m2\r\n      punpckhbw  m1,                 m0\r\n      movu       [r0 + 16],          m1\r\n\r\n      movu       m1,                 [r2 + r3]\r\n      pmovzxbw   m2,                 m1\r\n      movu       [r0 + r1],          m2\r\n      punpckhbw  m1,                 m0\r\n      movu       [r0 + r1 + 16],     m1\r\n\r\n      movu       m1,                 [r2 + 2 * r3]\r\n      pmovzxbw   m2,                 m1\r\n      movu       [r0 + 2 * r1],      m2\r\n      punpckhbw  m1,                 m0\r\n      movu       [r0 + 2 * r1 + 16], m1\r\n\r\n      lea        r0,                 [r0 + 2 * r1]\r\n      lea        r2,                 [r2 + 2 * r3]\r\n\r\n      movu       m1,                 [r2 + r3]\r\n      pmovzxbw   m2,                 m1\r\n      movu       [r0 + r1],          m2\r\n      punpckhbw  m1,                 m0\r\n      movu       [r0 + r1 + 16],     m1\r\n\r\n      lea        r0,                 [r0 + 2 * r1]\r\n      lea        r2,                 [r2 + 2 * r3]\r\n\r\n      dec        r4d\r\n      jnz        .loop\r\n\r\nRET\r\n%endmacro\r\n\r\nBLOCKCOPY_PS_W16_H4 16,  8\r\nBLOCKCOPY_PS_W16_H4 16, 12\r\nBLOCKCOPY_PS_W16_H4 16, 16\r\nBLOCKCOPY_PS_W16_H4 16, 32\r\nBLOCKCOPY_PS_W16_H4 16, 64\r\nBLOCKCOPY_PS_W16_H4 16, 24\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_ps_%1x%2(int16_t* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride);\r\n;-----------------------------------------------------------------------------\r\n%macro BLOCKCOPY_PS_W16_H4_avx2 2\r\nINIT_YMM avx2\r\ncglobal blockcopy_ps_%1x%2, 4, 7, 3\r\n\r\n    add     r1, r1\r\n    mov     r4d, %2/4\r\n    lea     r5, [3 * r3]\r\n    lea     r6, [3 * r1]\r\n    pxor    m0, m0\r\n\r\n.loop:\r\n    movu        xm1, [r2]\r\n    pmovzxbw    m2, xm1\r\n    movu        [r0], m2\r\n    movu        xm1, [r2 + r3]\r\n    pmovzxbw    m2, xm1\r\n    movu        [r0 + r1], m2\r\n    movu        xm1, [r2 + 2 * r3]\r\n    pmovzxbw    m2, xm1\r\n    movu        [r0 + 2 * r1], m2\r\n    movu        xm1, [r2 + r5]\r\n    pmovzxbw    m2, xm1\r\n    movu        [r0 + r6], m2\r\n\r\n    lea         r0, [r0 + 4 * r1]\r\n    lea         r2, [r2 + 4 * r3]\r\n\r\n    dec         r4d\r\n    jnz         .loop\r\n    RET\r\n%endmacro\r\n\r\nBLOCKCOPY_PS_W16_H4_avx2 16, 16\r\nBLOCKCOPY_PS_W16_H4_avx2 16, 32\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_ps_%1x%2(int16_t* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride);\r\n;-----------------------------------------------------------------------------\r\n%macro BLOCKCOPY_PS_W24_H2 2\r\nINIT_XMM sse4\r\ncglobal blockcopy_ps_%1x%2, 4, 5, 3, dst, dstStride, src, srcStride\r\n\r\nadd        r1,      r1\r\nmov        r4d,     %2/2\r\npxor       m0,      m0\r\n\r\n.loop:\r\n      movu       m1,             [r2]\r\n      pmovzxbw   m2,             m1\r\n      movu       [r0],           m2\r\n      punpckhbw  m1,             m0\r\n      movu       [r0 + 16],      m1\r\n\r\n      movh       m1,             [r2 + 16]\r\n      pmovzxbw   m1,             m1\r\n      movu       [r0 + 32],      m1\r\n\r\n      movu       m1,             [r2 + r3]\r\n      pmovzxbw   m2,             m1\r\n      movu       [r0 + r1],      m2\r\n      punpckhbw  m1,             m0\r\n      movu       [r0 + r1 + 16], m1\r\n\r\n      movh       m1,             [r2 + r3 + 16]\r\n      pmovzxbw   m1,             m1\r\n      movu       [r0 + r1 + 32], m1\r\n\r\n      lea        r0,             [r0 + 2 * r1]\r\n      lea        r2,             [r2 + 2 * r3]\r\n\r\n      dec        r4d\r\n      jnz        .loop\r\n\r\nRET\r\n%endmacro\r\n\r\nBLOCKCOPY_PS_W24_H2 24, 32\r\n\r\nBLOCKCOPY_PS_W24_H2 24, 64\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_ps_%1x%2(int16_t* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride);\r\n;-----------------------------------------------------------------------------\r\n%macro BLOCKCOPY_PS_W32_H2 2\r\nINIT_XMM sse4\r\ncglobal blockcopy_ps_%1x%2, 4, 5, 3, dst, dstStride, src, srcStride\r\n\r\nadd        r1,      r1\r\nmov        r4d,     %2/2\r\npxor       m0,      m0\r\n\r\n.loop:\r\n      movu       m1,             [r2]\r\n      pmovzxbw   m2,             m1\r\n      movu       [r0],           m2\r\n      punpckhbw  m1,             m0\r\n      movu       [r0 + 16],      m1\r\n\r\n      movu       m1,             [r2 + 16]\r\n      pmovzxbw   m2,             m1\r\n      movu       [r0 + 32],      m2\r\n      punpckhbw  m1,             m0\r\n      movu       [r0 + 48],      m1\r\n\r\n      movu       m1,             [r2 + r3]\r\n      pmovzxbw   m2,             m1\r\n      movu       [r0 + r1],      m2\r\n      punpckhbw  m1,             m0\r\n      movu       [r0 + r1 + 16], m1\r\n\r\n      movu       m1,             [r2 + r3 + 16]\r\n      pmovzxbw   m2,             m1\r\n      movu       [r0 + r1 + 32], m2\r\n      punpckhbw  m1,             m0\r\n      movu       [r0 + r1 + 48], m1\r\n\r\n      lea        r0,             [r0 + 2 * r1]\r\n      lea        r2,             [r2 + 2 * r3]\r\n\r\n      dec        r4d\r\n      jnz        .loop\r\n\r\nRET\r\n%endmacro\r\n\r\nBLOCKCOPY_PS_W32_H2 32,  8\r\nBLOCKCOPY_PS_W32_H2 32, 16\r\nBLOCKCOPY_PS_W32_H2 32, 24\r\nBLOCKCOPY_PS_W32_H2 32, 32\r\nBLOCKCOPY_PS_W32_H2 32, 64\r\n\r\nBLOCKCOPY_PS_W32_H2 32, 48\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_ps_%1x%2(int16_t* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride);\r\n;-----------------------------------------------------------------------------\r\n%macro BLOCKCOPY_PS_W32_H4_avx2 2\r\nINIT_YMM avx2\r\ncglobal blockcopy_ps_%1x%2, 4, 7, 2\r\n    add     r1, r1\r\n    mov     r4d, %2/4\r\n    lea     r5, [3 * r3]\r\n    lea     r6, [3 * r1]\r\n.loop:\r\n    pmovzxbw      m0, [r2 +  0]\r\n    pmovzxbw      m1, [r2 + 16]\r\n    movu          [r0 +  0], m0\r\n    movu          [r0 + 32], m1\r\n\r\n    pmovzxbw      m0, [r2 + r3 +  0]\r\n    pmovzxbw      m1, [r2 + r3 + 16]\r\n    movu          [r0 + r1 +  0], m0\r\n    movu          [r0 + r1 + 32], m1\r\n\r\n    pmovzxbw      m0, [r2 + r3 * 2 +  0]\r\n    pmovzxbw      m1, [r2 + r3 * 2 + 16]\r\n    movu          [r0 + r1 * 2 +  0], m0\r\n    movu          [r0 + r1 * 2 + 32], m1\r\n\r\n    pmovzxbw      m0, [r2 + r5 +  0]\r\n    pmovzxbw      m1, [r2 + r5 + 16]\r\n    movu          [r0 + r6 +  0], m0\r\n    movu          [r0 + r6 + 32], m1\r\n    lea           r0, [r0 + 4 * r1]\r\n    lea           r2, [r2 + 4 * r3]\r\n    dec           r4d\r\n    jnz           .loop\r\n    RET\r\n%endmacro\r\n\r\nBLOCKCOPY_PS_W32_H4_avx2 32, 32\r\nBLOCKCOPY_PS_W32_H4_avx2 32, 64\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_ps_%1x%2(int16_t* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride);\r\n;-----------------------------------------------------------------------------\r\n%macro BLOCKCOPY_PS_W48_H2 2\r\nINIT_XMM sse4\r\ncglobal blockcopy_ps_%1x%2, 4, 5, 3, dst, dstStride, src, srcStride\r\n\r\nadd        r1,      r1\r\nmov        r4d,     %2/2\r\npxor       m0,      m0\r\n\r\n.loop:\r\n      movu       m1,             [r2]\r\n      pmovzxbw   m2,             m1\r\n      movu       [r0],           m2\r\n      punpckhbw  m1,             m0\r\n      movu       [r0 + 16],      m1\r\n\r\n      movu       m1,             [r2 + 16]\r\n      pmovzxbw   m2,             m1\r\n      movu       [r0 + 32],      m2\r\n      punpckhbw  m1,             m0\r\n      movu       [r0 + 48],      m1\r\n\r\n      movu       m1,             [r2 + 32]\r\n      pmovzxbw   m2,             m1\r\n      movu       [r0 + 64],      m2\r\n      punpckhbw  m1,             m0\r\n      movu       [r0 + 80],      m1\r\n\r\n      movu       m1,             [r2 + r3]\r\n      pmovzxbw   m2,             m1\r\n      movu       [r0 + r1],      m2\r\n      punpckhbw  m1,             m0\r\n      movu       [r0 + r1 + 16], m1\r\n\r\n      movu       m1,             [r2 + r3 + 16]\r\n      pmovzxbw   m2,             m1\r\n      movu       [r0 + r1 + 32], m2\r\n      punpckhbw  m1,             m0\r\n      movu       [r0 + r1 + 48], m1\r\n\r\n      movu       m1,             [r2 + r3 + 32]\r\n      pmovzxbw   m2,             m1\r\n      movu       [r0 + r1 + 64], m2\r\n      punpckhbw  m1,             m0\r\n      movu       [r0 + r1 + 80], m1\r\n\r\n      lea        r0,             [r0 + 2 * r1]\r\n      lea        r2,             [r2 + 2 * r3]\r\n\r\n      dec        r4d\r\n      jnz        .loop\r\n\r\nRET\r\n%endmacro\r\n\r\nBLOCKCOPY_PS_W48_H2 48, 64\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_ps_%1x%2(int16_t* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride);\r\n;-----------------------------------------------------------------------------\r\n%macro BLOCKCOPY_PS_W64_H2 2\r\nINIT_XMM sse4\r\ncglobal blockcopy_ps_%1x%2, 4, 5, 3, dst, dstStride, src, srcStride\r\n\r\nadd        r1,      r1\r\nmov        r4d,     %2/2\r\npxor       m0,      m0\r\n\r\n.loop:\r\n      movu       m1,             [r2]\r\n      pmovzxbw   m2,             m1\r\n      movu       [r0],           m2\r\n      punpckhbw  m1,             m0\r\n      movu       [r0 + 16],      m1\r\n\r\n      movu       m1,             [r2 + 16]\r\n      pmovzxbw   m2,             m1\r\n      movu       [r0 + 32],      m2\r\n      punpckhbw  m1,             m0\r\n      movu       [r0 + 48],      m1\r\n\r\n      movu       m1,             [r2 + 32]\r\n      pmovzxbw   m2,             m1\r\n      movu       [r0 + 64],      m2\r\n      punpckhbw  m1,             m0\r\n      movu       [r0 + 80],      m1\r\n\r\n      movu       m1,             [r2 + 48]\r\n      pmovzxbw   m2,             m1\r\n      movu       [r0 + 96],      m2\r\n      punpckhbw  m1,             m0\r\n      movu       [r0 + 112],     m1\r\n\r\n      movu       m1,             [r2 + r3]\r\n      pmovzxbw   m2,             m1\r\n      movu       [r0 + r1],      m2\r\n      punpckhbw  m1,             m0\r\n      movu       [r0 + r1 + 16], m1\r\n\r\n      movu       m1,             [r2 + r3 + 16]\r\n      pmovzxbw   m2,             m1\r\n      movu       [r0 + r1 + 32], m2\r\n      punpckhbw  m1,             m0\r\n      movu       [r0 + r1 + 48], m1\r\n\r\n      movu       m1,             [r2 + r3 + 32]\r\n      pmovzxbw   m2,             m1\r\n      movu       [r0 + r1 + 64], m2\r\n      punpckhbw  m1,             m0\r\n      movu       [r0 + r1 + 80], m1\r\n\r\n      movu       m1,              [r2 + r3 + 48]\r\n      pmovzxbw   m2,              m1\r\n      movu       [r0 + r1 + 96],  m2\r\n      punpckhbw  m1,              m0\r\n      movu       [r0 + r1 + 112], m1\r\n\r\n      lea        r0,              [r0 + 2 * r1]\r\n      lea        r2,              [r2 + 2 * r3]\r\n\r\n      dec        r4d\r\n      jnz        .loop\r\n\r\nRET\r\n%endmacro\r\n\r\nBLOCKCOPY_PS_W64_H2 64, 16\r\nBLOCKCOPY_PS_W64_H2 64, 32\r\nBLOCKCOPY_PS_W64_H2 64, 48\r\nBLOCKCOPY_PS_W64_H2 64, 64\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_ps_%1x%2(int16_t* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride);\r\n;-----------------------------------------------------------------------------\r\nINIT_YMM avx2\r\ncglobal blockcopy_ps_64x64, 4, 7, 4\r\n    add     r1, r1\r\n    mov     r4d, 64/8\r\n    lea     r5, [3 * r3]\r\n    lea     r6, [3 * r1]\r\n.loop:\r\n%rep 2\r\n    pmovzxbw      m0, [r2 +  0]\r\n    pmovzxbw      m1, [r2 + 16]\r\n    pmovzxbw      m2, [r2 + 32]\r\n    pmovzxbw      m3, [r2 + 48]\r\n    movu          [r0 +  0], m0\r\n    movu          [r0 + 32], m1\r\n    movu          [r0 + 64], m2\r\n    movu          [r0 + 96], m3\r\n\r\n    pmovzxbw      m0, [r2 + r3 +  0]\r\n    pmovzxbw      m1, [r2 + r3 + 16]\r\n    pmovzxbw      m2, [r2 + r3 + 32]\r\n    pmovzxbw      m3, [r2 + r3 + 48]\r\n    movu          [r0 + r1 +  0], m0\r\n    movu          [r0 + r1 + 32], m1\r\n    movu          [r0 + r1 + 64], m2\r\n    movu          [r0 + r1 + 96], m3\r\n\r\n    pmovzxbw      m0, [r2 + r3 * 2 +  0]\r\n    pmovzxbw      m1, [r2 + r3 * 2 + 16]\r\n    pmovzxbw      m2, [r2 + r3 * 2 + 32]\r\n    pmovzxbw      m3, [r2 + r3 * 2 + 48]\r\n    movu          [r0 + r1 * 2 +  0], m0\r\n    movu          [r0 + r1 * 2 + 32], m1\r\n    movu          [r0 + r1 * 2 + 64], m2\r\n    movu          [r0 + r1 * 2 + 96], m3\r\n\r\n    pmovzxbw      m0, [r2 + r5 +  0]\r\n    pmovzxbw      m1, [r2 + r5 + 16]\r\n    pmovzxbw      m2, [r2 + r5 + 32]\r\n    pmovzxbw      m3, [r2 + r5 + 48]\r\n    movu          [r0 + r6 +  0], m0\r\n    movu          [r0 + r6 + 32], m1\r\n    movu          [r0 + r6 + 64], m2\r\n    movu          [r0 + r6 + 96], m3\r\n    lea           r0, [r0 + 4 * r1]\r\n    lea           r2, [r2 + 4 * r3]\r\n%endrep\r\n    dec           r4d\r\n    jnz           .loop\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_ss_2x4(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal blockcopy_ss_2x4, 4, 6, 0\r\n    add    r1, r1\r\n    add    r3, r3\r\n\r\n    mov    r4d, [r2]\r\n    mov    r5d, [r2 + r3]\r\n    mov    [r0], r4d\r\n    mov    [r0 + r1], r5d\r\n\r\n    lea    r2, [r2 + r3 * 2]\r\n    lea    r0, [r0 + 2 * r1]\r\n\r\n    mov    r4d, [r2]\r\n    mov    r5d, [r2 + r3]\r\n    mov    [r0], r4d\r\n    mov    [r0 + r1], r5d\r\n\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_ss_2x8(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal blockcopy_ss_2x8, 4, 6, 0\r\n    add    r1, r1\r\n    add    r3, r3\r\n\r\n    mov    r4d, [r2]\r\n    mov    r5d, [r2 + r3]\r\n    mov    [r0], r4d\r\n    mov    [r0 + r1], r5d\r\n\r\n    lea    r2, [r2 + r3 * 2]\r\n    lea    r0, [r0 + 2 * r1]\r\n\r\n    mov    r4d, [r2]\r\n    mov    r5d, [r2 + r3]\r\n    mov    [r0], r4d\r\n    mov    [r0 + r1], r5d\r\n\r\n    lea    r2, [r2 + r3 * 2]\r\n    lea    r0, [r0 + 2 * r1]\r\n\r\n    mov    r4d, [r2]\r\n    mov    r5d, [r2 + r3]\r\n    mov    [r0], r4d\r\n    mov    [r0 + r1], r5d\r\n\r\n    lea    r2, [r2 + r3 * 2]\r\n    lea    r0, [r0 + 2 * r1]\r\n\r\n    mov    r4d, [r2]\r\n    mov    r5d, [r2 + r3]\r\n    mov    [r0], r4d\r\n    mov    [r0 + r1], r5d\r\n\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_ss_2x16(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal blockcopy_ss_2x16, 4, 7, 0\r\n    add     r1, r1\r\n    add     r3, r3\r\n    mov     r6d,    16/2\r\n.loop:\r\n    mov     r4d,    [r2]\r\n    mov     r5d,    [r2 + r3]\r\n    dec     r6d\r\n    lea     r2, [r2 + r3 * 2]\r\n    mov     [r0],       r4d\r\n    mov     [r0 + r1],  r5d\r\n    lea     r0, [r0 + r1 * 2]\r\n    jnz     .loop\r\n    RET\r\n\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_ss_4x2(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal blockcopy_ss_4x2, 4, 4, 2\r\n    add     r1, r1\r\n    add     r3, r3\r\n\r\n    movh    m0, [r2]\r\n    movh    m1, [r2 + r3]\r\n\r\n    movh    [r0], m0\r\n    movh    [r0 + r1], m1\r\n\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_ss_4x4(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal blockcopy_ss_4x4, 4, 4, 4\r\n    add     r1, r1\r\n    add     r3, r3\r\n    movh    m0, [r2]\r\n    movh    m1, [r2 + r3]\r\n    lea     r2, [r2 + r3 * 2]\r\n    movh    m2, [r2]\r\n    movh    m3, [r2 + r3]\r\n\r\n    movh    [r0], m0\r\n    movh    [r0 + r1], m1\r\n    lea     r0, [r0 + 2 * r1]\r\n    movh    [r0], m2\r\n    movh    [r0 + r1], m3\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_ss_%1x%2(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\n%macro BLOCKCOPY_SS_W4_H8 2\r\nINIT_XMM sse2\r\ncglobal blockcopy_ss_%1x%2, 4, 5, 4\r\n    mov     r4d, %2/8\r\n    add     r1, r1\r\n    add     r3, r3\r\n.loop:\r\n    movh    m0, [r2]\r\n    movh    m1, [r2 + r3]\r\n    lea     r2, [r2 + r3 * 2]\r\n    movh    m2, [r2]\r\n    movh    m3, [r2 + r3]\r\n\r\n    movh    [r0], m0\r\n    movh    [r0 + r1], m1\r\n    lea     r0, [r0 + 2 * r1]\r\n    movh    [r0], m2\r\n    movh    [r0 + r1], m3\r\n\r\n    lea     r0, [r0 + 2 * r1]\r\n    lea     r2, [r2 + 2 * r3]\r\n    movh    m0, [r2]\r\n    movh    m1, [r2 + r3]\r\n    lea     r2, [r2 + r3 * 2]\r\n    movh    m2, [r2]\r\n    movh    m3, [r2 + r3]\r\n\r\n    movh    [r0], m0\r\n    movh    [r0 + r1], m1\r\n    lea     r0, [r0 + 2 * r1]\r\n    movh    [r0], m2\r\n    movh    [r0 + r1], m3\r\n    lea     r0, [r0 + 2 * r1]\r\n    lea     r2, [r2 + 2 * r3]\r\n\r\n    dec     r4d\r\n    jnz     .loop\r\n    RET\r\n%endmacro\r\n\r\nBLOCKCOPY_SS_W4_H8 4, 8\r\nBLOCKCOPY_SS_W4_H8 4, 16\r\n\r\nBLOCKCOPY_SS_W4_H8 4, 32\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_ss_6x8(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal blockcopy_ss_6x8, 4, 4, 4\r\n    add       r1, r1\r\n    add       r3, r3\r\n\r\n    movu      m0, [r2]\r\n    movu      m1, [r2 + r3]\r\n    pshufd    m2, m0, 2\r\n    pshufd    m3, m1, 2\r\n    movh      [r0], m0\r\n    movd      [r0 + 8], m2\r\n    movh      [r0 + r1], m1\r\n    movd      [r0 + r1 + 8], m3\r\n\r\n    lea       r0, [r0 + 2 * r1]\r\n    lea       r2, [r2 + 2 * r3]\r\n\r\n    movu      m0, [r2]\r\n    movu      m1, [r2 + r3]\r\n    pshufd    m2, m0, 2\r\n    pshufd    m3, m1, 2\r\n    movh      [r0], m0\r\n    movd      [r0 + 8], m2\r\n    movh      [r0 + r1], m1\r\n    movd      [r0 + r1 + 8], m3\r\n\r\n    lea       r0, [r0 + 2 * r1]\r\n    lea       r2, [r2 + 2 * r3]\r\n\r\n    movu      m0, [r2]\r\n    movu      m1, [r2 + r3]\r\n    pshufd    m2, m0, 2\r\n    pshufd    m3, m1, 2\r\n    movh      [r0], m0\r\n    movd      [r0 + 8], m2\r\n    movh      [r0 + r1], m1\r\n    movd      [r0 + r1 + 8], m3\r\n\r\n    lea       r0, [r0 + 2 * r1]\r\n    lea       r2, [r2 + 2 * r3]\r\n\r\n    movu      m0, [r2]\r\n    movu      m1, [r2 + r3]\r\n    pshufd    m2, m0, 2\r\n    pshufd    m3, m1, 2\r\n    movh      [r0], m0\r\n    movd      [r0 + 8], m2\r\n    movh      [r0 + r1], m1\r\n    movd      [r0 + r1 + 8], m3\r\n\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_ss_6x16(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal blockcopy_ss_6x16, 4, 5, 4\r\n    add     r1, r1\r\n    add     r3, r3\r\n    mov     r4d,    16/2\r\n.loop:\r\n    movh    m0, [r2]\r\n    movd    m2, [r2 + 8]\r\n    movh    m1, [r2 + r3]\r\n    movd    m3, [r2 + r3 + 8]\r\n    dec     r4d\r\n    lea     r2, [r2 + r3 * 2]\r\n    movh    [r0],           m0\r\n    movd    [r0 + 8],       m2\r\n    movh    [r0 + r1],      m1\r\n    movd    [r0 + r1 + 8],  m3\r\n    lea     r0, [r0 + r1 * 2]\r\n    jnz     .loop\r\n    RET\r\n\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_ss_8x2(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal blockcopy_ss_8x2, 4, 4, 2\r\n    add     r1, r1\r\n    add     r3, r3\r\n\r\n    movu    m0, [r2]\r\n    movu    m1, [r2 + r3]\r\n\r\n    movu    [r0], m0\r\n    movu    [r0 + r1], m1\r\n\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_ss_8x4(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal blockcopy_ss_8x4, 4, 4, 4\r\n    add     r1, r1\r\n    add     r3, r3\r\n\r\n    movu    m0, [r2]\r\n    movu    m1, [r2 + r3]\r\n    lea     r2, [r2 + r3 * 2]\r\n    movu    m2, [r2]\r\n    movu    m3, [r2 + r3]\r\n\r\n    movu    [r0], m0\r\n    movu    [r0 + r1], m1\r\n    lea     r0, [r0 + 2 * r1]\r\n    movu    [r0], m2\r\n    movu    [r0 + r1], m3\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_ss_8x6(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal blockcopy_ss_8x6, 4, 4, 4\r\n\r\n    add     r1, r1\r\n    add     r3, r3\r\n    movu    m0, [r2]\r\n    movu    m1, [r2 + r3]\r\n    lea     r2, [r2 + r3 * 2]\r\n    movu    m2, [r2]\r\n    movu    m3, [r2 + r3]\r\n\r\n    movu    [r0], m0\r\n    movu    [r0 + r1], m1\r\n    lea     r0, [r0 + 2 * r1]\r\n    movu    [r0], m2\r\n    movu    [r0 + r1], m3\r\n\r\n    lea     r2, [r2 + r3 * 2]\r\n    lea     r0, [r0 + 2 * r1]\r\n\r\n    movu    m0, [r2]\r\n    movu    m1, [r2 + r3]\r\n    movu    [r0], m0\r\n    movu    [r0 + r1], m1\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_ss_8x12(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal blockcopy_ss_8x12, 4, 5, 2\r\n    add     r1, r1\r\n    add     r3, r3\r\n    mov     r4d, 12/2\r\n.loop:\r\n    movu    m0, [r2]\r\n    movu    m1, [r2 + r3]\r\n    lea     r2, [r2 + 2 * r3]\r\n    dec     r4d\r\n    movu    [r0], m0\r\n    movu    [r0 + r1], m1\r\n    lea     r0, [r0 + 2 * r1]\r\n    jnz     .loop\r\n    RET\r\n\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_ss_%1x%2(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\n%macro BLOCKCOPY_SS_W8_H8 2\r\nINIT_XMM sse2\r\ncglobal blockcopy_ss_%1x%2, 4, 5, 4\r\n    mov     r4d, %2/8\r\n    add     r1, r1\r\n    add     r3, r3\r\n.loop:\r\n    movu    m0, [r2]\r\n    movu    m1, [r2 + r3]\r\n    lea     r2, [r2 + r3 * 2]\r\n    movu    m2, [r2]\r\n    movu    m3, [r2 + r3]\r\n\r\n    movu    [r0], m0\r\n    movu    [r0 + r1], m1\r\n    lea     r0, [r0 + 2 * r1]\r\n    movu    [r0], m2\r\n    movu    [r0 + r1], m3\r\n\r\n\r\n    lea     r2, [r2 + 2 * r3]\r\n    lea     r0, [r0 + 2 * r1]\r\n\r\n    movu    m0, [r2]\r\n    movu    m1, [r2 + r3]\r\n    lea     r2, [r2 + r3 * 2]\r\n    movu    m2, [r2]\r\n    movu    m3, [r2 + r3]\r\n\r\n    movu    [r0], m0\r\n    movu    [r0 + r1], m1\r\n    lea     r0, [r0 + 2 * r1]\r\n    movu    [r0], m2\r\n    movu    [r0 + r1], m3\r\n\r\n    dec     r4d\r\n    lea     r0, [r0 + 2 * r1]\r\n    lea     r2, [r2 + 2 * r3]\r\n    jnz    .loop\r\nRET\r\n%endmacro\r\n\r\nBLOCKCOPY_SS_W8_H8 8, 8\r\nBLOCKCOPY_SS_W8_H8 8, 16\r\nBLOCKCOPY_SS_W8_H8 8, 32\r\n\r\nBLOCKCOPY_SS_W8_H8 8, 64\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_ss_%1x%2(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\n%macro BLOCKCOPY_SS_W12_H4 2\r\nINIT_XMM sse2\r\ncglobal blockcopy_ss_%1x%2, 4, 5, 4\r\n\r\n    mov     r4d, %2/4\r\n    add     r1, r1\r\n    add     r3, r3\r\n.loop:\r\n    movu    m0, [r2]\r\n    movh    m1, [r2 + 16]\r\n    movu    m2, [r2 + r3]\r\n    movh    m3, [r2 + r3 + 16]\r\n    lea     r2, [r2 + 2 * r3]\r\n\r\n    movu    [r0], m0\r\n    movh    [r0 + 16], m1\r\n    movu    [r0 + r1], m2\r\n    movh    [r0 + r1 + 16], m3\r\n\r\n    lea     r0, [r0 + 2 * r1]\r\n    movu    m0, [r2]\r\n    movh    m1, [r2 + 16]\r\n    movu    m2, [r2 + r3]\r\n    movh    m3, [r2 + r3 + 16]\r\n\r\n    movu    [r0], m0\r\n    movh    [r0 + 16], m1\r\n    movu    [r0 + r1], m2\r\n    movh    [r0 + r1 + 16], m3\r\n\r\n    dec     r4d\r\n    lea     r0, [r0 + 2 * r1]\r\n    lea     r2, [r2 + 2 * r3]\r\n    jnz     .loop\r\n    RET\r\n%endmacro\r\n\r\nBLOCKCOPY_SS_W12_H4 12, 16\r\n\r\nBLOCKCOPY_SS_W12_H4 12, 32\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_ss_16x4(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\n%macro BLOCKCOPY_SS_W16_H4 2\r\nINIT_XMM sse2\r\ncglobal blockcopy_ss_%1x%2, 4, 5, 4\r\n    mov     r4d, %2/4\r\n    add     r1, r1\r\n    add     r3, r3\r\n.loop:\r\n    movu    m0, [r2]\r\n    movu    m1, [r2 + 16]\r\n    movu    m2, [r2 + r3]\r\n    movu    m3, [r2 + r3 + 16]\r\n\r\n    movu    [r0], m0\r\n    movu    [r0 + 16], m1\r\n    movu    [r0 + r1], m2\r\n    movu    [r0 + r1 + 16], m3\r\n\r\n    lea     r2, [r2 + 2 * r3]\r\n    lea     r0, [r0 + 2 * r1]\r\n\r\n    movu    m0, [r2]\r\n    movu    m1, [r2 + 16]\r\n    movu    m2, [r2 + r3]\r\n    movu    m3, [r2 + r3 + 16]\r\n\r\n    movu    [r0], m0\r\n    movu    [r0 + 16], m1\r\n    movu    [r0 + r1], m2\r\n    movu    [r0 + r1 + 16], m3\r\n\r\n    dec     r4d\r\n    lea     r0, [r0 + 2 * r1]\r\n    lea     r2, [r2 + 2 * r3]\r\n    jnz     .loop\r\n    RET\r\n%endmacro\r\n\r\nBLOCKCOPY_SS_W16_H4 16, 4\r\nBLOCKCOPY_SS_W16_H4 16, 12\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_ss_%1x%2(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\n%macro BLOCKCOPY_SS_W16_H4_avx 2\r\nINIT_YMM avx\r\ncglobal blockcopy_ss_%1x%2, 4, 7, 4\r\n    mov     r4d, %2/4\r\n    add     r1, r1\r\n    add     r3, r3\r\n    lea     r5, [3 * r3]\r\n    lea     r6, [3 * r1]\r\n.loop:\r\n    movu    m0, [r2]\r\n    movu    m1, [r2 + r3]\r\n    movu    m2, [r2 + 2 * r3]\r\n    movu    m3, [r2 + r5]\r\n\r\n    movu    [r0], m0\r\n    movu    [r0 + r1], m1\r\n    movu    [r0 + 2 * r1], m2\r\n    movu    [r0 + r6], m3\r\n\r\n    lea     r0, [r0 + 4 * r1]\r\n    lea     r2, [r2 + 4 * r3]\r\n    dec     r4d\r\n    jnz     .loop\r\n    RET\r\n%endmacro\r\n\r\nBLOCKCOPY_SS_W16_H4_avx 16, 4\r\nBLOCKCOPY_SS_W16_H4_avx 16, 12\r\nBLOCKCOPY_SS_W16_H4_avx 16, 8\r\nBLOCKCOPY_SS_W16_H4_avx 16, 16\r\nBLOCKCOPY_SS_W16_H4_avx 16, 24\r\nBLOCKCOPY_SS_W16_H4_avx 16, 32\r\nBLOCKCOPY_SS_W16_H4_avx 16, 64\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_ss_%1x%2(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\n%macro BLOCKCOPY_SS_W16_H8 2\r\nINIT_XMM sse2\r\ncglobal blockcopy_ss_%1x%2, 4, 5, 4\r\n    mov     r4d, %2/8\r\n    add     r1, r1\r\n    add     r3, r3\r\n.loop:\r\n    movu    m0, [r2]\r\n    movu    m1, [r2 + 16]\r\n    movu    m2, [r2 + r3]\r\n    movu    m3, [r2 + r3 + 16]\r\n\r\n    movu    [r0], m0\r\n    movu    [r0 + 16], m1\r\n    movu    [r0 + r1], m2\r\n    movu    [r0 + r1 + 16], m3\r\n\r\n    lea     r2, [r2 + 2 * r3]\r\n    lea     r0, [r0 + 2 * r1]\r\n\r\n    movu    m0, [r2]\r\n    movu    m1, [r2 + 16]\r\n    movu    m2, [r2 + r3]\r\n    movu    m3, [r2 + r3 + 16]\r\n\r\n    movu    [r0], m0\r\n    movu    [r0 + 16], m1\r\n    movu    [r0 + r1], m2\r\n    movu    [r0 + r1 + 16], m3\r\n\r\n    lea     r2, [r2 + 2 * r3]\r\n    lea     r0, [r0 + 2 * r1]\r\n\r\n    movu    m0, [r2]\r\n    movu    m1, [r2 + 16]\r\n    movu    m2, [r2 + r3]\r\n    movu    m3, [r2 + r3 + 16]\r\n\r\n    movu    [r0], m0\r\n    movu    [r0 + 16], m1\r\n    movu    [r0 + r1], m2\r\n    movu    [r0 + r1 + 16], m3\r\n\r\n    lea     r2, [r2 + 2 * r3]\r\n    lea     r0, [r0 + 2 * r1]\r\n\r\n    movu    m0, [r2]\r\n    movu    m1, [r2 + 16]\r\n    movu    m2, [r2 + r3]\r\n    movu    m3, [r2 + r3 + 16]\r\n\r\n    movu    [r0], m0\r\n    movu    [r0 + 16], m1\r\n    movu    [r0 + r1], m2\r\n    movu    [r0 + r1 + 16], m3\r\n\r\n    dec     r4d\r\n    lea     r2, [r2 + 2 * r3]\r\n    lea     r0, [r0 + 2 * r1]\r\n    jnz     .loop\r\n    RET\r\n%endmacro\r\n\r\nBLOCKCOPY_SS_W16_H8 16, 8\r\nBLOCKCOPY_SS_W16_H8 16, 16\r\nBLOCKCOPY_SS_W16_H8 16, 32\r\nBLOCKCOPY_SS_W16_H8 16, 64\r\n\r\nBLOCKCOPY_SS_W16_H8 16, 24\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_ss_%1x%2(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\n%macro BLOCKCOPY_SS_W24_H4 2\r\nINIT_XMM sse2\r\ncglobal blockcopy_ss_%1x%2, 4, 5, 6\r\n    mov     r4d, %2/4\r\n    add     r1, r1\r\n    add     r3, r3\r\n.loop:\r\n    movu    m0, [r2]\r\n    movu    m1, [r2 + 16]\r\n    movu    m2, [r2 + 32]\r\n    movu    m3, [r2 + r3]\r\n    movu    m4, [r2 + r3 + 16]\r\n    movu    m5, [r2 + r3 + 32]\r\n\r\n    movu    [r0], m0\r\n    movu    [r0 + 16], m1\r\n    movu    [r0 + 32], m2\r\n    movu    [r0 + r1], m3\r\n    movu    [r0 + r1 + 16], m4\r\n    movu    [r0 + r1 + 32], m5\r\n\r\n    lea     r2, [r2 + 2 * r3]\r\n    lea     r0, [r0 + 2 * r1]\r\n\r\n    movu    m0, [r2]\r\n    movu    m1, [r2 + 16]\r\n    movu    m2, [r2 + 32]\r\n    movu    m3, [r2 + r3]\r\n    movu    m4, [r2 + r3 + 16]\r\n    movu    m5, [r2 + r3 + 32]\r\n\r\n    movu    [r0], m0\r\n    movu    [r0 + 16], m1\r\n    movu    [r0 + 32], m2\r\n    movu    [r0 + r1], m3\r\n    movu    [r0 + r1 + 16], m4\r\n    movu    [r0 + r1 + 32], m5\r\n\r\n    dec     r4d\r\n    lea     r2, [r2 + 2 * r3]\r\n    lea     r0, [r0 + 2 * r1]\r\n    jnz     .loop\r\n    RET\r\n%endmacro\r\n\r\nBLOCKCOPY_SS_W24_H4 24, 32\r\n\r\nBLOCKCOPY_SS_W24_H4 24, 64\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_ss_%1x%2(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\n%macro BLOCKCOPY_SS_W24_H4_avx 2\r\nINIT_YMM avx\r\ncglobal blockcopy_ss_%1x%2, 4, 7, 2\r\n\r\n    mov    r4d, %2/4\r\n    add    r1, r1\r\n    add    r3, r3\r\n    lea    r5, [3 * r3]\r\n    lea    r6, [3 * r1]\r\n\r\n.loop:\r\n    movu    m0, [r2]\r\n    movu    xm1, [r2 + 32]\r\n    movu    [r0], m0\r\n    movu    [r0 + 32], xm1\r\n    movu    m0, [r2 + r3]\r\n    movu    xm1, [r2 + r3 + 32]\r\n    movu    [r0 + r1], m0\r\n    movu    [r0 + r1 + 32], xm1\r\n    movu    m0, [r2 + 2 * r3]\r\n    movu    xm1, [r2 + 2 * r3 + 32]\r\n    movu    [r0 + 2 * r1], m0\r\n    movu    [r0 + 2 * r1 + 32], xm1\r\n    movu    m0, [r2 + r5]\r\n    movu    xm1, [r2 + r5 + 32]\r\n    movu    [r0 + r6], m0\r\n    movu    [r0 + r6 + 32], xm1\r\n    dec     r4d\r\n    lea     r2, [r2 + 4 * r3]\r\n    lea     r0, [r0 + 4 * r1]\r\n    jnz     .loop\r\n    RET\r\n%endmacro\r\n\r\nBLOCKCOPY_SS_W24_H4_avx 24, 32\r\nBLOCKCOPY_SS_W24_H4_avx 24, 64\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_ss_%1x%2(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\n%macro BLOCKCOPY_SS_W32_H4 2\r\nINIT_XMM sse2\r\ncglobal blockcopy_ss_%1x%2, 4, 5, 4\r\n    mov     r4d, %2/4\r\n    add     r1, r1\r\n    add     r3, r3\r\n.loop:\r\n    movu    m0, [r2]\r\n    movu    m1, [r2 + 16]\r\n    movu    m2, [r2 + 32]\r\n    movu    m3, [r2 + 48]\r\n\r\n    movu    [r0], m0\r\n    movu    [r0 + 16], m1\r\n    movu    [r0 + 32], m2\r\n    movu    [r0 + 48], m3\r\n\r\n    movu    m0, [r2 + r3]\r\n    movu    m1, [r2 + r3 + 16]\r\n    movu    m2, [r2 + r3 + 32]\r\n    movu    m3, [r2 + r3 + 48]\r\n\r\n    movu    [r0 + r1], m0\r\n    movu    [r0 + r1 + 16], m1\r\n    movu    [r0 + r1 + 32], m2\r\n    movu    [r0 + r1 + 48], m3\r\n\r\n    lea     r2, [r2 + 2 * r3]\r\n    lea     r0, [r0 + 2 * r1]\r\n\r\n    movu    m0, [r2]\r\n    movu    m1, [r2 + 16]\r\n    movu    m2, [r2 + 32]\r\n    movu    m3, [r2 + 48]\r\n\r\n    movu    [r0], m0\r\n    movu    [r0 + 16], m1\r\n    movu    [r0 + 32], m2\r\n    movu    [r0 + 48], m3\r\n\r\n    movu    m0, [r2 + r3]\r\n    movu    m1, [r2 + r3 + 16]\r\n    movu    m2, [r2 + r3 + 32]\r\n    movu    m3, [r2 + r3 + 48]\r\n\r\n    movu    [r0 + r1], m0\r\n    movu    [r0 + r1 + 16], m1\r\n    movu    [r0 + r1 + 32], m2\r\n    movu    [r0 + r1 + 48], m3\r\n\r\n    dec     r4d\r\n    lea     r2, [r2 + 2 * r3]\r\n    lea     r0, [r0 + 2 * r1]\r\n    jnz     .loop\r\n    RET\r\n%endmacro\r\n\r\nBLOCKCOPY_SS_W32_H4 32, 8\r\nBLOCKCOPY_SS_W32_H4 32, 16\r\nBLOCKCOPY_SS_W32_H4 32, 24\r\nBLOCKCOPY_SS_W32_H4 32, 32\r\nBLOCKCOPY_SS_W32_H4 32, 64\r\n\r\nBLOCKCOPY_SS_W32_H4 32, 48\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_ss_%1x%2(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\n%macro BLOCKCOPY_SS_W32_H4_avx 2\r\nINIT_YMM avx\r\ncglobal blockcopy_ss_%1x%2, 4, 7, 4\r\n\r\n    mov    r4d, %2/4\r\n    add    r1, r1\r\n    add    r3, r3\r\n    lea    r5, [3 * r1]\r\n    lea    r6, [3 * r3]\r\n\r\n.loop:\r\n    movu    m0, [r2]\r\n    movu    m1, [r2 + 32]\r\n\r\n    movu    [r0], m0\r\n    movu    [r0 + 32], m1\r\n\r\n    movu    m0, [r2 + r3]\r\n    movu    m1, [r2 + r3 + 32]\r\n\r\n    movu    [r0 + r1], m0\r\n    movu    [r0 + r1 + 32], m1\r\n\r\n    movu    m0, [r2 + 2 * r3]\r\n    movu    m1, [r2 + 2 * r3 + 32]\r\n\r\n    movu    [r0 + 2 * r1], m0\r\n    movu    [r0 + 2 * r1 + 32], m1\r\n\r\n    movu    m0, [r2 + r6]\r\n    movu    m1, [r2 + r6 + 32]\r\n\r\n    movu    [r0 + r5], m0\r\n    movu    [r0 + r5 + 32], m1\r\n\r\n    dec     r4d\r\n    lea     r2, [r2 + 4 * r3]\r\n    lea     r0, [r0 + 4 * r1]\r\n    jnz     .loop\r\n    RET\r\n%endmacro\r\n\r\nBLOCKCOPY_SS_W32_H4_avx 32,  8\r\nBLOCKCOPY_SS_W32_H4_avx 32, 16\r\nBLOCKCOPY_SS_W32_H4_avx 32, 24\r\nBLOCKCOPY_SS_W32_H4_avx 32, 32\r\nBLOCKCOPY_SS_W32_H4_avx 32, 48\r\nBLOCKCOPY_SS_W32_H4_avx 32, 64\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_ss_%1x%2(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\n%macro BLOCKCOPY_SS_W48_H2 2\r\nINIT_XMM sse2\r\ncglobal blockcopy_ss_%1x%2, 4, 5, 6\r\n    mov     r4d, %2/4\r\n    add     r1, r1\r\n    add     r3, r3\r\n.loop:\r\n    movu    m0, [r2]\r\n    movu    m1, [r2 + 16]\r\n    movu    m2, [r2 + 32]\r\n    movu    m3, [r2 + 48]\r\n    movu    m4, [r2 + 64]\r\n    movu    m5, [r2 + 80]\r\n\r\n    movu    [r0], m0\r\n    movu    [r0 + 16], m1\r\n    movu    [r0 + 32], m2\r\n    movu    [r0 + 48], m3\r\n    movu    [r0 + 64], m4\r\n    movu    [r0 + 80], m5\r\n\r\n    movu    m0, [r2 + r3]\r\n    movu    m1, [r2 + r3 + 16]\r\n    movu    m2, [r2 + r3 + 32]\r\n    movu    m3, [r2 + r3 + 48]\r\n    movu    m4, [r2 + r3 + 64]\r\n    movu    m5, [r2 + r3 + 80]\r\n\r\n    movu    [r0 + r1], m0\r\n    movu    [r0 + r1 + 16], m1\r\n    movu    [r0 + r1 + 32], m2\r\n    movu    [r0 + r1 + 48], m3\r\n    movu    [r0 + r1 + 64], m4\r\n    movu    [r0 + r1 + 80], m5\r\n\r\n    lea     r2, [r2 + 2 * r3]\r\n    lea     r0, [r0 + 2 * r1]\r\n\r\n    movu    m0, [r2]\r\n    movu    m1, [r2 + 16]\r\n    movu    m2, [r2 + 32]\r\n    movu    m3, [r2 + 48]\r\n    movu    m4, [r2 + 64]\r\n    movu    m5, [r2 + 80]\r\n\r\n    movu    [r0], m0\r\n    movu    [r0 + 16], m1\r\n    movu    [r0 + 32], m2\r\n    movu    [r0 + 48], m3\r\n    movu    [r0 + 64], m4\r\n    movu    [r0 + 80], m5\r\n\r\n    movu    m0, [r2 + r3]\r\n    movu    m1, [r2 + r3 + 16]\r\n    movu    m2, [r2 + r3 + 32]\r\n    movu    m3, [r2 + r3 + 48]\r\n    movu    m4, [r2 + r3 + 64]\r\n    movu    m5, [r2 + r3 + 80]\r\n\r\n    movu    [r0 + r1], m0\r\n    movu    [r0 + r1 + 16], m1\r\n    movu    [r0 + r1 + 32], m2\r\n    movu    [r0 + r1 + 48], m3\r\n    movu    [r0 + r1 + 64], m4\r\n    movu    [r0 + r1 + 80], m5\r\n\r\n    dec     r4d\r\n    lea     r2, [r2 + 2 * r3]\r\n    lea     r0, [r0 + 2 * r1]\r\n    jnz     .loop\r\nRET\r\n%endmacro\r\n\r\nBLOCKCOPY_SS_W48_H2 48, 64\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_ss_48x64(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\nINIT_YMM avx\r\ncglobal blockcopy_ss_48x64, 4, 7, 6\r\n\r\n    mov    r4d, 64/4\r\n    add    r1, r1\r\n    add    r3, r3\r\n    lea    r5, [3 * r3]\r\n    lea    r6, [3 * r1]\r\n\r\n.loop:\r\n    movu    m0, [r2]\r\n    movu    m1, [r2 + 32]\r\n    movu    m2, [r2 + 64]\r\n\r\n    movu    [r0], m0\r\n    movu    [r0 + 32], m1\r\n    movu    [r0 + 64], m2\r\n\r\n    movu    m0, [r2 + r3]\r\n    movu    m1, [r2 + r3 + 32]\r\n    movu    m2, [r2 + r3 + 64]\r\n\r\n    movu    [r0 + r1], m0\r\n    movu    [r0 + r1 + 32], m1\r\n    movu    [r0 + r1 + 64], m2\r\n\r\n    movu    m0, [r2 + 2 * r3]\r\n    movu    m1, [r2 + 2 * r3 + 32]\r\n    movu    m2, [r2 + 2 * r3 + 64]\r\n\r\n    movu    [r0 + 2 * r1], m0\r\n    movu    [r0 + 2 * r1 + 32], m1\r\n    movu    [r0 + 2 * r1 + 64], m2\r\n\r\n    movu    m0, [r2 + r5]\r\n    movu    m1, [r2 + r5 + 32]\r\n    movu    m2, [r2 + r5 + 64]\r\n\r\n    movu    [r0 + r6], m0\r\n    movu    [r0 + r6 + 32], m1\r\n    movu    [r0 + r6 + 64], m2\r\n\r\n    dec     r4d\r\n    lea     r2, [r2 + 4 * r3]\r\n    lea     r0, [r0 + 4 * r1]\r\n    jnz     .loop\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_ss_%1x%2(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\n%macro BLOCKCOPY_SS_W64_H4 2\r\nINIT_XMM sse2\r\ncglobal blockcopy_ss_%1x%2, 4, 5, 6, dst, dstStride, src, srcStride\r\n    mov     r4d, %2/4\r\n    add     r1, r1\r\n    add     r3, r3\r\n.loop:\r\n    movu    m0, [r2]\r\n    movu    m1, [r2 + 16]\r\n    movu    m2, [r2 + 32]\r\n    movu    m3, [r2 + 48]\r\n\r\n    movu    [r0], m0\r\n    movu    [r0 + 16], m1\r\n    movu    [r0 + 32], m2\r\n    movu    [r0 + 48], m3\r\n\r\n    movu    m0,    [r2 + 64]\r\n    movu    m1,    [r2 + 80]\r\n    movu    m2,    [r2 + 96]\r\n    movu    m3,    [r2 + 112]\r\n\r\n    movu    [r0 + 64], m0\r\n    movu    [r0 + 80], m1\r\n    movu    [r0 + 96], m2\r\n    movu    [r0 + 112], m3\r\n\r\n    movu    m0, [r2 + r3]\r\n    movu    m1, [r2 + r3 + 16]\r\n    movu    m2, [r2 + r3 + 32]\r\n    movu    m3, [r2 + r3 + 48]\r\n\r\n    movu    [r0 + r1], m0\r\n    movu    [r0 + r1 + 16], m1\r\n    movu    [r0 + r1 + 32], m2\r\n    movu    [r0 + r1 + 48], m3\r\n\r\n    movu    m0, [r2 + r3 + 64]\r\n    movu    m1, [r2 + r3 + 80]\r\n    movu    m2, [r2 + r3 + 96]\r\n    movu    m3, [r2 + r3 + 112]\r\n\r\n    movu    [r0 + r1 + 64], m0\r\n    movu    [r0 + r1 + 80], m1\r\n    movu    [r0 + r1 + 96], m2\r\n    movu    [r0 + r1 + 112], m3\r\n\r\n    lea     r2, [r2 + 2 * r3]\r\n    lea     r0, [r0 + 2 * r1]\r\n\r\n    movu    m0, [r2]\r\n    movu    m1, [r2 + 16]\r\n    movu    m2, [r2 + 32]\r\n    movu    m3, [r2 + 48]\r\n\r\n    movu    [r0], m0\r\n    movu    [r0 + 16], m1\r\n    movu    [r0 + 32], m2\r\n    movu    [r0 + 48], m3\r\n\r\n    movu    m0,    [r2 + 64]\r\n    movu    m1,    [r2 + 80]\r\n    movu    m2,    [r2 + 96]\r\n    movu    m3,    [r2 + 112]\r\n\r\n    movu    [r0 + 64], m0\r\n    movu    [r0 + 80], m1\r\n    movu    [r0 + 96], m2\r\n    movu    [r0 + 112], m3\r\n\r\n    movu    m0, [r2 + r3]\r\n    movu    m1, [r2 + r3 + 16]\r\n    movu    m2, [r2 + r3 + 32]\r\n    movu    m3, [r2 + r3 + 48]\r\n\r\n    movu    [r0 + r1], m0\r\n    movu    [r0 + r1 + 16], m1\r\n    movu    [r0 + r1 + 32], m2\r\n    movu    [r0 + r1 + 48], m3\r\n\r\n    movu    m0, [r2 + r3 + 64]\r\n    movu    m1, [r2 + r3 + 80]\r\n    movu    m2, [r2 + r3 + 96]\r\n    movu    m3, [r2 + r3 + 112]\r\n\r\n    movu    [r0 + r1 + 64], m0\r\n    movu    [r0 + r1 + 80], m1\r\n    movu    [r0 + r1 + 96], m2\r\n    movu    [r0 + r1 + 112], m3\r\n\r\n    dec     r4d\r\n    lea     r2, [r2 + 2 * r3]\r\n    lea     r0, [r0 + 2 * r1]\r\n    jnz     .loop\r\n\r\n    RET\r\n%endmacro\r\n\r\nBLOCKCOPY_SS_W64_H4 64, 16\r\nBLOCKCOPY_SS_W64_H4 64, 32\r\nBLOCKCOPY_SS_W64_H4 64, 48\r\nBLOCKCOPY_SS_W64_H4 64, 64\r\n\r\n;-----------------------------------------------------------------------------\r\n; void blockcopy_ss_%1x%2(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)\r\n;-----------------------------------------------------------------------------\r\n%macro BLOCKCOPY_SS_W64_H4_avx 2\r\nINIT_YMM avx\r\ncglobal blockcopy_ss_%1x%2, 4, 7, 4, dst, dstStride, src, srcStride\r\n    mov     r4d, %2/4\r\n    add     r1, r1\r\n    add     r3, r3\r\n    lea     r5, [3 * r1]\r\n    lea     r6, [3 * r3]\r\n.loop:\r\n    movu    m0, [r2]\r\n    movu    m1, [r2 + 32]\r\n    movu    m2, [r2 + 64]\r\n    movu    m3, [r2 + 96]\r\n\r\n    movu    [r0], m0\r\n    movu    [r0 + 32], m1\r\n    movu    [r0 + 64], m2\r\n    movu    [r0 + 96], m3\r\n\r\n    movu    m0, [r2 + r3]\r\n    movu    m1, [r2 + r3 + 32]\r\n    movu    m2, [r2 + r3 + 64]\r\n    movu    m3, [r2 + r3 + 96]\r\n\r\n    movu    [r0 + r1], m0\r\n    movu    [r0 + r1 + 32], m1\r\n    movu    [r0 + r1 + 64], m2\r\n    movu    [r0 + r1 + 96], m3\r\n\r\n    movu    m0, [r2 + 2 * r3]\r\n    movu    m1, [r2 + 2 * r3 + 32]\r\n    movu    m2, [r2 + 2 * r3 + 64]\r\n    movu    m3, [r2 + 2 * r3 + 96]\r\n\r\n    movu    [r0 + 2 * r1], m0\r\n    movu    [r0 + 2 * r1 + 32], m1\r\n    movu    [r0 + 2 * r1 + 64], m2\r\n    movu    [r0 + 2 * r1 + 96], m3\r\n\r\n    movu    m0, [r2 + r6]\r\n    movu    m1, [r2 + r6 + 32]\r\n    movu    m2, [r2 + r6 + 64]\r\n    movu    m3, [r2 + r6 + 96]\r\n    lea     r2, [r2 + 4 * r3]\r\n\r\n    movu    [r0 + r5], m0\r\n    movu    [r0 + r5 + 32], m1\r\n    movu    [r0 + r5 + 64], m2\r\n    movu    [r0 + r5 + 96], m3\r\n    lea     r0, [r0 + 4 * r1]\r\n\r\n    dec     r4d\r\n    jnz     .loop\r\n    RET\r\n%endmacro\r\n\r\nBLOCKCOPY_SS_W64_H4_avx 64, 16\r\nBLOCKCOPY_SS_W64_H4_avx 64, 32\r\nBLOCKCOPY_SS_W64_H4_avx 64, 48\r\nBLOCKCOPY_SS_W64_H4_avx 64, 64\r\n\r\n;--------------------------------------------------------------------------------------\r\n; void cpy2Dto1D_shr(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);\r\n;--------------------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal cpy2Dto1D_shr_4, 3, 4, 4\r\n    add             r2d, r2d\r\n    movd            m0, r3m\r\n    pcmpeqw         m1, m1\r\n    psllw           m1, m0\r\n    psraw           m1, 1\r\n\r\n    ; register alloc\r\n    ; r0 - dst\r\n    ; r1 - src\r\n    ; r2 - srcStride\r\n    ; m0 - shift\r\n    ; m1 - word [-round]\r\n\r\n    ; Row 0-3\r\n    movh            m2, [r1]\r\n    movhps          m2, [r1 + r2]\r\n    lea             r1, [r1 + r2 * 2]\r\n    movh            m3, [r1]\r\n    movhps          m3, [r1 + r2]\r\n    psubw           m2, m1\r\n    psubw           m3, m1\r\n    psraw           m2, m0\r\n    psraw           m3, m0\r\n    mova            [r0 + 0 * mmsize], m2\r\n    mova            [r0 + 1 * mmsize], m3\r\n    RET\r\n\r\n\r\n;--------------------------------------------------------------------------------------\r\n; void cpy2Dto1D_shr(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);\r\n;--------------------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal cpy2Dto1D_shr_8, 3, 5, 4\r\n    add             r2d, r2d\r\n    movd            m0, r3m\r\n    pcmpeqw         m1, m1\r\n    psllw           m1, m0\r\n    psraw           m1, 1\r\n    mov             r3d, 8/4\r\n    lea             r4, [r2 * 3]\r\n\r\n    ; register alloc\r\n    ; r0 - dst\r\n    ; r1 - src\r\n    ; r2 - srcStride\r\n    ; r3 - loop counter\r\n    ; r4 - stride * 3\r\n    ; m0 - shift\r\n    ; m1 - word [-round]\r\n\r\n.loop:\r\n    ; Row 0-1\r\n    mova            m2, [r1]\r\n    mova            m3, [r1 + r2]\r\n    psubw           m2, m1\r\n    psubw           m3, m1\r\n    psraw           m2, m0\r\n    psraw           m3, m0\r\n    mova            [r0 + 0 * mmsize], m2\r\n    mova            [r0 + 1 * mmsize], m3\r\n\r\n    ; Row 2-3\r\n    mova            m2, [r1 + r2 * 2]\r\n    mova            m3, [r1 + r4]\r\n    psubw           m2, m1\r\n    psubw           m3, m1\r\n    psraw           m2, m0\r\n    psraw           m3, m0\r\n    mova            [r0 + 2 * mmsize], m2\r\n    mova            [r0 + 3 * mmsize], m3\r\n\r\n    add             r0, 4 * mmsize\r\n    lea             r1, [r1 + r2 * 4]\r\n    dec             r3d\r\n    jnz            .loop\r\n    RET\r\n\r\nINIT_YMM avx2\r\ncglobal cpy2Dto1D_shr_8, 3, 4, 4\r\n    add        r2d, r2d\r\n    movd       xm0, r3m\r\n    pcmpeqw    m1, m1\r\n    psllw      m1, xm0\r\n    psraw      m1, 1\r\n    lea        r3, [r2 * 3]\r\n\r\n    ; Row 0-3\r\n    movu           xm2, [r1]\r\n    vinserti128    m2, m2, [r1 + r2], 1\r\n    movu           xm3, [r1 + 2 * r2]\r\n    vinserti128    m3, m3, [r1 + r3], 1\r\n    psubw          m2, m1\r\n    psraw          m2, xm0\r\n    psubw          m3, m1\r\n    psraw          m3, xm0\r\n    movu           [r0], m2\r\n    movu           [r0 + 32], m3\r\n\r\n    ; Row 4-7\r\n    lea            r1, [r1 + 4 * r2]\r\n    movu           xm2, [r1]\r\n    vinserti128    m2, m2, [r1 + r2], 1\r\n    movu           xm3, [r1 + 2 * r2]\r\n    vinserti128    m3, m3, [r1 + r3], 1\r\n    psubw          m2, m1\r\n    psraw          m2, xm0\r\n    psubw          m3, m1\r\n    psraw          m3, xm0\r\n    movu           [r0 + 64], m2\r\n    movu           [r0 + 96], m3\r\n    RET\r\n\r\n\r\n;--------------------------------------------------------------------------------------\r\n; void cpy2Dto1D_shr(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);\r\n;--------------------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal cpy2Dto1D_shr_16, 3, 4, 4\r\n    add             r2d, r2d\r\n    movd            m0, r3m\r\n    pcmpeqw         m1, m1\r\n    psllw           m1, m0\r\n    psraw           m1, 1\r\n    mov             r3d, 16/2\r\n\r\n    ; register alloc\r\n    ; r0 - dst\r\n    ; r1 - src\r\n    ; r2 - srcStride\r\n    ; r3 - loop counter\r\n    ; m0 - shift\r\n    ; m1 - word [-round]\r\n\r\n.loop:\r\n    ; Row 0\r\n    mova            m2, [r1 + 0 * mmsize]\r\n    mova            m3, [r1 + 1 * mmsize]\r\n    psubw           m2, m1\r\n    psubw           m3, m1\r\n    psraw           m2, m0\r\n    psraw           m3, m0\r\n    mova            [r0 + 0 * mmsize], m2\r\n    mova            [r0 + 1 * mmsize], m3\r\n\r\n    ; Row 1\r\n    mova            m2, [r1 + r2 + 0 * mmsize]\r\n    mova            m3, [r1 + r2 + 1 * mmsize]\r\n    psubw           m2, m1\r\n    psubw           m3, m1\r\n    psraw           m2, m0\r\n    psraw           m3, m0\r\n    mova            [r0 + 2 * mmsize], m2\r\n    mova            [r0 + 3 * mmsize], m3\r\n\r\n    add             r0, 4 * mmsize\r\n    lea             r1, [r1 + r2 * 2]\r\n    dec             r3d\r\n    jnz            .loop\r\n    RET\r\n\r\nINIT_YMM avx2\r\ncglobal cpy2Dto1D_shr_16, 4, 5, 4\r\n    add        r2d, r2d\r\n    movd       xm0, r3d\r\n    pcmpeqw    m1, m1\r\n    psllw      m1, xm0\r\n    psraw      m1, 1\r\n    lea        r3, [r2 * 3]\r\n    mov        r4d, 16/8\r\n\r\n.loop:\r\n    ; Row 0-1\r\n    movu       m2, [r1]\r\n    movu       m3, [r1 + r2]\r\n    psubw      m2, m1\r\n    psraw      m2, xm0\r\n    psubw      m3, m1\r\n    psraw      m3, xm0\r\n    movu       [r0 + 0 * mmsize], m2\r\n    movu       [r0 + 1 * mmsize], m3\r\n\r\n    ; Row 2-3\r\n    movu       m2, [r1 + 2 * r2]\r\n    movu       m3, [r1 + r3]\r\n    psubw      m2, m1\r\n    psraw      m2, xm0\r\n    psubw      m3, m1\r\n    psraw      m3, xm0\r\n    movu       [r0 + 2 * mmsize], m2\r\n    movu       [r0 + 3 * mmsize], m3\r\n\r\n    ; Row 4-5\r\n    lea        r1, [r1 + 4 * r2]\r\n    movu       m2, [r1]\r\n    movu       m3, [r1 + r2]\r\n    psubw      m2, m1\r\n    psraw      m2, xm0\r\n    psubw      m3, m1\r\n    psraw      m3, xm0\r\n    movu       [r0 + 4 * mmsize], m2\r\n    movu       [r0 + 5 * mmsize], m3\r\n\r\n    ; Row 6-7\r\n    movu       m2, [r1 + 2 * r2]\r\n    movu       m3, [r1 + r3]\r\n    psubw      m2, m1\r\n    psraw      m2, xm0\r\n    psubw      m3, m1\r\n    psraw      m3, xm0\r\n    movu       [r0 + 6 * mmsize], m2\r\n    movu       [r0 + 7 * mmsize], m3\r\n\r\n    add        r0, 8 * mmsize\r\n    lea        r1, [r1 + 4 * r2]\r\n    dec        r4d\r\n    jnz        .loop\r\n    RET\r\n\r\n\r\n;--------------------------------------------------------------------------------------\r\n; void cpy2Dto1D_shr(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);\r\n;--------------------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal cpy2Dto1D_shr_32, 3, 4, 6\r\n    add             r2d, r2d\r\n    movd            m0, r3m\r\n    pcmpeqw         m1, m1\r\n    psllw           m1, m0\r\n    psraw           m1, 1\r\n    mov             r3d, 32/1\r\n\r\n    ; register alloc\r\n    ; r0 - dst\r\n    ; r1 - src\r\n    ; r2 - srcStride\r\n    ; r3 - loop counter\r\n    ; m0 - shift\r\n    ; m1 - word [-round]\r\n\r\n.loop:\r\n    ; Row 0\r\n    mova            m2, [r1 + 0 * mmsize]\r\n    mova            m3, [r1 + 1 * mmsize]\r\n    mova            m4, [r1 + 2 * mmsize]\r\n    mova            m5, [r1 + 3 * mmsize]\r\n    psubw           m2, m1\r\n    psubw           m3, m1\r\n    psubw           m4, m1\r\n    psubw           m5, m1\r\n    psraw           m2, m0\r\n    psraw           m3, m0\r\n    psraw           m4, m0\r\n    psraw           m5, m0\r\n    mova            [r0 + 0 * mmsize], m2\r\n    mova            [r0 + 1 * mmsize], m3\r\n    mova            [r0 + 2 * mmsize], m4\r\n    mova            [r0 + 3 * mmsize], m5\r\n\r\n    add             r0, 4 * mmsize\r\n    add             r1, r2\r\n    dec             r3d\r\n    jnz            .loop\r\n    RET\r\n\r\nINIT_YMM avx2\r\ncglobal cpy2Dto1D_shr_32, 4, 5, 4\r\n    add        r2d, r2d\r\n    movd       xm0, r3d\r\n    pcmpeqw    m1, m1\r\n    psllw      m1, xm0\r\n    psraw      m1, 1\r\n    lea        r3, [r2 * 3]\r\n    mov        r4d, 32/4\r\n\r\n.loop:\r\n    ; Row 0\r\n    movu       m2, [r1]\r\n    movu       m3, [r1 + 32]\r\n    psubw      m2, m1\r\n    psraw      m2, xm0\r\n    psubw      m3, m1\r\n    psraw      m3, xm0\r\n    movu       [r0 + 0 * mmsize], m2\r\n    movu       [r0 + 1 * mmsize], m3\r\n\r\n    ; Row 1\r\n    movu       m2, [r1 + r2]\r\n    movu       m3, [r1 + r2 + 32]\r\n    psubw      m2, m1\r\n    psraw      m2, xm0\r\n    psubw      m3, m1\r\n    psraw      m3, xm0\r\n    movu       [r0 + 2 * mmsize], m2\r\n    movu       [r0 + 3 * mmsize], m3\r\n\r\n    ; Row 2\r\n    movu       m2, [r1 + 2 * r2]\r\n    movu       m3, [r1 + 2 * r2 + 32]\r\n    psubw      m2, m1\r\n    psraw      m2, xm0\r\n    psubw      m3, m1\r\n    psraw      m3, xm0\r\n    movu       [r0 + 4 * mmsize], m2\r\n    movu       [r0 + 5 * mmsize], m3\r\n\r\n    ; Row 3\r\n    movu       m2, [r1 + r3]\r\n    movu       m3, [r1 + r3 + 32]\r\n    psubw      m2, m1\r\n    psraw      m2, xm0\r\n    psubw      m3, m1\r\n    psraw      m3, xm0\r\n    movu       [r0 + 6 * mmsize], m2\r\n    movu       [r0 + 7 * mmsize], m3\r\n\r\n    add        r0, 8 * mmsize\r\n    lea        r1, [r1 + 4 * r2]\r\n    dec        r4d\r\n    jnz        .loop\r\n    RET\r\n\r\n;--------------------------------------------------------------------------------------\r\n; void cpy1Dto2D_shl(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift)\r\n;--------------------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal cpy1Dto2D_shl_4, 3, 3, 3\r\n    add         r2d, r2d\r\n    movd        m0, r3m\r\n\r\n    ; Row 0-3\r\n    mova        m1, [r1 + 0 * mmsize]\r\n    mova        m2, [r1 + 1 * mmsize]\r\n    psllw       m1, m0\r\n    psllw       m2, m0\r\n    movh        [r0], m1\r\n    movhps      [r0 + r2], m1\r\n    movh        [r0 + r2 * 2], m2\r\n    lea         r2, [r2 * 3]\r\n    movhps      [r0 + r2], m2\r\n    RET\r\n\r\n\r\nINIT_YMM avx2\r\ncglobal cpy1Dto2D_shl_4, 3, 3, 2\r\n    add         r2d, r2d\r\n    movd        xm0, r3m\r\n\r\n    ; Row 0-3\r\n    movu        m1, [r1]\r\n    psllw       m1, xm0\r\n    vextracti128 xm0, m1, 1\r\n    movq        [r0], xm1\r\n    movhps      [r0 + r2], xm1\r\n    lea         r0, [r0 + r2 * 2]\r\n    movq        [r0], xm0\r\n    movhps      [r0 + r2], xm0\r\n    RET\r\n\r\n\r\n;--------------------------------------------------------------------------------------\r\n; void cpy1Dto2D_shl(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift)\r\n;--------------------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal cpy1Dto2D_shl_8, 3, 4, 5\r\n    add         r2d, r2d\r\n    movd        m0, r3m\r\n    lea         r3, [r2 * 3]\r\n\r\n    ; Row 0-3\r\n    mova        m1, [r1 + 0 * mmsize]\r\n    mova        m2, [r1 + 1 * mmsize]\r\n    mova        m3, [r1 + 2 * mmsize]\r\n    mova        m4, [r1 + 3 * mmsize]\r\n    psllw       m1, m0\r\n    psllw       m2, m0\r\n    psllw       m3, m0\r\n    psllw       m4, m0\r\n    mova        [r0], m1\r\n    mova        [r0 + r2], m2\r\n    mova        [r0 + r2 * 2], m3\r\n    mova        [r0 + r3], m4\r\n    lea         r0, [r0 + r2 * 4]\r\n\r\n    ; Row 4-7\r\n    mova        m1, [r1 + 4 * mmsize]\r\n    mova        m2, [r1 + 5 * mmsize]\r\n    mova        m3, [r1 + 6 * mmsize]\r\n    mova        m4, [r1 + 7 * mmsize]\r\n    psllw       m1, m0\r\n    psllw       m2, m0\r\n    psllw       m3, m0\r\n    psllw       m4, m0\r\n    mova        [r0], m1\r\n    mova        [r0 + r2], m2\r\n    mova        [r0 + r2 * 2], m3\r\n    mova        [r0 + r3], m4\r\n    RET\r\n\r\n\r\nINIT_YMM avx2\r\ncglobal cpy1Dto2D_shl_8, 3, 4, 3\r\n    add         r2d, r2d\r\n    movd        xm0, r3m\r\n    lea         r3, [r2 * 3]\r\n\r\n    ; Row 0-3\r\n    movu        m1, [r1 + 0 * mmsize]\r\n    movu        m2, [r1 + 1 * mmsize]\r\n    psllw       m1, xm0\r\n    psllw       m2, xm0\r\n    movu        [r0], xm1\r\n    vextracti128 [r0 + r2], m1, 1\r\n    movu        [r0 + r2 * 2], xm2\r\n    vextracti128 [r0 + r3], m2, 1\r\n\r\n    ; Row 4-7\r\n    movu        m1, [r1 + 2 * mmsize]\r\n    movu        m2, [r1 + 3 * mmsize]\r\n    lea         r0, [r0 + r2 * 4]\r\n    psllw       m1, xm0\r\n    psllw       m2, xm0\r\n    movu        [r0], xm1\r\n    vextracti128 [r0 + r2], m1, 1\r\n    movu        [r0 + r2 * 2], xm2\r\n    vextracti128 [r0 + r3], m2, 1\r\n    RET\r\n\r\n\r\n;--------------------------------------------------------------------------------------\r\n; void cpy1Dto2D_shl(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift)\r\n;--------------------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal cpy1Dto2D_shl_16, 3, 4, 5\r\n    add         r2d, r2d\r\n    movd        m0, r3m\r\n    mov         r3d, 16/4\r\n\r\n.loop:\r\n    ; Row 0-1\r\n    mova        m1, [r1 + 0 * mmsize]\r\n    mova        m2, [r1 + 1 * mmsize]\r\n    mova        m3, [r1 + 2 * mmsize]\r\n    mova        m4, [r1 + 3 * mmsize]\r\n    psllw       m1, m0\r\n    psllw       m2, m0\r\n    psllw       m3, m0\r\n    psllw       m4, m0\r\n    mova        [r0], m1\r\n    mova        [r0 + 16], m2\r\n    mova        [r0 + r2], m3\r\n    mova        [r0 + r2 + 16], m4\r\n\r\n    ; Row 2-3\r\n    mova        m1, [r1 + 4 * mmsize]\r\n    mova        m2, [r1 + 5 * mmsize]\r\n    mova        m3, [r1 + 6 * mmsize]\r\n    mova        m4, [r1 + 7 * mmsize]\r\n    lea         r0, [r0 + r2 * 2]\r\n    psllw       m1, m0\r\n    psllw       m2, m0\r\n    psllw       m3, m0\r\n    psllw       m4, m0\r\n    mova        [r0], m1\r\n    mova        [r0 + 16], m2\r\n    mova        [r0 + r2], m3\r\n    mova        [r0 + r2 + 16], m4\r\n\r\n    add         r1, 8 * mmsize\r\n    lea         r0, [r0 + r2 * 2]\r\n    dec         r3d\r\n    jnz        .loop\r\n    RET\r\n\r\n\r\nINIT_YMM avx2\r\ncglobal cpy1Dto2D_shl_16, 3, 5, 3\r\n    add         r2d, r2d\r\n    movd        xm0, r3m\r\n    mov         r3d, 16/4\r\n    lea         r4, [r2 * 3]\r\n\r\n.loop:\r\n    ; Row 0-1\r\n    movu        m1, [r1 + 0 * mmsize]\r\n    movu        m2, [r1 + 1 * mmsize]\r\n    psllw       m1, xm0\r\n    psllw       m2, xm0\r\n    movu        [r0], m1\r\n    movu        [r0 + r2], m2\r\n\r\n    ; Row 2-3\r\n    movu        m1, [r1 + 2 * mmsize]\r\n    movu        m2, [r1 + 3 * mmsize]\r\n    psllw       m1, xm0\r\n    psllw       m2, xm0\r\n    movu        [r0 + r2 * 2], m1\r\n    movu        [r0 + r4], m2\r\n\r\n    add         r1, 4 * mmsize\r\n    lea         r0, [r0 + r2 * 4]\r\n    dec         r3d\r\n    jnz        .loop\r\n    RET\r\n\r\n\r\n;--------------------------------------------------------------------------------------\r\n; void cpy1Dto2D_shl(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift)\r\n;--------------------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal cpy1Dto2D_shl_32, 3, 4, 5\r\n    add         r2d, r2d\r\n    movd        m0, r3m\r\n    mov         r3d, 32/2\r\n\r\n.loop:\r\n    ; Row 0\r\n    mova        m1, [r1 + 0 * mmsize]\r\n    mova        m2, [r1 + 1 * mmsize]\r\n    mova        m3, [r1 + 2 * mmsize]\r\n    mova        m4, [r1 + 3 * mmsize]\r\n    psllw       m1, m0\r\n    psllw       m2, m0\r\n    psllw       m3, m0\r\n    psllw       m4, m0\r\n    mova        [r0 + 0 * mmsize], m1\r\n    mova        [r0 + 1 * mmsize], m2\r\n    mova        [r0 + 2 * mmsize], m3\r\n    mova        [r0 + 3 * mmsize], m4\r\n\r\n    ; Row 1\r\n    mova        m1, [r1 + 4 * mmsize]\r\n    mova        m2, [r1 + 5 * mmsize]\r\n    mova        m3, [r1 + 6 * mmsize]\r\n    mova        m4, [r1 + 7 * mmsize]\r\n    psllw       m1, m0\r\n    psllw       m2, m0\r\n    psllw       m3, m0\r\n    psllw       m4, m0\r\n    mova        [r0 + r2 + 0 * mmsize], m1\r\n    mova        [r0 + r2 + 1 * mmsize], m2\r\n    mova        [r0 + r2 + 2 * mmsize], m3\r\n    mova        [r0 + r2 + 3 * mmsize], m4\r\n\r\n    add         r1, 8 * mmsize\r\n    lea         r0, [r0 + r2 * 2]\r\n    dec         r3d\r\n    jnz        .loop\r\n    RET\r\n\r\n\r\nINIT_YMM avx2\r\ncglobal cpy1Dto2D_shl_32, 3, 4, 5\r\n    add         r2d, r2d\r\n    movd        xm0, r3m\r\n    mov         r3d, 32/2\r\n\r\n.loop:\r\n    ; Row 0-1\r\n    movu        m1, [r1 + 0 * mmsize]\r\n    movu        m2, [r1 + 1 * mmsize]\r\n    movu        m3, [r1 + 2 * mmsize]\r\n    movu        m4, [r1 + 3 * mmsize]\r\n    psllw       m1, xm0\r\n    psllw       m2, xm0\r\n    psllw       m3, xm0\r\n    psllw       m4, xm0\r\n    movu        [r0], m1\r\n    movu        [r0 + mmsize], m2\r\n    movu        [r0 + r2], m3\r\n    movu        [r0 + r2 + mmsize], m4\r\n\r\n    add         r1, 4 * mmsize\r\n    lea         r0, [r0 + r2 * 2]\r\n    dec         r3d\r\n    jnz        .loop\r\n    RET\r\n\r\n\r\n;--------------------------------------------------------------------------------------\r\n; uint32_t copy_cnt(int16_t* dst, const int16_t* src, intptr_t srcStride);\r\n;--------------------------------------------------------------------------------------\r\nINIT_XMM sse4\r\ncglobal copy_cnt_4, 3,3,3\r\n    add         r2d, r2d\r\n    pxor        m2, m2\r\n\r\n    ; row 0 & 1\r\n    movh        m0, [r1]\r\n    movhps      m0, [r1 + r2]\r\n    mova        [r0], m0\r\n\r\n    ; row 2 & 3\r\n    movh        m1, [r1 + r2 * 2]\r\n    lea         r2, [r2 * 3]\r\n    movhps      m1, [r1 + r2]\r\n    mova        [r0 + 16], m1\r\n\r\n    packsswb    m0, m1\r\n    pcmpeqb     m0, m2\r\n\r\n    ; get count\r\n    ; CHECK_ME: Intel documents said POPCNT is SSE4.2 instruction, but just implement after Nehalem\r\n%if 0\r\n    pmovmskb    eax, m0\r\n    not         ax\r\n    popcnt      ax, ax\r\n%else\r\n    mova        m1, [pb_1]\r\n    paddb       m0, m1\r\n    psadbw      m0, m2\r\n    pshufd      m1, m0, 2\r\n    paddw       m0, m1\r\n    movd        eax, m0\r\n%endif\r\n    RET\r\n\r\n\r\n;--------------------------------------------------------------------------------------\r\n; uint32_t copy_cnt(int16_t* dst, const int16_t* src, intptr_t srcStride);\r\n;--------------------------------------------------------------------------------------\r\nINIT_XMM sse4\r\ncglobal copy_cnt_8, 3,3,6\r\n    add         r2d, r2d\r\n    pxor        m4, m4\r\n    pxor        m5, m5\r\n\r\n   ; row 0 & 1\r\n    movu         m0, [r1]\r\n    movu        m1, [r1 + r2]\r\n    movu        [r0], m0\r\n    movu        [r0 + 16], m1\r\n\r\n    packsswb    m0, m1\r\n    pcmpeqb     m0, m4\r\n    paddb       m5, m0\r\n\r\n    ; row 2 & 3\r\n    lea         r1, [r1 + 2 * r2]\r\n    movu        m0, [r1]\r\n    movu        m1, [r1 + r2]\r\n    movu        [r0 + 32], m0\r\n    movu        [r0 + 48], m1\r\n\r\n    packsswb    m0, m1\r\n    pcmpeqb     m0, m4\r\n    paddb       m5, m0\r\n\r\n    ; row 4 & 5\r\n    lea         r1, [r1 + 2 * r2]\r\n    movu        m0, [r1]\r\n    movu        m1, [r1 + r2]\r\n    movu        [r0 + 64], m0\r\n    movu        [r0 + 80], m1\r\n\r\n    packsswb    m0, m1\r\n    pcmpeqb     m0, m4\r\n    paddb       m5, m0\r\n\r\n    ; row 6 & 7\r\n    lea         r1, [r1 + 2 * r2]\r\n    movu        m0, [r1]\r\n    movu        m1, [r1 + r2]\r\n    movu        [r0 + 96], m0\r\n    movu        [r0 + 112], m1\r\n\r\n    packsswb    m0, m1\r\n    pcmpeqb     m0, m4\r\n    paddb       m5, m0\r\n\r\n    ; get count\r\n    mova        m0, [pb_4]\r\n    paddb       m5, m0\r\n    psadbw      m5, m4\r\n    pshufd      m0, m5, 2\r\n    paddw       m5, m0\r\n    movd        eax, m5\r\n     RET\r\n\r\n\r\nINIT_YMM avx2\r\ncglobal copy_cnt_8, 3,4,5\r\n    add         r2d, r2d\r\n    lea         r3, [r2 * 3]\r\n\r\n    ; row 0 - 1\r\n    movu        xm0, [r1]\r\n    vinserti128 m0, m0, [r1 + r2], 1\r\n    movu        [r0], m0\r\n\r\n    ; row 2 - 3\r\n    movu        xm1, [r1 + r2 * 2]\r\n    vinserti128 m1, m1, [r1 + r3], 1\r\n    movu        [r0 + 32], m1\r\n    lea         r1,  [r1 + r2 * 4]\r\n\r\n    ; row 4 - 5\r\n    movu        xm2, [r1]\r\n    vinserti128 m2, m2, [r1 + r2], 1\r\n    movu        [r0 + 64], m2\r\n\r\n    ; row 6 - 7\r\n    movu        xm3, [r1 + r2 * 2]\r\n    vinserti128 m3, m3, [r1 + r3], 1\r\n    movu        [r0 + 96], m3\r\n\r\n    ; get count\r\n    xorpd        m4, m4\r\n    vpacksswb    m0, m1\r\n    vpacksswb    m2, m3\r\n    pminub       m0, [pb_1]\r\n    pminub       m2, [pb_1]\r\n    paddb        m0, m2\r\n    vextracti128 xm1, m0, 1\r\n    paddb        xm0, xm1\r\n    psadbw       xm0, xm4\r\n    movhlps      xm1, xm0\r\n    paddd        xm0, xm1\r\n    movd         eax, xm0\r\n    RET\r\n\r\n\r\n;--------------------------------------------------------------------------------------\r\n; uint32_t copy_cnt(int16_t* dst, const int16_t* src, intptr_t srcStride);\r\n;--------------------------------------------------------------------------------------\r\nINIT_XMM sse4\r\ncglobal copy_cnt_16, 3,4,6\r\n     add         r2d, r2d\r\n     mov         r3d, 4\r\n     pxor        m4, m4\r\n     pxor        m5, m5\r\n\r\n.loop:\r\n    ; row 0\r\n    movu        m0, [r1]\r\n    movu        m1, [r1 + 16]\r\n    movu        [r0], m0\r\n    movu        [r0 + 16], m1\r\n\r\n    packsswb    m0, m1\r\n    pcmpeqb     m0, m4\r\n    paddb       m5, m0\r\n\r\n     ; row 1\r\n    movu        m0, [r1 + r2]\r\n    movu        m1, [r1 + r2 + 16]\r\n    movu        [r0 + 32], m0\r\n    movu        [r0 + 48], m1\r\n\r\n    packsswb    m0, m1\r\n    pcmpeqb     m0, m4\r\n    paddb       m5, m0\r\n\r\n    ; row 2\r\n    movu        m0, [r1 + 2 * r2]\r\n    movu        m1, [r1 + 2 * r2 + 16]\r\n    movu        [r0 + 64], m0\r\n    movu        [r0 + 80], m1\r\n\r\n    packsswb    m0, m1\r\n    pcmpeqb     m0, m4\r\n    paddb       m5, m0\r\n\r\n    ; row 3\r\n    lea         r1, [r1 + 2 * r2]\r\n    movu        m0, [r1 + r2]\r\n    movu        m1, [r1 + r2 + 16]\r\n    movu        [r0 + 96], m0\r\n    movu        [r0 + 112], m1\r\n\r\n    packsswb    m0, m1\r\n    pcmpeqb     m0, m4\r\n    paddb       m5, m0\r\n\r\n    add         r0, 128\r\n    lea         r1, [r1 + 2 * r2]\r\n     dec         r3d\r\n     jnz        .loop\r\n\r\n    mova        m0, [pb_16]\r\n    paddb       m5, m0\r\n    psadbw      m5, m4\r\n    pshufd      m0, m5, 2\r\n    paddw       m5, m0\r\n    movd        eax, m5\r\n     RET\r\n\r\n\r\nINIT_YMM avx2\r\ncglobal copy_cnt_16, 3, 5, 5\r\n    add         r2d, r2d\r\n    lea         r3,  [r2 * 3]\r\n    mov         r4d, 16/4\r\n\r\n    mova        m3, [pb_1]\r\n    xorpd       m4, m4\r\n\r\n.loop:\r\n    ; row 0 - 1\r\n    movu        m0, [r1]\r\n    movu        [r0], m0\r\n    movu        m1, [r1 + r2]\r\n    movu        [r0 + 32], m1\r\n\r\n    packsswb    m0, m1\r\n    pminub      m0, m3\r\n\r\n    ; row 2 - 3\r\n    movu        m1, [r1 + r2 * 2]\r\n    movu        [r0 + 64], m1\r\n    movu        m2, [r1 + r3]\r\n    movu        [r0 + 96], m2\r\n\r\n    packsswb    m1, m2\r\n    pminub      m1, m3\r\n    paddb       m0, m1\r\n    paddb       m4, m0\r\n\r\n    add         r0, 128\r\n    lea         r1, [r1 + 4 * r2]\r\n    dec         r4d\r\n    jnz         .loop\r\n\r\n    ; get count\r\n    xorpd        m0,  m0\r\n    vextracti128 xm1, m4, 1\r\n    paddb        xm4, xm1\r\n    psadbw       xm4, xm0\r\n    movhlps      xm1, xm4\r\n    paddd        xm4, xm1\r\n    movd         eax, xm4\r\n    RET\r\n\r\n;--------------------------------------------------------------------------------------\r\n; uint32_t copy_cnt(int32_t* dst, const int16_t* src, intptr_t stride);\r\n;--------------------------------------------------------------------------------------\r\nINIT_XMM sse4\r\ncglobal copy_cnt_32, 3,4,6\r\n    add         r2d, r2d\r\n    mov         r3d, 16\r\n    pxor        m4, m4\r\n    pxor        m5, m5\r\n\r\n.loop:\r\n    ; row 0\r\n    movu        m0, [r1]\r\n    movu        m1, [r1 + 16]\r\n    movu        [r0], m0\r\n    movu        [r0 + 16], m1\r\n\r\n    packsswb    m0, m1\r\n    pcmpeqb     m0, m4\r\n    paddb       m5, m0\r\n\r\n    movu        m0, [r1 + 32]\r\n    movu        m1, [r1 + 48]\r\n    movu        [r0 + 32], m0\r\n    movu        [r0 + 48], m1\r\n\r\n    packsswb    m0, m1\r\n    pcmpeqb     m0, m4\r\n    paddb       m5, m0\r\n\r\n    ; row 1\r\n    movu        m0, [r1 + r2]\r\n    movu        m1, [r1 + r2 + 16]\r\n    movu        [r0 + 64], m0\r\n    movu        [r0 + 80], m1\r\n\r\n    packsswb    m0, m1\r\n    pcmpeqb     m0, m4\r\n    paddb       m5, m0\r\n\r\n    movu        m0, [r1 + r2 + 32]\r\n    movu        m1, [r1 + r2 + 48]\r\n    movu        [r0 + 96], m0\r\n    movu        [r0 + 112], m1\r\n\r\n    packsswb    m0, m1\r\n    pcmpeqb     m0, m4\r\n    paddb       m5, m0\r\n\r\n    add         r0, 128\r\n    lea         r1, [r1 + 2 * r2]\r\n     dec         r3d\r\n     jnz        .loop\r\n\r\n     ; get count\r\n    mova        m0, [pb_64]\r\n    paddb       m5, m0\r\n    psadbw      m5, m4\r\n    pshufd      m0, m5, 2\r\n    paddw       m5, m0\r\n    movd        eax, m5\r\n     RET\r\n\r\n\r\nINIT_YMM avx2\r\ncglobal copy_cnt_32, 3, 5, 5\r\n    add         r2d, r2d\r\n    mov         r3d, 32/2\r\n\r\n    mova        m3, [pb_1]\r\n    xorpd       m4, m4\r\n\r\n.loop:\r\n    ; row 0\r\n    movu        m0, [r1]\r\n    movu        [r0], m0\r\n    movu        m1, [r1 + 32]\r\n    movu        [r0 + 32], m1\r\n\r\n    packsswb    m0, m1\r\n    pminub      m0, m3\r\n\r\n    ; row 1\r\n    movu        m1, [r1 + r2]\r\n    movu        [r0 + 64], m1\r\n    movu        m2, [r1 + r2 + 32]\r\n    movu        [r0 + 96], m2\r\n\r\n    packsswb    m1, m2\r\n    pminub      m1, m3\r\n    paddb       m0, m1\r\n    paddb       m4, m0\r\n\r\n    add         r0, 128\r\n    lea         r1, [r1 + 2 * r2]\r\n    dec         r3d\r\n    jnz         .loop\r\n\r\n    ; get count\r\n    xorpd        m0,  m0\r\n    vextracti128 xm1, m4, 1\r\n    paddb        xm4, xm1\r\n    psadbw       xm4, xm0\r\n    movhlps      xm1, xm4\r\n    paddd        xm4, xm1\r\n    movd         eax, xm4\r\n    RET\r\n\r\n\r\n;--------------------------------------------------------------------------------------\r\n; void cpy2Dto1D_shl(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);\r\n;--------------------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal cpy2Dto1D_shl_4, 4, 4, 4\r\n    add             r2d, r2d\r\n    movd            m0, r3d\r\n\r\n    ; register alloc\r\n    ; r0 - dst\r\n    ; r1 - src\r\n    ; r2 - srcStride\r\n    ; m0 - shift\r\n\r\n    ; Row 0-3\r\n    movh            m2, [r1]\r\n    movhps          m2, [r1 + r2]\r\n    lea             r1, [r1 + r2 * 2]\r\n    movh            m3, [r1]\r\n    movhps          m3, [r1 + r2]\r\n    psllw           m2, m0\r\n    psllw           m3, m0\r\n    mova            [r0 + 0 * mmsize], m2\r\n    mova            [r0 + 1 * mmsize], m3\r\n\r\n    RET\r\n\r\n\r\n;--------------------------------------------------------------------------------------\r\n; void cpy2Dto1D_shl(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);\r\n;--------------------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal cpy2Dto1D_shl_8, 4, 5, 4\r\n    add             r2d, r2d\r\n    movd            m0, r3d\r\n    mov             r3d, 8/4\r\n    lea             r4, [r2 * 3]\r\n\r\n    ; register alloc\r\n    ; r0 - dst\r\n    ; r1 - src\r\n    ; r2 - srcStride\r\n    ; r3 - loop counter\r\n    ; r4 - stride * 3\r\n    ; m0 - shift\r\n\r\n.loop:\r\n    ; Row 0, 1\r\n    mova            m2, [r1]\r\n    mova            m3, [r1 + r2]\r\n    psllw           m2, m0\r\n    psllw           m3, m0\r\n    mova            [r0 + 0 * mmsize], m2\r\n    mova            [r0 + 1 * mmsize], m3\r\n\r\n    ; Row 2, 3\r\n    mova            m2, [r1 + r2 * 2]\r\n    mova            m3, [r1 + r4]\r\n    psllw           m2, m0\r\n    psllw           m3, m0\r\n    mova            [r0 + 2 * mmsize], m2\r\n    mova            [r0 + 3 * mmsize], m3\r\n\r\n    add             r0, 4 * mmsize\r\n    lea             r1, [r1 + r2 * 4]\r\n    dec             r3d\r\n    jnz            .loop\r\n    RET\r\n\r\n;--------------------------------------------------------------------------------------\r\n; void cpy2Dto1D_shl_8(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);\r\n;--------------------------------------------------------------------------------------\r\nINIT_YMM avx2\r\ncglobal cpy2Dto1D_shl_8, 4, 5, 2\r\n    add     r2d, r2d\r\n    movd    xm0, r3d\r\n    lea     r4, [3 * r2]\r\n\r\n    ; Row 0, 1\r\n    movu           xm1, [r1]\r\n    vinserti128    m1, m1, [r1 + r2], 1\r\n    psllw          m1, xm0\r\n    movu           [r0], m1\r\n\r\n    ; Row 2, 3\r\n    movu           xm1, [r1 + 2 * r2]\r\n    vinserti128    m1, m1, [r1 + r4], 1\r\n    psllw          m1, xm0\r\n    movu           [r0 + 32], m1\r\n\r\n    lea            r1, [r1 + 4 * r2]\r\n\r\n    ; Row 4, 5\r\n    movu           xm1, [r1]\r\n    vinserti128    m1, m1, [r1 + r2], 1\r\n    psllw          m1, xm0\r\n    movu           [r0 + 64], m1\r\n\r\n    ; Row 6, 7\r\n    movu           xm1, [r1 + 2 * r2]\r\n    vinserti128    m1, m1, [r1 + r4], 1\r\n    psllw          m1, xm0\r\n    movu           [r0 + 96], m1\r\n    RET\r\n\r\n\r\n;--------------------------------------------------------------------------------------\r\n; void cpy2Dto1D_shl(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);\r\n;--------------------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal cpy2Dto1D_shl_16, 4, 4, 4\r\n    add             r2d, r2d\r\n    movd            m0, r3d\r\n    mov             r3d, 16/2\r\n\r\n    ; register alloc\r\n    ; r0 - dst\r\n    ; r1 - src\r\n    ; r2 - srcStride\r\n    ; r3 - loop counter\r\n    ; m0 - shift\r\n\r\n.loop:\r\n    ; Row 0\r\n    mova            m2, [r1 + 0 * mmsize]\r\n    mova            m3, [r1 + 1 * mmsize]\r\n    psllw           m2, m0\r\n    psllw           m3, m0\r\n    mova            [r0 + 0 * mmsize], m2\r\n    mova            [r0 + 1 * mmsize], m3\r\n\r\n    ; Row 1\r\n    mova            m2, [r1 + r2 + 0 * mmsize]\r\n    mova            m3, [r1 + r2 + 1 * mmsize]\r\n    psllw           m2, m0\r\n    psllw           m3, m0\r\n    mova            [r0 + 2 * mmsize], m2\r\n    mova            [r0 + 3 * mmsize], m3\r\n\r\n    add             r0, 4 * mmsize\r\n    lea             r1, [r1 + r2 * 2]\r\n    dec             r3d\r\n    jnz            .loop\r\n    RET\r\n\r\n;--------------------------------------------------------------------------------------\r\n; void cpy2Dto1D_shl_16(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);\r\n;--------------------------------------------------------------------------------------\r\nINIT_YMM avx2\r\ncglobal cpy2Dto1D_shl_16, 3, 5, 3\r\n    add    r2d, r2d\r\n    movd   xm0, r3m\r\n    mov    r3d, 16/4\r\n    lea     r4, [r2 * 3]\r\n\r\n.loop:\r\n    ; Row 0-1\r\n    movu     m1, [r1]\r\n    movu     m2, [r1 + r2]\r\n    psllw    m1, xm0\r\n    psllw    m2, xm0\r\n    movu     [r0 + 0 * mmsize], m1\r\n    movu     [r0 + 1 * mmsize], m2\r\n\r\n    ; Row 2-3\r\n    movu     m1, [r1 + 2 * r2]\r\n    movu     m2, [r1 + r4]\r\n    psllw    m1, xm0\r\n    psllw    m2, xm0\r\n    movu     [r0 + 2 * mmsize], m1\r\n    movu     [r0 + 3 * mmsize], m2\r\n\r\n    add      r0, 4 * mmsize\r\n    lea      r1, [r1 + r2 * 4]\r\n    dec      r3d\r\n    jnz      .loop\r\n    RET\r\n\r\n;--------------------------------------------------------------------------------------\r\n; void cpy2Dto1D_shl(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);\r\n;--------------------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal cpy2Dto1D_shl_32, 4, 4, 6\r\n    add             r2d, r2d\r\n    movd            m0, r3d\r\n    mov             r3d, 32/1\r\n\r\n    ; register alloc\r\n    ; r0 - dst\r\n    ; r1 - src\r\n    ; r2 - srcStride\r\n    ; r3 - loop counter\r\n    ; m0 - shift\r\n\r\n.loop:\r\n    ; Row 0\r\n    mova            m2, [r1 + 0 * mmsize]\r\n    mova            m3, [r1 + 1 * mmsize]\r\n    mova            m4, [r1 + 2 * mmsize]\r\n    mova            m5, [r1 + 3 * mmsize]\r\n    psllw           m2, m0\r\n    psllw           m3, m0\r\n    psllw           m4, m0\r\n    psllw           m5, m0\r\n    mova            [r0 + 0 * mmsize], m2\r\n    mova            [r0 + 1 * mmsize], m3\r\n    mova            [r0 + 2 * mmsize], m4\r\n    mova            [r0 + 3 * mmsize], m5\r\n\r\n    add             r0, 4 * mmsize\r\n    add             r1, r2\r\n    dec             r3d\r\n    jnz            .loop\r\n    RET\r\n\r\n;--------------------------------------------------------------------------------------\r\n; void cpy2Dto1D_shl_32(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);\r\n;--------------------------------------------------------------------------------------\r\nINIT_YMM avx2\r\ncglobal cpy2Dto1D_shl_32, 3, 5, 5\r\n    add     r2d, r2d\r\n    movd    xm0, r3m\r\n    mov     r3d, 32/4\r\n    lea     r4, [3 * r2]\r\n\r\n.loop:\r\n    ; Row 0-1\r\n    movu     m1, [r1]\r\n    movu     m2, [r1 + 32]\r\n    movu     m3, [r1 + r2]\r\n    movu     m4, [r1 + r2 + 32]\r\n\r\n    psllw    m1, xm0\r\n    psllw    m2, xm0\r\n    psllw    m3, xm0\r\n    psllw    m4, xm0\r\n    movu     [r0], m1\r\n    movu     [r0 + mmsize], m2\r\n    movu     [r0 + 2 * mmsize], m3\r\n    movu     [r0 + 3 * mmsize], m4\r\n\r\n    ; Row 2-3\r\n    movu     m1, [r1 + 2 * r2]\r\n    movu     m2, [r1 + 2 * r2 + 32]\r\n    movu     m3, [r1 + r4]\r\n    movu     m4, [r1 + r4 + 32]\r\n\r\n    psllw    m1, xm0\r\n    psllw    m2, xm0\r\n    psllw    m3, xm0\r\n    psllw    m4, xm0\r\n    movu     [r0 + 4 * mmsize], m1\r\n    movu     [r0 + 5 * mmsize], m2\r\n    movu     [r0 + 6 * mmsize], m3\r\n    movu     [r0 + 7 * mmsize], m4\r\n\r\n    add      r0, 8 * mmsize\r\n    lea      r1, [r1 + r2 * 4]\r\n    dec      r3d\r\n    jnz      .loop\r\n    RET\r\n\r\n;--------------------------------------------------------------------------------------\r\n; void cpy1Dto2D_shr(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift)\r\n;--------------------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal cpy1Dto2D_shr_4, 3, 3, 4\r\n    add         r2d, r2d\r\n    movd        m0, r3m\r\n    pcmpeqw     m1, m1\r\n    psllw       m1, m0\r\n    psraw       m1, 1\r\n\r\n    ; Row 0-3\r\n    mova        m2, [r1 + 0 * mmsize]\r\n    mova        m3, [r1 + 1 * mmsize]\r\n    psubw       m2, m1\r\n    psubw       m3, m1\r\n    psraw       m2, m0\r\n    psraw       m3, m0\r\n    movh        [r0], m2\r\n    movhps      [r0 + r2], m2\r\n    movh        [r0 + r2 * 2], m3\r\n    lea         r2, [r2 * 3]\r\n    movhps      [r0 + r2], m3\r\n    RET\r\n\r\n\r\nINIT_YMM avx2\r\ncglobal cpy1Dto2D_shr_4, 3, 3, 3\r\n    add         r2d, r2d\r\n    movd        xm0, r3m\r\n    pcmpeqw     m1, m1\r\n    psllw       m1, xm0\r\n    psraw       m1, 1\r\n\r\n    ; Row 0-3\r\n    movu        m2, [r1]\r\n    psubw       m2, m1\r\n    psraw       m2, xm0\r\n    vextracti128 xm1, m2, 1\r\n    movq        [r0], xm2\r\n    movhps      [r0 + r2], xm2\r\n    lea         r0, [r0 + r2 * 2]\r\n    movq        [r0], xm1\r\n    movhps      [r0 + r2], xm1\r\n    RET\r\n\r\n\r\n;--------------------------------------------------------------------------------------\r\n; void cpy1Dto2D_shr(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift)\r\n;--------------------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal cpy1Dto2D_shr_8, 3, 4, 6\r\n    add         r2d, r2d\r\n    movd        m0, r3m\r\n    pcmpeqw     m1, m1\r\n    psllw       m1, m0\r\n    psraw       m1, 1\r\n    lea         r3, [r2 * 3]\r\n\r\n    ; Row 0-3\r\n    mova        m2, [r1 + 0 * mmsize]\r\n    mova        m3, [r1 + 1 * mmsize]\r\n    mova        m4, [r1 + 2 * mmsize]\r\n    mova        m5, [r1 + 3 * mmsize]\r\n    psubw       m2, m1\r\n    psubw       m3, m1\r\n    psubw       m4, m1\r\n    psubw       m5, m1\r\n    psraw       m2, m0\r\n    psraw       m3, m0\r\n    psraw       m4, m0\r\n    psraw       m5, m0\r\n    mova        [r0], m2\r\n    mova        [r0 + r2], m3\r\n    mova        [r0 + r2 * 2], m4\r\n    mova        [r0 + r3], m5\r\n\r\n    ; Row 4-7\r\n    mova        m2, [r1 + 4 * mmsize]\r\n    mova        m3, [r1 + 5 * mmsize]\r\n    mova        m4, [r1 + 6 * mmsize]\r\n    mova        m5, [r1 + 7 * mmsize]\r\n    lea         r0, [r0 + r2 * 4]\r\n    psubw       m2, m1\r\n    psubw       m3, m1\r\n    psubw       m4, m1\r\n    psubw       m5, m1\r\n    psraw       m2, m0\r\n    psraw       m3, m0\r\n    psraw       m4, m0\r\n    psraw       m5, m0\r\n    mova        [r0], m2\r\n    mova        [r0 + r2], m3\r\n    mova        [r0 + r2 * 2], m4\r\n    mova        [r0 + r3], m5\r\n    RET\r\n\r\n\r\nINIT_YMM avx2\r\ncglobal cpy1Dto2D_shr_8, 3, 4, 4\r\n    add         r2d, r2d\r\n    movd        xm0, r3m\r\n    pcmpeqw     m1, m1\r\n    psllw       m1, xm0\r\n    psraw       m1, 1\r\n    lea         r3, [r2 * 3]\r\n\r\n    ; Row 0-3\r\n    movu        m2, [r1 + 0 * mmsize]\r\n    movu        m3, [r1 + 1 * mmsize]\r\n    psubw       m2, m1\r\n    psubw       m3, m1\r\n    psraw       m2, xm0\r\n    psraw       m3, xm0\r\n    movu        [r0], xm2\r\n    vextracti128 [r0 + r2], m2, 1\r\n    movu        [r0 + r2 * 2], xm3\r\n    vextracti128 [r0 + r3], m3, 1\r\n\r\n    ; Row 4-7\r\n    movu        m2, [r1 + 2 * mmsize]\r\n    movu        m3, [r1 + 3 * mmsize]\r\n    lea         r0, [r0 + r2 * 4]\r\n    psubw       m2, m1\r\n    psubw       m3, m1\r\n    psraw       m2, xm0\r\n    psraw       m3, xm0\r\n    movu        [r0], xm2\r\n    vextracti128 [r0 + r2], m2, 1\r\n    movu        [r0 + r2 * 2], xm3\r\n    vextracti128 [r0 + r3], m3, 1\r\n    RET\r\n\r\n\r\n;--------------------------------------------------------------------------------------\r\n; void cpy1Dto2D_shr(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift)\r\n;--------------------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal cpy1Dto2D_shr_16, 3, 5, 6\r\n    add         r2d, r2d\r\n    movd        m0, r3m\r\n    pcmpeqw     m1, m1\r\n    psllw       m1, m0\r\n    psraw       m1, 1\r\n    mov         r3d, 16/4\r\n    lea         r4, [r2 * 3]\r\n\r\n.loop:\r\n    ; Row 0-1\r\n    mova        m2, [r1 + 0 * mmsize]\r\n    mova        m3, [r1 + 1 * mmsize]\r\n    mova        m4, [r1 + 2 * mmsize]\r\n    mova        m5, [r1 + 3 * mmsize]\r\n    psubw       m2, m1\r\n    psubw       m3, m1\r\n    psubw       m4, m1\r\n    psubw       m5, m1\r\n    psraw       m2, m0\r\n    psraw       m3, m0\r\n    psraw       m4, m0\r\n    psraw       m5, m0\r\n    mova        [r0], m2\r\n    mova        [r0 + mmsize], m3\r\n    mova        [r0 + r2], m4\r\n    mova        [r0 + r2 + mmsize], m5\r\n\r\n    ; Row 2-3\r\n    mova        m2, [r1 + 4 * mmsize]\r\n    mova        m3, [r1 + 5 * mmsize]\r\n    mova        m4, [r1 + 6 * mmsize]\r\n    mova        m5, [r1 + 7 * mmsize]\r\n    psubw       m2, m1\r\n    psubw       m3, m1\r\n    psubw       m4, m1\r\n    psubw       m5, m1\r\n    psraw       m2, m0\r\n    psraw       m3, m0\r\n    psraw       m4, m0\r\n    psraw       m5, m0\r\n    mova        [r0 + r2 * 2], m2\r\n    mova        [r0 + r2 * 2 + mmsize], m3\r\n    mova        [r0 + r4], m4\r\n    mova        [r0 + r4 + mmsize], m5\r\n\r\n    add         r1, 8 * mmsize\r\n    lea         r0, [r0 + r2 * 4]\r\n    dec         r3d\r\n    jnz        .loop\r\n    RET\r\n\r\n\r\nINIT_YMM avx2\r\ncglobal cpy1Dto2D_shr_16, 3, 5, 4\r\n    add         r2d, r2d\r\n    movd        xm0, r3m\r\n    pcmpeqw     m1, m1\r\n    psllw       m1, xm0\r\n    psraw       m1, 1\r\n    mov         r3d, 16/4\r\n    lea         r4, [r2 * 3]\r\n\r\n.loop:\r\n    ; Row 0-1\r\n    movu        m2, [r1 + 0 * mmsize]\r\n    movu        m3, [r1 + 1 * mmsize]\r\n    psubw       m2, m1\r\n    psubw       m3, m1\r\n    psraw       m2, xm0\r\n    psraw       m3, xm0\r\n    movu        [r0], m2\r\n    movu        [r0 + r2], m3\r\n\r\n    ; Row 2-3\r\n    movu        m2, [r1 + 2 * mmsize]\r\n    movu        m3, [r1 + 3 * mmsize]\r\n    psubw       m2, m1\r\n    psubw       m3, m1\r\n    psraw       m2, xm0\r\n    psraw       m3, xm0\r\n    movu        [r0 + r2 * 2], m2\r\n    movu        [r0 + r4], m3\r\n\r\n    add         r1, 4 * mmsize\r\n    lea         r0, [r0 + r2 * 4]\r\n    dec         r3d\r\n    jnz        .loop\r\n    RET\r\n\r\n\r\n;--------------------------------------------------------------------------------------\r\n; void cpy1Dto2D_shr(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift)\r\n;--------------------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal cpy1Dto2D_shr_32, 3, 4, 6\r\n    add         r2d, r2d\r\n    movd        m0, r3m\r\n    pcmpeqw     m1, m1\r\n    psllw       m1, m0\r\n    psraw       m1, 1\r\n    mov         r3d, 32/2\r\n\r\n.loop:\r\n    ; Row 0\r\n    mova        m2, [r1 + 0 * mmsize]\r\n    mova        m3, [r1 + 1 * mmsize]\r\n    mova        m4, [r1 + 2 * mmsize]\r\n    mova        m5, [r1 + 3 * mmsize]\r\n    psubw       m2, m1\r\n    psubw       m3, m1\r\n    psubw       m4, m1\r\n    psubw       m5, m1\r\n    psraw       m2, m0\r\n    psraw       m3, m0\r\n    psraw       m4, m0\r\n    psraw       m5, m0\r\n    mova        [r0 + 0 * mmsize], m2\r\n    mova        [r0 + 1 * mmsize], m3\r\n    mova        [r0 + 2 * mmsize], m4\r\n    mova        [r0 + 3 * mmsize], m5\r\n\r\n    ; Row 1\r\n    mova        m2, [r1 + 4 * mmsize]\r\n    mova        m3, [r1 + 5 * mmsize]\r\n    mova        m4, [r1 + 6 * mmsize]\r\n    mova        m5, [r1 + 7 * mmsize]\r\n    psubw       m2, m1\r\n    psubw       m3, m1\r\n    psubw       m4, m1\r\n    psubw       m5, m1\r\n    psraw       m2, m0\r\n    psraw       m3, m0\r\n    psraw       m4, m0\r\n    psraw       m5, m0\r\n    mova        [r0 + r2 + 0 * mmsize], m2\r\n    mova        [r0 + r2 + 1 * mmsize], m3\r\n    mova        [r0 + r2 + 2 * mmsize], m4\r\n    mova        [r0 + r2 + 3 * mmsize], m5\r\n\r\n    add         r1, 8 * mmsize\r\n    lea         r0, [r0 + r2 * 2]\r\n    dec         r3d\r\n    jnz        .loop\r\n    RET\r\n\r\n\r\nINIT_YMM avx2\r\ncglobal cpy1Dto2D_shr_32, 3, 4, 6\r\n    add         r2d, r2d\r\n    movd        xm0, r3m\r\n    pcmpeqw     m1, m1\r\n    psllw       m1, xm0\r\n    psraw       m1, 1\r\n    mov         r3d, 32/2\r\n\r\n.loop:\r\n    ; Row 0-1\r\n    movu        m2, [r1 + 0 * mmsize]\r\n    movu        m3, [r1 + 1 * mmsize]\r\n    movu        m4, [r1 + 2 * mmsize]\r\n    movu        m5, [r1 + 3 * mmsize]\r\n    psubw       m2, m1\r\n    psubw       m3, m1\r\n    psubw       m4, m1\r\n    psubw       m5, m1\r\n    psraw       m2, xm0\r\n    psraw       m3, xm0\r\n    psraw       m4, xm0\r\n    psraw       m5, xm0\r\n    movu        [r0], m2\r\n    movu        [r0 + mmsize], m3\r\n    movu        [r0 + r2], m4\r\n    movu        [r0 + r2 + mmsize], m5\r\n\r\n    add         r1, 4 * mmsize\r\n    lea         r0, [r0 + r2 * 2]\r\n    dec         r3d\r\n    jnz        .loop\r\n    RET\r\n"
  },
  {
    "path": "source/common/x86/const-a.asm",
    "content": ";*****************************************************************************\r\n;* const-a.asm: x86 global constants\r\n;*****************************************************************************\r\n;* Copyright (C) 2003-2013 x264 project\r\n;* Copyright (C) 2013-2017 MulticoreWare, Inc\r\n;* Copyright (C) 2018~ VCL, NELVT, Peking University\r\n;*\r\n;* Authors: Loren Merritt <lorenm@u.washington.edu>\r\n;*          Fiona Glaser <fiona@x264.com>\r\n;*          Min Chen <chenm003@163.com> <min.chen@multicorewareinc.com>\r\n;*          Praveen Kumar Tiwari <praveen@multicorewareinc.com>\r\n;*          Jiaqi Zhang <zhangjiaqi.cs@gmail.com>\r\n;*\r\n;* This program is free software; you can redistribute it and/or modify\r\n;* it under the terms of the GNU General Public License as published by\r\n;* the Free Software Foundation; either version 2 of the License, or\r\n;* (at your option) any later version.\r\n;*\r\n;* This program is distributed in the hope that it will be useful,\r\n;* but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n;* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n;* GNU General Public License for more details.\r\n;*\r\n;* You should have received a copy of the GNU General Public License\r\n;* along with this program; if not, write to the Free Software\r\n;* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n;*\r\n;* This program is also available under a commercial proprietary license.\r\n;* For more information, contact us at license @ x265.com.\r\n;*****************************************************************************\r\n\r\n%include \"x86inc.asm\"\r\n\r\nSECTION_RODATA 32\r\n\r\n;; 8-bit constants\r\n\r\nconst pb_0,                 times 32 db 0\r\nconst pb_1,                 times 32 db 1\r\nconst pb_2,                 times 32 db 2\r\nconst pb_3,                 times 32 db 3\r\nconst pb_4,                 times 32 db 4\r\nconst pb_8,                 times 32 db 8\r\nconst pb_15,                times 32 db 15\r\nconst pb_16,                times 32 db 16\r\nconst pb_31,                times 32 db 31\r\nconst pb_32,                times 32 db 32\r\nconst pb_64,                times 32 db 64\r\nconst pb_124,               times 32 db 124\r\nconst pb_128,               times 32 db 128\r\nconst pb_a1,                times 16 db 0xa1\r\n\r\nconst pb_01,                times  8 db   0,   1\r\nconst pb_0123,              times  4 db   0,   1\r\n                            times  4 db   2,   3\r\nconst hsub_mul,             times 16 db   1,  -1\r\nconst pw_swap,              times  2 db   6,   7,   4,   5,   2,   3,   0,   1\r\nconst pb_unpackbd1,         times  2 db   0,   0,   0,   0,   1,   1,   1,   1,   2,   2,   2,   2,   3,   3,   3,   3\r\nconst pb_unpackbd2,         times  2 db   4,   4,   4,   4,   5,   5,   5,   5,   6,   6,   6,   6,   7,   7,   7,   7\r\nconst pb_unpackwq1,         times  1 db   0,   1,   0,   1,   0,   1,   0,   1,   2,   3,   2,   3,   2,   3,   2,   3\r\nconst pb_unpackwq2,         times  1 db   4,   5,   4,   5,   4,   5,   4,   5,   6,   7,   6,   7,   6,   7,   6,   7\r\nconst pb_shuf8x8c,          times  1 db   0,   0,   0,   0,   2,   2,   2,   2,   4,   4,   4,   4,   6,   6,   6,   6\r\nconst pb_movemask,          times 16 db 0x00\r\n                            times 16 db 0xFF\r\n\r\nconst pb_movemask_32,       times 32 db 0x00\r\n                            times 32 db 0xFF\r\n                            times 32 db 0x00\r\n\r\nconst pb_0000000000000F0F,  times  2 db 0xff, 0x00\r\n                            times 12 db 0x00\r\nconst pb_000000000000000F,           db 0xff\r\n                            times 15 db 0x00\r\nconst pb_shuf_off4,         times  2 db   0,   4,   1,   5,   2,   6,   3,   7\r\nconst pw_shuf_off4,         times  1 db   0,   1,   8,   9,   2,   3,  10,  11,   4,   5,  12,  13,   6,   7,  14,  15\r\n\r\n;; 16-bit constants\r\n\r\nconst pw_n1,                times 16 dw -1\r\nconst pw_1,                 times 16 dw 1\r\nconst pw_2,                 times 16 dw 2\r\nconst pw_3,                 times 16 dw 3\r\nconst pw_7,                 times 16 dw 7\r\nconst pw_m2,                times  8 dw -2\r\nconst pw_4,                 times  8 dw 4\r\nconst pw_8,                 times  8 dw 8\r\nconst pw_16,                times 16 dw 16\r\nconst pw_15,                times 16 dw 15\r\nconst pw_31,                times 16 dw 31\r\nconst pw_32,                times 16 dw 32\r\nconst pw_64,                times  8 dw 64\r\nconst pw_128,               times 16 dw 128\r\nconst pw_256,               times 16 dw 256\r\nconst pw_257,               times 16 dw 257\r\nconst pw_512,               times 16 dw 512\r\nconst pw_1023,              times 16 dw 1023\r\nconst pw_1024,              times 16 dw 1024\r\nconst pw_2048,              times 16 dw 2048\r\nconst pw_4096,              times 16 dw 4096\r\nconst pw_8192,              times  8 dw 8192\r\nconst pw_00ff,              times 16 dw 0x00ff\r\nconst pw_ff00,              times  8 dw 0xff00\r\nconst pw_2000,              times 16 dw 0x2000\r\nconst pw_8000,              times  8 dw 0x8000\r\nconst pw_3fff,              times 16 dw 0x3fff\r\nconst pw_32_0,              times  4 dw 32,\r\n                            times  4 dw 0\r\nconst pw_pixel_max,         times 16 dw ((1 << BIT_DEPTH)-1)\r\n\r\nconst pw_0_7,               times  2 dw   0,   1,   2,   3,   4,   5,   6,   7\r\nconst pw_ppppmmmm,          times  1 dw   1,   1,   1,   1,  -1,  -1,  -1,  -1\r\nconst pw_ppmmppmm,          times  1 dw   1,   1,  -1,  -1,   1,   1,  -1,  -1\r\nconst pw_pmpmpmpm,          times 16 dw   1,  -1,   1,  -1,   1,  -1,   1,  -1\r\nconst pw_pmmpzzzz,          times  1 dw   1,  -1,  -1,   1,   0,   0,   0,   0\r\nconst multi_2Row,           times  1 dw   1,   2,   3,   4,   1,   2,   3,   4\r\nconst multiH,               times  1 dw   9,  10,  11,  12,  13,  14,  15,  16\r\nconst multiH3,              times  1 dw  25,  26,  27,  28,  29,  30,  31,  32\r\nconst multiL,               times  1 dw   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,  14,  15,  16\r\nconst multiH2,              times  1 dw  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,  28,  29,  30,  31,  32\r\nconst pw_planar16_mul,      times  1 dw  15,  14,  13,  12,  11,  10,   9,   8,   7,   6,   5,   4,   3,   2,   1,   0\r\nconst pw_planar32_mul,      times  1 dw  31,  30,  29,  28,  27,  26,  25,  24,  23,  22,  21,  20,  19,  18,  17,  16\r\nconst pw_FFFFFFFFFFFFFFF0,           dw 0x00\r\n                            times  7 dw 0xff\r\nconst hmul_16p,             times 16 db   1\r\n                            times  8 db   1,  -1\r\nconst pw_exp2_0_15,                  dw 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768\r\nconst pw_1_ffff,            times  4 dw 1\r\n                            times  4 dw 0xFFFF\r\n\r\n\r\n;; 32-bit constants\r\n\r\nconst pd_0,                 times  8 dd 0\r\nconst pd_1,                 times  8 dd 1\r\nconst pd_2,                 times  8 dd 2\r\nconst pd_3,                 times  8 dd 3\r\nconst pd_4,                 times  4 dd 4\r\nconst pd_8,                 times  4 dd 8\r\nconst pd_11,                times  4 dd 11\r\nconst pd_12,                times  4 dd 12\r\nconst pd_15,                times  8 dd 15\r\nconst pd_16,                times  8 dd 16\r\nconst pd_31,                times  8 dd 31\r\nconst pd_32,                times  8 dd 32\r\nconst pd_64,                times  4 dd 64\r\nconst pd_128,               times  4 dd 128\r\nconst pd_256,               times  4 dd 256\r\nconst pd_512,               times  4 dd 512\r\nconst pd_1024,              times  4 dd 1024\r\nconst pd_2048,              times  4 dd 2048\r\nconst pd_ffff,              times  4 dd 0xffff\r\nconst pd_32767,             times  4 dd 32767\r\nconst pd_n32768,            times  4 dd 0xffff8000\r\nconst pd_524416,            times  4 dd 524416\r\nconst pd_n32768,            times  8 dd 0xffff8000\r\nconst pd_n131072,           times  4 dd 0xfffe0000\r\nconst pd_0000ffff,          times  8 dd 0x0000FFFF\r\nconst pd_planar16_mul0,     times  1 dd  15,  14,  13,  12,  11,  10,   9,   8,    7,   6,   5,   4,   3,   2,   1,   0\r\nconst pd_planar16_mul1,     times  1 dd   1,   2,   3,   4,   5,   6,   7,   8,    9,  10,  11,  12,  13,  14,  15,  16\r\nconst pd_planar32_mul1,     times  1 dd  31,  30,  29,  28,  27,  26,  25,  24,   23,  22,  21,  20,  19,  18,  17,  16\r\nconst pd_planar32_mul2,     times  1 dd  17,  18,  19,  20,  21,  22,  23,  24,   25,  26,  27,  28,  29,  30,  31,  32\r\nconst pd_planar16_mul2,     times  1 dd  15,  14,  13,  12,  11,  10,   9,   8,    7,   6,   5,   4,   3,   2,   1,   0\r\nconst trans8_shuf,          times  1 dd   0,   4,   1,   5,   2,   6,   3,   7\r\n\r\n;; 64-bit constants\r\n\r\nconst pq_1,                 times 1 dq 1\r\n"
  },
  {
    "path": "source/common/x86/cpu-a.asm",
    "content": ";*****************************************************************************\r\n;* cpu-a.asm: x86 cpu utilities\r\n;*****************************************************************************\r\n;* Copyright (C) 2003-2013 x264 project\r\n;* Copyright (C) 2013-2017 MulticoreWare, Inc\r\n;* Copyright (C) 2018~ VCL, NELVT, Peking University\r\n;*\r\n;* Authors: Laurent Aimar <fenrir@via.ecp.fr>\r\n;*          Loren Merritt <lorenm@u.washington.edu>\r\n;*          Fiona Glaser <fiona@x264.com>\r\n;*          Jiaqi Zhang  <zhangjiaqi.cs@gmail.com>\r\n;*\r\n;* This program is free software; you can redistribute it and/or modify\r\n;* it under the terms of the GNU General Public License as published by\r\n;* the Free Software Foundation; either version 2 of the License, or\r\n;* (at your option) any later version.\r\n;*\r\n;* This program is distributed in the hope that it will be useful,\r\n;* but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n;* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n;* GNU General Public License for more details.\r\n;*\r\n;* You should have received a copy of the GNU General Public License\r\n;* along with this program; if not, write to the Free Software\r\n;* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n;*\r\n;* This program is also available under a commercial proprietary license.\r\n;* For more information, contact us at license @ x265.com.\r\n;*****************************************************************************\r\n\r\n\r\n%include \"x86inc.asm\"\r\n\r\nSECTION .text\r\n\r\n;-----------------------------------------------------------------------------\r\n; void cpu_cpuid( int op, int *eax, int *ebx, int *ecx, int *edx )\r\n;-----------------------------------------------------------------------------\r\ncglobal cpu_cpuid, 5,7\r\n    push rbx\r\n    push  r4\r\n    push  r3\r\n    push  r2\r\n    push  r1\r\n    mov  eax, r0d\r\n    xor  ecx, ecx\r\n    cpuid\r\n    pop   r4\r\n    mov [r4], eax\r\n    pop   r4\r\n    mov [r4], ebx\r\n    pop   r4\r\n    mov [r4], ecx\r\n    pop   r4\r\n    mov [r4], edx\r\n    pop  rbx\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void cpu_xgetbv( int op, int *eax, int *edx )\r\n;-----------------------------------------------------------------------------\r\ncglobal cpu_xgetbv, 3,7\r\n    push  r2\r\n    push  r1\r\n    mov  ecx, r0d\r\n    xgetbv\r\n    pop   r4\r\n    mov [r4], eax\r\n    pop   r4\r\n    mov [r4], edx\r\n    RET\r\n    \r\n;-----------------------------------------------------------------------------\r\n; void cpuid_get_serial_number( int op, int *eax, int *ebx, int *ecx, int *edx )\r\n; 2017-06-18 luofl\r\n;-----------------------------------------------------------------------------\r\ncglobal cpuid_get_serial_number, 5,7\r\n    push  rbx\r\n    push  r4\r\n    push  r3\r\n    push  r2\r\n    push  r1\r\n    ; first 64 bits\r\n    mov eax, 00h\r\n    xor edx, edx\r\n    cpuid\r\n    pop   r4\r\n    mov [r4], edx\r\n    pop   r4\r\n    mov [r4], eax\r\n    ; second 64 bits\r\n    mov eax, 01h\r\n    xor ecx, ecx\r\n    xor edx, edx\r\n    cpuid\r\n    pop   r4\r\n    mov [r4], edx\r\n    pop   r4\r\n    mov [r4], eax\r\n    \r\n    pop  rbx\r\n    RET\r\n\r\n%if ARCH_X86_64\r\n\r\n;-----------------------------------------------------------------------------\r\n; void stack_align( void (*func)(void*), void *arg );\r\n;-----------------------------------------------------------------------------\r\ncglobal stack_align\r\n    push rbp\r\n    mov  rbp, rsp\r\n%if WIN64\r\n    sub  rsp, 32 ; shadow space\r\n%endif\r\n    and  rsp, ~31\r\n    mov  rax, r0\r\n    mov   r0, r1\r\n    mov   r1, r2\r\n    mov   r2, r3\r\n    call rax\r\n    leave\r\n    ret\r\n\r\n%else\r\n\r\n;-----------------------------------------------------------------------------\r\n; int cpu_cpuid_test( void )\r\n; return 0 if unsupported\r\n;-----------------------------------------------------------------------------\r\ncglobal cpu_cpuid_test\r\n    pushfd\r\n    push    ebx\r\n    push    ebp\r\n    push    esi\r\n    push    edi\r\n    pushfd\r\n    pop     eax\r\n    mov     ebx, eax\r\n    xor     eax, 0x200000\r\n    push    eax\r\n    popfd\r\n    pushfd\r\n    pop     eax\r\n    xor     eax, ebx\r\n    pop     edi\r\n    pop     esi\r\n    pop     ebp\r\n    pop     ebx\r\n    popfd\r\n    ret\r\n\r\ncglobal stack_align\r\n    push ebp\r\n    mov  ebp, esp\r\n    sub  esp, 12\r\n    and  esp, ~31\r\n    mov  ecx, [ebp+8]\r\n    mov  edx, [ebp+12]\r\n    mov  [esp], edx\r\n    mov  edx, [ebp+16]\r\n    mov  [esp+4], edx\r\n    mov  edx, [ebp+20]\r\n    mov  [esp+8], edx\r\n    call ecx\r\n    leave\r\n    ret\r\n\r\n%endif\r\n\r\n;-----------------------------------------------------------------------------\r\n; void cpu_emms( void )\r\n;-----------------------------------------------------------------------------\r\ncglobal cpu_emms\r\n    emms\r\n    ret\r\n\r\n;-----------------------------------------------------------------------------\r\n; void cpu_sfence( void )\r\n;-----------------------------------------------------------------------------\r\ncglobal cpu_sfence\r\n    sfence\r\n    ret\r\n\r\n%if 0                                 ; REMOVED\r\ncextern intel_cpu_indicator_init\r\n\r\n;-----------------------------------------------------------------------------\r\n; void safe_intel_cpu_indicator_init( void );\r\n;-----------------------------------------------------------------------------\r\ncglobal safe_intel_cpu_indicator_init\r\n    push r0\r\n    push r1\r\n    push r2\r\n    push r3\r\n    push r4\r\n    push r5\r\n    push r6\r\n%if ARCH_X86_64\r\n    push r7\r\n    push r8\r\n    push r9\r\n    push r10\r\n    push r11\r\n    push r12\r\n    push r13\r\n    push r14\r\n%endif\r\n    push rbp\r\n    mov  rbp, rsp\r\n%if WIN64\r\n    sub  rsp, 32 ; shadow space\r\n%endif\r\n    and  rsp, ~31\r\n    call intel_cpu_indicator_init\r\n    leave\r\n%if ARCH_X86_64\r\n    pop r14\r\n    pop r13\r\n    pop r12\r\n    pop r11\r\n    pop r10\r\n    pop r9\r\n    pop r8\r\n    pop r7\r\n%endif\r\n    pop r6\r\n    pop r5\r\n    pop r4\r\n    pop r3\r\n    pop r2\r\n    pop r1\r\n    pop r0\r\n    ret\r\n\r\n%endif  ; if 0"
  },
  {
    "path": "source/common/x86/dct8.asm",
    "content": ";*****************************************************************************\r\n;* Copyright (C) 2013-2017 MulticoreWare, Inc\r\n;* Copyright (C) 2018~ VCL, NELVT, Peking University\r\n;*\r\n;* Authors: Nabajit Deka <nabajit@multicorewareinc.com>\r\n;*          Min Chen <chenm003@163.com> <min.chen@multicorewareinc.com>\r\n;*          Li Cao <li@multicorewareinc.com>\r\n;*          Praveen Kumar Tiwari <Praveen@multicorewareinc.com>\r\n;*          Jiaqi Zhang <zhangjiaqi.cs@gmail.com>\r\n;*\r\n;* This program is free software; you can redistribute it and/or modify\r\n;* it under the terms of the GNU General Public License as published by\r\n;* the Free Software Foundation; either version 2 of the License, or\r\n;* (at your option) any later version.\r\n;*\r\n;* This program is distributed in the hope that it will be useful,\r\n;* but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n;* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n;* GNU General Public License for more details.\r\n;*\r\n;* You should have received a copy of the GNU General Public License\r\n;* along with this program; if not, write to the Free Software\r\n;* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n;*\r\n;* This program is also available under a commercial proprietary license.\r\n;* For more information, contact us at license @ x265.com.\r\n;*****************************************************************************/\r\n\r\n;TO-DO : Further optimize the routines.\r\n\r\n%include \"x86inc.asm\"\r\n%include \"x86util.asm\"\r\n\r\nSECTION_RODATA 32\r\n\r\n; ----------------------------------------------------------------------------\r\n; dct4\r\ntab_dct4:       times 4 dw  32,  32\r\n                times 4 dw  42,  17\r\n                times 4 dw  32, -32\r\n                times 4 dw  17, -42\r\n\r\navx2_idct4_1:   dw  32, 32, 32, 32, 32, 32, 32, 32, 32, -32, 32, -32, 32, -32, 32, -32\r\n                dw  42, 17, 42, 17, 42, 17, 42, 17, 17, -42, 17, -42, 17, -42, 17, -42\r\n\r\navx2_idct4_2:   dw  32, 32, 32,-32, 42, 17, 17,-42\r\n\r\nidct4_shuf1:    times 2 db 0, 1, 4, 5, 2, 3, 6, 7, 8, 9, 12, 13, 10, 11, 14, 15\r\n\r\nidct4_shuf2:    times 2 db 4, 5, 6, 7, 0, 1, 2, 3, 12, 13, 14, 15, 8 ,9 ,10, 11\r\n\r\n\r\n; ----------------------------------------------------------------------------\r\n; dct8\r\nalign 32\r\n\r\npb_idct8even:   db 0, 1, 8, 9, 4, 5, 12, 13, 0, 1,  8,  9, 4, 5, 12, 13\r\n\r\ntab_idct8_1:    times 1 dw  32, -32,  17, -42,  32,  32,  42,  17\r\n\r\ntab_idct8_2:    times 1 dw  44,  38,  25,   9,  38,  -9, -44, -25\r\n                times 1 dw  25, -44,   9,  38,   9, -25,  38, -44\r\n\r\ntab_idct8_3:    times 4 dw  44,  38\r\n                times 4 dw  25,   9\r\n                times 4 dw  38,  -9\r\n                times 4 dw -44, -25\r\n                times 4 dw  25, -44\r\n                times 4 dw   9,  38\r\n                times 4 dw   9, -25\r\n                times 4 dw  38, -44\r\n\r\navx2_idct8_1:   times 4 dw  32,  42,  32,  17\r\n                times 4 dw  32,  17, -32, -42\r\n                times 4 dw  32, -17, -32,  42\r\n                times 4 dw  32, -42,  32, -17\r\n\r\navx2_idct8_2:   times 4 dw  44,  38,  25,   9\r\n                times 4 dw  38,  -9, -44, -25\r\n                times 4 dw  25, -44,   9,  38\r\n                times 4 dw   9, -25,  38, -44\r\n\r\nalign 32\r\nidct8_shuf1:    dd 0, 2, 4, 6, 1, 3, 5, 7\r\n\r\nidct8_shuf2:    times 2 db 0, 1, 2, 3, 8, 9, 10, 11, 4, 5, 6, 7, 12, 13, 14, 15\r\n\r\nidct8_shuf3:    times 2 db 12, 13, 14, 15, 8, 9, 10, 11, 4, 5, 6, 7, 0, 1, 2, 3\r\n\r\npb_idct8odd:    db 2, 3, 6, 7, 10, 11, 14, 15, 2, 3, 6, 7, 10, 11, 14, 15\r\n\r\n\r\n; ----------------------------------------------------------------------------\r\n; dct16\r\nalign 32\r\n\r\ndct16_shuf1:    times 2 db 14, 15, 12, 13, 10, 11, 8, 9, 6, 7, 4, 5, 2, 3, 0, 1\r\n\r\ntab_idct16_1:   dw  45,  43,  40,  35,  29,  21,  13,   4\r\n                dw  43,  29,   4, -21, -40, -45, -35, -13\r\n                dw  40,   4, -35, -43, -13,  29,  45,  21\r\n                dw  35, -21, -43,   4,  45,  13, -40, -29\r\n                dw  29, -40, -13,  45,  -4, -43,  21,  35\r\n                dw  21, -45,  29,  13, -43,  35,   4, -40\r\n                dw  13, -35,  45, -40,  21,   4, -29,  43\r\n                dw   4, -13,  21, -29,  35, -40,  43, -45\r\n\r\ntab_idct16_2:   dw  32,  44,  42,  38,  32,  25,  17,   9\r\n                dw  32,  38,  17,  -9, -32, -44, -42, -25\r\n                dw  32,  25, -17, -44, -32,   9,  42,  38\r\n                dw  32,   9, -42, -25,  32,  38, -17, -44\r\n                dw  32,  -9, -42,  25,  32, -38, -17,  44\r\n                dw  32, -25, -17,  44, -32,  -9,  42, -38\r\n                dw  32, -38,  17,   9, -32,  44, -42,  25\r\n                dw  32, -44,  42, -38,  32, -25,  17,  -9\r\n\r\nidct16_shuff:   dd 0, 4, 2, 6, 1, 5, 3, 7\r\n\r\nidct16_shuff1:  dd 2, 6, 0, 4, 3, 7, 1, 5\r\n\r\n\r\n; ----------------------------------------------------------------------------\r\n; dct32\r\nalign 32\r\n\r\ntab_idct32_1:   dw  45,  45,  44,  43,  41,  39,  36,  34,  30,  27,  23,  19,  15,  11,   7,   2\r\n                dw  45,  41,  34,  23,  11,  -2, -15, -27, -36, -43, -45, -44, -39, -30, -19,  -7\r\n                dw  44,  34,  15,  -7, -27, -41, -45, -39, -23,  -2,  19,  36,  45,  43,  30,  11\r\n                dw  43,  23,  -7, -34, -45, -36, -11,  19,  41,  44,  27,  -2, -30, -45, -39, -15\r\n                dw  41,  11, -27, -45, -30,   7,  39,  43,  15, -23, -45, -34,   2,  36,  44,  19\r\n                dw  39,  -2, -41, -36,   7,  43,  34, -11, -44, -30,  15,  45,  27, -19, -45, -23\r\n                dw  36, -15, -45, -11,  39,  34, -19, -45,  -7,  41,  30, -23, -44,  -2,  43,  27\r\n                dw  34, -27, -39,  19,  43, -11, -45,   2,  45,   7, -44, -15,  41,  23, -36, -30\r\n                dw  30, -36, -23,  41,  15, -44,  -7,  45,  -2, -45,  11,  43, -19, -39,  27,  34\r\n                dw  27, -43,  -2,  44, -23, -30,  41,   7, -45,  19,  34, -39, -11,  45, -15, -36\r\n                dw  23, -45,  19,  27, -45,  15,  30, -44,  11,  34, -43,   7,  36, -41,   2,  39\r\n                dw  19, -44,  36,  -2, -34,  45, -23, -15,  43, -39,   7,  30, -45,  27,  11, -41\r\n                dw  15, -39,  45, -30,   2,  27, -44,  41, -19, -11,  36, -45,  34,  -7, -23,  43\r\n                dw  11, -30,  43, -45,  36, -19,  -2,  23, -39,  45, -41,  27,  -7, -15,  34, -44\r\n                dw   7, -19,  30, -39,  44, -45,  43, -36,  27, -15,   2,  11, -23,  34, -41,  45\r\n                dw   2,  -7,  11, -15,  19, -23,  27, -30,  34, -36,  39, -41,  43, -44,  45, -45\r\n\r\ntab_idct32_2:   dw  32,  44,  42,  38,  32,  25,  17,   9\r\n                dw  32,  38,  17,  -9, -32, -44, -42, -25\r\n                dw  32,  25, -17, -44, -32,   9,  42,  38\r\n                dw  32,   9, -42, -25,  32,  38, -17, -44\r\n                dw  32,  -9, -42,  25,  32, -38, -17,  44\r\n                dw  32, -25, -17,  44, -32,  -9,  42, -38\r\n                dw  32, -38,  17,   9, -32,  44, -42,  25\r\n                dw  32, -44,  42, -38,  32, -25,  17,  -9\r\n\r\ntab_idct32_3:   dw  45,  43,  40,  35,  29,  21,  13,   4\r\n                dw  43,  29,   4, -21, -40, -45, -35, -13\r\n                dw  40,   4, -35, -43, -13,  29,  45,  21\r\n                dw  35, -21, -43,   4,  45,  13, -40, -29\r\n                dw  29, -40, -13,  45,  -4, -43,  21,  35\r\n                dw  21, -45,  29,  13, -43,  35,   4, -40\r\n                dw  13, -35,  45, -40,  21,   4, -29,  43\r\n                dw   4, -13,  21, -29,  35, -40,  43, -45\r\n\r\ntab_idct32_4:   dw  32,  45,  44,  43,  42,  40,  38,  35,  32,  29,  25,  21,  17,  13,   9,   4\r\n                dw  32,  43,  38,  29,  17,   4,  -9, -21, -32, -40, -44, -45, -42, -35, -25, -13\r\n                dw  32,  40,  25,   4, -17, -35, -44, -43, -32, -13,   9,  29,  42,  45,  38,  21\r\n                dw  32,  35,   9, -21, -42, -43, -25,   4,  32,  45,  38,  13, -17, -40, -44, -29\r\n                dw  32,  29,  -9, -40, -42, -13,  25,  45,  32,  -4, -38, -43, -17,  21,  44,  35\r\n                dw  32,  21, -25, -45, -17,  29,  44,  13, -32, -43,  -9,  35,  42,   4, -38, -40\r\n                dw  32,  13, -38, -35,  17,  45,   9, -40, -32,  21,  44,   4, -42, -29,  25,  43\r\n                dw  32,   4, -44, -13,  42,  21, -38, -29,  32,  35, -25, -40,  17,  43,  -9, -45\r\n                dw  32,  -4, -44,  13,  42, -21, -38,  29,  32, -35, -25,  40,  17, -43,  -9,  45\r\n                dw  32, -13, -38,  35,  17, -45,   9,  40, -32, -21,  44,  -4, -42,  29,  25, -43\r\n                dw  32, -21, -25,  45, -17, -29,  44, -13, -32,  43,  -9, -35,  42,  -4, -38,  40\r\n                dw  32, -29,  -9,  40, -42,  13,  25, -45,  32,   4, -38,  43, -17, -21,  44, -35\r\n                dw  32, -35,   9,  21, -42,  43, -25,  -4,  32, -45,  38, -13, -17,  40, -44,  29\r\n                dw  32, -40,  25,  -4, -17,  35, -44,  43, -32,  13,   9, -29,  42, -45,  38, -21\r\n                dw  32, -43,  38, -29,  17,  -4,  -9,  21, -32,  40, -44,  45, -42,  35, -25,  13\r\n                dw  32, -45,  44, -43,  42, -40,  38, -35,  32, -29,  25, -21,  17, -13,   9,  -4\r\n\r\n\r\n; ----------------------------------------------------------------------------\r\nSECTION .text\r\n\r\ncextern pd_11\r\ncextern pd_12\r\ncextern pd_16\r\ncextern pd_512\r\ncextern pd_2048\r\n\r\n\r\n; ============================================================================\r\n; void idct_4x4(const coeff_t *src, coeff_t *dst, int i_dst)\r\n; ============================================================================\r\n\r\n; ------------------------------------------------------------------\r\n; idct_4x4_sse2\r\nINIT_XMM sse2\r\ncglobal idct_4x4, 3, 4, 7\r\n%define IDCT4_SHIFT1        5                   ; shift1 = 5\r\n%define IDCT4_OFFSET1       [pd_16]             ; add1   = 16\r\n%if BIT_DEPTH == 10                             ;\r\n    %define IDCT4_SHIFT2    10                  ;\r\n    %define IDCT4_OFFSET2   [pd_512]            ;\r\n%elif BIT_DEPTH == 8                            ; for BIT_DEPTH: 8\r\n    %define IDCT4_SHIFT2    12                  ; shift2 = 12\r\n    %define IDCT4_OFFSET2   [pd_2048]           ; add2   = 2048\r\n%else                                           ;\r\n    %error Unsupported BIT_DEPTH!               ;\r\n%endif                                          ;\r\n    add            r2d, r2d                     ; r2 <-- i_dst\r\n    lea             r3, [tab_dct4]              ;\r\n                                                ;\r\n    mova            m6, IDCT4_OFFSET1           ;\r\n                                                ;\r\n    movu            m0, [r0 + 0 * 16]           ; mova???\r\n    movu            m1, [r0 + 1 * 16]           ;\r\n                                                ;\r\n    punpcklwd       m2, m0, m1                  ;\r\n    pmaddwd         m3, m2, [r3 + 0 * 16]       ; m3 = E1\r\n    paddd           m3, m6                      ;\r\n                                                ;\r\n    pmaddwd         m2, [r3 + 2 * 16]           ; m2 = E2\r\n    paddd           m2, m6                      ;\r\n                                                ;\r\n    punpckhwd       m0, m1                      ;\r\n    pmaddwd         m1, m0, [r3 + 1 * 16]       ; m1 = O1\r\n    pmaddwd         m0, [r3 + 3 * 16]           ; m0 = O2\r\n                                                ;\r\n    paddd           m4, m3, m1                  ;\r\n    psrad           m4, IDCT4_SHIFT1            ; m4 = m128iA\r\n    paddd           m5, m2, m0                  ;\r\n    psrad           m5, IDCT4_SHIFT1            ;\r\n    packssdw        m4, m5                      ; m4 = m128iA\r\n                                                ;\r\n    psubd           m2, m0                      ;\r\n    psrad           m2, IDCT4_SHIFT1            ;\r\n    psubd           m3, m1                      ;\r\n    psrad           m3, IDCT4_SHIFT1            ;\r\n    packssdw        m2, m3                      ; m2 = m128iD\r\n                                                ;\r\n    punpcklwd       m1, m4, m2                  ; m1 = S0\r\n    punpckhwd       m4, m2                      ; m4 = S8\r\n                                                ;\r\n    punpcklwd       m0, m1, m4                  ; m0 = m128iA\r\n    punpckhwd       m1, m4                      ; m1 = m128iD\r\n                                                ;\r\n    mova            m6, IDCT4_OFFSET2           ;\r\n                                                ;\r\n    punpcklwd       m2, m0, m1                  ;\r\n    pmaddwd         m3, m2, [r3 + 0 * 16]       ;\r\n    paddd           m3, m6                      ; m3 = E1\r\n                                                ;\r\n    pmaddwd         m2, [r3 + 2 * 16]           ;\r\n    paddd           m2, m6                      ; m2 = E2\r\n                                                ;\r\n    punpckhwd       m0, m1                      ;\r\n    pmaddwd         m1, m0, [r3 + 1 * 16]       ; m1 = O1\r\n    pmaddwd         m0, [r3 + 3 * 16]           ; m0 = O2\r\n                                                ;\r\n    paddd           m4, m3, m1                  ;\r\n    psrad           m4, IDCT4_SHIFT2            ; m4 = m128iA\r\n    paddd           m5, m2, m0                  ;\r\n    psrad           m5, IDCT4_SHIFT2            ;\r\n    packssdw        m4, m5                      ; m4 = m128iA\r\n                                                ;\r\n    psubd           m2, m0                      ;\r\n    psrad           m2, IDCT4_SHIFT2            ;\r\n    psubd           m3, m1                      ;\r\n    psrad           m3, IDCT4_SHIFT2            ;\r\n    packssdw        m2, m3                      ; m2 = m128iD\r\n                                                ;\r\n    punpcklwd       m1, m4, m2                  ;\r\n    punpckhwd       m4, m2                      ;\r\n                                                ;\r\n    punpcklwd       m0, m1, m4                  ;\r\n    movlps         [r1 + 0 * r2], m0            ; store dst, line 0\r\n    movhps         [r1 + 1 * r2], m0            ;            line 1\r\n                                                ;\r\n    punpckhwd       m1, m4                      ;\r\n    movlps         [r1 + 2*r2], m1              ; store dst, line 2\r\n    lea             r1, [r1 + 2*r2]             ;\r\n    movhps         [r1 + r2], m1                ;            line 3\r\n                                                ;\r\n    RET                                         ;\r\n%undef IDCT4_SHIFT1\r\n%undef IDCT4_OFFSET1\r\n%undef IDCT4_SHIFT2\r\n%undef IDCT4_OFFSET2\r\n\r\n\r\n; ----------------------------------------------------------------------------\r\n; void idct_8x8(const coeff_t *src, coeff_t *dst, int i_dst)\r\n; ----------------------------------------------------------------------------\r\nINIT_XMM ssse3\r\n\r\ncglobal patial_butterfly_inverse_internal_pass1\r\n    %define IDCT8_SHIFT1    5                   ; shift1 = 5\r\n    %define IDCT8_ADD1      [pd_16]             ; add1   = 16\r\n                                                ;\r\n    movh            m0, [r0         ]           ;\r\n    movhps          m0, [r0 + 2 * 16]           ;\r\n    movh            m1, [r0 + 4 * 16]           ;\r\n    movhps          m1, [r0 + 6 * 16]           ;\r\n                                                ;\r\n    punpckhwd       m2, m0, m1                  ; [2 6]\r\n    punpcklwd       m0, m1                      ; [0 4]\r\n    pmaddwd         m1, m0, [r6     ]           ; EE[0]\r\n    pmaddwd         m0,     [r6 + 32]           ; EE[1]\r\n    pmaddwd         m3, m2, [r6 + 16]           ; EO[0]\r\n    pmaddwd         m2,     [r6 + 48]           ; EO[1]\r\n                                                ;\r\n    paddd           m4, m1, m3                  ; E[0]\r\n    psubd           m1, m3                      ; E[3]\r\n    paddd           m3, m0, m2                  ; E[1]\r\n    psubd           m0, m2                      ; E[2]\r\n                                                ;\r\n    ; E[K] = E[k] + add                         ;\r\n    mova            m5, IDCT8_ADD1              ; add1   = 16\r\n    paddd           m0, m5                      ;\r\n    paddd           m1, m5                      ;\r\n    paddd           m3, m5                      ;\r\n    paddd           m4, m5                      ;\r\n                                                ;\r\n    movh            m2, [r0 +     16]           ;\r\n    movhps          m2, [r0 + 5 * 16]           ;\r\n    movh            m5, [r0 + 3 * 16]           ;\r\n    movhps          m5, [r0 + 7 * 16]           ;\r\n    punpcklwd       m6, m2, m5                  ; [1 3]\r\n    punpckhwd       m2, m5                      ; [5 7]\r\n                                                ;\r\n    pmaddwd         m5, m6, [r4     ]           ;\r\n    pmaddwd         m7, m2, [r4 + 16]           ;\r\n    paddd           m5, m7                      ; O[0]\r\n                                                ;\r\n    paddd           m7, m4, m5                  ;\r\n    psrad           m7, IDCT8_SHIFT1            ; shift1 = 5\r\n                                                ;\r\n    psubd           m4, m5                      ;\r\n    psrad           m4, IDCT8_SHIFT1            ; shift1 = 5\r\n                                                ;\r\n    packssdw        m7, m4                      ;\r\n    movh           [r5 + 0 * 16], m7            ;\r\n    movhps         [r5 + 7 * 16], m7            ;\r\n                                                ;\r\n    pmaddwd         m5, m6, [r4 + 32]           ;\r\n    pmaddwd         m4, m2, [r4 + 48]           ;\r\n    paddd           m5, m4                      ; O[1]\r\n                                                ;\r\n    paddd           m4, m3, m5                  ;\r\n    psrad           m4, IDCT8_SHIFT1            ; shift1 = 5\r\n                                                ;\r\n    psubd           m3, m5                      ;\r\n    psrad           m3, IDCT8_SHIFT1            ; shift1 = 5\r\n                                                ;\r\n    packssdw        m4, m3                      ;\r\n    movh           [r5 + 1 * 16], m4            ;\r\n    movhps         [r5 + 6 * 16], m4            ;\r\n                                                ;\r\n    pmaddwd         m5, m6, [r4 + 64]           ;\r\n    pmaddwd         m4, m2, [r4 + 80]           ;\r\n    paddd           m5, m4                      ; O[2]\r\n                                                ;\r\n    paddd           m4, m0, m5                  ;\r\n    psrad           m4, IDCT8_SHIFT1            ; shift1 = 5\r\n                                                ;\r\n    psubd           m0, m5                      ;\r\n    psrad           m0, IDCT8_SHIFT1            ; shift1 = 5\r\n                                                ;\r\n    packssdw        m4, m0                      ;\r\n    movh           [r5 + 2 * 16], m4            ;\r\n    movhps         [r5 + 5 * 16], m4            ;\r\n                                                ;\r\n    pmaddwd         m5, m6, [r4 +  96]          ;\r\n    pmaddwd         m4, m2, [r4 + 112]          ;\r\n    paddd           m5, m4                      ; O[3]\r\n                                                ;\r\n    paddd           m4, m1, m5                  ;\r\n    psrad           m4, IDCT8_SHIFT1            ; shift1 = 5\r\n                                                ;\r\n    psubd           m1, m5                      ;\r\n    psrad           m1, IDCT8_SHIFT1            ; shift1 = 5\r\n                                                ;\r\n    packssdw        m4, m1                      ;\r\n    movh           [r5 + 3 * 16], m4            ;\r\n    movhps         [r5 + 4 * 16], m4            ;\r\n                                                ;\r\n    %undef IDCT8_SHIFT1                         ;\r\n    %undef IDCT8_ADD1                           ;\r\n    ret                                         ;\r\n\r\n%macro PARTIAL_BUTTERFLY_PROCESS_ROW 1\r\n%if BIT_DEPTH == 10                             ;\r\n    %define IDCT8_SHIFT2  10                    ;\r\n%elif BIT_DEPTH == 8                            ; for BIT_DEPTH: 8\r\n    %define IDCT8_SHIFT2  12                    ; shift2 = 12\r\n%else                                           ;\r\n    %error Unsupported BIT_DEPTH!               ;\r\n%endif                                          ;\r\n    pshufb          m4, %1, [pb_idct8even]      ;\r\n    pmaddwd         m4, [tab_idct8_1]           ;\r\n    phsubd          m5, m4                      ;\r\n    pshufd          m4, m4, 0x4E                ;\r\n    phaddd          m4, m4                      ;\r\n    punpckhqdq      m4, m5                      ; m4 = dd e[ 0 1 2 3]\r\n    paddd           m4, m6                      ;\r\n                                                ;\r\n    pshufb          %1, %1, [r6]                ;\r\n    pmaddwd         m5, %1, [r4]                ;\r\n    pmaddwd         %1, [r4 + 16]               ;\r\n    phaddd          m5, %1                      ; m5 = dd O[0, 1, 2, 3]\r\n                                                ;\r\n    paddd           %1, m4, m5                  ;\r\n    psrad           %1, IDCT8_SHIFT2            ;\r\n                                                ;\r\n    psubd           m4, m5                      ;\r\n    psrad           m4, IDCT8_SHIFT2            ;\r\n    pshufd          m4, m4, 0x1B                ;\r\n                                                ;\r\n    packssdw        %1, m4                      ;\r\n%undef IDCT8_SHIFT2                             ;\r\n%endmacro\r\n\r\ncglobal patial_butterfly_inverse_internal_pass2\r\n    mova            m0, [r5     ]               ;\r\n    PARTIAL_BUTTERFLY_PROCESS_ROW m0            ;\r\n    movu   [r1       ], m0                      ;\r\n                                                ;\r\n    mova            m2, [r5 + 16]               ;\r\n    PARTIAL_BUTTERFLY_PROCESS_ROW m2            ;\r\n    movu   [r1 +   r2], m2                      ;\r\n                                                ;\r\n    mova            m1, [r5 + 32]               ;\r\n    PARTIAL_BUTTERFLY_PROCESS_ROW m1            ;\r\n    movu   [r1 + 2*r2], m1                      ;\r\n                                                ;\r\n    mova            m3, [r5 + 48]               ;\r\n    PARTIAL_BUTTERFLY_PROCESS_ROW m3            ;\r\n    movu   [r1 +   r3], m3                      ;\r\n                                                ;\r\n    ret                                         ;\r\n\r\n; ------------------------------------------------------------------\r\n; idct_8x8_ssse3\r\ncglobal idct_8x8, 3,7,8 ;,0-16*mmsize\r\n    ; alignment stack to 64-bytes               ;\r\n    mov             r5, rsp                     ;\r\n    sub            rsp, 16*mmsize + gprsize     ;\r\n    and            rsp, ~(64-1)                 ;\r\n    mov           [rsp + 16*mmsize], r5         ;\r\n    mov             r5, rsp                     ;\r\n                                                ;\r\n    lea             r4, [tab_idct8_3]           ;\r\n    lea             r6, [tab_dct4]              ;\r\n                                                ;\r\n    call    patial_butterfly_inverse_internal_pass1\r\n                                                ;\r\n    add             r0, 8                       ;\r\n    add             r5, 8                       ;\r\n                                                ;\r\n    call    patial_butterfly_inverse_internal_pass1\r\n                                                ;\r\n%if BIT_DEPTH == 10                             ;\r\n    mova            m6, [pd_512]                ;\r\n%elif BIT_DEPTH == 8                            ;\r\n    mova            m6, [pd_2048]               ;\r\n%else                                           ;\r\n  %error Unsupported BIT_DEPTH!                 ;\r\n%endif                                          ;\r\n    add             r2, r2                      ;\r\n    lea             r3, [r2 * 3]                ;\r\n    lea             r4, [tab_idct8_2]           ;\r\n    lea             r6, [pb_idct8odd]           ;\r\n    sub             r5, 8                       ;\r\n                                                ;\r\n    call    patial_butterfly_inverse_internal_pass2\r\n                                                ;\r\n    lea             r1, [r1 + 4 * r2]           ;\r\n    add             r5, 64                      ;\r\n                                                ;\r\n    call    patial_butterfly_inverse_internal_pass2\r\n                                                ;\r\n    ; restore origin stack pointer              ;\r\n    mov            rsp, [rsp + 16*mmsize]       ;\r\n    RET                                         ;\r\n\r\n\r\n; ============================================================================\r\n; ARCH_X86_64 ONLY\r\n; ============================================================================\r\n\r\n%if ARCH_X86_64 == 1\r\n\r\n; ----------------------------------------------------------------------------\r\n; void idct_4x4(const coeff_t *src, coeff_t *dst, int i_dst)\r\n; ----------------------------------------------------------------------------\r\nINIT_YMM avx2\r\ncglobal idct_4x4, 3, 4, 6\r\n%define IDCT4_SHIFT1    5                       ; shift1 = 5\r\n    vbroadcasti128  m4, [pd_16]                 ; add1   = 16\r\n%if BIT_DEPTH == 10                             ;\r\n    %define IDCT4_SHIFT2 10                     ;\r\n    vpbroadcastd    m5, [pd_512]                ;\r\n%elif BIT_DEPTH == 8                            ; for BIT_DEPTH: 8\r\n    %define IDCT4_SHIFT2 12                     ; shift2 = 12\r\n    vpbroadcastd    m5, [pd_2048]               ; add2   = 2048\r\n%else                                           ;\r\n    %error Unsupported BIT_DEPTH!               ;\r\n%endif                                          ;\r\n                                                ;\r\n    add             r2, r2                      ; r2 <-- i_src (src is 16bit data)\r\n    lea             r3, [r2 * 3]                ; r3 <-- 3 * i_src\r\n                                                ;\r\n    movu            m0, [r0]                    ; [00 01 02 03 10 11 12 13 20 21 22 23 30 31 32 33]\r\n                                                ;\r\n    pshufb          m0, [idct4_shuf1]           ; [00 02 01 03 10 12 11 13 20 22 21 23 30 32 31 33]\r\n    vextracti128   xm1, m0, 1                   ; [20 22 21 23 30 32 31 33]\r\n    punpcklwd      xm2, xm0, xm1                ; [00 20 02 22 01 21 03 23]\r\n    punpckhwd      xm0, xm1                     ; [10 30 12 32 11 31 13 33]\r\n    vinserti128     m2, m2, xm2, 1              ; [00 20 02 22 01 21 03 23 00 20 02 22 01 21 03 23]\r\n    vinserti128     m0, m0, xm0, 1              ; [10 30 12 32 11 31 13 33 10 30 12 32 11 31 13 33]\r\n                                                ;\r\n    mova            m1, [avx2_idct4_1     ]     ;\r\n    mova            m3, [avx2_idct4_1 + 32]     ;\r\n    pmaddwd         m1, m2                      ;\r\n    pmaddwd         m3, m0                      ;\r\n                                                ;\r\n    paddd           m0, m1, m3                  ;\r\n    paddd           m0, m4                      ;\r\n    psrad           m0, IDCT4_SHIFT1            ; [00 20 10 30 01 21 11 31]\r\n                                                ;\r\n    psubd           m1, m3                      ;\r\n    paddd           m1, m4                      ;\r\n    psrad           m1, IDCT4_SHIFT1            ; [03 23 13 33 02 22 12 32]\r\n                                                ;\r\n    packssdw        m0, m1                      ; [00 20 10 30 03 23 13 33 01 21 11 31 02 22 12 32]\r\n    vmovshdup       m1, m0                      ; [10 30 10 30 13 33 13 33 11 31 11 31 12 32 12 32]\r\n    vmovsldup       m0, m0                      ; [00 20 00 20 03 23 03 23 01 21 01 21 02 22 02 22]\r\n                                                ;\r\n    vpbroadcastq    m2, [avx2_idct4_2    ]      ;\r\n    vpbroadcastq    m3, [avx2_idct4_2 + 8]      ;\r\n    pmaddwd         m0, m2                      ;\r\n    pmaddwd         m1, m3                      ;\r\n                                                ;\r\n    paddd           m2, m0, m1                  ;\r\n    paddd           m2, m5                      ;\r\n    psrad           m2, IDCT4_SHIFT2            ; [00 01 10 11 30 31 20 21]\r\n                                                ;\r\n    psubd           m0, m1                      ;\r\n    paddd           m0, m5                      ;\r\n    psrad           m0, IDCT4_SHIFT2            ; [03 02 13 12 33 32 23 22]\r\n                                                ;\r\n    pshufb          m0, [idct4_shuf2]           ; [02 03 12 13 32 33 22 23]\r\n    punpcklqdq      m1, m2, m0                  ; [00 01 02 03 10 11 12 13]\r\n    punpckhqdq      m2, m0                      ; [30 31 32 33 20 21 22 23]\r\n    packssdw        m1, m2                      ; [00 01 02 03 30 31 32 33 10 11 12 13 20 21 22 23]\r\n    vextracti128   xm0, m1, 1                   ;\r\n                                                ;\r\n    movq   [r1       ], xm1                     ; store result, line 0\r\n    movq   [r1 +   r2], xm0                     ; store result, line 1\r\n    movhps [r1 + 2*r2], xm0                     ; store result, line 2\r\n    movhps [r1 +   r3], xm1                     ; store result, line 3\r\n    RET                                         ;\r\n%undef IDCT4_SHIFT1\r\n%undef IDCT4_SHIFT2\r\n\r\n\r\n%macro IDCT8_PASS_1 1\r\n    vpbroadcastd    m7, [r5 + %1    ]           ;\r\n    vpbroadcastd   m10, [r5 + %1 + 4]           ;\r\n    pmaddwd         m5, m4, m7                  ;\r\n    pmaddwd         m6, m0, m10                 ;\r\n    paddd           m5, m6                      ;\r\n                                                ;\r\n    vpbroadcastd    m7, [r6 + %1    ]           ;\r\n    vpbroadcastd   m10, [r6 + %1 + 4]           ;\r\n    pmaddwd         m6, m1, m7                  ;\r\n    pmaddwd         m3, m2, m10                 ;\r\n    paddd           m6, m3                      ;\r\n                                                ;\r\n    paddd           m3, m5, m6                  ;\r\n    paddd           m3, m11                     ;\r\n    psrad           m3, IDCT8_SHIFT1            ;\r\n                                                ;\r\n    psubd           m5, m6                      ;\r\n    paddd           m5, m11                     ;\r\n    psrad           m5, IDCT8_SHIFT1            ;\r\n                                                ;\r\n    vpbroadcastd    m7, [r5 + %1 + 32]          ;\r\n    vpbroadcastd   m10, [r5 + %1 + 36]          ;\r\n    pmaddwd         m6, m4, m7                  ;\r\n    pmaddwd         m8, m0, m10                 ;\r\n    paddd           m6, m8                      ;\r\n                                                ;\r\n    vpbroadcastd    m7, [r6 + %1 + 32]          ;\r\n    vpbroadcastd   m10, [r6 + %1 + 36]          ;\r\n    pmaddwd         m8, m1, m7                  ;\r\n    pmaddwd         m9, m2, m10                 ;\r\n    paddd           m8, m9                      ;\r\n                                                ;\r\n    paddd           m9, m6, m8                  ;\r\n    paddd           m9, m11                     ;\r\n    psrad           m9, IDCT8_SHIFT1            ;\r\n                                                ;\r\n    psubd           m6, m8                      ;\r\n    paddd           m6, m11                     ;\r\n    psrad           m6, IDCT8_SHIFT1            ;\r\n                                                ;\r\n    packssdw        m3, m9                      ;\r\n    vpermq          m3, m3, 0xD8                ;\r\n                                                ;\r\n    packssdw        m6, m5                      ;\r\n    vpermq          m6, m6, 0xD8                ;\r\n%endmacro\r\n\r\n%macro IDCT8_PASS_2 0\r\n    punpcklqdq      m2, m0, m1                  ;\r\n    punpckhqdq      m0, m1                      ;\r\n                                                ;\r\n    pmaddwd         m3, m2, [r5     ]           ;\r\n    pmaddwd         m5, m2, [r5 + 32]           ;\r\n    pmaddwd         m6, m2, [r5 + 64]           ;\r\n    pmaddwd         m7, m2, [r5 + 96]           ;\r\n    phaddd          m3, m5                      ;\r\n    phaddd          m6, m7                      ;\r\n    pshufb          m3, [idct8_shuf2]           ;\r\n    pshufb          m6, [idct8_shuf2]           ;\r\n    punpcklqdq      m7, m3, m6                  ;\r\n    punpckhqdq      m3, m6                      ;\r\n                                                ;\r\n    pmaddwd         m5, m0, [r6     ]           ;\r\n    pmaddwd         m6, m0, [r6 + 32]           ;\r\n    pmaddwd         m8, m0, [r6 + 64]           ;\r\n    pmaddwd         m9, m0, [r6 + 96]           ;\r\n    phaddd          m5, m6                      ;\r\n    phaddd          m8, m9                      ;\r\n    pshufb          m5, [idct8_shuf2]           ;\r\n    pshufb          m8, [idct8_shuf2]           ;\r\n    punpcklqdq      m6, m5, m8                  ;\r\n    punpckhqdq      m5, m8                      ;\r\n                                                ;\r\n    paddd           m8, m7, m6                  ;\r\n    paddd           m8, m12                     ;\r\n    psrad           m8, IDCT8_SHIFT2            ;\r\n                                                ;\r\n    psubd           m7, m6                      ;\r\n    paddd           m7, m12                     ;\r\n    psrad           m7, IDCT8_SHIFT2            ;\r\n                                                ;\r\n    pshufb          m7, [idct8_shuf3]           ;\r\n    packssdw        m8, m7                      ;\r\n                                                ;\r\n    paddd           m9, m3, m5                  ;\r\n    paddd           m9, m12                     ;\r\n    psrad           m9, IDCT8_SHIFT2            ;\r\n                                                ;\r\n    psubd           m3, m5                      ;\r\n    paddd           m3, m12                     ;\r\n    psrad           m3, IDCT8_SHIFT2            ;\r\n                                                ;\r\n    pshufb          m3, [idct8_shuf3]           ;\r\n    packssdw        m9, m3                      ;\r\n%endmacro\r\n\r\n\r\n; ----------------------------------------------------------------------------\r\n; void idct_8x8(const coeff_t *src, coeff_t *dst, int i_dst)\r\n; ----------------------------------------------------------------------------\r\n\r\n; ------------------------------------------------------------------\r\n; idct_8x8_sse2\r\nINIT_XMM sse2\r\n\r\n    %define IDCT8_SHIFT1    5                   ; shift1 = 5\r\n    %define IDCT8_ADD1      [pd_16]             ; add1   = 16\r\n%if BIT_DEPTH == 10                             ;\r\n    %define IDCT8_SHIFT2    10                  ;\r\n    %define IDCT8_ADD2      [pd_512]            ;\r\n%elif BIT_DEPTH == 8                            ; for BIT_DEPTH: 8\r\n    %define IDCT8_SHIFT2    12                  ; shift2 = 12\r\n    %define IDCT8_ADD2      [pd_2048]           ; add2   = 2048\r\n%else                                           ;\r\n    %error Unsupported BIT_DEPTH!               ;\r\n%endif                                          ;\r\n\r\ncglobal idct_8x8, 3, 6, 16, 0-5*mmsize\r\n    mova            m9, [r0 + 1*mmsize]         ;\r\n    mova            m1, [r0 + 3*mmsize]         ;\r\n    mova            m7, m9                      ;\r\n    punpcklwd       m7, m1                      ;\r\n    punpckhwd       m9, m1                      ;\r\n    mova           m14, [tab_idct8_3]           ;\r\n    mova            m3, m14                     ;\r\n    pmaddwd        m14, m7                      ;\r\n    pmaddwd         m3, m9                      ;\r\n    mova            m0, [r0 + 5*mmsize]         ;\r\n    mova           m10, [r0 + 7*mmsize]         ;\r\n    mova            m2, m0                      ;\r\n    punpcklwd       m2, m10                     ;\r\n    punpckhwd       m0, m10                     ;\r\n    mova           m15, [tab_idct8_3+1*mmsize]  ;\r\n    mova           m11, [tab_idct8_3+1*mmsize]  ;\r\n    pmaddwd        m15, m2                      ;\r\n    mova            m4, [tab_idct8_3+2*mmsize]  ;\r\n    pmaddwd        m11, m0                      ;\r\n    mova            m1, [tab_idct8_3+2*mmsize]  ;\r\n    paddd          m15, m14                     ;\r\n    mova            m5, [tab_idct8_3+4*mmsize]  ;\r\n    mova           m12, [tab_idct8_3+4*mmsize]  ;\r\n    paddd          m11, m3                      ;\r\n    mova          [rsp + 0*mmsize], m11         ;\r\n    mova          [rsp + 1*mmsize], m15         ;\r\n    pmaddwd         m4, m7                      ;\r\n    pmaddwd         m1, m9                      ;\r\n    mova           m14, [tab_idct8_3+3*mmsize]  ;\r\n    mova            m3, [tab_idct8_3+3*mmsize]  ;\r\n    pmaddwd        m14, m2                      ;\r\n    pmaddwd         m3, m0                      ;\r\n    paddd          m14, m4                      ;\r\n    paddd           m3, m1                      ;\r\n    mova          [rsp + 2*mmsize], m3          ;\r\n    pmaddwd         m5, m9                      ;\r\n    pmaddwd         m9, [tab_idct8_3+6*mmsize]  ;\r\n    mova            m6, [tab_idct8_3+5*mmsize]  ;\r\n    pmaddwd        m12, m7                      ;\r\n    pmaddwd         m7, [tab_idct8_3+6*mmsize]  ;\r\n    mova            m4, [tab_idct8_3+5*mmsize]  ;\r\n    pmaddwd         m6, m2                      ;\r\n    paddd           m6, m12                     ;\r\n    pmaddwd         m2, [tab_idct8_3+7*mmsize]  ;\r\n    paddd           m7, m2                      ;\r\n    mova          [rsp + 3*mmsize], m6          ;\r\n    pmaddwd         m4, m0                      ;\r\n    pmaddwd         m0, [tab_idct8_3+7*mmsize]  ;\r\n    paddd           m9, m0                      ;\r\n    paddd           m5, m4                      ;\r\n    mova            m6, [r0 + 0*mmsize]         ;\r\n    mova            m0, [r0 + 4*mmsize]         ;\r\n    mova            m4, m6                      ;\r\n    punpcklwd       m4, m0                      ;\r\n    punpckhwd       m6, m0                      ;\r\n    mova           m12, [r0 + 2*mmsize]         ;\r\n    mova            m0, [r0 + 6*mmsize]         ;\r\n    mova           m13, m12                     ;\r\n    mova            m8, [tab_dct4]              ;\r\n    punpcklwd      m13, m0                      ;\r\n    mova           m10, [tab_dct4]              ;\r\n    punpckhwd      m12, m0                      ;\r\n    pmaddwd         m8, m4                      ;\r\n    mova            m3, m8                      ;\r\n    pmaddwd         m4, [tab_dct4 + 2*mmsize]   ;\r\n    pmaddwd        m10, m6                      ;\r\n    mova            m2, [tab_dct4 + 1*mmsize]   ;\r\n    mova            m1, m10                     ;\r\n    pmaddwd         m6, [tab_dct4 + 2*mmsize]   ;\r\n    mova            m0, [tab_dct4 + 1*mmsize]   ;\r\n    pmaddwd         m2, m13                     ;\r\n    paddd           m3, m2                      ;\r\n    psubd           m8, m2                      ;\r\n    mova            m2, m6                      ;\r\n    pmaddwd        m13, [tab_dct4 + 3*mmsize]   ;\r\n    pmaddwd         m0, m12                     ;\r\n    paddd           m1, m0                      ;\r\n    psubd          m10, m0                      ;\r\n    mova            m0, m4                      ;\r\n    pmaddwd        m12, [tab_dct4 + 3*mmsize]   ;\r\n    paddd           m3, IDCT8_ADD1              ; add1   = 16\r\n    paddd           m1, IDCT8_ADD1              ; add1   = 16\r\n    paddd           m8, IDCT8_ADD1              ; add1   = 16\r\n    paddd          m10, IDCT8_ADD1              ; add1   = 16\r\n    paddd           m0, m13                     ;\r\n    paddd           m2, m12                     ;\r\n    paddd           m0, IDCT8_ADD1              ; add1   = 16\r\n    paddd           m2, IDCT8_ADD1              ; add1   = 16\r\n    psubd           m4, m13                     ;\r\n    psubd           m6, m12                     ;\r\n    paddd           m4, IDCT8_ADD1              ; add1   = 16\r\n    paddd           m6, IDCT8_ADD1              ; add1   = 16\r\n    mova           m12, m8                      ;\r\n    psubd           m8, m7                      ;\r\n    psrad           m8, IDCT8_SHIFT1            ; shift1 = 5\r\n    paddd          m15, m3                      ;\r\n    psubd           m3, [rsp + 1*mmsize]        ;\r\n    psrad          m15, IDCT8_SHIFT1            ; shift1 = 5\r\n    paddd          m12, m7                      ;\r\n    psrad          m12, IDCT8_SHIFT1            ; shift1 = 5\r\n    paddd          m11, m1                      ;\r\n    mova           m13, m14                     ;\r\n    psrad          m11, IDCT8_SHIFT1            ; shift1 = 5\r\n    packssdw       m15, m11                     ;\r\n    psubd           m1, [rsp + 0*mmsize]        ;\r\n    psrad           m1, IDCT8_SHIFT1            ; shift1 = 5\r\n    mova           m11, [rsp + 2*mmsize]        ;\r\n    paddd          m14, m0                      ;\r\n    psrad          m14, IDCT8_SHIFT1            ; shift1 = 5\r\n    psubd           m0, m13                     ;\r\n    psrad           m0, IDCT8_SHIFT1            ; shift1 = 5\r\n    paddd          m11, m2                      ;\r\n    mova           m13, [rsp + 3*mmsize]        ;\r\n    psrad          m11, IDCT8_SHIFT1            ; shift1 = 5\r\n    packssdw       m14, m11                     ;\r\n    mova           m11, m6                      ;\r\n    psubd           m6, m5                      ;\r\n    paddd          m13, m4                      ;\r\n    psrad          m13, IDCT8_SHIFT1            ; shift1 = 5\r\n    psrad           m6, IDCT8_SHIFT1            ; shift1 = 5\r\n    paddd          m11, m5                      ;\r\n    psrad          m11, IDCT8_SHIFT1            ; shift1 = 5\r\n    packssdw       m13, m11                     ;\r\n    mova           m11, m10                     ;\r\n    psubd           m4, [rsp + 3*mmsize]        ;\r\n    psubd          m10, m9                      ;\r\n    psrad           m4, IDCT8_SHIFT1            ; shift1 = 5\r\n    psrad          m10, IDCT8_SHIFT1            ; shift1 = 5\r\n    packssdw        m4, m6                      ;\r\n    packssdw        m8, m10                     ;\r\n    paddd          m11, m9                      ;\r\n    psrad          m11, IDCT8_SHIFT1            ; shift1 = 5\r\n    packssdw       m12, m11                     ;\r\n    psubd           m2, [rsp + 2*mmsize]        ;\r\n    mova            m5, m15                     ;\r\n    psrad           m2, IDCT8_SHIFT1            ; shift1 = 5\r\n    packssdw        m0, m2                      ;\r\n    mova            m2, m14                     ;\r\n    psrad           m3, IDCT8_SHIFT1            ; shift1 = 5\r\n    packssdw        m3, m1                      ;\r\n    mova            m6, m13                     ;\r\n    punpcklwd       m5, m8                      ;\r\n    punpcklwd       m2, m4                      ;\r\n    mova            m1, m12                     ;\r\n    punpcklwd       m6, m0                      ;\r\n    punpcklwd       m1, m3                      ;\r\n    mova            m9, m5                      ;\r\n    punpckhwd      m13, m0                      ;\r\n    mova            m0, m2                      ;\r\n    punpcklwd       m9, m6                      ;\r\n    punpckhwd       m5, m6                      ;\r\n    punpcklwd       m0, m1                      ;\r\n    punpckhwd       m2, m1                      ;\r\n    punpckhwd      m15, m8                      ;\r\n    mova            m1, m5                      ;\r\n    punpckhwd      m14, m4                      ;\r\n    punpckhwd      m12, m3                      ;\r\n    mova            m6, m9                      ;\r\n    punpckhwd       m9, m0                      ;\r\n    punpcklwd       m1, m2                      ;\r\n    mova            m4, [tab_idct8_3+0*mmsize]  ;\r\n    punpckhwd       m5, m2                      ;\r\n    punpcklwd       m6, m0                      ;\r\n    mova            m2, m15                     ;\r\n    mova            m0, m14                     ;\r\n    mova            m7, m9                      ;\r\n    punpcklwd       m2, m13                     ;\r\n    punpcklwd       m0, m12                     ;\r\n    punpcklwd       m7, m5                      ;\r\n    punpckhwd      m14, m12                     ;\r\n    mova           m10, m2                      ;\r\n    punpckhwd      m15, m13                     ;\r\n    punpckhwd       m9, m5                      ;\r\n    pmaddwd         m4, m7                      ;\r\n    mova           m13, m1                      ;\r\n    punpckhwd       m2, m0                      ;\r\n    punpcklwd      m10, m0                      ;\r\n    mova            m0, m15                     ;\r\n    punpckhwd      m15, m14                     ;\r\n    mova           m12, m1                      ;\r\n    mova            m3, [tab_idct8_3+0*mmsize]  ;\r\n    punpcklwd       m0, m14                     ;\r\n    pmaddwd         m3, m9                      ;\r\n    mova           m11, m2                      ;\r\n    punpckhwd       m2, m15                     ;\r\n    punpcklwd      m11, m15                     ;\r\n    mova            m8, [tab_idct8_3+1*mmsize]  ;\r\n    punpcklwd      m13, m0                      ;\r\n    punpckhwd      m12, m0                      ;\r\n    pmaddwd         m8, m11                     ;\r\n    paddd           m8, m4                      ;\r\n    mova          [rsp + 4*mmsize], m8          ;\r\n    mova            m4, [tab_idct8_3+2*mmsize]  ;\r\n    pmaddwd         m4, m7                      ;\r\n    mova           m15, [tab_idct8_3+2*mmsize]  ;\r\n    mova            m5, [tab_idct8_3+1*mmsize]  ;\r\n    pmaddwd        m15, m9                      ;\r\n    pmaddwd         m5, m2                      ;\r\n    paddd           m5, m3                      ;\r\n    mova           [rsp + 3*mmsize], m5         ;\r\n    mova           m14, [tab_idct8_3+3*mmsize]  ;\r\n    mova            m5, [tab_idct8_3+3*mmsize]  ;\r\n    pmaddwd        m14, m11                     ;\r\n    paddd          m14, m4                      ;\r\n    mova          [rsp + 2*mmsize], m14         ;\r\n    pmaddwd         m5, m2                      ;\r\n    paddd           m5, m15                     ;\r\n    mova          [rsp + 1*mmsize], m5          ;\r\n    mova           m15, [tab_idct8_3+4*mmsize]  ;\r\n    mova            m5, [tab_idct8_3+4*mmsize]  ;\r\n    pmaddwd        m15, m7                      ;\r\n    pmaddwd         m7, [tab_idct8_3+6*mmsize]  ;\r\n    pmaddwd         m5, m9                      ;\r\n    pmaddwd         m9, [tab_idct8_3+6*mmsize]  ;\r\n    mova            m4, [tab_idct8_3+5*mmsize]  ;\r\n    pmaddwd         m4, m2                      ;\r\n    paddd           m5, m4                      ;\r\n    mova            m4, m6                      ;\r\n    mova            m8, [tab_idct8_3+5*mmsize]  ;\r\n    punpckhwd       m6, m10                     ;\r\n    pmaddwd         m2, [tab_idct8_3+7*mmsize]  ;\r\n    punpcklwd       m4, m10                     ;\r\n    paddd           m9, m2                      ;\r\n    pmaddwd         m8, m11                     ;\r\n    mova           m10, [tab_dct4]              ;\r\n    paddd           m8, m15                     ;\r\n    pmaddwd        m11, [tab_idct8_3+7*mmsize]  ;\r\n    paddd           m7, m11                     ;\r\n    mova          [rsp + 0*mmsize], m8          ;\r\n    pmaddwd        m10, m6                      ;\r\n    pmaddwd         m6, [tab_dct4 + 2*mmsize]   ;\r\n    mova            m1, m10                     ;\r\n    mova            m8, [tab_dct4]              ;\r\n    mova            m3, [tab_dct4 + 1*mmsize]   ;\r\n    pmaddwd         m8, m4                      ;\r\n    pmaddwd         m4, [tab_dct4 + 2*mmsize]   ;\r\n    mova            m0, m8                      ;\r\n    mova            m2, [tab_dct4 + 1*mmsize]   ;\r\n    pmaddwd         m3, m13                     ;\r\n    psubd           m8, m3                      ;\r\n    paddd           m0, m3                      ;\r\n    mova            m3, m6                      ;\r\n    pmaddwd        m13, [tab_dct4 + 3*mmsize]   ;\r\n    pmaddwd         m2, m12                     ;\r\n    paddd           m1, m2                      ;\r\n    psubd          m10, m2                      ;\r\n    mova            m2, m4                      ;\r\n    pmaddwd        m12, [tab_dct4 + 3*mmsize]   ;\r\n    paddd           m0, IDCT8_ADD2              ; add2   = 2048\r\n    paddd           m1, IDCT8_ADD2              ; add2   = 2048\r\n    paddd           m8, IDCT8_ADD2              ; add2   = 2048\r\n    paddd          m10, IDCT8_ADD2              ; add2   = 2048\r\n    paddd           m2, m13                     ;\r\n    paddd           m3, m12                     ;\r\n    paddd           m2, IDCT8_ADD2              ; add2   = 2048\r\n    paddd           m3, IDCT8_ADD2              ; add2   = 2048\r\n    psubd           m4, m13                     ;\r\n    psubd           m6, m12                     ;\r\n    paddd           m4, IDCT8_ADD2              ; add2   = 2048\r\n    paddd           m6, IDCT8_ADD2              ; add2   = 2048\r\n    mova           m15, [rsp + 4*mmsize]        ;\r\n    mova           m12, m8                      ;\r\n    psubd           m8, m7                      ;\r\n    psrad           m8, IDCT8_SHIFT2            ; shift2 = 12\r\n    mova           m11, [rsp + 3*mmsize]        ;\r\n    paddd          m15, m0                      ;\r\n    psrad          m15, IDCT8_SHIFT2            ; shift2 = 12\r\n    psubd           m0, [rsp + 4*mmsize]        ;\r\n    psrad           m0, IDCT8_SHIFT2            ; shift2 = 12\r\n    paddd          m12, m7                      ;\r\n    paddd          m11, m1                      ;\r\n    mova           m14, [rsp + 2*mmsize]        ;\r\n    psrad          m11, IDCT8_SHIFT2            ; shift2 = 12\r\n    packssdw       m15, m11                     ;\r\n    psubd           m1, [rsp + 3*mmsize]        ;\r\n    psrad           m1, IDCT8_SHIFT2            ; shift2 = 12\r\n    mova           m11, [rsp + 1*mmsize]        ;\r\n    paddd          m14, m2                      ;\r\n    psrad          m14, IDCT8_SHIFT2            ; shift2 = 12\r\n    packssdw        m0, m1                      ;\r\n    psrad          m12, IDCT8_SHIFT2            ; shift2 = 12\r\n    psubd           m2, [rsp + 2*mmsize]        ;\r\n    paddd          m11, m3                      ;\r\n    mova           m13, [rsp + 0*mmsize]        ;\r\n    psrad          m11, IDCT8_SHIFT2            ; shift2 = 12\r\n    packssdw       m14, m11                     ;\r\n    mova           m11, m6                      ;\r\n    psubd           m6, m5                      ;\r\n    paddd          m13, m4                      ;\r\n    psrad          m13, IDCT8_SHIFT2            ; shift2 = 12\r\n    mova            m1, m15                     ;\r\n    paddd          m11, m5                      ;\r\n    psrad          m11, IDCT8_SHIFT2            ; shift2 = 12\r\n    packssdw       m13, m11                     ;\r\n    mova           m11, m10                     ;\r\n    psubd          m10, m9                      ;\r\n    psrad          m10, IDCT8_SHIFT2            ; shift2 = 12\r\n    packssdw        m8, m10                     ;\r\n    psrad           m6, IDCT8_SHIFT2            ; shift2 = 12\r\n    psubd           m4, [rsp + 0*mmsize]        ;\r\n    paddd          m11, m9                      ;\r\n    psrad          m11, IDCT8_SHIFT2            ; shift2 = 12\r\n    packssdw       m12, m11                     ;\r\n    punpcklwd       m1, m14                     ;\r\n    mova            m5, m13                     ;\r\n    psrad           m4, IDCT8_SHIFT2            ; shift2 = 12\r\n    packssdw        m4, m6                      ;\r\n    psubd           m3, [rsp + 1*mmsize]        ;\r\n    psrad           m2, IDCT8_SHIFT2            ; shift2 = 12\r\n    mova            m6, m8                      ;\r\n    psrad           m3, IDCT8_SHIFT2            ; shift2 = 12\r\n    punpcklwd       m5, m12                     ;\r\n    packssdw        m2, m3                      ;\r\n    punpcklwd       m6, m4                      ;\r\n    punpckhwd       m8, m4                      ;\r\n    mova            m4, m1                      ;\r\n    mova            m3, m2                      ;\r\n    punpckhdq       m1, m5                      ;\r\n    punpckldq       m4, m5                      ;\r\n    punpcklwd       m3, m0                      ;\r\n    punpckhwd       m2, m0                      ;\r\n    mova            m0, m6                      ;\r\n    lea             r2, [r2 +   r2]             ;\r\n    lea             r4, [r2 +   r2]             ;\r\n    lea             r3, [r4 +   r2]             ;\r\n    lea             r4, [r4 +   r3]             ;\r\n    lea             r0, [r4 + 2*r2]             ;\r\n    movq          [r1], m4                      ;\r\n    punpckhwd      m15, m14                     ;\r\n    movhps [r1 +   r2], m4                      ;\r\n    punpckhdq       m0, m3                      ;\r\n    movq   [r1 + 2*r2], m1                      ;\r\n    punpckhwd      m13, m12                     ;\r\n    movhps [r1 +   r3], m1                      ;\r\n    mova            m1, m6                      ;\r\n    punpckldq       m1, m3                      ;\r\n    movq           [r1        + 8], m1          ;\r\n    movhps         [r1 +   r2 + 8], m1          ;\r\n    movq           [r1 + 2*r2 + 8], m0          ;\r\n    movhps         [r1 +   r3 + 8], m0          ;\r\n    mova            m0, m15                     ;\r\n    punpckhdq      m15, m13                     ;\r\n    punpckldq       m0, m13                     ;\r\n    movq           [r1 + 4*r2], m0              ;\r\n    movhps         [r1 +   r4], m0              ;\r\n    mova            m0, m8                      ;\r\n    punpckhdq       m8, m2                      ;\r\n    movq           [r1 + 2*r3], m15             ;\r\n    punpckldq       m0, m2                      ;\r\n    movhps         [r1 +   r0    ], m15         ;\r\n    movq           [r1 + 4*r2 + 8], m0          ;\r\n    movhps         [r1 +   r4 + 8], m0          ;\r\n    movq           [r1 + 2*r3 + 8], m8          ;\r\n    movhps         [r1 +   r0 + 8], m8          ;\r\n    RET                                         ;\r\n%undef IDCT8_SHIFT1\r\n%undef IDCT8_SHIFT2\r\n%undef IDCT8_ADD1\r\n%undef IDCT8_ADD2\r\n\r\n\r\n; ----------------------------------------------------------------------------\r\n; void idct_8x8(const coeff_t *src, coeff_t *dst, int i_dst)\r\n; ----------------------------------------------------------------------------\r\n\r\n; ------------------------------------------------------------------\r\n; idct_8x8_avx2\r\nINIT_YMM avx2\r\ncglobal idct_8x8, 3, 7, 13, 0-8*16\r\n    %define IDCT8_SHIFT1    5                   ; shift1 = 5\r\n    %define IDCT8_ADD1      [pd_16]             ; add1   = 16\r\n%if BIT_DEPTH == 10                             ;\r\n    %define IDCT8_SHIFT2    10                  ;\r\n    vpbroadcastd   m12,     [pd_512]            ;\r\n%elif BIT_DEPTH == 8                            ; for BIT_DEPTH: 8\r\n    %define IDCT8_SHIFT2    12                  ; shift2 = 12\r\n    vpbroadcastd   m12,     [pd_2048]           ; add1   = 2048\r\n%else                                           ;\r\n    %error Unsupported BIT_DEPTH!               ;\r\n%endif                                          ;\r\n                                                ;\r\n    vbroadcasti128 m11, IDCT8_ADD1              ; add1   = 16\r\n                                                ;\r\n    mov             r4, rsp                     ;\r\n    lea             r5, [avx2_idct8_1]          ;\r\n    lea             r6, [avx2_idct8_2]          ;\r\n                                                ;\r\n    ;pass1                                      ;\r\n    movu            m1, [r0 + 0 * 32]           ; [0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1]\r\n    movu            m0, [r0 + 1 * 32]           ; [2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3]\r\n    vpunpcklwd      m5, m1, m0                  ; [0 2 0 2 0 2 0 2 1 3 1 3 1 3 1 3]\r\n    vpunpckhwd      m1, m0                      ; [0 2 0 2 0 2 0 2 1 3 1 3 1 3 1 3]\r\n    vinserti128     m4, m5, xm1, 1              ; [0 2 0 2 0 2 0 2 0 2 0 2 0 2 0 2]\r\n    vextracti128   xm2, m5, 1                   ; [1 3 1 3 1 3 1 3]\r\n    vinserti128     m1, m1, xm2, 0              ; [1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3]\r\n                                                ;\r\n    movu            m2, [r0 + 2 * 32]           ; [4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5]\r\n    movu            m0, [r0 + 3 * 32]           ; [6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7]\r\n    vpunpcklwd      m5, m2, m0                  ; [4 6 4 6 4 6 4 6 5 7 5 7 5 7 5 7]\r\n    vpunpckhwd      m2, m0                      ; [4 6 4 6 4 6 4 6 5 7 5 7 5 7 5 7]\r\n    vinserti128     m0, m5, xm2, 1              ; [4 6 4 6 4 6 4 6 4 6 4 6 4 6 4 6]\r\n    vextracti128   xm5, m5, 1                   ; [5 7 5 7 5 7 5 7]\r\n    vinserti128     m2, m2, xm5, 0              ; [5 7 5 7 5 7 5 7 5 7 5 7 5 7 5 7]\r\n                                                ;\r\n    mova            m5, [idct8_shuf1]           ;\r\n    vpermd          m4, m5, m4                  ;\r\n    vpermd          m0, m5, m0                  ;\r\n    vpermd          m1, m5, m1                  ;\r\n    vpermd          m2, m5, m2                  ;\r\n                                                ;\r\n    IDCT8_PASS_1    0                           ;\r\n    mova            [r4     ], m3               ;\r\n    mova            [r4 + 96], m6               ;\r\n                                                ;\r\n    IDCT8_PASS_1    64                          ;\r\n    mova            [r4 + 32], m3               ;\r\n    mova            [r4 + 64], m6               ;\r\n                                                ;\r\n    ;pass2                                      ;\r\n    add            r2d, r2d                     ;\r\n    lea             r3, [r2 * 3]                ;\r\n                                                ;\r\n    mova            m0, [r4     ]               ;\r\n    mova            m1, [r4 + 32]               ;\r\n    IDCT8_PASS_2                                ;\r\n                                                ;\r\n    vextracti128   xm3, m8, 1                   ;\r\n    movu           [r1       ], xm8             ;\r\n    movu           [r1 +   r2], xm3             ;\r\n    vextracti128   xm3, m9, 1                   ;\r\n    movu           [r1 + 2*r2], xm9             ;\r\n    movu           [r1 +   r3], xm3             ;\r\n                                                ;\r\n    lea             r1, [r1 + r2 * 4]           ;\r\n    mova            m0, [r4 + 64]               ;\r\n    mova            m1, [r4 + 96]               ;\r\n    IDCT8_PASS_2                                ;\r\n                                                ;\r\n    vextracti128   xm3, m8, 1                   ;\r\n    movu           [r1       ], xm8             ;\r\n    movu           [r1 +   r2], xm3             ;\r\n    vextracti128   xm3, m9, 1                   ;\r\n    movu           [r1 + 2*r2], xm9             ;\r\n    movu           [r1 +   r3], xm3             ;\r\n    RET                                         ;\r\n%undef IDCT8_SHIFT1\r\n%undef IDCT8_SHIFT2\r\n%undef IDCT8_ADD1\r\n%undef IDCT8_ADD2\r\n\r\n\r\n%macro IDCT16_PASS1 2\r\n    vbroadcasti128  m5, [tab_idct16_2 + %1 * 16]\r\n\r\n    pmaddwd         m9, m0, m5                  ;\r\n    pmaddwd        m10, m7, m5                  ;\r\n    phaddd          m9, m10                     ;\r\n                                                ;\r\n    pmaddwd        m10, m6, m5                  ;\r\n    pmaddwd        m11, m8, m5                  ;\r\n    phaddd         m10, m11                     ;\r\n                                                ;\r\n    phaddd          m9, m10                     ;\r\n    vbroadcasti128  m5, [tab_idct16_1 + %1*16]  ;\r\n                                                ;\r\n    pmaddwd        m10, m1, m5                  ;\r\n    pmaddwd        m11, m3, m5                  ;\r\n    phaddd         m10, m11                     ;\r\n                                                ;\r\n    pmaddwd        m11, m4, m5                  ;\r\n    pmaddwd        m12, m2, m5                  ;\r\n    phaddd         m11, m12                     ;\r\n                                                ;\r\n    phaddd         m10, m11                     ;\r\n                                                ;\r\n    paddd          m11, m9, m10                 ;\r\n    paddd          m11, m14                     ;\r\n    psrad          m11, IDCT16_SHIFT1           ;\r\n                                                ;\r\n    psubd           m9, m10                     ;\r\n    paddd           m9, m14                     ;\r\n    psrad           m9, IDCT16_SHIFT1           ;\r\n                                                ;\r\n    vbroadcasti128  m5, [tab_idct16_2 + %1*16 + 16]\r\n                                                ;\r\n    pmaddwd        m10, m0, m5                  ;\r\n    pmaddwd        m12, m7, m5                  ;\r\n    phaddd         m10, m12                     ;\r\n                                                ;\r\n    pmaddwd        m12, m6, m5                  ;\r\n    pmaddwd        m13, m8, m5                  ;\r\n    phaddd         m12, m13                     ;\r\n                                                ;\r\n    phaddd         m10, m12                     ;\r\n    vbroadcasti128  m5, [tab_idct16_1 + %1 * 16  + 16]\r\n                                                ;\r\n    pmaddwd        m12, m1, m5                  ;\r\n    pmaddwd        m13, m3, m5                  ;\r\n    phaddd         m12, m13                     ;\r\n                                                ;\r\n    pmaddwd        m13, m4, m5                  ;\r\n    pmaddwd         m5, m2                      ;\r\n    phaddd         m13, m5                      ;\r\n                                                ;\r\n    phaddd         m12, m13                     ;\r\n                                                ;\r\n    paddd           m5, m10, m12                ;\r\n    paddd           m5, m14                     ;\r\n    psrad           m5, IDCT16_SHIFT1           ;\r\n                                                ;\r\n    psubd          m10, m12                     ;\r\n    paddd          m10, m14                     ;\r\n    psrad          m10, IDCT16_SHIFT1           ;\r\n                                                ;\r\n    packssdw       m11, m5                      ;\r\n    packssdw        m9, m10                     ;\r\n                                                ;\r\n    mova           m10, [idct16_shuff]          ;\r\n    mova            m5, [idct16_shuff1]         ;\r\n                                                ;\r\n    vpermd         m12, m10, m11                ;\r\n    vpermd         m13, m5, m9                  ;\r\n    mova           [r3 + %1*16*2     ], xm12    ;\r\n    mova           [r3 + %2*16*2     ], xm13    ;\r\n    vextracti128   [r3 + %2*16*2 + 32], m13, 1  ;\r\n    vextracti128   [r3 + %1*16*2 + 32], m12, 1  ;\r\n%endmacro\r\n\r\n\r\n; ----------------------------------------------------------------------------\r\n; void idct_16x16(const coeff_t *src, coeff_t *dst, int i_dst)\r\n; ----------------------------------------------------------------------------\r\n\r\n; ------------------------------------------------------------------\r\n; idct_16x16_avx2\r\nINIT_YMM avx2\r\ncglobal idct_16x16, 3, 7, 16, 0-16*mmsize\r\n%define IDCT16_SHIFT1       5                   ; shift1 = 5\r\n%define IDCT16_ADD1         [pd_16]             ; add1   = 16\r\n%if BIT_DEPTH == 10                             ;\r\n    %define IDCT16_SHIFT2   10                  ;\r\n    vpbroadcastd  m15,      [pd_512]            ;\r\n%elif BIT_DEPTH == 8                            ; for BIT_DEPTH: 8\r\n    %define IDCT16_SHIFT2   12                  ; shift2 = 12\r\n    vpbroadcastd  m15,      [pd_2048]           ; add2   = 2048\r\n%else                                           ;\r\n    %error Unsupported BIT_DEPTH!               ;\r\n%endif                                          ;\r\n                                                ;\r\n    vbroadcasti128 m14, IDCT16_ADD1             ; add1   = 16\r\n                                                ;\r\n    add            r2d, r2d                     ;\r\n    mov             r3, rsp                     ;\r\n    mov            r4d, 2                       ;\r\n                                                ;\r\n.pass1:                                         ;\r\n    movu           xm0, [r0 +  0 * 32]          ;\r\n    movu           xm1, [r0 +  8 * 32]          ;\r\n    punpckhqdq     xm2, xm0, xm1                ;\r\n    punpcklqdq     xm0, xm1                     ;\r\n    vinserti128     m0, m0, xm2, 1              ;\r\n                                                ;\r\n    movu           xm1, [r0 +  1 * 32]          ;\r\n    movu           xm2, [r0 +  9 * 32]          ;\r\n    punpckhqdq     xm3, xm1, xm2                ;\r\n    punpcklqdq     xm1, xm2                     ;\r\n    vinserti128     m1, m1, xm3, 1              ;\r\n                                                ;\r\n    movu           xm2, [r0 + 2  * 32]          ;\r\n    movu           xm3, [r0 + 10 * 32]          ;\r\n    punpckhqdq     xm4, xm2, xm3                ;\r\n    punpcklqdq     xm2, xm3                     ;\r\n    vinserti128     m2, m2, xm4, 1              ;\r\n                                                ;\r\n    movu           xm3, [r0 + 3  * 32]          ;\r\n    movu           xm4, [r0 + 11 * 32]          ;\r\n    punpckhqdq     xm5, xm3, xm4                ;\r\n    punpcklqdq     xm3, xm4                     ;\r\n    vinserti128     m3, m3, xm5, 1              ;\r\n                                                ;\r\n    movu           xm4, [r0 + 4  * 32]          ;\r\n    movu           xm5, [r0 + 12 * 32]          ;\r\n    punpckhqdq     xm6, xm4, xm5                ;\r\n    punpcklqdq     xm4, xm5                     ;\r\n    vinserti128     m4, m4, xm6, 1              ;\r\n                                                ;\r\n    movu           xm5, [r0 + 5  * 32]          ;\r\n    movu           xm6, [r0 + 13 * 32]          ;\r\n    punpckhqdq     xm7, xm5, xm6                ;\r\n    punpcklqdq     xm5, xm6                     ;\r\n    vinserti128     m5, m5, xm7, 1              ;\r\n                                                ;\r\n    movu           xm6, [r0 + 6  * 32]          ;\r\n    movu           xm7, [r0 + 14 * 32]          ;\r\n    punpckhqdq     xm8, xm6, xm7                ;\r\n    punpcklqdq     xm6, xm7                     ;\r\n    vinserti128     m6, m6, xm8, 1              ;\r\n                                                ;\r\n    movu           xm7, [r0 + 7  * 32]          ;\r\n    movu           xm8, [r0 + 15 * 32]          ;\r\n    punpckhqdq     xm9, xm7, xm8                ;\r\n    punpcklqdq     xm7, xm8                     ;\r\n    vinserti128     m7, m7, xm9, 1              ;\r\n                                                ;\r\n    punpckhwd       m8, m0, m2                  ; [8 10]\r\n    punpcklwd       m0, m2                      ; [0 2]\r\n                                                ;\r\n    punpckhwd       m2, m1, m3                  ; [9 11]\r\n    punpcklwd       m1, m3                      ; [1 3]\r\n                                                ;\r\n    punpckhwd       m3, m4, m6                  ; [12 14]\r\n    punpcklwd       m4, m6                      ; [4 6]\r\n                                                ;\r\n    punpckhwd       m6, m5, m7                  ; [13 15]\r\n    punpcklwd       m5, m7                      ; [5 7]\r\n                                                ;\r\n    punpckhdq       m7, m0, m4                  ; [02 22 42 62 03 23 43 63 06 26 46 66 07 27 47 67]\r\n    punpckldq       m0, m4                      ; [00 20 40 60 01 21 41 61 04 24 44 64 05 25 45 65]\r\n                                                ;\r\n    punpckhdq       m4, m8, m3                  ; [82 102 122 142 83 103 123 143 86 106 126 146 87 107 127 147]\r\n    punpckldq       m8, m3                      ; [80 100 120 140 81 101 121 141 84 104 124 144 85 105 125 145]\r\n                                                ;\r\n    punpckhdq       m3, m1, m5                  ; [12 32 52 72 13 33 53 73 16 36 56 76 17 37 57 77]\r\n    punpckldq       m1, m5                      ; [10 30 50 70 11 31 51 71 14 34 54 74 15 35 55 75]\r\n                                                ;\r\n    punpckhdq       m5, m2, m6                  ; [92 112 132 152 93 113 133 153 96 116 136 156 97 117 137 157]\r\n    punpckldq       m2, m6                      ; [90 110 130 150 91 111 131 151 94 114 134 154 95 115 135 155]\r\n                                                ;\r\n    punpckhqdq      m6, m0, m8                  ; [01 21 41 61 81 101 121 141 05 25 45 65 85 105 125 145]\r\n    punpcklqdq      m0, m8                      ; [00 20 40 60 80 100 120 140 04 24 44 64 84 104 124 144]\r\n                                                ;\r\n    punpckhqdq      m8, m7, m4                  ; [03 23 43 63 43 103 123 143 07 27 47 67 87 107 127 147]\r\n    punpcklqdq      m7, m4                      ; [02 22 42 62 82 102 122 142 06 26 46 66 86 106 126 146]\r\n                                                ;\r\n    punpckhqdq      m4, m1, m2                  ; [11 31 51 71 91 111 131 151 15 35 55 75 95 115 135 155]\r\n    punpcklqdq      m1, m2                      ; [10 30 50 70 90 110 130 150 14 34 54 74 94 114 134 154]\r\n                                                ;\r\n    punpckhqdq      m2, m3, m5                  ; [13 33 53 73 93 113 133 153 17 37 57 77 97 117 137 157]\r\n    punpcklqdq      m3, m5                      ; [12 32 52 72 92 112 132 152 16 36 56 76 96 116 136 156]\r\n                                                ;\r\n    IDCT16_PASS1    0, 14                       ;\r\n    IDCT16_PASS1    2, 12                       ;\r\n    IDCT16_PASS1    4, 10                       ;\r\n    IDCT16_PASS1    6, 8                        ;\r\n                                                ;\r\n    add             r0, 16                      ;\r\n    add             r3, 16                      ;\r\n    dec            r4d                          ;\r\n    jnz            .pass1                       ;\r\n                                                ;\r\n    mov             r3, rsp                     ;\r\n    mov            r4d, 8                       ;\r\n    lea             r5, [tab_idct16_2]          ;\r\n    lea             r6, [tab_idct16_1]          ;\r\n                                                ;\r\n    vbroadcasti128  m7, [r5     ]               ;\r\n    vbroadcasti128  m8, [r5 + 16]               ;\r\n    vbroadcasti128  m9, [r5 + 32]               ;\r\n    vbroadcasti128 m10, [r5 + 48]               ;\r\n    vbroadcasti128 m11, [r5 + 64]               ;\r\n    vbroadcasti128 m12, [r5 + 80]               ;\r\n    vbroadcasti128 m13, [r5 + 96]               ;\r\n                                                ;\r\n.pass2:                                         ;\r\n    movu            m1, [r3]                    ;\r\n    vpermq          m0, m1, 0xD8                ;\r\n                                                ;\r\n    pmaddwd         m1, m0, m7                  ;\r\n    pmaddwd         m2, m0, m8                  ;\r\n    phaddd          m1, m2                      ;\r\n                                                ;\r\n    pmaddwd         m2, m0, m9                  ;\r\n    pmaddwd         m3, m0, m10                 ;\r\n    phaddd          m2, m3                      ;\r\n                                                ;\r\n    phaddd          m1, m2                      ;\r\n                                                ;\r\n    pmaddwd         m2, m0, m11                 ;\r\n    pmaddwd         m3, m0, m12                 ;\r\n    phaddd          m2, m3                      ;\r\n                                                ;\r\n    vbroadcasti128 m14, [r5 + 112]              ;\r\n    pmaddwd         m3, m0, m13                 ;\r\n    pmaddwd         m4, m0, m14                 ;\r\n    phaddd          m3, m4                      ;\r\n                                                ;\r\n    phaddd          m2, m3                      ;\r\n                                                ;\r\n    movu            m3, [r3 + 32]               ;\r\n    vpermq          m0, m3, 0xD8                ;\r\n                                                ;\r\n    vbroadcasti128 m14, [r6]                    ;\r\n    pmaddwd         m3, m0, m14                 ;\r\n    vbroadcasti128 m14, [r6 + 16]               ;\r\n    pmaddwd         m4, m0, m14                 ;\r\n    phaddd          m3, m4                      ;\r\n                                                ;\r\n    vbroadcasti128 m14, [r6 + 32]               ;\r\n    pmaddwd         m4, m0, m14                 ;\r\n    vbroadcasti128 m14, [r6 + 48]               ;\r\n    pmaddwd         m5, m0, m14                 ;\r\n    phaddd          m4, m5                      ;\r\n                                                ;\r\n    phaddd          m3, m4                      ;\r\n                                                ;\r\n    vbroadcasti128 m14, [r6 + 64]               ;\r\n    pmaddwd         m4, m0, m14                 ;\r\n    vbroadcasti128 m14, [r6 + 80]               ;\r\n    pmaddwd         m5, m0, m14                 ;\r\n    phaddd          m4, m5                      ;\r\n                                                ;\r\n    vbroadcasti128 m14, [r6 + 96]               ;\r\n    pmaddwd         m6, m0, m14                 ;\r\n    vbroadcasti128 m14, [r6 + 112]              ;\r\n    pmaddwd         m0, m14                     ;\r\n    phaddd          m6, m0                      ;\r\n                                                ;\r\n    phaddd          m4, m6                      ;\r\n                                                ;\r\n    paddd           m5, m1, m3                  ;\r\n    paddd           m5, m15                     ;\r\n    psrad           m5, IDCT16_SHIFT2           ;\r\n                                                ;\r\n    psubd           m1, m3                      ;\r\n    paddd           m1, m15                     ;\r\n    psrad           m1, IDCT16_SHIFT2           ;\r\n                                                ;\r\n    paddd           m6, m2, m4                  ;\r\n    paddd           m6, m15                     ;\r\n    psrad           m6, IDCT16_SHIFT2           ;\r\n                                                ;\r\n    psubd           m2, m4                      ;\r\n    paddd           m2, m15                     ;\r\n    psrad           m2, IDCT16_SHIFT2           ;\r\n                                                ;\r\n    packssdw        m5, m6                      ;\r\n    packssdw        m1, m2                      ;\r\n    pshufb          m2, m1, [dct16_shuf1]       ;\r\n                                                ;\r\n    mova           [r1          ], xm5          ;\r\n    mova           [r1      + 16], xm2          ;\r\n    vextracti128   [r1 + r2     ], m5, 1        ;\r\n    vextracti128   [r1 + r2 + 16], m2, 1        ;\r\n                                                ;\r\n    lea             r1, [r1 + 2 * r2]           ;\r\n    add             r3, 64                      ;\r\n    dec            r4d                          ;\r\n    jnz            .pass2                       ;\r\n    RET                                         ;\r\n%undef IDCT16_SHIFT1\r\n%undef IDCT16_SHIFT2\r\n%undef IDCT16_ADD1\r\n%undef IDCT16_ADD2\r\n\r\n\r\n%macro IDCT32_PASS1 1\r\n    vbroadcasti128  m3, [tab_idct32_1+%1*32   ] ;\r\n    vbroadcasti128 m13, [tab_idct32_1+%1*32+16] ;\r\n    pmaddwd         m9, m4, m3                  ;\r\n    pmaddwd        m10, m8, m13                 ;\r\n    phaddd          m9, m10                     ;\r\n                                                ;\r\n    pmaddwd        m10, m2, m3                  ;\r\n    pmaddwd        m11, m1, m13                 ;\r\n    phaddd         m10, m11                     ;\r\n                                                ;\r\n    phaddd          m9, m10                     ;\r\n                                                ;\r\n    vbroadcasti128  m3, [tab_idct32_1+(15 - %1)*32   ]\r\n    vbroadcasti128 m13, [tab_idct32_1+(15 - %1)*32+16]\r\n    pmaddwd        m10, m4, m3                  ;\r\n    pmaddwd        m11, m8, m13                 ;\r\n    phaddd         m10, m11                     ;\r\n                                                ;\r\n    pmaddwd        m11, m2, m3                  ;\r\n    pmaddwd        m12, m1, m13                 ;\r\n    phaddd         m11, m12                     ;\r\n                                                ;\r\n    phaddd         m10, m11                     ;\r\n    phaddd          m9, m10                     ; [row0s0 row2s0 row0s15 row2s15 row1s0 row3s0 row1s15 row3s15]\r\n                                                ;\r\n    vbroadcasti128  m3, [tab_idct32_2 + %1*16]  ;\r\n    pmaddwd        m10, m0, m3                  ;\r\n    pmaddwd        m11, m7, m3                  ;\r\n    phaddd         m10, m11                     ;\r\n    phaddd         m10, m10                     ;\r\n                                                ;\r\n    vbroadcasti128  m3, [tab_idct32_3 + %1*16]  ;\r\n    pmaddwd        m11, m5, m3                  ;\r\n    pmaddwd        m12, m6, m3                  ;\r\n    phaddd         m11, m12                     ;\r\n    phaddd         m11, m11                     ;\r\n                                                ;\r\n    paddd          m12, m10, m11                ; [row0a0 row2a0 NIL NIL row1sa0 row3a0 NIL NIL]\r\n    psubd          m10, m11                     ; [row0a15 row2a15 NIL NIL row1a15 row3a15 NIL NIL]\r\n                                                ;\r\n    punpcklqdq     m12, m10                     ; [row0a0 row2a0 row0a15 row2a15 row1a0 row3a0 row1a15 row3a15]\r\n    paddd          m10, m9, m12                 ;\r\n    paddd          m10, m15                     ;\r\n    psrad          m10, IDCT32_SHIFT1           ;\r\n                                                ;\r\n    psubd          m12, m9                      ;\r\n    paddd          m12, m15                     ;\r\n    psrad          m12, IDCT32_SHIFT1           ;\r\n                                                ;\r\n    packssdw       m10, m12                     ;\r\n    vextracti128  xm12, m10, 1                  ;\r\n    movd    [r3              + %1*64], xm10     ;\r\n    movd    [r3 + 32         + %1*64], xm12     ;\r\n    pextrd  [r4              - %1*64], xm10, 1  ;\r\n    pextrd  [r4 + 32         - %1*64], xm12, 1  ;\r\n    pextrd  [r3 + 16*64      + %1*64], xm10, 3  ;\r\n    pextrd  [r3 + 16*64 + 32 + %1*64], xm12, 3  ;\r\n    pextrd  [r4 + 16*64      - %1*64], xm10, 2  ;\r\n    pextrd  [r4 + 16*64 + 32 - %1*64], xm12, 2  ;\r\n%endmacro\r\n\r\n\r\n; ----------------------------------------------------------------------------\r\n; void idct_32x32(const coeff_t *src, coeff_t *dst, int i_dst)\r\n; ----------------------------------------------------------------------------\r\n\r\n; TODO: Reduce PHADDD instruction by PADDD\r\n\r\n; ------------------------------------------------------------------\r\n; idct_32x32_avx2\r\nINIT_YMM avx2\r\ncglobal idct_32x32, 3, 6, 16, 0-32*64\r\n    %define IDCT32_SHIFT1    5                  ; shift1 = 5\r\n    %define IDCT32_ADD1      [pd_16]            ; add1   = 16\r\n                                                ;\r\n    vbroadcasti128 m15, IDCT32_ADD1             ; add1   = 16\r\n                                                ;\r\n    mov             r3, rsp                     ;\r\n    lea             r4, [r3 + 15 * 64]          ;\r\n    mov            r5d, 8                       ;\r\n                                                ;\r\n.pass1:                                         ;\r\n    movq           xm0, [r0 +  2 * 64]          ;\r\n    movq           xm1, [r0 + 18 * 64]          ;\r\n    punpcklqdq     xm0, xm0, xm1                ;\r\n    movq           xm1, [r0 +  0 * 64]          ;\r\n    movq           xm2, [r0 + 16 * 64]          ;\r\n    punpcklqdq     xm1, xm1, xm2                ;\r\n    vinserti128     m0,  m0, xm1, 1             ; [2 18 0 16]\r\n                                                ;\r\n    movq           xm1, [r0 +  1 * 64]          ;\r\n    movq           xm2, [r0 +  9 * 64]          ;\r\n    punpcklqdq     xm1, xm1, xm2                ;\r\n    movq           xm2, [r0 + 17 * 64]          ;\r\n    movq           xm3, [r0 + 25 * 64]          ;\r\n    punpcklqdq     xm2, xm2, xm3                ;\r\n    vinserti128     m1,  m1, xm2, 1             ; [1 9 17 25]\r\n                                                ;\r\n    movq           xm2, [r0 +  6 * 64]          ;\r\n    movq           xm3, [r0 + 22 * 64]          ;\r\n    punpcklqdq     xm2, xm2, xm3                ;\r\n    movq           xm3, [r0 + 4 * 64]           ;\r\n    movq           xm4, [r0 + 20 * 64]          ;\r\n    punpcklqdq     xm3, xm3, xm4                ;\r\n    vinserti128     m2,  m2, xm3, 1             ; [6 22 4 20]\r\n                                                ;\r\n    movq           xm3, [r0 +  3 * 64]          ;\r\n    movq           xm4, [r0 + 11 * 64]          ;\r\n    punpcklqdq     xm3, xm3, xm4                ;\r\n    movq           xm4, [r0 + 19 * 64]          ;\r\n    movq           xm5, [r0 + 27 * 64]          ;\r\n    punpcklqdq     xm4, xm4, xm5                ;\r\n    vinserti128     m3,  m3, xm4, 1             ; [3 11 17 25]\r\n                                                ;\r\n    movq           xm4, [r0 + 10 * 64]          ;\r\n    movq           xm5, [r0 + 26 * 64]          ;\r\n    punpcklqdq     xm4, xm4, xm5                ;\r\n    movq           xm5, [r0 +  8 * 64]          ;\r\n    movq           xm6, [r0 + 24 * 64]          ;\r\n    punpcklqdq     xm5, xm5, xm6                ;\r\n    vinserti128     m4,  m4, xm5, 1             ; [10 26 8 24]\r\n                                                ;\r\n    movq           xm5, [r0 +  5 * 64]          ;\r\n    movq           xm6, [r0 + 13 * 64]          ;\r\n    punpcklqdq     xm5, xm5, xm6                ;\r\n    movq           xm6, [r0 + 21 * 64]          ;\r\n    movq           xm7, [r0 + 29 * 64]          ;\r\n    punpcklqdq     xm6, xm6, xm7                ;\r\n    vinserti128     m5,  m5, xm6, 1             ; [5 13 21 9]\r\n                                                ;\r\n    movq           xm6, [r0 + 14 * 64]          ;\r\n    movq           xm7, [r0 + 30 * 64]          ;\r\n    punpcklqdq     xm6, xm6, xm7                ;\r\n    movq           xm7, [r0 + 12 * 64]          ;\r\n    movq           xm8, [r0 + 28 * 64]          ;\r\n    punpcklqdq     xm7, xm7, xm8                ;\r\n    vinserti128     m6,  m6, xm7, 1             ; [14 30 12 28]\r\n                                                ;\r\n    movq           xm7, [r0 +  7 * 64]          ;\r\n    movq           xm8, [r0 + 15 * 64]          ;\r\n    punpcklqdq     xm7, xm7, xm8                ;\r\n    movq           xm8, [r0 + 23 * 64]          ;\r\n    movq           xm9, [r0 + 31 * 64]          ;\r\n    punpcklqdq     xm8, xm8, xm9                ;\r\n    vinserti128     m7,  m7, xm8, 1             ; [7 15 23 31]\r\n                                                ;\r\n    punpckhwd       m8, m0, m2                  ; [18 22 16 20]\r\n    punpcklwd       m0, m2                      ; [2 6 0 4]\r\n                                                ;\r\n    punpckhwd       m2, m1, m3                  ; [9 11 25 27]\r\n    punpcklwd       m1, m3                      ; [1 3 17 19]\r\n                                                ;\r\n    punpckhwd       m3, m4, m6                  ; [26 30 24 28]\r\n    punpcklwd       m4, m6                      ; [10 14 8 12]\r\n                                                ;\r\n    punpckhwd       m6, m5, m7                  ; [13 15 29 31]\r\n    punpcklwd       m5, m7                      ; [5 7 21 23]\r\n                                                ;\r\n    punpckhdq       m7, m0, m4                  ; [22 62 102 142 23 63 103 143 02 42 82 122 03 43 83 123]\r\n    punpckldq       m0, m4                      ; [20 60 100 140 21 61 101 141 00 40 80 120 01 41 81 121]\r\n                                                ;\r\n    punpckhdq       m4, m8, m3                  ; [182 222 262 302 183 223 263 303 162 202 242 282 163 203 243 283]\r\n    punpckldq       m8, m3                      ; [180 220 260 300 181 221 261 301 160 200 240 280 161 201 241 281]\r\n                                                ;\r\n    punpckhdq       m3, m1, m5                  ; [12 32 52 72 13 33 53 73 172 192 212 232 173 193 213 233]\r\n    punpckldq       m1, m5                      ; [10 30 50 70 11 31 51 71 170 190 210 230 171 191 211 231]\r\n                                                ;\r\n    punpckhdq       m5, m2, m6                  ; [92 112 132 152 93 113 133 153 252 272 292 312 253 273 293 313]\r\n    punpckldq       m2, m6                      ; [90 110 130 150 91 111 131 151 250 270 290 310 251 271 291 311]\r\n                                                ;\r\n    punpckhqdq      m6, m0, m8                  ; [21 61 101 141 181 221 261 301 01 41 81 121 161 201 241 281]\r\n    punpcklqdq      m0, m8                      ; [20 60 100 140 180 220 260 300 00 40 80 120 160 200 240 280]\r\n                                                ;\r\n    punpckhqdq      m8, m7, m4                  ; [23 63 103 143 183 223 263 303 03 43 83 123 163 203 243 283]\r\n    punpcklqdq      m7, m4                      ; [22 62 102 142 182 222 262 302 02 42 82 122 162 202 242 282]\r\n                                                ;\r\n    punpckhqdq      m4, m1, m2                  ; [11 31 51 71 91 111 131 151 171 191 211 231 251 271 291 311]\r\n    punpcklqdq      m1, m2                      ; [10 30 50 70 90 110 130 150 170 190 210 230 250 270 290 310]\r\n                                                ;\r\n    punpckhqdq      m2, m3, m5                  ; [13 33 53 73 93 113 133 153 173 193 213 233 253 273 293 313]\r\n    punpcklqdq      m3, m5                      ; [12 32 52 72 92 112 132 152 172 192 212 232 252 272 292 312]\r\n                                                ;\r\n    vperm2i128      m5, m0, m6, 0x20            ; [20 60 100 140 180 220 260 300 21 61 101 141 181 221 261 301]\r\n    vperm2i128      m0, m0, m6, 0x31            ; [00 40 80 120 160 200 240 280 01 41 81 121 161 201 241 281]\r\n                                                ;\r\n    vperm2i128      m6, m7, m8, 0x20            ; [22 62 102 142 182 222 262 302 23 63 103 143 183 223 263 303]\r\n    vperm2i128      m7, m7, m8, 0x31            ; [02 42 82 122 162 202 242 282 03 43 83 123 163 203 243 283]\r\n                                                ;\r\n    vperm2i128      m8, m1, m4, 0x31            ; [170 190 210 230 250 270 290 310 171 191 211 231 251 271 291 311]\r\n    vperm2i128      m4, m1, m4, 0x20            ; [10 30 50 70 90 110 130 150 11 31 51 71 91 111 131 151]\r\n                                                ;\r\n    vperm2i128      m1, m3, m2, 0x31            ; [172 192 212 232 252 272 292 312 173 193 213 233 253 273 293 313]\r\n    vperm2i128      m2, m3, m2, 0x20            ; [12 32 52 72 92 112 132 152 13 33 53 73 93 113 133 153]\r\n                                                ;\r\n    IDCT32_PASS1    0                           ;\r\n    IDCT32_PASS1    1                           ;\r\n    IDCT32_PASS1    2                           ;\r\n    IDCT32_PASS1    3                           ;\r\n    IDCT32_PASS1    4                           ;\r\n    IDCT32_PASS1    5                           ;\r\n    IDCT32_PASS1    6                           ;\r\n    IDCT32_PASS1    7                           ;\r\n                                                ;\r\n    add             r0, 8                       ;\r\n    add             r3, 4                       ;\r\n    add             r4, 4                       ;\r\n    dec            r5d                          ;\r\n    jnz            .pass1                       ;\r\n                                                ;\r\n%if BIT_DEPTH == 10                             ;\r\n    %define IDCT_SHIFT2 10                      ;\r\n    vpbroadcastd   m15, [pd_512 ]               ;\r\n%elif BIT_DEPTH == 8                            ; for BIT_DEPTH: 8\r\n    test            r2, 0x01                    ; test flag?\r\n    jz             .b32x32                      ;\r\n    lea             r5, [pd_11  ]               ; shift2 = 11\r\n    vpbroadcastq   m15, [pd_2048]               ; add2   = 1024\r\n    and             r2, 0xFE                    ; clear the flag\r\n    jmp            .normal_start                ;\r\n.b32x32:                                        ;\r\n    lea             r5, [pd_12  ]               ; shift2 = 12\r\n    vpbroadcastq   m15, [pd_2048]               ; add2   = 2048\r\n.normal_start:                                  ;\r\n%else                                           ;\r\n    %error Unsupported BIT_DEPTH!               ;\r\n%endif                                          ;\r\n                                                ;\r\n    mov             r3, rsp                     ;\r\n    add            r2d, r2d                     ;\r\n    mov            r4d, 32                      ;\r\n                                                ;\r\n    mova            m7, [tab_idct32_4    ]      ;\r\n    mova            m8, [tab_idct32_4+ 32]      ;\r\n    mova            m9, [tab_idct32_4+ 64]      ;\r\n    mova           m10, [tab_idct32_4+ 96]      ;\r\n    mova           m11, [tab_idct32_4+128]      ;\r\n    mova           m12, [tab_idct32_4+160]      ;\r\n    mova           m13, [tab_idct32_4+192]      ;\r\n    mova           m14, [tab_idct32_4+224]      ;\r\n.pass2:                                         ;\r\n    movu            m0, [r3]                    ;\r\n    movu            m1, [r3 + 32]               ;\r\n                                                ;\r\n    pmaddwd         m2, m0, m7                  ;\r\n    pmaddwd         m3, m0, m8                  ;\r\n    phaddd          m2, m3                      ;\r\n                                                ;\r\n    pmaddwd         m3, m0, m9                  ;\r\n    pmaddwd         m4, m0, m10                 ;\r\n    phaddd          m3, m4                      ;\r\n                                                ;\r\n    phaddd          m2, m3                      ;\r\n                                                ;\r\n    pmaddwd         m3, m0, m11                 ;\r\n    pmaddwd         m4, m0, m12                 ;\r\n    phaddd          m3, m4                      ;\r\n                                                ;\r\n    pmaddwd         m4, m0, m13                 ;\r\n    pmaddwd         m5, m0, m14                 ;\r\n    phaddd          m4, m5                      ;\r\n                                                ;\r\n    phaddd          m3, m4                      ;\r\n                                                ;\r\n    vperm2i128      m4, m2, m3, 0x31            ;\r\n    vperm2i128      m2, m2, m3, 0x20            ;\r\n    paddd           m2, m4                      ;\r\n                                                ;\r\n    pmaddwd         m3, m0, [tab_idct32_4+256]  ;\r\n    pmaddwd         m4, m0, [tab_idct32_4+288]  ;\r\n    phaddd          m3, m4                      ;\r\n                                                ;\r\n    pmaddwd         m4, m0, [tab_idct32_4+320]  ;\r\n    pmaddwd         m5, m0, [tab_idct32_4+352]  ;\r\n    phaddd          m4, m5                      ;\r\n                                                ;\r\n    phaddd          m3, m4                      ;\r\n                                                ;\r\n    pmaddwd         m4, m0, [tab_idct32_4+384]  ;\r\n    pmaddwd         m5, m0, [tab_idct32_4+416]  ;\r\n    phaddd          m4, m5                      ;\r\n                                                ;\r\n    pmaddwd         m5, m0, [tab_idct32_4+448]  ;\r\n    pmaddwd         m0,     [tab_idct32_4+480]  ;\r\n    phaddd          m5, m0                      ;\r\n                                                ;\r\n    phaddd          m4, m5                      ;\r\n                                                ;\r\n    vperm2i128      m0, m3, m4, 0x31            ;\r\n    vperm2i128      m3, m3, m4, 0x20            ;\r\n    paddd           m3, m0                      ;\r\n                                                ;\r\n    pmaddwd         m4, m1, [tab_idct32_1]      ;\r\n    pmaddwd         m0, m1, [tab_idct32_1+32]   ;\r\n    phaddd          m4, m0                      ;\r\n                                                ;\r\n    pmaddwd         m5, m1, [tab_idct32_1+ 64]  ;\r\n    pmaddwd         m0, m1, [tab_idct32_1+ 96]  ;\r\n    phaddd          m5, m0                      ;\r\n                                                ;\r\n    phaddd          m4, m5                      ;\r\n                                                ;\r\n    pmaddwd         m5, m1, [tab_idct32_1+128]  ;\r\n    pmaddwd         m0, m1, [tab_idct32_1+160]  ;\r\n    phaddd          m5, m0                      ;\r\n                                                ;\r\n    pmaddwd         m6, m1, [tab_idct32_1+192]  ;\r\n    pmaddwd         m0, m1, [tab_idct32_1+224]  ;\r\n    phaddd          m6, m0                      ;\r\n                                                ;\r\n    phaddd          m5, m6                      ;\r\n                                                ;\r\n    vperm2i128      m0, m4, m5, 0x31            ;\r\n    vperm2i128      m4, m4, m5, 0x20            ;\r\n    paddd           m4, m0                      ;\r\n                                                ;\r\n    pmaddwd         m5, m1, [tab_idct32_1+256]  ;\r\n    pmaddwd         m0, m1, [tab_idct32_1+288]  ;\r\n    phaddd          m5, m0                      ;\r\n                                                ;\r\n    pmaddwd         m6, m1, [tab_idct32_1+320]  ;\r\n    pmaddwd         m0, m1, [tab_idct32_1+352]  ;\r\n    phaddd          m6, m0                      ;\r\n                                                ;\r\n    phaddd          m5, m6                      ;\r\n                                                ;\r\n    pmaddwd         m6, m1, [tab_idct32_1+384]  ;\r\n    pmaddwd         m0, m1, [tab_idct32_1+416]  ;\r\n    phaddd          m6, m0                      ;\r\n                                                ;\r\n    pmaddwd         m0, m1, [tab_idct32_1+448]  ;\r\n    pmaddwd         m1,     [tab_idct32_1+480]  ;\r\n    phaddd          m0, m1                      ;\r\n                                                ;\r\n    phaddd          m6, m0                      ;\r\n                                                ;\r\n    vperm2i128      m0, m5, m6, 0x31            ;\r\n    vperm2i128      m5, m5, m6, 0x20            ;\r\n    paddd           m5, m0                      ;\r\n                                                ;\r\n    paddd           m6, m2, m4                  ;\r\n    paddd           m6, m15                     ;\r\n    psrad           m6, [r5]                    ; shift2\r\n                                                ;\r\n    psubd           m2, m4                      ;\r\n    paddd           m2, m15                     ;\r\n    psrad           m2, [r5]                    ; shift2\r\n                                                ;\r\n    paddd           m4, m3, m5                  ;\r\n    paddd           m4, m15                     ;\r\n    psrad           m4, [r5]                    ; shift2\r\n                                                ;\r\n    psubd           m3, m5                      ;\r\n    paddd           m3, m15                     ;\r\n    psrad           m3, [r5]                    ; shift2\r\n                                                ;\r\n    packssdw        m6, m4                      ;\r\n    packssdw        m2, m3                      ;\r\n                                                ;\r\n    vpermq          m6, m6, 0xD8                ;\r\n    vpermq          m2, m2, 0x8D                ;\r\n    pshufb          m2, [dct16_shuf1]           ;\r\n                                                ;\r\n    movu     [r1     ], m6                      ;\r\n    movu     [r1 + 32], m2                      ;\r\n                                                ;\r\n    add             r1, r2                      ;\r\n    add             r3, 64                      ;\r\n    dec             r4d                         ;\r\n    jnz            .pass2                       ;\r\n    RET                                         ;\r\n%undef IDCT32_SHIFT1\r\n%undef IDCT32_SHIFT2\r\n%undef IDCT32_ADD1\r\n%undef IDCT32_ADD2\r\n\r\n%endif                                          ; if ARCH_X86_64 == 1\r\n"
  },
  {
    "path": "source/common/x86/dct8.h",
    "content": "/*****************************************************************************\r\n * Copyright (C) 2013-2017 MulticoreWare, Inc\r\n *\r\n * Authors: Nabajit Deka <nabajit@multicorewareinc.com>\r\n;*          Min Chen <chenm003@163.com>\r\n *          Jiaqi Zhang <zhangjiaqi.cs@gmail.com>\r\n *\r\n * This program is free software; you can redistribute it and/or modify\r\n * it under the terms of the GNU General Public License as published by\r\n * the Free Software Foundation; either version 2 of the License, or\r\n * (at your option) any later version.\r\n *\r\n * This program is distributed in the hope that it will be useful,\r\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n * GNU General Public License for more details.\r\n *\r\n * You should have received a copy of the GNU General Public License\r\n * along with this program; if not, write to the Free Software\r\n * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n * This program is also available under a commercial proprietary license.\r\n * For more information, contact us at license @ x265.com.\r\n *****************************************************************************/\r\n\r\n\r\n#ifndef DAVS2_I386_DCT8_H\r\n#define DAVS2_I386_DCT8_H\r\n#ifdef __cplusplus\r\nextern \"C\" {\r\n#endif\r\n\r\nvoid FPFX(idct_4x4_sse2 )(const coeff_t *src, coeff_t *dst, int i_dst);\r\nvoid FPFX(idct_8x8_ssse3)(const coeff_t *src, coeff_t *dst, int i_dst);\r\n#if ARCH_X86_64\r\nvoid FPFX(idct_4x4_avx2  )(const coeff_t *src, coeff_t *dst, int i_dst);\r\nvoid FPFX(idct_8x8_sse2  )(const coeff_t *src, coeff_t *dst, int i_dst);\r\nvoid FPFX(idct_8x8_avx2  )(const coeff_t *src, coeff_t *dst, int i_dst);\r\nvoid FPFX(idct_16x16_avx2)(const coeff_t *src, coeff_t *dst, int i_dst);\r\nvoid FPFX(idct_32x32_avx2)(const coeff_t *src, coeff_t *dst, int i_dst);\r\n#endif\r\n\r\n#ifdef __cplusplus\r\n}\r\n#endif\r\n#endif // ifndef DAVS2_I386_DCT8_H\r\n"
  },
  {
    "path": "source/common/x86/ipfilter8.asm",
    "content": ";*****************************************************************************\r\n;* Copyright (C) 2013-2017 MulticoreWare, Inc\r\n;*\r\n;* Authors: Min Chen <chenm003@163.com>\r\n;*          Nabajit Deka <nabajit@multicorewareinc.com>\r\n;*          Praveen Kumar Tiwari <praveen@multicorewareinc.com>\r\n;*\r\n;* This program is free software; you can redistribute it and/or modify\r\n;* it under the terms of the GNU General Public License as published by\r\n;* the Free Software Foundation; either version 2 of the License, or\r\n;* (at your option) any later version.\r\n;*\r\n;* This program is distributed in the hope that it will be useful,\r\n;* but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n;* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n;* GNU General Public License for more details.\r\n;*\r\n;* You should have received a copy of the GNU General Public License\r\n;* along with this program; if not, write to the Free Software\r\n;* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n;*\r\n;* This program is also available under a commercial proprietary license.\r\n;* For more information, contact us at license @ x265.com.\r\n;*****************************************************************************/\r\n\r\n%include \"x86inc.asm\"\r\n%include \"x86util.asm\"\r\n\r\nSECTION_RODATA 32\r\nconst tab_Tm,    db 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6\r\n                 db 4, 5, 6, 7, 5, 6, 7, 8, 6, 7, 8, 9, 7, 8, 9, 10\r\n                 db 8, 9,10,11, 9,10,11,12,10,11,12,13,11,12,13, 14\r\n\r\nconst interp4_vpp_shuf, times 2 db 0, 4, 1, 5, 2, 6, 3, 7, 8, 12, 9, 13, 10, 14, 11, 15\r\n\r\nconst interp_vert_shuf, times 2 db 0, 2, 1, 3, 2, 4, 3, 5, 4, 6, 5, 7, 6, 8, 7, 9\r\n                        times 2 db 4, 6, 5, 7, 6, 8, 7, 9, 8, 10, 9, 11, 10, 12, 11, 13\r\n\r\nconst interp4_vpp_shuf1, dd 0, 1, 1, 2, 2, 3, 3, 4\r\n                         dd 2, 3, 3, 4, 4, 5, 5, 6\r\n\r\nconst pb_8tap_hps_0, times 2 db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8\r\n                     times 2 db 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9,10\r\n                     times 2 db 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9,10,10,11,11,12\r\n                     times 2 db 6, 7, 7, 8, 8, 9, 9,10,10,11,11,12,12,13,13,14\r\n\r\nconst tab_Lm,    db 0, 1, 2, 3, 4,  5,  6,  7,  1, 2, 3, 4,  5,  6,  7,  8\r\n                 db 2, 3, 4, 5, 6,  7,  8,  9,  3, 4, 5, 6,  7,  8,  9,  10\r\n                 db 4, 5, 6, 7, 8,  9,  10, 11, 5, 6, 7, 8,  9,  10, 11, 12\r\n                 db 6, 7, 8, 9, 10, 11, 12, 13, 7, 8, 9, 10, 11, 12, 13, 14\r\n\r\nconst tab_Vm,    db 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1\r\n                 db 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3\r\n\r\nconst tab_Cm,    db 0, 2, 1, 3, 0, 2, 1, 3, 0, 2, 1, 3, 0, 2, 1, 3\r\n\r\nconst pd_526336, times 8 dd 8192*64+2048\r\n\r\nconst tab_ChromaCoeff, db  0, 64,  0,  0\r\n                       db -2, 58, 10, -2\r\n                       db -4, 54, 16, -2\r\n                       db -6, 46, 28, -4\r\n                       db -4, 36, 36, -4\r\n                       db -4, 28, 46, -6\r\n                       db -2, 16, 54, -4\r\n                       db -2, 10, 58, -2\r\n\r\nconst tabw_ChromaCoeff, dw  0, 64,  0,  0\r\n                        dw -2, 58, 10, -2\r\n                        dw -4, 54, 16, -2\r\n                        dw -6, 46, 28, -4\r\n                        dw -4, 36, 36, -4\r\n                        dw -4, 28, 46, -6\r\n                        dw -2, 16, 54, -4\r\n                        dw -2, 10, 58, -2\r\n\r\nconst tab_ChromaCoeff_V, times 8 db 0, 64\r\n                         times 8 db 0,  0\r\n\r\n                         times 8 db -2, 58\r\n                         times 8 db 10, -2\r\n\r\n                         times 8 db -4, 54\r\n                         times 8 db 16, -2\r\n\r\n                         times 8 db -6, 46\r\n                         times 8 db 28, -4\r\n\r\n                         times 8 db -4, 36\r\n                         times 8 db 36, -4\r\n\r\n                         times 8 db -4, 28\r\n                         times 8 db 46, -6\r\n\r\n                         times 8 db -2, 16\r\n                         times 8 db 54, -4\r\n\r\n                         times 8 db -2, 10\r\n                         times 8 db 58, -2\r\n\r\nconst tab_ChromaCoeffV, times 4 dw 0, 64\r\n                        times 4 dw 0, 0\r\n\r\n                        times 4 dw -2, 58\r\n                        times 4 dw 10, -2\r\n\r\n                        times 4 dw -4, 54\r\n                        times 4 dw 16, -2\r\n\r\n                        times 4 dw -6, 46\r\n                        times 4 dw 28, -4\r\n\r\n                        times 4 dw -4, 36\r\n                        times 4 dw 36, -4\r\n\r\n                        times 4 dw -4, 28\r\n                        times 4 dw 46, -6\r\n\r\n                        times 4 dw -2, 16\r\n                        times 4 dw 54, -4\r\n\r\n                        times 4 dw -2, 10\r\n                        times 4 dw 58, -2\r\n\r\nconst pw_ChromaCoeffV,  times 8 dw 0, 64\r\n                        times 8 dw 0, 0\r\n\r\n                        times 8 dw -2, 58\r\n                        times 8 dw 10, -2\r\n\r\n                        times 8 dw -4, 54\r\n                        times 8 dw 16, -2\r\n\r\n                        times 8 dw -6, 46\r\n                        times 8 dw 28, -4\r\n\r\n                        times 8 dw -4, 36\r\n                        times 8 dw 36, -4\r\n\r\n                        times 8 dw -4, 28\r\n                        times 8 dw 46, -6\r\n\r\n                        times 8 dw -2, 16\r\n                        times 8 dw 54, -4\r\n\r\n                        times 8 dw -2, 10\r\n                        times 8 dw 58, -2\r\n\r\nconst tab_LumaCoeff,   db   0, 0,  0,  64,  0,   0,  0,  0\r\n                       db  -1, 4, -10, 58,  17, -5,  1,  0\r\n                       db  -1, 4, -11, 40,  40, -11, 4, -1\r\n                       db   0, 1, -5,  17,  58, -10, 4, -1\r\n\r\nconst tabw_LumaCoeff,  dw   0, 0,  0,  64,  0,   0,  0,  0\r\n                       dw  -1, 4, -10, 58,  17, -5,  1,  0\r\n                       dw  -1, 4, -11, 40,  40, -11, 4, -1\r\n                       dw   0, 1, -5,  17,  58, -10, 4, -1\r\n\r\nconst tab_LumaCoeffV,   times 4 dw 0, 0\r\n                        times 4 dw 0, 64\r\n                        times 4 dw 0, 0\r\n                        times 4 dw 0, 0\r\n\r\n                        times 4 dw -1, 4\r\n                        times 4 dw -10, 58\r\n                        times 4 dw 17, -5\r\n                        times 4 dw 1, 0\r\n\r\n                        times 4 dw -1, 4\r\n                        times 4 dw -11, 40\r\n                        times 4 dw 40, -11\r\n                        times 4 dw 4, -1\r\n\r\n                        times 4 dw 0, 1\r\n                        times 4 dw -5, 17\r\n                        times 4 dw 58, -10\r\n                        times 4 dw 4, -1\r\n\r\nconst pw_LumaCoeffVer,  times 8 dw 0, 0\r\n                        times 8 dw 0, 64\r\n                        times 8 dw 0, 0\r\n                        times 8 dw 0, 0\r\n\r\n                        times 8 dw -1, 4\r\n                        times 8 dw -10, 58\r\n                        times 8 dw 17, -5\r\n                        times 8 dw 1, 0\r\n\r\n                        times 8 dw -1, 4\r\n                        times 8 dw -11, 40\r\n                        times 8 dw 40, -11\r\n                        times 8 dw 4, -1\r\n\r\n                        times 8 dw 0, 1\r\n                        times 8 dw -5, 17\r\n                        times 8 dw 58, -10\r\n                        times 8 dw 4, -1\r\n\r\nconst pb_LumaCoeffVer,  times 16 db 0, 0\r\n                        times 16 db 0, 64\r\n                        times 16 db 0, 0\r\n                        times 16 db 0, 0\r\n\r\n                        times 16 db -1, 4\r\n                        times 16 db -10, 58\r\n                        times 16 db 17, -5\r\n                        times 16 db 1, 0\r\n\r\n                        times 16 db -1, 4\r\n                        times 16 db -11, 40\r\n                        times 16 db 40, -11\r\n                        times 16 db 4, -1\r\n\r\n                        times 16 db 0, 1\r\n                        times 16 db -5, 17\r\n                        times 16 db 58, -10\r\n                        times 16 db 4, -1\r\n\r\nconst tab_LumaCoeffVer, times 8 db 0, 0\r\n                        times 8 db 0, 64\r\n                        times 8 db 0, 0\r\n                        times 8 db 0, 0\r\n\r\n                        times 8 db -1, 4\r\n                        times 8 db -10, 58\r\n                        times 8 db 17, -5\r\n                        times 8 db 1, 0\r\n\r\n                        times 8 db -1, 4\r\n                        times 8 db -11, 40\r\n                        times 8 db 40, -11\r\n                        times 8 db 4, -1\r\n\r\n                        times 8 db 0, 1\r\n                        times 8 db -5, 17\r\n                        times 8 db 58, -10\r\n                        times 8 db 4, -1\r\n\r\nconst tab_LumaCoeffVer_32,  times 16 db 0, 0\r\n                            times 16 db 0, 64\r\n                            times 16 db 0, 0\r\n                            times 16 db 0, 0\r\n\r\n                            times 16 db -1, 4\r\n                            times 16 db -10, 58\r\n                            times 16 db 17, -5\r\n                            times 16 db 1, 0\r\n\r\n                            times 16 db -1, 4\r\n                            times 16 db -11, 40\r\n                            times 16 db 40, -11\r\n                            times 16 db 4, -1\r\n\r\n                            times 16 db 0, 1\r\n                            times 16 db -5, 17\r\n                            times 16 db 58, -10\r\n                            times 16 db 4, -1\r\n\r\nconst tab_ChromaCoeffVer_32,    times 16 db 0, 64\r\n                                times 16 db 0, 0\r\n\r\n                                times 16 db -2, 58\r\n                                times 16 db 10, -2\r\n\r\n                                times 16 db -4, 54\r\n                                times 16 db 16, -2\r\n\r\n                                times 16 db -6, 46\r\n                                times 16 db 28, -4\r\n\r\n                                times 16 db -4, 36\r\n                                times 16 db 36, -4\r\n\r\n                                times 16 db -4, 28\r\n                                times 16 db 46, -6\r\n\r\n                                times 16 db -2, 16\r\n                                times 16 db 54, -4\r\n\r\n                                times 16 db -2, 10\r\n                                times 16 db 58, -2\r\n\r\nconst tab_c_64_n64, times 8 db 64, -64\r\n\r\nconst interp4_shuf, times 2 db 0, 1, 8, 9, 4, 5, 12, 13, 2, 3, 10, 11, 6, 7, 14, 15\r\n\r\nconst interp4_horiz_shuf1,  db 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6\r\n                            db 8, 9, 10, 11, 9, 10, 11, 12, 10, 11, 12, 13, 11, 12, 13, 14\r\n\r\nconst interp4_hpp_shuf,     times 2 db 0, 1, 2, 3, 1, 2, 3, 4, 8, 9, 10, 11, 9, 10, 11, 12\r\n\r\nconst interp8_hps_shuf,     dd 0, 4, 1, 5, 2, 6, 3, 7\r\n\r\nALIGN 32\r\ninterp4_hps_shuf: times 2 db 0, 1, 2, 3, 1, 2, 3, 4, 8, 9, 10, 11, 9, 10, 11, 12\r\n\r\nSECTION .text\r\n\r\ncextern pb_128\r\ncextern pw_1\r\ncextern pw_32\r\ncextern pw_512\r\ncextern pw_2000\r\ncextern pw_8192\r\n\r\n%macro FILTER_H4_w2_2_sse2 0\r\n    pxor        m3, m3\r\n    movd        m0, [srcq - 1]\r\n    movd        m2, [srcq]\r\n    punpckldq   m0, m2\r\n    punpcklbw   m0, m3\r\n    movd        m1, [srcq + srcstrideq - 1]\r\n    movd        m2, [srcq + srcstrideq]\r\n    punpckldq   m1, m2\r\n    punpcklbw   m1, m3\r\n    pmaddwd     m0, m4\r\n    pmaddwd     m1, m4\r\n    packssdw    m0, m1\r\n    pshuflw     m1, m0, q2301\r\n    pshufhw     m1, m1, q2301\r\n    paddw       m0, m1\r\n    psrld       m0, 16\r\n    packssdw    m0, m0\r\n    paddw       m0, m5\r\n    psraw       m0, 6\r\n    packuswb    m0, m0\r\n    movd        r4, m0\r\n    mov         [dstq], r4w\r\n    shr         r4, 16\r\n    mov         [dstq + dststrideq], r4w\r\n%endmacro\r\n\r\n;-----------------------------------------------------------------------------\r\n; void interp_4tap_horiz_pp_2xN(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-----------------------------------------------------------------------------\r\n%macro FILTER_H4_W2xN_sse3 1\r\nINIT_XMM sse3\r\ncglobal interp_4tap_horiz_pp_2x%1, 4, 6, 6, src, srcstride, dst, dststride\r\n    mov         r4d,    r4m\r\n    mova        m5,     [pw_32]\r\n\r\n%ifdef PIC\r\n    lea         r5,     [tabw_ChromaCoeff]\r\n    movddup     m4,     [r5 + r4 * 8]\r\n%else\r\n    movddup     m4,     [tabw_ChromaCoeff + r4 * 8]\r\n%endif\r\n\r\n%assign x 1\r\n%rep %1/2\r\n    FILTER_H4_w2_2_sse2\r\n%if x < %1/2\r\n    lea         srcq,   [srcq + srcstrideq * 2]\r\n    lea         dstq,   [dstq + dststrideq * 2]\r\n%endif\r\n%assign x x+1\r\n%endrep\r\n\r\n    RET\r\n\r\n%endmacro\r\n\r\n    FILTER_H4_W2xN_sse3 4\r\n    FILTER_H4_W2xN_sse3 8\r\n    FILTER_H4_W2xN_sse3 16\r\n\r\n%macro FILTER_H4_w4_2_sse2 0\r\n    pxor        m5, m5\r\n    movd        m0, [srcq - 1]\r\n    movd        m6, [srcq]\r\n    punpckldq   m0, m6\r\n    punpcklbw   m0, m5\r\n    movd        m1, [srcq + 1]\r\n    movd        m6, [srcq + 2]\r\n    punpckldq   m1, m6\r\n    punpcklbw   m1, m5\r\n    movd        m2, [srcq + srcstrideq - 1]\r\n    movd        m6, [srcq + srcstrideq]\r\n    punpckldq   m2, m6\r\n    punpcklbw   m2, m5\r\n    movd        m3, [srcq + srcstrideq + 1]\r\n    movd        m6, [srcq + srcstrideq + 2]\r\n    punpckldq   m3, m6\r\n    punpcklbw   m3, m5\r\n    pmaddwd     m0, m4\r\n    pmaddwd     m1, m4\r\n    pmaddwd     m2, m4\r\n    pmaddwd     m3, m4\r\n    packssdw    m0, m1\r\n    packssdw    m2, m3\r\n    pshuflw     m1, m0, q2301\r\n    pshufhw     m1, m1, q2301\r\n    pshuflw     m3, m2, q2301\r\n    pshufhw     m3, m3, q2301\r\n    paddw       m0, m1\r\n    paddw       m2, m3\r\n    psrld       m0, 16\r\n    psrld       m2, 16\r\n    packssdw    m0, m2\r\n    paddw       m0, m7\r\n    psraw       m0, 6\r\n    packuswb    m0, m2\r\n    movd        [dstq], m0\r\n    psrldq      m0, 4\r\n    movd        [dstq + dststrideq], m0\r\n%endmacro\r\n\r\n;-----------------------------------------------------------------------------\r\n; void interp_4tap_horiz_pp_4x32(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-----------------------------------------------------------------------------\r\n%macro FILTER_H4_W4xN_sse3 1\r\nINIT_XMM sse3\r\ncglobal interp_4tap_horiz_pp_4x%1, 4, 6, 8, src, srcstride, dst, dststride\r\n    mov         r4d,    r4m\r\n    mova        m7,     [pw_32]\r\n\r\n%ifdef PIC\r\n    lea         r5,     [tabw_ChromaCoeff]\r\n    movddup     m4,     [r5 + r4 * 8]\r\n%else\r\n    movddup     m4,     [tabw_ChromaCoeff + r4 * 8]\r\n%endif\r\n\r\n%assign x 1\r\n%rep %1/2\r\n    FILTER_H4_w4_2_sse2\r\n%if x < %1/2\r\n    lea         srcq,   [srcq + srcstrideq * 2]\r\n    lea         dstq,   [dstq + dststrideq * 2]\r\n%endif\r\n%assign x x+1\r\n%endrep\r\n\r\n    RET\r\n\r\n%endmacro\r\n\r\n    FILTER_H4_W4xN_sse3 2\r\n    FILTER_H4_W4xN_sse3 4\r\n    FILTER_H4_W4xN_sse3 8\r\n    FILTER_H4_W4xN_sse3 16\r\n    FILTER_H4_W4xN_sse3 32\r\n\r\n%macro FILTER_H4_w6_sse2 0\r\n    pxor        m4, m4\r\n    movh        m0, [srcq - 1]\r\n    movh        m5, [srcq]\r\n    punpckldq   m0, m5\r\n    movhlps     m2, m0\r\n    punpcklbw   m0, m4\r\n    punpcklbw   m2, m4\r\n    movd        m1, [srcq + 1]\r\n    movd        m5, [srcq + 2]\r\n    punpckldq   m1, m5\r\n    punpcklbw   m1, m4\r\n    pmaddwd     m0, m6\r\n    pmaddwd     m1, m6\r\n    pmaddwd     m2, m6\r\n    packssdw    m0, m1\r\n    packssdw    m2, m2\r\n    pshuflw     m1, m0, q2301\r\n    pshufhw     m1, m1, q2301\r\n    pshuflw     m3, m2, q2301\r\n    paddw       m0, m1\r\n    paddw       m2, m3\r\n    psrld       m0, 16\r\n    psrld       m2, 16\r\n    packssdw    m0, m2\r\n    paddw       m0, m7\r\n    psraw       m0, 6\r\n    packuswb    m0, m0\r\n    movd        [dstq], m0\r\n    pextrw      r4d, m0, 2\r\n    mov         [dstq + 4], r4w\r\n%endmacro\r\n\r\n%macro FILH4W8_sse2 1\r\n    movh        m0, [srcq - 1 + %1]\r\n    movh        m5, [srcq + %1]\r\n    punpckldq   m0, m5\r\n    movhlps     m2, m0\r\n    punpcklbw   m0, m4\r\n    punpcklbw   m2, m4\r\n    movh        m1, [srcq + 1 + %1]\r\n    movh        m5, [srcq + 2 + %1]\r\n    punpckldq   m1, m5\r\n    movhlps     m3, m1\r\n    punpcklbw   m1, m4\r\n    punpcklbw   m3, m4\r\n    pmaddwd     m0, m6\r\n    pmaddwd     m1, m6\r\n    pmaddwd     m2, m6\r\n    pmaddwd     m3, m6\r\n    packssdw    m0, m1\r\n    packssdw    m2, m3\r\n    pshuflw     m1, m0, q2301\r\n    pshufhw     m1, m1, q2301\r\n    pshuflw     m3, m2, q2301\r\n    pshufhw     m3, m3, q2301\r\n    paddw       m0, m1\r\n    paddw       m2, m3\r\n    psrld       m0, 16\r\n    psrld       m2, 16\r\n    packssdw    m0, m2\r\n    paddw       m0, m7\r\n    psraw       m0, 6\r\n    packuswb    m0, m0\r\n    movh        [dstq + %1], m0\r\n%endmacro\r\n\r\n%macro FILTER_H4_w8_sse2 0\r\n    FILH4W8_sse2 0\r\n%endmacro\r\n\r\n%macro FILTER_H4_w12_sse2 0\r\n    FILH4W8_sse2 0\r\n    movd        m1, [srcq - 1 + 8]\r\n    movd        m3, [srcq + 8]\r\n    punpckldq   m1, m3\r\n    punpcklbw   m1, m4\r\n    movd        m2, [srcq + 1 + 8]\r\n    movd        m3, [srcq + 2 + 8]\r\n    punpckldq   m2, m3\r\n    punpcklbw   m2, m4\r\n    pmaddwd     m1, m6\r\n    pmaddwd     m2, m6\r\n    packssdw    m1, m2\r\n    pshuflw     m2, m1, q2301\r\n    pshufhw     m2, m2, q2301\r\n    paddw       m1, m2\r\n    psrld       m1, 16\r\n    packssdw    m1, m1\r\n    paddw       m1, m7\r\n    psraw       m1, 6\r\n    packuswb    m1, m1\r\n    movd        [dstq + 8], m1\r\n%endmacro\r\n\r\n%macro FILTER_H4_w16_sse2 0\r\n    FILH4W8_sse2 0\r\n    FILH4W8_sse2 8\r\n%endmacro\r\n\r\n%macro FILTER_H4_w24_sse2 0\r\n    FILH4W8_sse2 0\r\n    FILH4W8_sse2 8\r\n    FILH4W8_sse2 16\r\n%endmacro\r\n\r\n%macro FILTER_H4_w32_sse2 0\r\n    FILH4W8_sse2 0\r\n    FILH4W8_sse2 8\r\n    FILH4W8_sse2 16\r\n    FILH4W8_sse2 24\r\n%endmacro\r\n\r\n%macro FILTER_H4_w48_sse2 0\r\n    FILH4W8_sse2 0\r\n    FILH4W8_sse2 8\r\n    FILH4W8_sse2 16\r\n    FILH4W8_sse2 24\r\n    FILH4W8_sse2 32\r\n    FILH4W8_sse2 40\r\n%endmacro\r\n\r\n%macro FILTER_H4_w64_sse2 0\r\n    FILH4W8_sse2 0\r\n    FILH4W8_sse2 8\r\n    FILH4W8_sse2 16\r\n    FILH4W8_sse2 24\r\n    FILH4W8_sse2 32\r\n    FILH4W8_sse2 40\r\n    FILH4W8_sse2 48\r\n    FILH4W8_sse2 56\r\n%endmacro\r\n\r\n;-----------------------------------------------------------------------------\r\n; void interp_4tap_horiz_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-----------------------------------------------------------------------------\r\n%macro IPFILTER_CHROMA_sse3 2\r\nINIT_XMM sse3\r\ncglobal interp_4tap_horiz_pp_%1x%2, 4, 6, 8, src, srcstride, dst, dststride\r\n    mov         r4d,        r4m\r\n    mova        m7,         [pw_32]\r\n    pxor        m4,         m4\r\n\r\n%ifdef PIC\r\n    lea         r5,          [tabw_ChromaCoeff]\r\n    movddup     m6,       [r5 + r4 * 8]\r\n%else\r\n    movddup     m6,       [tabw_ChromaCoeff + r4 * 8]\r\n%endif\r\n\r\n%assign x 1\r\n%rep %2\r\n    FILTER_H4_w%1_sse2\r\n%if x < %2\r\n    add         srcq,        srcstrideq\r\n    add         dstq,        dststrideq\r\n%endif\r\n%assign x x+1\r\n%endrep\r\n\r\n    RET\r\n\r\n%endmacro\r\n\r\n    IPFILTER_CHROMA_sse3 6,   8\r\n    IPFILTER_CHROMA_sse3 8,   2\r\n    IPFILTER_CHROMA_sse3 8,   4\r\n    IPFILTER_CHROMA_sse3 8,   6\r\n    IPFILTER_CHROMA_sse3 8,   8\r\n    IPFILTER_CHROMA_sse3 8,  16\r\n    IPFILTER_CHROMA_sse3 8,  32\r\n    IPFILTER_CHROMA_sse3 12, 16\r\n\r\n    IPFILTER_CHROMA_sse3 6,  16\r\n    IPFILTER_CHROMA_sse3 8,  12\r\n    IPFILTER_CHROMA_sse3 8,  64\r\n    IPFILTER_CHROMA_sse3 12, 32\r\n\r\n    IPFILTER_CHROMA_sse3 16,  4\r\n    IPFILTER_CHROMA_sse3 16,  8\r\n    IPFILTER_CHROMA_sse3 16, 12\r\n    IPFILTER_CHROMA_sse3 16, 16\r\n    IPFILTER_CHROMA_sse3 16, 32\r\n    IPFILTER_CHROMA_sse3 32,  8\r\n    IPFILTER_CHROMA_sse3 32, 16\r\n    IPFILTER_CHROMA_sse3 32, 24\r\n    IPFILTER_CHROMA_sse3 24, 32\r\n    IPFILTER_CHROMA_sse3 32, 32\r\n\r\n    IPFILTER_CHROMA_sse3 16, 24\r\n    IPFILTER_CHROMA_sse3 16, 64\r\n    IPFILTER_CHROMA_sse3 32, 48\r\n    IPFILTER_CHROMA_sse3 24, 64\r\n    IPFILTER_CHROMA_sse3 32, 64\r\n\r\n    IPFILTER_CHROMA_sse3 64, 64\r\n    IPFILTER_CHROMA_sse3 64, 32\r\n    IPFILTER_CHROMA_sse3 64, 48\r\n    IPFILTER_CHROMA_sse3 48, 64\r\n    IPFILTER_CHROMA_sse3 64, 16\r\n\r\n%macro FILTER_2 2\r\n    movd        m3,     [srcq + %1]\r\n    movd        m4,     [srcq + 1 + %1]\r\n    punpckldq   m3,     m4\r\n    punpcklbw   m3,     m0\r\n    pmaddwd     m3,     m1\r\n    packssdw    m3,     m3\r\n    pshuflw     m4,     m3, q2301\r\n    paddw       m3,     m4\r\n    psrldq      m3,     2\r\n    psubw       m3,     m2\r\n    movd        [dstq + %2], m3\r\n%endmacro\r\n\r\n%macro FILTER_4 2\r\n    movd        m3,     [srcq + %1]\r\n    movd        m4,     [srcq + 1 + %1]\r\n    punpckldq   m3,     m4\r\n    punpcklbw   m3,     m0\r\n    pmaddwd     m3,     m1\r\n    movd        m4,     [srcq + 2 + %1]\r\n    movd        m5,     [srcq + 3 + %1]\r\n    punpckldq   m4,     m5\r\n    punpcklbw   m4,     m0\r\n    pmaddwd     m4,     m1\r\n    packssdw    m3,     m4\r\n    pshuflw     m4,     m3, q2301\r\n    pshufhw     m4,     m4, q2301\r\n    paddw       m3,     m4\r\n    psrldq      m3,     2\r\n    pshufd      m3,     m3,     q3120\r\n    psubw       m3,     m2\r\n    movh        [dstq + %2], m3\r\n%endmacro\r\n\r\n%macro FILTER_4TAP_HPS_sse3 2\r\nINIT_XMM sse3\r\ncglobal interp_4tap_horiz_ps_%1x%2, 4, 7, 6, src, srcstride, dst, dststride\r\n    mov         r4d,    r4m\r\n    add         dststrided, dststrided\r\n    mova        m2,     [pw_2000]\r\n    pxor        m0,     m0\r\n\r\n%ifdef PIC\r\n    lea         r6,     [tabw_ChromaCoeff]\r\n    movddup     m1,     [r6 + r4 * 8]\r\n%else\r\n    movddup     m1,     [tabw_ChromaCoeff + r4 * 8]\r\n%endif\r\n\r\n    mov        r4d,     %2\r\n    cmp        r5m,     byte 0\r\n    je         .loopH\r\n    sub        srcq,    srcstrideq\r\n    add        r4d,     3\r\n\r\n.loopH:\r\n%assign x -1\r\n%assign y 0\r\n%rep %1/4\r\n    FILTER_4 x,y\r\n%assign x x+4\r\n%assign y y+8\r\n%endrep\r\n%rep (%1 % 4)/2\r\n    FILTER_2 x,y\r\n%endrep\r\n    add         srcq,   srcstrideq\r\n    add         dstq,   dststrideq\r\n\r\n    dec         r4d\r\n    jnz         .loopH\r\n    RET\r\n\r\n%endmacro\r\n\r\n    FILTER_4TAP_HPS_sse3 2, 4\r\n    FILTER_4TAP_HPS_sse3 2, 8\r\n    FILTER_4TAP_HPS_sse3 2, 16\r\n    FILTER_4TAP_HPS_sse3 4, 2\r\n    FILTER_4TAP_HPS_sse3 4, 4\r\n    FILTER_4TAP_HPS_sse3 4, 8\r\n    FILTER_4TAP_HPS_sse3 4, 16\r\n    FILTER_4TAP_HPS_sse3 4, 32\r\n    FILTER_4TAP_HPS_sse3 6, 8\r\n    FILTER_4TAP_HPS_sse3 6, 16\r\n    FILTER_4TAP_HPS_sse3 8, 2\r\n    FILTER_4TAP_HPS_sse3 8, 4\r\n    FILTER_4TAP_HPS_sse3 8, 6\r\n    FILTER_4TAP_HPS_sse3 8, 8\r\n    FILTER_4TAP_HPS_sse3 8, 12\r\n    FILTER_4TAP_HPS_sse3 8, 16\r\n    FILTER_4TAP_HPS_sse3 8, 32\r\n    FILTER_4TAP_HPS_sse3 8, 64\r\n    FILTER_4TAP_HPS_sse3 12, 16\r\n    FILTER_4TAP_HPS_sse3 12, 32\r\n    FILTER_4TAP_HPS_sse3 16, 4\r\n    FILTER_4TAP_HPS_sse3 16, 8\r\n    FILTER_4TAP_HPS_sse3 16, 12\r\n    FILTER_4TAP_HPS_sse3 16, 16\r\n    FILTER_4TAP_HPS_sse3 16, 24\r\n    FILTER_4TAP_HPS_sse3 16, 32\r\n    FILTER_4TAP_HPS_sse3 16, 64\r\n    FILTER_4TAP_HPS_sse3 24, 32\r\n    FILTER_4TAP_HPS_sse3 24, 64\r\n    FILTER_4TAP_HPS_sse3 32,  8\r\n    FILTER_4TAP_HPS_sse3 32, 16\r\n    FILTER_4TAP_HPS_sse3 32, 24\r\n    FILTER_4TAP_HPS_sse3 32, 32\r\n    FILTER_4TAP_HPS_sse3 32, 48\r\n    FILTER_4TAP_HPS_sse3 32, 64\r\n    FILTER_4TAP_HPS_sse3 48, 64\r\n    FILTER_4TAP_HPS_sse3 64, 16\r\n    FILTER_4TAP_HPS_sse3 64, 32\r\n    FILTER_4TAP_HPS_sse3 64, 48\r\n    FILTER_4TAP_HPS_sse3 64, 64\r\n\r\n%macro FILTER_H8_W8_sse2 0\r\n    movh        m1, [r0 + x - 3]\r\n    movh        m4, [r0 + x - 2]\r\n    punpcklbw   m1, m6\r\n    punpcklbw   m4, m6\r\n    movh        m5, [r0 + x - 1]\r\n    movh        m0, [r0 + x]\r\n    punpcklbw   m5, m6\r\n    punpcklbw   m0, m6\r\n    pmaddwd     m1, m3\r\n    pmaddwd     m4, m3\r\n    pmaddwd     m5, m3\r\n    pmaddwd     m0, m3\r\n    packssdw    m1, m4\r\n    packssdw    m5, m0\r\n    pshuflw     m4, m1, q2301\r\n    pshufhw     m4, m4, q2301\r\n    pshuflw     m0, m5, q2301\r\n    pshufhw     m0, m0, q2301\r\n    paddw       m1, m4\r\n    paddw       m5, m0\r\n    psrldq      m1, 2\r\n    psrldq      m5, 2\r\n    pshufd      m1, m1, q3120\r\n    pshufd      m5, m5, q3120\r\n    punpcklqdq  m1, m5\r\n    movh        m7, [r0 + x + 1]\r\n    movh        m4, [r0 + x + 2]\r\n    punpcklbw   m7, m6\r\n    punpcklbw   m4, m6\r\n    movh        m5, [r0 + x + 3]\r\n    movh        m0, [r0 + x + 4]\r\n    punpcklbw   m5, m6\r\n    punpcklbw   m0, m6\r\n    pmaddwd     m7, m3\r\n    pmaddwd     m4, m3\r\n    pmaddwd     m5, m3\r\n    pmaddwd     m0, m3\r\n    packssdw    m7, m4\r\n    packssdw    m5, m0\r\n    pshuflw     m4, m7, q2301\r\n    pshufhw     m4, m4, q2301\r\n    pshuflw     m0, m5, q2301\r\n    pshufhw     m0, m0, q2301\r\n    paddw       m7, m4\r\n    paddw       m5, m0\r\n    psrldq      m7, 2\r\n    psrldq      m5, 2\r\n    pshufd      m7, m7, q3120\r\n    pshufd      m5, m5, q3120\r\n    punpcklqdq  m7, m5\r\n    pshuflw     m4, m1, q2301\r\n    pshufhw     m4, m4, q2301\r\n    pshuflw     m0, m7, q2301\r\n    pshufhw     m0, m0, q2301\r\n    paddw       m1, m4\r\n    paddw       m7, m0\r\n    psrldq      m1, 2\r\n    psrldq      m7, 2\r\n    pshufd      m1, m1, q3120\r\n    pshufd      m7, m7, q3120\r\n    punpcklqdq  m1, m7\r\n%endmacro\r\n\r\n%macro FILTER_H8_W4_sse2 0\r\n    movh        m1, [r0 + x - 3]\r\n    movh        m0, [r0 + x - 2]\r\n    punpcklbw   m1, m6\r\n    punpcklbw   m0, m6\r\n    movh        m4, [r0 + x - 1]\r\n    movh        m5, [r0 + x]\r\n    punpcklbw   m4, m6\r\n    punpcklbw   m5, m6\r\n    pmaddwd     m1, m3\r\n    pmaddwd     m0, m3\r\n    pmaddwd     m4, m3\r\n    pmaddwd     m5, m3\r\n    packssdw    m1, m0\r\n    packssdw    m4, m5\r\n    pshuflw     m0, m1, q2301\r\n    pshufhw     m0, m0, q2301\r\n    pshuflw     m5, m4, q2301\r\n    pshufhw     m5, m5, q2301\r\n    paddw       m1, m0\r\n    paddw       m4, m5\r\n    psrldq      m1, 2\r\n    psrldq      m4, 2\r\n    pshufd      m1, m1, q3120\r\n    pshufd      m4, m4, q3120\r\n    punpcklqdq  m1, m4\r\n    pshuflw     m0, m1, q2301\r\n    pshufhw     m0, m0, q2301\r\n    paddw       m1, m0\r\n    psrldq      m1, 2\r\n    pshufd      m1, m1, q3120\r\n%endmacro\r\n\r\n;----------------------------------------------------------------------------------------------------------------------------\r\n; void interp_8tap_horiz_%3_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx, int isRowExt)\r\n;----------------------------------------------------------------------------------------------------------------------------\r\n%macro IPFILTER_LUMA_sse2 3\r\nINIT_XMM sse2\r\ncglobal interp_8tap_horiz_%3_%1x%2, 4,6,8\r\n    mov       r4d, r4m\r\n    add       r4d, r4d\r\n    pxor      m6, m6\r\n\r\n%ifidn %3, ps\r\n    add       r3d, r3d\r\n    cmp       r5m, byte 0\r\n%endif\r\n\r\n%ifdef PIC\r\n    lea       r5, [tabw_LumaCoeff]\r\n    movu      m3, [r5 + r4 * 8]\r\n%else\r\n    movu      m3, [tabw_LumaCoeff + r4 * 8]\r\n%endif\r\n\r\n    mov       r4d, %2\r\n\r\n%ifidn %3, pp\r\n    mova      m2, [pw_32]\r\n%else\r\n    mova      m2, [pw_2000]\r\n    je        .loopH\r\n    lea       r5, [r1 + 2 * r1]\r\n    sub       r0, r5\r\n    add       r4d, 7\r\n%endif\r\n\r\n.loopH:\r\n%assign x 0\r\n%rep %1 / 8\r\n    FILTER_H8_W8_sse2\r\n  %ifidn %3, pp\r\n    paddw     m1, m2\r\n    psraw     m1, 6\r\n    packuswb  m1, m1\r\n    movh      [r2 + x], m1\r\n  %else\r\n    psubw     m1, m2\r\n    movu      [r2 + 2 * x], m1\r\n  %endif\r\n%assign x x+8\r\n%endrep\r\n\r\n%rep (%1 % 8) / 4\r\n    FILTER_H8_W4_sse2\r\n  %ifidn %3, pp\r\n    paddw     m1, m2\r\n    psraw     m1, 6\r\n    packuswb  m1, m1\r\n    movd      [r2 + x], m1\r\n  %else\r\n    psubw     m1, m2\r\n    movh      [r2 + 2 * x], m1\r\n  %endif\r\n%endrep\r\n\r\n    add       r0, r1\r\n    add       r2, r3\r\n\r\n    dec       r4d\r\n    jnz       .loopH\r\n    RET\r\n\r\n%endmacro\r\n\r\n;--------------------------------------------------------------------------------------------------------------\r\n; void interp_8tap_horiz_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;--------------------------------------------------------------------------------------------------------------\r\n    IPFILTER_LUMA_sse2 4, 4, pp\r\n    IPFILTER_LUMA_sse2 4, 8, pp\r\n    IPFILTER_LUMA_sse2 8, 4, pp\r\n    IPFILTER_LUMA_sse2 8, 8, pp\r\n    IPFILTER_LUMA_sse2 16, 16, pp\r\n    IPFILTER_LUMA_sse2 16, 8, pp\r\n    IPFILTER_LUMA_sse2 8, 16, pp\r\n    IPFILTER_LUMA_sse2 16, 12, pp\r\n    IPFILTER_LUMA_sse2 12, 16, pp\r\n    IPFILTER_LUMA_sse2 16, 4, pp\r\n    IPFILTER_LUMA_sse2 4, 16, pp\r\n    IPFILTER_LUMA_sse2 32, 32, pp\r\n    IPFILTER_LUMA_sse2 32, 16, pp\r\n    IPFILTER_LUMA_sse2 16, 32, pp\r\n    IPFILTER_LUMA_sse2 32, 24, pp\r\n    IPFILTER_LUMA_sse2 24, 32, pp\r\n    IPFILTER_LUMA_sse2 32, 8, pp\r\n    IPFILTER_LUMA_sse2 8, 32, pp\r\n    IPFILTER_LUMA_sse2 64, 64, pp\r\n    IPFILTER_LUMA_sse2 64, 32, pp\r\n    IPFILTER_LUMA_sse2 32, 64, pp\r\n    IPFILTER_LUMA_sse2 64, 48, pp\r\n    IPFILTER_LUMA_sse2 48, 64, pp\r\n    IPFILTER_LUMA_sse2 64, 16, pp\r\n    IPFILTER_LUMA_sse2 16, 64, pp\r\n\r\n;----------------------------------------------------------------------------------------------------------------------------\r\n; void interp_8tap_horiz_ps_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx, int isRowExt)\r\n;----------------------------------------------------------------------------------------------------------------------------\r\n    IPFILTER_LUMA_sse2 4, 4, ps\r\n    IPFILTER_LUMA_sse2 8, 8, ps\r\n    IPFILTER_LUMA_sse2 8, 4, ps\r\n    IPFILTER_LUMA_sse2 4, 8, ps\r\n    IPFILTER_LUMA_sse2 16, 16, ps\r\n    IPFILTER_LUMA_sse2 16, 8, ps\r\n    IPFILTER_LUMA_sse2 8, 16, ps\r\n    IPFILTER_LUMA_sse2 16, 12, ps\r\n    IPFILTER_LUMA_sse2 12, 16, ps\r\n    IPFILTER_LUMA_sse2 16, 4, ps\r\n    IPFILTER_LUMA_sse2 4, 16, ps\r\n    IPFILTER_LUMA_sse2 32, 32, ps\r\n    IPFILTER_LUMA_sse2 32, 16, ps\r\n    IPFILTER_LUMA_sse2 16, 32, ps\r\n    IPFILTER_LUMA_sse2 32, 24, ps\r\n    IPFILTER_LUMA_sse2 24, 32, ps\r\n    IPFILTER_LUMA_sse2 32, 8, ps\r\n    IPFILTER_LUMA_sse2 8, 32, ps\r\n    IPFILTER_LUMA_sse2 64, 64, ps\r\n    IPFILTER_LUMA_sse2 64, 32, ps\r\n    IPFILTER_LUMA_sse2 32, 64, ps\r\n    IPFILTER_LUMA_sse2 64, 48, ps\r\n    IPFILTER_LUMA_sse2 48, 64, ps\r\n    IPFILTER_LUMA_sse2 64, 16, ps\r\n    IPFILTER_LUMA_sse2 16, 64, ps\r\n\r\n%macro PROCESS_LUMA_W4_4R_sse2 0\r\n    movd        m2,     [r0]\r\n    movd        m7,     [r0 + r1]\r\n    punpcklbw   m2,     m7                      ; m2=[0 1]\r\n\r\n    lea         r0,     [r0 + 2 * r1]\r\n    movd        m3,     [r0]\r\n    punpcklbw   m7,     m3                      ; m7=[1 2]\r\n    punpcklbw   m2,     m0\r\n    punpcklbw   m7,     m0\r\n    pmaddwd     m2,     [r6 + 0 * 32]\r\n    pmaddwd     m7,     [r6 + 0 * 32]\r\n    packssdw    m2,     m7                      ; m2=[0+1 1+2]\r\n\r\n    movd        m7,     [r0 + r1]\r\n    punpcklbw   m3,     m7                      ; m3=[2 3]\r\n    lea         r0,     [r0 + 2 * r1]\r\n    movd        m5,     [r0]\r\n    punpcklbw   m7,     m5                      ; m7=[3 4]\r\n    punpcklbw   m3,     m0\r\n    punpcklbw   m7,     m0\r\n    pmaddwd     m4,     m3,     [r6 + 1 * 32]\r\n    pmaddwd     m6,     m7,     [r6 + 1 * 32]\r\n    packssdw    m4,     m6                      ; m4=[2+3 3+4]\r\n    paddw       m2,     m4                      ; m2=[0+1+2+3 1+2+3+4]                   Row1-2\r\n    pmaddwd     m3,     [r6 + 0 * 32]\r\n    pmaddwd     m7,     [r6 + 0 * 32]\r\n    packssdw    m3,     m7                      ; m3=[2+3 3+4]                           Row3-4\r\n\r\n    movd        m7,     [r0 + r1]\r\n    punpcklbw   m5,     m7                      ; m5=[4 5]\r\n    lea         r0,     [r0 + 2 * r1]\r\n    movd        m4,     [r0]\r\n    punpcklbw   m7,     m4                      ; m7=[5 6]\r\n    punpcklbw   m5,     m0\r\n    punpcklbw   m7,     m0\r\n    pmaddwd     m6,     m5,     [r6 + 2 * 32]\r\n    pmaddwd     m8,     m7,     [r6 + 2 * 32]\r\n    packssdw    m6,     m8                      ; m6=[4+5 5+6]\r\n    paddw       m2,     m6                      ; m2=[0+1+2+3+4+5 1+2+3+4+5+6]           Row1-2\r\n    pmaddwd     m5,     [r6 + 1 * 32]\r\n    pmaddwd     m7,     [r6 + 1 * 32]\r\n    packssdw    m5,     m7                      ; m5=[4+5 5+6]\r\n    paddw       m3,     m5                      ; m3=[2+3+4+5 3+4+5+6]                   Row3-4\r\n\r\n    movd        m7,     [r0 + r1]\r\n    punpcklbw   m4,     m7                      ; m4=[6 7]\r\n    lea         r0,     [r0 + 2 * r1]\r\n    movd        m5,     [r0]\r\n    punpcklbw   m7,     m5                      ; m7=[7 8]\r\n    punpcklbw   m4,     m0\r\n    punpcklbw   m7,     m0\r\n    pmaddwd     m6,     m4,     [r6 + 3 * 32]\r\n    pmaddwd     m8,     m7,     [r6 + 3 * 32]\r\n    packssdw    m6,     m8                      ; m7=[6+7 7+8]\r\n    paddw       m2,     m6                      ; m2=[0+1+2+3+4+5+6+7 1+2+3+4+5+6+7+8]   Row1-2 end\r\n    pmaddwd     m4,     [r6 + 2 * 32]\r\n    pmaddwd     m7,     [r6 + 2 * 32]\r\n    packssdw    m4,     m7                      ; m4=[6+7 7+8]\r\n    paddw       m3,     m4                      ; m3=[2+3+4+5+6+7 3+4+5+6+7+8]           Row3-4\r\n\r\n    movd        m7,     [r0 + r1]\r\n    punpcklbw   m5,     m7                      ; m5=[8 9]\r\n    movd        m4,     [r0 + 2 * r1]\r\n    punpcklbw   m7,     m4                      ; m7=[9 10]\r\n    punpcklbw   m5,     m0\r\n    punpcklbw   m7,     m0\r\n    pmaddwd     m5,     [r6 + 3 * 32]\r\n    pmaddwd     m7,     [r6 + 3 * 32]\r\n    packssdw    m5,     m7                      ; m5=[8+9 9+10]\r\n    paddw       m3,     m5                      ; m3=[2+3+4+5+6+7+8+9 3+4+5+6+7+8+9+10]  Row3-4 end\r\n%endmacro\r\n\r\n%macro PROCESS_LUMA_W8_4R_sse2 0\r\n    movq        m7,     [r0]\r\n    movq        m6,     [r0 + r1]\r\n    punpcklbw   m7,     m6\r\n    punpcklbw   m2,     m7,     m0\r\n    punpckhbw   m7,     m0\r\n    pmaddwd     m2,     [r6 + 0 * 32]\r\n    pmaddwd     m7,     [r6 + 0 * 32]\r\n    packssdw    m2,     m7                      ; m2=[0+1]               Row1\r\n\r\n    lea         r0,     [r0 + 2 * r1]\r\n    movq        m7,     [r0]\r\n    punpcklbw   m6,     m7\r\n    punpcklbw   m3,     m6,     m0\r\n    punpckhbw   m6,     m0\r\n    pmaddwd     m3,     [r6 + 0 * 32]\r\n    pmaddwd     m6,     [r6 + 0 * 32]\r\n    packssdw    m3,     m6                      ; m3=[1+2]               Row2\r\n\r\n    movq        m6,     [r0 + r1]\r\n    punpcklbw   m7,     m6\r\n    punpckhbw   m8,     m7,     m0\r\n    punpcklbw   m7,     m0\r\n    pmaddwd     m4,     m7,     [r6 + 0 * 32]\r\n    pmaddwd     m9,     m8,     [r6 + 0 * 32]\r\n    packssdw    m4,     m9                      ; m4=[2+3]               Row3\r\n    pmaddwd     m7,     [r6 + 1 * 32]\r\n    pmaddwd     m8,     [r6 + 1 * 32]\r\n    packssdw    m7,     m8\r\n    paddw       m2,     m7                      ; m2=[0+1+2+3]           Row1\r\n\r\n    lea         r0,     [r0 + 2 * r1]\r\n    movq        m10,    [r0]\r\n    punpcklbw   m6,     m10\r\n    punpckhbw   m8,     m6,     m0\r\n    punpcklbw   m6,     m0\r\n    pmaddwd     m5,     m6,     [r6 + 0 * 32]\r\n    pmaddwd     m9,     m8,     [r6 + 0 * 32]\r\n    packssdw    m5,     m9                      ; m5=[3+4]               Row4\r\n    pmaddwd     m6,     [r6 + 1 * 32]\r\n    pmaddwd     m8,     [r6 + 1 * 32]\r\n    packssdw    m6,     m8\r\n    paddw       m3,     m6                      ; m3 = [1+2+3+4]         Row2\r\n\r\n    movq        m6,     [r0 + r1]\r\n    punpcklbw   m10,    m6\r\n    punpckhbw   m8,     m10,    m0\r\n    punpcklbw   m10,    m0\r\n    pmaddwd     m7,     m10,    [r6 + 1 * 32]\r\n    pmaddwd     m9,     m8,     [r6 + 1 * 32]\r\n    packssdw    m7,     m9\r\n    pmaddwd     m10,    [r6 + 2 * 32]\r\n    pmaddwd     m8,     [r6 + 2 * 32]\r\n    packssdw    m10,    m8\r\n    paddw       m2,     m10                     ; m2=[0+1+2+3+4+5]       Row1\r\n    paddw       m4,     m7                      ; m4=[2+3+4+5]           Row3\r\n\r\n    lea         r0,     [r0 + 2 * r1]\r\n    movq        m10,    [r0]\r\n    punpcklbw   m6,     m10\r\n    punpckhbw   m8,     m6,     m0\r\n    punpcklbw   m6,     m0\r\n    pmaddwd     m7,     m6,     [r6 + 1 * 32]\r\n    pmaddwd     m9,     m8,     [r6 + 1 * 32]\r\n    packssdw    m7,     m9\r\n    pmaddwd     m6,     [r6 + 2 * 32]\r\n    pmaddwd     m8,     [r6 + 2 * 32]\r\n    packssdw    m6,     m8\r\n    paddw       m3,     m6                      ; m3=[1+2+3+4+5+6]       Row2\r\n    paddw       m5,     m7                      ; m5=[3+4+5+6]           Row4\r\n\r\n    movq        m6,     [r0 + r1]\r\n    punpcklbw   m10,    m6\r\n    punpckhbw   m8,     m10,    m0\r\n    punpcklbw   m10,    m0\r\n    pmaddwd     m7,     m10,    [r6 + 2 * 32]\r\n    pmaddwd     m9,     m8,     [r6 + 2 * 32]\r\n    packssdw    m7,     m9\r\n    pmaddwd     m10,    [r6 + 3 * 32]\r\n    pmaddwd     m8,     [r6 + 3 * 32]\r\n    packssdw    m10,    m8\r\n    paddw       m2,     m10                     ; m2=[0+1+2+3+4+5+6+7]   Row1 end\r\n    paddw       m4,     m7                      ; m4=[2+3+4+5+6+7]       Row3\r\n\r\n    lea         r0,     [r0 + 2 * r1]\r\n    movq        m10,    [r0]\r\n    punpcklbw   m6,     m10\r\n    punpckhbw   m8,     m6,     m0\r\n    punpcklbw   m6,     m0\r\n    pmaddwd     m7,     m6,     [r6 + 2 * 32]\r\n    pmaddwd     m9,     m8,     [r6 + 2 * 32]\r\n    packssdw    m7,     m9\r\n    pmaddwd     m6,     [r6 + 3 * 32]\r\n    pmaddwd     m8,     [r6 + 3 * 32]\r\n    packssdw    m6,     m8\r\n    paddw       m3,     m6                      ; m3=[1+2+3+4+5+6+7+8]   Row2 end\r\n    paddw       m5,     m7                      ; m5=[3+4+5+6+7+8]       Row4\r\n\r\n    movq        m6,     [r0 + r1]\r\n    punpcklbw   m10,    m6\r\n    punpckhbw   m8,     m10,     m0\r\n    punpcklbw   m10,    m0\r\n    pmaddwd     m8,     [r6 + 3 * 32]\r\n    pmaddwd     m10,    [r6 + 3 * 32]\r\n    packssdw    m10,    m8\r\n    paddw       m4,     m10                     ; m4=[2+3+4+5+6+7+8+9]   Row3 end\r\n\r\n    movq        m10,    [r0 + 2 * r1]\r\n    punpcklbw   m6,     m10\r\n    punpckhbw   m8,     m6,     m0\r\n    punpcklbw   m6,     m0\r\n    pmaddwd     m8,     [r6 + 3 * 32]\r\n    pmaddwd     m6,     [r6 + 3 * 32]\r\n    packssdw    m6,     m8\r\n    paddw       m5,     m6                      ; m5=[3+4+5+6+7+8+9+10]  Row4 end\r\n%endmacro\r\n\r\n;-------------------------------------------------------------------------------------------------------------\r\n; void interp_8tap_vert_%3_4x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-------------------------------------------------------------------------------------------------------------\r\n%macro FILTER_VER_LUMA_sse2 3\r\nINIT_XMM sse2\r\ncglobal interp_8tap_vert_%3_%1x%2, 5, 8, 11\r\n    lea         r5,     [3 * r1]\r\n    sub         r0,     r5\r\n    shl         r4d,    7\r\n\r\n%ifdef PIC\r\n    lea         r6,     [pw_LumaCoeffVer]\r\n    add         r6,     r4\r\n%else\r\n    lea         r6,     [pw_LumaCoeffVer + r4]\r\n%endif\r\n\r\n%ifidn %3,pp\r\n    mova        m1,     [pw_32]\r\n%else\r\n    mova        m1,     [pw_2000]\r\n    add         r3d,    r3d\r\n%endif\r\n\r\n    mov         r4d,    %2/4\r\n    lea         r5,     [3 * r3]\r\n    pxor        m0,     m0\r\n\r\n.loopH:\r\n%assign x 0\r\n%rep (%1 / 8)\r\n    PROCESS_LUMA_W8_4R_sse2\r\n\r\n%ifidn %3,pp\r\n    paddw       m2,     m1\r\n    paddw       m3,     m1\r\n    paddw       m4,     m1\r\n    paddw       m5,     m1\r\n    psraw       m2,     6\r\n    psraw       m3,     6\r\n    psraw       m4,     6\r\n    psraw       m5,     6\r\n\r\n    packuswb    m2,     m3\r\n    packuswb    m4,     m5\r\n\r\n    movh        [r2 + x], m2\r\n    movhps      [r2 + r3 + x], m2\r\n    movh        [r2 + 2 * r3 + x], m4\r\n    movhps      [r2 + r5 + x], m4\r\n%else\r\n    psubw       m2,     m1\r\n    psubw       m3,     m1\r\n    psubw       m4,     m1\r\n    psubw       m5,     m1\r\n\r\n    movu        [r2 + (2*x)], m2\r\n    movu        [r2 + r3 + (2*x)], m3\r\n    movu        [r2 + 2 * r3 + (2*x)], m4\r\n    movu        [r2 + r5 + (2*x)], m5\r\n%endif\r\n%assign x x+8\r\n%if %1 > 8\r\n    lea         r7,     [8 * r1 - 8]\r\n    sub         r0,     r7\r\n%endif\r\n%endrep\r\n\r\n%rep (%1 % 8)/4\r\n    PROCESS_LUMA_W4_4R_sse2\r\n\r\n%ifidn %3,pp\r\n    paddw       m2,     m1\r\n    psraw       m2,     6\r\n    paddw       m3,     m1\r\n    psraw       m3,     6\r\n\r\n    packuswb    m2,     m3\r\n\r\n    movd        [r2 + x], m2\r\n    psrldq      m2,     4\r\n    movd        [r2 + r3 + x], m2\r\n    psrldq      m2,     4\r\n    movd        [r2 + 2 * r3 + x], m2\r\n    psrldq      m2,     4\r\n    movd        [r2 + r5 + x], m2\r\n%else\r\n    psubw       m2,     m1\r\n    psubw       m3,     m1\r\n\r\n    movh        [r2 + (2*x)], m2\r\n    movhps      [r2 + r3 + (2*x)], m2\r\n    movh        [r2 + 2 * r3 + (2*x)], m3\r\n    movhps      [r2 + r5 + (2*x)], m3\r\n%endif\r\n%endrep\r\n\r\n    lea         r2,     [r2 + 4 * r3]\r\n%if %1 <= 8\r\n    lea         r7,     [4 * r1]\r\n    sub         r0,     r7\r\n%elif %1 == 12\r\n    lea         r7,     [4 * r1 + 8]\r\n    sub         r0,     r7\r\n%else\r\n    lea         r0,     [r0 + 4 * r1 - %1]\r\n%endif\r\n\r\n    dec         r4d\r\n    jnz         .loopH\r\n\r\n    RET\r\n\r\n%endmacro\r\n\r\n%if ARCH_X86_64\r\n    FILTER_VER_LUMA_sse2 4, 4, pp\r\n    FILTER_VER_LUMA_sse2 4, 8, pp\r\n    FILTER_VER_LUMA_sse2 4, 16, pp\r\n    FILTER_VER_LUMA_sse2 8, 4, pp\r\n    FILTER_VER_LUMA_sse2 8, 8, pp\r\n    FILTER_VER_LUMA_sse2 8, 16, pp\r\n    FILTER_VER_LUMA_sse2 8, 32, pp\r\n    FILTER_VER_LUMA_sse2 12, 16, pp\r\n    FILTER_VER_LUMA_sse2 16, 4, pp\r\n    FILTER_VER_LUMA_sse2 16, 8, pp\r\n    FILTER_VER_LUMA_sse2 16, 12, pp\r\n    FILTER_VER_LUMA_sse2 16, 16, pp\r\n    FILTER_VER_LUMA_sse2 16, 32, pp\r\n    FILTER_VER_LUMA_sse2 16, 64, pp\r\n    FILTER_VER_LUMA_sse2 24, 32, pp\r\n    FILTER_VER_LUMA_sse2 32, 8, pp\r\n    FILTER_VER_LUMA_sse2 32, 16, pp\r\n    FILTER_VER_LUMA_sse2 32, 24, pp\r\n    FILTER_VER_LUMA_sse2 32, 32, pp\r\n    FILTER_VER_LUMA_sse2 32, 64, pp\r\n    FILTER_VER_LUMA_sse2 48, 64, pp\r\n    FILTER_VER_LUMA_sse2 64, 16, pp\r\n    FILTER_VER_LUMA_sse2 64, 32, pp\r\n    FILTER_VER_LUMA_sse2 64, 48, pp\r\n    FILTER_VER_LUMA_sse2 64, 64, pp\r\n\r\n    FILTER_VER_LUMA_sse2 4, 4, ps\r\n    FILTER_VER_LUMA_sse2 4, 8, ps\r\n    FILTER_VER_LUMA_sse2 4, 16, ps\r\n    FILTER_VER_LUMA_sse2 8, 4, ps\r\n    FILTER_VER_LUMA_sse2 8, 8, ps\r\n    FILTER_VER_LUMA_sse2 8, 16, ps\r\n    FILTER_VER_LUMA_sse2 8, 32, ps\r\n    FILTER_VER_LUMA_sse2 12, 16, ps\r\n    FILTER_VER_LUMA_sse2 16, 4, ps\r\n    FILTER_VER_LUMA_sse2 16, 8, ps\r\n    FILTER_VER_LUMA_sse2 16, 12, ps\r\n    FILTER_VER_LUMA_sse2 16, 16, ps\r\n    FILTER_VER_LUMA_sse2 16, 32, ps\r\n    FILTER_VER_LUMA_sse2 16, 64, ps\r\n    FILTER_VER_LUMA_sse2 24, 32, ps\r\n    FILTER_VER_LUMA_sse2 32, 8, ps\r\n    FILTER_VER_LUMA_sse2 32, 16, ps\r\n    FILTER_VER_LUMA_sse2 32, 24, ps\r\n    FILTER_VER_LUMA_sse2 32, 32, ps\r\n    FILTER_VER_LUMA_sse2 32, 64, ps\r\n    FILTER_VER_LUMA_sse2 48, 64, ps\r\n    FILTER_VER_LUMA_sse2 64, 16, ps\r\n    FILTER_VER_LUMA_sse2 64, 32, ps\r\n    FILTER_VER_LUMA_sse2 64, 48, ps\r\n    FILTER_VER_LUMA_sse2 64, 64, ps\r\n%endif\r\n\r\n%macro  WORD_TO_DOUBLE 1\r\n%if ARCH_X86_64\r\n    punpcklbw   %1,     m8\r\n%else\r\n    punpcklbw   %1,     %1\r\n    psrlw       %1,     8\r\n%endif\r\n%endmacro\r\n\r\n;-----------------------------------------------------------------------------\r\n; void interp_4tap_vert_%1_2x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-----------------------------------------------------------------------------\r\n%macro FILTER_V4_W2_H4_sse2 2\r\nINIT_XMM sse2\r\n%if ARCH_X86_64\r\ncglobal interp_4tap_vert_%1_2x%2, 4, 6, 9\r\n    pxor        m8,        m8\r\n%else\r\ncglobal interp_4tap_vert_%1_2x%2, 4, 6, 8\r\n%endif\r\n    mov         r4d,       r4m\r\n    sub         r0,        r1\r\n\r\n%ifidn %1,pp\r\n    mova        m1,        [pw_32]\r\n%elifidn %1,ps\r\n    mova        m1,        [pw_2000]\r\n    add         r3d,       r3d\r\n%endif\r\n\r\n%ifdef PIC\r\n    lea         r5,        [tabw_ChromaCoeff]\r\n    movh        m0,        [r5 + r4 * 8]\r\n%else\r\n    movh        m0,        [tabw_ChromaCoeff + r4 * 8]\r\n%endif\r\n\r\n    punpcklqdq  m0,        m0\r\n    lea         r5,        [3 * r1]\r\n\r\n%assign x 1\r\n%rep %2/4\r\n    movd        m2,        [r0]\r\n    movd        m3,        [r0 + r1]\r\n    movd        m4,        [r0 + 2 * r1]\r\n    movd        m5,        [r0 + r5]\r\n\r\n    punpcklbw   m2,        m3\r\n    punpcklbw   m6,        m4,        m5\r\n    punpcklwd   m2,        m6\r\n\r\n    WORD_TO_DOUBLE         m2\r\n    pmaddwd     m2,        m0\r\n\r\n    lea         r0,        [r0 + 4 * r1]\r\n    movd        m6,        [r0]\r\n\r\n    punpcklbw   m3,        m4\r\n    punpcklbw   m7,        m5,        m6\r\n    punpcklwd   m3,        m7\r\n\r\n    WORD_TO_DOUBLE         m3\r\n    pmaddwd     m3,        m0\r\n\r\n    packssdw    m2,        m3\r\n    pshuflw     m3,        m2,          q2301\r\n    pshufhw     m3,        m3,          q2301\r\n    paddw       m2,        m3\r\n\r\n    movd        m7,        [r0 + r1]\r\n\r\n    punpcklbw   m4,        m5\r\n    punpcklbw   m3,        m6,        m7\r\n    punpcklwd   m4,        m3\r\n\r\n    WORD_TO_DOUBLE         m4\r\n    pmaddwd     m4,        m0\r\n\r\n    movd        m3,        [r0 + 2 * r1]\r\n\r\n    punpcklbw   m5,        m6\r\n    punpcklbw   m7,        m3\r\n    punpcklwd   m5,        m7\r\n\r\n    WORD_TO_DOUBLE         m5\r\n    pmaddwd     m5,        m0\r\n\r\n    packssdw    m4,        m5\r\n    pshuflw     m5,        m4,          q2301\r\n    pshufhw     m5,        m5,          q2301\r\n    paddw       m4,        m5\r\n\r\n%ifidn %1,pp\r\n    psrld       m2,        16\r\n    psrld       m4,        16\r\n    packssdw    m2,        m4\r\n    paddw       m2,        m1\r\n    psraw       m2,        6\r\n    packuswb    m2,        m2\r\n\r\n%if ARCH_X86_64\r\n    movq        r4,        m2\r\n    mov         [r2],      r4w\r\n    shr         r4,        16\r\n    mov         [r2 + r3], r4w\r\n    lea         r2,        [r2 + 2 * r3]\r\n    shr         r4,        16\r\n    mov         [r2],      r4w\r\n    shr         r4,        16\r\n    mov         [r2 + r3], r4w\r\n%else\r\n    movd        r4,        m2\r\n    mov         [r2],      r4w\r\n    shr         r4,        16\r\n    mov         [r2 + r3], r4w\r\n    lea         r2,        [r2 + 2 * r3]\r\n    psrldq      m2,        4\r\n    movd        r4,        m2\r\n    mov         [r2],      r4w\r\n    shr         r4,        16\r\n    mov         [r2 + r3], r4w\r\n%endif\r\n%elifidn %1,ps\r\n    psrldq      m2,        2\r\n    psrldq      m4,        2\r\n    pshufd      m2,        m2, q3120\r\n    pshufd      m4,        m4, q3120\r\n    psubw       m4,        m1\r\n    psubw       m2,        m1\r\n\r\n    movd        [r2],      m2\r\n    psrldq      m2,        4\r\n    movd        [r2 + r3], m2\r\n    lea         r2,        [r2 + 2 * r3]\r\n    movd        [r2],      m4\r\n    psrldq      m4,        4\r\n    movd        [r2 + r3], m4\r\n%endif\r\n\r\n%if x < %2/4\r\n    lea         r2,        [r2 + 2 * r3]\r\n%endif\r\n%assign x x+1\r\n%endrep\r\n    RET\r\n\r\n%endmacro\r\n\r\n    FILTER_V4_W2_H4_sse2 pp, 4\r\n    FILTER_V4_W2_H4_sse2 pp, 8\r\n    FILTER_V4_W2_H4_sse2 pp, 16\r\n\r\n    FILTER_V4_W2_H4_sse2 ps, 4\r\n    FILTER_V4_W2_H4_sse2 ps, 8\r\n    FILTER_V4_W2_H4_sse2 ps, 16\r\n\r\n;-----------------------------------------------------------------------------\r\n; void interp_4tap_vert_%1_4x2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-----------------------------------------------------------------------------\r\n%macro  FILTER_V2_W4_H4_sse2 1\r\nINIT_XMM sse2\r\ncglobal interp_4tap_vert_%1_4x2, 4, 6, 8\r\n    mov         r4d,       r4m\r\n    sub         r0,        r1\r\n    pxor        m7,        m7\r\n\r\n%ifdef PIC\r\n    lea         r5,        [tabw_ChromaCoeff]\r\n    movh        m0,        [r5 + r4 * 8]\r\n%else\r\n    movh        m0,        [tabw_ChromaCoeff + r4 * 8]\r\n%endif\r\n\r\n    lea         r5,        [r0 + 2 * r1]\r\n    punpcklqdq  m0,        m0\r\n    movd        m2,        [r0]\r\n    movd        m3,        [r0 + r1]\r\n    movd        m4,        [r5]\r\n    movd        m5,        [r5 + r1]\r\n\r\n    punpcklbw   m2,        m3\r\n    punpcklbw   m1,        m4,        m5\r\n    punpcklwd   m2,        m1\r\n\r\n    movhlps     m6,        m2\r\n    punpcklbw   m2,        m7\r\n    punpcklbw   m6,        m7\r\n    pmaddwd     m2,        m0\r\n    pmaddwd     m6,        m0\r\n    packssdw    m2,        m6\r\n\r\n    movd        m1,        [r0 + 4 * r1]\r\n\r\n    punpcklbw   m3,        m4\r\n    punpcklbw   m5,        m1\r\n    punpcklwd   m3,        m5\r\n\r\n    movhlps     m6,        m3\r\n    punpcklbw   m3,        m7\r\n    punpcklbw   m6,        m7\r\n    pmaddwd     m3,        m0\r\n    pmaddwd     m6,        m0\r\n    packssdw    m3,        m6\r\n\r\n    pshuflw     m4,        m2,        q2301\r\n    pshufhw     m4,        m4,        q2301\r\n    paddw       m2,        m4\r\n    pshuflw     m5,        m3,        q2301\r\n    pshufhw     m5,        m5,        q2301\r\n    paddw       m3,        m5\r\n\r\n%ifidn %1, pp\r\n    psrld       m2,        16\r\n    psrld       m3,        16\r\n    packssdw    m2,        m3\r\n\r\n    paddw       m2,        [pw_32]\r\n    psraw       m2,        6\r\n    packuswb    m2,        m2\r\n\r\n    movd        [r2],      m2\r\n    psrldq      m2,        4\r\n    movd        [r2 + r3], m2\r\n%elifidn %1, ps\r\n    psrldq      m2,        2\r\n    psrldq      m3,        2\r\n    pshufd      m2,        m2, q3120\r\n    pshufd      m3,        m3, q3120\r\n    punpcklqdq  m2, m3\r\n\r\n    add         r3d,       r3d\r\n    psubw       m2,        [pw_2000]\r\n    movh        [r2],      m2\r\n    movhps      [r2 + r3], m2\r\n%endif\r\n    RET\r\n\r\n%endmacro\r\n\r\n    FILTER_V2_W4_H4_sse2 pp\r\n    FILTER_V2_W4_H4_sse2 ps\r\n\r\n;-----------------------------------------------------------------------------\r\n; void interp_4tap_vert_%1_4x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-----------------------------------------------------------------------------\r\n%macro FILTER_V4_W4_H4_sse2 2\r\nINIT_XMM sse2\r\n%if ARCH_X86_64\r\ncglobal interp_4tap_vert_%1_4x%2, 4, 6, 9\r\n    pxor        m8,        m8\r\n%else\r\ncglobal interp_4tap_vert_%1_4x%2, 4, 6, 8\r\n%endif\r\n\r\n    mov         r4d,       r4m\r\n    sub         r0,        r1\r\n\r\n%ifdef PIC\r\n    lea         r5,        [tabw_ChromaCoeff]\r\n    movh        m0,        [r5 + r4 * 8]\r\n%else\r\n    movh        m0,        [tabw_ChromaCoeff + r4 * 8]\r\n%endif\r\n\r\n%ifidn %1,pp\r\n    mova        m1,        [pw_32]\r\n%elifidn %1,ps\r\n    add         r3d,       r3d\r\n    mova        m1,        [pw_2000]\r\n%endif\r\n\r\n    lea         r5,        [3 * r1]\r\n    lea         r4,        [3 * r3]\r\n    punpcklqdq  m0,        m0\r\n\r\n%assign x 1\r\n%rep %2/4\r\n    movd        m2,        [r0]\r\n    movd        m3,        [r0 + r1]\r\n    movd        m4,        [r0 + 2 * r1]\r\n    movd        m5,        [r0 + r5]\r\n\r\n    punpcklbw   m2,        m3\r\n    punpcklbw   m6,        m4,        m5\r\n    punpcklwd   m2,        m6\r\n\r\n    movhlps     m6,        m2\r\n    WORD_TO_DOUBLE         m2\r\n    WORD_TO_DOUBLE         m6\r\n    pmaddwd     m2,        m0\r\n    pmaddwd     m6,        m0\r\n    packssdw    m2,        m6\r\n\r\n    lea         r0,        [r0 + 4 * r1]\r\n    movd        m6,        [r0]\r\n\r\n    punpcklbw   m3,        m4\r\n    punpcklbw   m7,        m5,        m6\r\n    punpcklwd   m3,        m7\r\n\r\n    movhlps     m7,        m3\r\n    WORD_TO_DOUBLE         m3\r\n    WORD_TO_DOUBLE         m7\r\n    pmaddwd     m3,        m0\r\n    pmaddwd     m7,        m0\r\n    packssdw    m3,        m7\r\n\r\n    pshuflw     m7,        m2,        q2301\r\n    pshufhw     m7,        m7,        q2301\r\n    paddw       m2,        m7\r\n    pshuflw     m7,        m3,        q2301\r\n    pshufhw     m7,        m7,        q2301\r\n    paddw       m3,        m7\r\n\r\n%ifidn %1,pp\r\n    psrld       m2,        16\r\n    psrld       m3,        16\r\n    packssdw    m2,        m3\r\n    paddw       m2,        m1\r\n    psraw       m2,        6\r\n%elifidn %1,ps\r\n    psrldq      m2,        2\r\n    psrldq      m3,        2\r\n    pshufd      m2,        m2, q3120\r\n    pshufd      m3,        m3, q3120\r\n    punpcklqdq  m2,        m3\r\n\r\n    psubw       m2,        m1\r\n    movh        [r2],      m2\r\n    movhps      [r2 + r3], m2\r\n%endif\r\n\r\n    movd        m7,        [r0 + r1]\r\n\r\n    punpcklbw   m4,        m5\r\n    punpcklbw   m3,        m6,        m7\r\n    punpcklwd   m4,        m3\r\n\r\n    movhlps     m3,        m4\r\n    WORD_TO_DOUBLE         m4\r\n    WORD_TO_DOUBLE         m3\r\n    pmaddwd     m4,        m0\r\n    pmaddwd     m3,        m0\r\n    packssdw    m4,        m3\r\n\r\n    movd        m3,        [r0 + 2 * r1]\r\n\r\n    punpcklbw   m5,        m6\r\n    punpcklbw   m7,        m3\r\n    punpcklwd   m5,        m7\r\n\r\n    movhlps     m3,        m5\r\n    WORD_TO_DOUBLE         m5\r\n    WORD_TO_DOUBLE         m3\r\n    pmaddwd     m5,        m0\r\n    pmaddwd     m3,        m0\r\n    packssdw    m5,        m3\r\n\r\n    pshuflw     m7,        m4,        q2301\r\n    pshufhw     m7,        m7,        q2301\r\n    paddw       m4,        m7\r\n    pshuflw     m7,        m5,        q2301\r\n    pshufhw     m7,        m7,        q2301\r\n    paddw       m5,        m7\r\n\r\n%ifidn %1,pp\r\n    psrld       m4,        16\r\n    psrld       m5,        16\r\n    packssdw    m4,        m5\r\n\r\n    paddw       m4,        m1\r\n    psraw       m4,        6\r\n    packuswb    m2,        m4\r\n\r\n    movd        [r2],      m2\r\n    psrldq      m2,        4\r\n    movd        [r2 + r3], m2\r\n    psrldq      m2,        4\r\n    movd        [r2 + 2 * r3],      m2\r\n    psrldq      m2,        4\r\n    movd        [r2 + r4], m2\r\n%elifidn %1,ps\r\n    psrldq      m4,        2\r\n    psrldq      m5,        2\r\n    pshufd      m4,        m4, q3120\r\n    pshufd      m5,        m5, q3120\r\n    punpcklqdq  m4,        m5\r\n    psubw       m4,        m1\r\n    movh        [r2 + 2 * r3],      m4\r\n    movhps      [r2 + r4], m4\r\n%endif\r\n\r\n%if x < %2/4\r\n    lea         r2,        [r2 + 4 * r3]\r\n%endif\r\n\r\n%assign x x+1\r\n%endrep\r\n    RET\r\n\r\n%endmacro\r\n\r\n    FILTER_V4_W4_H4_sse2 pp, 4\r\n    FILTER_V4_W4_H4_sse2 pp, 8\r\n    FILTER_V4_W4_H4_sse2 pp, 16\r\n    FILTER_V4_W4_H4_sse2 pp, 32\r\n\r\n    FILTER_V4_W4_H4_sse2 ps, 4\r\n    FILTER_V4_W4_H4_sse2 ps, 8\r\n    FILTER_V4_W4_H4_sse2 ps, 16\r\n    FILTER_V4_W4_H4_sse2 ps, 32\r\n\r\n;-----------------------------------------------------------------------------\r\n;void interp_4tap_vert_%1_6x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-----------------------------------------------------------------------------\r\n%macro FILTER_V4_W6_H4_sse2 2\r\nINIT_XMM sse2\r\ncglobal interp_4tap_vert_%1_6x%2, 4, 7, 10\r\n    mov         r4d,       r4m\r\n    sub         r0,        r1\r\n    shl         r4d,       5\r\n    pxor        m9,        m9\r\n\r\n%ifdef PIC\r\n    lea         r5,        [tab_ChromaCoeffV]\r\n    mova        m6,        [r5 + r4]\r\n    mova        m5,        [r5 + r4 + 16]\r\n%else\r\n    mova        m6,        [tab_ChromaCoeffV + r4]\r\n    mova        m5,        [tab_ChromaCoeffV + r4 + 16]\r\n%endif\r\n\r\n%ifidn %1,pp\r\n    mova        m4,        [pw_32]\r\n%elifidn %1,ps\r\n    mova        m4,        [pw_2000]\r\n    add         r3d,       r3d\r\n%endif\r\n    lea         r5,        [3 * r1]\r\n\r\n%assign x 1\r\n%rep %2/4\r\n    movq        m0,        [r0]\r\n    movq        m1,        [r0 + r1]\r\n    movq        m2,        [r0 + 2 * r1]\r\n    movq        m3,        [r0 + r5]\r\n\r\n    punpcklbw   m0,        m1\r\n    punpcklbw   m1,        m2\r\n    punpcklbw   m2,        m3\r\n\r\n    movhlps     m7,        m0\r\n    punpcklbw   m0,        m9\r\n    punpcklbw   m7,        m9\r\n    pmaddwd     m0,        m6\r\n    pmaddwd     m7,        m6\r\n    packssdw    m0,        m7\r\n\r\n    movhlps     m8,        m2\r\n    movq        m7,        m2\r\n    punpcklbw   m8,        m9\r\n    punpcklbw   m7,        m9\r\n    pmaddwd     m8,        m5\r\n    pmaddwd     m7,        m5\r\n    packssdw    m7,        m8\r\n\r\n    paddw       m0,        m7\r\n\r\n%ifidn %1,pp\r\n    paddw       m0,        m4\r\n    psraw       m0,        6\r\n    packuswb    m0,        m0\r\n\r\n    movd        [r2],      m0\r\n    pextrw      r6d,       m0,        2\r\n    mov         [r2 + 4],  r6w\r\n%elifidn %1,ps\r\n    psubw       m0,        m4\r\n    movh        [r2],      m0\r\n    pshufd      m0,        m0,        2\r\n    movd        [r2 + 8],  m0\r\n%endif\r\n\r\n    lea         r0,        [r0 + 4 * r1]\r\n\r\n    movq        m0,        [r0]\r\n    punpcklbw   m3,        m0\r\n\r\n    movhlps     m8,        m1\r\n    punpcklbw   m1,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m1,        m6\r\n    pmaddwd     m8,        m6\r\n    packssdw    m1,        m8\r\n\r\n    movhlps     m8,        m3\r\n    movq        m7,        m3\r\n    punpcklbw   m8,        m9\r\n    punpcklbw   m7,        m9\r\n    pmaddwd     m8,        m5\r\n    pmaddwd     m7,        m5\r\n    packssdw    m7,        m8\r\n\r\n    paddw       m1,        m7\r\n\r\n%ifidn %1,pp\r\n    paddw       m1,        m4\r\n    psraw       m1,        6\r\n    packuswb    m1,        m1\r\n\r\n    movd        [r2 + r3], m1\r\n    pextrw      r6d,       m1,        2\r\n    mov         [r2 + r3 + 4], r6w\r\n%elifidn %1,ps\r\n    psubw       m1,        m4\r\n    movh        [r2 + r3], m1\r\n    pshufd      m1,        m1,        2\r\n    movd        [r2 + r3 + 8],  m1\r\n%endif\r\n\r\n    movq        m1,        [r0 + r1]\r\n    punpcklbw   m7,        m0,        m1\r\n\r\n    movhlps     m8,        m2\r\n    punpcklbw   m2,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m2,        m6\r\n    pmaddwd     m8,        m6\r\n    packssdw    m2,        m8\r\n\r\n    movhlps     m8,        m7\r\n    punpcklbw   m7,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m7,        m5\r\n    pmaddwd     m8,        m5\r\n    packssdw    m7,        m8\r\n\r\n    paddw       m2,        m7\r\n    lea         r2,        [r2 + 2 * r3]\r\n\r\n%ifidn %1,pp\r\n    paddw       m2,        m4\r\n    psraw       m2,        6\r\n    packuswb    m2,        m2\r\n    movd        [r2],      m2\r\n    pextrw      r6d,       m2,    2\r\n    mov         [r2 + 4],  r6w\r\n%elifidn %1,ps\r\n    psubw       m2,        m4\r\n    movh        [r2],      m2\r\n    pshufd      m2,        m2,        2\r\n    movd        [r2 + 8],  m2\r\n%endif\r\n\r\n    movq        m2,        [r0 + 2 * r1]\r\n    punpcklbw   m1,        m2\r\n\r\n    movhlps     m8,        m3\r\n    punpcklbw   m3,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m3,        m6\r\n    pmaddwd     m8,        m6\r\n    packssdw    m3,        m8\r\n\r\n    movhlps     m8,        m1\r\n    punpcklbw   m1,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m1,        m5\r\n    pmaddwd     m8,        m5\r\n    packssdw    m1,        m8\r\n\r\n    paddw       m3,        m1\r\n\r\n%ifidn %1,pp\r\n    paddw       m3,        m4\r\n    psraw       m3,        6\r\n    packuswb    m3,        m3\r\n\r\n    movd        [r2 + r3], m3\r\n    pextrw      r6d,       m3,    2\r\n    mov         [r2 + r3 + 4], r6w\r\n%elifidn %1,ps\r\n    psubw       m3,        m4\r\n    movh        [r2 + r3], m3\r\n    pshufd      m3,        m3,        2\r\n    movd        [r2 + r3 + 8],  m3\r\n%endif\r\n\r\n%if x < %2/4\r\n    lea         r2,        [r2 + 2 * r3]\r\n%endif\r\n\r\n%assign x x+1\r\n%endrep\r\n    RET\r\n\r\n%endmacro\r\n\r\n%if ARCH_X86_64\r\n    FILTER_V4_W6_H4_sse2 pp, 8\r\n    FILTER_V4_W6_H4_sse2 pp, 16\r\n    FILTER_V4_W6_H4_sse2 ps, 8\r\n    FILTER_V4_W6_H4_sse2 ps, 16\r\n%endif\r\n\r\n;-----------------------------------------------------------------------------\r\n; void interp_4tap_vert_%1_8x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-----------------------------------------------------------------------------\r\n%macro FILTER_V4_W8_sse2 2\r\nINIT_XMM sse2\r\ncglobal interp_4tap_vert_%1_8x%2, 4, 7, 12\r\n    mov         r4d,       r4m\r\n    sub         r0,        r1\r\n    shl         r4d,       5\r\n    pxor        m9,        m9\r\n\r\n%ifidn %1,pp\r\n    mova        m4,        [pw_32]\r\n%elifidn %1,ps\r\n    mova        m4,        [pw_2000]\r\n    add         r3d,       r3d\r\n%endif\r\n\r\n%ifdef PIC\r\n    lea         r6,        [tab_ChromaCoeffV]\r\n    mova        m6,        [r6 + r4]\r\n    mova        m5,        [r6 + r4 + 16]\r\n%else\r\n    mova        m6,        [tab_ChromaCoeffV + r4]\r\n    mova        m5,        [tab_ChromaCoeffV + r4 + 16]\r\n%endif\r\n\r\n    movq        m0,        [r0]\r\n    movq        m1,        [r0 + r1]\r\n    movq        m2,        [r0 + 2 * r1]\r\n    lea         r5,        [r0 + 2 * r1]\r\n    movq        m3,        [r5 + r1]\r\n\r\n    punpcklbw   m0,        m1\r\n    punpcklbw   m7,        m2,          m3\r\n\r\n    movhlps     m8,        m0\r\n    punpcklbw   m0,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m0,        m6\r\n    pmaddwd     m8,        m6\r\n    packssdw    m0,        m8\r\n\r\n    movhlps     m8,        m7\r\n    punpcklbw   m7,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m7,        m5\r\n    pmaddwd     m8,        m5\r\n    packssdw    m7,        m8\r\n\r\n    paddw       m0,        m7\r\n\r\n%ifidn %1,pp\r\n    paddw       m0,        m4\r\n    psraw       m0,        6\r\n%elifidn %1,ps\r\n    psubw       m0,        m4\r\n    movu        [r2],      m0\r\n%endif\r\n\r\n    movq        m11,        [r0 + 4 * r1]\r\n\r\n    punpcklbw   m1,        m2\r\n    punpcklbw   m7,        m3,        m11\r\n\r\n    movhlps     m8,        m1\r\n    punpcklbw   m1,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m1,        m6\r\n    pmaddwd     m8,        m6\r\n    packssdw    m1,        m8\r\n\r\n    movhlps     m8,        m7\r\n    punpcklbw   m7,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m7,        m5\r\n    pmaddwd     m8,        m5\r\n    packssdw    m7,        m8\r\n\r\n    paddw       m1,        m7\r\n\r\n%ifidn %1,pp\r\n    paddw       m1,        m4\r\n    psraw       m1,        6\r\n    packuswb    m1,        m0\r\n\r\n    movhps      [r2],      m1\r\n    movh        [r2 + r3], m1\r\n%elifidn %1,ps\r\n    psubw       m1,        m4\r\n    movu        [r2 + r3], m1\r\n%endif\r\n%if %2 == 2     ;end of 8x2\r\n    RET\r\n\r\n%else\r\n    lea         r6,        [r0 + 4 * r1]\r\n    movq        m1,        [r6 + r1]\r\n\r\n    punpcklbw   m2,        m3\r\n    punpcklbw   m7,        m11,        m1\r\n\r\n    movhlps     m8,        m2\r\n    punpcklbw   m2,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m2,        m6\r\n    pmaddwd     m8,        m6\r\n    packssdw    m2,        m8\r\n\r\n    movhlps     m8,        m7\r\n    punpcklbw   m7,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m7,        m5\r\n    pmaddwd     m8,        m5\r\n    packssdw    m7,        m8\r\n\r\n    paddw       m2,        m7\r\n\r\n%ifidn %1,pp\r\n    paddw       m2,        m4\r\n    psraw       m2,        6\r\n%elifidn %1,ps\r\n    psubw       m2,        m4\r\n    movu        [r2 + 2 * r3], m2\r\n%endif\r\n\r\n    movq        m10,        [r6 + 2 * r1]\r\n\r\n    punpcklbw   m3,        m11\r\n    punpcklbw   m7,        m1,        m10\r\n\r\n    movhlps     m8,        m3\r\n    punpcklbw   m3,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m3,        m6\r\n    pmaddwd     m8,        m6\r\n    packssdw    m3,        m8\r\n\r\n    movhlps     m8,        m7\r\n    punpcklbw   m7,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m7,        m5\r\n    pmaddwd     m8,        m5\r\n    packssdw    m7,        m8\r\n\r\n    paddw       m3,        m7\r\n    lea         r5,        [r2 + 2 * r3]\r\n\r\n%ifidn %1,pp\r\n    paddw       m3,        m4\r\n    psraw       m3,        6\r\n    packuswb    m3,        m2\r\n\r\n    movhps      [r2 + 2 * r3], m3\r\n    movh        [r5 + r3], m3\r\n%elifidn %1,ps\r\n    psubw       m3,        m4\r\n    movu        [r5 + r3], m3\r\n%endif\r\n%if %2 == 4     ;end of 8x4\r\n    RET\r\n\r\n%else\r\n    lea         r6,        [r6 + 2 * r1]\r\n    movq        m3,        [r6 + r1]\r\n\r\n    punpcklbw   m11,        m1\r\n    punpcklbw   m7,        m10,        m3\r\n\r\n    movhlps     m8,        m11\r\n    punpcklbw   m11,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m11,        m6\r\n    pmaddwd     m8,        m6\r\n    packssdw    m11,        m8\r\n\r\n    movhlps     m8,        m7\r\n    punpcklbw   m7,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m7,        m5\r\n    pmaddwd     m8,        m5\r\n    packssdw    m7,        m8\r\n\r\n    paddw       m11,       m7\r\n\r\n%ifidn %1, pp\r\n    paddw       m11,       m4\r\n    psraw       m11,       6\r\n%elifidn %1,ps\r\n    psubw       m11,       m4\r\n    movu        [r2 + 4 * r3], m11\r\n%endif\r\n\r\n    movq        m7,        [r0 + 8 * r1]\r\n\r\n    punpcklbw   m1,        m10\r\n    punpcklbw   m3,        m7\r\n\r\n    movhlps     m8,        m1\r\n    punpcklbw   m1,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m1,        m6\r\n    pmaddwd     m8,        m6\r\n    packssdw    m1,        m8\r\n\r\n    movhlps     m8,        m3\r\n    punpcklbw   m3,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m3,        m5\r\n    pmaddwd     m8,        m5\r\n    packssdw    m3,        m8\r\n\r\n    paddw       m1,        m3\r\n    lea         r5,        [r2 + 4 * r3]\r\n\r\n%ifidn %1,pp\r\n    paddw       m1,        m4\r\n    psraw       m1,        6\r\n    packuswb    m1,        m11\r\n\r\n    movhps      [r2 + 4 * r3], m1\r\n    movh        [r5 + r3], m1\r\n%elifidn %1,ps\r\n    psubw       m1,        m4\r\n    movu        [r5 + r3], m1\r\n%endif\r\n%if %2 == 6\r\n    RET\r\n\r\n%else\r\n  %error INVALID macro argument, only 2, 4 or 6!\r\n%endif\r\n%endif\r\n%endif\r\n%endmacro\r\n\r\n%if ARCH_X86_64\r\n    FILTER_V4_W8_sse2 pp, 2\r\n    FILTER_V4_W8_sse2 pp, 4\r\n    FILTER_V4_W8_sse2 pp, 6\r\n    FILTER_V4_W8_sse2 ps, 2\r\n    FILTER_V4_W8_sse2 ps, 4\r\n    FILTER_V4_W8_sse2 ps, 6\r\n%endif\r\n\r\n;-----------------------------------------------------------------------------\r\n; void interp_4tap_vert_%1_8x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-----------------------------------------------------------------------------\r\n%macro FILTER_V4_W8_H8_H16_H32_sse2 2\r\nINIT_XMM sse2\r\ncglobal interp_4tap_vert_%1_8x%2, 4, 6, 11\r\n    mov         r4d,       r4m\r\n    sub         r0,        r1\r\n    shl         r4d,       5\r\n    pxor        m9,        m9\r\n\r\n%ifdef PIC\r\n    lea         r5,        [tab_ChromaCoeffV]\r\n    mova        m6,        [r5 + r4]\r\n    mova        m5,        [r5 + r4 + 16]\r\n%else\r\n    mova        m6,        [tab_ChromaCoeff + r4]\r\n    mova        m5,        [tab_ChromaCoeff + r4 + 16]\r\n%endif\r\n\r\n%ifidn %1,pp\r\n    mova        m4,        [pw_32]\r\n%elifidn %1,ps\r\n    mova        m4,        [pw_2000]\r\n    add         r3d,       r3d\r\n%endif\r\n\r\n    lea         r5,        [r1 * 3]\r\n\r\n%assign x 1\r\n%rep %2/4\r\n    movq        m0,        [r0]\r\n    movq        m1,        [r0 + r1]\r\n    movq        m2,        [r0 + 2 * r1]\r\n    movq        m3,        [r0 + r5]\r\n\r\n    punpcklbw   m0,        m1\r\n    punpcklbw   m1,        m2\r\n    punpcklbw   m2,        m3\r\n\r\n    movhlps     m7,        m0\r\n    punpcklbw   m0,        m9\r\n    punpcklbw   m7,        m9\r\n    pmaddwd     m0,        m6\r\n    pmaddwd     m7,        m6\r\n    packssdw    m0,        m7\r\n\r\n    movhlps     m8,        m2\r\n    movq        m7,        m2\r\n    punpcklbw   m8,        m9\r\n    punpcklbw   m7,        m9\r\n    pmaddwd     m8,        m5\r\n    pmaddwd     m7,        m5\r\n    packssdw    m7,        m8\r\n\r\n    paddw       m0,        m7\r\n\r\n%ifidn %1,pp\r\n    paddw       m0,        m4\r\n    psraw       m0,        6\r\n%elifidn %1,ps\r\n    psubw       m0,        m4\r\n    movu        [r2],      m0\r\n%endif\r\n\r\n    lea         r0,        [r0 + 4 * r1]\r\n    movq        m10,       [r0]\r\n    punpcklbw   m3,        m10\r\n\r\n    movhlps     m8,        m1\r\n    punpcklbw   m1,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m1,        m6\r\n    pmaddwd     m8,        m6\r\n    packssdw    m1,        m8\r\n\r\n    movhlps     m8,        m3\r\n    movq        m7,        m3\r\n    punpcklbw   m8,        m9\r\n    punpcklbw   m7,        m9\r\n    pmaddwd     m8,        m5\r\n    pmaddwd     m7,        m5\r\n    packssdw    m7,        m8\r\n\r\n    paddw       m1,        m7\r\n\r\n%ifidn %1,pp\r\n    paddw       m1,        m4\r\n    psraw       m1,        6\r\n\r\n    packuswb    m0,        m1\r\n    movh        [r2],      m0\r\n    movhps      [r2 + r3], m0\r\n%elifidn %1,ps\r\n    psubw       m1,        m4\r\n    movu        [r2 + r3], m1\r\n%endif\r\n\r\n    movq        m1,        [r0 + r1]\r\n    punpcklbw   m10,       m1\r\n\r\n    movhlps     m8,        m2\r\n    punpcklbw   m2,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m2,        m6\r\n    pmaddwd     m8,        m6\r\n    packssdw    m2,        m8\r\n\r\n    movhlps     m8,        m10\r\n    punpcklbw   m10,       m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m10,       m5\r\n    pmaddwd     m8,        m5\r\n    packssdw    m10,       m8\r\n\r\n    paddw       m2,        m10\r\n    lea         r2,        [r2 + 2 * r3]\r\n\r\n%ifidn %1,pp\r\n    paddw       m2,        m4\r\n    psraw       m2,        6\r\n%elifidn %1,ps\r\n    psubw       m2,        m4\r\n    movu        [r2],      m2\r\n%endif\r\n\r\n    movq        m7,        [r0 + 2 * r1]\r\n    punpcklbw   m1,        m7\r\n\r\n    movhlps     m8,        m3\r\n    punpcklbw   m3,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m3,        m6\r\n    pmaddwd     m8,        m6\r\n    packssdw    m3,        m8\r\n\r\n    movhlps     m8,        m1\r\n    punpcklbw   m1,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m1,        m5\r\n    pmaddwd     m8,        m5\r\n    packssdw    m1,        m8\r\n\r\n    paddw       m3,        m1\r\n\r\n%ifidn %1,pp\r\n    paddw       m3,        m4\r\n    psraw       m3,        6\r\n\r\n    packuswb    m2,        m3\r\n    movh        [r2],      m2\r\n    movhps      [r2 + r3], m2\r\n%elifidn %1,ps\r\n    psubw       m3,        m4\r\n    movu        [r2 + r3], m3\r\n%endif\r\n\r\n%if x < %2/4\r\n    lea         r2,        [r2 + 2 * r3]\r\n%endif\r\n%endrep\r\n    RET\r\n%endmacro\r\n\r\n%if ARCH_X86_64\r\n    FILTER_V4_W8_H8_H16_H32_sse2 pp,  8\r\n    FILTER_V4_W8_H8_H16_H32_sse2 pp, 16\r\n    FILTER_V4_W8_H8_H16_H32_sse2 pp, 32\r\n\r\n    FILTER_V4_W8_H8_H16_H32_sse2 pp, 12\r\n    FILTER_V4_W8_H8_H16_H32_sse2 pp, 64\r\n\r\n    FILTER_V4_W8_H8_H16_H32_sse2 ps,  8\r\n    FILTER_V4_W8_H8_H16_H32_sse2 ps, 16\r\n    FILTER_V4_W8_H8_H16_H32_sse2 ps, 32\r\n\r\n    FILTER_V4_W8_H8_H16_H32_sse2 ps, 12\r\n    FILTER_V4_W8_H8_H16_H32_sse2 ps, 64\r\n%endif\r\n\r\n;-----------------------------------------------------------------------------\r\n; void interp_4tap_vert_%1_12x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-----------------------------------------------------------------------------\r\n%macro FILTER_V4_W12_H2_sse2 2\r\nINIT_XMM sse2\r\ncglobal interp_4tap_vert_%1_12x%2, 4, 6, 11\r\n    mov         r4d,       r4m\r\n    sub         r0,        r1\r\n    shl         r4d,       5\r\n    pxor        m9,        m9\r\n\r\n%ifidn %1,pp\r\n    mova        m6,        [pw_32]\r\n%elifidn %1,ps\r\n    mova        m6,        [pw_2000]\r\n    add         r3d,       r3d\r\n%endif\r\n\r\n%ifdef PIC\r\n    lea         r5,        [tab_ChromaCoeffV]\r\n    mova        m1,        [r5 + r4]\r\n    mova        m0,        [r5 + r4 + 16]\r\n%else\r\n    mova        m1,        [tab_ChromaCoeffV + r4]\r\n    mova        m0,        [tab_ChromaCoeffV + r4 + 16]\r\n%endif\r\n\r\n%assign x 1\r\n%rep %2/2\r\n    movu        m2,        [r0]\r\n    movu        m3,        [r0 + r1]\r\n\r\n    punpcklbw   m4,        m2,        m3\r\n    punpckhbw   m2,        m3\r\n\r\n    movhlps     m8,        m4\r\n    punpcklbw   m4,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m4,        m1\r\n    pmaddwd     m8,        m1\r\n    packssdw    m4,        m8\r\n\r\n    movhlps     m8,        m2\r\n    punpcklbw   m2,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m2,        m1\r\n    pmaddwd     m8,        m1\r\n    packssdw    m2,        m8\r\n\r\n    lea         r0,        [r0 + 2 * r1]\r\n    movu        m5,        [r0]\r\n    movu        m7,        [r0 + r1]\r\n\r\n    punpcklbw   m10,       m5,        m7\r\n    movhlps     m8,        m10\r\n    punpcklbw   m10,       m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m10,       m0\r\n    pmaddwd     m8,        m0\r\n    packssdw    m10,       m8\r\n\r\n    paddw       m4,        m10\r\n\r\n    punpckhbw   m10,       m5,        m7\r\n    movhlps     m8,        m10\r\n    punpcklbw   m10,       m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m10,       m0\r\n    pmaddwd     m8,        m0\r\n    packssdw    m10,       m8\r\n\r\n    paddw       m2,        m10\r\n\r\n%ifidn %1,pp\r\n    paddw       m4,        m6\r\n    psraw       m4,        6\r\n    paddw       m2,        m6\r\n    psraw       m2,        6\r\n\r\n    packuswb    m4,        m2\r\n    movh        [r2],      m4\r\n    psrldq      m4,        8\r\n    movd        [r2 + 8],  m4\r\n%elifidn %1,ps\r\n    psubw       m4,        m6\r\n    psubw       m2,        m6\r\n    movu        [r2],      m4\r\n    movh        [r2 + 16], m2\r\n%endif\r\n\r\n    punpcklbw   m4,        m3,        m5\r\n    punpckhbw   m3,        m5\r\n\r\n    movhlps     m8,        m4\r\n    punpcklbw   m4,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m4,        m1\r\n    pmaddwd     m8,        m1\r\n    packssdw    m4,        m8\r\n\r\n    movhlps     m8,        m4\r\n    punpcklbw   m3,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m3,        m1\r\n    pmaddwd     m8,        m1\r\n    packssdw    m3,        m8\r\n\r\n    movu        m5,        [r0 + 2 * r1]\r\n    punpcklbw   m2,        m7,        m5\r\n    punpckhbw   m7,        m5\r\n\r\n    movhlps     m8,        m2\r\n    punpcklbw   m2,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m2,        m0\r\n    pmaddwd     m8,        m0\r\n    packssdw    m2,        m8\r\n\r\n    movhlps     m8,        m7\r\n    punpcklbw   m7,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m7,        m0\r\n    pmaddwd     m8,        m0\r\n    packssdw    m7,        m8\r\n\r\n    paddw       m4,        m2\r\n    paddw       m3,        m7\r\n\r\n%ifidn %1,pp\r\n    paddw       m4,        m6\r\n    psraw       m4,        6\r\n    paddw       m3,        m6\r\n    psraw       m3,        6\r\n\r\n    packuswb    m4,        m3\r\n    movh        [r2 + r3], m4\r\n    psrldq      m4,        8\r\n    movd        [r2 + r3 + 8], m4\r\n%elifidn %1,ps\r\n    psubw       m4,        m6\r\n    psubw       m3,        m6\r\n    movu        [r2 + r3], m4\r\n    movh        [r2 + r3 + 16], m3\r\n%endif\r\n\r\n%if x < %2/2\r\n    lea         r2,        [r2 + 2 * r3]\r\n%endif\r\n%assign x x+1\r\n%endrep\r\n    RET\r\n\r\n%endmacro\r\n\r\n%if ARCH_X86_64\r\n    FILTER_V4_W12_H2_sse2 pp, 16\r\n    FILTER_V4_W12_H2_sse2 pp, 32\r\n    FILTER_V4_W12_H2_sse2 ps, 16\r\n    FILTER_V4_W12_H2_sse2 ps, 32\r\n%endif\r\n\r\n;-----------------------------------------------------------------------------\r\n; void interp_4tap_vert_%1_16x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-----------------------------------------------------------------------------\r\n%macro FILTER_V4_W16_H2_sse2 2\r\nINIT_XMM sse2\r\ncglobal interp_4tap_vert_%1_16x%2, 4, 6, 11\r\n    mov         r4d,       r4m\r\n    sub         r0,        r1\r\n    shl         r4d,       5\r\n    pxor        m9,        m9\r\n\r\n%ifidn %1,pp\r\n    mova        m6,        [pw_32]\r\n%elifidn %1,ps\r\n    mova        m6,        [pw_2000]\r\n    add         r3d,       r3d\r\n%endif\r\n\r\n%ifdef PIC\r\n    lea         r5,        [tab_ChromaCoeffV]\r\n    mova        m1,        [r5 + r4]\r\n    mova        m0,        [r5 + r4 + 16]\r\n%else\r\n    mova        m1,        [tab_ChromaCoeffV + r4]\r\n    mova        m0,        [tab_ChromaCoeffV + r4 + 16]\r\n%endif\r\n\r\n%assign x 1\r\n%rep %2/2\r\n    movu        m2,        [r0]\r\n    movu        m3,        [r0 + r1]\r\n\r\n    punpcklbw   m4,        m2,        m3\r\n    punpckhbw   m2,        m3\r\n\r\n    movhlps     m8,        m4\r\n    punpcklbw   m4,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m4,        m1\r\n    pmaddwd     m8,        m1\r\n    packssdw    m4,        m8\r\n\r\n    movhlps     m8,        m2\r\n    punpcklbw   m2,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m2,        m1\r\n    pmaddwd     m8,        m1\r\n    packssdw    m2,        m8\r\n\r\n    lea         r0,        [r0 + 2 * r1]\r\n    movu        m5,        [r0]\r\n    movu        m10,       [r0 + r1]\r\n\r\n    punpckhbw   m7,        m5,        m10\r\n    movhlps     m8,        m7\r\n    punpcklbw   m7,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m7,        m0\r\n    pmaddwd     m8,        m0\r\n    packssdw    m7,        m8\r\n    paddw       m2,        m7\r\n\r\n    punpcklbw   m7,        m5,        m10\r\n    movhlps     m8,        m7\r\n    punpcklbw   m7,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m7,        m0\r\n    pmaddwd     m8,        m0\r\n    packssdw    m7,        m8\r\n    paddw       m4,        m7\r\n\r\n%ifidn %1,pp\r\n    paddw       m4,        m6\r\n    psraw       m4,        6\r\n    paddw       m2,        m6\r\n    psraw       m2,        6\r\n\r\n    packuswb    m4,        m2\r\n    movu        [r2],      m4\r\n%elifidn %1,ps\r\n    psubw       m4,        m6\r\n    psubw       m2,        m6\r\n    movu        [r2],      m4\r\n    movu        [r2 + 16], m2\r\n%endif\r\n\r\n    punpcklbw   m4,        m3,        m5\r\n    punpckhbw   m3,        m5\r\n\r\n    movhlps     m8,        m4\r\n    punpcklbw   m4,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m4,        m1\r\n    pmaddwd     m8,        m1\r\n    packssdw    m4,        m8\r\n\r\n    movhlps     m8,        m3\r\n    punpcklbw   m3,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m3,        m1\r\n    pmaddwd     m8,        m1\r\n    packssdw    m3,        m8\r\n\r\n    movu        m5,        [r0 + 2 * r1]\r\n\r\n    punpcklbw   m2,        m10,       m5\r\n    punpckhbw   m10,       m5\r\n\r\n    movhlps     m8,        m2\r\n    punpcklbw   m2,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m2,        m0\r\n    pmaddwd     m8,        m0\r\n    packssdw    m2,        m8\r\n\r\n    movhlps     m8,        m10\r\n    punpcklbw   m10,       m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m10,       m0\r\n    pmaddwd     m8,        m0\r\n    packssdw    m10,       m8\r\n\r\n    paddw       m4,        m2\r\n    paddw       m3,        m10\r\n\r\n%ifidn %1,pp\r\n    paddw       m4,        m6\r\n    psraw       m4,        6\r\n    paddw       m3,        m6\r\n    psraw       m3,        6\r\n\r\n    packuswb    m4,        m3\r\n    movu        [r2 + r3], m4\r\n%elifidn %1,ps\r\n    psubw       m4,        m6\r\n    psubw       m3,        m6\r\n    movu        [r2 + r3], m4\r\n    movu        [r2 + r3 + 16], m3\r\n%endif\r\n\r\n%if x < %2/2\r\n    lea         r2,        [r2 + 2 * r3]\r\n%endif\r\n%assign x x+1\r\n%endrep\r\n    RET\r\n\r\n%endmacro\r\n\r\n%if ARCH_X86_64\r\n    FILTER_V4_W16_H2_sse2 pp, 4\r\n    FILTER_V4_W16_H2_sse2 pp, 8\r\n    FILTER_V4_W16_H2_sse2 pp, 12\r\n    FILTER_V4_W16_H2_sse2 pp, 16\r\n    FILTER_V4_W16_H2_sse2 pp, 32\r\n\r\n    FILTER_V4_W16_H2_sse2 pp, 24\r\n    FILTER_V4_W16_H2_sse2 pp, 64\r\n\r\n    FILTER_V4_W16_H2_sse2 ps, 4\r\n    FILTER_V4_W16_H2_sse2 ps, 8\r\n    FILTER_V4_W16_H2_sse2 ps, 12\r\n    FILTER_V4_W16_H2_sse2 ps, 16\r\n    FILTER_V4_W16_H2_sse2 ps, 32\r\n\r\n    FILTER_V4_W16_H2_sse2 ps, 24\r\n    FILTER_V4_W16_H2_sse2 ps, 64\r\n%endif\r\n\r\n;-----------------------------------------------------------------------------\r\n;void interp_4tap_vert_%1_24%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-----------------------------------------------------------------------------\r\n%macro FILTER_V4_W24_sse2 2\r\nINIT_XMM sse2\r\ncglobal interp_4tap_vert_%1_24x%2, 4, 6, 11\r\n    mov         r4d,       r4m\r\n    sub         r0,        r1\r\n    shl         r4d,       5\r\n    pxor        m9,        m9\r\n\r\n%ifidn %1,pp\r\n    mova        m6,        [pw_32]\r\n%elifidn %1,ps\r\n    mova        m6,        [pw_2000]\r\n    add         r3d,       r3d\r\n%endif\r\n\r\n%ifdef PIC\r\n    lea         r5,        [tab_ChromaCoeffV]\r\n    mova        m1,        [r5 + r4]\r\n    mova        m0,        [r5 + r4 + 16]\r\n%else\r\n    mova        m1,        [tab_ChromaCoeffV + r4]\r\n    mova        m0,        [tab_ChromaCoeffV + r4 + 16]\r\n%endif\r\n\r\n%assign x 1\r\n%rep %2/2\r\n    movu        m2,        [r0]\r\n    movu        m3,        [r0 + r1]\r\n\r\n    punpcklbw   m4,        m2,        m3\r\n    punpckhbw   m2,        m3\r\n\r\n    movhlps     m8,        m4\r\n    punpcklbw   m4,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m4,        m1\r\n    pmaddwd     m8,        m1\r\n    packssdw    m4,        m8\r\n\r\n    movhlps     m8,        m2\r\n    punpcklbw   m2,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m2,        m1\r\n    pmaddwd     m8,        m1\r\n    packssdw    m2,        m8\r\n\r\n    lea         r5,        [r0 + 2 * r1]\r\n    movu        m5,        [r5]\r\n    movu        m10,       [r5 + r1]\r\n    punpcklbw   m7,        m5,        m10\r\n\r\n    movhlps     m8,        m7\r\n    punpcklbw   m7,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m7,        m0\r\n    pmaddwd     m8,        m0\r\n    packssdw    m7,        m8\r\n    paddw       m4,        m7\r\n\r\n    punpckhbw   m7,        m5,        m10\r\n\r\n    movhlps     m8,        m7\r\n    punpcklbw   m7,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m7,        m0\r\n    pmaddwd     m8,        m0\r\n    packssdw    m7,        m8\r\n\r\n    paddw       m2,        m7\r\n\r\n%ifidn %1,pp\r\n    paddw       m4,        m6\r\n    psraw       m4,        6\r\n    paddw       m2,        m6\r\n    psraw       m2,        6\r\n\r\n    packuswb    m4,        m2\r\n    movu        [r2],      m4\r\n%elifidn %1,ps\r\n    psubw       m4,        m6\r\n    psubw       m2,        m6\r\n    movu        [r2],      m4\r\n    movu        [r2 + 16], m2\r\n%endif\r\n\r\n    punpcklbw   m4,        m3,        m5\r\n    punpckhbw   m3,        m5\r\n\r\n    movhlps     m8,        m4\r\n    punpcklbw   m4,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m4,        m1\r\n    pmaddwd     m8,        m1\r\n    packssdw    m4,        m8\r\n\r\n    movhlps     m8,        m3\r\n    punpcklbw   m3,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m3,        m1\r\n    pmaddwd     m8,        m1\r\n    packssdw    m3,        m8\r\n\r\n    movu        m2,        [r5 + 2 * r1]\r\n\r\n    punpcklbw   m5,        m10,        m2\r\n    punpckhbw   m10,       m2\r\n\r\n    movhlps     m8,        m5\r\n    punpcklbw   m5,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m5,        m0\r\n    pmaddwd     m8,        m0\r\n    packssdw    m5,        m8\r\n\r\n    movhlps     m8,        m10\r\n    punpcklbw   m10,       m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m10,       m0\r\n    pmaddwd     m8,        m0\r\n    packssdw    m10,       m8\r\n\r\n    paddw       m4,        m5\r\n    paddw       m3,        m10\r\n\r\n%ifidn %1,pp\r\n    paddw       m4,        m6\r\n    psraw       m4,        6\r\n    paddw       m3,        m6\r\n    psraw       m3,        6\r\n\r\n    packuswb    m4,        m3\r\n    movu        [r2 + r3], m4\r\n%elifidn %1,ps\r\n    psubw       m4,        m6\r\n    psubw       m3,        m6\r\n    movu        [r2 + r3], m4\r\n    movu        [r2 + r3 + 16], m3\r\n%endif\r\n\r\n    movq        m2,        [r0 + 16]\r\n    movq        m3,        [r0 + r1 + 16]\r\n    movq        m4,        [r5 + 16]\r\n    movq        m5,        [r5 + r1 + 16]\r\n\r\n    punpcklbw   m2,        m3\r\n    punpcklbw   m4,        m5\r\n\r\n    movhlps     m8,        m4\r\n    punpcklbw   m4,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m4,        m0\r\n    pmaddwd     m8,        m0\r\n    packssdw    m4,        m8\r\n\r\n    movhlps     m8,        m2\r\n    punpcklbw   m2,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m2,        m1\r\n    pmaddwd     m8,        m1\r\n    packssdw    m2,        m8\r\n\r\n    paddw       m2,        m4\r\n\r\n%ifidn %1,pp\r\n    paddw       m2,        m6\r\n    psraw       m2,        6\r\n%elifidn %1,ps\r\n    psubw       m2,        m6\r\n    movu        [r2 + 32], m2\r\n%endif\r\n\r\n    movq        m3,        [r0 + r1 + 16]\r\n    movq        m4,        [r5 + 16]\r\n    movq        m5,        [r5 + r1 + 16]\r\n    movq        m7,        [r5 + 2 * r1 + 16]\r\n\r\n    punpcklbw   m3,        m4\r\n    punpcklbw   m5,        m7\r\n\r\n    movhlps     m8,        m5\r\n    punpcklbw   m5,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m5,        m0\r\n    pmaddwd     m8,        m0\r\n    packssdw    m5,        m8\r\n\r\n    movhlps     m8,        m3\r\n    punpcklbw   m3,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m3,        m1\r\n    pmaddwd     m8,        m1\r\n    packssdw    m3,        m8\r\n\r\n    paddw       m3,        m5\r\n\r\n%ifidn %1,pp\r\n    paddw       m3,        m6\r\n    psraw       m3,        6\r\n\r\n    packuswb    m2,        m3\r\n    movh        [r2 + 16], m2\r\n    movhps      [r2 + r3 + 16], m2\r\n%elifidn %1,ps\r\n    psubw       m3,        m6\r\n    movu        [r2 + r3 + 32], m3\r\n%endif\r\n\r\n%if x < %2/2\r\n    mov         r0,        r5\r\n    lea         r2,        [r2 + 2 * r3]\r\n%endif\r\n%assign x x+1\r\n%endrep\r\n    RET\r\n\r\n%endmacro\r\n\r\n%if ARCH_X86_64\r\n    FILTER_V4_W24_sse2 pp, 32\r\n    FILTER_V4_W24_sse2 pp, 64\r\n    FILTER_V4_W24_sse2 ps, 32\r\n    FILTER_V4_W24_sse2 ps, 64\r\n%endif\r\n\r\n;-----------------------------------------------------------------------------\r\n; void interp_4tap_vert_%1_32x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-----------------------------------------------------------------------------\r\n%macro FILTER_V4_W32_sse2 2\r\nINIT_XMM sse2\r\ncglobal interp_4tap_vert_%1_32x%2, 4, 6, 10\r\n    mov         r4d,       r4m\r\n    sub         r0,        r1\r\n    shl         r4d,       5\r\n    pxor        m9,        m9\r\n\r\n%ifidn %1,pp\r\n    mova        m6,        [pw_32]\r\n%elifidn %1,ps\r\n    mova        m6,        [pw_2000]\r\n    add         r3d,       r3d\r\n%endif\r\n\r\n%ifdef PIC\r\n    lea         r5,        [tab_ChromaCoeffV]\r\n    mova        m1,        [r5 + r4]\r\n    mova        m0,        [r5 + r4 + 16]\r\n%else\r\n    mova        m1,        [tab_ChromaCoeffV + r4]\r\n    mova        m0,        [tab_ChromaCoeffV + r4 + 16]\r\n%endif\r\n\r\n    mov         r4d,       %2\r\n\r\n.loop:\r\n    movu        m2,        [r0]\r\n    movu        m3,        [r0 + r1]\r\n\r\n    punpcklbw   m4,        m2,        m3\r\n    punpckhbw   m2,        m3\r\n\r\n    movhlps     m8,        m4\r\n    punpcklbw   m4,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m4,        m1\r\n    pmaddwd     m8,        m1\r\n    packssdw    m4,        m8\r\n\r\n    movhlps     m8,        m2\r\n    punpcklbw   m2,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m2,        m1\r\n    pmaddwd     m8,        m1\r\n    packssdw    m2,        m8\r\n\r\n    lea         r5,        [r0 + 2 * r1]\r\n    movu        m3,        [r5]\r\n    movu        m5,        [r5 + r1]\r\n\r\n    punpcklbw   m7,        m3,        m5\r\n    punpckhbw   m3,        m5\r\n\r\n    movhlps     m8,        m7\r\n    punpcklbw   m7,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m7,        m0\r\n    pmaddwd     m8,        m0\r\n    packssdw    m7,        m8\r\n\r\n    movhlps     m8,        m3\r\n    punpcklbw   m3,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m3,        m0\r\n    pmaddwd     m8,        m0\r\n    packssdw    m3,        m8\r\n\r\n    paddw       m4,        m7\r\n    paddw       m2,        m3\r\n\r\n%ifidn %1,pp\r\n    paddw       m4,        m6\r\n    psraw       m4,        6\r\n    paddw       m2,        m6\r\n    psraw       m2,        6\r\n\r\n    packuswb    m4,        m2\r\n    movu        [r2],      m4\r\n%elifidn %1,ps\r\n    psubw       m4,        m6\r\n    psubw       m2,        m6\r\n    movu        [r2],      m4\r\n    movu        [r2 + 16], m2\r\n%endif\r\n\r\n    movu        m2,        [r0 + 16]\r\n    movu        m3,        [r0 + r1 + 16]\r\n\r\n    punpcklbw   m4,        m2,        m3\r\n    punpckhbw   m2,        m3\r\n\r\n    movhlps     m8,        m4\r\n    punpcklbw   m4,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m4,        m1\r\n    pmaddwd     m8,        m1\r\n    packssdw    m4,        m8\r\n\r\n    movhlps     m8,        m2\r\n    punpcklbw   m2,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m2,        m1\r\n    pmaddwd     m8,        m1\r\n    packssdw    m2,        m8\r\n\r\n    movu        m3,        [r5 + 16]\r\n    movu        m5,        [r5 + r1 + 16]\r\n\r\n    punpcklbw   m7,        m3,        m5\r\n    punpckhbw   m3,        m5\r\n\r\n    movhlps     m8,        m7\r\n    punpcklbw   m7,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m7,        m0\r\n    pmaddwd     m8,        m0\r\n    packssdw    m7,        m8\r\n\r\n    movhlps     m8,        m3\r\n    punpcklbw   m3,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m3,        m0\r\n    pmaddwd     m8,        m0\r\n    packssdw    m3,        m8\r\n\r\n    paddw       m4,        m7\r\n    paddw       m2,        m3\r\n\r\n%ifidn %1,pp\r\n    paddw       m4,        m6\r\n    psraw       m4,        6\r\n    paddw       m2,        m6\r\n    psraw       m2,        6\r\n\r\n    packuswb    m4,        m2\r\n    movu        [r2 + 16], m4\r\n%elifidn %1,ps\r\n    psubw       m4,        m6\r\n    psubw       m2,        m6\r\n    movu        [r2 + 32], m4\r\n    movu        [r2 + 48], m2\r\n%endif\r\n\r\n    lea         r0,        [r0 + r1]\r\n    lea         r2,        [r2 + r3]\r\n    dec         r4\r\n    jnz        .loop\r\n    RET\r\n\r\n%endmacro\r\n\r\n%if ARCH_X86_64\r\n    FILTER_V4_W32_sse2 pp, 8\r\n    FILTER_V4_W32_sse2 pp, 16\r\n    FILTER_V4_W32_sse2 pp, 24\r\n    FILTER_V4_W32_sse2 pp, 32\r\n\r\n    FILTER_V4_W32_sse2 pp, 48\r\n    FILTER_V4_W32_sse2 pp, 64\r\n\r\n    FILTER_V4_W32_sse2 ps, 8\r\n    FILTER_V4_W32_sse2 ps, 16\r\n    FILTER_V4_W32_sse2 ps, 24\r\n    FILTER_V4_W32_sse2 ps, 32\r\n\r\n    FILTER_V4_W32_sse2 ps, 48\r\n    FILTER_V4_W32_sse2 ps, 64\r\n%endif\r\n\r\n;-----------------------------------------------------------------------------\r\n; void interp_4tap_vert_%1_%2x%3(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-----------------------------------------------------------------------------\r\n%macro FILTER_V4_W16n_H2_sse2 3\r\nINIT_XMM sse2\r\ncglobal interp_4tap_vert_%1_%2x%3, 4, 7, 11\r\n    mov         r4d,       r4m\r\n    sub         r0,        r1\r\n    shl         r4d,       5\r\n    pxor        m9,        m9\r\n\r\n%ifidn %1,pp\r\n    mova        m7,        [pw_32]\r\n%elifidn %1,ps\r\n    mova        m7,        [pw_2000]\r\n    add         r3d,       r3d\r\n%endif\r\n\r\n%ifdef PIC\r\n    lea         r5,        [tab_ChromaCoeffV]\r\n    mova        m1,        [r5 + r4]\r\n    mova        m0,        [r5 + r4 + 16]\r\n%else\r\n    mova        m1,        [tab_ChromaCoeffV + r4]\r\n    mova        m0,        [tab_ChromaCoeffV + r4 + 16]\r\n%endif\r\n\r\n    mov         r4d,       %3/2\r\n\r\n.loop:\r\n\r\n    mov         r6d,       %2/16\r\n\r\n.loopW:\r\n\r\n    movu        m2,        [r0]\r\n    movu        m3,        [r0 + r1]\r\n\r\n    punpcklbw   m4,        m2,        m3\r\n    punpckhbw   m2,        m3\r\n\r\n    movhlps     m8,        m4\r\n    punpcklbw   m4,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m4,        m1\r\n    pmaddwd     m8,        m1\r\n    packssdw    m4,        m8\r\n\r\n    movhlps     m8,        m2\r\n    punpcklbw   m2,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m2,        m1\r\n    pmaddwd     m8,        m1\r\n    packssdw    m2,        m8\r\n\r\n    lea         r5,        [r0 + 2 * r1]\r\n    movu        m5,        [r5]\r\n    movu        m6,        [r5 + r1]\r\n\r\n    punpckhbw   m10,        m5,        m6\r\n    movhlps     m8,        m10\r\n    punpcklbw   m10,       m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m10,       m0\r\n    pmaddwd     m8,        m0\r\n    packssdw    m10,       m8\r\n    paddw       m2,        m10\r\n\r\n    punpcklbw   m10,        m5,        m6\r\n    movhlps     m8,        m10\r\n    punpcklbw   m10,       m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m10,       m0\r\n    pmaddwd     m8,        m0\r\n    packssdw    m10,       m8\r\n    paddw       m4,        m10\r\n\r\n%ifidn %1,pp\r\n    paddw       m4,        m7\r\n    psraw       m4,        6\r\n    paddw       m2,        m7\r\n    psraw       m2,        6\r\n\r\n    packuswb    m4,        m2\r\n    movu        [r2],      m4\r\n%elifidn %1,ps\r\n    psubw       m4,        m7\r\n    psubw       m2,        m7\r\n    movu        [r2],      m4\r\n    movu        [r2 + 16], m2\r\n%endif\r\n\r\n    punpcklbw   m4,        m3,        m5\r\n    punpckhbw   m3,        m5\r\n\r\n    movhlps     m8,        m4\r\n    punpcklbw   m4,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m4,        m1\r\n    pmaddwd     m8,        m1\r\n    packssdw    m4,        m8\r\n\r\n    movhlps     m8,        m3\r\n    punpcklbw   m3,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m3,        m1\r\n    pmaddwd     m8,        m1\r\n    packssdw    m3,        m8\r\n\r\n    movu        m5,        [r5 + 2 * r1]\r\n\r\n    punpcklbw   m2,        m6,        m5\r\n    punpckhbw   m6,        m5\r\n\r\n    movhlps     m8,        m2\r\n    punpcklbw   m2,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m2,        m0\r\n    pmaddwd     m8,        m0\r\n    packssdw    m2,        m8\r\n\r\n    movhlps     m8,        m6\r\n    punpcklbw   m6,        m9\r\n    punpcklbw   m8,        m9\r\n    pmaddwd     m6,        m0\r\n    pmaddwd     m8,        m0\r\n    packssdw    m6,        m8\r\n\r\n    paddw       m4,        m2\r\n    paddw       m3,        m6\r\n\r\n%ifidn %1,pp\r\n    paddw       m4,        m7\r\n    psraw       m4,        6\r\n    paddw       m3,        m7\r\n    psraw       m3,        6\r\n\r\n    packuswb    m4,        m3\r\n    movu        [r2 + r3], m4\r\n    add         r2,        16\r\n%elifidn %1,ps\r\n    psubw       m4,        m7\r\n    psubw       m3,        m7\r\n    movu        [r2 + r3], m4\r\n    movu        [r2 + r3 + 16], m3\r\n    add         r2,        32\r\n%endif\r\n\r\n    add         r0,        16\r\n    dec         r6d\r\n    jnz         .loopW\r\n\r\n    lea         r0,        [r0 + r1 * 2 - %2]\r\n\r\n%ifidn %1,pp\r\n    lea         r2,        [r2 + r3 * 2 - %2]\r\n%elifidn %1,ps\r\n    lea         r2,        [r2 + r3 * 2 - (%2 * 2)]\r\n%endif\r\n\r\n    dec         r4d\r\n    jnz        .loop\r\n    RET\r\n\r\n%endmacro\r\n\r\n%if ARCH_X86_64\r\n    FILTER_V4_W16n_H2_sse2 pp, 64, 64\r\n    FILTER_V4_W16n_H2_sse2 pp, 64, 32\r\n    FILTER_V4_W16n_H2_sse2 pp, 64, 48\r\n    FILTER_V4_W16n_H2_sse2 pp, 48, 64\r\n    FILTER_V4_W16n_H2_sse2 pp, 64, 16\r\n    FILTER_V4_W16n_H2_sse2 ps, 64, 64\r\n    FILTER_V4_W16n_H2_sse2 ps, 64, 32\r\n    FILTER_V4_W16n_H2_sse2 ps, 64, 48\r\n    FILTER_V4_W16n_H2_sse2 ps, 48, 64\r\n    FILTER_V4_W16n_H2_sse2 ps, 64, 16\r\n%endif\r\n\r\n%macro FILTER_P2S_2_4_sse2 1\r\n    movd        m2,     [r0 + %1]\r\n    movd        m3,     [r0 + r1 + %1]\r\n    punpcklwd   m2,     m3\r\n    movd        m3,     [r0 + r1 * 2 + %1]\r\n    movd        m4,     [r0 + r4 + %1]\r\n    punpcklwd   m3,     m4\r\n    punpckldq   m2,     m3\r\n    punpcklbw   m2,     m0\r\n    psllw       m2,     6\r\n    psubw       m2,     m1\r\n\r\n    movd        [r2 + r3 * 0 + %1 * 2], m2\r\n    psrldq      m2,     4\r\n    movd        [r2 + r3 * 1 + %1 * 2], m2\r\n    psrldq      m2,     4\r\n    movd        [r2 + r3 * 2 + %1 * 2], m2\r\n    psrldq      m2,     4\r\n    movd        [r2 + r5 + %1 * 2], m2\r\n%endmacro\r\n\r\n%macro FILTER_P2S_4_4_sse2 1\r\n    movd        m2,     [r0 + %1]\r\n    movd        m3,     [r0 + r1 + %1]\r\n    movd        m4,     [r0 + r1 * 2 + %1]\r\n    movd        m5,     [r0 + r4 + %1]\r\n    punpckldq   m2,     m3\r\n    punpcklbw   m2,     m0\r\n    punpckldq   m4,     m5\r\n    punpcklbw   m4,     m0\r\n    psllw       m2,     6\r\n    psllw       m4,     6\r\n    psubw       m2,     m1\r\n    psubw       m4,     m1\r\n    movh        [r2 + r3 * 0 + %1 * 2], m2\r\n    movh        [r2 + r3 * 2 + %1 * 2], m4\r\n    movhps      [r2 + r3 * 1 + %1 * 2], m2\r\n    movhps      [r2 + r5 + %1 * 2], m4\r\n%endmacro\r\n\r\n%macro FILTER_P2S_4_2_sse2 0\r\n    movd        m2,     [r0]\r\n    movd        m3,     [r0 + r1]\r\n    punpckldq   m2,     m3\r\n    punpcklbw   m2,     m0\r\n    psllw       m2,     6\r\n    psubw       m2,     [pw_8192]\r\n    movh        [r2],   m2\r\n    movhps      [r2 + r3 * 2], m2\r\n%endmacro\r\n\r\n%macro FILTER_P2S_8_4_sse2 1\r\n    movh        m2,     [r0 + %1]\r\n    movh        m3,     [r0 + r1 + %1]\r\n    movh        m4,     [r0 + r1 * 2 + %1]\r\n    movh        m5,     [r0 + r4 + %1]\r\n    punpcklbw   m2,     m0\r\n    punpcklbw   m3,     m0\r\n    punpcklbw   m5,     m0\r\n    punpcklbw   m4,     m0\r\n    psllw       m2,     6\r\n    psllw       m3,     6\r\n    psllw       m5,     6\r\n    psllw       m4,     6\r\n    psubw       m2,     m1\r\n    psubw       m3,     m1\r\n    psubw       m4,     m1\r\n    psubw       m5,     m1\r\n    movu        [r2 + r3 * 0 + %1 * 2], m2\r\n    movu        [r2 + r3 * 1 + %1 * 2], m3\r\n    movu        [r2 + r3 * 2 + %1 * 2], m4\r\n    movu        [r2 + r5 + %1 * 2], m5\r\n%endmacro\r\n\r\n%macro FILTER_P2S_8_2_sse2 1\r\n    movh        m2,     [r0 + %1]\r\n    movh        m3,     [r0 + r1 + %1]\r\n    punpcklbw   m2,     m0\r\n    punpcklbw   m3,     m0\r\n    psllw       m2,     6\r\n    psllw       m3,     6\r\n    psubw       m2,     m1\r\n    psubw       m3,     m1\r\n    movu        [r2 + r3 * 0 + %1 * 2], m2\r\n    movu        [r2 + r3 * 1 + %1 * 2], m3\r\n%endmacro\r\n\r\n;-----------------------------------------------------------------------------\r\n; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride)\r\n;-----------------------------------------------------------------------------\r\n%macro FILTER_PIX_TO_SHORT_sse2 2\r\nINIT_XMM sse2\r\ncglobal filterPixelToShort_%1x%2, 4, 6, 6\r\n    pxor        m0,     m0\r\n%if %2 == 2\r\n%if %1 == 4\r\n    FILTER_P2S_4_2_sse2\r\n%elif %1 == 8\r\n    add        r3d, r3d\r\n    mova       m1, [pw_8192]\r\n    FILTER_P2S_8_2_sse2 0\r\n%endif\r\n%else\r\n    add        r3d, r3d\r\n    mova       m1, [pw_8192]\r\n    lea        r4, [r1 * 3]\r\n    lea        r5, [r3 * 3]\r\n%assign y 1\r\n%rep %2/4\r\n%assign x 0\r\n%rep %1/8\r\n    FILTER_P2S_8_4_sse2 x\r\n%if %2 == 6\r\n    lea         r0,     [r0 + 4 * r1]\r\n    lea         r2,     [r2 + 4 * r3]\r\n    FILTER_P2S_8_2_sse2 x\r\n%endif\r\n%assign x x+8\r\n%endrep\r\n%rep (%1 % 8)/4\r\n    FILTER_P2S_4_4_sse2 x\r\n%assign x x+4\r\n%endrep\r\n%rep (%1 % 4)/2\r\n    FILTER_P2S_2_4_sse2 x\r\n%endrep\r\n%if y < %2/4\r\n    lea         r0,     [r0 + 4 * r1]\r\n    lea         r2,     [r2 + 4 * r3]\r\n%assign y y+1\r\n%endif\r\n%endrep\r\n%endif\r\nRET\r\n%endmacro\r\n\r\n    FILTER_PIX_TO_SHORT_sse2 2, 4\r\n    FILTER_PIX_TO_SHORT_sse2 2, 8\r\n    FILTER_PIX_TO_SHORT_sse2 2, 16\r\n    FILTER_PIX_TO_SHORT_sse2 4, 2\r\n    FILTER_PIX_TO_SHORT_sse2 4, 4\r\n    FILTER_PIX_TO_SHORT_sse2 4, 8\r\n    FILTER_PIX_TO_SHORT_sse2 4, 16\r\n    FILTER_PIX_TO_SHORT_sse2 4, 32\r\n    FILTER_PIX_TO_SHORT_sse2 6, 8\r\n    FILTER_PIX_TO_SHORT_sse2 6, 16\r\n    FILTER_PIX_TO_SHORT_sse2 8, 2\r\n    FILTER_PIX_TO_SHORT_sse2 8, 4\r\n    FILTER_PIX_TO_SHORT_sse2 8, 6\r\n    FILTER_PIX_TO_SHORT_sse2 8, 8\r\n    FILTER_PIX_TO_SHORT_sse2 8, 12\r\n    FILTER_PIX_TO_SHORT_sse2 8, 16\r\n    FILTER_PIX_TO_SHORT_sse2 8, 32\r\n    FILTER_PIX_TO_SHORT_sse2 8, 64\r\n    FILTER_PIX_TO_SHORT_sse2 12, 16\r\n    FILTER_PIX_TO_SHORT_sse2 12, 32\r\n    FILTER_PIX_TO_SHORT_sse2 16, 4\r\n    FILTER_PIX_TO_SHORT_sse2 16, 8\r\n    FILTER_PIX_TO_SHORT_sse2 16, 12\r\n    FILTER_PIX_TO_SHORT_sse2 16, 16\r\n    FILTER_PIX_TO_SHORT_sse2 16, 24\r\n    FILTER_PIX_TO_SHORT_sse2 16, 32\r\n    FILTER_PIX_TO_SHORT_sse2 16, 64\r\n    FILTER_PIX_TO_SHORT_sse2 24, 32\r\n    FILTER_PIX_TO_SHORT_sse2 24, 64\r\n    FILTER_PIX_TO_SHORT_sse2 32, 8\r\n    FILTER_PIX_TO_SHORT_sse2 32, 16\r\n    FILTER_PIX_TO_SHORT_sse2 32, 24\r\n    FILTER_PIX_TO_SHORT_sse2 32, 32\r\n    FILTER_PIX_TO_SHORT_sse2 32, 48\r\n    FILTER_PIX_TO_SHORT_sse2 32, 64\r\n    FILTER_PIX_TO_SHORT_sse2 48, 64\r\n    FILTER_PIX_TO_SHORT_sse2 64, 16\r\n    FILTER_PIX_TO_SHORT_sse2 64, 32\r\n    FILTER_PIX_TO_SHORT_sse2 64, 48\r\n    FILTER_PIX_TO_SHORT_sse2 64, 64\r\n\r\n%macro FILTER_H4_w2_2 3\r\n    movh        %2, [srcq - 1]\r\n    pshufb      %2, %2, Tm0\r\n    movh        %1, [srcq + srcstrideq - 1]\r\n    pshufb      %1, %1, Tm0\r\n    punpcklqdq  %2, %1\r\n    pmaddubsw   %2, coef2\r\n    phaddw      %2, %2\r\n    pmulhrsw    %2, %3\r\n    packuswb    %2, %2\r\n    movd        r4, %2\r\n    mov         [dstq], r4w\r\n    shr         r4, 16\r\n    mov         [dstq + dststrideq], r4w\r\n%endmacro\r\n\r\n\r\n;-----------------------------------------------------------------------------\r\n; void interp_4tap_horiz_pp_2x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse4\r\ncglobal interp_4tap_horiz_pp_2x4, 4, 6, 5, src, srcstride, dst, dststride\r\n%define coef2       m4\r\n%define Tm0         m3\r\n%define t2          m2\r\n%define t1          m1\r\n%define t0          m0\r\n\r\n    mov         r4d,        r4m\r\n\r\n%ifdef PIC\r\n    lea         r5,          [tab_ChromaCoeff]\r\n    movd        coef2,       [r5 + r4 * 4]\r\n%else\r\n    movd        coef2,       [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    pshufd      coef2,       coef2,      0\r\n    mova        t2,          [pw_512]\r\n    mova        Tm0,         [tab_Tm]\r\n\r\n%rep 2\r\n    FILTER_H4_w2_2   t0, t1, t2\r\n    lea         srcq,       [srcq + srcstrideq * 2]\r\n    lea         dstq,       [dstq + dststrideq * 2]\r\n%endrep\r\n\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void interp_4tap_horiz_pp_2x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse4\r\ncglobal interp_4tap_horiz_pp_2x8, 4, 6, 5, src, srcstride, dst, dststride\r\n%define coef2       m4\r\n%define Tm0         m3\r\n%define t2          m2\r\n%define t1          m1\r\n%define t0          m0\r\n\r\n    mov         r4d,        r4m\r\n\r\n%ifdef PIC\r\n    lea         r5,          [tab_ChromaCoeff]\r\n    movd        coef2,       [r5 + r4 * 4]\r\n%else\r\n    movd        coef2,       [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    pshufd      coef2,       coef2,      0\r\n    mova        t2,          [pw_512]\r\n    mova        Tm0,         [tab_Tm]\r\n\r\n%rep 4\r\n    FILTER_H4_w2_2   t0, t1, t2\r\n    lea         srcq,       [srcq + srcstrideq * 2]\r\n    lea         dstq,       [dstq + dststrideq * 2]\r\n%endrep\r\n\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void interp_4tap_horiz_pp_2x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse4\r\ncglobal interp_4tap_horiz_pp_2x16, 4, 6, 5, src, srcstride, dst, dststride\r\n%define coef2       m4\r\n%define Tm0         m3\r\n%define t2          m2\r\n%define t1          m1\r\n%define t0          m0\r\n\r\n    mov         r4d,        r4m\r\n\r\n%ifdef PIC\r\n    lea         r5,          [tab_ChromaCoeff]\r\n    movd        coef2,       [r5 + r4 * 4]\r\n%else\r\n    movd        coef2,       [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    pshufd      coef2,       coef2,      0\r\n    mova        t2,          [pw_512]\r\n    mova        Tm0,         [tab_Tm]\r\n\r\n    mov         r5d,        16/2\r\n\r\n.loop:\r\n    FILTER_H4_w2_2   t0, t1, t2\r\n    lea         srcq,       [srcq + srcstrideq * 2]\r\n    lea         dstq,       [dstq + dststrideq * 2]\r\n    dec         r5d\r\n    jnz         .loop\r\n\r\n    RET\r\n\r\n%macro FILTER_H4_w4_2 3\r\n    movh        %2, [srcq - 1]\r\n    pshufb      %2, %2, Tm0\r\n    pmaddubsw   %2, coef2\r\n    movh        %1, [srcq + srcstrideq - 1]\r\n    pshufb      %1, %1, Tm0\r\n    pmaddubsw   %1, coef2\r\n    phaddw      %2, %1\r\n    pmulhrsw    %2, %3\r\n    packuswb    %2, %2\r\n    movd        [dstq], %2\r\n    palignr     %2, %2, 4\r\n    movd        [dstq + dststrideq], %2\r\n%endmacro\r\n\r\n;-----------------------------------------------------------------------------\r\n; void interp_4tap_horiz_pp_4x2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse4\r\ncglobal interp_4tap_horiz_pp_4x2, 4, 6, 5, src, srcstride, dst, dststride\r\n%define coef2       m4\r\n%define Tm0         m3\r\n%define t2          m2\r\n%define t1          m1\r\n%define t0          m0\r\n\r\n    mov         r4d,        r4m\r\n\r\n%ifdef PIC\r\n    lea         r5,          [tab_ChromaCoeff]\r\n    movd        coef2,       [r5 + r4 * 4]\r\n%else\r\n    movd        coef2,       [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    pshufd      coef2,       coef2,      0\r\n    mova        t2,          [pw_512]\r\n    mova        Tm0,         [tab_Tm]\r\n\r\n    FILTER_H4_w4_2   t0, t1, t2\r\n\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void interp_4tap_horiz_pp_4x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse4\r\ncglobal interp_4tap_horiz_pp_4x4, 4, 6, 5, src, srcstride, dst, dststride\r\n%define coef2       m4\r\n%define Tm0         m3\r\n%define t2          m2\r\n%define t1          m1\r\n%define t0          m0\r\n\r\n    mov         r4d,        r4m\r\n\r\n%ifdef PIC\r\n    lea         r5,          [tab_ChromaCoeff]\r\n    movd        coef2,       [r5 + r4 * 4]\r\n%else\r\n    movd        coef2,       [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    pshufd      coef2,       coef2,      0\r\n    mova        t2,          [pw_512]\r\n    mova        Tm0,         [tab_Tm]\r\n\r\n%rep 2\r\n    FILTER_H4_w4_2   t0, t1, t2\r\n    lea         srcq,       [srcq + srcstrideq * 2]\r\n    lea         dstq,       [dstq + dststrideq * 2]\r\n%endrep\r\n\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void interp_4tap_horiz_pp_4x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse4\r\ncglobal interp_4tap_horiz_pp_4x8, 4, 6, 5, src, srcstride, dst, dststride\r\n%define coef2       m4\r\n%define Tm0         m3\r\n%define t2          m2\r\n%define t1          m1\r\n%define t0          m0\r\n\r\n    mov         r4d,        r4m\r\n\r\n%ifdef PIC\r\n    lea         r5,          [tab_ChromaCoeff]\r\n    movd        coef2,       [r5 + r4 * 4]\r\n%else\r\n    movd        coef2,       [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    pshufd      coef2,       coef2,      0\r\n    mova        t2,          [pw_512]\r\n    mova        Tm0,         [tab_Tm]\r\n\r\n%rep 4\r\n    FILTER_H4_w4_2   t0, t1, t2\r\n    lea         srcq,       [srcq + srcstrideq * 2]\r\n    lea         dstq,       [dstq + dststrideq * 2]\r\n%endrep\r\n\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void interp_4tap_horiz_pp_4x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse4\r\ncglobal interp_4tap_horiz_pp_4x16, 4, 6, 5, src, srcstride, dst, dststride\r\n%define coef2       m4\r\n%define Tm0         m3\r\n%define t2          m2\r\n%define t1          m1\r\n%define t0          m0\r\n\r\n    mov         r4d,        r4m\r\n\r\n%ifdef PIC\r\n    lea         r5,          [tab_ChromaCoeff]\r\n    movd        coef2,       [r5 + r4 * 4]\r\n%else\r\n    movd        coef2,       [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    pshufd      coef2,       coef2,      0\r\n    mova        t2,          [pw_512]\r\n    mova        Tm0,         [tab_Tm]\r\n\r\n%rep 8\r\n    FILTER_H4_w4_2   t0, t1, t2\r\n    lea         srcq,       [srcq + srcstrideq * 2]\r\n    lea         dstq,       [dstq + dststrideq * 2]\r\n%endrep\r\n\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void interp_4tap_horiz_pp_4x32(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse4\r\ncglobal interp_4tap_horiz_pp_4x32, 4, 6, 5, src, srcstride, dst, dststride\r\n%define coef2       m4\r\n%define Tm0         m3\r\n%define t2          m2\r\n%define t1          m1\r\n%define t0          m0\r\n\r\n    mov         r4d,        r4m\r\n\r\n%ifdef PIC\r\n    lea         r5,          [tab_ChromaCoeff]\r\n    movd        coef2,       [r5 + r4 * 4]\r\n%else\r\n    movd        coef2,       [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    pshufd      coef2,       coef2,      0\r\n    mova        t2,          [pw_512]\r\n    mova        Tm0,         [tab_Tm]\r\n\r\n    mov         r5d,        32/2\r\n\r\n.loop:\r\n    FILTER_H4_w4_2   t0, t1, t2\r\n    lea         srcq,       [srcq + srcstrideq * 2]\r\n    lea         dstq,       [dstq + dststrideq * 2]\r\n    dec         r5d\r\n    jnz         .loop\r\n\r\n    RET\r\n\r\nALIGN 32\r\nconst interp_4tap_8x8_horiz_shuf,   dd 0, 4, 1, 5, 2, 6, 3, 7\r\n\r\n\r\n%macro FILTER_H4_w6 3\r\n    movu        %1, [srcq - 1]\r\n    pshufb      %2, %1, Tm0\r\n    pmaddubsw   %2, coef2\r\n    pshufb      %1, %1, Tm1\r\n    pmaddubsw   %1, coef2\r\n    phaddw      %2, %1\r\n    pmulhrsw    %2, %3\r\n    packuswb    %2, %2\r\n    movd        [dstq],      %2\r\n    pextrw      [dstq + 4], %2, 2\r\n%endmacro\r\n\r\n%macro FILTER_H4_w8 3\r\n    movu        %1, [srcq - 1]\r\n    pshufb      %2, %1, Tm0\r\n    pmaddubsw   %2, coef2\r\n    pshufb      %1, %1, Tm1\r\n    pmaddubsw   %1, coef2\r\n    phaddw      %2, %1\r\n    pmulhrsw    %2, %3\r\n    packuswb    %2, %2\r\n    movh        [dstq],      %2\r\n%endmacro\r\n\r\n%macro FILTER_H4_w12 3\r\n    movu        %1, [srcq - 1]\r\n    pshufb      %2, %1, Tm0\r\n    pmaddubsw   %2, coef2\r\n    pshufb      %1, %1, Tm1\r\n    pmaddubsw   %1, coef2\r\n    phaddw      %2, %1\r\n    pmulhrsw    %2, %3\r\n    movu        %1, [srcq - 1 + 8]\r\n    pshufb      %1, %1, Tm0\r\n    pmaddubsw   %1, coef2\r\n    phaddw      %1, %1\r\n    pmulhrsw    %1, %3\r\n    packuswb    %2, %1\r\n    movh        [dstq],      %2\r\n    pextrd      [dstq + 8], %2, 2\r\n%endmacro\r\n\r\n%macro FILTER_H4_w16 4\r\n    movu        %1, [srcq - 1]\r\n    pshufb      %2, %1, Tm0\r\n    pmaddubsw   %2, coef2\r\n    pshufb      %1, %1, Tm1\r\n    pmaddubsw   %1, coef2\r\n    phaddw      %2, %1\r\n    movu        %1, [srcq - 1 + 8]\r\n    pshufb      %4, %1, Tm0\r\n    pmaddubsw   %4, coef2\r\n    pshufb      %1, %1, Tm1\r\n    pmaddubsw   %1, coef2\r\n    phaddw      %4, %1\r\n    pmulhrsw    %2, %3\r\n    pmulhrsw    %4, %3\r\n    packuswb    %2, %4\r\n    movu        [dstq],      %2\r\n%endmacro\r\n\r\n%macro FILTER_H4_w24 4\r\n    movu        %1, [srcq - 1]\r\n    pshufb      %2, %1, Tm0\r\n    pmaddubsw   %2, coef2\r\n    pshufb      %1, %1, Tm1\r\n    pmaddubsw   %1, coef2\r\n    phaddw      %2, %1\r\n    movu        %1, [srcq - 1 + 8]\r\n    pshufb      %4, %1, Tm0\r\n    pmaddubsw   %4, coef2\r\n    pshufb      %1, %1, Tm1\r\n    pmaddubsw   %1, coef2\r\n    phaddw      %4, %1\r\n    pmulhrsw    %2, %3\r\n    pmulhrsw    %4, %3\r\n    packuswb    %2, %4\r\n    movu        [dstq],          %2\r\n    movu        %1, [srcq - 1 + 16]\r\n    pshufb      %2, %1, Tm0\r\n    pmaddubsw   %2, coef2\r\n    pshufb      %1, %1, Tm1\r\n    pmaddubsw   %1, coef2\r\n    phaddw      %2, %1\r\n    pmulhrsw    %2, %3\r\n    packuswb    %2, %2\r\n    movh        [dstq + 16],     %2\r\n%endmacro\r\n\r\n%macro FILTER_H4_w32 4\r\n    movu        %1, [srcq - 1]\r\n    pshufb      %2, %1, Tm0\r\n    pmaddubsw   %2, coef2\r\n    pshufb      %1, %1, Tm1\r\n    pmaddubsw   %1, coef2\r\n    phaddw      %2, %1\r\n    movu        %1, [srcq - 1 + 8]\r\n    pshufb      %4, %1, Tm0\r\n    pmaddubsw   %4, coef2\r\n    pshufb      %1, %1, Tm1\r\n    pmaddubsw   %1, coef2\r\n    phaddw      %4, %1\r\n    pmulhrsw    %2, %3\r\n    pmulhrsw    %4, %3\r\n    packuswb    %2, %4\r\n    movu        [dstq],      %2\r\n    movu        %1, [srcq - 1 + 16]\r\n    pshufb      %2, %1, Tm0\r\n    pmaddubsw   %2, coef2\r\n    pshufb      %1, %1, Tm1\r\n    pmaddubsw   %1, coef2\r\n    phaddw      %2, %1\r\n    movu        %1, [srcq - 1 + 24]\r\n    pshufb      %4, %1, Tm0\r\n    pmaddubsw   %4, coef2\r\n    pshufb      %1, %1, Tm1\r\n    pmaddubsw   %1, coef2\r\n    phaddw      %4, %1\r\n    pmulhrsw    %2, %3\r\n    pmulhrsw    %4, %3\r\n    packuswb    %2, %4\r\n    movu        [dstq + 16],      %2\r\n%endmacro\r\n\r\n%macro FILTER_H4_w16o 5\r\n    movu        %1, [srcq + %5 - 1]\r\n    pshufb      %2, %1, Tm0\r\n    pmaddubsw   %2, coef2\r\n    pshufb      %1, %1, Tm1\r\n    pmaddubsw   %1, coef2\r\n    phaddw      %2, %1\r\n    movu        %1, [srcq + %5 - 1 + 8]\r\n    pshufb      %4, %1, Tm0\r\n    pmaddubsw   %4, coef2\r\n    pshufb      %1, %1, Tm1\r\n    pmaddubsw   %1, coef2\r\n    phaddw      %4, %1\r\n    pmulhrsw    %2, %3\r\n    pmulhrsw    %4, %3\r\n    packuswb    %2, %4\r\n    movu        [dstq + %5],      %2\r\n%endmacro\r\n\r\n%macro FILTER_H4_w48 4\r\n    FILTER_H4_w16o %1, %2, %3, %4, 0\r\n    FILTER_H4_w16o %1, %2, %3, %4, 16\r\n    FILTER_H4_w16o %1, %2, %3, %4, 32\r\n%endmacro\r\n\r\n%macro FILTER_H4_w64 4\r\n    FILTER_H4_w16o %1, %2, %3, %4, 0\r\n    FILTER_H4_w16o %1, %2, %3, %4, 16\r\n    FILTER_H4_w16o %1, %2, %3, %4, 32\r\n    FILTER_H4_w16o %1, %2, %3, %4, 48\r\n%endmacro\r\n\r\n;-----------------------------------------------------------------------------\r\n; void interp_4tap_horiz_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-----------------------------------------------------------------------------\r\n%macro IPFILTER_CHROMA 2\r\nINIT_XMM sse4\r\ncglobal interp_4tap_horiz_pp_%1x%2, 4, 6, 6, src, srcstride, dst, dststride\r\n%define coef2       m5\r\n%define Tm0         m4\r\n%define Tm1         m3\r\n%define t2          m2\r\n%define t1          m1\r\n%define t0          m0\r\n\r\n    mov         r4d,        r4m\r\n\r\n%ifdef PIC\r\n    lea         r5,          [tab_ChromaCoeff]\r\n    movd        coef2,       [r5 + r4 * 4]\r\n%else\r\n    movd        coef2,       [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    mov           r5d,       %2\r\n\r\n    pshufd      coef2,       coef2,      0\r\n    mova        t2,          [pw_512]\r\n    mova        Tm0,         [tab_Tm]\r\n    mova        Tm1,         [tab_Tm + 16]\r\n\r\n.loop:\r\n    FILTER_H4_w%1   t0, t1, t2\r\n    add         srcq,        srcstrideq\r\n    add         dstq,        dststrideq\r\n\r\n    dec         r5d\r\n    jnz        .loop\r\n\r\n    RET\r\n%endmacro\r\n\r\n\r\n    IPFILTER_CHROMA 6,   8\r\n    IPFILTER_CHROMA 8,   2\r\n    IPFILTER_CHROMA 8,   4\r\n    IPFILTER_CHROMA 8,   6\r\n    IPFILTER_CHROMA 8,   8\r\n    IPFILTER_CHROMA 8,  16\r\n    IPFILTER_CHROMA 8,  32\r\n    IPFILTER_CHROMA 12, 16\r\n\r\n    IPFILTER_CHROMA 6,  16\r\n    IPFILTER_CHROMA 8,  12\r\n    IPFILTER_CHROMA 8,  64\r\n    IPFILTER_CHROMA 12, 32\r\n\r\n;-----------------------------------------------------------------------------\r\n; void interp_4tap_horiz_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-----------------------------------------------------------------------------\r\n%macro IPFILTER_CHROMA_W 2\r\nINIT_XMM sse4\r\ncglobal interp_4tap_horiz_pp_%1x%2, 4, 6, 7, src, srcstride, dst, dststride\r\n%define coef2       m6\r\n%define Tm0         m5\r\n%define Tm1         m4\r\n%define t3          m3\r\n%define t2          m2\r\n%define t1          m1\r\n%define t0          m0\r\n\r\n    mov         r4d,         r4m\r\n\r\n%ifdef PIC\r\n    lea         r5,          [tab_ChromaCoeff]\r\n    movd        coef2,       [r5 + r4 * 4]\r\n%else\r\n    movd        coef2,       [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    mov         r5d,          %2\r\n\r\n    pshufd      coef2,       coef2,      0\r\n    mova        t2,          [pw_512]\r\n    mova        Tm0,         [tab_Tm]\r\n    mova        Tm1,         [tab_Tm + 16]\r\n\r\n.loop:\r\n    FILTER_H4_w%1   t0, t1, t2, t3\r\n    add         srcq,        srcstrideq\r\n    add         dstq,        dststrideq\r\n\r\n    dec         r5d\r\n    jnz        .loop\r\n\r\n    RET\r\n%endmacro\r\n\r\n    IPFILTER_CHROMA_W 16,  4\r\n    IPFILTER_CHROMA_W 16,  8\r\n    IPFILTER_CHROMA_W 16, 12\r\n    IPFILTER_CHROMA_W 16, 16\r\n    IPFILTER_CHROMA_W 16, 32\r\n    IPFILTER_CHROMA_W 32,  8\r\n    IPFILTER_CHROMA_W 32, 16\r\n    IPFILTER_CHROMA_W 32, 24\r\n    IPFILTER_CHROMA_W 24, 32\r\n    IPFILTER_CHROMA_W 32, 32\r\n\r\n    IPFILTER_CHROMA_W 16, 24\r\n    IPFILTER_CHROMA_W 16, 64\r\n    IPFILTER_CHROMA_W 32, 48\r\n    IPFILTER_CHROMA_W 24, 64\r\n    IPFILTER_CHROMA_W 32, 64\r\n\r\n    IPFILTER_CHROMA_W 64, 64\r\n    IPFILTER_CHROMA_W 64, 32\r\n    IPFILTER_CHROMA_W 64, 48\r\n    IPFILTER_CHROMA_W 48, 64\r\n    IPFILTER_CHROMA_W 64, 16\r\n\r\n\r\n%macro FILTER_H8_W8 7-8   ; t0, t1, t2, t3, coef, c512, src, dst\r\n    movu        %1, %7\r\n    pshufb      %2, %1, [tab_Lm +  0]\r\n    pmaddubsw   %2, %5\r\n    pshufb      %3, %1, [tab_Lm + 16]\r\n    pmaddubsw   %3, %5\r\n    phaddw      %2, %3\r\n    pshufb      %4, %1, [tab_Lm + 32]\r\n    pmaddubsw   %4, %5\r\n    pshufb      %1, %1, [tab_Lm + 48]\r\n    pmaddubsw   %1, %5\r\n    phaddw      %4, %1\r\n    phaddw      %2, %4\r\n  %if %0 == 8\r\n    pmulhrsw    %2, %6\r\n    packuswb    %2, %2\r\n    movh        %8, %2\r\n  %endif\r\n%endmacro\r\n\r\n%macro FILTER_H8_W4 2\r\n    movu        %1, [r0 - 3 + r5]\r\n    pshufb      %2, %1, [tab_Lm]\r\n    pmaddubsw   %2, m3\r\n    pshufb      m7, %1, [tab_Lm + 16]\r\n    pmaddubsw   m7, m3\r\n    phaddw      %2, m7\r\n    phaddw      %2, %2\r\n%endmacro\r\n\r\n;----------------------------------------------------------------------------------------------------------------------------\r\n; void interp_8tap_horiz_%3_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx, int isRowExt)\r\n;----------------------------------------------------------------------------------------------------------------------------\r\n%macro IPFILTER_LUMA 3\r\nINIT_XMM sse4\r\ncglobal interp_8tap_horiz_%3_%1x%2, 4,7,8\r\n\r\n    mov       r4d, r4m\r\n\r\n%ifdef PIC\r\n    lea       r6, [tab_LumaCoeff]\r\n    movh      m3, [r6 + r4 * 8]\r\n%else\r\n    movh      m3, [tab_LumaCoeff + r4 * 8]\r\n%endif\r\n    punpcklqdq  m3, m3\r\n\r\n%ifidn %3, pp\r\n    mova      m2, [pw_512]\r\n%else\r\n    mova      m2, [pw_2000]\r\n%endif\r\n\r\n    mov       r4d, %2\r\n%ifidn %3, ps\r\n    add       r3, r3\r\n    cmp       r5m, byte 0\r\n    je        .loopH\r\n    lea       r6, [r1 + 2 * r1]\r\n    sub       r0, r6\r\n    add       r4d, 7\r\n%endif\r\n\r\n.loopH:\r\n    xor       r5, r5\r\n%rep %1 / 8\r\n  %ifidn %3, pp\r\n    FILTER_H8_W8  m0, m1, m4, m5, m3, m2, [r0 - 3 + r5], [r2 + r5]\r\n  %else\r\n    FILTER_H8_W8  m0, m1, m4, m5, m3, UNUSED, [r0 - 3 + r5]\r\n    psubw     m1, m2\r\n    movu      [r2 + 2 * r5], m1\r\n  %endif\r\n    add       r5, 8\r\n%endrep\r\n\r\n%rep (%1 % 8) / 4\r\n    FILTER_H8_W4  m0, m1\r\n  %ifidn %3, pp\r\n    pmulhrsw  m1, m2\r\n    packuswb  m1, m1\r\n    movd      [r2 + r5], m1\r\n  %else\r\n    psubw     m1, m2\r\n    movh      [r2 + 2 * r5], m1\r\n  %endif\r\n%endrep\r\n\r\n    add       r0, r1\r\n    add       r2, r3\r\n\r\n    dec       r4d\r\n    jnz       .loopH\r\n    RET\r\n%endmacro\r\n\r\n\r\nINIT_YMM avx2\r\ncglobal interp_8tap_horiz_pp_4x4, 4,6,6\r\n    mov             r4d, r4m\r\n\r\n%ifdef PIC\r\n    lea             r5, [tab_LumaCoeff]\r\n    vpbroadcastq    m0, [r5 + r4 * 8]\r\n%else\r\n    vpbroadcastq    m0, [tab_LumaCoeff + r4 * 8]\r\n%endif\r\n\r\n    mova            m1, [tab_Lm]\r\n    vpbroadcastd    m2, [pw_1]\r\n\r\n    ; register map\r\n    ; m0 - interpolate coeff\r\n    ; m1 - shuffle order table\r\n    ; m2 - constant word 1\r\n\r\n    sub             r0, 3\r\n    ; Row 0-1\r\n    vbroadcasti128  m3, [r0]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb          m3, m1\r\n    pmaddubsw       m3, m0\r\n    pmaddwd         m3, m2\r\n    vbroadcasti128  m4, [r0 + r1]                   ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb          m4, m1\r\n    pmaddubsw       m4, m0\r\n    pmaddwd         m4, m2\r\n    phaddd          m3, m4                          ; DWORD [R1D R1C R0D R0C R1B R1A R0B R0A]\r\n\r\n    ; Row 2-3\r\n    lea             r0, [r0 + r1 * 2]\r\n    vbroadcasti128  m4, [r0]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb          m4, m1\r\n    pmaddubsw       m4, m0\r\n    pmaddwd         m4, m2\r\n    vbroadcasti128  m5, [r0 + r1]                   ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb          m5, m1\r\n    pmaddubsw       m5, m0\r\n    pmaddwd         m5, m2\r\n    phaddd          m4, m5                          ; DWORD [R3D R3C R2D R2C R3B R3A R2B R2A]\r\n\r\n    packssdw        m3, m4                          ; WORD [R3D R3C R2D R2C R1D R1C R0D R0C R3B R3A R2B R2A R1B R1A R0B R0A]\r\n    pmulhrsw        m3, [pw_512]\r\n    vextracti128    xm4, m3, 1\r\n    packuswb        xm3, xm4                        ; BYTE [R3D R3C R2D R2C R1D R1C R0D R0C R3B R3A R2B R2A R1B R1A R0B R0A]\r\n    pshufb          xm3, [interp4_shuf]             ; [row3 row1 row2 row0]\r\n\r\n    lea             r0, [r3 * 3]\r\n    movd            [r2], xm3\r\n    pextrd          [r2+r3], xm3, 2\r\n    pextrd          [r2+r3*2], xm3, 1\r\n    pextrd          [r2+r0], xm3, 3\r\n    RET\r\n\r\n%macro FILTER_HORIZ_LUMA_AVX2_4xN 1\r\nINIT_YMM avx2\r\n%if ARCH_X86_64 == 1\r\ncglobal interp_8tap_horiz_pp_4x%1, 4, 6, 9\r\n    mov             r4d, r4m\r\n\r\n%ifdef PIC\r\n    lea             r5, [tab_LumaCoeff]\r\n    vpbroadcastq    m0, [r5 + r4 * 8]\r\n%else\r\n    vpbroadcastq    m0, [tab_LumaCoeff + r4 * 8]\r\n%endif\r\n\r\n    mova            m1, [tab_Lm]\r\n    mova            m2, [pw_1]\r\n    mova            m7, [interp8_hps_shuf]\r\n    mova            m8, [pw_512]\r\n\r\n    ; register map\r\n    ; m0 - interpolate coeff\r\n    ; m1 - shuffle order table\r\n    ; m2 - constant word 1\r\n    lea             r4, [r1 * 3]\r\n    lea             r5, [r3 * 3]\r\n    sub             r0, 3\r\n%rep %1 / 8\r\n    ; Row 0-1\r\n    vbroadcasti128  m3, [r0]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb          m3, m1\r\n    pmaddubsw       m3, m0\r\n    pmaddwd         m3, m2\r\n    vbroadcasti128  m4, [r0 + r1]                   ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb          m4, m1\r\n    pmaddubsw       m4, m0\r\n    pmaddwd         m4, m2\r\n    phaddd          m3, m4                          ; DWORD [R1D R1C R0D R0C R1B R1A R0B R0A]\r\n\r\n    ; Row 2-3\r\n    vbroadcasti128  m4, [r0 + r1 * 2]               ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb          m4, m1\r\n    pmaddubsw       m4, m0\r\n    pmaddwd         m4, m2\r\n    vbroadcasti128  m5, [r0 + r4]                   ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb          m5, m1\r\n    pmaddubsw       m5, m0\r\n    pmaddwd         m5, m2\r\n    phaddd          m4, m5                          ; DWORD [R3D R3C R2D R2C R3B R3A R2B R2A]\r\n\r\n    packssdw        m3, m4                          ; WORD [R3D R3C R2D R2C R1D R1C R0D R0C R3B R3A R2B R2A R1B R1A R0B R0A]\r\n    lea             r0, [r0 + r1 * 4]\r\n    ; Row 4-5\r\n    vbroadcasti128  m5, [r0]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb          m5, m1\r\n    pmaddubsw       m5, m0\r\n    pmaddwd         m5, m2\r\n    vbroadcasti128  m4, [r0 + r1]                   ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb          m4, m1\r\n    pmaddubsw       m4, m0\r\n    pmaddwd         m4, m2\r\n    phaddd          m5, m4                          ; DWORD [R5D R5C R4D R4C R5B R5A R4B R4A]\r\n\r\n    ; Row 6-7\r\n    vbroadcasti128  m4, [r0 + r1 * 2]               ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb          m4, m1\r\n    pmaddubsw       m4, m0\r\n    pmaddwd         m4, m2\r\n    vbroadcasti128  m6, [r0 + r4]                   ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb          m6, m1\r\n    pmaddubsw       m6, m0\r\n    pmaddwd         m6, m2\r\n    phaddd          m4, m6                          ; DWORD [R7D R7C R6D R6C R7B R7A R6B R6A]\r\n\r\n    packssdw        m5, m4                          ; WORD [R7D R7C R6D R6C R5D R5C R4D R4C R7B R7A R6B R6A R5B R5A R4B R4A]\r\n    vpermd          m3, m7, m3\r\n    vpermd          m5, m7, m5\r\n    pmulhrsw        m3, m8\r\n    pmulhrsw        m5, m8\r\n    packuswb        m3, m5\r\n    vextracti128    xm5, m3, 1\r\n\r\n    movd            [r2], xm3\r\n    pextrd          [r2 + r3], xm3, 1\r\n    movd            [r2 + r3 * 2], xm5\r\n    pextrd          [r2 + r5], xm5, 1\r\n    lea             r2, [r2 + r3 * 4]\r\n    pextrd          [r2], xm3, 2\r\n    pextrd          [r2 + r3], xm3, 3\r\n    pextrd          [r2 + r3 * 2], xm5, 2\r\n    pextrd          [r2 + r5], xm5, 3\r\n    lea             r0, [r0 + r1 * 4]\r\n    lea             r2, [r2 + r3 * 4]\r\n%endrep\r\n    RET\r\n%endif\r\n%endmacro\r\n\r\n    FILTER_HORIZ_LUMA_AVX2_4xN 8\r\n    FILTER_HORIZ_LUMA_AVX2_4xN 16\r\n\r\nINIT_YMM avx2\r\ncglobal interp_8tap_horiz_pp_8x4, 4, 6, 7\r\n    mov             r4d, r4m\r\n\r\n%ifdef PIC\r\n    lea             r5, [tab_LumaCoeff]\r\n    vpbroadcastq    m0, [r5 + r4 * 8]\r\n%else\r\n    vpbroadcastq    m0, [tab_LumaCoeff + r4 * 8]\r\n%endif\r\n\r\n    mova            m1, [tab_Lm]\r\n    mova            m2, [tab_Lm + 32]\r\n\r\n    ; register map\r\n    ; m0     - interpolate coeff\r\n    ; m1, m2 - shuffle order table\r\n\r\n    sub             r0, 3\r\n    lea             r5, [r1 * 3]\r\n    lea             r4, [r3 * 3]\r\n\r\n    ; Row 0\r\n    vbroadcasti128  m3, [r0]                        ; [x E D C B A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb          m4, m3, m2\r\n    pshufb          m3, m1\r\n    pmaddubsw       m3, m0\r\n    pmaddubsw       m4, m0\r\n    phaddw          m3, m4\r\n    ; Row 1\r\n    vbroadcasti128  m4, [r0 + r1]                   ; [x E D C B A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb          m5, m4, m2\r\n    pshufb          m4, m1\r\n    pmaddubsw       m4, m0\r\n    pmaddubsw       m5, m0\r\n    phaddw          m4, m5\r\n\r\n    phaddw          m3, m4                          ; WORD [R1H R1G R1D R1C R0H R0G R0D R0C R1F R1E R1B R1A R0F R0E R0B R0A]\r\n    pmulhrsw        m3, [pw_512]\r\n\r\n    ; Row 2\r\n    vbroadcasti128  m4, [r0 + r1 * 2]               ; [x E D C B A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb          m5, m4, m2\r\n    pshufb          m4, m1\r\n    pmaddubsw       m4, m0\r\n    pmaddubsw       m5, m0\r\n    phaddw          m4, m5\r\n    ; Row 3\r\n    vbroadcasti128  m5, [r0 + r5]                   ; [x E D C B A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb          m6, m5, m2\r\n    pshufb          m5, m1\r\n    pmaddubsw       m5, m0\r\n    pmaddubsw       m6, m0\r\n    phaddw          m5, m6\r\n\r\n    phaddw          m4, m5                          ; WORD [R3H R3G R3D R3C R2H R2G R2D R2C R3F R3E R3B R3A R2F R2E R2B R2A]\r\n    pmulhrsw        m4, [pw_512]\r\n\r\n    packuswb        m3, m4\r\n    vextracti128    xm4, m3, 1\r\n    punpcklwd       xm5, xm3, xm4\r\n\r\n    movq            [r2], xm5\r\n    movhps          [r2 + r3], xm5\r\n\r\n    punpckhwd       xm5, xm3, xm4\r\n    movq            [r2 + r3 * 2], xm5\r\n    movhps          [r2 + r4], xm5\r\n    RET\r\n\r\n%macro IPFILTER_LUMA_AVX2_8xN 2\r\nINIT_YMM avx2\r\ncglobal interp_8tap_horiz_pp_%1x%2, 4, 7, 7\r\n    mov             r4d, r4m\r\n\r\n%ifdef PIC\r\n    lea             r5, [tab_LumaCoeff]\r\n    vpbroadcastq    m0, [r5 + r4 * 8]\r\n%else\r\n    vpbroadcastq    m0, [tab_LumaCoeff + r4 * 8]\r\n%endif\r\n\r\n    mova            m1, [tab_Lm]\r\n    mova            m2, [tab_Lm + 32]\r\n\r\n    ; register map\r\n    ; m0     - interpolate coeff\r\n    ; m1, m2 - shuffle order table\r\n\r\n    sub             r0, 3\r\n    lea             r5, [r1 * 3]\r\n    lea             r6, [r3 * 3]\r\n    mov             r4d, %2 / 4\r\n.loop:\r\n    ; Row 0\r\n    vbroadcasti128  m3, [r0]                        ; [x E D C B A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb          m4, m3, m2\r\n    pshufb          m3, m1\r\n    pmaddubsw       m3, m0\r\n    pmaddubsw       m4, m0\r\n    phaddw          m3, m4\r\n    ; Row 1\r\n    vbroadcasti128  m4, [r0 + r1]                   ; [x E D C B A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb          m5, m4, m2\r\n    pshufb          m4, m1\r\n    pmaddubsw       m4, m0\r\n    pmaddubsw       m5, m0\r\n    phaddw          m4, m5\r\n\r\n    phaddw          m3, m4                          ; WORD [R1H R1G R1D R1C R0H R0G R0D R0C R1F R1E R1B R1A R0F R0E R0B R0A]\r\n    pmulhrsw        m3, [pw_512]\r\n\r\n    ; Row 2\r\n    vbroadcasti128  m4, [r0 + r1 * 2]               ; [x E D C B A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb          m5, m4, m2\r\n    pshufb          m4, m1\r\n    pmaddubsw       m4, m0\r\n    pmaddubsw       m5, m0\r\n    phaddw          m4, m5\r\n    ; Row 3\r\n    vbroadcasti128  m5, [r0 + r5]                   ; [x E D C B A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb          m6, m5, m2\r\n    pshufb          m5, m1\r\n    pmaddubsw       m5, m0\r\n    pmaddubsw       m6, m0\r\n    phaddw          m5, m6\r\n\r\n    phaddw          m4, m5                          ; WORD [R3H R3G R3D R3C R2H R2G R2D R2C R3F R3E R3B R3A R2F R2E R2B R2A]\r\n    pmulhrsw        m4, [pw_512]\r\n\r\n    packuswb        m3, m4\r\n    vextracti128    xm4, m3, 1\r\n    punpcklwd       xm5, xm3, xm4\r\n\r\n    movq            [r2], xm5\r\n    movhps          [r2 + r3], xm5\r\n\r\n    punpckhwd       xm5, xm3, xm4\r\n    movq            [r2 + r3 * 2], xm5\r\n    movhps          [r2 + r6], xm5\r\n\r\n    lea             r0, [r0 + r1 * 4]\r\n    lea             r2, [r2 + r3 * 4]\r\n    dec             r4d\r\n    jnz             .loop\r\n    RET\r\n%endmacro\r\n\r\n    IPFILTER_LUMA_AVX2_8xN 8, 8\r\n    IPFILTER_LUMA_AVX2_8xN 8, 16\r\n    IPFILTER_LUMA_AVX2_8xN 8, 32\r\n\r\n%macro IPFILTER_LUMA_AVX2 2\r\nINIT_YMM avx2\r\ncglobal interp_8tap_horiz_pp_%1x%2, 4,6,8\r\n    sub               r0,        3\r\n    mov               r4d,       r4m\r\n%ifdef PIC\r\n    lea               r5,        [tab_LumaCoeff]\r\n    vpbroadcastd      m0,        [r5 + r4 * 8]\r\n    vpbroadcastd      m1,        [r5 + r4 * 8 + 4]\r\n%else\r\n    vpbroadcastd      m0,         [tab_LumaCoeff + r4 * 8]\r\n    vpbroadcastd      m1,         [tab_LumaCoeff + r4 * 8 + 4]\r\n%endif\r\n    movu              m3,         [tab_Tm + 16]\r\n    vpbroadcastd      m7,         [pw_1]\r\n\r\n    ; register map\r\n    ; m0 , m1 interpolate coeff\r\n    ; m2 , m2  shuffle order table\r\n    ; m7 - pw_1\r\n\r\n    mov               r4d,        %2/2\r\n.loop:\r\n    ; Row 0\r\n    vbroadcasti128    m4,         [r0]                        ; [x E D C B A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m5,         m4,     m3\r\n    pshufb            m4,         [tab_Tm]\r\n    pmaddubsw         m4,         m0\r\n    pmaddubsw         m5,         m1\r\n    paddw             m4,         m5\r\n    pmaddwd           m4,         m7\r\n    vbroadcasti128    m5,         [r0 + 8]                    ; second 8 elements in Row0\r\n    pshufb            m6,         m5,     m3\r\n    pshufb            m5,         [tab_Tm]\r\n    pmaddubsw         m5,         m0\r\n    pmaddubsw         m6,         m1\r\n    paddw             m5,         m6\r\n    pmaddwd           m5,         m7\r\n    packssdw          m4,         m5                          ; [17 16 15 14 07 06 05 04 13 12 11 10 03 02 01 00]\r\n    pmulhrsw          m4,         [pw_512]\r\n    vbroadcasti128    m2,         [r0 + r1]                        ; [x E D C B A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m5,         m2,     m3\r\n    pshufb            m2,         [tab_Tm]\r\n    pmaddubsw         m2,         m0\r\n    pmaddubsw         m5,         m1\r\n    paddw             m2,         m5\r\n    pmaddwd           m2,         m7\r\n    vbroadcasti128    m5,         [r0 + r1 + 8]                    ; second 8 elements in Row0\r\n    pshufb            m6,         m5,     m3\r\n    pshufb            m5,         [tab_Tm]\r\n    pmaddubsw         m5,         m0\r\n    pmaddubsw         m6,         m1\r\n    paddw             m5,         m6\r\n    pmaddwd           m5,         m7\r\n    packssdw          m2,         m5                          ; [17 16 15 14 07 06 05 04 13 12 11 10 03 02 01 00]\r\n    pmulhrsw          m2,         [pw_512]\r\n    packuswb          m4,         m2\r\n    vpermq            m4,         m4,     11011000b\r\n    vextracti128      xm5,        m4,     1\r\n    pshufd            xm4,        xm4,    11011000b\r\n    pshufd            xm5,        xm5,    11011000b\r\n    movu              [r2],       xm4\r\n    movu              [r2+r3],    xm5\r\n    lea               r0,         [r0 + r1 * 2]\r\n    lea               r2,         [r2 + r3 * 2]\r\n    dec               r4d\r\n    jnz              .loop\r\n    RET\r\n%endmacro\r\n\r\n%macro IPFILTER_LUMA_32x_avx2 2\r\nINIT_YMM avx2\r\ncglobal interp_8tap_horiz_pp_%1x%2, 4,6,8\r\n    sub               r0,         3\r\n    mov               r4d,        r4m\r\n%ifdef PIC\r\n    lea               r5,         [tab_LumaCoeff]\r\n    vpbroadcastd      m0,         [r5 + r4 * 8]\r\n    vpbroadcastd      m1,         [r5 + r4 * 8 + 4]\r\n%else\r\n    vpbroadcastd      m0,         [tab_LumaCoeff + r4 * 8]\r\n    vpbroadcastd      m1,         [tab_LumaCoeff + r4 * 8 + 4]\r\n%endif\r\n    movu              m3,         [tab_Tm + 16]\r\n    vpbroadcastd      m7,         [pw_1]\r\n\r\n    ; register map\r\n    ; m0 , m1 interpolate coeff\r\n    ; m2 , m2  shuffle order table\r\n    ; m7 - pw_1\r\n\r\n    mov               r4d,        %2\r\n.loop:\r\n    ; Row 0\r\n    vbroadcasti128    m4,         [r0]                        ; [x E D C B A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m5,         m4,     m3\r\n    pshufb            m4,         [tab_Tm]\r\n    pmaddubsw         m4,         m0\r\n    pmaddubsw         m5,         m1\r\n    paddw             m4,         m5\r\n    pmaddwd           m4,         m7\r\n    vbroadcasti128    m5,         [r0 + 8]\r\n    pshufb            m6,         m5,     m3\r\n    pshufb            m5,         [tab_Tm]\r\n    pmaddubsw         m5,         m0\r\n    pmaddubsw         m6,         m1\r\n    paddw             m5,         m6\r\n    pmaddwd           m5,         m7\r\n    packssdw          m4,         m5                          ; [17 16 15 14 07 06 05 04 13 12 11 10 03 02 01 00]\r\n    pmulhrsw          m4,         [pw_512]\r\n    vbroadcasti128    m2,         [r0 + 16]\r\n    pshufb            m5,         m2,     m3\r\n    pshufb            m2,         [tab_Tm]\r\n    pmaddubsw         m2,         m0\r\n    pmaddubsw         m5,         m1\r\n    paddw             m2,         m5\r\n    pmaddwd           m2,         m7\r\n    vbroadcasti128    m5,         [r0 + 24]\r\n    pshufb            m6,         m5,     m3\r\n    pshufb            m5,         [tab_Tm]\r\n    pmaddubsw         m5,         m0\r\n    pmaddubsw         m6,         m1\r\n    paddw             m5,         m6\r\n    pmaddwd           m5,         m7\r\n    packssdw          m2,         m5\r\n    pmulhrsw          m2,         [pw_512]\r\n    packuswb          m4,         m2\r\n    vpermq            m4,         m4,     11011000b\r\n    vextracti128      xm5,        m4,     1\r\n    pshufd            xm4,        xm4,    11011000b\r\n    pshufd            xm5,        xm5,    11011000b\r\n    movu              [r2],       xm4\r\n    movu              [r2 + 16],  xm5\r\n    lea               r0,         [r0 + r1]\r\n    lea               r2,         [r2 + r3]\r\n    dec               r4d\r\n    jnz               .loop\r\n    RET\r\n%endmacro\r\n\r\n%macro IPFILTER_LUMA_64x_avx2 2\r\nINIT_YMM avx2\r\ncglobal interp_8tap_horiz_pp_%1x%2, 4,6,8\r\n    sub               r0,    3\r\n    mov               r4d,   r4m\r\n%ifdef PIC\r\n    lea               r5,        [tab_LumaCoeff]\r\n    vpbroadcastd      m0,        [r5 + r4 * 8]\r\n    vpbroadcastd      m1,        [r5 + r4 * 8 + 4]\r\n%else\r\n    vpbroadcastd      m0,        [tab_LumaCoeff + r4 * 8]\r\n    vpbroadcastd      m1,        [tab_LumaCoeff + r4 * 8 + 4]\r\n%endif\r\n    movu              m3,        [tab_Tm + 16]\r\n    vpbroadcastd      m7,        [pw_1]\r\n\r\n    ; register map\r\n    ; m0 , m1 interpolate coeff\r\n    ; m2 , m2  shuffle order table\r\n    ; m7 - pw_1\r\n\r\n    mov               r4d,   %2\r\n.loop:\r\n    ; Row 0\r\n    vbroadcasti128    m4,        [r0]                        ; [x E D C B A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m5,        m4,    m3\r\n    pshufb            m4,        [tab_Tm]\r\n    pmaddubsw         m4,        m0\r\n    pmaddubsw         m5,        m1\r\n    paddw             m4,        m5\r\n    pmaddwd           m4,        m7\r\n    vbroadcasti128    m5,        [r0 + 8]\r\n    pshufb            m6,        m5,    m3\r\n    pshufb            m5,        [tab_Tm]\r\n    pmaddubsw         m5,        m0\r\n    pmaddubsw         m6,        m1\r\n    paddw             m5,        m6\r\n    pmaddwd           m5,        m7\r\n    packssdw          m4,        m5                          ; [17 16 15 14 07 06 05 04 13 12 11 10 03 02 01 00]\r\n    pmulhrsw          m4,        [pw_512]\r\n    vbroadcasti128    m2,        [r0 + 16]\r\n    pshufb            m5,        m2,    m3\r\n    pshufb            m2,        [tab_Tm]\r\n    pmaddubsw         m2,        m0\r\n    pmaddubsw         m5,        m1\r\n    paddw             m2,        m5\r\n    pmaddwd           m2,        m7\r\n    vbroadcasti128    m5,        [r0 + 24]\r\n    pshufb            m6,        m5,    m3\r\n    pshufb            m5,        [tab_Tm]\r\n    pmaddubsw         m5,        m0\r\n    pmaddubsw         m6,        m1\r\n    paddw             m5,        m6\r\n    pmaddwd           m5,        m7\r\n    packssdw          m2,        m5\r\n    pmulhrsw          m2,        [pw_512]\r\n    packuswb          m4,        m2\r\n    vpermq            m4,        m4,    11011000b\r\n    vextracti128      xm5,       m4,    1\r\n    pshufd            xm4,       xm4,   11011000b\r\n    pshufd            xm5,       xm5,   11011000b\r\n    movu              [r2],      xm4\r\n    movu              [r2 + 16], xm5\r\n\r\n    vbroadcasti128    m4,        [r0 + 32]\r\n    pshufb            m5,        m4,    m3\r\n    pshufb            m4,        [tab_Tm]\r\n    pmaddubsw         m4,        m0\r\n    pmaddubsw         m5,        m1\r\n    paddw             m4,        m5\r\n    pmaddwd           m4,        m7\r\n    vbroadcasti128    m5,        [r0 + 40]\r\n    pshufb            m6,        m5,    m3\r\n    pshufb            m5,        [tab_Tm]\r\n    pmaddubsw         m5,        m0\r\n    pmaddubsw         m6,        m1\r\n    paddw             m5,        m6\r\n    pmaddwd           m5,        m7\r\n    packssdw          m4,        m5\r\n    pmulhrsw          m4,        [pw_512]\r\n    vbroadcasti128    m2,        [r0 + 48]\r\n    pshufb            m5,        m2,    m3\r\n    pshufb            m2,        [tab_Tm]\r\n    pmaddubsw         m2,        m0\r\n    pmaddubsw         m5,        m1\r\n    paddw             m2,        m5\r\n    pmaddwd           m2,        m7\r\n    vbroadcasti128    m5,        [r0 + 56]\r\n    pshufb            m6,        m5,    m3\r\n    pshufb            m5,        [tab_Tm]\r\n    pmaddubsw         m5,        m0\r\n    pmaddubsw         m6,        m1\r\n    paddw             m5,        m6\r\n    pmaddwd           m5,        m7\r\n    packssdw          m2,        m5\r\n    pmulhrsw          m2,        [pw_512]\r\n    packuswb          m4,        m2\r\n    vpermq            m4,        m4,    11011000b\r\n    vextracti128      xm5,       m4,    1\r\n    pshufd            xm4,       xm4,   11011000b\r\n    pshufd            xm5,       xm5,   11011000b\r\n    movu              [r2 +32],  xm4\r\n    movu              [r2 + 48], xm5\r\n\r\n    lea               r0,        [r0 + r1]\r\n    lea               r2,        [r2 + r3]\r\n    dec               r4d\r\n    jnz               .loop\r\n    RET\r\n%endmacro\r\n\r\nINIT_YMM avx2\r\ncglobal interp_8tap_horiz_pp_48x64, 4,6,8\r\n    sub               r0,         3\r\n    mov               r4d,        r4m\r\n%ifdef PIC\r\n    lea               r5,         [tab_LumaCoeff]\r\n    vpbroadcastd      m0,         [r5 + r4 * 8]\r\n    vpbroadcastd      m1,         [r5 + r4 * 8 + 4]\r\n%else\r\n    vpbroadcastd      m0,         [tab_LumaCoeff + r4 * 8]\r\n    vpbroadcastd      m1,         [tab_LumaCoeff + r4 * 8 + 4]\r\n%endif\r\n    movu              m3,         [tab_Tm + 16]\r\n    vpbroadcastd      m7,         [pw_1]\r\n\r\n    ; register map\r\n    ; m0 , m1 interpolate coeff\r\n    ; m2 , m2  shuffle order table\r\n    ; m7 - pw_1\r\n\r\n    mov               r4d,        64\r\n.loop:\r\n    ; Row 0\r\n    vbroadcasti128    m4,         [r0]                        ; [x E D C B A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m5,         m4,     m3\r\n    pshufb            m4,         [tab_Tm]\r\n    pmaddubsw         m4,         m0\r\n    pmaddubsw         m5,         m1\r\n    paddw             m4,         m5\r\n    pmaddwd           m4,         m7\r\n    vbroadcasti128    m5,         [r0 + 8]\r\n    pshufb            m6,         m5,     m3\r\n    pshufb            m5,         [tab_Tm]\r\n    pmaddubsw         m5,         m0\r\n    pmaddubsw         m6,         m1\r\n    paddw             m5,         m6\r\n    pmaddwd           m5,         m7\r\n    packssdw          m4,         m5                          ; [17 16 15 14 07 06 05 04 13 12 11 10 03 02 01 00]\r\n    pmulhrsw          m4,         [pw_512]\r\n\r\n    vbroadcasti128    m2,         [r0 + 16]\r\n    pshufb            m5,         m2,     m3\r\n    pshufb            m2,         [tab_Tm]\r\n    pmaddubsw         m2,         m0\r\n    pmaddubsw         m5,         m1\r\n    paddw             m2,         m5\r\n    pmaddwd           m2,         m7\r\n    vbroadcasti128    m5,         [r0 + 24]\r\n    pshufb            m6,         m5,     m3\r\n    pshufb            m5,         [tab_Tm]\r\n    pmaddubsw         m5,         m0\r\n    pmaddubsw         m6,         m1\r\n    paddw             m5,         m6\r\n    pmaddwd           m5,         m7\r\n    packssdw          m2,         m5\r\n    pmulhrsw          m2,         [pw_512]\r\n    packuswb          m4,         m2\r\n    vpermq            m4,         m4,     11011000b\r\n    vextracti128      xm5,        m4,     1\r\n    pshufd            xm4,        xm4,    11011000b\r\n    pshufd            xm5,        xm5,    11011000b\r\n    movu              [r2],       xm4\r\n    movu              [r2 + 16],  xm5\r\n\r\n    vbroadcasti128    m4,         [r0 + 32]\r\n    pshufb            m5,         m4,     m3\r\n    pshufb            m4,         [tab_Tm]\r\n    pmaddubsw         m4,         m0\r\n    pmaddubsw         m5,         m1\r\n    paddw             m4,         m5\r\n    pmaddwd           m4,         m7\r\n    vbroadcasti128    m5,         [r0 + 40]\r\n    pshufb            m6,         m5,     m3\r\n    pshufb            m5,         [tab_Tm]\r\n    pmaddubsw         m5,         m0\r\n    pmaddubsw         m6,         m1\r\n    paddw             m5,         m6\r\n    pmaddwd           m5,         m7\r\n    packssdw          m4,         m5\r\n    pmulhrsw          m4,         [pw_512]\r\n    packuswb          m4,         m4\r\n    vpermq            m4,         m4,     11011000b\r\n    pshufd            xm4,        xm4,    11011000b\r\n    movu              [r2 + 32],  xm4\r\n\r\n    lea               r0,         [r0 + r1]\r\n    lea               r2,         [r2 + r3]\r\n    dec               r4d\r\n    jnz               .loop\r\n    RET\r\n\r\nINIT_YMM avx2\r\ncglobal interp_4tap_horiz_pp_4x4, 4,6,6\r\n    mov             r4d, r4m\r\n\r\n%ifdef PIC\r\n    lea               r5,           [tab_ChromaCoeff]\r\n    vpbroadcastd      m0,           [r5 + r4 * 4]\r\n%else\r\n    vpbroadcastd      m0,           [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    vpbroadcastd      m2,           [pw_1]\r\n    vbroadcasti128    m1,           [tab_Tm]\r\n\r\n    ; register map\r\n    ; m0 - interpolate coeff\r\n    ; m1 - shuffle order table\r\n    ; m2 - constant word 1\r\n\r\n    dec                r0\r\n\r\n    ; Row 0-1\r\n    vbroadcasti128    m3,           [r0]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    vinserti128       m3,           m3,      [r0 + r1],     1\r\n    pshufb            m3,           m1\r\n    pmaddubsw         m3,           m0\r\n    pmaddwd           m3,           m2\r\n\r\n    ; Row 2-3\r\n    lea               r0,           [r0 + r1 * 2]\r\n    vbroadcasti128    m4,           [r0]                      ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    vinserti128       m4,           m4,      [r0 + r1],     1\r\n    pshufb            m4,           m1\r\n    pmaddubsw         m4,           m0\r\n    pmaddwd           m4,           m2\r\n\r\n    packssdw          m3,           m4\r\n    pmulhrsw          m3,           [pw_512]\r\n    vextracti128      xm4,          m3,     1\r\n    packuswb          xm3,          xm4\r\n\r\n    lea               r0,           [r3 * 3]\r\n    movd              [r2],         xm3\r\n    pextrd            [r2+r3],      xm3,     2\r\n    pextrd            [r2+r3*2],    xm3,     1\r\n    pextrd            [r2+r0],      xm3,     3\r\n    RET\r\n\r\nINIT_YMM avx2\r\ncglobal interp_4tap_horiz_pp_2x4, 4, 6, 3\r\n    mov               r4d,           r4m\r\n\r\n%ifdef PIC\r\n    lea               r5,            [tab_ChromaCoeff]\r\n    vpbroadcastd      m0,            [r5 + r4 * 4]\r\n%else\r\n    vpbroadcastd      m0,            [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    dec               r0\r\n    lea               r4,            [r1 * 3]\r\n    movq              xm1,           [r0]\r\n    movhps            xm1,           [r0 + r1]\r\n    movq              xm2,           [r0 + r1 * 2]\r\n    movhps            xm2,           [r0 + r4]\r\n    vinserti128       m1,            m1,          xm2,          1\r\n    pshufb            m1,            [interp4_hpp_shuf]\r\n    pmaddubsw         m1,            m0\r\n    pmaddwd           m1,            [pw_1]\r\n    vextracti128      xm2,           m1,          1\r\n    packssdw          xm1,           xm2\r\n    pmulhrsw          xm1,           [pw_512]\r\n    packuswb          xm1,           xm1\r\n\r\n    lea               r4,            [r3 * 3]\r\n    pextrw            [r2],          xm1,         0\r\n    pextrw            [r2 + r3],     xm1,         1\r\n    pextrw            [r2 + r3 * 2], xm1,         2\r\n    pextrw            [r2 + r4],     xm1,         3\r\n    RET\r\n\r\nINIT_YMM avx2\r\ncglobal interp_4tap_horiz_pp_2x8, 4, 6, 6\r\n    mov               r4d,           r4m\r\n\r\n%ifdef PIC\r\n    lea               r5,            [tab_ChromaCoeff]\r\n    vpbroadcastd      m0,            [r5 + r4 * 4]\r\n%else\r\n    vpbroadcastd      m0,            [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    mova              m4,            [interp4_hpp_shuf]\r\n    mova              m5,            [pw_1]\r\n    dec               r0\r\n    lea               r4,            [r1 * 3]\r\n    movq              xm1,           [r0]\r\n    movhps            xm1,           [r0 + r1]\r\n    movq              xm2,           [r0 + r1 * 2]\r\n    movhps            xm2,           [r0 + r4]\r\n    vinserti128       m1,            m1,          xm2,          1\r\n    lea               r0,            [r0 + r1 * 4]\r\n    movq              xm3,           [r0]\r\n    movhps            xm3,           [r0 + r1]\r\n    movq              xm2,           [r0 + r1 * 2]\r\n    movhps            xm2,           [r0 + r4]\r\n    vinserti128       m3,            m3,          xm2,          1\r\n\r\n    pshufb            m1,            m4\r\n    pshufb            m3,            m4\r\n    pmaddubsw         m1,            m0\r\n    pmaddubsw         m3,            m0\r\n    pmaddwd           m1,            m5\r\n    pmaddwd           m3,            m5\r\n    packssdw          m1,            m3\r\n    pmulhrsw          m1,            [pw_512]\r\n    vextracti128      xm2,           m1,          1\r\n    packuswb          xm1,           xm2\r\n\r\n    lea               r4,            [r3 * 3]\r\n    pextrw            [r2],          xm1,         0\r\n    pextrw            [r2 + r3],     xm1,         1\r\n    pextrw            [r2 + r3 * 2], xm1,         4\r\n    pextrw            [r2 + r4],     xm1,         5\r\n    lea               r2,            [r2 + r3 * 4]\r\n    pextrw            [r2],          xm1,         2\r\n    pextrw            [r2 + r3],     xm1,         3\r\n    pextrw            [r2 + r3 * 2], xm1,         6\r\n    pextrw            [r2 + r4],     xm1,         7\r\n    RET\r\n\r\nINIT_YMM avx2\r\ncglobal interp_4tap_horiz_pp_32x32, 4,6,7\r\n    mov             r4d, r4m\r\n\r\n%ifdef PIC\r\n    lea               r5,           [tab_ChromaCoeff]\r\n    vpbroadcastd      m0,           [r5 + r4 * 4]\r\n%else\r\n    vpbroadcastd      m0,           [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    mova              m1,           [interp4_horiz_shuf1]\r\n    vpbroadcastd      m2,           [pw_1]\r\n    mova              m6,           [pw_512]\r\n    ; register map\r\n    ; m0 - interpolate coeff\r\n    ; m1 - shuffle order table\r\n    ; m2 - constant word 1\r\n\r\n    dec               r0\r\n    mov               r4d,          32\r\n\r\n.loop:\r\n    ; Row 0\r\n    vbroadcasti128    m3,           [r0]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m3,           m1\r\n    pmaddubsw         m3,           m0\r\n    pmaddwd           m3,           m2\r\n    vbroadcasti128    m4,           [r0 + 4]\r\n    pshufb            m4,           m1\r\n    pmaddubsw         m4,           m0\r\n    pmaddwd           m4,           m2\r\n    packssdw          m3,           m4\r\n    pmulhrsw          m3,           m6\r\n\r\n    vbroadcasti128    m4,           [r0 + 16]\r\n    pshufb            m4,           m1\r\n    pmaddubsw         m4,           m0\r\n    pmaddwd           m4,           m2\r\n    vbroadcasti128    m5,           [r0 + 20]\r\n    pshufb            m5,           m1\r\n    pmaddubsw         m5,           m0\r\n    pmaddwd           m5,           m2\r\n    packssdw          m4,           m5\r\n    pmulhrsw          m4,           m6\r\n\r\n    packuswb          m3,           m4\r\n    vpermq            m3,           m3,      11011000b\r\n\r\n    movu              [r2],         m3\r\n    lea               r2,           [r2 + r3]\r\n    lea               r0,           [r0 + r1]\r\n    dec               r4d\r\n    jnz               .loop\r\n    RET\r\n\r\n\r\nINIT_YMM avx2\r\ncglobal interp_4tap_horiz_pp_16x16, 4, 6, 7\r\n    mov               r4d,          r4m\r\n\r\n%ifdef PIC\r\n    lea               r5,           [tab_ChromaCoeff]\r\n    vpbroadcastd      m0,           [r5 + r4 * 4]\r\n%else\r\n    vpbroadcastd      m0,           [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    mova              m6,           [pw_512]\r\n    mova              m1,           [interp4_horiz_shuf1]\r\n    vpbroadcastd      m2,           [pw_1]\r\n\r\n    ; register map\r\n    ; m0 - interpolate coeff\r\n    ; m1 - shuffle order table\r\n    ; m2 - constant word 1\r\n\r\n    dec               r0\r\n    mov               r4d,          8\r\n\r\n.loop:\r\n    ; Row 0\r\n    vbroadcasti128    m3,           [r0]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m3,           m1\r\n    pmaddubsw         m3,           m0\r\n    pmaddwd           m3,           m2\r\n    vbroadcasti128    m4,           [r0 + 4]                    ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m4,           m1\r\n    pmaddubsw         m4,           m0\r\n    pmaddwd           m4,           m2\r\n    packssdw          m3,           m4\r\n    pmulhrsw          m3,           m6\r\n\r\n    ; Row 1\r\n    vbroadcasti128    m4,           [r0 + r1]                   ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m4,           m1\r\n    pmaddubsw         m4,           m0\r\n    pmaddwd           m4,           m2\r\n    vbroadcasti128    m5,           [r0 + r1 + 4]               ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m5,           m1\r\n    pmaddubsw         m5,           m0\r\n    pmaddwd           m5,           m2\r\n    packssdw          m4,           m5\r\n    pmulhrsw          m4,           m6\r\n\r\n    packuswb          m3,           m4\r\n    vpermq            m3,           m3,      11011000b\r\n\r\n    vextracti128      xm4,          m3,       1\r\n    movu              [r2],         xm3\r\n    movu              [r2 + r3],    xm4\r\n    lea               r2,           [r2 + r3 * 2]\r\n    lea               r0,           [r0 + r1 * 2]\r\n    dec               r4d\r\n    jnz               .loop\r\n    RET\r\n;--------------------------------------------------------------------------------------------------------------\r\n; void interp_8tap_horiz_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;--------------------------------------------------------------------------------------------------------------\r\n    IPFILTER_LUMA 4, 4, pp\r\n    IPFILTER_LUMA 4, 8, pp\r\n    IPFILTER_LUMA 12, 16, pp\r\n    IPFILTER_LUMA 4, 16, pp\r\n\r\nINIT_YMM avx2\r\ncglobal interp_4tap_horiz_pp_8x8, 4,6,6\r\n    mov               r4d,    r4m\r\n\r\n%ifdef PIC\r\n    lea               r5,           [tab_ChromaCoeff]\r\n    vpbroadcastd      m0,           [r5 + r4 * 4]\r\n%else\r\n    vpbroadcastd      m0,           [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    movu              m1,           [tab_Tm]\r\n    vpbroadcastd      m2,           [pw_1]\r\n\r\n    ; register map\r\n    ; m0 - interpolate coeff\r\n    ; m1 - shuffle order table\r\n    ; m2 - constant word 1\r\n\r\n    sub               r0,           1\r\n    mov               r4d,          2\r\n\r\n.loop:\r\n    ; Row 0\r\n    vbroadcasti128    m3,           [r0]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m3,           m1\r\n    pmaddubsw         m3,           m0\r\n    pmaddwd           m3,           m2\r\n\r\n    ; Row 1\r\n    vbroadcasti128    m4,           [r0 + r1]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m4,           m1\r\n    pmaddubsw         m4,           m0\r\n    pmaddwd           m4,           m2\r\n    packssdw          m3,           m4\r\n    pmulhrsw          m3,           [pw_512]\r\n    lea               r0,           [r0 + r1 * 2]\r\n\r\n    ; Row 2\r\n    vbroadcasti128    m4,           [r0 ]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m4,           m1\r\n    pmaddubsw         m4,           m0\r\n    pmaddwd           m4,           m2\r\n\r\n    ; Row 3\r\n    vbroadcasti128    m5,           [r0 + r1]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m5,           m1\r\n    pmaddubsw         m5,           m0\r\n    pmaddwd           m5,           m2\r\n    packssdw          m4,           m5\r\n    pmulhrsw          m4,           [pw_512]\r\n\r\n    packuswb          m3,           m4\r\n    mova              m5,           [interp_4tap_8x8_horiz_shuf]\r\n    vpermd            m3,           m5,     m3\r\n    vextracti128      xm4,          m3,     1\r\n    movq              [r2],         xm3\r\n    movhps            [r2 + r3],    xm3\r\n    lea               r2,           [r2 + r3 * 2]\r\n    movq              [r2],         xm4\r\n    movhps            [r2 + r3],    xm4\r\n    lea               r2,           [r2 + r3 * 2]\r\n    lea               r0,           [r0 + r1*2]\r\n    dec               r4d\r\n    jnz               .loop\r\n    RET\r\n\r\n    IPFILTER_LUMA_AVX2 16, 4\r\n    IPFILTER_LUMA_AVX2 16, 8\r\n    IPFILTER_LUMA_AVX2 16, 12\r\n    IPFILTER_LUMA_AVX2 16, 16\r\n    IPFILTER_LUMA_AVX2 16, 32\r\n    IPFILTER_LUMA_AVX2 16, 64\r\n\r\n    IPFILTER_LUMA_32x_avx2 32 , 8\r\n    IPFILTER_LUMA_32x_avx2 32 , 16\r\n    IPFILTER_LUMA_32x_avx2 32 , 24\r\n    IPFILTER_LUMA_32x_avx2 32 , 32\r\n    IPFILTER_LUMA_32x_avx2 32 , 64\r\n\r\n    IPFILTER_LUMA_64x_avx2 64 , 64\r\n    IPFILTER_LUMA_64x_avx2 64 , 48\r\n    IPFILTER_LUMA_64x_avx2 64 , 32\r\n    IPFILTER_LUMA_64x_avx2 64 , 16\r\n\r\nINIT_YMM avx2\r\ncglobal interp_4tap_horiz_pp_8x2, 4, 6, 5\r\n    mov               r4d,          r4m\r\n\r\n%ifdef PIC\r\n    lea               r5,           [tab_ChromaCoeff]\r\n    vpbroadcastd      m0,           [r5 + r4 * 4]\r\n%else\r\n    vpbroadcastd      m0,           [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    mova              m1,           [tab_Tm]\r\n    mova              m2,           [pw_1]\r\n\r\n    ; register map\r\n    ; m0 - interpolate coeff\r\n    ; m1 - shuffle order table\r\n    ; m2 - constant word 1\r\n\r\n    dec               r0\r\n    ; Row 0\r\n    vbroadcasti128    m3,           [r0]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m3,           m1\r\n    pmaddubsw         m3,           m0\r\n    pmaddwd           m3,           m2\r\n\r\n    ; Row 1\r\n    vbroadcasti128    m4,           [r0 + r1]                   ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m4,           m1\r\n    pmaddubsw         m4,           m0\r\n    pmaddwd           m4,           m2\r\n    packssdw          m3,           m4\r\n    pmulhrsw          m3,           [pw_512]\r\n    vextracti128      xm4,          m3,          1\r\n    packuswb          xm3,          xm4\r\n    pshufd            xm3,          xm3,         11011000b\r\n    movq              [r2],         xm3\r\n    movhps            [r2 + r3],    xm3\r\n    RET\r\n\r\nINIT_YMM avx2\r\ncglobal interp_4tap_horiz_pp_8x6, 4, 6, 7\r\n    mov               r4d,           r4m\r\n\r\n%ifdef PIC\r\n    lea               r5,            [tab_ChromaCoeff]\r\n    vpbroadcastd      m0,            [r5 + r4 * 4]\r\n%else\r\n    vpbroadcastd      m0,            [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    mova              m1,            [tab_Tm]\r\n    mova              m2,            [pw_1]\r\n    mova              m6,            [pw_512]\r\n    lea               r4,            [r1 * 3]\r\n    lea               r5,            [r3 * 3]\r\n    ; register map\r\n    ; m0 - interpolate coeff\r\n    ; m1 - shuffle order table\r\n    ; m2 - constant word 1\r\n\r\n    dec               r0\r\n    ; Row 0\r\n    vbroadcasti128    m3,            [r0]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m3,            m1\r\n    pmaddubsw         m3,            m0\r\n    pmaddwd           m3,            m2\r\n\r\n    ; Row 1\r\n    vbroadcasti128    m4,            [r0 + r1]                   ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m4,            m1\r\n    pmaddubsw         m4,            m0\r\n    pmaddwd           m4,            m2\r\n    packssdw          m3,            m4\r\n    pmulhrsw          m3,            m6\r\n\r\n    ; Row 2\r\n    vbroadcasti128    m4,            [r0 + r1 * 2]               ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m4,            m1\r\n    pmaddubsw         m4,            m0\r\n    pmaddwd           m4,            m2\r\n\r\n    ; Row 3\r\n    vbroadcasti128    m5,            [r0 + r4]                   ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m5,            m1\r\n    pmaddubsw         m5,            m0\r\n    pmaddwd           m5,            m2\r\n    packssdw          m4,            m5\r\n    pmulhrsw          m4,            m6\r\n\r\n    packuswb          m3,            m4\r\n    mova              m5,            [interp8_hps_shuf]\r\n    vpermd            m3,            m5,          m3\r\n    vextracti128      xm4,           m3,          1\r\n    movq              [r2],          xm3\r\n    movhps            [r2 + r3],     xm3\r\n    movq              [r2 + r3 * 2], xm4\r\n    movhps            [r2 + r5],     xm4\r\n    lea               r2,            [r2 + r3 * 4]\r\n    lea               r0,            [r0 + r1 * 4]\r\n    ; Row 4\r\n    vbroadcasti128    m3,            [r0]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m3,            m1\r\n    pmaddubsw         m3,            m0\r\n    pmaddwd           m3,            m2\r\n\r\n    ; Row 5\r\n    vbroadcasti128    m4,            [r0 + r1]                   ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m4,            m1\r\n    pmaddubsw         m4,            m0\r\n    pmaddwd           m4,            m2\r\n    packssdw          m3,            m4\r\n    pmulhrsw          m3,            m6\r\n    vextracti128      xm4,           m3,          1\r\n    packuswb          xm3,           xm4\r\n    pshufd            xm3,           xm3,         11011000b\r\n    movq              [r2],          xm3\r\n    movhps            [r2 + r3],     xm3\r\n    RET\r\n\r\nINIT_YMM avx2\r\ncglobal interp_4tap_horiz_pp_6x8, 4, 6, 7\r\n    mov               r4d,               r4m\r\n\r\n%ifdef PIC\r\n    lea               r5,                [tab_ChromaCoeff]\r\n    vpbroadcastd      m0,                [r5 + r4 * 4]\r\n%else\r\n    vpbroadcastd      m0,                [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    mova              m1,                [tab_Tm]\r\n    mova              m2,                [pw_1]\r\n    mova              m6,                [pw_512]\r\n    lea               r4,                [r1 * 3]\r\n    lea               r5,                [r3 * 3]\r\n    ; register map\r\n    ; m0 - interpolate coeff\r\n    ; m1 - shuffle order table\r\n    ; m2 - constant word 1\r\n\r\n    dec               r0\r\n%rep 2\r\n    ; Row 0\r\n    vbroadcasti128    m3,                [r0]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m3,                m1\r\n    pmaddubsw         m3,                m0\r\n    pmaddwd           m3,                m2\r\n\r\n    ; Row 1\r\n    vbroadcasti128    m4,                [r0 + r1]                   ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m4,                m1\r\n    pmaddubsw         m4,                m0\r\n    pmaddwd           m4,                m2\r\n    packssdw          m3,                m4\r\n    pmulhrsw          m3,                m6\r\n\r\n    ; Row 2\r\n    vbroadcasti128    m4,                [r0 + r1 * 2]               ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m4,                m1\r\n    pmaddubsw         m4,                m0\r\n    pmaddwd           m4,                m2\r\n\r\n    ; Row 3\r\n    vbroadcasti128    m5,                [r0 + r4]                   ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m5,                m1\r\n    pmaddubsw         m5,                m0\r\n    pmaddwd           m5,                m2\r\n    packssdw          m4,                m5\r\n    pmulhrsw          m4,                m6\r\n\r\n    packuswb          m3,                m4\r\n    vextracti128      xm4,               m3,          1\r\n    movd              [r2],              xm3\r\n    pextrw            [r2 + 4],          xm4,         0\r\n    pextrd            [r2 + r3],         xm3,         1\r\n    pextrw            [r2 + r3 + 4],     xm4,         2\r\n    pextrd            [r2 + r3 * 2],     xm3,         2\r\n    pextrw            [r2 + r3 * 2 + 4], xm4,         4\r\n    pextrd            [r2 + r5],         xm3,         3\r\n    pextrw            [r2 + r5 + 4],     xm4,         6\r\n    lea               r2,                [r2 + r3 * 4]\r\n    lea               r0,                [r0 + r1 * 4]\r\n%endrep\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------------------------------------------------------\r\n; void interp_4tap_horiz_ps_64xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt)\r\n;-----------------------------------------------------------------------------------------------------------------------------;\r\n%macro IPFILTER_CHROMA_HPS_64xN 1\r\nINIT_YMM avx2\r\ncglobal interp_4tap_horiz_ps_64x%1, 4,7,6\r\n    mov             r4d, r4m\r\n    mov             r5d, r5m\r\n    add             r3d, r3d\r\n\r\n%ifdef PIC\r\n    lea               r6,           [tab_ChromaCoeff]\r\n    vpbroadcastd      m0,           [r6 + r4 * 4]\r\n%else\r\n    vpbroadcastd      m0,           [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    vbroadcasti128     m2,           [pw_1]\r\n    vbroadcasti128     m5,           [pw_2000]\r\n    mova               m1,           [tab_Tm]\r\n\r\n    ; register map\r\n    ; m0 - interpolate coeff\r\n    ; m1 - shuffle order table\r\n    ; m2 - constant word 1\r\n    mov                r6d,         %1\r\n    dec                r0\r\n    test                r5d,      r5d\r\n    je                 .loop\r\n    sub                r0 ,         r1\r\n    add                r6d ,        3\r\n\r\n.loop\r\n    ; Row 0\r\n    vbroadcasti128    m3,           [r0]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m3,           m1\r\n    pmaddubsw         m3,           m0\r\n    pmaddwd           m3,           m2\r\n    vbroadcasti128    m4,           [r0 + 8]                      ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m4,           m1\r\n    pmaddubsw         m4,           m0\r\n    pmaddwd           m4,           m2\r\n\r\n    packssdw          m3,           m4\r\n    psubw             m3,           m5\r\n    vpermq            m3,           m3,          11011000b\r\n    movu              [r2],         m3\r\n\r\n    vbroadcasti128    m3,           [r0 + 16]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m3,           m1\r\n    pmaddubsw         m3,           m0\r\n    pmaddwd           m3,           m2\r\n    vbroadcasti128    m4,           [r0 + 24]                      ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m4,           m1\r\n    pmaddubsw         m4,           m0\r\n    pmaddwd           m4,           m2\r\n\r\n    packssdw          m3,           m4\r\n    psubw             m3,           m5\r\n    vpermq            m3,           m3,          11011000b\r\n    movu              [r2 + 32],    m3\r\n\r\n    vbroadcasti128    m3,           [r0 + 32]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m3,           m1\r\n    pmaddubsw         m3,           m0\r\n    pmaddwd           m3,           m2\r\n    vbroadcasti128    m4,           [r0 + 40]                      ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m4,           m1\r\n    pmaddubsw         m4,           m0\r\n    pmaddwd           m4,           m2\r\n\r\n    packssdw          m3,           m4\r\n    psubw             m3,           m5\r\n    vpermq            m3,           m3,          11011000b\r\n    movu              [r2 + 64],    m3\r\n\r\n    vbroadcasti128    m3,           [r0 + 48]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m3,           m1\r\n    pmaddubsw         m3,           m0\r\n    pmaddwd           m3,           m2\r\n    vbroadcasti128    m4,           [r0 + 56]                      ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m4,           m1\r\n    pmaddubsw         m4,           m0\r\n    pmaddwd           m4,           m2\r\n\r\n    packssdw          m3,           m4\r\n    psubw             m3,           m5\r\n    vpermq            m3,           m3,          11011000b\r\n    movu              [r2 + 96],    m3\r\n\r\n    add                r2,           r3\r\n    add                r0,           r1\r\n    dec                r6d\r\n    jnz                .loop\r\n    RET\r\n%endmacro\r\n\r\n   IPFILTER_CHROMA_HPS_64xN 64\r\n   IPFILTER_CHROMA_HPS_64xN 32\r\n   IPFILTER_CHROMA_HPS_64xN 48\r\n   IPFILTER_CHROMA_HPS_64xN 16\r\n\r\n;-----------------------------------------------------------------------------------------------------------------------------\r\n;void interp_horiz_ps_c(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt)\r\n;-----------------------------------------------------------------------------------------------------------------------------\r\n\r\n%macro IPFILTER_LUMA_PS_4xN_AVX2 1\r\nINIT_YMM avx2\r\n%if ARCH_X86_64 == 1\r\ncglobal interp_8tap_horiz_ps_4x%1, 6,7,6\r\n    mov                         r5d,               r5m\r\n    mov                         r4d,               r4m\r\n%ifdef PIC\r\n    lea                         r6,                [tab_LumaCoeff]\r\n    vpbroadcastq                m0,                [r6 + r4 * 8]\r\n%else\r\n    vpbroadcastq                m0,                [tab_LumaCoeff + r4 * 8]\r\n%endif\r\n    mova                        m1,                [tab_Lm]\r\n    add                         r3d,               r3d\r\n    vbroadcasti128              m2,                [pw_2000]\r\n\r\n    ; register map\r\n    ; m0 - interpolate coeff\r\n    ; m1 - shuffle order table\r\n    ; m2 - pw_2000\r\n\r\n    sub                         r0,                3\r\n    test                        r5d,               r5d\r\n    mov                         r5d,               %1                           ; loop count variable - height\r\n    jz                         .preloop\r\n    lea                         r6,                [r1 * 3]                     ; r8 = (N / 2 - 1) * srcStride\r\n    sub                         r0,                r6                           ; r0(src) - 3 * srcStride\r\n    add                         r5d,               7                            ; need extra 7 rows, just set a specially flag here, blkheight += N - 1  (7 - 3 = 4 ; since the last three rows not in loop)\r\n\r\n.preloop:\r\n    lea                         r6,                [r3 * 3]\r\n.loop\r\n    ; Row 0-1\r\n    vbroadcasti128              m3,                [r0]                         ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb                      m3,                m1                           ; shuffled based on the col order tab_Lm\r\n    pmaddubsw                   m3,                m0\r\n    vbroadcasti128              m4,                [r0 + r1]                    ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb                      m4,                m1\r\n    pmaddubsw                   m4,                m0\r\n    phaddw                      m3,                m4                           ; DWORD [R1D R1C R0D R0C R1B R1A R0B R0A]\r\n\r\n    ; Row 2-3\r\n    lea                         r0,                [r0 + r1 * 2]                ;3rd row(i.e 2nd row)\r\n    vbroadcasti128              m4,                [r0]                         ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb                      m4,                m1\r\n    pmaddubsw                   m4,                m0\r\n    vbroadcasti128              m5,                [r0 + r1]                    ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb                      m5,                m1\r\n    pmaddubsw                   m5,                m0\r\n    phaddw                      m4,                m5                           ; DWORD [R3D R3C R2D R2C R3B R3A R2B R2A]\r\n    phaddw                      m3,                m4                           ; all rows and col completed.\r\n\r\n    mova                        m5,                [interp8_hps_shuf]\r\n    vpermd                      m3,                m5,               m3\r\n    psubw                       m3,                m2\r\n\r\n    vextracti128                xm4,               m3,               1\r\n    movq                        [r2],              xm3                          ;row 0\r\n    movhps                      [r2 + r3],         xm3                          ;row 1\r\n    movq                        [r2 + r3 * 2],     xm4                          ;row 2\r\n    movhps                      [r2 + r6],         xm4                          ;row 3\r\n\r\n    lea                         r0,                [r0 + r1 * 2]                ; first loop src ->5th row(i.e 4)\r\n    lea                         r2,                [r2 + r3 * 4]                ; first loop dst ->5th row(i.e 4)\r\n    sub                         r5d,               4\r\n    jz                         .end\r\n    cmp                         r5d,               4\r\n    jge                        .loop\r\n\r\n    ; Row 8-9\r\n    vbroadcasti128              m3,                [r0]                         ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb                      m3,                m1\r\n    pmaddubsw                   m3,                m0\r\n    vbroadcasti128              m4,                [r0 + r1]                    ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb                      m4,                m1\r\n    pmaddubsw                   m4,                m0\r\n    phaddw                      m3,                m4                           ; DWORD [R1D R1C R0D R0C R1B R1A R0B R0A]\r\n\r\n    ; Row 10\r\n    vbroadcasti128              m4,                [r0 + r1 * 2]                ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb                      m4,                m1\r\n    pmaddubsw                   m4,                m0\r\n    phaddw                      m4,                m4                           ; DWORD [R3D R3C R2D R2C R3B R3A R2B R2A]\r\n    phaddw                      m3,                m4\r\n\r\n    vpermd                      m3,                m5,            m3            ; m5 don't broken in above\r\n    psubw                       m3,                m2\r\n\r\n    vextracti128                xm4,               m3,            1\r\n    movq                        [r2],              xm3\r\n    movhps                      [r2 + r3],         xm3\r\n    movq                        [r2 + r3 * 2],     xm4\r\n.end\r\n    RET\r\n%endif\r\n%endmacro\r\n\r\n    IPFILTER_LUMA_PS_4xN_AVX2 4\r\n    IPFILTER_LUMA_PS_4xN_AVX2 8\r\n    IPFILTER_LUMA_PS_4xN_AVX2 16\r\n\r\n%macro IPFILTER_LUMA_PS_8xN_AVX2 1\r\n; TODO: verify and enable on X86 mode\r\n%if ARCH_X86_64 == 1\r\n; void filter_hps(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt)\r\nINIT_YMM avx2\r\ncglobal interp_8tap_horiz_ps_8x%1, 4,7,6\r\n    mov                         r5d,        r5m\r\n    mov                         r4d,        r4m\r\n    shl                         r4d,        7\r\n%ifdef PIC\r\n    lea                         r6,         [pb_LumaCoeffVer]\r\n    add                         r6,         r4\r\n%else\r\n    lea                         r6,         [pb_LumaCoeffVer + r4]\r\n%endif\r\n    add                         r3d,        r3d\r\n    vpbroadcastd                m0,         [pw_2000]\r\n    sub                         r0,         3\r\n    lea                         r4,         [pb_8tap_hps_0]\r\n    vbroadcasti128              m5,         [r4 + 0 * mmsize]\r\n\r\n    ; check row count extend for interpolateHV\r\n    test                        r5d,        r5d;\r\n    mov                         r5d,        %1\r\n    jz                         .enter_loop\r\n    lea                         r4,         [r1 * 3]                        ; r8 = (N / 2 - 1) * srcStride\r\n    sub                         r0,         r4                              ; r0(src)-r8\r\n    add                         r5d,        8-1-2                           ; blkheight += N - 1  (7 - 3 = 4 ; since the last three rows not in loop)\r\n\r\n.enter_loop:\r\n    lea                         r4,         [pb_8tap_hps_0]\r\n\r\n    ; ***** register map *****\r\n    ; m0 - pw_2000\r\n    ; r4 - base pointer of shuffle order table\r\n    ; r5 - count of loop\r\n    ; r6 - point to LumaCoeff\r\n.loop:\r\n\r\n    ; Row 0-1\r\n    movu                        xm1,        [r0]\r\n    movu                        xm2,        [r0 + r1]\r\n    vinserti128                 m1,         m1,         xm2, 1\r\n    pshufb                      m2,         m1,         m5                  ; [0 1 1 2 2 3 3 4 ...]\r\n    pshufb                      m3,         m1,         [r4 + 1 * mmsize]   ; [2 3 3 4 4 5 5 6 ...]\r\n    pshufb                      m4,         m1,         [r4 + 2 * mmsize]   ; [4 5 5 6 6 7 7 8 ...]\r\n    pshufb                      m1,         m1,         [r4 + 3 * mmsize]   ; [6 7 7 8 8 9 9 A ...]\r\n    pmaddubsw                   m2,         [r6 + 0 * mmsize]\r\n    pmaddubsw                   m3,         [r6 + 1 * mmsize]\r\n    pmaddubsw                   m4,         [r6 + 2 * mmsize]\r\n    pmaddubsw                   m1,         [r6 + 3 * mmsize]\r\n    paddw                       m2,         m3\r\n    paddw                       m1,         m4\r\n    paddw                       m1,         m2\r\n    psubw                       m1,         m0\r\n\r\n    vextracti128                xm2,        m1,         1\r\n    movu                        [r2],       xm1                             ; row 0\r\n    movu                        [r2 + r3],  xm2                             ; row 1\r\n\r\n    lea                         r0,         [r0 + r1 * 2]                   ; first loop src ->5th row(i.e 4)\r\n    lea                         r2,         [r2 + r3 * 2]                   ; first loop dst ->5th row(i.e 4)\r\n    sub                         r5d,        2\r\n    jg                         .loop\r\n    jz                         .end\r\n\r\n    ; last row\r\n    movu                        xm1,        [r0]\r\n    pshufb                      xm2,        xm1,         xm5                ; [0 1 1 2 2 3 3 4 ...]\r\n    pshufb                      xm3,        xm1,         [r4 + 1 * mmsize]  ; [2 3 3 4 4 5 5 6 ...]\r\n    pshufb                      xm4,        xm1,         [r4 + 2 * mmsize]  ; [4 5 5 6 6 7 7 8 ...]\r\n    pshufb                      xm1,        xm1,         [r4 + 3 * mmsize]  ; [6 7 7 8 8 9 9 A ...]\r\n    pmaddubsw                   xm2,        [r6 + 0 * mmsize]\r\n    pmaddubsw                   xm3,        [r6 + 1 * mmsize]\r\n    pmaddubsw                   xm4,        [r6 + 2 * mmsize]\r\n    pmaddubsw                   xm1,        [r6 + 3 * mmsize]\r\n    paddw                       xm2,        xm3\r\n    paddw                       xm1,        xm4\r\n    paddw                       xm1,        xm2\r\n    psubw                       xm1,        xm0\r\n    movu                        [r2],       xm1                          ;row 0\r\n.end\r\n    RET\r\n%endif\r\n%endmacro ; IPFILTER_LUMA_PS_8xN_AVX2\r\n\r\n    IPFILTER_LUMA_PS_8xN_AVX2  4\r\n    IPFILTER_LUMA_PS_8xN_AVX2  8\r\n    IPFILTER_LUMA_PS_8xN_AVX2 16\r\n    IPFILTER_LUMA_PS_8xN_AVX2 32\r\n\r\n\r\n%macro IPFILTER_LUMA_PS_16x_AVX2 2\r\nINIT_YMM avx2\r\n%if ARCH_X86_64 == 1\r\ncglobal interp_8tap_horiz_ps_%1x%2, 6, 10, 7\r\n    mov                         r5d,               r5m\r\n    mov                         r4d,               r4m\r\n%ifdef PIC\r\n    lea                         r6,                [tab_LumaCoeff]\r\n    vpbroadcastq                m0,                [r6 + r4 * 8]\r\n%else\r\n    vpbroadcastq                m0,                [tab_LumaCoeff + r4 * 8]\r\n%endif\r\n    mova                        m6,                [tab_Lm + 32]\r\n    mova                        m1,                [tab_Lm]\r\n    mov                         r9,                %2                           ;height\r\n    add                         r3d,               r3d\r\n    vbroadcasti128              m2,                [pw_2000]\r\n\r\n    ; register map\r\n    ; m0      - interpolate coeff\r\n    ; m1 , m6 - shuffle order table\r\n    ; m2      - pw_2000\r\n\r\n    xor                         r7,                r7                          ; loop count variable\r\n    sub                         r0,                3\r\n    test                        r5d,               r5d\r\n    jz                          .label\r\n    lea                         r8,                [r1 * 3]                     ; r8 = (N / 2 - 1) * srcStride\r\n    sub                         r0,                r8                           ; r0(src)-r8\r\n    add                         r9,                7                            ; blkheight += N - 1  (7 - 1 = 6 ; since the last one row not in loop)\r\n\r\n.label\r\n    ; Row 0\r\n    vbroadcasti128              m3,                [r0]                         ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb                      m4,                m3,             m6           ; row 0 (col 4 to 7)\r\n    pshufb                      m3,                m1                           ; shuffled based on the col order tab_Lm row 0 (col 0 to 3)\r\n    pmaddubsw                   m3,                m0\r\n    pmaddubsw                   m4,                m0\r\n    phaddw                      m3,                m4                           ; DWORD [R1D R1C R0D R0C R1B R1A R0B R0A]\r\n\r\n    vbroadcasti128              m4,                [r0 + 8]                         ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb                      m5,                m4,            m6            ;row 1 (col 4 to 7)\r\n    pshufb                      m4,                m1                           ;row 1 (col 0 to 3)\r\n    pmaddubsw                   m4,                m0\r\n    pmaddubsw                   m5,                m0\r\n    phaddw                      m4,                m5                           ; DWORD [R3D R3C R2D R2C R3B R3A R2B R2A]\r\n    phaddw                      m3,                m4                           ; all rows and col completed.\r\n\r\n    mova                        m5,                [interp8_hps_shuf]\r\n    vpermd                      m3,                m5,               m3\r\n    psubw                       m3,                m2\r\n\r\n    movu                        [r2],              m3                          ;row 0\r\n\r\n    lea                         r0,                [r0 + r1]                ; first loop src ->5th row(i.e 4)\r\n    lea                         r2,                [r2 + r3]                ; first loop dst ->5th row(i.e 4)\r\n    dec                         r9d\r\n    jnz                         .label\r\n\r\n    RET\r\n%endif\r\n%endmacro\r\n\r\n\r\n    IPFILTER_LUMA_PS_16x_AVX2 16 , 16\r\n    IPFILTER_LUMA_PS_16x_AVX2 16 , 8\r\n    IPFILTER_LUMA_PS_16x_AVX2 16 , 12\r\n    IPFILTER_LUMA_PS_16x_AVX2 16 , 4\r\n    IPFILTER_LUMA_PS_16x_AVX2 16 , 32\r\n    IPFILTER_LUMA_PS_16x_AVX2 16 , 64\r\n\r\n\r\n;--------------------------------------------------------------------------------------------------------------\r\n; void interp_8tap_horiz_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;--------------------------------------------------------------------------------------------------------------\r\n%macro IPFILTER_LUMA_PP_W8 2\r\nINIT_XMM sse4\r\ncglobal interp_8tap_horiz_pp_%1x%2, 4,6,7\r\n    mov         r4d, r4m\r\n\r\n%ifdef PIC\r\n    lea         r5, [tab_LumaCoeff]\r\n    movh        m3, [r5 + r4 * 8]\r\n%else\r\n    movh        m3, [tab_LumaCoeff + r4 * 8]\r\n%endif\r\n    pshufd      m0, m3, 0                       ; m0 = coeff-L\r\n    pshufd      m1, m3, 0x55                    ; m1 = coeff-H\r\n    lea         r5, [tab_Tm]                    ; r5 = shuffle\r\n    mova        m2, [pw_512]                    ; m2 = 512\r\n\r\n    mov         r4d, %2\r\n.loopH:\r\n%assign x 0\r\n%rep %1 / 8\r\n    movu        m3, [r0 - 3 + x]                ; m3 = [F E D C B A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb      m4, m3, [r5 + 0*16]             ; m4 = [6 5 4 3 5 4 3 2 4 3 2 1 3 2 1 0]\r\n    pshufb      m5, m3, [r5 + 1*16]             ; m5 = [A 9 8 7 9 8 7 6 8 7 6 5 7 6 5 4]\r\n    pshufb          m3, [r5 + 2*16]             ; m3 = [E D C B D C B A C B A 9 B A 9 8]\r\n    pmaddubsw   m4, m0\r\n    pmaddubsw   m6, m5, m1\r\n    pmaddubsw   m5, m0\r\n    pmaddubsw   m3, m1\r\n    paddw       m4, m6\r\n    paddw       m5, m3\r\n    phaddw      m4, m5\r\n    pmulhrsw    m4, m2\r\n    packuswb    m4, m4\r\n    movh        [r2 + x], m4\r\n%assign x x+8\r\n%endrep\r\n\r\n    add       r0, r1\r\n    add       r2, r3\r\n\r\n    dec       r4d\r\n    jnz      .loopH\r\n    RET\r\n%endmacro\r\n\r\n    IPFILTER_LUMA_PP_W8      8,  4\r\n    IPFILTER_LUMA_PP_W8      8,  8\r\n    IPFILTER_LUMA_PP_W8      8, 16\r\n    IPFILTER_LUMA_PP_W8      8, 32\r\n    IPFILTER_LUMA_PP_W8     16,  4\r\n    IPFILTER_LUMA_PP_W8     16,  8\r\n    IPFILTER_LUMA_PP_W8     16, 12\r\n    IPFILTER_LUMA_PP_W8     16, 16\r\n    IPFILTER_LUMA_PP_W8     16, 32\r\n    IPFILTER_LUMA_PP_W8     16, 64\r\n    IPFILTER_LUMA_PP_W8     24, 32\r\n    IPFILTER_LUMA_PP_W8     32,  8\r\n    IPFILTER_LUMA_PP_W8     32, 16\r\n    IPFILTER_LUMA_PP_W8     32, 24\r\n    IPFILTER_LUMA_PP_W8     32, 32\r\n    IPFILTER_LUMA_PP_W8     32, 64\r\n    IPFILTER_LUMA_PP_W8     48, 64\r\n    IPFILTER_LUMA_PP_W8     64, 16\r\n    IPFILTER_LUMA_PP_W8     64, 32\r\n    IPFILTER_LUMA_PP_W8     64, 48\r\n    IPFILTER_LUMA_PP_W8     64, 64\r\n\r\n;----------------------------------------------------------------------------------------------------------------------------\r\n; void interp_8tap_horiz_ps_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx, int isRowExt)\r\n;----------------------------------------------------------------------------------------------------------------------------\r\n    IPFILTER_LUMA 4, 4, ps\r\n    IPFILTER_LUMA 8, 8, ps\r\n    IPFILTER_LUMA 8, 4, ps\r\n    IPFILTER_LUMA 4, 8, ps\r\n    IPFILTER_LUMA 16, 16, ps\r\n    IPFILTER_LUMA 16, 8, ps\r\n    IPFILTER_LUMA 8, 16, ps\r\n    IPFILTER_LUMA 16, 12, ps\r\n    IPFILTER_LUMA 12, 16, ps\r\n    IPFILTER_LUMA 16, 4, ps\r\n    IPFILTER_LUMA 4, 16, ps\r\n    IPFILTER_LUMA 32, 32, ps\r\n    IPFILTER_LUMA 32, 16, ps\r\n    IPFILTER_LUMA 16, 32, ps\r\n    IPFILTER_LUMA 32, 24, ps\r\n    IPFILTER_LUMA 24, 32, ps\r\n    IPFILTER_LUMA 32, 8, ps\r\n    IPFILTER_LUMA 8, 32, ps\r\n    IPFILTER_LUMA 64, 64, ps\r\n    IPFILTER_LUMA 64, 32, ps\r\n    IPFILTER_LUMA 32, 64, ps\r\n    IPFILTER_LUMA 64, 48, ps\r\n    IPFILTER_LUMA 48, 64, ps\r\n    IPFILTER_LUMA 64, 16, ps\r\n    IPFILTER_LUMA 16, 64, ps\r\n\r\n;-----------------------------------------------------------------------------\r\n; Interpolate HV\r\n;-----------------------------------------------------------------------------\r\n%macro FILTER_HV8_START 7 ; (t0, t1, t2, t3, t4, off_src, off_coeff) -> (t3, t5), (t4, t1), [2]\r\n    mova        %5, [r0 +  (%6 + 0) * 16]\r\n    mova        %1, [r0 +  (%6 + 1) * 16]\r\n    mova        %2, [r0 +  (%6 + 2) * 16]\r\n    punpcklwd   %3, %5, %1\r\n    punpckhwd   %5, %1\r\n    pmaddwd     %3, [r5 + (%7) * 16]   ; R3 = L[0+1] -- Row 0\r\n    pmaddwd     %5, [r5 + (%7) * 16]   ; R0 = H[0+1]\r\n    punpcklwd   %4, %1, %2\r\n    punpckhwd   %1, %2\r\n    pmaddwd     %4, [r5 + (%7) * 16]   ; R4 = L[1+2] -- Row 1\r\n    pmaddwd     %1, [r5 + (%7) * 16]   ; R1 = H[1+2]\r\n%endmacro ; FILTER_HV8_START\r\n\r\n%macro FILTER_HV8_MID 10 ; (Row3, prevRow, sum0L, sum1L, sum0H, sum1H, t6, t7, off_src, off_coeff) -> [6]\r\n    mova        %8, [r0 +  (%9 + 0) * 16]\r\n    mova        %1, [r0 +  (%9 + 1) * 16]\r\n    punpcklwd   %7, %2, %8\r\n    punpckhwd   %2, %8\r\n    pmaddwd     %7, [r5 + %10 * 16]\r\n    pmaddwd     %2, [r5 + %10 * 16]\r\n    paddd       %3, %7              ; R3 = L[0+1+2+3] -- Row 0\r\n    paddd       %5, %2              ; R0 = H[0+1+2+3]\r\n    punpcklwd   %7, %8, %1\r\n    punpckhwd   %8, %1\r\n    pmaddwd     %7, [r5 + %10 * 16]\r\n    pmaddwd     %8, [r5 + %10 * 16]\r\n    paddd       %4, %7              ; R4 = L[1+2+3+4] -- Row 1\r\n    paddd       %6, %8              ; R1 = H[1+2+3+4]\r\n%endmacro ; FILTER_HV8_MID\r\n\r\n; Round and Saturate\r\n%macro FILTER_HV8_END 4 ; output in [1, 3]\r\n    paddd       %1, [pd_526336]\r\n    paddd       %2, [pd_526336]\r\n    paddd       %3, [pd_526336]\r\n    paddd       %4, [pd_526336]\r\n    psrad       %1, 12\r\n    psrad       %2, 12\r\n    psrad       %3, 12\r\n    psrad       %4, 12\r\n    packssdw    %1, %2\r\n    packssdw    %3, %4\r\n\r\n    ; TODO: is merge better? I think this way is short dependency link\r\n    packuswb    %1, %3\r\n%endmacro ; FILTER_HV8_END\r\n\r\n;-----------------------------------------------------------------------------\r\n; void interp_8tap_hv_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int idxX, int idxY)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM ssse3\r\ncglobal interp_8tap_hv_pp_8x8, 4, 7, 8, 0-15*16\r\n%define coef        m7\r\n%define stk_buf     rsp\r\n\r\n    mov         r4d,        r4m\r\n    mov         r5d,        r5m\r\n\r\n%ifdef PIC\r\n    lea         r6,         [tab_LumaCoeff]\r\n    movh        coef,       [r6 + r4 * 8]\r\n%else\r\n    movh        coef,       [tab_LumaCoeff + r4 * 8]\r\n%endif\r\n    punpcklqdq  coef,       coef\r\n\r\n    ; move to row -3\r\n    lea         r6,         [r1 + r1 * 2]\r\n    sub         r0,         r6\r\n\r\n    xor         r6,         r6\r\n    mov         r4,         rsp\r\n\r\n.loopH:\r\n    FILTER_H8_W8 m0, m1, m2, m3, coef, [pw_512], [r0 - 3]\r\n    psubw       m1,         [pw_2000]\r\n    mova        [r4],       m1\r\n\r\n    add         r0,         r1\r\n    add         r4,         16\r\n    inc         r6\r\n    cmp         r6,         8+7\r\n    jnz         .loopH\r\n\r\n    ; ready to phase V\r\n    ; Here all of mN is free\r\n\r\n    ; load coeff table\r\n    shl         r5,         6\r\n    lea         r6,         [tab_LumaCoeffV]\r\n    lea         r5,         [r5 + r6]\r\n\r\n    ; load intermedia buffer\r\n    mov         r0,         stk_buf\r\n\r\n    ; register mapping\r\n    ; r0 - src\r\n    ; r5 - coeff\r\n    ; r6 - loop_i\r\n\r\n    ; let's go\r\n    xor         r6,         r6\r\n\r\n    ; TODO: this loop have more than 70 instructions, I think it is more than Intel loop decode cache\r\n.loopV:\r\n\r\n    FILTER_HV8_START    m1, m2, m3, m4, m0,             0, 0\r\n    FILTER_HV8_MID      m6, m2, m3, m4, m0, m1, m7, m5, 3, 1\r\n    FILTER_HV8_MID      m5, m6, m3, m4, m0, m1, m7, m2, 5, 2\r\n    FILTER_HV8_MID      m6, m5, m3, m4, m0, m1, m7, m2, 7, 3\r\n    FILTER_HV8_END      m3, m0, m4, m1\r\n\r\n    movh        [r2],       m3\r\n    movhps      [r2 + r3],  m3\r\n\r\n    lea         r0,         [r0 + 16 * 2]\r\n    lea         r2,         [r2 + r3 * 2]\r\n\r\n    inc         r6\r\n    cmp         r6,         8/2\r\n    jnz         .loopV\r\n\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void interp_8tap_hv_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int idxX, int idxY)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse3\r\ncglobal interp_8tap_hv_pp_8x8, 4, 7, 8, 0-15*16\r\n    mov         r4d,        r4m\r\n    mov         r5d,        r5m\r\n    add         r4d,        r4d\r\n    pxor        m6,         m6\r\n\r\n%ifdef PIC\r\n    lea         r6,         [tabw_LumaCoeff]\r\n    mova        m3,         [r6 + r4 * 8]\r\n%else\r\n    mova        m3,         [tabw_LumaCoeff + r4 * 8]\r\n%endif\r\n\r\n    ; move to row -3\r\n    lea         r6,         [r1 + r1 * 2]\r\n    sub         r0,         r6\r\n\r\n    mov         r4,         rsp\r\n\r\n%assign x 0     ;needed for FILTER_H8_W8_sse2 macro\r\n%assign y 1\r\n%rep 15\r\n    FILTER_H8_W8_sse2\r\n    psubw       m1,         [pw_2000]\r\n    mova        [r4],       m1\r\n\r\n%if y < 15\r\n    add         r0,         r1\r\n    add         r4,         16\r\n%endif\r\n%assign y y+1\r\n%endrep\r\n\r\n    ; ready to phase V\r\n    ; Here all of mN is free\r\n\r\n    ; load coeff table\r\n    shl         r5,         6\r\n    lea         r6,         [tab_LumaCoeffV]\r\n    lea         r5,         [r5 + r6]\r\n\r\n    ; load intermedia buffer\r\n    mov         r0,         rsp\r\n\r\n    ; register mapping\r\n    ; r0 - src\r\n    ; r5 - coeff\r\n\r\n    ; let's go\r\n%assign y 1\r\n%rep 4\r\n    FILTER_HV8_START    m1, m2, m3, m4, m0,             0, 0\r\n    FILTER_HV8_MID      m6, m2, m3, m4, m0, m1, m7, m5, 3, 1\r\n    FILTER_HV8_MID      m5, m6, m3, m4, m0, m1, m7, m2, 5, 2\r\n    FILTER_HV8_MID      m6, m5, m3, m4, m0, m1, m7, m2, 7, 3\r\n    FILTER_HV8_END      m3, m0, m4, m1\r\n\r\n    movh        [r2],       m3\r\n    movhps      [r2 + r3],  m3\r\n\r\n%if y < 4\r\n    lea         r0,         [r0 + 16 * 2]\r\n    lea         r2,         [r2 + r3 * 2]\r\n%endif\r\n%assign y y+1\r\n%endrep\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------\r\n;void interp_4tap_vert_pp_2x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse4\r\ncglobal interp_4tap_vert_pp_2x4, 4, 6, 8\r\n\r\n    mov         r4d,       r4m\r\n    sub         r0,        r1\r\n\r\n%ifdef PIC\r\n    lea         r5,        [tab_ChromaCoeff]\r\n    movd        m0,        [r5 + r4 * 4]\r\n%else\r\n    movd        m0,        [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n    lea         r4,        [r1 * 3]\r\n    lea         r5,        [r0 + 4 * r1]\r\n    pshufb      m0,        [tab_Cm]\r\n    mova        m1,        [pw_512]\r\n\r\n    movd        m2,        [r0]\r\n    movd        m3,        [r0 + r1]\r\n    movd        m4,        [r0 + 2 * r1]\r\n    movd        m5,        [r0 + r4]\r\n\r\n    punpcklbw   m2,        m3\r\n    punpcklbw   m6,        m4,        m5\r\n    punpcklbw   m2,        m6\r\n\r\n    pmaddubsw   m2,        m0\r\n\r\n    movd        m6,        [r5]\r\n\r\n    punpcklbw   m3,        m4\r\n    punpcklbw   m7,        m5,        m6\r\n    punpcklbw   m3,        m7\r\n\r\n    pmaddubsw   m3,        m0\r\n\r\n    phaddw      m2,        m3\r\n\r\n    pmulhrsw    m2,        m1\r\n\r\n    movd        m7,        [r5 + r1]\r\n\r\n    punpcklbw   m4,        m5\r\n    punpcklbw   m3,        m6,        m7\r\n    punpcklbw   m4,        m3\r\n\r\n    pmaddubsw   m4,        m0\r\n\r\n    movd        m3,        [r5 + 2 * r1]\r\n\r\n    punpcklbw   m5,        m6\r\n    punpcklbw   m7,        m3\r\n    punpcklbw   m5,        m7\r\n\r\n    pmaddubsw   m5,        m0\r\n\r\n    phaddw      m4,        m5\r\n\r\n    pmulhrsw    m4,        m1\r\n    packuswb    m2,        m4\r\n\r\n    pextrw      [r2],      m2, 0\r\n    pextrw      [r2 + r3], m2, 2\r\n    lea         r2,        [r2 + 2 * r3]\r\n    pextrw      [r2],      m2, 4\r\n    pextrw      [r2 + r3], m2, 6\r\n\r\n    RET\r\n\r\n%macro FILTER_VER_CHROMA_AVX2_2x4 1\r\nINIT_YMM avx2\r\ncglobal interp_4tap_vert_%1_2x4, 4, 6, 2\r\n    mov             r4d, r4m\r\n    shl             r4d, 5\r\n    sub             r0, r1\r\n\r\n%ifdef PIC\r\n    lea             r5, [tab_ChromaCoeff_V]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [tab_ChromaCoeff_V + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n\r\n    pinsrw          xm1, [r0], 0\r\n    pinsrw          xm1, [r0 + r1], 1\r\n    pinsrw          xm1, [r0 + r1 * 2], 2\r\n    pinsrw          xm1, [r0 + r4], 3\r\n    lea             r0, [r0 + r1 * 4]\r\n    pinsrw          xm1, [r0], 4\r\n    pinsrw          xm1, [r0 + r1], 5\r\n    pinsrw          xm1, [r0 + r1 * 2], 6\r\n\r\n    pshufb          xm0, xm1, [interp_vert_shuf]\r\n    pshufb          xm1, [interp_vert_shuf + 32]\r\n    vinserti128     m0, m0, xm1, 1\r\n    pmaddubsw       m0, [r5]\r\n    vextracti128    xm1, m0, 1\r\n    paddw           xm0, xm1\r\n%ifidn %1,pp\r\n    pmulhrsw        xm0, [pw_512]\r\n    packuswb        xm0, xm0\r\n    lea             r4, [r3 * 3]\r\n    pextrw          [r2], xm0, 0\r\n    pextrw          [r2 + r3], xm0, 1\r\n    pextrw          [r2 + r3 * 2], xm0, 2\r\n    pextrw          [r2 + r4], xm0, 3\r\n%else\r\n    add             r3d, r3d\r\n    lea             r4, [r3 * 3]\r\n    psubw           xm0, [pw_2000]\r\n    movd            [r2], xm0\r\n    pextrd          [r2 + r3], xm0, 1\r\n    pextrd          [r2 + r3 * 2], xm0, 2\r\n    pextrd          [r2 + r4], xm0, 3\r\n%endif\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_AVX2_2x4 pp\r\n    FILTER_VER_CHROMA_AVX2_2x4 ps\r\n\r\n%macro FILTER_VER_CHROMA_AVX2_2x8 1\r\nINIT_YMM avx2\r\ncglobal interp_4tap_vert_%1_2x8, 4, 6, 2\r\n    mov             r4d, r4m\r\n    shl             r4d, 6\r\n    sub             r0, r1\r\n\r\n%ifdef PIC\r\n    lea             r5, [tab_ChromaCoeffVer_32]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [tab_ChromaCoeffVer_32 + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n\r\n    pinsrw          xm1, [r0], 0\r\n    pinsrw          xm1, [r0 + r1], 1\r\n    pinsrw          xm1, [r0 + r1 * 2], 2\r\n    pinsrw          xm1, [r0 + r4], 3\r\n    lea             r0, [r0 + r1 * 4]\r\n    pinsrw          xm1, [r0], 4\r\n    pinsrw          xm1, [r0 + r1], 5\r\n    pinsrw          xm1, [r0 + r1 * 2], 6\r\n    pinsrw          xm1, [r0 + r4], 7\r\n    movhlps         xm0, xm1\r\n    lea             r0, [r0 + r1 * 4]\r\n    pinsrw          xm0, [r0], 4\r\n    pinsrw          xm0, [r0 + r1], 5\r\n    pinsrw          xm0, [r0 + r1 * 2], 6\r\n    vinserti128     m1, m1, xm0, 1\r\n\r\n    pshufb          m0, m1, [interp_vert_shuf]\r\n    pshufb          m1, [interp_vert_shuf + 32]\r\n    pmaddubsw       m0, [r5]\r\n    pmaddubsw       m1, [r5 + 1 * mmsize]\r\n    paddw           m0, m1\r\n%ifidn %1,pp\r\n    pmulhrsw        m0, [pw_512]\r\n    vextracti128    xm1, m0, 1\r\n    packuswb        xm0, xm1\r\n    lea             r4, [r3 * 3]\r\n    pextrw          [r2], xm0, 0\r\n    pextrw          [r2 + r3], xm0, 1\r\n    pextrw          [r2 + r3 * 2], xm0, 2\r\n    pextrw          [r2 + r4], xm0, 3\r\n    lea             r2, [r2 + r3 * 4]\r\n    pextrw          [r2], xm0, 4\r\n    pextrw          [r2 + r3], xm0, 5\r\n    pextrw          [r2 + r3 * 2], xm0, 6\r\n    pextrw          [r2 + r4], xm0, 7\r\n%else\r\n    add             r3d, r3d\r\n    lea             r4, [r3 * 3]\r\n    psubw           m0, [pw_2000]\r\n    vextracti128    xm1, m0, 1\r\n    movd            [r2], xm0\r\n    pextrd          [r2 + r3], xm0, 1\r\n    pextrd          [r2 + r3 * 2], xm0, 2\r\n    pextrd          [r2 + r4], xm0, 3\r\n    lea             r2, [r2 + r3 * 4]\r\n    movd            [r2], xm1\r\n    pextrd          [r2 + r3], xm1, 1\r\n    pextrd          [r2 + r3 * 2], xm1, 2\r\n    pextrd          [r2 + r4], xm1, 3\r\n%endif\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_AVX2_2x8 pp\r\n    FILTER_VER_CHROMA_AVX2_2x8 ps\r\n\r\n%macro FILTER_VER_CHROMA_AVX2_2x16 1\r\nINIT_YMM avx2\r\ncglobal interp_4tap_vert_%1_2x16, 4, 6, 3\r\n    mov             r4d, r4m\r\n    shl             r4d, 6\r\n    sub             r0,  r1\r\n\r\n%ifdef PIC\r\n    lea             r5,  [tab_ChromaCoeffVer_32]\r\n    add             r5,  r4\r\n%else\r\n    lea             r5,  [tab_ChromaCoeffVer_32 + r4]\r\n%endif\r\n\r\n    lea             r4,  [r1 * 3]\r\n\r\n    movd            xm1, [r0]\r\n    pinsrw          xm1, [r0 + r1], 1\r\n    pinsrw          xm1, [r0 + r1 * 2], 2\r\n    pinsrw          xm1, [r0 + r4], 3\r\n    lea             r0,  [r0 + r1 * 4]\r\n    pinsrw          xm1, [r0], 4\r\n    pinsrw          xm1, [r0 + r1], 5\r\n    pinsrw          xm1, [r0 + r1 * 2], 6\r\n    pinsrw          xm1, [r0 + r4], 7\r\n    lea             r0,  [r0 + r1 * 4]\r\n    pinsrw          xm0, [r0], 4\r\n    pinsrw          xm0, [r0 + r1], 5\r\n    pinsrw          xm0, [r0 + r1 * 2], 6\r\n    pinsrw          xm0, [r0 + r4], 7\r\n    punpckhqdq      xm0, xm1, xm0\r\n    vinserti128     m1,  m1,  xm0,  1\r\n\r\n    pshufb          m2,  m1,  [interp_vert_shuf]\r\n    pshufb          m1,  [interp_vert_shuf + 32]\r\n    pmaddubsw       m2,  [r5]\r\n    pmaddubsw       m1,  [r5 + 1 * mmsize]\r\n    paddw           m2,  m1\r\n\r\n    lea             r0,  [r0 + r1 * 4]\r\n    pinsrw          xm1, [r0], 4\r\n    pinsrw          xm1, [r0 + r1], 5\r\n    pinsrw          xm1, [r0 + r1 * 2], 6\r\n    pinsrw          xm1, [r0 + r4], 7\r\n    punpckhqdq      xm1, xm0, xm1\r\n    lea             r0,  [r0 + r1 * 4]\r\n    pinsrw          xm0, [r0], 4\r\n    pinsrw          xm0, [r0 + r1], 5\r\n    pinsrw          xm0, [r0 + r1 * 2], 6\r\n    punpckhqdq      xm0, xm1, xm0\r\n    vinserti128     m1,  m1,  xm0,  1\r\n\r\n    pshufb          m0,  m1,  [interp_vert_shuf]\r\n    pshufb          m1,  [interp_vert_shuf + 32]\r\n    pmaddubsw       m0,  [r5]\r\n    pmaddubsw       m1,  [r5 + 1 * mmsize]\r\n    paddw           m0,  m1\r\n%ifidn %1,pp\r\n    mova            m1,  [pw_512]\r\n    pmulhrsw        m2,  m1\r\n    pmulhrsw        m0,  m1\r\n    packuswb        m2,  m0\r\n    lea             r4,  [r3 * 3]\r\n    pextrw          [r2], xm2, 0\r\n    pextrw          [r2 + r3], xm2, 1\r\n    pextrw          [r2 + r3 * 2], xm2, 2\r\n    pextrw          [r2 + r4], xm2, 3\r\n    vextracti128    xm0, m2, 1\r\n    lea             r2,  [r2 + r3 * 4]\r\n    pextrw          [r2], xm0, 0\r\n    pextrw          [r2 + r3], xm0, 1\r\n    pextrw          [r2 + r3 * 2], xm0, 2\r\n    pextrw          [r2 + r4], xm0, 3\r\n    lea             r2,  [r2 + r3 * 4]\r\n    pextrw          [r2], xm2, 4\r\n    pextrw          [r2 + r3], xm2, 5\r\n    pextrw          [r2 + r3 * 2], xm2, 6\r\n    pextrw          [r2 + r4], xm2, 7\r\n    lea             r2,  [r2 + r3 * 4]\r\n    pextrw          [r2], xm0, 4\r\n    pextrw          [r2 + r3], xm0, 5\r\n    pextrw          [r2 + r3 * 2], xm0, 6\r\n    pextrw          [r2 + r4], xm0, 7\r\n%else\r\n    add             r3d, r3d\r\n    lea             r4,  [r3 * 3]\r\n    vbroadcasti128  m1,  [pw_2000]\r\n    psubw           m2,  m1\r\n    psubw           m0,  m1\r\n    vextracti128    xm1, m2, 1\r\n    movd            [r2], xm2\r\n    pextrd          [r2 + r3], xm2, 1\r\n    pextrd          [r2 + r3 * 2], xm2, 2\r\n    pextrd          [r2 + r4], xm2, 3\r\n    lea             r2, [r2 + r3 * 4]\r\n    movd            [r2], xm1\r\n    pextrd          [r2 + r3], xm1, 1\r\n    pextrd          [r2 + r3 * 2], xm1, 2\r\n    pextrd          [r2 + r4], xm1, 3\r\n    vextracti128    xm1, m0, 1\r\n    lea             r2,  [r2 + r3 * 4]\r\n    movd            [r2], xm0\r\n    pextrd          [r2 + r3], xm0, 1\r\n    pextrd          [r2 + r3 * 2], xm0, 2\r\n    pextrd          [r2 + r4], xm0, 3\r\n    lea             r2,  [r2 + r3 * 4]\r\n    movd            [r2], xm1\r\n    pextrd          [r2 + r3], xm1, 1\r\n    pextrd          [r2 + r3 * 2], xm1, 2\r\n    pextrd          [r2 + r4], xm1, 3\r\n%endif\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_AVX2_2x16 pp\r\n    FILTER_VER_CHROMA_AVX2_2x16 ps\r\n\r\n;-----------------------------------------------------------------------------\r\n; void interp_4tap_vert_pp_2x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-----------------------------------------------------------------------------\r\n%macro FILTER_V4_W2_H4 2\r\nINIT_XMM sse4\r\ncglobal interp_4tap_vert_pp_2x%2, 4, 6, 8\r\n\r\n    mov         r4d,       r4m\r\n    sub         r0,        r1\r\n\r\n%ifdef PIC\r\n    lea         r5,        [tab_ChromaCoeff]\r\n    movd        m0,        [r5 + r4 * 4]\r\n%else\r\n    movd        m0,        [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    pshufb      m0,        [tab_Cm]\r\n\r\n    mova        m1,        [pw_512]\r\n\r\n    mov         r4d,       %2\r\n    lea         r5,        [3 * r1]\r\n\r\n.loop:\r\n    movd        m2,        [r0]\r\n    movd        m3,        [r0 + r1]\r\n    movd        m4,        [r0 + 2 * r1]\r\n    movd        m5,        [r0 + r5]\r\n\r\n    punpcklbw   m2,        m3\r\n    punpcklbw   m6,        m4,        m5\r\n    punpcklbw   m2,        m6\r\n\r\n    pmaddubsw   m2,        m0\r\n\r\n    lea         r0,        [r0 + 4 * r1]\r\n    movd        m6,        [r0]\r\n\r\n    punpcklbw   m3,        m4\r\n    punpcklbw   m7,        m5,        m6\r\n    punpcklbw   m3,        m7\r\n\r\n    pmaddubsw   m3,        m0\r\n\r\n    phaddw      m2,        m3\r\n\r\n    pmulhrsw    m2,        m1\r\n\r\n    movd        m7,        [r0 + r1]\r\n\r\n    punpcklbw   m4,        m5\r\n    punpcklbw   m3,        m6,        m7\r\n    punpcklbw   m4,        m3\r\n\r\n    pmaddubsw   m4,        m0\r\n\r\n    movd        m3,        [r0 + 2 * r1]\r\n\r\n    punpcklbw   m5,        m6\r\n    punpcklbw   m7,        m3\r\n    punpcklbw   m5,        m7\r\n\r\n    pmaddubsw   m5,        m0\r\n\r\n    phaddw      m4,        m5\r\n\r\n    pmulhrsw    m4,        m1\r\n    packuswb    m2,        m4\r\n\r\n    pextrw      [r2],      m2, 0\r\n    pextrw      [r2 + r3], m2, 2\r\n    lea         r2,        [r2 + 2 * r3]\r\n    pextrw      [r2],      m2, 4\r\n    pextrw      [r2 + r3], m2, 6\r\n\r\n    lea         r2,        [r2 + 2 * r3]\r\n\r\n    sub         r4,        4\r\n    jnz        .loop\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_V4_W2_H4 2, 8\r\n\r\n    FILTER_V4_W2_H4 2, 16\r\n\r\n;-----------------------------------------------------------------------------\r\n; void interp_4tap_vert_pp_4x2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse4\r\ncglobal interp_4tap_vert_pp_4x2, 4, 6, 6\r\n\r\n    mov         r4d,       r4m\r\n    sub         r0,        r1\r\n\r\n%ifdef PIC\r\n    lea         r5,        [tab_ChromaCoeff]\r\n    movd        m0,        [r5 + r4 * 4]\r\n%else\r\n    movd        m0,        [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    pshufb      m0,        [tab_Cm]\r\n    lea         r5,        [r0 + 2 * r1]\r\n\r\n    movd        m2,        [r0]\r\n    movd        m3,        [r0 + r1]\r\n    movd        m4,        [r5]\r\n    movd        m5,        [r5 + r1]\r\n\r\n    punpcklbw   m2,        m3\r\n    punpcklbw   m1,        m4,        m5\r\n    punpcklbw   m2,        m1\r\n\r\n    pmaddubsw   m2,        m0\r\n\r\n    movd        m1,        [r0 + 4 * r1]\r\n\r\n    punpcklbw   m3,        m4\r\n    punpcklbw   m5,        m1\r\n    punpcklbw   m3,        m5\r\n\r\n    pmaddubsw   m3,        m0\r\n\r\n    phaddw      m2,        m3\r\n\r\n    pmulhrsw    m2,        [pw_512]\r\n    packuswb    m2,        m2\r\n    movd        [r2],      m2\r\n    pextrd      [r2 + r3], m2,  1\r\n\r\n    RET\r\n\r\n%macro FILTER_VER_CHROMA_AVX2_4x2 1\r\nINIT_YMM avx2\r\ncglobal interp_4tap_vert_%1_4x2, 4, 6, 4\r\n    mov             r4d, r4m\r\n    shl             r4d, 5\r\n    sub             r0, r1\r\n\r\n%ifdef PIC\r\n    lea             r5, [tab_ChromaCoeff_V]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [tab_ChromaCoeff_V + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n\r\n    movd            xm1, [r0]\r\n    movd            xm2, [r0 + r1]\r\n    punpcklbw       xm1, xm2\r\n    movd            xm3, [r0 + r1 * 2]\r\n    punpcklbw       xm2, xm3\r\n    movlhps         xm1, xm2\r\n    movd            xm0, [r0 + r4]\r\n    punpcklbw       xm3, xm0\r\n    movd            xm2, [r0 + r1 * 4]\r\n    punpcklbw       xm0, xm2\r\n    movlhps         xm3, xm0\r\n    vinserti128     m1, m1, xm3, 1                          ; m1 = row[x x x 4 3 2 1 0]\r\n\r\n    pmaddubsw       m1, [r5]\r\n    vextracti128    xm3, m1, 1\r\n    paddw           xm1, xm3\r\n%ifidn %1,pp\r\n    pmulhrsw        xm1, [pw_512]\r\n    packuswb        xm1, xm1\r\n    movd            [r2], xm1\r\n    pextrd          [r2 + r3], xm1, 1\r\n%else\r\n    add             r3d, r3d\r\n    psubw           xm1, [pw_2000]\r\n    movq            [r2], xm1\r\n    movhps          [r2 + r3], xm1\r\n%endif\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_AVX2_4x2 pp\r\n    FILTER_VER_CHROMA_AVX2_4x2 ps\r\n\r\n;-----------------------------------------------------------------------------\r\n; void interp_4tap_vert_pp_4x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse4\r\ncglobal interp_4tap_vert_pp_4x4, 4, 6, 8\r\n\r\n    mov         r4d,       r4m\r\n    sub         r0,        r1\r\n\r\n%ifdef PIC\r\n    lea         r5,        [tab_ChromaCoeff]\r\n    movd        m0,        [r5 + r4 * 4]\r\n%else\r\n    movd        m0,        [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    pshufb      m0,        [tab_Cm]\r\n    mova        m1,        [pw_512]\r\n    lea         r5,        [r0 + 4 * r1]\r\n    lea         r4,        [r1 * 3]\r\n\r\n    movd        m2,        [r0]\r\n    movd        m3,        [r0 + r1]\r\n    movd        m4,        [r0 + 2 * r1]\r\n    movd        m5,        [r0 + r4]\r\n\r\n    punpcklbw   m2,        m3\r\n    punpcklbw   m6,        m4,        m5\r\n    punpcklbw   m2,        m6\r\n\r\n    pmaddubsw   m2,        m0\r\n\r\n    movd        m6,        [r5]\r\n\r\n    punpcklbw   m3,        m4\r\n    punpcklbw   m7,        m5,        m6\r\n    punpcklbw   m3,        m7\r\n\r\n    pmaddubsw   m3,        m0\r\n\r\n    phaddw      m2,        m3\r\n\r\n    pmulhrsw    m2,        m1\r\n\r\n    movd        m7,        [r5 + r1]\r\n\r\n    punpcklbw   m4,        m5\r\n    punpcklbw   m3,        m6,        m7\r\n    punpcklbw   m4,        m3\r\n\r\n    pmaddubsw   m4,        m0\r\n\r\n    movd        m3,        [r5 + 2 * r1]\r\n\r\n    punpcklbw   m5,        m6\r\n    punpcklbw   m7,        m3\r\n    punpcklbw   m5,        m7\r\n\r\n    pmaddubsw   m5,        m0\r\n\r\n    phaddw      m4,        m5\r\n\r\n    pmulhrsw    m4,        m1\r\n\r\n    packuswb    m2,        m4\r\n    movd        [r2],      m2\r\n    pextrd      [r2 + r3], m2, 1\r\n    lea         r2,        [r2 + 2 * r3]\r\n    pextrd      [r2],      m2, 2\r\n    pextrd      [r2 + r3], m2, 3\r\n    RET\r\n%macro FILTER_VER_CHROMA_AVX2_4x4 1\r\nINIT_YMM avx2\r\ncglobal interp_4tap_vert_%1_4x4, 4, 6, 3\r\n    mov             r4d, r4m\r\n    shl             r4d, 6\r\n    sub             r0, r1\r\n\r\n%ifdef PIC\r\n    lea             r5, [tab_ChromaCoeffVer_32]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [tab_ChromaCoeffVer_32 + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n\r\n    movd            xm1, [r0]\r\n    pinsrd          xm1, [r0 + r1], 1\r\n    pinsrd          xm1, [r0 + r1 * 2], 2\r\n    pinsrd          xm1, [r0 + r4], 3                       ; m1 = row[3 2 1 0]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movd            xm2, [r0]\r\n    pinsrd          xm2, [r0 + r1], 1\r\n    pinsrd          xm2, [r0 + r1 * 2], 2                   ; m2 = row[x 6 5 4]\r\n    vinserti128     m1, m1, xm2, 1                          ; m1 = row[x 6 5 4 3 2 1 0]\r\n    mova            m2, [interp4_vpp_shuf1]\r\n    vpermd          m0, m2, m1                              ; m0 = row[4 3 3 2 2 1 1 0]\r\n    mova            m2, [interp4_vpp_shuf1 + mmsize]\r\n    vpermd          m1, m2, m1                              ; m1 = row[6 5 5 4 4 3 3 2]\r\n\r\n    mova            m2, [interp4_vpp_shuf]\r\n    pshufb          m0, m0, m2\r\n    pshufb          m1, m1, m2\r\n    pmaddubsw       m0, [r5]\r\n    pmaddubsw       m1, [r5 + mmsize]\r\n    paddw           m0, m1                                  ; m0 = WORD ROW[3 2 1 0]\r\n%ifidn %1,pp\r\n    pmulhrsw        m0, [pw_512]\r\n    vextracti128    xm1, m0, 1\r\n    packuswb        xm0, xm1\r\n    lea             r5, [r3 * 3]\r\n    movd            [r2], xm0\r\n    pextrd          [r2 + r3], xm0, 1\r\n    pextrd          [r2 + r3 * 2], xm0, 2\r\n    pextrd          [r2 + r5], xm0, 3\r\n%else\r\n    add             r3d, r3d\r\n    psubw           m0, [pw_2000]\r\n    vextracti128    xm1, m0, 1\r\n    lea             r5, [r3 * 3]\r\n    movq            [r2], xm0\r\n    movhps          [r2 + r3], xm0\r\n    movq            [r2 + r3 * 2], xm1\r\n    movhps          [r2 + r5], xm1\r\n%endif\r\n    RET\r\n%endmacro\r\n    FILTER_VER_CHROMA_AVX2_4x4 pp\r\n    FILTER_VER_CHROMA_AVX2_4x4 ps\r\n\r\n%macro FILTER_VER_CHROMA_AVX2_4x8 1\r\nINIT_YMM avx2\r\ncglobal interp_4tap_vert_%1_4x8, 4, 6, 5\r\n    mov             r4d, r4m\r\n    shl             r4d, 6\r\n    sub             r0, r1\r\n\r\n%ifdef PIC\r\n    lea             r5, [tab_ChromaCoeffVer_32]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [tab_ChromaCoeffVer_32 + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n\r\n    movd            xm1, [r0]\r\n    pinsrd          xm1, [r0 + r1], 1\r\n    pinsrd          xm1, [r0 + r1 * 2], 2\r\n    pinsrd          xm1, [r0 + r4], 3                       ; m1 = row[3 2 1 0]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movd            xm2, [r0]\r\n    pinsrd          xm2, [r0 + r1], 1\r\n    pinsrd          xm2, [r0 + r1 * 2], 2\r\n    pinsrd          xm2, [r0 + r4], 3                       ; m2 = row[7 6 5 4]\r\n    vinserti128     m1, m1, xm2, 1                          ; m1 = row[7 6 5 4 3 2 1 0]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movd            xm3, [r0]\r\n    pinsrd          xm3, [r0 + r1], 1\r\n    pinsrd          xm3, [r0 + r1 * 2], 2                   ; m3 = row[x 10 9 8]\r\n    vinserti128     m2, m2, xm3, 1                          ; m2 = row[x 10 9 8 7 6 5 4]\r\n    mova            m3, [interp4_vpp_shuf1]\r\n    vpermd          m0, m3, m1                              ; m0 = row[4 3 3 2 2 1 1 0]\r\n    vpermd          m4, m3, m2                              ; m4 = row[8 7 7 6 6 5 5 4]\r\n    mova            m3, [interp4_vpp_shuf1 + mmsize]\r\n    vpermd          m1, m3, m1                              ; m1 = row[6 5 5 4 4 3 3 2]\r\n    vpermd          m2, m3, m2                              ; m2 = row[10 9 9 8 8 7 7 6]\r\n\r\n    mova            m3, [interp4_vpp_shuf]\r\n    pshufb          m0, m0, m3\r\n    pshufb          m1, m1, m3\r\n    pshufb          m2, m2, m3\r\n    pshufb          m4, m4, m3\r\n    pmaddubsw       m0, [r5]\r\n    pmaddubsw       m4, [r5]\r\n    pmaddubsw       m1, [r5 + mmsize]\r\n    pmaddubsw       m2, [r5 + mmsize]\r\n    paddw           m0, m1                                  ; m0 = WORD ROW[3 2 1 0]\r\n    paddw           m4, m2                                  ; m4 = WORD ROW[7 6 5 4]\r\n%ifidn %1,pp\r\n    pmulhrsw        m0, [pw_512]\r\n    pmulhrsw        m4, [pw_512]\r\n    packuswb        m0, m4\r\n    vextracti128    xm1, m0, 1\r\n    lea             r5, [r3 * 3]\r\n    movd            [r2], xm0\r\n    pextrd          [r2 + r3], xm0, 1\r\n    movd            [r2 + r3 * 2], xm1\r\n    pextrd          [r2 + r5], xm1, 1\r\n    lea             r2, [r2 + r3 * 4]\r\n    pextrd          [r2], xm0, 2\r\n    pextrd          [r2 + r3], xm0, 3\r\n    pextrd          [r2 + r3 * 2], xm1, 2\r\n    pextrd          [r2 + r5], xm1, 3\r\n%else\r\n    add             r3d, r3d\r\n    psubw           m0, [pw_2000]\r\n    psubw           m4, [pw_2000]\r\n    vextracti128    xm1, m0, 1\r\n    vextracti128    xm2, m4, 1\r\n    lea             r5, [r3 * 3]\r\n    movq            [r2], xm0\r\n    movhps          [r2 + r3], xm0\r\n    movq            [r2 + r3 * 2], xm1\r\n    movhps          [r2 + r5], xm1\r\n    lea             r2, [r2 + r3 * 4]\r\n    movq            [r2], xm4\r\n    movhps          [r2 + r3], xm4\r\n    movq            [r2 + r3 * 2], xm2\r\n    movhps          [r2 + r5], xm2\r\n%endif\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_AVX2_4x8 pp\r\n    FILTER_VER_CHROMA_AVX2_4x8 ps\r\n\r\n%macro FILTER_VER_CHROMA_AVX2_4xN 2\r\n%if ARCH_X86_64 == 1\r\nINIT_YMM avx2\r\ncglobal interp_4tap_vert_%1_4x%2, 4, 6, 12\r\n    mov             r4d, r4m\r\n    shl             r4d, 6\r\n    sub             r0, r1\r\n\r\n%ifdef PIC\r\n    lea             r5, [tab_ChromaCoeffVer_32]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [tab_ChromaCoeffVer_32 + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    mova            m10, [r5]\r\n    mova            m11, [r5 + mmsize]\r\n%ifidn %1,pp\r\n    mova            m9, [pw_512]\r\n%else\r\n    add             r3d, r3d\r\n    mova            m9, [pw_2000]\r\n%endif\r\n    lea             r5, [r3 * 3]\r\n%rep %2 / 16\r\n    movd            xm1, [r0]\r\n    pinsrd          xm1, [r0 + r1], 1\r\n    pinsrd          xm1, [r0 + r1 * 2], 2\r\n    pinsrd          xm1, [r0 + r4], 3                       ; m1 = row[3 2 1 0]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movd            xm2, [r0]\r\n    pinsrd          xm2, [r0 + r1], 1\r\n    pinsrd          xm2, [r0 + r1 * 2], 2\r\n    pinsrd          xm2, [r0 + r4], 3                       ; m2 = row[7 6 5 4]\r\n    vinserti128     m1, m1, xm2, 1                          ; m1 = row[7 6 5 4 3 2 1 0]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movd            xm3, [r0]\r\n    pinsrd          xm3, [r0 + r1], 1\r\n    pinsrd          xm3, [r0 + r1 * 2], 2\r\n    pinsrd          xm3, [r0 + r4], 3                       ; m3 = row[11 10 9 8]\r\n    vinserti128     m2, m2, xm3, 1                          ; m2 = row[11 10 9 8 7 6 5 4]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movd            xm4, [r0]\r\n    pinsrd          xm4, [r0 + r1], 1\r\n    pinsrd          xm4, [r0 + r1 * 2], 2\r\n    pinsrd          xm4, [r0 + r4], 3                       ; m4 = row[15 14 13 12]\r\n    vinserti128     m3, m3, xm4, 1                          ; m3 = row[15 14 13 12 11 10 9 8]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movd            xm5, [r0]\r\n    pinsrd          xm5, [r0 + r1], 1\r\n    pinsrd          xm5, [r0 + r1 * 2], 2                   ; m5 = row[x 18 17 16]\r\n    vinserti128     m4, m4, xm5, 1                          ; m4 = row[x 18 17 16 15 14 13 12]\r\n    mova            m5, [interp4_vpp_shuf1]\r\n    vpermd          m0, m5, m1                              ; m0 = row[4 3 3 2 2 1 1 0]\r\n    vpermd          m6, m5, m2                              ; m6 = row[8 7 7 6 6 5 5 4]\r\n    vpermd          m7, m5, m3                              ; m7 = row[12 11 11 10 10 9 9 8]\r\n    vpermd          m8, m5, m4                              ; m8 = row[16 15 15 14 14 13 13 12]\r\n    mova            m5, [interp4_vpp_shuf1 + mmsize]\r\n    vpermd          m1, m5, m1                              ; m1 = row[6 5 5 4 4 3 3 2]\r\n    vpermd          m2, m5, m2                              ; m2 = row[10 9 9 8 8 7 7 6]\r\n    vpermd          m3, m5, m3                              ; m3 = row[14 13 13 12 12 11 11 10]\r\n    vpermd          m4, m5, m4                              ; m4 = row[18 17 17 16 16 15 15 14]\r\n\r\n    mova            m5, [interp4_vpp_shuf]\r\n    pshufb          m0, m0, m5\r\n    pshufb          m1, m1, m5\r\n    pshufb          m2, m2, m5\r\n    pshufb          m4, m4, m5\r\n    pshufb          m3, m3, m5\r\n    pshufb          m6, m6, m5\r\n    pshufb          m7, m7, m5\r\n    pshufb          m8, m8, m5\r\n    pmaddubsw       m0, m10\r\n    pmaddubsw       m6, m10\r\n    pmaddubsw       m7, m10\r\n    pmaddubsw       m8, m10\r\n    pmaddubsw       m1, m11\r\n    pmaddubsw       m2, m11\r\n    pmaddubsw       m3, m11\r\n    pmaddubsw       m4, m11\r\n    paddw           m0, m1                                  ; m0 = WORD ROW[3 2 1 0]\r\n    paddw           m6, m2                                  ; m6 = WORD ROW[7 6 5 4]\r\n    paddw           m7, m3                                  ; m7 = WORD ROW[11 10 9 8]\r\n    paddw           m8, m4                                  ; m8 = WORD ROW[15 14 13 12]\r\n%ifidn %1,pp\r\n    pmulhrsw        m0, m9\r\n    pmulhrsw        m6, m9\r\n    pmulhrsw        m7, m9\r\n    pmulhrsw        m8, m9\r\n    packuswb        m0, m6\r\n    packuswb        m7, m8\r\n    vextracti128    xm1, m0, 1\r\n    vextracti128    xm2, m7, 1\r\n    movd            [r2], xm0\r\n    pextrd          [r2 + r3], xm0, 1\r\n    movd            [r2 + r3 * 2], xm1\r\n    pextrd          [r2 + r5], xm1, 1\r\n    lea             r2, [r2 + r3 * 4]\r\n    pextrd          [r2], xm0, 2\r\n    pextrd          [r2 + r3], xm0, 3\r\n    pextrd          [r2 + r3 * 2], xm1, 2\r\n    pextrd          [r2 + r5], xm1, 3\r\n    lea             r2, [r2 + r3 * 4]\r\n    movd            [r2], xm7\r\n    pextrd          [r2 + r3], xm7, 1\r\n    movd            [r2 + r3 * 2], xm2\r\n    pextrd          [r2 + r5], xm2, 1\r\n    lea             r2, [r2 + r3 * 4]\r\n    pextrd          [r2], xm7, 2\r\n    pextrd          [r2 + r3], xm7, 3\r\n    pextrd          [r2 + r3 * 2], xm2, 2\r\n    pextrd          [r2 + r5], xm2, 3\r\n%else\r\n    psubw           m0, m9\r\n    psubw           m6, m9\r\n    psubw           m7, m9\r\n    psubw           m8, m9\r\n    vextracti128    xm1, m0, 1\r\n    vextracti128    xm2, m6, 1\r\n    vextracti128    xm3, m7, 1\r\n    vextracti128    xm4, m8, 1\r\n    movq            [r2], xm0\r\n    movhps          [r2 + r3], xm0\r\n    movq            [r2 + r3 * 2], xm1\r\n    movhps          [r2 + r5], xm1\r\n    lea             r2, [r2 + r3 * 4]\r\n    movq            [r2], xm6\r\n    movhps          [r2 + r3], xm6\r\n    movq            [r2 + r3 * 2], xm2\r\n    movhps          [r2 + r5], xm2\r\n    lea             r2, [r2 + r3 * 4]\r\n    movq            [r2], xm7\r\n    movhps          [r2 + r3], xm7\r\n    movq            [r2 + r3 * 2], xm3\r\n    movhps          [r2 + r5], xm3\r\n    lea             r2, [r2 + r3 * 4]\r\n    movq            [r2], xm8\r\n    movhps          [r2 + r3], xm8\r\n    movq            [r2 + r3 * 2], xm4\r\n    movhps          [r2 + r5], xm4\r\n%endif\r\n    lea             r2, [r2 + r3 * 4]\r\n%endrep\r\n    RET\r\n%endif\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_AVX2_4xN pp, 16\r\n    FILTER_VER_CHROMA_AVX2_4xN ps, 16\r\n    FILTER_VER_CHROMA_AVX2_4xN pp, 32\r\n    FILTER_VER_CHROMA_AVX2_4xN ps, 32\r\n\r\n;-----------------------------------------------------------------------------\r\n; void interp_4tap_vert_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-----------------------------------------------------------------------------\r\n%macro FILTER_V4_W4_H4 2\r\nINIT_XMM sse4\r\ncglobal interp_4tap_vert_pp_%1x%2, 4, 6, 8\r\n\r\n    mov         r4d,       r4m\r\n    sub         r0,        r1\r\n\r\n%ifdef PIC\r\n    lea         r5,        [tab_ChromaCoeff]\r\n    movd        m0,        [r5 + r4 * 4]\r\n%else\r\n    movd        m0,        [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    pshufb      m0,        [tab_Cm]\r\n\r\n    mova        m1,        [pw_512]\r\n\r\n    mov         r4d,       %2\r\n\r\n    lea         r5,        [3 * r1]\r\n\r\n.loop:\r\n    movd        m2,        [r0]\r\n    movd        m3,        [r0 + r1]\r\n    movd        m4,        [r0 + 2 * r1]\r\n    movd        m5,        [r0 + r5]\r\n\r\n    punpcklbw   m2,        m3\r\n    punpcklbw   m6,        m4,        m5\r\n    punpcklbw   m2,        m6\r\n\r\n    pmaddubsw   m2,        m0\r\n\r\n    lea         r0,        [r0 + 4 * r1]\r\n    movd        m6,        [r0]\r\n\r\n    punpcklbw   m3,        m4\r\n    punpcklbw   m7,        m5,        m6\r\n    punpcklbw   m3,        m7\r\n\r\n    pmaddubsw   m3,        m0\r\n\r\n    phaddw      m2,        m3\r\n\r\n    pmulhrsw    m2,        m1\r\n\r\n    movd        m7,        [r0 + r1]\r\n\r\n    punpcklbw   m4,        m5\r\n    punpcklbw   m3,        m6,        m7\r\n    punpcklbw   m4,        m3\r\n\r\n    pmaddubsw   m4,        m0\r\n\r\n    movd        m3,        [r0 + 2 * r1]\r\n\r\n    punpcklbw   m5,        m6\r\n    punpcklbw   m7,        m3\r\n    punpcklbw   m5,        m7\r\n\r\n    pmaddubsw   m5,        m0\r\n\r\n    phaddw      m4,        m5\r\n\r\n    pmulhrsw    m4,        m1\r\n    packuswb    m2,        m4\r\n    movd        [r2],      m2\r\n    pextrd      [r2 + r3], m2,  1\r\n    lea         r2,        [r2 + 2 * r3]\r\n    pextrd      [r2],      m2, 2\r\n    pextrd      [r2 + r3], m2, 3\r\n\r\n    lea         r2,        [r2 + 2 * r3]\r\n\r\n    sub         r4,        4\r\n    jnz        .loop\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_V4_W4_H4 4,  8\r\n    FILTER_V4_W4_H4 4, 16\r\n\r\n    FILTER_V4_W4_H4 4, 32\r\n\r\n%macro FILTER_V4_W8_H2 0\r\n    punpcklbw   m1,        m2\r\n    punpcklbw   m7,        m3,        m0\r\n\r\n    pmaddubsw   m1,        m6\r\n    pmaddubsw   m7,        m5\r\n\r\n    paddw       m1,        m7\r\n\r\n    pmulhrsw    m1,        m4\r\n    packuswb    m1,        m1\r\n%endmacro\r\n\r\n%macro FILTER_V4_W8_H3 0\r\n    punpcklbw   m2,        m3\r\n    punpcklbw   m7,        m0,        m1\r\n\r\n    pmaddubsw   m2,        m6\r\n    pmaddubsw   m7,        m5\r\n\r\n    paddw       m2,        m7\r\n\r\n    pmulhrsw    m2,        m4\r\n    packuswb    m2,        m2\r\n%endmacro\r\n\r\n%macro FILTER_V4_W8_H4 0\r\n    punpcklbw   m3,        m0\r\n    punpcklbw   m7,        m1,        m2\r\n\r\n    pmaddubsw   m3,        m6\r\n    pmaddubsw   m7,        m5\r\n\r\n    paddw       m3,        m7\r\n\r\n    pmulhrsw    m3,        m4\r\n    packuswb    m3,        m3\r\n%endmacro\r\n\r\n%macro FILTER_V4_W8_H5 0\r\n    punpcklbw   m0,        m1\r\n    punpcklbw   m7,        m2,        m3\r\n\r\n    pmaddubsw   m0,        m6\r\n    pmaddubsw   m7,        m5\r\n\r\n    paddw       m0,        m7\r\n\r\n    pmulhrsw    m0,        m4\r\n    packuswb    m0,        m0\r\n%endmacro\r\n\r\n%macro FILTER_V4_W8_8x2 2\r\n    FILTER_V4_W8 %1, %2\r\n    movq        m0,        [r0 + 4 * r1]\r\n\r\n    FILTER_V4_W8_H2\r\n\r\n    movh        [r2 + r3], m1\r\n%endmacro\r\n\r\n%macro FILTER_V4_W8_8x4 2\r\n    FILTER_V4_W8_8x2 %1, %2\r\n;8x3\r\n    lea         r6,        [r0 + 4 * r1]\r\n    movq        m1,        [r6 + r1]\r\n\r\n    FILTER_V4_W8_H3\r\n\r\n    movh        [r2 + 2 * r3], m2\r\n\r\n;8x4\r\n    movq        m2,        [r6 + 2 * r1]\r\n\r\n    FILTER_V4_W8_H4\r\n\r\n    lea         r5,        [r2 + 2 * r3]\r\n    movh        [r5 + r3], m3\r\n%endmacro\r\n\r\n%macro FILTER_V4_W8_8x6 2\r\n    FILTER_V4_W8_8x4 %1, %2\r\n;8x5\r\n    lea         r6,        [r6 + 2 * r1]\r\n    movq        m3,        [r6 + r1]\r\n\r\n    FILTER_V4_W8_H5\r\n\r\n    movh        [r2 + 4 * r3], m0\r\n\r\n;8x6\r\n    movq        m0,        [r0 + 8 * r1]\r\n\r\n    FILTER_V4_W8_H2\r\n\r\n    lea         r5,        [r2 + 4 * r3]\r\n    movh        [r5 + r3], m1\r\n%endmacro\r\n\r\n;-----------------------------------------------------------------------------\r\n; void interp_4tap_vert_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-----------------------------------------------------------------------------\r\n%macro FILTER_V4_W8 2\r\nINIT_XMM sse4\r\ncglobal interp_4tap_vert_pp_%1x%2, 4, 7, 8\r\n\r\n    mov         r4d,       r4m\r\n\r\n    sub         r0,        r1\r\n    movq        m0,        [r0]\r\n    movq        m1,        [r0 + r1]\r\n    movq        m2,        [r0 + 2 * r1]\r\n    lea         r5,        [r0 + 2 * r1]\r\n    movq        m3,        [r5 + r1]\r\n\r\n    punpcklbw   m0,        m1\r\n    punpcklbw   m4,        m2,          m3\r\n\r\n%ifdef PIC\r\n    lea         r6,        [tab_ChromaCoeff]\r\n    movd        m5,        [r6 + r4 * 4]\r\n%else\r\n    movd        m5,        [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    pshufb      m6,        m5,       [tab_Vm]\r\n    pmaddubsw   m0,        m6\r\n\r\n    pshufb      m5,        [tab_Vm + 16]\r\n    pmaddubsw   m4,        m5\r\n\r\n    paddw       m0,        m4\r\n\r\n    mova        m4,        [pw_512]\r\n\r\n    pmulhrsw    m0,        m4\r\n    packuswb    m0,        m0\r\n    movh        [r2],      m0\r\n%endmacro\r\n\r\n;-----------------------------------------------------------------------------\r\n; void interp_4tap_vert_pp_8x2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-----------------------------------------------------------------------------\r\n    FILTER_V4_W8_8x2 8, 2\r\n\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void interp_4tap_vert_pp_8x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-----------------------------------------------------------------------------\r\n    FILTER_V4_W8_8x4 8, 4\r\n\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void interp_4tap_vert_pp_8x6(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-----------------------------------------------------------------------------\r\n    FILTER_V4_W8_8x6 8, 6\r\n\r\n    RET\r\n\r\n;-------------------------------------------------------------------------------------------------------------\r\n; void interp_4tap_vert_ps_4x2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)\r\n;-------------------------------------------------------------------------------------------------------------\r\nINIT_XMM sse4\r\ncglobal interp_4tap_vert_ps_4x2, 4, 6, 6\r\n\r\n    mov         r4d, r4m\r\n    sub         r0, r1\r\n    add         r3d, r3d\r\n\r\n%ifdef PIC\r\n    lea         r5, [tab_ChromaCoeff]\r\n    movd        m0, [r5 + r4 * 4]\r\n%else\r\n    movd        m0, [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    pshufb      m0, [tab_Cm]\r\n\r\n    movd        m2, [r0]\r\n    movd        m3, [r0 + r1]\r\n    lea         r5, [r0 + 2 * r1]\r\n    movd        m4, [r5]\r\n    movd        m5, [r5 + r1]\r\n\r\n    punpcklbw   m2, m3\r\n    punpcklbw   m1, m4, m5\r\n    punpcklbw   m2, m1\r\n\r\n    pmaddubsw   m2, m0\r\n\r\n    movd        m1, [r0 + 4 * r1]\r\n\r\n    punpcklbw   m3, m4\r\n    punpcklbw   m5, m1\r\n    punpcklbw   m3, m5\r\n\r\n    pmaddubsw   m3, m0\r\n\r\n    phaddw      m2, m3\r\n\r\n    psubw       m2, [pw_2000]\r\n    movh        [r2], m2\r\n    movhps      [r2 + r3], m2\r\n\r\n    RET\r\n\r\n;-------------------------------------------------------------------------------------------------------------\r\n; void interp_4tap_vert_ps_4x4(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)\r\n;-------------------------------------------------------------------------------------------------------------\r\nINIT_XMM sse4\r\ncglobal interp_4tap_vert_ps_4x4, 4, 6, 7\r\n\r\n    mov        r4d, r4m\r\n    sub        r0, r1\r\n    add        r3d, r3d\r\n\r\n%ifdef PIC\r\n    lea        r5, [tab_ChromaCoeff]\r\n    movd       m0, [r5 + r4 * 4]\r\n%else\r\n    movd       m0, [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    pshufb     m0, [tab_Cm]\r\n\r\n    lea        r4, [r1 * 3]\r\n    lea        r5, [r0 + 4 * r1]\r\n\r\n    movd       m2, [r0]\r\n    movd       m3, [r0 + r1]\r\n    movd       m4, [r0 + 2 * r1]\r\n    movd       m5, [r0 + r4]\r\n\r\n    punpcklbw  m2, m3\r\n    punpcklbw  m6, m4, m5\r\n    punpcklbw  m2, m6\r\n\r\n    pmaddubsw  m2, m0\r\n\r\n    movd       m6, [r5]\r\n\r\n    punpcklbw  m3, m4\r\n    punpcklbw  m1, m5, m6\r\n    punpcklbw  m3, m1\r\n\r\n    pmaddubsw  m3, m0\r\n\r\n    phaddw     m2, m3\r\n\r\n    mova       m1, [pw_2000]\r\n\r\n    psubw      m2, m1\r\n    movh       [r2], m2\r\n    movhps     [r2 + r3], m2\r\n\r\n    movd       m2, [r5 + r1]\r\n\r\n    punpcklbw  m4, m5\r\n    punpcklbw  m3, m6, m2\r\n    punpcklbw  m4, m3\r\n\r\n    pmaddubsw  m4, m0\r\n\r\n    movd       m3, [r5 + 2 * r1]\r\n\r\n    punpcklbw  m5, m6\r\n    punpcklbw  m2, m3\r\n    punpcklbw  m5, m2\r\n\r\n    pmaddubsw  m5, m0\r\n\r\n    phaddw     m4, m5\r\n\r\n    psubw      m4, m1\r\n    lea        r2, [r2 + 2 * r3]\r\n    movh       [r2], m4\r\n    movhps     [r2 + r3], m4\r\n\r\n    RET\r\n\r\n;---------------------------------------------------------------------------------------------------------------\r\n; void interp_4tap_vert_ps_%1x%2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)\r\n;---------------------------------------------------------------------------------------------------------------\r\n%macro FILTER_V_PS_W4_H4 2\r\nINIT_XMM sse4\r\ncglobal interp_4tap_vert_ps_%1x%2, 4, 6, 8\r\n\r\n    mov        r4d, r4m\r\n    sub        r0, r1\r\n    add        r3d, r3d\r\n\r\n%ifdef PIC\r\n    lea        r5, [tab_ChromaCoeff]\r\n    movd       m0, [r5 + r4 * 4]\r\n%else\r\n    movd       m0, [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    pshufb     m0, [tab_Cm]\r\n\r\n    mova       m1, [pw_2000]\r\n\r\n    mov        r4d, %2/4\r\n    lea        r5, [3 * r1]\r\n\r\n.loop:\r\n    movd       m2, [r0]\r\n    movd       m3, [r0 + r1]\r\n    movd       m4, [r0 + 2 * r1]\r\n    movd       m5, [r0 + r5]\r\n\r\n    punpcklbw  m2, m3\r\n    punpcklbw  m6, m4, m5\r\n    punpcklbw  m2, m6\r\n\r\n    pmaddubsw  m2, m0\r\n\r\n    lea        r0, [r0 + 4 * r1]\r\n    movd       m6, [r0]\r\n\r\n    punpcklbw  m3, m4\r\n    punpcklbw  m7, m5, m6\r\n    punpcklbw  m3, m7\r\n\r\n    pmaddubsw  m3, m0\r\n\r\n    phaddw     m2, m3\r\n\r\n    psubw      m2, m1\r\n    movh       [r2], m2\r\n    movhps     [r2 + r3], m2\r\n\r\n    movd       m2, [r0 + r1]\r\n\r\n    punpcklbw  m4, m5\r\n    punpcklbw  m3, m6, m2\r\n    punpcklbw  m4, m3\r\n\r\n    pmaddubsw  m4, m0\r\n\r\n    movd       m3, [r0 + 2 * r1]\r\n\r\n    punpcklbw  m5, m6\r\n    punpcklbw  m2, m3\r\n    punpcklbw  m5, m2\r\n\r\n    pmaddubsw  m5, m0\r\n\r\n    phaddw     m4, m5\r\n\r\n    psubw      m4, m1\r\n    lea        r2, [r2 + 2 * r3]\r\n    movh       [r2], m4\r\n    movhps     [r2 + r3], m4\r\n\r\n    lea        r2, [r2 + 2 * r3]\r\n\r\n    dec        r4d\r\n    jnz        .loop\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_V_PS_W4_H4 4, 8\r\n    FILTER_V_PS_W4_H4 4, 16\r\n\r\n    FILTER_V_PS_W4_H4 4, 32\r\n\r\n;--------------------------------------------------------------------------------------------------------------\r\n; void interp_4tap_vert_ps_8x%2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)\r\n;--------------------------------------------------------------------------------------------------------------\r\n%macro FILTER_V_PS_W8_H8_H16_H2 2\r\nINIT_XMM sse4\r\ncglobal interp_4tap_vert_ps_%1x%2, 4, 6, 7\r\n\r\n    mov        r4d, r4m\r\n    sub        r0, r1\r\n    add        r3d, r3d\r\n\r\n%ifdef PIC\r\n    lea        r5, [tab_ChromaCoeff]\r\n    movd       m5, [r5 + r4 * 4]\r\n%else\r\n    movd       m5, [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    pshufb     m6, m5, [tab_Vm]\r\n    pshufb     m5, [tab_Vm + 16]\r\n    mova       m4, [pw_2000]\r\n\r\n    mov        r4d, %2/2\r\n    lea        r5, [3 * r1]\r\n\r\n.loopH:\r\n    movq       m0, [r0]\r\n    movq       m1, [r0 + r1]\r\n    movq       m2, [r0 + 2 * r1]\r\n    movq       m3, [r0 + r5]\r\n\r\n    punpcklbw  m0, m1\r\n    punpcklbw  m1, m2\r\n    punpcklbw  m2, m3\r\n\r\n    pmaddubsw  m0, m6\r\n    pmaddubsw  m2, m5\r\n\r\n    paddw      m0, m2\r\n\r\n    psubw      m0, m4\r\n    movu       [r2], m0\r\n\r\n    movq       m0, [r0 + 4 * r1]\r\n\r\n    punpcklbw  m3, m0\r\n\r\n    pmaddubsw  m1, m6\r\n    pmaddubsw  m3, m5\r\n\r\n    paddw      m1, m3\r\n    psubw      m1, m4\r\n\r\n    movu       [r2 + r3], m1\r\n\r\n    lea        r0, [r0 + 2 * r1]\r\n    lea        r2, [r2 + 2 * r3]\r\n\r\n    dec        r4d\r\n    jnz       .loopH\r\n\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_V_PS_W8_H8_H16_H2 8, 2\r\n    FILTER_V_PS_W8_H8_H16_H2 8, 4\r\n    FILTER_V_PS_W8_H8_H16_H2 8, 6\r\n\r\n    FILTER_V_PS_W8_H8_H16_H2 8, 12\r\n    FILTER_V_PS_W8_H8_H16_H2 8, 64\r\n\r\n;--------------------------------------------------------------------------------------------------------------\r\n; void interp_4tap_vert_ps_8x%2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)\r\n;--------------------------------------------------------------------------------------------------------------\r\n%macro FILTER_V_PS_W8_H8_H16_H32 2\r\nINIT_XMM sse4\r\ncglobal interp_4tap_vert_ps_%1x%2, 4, 6, 8\r\n\r\n    mov        r4d, r4m\r\n    sub        r0, r1\r\n    add        r3d, r3d\r\n\r\n%ifdef PIC\r\n    lea        r5, [tab_ChromaCoeff]\r\n    movd       m5, [r5 + r4 * 4]\r\n%else\r\n    movd       m5, [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    pshufb     m6, m5, [tab_Vm]\r\n    pshufb     m5, [tab_Vm + 16]\r\n    mova       m4, [pw_2000]\r\n\r\n    mov        r4d, %2/4\r\n    lea        r5, [3 * r1]\r\n\r\n.loop:\r\n    movq       m0, [r0]\r\n    movq       m1, [r0 + r1]\r\n    movq       m2, [r0 + 2 * r1]\r\n    movq       m3, [r0 + r5]\r\n\r\n    punpcklbw  m0, m1\r\n    punpcklbw  m1, m2\r\n    punpcklbw  m2, m3\r\n\r\n    pmaddubsw  m0, m6\r\n    pmaddubsw  m7, m2, m5\r\n\r\n    paddw      m0, m7\r\n\r\n    psubw       m0, m4\r\n    movu       [r2], m0\r\n\r\n    lea        r0, [r0 + 4 * r1]\r\n    movq       m0, [r0]\r\n\r\n    punpcklbw  m3, m0\r\n\r\n    pmaddubsw  m1, m6\r\n    pmaddubsw  m7, m3, m5\r\n\r\n    paddw      m1, m7\r\n\r\n    psubw      m1, m4\r\n    movu       [r2 + r3], m1\r\n\r\n    movq       m1, [r0 + r1]\r\n\r\n    punpcklbw  m0, m1\r\n\r\n    pmaddubsw  m2, m6\r\n    pmaddubsw  m0, m5\r\n\r\n    paddw      m2, m0\r\n\r\n    psubw      m2, m4\r\n    lea        r2, [r2 + 2 * r3]\r\n    movu       [r2], m2\r\n\r\n    movq       m2, [r0 + 2 * r1]\r\n\r\n    punpcklbw  m1, m2\r\n\r\n    pmaddubsw  m3, m6\r\n    pmaddubsw  m1, m5\r\n\r\n    paddw      m3, m1\r\n    psubw      m3, m4\r\n\r\n    movu       [r2 + r3], m3\r\n\r\n    lea        r2, [r2 + 2 * r3]\r\n\r\n    dec        r4d\r\n    jnz        .loop\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_V_PS_W8_H8_H16_H32 8,  8\r\n    FILTER_V_PS_W8_H8_H16_H32 8, 16\r\n    FILTER_V_PS_W8_H8_H16_H32 8, 32\r\n\r\n;------------------------------------------------------------------------------------------------------------\r\n;void interp_4tap_vert_ps_6x8(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)\r\n;------------------------------------------------------------------------------------------------------------\r\n%macro FILTER_V_PS_W6 2\r\nINIT_XMM sse4\r\ncglobal interp_4tap_vert_ps_6x%2, 4, 6, 8\r\n\r\n    mov        r4d, r4m\r\n    sub        r0, r1\r\n    add        r3d, r3d\r\n\r\n%ifdef PIC\r\n    lea        r5, [tab_ChromaCoeff]\r\n    movd       m5, [r5 + r4 * 4]\r\n%else\r\n    movd       m5, [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    pshufb     m6, m5, [tab_Vm]\r\n    pshufb     m5, [tab_Vm + 16]\r\n    mova       m4, [pw_2000]\r\n    lea        r5, [3 * r1]\r\n    mov        r4d, %2/4\r\n\r\n.loop:\r\n    movq       m0, [r0]\r\n    movq       m1, [r0 + r1]\r\n    movq       m2, [r0 + 2 * r1]\r\n    movq       m3, [r0 + r5]\r\n\r\n    punpcklbw  m0, m1\r\n    punpcklbw  m1, m2\r\n    punpcklbw  m2, m3\r\n\r\n    pmaddubsw  m0, m6\r\n    pmaddubsw  m7, m2, m5\r\n\r\n    paddw      m0, m7\r\n    psubw      m0, m4\r\n\r\n    movh       [r2], m0\r\n    pshufd     m0, m0, 2\r\n    movd       [r2 + 8], m0\r\n\r\n    lea        r0, [r0 + 4 * r1]\r\n    movq       m0, [r0]\r\n    punpcklbw  m3, m0\r\n\r\n    pmaddubsw  m1, m6\r\n    pmaddubsw  m7, m3, m5\r\n\r\n    paddw      m1, m7\r\n    psubw      m1, m4\r\n\r\n    movh       [r2 + r3], m1\r\n    pshufd     m1, m1, 2\r\n    movd       [r2 + r3 + 8], m1\r\n\r\n    movq       m1, [r0 + r1]\r\n    punpcklbw  m0, m1\r\n\r\n    pmaddubsw  m2, m6\r\n    pmaddubsw  m0, m5\r\n\r\n    paddw      m2, m0\r\n    psubw      m2, m4\r\n\r\n    lea        r2,[r2 + 2 * r3]\r\n    movh       [r2], m2\r\n    pshufd     m2, m2, 2\r\n    movd       [r2 + 8], m2\r\n\r\n    movq       m2,[r0 + 2 * r1]\r\n    punpcklbw  m1, m2\r\n\r\n    pmaddubsw  m3, m6\r\n    pmaddubsw  m1, m5\r\n\r\n    paddw      m3, m1\r\n    psubw      m3, m4\r\n\r\n    movh       [r2 + r3], m3\r\n    pshufd     m3, m3, 2\r\n    movd       [r2 + r3 + 8], m3\r\n\r\n    lea        r2, [r2 + 2 * r3]\r\n\r\n    dec        r4d\r\n    jnz        .loop\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_V_PS_W6 6, 8\r\n    FILTER_V_PS_W6 6, 16\r\n\r\n;---------------------------------------------------------------------------------------------------------------\r\n; void interp_4tap_vert_ps_12x16(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)\r\n;---------------------------------------------------------------------------------------------------------------\r\n%macro FILTER_V_PS_W12 2\r\nINIT_XMM sse4\r\ncglobal interp_4tap_vert_ps_12x%2, 4, 6, 8\r\n\r\n    mov        r4d, r4m\r\n    sub        r0, r1\r\n    add        r3d, r3d\r\n\r\n%ifdef PIC\r\n    lea        r5, [tab_ChromaCoeff]\r\n    movd       m0, [r5 + r4 * 4]\r\n%else\r\n    movd       m0, [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    pshufb     m1, m0, [tab_Vm]\r\n    pshufb     m0, [tab_Vm + 16]\r\n\r\n    mov        r4d, %2/2\r\n\r\n.loop:\r\n    movu       m2, [r0]\r\n    movu       m3, [r0 + r1]\r\n\r\n    punpcklbw  m4, m2, m3\r\n    punpckhbw  m2, m3\r\n\r\n    pmaddubsw  m4, m1\r\n    pmaddubsw  m2, m1\r\n\r\n    lea        r0, [r0 + 2 * r1]\r\n    movu       m5, [r0]\r\n    movu       m7, [r0 + r1]\r\n\r\n    punpcklbw  m6, m5, m7\r\n    pmaddubsw  m6, m0\r\n    paddw      m4, m6\r\n\r\n    punpckhbw  m6, m5, m7\r\n    pmaddubsw  m6, m0\r\n    paddw      m2, m6\r\n\r\n    mova       m6, [pw_2000]\r\n\r\n    psubw      m4, m6\r\n    psubw      m2, m6\r\n\r\n    movu       [r2], m4\r\n    movh       [r2 + 16], m2\r\n\r\n    punpcklbw  m4, m3, m5\r\n    punpckhbw  m3, m5\r\n\r\n    pmaddubsw  m4, m1\r\n    pmaddubsw  m3, m1\r\n\r\n    movu       m2, [r0 + 2 * r1]\r\n\r\n    punpcklbw  m5, m7, m2\r\n    punpckhbw  m7, m2\r\n\r\n    pmaddubsw  m5, m0\r\n    pmaddubsw  m7, m0\r\n\r\n    paddw      m4, m5\r\n    paddw      m3, m7\r\n\r\n    psubw      m4, m6\r\n    psubw      m3, m6\r\n\r\n    movu       [r2 + r3], m4\r\n    movh       [r2 + r3 + 16], m3\r\n\r\n    lea        r2, [r2 + 2 * r3]\r\n\r\n    dec        r4d\r\n    jnz        .loop\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_V_PS_W12 12, 16\r\n    FILTER_V_PS_W12 12, 32\r\n\r\n;---------------------------------------------------------------------------------------------------------------\r\n; void interp_4tap_vert_ps_16x%2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)\r\n;---------------------------------------------------------------------------------------------------------------\r\n%macro FILTER_V_PS_W16 2\r\nINIT_XMM sse4\r\ncglobal interp_4tap_vert_ps_%1x%2, 4, 6, 8\r\n\r\n    mov        r4d, r4m\r\n    sub        r0, r1\r\n    add        r3d, r3d\r\n\r\n%ifdef PIC\r\n    lea        r5, [tab_ChromaCoeff]\r\n    movd       m0, [r5 + r4 * 4]\r\n%else\r\n    movd       m0, [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    pshufb     m1, m0, [tab_Vm]\r\n    pshufb     m0, [tab_Vm + 16]\r\n    mov        r4d, %2/2\r\n\r\n.loop:\r\n    movu       m2, [r0]\r\n    movu       m3, [r0 + r1]\r\n\r\n    punpcklbw  m4, m2, m3\r\n    punpckhbw  m2, m3\r\n\r\n    pmaddubsw  m4, m1\r\n    pmaddubsw  m2, m1\r\n\r\n    lea        r0, [r0 + 2 * r1]\r\n    movu       m5, [r0]\r\n    movu       m7, [r0 + r1]\r\n\r\n    punpcklbw  m6, m5, m7\r\n    pmaddubsw  m6, m0\r\n    paddw      m4, m6\r\n\r\n    punpckhbw  m6, m5, m7\r\n    pmaddubsw  m6, m0\r\n    paddw      m2, m6\r\n\r\n    mova       m6, [pw_2000]\r\n\r\n    psubw      m4, m6\r\n    psubw      m2, m6\r\n\r\n    movu       [r2], m4\r\n    movu       [r2 + 16], m2\r\n\r\n    punpcklbw  m4, m3, m5\r\n    punpckhbw  m3, m5\r\n\r\n    pmaddubsw  m4, m1\r\n    pmaddubsw  m3, m1\r\n\r\n    movu       m5, [r0 + 2 * r1]\r\n\r\n    punpcklbw  m2, m7, m5\r\n    punpckhbw  m7, m5\r\n\r\n    pmaddubsw  m2, m0\r\n    pmaddubsw  m7, m0\r\n\r\n    paddw      m4, m2\r\n    paddw      m3, m7\r\n\r\n    psubw      m4, m6\r\n    psubw      m3, m6\r\n\r\n    movu       [r2 + r3], m4\r\n    movu       [r2 + r3 + 16], m3\r\n\r\n    lea        r2, [r2 + 2 * r3]\r\n\r\n    dec        r4d\r\n    jnz        .loop\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_V_PS_W16 16,  4\r\n    FILTER_V_PS_W16 16,  8\r\n    FILTER_V_PS_W16 16, 12\r\n    FILTER_V_PS_W16 16, 16\r\n    FILTER_V_PS_W16 16, 32\r\n\r\n    FILTER_V_PS_W16 16, 24\r\n    FILTER_V_PS_W16 16, 64\r\n\r\n;--------------------------------------------------------------------------------------------------------------\r\n;void interp_4tap_vert_ps_24x32(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)\r\n;--------------------------------------------------------------------------------------------------------------\r\n%macro FILTER_V4_PS_W24 2\r\nINIT_XMM sse4\r\ncglobal interp_4tap_vert_ps_24x%2, 4, 6, 8\r\n\r\n    mov        r4d, r4m\r\n    sub        r0, r1\r\n    add        r3d, r3d\r\n\r\n%ifdef PIC\r\n    lea        r5, [tab_ChromaCoeff]\r\n    movd       m0, [r5 + r4 * 4]\r\n%else\r\n    movd       m0, [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    pshufb     m1, m0, [tab_Vm]\r\n    pshufb     m0, [tab_Vm + 16]\r\n\r\n    mov        r4d, %2/2\r\n\r\n.loop:\r\n    movu       m2, [r0]\r\n    movu       m3, [r0 + r1]\r\n\r\n    punpcklbw  m4, m2, m3\r\n    punpckhbw  m2, m3\r\n\r\n    pmaddubsw  m4, m1\r\n    pmaddubsw  m2, m1\r\n\r\n    lea        r5, [r0 + 2 * r1]\r\n\r\n    movu       m5, [r5]\r\n    movu       m7, [r5 + r1]\r\n\r\n    punpcklbw  m6, m5, m7\r\n    pmaddubsw  m6, m0\r\n    paddw      m4, m6\r\n\r\n    punpckhbw  m6, m5, m7\r\n    pmaddubsw  m6, m0\r\n    paddw      m2, m6\r\n\r\n    mova       m6, [pw_2000]\r\n\r\n    psubw      m4, m6\r\n    psubw      m2, m6\r\n\r\n    movu       [r2], m4\r\n    movu       [r2 + 16], m2\r\n\r\n    punpcklbw  m4, m3, m5\r\n    punpckhbw  m3, m5\r\n\r\n    pmaddubsw  m4, m1\r\n    pmaddubsw  m3, m1\r\n\r\n    movu       m2, [r5 + 2 * r1]\r\n\r\n    punpcklbw  m5, m7, m2\r\n    punpckhbw  m7, m2\r\n\r\n    pmaddubsw  m5, m0\r\n    pmaddubsw  m7, m0\r\n\r\n    paddw      m4, m5\r\n    paddw      m3, m7\r\n\r\n    psubw      m4, m6\r\n    psubw      m3, m6\r\n\r\n    movu       [r2 + r3], m4\r\n    movu       [r2 + r3 + 16], m3\r\n\r\n    movq       m2, [r0 + 16]\r\n    movq       m3, [r0 + r1 + 16]\r\n    movq       m4, [r5 + 16]\r\n    movq       m5, [r5 + r1 + 16]\r\n\r\n    punpcklbw  m2, m3\r\n    punpcklbw  m7, m4, m5\r\n\r\n    pmaddubsw  m2, m1\r\n    pmaddubsw  m7, m0\r\n\r\n    paddw      m2, m7\r\n    psubw      m2, m6\r\n\r\n    movu       [r2 + 32], m2\r\n\r\n    movq       m2, [r5 + 2 * r1 + 16]\r\n\r\n    punpcklbw  m3, m4\r\n    punpcklbw  m5, m2\r\n\r\n    pmaddubsw  m3, m1\r\n    pmaddubsw  m5, m0\r\n\r\n    paddw      m3, m5\r\n    psubw      m3,  m6\r\n\r\n    movu       [r2 + r3 + 32], m3\r\n\r\n    mov        r0, r5\r\n    lea        r2, [r2 + 2 * r3]\r\n\r\n    dec        r4d\r\n    jnz        .loop\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_V4_PS_W24 24, 32\r\n\r\n    FILTER_V4_PS_W24 24, 64\r\n\r\n;---------------------------------------------------------------------------------------------------------------\r\n; void interp_4tap_vert_ps_32x%2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)\r\n;---------------------------------------------------------------------------------------------------------------\r\n%macro FILTER_V_PS_W32 2\r\nINIT_XMM sse4\r\ncglobal interp_4tap_vert_ps_%1x%2, 4, 6, 8\r\n\r\n    mov        r4d, r4m\r\n    sub        r0, r1\r\n    add        r3d, r3d\r\n\r\n%ifdef PIC\r\n    lea        r5, [tab_ChromaCoeff]\r\n    movd       m0, [r5 + r4 * 4]\r\n%else\r\n    movd       m0, [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    pshufb     m1, m0, [tab_Vm]\r\n    pshufb     m0, [tab_Vm + 16]\r\n\r\n    mova       m7, [pw_2000]\r\n\r\n    mov        r4d, %2\r\n\r\n.loop:\r\n    movu       m2, [r0]\r\n    movu       m3, [r0 + r1]\r\n\r\n    punpcklbw  m4, m2, m3\r\n    punpckhbw  m2, m3\r\n\r\n    pmaddubsw  m4, m1\r\n    pmaddubsw  m2, m1\r\n\r\n    lea        r5, [r0 + 2 * r1]\r\n    movu       m3, [r5]\r\n    movu       m5, [r5 + r1]\r\n\r\n    punpcklbw  m6, m3, m5\r\n    punpckhbw  m3, m5\r\n\r\n    pmaddubsw  m6, m0\r\n    pmaddubsw  m3, m0\r\n\r\n    paddw      m4, m6\r\n    paddw      m2, m3\r\n\r\n    psubw      m4, m7\r\n    psubw      m2, m7\r\n\r\n    movu       [r2], m4\r\n    movu       [r2 + 16], m2\r\n\r\n    movu       m2, [r0 + 16]\r\n    movu       m3, [r0 + r1 + 16]\r\n\r\n    punpcklbw  m4, m2, m3\r\n    punpckhbw  m2, m3\r\n\r\n    pmaddubsw  m4, m1\r\n    pmaddubsw  m2, m1\r\n\r\n    movu       m3, [r5 + 16]\r\n    movu       m5, [r5 + r1 + 16]\r\n\r\n    punpcklbw  m6, m3, m5\r\n    punpckhbw  m3, m5\r\n\r\n    pmaddubsw  m6, m0\r\n    pmaddubsw  m3, m0\r\n\r\n    paddw      m4, m6\r\n    paddw      m2, m3\r\n\r\n    psubw      m4, m7\r\n    psubw      m2, m7\r\n\r\n    movu       [r2 + 32], m4\r\n    movu       [r2 + 48], m2\r\n\r\n    lea        r0, [r0 + r1]\r\n    lea        r2, [r2 + r3]\r\n\r\n    dec        r4d\r\n    jnz        .loop\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_V_PS_W32 32,  8\r\n    FILTER_V_PS_W32 32, 16\r\n    FILTER_V_PS_W32 32, 24\r\n    FILTER_V_PS_W32 32, 32\r\n\r\n    FILTER_V_PS_W32 32, 48\r\n    FILTER_V_PS_W32 32, 64\r\n\r\n;-----------------------------------------------------------------------------\r\n; void interp_4tap_vert_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-----------------------------------------------------------------------------\r\n%macro FILTER_V4_W8_H8_H16_H32 2\r\nINIT_XMM sse4\r\ncglobal interp_4tap_vert_pp_%1x%2, 4, 6, 8\r\n\r\n    mov         r4d,       r4m\r\n    sub         r0,        r1\r\n\r\n%ifdef PIC\r\n    lea         r5,        [tab_ChromaCoeff]\r\n    movd        m5,        [r5 + r4 * 4]\r\n%else\r\n    movd        m5,        [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    pshufb      m6,        m5,       [tab_Vm]\r\n    pshufb      m5,        [tab_Vm + 16]\r\n    mova        m4,        [pw_512]\r\n    lea         r5,        [r1 * 3]\r\n\r\n    mov         r4d,       %2\r\n\r\n.loop:\r\n    movq        m0,        [r0]\r\n    movq        m1,        [r0 + r1]\r\n    movq        m2,        [r0 + 2 * r1]\r\n    movq        m3,        [r0 + r5]\r\n\r\n    punpcklbw   m0,        m1\r\n    punpcklbw   m1,        m2\r\n    punpcklbw   m2,        m3\r\n\r\n    pmaddubsw   m0,        m6\r\n    pmaddubsw   m7,        m2, m5\r\n\r\n    paddw       m0,        m7\r\n\r\n    pmulhrsw    m0,        m4\r\n    packuswb    m0,        m0\r\n    movh        [r2],      m0\r\n\r\n    lea         r0,        [r0 + 4 * r1]\r\n    movq        m0,        [r0]\r\n\r\n    punpcklbw   m3,        m0\r\n\r\n    pmaddubsw   m1,        m6\r\n    pmaddubsw   m7,        m3, m5\r\n\r\n    paddw       m1,        m7\r\n\r\n    pmulhrsw    m1,        m4\r\n    packuswb    m1,        m1\r\n    movh        [r2 + r3], m1\r\n\r\n    movq        m1,        [r0 + r1]\r\n\r\n    punpcklbw   m0,        m1\r\n\r\n    pmaddubsw   m2,        m6\r\n    pmaddubsw   m0,        m5\r\n\r\n    paddw       m2,        m0\r\n\r\n    pmulhrsw    m2,        m4\r\n\r\n    movq        m7,        [r0 + 2 * r1]\r\n    punpcklbw   m1,        m7\r\n\r\n    pmaddubsw   m3,        m6\r\n    pmaddubsw   m1,        m5\r\n\r\n    paddw       m3,        m1\r\n\r\n    pmulhrsw    m3,        m4\r\n    packuswb    m2,        m3\r\n\r\n    lea         r2,        [r2 + 2 * r3]\r\n    movh        [r2],      m2\r\n    movhps      [r2 + r3], m2\r\n\r\n    lea         r2,        [r2 + 2 * r3]\r\n\r\n    sub         r4,         4\r\n    jnz        .loop\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_V4_W8_H8_H16_H32 8,  8\r\n    FILTER_V4_W8_H8_H16_H32 8, 16\r\n    FILTER_V4_W8_H8_H16_H32 8, 32\r\n\r\n    FILTER_V4_W8_H8_H16_H32 8, 12\r\n    FILTER_V4_W8_H8_H16_H32 8, 64\r\n\r\n%macro PROCESS_CHROMA_AVX2_W8_8R 0\r\n    movq            xm1, [r0]                       ; m1 = row 0\r\n    movq            xm2, [r0 + r1]                  ; m2 = row 1\r\n    punpcklbw       xm1, xm2                        ; m1 = [17 07 16 06 15 05 14 04 13 03 12 02 11 01 10 00]\r\n    movq            xm3, [r0 + r1 * 2]              ; m3 = row 2\r\n    punpcklbw       xm2, xm3                        ; m2 = [27 17 26 16 25 15 24 14 23 13 22 12 21 11 20 10]\r\n    vinserti128     m5, m1, xm2, 1                  ; m5 = [27 17 26 16 25 15 24 14 23 13 22 12 21 11 20 10] - [17 07 16 06 15 05 14 04 13 03 12 02 11 01 10 00]\r\n    pmaddubsw       m5, [r5]\r\n    movq            xm4, [r0 + r4]                  ; m4 = row 3\r\n    punpcklbw       xm3, xm4                        ; m3 = [37 27 36 26 35 25 34 24 33 23 32 22 31 21 30 20]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movq            xm1, [r0]                       ; m1 = row 4\r\n    punpcklbw       xm4, xm1                        ; m4 = [47 37 46 36 45 35 44 34 43 33 42 32 41 31 40 30]\r\n    vinserti128     m2, m3, xm4, 1                  ; m2 = [47 37 46 36 45 35 44 34 43 33 42 32 41 31 40 30] - [37 27 36 26 35 25 34 24 33 23 32 22 31 21 30 20]\r\n    pmaddubsw       m0, m2, [r5 + 1 * mmsize]\r\n    paddw           m5, m0\r\n    pmaddubsw       m2, [r5]\r\n    movq            xm3, [r0 + r1]                  ; m3 = row 5\r\n    punpcklbw       xm1, xm3                        ; m1 = [57 47 56 46 55 45 54 44 53 43 52 42 51 41 50 40]\r\n    movq            xm4, [r0 + r1 * 2]              ; m4 = row 6\r\n    punpcklbw       xm3, xm4                        ; m3 = [67 57 66 56 65 55 64 54 63 53 62 52 61 51 60 50]\r\n    vinserti128     m1, m1, xm3, 1                  ; m1 = [67 57 66 56 65 55 64 54 63 53 62 52 61 51 60 50] - [57 47 56 46 55 45 54 44 53 43 52 42 51 41 50 40]\r\n    pmaddubsw       m0, m1, [r5 + 1 * mmsize]\r\n    paddw           m2, m0\r\n    pmaddubsw       m1, [r5]\r\n    movq            xm3, [r0 + r4]                  ; m3 = row 7\r\n    punpcklbw       xm4, xm3                        ; m4 = [77 67 76 66 75 65 74 64 73 63 72 62 71 61 70 60]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movq            xm0, [r0]                       ; m0 = row 8\r\n    punpcklbw       xm3, xm0                        ; m3 = [87 77 86 76 85 75 84 74 83 73 82 72 81 71 80 70]\r\n    vinserti128     m4, m4, xm3, 1                  ; m4 = [87 77 86 76 85 75 84 74 83 73 82 72 81 71 80 70] - [77 67 76 66 75 65 74 64 73 63 72 62 71 61 70 60]\r\n    pmaddubsw       m3, m4, [r5 + 1 * mmsize]\r\n    paddw           m1, m3\r\n    pmaddubsw       m4, [r5]\r\n    movq            xm3, [r0 + r1]                  ; m3 = row 9\r\n    punpcklbw       xm0, xm3                        ; m0 = [97 87 96 86 95 85 94 84 93 83 92 82 91 81 90 80]\r\n    movq            xm6, [r0 + r1 * 2]              ; m6 = row 10\r\n    punpcklbw       xm3, xm6                        ; m3 = [A7 97 A6 96 A5 95 A4 94 A3 93 A2 92 A1 91 A0 90]\r\n    vinserti128     m0, m0, xm3, 1                  ; m0 = [A7 97 A6 96 A5 95 A4 94 A3 93 A2 92 A1 91 A0 90] - [97 87 96 86 95 85 94 84 93 83 92 82 91 81 90 80]\r\n    pmaddubsw       m0, [r5 + 1 * mmsize]\r\n    paddw           m4, m0\r\n%endmacro\r\n\r\n%macro FILTER_VER_CHROMA_AVX2_8x8 1\r\nINIT_YMM avx2\r\ncglobal interp_4tap_vert_%1_8x8, 4, 6, 7\r\n    mov             r4d, r4m\r\n    shl             r4d, 6\r\n\r\n%ifdef PIC\r\n    lea             r5, [tab_ChromaCoeffVer_32]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [tab_ChromaCoeffVer_32 + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r1\r\n    PROCESS_CHROMA_AVX2_W8_8R\r\n%ifidn %1,pp\r\n    lea             r4, [r3 * 3]\r\n    mova            m3, [pw_512]\r\n    pmulhrsw        m5, m3                          ; m5 = word: row 0, row 1\r\n    pmulhrsw        m2, m3                          ; m2 = word: row 2, row 3\r\n    pmulhrsw        m1, m3                          ; m1 = word: row 4, row 5\r\n    pmulhrsw        m4, m3                          ; m4 = word: row 6, row 7\r\n    packuswb        m5, m2\r\n    packuswb        m1, m4\r\n    vextracti128    xm2, m5, 1\r\n    vextracti128    xm4, m1, 1\r\n    movq            [r2], xm5\r\n    movq            [r2 + r3], xm2\r\n    movhps          [r2 + r3 * 2], xm5\r\n    movhps          [r2 + r4], xm2\r\n    lea             r2, [r2 + r3 * 4]\r\n    movq            [r2], xm1\r\n    movq            [r2 + r3], xm4\r\n    movhps          [r2 + r3 * 2], xm1\r\n    movhps          [r2 + r4], xm4\r\n%else\r\n    add             r3d, r3d\r\n    vbroadcasti128  m3, [pw_2000]\r\n    lea             r4, [r3 * 3]\r\n    psubw           m5, m3                          ; m5 = word: row 0, row 1\r\n    psubw           m2, m3                          ; m2 = word: row 2, row 3\r\n    psubw           m1, m3                          ; m1 = word: row 4, row 5\r\n    psubw           m4, m3                          ; m4 = word: row 6, row 7\r\n    vextracti128    xm6, m5, 1\r\n    vextracti128    xm3, m2, 1\r\n    vextracti128    xm0, m1, 1\r\n    movu            [r2], xm5\r\n    movu            [r2 + r3], xm6\r\n    movu            [r2 + r3 * 2], xm2\r\n    movu            [r2 + r4], xm3\r\n    lea             r2, [r2 + r3 * 4]\r\n    movu            [r2], xm1\r\n    movu            [r2 + r3], xm0\r\n    movu            [r2 + r3 * 2], xm4\r\n    vextracti128    xm4, m4, 1\r\n    movu            [r2 + r4], xm4\r\n%endif\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_AVX2_8x8 pp\r\n    FILTER_VER_CHROMA_AVX2_8x8 ps\r\n\r\n%macro FILTER_VER_CHROMA_AVX2_8x6 1\r\nINIT_YMM avx2\r\ncglobal interp_4tap_vert_%1_8x6, 4, 6, 6\r\n    mov             r4d, r4m\r\n    shl             r4d, 6\r\n\r\n%ifdef PIC\r\n    lea             r5, [tab_ChromaCoeffVer_32]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [tab_ChromaCoeffVer_32 + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r1\r\n\r\n    movq            xm1, [r0]                       ; m1 = row 0\r\n    movq            xm2, [r0 + r1]                  ; m2 = row 1\r\n    punpcklbw       xm1, xm2                        ; m1 = [17 07 16 06 15 05 14 04 13 03 12 02 11 01 10 00]\r\n    movq            xm3, [r0 + r1 * 2]              ; m3 = row 2\r\n    punpcklbw       xm2, xm3                        ; m2 = [27 17 26 16 25 15 24 14 23 13 22 12 21 11 20 10]\r\n    vinserti128     m5, m1, xm2, 1                  ; m5 = [27 17 26 16 25 15 24 14 23 13 22 12 21 11 20 10] - [17 07 16 06 15 05 14 04 13 03 12 02 11 01 10 00]\r\n    pmaddubsw       m5, [r5]\r\n    movq            xm4, [r0 + r4]                  ; m4 = row 3\r\n    punpcklbw       xm3, xm4                        ; m3 = [37 27 36 26 35 25 34 24 33 23 32 22 31 21 30 20]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movq            xm1, [r0]                       ; m1 = row 4\r\n    punpcklbw       xm4, xm1                        ; m4 = [47 37 46 36 45 35 44 34 43 33 42 32 41 31 40 30]\r\n    vinserti128     m2, m3, xm4, 1                  ; m2 = [47 37 46 36 45 35 44 34 43 33 42 32 41 31 40 30] - [37 27 36 26 35 25 34 24 33 23 32 22 31 21 30 20]\r\n    pmaddubsw       m0, m2, [r5 + 1 * mmsize]\r\n    paddw           m5, m0\r\n    pmaddubsw       m2, [r5]\r\n    movq            xm3, [r0 + r1]                  ; m3 = row 5\r\n    punpcklbw       xm1, xm3                        ; m1 = [57 47 56 46 55 45 54 44 53 43 52 42 51 41 50 40]\r\n    movq            xm4, [r0 + r1 * 2]              ; m4 = row 6\r\n    punpcklbw       xm3, xm4                        ; m3 = [67 57 66 56 65 55 64 54 63 53 62 52 61 51 60 50]\r\n    vinserti128     m1, m1, xm3, 1                  ; m1 = [67 57 66 56 65 55 64 54 63 53 62 52 61 51 60 50] - [57 47 56 46 55 45 54 44 53 43 52 42 51 41 50 40]\r\n    pmaddubsw       m0, m1, [r5 + 1 * mmsize]\r\n    paddw           m2, m0\r\n    pmaddubsw       m1, [r5]\r\n    movq            xm3, [r0 + r4]                  ; m3 = row 7\r\n    punpcklbw       xm4, xm3                        ; m4 = [77 67 76 66 75 65 74 64 73 63 72 62 71 61 70 60]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movq            xm0, [r0]                       ; m0 = row 8\r\n    punpcklbw       xm3, xm0                        ; m3 = [87 77 86 76 85 75 84 74 83 73 82 72 81 71 80 70]\r\n    vinserti128     m4, m4, xm3, 1                  ; m4 = [87 77 86 76 85 75 84 74 83 73 82 72 81 71 80 70] - [77 67 76 66 75 65 74 64 73 63 72 62 71 61 70 60]\r\n    pmaddubsw       m4, [r5 + 1 * mmsize]\r\n    paddw           m1, m4\r\n%ifidn %1,pp\r\n    lea             r4, [r3 * 3]\r\n    mova            m3, [pw_512]\r\n    pmulhrsw        m5, m3                          ; m5 = word: row 0, row 1\r\n    pmulhrsw        m2, m3                          ; m2 = word: row 2, row 3\r\n    pmulhrsw        m1, m3                          ; m1 = word: row 4, row 5\r\n    packuswb        m5, m2\r\n    packuswb        m1, m1\r\n    vextracti128    xm2, m5, 1\r\n    vextracti128    xm4, m1, 1\r\n    movq            [r2], xm5\r\n    movq            [r2 + r3], xm2\r\n    movhps          [r2 + r3 * 2], xm5\r\n    movhps          [r2 + r4], xm2\r\n    lea             r2, [r2 + r3 * 4]\r\n    movq            [r2], xm1\r\n    movq            [r2 + r3], xm4\r\n%else\r\n    add             r3d, r3d\r\n    mova            m3, [pw_2000]\r\n    lea             r4, [r3 * 3]\r\n    psubw           m5, m3                          ; m5 = word: row 0, row 1\r\n    psubw           m2, m3                          ; m2 = word: row 2, row 3\r\n    psubw           m1, m3                          ; m1 = word: row 4, row 5\r\n    vextracti128    xm4, m5, 1\r\n    vextracti128    xm3, m2, 1\r\n    vextracti128    xm0, m1, 1\r\n    movu            [r2], xm5\r\n    movu            [r2 + r3], xm4\r\n    movu            [r2 + r3 * 2], xm2\r\n    movu            [r2 + r4], xm3\r\n    lea             r2, [r2 + r3 * 4]\r\n    movu            [r2], xm1\r\n    movu            [r2 + r3], xm0\r\n%endif\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_AVX2_8x6 pp\r\n    FILTER_VER_CHROMA_AVX2_8x6 ps\r\n\r\n%macro PROCESS_CHROMA_AVX2_W8_16R 1\r\n    movq            xm1, [r0]                       ; m1 = row 0\r\n    movq            xm2, [r0 + r1]                  ; m2 = row 1\r\n    punpcklbw       xm1, xm2\r\n    movq            xm3, [r0 + r1 * 2]              ; m3 = row 2\r\n    punpcklbw       xm2, xm3\r\n    vinserti128     m5, m1, xm2, 1\r\n    pmaddubsw       m5, [r5]\r\n    movq            xm4, [r0 + r4]                  ; m4 = row 3\r\n    punpcklbw       xm3, xm4\r\n    lea             r0, [r0 + r1 * 4]\r\n    movq            xm1, [r0]                       ; m1 = row 4\r\n    punpcklbw       xm4, xm1\r\n    vinserti128     m2, m3, xm4, 1\r\n    pmaddubsw       m0, m2, [r5 + 1 * mmsize]\r\n    paddw           m5, m0\r\n    pmaddubsw       m2, [r5]\r\n    movq            xm3, [r0 + r1]                  ; m3 = row 5\r\n    punpcklbw       xm1, xm3\r\n    movq            xm4, [r0 + r1 * 2]              ; m4 = row 6\r\n    punpcklbw       xm3, xm4\r\n    vinserti128     m1, m1, xm3, 1\r\n    pmaddubsw       m0, m1, [r5 + 1 * mmsize]\r\n    paddw           m2, m0\r\n    pmaddubsw       m1, [r5]\r\n    movq            xm3, [r0 + r4]                  ; m3 = row 7\r\n    punpcklbw       xm4, xm3\r\n    lea             r0, [r0 + r1 * 4]\r\n    movq            xm0, [r0]                       ; m0 = row 8\r\n    punpcklbw       xm3, xm0\r\n    vinserti128     m4, m4, xm3, 1\r\n    pmaddubsw       m3, m4, [r5 + 1 * mmsize]\r\n    paddw           m1, m3\r\n    pmaddubsw       m4, [r5]\r\n    movq            xm3, [r0 + r1]                  ; m3 = row 9\r\n    punpcklbw       xm0, xm3\r\n    movq            xm6, [r0 + r1 * 2]              ; m6 = row 10\r\n    punpcklbw       xm3, xm6\r\n    vinserti128     m0, m0, xm3, 1\r\n    pmaddubsw       m3, m0, [r5 + 1 * mmsize]\r\n    paddw           m4, m3\r\n    pmaddubsw       m0, [r5]\r\n%ifidn %1,pp\r\n    pmulhrsw        m5, m7                          ; m5 = word: row 0, row 1\r\n    pmulhrsw        m2, m7                          ; m2 = word: row 2, row 3\r\n    pmulhrsw        m1, m7                          ; m1 = word: row 4, row 5\r\n    pmulhrsw        m4, m7                          ; m4 = word: row 6, row 7\r\n    packuswb        m5, m2\r\n    packuswb        m1, m4\r\n    vextracti128    xm2, m5, 1\r\n    vextracti128    xm4, m1, 1\r\n    movq            [r2], xm5\r\n    movq            [r2 + r3], xm2\r\n    movhps          [r2 + r3 * 2], xm5\r\n    movhps          [r2 + r6], xm2\r\n    lea             r2, [r2 + r3 * 4]\r\n    movq            [r2], xm1\r\n    movq            [r2 + r3], xm4\r\n    movhps          [r2 + r3 * 2], xm1\r\n    movhps          [r2 + r6], xm4\r\n%else\r\n    psubw           m5, m7                          ; m5 = word: row 0, row 1\r\n    psubw           m2, m7                          ; m2 = word: row 2, row 3\r\n    psubw           m1, m7                          ; m1 = word: row 4, row 5\r\n    psubw           m4, m7                          ; m4 = word: row 6, row 7\r\n    vextracti128    xm3, m5, 1\r\n    movu            [r2], xm5\r\n    movu            [r2 + r3], xm3\r\n    vextracti128    xm3, m2, 1\r\n    movu            [r2 + r3 * 2], xm2\r\n    movu            [r2 + r6], xm3\r\n    lea             r2, [r2 + r3 * 4]\r\n    vextracti128    xm5, m1, 1\r\n    vextracti128    xm3, m4, 1\r\n    movu            [r2], xm1\r\n    movu            [r2 + r3], xm5\r\n    movu            [r2 + r3 * 2], xm4\r\n    movu            [r2 + r6], xm3\r\n%endif\r\n    movq            xm3, [r0 + r4]                  ; m3 = row 11\r\n    punpcklbw       xm6, xm3\r\n    lea             r0, [r0 + r1 * 4]\r\n    movq            xm5, [r0]                       ; m5 = row 12\r\n    punpcklbw       xm3, xm5\r\n    vinserti128     m6, m6, xm3, 1\r\n    pmaddubsw       m3, m6, [r5 + 1 * mmsize]\r\n    paddw           m0, m3\r\n    pmaddubsw       m6, [r5]\r\n    movq            xm3, [r0 + r1]                  ; m3 = row 13\r\n    punpcklbw       xm5, xm3\r\n    movq            xm2, [r0 + r1 * 2]              ; m2 = row 14\r\n    punpcklbw       xm3, xm2\r\n    vinserti128     m5, m5, xm3, 1\r\n    pmaddubsw       m3, m5, [r5 + 1 * mmsize]\r\n    paddw           m6, m3\r\n    pmaddubsw       m5, [r5]\r\n    movq            xm3, [r0 + r4]                  ; m3 = row 15\r\n    punpcklbw       xm2, xm3\r\n    lea             r0, [r0 + r1 * 4]\r\n    movq            xm1, [r0]                       ; m1 = row 16\r\n    punpcklbw       xm3, xm1\r\n    vinserti128     m2, m2, xm3, 1\r\n    pmaddubsw       m3, m2, [r5 + 1 * mmsize]\r\n    paddw           m5, m3\r\n    pmaddubsw       m2, [r5]\r\n    movq            xm3, [r0 + r1]                  ; m3 = row 17\r\n    punpcklbw       xm1, xm3\r\n    movq            xm4, [r0 + r1 * 2]              ; m4 = row 18\r\n    punpcklbw       xm3, xm4\r\n    vinserti128     m1, m1, xm3, 1\r\n    pmaddubsw       m1, [r5 + 1 * mmsize]\r\n    paddw           m2, m1\r\n    lea             r2, [r2 + r3 * 4]\r\n%ifidn %1,pp\r\n    pmulhrsw        m0, m7                          ; m0 = word: row 8, row 9\r\n    pmulhrsw        m6, m7                          ; m6 = word: row 10, row 11\r\n    pmulhrsw        m5, m7                          ; m5 = word: row 12, row 13\r\n    pmulhrsw        m2, m7                          ; m2 = word: row 14, row 15\r\n    packuswb        m0, m6\r\n    packuswb        m5, m2\r\n    vextracti128    xm6, m0, 1\r\n    vextracti128    xm2, m5, 1\r\n    movq            [r2], xm0\r\n    movq            [r2 + r3], xm6\r\n    movhps          [r2 + r3 * 2], xm0\r\n    movhps          [r2 + r6], xm6\r\n    lea             r2, [r2 + r3 * 4]\r\n    movq            [r2], xm5\r\n    movq            [r2 + r3], xm2\r\n    movhps          [r2 + r3 * 2], xm5\r\n    movhps          [r2 + r6], xm2\r\n%else\r\n    psubw           m0, m7                          ; m0 = word: row 8, row 9\r\n    psubw           m6, m7                          ; m6 = word: row 10, row 11\r\n    psubw           m5, m7                          ; m5 = word: row 12, row 13\r\n    psubw           m2, m7                          ; m2 = word: row 14, row 15\r\n    vextracti128    xm1, m0, 1\r\n    vextracti128    xm3, m6, 1\r\n    movu            [r2], xm0\r\n    movu            [r2 + r3], xm1\r\n    movu            [r2 + r3 * 2], xm6\r\n    movu            [r2 + r6], xm3\r\n    lea             r2, [r2 + r3 * 4]\r\n    vextracti128    xm1, m5, 1\r\n    vextracti128    xm3, m2, 1\r\n    movu            [r2], xm5\r\n    movu            [r2 + r3], xm1\r\n    movu            [r2 + r3 * 2], xm2\r\n    movu            [r2 + r6], xm3\r\n%endif\r\n%endmacro\r\n\r\n%macro FILTER_VER_CHROMA_AVX2_8x16 1\r\nINIT_YMM avx2\r\ncglobal interp_4tap_vert_%1_8x16, 4, 7, 8\r\n    mov             r4d, r4m\r\n    shl             r4d, 6\r\n\r\n%ifdef PIC\r\n    lea             r5, [tab_ChromaCoeffVer_32]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [tab_ChromaCoeffVer_32 + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r1\r\n%ifidn %1,pp\r\n    mova            m7, [pw_512]\r\n%else\r\n    add             r3d, r3d\r\n    mova            m7, [pw_2000]\r\n%endif\r\n    lea             r6, [r3 * 3]\r\n    PROCESS_CHROMA_AVX2_W8_16R %1\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_AVX2_8x16 pp\r\n    FILTER_VER_CHROMA_AVX2_8x16 ps\r\n\r\n%macro FILTER_VER_CHROMA_AVX2_8x12 1\r\nINIT_YMM avx2\r\ncglobal interp_4tap_vert_%1_8x12, 4, 7, 8\r\n    mov             r4d, r4m\r\n    shl             r4d, 6\r\n\r\n%ifdef PIC\r\n    lea             r5, [tab_ChromaCoeffVer_32]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [tab_ChromaCoeffVer_32 + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r1\r\n%ifidn %1, pp\r\n    mova            m7, [pw_512]\r\n%else\r\n    add             r3d, r3d\r\n    mova            m7, [pw_2000]\r\n%endif\r\n    lea             r6, [r3 * 3]\r\n    movq            xm1, [r0]                       ; m1 = row 0\r\n    movq            xm2, [r0 + r1]                  ; m2 = row 1\r\n    punpcklbw       xm1, xm2\r\n    movq            xm3, [r0 + r1 * 2]              ; m3 = row 2\r\n    punpcklbw       xm2, xm3\r\n    vinserti128     m5, m1, xm2, 1\r\n    pmaddubsw       m5, [r5]\r\n    movq            xm4, [r0 + r4]                  ; m4 = row 3\r\n    punpcklbw       xm3, xm4\r\n    lea             r0, [r0 + r1 * 4]\r\n    movq            xm1, [r0]                       ; m1 = row 4\r\n    punpcklbw       xm4, xm1\r\n    vinserti128     m2, m3, xm4, 1\r\n    pmaddubsw       m0, m2, [r5 + 1 * mmsize]\r\n    paddw           m5, m0\r\n    pmaddubsw       m2, [r5]\r\n    movq            xm3, [r0 + r1]                  ; m3 = row 5\r\n    punpcklbw       xm1, xm3\r\n    movq            xm4, [r0 + r1 * 2]              ; m4 = row 6\r\n    punpcklbw       xm3, xm4\r\n    vinserti128     m1, m1, xm3, 1\r\n    pmaddubsw       m0, m1, [r5 + 1 * mmsize]\r\n    paddw           m2, m0\r\n    pmaddubsw       m1, [r5]\r\n    movq            xm3, [r0 + r4]                  ; m3 = row 7\r\n    punpcklbw       xm4, xm3\r\n    lea             r0, [r0 + r1 * 4]\r\n    movq            xm0, [r0]                       ; m0 = row 8\r\n    punpcklbw       xm3, xm0\r\n    vinserti128     m4, m4, xm3, 1\r\n    pmaddubsw       m3, m4, [r5 + 1 * mmsize]\r\n    paddw           m1, m3\r\n    pmaddubsw       m4, [r5]\r\n    movq            xm3, [r0 + r1]                  ; m3 = row 9\r\n    punpcklbw       xm0, xm3\r\n    movq            xm6, [r0 + r1 * 2]              ; m6 = row 10\r\n    punpcklbw       xm3, xm6\r\n    vinserti128     m0, m0, xm3, 1\r\n    pmaddubsw       m3, m0, [r5 + 1 * mmsize]\r\n    paddw           m4, m3\r\n    pmaddubsw       m0, [r5]\r\n%ifidn %1, pp\r\n    pmulhrsw        m5, m7                          ; m5 = word: row 0, row 1\r\n    pmulhrsw        m2, m7                          ; m2 = word: row 2, row 3\r\n    pmulhrsw        m1, m7                          ; m1 = word: row 4, row 5\r\n    pmulhrsw        m4, m7                          ; m4 = word: row 6, row 7\r\n    packuswb        m5, m2\r\n    packuswb        m1, m4\r\n    vextracti128    xm2, m5, 1\r\n    vextracti128    xm4, m1, 1\r\n    movq            [r2], xm5\r\n    movq            [r2 + r3], xm2\r\n    movhps          [r2 + r3 * 2], xm5\r\n    movhps          [r2 + r6], xm2\r\n    lea             r2, [r2 + r3 * 4]\r\n    movq            [r2], xm1\r\n    movq            [r2 + r3], xm4\r\n    movhps          [r2 + r3 * 2], xm1\r\n    movhps          [r2 + r6], xm4\r\n%else\r\n    psubw           m5, m7                          ; m5 = word: row 0, row 1\r\n    psubw           m2, m7                          ; m2 = word: row 2, row 3\r\n    psubw           m1, m7                          ; m1 = word: row 4, row 5\r\n    psubw           m4, m7                          ; m4 = word: row 6, row 7\r\n    vextracti128    xm3, m5, 1\r\n    movu            [r2], xm5\r\n    movu            [r2 + r3], xm3\r\n    vextracti128    xm3, m2, 1\r\n    movu            [r2 + r3 * 2], xm2\r\n    movu            [r2 + r6], xm3\r\n    lea             r2, [r2 + r3 * 4]\r\n    vextracti128    xm5, m1, 1\r\n    vextracti128    xm3, m4, 1\r\n    movu            [r2], xm1\r\n    movu            [r2 + r3], xm5\r\n    movu            [r2 + r3 * 2], xm4\r\n    movu            [r2 + r6], xm3\r\n%endif\r\n    movq            xm3, [r0 + r4]                  ; m3 = row 11\r\n    punpcklbw       xm6, xm3\r\n    lea             r0, [r0 + r1 * 4]\r\n    movq            xm5, [r0]                       ; m5 = row 12\r\n    punpcklbw       xm3, xm5\r\n    vinserti128     m6, m6, xm3, 1\r\n    pmaddubsw       m3, m6, [r5 + 1 * mmsize]\r\n    paddw           m0, m3\r\n    pmaddubsw       m6, [r5]\r\n    movq            xm3, [r0 + r1]                  ; m3 = row 13\r\n    punpcklbw       xm5, xm3\r\n    movq            xm2, [r0 + r1 * 2]              ; m2 = row 14\r\n    punpcklbw       xm3, xm2\r\n    vinserti128     m5, m5, xm3, 1\r\n    pmaddubsw       m3, m5, [r5 + 1 * mmsize]\r\n    paddw           m6, m3\r\n    lea             r2, [r2 + r3 * 4]\r\n%ifidn %1, pp\r\n    pmulhrsw        m0, m7                          ; m0 = word: row 8, row 9\r\n    pmulhrsw        m6, m7                          ; m6 = word: row 10, row 11\r\n    packuswb        m0, m6\r\n    vextracti128    xm6, m0, 1\r\n    movq            [r2], xm0\r\n    movq            [r2 + r3], xm6\r\n    movhps          [r2 + r3 * 2], xm0\r\n    movhps          [r2 + r6], xm6\r\n%else\r\n    psubw           m0, m7                          ; m0 = word: row 8, row 9\r\n    psubw           m6, m7                          ; m6 = word: row 10, row 11\r\n    vextracti128    xm1, m0, 1\r\n    vextracti128    xm3, m6, 1\r\n    movu            [r2], xm0\r\n    movu            [r2 + r3], xm1\r\n    movu            [r2 + r3 * 2], xm6\r\n    movu            [r2 + r6], xm3\r\n%endif\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_AVX2_8x12 pp\r\n    FILTER_VER_CHROMA_AVX2_8x12 ps\r\n\r\n%macro FILTER_VER_CHROMA_AVX2_8xN 2\r\nINIT_YMM avx2\r\ncglobal interp_4tap_vert_%1_8x%2, 4, 7, 8\r\n    mov             r4d, r4m\r\n    shl             r4d, 6\r\n\r\n%ifdef PIC\r\n    lea             r5, [tab_ChromaCoeffVer_32]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [tab_ChromaCoeffVer_32 + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r1\r\n%ifidn %1,pp\r\n    mova            m7, [pw_512]\r\n%else\r\n    add             r3d, r3d\r\n    mova            m7, [pw_2000]\r\n%endif\r\n    lea             r6, [r3 * 3]\r\n%rep %2 / 16\r\n    PROCESS_CHROMA_AVX2_W8_16R %1\r\n    lea             r2, [r2 + r3 * 4]\r\n%endrep\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_AVX2_8xN pp, 32\r\n    FILTER_VER_CHROMA_AVX2_8xN ps, 32\r\n    FILTER_VER_CHROMA_AVX2_8xN pp, 64\r\n    FILTER_VER_CHROMA_AVX2_8xN ps, 64\r\n\r\n%macro PROCESS_CHROMA_AVX2_W8_4R 0\r\n    movq            xm1, [r0]                       ; m1 = row 0\r\n    movq            xm2, [r0 + r1]                  ; m2 = row 1\r\n    punpcklbw       xm1, xm2                        ; m1 = [17 07 16 06 15 05 14 04 13 03 12 02 11 01 10 00]\r\n    movq            xm3, [r0 + r1 * 2]              ; m3 = row 2\r\n    punpcklbw       xm2, xm3                        ; m2 = [27 17 26 16 25 15 24 14 23 13 22 12 21 11 20 10]\r\n    vinserti128     m0, m1, xm2, 1                  ; m0 = [27 17 26 16 25 15 24 14 23 13 22 12 21 11 20 10] - [17 07 16 06 15 05 14 04 13 03 12 02 11 01 10 00]\r\n    pmaddubsw       m0, [r5]\r\n    movq            xm4, [r0 + r4]                  ; m4 = row 3\r\n    punpcklbw       xm3, xm4                        ; m3 = [37 27 36 26 35 25 34 24 33 23 32 22 31 21 30 20]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movq            xm1, [r0]                       ; m1 = row 4\r\n    punpcklbw       xm4, xm1                        ; m4 = [47 37 46 36 45 35 44 34 43 33 42 32 41 31 40 30]\r\n    vinserti128     m2, m3, xm4, 1                  ; m2 = [47 37 46 36 45 35 44 34 43 33 42 32 41 31 40 30] - [37 27 36 26 35 25 34 24 33 23 32 22 31 21 30 20]\r\n    pmaddubsw       m4, m2, [r5 + 1 * mmsize]\r\n    paddw           m0, m4\r\n    pmaddubsw       m2, [r5]\r\n    movq            xm3, [r0 + r1]                  ; m3 = row 5\r\n    punpcklbw       xm1, xm3                        ; m1 = [57 47 56 46 55 45 54 44 53 43 52 42 51 41 50 40]\r\n    movq            xm4, [r0 + r1 * 2]              ; m4 = row 6\r\n    punpcklbw       xm3, xm4                        ; m3 = [67 57 66 56 65 55 64 54 63 53 62 52 61 51 60 50]\r\n    vinserti128     m1, m1, xm3, 1                  ; m1 = [67 57 66 56 65 55 64 54 63 53 62 52 61 51 60 50] - [57 47 56 46 55 45 54 44 53 43 52 42 51 41 50 40]\r\n    pmaddubsw       m1, [r5 + 1 * mmsize]\r\n    paddw           m2, m1\r\n%endmacro\r\n\r\n%macro FILTER_VER_CHROMA_AVX2_8x4 1\r\nINIT_YMM avx2\r\ncglobal interp_4tap_vert_%1_8x4, 4, 6, 5\r\n    mov             r4d, r4m\r\n    shl             r4d, 6\r\n\r\n%ifdef PIC\r\n    lea             r5, [tab_ChromaCoeffVer_32]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [tab_ChromaCoeffVer_32 + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r1\r\n    PROCESS_CHROMA_AVX2_W8_4R\r\n%ifidn %1,pp\r\n    lea             r4, [r3 * 3]\r\n    mova            m3, [pw_512]\r\n    pmulhrsw        m0, m3                          ; m0 = word: row 0, row 1\r\n    pmulhrsw        m2, m3                          ; m2 = word: row 2, row 3\r\n    packuswb        m0, m2\r\n    vextracti128    xm2, m0, 1\r\n    movq            [r2], xm0\r\n    movq            [r2 + r3], xm2\r\n    movhps          [r2 + r3 * 2], xm0\r\n    movhps          [r2 + r4], xm2\r\n%else\r\n    add             r3d, r3d\r\n    vbroadcasti128  m3, [pw_2000]\r\n    lea             r4, [r3 * 3]\r\n    psubw           m0, m3                          ; m0 = word: row 0, row 1\r\n    psubw           m2, m3                          ; m2 = word: row 2, row 3\r\n    vextracti128    xm1, m0, 1\r\n    vextracti128    xm4, m2, 1\r\n    movu            [r2], xm0\r\n    movu            [r2 + r3], xm1\r\n    movu            [r2 + r3 * 2], xm2\r\n    movu            [r2 + r4], xm4\r\n%endif\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_AVX2_8x4 pp\r\n    FILTER_VER_CHROMA_AVX2_8x4 ps\r\n\r\n%macro FILTER_VER_CHROMA_AVX2_8x2 1\r\nINIT_YMM avx2\r\ncglobal interp_4tap_vert_%1_8x2, 4, 6, 4\r\n    mov             r4d, r4m\r\n    shl             r4d, 6\r\n\r\n%ifdef PIC\r\n    lea             r5, [tab_ChromaCoeffVer_32]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [tab_ChromaCoeffVer_32 + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r1\r\n\r\n    movq            xm1, [r0]                       ; m1 = row 0\r\n    movq            xm2, [r0 + r1]                  ; m2 = row 1\r\n    punpcklbw       xm1, xm2                        ; m1 = [17 07 16 06 15 05 14 04 13 03 12 02 11 01 10 00]\r\n    movq            xm3, [r0 + r1 * 2]              ; m3 = row 2\r\n    punpcklbw       xm2, xm3                        ; m2 = [27 17 26 16 25 15 24 14 23 13 22 12 21 11 20 10]\r\n    vinserti128     m1, m1, xm2, 1                  ; m1 = [27 17 26 16 25 15 24 14 23 13 22 12 21 11 20 10] - [17 07 16 06 15 05 14 04 13 03 12 02 11 01 10 00]\r\n    pmaddubsw       m1, [r5]\r\n    movq            xm2, [r0 + r4]                  ; m2 = row 3\r\n    punpcklbw       xm3, xm2                        ; m3 = [37 27 36 26 35 25 34 24 33 23 32 22 31 21 30 20]\r\n    movq            xm0, [r0 + r1 * 4]              ; m0 = row 4\r\n    punpcklbw       xm2, xm0                        ; m2 = [47 37 46 36 45 35 44 34 43 33 42 32 41 31 40 30]\r\n    vinserti128     m3, m3, xm2, 1                  ; m3 = [47 37 46 36 45 35 44 34 43 33 42 32 41 31 40 30] - [37 27 36 26 35 25 34 24 33 23 32 22 31 21 30 20]\r\n    pmaddubsw       m3, [r5 + 1 * mmsize]\r\n    paddw           m1, m3\r\n%ifidn %1,pp\r\n    pmulhrsw        m1, [pw_512]                    ; m1 = word: row 0, row 1\r\n    packuswb        m1, m1\r\n    vextracti128    xm0, m1, 1\r\n    movq            [r2], xm1\r\n    movq            [r2 + r3], xm0\r\n%else\r\n    add             r3d, r3d\r\n    psubw           m1, [pw_2000]                   ; m1 = word: row 0, row 1\r\n    vextracti128    xm0, m1, 1\r\n    movu            [r2], xm1\r\n    movu            [r2 + r3], xm0\r\n%endif\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_AVX2_8x2 pp\r\n    FILTER_VER_CHROMA_AVX2_8x2 ps\r\n\r\n%macro FILTER_VER_CHROMA_AVX2_6x8 1\r\nINIT_YMM avx2\r\ncglobal interp_4tap_vert_%1_6x8, 4, 6, 7\r\n    mov             r4d, r4m\r\n    shl             r4d, 6\r\n\r\n%ifdef PIC\r\n    lea             r5, [tab_ChromaCoeffVer_32]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [tab_ChromaCoeffVer_32 + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r1\r\n    PROCESS_CHROMA_AVX2_W8_8R\r\n%ifidn %1,pp\r\n    lea             r4, [r3 * 3]\r\n    mova            m3, [pw_512]\r\n    pmulhrsw        m5, m3                          ; m5 = word: row 0, row 1\r\n    pmulhrsw        m2, m3                          ; m2 = word: row 2, row 3\r\n    pmulhrsw        m1, m3                          ; m1 = word: row 4, row 5\r\n    pmulhrsw        m4, m3                          ; m4 = word: row 6, row 7\r\n    packuswb        m5, m2\r\n    packuswb        m1, m4\r\n    vextracti128    xm2, m5, 1\r\n    vextracti128    xm4, m1, 1\r\n    movd            [r2], xm5\r\n    pextrw          [r2 + 4], xm5, 2\r\n    movd            [r2 + r3], xm2\r\n    pextrw          [r2 + r3 + 4], xm2, 2\r\n    pextrd          [r2 + r3 * 2], xm5, 2\r\n    pextrw          [r2 + r3 * 2 + 4], xm5, 6\r\n    pextrd          [r2 + r4], xm2, 2\r\n    pextrw          [r2 + r4 + 4], xm2, 6\r\n    lea             r2, [r2 + r3 * 4]\r\n    movd            [r2], xm1\r\n    pextrw          [r2 + 4], xm1, 2\r\n    movd            [r2 + r3], xm4\r\n    pextrw          [r2 + r3 + 4], xm4, 2\r\n    pextrd          [r2 + r3 * 2], xm1, 2\r\n    pextrw          [r2 + r3 * 2 + 4], xm1, 6\r\n    pextrd          [r2 + r4], xm4, 2\r\n    pextrw          [r2 + r4 + 4], xm4, 6\r\n%else\r\n    add             r3d, r3d\r\n    vbroadcasti128  m3, [pw_2000]\r\n    lea             r4, [r3 * 3]\r\n    psubw           m5, m3                          ; m5 = word: row 0, row 1\r\n    psubw           m2, m3                          ; m2 = word: row 2, row 3\r\n    psubw           m1, m3                          ; m1 = word: row 4, row 5\r\n    psubw           m4, m3                          ; m4 = word: row 6, row 7\r\n    vextracti128    xm6, m5, 1\r\n    vextracti128    xm3, m2, 1\r\n    vextracti128    xm0, m1, 1\r\n    movq            [r2], xm5\r\n    pextrd          [r2 + 8], xm5, 2\r\n    movq            [r2 + r3], xm6\r\n    pextrd          [r2 + r3 + 8], xm6, 2\r\n    movq            [r2 + r3 * 2], xm2\r\n    pextrd          [r2 + r3 * 2 + 8], xm2, 2\r\n    movq            [r2 + r4], xm3\r\n    pextrd          [r2 + r4 + 8], xm3, 2\r\n    lea             r2, [r2 + r3 * 4]\r\n    movq            [r2], xm1\r\n    pextrd          [r2 + 8], xm1, 2\r\n    movq            [r2 + r3], xm0\r\n    pextrd          [r2 + r3 + 8], xm0, 2\r\n    movq            [r2 + r3 * 2], xm4\r\n    pextrd          [r2 + r3 * 2 + 8], xm4, 2\r\n    vextracti128    xm4, m4, 1\r\n    movq            [r2 + r4], xm4\r\n    pextrd          [r2 + r4 + 8], xm4, 2\r\n%endif\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_AVX2_6x8 pp\r\n    FILTER_VER_CHROMA_AVX2_6x8 ps\r\n\r\n;-----------------------------------------------------------------------------\r\n;void interp_4tap_vert_pp_6x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-----------------------------------------------------------------------------\r\n%macro FILTER_V4_W6_H4 2\r\nINIT_XMM sse4\r\ncglobal interp_4tap_vert_pp_6x%2, 4, 6, 8\r\n\r\n    mov         r4d,       r4m\r\n    sub         r0,        r1\r\n\r\n%ifdef PIC\r\n    lea         r5,        [tab_ChromaCoeff]\r\n    movd        m5,        [r5 + r4 * 4]\r\n%else\r\n    movd        m5,        [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    pshufb      m6,        m5,       [tab_Vm]\r\n    pshufb      m5,        [tab_Vm + 16]\r\n    mova        m4,        [pw_512]\r\n\r\n    mov         r4d,       %2\r\n    lea         r5,        [3 * r1]\r\n\r\n.loop:\r\n    movq        m0,        [r0]\r\n    movq        m1,        [r0 + r1]\r\n    movq        m2,        [r0 + 2 * r1]\r\n    movq        m3,        [r0 + r5]\r\n\r\n    punpcklbw   m0,        m1\r\n    punpcklbw   m1,        m2\r\n    punpcklbw   m2,        m3\r\n\r\n    pmaddubsw   m0,        m6\r\n    pmaddubsw   m7,        m2, m5\r\n\r\n    paddw       m0,        m7\r\n\r\n    pmulhrsw    m0,        m4\r\n    packuswb    m0,        m0\r\n    movd        [r2],      m0\r\n    pextrw      [r2 + 4],  m0,    2\r\n\r\n    lea         r0,        [r0 + 4 * r1]\r\n\r\n    movq        m0,        [r0]\r\n    punpcklbw   m3,        m0\r\n\r\n    pmaddubsw   m1,        m6\r\n    pmaddubsw   m7,        m3, m5\r\n\r\n    paddw       m1,        m7\r\n\r\n    pmulhrsw    m1,        m4\r\n    packuswb    m1,        m1\r\n    movd        [r2 + r3],      m1\r\n    pextrw      [r2 + r3 + 4],  m1,    2\r\n\r\n    movq        m1,        [r0 + r1]\r\n    punpcklbw   m7,        m0,        m1\r\n\r\n    pmaddubsw   m2,        m6\r\n    pmaddubsw   m7,        m5\r\n\r\n    paddw       m2,        m7\r\n\r\n    pmulhrsw    m2,        m4\r\n    packuswb    m2,        m2\r\n    lea         r2,        [r2 + 2 * r3]\r\n    movd        [r2],      m2\r\n    pextrw      [r2 + 4],  m2,    2\r\n\r\n    movq        m2,        [r0 + 2 * r1]\r\n    punpcklbw   m1,        m2\r\n\r\n    pmaddubsw   m3,        m6\r\n    pmaddubsw   m1,        m5\r\n\r\n    paddw       m3,        m1\r\n\r\n    pmulhrsw    m3,        m4\r\n    packuswb    m3,        m3\r\n\r\n    movd        [r2 + r3],        m3\r\n    pextrw      [r2 + r3 + 4],    m3,    2\r\n\r\n    lea         r2,        [r2 + 2 * r3]\r\n\r\n    sub         r4,         4\r\n    jnz        .loop\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_V4_W6_H4 6, 8\r\n\r\n    FILTER_V4_W6_H4 6, 16\r\n\r\n;-----------------------------------------------------------------------------\r\n; void interp_4tap_vert_pp_12x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-----------------------------------------------------------------------------\r\n%macro FILTER_V4_W12_H2 2\r\nINIT_XMM sse4\r\ncglobal interp_4tap_vert_pp_12x%2, 4, 6, 8\r\n\r\n    mov         r4d,       r4m\r\n    sub         r0,        r1\r\n\r\n%ifdef PIC\r\n    lea         r5,        [tab_ChromaCoeff]\r\n    movd        m0,        [r5 + r4 * 4]\r\n%else\r\n    movd        m0,        [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    pshufb      m1,        m0,       [tab_Vm]\r\n    pshufb      m0,        [tab_Vm + 16]\r\n\r\n    mov         r4d,       %2\r\n\r\n.loop:\r\n    movu        m2,        [r0]\r\n    movu        m3,        [r0 + r1]\r\n\r\n    punpcklbw   m4,        m2,        m3\r\n    punpckhbw   m2,        m3\r\n\r\n    pmaddubsw   m4,        m1\r\n    pmaddubsw   m2,        m1\r\n\r\n    lea         r0,        [r0 + 2 * r1]\r\n    movu        m5,        [r0]\r\n    movu        m7,        [r0 + r1]\r\n\r\n    punpcklbw   m6,        m5,        m7\r\n    pmaddubsw   m6,        m0\r\n    paddw       m4,        m6\r\n\r\n    punpckhbw   m6,        m5,        m7\r\n    pmaddubsw   m6,        m0\r\n    paddw       m2,        m6\r\n\r\n    mova        m6,        [pw_512]\r\n\r\n    pmulhrsw    m4,        m6\r\n    pmulhrsw    m2,        m6\r\n\r\n    packuswb    m4,        m2\r\n\r\n    movh         [r2],     m4\r\n    pextrd       [r2 + 8], m4,  2\r\n\r\n    punpcklbw   m4,        m3,        m5\r\n    punpckhbw   m3,        m5\r\n\r\n    pmaddubsw   m4,        m1\r\n    pmaddubsw   m3,        m1\r\n\r\n    movu        m5,        [r0 + 2 * r1]\r\n\r\n    punpcklbw   m2,        m7,        m5\r\n    punpckhbw   m7,        m5\r\n\r\n    pmaddubsw   m2,        m0\r\n    pmaddubsw   m7,        m0\r\n\r\n    paddw       m4,        m2\r\n    paddw       m3,        m7\r\n\r\n    pmulhrsw    m4,        m6\r\n    pmulhrsw    m3,        m6\r\n\r\n    packuswb    m4,        m3\r\n\r\n    movh        [r2 + r3],      m4\r\n    pextrd      [r2 + r3 + 8],  m4,  2\r\n\r\n    lea         r2,        [r2 + 2 * r3]\r\n\r\n    sub         r4,        2\r\n    jnz        .loop\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_V4_W12_H2 12, 16\r\n\r\n    FILTER_V4_W12_H2 12, 32\r\n\r\n;-----------------------------------------------------------------------------\r\n; void interp_4tap_vert_pp_16x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-----------------------------------------------------------------------------\r\n%macro FILTER_V4_W16_H2 2\r\nINIT_XMM sse4\r\ncglobal interp_4tap_vert_pp_16x%2, 4, 6, 8\r\n\r\n    mov         r4d,       r4m\r\n    sub         r0,        r1\r\n\r\n%ifdef PIC\r\n    lea         r5,        [tab_ChromaCoeff]\r\n    movd        m0,        [r5 + r4 * 4]\r\n%else\r\n    movd        m0,        [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    pshufb      m1,        m0,       [tab_Vm]\r\n    pshufb      m0,        [tab_Vm + 16]\r\n\r\n    mov         r4d,       %2/2\r\n\r\n.loop:\r\n    movu        m2,        [r0]\r\n    movu        m3,        [r0 + r1]\r\n\r\n    punpcklbw   m4,        m2,        m3\r\n    punpckhbw   m2,        m3\r\n\r\n    pmaddubsw   m4,        m1\r\n    pmaddubsw   m2,        m1\r\n\r\n    lea         r0,        [r0 + 2 * r1]\r\n    movu        m5,        [r0]\r\n    movu        m6,        [r0 + r1]\r\n\r\n    punpckhbw   m7,        m5,        m6\r\n    pmaddubsw   m7,        m0\r\n    paddw       m2,        m7\r\n\r\n    punpcklbw   m7,        m5,        m6\r\n    pmaddubsw   m7,        m0\r\n    paddw       m4,        m7\r\n\r\n    mova        m7,        [pw_512]\r\n\r\n    pmulhrsw    m4,        m7\r\n    pmulhrsw    m2,        m7\r\n\r\n    packuswb    m4,        m2\r\n\r\n    movu        [r2],      m4\r\n\r\n    punpcklbw   m4,        m3,        m5\r\n    punpckhbw   m3,        m5\r\n\r\n    pmaddubsw   m4,        m1\r\n    pmaddubsw   m3,        m1\r\n\r\n    movu        m5,        [r0 + 2 * r1]\r\n\r\n    punpcklbw   m2,        m6,        m5\r\n    punpckhbw   m6,        m5\r\n\r\n    pmaddubsw   m2,        m0\r\n    pmaddubsw   m6,        m0\r\n\r\n    paddw       m4,        m2\r\n    paddw       m3,        m6\r\n\r\n    pmulhrsw    m4,        m7\r\n    pmulhrsw    m3,        m7\r\n\r\n    packuswb    m4,        m3\r\n\r\n    movu        [r2 + r3],      m4\r\n\r\n    lea         r2,        [r2 + 2 * r3]\r\n\r\n    dec         r4d\r\n    jnz        .loop\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_V4_W16_H2 16,  4\r\n    FILTER_V4_W16_H2 16,  8\r\n    FILTER_V4_W16_H2 16, 12\r\n    FILTER_V4_W16_H2 16, 16\r\n    FILTER_V4_W16_H2 16, 32\r\n\r\n    FILTER_V4_W16_H2 16, 24\r\n    FILTER_V4_W16_H2 16, 64\r\n\r\n%macro FILTER_VER_CHROMA_AVX2_16x16 1\r\nINIT_YMM avx2\r\n%if ARCH_X86_64 == 1\r\ncglobal interp_4tap_vert_%1_16x16, 4, 6, 15\r\n    mov             r4d, r4m\r\n    shl             r4d, 6\r\n\r\n%ifdef PIC\r\n    lea             r5, [tab_ChromaCoeffVer_32]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [tab_ChromaCoeffVer_32 + r4]\r\n%endif\r\n\r\n    mova            m12, [r5]\r\n    mova            m13, [r5 + mmsize]\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r1\r\n%ifidn %1,pp\r\n    mova            m14, [pw_512]\r\n%else\r\n    add             r3d, r3d\r\n    vbroadcasti128  m14, [pw_2000]\r\n%endif\r\n    lea             r5, [r3 * 3]\r\n\r\n    movu            xm0, [r0]                       ; m0 = row 0\r\n    movu            xm1, [r0 + r1]                  ; m1 = row 1\r\n    punpckhbw       xm2, xm0, xm1\r\n    punpcklbw       xm0, xm1\r\n    vinserti128     m0, m0, xm2, 1\r\n    pmaddubsw       m0, m12\r\n    movu            xm2, [r0 + r1 * 2]              ; m2 = row 2\r\n    punpckhbw       xm3, xm1, xm2\r\n    punpcklbw       xm1, xm2\r\n    vinserti128     m1, m1, xm3, 1\r\n    pmaddubsw       m1, m12\r\n    movu            xm3, [r0 + r4]                  ; m3 = row 3\r\n    punpckhbw       xm4, xm2, xm3\r\n    punpcklbw       xm2, xm3\r\n    vinserti128     m2, m2, xm4, 1\r\n    pmaddubsw       m4, m2, m13\r\n    paddw           m0, m4\r\n    pmaddubsw       m2, m12\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm4, [r0]                       ; m4 = row 4\r\n    punpckhbw       xm5, xm3, xm4\r\n    punpcklbw       xm3, xm4\r\n    vinserti128     m3, m3, xm5, 1\r\n    pmaddubsw       m5, m3, m13\r\n    paddw           m1, m5\r\n    pmaddubsw       m3, m12\r\n    movu            xm5, [r0 + r1]                  ; m5 = row 5\r\n    punpckhbw       xm6, xm4, xm5\r\n    punpcklbw       xm4, xm5\r\n    vinserti128     m4, m4, xm6, 1\r\n    pmaddubsw       m6, m4, m13\r\n    paddw           m2, m6\r\n    pmaddubsw       m4, m12\r\n    movu            xm6, [r0 + r1 * 2]              ; m6 = row 6\r\n    punpckhbw       xm7, xm5, xm6\r\n    punpcklbw       xm5, xm6\r\n    vinserti128     m5, m5, xm7, 1\r\n    pmaddubsw       m7, m5, m13\r\n    paddw           m3, m7\r\n    pmaddubsw       m5, m12\r\n    movu            xm7, [r0 + r4]                  ; m7 = row 7\r\n    punpckhbw       xm8, xm6, xm7\r\n    punpcklbw       xm6, xm7\r\n    vinserti128     m6, m6, xm8, 1\r\n    pmaddubsw       m8, m6, m13\r\n    paddw           m4, m8\r\n    pmaddubsw       m6, m12\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm8, [r0]                       ; m8 = row 8\r\n    punpckhbw       xm9, xm7, xm8\r\n    punpcklbw       xm7, xm8\r\n    vinserti128     m7, m7, xm9, 1\r\n    pmaddubsw       m9, m7, m13\r\n    paddw           m5, m9\r\n    pmaddubsw       m7, m12\r\n    movu            xm9, [r0 + r1]                  ; m9 = row 9\r\n    punpckhbw       xm10, xm8, xm9\r\n    punpcklbw       xm8, xm9\r\n    vinserti128     m8, m8, xm10, 1\r\n    pmaddubsw       m10, m8, m13\r\n    paddw           m6, m10\r\n    pmaddubsw       m8, m12\r\n    movu            xm10, [r0 + r1 * 2]             ; m10 = row 10\r\n    punpckhbw       xm11, xm9, xm10\r\n    punpcklbw       xm9, xm10\r\n    vinserti128     m9, m9, xm11, 1\r\n    pmaddubsw       m11, m9, m13\r\n    paddw           m7, m11\r\n    pmaddubsw       m9, m12\r\n\r\n%ifidn %1,pp\r\n    pmulhrsw        m0, m14                         ; m0 = word: row 0\r\n    pmulhrsw        m1, m14                         ; m1 = word: row 1\r\n    pmulhrsw        m2, m14                         ; m2 = word: row 2\r\n    pmulhrsw        m3, m14                         ; m3 = word: row 3\r\n    pmulhrsw        m4, m14                         ; m4 = word: row 4\r\n    pmulhrsw        m5, m14                         ; m5 = word: row 5\r\n    pmulhrsw        m6, m14                         ; m6 = word: row 6\r\n    pmulhrsw        m7, m14                         ; m7 = word: row 7\r\n    packuswb        m0, m1\r\n    packuswb        m2, m3\r\n    packuswb        m4, m5\r\n    packuswb        m6, m7\r\n    vpermq          m0, m0, 11011000b\r\n    vpermq          m2, m2, 11011000b\r\n    vpermq          m4, m4, 11011000b\r\n    vpermq          m6, m6, 11011000b\r\n    vextracti128    xm1, m0, 1\r\n    vextracti128    xm3, m2, 1\r\n    vextracti128    xm5, m4, 1\r\n    vextracti128    xm7, m6, 1\r\n    movu            [r2], xm0\r\n    movu            [r2 + r3], xm1\r\n    movu            [r2 + r3 * 2], xm2\r\n    movu            [r2 + r5], xm3\r\n    lea             r2, [r2 + r3 * 4]\r\n    movu            [r2], xm4\r\n    movu            [r2 + r3], xm5\r\n    movu            [r2 + r3 * 2], xm6\r\n    movu            [r2 + r5], xm7\r\n%else\r\n    psubw           m0, m14                         ; m0 = word: row 0\r\n    psubw           m1, m14                         ; m1 = word: row 1\r\n    psubw           m2, m14                         ; m2 = word: row 2\r\n    psubw           m3, m14                         ; m3 = word: row 3\r\n    psubw           m4, m14                         ; m4 = word: row 4\r\n    psubw           m5, m14                         ; m5 = word: row 5\r\n    psubw           m6, m14                         ; m6 = word: row 6\r\n    psubw           m7, m14                         ; m7 = word: row 7\r\n    movu            [r2], m0\r\n    movu            [r2 + r3], m1\r\n    movu            [r2 + r3 * 2], m2\r\n    movu            [r2 + r5], m3\r\n    lea             r2, [r2 + r3 * 4]\r\n    movu            [r2], m4\r\n    movu            [r2 + r3], m5\r\n    movu            [r2 + r3 * 2], m6\r\n    movu            [r2 + r5], m7\r\n%endif\r\n    lea             r2, [r2 + r3 * 4]\r\n\r\n    movu            xm11, [r0 + r4]                 ; m11 = row 11\r\n    punpckhbw       xm6, xm10, xm11\r\n    punpcklbw       xm10, xm11\r\n    vinserti128     m10, m10, xm6, 1\r\n    pmaddubsw       m6, m10, m13\r\n    paddw           m8, m6\r\n    pmaddubsw       m10, m12\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm6, [r0]                       ; m6 = row 12\r\n    punpckhbw       xm7, xm11, xm6\r\n    punpcklbw       xm11, xm6\r\n    vinserti128     m11, m11, xm7, 1\r\n    pmaddubsw       m7, m11, m13\r\n    paddw           m9, m7\r\n    pmaddubsw       m11, m12\r\n\r\n    movu            xm7, [r0 + r1]                  ; m7 = row 13\r\n    punpckhbw       xm0, xm6, xm7\r\n    punpcklbw       xm6, xm7\r\n    vinserti128     m6, m6, xm0, 1\r\n    pmaddubsw       m0, m6, m13\r\n    paddw           m10, m0\r\n    pmaddubsw       m6, m12\r\n    movu            xm0, [r0 + r1 * 2]              ; m0 = row 14\r\n    punpckhbw       xm1, xm7, xm0\r\n    punpcklbw       xm7, xm0\r\n    vinserti128     m7, m7, xm1, 1\r\n    pmaddubsw       m1, m7, m13\r\n    paddw           m11, m1\r\n    pmaddubsw       m7, m12\r\n    movu            xm1, [r0 + r4]                  ; m1 = row 15\r\n    punpckhbw       xm2, xm0, xm1\r\n    punpcklbw       xm0, xm1\r\n    vinserti128     m0, m0, xm2, 1\r\n    pmaddubsw       m2, m0, m13\r\n    paddw           m6, m2\r\n    pmaddubsw       m0, m12\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm2, [r0]                       ; m2 = row 16\r\n    punpckhbw       xm3, xm1, xm2\r\n    punpcklbw       xm1, xm2\r\n    vinserti128     m1, m1, xm3, 1\r\n    pmaddubsw       m3, m1, m13\r\n    paddw           m7, m3\r\n    pmaddubsw       m1, m12\r\n    movu            xm3, [r0 + r1]                  ; m3 = row 17\r\n    punpckhbw       xm4, xm2, xm3\r\n    punpcklbw       xm2, xm3\r\n    vinserti128     m2, m2, xm4, 1\r\n    pmaddubsw       m2, m13\r\n    paddw           m0, m2\r\n    movu            xm4, [r0 + r1 * 2]              ; m4 = row 18\r\n    punpckhbw       xm5, xm3, xm4\r\n    punpcklbw       xm3, xm4\r\n    vinserti128     m3, m3, xm5, 1\r\n    pmaddubsw       m3, m13\r\n    paddw           m1, m3\r\n\r\n%ifidn %1,pp\r\n    pmulhrsw        m8, m14                         ; m8 = word: row 8\r\n    pmulhrsw        m9, m14                         ; m9 = word: row 9\r\n    pmulhrsw        m10, m14                        ; m10 = word: row 10\r\n    pmulhrsw        m11, m14                        ; m11 = word: row 11\r\n    pmulhrsw        m6, m14                         ; m6 = word: row 12\r\n    pmulhrsw        m7, m14                         ; m7 = word: row 13\r\n    pmulhrsw        m0, m14                         ; m0 = word: row 14\r\n    pmulhrsw        m1, m14                         ; m1 = word: row 15\r\n    packuswb        m8, m9\r\n    packuswb        m10, m11\r\n    packuswb        m6, m7\r\n    packuswb        m0, m1\r\n    vpermq          m8, m8, 11011000b\r\n    vpermq          m10, m10, 11011000b\r\n    vpermq          m6, m6, 11011000b\r\n    vpermq          m0, m0, 11011000b\r\n    vextracti128    xm9, m8, 1\r\n    vextracti128    xm11, m10, 1\r\n    vextracti128    xm7, m6, 1\r\n    vextracti128    xm1, m0, 1\r\n    movu            [r2], xm8\r\n    movu            [r2 + r3], xm9\r\n    movu            [r2 + r3 * 2], xm10\r\n    movu            [r2 + r5], xm11\r\n    lea             r2, [r2 + r3 * 4]\r\n    movu            [r2], xm6\r\n    movu            [r2 + r3], xm7\r\n    movu            [r2 + r3 * 2], xm0\r\n    movu            [r2 + r5], xm1\r\n%else\r\n    psubw           m8, m14                         ; m8 = word: row 8\r\n    psubw           m9, m14                         ; m9 = word: row 9\r\n    psubw           m10, m14                        ; m10 = word: row 10\r\n    psubw           m11, m14                        ; m11 = word: row 11\r\n    psubw           m6, m14                         ; m6 = word: row 12\r\n    psubw           m7, m14                         ; m7 = word: row 13\r\n    psubw           m0, m14                         ; m0 = word: row 14\r\n    psubw           m1, m14                         ; m1 = word: row 15\r\n    movu            [r2], m8\r\n    movu            [r2 + r3], m9\r\n    movu            [r2 + r3 * 2], m10\r\n    movu            [r2 + r5], m11\r\n    lea             r2, [r2 + r3 * 4]\r\n    movu            [r2], m6\r\n    movu            [r2 + r3], m7\r\n    movu            [r2 + r3 * 2], m0\r\n    movu            [r2 + r5], m1\r\n%endif\r\n    RET\r\n%endif\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_AVX2_16x16 pp\r\n    FILTER_VER_CHROMA_AVX2_16x16 ps\r\n%macro FILTER_VER_CHROMA_AVX2_16x8 1\r\nINIT_YMM avx2\r\ncglobal interp_4tap_vert_%1_16x8, 4, 7, 7\r\n    mov             r4d, r4m\r\n    shl             r4d, 6\r\n\r\n%ifdef PIC\r\n    lea             r5, [tab_ChromaCoeffVer_32]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [tab_ChromaCoeffVer_32 + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r1\r\n%ifidn %1,pp\r\n    mova            m6, [pw_512]\r\n%else\r\n    add             r3d, r3d\r\n    mova            m6, [pw_2000]\r\n%endif\r\n    lea             r6, [r3 * 3]\r\n\r\n    movu            xm0, [r0]                       ; m0 = row 0\r\n    movu            xm1, [r0 + r1]                  ; m1 = row 1\r\n    punpckhbw       xm2, xm0, xm1\r\n    punpcklbw       xm0, xm1\r\n    vinserti128     m0, m0, xm2, 1\r\n    pmaddubsw       m0, [r5]\r\n    movu            xm2, [r0 + r1 * 2]              ; m2 = row 2\r\n    punpckhbw       xm3, xm1, xm2\r\n    punpcklbw       xm1, xm2\r\n    vinserti128     m1, m1, xm3, 1\r\n    pmaddubsw       m1, [r5]\r\n    movu            xm3, [r0 + r4]                  ; m3 = row 3\r\n    punpckhbw       xm4, xm2, xm3\r\n    punpcklbw       xm2, xm3\r\n    vinserti128     m2, m2, xm4, 1\r\n    pmaddubsw       m4, m2, [r5 + mmsize]\r\n    paddw           m0, m4\r\n    pmaddubsw       m2, [r5]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm4, [r0]                       ; m4 = row 4\r\n    punpckhbw       xm5, xm3, xm4\r\n    punpcklbw       xm3, xm4\r\n    vinserti128     m3, m3, xm5, 1\r\n    pmaddubsw       m5, m3, [r5 + mmsize]\r\n    paddw           m1, m5\r\n    pmaddubsw       m3, [r5]\r\n%ifidn %1,pp\r\n    pmulhrsw        m0, m6                          ; m0 = word: row 0\r\n    pmulhrsw        m1, m6                          ; m1 = word: row 1\r\n    packuswb        m0, m1\r\n    vpermq          m0, m0, 11011000b\r\n    vextracti128    xm1, m0, 1\r\n    movu            [r2], xm0\r\n    movu            [r2 + r3], xm1\r\n%else\r\n    psubw           m0, m6                          ; m0 = word: row 0\r\n    psubw           m1, m6                          ; m1 = word: row 1\r\n    movu            [r2], m0\r\n    movu            [r2 + r3], m1\r\n%endif\r\n\r\n    movu            xm0, [r0 + r1]                  ; m0 = row 5\r\n    punpckhbw       xm1, xm4, xm0\r\n    punpcklbw       xm4, xm0\r\n    vinserti128     m4, m4, xm1, 1\r\n    pmaddubsw       m1, m4, [r5 + mmsize]\r\n    paddw           m2, m1\r\n    pmaddubsw       m4, [r5]\r\n    movu            xm1, [r0 + r1 * 2]              ; m1 = row 6\r\n    punpckhbw       xm5, xm0, xm1\r\n    punpcklbw       xm0, xm1\r\n    vinserti128     m0, m0, xm5, 1\r\n    pmaddubsw       m5, m0, [r5 + mmsize]\r\n    paddw           m3, m5\r\n    pmaddubsw       m0, [r5]\r\n%ifidn %1,pp\r\n    pmulhrsw        m2, m6                          ; m2 = word: row 2\r\n    pmulhrsw        m3, m6                          ; m3 = word: row 3\r\n    packuswb        m2, m3\r\n    vpermq          m2, m2, 11011000b\r\n    vextracti128    xm3, m2, 1\r\n    movu            [r2 + r3 * 2], xm2\r\n    movu            [r2 + r6], xm3\r\n%else\r\n    psubw           m2, m6                          ; m2 = word: row 2\r\n    psubw           m3, m6                          ; m3 = word: row 3\r\n    movu            [r2 + r3 * 2], m2\r\n    movu            [r2 + r6], m3\r\n%endif\r\n\r\n    movu            xm2, [r0 + r4]                  ; m2 = row 7\r\n    punpckhbw       xm3, xm1, xm2\r\n    punpcklbw       xm1, xm2\r\n    vinserti128     m1, m1, xm3, 1\r\n    pmaddubsw       m3, m1, [r5 + mmsize]\r\n    paddw           m4, m3\r\n    pmaddubsw       m1, [r5]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm3, [r0]                       ; m3 = row 8\r\n    punpckhbw       xm5, xm2, xm3\r\n    punpcklbw       xm2, xm3\r\n    vinserti128     m2, m2, xm5, 1\r\n    pmaddubsw       m5, m2, [r5 + mmsize]\r\n    paddw           m0, m5\r\n    pmaddubsw       m2, [r5]\r\n    lea             r2, [r2 + r3 * 4]\r\n%ifidn %1,pp\r\n    pmulhrsw        m4, m6                          ; m4 = word: row 4\r\n    pmulhrsw        m0, m6                          ; m0 = word: row 5\r\n    packuswb        m4, m0\r\n    vpermq          m4, m4, 11011000b\r\n    vextracti128    xm0, m4, 1\r\n    movu            [r2], xm4\r\n    movu            [r2 + r3], xm0\r\n%else\r\n    psubw           m4, m6                          ; m4 = word: row 4\r\n    psubw           m0, m6                          ; m0 = word: row 5\r\n    movu            [r2], m4\r\n    movu            [r2 + r3], m0\r\n%endif\r\n\r\n    movu            xm5, [r0 + r1]                  ; m5 = row 9\r\n    punpckhbw       xm4, xm3, xm5\r\n    punpcklbw       xm3, xm5\r\n    vinserti128     m3, m3, xm4, 1\r\n    pmaddubsw       m3, [r5 + mmsize]\r\n    paddw           m1, m3\r\n    movu            xm4, [r0 + r1 * 2]              ; m4 = row 10\r\n    punpckhbw       xm0, xm5, xm4\r\n    punpcklbw       xm5, xm4\r\n    vinserti128     m5, m5, xm0, 1\r\n    pmaddubsw       m5, [r5 + mmsize]\r\n    paddw           m2, m5\r\n%ifidn %1,pp\r\n    pmulhrsw        m1, m6                          ; m1 = word: row 6\r\n    pmulhrsw        m2, m6                          ; m2 = word: row 7\r\n    packuswb        m1, m2\r\n    vpermq          m1, m1, 11011000b\r\n    vextracti128    xm2, m1, 1\r\n    movu            [r2 + r3 * 2], xm1\r\n    movu            [r2 + r6], xm2\r\n%else\r\n    psubw           m1, m6                          ; m1 = word: row 6\r\n    psubw           m2, m6                          ; m2 = word: row 7\r\n    movu            [r2 + r3 * 2], m1\r\n    movu            [r2 + r6], m2\r\n%endif\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_AVX2_16x8 pp\r\n    FILTER_VER_CHROMA_AVX2_16x8 ps\r\n\r\n%macro FILTER_VER_CHROMA_AVX2_16x12 1\r\nINIT_YMM avx2\r\n%if ARCH_X86_64 == 1\r\ncglobal interp_4tap_vert_%1_16x12, 4, 6, 10\r\n    mov             r4d, r4m\r\n    shl             r4d, 6\r\n\r\n%ifdef PIC\r\n    lea             r5, [tab_ChromaCoeffVer_32]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [tab_ChromaCoeffVer_32 + r4]\r\n%endif\r\n\r\n    mova            m8, [r5]\r\n    mova            m9, [r5 + mmsize]\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r1\r\n%ifidn %1,pp\r\n    mova            m7, [pw_512]\r\n%else\r\n    add             r3d, r3d\r\n    vbroadcasti128  m7, [pw_2000]\r\n%endif\r\n    lea             r5, [r3 * 3]\r\n\r\n    movu            xm0, [r0]\r\n    vinserti128     m0, m0, [r0 + r1 * 2], 1\r\n    movu            xm1, [r0 + r1]\r\n    vinserti128     m1, m1, [r0 + r4], 1\r\n\r\n    punpcklbw       m2, m0, m1\r\n    punpckhbw       m3, m0, m1\r\n    vperm2i128      m4, m2, m3, 0x20\r\n    vperm2i128      m2, m2, m3, 0x31\r\n    pmaddubsw       m4, m8\r\n    pmaddubsw       m3, m2, m9\r\n    paddw           m4, m3\r\n    pmaddubsw       m2, m8\r\n\r\n    vextracti128    xm0, m0, 1\r\n    lea             r0, [r0 + r1 * 4]\r\n    vinserti128     m0, m0, [r0], 1\r\n\r\n    punpcklbw       m5, m1, m0\r\n    punpckhbw       m3, m1, m0\r\n    vperm2i128      m6, m5, m3, 0x20\r\n    vperm2i128      m5, m5, m3, 0x31\r\n    pmaddubsw       m6, m8\r\n    pmaddubsw       m3, m5, m9\r\n    paddw           m6, m3\r\n    pmaddubsw       m5, m8\r\n%ifidn %1,pp\r\n    pmulhrsw        m4, m7                         ; m4 = word: row 0\r\n    pmulhrsw        m6, m7                         ; m6 = word: row 1\r\n    packuswb        m4, m6\r\n    vpermq          m4, m4, 11011000b\r\n    vextracti128    xm6, m4, 1\r\n    movu            [r2], xm4\r\n    movu            [r2 + r3], xm6\r\n%else\r\n    psubw           m4, m7                         ; m4 = word: row 0\r\n    psubw           m6, m7                         ; m6 = word: row 1\r\n    movu            [r2], m4\r\n    movu            [r2 + r3], m6\r\n%endif\r\n\r\n    movu            xm4, [r0 + r1 * 2]\r\n    vinserti128     m4, m4, [r0 + r1], 1\r\n    vextracti128    xm1, m4, 1\r\n    vinserti128     m0, m0, xm1, 0\r\n\r\n    punpcklbw       m6, m0, m4\r\n    punpckhbw       m1, m0, m4\r\n    vperm2i128      m0, m6, m1, 0x20\r\n    vperm2i128      m6, m6, m1, 0x31\r\n    pmaddubsw       m1, m0, m9\r\n    paddw           m5, m1\r\n    pmaddubsw       m0, m8\r\n    pmaddubsw       m1, m6, m9\r\n    paddw           m2, m1\r\n    pmaddubsw       m6, m8\r\n\r\n%ifidn %1,pp\r\n    pmulhrsw        m2, m7                         ; m2 = word: row 2\r\n    pmulhrsw        m5, m7                         ; m5 = word: row 3\r\n    packuswb        m2, m5\r\n    vpermq          m2, m2, 11011000b\r\n    vextracti128    xm5, m2, 1\r\n    movu            [r2 + r3 * 2], xm2\r\n    movu            [r2 + r5], xm5\r\n%else\r\n    psubw           m2, m7                         ; m2 = word: row 2\r\n    psubw           m5, m7                         ; m5 = word: row 3\r\n    movu            [r2 + r3 * 2], m2\r\n    movu            [r2 + r5], m5\r\n%endif\r\n    lea             r2, [r2 + r3 * 4]\r\n\r\n    movu            xm1, [r0 + r4]\r\n    lea             r0, [r0 + r1 * 4]\r\n    vinserti128     m1, m1, [r0], 1\r\n    vinserti128     m4, m4, xm1, 1\r\n\r\n    punpcklbw       m2, m4, m1\r\n    punpckhbw       m5, m4, m1\r\n    vperm2i128      m3, m2, m5, 0x20\r\n    vperm2i128      m2, m2, m5, 0x31\r\n    pmaddubsw       m5, m3, m9\r\n    paddw           m6, m5\r\n    pmaddubsw       m3, m8\r\n    pmaddubsw       m5, m2, m9\r\n    paddw           m0, m5\r\n    pmaddubsw       m2, m8\r\n\r\n%ifidn %1,pp\r\n    pmulhrsw        m6, m7                         ; m6 = word: row 4\r\n    pmulhrsw        m0, m7                         ; m0 = word: row 5\r\n    packuswb        m6, m0\r\n    vpermq          m6, m6, 11011000b\r\n    vextracti128    xm0, m6, 1\r\n    movu            [r2], xm6\r\n    movu            [r2 + r3], xm0\r\n%else\r\n    psubw           m6, m7                         ; m6 = word: row 4\r\n    psubw           m0, m7                         ; m0 = word: row 5\r\n    movu            [r2], m6\r\n    movu            [r2 + r3], m0\r\n%endif\r\n\r\n    movu            xm6, [r0 + r1 * 2]\r\n    vinserti128     m6, m6, [r0 + r1], 1\r\n    vextracti128    xm0, m6, 1\r\n    vinserti128     m1, m1, xm0, 0\r\n\r\n    punpcklbw       m4, m1, m6\r\n    punpckhbw       m5, m1, m6\r\n    vperm2i128      m0, m4, m5, 0x20\r\n    vperm2i128      m5, m4, m5, 0x31\r\n    pmaddubsw       m4, m0, m9\r\n    paddw           m2, m4\r\n    pmaddubsw       m0, m8\r\n    pmaddubsw       m4, m5, m9\r\n    paddw           m3, m4\r\n    pmaddubsw       m5, m8\r\n\r\n%ifidn %1,pp\r\n    pmulhrsw        m3, m7                         ; m3 = word: row 6\r\n    pmulhrsw        m2, m7                         ; m2 = word: row 7\r\n    packuswb        m3, m2\r\n    vpermq          m3, m3, 11011000b\r\n    vextracti128    xm2, m3, 1\r\n    movu            [r2 + r3 * 2], xm3\r\n    movu            [r2 + r5], xm2\r\n%else\r\n    psubw           m3, m7                         ; m3 = word: row 6\r\n    psubw           m2, m7                         ; m2 = word: row 7\r\n    movu            [r2 + r3 * 2], m3\r\n    movu            [r2 + r5], m2\r\n%endif\r\n    lea             r2, [r2 + r3 * 4]\r\n\r\n    movu            xm3, [r0 + r4]\r\n    lea             r0, [r0 + r1 * 4]\r\n    vinserti128     m3, m3, [r0], 1\r\n    vinserti128     m6, m6, xm3, 1\r\n\r\n    punpcklbw       m2, m6, m3\r\n    punpckhbw       m1, m6, m3\r\n    vperm2i128      m4, m2, m1, 0x20\r\n    vperm2i128      m2, m2, m1, 0x31\r\n    pmaddubsw       m1, m4, m9\r\n    paddw           m5, m1\r\n    pmaddubsw       m4, m8\r\n    pmaddubsw       m1, m2, m9\r\n    paddw           m0, m1\r\n    pmaddubsw       m2, m8\r\n\r\n%ifidn %1,pp\r\n    pmulhrsw        m5, m7                         ; m5 = word: row 8\r\n    pmulhrsw        m0, m7                         ; m0 = word: row 9\r\n    packuswb        m5, m0\r\n    vpermq          m5, m5, 11011000b\r\n    vextracti128    xm0, m5, 1\r\n    movu            [r2], xm5\r\n    movu            [r2 + r3], xm0\r\n%else\r\n    psubw           m5, m7                         ; m5 = word: row 8\r\n    psubw           m0, m7                         ; m0 = word: row 9\r\n    movu            [r2], m5\r\n    movu            [r2 + r3], m0\r\n%endif\r\n\r\n    movu            xm5, [r0 + r1 * 2]\r\n    vinserti128     m5, m5, [r0 + r1], 1\r\n    vextracti128    xm0, m5, 1\r\n    vinserti128     m3, m3, xm0, 0\r\n\r\n    punpcklbw       m1, m3, m5\r\n    punpckhbw       m0, m3, m5\r\n    vperm2i128      m6, m1, m0, 0x20\r\n    vperm2i128      m0, m1, m0, 0x31\r\n    pmaddubsw       m1, m6, m9\r\n    paddw           m2, m1\r\n    pmaddubsw       m1, m0, m9\r\n    paddw           m4, m1\r\n\r\n%ifidn %1,pp\r\n    pmulhrsw        m4, m7                         ; m4 = word: row 10\r\n    pmulhrsw        m2, m7                         ; m2 = word: row 11\r\n    packuswb        m4, m2\r\n    vpermq          m4, m4, 11011000b\r\n    vextracti128    xm2, m4, 1\r\n    movu            [r2 + r3 * 2], xm4\r\n    movu            [r2 + r5], xm2\r\n%else\r\n    psubw           m4, m7                         ; m4 = word: row 10\r\n    psubw           m2, m7                         ; m2 = word: row 11\r\n    movu            [r2 + r3 * 2], m4\r\n    movu            [r2 + r5], m2\r\n%endif\r\n    RET\r\n%endif\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_AVX2_16x12 pp\r\n    FILTER_VER_CHROMA_AVX2_16x12 ps\r\n\r\n%macro FILTER_VER_CHROMA_AVX2_16xN 2\r\n%if ARCH_X86_64 == 1\r\nINIT_YMM avx2\r\ncglobal interp_4tap_vert_%1_16x%2, 4, 8, 8\r\n    mov             r4d, r4m\r\n    shl             r4d, 6\r\n\r\n%ifdef PIC\r\n    lea             r5, [tab_ChromaCoeffVer_32]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [tab_ChromaCoeffVer_32 + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r1\r\n%ifidn %1,pp\r\n    mova            m7, [pw_512]\r\n%else\r\n    add             r3d, r3d\r\n    mova            m7, [pw_2000]\r\n%endif\r\n    lea             r6, [r3 * 3]\r\n    mov             r7d, %2 / 16\r\n.loopH:\r\n    movu            xm0, [r0]\r\n    vinserti128     m0, m0, [r0 + r1 * 2], 1\r\n    movu            xm1, [r0 + r1]\r\n    vinserti128     m1, m1, [r0 + r4], 1\r\n\r\n    punpcklbw       m2, m0, m1\r\n    punpckhbw       m3, m0, m1\r\n    vperm2i128      m4, m2, m3, 0x20\r\n    vperm2i128      m2, m2, m3, 0x31\r\n    pmaddubsw       m4, [r5]\r\n    pmaddubsw       m3, m2, [r5 + mmsize]\r\n    paddw           m4, m3\r\n    pmaddubsw       m2, [r5]\r\n\r\n    vextracti128    xm0, m0, 1\r\n    lea             r0, [r0 + r1 * 4]\r\n    vinserti128     m0, m0, [r0], 1\r\n\r\n    punpcklbw       m5, m1, m0\r\n    punpckhbw       m3, m1, m0\r\n    vperm2i128      m6, m5, m3, 0x20\r\n    vperm2i128      m5, m5, m3, 0x31\r\n    pmaddubsw       m6, [r5]\r\n    pmaddubsw       m3, m5, [r5 + mmsize]\r\n    paddw           m6, m3\r\n    pmaddubsw       m5, [r5]\r\n%ifidn %1,pp\r\n    pmulhrsw        m4, m7                         ; m4 = word: row 0\r\n    pmulhrsw        m6, m7                         ; m6 = word: row 1\r\n    packuswb        m4, m6\r\n    vpermq          m4, m4, 11011000b\r\n    vextracti128    xm6, m4, 1\r\n    movu            [r2], xm4\r\n    movu            [r2 + r3], xm6\r\n%else\r\n    psubw           m4, m7                         ; m4 = word: row 0\r\n    psubw           m6, m7                         ; m6 = word: row 1\r\n    movu            [r2], m4\r\n    movu            [r2 + r3], m6\r\n%endif\r\n\r\n    movu            xm4, [r0 + r1 * 2]\r\n    vinserti128     m4, m4, [r0 + r1], 1\r\n    vextracti128    xm1, m4, 1\r\n    vinserti128     m0, m0, xm1, 0\r\n\r\n    punpcklbw       m6, m0, m4\r\n    punpckhbw       m1, m0, m4\r\n    vperm2i128      m0, m6, m1, 0x20\r\n    vperm2i128      m6, m6, m1, 0x31\r\n    pmaddubsw       m1, m0, [r5 + mmsize]\r\n    paddw           m5, m1\r\n    pmaddubsw       m0, [r5]\r\n    pmaddubsw       m1, m6, [r5 + mmsize]\r\n    paddw           m2, m1\r\n    pmaddubsw       m6, [r5]\r\n\r\n%ifidn %1,pp\r\n    pmulhrsw        m2, m7                         ; m2 = word: row 2\r\n    pmulhrsw        m5, m7                         ; m5 = word: row 3\r\n    packuswb        m2, m5\r\n    vpermq          m2, m2, 11011000b\r\n    vextracti128    xm5, m2, 1\r\n    movu            [r2 + r3 * 2], xm2\r\n    movu            [r2 + r6], xm5\r\n%else\r\n    psubw           m2, m7                         ; m2 = word: row 2\r\n    psubw           m5, m7                         ; m5 = word: row 3\r\n    movu            [r2 + r3 * 2], m2\r\n    movu            [r2 + r6], m5\r\n%endif\r\n    lea             r2, [r2 + r3 * 4]\r\n\r\n    movu            xm1, [r0 + r4]\r\n    lea             r0, [r0 + r1 * 4]\r\n    vinserti128     m1, m1, [r0], 1\r\n    vinserti128     m4, m4, xm1, 1\r\n\r\n    punpcklbw       m2, m4, m1\r\n    punpckhbw       m5, m4, m1\r\n    vperm2i128      m3, m2, m5, 0x20\r\n    vperm2i128      m2, m2, m5, 0x31\r\n    pmaddubsw       m5, m3, [r5 + mmsize]\r\n    paddw           m6, m5\r\n    pmaddubsw       m3, [r5]\r\n    pmaddubsw       m5, m2, [r5 + mmsize]\r\n    paddw           m0, m5\r\n    pmaddubsw       m2, [r5]\r\n\r\n%ifidn %1,pp\r\n    pmulhrsw        m6, m7                         ; m6 = word: row 4\r\n    pmulhrsw        m0, m7                         ; m0 = word: row 5\r\n    packuswb        m6, m0\r\n    vpermq          m6, m6, 11011000b\r\n    vextracti128    xm0, m6, 1\r\n    movu            [r2], xm6\r\n    movu            [r2 + r3], xm0\r\n%else\r\n    psubw           m6, m7                         ; m6 = word: row 4\r\n    psubw           m0, m7                         ; m0 = word: row 5\r\n    movu            [r2], m6\r\n    movu            [r2 + r3], m0\r\n%endif\r\n\r\n    movu            xm6, [r0 + r1 * 2]\r\n    vinserti128     m6, m6, [r0 + r1], 1\r\n    vextracti128    xm0, m6, 1\r\n    vinserti128     m1, m1, xm0, 0\r\n\r\n    punpcklbw       m4, m1, m6\r\n    punpckhbw       m5, m1, m6\r\n    vperm2i128      m0, m4, m5, 0x20\r\n    vperm2i128      m5, m4, m5, 0x31\r\n    pmaddubsw       m4, m0, [r5 + mmsize]\r\n    paddw           m2, m4\r\n    pmaddubsw       m0, [r5]\r\n    pmaddubsw       m4, m5, [r5 + mmsize]\r\n    paddw           m3, m4\r\n    pmaddubsw       m5, [r5]\r\n\r\n%ifidn %1,pp\r\n    pmulhrsw        m3, m7                         ; m3 = word: row 6\r\n    pmulhrsw        m2, m7                         ; m2 = word: row 7\r\n    packuswb        m3, m2\r\n    vpermq          m3, m3, 11011000b\r\n    vextracti128    xm2, m3, 1\r\n    movu            [r2 + r3 * 2], xm3\r\n    movu            [r2 + r6], xm2\r\n%else\r\n    psubw           m3, m7                         ; m3 = word: row 6\r\n    psubw           m2, m7                         ; m2 = word: row 7\r\n    movu            [r2 + r3 * 2], m3\r\n    movu            [r2 + r6], m2\r\n%endif\r\n    lea             r2, [r2 + r3 * 4]\r\n\r\n    movu            xm3, [r0 + r4]\r\n    lea             r0, [r0 + r1 * 4]\r\n    vinserti128     m3, m3, [r0], 1\r\n    vinserti128     m6, m6, xm3, 1\r\n\r\n    punpcklbw       m2, m6, m3\r\n    punpckhbw       m1, m6, m3\r\n    vperm2i128      m4, m2, m1, 0x20\r\n    vperm2i128      m2, m2, m1, 0x31\r\n    pmaddubsw       m1, m4, [r5 + mmsize]\r\n    paddw           m5, m1\r\n    pmaddubsw       m4, [r5]\r\n    pmaddubsw       m1, m2, [r5 + mmsize]\r\n    paddw           m0, m1\r\n    pmaddubsw       m2, [r5]\r\n\r\n%ifidn %1,pp\r\n    pmulhrsw        m5, m7                         ; m5 = word: row 8\r\n    pmulhrsw        m0, m7                         ; m0 = word: row 9\r\n    packuswb        m5, m0\r\n    vpermq          m5, m5, 11011000b\r\n    vextracti128    xm0, m5, 1\r\n    movu            [r2], xm5\r\n    movu            [r2 + r3], xm0\r\n%else\r\n    psubw           m5, m7                         ; m5 = word: row 8\r\n    psubw           m0, m7                         ; m0 = word: row 9\r\n    movu            [r2], m5\r\n    movu            [r2 + r3], m0\r\n%endif\r\n\r\n    movu            xm5, [r0 + r1 * 2]\r\n    vinserti128     m5, m5, [r0 + r1], 1\r\n    vextracti128    xm0, m5, 1\r\n    vinserti128     m3, m3, xm0, 0\r\n\r\n    punpcklbw       m1, m3, m5\r\n    punpckhbw       m0, m3, m5\r\n    vperm2i128      m6, m1, m0, 0x20\r\n    vperm2i128      m0, m1, m0, 0x31\r\n    pmaddubsw       m1, m6, [r5 + mmsize]\r\n    paddw           m2, m1\r\n    pmaddubsw       m6, [r5]\r\n    pmaddubsw       m1, m0, [r5 + mmsize]\r\n    paddw           m4, m1\r\n    pmaddubsw       m0, [r5]\r\n\r\n%ifidn %1,pp\r\n    pmulhrsw        m4, m7                         ; m4 = word: row 10\r\n    pmulhrsw        m2, m7                         ; m2 = word: row 11\r\n    packuswb        m4, m2\r\n    vpermq          m4, m4, 11011000b\r\n    vextracti128    xm2, m4, 1\r\n    movu            [r2 + r3 * 2], xm4\r\n    movu            [r2 + r6], xm2\r\n%else\r\n    psubw           m4, m7                         ; m4 = word: row 10\r\n    psubw           m2, m7                         ; m2 = word: row 11\r\n    movu            [r2 + r3 * 2], m4\r\n    movu            [r2 + r6], m2\r\n%endif\r\n    lea             r2, [r2 + r3 * 4]\r\n\r\n    movu            xm3, [r0 + r4]\r\n    lea             r0, [r0 + r1 * 4]\r\n    vinserti128     m3, m3, [r0], 1\r\n    vinserti128     m5, m5, xm3, 1\r\n\r\n    punpcklbw       m2, m5, m3\r\n    punpckhbw       m1, m5, m3\r\n    vperm2i128      m4, m2, m1, 0x20\r\n    vperm2i128      m2, m2, m1, 0x31\r\n    pmaddubsw       m1, m4, [r5 + mmsize]\r\n    paddw           m0, m1\r\n    pmaddubsw       m4, [r5]\r\n    pmaddubsw       m1, m2, [r5 + mmsize]\r\n    paddw           m6, m1\r\n    pmaddubsw       m2, [r5]\r\n\r\n%ifidn %1,pp\r\n    pmulhrsw        m0, m7                         ; m0 = word: row 12\r\n    pmulhrsw        m6, m7                         ; m6 = word: row 13\r\n    packuswb        m0, m6\r\n    vpermq          m0, m0, 11011000b\r\n    vextracti128    xm6, m0, 1\r\n    movu            [r2], xm0\r\n    movu            [r2 + r3], xm6\r\n%else\r\n    psubw           m0, m7                         ; m0 = word: row 12\r\n    psubw           m6, m7                         ; m6 = word: row 13\r\n    movu            [r2], m0\r\n    movu            [r2 + r3], m6\r\n%endif\r\n\r\n    movu            xm5, [r0 + r1 * 2]\r\n    vinserti128     m5, m5, [r0 + r1], 1\r\n    vextracti128    xm0, m5, 1\r\n    vinserti128     m3, m3, xm0, 0\r\n\r\n    punpcklbw       m1, m3, m5\r\n    punpckhbw       m0, m3, m5\r\n    vperm2i128      m6, m1, m0, 0x20\r\n    vperm2i128      m0, m1, m0, 0x31\r\n    pmaddubsw       m6, [r5 + mmsize]\r\n    paddw           m2, m6\r\n    pmaddubsw       m0, [r5 + mmsize]\r\n    paddw           m4, m0\r\n\r\n%ifidn %1,pp\r\n    pmulhrsw        m4, m7                         ; m4 = word: row 14\r\n    pmulhrsw        m2, m7                         ; m2 = word: row 15\r\n    packuswb        m4, m2\r\n    vpermq          m4, m4, 11011000b\r\n    vextracti128    xm2, m4, 1\r\n    movu            [r2 + r3 * 2], xm4\r\n    movu            [r2 + r6], xm2\r\n%else\r\n    psubw           m4, m7                         ; m4 = word: row 14\r\n    psubw           m2, m7                         ; m2 = word: row 15\r\n    movu            [r2 + r3 * 2], m4\r\n    movu            [r2 + r6], m2\r\n%endif\r\n    lea             r2, [r2 + r3 * 4]\r\n    dec             r7d\r\n    jnz             .loopH\r\n    RET\r\n%endif\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_AVX2_16xN pp, 32\r\n    FILTER_VER_CHROMA_AVX2_16xN ps, 32\r\n    FILTER_VER_CHROMA_AVX2_16xN pp, 64\r\n    FILTER_VER_CHROMA_AVX2_16xN ps, 64\r\n\r\n%macro FILTER_VER_CHROMA_AVX2_16x24 1\r\n%if ARCH_X86_64 == 1\r\nINIT_YMM avx2\r\ncglobal interp_4tap_vert_%1_16x24, 4, 6, 15\r\n    mov             r4d, r4m\r\n    shl             r4d, 6\r\n\r\n%ifdef PIC\r\n    lea             r5, [tab_ChromaCoeffVer_32]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [tab_ChromaCoeffVer_32 + r4]\r\n%endif\r\n\r\n    mova            m12, [r5]\r\n    mova            m13, [r5 + mmsize]\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r1\r\n%ifidn %1,pp\r\n    mova            m14, [pw_512]\r\n%else\r\n    add             r3d, r3d\r\n    vbroadcasti128  m14, [pw_2000]\r\n%endif\r\n    lea             r5, [r3 * 3]\r\n\r\n    movu            xm0, [r0]                       ; m0 = row 0\r\n    movu            xm1, [r0 + r1]                  ; m1 = row 1\r\n    punpckhbw       xm2, xm0, xm1\r\n    punpcklbw       xm0, xm1\r\n    vinserti128     m0, m0, xm2, 1\r\n    pmaddubsw       m0, m12\r\n    movu            xm2, [r0 + r1 * 2]              ; m2 = row 2\r\n    punpckhbw       xm3, xm1, xm2\r\n    punpcklbw       xm1, xm2\r\n    vinserti128     m1, m1, xm3, 1\r\n    pmaddubsw       m1, m12\r\n    movu            xm3, [r0 + r4]                  ; m3 = row 3\r\n    punpckhbw       xm4, xm2, xm3\r\n    punpcklbw       xm2, xm3\r\n    vinserti128     m2, m2, xm4, 1\r\n    pmaddubsw       m4, m2, m13\r\n    paddw           m0, m4\r\n    pmaddubsw       m2, m12\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm4, [r0]                       ; m4 = row 4\r\n    punpckhbw       xm5, xm3, xm4\r\n    punpcklbw       xm3, xm4\r\n    vinserti128     m3, m3, xm5, 1\r\n    pmaddubsw       m5, m3, m13\r\n    paddw           m1, m5\r\n    pmaddubsw       m3, m12\r\n    movu            xm5, [r0 + r1]                  ; m5 = row 5\r\n    punpckhbw       xm6, xm4, xm5\r\n    punpcklbw       xm4, xm5\r\n    vinserti128     m4, m4, xm6, 1\r\n    pmaddubsw       m6, m4, m13\r\n    paddw           m2, m6\r\n    pmaddubsw       m4, m12\r\n    movu            xm6, [r0 + r1 * 2]              ; m6 = row 6\r\n    punpckhbw       xm7, xm5, xm6\r\n    punpcklbw       xm5, xm6\r\n    vinserti128     m5, m5, xm7, 1\r\n    pmaddubsw       m7, m5, m13\r\n    paddw           m3, m7\r\n    pmaddubsw       m5, m12\r\n    movu            xm7, [r0 + r4]                  ; m7 = row 7\r\n    punpckhbw       xm8, xm6, xm7\r\n    punpcklbw       xm6, xm7\r\n    vinserti128     m6, m6, xm8, 1\r\n    pmaddubsw       m8, m6, m13\r\n    paddw           m4, m8\r\n    pmaddubsw       m6, m12\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm8, [r0]                       ; m8 = row 8\r\n    punpckhbw       xm9, xm7, xm8\r\n    punpcklbw       xm7, xm8\r\n    vinserti128     m7, m7, xm9, 1\r\n    pmaddubsw       m9, m7, m13\r\n    paddw           m5, m9\r\n    pmaddubsw       m7, m12\r\n    movu            xm9, [r0 + r1]                  ; m9 = row 9\r\n    punpckhbw       xm10, xm8, xm9\r\n    punpcklbw       xm8, xm9\r\n    vinserti128     m8, m8, xm10, 1\r\n    pmaddubsw       m10, m8, m13\r\n    paddw           m6, m10\r\n    pmaddubsw       m8, m12\r\n    movu            xm10, [r0 + r1 * 2]             ; m10 = row 10\r\n    punpckhbw       xm11, xm9, xm10\r\n    punpcklbw       xm9, xm10\r\n    vinserti128     m9, m9, xm11, 1\r\n    pmaddubsw       m11, m9, m13\r\n    paddw           m7, m11\r\n    pmaddubsw       m9, m12\r\n\r\n%ifidn %1,pp\r\n    pmulhrsw        m0, m14                         ; m0 = word: row 0\r\n    pmulhrsw        m1, m14                         ; m1 = word: row 1\r\n    pmulhrsw        m2, m14                         ; m2 = word: row 2\r\n    pmulhrsw        m3, m14                         ; m3 = word: row 3\r\n    pmulhrsw        m4, m14                         ; m4 = word: row 4\r\n    pmulhrsw        m5, m14                         ; m5 = word: row 5\r\n    pmulhrsw        m6, m14                         ; m6 = word: row 6\r\n    pmulhrsw        m7, m14                         ; m7 = word: row 7\r\n    packuswb        m0, m1\r\n    packuswb        m2, m3\r\n    packuswb        m4, m5\r\n    packuswb        m6, m7\r\n    vpermq          m0, m0, q3120\r\n    vpermq          m2, m2, q3120\r\n    vpermq          m4, m4, q3120\r\n    vpermq          m6, m6, q3120\r\n    vextracti128    xm1, m0, 1\r\n    vextracti128    xm3, m2, 1\r\n    vextracti128    xm5, m4, 1\r\n    vextracti128    xm7, m6, 1\r\n    movu            [r2], xm0\r\n    movu            [r2 + r3], xm1\r\n    movu            [r2 + r3 * 2], xm2\r\n    movu            [r2 + r5], xm3\r\n    lea             r2, [r2 + r3 * 4]\r\n    movu            [r2], xm4\r\n    movu            [r2 + r3], xm5\r\n    movu            [r2 + r3 * 2], xm6\r\n    movu            [r2 + r5], xm7\r\n%else\r\n    psubw           m0, m14                         ; m0 = word: row 0\r\n    psubw           m1, m14                         ; m1 = word: row 1\r\n    psubw           m2, m14                         ; m2 = word: row 2\r\n    psubw           m3, m14                         ; m3 = word: row 3\r\n    psubw           m4, m14                         ; m4 = word: row 4\r\n    psubw           m5, m14                         ; m5 = word: row 5\r\n    psubw           m6, m14                         ; m6 = word: row 6\r\n    psubw           m7, m14                         ; m7 = word: row 7\r\n    movu            [r2], m0\r\n    movu            [r2 + r3], m1\r\n    movu            [r2 + r3 * 2], m2\r\n    movu            [r2 + r5], m3\r\n    lea             r2, [r2 + r3 * 4]\r\n    movu            [r2], m4\r\n    movu            [r2 + r3], m5\r\n    movu            [r2 + r3 * 2], m6\r\n    movu            [r2 + r5], m7\r\n%endif\r\n    lea             r2, [r2 + r3 * 4]\r\n\r\n    movu            xm11, [r0 + r4]                 ; m11 = row 11\r\n    punpckhbw       xm6, xm10, xm11\r\n    punpcklbw       xm10, xm11\r\n    vinserti128     m10, m10, xm6, 1\r\n    pmaddubsw       m6, m10, m13\r\n    paddw           m8, m6\r\n    pmaddubsw       m10, m12\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm6, [r0]                       ; m6 = row 12\r\n    punpckhbw       xm7, xm11, xm6\r\n    punpcklbw       xm11, xm6\r\n    vinserti128     m11, m11, xm7, 1\r\n    pmaddubsw       m7, m11, m13\r\n    paddw           m9, m7\r\n    pmaddubsw       m11, m12\r\n\r\n    movu            xm7, [r0 + r1]                  ; m7 = row 13\r\n    punpckhbw       xm0, xm6, xm7\r\n    punpcklbw       xm6, xm7\r\n    vinserti128     m6, m6, xm0, 1\r\n    pmaddubsw       m0, m6, m13\r\n    paddw           m10, m0\r\n    pmaddubsw       m6, m12\r\n    movu            xm0, [r0 + r1 * 2]              ; m0 = row 14\r\n    punpckhbw       xm1, xm7, xm0\r\n    punpcklbw       xm7, xm0\r\n    vinserti128     m7, m7, xm1, 1\r\n    pmaddubsw       m1, m7, m13\r\n    paddw           m11, m1\r\n    pmaddubsw       m7, m12\r\n    movu            xm1, [r0 + r4]                  ; m1 = row 15\r\n    punpckhbw       xm2, xm0, xm1\r\n    punpcklbw       xm0, xm1\r\n    vinserti128     m0, m0, xm2, 1\r\n    pmaddubsw       m2, m0, m13\r\n    paddw           m6, m2\r\n    pmaddubsw       m0, m12\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm2, [r0]                       ; m2 = row 16\r\n    punpckhbw       xm3, xm1, xm2\r\n    punpcklbw       xm1, xm2\r\n    vinserti128     m1, m1, xm3, 1\r\n    pmaddubsw       m3, m1, m13\r\n    paddw           m7, m3\r\n    pmaddubsw       m1, m12\r\n    movu            xm3, [r0 + r1]                  ; m3 = row 17\r\n    punpckhbw       xm4, xm2, xm3\r\n    punpcklbw       xm2, xm3\r\n    vinserti128     m2, m2, xm4, 1\r\n    pmaddubsw       m4, m2, m13\r\n    paddw           m0, m4\r\n    pmaddubsw       m2, m12\r\n    movu            xm4, [r0 + r1 * 2]              ; m4 = row 18\r\n    punpckhbw       xm5, xm3, xm4\r\n    punpcklbw       xm3, xm4\r\n    vinserti128     m3, m3, xm5, 1\r\n    pmaddubsw       m5, m3, m13\r\n    paddw           m1, m5\r\n    pmaddubsw       m3, m12\r\n\r\n%ifidn %1,pp\r\n    pmulhrsw        m8, m14                         ; m8 = word: row 8\r\n    pmulhrsw        m9, m14                         ; m9 = word: row 9\r\n    pmulhrsw        m10, m14                        ; m10 = word: row 10\r\n    pmulhrsw        m11, m14                        ; m11 = word: row 11\r\n    pmulhrsw        m6, m14                         ; m6 = word: row 12\r\n    pmulhrsw        m7, m14                         ; m7 = word: row 13\r\n    pmulhrsw        m0, m14                         ; m0 = word: row 14\r\n    pmulhrsw        m1, m14                         ; m1 = word: row 15\r\n    packuswb        m8, m9\r\n    packuswb        m10, m11\r\n    packuswb        m6, m7\r\n    packuswb        m0, m1\r\n    vpermq          m8, m8, q3120\r\n    vpermq          m10, m10, q3120\r\n    vpermq          m6, m6, q3120\r\n    vpermq          m0, m0, q3120\r\n    vextracti128    xm9, m8, 1\r\n    vextracti128    xm11, m10, 1\r\n    vextracti128    xm7, m6, 1\r\n    vextracti128    xm1, m0, 1\r\n    movu            [r2], xm8\r\n    movu            [r2 + r3], xm9\r\n    movu            [r2 + r3 * 2], xm10\r\n    movu            [r2 + r5], xm11\r\n    lea             r2, [r2 + r3 * 4]\r\n    movu            [r2], xm6\r\n    movu            [r2 + r3], xm7\r\n    movu            [r2 + r3 * 2], xm0\r\n    movu            [r2 + r5], xm1\r\n%else\r\n    psubw           m8, m14                         ; m8 = word: row 8\r\n    psubw           m9, m14                         ; m9 = word: row 9\r\n    psubw           m10, m14                        ; m10 = word: row 10\r\n    psubw           m11, m14                        ; m11 = word: row 11\r\n    psubw           m6, m14                         ; m6 = word: row 12\r\n    psubw           m7, m14                         ; m7 = word: row 13\r\n    psubw           m0, m14                         ; m0 = word: row 14\r\n    psubw           m1, m14                         ; m1 = word: row 15\r\n    movu            [r2], m8\r\n    movu            [r2 + r3], m9\r\n    movu            [r2 + r3 * 2], m10\r\n    movu            [r2 + r5], m11\r\n    lea             r2, [r2 + r3 * 4]\r\n    movu            [r2], m6\r\n    movu            [r2 + r3], m7\r\n    movu            [r2 + r3 * 2], m0\r\n    movu            [r2 + r5], m1\r\n%endif\r\n    lea             r2, [r2 + r3 * 4]\r\n\r\n    movu            xm5, [r0 + r4]                  ; m5 = row 19\r\n    punpckhbw       xm6, xm4, xm5\r\n    punpcklbw       xm4, xm5\r\n    vinserti128     m4, m4, xm6, 1\r\n    pmaddubsw       m6, m4, m13\r\n    paddw           m2, m6\r\n    pmaddubsw       m4, m12\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm6, [r0]                       ; m6 = row 20\r\n    punpckhbw       xm7, xm5, xm6\r\n    punpcklbw       xm5, xm6\r\n    vinserti128     m5, m5, xm7, 1\r\n    pmaddubsw       m7, m5, m13\r\n    paddw           m3, m7\r\n    pmaddubsw       m5, m12\r\n    movu            xm7, [r0 + r1]                  ; m7 = row 21\r\n    punpckhbw       xm0, xm6, xm7\r\n    punpcklbw       xm6, xm7\r\n    vinserti128     m6, m6, xm0, 1\r\n    pmaddubsw       m0, m6, m13\r\n    paddw           m4, m0\r\n    pmaddubsw       m6, m12\r\n    movu            xm0, [r0 + r1 * 2]              ; m0 = row 22\r\n    punpckhbw       xm1, xm7, xm0\r\n    punpcklbw       xm7, xm0\r\n    vinserti128     m7, m7, xm1, 1\r\n    pmaddubsw       m1, m7, m13\r\n    paddw           m5, m1\r\n    pmaddubsw       m7, m12\r\n    movu            xm1, [r0 + r4]                  ; m1 = row 23\r\n    punpckhbw       xm8, xm0, xm1\r\n    punpcklbw       xm0, xm1\r\n    vinserti128     m0, m0, xm8, 1\r\n    pmaddubsw       m8, m0, m13\r\n    paddw           m6, m8\r\n    pmaddubsw       m0, m12\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm8, [r0]                       ; m8 = row 24\r\n    punpckhbw       xm9, xm1, xm8\r\n    punpcklbw       xm1, xm8\r\n    vinserti128     m1, m1, xm9, 1\r\n    pmaddubsw       m9, m1, m13\r\n    paddw           m7, m9\r\n    pmaddubsw       m1, m12\r\n    movu            xm9, [r0 + r1]                  ; m9 = row 25\r\n    punpckhbw       xm10, xm8, xm9\r\n    punpcklbw       xm8, xm9\r\n    vinserti128     m8, m8, xm10, 1\r\n    pmaddubsw       m8, m13\r\n    paddw           m0, m8\r\n    movu            xm10, [r0 + r1 * 2]             ; m10 = row 26\r\n    punpckhbw       xm11, xm9, xm10\r\n    punpcklbw       xm9, xm10\r\n    vinserti128     m9, m9, xm11, 1\r\n    pmaddubsw       m9, m13\r\n    paddw           m1, m9\r\n\r\n%ifidn %1,pp\r\n    pmulhrsw        m2, m14                         ; m2 = word: row 16\r\n    pmulhrsw        m3, m14                         ; m3 = word: row 17\r\n    pmulhrsw        m4, m14                         ; m4 = word: row 18\r\n    pmulhrsw        m5, m14                         ; m5 = word: row 19\r\n    pmulhrsw        m6, m14                         ; m6 = word: row 20\r\n    pmulhrsw        m7, m14                         ; m7 = word: row 21\r\n    pmulhrsw        m0, m14                         ; m0 = word: row 22\r\n    pmulhrsw        m1, m14                         ; m1 = word: row 23\r\n    packuswb        m2, m3\r\n    packuswb        m4, m5\r\n    packuswb        m6, m7\r\n    packuswb        m0, m1\r\n    vpermq          m2, m2, q3120\r\n    vpermq          m4, m4, q3120\r\n    vpermq          m6, m6, q3120\r\n    vpermq          m0, m0, q3120\r\n    vextracti128    xm3, m2, 1\r\n    vextracti128    xm5, m4, 1\r\n    vextracti128    xm7, m6, 1\r\n    vextracti128    xm1, m0, 1\r\n    movu            [r2], xm2\r\n    movu            [r2 + r3], xm3\r\n    movu            [r2 + r3 * 2], xm4\r\n    movu            [r2 + r5], xm5\r\n    lea             r2, [r2 + r3 * 4]\r\n    movu            [r2], xm6\r\n    movu            [r2 + r3], xm7\r\n    movu            [r2 + r3 * 2], xm0\r\n    movu            [r2 + r5], xm1\r\n%else\r\n    psubw           m2, m14                         ; m2 = word: row 16\r\n    psubw           m3, m14                         ; m3 = word: row 17\r\n    psubw           m4, m14                         ; m4 = word: row 18\r\n    psubw           m5, m14                         ; m5 = word: row 19\r\n    psubw           m6, m14                         ; m6 = word: row 20\r\n    psubw           m7, m14                         ; m7 = word: row 21\r\n    psubw           m0, m14                         ; m0 = word: row 22\r\n    psubw           m1, m14                         ; m1 = word: row 23\r\n    movu            [r2], m2\r\n    movu            [r2 + r3], m3\r\n    movu            [r2 + r3 * 2], m4\r\n    movu            [r2 + r5], m5\r\n    lea             r2, [r2 + r3 * 4]\r\n    movu            [r2], m6\r\n    movu            [r2 + r3], m7\r\n    movu            [r2 + r3 * 2], m0\r\n    movu            [r2 + r5], m1\r\n%endif\r\n    RET\r\n%endif\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_AVX2_16x24 pp\r\n    FILTER_VER_CHROMA_AVX2_16x24 ps\r\n\r\n%macro FILTER_VER_CHROMA_AVX2_24x32 1\r\nINIT_YMM avx2\r\n%if ARCH_X86_64 == 1\r\ncglobal interp_4tap_vert_%1_24x32, 4, 9, 10\r\n    mov             r4d, r4m\r\n    shl             r4d, 6\r\n\r\n%ifdef PIC\r\n    lea             r5, [tab_ChromaCoeffVer_32]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [tab_ChromaCoeffVer_32 + r4]\r\n%endif\r\n\r\n    mova            m8, [r5]\r\n    mova            m9, [r5 + mmsize]\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r1\r\n%ifidn %1,pp\r\n    mova            m7, [pw_512]\r\n%else\r\n    add             r3d, r3d\r\n    vbroadcasti128  m7, [pw_2000]\r\n%endif\r\n    lea             r6, [r3 * 3]\r\n    mov             r5d, 2\r\n.loopH:\r\n    movu            xm0, [r0]\r\n    vinserti128     m0, m0, [r0 + r1 * 2], 1\r\n    movu            xm1, [r0 + r1]\r\n    vinserti128     m1, m1, [r0 + r4], 1\r\n\r\n    punpcklbw       m2, m0, m1\r\n    punpckhbw       m3, m0, m1\r\n    vperm2i128      m4, m2, m3, 0x20\r\n    vperm2i128      m2, m2, m3, 0x31\r\n    pmaddubsw       m4, m8\r\n    pmaddubsw       m3, m2, m9\r\n    paddw           m4, m3\r\n    pmaddubsw       m2, m8\r\n\r\n    vextracti128    xm0, m0, 1\r\n    lea             r7, [r0 + r1 * 4]\r\n    vinserti128     m0, m0, [r7], 1\r\n\r\n    punpcklbw       m5, m1, m0\r\n    punpckhbw       m3, m1, m0\r\n    vperm2i128      m6, m5, m3, 0x20\r\n    vperm2i128      m5, m5, m3, 0x31\r\n    pmaddubsw       m6, m8\r\n    pmaddubsw       m3, m5, m9\r\n    paddw           m6, m3\r\n    pmaddubsw       m5, m8\r\n%ifidn %1,pp\r\n    pmulhrsw        m4, m7                         ; m4 = word: row 0\r\n    pmulhrsw        m6, m7                         ; m6 = word: row 1\r\n    packuswb        m4, m6\r\n    vpermq          m4, m4, 11011000b\r\n    vextracti128    xm6, m4, 1\r\n    movu            [r2], xm4\r\n    movu            [r2 + r3], xm6\r\n%else\r\n    psubw           m4, m7                         ; m4 = word: row 0\r\n    psubw           m6, m7                         ; m6 = word: row 1\r\n    movu            [r2], m4\r\n    movu            [r2 + r3], m6\r\n%endif\r\n\r\n    movu            xm4, [r7 + r1 * 2]\r\n    vinserti128     m4, m4, [r7 + r1], 1\r\n    vextracti128    xm1, m4, 1\r\n    vinserti128     m0, m0, xm1, 0\r\n\r\n    punpcklbw       m6, m0, m4\r\n    punpckhbw       m1, m0, m4\r\n    vperm2i128      m0, m6, m1, 0x20\r\n    vperm2i128      m6, m6, m1, 0x31\r\n    pmaddubsw       m1, m0, m9\r\n    paddw           m5, m1\r\n    pmaddubsw       m0, m8\r\n    pmaddubsw       m1, m6, m9\r\n    paddw           m2, m1\r\n    pmaddubsw       m6, m8\r\n\r\n%ifidn %1,pp\r\n    pmulhrsw        m2, m7                         ; m2 = word: row 2\r\n    pmulhrsw        m5, m7                         ; m5 = word: row 3\r\n    packuswb        m2, m5\r\n    vpermq          m2, m2, 11011000b\r\n    vextracti128    xm5, m2, 1\r\n    movu            [r2 + r3 * 2], xm2\r\n    movu            [r2 + r6], xm5\r\n%else\r\n    psubw           m2, m7                         ; m2 = word: row 2\r\n    psubw           m5, m7                         ; m5 = word: row 3\r\n    movu            [r2 + r3 * 2], m2\r\n    movu            [r2 + r6], m5\r\n%endif\r\n    lea             r8, [r2 + r3 * 4]\r\n\r\n    movu            xm1, [r7 + r4]\r\n    lea             r7, [r7 + r1 * 4]\r\n    vinserti128     m1, m1, [r7], 1\r\n    vinserti128     m4, m4, xm1, 1\r\n\r\n    punpcklbw       m2, m4, m1\r\n    punpckhbw       m5, m4, m1\r\n    vperm2i128      m3, m2, m5, 0x20\r\n    vperm2i128      m2, m2, m5, 0x31\r\n    pmaddubsw       m5, m3, m9\r\n    paddw           m6, m5\r\n    pmaddubsw       m3, m8\r\n    pmaddubsw       m5, m2, m9\r\n    paddw           m0, m5\r\n    pmaddubsw       m2, m8\r\n\r\n%ifidn %1,pp\r\n    pmulhrsw        m6, m7                         ; m6 = word: row 4\r\n    pmulhrsw        m0, m7                         ; m0 = word: row 5\r\n    packuswb        m6, m0\r\n    vpermq          m6, m6, 11011000b\r\n    vextracti128    xm0, m6, 1\r\n    movu            [r8], xm6\r\n    movu            [r8 + r3], xm0\r\n%else\r\n    psubw           m6, m7                         ; m6 = word: row 4\r\n    psubw           m0, m7                         ; m0 = word: row 5\r\n    movu            [r8], m6\r\n    movu            [r8 + r3], m0\r\n%endif\r\n\r\n    movu            xm6, [r7 + r1 * 2]\r\n    vinserti128     m6, m6, [r7 + r1], 1\r\n    vextracti128    xm0, m6, 1\r\n    vinserti128     m1, m1, xm0, 0\r\n\r\n    punpcklbw       m4, m1, m6\r\n    punpckhbw       m5, m1, m6\r\n    vperm2i128      m0, m4, m5, 0x20\r\n    vperm2i128      m5, m4, m5, 0x31\r\n    pmaddubsw       m4, m0, m9\r\n    paddw           m2, m4\r\n    pmaddubsw       m0, m8\r\n    pmaddubsw       m4, m5, m9\r\n    paddw           m3, m4\r\n    pmaddubsw       m5, m8\r\n\r\n%ifidn %1,pp\r\n    pmulhrsw        m3, m7                         ; m3 = word: row 6\r\n    pmulhrsw        m2, m7                         ; m2 = word: row 7\r\n    packuswb        m3, m2\r\n    vpermq          m3, m3, 11011000b\r\n    vextracti128    xm2, m3, 1\r\n    movu            [r8 + r3 * 2], xm3\r\n    movu            [r8 + r6], xm2\r\n%else\r\n    psubw           m3, m7                         ; m3 = word: row 6\r\n    psubw           m2, m7                         ; m2 = word: row 7\r\n    movu            [r8 + r3 * 2], m3\r\n    movu            [r8 + r6], m2\r\n%endif\r\n    lea             r8, [r8 + r3 * 4]\r\n\r\n    movu            xm3, [r7 + r4]\r\n    lea             r7, [r7 + r1 * 4]\r\n    vinserti128     m3, m3, [r7], 1\r\n    vinserti128     m6, m6, xm3, 1\r\n\r\n    punpcklbw       m2, m6, m3\r\n    punpckhbw       m1, m6, m3\r\n    vperm2i128      m4, m2, m1, 0x20\r\n    vperm2i128      m2, m2, m1, 0x31\r\n    pmaddubsw       m1, m4, m9\r\n    paddw           m5, m1\r\n    pmaddubsw       m4, m8\r\n    pmaddubsw       m1, m2, m9\r\n    paddw           m0, m1\r\n    pmaddubsw       m2, m8\r\n\r\n%ifidn %1,pp\r\n    pmulhrsw        m5, m7                         ; m5 = word: row 8\r\n    pmulhrsw        m0, m7                         ; m0 = word: row 9\r\n    packuswb        m5, m0\r\n    vpermq          m5, m5, 11011000b\r\n    vextracti128    xm0, m5, 1\r\n    movu            [r8], xm5\r\n    movu            [r8 + r3], xm0\r\n%else\r\n    psubw           m5, m7                         ; m5 = word: row 8\r\n    psubw           m0, m7                         ; m0 = word: row 9\r\n    movu            [r8], m5\r\n    movu            [r8 + r3], m0\r\n%endif\r\n\r\n    movu            xm5, [r7 + r1 * 2]\r\n    vinserti128     m5, m5, [r7 + r1], 1\r\n    vextracti128    xm0, m5, 1\r\n    vinserti128     m3, m3, xm0, 0\r\n\r\n    punpcklbw       m1, m3, m5\r\n    punpckhbw       m0, m3, m5\r\n    vperm2i128      m6, m1, m0, 0x20\r\n    vperm2i128      m0, m1, m0, 0x31\r\n    pmaddubsw       m1, m6, m9\r\n    paddw           m2, m1\r\n    pmaddubsw       m6, m8\r\n    pmaddubsw       m1, m0, m9\r\n    paddw           m4, m1\r\n    pmaddubsw       m0, m8\r\n\r\n%ifidn %1,pp\r\n    pmulhrsw        m4, m7                         ; m4 = word: row 10\r\n    pmulhrsw        m2, m7                         ; m2 = word: row 11\r\n    packuswb        m4, m2\r\n    vpermq          m4, m4, 11011000b\r\n    vextracti128    xm2, m4, 1\r\n    movu            [r8 + r3 * 2], xm4\r\n    movu            [r8 + r6], xm2\r\n%else\r\n    psubw           m4, m7                         ; m4 = word: row 10\r\n    psubw           m2, m7                         ; m2 = word: row 11\r\n    movu            [r8 + r3 * 2], m4\r\n    movu            [r8 + r6], m2\r\n%endif\r\n    lea             r8, [r8 + r3 * 4]\r\n\r\n    movu            xm3, [r7 + r4]\r\n    lea             r7, [r7 + r1 * 4]\r\n    vinserti128     m3, m3, [r7], 1\r\n    vinserti128     m5, m5, xm3, 1\r\n\r\n    punpcklbw       m2, m5, m3\r\n    punpckhbw       m1, m5, m3\r\n    vperm2i128      m4, m2, m1, 0x20\r\n    vperm2i128      m2, m2, m1, 0x31\r\n    pmaddubsw       m1, m4, m9\r\n    paddw           m0, m1\r\n    pmaddubsw       m4, m8\r\n    pmaddubsw       m1, m2, m9\r\n    paddw           m6, m1\r\n    pmaddubsw       m2, m8\r\n\r\n%ifidn %1,pp\r\n    pmulhrsw        m0, m7                         ; m0 = word: row 12\r\n    pmulhrsw        m6, m7                         ; m6 = word: row 13\r\n    packuswb        m0, m6\r\n    vpermq          m0, m0, 11011000b\r\n    vextracti128    xm6, m0, 1\r\n    movu            [r8], xm0\r\n    movu            [r8 + r3], xm6\r\n%else\r\n    psubw           m0, m7                         ; m0 = word: row 12\r\n    psubw           m6, m7                         ; m6 = word: row 13\r\n    movu            [r8], m0\r\n    movu            [r8 + r3], m6\r\n%endif\r\n\r\n    movu            xm5, [r7 + r1 * 2]\r\n    vinserti128     m5, m5, [r7 + r1], 1\r\n    vextracti128    xm0, m5, 1\r\n    vinserti128     m3, m3, xm0, 0\r\n\r\n    punpcklbw       m1, m3, m5\r\n    punpckhbw       m0, m3, m5\r\n    vperm2i128      m6, m1, m0, 0x20\r\n    vperm2i128      m0, m1, m0, 0x31\r\n    pmaddubsw       m6, m9\r\n    paddw           m2, m6\r\n    pmaddubsw       m0, m9\r\n    paddw           m4, m0\r\n\r\n%ifidn %1,pp\r\n    pmulhrsw        m4, m7                         ; m4 = word: row 14\r\n    pmulhrsw        m2, m7                         ; m2 = word: row 15\r\n    packuswb        m4, m2\r\n    vpermq          m4, m4, 11011000b\r\n    vextracti128    xm2, m4, 1\r\n    movu            [r8 + r3 * 2], xm4\r\n    movu            [r8 + r6], xm2\r\n    add             r2, 16\r\n%else\r\n    psubw           m4, m7                         ; m4 = word: row 14\r\n    psubw           m2, m7                         ; m2 = word: row 15\r\n    movu            [r8 + r3 * 2], m4\r\n    movu            [r8 + r6], m2\r\n    add             r2, 32\r\n%endif\r\n    add             r0, 16\r\n    movq            xm1, [r0]                       ; m1 = row 0\r\n    movq            xm2, [r0 + r1]                  ; m2 = row 1\r\n    punpcklbw       xm1, xm2\r\n    movq            xm3, [r0 + r1 * 2]              ; m3 = row 2\r\n    punpcklbw       xm2, xm3\r\n    vinserti128     m5, m1, xm2, 1\r\n    pmaddubsw       m5, m8\r\n    movq            xm4, [r0 + r4]                  ; m4 = row 3\r\n    punpcklbw       xm3, xm4\r\n    lea             r7, [r0 + r1 * 4]\r\n    movq            xm1, [r7]                       ; m1 = row 4\r\n    punpcklbw       xm4, xm1\r\n    vinserti128     m2, m3, xm4, 1\r\n    pmaddubsw       m0, m2, m9\r\n    paddw           m5, m0\r\n    pmaddubsw       m2, m8\r\n    movq            xm3, [r7 + r1]                  ; m3 = row 5\r\n    punpcklbw       xm1, xm3\r\n    movq            xm4, [r7 + r1 * 2]              ; m4 = row 6\r\n    punpcklbw       xm3, xm4\r\n    vinserti128     m1, m1, xm3, 1\r\n    pmaddubsw       m0, m1, m9\r\n    paddw           m2, m0\r\n    pmaddubsw       m1, m8\r\n    movq            xm3, [r7 + r4]                  ; m3 = row 7\r\n    punpcklbw       xm4, xm3\r\n    lea             r7, [r7 + r1 * 4]\r\n    movq            xm0, [r7]                       ; m0 = row 8\r\n    punpcklbw       xm3, xm0\r\n    vinserti128     m4, m4, xm3, 1\r\n    pmaddubsw       m3, m4, m9\r\n    paddw           m1, m3\r\n    pmaddubsw       m4, m8\r\n    movq            xm3, [r7 + r1]                  ; m3 = row 9\r\n    punpcklbw       xm0, xm3\r\n    movq            xm6, [r7 + r1 * 2]              ; m6 = row 10\r\n    punpcklbw       xm3, xm6\r\n    vinserti128     m0, m0, xm3, 1\r\n    pmaddubsw       m3, m0, m9\r\n    paddw           m4, m3\r\n    pmaddubsw       m0, m8\r\n\r\n%ifidn %1,pp\r\n    pmulhrsw        m5, m7                          ; m5 = word: row 0, row 1\r\n    pmulhrsw        m2, m7                          ; m2 = word: row 2, row 3\r\n    pmulhrsw        m1, m7                          ; m1 = word: row 4, row 5\r\n    pmulhrsw        m4, m7                          ; m4 = word: row 6, row 7\r\n    packuswb        m5, m2\r\n    packuswb        m1, m4\r\n    vextracti128    xm2, m5, 1\r\n    vextracti128    xm4, m1, 1\r\n    movq            [r2], xm5\r\n    movq            [r2 + r3], xm2\r\n    movhps          [r2 + r3 * 2], xm5\r\n    movhps          [r2 + r6], xm2\r\n    lea             r8, [r2 + r3 * 4]\r\n    movq            [r8], xm1\r\n    movq            [r8 + r3], xm4\r\n    movhps          [r8 + r3 * 2], xm1\r\n    movhps          [r8 + r6], xm4\r\n%else\r\n    psubw           m5, m7                          ; m5 = word: row 0, row 1\r\n    psubw           m2, m7                          ; m2 = word: row 2, row 3\r\n    psubw           m1, m7                          ; m1 = word: row 4, row 5\r\n    psubw           m4, m7                          ; m4 = word: row 6, row 7\r\n    vextracti128    xm3, m5, 1\r\n    movu            [r2], xm5\r\n    movu            [r2 + r3], xm3\r\n    vextracti128    xm3, m2, 1\r\n    movu            [r2 + r3 * 2], xm2\r\n    movu            [r2 + r6], xm3\r\n    vextracti128    xm3, m1, 1\r\n    lea             r8, [r2 + r3 * 4]\r\n    movu            [r8], xm1\r\n    movu            [r8 + r3], xm3\r\n    vextracti128    xm3, m4, 1\r\n    movu            [r8 + r3 * 2], xm4\r\n    movu            [r8 + r6], xm3\r\n%endif\r\n    lea             r8, [r8 + r3 * 4]\r\n\r\n    movq            xm3, [r7 + r4]                  ; m3 = row 11\r\n    punpcklbw       xm6, xm3\r\n    lea             r7, [r7 + r1 * 4]\r\n    movq            xm5, [r7]                       ; m5 = row 12\r\n    punpcklbw       xm3, xm5\r\n    vinserti128     m6, m6, xm3, 1\r\n    pmaddubsw       m3, m6, m9\r\n    paddw           m0, m3\r\n    pmaddubsw       m6, m8\r\n    movq            xm3, [r7 + r1]                  ; m3 = row 13\r\n    punpcklbw       xm5, xm3\r\n    movq            xm2, [r7 + r1 * 2]              ; m2 = row 14\r\n    punpcklbw       xm3, xm2\r\n    vinserti128     m5, m5, xm3, 1\r\n    pmaddubsw       m3, m5, m9\r\n    paddw           m6, m3\r\n    pmaddubsw       m5, m8\r\n    movq            xm3, [r7 + r4]                  ; m3 = row 15\r\n    punpcklbw       xm2, xm3\r\n    lea             r7, [r7 + r1 * 4]\r\n    movq            xm1, [r7]                       ; m1 = row 16\r\n    punpcklbw       xm3, xm1\r\n    vinserti128     m2, m2, xm3, 1\r\n    pmaddubsw       m3, m2, m9\r\n    paddw           m5, m3\r\n    pmaddubsw       m2, m8\r\n    movq            xm3, [r7 + r1]                  ; m3 = row 17\r\n    punpcklbw       xm1, xm3\r\n    movq            xm4, [r7 + r1 * 2]              ; m4 = row 18\r\n    punpcklbw       xm3, xm4\r\n    vinserti128     m1, m1, xm3, 1\r\n    pmaddubsw       m3, m1, m9\r\n    paddw           m2, m3\r\n%ifidn %1,pp\r\n    pmulhrsw        m0, m7                          ; m0 = word: row 8, row 9\r\n    pmulhrsw        m6, m7                          ; m6 = word: row 10, row 11\r\n    pmulhrsw        m5, m7                          ; m5 = word: row 12, row 13\r\n    pmulhrsw        m2, m7                          ; m2 = word: row 14, row 15\r\n    packuswb        m0, m6\r\n    packuswb        m5, m2\r\n    vextracti128    xm6, m0, 1\r\n    vextracti128    xm2, m5, 1\r\n    movq            [r8], xm0\r\n    movq            [r8 + r3], xm6\r\n    movhps          [r8 + r3 * 2], xm0\r\n    movhps          [r8 + r6], xm6\r\n    lea             r8, [r8 + r3 * 4]\r\n    movq            [r8], xm5\r\n    movq            [r8 + r3], xm2\r\n    movhps          [r8 + r3 * 2], xm5\r\n    movhps          [r8 + r6], xm2\r\n    lea             r2, [r8 + r3 * 4 - 16]\r\n%else\r\n    psubw           m0, m7                          ; m0 = word: row 8, row 9\r\n    psubw           m6, m7                          ; m6 = word: row 10, row 11\r\n    psubw           m5, m7                          ; m5 = word: row 12, row 13\r\n    psubw           m2, m7                          ; m2 = word: row 14, row 15\r\n    vextracti128    xm3, m0, 1\r\n    movu            [r8], xm0\r\n    movu            [r8 + r3], xm3\r\n    vextracti128    xm3, m6, 1\r\n    movu            [r8 + r3 * 2], xm6\r\n    movu            [r8 + r6], xm3\r\n    vextracti128    xm3, m5, 1\r\n    lea             r8, [r8 + r3 * 4]\r\n    movu            [r8], xm5\r\n    movu            [r8 + r3], xm3\r\n    vextracti128    xm3, m2, 1\r\n    movu            [r8 + r3 * 2], xm2\r\n    movu            [r8 + r6], xm3\r\n    lea             r2, [r8 + r3 * 4 - 32]\r\n%endif\r\n    lea             r0, [r7 - 16]\r\n    dec             r5d\r\n    jnz             .loopH\r\n    RET\r\n%endif\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_AVX2_24x32 pp\r\n    FILTER_VER_CHROMA_AVX2_24x32 ps\r\n\r\n%macro FILTER_VER_CHROMA_AVX2_24x64 1\r\n%if ARCH_X86_64 == 1\r\nINIT_YMM avx2\r\ncglobal interp_4tap_vert_%1_24x64, 4, 7, 13\r\n    mov             r4d, r4m\r\n    shl             r4d, 6\r\n\r\n%ifdef PIC\r\n    lea             r5, [tab_ChromaCoeffVer_32]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [tab_ChromaCoeffVer_32 + r4]\r\n%endif\r\n\r\n    mova            m10, [r5]\r\n    mova            m11, [r5 + mmsize]\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r1\r\n%ifidn %1,pp\r\n    mova            m12, [pw_512]\r\n%else\r\n    add             r3d, r3d\r\n    vbroadcasti128  m12, [pw_2000]\r\n%endif\r\n    lea             r5, [r3 * 3]\r\n    mov             r6d, 16\r\n.loopH:\r\n    movu            m0, [r0]                        ; m0 = row 0\r\n    movu            m1, [r0 + r1]                   ; m1 = row 1\r\n    punpcklbw       m2, m0, m1\r\n    punpckhbw       m3, m0, m1\r\n    pmaddubsw       m2, m10\r\n    pmaddubsw       m3, m10\r\n    movu            m0, [r0 + r1 * 2]               ; m0 = row 2\r\n    punpcklbw       m4, m1, m0\r\n    punpckhbw       m5, m1, m0\r\n    pmaddubsw       m4, m10\r\n    pmaddubsw       m5, m10\r\n    movu            m1, [r0 + r4]                   ; m1 = row 3\r\n    punpcklbw       m6, m0, m1\r\n    punpckhbw       m7, m0, m1\r\n    pmaddubsw       m8, m6, m11\r\n    pmaddubsw       m9, m7, m11\r\n    pmaddubsw       m6, m10\r\n    pmaddubsw       m7, m10\r\n    paddw           m2, m8\r\n    paddw           m3, m9\r\n%ifidn %1,pp\r\n    pmulhrsw        m2, m12\r\n    pmulhrsw        m3, m12\r\n    packuswb        m2, m3\r\n    movu            [r2], xm2\r\n    vextracti128    xm2, m2, 1\r\n    movq            [r2 + 16], xm2\r\n%else\r\n    psubw           m2, m12\r\n    psubw           m3, m12\r\n    vperm2i128      m0, m2, m3, 0x20\r\n    vperm2i128      m2, m2, m3, 0x31\r\n    movu            [r2], m0\r\n    movu            [r2 + mmsize], xm2\r\n%endif\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            m0, [r0]                        ; m0 = row 4\r\n    punpcklbw       m2, m1, m0\r\n    punpckhbw       m3, m1, m0\r\n    pmaddubsw       m8, m2, m11\r\n    pmaddubsw       m9, m3, m11\r\n    pmaddubsw       m2, m10\r\n    pmaddubsw       m3, m10\r\n    paddw           m4, m8\r\n    paddw           m5, m9\r\n%ifidn %1,pp\r\n    pmulhrsw        m4, m12\r\n    pmulhrsw        m5, m12\r\n    packuswb        m4, m5\r\n    movu            [r2 + r3], xm4\r\n    vextracti128    xm4, m4, 1\r\n    movq            [r2 + r3 + 16], xm4\r\n%else\r\n    psubw           m4, m12\r\n    psubw           m5, m12\r\n    vperm2i128      m1, m4, m5, 0x20\r\n    vperm2i128      m4, m4, m5, 0x31\r\n    movu            [r2 + r3], m1\r\n    movu            [r2 + r3 + mmsize], xm4\r\n%endif\r\n\r\n    movu            m1, [r0 + r1]                   ; m1 = row 5\r\n    punpcklbw       m4, m0, m1\r\n    punpckhbw       m5, m0, m1\r\n    pmaddubsw       m4, m11\r\n    pmaddubsw       m5, m11\r\n    paddw           m6, m4\r\n    paddw           m7, m5\r\n%ifidn %1,pp\r\n    pmulhrsw        m6, m12\r\n    pmulhrsw        m7, m12\r\n    packuswb        m6, m7\r\n    movu            [r2 + r3 * 2], xm6\r\n    vextracti128    xm6, m6, 1\r\n    movq            [r2 + r3 * 2 + 16], xm6\r\n%else\r\n    psubw           m6, m12\r\n    psubw           m7, m12\r\n    vperm2i128      m0, m6, m7, 0x20\r\n    vperm2i128      m6, m6, m7, 0x31\r\n    movu            [r2 + r3 * 2], m0\r\n    movu            [r2 + r3 * 2 + mmsize], xm6\r\n%endif\r\n\r\n    movu            m0, [r0 + r1 * 2]               ; m0 = row 6\r\n    punpcklbw       m6, m1, m0\r\n    punpckhbw       m7, m1, m0\r\n    pmaddubsw       m6, m11\r\n    pmaddubsw       m7, m11\r\n    paddw           m2, m6\r\n    paddw           m3, m7\r\n%ifidn %1,pp\r\n    pmulhrsw        m2, m12\r\n    pmulhrsw        m3, m12\r\n    packuswb        m2, m3\r\n    movu            [r2 + r5], xm2\r\n    vextracti128    xm2, m2, 1\r\n    movq            [r2 + r5 + 16], xm2\r\n%else\r\n    psubw           m2, m12\r\n    psubw           m3, m12\r\n    vperm2i128      m0, m2, m3, 0x20\r\n    vperm2i128      m2, m2, m3, 0x31\r\n    movu            [r2 + r5], m0\r\n    movu            [r2 + r5 + mmsize], xm2\r\n%endif\r\n    lea             r2, [r2 + r3 * 4]\r\n    dec             r6d\r\n    jnz             .loopH\r\n    RET\r\n%endif\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_AVX2_24x64 pp\r\n    FILTER_VER_CHROMA_AVX2_24x64 ps\r\n\r\n%macro FILTER_VER_CHROMA_AVX2_16x4 1\r\nINIT_YMM avx2\r\ncglobal interp_4tap_vert_%1_16x4, 4, 6, 8\r\n    mov             r4d, r4m\r\n    shl             r4d, 6\r\n\r\n%ifdef PIC\r\n    lea             r5, [tab_ChromaCoeffVer_32]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [tab_ChromaCoeffVer_32 + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r1\r\n%ifidn %1,pp\r\n    mova            m7, [pw_512]\r\n%else\r\n    add             r3d, r3d\r\n    mova            m7, [pw_2000]\r\n%endif\r\n\r\n    movu            xm0, [r0]\r\n    vinserti128     m0, m0, [r0 + r1 * 2], 1\r\n    movu            xm1, [r0 + r1]\r\n    vinserti128     m1, m1, [r0 + r4], 1\r\n\r\n    punpcklbw       m2, m0, m1\r\n    punpckhbw       m3, m0, m1\r\n    vperm2i128      m4, m2, m3, 0x20\r\n    vperm2i128      m2, m2, m3, 0x31\r\n    pmaddubsw       m4, [r5]\r\n    pmaddubsw       m3, m2, [r5 + mmsize]\r\n    paddw           m4, m3\r\n    pmaddubsw       m2, [r5]\r\n\r\n    vextracti128    xm0, m0, 1\r\n    lea             r0, [r0 + r1 * 4]\r\n    vinserti128     m0, m0, [r0], 1\r\n\r\n    punpcklbw       m5, m1, m0\r\n    punpckhbw       m3, m1, m0\r\n    vperm2i128      m6, m5, m3, 0x20\r\n    vperm2i128      m5, m5, m3, 0x31\r\n    pmaddubsw       m6, [r5]\r\n    pmaddubsw       m3, m5, [r5 + mmsize]\r\n    paddw           m6, m3\r\n    pmaddubsw       m5, [r5]\r\n%ifidn %1,pp\r\n    pmulhrsw        m4, m7                          ; m4 = word: row 0\r\n    pmulhrsw        m6, m7                          ; m6 = word: row 1\r\n    packuswb        m4, m6\r\n    vpermq          m4, m4, 11011000b\r\n    vextracti128    xm6, m4, 1\r\n    movu            [r2], xm4\r\n    movu            [r2 + r3], xm6\r\n%else\r\n    psubw           m4, m7                          ; m4 = word: row 0\r\n    psubw           m6, m7                          ; m6 = word: row 1\r\n    movu            [r2], m4\r\n    movu            [r2 + r3], m6\r\n%endif\r\n    lea             r2, [r2 + r3 * 2]\r\n\r\n    movu            xm4, [r0 + r1 * 2]\r\n    vinserti128     m4, m4, [r0 + r1], 1\r\n    vextracti128    xm1, m4, 1\r\n    vinserti128     m0, m0, xm1, 0\r\n\r\n    punpcklbw       m6, m0, m4\r\n    punpckhbw       m1, m0, m4\r\n    vperm2i128      m0, m6, m1, 0x20\r\n    vperm2i128      m6, m6, m1, 0x31\r\n    pmaddubsw       m0, [r5 + mmsize]\r\n    paddw           m5, m0\r\n    pmaddubsw       m6, [r5 + mmsize]\r\n    paddw           m2, m6\r\n\r\n%ifidn %1,pp\r\n    pmulhrsw        m2, m7                          ; m2 = word: row 2\r\n    pmulhrsw        m5, m7                          ; m5 = word: row 3\r\n    packuswb        m2, m5\r\n    vpermq          m2, m2, 11011000b\r\n    vextracti128    xm5, m2, 1\r\n    movu            [r2], xm2\r\n    movu            [r2 + r3], xm5\r\n%else\r\n    psubw           m2, m7                          ; m2 = word: row 2\r\n    psubw           m5, m7                          ; m5 = word: row 3\r\n    movu            [r2], m2\r\n    movu            [r2 + r3], m5\r\n%endif\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_AVX2_16x4 pp\r\n    FILTER_VER_CHROMA_AVX2_16x4 ps\r\n\r\n%macro FILTER_VER_CHROMA_AVX2_12xN 2\r\nINIT_YMM avx2\r\ncglobal interp_4tap_vert_%1_12x%2, 4, 7, 8\r\n    mov             r4d, r4m\r\n    shl             r4d, 6\r\n\r\n%ifdef PIC\r\n    lea             r5, [tab_ChromaCoeffVer_32]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [tab_ChromaCoeffVer_32 + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r1\r\n%ifidn %1,pp\r\n    mova            m7, [pw_512]\r\n%else\r\n    add             r3d, r3d\r\n    vbroadcasti128  m7, [pw_2000]\r\n%endif\r\n    lea             r6, [r3 * 3]\r\n%rep %2 / 16\r\n    movu            xm0, [r0]                       ; m0 = row 0\r\n    movu            xm1, [r0 + r1]                  ; m1 = row 1\r\n    punpckhbw       xm2, xm0, xm1\r\n    punpcklbw       xm0, xm1\r\n    vinserti128     m0, m0, xm2, 1\r\n    pmaddubsw       m0, [r5]\r\n    movu            xm2, [r0 + r1 * 2]              ; m2 = row 2\r\n    punpckhbw       xm3, xm1, xm2\r\n    punpcklbw       xm1, xm2\r\n    vinserti128     m1, m1, xm3, 1\r\n    pmaddubsw       m1, [r5]\r\n    movu            xm3, [r0 + r4]                  ; m3 = row 3\r\n    punpckhbw       xm4, xm2, xm3\r\n    punpcklbw       xm2, xm3\r\n    vinserti128     m2, m2, xm4, 1\r\n    pmaddubsw       m4, m2, [r5 + 1 * mmsize]\r\n    paddw           m0, m4\r\n    pmaddubsw       m2, [r5]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm4, [r0]                       ; m4 = row 4\r\n    punpckhbw       xm5, xm3, xm4\r\n    punpcklbw       xm3, xm4\r\n    vinserti128     m3, m3, xm5, 1\r\n    pmaddubsw       m5, m3, [r5 + 1 * mmsize]\r\n    paddw           m1, m5\r\n    pmaddubsw       m3, [r5]\r\n%ifidn %1,pp\r\n    pmulhrsw        m0, m7                          ; m0 = word: row 0\r\n    pmulhrsw        m1, m7                          ; m1 = word: row 1\r\n    packuswb        m0, m1\r\n    vextracti128    xm1, m0, 1\r\n    movq            [r2], xm0\r\n    movd            [r2 + 8], xm1\r\n    movhps          [r2 + r3], xm0\r\n    pextrd          [r2 + r3 + 8], xm1, 2\r\n%else\r\n    psubw           m0, m7                          ; m0 = word: row 0\r\n    psubw           m1, m7                          ; m1 = word: row 1\r\n    movu            [r2], xm0\r\n    vextracti128    xm0, m0, 1\r\n    movq            [r2 + 16], xm0\r\n    movu            [r2 + r3], xm1\r\n    vextracti128    xm1, m1, 1\r\n    movq            [r2 + r3 + 16], xm1\r\n%endif\r\n\r\n    movu            xm5, [r0 + r1]                  ; m5 = row 5\r\n    punpckhbw       xm6, xm4, xm5\r\n    punpcklbw       xm4, xm5\r\n    vinserti128     m4, m4, xm6, 1\r\n    pmaddubsw       m6, m4, [r5 + 1 * mmsize]\r\n    paddw           m2, m6\r\n    pmaddubsw       m4, [r5]\r\n    movu            xm6, [r0 + r1 * 2]              ; m6 = row 6\r\n    punpckhbw       xm0, xm5, xm6\r\n    punpcklbw       xm5, xm6\r\n    vinserti128     m5, m5, xm0, 1\r\n    pmaddubsw       m0, m5, [r5 + 1 * mmsize]\r\n    paddw           m3, m0\r\n    pmaddubsw       m5, [r5]\r\n%ifidn %1,pp\r\n    pmulhrsw        m2, m7                          ; m2 = word: row 2\r\n    pmulhrsw        m3, m7                          ; m3 = word: row 3\r\n    packuswb        m2, m3\r\n    vextracti128    xm3, m2, 1\r\n    movq            [r2 + r3 * 2], xm2\r\n    movd            [r2 + r3 * 2 + 8], xm3\r\n    movhps          [r2 + r6], xm2\r\n    pextrd          [r2 + r6 + 8], xm3, 2\r\n%else\r\n    psubw           m2, m7                          ; m2 = word: row 2\r\n    psubw           m3, m7                          ; m3 = word: row 3\r\n    movu            [r2 + r3 * 2], xm2\r\n    vextracti128    xm2, m2, 1\r\n    movq            [r2 + r3 * 2 + 16], xm2\r\n    movu            [r2 + r6], xm3\r\n    vextracti128    xm3, m3, 1\r\n    movq            [r2 + r6 + 16], xm3\r\n%endif\r\n    lea             r2, [r2 + r3 * 4]\r\n\r\n    movu            xm0, [r0 + r4]                  ; m0 = row 7\r\n    punpckhbw       xm3, xm6, xm0\r\n    punpcklbw       xm6, xm0\r\n    vinserti128     m6, m6, xm3, 1\r\n    pmaddubsw       m3, m6, [r5 + 1 * mmsize]\r\n    paddw           m4, m3\r\n    pmaddubsw       m6, [r5]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm3, [r0]                       ; m3 = row 8\r\n    punpckhbw       xm1, xm0, xm3\r\n    punpcklbw       xm0, xm3\r\n    vinserti128     m0, m0, xm1, 1\r\n    pmaddubsw       m1, m0, [r5 + 1 * mmsize]\r\n    paddw           m5, m1\r\n    pmaddubsw       m0, [r5]\r\n%ifidn %1,pp\r\n    pmulhrsw        m4, m7                          ; m4 = word: row 4\r\n    pmulhrsw        m5, m7                          ; m5 = word: row 5\r\n    packuswb        m4, m5\r\n    vextracti128    xm5, m4, 1\r\n    movq            [r2], xm4\r\n    movd            [r2 + 8], xm5\r\n    movhps          [r2 + r3], xm4\r\n    pextrd          [r2 + r3 + 8], xm5, 2\r\n%else\r\n    psubw           m4, m7                          ; m4 = word: row 4\r\n    psubw           m5, m7                          ; m5 = word: row 5\r\n    movu            [r2], xm4\r\n    vextracti128    xm4, m4, 1\r\n    movq            [r2 + 16], xm4\r\n    movu            [r2 + r3], xm5\r\n    vextracti128    xm5, m5, 1\r\n    movq            [r2 + r3 + 16], xm5\r\n%endif\r\n\r\n    movu            xm1, [r0 + r1]                  ; m1 = row 9\r\n    punpckhbw       xm2, xm3, xm1\r\n    punpcklbw       xm3, xm1\r\n    vinserti128     m3, m3, xm2, 1\r\n    pmaddubsw       m2, m3, [r5 + 1 * mmsize]\r\n    paddw           m6, m2\r\n    pmaddubsw       m3, [r5]\r\n    movu            xm2, [r0 + r1 * 2]              ; m2 = row 10\r\n    punpckhbw       xm4, xm1, xm2\r\n    punpcklbw       xm1, xm2\r\n    vinserti128     m1, m1, xm4, 1\r\n    pmaddubsw       m4, m1, [r5 + 1 * mmsize]\r\n    paddw           m0, m4\r\n    pmaddubsw       m1, [r5]\r\n\r\n%ifidn %1,pp\r\n    pmulhrsw        m6, m7                          ; m6 = word: row 6\r\n    pmulhrsw        m0, m7                          ; m0 = word: row 7\r\n    packuswb        m6, m0\r\n    vextracti128    xm0, m6, 1\r\n    movq            [r2 + r3 * 2], xm6\r\n    movd            [r2 + r3 * 2 + 8], xm0\r\n    movhps          [r2 + r6], xm6\r\n    pextrd          [r2 + r6 + 8], xm0, 2\r\n%else\r\n    psubw           m6, m7                          ; m6 = word: row 6\r\n    psubw           m0, m7                          ; m0 = word: row 7\r\n    movu            [r2 + r3 * 2], xm6\r\n    vextracti128    xm6, m6, 1\r\n    movq            [r2 + r3 * 2 + 16], xm6\r\n    movu            [r2 + r6], xm0\r\n    vextracti128    xm0, m0, 1\r\n    movq            [r2 + r6 + 16], xm0\r\n%endif\r\n    lea             r2, [r2 + r3 * 4]\r\n\r\n    movu            xm4, [r0 + r4]                  ; m4 = row 11\r\n    punpckhbw       xm6, xm2, xm4\r\n    punpcklbw       xm2, xm4\r\n    vinserti128     m2, m2, xm6, 1\r\n    pmaddubsw       m6, m2, [r5 + 1 * mmsize]\r\n    paddw           m3, m6\r\n    pmaddubsw       m2, [r5]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm6, [r0]                       ; m6 = row 12\r\n    punpckhbw       xm0, xm4, xm6\r\n    punpcklbw       xm4, xm6\r\n    vinserti128     m4, m4, xm0, 1\r\n    pmaddubsw       m0, m4, [r5 + 1 * mmsize]\r\n    paddw           m1, m0\r\n    pmaddubsw       m4, [r5]\r\n%ifidn %1,pp\r\n    pmulhrsw        m3, m7                          ; m3 = word: row 8\r\n    pmulhrsw        m1, m7                          ; m1 = word: row 9\r\n    packuswb        m3, m1\r\n    vextracti128    xm1, m3, 1\r\n    movq            [r2], xm3\r\n    movd            [r2 + 8], xm1\r\n    movhps          [r2 + r3], xm3\r\n    pextrd          [r2 + r3 + 8], xm1, 2\r\n%else\r\n    psubw           m3, m7                          ; m3 = word: row 8\r\n    psubw           m1, m7                          ; m1 = word: row 9\r\n    movu            [r2], xm3\r\n    vextracti128    xm3, m3, 1\r\n    movq            [r2 + 16], xm3\r\n    movu            [r2 + r3], xm1\r\n    vextracti128    xm1, m1, 1\r\n    movq            [r2 + r3 + 16], xm1\r\n%endif\r\n\r\n    movu            xm0, [r0 + r1]                  ; m0 = row 13\r\n    punpckhbw       xm1, xm6, xm0\r\n    punpcklbw       xm6, xm0\r\n    vinserti128     m6, m6, xm1, 1\r\n    pmaddubsw       m1, m6, [r5 + 1 * mmsize]\r\n    paddw           m2, m1\r\n    pmaddubsw       m6, [r5]\r\n    movu            xm1, [r0 + r1 * 2]              ; m1 = row 14\r\n    punpckhbw       xm5, xm0, xm1\r\n    punpcklbw       xm0, xm1\r\n    vinserti128     m0, m0, xm5, 1\r\n    pmaddubsw       m5, m0, [r5 + 1 * mmsize]\r\n    paddw           m4, m5\r\n    pmaddubsw       m0, [r5]\r\n%ifidn %1,pp\r\n    pmulhrsw        m2, m7                          ; m2 = word: row 10\r\n    pmulhrsw        m4, m7                          ; m4 = word: row 11\r\n    packuswb        m2, m4\r\n    vextracti128    xm4, m2, 1\r\n    movq            [r2 + r3 * 2], xm2\r\n    movd            [r2 + r3 * 2 + 8], xm4\r\n    movhps          [r2 + r6], xm2\r\n    pextrd          [r2 + r6 + 8], xm4, 2\r\n%else\r\n    psubw           m2, m7                          ; m2 = word: row 10\r\n    psubw           m4, m7                          ; m4 = word: row 11\r\n    movu            [r2 + r3 * 2], xm2\r\n    vextracti128    xm2, m2, 1\r\n    movq            [r2 + r3 * 2 + 16], xm2\r\n    movu            [r2 + r6], xm4\r\n    vextracti128    xm4, m4, 1\r\n    movq            [r2 + r6 + 16], xm4\r\n%endif\r\n    lea             r2, [r2 + r3 * 4]\r\n\r\n    movu            xm5, [r0 + r4]                  ; m5 = row 15\r\n    punpckhbw       xm2, xm1, xm5\r\n    punpcklbw       xm1, xm5\r\n    vinserti128     m1, m1, xm2, 1\r\n    pmaddubsw       m2, m1, [r5 + 1 * mmsize]\r\n    paddw           m6, m2\r\n    pmaddubsw       m1, [r5]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm2, [r0]                       ; m2 = row 16\r\n    punpckhbw       xm3, xm5, xm2\r\n    punpcklbw       xm5, xm2\r\n    vinserti128     m5, m5, xm3, 1\r\n    pmaddubsw       m3, m5, [r5 + 1 * mmsize]\r\n    paddw           m0, m3\r\n    pmaddubsw       m5, [r5]\r\n    movu            xm3, [r0 + r1]                  ; m3 = row 17\r\n    punpckhbw       xm4, xm2, xm3\r\n    punpcklbw       xm2, xm3\r\n    vinserti128     m2, m2, xm4, 1\r\n    pmaddubsw       m2, [r5 + 1 * mmsize]\r\n    paddw           m1, m2\r\n    movu            xm4, [r0 + r1 * 2]              ; m4 = row 18\r\n    punpckhbw       xm2, xm3, xm4\r\n    punpcklbw       xm3, xm4\r\n    vinserti128     m3, m3, xm2, 1\r\n    pmaddubsw       m3, [r5 + 1 * mmsize]\r\n    paddw           m5, m3\r\n\r\n%ifidn %1,pp\r\n    pmulhrsw        m6, m7                          ; m6 = word: row 12\r\n    pmulhrsw        m0, m7                          ; m0 = word: row 13\r\n    pmulhrsw        m1, m7                          ; m1 = word: row 14\r\n    pmulhrsw        m5, m7                          ; m5 = word: row 15\r\n    packuswb        m6, m0\r\n    packuswb        m1, m5\r\n    vextracti128    xm0, m6, 1\r\n    vextracti128    xm5, m1, 1\r\n    movq            [r2], xm6\r\n    movd            [r2 + 8], xm0\r\n    movhps          [r2 + r3], xm6\r\n    pextrd          [r2 + r3 + 8], xm0, 2\r\n    movq            [r2 + r3 * 2], xm1\r\n    movd            [r2 + r3 * 2 + 8], xm5\r\n    movhps          [r2 + r6], xm1\r\n    pextrd          [r2 + r6 + 8], xm5, 2\r\n%else\r\n    psubw           m6, m7                          ; m6 = word: row 12\r\n    psubw           m0, m7                          ; m0 = word: row 13\r\n    psubw           m1, m7                          ; m1 = word: row 14\r\n    psubw           m5, m7                          ; m5 = word: row 15\r\n    movu            [r2], xm6\r\n    vextracti128    xm6, m6, 1\r\n    movq            [r2 + 16], xm6\r\n    movu            [r2 + r3], xm0\r\n    vextracti128    xm0, m0, 1\r\n    movq            [r2 + r3 + 16], xm0\r\n    movu            [r2 + r3 * 2], xm1\r\n    vextracti128    xm1, m1, 1\r\n    movq            [r2 + r3 * 2 + 16], xm1\r\n    movu            [r2 + r6], xm5\r\n    vextracti128    xm5, m5, 1\r\n    movq            [r2 + r6 + 16], xm5\r\n%endif\r\n    lea             r2, [r2 + r3 * 4]\r\n%endrep\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_AVX2_12xN pp, 16\r\n    FILTER_VER_CHROMA_AVX2_12xN ps, 16\r\n    FILTER_VER_CHROMA_AVX2_12xN pp, 32\r\n    FILTER_VER_CHROMA_AVX2_12xN ps, 32\r\n\r\n;-----------------------------------------------------------------------------\r\n;void interp_4tap_vert_pp_24x32(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-----------------------------------------------------------------------------\r\n%macro FILTER_V4_W24 2\r\nINIT_XMM sse4\r\ncglobal interp_4tap_vert_pp_24x%2, 4, 6, 8\r\n\r\n    mov         r4d,       r4m\r\n    sub         r0,        r1\r\n\r\n%ifdef PIC\r\n    lea         r5,        [tab_ChromaCoeff]\r\n    movd        m0,        [r5 + r4 * 4]\r\n%else\r\n    movd        m0,        [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    pshufb      m1,        m0,       [tab_Vm]\r\n    pshufb      m0,        [tab_Vm + 16]\r\n\r\n    mov         r4d,       %2\r\n\r\n.loop:\r\n    movu        m2,        [r0]\r\n    movu        m3,        [r0 + r1]\r\n\r\n    punpcklbw   m4,        m2,        m3\r\n    punpckhbw   m2,        m3\r\n\r\n    pmaddubsw   m4,        m1\r\n    pmaddubsw   m2,        m1\r\n\r\n    lea         r5,        [r0 + 2 * r1]\r\n    movu        m5,        [r5]\r\n    movu        m7,        [r5 + r1]\r\n\r\n    punpcklbw   m6,        m5,        m7\r\n    pmaddubsw   m6,        m0\r\n    paddw       m4,        m6\r\n\r\n    punpckhbw   m6,        m5,        m7\r\n    pmaddubsw   m6,        m0\r\n    paddw       m2,        m6\r\n\r\n    mova        m6,        [pw_512]\r\n\r\n    pmulhrsw    m4,        m6\r\n    pmulhrsw    m2,        m6\r\n\r\n    packuswb    m4,        m2\r\n\r\n    movu        [r2],      m4\r\n\r\n    punpcklbw   m4,        m3,        m5\r\n    punpckhbw   m3,        m5\r\n\r\n    pmaddubsw   m4,        m1\r\n    pmaddubsw   m3,        m1\r\n\r\n    movu        m2,        [r5 + 2 * r1]\r\n\r\n    punpcklbw   m5,        m7,        m2\r\n    punpckhbw   m7,        m2\r\n\r\n    pmaddubsw   m5,        m0\r\n    pmaddubsw   m7,        m0\r\n\r\n    paddw       m4,        m5\r\n    paddw       m3,        m7\r\n\r\n    pmulhrsw    m4,        m6\r\n    pmulhrsw    m3,        m6\r\n\r\n    packuswb    m4,        m3\r\n\r\n    movu        [r2 + r3],      m4\r\n\r\n    movq        m2,        [r0 + 16]\r\n    movq        m3,        [r0 + r1 + 16]\r\n    movq        m4,        [r5 + 16]\r\n    movq        m5,        [r5 + r1 + 16]\r\n\r\n    punpcklbw   m2,        m3\r\n    punpcklbw   m4,        m5\r\n\r\n    pmaddubsw   m2,        m1\r\n    pmaddubsw   m4,        m0\r\n\r\n    paddw       m2,        m4\r\n\r\n    pmulhrsw    m2,        m6\r\n\r\n    movq        m3,        [r0 + r1 + 16]\r\n    movq        m4,        [r5 + 16]\r\n    movq        m5,        [r5 + r1 + 16]\r\n    movq        m7,        [r5 + 2 * r1 + 16]\r\n\r\n    punpcklbw   m3,        m4\r\n    punpcklbw   m5,        m7\r\n\r\n    pmaddubsw   m3,        m1\r\n    pmaddubsw   m5,        m0\r\n\r\n    paddw       m3,        m5\r\n\r\n    pmulhrsw    m3,        m6\r\n    packuswb    m2,        m3\r\n\r\n    movh        [r2 + 16], m2\r\n    movhps      [r2 + r3 + 16], m2\r\n\r\n    mov         r0,        r5\r\n    lea         r2,        [r2 + 2 * r3]\r\n\r\n    sub         r4,        2\r\n    jnz        .loop\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_V4_W24 24, 32\r\n\r\n    FILTER_V4_W24 24, 64\r\n\r\n;-----------------------------------------------------------------------------\r\n; void interp_4tap_vert_pp_32x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-----------------------------------------------------------------------------\r\n%macro FILTER_V4_W32 2\r\nINIT_XMM sse4\r\ncglobal interp_4tap_vert_pp_%1x%2, 4, 6, 8\r\n\r\n    mov         r4d,       r4m\r\n    sub         r0,        r1\r\n\r\n%ifdef PIC\r\n    lea         r5,        [tab_ChromaCoeff]\r\n    movd        m0,        [r5 + r4 * 4]\r\n%else\r\n    movd        m0,        [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    pshufb      m1,        m0,       [tab_Vm]\r\n    pshufb      m0,        [tab_Vm + 16]\r\n\r\n    mova        m7,        [pw_512]\r\n\r\n    mov         r4d,       %2\r\n\r\n.loop:\r\n    movu        m2,        [r0]\r\n    movu        m3,        [r0 + r1]\r\n\r\n    punpcklbw   m4,        m2,        m3\r\n    punpckhbw   m2,        m3\r\n\r\n    pmaddubsw   m4,        m1\r\n    pmaddubsw   m2,        m1\r\n\r\n    lea         r5,        [r0 + 2 * r1]\r\n    movu        m3,        [r5]\r\n    movu        m5,        [r5 + r1]\r\n\r\n    punpcklbw   m6,        m3,        m5\r\n    punpckhbw   m3,        m5\r\n\r\n    pmaddubsw   m6,        m0\r\n    pmaddubsw   m3,        m0\r\n\r\n    paddw       m4,        m6\r\n    paddw       m2,        m3\r\n\r\n    pmulhrsw    m4,        m7\r\n    pmulhrsw    m2,        m7\r\n\r\n    packuswb    m4,        m2\r\n\r\n    movu        [r2],      m4\r\n\r\n    movu        m2,        [r0 + 16]\r\n    movu        m3,        [r0 + r1 + 16]\r\n\r\n    punpcklbw   m4,        m2,        m3\r\n    punpckhbw   m2,        m3\r\n\r\n    pmaddubsw   m4,        m1\r\n    pmaddubsw   m2,        m1\r\n\r\n    movu        m3,        [r5 + 16]\r\n    movu        m5,        [r5 + r1 + 16]\r\n\r\n    punpcklbw   m6,        m3,        m5\r\n    punpckhbw   m3,        m5\r\n\r\n    pmaddubsw   m6,        m0\r\n    pmaddubsw   m3,        m0\r\n\r\n    paddw       m4,        m6\r\n    paddw       m2,        m3\r\n\r\n    pmulhrsw    m4,        m7\r\n    pmulhrsw    m2,        m7\r\n\r\n    packuswb    m4,        m2\r\n\r\n    movu        [r2 + 16], m4\r\n\r\n    lea         r0,        [r0 + r1]\r\n    lea         r2,        [r2 + r3]\r\n\r\n    dec         r4\r\n    jnz        .loop\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_V4_W32 32,  8\r\n    FILTER_V4_W32 32, 16\r\n    FILTER_V4_W32 32, 24\r\n    FILTER_V4_W32 32, 32\r\n\r\n    FILTER_V4_W32 32, 48\r\n    FILTER_V4_W32 32, 64\r\n\r\n%macro FILTER_VER_CHROMA_AVX2_32xN 2\r\n%if ARCH_X86_64 == 1\r\nINIT_YMM avx2\r\ncglobal interp_4tap_vert_%1_32x%2, 4, 7, 13\r\n    mov             r4d, r4m\r\n    shl             r4d, 6\r\n\r\n%ifdef PIC\r\n    lea             r5, [tab_ChromaCoeffVer_32]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [tab_ChromaCoeffVer_32 + r4]\r\n%endif\r\n\r\n    mova            m10, [r5]\r\n    mova            m11, [r5 + mmsize]\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r1\r\n%ifidn %1,pp\r\n    mova            m12, [pw_512]\r\n%else\r\n    add             r3d, r3d\r\n    vbroadcasti128  m12, [pw_2000]\r\n%endif\r\n    lea             r5, [r3 * 3]\r\n    mov             r6d, %2 / 4\r\n.loopW:\r\n    movu            m0, [r0]                        ; m0 = row 0\r\n    movu            m1, [r0 + r1]                   ; m1 = row 1\r\n    punpcklbw       m2, m0, m1\r\n    punpckhbw       m3, m0, m1\r\n    pmaddubsw       m2, m10\r\n    pmaddubsw       m3, m10\r\n    movu            m0, [r0 + r1 * 2]               ; m0 = row 2\r\n    punpcklbw       m4, m1, m0\r\n    punpckhbw       m5, m1, m0\r\n    pmaddubsw       m4, m10\r\n    pmaddubsw       m5, m10\r\n    movu            m1, [r0 + r4]                   ; m1 = row 3\r\n    punpcklbw       m6, m0, m1\r\n    punpckhbw       m7, m0, m1\r\n    pmaddubsw       m8, m6, m11\r\n    pmaddubsw       m9, m7, m11\r\n    pmaddubsw       m6, m10\r\n    pmaddubsw       m7, m10\r\n    paddw           m2, m8\r\n    paddw           m3, m9\r\n%ifidn %1,pp\r\n    pmulhrsw        m2, m12\r\n    pmulhrsw        m3, m12\r\n    packuswb        m2, m3\r\n    movu            [r2], m2\r\n%else\r\n    psubw           m2, m12\r\n    psubw           m3, m12\r\n    vperm2i128      m0, m2, m3, 0x20\r\n    vperm2i128      m2, m2, m3, 0x31\r\n    movu            [r2], m0\r\n    movu            [r2 + mmsize], m2\r\n%endif\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            m0, [r0]                        ; m0 = row 4\r\n    punpcklbw       m2, m1, m0\r\n    punpckhbw       m3, m1, m0\r\n    pmaddubsw       m8, m2, m11\r\n    pmaddubsw       m9, m3, m11\r\n    pmaddubsw       m2, m10\r\n    pmaddubsw       m3, m10\r\n    paddw           m4, m8\r\n    paddw           m5, m9\r\n%ifidn %1,pp\r\n    pmulhrsw        m4, m12\r\n    pmulhrsw        m5, m12\r\n    packuswb        m4, m5\r\n    movu            [r2 + r3], m4\r\n%else\r\n    psubw           m4, m12\r\n    psubw           m5, m12\r\n    vperm2i128      m1, m4, m5, 0x20\r\n    vperm2i128      m4, m4, m5, 0x31\r\n    movu            [r2 + r3], m1\r\n    movu            [r2 + r3 + mmsize], m4\r\n%endif\r\n\r\n    movu            m1, [r0 + r1]                   ; m1 = row 5\r\n    punpcklbw       m4, m0, m1\r\n    punpckhbw       m5, m0, m1\r\n    pmaddubsw       m4, m11\r\n    pmaddubsw       m5, m11\r\n    paddw           m6, m4\r\n    paddw           m7, m5\r\n%ifidn %1,pp\r\n    pmulhrsw        m6, m12\r\n    pmulhrsw        m7, m12\r\n    packuswb        m6, m7\r\n    movu            [r2 + r3 * 2], m6\r\n%else\r\n    psubw           m6, m12\r\n    psubw           m7, m12\r\n    vperm2i128      m0, m6, m7, 0x20\r\n    vperm2i128      m6, m6, m7, 0x31\r\n    movu            [r2 + r3 * 2], m0\r\n    movu            [r2 + r3 * 2 + mmsize], m6\r\n%endif\r\n\r\n    movu            m0, [r0 + r1 * 2]               ; m0 = row 6\r\n    punpcklbw       m6, m1, m0\r\n    punpckhbw       m7, m1, m0\r\n    pmaddubsw       m6, m11\r\n    pmaddubsw       m7, m11\r\n    paddw           m2, m6\r\n    paddw           m3, m7\r\n%ifidn %1,pp\r\n    pmulhrsw        m2, m12\r\n    pmulhrsw        m3, m12\r\n    packuswb        m2, m3\r\n    movu            [r2 + r5], m2\r\n%else\r\n    psubw           m2, m12\r\n    psubw           m3, m12\r\n    vperm2i128      m0, m2, m3, 0x20\r\n    vperm2i128      m2, m2, m3, 0x31\r\n    movu            [r2 + r5], m0\r\n    movu            [r2 + r5 + mmsize], m2\r\n%endif\r\n    lea             r2, [r2 + r3 * 4]\r\n    dec             r6d\r\n    jnz             .loopW\r\n    RET\r\n%endif\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_AVX2_32xN pp, 64\r\n    FILTER_VER_CHROMA_AVX2_32xN pp, 48\r\n    FILTER_VER_CHROMA_AVX2_32xN pp, 32\r\n    FILTER_VER_CHROMA_AVX2_32xN pp, 24\r\n    FILTER_VER_CHROMA_AVX2_32xN pp, 16\r\n    FILTER_VER_CHROMA_AVX2_32xN pp, 8\r\n    FILTER_VER_CHROMA_AVX2_32xN ps, 64\r\n    FILTER_VER_CHROMA_AVX2_32xN ps, 48\r\n    FILTER_VER_CHROMA_AVX2_32xN ps, 32\r\n    FILTER_VER_CHROMA_AVX2_32xN ps, 24\r\n    FILTER_VER_CHROMA_AVX2_32xN ps, 16\r\n    FILTER_VER_CHROMA_AVX2_32xN ps, 8\r\n\r\n%macro FILTER_VER_CHROMA_AVX2_48x64 1\r\n%if ARCH_X86_64 == 1\r\nINIT_YMM avx2\r\ncglobal interp_4tap_vert_%1_48x64, 4, 8, 13\r\n    mov             r4d, r4m\r\n    shl             r4d, 6\r\n\r\n%ifdef PIC\r\n    lea             r5, [tab_ChromaCoeffVer_32]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [tab_ChromaCoeffVer_32 + r4]\r\n%endif\r\n\r\n    mova            m10, [r5]\r\n    mova            m11, [r5 + mmsize]\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r1\r\n%ifidn %1,pp\r\n    mova            m12, [pw_512]\r\n%else\r\n    add             r3d, r3d\r\n    vbroadcasti128  m12, [pw_2000]\r\n%endif\r\n    lea             r5, [r3 * 3]\r\n    lea             r7, [r1 * 4]\r\n    mov             r6d, 16\r\n.loopH:\r\n    movu            m0, [r0]                        ; m0 = row 0\r\n    movu            m1, [r0 + r1]                   ; m1 = row 1\r\n    punpcklbw       m2, m0, m1\r\n    punpckhbw       m3, m0, m1\r\n    pmaddubsw       m2, m10\r\n    pmaddubsw       m3, m10\r\n    movu            m0, [r0 + r1 * 2]               ; m0 = row 2\r\n    punpcklbw       m4, m1, m0\r\n    punpckhbw       m5, m1, m0\r\n    pmaddubsw       m4, m10\r\n    pmaddubsw       m5, m10\r\n    movu            m1, [r0 + r4]                   ; m1 = row 3\r\n    punpcklbw       m6, m0, m1\r\n    punpckhbw       m7, m0, m1\r\n    pmaddubsw       m8, m6, m11\r\n    pmaddubsw       m9, m7, m11\r\n    pmaddubsw       m6, m10\r\n    pmaddubsw       m7, m10\r\n    paddw           m2, m8\r\n    paddw           m3, m9\r\n%ifidn %1,pp\r\n    pmulhrsw        m2, m12\r\n    pmulhrsw        m3, m12\r\n    packuswb        m2, m3\r\n    movu            [r2], m2\r\n%else\r\n    psubw           m2, m12\r\n    psubw           m3, m12\r\n    vperm2i128      m0, m2, m3, 0x20\r\n    vperm2i128      m2, m2, m3, 0x31\r\n    movu            [r2], m0\r\n    movu            [r2 + mmsize], m2\r\n%endif\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            m0, [r0]                        ; m0 = row 4\r\n    punpcklbw       m2, m1, m0\r\n    punpckhbw       m3, m1, m0\r\n    pmaddubsw       m8, m2, m11\r\n    pmaddubsw       m9, m3, m11\r\n    pmaddubsw       m2, m10\r\n    pmaddubsw       m3, m10\r\n    paddw           m4, m8\r\n    paddw           m5, m9\r\n%ifidn %1,pp\r\n    pmulhrsw        m4, m12\r\n    pmulhrsw        m5, m12\r\n    packuswb        m4, m5\r\n    movu            [r2 + r3], m4\r\n%else\r\n    psubw           m4, m12\r\n    psubw           m5, m12\r\n    vperm2i128      m1, m4, m5, 0x20\r\n    vperm2i128      m4, m4, m5, 0x31\r\n    movu            [r2 + r3], m1\r\n    movu            [r2 + r3 + mmsize], m4\r\n%endif\r\n\r\n    movu            m1, [r0 + r1]                   ; m1 = row 5\r\n    punpcklbw       m4, m0, m1\r\n    punpckhbw       m5, m0, m1\r\n    pmaddubsw       m4, m11\r\n    pmaddubsw       m5, m11\r\n    paddw           m6, m4\r\n    paddw           m7, m5\r\n%ifidn %1,pp\r\n    pmulhrsw        m6, m12\r\n    pmulhrsw        m7, m12\r\n    packuswb        m6, m7\r\n    movu            [r2 + r3 * 2], m6\r\n%else\r\n    psubw           m6, m12\r\n    psubw           m7, m12\r\n    vperm2i128      m0, m6, m7, 0x20\r\n    vperm2i128      m6, m6, m7, 0x31\r\n    movu            [r2 + r3 * 2], m0\r\n    movu            [r2 + r3 * 2 + mmsize], m6\r\n%endif\r\n\r\n    movu            m0, [r0 + r1 * 2]               ; m0 = row 6\r\n    punpcklbw       m6, m1, m0\r\n    punpckhbw       m7, m1, m0\r\n    pmaddubsw       m6, m11\r\n    pmaddubsw       m7, m11\r\n    paddw           m2, m6\r\n    paddw           m3, m7\r\n%ifidn %1,pp\r\n    pmulhrsw        m2, m12\r\n    pmulhrsw        m3, m12\r\n    packuswb        m2, m3\r\n    movu            [r2 + r5], m2\r\n    add             r2, 32\r\n%else\r\n    psubw           m2, m12\r\n    psubw           m3, m12\r\n    vperm2i128      m0, m2, m3, 0x20\r\n    vperm2i128      m2, m2, m3, 0x31\r\n    movu            [r2 + r5], m0\r\n    movu            [r2 + r5 + mmsize], m2\r\n    add             r2, 64\r\n%endif\r\n    sub             r0, r7\r\n\r\n    movu            xm0, [r0 + 32]                  ; m0 = row 0\r\n    movu            xm1, [r0 + r1 + 32]             ; m1 = row 1\r\n    punpckhbw       xm2, xm0, xm1\r\n    punpcklbw       xm0, xm1\r\n    vinserti128     m0, m0, xm2, 1\r\n    pmaddubsw       m0, m10\r\n    movu            xm2, [r0 + r1 * 2 + 32]         ; m2 = row 2\r\n    punpckhbw       xm3, xm1, xm2\r\n    punpcklbw       xm1, xm2\r\n    vinserti128     m1, m1, xm3, 1\r\n    pmaddubsw       m1, m10\r\n    movu            xm3, [r0 + r4 + 32]             ; m3 = row 3\r\n    punpckhbw       xm4, xm2, xm3\r\n    punpcklbw       xm2, xm3\r\n    vinserti128     m2, m2, xm4, 1\r\n    pmaddubsw       m4, m2, m11\r\n    paddw           m0, m4\r\n    pmaddubsw       m2, m10\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm4, [r0 + 32]                  ; m4 = row 4\r\n    punpckhbw       xm5, xm3, xm4\r\n    punpcklbw       xm3, xm4\r\n    vinserti128     m3, m3, xm5, 1\r\n    pmaddubsw       m5, m3, m11\r\n    paddw           m1, m5\r\n    pmaddubsw       m3, m10\r\n    movu            xm5, [r0 + r1 + 32]             ; m5 = row 5\r\n    punpckhbw       xm6, xm4, xm5\r\n    punpcklbw       xm4, xm5\r\n    vinserti128     m4, m4, xm6, 1\r\n    pmaddubsw       m4, m11\r\n    paddw           m2, m4\r\n    movu            xm6, [r0 + r1 * 2 + 32]         ; m6 = row 6\r\n    punpckhbw       xm7, xm5, xm6\r\n    punpcklbw       xm5, xm6\r\n    vinserti128     m5, m5, xm7, 1\r\n    pmaddubsw       m5, m11\r\n    paddw           m3, m5\r\n%ifidn %1,pp\r\n    pmulhrsw        m0, m12                         ; m0 = word: row 0\r\n    pmulhrsw        m1, m12                         ; m1 = word: row 1\r\n    pmulhrsw        m2, m12                         ; m2 = word: row 2\r\n    pmulhrsw        m3, m12                         ; m3 = word: row 3\r\n    packuswb        m0, m1\r\n    packuswb        m2, m3\r\n    vpermq          m0, m0, 11011000b\r\n    vpermq          m2, m2, 11011000b\r\n    vextracti128    xm1, m0, 1\r\n    vextracti128    xm3, m2, 1\r\n    movu            [r2], xm0\r\n    movu            [r2 + r3], xm1\r\n    movu            [r2 + r3 * 2], xm2\r\n    movu            [r2 + r5], xm3\r\n    lea             r2, [r2 + r3 * 4 - 32]\r\n%else\r\n    psubw           m0, m12                         ; m0 = word: row 0\r\n    psubw           m1, m12                         ; m1 = word: row 1\r\n    psubw           m2, m12                         ; m2 = word: row 2\r\n    psubw           m3, m12                         ; m3 = word: row 3\r\n    movu            [r2], m0\r\n    movu            [r2 + r3], m1\r\n    movu            [r2 + r3 * 2], m2\r\n    movu            [r2 + r5], m3\r\n    lea             r2, [r2 + r3 * 4 - 64]\r\n%endif\r\n    dec             r6d\r\n    jnz             .loopH\r\n    RET\r\n%endif\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_AVX2_48x64 pp\r\n    FILTER_VER_CHROMA_AVX2_48x64 ps\r\n\r\n%macro FILTER_VER_CHROMA_AVX2_64xN 2\r\n%if ARCH_X86_64 == 1\r\nINIT_YMM avx2\r\ncglobal interp_4tap_vert_%1_64x%2, 4, 8, 13\r\n    mov             r4d, r4m\r\n    shl             r4d, 6\r\n\r\n%ifdef PIC\r\n    lea             r5, [tab_ChromaCoeffVer_32]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [tab_ChromaCoeffVer_32 + r4]\r\n%endif\r\n\r\n    mova            m10, [r5]\r\n    mova            m11, [r5 + mmsize]\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r1\r\n%ifidn %1,pp\r\n    mova            m12, [pw_512]\r\n%else\r\n    add             r3d, r3d\r\n    vbroadcasti128  m12, [pw_2000]\r\n%endif\r\n    lea             r5, [r3 * 3]\r\n    lea             r7, [r1 * 4]\r\n    mov             r6d, %2 / 4\r\n.loopH:\r\n%assign x 0\r\n%rep 2\r\n    movu            m0, [r0 + x]                    ; m0 = row 0\r\n    movu            m1, [r0 + r1 + x]               ; m1 = row 1\r\n    punpcklbw       m2, m0, m1\r\n    punpckhbw       m3, m0, m1\r\n    pmaddubsw       m2, m10\r\n    pmaddubsw       m3, m10\r\n    movu            m0, [r0 + r1 * 2 + x]           ; m0 = row 2\r\n    punpcklbw       m4, m1, m0\r\n    punpckhbw       m5, m1, m0\r\n    pmaddubsw       m4, m10\r\n    pmaddubsw       m5, m10\r\n    movu            m1, [r0 + r4 + x]               ; m1 = row 3\r\n    punpcklbw       m6, m0, m1\r\n    punpckhbw       m7, m0, m1\r\n    pmaddubsw       m8, m6, m11\r\n    pmaddubsw       m9, m7, m11\r\n    pmaddubsw       m6, m10\r\n    pmaddubsw       m7, m10\r\n    paddw           m2, m8\r\n    paddw           m3, m9\r\n%ifidn %1,pp\r\n    pmulhrsw        m2, m12\r\n    pmulhrsw        m3, m12\r\n    packuswb        m2, m3\r\n    movu            [r2], m2\r\n%else\r\n    psubw           m2, m12\r\n    psubw           m3, m12\r\n    vperm2i128      m0, m2, m3, 0x20\r\n    vperm2i128      m2, m2, m3, 0x31\r\n    movu            [r2], m0\r\n    movu            [r2 + mmsize], m2\r\n%endif\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            m0, [r0 + x]                    ; m0 = row 4\r\n    punpcklbw       m2, m1, m0\r\n    punpckhbw       m3, m1, m0\r\n    pmaddubsw       m8, m2, m11\r\n    pmaddubsw       m9, m3, m11\r\n    pmaddubsw       m2, m10\r\n    pmaddubsw       m3, m10\r\n    paddw           m4, m8\r\n    paddw           m5, m9\r\n%ifidn %1,pp\r\n    pmulhrsw        m4, m12\r\n    pmulhrsw        m5, m12\r\n    packuswb        m4, m5\r\n    movu            [r2 + r3], m4\r\n%else\r\n    psubw           m4, m12\r\n    psubw           m5, m12\r\n    vperm2i128      m1, m4, m5, 0x20\r\n    vperm2i128      m4, m4, m5, 0x31\r\n    movu            [r2 + r3], m1\r\n    movu            [r2 + r3 + mmsize], m4\r\n%endif\r\n\r\n    movu            m1, [r0 + r1 + x]               ; m1 = row 5\r\n    punpcklbw       m4, m0, m1\r\n    punpckhbw       m5, m0, m1\r\n    pmaddubsw       m4, m11\r\n    pmaddubsw       m5, m11\r\n    paddw           m6, m4\r\n    paddw           m7, m5\r\n%ifidn %1,pp\r\n    pmulhrsw        m6, m12\r\n    pmulhrsw        m7, m12\r\n    packuswb        m6, m7\r\n    movu            [r2 + r3 * 2], m6\r\n%else\r\n    psubw           m6, m12\r\n    psubw           m7, m12\r\n    vperm2i128      m0, m6, m7, 0x20\r\n    vperm2i128      m6, m6, m7, 0x31\r\n    movu            [r2 + r3 * 2], m0\r\n    movu            [r2 + r3 * 2 + mmsize], m6\r\n%endif\r\n\r\n    movu            m0, [r0 + r1 * 2 + x]           ; m0 = row 6\r\n    punpcklbw       m6, m1, m0\r\n    punpckhbw       m7, m1, m0\r\n    pmaddubsw       m6, m11\r\n    pmaddubsw       m7, m11\r\n    paddw           m2, m6\r\n    paddw           m3, m7\r\n%ifidn %1,pp\r\n    pmulhrsw        m2, m12\r\n    pmulhrsw        m3, m12\r\n    packuswb        m2, m3\r\n    movu            [r2 + r5], m2\r\n    add             r2, 32\r\n%else\r\n    psubw           m2, m12\r\n    psubw           m3, m12\r\n    vperm2i128      m0, m2, m3, 0x20\r\n    vperm2i128      m2, m2, m3, 0x31\r\n    movu            [r2 + r5], m0\r\n    movu            [r2 + r5 + mmsize], m2\r\n    add             r2, 64\r\n%endif\r\n    sub             r0, r7\r\n%assign x x+32\r\n%endrep\r\n%ifidn %1,pp\r\n    lea             r2, [r2 + r3 * 4 - 64]\r\n%else\r\n    lea             r2, [r2 + r3 * 4 - 128]\r\n%endif\r\n    add             r0, r7\r\n    dec             r6d\r\n    jnz             .loopH\r\n    RET\r\n%endif\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_AVX2_64xN pp, 64\r\n    FILTER_VER_CHROMA_AVX2_64xN pp, 48\r\n    FILTER_VER_CHROMA_AVX2_64xN pp, 32\r\n    FILTER_VER_CHROMA_AVX2_64xN pp, 16\r\n    FILTER_VER_CHROMA_AVX2_64xN ps, 64\r\n    FILTER_VER_CHROMA_AVX2_64xN ps, 48\r\n    FILTER_VER_CHROMA_AVX2_64xN ps, 32\r\n    FILTER_VER_CHROMA_AVX2_64xN ps, 16\r\n\r\n;-----------------------------------------------------------------------------\r\n; void interp_4tap_vert_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-----------------------------------------------------------------------------\r\n%macro FILTER_V4_W16n_H2 2\r\nINIT_XMM sse4\r\ncglobal interp_4tap_vert_pp_%1x%2, 4, 7, 8\r\n\r\n    mov         r4d,       r4m\r\n    sub         r0,        r1\r\n\r\n%ifdef PIC\r\n    lea         r5,        [tab_ChromaCoeff]\r\n    movd        m0,        [r5 + r4 * 4]\r\n%else\r\n    movd        m0,        [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    pshufb      m1,        m0,       [tab_Vm]\r\n    pshufb      m0,        [tab_Vm + 16]\r\n\r\n    mov         r4d,       %2/2\r\n\r\n.loop:\r\n\r\n    mov         r6d,       %1/16\r\n\r\n.loopW:\r\n\r\n    movu        m2,        [r0]\r\n    movu        m3,        [r0 + r1]\r\n\r\n    punpcklbw   m4,        m2,        m3\r\n    punpckhbw   m2,        m3\r\n\r\n    pmaddubsw   m4,        m1\r\n    pmaddubsw   m2,        m1\r\n\r\n    lea         r5,        [r0 + 2 * r1]\r\n    movu        m5,        [r5]\r\n    movu        m6,        [r5 + r1]\r\n\r\n    punpckhbw   m7,        m5,        m6\r\n    pmaddubsw   m7,        m0\r\n    paddw       m2,        m7\r\n\r\n    punpcklbw   m7,        m5,        m6\r\n    pmaddubsw   m7,        m0\r\n    paddw       m4,        m7\r\n\r\n    mova        m7,        [pw_512]\r\n\r\n    pmulhrsw    m4,        m7\r\n    pmulhrsw    m2,        m7\r\n\r\n    packuswb    m4,        m2\r\n\r\n    movu        [r2],      m4\r\n\r\n    punpcklbw   m4,        m3,        m5\r\n    punpckhbw   m3,        m5\r\n\r\n    pmaddubsw   m4,        m1\r\n    pmaddubsw   m3,        m1\r\n\r\n    movu        m5,        [r5 + 2 * r1]\r\n\r\n    punpcklbw   m2,        m6,        m5\r\n    punpckhbw   m6,        m5\r\n\r\n    pmaddubsw   m2,        m0\r\n    pmaddubsw   m6,        m0\r\n\r\n    paddw       m4,        m2\r\n    paddw       m3,        m6\r\n\r\n    pmulhrsw    m4,        m7\r\n    pmulhrsw    m3,        m7\r\n\r\n    packuswb    m4,        m3\r\n\r\n    movu        [r2 + r3],      m4\r\n\r\n    add         r0,        16\r\n    add         r2,        16\r\n    dec         r6d\r\n    jnz         .loopW\r\n\r\n    lea         r0,        [r0 + r1 * 2 - %1]\r\n    lea         r2,        [r2 + r3 * 2 - %1]\r\n\r\n    dec         r4d\r\n    jnz        .loop\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_V4_W16n_H2 64, 64\r\n    FILTER_V4_W16n_H2 64, 32\r\n    FILTER_V4_W16n_H2 64, 48\r\n    FILTER_V4_W16n_H2 48, 64\r\n    FILTER_V4_W16n_H2 64, 16\r\n\r\n;-----------------------------------------------------------------------------\r\n; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)\r\n;-----------------------------------------------------------------------------\r\n%macro P2S_H_2xN 1\r\nINIT_XMM sse4\r\ncglobal filterPixelToShort_2x%1, 3, 4, 3\r\n    mov         r3d, r3m\r\n    add         r3d, r3d\r\n\r\n    ; load constant\r\n    mova        m1, [pb_128]\r\n    mova        m2, [tab_c_64_n64]\r\n\r\n%rep %1/2\r\n    movd        m0, [r0]\r\n    pinsrd      m0, [r0 + r1], 1\r\n    punpcklbw   m0, m1\r\n    pmaddubsw   m0, m2\r\n\r\n    movd        [r2 + r3 * 0], m0\r\n    pextrd      [r2 + r3 * 1], m0, 2\r\n\r\n    lea         r0, [r0 + r1 * 2]\r\n    lea         r2, [r2 + r3 * 2]\r\n%endrep\r\n    RET\r\n%endmacro\r\n    P2S_H_2xN 4\r\n    P2S_H_2xN 8\r\n    P2S_H_2xN 16\r\n\r\n;-----------------------------------------------------------------------------\r\n; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)\r\n;-----------------------------------------------------------------------------\r\n%macro P2S_H_4xN 1\r\nINIT_XMM sse4\r\ncglobal filterPixelToShort_4x%1, 3, 6, 4\r\n    mov         r3d, r3m\r\n    add         r3d, r3d\r\n    lea         r4, [r3 * 3]\r\n    lea         r5, [r1 * 3]\r\n\r\n    ; load constant\r\n    mova        m2, [pb_128]\r\n    mova        m3, [tab_c_64_n64]\r\n\r\n%assign x 0\r\n%rep %1/4\r\n    movd        m0, [r0]\r\n    pinsrd      m0, [r0 + r1], 1\r\n    punpcklbw   m0, m2\r\n    pmaddubsw   m0, m3\r\n\r\n    movd        m1, [r0 + r1 * 2]\r\n    pinsrd      m1, [r0 + r5], 1\r\n    punpcklbw   m1, m2\r\n    pmaddubsw   m1, m3\r\n\r\n    movq        [r2 + r3 * 0], m0\r\n    movq        [r2 + r3 * 2], m1\r\n    movhps      [r2 + r3 * 1], m0\r\n    movhps      [r2 + r4], m1\r\n%assign x x+1\r\n%if (x != %1/4)\r\n    lea         r0, [r0 + r1 * 4]\r\n    lea         r2, [r2 + r3 * 4]\r\n%endif\r\n%endrep\r\n    RET\r\n%endmacro\r\n    P2S_H_4xN 4\r\n    P2S_H_4xN 8\r\n    P2S_H_4xN 16\r\n    P2S_H_4xN 32\r\n\r\n;-----------------------------------------------------------------------------\r\n; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)\r\n;-----------------------------------------------------------------------------\r\n%macro P2S_H_6xN 1\r\nINIT_XMM sse4\r\ncglobal filterPixelToShort_6x%1, 3, 7, 6\r\n    mov         r3d, r3m\r\n    add         r3d, r3d\r\n    lea         r4, [r1 * 3]\r\n    lea         r5, [r3 * 3]\r\n\r\n    ; load height\r\n    mov         r6d, %1/4\r\n\r\n    ; load constant\r\n    mova        m4, [pb_128]\r\n    mova        m5, [tab_c_64_n64]\r\n\r\n.loop:\r\n    movh        m0, [r0]\r\n    punpcklbw   m0, m4\r\n    pmaddubsw   m0, m5\r\n\r\n    movh        m1, [r0 + r1]\r\n    punpcklbw   m1, m4\r\n    pmaddubsw   m1, m5\r\n\r\n    movh        m2, [r0 + r1 * 2]\r\n    punpcklbw   m2, m4\r\n    pmaddubsw   m2, m5\r\n\r\n    movh        m3, [r0 + r4]\r\n    punpcklbw   m3, m4\r\n    pmaddubsw   m3, m5\r\n\r\n    movh        [r2 + r3 * 0], m0\r\n    pextrd      [r2 + r3 * 0 + 8], m0, 2\r\n    movh        [r2 + r3 * 1], m1\r\n    pextrd      [r2 + r3 * 1 + 8], m1, 2\r\n    movh        [r2 + r3 * 2], m2\r\n    pextrd      [r2 + r3 * 2 + 8], m2, 2\r\n    movh        [r2 + r5], m3\r\n    pextrd      [r2 + r5 + 8], m3, 2\r\n\r\n    lea         r0, [r0 + r1 * 4]\r\n    lea         r2, [r2 + r3 * 4]\r\n\r\n    dec         r6d\r\n    jnz         .loop\r\n    RET\r\n%endmacro\r\n    P2S_H_6xN 8\r\n    P2S_H_6xN 16\r\n\r\n;-----------------------------------------------------------------------------\r\n; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)\r\n;-----------------------------------------------------------------------------\r\n%macro P2S_H_8xN 1\r\nINIT_XMM ssse3\r\ncglobal filterPixelToShort_8x%1, 3, 7, 6\r\n    mov         r3d, r3m\r\n    add         r3d, r3d\r\n    lea         r5, [r1 * 3]\r\n    lea         r6, [r3 * 3]\r\n\r\n    ; load height\r\n    mov         r4d, %1/4\r\n\r\n    ; load constant\r\n    mova        m4, [pb_128]\r\n    mova        m5, [tab_c_64_n64]\r\n\r\n.loop\r\n    movh        m0, [r0]\r\n    punpcklbw   m0, m4\r\n    pmaddubsw   m0, m5\r\n\r\n    movh        m1, [r0 + r1]\r\n    punpcklbw   m1, m4\r\n    pmaddubsw   m1, m5\r\n\r\n    movh        m2, [r0 + r1 * 2]\r\n    punpcklbw   m2, m4\r\n    pmaddubsw   m2, m5\r\n\r\n    movh        m3, [r0 + r5]\r\n    punpcklbw   m3, m4\r\n    pmaddubsw   m3, m5\r\n\r\n    movu        [r2 + r3 * 0], m0\r\n    movu        [r2 + r3 * 1], m1\r\n    movu        [r2 + r3 * 2], m2\r\n    movu        [r2 + r6 ], m3\r\n\r\n    lea         r0, [r0 + r1 * 4]\r\n    lea         r2, [r2 + r3 * 4]\r\n\r\n    dec         r4d\r\n    jnz         .loop\r\n    RET\r\n%endmacro\r\n    P2S_H_8xN 8\r\n    P2S_H_8xN 4\r\n    P2S_H_8xN 16\r\n    P2S_H_8xN 32\r\n    P2S_H_8xN 12\r\n    P2S_H_8xN 64\r\n\r\n;-----------------------------------------------------------------------------\r\n; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM ssse3\r\ncglobal filterPixelToShort_8x6, 3, 7, 5\r\n    mov         r3d, r3m\r\n    add         r3d, r3d\r\n    lea         r4, [r1 * 3]\r\n    lea         r5, [r1 * 5]\r\n    lea         r6, [r3 * 3]\r\n\r\n    ; load constant\r\n    mova        m3, [pb_128]\r\n    mova        m4, [tab_c_64_n64]\r\n\r\n    movh        m0, [r0]\r\n    punpcklbw   m0, m3\r\n    pmaddubsw   m0, m4\r\n\r\n    movh        m1, [r0 + r1]\r\n    punpcklbw   m1, m3\r\n    pmaddubsw   m1, m4\r\n\r\n    movh        m2, [r0 + r1 * 2]\r\n    punpcklbw   m2, m3\r\n    pmaddubsw   m2, m4\r\n\r\n    movu        [r2 + r3 * 0], m0\r\n    movu        [r2 + r3 * 1], m1\r\n    movu        [r2 + r3 * 2], m2\r\n\r\n    movh        m0, [r0 + r4]\r\n    punpcklbw   m0, m3\r\n    pmaddubsw   m0, m4\r\n\r\n    movh        m1, [r0 + r1 * 4]\r\n    punpcklbw   m1, m3\r\n    pmaddubsw   m1, m4\r\n\r\n    movh        m2, [r0 + r5]\r\n    punpcklbw   m2, m3\r\n    pmaddubsw   m2, m4\r\n\r\n    movu        [r2 + r6 ], m0\r\n    movu        [r2 + r3 * 4], m1\r\n    lea         r2, [r2 + r3 * 4]\r\n    movu        [r2 + r3], m2\r\n\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)\r\n;-----------------------------------------------------------------------------\r\n%macro P2S_H_16xN 1\r\nINIT_XMM ssse3\r\ncglobal filterPixelToShort_16x%1, 3, 7, 6\r\n    mov         r3d, r3m\r\n    add         r3d, r3d\r\n    lea         r4, [r3 * 3]\r\n    lea         r5, [r1 * 3]\r\n\r\n   ; load height\r\n    mov         r6d, %1/4\r\n\r\n    ; load constant\r\n    mova        m4, [pb_128]\r\n    mova        m5, [tab_c_64_n64]\r\n\r\n.loop:\r\n    movh        m0, [r0]\r\n    punpcklbw   m0, m4\r\n    pmaddubsw   m0, m5\r\n\r\n    movh        m1, [r0 + r1]\r\n    punpcklbw   m1, m4\r\n    pmaddubsw   m1, m5\r\n\r\n    movh        m2, [r0 + r1 * 2]\r\n    punpcklbw   m2, m4\r\n    pmaddubsw   m2, m5\r\n\r\n    movh        m3, [r0 + r5]\r\n    punpcklbw   m3, m4\r\n    pmaddubsw   m3, m5\r\n\r\n    movu        [r2 + r3 * 0], m0\r\n    movu        [r2 + r3 * 1], m1\r\n    movu        [r2 + r3 * 2], m2\r\n    movu        [r2 + r4], m3\r\n\r\n    lea         r0, [r0 + 8]\r\n\r\n    movh        m0, [r0]\r\n    punpcklbw   m0, m4\r\n    pmaddubsw   m0, m5\r\n\r\n    movh        m1, [r0 + r1]\r\n    punpcklbw   m1, m4\r\n    pmaddubsw   m1, m5\r\n\r\n    movh        m2, [r0 + r1 * 2]\r\n    punpcklbw   m2, m4\r\n    pmaddubsw   m2, m5\r\n\r\n    movh        m3, [r0 + r5]\r\n    punpcklbw   m3, m4\r\n    pmaddubsw   m3, m5\r\n\r\n    movu        [r2 + r3 * 0 + 16], m0\r\n    movu        [r2 + r3 * 1 + 16], m1\r\n    movu        [r2 + r3 * 2 + 16], m2\r\n    movu        [r2 + r4 + 16], m3\r\n\r\n    lea         r0, [r0 + r1 * 4 - 8]\r\n    lea         r2, [r2 + r3 * 4]\r\n\r\n    dec         r6d\r\n    jnz         .loop\r\n    RET\r\n%endmacro\r\n    P2S_H_16xN 16\r\n    P2S_H_16xN 4\r\n    P2S_H_16xN 8\r\n    P2S_H_16xN 12\r\n    P2S_H_16xN 32\r\n    P2S_H_16xN 64\r\n    P2S_H_16xN 24\r\n\r\n;-----------------------------------------------------------------------------\r\n; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)\r\n;-----------------------------------------------------------------------------\r\nINIT_YMM avx2\r\ncglobal filterPixelToShort_16x4, 3, 4, 2\r\n    mov             r3d, r3m\r\n    add             r3d, r3d\r\n\r\n    ; load constant\r\n    vbroadcasti128  m1, [pw_2000]\r\n\r\n    pmovzxbw        m0, [r0]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2], m0\r\n\r\n    pmovzxbw        m0, [r0 + r1]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2 + r3], m0\r\n\r\n    pmovzxbw        m0, [r0 + r1 * 2]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2 + r3 * 2], m0\r\n\r\n    lea             r1, [r1 * 3]\r\n    lea             r3, [r3 * 3]\r\n\r\n    pmovzxbw        m0, [r0 + r1]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2 + r3], m0\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)\r\n;-----------------------------------------------------------------------------\r\nINIT_YMM avx2\r\ncglobal filterPixelToShort_16x8, 3, 6, 2\r\n    mov             r3d, r3m\r\n    add             r3d, r3d\r\n    lea             r4, [r1 * 3]\r\n    lea             r5, [r3 * 3]\r\n\r\n    ; load constant\r\n    vbroadcasti128  m1, [pw_2000]\r\n\r\n    pmovzxbw        m0, [r0]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2], m0\r\n\r\n    pmovzxbw        m0, [r0 + r1]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2 + r3], m0\r\n\r\n    pmovzxbw        m0, [r0 + r1 * 2]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2 + r3 * 2], m0\r\n\r\n    pmovzxbw        m0, [r0 + r4]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2 + r5], m0\r\n\r\n    lea             r0, [r0 + r1 * 4]\r\n    lea             r2, [r2 + r3 * 4]\r\n\r\n    pmovzxbw        m0, [r0]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2], m0\r\n\r\n    pmovzxbw        m0, [r0 + r1]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2 + r3], m0\r\n\r\n    pmovzxbw        m0, [r0 + r1 * 2]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2 + r3 * 2], m0\r\n\r\n    pmovzxbw        m0, [r0 + r4]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2 + r5], m0\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)\r\n;-----------------------------------------------------------------------------\r\nINIT_YMM avx2\r\ncglobal filterPixelToShort_16x12, 3, 6, 2\r\n    mov             r3d, r3m\r\n    add             r3d, r3d\r\n    lea             r4, [r1 * 3]\r\n    lea             r5, [r3 * 3]\r\n\r\n    ; load constant\r\n    vbroadcasti128  m1, [pw_2000]\r\n\r\n    pmovzxbw        m0, [r0]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2], m0\r\n\r\n    pmovzxbw        m0, [r0 + r1]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2 + r3], m0\r\n\r\n    pmovzxbw        m0, [r0 + r1 * 2]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2 + r3 * 2], m0\r\n\r\n    pmovzxbw        m0, [r0 + r4]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2 + r5], m0\r\n\r\n    lea             r0, [r0 + r1 * 4]\r\n    lea             r2, [r2 + r3 * 4]\r\n\r\n    pmovzxbw        m0, [r0]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2], m0\r\n\r\n    pmovzxbw        m0, [r0 + r1]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2 + r3], m0\r\n\r\n    pmovzxbw        m0, [r0 + r1 * 2]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2 + r3 * 2], m0\r\n\r\n    pmovzxbw        m0, [r0 + r4]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2 + r5], m0\r\n\r\n    lea             r0, [r0 + r1 * 4]\r\n    lea             r2, [r2 + r3 * 4]\r\n\r\n    pmovzxbw        m0, [r0]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2], m0\r\n\r\n    pmovzxbw        m0, [r0 + r1]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2 + r3], m0\r\n\r\n    pmovzxbw        m0, [r0 + r1 * 2]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2 + r3 * 2], m0\r\n\r\n    pmovzxbw        m0, [r0 + r4]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2 + r5], m0\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)\r\n;-----------------------------------------------------------------------------\r\nINIT_YMM avx2\r\ncglobal filterPixelToShort_16x16, 3, 6, 2\r\n    mov             r3d, r3m\r\n    add             r3d, r3d\r\n    lea             r4, [r1 * 3]\r\n    lea             r5, [r3 * 3]\r\n\r\n    ; load constant\r\n    vbroadcasti128  m1, [pw_2000]\r\n\r\n    pmovzxbw        m0, [r0]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2], m0\r\n\r\n    pmovzxbw        m0, [r0 + r1]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2 + r3], m0\r\n\r\n    pmovzxbw        m0, [r0 + r1 * 2]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2 + r3 * 2], m0\r\n\r\n    pmovzxbw        m0, [r0 + r4]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2 + r5], m0\r\n\r\n    lea             r0, [r0 + r1 * 4]\r\n    lea             r2, [r2 + r3 * 4]\r\n\r\n    pmovzxbw        m0, [r0]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2], m0\r\n\r\n    pmovzxbw        m0, [r0 + r1]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2 + r3], m0\r\n\r\n    pmovzxbw        m0, [r0 + r1 * 2]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2 + r3 * 2], m0\r\n\r\n    pmovzxbw        m0, [r0 + r4]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2 + r5], m0\r\n\r\n    lea             r0, [r0 + r1 * 4]\r\n    lea             r2, [r2 + r3 * 4]\r\n\r\n    pmovzxbw        m0, [r0]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2], m0\r\n\r\n    pmovzxbw        m0, [r0 + r1]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2 + r3], m0\r\n\r\n    pmovzxbw        m0, [r0 + r1 * 2]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2 + r3 * 2], m0\r\n\r\n    pmovzxbw        m0, [r0 + r4]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2 + r5], m0\r\n\r\n    lea             r0, [r0 + r1 * 4]\r\n    lea             r2, [r2 + r3 * 4]\r\n\r\n    pmovzxbw        m0, [r0]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2], m0\r\n\r\n    pmovzxbw        m0, [r0 + r1]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2 + r3], m0\r\n\r\n    pmovzxbw        m0, [r0 + r1 * 2]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2 + r3 * 2], m0\r\n\r\n    pmovzxbw        m0, [r0 + r4]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2 + r5], m0\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)\r\n;-----------------------------------------------------------------------------\r\nINIT_YMM avx2\r\ncglobal filterPixelToShort_16x24, 3, 7, 2\r\n    mov             r3d, r3m\r\n    add             r3d, r3d\r\n    lea             r4, [r1 * 3]\r\n    lea             r5, [r3 * 3]\r\n    mov             r6d, 3\r\n\r\n    ; load constant\r\n    vbroadcasti128  m1, [pw_2000]\r\n.loop:\r\n    pmovzxbw        m0, [r0]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2], m0\r\n\r\n    pmovzxbw        m0, [r0 + r1]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2 + r3], m0\r\n\r\n    pmovzxbw        m0, [r0 + r1 * 2]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2 + r3 * 2], m0\r\n\r\n    pmovzxbw        m0, [r0 + r4]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2 + r5], m0\r\n\r\n    lea             r0, [r0 + r1 * 4]\r\n    lea             r2, [r2 + r3 * 4]\r\n\r\n    pmovzxbw        m0, [r0]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2], m0\r\n\r\n    pmovzxbw        m0, [r0 + r1]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2 + r3], m0\r\n\r\n    pmovzxbw        m0, [r0 + r1 * 2]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2 + r3 * 2], m0\r\n\r\n    pmovzxbw        m0, [r0 + r4]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2 + r5], m0\r\n\r\n    lea             r0, [r0 + r1 * 4]\r\n    lea             r2, [r2 + r3 * 4]\r\n\r\n    dec             r6d\r\n    jnz             .loop\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)\r\n;-----------------------------------------------------------------------------\r\n%macro P2S_H_16xN_avx2 1\r\nINIT_YMM avx2\r\ncglobal filterPixelToShort_16x%1, 3, 7, 2\r\n    mov             r3d, r3m\r\n    add             r3d, r3d\r\n    lea             r4, [r1 * 3]\r\n    lea             r5, [r3 * 3]\r\n    mov             r6d, %1/16\r\n\r\n    ; load constant\r\n    vbroadcasti128  m1, [pw_2000]\r\n.loop:\r\n    pmovzxbw        m0, [r0]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2], m0\r\n\r\n    pmovzxbw        m0, [r0 + r1]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2 + r3], m0\r\n\r\n    pmovzxbw        m0, [r0 + r1 * 2]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2 + r3 * 2], m0\r\n\r\n    pmovzxbw        m0, [r0 + r4]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2 + r5], m0\r\n\r\n    lea             r0, [r0 + r1 * 4]\r\n    lea             r2, [r2 + r3 * 4]\r\n\r\n    pmovzxbw        m0, [r0]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2], m0\r\n\r\n    pmovzxbw        m0, [r0 + r1]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2 + r3], m0\r\n\r\n    pmovzxbw        m0, [r0 + r1 * 2]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2 + r3 * 2], m0\r\n\r\n    pmovzxbw        m0, [r0 + r4]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2 + r5], m0\r\n\r\n    lea             r0, [r0 + r1 * 4]\r\n    lea             r2, [r2 + r3 * 4]\r\n\r\n    pmovzxbw        m0, [r0]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2], m0\r\n\r\n    pmovzxbw        m0, [r0 + r1]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2 + r3], m0\r\n\r\n    pmovzxbw        m0, [r0 + r1 * 2]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2 + r3 * 2], m0\r\n\r\n    pmovzxbw        m0, [r0 + r4]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2 + r5], m0\r\n\r\n    lea             r0, [r0 + r1 * 4]\r\n    lea             r2, [r2 + r3 * 4]\r\n\r\n    pmovzxbw        m0, [r0]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2], m0\r\n\r\n    pmovzxbw        m0, [r0 + r1]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2 + r3], m0\r\n\r\n    pmovzxbw        m0, [r0 + r1 * 2]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2 + r3 * 2], m0\r\n\r\n    pmovzxbw        m0, [r0 + r4]\r\n    psllw           m0, 6\r\n    psubw           m0, m1\r\n    movu            [r2 + r5], m0\r\n\r\n    lea             r0, [r0 + r1 * 4]\r\n    lea             r2, [r2 + r3 * 4]\r\n\r\n    dec             r6d\r\n    jnz             .loop\r\n    RET\r\n%endmacro\r\nP2S_H_16xN_avx2 32\r\nP2S_H_16xN_avx2 64\r\n\r\n;-----------------------------------------------------------------------------\r\n; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)\r\n;-----------------------------------------------------------------------------\r\n%macro P2S_H_32xN 1\r\nINIT_XMM ssse3\r\ncglobal filterPixelToShort_32x%1, 3, 7, 6\r\n    mov         r3d, r3m\r\n    add         r3d, r3d\r\n    lea         r4, [r3 * 3]\r\n    lea         r5, [r1 * 3]\r\n\r\n    ; load height\r\n    mov         r6d, %1/4\r\n\r\n    ; load constant\r\n    mova        m4, [pb_128]\r\n    mova        m5, [tab_c_64_n64]\r\n\r\n.loop:\r\n    movh        m0, [r0]\r\n    punpcklbw   m0, m4\r\n    pmaddubsw   m0, m5\r\n\r\n    movh        m1, [r0 + r1]\r\n    punpcklbw   m1, m4\r\n    pmaddubsw   m1, m5\r\n\r\n    movh        m2, [r0 + r1 * 2]\r\n    punpcklbw   m2, m4\r\n    pmaddubsw   m2, m5\r\n\r\n    movh        m3, [r0 + r5]\r\n    punpcklbw   m3, m4\r\n    pmaddubsw   m3, m5\r\n\r\n    movu        [r2 + r3 * 0], m0\r\n    movu        [r2 + r3 * 1], m1\r\n    movu        [r2 + r3 * 2], m2\r\n    movu        [r2 + r4], m3\r\n\r\n    lea         r0, [r0 + 8]\r\n\r\n    movh        m0, [r0]\r\n    punpcklbw   m0, m4\r\n    pmaddubsw   m0, m5\r\n\r\n    movh        m1, [r0 + r1]\r\n    punpcklbw   m1, m4\r\n    pmaddubsw   m1, m5\r\n\r\n    movh        m2, [r0 + r1 * 2]\r\n    punpcklbw   m2, m4\r\n    pmaddubsw   m2, m5\r\n\r\n    movh        m3, [r0 + r5]\r\n    punpcklbw   m3, m4\r\n    pmaddubsw   m3, m5\r\n\r\n    movu        [r2 + r3 * 0 + 16], m0\r\n    movu        [r2 + r3 * 1 + 16], m1\r\n    movu        [r2 + r3 * 2 + 16], m2\r\n    movu        [r2 + r4 + 16], m3\r\n\r\n    lea         r0, [r0 + 8]\r\n\r\n    movh        m0, [r0]\r\n    punpcklbw   m0, m4\r\n    pmaddubsw   m0, m5\r\n\r\n    movh        m1, [r0 + r1]\r\n    punpcklbw   m1, m4\r\n    pmaddubsw   m1, m5\r\n\r\n    movh        m2, [r0 + r1 * 2]\r\n    punpcklbw   m2, m4\r\n    pmaddubsw   m2, m5\r\n\r\n    movh        m3, [r0 + r5]\r\n    punpcklbw   m3, m4\r\n    pmaddubsw   m3, m5\r\n\r\n    movu        [r2 + r3 * 0 + 32], m0\r\n    movu        [r2 + r3 * 1 + 32], m1\r\n    movu        [r2 + r3 * 2 + 32], m2\r\n    movu        [r2 + r4 + 32], m3\r\n\r\n    lea         r0, [r0 + 8]\r\n\r\n    movh        m0, [r0]\r\n    punpcklbw   m0, m4\r\n    pmaddubsw   m0, m5\r\n\r\n    movh        m1, [r0 + r1]\r\n    punpcklbw   m1, m4\r\n    pmaddubsw   m1, m5\r\n\r\n    movh        m2, [r0 + r1 * 2]\r\n    punpcklbw   m2, m4\r\n    pmaddubsw   m2, m5\r\n\r\n    movh        m3, [r0 + r5]\r\n    punpcklbw   m3, m4\r\n    pmaddubsw   m3, m5\r\n\r\n    movu        [r2 + r3 * 0 + 48], m0\r\n    movu        [r2 + r3 * 1 + 48], m1\r\n    movu        [r2 + r3 * 2 + 48], m2\r\n    movu        [r2 + r4 + 48], m3\r\n\r\n    lea         r0, [r0 + r1 * 4 - 24]\r\n    lea         r2, [r2 + r3 * 4]\r\n\r\n    dec         r6d\r\n    jnz         .loop\r\n    RET\r\n%endmacro\r\n    P2S_H_32xN 32\r\n    P2S_H_32xN 8\r\n    P2S_H_32xN 16\r\n    P2S_H_32xN 24\r\n    P2S_H_32xN 64\r\n    P2S_H_32xN 48\r\n\r\n;-----------------------------------------------------------------------------\r\n; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)\r\n;-----------------------------------------------------------------------------\r\n%macro P2S_H_32xN_avx2 1\r\nINIT_YMM avx2\r\ncglobal filterPixelToShort_32x%1, 3, 7, 3\r\n    mov         r3d, r3m\r\n    add         r3d, r3d\r\n    lea         r5, [r1 * 3]\r\n    lea         r6, [r3 * 3]\r\n\r\n    ; load height\r\n    mov         r4d, %1/4\r\n\r\n    ; load constant\r\n    vpbroadcastd m2, [pw_2000]\r\n\r\n.loop:\r\n    pmovzxbw    m0, [r0 + 0 * mmsize/2]\r\n    pmovzxbw    m1, [r0 + 1 * mmsize/2]\r\n    psllw       m0, 6\r\n    psllw       m1, 6\r\n    psubw       m0, m2\r\n    psubw       m1, m2\r\n    movu        [r2 + 0 * mmsize], m0\r\n    movu        [r2 + 1 * mmsize], m1\r\n\r\n    pmovzxbw    m0, [r0 + r1 + 0 * mmsize/2]\r\n    pmovzxbw    m1, [r0 + r1 + 1 * mmsize/2]\r\n    psllw       m0, 6\r\n    psllw       m1, 6\r\n    psubw       m0, m2\r\n    psubw       m1, m2\r\n    movu        [r2 + r3 + 0 * mmsize], m0\r\n    movu        [r2 + r3 + 1 * mmsize], m1\r\n\r\n    pmovzxbw    m0, [r0 + r1 * 2 + 0 * mmsize/2]\r\n    pmovzxbw    m1, [r0 + r1 * 2 + 1 * mmsize/2]\r\n    psllw       m0, 6\r\n    psllw       m1, 6\r\n    psubw       m0, m2\r\n    psubw       m1, m2\r\n    movu        [r2 + r3 * 2 + 0 * mmsize], m0\r\n    movu        [r2 + r3 * 2 + 1 * mmsize], m1\r\n\r\n    pmovzxbw    m0, [r0 + r5 + 0 * mmsize/2]\r\n    pmovzxbw    m1, [r0 + r5 + 1 * mmsize/2]\r\n    psllw       m0, 6\r\n    psllw       m1, 6\r\n    psubw       m0, m2\r\n    psubw       m1, m2\r\n    movu        [r2 + r6 + 0 * mmsize], m0\r\n    movu        [r2 + r6 + 1 * mmsize], m1\r\n\r\n    lea         r0, [r0 + r1 * 4]\r\n    lea         r2, [r2 + r3 * 4]\r\n\r\n    dec         r4d\r\n    jnz        .loop\r\n    RET\r\n%endmacro\r\n    P2S_H_32xN_avx2 32\r\n    P2S_H_32xN_avx2 8\r\n    P2S_H_32xN_avx2 16\r\n    P2S_H_32xN_avx2 24\r\n    P2S_H_32xN_avx2 64\r\n    P2S_H_32xN_avx2 48\r\n\r\n;-----------------------------------------------------------------------------\r\n; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)\r\n;-----------------------------------------------------------------------------\r\n%macro P2S_H_64xN 1\r\nINIT_XMM ssse3\r\ncglobal filterPixelToShort_64x%1, 3, 7, 6\r\n    mov         r3d, r3m\r\n    add         r3d, r3d\r\n    lea         r4, [r3 * 3]\r\n    lea         r5, [r1 * 3]\r\n\r\n    ; load height\r\n    mov         r6d, %1/4\r\n\r\n    ; load constant\r\n    mova        m4, [pb_128]\r\n    mova        m5, [tab_c_64_n64]\r\n\r\n.loop:\r\n    movh        m0, [r0]\r\n    punpcklbw   m0, m4\r\n    pmaddubsw   m0, m5\r\n\r\n    movh        m1, [r0 + r1]\r\n    punpcklbw   m1, m4\r\n    pmaddubsw   m1, m5\r\n\r\n    movh        m2, [r0 + r1 * 2]\r\n    punpcklbw   m2, m4\r\n    pmaddubsw   m2, m5\r\n\r\n    movh        m3, [r0 + r5]\r\n    punpcklbw   m3, m4\r\n    pmaddubsw   m3, m5\r\n\r\n    movu        [r2 + r3 * 0], m0\r\n    movu        [r2 + r3 * 1], m1\r\n    movu        [r2 + r3 * 2], m2\r\n    movu        [r2 + r4], m3\r\n\r\n    lea         r0, [r0 + 8]\r\n\r\n    movh        m0, [r0]\r\n    punpcklbw   m0, m4\r\n    pmaddubsw   m0, m5\r\n\r\n    movh        m1, [r0 + r1]\r\n    punpcklbw   m1, m4\r\n    pmaddubsw   m1, m5\r\n\r\n    movh        m2, [r0 + r1 * 2]\r\n    punpcklbw   m2, m4\r\n    pmaddubsw   m2, m5\r\n\r\n    movh        m3, [r0 + r5]\r\n    punpcklbw   m3, m4\r\n    pmaddubsw   m3, m5\r\n\r\n    movu        [r2 + r3 * 0 + 16], m0\r\n    movu        [r2 + r3 * 1 + 16], m1\r\n    movu        [r2 + r3 * 2 + 16], m2\r\n    movu        [r2 + r4 + 16], m3\r\n\r\n    lea         r0, [r0 + 8]\r\n\r\n    movh        m0, [r0]\r\n    punpcklbw   m0, m4\r\n    pmaddubsw   m0, m5\r\n\r\n    movh        m1, [r0 + r1]\r\n    punpcklbw   m1, m4\r\n    pmaddubsw   m1, m5\r\n\r\n    movh        m2, [r0 + r1 * 2]\r\n    punpcklbw   m2, m4\r\n    pmaddubsw   m2, m5\r\n\r\n    movh        m3, [r0 + r5]\r\n    punpcklbw   m3, m4\r\n    pmaddubsw   m3, m5\r\n\r\n    movu        [r2 + r3 * 0 + 32], m0\r\n    movu        [r2 + r3 * 1 + 32], m1\r\n    movu        [r2 + r3 * 2 + 32], m2\r\n    movu        [r2 + r4 + 32], m3\r\n\r\n    lea         r0, [r0 + 8]\r\n\r\n    movh        m0, [r0]\r\n    punpcklbw   m0, m4\r\n    pmaddubsw   m0, m5\r\n\r\n    movh        m1, [r0 + r1]\r\n    punpcklbw   m1, m4\r\n    pmaddubsw   m1, m5\r\n\r\n    movh        m2, [r0 + r1 * 2]\r\n    punpcklbw   m2, m4\r\n    pmaddubsw   m2, m5\r\n\r\n    movh        m3, [r0 + r5]\r\n    punpcklbw   m3, m4\r\n    pmaddubsw   m3, m5\r\n\r\n    movu        [r2 + r3 * 0 + 48], m0\r\n    movu        [r2 + r3 * 1 + 48], m1\r\n    movu        [r2 + r3 * 2 + 48], m2\r\n    movu        [r2 + r4 + 48], m3\r\n\r\n    lea         r0, [r0 + 8]\r\n\r\n    movh        m0, [r0]\r\n    punpcklbw   m0, m4\r\n    pmaddubsw   m0, m5\r\n\r\n    movh        m1, [r0 + r1]\r\n    punpcklbw   m1, m4\r\n    pmaddubsw   m1, m5\r\n\r\n    movh        m2, [r0 + r1 * 2]\r\n    punpcklbw   m2, m4\r\n    pmaddubsw   m2, m5\r\n\r\n    movh        m3, [r0 + r5]\r\n    punpcklbw   m3, m4\r\n    pmaddubsw   m3, m5\r\n\r\n    movu        [r2 + r3 * 0 + 64], m0\r\n    movu        [r2 + r3 * 1 + 64], m1\r\n    movu        [r2 + r3 * 2 + 64], m2\r\n    movu        [r2 + r4 + 64], m3\r\n\r\n    lea         r0, [r0 + 8]\r\n\r\n    movh        m0, [r0]\r\n    punpcklbw   m0, m4\r\n    pmaddubsw   m0, m5\r\n\r\n    movh        m1, [r0 + r1]\r\n    punpcklbw   m1, m4\r\n    pmaddubsw   m1, m5\r\n\r\n    movh        m2, [r0 + r1 * 2]\r\n    punpcklbw   m2, m4\r\n    pmaddubsw   m2, m5\r\n\r\n    movh        m3, [r0 + r5]\r\n    punpcklbw   m3, m4\r\n    pmaddubsw   m3, m5\r\n\r\n    movu        [r2 + r3 * 0 + 80], m0\r\n    movu        [r2 + r3 * 1 + 80], m1\r\n    movu        [r2 + r3 * 2 + 80], m2\r\n    movu        [r2 + r4 + 80], m3\r\n\r\n    lea         r0, [r0 + 8]\r\n\r\n    movh        m0, [r0]\r\n    punpcklbw   m0, m4\r\n    pmaddubsw   m0, m5\r\n\r\n    movh        m1, [r0 + r1]\r\n    punpcklbw   m1, m4\r\n    pmaddubsw   m1, m5\r\n\r\n    movh        m2, [r0 + r1 * 2]\r\n    punpcklbw   m2, m4\r\n    pmaddubsw   m2, m5\r\n\r\n    movh        m3, [r0 + r5]\r\n    punpcklbw   m3, m4\r\n    pmaddubsw   m3, m5\r\n\r\n    movu        [r2 + r3 * 0 + 96], m0\r\n    movu        [r2 + r3 * 1 + 96], m1\r\n    movu        [r2 + r3 * 2 + 96], m2\r\n    movu        [r2 + r4 + 96], m3\r\n\r\n    lea         r0, [r0 + 8]\r\n\r\n    movh        m0, [r0]\r\n    punpcklbw   m0, m4\r\n    pmaddubsw   m0, m5\r\n\r\n    movh        m1, [r0 + r1]\r\n    punpcklbw   m1, m4\r\n    pmaddubsw   m1, m5\r\n\r\n    movh        m2, [r0 + r1 * 2]\r\n    punpcklbw   m2, m4\r\n    pmaddubsw   m2, m5\r\n\r\n    movh        m3, [r0 + r5]\r\n    punpcklbw   m3, m4\r\n    pmaddubsw   m3, m5\r\n\r\n    movu        [r2 + r3 * 0 + 112], m0\r\n    movu        [r2 + r3 * 1 + 112], m1\r\n    movu        [r2 + r3 * 2 + 112], m2\r\n    movu        [r2 + r4 + 112], m3\r\n\r\n    lea         r0, [r0 + r1 * 4 - 56]\r\n    lea         r2, [r2 + r3 * 4]\r\n\r\n    dec         r6d\r\n    jnz         .loop\r\n    RET\r\n%endmacro\r\n    P2S_H_64xN 64\r\n    P2S_H_64xN 16\r\n    P2S_H_64xN 32\r\n    P2S_H_64xN 48\r\n\r\n;-----------------------------------------------------------------------------\r\n; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)\r\n;-----------------------------------------------------------------------------\r\n%macro P2S_H_64xN_avx2 1\r\nINIT_YMM avx2\r\ncglobal filterPixelToShort_64x%1, 3, 7, 5\r\n    mov         r3d, r3m\r\n    add         r3d, r3d\r\n    lea         r5, [r1 * 3]\r\n    lea         r6, [r3 * 3]\r\n\r\n    ; load height\r\n    mov         r4d, %1/4\r\n\r\n    ; load constant\r\n    vpbroadcastd m4, [pw_2000]\r\n\r\n.loop:\r\n    pmovzxbw    m0, [r0 + 0 * mmsize/2]\r\n    pmovzxbw    m1, [r0 + 1 * mmsize/2]\r\n    pmovzxbw    m2, [r0 + 2 * mmsize/2]\r\n    pmovzxbw    m3, [r0 + 3 * mmsize/2]\r\n    psllw       m0, 6\r\n    psllw       m1, 6\r\n    psllw       m2, 6\r\n    psllw       m3, 6\r\n    psubw       m0, m4\r\n    psubw       m1, m4\r\n    psubw       m2, m4\r\n    psubw       m3, m4\r\n\r\n    movu        [r2 + 0 * mmsize], m0\r\n    movu        [r2 + 1 * mmsize], m1\r\n    movu        [r2 + 2 * mmsize], m2\r\n    movu        [r2 + 3 * mmsize], m3\r\n\r\n    pmovzxbw    m0, [r0 + r1 + 0 * mmsize/2]\r\n    pmovzxbw    m1, [r0 + r1 + 1 * mmsize/2]\r\n    pmovzxbw    m2, [r0 + r1 + 2 * mmsize/2]\r\n    pmovzxbw    m3, [r0 + r1 + 3 * mmsize/2]\r\n    psllw       m0, 6\r\n    psllw       m1, 6\r\n    psllw       m2, 6\r\n    psllw       m3, 6\r\n    psubw       m0, m4\r\n    psubw       m1, m4\r\n    psubw       m2, m4\r\n    psubw       m3, m4\r\n\r\n    movu        [r2 + r3 + 0 * mmsize], m0\r\n    movu        [r2 + r3 + 1 * mmsize], m1\r\n    movu        [r2 + r3 + 2 * mmsize], m2\r\n    movu        [r2 + r3 + 3 * mmsize], m3\r\n\r\n    pmovzxbw    m0, [r0 + r1 * 2 + 0 * mmsize/2]\r\n    pmovzxbw    m1, [r0 + r1 * 2 + 1 * mmsize/2]\r\n    pmovzxbw    m2, [r0 + r1 * 2 + 2 * mmsize/2]\r\n    pmovzxbw    m3, [r0 + r1 * 2 + 3 * mmsize/2]\r\n    psllw       m0, 6\r\n    psllw       m1, 6\r\n    psllw       m2, 6\r\n    psllw       m3, 6\r\n    psubw       m0, m4\r\n    psubw       m1, m4\r\n    psubw       m2, m4\r\n    psubw       m3, m4\r\n\r\n    movu        [r2 + r3 * 2 + 0 * mmsize], m0\r\n    movu        [r2 + r3 * 2 + 1 * mmsize], m1\r\n    movu        [r2 + r3 * 2 + 2 * mmsize], m2\r\n    movu        [r2 + r3 * 2 + 3 * mmsize], m3\r\n\r\n    pmovzxbw    m0, [r0 + r5 + 0 * mmsize/2]\r\n    pmovzxbw    m1, [r0 + r5 + 1 * mmsize/2]\r\n    pmovzxbw    m2, [r0 + r5 + 2 * mmsize/2]\r\n    pmovzxbw    m3, [r0 + r5 + 3 * mmsize/2]\r\n    psllw       m0, 6\r\n    psllw       m1, 6\r\n    psllw       m2, 6\r\n    psllw       m3, 6\r\n    psubw       m0, m4\r\n    psubw       m1, m4\r\n    psubw       m2, m4\r\n    psubw       m3, m4\r\n\r\n    movu        [r2 + r6 + 0 * mmsize], m0\r\n    movu        [r2 + r6 + 1 * mmsize], m1\r\n    movu        [r2 + r6 + 2 * mmsize], m2\r\n    movu        [r2 + r6 + 3 * mmsize], m3\r\n\r\n    lea         r0, [r0 + r1 * 4]\r\n    lea         r2, [r2 + r3 * 4]\r\n\r\n    dec         r4d\r\n    jnz        .loop\r\n    RET\r\n%endmacro\r\n    P2S_H_64xN_avx2 64\r\n    P2S_H_64xN_avx2 16\r\n    P2S_H_64xN_avx2 32\r\n    P2S_H_64xN_avx2 48\r\n\r\n;-----------------------------------------------------------------------------\r\n; void filterPixelToShort(pixel src, intptr_t srcStride, int16_t dst, int16_t dstStride)\r\n;-----------------------------------------------------------------------------\r\n%macro P2S_H_12xN 1\r\nINIT_XMM ssse3\r\ncglobal filterPixelToShort_12x%1, 3, 7, 6\r\n    mov         r3d, r3m\r\n    add         r3d, r3d\r\n    lea         r4, [r1 * 3]\r\n    lea         r6, [r3 * 3]\r\n    mov         r5d, %1/4\r\n\r\n    ; load constant\r\n    mova        m4, [pb_128]\r\n    mova        m5, [tab_c_64_n64]\r\n\r\n.loop:\r\n    movu        m0, [r0]\r\n    punpcklbw   m1, m0, m4\r\n    punpckhbw   m0, m4\r\n    pmaddubsw   m0, m5\r\n    pmaddubsw   m1, m5\r\n\r\n    movu        m2, [r0 + r1]\r\n    punpcklbw   m3, m2, m4\r\n    punpckhbw   m2, m4\r\n    pmaddubsw   m2, m5\r\n    pmaddubsw   m3, m5\r\n\r\n    movu        [r2 + r3 * 0], m1\r\n    movu        [r2 + r3 * 1], m3\r\n\r\n    movh        [r2 + r3 * 0 + 16], m0\r\n    movh        [r2 + r3 * 1 + 16], m2\r\n\r\n    movu        m0, [r0 + r1 * 2]\r\n    punpcklbw   m1, m0, m4\r\n    punpckhbw   m0, m4\r\n    pmaddubsw   m0, m5\r\n    pmaddubsw   m1, m5\r\n\r\n    movu        m2, [r0 + r4]\r\n    punpcklbw   m3, m2, m4\r\n    punpckhbw   m2, m4\r\n    pmaddubsw   m2, m5\r\n    pmaddubsw   m3, m5\r\n\r\n    movu        [r2 + r3 * 2], m1\r\n    movu        [r2 + r6], m3\r\n\r\n    movh        [r2 + r3 * 2 + 16], m0\r\n    movh        [r2 + r6 + 16], m2\r\n\r\n    lea         r0, [r0 + r1 * 4]\r\n    lea         r2, [r2 + r3 * 4]\r\n\r\n    dec         r5d\r\n    jnz         .loop\r\n    RET\r\n%endmacro\r\n    P2S_H_12xN 16\r\n    P2S_H_12xN 32\r\n\r\n;-----------------------------------------------------------------------------\r\n; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)\r\n;-----------------------------------------------------------------------------\r\n%macro P2S_H_24xN 1\r\nINIT_XMM ssse3\r\ncglobal filterPixelToShort_24x%1, 3, 7, 5\r\n    mov         r3d, r3m\r\n    add         r3d, r3d\r\n    lea         r4, [r1 * 3]\r\n    lea         r5, [r3 * 3]\r\n    mov         r6d, %1/4\r\n\r\n    ; load constant\r\n    mova        m3, [pb_128]\r\n    mova        m4, [tab_c_64_n64]\r\n\r\n.loop:\r\n    movu        m0, [r0]\r\n    punpcklbw   m1, m0, m3\r\n    punpckhbw   m0, m3\r\n    pmaddubsw   m0, m4\r\n    pmaddubsw   m1, m4\r\n\r\n    movu        m2, [r0 + 16]\r\n    punpcklbw   m2, m3\r\n    pmaddubsw   m2, m4\r\n\r\n    movu        [r2 +  r3 * 0], m1\r\n    movu        [r2 +  r3 * 0 + 16], m0\r\n    movu        [r2 +  r3 * 0 + 32], m2\r\n\r\n    movu        m0, [r0 + r1]\r\n    punpcklbw   m1, m0, m3\r\n    punpckhbw   m0, m3\r\n    pmaddubsw   m0, m4\r\n    pmaddubsw   m1, m4\r\n\r\n    movu        m2, [r0 + r1 + 16]\r\n    punpcklbw   m2, m3\r\n    pmaddubsw   m2, m4\r\n\r\n    movu        [r2 +  r3 * 1], m1\r\n    movu        [r2 +  r3 * 1 + 16], m0\r\n    movu        [r2 +  r3 * 1 + 32], m2\r\n\r\n    movu        m0, [r0 + r1 * 2]\r\n    punpcklbw   m1, m0, m3\r\n    punpckhbw   m0, m3\r\n    pmaddubsw   m0, m4\r\n    pmaddubsw   m1, m4\r\n\r\n    movu        m2, [r0 + r1 * 2 + 16]\r\n    punpcklbw   m2, m3\r\n    pmaddubsw   m2, m4\r\n\r\n    movu        [r2 +  r3 * 2], m1\r\n    movu        [r2 +  r3 * 2 + 16], m0\r\n    movu        [r2 +  r3 * 2 + 32], m2\r\n\r\n    movu        m0, [r0 + r4]\r\n    punpcklbw   m1, m0, m3\r\n    punpckhbw   m0, m3\r\n    pmaddubsw   m0, m4\r\n    pmaddubsw   m1, m4\r\n\r\n    movu        m2, [r0 + r4 + 16]\r\n    punpcklbw   m2, m3\r\n    pmaddubsw   m2, m4\r\n    movu        [r2 +  r5], m1\r\n    movu        [r2 +  r5 + 16], m0\r\n    movu        [r2 +  r5 + 32], m2\r\n\r\n    lea         r0, [r0 + r1 * 4]\r\n    lea         r2, [r2 + r3 * 4]\r\n\r\n    dec         r6d\r\n    jnz         .loop\r\n    RET\r\n%endmacro\r\n    P2S_H_24xN 32\r\n    P2S_H_24xN 64\r\n\r\n;-----------------------------------------------------------------------------\r\n; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)\r\n;-----------------------------------------------------------------------------\r\n%macro P2S_H_24xN_avx2 1\r\nINIT_YMM avx2\r\ncglobal filterPixelToShort_24x%1, 3, 7, 4\r\n    mov         r3d, r3m\r\n    add         r3d, r3d\r\n    lea         r4, [r1 * 3]\r\n    lea         r5, [r3 * 3]\r\n    mov         r6d, %1/4\r\n\r\n    ; load constant\r\n    vpbroadcastd m1, [pw_2000]\r\n    vpbroadcastd m2, [pb_128]\r\n    vpbroadcastd m3, [tab_c_64_n64]\r\n\r\n.loop:\r\n    pmovzxbw    m0, [r0]\r\n    psllw       m0, 6\r\n    psubw       m0, m1\r\n    movu        [r2], m0\r\n\r\n    movu        m0, [r0 + mmsize/2]\r\n    punpcklbw   m0, m2\r\n    pmaddubsw   m0, m3\r\n    movu        [r2 +  r3 * 0 + mmsize], xm0\r\n\r\n    pmovzxbw    m0, [r0 + r1]\r\n    psllw       m0, 6\r\n    psubw       m0, m1\r\n    movu        [r2 + r3], m0\r\n\r\n    movu        m0, [r0 + r1 + mmsize/2]\r\n    punpcklbw   m0, m2\r\n    pmaddubsw   m0, m3\r\n    movu        [r2 +  r3 * 1 + mmsize], xm0\r\n\r\n    pmovzxbw    m0, [r0 + r1 * 2]\r\n    psllw       m0, 6\r\n    psubw       m0, m1\r\n    movu        [r2 + r3 * 2], m0\r\n\r\n    movu        m0, [r0 + r1 * 2 + mmsize/2]\r\n    punpcklbw   m0, m2\r\n    pmaddubsw   m0, m3\r\n    movu        [r2 +  r3 * 2 + mmsize], xm0\r\n\r\n    pmovzxbw    m0, [r0 + r4]\r\n    psllw       m0, 6\r\n    psubw       m0, m1\r\n    movu        [r2 + r5], m0\r\n\r\n    movu        m0, [r0 + r4 + mmsize/2]\r\n    punpcklbw   m0, m2\r\n    pmaddubsw   m0, m3\r\n    movu        [r2 + r5 + mmsize], xm0\r\n\r\n    lea         r0, [r0 + r1 * 4]\r\n    lea         r2, [r2 + r3 * 4]\r\n\r\n    dec         r6d\r\n    jnz         .loop\r\n    RET\r\n%endmacro\r\n    P2S_H_24xN_avx2 32\r\n    P2S_H_24xN_avx2 64\r\n\r\n;-----------------------------------------------------------------------------\r\n; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM ssse3\r\ncglobal filterPixelToShort_48x64, 3, 7, 4\r\n    mov         r3d, r3m\r\n    add         r3d, r3d\r\n    lea         r4, [r1 * 3]\r\n    lea         r5, [r3 * 3]\r\n    mov         r6d, 16\r\n\r\n    ; load constant\r\n    mova        m2, [pb_128]\r\n    mova        m3, [tab_c_64_n64]\r\n\r\n.loop:\r\n    movu        m0, [r0]\r\n    punpcklbw   m1, m0, m2\r\n    punpckhbw   m0, m2\r\n    pmaddubsw   m0, m3\r\n    pmaddubsw   m1, m3\r\n\r\n    movu        [r2 +  r3 * 0], m1\r\n    movu        [r2 +  r3 * 0 + 16], m0\r\n\r\n    movu        m0, [r0 + 16]\r\n    punpcklbw   m1, m0, m2\r\n    punpckhbw   m0, m2\r\n    pmaddubsw   m0, m3\r\n    pmaddubsw   m1, m3\r\n\r\n    movu        [r2 +  r3 * 0 + 32], m1\r\n    movu        [r2 +  r3 * 0 + 48], m0\r\n\r\n    movu        m0, [r0 + 32]\r\n    punpcklbw   m1, m0, m2\r\n    punpckhbw   m0, m2\r\n    pmaddubsw   m0, m3\r\n    pmaddubsw   m1, m3\r\n\r\n    movu        [r2 +  r3 * 0 + 64], m1\r\n    movu        [r2 +  r3 * 0 + 80], m0\r\n\r\n    movu        m0, [r0 + r1]\r\n    punpcklbw   m1, m0, m2\r\n    punpckhbw   m0, m2\r\n    pmaddubsw   m0, m3\r\n    pmaddubsw   m1, m3\r\n\r\n    movu        [r2 +  r3 * 1], m1\r\n    movu        [r2 +  r3 * 1 + 16], m0\r\n\r\n    movu        m0, [r0 + r1 + 16]\r\n    punpcklbw   m1, m0, m2\r\n    punpckhbw   m0, m2\r\n    pmaddubsw   m0, m3\r\n    pmaddubsw   m1, m3\r\n\r\n    movu        [r2 +  r3 * 1 + 32], m1\r\n    movu        [r2 +  r3 * 1 + 48], m0\r\n\r\n    movu        m0, [r0 + r1 + 32]\r\n    punpcklbw   m1, m0, m2\r\n    punpckhbw   m0, m2\r\n    pmaddubsw   m0, m3\r\n    pmaddubsw   m1, m3\r\n\r\n    movu        [r2 +  r3 * 1 + 64], m1\r\n    movu        [r2 +  r3 * 1 + 80], m0\r\n\r\n    movu        m0, [r0 + r1 * 2]\r\n    punpcklbw   m1, m0, m2\r\n    punpckhbw   m0, m2\r\n    pmaddubsw   m0, m3\r\n    pmaddubsw   m1, m3\r\n\r\n    movu        [r2 +  r3 * 2], m1\r\n    movu        [r2 +  r3 * 2 + 16], m0\r\n\r\n    movu        m0, [r0 + r1 * 2 + 16]\r\n    punpcklbw   m1, m0, m2\r\n    punpckhbw   m0, m2\r\n    pmaddubsw   m0, m3\r\n    pmaddubsw   m1, m3\r\n\r\n    movu        [r2 +  r3 * 2 + 32], m1\r\n    movu        [r2 +  r3 * 2 + 48], m0\r\n\r\n    movu        m0, [r0 + r1 * 2 + 32]\r\n    punpcklbw   m1, m0, m2\r\n    punpckhbw   m0, m2\r\n    pmaddubsw   m0, m3\r\n    pmaddubsw   m1, m3\r\n\r\n    movu        [r2 +  r3 * 2 + 64], m1\r\n    movu        [r2 +  r3 * 2 + 80], m0\r\n\r\n    movu        m0, [r0 + r4]\r\n    punpcklbw   m1, m0, m2\r\n    punpckhbw   m0, m2\r\n    pmaddubsw   m0, m3\r\n    pmaddubsw   m1, m3\r\n\r\n    movu        [r2 +  r5], m1\r\n    movu        [r2 +  r5 + 16], m0\r\n\r\n    movu        m0, [r0 + r4 + 16]\r\n    punpcklbw   m1, m0, m2\r\n    punpckhbw   m0, m2\r\n    pmaddubsw   m0, m3\r\n    pmaddubsw   m1, m3\r\n\r\n    movu        [r2 +  r5 + 32], m1\r\n    movu        [r2 +  r5 + 48], m0\r\n\r\n    movu        m0, [r0 + r4 + 32]\r\n    punpcklbw   m1, m0, m2\r\n    punpckhbw   m0, m2\r\n    pmaddubsw   m0, m3\r\n    pmaddubsw   m1, m3\r\n\r\n    movu        [r2 +  r5 + 64], m1\r\n    movu        [r2 +  r5 + 80], m0\r\n\r\n    lea         r0, [r0 + r1 * 4]\r\n    lea         r2, [r2 + r3 * 4]\r\n\r\n    dec         r6d\r\n    jnz         .loop\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)\r\n;-----------------------------------------------------------------------------\r\nINIT_YMM avx2\r\ncglobal filterPixelToShort_48x64, 3,7,4\r\n    mov         r3d, r3m\r\n    add         r3d, r3d\r\n    lea         r5, [r1 * 3]\r\n    lea         r6, [r3 * 3]\r\n\r\n    ; load height\r\n    mov         r4d, 64/4\r\n\r\n    ; load constant\r\n    vpbroadcastd m3, [pw_2000]\r\n\r\n    ; just unroll(1) because it is best choice for 48x64\r\n.loop:\r\n    pmovzxbw    m0, [r0 + 0 * mmsize/2]\r\n    pmovzxbw    m1, [r0 + 1 * mmsize/2]\r\n    pmovzxbw    m2, [r0 + 2 * mmsize/2]\r\n    psllw       m0, 6\r\n    psllw       m1, 6\r\n    psllw       m2, 6\r\n    psubw       m0, m3\r\n    psubw       m1, m3\r\n    psubw       m2, m3\r\n    movu        [r2 + 0 * mmsize], m0\r\n    movu        [r2 + 1 * mmsize], m1\r\n    movu        [r2 + 2 * mmsize], m2\r\n\r\n    pmovzxbw    m0, [r0 + r1 + 0 * mmsize/2]\r\n    pmovzxbw    m1, [r0 + r1 + 1 * mmsize/2]\r\n    pmovzxbw    m2, [r0 + r1 + 2 * mmsize/2]\r\n    psllw       m0, 6\r\n    psllw       m1, 6\r\n    psllw       m2, 6\r\n    psubw       m0, m3\r\n    psubw       m1, m3\r\n    psubw       m2, m3\r\n    movu        [r2 + r3 + 0 * mmsize], m0\r\n    movu        [r2 + r3 + 1 * mmsize], m1\r\n    movu        [r2 + r3 + 2 * mmsize], m2\r\n\r\n    pmovzxbw    m0, [r0 + r1 * 2 + 0 * mmsize/2]\r\n    pmovzxbw    m1, [r0 + r1 * 2 + 1 * mmsize/2]\r\n    pmovzxbw    m2, [r0 + r1 * 2 + 2 * mmsize/2]\r\n    psllw       m0, 6\r\n    psllw       m1, 6\r\n    psllw       m2, 6\r\n    psubw       m0, m3\r\n    psubw       m1, m3\r\n    psubw       m2, m3\r\n    movu        [r2 + r3 * 2 + 0 * mmsize], m0\r\n    movu        [r2 + r3 * 2 + 1 * mmsize], m1\r\n    movu        [r2 + r3 * 2 + 2 * mmsize], m2\r\n\r\n    pmovzxbw    m0, [r0 + r5 + 0 * mmsize/2]\r\n    pmovzxbw    m1, [r0 + r5 + 1 * mmsize/2]\r\n    pmovzxbw    m2, [r0 + r5 + 2 * mmsize/2]\r\n    psllw       m0, 6\r\n    psllw       m1, 6\r\n    psllw       m2, 6\r\n    psubw       m0, m3\r\n    psubw       m1, m3\r\n    psubw       m2, m3\r\n    movu        [r2 + r6 + 0 * mmsize], m0\r\n    movu        [r2 + r6 + 1 * mmsize], m1\r\n    movu        [r2 + r6 + 2 * mmsize], m2\r\n\r\n    lea         r0, [r0 + r1 * 4]\r\n    lea         r2, [r2 + r3 * 4]\r\n\r\n    dec         r4d\r\n    jnz        .loop\r\n    RET\r\n\r\n\r\n%macro PROCESS_LUMA_W4_4R 0\r\n    movd        m0, [r0]\r\n    movd        m1, [r0 + r1]\r\n    punpcklbw   m2, m0, m1                     ; m2=[0 1]\r\n\r\n    lea         r0, [r0 + 2 * r1]\r\n    movd        m0, [r0]\r\n    punpcklbw   m1, m0                         ; m1=[1 2]\r\n    punpcklqdq  m2, m1                         ; m2=[0 1 1 2]\r\n    pmaddubsw   m4, m2, [r6 + 0 * 16]          ; m4=[0+1 1+2]\r\n\r\n    movd        m1, [r0 + r1]\r\n    punpcklbw   m5, m0, m1                     ; m2=[2 3]\r\n    lea         r0, [r0 + 2 * r1]\r\n    movd        m0, [r0]\r\n    punpcklbw   m1, m0                         ; m1=[3 4]\r\n    punpcklqdq  m5, m1                         ; m5=[2 3 3 4]\r\n    pmaddubsw   m2, m5, [r6 + 1 * 16]          ; m2=[2+3 3+4]\r\n    paddw       m4, m2                         ; m4=[0+1+2+3 1+2+3+4]                   Row1-2\r\n    pmaddubsw   m5, [r6 + 0 * 16]              ; m5=[2+3 3+4]                           Row3-4\r\n\r\n    movd        m1, [r0 + r1]\r\n    punpcklbw   m2, m0, m1                     ; m2=[4 5]\r\n    lea         r0, [r0 + 2 * r1]\r\n    movd        m0, [r0]\r\n    punpcklbw   m1, m0                         ; m1=[5 6]\r\n    punpcklqdq  m2, m1                         ; m2=[4 5 5 6]\r\n    pmaddubsw   m1, m2, [r6 + 2 * 16]          ; m1=[4+5 5+6]\r\n    paddw       m4, m1                         ; m4=[0+1+2+3+4+5 1+2+3+4+5+6]           Row1-2\r\n    pmaddubsw   m2, [r6 + 1 * 16]              ; m2=[4+5 5+6]\r\n    paddw       m5, m2                         ; m5=[2+3+4+5 3+4+5+6]                   Row3-4\r\n\r\n    movd        m1, [r0 + r1]\r\n    punpcklbw   m2, m0, m1                     ; m2=[6 7]\r\n    lea         r0, [r0 + 2 * r1]\r\n    movd        m0, [r0]\r\n    punpcklbw   m1, m0                         ; m1=[7 8]\r\n    punpcklqdq  m2, m1                         ; m2=[6 7 7 8]\r\n    pmaddubsw   m1, m2, [r6 + 3 * 16]          ; m1=[6+7 7+8]\r\n    paddw       m4, m1                         ; m4=[0+1+2+3+4+5+6+7 1+2+3+4+5+6+7+8]   Row1-2 end\r\n    pmaddubsw   m2, [r6 + 2 * 16]              ; m2=[6+7 7+8]\r\n    paddw       m5, m2                         ; m5=[2+3+4+5+6+7 3+4+5+6+7+8]           Row3-4\r\n\r\n    movd        m1, [r0 + r1]\r\n    punpcklbw   m2, m0, m1                     ; m2=[8 9]\r\n    movd        m0, [r0 + 2 * r1]\r\n    punpcklbw   m1, m0                         ; m1=[9 10]\r\n    punpcklqdq  m2, m1                         ; m2=[8 9 9 10]\r\n    pmaddubsw   m2, [r6 + 3 * 16]              ; m2=[8+9 9+10]\r\n    paddw       m5, m2                         ; m5=[2+3+4+5+6+7+8+9 3+4+5+6+7+8+9+10]  Row3-4 end\r\n%endmacro\r\n\r\n%macro PROCESS_LUMA_W8_4R 0\r\n    movq       m0, [r0]\r\n    movq       m1, [r0 + r1]\r\n    punpcklbw  m0, m1\r\n    pmaddubsw  m7, m0, [r6 + 0 *16]            ;m7=[0+1]               Row1\r\n\r\n    lea        r0, [r0 + 2 * r1]\r\n    movq       m0, [r0]\r\n    punpcklbw  m1, m0\r\n    pmaddubsw  m6, m1, [r6 + 0 *16]            ;m6=[1+2]               Row2\r\n\r\n    movq       m1, [r0 + r1]\r\n    punpcklbw  m0, m1\r\n    pmaddubsw  m5, m0, [r6 + 0 *16]            ;m5=[2+3]               Row3\r\n    pmaddubsw  m0, [r6 + 1 * 16]\r\n    paddw      m7, m0                          ;m7=[0+1+2+3]           Row1\r\n\r\n    lea        r0, [r0 + 2 * r1]\r\n    movq       m0, [r0]\r\n    punpcklbw  m1, m0\r\n    pmaddubsw  m4, m1, [r6 + 0 *16]            ;m4=[3+4]               Row4\r\n    pmaddubsw  m1, [r6 + 1 * 16]\r\n    paddw      m6, m1                          ;m6 = [1+2+3+4]         Row2\r\n\r\n    movq       m1, [r0 + r1]\r\n    punpcklbw  m0, m1\r\n    pmaddubsw  m2, m0, [r6 + 1 * 16]\r\n    pmaddubsw  m0, [r6 + 2 * 16]\r\n    paddw      m7, m0                          ;m7=[0+1+2+3+4+5]       Row1\r\n    paddw      m5, m2                          ;m5=[2+3+4+5]           Row3\r\n\r\n    lea        r0, [r0 + 2 * r1]\r\n    movq       m0, [r0]\r\n    punpcklbw  m1, m0\r\n    pmaddubsw  m2, m1, [r6 + 1 * 16]\r\n    pmaddubsw  m1, [r6 + 2 * 16]\r\n    paddw      m6, m1                          ;m6=[1+2+3+4+5+6]       Row2\r\n    paddw      m4, m2                          ;m4=[3+4+5+6]           Row4\r\n\r\n    movq       m1, [r0 + r1]\r\n    punpcklbw  m0, m1\r\n    pmaddubsw  m2, m0, [r6 + 2 * 16]\r\n    pmaddubsw  m0, [r6 + 3 * 16]\r\n    paddw      m7, m0                          ;m7=[0+1+2+3+4+5+6+7]   Row1 end\r\n    paddw      m5, m2                          ;m5=[2+3+4+5+6+7]       Row3\r\n\r\n    lea        r0, [r0 + 2 * r1]\r\n    movq       m0, [r0]\r\n    punpcklbw  m1, m0\r\n    pmaddubsw  m2, m1, [r6 + 2 * 16]\r\n    pmaddubsw  m1, [r6 + 3 * 16]\r\n    paddw      m6, m1                          ;m6=[1+2+3+4+5+6+7+8]   Row2 end\r\n    paddw      m4, m2                          ;m4=[3+4+5+6+7+8]       Row4\r\n\r\n    movq       m1, [r0 + r1]\r\n    punpcklbw  m0, m1\r\n    pmaddubsw  m0, [r6 + 3 * 16]\r\n    paddw      m5, m0                          ;m5=[2+3+4+5+6+7+8+9]   Row3 end\r\n\r\n    movq       m0, [r0 + 2 * r1]\r\n    punpcklbw  m1, m0\r\n    pmaddubsw  m1, [r6 + 3 * 16]\r\n    paddw      m4, m1                          ;m4=[3+4+5+6+7+8+9+10]  Row4 end\r\n%endmacro\r\n\r\n;-------------------------------------------------------------------------------------------------------------\r\n; void interp_8tap_vert_%3_4x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-------------------------------------------------------------------------------------------------------------\r\n%macro FILTER_VER_LUMA_4xN 3\r\nINIT_XMM sse4\r\ncglobal interp_8tap_vert_%3_%1x%2, 5, 7, 6\r\n    lea       r5, [3 * r1]\r\n    sub       r0, r5\r\n    shl       r4d, 6\r\n%ifidn %3,ps\r\n    add       r3d, r3d\r\n%endif\r\n\r\n%ifdef PIC\r\n    lea       r5, [tab_LumaCoeffVer]\r\n    lea       r6, [r5 + r4]\r\n%else\r\n    lea       r6, [tab_LumaCoeffVer + r4]\r\n%endif\r\n\r\n%ifidn %3,pp\r\n    mova      m3, [pw_512]\r\n%else\r\n    mova      m3, [pw_2000]\r\n%endif\r\n\r\n    mov       r4d, %2/4\r\n    lea       r5, [4 * r1]\r\n\r\n.loopH:\r\n    PROCESS_LUMA_W4_4R\r\n\r\n%ifidn %3,pp\r\n    pmulhrsw  m4, m3\r\n    pmulhrsw  m5, m3\r\n\r\n    packuswb  m4, m5\r\n\r\n    movd      [r2], m4\r\n    pextrd    [r2 + r3], m4, 1\r\n    lea       r2, [r2 + 2 * r3]\r\n    pextrd    [r2], m4, 2\r\n    pextrd    [r2 + r3], m4, 3\r\n%else\r\n    psubw     m4, m3\r\n    psubw     m5, m3\r\n\r\n    movlps    [r2], m4\r\n    movhps    [r2 + r3], m4\r\n    lea       r2, [r2 + 2 * r3]\r\n    movlps    [r2], m5\r\n    movhps    [r2 + r3], m5\r\n%endif\r\n\r\n    sub       r0, r5\r\n    lea       r2, [r2 + 2 * r3]\r\n\r\n    dec       r4d\r\n    jnz       .loopH\r\n\r\n    RET\r\n%endmacro\r\n\r\n\r\nINIT_YMM avx2\r\ncglobal interp_8tap_vert_pp_4x4, 4,6,8\r\n    mov             r4d, r4m\r\n    lea             r5, [r1 * 3]\r\n    sub             r0, r5\r\n\r\n    ; TODO: VPGATHERDD\r\n    movd            xm1, [r0]                       ; m1 = row0\r\n    movd            xm2, [r0 + r1]                  ; m2 = row1\r\n    punpcklbw       xm1, xm2                        ; m1 = [13 03 12 02 11 01 10 00]\r\n\r\n    movd            xm3, [r0 + r1 * 2]              ; m3 = row2\r\n    punpcklbw       xm2, xm3                        ; m2 = [23 13 22 12 21 11 20 10]\r\n    movd            xm4, [r0 + r5]\r\n    punpcklbw       xm3, xm4                        ; m3 = [33 23 32 22 31 21 30 20]\r\n    punpcklwd       xm1, xm3                        ; m1 = [33 23 13 03 32 22 12 02 31 21 11 01 30 20 10 00]\r\n\r\n    lea             r0, [r0 + r1 * 4]\r\n    movd            xm5, [r0]                       ; m5 = row4\r\n    punpcklbw       xm4, xm5                        ; m4 = [43 33 42 32 41 31 40 30]\r\n    punpcklwd       xm2, xm4                        ; m2 = [43 33 21 13 42 32 22 12 41 31 21 11 40 30 20 10]\r\n    vinserti128     m1, m1, xm2, 1                  ; m1 = [43 33 21 13 42 32 22 12 41 31 21 11 40 30 20 10] - [33 23 13 03 32 22 12 02 31 21 11 01 30 20 10 00]\r\n    movd            xm2, [r0 + r1]                  ; m2 = row5\r\n    punpcklbw       xm5, xm2                        ; m5 = [53 43 52 42 51 41 50 40]\r\n    punpcklwd       xm3, xm5                        ; m3 = [53 43 44 23 52 42 32 22 51 41 31 21 50 40 30 20]\r\n    movd            xm6, [r0 + r1 * 2]              ; m6 = row6\r\n    punpcklbw       xm2, xm6                        ; m2 = [63 53 62 52 61 51 60 50]\r\n    punpcklwd       xm4, xm2                        ; m4 = [63 53 43 33 62 52 42 32 61 51 41 31 60 50 40 30]\r\n    vinserti128     m3, m3, xm4, 1                  ; m3 = [63 53 43 33 62 52 42 32 61 51 41 31 60 50 40 30] - [53 43 44 23 52 42 32 22 51 41 31 21 50 40 30 20]\r\n    movd            xm4, [r0 + r5]                  ; m4 = row7\r\n    punpcklbw       xm6, xm4                        ; m6 = [73 63 72 62 71 61 70 60]\r\n    punpcklwd       xm5, xm6                        ; m5 = [73 63 53 43 72 62 52 42 71 61 51 41 70 60 50 40]\r\n\r\n    lea             r0, [r0 + r1 * 4]\r\n    movd            xm7, [r0]                       ; m7 = row8\r\n    punpcklbw       xm4, xm7                        ; m4 = [83 73 82 72 81 71 80 70]\r\n    punpcklwd       xm2, xm4                        ; m2 = [83 73 63 53 82 72 62 52 81 71 61 51 80 70 60 50]\r\n    vinserti128     m5, m5, xm2, 1                  ; m5 = [83 73 63 53 82 72 62 52 81 71 61 51 80 70 60 50] - [73 63 53 43 72 62 52 42 71 61 51 41 70 60 50 40]\r\n    movd            xm2, [r0 + r1]                  ; m2 = row9\r\n    punpcklbw       xm7, xm2                        ; m7 = [93 83 92 82 91 81 90 80]\r\n    punpcklwd       xm6, xm7                        ; m6 = [93 83 73 63 92 82 72 62 91 81 71 61 90 80 70 60]\r\n    movd            xm7, [r0 + r1 * 2]              ; m7 = rowA\r\n    punpcklbw       xm2, xm7                        ; m2 = [A3 93 A2 92 A1 91 A0 90]\r\n    punpcklwd       xm4, xm2                        ; m4 = [A3 93 83 73 A2 92 82 72 A1 91 81 71 A0 90 80 70]\r\n    vinserti128     m6, m6, xm4, 1                  ; m6 = [A3 93 83 73 A2 92 82 72 A1 91 81 71 A0 90 80 70] - [93 83 73 63 92 82 72 62 91 81 71 61 90 80 70 60]\r\n\r\n    ; load filter coeff\r\n%ifdef PIC\r\n    lea             r5, [tab_LumaCoeff]\r\n    vpbroadcastd    m0, [r5 + r4 * 8 + 0]\r\n    vpbroadcastd    m2, [r5 + r4 * 8 + 4]\r\n%else\r\n    vpbroadcastd    m0, [tab_LumaCoeff + r4 * 8 + 0]\r\n    vpbroadcastd    m2, [tab_LumaCoeff + r4 * 8 + 4]\r\n%endif\r\n\r\n    pmaddubsw       m1, m0\r\n    pmaddubsw       m3, m0\r\n    pmaddubsw       m5, m2\r\n    pmaddubsw       m6, m2\r\n    vbroadcasti128  m0, [pw_1]\r\n    pmaddwd         m1, m0\r\n    pmaddwd         m3, m0\r\n    pmaddwd         m5, m0\r\n    pmaddwd         m6, m0\r\n    paddd           m1, m5                          ; m1 = DQWORD ROW[1 0]\r\n    paddd           m3, m6                          ; m3 = DQWORD ROW[3 2]\r\n    packssdw        m1, m3                          ; m1 =  QWORD ROW[3 1 2 0]\r\n\r\n    ; TODO: does it overflow?\r\n    pmulhrsw        m1, [pw_512]\r\n    vextracti128    xm2, m1, 1\r\n    packuswb        xm1, xm2                        ; m1 =  DWORD ROW[3 1 2 0]\r\n    movd            [r2], xm1\r\n    pextrd          [r2 + r3], xm1, 2\r\n    pextrd          [r2 + r3 * 2], xm1, 1\r\n    lea             r4, [r3 * 3]\r\n    pextrd          [r2 + r4], xm1, 3\r\n    RET\r\n\r\nINIT_YMM avx2\r\ncglobal interp_8tap_vert_ps_4x4, 4, 6, 5\r\n    mov             r4d, r4m\r\n    shl             r4d, 7\r\n\r\n%ifdef PIC\r\n    lea             r5, [tab_LumaCoeffVer_32]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [tab_LumaCoeffVer_32 + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r4\r\n\r\n    add             r3d, r3d\r\n\r\n    movd            xm1, [r0]\r\n    pinsrd          xm1, [r0 + r1], 1\r\n    pinsrd          xm1, [r0 + r1 * 2], 2\r\n    pinsrd          xm1, [r0 + r4], 3                       ; m1 = row[3 2 1 0]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movd            xm2, [r0]\r\n    pinsrd          xm2, [r0 + r1], 1\r\n    pinsrd          xm2, [r0 + r1 * 2], 2\r\n    pinsrd          xm2, [r0 + r4], 3                       ; m2 = row[7 6 5 4]\r\n    vinserti128     m1, m1, xm2, 1                          ; m1 = row[7 6 5 4 3 2 1 0]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movd            xm3, [r0]\r\n    pinsrd          xm3, [r0 + r1], 1\r\n    pinsrd          xm3, [r0 + r1 * 2], 2                   ; m3 = row[x 10 9 8]\r\n    vinserti128     m2, m2, xm3, 1                          ; m2 = row[x 10 9 8 7 6 5 4]\r\n    mova            m3, [interp4_vpp_shuf1]\r\n    vpermd          m0, m3, m1                              ; m0 = row[4 3 3 2 2 1 1 0]\r\n    vpermd          m4, m3, m2                              ; m4 = row[8 7 7 6 6 5 5 4]\r\n    mova            m3, [interp4_vpp_shuf1 + mmsize]\r\n    vpermd          m1, m3, m1                              ; m1 = row[6 5 5 4 4 3 3 2]\r\n    vpermd          m2, m3, m2                              ; m2 = row[10 9 9 8 8 7 7 6]\r\n\r\n    mova            m3, [interp4_vpp_shuf]\r\n    pshufb          m0, m0, m3\r\n    pshufb          m1, m1, m3\r\n    pshufb          m4, m4, m3\r\n    pshufb          m2, m2, m3\r\n    pmaddubsw       m0, [r5]\r\n    pmaddubsw       m1, [r5 + mmsize]\r\n    pmaddubsw       m4, [r5 + 2 * mmsize]\r\n    pmaddubsw       m2, [r5 + 3 * mmsize]\r\n    paddw           m0, m1\r\n    paddw           m0, m4\r\n    paddw           m0, m2                                  ; m0 = WORD ROW[3 2 1 0]\r\n\r\n    psubw           m0, [pw_2000]\r\n    vextracti128    xm2, m0, 1\r\n    lea             r5, [r3 * 3]\r\n    movq            [r2], xm0\r\n    movhps          [r2 + r3], xm0\r\n    movq            [r2 + r3 * 2], xm2\r\n    movhps          [r2 + r5], xm2\r\n    RET\r\n\r\n%macro FILTER_VER_LUMA_AVX2_4xN 3\r\nINIT_YMM avx2\r\n%if ARCH_X86_64 == 1\r\ncglobal interp_8tap_vert_%3_%1x%2, 4, 9, 10\r\n    mov             r4d, r4m\r\n    shl             r4d, 7\r\n\r\n%ifdef PIC\r\n    lea             r5, [tab_LumaCoeffVer_32]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [tab_LumaCoeffVer_32 + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r4\r\n    lea             r6, [r1 * 4]\r\n%ifidn %3,pp\r\n    mova            m6, [pw_512]\r\n%else\r\n    add             r3d, r3d\r\n    vbroadcasti128  m6, [pw_2000]\r\n%endif\r\n    lea             r8, [r3 * 3]\r\n    mova            m5, [interp4_vpp_shuf]\r\n    mova            m0, [interp4_vpp_shuf1]\r\n    mova            m7, [interp4_vpp_shuf1 + mmsize]\r\n    mov             r7d, %2 / 8\r\n.loop:\r\n    movd            xm1, [r0]\r\n    pinsrd          xm1, [r0 + r1], 1\r\n    pinsrd          xm1, [r0 + r1 * 2], 2\r\n    pinsrd          xm1, [r0 + r4], 3                       ; m1 = row[3 2 1 0]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movd            xm2, [r0]\r\n    pinsrd          xm2, [r0 + r1], 1\r\n    pinsrd          xm2, [r0 + r1 * 2], 2\r\n    pinsrd          xm2, [r0 + r4], 3                       ; m2 = row[7 6 5 4]\r\n    vinserti128     m1, m1, xm2, 1                          ; m1 = row[7 6 5 4 3 2 1 0]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movd            xm3, [r0]\r\n    pinsrd          xm3, [r0 + r1], 1\r\n    pinsrd          xm3, [r0 + r1 * 2], 2\r\n    pinsrd          xm3, [r0 + r4], 3                       ; m3 = row[11 10 9 8]\r\n    vinserti128     m2, m2, xm3, 1                          ; m2 = row[11 10 9 8 7 6 5 4]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movd            xm4, [r0]\r\n    pinsrd          xm4, [r0 + r1], 1\r\n    pinsrd          xm4, [r0 + r1 * 2], 2                   ; m4 = row[x 14 13 12]\r\n    vinserti128     m3, m3, xm4, 1                          ; m3 = row[x 14 13 12 11 10 9 8]\r\n    vpermd          m8, m0, m1                              ; m8 = row[4 3 3 2 2 1 1 0]\r\n    vpermd          m4, m0, m2                              ; m4 = row[8 7 7 6 6 5 5 4]\r\n    vpermd          m1, m7, m1                              ; m1 = row[6 5 5 4 4 3 3 2]\r\n    vpermd          m2, m7, m2                              ; m2 = row[10 9 9 8 8 7 7 6]\r\n    vpermd          m9, m0, m3                              ; m9 = row[12 11 11 10 10 9 9 8]\r\n    vpermd          m3, m7, m3                              ; m3 = row[14 13 13 12 12 11 11 10]\r\n\r\n    pshufb          m8, m8, m5\r\n    pshufb          m1, m1, m5\r\n    pshufb          m4, m4, m5\r\n    pshufb          m9, m9, m5\r\n    pshufb          m2, m2, m5\r\n    pshufb          m3, m3, m5\r\n    pmaddubsw       m8, [r5]\r\n    pmaddubsw       m1, [r5 + mmsize]\r\n    pmaddubsw       m9, [r5 + 2 * mmsize]\r\n    pmaddubsw       m3, [r5 + 3 * mmsize]\r\n    paddw           m8, m1\r\n    paddw           m9, m3\r\n    pmaddubsw       m1, m4, [r5 + 2 * mmsize]\r\n    pmaddubsw       m3, m2, [r5 + 3 * mmsize]\r\n    pmaddubsw       m4, [r5]\r\n    pmaddubsw       m2, [r5 + mmsize]\r\n    paddw           m3, m1\r\n    paddw           m2, m4\r\n    paddw           m8, m3                                  ; m8 = WORD ROW[3 2 1 0]\r\n    paddw           m9, m2                                  ; m9 = WORD ROW[7 6 5 4]\r\n\r\n%ifidn %3,pp\r\n    pmulhrsw        m8, m6\r\n    pmulhrsw        m9, m6\r\n    packuswb        m8, m9\r\n    vextracti128    xm1, m8, 1\r\n    movd            [r2], xm8\r\n    pextrd          [r2 + r3], xm8, 1\r\n    movd            [r2 + r3 * 2], xm1\r\n    pextrd          [r2 + r8], xm1, 1\r\n    lea             r2, [r2 + r3 * 4]\r\n    pextrd          [r2], xm8, 2\r\n    pextrd          [r2 + r3], xm8, 3\r\n    pextrd          [r2 + r3 * 2], xm1, 2\r\n    pextrd          [r2 + r8], xm1, 3\r\n%else\r\n    psubw           m8, m6\r\n    psubw           m9, m6\r\n    vextracti128    xm1, m8, 1\r\n    vextracti128    xm2, m9, 1\r\n    movq            [r2], xm8\r\n    movhps          [r2 + r3], xm8\r\n    movq            [r2 + r3 * 2], xm1\r\n    movhps          [r2 + r8], xm1\r\n    lea             r2, [r2 + r3 * 4]\r\n    movq            [r2], xm9\r\n    movhps          [r2 + r3], xm9\r\n    movq            [r2 + r3 * 2], xm2\r\n    movhps          [r2 + r8], xm2\r\n%endif\r\n    lea             r2, [r2 + r3 * 4]\r\n    sub             r0, r6\r\n    dec             r7d\r\n    jnz             .loop\r\n    RET\r\n%endif\r\n%endmacro\r\n\r\n;-------------------------------------------------------------------------------------------------------------\r\n; void interp_8tap_vert_pp_4x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-------------------------------------------------------------------------------------------------------------\r\n    FILTER_VER_LUMA_4xN 4, 4, pp\r\n\r\n;-------------------------------------------------------------------------------------------------------------\r\n; void interp_8tap_vert_pp_4x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-------------------------------------------------------------------------------------------------------------\r\n    FILTER_VER_LUMA_4xN 4, 8, pp\r\n    FILTER_VER_LUMA_AVX2_4xN 4, 8, pp\r\n\r\n;-------------------------------------------------------------------------------------------------------------\r\n; void interp_8tap_vert_pp_4x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-------------------------------------------------------------------------------------------------------------\r\n    FILTER_VER_LUMA_4xN 4, 16, pp\r\n    FILTER_VER_LUMA_AVX2_4xN 4, 16, pp\r\n\r\n;-------------------------------------------------------------------------------------------------------------\r\n; void interp_8tap_vert_ps_4x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-------------------------------------------------------------------------------------------------------------\r\n    FILTER_VER_LUMA_4xN 4, 4, ps\r\n\r\n;-------------------------------------------------------------------------------------------------------------\r\n; void interp_8tap_vert_ps_4x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-------------------------------------------------------------------------------------------------------------\r\n    FILTER_VER_LUMA_4xN 4, 8, ps\r\n    FILTER_VER_LUMA_AVX2_4xN 4, 8, ps\r\n\r\n;-------------------------------------------------------------------------------------------------------------\r\n; void interp_8tap_vert_ps_4x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-------------------------------------------------------------------------------------------------------------\r\n    FILTER_VER_LUMA_4xN 4, 16, ps\r\n    FILTER_VER_LUMA_AVX2_4xN 4, 16, ps\r\n\r\n%macro PROCESS_LUMA_AVX2_W8_8R 0\r\n    movq            xm1, [r0]                       ; m1 = row 0\r\n    movq            xm2, [r0 + r1]                  ; m2 = row 1\r\n    punpcklbw       xm1, xm2                        ; m1 = [17 07 16 06 15 05 14 04 13 03 12 02 11 01 10 00]\r\n    movq            xm3, [r0 + r1 * 2]              ; m3 = row 2\r\n    punpcklbw       xm2, xm3                        ; m2 = [27 17 26 16 25 15 24 14 23 13 22 12 21 11 20 10]\r\n    vinserti128     m5, m1, xm2, 1                  ; m5 = [27 17 26 16 25 15 24 14 23 13 22 12 21 11 20 10] - [17 07 16 06 15 05 14 04 13 03 12 02 11 01 10 00]\r\n    pmaddubsw       m5, [r5]\r\n    movq            xm4, [r0 + r4]                  ; m4 = row 3\r\n    punpcklbw       xm3, xm4                        ; m3 = [37 27 36 26 35 25 34 24 33 23 32 22 31 21 30 20]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movq            xm1, [r0]                       ; m1 = row 4\r\n    punpcklbw       xm4, xm1                        ; m4 = [47 37 46 36 45 35 44 34 43 33 42 32 41 31 40 30]\r\n    vinserti128     m2, m3, xm4, 1                  ; m2 = [47 37 46 36 45 35 44 34 43 33 42 32 41 31 40 30] - [37 27 36 26 35 25 34 24 33 23 32 22 31 21 30 20]\r\n    pmaddubsw       m0, m2, [r5 + 1 * mmsize]\r\n    paddw           m5, m0\r\n    pmaddubsw       m2, [r5]\r\n    movq            xm3, [r0 + r1]                  ; m3 = row 5\r\n    punpcklbw       xm1, xm3                        ; m1 = [57 47 56 46 55 45 54 44 53 43 52 42 51 41 50 40]\r\n    movq            xm4, [r0 + r1 * 2]              ; m4 = row 6\r\n    punpcklbw       xm3, xm4                        ; m3 = [67 57 66 56 65 55 64 54 63 53 62 52 61 51 60 50]\r\n    vinserti128     m1, m1, xm3, 1                  ; m1 = [67 57 66 56 65 55 64 54 63 53 62 52 61 51 60 50] - [57 47 56 46 55 45 54 44 53 43 52 42 51 41 50 40]\r\n    pmaddubsw       m3, m1, [r5 + 2 * mmsize]\r\n    paddw           m5, m3\r\n    pmaddubsw       m0, m1, [r5 + 1 * mmsize]\r\n    paddw           m2, m0\r\n    pmaddubsw       m1, [r5]\r\n    movq            xm3, [r0 + r4]                  ; m3 = row 7\r\n    punpcklbw       xm4, xm3                        ; m4 = [77 67 76 66 75 65 74 64 73 63 72 62 71 61 70 60]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movq            xm0, [r0]                       ; m0 = row 8\r\n    punpcklbw       xm3, xm0                        ; m3 = [87 77 86 76 85 75 84 74 83 73 82 72 81 71 80 70]\r\n    vinserti128     m4, m4, xm3, 1                  ; m4 = [87 77 86 76 85 75 84 74 83 73 82 72 81 71 80 70] - [77 67 76 66 75 65 74 64 73 63 72 62 71 61 70 60]\r\n    pmaddubsw       m3, m4, [r5 + 3 * mmsize]\r\n    paddw           m5, m3\r\n    pmaddubsw       m3, m4, [r5 + 2 * mmsize]\r\n    paddw           m2, m3\r\n    pmaddubsw       m3, m4, [r5 + 1 * mmsize]\r\n    paddw           m1, m3\r\n    pmaddubsw       m4, [r5]\r\n    movq            xm3, [r0 + r1]                  ; m3 = row 9\r\n    punpcklbw       xm0, xm3                        ; m0 = [97 87 96 86 95 85 94 84 93 83 92 82 91 81 90 80]\r\n    movq            xm6, [r0 + r1 * 2]              ; m6 = row 10\r\n    punpcklbw       xm3, xm6                        ; m3 = [A7 97 A6 96 A5 95 A4 94 A3 93 A2 92 A1 91 A0 90]\r\n    vinserti128     m0, m0, xm3, 1                  ; m0 = [A7 97 A6 96 A5 95 A4 94 A3 93 A2 92 A1 91 A0 90] - [97 87 96 86 95 85 94 84 93 83 92 82 91 81 90 80]\r\n    pmaddubsw       m3, m0, [r5 + 3 * mmsize]\r\n    paddw           m2, m3\r\n    pmaddubsw       m3, m0, [r5 + 2 * mmsize]\r\n    paddw           m1, m3\r\n    pmaddubsw       m0, [r5 + 1 * mmsize]\r\n    paddw           m4, m0\r\n\r\n    movq            xm3, [r0 + r4]                  ; m3 = row 11\r\n    punpcklbw       xm6, xm3                        ; m6 = [B7 A7 B6 A6 B5 A5 B4 A4 B3 A3 B2 A2 B1 A1 B0 A0]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movq            xm0, [r0]                       ; m0 = row 12\r\n    punpcklbw       xm3, xm0                        ; m3 = [C7 B7 C6 B6 C5 B5 C4 B4 C3 B3 C2 B2 C1 B1 C0 B0]\r\n    vinserti128     m6, m6, xm3, 1                  ; m6 = [C7 B7 C6 B6 C5 B5 C4 B4 C3 B3 C2 B2 C1 B1 C0 B0] - [B7 A7 B6 A6 B5 A5 B4 A4 B3 A3 B2 A2 B1 A1 B0 A0]\r\n    pmaddubsw       m3, m6, [r5 + 3 * mmsize]\r\n    paddw           m1, m3\r\n    pmaddubsw       m6, [r5 + 2 * mmsize]\r\n    paddw           m4, m6\r\n    movq            xm3, [r0 + r1]                  ; m3 = row 13\r\n    punpcklbw       xm0, xm3                        ; m0 = [D7 C7 D6 C6 D5 C5 D4 C4 D3 C3 D2 C2 D1 C1 D0 C0]\r\n    movq            xm6, [r0 + r1 * 2]              ; m6 = row 14\r\n    punpcklbw       xm3, xm6                        ; m3 = [E7 D7 E6 D6 E5 D5 E4 D4 E3 D3 E2 D2 E1 D1 E0 D0]\r\n    vinserti128     m0, m0, xm3, 1                  ; m0 = [E7 D7 E6 D6 E5 D5 E4 D4 E3 D3 E2 D2 E1 D1 E0 D0] - [D7 C7 D6 C6 D5 C5 D4 C4 D3 C3 D2 C2 D1 C1 D0 C0]\r\n    pmaddubsw       m0, [r5 + 3 * mmsize]\r\n    paddw           m4, m0\r\n%endmacro\r\n\r\n%macro PROCESS_LUMA_AVX2_W8_4R 0\r\n    movq            xm1, [r0]                       ; m1 = row 0\r\n    movq            xm2, [r0 + r1]                  ; m2 = row 1\r\n    punpcklbw       xm1, xm2                        ; m1 = [17 07 16 06 15 05 14 04 13 03 12 02 11 01 10 00]\r\n    movq            xm3, [r0 + r1 * 2]              ; m3 = row 2\r\n    punpcklbw       xm2, xm3                        ; m2 = [27 17 26 16 25 15 24 14 23 13 22 12 21 11 20 10]\r\n    vinserti128     m5, m1, xm2, 1                  ; m5 = [27 17 26 16 25 15 24 14 23 13 22 12 21 11 20 10] - [17 07 16 06 15 05 14 04 13 03 12 02 11 01 10 00]\r\n    pmaddubsw       m5, [r5]\r\n    movq            xm4, [r0 + r4]                  ; m4 = row 3\r\n    punpcklbw       xm3, xm4                        ; m3 = [37 27 36 26 35 25 34 24 33 23 32 22 31 21 30 20]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movq            xm1, [r0]                       ; m1 = row 4\r\n    punpcklbw       xm4, xm1                        ; m4 = [47 37 46 36 45 35 44 34 43 33 42 32 41 31 40 30]\r\n    vinserti128     m2, m3, xm4, 1                  ; m2 = [47 37 46 36 45 35 44 34 43 33 42 32 41 31 40 30] - [37 27 36 26 35 25 34 24 33 23 32 22 31 21 30 20]\r\n    pmaddubsw       m0, m2, [r5 + 1 * mmsize]\r\n    paddw           m5, m0\r\n    pmaddubsw       m2, [r5]\r\n    movq            xm3, [r0 + r1]                  ; m3 = row 5\r\n    punpcklbw       xm1, xm3                        ; m1 = [57 47 56 46 55 45 54 44 53 43 52 42 51 41 50 40]\r\n    movq            xm4, [r0 + r1 * 2]              ; m4 = row 6\r\n    punpcklbw       xm3, xm4                        ; m3 = [67 57 66 56 65 55 64 54 63 53 62 52 61 51 60 50]\r\n    vinserti128     m1, m1, xm3, 1                  ; m1 = [67 57 66 56 65 55 64 54 63 53 62 52 61 51 60 50] - [57 47 56 46 55 45 54 44 53 43 52 42 51 41 50 40]\r\n    pmaddubsw       m3, m1, [r5 + 2 * mmsize]\r\n    paddw           m5, m3\r\n    pmaddubsw       m0, m1, [r5 + 1 * mmsize]\r\n    paddw           m2, m0\r\n    movq            xm3, [r0 + r4]                  ; m3 = row 7\r\n    punpcklbw       xm4, xm3                        ; m4 = [77 67 76 66 75 65 74 64 73 63 72 62 71 61 70 60]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movq            xm0, [r0]                       ; m0 = row 8\r\n    punpcklbw       xm3, xm0                        ; m3 = [87 77 86 76 85 75 84 74 83 73 82 72 81 71 80 70]\r\n    vinserti128     m4, m4, xm3, 1                  ; m4 = [87 77 86 76 85 75 84 74 83 73 82 72 81 71 80 70] - [77 67 76 66 75 65 74 64 73 63 72 62 71 61 70 60]\r\n    pmaddubsw       m3, m4, [r5 + 3 * mmsize]\r\n    paddw           m5, m3\r\n    pmaddubsw       m3, m4, [r5 + 2 * mmsize]\r\n    paddw           m2, m3\r\n    movq            xm3, [r0 + r1]                  ; m3 = row 9\r\n    punpcklbw       xm0, xm3                        ; m0 = [97 87 96 86 95 85 94 84 93 83 92 82 91 81 90 80]\r\n    movq            xm6, [r0 + r1 * 2]              ; m6 = row 10\r\n    punpcklbw       xm3, xm6                        ; m3 = [A7 97 A6 96 A5 95 A4 94 A3 93 A2 92 A1 91 A0 90]\r\n    vinserti128     m0, m0, xm3, 1                  ; m0 = [A7 97 A6 96 A5 95 A4 94 A3 93 A2 92 A1 91 A0 90] - [97 87 96 86 95 85 94 84 93 83 92 82 91 81 90 80]\r\n    pmaddubsw       m3, m0, [r5 + 3 * mmsize]\r\n    paddw           m2, m3\r\n%endmacro\r\n\r\n;-------------------------------------------------------------------------------------------------------------\r\n; void interp_8tap_vert_%3_8x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-------------------------------------------------------------------------------------------------------------\r\n%macro FILTER_VER_LUMA_8xN 3\r\nINIT_XMM sse4\r\ncglobal interp_8tap_vert_%3_%1x%2, 5, 7, 8\r\n    lea       r5, [3 * r1]\r\n    sub       r0, r5\r\n    shl       r4d, 6\r\n\r\n%ifidn %3,ps\r\n    add       r3d, r3d\r\n%endif\r\n\r\n%ifdef PIC\r\n    lea       r5, [tab_LumaCoeffVer]\r\n    lea       r6, [r5 + r4]\r\n%else\r\n    lea       r6, [tab_LumaCoeffVer + r4]\r\n%endif\r\n\r\n %ifidn %3,pp\r\n    mova      m3, [pw_512]\r\n%else\r\n    mova      m3, [pw_2000]\r\n%endif\r\n\r\n    mov       r4d, %2/4\r\n    lea       r5, [4 * r1]\r\n\r\n.loopH:\r\n    PROCESS_LUMA_W8_4R\r\n\r\n%ifidn %3,pp\r\n    pmulhrsw  m7, m3\r\n    pmulhrsw  m6, m3\r\n    pmulhrsw  m5, m3\r\n    pmulhrsw  m4, m3\r\n\r\n    packuswb  m7, m6\r\n    packuswb  m5, m4\r\n\r\n    movlps    [r2], m7\r\n    movhps    [r2 + r3], m7\r\n    lea       r2, [r2 + 2 * r3]\r\n    movlps    [r2], m5\r\n    movhps    [r2 + r3], m5\r\n%else\r\n    psubw     m7, m3\r\n    psubw     m6, m3\r\n    psubw     m5, m3\r\n    psubw     m4, m3\r\n\r\n    movu      [r2], m7\r\n    movu      [r2 + r3], m6\r\n    lea       r2, [r2 + 2 * r3]\r\n    movu      [r2], m5\r\n    movu      [r2 + r3], m4\r\n%endif\r\n\r\n    sub       r0, r5\r\n    lea       r2, [r2 + 2 * r3]\r\n\r\n    dec       r4d\r\n    jnz       .loopH\r\n\r\n    RET\r\n%endmacro\r\n\r\n%macro FILTER_VER_LUMA_AVX2_8xN 3\r\nINIT_YMM avx2\r\ncglobal interp_8tap_vert_%3_%1x%2, 4, 7, 8, 0-gprsize\r\n    mov             r4d, r4m\r\n    shl             r4d, 7\r\n\r\n%ifdef PIC\r\n    lea             r5, [tab_LumaCoeffVer_32]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [tab_LumaCoeffVer_32 + r4]\r\n%endif\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r4\r\n    lea             r6, [r1 * 4]\r\n%ifidn %3,pp\r\n    mova            m7, [pw_512]\r\n%else\r\n    add             r3d, r3d\r\n    vbroadcasti128  m7, [pw_2000]\r\n%endif\r\n    mov             word [rsp], %2 / 8\r\n\r\n.loop:\r\n    PROCESS_LUMA_AVX2_W8_8R\r\n%ifidn %3,pp\r\n    pmulhrsw        m5, m7                          ; m5 = word: row 0, row 1\r\n    pmulhrsw        m2, m7                          ; m2 = word: row 2, row 3\r\n    pmulhrsw        m1, m7                          ; m1 = word: row 4, row 5\r\n    pmulhrsw        m4, m7                          ; m4 = word: row 6, row 7\r\n    packuswb        m5, m2\r\n    packuswb        m1, m4\r\n    vextracti128    xm2, m5, 1\r\n    vextracti128    xm4, m1, 1\r\n    movq            [r2], xm5\r\n    movq            [r2 + r3], xm2\r\n    lea             r2, [r2 + r3 * 2]\r\n    movhps          [r2], xm5\r\n    movhps          [r2 + r3], xm2\r\n    lea             r2, [r2 + r3 * 2]\r\n    movq            [r2], xm1\r\n    movq            [r2 + r3], xm4\r\n    lea             r2, [r2 + r3 * 2]\r\n    movhps          [r2], xm1\r\n    movhps          [r2 + r3], xm4\r\n%else\r\n    psubw           m5, m7                          ; m5 = word: row 0, row 1\r\n    psubw           m2, m7                          ; m2 = word: row 2, row 3\r\n    psubw           m1, m7                          ; m1 = word: row 4, row 5\r\n    psubw           m4, m7                          ; m4 = word: row 6, row 7\r\n    vextracti128    xm6, m5, 1\r\n    vextracti128    xm3, m2, 1\r\n    vextracti128    xm0, m1, 1\r\n    movu            [r2], xm5\r\n    movu            [r2 + r3], xm6\r\n    lea             r2, [r2 + r3 * 2]\r\n    movu            [r2], xm2\r\n    movu            [r2 + r3], xm3\r\n    lea             r2, [r2 + r3 * 2]\r\n    movu            [r2], xm1\r\n    movu            [r2 + r3], xm0\r\n    lea             r2, [r2 + r3 * 2]\r\n    movu            [r2], xm4\r\n    vextracti128    xm4, m4, 1\r\n    movu            [r2 + r3], xm4\r\n%endif\r\n    lea             r2, [r2 + r3 * 2]\r\n    sub             r0, r6\r\n    dec             word [rsp]\r\n    jnz             .loop\r\n    RET\r\n%endmacro\r\n\r\n%macro FILTER_VER_LUMA_AVX2_8x8 1\r\nINIT_YMM avx2\r\ncglobal interp_8tap_vert_%1_8x8, 4, 6, 7\r\n    mov             r4d, r4m\r\n    shl             r4d, 7\r\n\r\n%ifdef PIC\r\n    lea             r5, [tab_LumaCoeffVer_32]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [tab_LumaCoeffVer_32 + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r4\r\n    PROCESS_LUMA_AVX2_W8_8R\r\n%ifidn %1,pp\r\n    mova            m3, [pw_512]\r\n%else\r\n    add             r3d, r3d\r\n    vbroadcasti128  m3, [pw_2000]\r\n%endif\r\n    lea             r4, [r3 * 3]\r\n%ifidn %1,pp\r\n    pmulhrsw        m5, m3                          ; m5 = word: row 0, row 1\r\n    pmulhrsw        m2, m3                          ; m2 = word: row 2, row 3\r\n    pmulhrsw        m1, m3                          ; m1 = word: row 4, row 5\r\n    pmulhrsw        m4, m3                          ; m4 = word: row 6, row 7\r\n    packuswb        m5, m2\r\n    packuswb        m1, m4\r\n    vextracti128    xm2, m5, 1\r\n    vextracti128    xm4, m1, 1\r\n    movq            [r2], xm5\r\n    movq            [r2 + r3], xm2\r\n    movhps          [r2 + r3 * 2], xm5\r\n    movhps          [r2 + r4], xm2\r\n    lea             r2, [r2 + r3 * 4]\r\n    movq            [r2], xm1\r\n    movq            [r2 + r3], xm4\r\n    movhps          [r2 + r3 * 2], xm1\r\n    movhps          [r2 + r4], xm4\r\n%else\r\n    psubw           m5, m3                          ; m5 = word: row 0, row 1\r\n    psubw           m2, m3                          ; m2 = word: row 2, row 3\r\n    psubw           m1, m3                          ; m1 = word: row 4, row 5\r\n    psubw           m4, m3                          ; m4 = word: row 6, row 7\r\n    vextracti128    xm6, m5, 1\r\n    vextracti128    xm3, m2, 1\r\n    vextracti128    xm0, m1, 1\r\n    movu            [r2], xm5\r\n    movu            [r2 + r3], xm6\r\n    movu            [r2 + r3 * 2], xm2\r\n    movu            [r2 + r4], xm3\r\n    lea             r2, [r2 + r3 * 4]\r\n    movu            [r2], xm1\r\n    movu            [r2 + r3], xm0\r\n    movu            [r2 + r3 * 2], xm4\r\n    vextracti128    xm4, m4, 1\r\n    movu            [r2 + r4], xm4\r\n%endif\r\n    RET\r\n%endmacro\r\n\r\n%macro FILTER_VER_LUMA_AVX2_8x4 1\r\nINIT_YMM avx2\r\ncglobal interp_8tap_vert_%1_8x4, 4, 6, 7\r\n    mov             r4d, r4m\r\n    shl             r4d, 7\r\n\r\n%ifdef PIC\r\n    lea             r5, [tab_LumaCoeffVer_32]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [tab_LumaCoeffVer_32 + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r4\r\n    PROCESS_LUMA_AVX2_W8_4R\r\n%ifidn %1,pp\r\n    mova            m3, [pw_512]\r\n%else\r\n    add             r3d, r3d\r\n    vbroadcasti128  m3, [pw_2000]\r\n%endif\r\n    lea             r4, [r3 * 3]\r\n%ifidn %1,pp\r\n    pmulhrsw        m5, m3                          ; m5 = word: row 0, row 1\r\n    pmulhrsw        m2, m3                          ; m2 = word: row 2, row 3\r\n    packuswb        m5, m2\r\n    vextracti128    xm2, m5, 1\r\n    movq            [r2], xm5\r\n    movq            [r2 + r3], xm2\r\n    movhps          [r2 + r3 * 2], xm5\r\n    movhps          [r2 + r4], xm2\r\n%else\r\n    psubw           m5, m3                          ; m5 = word: row 0, row 1\r\n    psubw           m2, m3                          ; m2 = word: row 2, row 3\r\n    movu            [r2], xm5\r\n    vextracti128    xm5, m5, 1\r\n    movu            [r2 + r3], xm5\r\n    movu            [r2 + r3 * 2], xm2\r\n    vextracti128    xm2, m2, 1\r\n    movu            [r2 + r4], xm2\r\n%endif\r\n    RET\r\n%endmacro\r\n\r\n;-------------------------------------------------------------------------------------------------------------\r\n; void interp_8tap_vert_pp_8x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-------------------------------------------------------------------------------------------------------------\r\n    FILTER_VER_LUMA_8xN 8, 4, pp\r\n    FILTER_VER_LUMA_AVX2_8x4 pp\r\n\r\n;-------------------------------------------------------------------------------------------------------------\r\n; void interp_8tap_vert_pp_8x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-------------------------------------------------------------------------------------------------------------\r\n    FILTER_VER_LUMA_8xN 8, 8, pp\r\n    FILTER_VER_LUMA_AVX2_8x8 pp\r\n\r\n;-------------------------------------------------------------------------------------------------------------\r\n; void interp_8tap_vert_pp_8x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-------------------------------------------------------------------------------------------------------------\r\n    FILTER_VER_LUMA_8xN 8, 16, pp\r\n    FILTER_VER_LUMA_AVX2_8xN 8, 16, pp\r\n\r\n;-------------------------------------------------------------------------------------------------------------\r\n; void interp_8tap_vert_pp_8x32(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-------------------------------------------------------------------------------------------------------------\r\n    FILTER_VER_LUMA_8xN 8, 32, pp\r\n    FILTER_VER_LUMA_AVX2_8xN 8, 32, pp\r\n\r\n;-------------------------------------------------------------------------------------------------------------\r\n; void interp_8tap_vert_ps_8x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-------------------------------------------------------------------------------------------------------------\r\n    FILTER_VER_LUMA_8xN 8, 4, ps\r\n    FILTER_VER_LUMA_AVX2_8x4 ps\r\n\r\n;-------------------------------------------------------------------------------------------------------------\r\n; void interp_8tap_vert_ps_8x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-------------------------------------------------------------------------------------------------------------\r\n    FILTER_VER_LUMA_8xN 8, 8, ps\r\n    FILTER_VER_LUMA_AVX2_8x8 ps\r\n\r\n;-------------------------------------------------------------------------------------------------------------\r\n; void interp_8tap_vert_ps_8x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-------------------------------------------------------------------------------------------------------------\r\n    FILTER_VER_LUMA_8xN 8, 16, ps\r\n    FILTER_VER_LUMA_AVX2_8xN 8, 16, ps\r\n\r\n;-------------------------------------------------------------------------------------------------------------\r\n; void interp_8tap_vert_ps_8x32(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-------------------------------------------------------------------------------------------------------------\r\n    FILTER_VER_LUMA_8xN 8, 32, ps\r\n    FILTER_VER_LUMA_AVX2_8xN 8, 32, ps\r\n\r\n;-------------------------------------------------------------------------------------------------------------\r\n; void interp_8tap_vert_%3_12x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-------------------------------------------------------------------------------------------------------------\r\n%macro FILTER_VER_LUMA_12xN 3\r\nINIT_XMM sse4\r\ncglobal interp_8tap_vert_%3_%1x%2, 5, 7, 8\r\n    lea       r5, [3 * r1]\r\n    sub       r0, r5\r\n    shl       r4d, 6\r\n%ifidn %3,ps\r\n    add       r3d, r3d\r\n%endif\r\n\r\n%ifdef PIC\r\n    lea       r5, [tab_LumaCoeffVer]\r\n    lea       r6, [r5 + r4]\r\n%else\r\n    lea       r6, [tab_LumaCoeffVer + r4]\r\n%endif\r\n\r\n %ifidn %3,pp\r\n    mova      m3, [pw_512]\r\n%else\r\n    mova      m3, [pw_2000]\r\n%endif\r\n\r\n    mov       r4d, %2/4\r\n\r\n.loopH:\r\n    PROCESS_LUMA_W8_4R\r\n\r\n%ifidn %3,pp\r\n    pmulhrsw  m7, m3\r\n    pmulhrsw  m6, m3\r\n    pmulhrsw  m5, m3\r\n    pmulhrsw  m4, m3\r\n\r\n    packuswb  m7, m6\r\n    packuswb  m5, m4\r\n\r\n    movlps    [r2], m7\r\n    movhps    [r2 + r3], m7\r\n    lea       r5, [r2 + 2 * r3]\r\n    movlps    [r5], m5\r\n    movhps    [r5 + r3], m5\r\n%else\r\n    psubw     m7, m3\r\n    psubw     m6, m3\r\n    psubw     m5, m3\r\n    psubw     m4, m3\r\n\r\n    movu      [r2], m7\r\n    movu      [r2 + r3], m6\r\n    lea       r5, [r2 + 2 * r3]\r\n    movu      [r5], m5\r\n    movu      [r5 + r3], m4\r\n%endif\r\n\r\n    lea       r5, [8 * r1 - 8]\r\n    sub       r0, r5\r\n%ifidn %3,pp\r\n    add       r2, 8\r\n%else\r\n    add       r2, 16\r\n%endif\r\n\r\n    PROCESS_LUMA_W4_4R\r\n\r\n%ifidn %3,pp\r\n    pmulhrsw  m4, m3\r\n    pmulhrsw  m5, m3\r\n\r\n    packuswb  m4, m5\r\n\r\n    movd      [r2], m4\r\n    pextrd    [r2 + r3], m4, 1\r\n    lea       r5, [r2 + 2 * r3]\r\n    pextrd    [r5], m4, 2\r\n    pextrd    [r5 + r3], m4, 3\r\n%else\r\n    psubw     m4, m3\r\n    psubw     m5, m3\r\n\r\n    movlps    [r2], m4\r\n    movhps    [r2 + r3], m4\r\n    lea       r5, [r2 + 2 * r3]\r\n    movlps    [r5], m5\r\n    movhps    [r5 + r3], m5\r\n%endif\r\n\r\n    lea       r5, [4 * r1 + 8]\r\n    sub       r0, r5\r\n%ifidn %3,pp\r\n    lea       r2, [r2 + 4 * r3 - 8]\r\n%else\r\n    lea       r2, [r2 + 4 * r3 - 16]\r\n%endif\r\n\r\n    dec       r4d\r\n    jnz       .loopH\r\n\r\n    RET\r\n%endmacro\r\n\r\n;-------------------------------------------------------------------------------------------------------------\r\n; void interp_8tap_vert_pp_12x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-------------------------------------------------------------------------------------------------------------\r\n    FILTER_VER_LUMA_12xN 12, 16, pp\r\n\r\n;-------------------------------------------------------------------------------------------------------------\r\n; void interp_8tap_vert_ps_12x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-------------------------------------------------------------------------------------------------------------\r\n    FILTER_VER_LUMA_12xN 12, 16, ps\r\n\r\n%macro FILTER_VER_LUMA_AVX2_12x16 1\r\nINIT_YMM avx2\r\n%if ARCH_X86_64 == 1\r\ncglobal interp_8tap_vert_%1_12x16, 4, 7, 15\r\n    mov             r4d, r4m\r\n    shl             r4d, 7\r\n\r\n%ifdef PIC\r\n    lea             r5, [tab_LumaCoeffVer_32]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [tab_LumaCoeffVer_32 + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r4\r\n%ifidn %1,pp\r\n    mova            m14, [pw_512]\r\n%else\r\n    add             r3d, r3d\r\n    vbroadcasti128  m14, [pw_2000]\r\n%endif\r\n    lea             r6, [r3 * 3]\r\n\r\n    movu            xm0, [r0]                       ; m0 = row 0\r\n    movu            xm1, [r0 + r1]                  ; m1 = row 1\r\n    punpckhbw       xm2, xm0, xm1\r\n    punpcklbw       xm0, xm1\r\n    vinserti128     m0, m0, xm2, 1\r\n    pmaddubsw       m0, [r5]\r\n    movu            xm2, [r0 + r1 * 2]              ; m2 = row 2\r\n    punpckhbw       xm3, xm1, xm2\r\n    punpcklbw       xm1, xm2\r\n    vinserti128     m1, m1, xm3, 1\r\n    pmaddubsw       m1, [r5]\r\n    movu            xm3, [r0 + r4]                  ; m3 = row 3\r\n    punpckhbw       xm4, xm2, xm3\r\n    punpcklbw       xm2, xm3\r\n    vinserti128     m2, m2, xm4, 1\r\n    pmaddubsw       m4, m2, [r5 + 1 * mmsize]\r\n    paddw           m0, m4\r\n    pmaddubsw       m2, [r5]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm4, [r0]                       ; m4 = row 4\r\n    punpckhbw       xm5, xm3, xm4\r\n    punpcklbw       xm3, xm4\r\n    vinserti128     m3, m3, xm5, 1\r\n    pmaddubsw       m5, m3, [r5 + 1 * mmsize]\r\n    paddw           m1, m5\r\n    pmaddubsw       m3, [r5]\r\n    movu            xm5, [r0 + r1]                  ; m5 = row 5\r\n    punpckhbw       xm6, xm4, xm5\r\n    punpcklbw       xm4, xm5\r\n    vinserti128     m4, m4, xm6, 1\r\n    pmaddubsw       m6, m4, [r5 + 2 * mmsize]\r\n    paddw           m0, m6\r\n    pmaddubsw       m6, m4, [r5 + 1 * mmsize]\r\n    paddw           m2, m6\r\n    pmaddubsw       m4, [r5]\r\n    movu            xm6, [r0 + r1 * 2]              ; m6 = row 6\r\n    punpckhbw       xm7, xm5, xm6\r\n    punpcklbw       xm5, xm6\r\n    vinserti128     m5, m5, xm7, 1\r\n    pmaddubsw       m7, m5, [r5 + 2 * mmsize]\r\n    paddw           m1, m7\r\n    pmaddubsw       m7, m5, [r5 + 1 * mmsize]\r\n    paddw           m3, m7\r\n    pmaddubsw       m5, [r5]\r\n    movu            xm7, [r0 + r4]                  ; m7 = row 7\r\n    punpckhbw       xm8, xm6, xm7\r\n    punpcklbw       xm6, xm7\r\n    vinserti128     m6, m6, xm8, 1\r\n    pmaddubsw       m8, m6, [r5 + 3 * mmsize]\r\n    paddw           m0, m8\r\n    pmaddubsw       m8, m6, [r5 + 2 * mmsize]\r\n    paddw           m2, m8\r\n    pmaddubsw       m8, m6, [r5 + 1 * mmsize]\r\n    paddw           m4, m8\r\n    pmaddubsw       m6, [r5]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm8, [r0]                       ; m8 = row 8\r\n    punpckhbw       xm9, xm7, xm8\r\n    punpcklbw       xm7, xm8\r\n    vinserti128     m7, m7, xm9, 1\r\n    pmaddubsw       m9, m7, [r5 + 3 * mmsize]\r\n    paddw           m1, m9\r\n    pmaddubsw       m9, m7, [r5 + 2 * mmsize]\r\n    paddw           m3, m9\r\n    pmaddubsw       m9, m7, [r5 + 1 * mmsize]\r\n    paddw           m5, m9\r\n    pmaddubsw       m7, [r5]\r\n    movu            xm9, [r0 + r1]                  ; m9 = row 9\r\n    punpckhbw       xm10, xm8, xm9\r\n    punpcklbw       xm8, xm9\r\n    vinserti128     m8, m8, xm10, 1\r\n    pmaddubsw       m10, m8, [r5 + 3 * mmsize]\r\n    paddw           m2, m10\r\n    pmaddubsw       m10, m8, [r5 + 2 * mmsize]\r\n    paddw           m4, m10\r\n    pmaddubsw       m10, m8, [r5 + 1 * mmsize]\r\n    paddw           m6, m10\r\n    pmaddubsw       m8, [r5]\r\n    movu            xm10, [r0 + r1 * 2]             ; m10 = row 10\r\n    punpckhbw       xm11, xm9, xm10\r\n    punpcklbw       xm9, xm10\r\n    vinserti128     m9, m9, xm11, 1\r\n    pmaddubsw       m11, m9, [r5 + 3 * mmsize]\r\n    paddw           m3, m11\r\n    pmaddubsw       m11, m9, [r5 + 2 * mmsize]\r\n    paddw           m5, m11\r\n    pmaddubsw       m11, m9, [r5 + 1 * mmsize]\r\n    paddw           m7, m11\r\n    pmaddubsw       m9, [r5]\r\n    movu            xm11, [r0 + r4]                 ; m11 = row 11\r\n    punpckhbw       xm12, xm10, xm11\r\n    punpcklbw       xm10, xm11\r\n    vinserti128     m10, m10, xm12, 1\r\n    pmaddubsw       m12, m10, [r5 + 3 * mmsize]\r\n    paddw           m4, m12\r\n    pmaddubsw       m12, m10, [r5 + 2 * mmsize]\r\n    paddw           m6, m12\r\n    pmaddubsw       m12, m10, [r5 + 1 * mmsize]\r\n    paddw           m8, m12\r\n    pmaddubsw       m10, [r5]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm12, [r0]                      ; m12 = row 12\r\n    punpckhbw       xm13, xm11, xm12\r\n    punpcklbw       xm11, xm12\r\n    vinserti128     m11, m11, xm13, 1\r\n    pmaddubsw       m13, m11, [r5 + 3 * mmsize]\r\n    paddw           m5, m13\r\n    pmaddubsw       m13, m11, [r5 + 2 * mmsize]\r\n    paddw           m7, m13\r\n    pmaddubsw       m13, m11, [r5 + 1 * mmsize]\r\n    paddw           m9, m13\r\n    pmaddubsw       m11, [r5]\r\n\r\n%ifidn %1,pp\r\n    pmulhrsw        m0, m14                         ; m0 = word: row 0\r\n    pmulhrsw        m1, m14                         ; m1 = word: row 1\r\n    pmulhrsw        m2, m14                         ; m2 = word: row 2\r\n    pmulhrsw        m3, m14                         ; m3 = word: row 3\r\n    pmulhrsw        m4, m14                         ; m4 = word: row 4\r\n    pmulhrsw        m5, m14                         ; m5 = word: row 5\r\n    packuswb        m0, m1\r\n    packuswb        m2, m3\r\n    packuswb        m4, m5\r\n    vpermq          m0, m0, 11011000b\r\n    vpermq          m2, m2, 11011000b\r\n    vpermq          m4, m4, 11011000b\r\n    vextracti128    xm1, m0, 1\r\n    vextracti128    xm3, m2, 1\r\n    vextracti128    xm5, m4, 1\r\n    movq            [r2], xm0\r\n    pextrd          [r2 + 8], xm0, 2\r\n    movq            [r2 + r3], xm1\r\n    pextrd          [r2 + r3 + 8], xm1, 2\r\n    movq            [r2 + r3 * 2], xm2\r\n    pextrd          [r2 + r3 * 2 + 8], xm2, 2\r\n    movq            [r2 + r6], xm3\r\n    pextrd          [r2 + r6 + 8], xm3, 2\r\n    lea             r2, [r2 + r3 * 4]\r\n    movq            [r2], xm4\r\n    pextrd          [r2 + 8], xm4, 2\r\n    movq            [r2 + r3], xm5\r\n    pextrd          [r2 + r3 + 8], xm5, 2\r\n%else\r\n    psubw           m0, m14                         ; m0 = word: row 0\r\n    psubw           m1, m14                         ; m1 = word: row 1\r\n    psubw           m2, m14                         ; m2 = word: row 2\r\n    psubw           m3, m14                         ; m3 = word: row 3\r\n    psubw           m4, m14                         ; m4 = word: row 4\r\n    psubw           m5, m14                         ; m5 = word: row 5\r\n    movu            [r2], xm0\r\n    vextracti128    xm0, m0, 1\r\n    movq            [r2 + 16], xm0\r\n    movu            [r2 + r3], xm1\r\n    vextracti128    xm1, m1, 1\r\n    movq            [r2 + r3 + 16], xm1\r\n    movu            [r2 + r3 * 2], xm2\r\n    vextracti128    xm2, m2, 1\r\n    movq            [r2 + r3 * 2 + 16], xm2\r\n    movu            [r2 + r6], xm3\r\n    vextracti128    xm3, m3, 1\r\n    movq            [r2 + r6 + 16], xm3\r\n    lea             r2, [r2 + r3 * 4]\r\n    movu            [r2], xm4\r\n    vextracti128    xm4, m4, 1\r\n    movq            [r2 + 16], xm4\r\n    movu            [r2 + r3], xm5\r\n    vextracti128    xm5, m5, 1\r\n    movq            [r2 + r3 + 16], xm5\r\n%endif\r\n\r\n    movu            xm13, [r0 + r1]                 ; m13 = row 13\r\n    punpckhbw       xm0, xm12, xm13\r\n    punpcklbw       xm12, xm13\r\n    vinserti128     m12, m12, xm0, 1\r\n    pmaddubsw       m0, m12, [r5 + 3 * mmsize]\r\n    paddw           m6, m0\r\n    pmaddubsw       m0, m12, [r5 + 2 * mmsize]\r\n    paddw           m8, m0\r\n    pmaddubsw       m0, m12, [r5 + 1 * mmsize]\r\n    paddw           m10, m0\r\n    pmaddubsw       m12, [r5]\r\n    movu            xm0, [r0 + r1 * 2]              ; m0 = row 14\r\n    punpckhbw       xm1, xm13, xm0\r\n    punpcklbw       xm13, xm0\r\n    vinserti128     m13, m13, xm1, 1\r\n    pmaddubsw       m1, m13, [r5 + 3 * mmsize]\r\n    paddw           m7, m1\r\n    pmaddubsw       m1, m13, [r5 + 2 * mmsize]\r\n    paddw           m9, m1\r\n    pmaddubsw       m1, m13, [r5 + 1 * mmsize]\r\n    paddw           m11, m1\r\n    pmaddubsw       m13, [r5]\r\n\r\n%ifidn %1,pp\r\n    pmulhrsw        m6, m14                         ; m6 = word: row 6\r\n    pmulhrsw        m7, m14                         ; m7 = word: row 7\r\n    packuswb        m6, m7\r\n    vpermq          m6, m6, 11011000b\r\n    vextracti128    xm7, m6, 1\r\n    movq            [r2 + r3 * 2], xm6\r\n    pextrd          [r2 + r3 * 2 + 8], xm6, 2\r\n    movq            [r2 + r6], xm7\r\n    pextrd          [r2 + r6 + 8], xm7, 2\r\n%else\r\n    psubw           m6, m14                         ; m6 = word: row 6\r\n    psubw           m7, m14                         ; m7 = word: row 7\r\n    movu            [r2 + r3 * 2], xm6\r\n    vextracti128    xm6, m6, 1\r\n    movq            [r2 + r3 * 2 + 16], xm6\r\n    movu            [r2 + r6], xm7\r\n    vextracti128    xm7, m7, 1\r\n    movq            [r2 + r6 + 16], xm7\r\n%endif\r\n    lea             r2, [r2 + r3 * 4]\r\n\r\n    movu            xm1, [r0 + r4]                  ; m1 = row 15\r\n    punpckhbw       xm2, xm0, xm1\r\n    punpcklbw       xm0, xm1\r\n    vinserti128     m0, m0, xm2, 1\r\n    pmaddubsw       m2, m0, [r5 + 3 * mmsize]\r\n    paddw           m8, m2\r\n    pmaddubsw       m2, m0, [r5 + 2 * mmsize]\r\n    paddw           m10, m2\r\n    pmaddubsw       m2, m0, [r5 + 1 * mmsize]\r\n    paddw           m12, m2\r\n    pmaddubsw       m0, [r5]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm2, [r0]                       ; m2 = row 16\r\n    punpckhbw       xm3, xm1, xm2\r\n    punpcklbw       xm1, xm2\r\n    vinserti128     m1, m1, xm3, 1\r\n    pmaddubsw       m3, m1, [r5 + 3 * mmsize]\r\n    paddw           m9, m3\r\n    pmaddubsw       m3, m1, [r5 + 2 * mmsize]\r\n    paddw           m11, m3\r\n    pmaddubsw       m3, m1, [r5 + 1 * mmsize]\r\n    paddw           m13, m3\r\n    pmaddubsw       m1, [r5]\r\n    movu            xm3, [r0 + r1]                  ; m3 = row 17\r\n    punpckhbw       xm4, xm2, xm3\r\n    punpcklbw       xm2, xm3\r\n    vinserti128     m2, m2, xm4, 1\r\n    pmaddubsw       m4, m2, [r5 + 3 * mmsize]\r\n    paddw           m10, m4\r\n    pmaddubsw       m4, m2, [r5 + 2 * mmsize]\r\n    paddw           m12, m4\r\n    pmaddubsw       m2, [r5 + 1 * mmsize]\r\n    paddw           m0, m2\r\n    movu            xm4, [r0 + r1 * 2]              ; m4 = row 18\r\n    punpckhbw       xm5, xm3, xm4\r\n    punpcklbw       xm3, xm4\r\n    vinserti128     m3, m3, xm5, 1\r\n    pmaddubsw       m5, m3, [r5 + 3 * mmsize]\r\n    paddw           m11, m5\r\n    pmaddubsw       m5, m3, [r5 + 2 * mmsize]\r\n    paddw           m13, m5\r\n    pmaddubsw       m3, [r5 + 1 * mmsize]\r\n    paddw           m1, m3\r\n    movu            xm5, [r0 + r4]                  ; m5 = row 19\r\n    punpckhbw       xm6, xm4, xm5\r\n    punpcklbw       xm4, xm5\r\n    vinserti128     m4, m4, xm6, 1\r\n    pmaddubsw       m6, m4, [r5 + 3 * mmsize]\r\n    paddw           m12, m6\r\n    pmaddubsw       m4, [r5 + 2 * mmsize]\r\n    paddw           m0, m4\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm6, [r0]                       ; m6 = row 20\r\n    punpckhbw       xm7, xm5, xm6\r\n    punpcklbw       xm5, xm6\r\n    vinserti128     m5, m5, xm7, 1\r\n    pmaddubsw       m7, m5, [r5 + 3 * mmsize]\r\n    paddw           m13, m7\r\n    pmaddubsw       m5, [r5 + 2 * mmsize]\r\n    paddw           m1, m5\r\n    movu            xm7, [r0 + r1]                  ; m7 = row 21\r\n    punpckhbw       xm2, xm6, xm7\r\n    punpcklbw       xm6, xm7\r\n    vinserti128     m6, m6, xm2, 1\r\n    pmaddubsw       m6, [r5 + 3 * mmsize]\r\n    paddw           m0, m6\r\n    movu            xm2, [r0 + r1 * 2]              ; m2 = row 22\r\n    punpckhbw       xm3, xm7, xm2\r\n    punpcklbw       xm7, xm2\r\n    vinserti128     m7, m7, xm3, 1\r\n    pmaddubsw       m7, [r5 + 3 * mmsize]\r\n    paddw           m1, m7\r\n\r\n%ifidn %1,pp\r\n    pmulhrsw        m8, m14                         ; m8 = word: row 8\r\n    pmulhrsw        m9, m14                         ; m9 = word: row 9\r\n    pmulhrsw        m10, m14                        ; m10 = word: row 10\r\n    pmulhrsw        m11, m14                        ; m11 = word: row 11\r\n    pmulhrsw        m12, m14                        ; m12 = word: row 12\r\n    pmulhrsw        m13, m14                        ; m13 = word: row 13\r\n    pmulhrsw        m0, m14                         ; m0 = word: row 14\r\n    pmulhrsw        m1, m14                         ; m1 = word: row 15\r\n    packuswb        m8, m9\r\n    packuswb        m10, m11\r\n    packuswb        m12, m13\r\n    packuswb        m0, m1\r\n    vpermq          m8, m8, 11011000b\r\n    vpermq          m10, m10, 11011000b\r\n    vpermq          m12, m12, 11011000b\r\n    vpermq          m0, m0, 11011000b\r\n    vextracti128    xm9, m8, 1\r\n    vextracti128    xm11, m10, 1\r\n    vextracti128    xm13, m12, 1\r\n    vextracti128    xm1, m0, 1\r\n    movq            [r2], xm8\r\n    pextrd          [r2 + 8], xm8, 2\r\n    movq            [r2 + r3], xm9\r\n    pextrd          [r2 + r3 + 8], xm9, 2\r\n    movq            [r2 + r3 * 2], xm10\r\n    pextrd          [r2 + r3 * 2 + 8], xm10, 2\r\n    movq            [r2 + r6], xm11\r\n    pextrd          [r2 + r6 + 8], xm11, 2\r\n    lea             r2, [r2 + r3 * 4]\r\n    movq            [r2], xm12\r\n    pextrd          [r2 + 8], xm12, 2\r\n    movq            [r2 + r3], xm13\r\n    pextrd          [r2 + r3 + 8], xm13, 2\r\n    movq            [r2 + r3 * 2], xm0\r\n    pextrd          [r2 + r3 * 2 + 8], xm0, 2\r\n    movq            [r2 + r6], xm1\r\n    pextrd          [r2 + r6 + 8], xm1, 2\r\n%else\r\n    psubw           m8, m14                         ; m8 = word: row 8\r\n    psubw           m9, m14                         ; m9 = word: row 9\r\n    psubw           m10, m14                        ; m10 = word: row 10\r\n    psubw           m11, m14                        ; m11 = word: row 11\r\n    psubw           m12, m14                        ; m12 = word: row 12\r\n    psubw           m13, m14                        ; m13 = word: row 13\r\n    psubw           m0, m14                         ; m0 = word: row 14\r\n    psubw           m1, m14                         ; m1 = word: row 15\r\n    movu            [r2], xm8\r\n    vextracti128    xm8, m8, 1\r\n    movq            [r2 + 16], xm8\r\n    movu            [r2 + r3], xm9\r\n    vextracti128    xm9, m9, 1\r\n    movq            [r2 + r3 + 16], xm9\r\n    movu            [r2 + r3 * 2], xm10\r\n    vextracti128    xm10, m10, 1\r\n    movq            [r2 + r3 * 2 + 16], xm10\r\n    movu            [r2 + r6], xm11\r\n    vextracti128    xm11, m11, 1\r\n    movq            [r2 + r6 + 16], xm11\r\n    lea             r2, [r2 + r3 * 4]\r\n    movu            [r2], xm12\r\n    vextracti128    xm12, m12, 1\r\n    movq            [r2 + 16], xm12\r\n    movu            [r2 + r3], xm13\r\n    vextracti128    xm13, m13, 1\r\n    movq            [r2 + r3 + 16], xm13\r\n    movu            [r2 + r3 * 2], xm0\r\n    vextracti128    xm0, m0, 1\r\n    movq            [r2 + r3 * 2 + 16], xm0\r\n    movu            [r2 + r6], xm1\r\n    vextracti128    xm1, m1, 1\r\n    movq            [r2 + r6 + 16], xm1\r\n%endif\r\n    RET\r\n%endif\r\n%endmacro\r\n\r\n    FILTER_VER_LUMA_AVX2_12x16 pp\r\n    FILTER_VER_LUMA_AVX2_12x16 ps\r\n\r\n%macro FILTER_VER_LUMA_AVX2_16x16 1\r\nINIT_YMM avx2\r\n%if ARCH_X86_64 == 1\r\ncglobal interp_8tap_vert_%1_16x16, 4, 7, 15\r\n    mov             r4d, r4m\r\n    shl             r4d, 7\r\n\r\n%ifdef PIC\r\n    lea             r5, [tab_LumaCoeffVer_32]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [tab_LumaCoeffVer_32 + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r4\r\n%ifidn %1,pp\r\n    mova            m14, [pw_512]\r\n%else\r\n    add             r3d, r3d\r\n    vbroadcasti128  m14, [pw_2000]\r\n%endif\r\n    lea             r6, [r3 * 3]\r\n\r\n    movu            xm0, [r0]                       ; m0 = row 0\r\n    movu            xm1, [r0 + r1]                  ; m1 = row 1\r\n    punpckhbw       xm2, xm0, xm1\r\n    punpcklbw       xm0, xm1\r\n    vinserti128     m0, m0, xm2, 1\r\n    pmaddubsw       m0, [r5]\r\n    movu            xm2, [r0 + r1 * 2]              ; m2 = row 2\r\n    punpckhbw       xm3, xm1, xm2\r\n    punpcklbw       xm1, xm2\r\n    vinserti128     m1, m1, xm3, 1\r\n    pmaddubsw       m1, [r5]\r\n    movu            xm3, [r0 + r4]                  ; m3 = row 3\r\n    punpckhbw       xm4, xm2, xm3\r\n    punpcklbw       xm2, xm3\r\n    vinserti128     m2, m2, xm4, 1\r\n    pmaddubsw       m4, m2, [r5 + 1 * mmsize]\r\n    paddw           m0, m4\r\n    pmaddubsw       m2, [r5]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm4, [r0]                       ; m4 = row 4\r\n    punpckhbw       xm5, xm3, xm4\r\n    punpcklbw       xm3, xm4\r\n    vinserti128     m3, m3, xm5, 1\r\n    pmaddubsw       m5, m3, [r5 + 1 * mmsize]\r\n    paddw           m1, m5\r\n    pmaddubsw       m3, [r5]\r\n    movu            xm5, [r0 + r1]                  ; m5 = row 5\r\n    punpckhbw       xm6, xm4, xm5\r\n    punpcklbw       xm4, xm5\r\n    vinserti128     m4, m4, xm6, 1\r\n    pmaddubsw       m6, m4, [r5 + 2 * mmsize]\r\n    paddw           m0, m6\r\n    pmaddubsw       m6, m4, [r5 + 1 * mmsize]\r\n    paddw           m2, m6\r\n    pmaddubsw       m4, [r5]\r\n    movu            xm6, [r0 + r1 * 2]              ; m6 = row 6\r\n    punpckhbw       xm7, xm5, xm6\r\n    punpcklbw       xm5, xm6\r\n    vinserti128     m5, m5, xm7, 1\r\n    pmaddubsw       m7, m5, [r5 + 2 * mmsize]\r\n    paddw           m1, m7\r\n    pmaddubsw       m7, m5, [r5 + 1 * mmsize]\r\n    paddw           m3, m7\r\n    pmaddubsw       m5, [r5]\r\n    movu            xm7, [r0 + r4]                  ; m7 = row 7\r\n    punpckhbw       xm8, xm6, xm7\r\n    punpcklbw       xm6, xm7\r\n    vinserti128     m6, m6, xm8, 1\r\n    pmaddubsw       m8, m6, [r5 + 3 * mmsize]\r\n    paddw           m0, m8\r\n    pmaddubsw       m8, m6, [r5 + 2 * mmsize]\r\n    paddw           m2, m8\r\n    pmaddubsw       m8, m6, [r5 + 1 * mmsize]\r\n    paddw           m4, m8\r\n    pmaddubsw       m6, [r5]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm8, [r0]                       ; m8 = row 8\r\n    punpckhbw       xm9, xm7, xm8\r\n    punpcklbw       xm7, xm8\r\n    vinserti128     m7, m7, xm9, 1\r\n    pmaddubsw       m9, m7, [r5 + 3 * mmsize]\r\n    paddw           m1, m9\r\n    pmaddubsw       m9, m7, [r5 + 2 * mmsize]\r\n    paddw           m3, m9\r\n    pmaddubsw       m9, m7, [r5 + 1 * mmsize]\r\n    paddw           m5, m9\r\n    pmaddubsw       m7, [r5]\r\n    movu            xm9, [r0 + r1]                  ; m9 = row 9\r\n    punpckhbw       xm10, xm8, xm9\r\n    punpcklbw       xm8, xm9\r\n    vinserti128     m8, m8, xm10, 1\r\n    pmaddubsw       m10, m8, [r5 + 3 * mmsize]\r\n    paddw           m2, m10\r\n    pmaddubsw       m10, m8, [r5 + 2 * mmsize]\r\n    paddw           m4, m10\r\n    pmaddubsw       m10, m8, [r5 + 1 * mmsize]\r\n    paddw           m6, m10\r\n    pmaddubsw       m8, [r5]\r\n    movu            xm10, [r0 + r1 * 2]             ; m10 = row 10\r\n    punpckhbw       xm11, xm9, xm10\r\n    punpcklbw       xm9, xm10\r\n    vinserti128     m9, m9, xm11, 1\r\n    pmaddubsw       m11, m9, [r5 + 3 * mmsize]\r\n    paddw           m3, m11\r\n    pmaddubsw       m11, m9, [r5 + 2 * mmsize]\r\n    paddw           m5, m11\r\n    pmaddubsw       m11, m9, [r5 + 1 * mmsize]\r\n    paddw           m7, m11\r\n    pmaddubsw       m9, [r5]\r\n    movu            xm11, [r0 + r4]                 ; m11 = row 11\r\n    punpckhbw       xm12, xm10, xm11\r\n    punpcklbw       xm10, xm11\r\n    vinserti128     m10, m10, xm12, 1\r\n    pmaddubsw       m12, m10, [r5 + 3 * mmsize]\r\n    paddw           m4, m12\r\n    pmaddubsw       m12, m10, [r5 + 2 * mmsize]\r\n    paddw           m6, m12\r\n    pmaddubsw       m12, m10, [r5 + 1 * mmsize]\r\n    paddw           m8, m12\r\n    pmaddubsw       m10, [r5]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm12, [r0]                      ; m12 = row 12\r\n    punpckhbw       xm13, xm11, xm12\r\n    punpcklbw       xm11, xm12\r\n    vinserti128     m11, m11, xm13, 1\r\n    pmaddubsw       m13, m11, [r5 + 3 * mmsize]\r\n    paddw           m5, m13\r\n    pmaddubsw       m13, m11, [r5 + 2 * mmsize]\r\n    paddw           m7, m13\r\n    pmaddubsw       m13, m11, [r5 + 1 * mmsize]\r\n    paddw           m9, m13\r\n    pmaddubsw       m11, [r5]\r\n\r\n%ifidn %1,pp\r\n    pmulhrsw        m0, m14                         ; m0 = word: row 0\r\n    pmulhrsw        m1, m14                         ; m1 = word: row 1\r\n    pmulhrsw        m2, m14                         ; m2 = word: row 2\r\n    pmulhrsw        m3, m14                         ; m3 = word: row 3\r\n    pmulhrsw        m4, m14                         ; m4 = word: row 4\r\n    pmulhrsw        m5, m14                         ; m5 = word: row 5\r\n    packuswb        m0, m1\r\n    packuswb        m2, m3\r\n    packuswb        m4, m5\r\n    vpermq          m0, m0, 11011000b\r\n    vpermq          m2, m2, 11011000b\r\n    vpermq          m4, m4, 11011000b\r\n    vextracti128    xm1, m0, 1\r\n    vextracti128    xm3, m2, 1\r\n    vextracti128    xm5, m4, 1\r\n    movu            [r2], xm0\r\n    movu            [r2 + r3], xm1\r\n    movu            [r2 + r3 * 2], xm2\r\n    movu            [r2 + r6], xm3\r\n    lea             r2, [r2 + r3 * 4]\r\n    movu            [r2], xm4\r\n    movu            [r2 + r3], xm5\r\n%else\r\n    psubw           m0, m14                         ; m0 = word: row 0\r\n    psubw           m1, m14                         ; m1 = word: row 1\r\n    psubw           m2, m14                         ; m2 = word: row 2\r\n    psubw           m3, m14                         ; m3 = word: row 3\r\n    psubw           m4, m14                         ; m4 = word: row 4\r\n    psubw           m5, m14                         ; m5 = word: row 5\r\n    movu            [r2], m0\r\n    movu            [r2 + r3], m1\r\n    movu            [r2 + r3 * 2], m2\r\n    movu            [r2 + r6], m3\r\n    lea             r2, [r2 + r3 * 4]\r\n    movu            [r2], m4\r\n    movu            [r2 + r3], m5\r\n%endif\r\n\r\n    movu            xm13, [r0 + r1]                 ; m13 = row 13\r\n    punpckhbw       xm0, xm12, xm13\r\n    punpcklbw       xm12, xm13\r\n    vinserti128     m12, m12, xm0, 1\r\n    pmaddubsw       m0, m12, [r5 + 3 * mmsize]\r\n    paddw           m6, m0\r\n    pmaddubsw       m0, m12, [r5 + 2 * mmsize]\r\n    paddw           m8, m0\r\n    pmaddubsw       m0, m12, [r5 + 1 * mmsize]\r\n    paddw           m10, m0\r\n    pmaddubsw       m12, [r5]\r\n    movu            xm0, [r0 + r1 * 2]              ; m0 = row 14\r\n    punpckhbw       xm1, xm13, xm0\r\n    punpcklbw       xm13, xm0\r\n    vinserti128     m13, m13, xm1, 1\r\n    pmaddubsw       m1, m13, [r5 + 3 * mmsize]\r\n    paddw           m7, m1\r\n    pmaddubsw       m1, m13, [r5 + 2 * mmsize]\r\n    paddw           m9, m1\r\n    pmaddubsw       m1, m13, [r5 + 1 * mmsize]\r\n    paddw           m11, m1\r\n    pmaddubsw       m13, [r5]\r\n\r\n%ifidn %1,pp\r\n    pmulhrsw        m6, m14                         ; m6 = word: row 6\r\n    pmulhrsw        m7, m14                         ; m7 = word: row 7\r\n    packuswb        m6, m7\r\n    vpermq          m6, m6, 11011000b\r\n    vextracti128    xm7, m6, 1\r\n    movu            [r2 + r3 * 2], xm6\r\n    movu            [r2 + r6], xm7\r\n%else\r\n    psubw           m6, m14                         ; m6 = word: row 6\r\n    psubw           m7, m14                         ; m7 = word: row 7\r\n    movu            [r2 + r3 * 2], m6\r\n    movu            [r2 + r6], m7\r\n%endif\r\n    lea             r2, [r2 + r3 * 4]\r\n\r\n    movu            xm1, [r0 + r4]                  ; m1 = row 15\r\n    punpckhbw       xm2, xm0, xm1\r\n    punpcklbw       xm0, xm1\r\n    vinserti128     m0, m0, xm2, 1\r\n    pmaddubsw       m2, m0, [r5 + 3 * mmsize]\r\n    paddw           m8, m2\r\n    pmaddubsw       m2, m0, [r5 + 2 * mmsize]\r\n    paddw           m10, m2\r\n    pmaddubsw       m2, m0, [r5 + 1 * mmsize]\r\n    paddw           m12, m2\r\n    pmaddubsw       m0, [r5]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm2, [r0]                       ; m2 = row 16\r\n    punpckhbw       xm3, xm1, xm2\r\n    punpcklbw       xm1, xm2\r\n    vinserti128     m1, m1, xm3, 1\r\n    pmaddubsw       m3, m1, [r5 + 3 * mmsize]\r\n    paddw           m9, m3\r\n    pmaddubsw       m3, m1, [r5 + 2 * mmsize]\r\n    paddw           m11, m3\r\n    pmaddubsw       m3, m1, [r5 + 1 * mmsize]\r\n    paddw           m13, m3\r\n    pmaddubsw       m1, [r5]\r\n    movu            xm3, [r0 + r1]                  ; m3 = row 17\r\n    punpckhbw       xm4, xm2, xm3\r\n    punpcklbw       xm2, xm3\r\n    vinserti128     m2, m2, xm4, 1\r\n    pmaddubsw       m4, m2, [r5 + 3 * mmsize]\r\n    paddw           m10, m4\r\n    pmaddubsw       m4, m2, [r5 + 2 * mmsize]\r\n    paddw           m12, m4\r\n    pmaddubsw       m2, [r5 + 1 * mmsize]\r\n    paddw           m0, m2\r\n    movu            xm4, [r0 + r1 * 2]              ; m4 = row 18\r\n    punpckhbw       xm5, xm3, xm4\r\n    punpcklbw       xm3, xm4\r\n    vinserti128     m3, m3, xm5, 1\r\n    pmaddubsw       m5, m3, [r5 + 3 * mmsize]\r\n    paddw           m11, m5\r\n    pmaddubsw       m5, m3, [r5 + 2 * mmsize]\r\n    paddw           m13, m5\r\n    pmaddubsw       m3, [r5 + 1 * mmsize]\r\n    paddw           m1, m3\r\n    movu            xm5, [r0 + r4]                  ; m5 = row 19\r\n    punpckhbw       xm6, xm4, xm5\r\n    punpcklbw       xm4, xm5\r\n    vinserti128     m4, m4, xm6, 1\r\n    pmaddubsw       m6, m4, [r5 + 3 * mmsize]\r\n    paddw           m12, m6\r\n    pmaddubsw       m4, [r5 + 2 * mmsize]\r\n    paddw           m0, m4\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm6, [r0]                       ; m6 = row 20\r\n    punpckhbw       xm7, xm5, xm6\r\n    punpcklbw       xm5, xm6\r\n    vinserti128     m5, m5, xm7, 1\r\n    pmaddubsw       m7, m5, [r5 + 3 * mmsize]\r\n    paddw           m13, m7\r\n    pmaddubsw       m5, [r5 + 2 * mmsize]\r\n    paddw           m1, m5\r\n    movu            xm7, [r0 + r1]                  ; m7 = row 21\r\n    punpckhbw       xm2, xm6, xm7\r\n    punpcklbw       xm6, xm7\r\n    vinserti128     m6, m6, xm2, 1\r\n    pmaddubsw       m6, [r5 + 3 * mmsize]\r\n    paddw           m0, m6\r\n    movu            xm2, [r0 + r1 * 2]              ; m2 = row 22\r\n    punpckhbw       xm3, xm7, xm2\r\n    punpcklbw       xm7, xm2\r\n    vinserti128     m7, m7, xm3, 1\r\n    pmaddubsw       m7, [r5 + 3 * mmsize]\r\n    paddw           m1, m7\r\n\r\n%ifidn %1,pp\r\n    pmulhrsw        m8, m14                         ; m8 = word: row 8\r\n    pmulhrsw        m9, m14                         ; m9 = word: row 9\r\n    pmulhrsw        m10, m14                        ; m10 = word: row 10\r\n    pmulhrsw        m11, m14                        ; m11 = word: row 11\r\n    pmulhrsw        m12, m14                        ; m12 = word: row 12\r\n    pmulhrsw        m13, m14                        ; m13 = word: row 13\r\n    pmulhrsw        m0, m14                         ; m0 = word: row 14\r\n    pmulhrsw        m1, m14                         ; m1 = word: row 15\r\n    packuswb        m8, m9\r\n    packuswb        m10, m11\r\n    packuswb        m12, m13\r\n    packuswb        m0, m1\r\n    vpermq          m8, m8, 11011000b\r\n    vpermq          m10, m10, 11011000b\r\n    vpermq          m12, m12, 11011000b\r\n    vpermq          m0, m0, 11011000b\r\n    vextracti128    xm9, m8, 1\r\n    vextracti128    xm11, m10, 1\r\n    vextracti128    xm13, m12, 1\r\n    vextracti128    xm1, m0, 1\r\n    movu            [r2], xm8\r\n    movu            [r2 + r3], xm9\r\n    movu            [r2 + r3 * 2], xm10\r\n    movu            [r2 + r6], xm11\r\n    lea             r2, [r2 + r3 * 4]\r\n    movu            [r2], xm12\r\n    movu            [r2 + r3], xm13\r\n    movu            [r2 + r3 * 2], xm0\r\n    movu            [r2 + r6], xm1\r\n%else\r\n    psubw           m8, m14                         ; m8 = word: row 8\r\n    psubw           m9, m14                         ; m9 = word: row 9\r\n    psubw           m10, m14                        ; m10 = word: row 10\r\n    psubw           m11, m14                        ; m11 = word: row 11\r\n    psubw           m12, m14                        ; m12 = word: row 12\r\n    psubw           m13, m14                        ; m13 = word: row 13\r\n    psubw           m0, m14                         ; m0 = word: row 14\r\n    psubw           m1, m14                         ; m1 = word: row 15\r\n    movu            [r2], m8\r\n    movu            [r2 + r3], m9\r\n    movu            [r2 + r3 * 2], m10\r\n    movu            [r2 + r6], m11\r\n    lea             r2, [r2 + r3 * 4]\r\n    movu            [r2], m12\r\n    movu            [r2 + r3], m13\r\n    movu            [r2 + r3 * 2], m0\r\n    movu            [r2 + r6], m1\r\n%endif\r\n    RET\r\n%endif\r\n%endmacro\r\n\r\n    FILTER_VER_LUMA_AVX2_16x16 pp\r\n    FILTER_VER_LUMA_AVX2_16x16 ps\r\n\r\n%macro FILTER_VER_LUMA_AVX2_16x12 1\r\nINIT_YMM avx2\r\n%if ARCH_X86_64 == 1\r\ncglobal interp_8tap_vert_%1_16x12, 4, 7, 15\r\n    mov             r4d, r4m\r\n    shl             r4d, 7\r\n\r\n%ifdef PIC\r\n    lea             r5, [tab_LumaCoeffVer_32]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [tab_LumaCoeffVer_32 + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r4\r\n%ifidn %1,pp\r\n    mova            m14, [pw_512]\r\n%else\r\n    add             r3d, r3d\r\n    vbroadcasti128  m14, [pw_2000]\r\n%endif\r\n    lea             r6, [r3 * 3]\r\n\r\n    movu            xm0, [r0]                       ; m0 = row 0\r\n    movu            xm1, [r0 + r1]                  ; m1 = row 1\r\n    punpckhbw       xm2, xm0, xm1\r\n    punpcklbw       xm0, xm1\r\n    vinserti128     m0, m0, xm2, 1\r\n    pmaddubsw       m0, [r5]\r\n    movu            xm2, [r0 + r1 * 2]              ; m2 = row 2\r\n    punpckhbw       xm3, xm1, xm2\r\n    punpcklbw       xm1, xm2\r\n    vinserti128     m1, m1, xm3, 1\r\n    pmaddubsw       m1, [r5]\r\n    movu            xm3, [r0 + r4]                  ; m3 = row 3\r\n    punpckhbw       xm4, xm2, xm3\r\n    punpcklbw       xm2, xm3\r\n    vinserti128     m2, m2, xm4, 1\r\n    pmaddubsw       m4, m2, [r5 + 1 * mmsize]\r\n    paddw           m0, m4\r\n    pmaddubsw       m2, [r5]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm4, [r0]                       ; m4 = row 4\r\n    punpckhbw       xm5, xm3, xm4\r\n    punpcklbw       xm3, xm4\r\n    vinserti128     m3, m3, xm5, 1\r\n    pmaddubsw       m5, m3, [r5 + 1 * mmsize]\r\n    paddw           m1, m5\r\n    pmaddubsw       m3, [r5]\r\n    movu            xm5, [r0 + r1]                  ; m5 = row 5\r\n    punpckhbw       xm6, xm4, xm5\r\n    punpcklbw       xm4, xm5\r\n    vinserti128     m4, m4, xm6, 1\r\n    pmaddubsw       m6, m4, [r5 + 2 * mmsize]\r\n    paddw           m0, m6\r\n    pmaddubsw       m6, m4, [r5 + 1 * mmsize]\r\n    paddw           m2, m6\r\n    pmaddubsw       m4, [r5]\r\n    movu            xm6, [r0 + r1 * 2]              ; m6 = row 6\r\n    punpckhbw       xm7, xm5, xm6\r\n    punpcklbw       xm5, xm6\r\n    vinserti128     m5, m5, xm7, 1\r\n    pmaddubsw       m7, m5, [r5 + 2 * mmsize]\r\n    paddw           m1, m7\r\n    pmaddubsw       m7, m5, [r5 + 1 * mmsize]\r\n    paddw           m3, m7\r\n    pmaddubsw       m5, [r5]\r\n    movu            xm7, [r0 + r4]                  ; m7 = row 7\r\n    punpckhbw       xm8, xm6, xm7\r\n    punpcklbw       xm6, xm7\r\n    vinserti128     m6, m6, xm8, 1\r\n    pmaddubsw       m8, m6, [r5 + 3 * mmsize]\r\n    paddw           m0, m8\r\n    pmaddubsw       m8, m6, [r5 + 2 * mmsize]\r\n    paddw           m2, m8\r\n    pmaddubsw       m8, m6, [r5 + 1 * mmsize]\r\n    paddw           m4, m8\r\n    pmaddubsw       m6, [r5]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm8, [r0]                       ; m8 = row 8\r\n    punpckhbw       xm9, xm7, xm8\r\n    punpcklbw       xm7, xm8\r\n    vinserti128     m7, m7, xm9, 1\r\n    pmaddubsw       m9, m7, [r5 + 3 * mmsize]\r\n    paddw           m1, m9\r\n    pmaddubsw       m9, m7, [r5 + 2 * mmsize]\r\n    paddw           m3, m9\r\n    pmaddubsw       m9, m7, [r5 + 1 * mmsize]\r\n    paddw           m5, m9\r\n    pmaddubsw       m7, [r5]\r\n    movu            xm9, [r0 + r1]                  ; m9 = row 9\r\n    punpckhbw       xm10, xm8, xm9\r\n    punpcklbw       xm8, xm9\r\n    vinserti128     m8, m8, xm10, 1\r\n    pmaddubsw       m10, m8, [r5 + 3 * mmsize]\r\n    paddw           m2, m10\r\n    pmaddubsw       m10, m8, [r5 + 2 * mmsize]\r\n    paddw           m4, m10\r\n    pmaddubsw       m10, m8, [r5 + 1 * mmsize]\r\n    paddw           m6, m10\r\n    pmaddubsw       m8, [r5]\r\n    movu            xm10, [r0 + r1 * 2]             ; m10 = row 10\r\n    punpckhbw       xm11, xm9, xm10\r\n    punpcklbw       xm9, xm10\r\n    vinserti128     m9, m9, xm11, 1\r\n    pmaddubsw       m11, m9, [r5 + 3 * mmsize]\r\n    paddw           m3, m11\r\n    pmaddubsw       m11, m9, [r5 + 2 * mmsize]\r\n    paddw           m5, m11\r\n    pmaddubsw       m11, m9, [r5 + 1 * mmsize]\r\n    paddw           m7, m11\r\n    pmaddubsw       m9, [r5]\r\n    movu            xm11, [r0 + r4]                 ; m11 = row 11\r\n    punpckhbw       xm12, xm10, xm11\r\n    punpcklbw       xm10, xm11\r\n    vinserti128     m10, m10, xm12, 1\r\n    pmaddubsw       m12, m10, [r5 + 3 * mmsize]\r\n    paddw           m4, m12\r\n    pmaddubsw       m12, m10, [r5 + 2 * mmsize]\r\n    paddw           m6, m12\r\n    pmaddubsw       m12, m10, [r5 + 1 * mmsize]\r\n    paddw           m8, m12\r\n    pmaddubsw       m10, [r5]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm12, [r0]                      ; m12 = row 12\r\n    punpckhbw       xm13, xm11, xm12\r\n    punpcklbw       xm11, xm12\r\n    vinserti128     m11, m11, xm13, 1\r\n    pmaddubsw       m13, m11, [r5 + 3 * mmsize]\r\n    paddw           m5, m13\r\n    pmaddubsw       m13, m11, [r5 + 2 * mmsize]\r\n    paddw           m7, m13\r\n    pmaddubsw       m13, m11, [r5 + 1 * mmsize]\r\n    paddw           m9, m13\r\n    pmaddubsw       m11, [r5]\r\n\r\n%ifidn %1,pp\r\n    pmulhrsw        m0, m14                         ; m0 = word: row 0\r\n    pmulhrsw        m1, m14                         ; m1 = word: row 1\r\n    pmulhrsw        m2, m14                         ; m2 = word: row 2\r\n    pmulhrsw        m3, m14                         ; m3 = word: row 3\r\n    pmulhrsw        m4, m14                         ; m4 = word: row 4\r\n    pmulhrsw        m5, m14                         ; m5 = word: row 5\r\n    packuswb        m0, m1\r\n    packuswb        m2, m3\r\n    packuswb        m4, m5\r\n    vpermq          m0, m0, 11011000b\r\n    vpermq          m2, m2, 11011000b\r\n    vpermq          m4, m4, 11011000b\r\n    vextracti128    xm1, m0, 1\r\n    vextracti128    xm3, m2, 1\r\n    vextracti128    xm5, m4, 1\r\n    movu            [r2], xm0\r\n    movu            [r2 + r3], xm1\r\n    movu            [r2 + r3 * 2], xm2\r\n    movu            [r2 + r6], xm3\r\n    lea             r2, [r2 + r3 * 4]\r\n    movu            [r2], xm4\r\n    movu            [r2 + r3], xm5\r\n%else\r\n    psubw           m0, m14                         ; m0 = word: row 0\r\n    psubw           m1, m14                         ; m1 = word: row 1\r\n    psubw           m2, m14                         ; m2 = word: row 2\r\n    psubw           m3, m14                         ; m3 = word: row 3\r\n    psubw           m4, m14                         ; m4 = word: row 4\r\n    psubw           m5, m14                         ; m5 = word: row 5\r\n    movu            [r2], m0\r\n    movu            [r2 + r3], m1\r\n    movu            [r2 + r3 * 2], m2\r\n    movu            [r2 + r6], m3\r\n    lea             r2, [r2 + r3 * 4]\r\n    movu            [r2], m4\r\n    movu            [r2 + r3], m5\r\n%endif\r\n\r\n    movu            xm13, [r0 + r1]                 ; m13 = row 13\r\n    punpckhbw       xm0, xm12, xm13\r\n    punpcklbw       xm12, xm13\r\n    vinserti128     m12, m12, xm0, 1\r\n    pmaddubsw       m0, m12, [r5 + 3 * mmsize]\r\n    paddw           m6, m0\r\n    pmaddubsw       m0, m12, [r5 + 2 * mmsize]\r\n    paddw           m8, m0\r\n    pmaddubsw       m0, m12, [r5 + 1 * mmsize]\r\n    paddw           m10, m0\r\n    movu            xm0, [r0 + r1 * 2]              ; m0 = row 14\r\n    punpckhbw       xm1, xm13, xm0\r\n    punpcklbw       xm13, xm0\r\n    vinserti128     m13, m13, xm1, 1\r\n    pmaddubsw       m1, m13, [r5 + 3 * mmsize]\r\n    paddw           m7, m1\r\n    pmaddubsw       m1, m13, [r5 + 2 * mmsize]\r\n    paddw           m9, m1\r\n    pmaddubsw       m1, m13, [r5 + 1 * mmsize]\r\n    paddw           m11, m1\r\n\r\n%ifidn %1,pp\r\n    pmulhrsw        m6, m14                         ; m6 = word: row 6\r\n    pmulhrsw        m7, m14                         ; m7 = word: row 7\r\n    packuswb        m6, m7\r\n    vpermq          m6, m6, 11011000b\r\n    vextracti128    xm7, m6, 1\r\n    movu            [r2 + r3 * 2], xm6\r\n    movu            [r2 + r6], xm7\r\n%else\r\n    psubw           m6, m14                         ; m6 = word: row 6\r\n    psubw           m7, m14                         ; m7 = word: row 7\r\n    movu            [r2 + r3 * 2], m6\r\n    movu            [r2 + r6], m7\r\n%endif\r\n    lea             r2, [r2 + r3 * 4]\r\n\r\n    movu            xm1, [r0 + r4]                  ; m1 = row 15\r\n    punpckhbw       xm2, xm0, xm1\r\n    punpcklbw       xm0, xm1\r\n    vinserti128     m0, m0, xm2, 1\r\n    pmaddubsw       m2, m0, [r5 + 3 * mmsize]\r\n    paddw           m8, m2\r\n    pmaddubsw       m2, m0, [r5 + 2 * mmsize]\r\n    paddw           m10, m2\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm2, [r0]                       ; m2 = row 16\r\n    punpckhbw       xm3, xm1, xm2\r\n    punpcklbw       xm1, xm2\r\n    vinserti128     m1, m1, xm3, 1\r\n    pmaddubsw       m3, m1, [r5 + 3 * mmsize]\r\n    paddw           m9, m3\r\n    pmaddubsw       m3, m1, [r5 + 2 * mmsize]\r\n    paddw           m11, m3\r\n    movu            xm3, [r0 + r1]                  ; m3 = row 17\r\n    punpckhbw       xm4, xm2, xm3\r\n    punpcklbw       xm2, xm3\r\n    vinserti128     m2, m2, xm4, 1\r\n    pmaddubsw       m4, m2, [r5 + 3 * mmsize]\r\n    paddw           m10, m4\r\n    movu            xm4, [r0 + r1 * 2]              ; m4 = row 18\r\n    punpckhbw       xm5, xm3, xm4\r\n    punpcklbw       xm3, xm4\r\n    vinserti128     m3, m3, xm5, 1\r\n    pmaddubsw       m5, m3, [r5 + 3 * mmsize]\r\n    paddw           m11, m5\r\n\r\n%ifidn %1,pp\r\n    pmulhrsw        m8, m14                         ; m8 = word: row 8\r\n    pmulhrsw        m9, m14                         ; m9 = word: row 9\r\n    pmulhrsw        m10, m14                        ; m10 = word: row 10\r\n    pmulhrsw        m11, m14                        ; m11 = word: row 11\r\n    packuswb        m8, m9\r\n    packuswb        m10, m11\r\n    vpermq          m8, m8, 11011000b\r\n    vpermq          m10, m10, 11011000b\r\n    vextracti128    xm9, m8, 1\r\n    vextracti128    xm11, m10, 1\r\n    movu            [r2], xm8\r\n    movu            [r2 + r3], xm9\r\n    movu            [r2 + r3 * 2], xm10\r\n    movu            [r2 + r6], xm11\r\n%else\r\n    psubw           m8, m14                         ; m8 = word: row 8\r\n    psubw           m9, m14                         ; m9 = word: row 9\r\n    psubw           m10, m14                        ; m10 = word: row 10\r\n    psubw           m11, m14                        ; m11 = word: row 11\r\n    movu            [r2], m8\r\n    movu            [r2 + r3], m9\r\n    movu            [r2 + r3 * 2], m10\r\n    movu            [r2 + r6], m11\r\n%endif\r\n    RET\r\n%endif\r\n%endmacro\r\n\r\n    FILTER_VER_LUMA_AVX2_16x12 pp\r\n    FILTER_VER_LUMA_AVX2_16x12 ps\r\n\r\n%macro FILTER_VER_LUMA_AVX2_16x8 1\r\nINIT_YMM avx2\r\n%if ARCH_X86_64 == 1\r\ncglobal interp_8tap_vert_%1_16x8, 4, 6, 15\r\n    mov             r4d, r4m\r\n    shl             r4d, 7\r\n%ifdef PIC\r\n    lea             r5, [tab_LumaCoeffVer_32]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [tab_LumaCoeffVer_32 + r4]\r\n%endif\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r4\r\n%ifidn %1,pp\r\n    mova            m14, [pw_512]\r\n%else\r\n    add             r3d, r3d\r\n    vbroadcasti128  m14, [pw_2000]\r\n%endif\r\n    movu            xm0, [r0]                       ; m0 = row 0\r\n    movu            xm1, [r0 + r1]                  ; m1 = row 1\r\n    punpckhbw       xm2, xm0, xm1\r\n    punpcklbw       xm0, xm1\r\n    vinserti128     m0, m0, xm2, 1\r\n    pmaddubsw       m0, [r5]\r\n    movu            xm2, [r0 + r1 * 2]              ; m2 = row 2\r\n    punpckhbw       xm3, xm1, xm2\r\n    punpcklbw       xm1, xm2\r\n    vinserti128     m1, m1, xm3, 1\r\n    pmaddubsw       m1, [r5]\r\n    movu            xm3, [r0 + r4]                  ; m3 = row 3\r\n    punpckhbw       xm4, xm2, xm3\r\n    punpcklbw       xm2, xm3\r\n    vinserti128     m2, m2, xm4, 1\r\n    pmaddubsw       m4, m2, [r5 + 1 * mmsize]\r\n    paddw           m0, m4\r\n    pmaddubsw       m2, [r5]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm4, [r0]                       ; m4 = row 4\r\n    punpckhbw       xm5, xm3, xm4\r\n    punpcklbw       xm3, xm4\r\n    vinserti128     m3, m3, xm5, 1\r\n    pmaddubsw       m5, m3, [r5 + 1 * mmsize]\r\n    paddw           m1, m5\r\n    pmaddubsw       m3, [r5]\r\n    movu            xm5, [r0 + r1]                  ; m5 = row 5\r\n    punpckhbw       xm6, xm4, xm5\r\n    punpcklbw       xm4, xm5\r\n    vinserti128     m4, m4, xm6, 1\r\n    pmaddubsw       m6, m4, [r5 + 2 * mmsize]\r\n    paddw           m0, m6\r\n    pmaddubsw       m6, m4, [r5 + 1 * mmsize]\r\n    paddw           m2, m6\r\n    pmaddubsw       m4, [r5]\r\n    movu            xm6, [r0 + r1 * 2]              ; m6 = row 6\r\n    punpckhbw       xm7, xm5, xm6\r\n    punpcklbw       xm5, xm6\r\n    vinserti128     m5, m5, xm7, 1\r\n    pmaddubsw       m7, m5, [r5 + 2 * mmsize]\r\n    paddw           m1, m7\r\n    pmaddubsw       m7, m5, [r5 + 1 * mmsize]\r\n    paddw           m3, m7\r\n    pmaddubsw       m5, [r5]\r\n    movu            xm7, [r0 + r4]                  ; m7 = row 7\r\n    punpckhbw       xm8, xm6, xm7\r\n    punpcklbw       xm6, xm7\r\n    vinserti128     m6, m6, xm8, 1\r\n    pmaddubsw       m8, m6, [r5 + 3 * mmsize]\r\n    paddw           m0, m8\r\n    pmaddubsw       m8, m6, [r5 + 2 * mmsize]\r\n    paddw           m2, m8\r\n    pmaddubsw       m8, m6, [r5 + 1 * mmsize]\r\n    paddw           m4, m8\r\n    pmaddubsw       m6, [r5]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm8, [r0]                       ; m8 = row 8\r\n    punpckhbw       xm9, xm7, xm8\r\n    punpcklbw       xm7, xm8\r\n    vinserti128     m7, m7, xm9, 1\r\n    pmaddubsw       m9, m7, [r5 + 3 * mmsize]\r\n    paddw           m1, m9\r\n    pmaddubsw       m9, m7, [r5 + 2 * mmsize]\r\n    paddw           m3, m9\r\n    pmaddubsw       m9, m7, [r5 + 1 * mmsize]\r\n    paddw           m5, m9\r\n    pmaddubsw       m7, [r5]\r\n    movu            xm9, [r0 + r1]                  ; m9 = row 9\r\n    punpckhbw       xm10, xm8, xm9\r\n    punpcklbw       xm8, xm9\r\n    vinserti128     m8, m8, xm10, 1\r\n    pmaddubsw       m10, m8, [r5 + 3 * mmsize]\r\n    paddw           m2, m10\r\n    pmaddubsw       m10, m8, [r5 + 2 * mmsize]\r\n    paddw           m4, m10\r\n    pmaddubsw       m10, m8, [r5 + 1 * mmsize]\r\n    paddw           m6, m10\r\n    movu            xm10, [r0 + r1 * 2]             ; m10 = row 10\r\n    punpckhbw       xm11, xm9, xm10\r\n    punpcklbw       xm9, xm10\r\n    vinserti128     m9, m9, xm11, 1\r\n    pmaddubsw       m11, m9, [r5 + 3 * mmsize]\r\n    paddw           m3, m11\r\n    pmaddubsw       m11, m9, [r5 + 2 * mmsize]\r\n    paddw           m5, m11\r\n    pmaddubsw       m11, m9, [r5 + 1 * mmsize]\r\n    paddw           m7, m11\r\n    movu            xm11, [r0 + r4]                 ; m11 = row 11\r\n    punpckhbw       xm12, xm10, xm11\r\n    punpcklbw       xm10, xm11\r\n    vinserti128     m10, m10, xm12, 1\r\n    pmaddubsw       m12, m10, [r5 + 3 * mmsize]\r\n    paddw           m4, m12\r\n    pmaddubsw       m12, m10, [r5 + 2 * mmsize]\r\n    paddw           m6, m12\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm12, [r0]                      ; m12 = row 12\r\n    punpckhbw       xm13, xm11, xm12\r\n    punpcklbw       xm11, xm12\r\n    vinserti128     m11, m11, xm13, 1\r\n    pmaddubsw       m13, m11, [r5 + 3 * mmsize]\r\n    paddw           m5, m13\r\n    pmaddubsw       m13, m11, [r5 + 2 * mmsize]\r\n    paddw           m7, m13\r\n    lea             r4, [r3 * 3]\r\n%ifidn %1,pp\r\n    pmulhrsw        m0, m14                         ; m0 = word: row 0\r\n    pmulhrsw        m1, m14                         ; m1 = word: row 1\r\n    pmulhrsw        m2, m14                         ; m2 = word: row 2\r\n    pmulhrsw        m3, m14                         ; m3 = word: row 3\r\n    pmulhrsw        m4, m14                         ; m4 = word: row 4\r\n    pmulhrsw        m5, m14                         ; m5 = word: row 5\r\n    packuswb        m0, m1\r\n    packuswb        m2, m3\r\n    packuswb        m4, m5\r\n    vpermq          m0, m0, 11011000b\r\n    vpermq          m2, m2, 11011000b\r\n    vpermq          m4, m4, 11011000b\r\n    vextracti128    xm1, m0, 1\r\n    vextracti128    xm3, m2, 1\r\n    vextracti128    xm5, m4, 1\r\n    movu            [r2], xm0\r\n    movu            [r2 + r3], xm1\r\n    movu            [r2 + r3 * 2], xm2\r\n    movu            [r2 + r4], xm3\r\n    lea             r2, [r2 + r3 * 4]\r\n    movu            [r2], xm4\r\n    movu            [r2 + r3], xm5\r\n%else\r\n    psubw           m0, m14                         ; m0 = word: row 0\r\n    psubw           m1, m14                         ; m1 = word: row 1\r\n    psubw           m2, m14                         ; m2 = word: row 2\r\n    psubw           m3, m14                         ; m3 = word: row 3\r\n    psubw           m4, m14                         ; m4 = word: row 4\r\n    psubw           m5, m14                         ; m5 = word: row 5\r\n    movu            [r2], m0\r\n    movu            [r2 + r3], m1\r\n    movu            [r2 + r3 * 2], m2\r\n    movu            [r2 + r4], m3\r\n    lea             r2, [r2 + r3 * 4]\r\n    movu            [r2], m4\r\n    movu            [r2 + r3], m5\r\n%endif\r\n    movu            xm13, [r0 + r1]                 ; m13 = row 13\r\n    punpckhbw       xm0, xm12, xm13\r\n    punpcklbw       xm12, xm13\r\n    vinserti128     m12, m12, xm0, 1\r\n    pmaddubsw       m0, m12, [r5 + 3 * mmsize]\r\n    paddw           m6, m0\r\n    movu            xm0, [r0 + r1 * 2]              ; m0 = row 14\r\n    punpckhbw       xm1, xm13, xm0\r\n    punpcklbw       xm13, xm0\r\n    vinserti128     m13, m13, xm1, 1\r\n    pmaddubsw       m1, m13, [r5 + 3 * mmsize]\r\n    paddw           m7, m1\r\n%ifidn %1,pp\r\n    pmulhrsw        m6, m14                         ; m6 = word: row 6\r\n    pmulhrsw        m7, m14                         ; m7 = word: row 7\r\n    packuswb        m6, m7\r\n    vpermq          m6, m6, 11011000b\r\n    vextracti128    xm7, m6, 1\r\n    movu            [r2 + r3 * 2], xm6\r\n    movu            [r2 + r4], xm7\r\n%else\r\n    psubw           m6, m14                         ; m6 = word: row 6\r\n    psubw           m7, m14                         ; m7 = word: row 7\r\n    movu            [r2 + r3 * 2], m6\r\n    movu            [r2 + r4], m7\r\n%endif\r\n    RET\r\n%endif\r\n%endmacro\r\n\r\n    FILTER_VER_LUMA_AVX2_16x8 pp\r\n    FILTER_VER_LUMA_AVX2_16x8 ps\r\n\r\n%macro FILTER_VER_LUMA_AVX2_16x4 1\r\nINIT_YMM avx2\r\n%if ARCH_X86_64 == 1\r\ncglobal interp_8tap_vert_%1_16x4, 4, 6, 13\r\n    mov             r4d, r4m\r\n    shl             r4d, 7\r\n%ifdef PIC\r\n    lea             r5, [tab_LumaCoeffVer_32]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [tab_LumaCoeffVer_32 + r4]\r\n%endif\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r4\r\n%ifidn %1,pp\r\n    mova            m12, [pw_512]\r\n%else\r\n    add             r3d, r3d\r\n    vbroadcasti128  m12, [pw_2000]\r\n%endif\r\n    movu            xm0, [r0]                       ; m0 = row 0\r\n    movu            xm1, [r0 + r1]                  ; m1 = row 1\r\n    punpckhbw       xm2, xm0, xm1\r\n    punpcklbw       xm0, xm1\r\n    vinserti128     m0, m0, xm2, 1\r\n    pmaddubsw       m0, [r5]\r\n    movu            xm2, [r0 + r1 * 2]              ; m2 = row 2\r\n    punpckhbw       xm3, xm1, xm2\r\n    punpcklbw       xm1, xm2\r\n    vinserti128     m1, m1, xm3, 1\r\n    pmaddubsw       m1, [r5]\r\n    movu            xm3, [r0 + r4]                  ; m3 = row 3\r\n    punpckhbw       xm4, xm2, xm3\r\n    punpcklbw       xm2, xm3\r\n    vinserti128     m2, m2, xm4, 1\r\n    pmaddubsw       m4, m2, [r5 + 1 * mmsize]\r\n    paddw           m0, m4\r\n    pmaddubsw       m2, [r5]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm4, [r0]                       ; m4 = row 4\r\n    punpckhbw       xm5, xm3, xm4\r\n    punpcklbw       xm3, xm4\r\n    vinserti128     m3, m3, xm5, 1\r\n    pmaddubsw       m5, m3, [r5 + 1 * mmsize]\r\n    paddw           m1, m5\r\n    pmaddubsw       m3, [r5]\r\n    movu            xm5, [r0 + r1]                  ; m5 = row 5\r\n    punpckhbw       xm6, xm4, xm5\r\n    punpcklbw       xm4, xm5\r\n    vinserti128     m4, m4, xm6, 1\r\n    pmaddubsw       m6, m4, [r5 + 2 * mmsize]\r\n    paddw           m0, m6\r\n    pmaddubsw       m6, m4, [r5 + 1 * mmsize]\r\n    paddw           m2, m6\r\n    movu            xm6, [r0 + r1 * 2]              ; m6 = row 6\r\n    punpckhbw       xm7, xm5, xm6\r\n    punpcklbw       xm5, xm6\r\n    vinserti128     m5, m5, xm7, 1\r\n    pmaddubsw       m7, m5, [r5 + 2 * mmsize]\r\n    paddw           m1, m7\r\n    pmaddubsw       m7, m5, [r5 + 1 * mmsize]\r\n    paddw           m3, m7\r\n    movu            xm7, [r0 + r4]                  ; m7 = row 7\r\n    punpckhbw       xm8, xm6, xm7\r\n    punpcklbw       xm6, xm7\r\n    vinserti128     m6, m6, xm8, 1\r\n    pmaddubsw       m8, m6, [r5 + 3 * mmsize]\r\n    paddw           m0, m8\r\n    pmaddubsw       m8, m6, [r5 + 2 * mmsize]\r\n    paddw           m2, m8\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm8, [r0]                       ; m8 = row 8\r\n    punpckhbw       xm9, xm7, xm8\r\n    punpcklbw       xm7, xm8\r\n    vinserti128     m7, m7, xm9, 1\r\n    pmaddubsw       m9, m7, [r5 + 3 * mmsize]\r\n    paddw           m1, m9\r\n    pmaddubsw       m9, m7, [r5 + 2 * mmsize]\r\n    paddw           m3, m9\r\n    movu            xm9, [r0 + r1]                  ; m9 = row 9\r\n    punpckhbw       xm10, xm8, xm9\r\n    punpcklbw       xm8, xm9\r\n    vinserti128     m8, m8, xm10, 1\r\n    pmaddubsw       m10, m8, [r5 + 3 * mmsize]\r\n    paddw           m2, m10\r\n    movu            xm10, [r0 + r1 * 2]             ; m10 = row 10\r\n    punpckhbw       xm11, xm9, xm10\r\n    punpcklbw       xm9, xm10\r\n    vinserti128     m9, m9, xm11, 1\r\n    pmaddubsw       m11, m9, [r5 + 3 * mmsize]\r\n    paddw           m3, m11\r\n%ifidn %1,pp\r\n    pmulhrsw        m0, m12                         ; m0 = word: row 0\r\n    pmulhrsw        m1, m12                         ; m1 = word: row 1\r\n    pmulhrsw        m2, m12                         ; m2 = word: row 2\r\n    pmulhrsw        m3, m12                         ; m3 = word: row 3\r\n    packuswb        m0, m1\r\n    packuswb        m2, m3\r\n    vpermq          m0, m0, 11011000b\r\n    vpermq          m2, m2, 11011000b\r\n    vextracti128    xm1, m0, 1\r\n    vextracti128    xm3, m2, 1\r\n    movu            [r2], xm0\r\n    movu            [r2 + r3], xm1\r\n    movu            [r2 + r3 * 2], xm2\r\n    lea             r4, [r3 * 3]\r\n    movu            [r2 + r4], xm3\r\n%else\r\n    psubw           m0, m12                         ; m0 = word: row 0\r\n    psubw           m1, m12                         ; m1 = word: row 1\r\n    psubw           m2, m12                         ; m2 = word: row 2\r\n    psubw           m3, m12                         ; m3 = word: row 3\r\n    movu            [r2], m0\r\n    movu            [r2 + r3], m1\r\n    movu            [r2 + r3 * 2], m2\r\n    lea             r4, [r3 * 3]\r\n    movu            [r2 + r4], m3\r\n%endif\r\n    RET\r\n%endif\r\n%endmacro\r\n\r\n    FILTER_VER_LUMA_AVX2_16x4 pp\r\n    FILTER_VER_LUMA_AVX2_16x4 ps\r\n%macro FILTER_VER_LUMA_AVX2_16xN 3\r\nINIT_YMM avx2\r\n%if ARCH_X86_64 == 1\r\ncglobal interp_8tap_vert_%3_%1x%2, 4, 9, 15\r\n    mov             r4d, r4m\r\n    shl             r4d, 7\r\n\r\n%ifdef PIC\r\n    lea             r5, [tab_LumaCoeffVer_32]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [tab_LumaCoeffVer_32 + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r4\r\n%ifidn %3,ps\r\n    add             r3d, r3d\r\n    vbroadcasti128  m14, [pw_2000]\r\n%else\r\n    mova            m14, [pw_512]\r\n%endif\r\n    lea             r6, [r3 * 3]\r\n    lea             r7, [r1 * 4]\r\n    mov             r8d, %2 / 16\r\n\r\n.loop:\r\n    movu            xm0, [r0]                       ; m0 = row 0\r\n    movu            xm1, [r0 + r1]                  ; m1 = row 1\r\n    punpckhbw       xm2, xm0, xm1\r\n    punpcklbw       xm0, xm1\r\n    vinserti128     m0, m0, xm2, 1\r\n    pmaddubsw       m0, [r5]\r\n    movu            xm2, [r0 + r1 * 2]              ; m2 = row 2\r\n    punpckhbw       xm3, xm1, xm2\r\n    punpcklbw       xm1, xm2\r\n    vinserti128     m1, m1, xm3, 1\r\n    pmaddubsw       m1, [r5]\r\n    movu            xm3, [r0 + r4]                  ; m3 = row 3\r\n    punpckhbw       xm4, xm2, xm3\r\n    punpcklbw       xm2, xm3\r\n    vinserti128     m2, m2, xm4, 1\r\n    pmaddubsw       m4, m2, [r5 + 1 * mmsize]\r\n    paddw           m0, m4\r\n    pmaddubsw       m2, [r5]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm4, [r0]                       ; m4 = row 4\r\n    punpckhbw       xm5, xm3, xm4\r\n    punpcklbw       xm3, xm4\r\n    vinserti128     m3, m3, xm5, 1\r\n    pmaddubsw       m5, m3, [r5 + 1 * mmsize]\r\n    paddw           m1, m5\r\n    pmaddubsw       m3, [r5]\r\n    movu            xm5, [r0 + r1]                  ; m5 = row 5\r\n    punpckhbw       xm6, xm4, xm5\r\n    punpcklbw       xm4, xm5\r\n    vinserti128     m4, m4, xm6, 1\r\n    pmaddubsw       m6, m4, [r5 + 2 * mmsize]\r\n    paddw           m0, m6\r\n    pmaddubsw       m6, m4, [r5 + 1 * mmsize]\r\n    paddw           m2, m6\r\n    pmaddubsw       m4, [r5]\r\n    movu            xm6, [r0 + r1 * 2]              ; m6 = row 6\r\n    punpckhbw       xm7, xm5, xm6\r\n    punpcklbw       xm5, xm6\r\n    vinserti128     m5, m5, xm7, 1\r\n    pmaddubsw       m7, m5, [r5 + 2 * mmsize]\r\n    paddw           m1, m7\r\n    pmaddubsw       m7, m5, [r5 + 1 * mmsize]\r\n    paddw           m3, m7\r\n    pmaddubsw       m5, [r5]\r\n    movu            xm7, [r0 + r4]                  ; m7 = row 7\r\n    punpckhbw       xm8, xm6, xm7\r\n    punpcklbw       xm6, xm7\r\n    vinserti128     m6, m6, xm8, 1\r\n    pmaddubsw       m8, m6, [r5 + 3 * mmsize]\r\n    paddw           m0, m8\r\n    pmaddubsw       m8, m6, [r5 + 2 * mmsize]\r\n    paddw           m2, m8\r\n    pmaddubsw       m8, m6, [r5 + 1 * mmsize]\r\n    paddw           m4, m8\r\n    pmaddubsw       m6, [r5]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm8, [r0]                       ; m8 = row 8\r\n    punpckhbw       xm9, xm7, xm8\r\n    punpcklbw       xm7, xm8\r\n    vinserti128     m7, m7, xm9, 1\r\n    pmaddubsw       m9, m7, [r5 + 3 * mmsize]\r\n    paddw           m1, m9\r\n    pmaddubsw       m9, m7, [r5 + 2 * mmsize]\r\n    paddw           m3, m9\r\n    pmaddubsw       m9, m7, [r5 + 1 * mmsize]\r\n    paddw           m5, m9\r\n    pmaddubsw       m7, [r5]\r\n    movu            xm9, [r0 + r1]                  ; m9 = row 9\r\n    punpckhbw       xm10, xm8, xm9\r\n    punpcklbw       xm8, xm9\r\n    vinserti128     m8, m8, xm10, 1\r\n    pmaddubsw       m10, m8, [r5 + 3 * mmsize]\r\n    paddw           m2, m10\r\n    pmaddubsw       m10, m8, [r5 + 2 * mmsize]\r\n    paddw           m4, m10\r\n    pmaddubsw       m10, m8, [r5 + 1 * mmsize]\r\n    paddw           m6, m10\r\n    pmaddubsw       m8, [r5]\r\n    movu            xm10, [r0 + r1 * 2]             ; m10 = row 10\r\n    punpckhbw       xm11, xm9, xm10\r\n    punpcklbw       xm9, xm10\r\n    vinserti128     m9, m9, xm11, 1\r\n    pmaddubsw       m11, m9, [r5 + 3 * mmsize]\r\n    paddw           m3, m11\r\n    pmaddubsw       m11, m9, [r5 + 2 * mmsize]\r\n    paddw           m5, m11\r\n    pmaddubsw       m11, m9, [r5 + 1 * mmsize]\r\n    paddw           m7, m11\r\n    pmaddubsw       m9, [r5]\r\n    movu            xm11, [r0 + r4]                 ; m11 = row 11\r\n    punpckhbw       xm12, xm10, xm11\r\n    punpcklbw       xm10, xm11\r\n    vinserti128     m10, m10, xm12, 1\r\n    pmaddubsw       m12, m10, [r5 + 3 * mmsize]\r\n    paddw           m4, m12\r\n    pmaddubsw       m12, m10, [r5 + 2 * mmsize]\r\n    paddw           m6, m12\r\n    pmaddubsw       m12, m10, [r5 + 1 * mmsize]\r\n    paddw           m8, m12\r\n    pmaddubsw       m10, [r5]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm12, [r0]                      ; m12 = row 12\r\n    punpckhbw       xm13, xm11, xm12\r\n    punpcklbw       xm11, xm12\r\n    vinserti128     m11, m11, xm13, 1\r\n    pmaddubsw       m13, m11, [r5 + 3 * mmsize]\r\n    paddw           m5, m13\r\n    pmaddubsw       m13, m11, [r5 + 2 * mmsize]\r\n    paddw           m7, m13\r\n    pmaddubsw       m13, m11, [r5 + 1 * mmsize]\r\n    paddw           m9, m13\r\n    pmaddubsw       m11, [r5]\r\n\r\n%ifidn %3,pp\r\n    pmulhrsw        m0, m14                         ; m0 = word: row 0\r\n    pmulhrsw        m1, m14                         ; m1 = word: row 1\r\n    pmulhrsw        m2, m14                         ; m2 = word: row 2\r\n    pmulhrsw        m3, m14                         ; m3 = word: row 3\r\n    pmulhrsw        m4, m14                         ; m4 = word: row 4\r\n    pmulhrsw        m5, m14                         ; m5 = word: row 5\r\n    packuswb        m0, m1\r\n    packuswb        m2, m3\r\n    packuswb        m4, m5\r\n    vpermq          m0, m0, 11011000b\r\n    vpermq          m2, m2, 11011000b\r\n    vpermq          m4, m4, 11011000b\r\n    vextracti128    xm1, m0, 1\r\n    vextracti128    xm3, m2, 1\r\n    vextracti128    xm5, m4, 1\r\n    movu            [r2], xm0\r\n    movu            [r2 + r3], xm1\r\n    movu            [r2 + r3 * 2], xm2\r\n    movu            [r2 + r6], xm3\r\n    lea             r2, [r2 + r3 * 4]\r\n    movu            [r2], xm4\r\n    movu            [r2 + r3], xm5\r\n%else\r\n    psubw           m0, m14                         ; m0 = word: row 0\r\n    psubw           m1, m14                         ; m1 = word: row 1\r\n    psubw           m2, m14                         ; m2 = word: row 2\r\n    psubw           m3, m14                         ; m3 = word: row 3\r\n    psubw           m4, m14                         ; m4 = word: row 4\r\n    psubw           m5, m14                         ; m5 = word: row 5\r\n    movu            [r2], m0\r\n    movu            [r2 + r3], m1\r\n    movu            [r2 + r3 * 2], m2\r\n    movu            [r2 + r6], m3\r\n    lea             r2, [r2 + r3 * 4]\r\n    movu            [r2], m4\r\n    movu            [r2 + r3], m5\r\n%endif\r\n\r\n    movu            xm13, [r0 + r1]                 ; m13 = row 13\r\n    punpckhbw       xm0, xm12, xm13\r\n    punpcklbw       xm12, xm13\r\n    vinserti128     m12, m12, xm0, 1\r\n    pmaddubsw       m0, m12, [r5 + 3 * mmsize]\r\n    paddw           m6, m0\r\n    pmaddubsw       m0, m12, [r5 + 2 * mmsize]\r\n    paddw           m8, m0\r\n    pmaddubsw       m0, m12, [r5 + 1 * mmsize]\r\n    paddw           m10, m0\r\n    pmaddubsw       m12, [r5]\r\n    movu            xm0, [r0 + r1 * 2]              ; m0 = row 14\r\n    punpckhbw       xm1, xm13, xm0\r\n    punpcklbw       xm13, xm0\r\n    vinserti128     m13, m13, xm1, 1\r\n    pmaddubsw       m1, m13, [r5 + 3 * mmsize]\r\n    paddw           m7, m1\r\n    pmaddubsw       m1, m13, [r5 + 2 * mmsize]\r\n    paddw           m9, m1\r\n    pmaddubsw       m1, m13, [r5 + 1 * mmsize]\r\n    paddw           m11, m1\r\n    pmaddubsw       m13, [r5]\r\n\r\n%ifidn %3,pp\r\n    pmulhrsw        m6, m14                         ; m6 = word: row 6\r\n    pmulhrsw        m7, m14                         ; m7 = word: row 7\r\n    packuswb        m6, m7\r\n    vpermq          m6, m6, 11011000b\r\n    vextracti128    xm7, m6, 1\r\n    movu            [r2 + r3 * 2], xm6\r\n    movu            [r2 + r6], xm7\r\n%else\r\n    psubw           m6, m14                         ; m6 = word: row 6\r\n    psubw           m7, m14                         ; m7 = word: row 7\r\n    movu            [r2 + r3 * 2], m6\r\n    movu            [r2 + r6], m7\r\n%endif\r\n\r\n    lea             r2, [r2 + r3 * 4]\r\n\r\n    movu            xm1, [r0 + r4]                  ; m1 = row 15\r\n    punpckhbw       xm2, xm0, xm1\r\n    punpcklbw       xm0, xm1\r\n    vinserti128     m0, m0, xm2, 1\r\n    pmaddubsw       m2, m0, [r5 + 3 * mmsize]\r\n    paddw           m8, m2\r\n    pmaddubsw       m2, m0, [r5 + 2 * mmsize]\r\n    paddw           m10, m2\r\n    pmaddubsw       m2, m0, [r5 + 1 * mmsize]\r\n    paddw           m12, m2\r\n    pmaddubsw       m0, [r5]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm2, [r0]                       ; m2 = row 16\r\n    punpckhbw       xm3, xm1, xm2\r\n    punpcklbw       xm1, xm2\r\n    vinserti128     m1, m1, xm3, 1\r\n    pmaddubsw       m3, m1, [r5 + 3 * mmsize]\r\n    paddw           m9, m3\r\n    pmaddubsw       m3, m1, [r5 + 2 * mmsize]\r\n    paddw           m11, m3\r\n    pmaddubsw       m3, m1, [r5 + 1 * mmsize]\r\n    paddw           m13, m3\r\n    pmaddubsw       m1, [r5]\r\n    movu            xm3, [r0 + r1]                  ; m3 = row 17\r\n    punpckhbw       xm4, xm2, xm3\r\n    punpcklbw       xm2, xm3\r\n    vinserti128     m2, m2, xm4, 1\r\n    pmaddubsw       m4, m2, [r5 + 3 * mmsize]\r\n    paddw           m10, m4\r\n    pmaddubsw       m4, m2, [r5 + 2 * mmsize]\r\n    paddw           m12, m4\r\n    pmaddubsw       m2, [r5 + 1 * mmsize]\r\n    paddw           m0, m2\r\n    movu            xm4, [r0 + r1 * 2]              ; m4 = row 18\r\n    punpckhbw       xm5, xm3, xm4\r\n    punpcklbw       xm3, xm4\r\n    vinserti128     m3, m3, xm5, 1\r\n    pmaddubsw       m5, m3, [r5 + 3 * mmsize]\r\n    paddw           m11, m5\r\n    pmaddubsw       m5, m3, [r5 + 2 * mmsize]\r\n    paddw           m13, m5\r\n    pmaddubsw       m3, [r5 + 1 * mmsize]\r\n    paddw           m1, m3\r\n    movu            xm5, [r0 + r4]                  ; m5 = row 19\r\n    punpckhbw       xm6, xm4, xm5\r\n    punpcklbw       xm4, xm5\r\n    vinserti128     m4, m4, xm6, 1\r\n    pmaddubsw       m6, m4, [r5 + 3 * mmsize]\r\n    paddw           m12, m6\r\n    pmaddubsw       m4, [r5 + 2 * mmsize]\r\n    paddw           m0, m4\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm6, [r0]                       ; m6 = row 20\r\n    punpckhbw       xm7, xm5, xm6\r\n    punpcklbw       xm5, xm6\r\n    vinserti128     m5, m5, xm7, 1\r\n    pmaddubsw       m7, m5, [r5 + 3 * mmsize]\r\n    paddw           m13, m7\r\n    pmaddubsw       m5, [r5 + 2 * mmsize]\r\n    paddw           m1, m5\r\n    movu            xm7, [r0 + r1]                  ; m7 = row 21\r\n    punpckhbw       xm2, xm6, xm7\r\n    punpcklbw       xm6, xm7\r\n    vinserti128     m6, m6, xm2, 1\r\n    pmaddubsw       m6, [r5 + 3 * mmsize]\r\n    paddw           m0, m6\r\n    movu            xm2, [r0 + r1 * 2]              ; m2 = row 22\r\n    punpckhbw       xm3, xm7, xm2\r\n    punpcklbw       xm7, xm2\r\n    vinserti128     m7, m7, xm3, 1\r\n    pmaddubsw       m7, [r5 + 3 * mmsize]\r\n    paddw           m1, m7\r\n\r\n%ifidn %3,pp\r\n    pmulhrsw        m8, m14                         ; m8 = word: row 8\r\n    pmulhrsw        m9, m14                         ; m9 = word: row 9\r\n    pmulhrsw        m10, m14                        ; m10 = word: row 10\r\n    pmulhrsw        m11, m14                        ; m11 = word: row 11\r\n    pmulhrsw        m12, m14                        ; m12 = word: row 12\r\n    pmulhrsw        m13, m14                        ; m13 = word: row 13\r\n    pmulhrsw        m0, m14                         ; m0 = word: row 14\r\n    pmulhrsw        m1, m14                         ; m1 = word: row 15\r\n    packuswb        m8, m9\r\n    packuswb        m10, m11\r\n    packuswb        m12, m13\r\n    packuswb        m0, m1\r\n    vpermq          m8, m8, 11011000b\r\n    vpermq          m10, m10, 11011000b\r\n    vpermq          m12, m12, 11011000b\r\n    vpermq          m0, m0, 11011000b\r\n    vextracti128    xm9, m8, 1\r\n    vextracti128    xm11, m10, 1\r\n    vextracti128    xm13, m12, 1\r\n    vextracti128    xm1, m0, 1\r\n    movu            [r2], xm8\r\n    movu            [r2 + r3], xm9\r\n    movu            [r2 + r3 * 2], xm10\r\n    movu            [r2 + r6], xm11\r\n    lea             r2, [r2 + r3 * 4]\r\n    movu            [r2], xm12\r\n    movu            [r2 + r3], xm13\r\n    movu            [r2 + r3 * 2], xm0\r\n    movu            [r2 + r6], xm1\r\n%else\r\n    psubw           m8, m14                         ; m8 = word: row 8\r\n    psubw           m9, m14                         ; m9 = word: row 9\r\n    psubw           m10, m14                        ; m10 = word: row 10\r\n    psubw           m11, m14                        ; m11 = word: row 11\r\n    psubw           m12, m14                        ; m12 = word: row 12\r\n    psubw           m13, m14                        ; m13 = word: row 13\r\n    psubw           m0, m14                         ; m0 = word: row 14\r\n    psubw           m1, m14                         ; m1 = word: row 15\r\n    movu            [r2], m8\r\n    movu            [r2 + r3], m9\r\n    movu            [r2 + r3 * 2], m10\r\n    movu            [r2 + r6], m11\r\n    lea             r2, [r2 + r3 * 4]\r\n    movu            [r2], m12\r\n    movu            [r2 + r3], m13\r\n    movu            [r2 + r3 * 2], m0\r\n    movu            [r2 + r6], m1\r\n%endif\r\n\r\n    lea             r2, [r2 + r3 * 4]\r\n    sub             r0, r7\r\n    dec             r8d\r\n    jnz             .loop\r\n    RET\r\n%endif\r\n%endmacro\r\n\r\n    FILTER_VER_LUMA_AVX2_16xN 16, 32, pp\r\n    FILTER_VER_LUMA_AVX2_16xN 16, 64, pp\r\n    FILTER_VER_LUMA_AVX2_16xN 16, 32, ps\r\n    FILTER_VER_LUMA_AVX2_16xN 16, 64, ps\r\n\r\n%macro PROCESS_LUMA_AVX2_W16_16R 1\r\n    movu            xm0, [r0]                       ; m0 = row 0\r\n    movu            xm1, [r0 + r1]                  ; m1 = row 1\r\n    punpckhbw       xm2, xm0, xm1\r\n    punpcklbw       xm0, xm1\r\n    vinserti128     m0, m0, xm2, 1\r\n    pmaddubsw       m0, [r5]\r\n    movu            xm2, [r0 + r1 * 2]              ; m2 = row 2\r\n    punpckhbw       xm3, xm1, xm2\r\n    punpcklbw       xm1, xm2\r\n    vinserti128     m1, m1, xm3, 1\r\n    pmaddubsw       m1, [r5]\r\n    movu            xm3, [r0 + r4]                  ; m3 = row 3\r\n    punpckhbw       xm4, xm2, xm3\r\n    punpcklbw       xm2, xm3\r\n    vinserti128     m2, m2, xm4, 1\r\n    pmaddubsw       m4, m2, [r5 + 1 * mmsize]\r\n    paddw           m0, m4\r\n    pmaddubsw       m2, [r5]\r\n    lea             r7, [r0 + r1 * 4]\r\n    movu            xm4, [r7]                       ; m4 = row 4\r\n    punpckhbw       xm5, xm3, xm4\r\n    punpcklbw       xm3, xm4\r\n    vinserti128     m3, m3, xm5, 1\r\n    pmaddubsw       m5, m3, [r5 + 1 * mmsize]\r\n    paddw           m1, m5\r\n    pmaddubsw       m3, [r5]\r\n    movu            xm5, [r7 + r1]                  ; m5 = row 5\r\n    punpckhbw       xm6, xm4, xm5\r\n    punpcklbw       xm4, xm5\r\n    vinserti128     m4, m4, xm6, 1\r\n    pmaddubsw       m6, m4, [r5 + 2 * mmsize]\r\n    paddw           m0, m6\r\n    pmaddubsw       m6, m4, [r5 + 1 * mmsize]\r\n    paddw           m2, m6\r\n    pmaddubsw       m4, [r5]\r\n    movu            xm6, [r7 + r1 * 2]              ; m6 = row 6\r\n    punpckhbw       xm7, xm5, xm6\r\n    punpcklbw       xm5, xm6\r\n    vinserti128     m5, m5, xm7, 1\r\n    pmaddubsw       m7, m5, [r5 + 2 * mmsize]\r\n    paddw           m1, m7\r\n    pmaddubsw       m7, m5, [r5 + 1 * mmsize]\r\n    paddw           m3, m7\r\n    pmaddubsw       m5, [r5]\r\n    movu            xm7, [r7 + r4]                  ; m7 = row 7\r\n    punpckhbw       xm8, xm6, xm7\r\n    punpcklbw       xm6, xm7\r\n    vinserti128     m6, m6, xm8, 1\r\n    pmaddubsw       m8, m6, [r5 + 3 * mmsize]\r\n    paddw           m0, m8\r\n    pmaddubsw       m8, m6, [r5 + 2 * mmsize]\r\n    paddw           m2, m8\r\n    pmaddubsw       m8, m6, [r5 + 1 * mmsize]\r\n    paddw           m4, m8\r\n    pmaddubsw       m6, [r5]\r\n    lea             r7, [r7 + r1 * 4]\r\n    movu            xm8, [r7]                       ; m8 = row 8\r\n    punpckhbw       xm9, xm7, xm8\r\n    punpcklbw       xm7, xm8\r\n    vinserti128     m7, m7, xm9, 1\r\n    pmaddubsw       m9, m7, [r5 + 3 * mmsize]\r\n    paddw           m1, m9\r\n    pmaddubsw       m9, m7, [r5 + 2 * mmsize]\r\n    paddw           m3, m9\r\n    pmaddubsw       m9, m7, [r5 + 1 * mmsize]\r\n    paddw           m5, m9\r\n    pmaddubsw       m7, [r5]\r\n    movu            xm9, [r7 + r1]                  ; m9 = row 9\r\n    punpckhbw       xm10, xm8, xm9\r\n    punpcklbw       xm8, xm9\r\n    vinserti128     m8, m8, xm10, 1\r\n    pmaddubsw       m10, m8, [r5 + 3 * mmsize]\r\n    paddw           m2, m10\r\n    pmaddubsw       m10, m8, [r5 + 2 * mmsize]\r\n    paddw           m4, m10\r\n    pmaddubsw       m10, m8, [r5 + 1 * mmsize]\r\n    paddw           m6, m10\r\n    pmaddubsw       m8, [r5]\r\n    movu            xm10, [r7 + r1 * 2]             ; m10 = row 10\r\n    punpckhbw       xm11, xm9, xm10\r\n    punpcklbw       xm9, xm10\r\n    vinserti128     m9, m9, xm11, 1\r\n    pmaddubsw       m11, m9, [r5 + 3 * mmsize]\r\n    paddw           m3, m11\r\n    pmaddubsw       m11, m9, [r5 + 2 * mmsize]\r\n    paddw           m5, m11\r\n    pmaddubsw       m11, m9, [r5 + 1 * mmsize]\r\n    paddw           m7, m11\r\n    pmaddubsw       m9, [r5]\r\n    movu            xm11, [r7 + r4]                 ; m11 = row 11\r\n    punpckhbw       xm12, xm10, xm11\r\n    punpcklbw       xm10, xm11\r\n    vinserti128     m10, m10, xm12, 1\r\n    pmaddubsw       m12, m10, [r5 + 3 * mmsize]\r\n    paddw           m4, m12\r\n    pmaddubsw       m12, m10, [r5 + 2 * mmsize]\r\n    paddw           m6, m12\r\n    pmaddubsw       m12, m10, [r5 + 1 * mmsize]\r\n    paddw           m8, m12\r\n    pmaddubsw       m10, [r5]\r\n    lea             r7, [r7 + r1 * 4]\r\n    movu            xm12, [r7]                      ; m12 = row 12\r\n    punpckhbw       xm13, xm11, xm12\r\n    punpcklbw       xm11, xm12\r\n    vinserti128     m11, m11, xm13, 1\r\n    pmaddubsw       m13, m11, [r5 + 3 * mmsize]\r\n    paddw           m5, m13\r\n    pmaddubsw       m13, m11, [r5 + 2 * mmsize]\r\n    paddw           m7, m13\r\n    pmaddubsw       m13, m11, [r5 + 1 * mmsize]\r\n    paddw           m9, m13\r\n    pmaddubsw       m11, [r5]\r\n\r\n%ifidn %1,pp\r\n    pmulhrsw        m0, m14                         ; m0 = word: row 0\r\n    pmulhrsw        m1, m14                         ; m1 = word: row 1\r\n    pmulhrsw        m2, m14                         ; m2 = word: row 2\r\n    pmulhrsw        m3, m14                         ; m3 = word: row 3\r\n    pmulhrsw        m4, m14                         ; m4 = word: row 4\r\n    pmulhrsw        m5, m14                         ; m5 = word: row 5\r\n    packuswb        m0, m1\r\n    packuswb        m2, m3\r\n    packuswb        m4, m5\r\n    vpermq          m0, m0, 11011000b\r\n    vpermq          m2, m2, 11011000b\r\n    vpermq          m4, m4, 11011000b\r\n    vextracti128    xm1, m0, 1\r\n    vextracti128    xm3, m2, 1\r\n    vextracti128    xm5, m4, 1\r\n    movu            [r2], xm0\r\n    movu            [r2 + r3], xm1\r\n    movu            [r2 + r3 * 2], xm2\r\n    movu            [r2 + r6], xm3\r\n    lea             r8, [r2 + r3 * 4]\r\n    movu            [r8], xm4\r\n    movu            [r8 + r3], xm5\r\n%else\r\n    psubw           m0, m14                         ; m0 = word: row 0\r\n    psubw           m1, m14                         ; m1 = word: row 1\r\n    psubw           m2, m14                         ; m2 = word: row 2\r\n    psubw           m3, m14                         ; m3 = word: row 3\r\n    psubw           m4, m14                         ; m4 = word: row 4\r\n    psubw           m5, m14                         ; m5 = word: row 5\r\n    movu            [r2], m0\r\n    movu            [r2 + r3], m1\r\n    movu            [r2 + r3 * 2], m2\r\n    movu            [r2 + r6], m3\r\n    lea             r8, [r2 + r3 * 4]\r\n    movu            [r8], m4\r\n    movu            [r8 + r3], m5\r\n%endif\r\n\r\n    movu            xm13, [r7 + r1]                 ; m13 = row 13\r\n    punpckhbw       xm0, xm12, xm13\r\n    punpcklbw       xm12, xm13\r\n    vinserti128     m12, m12, xm0, 1\r\n    pmaddubsw       m0, m12, [r5 + 3 * mmsize]\r\n    paddw           m6, m0\r\n    pmaddubsw       m0, m12, [r5 + 2 * mmsize]\r\n    paddw           m8, m0\r\n    pmaddubsw       m0, m12, [r5 + 1 * mmsize]\r\n    paddw           m10, m0\r\n    pmaddubsw       m12, [r5]\r\n    movu            xm0, [r7 + r1 * 2]              ; m0 = row 14\r\n    punpckhbw       xm1, xm13, xm0\r\n    punpcklbw       xm13, xm0\r\n    vinserti128     m13, m13, xm1, 1\r\n    pmaddubsw       m1, m13, [r5 + 3 * mmsize]\r\n    paddw           m7, m1\r\n    pmaddubsw       m1, m13, [r5 + 2 * mmsize]\r\n    paddw           m9, m1\r\n    pmaddubsw       m1, m13, [r5 + 1 * mmsize]\r\n    paddw           m11, m1\r\n    pmaddubsw       m13, [r5]\r\n\r\n%ifidn %1,pp\r\n    pmulhrsw        m6, m14                         ; m6 = word: row 6\r\n    pmulhrsw        m7, m14                         ; m7 = word: row 7\r\n    packuswb        m6, m7\r\n    vpermq          m6, m6, 11011000b\r\n    vextracti128    xm7, m6, 1\r\n    movu            [r8 + r3 * 2], xm6\r\n    movu            [r8 + r6], xm7\r\n%else\r\n    psubw           m6, m14                         ; m6 = word: row 6\r\n    psubw           m7, m14                         ; m7 = word: row 7\r\n    movu            [r8 + r3 * 2], m6\r\n    movu            [r8 + r6], m7\r\n%endif\r\n\r\n    lea             r8, [r8 + r3 * 4]\r\n\r\n    movu            xm1, [r7 + r4]                  ; m1 = row 15\r\n    punpckhbw       xm2, xm0, xm1\r\n    punpcklbw       xm0, xm1\r\n    vinserti128     m0, m0, xm2, 1\r\n    pmaddubsw       m2, m0, [r5 + 3 * mmsize]\r\n    paddw           m8, m2\r\n    pmaddubsw       m2, m0, [r5 + 2 * mmsize]\r\n    paddw           m10, m2\r\n    pmaddubsw       m2, m0, [r5 + 1 * mmsize]\r\n    paddw           m12, m2\r\n    pmaddubsw       m0, [r5]\r\n    lea             r7, [r7 + r1 * 4]\r\n    movu            xm2, [r7]                       ; m2 = row 16\r\n    punpckhbw       xm3, xm1, xm2\r\n    punpcklbw       xm1, xm2\r\n    vinserti128     m1, m1, xm3, 1\r\n    pmaddubsw       m3, m1, [r5 + 3 * mmsize]\r\n    paddw           m9, m3\r\n    pmaddubsw       m3, m1, [r5 + 2 * mmsize]\r\n    paddw           m11, m3\r\n    pmaddubsw       m3, m1, [r5 + 1 * mmsize]\r\n    paddw           m13, m3\r\n    pmaddubsw       m1, [r5]\r\n    movu            xm3, [r7 + r1]                  ; m3 = row 17\r\n    punpckhbw       xm4, xm2, xm3\r\n    punpcklbw       xm2, xm3\r\n    vinserti128     m2, m2, xm4, 1\r\n    pmaddubsw       m4, m2, [r5 + 3 * mmsize]\r\n    paddw           m10, m4\r\n    pmaddubsw       m4, m2, [r5 + 2 * mmsize]\r\n    paddw           m12, m4\r\n    pmaddubsw       m2, [r5 + 1 * mmsize]\r\n    paddw           m0, m2\r\n    movu            xm4, [r7 + r1 * 2]              ; m4 = row 18\r\n    punpckhbw       xm5, xm3, xm4\r\n    punpcklbw       xm3, xm4\r\n    vinserti128     m3, m3, xm5, 1\r\n    pmaddubsw       m5, m3, [r5 + 3 * mmsize]\r\n    paddw           m11, m5\r\n    pmaddubsw       m5, m3, [r5 + 2 * mmsize]\r\n    paddw           m13, m5\r\n    pmaddubsw       m3, [r5 + 1 * mmsize]\r\n    paddw           m1, m3\r\n    movu            xm5, [r7 + r4]                  ; m5 = row 19\r\n    punpckhbw       xm6, xm4, xm5\r\n    punpcklbw       xm4, xm5\r\n    vinserti128     m4, m4, xm6, 1\r\n    pmaddubsw       m6, m4, [r5 + 3 * mmsize]\r\n    paddw           m12, m6\r\n    pmaddubsw       m4, [r5 + 2 * mmsize]\r\n    paddw           m0, m4\r\n    lea             r7, [r7 + r1 * 4]\r\n    movu            xm6, [r7]                       ; m6 = row 20\r\n    punpckhbw       xm7, xm5, xm6\r\n    punpcklbw       xm5, xm6\r\n    vinserti128     m5, m5, xm7, 1\r\n    pmaddubsw       m7, m5, [r5 + 3 * mmsize]\r\n    paddw           m13, m7\r\n    pmaddubsw       m5, [r5 + 2 * mmsize]\r\n    paddw           m1, m5\r\n    movu            xm7, [r7 + r1]                  ; m7 = row 21\r\n    punpckhbw       xm2, xm6, xm7\r\n    punpcklbw       xm6, xm7\r\n    vinserti128     m6, m6, xm2, 1\r\n    pmaddubsw       m6, [r5 + 3 * mmsize]\r\n    paddw           m0, m6\r\n    movu            xm2, [r7 + r1 * 2]              ; m2 = row 22\r\n    punpckhbw       xm3, xm7, xm2\r\n    punpcklbw       xm7, xm2\r\n    vinserti128     m7, m7, xm3, 1\r\n    pmaddubsw       m7, [r5 + 3 * mmsize]\r\n    paddw           m1, m7\r\n\r\n%ifidn %1,pp\r\n    pmulhrsw        m8, m14                         ; m8 = word: row 8\r\n    pmulhrsw        m9, m14                         ; m9 = word: row 9\r\n    pmulhrsw        m10, m14                        ; m10 = word: row 10\r\n    pmulhrsw        m11, m14                        ; m11 = word: row 11\r\n    pmulhrsw        m12, m14                        ; m12 = word: row 12\r\n    pmulhrsw        m13, m14                        ; m13 = word: row 13\r\n    pmulhrsw        m0, m14                         ; m0 = word: row 14\r\n    pmulhrsw        m1, m14                         ; m1 = word: row 15\r\n    packuswb        m8, m9\r\n    packuswb        m10, m11\r\n    packuswb        m12, m13\r\n    packuswb        m0, m1\r\n    vpermq          m8, m8, 11011000b\r\n    vpermq          m10, m10, 11011000b\r\n    vpermq          m12, m12, 11011000b\r\n    vpermq          m0, m0, 11011000b\r\n    vextracti128    xm9, m8, 1\r\n    vextracti128    xm11, m10, 1\r\n    vextracti128    xm13, m12, 1\r\n    vextracti128    xm1, m0, 1\r\n    movu            [r8], xm8\r\n    movu            [r8 + r3], xm9\r\n    movu            [r8 + r3 * 2], xm10\r\n    movu            [r8 + r6], xm11\r\n    lea             r8, [r8 + r3 * 4]\r\n    movu            [r8], xm12\r\n    movu            [r8 + r3], xm13\r\n    movu            [r8 + r3 * 2], xm0\r\n    movu            [r8 + r6], xm1\r\n%else\r\n    psubw           m8, m14                         ; m8 = word: row 8\r\n    psubw           m9, m14                         ; m9 = word: row 9\r\n    psubw           m10, m14                        ; m10 = word: row 10\r\n    psubw           m11, m14                        ; m11 = word: row 11\r\n    psubw           m12, m14                        ; m12 = word: row 12\r\n    psubw           m13, m14                        ; m13 = word: row 13\r\n    psubw           m0, m14                         ; m0 = word: row 14\r\n    psubw           m1, m14                         ; m1 = word: row 15\r\n    movu            [r8], m8\r\n    movu            [r8 + r3], m9\r\n    movu            [r8 + r3 * 2], m10\r\n    movu            [r8 + r6], m11\r\n    lea             r8, [r8 + r3 * 4]\r\n    movu            [r8], m12\r\n    movu            [r8 + r3], m13\r\n    movu            [r8 + r3 * 2], m0\r\n    movu            [r8 + r6], m1\r\n%endif\r\n%endmacro\r\n\r\n%macro PROCESS_LUMA_AVX2_W16_8R 1\r\n    movu            xm0, [r0]                       ; m0 = row 0\r\n    movu            xm1, [r0 + r1]                  ; m1 = row 1\r\n    punpckhbw       xm2, xm0, xm1\r\n    punpcklbw       xm0, xm1\r\n    vinserti128     m0, m0, xm2, 1\r\n    pmaddubsw       m0, [r5]\r\n    movu            xm2, [r0 + r1 * 2]              ; m2 = row 2\r\n    punpckhbw       xm3, xm1, xm2\r\n    punpcklbw       xm1, xm2\r\n    vinserti128     m1, m1, xm3, 1\r\n    pmaddubsw       m1, [r5]\r\n    movu            xm3, [r0 + r4]                  ; m3 = row 3\r\n    punpckhbw       xm4, xm2, xm3\r\n    punpcklbw       xm2, xm3\r\n    vinserti128     m2, m2, xm4, 1\r\n    pmaddubsw       m4, m2, [r5 + 1 * mmsize]\r\n    paddw           m0, m4\r\n    pmaddubsw       m2, [r5]\r\n    lea             r7, [r0 + r1 * 4]\r\n    movu            xm4, [r7]                       ; m4 = row 4\r\n    punpckhbw       xm5, xm3, xm4\r\n    punpcklbw       xm3, xm4\r\n    vinserti128     m3, m3, xm5, 1\r\n    pmaddubsw       m5, m3, [r5 + 1 * mmsize]\r\n    paddw           m1, m5\r\n    pmaddubsw       m3, [r5]\r\n    movu            xm5, [r7 + r1]                  ; m5 = row 5\r\n    punpckhbw       xm6, xm4, xm5\r\n    punpcklbw       xm4, xm5\r\n    vinserti128     m4, m4, xm6, 1\r\n    pmaddubsw       m6, m4, [r5 + 2 * mmsize]\r\n    paddw           m0, m6\r\n    pmaddubsw       m6, m4, [r5 + 1 * mmsize]\r\n    paddw           m2, m6\r\n    pmaddubsw       m4, [r5]\r\n    movu            xm6, [r7 + r1 * 2]              ; m6 = row 6\r\n    punpckhbw       xm7, xm5, xm6\r\n    punpcklbw       xm5, xm6\r\n    vinserti128     m5, m5, xm7, 1\r\n    pmaddubsw       m7, m5, [r5 + 2 * mmsize]\r\n    paddw           m1, m7\r\n    pmaddubsw       m7, m5, [r5 + 1 * mmsize]\r\n    paddw           m3, m7\r\n    pmaddubsw       m5, [r5]\r\n    movu            xm7, [r7 + r4]                  ; m7 = row 7\r\n    punpckhbw       xm8, xm6, xm7\r\n    punpcklbw       xm6, xm7\r\n    vinserti128     m6, m6, xm8, 1\r\n    pmaddubsw       m8, m6, [r5 + 3 * mmsize]\r\n    paddw           m0, m8\r\n    pmaddubsw       m8, m6, [r5 + 2 * mmsize]\r\n    paddw           m2, m8\r\n    pmaddubsw       m8, m6, [r5 + 1 * mmsize]\r\n    paddw           m4, m8\r\n    pmaddubsw       m6, [r5]\r\n    lea             r7, [r7 + r1 * 4]\r\n    movu            xm8, [r7]                       ; m8 = row 8\r\n    punpckhbw       xm9, xm7, xm8\r\n    punpcklbw       xm7, xm8\r\n    vinserti128     m7, m7, xm9, 1\r\n    pmaddubsw       m9, m7, [r5 + 3 * mmsize]\r\n    paddw           m1, m9\r\n    pmaddubsw       m9, m7, [r5 + 2 * mmsize]\r\n    paddw           m3, m9\r\n    pmaddubsw       m9, m7, [r5 + 1 * mmsize]\r\n    paddw           m5, m9\r\n    pmaddubsw       m7, [r5]\r\n    movu            xm9, [r7 + r1]                  ; m9 = row 9\r\n    punpckhbw       xm10, xm8, xm9\r\n    punpcklbw       xm8, xm9\r\n    vinserti128     m8, m8, xm10, 1\r\n    pmaddubsw       m10, m8, [r5 + 3 * mmsize]\r\n    paddw           m2, m10\r\n    pmaddubsw       m10, m8, [r5 + 2 * mmsize]\r\n    paddw           m4, m10\r\n    pmaddubsw       m10, m8, [r5 + 1 * mmsize]\r\n    paddw           m6, m10\r\n    movu            xm10, [r7 + r1 * 2]             ; m10 = row 10\r\n    punpckhbw       xm11, xm9, xm10\r\n    punpcklbw       xm9, xm10\r\n    vinserti128     m9, m9, xm11, 1\r\n    pmaddubsw       m11, m9, [r5 + 3 * mmsize]\r\n    paddw           m3, m11\r\n    pmaddubsw       m11, m9, [r5 + 2 * mmsize]\r\n    paddw           m5, m11\r\n    pmaddubsw       m11, m9, [r5 + 1 * mmsize]\r\n    paddw           m7, m11\r\n    movu            xm11, [r7 + r4]                 ; m11 = row 11\r\n    punpckhbw       xm12, xm10, xm11\r\n    punpcklbw       xm10, xm11\r\n    vinserti128     m10, m10, xm12, 1\r\n    pmaddubsw       m12, m10, [r5 + 3 * mmsize]\r\n    paddw           m4, m12\r\n    pmaddubsw       m12, m10, [r5 + 2 * mmsize]\r\n    paddw           m6, m12\r\n    lea             r7, [r7 + r1 * 4]\r\n    movu            xm12, [r7]                      ; m12 = row 12\r\n    punpckhbw       xm13, xm11, xm12\r\n    punpcklbw       xm11, xm12\r\n    vinserti128     m11, m11, xm13, 1\r\n    pmaddubsw       m13, m11, [r5 + 3 * mmsize]\r\n    paddw           m5, m13\r\n    pmaddubsw       m13, m11, [r5 + 2 * mmsize]\r\n    paddw           m7, m13\r\n\r\n%ifidn %1,pp\r\n    pmulhrsw        m0, m14                         ; m0 = word: row 0\r\n    pmulhrsw        m1, m14                         ; m1 = word: row 1\r\n    pmulhrsw        m2, m14                         ; m2 = word: row 2\r\n    pmulhrsw        m3, m14                         ; m3 = word: row 3\r\n    pmulhrsw        m4, m14                         ; m4 = word: row 4\r\n    pmulhrsw        m5, m14                         ; m5 = word: row 5\r\n    packuswb        m0, m1\r\n    packuswb        m2, m3\r\n    packuswb        m4, m5\r\n    vpermq          m0, m0, 11011000b\r\n    vpermq          m2, m2, 11011000b\r\n    vpermq          m4, m4, 11011000b\r\n    vextracti128    xm1, m0, 1\r\n    vextracti128    xm3, m2, 1\r\n    vextracti128    xm5, m4, 1\r\n    movu            [r2], xm0\r\n    movu            [r2 + r3], xm1\r\n    movu            [r2 + r3 * 2], xm2\r\n    movu            [r2 + r6], xm3\r\n    lea             r8, [r2 + r3 * 4]\r\n    movu            [r8], xm4\r\n    movu            [r8 + r3], xm5\r\n%else\r\n    psubw           m0, m14                         ; m0 = word: row 0\r\n    psubw           m1, m14                         ; m1 = word: row 1\r\n    psubw           m2, m14                         ; m2 = word: row 2\r\n    psubw           m3, m14                         ; m3 = word: row 3\r\n    psubw           m4, m14                         ; m4 = word: row 4\r\n    psubw           m5, m14                         ; m5 = word: row 5\r\n    movu            [r2], m0\r\n    movu            [r2 + r3], m1\r\n    movu            [r2 + r3 * 2], m2\r\n    movu            [r2 + r6], m3\r\n    lea             r8, [r2 + r3 * 4]\r\n    movu            [r8], m4\r\n    movu            [r8 + r3], m5\r\n%endif\r\n\r\n    movu            xm13, [r7 + r1]                 ; m13 = row 13\r\n    punpckhbw       xm0, xm12, xm13\r\n    punpcklbw       xm12, xm13\r\n    vinserti128     m12, m12, xm0, 1\r\n    pmaddubsw       m0, m12, [r5 + 3 * mmsize]\r\n    paddw           m6, m0\r\n    movu            xm0, [r7 + r1 * 2]              ; m0 = row 14\r\n    punpckhbw       xm1, xm13, xm0\r\n    punpcklbw       xm13, xm0\r\n    vinserti128     m13, m13, xm1, 1\r\n    pmaddubsw       m1, m13, [r5 + 3 * mmsize]\r\n    paddw           m7, m1\r\n\r\n%ifidn %1,pp\r\n    pmulhrsw        m6, m14                         ; m6 = word: row 6\r\n    pmulhrsw        m7, m14                         ; m7 = word: row 7\r\n    packuswb        m6, m7\r\n    vpermq          m6, m6, 11011000b\r\n    vextracti128    xm7, m6, 1\r\n    movu            [r8 + r3 * 2], xm6\r\n    movu            [r8 + r6], xm7\r\n%else\r\n    psubw           m6, m14                         ; m6 = word: row 6\r\n    psubw           m7, m14                         ; m7 = word: row 7\r\n    movu            [r8 + r3 * 2], m6\r\n    movu            [r8 + r6], m7\r\n%endif\r\n%endmacro\r\n\r\n%macro FILTER_VER_LUMA_AVX2_24x32 1\r\nINIT_YMM avx2\r\n%if ARCH_X86_64 == 1\r\ncglobal interp_8tap_vert_%1_24x32, 4, 11, 15\r\n    mov             r4d, r4m\r\n    shl             r4d, 7\r\n%ifdef PIC\r\n    lea             r5, [tab_LumaCoeffVer_32]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [tab_LumaCoeffVer_32 + r4]\r\n%endif\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r4\r\n%ifidn %1,ps\r\n    add             r3d, r3d\r\n    vbroadcasti128  m14, [pw_2000]\r\n%else\r\n    mova            m14, [pw_512]\r\n%endif\r\n    lea             r6, [r3 * 3]\r\n    lea             r10, [r1 * 4]\r\n    mov             r9d, 2\r\n.loopH:\r\n    PROCESS_LUMA_AVX2_W16_16R %1\r\n%ifidn %1,pp\r\n    add             r2, 16\r\n%else\r\n    add             r2, 32\r\n%endif\r\n    add             r0, 16\r\n    movq            xm1, [r0]                       ; m1 = row 0\r\n    movq            xm2, [r0 + r1]                  ; m2 = row 1\r\n    punpcklbw       xm1, xm2\r\n    movq            xm3, [r0 + r1 * 2]              ; m3 = row 2\r\n    punpcklbw       xm2, xm3\r\n    vinserti128     m5, m1, xm2, 1\r\n    pmaddubsw       m5, [r5]\r\n    movq            xm4, [r0 + r4]                  ; m4 = row 3\r\n    punpcklbw       xm3, xm4\r\n    lea             r7, [r0 + r1 * 4]\r\n    movq            xm1, [r7]                       ; m1 = row 4\r\n    punpcklbw       xm4, xm1\r\n    vinserti128     m2, m3, xm4, 1\r\n    pmaddubsw       m0, m2, [r5 + 1 * mmsize]\r\n    paddw           m5, m0\r\n    pmaddubsw       m2, [r5]\r\n    movq            xm3, [r7 + r1]                  ; m3 = row 5\r\n    punpcklbw       xm1, xm3\r\n    movq            xm4, [r7 + r1 * 2]              ; m4 = row 6\r\n    punpcklbw       xm3, xm4\r\n    vinserti128     m1, m1, xm3, 1\r\n    pmaddubsw       m3, m1, [r5 + 2 * mmsize]\r\n    paddw           m5, m3\r\n    pmaddubsw       m0, m1, [r5 + 1 * mmsize]\r\n    paddw           m2, m0\r\n    pmaddubsw       m1, [r5]\r\n    movq            xm3, [r7 + r4]                  ; m3 = row 7\r\n    punpcklbw       xm4, xm3\r\n    lea             r7, [r7 + r1 * 4]\r\n    movq            xm0, [r7]                       ; m0 = row 8\r\n    punpcklbw       xm3, xm0\r\n    vinserti128     m4, m4, xm3, 1\r\n    pmaddubsw       m3, m4, [r5 + 3 * mmsize]\r\n    paddw           m5, m3\r\n    pmaddubsw       m3, m4, [r5 + 2 * mmsize]\r\n    paddw           m2, m3\r\n    pmaddubsw       m3, m4, [r5 + 1 * mmsize]\r\n    paddw           m1, m3\r\n    pmaddubsw       m4, [r5]\r\n    movq            xm3, [r7 + r1]                  ; m3 = row 9\r\n    punpcklbw       xm0, xm3\r\n    movq            xm6, [r7 + r1 * 2]              ; m6 = row 10\r\n    punpcklbw       xm3, xm6\r\n    vinserti128     m0, m0, xm3, 1\r\n    pmaddubsw       m3, m0, [r5 + 3 * mmsize]\r\n    paddw           m2, m3\r\n    pmaddubsw       m3, m0, [r5 + 2 * mmsize]\r\n    paddw           m1, m3\r\n    pmaddubsw       m3, m0, [r5 + 1 * mmsize]\r\n    paddw           m4, m3\r\n    pmaddubsw       m0, [r5]\r\n\r\n    movq            xm3, [r7 + r4]                  ; m3 = row 11\r\n    punpcklbw       xm6, xm3\r\n    lea             r7, [r7 + r1 * 4]\r\n    movq            xm7, [r7]                       ; m7 = row 12\r\n    punpcklbw       xm3, xm7\r\n    vinserti128     m6, m6, xm3, 1\r\n    pmaddubsw       m3, m6, [r5 + 3 * mmsize]\r\n    paddw           m1, m3\r\n    pmaddubsw       m3, m6, [r5 + 2 * mmsize]\r\n    paddw           m4, m3\r\n    pmaddubsw       m3, m6, [r5 + 1 * mmsize]\r\n    paddw           m0, m3\r\n    pmaddubsw       m6, [r5]\r\n    movq            xm3, [r7 + r1]                  ; m3 = row 13\r\n    punpcklbw       xm7, xm3\r\n    movq            xm8, [r7 + r1 * 2]              ; m8 = row 14\r\n    punpcklbw       xm3, xm8\r\n    vinserti128     m7, m7, xm3, 1\r\n    pmaddubsw       m3, m7, [r5 + 3 * mmsize]\r\n    paddw           m4, m3\r\n    pmaddubsw       m3, m7, [r5 + 2 * mmsize]\r\n    paddw           m0, m3\r\n    pmaddubsw       m3, m7, [r5 + 1 * mmsize]\r\n    paddw           m6, m3\r\n    pmaddubsw       m7, [r5]\r\n    movq            xm3, [r7 + r4]                  ; m3 = row 15\r\n    punpcklbw       xm8, xm3\r\n    lea             r7, [r7 + r1 * 4]\r\n    movq            xm9, [r7]                       ; m9 = row 16\r\n    punpcklbw       xm3, xm9\r\n    vinserti128     m8, m8, xm3, 1\r\n    pmaddubsw       m3, m8, [r5 + 3 * mmsize]\r\n    paddw           m0, m3\r\n    pmaddubsw       m3, m8, [r5 + 2 * mmsize]\r\n    paddw           m6, m3\r\n    pmaddubsw       m3, m8, [r5 + 1 * mmsize]\r\n    paddw           m7, m3\r\n    pmaddubsw       m8, [r5]\r\n    movq            xm3, [r7 + r1]                  ; m3 = row 17\r\n    punpcklbw       xm9, xm3\r\n    movq            xm10, [r7 + r1 * 2]             ; m10 = row 18\r\n    punpcklbw       xm3, xm10\r\n    vinserti128     m9, m9, xm3, 1\r\n    pmaddubsw       m3, m9, [r5 + 3 * mmsize]\r\n    paddw           m6, m3\r\n    pmaddubsw       m3, m9, [r5 + 2 * mmsize]\r\n    paddw           m7, m3\r\n    pmaddubsw       m3, m9, [r5 + 1 * mmsize]\r\n    paddw           m8, m3\r\n    movq            xm3, [r7 + r4]                  ; m3 = row 19\r\n    punpcklbw       xm10, xm3\r\n    lea             r7, [r7 + r1 * 4]\r\n    movq            xm9, [r7]                       ; m9 = row 20\r\n    punpcklbw       xm3, xm9\r\n    vinserti128     m10, m10, xm3, 1\r\n    pmaddubsw       m3, m10, [r5 + 3 * mmsize]\r\n    paddw           m7, m3\r\n    pmaddubsw       m3, m10, [r5 + 2 * mmsize]\r\n    paddw           m8, m3\r\n    movq            xm3, [r7 + r1]                  ; m3 = row 21\r\n    punpcklbw       xm9, xm3\r\n    movq            xm10, [r7 + r1 * 2]             ; m10 = row 22\r\n    punpcklbw       xm3, xm10\r\n    vinserti128     m9, m9, xm3, 1\r\n    pmaddubsw       m3, m9, [r5 + 3 * mmsize]\r\n    paddw           m8, m3\r\n%ifidn %1,pp\r\n    pmulhrsw        m5, m14                         ; m5 = word: row 0, row 1\r\n    pmulhrsw        m2, m14                         ; m2 = word: row 2, row 3\r\n    pmulhrsw        m1, m14                         ; m1 = word: row 4, row 5\r\n    pmulhrsw        m4, m14                         ; m4 = word: row 6, row 7\r\n    pmulhrsw        m0, m14                         ; m0 = word: row 8, row 9\r\n    pmulhrsw        m6, m14                         ; m6 = word: row 10, row 11\r\n    pmulhrsw        m7, m14                         ; m7 = word: row 12, row 13\r\n    pmulhrsw        m8, m14                         ; m8 = word: row 14, row 15\r\n    packuswb        m5, m2\r\n    packuswb        m1, m4\r\n    packuswb        m0, m6\r\n    packuswb        m7, m8\r\n    vextracti128    xm2, m5, 1\r\n    vextracti128    xm4, m1, 1\r\n    vextracti128    xm6, m0, 1\r\n    vextracti128    xm8, m7, 1\r\n    movq            [r2], xm5\r\n    movq            [r2 + r3], xm2\r\n    movhps          [r2 + r3 * 2], xm5\r\n    movhps          [r2 + r6], xm2\r\n    lea             r8, [r2 + r3 * 4]\r\n    movq            [r8], xm1\r\n    movq            [r8 + r3], xm4\r\n    movhps          [r8 + r3 * 2], xm1\r\n    movhps          [r8 + r6], xm4\r\n    lea             r8, [r8 + r3 * 4]\r\n    movq            [r8], xm0\r\n    movq            [r8 + r3], xm6\r\n    movhps          [r8 + r3 * 2], xm0\r\n    movhps          [r8 + r6], xm6\r\n    lea             r8, [r8 + r3 * 4]\r\n    movq            [r8], xm7\r\n    movq            [r8 + r3], xm8\r\n    movhps          [r8 + r3 * 2], xm7\r\n    movhps          [r8 + r6], xm8\r\n%else\r\n    psubw           m5, m14                         ; m5 = word: row 0, row 1\r\n    psubw           m2, m14                         ; m2 = word: row 2, row 3\r\n    psubw           m1, m14                         ; m1 = word: row 4, row 5\r\n    psubw           m4, m14                         ; m4 = word: row 6, row 7\r\n    psubw           m0, m14                         ; m0 = word: row 8, row 9\r\n    psubw           m6, m14                         ; m6 = word: row 10, row 11\r\n    psubw           m7, m14                         ; m7 = word: row 12, row 13\r\n    psubw           m8, m14                         ; m8 = word: row 14, row 15\r\n    vextracti128    xm3, m5, 1\r\n    movu            [r2], xm5\r\n    movu            [r2 + r3], xm3\r\n    vextracti128    xm3, m2, 1\r\n    movu            [r2 + r3 * 2], xm2\r\n    movu            [r2 + r6], xm3\r\n    vextracti128    xm3, m1, 1\r\n    lea             r8, [r2 + r3 * 4]\r\n    movu            [r8], xm1\r\n    movu            [r8 + r3], xm3\r\n    vextracti128    xm3, m4, 1\r\n    movu            [r8 + r3 * 2], xm4\r\n    movu            [r8 + r6], xm3\r\n    vextracti128    xm3, m0, 1\r\n    lea             r8, [r8 + r3 * 4]\r\n    movu            [r8], xm0\r\n    movu            [r8 + r3], xm3\r\n    vextracti128    xm3, m6, 1\r\n    movu            [r8 + r3 * 2], xm6\r\n    movu            [r8 + r6], xm3\r\n    vextracti128    xm3, m7, 1\r\n    lea             r8, [r8 + r3 * 4]\r\n    movu            [r8], xm7\r\n    movu            [r8 + r3], xm3\r\n    vextracti128    xm3, m8, 1\r\n    movu            [r8 + r3 * 2], xm8\r\n    movu            [r8 + r6], xm3\r\n%endif\r\n    sub             r7, r10\r\n    lea             r0, [r7 - 16]\r\n%ifidn %1,pp\r\n    lea             r2, [r8 + r3 * 4 - 16]\r\n%else\r\n    lea             r2, [r8 + r3 * 4 - 32]\r\n%endif\r\n    dec             r9d\r\n    jnz             .loopH\r\n    RET\r\n%endif\r\n%endmacro\r\n\r\n    FILTER_VER_LUMA_AVX2_24x32 pp\r\n    FILTER_VER_LUMA_AVX2_24x32 ps\r\n\r\n%macro FILTER_VER_LUMA_AVX2_32xN 3\r\nINIT_YMM avx2\r\n%if ARCH_X86_64 == 1\r\ncglobal interp_8tap_vert_%3_%1x%2, 4, 12, 15\r\n    mov             r4d, r4m\r\n    shl             r4d, 7\r\n%ifdef PIC\r\n    lea             r5, [tab_LumaCoeffVer_32]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [tab_LumaCoeffVer_32 + r4]\r\n%endif\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r4\r\n%ifidn %3,ps\r\n    add             r3d, r3d\r\n    vbroadcasti128  m14, [pw_2000]\r\n%else\r\n    mova            m14, [pw_512]\r\n%endif\r\n    lea             r6, [r3 * 3]\r\n    lea             r11, [r1 * 4]\r\n    mov             r9d, %2 / 16\r\n.loopH:\r\n    mov             r10d, %1 / 16\r\n.loopW:\r\n    PROCESS_LUMA_AVX2_W16_16R %3\r\n%ifidn %3,pp\r\n    add             r2, 16\r\n%else\r\n    add             r2, 32\r\n%endif\r\n    add             r0, 16\r\n    dec             r10d\r\n    jnz             .loopW\r\n    sub             r7, r11\r\n    lea             r0, [r7 - 16]\r\n%ifidn %3,pp\r\n    lea             r2, [r8 + r3 * 4 - 16]\r\n%else\r\n    lea             r2, [r8 + r3 * 4 - 32]\r\n%endif\r\n    dec             r9d\r\n    jnz             .loopH\r\n    RET\r\n%endif\r\n%endmacro\r\n\r\n    FILTER_VER_LUMA_AVX2_32xN 32, 32, pp\r\n    FILTER_VER_LUMA_AVX2_32xN 32, 64, pp\r\n    FILTER_VER_LUMA_AVX2_32xN 32, 32, ps\r\n    FILTER_VER_LUMA_AVX2_32xN 32, 64, ps\r\n\r\n%macro FILTER_VER_LUMA_AVX2_32x16 1\r\nINIT_YMM avx2\r\n%if ARCH_X86_64 == 1\r\ncglobal interp_8tap_vert_%1_32x16, 4, 10, 15\r\n    mov             r4d, r4m\r\n    shl             r4d, 7\r\n\r\n%ifdef PIC\r\n    lea             r5, [tab_LumaCoeffVer_32]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [tab_LumaCoeffVer_32 + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r4\r\n%ifidn %1,ps\r\n    add             r3d, r3d\r\n    vbroadcasti128  m14, [pw_2000]\r\n%else\r\n    mova            m14, [pw_512]\r\n%endif\r\n    lea             r6, [r3 * 3]\r\n    mov             r9d, 2\r\n.loopW:\r\n    PROCESS_LUMA_AVX2_W16_16R %1\r\n%ifidn %1,pp\r\n    add             r2, 16\r\n%else\r\n    add             r2, 32\r\n%endif\r\n    add             r0, 16\r\n    dec             r9d\r\n    jnz             .loopW\r\n    RET\r\n%endif\r\n%endmacro\r\n\r\n    FILTER_VER_LUMA_AVX2_32x16 pp\r\n    FILTER_VER_LUMA_AVX2_32x16 ps\r\n\r\n%macro FILTER_VER_LUMA_AVX2_32x24 1\r\nINIT_YMM avx2\r\n%if ARCH_X86_64 == 1\r\ncglobal interp_8tap_vert_%1_32x24, 4, 10, 15\r\n    mov             r4d, r4m\r\n    shl             r4d, 7\r\n%ifdef PIC\r\n    lea             r5, [tab_LumaCoeffVer_32]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [tab_LumaCoeffVer_32 + r4]\r\n%endif\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r4\r\n%ifidn %1,ps\r\n    add             r3d, r3d\r\n%endif\r\n    lea             r6, [r3 * 3]\r\n%ifidn %1,pp\r\n    mova            m14, [pw_512]\r\n%else\r\n    vbroadcasti128  m14, [pw_2000]\r\n%endif\r\n    mov             r9d, 2\r\n.loopW:\r\n    PROCESS_LUMA_AVX2_W16_16R %1\r\n%ifidn %1,pp\r\n    add             r2, 16\r\n%else\r\n    add             r2, 32\r\n%endif\r\n    add             r0, 16\r\n    dec             r9d\r\n    jnz             .loopW\r\n    lea             r9, [r1 * 4]\r\n    sub             r7, r9\r\n    lea             r0, [r7 - 16]\r\n%ifidn %1,pp\r\n    lea             r2, [r8 + r3 * 4 - 16]\r\n%else\r\n    lea             r2, [r8 + r3 * 4 - 32]\r\n%endif\r\n    mov             r9d, 2\r\n.loop:\r\n    PROCESS_LUMA_AVX2_W16_8R %1\r\n%ifidn %1,pp\r\n    add             r2, 16\r\n%else\r\n    add             r2, 32\r\n%endif\r\n    add             r0, 16\r\n    dec             r9d\r\n    jnz             .loop\r\n    RET\r\n%endif\r\n%endmacro\r\n\r\n    FILTER_VER_LUMA_AVX2_32x24 pp\r\n    FILTER_VER_LUMA_AVX2_32x24 ps\r\n\r\n%macro FILTER_VER_LUMA_AVX2_32x8 1\r\nINIT_YMM avx2\r\n%if ARCH_X86_64 == 1\r\ncglobal interp_8tap_vert_%1_32x8, 4, 10, 15\r\n    mov             r4d, r4m\r\n    shl             r4d, 7\r\n\r\n%ifdef PIC\r\n    lea             r5, [tab_LumaCoeffVer_32]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [tab_LumaCoeffVer_32 + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r4\r\n%ifidn %1,ps\r\n    add             r3d, r3d\r\n%endif\r\n    lea             r6, [r3 * 3]\r\n%ifidn %1,pp\r\n    mova            m14, [pw_512]\r\n%else\r\n    vbroadcasti128  m14, [pw_2000]\r\n%endif\r\n    mov             r9d, 2\r\n.loopW:\r\n    PROCESS_LUMA_AVX2_W16_8R %1\r\n%ifidn %1,pp\r\n    add             r2, 16\r\n%else\r\n    add             r2, 32\r\n%endif\r\n    add             r0, 16\r\n    dec             r9d\r\n    jnz             .loopW\r\n    RET\r\n%endif\r\n%endmacro\r\n\r\n    FILTER_VER_LUMA_AVX2_32x8 pp\r\n    FILTER_VER_LUMA_AVX2_32x8 ps\r\n\r\n%macro FILTER_VER_LUMA_AVX2_48x64 1\r\nINIT_YMM avx2\r\n%if ARCH_X86_64 == 1\r\ncglobal interp_8tap_vert_%1_48x64, 4, 12, 15\r\n    mov             r4d, r4m\r\n    shl             r4d, 7\r\n\r\n%ifdef PIC\r\n    lea             r5, [tab_LumaCoeffVer_32]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [tab_LumaCoeffVer_32 + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r4\r\n\r\n%ifidn %1,ps\r\n    add             r3d, r3d\r\n%endif\r\n\r\n    lea             r6, [r3 * 3]\r\n    lea             r11, [r1 * 4]\r\n\r\n%ifidn %1,pp\r\n    mova            m14, [pw_512]\r\n%else\r\n    vbroadcasti128  m14, [pw_2000]\r\n%endif\r\n\r\n    mov             r9d, 4\r\n.loopH:\r\n    mov             r10d, 3\r\n.loopW:\r\n    PROCESS_LUMA_AVX2_W16_16R %1\r\n%ifidn %1,pp\r\n    add             r2, 16\r\n%else\r\n    add             r2, 32\r\n%endif\r\n    add             r0, 16\r\n    dec             r10d\r\n    jnz             .loopW\r\n    sub             r7, r11\r\n    lea             r0, [r7 - 32]\r\n%ifidn %1,pp\r\n    lea             r2, [r8 + r3 * 4 - 32]\r\n%else\r\n    lea             r2, [r8 + r3 * 4 - 64]\r\n%endif\r\n    dec             r9d\r\n    jnz             .loopH\r\n    RET\r\n%endif\r\n%endmacro\r\n\r\n    FILTER_VER_LUMA_AVX2_48x64 pp\r\n    FILTER_VER_LUMA_AVX2_48x64 ps\r\n\r\n%macro FILTER_VER_LUMA_AVX2_64xN 3\r\nINIT_YMM avx2\r\n%if ARCH_X86_64 == 1\r\ncglobal interp_8tap_vert_%3_%1x%2, 4, 12, 15\r\n    mov             r4d, r4m\r\n    shl             r4d, 7\r\n\r\n%ifdef PIC\r\n    lea             r5, [tab_LumaCoeffVer_32]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [tab_LumaCoeffVer_32 + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r4\r\n\r\n%ifidn %3,ps\r\n    add             r3d, r3d\r\n%endif\r\n\r\n    lea             r6, [r3 * 3]\r\n    lea             r11, [r1 * 4]\r\n\r\n%ifidn %3,pp\r\n    mova            m14, [pw_512]\r\n%else\r\n    vbroadcasti128  m14, [pw_2000]\r\n%endif\r\n\r\n    mov             r9d, %2 / 16\r\n.loopH:\r\n    mov             r10d, %1 / 16\r\n.loopW:\r\n    PROCESS_LUMA_AVX2_W16_16R %3\r\n%ifidn %3,pp\r\n    add             r2, 16\r\n%else\r\n    add             r2, 32\r\n%endif\r\n    add             r0, 16\r\n    dec             r10d\r\n    jnz             .loopW\r\n    sub             r7, r11\r\n    lea             r0, [r7 - 48]\r\n%ifidn %3,pp\r\n    lea             r2, [r8 + r3 * 4 - 48]\r\n%else\r\n    lea             r2, [r8 + r3 * 4 - 96]\r\n%endif\r\n    dec             r9d\r\n    jnz             .loopH\r\n    RET\r\n%endif\r\n%endmacro\r\n\r\n    FILTER_VER_LUMA_AVX2_64xN 64, 32, pp\r\n    FILTER_VER_LUMA_AVX2_64xN 64, 48, pp\r\n    FILTER_VER_LUMA_AVX2_64xN 64, 64, pp\r\n    FILTER_VER_LUMA_AVX2_64xN 64, 32, ps\r\n    FILTER_VER_LUMA_AVX2_64xN 64, 48, ps\r\n    FILTER_VER_LUMA_AVX2_64xN 64, 64, ps\r\n\r\n%macro FILTER_VER_LUMA_AVX2_64x16 1\r\nINIT_YMM avx2\r\n%if ARCH_X86_64 == 1\r\ncglobal interp_8tap_vert_%1_64x16, 4, 10, 15\r\n    mov             r4d, r4m\r\n    shl             r4d, 7\r\n\r\n%ifdef PIC\r\n    lea             r5, [tab_LumaCoeffVer_32]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [tab_LumaCoeffVer_32 + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r4\r\n\r\n%ifidn %1,ps\r\n    add             r3d, r3d\r\n%endif\r\n\r\n    lea             r6, [r3 * 3]\r\n\r\n%ifidn %1,pp\r\n    mova            m14, [pw_512]\r\n%else\r\n    vbroadcasti128  m14, [pw_2000]\r\n%endif\r\n\r\n    mov             r9d, 4\r\n.loopW:\r\n    PROCESS_LUMA_AVX2_W16_16R %1\r\n%ifidn %1,pp\r\n    add             r2, 16\r\n%else\r\n    add             r2, 32\r\n%endif\r\n    add             r0, 16\r\n    dec             r9d\r\n    jnz             .loopW\r\n    RET\r\n%endif\r\n%endmacro\r\n\r\n    FILTER_VER_LUMA_AVX2_64x16 pp\r\n    FILTER_VER_LUMA_AVX2_64x16 ps\r\n\r\n;-------------------------------------------------------------------------------------------------------------\r\n; void interp_8tap_vert_%3_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-------------------------------------------------------------------------------------------------------------\r\n%macro FILTER_VER_LUMA 3\r\nINIT_XMM sse4\r\ncglobal interp_8tap_vert_%3_%1x%2, 5, 7, 8 ,0-gprsize\r\n    lea       r5, [3 * r1]\r\n    sub       r0, r5\r\n    shl       r4d, 6\r\n%ifidn %3,ps\r\n    add       r3d, r3d\r\n%endif\r\n\r\n%ifdef PIC\r\n    lea       r5, [tab_LumaCoeffVer]\r\n    lea       r6, [r5 + r4]\r\n%else\r\n    lea       r6, [tab_LumaCoeffVer + r4]\r\n%endif\r\n\r\n%ifidn %3,pp\r\n    mova      m3, [pw_512]\r\n%else\r\n    mova      m3, [pw_2000]\r\n%endif\r\n    mov       dword [rsp], %2/4\r\n\r\n.loopH:\r\n    mov       r4d, (%1/8)\r\n.loopW:\r\n    PROCESS_LUMA_W8_4R\r\n%ifidn %3,pp\r\n    pmulhrsw  m7, m3\r\n    pmulhrsw  m6, m3\r\n    pmulhrsw  m5, m3\r\n    pmulhrsw  m4, m3\r\n\r\n    packuswb  m7, m6\r\n    packuswb  m5, m4\r\n\r\n    movlps    [r2], m7\r\n    movhps    [r2 + r3], m7\r\n    lea       r5, [r2 + 2 * r3]\r\n    movlps    [r5], m5\r\n    movhps    [r5 + r3], m5\r\n%else\r\n    psubw     m7, m3\r\n    psubw     m6, m3\r\n    psubw     m5, m3\r\n    psubw     m4, m3\r\n\r\n    movu      [r2], m7\r\n    movu      [r2 + r3], m6\r\n    lea       r5, [r2 + 2 * r3]\r\n    movu      [r5], m5\r\n    movu      [r5 + r3], m4\r\n%endif\r\n\r\n    lea       r5, [8 * r1 - 8]\r\n    sub       r0, r5\r\n%ifidn %3,pp\r\n    add       r2, 8\r\n%else\r\n    add       r2, 16\r\n%endif\r\n    dec       r4d\r\n    jnz       .loopW\r\n\r\n    lea       r0, [r0 + 4 * r1 - %1]\r\n%ifidn %3,pp\r\n    lea       r2, [r2 + 4 * r3 - %1]\r\n%else\r\n    lea       r2, [r2 + 4 * r3 - 2 * %1]\r\n%endif\r\n\r\n    dec       dword [rsp]\r\n    jnz       .loopH\r\n\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_VER_LUMA 16, 4, pp\r\n    FILTER_VER_LUMA 16, 8, pp\r\n    FILTER_VER_LUMA 16, 12, pp\r\n    FILTER_VER_LUMA 16, 16, pp\r\n    FILTER_VER_LUMA 16, 32, pp\r\n    FILTER_VER_LUMA 16, 64, pp\r\n    FILTER_VER_LUMA 24, 32, pp\r\n    FILTER_VER_LUMA 32, 8, pp\r\n    FILTER_VER_LUMA 32, 16, pp\r\n    FILTER_VER_LUMA 32, 24, pp\r\n    FILTER_VER_LUMA 32, 32, pp\r\n    FILTER_VER_LUMA 32, 64, pp\r\n    FILTER_VER_LUMA 48, 64, pp\r\n    FILTER_VER_LUMA 64, 16, pp\r\n    FILTER_VER_LUMA 64, 32, pp\r\n    FILTER_VER_LUMA 64, 48, pp\r\n    FILTER_VER_LUMA 64, 64, pp\r\n\r\n    FILTER_VER_LUMA 16, 4, ps\r\n    FILTER_VER_LUMA 16, 8, ps\r\n    FILTER_VER_LUMA 16, 12, ps\r\n    FILTER_VER_LUMA 16, 16, ps\r\n    FILTER_VER_LUMA 16, 32, ps\r\n    FILTER_VER_LUMA 16, 64, ps\r\n    FILTER_VER_LUMA 24, 32, ps\r\n    FILTER_VER_LUMA 32, 8, ps\r\n    FILTER_VER_LUMA 32, 16, ps\r\n    FILTER_VER_LUMA 32, 24, ps\r\n    FILTER_VER_LUMA 32, 32, ps\r\n    FILTER_VER_LUMA 32, 64, ps\r\n    FILTER_VER_LUMA 48, 64, ps\r\n    FILTER_VER_LUMA 64, 16, ps\r\n    FILTER_VER_LUMA 64, 32, ps\r\n    FILTER_VER_LUMA 64, 48, ps\r\n    FILTER_VER_LUMA 64, 64, ps\r\n\r\n%macro PROCESS_LUMA_SP_W4_4R 0\r\n    movq       m0, [r0]\r\n    movq       m1, [r0 + r1]\r\n    punpcklwd  m0, m1                          ;m0=[0 1]\r\n    pmaddwd    m0, [r6 + 0 *16]                ;m0=[0+1]  Row1\r\n\r\n    lea        r0, [r0 + 2 * r1]\r\n    movq       m4, [r0]\r\n    punpcklwd  m1, m4                          ;m1=[1 2]\r\n    pmaddwd    m1, [r6 + 0 *16]                ;m1=[1+2]  Row2\r\n\r\n    movq       m5, [r0 + r1]\r\n    punpcklwd  m4, m5                          ;m4=[2 3]\r\n    pmaddwd    m2, m4, [r6 + 0 *16]            ;m2=[2+3]  Row3\r\n    pmaddwd    m4, [r6 + 1 * 16]\r\n    paddd      m0, m4                          ;m0=[0+1+2+3]  Row1\r\n\r\n    lea        r0, [r0 + 2 * r1]\r\n    movq       m4, [r0]\r\n    punpcklwd  m5, m4                          ;m5=[3 4]\r\n    pmaddwd    m3, m5, [r6 + 0 *16]            ;m3=[3+4]  Row4\r\n    pmaddwd    m5, [r6 + 1 * 16]\r\n    paddd      m1, m5                          ;m1 = [1+2+3+4]  Row2\r\n\r\n    movq       m5, [r0 + r1]\r\n    punpcklwd  m4, m5                          ;m4=[4 5]\r\n    pmaddwd    m6, m4, [r6 + 1 * 16]\r\n    paddd      m2, m6                          ;m2=[2+3+4+5]  Row3\r\n    pmaddwd    m4, [r6 + 2 * 16]\r\n    paddd      m0, m4                          ;m0=[0+1+2+3+4+5]  Row1\r\n\r\n    lea        r0, [r0 + 2 * r1]\r\n    movq       m4, [r0]\r\n    punpcklwd  m5, m4                          ;m5=[5 6]\r\n    pmaddwd    m6, m5, [r6 + 1 * 16]\r\n    paddd      m3, m6                          ;m3=[3+4+5+6]  Row4\r\n    pmaddwd    m5, [r6 + 2 * 16]\r\n    paddd      m1, m5                          ;m1=[1+2+3+4+5+6]  Row2\r\n\r\n    movq       m5, [r0 + r1]\r\n    punpcklwd  m4, m5                          ;m4=[6 7]\r\n    pmaddwd    m6, m4, [r6 + 2 * 16]\r\n    paddd      m2, m6                          ;m2=[2+3+4+5+6+7]  Row3\r\n    pmaddwd    m4, [r6 + 3 * 16]\r\n    paddd      m0, m4                          ;m0=[0+1+2+3+4+5+6+7]  Row1 end\r\n\r\n    lea        r0, [r0 + 2 * r1]\r\n    movq       m4, [r0]\r\n    punpcklwd  m5, m4                          ;m5=[7 8]\r\n    pmaddwd    m6, m5, [r6 + 2 * 16]\r\n    paddd      m3, m6                          ;m3=[3+4+5+6+7+8]  Row4\r\n    pmaddwd    m5, [r6 + 3 * 16]\r\n    paddd      m1, m5                          ;m1=[1+2+3+4+5+6+7+8]  Row2 end\r\n\r\n    movq       m5, [r0 + r1]\r\n    punpcklwd  m4, m5                          ;m4=[8 9]\r\n    pmaddwd    m4, [r6 + 3 * 16]\r\n    paddd      m2, m4                          ;m2=[2+3+4+5+6+7+8+9]  Row3 end\r\n\r\n    movq       m4, [r0 + 2 * r1]\r\n    punpcklwd  m5, m4                          ;m5=[9 10]\r\n    pmaddwd    m5, [r6 + 3 * 16]\r\n    paddd      m3, m5                          ;m3=[3+4+5+6+7+8+9+10]  Row4 end\r\n%endmacro\r\n\r\n;--------------------------------------------------------------------------------------------------------------\r\n; void interp_8tap_vert_sp_%1x%2(int16_t *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;--------------------------------------------------------------------------------------------------------------\r\n%macro FILTER_VER_LUMA_SP 2\r\nINIT_XMM sse4\r\ncglobal interp_8tap_vert_sp_%1x%2, 5, 7, 8 ,0-gprsize\r\n\r\n    add       r1d, r1d\r\n    lea       r5, [r1 + 2 * r1]\r\n    sub       r0, r5\r\n    shl       r4d, 6\r\n\r\n%ifdef PIC\r\n    lea       r5, [tab_LumaCoeffV]\r\n    lea       r6, [r5 + r4]\r\n%else\r\n    lea       r6, [tab_LumaCoeffV + r4]\r\n%endif\r\n\r\n    mova      m7, [pd_526336]\r\n\r\n    mov       dword [rsp], %2/4\r\n.loopH:\r\n    mov       r4d, (%1/4)\r\n.loopW:\r\n    PROCESS_LUMA_SP_W4_4R\r\n\r\n    paddd     m0, m7\r\n    paddd     m1, m7\r\n    paddd     m2, m7\r\n    paddd     m3, m7\r\n\r\n    psrad     m0, 12\r\n    psrad     m1, 12\r\n    psrad     m2, 12\r\n    psrad     m3, 12\r\n\r\n    packssdw  m0, m1\r\n    packssdw  m2, m3\r\n\r\n    packuswb  m0, m2\r\n\r\n    movd      [r2], m0\r\n    pextrd    [r2 + r3], m0, 1\r\n    lea       r5, [r2 + 2 * r3]\r\n    pextrd    [r5], m0, 2\r\n    pextrd    [r5 + r3], m0, 3\r\n\r\n    lea       r5, [8 * r1 - 2 * 4]\r\n    sub       r0, r5\r\n    add       r2, 4\r\n\r\n    dec       r4d\r\n    jnz       .loopW\r\n\r\n    lea       r0, [r0 + 4 * r1 - 2 * %1]\r\n    lea       r2, [r2 + 4 * r3 - %1]\r\n\r\n    dec       dword [rsp]\r\n    jnz       .loopH\r\n\r\n    RET\r\n%endmacro\r\n\r\n;--------------------------------------------------------------------------------------------------------------\r\n; void interp_8tap_vert_sp_%1x%2(int16_t *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;--------------------------------------------------------------------------------------------------------------\r\n    FILTER_VER_LUMA_SP 4, 4\r\n    FILTER_VER_LUMA_SP 8, 8\r\n    FILTER_VER_LUMA_SP 8, 4\r\n    FILTER_VER_LUMA_SP 4, 8\r\n    FILTER_VER_LUMA_SP 16, 16\r\n    FILTER_VER_LUMA_SP 16, 8\r\n    FILTER_VER_LUMA_SP 8, 16\r\n    FILTER_VER_LUMA_SP 16, 12\r\n    FILTER_VER_LUMA_SP 12, 16\r\n    FILTER_VER_LUMA_SP 16, 4\r\n    FILTER_VER_LUMA_SP 4, 16\r\n    FILTER_VER_LUMA_SP 32, 32\r\n    FILTER_VER_LUMA_SP 32, 16\r\n    FILTER_VER_LUMA_SP 16, 32\r\n    FILTER_VER_LUMA_SP 32, 24\r\n    FILTER_VER_LUMA_SP 24, 32\r\n    FILTER_VER_LUMA_SP 32, 8\r\n    FILTER_VER_LUMA_SP 8, 32\r\n    FILTER_VER_LUMA_SP 64, 64\r\n    FILTER_VER_LUMA_SP 64, 32\r\n    FILTER_VER_LUMA_SP 32, 64\r\n    FILTER_VER_LUMA_SP 64, 48\r\n    FILTER_VER_LUMA_SP 48, 64\r\n    FILTER_VER_LUMA_SP 64, 16\r\n    FILTER_VER_LUMA_SP 16, 64\r\n\r\n;-----------------------------------------------------------------------------\r\n; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM sse4\r\ncglobal filterPixelToShort_4x2, 3, 4, 3\r\n    mov         r3d, r3m\r\n    add         r3d, r3d\r\n\r\n    ; load constant\r\n    mova        m1, [pb_128]\r\n    mova        m2, [tab_c_64_n64]\r\n\r\n    movd        m0, [r0]\r\n    pinsrd      m0, [r0 + r1], 1\r\n    punpcklbw   m0, m1\r\n    pmaddubsw   m0, m2\r\n\r\n    movq        [r2 + r3 * 0], m0\r\n    movhps      [r2 + r3 * 1], m0\r\n\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)\r\n;-----------------------------------------------------------------------------\r\nINIT_XMM ssse3\r\ncglobal filterPixelToShort_8x2, 3, 4, 3\r\n    mov         r3d, r3m\r\n    add         r3d, r3d\r\n\r\n    ; load constant\r\n    mova        m1, [pb_128]\r\n    mova        m2, [tab_c_64_n64]\r\n\r\n    movh        m0, [r0]\r\n    punpcklbw   m0, m1\r\n    pmaddubsw   m0, m2\r\n    movu        [r2 + r3 * 0], m0\r\n\r\n    movh        m0, [r0 + r1]\r\n    punpcklbw   m0, m1\r\n    pmaddubsw   m0, m2\r\n    movu        [r2 + r3 * 1], m0\r\n\r\n    RET\r\n\r\n%macro PROCESS_CHROMA_SP_W4_4R 0\r\n    movq       m0, [r0]\r\n    movq       m1, [r0 + r1]\r\n    punpcklwd  m0, m1                          ;m0=[0 1]\r\n    pmaddwd    m0, [r6 + 0 *16]                ;m0=[0+1]         Row1\r\n\r\n    lea        r0, [r0 + 2 * r1]\r\n    movq       m4, [r0]\r\n    punpcklwd  m1, m4                          ;m1=[1 2]\r\n    pmaddwd    m1, [r6 + 0 *16]                ;m1=[1+2]         Row2\r\n\r\n    movq       m5, [r0 + r1]\r\n    punpcklwd  m4, m5                          ;m4=[2 3]\r\n    pmaddwd    m2, m4, [r6 + 0 *16]            ;m2=[2+3]         Row3\r\n    pmaddwd    m4, [r6 + 1 * 16]\r\n    paddd      m0, m4                          ;m0=[0+1+2+3]     Row1 done\r\n\r\n    lea        r0, [r0 + 2 * r1]\r\n    movq       m4, [r0]\r\n    punpcklwd  m5, m4                          ;m5=[3 4]\r\n    pmaddwd    m3, m5, [r6 + 0 *16]            ;m3=[3+4]         Row4\r\n    pmaddwd    m5, [r6 + 1 * 16]\r\n    paddd      m1, m5                          ;m1 = [1+2+3+4]   Row2\r\n\r\n    movq       m5, [r0 + r1]\r\n    punpcklwd  m4, m5                          ;m4=[4 5]\r\n    pmaddwd    m4, [r6 + 1 * 16]\r\n    paddd      m2, m4                          ;m2=[2+3+4+5]     Row3\r\n\r\n    movq       m4, [r0 + 2 * r1]\r\n    punpcklwd  m5, m4                          ;m5=[5 6]\r\n    pmaddwd    m5, [r6 + 1 * 16]\r\n    paddd      m3, m5                          ;m3=[3+4+5+6]     Row4\r\n%endmacro\r\n\r\n;--------------------------------------------------------------------------------------------------------------\r\n; void interp_4tap_vert_sp_%1x%2(int16_t *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;--------------------------------------------------------------------------------------------------------------\r\n%macro FILTER_VER_CHROMA_SP 2\r\nINIT_XMM sse4\r\ncglobal interp_4tap_vert_sp_%1x%2, 5, 7, 7 ,0-gprsize\r\n\r\n    add       r1d, r1d\r\n    sub       r0, r1\r\n    shl       r4d, 5\r\n\r\n%ifdef PIC\r\n    lea       r5, [tab_ChromaCoeffV]\r\n    lea       r6, [r5 + r4]\r\n%else\r\n    lea       r6, [tab_ChromaCoeffV + r4]\r\n%endif\r\n\r\n    mova      m6, [pd_526336]\r\n\r\n    mov       dword [rsp], %2/4\r\n\r\n.loopH:\r\n    mov       r4d, (%1/4)\r\n.loopW:\r\n    PROCESS_CHROMA_SP_W4_4R\r\n\r\n    paddd     m0, m6\r\n    paddd     m1, m6\r\n    paddd     m2, m6\r\n    paddd     m3, m6\r\n\r\n    psrad     m0, 12\r\n    psrad     m1, 12\r\n    psrad     m2, 12\r\n    psrad     m3, 12\r\n\r\n    packssdw  m0, m1\r\n    packssdw  m2, m3\r\n\r\n    packuswb  m0, m2\r\n\r\n    movd      [r2], m0\r\n    pextrd    [r2 + r3], m0, 1\r\n    lea       r5, [r2 + 2 * r3]\r\n    pextrd    [r5], m0, 2\r\n    pextrd    [r5 + r3], m0, 3\r\n\r\n    lea       r5, [4 * r1 - 2 * 4]\r\n    sub       r0, r5\r\n    add       r2, 4\r\n\r\n    dec       r4d\r\n    jnz       .loopW\r\n\r\n    lea       r0, [r0 + 4 * r1 - 2 * %1]\r\n    lea       r2, [r2 + 4 * r3 - %1]\r\n\r\n    dec       dword [rsp]\r\n    jnz       .loopH\r\n\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_SP 4, 4\r\n    FILTER_VER_CHROMA_SP 4, 8\r\n    FILTER_VER_CHROMA_SP 16, 16\r\n    FILTER_VER_CHROMA_SP 16, 8\r\n    FILTER_VER_CHROMA_SP 16, 12\r\n    FILTER_VER_CHROMA_SP 12, 16\r\n    FILTER_VER_CHROMA_SP 16, 4\r\n    FILTER_VER_CHROMA_SP 4, 16\r\n    FILTER_VER_CHROMA_SP 32, 32\r\n    FILTER_VER_CHROMA_SP 32, 16\r\n    FILTER_VER_CHROMA_SP 16, 32\r\n    FILTER_VER_CHROMA_SP 32, 24\r\n    FILTER_VER_CHROMA_SP 24, 32\r\n    FILTER_VER_CHROMA_SP 32, 8\r\n\r\n    FILTER_VER_CHROMA_SP 16, 24\r\n    FILTER_VER_CHROMA_SP 16, 64\r\n    FILTER_VER_CHROMA_SP 12, 32\r\n    FILTER_VER_CHROMA_SP 4, 32\r\n    FILTER_VER_CHROMA_SP 32, 64\r\n    FILTER_VER_CHROMA_SP 32, 48\r\n    FILTER_VER_CHROMA_SP 24, 64\r\n\r\n    FILTER_VER_CHROMA_SP 64, 64\r\n    FILTER_VER_CHROMA_SP 64, 32\r\n    FILTER_VER_CHROMA_SP 64, 48\r\n    FILTER_VER_CHROMA_SP 48, 64\r\n    FILTER_VER_CHROMA_SP 64, 16\r\n\r\n\r\n%macro PROCESS_CHROMA_SP_W2_4R 1\r\n    movd       m0, [r0]\r\n    movd       m1, [r0 + r1]\r\n    punpcklwd  m0, m1                          ;m0=[0 1]\r\n\r\n    lea        r0, [r0 + 2 * r1]\r\n    movd       m2, [r0]\r\n    punpcklwd  m1, m2                          ;m1=[1 2]\r\n    punpcklqdq m0, m1                          ;m0=[0 1 1 2]\r\n    pmaddwd    m0, [%1 + 0 *16]                ;m0=[0+1 1+2] Row 1-2\r\n\r\n    movd       m1, [r0 + r1]\r\n    punpcklwd  m2, m1                          ;m2=[2 3]\r\n\r\n    lea        r0, [r0 + 2 * r1]\r\n    movd       m3, [r0]\r\n    punpcklwd  m1, m3                          ;m2=[3 4]\r\n    punpcklqdq m2, m1                          ;m2=[2 3 3 4]\r\n\r\n    pmaddwd    m4, m2, [%1 + 1 * 16]           ;m4=[2+3 3+4] Row 1-2\r\n    pmaddwd    m2, [%1 + 0 * 16]               ;m2=[2+3 3+4] Row 3-4\r\n    paddd      m0, m4                          ;m0=[0+1+2+3 1+2+3+4] Row 1-2\r\n\r\n    movd       m1, [r0 + r1]\r\n    punpcklwd  m3, m1                          ;m3=[4 5]\r\n\r\n    movd       m4, [r0 + 2 * r1]\r\n    punpcklwd  m1, m4                          ;m1=[5 6]\r\n    punpcklqdq m3, m1                          ;m2=[4 5 5 6]\r\n    pmaddwd    m3, [%1 + 1 * 16]               ;m3=[4+5 5+6] Row 3-4\r\n    paddd      m2, m3                          ;m2=[2+3+4+5 3+4+5+6] Row 3-4\r\n%endmacro\r\n\r\n;-------------------------------------------------------------------------------------------------------------------\r\n; void interp_4tap_vertical_sp_%1x%2(int16_t *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-------------------------------------------------------------------------------------------------------------------\r\n%macro FILTER_VER_CHROMA_SP_W2_4R 2\r\nINIT_XMM sse4\r\ncglobal interp_4tap_vert_sp_%1x%2, 5, 6, 6\r\n\r\n    add       r1d, r1d\r\n    sub       r0, r1\r\n    shl       r4d, 5\r\n\r\n%ifdef PIC\r\n    lea       r5, [tab_ChromaCoeffV]\r\n    lea       r5, [r5 + r4]\r\n%else\r\n    lea       r5, [tab_ChromaCoeffV + r4]\r\n%endif\r\n\r\n    mova      m5, [pd_526336]\r\n\r\n    mov       r4d, (%2/4)\r\n\r\n.loopH:\r\n    PROCESS_CHROMA_SP_W2_4R r5\r\n\r\n    paddd     m0, m5\r\n    paddd     m2, m5\r\n\r\n    psrad     m0, 12\r\n    psrad     m2, 12\r\n\r\n    packssdw  m0, m2\r\n    packuswb  m0, m0\r\n\r\n    pextrw    [r2], m0, 0\r\n    pextrw    [r2 + r3], m0, 1\r\n    lea       r2, [r2 + 2 * r3]\r\n    pextrw    [r2], m0, 2\r\n    pextrw    [r2 + r3], m0, 3\r\n\r\n    lea       r2, [r2 + 2 * r3]\r\n\r\n    dec       r4d\r\n    jnz       .loopH\r\n\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_SP_W2_4R 2, 4\r\n    FILTER_VER_CHROMA_SP_W2_4R 2, 8\r\n\r\n    FILTER_VER_CHROMA_SP_W2_4R 2, 16\r\n\r\n;--------------------------------------------------------------------------------------------------------------\r\n; void interp_4tap_vert_sp_4x2(int16_t *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;--------------------------------------------------------------------------------------------------------------\r\nINIT_XMM sse4\r\ncglobal interp_4tap_vert_sp_4x2, 5, 6, 5\r\n\r\n    add        r1d, r1d\r\n    sub        r0, r1\r\n    shl        r4d, 5\r\n\r\n%ifdef PIC\r\n    lea        r5, [tab_ChromaCoeffV]\r\n    lea        r5, [r5 + r4]\r\n%else\r\n    lea        r5, [tab_ChromaCoeffV + r4]\r\n%endif\r\n\r\n    mova       m4, [pd_526336]\r\n\r\n    movq       m0, [r0]\r\n    movq       m1, [r0 + r1]\r\n    punpcklwd  m0, m1                          ;m0=[0 1]\r\n    pmaddwd    m0, [r5 + 0 *16]                ;m0=[0+1]  Row1\r\n\r\n    lea        r0, [r0 + 2 * r1]\r\n    movq       m2, [r0]\r\n    punpcklwd  m1, m2                          ;m1=[1 2]\r\n    pmaddwd    m1, [r5 + 0 *16]                ;m1=[1+2]  Row2\r\n\r\n    movq       m3, [r0 + r1]\r\n    punpcklwd  m2, m3                          ;m4=[2 3]\r\n    pmaddwd    m2, [r5 + 1 * 16]\r\n    paddd      m0, m2                          ;m0=[0+1+2+3]  Row1 done\r\n    paddd      m0, m4\r\n    psrad      m0, 12\r\n\r\n    movq       m2, [r0 + 2 * r1]\r\n    punpcklwd  m3, m2                          ;m5=[3 4]\r\n    pmaddwd    m3, [r5 + 1 * 16]\r\n    paddd      m1, m3                          ;m1 = [1+2+3+4]  Row2 done\r\n    paddd      m1, m4\r\n    psrad      m1, 12\r\n\r\n    packssdw   m0, m1\r\n    packuswb   m0, m0\r\n\r\n    movd       [r2], m0\r\n    pextrd     [r2 + r3], m0, 1\r\n\r\n    RET\r\n\r\n;-------------------------------------------------------------------------------------------------------------------\r\n; void interp_4tap_vertical_sp_6x%2(int16_t *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;-------------------------------------------------------------------------------------------------------------------\r\n%macro FILTER_VER_CHROMA_SP_W6_H4 2\r\nINIT_XMM sse4\r\ncglobal interp_4tap_vert_sp_6x%2, 5, 7, 7\r\n\r\n    add       r1d, r1d\r\n    sub       r0, r1\r\n    shl       r4d, 5\r\n\r\n%ifdef PIC\r\n    lea       r5, [tab_ChromaCoeffV]\r\n    lea       r6, [r5 + r4]\r\n%else\r\n    lea       r6, [tab_ChromaCoeffV + r4]\r\n%endif\r\n\r\n    mova      m6, [pd_526336]\r\n\r\n    mov       r4d, %2/4\r\n\r\n.loopH:\r\n    PROCESS_CHROMA_SP_W4_4R\r\n\r\n    paddd     m0, m6\r\n    paddd     m1, m6\r\n    paddd     m2, m6\r\n    paddd     m3, m6\r\n\r\n    psrad     m0, 12\r\n    psrad     m1, 12\r\n    psrad     m2, 12\r\n    psrad     m3, 12\r\n\r\n    packssdw  m0, m1\r\n    packssdw  m2, m3\r\n\r\n    packuswb  m0, m2\r\n\r\n    movd      [r2], m0\r\n    pextrd    [r2 + r3], m0, 1\r\n    lea       r5, [r2 + 2 * r3]\r\n    pextrd    [r5], m0, 2\r\n    pextrd    [r5 + r3], m0, 3\r\n\r\n    lea       r5, [4 * r1 - 2 * 4]\r\n    sub       r0, r5\r\n    add       r2, 4\r\n\r\n    PROCESS_CHROMA_SP_W2_4R r6\r\n\r\n    paddd     m0, m6\r\n    paddd     m2, m6\r\n\r\n    psrad     m0, 12\r\n    psrad     m2, 12\r\n\r\n    packssdw  m0, m2\r\n    packuswb  m0, m0\r\n\r\n    pextrw    [r2], m0, 0\r\n    pextrw    [r2 + r3], m0, 1\r\n    lea       r2, [r2 + 2 * r3]\r\n    pextrw    [r2], m0, 2\r\n    pextrw    [r2 + r3], m0, 3\r\n\r\n    sub       r0, 2 * 4\r\n    lea       r2, [r2 + 2 * r3 - 4]\r\n\r\n    dec       r4d\r\n    jnz       .loopH\r\n\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_SP_W6_H4 6, 8\r\n\r\n    FILTER_VER_CHROMA_SP_W6_H4 6, 16\r\n\r\n%macro PROCESS_CHROMA_SP_W8_2R 0\r\n    movu       m1, [r0]\r\n    movu       m3, [r0 + r1]\r\n    punpcklwd  m0, m1, m3\r\n    pmaddwd    m0, [r5 + 0 * 16]                ;m0 = [0l+1l]  Row1l\r\n    punpckhwd  m1, m3\r\n    pmaddwd    m1, [r5 + 0 * 16]                ;m1 = [0h+1h]  Row1h\r\n\r\n    movu       m4, [r0 + 2 * r1]\r\n    punpcklwd  m2, m3, m4\r\n    pmaddwd    m2, [r5 + 0 * 16]                ;m2 = [1l+2l]  Row2l\r\n    punpckhwd  m3, m4\r\n    pmaddwd    m3, [r5 + 0 * 16]                ;m3 = [1h+2h]  Row2h\r\n\r\n    lea        r0, [r0 + 2 * r1]\r\n    movu       m5, [r0 + r1]\r\n    punpcklwd  m6, m4, m5\r\n    pmaddwd    m6, [r5 + 1 * 16]                ;m6 = [2l+3l]  Row1l\r\n    paddd      m0, m6                           ;m0 = [0l+1l+2l+3l]  Row1l sum\r\n    punpckhwd  m4, m5\r\n    pmaddwd    m4, [r5 + 1 * 16]                ;m6 = [2h+3h]  Row1h\r\n    paddd      m1, m4                           ;m1 = [0h+1h+2h+3h]  Row1h sum\r\n\r\n    movu       m4, [r0 + 2 * r1]\r\n    punpcklwd  m6, m5, m4\r\n    pmaddwd    m6, [r5 + 1 * 16]                ;m6 = [3l+4l]  Row2l\r\n    paddd      m2, m6                           ;m2 = [1l+2l+3l+4l]  Row2l sum\r\n    punpckhwd  m5, m4\r\n    pmaddwd    m5, [r5 + 1 * 16]                ;m1 = [3h+4h]  Row2h\r\n    paddd      m3, m5                           ;m3 = [1h+2h+3h+4h]  Row2h sum\r\n%endmacro\r\n\r\n;--------------------------------------------------------------------------------------------------------------\r\n; void interp_4tap_vert_sp_8x%2(int16_t *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)\r\n;--------------------------------------------------------------------------------------------------------------\r\n%macro FILTER_VER_CHROMA_SP_W8_H2 2\r\nINIT_XMM sse2\r\ncglobal interp_4tap_vert_sp_%1x%2, 5, 6, 8\r\n\r\n    add       r1d, r1d\r\n    sub       r0, r1\r\n    shl       r4d, 5\r\n\r\n%ifdef PIC\r\n    lea       r5, [tab_ChromaCoeffV]\r\n    lea       r5, [r5 + r4]\r\n%else\r\n    lea       r5, [tab_ChromaCoeffV + r4]\r\n%endif\r\n\r\n    mova      m7, [pd_526336]\r\n\r\n    mov       r4d, %2/2\r\n.loopH:\r\n    PROCESS_CHROMA_SP_W8_2R\r\n\r\n    paddd     m0, m7\r\n    paddd     m1, m7\r\n    paddd     m2, m7\r\n    paddd     m3, m7\r\n\r\n    psrad     m0, 12\r\n    psrad     m1, 12\r\n    psrad     m2, 12\r\n    psrad     m3, 12\r\n\r\n    packssdw  m0, m1\r\n    packssdw  m2, m3\r\n\r\n    packuswb  m0, m2\r\n\r\n    movlps    [r2], m0\r\n    movhps    [r2 + r3], m0\r\n\r\n    lea       r2, [r2 + 2 * r3]\r\n\r\n    dec r4d\r\n    jnz .loopH\r\n\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_SP_W8_H2 8, 2\r\n    FILTER_VER_CHROMA_SP_W8_H2 8, 4\r\n    FILTER_VER_CHROMA_SP_W8_H2 8, 6\r\n    FILTER_VER_CHROMA_SP_W8_H2 8, 8\r\n    FILTER_VER_CHROMA_SP_W8_H2 8, 16\r\n    FILTER_VER_CHROMA_SP_W8_H2 8, 32\r\n\r\n    FILTER_VER_CHROMA_SP_W8_H2 8, 12\r\n    FILTER_VER_CHROMA_SP_W8_H2 8, 64\r\n\r\n\r\n;-----------------------------------------------------------------------------------------------------------------------------\r\n; void interp_4tap_horiz_ps_2x%2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt)\r\n;-----------------------------------------------------------------------------------------------------------------------------\r\n%macro FILTER_HORIZ_CHROMA_2xN 2\r\nINIT_XMM sse4\r\ncglobal interp_4tap_horiz_ps_%1x%2, 4, 7, 4, src, srcstride, dst, dststride\r\n%define coef2  m3\r\n%define Tm0    m2\r\n%define t1     m1\r\n%define t0     m0\r\n\r\n    dec        srcq\r\n    mov        r4d, r4m\r\n    add        dststrided, dststrided\r\n\r\n%ifdef PIC\r\n    lea        r6, [tab_ChromaCoeff]\r\n    movd       coef2, [r6 + r4 * 4]\r\n%else\r\n    movd       coef2, [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    pshufd     coef2, coef2, 0\r\n    mova       t1, [pw_2000]\r\n    mova       Tm0, [tab_Tm]\r\n\r\n    mov        r4d, %2\r\n    cmp        r5m, byte 0\r\n    je         .loopH\r\n    sub        srcq, srcstrideq\r\n    add        r4d, 3\r\n\r\n.loopH:\r\n    movh       t0, [srcq]\r\n    pshufb     t0, t0, Tm0\r\n    pmaddubsw  t0, coef2\r\n    phaddw     t0, t0\r\n    psubw      t0, t1\r\n    movd       [dstq], t0\r\n\r\n    lea        srcq, [srcq + srcstrideq]\r\n    lea        dstq, [dstq + dststrideq]\r\n\r\n    dec        r4d\r\n    jnz        .loopH\r\n\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_HORIZ_CHROMA_2xN 2, 4\r\n    FILTER_HORIZ_CHROMA_2xN 2, 8\r\n\r\n    FILTER_HORIZ_CHROMA_2xN 2, 16\r\n\r\n;-----------------------------------------------------------------------------------------------------------------------------\r\n; void interp_4tap_horiz_ps_4x%2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt)\r\n;-----------------------------------------------------------------------------------------------------------------------------\r\n%macro FILTER_HORIZ_CHROMA_4xN 2\r\nINIT_XMM sse4\r\ncglobal interp_4tap_horiz_ps_%1x%2, 4, 7, 4, src, srcstride, dst, dststride\r\n%define coef2  m3\r\n%define Tm0    m2\r\n%define t1     m1\r\n%define t0     m0\r\n\r\n    dec        srcq\r\n    mov        r4d, r4m\r\n    add        dststrided, dststrided\r\n\r\n%ifdef PIC\r\n    lea        r6, [tab_ChromaCoeff]\r\n    movd       coef2, [r6 + r4 * 4]\r\n%else\r\n    movd       coef2, [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    pshufd     coef2, coef2, 0\r\n    mova       t1, [pw_2000]\r\n    mova       Tm0, [tab_Tm]\r\n\r\n    mov        r4d, %2\r\n    cmp        r5m, byte 0\r\n    je         .loopH\r\n    sub        srcq, srcstrideq\r\n    add        r4d, 3\r\n\r\n.loopH:\r\n    movh       t0, [srcq]\r\n    pshufb     t0, t0, Tm0\r\n    pmaddubsw  t0, coef2\r\n    phaddw     t0, t0\r\n    psubw      t0, t1\r\n    movlps     [dstq], t0\r\n\r\n    lea        srcq, [srcq + srcstrideq]\r\n    lea        dstq, [dstq + dststrideq]\r\n\r\n    dec        r4d\r\n    jnz        .loopH\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_HORIZ_CHROMA_4xN 4, 2\r\n    FILTER_HORIZ_CHROMA_4xN 4, 4\r\n    FILTER_HORIZ_CHROMA_4xN 4, 8\r\n    FILTER_HORIZ_CHROMA_4xN 4, 16\r\n\r\n    FILTER_HORIZ_CHROMA_4xN 4, 32\r\n\r\n%macro PROCESS_CHROMA_W6 3\r\n    movu       %1, [srcq]\r\n    pshufb     %2, %1, Tm0\r\n    pmaddubsw  %2, coef2\r\n    pshufb     %1, %1, Tm1\r\n    pmaddubsw  %1, coef2\r\n    phaddw     %2, %1\r\n    psubw      %2, %3\r\n    movh       [dstq], %2\r\n    pshufd     %2, %2, 2\r\n    movd       [dstq + 8], %2\r\n%endmacro\r\n\r\n%macro PROCESS_CHROMA_W12 3\r\n    movu       %1, [srcq]\r\n    pshufb     %2, %1, Tm0\r\n    pmaddubsw  %2, coef2\r\n    pshufb     %1, %1, Tm1\r\n    pmaddubsw  %1, coef2\r\n    phaddw     %2, %1\r\n    psubw      %2, %3\r\n    movu       [dstq], %2\r\n    movu       %1, [srcq + 8]\r\n    pshufb     %1, %1, Tm0\r\n    pmaddubsw  %1, coef2\r\n    phaddw     %1, %1\r\n    psubw      %1, %3\r\n    movh       [dstq + 16], %1\r\n%endmacro\r\n\r\n;-----------------------------------------------------------------------------------------------------------------------------\r\n; void interp_4tap_horiz_ps_6x%2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt)\r\n;-----------------------------------------------------------------------------------------------------------------------------\r\n%macro FILTER_HORIZ_CHROMA 2\r\nINIT_XMM sse4\r\ncglobal interp_4tap_horiz_ps_%1x%2, 4, 7, 6, src, srcstride, dst, dststride\r\n%define coef2    m5\r\n%define Tm0      m4\r\n%define Tm1      m3\r\n%define t2       m2\r\n%define t1       m1\r\n%define t0       m0\r\n\r\n    dec     srcq\r\n    mov     r4d, r4m\r\n    add     dststrided, dststrided\r\n\r\n%ifdef PIC\r\n    lea     r6, [tab_ChromaCoeff]\r\n    movd    coef2, [r6 + r4 * 4]\r\n%else\r\n    movd    coef2, [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    pshufd  coef2, coef2, 0\r\n    mova    t2, [pw_2000]\r\n    mova    Tm0, [tab_Tm]\r\n    mova    Tm1, [tab_Tm + 16]\r\n\r\n    mov     r4d, %2\r\n    cmp     r5m, byte 0\r\n    je      .loopH\r\n    sub     srcq, srcstrideq\r\n    add     r4d, 3\r\n\r\n.loopH:\r\n    PROCESS_CHROMA_W%1  t0, t1, t2\r\n    add     srcq, srcstrideq\r\n    add     dstq, dststrideq\r\n\r\n    dec     r4d\r\n    jnz     .loopH\r\n\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_HORIZ_CHROMA 6, 8\r\n    FILTER_HORIZ_CHROMA 12, 16\r\n\r\n    FILTER_HORIZ_CHROMA 6, 16\r\n    FILTER_HORIZ_CHROMA 12, 32\r\n\r\n%macro PROCESS_CHROMA_W8 3\r\n    movu        %1, [srcq]\r\n    pshufb      %2, %1, Tm0\r\n    pmaddubsw   %2, coef2\r\n    pshufb      %1, %1, Tm1\r\n    pmaddubsw   %1, coef2\r\n    phaddw      %2, %1\r\n    psubw       %2, %3\r\n    movu        [dstq], %2\r\n%endmacro\r\n\r\n;-----------------------------------------------------------------------------------------------------------------------------\r\n; void interp_4tap_horiz_ps_8x%2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt)\r\n;-----------------------------------------------------------------------------------------------------------------------------\r\n%macro FILTER_HORIZ_CHROMA_8xN 2\r\nINIT_XMM sse4\r\ncglobal interp_4tap_horiz_ps_%1x%2, 4, 7, 6, src, srcstride, dst, dststride\r\n%define coef2    m5\r\n%define Tm0      m4\r\n%define Tm1      m3\r\n%define t2       m2\r\n%define t1       m1\r\n%define t0       m0\r\n\r\n    dec     srcq\r\n    mov     r4d, r4m\r\n    add     dststrided, dststrided\r\n\r\n%ifdef PIC\r\n    lea     r6, [tab_ChromaCoeff]\r\n    movd    coef2, [r6 + r4 * 4]\r\n%else\r\n    movd    coef2, [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    pshufd  coef2, coef2, 0\r\n    mova    t2, [pw_2000]\r\n    mova    Tm0, [tab_Tm]\r\n    mova    Tm1, [tab_Tm + 16]\r\n\r\n    mov     r4d, %2\r\n    cmp     r5m, byte 0\r\n    je      .loopH\r\n    sub     srcq, srcstrideq\r\n    add     r4d, 3\r\n\r\n.loopH:\r\n    PROCESS_CHROMA_W8  t0, t1, t2\r\n    add     srcq, srcstrideq\r\n    add     dstq, dststrideq\r\n\r\n    dec     r4d\r\n    jnz     .loopH\r\n\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_HORIZ_CHROMA_8xN 8, 2\r\n    FILTER_HORIZ_CHROMA_8xN 8, 4\r\n    FILTER_HORIZ_CHROMA_8xN 8, 6\r\n    FILTER_HORIZ_CHROMA_8xN 8, 8\r\n    FILTER_HORIZ_CHROMA_8xN 8, 16\r\n    FILTER_HORIZ_CHROMA_8xN 8, 32\r\n\r\n    FILTER_HORIZ_CHROMA_8xN 8, 12\r\n    FILTER_HORIZ_CHROMA_8xN 8, 64\r\n\r\n%macro PROCESS_CHROMA_W16 4\r\n    movu        %1, [srcq]\r\n    pshufb      %2, %1, Tm0\r\n    pmaddubsw   %2, coef2\r\n    pshufb      %1, %1, Tm1\r\n    pmaddubsw   %1, coef2\r\n    phaddw      %2, %1\r\n    movu        %1, [srcq + 8]\r\n    pshufb      %4, %1, Tm0\r\n    pmaddubsw   %4, coef2\r\n    pshufb      %1, %1, Tm1\r\n    pmaddubsw   %1, coef2\r\n    phaddw      %4, %1\r\n    psubw       %2, %3\r\n    psubw       %4, %3\r\n    movu        [dstq], %2\r\n    movu        [dstq + 16], %4\r\n%endmacro\r\n\r\n%macro PROCESS_CHROMA_W24 4\r\n    movu        %1, [srcq]\r\n    pshufb      %2, %1, Tm0\r\n    pmaddubsw   %2, coef2\r\n    pshufb      %1, %1, Tm1\r\n    pmaddubsw   %1, coef2\r\n    phaddw      %2, %1\r\n    movu        %1, [srcq + 8]\r\n    pshufb      %4, %1, Tm0\r\n    pmaddubsw   %4, coef2\r\n    pshufb      %1, %1, Tm1\r\n    pmaddubsw   %1, coef2\r\n    phaddw      %4, %1\r\n    psubw       %2, %3\r\n    psubw       %4, %3\r\n    movu        [dstq], %2\r\n    movu        [dstq + 16], %4\r\n    movu        %1, [srcq + 16]\r\n    pshufb      %2, %1, Tm0\r\n    pmaddubsw   %2, coef2\r\n    pshufb      %1, %1, Tm1\r\n    pmaddubsw   %1, coef2\r\n    phaddw      %2, %1\r\n    psubw       %2, %3\r\n    movu        [dstq + 32], %2\r\n%endmacro\r\n\r\n%macro PROCESS_CHROMA_W32 4\r\n    movu        %1, [srcq]\r\n    pshufb      %2, %1, Tm0\r\n    pmaddubsw   %2, coef2\r\n    pshufb      %1, %1, Tm1\r\n    pmaddubsw   %1, coef2\r\n    phaddw      %2, %1\r\n    movu        %1, [srcq + 8]\r\n    pshufb      %4, %1, Tm0\r\n    pmaddubsw   %4, coef2\r\n    pshufb      %1, %1, Tm1\r\n    pmaddubsw   %1, coef2\r\n    phaddw      %4, %1\r\n    psubw       %2, %3\r\n    psubw       %4, %3\r\n    movu        [dstq], %2\r\n    movu        [dstq + 16], %4\r\n    movu        %1, [srcq + 16]\r\n    pshufb      %2, %1, Tm0\r\n    pmaddubsw   %2, coef2\r\n    pshufb      %1, %1, Tm1\r\n    pmaddubsw   %1, coef2\r\n    phaddw      %2, %1\r\n    movu        %1, [srcq + 24]\r\n    pshufb      %4, %1, Tm0\r\n    pmaddubsw   %4, coef2\r\n    pshufb      %1, %1, Tm1\r\n    pmaddubsw   %1, coef2\r\n    phaddw      %4, %1\r\n    psubw       %2, %3\r\n    psubw       %4, %3\r\n    movu        [dstq + 32], %2\r\n    movu        [dstq + 48], %4\r\n%endmacro\r\n\r\n%macro PROCESS_CHROMA_W16o 5\r\n    movu        %1, [srcq + %5]\r\n    pshufb      %2, %1, Tm0\r\n    pmaddubsw   %2, coef2\r\n    pshufb      %1, %1, Tm1\r\n    pmaddubsw   %1, coef2\r\n    phaddw      %2, %1\r\n    movu        %1, [srcq + %5 + 8]\r\n    pshufb      %4, %1, Tm0\r\n    pmaddubsw   %4, coef2\r\n    pshufb      %1, %1, Tm1\r\n    pmaddubsw   %1, coef2\r\n    phaddw      %4, %1\r\n    psubw       %2, %3\r\n    psubw       %4, %3\r\n    movu        [dstq + %5 * 2], %2\r\n    movu        [dstq + %5 * 2 + 16], %4\r\n%endmacro\r\n\r\n%macro PROCESS_CHROMA_W48 4\r\n    PROCESS_CHROMA_W16o %1, %2, %3, %4, 0\r\n    PROCESS_CHROMA_W16o %1, %2, %3, %4, 16\r\n    PROCESS_CHROMA_W16o %1, %2, %3, %4, 32\r\n%endmacro\r\n\r\n%macro PROCESS_CHROMA_W64 4\r\n    PROCESS_CHROMA_W16o %1, %2, %3, %4, 0\r\n    PROCESS_CHROMA_W16o %1, %2, %3, %4, 16\r\n    PROCESS_CHROMA_W16o %1, %2, %3, %4, 32\r\n    PROCESS_CHROMA_W16o %1, %2, %3, %4, 48\r\n%endmacro\r\n\r\n;------------------------------------------------------------------------------------------------------------------------------\r\n; void interp_4tap_horiz_ps_%1x%2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt)\r\n;------------------------------------------------------------------------------------------------------------------------------\r\n%macro FILTER_HORIZ_CHROMA_WxN 2\r\nINIT_XMM sse4\r\ncglobal interp_4tap_horiz_ps_%1x%2, 4, 7, 7, src, srcstride, dst, dststride\r\n%define coef2    m6\r\n%define Tm0      m5\r\n%define Tm1      m4\r\n%define t3       m3\r\n%define t2       m2\r\n%define t1       m1\r\n%define t0       m0\r\n\r\n    dec     srcq\r\n    mov     r4d, r4m\r\n    add     dststrided, dststrided\r\n\r\n%ifdef PIC\r\n    lea     r6, [tab_ChromaCoeff]\r\n    movd    coef2, [r6 + r4 * 4]\r\n%else\r\n    movd    coef2, [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    pshufd  coef2, coef2, 0\r\n    mova    t2, [pw_2000]\r\n    mova    Tm0, [tab_Tm]\r\n    mova    Tm1, [tab_Tm + 16]\r\n\r\n    mov     r4d, %2\r\n    cmp     r5m, byte 0\r\n    je      .loopH\r\n    sub     srcq, srcstrideq\r\n    add     r4d, 3\r\n\r\n.loopH:\r\n    PROCESS_CHROMA_W%1   t0, t1, t2, t3\r\n    add     srcq, srcstrideq\r\n    add     dstq, dststrideq\r\n\r\n    dec     r4d\r\n    jnz     .loopH\r\n\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_HORIZ_CHROMA_WxN 16, 4\r\n    FILTER_HORIZ_CHROMA_WxN 16, 8\r\n    FILTER_HORIZ_CHROMA_WxN 16, 12\r\n    FILTER_HORIZ_CHROMA_WxN 16, 16\r\n    FILTER_HORIZ_CHROMA_WxN 16, 32\r\n    FILTER_HORIZ_CHROMA_WxN 24, 32\r\n    FILTER_HORIZ_CHROMA_WxN 32,  8\r\n    FILTER_HORIZ_CHROMA_WxN 32, 16\r\n    FILTER_HORIZ_CHROMA_WxN 32, 24\r\n    FILTER_HORIZ_CHROMA_WxN 32, 32\r\n\r\n    FILTER_HORIZ_CHROMA_WxN 16, 24\r\n    FILTER_HORIZ_CHROMA_WxN 16, 64\r\n    FILTER_HORIZ_CHROMA_WxN 24, 64\r\n    FILTER_HORIZ_CHROMA_WxN 32, 48\r\n    FILTER_HORIZ_CHROMA_WxN 32, 64\r\n\r\n    FILTER_HORIZ_CHROMA_WxN 64, 64\r\n    FILTER_HORIZ_CHROMA_WxN 64, 32\r\n    FILTER_HORIZ_CHROMA_WxN 64, 48\r\n    FILTER_HORIZ_CHROMA_WxN 48, 64\r\n    FILTER_HORIZ_CHROMA_WxN 64, 16\r\n\r\n\r\n;---------------------------------------------------------------------------------------------------------------\r\n; void interp_4tap_vert_ps_%1x%2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)\r\n;---------------------------------------------------------------------------------------------------------------\r\n%macro FILTER_V_PS_W16n 2\r\nINIT_XMM sse4\r\ncglobal interp_4tap_vert_ps_%1x%2, 4, 7, 8\r\n\r\n    mov        r4d, r4m\r\n    sub        r0, r1\r\n    add        r3d, r3d\r\n\r\n%ifdef PIC\r\n    lea        r5, [tab_ChromaCoeff]\r\n    movd       m0, [r5 + r4 * 4]\r\n%else\r\n    movd       m0, [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    pshufb     m1, m0, [tab_Vm]\r\n    pshufb     m0, [tab_Vm + 16]\r\n    mov        r4d, %2/2\r\n\r\n.loop:\r\n\r\n    mov         r6d,       %1/16\r\n\r\n.loopW:\r\n\r\n    movu       m2, [r0]\r\n    movu       m3, [r0 + r1]\r\n\r\n    punpcklbw  m4, m2, m3\r\n    punpckhbw  m2, m3\r\n\r\n    pmaddubsw  m4, m1\r\n    pmaddubsw  m2, m1\r\n\r\n    lea        r5, [r0 + 2 * r1]\r\n    movu       m5, [r5]\r\n    movu       m7, [r5 + r1]\r\n\r\n    punpcklbw  m6, m5, m7\r\n    pmaddubsw  m6, m0\r\n    paddw      m4, m6\r\n\r\n    punpckhbw  m6, m5, m7\r\n    pmaddubsw  m6, m0\r\n    paddw      m2, m6\r\n\r\n    mova       m6, [pw_2000]\r\n\r\n    psubw      m4, m6\r\n    psubw      m2, m6\r\n\r\n    movu       [r2], m4\r\n    movu       [r2 + 16], m2\r\n\r\n    punpcklbw  m4, m3, m5\r\n    punpckhbw  m3, m5\r\n\r\n    pmaddubsw  m4, m1\r\n    pmaddubsw  m3, m1\r\n\r\n    movu       m5, [r5 + 2 * r1]\r\n\r\n    punpcklbw  m2, m7, m5\r\n    punpckhbw  m7, m5\r\n\r\n    pmaddubsw  m2, m0\r\n    pmaddubsw  m7, m0\r\n\r\n    paddw      m4, m2\r\n    paddw      m3, m7\r\n\r\n    psubw      m4, m6\r\n    psubw      m3, m6\r\n\r\n    movu       [r2 + r3], m4\r\n    movu       [r2 + r3 + 16], m3\r\n\r\n    add         r0,        16\r\n    add         r2,        32\r\n    dec         r6d\r\n    jnz         .loopW\r\n\r\n    lea         r0,        [r0 + r1 * 2 - %1]\r\n    lea         r2,        [r2 + r3 * 2 - %1 * 2]\r\n\r\n    dec        r4d\r\n    jnz        .loop\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_V_PS_W16n 64, 64\r\n    FILTER_V_PS_W16n 64, 32\r\n    FILTER_V_PS_W16n 64, 48\r\n    FILTER_V_PS_W16n 48, 64\r\n    FILTER_V_PS_W16n 64, 16\r\n\r\n\r\n;------------------------------------------------------------------------------------------------------------\r\n;void interp_4tap_vert_ps_2x4(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)\r\n;------------------------------------------------------------------------------------------------------------\r\nINIT_XMM sse4\r\ncglobal interp_4tap_vert_ps_2x4, 4, 6, 7\r\n\r\n    mov         r4d, r4m\r\n    sub         r0, r1\r\n    add         r3d, r3d\r\n\r\n%ifdef PIC\r\n    lea         r5, [tab_ChromaCoeff]\r\n    movd        m0, [r5 + r4 * 4]\r\n%else\r\n    movd        m0, [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    pshufb      m0, [tab_Cm]\r\n\r\n    lea         r5, [3 * r1]\r\n\r\n    movd        m2, [r0]\r\n    movd        m3, [r0 + r1]\r\n    movd        m4, [r0 + 2 * r1]\r\n    movd        m5, [r0 + r5]\r\n\r\n    punpcklbw   m2, m3\r\n    punpcklbw   m6, m4, m5\r\n    punpcklbw   m2, m6\r\n\r\n    pmaddubsw   m2, m0\r\n\r\n    lea         r0, [r0 + 4 * r1]\r\n    movd        m6, [r0]\r\n\r\n    punpcklbw   m3, m4\r\n    punpcklbw   m1, m5, m6\r\n    punpcklbw   m3, m1\r\n\r\n    pmaddubsw   m3, m0\r\n    phaddw      m2, m3\r\n\r\n    mova        m1, [pw_2000]\r\n\r\n    psubw       m2, m1\r\n\r\n    movd        [r2], m2\r\n    pextrd      [r2 + r3], m2, 2\r\n\r\n    movd        m2, [r0 + r1]\r\n\r\n    punpcklbw   m4, m5\r\n    punpcklbw   m3, m6, m2\r\n    punpcklbw   m4, m3\r\n\r\n    pmaddubsw   m4, m0\r\n\r\n    movd        m3, [r0 + 2 * r1]\r\n\r\n    punpcklbw   m5, m6\r\n    punpcklbw   m2, m3\r\n    punpcklbw   m5, m2\r\n\r\n    pmaddubsw   m5, m0\r\n    phaddw      m4, m5\r\n    psubw       m4, m1\r\n\r\n    lea         r2, [r2 + 2 * r3]\r\n    movd        [r2], m4\r\n    pextrd      [r2 + r3], m4, 2\r\n\r\n    RET\r\n\r\n;-------------------------------------------------------------------------------------------------------------\r\n; void interp_4tap_vert_ps_2x8(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)\r\n;-------------------------------------------------------------------------------------------------------------\r\n%macro FILTER_V_PS_W2 2\r\nINIT_XMM sse4\r\ncglobal interp_4tap_vert_ps_2x%2, 4, 6, 8\r\n\r\n    mov        r4d, r4m\r\n    sub        r0, r1\r\n    add        r3d, r3d\r\n\r\n%ifdef PIC\r\n    lea        r5, [tab_ChromaCoeff]\r\n    movd       m0, [r5 + r4 * 4]\r\n%else\r\n    movd       m0, [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    pshufb     m0, [tab_Cm]\r\n\r\n    mova       m1, [pw_2000]\r\n    lea        r5, [3 * r1]\r\n    mov        r4d, %2/4\r\n.loop:\r\n    movd       m2, [r0]\r\n    movd       m3, [r0 + r1]\r\n    movd       m4, [r0 + 2 * r1]\r\n    movd       m5, [r0 + r5]\r\n\r\n    punpcklbw  m2, m3\r\n    punpcklbw  m6, m4, m5\r\n    punpcklbw  m2, m6\r\n\r\n    pmaddubsw  m2, m0\r\n\r\n    lea        r0, [r0 + 4 * r1]\r\n    movd       m6, [r0]\r\n\r\n    punpcklbw  m3, m4\r\n    punpcklbw  m7, m5, m6\r\n    punpcklbw  m3, m7\r\n\r\n    pmaddubsw  m3, m0\r\n\r\n    phaddw     m2, m3\r\n    psubw      m2, m1\r\n\r\n\r\n    movd       [r2], m2\r\n    pshufd     m2, m2, 2\r\n    movd       [r2 + r3], m2\r\n\r\n    movd       m2, [r0 + r1]\r\n\r\n    punpcklbw  m4, m5\r\n    punpcklbw  m3, m6, m2\r\n    punpcklbw  m4, m3\r\n\r\n    pmaddubsw  m4, m0\r\n\r\n    movd       m3, [r0 + 2 * r1]\r\n\r\n    punpcklbw  m5, m6\r\n    punpcklbw  m2, m3\r\n    punpcklbw  m5, m2\r\n\r\n    pmaddubsw  m5, m0\r\n\r\n    phaddw     m4, m5\r\n\r\n    psubw      m4, m1\r\n\r\n    lea        r2, [r2 + 2 * r3]\r\n    movd       [r2], m4\r\n    pshufd     m4 , m4 ,2\r\n    movd       [r2 + r3], m4\r\n\r\n    lea        r2, [r2 + 2 * r3]\r\n\r\n    dec        r4d\r\n    jnz        .loop\r\n\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_V_PS_W2 2, 8\r\n\r\n    FILTER_V_PS_W2 2, 16\r\n\r\n;-----------------------------------------------------------------------------------------------------------------\r\n; void interp_4tap_vert_ss_%1x%2(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)\r\n;-----------------------------------------------------------------------------------------------------------------\r\n%macro FILTER_VER_CHROMA_SS 2\r\nINIT_XMM sse2\r\ncglobal interp_4tap_vert_ss_%1x%2, 5, 7, 6 ,0-gprsize\r\n\r\n    add       r1d, r1d\r\n    add       r3d, r3d\r\n    sub       r0, r1\r\n    shl       r4d, 5\r\n\r\n%ifdef PIC\r\n    lea       r5, [tab_ChromaCoeffV]\r\n    lea       r6, [r5 + r4]\r\n%else\r\n    lea       r6, [tab_ChromaCoeffV + r4]\r\n%endif\r\n\r\n    mov       dword [rsp], %2/4\r\n\r\n.loopH:\r\n    mov       r4d, (%1/4)\r\n.loopW:\r\n    PROCESS_CHROMA_SP_W4_4R\r\n\r\n    psrad     m0, 6\r\n    psrad     m1, 6\r\n    psrad     m2, 6\r\n    psrad     m3, 6\r\n\r\n    packssdw  m0, m1\r\n    packssdw  m2, m3\r\n\r\n    movlps    [r2], m0\r\n    movhps    [r2 + r3], m0\r\n    lea       r5, [r2 + 2 * r3]\r\n    movlps    [r5], m2\r\n    movhps    [r5 + r3], m2\r\n\r\n    lea       r5, [4 * r1 - 2 * 4]\r\n    sub       r0, r5\r\n    add       r2, 2 * 4\r\n\r\n    dec       r4d\r\n    jnz       .loopW\r\n\r\n    lea       r0, [r0 + 4 * r1 - 2 * %1]\r\n    lea       r2, [r2 + 4 * r3 - 2 * %1]\r\n\r\n    dec       dword [rsp]\r\n    jnz       .loopH\r\n\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_SS 4, 4\r\n    FILTER_VER_CHROMA_SS 4, 8\r\n    FILTER_VER_CHROMA_SS 16, 16\r\n    FILTER_VER_CHROMA_SS 16, 8\r\n    FILTER_VER_CHROMA_SS 16, 12\r\n    FILTER_VER_CHROMA_SS 12, 16\r\n    FILTER_VER_CHROMA_SS 16, 4\r\n    FILTER_VER_CHROMA_SS 4, 16\r\n    FILTER_VER_CHROMA_SS 32, 32\r\n    FILTER_VER_CHROMA_SS 32, 16\r\n    FILTER_VER_CHROMA_SS 16, 32\r\n    FILTER_VER_CHROMA_SS 32, 24\r\n    FILTER_VER_CHROMA_SS 24, 32\r\n    FILTER_VER_CHROMA_SS 32, 8\r\n\r\n    FILTER_VER_CHROMA_SS 16, 24\r\n    FILTER_VER_CHROMA_SS 12, 32\r\n    FILTER_VER_CHROMA_SS 4, 32\r\n    FILTER_VER_CHROMA_SS 32, 64\r\n    FILTER_VER_CHROMA_SS 16, 64\r\n    FILTER_VER_CHROMA_SS 32, 48\r\n    FILTER_VER_CHROMA_SS 24, 64\r\n\r\n    FILTER_VER_CHROMA_SS 64, 64\r\n    FILTER_VER_CHROMA_SS 64, 32\r\n    FILTER_VER_CHROMA_SS 64, 48\r\n    FILTER_VER_CHROMA_SS 48, 64\r\n    FILTER_VER_CHROMA_SS 64, 16\r\n\r\n%macro FILTER_VER_CHROMA_S_AVX2_4x4 1\r\nINIT_YMM avx2\r\ncglobal interp_4tap_vert_%1_4x4, 4, 6, 7\r\n    mov             r4d, r4m\r\n    add             r1d, r1d\r\n    shl             r4d, 6\r\n    sub             r0, r1\r\n\r\n%ifdef PIC\r\n    lea             r5, [pw_ChromaCoeffV]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [pw_ChromaCoeffV + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n%ifidn %1,sp\r\n    mova            m6, [pd_526336]\r\n%else\r\n    add             r3d, r3d\r\n%endif\r\n\r\n    movq            xm0, [r0]\r\n    movq            xm1, [r0 + r1]\r\n    punpcklwd       xm0, xm1\r\n    movq            xm2, [r0 + r1 * 2]\r\n    punpcklwd       xm1, xm2\r\n    vinserti128     m0, m0, xm1, 1                  ; m0 = [2 1 1 0]\r\n    pmaddwd         m0, [r5]\r\n    movq            xm3, [r0 + r4]\r\n    punpcklwd       xm2, xm3\r\n    lea             r0, [r0 + 4 * r1]\r\n    movq            xm4, [r0]\r\n    punpcklwd       xm3, xm4\r\n    vinserti128     m2, m2, xm3, 1                  ; m2 = [4 3 3 2]\r\n    pmaddwd         m5, m2, [r5 + 1 * mmsize]\r\n    pmaddwd         m2, [r5]\r\n    paddd           m0, m5\r\n    movq            xm3, [r0 + r1]\r\n    punpcklwd       xm4, xm3\r\n    movq            xm1, [r0 + r1 * 2]\r\n    punpcklwd       xm3, xm1\r\n    vinserti128     m4, m4, xm3, 1                  ; m4 = [6 5 5 4]\r\n    pmaddwd         m4, [r5 + 1 * mmsize]\r\n    paddd           m2, m4\r\n\r\n%ifidn %1,sp\r\n    paddd           m0, m6\r\n    paddd           m2, m6\r\n    psrad           m0, 12\r\n    psrad           m2, 12\r\n%else\r\n    psrad           m0, 6\r\n    psrad           m2, 6\r\n%endif\r\n    packssdw        m0, m2\r\n    vextracti128    xm2, m0, 1\r\n    lea             r4, [r3 * 3]\r\n\r\n%ifidn %1,sp\r\n    packuswb        xm0, xm2\r\n    movd            [r2], xm0\r\n    pextrd          [r2 + r3], xm0, 2\r\n    pextrd          [r2 + r3 * 2], xm0, 1\r\n    pextrd          [r2 + r4], xm0, 3\r\n%else\r\n    movq            [r2], xm0\r\n    movq            [r2 + r3], xm2\r\n    movhps          [r2 + r3 * 2], xm0\r\n    movhps          [r2 + r4], xm2\r\n%endif\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_S_AVX2_4x4 sp\r\n    FILTER_VER_CHROMA_S_AVX2_4x4 ss\r\n\r\n%macro FILTER_VER_CHROMA_S_AVX2_4x8 1\r\nINIT_YMM avx2\r\ncglobal interp_4tap_vert_%1_4x8, 4, 6, 8\r\n    mov             r4d, r4m\r\n    shl             r4d, 6\r\n    add             r1d, r1d\r\n    sub             r0, r1\r\n\r\n%ifdef PIC\r\n    lea             r5, [pw_ChromaCoeffV]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [pw_ChromaCoeffV + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n%ifidn %1,sp\r\n    mova            m7, [pd_526336]\r\n%else\r\n    add             r3d, r3d\r\n%endif\r\n\r\n    movq            xm0, [r0]\r\n    movq            xm1, [r0 + r1]\r\n    punpcklwd       xm0, xm1\r\n    movq            xm2, [r0 + r1 * 2]\r\n    punpcklwd       xm1, xm2\r\n    vinserti128     m0, m0, xm1, 1                  ; m0 = [2 1 1 0]\r\n    pmaddwd         m0, [r5]\r\n    movq            xm3, [r0 + r4]\r\n    punpcklwd       xm2, xm3\r\n    lea             r0, [r0 + 4 * r1]\r\n    movq            xm4, [r0]\r\n    punpcklwd       xm3, xm4\r\n    vinserti128     m2, m2, xm3, 1                  ; m2 = [4 3 3 2]\r\n    pmaddwd         m5, m2, [r5 + 1 * mmsize]\r\n    pmaddwd         m2, [r5]\r\n    paddd           m0, m5\r\n    movq            xm3, [r0 + r1]\r\n    punpcklwd       xm4, xm3\r\n    movq            xm1, [r0 + r1 * 2]\r\n    punpcklwd       xm3, xm1\r\n    vinserti128     m4, m4, xm3, 1                  ; m4 = [6 5 5 4]\r\n    pmaddwd         m5, m4, [r5 + 1 * mmsize]\r\n    paddd           m2, m5\r\n    pmaddwd         m4, [r5]\r\n    movq            xm3, [r0 + r4]\r\n    punpcklwd       xm1, xm3\r\n    lea             r0, [r0 + 4 * r1]\r\n    movq            xm6, [r0]\r\n    punpcklwd       xm3, xm6\r\n    vinserti128     m1, m1, xm3, 1                  ; m1 = [8 7 7 6]\r\n    pmaddwd         m5, m1, [r5 + 1 * mmsize]\r\n    paddd           m4, m5\r\n    pmaddwd         m1, [r5]\r\n    movq            xm3, [r0 + r1]\r\n    punpcklwd       xm6, xm3\r\n    movq            xm5, [r0 + 2 * r1]\r\n    punpcklwd       xm3, xm5\r\n    vinserti128     m6, m6, xm3, 1                  ; m6 = [A 9 9 8]\r\n    pmaddwd         m6, [r5 + 1 * mmsize]\r\n    paddd           m1, m6\r\n    lea             r4, [r3 * 3]\r\n\r\n%ifidn %1,sp\r\n    paddd           m0, m7\r\n    paddd           m2, m7\r\n    paddd           m4, m7\r\n    paddd           m1, m7\r\n    psrad           m0, 12\r\n    psrad           m2, 12\r\n    psrad           m4, 12\r\n    psrad           m1, 12\r\n%else\r\n    psrad           m0, 6\r\n    psrad           m2, 6\r\n    psrad           m4, 6\r\n    psrad           m1, 6\r\n%endif\r\n    packssdw        m0, m2\r\n    packssdw        m4, m1\r\n%ifidn %1,sp\r\n    packuswb        m0, m4\r\n    vextracti128    xm2, m0, 1\r\n    movd            [r2], xm0\r\n    movd            [r2 + r3], xm2\r\n    pextrd          [r2 + r3 * 2], xm0, 1\r\n    pextrd          [r2 + r4], xm2, 1\r\n    lea             r2, [r2 + r3 * 4]\r\n    pextrd          [r2], xm0, 2\r\n    pextrd          [r2 + r3], xm2, 2\r\n    pextrd          [r2 + r3 * 2], xm0, 3\r\n    pextrd          [r2 + r4], xm2, 3\r\n%else\r\n    vextracti128    xm2, m0, 1\r\n    vextracti128    xm1, m4, 1\r\n    movq            [r2], xm0\r\n    movq            [r2 + r3], xm2\r\n    movhps          [r2 + r3 * 2], xm0\r\n    movhps          [r2 + r4], xm2\r\n    lea             r2, [r2 + r3 * 4]\r\n    movq            [r2], xm4\r\n    movq            [r2 + r3], xm1\r\n    movhps          [r2 + r3 * 2], xm4\r\n    movhps          [r2 + r4], xm1\r\n%endif\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_S_AVX2_4x8 sp\r\n    FILTER_VER_CHROMA_S_AVX2_4x8 ss\r\n\r\n%macro PROCESS_CHROMA_AVX2_W4_16R 1\r\n    movq            xm0, [r0]\r\n    movq            xm1, [r0 + r1]\r\n    punpcklwd       xm0, xm1\r\n    movq            xm2, [r0 + r1 * 2]\r\n    punpcklwd       xm1, xm2\r\n    vinserti128     m0, m0, xm1, 1                  ; m0 = [2 1 1 0]\r\n    pmaddwd         m0, [r5]\r\n    movq            xm3, [r0 + r4]\r\n    punpcklwd       xm2, xm3\r\n    lea             r0, [r0 + 4 * r1]\r\n    movq            xm4, [r0]\r\n    punpcklwd       xm3, xm4\r\n    vinserti128     m2, m2, xm3, 1                  ; m2 = [4 3 3 2]\r\n    pmaddwd         m5, m2, [r5 + 1 * mmsize]\r\n    pmaddwd         m2, [r5]\r\n    paddd           m0, m5\r\n    movq            xm3, [r0 + r1]\r\n    punpcklwd       xm4, xm3\r\n    movq            xm1, [r0 + r1 * 2]\r\n    punpcklwd       xm3, xm1\r\n    vinserti128     m4, m4, xm3, 1                  ; m4 = [6 5 5 4]\r\n    pmaddwd         m5, m4, [r5 + 1 * mmsize]\r\n    paddd           m2, m5\r\n    pmaddwd         m4, [r5]\r\n    movq            xm3, [r0 + r4]\r\n    punpcklwd       xm1, xm3\r\n    lea             r0, [r0 + 4 * r1]\r\n    movq            xm6, [r0]\r\n    punpcklwd       xm3, xm6\r\n    vinserti128     m1, m1, xm3, 1                  ; m1 = [8 7 7 6]\r\n    pmaddwd         m5, m1, [r5 + 1 * mmsize]\r\n    paddd           m4, m5\r\n    pmaddwd         m1, [r5]\r\n    movq            xm3, [r0 + r1]\r\n    punpcklwd       xm6, xm3\r\n    movq            xm5, [r0 + 2 * r1]\r\n    punpcklwd       xm3, xm5\r\n    vinserti128     m6, m6, xm3, 1                  ; m6 = [10 9 9 8]\r\n    pmaddwd         m3, m6, [r5 + 1 * mmsize]\r\n    paddd           m1, m3\r\n    pmaddwd         m6, [r5]\r\n\r\n%ifidn %1,sp\r\n    paddd           m0, m7\r\n    paddd           m2, m7\r\n    paddd           m4, m7\r\n    paddd           m1, m7\r\n    psrad           m4, 12\r\n    psrad           m1, 12\r\n    psrad           m0, 12\r\n    psrad           m2, 12\r\n%else\r\n    psrad           m0, 6\r\n    psrad           m2, 6\r\n    psrad           m4, 6\r\n    psrad           m1, 6\r\n%endif\r\n    packssdw        m0, m2\r\n    packssdw        m4, m1\r\n%ifidn %1,sp\r\n    packuswb        m0, m4\r\n    vextracti128    xm4, m0, 1\r\n    movd            [r2], xm0\r\n    movd            [r2 + r3], xm4\r\n    pextrd          [r2 + r3 * 2], xm0, 1\r\n    pextrd          [r2 + r6], xm4, 1\r\n    lea             r2, [r2 + r3 * 4]\r\n    pextrd          [r2], xm0, 2\r\n    pextrd          [r2 + r3], xm4, 2\r\n    pextrd          [r2 + r3 * 2], xm0, 3\r\n    pextrd          [r2 + r6], xm4, 3\r\n%else\r\n    vextracti128    xm2, m0, 1\r\n    vextracti128    xm1, m4, 1\r\n    movq            [r2], xm0\r\n    movq            [r2 + r3], xm2\r\n    movhps          [r2 + r3 * 2], xm0\r\n    movhps          [r2 + r6], xm2\r\n    lea             r2, [r2 + r3 * 4]\r\n    movq            [r2], xm4\r\n    movq            [r2 + r3], xm1\r\n    movhps          [r2 + r3 * 2], xm4\r\n    movhps          [r2 + r6], xm1\r\n%endif\r\n\r\n    movq            xm2, [r0 + r4]\r\n    punpcklwd       xm5, xm2\r\n    lea             r0, [r0 + 4 * r1]\r\n    movq            xm0, [r0]\r\n    punpcklwd       xm2, xm0\r\n    vinserti128     m5, m5, xm2, 1                  ; m5 = [12 11 11 10]\r\n    pmaddwd         m2, m5, [r5 + 1 * mmsize]\r\n    paddd           m6, m2\r\n    pmaddwd         m5, [r5]\r\n    movq            xm2, [r0 + r1]\r\n    punpcklwd       xm0, xm2\r\n    movq            xm3, [r0 + 2 * r1]\r\n    punpcklwd       xm2, xm3\r\n    vinserti128     m0, m0, xm2, 1                  ; m0 = [14 13 13 12]\r\n    pmaddwd         m2, m0, [r5 + 1 * mmsize]\r\n    paddd           m5, m2\r\n    pmaddwd         m0, [r5]\r\n    movq            xm4, [r0 + r4]\r\n    punpcklwd       xm3, xm4\r\n    lea             r0, [r0 + 4 * r1]\r\n    movq            xm1, [r0]\r\n    punpcklwd       xm4, xm1\r\n    vinserti128     m3, m3, xm4, 1                  ; m3 = [16 15 15 14]\r\n    pmaddwd         m4, m3, [r5 + 1 * mmsize]\r\n    paddd           m0, m4\r\n    pmaddwd         m3, [r5]\r\n    movq            xm4, [r0 + r1]\r\n    punpcklwd       xm1, xm4\r\n    movq            xm2, [r0 + 2 * r1]\r\n    punpcklwd       xm4, xm2\r\n    vinserti128     m1, m1, xm4, 1                  ; m1 = [18 17 17 16]\r\n    pmaddwd         m1, [r5 + 1 * mmsize]\r\n    paddd           m3, m1\r\n\r\n%ifidn %1,sp\r\n    paddd           m6, m7\r\n    paddd           m5, m7\r\n    paddd           m0, m7\r\n    paddd           m3, m7\r\n    psrad           m6, 12\r\n    psrad           m5, 12\r\n    psrad           m0, 12\r\n    psrad           m3, 12\r\n%else\r\n    psrad           m6, 6\r\n    psrad           m5, 6\r\n    psrad           m0, 6\r\n    psrad           m3, 6\r\n%endif\r\n    packssdw        m6, m5\r\n    packssdw        m0, m3\r\n    lea             r2, [r2 + r3 * 4]\r\n\r\n%ifidn %1,sp\r\n    packuswb        m6, m0\r\n    vextracti128    xm0, m6, 1\r\n    movd            [r2], xm6\r\n    movd            [r2 + r3], xm0\r\n    pextrd          [r2 + r3 * 2], xm6, 1\r\n    pextrd          [r2 + r6], xm0, 1\r\n    lea             r2, [r2 + r3 * 4]\r\n    pextrd          [r2], xm6, 2\r\n    pextrd          [r2 + r3], xm0, 2\r\n    pextrd          [r2 + r3 * 2], xm6, 3\r\n    pextrd          [r2 + r6], xm0, 3\r\n%else\r\n    vextracti128    xm5, m6, 1\r\n    vextracti128    xm3, m0, 1\r\n    movq            [r2], xm6\r\n    movq            [r2 + r3], xm5\r\n    movhps          [r2 + r3 * 2], xm6\r\n    movhps          [r2 + r6], xm5\r\n    lea             r2, [r2 + r3 * 4]\r\n    movq            [r2], xm0\r\n    movq            [r2 + r3], xm3\r\n    movhps          [r2 + r3 * 2], xm0\r\n    movhps          [r2 + r6], xm3\r\n%endif\r\n%endmacro\r\n\r\n%macro FILTER_VER_CHROMA_S_AVX2_4x16 1\r\nINIT_YMM avx2\r\ncglobal interp_4tap_vert_%1_4x16, 4, 7, 8\r\n    mov             r4d, r4m\r\n    shl             r4d, 6\r\n    add             r1d, r1d\r\n    sub             r0, r1\r\n\r\n%ifdef PIC\r\n    lea             r5, [pw_ChromaCoeffV]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [pw_ChromaCoeffV + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n%ifidn %1,sp\r\n    mova            m7, [pd_526336]\r\n%else\r\n    add             r3d, r3d\r\n%endif\r\n    lea             r6, [r3 * 3]\r\n    PROCESS_CHROMA_AVX2_W4_16R %1\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_S_AVX2_4x16 sp\r\n    FILTER_VER_CHROMA_S_AVX2_4x16 ss\r\n\r\n%macro FILTER_VER_CHROMA_S_AVX2_4x32 1\r\nINIT_YMM avx2\r\ncglobal interp_4tap_vert_%1_4x32, 4, 7, 8\r\n    mov             r4d, r4m\r\n    shl             r4d, 6\r\n    add             r1d, r1d\r\n    sub             r0, r1\r\n\r\n%ifdef PIC\r\n    lea             r5, [pw_ChromaCoeffV]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [pw_ChromaCoeffV + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n%ifidn %1,sp\r\n    mova            m7, [pd_526336]\r\n%else\r\n    add             r3d, r3d\r\n%endif\r\n    lea             r6, [r3 * 3]\r\n%rep 2\r\n    PROCESS_CHROMA_AVX2_W4_16R %1\r\n    lea             r2, [r2 + r3 * 4]\r\n%endrep\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_S_AVX2_4x32 sp\r\n    FILTER_VER_CHROMA_S_AVX2_4x32 ss\r\n\r\n%macro FILTER_VER_CHROMA_S_AVX2_4x2 1\r\nINIT_YMM avx2\r\ncglobal interp_4tap_vert_%1_4x2, 4, 6, 6\r\n    mov             r4d, r4m\r\n    shl             r4d, 6\r\n    add             r1d, r1d\r\n    sub             r0, r1\r\n\r\n%ifdef PIC\r\n    lea             r5, [pw_ChromaCoeffV]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [pw_ChromaCoeffV + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n%ifidn %1,sp\r\n    mova            m5, [pd_526336]\r\n%else\r\n    add             r3d, r3d\r\n%endif\r\n    movq            xm0, [r0]\r\n    movq            xm1, [r0 + r1]\r\n    punpcklwd       xm0, xm1\r\n    movq            xm2, [r0 + r1 * 2]\r\n    punpcklwd       xm1, xm2\r\n    vinserti128     m0, m0, xm1, 1                  ; m0 = [2 1 1 0]\r\n    pmaddwd         m0, [r5]\r\n    movq            xm3, [r0 + r4]\r\n    punpcklwd       xm2, xm3\r\n    movq            xm4, [r0 + 4 * r1]\r\n    punpcklwd       xm3, xm4\r\n    vinserti128     m2, m2, xm3, 1                  ; m2 = [4 3 3 2]\r\n    pmaddwd         m2, [r5 + 1 * mmsize]\r\n    paddd           m0, m2\r\n%ifidn %1,sp\r\n    paddd           m0, m5\r\n    psrad           m0, 12\r\n%else\r\n    psrad           m0, 6\r\n%endif\r\n    vextracti128    xm1, m0, 1\r\n    packssdw        xm0, xm1\r\n%ifidn %1,sp\r\n    packuswb        xm0, xm0\r\n    movd            [r2], xm0\r\n    pextrd          [r2 + r3], xm0, 1\r\n%else\r\n    movq            [r2], xm0\r\n    movhps          [r2 + r3], xm0\r\n%endif\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_S_AVX2_4x2 sp\r\n    FILTER_VER_CHROMA_S_AVX2_4x2 ss\r\n\r\n%macro FILTER_VER_CHROMA_S_AVX2_2x4 1\r\nINIT_YMM avx2\r\ncglobal interp_4tap_vert_%1_2x4, 4, 6, 6\r\n    mov             r4d, r4m\r\n    shl             r4d, 6\r\n    add             r1d, r1d\r\n    sub             r0, r1\r\n\r\n%ifdef PIC\r\n    lea             r5, [pw_ChromaCoeffV]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [pw_ChromaCoeffV + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n%ifidn %1,sp\r\n    mova            m5, [pd_526336]\r\n%else\r\n    add             r3d, r3d\r\n%endif\r\n    movd            xm0, [r0]\r\n    movd            xm1, [r0 + r1]\r\n    punpcklwd       xm0, xm1\r\n    movd            xm2, [r0 + r1 * 2]\r\n    punpcklwd       xm1, xm2\r\n    punpcklqdq      xm0, xm1                        ; m0 = [2 1 1 0]\r\n    movd            xm3, [r0 + r4]\r\n    punpcklwd       xm2, xm3\r\n    lea             r0, [r0 + 4 * r1]\r\n    movd            xm4, [r0]\r\n    punpcklwd       xm3, xm4\r\n    punpcklqdq      xm2, xm3                        ; m2 = [4 3 3 2]\r\n    vinserti128     m0, m0, xm2, 1                  ; m0 = [4 3 3 2 2 1 1 0]\r\n    movd            xm1, [r0 + r1]\r\n    punpcklwd       xm4, xm1\r\n    movd            xm3, [r0 + r1 * 2]\r\n    punpcklwd       xm1, xm3\r\n    punpcklqdq      xm4, xm1                        ; m4 = [6 5 5 4]\r\n    vinserti128     m2, m2, xm4, 1                  ; m2 = [6 5 5 4 4 3 3 2]\r\n    pmaddwd         m0, [r5]\r\n    pmaddwd         m2, [r5 + 1 * mmsize]\r\n    paddd           m0, m2\r\n%ifidn %1,sp\r\n    paddd           m0, m5\r\n    psrad           m0, 12\r\n%else\r\n    psrad           m0, 6\r\n%endif\r\n    vextracti128    xm1, m0, 1\r\n    packssdw        xm0, xm1\r\n    lea             r4, [r3 * 3]\r\n%ifidn %1,sp\r\n    packuswb        xm0, xm0\r\n    pextrw          [r2], xm0, 0\r\n    pextrw          [r2 + r3], xm0, 1\r\n    pextrw          [r2 + 2 * r3], xm0, 2\r\n    pextrw          [r2 + r4], xm0, 3\r\n%else\r\n    movd            [r2], xm0\r\n    pextrd          [r2 + r3], xm0, 1\r\n    pextrd          [r2 + 2 * r3], xm0, 2\r\n    pextrd          [r2 + r4], xm0, 3\r\n%endif\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_S_AVX2_2x4 sp\r\n    FILTER_VER_CHROMA_S_AVX2_2x4 ss\r\n\r\n%macro FILTER_VER_CHROMA_S_AVX2_8x8 1\r\nINIT_YMM avx2\r\ncglobal interp_4tap_vert_%1_8x8, 4, 6, 8\r\n    mov             r4d, r4m\r\n    shl             r4d, 6\r\n    add             r1d, r1d\r\n\r\n%ifdef PIC\r\n    lea             r5, [pw_ChromaCoeffV]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [pw_ChromaCoeffV + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r1\r\n%ifidn %1,sp\r\n    mova            m7, [pd_526336]\r\n%else\r\n    add             r3d, r3d\r\n%endif\r\n\r\n    movu            xm0, [r0]                       ; m0 = row 0\r\n    movu            xm1, [r0 + r1]                  ; m1 = row 1\r\n    punpckhwd       xm2, xm0, xm1\r\n    punpcklwd       xm0, xm1\r\n    vinserti128     m0, m0, xm2, 1\r\n    pmaddwd         m0, [r5]\r\n    movu            xm2, [r0 + r1 * 2]              ; m2 = row 2\r\n    punpckhwd       xm3, xm1, xm2\r\n    punpcklwd       xm1, xm2\r\n    vinserti128     m1, m1, xm3, 1\r\n    pmaddwd         m1, [r5]\r\n    movu            xm3, [r0 + r4]                  ; m3 = row 3\r\n    punpckhwd       xm4, xm2, xm3\r\n    punpcklwd       xm2, xm3\r\n    vinserti128     m2, m2, xm4, 1\r\n    pmaddwd         m4, m2, [r5 + 1 * mmsize]\r\n    pmaddwd         m2, [r5]\r\n    paddd           m0, m4\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm4, [r0]                       ; m4 = row 4\r\n    punpckhwd       xm5, xm3, xm4\r\n    punpcklwd       xm3, xm4\r\n    vinserti128     m3, m3, xm5, 1\r\n    pmaddwd         m5, m3, [r5 + 1 * mmsize]\r\n    pmaddwd         m3, [r5]\r\n    paddd           m1, m5\r\n%ifidn %1,sp\r\n    paddd           m0, m7\r\n    paddd           m1, m7\r\n    psrad           m0, 12\r\n    psrad           m1, 12\r\n%else\r\n    psrad           m0, 6\r\n    psrad           m1, 6\r\n%endif\r\n    packssdw        m0, m1\r\n\r\n    movu            xm5, [r0 + r1]                  ; m5 = row 5\r\n    punpckhwd       xm6, xm4, xm5\r\n    punpcklwd       xm4, xm5\r\n    vinserti128     m4, m4, xm6, 1\r\n    pmaddwd         m6, m4, [r5 + 1 * mmsize]\r\n    paddd           m2, m6\r\n    pmaddwd         m4, [r5]\r\n    movu            xm6, [r0 + r1 * 2]              ; m6 = row 6\r\n    punpckhwd       xm1, xm5, xm6\r\n    punpcklwd       xm5, xm6\r\n    vinserti128     m5, m5, xm1, 1\r\n    pmaddwd         m1, m5, [r5 + 1 * mmsize]\r\n    pmaddwd         m5, [r5]\r\n    paddd           m3, m1\r\n%ifidn %1,sp\r\n    paddd           m2, m7\r\n    paddd           m3, m7\r\n    psrad           m2, 12\r\n    psrad           m3, 12\r\n%else\r\n    psrad           m2, 6\r\n    psrad           m3, 6\r\n%endif\r\n    packssdw        m2, m3\r\n\r\n    movu            xm1, [r0 + r4]                  ; m1 = row 7\r\n    punpckhwd       xm3, xm6, xm1\r\n    punpcklwd       xm6, xm1\r\n    vinserti128     m6, m6, xm3, 1\r\n    pmaddwd         m3, m6, [r5 + 1 * mmsize]\r\n    pmaddwd         m6, [r5]\r\n    paddd           m4, m3\r\n\r\n    lea             r4, [r3 * 3]\r\n%ifidn %1,sp\r\n    packuswb        m0, m2\r\n    mova            m3, [interp8_hps_shuf]\r\n    vpermd          m0, m3, m0\r\n    vextracti128    xm2, m0, 1\r\n    movq            [r2], xm0\r\n    movhps          [r2 + r3], xm0\r\n    movq            [r2 + r3 * 2], xm2\r\n    movhps          [r2 + r4], xm2\r\n%else\r\n    vpermq          m0, m0, 11011000b\r\n    vpermq          m2, m2, 11011000b\r\n    movu            [r2], xm0\r\n    vextracti128    xm0, m0, 1\r\n    vextracti128    xm3, m2, 1\r\n    movu            [r2 + r3], xm0\r\n    movu            [r2 + r3 * 2], xm2\r\n    movu            [r2 + r4], xm3\r\n%endif\r\n    lea             r2, [r2 + r3 * 4]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm0, [r0]                       ; m0 = row 8\r\n    punpckhwd       xm2, xm1, xm0\r\n    punpcklwd       xm1, xm0\r\n    vinserti128     m1, m1, xm2, 1\r\n    pmaddwd         m2, m1, [r5 + 1 * mmsize]\r\n    pmaddwd         m1, [r5]\r\n    paddd           m5, m2\r\n%ifidn %1,sp\r\n    paddd           m4, m7\r\n    paddd           m5, m7\r\n    psrad           m4, 12\r\n    psrad           m5, 12\r\n%else\r\n    psrad           m4, 6\r\n    psrad           m5, 6\r\n%endif\r\n    packssdw        m4, m5\r\n\r\n    movu            xm2, [r0 + r1]                  ; m2 = row 9\r\n    punpckhwd       xm5, xm0, xm2\r\n    punpcklwd       xm0, xm2\r\n    vinserti128     m0, m0, xm5, 1\r\n    pmaddwd         m0, [r5 + 1 * mmsize]\r\n    paddd           m6, m0\r\n    movu            xm5, [r0 + r1 * 2]              ; m5 = row 10\r\n    punpckhwd       xm0, xm2, xm5\r\n    punpcklwd       xm2, xm5\r\n    vinserti128     m2, m2, xm0, 1\r\n    pmaddwd         m2, [r5 + 1 * mmsize]\r\n    paddd           m1, m2\r\n\r\n%ifidn %1,sp\r\n    paddd           m6, m7\r\n    paddd           m1, m7\r\n    psrad           m6, 12\r\n    psrad           m1, 12\r\n%else\r\n    psrad           m6, 6\r\n    psrad           m1, 6\r\n%endif\r\n    packssdw        m6, m1\r\n%ifidn %1,sp\r\n    packuswb        m4, m6\r\n    vpermd          m4, m3, m4\r\n    vextracti128    xm6, m4, 1\r\n    movq            [r2], xm4\r\n    movhps          [r2 + r3], xm4\r\n    movq            [r2 + r3 * 2], xm6\r\n    movhps          [r2 + r4], xm6\r\n%else\r\n    vpermq          m4, m4, 11011000b\r\n    vpermq          m6, m6, 11011000b\r\n    vextracti128    xm5, m4, 1\r\n    vextracti128    xm1, m6, 1\r\n    movu            [r2], xm4\r\n    movu            [r2 + r3], xm5\r\n    movu            [r2 + r3 * 2], xm6\r\n    movu            [r2 + r4], xm1\r\n%endif\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_S_AVX2_8x8 sp\r\n    FILTER_VER_CHROMA_S_AVX2_8x8 ss\r\n\r\n%macro PROCESS_CHROMA_S_AVX2_W8_16R 1\r\n    movu            xm0, [r0]                       ; m0 = row 0\r\n    movu            xm1, [r0 + r1]                  ; m1 = row 1\r\n    punpckhwd       xm2, xm0, xm1\r\n    punpcklwd       xm0, xm1\r\n    vinserti128     m0, m0, xm2, 1\r\n    pmaddwd         m0, [r5]\r\n    movu            xm2, [r0 + r1 * 2]              ; m2 = row 2\r\n    punpckhwd       xm3, xm1, xm2\r\n    punpcklwd       xm1, xm2\r\n    vinserti128     m1, m1, xm3, 1\r\n    pmaddwd         m1, [r5]\r\n    movu            xm3, [r0 + r4]                  ; m3 = row 3\r\n    punpckhwd       xm4, xm2, xm3\r\n    punpcklwd       xm2, xm3\r\n    vinserti128     m2, m2, xm4, 1\r\n    pmaddwd         m4, m2, [r5 + 1 * mmsize]\r\n    paddd           m0, m4\r\n    pmaddwd         m2, [r5]\r\n    lea             r7, [r0 + r1 * 4]\r\n    movu            xm4, [r7]                       ; m4 = row 4\r\n    punpckhwd       xm5, xm3, xm4\r\n    punpcklwd       xm3, xm4\r\n    vinserti128     m3, m3, xm5, 1\r\n    pmaddwd         m5, m3, [r5 + 1 * mmsize]\r\n    paddd           m1, m5\r\n    pmaddwd         m3, [r5]\r\n    movu            xm5, [r7 + r1]                  ; m5 = row 5\r\n    punpckhwd       xm6, xm4, xm5\r\n    punpcklwd       xm4, xm5\r\n    vinserti128     m4, m4, xm6, 1\r\n    pmaddwd         m6, m4, [r5 + 1 * mmsize]\r\n    paddd           m2, m6\r\n    pmaddwd         m4, [r5]\r\n    movu            xm6, [r7 + r1 * 2]              ; m6 = row 6\r\n    punpckhwd       xm7, xm5, xm6\r\n    punpcklwd       xm5, xm6\r\n    vinserti128     m5, m5, xm7, 1\r\n    pmaddwd         m7, m5, [r5 + 1 * mmsize]\r\n    paddd           m3, m7\r\n    pmaddwd         m5, [r5]\r\n%ifidn %1,sp\r\n    paddd           m0, m9\r\n    paddd           m1, m9\r\n    paddd           m2, m9\r\n    paddd           m3, m9\r\n    psrad           m0, 12\r\n    psrad           m1, 12\r\n    psrad           m2, 12\r\n    psrad           m3, 12\r\n%else\r\n    psrad           m0, 6\r\n    psrad           m1, 6\r\n    psrad           m2, 6\r\n    psrad           m3, 6\r\n%endif\r\n    packssdw        m0, m1\r\n    packssdw        m2, m3\r\n%ifidn %1,sp\r\n    packuswb        m0, m2\r\n    mova            m3, [interp8_hps_shuf]\r\n    vpermd          m0, m3, m0\r\n    vextracti128    xm2, m0, 1\r\n    movq            [r2], xm0\r\n    movhps          [r2 + r3], xm0\r\n    movq            [r2 + r3 * 2], xm2\r\n    movhps          [r2 + r6], xm2\r\n%else\r\n    vpermq          m0, m0, 11011000b\r\n    vpermq          m2, m2, 11011000b\r\n    vextracti128    xm1, m0, 1\r\n    vextracti128    xm3, m2, 1\r\n    movu            [r2], xm0\r\n    movu            [r2 + r3], xm1\r\n    movu            [r2 + r3 * 2], xm2\r\n    movu            [r2 + r6], xm3\r\n%endif\r\n\r\n    movu            xm7, [r7 + r4]                  ; m7 = row 7\r\n    punpckhwd       xm8, xm6, xm7\r\n    punpcklwd       xm6, xm7\r\n    vinserti128     m6, m6, xm8, 1\r\n    pmaddwd         m8, m6, [r5 + 1 * mmsize]\r\n    paddd           m4, m8\r\n    pmaddwd         m6, [r5]\r\n    lea             r7, [r7 + r1 * 4]\r\n    movu            xm8, [r7]                       ; m8 = row 8\r\n    punpckhwd       xm0, xm7, xm8\r\n    punpcklwd       xm7, xm8\r\n    vinserti128     m7, m7, xm0, 1\r\n    pmaddwd         m0, m7, [r5 + 1 * mmsize]\r\n    paddd           m5, m0\r\n    pmaddwd         m7, [r5]\r\n    movu            xm0, [r7 + r1]                  ; m0 = row 9\r\n    punpckhwd       xm1, xm8, xm0\r\n    punpcklwd       xm8, xm0\r\n    vinserti128     m8, m8, xm1, 1\r\n    pmaddwd         m1, m8, [r5 + 1 * mmsize]\r\n    paddd           m6, m1\r\n    pmaddwd         m8, [r5]\r\n    movu            xm1, [r7 + r1 * 2]              ; m1 = row 10\r\n    punpckhwd       xm2, xm0, xm1\r\n    punpcklwd       xm0, xm1\r\n    vinserti128     m0, m0, xm2, 1\r\n    pmaddwd         m2, m0, [r5 + 1 * mmsize]\r\n    paddd           m7, m2\r\n    pmaddwd         m0, [r5]\r\n%ifidn %1,sp\r\n    paddd           m4, m9\r\n    paddd           m5, m9\r\n    psrad           m4, 12\r\n    psrad           m5, 12\r\n    paddd           m6, m9\r\n    paddd           m7, m9\r\n    psrad           m6, 12\r\n    psrad           m7, 12\r\n%else\r\n    psrad           m4, 6\r\n    psrad           m5, 6\r\n    psrad           m6, 6\r\n    psrad           m7, 6\r\n%endif\r\n    packssdw        m4, m5\r\n    packssdw        m6, m7\r\n    lea             r8, [r2 + r3 * 4]\r\n%ifidn %1,sp\r\n    packuswb        m4, m6\r\n    vpermd          m4, m3, m4\r\n    vextracti128    xm6, m4, 1\r\n    movq            [r8], xm4\r\n    movhps          [r8 + r3], xm4\r\n    movq            [r8 + r3 * 2], xm6\r\n    movhps          [r8 + r6], xm6\r\n%else\r\n    vpermq          m4, m4, 11011000b\r\n    vpermq          m6, m6, 11011000b\r\n    vextracti128    xm5, m4, 1\r\n    vextracti128    xm7, m6, 1\r\n    movu            [r8], xm4\r\n    movu            [r8 + r3], xm5\r\n    movu            [r8 + r3 * 2], xm6\r\n    movu            [r8 + r6], xm7\r\n%endif\r\n\r\n    movu            xm2, [r7 + r4]                  ; m2 = row 11\r\n    punpckhwd       xm4, xm1, xm2\r\n    punpcklwd       xm1, xm2\r\n    vinserti128     m1, m1, xm4, 1\r\n    pmaddwd         m4, m1, [r5 + 1 * mmsize]\r\n    paddd           m8, m4\r\n    pmaddwd         m1, [r5]\r\n    lea             r7, [r7 + r1 * 4]\r\n    movu            xm4, [r7]                       ; m4 = row 12\r\n    punpckhwd       xm5, xm2, xm4\r\n    punpcklwd       xm2, xm4\r\n    vinserti128     m2, m2, xm5, 1\r\n    pmaddwd         m5, m2, [r5 + 1 * mmsize]\r\n    paddd           m0, m5\r\n    pmaddwd         m2, [r5]\r\n    movu            xm5, [r7 + r1]                  ; m5 = row 13\r\n    punpckhwd       xm6, xm4, xm5\r\n    punpcklwd       xm4, xm5\r\n    vinserti128     m4, m4, xm6, 1\r\n    pmaddwd         m6, m4, [r5 + 1 * mmsize]\r\n    paddd           m1, m6\r\n    pmaddwd         m4, [r5]\r\n    movu            xm6, [r7 + r1 * 2]              ; m6 = row 14\r\n    punpckhwd       xm7, xm5, xm6\r\n    punpcklwd       xm5, xm6\r\n    vinserti128     m5, m5, xm7, 1\r\n    pmaddwd         m7, m5, [r5 + 1 * mmsize]\r\n    paddd           m2, m7\r\n    pmaddwd         m5, [r5]\r\n%ifidn %1,sp\r\n    paddd           m8, m9\r\n    paddd           m0, m9\r\n    paddd           m1, m9\r\n    paddd           m2, m9\r\n    psrad           m8, 12\r\n    psrad           m0, 12\r\n    psrad           m1, 12\r\n    psrad           m2, 12\r\n%else\r\n    psrad           m8, 6\r\n    psrad           m0, 6\r\n    psrad           m1, 6\r\n    psrad           m2, 6\r\n%endif\r\n    packssdw        m8, m0\r\n    packssdw        m1, m2\r\n    lea             r8, [r8 + r3 * 4]\r\n%ifidn %1,sp\r\n    packuswb        m8, m1\r\n    vpermd          m8, m3, m8\r\n    vextracti128    xm1, m8, 1\r\n    movq            [r8], xm8\r\n    movhps          [r8 + r3], xm8\r\n    movq            [r8 + r3 * 2], xm1\r\n    movhps          [r8 + r6], xm1\r\n%else\r\n    vpermq          m8, m8, 11011000b\r\n    vpermq          m1, m1, 11011000b\r\n    vextracti128    xm0, m8, 1\r\n    vextracti128    xm2, m1, 1\r\n    movu            [r8], xm8\r\n    movu            [r8 + r3], xm0\r\n    movu            [r8 + r3 * 2], xm1\r\n    movu            [r8 + r6], xm2\r\n%endif\r\n    lea             r8, [r8 + r3 * 4]\r\n\r\n    movu            xm7, [r7 + r4]                  ; m7 = row 15\r\n    punpckhwd       xm2, xm6, xm7\r\n    punpcklwd       xm6, xm7\r\n    vinserti128     m6, m6, xm2, 1\r\n    pmaddwd         m2, m6, [r5 + 1 * mmsize]\r\n    paddd           m4, m2\r\n    pmaddwd         m6, [r5]\r\n    lea             r7, [r7 + r1 * 4]\r\n    movu            xm2, [r7]                       ; m2 = row 16\r\n    punpckhwd       xm1, xm7, xm2\r\n    punpcklwd       xm7, xm2\r\n    vinserti128     m7, m7, xm1, 1\r\n    pmaddwd         m1, m7, [r5 + 1 * mmsize]\r\n    paddd           m5, m1\r\n    pmaddwd         m7, [r5]\r\n    movu            xm1, [r7 + r1]                  ; m1 = row 17\r\n    punpckhwd       xm0, xm2, xm1\r\n    punpcklwd       xm2, xm1\r\n    vinserti128     m2, m2, xm0, 1\r\n    pmaddwd         m2, [r5 + 1 * mmsize]\r\n    paddd           m6, m2\r\n    movu            xm0, [r7 + r1 * 2]              ; m0 = row 18\r\n    punpckhwd       xm2, xm1, xm0\r\n    punpcklwd       xm1, xm0\r\n    vinserti128     m1, m1, xm2, 1\r\n    pmaddwd         m1, [r5 + 1 * mmsize]\r\n    paddd           m7, m1\r\n\r\n%ifidn %1,sp\r\n    paddd           m4, m9\r\n    paddd           m5, m9\r\n    paddd           m6, m9\r\n    paddd           m7, m9\r\n    psrad           m4, 12\r\n    psrad           m5, 12\r\n    psrad           m6, 12\r\n    psrad           m7, 12\r\n%else\r\n    psrad           m4, 6\r\n    psrad           m5, 6\r\n    psrad           m6, 6\r\n    psrad           m7, 6\r\n%endif\r\n    packssdw        m4, m5\r\n    packssdw        m6, m7\r\n%ifidn %1,sp\r\n    packuswb        m4, m6\r\n    vpermd          m4, m3, m4\r\n    vextracti128    xm6, m4, 1\r\n    movq            [r8], xm4\r\n    movhps          [r8 + r3], xm4\r\n    movq            [r8 + r3 * 2], xm6\r\n    movhps          [r8 + r6], xm6\r\n%else\r\n    vpermq          m4, m4, 11011000b\r\n    vpermq          m6, m6, 11011000b\r\n    vextracti128    xm5, m4, 1\r\n    vextracti128    xm7, m6, 1\r\n    movu            [r8], xm4\r\n    movu            [r8 + r3], xm5\r\n    movu            [r8 + r3 * 2], xm6\r\n    movu            [r8 + r6], xm7\r\n%endif\r\n%endmacro\r\n\r\n%macro FILTER_VER_CHROMA_S_AVX2_Nx16 2\r\nINIT_YMM avx2\r\n%if ARCH_X86_64 == 1\r\ncglobal interp_4tap_vert_%1_%2x16, 4, 10, 10\r\n    mov             r4d, r4m\r\n    shl             r4d, 6\r\n    add             r1d, r1d\r\n\r\n%ifdef PIC\r\n    lea             r5, [pw_ChromaCoeffV]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [pw_ChromaCoeffV + r4]\r\n%endif\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r1\r\n%ifidn %1,sp\r\n    mova            m9, [pd_526336]\r\n%else\r\n    add             r3d, r3d\r\n%endif\r\n    lea             r6, [r3 * 3]\r\n    mov             r9d, %2 / 8\r\n.loopW:\r\n    PROCESS_CHROMA_S_AVX2_W8_16R %1\r\n%ifidn %1,sp\r\n    add             r2, 8\r\n%else\r\n    add             r2, 16\r\n%endif\r\n    add             r0, 16\r\n    dec             r9d\r\n    jnz             .loopW\r\n    RET\r\n%endif\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_S_AVX2_Nx16 sp, 16\r\n    FILTER_VER_CHROMA_S_AVX2_Nx16 sp, 32\r\n    FILTER_VER_CHROMA_S_AVX2_Nx16 sp, 64\r\n    FILTER_VER_CHROMA_S_AVX2_Nx16 ss, 16\r\n    FILTER_VER_CHROMA_S_AVX2_Nx16 ss, 32\r\n    FILTER_VER_CHROMA_S_AVX2_Nx16 ss, 64\r\n\r\n%macro FILTER_VER_CHROMA_S_AVX2_NxN 3\r\nINIT_YMM avx2\r\n%if ARCH_X86_64 == 1\r\ncglobal interp_4tap_vert_%3_%1x%2, 4, 11, 10\r\n    mov             r4d, r4m\r\n    shl             r4d, 6\r\n    add             r1d, r1d\r\n\r\n%ifdef PIC\r\n    lea             r5, [pw_ChromaCoeffV]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [pw_ChromaCoeffV + r4]\r\n%endif\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r1\r\n%ifidn %3,sp\r\n    mova            m9, [pd_526336]\r\n%else\r\n    add             r3d, r3d\r\n%endif\r\n    lea             r6, [r3 * 3]\r\n    mov             r9d, %2 / 16\r\n.loopH:\r\n    mov             r10d, %1 / 8\r\n.loopW:\r\n    PROCESS_CHROMA_S_AVX2_W8_16R %3\r\n%ifidn %3,sp\r\n    add             r2, 8\r\n%else\r\n    add             r2, 16\r\n%endif\r\n    add             r0, 16\r\n    dec             r10d\r\n    jnz             .loopW\r\n    lea             r0, [r7 - 2 * %1 + 16]\r\n%ifidn %3,sp\r\n    lea             r2, [r8 + r3 * 4 - %1 + 8]\r\n%else\r\n    lea             r2, [r8 + r3 * 4 - 2 * %1 + 16]\r\n%endif\r\n    dec             r9d\r\n    jnz             .loopH\r\n    RET\r\n%endif\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_S_AVX2_NxN 16, 32, sp\r\n    FILTER_VER_CHROMA_S_AVX2_NxN 24, 32, sp\r\n    FILTER_VER_CHROMA_S_AVX2_NxN 32, 32, sp\r\n    FILTER_VER_CHROMA_S_AVX2_NxN 16, 32, ss\r\n    FILTER_VER_CHROMA_S_AVX2_NxN 24, 32, ss\r\n    FILTER_VER_CHROMA_S_AVX2_NxN 32, 32, ss\r\n    FILTER_VER_CHROMA_S_AVX2_NxN 16, 64, sp\r\n    FILTER_VER_CHROMA_S_AVX2_NxN 24, 64, sp\r\n    FILTER_VER_CHROMA_S_AVX2_NxN 32, 64, sp\r\n    FILTER_VER_CHROMA_S_AVX2_NxN 32, 48, sp\r\n    FILTER_VER_CHROMA_S_AVX2_NxN 32, 48, ss\r\n    FILTER_VER_CHROMA_S_AVX2_NxN 16, 64, ss\r\n    FILTER_VER_CHROMA_S_AVX2_NxN 24, 64, ss\r\n    FILTER_VER_CHROMA_S_AVX2_NxN 32, 64, ss\r\n    FILTER_VER_CHROMA_S_AVX2_NxN 64, 64, sp\r\n    FILTER_VER_CHROMA_S_AVX2_NxN 64, 32, sp\r\n    FILTER_VER_CHROMA_S_AVX2_NxN 64, 48, sp\r\n    FILTER_VER_CHROMA_S_AVX2_NxN 48, 64, sp\r\n    FILTER_VER_CHROMA_S_AVX2_NxN 64, 64, ss\r\n    FILTER_VER_CHROMA_S_AVX2_NxN 64, 32, ss\r\n    FILTER_VER_CHROMA_S_AVX2_NxN 64, 48, ss\r\n    FILTER_VER_CHROMA_S_AVX2_NxN 48, 64, ss\r\n\r\n%macro PROCESS_CHROMA_S_AVX2_W8_4R 1\r\n    movu            xm0, [r0]                       ; m0 = row 0\r\n    movu            xm1, [r0 + r1]                  ; m1 = row 1\r\n    punpckhwd       xm2, xm0, xm1\r\n    punpcklwd       xm0, xm1\r\n    vinserti128     m0, m0, xm2, 1\r\n    pmaddwd         m0, [r5]\r\n    movu            xm2, [r0 + r1 * 2]              ; m2 = row 2\r\n    punpckhwd       xm3, xm1, xm2\r\n    punpcklwd       xm1, xm2\r\n    vinserti128     m1, m1, xm3, 1\r\n    pmaddwd         m1, [r5]\r\n    movu            xm3, [r0 + r4]                  ; m3 = row 3\r\n    punpckhwd       xm4, xm2, xm3\r\n    punpcklwd       xm2, xm3\r\n    vinserti128     m2, m2, xm4, 1\r\n    pmaddwd         m4, m2, [r5 + 1 * mmsize]\r\n    paddd           m0, m4\r\n    pmaddwd         m2, [r5]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm4, [r0]                       ; m4 = row 4\r\n    punpckhwd       xm5, xm3, xm4\r\n    punpcklwd       xm3, xm4\r\n    vinserti128     m3, m3, xm5, 1\r\n    pmaddwd         m5, m3, [r5 + 1 * mmsize]\r\n    paddd           m1, m5\r\n    pmaddwd         m3, [r5]\r\n    movu            xm5, [r0 + r1]                  ; m5 = row 5\r\n    punpckhwd       xm6, xm4, xm5\r\n    punpcklwd       xm4, xm5\r\n    vinserti128     m4, m4, xm6, 1\r\n    pmaddwd         m4, [r5 + 1 * mmsize]\r\n    paddd           m2, m4\r\n    movu            xm6, [r0 + r1 * 2]              ; m6 = row 6\r\n    punpckhwd       xm4, xm5, xm6\r\n    punpcklwd       xm5, xm6\r\n    vinserti128     m5, m5, xm4, 1\r\n    pmaddwd         m5, [r5 + 1 * mmsize]\r\n    paddd           m3, m5\r\n%ifidn %1,sp\r\n    paddd           m0, m7\r\n    paddd           m1, m7\r\n    paddd           m2, m7\r\n    paddd           m3, m7\r\n    psrad           m0, 12\r\n    psrad           m1, 12\r\n    psrad           m2, 12\r\n    psrad           m3, 12\r\n%else\r\n    psrad           m0, 6\r\n    psrad           m1, 6\r\n    psrad           m2, 6\r\n    psrad           m3, 6\r\n%endif\r\n    packssdw        m0, m1\r\n    packssdw        m2, m3\r\n%ifidn %1,sp\r\n    packuswb        m0, m2\r\n    mova            m3, [interp8_hps_shuf]\r\n    vpermd          m0, m3, m0\r\n    vextracti128    xm2, m0, 1\r\n%else\r\n    vpermq          m0, m0, 11011000b\r\n    vpermq          m2, m2, 11011000b\r\n    vextracti128    xm1, m0, 1\r\n    vextracti128    xm3, m2, 1\r\n%endif\r\n%endmacro\r\n\r\n%macro FILTER_VER_CHROMA_S_AVX2_8x4 1\r\nINIT_YMM avx2\r\ncglobal interp_4tap_vert_%1_8x4, 4, 6, 8\r\n    mov             r4d, r4m\r\n    shl             r4d, 6\r\n    add             r1d, r1d\r\n\r\n%ifdef PIC\r\n    lea             r5, [pw_ChromaCoeffV]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [pw_ChromaCoeffV + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r1\r\n%ifidn %1,sp\r\n    mova            m7, [pd_526336]\r\n%else\r\n    add             r3d, r3d\r\n%endif\r\n\r\n    PROCESS_CHROMA_S_AVX2_W8_4R %1\r\n    lea             r4, [r3 * 3]\r\n%ifidn %1,sp\r\n    movq            [r2], xm0\r\n    movhps          [r2 + r3], xm0\r\n    movq            [r2 + r3 * 2], xm2\r\n    movhps          [r2 + r4], xm2\r\n%else\r\n    movu            [r2], xm0\r\n    movu            [r2 + r3], xm1\r\n    movu            [r2 + r3 * 2], xm2\r\n    movu            [r2 + r4], xm3\r\n%endif\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_S_AVX2_8x4 sp\r\n    FILTER_VER_CHROMA_S_AVX2_8x4 ss\r\n\r\n%macro FILTER_VER_CHROMA_S_AVX2_12x16 1\r\nINIT_YMM avx2\r\n%if ARCH_X86_64 == 1\r\ncglobal interp_4tap_vert_%1_12x16, 4, 9, 10\r\n    mov             r4d, r4m\r\n    shl             r4d, 6\r\n    add             r1d, r1d\r\n\r\n%ifdef PIC\r\n    lea             r5, [pw_ChromaCoeffV]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [pw_ChromaCoeffV + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r1\r\n%ifidn %1,sp\r\n    mova            m9, [pd_526336]\r\n%else\r\n    add             r3d, r3d\r\n%endif\r\n    lea             r6, [r3 * 3]\r\n    PROCESS_CHROMA_S_AVX2_W8_16R %1\r\n%ifidn %1,sp\r\n    add             r2, 8\r\n%else\r\n    add             r2, 16\r\n%endif\r\n    add             r0, 16\r\n    mova            m7, m9\r\n    PROCESS_CHROMA_AVX2_W4_16R %1\r\n    RET\r\n%endif\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_S_AVX2_12x16 sp\r\n    FILTER_VER_CHROMA_S_AVX2_12x16 ss\r\n\r\n%macro FILTER_VER_CHROMA_S_AVX2_12x32 1\r\n%if ARCH_X86_64 == 1\r\nINIT_YMM avx2\r\ncglobal interp_4tap_vert_%1_12x32, 4, 9, 10\r\n    mov             r4d, r4m\r\n    shl             r4d, 6\r\n    add             r1d, r1d\r\n\r\n%ifdef PIC\r\n    lea             r5, [pw_ChromaCoeffV]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [pw_ChromaCoeffV + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r1\r\n%ifidn %1, sp\r\n    mova            m9, [pd_526336]\r\n%else\r\n    add             r3d, r3d\r\n%endif\r\n    lea             r6, [r3 * 3]\r\n%rep 2\r\n    PROCESS_CHROMA_S_AVX2_W8_16R %1\r\n%ifidn %1, sp\r\n    add             r2, 8\r\n%else\r\n    add             r2, 16\r\n%endif\r\n    add             r0, 16\r\n    mova            m7, m9\r\n    PROCESS_CHROMA_AVX2_W4_16R %1\r\n    sub             r0, 16\r\n%ifidn %1, sp\r\n    lea             r2, [r2 + r3 * 4 - 8]\r\n%else\r\n    lea             r2, [r2 + r3 * 4 - 16]\r\n%endif\r\n%endrep\r\n    RET\r\n%endif\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_S_AVX2_12x32 sp\r\n    FILTER_VER_CHROMA_S_AVX2_12x32 ss\r\n\r\n%macro FILTER_VER_CHROMA_S_AVX2_16x12 1\r\nINIT_YMM avx2\r\n%if ARCH_X86_64 == 1\r\ncglobal interp_4tap_vert_%1_16x12, 4, 9, 9\r\n    mov             r4d, r4m\r\n    shl             r4d, 6\r\n    add             r1d, r1d\r\n\r\n%ifdef PIC\r\n    lea             r5, [pw_ChromaCoeffV]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [pw_ChromaCoeffV + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r1\r\n%ifidn %1,sp\r\n    mova            m8, [pd_526336]\r\n%else\r\n    add             r3d, r3d\r\n%endif\r\n    lea             r6, [r3 * 3]\r\n%rep 2\r\n    movu            xm0, [r0]                       ; m0 = row 0\r\n    movu            xm1, [r0 + r1]                  ; m1 = row 1\r\n    punpckhwd       xm2, xm0, xm1\r\n    punpcklwd       xm0, xm1\r\n    vinserti128     m0, m0, xm2, 1\r\n    pmaddwd         m0, [r5]\r\n    movu            xm2, [r0 + r1 * 2]              ; m2 = row 2\r\n    punpckhwd       xm3, xm1, xm2\r\n    punpcklwd       xm1, xm2\r\n    vinserti128     m1, m1, xm3, 1\r\n    pmaddwd         m1, [r5]\r\n    movu            xm3, [r0 + r4]                  ; m3 = row 3\r\n    punpckhwd       xm4, xm2, xm3\r\n    punpcklwd       xm2, xm3\r\n    vinserti128     m2, m2, xm4, 1\r\n    pmaddwd         m4, m2, [r5 + 1 * mmsize]\r\n    paddd           m0, m4\r\n    pmaddwd         m2, [r5]\r\n    lea             r7, [r0 + r1 * 4]\r\n    movu            xm4, [r7]                       ; m4 = row 4\r\n    punpckhwd       xm5, xm3, xm4\r\n    punpcklwd       xm3, xm4\r\n    vinserti128     m3, m3, xm5, 1\r\n    pmaddwd         m5, m3, [r5 + 1 * mmsize]\r\n    paddd           m1, m5\r\n    pmaddwd         m3, [r5]\r\n%ifidn %1,sp\r\n    paddd           m0, m8\r\n    paddd           m1, m8\r\n    psrad           m0, 12\r\n    psrad           m1, 12\r\n%else\r\n    psrad           m0, 6\r\n    psrad           m1, 6\r\n%endif\r\n    packssdw        m0, m1\r\n\r\n    movu            xm5, [r7 + r1]                  ; m5 = row 5\r\n    punpckhwd       xm6, xm4, xm5\r\n    punpcklwd       xm4, xm5\r\n    vinserti128     m4, m4, xm6, 1\r\n    pmaddwd         m6, m4, [r5 + 1 * mmsize]\r\n    paddd           m2, m6\r\n    pmaddwd         m4, [r5]\r\n    movu            xm6, [r7 + r1 * 2]              ; m6 = row 6\r\n    punpckhwd       xm1, xm5, xm6\r\n    punpcklwd       xm5, xm6\r\n    vinserti128     m5, m5, xm1, 1\r\n    pmaddwd         m1, m5, [r5 + 1 * mmsize]\r\n    pmaddwd         m5, [r5]\r\n    paddd           m3, m1\r\n%ifidn %1,sp\r\n    paddd           m2, m8\r\n    paddd           m3, m8\r\n    psrad           m2, 12\r\n    psrad           m3, 12\r\n%else\r\n    psrad           m2, 6\r\n    psrad           m3, 6\r\n%endif\r\n    packssdw        m2, m3\r\n%ifidn %1,sp\r\n    packuswb        m0, m2\r\n    mova            m3, [interp8_hps_shuf]\r\n    vpermd          m0, m3, m0\r\n    vextracti128    xm2, m0, 1\r\n    movq            [r2], xm0\r\n    movhps          [r2 + r3], xm0\r\n    movq            [r2 + r3 * 2], xm2\r\n    movhps          [r2 + r6], xm2\r\n%else\r\n    vpermq          m0, m0, 11011000b\r\n    vpermq          m2, m2, 11011000b\r\n    movu            [r2], xm0\r\n    vextracti128    xm0, m0, 1\r\n    vextracti128    xm3, m2, 1\r\n    movu            [r2 + r3], xm0\r\n    movu            [r2 + r3 * 2], xm2\r\n    movu            [r2 + r6], xm3\r\n%endif\r\n    lea             r8, [r2 + r3 * 4]\r\n\r\n    movu            xm1, [r7 + r4]                  ; m1 = row 7\r\n    punpckhwd       xm0, xm6, xm1\r\n    punpcklwd       xm6, xm1\r\n    vinserti128     m6, m6, xm0, 1\r\n    pmaddwd         m0, m6, [r5 + 1 * mmsize]\r\n    pmaddwd         m6, [r5]\r\n    paddd           m4, m0\r\n    lea             r7, [r7 + r1 * 4]\r\n    movu            xm0, [r7]                       ; m0 = row 8\r\n    punpckhwd       xm2, xm1, xm0\r\n    punpcklwd       xm1, xm0\r\n    vinserti128     m1, m1, xm2, 1\r\n    pmaddwd         m2, m1, [r5 + 1 * mmsize]\r\n    pmaddwd         m1, [r5]\r\n    paddd           m5, m2\r\n%ifidn %1,sp\r\n    paddd           m4, m8\r\n    paddd           m5, m8\r\n    psrad           m4, 12\r\n    psrad           m5, 12\r\n%else\r\n    psrad           m4, 6\r\n    psrad           m5, 6\r\n%endif\r\n    packssdw        m4, m5\r\n\r\n    movu            xm2, [r7 + r1]                  ; m2 = row 9\r\n    punpckhwd       xm5, xm0, xm2\r\n    punpcklwd       xm0, xm2\r\n    vinserti128     m0, m0, xm5, 1\r\n    pmaddwd         m5, m0, [r5 + 1 * mmsize]\r\n    paddd           m6, m5\r\n    pmaddwd         m0, [r5]\r\n    movu            xm5, [r7 + r1 * 2]              ; m5 = row 10\r\n    punpckhwd       xm7, xm2, xm5\r\n    punpcklwd       xm2, xm5\r\n    vinserti128     m2, m2, xm7, 1\r\n    pmaddwd         m7, m2, [r5 + 1 * mmsize]\r\n    paddd           m1, m7\r\n    pmaddwd         m2, [r5]\r\n\r\n%ifidn %1,sp\r\n    paddd           m6, m8\r\n    paddd           m1, m8\r\n    psrad           m6, 12\r\n    psrad           m1, 12\r\n%else\r\n    psrad           m6, 6\r\n    psrad           m1, 6\r\n%endif\r\n    packssdw        m6, m1\r\n%ifidn %1,sp\r\n    packuswb        m4, m6\r\n    vpermd          m4, m3, m4\r\n    vextracti128    xm6, m4, 1\r\n    movq            [r8], xm4\r\n    movhps          [r8 + r3], xm4\r\n    movq            [r8 + r3 * 2], xm6\r\n    movhps          [r8 + r6], xm6\r\n%else\r\n    vpermq          m4, m4, 11011000b\r\n    vpermq          m6, m6, 11011000b\r\n    vextracti128    xm7, m4, 1\r\n    vextracti128    xm1, m6, 1\r\n    movu            [r8], xm4\r\n    movu            [r8 + r3], xm7\r\n    movu            [r8 + r3 * 2], xm6\r\n    movu            [r8 + r6], xm1\r\n%endif\r\n    lea             r8, [r8 + r3 * 4]\r\n\r\n    movu            xm7, [r7 + r4]                  ; m7 = row 11\r\n    punpckhwd       xm1, xm5, xm7\r\n    punpcklwd       xm5, xm7\r\n    vinserti128     m5, m5, xm1, 1\r\n    pmaddwd         m1, m5, [r5 + 1 * mmsize]\r\n    paddd           m0, m1\r\n    pmaddwd         m5, [r5]\r\n    lea             r7, [r7 + r1 * 4]\r\n    movu            xm1, [r7]                       ; m1 = row 12\r\n    punpckhwd       xm4, xm7, xm1\r\n    punpcklwd       xm7, xm1\r\n    vinserti128     m7, m7, xm4, 1\r\n    pmaddwd         m4, m7, [r5 + 1 * mmsize]\r\n    paddd           m2, m4\r\n    pmaddwd         m7, [r5]\r\n%ifidn %1,sp\r\n    paddd           m0, m8\r\n    paddd           m2, m8\r\n    psrad           m0, 12\r\n    psrad           m2, 12\r\n%else\r\n    psrad           m0, 6\r\n    psrad           m2, 6\r\n%endif\r\n    packssdw        m0, m2\r\n\r\n    movu            xm4, [r7 + r1]                  ; m4 = row 13\r\n    punpckhwd       xm2, xm1, xm4\r\n    punpcklwd       xm1, xm4\r\n    vinserti128     m1, m1, xm2, 1\r\n    pmaddwd         m1, [r5 + 1 * mmsize]\r\n    paddd           m5, m1\r\n    movu            xm2, [r7 + r1 * 2]              ; m2 = row 14\r\n    punpckhwd       xm6, xm4, xm2\r\n    punpcklwd       xm4, xm2\r\n    vinserti128     m4, m4, xm6, 1\r\n    pmaddwd         m4, [r5 + 1 * mmsize]\r\n    paddd           m7, m4\r\n%ifidn %1,sp\r\n    paddd           m5, m8\r\n    paddd           m7, m8\r\n    psrad           m5, 12\r\n    psrad           m7, 12\r\n%else\r\n    psrad           m5, 6\r\n    psrad           m7, 6\r\n%endif\r\n    packssdw        m5, m7\r\n%ifidn %1,sp\r\n    packuswb        m0, m5\r\n    vpermd          m0, m3, m0\r\n    vextracti128    xm5, m0, 1\r\n    movq            [r8], xm0\r\n    movhps          [r8 + r3], xm0\r\n    movq            [r8 + r3 * 2], xm5\r\n    movhps          [r8 + r6], xm5\r\n    add             r2, 8\r\n%else\r\n    vpermq          m0, m0, 11011000b\r\n    vpermq          m5, m5, 11011000b\r\n    vextracti128    xm7, m0, 1\r\n    vextracti128    xm6, m5, 1\r\n    movu            [r8], xm0\r\n    movu            [r8 + r3], xm7\r\n    movu            [r8 + r3 * 2], xm5\r\n    movu            [r8 + r6], xm6\r\n    add             r2, 16\r\n%endif\r\n    add             r0, 16\r\n%endrep\r\n    RET\r\n%endif\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_S_AVX2_16x12 sp\r\n    FILTER_VER_CHROMA_S_AVX2_16x12 ss\r\n\r\n%macro FILTER_VER_CHROMA_S_AVX2_8x12 1\r\n%if ARCH_X86_64 == 1\r\nINIT_YMM avx2\r\ncglobal interp_4tap_vert_%1_8x12, 4, 7, 9\r\n    mov             r4d, r4m\r\n    shl             r4d, 6\r\n    add             r1d, r1d\r\n\r\n%ifdef PIC\r\n    lea             r5, [pw_ChromaCoeffV]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [pw_ChromaCoeffV + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r1\r\n%ifidn %1,sp\r\n    mova            m8, [pd_526336]\r\n%else\r\n    add             r3d, r3d\r\n%endif\r\n    lea             r6, [r3 * 3]\r\n    movu            xm0, [r0]                       ; m0 = row 0\r\n    movu            xm1, [r0 + r1]                  ; m1 = row 1\r\n    punpckhwd       xm2, xm0, xm1\r\n    punpcklwd       xm0, xm1\r\n    vinserti128     m0, m0, xm2, 1\r\n    pmaddwd         m0, [r5]\r\n    movu            xm2, [r0 + r1 * 2]              ; m2 = row 2\r\n    punpckhwd       xm3, xm1, xm2\r\n    punpcklwd       xm1, xm2\r\n    vinserti128     m1, m1, xm3, 1\r\n    pmaddwd         m1, [r5]\r\n    movu            xm3, [r0 + r4]                  ; m3 = row 3\r\n    punpckhwd       xm4, xm2, xm3\r\n    punpcklwd       xm2, xm3\r\n    vinserti128     m2, m2, xm4, 1\r\n    pmaddwd         m4, m2, [r5 + 1 * mmsize]\r\n    paddd           m0, m4\r\n    pmaddwd         m2, [r5]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm4, [r0]                       ; m4 = row 4\r\n    punpckhwd       xm5, xm3, xm4\r\n    punpcklwd       xm3, xm4\r\n    vinserti128     m3, m3, xm5, 1\r\n    pmaddwd         m5, m3, [r5 + 1 * mmsize]\r\n    paddd           m1, m5\r\n    pmaddwd         m3, [r5]\r\n%ifidn %1,sp\r\n    paddd           m0, m8\r\n    paddd           m1, m8\r\n    psrad           m0, 12\r\n    psrad           m1, 12\r\n%else\r\n    psrad           m0, 6\r\n    psrad           m1, 6\r\n%endif\r\n    packssdw        m0, m1\r\n\r\n    movu            xm5, [r0 + r1]                  ; m5 = row 5\r\n    punpckhwd       xm6, xm4, xm5\r\n    punpcklwd       xm4, xm5\r\n    vinserti128     m4, m4, xm6, 1\r\n    pmaddwd         m6, m4, [r5 + 1 * mmsize]\r\n    paddd           m2, m6\r\n    pmaddwd         m4, [r5]\r\n    movu            xm6, [r0 + r1 * 2]              ; m6 = row 6\r\n    punpckhwd       xm1, xm5, xm6\r\n    punpcklwd       xm5, xm6\r\n    vinserti128     m5, m5, xm1, 1\r\n    pmaddwd         m1, m5, [r5 + 1 * mmsize]\r\n    pmaddwd         m5, [r5]\r\n    paddd           m3, m1\r\n%ifidn %1,sp\r\n    paddd           m2, m8\r\n    paddd           m3, m8\r\n    psrad           m2, 12\r\n    psrad           m3, 12\r\n%else\r\n    psrad           m2, 6\r\n    psrad           m3, 6\r\n%endif\r\n    packssdw        m2, m3\r\n%ifidn %1,sp\r\n    packuswb        m0, m2\r\n    mova            m3, [interp8_hps_shuf]\r\n    vpermd          m0, m3, m0\r\n    vextracti128    xm2, m0, 1\r\n    movq            [r2], xm0\r\n    movhps          [r2 + r3], xm0\r\n    movq            [r2 + r3 * 2], xm2\r\n    movhps          [r2 + r6], xm2\r\n%else\r\n    vpermq          m0, m0, 11011000b\r\n    vpermq          m2, m2, 11011000b\r\n    movu            [r2], xm0\r\n    vextracti128    xm0, m0, 1\r\n    vextracti128    xm3, m2, 1\r\n    movu            [r2 + r3], xm0\r\n    movu            [r2 + r3 * 2], xm2\r\n    movu            [r2 + r6], xm3\r\n%endif\r\n    lea             r2, [r2 + r3 * 4]\r\n\r\n    movu            xm1, [r0 + r4]                  ; m1 = row 7\r\n    punpckhwd       xm0, xm6, xm1\r\n    punpcklwd       xm6, xm1\r\n    vinserti128     m6, m6, xm0, 1\r\n    pmaddwd         m0, m6, [r5 + 1 * mmsize]\r\n    pmaddwd         m6, [r5]\r\n    paddd           m4, m0\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm0, [r0]                       ; m0 = row 8\r\n    punpckhwd       xm2, xm1, xm0\r\n    punpcklwd       xm1, xm0\r\n    vinserti128     m1, m1, xm2, 1\r\n    pmaddwd         m2, m1, [r5 + 1 * mmsize]\r\n    pmaddwd         m1, [r5]\r\n    paddd           m5, m2\r\n%ifidn %1,sp\r\n    paddd           m4, m8\r\n    paddd           m5, m8\r\n    psrad           m4, 12\r\n    psrad           m5, 12\r\n%else\r\n    psrad           m4, 6\r\n    psrad           m5, 6\r\n%endif\r\n    packssdw        m4, m5\r\n\r\n    movu            xm2, [r0 + r1]                  ; m2 = row 9\r\n    punpckhwd       xm5, xm0, xm2\r\n    punpcklwd       xm0, xm2\r\n    vinserti128     m0, m0, xm5, 1\r\n    pmaddwd         m5, m0, [r5 + 1 * mmsize]\r\n    paddd           m6, m5\r\n    pmaddwd         m0, [r5]\r\n    movu            xm5, [r0 + r1 * 2]              ; m5 = row 10\r\n    punpckhwd       xm7, xm2, xm5\r\n    punpcklwd       xm2, xm5\r\n    vinserti128     m2, m2, xm7, 1\r\n    pmaddwd         m7, m2, [r5 + 1 * mmsize]\r\n    paddd           m1, m7\r\n    pmaddwd         m2, [r5]\r\n\r\n%ifidn %1,sp\r\n    paddd           m6, m8\r\n    paddd           m1, m8\r\n    psrad           m6, 12\r\n    psrad           m1, 12\r\n%else\r\n    psrad           m6, 6\r\n    psrad           m1, 6\r\n%endif\r\n    packssdw        m6, m1\r\n%ifidn %1,sp\r\n    packuswb        m4, m6\r\n    vpermd          m4, m3, m4\r\n    vextracti128    xm6, m4, 1\r\n    movq            [r2], xm4\r\n    movhps          [r2 + r3], xm4\r\n    movq            [r2 + r3 * 2], xm6\r\n    movhps          [r2 + r6], xm6\r\n%else\r\n    vpermq          m4, m4, 11011000b\r\n    vpermq          m6, m6, 11011000b\r\n    vextracti128    xm7, m4, 1\r\n    vextracti128    xm1, m6, 1\r\n    movu            [r2], xm4\r\n    movu            [r2 + r3], xm7\r\n    movu            [r2 + r3 * 2], xm6\r\n    movu            [r2 + r6], xm1\r\n%endif\r\n    lea             r2, [r2 + r3 * 4]\r\n\r\n    movu            xm7, [r0 + r4]                  ; m7 = row 11\r\n    punpckhwd       xm1, xm5, xm7\r\n    punpcklwd       xm5, xm7\r\n    vinserti128     m5, m5, xm1, 1\r\n    pmaddwd         m1, m5, [r5 + 1 * mmsize]\r\n    paddd           m0, m1\r\n    pmaddwd         m5, [r5]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm1, [r0]                       ; m1 = row 12\r\n    punpckhwd       xm4, xm7, xm1\r\n    punpcklwd       xm7, xm1\r\n    vinserti128     m7, m7, xm4, 1\r\n    pmaddwd         m4, m7, [r5 + 1 * mmsize]\r\n    paddd           m2, m4\r\n    pmaddwd         m7, [r5]\r\n%ifidn %1,sp\r\n    paddd           m0, m8\r\n    paddd           m2, m8\r\n    psrad           m0, 12\r\n    psrad           m2, 12\r\n%else\r\n    psrad           m0, 6\r\n    psrad           m2, 6\r\n%endif\r\n    packssdw        m0, m2\r\n\r\n    movu            xm4, [r0 + r1]                  ; m4 = row 13\r\n    punpckhwd       xm2, xm1, xm4\r\n    punpcklwd       xm1, xm4\r\n    vinserti128     m1, m1, xm2, 1\r\n    pmaddwd         m1, [r5 + 1 * mmsize]\r\n    paddd           m5, m1\r\n    movu            xm2, [r0 + r1 * 2]              ; m2 = row 14\r\n    punpckhwd       xm6, xm4, xm2\r\n    punpcklwd       xm4, xm2\r\n    vinserti128     m4, m4, xm6, 1\r\n    pmaddwd         m4, [r5 + 1 * mmsize]\r\n    paddd           m7, m4\r\n%ifidn %1,sp\r\n    paddd           m5, m8\r\n    paddd           m7, m8\r\n    psrad           m5, 12\r\n    psrad           m7, 12\r\n%else\r\n    psrad           m5, 6\r\n    psrad           m7, 6\r\n%endif\r\n    packssdw        m5, m7\r\n%ifidn %1,sp\r\n    packuswb        m0, m5\r\n    vpermd          m0, m3, m0\r\n    vextracti128    xm5, m0, 1\r\n    movq            [r2], xm0\r\n    movhps          [r2 + r3], xm0\r\n    movq            [r2 + r3 * 2], xm5\r\n    movhps          [r2 + r6], xm5\r\n%else\r\n    vpermq          m0, m0, 11011000b\r\n    vpermq          m5, m5, 11011000b\r\n    vextracti128    xm7, m0, 1\r\n    vextracti128    xm6, m5, 1\r\n    movu            [r2], xm0\r\n    movu            [r2 + r3], xm7\r\n    movu            [r2 + r3 * 2], xm5\r\n    movu            [r2 + r6], xm6\r\n%endif\r\n    RET\r\n%endif\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_S_AVX2_8x12 sp\r\n    FILTER_VER_CHROMA_S_AVX2_8x12 ss\r\n\r\n%macro FILTER_VER_CHROMA_S_AVX2_16x4 1\r\nINIT_YMM avx2\r\ncglobal interp_4tap_vert_%1_16x4, 4, 7, 8\r\n    mov             r4d, r4m\r\n    shl             r4d, 6\r\n    add             r1d, r1d\r\n\r\n%ifdef PIC\r\n    lea             r5, [pw_ChromaCoeffV]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [pw_ChromaCoeffV + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r1\r\n%ifidn %1,sp\r\n    mova            m7, [pd_526336]\r\n%else\r\n    add             r3d, r3d\r\n%endif\r\n%rep 2\r\n    PROCESS_CHROMA_S_AVX2_W8_4R %1\r\n    lea             r6, [r3 * 3]\r\n%ifidn %1,sp\r\n    movq            [r2], xm0\r\n    movhps          [r2 + r3], xm0\r\n    movq            [r2 + r3 * 2], xm2\r\n    movhps          [r2 + r6], xm2\r\n    add             r2, 8\r\n%else\r\n    movu            [r2], xm0\r\n    movu            [r2 + r3], xm1\r\n    movu            [r2 + r3 * 2], xm2\r\n    movu            [r2 + r6], xm3\r\n    add             r2, 16\r\n%endif\r\n    lea             r6, [4 * r1 - 16]\r\n    sub             r0, r6\r\n%endrep\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_S_AVX2_16x4 sp\r\n    FILTER_VER_CHROMA_S_AVX2_16x4 ss\r\n\r\n%macro PROCESS_CHROMA_S_AVX2_W8_8R 1\r\n    movu            xm0, [r0]                       ; m0 = row 0\r\n    movu            xm1, [r0 + r1]                  ; m1 = row 1\r\n    punpckhwd       xm2, xm0, xm1\r\n    punpcklwd       xm0, xm1\r\n    vinserti128     m0, m0, xm2, 1\r\n    pmaddwd         m0, [r5]\r\n    movu            xm2, [r0 + r1 * 2]              ; m2 = row 2\r\n    punpckhwd       xm3, xm1, xm2\r\n    punpcklwd       xm1, xm2\r\n    vinserti128     m1, m1, xm3, 1\r\n    pmaddwd         m1, [r5]\r\n    movu            xm3, [r0 + r4]                  ; m3 = row 3\r\n    punpckhwd       xm4, xm2, xm3\r\n    punpcklwd       xm2, xm3\r\n    vinserti128     m2, m2, xm4, 1\r\n    pmaddwd         m4, m2, [r5 + 1 * mmsize]\r\n    paddd           m0, m4\r\n    pmaddwd         m2, [r5]\r\n    lea             r7, [r0 + r1 * 4]\r\n    movu            xm4, [r7]                       ; m4 = row 4\r\n    punpckhwd       xm5, xm3, xm4\r\n    punpcklwd       xm3, xm4\r\n    vinserti128     m3, m3, xm5, 1\r\n    pmaddwd         m5, m3, [r5 + 1 * mmsize]\r\n    paddd           m1, m5\r\n    pmaddwd         m3, [r5]\r\n%ifidn %1,sp\r\n    paddd           m0, m7\r\n    paddd           m1, m7\r\n    psrad           m0, 12\r\n    psrad           m1, 12\r\n%else\r\n    psrad           m0, 6\r\n    psrad           m1, 6\r\n%endif\r\n    packssdw        m0, m1\r\n\r\n    movu            xm5, [r7 + r1]                  ; m5 = row 5\r\n    punpckhwd       xm6, xm4, xm5\r\n    punpcklwd       xm4, xm5\r\n    vinserti128     m4, m4, xm6, 1\r\n    pmaddwd         m6, m4, [r5 + 1 * mmsize]\r\n    paddd           m2, m6\r\n    pmaddwd         m4, [r5]\r\n    movu            xm6, [r7 + r1 * 2]              ; m6 = row 6\r\n    punpckhwd       xm1, xm5, xm6\r\n    punpcklwd       xm5, xm6\r\n    vinserti128     m5, m5, xm1, 1\r\n    pmaddwd         m1, m5, [r5 + 1 * mmsize]\r\n    pmaddwd         m5, [r5]\r\n    paddd           m3, m1\r\n%ifidn %1,sp\r\n    paddd           m2, m7\r\n    paddd           m3, m7\r\n    psrad           m2, 12\r\n    psrad           m3, 12\r\n%else\r\n    psrad           m2, 6\r\n    psrad           m3, 6\r\n%endif\r\n    packssdw        m2, m3\r\n%ifidn %1,sp\r\n    packuswb        m0, m2\r\n    mova            m3, [interp8_hps_shuf]\r\n    vpermd          m0, m3, m0\r\n    vextracti128    xm2, m0, 1\r\n    movq            [r2], xm0\r\n    movhps          [r2 + r3], xm0\r\n    movq            [r2 + r3 * 2], xm2\r\n    movhps          [r2 + r6], xm2\r\n%else\r\n    vpermq          m0, m0, 11011000b\r\n    vpermq          m2, m2, 11011000b\r\n    movu            [r2], xm0\r\n    vextracti128    xm0, m0, 1\r\n    vextracti128    xm3, m2, 1\r\n    movu            [r2 + r3], xm0\r\n    movu            [r2 + r3 * 2], xm2\r\n    movu            [r2 + r6], xm3\r\n%endif\r\n    lea             r8, [r2 + r3 * 4]\r\n\r\n    movu            xm1, [r7 + r4]                  ; m1 = row 7\r\n    punpckhwd       xm0, xm6, xm1\r\n    punpcklwd       xm6, xm1\r\n    vinserti128     m6, m6, xm0, 1\r\n    pmaddwd         m0, m6, [r5 + 1 * mmsize]\r\n    pmaddwd         m6, [r5]\r\n    paddd           m4, m0\r\n    lea             r7, [r7 + r1 * 4]\r\n    movu            xm0, [r7]                       ; m0 = row 8\r\n    punpckhwd       xm2, xm1, xm0\r\n    punpcklwd       xm1, xm0\r\n    vinserti128     m1, m1, xm2, 1\r\n    pmaddwd         m2, m1, [r5 + 1 * mmsize]\r\n    pmaddwd         m1, [r5]\r\n    paddd           m5, m2\r\n%ifidn %1,sp\r\n    paddd           m4, m7\r\n    paddd           m5, m7\r\n    psrad           m4, 12\r\n    psrad           m5, 12\r\n%else\r\n    psrad           m4, 6\r\n    psrad           m5, 6\r\n%endif\r\n    packssdw        m4, m5\r\n\r\n    movu            xm2, [r7 + r1]                  ; m2 = row 9\r\n    punpckhwd       xm5, xm0, xm2\r\n    punpcklwd       xm0, xm2\r\n    vinserti128     m0, m0, xm5, 1\r\n    pmaddwd         m0, [r5 + 1 * mmsize]\r\n    paddd           m6, m0\r\n    movu            xm5, [r7 + r1 * 2]              ; m5 = row 10\r\n    punpckhwd       xm0, xm2, xm5\r\n    punpcklwd       xm2, xm5\r\n    vinserti128     m2, m2, xm0, 1\r\n    pmaddwd         m2, [r5 + 1 * mmsize]\r\n    paddd           m1, m2\r\n\r\n%ifidn %1,sp\r\n    paddd           m6, m7\r\n    paddd           m1, m7\r\n    psrad           m6, 12\r\n    psrad           m1, 12\r\n%else\r\n    psrad           m6, 6\r\n    psrad           m1, 6\r\n%endif\r\n    packssdw        m6, m1\r\n%ifidn %1,sp\r\n    packuswb        m4, m6\r\n    vpermd          m4, m3, m4\r\n    vextracti128    xm6, m4, 1\r\n    movq            [r8], xm4\r\n    movhps          [r8 + r3], xm4\r\n    movq            [r8 + r3 * 2], xm6\r\n    movhps          [r8 + r6], xm6\r\n%else\r\n    vpermq          m4, m4, 11011000b\r\n    vpermq          m6, m6, 11011000b\r\n    vextracti128    xm7, m4, 1\r\n    vextracti128    xm1, m6, 1\r\n    movu            [r8], xm4\r\n    movu            [r8 + r3], xm7\r\n    movu            [r8 + r3 * 2], xm6\r\n    movu            [r8 + r6], xm1\r\n%endif\r\n%endmacro\r\n\r\n%macro FILTER_VER_CHROMA_S_AVX2_Nx8 2\r\nINIT_YMM avx2\r\n%if ARCH_X86_64 == 1\r\ncglobal interp_4tap_vert_%1_%2x8, 4, 9, 8\r\n    mov             r4d, r4m\r\n    shl             r4d, 6\r\n    add             r1d, r1d\r\n\r\n%ifdef PIC\r\n    lea             r5, [pw_ChromaCoeffV]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [pw_ChromaCoeffV + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r1\r\n%ifidn %1,sp\r\n    mova            m7, [pd_526336]\r\n%else\r\n    add             r3d, r3d\r\n%endif\r\n    lea             r6, [r3 * 3]\r\n%rep %2 / 8\r\n    PROCESS_CHROMA_S_AVX2_W8_8R %1\r\n%ifidn %1,sp\r\n    add             r2, 8\r\n%else\r\n    add             r2, 16\r\n%endif\r\n    add             r0, 16\r\n%endrep\r\n    RET\r\n%endif\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_S_AVX2_Nx8 sp, 32\r\n    FILTER_VER_CHROMA_S_AVX2_Nx8 sp, 16\r\n    FILTER_VER_CHROMA_S_AVX2_Nx8 ss, 32\r\n    FILTER_VER_CHROMA_S_AVX2_Nx8 ss, 16\r\n\r\n%macro FILTER_VER_CHROMA_S_AVX2_8x2 1\r\nINIT_YMM avx2\r\ncglobal interp_4tap_vert_%1_8x2, 4, 6, 6\r\n    mov             r4d, r4m\r\n    shl             r4d, 6\r\n    add             r1d, r1d\r\n\r\n%ifdef PIC\r\n    lea             r5, [pw_ChromaCoeffV]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [pw_ChromaCoeffV + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r1\r\n%ifidn %1,sp\r\n    mova            m5, [pd_526336]\r\n%else\r\n    add             r3d, r3d\r\n%endif\r\n\r\n    movu            xm0, [r0]                       ; m0 = row 0\r\n    movu            xm1, [r0 + r1]                  ; m1 = row 1\r\n    punpckhwd       xm2, xm0, xm1\r\n    punpcklwd       xm0, xm1\r\n    vinserti128     m0, m0, xm2, 1\r\n    pmaddwd         m0, [r5]\r\n    movu            xm2, [r0 + r1 * 2]              ; m2 = row 2\r\n    punpckhwd       xm3, xm1, xm2\r\n    punpcklwd       xm1, xm2\r\n    vinserti128     m1, m1, xm3, 1\r\n    pmaddwd         m1, [r5]\r\n    movu            xm3, [r0 + r4]                  ; m3 = row 3\r\n    punpckhwd       xm4, xm2, xm3\r\n    punpcklwd       xm2, xm3\r\n    vinserti128     m2, m2, xm4, 1\r\n    pmaddwd         m2, [r5 + 1 * mmsize]\r\n    paddd           m0, m2\r\n    movu            xm4, [r0 + r1 * 4]              ; m4 = row 4\r\n    punpckhwd       xm2, xm3, xm4\r\n    punpcklwd       xm3, xm4\r\n    vinserti128     m3, m3, xm2, 1\r\n    pmaddwd         m3, [r5 + 1 * mmsize]\r\n    paddd           m1, m3\r\n%ifidn %1,sp\r\n    paddd           m0, m5\r\n    paddd           m1, m5\r\n    psrad           m0, 12\r\n    psrad           m1, 12\r\n%else\r\n    psrad           m0, 6\r\n    psrad           m1, 6\r\n%endif\r\n    packssdw        m0, m1\r\n%ifidn %1,sp\r\n    vextracti128    xm1, m0, 1\r\n    packuswb        xm0, xm1\r\n    pshufd          xm0, xm0, 11011000b\r\n    movq            [r2], xm0\r\n    movhps          [r2 + r3], xm0\r\n%else\r\n    vpermq          m0, m0, 11011000b\r\n    vextracti128    xm1, m0, 1\r\n    movu            [r2], xm0\r\n    movu            [r2 + r3], xm1\r\n%endif\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_S_AVX2_8x2 sp\r\n    FILTER_VER_CHROMA_S_AVX2_8x2 ss\r\n\r\n%macro FILTER_VER_CHROMA_S_AVX2_8x6 1\r\nINIT_YMM avx2\r\ncglobal interp_4tap_vert_%1_8x6, 4, 6, 8\r\n    mov             r4d, r4m\r\n    shl             r4d, 6\r\n    add             r1d, r1d\r\n\r\n%ifdef PIC\r\n    lea             r5, [pw_ChromaCoeffV]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [pw_ChromaCoeffV + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r1\r\n%ifidn %1,sp\r\n    mova            m7, [pd_526336]\r\n%else\r\n    add             r3d, r3d\r\n%endif\r\n\r\n    movu            xm0, [r0]                       ; m0 = row 0\r\n    movu            xm1, [r0 + r1]                  ; m1 = row 1\r\n    punpckhwd       xm2, xm0, xm1\r\n    punpcklwd       xm0, xm1\r\n    vinserti128     m0, m0, xm2, 1\r\n    pmaddwd         m0, [r5]\r\n    movu            xm2, [r0 + r1 * 2]              ; m2 = row 2\r\n    punpckhwd       xm3, xm1, xm2\r\n    punpcklwd       xm1, xm2\r\n    vinserti128     m1, m1, xm3, 1\r\n    pmaddwd         m1, [r5]\r\n    movu            xm3, [r0 + r4]                  ; m3 = row 3\r\n    punpckhwd       xm4, xm2, xm3\r\n    punpcklwd       xm2, xm3\r\n    vinserti128     m2, m2, xm4, 1\r\n    pmaddwd         m4, m2, [r5 + 1 * mmsize]\r\n    pmaddwd         m2, [r5]\r\n    paddd           m0, m4\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm4, [r0]                       ; m4 = row 4\r\n    punpckhwd       xm5, xm3, xm4\r\n    punpcklwd       xm3, xm4\r\n    vinserti128     m3, m3, xm5, 1\r\n    pmaddwd         m5, m3, [r5 + 1 * mmsize]\r\n    pmaddwd         m3, [r5]\r\n    paddd           m1, m5\r\n%ifidn %1,sp\r\n    paddd           m0, m7\r\n    paddd           m1, m7\r\n    psrad           m0, 12\r\n    psrad           m1, 12\r\n%else\r\n    psrad           m0, 6\r\n    psrad           m1, 6\r\n%endif\r\n    packssdw        m0, m1\r\n\r\n    movu            xm5, [r0 + r1]                  ; m5 = row 5\r\n    punpckhwd       xm6, xm4, xm5\r\n    punpcklwd       xm4, xm5\r\n    vinserti128     m4, m4, xm6, 1\r\n    pmaddwd         m6, m4, [r5 + 1 * mmsize]\r\n    paddd           m2, m6\r\n    pmaddwd         m4, [r5]\r\n    movu            xm6, [r0 + r1 * 2]              ; m6 = row 6\r\n    punpckhwd       xm1, xm5, xm6\r\n    punpcklwd       xm5, xm6\r\n    vinserti128     m5, m5, xm1, 1\r\n    pmaddwd         m1, m5, [r5 + 1 * mmsize]\r\n    pmaddwd         m5, [r5]\r\n    paddd           m3, m1\r\n%ifidn %1,sp\r\n    paddd           m2, m7\r\n    paddd           m3, m7\r\n    psrad           m2, 12\r\n    psrad           m3, 12\r\n%else\r\n    psrad           m2, 6\r\n    psrad           m3, 6\r\n%endif\r\n    packssdw        m2, m3\r\n\r\n    movu            xm1, [r0 + r4]                  ; m1 = row 7\r\n    punpckhwd       xm3, xm6, xm1\r\n    punpcklwd       xm6, xm1\r\n    vinserti128     m6, m6, xm3, 1\r\n    pmaddwd         m6, [r5 + 1 * mmsize]\r\n    paddd           m4, m6\r\n    movu            xm6, [r0 + r1 * 4]              ; m6 = row 8\r\n    punpckhwd       xm3, xm1, xm6\r\n    punpcklwd       xm1, xm6\r\n    vinserti128     m1, m1, xm3, 1\r\n    pmaddwd         m1, [r5 + 1 * mmsize]\r\n    paddd           m5, m1\r\n%ifidn %1,sp\r\n    paddd           m4, m7\r\n    paddd           m5, m7\r\n    psrad           m4, 12\r\n    psrad           m5, 12\r\n%else\r\n    psrad           m4, 6\r\n    psrad           m5, 6\r\n%endif\r\n    packssdw        m4, m5\r\n    lea             r4, [r3 * 3]\r\n%ifidn %1,sp\r\n    packuswb        m0, m2\r\n    mova            m3, [interp8_hps_shuf]\r\n    vpermd          m0, m3, m0\r\n    vextracti128    xm2, m0, 1\r\n    vextracti128    xm5, m4, 1\r\n    packuswb        xm4, xm5\r\n    pshufd          xm4, xm4, 11011000b\r\n    movq            [r2], xm0\r\n    movhps          [r2 + r3], xm0\r\n    movq            [r2 + r3 * 2], xm2\r\n    movhps          [r2 + r4], xm2\r\n    lea             r2, [r2 + r3 * 4]\r\n    movq            [r2], xm4\r\n    movhps          [r2 + r3], xm4\r\n%else\r\n    vpermq          m0, m0, 11011000b\r\n    vpermq          m2, m2, 11011000b\r\n    vpermq          m4, m4, 11011000b\r\n    movu            [r2], xm0\r\n    vextracti128    xm0, m0, 1\r\n    vextracti128    xm3, m2, 1\r\n    vextracti128    xm5, m4, 1\r\n    movu            [r2 + r3], xm0\r\n    movu            [r2 + r3 * 2], xm2\r\n    movu            [r2 + r4], xm3\r\n    lea             r2, [r2 + r3 * 4]\r\n    movu            [r2], xm4\r\n    movu            [r2 + r3], xm5\r\n%endif\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_S_AVX2_8x6 sp\r\n    FILTER_VER_CHROMA_S_AVX2_8x6 ss\r\n\r\n%macro FILTER_VER_CHROMA_S_AVX2_8xN 2\r\nINIT_YMM avx2\r\n%if ARCH_X86_64 == 1\r\ncglobal interp_4tap_vert_%1_8x%2, 4, 7, 9\r\n    mov             r4d, r4m\r\n    shl             r4d, 6\r\n    add             r1d, r1d\r\n\r\n%ifdef PIC\r\n    lea             r5, [pw_ChromaCoeffV]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [pw_ChromaCoeffV + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r1\r\n%ifidn %1,sp\r\n    mova            m8, [pd_526336]\r\n%else\r\n    add             r3d, r3d\r\n%endif\r\n    lea             r6, [r3 * 3]\r\n%rep %2 / 16\r\n    movu            xm0, [r0]                       ; m0 = row 0\r\n    movu            xm1, [r0 + r1]                  ; m1 = row 1\r\n    punpckhwd       xm2, xm0, xm1\r\n    punpcklwd       xm0, xm1\r\n    vinserti128     m0, m0, xm2, 1\r\n    pmaddwd         m0, [r5]\r\n    movu            xm2, [r0 + r1 * 2]              ; m2 = row 2\r\n    punpckhwd       xm3, xm1, xm2\r\n    punpcklwd       xm1, xm2\r\n    vinserti128     m1, m1, xm3, 1\r\n    pmaddwd         m1, [r5]\r\n    movu            xm3, [r0 + r4]                  ; m3 = row 3\r\n    punpckhwd       xm4, xm2, xm3\r\n    punpcklwd       xm2, xm3\r\n    vinserti128     m2, m2, xm4, 1\r\n    pmaddwd         m4, m2, [r5 + 1 * mmsize]\r\n    paddd           m0, m4\r\n    pmaddwd         m2, [r5]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm4, [r0]                       ; m4 = row 4\r\n    punpckhwd       xm5, xm3, xm4\r\n    punpcklwd       xm3, xm4\r\n    vinserti128     m3, m3, xm5, 1\r\n    pmaddwd         m5, m3, [r5 + 1 * mmsize]\r\n    paddd           m1, m5\r\n    pmaddwd         m3, [r5]\r\n%ifidn %1,sp\r\n    paddd           m0, m8\r\n    paddd           m1, m8\r\n    psrad           m0, 12\r\n    psrad           m1, 12\r\n%else\r\n    psrad           m0, 6\r\n    psrad           m1, 6\r\n%endif\r\n    packssdw        m0, m1\r\n\r\n    movu            xm5, [r0 + r1]                  ; m5 = row 5\r\n    punpckhwd       xm6, xm4, xm5\r\n    punpcklwd       xm4, xm5\r\n    vinserti128     m4, m4, xm6, 1\r\n    pmaddwd         m6, m4, [r5 + 1 * mmsize]\r\n    paddd           m2, m6\r\n    pmaddwd         m4, [r5]\r\n    movu            xm6, [r0 + r1 * 2]              ; m6 = row 6\r\n    punpckhwd       xm1, xm5, xm6\r\n    punpcklwd       xm5, xm6\r\n    vinserti128     m5, m5, xm1, 1\r\n    pmaddwd         m1, m5, [r5 + 1 * mmsize]\r\n    pmaddwd         m5, [r5]\r\n    paddd           m3, m1\r\n%ifidn %1,sp\r\n    paddd           m2, m8\r\n    paddd           m3, m8\r\n    psrad           m2, 12\r\n    psrad           m3, 12\r\n%else\r\n    psrad           m2, 6\r\n    psrad           m3, 6\r\n%endif\r\n    packssdw        m2, m3\r\n%ifidn %1,sp\r\n    packuswb        m0, m2\r\n    mova            m3, [interp8_hps_shuf]\r\n    vpermd          m0, m3, m0\r\n    vextracti128    xm2, m0, 1\r\n    movq            [r2], xm0\r\n    movhps          [r2 + r3], xm0\r\n    movq            [r2 + r3 * 2], xm2\r\n    movhps          [r2 + r6], xm2\r\n%else\r\n    vpermq          m0, m0, 11011000b\r\n    vpermq          m2, m2, 11011000b\r\n    movu            [r2], xm0\r\n    vextracti128    xm0, m0, 1\r\n    vextracti128    xm3, m2, 1\r\n    movu            [r2 + r3], xm0\r\n    movu            [r2 + r3 * 2], xm2\r\n    movu            [r2 + r6], xm3\r\n%endif\r\n    lea             r2, [r2 + r3 * 4]\r\n\r\n    movu            xm1, [r0 + r4]                  ; m1 = row 7\r\n    punpckhwd       xm0, xm6, xm1\r\n    punpcklwd       xm6, xm1\r\n    vinserti128     m6, m6, xm0, 1\r\n    pmaddwd         m0, m6, [r5 + 1 * mmsize]\r\n    pmaddwd         m6, [r5]\r\n    paddd           m4, m0\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm0, [r0]                       ; m0 = row 8\r\n    punpckhwd       xm2, xm1, xm0\r\n    punpcklwd       xm1, xm0\r\n    vinserti128     m1, m1, xm2, 1\r\n    pmaddwd         m2, m1, [r5 + 1 * mmsize]\r\n    pmaddwd         m1, [r5]\r\n    paddd           m5, m2\r\n%ifidn %1,sp\r\n    paddd           m4, m8\r\n    paddd           m5, m8\r\n    psrad           m4, 12\r\n    psrad           m5, 12\r\n%else\r\n    psrad           m4, 6\r\n    psrad           m5, 6\r\n%endif\r\n    packssdw        m4, m5\r\n\r\n    movu            xm2, [r0 + r1]                  ; m2 = row 9\r\n    punpckhwd       xm5, xm0, xm2\r\n    punpcklwd       xm0, xm2\r\n    vinserti128     m0, m0, xm5, 1\r\n    pmaddwd         m5, m0, [r5 + 1 * mmsize]\r\n    paddd           m6, m5\r\n    pmaddwd         m0, [r5]\r\n    movu            xm5, [r0 + r1 * 2]              ; m5 = row 10\r\n    punpckhwd       xm7, xm2, xm5\r\n    punpcklwd       xm2, xm5\r\n    vinserti128     m2, m2, xm7, 1\r\n    pmaddwd         m7, m2, [r5 + 1 * mmsize]\r\n    paddd           m1, m7\r\n    pmaddwd         m2, [r5]\r\n\r\n%ifidn %1,sp\r\n    paddd           m6, m8\r\n    paddd           m1, m8\r\n    psrad           m6, 12\r\n    psrad           m1, 12\r\n%else\r\n    psrad           m6, 6\r\n    psrad           m1, 6\r\n%endif\r\n    packssdw        m6, m1\r\n%ifidn %1,sp\r\n    packuswb        m4, m6\r\n    vpermd          m4, m3, m4\r\n    vextracti128    xm6, m4, 1\r\n    movq            [r2], xm4\r\n    movhps          [r2 + r3], xm4\r\n    movq            [r2 + r3 * 2], xm6\r\n    movhps          [r2 + r6], xm6\r\n%else\r\n    vpermq          m4, m4, 11011000b\r\n    vpermq          m6, m6, 11011000b\r\n    vextracti128    xm7, m4, 1\r\n    vextracti128    xm1, m6, 1\r\n    movu            [r2], xm4\r\n    movu            [r2 + r3], xm7\r\n    movu            [r2 + r3 * 2], xm6\r\n    movu            [r2 + r6], xm1\r\n%endif\r\n    lea             r2, [r2 + r3 * 4]\r\n\r\n    movu            xm7, [r0 + r4]                  ; m7 = row 11\r\n    punpckhwd       xm1, xm5, xm7\r\n    punpcklwd       xm5, xm7\r\n    vinserti128     m5, m5, xm1, 1\r\n    pmaddwd         m1, m5, [r5 + 1 * mmsize]\r\n    paddd           m0, m1\r\n    pmaddwd         m5, [r5]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm1, [r0]                       ; m1 = row 12\r\n    punpckhwd       xm4, xm7, xm1\r\n    punpcklwd       xm7, xm1\r\n    vinserti128     m7, m7, xm4, 1\r\n    pmaddwd         m4, m7, [r5 + 1 * mmsize]\r\n    paddd           m2, m4\r\n    pmaddwd         m7, [r5]\r\n%ifidn %1,sp\r\n    paddd           m0, m8\r\n    paddd           m2, m8\r\n    psrad           m0, 12\r\n    psrad           m2, 12\r\n%else\r\n    psrad           m0, 6\r\n    psrad           m2, 6\r\n%endif\r\n    packssdw        m0, m2\r\n\r\n    movu            xm4, [r0 + r1]                  ; m4 = row 13\r\n    punpckhwd       xm2, xm1, xm4\r\n    punpcklwd       xm1, xm4\r\n    vinserti128     m1, m1, xm2, 1\r\n    pmaddwd         m2, m1, [r5 + 1 * mmsize]\r\n    paddd           m5, m2\r\n    pmaddwd         m1, [r5]\r\n    movu            xm2, [r0 + r1 * 2]              ; m2 = row 14\r\n    punpckhwd       xm6, xm4, xm2\r\n    punpcklwd       xm4, xm2\r\n    vinserti128     m4, m4, xm6, 1\r\n    pmaddwd         m6, m4, [r5 + 1 * mmsize]\r\n    paddd           m7, m6\r\n    pmaddwd         m4, [r5]\r\n%ifidn %1,sp\r\n    paddd           m5, m8\r\n    paddd           m7, m8\r\n    psrad           m5, 12\r\n    psrad           m7, 12\r\n%else\r\n    psrad           m5, 6\r\n    psrad           m7, 6\r\n%endif\r\n    packssdw        m5, m7\r\n%ifidn %1,sp\r\n    packuswb        m0, m5\r\n    vpermd          m0, m3, m0\r\n    vextracti128    xm5, m0, 1\r\n    movq            [r2], xm0\r\n    movhps          [r2 + r3], xm0\r\n    movq            [r2 + r3 * 2], xm5\r\n    movhps          [r2 + r6], xm5\r\n%else\r\n    vpermq          m0, m0, 11011000b\r\n    vpermq          m5, m5, 11011000b\r\n    vextracti128    xm7, m0, 1\r\n    vextracti128    xm6, m5, 1\r\n    movu            [r2], xm0\r\n    movu            [r2 + r3], xm7\r\n    movu            [r2 + r3 * 2], xm5\r\n    movu            [r2 + r6], xm6\r\n%endif\r\n    lea             r2, [r2 + r3 * 4]\r\n\r\n    movu            xm6, [r0 + r4]                  ; m6 = row 15\r\n    punpckhwd       xm5, xm2, xm6\r\n    punpcklwd       xm2, xm6\r\n    vinserti128     m2, m2, xm5, 1\r\n    pmaddwd         m5, m2, [r5 + 1 * mmsize]\r\n    paddd           m1, m5\r\n    pmaddwd         m2, [r5]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm0, [r0]                       ; m0 = row 16\r\n    punpckhwd       xm5, xm6, xm0\r\n    punpcklwd       xm6, xm0\r\n    vinserti128     m6, m6, xm5, 1\r\n    pmaddwd         m5, m6, [r5 + 1 * mmsize]\r\n    paddd           m4, m5\r\n    pmaddwd         m6, [r5]\r\n%ifidn %1,sp\r\n    paddd           m1, m8\r\n    paddd           m4, m8\r\n    psrad           m1, 12\r\n    psrad           m4, 12\r\n%else\r\n    psrad           m1, 6\r\n    psrad           m4, 6\r\n%endif\r\n    packssdw        m1, m4\r\n\r\n    movu            xm5, [r0 + r1]                  ; m5 = row 17\r\n    punpckhwd       xm4, xm0, xm5\r\n    punpcklwd       xm0, xm5\r\n    vinserti128     m0, m0, xm4, 1\r\n    pmaddwd         m0, [r5 + 1 * mmsize]\r\n    paddd           m2, m0\r\n    movu            xm4, [r0 + r1 * 2]              ; m4 = row 18\r\n    punpckhwd       xm0, xm5, xm4\r\n    punpcklwd       xm5, xm4\r\n    vinserti128     m5, m5, xm0, 1\r\n    pmaddwd         m5, [r5 + 1 * mmsize]\r\n    paddd           m6, m5\r\n%ifidn %1,sp\r\n    paddd           m2, m8\r\n    paddd           m6, m8\r\n    psrad           m2, 12\r\n    psrad           m6, 12\r\n%else\r\n    psrad           m2, 6\r\n    psrad           m6, 6\r\n%endif\r\n    packssdw        m2, m6\r\n%ifidn %1,sp\r\n    packuswb        m1, m2\r\n    vpermd          m1, m3, m1\r\n    vextracti128    xm2, m1, 1\r\n    movq            [r2], xm1\r\n    movhps          [r2 + r3], xm1\r\n    movq            [r2 + r3 * 2], xm2\r\n    movhps          [r2 + r6], xm2\r\n%else\r\n    vpermq          m1, m1, 11011000b\r\n    vpermq          m2, m2, 11011000b\r\n    vextracti128    xm6, m1, 1\r\n    vextracti128    xm4, m2, 1\r\n    movu            [r2], xm1\r\n    movu            [r2 + r3], xm6\r\n    movu            [r2 + r3 * 2], xm2\r\n    movu            [r2 + r6], xm4\r\n%endif\r\n    lea             r2, [r2 + r3 * 4]\r\n%endrep\r\n    RET\r\n%endif\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_S_AVX2_8xN sp, 16\r\n    FILTER_VER_CHROMA_S_AVX2_8xN sp, 32\r\n    FILTER_VER_CHROMA_S_AVX2_8xN sp, 64\r\n    FILTER_VER_CHROMA_S_AVX2_8xN ss, 16\r\n    FILTER_VER_CHROMA_S_AVX2_8xN ss, 32\r\n    FILTER_VER_CHROMA_S_AVX2_8xN ss, 64\r\n\r\n%macro FILTER_VER_CHROMA_S_AVX2_Nx24 2\r\n%if ARCH_X86_64 == 1\r\nINIT_YMM avx2\r\ncglobal interp_4tap_vert_%1_%2x24, 4, 10, 10\r\n    mov             r4d, r4m\r\n    shl             r4d, 6\r\n    add             r1d, r1d\r\n\r\n%ifdef PIC\r\n    lea             r5, [pw_ChromaCoeffV]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [pw_ChromaCoeffV + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r1\r\n%ifidn %1,sp\r\n    mova            m9, [pd_526336]\r\n%else\r\n    add             r3d, r3d\r\n%endif\r\n    lea             r6, [r3 * 3]\r\n    mov             r9d, %2 / 8\r\n.loopW:\r\n    PROCESS_CHROMA_S_AVX2_W8_16R %1\r\n%ifidn %1,sp\r\n    add             r2, 8\r\n%else\r\n    add             r2, 16\r\n%endif\r\n    add             r0, 16\r\n    dec             r9d\r\n    jnz             .loopW\r\n%ifidn %1,sp\r\n    lea             r2, [r8 + r3 * 4 - %2 + 8]\r\n%else\r\n    lea             r2, [r8 + r3 * 4 - 2 * %2 + 16]\r\n%endif\r\n    lea             r0, [r7 - 2 * %2 + 16]\r\n    mova            m7, m9\r\n    mov             r9d, %2 / 8\r\n.loop:\r\n    PROCESS_CHROMA_S_AVX2_W8_8R %1\r\n%ifidn %1,sp\r\n    add             r2, 8\r\n%else\r\n    add             r2, 16\r\n%endif\r\n    add             r0, 16\r\n    dec             r9d\r\n    jnz             .loop\r\n    RET\r\n%endif\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_S_AVX2_Nx24 sp, 32\r\n    FILTER_VER_CHROMA_S_AVX2_Nx24 sp, 16\r\n    FILTER_VER_CHROMA_S_AVX2_Nx24 ss, 32\r\n    FILTER_VER_CHROMA_S_AVX2_Nx24 ss, 16\r\n\r\n%macro FILTER_VER_CHROMA_S_AVX2_2x8 1\r\nINIT_YMM avx2\r\ncglobal interp_4tap_vert_%1_2x8, 4, 6, 7\r\n    mov             r4d, r4m\r\n    shl             r4d, 6\r\n    add             r1d, r1d\r\n    sub             r0, r1\r\n\r\n%ifdef PIC\r\n    lea             r5, [pw_ChromaCoeffV]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [pw_ChromaCoeffV + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n%ifidn %1,sp\r\n    mova            m6, [pd_526336]\r\n%else\r\n    add             r3d, r3d\r\n%endif\r\n    movd            xm0, [r0]\r\n    movd            xm1, [r0 + r1]\r\n    punpcklwd       xm0, xm1\r\n    movd            xm2, [r0 + r1 * 2]\r\n    punpcklwd       xm1, xm2\r\n    punpcklqdq      xm0, xm1                        ; m0 = [2 1 1 0]\r\n    movd            xm3, [r0 + r4]\r\n    punpcklwd       xm2, xm3\r\n    lea             r0, [r0 + 4 * r1]\r\n    movd            xm4, [r0]\r\n    punpcklwd       xm3, xm4\r\n    punpcklqdq      xm2, xm3                        ; m2 = [4 3 3 2]\r\n    vinserti128     m0, m0, xm2, 1                  ; m0 = [4 3 3 2 2 1 1 0]\r\n    movd            xm1, [r0 + r1]\r\n    punpcklwd       xm4, xm1\r\n    movd            xm3, [r0 + r1 * 2]\r\n    punpcklwd       xm1, xm3\r\n    punpcklqdq      xm4, xm1                        ; m4 = [6 5 5 4]\r\n    vinserti128     m2, m2, xm4, 1                  ; m2 = [6 5 5 4 4 3 3 2]\r\n    pmaddwd         m0, [r5]\r\n    pmaddwd         m2, [r5 + 1 * mmsize]\r\n    paddd           m0, m2\r\n    movd            xm1, [r0 + r4]\r\n    punpcklwd       xm3, xm1\r\n    lea             r0, [r0 + 4 * r1]\r\n    movd            xm2, [r0]\r\n    punpcklwd       xm1, xm2\r\n    punpcklqdq      xm3, xm1                        ; m3 = [8 7 7 6]\r\n    vinserti128     m4, m4, xm3, 1                  ; m4 = [8 7 7 6 6 5 5 4]\r\n    movd            xm1, [r0 + r1]\r\n    punpcklwd       xm2, xm1\r\n    movd            xm5, [r0 + r1 * 2]\r\n    punpcklwd       xm1, xm5\r\n    punpcklqdq      xm2, xm1                        ; m2 = [10 9 9 8]\r\n    vinserti128     m3, m3, xm2, 1                  ; m3 = [10 9 9 8 8 7 7 6]\r\n    pmaddwd         m4, [r5]\r\n    pmaddwd         m3, [r5 + 1 * mmsize]\r\n    paddd           m4, m3\r\n%ifidn %1,sp\r\n    paddd           m0, m6\r\n    paddd           m4, m6\r\n    psrad           m0, 12\r\n    psrad           m4, 12\r\n%else\r\n    psrad           m0, 6\r\n    psrad           m4, 6\r\n%endif\r\n    packssdw        m0, m4\r\n    vextracti128    xm4, m0, 1\r\n    lea             r4, [r3 * 3]\r\n%ifidn %1,sp\r\n    packuswb        xm0, xm4\r\n    pextrw          [r2], xm0, 0\r\n    pextrw          [r2 + r3], xm0, 1\r\n    pextrw          [r2 + 2 * r3], xm0, 4\r\n    pextrw          [r2 + r4], xm0, 5\r\n    lea             r2, [r2 + r3 * 4]\r\n    pextrw          [r2], xm0, 2\r\n    pextrw          [r2 + r3], xm0, 3\r\n    pextrw          [r2 + 2 * r3], xm0, 6\r\n    pextrw          [r2 + r4], xm0, 7\r\n%else\r\n    movd            [r2], xm0\r\n    pextrd          [r2 + r3], xm0, 1\r\n    movd            [r2 + 2 * r3], xm4\r\n    pextrd          [r2 + r4], xm4, 1\r\n    lea             r2, [r2 + r3 * 4]\r\n    pextrd          [r2], xm0, 2\r\n    pextrd          [r2 + r3], xm0, 3\r\n    pextrd          [r2 + 2 * r3], xm4, 2\r\n    pextrd          [r2 + r4], xm4, 3\r\n%endif\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_S_AVX2_2x8 sp\r\n    FILTER_VER_CHROMA_S_AVX2_2x8 ss\r\n\r\n%macro FILTER_VER_CHROMA_S_AVX2_2x16 1\r\n%if ARCH_X86_64 == 1\r\nINIT_YMM avx2\r\ncglobal interp_4tap_vert_%1_2x16, 4, 6, 9\r\n    mov             r4d, r4m\r\n    shl             r4d, 6\r\n    add             r1d, r1d\r\n    sub             r0, r1\r\n\r\n%ifdef PIC\r\n    lea             r5, [pw_ChromaCoeffV]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [pw_ChromaCoeffV + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n%ifidn %1,sp\r\n    mova            m6, [pd_526336]\r\n%else\r\n    add             r3d, r3d\r\n%endif\r\n    movd            xm0, [r0]\r\n    movd            xm1, [r0 + r1]\r\n    punpcklwd       xm0, xm1\r\n    movd            xm2, [r0 + r1 * 2]\r\n    punpcklwd       xm1, xm2\r\n    punpcklqdq      xm0, xm1                        ; m0 = [2 1 1 0]\r\n    movd            xm3, [r0 + r4]\r\n    punpcklwd       xm2, xm3\r\n    lea             r0, [r0 + 4 * r1]\r\n    movd            xm4, [r0]\r\n    punpcklwd       xm3, xm4\r\n    punpcklqdq      xm2, xm3                        ; m2 = [4 3 3 2]\r\n    vinserti128     m0, m0, xm2, 1                  ; m0 = [4 3 3 2 2 1 1 0]\r\n    movd            xm1, [r0 + r1]\r\n    punpcklwd       xm4, xm1\r\n    movd            xm3, [r0 + r1 * 2]\r\n    punpcklwd       xm1, xm3\r\n    punpcklqdq      xm4, xm1                        ; m4 = [6 5 5 4]\r\n    vinserti128     m2, m2, xm4, 1                  ; m2 = [6 5 5 4 4 3 3 2]\r\n    pmaddwd         m0, [r5]\r\n    pmaddwd         m2, [r5 + 1 * mmsize]\r\n    paddd           m0, m2\r\n    movd            xm1, [r0 + r4]\r\n    punpcklwd       xm3, xm1\r\n    lea             r0, [r0 + 4 * r1]\r\n    movd            xm2, [r0]\r\n    punpcklwd       xm1, xm2\r\n    punpcklqdq      xm3, xm1                        ; m3 = [8 7 7 6]\r\n    vinserti128     m4, m4, xm3, 1                  ; m4 = [8 7 7 6 6 5 5 4]\r\n    movd            xm1, [r0 + r1]\r\n    punpcklwd       xm2, xm1\r\n    movd            xm5, [r0 + r1 * 2]\r\n    punpcklwd       xm1, xm5\r\n    punpcklqdq      xm2, xm1                        ; m2 = [10 9 9 8]\r\n    vinserti128     m3, m3, xm2, 1                  ; m3 = [10 9 9 8 8 7 7 6]\r\n    pmaddwd         m4, [r5]\r\n    pmaddwd         m3, [r5 + 1 * mmsize]\r\n    paddd           m4, m3\r\n    movd            xm1, [r0 + r4]\r\n    punpcklwd       xm5, xm1\r\n    lea             r0, [r0 + 4 * r1]\r\n    movd            xm3, [r0]\r\n    punpcklwd       xm1, xm3\r\n    punpcklqdq      xm5, xm1                        ; m5 = [12 11 11 10]\r\n    vinserti128     m2, m2, xm5, 1                  ; m2 = [12 11 11 10 10 9 9 8]\r\n    movd            xm1, [r0 + r1]\r\n    punpcklwd       xm3, xm1\r\n    movd            xm7, [r0 + r1 * 2]\r\n    punpcklwd       xm1, xm7\r\n    punpcklqdq      xm3, xm1                        ; m3 = [14 13 13 12]\r\n    vinserti128     m5, m5, xm3, 1                  ; m5 = [14 13 13 12 12 11 11 10]\r\n    pmaddwd         m2, [r5]\r\n    pmaddwd         m5, [r5 + 1 * mmsize]\r\n    paddd           m2, m5\r\n    movd            xm5, [r0 + r4]\r\n    punpcklwd       xm7, xm5\r\n    lea             r0, [r0 + 4 * r1]\r\n    movd            xm1, [r0]\r\n    punpcklwd       xm5, xm1\r\n    punpcklqdq      xm7, xm5                        ; m7 = [16 15 15 14]\r\n    vinserti128     m3, m3, xm7, 1                  ; m3 = [16 15 15 14 14 13 13 12]\r\n    movd            xm5, [r0 + r1]\r\n    punpcklwd       xm1, xm5\r\n    movd            xm8, [r0 + r1 * 2]\r\n    punpcklwd       xm5, xm8\r\n    punpcklqdq      xm1, xm5                        ; m1 = [18 17 17 16]\r\n    vinserti128     m7, m7, xm1, 1                  ; m7 = [18 17 17 16 16 15 15 14]\r\n    pmaddwd         m3, [r5]\r\n    pmaddwd         m7, [r5 + 1 * mmsize]\r\n    paddd           m3, m7\r\n%ifidn %1,sp\r\n    paddd           m0, m6\r\n    paddd           m4, m6\r\n    paddd           m2, m6\r\n    paddd           m3, m6\r\n    psrad           m0, 12\r\n    psrad           m4, 12\r\n    psrad           m2, 12\r\n    psrad           m3, 12\r\n%else\r\n    psrad           m0, 6\r\n    psrad           m4, 6\r\n    psrad           m2, 6\r\n    psrad           m3, 6\r\n%endif\r\n    packssdw        m0, m4\r\n    packssdw        m2, m3\r\n    lea             r4, [r3 * 3]\r\n%ifidn %1,sp\r\n    packuswb        m0, m2\r\n    vextracti128    xm2, m0, 1\r\n    pextrw          [r2], xm0, 0\r\n    pextrw          [r2 + r3], xm0, 1\r\n    pextrw          [r2 + 2 * r3], xm2, 0\r\n    pextrw          [r2 + r4], xm2, 1\r\n    lea             r2, [r2 + r3 * 4]\r\n    pextrw          [r2], xm0, 2\r\n    pextrw          [r2 + r3], xm0, 3\r\n    pextrw          [r2 + 2 * r3], xm2, 2\r\n    pextrw          [r2 + r4], xm2, 3\r\n    lea             r2, [r2 + r3 * 4]\r\n    pextrw          [r2], xm0, 4\r\n    pextrw          [r2 + r3], xm0, 5\r\n    pextrw          [r2 + 2 * r3], xm2, 4\r\n    pextrw          [r2 + r4], xm2, 5\r\n    lea             r2, [r2 + r3 * 4]\r\n    pextrw          [r2], xm0, 6\r\n    pextrw          [r2 + r3], xm0, 7\r\n    pextrw          [r2 + 2 * r3], xm2, 6\r\n    pextrw          [r2 + r4], xm2, 7\r\n%else\r\n    vextracti128    xm4, m0, 1\r\n    vextracti128    xm3, m2, 1\r\n    movd            [r2], xm0\r\n    pextrd          [r2 + r3], xm0, 1\r\n    movd            [r2 + 2 * r3], xm4\r\n    pextrd          [r2 + r4], xm4, 1\r\n    lea             r2, [r2 + r3 * 4]\r\n    pextrd          [r2], xm0, 2\r\n    pextrd          [r2 + r3], xm0, 3\r\n    pextrd          [r2 + 2 * r3], xm4, 2\r\n    pextrd          [r2 + r4], xm4, 3\r\n    lea             r2, [r2 + r3 * 4]\r\n    movd            [r2], xm2\r\n    pextrd          [r2 + r3], xm2, 1\r\n    movd            [r2 + 2 * r3], xm3\r\n    pextrd          [r2 + r4], xm3, 1\r\n    lea             r2, [r2 + r3 * 4]\r\n    pextrd          [r2], xm2, 2\r\n    pextrd          [r2 + r3], xm2, 3\r\n    pextrd          [r2 + 2 * r3], xm3, 2\r\n    pextrd          [r2 + r4], xm3, 3\r\n%endif\r\n    RET\r\n%endif\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_S_AVX2_2x16 sp\r\n    FILTER_VER_CHROMA_S_AVX2_2x16 ss\r\n\r\n%macro FILTER_VER_CHROMA_S_AVX2_6x8 1\r\nINIT_YMM avx2\r\ncglobal interp_4tap_vert_%1_6x8, 4, 6, 8\r\n    mov             r4d, r4m\r\n    shl             r4d, 6\r\n    add             r1d, r1d\r\n\r\n%ifdef PIC\r\n    lea             r5, [pw_ChromaCoeffV]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [pw_ChromaCoeffV + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r1\r\n%ifidn %1,sp\r\n    mova            m7, [pd_526336]\r\n%else\r\n    add             r3d, r3d\r\n%endif\r\n\r\n    movu            xm0, [r0]                       ; m0 = row 0\r\n    movu            xm1, [r0 + r1]                  ; m1 = row 1\r\n    punpckhwd       xm2, xm0, xm1\r\n    punpcklwd       xm0, xm1\r\n    vinserti128     m0, m0, xm2, 1\r\n    pmaddwd         m0, [r5]\r\n    movu            xm2, [r0 + r1 * 2]              ; m2 = row 2\r\n    punpckhwd       xm3, xm1, xm2\r\n    punpcklwd       xm1, xm2\r\n    vinserti128     m1, m1, xm3, 1\r\n    pmaddwd         m1, [r5]\r\n    movu            xm3, [r0 + r4]                  ; m3 = row 3\r\n    punpckhwd       xm4, xm2, xm3\r\n    punpcklwd       xm2, xm3\r\n    vinserti128     m2, m2, xm4, 1\r\n    pmaddwd         m4, m2, [r5 + 1 * mmsize]\r\n    pmaddwd         m2, [r5]\r\n    paddd           m0, m4\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm4, [r0]                       ; m4 = row 4\r\n    punpckhwd       xm5, xm3, xm4\r\n    punpcklwd       xm3, xm4\r\n    vinserti128     m3, m3, xm5, 1\r\n    pmaddwd         m5, m3, [r5 + 1 * mmsize]\r\n    pmaddwd         m3, [r5]\r\n    paddd           m1, m5\r\n%ifidn %1,sp\r\n    paddd           m0, m7\r\n    paddd           m1, m7\r\n    psrad           m0, 12\r\n    psrad           m1, 12\r\n%else\r\n    psrad           m0, 6\r\n    psrad           m1, 6\r\n%endif\r\n    packssdw        m0, m1\r\n\r\n    movu            xm5, [r0 + r1]                  ; m5 = row 5\r\n    punpckhwd       xm6, xm4, xm5\r\n    punpcklwd       xm4, xm5\r\n    vinserti128     m4, m4, xm6, 1\r\n    pmaddwd         m6, m4, [r5 + 1 * mmsize]\r\n    paddd           m2, m6\r\n    pmaddwd         m4, [r5]\r\n    movu            xm6, [r0 + r1 * 2]              ; m6 = row 6\r\n    punpckhwd       xm1, xm5, xm6\r\n    punpcklwd       xm5, xm6\r\n    vinserti128     m5, m5, xm1, 1\r\n    pmaddwd         m1, m5, [r5 + 1 * mmsize]\r\n    pmaddwd         m5, [r5]\r\n    paddd           m3, m1\r\n%ifidn %1,sp\r\n    paddd           m2, m7\r\n    paddd           m3, m7\r\n    psrad           m2, 12\r\n    psrad           m3, 12\r\n%else\r\n    psrad           m2, 6\r\n    psrad           m3, 6\r\n%endif\r\n    packssdw        m2, m3\r\n\r\n    movu            xm1, [r0 + r4]                  ; m1 = row 7\r\n    punpckhwd       xm3, xm6, xm1\r\n    punpcklwd       xm6, xm1\r\n    vinserti128     m6, m6, xm3, 1\r\n    pmaddwd         m3, m6, [r5 + 1 * mmsize]\r\n    pmaddwd         m6, [r5]\r\n    paddd           m4, m3\r\n\r\n    lea             r4, [r3 * 3]\r\n%ifidn %1,sp\r\n    packuswb        m0, m2\r\n    vextracti128    xm2, m0, 1\r\n    movd            [r2], xm0\r\n    pextrw          [r2 + 4], xm2, 0\r\n    pextrd          [r2 + r3], xm0, 1\r\n    pextrw          [r2 + r3 + 4], xm2, 2\r\n    pextrd          [r2 + r3 * 2], xm0, 2\r\n    pextrw          [r2 + r3 * 2 + 4], xm2, 4\r\n    pextrd          [r2 + r4], xm0, 3\r\n    pextrw          [r2 + r4 + 4], xm2, 6\r\n%else\r\n    movq            [r2], xm0\r\n    movhps          [r2 + r3], xm0\r\n    movq            [r2 + r3 * 2], xm2\r\n    movhps          [r2 + r4], xm2\r\n    vextracti128    xm0, m0, 1\r\n    vextracti128    xm3, m2, 1\r\n    movd            [r2 + 8], xm0\r\n    pextrd          [r2 + r3 + 8], xm0, 2\r\n    movd            [r2 + r3 * 2 + 8], xm3\r\n    pextrd          [r2 + r4 + 8], xm3, 2\r\n%endif\r\n    lea             r2, [r2 + r3 * 4]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm0, [r0]                       ; m0 = row 8\r\n    punpckhwd       xm2, xm1, xm0\r\n    punpcklwd       xm1, xm0\r\n    vinserti128     m1, m1, xm2, 1\r\n    pmaddwd         m2, m1, [r5 + 1 * mmsize]\r\n    pmaddwd         m1, [r5]\r\n    paddd           m5, m2\r\n%ifidn %1,sp\r\n    paddd           m4, m7\r\n    paddd           m5, m7\r\n    psrad           m4, 12\r\n    psrad           m5, 12\r\n%else\r\n    psrad           m4, 6\r\n    psrad           m5, 6\r\n%endif\r\n    packssdw        m4, m5\r\n\r\n    movu            xm2, [r0 + r1]                  ; m2 = row 9\r\n    punpckhwd       xm5, xm0, xm2\r\n    punpcklwd       xm0, xm2\r\n    vinserti128     m0, m0, xm5, 1\r\n    pmaddwd         m0, [r5 + 1 * mmsize]\r\n    paddd           m6, m0\r\n    movu            xm5, [r0 + r1 * 2]              ; m5 = row 10\r\n    punpckhwd       xm0, xm2, xm5\r\n    punpcklwd       xm2, xm5\r\n    vinserti128     m2, m2, xm0, 1\r\n    pmaddwd         m2, [r5 + 1 * mmsize]\r\n    paddd           m1, m2\r\n\r\n%ifidn %1,sp\r\n    paddd           m6, m7\r\n    paddd           m1, m7\r\n    psrad           m6, 12\r\n    psrad           m1, 12\r\n%else\r\n    psrad           m6, 6\r\n    psrad           m1, 6\r\n%endif\r\n    packssdw        m6, m1\r\n%ifidn %1,sp\r\n    packuswb        m4, m6\r\n    vextracti128    xm6, m4, 1\r\n    movd            [r2], xm4\r\n    pextrw          [r2 + 4], xm6, 0\r\n    pextrd          [r2 + r3], xm4, 1\r\n    pextrw          [r2 + r3 + 4], xm6, 2\r\n    pextrd          [r2 + r3 * 2], xm4, 2\r\n    pextrw          [r2 + r3 * 2 + 4], xm6, 4\r\n    pextrd          [r2 + r4], xm4, 3\r\n    pextrw          [r2 + r4 + 4], xm6, 6\r\n%else\r\n    movq            [r2], xm4\r\n    movhps          [r2 + r3], xm4\r\n    movq            [r2 + r3 * 2], xm6\r\n    movhps          [r2 + r4], xm6\r\n    vextracti128    xm5, m4, 1\r\n    vextracti128    xm1, m6, 1\r\n    movd            [r2 + 8], xm5\r\n    pextrd          [r2 + r3 + 8], xm5, 2\r\n    movd            [r2 + r3 * 2 + 8], xm1\r\n    pextrd          [r2 + r4 + 8], xm1, 2\r\n%endif\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_S_AVX2_6x8 sp\r\n    FILTER_VER_CHROMA_S_AVX2_6x8 ss\r\n\r\n%macro FILTER_VER_CHROMA_S_AVX2_6x16 1\r\n%if ARCH_X86_64 == 1\r\nINIT_YMM avx2\r\ncglobal interp_4tap_vert_%1_6x16, 4, 7, 9\r\n    mov             r4d, r4m\r\n    shl             r4d, 6\r\n    add             r1d, r1d\r\n\r\n%ifdef PIC\r\n    lea             r5, [pw_ChromaCoeffV]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [pw_ChromaCoeffV + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r1\r\n%ifidn %1,sp\r\n    mova            m8, [pd_526336]\r\n%else\r\n    add             r3d, r3d\r\n%endif\r\n    lea             r6, [r3 * 3]\r\n    movu            xm0, [r0]                       ; m0 = row 0\r\n    movu            xm1, [r0 + r1]                  ; m1 = row 1\r\n    punpckhwd       xm2, xm0, xm1\r\n    punpcklwd       xm0, xm1\r\n    vinserti128     m0, m0, xm2, 1\r\n    pmaddwd         m0, [r5]\r\n    movu            xm2, [r0 + r1 * 2]              ; m2 = row 2\r\n    punpckhwd       xm3, xm1, xm2\r\n    punpcklwd       xm1, xm2\r\n    vinserti128     m1, m1, xm3, 1\r\n    pmaddwd         m1, [r5]\r\n    movu            xm3, [r0 + r4]                  ; m3 = row 3\r\n    punpckhwd       xm4, xm2, xm3\r\n    punpcklwd       xm2, xm3\r\n    vinserti128     m2, m2, xm4, 1\r\n    pmaddwd         m4, m2, [r5 + 1 * mmsize]\r\n    paddd           m0, m4\r\n    pmaddwd         m2, [r5]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm4, [r0]                       ; m4 = row 4\r\n    punpckhwd       xm5, xm3, xm4\r\n    punpcklwd       xm3, xm4\r\n    vinserti128     m3, m3, xm5, 1\r\n    pmaddwd         m5, m3, [r5 + 1 * mmsize]\r\n    paddd           m1, m5\r\n    pmaddwd         m3, [r5]\r\n%ifidn %1,sp\r\n    paddd           m0, m8\r\n    paddd           m1, m8\r\n    psrad           m0, 12\r\n    psrad           m1, 12\r\n%else\r\n    psrad           m0, 6\r\n    psrad           m1, 6\r\n%endif\r\n    packssdw        m0, m1\r\n\r\n    movu            xm5, [r0 + r1]                  ; m5 = row 5\r\n    punpckhwd       xm6, xm4, xm5\r\n    punpcklwd       xm4, xm5\r\n    vinserti128     m4, m4, xm6, 1\r\n    pmaddwd         m6, m4, [r5 + 1 * mmsize]\r\n    paddd           m2, m6\r\n    pmaddwd         m4, [r5]\r\n    movu            xm6, [r0 + r1 * 2]              ; m6 = row 6\r\n    punpckhwd       xm1, xm5, xm6\r\n    punpcklwd       xm5, xm6\r\n    vinserti128     m5, m5, xm1, 1\r\n    pmaddwd         m1, m5, [r5 + 1 * mmsize]\r\n    pmaddwd         m5, [r5]\r\n    paddd           m3, m1\r\n%ifidn %1,sp\r\n    paddd           m2, m8\r\n    paddd           m3, m8\r\n    psrad           m2, 12\r\n    psrad           m3, 12\r\n%else\r\n    psrad           m2, 6\r\n    psrad           m3, 6\r\n%endif\r\n    packssdw        m2, m3\r\n%ifidn %1,sp\r\n    packuswb        m0, m2\r\n    vextracti128    xm2, m0, 1\r\n    movd            [r2], xm0\r\n    pextrw          [r2 + 4], xm2, 0\r\n    pextrd          [r2 + r3], xm0, 1\r\n    pextrw          [r2 + r3 + 4], xm2, 2\r\n    pextrd          [r2 + r3 * 2], xm0, 2\r\n    pextrw          [r2 + r3 * 2 + 4], xm2, 4\r\n    pextrd          [r2 + r6], xm0, 3\r\n    pextrw          [r2 + r6 + 4], xm2, 6\r\n%else\r\n    movq            [r2], xm0\r\n    movhps          [r2 + r3], xm0\r\n    movq            [r2 + r3 * 2], xm2\r\n    movhps          [r2 + r6], xm2\r\n    vextracti128    xm0, m0, 1\r\n    vextracti128    xm3, m2, 1\r\n    movd            [r2 + 8], xm0\r\n    pextrd          [r2 + r3 + 8], xm0, 2\r\n    movd            [r2 + r3 * 2 + 8], xm3\r\n    pextrd          [r2 + r6 + 8], xm3, 2\r\n%endif\r\n    lea             r2, [r2 + r3 * 4]\r\n    movu            xm1, [r0 + r4]                  ; m1 = row 7\r\n    punpckhwd       xm0, xm6, xm1\r\n    punpcklwd       xm6, xm1\r\n    vinserti128     m6, m6, xm0, 1\r\n    pmaddwd         m0, m6, [r5 + 1 * mmsize]\r\n    pmaddwd         m6, [r5]\r\n    paddd           m4, m0\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm0, [r0]                       ; m0 = row 8\r\n    punpckhwd       xm2, xm1, xm0\r\n    punpcklwd       xm1, xm0\r\n    vinserti128     m1, m1, xm2, 1\r\n    pmaddwd         m2, m1, [r5 + 1 * mmsize]\r\n    pmaddwd         m1, [r5]\r\n    paddd           m5, m2\r\n%ifidn %1,sp\r\n    paddd           m4, m8\r\n    paddd           m5, m8\r\n    psrad           m4, 12\r\n    psrad           m5, 12\r\n%else\r\n    psrad           m4, 6\r\n    psrad           m5, 6\r\n%endif\r\n    packssdw        m4, m5\r\n\r\n    movu            xm2, [r0 + r1]                  ; m2 = row 9\r\n    punpckhwd       xm5, xm0, xm2\r\n    punpcklwd       xm0, xm2\r\n    vinserti128     m0, m0, xm5, 1\r\n    pmaddwd         m5, m0, [r5 + 1 * mmsize]\r\n    paddd           m6, m5\r\n    pmaddwd         m0, [r5]\r\n    movu            xm5, [r0 + r1 * 2]              ; m5 = row 10\r\n    punpckhwd       xm7, xm2, xm5\r\n    punpcklwd       xm2, xm5\r\n    vinserti128     m2, m2, xm7, 1\r\n    pmaddwd         m7, m2, [r5 + 1 * mmsize]\r\n    paddd           m1, m7\r\n    pmaddwd         m2, [r5]\r\n\r\n%ifidn %1,sp\r\n    paddd           m6, m8\r\n    paddd           m1, m8\r\n    psrad           m6, 12\r\n    psrad           m1, 12\r\n%else\r\n    psrad           m6, 6\r\n    psrad           m1, 6\r\n%endif\r\n    packssdw        m6, m1\r\n%ifidn %1,sp\r\n    packuswb        m4, m6\r\n    vextracti128    xm6, m4, 1\r\n    movd            [r2], xm4\r\n    pextrw          [r2 + 4], xm6, 0\r\n    pextrd          [r2 + r3], xm4, 1\r\n    pextrw          [r2 + r3 + 4], xm6, 2\r\n    pextrd          [r2 + r3 * 2], xm4, 2\r\n    pextrw          [r2 + r3 * 2 + 4], xm6, 4\r\n    pextrd          [r2 + r6], xm4, 3\r\n    pextrw          [r2 + r6 + 4], xm6, 6\r\n%else\r\n    movq            [r2], xm4\r\n    movhps          [r2 + r3], xm4\r\n    movq            [r2 + r3 * 2], xm6\r\n    movhps          [r2 + r6], xm6\r\n    vextracti128    xm4, m4, 1\r\n    vextracti128    xm1, m6, 1\r\n    movd            [r2 + 8], xm4\r\n    pextrd          [r2 + r3 + 8], xm4, 2\r\n    movd            [r2 + r3 * 2 + 8], xm1\r\n    pextrd          [r2 + r6 + 8], xm1, 2\r\n%endif\r\n    lea             r2, [r2 + r3 * 4]\r\n    movu            xm7, [r0 + r4]                  ; m7 = row 11\r\n    punpckhwd       xm1, xm5, xm7\r\n    punpcklwd       xm5, xm7\r\n    vinserti128     m5, m5, xm1, 1\r\n    pmaddwd         m1, m5, [r5 + 1 * mmsize]\r\n    paddd           m0, m1\r\n    pmaddwd         m5, [r5]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm1, [r0]                       ; m1 = row 12\r\n    punpckhwd       xm4, xm7, xm1\r\n    punpcklwd       xm7, xm1\r\n    vinserti128     m7, m7, xm4, 1\r\n    pmaddwd         m4, m7, [r5 + 1 * mmsize]\r\n    paddd           m2, m4\r\n    pmaddwd         m7, [r5]\r\n%ifidn %1,sp\r\n    paddd           m0, m8\r\n    paddd           m2, m8\r\n    psrad           m0, 12\r\n    psrad           m2, 12\r\n%else\r\n    psrad           m0, 6\r\n    psrad           m2, 6\r\n%endif\r\n    packssdw        m0, m2\r\n\r\n    movu            xm4, [r0 + r1]                  ; m4 = row 13\r\n    punpckhwd       xm2, xm1, xm4\r\n    punpcklwd       xm1, xm4\r\n    vinserti128     m1, m1, xm2, 1\r\n    pmaddwd         m2, m1, [r5 + 1 * mmsize]\r\n    paddd           m5, m2\r\n    pmaddwd         m1, [r5]\r\n    movu            xm2, [r0 + r1 * 2]              ; m2 = row 14\r\n    punpckhwd       xm6, xm4, xm2\r\n    punpcklwd       xm4, xm2\r\n    vinserti128     m4, m4, xm6, 1\r\n    pmaddwd         m6, m4, [r5 + 1 * mmsize]\r\n    paddd           m7, m6\r\n    pmaddwd         m4, [r5]\r\n%ifidn %1,sp\r\n    paddd           m5, m8\r\n    paddd           m7, m8\r\n    psrad           m5, 12\r\n    psrad           m7, 12\r\n%else\r\n    psrad           m5, 6\r\n    psrad           m7, 6\r\n%endif\r\n    packssdw        m5, m7\r\n%ifidn %1,sp\r\n    packuswb        m0, m5\r\n    vextracti128    xm5, m0, 1\r\n    movd            [r2], xm0\r\n    pextrw          [r2 + 4], xm5, 0\r\n    pextrd          [r2 + r3], xm0, 1\r\n    pextrw          [r2 + r3 + 4], xm5, 2\r\n    pextrd          [r2 + r3 * 2], xm0, 2\r\n    pextrw          [r2 + r3 * 2 + 4], xm5, 4\r\n    pextrd          [r2 + r6], xm0, 3\r\n    pextrw          [r2 + r6 + 4], xm5, 6\r\n%else\r\n    movq            [r2], xm0\r\n    movhps          [r2 + r3], xm0\r\n    movq            [r2 + r3 * 2], xm5\r\n    movhps          [r2 + r6], xm5\r\n    vextracti128    xm0, m0, 1\r\n    vextracti128    xm7, m5, 1\r\n    movd            [r2 + 8], xm0\r\n    pextrd          [r2 + r3 + 8], xm0, 2\r\n    movd            [r2 + r3 * 2 + 8], xm7\r\n    pextrd          [r2 + r6 + 8], xm7, 2\r\n%endif\r\n    lea             r2, [r2 + r3 * 4]\r\n\r\n    movu            xm6, [r0 + r4]                  ; m6 = row 15\r\n    punpckhwd       xm5, xm2, xm6\r\n    punpcklwd       xm2, xm6\r\n    vinserti128     m2, m2, xm5, 1\r\n    pmaddwd         m5, m2, [r5 + 1 * mmsize]\r\n    paddd           m1, m5\r\n    pmaddwd         m2, [r5]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm0, [r0]                       ; m0 = row 16\r\n    punpckhwd       xm5, xm6, xm0\r\n    punpcklwd       xm6, xm0\r\n    vinserti128     m6, m6, xm5, 1\r\n    pmaddwd         m5, m6, [r5 + 1 * mmsize]\r\n    paddd           m4, m5\r\n    pmaddwd         m6, [r5]\r\n%ifidn %1,sp\r\n    paddd           m1, m8\r\n    paddd           m4, m8\r\n    psrad           m1, 12\r\n    psrad           m4, 12\r\n%else\r\n    psrad           m1, 6\r\n    psrad           m4, 6\r\n%endif\r\n    packssdw        m1, m4\r\n\r\n    movu            xm5, [r0 + r1]                  ; m5 = row 17\r\n    punpckhwd       xm4, xm0, xm5\r\n    punpcklwd       xm0, xm5\r\n    vinserti128     m0, m0, xm4, 1\r\n    pmaddwd         m0, [r5 + 1 * mmsize]\r\n    paddd           m2, m0\r\n    movu            xm4, [r0 + r1 * 2]              ; m4 = row 18\r\n    punpckhwd       xm0, xm5, xm4\r\n    punpcklwd       xm5, xm4\r\n    vinserti128     m5, m5, xm0, 1\r\n    pmaddwd         m5, [r5 + 1 * mmsize]\r\n    paddd           m6, m5\r\n%ifidn %1,sp\r\n    paddd           m2, m8\r\n    paddd           m6, m8\r\n    psrad           m2, 12\r\n    psrad           m6, 12\r\n%else\r\n    psrad           m2, 6\r\n    psrad           m6, 6\r\n%endif\r\n    packssdw        m2, m6\r\n%ifidn %1,sp\r\n    packuswb        m1, m2\r\n    vextracti128    xm2, m1, 1\r\n    movd            [r2], xm1\r\n    pextrw          [r2 + 4], xm2, 0\r\n    pextrd          [r2 + r3], xm1, 1\r\n    pextrw          [r2 + r3 + 4], xm2, 2\r\n    pextrd          [r2 + r3 * 2], xm1, 2\r\n    pextrw          [r2 + r3 * 2 + 4], xm2, 4\r\n    pextrd          [r2 + r6], xm1, 3\r\n    pextrw          [r2 + r6 + 4], xm2, 6\r\n%else\r\n    movq            [r2], xm1\r\n    movhps          [r2 + r3], xm1\r\n    movq            [r2 + r3 * 2], xm2\r\n    movhps          [r2 + r6], xm2\r\n    vextracti128    xm4, m1, 1\r\n    vextracti128    xm6, m2, 1\r\n    movd            [r2 + 8], xm4\r\n    pextrd          [r2 + r3 + 8], xm4, 2\r\n    movd            [r2 + r3 * 2 + 8], xm6\r\n    pextrd          [r2 + r6 + 8], xm6, 2\r\n%endif\r\n    RET\r\n%endif\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_S_AVX2_6x16 sp\r\n    FILTER_VER_CHROMA_S_AVX2_6x16 ss\r\n\r\n;---------------------------------------------------------------------------------------------------------------------\r\n; void interp_4tap_vertical_ss_%1x%2(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)\r\n;---------------------------------------------------------------------------------------------------------------------\r\n%macro FILTER_VER_CHROMA_SS_W2_4R 2\r\nINIT_XMM sse4\r\ncglobal interp_4tap_vert_ss_%1x%2, 5, 6, 5\r\n\r\n    add       r1d, r1d\r\n    add       r3d, r3d\r\n    sub       r0, r1\r\n    shl       r4d, 5\r\n\r\n%ifdef PIC\r\n    lea       r5, [tab_ChromaCoeffV]\r\n    lea       r5, [r5 + r4]\r\n%else\r\n    lea       r5, [tab_ChromaCoeffV + r4]\r\n%endif\r\n\r\n    mov       r4d, (%2/4)\r\n\r\n.loopH:\r\n    PROCESS_CHROMA_SP_W2_4R r5\r\n\r\n    psrad     m0, 6\r\n    psrad     m2, 6\r\n\r\n    packssdw  m0, m2\r\n\r\n    movd      [r2], m0\r\n    pextrd    [r2 + r3], m0, 1\r\n    lea       r2, [r2 + 2 * r3]\r\n    pextrd    [r2], m0, 2\r\n    pextrd    [r2 + r3], m0, 3\r\n\r\n    lea       r2, [r2 + 2 * r3]\r\n\r\n    dec       r4d\r\n    jnz       .loopH\r\n\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_SS_W2_4R 2, 4\r\n    FILTER_VER_CHROMA_SS_W2_4R 2, 8\r\n\r\n    FILTER_VER_CHROMA_SS_W2_4R 2, 16\r\n\r\n;---------------------------------------------------------------------------------------------------------------\r\n; void interp_4tap_vert_ss_4x2(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)\r\n;---------------------------------------------------------------------------------------------------------------\r\nINIT_XMM sse2\r\ncglobal interp_4tap_vert_ss_4x2, 5, 6, 4\r\n\r\n    add        r1d, r1d\r\n    add        r3d, r3d\r\n    sub        r0, r1\r\n    shl        r4d, 5\r\n\r\n%ifdef PIC\r\n    lea        r5, [tab_ChromaCoeffV]\r\n    lea        r5, [r5 + r4]\r\n%else\r\n    lea        r5, [tab_ChromaCoeffV + r4]\r\n%endif\r\n\r\n    movq       m0, [r0]\r\n    movq       m1, [r0 + r1]\r\n    punpcklwd  m0, m1                          ;m0=[0 1]\r\n    pmaddwd    m0, [r5 + 0 *16]                ;m0=[0+1]  Row1\r\n\r\n    lea        r0, [r0 + 2 * r1]\r\n    movq       m2, [r0]\r\n    punpcklwd  m1, m2                          ;m1=[1 2]\r\n    pmaddwd    m1, [r5 + 0 *16]                ;m1=[1+2]  Row2\r\n\r\n    movq       m3, [r0 + r1]\r\n    punpcklwd  m2, m3                          ;m4=[2 3]\r\n    pmaddwd    m2, [r5 + 1 * 16]\r\n    paddd      m0, m2                          ;m0=[0+1+2+3]  Row1 done\r\n    psrad      m0, 6\r\n\r\n    movq       m2, [r0 + 2 * r1]\r\n    punpcklwd  m3, m2                          ;m5=[3 4]\r\n    pmaddwd    m3, [r5 + 1 * 16]\r\n    paddd      m1, m3                          ;m1=[1+2+3+4]  Row2 done\r\n    psrad      m1, 6\r\n\r\n    packssdw   m0, m1\r\n\r\n    movlps     [r2], m0\r\n    movhps     [r2 + r3], m0\r\n\r\n    RET\r\n\r\n;-------------------------------------------------------------------------------------------------------------------\r\n; void interp_4tap_vertical_ss_6x8(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)\r\n;-------------------------------------------------------------------------------------------------------------------\r\n%macro FILTER_VER_CHROMA_SS_W6_H4 2\r\nINIT_XMM sse4\r\ncglobal interp_4tap_vert_ss_6x%2, 5, 7, 6\r\n\r\n    add       r1d, r1d\r\n    add       r3d, r3d\r\n    sub       r0, r1\r\n    shl       r4d, 5\r\n\r\n%ifdef PIC\r\n    lea       r5, [tab_ChromaCoeffV]\r\n    lea       r6, [r5 + r4]\r\n%else\r\n    lea       r6, [tab_ChromaCoeffV + r4]\r\n%endif\r\n\r\n    mov       r4d, %2/4\r\n\r\n.loopH:\r\n    PROCESS_CHROMA_SP_W4_4R\r\n\r\n    psrad     m0, 6\r\n    psrad     m1, 6\r\n    psrad     m2, 6\r\n    psrad     m3, 6\r\n\r\n    packssdw  m0, m1\r\n    packssdw  m2, m3\r\n\r\n    movlps    [r2], m0\r\n    movhps    [r2 + r3], m0\r\n    lea       r5, [r2 + 2 * r3]\r\n    movlps    [r5], m2\r\n    movhps    [r5 + r3], m2\r\n\r\n    lea       r5, [4 * r1 - 2 * 4]\r\n    sub       r0, r5\r\n    add       r2, 2 * 4\r\n\r\n    PROCESS_CHROMA_SP_W2_4R r6\r\n\r\n    psrad     m0, 6\r\n    psrad     m2, 6\r\n\r\n    packssdw  m0, m2\r\n\r\n    movd      [r2], m0\r\n    pextrd    [r2 + r3], m0, 1\r\n    lea       r2, [r2 + 2 * r3]\r\n    pextrd    [r2], m0, 2\r\n    pextrd    [r2 + r3], m0, 3\r\n\r\n    sub       r0, 2 * 4\r\n    lea       r2, [r2 + 2 * r3 - 2 * 4]\r\n\r\n    dec       r4d\r\n    jnz       .loopH\r\n\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_SS_W6_H4 6, 8\r\n\r\n    FILTER_VER_CHROMA_SS_W6_H4 6, 16\r\n\r\n\r\n;----------------------------------------------------------------------------------------------------------------\r\n; void interp_4tap_vert_ss_8x%2(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)\r\n;----------------------------------------------------------------------------------------------------------------\r\n%macro FILTER_VER_CHROMA_SS_W8_H2 2\r\nINIT_XMM sse2\r\ncglobal interp_4tap_vert_ss_%1x%2, 5, 6, 7\r\n\r\n    add       r1d, r1d\r\n    add       r3d, r3d\r\n    sub       r0, r1\r\n    shl       r4d, 5\r\n\r\n%ifdef PIC\r\n    lea       r5, [tab_ChromaCoeffV]\r\n    lea       r5, [r5 + r4]\r\n%else\r\n    lea       r5, [tab_ChromaCoeffV + r4]\r\n%endif\r\n\r\n    mov       r4d, %2/2\r\n.loopH:\r\n    PROCESS_CHROMA_SP_W8_2R\r\n\r\n    psrad     m0, 6\r\n    psrad     m1, 6\r\n    psrad     m2, 6\r\n    psrad     m3, 6\r\n\r\n    packssdw  m0, m1\r\n    packssdw  m2, m3\r\n\r\n    movu      [r2], m0\r\n    movu      [r2 + r3], m2\r\n\r\n    lea       r2, [r2 + 2 * r3]\r\n\r\n    dec       r4d\r\n    jnz       .loopH\r\n\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_VER_CHROMA_SS_W8_H2 8, 2\r\n    FILTER_VER_CHROMA_SS_W8_H2 8, 4\r\n    FILTER_VER_CHROMA_SS_W8_H2 8, 6\r\n    FILTER_VER_CHROMA_SS_W8_H2 8, 8\r\n    FILTER_VER_CHROMA_SS_W8_H2 8, 16\r\n    FILTER_VER_CHROMA_SS_W8_H2 8, 32\r\n\r\n    FILTER_VER_CHROMA_SS_W8_H2 8, 12\r\n    FILTER_VER_CHROMA_SS_W8_H2 8, 64\r\n\r\n;-----------------------------------------------------------------------------------------------------------------\r\n; void interp_8tap_vert_ss_%1x%2(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)\r\n;-----------------------------------------------------------------------------------------------------------------\r\n%macro FILTER_VER_LUMA_SS 2\r\nINIT_XMM sse2\r\ncglobal interp_8tap_vert_ss_%1x%2, 5, 7, 7 ,0-gprsize\r\n\r\n    add        r1d, r1d\r\n    add        r3d, r3d\r\n    lea        r5, [3 * r1]\r\n    sub        r0, r5\r\n    shl        r4d, 6\r\n\r\n%ifdef PIC\r\n    lea        r5, [tab_LumaCoeffV]\r\n    lea        r6, [r5 + r4]\r\n%else\r\n    lea        r6, [tab_LumaCoeffV + r4]\r\n%endif\r\n\r\n    mov        dword [rsp], %2/4\r\n.loopH:\r\n    mov        r4d, (%1/4)\r\n.loopW:\r\n    movq       m0, [r0]\r\n    movq       m1, [r0 + r1]\r\n    punpcklwd  m0, m1                          ;m0=[0 1]\r\n    pmaddwd    m0, [r6 + 0 *16]                ;m0=[0+1]  Row1\r\n\r\n    lea        r0, [r0 + 2 * r1]\r\n    movq       m4, [r0]\r\n    punpcklwd  m1, m4                          ;m1=[1 2]\r\n    pmaddwd    m1, [r6 + 0 *16]                ;m1=[1+2]  Row2\r\n\r\n    movq       m5, [r0 + r1]\r\n    punpcklwd  m4, m5                          ;m4=[2 3]\r\n    pmaddwd    m2, m4, [r6 + 0 *16]            ;m2=[2+3]  Row3\r\n    pmaddwd    m4, [r6 + 1 * 16]\r\n    paddd      m0, m4                          ;m0=[0+1+2+3]  Row1\r\n\r\n    lea        r0, [r0 + 2 * r1]\r\n    movq       m4, [r0]\r\n    punpcklwd  m5, m4                          ;m5=[3 4]\r\n    pmaddwd    m3, m5, [r6 + 0 *16]            ;m3=[3+4]  Row4\r\n    pmaddwd    m5, [r6 + 1 * 16]\r\n    paddd      m1, m5                          ;m1 = [1+2+3+4]  Row2\r\n\r\n    movq       m5, [r0 + r1]\r\n    punpcklwd  m4, m5                          ;m4=[4 5]\r\n    pmaddwd    m6, m4, [r6 + 1 * 16]\r\n    paddd      m2, m6                          ;m2=[2+3+4+5]  Row3\r\n    pmaddwd    m4, [r6 + 2 * 16]\r\n    paddd      m0, m4                          ;m0=[0+1+2+3+4+5]  Row1\r\n\r\n    lea        r0, [r0 + 2 * r1]\r\n    movq       m4, [r0]\r\n    punpcklwd  m5, m4                          ;m5=[5 6]\r\n    pmaddwd    m6, m5, [r6 + 1 * 16]\r\n    paddd      m3, m6                          ;m3=[3+4+5+6]  Row4\r\n    pmaddwd    m5, [r6 + 2 * 16]\r\n    paddd      m1, m5                          ;m1=[1+2+3+4+5+6]  Row2\r\n\r\n    movq       m5, [r0 + r1]\r\n    punpcklwd  m4, m5                          ;m4=[6 7]\r\n    pmaddwd    m6, m4, [r6 + 2 * 16]\r\n    paddd      m2, m6                          ;m2=[2+3+4+5+6+7]  Row3\r\n    pmaddwd    m4, [r6 + 3 * 16]\r\n    paddd      m0, m4                          ;m0=[0+1+2+3+4+5+6+7]  Row1 end\r\n    psrad      m0, 6\r\n\r\n    lea        r0, [r0 + 2 * r1]\r\n    movq       m4, [r0]\r\n    punpcklwd  m5, m4                          ;m5=[7 8]\r\n    pmaddwd    m6, m5, [r6 + 2 * 16]\r\n    paddd      m3, m6                          ;m3=[3+4+5+6+7+8]  Row4\r\n    pmaddwd    m5, [r6 + 3 * 16]\r\n    paddd      m1, m5                          ;m1=[1+2+3+4+5+6+7+8]  Row2 end\r\n    psrad      m1, 6\r\n\r\n    packssdw   m0, m1\r\n\r\n    movlps     [r2], m0\r\n    movhps     [r2 + r3], m0\r\n\r\n    movq       m5, [r0 + r1]\r\n    punpcklwd  m4, m5                          ;m4=[8 9]\r\n    pmaddwd    m4, [r6 + 3 * 16]\r\n    paddd      m2, m4                          ;m2=[2+3+4+5+6+7+8+9]  Row3 end\r\n    psrad      m2, 6\r\n\r\n    movq       m4, [r0 + 2 * r1]\r\n    punpcklwd  m5, m4                          ;m5=[9 10]\r\n    pmaddwd    m5, [r6 + 3 * 16]\r\n    paddd      m3, m5                          ;m3=[3+4+5+6+7+8+9+10]  Row4 end\r\n    psrad      m3, 6\r\n\r\n    packssdw   m2, m3\r\n\r\n    movlps     [r2 + 2 * r3], m2\r\n    lea        r5, [3 * r3]\r\n    movhps     [r2 + r5], m2\r\n\r\n    lea        r5, [8 * r1 - 2 * 4]\r\n    sub        r0, r5\r\n    add        r2, 2 * 4\r\n\r\n    dec        r4d\r\n    jnz        .loopW\r\n\r\n    lea        r0, [r0 + 4 * r1 - 2 * %1]\r\n    lea        r2, [r2 + 4 * r3 - 2 * %1]\r\n\r\n    dec        dword [rsp]\r\n    jnz        .loopH\r\n\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_VER_LUMA_SS 4, 4\r\n    FILTER_VER_LUMA_SS 8, 8\r\n    FILTER_VER_LUMA_SS 8, 4\r\n    FILTER_VER_LUMA_SS 4, 8\r\n    FILTER_VER_LUMA_SS 16, 16\r\n    FILTER_VER_LUMA_SS 16, 8\r\n    FILTER_VER_LUMA_SS 8, 16\r\n    FILTER_VER_LUMA_SS 16, 12\r\n    FILTER_VER_LUMA_SS 12, 16\r\n    FILTER_VER_LUMA_SS 16, 4\r\n    FILTER_VER_LUMA_SS 4, 16\r\n    FILTER_VER_LUMA_SS 32, 32\r\n    FILTER_VER_LUMA_SS 32, 16\r\n    FILTER_VER_LUMA_SS 16, 32\r\n    FILTER_VER_LUMA_SS 32, 24\r\n    FILTER_VER_LUMA_SS 24, 32\r\n    FILTER_VER_LUMA_SS 32, 8\r\n    FILTER_VER_LUMA_SS 8, 32\r\n    FILTER_VER_LUMA_SS 64, 64\r\n    FILTER_VER_LUMA_SS 64, 32\r\n    FILTER_VER_LUMA_SS 32, 64\r\n    FILTER_VER_LUMA_SS 64, 48\r\n    FILTER_VER_LUMA_SS 48, 64\r\n    FILTER_VER_LUMA_SS 64, 16\r\n    FILTER_VER_LUMA_SS 16, 64\r\n\r\n%macro FILTER_VER_LUMA_AVX2_4x4 1\r\nINIT_YMM avx2\r\ncglobal interp_8tap_vert_%1_4x4, 4, 6, 7\r\n    mov             r4d, r4m\r\n    add             r1d, r1d\r\n    shl             r4d, 7\r\n\r\n%ifdef PIC\r\n    lea             r5, [pw_LumaCoeffVer]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [pw_LumaCoeffVer + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r4\r\n\r\n%ifidn %1,sp\r\n    mova            m6, [pd_526336]\r\n%else\r\n    add             r3d, r3d\r\n%endif\r\n\r\n    movq            xm0, [r0]\r\n    movq            xm1, [r0 + r1]\r\n    punpcklwd       xm0, xm1\r\n    movq            xm2, [r0 + r1 * 2]\r\n    punpcklwd       xm1, xm2\r\n    vinserti128     m0, m0, xm1, 1                  ; m0 = [2 1 1 0]\r\n    pmaddwd         m0, [r5]\r\n    movq            xm3, [r0 + r4]\r\n    punpcklwd       xm2, xm3\r\n    lea             r0, [r0 + 4 * r1]\r\n    movq            xm4, [r0]\r\n    punpcklwd       xm3, xm4\r\n    vinserti128     m2, m2, xm3, 1                  ; m2 = [4 3 3 2]\r\n    pmaddwd         m5, m2, [r5 + 1 * mmsize]\r\n    pmaddwd         m2, [r5]\r\n    paddd           m0, m5\r\n    movq            xm3, [r0 + r1]\r\n    punpcklwd       xm4, xm3\r\n    movq            xm1, [r0 + r1 * 2]\r\n    punpcklwd       xm3, xm1\r\n    vinserti128     m4, m4, xm3, 1                  ; m4 = [6 5 5 4]\r\n    pmaddwd         m5, m4, [r5 + 2 * mmsize]\r\n    pmaddwd         m4, [r5 + 1 * mmsize]\r\n    paddd           m0, m5\r\n    paddd           m2, m4\r\n    movq            xm3, [r0 + r4]\r\n    punpcklwd       xm1, xm3\r\n    lea             r0, [r0 + 4 * r1]\r\n    movq            xm4, [r0]\r\n    punpcklwd       xm3, xm4\r\n    vinserti128     m1, m1, xm3, 1                  ; m1 = [8 7 7 6]\r\n    pmaddwd         m5, m1, [r5 + 3 * mmsize]\r\n    pmaddwd         m1, [r5 + 2 * mmsize]\r\n    paddd           m0, m5\r\n    paddd           m2, m1\r\n    movq            xm3, [r0 + r1]\r\n    punpcklwd       xm4, xm3\r\n    movq            xm1, [r0 + 2 * r1]\r\n    punpcklwd       xm3, xm1\r\n    vinserti128     m4, m4, xm3, 1                  ; m4 = [A 9 9 8]\r\n    pmaddwd         m4, [r5 + 3 * mmsize]\r\n    paddd           m2, m4\r\n\r\n%ifidn %1,sp\r\n    paddd           m0, m6\r\n    paddd           m2, m6\r\n    psrad           m0, 12\r\n    psrad           m2, 12\r\n%else\r\n    psrad           m0, 6\r\n    psrad           m2, 6\r\n%endif\r\n    packssdw        m0, m2\r\n    vextracti128    xm2, m0, 1\r\n    lea             r4, [r3 * 3]\r\n\r\n%ifidn %1,sp\r\n    packuswb        xm0, xm2\r\n    movd            [r2], xm0\r\n    pextrd          [r2 + r3], xm0, 2\r\n    pextrd          [r2 + r3 * 2], xm0, 1\r\n    pextrd          [r2 + r4], xm0, 3\r\n%else\r\n    movq            [r2], xm0\r\n    movq            [r2 + r3], xm2\r\n    movhps          [r2 + r3 * 2], xm0\r\n    movhps          [r2 + r4], xm2\r\n%endif\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_VER_LUMA_AVX2_4x4 sp\r\n    FILTER_VER_LUMA_AVX2_4x4 ss\r\n\r\n%macro FILTER_VER_LUMA_AVX2_4x8 1\r\nINIT_YMM avx2\r\ncglobal interp_8tap_vert_%1_4x8, 4, 7, 8\r\n    mov             r4d, r4m\r\n    shl             r4d, 7\r\n    add             r1d, r1d\r\n\r\n%ifdef PIC\r\n    lea             r5, [pw_LumaCoeffVer]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [pw_LumaCoeffVer + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r4\r\n\r\n%ifidn %1,sp\r\n    mova            m7, [pd_526336]\r\n%else\r\n    add             r3d, r3d\r\n%endif\r\n    lea             r6, [r3 * 3]\r\n\r\n    movq            xm0, [r0]\r\n    movq            xm1, [r0 + r1]\r\n    punpcklwd       xm0, xm1\r\n    movq            xm2, [r0 + r1 * 2]\r\n    punpcklwd       xm1, xm2\r\n    vinserti128     m0, m0, xm1, 1                  ; m0 = [2 1 1 0]\r\n    pmaddwd         m0, [r5]\r\n    movq            xm3, [r0 + r4]\r\n    punpcklwd       xm2, xm3\r\n    lea             r0, [r0 + 4 * r1]\r\n    movq            xm4, [r0]\r\n    punpcklwd       xm3, xm4\r\n    vinserti128     m2, m2, xm3, 1                  ; m2 = [4 3 3 2]\r\n    pmaddwd         m5, m2, [r5 + 1 * mmsize]\r\n    pmaddwd         m2, [r5]\r\n    paddd           m0, m5\r\n    movq            xm3, [r0 + r1]\r\n    punpcklwd       xm4, xm3\r\n    movq            xm1, [r0 + r1 * 2]\r\n    punpcklwd       xm3, xm1\r\n    vinserti128     m4, m4, xm3, 1                  ; m4 = [6 5 5 4]\r\n    pmaddwd         m5, m4, [r5 + 2 * mmsize]\r\n    paddd           m0, m5\r\n    pmaddwd         m5, m4, [r5 + 1 * mmsize]\r\n    paddd           m2, m5\r\n    pmaddwd         m4, [r5]\r\n    movq            xm3, [r0 + r4]\r\n    punpcklwd       xm1, xm3\r\n    lea             r0, [r0 + 4 * r1]\r\n    movq            xm6, [r0]\r\n    punpcklwd       xm3, xm6\r\n    vinserti128     m1, m1, xm3, 1                  ; m1 = [8 7 7 6]\r\n    pmaddwd         m5, m1, [r5 + 3 * mmsize]\r\n    paddd           m0, m5\r\n    pmaddwd         m5, m1, [r5 + 2 * mmsize]\r\n    paddd           m2, m5\r\n    pmaddwd         m5, m1, [r5 + 1 * mmsize]\r\n    paddd           m4, m5\r\n    pmaddwd         m1, [r5]\r\n    movq            xm3, [r0 + r1]\r\n    punpcklwd       xm6, xm3\r\n    movq            xm5, [r0 + 2 * r1]\r\n    punpcklwd       xm3, xm5\r\n    vinserti128     m6, m6, xm3, 1                  ; m6 = [A 9 9 8]\r\n    pmaddwd         m3, m6, [r5 + 3 * mmsize]\r\n    paddd           m2, m3\r\n    pmaddwd         m3, m6, [r5 + 2 * mmsize]\r\n    paddd           m4, m3\r\n    pmaddwd         m6, [r5 + 1 * mmsize]\r\n    paddd           m1, m6\r\n\r\n%ifidn %1,sp\r\n    paddd           m0, m7\r\n    paddd           m2, m7\r\n    psrad           m0, 12\r\n    psrad           m2, 12\r\n%else\r\n    psrad           m0, 6\r\n    psrad           m2, 6\r\n%endif\r\n    packssdw        m0, m2\r\n\r\n    movq            xm3, [r0 + r4]\r\n    punpcklwd       xm5, xm3\r\n    lea             r0, [r0 + 4 * r1]\r\n    movq            xm2, [r0]\r\n    punpcklwd       xm3, xm2\r\n    vinserti128     m5, m5, xm3, 1                  ; m5 = [C B B A]\r\n    pmaddwd         m3, m5, [r5 + 3 * mmsize]\r\n    paddd           m4, m3\r\n    pmaddwd         m5, [r5 + 2 * mmsize]\r\n    paddd           m1, m5\r\n    movq            xm3, [r0 + r1]\r\n    punpcklwd       xm2, xm3\r\n    movq            xm5, [r0 + 2 * r1]\r\n    punpcklwd       xm3, xm5\r\n    vinserti128     m2, m2, xm3, 1                  ; m2 = [E D D C]\r\n    pmaddwd         m2, [r5 + 3 * mmsize]\r\n    paddd           m1, m2\r\n\r\n%ifidn %1,sp\r\n    paddd           m4, m7\r\n    paddd           m1, m7\r\n    psrad           m4, 12\r\n    psrad           m1, 12\r\n%else\r\n    psrad           m4, 6\r\n    psrad           m1, 6\r\n%endif\r\n    packssdw        m4, m1\r\n\r\n%ifidn %1,sp\r\n    packuswb        m0, m4\r\n    vextracti128    xm2, m0, 1\r\n    movd            [r2], xm0\r\n    movd            [r2 + r3], xm2\r\n    pextrd          [r2 + r3 * 2], xm0, 1\r\n    pextrd          [r2 + r6], xm2, 1\r\n    lea             r2, [r2 + r3 * 4]\r\n    pextrd          [r2], xm0, 2\r\n    pextrd          [r2 + r3], xm2, 2\r\n    pextrd          [r2 + r3 * 2], xm0, 3\r\n    pextrd          [r2 + r6], xm2, 3\r\n%else\r\n    vextracti128    xm2, m0, 1\r\n    vextracti128    xm1, m4, 1\r\n    movq            [r2], xm0\r\n    movq            [r2 + r3], xm2\r\n    movhps          [r2 + r3 * 2], xm0\r\n    movhps          [r2 + r6], xm2\r\n    lea             r2, [r2 + r3 * 4]\r\n    movq            [r2], xm4\r\n    movq            [r2 + r3], xm1\r\n    movhps          [r2 + r3 * 2], xm4\r\n    movhps          [r2 + r6], xm1\r\n%endif\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_VER_LUMA_AVX2_4x8 sp\r\n    FILTER_VER_LUMA_AVX2_4x8 ss\r\n\r\n%macro PROCESS_LUMA_AVX2_W4_16R 1\r\n    movq            xm0, [r0]\r\n    movq            xm1, [r0 + r1]\r\n    punpcklwd       xm0, xm1\r\n    movq            xm2, [r0 + r1 * 2]\r\n    punpcklwd       xm1, xm2\r\n    vinserti128     m0, m0, xm1, 1                  ; m0 = [2 1 1 0]\r\n    pmaddwd         m0, [r5]\r\n    movq            xm3, [r0 + r4]\r\n    punpcklwd       xm2, xm3\r\n    lea             r0, [r0 + 4 * r1]\r\n    movq            xm4, [r0]\r\n    punpcklwd       xm3, xm4\r\n    vinserti128     m2, m2, xm3, 1                  ; m2 = [4 3 3 2]\r\n    pmaddwd         m5, m2, [r5 + 1 * mmsize]\r\n    pmaddwd         m2, [r5]\r\n    paddd           m0, m5\r\n    movq            xm3, [r0 + r1]\r\n    punpcklwd       xm4, xm3\r\n    movq            xm1, [r0 + r1 * 2]\r\n    punpcklwd       xm3, xm1\r\n    vinserti128     m4, m4, xm3, 1                  ; m4 = [6 5 5 4]\r\n    pmaddwd         m5, m4, [r5 + 2 * mmsize]\r\n    paddd           m0, m5\r\n    pmaddwd         m5, m4, [r5 + 1 * mmsize]\r\n    paddd           m2, m5\r\n    pmaddwd         m4, [r5]\r\n    movq            xm3, [r0 + r4]\r\n    punpcklwd       xm1, xm3\r\n    lea             r0, [r0 + 4 * r1]\r\n    movq            xm6, [r0]\r\n    punpcklwd       xm3, xm6\r\n    vinserti128     m1, m1, xm3, 1                  ; m1 = [8 7 7 6]\r\n    pmaddwd         m5, m1, [r5 + 3 * mmsize]\r\n    paddd           m0, m5\r\n    pmaddwd         m5, m1, [r5 + 2 * mmsize]\r\n    paddd           m2, m5\r\n    pmaddwd         m5, m1, [r5 + 1 * mmsize]\r\n    paddd           m4, m5\r\n    pmaddwd         m1, [r5]\r\n    movq            xm3, [r0 + r1]\r\n    punpcklwd       xm6, xm3\r\n    movq            xm5, [r0 + 2 * r1]\r\n    punpcklwd       xm3, xm5\r\n    vinserti128     m6, m6, xm3, 1                  ; m6 = [10 9 9 8]\r\n    pmaddwd         m3, m6, [r5 + 3 * mmsize]\r\n    paddd           m2, m3\r\n    pmaddwd         m3, m6, [r5 + 2 * mmsize]\r\n    paddd           m4, m3\r\n    pmaddwd         m3, m6, [r5 + 1 * mmsize]\r\n    paddd           m1, m3\r\n    pmaddwd         m6, [r5]\r\n\r\n%ifidn %1,sp\r\n    paddd           m0, m7\r\n    paddd           m2, m7\r\n    psrad           m0, 12\r\n    psrad           m2, 12\r\n%else\r\n    psrad           m0, 6\r\n    psrad           m2, 6\r\n%endif\r\n    packssdw        m0, m2\r\n    vextracti128    xm2, m0, 1\r\n%ifidn %1,sp\r\n    packuswb        xm0, xm2\r\n    movd            [r2], xm0\r\n    pextrd          [r2 + r3], xm0, 2\r\n    pextrd          [r2 + r3 * 2], xm0, 1\r\n    pextrd          [r2 + r6], xm0, 3\r\n%else\r\n    movq            [r2], xm0\r\n    movq            [r2 + r3], xm2\r\n    movhps          [r2 + r3 * 2], xm0\r\n    movhps          [r2 + r6], xm2\r\n%endif\r\n\r\n    movq            xm2, [r0 + r4]\r\n    punpcklwd       xm5, xm2\r\n    lea             r0, [r0 + 4 * r1]\r\n    movq            xm0, [r0]\r\n    punpcklwd       xm2, xm0\r\n    vinserti128     m5, m5, xm2, 1                  ; m5 = [12 11 11 10]\r\n    pmaddwd         m2, m5, [r5 + 3 * mmsize]\r\n    paddd           m4, m2\r\n    pmaddwd         m2, m5, [r5 + 2 * mmsize]\r\n    paddd           m1, m2\r\n    pmaddwd         m2, m5, [r5 + 1 * mmsize]\r\n    paddd           m6, m2\r\n    pmaddwd         m5, [r5]\r\n    movq            xm2, [r0 + r1]\r\n    punpcklwd       xm0, xm2\r\n    movq            xm3, [r0 + 2 * r1]\r\n    punpcklwd       xm2, xm3\r\n    vinserti128     m0, m0, xm2, 1                  ; m0 = [14 13 13 12]\r\n    pmaddwd         m2, m0, [r5 + 3 * mmsize]\r\n    paddd           m1, m2\r\n    pmaddwd         m2, m0, [r5 + 2 * mmsize]\r\n    paddd           m6, m2\r\n    pmaddwd         m2, m0, [r5 + 1 * mmsize]\r\n    paddd           m5, m2\r\n    pmaddwd         m0, [r5]\r\n\r\n%ifidn %1,sp\r\n    paddd           m4, m7\r\n    paddd           m1, m7\r\n    psrad           m4, 12\r\n    psrad           m1, 12\r\n%else\r\n    psrad           m4, 6\r\n    psrad           m1, 6\r\n%endif\r\n    packssdw        m4, m1\r\n    vextracti128    xm1, m4, 1\r\n    lea             r2, [r2 + r3 * 4]\r\n%ifidn %1,sp\r\n    packuswb        xm4, xm1\r\n    movd            [r2], xm4\r\n    pextrd          [r2 + r3], xm4, 2\r\n    pextrd          [r2 + r3 * 2], xm4, 1\r\n    pextrd          [r2 + r6], xm4, 3\r\n%else\r\n    movq            [r2], xm4\r\n    movq            [r2 + r3], xm1\r\n    movhps          [r2 + r3 * 2], xm4\r\n    movhps          [r2 + r6], xm1\r\n%endif\r\n\r\n    movq            xm4, [r0 + r4]\r\n    punpcklwd       xm3, xm4\r\n    lea             r0, [r0 + 4 * r1]\r\n    movq            xm1, [r0]\r\n    punpcklwd       xm4, xm1\r\n    vinserti128     m3, m3, xm4, 1                  ; m3 = [16 15 15 14]\r\n    pmaddwd         m4, m3, [r5 + 3 * mmsize]\r\n    paddd           m6, m4\r\n    pmaddwd         m4, m3, [r5 + 2 * mmsize]\r\n    paddd           m5, m4\r\n    pmaddwd         m4, m3, [r5 + 1 * mmsize]\r\n    paddd           m0, m4\r\n    pmaddwd         m3, [r5]\r\n    movq            xm4, [r0 + r1]\r\n    punpcklwd       xm1, xm4\r\n    movq            xm2, [r0 + 2 * r1]\r\n    punpcklwd       xm4, xm2\r\n    vinserti128     m1, m1, xm4, 1                  ; m1 = [18 17 17 16]\r\n    pmaddwd         m4, m1, [r5 + 3 * mmsize]\r\n    paddd           m5, m4\r\n    pmaddwd         m4, m1, [r5 + 2 * mmsize]\r\n    paddd           m0, m4\r\n    pmaddwd         m1, [r5 + 1 * mmsize]\r\n    paddd           m3, m1\r\n    movq            xm4, [r0 + r4]\r\n    punpcklwd       xm2, xm4\r\n    lea             r0, [r0 + 4 * r1]\r\n    movq            xm1, [r0]\r\n    punpcklwd       xm4, xm1\r\n    vinserti128     m2, m2, xm4, 1                  ; m2 = [20 19 19 18]\r\n    pmaddwd         m4, m2, [r5 + 3 * mmsize]\r\n    paddd           m0, m4\r\n    pmaddwd         m2, [r5 + 2 * mmsize]\r\n    paddd           m3, m2\r\n    movq            xm4, [r0 + r1]\r\n    punpcklwd       xm1, xm4\r\n    movq            xm2, [r0 + 2 * r1]\r\n    punpcklwd       xm4, xm2\r\n    vinserti128     m1, m1, xm4, 1                  ; m1 = [22 21 21 20]\r\n    pmaddwd         m1, [r5 + 3 * mmsize]\r\n    paddd           m3, m1\r\n\r\n%ifidn %1,sp\r\n    paddd           m6, m7\r\n    paddd           m5, m7\r\n    paddd           m0, m7\r\n    paddd           m3, m7\r\n    psrad           m6, 12\r\n    psrad           m5, 12\r\n    psrad           m0, 12\r\n    psrad           m3, 12\r\n%else\r\n    psrad           m6, 6\r\n    psrad           m5, 6\r\n    psrad           m0, 6\r\n    psrad           m3, 6\r\n%endif\r\n    packssdw        m6, m5\r\n    packssdw        m0, m3\r\n    lea             r2, [r2 + r3 * 4]\r\n\r\n%ifidn %1,sp\r\n    packuswb        m6, m0\r\n    vextracti128    xm0, m6, 1\r\n    movd            [r2], xm6\r\n    movd            [r2 + r3], xm0\r\n    pextrd          [r2 + r3 * 2], xm6, 1\r\n    pextrd          [r2 + r6], xm0, 1\r\n    lea             r2, [r2 + r3 * 4]\r\n    pextrd          [r2], xm6, 2\r\n    pextrd          [r2 + r3], xm0, 2\r\n    pextrd          [r2 + r3 * 2], xm6, 3\r\n    pextrd          [r2 + r6], xm0, 3\r\n%else\r\n    vextracti128    xm5, m6, 1\r\n    vextracti128    xm3, m0, 1\r\n    movq            [r2], xm6\r\n    movq            [r2 + r3], xm5\r\n    movhps          [r2 + r3 * 2], xm6\r\n    movhps          [r2 + r6], xm5\r\n    lea             r2, [r2 + r3 * 4]\r\n    movq            [r2], xm0\r\n    movq            [r2 + r3], xm3\r\n    movhps          [r2 + r3 * 2], xm0\r\n    movhps          [r2 + r6], xm3\r\n%endif\r\n%endmacro\r\n\r\n%macro FILTER_VER_LUMA_AVX2_4x16 1\r\nINIT_YMM avx2\r\ncglobal interp_8tap_vert_%1_4x16, 4, 7, 8\r\n    mov             r4d, r4m\r\n    shl             r4d, 7\r\n    add             r1d, r1d\r\n\r\n%ifdef PIC\r\n    lea             r5, [pw_LumaCoeffVer]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [pw_LumaCoeffVer + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r4\r\n%ifidn %1,sp\r\n    mova            m7, [pd_526336]\r\n%else\r\n    add             r3d, r3d\r\n%endif\r\n    lea             r6, [r3 * 3]\r\n    PROCESS_LUMA_AVX2_W4_16R %1\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_VER_LUMA_AVX2_4x16 sp\r\n    FILTER_VER_LUMA_AVX2_4x16 ss\r\n\r\n%macro FILTER_VER_LUMA_S_AVX2_8x8 1\r\nINIT_YMM avx2\r\n%if ARCH_X86_64 == 1\r\ncglobal interp_8tap_vert_%1_8x8, 4, 6, 12\r\n    mov             r4d, r4m\r\n    shl             r4d, 7\r\n    add             r1d, r1d\r\n\r\n%ifdef PIC\r\n    lea             r5, [pw_LumaCoeffVer]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [pw_LumaCoeffVer + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r4\r\n\r\n%ifidn %1,sp\r\n    mova            m11, [pd_526336]\r\n%else\r\n    add             r3d, r3d\r\n%endif\r\n\r\n    movu            xm0, [r0]                       ; m0 = row 0\r\n    movu            xm1, [r0 + r1]                  ; m1 = row 1\r\n    punpckhwd       xm2, xm0, xm1\r\n    punpcklwd       xm0, xm1\r\n    vinserti128     m0, m0, xm2, 1\r\n    pmaddwd         m0, [r5]\r\n    movu            xm2, [r0 + r1 * 2]              ; m2 = row 2\r\n    punpckhwd       xm3, xm1, xm2\r\n    punpcklwd       xm1, xm2\r\n    vinserti128     m1, m1, xm3, 1\r\n    pmaddwd         m1, [r5]\r\n    movu            xm3, [r0 + r4]                  ; m3 = row 3\r\n    punpckhwd       xm4, xm2, xm3\r\n    punpcklwd       xm2, xm3\r\n    vinserti128     m2, m2, xm4, 1\r\n    pmaddwd         m4, m2, [r5 + 1 * mmsize]\r\n    pmaddwd         m2, [r5]\r\n    paddd           m0, m4\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm4, [r0]                       ; m4 = row 4\r\n    punpckhwd       xm5, xm3, xm4\r\n    punpcklwd       xm3, xm4\r\n    vinserti128     m3, m3, xm5, 1\r\n    pmaddwd         m5, m3, [r5 + 1 * mmsize]\r\n    pmaddwd         m3, [r5]\r\n    paddd           m1, m5\r\n    movu            xm5, [r0 + r1]                  ; m5 = row 5\r\n    punpckhwd       xm6, xm4, xm5\r\n    punpcklwd       xm4, xm5\r\n    vinserti128     m4, m4, xm6, 1\r\n    pmaddwd         m6, m4, [r5 + 2 * mmsize]\r\n    paddd           m0, m6\r\n    pmaddwd         m6, m4, [r5 + 1 * mmsize]\r\n    paddd           m2, m6\r\n    pmaddwd         m4, [r5]\r\n    movu            xm6, [r0 + r1 * 2]              ; m6 = row 6\r\n    punpckhwd       xm7, xm5, xm6\r\n    punpcklwd       xm5, xm6\r\n    vinserti128     m5, m5, xm7, 1\r\n    pmaddwd         m7, m5, [r5 + 2 * mmsize]\r\n    paddd           m1, m7\r\n    pmaddwd         m7, m5, [r5 + 1 * mmsize]\r\n    pmaddwd         m5, [r5]\r\n    paddd           m3, m7\r\n    movu            xm7, [r0 + r4]                  ; m7 = row 7\r\n    punpckhwd       xm8, xm6, xm7\r\n    punpcklwd       xm6, xm7\r\n    vinserti128     m6, m6, xm8, 1\r\n    pmaddwd         m8, m6, [r5 + 3 * mmsize]\r\n    paddd           m0, m8\r\n    pmaddwd         m8, m6, [r5 + 2 * mmsize]\r\n    paddd           m2, m8\r\n    pmaddwd         m8, m6, [r5 + 1 * mmsize]\r\n    pmaddwd         m6, [r5]\r\n    paddd           m4, m8\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm8, [r0]                       ; m8 = row 8\r\n    punpckhwd       xm9, xm7, xm8\r\n    punpcklwd       xm7, xm8\r\n    vinserti128     m7, m7, xm9, 1\r\n    pmaddwd         m9, m7, [r5 + 3 * mmsize]\r\n    paddd           m1, m9\r\n    pmaddwd         m9, m7, [r5 + 2 * mmsize]\r\n    paddd           m3, m9\r\n    pmaddwd         m9, m7, [r5 + 1 * mmsize]\r\n    pmaddwd         m7, [r5]\r\n    paddd           m5, m9\r\n    movu            xm9, [r0 + r1]                  ; m9 = row 9\r\n    punpckhwd       xm10, xm8, xm9\r\n    punpcklwd       xm8, xm9\r\n    vinserti128     m8, m8, xm10, 1\r\n    pmaddwd         m10, m8, [r5 + 3 * mmsize]\r\n    paddd           m2, m10\r\n    pmaddwd         m10, m8, [r5 + 2 * mmsize]\r\n    pmaddwd         m8, [r5 + 1 * mmsize]\r\n    paddd           m4, m10\r\n    paddd           m6, m8\r\n    movu            xm10, [r0 + r1 * 2]             ; m10 = row 10\r\n    punpckhwd       xm8, xm9, xm10\r\n    punpcklwd       xm9, xm10\r\n    vinserti128     m9, m9, xm8, 1\r\n    pmaddwd         m8, m9, [r5 + 3 * mmsize]\r\n    paddd           m3, m8\r\n    pmaddwd         m8, m9, [r5 + 2 * mmsize]\r\n    pmaddwd         m9, [r5 + 1 * mmsize]\r\n    paddd           m5, m8\r\n    paddd           m7, m9\r\n    movu            xm8, [r0 + r4]                  ; m8 = row 11\r\n    punpckhwd       xm9, xm10, xm8\r\n    punpcklwd       xm10, xm8\r\n    vinserti128     m10, m10, xm9, 1\r\n    pmaddwd         m9, m10, [r5 + 3 * mmsize]\r\n    pmaddwd         m10, [r5 + 2 * mmsize]\r\n    paddd           m4, m9\r\n    paddd           m6, m10\r\n\r\n    lea             r4, [r3 * 3]\r\n%ifidn %1,sp\r\n    paddd           m0, m11\r\n    paddd           m1, m11\r\n    paddd           m2, m11\r\n    paddd           m3, m11\r\n    psrad           m0, 12\r\n    psrad           m1, 12\r\n    psrad           m2, 12\r\n    psrad           m3, 12\r\n%else\r\n    psrad           m0, 6\r\n    psrad           m1, 6\r\n    psrad           m2, 6\r\n    psrad           m3, 6\r\n%endif\r\n    packssdw        m0, m1\r\n    packssdw        m2, m3\r\n%ifidn %1,sp\r\n    packuswb        m0, m2\r\n    mova            m1, [interp8_hps_shuf]\r\n    vpermd          m0, m1, m0\r\n    vextracti128    xm2, m0, 1\r\n    movq            [r2], xm0\r\n    movhps          [r2 + r3], xm0\r\n    movq            [r2 + r3 * 2], xm2\r\n    movhps          [r2 + r4], xm2\r\n%else\r\n    vpermq          m0, m0, 11011000b\r\n    vpermq          m2, m2, 11011000b\r\n    vextracti128    xm1, m0, 1\r\n    vextracti128    xm3, m2, 1\r\n    movu            [r2], xm0\r\n    movu            [r2 + r3], xm1\r\n    movu            [r2 + r3 * 2], xm2\r\n    movu            [r2 + r4], xm3\r\n%endif\r\n\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm9, [r0]                       ; m9 = row 12\r\n    punpckhwd       xm3, xm8, xm9\r\n    punpcklwd       xm8, xm9\r\n    vinserti128     m8, m8, xm3, 1\r\n    pmaddwd         m3, m8, [r5 + 3 * mmsize]\r\n    pmaddwd         m8, [r5 + 2 * mmsize]\r\n    paddd           m5, m3\r\n    paddd           m7, m8\r\n    movu            xm3, [r0 + r1]                  ; m3 = row 13\r\n    punpckhwd       xm0, xm9, xm3\r\n    punpcklwd       xm9, xm3\r\n    vinserti128     m9, m9, xm0, 1\r\n    pmaddwd         m9, [r5 + 3 * mmsize]\r\n    paddd           m6, m9\r\n    movu            xm0, [r0 + r1 * 2]              ; m0 = row 14\r\n    punpckhwd       xm9, xm3, xm0\r\n    punpcklwd       xm3, xm0\r\n    vinserti128     m3, m3, xm9, 1\r\n    pmaddwd         m3, [r5 + 3 * mmsize]\r\n    paddd           m7, m3\r\n\r\n%ifidn %1,sp\r\n    paddd           m4, m11\r\n    paddd           m5, m11\r\n    paddd           m6, m11\r\n    paddd           m7, m11\r\n    psrad           m4, 12\r\n    psrad           m5, 12\r\n    psrad           m6, 12\r\n    psrad           m7, 12\r\n%else\r\n    psrad           m4, 6\r\n    psrad           m5, 6\r\n    psrad           m6, 6\r\n    psrad           m7, 6\r\n%endif\r\n    packssdw        m4, m5\r\n    packssdw        m6, m7\r\n    lea             r2, [r2 + r3 * 4]\r\n%ifidn %1,sp\r\n    packuswb        m4, m6\r\n    vpermd          m4, m1, m4\r\n    vextracti128    xm6, m4, 1\r\n    movq            [r2], xm4\r\n    movhps          [r2 + r3], xm4\r\n    movq            [r2 + r3 * 2], xm6\r\n    movhps          [r2 + r4], xm6\r\n%else\r\n    vpermq          m4, m4, 11011000b\r\n    vpermq          m6, m6, 11011000b\r\n    vextracti128    xm5, m4, 1\r\n    vextracti128    xm7, m6, 1\r\n    movu            [r2], xm4\r\n    movu            [r2 + r3], xm5\r\n    movu            [r2 + r3 * 2], xm6\r\n    movu            [r2 + r4], xm7\r\n%endif\r\n    RET\r\n%endif\r\n%endmacro\r\n\r\n    FILTER_VER_LUMA_S_AVX2_8x8 sp\r\n    FILTER_VER_LUMA_S_AVX2_8x8 ss\r\n\r\n%macro FILTER_VER_LUMA_S_AVX2_8xN 2\r\nINIT_YMM avx2\r\n%if ARCH_X86_64 == 1\r\ncglobal interp_8tap_vert_%1_8x%2, 4, 9, 15\r\n    mov             r4d, r4m\r\n    shl             r4d, 7\r\n    add             r1d, r1d\r\n\r\n%ifdef PIC\r\n    lea             r5, [pw_LumaCoeffVer]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [pw_LumaCoeffVer + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r4\r\n%ifidn %1,sp\r\n    mova            m14, [pd_526336]\r\n%else\r\n    add             r3d, r3d\r\n%endif\r\n    lea             r6, [r3 * 3]\r\n    lea             r7, [r1 * 4]\r\n    mov             r8d, %2 / 16\r\n.loopH:\r\n    movu            xm0, [r0]                       ; m0 = row 0\r\n    movu            xm1, [r0 + r1]                  ; m1 = row 1\r\n    punpckhwd       xm2, xm0, xm1\r\n    punpcklwd       xm0, xm1\r\n    vinserti128     m0, m0, xm2, 1\r\n    pmaddwd         m0, [r5]\r\n    movu            xm2, [r0 + r1 * 2]              ; m2 = row 2\r\n    punpckhwd       xm3, xm1, xm2\r\n    punpcklwd       xm1, xm2\r\n    vinserti128     m1, m1, xm3, 1\r\n    pmaddwd         m1, [r5]\r\n    movu            xm3, [r0 + r4]                  ; m3 = row 3\r\n    punpckhwd       xm4, xm2, xm3\r\n    punpcklwd       xm2, xm3\r\n    vinserti128     m2, m2, xm4, 1\r\n    pmaddwd         m4, m2, [r5 + 1 * mmsize]\r\n    paddd           m0, m4\r\n    pmaddwd         m2, [r5]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm4, [r0]                       ; m4 = row 4\r\n    punpckhwd       xm5, xm3, xm4\r\n    punpcklwd       xm3, xm4\r\n    vinserti128     m3, m3, xm5, 1\r\n    pmaddwd         m5, m3, [r5 + 1 * mmsize]\r\n    paddd           m1, m5\r\n    pmaddwd         m3, [r5]\r\n    movu            xm5, [r0 + r1]                  ; m5 = row 5\r\n    punpckhwd       xm6, xm4, xm5\r\n    punpcklwd       xm4, xm5\r\n    vinserti128     m4, m4, xm6, 1\r\n    pmaddwd         m6, m4, [r5 + 2 * mmsize]\r\n    paddd           m0, m6\r\n    pmaddwd         m6, m4, [r5 + 1 * mmsize]\r\n    paddd           m2, m6\r\n    pmaddwd         m4, [r5]\r\n    movu            xm6, [r0 + r1 * 2]              ; m6 = row 6\r\n    punpckhwd       xm7, xm5, xm6\r\n    punpcklwd       xm5, xm6\r\n    vinserti128     m5, m5, xm7, 1\r\n    pmaddwd         m7, m5, [r5 + 2 * mmsize]\r\n    paddd           m1, m7\r\n    pmaddwd         m7, m5, [r5 + 1 * mmsize]\r\n    paddd           m3, m7\r\n    pmaddwd         m5, [r5]\r\n    movu            xm7, [r0 + r4]                  ; m7 = row 7\r\n    punpckhwd       xm8, xm6, xm7\r\n    punpcklwd       xm6, xm7\r\n    vinserti128     m6, m6, xm8, 1\r\n    pmaddwd         m8, m6, [r5 + 3 * mmsize]\r\n    paddd           m0, m8\r\n    pmaddwd         m8, m6, [r5 + 2 * mmsize]\r\n    paddd           m2, m8\r\n    pmaddwd         m8, m6, [r5 + 1 * mmsize]\r\n    paddd           m4, m8\r\n    pmaddwd         m6, [r5]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm8, [r0]                       ; m8 = row 8\r\n    punpckhwd       xm9, xm7, xm8\r\n    punpcklwd       xm7, xm8\r\n    vinserti128     m7, m7, xm9, 1\r\n    pmaddwd         m9, m7, [r5 + 3 * mmsize]\r\n    paddd           m1, m9\r\n    pmaddwd         m9, m7, [r5 + 2 * mmsize]\r\n    paddd           m3, m9\r\n    pmaddwd         m9, m7, [r5 + 1 * mmsize]\r\n    paddd           m5, m9\r\n    pmaddwd         m7, [r5]\r\n    movu            xm9, [r0 + r1]                  ; m9 = row 9\r\n    punpckhwd       xm10, xm8, xm9\r\n    punpcklwd       xm8, xm9\r\n    vinserti128     m8, m8, xm10, 1\r\n    pmaddwd         m10, m8, [r5 + 3 * mmsize]\r\n    paddd           m2, m10\r\n    pmaddwd         m10, m8, [r5 + 2 * mmsize]\r\n    paddd           m4, m10\r\n    pmaddwd         m10, m8, [r5 + 1 * mmsize]\r\n    paddd           m6, m10\r\n    pmaddwd         m8, [r5]\r\n    movu            xm10, [r0 + r1 * 2]             ; m10 = row 10\r\n    punpckhwd       xm11, xm9, xm10\r\n    punpcklwd       xm9, xm10\r\n    vinserti128     m9, m9, xm11, 1\r\n    pmaddwd         m11, m9, [r5 + 3 * mmsize]\r\n    paddd           m3, m11\r\n    pmaddwd         m11, m9, [r5 + 2 * mmsize]\r\n    paddd           m5, m11\r\n    pmaddwd         m11, m9, [r5 + 1 * mmsize]\r\n    paddd           m7, m11\r\n    pmaddwd         m9, [r5]\r\n    movu            xm11, [r0 + r4]                 ; m11 = row 11\r\n    punpckhwd       xm12, xm10, xm11\r\n    punpcklwd       xm10, xm11\r\n    vinserti128     m10, m10, xm12, 1\r\n    pmaddwd         m12, m10, [r5 + 3 * mmsize]\r\n    paddd           m4, m12\r\n    pmaddwd         m12, m10, [r5 + 2 * mmsize]\r\n    paddd           m6, m12\r\n    pmaddwd         m12, m10, [r5 + 1 * mmsize]\r\n    paddd           m8, m12\r\n    pmaddwd         m10, [r5]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm12, [r0]                      ; m12 = row 12\r\n    punpckhwd       xm13, xm11, xm12\r\n    punpcklwd       xm11, xm12\r\n    vinserti128     m11, m11, xm13, 1\r\n    pmaddwd         m13, m11, [r5 + 3 * mmsize]\r\n    paddd           m5, m13\r\n    pmaddwd         m13, m11, [r5 + 2 * mmsize]\r\n    paddd           m7, m13\r\n    pmaddwd         m13, m11, [r5 + 1 * mmsize]\r\n    paddd           m9, m13\r\n    pmaddwd         m11, [r5]\r\n\r\n%ifidn %1,sp\r\n    paddd           m0, m14\r\n    paddd           m1, m14\r\n    paddd           m2, m14\r\n    paddd           m3, m14\r\n    paddd           m4, m14\r\n    paddd           m5, m14\r\n    psrad           m0, 12\r\n    psrad           m1, 12\r\n    psrad           m2, 12\r\n    psrad           m3, 12\r\n    psrad           m4, 12\r\n    psrad           m5, 12\r\n%else\r\n    psrad           m0, 6\r\n    psrad           m1, 6\r\n    psrad           m2, 6\r\n    psrad           m3, 6\r\n    psrad           m4, 6\r\n    psrad           m5, 6\r\n%endif\r\n    packssdw        m0, m1\r\n    packssdw        m2, m3\r\n    packssdw        m4, m5\r\n%ifidn %1,sp\r\n    packuswb        m0, m2\r\n    mova            m1, [interp8_hps_shuf]\r\n    vpermd          m0, m1, m0\r\n    vextracti128    xm2, m0, 1\r\n    movq            [r2], xm0\r\n    movhps          [r2 + r3], xm0\r\n    movq            [r2 + r3 * 2], xm2\r\n    movhps          [r2 + r6], xm2\r\n%else\r\n    vpermq          m0, m0, 11011000b\r\n    vpermq          m2, m2, 11011000b\r\n    vextracti128    xm1, m0, 1\r\n    vextracti128    xm3, m2, 1\r\n    movu            [r2], xm0\r\n    movu            [r2 + r3], xm1\r\n    movu            [r2 + r3 * 2], xm2\r\n    movu            [r2 + r6], xm3\r\n%endif\r\n\r\n    movu            xm13, [r0 + r1]                 ; m13 = row 13\r\n    punpckhwd       xm0, xm12, xm13\r\n    punpcklwd       xm12, xm13\r\n    vinserti128     m12, m12, xm0, 1\r\n    pmaddwd         m0, m12, [r5 + 3 * mmsize]\r\n    paddd           m6, m0\r\n    pmaddwd         m0, m12, [r5 + 2 * mmsize]\r\n    paddd           m8, m0\r\n    pmaddwd         m0, m12, [r5 + 1 * mmsize]\r\n    paddd           m10, m0\r\n    pmaddwd         m12, [r5]\r\n    movu            xm0, [r0 + r1 * 2]              ; m0 = row 14\r\n    punpckhwd       xm2, xm13, xm0\r\n    punpcklwd       xm13, xm0\r\n    vinserti128     m13, m13, xm2, 1\r\n    pmaddwd         m2, m13, [r5 + 3 * mmsize]\r\n    paddd           m7, m2\r\n    pmaddwd         m2, m13, [r5 + 2 * mmsize]\r\n    paddd           m9, m2\r\n    pmaddwd         m2, m13, [r5 + 1 * mmsize]\r\n    paddd           m11, m2\r\n    pmaddwd         m13, [r5]\r\n\r\n%ifidn %1,sp\r\n    paddd           m6, m14\r\n    paddd           m7, m14\r\n    psrad           m6, 12\r\n    psrad           m7, 12\r\n%else\r\n    psrad           m6, 6\r\n    psrad           m7, 6\r\n%endif\r\n    packssdw        m6, m7\r\n    lea             r2, [r2 + r3 * 4]\r\n\r\n%ifidn %1,sp\r\n    packuswb        m4, m6\r\n    vpermd          m4, m1, m4\r\n    vextracti128    xm6, m4, 1\r\n    movq            [r2], xm4\r\n    movhps          [r2 + r3], xm4\r\n    movq            [r2 + r3 * 2], xm6\r\n    movhps          [r2 + r6], xm6\r\n%else\r\n    vpermq          m6, m6, 11011000b\r\n    vpermq          m4, m4, 11011000b\r\n    vextracti128    xm1, m4, 1\r\n    vextracti128    xm7, m6, 1\r\n    movu            [r2], xm4\r\n    movu            [r2 + r3], xm1\r\n    movu            [r2 + r3 * 2], xm6\r\n    movu            [r2 + r6], xm7\r\n%endif\r\n\r\n    movu            xm6, [r0 + r4]                  ; m6 = row 15\r\n    punpckhwd       xm5, xm0, xm6\r\n    punpcklwd       xm0, xm6\r\n    vinserti128     m0, m0, xm5, 1\r\n    pmaddwd         m5, m0, [r5 + 3 * mmsize]\r\n    paddd           m8, m5\r\n    pmaddwd         m5, m0, [r5 + 2 * mmsize]\r\n    paddd           m10, m5\r\n    pmaddwd         m5, m0, [r5 + 1 * mmsize]\r\n    paddd           m12, m5\r\n    pmaddwd         m0, [r5]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm2, [r0]                       ; m2 = row 16\r\n    punpckhwd       xm3, xm6, xm2\r\n    punpcklwd       xm6, xm2\r\n    vinserti128     m6, m6, xm3, 1\r\n    pmaddwd         m3, m6, [r5 + 3 * mmsize]\r\n    paddd           m9, m3\r\n    pmaddwd         m3, m6, [r5 + 2 * mmsize]\r\n    paddd           m11, m3\r\n    pmaddwd         m3, m6, [r5 + 1 * mmsize]\r\n    paddd           m13, m3\r\n    pmaddwd         m6, [r5]\r\n    movu            xm3, [r0 + r1]                  ; m3 = row 17\r\n    punpckhwd       xm4, xm2, xm3\r\n    punpcklwd       xm2, xm3\r\n    vinserti128     m2, m2, xm4, 1\r\n    pmaddwd         m4, m2, [r5 + 3 * mmsize]\r\n    paddd           m10, m4\r\n    pmaddwd         m4, m2, [r5 + 2 * mmsize]\r\n    paddd           m12, m4\r\n    pmaddwd         m2, [r5 + 1 * mmsize]\r\n    paddd           m0, m2\r\n    movu            xm4, [r0 + r1 * 2]              ; m4 = row 18\r\n    punpckhwd       xm2, xm3, xm4\r\n    punpcklwd       xm3, xm4\r\n    vinserti128     m3, m3, xm2, 1\r\n    pmaddwd         m2, m3, [r5 + 3 * mmsize]\r\n    paddd           m11, m2\r\n    pmaddwd         m2, m3, [r5 + 2 * mmsize]\r\n    paddd           m13, m2\r\n    pmaddwd         m3, [r5 + 1 * mmsize]\r\n    paddd           m6, m3\r\n    movu            xm2, [r0 + r4]                  ; m2 = row 19\r\n    punpckhwd       xm7, xm4, xm2\r\n    punpcklwd       xm4, xm2\r\n    vinserti128     m4, m4, xm7, 1\r\n    pmaddwd         m7, m4, [r5 + 3 * mmsize]\r\n    paddd           m12, m7\r\n    pmaddwd         m4, [r5 + 2 * mmsize]\r\n    paddd           m0, m4\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm7, [r0]                       ; m7 = row 20\r\n    punpckhwd       xm3, xm2, xm7\r\n    punpcklwd       xm2, xm7\r\n    vinserti128     m2, m2, xm3, 1\r\n    pmaddwd         m3, m2, [r5 + 3 * mmsize]\r\n    paddd           m13, m3\r\n    pmaddwd         m2, [r5 + 2 * mmsize]\r\n    paddd           m6, m2\r\n    movu            xm3, [r0 + r1]                  ; m3 = row 21\r\n    punpckhwd       xm2, xm7, xm3\r\n    punpcklwd       xm7, xm3\r\n    vinserti128     m7, m7, xm2, 1\r\n    pmaddwd         m7, [r5 + 3 * mmsize]\r\n    paddd           m0, m7\r\n    movu            xm2, [r0 + r1 * 2]              ; m2 = row 22\r\n    punpckhwd       xm7, xm3, xm2\r\n    punpcklwd       xm3, xm2\r\n    vinserti128     m3, m3, xm7, 1\r\n    pmaddwd         m3, [r5 + 3 * mmsize]\r\n    paddd           m6, m3\r\n\r\n%ifidn %1,sp\r\n    paddd           m8, m14\r\n    paddd           m9, m14\r\n    paddd           m10, m14\r\n    paddd           m11, m14\r\n    paddd           m12, m14\r\n    paddd           m13, m14\r\n    paddd           m0, m14\r\n    paddd           m6, m14\r\n    psrad           m8, 12\r\n    psrad           m9, 12\r\n    psrad           m10, 12\r\n    psrad           m11, 12\r\n    psrad           m12, 12\r\n    psrad           m13, 12\r\n    psrad           m0, 12\r\n    psrad           m6, 12\r\n%else\r\n    psrad           m8, 6\r\n    psrad           m9, 6\r\n    psrad           m10, 6\r\n    psrad           m11, 6\r\n    psrad           m12, 6\r\n    psrad           m13, 6\r\n    psrad           m0, 6\r\n    psrad           m6, 6\r\n%endif\r\n    packssdw        m8, m9\r\n    packssdw        m10, m11\r\n    packssdw        m12, m13\r\n    packssdw        m0, m6\r\n    lea             r2, [r2 + r3 * 4]\r\n\r\n%ifidn %1,sp\r\n    packuswb        m8, m10\r\n    packuswb        m12, m0\r\n    vpermd          m8, m1, m8\r\n    vpermd          m12, m1, m12\r\n    vextracti128    xm10, m8, 1\r\n    vextracti128    xm0, m12, 1\r\n    movq            [r2], xm8\r\n    movhps          [r2 + r3], xm8\r\n    movq            [r2 + r3 * 2], xm10\r\n    movhps          [r2 + r6], xm10\r\n    lea             r2, [r2 + r3 * 4]\r\n    movq            [r2], xm12\r\n    movhps          [r2 + r3], xm12\r\n    movq            [r2 + r3 * 2], xm0\r\n    movhps          [r2 + r6], xm0\r\n%else\r\n    vpermq          m8, m8, 11011000b\r\n    vpermq          m10, m10, 11011000b\r\n    vpermq          m12, m12, 11011000b\r\n    vpermq          m0, m0, 11011000b\r\n    vextracti128    xm9, m8, 1\r\n    vextracti128    xm11, m10, 1\r\n    vextracti128    xm13, m12, 1\r\n    vextracti128    xm6, m0, 1\r\n    movu            [r2], xm8\r\n    movu            [r2 + r3], xm9\r\n    movu            [r2 + r3 * 2], xm10\r\n    movu            [r2 + r6], xm11\r\n    lea             r2, [r2 + r3 * 4]\r\n    movu            [r2], xm12\r\n    movu            [r2 + r3], xm13\r\n    movu            [r2 + r3 * 2], xm0\r\n    movu            [r2 + r6], xm6\r\n%endif\r\n\r\n    lea             r2, [r2 + r3 * 4]\r\n    sub             r0, r7\r\n    dec             r8d\r\n    jnz             .loopH\r\n    RET\r\n%endif\r\n%endmacro\r\n\r\n    FILTER_VER_LUMA_S_AVX2_8xN sp, 16\r\n    FILTER_VER_LUMA_S_AVX2_8xN sp, 32\r\n    FILTER_VER_LUMA_S_AVX2_8xN ss, 16\r\n    FILTER_VER_LUMA_S_AVX2_8xN ss, 32\r\n\r\n%macro PROCESS_LUMA_S_AVX2_W8_4R 1\r\n    movu            xm0, [r0]                       ; m0 = row 0\r\n    movu            xm1, [r0 + r1]                  ; m1 = row 1\r\n    punpckhwd       xm2, xm0, xm1\r\n    punpcklwd       xm0, xm1\r\n    vinserti128     m0, m0, xm2, 1\r\n    pmaddwd         m0, [r5]\r\n    movu            xm2, [r0 + r1 * 2]              ; m2 = row 2\r\n    punpckhwd       xm3, xm1, xm2\r\n    punpcklwd       xm1, xm2\r\n    vinserti128     m1, m1, xm3, 1\r\n    pmaddwd         m1, [r5]\r\n    movu            xm3, [r0 + r4]                  ; m3 = row 3\r\n    punpckhwd       xm4, xm2, xm3\r\n    punpcklwd       xm2, xm3\r\n    vinserti128     m2, m2, xm4, 1\r\n    pmaddwd         m4, m2, [r5 + 1 * mmsize]\r\n    paddd           m0, m4\r\n    pmaddwd         m2, [r5]\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm4, [r0]                       ; m4 = row 4\r\n    punpckhwd       xm5, xm3, xm4\r\n    punpcklwd       xm3, xm4\r\n    vinserti128     m3, m3, xm5, 1\r\n    pmaddwd         m5, m3, [r5 + 1 * mmsize]\r\n    paddd           m1, m5\r\n    pmaddwd         m3, [r5]\r\n    movu            xm5, [r0 + r1]                  ; m5 = row 5\r\n    punpckhwd       xm6, xm4, xm5\r\n    punpcklwd       xm4, xm5\r\n    vinserti128     m4, m4, xm6, 1\r\n    pmaddwd         m6, m4, [r5 + 2 * mmsize]\r\n    paddd           m0, m6\r\n    pmaddwd         m4, [r5 + 1 * mmsize]\r\n    paddd           m2, m4\r\n    movu            xm6, [r0 + r1 * 2]              ; m6 = row 6\r\n    punpckhwd       xm4, xm5, xm6\r\n    punpcklwd       xm5, xm6\r\n    vinserti128     m5, m5, xm4, 1\r\n    pmaddwd         m4, m5, [r5 + 2 * mmsize]\r\n    paddd           m1, m4\r\n    pmaddwd         m5, [r5 + 1 * mmsize]\r\n    paddd           m3, m5\r\n    movu            xm4, [r0 + r4]                  ; m4 = row 7\r\n    punpckhwd       xm5, xm6, xm4\r\n    punpcklwd       xm6, xm4\r\n    vinserti128     m6, m6, xm5, 1\r\n    pmaddwd         m5, m6, [r5 + 3 * mmsize]\r\n    paddd           m0, m5\r\n    pmaddwd         m6, [r5 + 2 * mmsize]\r\n    paddd           m2, m6\r\n    lea             r0, [r0 + r1 * 4]\r\n    movu            xm5, [r0]                       ; m5 = row 8\r\n    punpckhwd       xm6, xm4, xm5\r\n    punpcklwd       xm4, xm5\r\n    vinserti128     m4, m4, xm6, 1\r\n    pmaddwd         m6, m4, [r5 + 3 * mmsize]\r\n    paddd           m1, m6\r\n    pmaddwd         m4, [r5 + 2 * mmsize]\r\n    paddd           m3, m4\r\n    movu            xm6, [r0 + r1]                  ; m6 = row 9\r\n    punpckhwd       xm4, xm5, xm6\r\n    punpcklwd       xm5, xm6\r\n    vinserti128     m5, m5, xm4, 1\r\n    pmaddwd         m5, [r5 + 3 * mmsize]\r\n    paddd           m2, m5\r\n    movu            xm4, [r0 + r1 * 2]              ; m4 = row 10\r\n    punpckhwd       xm5, xm6, xm4\r\n    punpcklwd       xm6, xm4\r\n    vinserti128     m6, m6, xm5, 1\r\n    pmaddwd         m6, [r5 + 3 * mmsize]\r\n    paddd           m3, m6\r\n\r\n%ifidn %1,sp\r\n    paddd           m0, m7\r\n    paddd           m1, m7\r\n    paddd           m2, m7\r\n    paddd           m3, m7\r\n    psrad           m0, 12\r\n    psrad           m1, 12\r\n    psrad           m2, 12\r\n    psrad           m3, 12\r\n%else\r\n    psrad           m0, 6\r\n    psrad           m1, 6\r\n    psrad           m2, 6\r\n    psrad           m3, 6\r\n%endif\r\n    packssdw        m0, m1\r\n    packssdw        m2, m3\r\n%ifidn %1,sp\r\n    packuswb        m0, m2\r\n    mova            m4, [interp8_hps_shuf]\r\n    vpermd          m0, m4, m0\r\n    vextracti128    xm2, m0, 1\r\n%else\r\n    vpermq          m0, m0, 11011000b\r\n    vpermq          m2, m2, 11011000b\r\n    vextracti128    xm1, m0, 1\r\n    vextracti128    xm3, m2, 1\r\n%endif\r\n%endmacro\r\n\r\n%macro FILTER_VER_LUMA_S_AVX2_8x4 1\r\nINIT_YMM avx2\r\ncglobal interp_8tap_vert_%1_8x4, 4, 6, 8\r\n    mov             r4d, r4m\r\n    shl             r4d, 7\r\n    add             r1d, r1d\r\n\r\n%ifdef PIC\r\n    lea             r5, [pw_LumaCoeffVer]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [pw_LumaCoeffVer + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r4\r\n%ifidn %1,sp\r\n    mova            m7, [pd_526336]\r\n%else\r\n    add             r3d, r3d\r\n%endif\r\n\r\n    PROCESS_LUMA_S_AVX2_W8_4R %1\r\n    lea             r4, [r3 * 3]\r\n%ifidn %1,sp\r\n    movq            [r2], xm0\r\n    movhps          [r2 + r3], xm0\r\n    movq            [r2 + r3 * 2], xm2\r\n    movhps          [r2 + r4], xm2\r\n%else\r\n    movu            [r2], xm0\r\n    movu            [r2 + r3], xm1\r\n    movu            [r2 + r3 * 2], xm2\r\n    movu            [r2 + r4], xm3\r\n%endif\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_VER_LUMA_S_AVX2_8x4 sp\r\n    FILTER_VER_LUMA_S_AVX2_8x4 ss\r\n\r\n%macro PROCESS_LUMA_AVX2_W8_16R 1\r\n    movu            xm0, [r0]                       ; m0 = row 0\r\n    movu            xm1, [r0 + r1]                  ; m1 = row 1\r\n    punpckhwd       xm2, xm0, xm1\r\n    punpcklwd       xm0, xm1\r\n    vinserti128     m0, m0, xm2, 1\r\n    pmaddwd         m0, [r5]\r\n    movu            xm2, [r0 + r1 * 2]              ; m2 = row 2\r\n    punpckhwd       xm3, xm1, xm2\r\n    punpcklwd       xm1, xm2\r\n    vinserti128     m1, m1, xm3, 1\r\n    pmaddwd         m1, [r5]\r\n    movu            xm3, [r0 + r4]                  ; m3 = row 3\r\n    punpckhwd       xm4, xm2, xm3\r\n    punpcklwd       xm2, xm3\r\n    vinserti128     m2, m2, xm4, 1\r\n    pmaddwd         m4, m2, [r5 + 1 * mmsize]\r\n    paddd           m0, m4\r\n    pmaddwd         m2, [r5]\r\n    lea             r7, [r0 + r1 * 4]\r\n    movu            xm4, [r7]                       ; m4 = row 4\r\n    punpckhwd       xm5, xm3, xm4\r\n    punpcklwd       xm3, xm4\r\n    vinserti128     m3, m3, xm5, 1\r\n    pmaddwd         m5, m3, [r5 + 1 * mmsize]\r\n    paddd           m1, m5\r\n    pmaddwd         m3, [r5]\r\n    movu            xm5, [r7 + r1]                  ; m5 = row 5\r\n    punpckhwd       xm6, xm4, xm5\r\n    punpcklwd       xm4, xm5\r\n    vinserti128     m4, m4, xm6, 1\r\n    pmaddwd         m6, m4, [r5 + 2 * mmsize]\r\n    paddd           m0, m6\r\n    pmaddwd         m6, m4, [r5 + 1 * mmsize]\r\n    paddd           m2, m6\r\n    pmaddwd         m4, [r5]\r\n    movu            xm6, [r7 + r1 * 2]              ; m6 = row 6\r\n    punpckhwd       xm7, xm5, xm6\r\n    punpcklwd       xm5, xm6\r\n    vinserti128     m5, m5, xm7, 1\r\n    pmaddwd         m7, m5, [r5 + 2 * mmsize]\r\n    paddd           m1, m7\r\n    pmaddwd         m7, m5, [r5 + 1 * mmsize]\r\n    paddd           m3, m7\r\n    pmaddwd         m5, [r5]\r\n    movu            xm7, [r7 + r4]                  ; m7 = row 7\r\n    punpckhwd       xm8, xm6, xm7\r\n    punpcklwd       xm6, xm7\r\n    vinserti128     m6, m6, xm8, 1\r\n    pmaddwd         m8, m6, [r5 + 3 * mmsize]\r\n    paddd           m0, m8\r\n    pmaddwd         m8, m6, [r5 + 2 * mmsize]\r\n    paddd           m2, m8\r\n    pmaddwd         m8, m6, [r5 + 1 * mmsize]\r\n    paddd           m4, m8\r\n    pmaddwd         m6, [r5]\r\n    lea             r7, [r7 + r1 * 4]\r\n    movu            xm8, [r7]                       ; m8 = row 8\r\n    punpckhwd       xm9, xm7, xm8\r\n    punpcklwd       xm7, xm8\r\n    vinserti128     m7, m7, xm9, 1\r\n    pmaddwd         m9, m7, [r5 + 3 * mmsize]\r\n    paddd           m1, m9\r\n    pmaddwd         m9, m7, [r5 + 2 * mmsize]\r\n    paddd           m3, m9\r\n    pmaddwd         m9, m7, [r5 + 1 * mmsize]\r\n    paddd           m5, m9\r\n    pmaddwd         m7, [r5]\r\n    movu            xm9, [r7 + r1]                  ; m9 = row 9\r\n    punpckhwd       xm10, xm8, xm9\r\n    punpcklwd       xm8, xm9\r\n    vinserti128     m8, m8, xm10, 1\r\n    pmaddwd         m10, m8, [r5 + 3 * mmsize]\r\n    paddd           m2, m10\r\n    pmaddwd         m10, m8, [r5 + 2 * mmsize]\r\n    paddd           m4, m10\r\n    pmaddwd         m10, m8, [r5 + 1 * mmsize]\r\n    paddd           m6, m10\r\n    pmaddwd         m8, [r5]\r\n    movu            xm10, [r7 + r1 * 2]             ; m10 = row 10\r\n    punpckhwd       xm11, xm9, xm10\r\n    punpcklwd       xm9, xm10\r\n    vinserti128     m9, m9, xm11, 1\r\n    pmaddwd         m11, m9, [r5 + 3 * mmsize]\r\n    paddd           m3, m11\r\n    pmaddwd         m11, m9, [r5 + 2 * mmsize]\r\n    paddd           m5, m11\r\n    pmaddwd         m11, m9, [r5 + 1 * mmsize]\r\n    paddd           m7, m11\r\n    pmaddwd         m9, [r5]\r\n    movu            xm11, [r7 + r4]                 ; m11 = row 11\r\n    punpckhwd       xm12, xm10, xm11\r\n    punpcklwd       xm10, xm11\r\n    vinserti128     m10, m10, xm12, 1\r\n    pmaddwd         m12, m10, [r5 + 3 * mmsize]\r\n    paddd           m4, m12\r\n    pmaddwd         m12, m10, [r5 + 2 * mmsize]\r\n    paddd           m6, m12\r\n    pmaddwd         m12, m10, [r5 + 1 * mmsize]\r\n    paddd           m8, m12\r\n    pmaddwd         m10, [r5]\r\n    lea             r7, [r7 + r1 * 4]\r\n    movu            xm12, [r7]                      ; m12 = row 12\r\n    punpckhwd       xm13, xm11, xm12\r\n    punpcklwd       xm11, xm12\r\n    vinserti128     m11, m11, xm13, 1\r\n    pmaddwd         m13, m11, [r5 + 3 * mmsize]\r\n    paddd           m5, m13\r\n    pmaddwd         m13, m11, [r5 + 2 * mmsize]\r\n    paddd           m7, m13\r\n    pmaddwd         m13, m11, [r5 + 1 * mmsize]\r\n    paddd           m9, m13\r\n    pmaddwd         m11, [r5]\r\n\r\n%ifidn %1,sp\r\n    paddd           m0, m14\r\n    paddd           m1, m14\r\n    paddd           m2, m14\r\n    paddd           m3, m14\r\n    paddd           m4, m14\r\n    paddd           m5, m14\r\n    psrad           m0, 12\r\n    psrad           m1, 12\r\n    psrad           m2, 12\r\n    psrad           m3, 12\r\n    psrad           m4, 12\r\n    psrad           m5, 12\r\n%else\r\n    psrad           m0, 6\r\n    psrad           m1, 6\r\n    psrad           m2, 6\r\n    psrad           m3, 6\r\n    psrad           m4, 6\r\n    psrad           m5, 6\r\n%endif\r\n    packssdw        m0, m1\r\n    packssdw        m2, m3\r\n    packssdw        m4, m5\r\n%ifidn %1,sp\r\n    packuswb        m0, m2\r\n    mova            m5, [interp8_hps_shuf]\r\n    vpermd          m0, m5, m0\r\n    vextracti128    xm2, m0, 1\r\n    movq            [r2], xm0\r\n    movhps          [r2 + r3], xm0\r\n    movq            [r2 + r3 * 2], xm2\r\n    movhps          [r2 + r6], xm2\r\n%else\r\n    vpermq          m0, m0, 11011000b\r\n    vpermq          m2, m2, 11011000b\r\n    vextracti128    xm1, m0, 1\r\n    vextracti128    xm3, m2, 1\r\n    movu            [r2], xm0\r\n    movu            [r2 + r3], xm1\r\n    movu            [r2 + r3 * 2], xm2\r\n    movu            [r2 + r6], xm3\r\n%endif\r\n\r\n    movu            xm13, [r7 + r1]                 ; m13 = row 13\r\n    punpckhwd       xm0, xm12, xm13\r\n    punpcklwd       xm12, xm13\r\n    vinserti128     m12, m12, xm0, 1\r\n    pmaddwd         m0, m12, [r5 + 3 * mmsize]\r\n    paddd           m6, m0\r\n    pmaddwd         m0, m12, [r5 + 2 * mmsize]\r\n    paddd           m8, m0\r\n    pmaddwd         m0, m12, [r5 + 1 * mmsize]\r\n    paddd           m10, m0\r\n    pmaddwd         m12, [r5]\r\n    movu            xm0, [r7 + r1 * 2]              ; m0 = row 14\r\n    punpckhwd       xm1, xm13, xm0\r\n    punpcklwd       xm13, xm0\r\n    vinserti128     m13, m13, xm1, 1\r\n    pmaddwd         m1, m13, [r5 + 3 * mmsize]\r\n    paddd           m7, m1\r\n    pmaddwd         m1, m13, [r5 + 2 * mmsize]\r\n    paddd           m9, m1\r\n    pmaddwd         m1, m13, [r5 + 1 * mmsize]\r\n    paddd           m11, m1\r\n    pmaddwd         m13, [r5]\r\n\r\n%ifidn %1,sp\r\n    paddd           m6, m14\r\n    paddd           m7, m14\r\n    psrad           m6, 12\r\n    psrad           m7, 12\r\n%else\r\n    psrad           m6, 6\r\n    psrad           m7, 6\r\n%endif\r\n    packssdw        m6, m7\r\n    lea             r8, [r2 + r3 * 4]\r\n\r\n%ifidn %1,sp\r\n    packuswb        m4, m6\r\n    vpermd          m4, m5, m4\r\n    vextracti128    xm6, m4, 1\r\n    movq            [r8], xm4\r\n    movhps          [r8 + r3], xm4\r\n    movq            [r8 + r3 * 2], xm6\r\n    movhps          [r8 + r6], xm6\r\n%else\r\n    vpermq          m4, m4, 11011000b\r\n    vpermq          m6, m6, 11011000b\r\n    vextracti128    xm1, m4, 1\r\n    vextracti128    xm7, m6, 1\r\n    movu            [r8], xm4\r\n    movu            [r8 + r3], xm1\r\n    movu            [r8 + r3 * 2], xm6\r\n    movu            [r8 + r6], xm7\r\n%endif\r\n\r\n    movu            xm1, [r7 + r4]                  ; m1 = row 15\r\n    punpckhwd       xm2, xm0, xm1\r\n    punpcklwd       xm0, xm1\r\n    vinserti128     m0, m0, xm2, 1\r\n    pmaddwd         m2, m0, [r5 + 3 * mmsize]\r\n    paddd           m8, m2\r\n    pmaddwd         m2, m0, [r5 + 2 * mmsize]\r\n    paddd           m10, m2\r\n    pmaddwd         m2, m0, [r5 + 1 * mmsize]\r\n    paddd           m12, m2\r\n    pmaddwd         m0, [r5]\r\n    lea             r7, [r7 + r1 * 4]\r\n    movu            xm2, [r7]                       ; m2 = row 16\r\n    punpckhwd       xm3, xm1, xm2\r\n    punpcklwd       xm1, xm2\r\n    vinserti128     m1, m1, xm3, 1\r\n    pmaddwd         m3, m1, [r5 + 3 * mmsize]\r\n    paddd           m9, m3\r\n    pmaddwd         m3, m1, [r5 + 2 * mmsize]\r\n    paddd           m11, m3\r\n    pmaddwd         m3, m1, [r5 + 1 * mmsize]\r\n    paddd           m13, m3\r\n    pmaddwd         m1, [r5]\r\n    movu            xm3, [r7 + r1]                  ; m3 = row 17\r\n    punpckhwd       xm4, xm2, xm3\r\n    punpcklwd       xm2, xm3\r\n    vinserti128     m2, m2, xm4, 1\r\n    pmaddwd         m4, m2, [r5 + 3 * mmsize]\r\n    paddd           m10, m4\r\n    pmaddwd         m4, m2, [r5 + 2 * mmsize]\r\n    paddd           m12, m4\r\n    pmaddwd         m2, [r5 + 1 * mmsize]\r\n    paddd           m0, m2\r\n    movu            xm4, [r7 + r1 * 2]              ; m4 = row 18\r\n    punpckhwd       xm2, xm3, xm4\r\n    punpcklwd       xm3, xm4\r\n    vinserti128     m3, m3, xm2, 1\r\n    pmaddwd         m2, m3, [r5 + 3 * mmsize]\r\n    paddd           m11, m2\r\n    pmaddwd         m2, m3, [r5 + 2 * mmsize]\r\n    paddd           m13, m2\r\n    pmaddwd         m3, [r5 + 1 * mmsize]\r\n    paddd           m1, m3\r\n    movu            xm2, [r7 + r4]                  ; m2 = row 19\r\n    punpckhwd       xm6, xm4, xm2\r\n    punpcklwd       xm4, xm2\r\n    vinserti128     m4, m4, xm6, 1\r\n    pmaddwd         m6, m4, [r5 + 3 * mmsize]\r\n    paddd           m12, m6\r\n    pmaddwd         m4, [r5 + 2 * mmsize]\r\n    paddd           m0, m4\r\n    lea             r7, [r7 + r1 * 4]\r\n    movu            xm6, [r7]                       ; m6 = row 20\r\n    punpckhwd       xm7, xm2, xm6\r\n    punpcklwd       xm2, xm6\r\n    vinserti128     m2, m2, xm7, 1\r\n    pmaddwd         m7, m2, [r5 + 3 * mmsize]\r\n    paddd           m13, m7\r\n    pmaddwd         m2, [r5 + 2 * mmsize]\r\n    paddd           m1, m2\r\n    movu            xm7, [r7 + r1]                  ; m7 = row 21\r\n    punpckhwd       xm2, xm6, xm7\r\n    punpcklwd       xm6, xm7\r\n    vinserti128     m6, m6, xm2, 1\r\n    pmaddwd         m6, [r5 + 3 * mmsize]\r\n    paddd           m0, m6\r\n    movu            xm2, [r7 + r1 * 2]              ; m2 = row 22\r\n    punpckhwd       xm3, xm7, xm2\r\n    punpcklwd       xm7, xm2\r\n    vinserti128     m7, m7, xm3, 1\r\n    pmaddwd         m7, [r5 + 3 * mmsize]\r\n    paddd           m1, m7\r\n\r\n%ifidn %1,sp\r\n    paddd           m8, m14\r\n    paddd           m9, m14\r\n    paddd           m10, m14\r\n    paddd           m11, m14\r\n    paddd           m12, m14\r\n    paddd           m13, m14\r\n    paddd           m0, m14\r\n    paddd           m1, m14\r\n    psrad           m8, 12\r\n    psrad           m9, 12\r\n    psrad           m10, 12\r\n    psrad           m11, 12\r\n    psrad           m12, 12\r\n    psrad           m13, 12\r\n    psrad           m0, 12\r\n    psrad           m1, 12\r\n%else\r\n    psrad           m8, 6\r\n    psrad           m9, 6\r\n    psrad           m10, 6\r\n    psrad           m11, 6\r\n    psrad           m12, 6\r\n    psrad           m13, 6\r\n    psrad           m0, 6\r\n    psrad           m1, 6\r\n%endif\r\n    packssdw        m8, m9\r\n    packssdw        m10, m11\r\n    packssdw        m12, m13\r\n    packssdw        m0, m1\r\n    lea             r8, [r8 + r3 * 4]\r\n\r\n%ifidn %1,sp\r\n    packuswb        m8, m10\r\n    packuswb        m12, m0\r\n    vpermd          m8, m5, m8\r\n    vpermd          m12, m5, m12\r\n    vextracti128    xm10, m8, 1\r\n    vextracti128    xm0, m12, 1\r\n    movq            [r8], xm8\r\n    movhps          [r8 + r3], xm8\r\n    movq            [r8 + r3 * 2], xm10\r\n    movhps          [r8 + r6], xm10\r\n    lea             r8, [r8 + r3 * 4]\r\n    movq            [r8], xm12\r\n    movhps          [r8 + r3], xm12\r\n    movq            [r8 + r3 * 2], xm0\r\n    movhps          [r8 + r6], xm0\r\n%else\r\n    vpermq          m8, m8, 11011000b\r\n    vpermq          m10, m10, 11011000b\r\n    vpermq          m12, m12, 11011000b\r\n    vpermq          m0, m0, 11011000b\r\n    vextracti128    xm9, m8, 1\r\n    vextracti128    xm11, m10, 1\r\n    vextracti128    xm13, m12, 1\r\n    vextracti128    xm1, m0, 1\r\n    movu            [r8], xm8\r\n    movu            [r8 + r3], xm9\r\n    movu            [r8 + r3 * 2], xm10\r\n    movu            [r8 + r6], xm11\r\n    lea             r8, [r8 + r3 * 4]\r\n    movu            [r8], xm12\r\n    movu            [r8 + r3], xm13\r\n    movu            [r8 + r3 * 2], xm0\r\n    movu            [r8 + r6], xm1\r\n%endif\r\n%endmacro\r\n\r\n%macro FILTER_VER_LUMA_AVX2_Nx16 2\r\nINIT_YMM avx2\r\n%if ARCH_X86_64 == 1\r\ncglobal interp_8tap_vert_%1_%2x16, 4, 10, 15\r\n    mov             r4d, r4m\r\n    shl             r4d, 7\r\n    add             r1d, r1d\r\n\r\n%ifdef PIC\r\n    lea             r5, [pw_LumaCoeffVer]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [pw_LumaCoeffVer + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r4\r\n%ifidn %1,sp\r\n    mova            m14, [pd_526336]\r\n%else\r\n    add             r3d, r3d\r\n%endif\r\n    lea             r6, [r3 * 3]\r\n    mov             r9d, %2 / 8\r\n.loopW:\r\n    PROCESS_LUMA_AVX2_W8_16R %1\r\n%ifidn %1,sp\r\n    add             r2, 8\r\n%else\r\n    add             r2, 16\r\n%endif\r\n    add             r0, 16\r\n    dec             r9d\r\n    jnz             .loopW\r\n    RET\r\n%endif\r\n%endmacro\r\n\r\n    FILTER_VER_LUMA_AVX2_Nx16 sp, 16\r\n    FILTER_VER_LUMA_AVX2_Nx16 sp, 32\r\n    FILTER_VER_LUMA_AVX2_Nx16 sp, 64\r\n    FILTER_VER_LUMA_AVX2_Nx16 ss, 16\r\n    FILTER_VER_LUMA_AVX2_Nx16 ss, 32\r\n    FILTER_VER_LUMA_AVX2_Nx16 ss, 64\r\n\r\n%macro FILTER_VER_LUMA_AVX2_NxN 3\r\nINIT_YMM avx2\r\n%if ARCH_X86_64 == 1\r\ncglobal interp_8tap_vert_%3_%1x%2, 4, 12, 15\r\n    mov             r4d, r4m\r\n    shl             r4d, 7\r\n    add             r1d, r1d\r\n\r\n%ifdef PIC\r\n    lea             r5, [pw_LumaCoeffVer]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [pw_LumaCoeffVer + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r4\r\n\r\n%ifidn %3,sp\r\n    mova            m14, [pd_526336]\r\n%else\r\n    add             r3d, r3d\r\n%endif\r\n\r\n    lea             r6, [r3 * 3]\r\n    lea             r11, [r1 * 4]\r\n    mov             r9d, %2 / 16\r\n.loopH:\r\n    mov             r10d, %1 / 8\r\n.loopW:\r\n    PROCESS_LUMA_AVX2_W8_16R %3\r\n%ifidn %3,sp\r\n    add             r2, 8\r\n%else\r\n    add             r2, 16\r\n%endif\r\n    add             r0, 16\r\n    dec             r10d\r\n    jnz             .loopW\r\n    sub             r7, r11\r\n    lea             r0, [r7 - 2 * %1 + 16]\r\n%ifidn %3,sp\r\n    lea             r2, [r8 + r3 * 4 - %1 + 8]\r\n%else\r\n    lea             r2, [r8 + r3 * 4 - 2 * %1 + 16]\r\n%endif\r\n    dec             r9d\r\n    jnz             .loopH\r\n    RET\r\n%endif\r\n%endmacro\r\n\r\n    FILTER_VER_LUMA_AVX2_NxN 16, 32, sp\r\n    FILTER_VER_LUMA_AVX2_NxN 16, 64, sp\r\n    FILTER_VER_LUMA_AVX2_NxN 24, 32, sp\r\n    FILTER_VER_LUMA_AVX2_NxN 32, 32, sp\r\n    FILTER_VER_LUMA_AVX2_NxN 32, 64, sp\r\n    FILTER_VER_LUMA_AVX2_NxN 48, 64, sp\r\n    FILTER_VER_LUMA_AVX2_NxN 64, 32, sp\r\n    FILTER_VER_LUMA_AVX2_NxN 64, 48, sp\r\n    FILTER_VER_LUMA_AVX2_NxN 64, 64, sp\r\n    FILTER_VER_LUMA_AVX2_NxN 16, 32, ss\r\n    FILTER_VER_LUMA_AVX2_NxN 16, 64, ss\r\n    FILTER_VER_LUMA_AVX2_NxN 24, 32, ss\r\n    FILTER_VER_LUMA_AVX2_NxN 32, 32, ss\r\n    FILTER_VER_LUMA_AVX2_NxN 32, 64, ss\r\n    FILTER_VER_LUMA_AVX2_NxN 48, 64, ss\r\n    FILTER_VER_LUMA_AVX2_NxN 64, 32, ss\r\n    FILTER_VER_LUMA_AVX2_NxN 64, 48, ss\r\n    FILTER_VER_LUMA_AVX2_NxN 64, 64, ss\r\n\r\n%macro FILTER_VER_LUMA_S_AVX2_12x16 1\r\nINIT_YMM avx2\r\n%if ARCH_X86_64 == 1\r\ncglobal interp_8tap_vert_%1_12x16, 4, 9, 15\r\n    mov             r4d, r4m\r\n    shl             r4d, 7\r\n    add             r1d, r1d\r\n\r\n%ifdef PIC\r\n    lea             r5, [pw_LumaCoeffVer]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [pw_LumaCoeffVer + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r4\r\n%ifidn %1,sp\r\n    mova            m14, [pd_526336]\r\n%else\r\n    add             r3d, r3d\r\n%endif\r\n    lea             r6, [r3 * 3]\r\n    PROCESS_LUMA_AVX2_W8_16R %1\r\n%ifidn %1,sp\r\n    add             r2, 8\r\n%else\r\n    add             r2, 16\r\n%endif\r\n    add             r0, 16\r\n    mova            m7, m14\r\n    PROCESS_LUMA_AVX2_W4_16R %1\r\n    RET\r\n%endif\r\n%endmacro\r\n\r\n    FILTER_VER_LUMA_S_AVX2_12x16 sp\r\n    FILTER_VER_LUMA_S_AVX2_12x16 ss\r\n\r\n%macro FILTER_VER_LUMA_S_AVX2_16x12 1\r\nINIT_YMM avx2\r\n%if ARCH_X86_64 == 1\r\ncglobal interp_8tap_vert_%1_16x12, 4, 10, 15\r\n    mov             r4d, r4m\r\n    shl             r4d, 7\r\n    add             r1d, r1d\r\n\r\n%ifdef PIC\r\n    lea             r5, [pw_LumaCoeffVer]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [pw_LumaCoeffVer + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r4\r\n%ifidn %1,sp\r\n    mova            m14, [pd_526336]\r\n%else\r\n    add             r3d, r3d\r\n%endif\r\n    lea             r6, [r3 * 3]\r\n    mov             r9d, 2\r\n.loopW:\r\n    movu            xm0, [r0]                       ; m0 = row 0\r\n    movu            xm1, [r0 + r1]                  ; m1 = row 1\r\n    punpckhwd       xm2, xm0, xm1\r\n    punpcklwd       xm0, xm1\r\n    vinserti128     m0, m0, xm2, 1\r\n    pmaddwd         m0, [r5]\r\n    movu            xm2, [r0 + r1 * 2]              ; m2 = row 2\r\n    punpckhwd       xm3, xm1, xm2\r\n    punpcklwd       xm1, xm2\r\n    vinserti128     m1, m1, xm3, 1\r\n    pmaddwd         m1, [r5]\r\n    movu            xm3, [r0 + r4]                  ; m3 = row 3\r\n    punpckhwd       xm4, xm2, xm3\r\n    punpcklwd       xm2, xm3\r\n    vinserti128     m2, m2, xm4, 1\r\n    pmaddwd         m4, m2, [r5 + 1 * mmsize]\r\n    paddd           m0, m4\r\n    pmaddwd         m2, [r5]\r\n    lea             r7, [r0 + r1 * 4]\r\n    movu            xm4, [r7]                       ; m4 = row 4\r\n    punpckhwd       xm5, xm3, xm4\r\n    punpcklwd       xm3, xm4\r\n    vinserti128     m3, m3, xm5, 1\r\n    pmaddwd         m5, m3, [r5 + 1 * mmsize]\r\n    paddd           m1, m5\r\n    pmaddwd         m3, [r5]\r\n    movu            xm5, [r7 + r1]                  ; m5 = row 5\r\n    punpckhwd       xm6, xm4, xm5\r\n    punpcklwd       xm4, xm5\r\n    vinserti128     m4, m4, xm6, 1\r\n    pmaddwd         m6, m4, [r5 + 2 * mmsize]\r\n    paddd           m0, m6\r\n    pmaddwd         m6, m4, [r5 + 1 * mmsize]\r\n    paddd           m2, m6\r\n    pmaddwd         m4, [r5]\r\n    movu            xm6, [r7 + r1 * 2]              ; m6 = row 6\r\n    punpckhwd       xm7, xm5, xm6\r\n    punpcklwd       xm5, xm6\r\n    vinserti128     m5, m5, xm7, 1\r\n    pmaddwd         m7, m5, [r5 + 2 * mmsize]\r\n    paddd           m1, m7\r\n    pmaddwd         m7, m5, [r5 + 1 * mmsize]\r\n    paddd           m3, m7\r\n    pmaddwd         m5, [r5]\r\n    movu            xm7, [r7 + r4]                  ; m7 = row 7\r\n    punpckhwd       xm8, xm6, xm7\r\n    punpcklwd       xm6, xm7\r\n    vinserti128     m6, m6, xm8, 1\r\n    pmaddwd         m8, m6, [r5 + 3 * mmsize]\r\n    paddd           m0, m8\r\n    pmaddwd         m8, m6, [r5 + 2 * mmsize]\r\n    paddd           m2, m8\r\n    pmaddwd         m8, m6, [r5 + 1 * mmsize]\r\n    paddd           m4, m8\r\n    pmaddwd         m6, [r5]\r\n    lea             r7, [r7 + r1 * 4]\r\n    movu            xm8, [r7]                       ; m8 = row 8\r\n    punpckhwd       xm9, xm7, xm8\r\n    punpcklwd       xm7, xm8\r\n    vinserti128     m7, m7, xm9, 1\r\n    pmaddwd         m9, m7, [r5 + 3 * mmsize]\r\n    paddd           m1, m9\r\n    pmaddwd         m9, m7, [r5 + 2 * mmsize]\r\n    paddd           m3, m9\r\n    pmaddwd         m9, m7, [r5 + 1 * mmsize]\r\n    paddd           m5, m9\r\n    pmaddwd         m7, [r5]\r\n    movu            xm9, [r7 + r1]                  ; m9 = row 9\r\n    punpckhwd       xm10, xm8, xm9\r\n    punpcklwd       xm8, xm9\r\n    vinserti128     m8, m8, xm10, 1\r\n    pmaddwd         m10, m8, [r5 + 3 * mmsize]\r\n    paddd           m2, m10\r\n    pmaddwd         m10, m8, [r5 + 2 * mmsize]\r\n    paddd           m4, m10\r\n    pmaddwd         m10, m8, [r5 + 1 * mmsize]\r\n    paddd           m6, m10\r\n    pmaddwd         m8, [r5]\r\n    movu            xm10, [r7 + r1 * 2]             ; m10 = row 10\r\n    punpckhwd       xm11, xm9, xm10\r\n    punpcklwd       xm9, xm10\r\n    vinserti128     m9, m9, xm11, 1\r\n    pmaddwd         m11, m9, [r5 + 3 * mmsize]\r\n    paddd           m3, m11\r\n    pmaddwd         m11, m9, [r5 + 2 * mmsize]\r\n    paddd           m5, m11\r\n    pmaddwd         m11, m9, [r5 + 1 * mmsize]\r\n    paddd           m7, m11\r\n    pmaddwd         m9, [r5]\r\n    movu            xm11, [r7 + r4]                 ; m11 = row 11\r\n    punpckhwd       xm12, xm10, xm11\r\n    punpcklwd       xm10, xm11\r\n    vinserti128     m10, m10, xm12, 1\r\n    pmaddwd         m12, m10, [r5 + 3 * mmsize]\r\n    paddd           m4, m12\r\n    pmaddwd         m12, m10, [r5 + 2 * mmsize]\r\n    paddd           m6, m12\r\n    pmaddwd         m12, m10, [r5 + 1 * mmsize]\r\n    paddd           m8, m12\r\n    pmaddwd         m10, [r5]\r\n    lea             r7, [r7 + r1 * 4]\r\n    movu            xm12, [r7]                      ; m12 = row 12\r\n    punpckhwd       xm13, xm11, xm12\r\n    punpcklwd       xm11, xm12\r\n    vinserti128     m11, m11, xm13, 1\r\n    pmaddwd         m13, m11, [r5 + 3 * mmsize]\r\n    paddd           m5, m13\r\n    pmaddwd         m13, m11, [r5 + 2 * mmsize]\r\n    paddd           m7, m13\r\n    pmaddwd         m13, m11, [r5 + 1 * mmsize]\r\n    paddd           m9, m13\r\n    pmaddwd         m11, [r5]\r\n\r\n%ifidn %1,sp\r\n    paddd           m0, m14\r\n    paddd           m1, m14\r\n    paddd           m2, m14\r\n    paddd           m3, m14\r\n    paddd           m4, m14\r\n    paddd           m5, m14\r\n    psrad           m0, 12\r\n    psrad           m1, 12\r\n    psrad           m2, 12\r\n    psrad           m3, 12\r\n    psrad           m4, 12\r\n    psrad           m5, 12\r\n%else\r\n    psrad           m0, 6\r\n    psrad           m1, 6\r\n    psrad           m2, 6\r\n    psrad           m3, 6\r\n    psrad           m4, 6\r\n    psrad           m5, 6\r\n%endif\r\n    packssdw        m0, m1\r\n    packssdw        m2, m3\r\n    packssdw        m4, m5\r\n\r\n%ifidn %1,sp\r\n    packuswb        m0, m2\r\n    mova            m5, [interp8_hps_shuf]\r\n    vpermd          m0, m5, m0\r\n    vextracti128    xm2, m0, 1\r\n    movq            [r2], xm0\r\n    movhps          [r2 + r3], xm0\r\n    movq            [r2 + r3 * 2], xm2\r\n    movhps          [r2 + r6], xm2\r\n%else\r\n    vpermq          m0, m0, 11011000b\r\n    vpermq          m2, m2, 11011000b\r\n    vextracti128    xm1, m0, 1\r\n    vextracti128    xm3, m2, 1\r\n    movu            [r2], xm0\r\n    movu            [r2 + r3], xm1\r\n    movu            [r2 + r3 * 2], xm2\r\n    movu            [r2 + r6], xm3\r\n%endif\r\n\r\n    movu            xm13, [r7 + r1]                 ; m13 = row 13\r\n    punpckhwd       xm0, xm12, xm13\r\n    punpcklwd       xm12, xm13\r\n    vinserti128     m12, m12, xm0, 1\r\n    pmaddwd         m0, m12, [r5 + 3 * mmsize]\r\n    paddd           m6, m0\r\n    pmaddwd         m0, m12, [r5 + 2 * mmsize]\r\n    paddd           m8, m0\r\n    pmaddwd         m12, [r5 + 1 * mmsize]\r\n    paddd           m10, m12\r\n    movu            xm0, [r7 + r1 * 2]              ; m0 = row 14\r\n    punpckhwd       xm1, xm13, xm0\r\n    punpcklwd       xm13, xm0\r\n    vinserti128     m13, m13, xm1, 1\r\n    pmaddwd         m1, m13, [r5 + 3 * mmsize]\r\n    paddd           m7, m1\r\n    pmaddwd         m1, m13, [r5 + 2 * mmsize]\r\n    paddd           m9, m1\r\n    pmaddwd         m13, [r5 + 1 * mmsize]\r\n    paddd           m11, m13\r\n\r\n%ifidn %1,sp\r\n    paddd           m6, m14\r\n    paddd           m7, m14\r\n    psrad           m6, 12\r\n    psrad           m7, 12\r\n%else\r\n    psrad           m6, 6\r\n    psrad           m7, 6\r\n%endif\r\n    packssdw        m6, m7\r\n    lea             r8, [r2 + r3 * 4]\r\n\r\n%ifidn %1,sp\r\n    packuswb        m4, m6\r\n    vpermd          m4, m5, m4\r\n    vextracti128    xm6, m4, 1\r\n    movq            [r8], xm4\r\n    movhps          [r8 + r3], xm4\r\n    movq            [r8 + r3 * 2], xm6\r\n    movhps          [r8 + r6], xm6\r\n%else\r\n    vpermq          m4, m4, 11011000b\r\n    vpermq          m6, m6, 11011000b\r\n    vextracti128    xm1, m4, 1\r\n    vextracti128    xm7, m6, 1\r\n    movu            [r8], xm4\r\n    movu            [r8 + r3], xm1\r\n    movu            [r8 + r3 * 2], xm6\r\n    movu            [r8 + r6], xm7\r\n%endif\r\n\r\n    movu            xm1, [r7 + r4]                  ; m1 = row 15\r\n    punpckhwd       xm2, xm0, xm1\r\n    punpcklwd       xm0, xm1\r\n    vinserti128     m0, m0, xm2, 1\r\n    pmaddwd         m2, m0, [r5 + 3 * mmsize]\r\n    paddd           m8, m2\r\n    pmaddwd         m0, [r5 + 2 * mmsize]\r\n    paddd           m10, m0\r\n    lea             r7, [r7 + r1 * 4]\r\n    movu            xm2, [r7]                       ; m2 = row 16\r\n    punpckhwd       xm3, xm1, xm2\r\n    punpcklwd       xm1, xm2\r\n    vinserti128     m1, m1, xm3, 1\r\n    pmaddwd         m3, m1, [r5 + 3 * mmsize]\r\n    paddd           m9, m3\r\n    pmaddwd         m1, [r5 + 2 * mmsize]\r\n    paddd           m11, m1\r\n    movu            xm3, [r7 + r1]                  ; m3 = row 17\r\n    punpckhwd       xm4, xm2, xm3\r\n    punpcklwd       xm2, xm3\r\n    vinserti128     m2, m2, xm4, 1\r\n    pmaddwd         m2, [r5 + 3 * mmsize]\r\n    paddd           m10, m2\r\n    movu            xm4, [r7 + r1 * 2]              ; m4 = row 18\r\n    punpckhwd       xm2, xm3, xm4\r\n    punpcklwd       xm3, xm4\r\n    vinserti128     m3, m3, xm2, 1\r\n    pmaddwd         m3, [r5 + 3 * mmsize]\r\n    paddd           m11, m3\r\n\r\n%ifidn %1,sp\r\n    paddd           m8, m14\r\n    paddd           m9, m14\r\n    paddd           m10, m14\r\n    paddd           m11, m14\r\n    psrad           m8, 12\r\n    psrad           m9, 12\r\n    psrad           m10, 12\r\n    psrad           m11, 12\r\n%else\r\n    psrad           m8, 6\r\n    psrad           m9, 6\r\n    psrad           m10, 6\r\n    psrad           m11, 6\r\n%endif\r\n    packssdw        m8, m9\r\n    packssdw        m10, m11\r\n    lea             r8, [r8 + r3 * 4]\r\n\r\n%ifidn %1,sp\r\n    packuswb        m8, m10\r\n    vpermd          m8, m5, m8\r\n    vextracti128    xm10, m8, 1\r\n    movq            [r8], xm8\r\n    movhps          [r8 + r3], xm8\r\n    movq            [r8 + r3 * 2], xm10\r\n    movhps          [r8 + r6], xm10\r\n    add             r2, 8\r\n%else\r\n    vpermq          m8, m8, 11011000b\r\n    vpermq          m10, m10, 11011000b\r\n    vextracti128    xm9, m8, 1\r\n    vextracti128    xm11, m10, 1\r\n    movu            [r8], xm8\r\n    movu            [r8 + r3], xm9\r\n    movu            [r8 + r3 * 2], xm10\r\n    movu            [r8 + r6], xm11\r\n    add             r2, 16\r\n%endif\r\n    add             r0, 16\r\n    dec             r9d\r\n    jnz             .loopW\r\n    RET\r\n%endif\r\n%endmacro\r\n\r\n    FILTER_VER_LUMA_S_AVX2_16x12 sp\r\n    FILTER_VER_LUMA_S_AVX2_16x12 ss\r\n\r\n%macro FILTER_VER_LUMA_S_AVX2_16x4 1\r\nINIT_YMM avx2\r\ncglobal interp_8tap_vert_%1_16x4, 4, 7, 8, 0 - gprsize\r\n    mov             r4d, r4m\r\n    shl             r4d, 7\r\n    add             r1d, r1d\r\n\r\n%ifdef PIC\r\n    lea             r5, [pw_LumaCoeffVer]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [pw_LumaCoeffVer + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r4\r\n%ifidn %1,sp\r\n    mova            m7, [pd_526336]\r\n%else\r\n    add             r3d, r3d\r\n%endif\r\n    mov             dword [rsp], 2\r\n.loopW:\r\n    PROCESS_LUMA_S_AVX2_W8_4R %1\r\n    lea             r6, [r3 * 3]\r\n%ifidn %1,sp\r\n    movq            [r2], xm0\r\n    movhps          [r2 + r3], xm0\r\n    movq            [r2 + r3 * 2], xm2\r\n    movhps          [r2 + r6], xm2\r\n    add             r2, 8\r\n%else\r\n    movu            [r2], xm0\r\n    movu            [r2 + r3], xm1\r\n    movu            [r2 + r3 * 2], xm2\r\n    movu            [r2 + r6], xm3\r\n    add             r2, 16\r\n%endif\r\n    lea             r6, [8 * r1 - 16]\r\n    sub             r0, r6\r\n    dec             dword [rsp]\r\n    jnz             .loopW\r\n    RET\r\n%endmacro\r\n\r\n    FILTER_VER_LUMA_S_AVX2_16x4 sp\r\n    FILTER_VER_LUMA_S_AVX2_16x4 ss\r\n\r\n%macro PROCESS_LUMA_S_AVX2_W8_8R 1\r\n    movu            xm0, [r0]                       ; m0 = row 0\r\n    movu            xm1, [r0 + r1]                  ; m1 = row 1\r\n    punpckhwd       xm2, xm0, xm1\r\n    punpcklwd       xm0, xm1\r\n    vinserti128     m0, m0, xm2, 1\r\n    pmaddwd         m0, [r5]\r\n    movu            xm2, [r0 + r1 * 2]              ; m2 = row 2\r\n    punpckhwd       xm3, xm1, xm2\r\n    punpcklwd       xm1, xm2\r\n    vinserti128     m1, m1, xm3, 1\r\n    pmaddwd         m1, [r5]\r\n    movu            xm3, [r0 + r4]                  ; m3 = row 3\r\n    punpckhwd       xm4, xm2, xm3\r\n    punpcklwd       xm2, xm3\r\n    vinserti128     m2, m2, xm4, 1\r\n    pmaddwd         m4, m2, [r5 + 1 * mmsize]\r\n    paddd           m0, m4\r\n    pmaddwd         m2, [r5]\r\n    lea             r7, [r0 + r1 * 4]\r\n    movu            xm4, [r7]                       ; m4 = row 4\r\n    punpckhwd       xm5, xm3, xm4\r\n    punpcklwd       xm3, xm4\r\n    vinserti128     m3, m3, xm5, 1\r\n    pmaddwd         m5, m3, [r5 + 1 * mmsize]\r\n    paddd           m1, m5\r\n    pmaddwd         m3, [r5]\r\n    movu            xm5, [r7 + r1]                  ; m5 = row 5\r\n    punpckhwd       xm6, xm4, xm5\r\n    punpcklwd       xm4, xm5\r\n    vinserti128     m4, m4, xm6, 1\r\n    pmaddwd         m6, m4, [r5 + 2 * mmsize]\r\n    paddd           m0, m6\r\n    pmaddwd         m6, m4, [r5 + 1 * mmsize]\r\n    paddd           m2, m6\r\n    pmaddwd         m4, [r5]\r\n    movu            xm6, [r7 + r1 * 2]              ; m6 = row 6\r\n    punpckhwd       xm7, xm5, xm6\r\n    punpcklwd       xm5, xm6\r\n    vinserti128     m5, m5, xm7, 1\r\n    pmaddwd         m7, m5, [r5 + 2 * mmsize]\r\n    paddd           m1, m7\r\n    pmaddwd         m7, m5, [r5 + 1 * mmsize]\r\n    paddd           m3, m7\r\n    pmaddwd         m5, [r5]\r\n    movu            xm7, [r7 + r4]                  ; m7 = row 7\r\n    punpckhwd       xm8, xm6, xm7\r\n    punpcklwd       xm6, xm7\r\n    vinserti128     m6, m6, xm8, 1\r\n    pmaddwd         m8, m6, [r5 + 3 * mmsize]\r\n    paddd           m0, m8\r\n    pmaddwd         m8, m6, [r5 + 2 * mmsize]\r\n    paddd           m2, m8\r\n    pmaddwd         m8, m6, [r5 + 1 * mmsize]\r\n    paddd           m4, m8\r\n    pmaddwd         m6, [r5]\r\n    lea             r7, [r7 + r1 * 4]\r\n    movu            xm8, [r7]                       ; m8 = row 8\r\n    punpckhwd       xm9, xm7, xm8\r\n    punpcklwd       xm7, xm8\r\n    vinserti128     m7, m7, xm9, 1\r\n    pmaddwd         m9, m7, [r5 + 3 * mmsize]\r\n    paddd           m1, m9\r\n    pmaddwd         m9, m7, [r5 + 2 * mmsize]\r\n    paddd           m3, m9\r\n    pmaddwd         m9, m7, [r5 + 1 * mmsize]\r\n    paddd           m5, m9\r\n    pmaddwd         m7, [r5]\r\n    movu            xm9, [r7 + r1]                  ; m9 = row 9\r\n    punpckhwd       xm10, xm8, xm9\r\n    punpcklwd       xm8, xm9\r\n    vinserti128     m8, m8, xm10, 1\r\n    pmaddwd         m10, m8, [r5 + 3 * mmsize]\r\n    paddd           m2, m10\r\n    pmaddwd         m10, m8, [r5 + 2 * mmsize]\r\n    paddd           m4, m10\r\n    pmaddwd         m8, [r5 + 1 * mmsize]\r\n    paddd           m6, m8\r\n    movu            xm10, [r7 + r1 * 2]             ; m10 = row 10\r\n    punpckhwd       xm8, xm9, xm10\r\n    punpcklwd       xm9, xm10\r\n    vinserti128     m9, m9, xm8, 1\r\n    pmaddwd         m8, m9, [r5 + 3 * mmsize]\r\n    paddd           m3, m8\r\n    pmaddwd         m8, m9, [r5 + 2 * mmsize]\r\n    paddd           m5, m8\r\n    pmaddwd         m9, [r5 + 1 * mmsize]\r\n    paddd           m7, m9\r\n    movu            xm8, [r7 + r4]                  ; m8 = row 11\r\n    punpckhwd       xm9, xm10, xm8\r\n    punpcklwd       xm10, xm8\r\n    vinserti128     m10, m10, xm9, 1\r\n    pmaddwd         m9, m10, [r5 + 3 * mmsize]\r\n    paddd           m4, m9\r\n    pmaddwd         m10, [r5 + 2 * mmsize]\r\n    paddd           m6, m10\r\n    lea             r7, [r7 + r1 * 4]\r\n    movu            xm9, [r7]                       ; m9 = row 12\r\n    punpckhwd       xm10, xm8, xm9\r\n    punpcklwd       xm8, xm9\r\n    vinserti128     m8, m8, xm10, 1\r\n    pmaddwd         m10, m8, [r5 + 3 * mmsize]\r\n    paddd           m5, m10\r\n    pmaddwd         m8, [r5 + 2 * mmsize]\r\n    paddd           m7, m8\r\n\r\n%ifidn %1,sp\r\n    paddd           m0, m11\r\n    paddd           m1, m11\r\n    paddd           m2, m11\r\n    paddd           m3, m11\r\n    paddd           m4, m11\r\n    paddd           m5, m11\r\n    psrad           m0, 12\r\n    psrad           m1, 12\r\n    psrad           m2, 12\r\n    psrad           m3, 12\r\n    psrad           m4, 12\r\n    psrad           m5, 12\r\n%else\r\n    psrad           m0, 6\r\n    psrad           m1, 6\r\n    psrad           m2, 6\r\n    psrad           m3, 6\r\n    psrad           m4, 6\r\n    psrad           m5, 6\r\n%endif\r\n    packssdw        m0, m1\r\n    packssdw        m2, m3\r\n    packssdw        m4, m5\r\n\r\n%ifidn %1,sp\r\n    packuswb        m0, m2\r\n    mova            m5, [interp8_hps_shuf]\r\n    vpermd          m0, m5, m0\r\n    vextracti128    xm2, m0, 1\r\n    movq            [r2], xm0\r\n    movhps          [r2 + r3], xm0\r\n    movq            [r2 + r3 * 2], xm2\r\n    movhps          [r2 + r6], xm2\r\n%else\r\n    vpermq          m0, m0, 11011000b\r\n    vpermq          m2, m2, 11011000b\r\n    vextracti128    xm1, m0, 1\r\n    vextracti128    xm3, m2, 1\r\n    movu            [r2], xm0\r\n    movu            [r2 + r3], xm1\r\n    movu            [r2 + r3 * 2], xm2\r\n    movu            [r2 + r6], xm3\r\n%endif\r\n\r\n    movu            xm10, [r7 + r1]                 ; m10 = row 13\r\n    punpckhwd       xm0, xm9, xm10\r\n    punpcklwd       xm9, xm10\r\n    vinserti128     m9, m9, xm0, 1\r\n    pmaddwd         m9, [r5 + 3 * mmsize]\r\n    paddd           m6, m9\r\n    movu            xm0, [r7 + r1 * 2]              ; m0 = row 14\r\n    punpckhwd       xm1, xm10, xm0\r\n    punpcklwd       xm10, xm0\r\n    vinserti128     m10, m10, xm1, 1\r\n    pmaddwd         m10, [r5 + 3 * mmsize]\r\n    paddd           m7, m10\r\n\r\n%ifidn %1,sp\r\n    paddd           m6, m11\r\n    paddd           m7, m11\r\n    psrad           m6, 12\r\n    psrad           m7, 12\r\n%else\r\n    psrad           m6, 6\r\n    psrad           m7, 6\r\n%endif\r\n    packssdw        m6, m7\r\n    lea             r8, [r2 + r3 * 4]\r\n\r\n%ifidn %1,sp\r\n    packuswb        m4, m6\r\n    vpermd          m4, m5, m4\r\n    vextracti128    xm6, m4, 1\r\n    movq            [r8], xm4\r\n    movhps          [r8 + r3], xm4\r\n    movq            [r8 + r3 * 2], xm6\r\n    movhps          [r8 + r6], xm6\r\n%else\r\n    vpermq          m4, m4, 11011000b\r\n    vpermq          m6, m6, 11011000b\r\n    vextracti128    xm5, m4, 1\r\n    vextracti128    xm7, m6, 1\r\n    movu            [r8], xm4\r\n    movu            [r8 + r3], xm5\r\n    movu            [r8 + r3 * 2], xm6\r\n    movu            [r8 + r6], xm7\r\n%endif\r\n%endmacro\r\n\r\n%macro FILTER_VER_LUMA_AVX2_Nx8 2\r\nINIT_YMM avx2\r\n%if ARCH_X86_64 == 1\r\ncglobal interp_8tap_vert_%1_%2x8, 4, 10, 12\r\n    mov             r4d, r4m\r\n    shl             r4d, 7\r\n    add             r1d, r1d\r\n\r\n%ifdef PIC\r\n    lea             r5, [pw_LumaCoeffVer]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [pw_LumaCoeffVer + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r4\r\n%ifidn %1,sp\r\n    mova            m11, [pd_526336]\r\n%else\r\n    add             r3d, r3d\r\n%endif\r\n    lea             r6, [r3 * 3]\r\n    mov             r9d, %2 / 8\r\n.loopW:\r\n    PROCESS_LUMA_S_AVX2_W8_8R %1\r\n%ifidn %1,sp\r\n    add             r2, 8\r\n%else\r\n    add             r2, 16\r\n%endif\r\n    add             r0, 16\r\n    dec             r9d\r\n    jnz             .loopW\r\n    RET\r\n%endif\r\n%endmacro\r\n\r\n    FILTER_VER_LUMA_AVX2_Nx8 sp, 32\r\n    FILTER_VER_LUMA_AVX2_Nx8 sp, 16\r\n    FILTER_VER_LUMA_AVX2_Nx8 ss, 32\r\n    FILTER_VER_LUMA_AVX2_Nx8 ss, 16\r\n\r\n%macro FILTER_VER_LUMA_S_AVX2_32x24 1\r\nINIT_YMM avx2\r\n%if ARCH_X86_64 == 1\r\ncglobal interp_8tap_vert_%1_32x24, 4, 10, 15\r\n    mov             r4d, r4m\r\n    shl             r4d, 7\r\n    add             r1d, r1d\r\n\r\n%ifdef PIC\r\n    lea             r5, [pw_LumaCoeffVer]\r\n    add             r5, r4\r\n%else\r\n    lea             r5, [pw_LumaCoeffVer + r4]\r\n%endif\r\n\r\n    lea             r4, [r1 * 3]\r\n    sub             r0, r4\r\n%ifidn %1,sp\r\n    mova            m14, [pd_526336]\r\n%else\r\n    add             r3d, r3d\r\n%endif\r\n    lea             r6, [r3 * 3]\r\n    mov             r9d, 4\r\n.loopW:\r\n    PROCESS_LUMA_AVX2_W8_16R %1\r\n%ifidn %1,sp\r\n    add             r2, 8\r\n%else\r\n    add             r2, 16\r\n%endif\r\n    add             r0, 16\r\n    dec             r9d\r\n    jnz             .loopW\r\n    lea             r9, [r1 * 4]\r\n    sub             r7, r9\r\n    lea             r0, [r7 - 48]\r\n%ifidn %1,sp\r\n    lea             r2, [r8 + r3 * 4 - 24]\r\n%else\r\n    lea             r2, [r8 + r3 * 4 - 48]\r\n%endif\r\n    mova            m11, m14\r\n    mov             r9d, 4\r\n.loop:\r\n    PROCESS_LUMA_S_AVX2_W8_8R %1\r\n%ifidn %1,sp\r\n    add             r2, 8\r\n%else\r\n    add             r2, 16\r\n%endif\r\n    add             r0, 16\r\n    dec             r9d\r\n    jnz             .loop\r\n    RET\r\n%endif\r\n%endmacro\r\n\r\n    FILTER_VER_LUMA_S_AVX2_32x24 sp\r\n    FILTER_VER_LUMA_S_AVX2_32x24 ss\r\n\r\n;-----------------------------------------------------------------------------------------------------------------------------\r\n; void interp_4tap_horiz_ps_32x32(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt)\r\n;-----------------------------------------------------------------------------------------------------------------------------;\r\nINIT_YMM avx2\r\ncglobal interp_4tap_horiz_ps_32x32, 4,6,8\r\n    mov             r4d, r4m\r\n    add             r3d, r3d\r\n    dec             r0\r\n\r\n    ; check isRowExt\r\n    cmp             r5m, byte 0\r\n\r\n    lea             r5, [tab_ChromaCoeff]\r\n    vpbroadcastw    m0, [r5 + r4 * 4 + 0]\r\n    vpbroadcastw    m1, [r5 + r4 * 4 + 2]\r\n    mova            m7, [pw_2000]\r\n\r\n    ; register map\r\n    ; m0 - interpolate coeff Low\r\n    ; m1 - interpolate coeff High\r\n    ; m7 - constant pw_2000\r\n    mov             r4d, 32\r\n    je             .loop\r\n    sub             r0, r1\r\n    add             r4d, 3\r\n\r\n.loop\r\n    ; Row 0\r\n    movu            m2, [r0]\r\n    movu            m3, [r0 + 1]\r\n    punpckhbw       m4, m2, m3\r\n    punpcklbw       m2, m3\r\n    pmaddubsw       m4, m0\r\n    pmaddubsw       m2, m0\r\n\r\n    movu            m3, [r0 + 2]\r\n    movu            m5, [r0 + 3]\r\n    punpckhbw       m6, m3, m5\r\n    punpcklbw       m3, m5\r\n    pmaddubsw       m6, m1\r\n    pmaddubsw       m3, m1\r\n\r\n    paddw           m4, m6\r\n    paddw           m2, m3\r\n    psubw           m4, m7\r\n    psubw           m2, m7\r\n    vperm2i128      m3, m2, m4, 0x20\r\n    vperm2i128      m5, m2, m4, 0x31\r\n    movu            [r2], m3\r\n    movu            [r2 + mmsize], m5\r\n\r\n    add             r2, r3\r\n    add             r0, r1\r\n    dec             r4d\r\n    jnz            .loop\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------------------------------------------------------\r\n; void interp_4tap_horiz_ps_16x16(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt)\r\n;-----------------------------------------------------------------------------------------------------------------------------;\r\nINIT_YMM avx2\r\ncglobal interp_4tap_horiz_ps_16x16, 4,7,6\r\n    mov             r4d, r4m\r\n    mov             r5d, r5m\r\n    add             r3d, r3d\r\n\r\n%ifdef PIC\r\n    lea               r6,           [tab_ChromaCoeff]\r\n    vpbroadcastd      m0,           [r6 + r4 * 4]\r\n%else\r\n    vpbroadcastd      m0,           [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    vbroadcasti128     m2,           [pw_1]\r\n    vbroadcasti128     m5,           [pw_2000]\r\n    mova               m1,           [tab_Tm]\r\n\r\n    ; register map\r\n    ; m0 - interpolate coeff\r\n    ; m1 - shuffle order table\r\n    ; m2 - constant word 1\r\n    mov                r6d,         16\r\n    dec                r0\r\n    test                r5d,        r5d\r\n    je                 .loop\r\n    sub                r0 ,         r1\r\n    add                r6d ,        3\r\n\r\n.loop\r\n    ; Row 0\r\n    vbroadcasti128    m3,           [r0]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m3,           m1\r\n    pmaddubsw         m3,           m0\r\n    pmaddwd           m3,           m2\r\n    vbroadcasti128    m4,           [r0 + 8]\r\n    pshufb            m4,           m1\r\n    pmaddubsw         m4,           m0\r\n    pmaddwd           m4,           m2\r\n\r\n    packssdw          m3,           m4\r\n    psubw             m3,           m5\r\n    vpermq            m3,           m3,          11011000b\r\n    movu              [r2],         m3\r\n\r\n    add                r2,          r3\r\n    add                r0,          r1\r\n    dec                r6d\r\n    jnz                .loop\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------------------------------------------------------\r\n; void interp_4tap_horiz_ps_16xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt)\r\n;-----------------------------------------------------------------------------------------------------------------------------\r\n%macro IPFILTER_CHROMA_PS_16xN_AVX2 2\r\nINIT_YMM avx2\r\ncglobal interp_4tap_horiz_ps_%1x%2, 4,7,6\r\n    mov                    r4d,        r4m\r\n    mov                    r5d,        r5m\r\n    add                    r3d,        r3d\r\n\r\n%ifdef PIC\r\n    lea                    r6,         [tab_ChromaCoeff]\r\n    vpbroadcastd           m0,         [r6 + r4 * 4]\r\n%else\r\n    vpbroadcastd           m0,         [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    vbroadcasti128         m2,         [pw_1]\r\n    vbroadcasti128         m5,         [pw_2000]\r\n    mova                   m1,         [tab_Tm]\r\n\r\n    ; register map\r\n    ; m0 - interpolate coeff\r\n    ; m1 - shuffle order table\r\n    ; m2 - constant word 1\r\n    mov                    r6d,        %2\r\n    dec                    r0\r\n    test                   r5d,        r5d\r\n    je                     .loop\r\n    sub                    r0 ,        r1\r\n    add                    r6d ,       3\r\n\r\n.loop\r\n    ; Row 0\r\n    vbroadcasti128         m3,         [r0]\r\n    pshufb                 m3,         m1\r\n    pmaddubsw              m3,         m0\r\n    pmaddwd                m3,         m2\r\n    vbroadcasti128         m4,         [r0 + 8]\r\n    pshufb                 m4,         m1\r\n    pmaddubsw              m4,         m0\r\n    pmaddwd                m4,         m2\r\n\r\n    packssdw               m3,         m4\r\n    psubw                  m3,         m5\r\n\r\n    vpermq                 m3,         m3,          11011000b\r\n    movu                   [r2],       m3\r\n\r\n    add                    r2,         r3\r\n    add                    r0,         r1\r\n    dec                    r6d\r\n    jnz                    .loop\r\n    RET\r\n%endmacro\r\n\r\n    IPFILTER_CHROMA_PS_16xN_AVX2  16 , 32\r\n    IPFILTER_CHROMA_PS_16xN_AVX2  16 , 12\r\n    IPFILTER_CHROMA_PS_16xN_AVX2  16 , 8\r\n    IPFILTER_CHROMA_PS_16xN_AVX2  16 , 4\r\n    IPFILTER_CHROMA_PS_16xN_AVX2  16 , 24\r\n    IPFILTER_CHROMA_PS_16xN_AVX2  16 , 64\r\n\r\n;-----------------------------------------------------------------------------------------------------------------------------\r\n; void interp_4tap_horiz_ps_32xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt)\r\n;-----------------------------------------------------------------------------------------------------------------------------\r\n%macro IPFILTER_CHROMA_PS_32xN_AVX2 2\r\nINIT_YMM avx2\r\ncglobal interp_4tap_horiz_ps_%1x%2, 4,7,6\r\n    mov                r4d,          r4m\r\n    mov                r5d,          r5m\r\n    add                r3d,          r3d\r\n\r\n%ifdef PIC\r\n    lea                r6,           [tab_ChromaCoeff]\r\n    vpbroadcastd       m0,           [r6 + r4 * 4]\r\n%else\r\n    vpbroadcastd       m0,           [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    vbroadcasti128     m2,           [pw_1]\r\n    vbroadcasti128     m5,           [pw_2000]\r\n    mova               m1,           [tab_Tm]\r\n\r\n    ; register map\r\n    ; m0 - interpolate coeff\r\n    ; m1 - shuffle order table\r\n    ; m2 - constant word 1\r\n    mov                r6d,          %2\r\n    dec                r0\r\n    test               r5d,          r5d\r\n    je                 .loop\r\n    sub                r0 ,          r1\r\n    add                r6d ,         3\r\n\r\n.loop\r\n    ; Row 0\r\n    vbroadcasti128     m3,           [r0]\r\n    pshufb             m3,           m1\r\n    pmaddubsw          m3,           m0\r\n    pmaddwd            m3,           m2\r\n    vbroadcasti128     m4,           [r0 + 8]\r\n    pshufb             m4,           m1\r\n    pmaddubsw          m4,           m0\r\n    pmaddwd            m4,           m2\r\n\r\n    packssdw           m3,           m4\r\n    psubw              m3,           m5\r\n\r\n    vpermq             m3,           m3,          11011000b\r\n    movu              [r2],          m3\r\n\r\n    vbroadcasti128     m3,           [r0 + 16]\r\n    pshufb             m3,           m1\r\n    pmaddubsw          m3,           m0\r\n    pmaddwd            m3,           m2\r\n    vbroadcasti128     m4,           [r0 + 24]\r\n    pshufb             m4,           m1\r\n    pmaddubsw          m4,           m0\r\n    pmaddwd            m4,           m2\r\n\r\n    packssdw           m3,           m4\r\n    psubw              m3,           m5\r\n\r\n    vpermq             m3,           m3,          11011000b\r\n    movu               [r2 + 32],    m3\r\n\r\n    add                r2,           r3\r\n    add                r0,           r1\r\n    dec                r6d\r\n    jnz                .loop\r\n    RET\r\n%endmacro\r\n\r\n    IPFILTER_CHROMA_PS_32xN_AVX2  32 , 16\r\n    IPFILTER_CHROMA_PS_32xN_AVX2  32 , 24\r\n    IPFILTER_CHROMA_PS_32xN_AVX2  32 , 8\r\n    IPFILTER_CHROMA_PS_32xN_AVX2  32 , 64\r\n    IPFILTER_CHROMA_PS_32xN_AVX2  32 , 48\r\n;-----------------------------------------------------------------------------------------------------------------------------\r\n; void interp_4tap_horiz_ps_4x4(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt)\r\n;-----------------------------------------------------------------------------------------------------------------------------\r\nINIT_YMM avx2\r\ncglobal interp_4tap_horiz_ps_4x4, 4,7,5\r\n    mov             r4d, r4m\r\n    mov             r5d, r5m\r\n    add             r3d, r3d\r\n\r\n%ifdef PIC\r\n    lea               r6,           [tab_ChromaCoeff]\r\n    vpbroadcastd      m0,           [r6 + r4 * 4]\r\n%else\r\n    vpbroadcastd      m0,           [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    vbroadcasti128     m2,           [pw_1]\r\n    vbroadcasti128     m1,           [tab_Tm]\r\n\r\n    ; register map\r\n    ; m0 - interpolate coeff\r\n    ; m1 - shuffle order table\r\n    ; m2 - constant word 1\r\n\r\n    dec                r0\r\n    test                r5d,       r5d\r\n    je                 .label\r\n    sub                r0 , r1\r\n\r\n.label\r\n    ; Row 0-1\r\n    movu              xm3,           [r0]\r\n    vinserti128       m3,           m3,      [r0 + r1],     1\r\n    pshufb            m3,           m1\r\n    pmaddubsw         m3,           m0\r\n    pmaddwd           m3,           m2\r\n\r\n    ; Row 2-3\r\n    lea               r0,           [r0 + r1 * 2]\r\n    movu              xm4,           [r0]\r\n    vinserti128       m4,           m4,      [r0 + r1],     1\r\n    pshufb            m4,           m1\r\n    pmaddubsw         m4,           m0\r\n    pmaddwd           m4,           m2\r\n\r\n    packssdw          m3,           m4\r\n    psubw             m3,           [pw_2000]\r\n    vextracti128      xm4,          m3,     1\r\n    movq              [r2],         xm3\r\n    movq              [r2+r3],      xm4\r\n    lea               r2,           [r2 + r3 * 2]\r\n    movhps            [r2],         xm3\r\n    movhps            [r2 + r3],    xm4\r\n\r\n    test                r5d,        r5d\r\n    jz                .end\r\n    lea               r2,           [r2 + r3 * 2]\r\n    lea               r0,           [r0 + r1 * 2]\r\n\r\n    ;Row 5-6\r\n    movu              xm3,          [r0]\r\n    vinserti128       m3,           m3,      [r0 + r1],     1\r\n    pshufb            m3,           m1\r\n    pmaddubsw         m3,           m0\r\n    pmaddwd           m3,           m2\r\n\r\n    ; Row 7\r\n    lea               r0,           [r0 + r1 * 2]\r\n    vbroadcasti128    m4,           [r0]                      ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m4,           m1\r\n    pmaddubsw         m4,           m0\r\n    pmaddwd           m4,           m2\r\n\r\n    packssdw          m3,           m4\r\n    psubw             m3,           [pw_2000]\r\n\r\n    vextracti128      xm4,          m3,     1\r\n    movq              [r2],         xm3\r\n    movq              [r2+r3],      xm4\r\n    lea               r2,           [r2 + r3 * 2]\r\n    movhps            [r2],         xm3\r\n.end\r\n    RET\r\n\r\ncglobal interp_4tap_horiz_ps_4x2, 4,7,5\r\n    mov             r4d, r4m\r\n    mov             r5d, r5m\r\n    add             r3d, r3d\r\n\r\n%ifdef PIC\r\n    lea               r6,           [tab_ChromaCoeff]\r\n    vpbroadcastd      m0,           [r6 + r4 * 4]\r\n%else\r\n    vpbroadcastd      m0,           [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    vbroadcasti128     m2,           [pw_1]\r\n    vbroadcasti128     m1,           [tab_Tm]\r\n\r\n    ; register map\r\n    ; m0 - interpolate coeff\r\n    ; m1 - shuffle order table\r\n    ; m2 - constant word 1\r\n\r\n    dec                r0\r\n    test                r5d,       r5d\r\n    je                 .label\r\n    sub                r0 , r1\r\n\r\n.label\r\n    ; Row 0-1\r\n    movu              xm3,           [r0]\r\n    vinserti128       m3,           m3,      [r0 + r1],     1\r\n    pshufb            m3,           m1\r\n    pmaddubsw         m3,           m0\r\n    pmaddwd           m3,           m2\r\n\r\n    packssdw          m3,           m3\r\n    psubw             m3,           [pw_2000]\r\n    vextracti128      xm4,          m3,     1\r\n    movq              [r2],         xm3\r\n    movq              [r2+r3],      xm4\r\n\r\n    test              r5d,          r5d\r\n    jz                .end\r\n    lea               r2,           [r2 + r3 * 2]\r\n    lea               r0,           [r0 + r1 * 2]\r\n\r\n    ;Row 2-3\r\n    movu              xm3,          [r0]\r\n    vinserti128       m3,           m3,      [r0 + r1],     1\r\n    pshufb            m3,           m1\r\n    pmaddubsw         m3,           m0\r\n    pmaddwd           m3,           m2\r\n\r\n    ; Row 5\r\n    lea               r0,           [r0 + r1 * 2]\r\n    vbroadcasti128    m4,           [r0]                      ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m4,           m1\r\n    pmaddubsw         m4,           m0\r\n    pmaddwd           m4,           m2\r\n\r\n    packssdw          m3,           m4\r\n    psubw             m3,           [pw_2000]\r\n\r\n    vextracti128      xm4,          m3,     1\r\n    movq              [r2],         xm3\r\n    movq              [r2+r3],      xm4\r\n    lea               r2,           [r2 + r3 * 2]\r\n    movhps            [r2],         xm3\r\n.end\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------------------------------------------------------\r\n; void interp_4tap_horiz_ps_4xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt)\r\n;-----------------------------------------------------------------------------------------------------------------------------;\r\n%macro IPFILTER_CHROMA_PS_4xN_AVX2 2\r\nINIT_YMM avx2\r\ncglobal interp_4tap_horiz_ps_%1x%2, 4,7,5\r\n    mov             r4d, r4m\r\n    mov             r5d, r5m\r\n    add             r3d, r3d\r\n\r\n%ifdef PIC\r\n    lea               r6,           [tab_ChromaCoeff]\r\n    vpbroadcastd      m0,           [r6 + r4 * 4]\r\n%else\r\n    vpbroadcastd      m0,           [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    vbroadcasti128     m2,           [pw_1]\r\n    vbroadcasti128     m1,           [tab_Tm]\r\n\r\n    ; register map\r\n    ; m0 - interpolate coeff\r\n    ; m1 - shuffle order table\r\n    ; m2 - constant word 1\r\n\r\n    mov              r4,                %2\r\n    dec              r0\r\n    test             r5d,       r5d\r\n    je               .loop\r\n    sub              r0 ,               r1\r\n\r\n\r\n.loop\r\n    sub              r4d,           4\r\n    ; Row 0-1\r\n    movu              xm3,          [r0]\r\n    vinserti128       m3,           m3,      [r0 + r1],     1\r\n    pshufb            m3,           m1\r\n    pmaddubsw         m3,           m0\r\n    pmaddwd           m3,           m2\r\n\r\n    ; Row 2-3\r\n    lea               r0,           [r0 + r1 * 2]\r\n    movu              xm4,          [r0]\r\n    vinserti128       m4,           m4,      [r0 + r1],     1\r\n    pshufb            m4,           m1\r\n    pmaddubsw         m4,           m0\r\n    pmaddwd           m4,           m2\r\n\r\n    packssdw          m3,           m4\r\n    psubw             m3,           [pw_2000]\r\n    vextracti128      xm4,          m3,     1\r\n    movq              [r2],         xm3\r\n    movq              [r2+r3],      xm4\r\n    lea               r2,           [r2 + r3 * 2]\r\n    movhps            [r2],         xm3\r\n    movhps            [r2 + r3],    xm4\r\n\r\n    lea               r2,           [r2 + r3 * 2]\r\n    lea               r0,           [r0 + r1 * 2]\r\n\r\n    test              r4d,          r4d\r\n    jnz               .loop\r\n    test                r5d,        r5d\r\n    jz                .end\r\n\r\n    ;Row 5-6\r\n    movu              xm3,          [r0]\r\n    vinserti128       m3,           m3,      [r0 + r1],     1\r\n    pshufb            m3,           m1\r\n    pmaddubsw         m3,           m0\r\n    pmaddwd           m3,           m2\r\n\r\n    ; Row 7\r\n    lea               r0,           [r0 + r1 * 2]\r\n    vbroadcasti128    m4,           [r0]                      ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m4,           m1\r\n    pmaddubsw         m4,           m0\r\n    pmaddwd           m4,           m2\r\n\r\n    packssdw          m3,           m4\r\n    psubw             m3,           [pw_2000]\r\n\r\n    vextracti128      xm4,          m3,     1\r\n    movq              [r2],         xm3\r\n    movq              [r2+r3],      xm4\r\n    lea               r2,           [r2 + r3 * 2]\r\n    movhps            [r2],         xm3\r\n.end\r\n    RET\r\n%endmacro\r\n\r\n    IPFILTER_CHROMA_PS_4xN_AVX2  4 , 8\r\n    IPFILTER_CHROMA_PS_4xN_AVX2  4 , 16\r\n;-----------------------------------------------------------------------------------------------------------------------------\r\n; void interp_4tap_horiz_ps_8x8(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt)\r\n;-----------------------------------------------------------------------------------------------------------------------------;\r\nINIT_YMM avx2\r\ncglobal interp_4tap_horiz_ps_8x8, 4,7,6\r\n    mov             r4d, r4m\r\n    mov             r5d, r5m\r\n    add             r3d, r3d\r\n\r\n%ifdef PIC\r\n    lea               r6,           [tab_ChromaCoeff]\r\n    vpbroadcastd      m0,           [r6 + r4 * 4]\r\n%else\r\n    vpbroadcastd      m0,           [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    vbroadcasti128     m2,           [pw_1]\r\n    vbroadcasti128     m5,           [pw_2000]\r\n    mova               m1,           [tab_Tm]\r\n\r\n    ; register map\r\n    ; m0 - interpolate coeff\r\n    ; m1 - shuffle order table\r\n    ; m2 - constant word 1\r\n\r\n    mov                r6d,      4\r\n    dec                r0\r\n    test                r5d,     r5d\r\n    je                 .loop\r\n    sub                r0 ,      r1\r\n    add                r6d ,     1\r\n\r\n.loop\r\n     dec               r6d\r\n    ; Row 0\r\n    vbroadcasti128    m3,           [r0]\r\n    pshufb            m3,           m1\r\n    pmaddubsw         m3,           m0\r\n    pmaddwd           m3,           m2\r\n\r\n    ; Row 1\r\n    vbroadcasti128    m4,           [r0 + r1]\r\n    pshufb            m4,           m1\r\n    pmaddubsw         m4,           m0\r\n    pmaddwd           m4,           m2\r\n\r\n    packssdw          m3,           m4\r\n    psubw             m3,           m5\r\n\r\n    vpermq            m3,           m3,          11011000b\r\n    vextracti128      xm4,          m3,     1\r\n    movu             [r2],         xm3\r\n    movu             [r2 + r3],    xm4\r\n\r\n    lea               r2,           [r2 + r3 * 2]\r\n    lea               r0,           [r0 + r1 * 2]\r\n    test               r6d,          r6d\r\n    jnz               .loop\r\n    test              r5d,         r5d\r\n    je                .end\r\n\r\n    ;Row 11\r\n    vbroadcasti128    m3,           [r0]\r\n    pshufb            m3,           m1\r\n    pmaddubsw         m3,           m0\r\n    pmaddwd           m3,           m2\r\n\r\n    packssdw          m3,           m3\r\n    psubw             m3,           m5\r\n    vpermq            m3,           m3,          11011000b\r\n    movu             [r2],         xm3\r\n.end\r\n    RET\r\n\r\nINIT_YMM avx2\r\ncglobal interp_4tap_horiz_pp_4x2, 4,6,4\r\n    mov             r4d, r4m\r\n%ifdef PIC\r\n    lea               r5,           [tab_ChromaCoeff]\r\n    vpbroadcastd      m0,           [r5 + r4 * 4]\r\n%else\r\n    vpbroadcastd      m0,           [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    vbroadcasti128    m1,           [tab_Tm]\r\n\r\n    ; register map\r\n    ; m0 - interpolate coeff\r\n    ; m1 - shuffle order table\r\n\r\n    ; Row 0-1\r\n    movu              xm2,          [r0 - 1]\r\n    vinserti128       m2,           m2,      [r0 + r1 - 1],     1\r\n    pshufb            m2,           m1\r\n    pmaddubsw         m2,           m0\r\n    pmaddwd           m2,           [pw_1]\r\n\r\n    packssdw          m2,           m2\r\n    pmulhrsw          m2,           [pw_512]\r\n    vextracti128      xm3,          m2,     1\r\n    packuswb          xm2,          xm3\r\n\r\n    movd              [r2],         xm2\r\n    pextrd            [r2+r3],      xm2,     2\r\n    RET\r\n\r\n;-------------------------------------------------------------------------------------------------------------\r\n; void interp_4tap_horiz_pp_32xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx\r\n;-------------------------------------------------------------------------------------------------------------\r\n%macro IPFILTER_CHROMA_PP_32xN_AVX2 2\r\nINIT_YMM avx2\r\ncglobal interp_4tap_horiz_pp_%1x%2, 4,6,7\r\n    mov             r4d, r4m\r\n\r\n%ifdef PIC\r\n    lea               r5,           [tab_ChromaCoeff]\r\n    vpbroadcastd      m0,           [r5 + r4 * 4]\r\n%else\r\n    vpbroadcastd      m0,           [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    mova              m1,           [interp4_horiz_shuf1]\r\n    vpbroadcastd      m2,           [pw_1]\r\n    mova              m6,           [pw_512]\r\n    ; register map\r\n    ; m0 - interpolate coeff\r\n    ; m1 - shuffle order table\r\n    ; m2 - constant word 1\r\n\r\n    dec               r0\r\n    mov               r4d,          %2\r\n\r\n.loop:\r\n    ; Row 0\r\n    vbroadcasti128    m3,           [r0]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m3,           m1\r\n    pmaddubsw         m3,           m0\r\n    pmaddwd           m3,           m2\r\n    vbroadcasti128    m4,           [r0 + 4]\r\n    pshufb            m4,           m1\r\n    pmaddubsw         m4,           m0\r\n    pmaddwd           m4,           m2\r\n    packssdw          m3,           m4\r\n    pmulhrsw          m3,           m6\r\n\r\n    vbroadcasti128    m4,           [r0 + 16]\r\n    pshufb            m4,           m1\r\n    pmaddubsw         m4,           m0\r\n    pmaddwd           m4,           m2\r\n    vbroadcasti128    m5,           [r0 + 20]\r\n    pshufb            m5,           m1\r\n    pmaddubsw         m5,           m0\r\n    pmaddwd           m5,           m2\r\n    packssdw          m4,           m5\r\n    pmulhrsw          m4,           m6\r\n\r\n    packuswb          m3,           m4\r\n    vpermq            m3,           m3,      11011000b\r\n\r\n    movu              [r2],         m3\r\n    add               r2,           r3\r\n    add               r0,           r1\r\n    dec               r4d\r\n    jnz               .loop\r\n    RET\r\n%endmacro\r\n\r\n    IPFILTER_CHROMA_PP_32xN_AVX2 32, 16\r\n    IPFILTER_CHROMA_PP_32xN_AVX2 32, 24\r\n    IPFILTER_CHROMA_PP_32xN_AVX2 32, 8\r\n    IPFILTER_CHROMA_PP_32xN_AVX2 32, 64\r\n    IPFILTER_CHROMA_PP_32xN_AVX2 32, 48\r\n\r\n;-------------------------------------------------------------------------------------------------------------\r\n; void interp_4tap_horiz_pp_8xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx\r\n;-------------------------------------------------------------------------------------------------------------\r\n%macro IPFILTER_CHROMA_PP_8xN_AVX2 2\r\nINIT_YMM avx2\r\ncglobal interp_4tap_horiz_pp_%1x%2, 4,6,6\r\n    mov               r4d,    r4m\r\n\r\n%ifdef PIC\r\n    lea               r5,           [tab_ChromaCoeff]\r\n    vpbroadcastd      m0,           [r5 + r4 * 4]\r\n%else\r\n    vpbroadcastd      m0,           [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    movu              m1,           [tab_Tm]\r\n    vpbroadcastd      m2,           [pw_1]\r\n    ; register map\r\n    ; m0 - interpolate coeff\r\n    ; m1 - shuffle order table\r\n    ; m2 - constant word 1\r\n\r\n    sub               r0,           1\r\n    mov               r4d,          %2\r\n\r\n.loop:\r\n    sub               r4d,          4\r\n    ; Row 0\r\n    vbroadcasti128    m3,           [r0]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m3,           m1\r\n    pmaddubsw         m3,           m0\r\n    pmaddwd           m3,           m2\r\n\r\n    ; Row 1\r\n    vbroadcasti128    m4,           [r0 + r1]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m4,           m1\r\n    pmaddubsw         m4,           m0\r\n    pmaddwd           m4,           m2\r\n    packssdw          m3,           m4\r\n    pmulhrsw          m3,           [pw_512]\r\n    lea               r0,           [r0 + r1 * 2]\r\n\r\n    ; Row 2\r\n    vbroadcasti128    m4,           [r0 ]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m4,           m1\r\n    pmaddubsw         m4,           m0\r\n    pmaddwd           m4,           m2\r\n\r\n    ; Row 3\r\n    vbroadcasti128    m5,           [r0 + r1]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m5,           m1\r\n    pmaddubsw         m5,           m0\r\n    pmaddwd           m5,           m2\r\n    packssdw          m4,           m5\r\n    pmulhrsw          m4,           [pw_512]\r\n\r\n    packuswb          m3,           m4\r\n    mova              m5,           [interp_4tap_8x8_horiz_shuf]\r\n    vpermd            m3,           m5,     m3\r\n    vextracti128      xm4,          m3,     1\r\n    movq              [r2],         xm3\r\n    movhps            [r2 + r3],    xm3\r\n    lea               r2,           [r2 + r3 * 2]\r\n    movq              [r2],         xm4\r\n    movhps            [r2 + r3],    xm4\r\n    lea               r2,           [r2 + r3 * 2]\r\n    lea               r0,           [r0 + r1*2]\r\n    test              r4d,          r4d\r\n    jnz               .loop\r\n    RET\r\n%endmacro\r\n\r\n    IPFILTER_CHROMA_PP_8xN_AVX2   8 , 16\r\n    IPFILTER_CHROMA_PP_8xN_AVX2   8 , 32\r\n    IPFILTER_CHROMA_PP_8xN_AVX2   8 , 4\r\n    IPFILTER_CHROMA_PP_8xN_AVX2   8 , 64\r\n    IPFILTER_CHROMA_PP_8xN_AVX2   8 , 12\r\n\r\n;-------------------------------------------------------------------------------------------------------------\r\n; void interp_4tap_horiz_pp_4xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx\r\n;-------------------------------------------------------------------------------------------------------------\r\n%macro IPFILTER_CHROMA_PP_4xN_AVX2 2\r\nINIT_YMM avx2\r\ncglobal interp_4tap_horiz_pp_%1x%2, 4,6,6\r\n    mov             r4d, r4m\r\n\r\n%ifdef PIC\r\n    lea               r5,           [tab_ChromaCoeff]\r\n    vpbroadcastd      m0,           [r5 + r4 * 4]\r\n%else\r\n    vpbroadcastd      m0,           [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    vpbroadcastd      m2,           [pw_1]\r\n    vbroadcasti128    m1,           [tab_Tm]\r\n    mov               r4d,          %2\r\n    ; register map\r\n    ; m0 - interpolate coeff\r\n    ; m1 - shuffle order table\r\n    ; m2 - constant word 1\r\n\r\n    dec                r0\r\n\r\n.loop\r\n    sub               r4d,          4\r\n    ; Row 0-1\r\n    movu              xm3,          [r0]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    vinserti128       m3,           m3,      [r0 + r1],     1\r\n    pshufb            m3,           m1\r\n    pmaddubsw         m3,           m0\r\n    pmaddwd           m3,           m2\r\n\r\n    ; Row 2-3\r\n    lea               r0,           [r0 + r1 * 2]\r\n    movu              xm4,          [r0]                      ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    vinserti128       m4,           m4,      [r0 + r1],     1\r\n    pshufb            m4,           m1\r\n    pmaddubsw         m4,           m0\r\n    pmaddwd           m4,           m2\r\n\r\n    packssdw          m3,           m4\r\n    pmulhrsw          m3,           [pw_512]\r\n    vextracti128      xm4,          m3,                     1\r\n    packuswb          xm3,          xm4\r\n\r\n    movd              [r2],         xm3\r\n    pextrd            [r2+r3],      xm3,                    2\r\n    lea               r2,           [r2 + r3 * 2]\r\n    pextrd            [r2],         xm3,                    1\r\n    pextrd            [r2+r3],      xm3,                    3\r\n\r\n    lea               r0,           [r0 + r1 * 2]\r\n    lea               r2,           [r2 + r3 * 2]\r\n    test              r4d,          r4d\r\n    jnz               .loop\r\n    RET\r\n%endmacro\r\n\r\n    IPFILTER_CHROMA_PP_4xN_AVX2  4 , 8\r\n    IPFILTER_CHROMA_PP_4xN_AVX2  4 , 16\r\n\r\n%macro IPFILTER_LUMA_PS_32xN_AVX2 2\r\nINIT_YMM avx2\r\ncglobal interp_8tap_horiz_ps_%1x%2, 4, 7, 8\r\n    mov                         r5d,               r5m\r\n    mov                         r4d,               r4m\r\n%ifdef PIC\r\n    lea                         r6,                [tab_LumaCoeff]\r\n    vpbroadcastq                m0,                [r6 + r4 * 8]\r\n%else\r\n    vpbroadcastq                m0,                [tab_LumaCoeff + r4 * 8]\r\n%endif\r\n    mova                        m6,                [tab_Lm + 32]\r\n    mova                        m1,                [tab_Lm]\r\n    mov                         r4d,                %2                           ;height\r\n    add                         r3d,               r3d\r\n    vbroadcasti128              m2,                [pw_1]\r\n    mova                        m7,                [interp8_hps_shuf]\r\n\r\n    ; register map\r\n    ; m0      - interpolate coeff\r\n    ; m1 , m6 - shuffle order table\r\n    ; m2      - pw_1\r\n\r\n\r\n    sub                         r0,                3\r\n    test                        r5d,               r5d\r\n    jz                          .label\r\n    lea                         r6,                [r1 * 3]                     ; r8 = (N / 2 - 1) * srcStride\r\n    sub                         r0,                r6\r\n    add                         r4d,                7\r\n\r\n.label\r\n    lea                         r6,                 [pw_2000]\r\n.loop\r\n    ; Row 0\r\n    vbroadcasti128              m3,                [r0]                         ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb                      m4,                m3,             m6           ; row 0 (col 4 to 7)\r\n    pshufb                      m3,                m1                           ; shuffled based on the col order tab_Lm row 0 (col 0 to 3)\r\n    pmaddubsw                   m3,                m0\r\n    pmaddubsw                   m4,                m0\r\n    pmaddwd                     m3,                m2\r\n    pmaddwd                     m4,                m2\r\n    packssdw                    m3,                m4\r\n\r\n\r\n    vbroadcasti128              m4,                [r0 + 8]\r\n    pshufb                      m5,                m4,            m6            ;row 0 (col 12 to 15)\r\n    pshufb                      m4,                m1                           ;row 0 (col 8 to 11)\r\n    pmaddubsw                   m4,                m0\r\n    pmaddubsw                   m5,                m0\r\n    pmaddwd                     m4,                m2\r\n    pmaddwd                     m5,                m2\r\n    packssdw                    m4,                m5\r\n\r\n    pmaddwd                     m3,                m2\r\n    pmaddwd                     m4,                m2\r\n    packssdw                    m3,                m4\r\n    vpermd                      m3,                m7,               m3\r\n    psubw                       m3,                [r6]\r\n\r\n    movu                        [r2],              m3                          ;row 0\r\n\r\n    vbroadcasti128              m3,                [r0 + 16]\r\n    pshufb                      m4,                m3,             m6           ; row 0 (col 20 to 23)\r\n    pshufb                      m3,                m1                           ; row 0 (col 16 to 19)\r\n    pmaddubsw                   m3,                m0\r\n    pmaddubsw                   m4,                m0\r\n    pmaddwd                     m3,                m2\r\n    pmaddwd                     m4,                m2\r\n    packssdw                    m3,                m4\r\n\r\n    vbroadcasti128              m4,                [r0 + 24]\r\n    pshufb                      m5,                m4,            m6            ;row 0 (col 28 to 31)\r\n    pshufb                      m4,                m1                           ;row 0 (col 24 to 27)\r\n    pmaddubsw                   m4,                m0\r\n    pmaddubsw                   m5,                m0\r\n    pmaddwd                     m4,                m2\r\n    pmaddwd                     m5,                m2\r\n    packssdw                    m4,                m5\r\n\r\n    pmaddwd                     m3,                m2\r\n    pmaddwd                     m4,                m2\r\n    packssdw                    m3,                m4\r\n    vpermd                      m3,                m7,               m3\r\n    psubw                       m3,                [r6]\r\n\r\n    movu                        [r2 + 32],         m3                          ;row 0\r\n\r\n    add                         r0,                r1\r\n    add                         r2,                r3\r\n    dec                         r4d\r\n    jnz                         .loop\r\n    RET\r\n%endmacro\r\n\r\n    IPFILTER_LUMA_PS_32xN_AVX2 32 , 32\r\n    IPFILTER_LUMA_PS_32xN_AVX2 32 , 16\r\n    IPFILTER_LUMA_PS_32xN_AVX2 32 , 24\r\n    IPFILTER_LUMA_PS_32xN_AVX2 32 , 8\r\n    IPFILTER_LUMA_PS_32xN_AVX2 32 , 64\r\n\r\nINIT_YMM avx2\r\ncglobal interp_8tap_horiz_ps_48x64, 4, 7, 8\r\n    mov                         r5d,               r5m\r\n    mov                         r4d,               r4m\r\n%ifdef PIC\r\n    lea                         r6,                [tab_LumaCoeff]\r\n    vpbroadcastq                m0,                [r6 + r4 * 8]\r\n%else\r\n    vpbroadcastq                m0,                [tab_LumaCoeff + r4 * 8]\r\n%endif\r\n    mova                        m6,                [tab_Lm + 32]\r\n    mova                        m1,                [tab_Lm]\r\n    mov                         r4d,               64                           ;height\r\n    add                         r3d,               r3d\r\n    vbroadcasti128              m2,                [pw_2000]\r\n    mova                        m7,                [pw_1]\r\n\r\n    ; register map\r\n    ; m0      - interpolate coeff\r\n    ; m1 , m6 - shuffle order table\r\n    ; m2      - pw_2000\r\n\r\n    sub                         r0,                3\r\n    test                        r5d,               r5d\r\n    jz                          .label\r\n    lea                         r6,                [r1 * 3]                     ; r6 = (N / 2 - 1) * srcStride\r\n    sub                         r0,                r6                           ; r0(src)-r6\r\n    add                         r4d,                7                            ; blkheight += N - 1  (7 - 1 = 6 ; since the last one row not in loop)\r\n\r\n.label\r\n    lea                         r6,                [interp8_hps_shuf]\r\n.loop\r\n    ; Row 0\r\n    vbroadcasti128              m3,                [r0]                         ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb                      m4,                m3,             m6           ; row 0 (col 4 to 7)\r\n    pshufb                      m3,                m1                           ; shuffled based on the col order tab_Lm row 0 (col 0 to 3)\r\n    pmaddubsw                   m3,                m0\r\n    pmaddubsw                   m4,                m0\r\n    pmaddwd                     m3,                m7\r\n    pmaddwd                     m4,                m7\r\n    packssdw                    m3,                m4\r\n\r\n    vbroadcasti128              m4,                [r0 + 8]\r\n    pshufb                      m5,                m4,             m6            ;row 0 (col 12 to 15)\r\n    pshufb                      m4,                m1                           ;row 0 (col 8 to 11)\r\n    pmaddubsw                   m4,                m0\r\n    pmaddubsw                   m5,                m0\r\n    pmaddwd                     m4,                m7\r\n    pmaddwd                     m5,                m7\r\n    packssdw                    m4,                m5\r\n    pmaddwd                     m3,                m7\r\n    pmaddwd                     m4,                m7\r\n    packssdw                    m3,                m4\r\n    mova                        m5,                [r6]\r\n    vpermd                      m3,                m5,             m3\r\n    psubw                       m3,                m2\r\n    movu                        [r2],              m3                          ;row 0\r\n\r\n    vbroadcasti128              m3,                [r0 + 16]\r\n    pshufb                      m4,                m3,             m6           ; row 0 (col 20 to 23)\r\n    pshufb                      m3,                m1                           ; row 0 (col 16 to 19)\r\n    pmaddubsw                   m3,                m0\r\n    pmaddubsw                   m4,                m0\r\n    pmaddwd                     m3,                m7\r\n    pmaddwd                     m4,                m7\r\n    packssdw                    m3,                m4\r\n\r\n    vbroadcasti128              m4,                [r0 + 24]\r\n    pshufb                      m5,                m4,             m6            ;row 0 (col 28 to 31)\r\n    pshufb                      m4,                m1                           ;row 0 (col 24 to 27)\r\n    pmaddubsw                   m4,                m0\r\n    pmaddubsw                   m5,                m0\r\n    pmaddwd                     m4,                m7\r\n    pmaddwd                     m5,                m7\r\n    packssdw                    m4,                m5\r\n    pmaddwd                     m3,                m7\r\n    pmaddwd                     m4,                m7\r\n    packssdw                    m3,                m4\r\n    mova                        m5,                [r6]\r\n    vpermd                      m3,                m5,               m3\r\n    psubw                       m3,                m2\r\n    movu                        [r2 + 32],         m3                          ;row 0\r\n\r\n    vbroadcasti128              m3,                [r0 + 32]\r\n    pshufb                      m4,                m3,             m6           ; row 0 (col 36 to 39)\r\n    pshufb                      m3,                m1                           ; row 0 (col 32 to 35)\r\n    pmaddubsw                   m3,                m0\r\n    pmaddubsw                   m4,                m0\r\n    pmaddwd                     m3,                m7\r\n    pmaddwd                     m4,                m7\r\n    packssdw                    m3,                m4\r\n\r\n    vbroadcasti128              m4,                [r0 + 40]\r\n    pshufb                      m5,                m4,            m6            ;row 0 (col 44 to 47)\r\n    pshufb                      m4,                m1                           ;row 0 (col 40 to 43)\r\n    pmaddubsw                   m4,                m0\r\n    pmaddubsw                   m5,                m0\r\n    pmaddwd                     m4,                m7\r\n    pmaddwd                     m5,                m7\r\n    packssdw                    m4,                m5\r\n    pmaddwd                     m3,                m7\r\n    pmaddwd                     m4,                m7\r\n    packssdw                    m3,                m4\r\n    mova                        m5,                [r6]\r\n    vpermd                      m3,                m5,               m3\r\n    psubw                       m3,                m2\r\n    movu                        [r2 + 64],         m3                          ;row 0\r\n\r\n    add                         r0,                r1\r\n    add                         r2,                r3\r\n    dec                         r4d\r\n    jnz                         .loop\r\n    RET\r\n\r\nINIT_YMM avx2\r\ncglobal interp_8tap_horiz_pp_24x32, 4,6,8\r\n    sub               r0,         3\r\n    mov               r4d,        r4m\r\n%ifdef PIC\r\n    lea               r5,         [tab_LumaCoeff]\r\n    vpbroadcastd      m0,         [r5 + r4 * 8]\r\n    vpbroadcastd      m1,         [r5 + r4 * 8 + 4]\r\n%else\r\n    vpbroadcastd      m0,         [tab_LumaCoeff + r4 * 8]\r\n    vpbroadcastd      m1,         [tab_LumaCoeff + r4 * 8 + 4]\r\n%endif\r\n    movu              m3,         [tab_Tm + 16]\r\n    vpbroadcastd      m7,         [pw_1]\r\n    lea               r5,         [tab_Tm]\r\n\r\n    ; register map\r\n    ; m0 , m1 interpolate coeff\r\n    ; m2 , m2  shuffle order table\r\n    ; m7 - pw_1\r\n\r\n    mov               r4d,        32\r\n.loop:\r\n    ; Row 0\r\n    vbroadcasti128    m4,         [r0]                        ; [x E D C B A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m5,         m4,     m3\r\n    pshufb            m4,         [r5]\r\n    pmaddubsw         m4,         m0\r\n    pmaddubsw         m5,         m1\r\n    paddw             m4,         m5\r\n    pmaddwd           m4,         m7\r\n\r\n    vbroadcasti128    m5,         [r0 + 8]\r\n    pshufb            m6,         m5,     m3\r\n    pshufb            m5,         [r5]\r\n    pmaddubsw         m5,         m0\r\n    pmaddubsw         m6,         m1\r\n    paddw             m5,         m6\r\n    pmaddwd           m5,         m7\r\n    packssdw          m4,         m5                          ; [17 16 15 14 07 06 05 04 13 12 11 10 03 02 01 00]\r\n    pmulhrsw          m4,         [pw_512]\r\n\r\n    vbroadcasti128    m2,         [r0 + 16]\r\n    pshufb            m5,         m2,     m3\r\n    pshufb            m2,         [r5]\r\n    pmaddubsw         m2,         m0\r\n    pmaddubsw         m5,         m1\r\n    paddw             m2,         m5\r\n    pmaddwd           m2,         m7\r\n\r\n    packssdw          m2,         m2\r\n    pmulhrsw          m2,         [pw_512]\r\n    packuswb          m4,         m2\r\n    vpermq            m4,         m4,     11011000b\r\n    vextracti128      xm5,        m4,     1\r\n    pshufd            xm4,        xm4,    11011000b\r\n    pshufd            xm5,        xm5,    11011000b\r\n\r\n    movu              [r2],       xm4\r\n    movq              [r2 + 16],  xm5\r\n    add               r0,         r1\r\n    add               r2,         r3\r\n    dec               r4d\r\n    jnz               .loop\r\n    RET\r\n\r\nINIT_YMM avx2\r\ncglobal interp_8tap_horiz_pp_12x16, 4,6,8\r\n    sub               r0,        3\r\n    mov               r4d,       r4m\r\n%ifdef PIC\r\n    lea               r5,        [tab_LumaCoeff]\r\n    vpbroadcastd      m0,        [r5 + r4 * 8]\r\n    vpbroadcastd      m1,        [r5 + r4 * 8 + 4]\r\n%else\r\n    vpbroadcastd      m0,         [tab_LumaCoeff + r4 * 8]\r\n    vpbroadcastd      m1,         [tab_LumaCoeff + r4 * 8 + 4]\r\n%endif\r\n    movu              m3,         [tab_Tm + 16]\r\n    vpbroadcastd      m7,         [pw_1]\r\n    lea               r5,         [tab_Tm]\r\n\r\n    ; register map\r\n    ; m0 , m1 interpolate coeff\r\n    ; m2 , m2  shuffle order table\r\n    ; m7 - pw_1\r\n\r\n    mov               r4d,        8\r\n.loop:\r\n    ; Row 0\r\n    vbroadcasti128    m4,         [r0]                        ;first 8 element\r\n    pshufb            m5,         m4,     m3\r\n    pshufb            m4,         [r5]\r\n    pmaddubsw         m4,         m0\r\n    pmaddubsw         m5,         m1\r\n    paddw             m4,         m5\r\n    pmaddwd           m4,         m7\r\n\r\n    vbroadcasti128    m5,         [r0 + 8]                    ; element 8 to 11\r\n    pshufb            m6,         m5,     m3\r\n    pshufb            m5,         [r5]\r\n    pmaddubsw         m5,         m0\r\n    pmaddubsw         m6,         m1\r\n    paddw             m5,         m6\r\n    pmaddwd           m5,         m7\r\n\r\n    packssdw          m4,         m5                          ; [17 16 15 14 07 06 05 04 13 12 11 10 03 02 01 00]\r\n    pmulhrsw          m4,         [pw_512]\r\n\r\n    ;Row 1\r\n    vbroadcasti128    m2,         [r0 + r1]\r\n    pshufb            m5,         m2,     m3\r\n    pshufb            m2,         [r5]\r\n    pmaddubsw         m2,         m0\r\n    pmaddubsw         m5,         m1\r\n    paddw             m2,         m5\r\n    pmaddwd           m2,         m7\r\n\r\n    vbroadcasti128    m5,         [r0 + r1 + 8]\r\n    pshufb            m6,         m5,     m3\r\n    pshufb            m5,         [r5]\r\n    pmaddubsw         m5,         m0\r\n    pmaddubsw         m6,         m1\r\n    paddw             m5,         m6\r\n    pmaddwd           m5,         m7\r\n\r\n    packssdw          m2,         m5\r\n    pmulhrsw          m2,         [pw_512]\r\n    packuswb          m4,         m2\r\n    vpermq            m4,         m4,     11011000b\r\n    vextracti128      xm5,        m4,     1\r\n    pshufd            xm4,        xm4,    11011000b\r\n    pshufd            xm5,        xm5,    11011000b\r\n\r\n    movq              [r2],       xm4\r\n    pextrd            [r2+8],     xm4,    2\r\n    movq              [r2 + r3],  xm5\r\n    pextrd            [r2+r3+8],  xm5,    2\r\n    lea               r0,         [r0 + r1 * 2]\r\n    lea               r2,         [r2 + r3 * 2]\r\n    dec               r4d\r\n    jnz              .loop\r\n    RET\r\n\r\n;-------------------------------------------------------------------------------------------------------------\r\n; void interp_4tap_horiz_pp_16xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx\r\n;-------------------------------------------------------------------------------------------------------------\r\n%macro IPFILTER_CHROMA_PP_16xN_AVX2 2\r\nINIT_YMM avx2\r\ncglobal interp_4tap_horiz_pp_%1x%2, 4, 6, 7\r\n    mov               r4d,          r4m\r\n\r\n%ifdef PIC\r\n    lea               r5,           [tab_ChromaCoeff]\r\n    vpbroadcastd      m0,           [r5 + r4 * 4]\r\n%else\r\n    vpbroadcastd      m0,           [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    mova              m6,           [pw_512]\r\n    mova              m1,           [interp4_horiz_shuf1]\r\n    vpbroadcastd      m2,           [pw_1]\r\n\r\n    ; register map\r\n    ; m0 - interpolate coeff\r\n    ; m1 - shuffle order table\r\n    ; m2 - constant word 1\r\n\r\n    dec               r0\r\n    mov               r4d,          %2/2\r\n\r\n.loop:\r\n    ; Row 0\r\n    vbroadcasti128    m3,           [r0]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m3,           m1\r\n    pmaddubsw         m3,           m0\r\n    pmaddwd           m3,           m2\r\n    vbroadcasti128    m4,           [r0 + 4]                    ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m4,           m1\r\n    pmaddubsw         m4,           m0\r\n    pmaddwd           m4,           m2\r\n    packssdw          m3,           m4\r\n    pmulhrsw          m3,           m6\r\n\r\n    ; Row 1\r\n    vbroadcasti128    m4,           [r0 + r1]                   ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m4,           m1\r\n    pmaddubsw         m4,           m0\r\n    pmaddwd           m4,           m2\r\n    vbroadcasti128    m5,           [r0 + r1 + 4]               ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m5,           m1\r\n    pmaddubsw         m5,           m0\r\n    pmaddwd           m5,           m2\r\n    packssdw          m4,           m5\r\n    pmulhrsw          m4,           m6\r\n\r\n    packuswb          m3,           m4\r\n    vpermq            m3,           m3,      11011000b\r\n\r\n    vextracti128      xm4,          m3,       1\r\n    movu              [r2],         xm3\r\n    movu              [r2 + r3],    xm4\r\n    lea               r2,           [r2 + r3 * 2]\r\n    lea               r0,           [r0 + r1 * 2]\r\n    dec               r4d\r\n    jnz               .loop\r\n    RET\r\n%endmacro\r\n\r\n    IPFILTER_CHROMA_PP_16xN_AVX2 16 , 8\r\n    IPFILTER_CHROMA_PP_16xN_AVX2 16 , 32\r\n    IPFILTER_CHROMA_PP_16xN_AVX2 16 , 12\r\n    IPFILTER_CHROMA_PP_16xN_AVX2 16 , 4\r\n    IPFILTER_CHROMA_PP_16xN_AVX2 16 , 64\r\n    IPFILTER_CHROMA_PP_16xN_AVX2 16 , 24\r\n\r\n%macro IPFILTER_LUMA_PS_64xN_AVX2 1\r\nINIT_YMM avx2\r\ncglobal interp_8tap_horiz_ps_64x%1, 4, 7, 8\r\n    mov                         r5d,               r5m\r\n    mov                         r4d,               r4m\r\n%ifdef PIC\r\n    lea                         r6,                [tab_LumaCoeff]\r\n    vpbroadcastq                m0,                [r6 + r4 * 8]\r\n%else\r\n    vpbroadcastq                m0,                [tab_LumaCoeff + r4 * 8]\r\n%endif\r\n    mova                        m6,                [tab_Lm + 32]\r\n    mova                        m1,                [tab_Lm]\r\n    mov                         r4d,               %1                           ;height\r\n    add                         r3d,               r3d\r\n    vbroadcasti128              m2,                [pw_1]\r\n    mova                        m7,                [interp8_hps_shuf]\r\n\r\n    ; register map\r\n    ; m0      - interpolate coeff\r\n    ; m1 , m6 - shuffle order table\r\n    ; m2      - pw_2000\r\n\r\n    sub                         r0,                3\r\n    test                        r5d,               r5d\r\n    jz                          .label\r\n    lea                         r6,                [r1 * 3]\r\n    sub                         r0,                r6                           ; r0(src)-r6\r\n    add                         r4d,               7                            ; blkheight += N - 1\r\n\r\n.label\r\n    lea                         r6,                [pw_2000]\r\n.loop\r\n    ; Row 0\r\n    vbroadcasti128              m3,                [r0]                         ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb                      m4,                m3,             m6           ; row 0 (col 4 to 7)\r\n    pshufb                      m3,                m1                           ; shuffled based on the col order tab_Lm row 0 (col 0 to 3)\r\n    pmaddubsw                   m3,                m0\r\n    pmaddubsw                   m4,                m0\r\n    pmaddwd                     m3,                m2\r\n    pmaddwd                     m4,                m2\r\n    packssdw                    m3,                m4\r\n\r\n    vbroadcasti128              m4,                [r0 + 8]\r\n    pshufb                      m5,                m4,            m6            ;row 0 (col 12 to 15)\r\n    pshufb                      m4,                m1                           ;row 0 (col 8 to 11)\r\n    pmaddubsw                   m4,                m0\r\n    pmaddubsw                   m5,                m0\r\n    pmaddwd                     m4,                m2\r\n    pmaddwd                     m5,                m2\r\n    packssdw                    m4,                m5\r\n    pmaddwd                     m3,                m2\r\n    pmaddwd                     m4,                m2\r\n    packssdw                    m3,                m4\r\n    vpermd                      m3,                m7,               m3\r\n    psubw                       m3,                [r6]\r\n    movu                        [r2],              m3                          ;row 0\r\n\r\n    vbroadcasti128              m3,                [r0 + 16]\r\n    pshufb                      m4,                m3,             m6           ; row 0 (col 20 to 23)\r\n    pshufb                      m3,                m1                           ; row 0 (col 16 to 19)\r\n    pmaddubsw                   m3,                m0\r\n    pmaddubsw                   m4,                m0\r\n    pmaddwd                     m3,                m2\r\n    pmaddwd                     m4,                m2\r\n    packssdw                    m3,                m4\r\n\r\n    vbroadcasti128              m4,                [r0 + 24]\r\n    pshufb                      m5,                m4,            m6            ;row 0 (col 28 to 31)\r\n    pshufb                      m4,                m1                           ;row 0 (col 24 to 27)\r\n    pmaddubsw                   m4,                m0\r\n    pmaddubsw                   m5,                m0\r\n    pmaddwd                     m4,                m2\r\n    pmaddwd                     m5,                m2\r\n    packssdw                    m4,                m5\r\n    pmaddwd                     m3,                m2\r\n    pmaddwd                     m4,                m2\r\n    packssdw                    m3,                m4\r\n    vpermd                      m3,                m7,               m3\r\n    psubw                       m3,                [r6]\r\n    movu                        [r2 + 32],         m3                          ;row 0\r\n\r\n    vbroadcasti128              m3,                [r0 + 32]\r\n    pshufb                      m4,                m3,             m6           ; row 0 (col 36 to 39)\r\n    pshufb                      m3,                m1                           ; row 0 (col 32 to 35)\r\n    pmaddubsw                   m3,                m0\r\n    pmaddubsw                   m4,                m0\r\n    pmaddwd                     m3,                m2\r\n    pmaddwd                     m4,                m2\r\n    packssdw                    m3,                m4\r\n\r\n    vbroadcasti128              m4,                [r0 + 40]\r\n    pshufb                      m5,                m4,            m6            ;row 0 (col 44 to 47)\r\n    pshufb                      m4,                m1                           ;row 0 (col 40 to 43)\r\n    pmaddubsw                   m4,                m0\r\n    pmaddubsw                   m5,                m0\r\n    pmaddwd                     m4,                m2\r\n    pmaddwd                     m5,                m2\r\n    packssdw                    m4,                m5\r\n    pmaddwd                     m3,                m2\r\n    pmaddwd                     m4,                m2\r\n    packssdw                    m3,                m4\r\n    vpermd                      m3,                m7,               m3\r\n    psubw                       m3,                [r6]\r\n    movu                        [r2 + 64],         m3                          ;row 0\r\n    vbroadcasti128              m3,                [r0 + 48]\r\n    pshufb                      m4,                m3,             m6           ; row 0 (col 52 to 55)\r\n    pshufb                      m3,                m1                           ; row 0 (col 48 to 51)\r\n    pmaddubsw                   m3,                m0\r\n    pmaddubsw                   m4,                m0\r\n    pmaddwd                     m3,                m2\r\n    pmaddwd                     m4,                m2\r\n    packssdw                    m3,                m4\r\n\r\n    vbroadcasti128              m4,                [r0 + 56]\r\n    pshufb                      m5,                m4,            m6            ;row 0 (col 60 to 63)\r\n    pshufb                      m4,                m1                           ;row 0 (col 56 to 59)\r\n    pmaddubsw                   m4,                m0\r\n    pmaddubsw                   m5,                m0\r\n    pmaddwd                     m4,                m2\r\n    pmaddwd                     m5,                m2\r\n    packssdw                    m4,                m5\r\n    pmaddwd                     m3,                m2\r\n    pmaddwd                     m4,                m2\r\n    packssdw                    m3,                m4\r\n    vpermd                      m3,                m7,               m3\r\n    psubw                       m3,                [r6]\r\n    movu                        [r2 + 96],         m3                          ;row 0\r\n\r\n    add                          r0,                r1\r\n    add                          r2,                r3\r\n    dec                          r4d\r\n    jnz                         .loop\r\n    RET\r\n%endmacro\r\n\r\n    IPFILTER_LUMA_PS_64xN_AVX2 64\r\n    IPFILTER_LUMA_PS_64xN_AVX2 48\r\n    IPFILTER_LUMA_PS_64xN_AVX2 32\r\n    IPFILTER_LUMA_PS_64xN_AVX2 16\r\n\r\n;-----------------------------------------------------------------------------------------------------------------------------\r\n; void interp_4tap_horiz_ps_8xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt)\r\n;-----------------------------------------------------------------------------------------------------------------------------\r\n%macro IPFILTER_CHROMA_PS_8xN_AVX2 1\r\nINIT_YMM avx2\r\ncglobal interp_4tap_horiz_ps_8x%1, 4,7,6\r\n    mov                r4d,             r4m\r\n    mov                r5d,             r5m\r\n    add                r3d,             r3d\r\n\r\n%ifdef PIC\r\n    lea                r6,              [tab_ChromaCoeff]\r\n    vpbroadcastd       m0,              [r6 + r4 * 4]\r\n%else\r\n    vpbroadcastd       m0,              [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    vbroadcasti128     m2,              [pw_1]\r\n    vbroadcasti128     m5,              [pw_2000]\r\n    mova               m1,              [tab_Tm]\r\n\r\n    ; register map\r\n    ; m0 - interpolate coeff\r\n    ; m1 - shuffle order table\r\n    ; m2 - constant word 1\r\n\r\n    mov                r6d,             %1/2\r\n    dec                r0\r\n    test               r5d,             r5d\r\n    jz                 .loop\r\n    sub                r0 ,             r1\r\n    inc                r6d\r\n\r\n.loop\r\n    ; Row 0\r\n    vbroadcasti128     m3,              [r0]\r\n    pshufb             m3,              m1\r\n    pmaddubsw          m3,              m0\r\n    pmaddwd            m3,              m2\r\n\r\n    ; Row 1\r\n    vbroadcasti128     m4,              [r0 + r1]\r\n    pshufb             m4,              m1\r\n    pmaddubsw          m4,              m0\r\n    pmaddwd            m4,              m2\r\n    packssdw           m3,              m4\r\n    psubw              m3,              m5\r\n    vpermq             m3,              m3,          11011000b\r\n    vextracti128       xm4,             m3,          1\r\n    movu               [r2],            xm3\r\n    movu               [r2 + r3],       xm4\r\n\r\n    lea                r2,              [r2 + r3 * 2]\r\n    lea                r0,              [r0 + r1 * 2]\r\n    dec                r6d\r\n    jnz                .loop\r\n    test               r5d,             r5d\r\n    jz                 .end\r\n\r\n    ;Row 11\r\n    vbroadcasti128     m3,              [r0]\r\n    pshufb             m3,              m1\r\n    pmaddubsw          m3,              m0\r\n    pmaddwd            m3,              m2\r\n    packssdw           m3,              m3\r\n    psubw              m3,              m5\r\n    vpermq             m3,              m3,          11011000b\r\n    movu               [r2],            xm3\r\n.end\r\n    RET\r\n%endmacro\r\n\r\n    IPFILTER_CHROMA_PS_8xN_AVX2  2\r\n    IPFILTER_CHROMA_PS_8xN_AVX2  32\r\n    IPFILTER_CHROMA_PS_8xN_AVX2  16\r\n    IPFILTER_CHROMA_PS_8xN_AVX2  6\r\n    IPFILTER_CHROMA_PS_8xN_AVX2  4\r\n    IPFILTER_CHROMA_PS_8xN_AVX2  12\r\n    IPFILTER_CHROMA_PS_8xN_AVX2  64\r\n\r\nINIT_YMM avx2\r\ncglobal interp_4tap_horiz_ps_2x4, 4, 7, 3\r\n    mov                r4d,            r4m\r\n    mov                r5d,            r5m\r\n    add                r3d,            r3d\r\n%ifdef PIC\r\n    lea                r6,             [tab_ChromaCoeff]\r\n    vpbroadcastd       m0,             [r6 + r4 * 4]\r\n%else\r\n    vpbroadcastd       m0,             [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    mova               xm3,            [pw_2000]\r\n    dec                r0\r\n    test               r5d,            r5d\r\n    jz                 .label\r\n    sub                r0,             r1\r\n\r\n.label\r\n    lea                r6,             [r1 * 3]\r\n    movq               xm1,            [r0]\r\n    movhps             xm1,            [r0 + r1]\r\n    movq               xm2,            [r0 + r1 * 2]\r\n    movhps             xm2,            [r0 + r6]\r\n\r\n    vinserti128        m1,             m1,          xm2,          1\r\n    pshufb             m1,             [interp4_hpp_shuf]\r\n    pmaddubsw          m1,             m0\r\n    pmaddwd            m1,             [pw_1]\r\n    vextracti128       xm2,            m1,          1\r\n    packssdw           xm1,            xm2\r\n    psubw              xm1,            xm3\r\n\r\n    lea                r4,             [r3 * 3]\r\n    movd               [r2],           xm1\r\n    pextrd             [r2 + r3],      xm1,         1\r\n    pextrd             [r2 + r3 * 2],  xm1,         2\r\n    pextrd             [r2 + r4],      xm1,         3\r\n\r\n    test               r5d,            r5d\r\n    jz                .end\r\n    lea                r2,             [r2 + r3 * 4]\r\n    lea                r0,             [r0 + r1 * 4]\r\n\r\n    movq               xm1,            [r0]\r\n    movhps             xm1,            [r0 + r1]\r\n    movq               xm2,            [r0 + r1 * 2]\r\n    vinserti128        m1,             m1,          xm2,          1\r\n    pshufb             m1,             [interp4_hpp_shuf]\r\n    pmaddubsw          m1,             m0\r\n    pmaddwd            m1,             [pw_1]\r\n    vextracti128       xm2,            m1,          1\r\n    packssdw           xm1,            xm2\r\n    psubw              xm1,            xm3\r\n\r\n    movd               [r2],           xm1\r\n    pextrd             [r2 + r3],      xm1,         1\r\n    pextrd             [r2 + r3 * 2],  xm1,         2\r\n.end\r\n    RET\r\n\r\nINIT_YMM avx2\r\ncglobal interp_4tap_horiz_ps_2x8, 4, 7, 7\r\n    mov               r4d,           r4m\r\n    mov               r5d,           r5m\r\n    add               r3d,           r3d\r\n\r\n%ifdef PIC\r\n    lea               r6,            [tab_ChromaCoeff]\r\n    vpbroadcastd      m0,            [r6 + r4 * 4]\r\n%else\r\n    vpbroadcastd      m0,            [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n    vbroadcasti128    m6,            [pw_2000]\r\n    test              r5d,            r5d\r\n    jz                .label\r\n    sub               r0,             r1\r\n\r\n.label\r\n    mova              m4,            [interp4_hpp_shuf]\r\n    mova              m5,            [pw_1]\r\n    dec               r0\r\n    lea               r4,            [r1 * 3]\r\n    movq              xm1,           [r0]                                   ;row 0\r\n    movhps            xm1,           [r0 + r1]\r\n    movq              xm2,           [r0 + r1 * 2]\r\n    movhps            xm2,           [r0 + r4]\r\n    vinserti128       m1,            m1,          xm2,          1\r\n    lea               r0,            [r0 + r1 * 4]\r\n    movq              xm3,           [r0]\r\n    movhps            xm3,           [r0 + r1]\r\n    movq              xm2,           [r0 + r1 * 2]\r\n    movhps            xm2,           [r0 + r4]\r\n    vinserti128       m3,            m3,          xm2,          1\r\n\r\n    pshufb            m1,            m4\r\n    pshufb            m3,            m4\r\n    pmaddubsw         m1,            m0\r\n    pmaddubsw         m3,            m0\r\n    pmaddwd           m1,            m5\r\n    pmaddwd           m3,            m5\r\n    packssdw          m1,            m3\r\n    psubw             m1,            m6\r\n\r\n    lea               r4,            [r3 * 3]\r\n    vextracti128      xm2,           m1,          1\r\n\r\n    movd              [r2],          xm1\r\n    pextrd            [r2 + r3],     xm1,         1\r\n    movd              [r2 + r3 * 2], xm2\r\n    pextrd            [r2 + r4],     xm2,         1\r\n    lea               r2,            [r2 + r3 * 4]\r\n    pextrd            [r2],          xm1,         2\r\n    pextrd            [r2 + r3],     xm1,         3\r\n    pextrd            [r2 + r3 * 2], xm2,         2\r\n    pextrd            [r2 + r4],     xm2,         3\r\n    test              r5d,            r5d\r\n    jz                .end\r\n\r\n    lea               r0,            [r0 + r1 * 4]\r\n    lea               r2,            [r2 + r3 * 4]\r\n    movq              xm1,           [r0]                                   ;row 0\r\n    movhps            xm1,           [r0 + r1]\r\n    movq              xm2,           [r0 + r1 * 2]\r\n    vinserti128       m1,            m1,          xm2,          1\r\n    pshufb            m1,            m4\r\n    pmaddubsw         m1,            m0\r\n    pmaddwd           m1,            m5\r\n    packssdw          m1,            m1\r\n    psubw             m1,            m6\r\n    vextracti128      xm2,           m1,          1\r\n\r\n    movd              [r2],          xm1\r\n    pextrd            [r2 + r3],     xm1,         1\r\n    movd              [r2 + r3 * 2], xm2\r\n.end\r\n    RET\r\n\r\nINIT_YMM avx2\r\ncglobal interp_4tap_horiz_pp_12x16, 4, 6, 7\r\n    mov               r4d,          r4m\r\n\r\n%ifdef PIC\r\n    lea               r5,           [tab_ChromaCoeff]\r\n    vpbroadcastd      m0,           [r5 + r4 * 4]\r\n%else\r\n    vpbroadcastd      m0,           [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    mova              m6,           [pw_512]\r\n    mova              m1,           [interp4_horiz_shuf1]\r\n    vpbroadcastd      m2,           [pw_1]\r\n\r\n    ; register map\r\n    ; m0 - interpolate coeff\r\n    ; m1 - shuffle order table\r\n    ; m2 - constant word 1\r\n\r\n    dec               r0\r\n    mov               r4d,          8\r\n\r\n.loop:\r\n    ; Row 0\r\n    vbroadcasti128    m3,           [r0]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m3,           m1\r\n    pmaddubsw         m3,           m0\r\n    pmaddwd           m3,           m2\r\n    vbroadcasti128    m4,           [r0 + 4]                    ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m4,           m1\r\n    pmaddubsw         m4,           m0\r\n    pmaddwd           m4,           m2\r\n    packssdw          m3,           m4\r\n    pmulhrsw          m3,           m6\r\n\r\n    ; Row 1\r\n    vbroadcasti128    m4,           [r0 + r1]                   ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m4,           m1\r\n    pmaddubsw         m4,           m0\r\n    pmaddwd           m4,           m2\r\n    vbroadcasti128    m5,           [r0 + r1 + 4]               ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m5,           m1\r\n    pmaddubsw         m5,           m0\r\n    pmaddwd           m5,           m2\r\n    packssdw          m4,           m5\r\n    pmulhrsw          m4,           m6\r\n\r\n    packuswb          m3,           m4\r\n    vpermq            m3,           m3,      11011000b\r\n\r\n    vextracti128      xm4,          m3,       1\r\n    movq              [r2],         xm3\r\n    pextrd            [r2+8],       xm3,      2\r\n    movq              [r2 + r3],    xm4\r\n    pextrd            [r2 + r3 + 8],xm4,      2\r\n    lea               r2,           [r2 + r3 * 2]\r\n    lea               r0,           [r0 + r1 * 2]\r\n    dec               r4d\r\n    jnz               .loop\r\n    RET\r\n\r\nINIT_YMM avx2\r\ncglobal interp_4tap_horiz_pp_24x32, 4,6,7\r\n    mov              r4d,           r4m\r\n\r\n%ifdef PIC\r\n    lea               r5,           [tab_ChromaCoeff]\r\n    vpbroadcastd      m0,           [r5 + r4 * 4]\r\n%else\r\n    vpbroadcastd      m0,           [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    mova              m1,           [interp4_horiz_shuf1]\r\n    vpbroadcastd      m2,           [pw_1]\r\n    mova              m6,           [pw_512]\r\n    ; register map\r\n    ; m0 - interpolate coeff\r\n    ; m1 - shuffle order table\r\n    ; m2 - constant word 1\r\n\r\n    dec               r0\r\n    mov               r4d,          32\r\n\r\n.loop:\r\n    ; Row 0\r\n    vbroadcasti128    m3,           [r0]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m3,           m1\r\n    pmaddubsw         m3,           m0\r\n    pmaddwd           m3,           m2\r\n    vbroadcasti128    m4,           [r0 + 4]\r\n    pshufb            m4,           m1\r\n    pmaddubsw         m4,           m0\r\n    pmaddwd           m4,           m2\r\n    packssdw          m3,           m4\r\n    pmulhrsw          m3,           m6\r\n\r\n    vbroadcasti128    m4,           [r0 + 16]\r\n    pshufb            m4,           m1\r\n    pmaddubsw         m4,           m0\r\n    pmaddwd           m4,           m2\r\n    vbroadcasti128    m5,           [r0 + 20]\r\n    pshufb            m5,           m1\r\n    pmaddubsw         m5,           m0\r\n    pmaddwd           m5,           m2\r\n    packssdw          m4,           m5\r\n    pmulhrsw          m4,           m6\r\n\r\n    packuswb          m3,           m4\r\n    vpermq            m3,           m3,      11011000b\r\n\r\n    vextracti128      xm4,          m3,       1\r\n    movu              [r2],         xm3\r\n    movq              [r2 + 16],    xm4\r\n    add               r2,           r3\r\n    add               r0,           r1\r\n    dec               r4d\r\n    jnz               .loop\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------------------------------------------------------\r\n; void interp_4tap_horiz_ps_6x8(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt)\r\n;-----------------------------------------------------------------------------------------------------------------------------;\r\nINIT_YMM avx2\r\ncglobal interp_4tap_horiz_ps_6x8, 4,7,6\r\n    mov                r4d,            r4m\r\n    mov                r5d,            r5m\r\n    add                r3d,            r3d\r\n\r\n%ifdef PIC\r\n    lea                r6,             [tab_ChromaCoeff]\r\n    vpbroadcastd       m0,             [r6 + r4 * 4]\r\n%else\r\n    vpbroadcastd       m0,             [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    vbroadcasti128     m2,             [pw_1]\r\n    vbroadcasti128     m5,             [pw_2000]\r\n    mova               m1,             [tab_Tm]\r\n\r\n    ; register map\r\n    ; m0 - interpolate coeff\r\n    ; m1 - shuffle order table\r\n    ; m2 - constant word 1\r\n\r\n    mov               r6d,             8/2\r\n    dec               r0\r\n    test              r5d,             r5d\r\n    jz                .loop\r\n    sub               r0 ,             r1\r\n    inc               r6d\r\n\r\n.loop\r\n    ; Row 0\r\n    vbroadcasti128    m3,              [r0]\r\n    pshufb            m3,              m1\r\n    pmaddubsw         m3,              m0\r\n    pmaddwd           m3,              m2\r\n\r\n    ; Row 1\r\n    vbroadcasti128    m4,              [r0 + r1]\r\n    pshufb            m4,              m1\r\n    pmaddubsw         m4,              m0\r\n    pmaddwd           m4,              m2\r\n    packssdw          m3,              m4\r\n    psubw             m3,              m5\r\n    vpermq            m3,              m3,          11011000b\r\n    vextracti128      xm4,             m3,          1\r\n    movq              [r2],            xm3\r\n    pextrd            [r2 + 8],        xm3,         2\r\n    movq              [r2 + r3],       xm4\r\n    pextrd            [r2 + r3 + 8],   xm4,         2\r\n    lea               r2,              [r2 + r3 * 2]\r\n    lea               r0,              [r0 + r1 * 2]\r\n    dec               r6d\r\n    jnz              .loop\r\n    test              r5d,             r5d\r\n    jz               .end\r\n\r\n    ;Row 11\r\n    vbroadcasti128    m3,              [r0]\r\n    pshufb            m3,              m1\r\n    pmaddubsw         m3,              m0\r\n    pmaddwd           m3,              m2\r\n    packssdw          m3,              m3\r\n    psubw             m3,              m5\r\n    vextracti128      xm4,             m3,          1\r\n    movq              [r2],            xm3\r\n    movd              [r2+8],          xm4\r\n.end\r\n    RET\r\n\r\nINIT_YMM avx2\r\ncglobal interp_8tap_horiz_ps_12x16, 6, 7, 8\r\n    mov                         r5d,               r5m\r\n    mov                         r4d,               r4m\r\n%ifdef PIC\r\n    lea                         r6,                [tab_LumaCoeff]\r\n    vpbroadcastq                m0,                [r6 + r4 * 8]\r\n%else\r\n    vpbroadcastq                m0,                [tab_LumaCoeff + r4 * 8]\r\n%endif\r\n    mova                        m6,                [tab_Lm + 32]\r\n    mova                        m1,                [tab_Lm]\r\n    add                         r3d,               r3d\r\n    vbroadcasti128              m2,                [pw_2000]\r\n    mov                         r4d,                16\r\n    vbroadcasti128              m7,                [pw_1]\r\n    ; register map\r\n    ; m0 - interpolate coeff\r\n    ; m1 - shuffle order table\r\n    ; m2 - pw_2000\r\n\r\n    mova                        m5,                [interp8_hps_shuf]\r\n    sub                         r0,                3\r\n    test                        r5d,               r5d\r\n    jz                          .loop\r\n    lea                         r6,                [r1 * 3]                     ; r6 = (N / 2 - 1) * srcStride\r\n    sub                         r0,                r6                           ; r0(src)-r6\r\n    add                         r4d,                7\r\n.loop\r\n\r\n    ; Row 0\r\n\r\n    vbroadcasti128              m3,                [r0]                         ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb                      m4,                m3,        m6\r\n    pshufb                      m3,                m1                           ; shuffled based on the col order tab_Lm\r\n    pmaddubsw                   m3,                m0\r\n    pmaddubsw                   m4,                m0\r\n    pmaddwd                     m3,                m7\r\n    pmaddwd                     m4,                m7\r\n    packssdw                    m3,                m4\r\n\r\n    vbroadcasti128              m4,                [r0 + 8]                         ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb                      m4,                m1\r\n    pmaddubsw                   m4,                m0\r\n    pmaddwd                     m4,                m7\r\n    packssdw                    m4,                m4\r\n\r\n    pmaddwd                     m3,                m7\r\n    pmaddwd                     m4,                m7\r\n    packssdw                    m3,                m4\r\n\r\n    vpermd                      m3,                m5,               m3\r\n    psubw                       m3,                m2\r\n\r\n    vextracti128                xm4,               m3,               1\r\n    movu                        [r2],              xm3                          ;row 0\r\n    movq                        [r2 + 16],         xm4                          ;row 1\r\n\r\n    add                         r0,                r1\r\n    add                         r2,                r3\r\n    dec                         r4d\r\n    jnz                         .loop\r\n    RET\r\n\r\nINIT_YMM avx2\r\ncglobal interp_8tap_horiz_ps_24x32, 4, 7, 8\r\n    mov                         r5d,               r5m\r\n    mov                         r4d,               r4m\r\n%ifdef PIC\r\n    lea                         r6,                [tab_LumaCoeff]\r\n    vpbroadcastq                m0,                [r6 + r4 * 8]\r\n%else\r\n    vpbroadcastq                m0,                [tab_LumaCoeff + r4 * 8]\r\n%endif\r\n    mova                        m6,                [tab_Lm + 32]\r\n    mova                        m1,                [tab_Lm]\r\n    mov                         r4d,               32                           ;height\r\n    add                         r3d,               r3d\r\n    vbroadcasti128              m2,                [pw_2000]\r\n    vbroadcasti128              m7,                [pw_1]\r\n\r\n    ; register map\r\n    ; m0      - interpolate coeff\r\n    ; m1 , m6 - shuffle order table\r\n    ; m2      - pw_2000\r\n\r\n    sub                         r0,                3\r\n    test                        r5d,               r5d\r\n    jz                          .label\r\n    lea                         r6,                [r1 * 3]                     ; r6 = (N / 2 - 1) * srcStride\r\n    sub                         r0,                r6                           ; r0(src)-r6\r\n    add                         r4d,               7                            ; blkheight += N - 1  (7 - 1 = 6 ; since the last one row not in loop)\r\n\r\n.label\r\n    lea                         r6,                [interp8_hps_shuf]\r\n.loop\r\n    ; Row 0\r\n    vbroadcasti128              m3,                [r0]                         ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb                      m4,                m3,             m6           ; row 0 (col 4 to 7)\r\n    pshufb                      m3,                m1                           ; shuffled based on the col order tab_Lm row 0 (col 0 to 3)\r\n    pmaddubsw                   m3,                m0\r\n    pmaddubsw                   m4,                m0\r\n    pmaddwd                     m3,                m7\r\n    pmaddwd                     m4,                m7\r\n    packssdw                    m3,                m4\r\n\r\n    vbroadcasti128              m4,                [r0 + 8]                     ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb                      m5,                m4,            m6            ;row 1 (col 4 to 7)\r\n    pshufb                      m4,                m1                           ;row 1 (col 0 to 3)\r\n    pmaddubsw                   m4,                m0\r\n    pmaddubsw                   m5,                m0\r\n    pmaddwd                     m4,                m7\r\n    pmaddwd                     m5,                m7\r\n    packssdw                    m4,                m5\r\n    pmaddwd                     m3,                m7\r\n    pmaddwd                     m4,                m7\r\n    packssdw                    m3,                m4\r\n    mova                        m5,                [r6]\r\n    vpermd                      m3,                m5,               m3\r\n    psubw                       m3,                m2\r\n    movu                        [r2],              m3                          ;row 0\r\n\r\n    vbroadcasti128              m3,                [r0 + 16]\r\n    pshufb                      m4,                m3,          m6\r\n    pshufb                      m3,                m1\r\n    pmaddubsw                   m3,                m0\r\n    pmaddubsw                   m4,                m0\r\n    pmaddwd                     m3,                m7\r\n    pmaddwd                     m4,                m7\r\n    packssdw                    m3,                m4\r\n    pmaddwd                     m3,                m7\r\n    pmaddwd                     m4,                m7\r\n    packssdw                    m3,                m4\r\n    mova                        m4,                [r6]\r\n    vpermd                      m3,                m4,            m3\r\n    psubw                       m3,                m2\r\n    movu                        [r2 + 32],         xm3                          ;row 0\r\n\r\n    add                         r0,                r1\r\n    add                         r2,                r3\r\n    dec                         r4d\r\n    jnz                         .loop\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------------------------------------------------------\r\n; void interp_4tap_horiz_ps_24x32(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt)\r\n;-----------------------------------------------------------------------------------------------------------------------------\r\nINIT_YMM avx2\r\ncglobal interp_4tap_horiz_ps_24x32, 4,7,6\r\n    mov                r4d,            r4m\r\n    mov                r5d,            r5m\r\n    add                r3d,            r3d\r\n%ifdef PIC\r\n    lea                r6,             [tab_ChromaCoeff]\r\n    vpbroadcastd       m0,             [r6 + r4 * 4]\r\n%else\r\n    vpbroadcastd       m0,             [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n    vbroadcasti128     m2,             [pw_1]\r\n    vbroadcasti128     m5,             [pw_2000]\r\n    mova               m1,             [tab_Tm]\r\n\r\n    ; register map\r\n    ; m0 - interpolate coeff\r\n    ; m1 - shuffle order table\r\n    ; m2 - constant word 1\r\n    mov                r6d,            32\r\n    dec                r0\r\n    test               r5d,            r5d\r\n    je                 .loop\r\n    sub                r0 ,            r1\r\n    add                r6d ,           3\r\n\r\n.loop\r\n    ; Row 0\r\n    vbroadcasti128     m3,             [r0]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb             m3,             m1\r\n    pmaddubsw          m3,             m0\r\n    pmaddwd            m3,             m2\r\n    vbroadcasti128     m4,             [r0 + 8]                      ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb             m4,             m1\r\n    pmaddubsw          m4,             m0\r\n    pmaddwd            m4,             m2\r\n    packssdw           m3,             m4\r\n    psubw              m3,             m5\r\n    vpermq             m3,             m3,          11011000b\r\n    movu               [r2],           m3\r\n\r\n    vbroadcasti128     m3,             [r0 + 16]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb             m3,             m1\r\n    pmaddubsw          m3,             m0\r\n    pmaddwd            m3,             m2\r\n    packssdw           m3,             m3\r\n    psubw              m3,             m5\r\n    vpermq             m3,             m3,          11011000b\r\n    movu               [r2 + 32],      xm3\r\n\r\n    add                r2,             r3\r\n    add                r0,             r1\r\n    dec                r6d\r\n    jnz                .loop\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------------------------------------------------\r\n;macro FILTER_H8_W8_16N_AVX2\r\n;-----------------------------------------------------------------------------------------------------------------------\r\n%macro  FILTER_H8_W8_16N_AVX2 0\r\n    vbroadcasti128              m3,                [r0]                         ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb                      m4,                m3,             m6           ; row 0 (col 4 to 7)\r\n    pshufb                      m3,                m1                           ; shuffled based on the col order tab_Lm row 0 (col 0 to 3)\r\n    pmaddubsw                   m3,                m0\r\n    pmaddubsw                   m4,                m0\r\n    pmaddwd                     m3,                m2\r\n    pmaddwd                     m4,                m2\r\n    packssdw                    m3,                m4                         ; DWORD [R1D R1C R0D R0C R1B R1A R0B R0A]\r\n\r\n    vbroadcasti128              m4,                [r0 + 8]                         ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb                      m5,                m4,            m6            ;row 1 (col 4 to 7)\r\n    pshufb                      m4,                m1                           ;row 1 (col 0 to 3)\r\n    pmaddubsw                   m4,                m0\r\n    pmaddubsw                   m5,                m0\r\n    pmaddwd                     m4,                m2\r\n    pmaddwd                     m5,                m2\r\n    packssdw                    m4,                m5                         ; DWORD [R3D R3C R2D R2C R3B R3A R2B R2A]\r\n\r\n    pmaddwd                     m3,                m2\r\n    pmaddwd                     m4,                m2\r\n    packssdw                    m3,                m4                         ; all rows and col completed.\r\n\r\n    mova                        m5,                [interp8_hps_shuf]\r\n    vpermd                      m3,                m5,               m3\r\n    psubw                       m3,                m8\r\n\r\n    vextracti128                xm4,               m3,               1\r\n    mova                        [r4],              xm3\r\n    mova                        [r4 + 16],         xm4\r\n    %endmacro\r\n\r\n;-----------------------------------------------------------------------------\r\n; void interp_8tap_hv_pp_16x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int idxX, int idxY)\r\n;-----------------------------------------------------------------------------\r\nINIT_YMM avx2\r\n%if ARCH_X86_64 == 1\r\ncglobal interp_8tap_hv_pp_16x16, 4, 10, 15, 0-31*32\r\n%define stk_buf1    rsp\r\n    mov                         r4d,               r4m\r\n    mov                         r5d,               r5m\r\n%ifdef PIC\r\n    lea                         r6,                [tab_LumaCoeff]\r\n    vpbroadcastq                m0,                [r6 + r4 * 8]\r\n%else\r\n    vpbroadcastq                m0,                [tab_LumaCoeff + r4 * 8]\r\n%endif\r\n\r\n    xor                         r6,                 r6\r\n    mov                         r4,                 rsp\r\n    mova                        m6,                [tab_Lm + 32]\r\n    mova                        m1,                [tab_Lm]\r\n    mov                         r8,                16                           ;height\r\n    vbroadcasti128              m8,                [pw_2000]\r\n    vbroadcasti128              m2,                [pw_1]\r\n    sub                         r0,                3\r\n    lea                         r7,                [r1 * 3]                     ; r7 = (N / 2 - 1) * srcStride\r\n    sub                         r0,                r7                           ; r0(src)-r7\r\n    add                         r8,                7\r\n\r\n.loopH:\r\n    FILTER_H8_W8_16N_AVX2\r\n    add                         r0,                r1\r\n    add                         r4,                32\r\n    inc                         r6\r\n    cmp                         r6,                16+7\r\n    jnz                        .loopH\r\n\r\n; vertical phase\r\n    xor                         r6,                r6\r\n    xor                         r1,                r1\r\n.loopV:\r\n\r\n;load necessary variables\r\n    mov                         r4d,               r5d          ;coeff here for vertical is r5m\r\n    shl                         r4d,               7\r\n    mov                         r1d,               16\r\n    add                         r1d,               r1d\r\n\r\n ; load intermedia buffer\r\n    mov                         r0,                stk_buf1\r\n\r\n    ; register mapping\r\n    ; r0 - src\r\n    ; r5 - coeff\r\n    ; r6 - loop_i\r\n\r\n; load coeff table\r\n%ifdef PIC\r\n    lea                          r5,                [pw_LumaCoeffVer]\r\n    add                          r5,                r4\r\n%else\r\n    lea                          r5,                [pw_LumaCoeffVer + r4]\r\n%endif\r\n\r\n    lea                          r4,                [r1*3]\r\n    mova                         m14,               [pd_526336]\r\n    lea                          r6,                [r3 * 3]\r\n    mov                          r9d,               16 / 8\r\n\r\n.loopW:\r\n    PROCESS_LUMA_AVX2_W8_16R sp\r\n    add                          r2,                 8\r\n    add                          r0,                 16\r\n    dec                          r9d\r\n    jnz                          .loopW\r\n    RET\r\n%endif\r\n\r\nINIT_YMM avx2\r\ncglobal interp_4tap_horiz_pp_12x32, 4, 6, 7\r\n    mov               r4d,          r4m\r\n\r\n%ifdef PIC\r\n    lea               r5,           [tab_ChromaCoeff]\r\n    vpbroadcastd      m0,           [r5 + r4 * 4]\r\n%else\r\n    vpbroadcastd      m0,           [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    mova              m6,           [pw_512]\r\n    mova              m1,           [interp4_horiz_shuf1]\r\n    vpbroadcastd      m2,           [pw_1]\r\n\r\n    ; register map\r\n    ; m0 - interpolate coeff\r\n    ; m1 - shuffle order table\r\n    ; m2 - constant word 1\r\n\r\n    dec               r0\r\n    mov               r4d,          16\r\n\r\n.loop:\r\n    ; Row 0\r\n    vbroadcasti128    m3,           [r0]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m3,           m1\r\n    pmaddubsw         m3,           m0\r\n    pmaddwd           m3,           m2\r\n    vbroadcasti128    m4,           [r0 + 4]                    ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m4,           m1\r\n    pmaddubsw         m4,           m0\r\n    pmaddwd           m4,           m2\r\n    packssdw          m3,           m4\r\n    pmulhrsw          m3,           m6\r\n\r\n    ; Row 1\r\n    vbroadcasti128    m4,           [r0 + r1]                   ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m4,           m1\r\n    pmaddubsw         m4,           m0\r\n    pmaddwd           m4,           m2\r\n    vbroadcasti128    m5,           [r0 + r1 + 4]               ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m5,           m1\r\n    pmaddubsw         m5,           m0\r\n    pmaddwd           m5,           m2\r\n    packssdw          m4,           m5\r\n    pmulhrsw          m4,           m6\r\n\r\n    packuswb          m3,           m4\r\n    vpermq            m3,           m3,      11011000b\r\n\r\n    vextracti128      xm4,          m3,       1\r\n    movq              [r2],         xm3\r\n    pextrd            [r2+8],       xm3,      2\r\n    movq              [r2 + r3],    xm4\r\n    pextrd            [r2 + r3 + 8],xm4,      2\r\n    lea               r2,           [r2 + r3 * 2]\r\n    lea               r0,           [r0 + r1 * 2]\r\n    dec               r4d\r\n    jnz               .loop\r\n    RET\r\n\r\nINIT_YMM avx2\r\ncglobal interp_4tap_horiz_pp_24x64, 4,6,7\r\n    mov              r4d,           r4m\r\n\r\n%ifdef PIC\r\n    lea               r5,           [tab_ChromaCoeff]\r\n    vpbroadcastd      m0,           [r5 + r4 * 4]\r\n%else\r\n    vpbroadcastd      m0,           [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    mova              m1,           [interp4_horiz_shuf1]\r\n    vpbroadcastd      m2,           [pw_1]\r\n    mova              m6,           [pw_512]\r\n    ; register map\r\n    ; m0 - interpolate coeff\r\n    ; m1 - shuffle order table\r\n    ; m2 - constant word 1\r\n\r\n    dec               r0\r\n    mov               r4d,          64\r\n\r\n.loop:\r\n    ; Row 0\r\n    vbroadcasti128    m3,           [r0]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m3,           m1\r\n    pmaddubsw         m3,           m0\r\n    pmaddwd           m3,           m2\r\n    vbroadcasti128    m4,           [r0 + 4]\r\n    pshufb            m4,           m1\r\n    pmaddubsw         m4,           m0\r\n    pmaddwd           m4,           m2\r\n    packssdw          m3,           m4\r\n    pmulhrsw          m3,           m6\r\n\r\n    vbroadcasti128    m4,           [r0 + 16]\r\n    pshufb            m4,           m1\r\n    pmaddubsw         m4,           m0\r\n    pmaddwd           m4,           m2\r\n    vbroadcasti128    m5,           [r0 + 20]\r\n    pshufb            m5,           m1\r\n    pmaddubsw         m5,           m0\r\n    pmaddwd           m5,           m2\r\n    packssdw          m4,           m5\r\n    pmulhrsw          m4,           m6\r\n\r\n    packuswb          m3,           m4\r\n    vpermq            m3,           m3,      11011000b\r\n\r\n    vextracti128      xm4,          m3,       1\r\n    movu              [r2],         xm3\r\n    movq              [r2 + 16],    xm4\r\n    add               r2,           r3\r\n    add               r0,           r1\r\n    dec               r4d\r\n    jnz               .loop\r\n    RET\r\n\r\n\r\nINIT_YMM avx2\r\ncglobal interp_4tap_horiz_pp_2x16, 4, 6, 6\r\n    mov               r4d,           r4m\r\n\r\n%ifdef PIC\r\n    lea               r5,            [tab_ChromaCoeff]\r\n    vpbroadcastd      m0,            [r5 + r4 * 4]\r\n%else\r\n    vpbroadcastd      m0,            [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    mova              m4,            [interp4_hpp_shuf]\r\n    mova              m5,            [pw_1]\r\n    dec               r0\r\n    lea               r4,            [r1 * 3]\r\n    movq              xm1,           [r0]\r\n    movhps            xm1,           [r0 + r1]\r\n    movq              xm2,           [r0 + r1 * 2]\r\n    movhps            xm2,           [r0 + r4]\r\n    vinserti128       m1,            m1,          xm2,          1\r\n    lea               r0,            [r0 + r1 * 4]\r\n    movq              xm3,           [r0]\r\n    movhps            xm3,           [r0 + r1]\r\n    movq              xm2,           [r0 + r1 * 2]\r\n    movhps            xm2,           [r0 + r4]\r\n    vinserti128       m3,            m3,          xm2,          1\r\n\r\n    pshufb            m1,            m4\r\n    pshufb            m3,            m4\r\n    pmaddubsw         m1,            m0\r\n    pmaddubsw         m3,            m0\r\n    pmaddwd           m1,            m5\r\n    pmaddwd           m3,            m5\r\n    packssdw          m1,            m3\r\n    pmulhrsw          m1,            [pw_512]\r\n    vextracti128      xm2,           m1,          1\r\n    packuswb          xm1,           xm2\r\n\r\n    lea               r4,            [r3 * 3]\r\n    pextrw            [r2],          xm1,         0\r\n    pextrw            [r2 + r3],     xm1,         1\r\n    pextrw            [r2 + r3 * 2], xm1,         4\r\n    pextrw            [r2 + r4],     xm1,         5\r\n    lea               r2,            [r2 + r3 * 4]\r\n    pextrw            [r2],          xm1,         2\r\n    pextrw            [r2 + r3],     xm1,         3\r\n    pextrw            [r2 + r3 * 2], xm1,         6\r\n    pextrw            [r2 + r4],     xm1,         7\r\n    lea               r2,            [r2 + r3 * 4]\r\n    lea               r0,            [r0 + r1 * 4]\r\n\r\n    lea               r4,            [r1 * 3]\r\n    movq              xm1,           [r0]\r\n    movhps            xm1,           [r0 + r1]\r\n    movq              xm2,           [r0 + r1 * 2]\r\n    movhps            xm2,           [r0 + r4]\r\n    vinserti128       m1,            m1,          xm2,          1\r\n    lea               r0,            [r0 + r1 * 4]\r\n    movq              xm3,           [r0]\r\n    movhps            xm3,           [r0 + r1]\r\n    movq              xm2,           [r0 + r1 * 2]\r\n    movhps            xm2,           [r0 + r4]\r\n    vinserti128       m3,            m3,          xm2,          1\r\n\r\n    pshufb            m1,            m4\r\n    pshufb            m3,            m4\r\n    pmaddubsw         m1,            m0\r\n    pmaddubsw         m3,            m0\r\n    pmaddwd           m1,            m5\r\n    pmaddwd           m3,            m5\r\n    packssdw          m1,            m3\r\n    pmulhrsw          m1,            [pw_512]\r\n    vextracti128      xm2,           m1,          1\r\n    packuswb          xm1,           xm2\r\n\r\n    lea               r4,            [r3 * 3]\r\n    pextrw            [r2],          xm1,         0\r\n    pextrw            [r2 + r3],     xm1,         1\r\n    pextrw            [r2 + r3 * 2], xm1,         4\r\n    pextrw            [r2 + r4],     xm1,         5\r\n    lea               r2,            [r2 + r3 * 4]\r\n    pextrw            [r2],          xm1,         2\r\n    pextrw            [r2 + r3],     xm1,         3\r\n    pextrw            [r2 + r3 * 2], xm1,         6\r\n    pextrw            [r2 + r4],     xm1,         7\r\n    RET\r\n\r\n;-------------------------------------------------------------------------------------------------------------\r\n; void interp_4tap_horiz_pp_64xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx\r\n;-------------------------------------------------------------------------------------------------------------\r\n%macro IPFILTER_CHROMA_PP_64xN_AVX2 1\r\nINIT_YMM avx2\r\ncglobal interp_4tap_horiz_pp_64x%1, 4,6,7\r\n    mov             r4d, r4m\r\n\r\n%ifdef PIC\r\n    lea               r5,           [tab_ChromaCoeff]\r\n    vpbroadcastd      m0,           [r5 + r4 * 4]\r\n%else\r\n    vpbroadcastd      m0,           [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    mova              m1,           [interp4_horiz_shuf1]\r\n    vpbroadcastd      m2,           [pw_1]\r\n    mova              m6,           [pw_512]\r\n    ; register map\r\n    ; m0 - interpolate coeff\r\n    ; m1 - shuffle order table\r\n    ; m2 - constant word 1\r\n\r\n    dec               r0\r\n    mov               r4d,          %1\r\n\r\n.loop:\r\n    ; Row 0\r\n    vbroadcasti128    m3,           [r0]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m3,           m1\r\n    pmaddubsw         m3,           m0\r\n    pmaddwd           m3,           m2\r\n    vbroadcasti128    m4,           [r0 + 4]\r\n    pshufb            m4,           m1\r\n    pmaddubsw         m4,           m0\r\n    pmaddwd           m4,           m2\r\n    packssdw          m3,           m4\r\n    pmulhrsw          m3,           m6\r\n\r\n    vbroadcasti128    m4,           [r0 + 16]\r\n    pshufb            m4,           m1\r\n    pmaddubsw         m4,           m0\r\n    pmaddwd           m4,           m2\r\n    vbroadcasti128    m5,           [r0 + 20]\r\n    pshufb            m5,           m1\r\n    pmaddubsw         m5,           m0\r\n    pmaddwd           m5,           m2\r\n    packssdw          m4,           m5\r\n    pmulhrsw          m4,           m6\r\n    packuswb          m3,           m4\r\n    vpermq            m3,           m3,      11011000b\r\n    movu              [r2],         m3\r\n\r\n    vbroadcasti128    m3,           [r0 + 32]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m3,           m1\r\n    pmaddubsw         m3,           m0\r\n    pmaddwd           m3,           m2\r\n    vbroadcasti128    m4,           [r0 + 36]\r\n    pshufb            m4,           m1\r\n    pmaddubsw         m4,           m0\r\n    pmaddwd           m4,           m2\r\n    packssdw          m3,           m4\r\n    pmulhrsw          m3,           m6\r\n\r\n    vbroadcasti128    m4,           [r0 + 48]\r\n    pshufb            m4,           m1\r\n    pmaddubsw         m4,           m0\r\n    pmaddwd           m4,           m2\r\n    vbroadcasti128    m5,           [r0 + 52]\r\n    pshufb            m5,           m1\r\n    pmaddubsw         m5,           m0\r\n    pmaddwd           m5,           m2\r\n    packssdw          m4,           m5\r\n    pmulhrsw          m4,           m6\r\n    packuswb          m3,           m4\r\n    vpermq            m3,           m3,      11011000b\r\n    movu              [r2 + 32],         m3\r\n\r\n    add               r2,           r3\r\n    add               r0,           r1\r\n    dec               r4d\r\n    jnz               .loop\r\n    RET\r\n%endmacro\r\n\r\n    IPFILTER_CHROMA_PP_64xN_AVX2  64\r\n    IPFILTER_CHROMA_PP_64xN_AVX2  32\r\n    IPFILTER_CHROMA_PP_64xN_AVX2  48\r\n    IPFILTER_CHROMA_PP_64xN_AVX2  16\r\n\r\n;-------------------------------------------------------------------------------------------------------------\r\n; void interp_4tap_horiz_pp_48x64(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx\r\n;-------------------------------------------------------------------------------------------------------------\r\nINIT_YMM avx2\r\ncglobal interp_4tap_horiz_pp_48x64, 4,6,7\r\n    mov             r4d, r4m\r\n\r\n%ifdef PIC\r\n    lea               r5,            [tab_ChromaCoeff]\r\n    vpbroadcastd      m0,            [r5 + r4 * 4]\r\n%else\r\n    vpbroadcastd      m0,            [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    mova              m1,            [interp4_horiz_shuf1]\r\n    vpbroadcastd      m2,            [pw_1]\r\n    mova              m6,            [pw_512]\r\n    ; register map\r\n    ; m0 - interpolate coeff\r\n    ; m1 - shuffle order table\r\n    ; m2 - constant word 1\r\n\r\n    dec               r0\r\n    mov               r4d,           64\r\n\r\n.loop:\r\n    ; Row 0\r\n    vbroadcasti128    m3,            [r0]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m3,            m1\r\n    pmaddubsw         m3,            m0\r\n    pmaddwd           m3,            m2\r\n    vbroadcasti128    m4,            [r0 + 4]\r\n    pshufb            m4,            m1\r\n    pmaddubsw         m4,            m0\r\n    pmaddwd           m4,            m2\r\n    packssdw          m3,            m4\r\n    pmulhrsw          m3,            m6\r\n\r\n    vbroadcasti128    m4,            [r0 + 16]\r\n    pshufb            m4,            m1\r\n    pmaddubsw         m4,            m0\r\n    pmaddwd           m4,            m2\r\n    vbroadcasti128    m5,            [r0 + 20]\r\n    pshufb            m5,            m1\r\n    pmaddubsw         m5,            m0\r\n    pmaddwd           m5,            m2\r\n    packssdw          m4,            m5\r\n    pmulhrsw          m4,            m6\r\n\r\n    packuswb          m3,            m4\r\n    vpermq            m3,            m3,      q3120\r\n\r\n    movu              [r2],          m3\r\n\r\n    vbroadcasti128    m3,            [r0 + mmsize]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m3,            m1\r\n    pmaddubsw         m3,            m0\r\n    pmaddwd           m3,            m2\r\n    vbroadcasti128    m4,            [r0 + mmsize + 4]\r\n    pshufb            m4,            m1\r\n    pmaddubsw         m4,            m0\r\n    pmaddwd           m4,            m2\r\n    packssdw          m3,            m4\r\n    pmulhrsw          m3,            m6\r\n\r\n    vbroadcasti128    m4,            [r0 + mmsize + 16]\r\n    pshufb            m4,            m1\r\n    pmaddubsw         m4,            m0\r\n    pmaddwd           m4,            m2\r\n    vbroadcasti128    m5,            [r0 + mmsize + 20]\r\n    pshufb            m5,            m1\r\n    pmaddubsw         m5,            m0\r\n    pmaddwd           m5,            m2\r\n    packssdw          m4,            m5\r\n    pmulhrsw          m4,            m6\r\n\r\n    packuswb          m3,            m4\r\n    vpermq            m3,            m3,      q3120\r\n    movu              [r2 + mmsize], xm3\r\n\r\n    add               r2,            r3\r\n    add               r0,            r1\r\n    dec               r4d\r\n    jnz               .loop\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------------------------------------------------------\r\n; void interp_4tap_horiz_ps_48x64(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt)\r\n;-----------------------------------------------------------------------------------------------------------------------------;\r\n\r\nINIT_YMM avx2\r\ncglobal interp_4tap_horiz_ps_48x64, 4,7,6\r\n    mov             r4d, r4m\r\n    mov             r5d, r5m\r\n    add             r3d, r3d\r\n\r\n%ifdef PIC\r\n    lea               r6,           [tab_ChromaCoeff]\r\n    vpbroadcastd      m0,           [r6 + r4 * 4]\r\n%else\r\n    vpbroadcastd      m0,           [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    vbroadcasti128     m2,          [pw_1]\r\n    vbroadcasti128     m5,          [pw_2000]\r\n    mova               m1,          [tab_Tm]\r\n\r\n    ; register map\r\n    ; m0 - interpolate coeff\r\n    ; m1 - shuffle order table\r\n    ; m2 - constant word 1\r\n    mov               r6d,          64\r\n    dec               r0\r\n    test              r5d,          r5d\r\n    je                .loop\r\n    sub               r0 ,          r1\r\n    add               r6d ,         3\r\n\r\n.loop\r\n    ; Row 0\r\n    vbroadcasti128    m3,           [r0]                           ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m3,           m1\r\n    pmaddubsw         m3,           m0\r\n    pmaddwd           m3,           m2\r\n    vbroadcasti128    m4,           [r0 + 8]                       ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m4,           m1\r\n    pmaddubsw         m4,           m0\r\n    pmaddwd           m4,           m2\r\n\r\n    packssdw          m3,           m4\r\n    psubw             m3,           m5\r\n    vpermq            m3,           m3,          q3120\r\n    movu              [r2],         m3\r\n\r\n    vbroadcasti128    m3,           [r0 + 16]                      ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m3,           m1\r\n    pmaddubsw         m3,           m0\r\n    pmaddwd           m3,           m2\r\n    vbroadcasti128    m4,           [r0 + 24]                      ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m4,           m1\r\n    pmaddubsw         m4,           m0\r\n    pmaddwd           m4,           m2\r\n\r\n    packssdw          m3,           m4\r\n    psubw             m3,           m5\r\n    vpermq            m3,           m3,          q3120\r\n    movu              [r2 + 32],    m3\r\n\r\n    vbroadcasti128    m3,           [r0 + 32]                      ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m3,           m1\r\n    pmaddubsw         m3,           m0\r\n    pmaddwd           m3,           m2\r\n    vbroadcasti128    m4,           [r0 + 40]                      ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m4,           m1\r\n    pmaddubsw         m4,           m0\r\n    pmaddwd           m4,           m2\r\n\r\n    packssdw          m3,           m4\r\n    psubw             m3,           m5\r\n    vpermq            m3,           m3,          q3120\r\n    movu              [r2 + 64],    m3\r\n\r\n    add               r2,          r3\r\n    add               r0,          r1\r\n    dec               r6d\r\n    jnz               .loop\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------------------------------------------------------\r\n; void interp_4tap_horiz_ps_24x64(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt)\r\n;-----------------------------------------------------------------------------------------------------------------------------\r\nINIT_YMM avx2\r\ncglobal interp_4tap_horiz_ps_24x64, 4,7,6\r\n    mov                r4d,            r4m\r\n    mov                r5d,            r5m\r\n    add                r3d,            r3d\r\n%ifdef PIC\r\n    lea                r6,             [tab_ChromaCoeff]\r\n    vpbroadcastd       m0,             [r6 + r4 * 4]\r\n%else\r\n    vpbroadcastd       m0,             [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n    vbroadcasti128     m2,             [pw_1]\r\n    vbroadcasti128     m5,             [pw_2000]\r\n    mova               m1,             [tab_Tm]\r\n\r\n    ; register map\r\n    ; m0 - interpolate coeff\r\n    ; m1 - shuffle order table\r\n    ; m2 - constant word 1\r\n    mov                r6d,            64\r\n    dec                r0\r\n    test               r5d,            r5d\r\n    je                 .loop\r\n    sub                r0 ,            r1\r\n    add                r6d ,           3\r\n\r\n.loop\r\n    ; Row 0\r\n    vbroadcasti128     m3,             [r0]                          ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb             m3,             m1\r\n    pmaddubsw          m3,             m0\r\n    pmaddwd            m3,             m2\r\n    vbroadcasti128     m4,             [r0 + 8]                      ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb             m4,             m1\r\n    pmaddubsw          m4,             m0\r\n    pmaddwd            m4,             m2\r\n    packssdw           m3,             m4\r\n    psubw              m3,             m5\r\n    vpermq             m3,             m3,          q3120\r\n    movu               [r2],           m3\r\n\r\n    vbroadcasti128     m3,             [r0 + 16]                     ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb             m3,             m1\r\n    pmaddubsw          m3,             m0\r\n    pmaddwd            m3,             m2\r\n    packssdw           m3,             m3\r\n    psubw              m3,             m5\r\n    vpermq             m3,             m3,          q3120\r\n    movu               [r2 + 32],      xm3\r\n\r\n    add                r2,             r3\r\n    add                r0,             r1\r\n    dec                r6d\r\n    jnz                .loop\r\n    RET\r\n\r\nINIT_YMM avx2\r\ncglobal interp_4tap_horiz_ps_2x16, 4, 7, 7\r\n    mov               r4d,           r4m\r\n    mov               r5d,           r5m\r\n    add               r3d,           r3d\r\n\r\n%ifdef PIC\r\n    lea               r6,            [tab_ChromaCoeff]\r\n    vpbroadcastd      m0,            [r6 + r4 * 4]\r\n%else\r\n    vpbroadcastd      m0,            [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n    vbroadcasti128    m6,            [pw_2000]\r\n    test              r5d,            r5d\r\n    jz                .label\r\n    sub               r0,             r1\r\n\r\n.label\r\n    mova              m4,            [interp4_hps_shuf]\r\n    mova              m5,            [pw_1]\r\n    dec               r0\r\n    lea               r4,            [r1 * 3]\r\n    movq              xm1,           [r0]                                   ;row 0\r\n    movhps            xm1,           [r0 + r1]\r\n    movq              xm2,           [r0 + r1 * 2]\r\n    movhps            xm2,           [r0 + r4]\r\n    vinserti128       m1,            m1,           xm2,          1\r\n    lea               r0,            [r0 + r1 * 4]\r\n    movq              xm3,           [r0]\r\n    movhps            xm3,           [r0 + r1]\r\n    movq              xm2,           [r0 + r1 * 2]\r\n    movhps            xm2,           [r0 + r4]\r\n    vinserti128       m3,            m3,           xm2,          1\r\n\r\n    pshufb            m1,            m4\r\n    pshufb            m3,            m4\r\n    pmaddubsw         m1,            m0\r\n    pmaddubsw         m3,            m0\r\n    pmaddwd           m1,            m5\r\n    pmaddwd           m3,            m5\r\n    packssdw          m1,            m3\r\n    psubw             m1,            m6\r\n\r\n    lea               r4,            [r3 * 3]\r\n    vextracti128      xm2,           m1,           1\r\n\r\n    movd              [r2],          xm1\r\n    pextrd            [r2 + r3],     xm1,          1\r\n    movd              [r2 + r3 * 2], xm2\r\n    pextrd            [r2 + r4],     xm2,          1\r\n    lea               r2,            [r2 + r3 * 4]\r\n    pextrd            [r2],          xm1,          2\r\n    pextrd            [r2 + r3],     xm1,          3\r\n    pextrd            [r2 + r3 * 2], xm2,          2\r\n    pextrd            [r2 + r4],     xm2,          3\r\n\r\n    lea               r0,            [r0 + r1 * 4]\r\n    lea               r2,            [r2 + r3 * 4]\r\n    lea               r4,            [r1 * 3]\r\n    movq              xm1,           [r0]\r\n    movhps            xm1,           [r0 + r1]\r\n    movq              xm2,           [r0 + r1 * 2]\r\n    movhps            xm2,           [r0 + r4]\r\n    vinserti128       m1,            m1,          xm2,           1\r\n    lea               r0,            [r0 + r1 * 4]\r\n    movq              xm3,           [r0]\r\n    movhps            xm3,           [r0 + r1]\r\n    movq              xm2,           [r0 + r1 * 2]\r\n    movhps            xm2,           [r0 + r4]\r\n    vinserti128       m3,            m3,          xm2,           1\r\n\r\n    pshufb            m1,            m4\r\n    pshufb            m3,            m4\r\n    pmaddubsw         m1,            m0\r\n    pmaddubsw         m3,            m0\r\n    pmaddwd           m1,            m5\r\n    pmaddwd           m3,            m5\r\n    packssdw          m1,            m3\r\n    psubw             m1,            m6\r\n\r\n    lea               r4,            [r3 * 3]\r\n    vextracti128      xm2,           m1,           1\r\n\r\n    movd              [r2],          xm1\r\n    pextrd            [r2 + r3],     xm1,          1\r\n    movd              [r2 + r3 * 2], xm2\r\n    pextrd            [r2 + r4],     xm2,          1\r\n    lea               r2,            [r2 + r3 * 4]\r\n    pextrd            [r2],          xm1,          2\r\n    pextrd            [r2 + r3],     xm1,          3\r\n    pextrd            [r2 + r3 * 2], xm2,          2\r\n    pextrd            [r2 + r4],     xm2,          3\r\n\r\n    test              r5d,            r5d\r\n    jz                .end\r\n\r\n    lea               r0,            [r0 + r1 * 4]\r\n    lea               r2,            [r2 + r3 * 4]\r\n    movq              xm1,           [r0]\r\n    movhps            xm1,           [r0 + r1]\r\n    movq              xm2,           [r0 + r1 * 2]\r\n    vinserti128       m1,            m1,          xm2,           1\r\n    pshufb            m1,            m4\r\n    pmaddubsw         m1,            m0\r\n    pmaddwd           m1,            m5\r\n    packssdw          m1,            m1\r\n    psubw             m1,            m6\r\n    vextracti128      xm2,           m1,           1\r\n\r\n    movd              [r2],          xm1\r\n    pextrd            [r2 + r3],     xm1,          1\r\n    movd              [r2 + r3 * 2], xm2\r\n.end\r\n    RET\r\n\r\nINIT_YMM avx2\r\ncglobal interp_4tap_horiz_pp_6x16, 4, 6, 7\r\n    mov               r4d,               r4m\r\n\r\n%ifdef PIC\r\n    lea               r5,                [tab_ChromaCoeff]\r\n    vpbroadcastd      m0,                [r5 + r4 * 4]\r\n%else\r\n    vpbroadcastd      m0,                [tab_ChromaCoeff + r4 * 4]\r\n%endif\r\n\r\n    mova              m1,                [tab_Tm]\r\n    mova              m2,                [pw_1]\r\n    mova              m6,                [pw_512]\r\n    lea               r4,                [r1 * 3]\r\n    lea               r5,                [r3 * 3]\r\n    ; register map\r\n    ; m0 - interpolate coeff\r\n    ; m1 - shuffle order table\r\n    ; m2 - constant word 1\r\n\r\n    dec               r0\r\n%rep 4\r\n    ; Row 0\r\n    vbroadcasti128    m3,                [r0]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m3,                m1\r\n    pmaddubsw         m3,                m0\r\n    pmaddwd           m3,                m2\r\n\r\n    ; Row 1\r\n    vbroadcasti128    m4,                [r0 + r1]                   ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m4,                m1\r\n    pmaddubsw         m4,                m0\r\n    pmaddwd           m4,                m2\r\n    packssdw          m3,                m4\r\n    pmulhrsw          m3,                m6\r\n\r\n    ; Row 2\r\n    vbroadcasti128    m4,                [r0 + r1 * 2]               ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m4,                m1\r\n    pmaddubsw         m4,                m0\r\n    pmaddwd           m4,                m2\r\n\r\n    ; Row 3\r\n    vbroadcasti128    m5,                [r0 + r4]                   ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]\r\n    pshufb            m5,                m1\r\n    pmaddubsw         m5,                m0\r\n    pmaddwd           m5,                m2\r\n    packssdw          m4,                m5\r\n    pmulhrsw          m4,                m6\r\n\r\n    packuswb          m3,                m4\r\n    vextracti128      xm4,               m3,          1\r\n    movd              [r2],              xm3\r\n    pextrw            [r2 + 4],          xm4,         0\r\n    pextrd            [r2 + r3],         xm3,         1\r\n    pextrw            [r2 + r3 + 4],     xm4,         2\r\n    pextrd            [r2 + r3 * 2],     xm3,         2\r\n    pextrw            [r2 + r3 * 2 + 4], xm4,         4\r\n    pextrd            [r2 + r5],         xm3,         3\r\n    pextrw            [r2 + r5 + 4],     xm4,         6\r\n    lea               r2,                [r2 + r3 * 4]\r\n    lea               r0,                [r0 + r1 * 4]\r\n%endrep\r\n    RET\r\n"
  },
  {
    "path": "source/common/x86/ipfilter8.h",
    "content": "/*****************************************************************************\r\n * Copyright (C) 2013-2017 MulticoreWare, Inc\r\n *\r\n * Authors: Steve Borho <steve@borho.org>\r\n *          Jiaqi Zhang <zhangjiaqi.cs@gmail.com>\r\n *\r\n * This program is free software; you can redistribute it and/or modify\r\n * it under the terms of the GNU General Public License as published by\r\n * the Free Software Foundation; either version 2 of the License, or\r\n * (at your option) any later version.\r\n *\r\n * This program is distributed in the hope that it will be useful,\r\n * but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n * GNU General Public License for more details.\r\n *\r\n * You should have received a copy of the GNU General Public License\r\n * along with this program; if not, write to the Free Software\r\n * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n * This program is also available under a commercial proprietary license.\r\n * For more information, contact us at license @ x265.com.\r\n *****************************************************************************/\r\n\r\n#ifndef DAVS2_IPFILTER8_H\r\n#define DAVS2_IPFILTER8_H\r\n\r\n#include \"../vec/intrinsic.h\"\r\n\r\n#if defined(__cplusplus)\r\nextern \"C\" {\r\n#endif  /* __cplusplus */\r\n\r\n#define SETUP_FUNC_DEF(cpu) \\\r\n    FUNCDEF_PU(void, interp_8tap_horiz_pp, cpu, const pel_t* src, intptr_t srcStride, pel_t* dst, intptr_t dstStride, int coeffIdx); \\\r\n    FUNCDEF_PU(void, interp_8tap_horiz_ps, cpu, const pel_t* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); \\\r\n    FUNCDEF_PU(void, interp_8tap_vert_pp, cpu, const pel_t* src, intptr_t srcStride, pel_t* dst, intptr_t dstStride, int coeffIdx); \\\r\n    FUNCDEF_PU(void, interp_8tap_vert_ps, cpu, const pel_t* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); \\\r\n    FUNCDEF_PU(void, interp_8tap_vert_sp, cpu, const int16_t* src, intptr_t srcStride, pel_t* dst, intptr_t dstStride, int coeffIdx); \\\r\n    FUNCDEF_PU(void, interp_8tap_vert_ss, cpu, const int16_t* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); \\\r\n    FUNCDEF_PU(void, interp_8tap_hv_pp, cpu, const pel_t* src, intptr_t srcStride, pel_t* dst, intptr_t dstStride, int idxX, int idxY)\r\n\r\nSETUP_FUNC_DEF(sse2);\r\nSETUP_FUNC_DEF(ssse3);\r\nSETUP_FUNC_DEF(sse3);\r\nSETUP_FUNC_DEF(sse4);\r\nSETUP_FUNC_DEF(avx2);\r\n\r\n#if defined(__cplusplus)\r\n}\r\n#endif  /* __cplusplus */\r\n#endif // ifndef DAVS2_IPFILTER8_H\r\n"
  },
  {
    "path": "source/common/x86/mc-a2.asm",
    "content": ";*****************************************************************************\r\n;* mc-a2.asm: x86 motion compensation\r\n;*****************************************************************************\r\n;* Copyright (C) 2003-2013 x264 project\r\n;* Copyright (C) 2013-2017 MulticoreWare, Inc\r\n;* Copyright (C) 2018~ VCL, NELVT, Peking University\r\n;*\r\n;* Authors: Loren Merritt <lorenm@u.washington.edu>\r\n;*          Fiona Glaser <fiona@x264.com>\r\n;*          Holger Lubitz <holger@lubitz.org>\r\n;*          Mathieu Monnier <manao@melix.net>\r\n;*          Oskar Arvidsson <oskar@irock.se>\r\n;*          Min Chen <chenm003@163.com>\r\n;*          Jiaqi Zhang <zhangjiaqi.cs@gmail.com>\r\n;*\r\n;* This program is free software; you can redistribute it and/or modify\r\n;* it under the terms of the GNU General Public License as published by\r\n;* the Free Software Foundation; either version 2 of the License, or\r\n;* (at your option) any later version.\r\n;*\r\n;* This program is distributed in the hope that it will be useful,\r\n;* but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n;* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n;* GNU General Public License for more details.\r\n;*\r\n;* You should have received a copy of the GNU General Public License\r\n;* along with this program; if not, write to the Free Software\r\n;* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n;*\r\n;* This program is also available under a commercial proprietary license.\r\n;* For more information, contact us at license @ x265.com.\r\n;*****************************************************************************\r\n\r\n%include \"x86inc.asm\"\r\n%include \"x86util.asm\"\r\n\r\nSECTION_RODATA 32\r\n\r\ndeinterleave_shuf: times 2 db 0,2,4,6,8,10,12,14,1,3,5,7,9,11,13,15\r\n\r\n%if HIGH_BIT_DEPTH\r\ndeinterleave_shuf32a: SHUFFLE_MASK_W 0,2,4,6,8,10,12,14\r\ndeinterleave_shuf32b: SHUFFLE_MASK_W 1,3,5,7,9,11,13,15\r\n%else\r\ndeinterleave_shuf32a: db 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30\r\ndeinterleave_shuf32b: db 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31\r\n%endif\r\n\r\ncutree_fix8_unpack_shuf: db -1,-1, 0, 1,-1,-1, 2, 3,-1,-1, 4, 5,-1,-1, 6, 7\r\n                         db -1,-1, 8, 9,-1,-1,10,11,-1,-1,12,13,-1,-1,14,15\r\n\r\nconst pq_256,       times 4 dq 256.0\r\nconst pd_inv256,    times 4 dq 0.00390625\r\nconst pd_0_5,       times 4 dq 0.5\r\n\r\nSECTION .text\r\n\r\ncextern pb_0\r\ncextern pw_1\r\ncextern pw_16\r\ncextern pw_32\r\ncextern pw_512\r\ncextern pw_00ff\r\ncextern pw_1024\r\ncextern pw_3fff\r\ncextern pw_pixel_max\r\ncextern pd_ffff\r\ncextern pd_16\r\n\r\n;The hpel_filter routines use non-temporal writes for output.\r\n;The following defines may be uncommented for testing.\r\n;Doing the hpel_filter temporal may be a win if the last level cache\r\n;is big enough (preliminary benching suggests on the order of 4* framesize).\r\n\r\n;%define movntq movq\r\n;%define movntps movaps\r\n;%define sfence\r\n\r\n%if HIGH_BIT_DEPTH == 0\r\n%undef movntq\r\n%undef movntps\r\n%undef sfence\r\n%endif ; !HIGH_BIT_DEPTH\r\n\r\n;-----------------------------------------------------------------------------\r\n; void plane_copy_core( pixel *dst, intptr_t i_dst,\r\n;                       pixel *src, intptr_t i_src, int w, int h )\r\n;-----------------------------------------------------------------------------\r\n; assumes i_dst and w are multiples of 16, and i_dst>w\r\nINIT_MMX\r\ncglobal plane_copy_core_mmx2, 6,7\r\n    FIX_STRIDES r1, r3, r4d\r\n%if HIGH_BIT_DEPTH == 0\r\n    movsxdifnidn r4, r4d\r\n%endif\r\n    sub    r1,  r4\r\n    sub    r3,  r4\r\n.loopy:\r\n    lea   r6d, [r4-63]\r\n.loopx:\r\n    prefetchnta [r2+256]\r\n    movq   m0, [r2   ]\r\n    movq   m1, [r2+ 8]\r\n    movntq [r0   ], m0\r\n    movntq [r0+ 8], m1\r\n    movq   m2, [r2+16]\r\n    movq   m3, [r2+24]\r\n    movntq [r0+16], m2\r\n    movntq [r0+24], m3\r\n    movq   m4, [r2+32]\r\n    movq   m5, [r2+40]\r\n    movntq [r0+32], m4\r\n    movntq [r0+40], m5\r\n    movq   m6, [r2+48]\r\n    movq   m7, [r2+56]\r\n    movntq [r0+48], m6\r\n    movntq [r0+56], m7\r\n    add    r2,  64\r\n    add    r0,  64\r\n    sub    r6d, 64\r\n    jg .loopx\r\n    prefetchnta [r2+256]\r\n    add    r6d, 63\r\n    jle .end16\r\n.loop16:\r\n    movq   m0, [r2  ]\r\n    movq   m1, [r2+8]\r\n    movntq [r0  ], m0\r\n    movntq [r0+8], m1\r\n    add    r2,  16\r\n    add    r0,  16\r\n    sub    r6d, 16\r\n    jg .loop16\r\n.end16:\r\n    add    r0, r1\r\n    add    r2, r3\r\n    dec    r5d\r\n    jg .loopy\r\n    sfence\r\n    emms\r\n    RET\r\n\r\n\r\n%macro INTERLEAVE 4-5 ; dst, srcu, srcv, is_aligned, nt_hint\r\n%if HIGH_BIT_DEPTH\r\n%assign x 0\r\n%rep 16/mmsize\r\n    mov%4     m0, [%2+(x/2)*mmsize]\r\n    mov%4     m1, [%3+(x/2)*mmsize]\r\n    punpckhwd m2, m0, m1\r\n    punpcklwd m0, m1\r\n    mov%5a    [%1+(x+0)*mmsize], m0\r\n    mov%5a    [%1+(x+1)*mmsize], m2\r\n    %assign x (x+2)\r\n%endrep\r\n%else\r\n    movq   m0, [%2]\r\n%if mmsize==16\r\n%ifidn %4, a\r\n    punpcklbw m0, [%3]\r\n%else\r\n    movq   m1, [%3]\r\n    punpcklbw m0, m1\r\n%endif\r\n    mov%5a [%1], m0\r\n%else\r\n    movq   m1, [%3]\r\n    punpckhbw m2, m0, m1\r\n    punpcklbw m0, m1\r\n    mov%5a [%1+0], m0\r\n    mov%5a [%1+8], m2\r\n%endif\r\n%endif ; HIGH_BIT_DEPTH\r\n%endmacro\r\n\r\n%macro DEINTERLEAVE 6 ; dstu, dstv, src, dstv==dstu+8, shuffle constant, is aligned\r\n%if HIGH_BIT_DEPTH\r\n%assign n 0\r\n%rep 16/mmsize\r\n    mova     m0, [%3+(n+0)*mmsize]\r\n    mova     m1, [%3+(n+1)*mmsize]\r\n    psrld    m2, m0, 16\r\n    psrld    m3, m1, 16\r\n    pand     m0, %5\r\n    pand     m1, %5\r\n    packssdw m0, m1\r\n    packssdw m2, m3\r\n    mov%6    [%1+(n/2)*mmsize], m0\r\n    mov%6    [%2+(n/2)*mmsize], m2\r\n    %assign n (n+2)\r\n%endrep\r\n%else ; !HIGH_BIT_DEPTH\r\n%if mmsize==16\r\n    mova   m0, [%3]\r\n%if cpuflag(ssse3)\r\n    pshufb m0, %5\r\n%else\r\n    mova   m1, m0\r\n    pand   m0, %5\r\n    psrlw  m1, 8\r\n    packuswb m0, m1\r\n%endif\r\n%if %4\r\n    mova   [%1], m0\r\n%else\r\n    movq   [%1], m0\r\n    movhps [%2], m0\r\n%endif\r\n%else\r\n    mova   m0, [%3]\r\n    mova   m1, [%3+8]\r\n    mova   m2, m0\r\n    mova   m3, m1\r\n    pand   m0, %5\r\n    pand   m1, %5\r\n    psrlw  m2, 8\r\n    psrlw  m3, 8\r\n    packuswb m0, m1\r\n    packuswb m2, m3\r\n    mova   [%1], m0\r\n    mova   [%2], m2\r\n%endif ; mmsize == 16\r\n%endif ; HIGH_BIT_DEPTH\r\n%endmacro\r\n\r\n%macro PLANE_INTERLEAVE 0\r\n;-----------------------------------------------------------------------------\r\n; void plane_copy_interleave_core( uint8_t *dst,  intptr_t i_dst,\r\n;                                  uint8_t *srcu, intptr_t i_srcu,\r\n;                                  uint8_t *srcv, intptr_t i_srcv, int w, int h )\r\n;-----------------------------------------------------------------------------\r\n; assumes i_dst and w are multiples of 16, and i_dst>2*w\r\ncglobal plane_copy_interleave_core, 6,9\r\n    mov   r6d, r6m\r\n%if HIGH_BIT_DEPTH\r\n    FIX_STRIDES r1, r3, r5, r6d\r\n    movifnidn r1mp, r1\r\n    movifnidn r3mp, r3\r\n    mov  r6m, r6d\r\n%endif\r\n    lea    r0, [r0+r6*2]\r\n    add    r2,  r6\r\n    add    r4,  r6\r\n%if ARCH_X86_64\r\n    DECLARE_REG_TMP 7,8\r\n%else\r\n    DECLARE_REG_TMP 1,3\r\n%endif\r\n    mov  t1, r1\r\n    shr  t1, SIZEOF_PIXEL\r\n    sub  t1, r6\r\n    mov  t0d, r7m\r\n.loopy:\r\n    mov    r6d, r6m\r\n    neg    r6\r\n.prefetch:\r\n    prefetchnta [r2+r6]\r\n    prefetchnta [r4+r6]\r\n    add    r6, 64\r\n    jl .prefetch\r\n    mov    r6d, r6m\r\n    neg    r6\r\n.loopx:\r\n    INTERLEAVE r0+r6*2+ 0*SIZEOF_PIXEL, r2+r6+0*SIZEOF_PIXEL, r4+r6+0*SIZEOF_PIXEL, u, nt\r\n    INTERLEAVE r0+r6*2+16*SIZEOF_PIXEL, r2+r6+8*SIZEOF_PIXEL, r4+r6+8*SIZEOF_PIXEL, u, nt\r\n    add    r6, 16*SIZEOF_PIXEL\r\n    jl .loopx\r\n.pad:\r\n%assign n 0\r\n%rep SIZEOF_PIXEL\r\n%if mmsize==8\r\n    movntq [r0+r6*2+(n+ 0)], m0\r\n    movntq [r0+r6*2+(n+ 8)], m0\r\n    movntq [r0+r6*2+(n+16)], m0\r\n    movntq [r0+r6*2+(n+24)], m0\r\n%else\r\n    movntdq [r0+r6*2+(n+ 0)], m0\r\n    movntdq [r0+r6*2+(n+16)], m0\r\n%endif\r\n    %assign n n+32\r\n%endrep\r\n    add    r6, 16*SIZEOF_PIXEL\r\n    cmp    r6, t1\r\n    jl .pad\r\n    add    r0, r1mp\r\n    add    r2, r3mp\r\n    add    r4, r5\r\n    dec    t0d\r\n    jg .loopy\r\n    sfence\r\n    emms\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void store_interleave_chroma( uint8_t *dst, intptr_t i_dst, uint8_t *srcu, uint8_t *srcv, int height )\r\n;-----------------------------------------------------------------------------\r\ncglobal store_interleave_chroma, 5,5\r\n    FIX_STRIDES r1\r\n.loop:\r\n    INTERLEAVE r0+ 0, r2+           0, r3+           0, a\r\n    INTERLEAVE r0+r1, r2+FDEC_STRIDEB, r3+FDEC_STRIDEB, a\r\n    add    r2, FDEC_STRIDEB*2\r\n    add    r3, FDEC_STRIDEB*2\r\n    lea    r0, [r0+r1*2]\r\n    sub   r4d, 2\r\n    jg .loop\r\n    RET\r\n%endmacro ; PLANE_INTERLEAVE\r\n\r\n%macro DEINTERLEAVE_START 0\r\n%if HIGH_BIT_DEPTH\r\n    mova   m4, [pd_ffff]\r\n%elif cpuflag(ssse3)\r\n    mova   m4, [deinterleave_shuf]\r\n%else\r\n    mova   m4, [pw_00ff]\r\n%endif ; HIGH_BIT_DEPTH\r\n%endmacro\r\n\r\n%macro PLANE_DEINTERLEAVE 0\r\n;-----------------------------------------------------------------------------\r\n; void plane_copy_deinterleave( pixel *dstu, intptr_t i_dstu,\r\n;                               pixel *dstv, intptr_t i_dstv,\r\n;                               pixel *src,  intptr_t i_src, int w, int h )\r\n;-----------------------------------------------------------------------------\r\ncglobal plane_copy_deinterleave, 6,7\r\n    DEINTERLEAVE_START\r\n    mov    r6d, r6m\r\n    FIX_STRIDES r1, r3, r5, r6d\r\n%if HIGH_BIT_DEPTH\r\n    mov    r6m, r6d\r\n%endif\r\n    add    r0,  r6\r\n    add    r2,  r6\r\n    lea    r4, [r4+r6*2]\r\n.loopy:\r\n    mov    r6d, r6m\r\n    neg    r6\r\n.loopx:\r\n    DEINTERLEAVE r0+r6+0*SIZEOF_PIXEL, r2+r6+0*SIZEOF_PIXEL, r4+r6*2+ 0*SIZEOF_PIXEL, 0, m4, u\r\n    DEINTERLEAVE r0+r6+8*SIZEOF_PIXEL, r2+r6+8*SIZEOF_PIXEL, r4+r6*2+16*SIZEOF_PIXEL, 0, m4, u\r\n    add    r6, 16*SIZEOF_PIXEL\r\n    jl .loopx\r\n    add    r0, r1\r\n    add    r2, r3\r\n    add    r4, r5\r\n    dec dword r7m\r\n    jg .loopy\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void load_deinterleave_chroma_fenc( pixel *dst, pixel *src, intptr_t i_src, int height )\r\n;-----------------------------------------------------------------------------\r\ncglobal load_deinterleave_chroma_fenc, 4,4\r\n    DEINTERLEAVE_START\r\n    FIX_STRIDES r2\r\n.loop:\r\n    DEINTERLEAVE r0+           0, r0+FENC_STRIDEB*1/2, r1+ 0, 1, m4, a\r\n    DEINTERLEAVE r0+FENC_STRIDEB, r0+FENC_STRIDEB*3/2, r1+r2, 1, m4, a\r\n    add    r0, FENC_STRIDEB*2\r\n    lea    r1, [r1+r2*2]\r\n    sub   r3d, 2\r\n    jg .loop\r\n    RET\r\n\r\n;-----------------------------------------------------------------------------\r\n; void load_deinterleave_chroma_fdec( pixel *dst, pixel *src, intptr_t i_src, int height )\r\n;-----------------------------------------------------------------------------\r\ncglobal load_deinterleave_chroma_fdec, 4,4\r\n    DEINTERLEAVE_START\r\n    FIX_STRIDES r2\r\n.loop:\r\n    DEINTERLEAVE r0+           0, r0+FDEC_STRIDEB*1/2, r1+ 0, 0, m4, a\r\n    DEINTERLEAVE r0+FDEC_STRIDEB, r0+FDEC_STRIDEB*3/2, r1+r2, 0, m4, a\r\n    add    r0, FDEC_STRIDEB*2\r\n    lea    r1, [r1+r2*2]\r\n    sub   r3d, 2\r\n    jg .loop\r\n    RET\r\n%endmacro ; PLANE_DEINTERLEAVE\r\n\r\n%if HIGH_BIT_DEPTH\r\nINIT_MMX mmx2\r\nPLANE_INTERLEAVE\r\nINIT_MMX mmx\r\nPLANE_DEINTERLEAVE\r\nINIT_XMM sse2\r\nPLANE_INTERLEAVE\r\nPLANE_DEINTERLEAVE\r\nINIT_XMM avx\r\nPLANE_INTERLEAVE\r\nPLANE_DEINTERLEAVE\r\n%else\r\nINIT_MMX mmx2\r\nPLANE_INTERLEAVE\r\nINIT_MMX mmx\r\nPLANE_DEINTERLEAVE\r\nINIT_XMM sse2\r\nPLANE_INTERLEAVE\r\nPLANE_DEINTERLEAVE\r\nINIT_XMM ssse3\r\nPLANE_DEINTERLEAVE\r\n%endif\r\n\r\n; These functions are not general-use; not only do the SSE ones require aligned input,\r\n; but they also will fail if given a non-mod16 size.\r\n; memzero SSE will fail for non-mod128.\r\n\r\n;-----------------------------------------------------------------------------\r\n; void *memcpy_aligned( void *dst, const void *src, size_t n );\r\n;-----------------------------------------------------------------------------\r\n%macro MEMCPY 0\r\ncglobal memcpy_aligned, 3,3\r\n%if mmsize == 16\r\n    test r2d, 16\r\n    jz .copy2\r\n    mova  m0, [r1+r2-16]\r\n    mova [r0+r2-16], m0\r\n    sub  r2d, 16\r\n.copy2:\r\n%endif\r\n    test r2d, 2*mmsize\r\n    jz .copy4start\r\n    mova  m0, [r1+r2-1*mmsize]\r\n    mova  m1, [r1+r2-2*mmsize]\r\n    mova [r0+r2-1*mmsize], m0\r\n    mova [r0+r2-2*mmsize], m1\r\n    sub  r2d, 2*mmsize\r\n.copy4start:\r\n    test r2d, r2d\r\n    jz .ret\r\n.copy4:\r\n    mova  m0, [r1+r2-1*mmsize]\r\n    mova  m1, [r1+r2-2*mmsize]\r\n    mova  m2, [r1+r2-3*mmsize]\r\n    mova  m3, [r1+r2-4*mmsize]\r\n    mova [r0+r2-1*mmsize], m0\r\n    mova [r0+r2-2*mmsize], m1\r\n    mova [r0+r2-3*mmsize], m2\r\n    mova [r0+r2-4*mmsize], m3\r\n    sub  r2d, 4*mmsize\r\n    jg .copy4\r\n.ret:\r\n    REP_RET\r\n%endmacro\r\n\r\nINIT_MMX mmx\r\nMEMCPY\r\nINIT_XMM sse\r\nMEMCPY\r\n\r\n\r\n; ----------------------------------------------------------------------------\r\n; void *fast_memcpy( void *dst, const void *src, size_t n );\r\n; ----------------------------------------------------------------------------\r\nINIT_MMX mmx\r\ncglobal fast_memcpy, 3,5,8\r\n;{\r\n    test        r2, r2                          ; if n = 0, quit\r\n    jz          .L_QUIT                         ;\r\n                                                ;\r\n    mov         r3, r2                          ; r3 <-- r2, copy\r\n    sar         r2, 3                           ; r2 <-- n/8\r\n    and         r3, 0x07                        ; r3 <-- n%8\r\n    prefetchnta [r1]                            ; prefetch ahead, non-temporal\r\n                                                ;\r\n    ; cal hexnum/8 and remainder/8 and store    ;\r\n    mov         r4, r2                          ; r4 <-- r2, copy\r\n    sar         r2, 3                           ; r2 <-- (n/8)/8\r\n    and         r4, 0x07                        ; r4 <-- (n/8)%8\r\n    cmp         r2, 0                           ;\r\n    je         .HEX_ZERO                        ;\r\n                                                ;\r\nalign 4                                         ;\r\n.L_COPY_64X:                                    ;\r\n    prefetchnta [r1 + 128]                      ; prefetch ahead, non-temporal\r\n    prefetchnta [r1 + 256]                      ; prefetch ahead, non-temporal\r\n                                                ;\r\n    ; load 64 bytes data form src               ;\r\n    movq        m0, [r1 + 0*8]                  ; load  8 bytes\r\n    movq        m1, [r1 + 1*8]                  ; load  8 bytes\r\n    movq        m2, [r1 + 2*8]                  ; load  8 bytes\r\n    movq        m3, [r1 + 3*8]                  ; load  8 bytes\r\n    movq        m4, [r1 + 4*8]                  ; load  8 bytes\r\n    movq        m5, [r1 + 5*8]                  ; load  8 bytes\r\n    movq        m6, [r1 + 6*8]                  ; load  8 bytes\r\n    movq        m7, [r1 + 7*8]                  ; load  8 bytes\r\n                                                ;\r\n    ; store the 64 bytes to dst                 ;\r\n    movntq      [r0 + 0*8], m0                  ; store 8 bytes\r\n    movntq      [r0 + 1*8], m1                  ; store 8 bytes\r\n    movntq      [r0 + 2*8], m2                  ; store 8 bytes\r\n    movntq      [r0 + 3*8], m3                  ; store 8 bytes\r\n    movntq      [r0 + 4*8], m4                  ; store 8 bytes\r\n    movntq      [r0 + 5*8], m5                  ; store 8 bytes\r\n    movntq      [r0 + 6*8], m6                  ; store 8 bytes\r\n    movntq      [r0 + 7*8], m7                  ; store 8 bytes\r\n                                                ;\r\n    add         r1, 64                          ;\r\n    add         r0, 64                          ;\r\n    dec         r2                              ;\r\n    jnz        .L_COPY_64X                      ;\r\n                                                ;\r\n.HEX_ZERO:                                      ;\r\n    cmp         r4, 0                           ;\r\n    je         .L_RESIDUAL                      ;\r\n                                                ;\r\n.L_COPY_8X:                                     ;\r\n    movq        m3, [r1]                        ; load  8 bytes\r\n    movntq    [r0], m3                          ; store 8 bytes\r\n    add         r1, 8                           ;\r\n    add         r0, 8                           ;\r\n    dec         r4                              ;\r\n    jnz        .L_COPY_8X                       ;\r\n                                                ;\r\n.L_RESIDUAL:                                    ;\r\n    ; quit                                      ;\r\n    cmp         r3, 0                           ;\r\n    je         .L_QUIT                          ;\r\n                                                ;\r\n.L_COPY_1X:                                     ;\r\n    mov        r2b, [r1]                        ;\r\n    mov       [r0], r2b                         ;\r\n    add         r1, 1                           ;\r\n    add         r0, 1                           ;\r\n    dec         r3                              ;\r\n    jnz        .L_COPY_1X                       ;\r\n                                                ;\r\n.L_QUIT:                                        ;\r\n    sfence                                      ;\r\n    emms                                        ;\r\n    RET                                         ;\r\n;}\r\n\r\n\r\n\r\n;-----------------------------------------------------------------------------\r\n; void *memzero_aligned( void *dst, size_t n );\r\n;-----------------------------------------------------------------------------\r\n%macro MEMZERO 1\r\ncglobal memzero_aligned, 2,2\r\n    add  r0, r1\r\n    neg  r1\r\n%if mmsize == 8\r\n    pxor m0, m0\r\n%else\r\n    xorps m0, m0\r\n%endif\r\n.loop:\r\n%assign i 0\r\n%rep %1\r\n    mova [r0 + r1 + i], m0\r\n%assign i i+mmsize\r\n%endrep\r\n    add r1, mmsize*%1\r\n    jl .loop\r\n    RET\r\n%endmacro\r\n\r\nINIT_MMX mmx\r\nMEMZERO 8\r\nINIT_XMM sse\r\nMEMZERO 8\r\nINIT_YMM avx\r\nMEMZERO 4\r\n\r\n\r\n\r\n; ----------------------------------------------------------------------------\r\n; void *fast_memzero( void *dst, size_t n );\r\n; ----------------------------------------------------------------------------\r\nINIT_MMX mmx\r\ncglobal fast_memzero, 2,3,1\r\n;{\r\n    test        r1, r1                          ; if n = 0, quit\r\n    jz         .L_QUIT                          ;\r\n    mov         r2, r1                          ; r2 <-- r1 = n, copy\r\n    sar         r1, 3                           ; r1 = n/8\r\n    and         r2, 7                           ; r2 = n%8\r\n    cmp         r1, 0                           ; n/8 = 0?\r\n    je        .HEX_ZERO                         ; jump if n < 8\r\n    pxor        m0, m0                          ; clear m0\r\n                                                ;\r\n.L_SET_8X:                                      ;\r\n    movntq    [r0], m0                          ; clear 8 bytes\r\n    add         r0, 8                           ; r0 = r0 + 8\r\n    dec         r1                              ; r1 = r1 - 1\r\n    jnz        .L_SET_8X                        ; loop until r1 = 0\r\n                                                ;\r\n.HEX_ZERO:                                      ;\r\n    xor         r1, r1                          ; clear r1\r\n    cmp         r2, 0                           ; n%8 = 0?\r\n    je         .L_QUIT                          ;\r\n                                                ;\r\n.L_RESIDUAL:                                    ;\r\n    mov       [r0], r1b                         ; mov 1 byte\r\n    add         r0, 1                           ;\r\n    dec         r2                              ;\r\n    jnz        .L_RESIDUAL                      ;\r\n                                                ;\r\n.L_QUIT:                                        ;\r\n    emms                                        ;\r\n    RET                                         ;\r\n;}\r\n\r\n; ----------------------------------------------------------------------------\r\n; void *fast_memset( void *dst, int val, size_t n );\r\n; ----------------------------------------------------------------------------\r\nINIT_MMX mmx\r\ncglobal fast_memset, 3,4,1\r\n;{\r\n    test        r2, r2                          ; if n = 0, quit\r\n    jz         .L_QUIT                          ;\r\n    mov         r3, r2                          ; r3 <-- r2 = n, copy\r\n    sar         r2, 3                           ; r2 = n/8\r\n    and         r3, 7                           ; r3 = n%8\r\n    cmp         r2, 0                           ; n/8 = 0?\r\n    je        .HEX_ZERO                         ; jump if n < 8\r\n    movd        m0, r1d                         ; m0[       0] = val (DWORD)\r\n    pshufw      m0, m0, 0                       ; m0[ 3 2 1 0] = val (WORD)\r\n    packsswb    m0, m0                          ; m0[76543210] = val (BYTE)\r\n                                                ;\r\n.L_SET_8X:                                      ;\r\n    movntq    [r0], m0                          ; clear 8 bytes\r\n    add         r0, 8                           ; r0 = r0 + 8\r\n    dec         r2                              ; r2 = r2 - 1\r\n    jnz        .L_SET_8X                        ; loop until r2 = 0\r\n                                                ;\r\n.HEX_ZERO:                                      ;\r\n    cmp         r3, 0                           ; n%8 = 0?\r\n    je         .L_QUIT                          ;\r\n                                                ;\r\n.L_RESIDUAL:                                    ;\r\n    mov       [r0], r1b                         ; mov 1 byte\r\n    add         r0, 1                           ;\r\n    dec         r3                              ;\r\n    jnz        .L_RESIDUAL                      ;\r\n                                                ;\r\n.L_QUIT:                                        ;\r\n    emms                                        ;\r\n    RET                                         ;\r\n;}\r\n\r\n\r\n; ------------------------------------------------------------------\r\n; param 1: dst, param 2: src stride\r\n; r0 -- src\r\n%macro FILT_8x2 2\r\n    mova      m3, [r0      + 8]\r\n    mova      m2, [r0         ]\r\n    pavgb     m3, [r0 + %2 + 8]\r\n    pavgb     m2, [r0 + %2    ]\r\n    mova      m1, [r0      + 9]\r\n    mova      m0, [r0      + 1]\r\n    pavgb     m1, [r0 + %2 + 9]\r\n    pavgb     m0, [r0 + %2 + 1]\r\n    pavgb     m1, m3\r\n    pavgb     m0, m2\r\n    pand      m1, m7\r\n    pand      m0, m7\r\n    packuswb  m0, m1\r\n    movu    [%1], m0\r\n%endmacro\r\n\r\n; ------------------------------------------------------------------\r\n; param 1: dst, param 2: src stride\r\n; r0 -- src\r\n%macro FILT_16x2 2\r\n    mova      m3, [r0      + mmsize]\r\n    mova      m2, [r0              ]\r\n    pavgb     m3, [r0 + %2 + mmsize]\r\n    pavgb     m2, [r0 + %2         ]\r\n    PALIGNR   m0, m3, 1, m6\r\n    pavgb     m0, m3\r\n    PALIGNR   m3, m2, 1, m6\r\n    pavgb     m3, m2\r\n    pand      m0, m7\r\n    pand      m3, m7\r\n    packuswb  m3, m0\r\n    movu    [%1], m3\r\n%endmacro\r\n\r\n; ----------------------------------------------------------------------------\r\n; void lowres_filter_core_c( pel_t *src, int i_src, pel_t *dst, int i_dst,\r\n;                            int width, int height )\r\n; ----------------------------------------------------------------------------\r\n%macro LOWRES_FILTER_CORE 0\r\ncglobal lowres_filter_core, 6,7,8\r\n%if mmsize >= 16                            ;\r\n    add       r4,   mmsize-1                ;\r\n    and       r4, ~(mmsize-1)               ;\r\n%endif                                      ;\r\n    ; src += 2*[(height-1)*i_src + width]   ;\r\n    mov      r6d, r5d                       ; r6 <-- height\r\n    dec      r6d                            ; r6 <-- (height - 1)\r\n    imul     r6d, r1d                       ; r6 <-- (height - 1) * i_src\r\n    add      r6d, r4d                       ; r6 <-- (height - 1) * i_src + width\r\n    lea       r0, [r0+r6*2]                 ; r0 <== src + 2*((height - 1) * i_src + width)\r\n    ; dst += (height-1)*stride + width      ;\r\n    mov      r6d, r5d                       ; r6 <-- height\r\n    dec      r6d                            ; r6 <-- (height - 1)\r\n    imul     r6d, r3d                       ; r6 <-- (height - 1) * i_dst\r\n    add      r6d, r4d                       ; r6 <-- (height - 1) * i_dst + width\r\n    add       r2, r6                        ; r2 <== dst + (height - 1) * i_dst + width\r\n    ; gap of src and dst in each line       ;\r\n    sub      r3d, r4d                       ; r3 <== i_dst - width  // dst gap\r\n    mov      r6d, r1d                       ; r6 <-- i_src\r\n    sub      r6d, r4d                       ; r6 <-- i_src - width\r\n    shl      r6d, 1                         ; r6 <-- 2 * (i_src - width)\r\n    PUSH     r6                             ; src gap\r\n%define   src_gap [rsp]                     ;\r\n                                            ;\r\n    pcmpeqb   m7, m7                        ; m7 <-- [FFFF...FFFF]\r\n    psrlw     m7, 8                         ; m7 <-- [00FF...00FF]\r\n                                            ;\r\n.vloop:                         ; ==== for (; height>0; height--) {\r\n    mov      r6d, r4d                       ; r6 <-- width\r\n%ifnidn cpuname, mmx2                       ;\r\n%if mmsize <= 16                            ;\r\n    mova      m0, [r0     ]                 ; load from src\r\n    mova      m1, [r0 + r1]                 ; load from down line\r\n    pavgb     m0, m1                        ; m0 <-- average of 2 lines\r\n%endif                                      ;\r\n%endif                                      ;\r\n.hloop:                         ; -------- for (; width>0; width-=mmsize) {\r\n    sub       r0, mmsize*2                  ; src -= mmsize * 2\r\n    sub       r2, mmsize                    ; dst -= mmsize\r\n%ifidn cpuname, mmx2                        ;\r\n    FILT_8x2  r2, r1                        ;\r\n%else                                       ;\r\n    FILT_16x2 r2, r1                        ;\r\n%endif                                      ;\r\n    sub      r6d, mmsize                    ; r6 -= mmsize\r\n    jg .hloop                   ; -------- } // end for (width...)\r\n                                            ;\r\n.skip:                                      ;\r\n    sub       r0, src_gap                   ;\r\n    sub       r2, r3                        ;\r\n    dec      r5d                            ;\r\n    jg .vloop                   ; ==== } // end for (height...)\r\n    ADD      rsp, gprsize                   ;\r\n    emms                                    ;\r\n    RET                                     ;\r\n%endmacro ; LOWRES_FILTER_CORE\r\n\r\nINIT_MMX mmx2\r\nLOWRES_FILTER_CORE              ; lowres_filter_core_mmx2\r\nINIT_XMM sse2\r\nLOWRES_FILTER_CORE              ; lowres_filter_core_sse2\r\nINIT_XMM ssse3\r\nLOWRES_FILTER_CORE              ; lowres_filter_core_ssse3\r\nINIT_XMM avx\r\nLOWRES_FILTER_CORE              ; lowres_filter_core_avx\r\n\r\n\r\n\r\n; %if HIGH_BIT_DEPTH == 0\r\n; ;-----------------------------------------------------------------------------\r\n; ; void integral_init4h( uint16_t *sum, uint8_t *pix, intptr_t stride )\r\n; ;-----------------------------------------------------------------------------\r\n; %macro INTEGRAL_INIT4H 0\r\n; cglobal integral_init4h, 3,4\r\n;     lea     r3, [r0+r2*2]\r\n;     add     r1, r2\r\n;     neg     r2\r\n;     pxor    m4, m4\r\n; .loop:\r\n;     mova    m0, [r1+r2]\r\n; %if mmsize==32\r\n;     movu    m1, [r1+r2+8]\r\n; %else\r\n;     mova    m1, [r1+r2+16]\r\n;     palignr m1, m0, 8\r\n; %endif\r\n;     mpsadbw m0, m4, 0\r\n;     mpsadbw m1, m4, 0\r\n;     paddw   m0, [r0+r2*2]\r\n;     paddw   m1, [r0+r2*2+mmsize]\r\n;     mova  [r3+r2*2   ], m0\r\n;     mova  [r3+r2*2+mmsize], m1\r\n;     add     r2, mmsize\r\n;     jl .loop\r\n;     RET\r\n; %endmacro\r\n; \r\n; INIT_XMM sse4\r\n; INTEGRAL_INIT4H\r\n; INIT_YMM avx2\r\n; INTEGRAL_INIT4H\r\n; \r\n; %macro INTEGRAL_INIT8H 0\r\n; cglobal integral_init8h, 3,4\r\n;     lea     r3, [r0+r2*2]\r\n;     add     r1, r2\r\n;     neg     r2\r\n;     pxor    m4, m4\r\n; .loop:\r\n;     mova    m0, [r1+r2]\r\n; %if mmsize==32\r\n;     movu    m1, [r1+r2+8]\r\n;     mpsadbw m2, m0, m4, 100100b\r\n;     mpsadbw m3, m1, m4, 100100b\r\n; %else\r\n;     mova    m1, [r1+r2+16]\r\n;     palignr m1, m0, 8\r\n;     mpsadbw m2, m0, m4, 100b\r\n;     mpsadbw m3, m1, m4, 100b\r\n; %endif\r\n;     mpsadbw m0, m4, 0\r\n;     mpsadbw m1, m4, 0\r\n;     paddw   m0, [r0+r2*2]\r\n;     paddw   m1, [r0+r2*2+mmsize]\r\n;     paddw   m0, m2\r\n;     paddw   m1, m3\r\n;     mova  [r3+r2*2   ], m0\r\n;     mova  [r3+r2*2+mmsize], m1\r\n;     add     r2, mmsize\r\n;     jl .loop\r\n;     RET\r\n; %endmacro\r\n; \r\n; INIT_XMM sse4\r\n; INTEGRAL_INIT8H\r\n; INIT_XMM avx\r\n; INTEGRAL_INIT8H\r\n; INIT_YMM avx2\r\n; INTEGRAL_INIT8H\r\n; %endif ; !HIGH_BIT_DEPTH\r\n; \r\n; %macro INTEGRAL_INIT_8V 0\r\n; ;-----------------------------------------------------------------------------\r\n; ; void integral_init8v( uint16_t *sum8, intptr_t stride )\r\n; ;-----------------------------------------------------------------------------\r\n; cglobal integral_init8v, 3,3\r\n;     add   r1, r1\r\n;     add   r0, r1\r\n;     lea   r2, [r0+r1*8]\r\n;     neg   r1\r\n; .loop:\r\n;     mova  m0, [r2+r1]\r\n;     mova  m1, [r2+r1+mmsize]\r\n;     psubw m0, [r0+r1]\r\n;     psubw m1, [r0+r1+mmsize]\r\n;     mova  [r0+r1], m0\r\n;     mova  [r0+r1+mmsize], m1\r\n;     add   r1, 2*mmsize\r\n;     jl .loop\r\n;     RET\r\n; %endmacro\r\n; \r\n; INIT_MMX mmx\r\n; INTEGRAL_INIT_8V\r\n; INIT_XMM sse2\r\n; INTEGRAL_INIT_8V\r\n; INIT_YMM avx2\r\n; INTEGRAL_INIT_8V\r\n; \r\n; ;-----------------------------------------------------------------------------\r\n; ; void integral_init4v( uint16_t *sum8, uint16_t *sum4, intptr_t stride )\r\n; ;-----------------------------------------------------------------------------\r\n; INIT_MMX mmx\r\n; cglobal integral_init4v, 3,5\r\n;     shl   r2, 1\r\n;     lea   r3, [r0+r2*4]\r\n;     lea   r4, [r0+r2*8]\r\n;     mova  m0, [r0+r2]\r\n;     mova  m4, [r4+r2]\r\n; .loop:\r\n;     mova  m1, m4\r\n;     psubw m1, m0\r\n;     mova  m4, [r4+r2-8]\r\n;     mova  m0, [r0+r2-8]\r\n;     paddw m1, m4\r\n;     mova  m3, [r3+r2-8]\r\n;     psubw m1, m0\r\n;     psubw m3, m0\r\n;     mova  [r0+r2-8], m1\r\n;     mova  [r1+r2-8], m3\r\n;     sub   r2, 8\r\n;     jge .loop\r\n;     RET\r\n; \r\n; INIT_XMM sse2\r\n; cglobal integral_init4v, 3,5\r\n;     shl     r2, 1\r\n;     add     r0, r2\r\n;     add     r1, r2\r\n;     lea     r3, [r0+r2*4]\r\n;     lea     r4, [r0+r2*8]\r\n;     neg     r2\r\n; .loop:\r\n;     mova    m0, [r0+r2]\r\n;     mova    m1, [r4+r2]\r\n;     mova    m2, m0\r\n;     mova    m4, m1\r\n;     shufpd  m0, [r0+r2+16], 1\r\n;     shufpd  m1, [r4+r2+16], 1\r\n;     paddw   m0, m2\r\n;     paddw   m1, m4\r\n;     mova    m3, [r3+r2]\r\n;     psubw   m1, m0\r\n;     psubw   m3, m2\r\n;     mova  [r0+r2], m1\r\n;     mova  [r1+r2], m3\r\n;     add     r2, 16\r\n;     jl .loop\r\n;     RET\r\n; \r\n; INIT_XMM ssse3\r\n; cglobal integral_init4v, 3,5\r\n;     shl     r2, 1\r\n;     add     r0, r2\r\n;     add     r1, r2\r\n;     lea     r3, [r0+r2*4]\r\n;     lea     r4, [r0+r2*8]\r\n;     neg     r2\r\n; .loop:\r\n;     mova    m2, [r0+r2]\r\n;     mova    m0, [r0+r2+16]\r\n;     mova    m4, [r4+r2]\r\n;     mova    m1, [r4+r2+16]\r\n;     palignr m0, m2, 8\r\n;     palignr m1, m4, 8\r\n;     paddw   m0, m2\r\n;     paddw   m1, m4\r\n;     mova    m3, [r3+r2]\r\n;     psubw   m1, m0\r\n;     psubw   m3, m2\r\n;     mova  [r0+r2], m1\r\n;     mova  [r1+r2], m3\r\n;     add     r2, 16\r\n;     jl .loop\r\n;     RET\r\n; \r\n; INIT_YMM avx2\r\n; cglobal integral_init4v, 3,5\r\n;     add     r2, r2\r\n;     add     r0, r2\r\n;     add     r1, r2\r\n;     lea     r3, [r0+r2*4]\r\n;     lea     r4, [r0+r2*8]\r\n;     neg     r2\r\n; .loop:\r\n;     mova    m2, [r0+r2]\r\n;     movu    m1, [r4+r2+8]\r\n;     paddw   m0, m2, [r0+r2+8]\r\n;     paddw   m1, [r4+r2]\r\n;     mova    m3, [r3+r2]\r\n;     psubw   m1, m0\r\n;     psubw   m3, m2\r\n;     mova  [r0+r2], m1\r\n;     mova  [r1+r2], m3\r\n;     add     r2, 32\r\n;     jl .loop\r\n;     RET\r\n; \r\n; %macro FILT8x4 7\r\n;     mova      %3, [r0+%7]\r\n;     mova      %4, [r0+r5+%7]\r\n;     pavgb     %3, %4\r\n;     pavgb     %4, [r0+r5*2+%7]\r\n;     PALIGNR   %1, %3, 1, m6\r\n;     PALIGNR   %2, %4, 1, m6\r\n; %if cpuflag(xop)\r\n;     pavgb     %1, %3\r\n;     pavgb     %2, %4\r\n; %else\r\n;     pavgb     %1, %3\r\n;     pavgb     %2, %4\r\n;     psrlw     %5, %1, 8\r\n;     psrlw     %6, %2, 8\r\n;     pand      %1, m7\r\n;     pand      %2, m7\r\n; %endif\r\n; %endmacro\r\n; \r\n; %macro FILT32x4U 4\r\n;     movu      m1, [r0+r5]\r\n;     pavgb     m0, m1, [r0]\r\n;     movu      m3, [r0+r5+1]\r\n;     pavgb     m2, m3, [r0+1]\r\n;     pavgb     m1, [r0+r5*2]\r\n;     pavgb     m3, [r0+r5*2+1]\r\n;     pavgb     m0, m2\r\n;     pavgb     m1, m3\r\n; \r\n;     movu      m3, [r0+r5+mmsize]\r\n;     pavgb     m2, m3, [r0+mmsize]\r\n;     movu      m5, [r0+r5+1+mmsize]\r\n;     pavgb     m4, m5, [r0+1+mmsize]\r\n;     pavgb     m3, [r0+r5*2+mmsize]\r\n;     pavgb     m5, [r0+r5*2+1+mmsize]\r\n;     pavgb     m2, m4\r\n;     pavgb     m3, m5\r\n; \r\n;     pshufb    m0, m7\r\n;     pshufb    m1, m7\r\n;     pshufb    m2, m7\r\n;     pshufb    m3, m7\r\n;     punpckhqdq m4, m0, m2\r\n;     punpcklqdq m0, m0, m2\r\n;     punpckhqdq m5, m1, m3\r\n;     punpcklqdq m2, m1, m3\r\n;     vpermq    m0, m0, q3120\r\n;     vpermq    m1, m4, q3120\r\n;     vpermq    m2, m2, q3120\r\n;     vpermq    m3, m5, q3120\r\n;     movu    [%1], m0\r\n;     movu    [%2], m1\r\n;     movu    [%3], m2\r\n;     movu    [%4], m3\r\n; %endmacro\r\n; \r\n; %macro FILT16x2 4\r\n;     mova      m3, [r0+%4+mmsize]\r\n;     mova      m2, [r0+%4]\r\n;     pavgb     m3, [r0+%4+r5+mmsize]\r\n;     pavgb     m2, [r0+%4+r5]\r\n;     PALIGNR   %1, m3, 1, m6\r\n;     pavgb     %1, m3\r\n;     PALIGNR   m3, m2, 1, m6\r\n;     pavgb     m3, m2\r\n; %if cpuflag(xop)\r\n;     vpperm    m5, m3, %1, m7\r\n;     vpperm    m3, m3, %1, m6\r\n; %else\r\n;     psrlw     m5, m3, 8\r\n;     psrlw     m4, %1, 8\r\n;     pand      m3, m7\r\n;     pand      %1, m7\r\n;     packuswb  m3, %1\r\n;     packuswb  m5, m4\r\n; %endif\r\n;     mova    [%2], m3\r\n;     mova    [%3], m5\r\n;     mova      %1, m2\r\n; %endmacro\r\n; \r\n; %macro FILT8x2U 3\r\n;     mova      m3, [r0+%3+8]\r\n;     mova      m2, [r0+%3]\r\n;     pavgb     m3, [r0+%3+r5+8]\r\n;     pavgb     m2, [r0+%3+r5]\r\n;     mova      m1, [r0+%3+9]\r\n;     mova      m0, [r0+%3+1]\r\n;     pavgb     m1, [r0+%3+r5+9]\r\n;     pavgb     m0, [r0+%3+r5+1]\r\n;     pavgb     m1, m3\r\n;     pavgb     m0, m2\r\n;     psrlw     m3, m1, 8\r\n;     psrlw     m2, m0, 8\r\n;     pand      m1, m7\r\n;     pand      m0, m7\r\n;     packuswb  m0, m1\r\n;     packuswb  m2, m3\r\n;     mova    [%1], m0\r\n;     mova    [%2], m2\r\n; %endmacro\r\n; \r\n; %macro FILT8xU 3\r\n;     mova      m3, [r0+%3+8]\r\n;     mova      m2, [r0+%3]\r\n;     pavgw     m3, [r0+%3+r5+8]\r\n;     pavgw     m2, [r0+%3+r5]\r\n;     movu      m1, [r0+%3+10]\r\n;     movu      m0, [r0+%3+2]\r\n;     pavgw     m1, [r0+%3+r5+10]\r\n;     pavgw     m0, [r0+%3+r5+2]\r\n;     pavgw     m1, m3\r\n;     pavgw     m0, m2\r\n;     psrld     m3, m1, 16\r\n;     psrld     m2, m0, 16\r\n;     pand      m1, m7\r\n;     pand      m0, m7\r\n;     packssdw  m0, m1\r\n;     packssdw  m2, m3\r\n;     movu    [%1], m0\r\n;     mova    [%2], m2\r\n; %endmacro\r\n; \r\n; %macro FILT8xA 4\r\n;     movu      m3, [r0+%4+mmsize]\r\n;     movu      m2, [r0+%4]\r\n;     pavgw     m3, [r0+%4+r5+mmsize]\r\n;     pavgw     m2, [r0+%4+r5]\r\n;     PALIGNR   %1, m3, 2, m6\r\n;     pavgw     %1, m3\r\n;     PALIGNR   m3, m2, 2, m6\r\n;     pavgw     m3, m2\r\n; %if cpuflag(xop)\r\n;     vpperm    m5, m3, %1, m7\r\n;     vpperm    m3, m3, %1, m6\r\n; %else\r\n;     psrld     m5, m3, 16\r\n;     psrld     m4, %1, 16\r\n;     pand      m3, m7\r\n;     pand      %1, m7\r\n;     packssdw  m3, %1\r\n;     packssdw  m5, m4\r\n; %endif\r\n; %if cpuflag(avx2)\r\n;     vpermq     m3, m3, q3120\r\n;     vpermq     m5, m5, q3120\r\n; %endif\r\n;     movu    [%2], m3\r\n;     movu    [%3], m5\r\n;     movu      %1, m2\r\n; %endmacro\r\n; \r\n; ;-----------------------------------------------------------------------------\r\n; ; void frame_init_lowres_core( uint8_t *src0, uint8_t *dst0, uint8_t *dsth, uint8_t *dstv, uint8_t *dstc,\r\n; ;                              intptr_t src_stride, intptr_t dst_stride, int width, int height )\r\n; ;-----------------------------------------------------------------------------\r\n; %macro FRAME_INIT_LOWRES 0\r\n; cglobal frame_init_lowres_core, 6,7,(12-4*(BIT_DEPTH/9)) ; 8 for HIGH_BIT_DEPTH, 12 otherwise\r\n; %if HIGH_BIT_DEPTH\r\n;     shl   dword r6m, 1\r\n;     FIX_STRIDES r5\r\n;     shl   dword r7m, 1\r\n; %endif\r\n; %if mmsize >= 16\r\n;     add   dword r7m, mmsize-1\r\n;     and   dword r7m, ~(mmsize-1)\r\n; %endif\r\n;     ; src += 2*(height-1)*stride + 2*width\r\n;     mov      r6d, r8m\r\n;     dec      r6d\r\n;     imul     r6d, r5d\r\n;     add      r6d, r7m\r\n;     lea       r0, [r0+r6*2]\r\n;     ; dst += (height-1)*stride + width\r\n;     mov      r6d, r8m\r\n;     dec      r6d\r\n;     imul     r6d, r6m\r\n;     add      r6d, r7m\r\n;     add       r1, r6\r\n;     add       r2, r6\r\n;     add       r3, r6\r\n;     add       r4, r6\r\n;     ; gap = stride - width\r\n;     mov      r6d, r6m\r\n;     sub      r6d, r7m\r\n;     PUSH      r6\r\n;     %define dst_gap [rsp+gprsize]\r\n;     mov      r6d, r5d\r\n;     sub      r6d, r7m\r\n;     shl      r6d, 1\r\n;     PUSH      r6\r\n;     %define src_gap [rsp]\r\n; %if HIGH_BIT_DEPTH\r\n; %if cpuflag(xop)\r\n;     mova      m6, [deinterleave_shuf32a]\r\n;     mova      m7, [deinterleave_shuf32b]\r\n; %else\r\n;     pcmpeqw   m7, m7\r\n;     psrld     m7, 16\r\n; %endif\r\n; .vloop:\r\n;     mov      r6d, r7m\r\n; %ifnidn cpuname, mmx2\r\n;     movu      m0, [r0]\r\n;     movu      m1, [r0+r5]\r\n;     pavgw     m0, m1\r\n;     pavgw     m1, [r0+r5*2]\r\n; %endif\r\n; .hloop:\r\n;     sub       r0, mmsize*2\r\n;     sub       r1, mmsize\r\n;     sub       r2, mmsize\r\n;     sub       r3, mmsize\r\n;     sub       r4, mmsize\r\n; %ifidn cpuname, mmx2\r\n;     FILT8xU r1, r2, 0\r\n;     FILT8xU r3, r4, r5\r\n; %else\r\n;     FILT8xA m0, r1, r2, 0\r\n;     FILT8xA m1, r3, r4, r5\r\n; %endif\r\n;     sub      r6d, mmsize\r\n;     jg .hloop\r\n; %else ; !HIGH_BIT_DEPTH\r\n; %if cpuflag(avx2)\r\n;     mova      m7, [deinterleave_shuf]\r\n; %elif cpuflag(xop)\r\n;     mova      m6, [deinterleave_shuf32a]\r\n;     mova      m7, [deinterleave_shuf32b]\r\n; %else\r\n;     pcmpeqb   m7, m7\r\n;     psrlw     m7, 8\r\n; %endif\r\n; .vloop:\r\n;     mov      r6d, r7m\r\n; %ifnidn cpuname, mmx2\r\n; %if mmsize <= 16\r\n;     mova      m0, [r0]\r\n;     mova      m1, [r0+r5]\r\n;     pavgb     m0, m1\r\n;     pavgb     m1, [r0+r5*2]\r\n; %endif\r\n; %endif\r\n; .hloop:\r\n;     sub       r0, mmsize*2\r\n;     sub       r1, mmsize\r\n;     sub       r2, mmsize\r\n;     sub       r3, mmsize\r\n;     sub       r4, mmsize\r\n; %if mmsize==32\r\n;     FILT32x4U r1, r2, r3, r4\r\n; %elifdef m8\r\n;     FILT8x4   m0, m1, m2, m3, m10, m11, mmsize\r\n;     mova      m8, m0\r\n;     mova      m9, m1\r\n;     FILT8x4   m2, m3, m0, m1, m4, m5, 0\r\n; %if cpuflag(xop)\r\n;     vpperm    m4, m2, m8, m7\r\n;     vpperm    m2, m2, m8, m6\r\n;     vpperm    m5, m3, m9, m7\r\n;     vpperm    m3, m3, m9, m6\r\n; %else\r\n;     packuswb  m2, m8\r\n;     packuswb  m3, m9\r\n;     packuswb  m4, m10\r\n;     packuswb  m5, m11\r\n; %endif\r\n;     mova    [r1], m2\r\n;     mova    [r2], m4\r\n;     mova    [r3], m3\r\n;     mova    [r4], m5\r\n; %elifidn cpuname, mmx2\r\n;     FILT8x2U  r1, r2, 0\r\n;     FILT8x2U  r3, r4, r5\r\n; %else\r\n;     FILT16x2  m0, r1, r2, 0\r\n;     FILT16x2  m1, r3, r4, r5\r\n; %endif\r\n;     sub      r6d, mmsize\r\n;     jg .hloop\r\n; %endif ; HIGH_BIT_DEPTH\r\n; .skip:\r\n;     mov       r6, dst_gap\r\n;     sub       r0, src_gap\r\n;     sub       r1, r6\r\n;     sub       r2, r6\r\n;     sub       r3, r6\r\n;     sub       r4, r6\r\n;     dec    dword r8m\r\n;     jg .vloop\r\n;     ADD      rsp, 2*gprsize\r\n;     emms\r\n;     RET\r\n; %endmacro ; FRAME_INIT_LOWRES\r\n; \r\n; INIT_MMX mmx2\r\n; FRAME_INIT_LOWRES\r\n; %if ARCH_X86_64 == 0\r\n; INIT_MMX cache32, mmx2\r\n; FRAME_INIT_LOWRES\r\n; %endif\r\n; INIT_XMM sse2\r\n; FRAME_INIT_LOWRES\r\n; INIT_XMM ssse3\r\n; FRAME_INIT_LOWRES\r\n; INIT_XMM avx\r\n; FRAME_INIT_LOWRES\r\n; INIT_XMM xop\r\n; FRAME_INIT_LOWRES\r\n; %if ARCH_X86_64 == 1\r\n; INIT_YMM avx2\r\n; FRAME_INIT_LOWRES\r\n; %endif\r\n; \r\n; ;-----------------------------------------------------------------------------\r\n; ; void mbtree_propagate_cost( int *dst, uint16_t *propagate_in, int32_t *intra_costs,\r\n; ;                             uint16_t *inter_costs, int32_t *inv_qscales, double *fps_factor, int len )\r\n; ;-----------------------------------------------------------------------------\r\n; INIT_XMM sse2\r\n; cglobal mbtree_propagate_cost, 7,7,7\r\n;     dec         r6d\r\n;     movsd       m6, [r5]\r\n;     mulpd       m6, [pd_inv256]\r\n;     xor         r5d, r5d\r\n;     lea         r0, [r0+r5*2]\r\n;     pxor        m4, m4\r\n;     movlhps     m6, m6\r\n;     mova        m5, [pw_3fff]\r\n; \r\n; .loop:\r\n;     movh        m2, [r2+r5*4]       ; intra\r\n;     movh        m0, [r4+r5*4]       ; invq\r\n;     movd        m3, [r3+r5*2]       ; inter\r\n;     pand        m3, m5\r\n;     punpcklwd   m3, m4\r\n; \r\n;     ; PMINSD\r\n;     pcmpgtd     m1, m2, m3\r\n;     pand        m3, m1\r\n;     pandn       m1, m2\r\n;     por         m3, m1\r\n; \r\n;     movd        m1, [r1+r5*2]       ; prop\r\n;     punpckldq   m2, m2\r\n;     punpckldq   m0, m0\r\n;     pmuludq     m0, m2\r\n;     pshufd      m2, m2, q3120\r\n;     pshufd      m0, m0, q3120\r\n; \r\n;     punpcklwd   m1, m4\r\n;     cvtdq2pd    m0, m0\r\n;     mulpd       m0, m6              ; intra*invq*fps_factor>>8\r\n;     cvtdq2pd    m1, m1              ; prop\r\n;     addpd       m0, m1              ; prop + (intra*invq*fps_factor>>8)\r\n;     ;cvtdq2ps    m1, m2              ; intra\r\n;     cvtdq2pd    m1, m2              ; intra\r\n;     psubd       m2, m3              ; intra - inter\r\n;     cvtdq2pd    m2, m2              ; intra - inter\r\n;     ;rcpps       m3, m1\r\n;     ;mulps       m1, m3              ; intra * (1/intra 1st approx)\r\n;     ;mulps       m1, m3              ; intra * (1/intra 1st approx)^2\r\n;     ;addps       m3, m3              ; 2 * (1/intra 1st approx)\r\n;     ;subps       m3, m1              ; 2nd approximation for 1/intra\r\n;     ;cvtps2pd    m3, m3              ; 1 / intra 1st approximation\r\n;     mulpd       m0, m2              ; (prop + (intra*invq*fps_factor>>8)) * (intra - inter)\r\n;     ;mulpd       m0, m3              ; / intra\r\n; \r\n;     ; TODO: DIVPD very slow, but match to C model output, since it is not bottleneck function, I comment above faster code\r\n;     divpd       m0, m1\r\n;     addpd       m0, [pd_0_5]\r\n;     cvttpd2dq    m0, m0\r\n; \r\n;     movh        [r0+r5*4], m0\r\n;     add         r5d, 2\r\n;     cmp         r5d, r6d\r\n;     jl         .loop\r\n; \r\n;     xor         r6d, r5d\r\n;     jnz         .even\r\n;     movd        m2, [r2+r5*4]       ; intra\r\n;     movd        m0, [r4+r5*4]       ; invq\r\n;     movd        m3, [r3+r5*2]       ; inter\r\n;     pand        m3, m5\r\n;     punpcklwd   m3, m4\r\n; \r\n;     ; PMINSD\r\n;     pcmpgtd     m1, m2, m3\r\n;     pand        m3, m1\r\n;     pandn       m1, m2\r\n;     por         m3, m1\r\n; \r\n;     movd        m1, [r1+r5*2]       ; prop\r\n;     punpckldq   m2, m2              ; DWORD [_ 1 _ 0]\r\n;     punpckldq   m0, m0\r\n;     pmuludq     m0, m2              ; QWORD [m1 m0]\r\n;     pshufd      m2, m2, q3120\r\n;     pshufd      m0, m0, q3120\r\n;     punpcklwd   m1, m4\r\n;     cvtdq2pd    m0, m0\r\n;     mulpd       m0, m6              ; intra*invq*fps_factor>>8\r\n;     cvtdq2pd    m1, m1              ; prop\r\n;     addpd       m0, m1              ; prop + (intra*invq*fps_factor>>8)\r\n;     cvtdq2pd    m1, m2              ; intra\r\n;     psubd       m2, m3              ; intra - inter\r\n;     cvtdq2pd    m2, m2              ; intra - inter\r\n;     mulpd       m0, m2              ; (prop + (intra*invq*fps_factor>>8)) * (intra - inter)\r\n; \r\n;     divpd       m0, m1\r\n;     addpd       m0, [pd_0_5]\r\n;     cvttpd2dq    m0, m0\r\n;     movd        [r0+r5*4], m0\r\n; .even:\r\n;     RET\r\n; \r\n; \r\n; ;-----------------------------------------------------------------------------\r\n; ; void mbtree_propagate_cost( int *dst, uint16_t *propagate_in, int32_t *intra_costs,\r\n; ;                             uint16_t *inter_costs, int32_t *inv_qscales, double *fps_factor, int len )\r\n; ;-----------------------------------------------------------------------------\r\n; ; FIXME: align loads/stores to 16 bytes\r\n; %macro MBTREE_AVX 0\r\n; cglobal mbtree_propagate_cost, 7,7,7\r\n;     sub             r6d, 3\r\n;     vbroadcastsd    m6, [r5]\r\n;     mulpd           m6, [pd_inv256]\r\n;     xor             r5d, r5d\r\n;     mova            m5, [pw_3fff]\r\n; \r\n; .loop:\r\n;     movu            xm2, [r2+r5*4]      ; intra\r\n;     movu            xm0, [r4+r5*4]      ; invq\r\n;     pmovzxwd        xm3, [r3+r5*2]      ; inter\r\n;     pand            xm3, xm5\r\n;     pminsd          xm3, xm2\r\n; \r\n;     pmovzxwd        xm1, [r1+r5*2]      ; prop\r\n;     pmulld          xm0, xm2\r\n;     cvtdq2pd        m0, xm0\r\n;     cvtdq2pd        m1, xm1             ; prop\r\n; ;%if cpuflag(avx2)\r\n; ;    fmaddpd         m0, m0, m6, m1\r\n; ;%else\r\n;     mulpd           m0, m6              ; intra*invq*fps_factor>>8\r\n;     addpd           m0, m1              ; prop + (intra*invq*fps_factor>>8)\r\n; ;%endif\r\n;     cvtdq2pd        m1, xm2             ; intra\r\n;     psubd           xm2, xm3            ; intra - inter\r\n;     cvtdq2pd        m2, xm2             ; intra - inter\r\n;     mulpd           m0, m2              ; (prop + (intra*invq*fps_factor>>8)) * (intra - inter)\r\n; \r\n;     ; TODO: DIVPD very slow, but match to C model output, since it is not bottleneck function, I comment above faster code\r\n;     divpd           m0, m1\r\n;     addpd           m0, [pd_0_5]\r\n;     cvttpd2dq       xm0, m0\r\n; \r\n;     movu            [r0+r5*4], xm0\r\n;     add             r5d, 4              ; process 4 values in one iteration\r\n;     cmp             r5d, r6d\r\n;     jl             .loop\r\n; \r\n;     add             r6d, 3\r\n;     xor             r6d, r5d\r\n;     jz              .even               ; if loop counter is multiple of 4, all values are processed\r\n; \r\n;     and             r6d, 3              ; otherwise, remaining unprocessed values must be 1, 2 or 3\r\n;     cmp             r6d, 1\r\n;     je              .process1           ; if only 1 value is unprocessed\r\n; \r\n;     ; process 2 values here\r\n;     movq            xm2, [r2+r5*4]      ; intra\r\n;     movq            xm0, [r4+r5*4]      ; invq\r\n;     movd            xm3, [r3+r5*2]      ; inter\r\n;     pmovzxwd        xm3, xm3\r\n;     pand            xm3, xm5\r\n;     pminsd          xm3, xm2\r\n; \r\n;     movd            xm1, [r1+r5*2]      ; prop\r\n;     pmovzxwd        xm1, xm1\r\n;     pmulld          xm0, xm2\r\n;     cvtdq2pd        m0, xm0\r\n;     cvtdq2pd        m1, xm1             ; prop\r\n; ;%if cpuflag(avx2)\r\n; ;    fmaddpd         m0, m0, m6, m1\r\n; ;%else\r\n;     mulpd           m0, m6              ; intra*invq*fps_factor>>8\r\n;     addpd           m0, m1              ; prop + (intra*invq*fps_factor>>8)\r\n; ;%endif\r\n;     cvtdq2pd        m1, xm2             ; intra\r\n;     psubd           xm2, xm3            ; intra - inter\r\n;     cvtdq2pd        m2, xm2             ; intra - inter\r\n;     mulpd           m0, m2              ; (prop + (intra*invq*fps_factor>>8)) * (intra - inter)\r\n; \r\n;     divpd           m0, m1\r\n;     addpd           m0, [pd_0_5]\r\n;     cvttpd2dq       xm0, m0\r\n;     movq            [r0+r5*4], xm0\r\n; \r\n;     xor             r6d, 2\r\n;     jz              .even\r\n;     add             r5d, 2\r\n; \r\n;     ; process 1 value here\r\n; .process1:\r\n;     movd            xm2, [r2+r5*4]      ; intra\r\n;     movd            xm0, [r4+r5*4]      ; invq\r\n;     movzx           r6d, word [r3+r5*2] ; inter\r\n;     movd            xm3, r6d\r\n;     pand            xm3, xm5\r\n;     pminsd          xm3, xm2\r\n; \r\n;     movzx           r6d, word [r1+r5*2] ; prop\r\n;     movd            xm1, r6d\r\n;     pmulld          xm0, xm2\r\n;     cvtdq2pd        m0, xm0\r\n;     cvtdq2pd        m1, xm1             ; prop\r\n; ;%if cpuflag(avx2)\r\n; ;    fmaddpd         m0, m0, m6, m1\r\n; ;%else\r\n;     mulpd           m0, m6              ; intra*invq*fps_factor>>8\r\n;     addpd           m0, m1              ; prop + (intra*invq*fps_factor>>8)\r\n; ;%endif\r\n;     cvtdq2pd        m1, xm2             ; intra\r\n;     psubd           xm2, xm3            ; intra - inter\r\n;     cvtdq2pd        m2, xm2             ; intra - inter\r\n;     mulpd           m0, m2              ; (prop + (intra*invq*fps_factor>>8)) * (intra - inter)\r\n; \r\n;     divpd           m0, m1\r\n;     addpd           m0, [pd_0_5]\r\n;     cvttpd2dq       xm0, m0\r\n;     movd            [r0+r5*4], xm0\r\n; .even:\r\n;     RET\r\n; %endmacro\r\n; \r\n; INIT_YMM avx\r\n; MBTREE_AVX\r\n; \r\n; INIT_YMM avx2\r\n; MBTREE_AVX\r\n; \r\n; \r\n; %macro CUTREE_FIX8 0\r\n; ;-----------------------------------------------------------------------------\r\n; ; void cutree_fix8_pack( uint16_t *dst, double *src, int count )\r\n; ;-----------------------------------------------------------------------------\r\n; cglobal cutree_fix8_pack, 3, 4, 5\r\n;     movapd       m2, [pq_256]\r\n;     sub          r2d, mmsize / 2\r\n;     movsxdifnidn r2, r2d\r\n;     lea          r1, [r1 + 8 * r2]\r\n;     lea          r0, [r0 + 2 * r2]\r\n;     neg          r2\r\n;     jg .skip_loop\r\n; .loop:\r\n;     mulpd        m0, m2, [r1 + 8 * r2]\r\n;     mulpd        m1, m2, [r1 + 8 * r2 + mmsize]\r\n;     mulpd        m3, m2, [r1 + 8 * r2 + 2 * mmsize]\r\n;     mulpd        m4, m2, [r1 + 8 * r2 + 3 * mmsize]\r\n;     cvttpd2dq    xm0, m0\r\n;     cvttpd2dq    xm1, m1\r\n;     cvttpd2dq    xm3, m3\r\n;     cvttpd2dq    xm4, m4\r\n; %if mmsize == 32\r\n;     vinserti128  m0, m0, xm3, 1\r\n;     vinserti128  m1, m1, xm4, 1\r\n;     packssdw     m0, m1\r\n; %else\r\n;     punpcklqdq   m0, m1\r\n;     punpcklqdq   m3, m4\r\n;     packssdw     m0, m3\r\n; %endif\r\n;     mova         [r0 + 2 * r2], m0\r\n;     add          r2, mmsize / 2\r\n;     jle .loop\r\n; .skip_loop:\r\n;     sub          r2, mmsize / 2\r\n;     jz .end\r\n;     ; Do the remaining values in scalar in order to avoid overreading src.\r\n; .scalar:\r\n;     movq         xm0, [r1 + 8 * r2 + 4 * mmsize] \r\n;     mulsd        xm0, xm2\r\n;     cvttsd2si    r3d, xm0\r\n;     mov          [r0 + 2 * r2 + mmsize], r3w\r\n;     inc          r2\r\n;     jl .scalar\r\n; .end:\r\n;     RET\r\n; \r\n; ;-----------------------------------------------------------------------------\r\n; ; void cutree_fix8_unpack( double *dst, uint16_t *src, int count )\r\n; ;-----------------------------------------------------------------------------\r\n; cglobal cutree_fix8_unpack, 3, 4, 7\r\n; %if mmsize != 32\r\n;     mova           m4, [cutree_fix8_unpack_shuf+16]\r\n; %endif\r\n;     movapd         m2, [pd_inv256]\r\n;     mova           m3, [cutree_fix8_unpack_shuf]\r\n;     sub            r2d, mmsize / 2\r\n;     movsxdifnidn   r2, r2d\r\n;     lea            r1, [r1 + 2 * r2]\r\n;     lea            r0, [r0 + 8 * r2]\r\n;     neg            r2\r\n;     jg .skip_loop\r\n; .loop:\r\n; %if mmsize == 32\r\n;     vbroadcasti128 m0, [r1 + 2 * r2]\r\n;     vbroadcasti128 m1, [r1 + 2 * r2 + 16]\r\n;     pshufb         m0, m3\r\n;     pshufb         m1, m3\r\n; %else\r\n;     mova           m1, [r1 + 2 * r2]\r\n;     pshufb         m0, m1, m3\r\n;     pshufb         m1, m4\r\n; %endif\r\n;     psrad          m0, 16 ; sign-extend\r\n;     psrad          m1, 16\r\n;     cvtdq2pd       m5, xm0\r\n;     cvtdq2pd       m6, xm1\r\n; %if mmsize == 32\r\n;     vpermq         m0, m0, q1032\r\n;     vpermq         m1, m1, q1032\r\n; %else\r\n;     psrldq         m0, 8\r\n;     psrldq         m1, 8\r\n; %endif\r\n;     cvtdq2pd       m0, xm0\r\n;     cvtdq2pd       m1, xm1\r\n;     mulpd          m0, m2\r\n;     mulpd          m1, m2\r\n;     mulpd          m5, m2\r\n;     mulpd          m6, m2\r\n;     movapd         [r0 + 8 * r2], m5\r\n;     movapd         [r0 + 8 * r2 + mmsize], m0\r\n;     movapd         [r0 + 8 * r2 + mmsize * 2], m6\r\n;     movapd         [r0 + 8 * r2 + mmsize * 3], m1\r\n;     add            r2, mmsize / 2\r\n;     jle .loop\r\n; .skip_loop:\r\n;     sub            r2, mmsize / 2\r\n;     jz .end\r\n; .scalar:\r\n;     movzx          r3d, word [r1 + 2 * r2 + mmsize]\r\n;     movsx          r3d, r3w\r\n;     cvtsi2sd       xm0, r3d\r\n;     mulsd          xm0, xm2\r\n;     movsd          [r0 + 8 * r2 + 4 * mmsize], xm0\r\n;     inc            r2\r\n;     jl .scalar\r\n; .end:\r\n;     RET\r\n; %endmacro\r\n; \r\n; INIT_XMM ssse3\r\n; CUTREE_FIX8\r\n; \r\n; INIT_YMM avx2\r\n; CUTREE_FIX8\r\n"
  },
  {
    "path": "source/common/x86/pixeladd8.asm",
    "content": ";*****************************************************************************\r\n;* Copyright (C) 2013-2017 MulticoreWare, Inc\r\n;* Copyright (C) 2018~ VCL, NELVT, Peking University\r\n;*\r\n;* Authors: Praveen Kumar Tiwari <praveen@multicorewareinc.com>\r\n;*          Min Chen <chenm003@163.com>\r\n;*          Jiaqi Zhang <zhangjiaqi.cs@gmail.com>\r\n;*\r\n;* This program is free software; you can redistribute it and/or modify\r\n;* it under the terms of the GNU General Public License as published by\r\n;* the Free Software Foundation; either version 2 of the License, or\r\n;* (at your option) any later version.\r\n;*\r\n;* This program is distributed in the hope that it will be useful,\r\n;* but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n;* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n;* GNU General Public License for more details.\r\n;*\r\n;* You should have received a copy of the GNU General Public License\r\n;* along with this program; if not, write to the Free Software\r\n;* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n;*\r\n;* This program is also available under a commercial proprietary license.\r\n;* For more information, contact us at license @ x265.com.\r\n;*****************************************************************************/\r\n\r\n%include \"x86inc.asm\"\r\n%include \"x86util.asm\"\r\n\r\nSECTION_RODATA 32\r\n\r\nSECTION .text\r\n\r\ncextern pw_pixel_max\r\n\r\n;-----------------------------------------------------------------------------\r\n; void pixel_add_ps_4x4(pixel *dest, intptr_t destride, pixel *src0, int16_t *scr1, intptr_t srcStride0, intptr_t srcStride1)\r\n;-----------------------------------------------------------------------------\r\n%if HIGH_BIT_DEPTH\r\nINIT_XMM sse2\r\ncglobal pixel_add_ps_4x4, 6, 6, 6, dest, destride, src0, scr1, srcStride0, srcStride1\r\n    mova    m1,     [pw_pixel_max]\r\n    pxor    m0,     m0\r\n    add     r4,     r4\r\n    add     r5,     r5\r\n    add     r1,     r1\r\n    movh    m2,     [r2]\r\n    movhps  m2,     [r2 + r4]\r\n    movh    m3,     [r3]\r\n    movhps  m3,     [r3 + r5]\r\n    lea     r2,     [r2 + r4 * 2]\r\n    lea     r3,     [r3 + r5 * 2]\r\n    movh    m4,     [r2]\r\n    movhps  m4,     [r2 + r4]\r\n    movh    m5,     [r3]\r\n    movhps  m5,     [r3 + r5]\r\n\r\n    paddw   m2,     m3\r\n    paddw   m4,     m5\r\n    CLIPW2  m2, m4, m0, m1\r\n\r\n    movh    [r0],       m2\r\n    movhps  [r0 + r1],  m2\r\n    lea     r0,     [r0 + r1 * 2]\r\n    movh    [r0],       m4\r\n    movhps  [r0 + r1],  m4\r\n\r\n    RET\r\n%else\r\nINIT_XMM sse4\r\ncglobal pixel_add_ps_4x4, 6, 6, 8, dest, destride, src0, scr1, srcStride0, srcStride1\r\n    add         r5,         r5\r\n    pmovzxbw    m0,         [r2]\r\n    pmovzxbw    m2,         [r2 + r4]\r\n    movh        m1,         [r3]\r\n    movh        m3,         [r3 + r5]\r\n    lea         r2,         [r2 + r4 * 2]\r\n    lea         r3,         [r3 + r5 * 2]\r\n    pmovzxbw    m4,         [r2]\r\n    pmovzxbw    m6,         [r2 + r4]\r\n    movh        m5,         [r3]\r\n    movh        m7,         [r3 + r5]\r\n\r\n    paddw       m0,         m1\r\n    paddw       m2,         m3\r\n    paddw       m4,         m5\r\n    paddw       m6,         m7\r\n    packuswb    m0,         m0\r\n    packuswb    m2,         m2\r\n    packuswb    m4,         m4\r\n    packuswb    m6,         m6\r\n\r\n    movd        [r0],       m0\r\n    movd        [r0 + r1],  m2\r\n    lea         r0,         [r0 + r1 * 2]\r\n    movd        [r0],       m4\r\n    movd        [r0 + r1],  m6\r\n\r\n    RET\r\n%endif\r\n\r\n\r\n;-----------------------------------------------------------------------------\r\n; void pixel_add_ps_4x%2(pixel *dest, intptr_t destride, pixel *src0, int16_t *scr1, intptr_t srcStride0, intptr_t srcStride1)\r\n;-----------------------------------------------------------------------------\r\n%macro PIXEL_ADD_PS_W4_H4 2\r\n%if HIGH_BIT_DEPTH\r\nINIT_XMM sse2\r\ncglobal pixel_add_ps_4x%2, 6, 7, 6, dest, destride, src0, scr1, srcStride0, srcStride1\r\n    mova    m1,     [pw_pixel_max]\r\n    pxor    m0,     m0\r\n    mov     r6d,    %2/4\r\n    add     r4,     r4\r\n    add     r5,     r5\r\n    add     r1,     r1\r\n.loop:\r\n    movh    m2,     [r2]\r\n    movhps  m2,     [r2 + r4]\r\n    movh    m3,     [r3]\r\n    movhps  m3,     [r3 + r5]\r\n    lea     r2,     [r2 + r4 * 2]\r\n    lea     r3,     [r3 + r5 * 2]\r\n    movh    m4,     [r2]\r\n    movhps  m4,     [r2 + r4]\r\n    movh    m5,     [r3]\r\n    movhps  m5,     [r3 + r5]\r\n    dec     r6d\r\n    lea     r2,     [r2 + r4 * 2]\r\n    lea     r3,     [r3 + r5 * 2]\r\n\r\n    paddw   m2,     m3\r\n    paddw   m4,     m5\r\n    CLIPW2  m2, m4, m0, m1\r\n\r\n    movh    [r0],       m2\r\n    movhps  [r0 + r1],  m2\r\n    lea     r0,     [r0 + r1 * 2]\r\n    movh    [r0],       m4\r\n    movhps  [r0 + r1],  m4\r\n    lea     r0,     [r0 + r1 * 2]\r\n\r\n    jnz     .loop\r\n    RET\r\n%else\r\nINIT_XMM sse4\r\ncglobal pixel_add_ps_4x%2, 6, 7, 8, dest, destride, src0, scr1, srcStride0, srcStride1\r\n    mov         r6d,        %2/4\r\n    add         r5,         r5\r\n.loop:\r\n    pmovzxbw    m0,         [r2]\r\n    pmovzxbw    m2,         [r2 + r4]\r\n    movh        m1,         [r3]\r\n    movh        m3,         [r3 + r5]\r\n    lea         r2,         [r2 + r4 * 2]\r\n    lea         r3,         [r3 + r5 * 2]\r\n    pmovzxbw    m4,         [r2]\r\n    pmovzxbw    m6,         [r2 + r4]\r\n    movh        m5,         [r3]\r\n    movh        m7,         [r3 + r5]\r\n    dec         r6d\r\n    lea         r2,         [r2 + r4 * 2]\r\n    lea         r3,         [r3 + r5 * 2]\r\n\r\n    paddw       m0,         m1\r\n    paddw       m2,         m3\r\n    paddw       m4,         m5\r\n    paddw       m6,         m7\r\n    packuswb    m0,         m0\r\n    packuswb    m2,         m2\r\n    packuswb    m4,         m4\r\n    packuswb    m6,         m6\r\n\r\n    movd        [r0],       m0\r\n    movd        [r0 + r1],  m2\r\n    lea         r0,         [r0 + r1 * 2]\r\n    movd        [r0],       m4\r\n    movd        [r0 + r1],  m6\r\n    lea         r0,         [r0 + r1 * 2]\r\n\r\n    jnz         .loop\r\n    RET\r\n%endif\r\n%endmacro\r\n\r\nPIXEL_ADD_PS_W4_H4   4,  8\r\nPIXEL_ADD_PS_W4_H4   4, 16\r\n\r\n\r\n;-----------------------------------------------------------------------------\r\n; void pixel_add_ps_8x%2(pixel *dest, intptr_t destride, pixel *src0, int16_t *scr1, intptr_t srcStride0, intptr_t srcStride1)\r\n;-----------------------------------------------------------------------------\r\n%macro PIXEL_ADD_PS_W8_H4 2\r\n%if HIGH_BIT_DEPTH\r\nINIT_XMM sse2\r\ncglobal pixel_add_ps_8x%2, 6, 7, 6, dest, destride, src0, scr1, srcStride0, srcStride1\r\n    mova    m5,     [pw_pixel_max]\r\n    pxor    m4,     m4\r\n    mov     r6d,    %2/4\r\n    add     r4,     r4\r\n    add     r5,     r5\r\n    add     r1,     r1\r\n.loop:\r\n    movu    m0,     [r2]\r\n    movu    m2,     [r2 + r4]\r\n    movu    m1,     [r3]\r\n    movu    m3,     [r3 + r5]\r\n    lea     r2,     [r2 + r4 * 2]\r\n    lea     r3,     [r3 + r5 * 2]\r\n\r\n    paddw   m0,     m1\r\n    paddw   m2,     m3\r\n    CLIPW2  m0, m2, m4, m5\r\n\r\n    movu    [r0],       m0\r\n    movu    [r0 + r1],  m2\r\n\r\n    movu    m0,     [r2]\r\n    movu    m2,     [r2 + r4]\r\n    movu    m1,     [r3]\r\n    movu    m3,     [r3 + r5]\r\n    dec     r6d\r\n    lea     r0,     [r0 + r1 * 2]\r\n    lea     r2,     [r2 + r4 * 2]\r\n    lea     r3,     [r3 + r5 * 2]\r\n\r\n    paddw   m0,     m1\r\n    paddw   m2,     m3\r\n    CLIPW2  m0, m2, m4, m5\r\n\r\n    movu    [r0],       m0\r\n    movu    [r0 + r1],  m2\r\n    lea     r0,     [r0 + r1 * 2]\r\n\r\n    jnz     .loop\r\n    RET\r\n%else\r\nINIT_XMM sse4\r\ncglobal pixel_add_ps_8x%2, 6, 7, 8, dest, destride, src0, scr1, srcStride0, srcStride1\r\n    mov         r6d,        %2/4\r\n    add         r5,         r5\r\n.loop:\r\n    pmovzxbw    m0,         [r2]\r\n    pmovzxbw    m2,         [r2 + r4]\r\n    movu        m1,         [r3]\r\n    movu        m3,         [r3 + r5]\r\n    lea         r2,         [r2 + r4 * 2]\r\n    lea         r3,         [r3 + r5 * 2]\r\n    pmovzxbw    m4,         [r2]\r\n    pmovzxbw    m6,         [r2 + r4]\r\n    movu        m5,         [r3]\r\n    movu        m7,         [r3 + r5]\r\n    dec         r6d\r\n    lea         r2,         [r2 + r4 * 2]\r\n    lea         r3,         [r3 + r5 * 2]\r\n\r\n    paddw       m0,         m1\r\n    paddw       m2,         m3\r\n    paddw       m4,         m5\r\n    paddw       m6,         m7\r\n    packuswb    m0,         m0\r\n    packuswb    m2,         m2\r\n    packuswb    m4,         m4\r\n    packuswb    m6,         m6\r\n\r\n    movh        [r0],       m0\r\n    movh        [r0 + r1],  m2\r\n    lea         r0,         [r0 + r1 * 2]\r\n    movh        [r0],       m4\r\n    movh        [r0 + r1],  m6\r\n    lea         r0,         [r0 + r1 * 2]\r\n\r\n    jnz         .loop\r\n    RET\r\n%endif\r\n%endmacro\r\n\r\nPIXEL_ADD_PS_W8_H4 8,  4\r\nPIXEL_ADD_PS_W8_H4 8,  8\r\nPIXEL_ADD_PS_W8_H4 8, 16\r\nPIXEL_ADD_PS_W8_H4 8, 32\r\n\r\n\r\n;-----------------------------------------------------------------------------\r\n; void pixel_add_ps_16x%2(pixel *dest, intptr_t destride, pixel *src0, int16_t *scr1, intptr_t srcStride0, intptr_t srcStride1)\r\n;-----------------------------------------------------------------------------\r\n%macro PIXEL_ADD_PS_W16_H4 2\r\n%if HIGH_BIT_DEPTH\r\nINIT_XMM sse2\r\ncglobal pixel_add_ps_16x%2, 6, 7, 6, dest, destride, src0, scr1, srcStride0, srcStride1\r\n    mova    m5,     [pw_pixel_max]\r\n    pxor    m4,     m4\r\n    mov     r6d,    %2/4\r\n    add     r4,     r4\r\n    add     r5,     r5\r\n    add     r1,     r1\r\n.loop:\r\n    movu    m0,     [r2]\r\n    movu    m2,     [r2 + 16]\r\n    movu    m1,     [r3]\r\n    movu    m3,     [r3 + 16]\r\n\r\n    paddw   m0,     m1\r\n    paddw   m2,     m3\r\n    CLIPW2  m0, m2, m4, m5\r\n\r\n    movu    [r0],       m0\r\n    movu    [r0 + 16],  m2\r\n\r\n    movu    m0,     [r2 + r4]\r\n    movu    m2,     [r2 + r4 + 16]\r\n    movu    m1,     [r3 + r5]\r\n    movu    m3,     [r3 + r5 + 16]\r\n    lea     r2,     [r2 + r4 * 2]\r\n    lea     r3,     [r3 + r5 * 2]\r\n\r\n    paddw   m0,     m1\r\n    paddw   m2,     m3\r\n    CLIPW2  m0, m2, m4, m5\r\n\r\n    movu    [r0 + r1],      m0\r\n    movu    [r0 + r1 + 16], m2\r\n\r\n    movu    m0,     [r2]\r\n    movu    m2,     [r2 + 16]\r\n    movu    m1,     [r3]\r\n    movu    m3,     [r3 + 16]\r\n    lea     r0,     [r0 + r1 * 2]\r\n\r\n    paddw   m0,     m1\r\n    paddw   m2,     m3\r\n    CLIPW2  m0, m2, m4, m5\r\n\r\n    movu    [r0],       m0\r\n    movu    [r0 + 16],  m2\r\n\r\n    movu    m0,     [r2 + r4]\r\n    movu    m2,     [r2 + r4 + 16]\r\n    movu    m1,     [r3 + r5]\r\n    movu    m3,     [r3 + r5 + 16]\r\n    dec     r6d\r\n    lea     r2,     [r2 + r4 * 2]\r\n    lea     r3,     [r3 + r5 * 2]\r\n\r\n    paddw   m0,     m1\r\n    paddw   m2,     m3\r\n    CLIPW2  m0, m2, m4, m5\r\n\r\n    movu    [r0 + r1],      m0\r\n    movu    [r0 + r1 + 16], m2\r\n    lea     r0,     [r0 + r1 * 2]\r\n\r\n    jnz     .loop\r\n    RET\r\n%else\r\nINIT_XMM sse4\r\ncglobal pixel_add_ps_16x%2, 6, 7, 8, dest, destride, src0, scr1, srcStride0, srcStride1\r\n    mov         r6d,        %2/4\r\n    add         r5,         r5\r\n.loop:\r\n    pmovzxbw    m0,         [r2]\r\n    pmovzxbw    m1,         [r2 + 8]\r\n    pmovzxbw    m4,         [r2 + r4]\r\n    pmovzxbw    m5,         [r2 + r4 + 8]\r\n    movu        m2,         [r3]\r\n    movu        m3,         [r3 + 16]\r\n    movu        m6,         [r3 + r5]\r\n    movu        m7,         [r3 + r5 + 16]\r\n    lea         r2,         [r2 + r4 * 2]\r\n    lea         r3,         [r3 + r5 * 2]\r\n\r\n    paddw       m0,         m2\r\n    paddw       m1,         m3\r\n    paddw       m4,         m6\r\n    paddw       m5,         m7\r\n    packuswb    m0,         m1\r\n    packuswb    m4,         m5\r\n\r\n    movu        [r0],       m0\r\n    movu        [r0 + r1],  m4\r\n\r\n    pmovzxbw    m0,         [r2]\r\n    pmovzxbw    m1,         [r2 + 8]\r\n    pmovzxbw    m4,         [r2 + r4]\r\n    pmovzxbw    m5,         [r2 + r4 + 8]\r\n    movu        m2,         [r3]\r\n    movu        m3,         [r3 + 16]\r\n    movu        m6,         [r3 + r5]\r\n    movu        m7,         [r3 + r5 + 16]\r\n    dec         r6d\r\n    lea         r0,         [r0 + r1 * 2]\r\n    lea         r2,         [r2 + r4 * 2]\r\n    lea         r3,         [r3 + r5 * 2]\r\n\r\n    paddw       m0,         m2\r\n    paddw       m1,         m3\r\n    paddw       m4,         m6\r\n    paddw       m5,         m7\r\n    packuswb    m0,         m1\r\n    packuswb    m4,         m5\r\n\r\n    movu        [r0],       m0\r\n    movu        [r0 + r1],  m4\r\n    lea         r0,         [r0 + r1 * 2]\r\n\r\n    jnz         .loop\r\n    RET\r\n%endif\r\n%endmacro\r\nPIXEL_ADD_PS_W16_H4 16,  4\r\nPIXEL_ADD_PS_W16_H4 16,  8\r\nPIXEL_ADD_PS_W16_H4 16, 12\r\nPIXEL_ADD_PS_W16_H4 16, 16\r\nPIXEL_ADD_PS_W16_H4 16, 32\r\nPIXEL_ADD_PS_W16_H4 16, 64\r\n\r\n;-----------------------------------------------------------------------------\r\n; void pixel_add_ps_16x16(pixel *dest, intptr_t destride, pixel *src0, int16_t *scr1, intptr_t srcStride0, intptr_t srcStride1)\r\n;-----------------------------------------------------------------------------\r\n%macro PIXEL_ADD_PS_W16_H4_avx2 1\r\n%if HIGH_BIT_DEPTH\r\n%if ARCH_X86_64\r\nINIT_YMM avx2\r\ncglobal pixel_add_ps_16x%1, 6, 10, 4, dest, destride, src0, scr1, srcStride0, srcStride1\r\n    mova    m3,     [pw_pixel_max]\r\n    pxor    m2,     m2\r\n    mov     r6d,    %1/4\r\n    add     r4d,    r4d\r\n    add     r5d,    r5d\r\n    add     r1d,    r1d\r\n    lea     r7,     [r4 * 3]\r\n    lea     r8,     [r5 * 3]\r\n    lea     r9,     [r1 * 3]\r\n\r\n.loop:\r\n    movu    m0,     [r2]\r\n    movu    m1,     [r3]\r\n    paddw   m0,     m1\r\n    CLIPW   m0, m2, m3\r\n    movu    [r0],              m0\r\n\r\n    movu    m0,     [r2 + r4]\r\n    movu    m1,     [r3 + r5]\r\n    paddw   m0,     m1\r\n    CLIPW   m0, m2, m3\r\n    movu    [r0 + r1],         m0\r\n\r\n    movu    m0,     [r2 + r4 * 2]\r\n    movu    m1,     [r3 + r5 * 2]\r\n    paddw   m0,     m1\r\n    CLIPW   m0, m2, m3\r\n    movu    [r0 + r1 * 2],     m0\r\n\r\n    movu    m0,     [r2 + r7]\r\n    movu    m1,     [r3 + r8]\r\n    paddw   m0,     m1\r\n    CLIPW   m0, m2, m3\r\n    movu    [r0 + r9],         m0\r\n\r\n    dec     r6d\r\n    lea     r0,     [r0 + r1 * 4]\r\n    lea     r2,     [r2 + r4 * 4]\r\n    lea     r3,     [r3 + r5 * 4]\r\n    jnz     .loop\r\n    RET\r\n%endif\r\n%else\r\nINIT_YMM avx2\r\ncglobal pixel_add_ps_16x%1, 6, 7, 8, dest, destride, src0, scr1, srcStride0, srcStride1\r\n    mov         r6d,        %1/4\r\n    add         r5,         r5\r\n.loop:\r\n\r\n    pmovzxbw    m0,         [r2]        ; row 0 of src0\r\n    pmovzxbw    m1,         [r2 + r4]   ; row 1 of src0\r\n    movu        m2,        [r3]        ; row 0 of src1\r\n    movu        m3,        [r3 + r5]   ; row 1 of src1\r\n    paddw       m0,         m2\r\n    paddw       m1,         m3\r\n    packuswb    m0,         m1\r\n\r\n    lea         r2,         [r2 + r4 * 2]\r\n    lea         r3,         [r3 + r5 * 2]\r\n\r\n    pmovzxbw    m2,         [r2]        ; row 2 of src0\r\n    pmovzxbw    m3,         [r2 + r4]   ; row 3 of src0\r\n    movu        m4,        [r3]        ; row 2 of src1\r\n    movu        m5,        [r3 + r5]   ; row 3 of src1\r\n    paddw       m2,         m4\r\n    paddw       m3,         m5\r\n    packuswb    m2,         m3\r\n\r\n    lea         r2,         [r2 + r4 * 2]\r\n    lea         r3,         [r3 + r5 * 2]\r\n\r\n    vpermq      m0, m0, 11011000b\r\n    movu        [r0],      xm0           ; row 0 of dst\r\n    vextracti128 xm3, m0, 1\r\n    movu        [r0 + r1], xm3           ; row 1 of dst\r\n\r\n    lea         r0,         [r0 + r1 * 2]\r\n    vpermq      m2, m2, 11011000b\r\n    movu        [r0],      xm2           ; row 2 of dst\r\n    vextracti128 xm3, m2, 1\r\n    movu         [r0 + r1], xm3          ; row 3 of dst\r\n\r\n    lea         r0,         [r0 + r1 * 2]\r\n\r\n    dec         r6d\r\n    jnz         .loop\r\n\r\n    RET\r\n%endif\r\n%endmacro\r\n\r\nPIXEL_ADD_PS_W16_H4_avx2  4\r\nPIXEL_ADD_PS_W16_H4_avx2  8\r\nPIXEL_ADD_PS_W16_H4_avx2 12\r\nPIXEL_ADD_PS_W16_H4_avx2 16\r\nPIXEL_ADD_PS_W16_H4_avx2 32\r\nPIXEL_ADD_PS_W16_H4_avx2 64\r\n\r\n\r\n;-----------------------------------------------------------------------------\r\n; void pixel_add_ps_32x%2(pixel *dest, intptr_t destride, pixel *src0, int16_t *scr1, intptr_t srcStride0, intptr_t srcStride1)\r\n;-----------------------------------------------------------------------------\r\n%macro PIXEL_ADD_PS_W32_H2 2\r\n%if HIGH_BIT_DEPTH\r\nINIT_XMM sse2\r\ncglobal pixel_add_ps_32x%2, 6, 7, 6, dest, destride, src0, scr1, srcStride0, srcStride1\r\n    mova    m5,     [pw_pixel_max]\r\n    pxor    m4,     m4\r\n    mov     r6d,    %2/2\r\n    add     r4,     r4\r\n    add     r5,     r5\r\n    add     r1,     r1\r\n.loop:\r\n    movu    m0,     [r2]\r\n    movu    m2,     [r2 + 16]\r\n    movu    m1,     [r3]\r\n    movu    m3,     [r3 + 16]\r\n\r\n    paddw   m0,     m1\r\n    paddw   m2,     m3\r\n    CLIPW2  m0, m2, m4, m5\r\n\r\n    movu    [r0],       m0\r\n    movu    [r0 + 16],  m2\r\n\r\n    movu    m0,     [r2 + 32]\r\n    movu    m2,     [r2 + 48]\r\n    movu    m1,     [r3 + 32]\r\n    movu    m3,     [r3 + 48]\r\n\r\n    paddw   m0,     m1\r\n    paddw   m2,     m3\r\n    CLIPW2  m0, m2, m4, m5\r\n\r\n    movu    [r0 + 32],  m0\r\n    movu    [r0 + 48],  m2\r\n\r\n    movu    m0,     [r2 + r4]\r\n    movu    m2,     [r2 + r4 + 16]\r\n    movu    m1,     [r3 + r5]\r\n    movu    m3,     [r3 + r5 + 16]\r\n\r\n    paddw   m0,     m1\r\n    paddw   m2,     m3\r\n    CLIPW2  m0, m2, m4, m5\r\n\r\n    movu    [r0 + r1],      m0\r\n    movu    [r0 + r1 + 16], m2\r\n\r\n    movu    m0,     [r2 + r4 + 32]\r\n    movu    m2,     [r2 + r4 + 48]\r\n    movu    m1,     [r3 + r5 + 32]\r\n    movu    m3,     [r3 + r5 + 48]\r\n    dec     r6d\r\n    lea     r2,     [r2 + r4 * 2]\r\n    lea     r3,     [r3 + r5 * 2]\r\n\r\n    paddw   m0,     m1\r\n    paddw   m2,     m3\r\n    CLIPW2  m0, m2, m4, m5\r\n\r\n    movu    [r0 + r1 + 32], m0\r\n    movu    [r0 + r1 + 48], m2\r\n    lea     r0,     [r0 + r1 * 2]\r\n\r\n    jnz     .loop\r\n    RET\r\n%else\r\nINIT_XMM sse4\r\ncglobal pixel_add_ps_32x%2, 6, 7, 8, dest, destride, src0, scr1, srcStride0, srcStride1\r\n    mov         r6d,        %2/2\r\n    add         r5,         r5\r\n.loop:\r\n    pmovzxbw    m0,         [r2]\r\n    pmovzxbw    m1,         [r2 + 8]\r\n    pmovzxbw    m2,         [r2 + 16]\r\n    pmovzxbw    m3,         [r2 + 24]\r\n    movu        m4,         [r3]\r\n    movu        m5,         [r3 + 16]\r\n    movu        m6,         [r3 + 32]\r\n    movu        m7,         [r3 + 48]\r\n\r\n    paddw       m0,         m4\r\n    paddw       m1,         m5\r\n    paddw       m2,         m6\r\n    paddw       m3,         m7\r\n    packuswb    m0,         m1\r\n    packuswb    m2,         m3\r\n\r\n    movu        [r0],       m0\r\n    movu        [r0 + 16],  m2\r\n\r\n    pmovzxbw    m0,         [r2 + r4]\r\n    pmovzxbw    m1,         [r2 + r4 + 8]\r\n    pmovzxbw    m2,         [r2 + r4 + 16]\r\n    pmovzxbw    m3,         [r2 + r4 + 24]\r\n    movu        m4,         [r3 + r5]\r\n    movu        m5,         [r3 + r5 + 16]\r\n    movu        m6,         [r3 + r5 + 32]\r\n    movu        m7,         [r3 + r5 + 48]\r\n    dec         r6d\r\n    lea         r2,         [r2 + r4 * 2]\r\n    lea         r3,         [r3 + r5 * 2]\r\n\r\n    paddw       m0,         m4\r\n    paddw       m1,         m5\r\n    paddw       m2,         m6\r\n    paddw       m3,         m7\r\n    packuswb    m0,         m1\r\n    packuswb    m2,         m3\r\n\r\n    movu        [r0 + r1],      m0\r\n    movu        [r0 + r1 + 16], m2\r\n    lea         r0,         [r0 + r1 * 2]\r\n\r\n    jnz         .loop\r\n    RET\r\n%endif\r\n%endmacro\r\nPIXEL_ADD_PS_W32_H2 32,  8\r\nPIXEL_ADD_PS_W32_H2 32, 16\r\nPIXEL_ADD_PS_W32_H2 32, 24\r\nPIXEL_ADD_PS_W32_H2 32, 32\r\nPIXEL_ADD_PS_W32_H2 32, 64\r\n\r\n;-----------------------------------------------------------------------------\r\n; void pixel_add_ps_32x32(pixel *dest, intptr_t destride, pixel *src0, int16_t *scr1, intptr_t srcStride0, intptr_t srcStride1)\r\n;-----------------------------------------------------------------------------\r\n%macro PIXEL_ADD_PS_W32_H4_avx2 1\r\n%if HIGH_BIT_DEPTH\r\n%if ARCH_X86_64\r\nINIT_YMM avx2\r\ncglobal pixel_add_ps_32x%1, 6, 10, 6, dest, destride, src0, scr1, srcStride0, srcStride1\r\n    mova    m5,     [pw_pixel_max]\r\n    pxor    m4,     m4\r\n    mov     r6d,    %1/4\r\n    add     r4d,    r4d\r\n    add     r5d,    r5d\r\n    add     r1d,    r1d\r\n    lea     r7,     [r4 * 3]\r\n    lea     r8,     [r5 * 3]\r\n    lea     r9,     [r1 * 3]\r\n\r\n.loop:\r\n    movu    m0,     [r2]\r\n    movu    m2,     [r2 + 32]\r\n    movu    m1,     [r3]\r\n    movu    m3,     [r3 + 32]\r\n    paddw   m0,     m1\r\n    paddw   m2,     m3\r\n    CLIPW2  m0, m2, m4, m5\r\n\r\n    movu    [r0],               m0\r\n    movu    [r0 + 32],          m2\r\n\r\n    movu    m0,     [r2 + r4]\r\n    movu    m2,     [r2 + r4 + 32]\r\n    movu    m1,     [r3 + r5]\r\n    movu    m3,     [r3 + r5 + 32]\r\n    paddw   m0,     m1\r\n    paddw   m2,     m3\r\n    CLIPW2  m0, m2, m4, m5\r\n\r\n    movu    [r0 + r1],          m0\r\n    movu    [r0 + r1 + 32],     m2\r\n\r\n    movu    m0,     [r2 + r4 * 2]\r\n    movu    m2,     [r2 + r4 * 2 + 32]\r\n    movu    m1,     [r3 + r5 * 2]\r\n    movu    m3,     [r3 + r5 * 2 + 32]\r\n    paddw   m0,     m1\r\n    paddw   m2,     m3\r\n    CLIPW2  m0, m2, m4, m5\r\n\r\n    movu    [r0 + r1 * 2],      m0\r\n    movu    [r0 + r1 * 2 + 32], m2\r\n\r\n    movu    m0,     [r2 + r7]\r\n    movu    m2,     [r2 + r7 + 32]\r\n    movu    m1,     [r3 + r8]\r\n    movu    m3,     [r3 + r8 + 32]\r\n    paddw   m0,     m1\r\n    paddw   m2,     m3\r\n    CLIPW2  m0, m2, m4, m5\r\n\r\n    movu    [r0 + r9],          m0\r\n    movu    [r0 + r9 + 32],     m2\r\n\r\n    dec     r6d\r\n    lea     r0,     [r0 + r1 * 4]\r\n    lea     r2,     [r2 + r4 * 4]\r\n    lea     r3,     [r3 + r5 * 4]\r\n    jnz     .loop\r\n    RET\r\n%endif\r\n%else\r\n%if ARCH_X86_64\r\nINIT_YMM avx2\r\ncglobal pixel_add_ps_32x%1, 6, 10, 8, dest, destride, src0, scr1, srcStride0, srcStride1\r\n    mov         r6d,        %1/4\r\n    add         r5,         r5\r\n    lea         r7,         [r4 * 3]\r\n    lea         r8,         [r5 * 3]\r\n    lea         r9,         [r1 * 3]\r\n.loop:\r\n    pmovzxbw    m0,         [r2]                ; first half of row 0 of src0\r\n    pmovzxbw    m1,         [r2 + 16]           ; second half of row 0 of src0\r\n    movu        m2,         [r3]                ; first half of row 0 of src1\r\n    movu        m3,         [r3 + 32]           ; second half of row 0 of src1\r\n\r\n    paddw       m0,         m2\r\n    paddw       m1,         m3\r\n    packuswb    m0,         m1\r\n    vpermq      m0, m0, 11011000b\r\n    movu        [r0],      m0                   ; row 0 of dst\r\n\r\n    pmovzxbw    m0,         [r2 + r4]           ; first half of row 1 of src0\r\n    pmovzxbw    m1,         [r2 + r4 + 16]      ; second half of row 1 of src0\r\n    movu        m2,         [r3 + r5]           ; first half of row 1 of src1\r\n    movu        m3,         [r3 + r5 + 32]      ; second half of row 1 of src1\r\n\r\n    paddw       m0,         m2\r\n    paddw       m1,         m3\r\n    packuswb    m0,         m1\r\n    vpermq      m0, m0, 11011000b\r\n    movu        [r0 + r1],      m0              ; row 1 of dst\r\n\r\n    pmovzxbw    m0,         [r2 + r4 * 2]       ; first half of row 2 of src0\r\n    pmovzxbw    m1,         [r2 + r4 * 2 + 16]  ; second half of row 2 of src0\r\n    movu        m2,         [r3 + r5 * 2]       ; first half of row 2 of src1\r\n    movu        m3,         [r3 + + r5 * 2 + 32]; second half of row 2 of src1\r\n\r\n    paddw       m0,         m2\r\n    paddw       m1,         m3\r\n    packuswb    m0,         m1\r\n    vpermq      m0, m0, 11011000b\r\n    movu        [r0 + r1 * 2],      m0          ; row 2 of dst\r\n\r\n    pmovzxbw    m0,         [r2 + r7]           ; first half of row 3 of src0\r\n    pmovzxbw    m1,         [r2 + r7 + 16]      ; second half of row 3 of src0\r\n    movu        m2,         [r3 + r8]           ; first half of row 3 of src1\r\n    movu        m3,         [r3 + r8 + 32]      ; second half of row 3 of src1\r\n\r\n    paddw       m0,         m2\r\n    paddw       m1,         m3\r\n    packuswb    m0,         m1\r\n    vpermq      m0, m0, 11011000b\r\n    movu        [r0 + r9],      m0              ; row 3 of dst\r\n\r\n    lea         r2,         [r2 + r4 * 4]\r\n    lea         r3,         [r3 + r5 * 4]\r\n    lea         r0,         [r0 + r1 * 4]\r\n\r\n    dec         r6d\r\n    jnz         .loop\r\n    RET\r\n%endif\r\n%endif\r\n%endmacro\r\n\r\nPIXEL_ADD_PS_W32_H4_avx2  8\r\nPIXEL_ADD_PS_W32_H4_avx2 16\r\nPIXEL_ADD_PS_W32_H4_avx2 24\r\nPIXEL_ADD_PS_W32_H4_avx2 32\r\nPIXEL_ADD_PS_W32_H4_avx2 64\r\n\r\n\r\n;-----------------------------------------------------------------------------\r\n; void pixel_add_ps_64x%2(pixel *dest, intptr_t destride, pixel *src0, int16_t *scr1, intptr_t srcStride0, intptr_t srcStride1)\r\n;-----------------------------------------------------------------------------\r\n%macro PIXEL_ADD_PS_W64_H2 2\r\n%if HIGH_BIT_DEPTH\r\nINIT_XMM sse2\r\ncglobal pixel_add_ps_64x%2, 6, 7, 6, dest, destride, src0, scr1, srcStride0, srcStride1\r\n    mova    m5,     [pw_pixel_max]\r\n    pxor    m4,     m4\r\n    mov     r6d,    %2/2\r\n    add     r4,     r4\r\n    add     r5,     r5\r\n    add     r1,     r1\r\n.loop:\r\n    movu    m0,     [r2]\r\n    movu    m2,     [r2 + 16]\r\n    movu    m1,     [r3]\r\n    movu    m3,     [r3 + 16]\r\n\r\n    paddw   m0,     m1\r\n    paddw   m2,     m3\r\n    CLIPW2  m0, m2, m4, m5\r\n\r\n    movu    [r0],       m0\r\n    movu    [r0 + 16],  m2\r\n\r\n    movu    m0,     [r2 + 32]\r\n    movu    m2,     [r2 + 48]\r\n    movu    m1,     [r3 + 32]\r\n    movu    m3,     [r3 + 48]\r\n\r\n    paddw   m0,     m1\r\n    paddw   m2,     m3\r\n    CLIPW2  m0, m2, m4, m5\r\n\r\n    movu    [r0 + 32],  m0\r\n    movu    [r0 + 48],  m2\r\n\r\n    movu    m0,     [r2 + 64]\r\n    movu    m2,     [r2 + 80]\r\n    movu    m1,     [r3 + 64]\r\n    movu    m3,     [r3 + 80]\r\n\r\n    paddw   m0,     m1\r\n    paddw   m2,     m3\r\n    CLIPW2  m0, m2, m4, m5\r\n\r\n    movu    [r0 + 64],  m0\r\n    movu    [r0 + 80],  m2\r\n\r\n    movu    m0,     [r2 + 96]\r\n    movu    m2,     [r2 + 112]\r\n    movu    m1,     [r3 + 96]\r\n    movu    m3,     [r3 + 112]\r\n\r\n    paddw   m0,     m1\r\n    paddw   m2,     m3\r\n    CLIPW2  m0, m2, m4, m5\r\n\r\n    movu    [r0 + 96],  m0\r\n    movu    [r0 + 112], m2\r\n\r\n    movu    m0,     [r2 + r4]\r\n    movu    m2,     [r2 + r4 + 16]\r\n    movu    m1,     [r3 + r5]\r\n    movu    m3,     [r3 + r5 + 16]\r\n\r\n    paddw   m0,     m1\r\n    paddw   m2,     m3\r\n    CLIPW2  m0, m2, m4, m5\r\n\r\n    movu    [r0 + r1],      m0\r\n    movu    [r0 + r1 + 16], m2\r\n\r\n    movu    m0,     [r2 + r4 + 32]\r\n    movu    m2,     [r2 + r4 + 48]\r\n    movu    m1,     [r3 + r5 + 32]\r\n    movu    m3,     [r3 + r5 + 48]\r\n\r\n    paddw   m0,     m1\r\n    paddw   m2,     m3\r\n    CLIPW2  m0, m2, m4, m5\r\n\r\n    movu    [r0 + r1 + 32], m0\r\n    movu    [r0 + r1 + 48], m2\r\n\r\n    movu    m0,     [r2 + r4 + 64]\r\n    movu    m2,     [r2 + r4 + 80]\r\n    movu    m1,     [r3 + r5 + 64]\r\n    movu    m3,     [r3 + r5 + 80]\r\n\r\n    paddw   m0,     m1\r\n    paddw   m2,     m3\r\n    CLIPW2  m0, m2, m4, m5\r\n\r\n    movu    [r0 + r1 + 64], m0\r\n    movu    [r0 + r1 + 80], m2\r\n\r\n    movu    m0,     [r2 + r4 + 96]\r\n    movu    m2,     [r2 + r4 + 112]\r\n    movu    m1,     [r3 + r5 + 96]\r\n    movu    m3,     [r3 + r5 + 112]\r\n    dec     r6d\r\n    lea     r2,     [r2 + r4 * 2]\r\n    lea     r3,     [r3 + r5 * 2]\r\n\r\n    paddw   m0,     m1\r\n    paddw   m2,     m3\r\n    CLIPW2  m0, m2, m4, m5\r\n\r\n    movu    [r0 + r1 + 96],     m0\r\n    movu    [r0 + r1 + 112],    m2\r\n    lea     r0,     [r0 + r1 * 2]\r\n\r\n    jnz     .loop\r\n    RET\r\n%else\r\nINIT_XMM sse4\r\ncglobal pixel_add_ps_64x%2, 6, 7, 8, dest, destride, src0, scr1, srcStride0, srcStride1\r\n    mov         r6d,        %2/2\r\n    add         r5,         r5\r\n.loop:\r\n    pmovzxbw    m0,         [r2]\r\n    pmovzxbw    m1,         [r2 + 8]\r\n    pmovzxbw    m2,         [r2 + 16]\r\n    pmovzxbw    m3,         [r2 + 24]\r\n    movu        m4,         [r3]\r\n    movu        m5,         [r3 + 16]\r\n    movu        m6,         [r3 + 32]\r\n    movu        m7,         [r3 + 48]\r\n\r\n    paddw       m0,         m4\r\n    paddw       m1,         m5\r\n    paddw       m2,         m6\r\n    paddw       m3,         m7\r\n    packuswb    m0,         m1\r\n    packuswb    m2,         m3\r\n\r\n    movu        [r0],       m0\r\n    movu        [r0 + 16],  m2\r\n\r\n    pmovzxbw    m0,         [r2 + 32]\r\n    pmovzxbw    m1,         [r2 + 40]\r\n    pmovzxbw    m2,         [r2 + 48]\r\n    pmovzxbw    m3,         [r2 + 56]\r\n    movu        m4,         [r3 + 64]\r\n    movu        m5,         [r3 + 80]\r\n    movu        m6,         [r3 + 96]\r\n    movu        m7,         [r3 + 112]\r\n\r\n    paddw       m0,         m4\r\n    paddw       m1,         m5\r\n    paddw       m2,         m6\r\n    paddw       m3,         m7\r\n    packuswb    m0,         m1\r\n    packuswb    m2,         m3\r\n\r\n    movu        [r0 + 32],  m0\r\n    movu        [r0 + 48],  m2\r\n\r\n    pmovzxbw    m0,         [r2 + r4]\r\n    pmovzxbw    m1,         [r2 + r4 + 8]\r\n    pmovzxbw    m2,         [r2 + r4 + 16]\r\n    pmovzxbw    m3,         [r2 + r4 + 24]\r\n    movu        m4,         [r3 + r5]\r\n    movu        m5,         [r3 + r5 + 16]\r\n    movu        m6,         [r3 + r5 + 32]\r\n    movu        m7,         [r3 + r5 + 48]\r\n\r\n    paddw       m0,         m4\r\n    paddw       m1,         m5\r\n    paddw       m2,         m6\r\n    paddw       m3,         m7\r\n    packuswb    m0,         m1\r\n    packuswb    m2,         m3\r\n\r\n    movu        [r0 + r1],      m0\r\n    movu        [r0 + r1 + 16], m2\r\n\r\n    pmovzxbw    m0,         [r2 + r4 + 32]\r\n    pmovzxbw    m1,         [r2 + r4 + 40]\r\n    pmovzxbw    m2,         [r2 + r4 + 48]\r\n    pmovzxbw    m3,         [r2 + r4 + 56]\r\n    movu        m4,         [r3 + r5 + 64]\r\n    movu        m5,         [r3 + r5 + 80]\r\n    movu        m6,         [r3 + r5 + 96]\r\n    movu        m7,         [r3 + r5 + 112]\r\n    dec         r6d\r\n    lea         r2,         [r2 + r4 * 2]\r\n    lea         r3,         [r3 + r5 * 2]\r\n\r\n    paddw       m0,         m4\r\n    paddw       m1,         m5\r\n    paddw       m2,         m6\r\n    paddw       m3,         m7\r\n    packuswb    m0,         m1\r\n    packuswb    m2,         m3\r\n\r\n    movu        [r0 + r1 + 32], m0\r\n    movu        [r0 + r1 + 48], m2\r\n    lea         r0,         [r0 + r1 * 2]\r\n\r\n    jnz         .loop\r\n    RET\r\n%endif\r\n%endmacro\r\nPIXEL_ADD_PS_W64_H2 64, 16\r\nPIXEL_ADD_PS_W64_H2 64, 32\r\nPIXEL_ADD_PS_W64_H2 64, 48\r\nPIXEL_ADD_PS_W64_H2 64, 64\r\n\r\n;-----------------------------------------------------------------------------\r\n; void pixel_add_ps_64x64(pixel *dest, intptr_t destride, pixel *src0, int16_t *scr1, intptr_t srcStride0, intptr_t srcStride1)\r\n;-----------------------------------------------------------------------------\r\n%macro PIXEL_ADD_PS_W64H4_avx2 1\r\n%if HIGH_BIT_DEPTH\r\n%if ARCH_X86_64\r\nINIT_YMM avx2\r\ncglobal pixel_add_ps_64x%1, 6, 10, 6, dest, destride, src0, scr1, srcStride0, srcStride1\r\n    mova    m5,     [pw_pixel_max]\r\n    pxor    m4,     m4\r\n    mov     r6d,    %1/4\r\n    add     r4d,    r4d\r\n    add     r5d,    r5d\r\n    add     r1d,    r1d\r\n    lea     r7,     [r4 * 3]\r\n    lea     r8,     [r5 * 3]\r\n    lea     r9,     [r1 * 3]\r\n\r\n.loop:\r\n    movu    m0,     [r2]\r\n    movu    m1,     [r2 + 32]\r\n    movu    m2,     [r3]\r\n    movu    m3,     [r3 + 32]\r\n    paddw   m0,     m2\r\n    paddw   m1,     m3\r\n\r\n    CLIPW2  m0, m1, m4, m5\r\n    movu    [r0],                m0\r\n    movu    [r0 + 32],           m1\r\n\r\n    movu    m0,     [r2 + 64]\r\n    movu    m1,     [r2 + 96]\r\n    movu    m2,     [r3 + 64]\r\n    movu    m3,     [r3 + 96]\r\n    paddw   m0,     m2\r\n    paddw   m1,     m3\r\n\r\n    CLIPW2  m0, m1, m4, m5\r\n    movu    [r0 + 64],           m0\r\n    movu    [r0 + 96],           m1\r\n\r\n    movu    m0,     [r2 + r4]\r\n    movu    m1,     [r2 + r4 + 32]\r\n    movu    m2,     [r3 + r5]\r\n    movu    m3,     [r3 + r5 + 32]\r\n    paddw   m0,     m2\r\n    paddw   m1,     m3\r\n\r\n    CLIPW2  m0, m1, m4, m5\r\n    movu    [r0 + r1],           m0\r\n    movu    [r0 + r1 + 32],      m1\r\n\r\n    movu    m0,     [r2 + r4 + 64]\r\n    movu    m1,     [r2 + r4 + 96]\r\n    movu    m2,     [r3 + r5 + 64]\r\n    movu    m3,     [r3 + r5 + 96]\r\n    paddw   m0,     m2\r\n    paddw   m1,     m3\r\n\r\n    CLIPW2  m0, m1, m4, m5\r\n    movu    [r0 + r1 + 64],      m0\r\n    movu    [r0 + r1 + 96],      m1\r\n\r\n    movu    m0,     [r2 + r4 * 2]\r\n    movu    m1,     [r2 + r4 * 2 + 32]\r\n    movu    m2,     [r3 + r5 * 2]\r\n    movu    m3,     [r3 + r5 * 2+ 32]\r\n    paddw   m0,     m2\r\n    paddw   m1,     m3\r\n\r\n    CLIPW2  m0, m1, m4, m5\r\n    movu    [r0 + r1 * 2],       m0\r\n    movu    [r0 + r1 * 2 + 32],  m1\r\n\r\n    movu    m0,     [r2 + r4 * 2 + 64]\r\n    movu    m1,     [r2 + r4 * 2 + 96]\r\n    movu    m2,     [r3 + r5 * 2 + 64]\r\n    movu    m3,     [r3 + r5 * 2 + 96]\r\n    paddw   m0,     m2\r\n    paddw   m1,     m3\r\n\r\n    CLIPW2  m0, m1, m4, m5\r\n    movu    [r0 + r1 * 2 + 64],  m0\r\n    movu    [r0 + r1 * 2 + 96],  m1\r\n\r\n    movu    m0,     [r2 + r7]\r\n    movu    m1,     [r2 + r7 + 32]\r\n    movu    m2,     [r3 + r8]\r\n    movu    m3,     [r3 + r8 + 32]\r\n    paddw   m0,     m2\r\n    paddw   m1,     m3\r\n\r\n    CLIPW2  m0, m1, m4, m5\r\n    movu    [r0 + r9],           m0\r\n    movu    [r0 + r9 + 32],      m1\r\n\r\n    movu    m0,     [r2 + r7 + 64]\r\n    movu    m1,     [r2 + r7 + 96]\r\n    movu    m2,     [r3 + r8 + 64]\r\n    movu    m3,     [r3 + r8 + 96]\r\n    paddw   m0,     m2\r\n    paddw   m1,     m3\r\n\r\n    CLIPW2  m0, m1, m4, m5\r\n    movu    [r0 + r9 + 64],      m0\r\n    movu    [r0 + r9 + 96],      m1\r\n\r\n    dec     r6d\r\n    lea     r0,     [r0 + r1 * 4]\r\n    lea     r2,     [r2 + r4 * 4]\r\n    lea     r3,     [r3 + r5 * 4]\r\n    jnz     .loop\r\n    RET\r\n%endif\r\n%else\r\nINIT_YMM avx2\r\ncglobal pixel_add_ps_64x%1, 6, 7, 8, dest, destride, src0, scr1, srcStride0, srcStride1\r\n    mov         r6d,        %1/2\r\n    add         r5,         r5\r\n.loop:\r\n    pmovzxbw    m0,         [r2]                ; first 16 of row 0 of src0\r\n    pmovzxbw    m1,         [r2 + 16]           ; second 16 of row 0 of src0\r\n    pmovzxbw    m2,         [r2 + 32]           ; third 16 of row 0 of src0\r\n    pmovzxbw    m3,         [r2 + 48]           ; forth 16 of row 0 of src0\r\n    movu        m4,         [r3]                ; first 16 of row 0 of src1\r\n    movu        m5,         [r3 + 32]           ; second 16 of row 0 of src1\r\n    movu        m6,         [r3 + 64]           ; third 16 of row 0 of src1\r\n    movu        m7,         [r3 + 96]           ; forth 16 of row 0 of src1\r\n\r\n    paddw       m0,         m4\r\n    paddw       m1,         m5\r\n    paddw       m2,         m6\r\n    paddw       m3,         m7\r\n    packuswb    m0,         m1\r\n    packuswb    m2,         m3\r\n    vpermq      m0, m0, 11011000b\r\n    movu        [r0],      m0                   ; first 32 of row 0 of dst\r\n    vpermq      m2, m2, 11011000b\r\n    movu        [r0 + 32],      m2              ; second 32 of row 0 of dst\r\n\r\n    pmovzxbw    m0,         [r2 + r4]           ; first 16 of row 1 of src0\r\n    pmovzxbw    m1,         [r2 + r4 + 16]      ; second 16 of row 1 of src0\r\n    pmovzxbw    m2,         [r2 + r4 + 32]      ; third 16 of row 1 of src0\r\n    pmovzxbw    m3,         [r2 + r4 + 48]      ; forth 16 of row 1 of src0\r\n    movu        m4,         [r3 + r5]           ; first 16 of row 1 of src1\r\n    movu        m5,         [r3 + r5 + 32]      ; second 16 of row 1 of src1\r\n    movu        m6,         [r3 + r5 + 64]      ; third 16 of row 1 of src1\r\n    movu        m7,         [r3 + r5 + 96]      ; forth 16 of row 1 of src1\r\n\r\n    paddw       m0,         m4\r\n    paddw       m1,         m5\r\n    paddw       m2,         m6\r\n    paddw       m3,         m7\r\n    packuswb    m0,         m1\r\n    packuswb    m2,         m3\r\n    vpermq      m0, m0, 11011000b\r\n    movu        [r0 + r1],      m0              ; first 32 of row 1 of dst\r\n    vpermq      m2, m2, 11011000b\r\n    movu        [r0 + r1 + 32],      m2         ; second 32 of row 1 of dst\r\n\r\n    lea         r2,         [r2 + r4 * 2]\r\n    lea         r3,         [r3 + r5 * 2]\r\n    lea         r0,         [r0 + r1 * 2]\r\n\r\n    dec         r6d\r\n    jnz         .loop\r\n    RET\r\n\r\n%endif\r\n%endmacro\r\n\r\nPIXEL_ADD_PS_W64H4_avx2 16\r\nPIXEL_ADD_PS_W64H4_avx2 32\r\nPIXEL_ADD_PS_W64H4_avx2 48\r\nPIXEL_ADD_PS_W64H4_avx2 64\r\n"
  },
  {
    "path": "source/common/x86/quant8.asm",
    "content": ";*****************************************************************************\r\n;* quant8.asm: x86 quantization functions\r\n;*****************************************************************************\r\n;*    xavs2 - video encoder of AVS2/IEEE1857.4 video coding standard\r\n;*    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n;*\r\n;*    Authors: Falei LUO <falei.luo@gmail.com>\r\n;*             Jiaqi Zhang <zhangjiaqi.cs@gmail.com>\r\n;*\r\n;*    Homepage1: http://vcl.idm.pku.edu.cn/xavs2\r\n;*    Homepage2: https://github.com/pkuvcl/xavs2\r\n;*    Homepage3: https://gitee.com/pkuvcl/xavs2\r\n;*\r\n;*    This program is free software; you can redistribute it and/or modify\r\n;*    it under the terms of the GNU General Public License as published by\r\n;*    the Free Software Foundation; either version 2 of the License, or\r\n;*    (at your option) any later version.\r\n;*\r\n;*    This program is distributed in the hope that it will be useful,\r\n;*    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n;*    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n;*    GNU General Public License for more details.\r\n;*\r\n;*    You should have received a copy of the GNU General Public License\r\n;*    along with this program; if not, write to the Free Software\r\n;*    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n;*\r\n;*    This program is also available under a commercial proprietary license.\r\n;*    For more information, contact us at sswang @ pku.edu.cn.\r\n;*****************************************************************************\r\n\r\n%include \"x86inc.asm\"\r\n%include \"x86util.asm\"\r\n\r\n\r\nSECTION .text\r\n\r\n; ----------------------------------------------------------------------------\r\n; void dequant(coeff_t *coef, const int i_coef, const int scale, const int shift);\r\n; ----------------------------------------------------------------------------\r\n\r\n; ----------------------------------------------------------------------------\r\n; dequant_sse4\r\nINIT_XMM sse4\r\ncglobal dequant, 2,2,7\r\n;{\r\n    mov         r3, r3mp              ; r3  <-- shift\r\n    movq        m4, r2mp              ; m4[0] = scale\r\n    movq        m6, r3                ; m6[0] = shift\r\n    dec         r3                    ; r3d <-- shift - 1\r\n    xor         r2, r2                ; r2 <-- 0\r\n    shr         r1, 4                 ; r1    = i_coef/16\r\n    bts         r2, r3                ; r2 <-- add = 1 < (shift - 1)\r\n    movq        m5, r2                ; m5[0] = add\r\n    pshufd      m4, m4, 0             ; m4[3210] = scale\r\n    pshufd      m5, m5, 0             ; m5[3210] = add\r\n                                      ;\r\n.loop:                                ;\r\n    pmovsxwd    m0, [r0     ]         ; load 4 coeff\r\n    pmovsxwd    m1, [r0 +  8]         ;\r\n    pmovsxwd    m2, [r0 + 16]         ;\r\n    pmovsxwd    m3, [r0 + 24]         ;\r\n                                      ;\r\n    pmulld      m0, m4                ; coef[i] * scale\r\n    pmulld      m1, m4                ;\r\n    pmulld      m2, m4                ;\r\n    pmulld      m3, m4                ;\r\n    paddd       m0, m5                ; coef[i] * scale + add\r\n    paddd       m1, m5                ;\r\n    paddd       m2, m5                ;\r\n    paddd       m3, m5                ;\r\n    psrad       m0, m6                ; (coef[i] * scale + add) >> shift\r\n    psrad       m1, m6                ;\r\n    psrad       m2, m6                ;\r\n    psrad       m3, m6                ;\r\n                                      ;\r\n    packssdw    m0, m1                ; pack to 8 coeff\r\n    packssdw    m2, m3                ;\r\n                                      ;\r\n    mova   [r0   ], m0                ; store\r\n    mova   [r0+16], m2                ;\r\n    add         r0, 32                ;\r\n    dec         r1                    ;\r\n    jnz        .loop                  ;\r\n                                      ;\r\n    RET                               ; return\r\n;}\r\n"
  },
  {
    "path": "source/common/x86/x86inc.asm",
    "content": ";*****************************************************************************\r\n;* x86inc.asm: x264asm abstraction layer\r\n;*****************************************************************************\r\n;* Copyright (C) 2005-2014 x264 project\r\n;*               2013-2014 x265 project\r\n;*\r\n;* Authors: Loren Merritt <lorenm@u.washington.edu>\r\n;*          Anton Mitrofanov <BugMaster@narod.ru>\r\n;*          Fiona Glaser <fiona@x264.com>\r\n;*          Henrik Gramner <henrik@gramner.com>\r\n;*          Min Chen <chenm003@163.com>\r\n;*\r\n;* Permission to use, copy, modify, and/or distribute this software for any\r\n;* purpose with or without fee is hereby granted, provided that the above\r\n;* copyright notice and this permission notice appear in all copies.\r\n;*\r\n;* THE SOFTWARE IS PROVIDED \"AS IS\" AND THE AUTHOR DISCLAIMS ALL WARRANTIES\r\n;* WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF\r\n;* MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR\r\n;* ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES\r\n;* WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN\r\n;* ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF\r\n;* OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.\r\n;*****************************************************************************\r\n\r\n; This is a header file for the x264ASM assembly language, which uses\r\n; NASM/YASM syntax combined with a large number of macros to provide easy\r\n; abstraction between different calling conventions (x86_32, win64, linux64).\r\n; It also has various other useful features to simplify writing the kind of\r\n; DSP functions that are most often used in x264.\r\n\r\n; Unlike the rest of x264, this file is available under an ISC license, as it\r\n; has significant usefulness outside of x264 and we want it to be available\r\n; to the largest audience possible.  Of course, if you modify it for your own\r\n; purposes to add a new feature, we strongly encourage contributing a patch\r\n; as this feature might be useful for others as well.  Send patches or ideas\r\n; to x264-devel@videolan.org .\r\n\r\n%ifndef private_prefix\r\n    %define private_prefix davs2\r\n%endif\r\n\r\n%ifndef public_prefix\r\n    %define public_prefix private_prefix\r\n%endif\r\n\r\n%ifndef STACK_ALIGNMENT\r\n    %if ARCH_X86_64\r\n        %define STACK_ALIGNMENT 16\r\n    %else\r\n        %define STACK_ALIGNMENT 4\r\n    %endif\r\n%endif\r\n\r\n%define WIN64  0\r\n%define UNIX64 0\r\n%if ARCH_X86_64\r\n    %ifidn __OUTPUT_FORMAT__,win32\r\n        %define WIN64  1\r\n    %elifidn __OUTPUT_FORMAT__,win64\r\n        %define WIN64  1\r\n    %elifidn __OUTPUT_FORMAT__,x64\r\n        %define WIN64  1\r\n    %else\r\n        %define UNIX64 1\r\n    %endif\r\n%endif\r\n\r\n%define FORMAT_ELF 0\r\n%ifidn __OUTPUT_FORMAT__,elf\r\n    %define FORMAT_ELF 1\r\n%elifidn __OUTPUT_FORMAT__,elf32\r\n    %define FORMAT_ELF 1\r\n%elifidn __OUTPUT_FORMAT__,elf64\r\n    %define FORMAT_ELF 1\r\n%endif\r\n\r\n%ifdef PREFIX\r\n    %define mangle(x) _ %+ x\r\n%else\r\n    %define mangle(x) x\r\n%endif\r\n\r\n%macro SECTION_RODATA 0-1 32\r\n    SECTION .rodata align=%1\r\n%endmacro\r\n\r\n%macro SECTION_TEXT 0-1 16\r\n    SECTION .text align=%1\r\n%endmacro\r\n\r\n%if WIN64\r\n    %define PIC\r\n%elif ARCH_X86_64 == 0\r\n; x86_32 doesn't require PIC.\r\n; Some distros prefer shared objects to be PIC, but nothing breaks if\r\n; the code contains a few textrels, so we'll skip that complexity.\r\n    %undef PIC\r\n%endif\r\n%ifdef PIC\r\n    default rel\r\n%endif\r\n\r\n%ifdef __NASM_VER__\r\n    %use smartalign\r\n%endif\r\n\r\n; Macros to eliminate most code duplication between x86_32 and x86_64:\r\n; Currently this works only for leaf functions which load all their arguments\r\n; into registers at the start, and make no other use of the stack. Luckily that\r\n; covers most of x264's asm.\r\n\r\n; PROLOGUE:\r\n; %1 = number of arguments. loads them from stack if needed.\r\n; %2 = number of registers used. pushes callee-saved regs if needed.\r\n; %3 = number of xmm registers used. pushes callee-saved xmm regs if needed.\r\n; %4 = (optional) stack size to be allocated. The stack will be aligned before\r\n;      allocating the specified stack size. If the required stack alignment is\r\n;      larger than the known stack alignment the stack will be manually aligned\r\n;      and an extra register will be allocated to hold the original stack\r\n;      pointer (to not invalidate r0m etc.). To prevent the use of an extra\r\n;      register as stack pointer, request a negative stack size.\r\n; %4+/%5+ = list of names to define to registers\r\n; PROLOGUE can also be invoked by adding the same options to cglobal\r\n\r\n; e.g.\r\n; cglobal foo, 2,3,7,0x40, dst, src, tmp\r\n; declares a function (foo) that automatically loads two arguments (dst and\r\n; src) into registers, uses one additional register (tmp) plus 7 vector\r\n; registers (m0-m6) and allocates 0x40 bytes of stack space.\r\n\r\n; TODO Some functions can use some args directly from the stack. If they're the\r\n; last args then you can just not declare them, but if they're in the middle\r\n; we need more flexible macro.\r\n\r\n; RET:\r\n; Pops anything that was pushed by PROLOGUE, and returns.\r\n\r\n; REP_RET:\r\n; Use this instead of RET if it's a branch target.\r\n\r\n; registers:\r\n; rN and rNq are the native-size register holding function argument N\r\n; rNd, rNw, rNb are dword, word, and byte size\r\n; rNh is the high 8 bits of the word size\r\n; rNm is the original location of arg N (a register or on the stack), dword\r\n; rNmp is native size\r\n\r\n%macro DECLARE_REG 2-3\r\n    %define r%1q %2\r\n    %define r%1d %2d\r\n    %define r%1w %2w\r\n    %define r%1b %2b\r\n    %define r%1h %2h\r\n    %if %0 == 2\r\n        %define r%1m  %2d\r\n        %define r%1mp %2\r\n    %elif ARCH_X86_64 ; memory\r\n        %define r%1m [rstk + stack_offset + %3]\r\n        %define r%1mp qword r %+ %1 %+ m\r\n    %else\r\n        %define r%1m [rstk + stack_offset + %3]\r\n        %define r%1mp dword r %+ %1 %+ m\r\n    %endif\r\n    %define r%1  %2\r\n%endmacro\r\n\r\n%macro DECLARE_REG_SIZE 3\r\n    %define r%1q r%1\r\n    %define e%1q r%1\r\n    %define r%1d e%1\r\n    %define e%1d e%1\r\n    %define r%1w %1\r\n    %define e%1w %1\r\n    %define r%1h %3\r\n    %define e%1h %3\r\n    %define r%1b %2\r\n    %define e%1b %2\r\n%if ARCH_X86_64 == 0\r\n    %define r%1  e%1\r\n%endif\r\n%endmacro\r\n\r\nDECLARE_REG_SIZE ax, al, ah\r\nDECLARE_REG_SIZE bx, bl, bh\r\nDECLARE_REG_SIZE cx, cl, ch\r\nDECLARE_REG_SIZE dx, dl, dh\r\nDECLARE_REG_SIZE si, sil, null\r\nDECLARE_REG_SIZE di, dil, null\r\nDECLARE_REG_SIZE bp, bpl, null\r\n\r\n; t# defines for when per-arch register allocation is more complex than just function arguments\r\n\r\n%macro DECLARE_REG_TMP 1-*\r\n    %assign %%i 0\r\n    %rep %0\r\n        CAT_XDEFINE t, %%i, r%1\r\n        %assign %%i %%i+1\r\n        %rotate 1\r\n    %endrep\r\n%endmacro\r\n\r\n%macro DECLARE_REG_TMP_SIZE 0-*\r\n    %rep %0\r\n        %define t%1q t%1 %+ q\r\n        %define t%1d t%1 %+ d\r\n        %define t%1w t%1 %+ w\r\n        %define t%1h t%1 %+ h\r\n        %define t%1b t%1 %+ b\r\n        %rotate 1\r\n    %endrep\r\n%endmacro\r\n\r\nDECLARE_REG_TMP_SIZE 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14\r\n\r\n%if ARCH_X86_64\r\n    %define gprsize 8\r\n%else\r\n    %define gprsize 4\r\n%endif\r\n\r\n%macro PUSH 1\r\n    push %1\r\n    %ifidn rstk, rsp\r\n        %assign stack_offset stack_offset+gprsize\r\n    %endif\r\n%endmacro\r\n\r\n%macro POP 1\r\n    pop %1\r\n    %ifidn rstk, rsp\r\n        %assign stack_offset stack_offset-gprsize\r\n    %endif\r\n%endmacro\r\n\r\n%macro PUSH_IF_USED 1-*\r\n    %rep %0\r\n        %if %1 < regs_used\r\n            PUSH r%1\r\n        %endif\r\n        %rotate 1\r\n    %endrep\r\n%endmacro\r\n\r\n%macro POP_IF_USED 1-*\r\n    %rep %0\r\n        %if %1 < regs_used\r\n            pop r%1\r\n        %endif\r\n        %rotate 1\r\n    %endrep\r\n%endmacro\r\n\r\n%macro LOAD_IF_USED 1-*\r\n    %rep %0\r\n        %if %1 < num_args\r\n            mov r%1, r %+ %1 %+ mp\r\n        %endif\r\n        %rotate 1\r\n    %endrep\r\n%endmacro\r\n\r\n%macro SUB 2\r\n    sub %1, %2\r\n    %ifidn %1, rstk\r\n        %assign stack_offset stack_offset+(%2)\r\n    %endif\r\n%endmacro\r\n\r\n%macro ADD 2\r\n    add %1, %2\r\n    %ifidn %1, rstk\r\n        %assign stack_offset stack_offset-(%2)\r\n    %endif\r\n%endmacro\r\n\r\n%macro movifnidn 2\r\n    %ifnidn %1, %2\r\n        mov %1, %2\r\n    %endif\r\n%endmacro\r\n\r\n%macro movsxdifnidn 2\r\n    %ifnidn %1, %2\r\n        movsxd %1, %2\r\n    %endif\r\n%endmacro\r\n\r\n%macro ASSERT 1\r\n    %if (%1) == 0\r\n        %error assert failed\r\n    %endif\r\n%endmacro\r\n\r\n%macro DEFINE_ARGS 0-*\r\n    %ifdef n_arg_names\r\n        %assign %%i 0\r\n        %rep n_arg_names\r\n            CAT_UNDEF arg_name %+ %%i, q\r\n            CAT_UNDEF arg_name %+ %%i, d\r\n            CAT_UNDEF arg_name %+ %%i, w\r\n            CAT_UNDEF arg_name %+ %%i, h\r\n            CAT_UNDEF arg_name %+ %%i, b\r\n            CAT_UNDEF arg_name %+ %%i, m\r\n            CAT_UNDEF arg_name %+ %%i, mp\r\n            CAT_UNDEF arg_name, %%i\r\n            %assign %%i %%i+1\r\n        %endrep\r\n    %endif\r\n\r\n    %xdefine %%stack_offset stack_offset\r\n    %undef stack_offset ; so that the current value of stack_offset doesn't get baked in by xdefine\r\n    %assign %%i 0\r\n    %rep %0\r\n        %xdefine %1q r %+ %%i %+ q\r\n        %xdefine %1d r %+ %%i %+ d\r\n        %xdefine %1w r %+ %%i %+ w\r\n        %xdefine %1h r %+ %%i %+ h\r\n        %xdefine %1b r %+ %%i %+ b\r\n        %xdefine %1m r %+ %%i %+ m\r\n        %xdefine %1mp r %+ %%i %+ mp\r\n        CAT_XDEFINE arg_name, %%i, %1\r\n        %assign %%i %%i+1\r\n        %rotate 1\r\n    %endrep\r\n    %xdefine stack_offset %%stack_offset\r\n    %assign n_arg_names %0\r\n%endmacro\r\n\r\n%define required_stack_alignment ((mmsize + 15) & ~15)\r\n\r\n%macro ALLOC_STACK 1-2 0 ; stack_size, n_xmm_regs (for win64 only)\r\n    %ifnum %1\r\n        %if %1 != 0\r\n            %assign %%pad 0\r\n            %assign stack_size %1\r\n            %if stack_size < 0\r\n                %assign stack_size -stack_size\r\n            %endif\r\n            %if WIN64\r\n                %assign %%pad %%pad + 32 ; shadow space\r\n                %if mmsize != 8\r\n                    %assign xmm_regs_used %2\r\n                    %if xmm_regs_used > 8\r\n                        %assign %%pad %%pad + (xmm_regs_used-8)*16 ; callee-saved xmm registers\r\n                    %endif\r\n                %endif\r\n            %endif\r\n            %if required_stack_alignment <= STACK_ALIGNMENT\r\n                ; maintain the current stack alignment\r\n                %assign stack_size_padded stack_size + %%pad + ((-%%pad-stack_offset-gprsize) & (STACK_ALIGNMENT-1))\r\n                SUB rsp, stack_size_padded\r\n            %else\r\n                %assign %%reg_num (regs_used - 1)\r\n                %xdefine rstk r %+ %%reg_num\r\n                ; align stack, and save original stack location directly above\r\n                ; it, i.e. in [rsp+stack_size_padded], so we can restore the\r\n                ; stack in a single instruction (i.e. mov rsp, rstk or mov\r\n                ; rsp, [rsp+stack_size_padded])\r\n                %if %1 < 0 ; need to store rsp on stack\r\n                    %xdefine rstkm [rsp + stack_size + %%pad]\r\n                    %assign %%pad %%pad + gprsize\r\n                %else ; can keep rsp in rstk during whole function\r\n                    %xdefine rstkm rstk\r\n                %endif\r\n                %assign stack_size_padded stack_size + ((%%pad + required_stack_alignment-1) & ~(required_stack_alignment-1))\r\n                mov rstk, rsp\r\n                and rsp, ~(required_stack_alignment-1)\r\n                sub rsp, stack_size_padded\r\n                movifnidn rstkm, rstk\r\n            %endif\r\n            WIN64_PUSH_XMM\r\n        %endif\r\n    %endif\r\n%endmacro\r\n\r\n%macro SETUP_STACK_POINTER 1\r\n    %ifnum %1\r\n        %if %1 != 0 && required_stack_alignment > STACK_ALIGNMENT\r\n            %if %1 > 0\r\n                %assign regs_used (regs_used + 1)\r\n            %elif ARCH_X86_64 && regs_used == num_args && num_args <= 4 + UNIX64 * 2\r\n                %warning \"Stack pointer will overwrite register argument\"\r\n            %endif\r\n        %endif\r\n    %endif\r\n%endmacro\r\n\r\n%macro DEFINE_ARGS_INTERNAL 3+\r\n    %ifnum %2\r\n        DEFINE_ARGS %3\r\n    %elif %1 == 4\r\n        DEFINE_ARGS %2\r\n    %elif %1 > 4\r\n        DEFINE_ARGS %2, %3\r\n    %endif\r\n%endmacro\r\n\r\n%if WIN64 ; Windows x64 ;=================================================\r\n\r\nDECLARE_REG 0,  rcx\r\nDECLARE_REG 1,  rdx\r\nDECLARE_REG 2,  R8\r\nDECLARE_REG 3,  R9\r\nDECLARE_REG 4,  R10, 40\r\nDECLARE_REG 5,  R11, 48\r\nDECLARE_REG 6,  rax, 56\r\nDECLARE_REG 7,  rdi, 64\r\nDECLARE_REG 8,  rsi, 72\r\nDECLARE_REG 9,  rbx, 80\r\nDECLARE_REG 10, rbp, 88\r\nDECLARE_REG 11, R12, 96\r\nDECLARE_REG 12, R13, 104\r\nDECLARE_REG 13, R14, 112\r\nDECLARE_REG 14, R15, 120\r\n\r\n%macro PROLOGUE 2-5+ 0 ; #args, #regs, #xmm_regs, [stack_size,] arg_names...\r\n    %assign num_args %1\r\n    %assign regs_used %2\r\n    ASSERT regs_used >= num_args\r\n    SETUP_STACK_POINTER %4\r\n    ASSERT regs_used <= 15\r\n    PUSH_IF_USED 7, 8, 9, 10, 11, 12, 13, 14\r\n    ALLOC_STACK %4, %3\r\n    %if mmsize != 8 && stack_size == 0\r\n        WIN64_SPILL_XMM %3\r\n    %endif\r\n    LOAD_IF_USED 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14\r\n    DEFINE_ARGS_INTERNAL %0, %4, %5\r\n%endmacro\r\n\r\n%macro WIN64_PUSH_XMM 0\r\n    ; Use the shadow space to store XMM6 and XMM7, the rest needs stack space allocated.\r\n    %if xmm_regs_used > 6\r\n        movaps [rstk + stack_offset +  8], xmm6\r\n    %endif\r\n    %if xmm_regs_used > 7\r\n        movaps [rstk + stack_offset + 24], xmm7\r\n    %endif\r\n    %if xmm_regs_used > 8\r\n        %assign %%i 8\r\n        %rep xmm_regs_used-8\r\n            movaps [rsp + (%%i-8)*16 + stack_size + 32], xmm %+ %%i\r\n            %assign %%i %%i+1\r\n        %endrep\r\n    %endif\r\n%endmacro\r\n\r\n%macro WIN64_SPILL_XMM 1\r\n    %assign xmm_regs_used %1\r\n    ASSERT xmm_regs_used <= 16\r\n    %if xmm_regs_used > 8\r\n        ; Allocate stack space for callee-saved xmm registers plus shadow space and align the stack.\r\n        %assign %%pad (xmm_regs_used-8)*16 + 32\r\n        %assign stack_size_padded %%pad + ((-%%pad-stack_offset-gprsize) & (STACK_ALIGNMENT-1))\r\n        SUB rsp, stack_size_padded\r\n    %endif\r\n    WIN64_PUSH_XMM\r\n%endmacro\r\n\r\n%macro WIN64_RESTORE_XMM_INTERNAL 1\r\n    %assign %%pad_size 0\r\n    %if xmm_regs_used > 8\r\n        %assign %%i xmm_regs_used\r\n        %rep xmm_regs_used-8\r\n            %assign %%i %%i-1\r\n            movaps xmm %+ %%i, [%1 + (%%i-8)*16 + stack_size + 32]\r\n        %endrep\r\n    %endif\r\n    %if stack_size_padded > 0\r\n        %if stack_size > 0 && required_stack_alignment > STACK_ALIGNMENT\r\n            mov rsp, rstkm\r\n        %else\r\n            add %1, stack_size_padded\r\n            %assign %%pad_size stack_size_padded\r\n        %endif\r\n    %endif\r\n    %if xmm_regs_used > 7\r\n        movaps xmm7, [%1 + stack_offset - %%pad_size + 24]\r\n    %endif\r\n    %if xmm_regs_used > 6\r\n        movaps xmm6, [%1 + stack_offset - %%pad_size +  8]\r\n    %endif\r\n%endmacro\r\n\r\n%macro WIN64_RESTORE_XMM 1\r\n    WIN64_RESTORE_XMM_INTERNAL %1\r\n    %assign stack_offset (stack_offset-stack_size_padded)\r\n    %assign xmm_regs_used 0\r\n%endmacro\r\n\r\n%define has_epilogue regs_used > 7 || xmm_regs_used > 6 || mmsize == 32 || stack_size > 0\r\n\r\n%macro RET 0\r\n    WIN64_RESTORE_XMM_INTERNAL rsp\r\n    POP_IF_USED 14, 13, 12, 11, 10, 9, 8, 7\r\n%if mmsize == 32\r\n    vzeroupper\r\n%endif\r\n    AUTO_REP_RET\r\n%endmacro\r\n\r\n%elif ARCH_X86_64 ; *nix x64 ;=============================================\r\n\r\nDECLARE_REG 0,  rdi\r\nDECLARE_REG 1,  rsi\r\nDECLARE_REG 2,  rdx\r\nDECLARE_REG 3,  rcx\r\nDECLARE_REG 4,  R8\r\nDECLARE_REG 5,  R9\r\nDECLARE_REG 6,  rax, 8\r\nDECLARE_REG 7,  R10, 16\r\nDECLARE_REG 8,  R11, 24\r\nDECLARE_REG 9,  rbx, 32\r\nDECLARE_REG 10, rbp, 40\r\nDECLARE_REG 11, R12, 48\r\nDECLARE_REG 12, R13, 56\r\nDECLARE_REG 13, R14, 64\r\nDECLARE_REG 14, R15, 72\r\n\r\n%macro PROLOGUE 2-5+ ; #args, #regs, #xmm_regs, [stack_size,] arg_names...\r\n    %assign num_args %1\r\n    %assign regs_used %2\r\n    ASSERT regs_used >= num_args\r\n    SETUP_STACK_POINTER %4\r\n    ASSERT regs_used <= 15\r\n    PUSH_IF_USED 9, 10, 11, 12, 13, 14\r\n    ALLOC_STACK %4\r\n    LOAD_IF_USED 6, 7, 8, 9, 10, 11, 12, 13, 14\r\n    DEFINE_ARGS_INTERNAL %0, %4, %5\r\n%endmacro\r\n\r\n%define has_epilogue regs_used > 9 || mmsize == 32 || stack_size > 0\r\n\r\n%macro RET 0\r\n%if stack_size_padded > 0\r\n%if required_stack_alignment > STACK_ALIGNMENT\r\n    mov rsp, rstkm\r\n%else\r\n    add rsp, stack_size_padded\r\n%endif\r\n%endif\r\n    POP_IF_USED 14, 13, 12, 11, 10, 9\r\n%if mmsize == 32\r\n    vzeroupper\r\n%endif\r\n    AUTO_REP_RET\r\n%endmacro\r\n\r\n%else ; X86_32 ;==============================================================\r\n\r\nDECLARE_REG 0, eax, 4\r\nDECLARE_REG 1, ecx, 8\r\nDECLARE_REG 2, edx, 12\r\nDECLARE_REG 3, ebx, 16\r\nDECLARE_REG 4, esi, 20\r\nDECLARE_REG 5, edi, 24\r\nDECLARE_REG 6, ebp, 28\r\n%define rsp esp\r\n\r\n%macro DECLARE_ARG 1-*\r\n    %rep %0\r\n        %define r%1m [rstk + stack_offset + 4*%1 + 4]\r\n        %define r%1mp dword r%1m\r\n        %rotate 1\r\n    %endrep\r\n%endmacro\r\n\r\nDECLARE_ARG 7, 8, 9, 10, 11, 12, 13, 14\r\n\r\n%macro PROLOGUE 2-5+ ; #args, #regs, #xmm_regs, [stack_size,] arg_names...\r\n    %assign num_args %1\r\n    %assign regs_used %2\r\n    ASSERT regs_used >= num_args\r\n    %if num_args > 7\r\n        %assign num_args 7\r\n    %endif\r\n    %if regs_used > 7\r\n        %assign regs_used 7\r\n    %endif\r\n    SETUP_STACK_POINTER %4\r\n    ASSERT regs_used <= 7\r\n    PUSH_IF_USED 3, 4, 5, 6\r\n    ALLOC_STACK %4\r\n    LOAD_IF_USED 0, 1, 2, 3, 4, 5, 6\r\n    DEFINE_ARGS_INTERNAL %0, %4, %5\r\n%endmacro\r\n\r\n%define has_epilogue regs_used > 3 || mmsize == 32 || stack_size > 0\r\n\r\n%macro RET 0\r\n%if stack_size_padded > 0\r\n%if required_stack_alignment > STACK_ALIGNMENT\r\n    mov rsp, rstkm\r\n%else\r\n    add rsp, stack_size_padded\r\n%endif\r\n%endif\r\n    POP_IF_USED 6, 5, 4, 3\r\n%if mmsize == 32\r\n    vzeroupper\r\n%endif\r\n    AUTO_REP_RET\r\n%endmacro\r\n\r\n%endif ;======================================================================\r\n\r\n%if WIN64 == 0\r\n%macro WIN64_SPILL_XMM 1\r\n%endmacro\r\n%macro WIN64_RESTORE_XMM 1\r\n%endmacro\r\n%macro WIN64_PUSH_XMM 0\r\n%endmacro\r\n%endif\r\n\r\n; On AMD cpus <=K10, an ordinary ret is slow if it immediately follows either\r\n; a branch or a branch target. So switch to a 2-byte form of ret in that case.\r\n; We can automatically detect \"follows a branch\", but not a branch target.\r\n; (SSSE3 is a sufficient condition to know that your cpu doesn't have this problem.)\r\n%macro REP_RET 0\r\n    %if has_epilogue\r\n        RET\r\n    %else\r\n        rep ret\r\n    %endif\r\n%endmacro\r\n\r\n%define last_branch_adr $$\r\n%macro AUTO_REP_RET 0\r\n    %ifndef cpuflags\r\n        times ((last_branch_adr-$)>>31)+1 rep ; times 1 iff $ != last_branch_adr.\r\n    %elif notcpuflag(ssse3)\r\n        times ((last_branch_adr-$)>>31)+1 rep\r\n    %endif\r\n    ret\r\n%endmacro\r\n\r\n%macro BRANCH_INSTR 0-*\r\n    %rep %0\r\n        %macro %1 1-2 %1\r\n            %2 %1\r\n            %%branch_instr:\r\n            %xdefine last_branch_adr %%branch_instr\r\n        %endmacro\r\n        %rotate 1\r\n    %endrep\r\n%endmacro\r\n\r\nBRANCH_INSTR jz, je, jnz, jne, jl, jle, jnl, jnle, jg, jge, jng, jnge, ja, jae, jna, jnae, jb, jbe, jnb, jnbe, jc, jnc, js, jns, jo, jno, jp, jnp\r\n\r\n%macro TAIL_CALL 2 ; callee, is_nonadjacent\r\n    %if has_epilogue\r\n        call %1\r\n        RET\r\n    %elif %2\r\n        jmp %1\r\n    %endif\r\n%endmacro\r\n\r\n;=============================================================================\r\n; arch-independent part\r\n;=============================================================================\r\n\r\n%assign function_align 16\r\n\r\n; Begin a function.\r\n; Applies any symbol mangling needed for C linkage, and sets up a define such that\r\n; subsequent uses of the function name automatically refer to the mangled version.\r\n; Appends cpuflags to the function name if cpuflags has been specified.\r\n; The \"\" empty default parameter is a workaround for nasm, which fails if SUFFIX\r\n; is empty and we call cglobal_internal with just %1 %+ SUFFIX (without %2).\r\n%macro cglobal 1-2+ \"\" ; name, [PROLOGUE args]\r\n    cglobal_internal 1, %1 %+ SUFFIX, %2\r\n%endmacro\r\n%macro cvisible 1-2+ \"\" ; name, [PROLOGUE args]\r\n    cglobal_internal 0, %1 %+ SUFFIX, %2\r\n%endmacro\r\n%macro cglobal_internal 2-3+\r\n    %if %1\r\n        %xdefine %%FUNCTION_PREFIX private_prefix\r\n        %xdefine %%VISIBILITY hidden\r\n    %else\r\n        %xdefine %%FUNCTION_PREFIX public_prefix\r\n        %xdefine %%VISIBILITY\r\n    %endif\r\n    %ifndef cglobaled_%2\r\n        %xdefine %2 mangle(%%FUNCTION_PREFIX %+ _ %+ %2)\r\n        %xdefine %2.skip_prologue %2 %+ .skip_prologue\r\n        CAT_XDEFINE cglobaled_, %2, 1\r\n    %endif\r\n    %xdefine current_function %2\r\n    %if FORMAT_ELF\r\n        global %2:function %%VISIBILITY\r\n    %else\r\n        global %2\r\n    %endif\r\n    align function_align\r\n    %2:\r\n    RESET_MM_PERMUTATION        ; needed for x86-64, also makes disassembly somewhat nicer\r\n    %xdefine rstk rsp           ; copy of the original stack pointer, used when greater alignment than the known stack alignment is required\r\n    %assign stack_offset 0      ; stack pointer offset relative to the return address\r\n    %assign stack_size 0        ; amount of stack space that can be freely used inside a function\r\n    %assign stack_size_padded 0 ; total amount of allocated stack space, including space for callee-saved xmm registers on WIN64 and alignment padding\r\n    %assign xmm_regs_used 0     ; number of XMM registers requested, used for dealing with callee-saved registers on WIN64\r\n    %ifnidn %3, \"\"\r\n        PROLOGUE %3\r\n    %endif\r\n%endmacro\r\n\r\n%macro cextern 1\r\n    %xdefine %1 mangle(private_prefix %+ _ %+ %1)\r\n    CAT_XDEFINE cglobaled_, %1, 1\r\n    extern %1\r\n%endmacro\r\n\r\n; like cextern, but without the prefix\r\n%macro cextern_naked 1\r\n    %ifdef PREFIX\r\n        %xdefine %1 mangle(%1)\r\n    %endif\r\n    CAT_XDEFINE cglobaled_, %1, 1\r\n    extern %1\r\n%endmacro\r\n\r\n%macro const 1-2+\r\n    %xdefine %1 mangle(private_prefix %+ _ %+ %1)\r\n    %if FORMAT_ELF\r\n        global %1:data hidden\r\n    %else\r\n        global %1\r\n    %endif\r\n    ALIGN 32\r\n    %1: %2\r\n%endmacro\r\n\r\n; This is needed for ELF, otherwise the GNU linker assumes the stack is\r\n; executable by default.\r\n%if FORMAT_ELF\r\n    SECTION .note.GNU-stack noalloc noexec nowrite progbits\r\n%endif\r\n\r\n; cpuflags\r\n\r\n%assign cpuflags_mmx      (1<<0)\r\n%assign cpuflags_mmx2     (1<<1) | cpuflags_mmx\r\n%assign cpuflags_3dnow    (1<<2) | cpuflags_mmx\r\n%assign cpuflags_3dnowext (1<<3) | cpuflags_3dnow\r\n%assign cpuflags_sse      (1<<4) | cpuflags_mmx2\r\n%assign cpuflags_sse2     (1<<5) | cpuflags_sse\r\n%assign cpuflags_sse2slow (1<<6) | cpuflags_sse2\r\n%assign cpuflags_sse3     (1<<7) | cpuflags_sse2\r\n%assign cpuflags_ssse3    (1<<8) | cpuflags_sse3\r\n%assign cpuflags_sse4     (1<<9) | cpuflags_ssse3\r\n%assign cpuflags_sse42    (1<<10)| cpuflags_sse4\r\n%assign cpuflags_avx      (1<<11)| cpuflags_sse42\r\n%assign cpuflags_xop      (1<<12)| cpuflags_avx\r\n%assign cpuflags_fma4     (1<<13)| cpuflags_avx\r\n%assign cpuflags_avx2     (1<<14)| cpuflags_avx\r\n%assign cpuflags_fma3     (1<<15)| cpuflags_avx\r\n\r\n%assign cpuflags_cache32  (1<<16)\r\n%assign cpuflags_cache64  (1<<17)\r\n%assign cpuflags_slowctz  (1<<18)\r\n%assign cpuflags_lzcnt    (1<<19)\r\n%assign cpuflags_aligned  (1<<20) ; not a cpu feature, but a function variant\r\n%assign cpuflags_atom     (1<<21)\r\n%assign cpuflags_bmi1     (1<<22)|cpuflags_lzcnt\r\n%assign cpuflags_bmi2     (1<<23)|cpuflags_bmi1\r\n\r\n%define    cpuflag(x) ((cpuflags & (cpuflags_ %+ x)) == (cpuflags_ %+ x))\r\n%define notcpuflag(x) ((cpuflags & (cpuflags_ %+ x)) != (cpuflags_ %+ x))\r\n\r\n; Takes an arbitrary number of cpuflags from the above list.\r\n; All subsequent functions (up to the next INIT_CPUFLAGS) is built for the specified cpu.\r\n; You shouldn't need to invoke this macro directly, it's a subroutine for INIT_MMX &co.\r\n%macro INIT_CPUFLAGS 0-*\r\n    %xdefine SUFFIX\r\n    %undef cpuname\r\n    %assign cpuflags 0\r\n\r\n    %if %0 >= 1\r\n        %rep %0\r\n            %ifdef cpuname\r\n                %xdefine cpuname cpuname %+ _%1\r\n            %else\r\n                %xdefine cpuname %1\r\n            %endif\r\n            %assign cpuflags cpuflags | cpuflags_%1\r\n            %rotate 1\r\n        %endrep\r\n        %xdefine SUFFIX _ %+ cpuname\r\n\r\n        %if cpuflag(avx)\r\n            %assign avx_enabled 1\r\n        %endif\r\n        %if (mmsize == 16 && notcpuflag(sse2)) || (mmsize == 32 && notcpuflag(avx2))\r\n            %define mova movaps\r\n            %define movu movups\r\n            %define movnta movntps\r\n        %endif\r\n        %if cpuflag(aligned)\r\n            %define movu mova\r\n        %elif cpuflag(sse3) && notcpuflag(ssse3)\r\n            %define movu lddqu\r\n        %endif\r\n    %endif\r\n\r\n     %if ARCH_X86_64 || cpuflag(sse2)\r\n        %ifdef __NASM_VER__\r\n            ALIGNMODE p6\r\n        %else\r\n            CPU amdnop\r\n        %endif\r\n     %else\r\n        %ifdef __NASM_VER__\r\n            ALIGNMODE nop\r\n        %else\r\n            CPU basicnop\r\n        %endif\r\n     %endif\r\n\r\n%endmacro\r\n\r\n; Merge mmx and sse*\r\n; m# is a simd register of the currently selected size\r\n; xm# is the corresponding xmm register if mmsize >= 16, otherwise the same as m#\r\n; ym# is the corresponding ymm register if mmsize >= 32, otherwise the same as m#\r\n; (All 3 remain in sync through SWAP.)\r\n\r\n%macro CAT_XDEFINE 3\r\n    %xdefine %1%2 %3\r\n%endmacro\r\n\r\n%macro CAT_UNDEF 2\r\n    %undef %1%2\r\n%endmacro\r\n\r\n%macro INIT_MMX 0-1+\r\n    %assign avx_enabled 0\r\n    %define RESET_MM_PERMUTATION INIT_MMX %1\r\n    %define mmsize 8\r\n    %define num_mmregs 8\r\n    %define mova movq\r\n    %define movu movq\r\n    %define movh movd\r\n    %define movnta movntq\r\n    %assign %%i 0\r\n    %rep 8\r\n    CAT_XDEFINE m, %%i, mm %+ %%i\r\n    CAT_XDEFINE nmm, %%i, %%i\r\n    %assign %%i %%i+1\r\n    %endrep\r\n    %rep 8\r\n    CAT_UNDEF m, %%i\r\n    CAT_UNDEF nmm, %%i\r\n    %assign %%i %%i+1\r\n    %endrep\r\n    INIT_CPUFLAGS %1\r\n%endmacro\r\n\r\n%macro INIT_XMM 0-1+\r\n    %assign avx_enabled 0\r\n    %define RESET_MM_PERMUTATION INIT_XMM %1\r\n    %define mmsize 16\r\n    %define num_mmregs 8\r\n    %if ARCH_X86_64\r\n    %define num_mmregs 16\r\n    %endif\r\n    %define mova movdqa\r\n    %define movu movdqu\r\n    %define movh movq\r\n    %define movnta movntdq\r\n    %assign %%i 0\r\n    %rep num_mmregs\r\n    CAT_XDEFINE m, %%i, xmm %+ %%i\r\n    CAT_XDEFINE nxmm, %%i, %%i\r\n    %assign %%i %%i+1\r\n    %endrep\r\n    INIT_CPUFLAGS %1\r\n%endmacro\r\n\r\n%macro INIT_YMM 0-1+\r\n    %assign avx_enabled 1\r\n    %define RESET_MM_PERMUTATION INIT_YMM %1\r\n    %define mmsize 32\r\n    %define num_mmregs 8\r\n    %if ARCH_X86_64\r\n    %define num_mmregs 16\r\n    %endif\r\n    %define mova movdqa\r\n    %define movu movdqu\r\n    %undef movh\r\n    %define movnta movntdq\r\n    %assign %%i 0\r\n    %rep num_mmregs\r\n    CAT_XDEFINE m, %%i, ymm %+ %%i\r\n    CAT_XDEFINE nymm, %%i, %%i\r\n    %assign %%i %%i+1\r\n    %endrep\r\n    INIT_CPUFLAGS %1\r\n%endmacro\r\n\r\nINIT_XMM\r\n\r\n%macro DECLARE_MMCAST 1\r\n    %define  mmmm%1   mm%1\r\n    %define  mmxmm%1  mm%1\r\n    %define  mmymm%1  mm%1\r\n    %define xmmmm%1   mm%1\r\n    %define xmmxmm%1 xmm%1\r\n    %define xmmymm%1 xmm%1\r\n    %define ymmmm%1   mm%1\r\n    %define ymmxmm%1 xmm%1\r\n    %define ymmymm%1 ymm%1\r\n    %define ymm%1xmm xmm%1\r\n    %define xmm%1ymm ymm%1\r\n    %define xm%1 xmm %+ m%1\r\n    %define ym%1 ymm %+ m%1\r\n%endmacro\r\n\r\n%assign i 0\r\n%rep 16\r\n    DECLARE_MMCAST i\r\n%assign i i+1\r\n%endrep\r\n\r\n; I often want to use macros that permute their arguments. e.g. there's no\r\n; efficient way to implement butterfly or transpose or dct without swapping some\r\n; arguments.\r\n;\r\n; I would like to not have to manually keep track of the permutations:\r\n; If I insert a permutation in the middle of a function, it should automatically\r\n; change everything that follows. For more complex macros I may also have multiple\r\n; implementations, e.g. the SSE2 and SSSE3 versions may have different permutations.\r\n;\r\n; Hence these macros. Insert a PERMUTE or some SWAPs at the end of a macro that\r\n; permutes its arguments. It's equivalent to exchanging the contents of the\r\n; registers, except that this way you exchange the register names instead, so it\r\n; doesn't cost any cycles.\r\n\r\n%macro PERMUTE 2-* ; takes a list of pairs to swap\r\n%rep %0/2\r\n    %xdefine %%tmp%2 m%2\r\n    %rotate 2\r\n%endrep\r\n%rep %0/2\r\n    %xdefine m%1 %%tmp%2\r\n    CAT_XDEFINE n, m%1, %1\r\n    %rotate 2\r\n%endrep\r\n%endmacro\r\n\r\n%macro SWAP 2+ ; swaps a single chain (sometimes more concise than pairs)\r\n%ifnum %1 ; SWAP 0, 1, ...\r\n    SWAP_INTERNAL_NUM %1, %2\r\n%else ; SWAP m0, m1, ...\r\n    SWAP_INTERNAL_NAME %1, %2\r\n%endif\r\n%endmacro\r\n\r\n%macro SWAP_INTERNAL_NUM 2-*\r\n    %rep %0-1\r\n        %xdefine %%tmp m%1\r\n        %xdefine m%1 m%2\r\n        %xdefine m%2 %%tmp\r\n        CAT_XDEFINE n, m%1, %1\r\n        CAT_XDEFINE n, m%2, %2\r\n    %rotate 1\r\n    %endrep\r\n%endmacro\r\n\r\n%macro SWAP_INTERNAL_NAME 2-*\r\n    %xdefine %%args n %+ %1\r\n    %rep %0-1\r\n        %xdefine %%args %%args, n %+ %2\r\n    %rotate 1\r\n    %endrep\r\n    SWAP_INTERNAL_NUM %%args\r\n%endmacro\r\n\r\n; If SAVE_MM_PERMUTATION is placed at the end of a function, then any later\r\n; calls to that function will automatically load the permutation, so values can\r\n; be returned in mmregs.\r\n%macro SAVE_MM_PERMUTATION 0-1\r\n    %if %0\r\n        %xdefine %%f %1_m\r\n    %else\r\n        %xdefine %%f current_function %+ _m\r\n    %endif\r\n    %assign %%i 0\r\n    %rep num_mmregs\r\n        CAT_XDEFINE %%f, %%i, m %+ %%i\r\n    %assign %%i %%i+1\r\n    %endrep\r\n%endmacro\r\n\r\n%macro LOAD_MM_PERMUTATION 1 ; name to load from\r\n    %ifdef %1_m0\r\n        %assign %%i 0\r\n        %rep num_mmregs\r\n            CAT_XDEFINE m, %%i, %1_m %+ %%i\r\n            CAT_XDEFINE n, m %+ %%i, %%i\r\n        %assign %%i %%i+1\r\n        %endrep\r\n    %endif\r\n%endmacro\r\n\r\n; Append cpuflags to the callee's name iff the appended name is known and the plain name isn't\r\n%macro call 1\r\n    call_internal %1, %1 %+ SUFFIX\r\n%endmacro\r\n%macro call_internal 2\r\n    %xdefine %%i %1\r\n    %ifndef cglobaled_%1\r\n        %ifdef cglobaled_%2\r\n            %xdefine %%i %2\r\n        %endif\r\n    %endif\r\n    call %%i\r\n    LOAD_MM_PERMUTATION %%i\r\n%endmacro\r\n\r\n; Substitutions that reduce instruction size but are functionally equivalent\r\n%macro add 2\r\n    %ifnum %2\r\n        %if %2==128\r\n            sub %1, -128\r\n        %else\r\n            add %1, %2\r\n        %endif\r\n    %else\r\n        add %1, %2\r\n    %endif\r\n%endmacro\r\n\r\n%macro sub 2\r\n    %ifnum %2\r\n        %if %2==128\r\n            add %1, -128\r\n        %else\r\n            sub %1, %2\r\n        %endif\r\n    %else\r\n        sub %1, %2\r\n    %endif\r\n%endmacro\r\n\r\n;=============================================================================\r\n; AVX abstraction layer\r\n;=============================================================================\r\n\r\n%assign i 0\r\n%rep 16\r\n    %if i < 8\r\n        CAT_XDEFINE sizeofmm, i, 8\r\n    %endif\r\n    CAT_XDEFINE sizeofxmm, i, 16\r\n    CAT_XDEFINE sizeofymm, i, 32\r\n%assign i i+1\r\n%endrep\r\n%undef i\r\n\r\n%macro CHECK_AVX_INSTR_EMU 3-*\r\n    %xdefine %%opcode %1\r\n    %xdefine %%dst %2\r\n    %rep %0-2\r\n        %ifidn %%dst, %3\r\n            %error non-avx emulation of ``%%opcode'' is not supported\r\n        %endif\r\n        %rotate 1\r\n    %endrep\r\n%endmacro\r\n\r\n;%1 == instruction\r\n;%2 == minimal instruction set\r\n;%3 == 1 if float, 0 if int\r\n;%4 == 1 if non-destructive or 4-operand (xmm, xmm, xmm, imm), 0 otherwise\r\n;%5 == 1 if commutative (i.e. doesn't matter which src arg is which), 0 if not\r\n;%6+: operands\r\n%macro RUN_AVX_INSTR 6-9+\r\n    %ifnum sizeof%7\r\n        %assign __sizeofreg sizeof%7\r\n    %elifnum sizeof%6\r\n        %assign __sizeofreg sizeof%6\r\n    %else\r\n        %assign __sizeofreg mmsize\r\n    %endif\r\n    %assign __emulate_avx 0\r\n    %if avx_enabled && __sizeofreg >= 16\r\n        %xdefine __instr v%1\r\n    %else\r\n        %xdefine __instr %1\r\n        %if %0 >= 8+%4\r\n            %assign __emulate_avx 1\r\n        %endif\r\n    %endif\r\n    %ifnidn %2, fnord\r\n        %ifdef cpuname\r\n            %if notcpuflag(%2)\r\n                %error use of ``%1'' %2 instruction in cpuname function: current_function\r\n            %endif\r\n        %endif\r\n    %endif\r\n\r\n    %if __emulate_avx\r\n        %xdefine __src1 %7\r\n        %xdefine __src2 %8\r\n        %ifnidn %6, %7\r\n            %if %0 >= 9\r\n                CHECK_AVX_INSTR_EMU {%1 %6, %7, %8, %9}, %6, %8, %9\r\n            %else\r\n                CHECK_AVX_INSTR_EMU {%1 %6, %7, %8}, %6, %8\r\n            %endif\r\n            %if %5 && %4 == 0\r\n                %ifnid %8\r\n                    ; 3-operand AVX instructions with a memory arg can only have it in src2,\r\n                    ; whereas SSE emulation prefers to have it in src1 (i.e. the mov).\r\n                    ; So, if the instruction is commutative with a memory arg, swap them.\r\n                    %xdefine __src1 %8\r\n                    %xdefine __src2 %7\r\n                %endif\r\n            %endif\r\n            %if __sizeofreg == 8\r\n                MOVQ %6, __src1\r\n            %elif %3\r\n                MOVAPS %6, __src1\r\n            %else\r\n                MOVDQA %6, __src1\r\n            %endif\r\n        %endif\r\n        %if %0 >= 9\r\n            %1 %6, __src2, %9\r\n        %else\r\n            %1 %6, __src2\r\n        %endif\r\n    %elif %0 >= 9\r\n        __instr %6, %7, %8, %9\r\n    %elif %0 == 8\r\n        __instr %6, %7, %8\r\n    %elif %0 == 7\r\n        __instr %6, %7\r\n    %else\r\n        __instr %6\r\n    %endif\r\n%endmacro\r\n\r\n;%1 == instruction\r\n;%2 == minimal instruction set\r\n;%3 == 1 if float, 0 if int\r\n;%4 == 1 if non-destructive or 4-operand (xmm, xmm, xmm, imm), 0 otherwise\r\n;%5 == 1 if commutative (i.e. doesn't matter which src arg is which), 0 if not\r\n%macro AVX_INSTR 1-5 fnord, 0, 1, 0\r\n    %macro %1 1-10 fnord, fnord, fnord, fnord, %1, %2, %3, %4, %5\r\n        %ifidn %2, fnord\r\n            RUN_AVX_INSTR %6, %7, %8, %9, %10, %1\r\n        %elifidn %3, fnord\r\n            RUN_AVX_INSTR %6, %7, %8, %9, %10, %1, %2\r\n        %elifidn %4, fnord\r\n            RUN_AVX_INSTR %6, %7, %8, %9, %10, %1, %2, %3\r\n        %elifidn %5, fnord\r\n            RUN_AVX_INSTR %6, %7, %8, %9, %10, %1, %2, %3, %4\r\n        %else\r\n            RUN_AVX_INSTR %6, %7, %8, %9, %10, %1, %2, %3, %4, %5\r\n        %endif\r\n    %endmacro\r\n%endmacro\r\n\r\n; Instructions with both VEX and non-VEX encodings\r\n; Non-destructive instructions are written without parameters\r\nAVX_INSTR addpd, sse2, 1, 0, 1\r\nAVX_INSTR addps, sse, 1, 0, 1\r\nAVX_INSTR addsd, sse2, 1, 0, 1\r\nAVX_INSTR addss, sse, 1, 0, 1\r\nAVX_INSTR addsubpd, sse3, 1, 0, 0\r\nAVX_INSTR addsubps, sse3, 1, 0, 0\r\nAVX_INSTR aesdec, fnord, 0, 0, 0\r\nAVX_INSTR aesdeclast, fnord, 0, 0, 0\r\nAVX_INSTR aesenc, fnord, 0, 0, 0\r\nAVX_INSTR aesenclast, fnord, 0, 0, 0\r\nAVX_INSTR aesimc\r\nAVX_INSTR aeskeygenassist\r\nAVX_INSTR andnpd, sse2, 1, 0, 0\r\nAVX_INSTR andnps, sse, 1, 0, 0\r\nAVX_INSTR andpd, sse2, 1, 0, 1\r\nAVX_INSTR andps, sse, 1, 0, 1\r\nAVX_INSTR blendpd, sse4, 1, 0, 0\r\nAVX_INSTR blendps, sse4, 1, 0, 0\r\nAVX_INSTR blendvpd, sse4, 1, 0, 0\r\nAVX_INSTR blendvps, sse4, 1, 0, 0\r\nAVX_INSTR cmppd, sse2, 1, 1, 0\r\nAVX_INSTR cmpps, sse, 1, 1, 0\r\nAVX_INSTR cmpsd, sse2, 1, 1, 0\r\nAVX_INSTR cmpss, sse, 1, 1, 0\r\nAVX_INSTR comisd, sse2\r\nAVX_INSTR comiss, sse\r\nAVX_INSTR cvtdq2pd, sse2\r\nAVX_INSTR cvtdq2ps, sse2\r\nAVX_INSTR cvtpd2dq, sse2\r\nAVX_INSTR cvtpd2ps, sse2\r\nAVX_INSTR cvtps2dq, sse2\r\nAVX_INSTR cvtps2pd, sse2\r\nAVX_INSTR cvtsd2si, sse2\r\nAVX_INSTR cvtsd2ss, sse2\r\nAVX_INSTR cvtsi2sd, sse2\r\nAVX_INSTR cvtsi2ss, sse\r\nAVX_INSTR cvtss2sd, sse2\r\nAVX_INSTR cvtss2si, sse\r\nAVX_INSTR cvttpd2dq, sse2\r\nAVX_INSTR cvttps2dq, sse2\r\nAVX_INSTR cvttsd2si, sse2\r\nAVX_INSTR cvttss2si, sse\r\nAVX_INSTR divpd, sse2, 1, 0, 0\r\nAVX_INSTR divps, sse, 1, 0, 0\r\nAVX_INSTR divsd, sse2, 1, 0, 0\r\nAVX_INSTR divss, sse, 1, 0, 0\r\nAVX_INSTR dppd, sse4, 1, 1, 0\r\nAVX_INSTR dpps, sse4, 1, 1, 0\r\nAVX_INSTR extractps, sse4\r\nAVX_INSTR haddpd, sse3, 1, 0, 0\r\nAVX_INSTR haddps, sse3, 1, 0, 0\r\nAVX_INSTR hsubpd, sse3, 1, 0, 0\r\nAVX_INSTR hsubps, sse3, 1, 0, 0\r\nAVX_INSTR insertps, sse4, 1, 1, 0\r\nAVX_INSTR lddqu, sse3\r\nAVX_INSTR ldmxcsr, sse\r\nAVX_INSTR maskmovdqu, sse2\r\nAVX_INSTR maxpd, sse2, 1, 0, 1\r\nAVX_INSTR maxps, sse, 1, 0, 1\r\nAVX_INSTR maxsd, sse2, 1, 0, 1\r\nAVX_INSTR maxss, sse, 1, 0, 1\r\nAVX_INSTR minpd, sse2, 1, 0, 1\r\nAVX_INSTR minps, sse, 1, 0, 1\r\nAVX_INSTR minsd, sse2, 1, 0, 1\r\nAVX_INSTR minss, sse, 1, 0, 1\r\nAVX_INSTR movapd, sse2\r\nAVX_INSTR movaps, sse\r\nAVX_INSTR movd\r\nAVX_INSTR movddup, sse3\r\nAVX_INSTR movdqa, sse2\r\nAVX_INSTR movdqu, sse2\r\nAVX_INSTR movhlps, sse, 1, 0, 0\r\nAVX_INSTR movhpd, sse2, 1, 0, 0\r\nAVX_INSTR movhps, sse, 1, 0, 0\r\nAVX_INSTR movlhps, sse, 1, 0, 0\r\nAVX_INSTR movlpd, sse2, 1, 0, 0\r\nAVX_INSTR movlps, sse, 1, 0, 0\r\nAVX_INSTR movmskpd, sse2\r\nAVX_INSTR movmskps, sse\r\nAVX_INSTR movntdq, sse2\r\nAVX_INSTR movntdqa, sse4\r\nAVX_INSTR movntpd, sse2\r\nAVX_INSTR movntps, sse\r\nAVX_INSTR movq\r\nAVX_INSTR movsd, sse2, 1, 0, 0\r\nAVX_INSTR movshdup, sse3\r\nAVX_INSTR movsldup, sse3\r\nAVX_INSTR movss, sse, 1, 0, 0\r\nAVX_INSTR movupd, sse2\r\nAVX_INSTR movups, sse\r\nAVX_INSTR mpsadbw, sse4\r\nAVX_INSTR mulpd, sse2, 1, 0, 1\r\nAVX_INSTR mulps, sse, 1, 0, 1\r\nAVX_INSTR mulsd, sse2, 1, 0, 1\r\nAVX_INSTR mulss, sse, 1, 0, 1\r\nAVX_INSTR orpd, sse2, 1, 0, 1\r\nAVX_INSTR orps, sse, 1, 0, 1\r\nAVX_INSTR pabsb, ssse3\r\nAVX_INSTR pabsd, ssse3\r\nAVX_INSTR pabsw, ssse3\r\nAVX_INSTR packsswb, mmx, 0, 0, 0\r\nAVX_INSTR packssdw, mmx, 0, 0, 0\r\nAVX_INSTR packuswb, mmx, 0, 0, 0\r\nAVX_INSTR packusdw, sse4, 0, 0, 0\r\nAVX_INSTR paddb, mmx, 0, 0, 1\r\nAVX_INSTR paddw, mmx, 0, 0, 1\r\nAVX_INSTR paddd, mmx, 0, 0, 1\r\nAVX_INSTR paddq, sse2, 0, 0, 1\r\nAVX_INSTR paddsb, mmx, 0, 0, 1\r\nAVX_INSTR paddsw, mmx, 0, 0, 1\r\nAVX_INSTR paddusb, mmx, 0, 0, 1\r\nAVX_INSTR paddusw, mmx, 0, 0, 1\r\nAVX_INSTR palignr, ssse3\r\nAVX_INSTR pand, mmx, 0, 0, 1\r\nAVX_INSTR pandn, mmx, 0, 0, 0\r\nAVX_INSTR pavgb, mmx2, 0, 0, 1\r\nAVX_INSTR pavgw, mmx2, 0, 0, 1\r\nAVX_INSTR pblendvb, sse4, 0, 0, 0\r\nAVX_INSTR pblendw, sse4\r\nAVX_INSTR pclmulqdq\r\nAVX_INSTR pcmpestri, sse42\r\nAVX_INSTR pcmpestrm, sse42\r\nAVX_INSTR pcmpistri, sse42\r\nAVX_INSTR pcmpistrm, sse42\r\nAVX_INSTR pcmpeqb, mmx, 0, 0, 1\r\nAVX_INSTR pcmpeqw, mmx, 0, 0, 1\r\nAVX_INSTR pcmpeqd, mmx, 0, 0, 1\r\nAVX_INSTR pcmpeqq, sse4, 0, 0, 1\r\nAVX_INSTR pcmpgtb, mmx, 0, 0, 0\r\nAVX_INSTR pcmpgtw, mmx, 0, 0, 0\r\nAVX_INSTR pcmpgtd, mmx, 0, 0, 0\r\nAVX_INSTR pcmpgtq, sse42, 0, 0, 0\r\nAVX_INSTR pextrb, sse4\r\nAVX_INSTR pextrd, sse4\r\nAVX_INSTR pextrq, sse4\r\nAVX_INSTR pextrw, mmx2\r\nAVX_INSTR phaddw, ssse3, 0, 0, 0\r\nAVX_INSTR phaddd, ssse3, 0, 0, 0\r\nAVX_INSTR phaddsw, ssse3, 0, 0, 0\r\nAVX_INSTR phminposuw, sse4\r\nAVX_INSTR phsubw, ssse3, 0, 0, 0\r\nAVX_INSTR phsubd, ssse3, 0, 0, 0\r\nAVX_INSTR phsubsw, ssse3, 0, 0, 0\r\nAVX_INSTR pinsrb, sse4\r\nAVX_INSTR pinsrd, sse4\r\nAVX_INSTR pinsrq, sse4\r\nAVX_INSTR pinsrw, mmx2\r\nAVX_INSTR pmaddwd, mmx, 0, 0, 1\r\nAVX_INSTR pmaddubsw, ssse3, 0, 0, 0\r\nAVX_INSTR pmaxsb, sse4, 0, 0, 1\r\nAVX_INSTR pmaxsw, mmx2, 0, 0, 1\r\nAVX_INSTR pmaxsd, sse4, 0, 0, 1\r\nAVX_INSTR pmaxub, mmx2, 0, 0, 1\r\nAVX_INSTR pmaxuw, sse4, 0, 0, 1\r\nAVX_INSTR pmaxud, sse4, 0, 0, 1\r\nAVX_INSTR pminsb, sse4, 0, 0, 1\r\nAVX_INSTR pminsw, mmx2, 0, 0, 1\r\nAVX_INSTR pminsd, sse4, 0, 0, 1\r\nAVX_INSTR pminub, mmx2, 0, 0, 1\r\nAVX_INSTR pminuw, sse4, 0, 0, 1\r\nAVX_INSTR pminud, sse4, 0, 0, 1\r\nAVX_INSTR pmovmskb, mmx2\r\nAVX_INSTR pmovsxbw, sse4\r\nAVX_INSTR pmovsxbd, sse4\r\nAVX_INSTR pmovsxbq, sse4\r\nAVX_INSTR pmovsxwd, sse4\r\nAVX_INSTR pmovsxwq, sse4\r\nAVX_INSTR pmovsxdq, sse4\r\nAVX_INSTR pmovzxbw, sse4\r\nAVX_INSTR pmovzxbd, sse4\r\nAVX_INSTR pmovzxbq, sse4\r\nAVX_INSTR pmovzxwd, sse4\r\nAVX_INSTR pmovzxwq, sse4\r\nAVX_INSTR pmovzxdq, sse4\r\nAVX_INSTR pmuldq, sse4, 0, 0, 1\r\nAVX_INSTR pmulhrsw, ssse3, 0, 0, 1\r\nAVX_INSTR pmulhuw, mmx2, 0, 0, 1\r\nAVX_INSTR pmulhw, mmx, 0, 0, 1\r\nAVX_INSTR pmullw, mmx, 0, 0, 1\r\nAVX_INSTR pmulld, sse4, 0, 0, 1\r\nAVX_INSTR pmuludq, sse2, 0, 0, 1\r\nAVX_INSTR por, mmx, 0, 0, 1\r\nAVX_INSTR psadbw, mmx2, 0, 0, 1\r\nAVX_INSTR pshufb, ssse3, 0, 0, 0\r\nAVX_INSTR pshufd, sse2\r\nAVX_INSTR pshufhw, sse2\r\nAVX_INSTR pshuflw, sse2\r\nAVX_INSTR psignb, ssse3, 0, 0, 0\r\nAVX_INSTR psignw, ssse3, 0, 0, 0\r\nAVX_INSTR psignd, ssse3, 0, 0, 0\r\nAVX_INSTR psllw, mmx, 0, 0, 0\r\nAVX_INSTR pslld, mmx, 0, 0, 0\r\nAVX_INSTR psllq, mmx, 0, 0, 0\r\nAVX_INSTR pslldq, sse2, 0, 0, 0\r\nAVX_INSTR psraw, mmx, 0, 0, 0\r\nAVX_INSTR psrad, mmx, 0, 0, 0\r\nAVX_INSTR psrlw, mmx, 0, 0, 0\r\nAVX_INSTR psrld, mmx, 0, 0, 0\r\nAVX_INSTR psrlq, mmx, 0, 0, 0\r\nAVX_INSTR psrldq, sse2, 0, 0, 0\r\nAVX_INSTR psubb, mmx, 0, 0, 0\r\nAVX_INSTR psubw, mmx, 0, 0, 0\r\nAVX_INSTR psubd, mmx, 0, 0, 0\r\nAVX_INSTR psubq, sse2, 0, 0, 0\r\nAVX_INSTR psubsb, mmx, 0, 0, 0\r\nAVX_INSTR psubsw, mmx, 0, 0, 0\r\nAVX_INSTR psubusb, mmx, 0, 0, 0\r\nAVX_INSTR psubusw, mmx, 0, 0, 0\r\nAVX_INSTR ptest, sse4\r\nAVX_INSTR punpckhbw, mmx, 0, 0, 0\r\nAVX_INSTR punpckhwd, mmx, 0, 0, 0\r\nAVX_INSTR punpckhdq, mmx, 0, 0, 0\r\nAVX_INSTR punpckhqdq, sse2, 0, 0, 0\r\nAVX_INSTR punpcklbw, mmx, 0, 0, 0\r\nAVX_INSTR punpcklwd, mmx, 0, 0, 0\r\nAVX_INSTR punpckldq, mmx, 0, 0, 0\r\nAVX_INSTR punpcklqdq, sse2, 0, 0, 0\r\nAVX_INSTR pxor, mmx, 0, 0, 1\r\nAVX_INSTR rcpps, sse, 1, 0, 0\r\nAVX_INSTR rcpss, sse, 1, 0, 0\r\nAVX_INSTR roundpd, sse4\r\nAVX_INSTR roundps, sse4\r\nAVX_INSTR roundsd, sse4\r\nAVX_INSTR roundss, sse4\r\nAVX_INSTR rsqrtps, sse, 1, 0, 0\r\nAVX_INSTR rsqrtss, sse, 1, 0, 0\r\nAVX_INSTR shufpd, sse2, 1, 1, 0\r\nAVX_INSTR shufps, sse, 1, 1, 0\r\nAVX_INSTR sqrtpd, sse2, 1, 0, 0\r\nAVX_INSTR sqrtps, sse, 1, 0, 0\r\nAVX_INSTR sqrtsd, sse2, 1, 0, 0\r\nAVX_INSTR sqrtss, sse, 1, 0, 0\r\nAVX_INSTR stmxcsr, sse\r\nAVX_INSTR subpd, sse2, 1, 0, 0\r\nAVX_INSTR subps, sse, 1, 0, 0\r\nAVX_INSTR subsd, sse2, 1, 0, 0\r\nAVX_INSTR subss, sse, 1, 0, 0\r\nAVX_INSTR ucomisd, sse2\r\nAVX_INSTR ucomiss, sse\r\nAVX_INSTR unpckhpd, sse2, 1, 0, 0\r\nAVX_INSTR unpckhps, sse, 1, 0, 0\r\nAVX_INSTR unpcklpd, sse2, 1, 0, 0\r\nAVX_INSTR unpcklps, sse, 1, 0, 0\r\nAVX_INSTR xorpd, sse2, 1, 0, 1\r\nAVX_INSTR xorps, sse, 1, 0, 1\r\n\r\n; 3DNow instructions, for sharing code between AVX, SSE and 3DN\r\nAVX_INSTR pfadd, 3dnow, 1, 0, 1\r\nAVX_INSTR pfsub, 3dnow, 1, 0, 0\r\nAVX_INSTR pfmul, 3dnow, 1, 0, 1\r\n\r\n; base-4 constants for shuffles\r\n%assign i 0\r\n%rep 256\r\n    %assign j ((i>>6)&3)*1000 + ((i>>4)&3)*100 + ((i>>2)&3)*10 + (i&3)\r\n    %if j < 10\r\n        CAT_XDEFINE q000, j, i\r\n    %elif j < 100\r\n        CAT_XDEFINE q00, j, i\r\n    %elif j < 1000\r\n        CAT_XDEFINE q0, j, i\r\n    %else\r\n        CAT_XDEFINE q, j, i\r\n    %endif\r\n%assign i i+1\r\n%endrep\r\n%undef i\r\n%undef j\r\n\r\n%macro FMA_INSTR 3\r\n    %macro %1 4-7 %1, %2, %3\r\n        %if cpuflag(xop)\r\n            v%5 %1, %2, %3, %4\r\n        %elifnidn %1, %4\r\n            %6 %1, %2, %3\r\n            %7 %1, %4\r\n        %else\r\n            %error non-xop emulation of ``%5 %1, %2, %3, %4'' is not supported\r\n        %endif\r\n    %endmacro\r\n%endmacro\r\n\r\nFMA_INSTR  pmacsww,  pmullw, paddw\r\nFMA_INSTR  pmacsdd,  pmulld, paddd ; sse4 emulation\r\nFMA_INSTR pmacsdql,  pmuldq, paddq ; sse4 emulation\r\nFMA_INSTR pmadcswd, pmaddwd, paddd\r\n\r\n; convert FMA4 to FMA3 if possible\r\n%macro FMA4_INSTR 4\r\n    %macro %1 4-8 %1, %2, %3, %4\r\n        %if cpuflag(fma4)\r\n            v%5 %1, %2, %3, %4\r\n        %elifidn %1, %2\r\n            v%6 %1, %4, %3 ; %1 = %1 * %3 + %4\r\n        %elifidn %1, %3\r\n            v%7 %1, %2, %4 ; %1 = %2 * %1 + %4\r\n        %elifidn %1, %4\r\n            v%8 %1, %2, %3 ; %1 = %2 * %3 + %1\r\n        %else\r\n            %error fma3 emulation of ``%5 %1, %2, %3, %4'' is not supported\r\n        %endif\r\n    %endmacro\r\n%endmacro\r\n\r\nFMA4_INSTR fmaddpd, fmadd132pd, fmadd213pd, fmadd231pd\r\nFMA4_INSTR fmaddps, fmadd132ps, fmadd213ps, fmadd231ps\r\nFMA4_INSTR fmaddsd, fmadd132sd, fmadd213sd, fmadd231sd\r\nFMA4_INSTR fmaddss, fmadd132ss, fmadd213ss, fmadd231ss\r\n\r\nFMA4_INSTR fmaddsubpd, fmaddsub132pd, fmaddsub213pd, fmaddsub231pd\r\nFMA4_INSTR fmaddsubps, fmaddsub132ps, fmaddsub213ps, fmaddsub231ps\r\nFMA4_INSTR fmsubaddpd, fmsubadd132pd, fmsubadd213pd, fmsubadd231pd\r\nFMA4_INSTR fmsubaddps, fmsubadd132ps, fmsubadd213ps, fmsubadd231ps\r\n\r\nFMA4_INSTR fmsubpd, fmsub132pd, fmsub213pd, fmsub231pd\r\nFMA4_INSTR fmsubps, fmsub132ps, fmsub213ps, fmsub231ps\r\nFMA4_INSTR fmsubsd, fmsub132sd, fmsub213sd, fmsub231sd\r\nFMA4_INSTR fmsubss, fmsub132ss, fmsub213ss, fmsub231ss\r\n\r\nFMA4_INSTR fnmaddpd, fnmadd132pd, fnmadd213pd, fnmadd231pd\r\nFMA4_INSTR fnmaddps, fnmadd132ps, fnmadd213ps, fnmadd231ps\r\nFMA4_INSTR fnmaddsd, fnmadd132sd, fnmadd213sd, fnmadd231sd\r\nFMA4_INSTR fnmaddss, fnmadd132ss, fnmadd213ss, fnmadd231ss\r\n\r\nFMA4_INSTR fnmsubpd, fnmsub132pd, fnmsub213pd, fnmsub231pd\r\nFMA4_INSTR fnmsubps, fnmsub132ps, fnmsub213ps, fnmsub231ps\r\nFMA4_INSTR fnmsubsd, fnmsub132sd, fnmsub213sd, fnmsub231sd\r\nFMA4_INSTR fnmsubss, fnmsub132ss, fnmsub213ss, fnmsub231ss\r\n\r\n; workaround: vpbroadcastd with register, the yasm will generate wrong code\r\n%macro vpbroadcastd 2\r\n  %ifid %2\r\n    movd         %1 %+ xmm, %2\r\n    vpbroadcastd %1, %1 %+ xmm\r\n  %else\r\n    vpbroadcastd %1, %2\r\n  %endif\r\n%endmacro\r\n"
  },
  {
    "path": "source/common/x86/x86util.asm",
    "content": ";*****************************************************************************\r\n;* x86util.asm: x86 utility macros\r\n;*****************************************************************************\r\n;* Copyright (C) 2008-2013 x264 project\r\n;*\r\n;* Authors: Holger Lubitz <holger@lubitz.org>\r\n;*          Loren Merritt <lorenm@u.washington.edu>\r\n;*\r\n;* This program is free software; you can redistribute it and/or modify\r\n;* it under the terms of the GNU General Public License as published by\r\n;* the Free Software Foundation; either version 2 of the License, or\r\n;* (at your option) any later version.\r\n;*\r\n;* This program is distributed in the hope that it will be useful,\r\n;* but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n;* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n;* GNU General Public License for more details.\r\n;*\r\n;* You should have received a copy of the GNU General Public License\r\n;* along with this program; if not, write to the Free Software\r\n;* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n;*\r\n;* This program is also available under a commercial proprietary license.\r\n;* For more information, contact us at license @ x265.com.\r\n;*****************************************************************************\r\n\r\n%assign FENC_STRIDE 64\r\n%assign FDEC_STRIDE 32\r\n\r\n%assign SIZEOF_PIXEL 1\r\n%assign SIZEOF_DCTCOEF 2\r\n%define pixel byte\r\n%define vpbroadcastdct vpbroadcastw\r\n%define vpbroadcastpix vpbroadcastb\r\n%if HIGH_BIT_DEPTH\r\n    %assign SIZEOF_PIXEL 2\r\n    %assign SIZEOF_DCTCOEF 4\r\n    %define pixel word\r\n    %define vpbroadcastdct vpbroadcastd\r\n    %define vpbroadcastpix vpbroadcastw\r\n%endif\r\n\r\n%assign FENC_STRIDEB SIZEOF_PIXEL*FENC_STRIDE\r\n%assign FDEC_STRIDEB SIZEOF_PIXEL*FDEC_STRIDE\r\n\r\n%assign PIXEL_MAX ((1 << BIT_DEPTH)-1)\r\n\r\n%macro FIX_STRIDES 1-*\r\n%if HIGH_BIT_DEPTH\r\n%rep %0\r\n    add %1, %1\r\n    %rotate 1\r\n%endrep\r\n%endif\r\n%endmacro\r\n\r\n\r\n%macro SBUTTERFLY 4\r\n%ifidn %1, dqqq\r\n    vperm2i128  m%4, m%2, m%3, q0301 ; punpckh\r\n    vinserti128 m%2, m%2, xm%3, 1    ; punpckl\r\n%elif avx_enabled && mmsize >= 16\r\n    punpckh%1 m%4, m%2, m%3\r\n    punpckl%1 m%2, m%3\r\n%else\r\n    mova      m%4, m%2\r\n    punpckl%1 m%2, m%3\r\n    punpckh%1 m%4, m%3\r\n%endif\r\n    SWAP %3, %4\r\n%endmacro\r\n\r\n%macro SBUTTERFLY2 4\r\n    punpckl%1 m%4, m%2, m%3\r\n    punpckh%1 m%2, m%2, m%3\r\n    SWAP %2, %4, %3\r\n%endmacro\r\n\r\n%macro TRANSPOSE4x4W 5\r\n    SBUTTERFLY wd, %1, %2, %5\r\n    SBUTTERFLY wd, %3, %4, %5\r\n    SBUTTERFLY dq, %1, %3, %5\r\n    SBUTTERFLY dq, %2, %4, %5\r\n    SWAP %2, %3\r\n%endmacro\r\n\r\n%macro TRANSPOSE2x4x4W 5\r\n    SBUTTERFLY wd,  %1, %2, %5\r\n    SBUTTERFLY wd,  %3, %4, %5\r\n    SBUTTERFLY dq,  %1, %3, %5\r\n    SBUTTERFLY dq,  %2, %4, %5\r\n    SBUTTERFLY qdq, %1, %2, %5\r\n    SBUTTERFLY qdq, %3, %4, %5\r\n%endmacro\r\n\r\n%macro TRANSPOSE4x4D 5\r\n    SBUTTERFLY dq,  %1, %2, %5\r\n    SBUTTERFLY dq,  %3, %4, %5\r\n    SBUTTERFLY qdq, %1, %3, %5\r\n    SBUTTERFLY qdq, %2, %4, %5\r\n    SWAP %2, %3\r\n%endmacro\r\n\r\n%macro TRANSPOSE8x8W 9-11\r\n%if ARCH_X86_64\r\n    SBUTTERFLY wd,  %1, %2, %9\r\n    SBUTTERFLY wd,  %3, %4, %9\r\n    SBUTTERFLY wd,  %5, %6, %9\r\n    SBUTTERFLY wd,  %7, %8, %9\r\n    SBUTTERFLY dq,  %1, %3, %9\r\n    SBUTTERFLY dq,  %2, %4, %9\r\n    SBUTTERFLY dq,  %5, %7, %9\r\n    SBUTTERFLY dq,  %6, %8, %9\r\n    SBUTTERFLY qdq, %1, %5, %9\r\n    SBUTTERFLY qdq, %2, %6, %9\r\n    SBUTTERFLY qdq, %3, %7, %9\r\n    SBUTTERFLY qdq, %4, %8, %9\r\n    SWAP %2, %5\r\n    SWAP %4, %7\r\n%else\r\n; in:  m0..m7, unless %11 in which case m6 is in %9\r\n; out: m0..m7, unless %11 in which case m4 is in %10\r\n; spills into %9 and %10\r\n%if %0<11\r\n    movdqa %9, m%7\r\n%endif\r\n    SBUTTERFLY wd,  %1, %2, %7\r\n    movdqa %10, m%2\r\n    movdqa m%7, %9\r\n    SBUTTERFLY wd,  %3, %4, %2\r\n    SBUTTERFLY wd,  %5, %6, %2\r\n    SBUTTERFLY wd,  %7, %8, %2\r\n    SBUTTERFLY dq,  %1, %3, %2\r\n    movdqa %9, m%3\r\n    movdqa m%2, %10\r\n    SBUTTERFLY dq,  %2, %4, %3\r\n    SBUTTERFLY dq,  %5, %7, %3\r\n    SBUTTERFLY dq,  %6, %8, %3\r\n    SBUTTERFLY qdq, %1, %5, %3\r\n    SBUTTERFLY qdq, %2, %6, %3\r\n    movdqa %10, m%2\r\n    movdqa m%3, %9\r\n    SBUTTERFLY qdq, %3, %7, %2\r\n    SBUTTERFLY qdq, %4, %8, %2\r\n    SWAP %2, %5\r\n    SWAP %4, %7\r\n%if %0<11\r\n    movdqa m%5, %10\r\n%endif\r\n%endif\r\n%endmacro\r\n\r\n%macro WIDEN_SXWD 2\r\n    punpckhwd m%2, m%1\r\n    psrad     m%2, 16\r\n%if cpuflag(sse4)\r\n    pmovsxwd  m%1, m%1\r\n%else\r\n    punpcklwd m%1, m%1\r\n    psrad     m%1, 16\r\n%endif\r\n%endmacro\r\n\r\n%macro ABSW 2-3 ; dst, src, tmp (tmp used only if dst==src)\r\n%if cpuflag(ssse3)\r\n    pabsw   %1, %2\r\n%elifidn %3, sign ; version for pairing with PSIGNW: modifies src\r\n    pxor    %1, %1\r\n    pcmpgtw %1, %2\r\n    pxor    %2, %1\r\n    psubw   %2, %1\r\n    SWAP    %1, %2\r\n%elifidn %1, %2\r\n    pxor    %3, %3\r\n    psubw   %3, %1\r\n    pmaxsw  %1, %3\r\n%elifid %2\r\n    pxor    %1, %1\r\n    psubw   %1, %2\r\n    pmaxsw  %1, %2\r\n%elif %0 == 2\r\n    pxor    %1, %1\r\n    psubw   %1, %2\r\n    pmaxsw  %1, %2\r\n%else\r\n    mova    %1, %2\r\n    pxor    %3, %3\r\n    psubw   %3, %1\r\n    pmaxsw  %1, %3\r\n%endif\r\n%endmacro\r\n\r\n%macro ABSW2 6 ; dst1, dst2, src1, src2, tmp, tmp\r\n%if cpuflag(ssse3)\r\n    pabsw   %1, %3\r\n    pabsw   %2, %4\r\n%elifidn %1, %3\r\n    pxor    %5, %5\r\n    pxor    %6, %6\r\n    psubw   %5, %1\r\n    psubw   %6, %2\r\n    pmaxsw  %1, %5\r\n    pmaxsw  %2, %6\r\n%else\r\n    pxor    %1, %1\r\n    pxor    %2, %2\r\n    psubw   %1, %3\r\n    psubw   %2, %4\r\n    pmaxsw  %1, %3\r\n    pmaxsw  %2, %4\r\n%endif\r\n%endmacro\r\n\r\n%macro ABSB 2\r\n%if cpuflag(ssse3)\r\n    pabsb   %1, %1\r\n%else\r\n    pxor    %2, %2\r\n    psubb   %2, %1\r\n    pminub  %1, %2\r\n%endif\r\n%endmacro\r\n\r\n%macro ABSD 2-3\r\n%if cpuflag(ssse3)\r\n    pabsd   %1, %2\r\n%else\r\n    %define %%s %2\r\n%if %0 == 3\r\n    mova    %3, %2\r\n    %define %%s %3\r\n%endif\r\n    pxor     %1, %1\r\n    pcmpgtd  %1, %%s\r\n    pxor    %%s, %1\r\n    psubd   %%s, %1\r\n    SWAP     %1, %%s\r\n%endif\r\n%endmacro\r\n\r\n%macro PSIGN 3-4\r\n%if cpuflag(ssse3) && %0 == 4\r\n    psign%1 %2, %3, %4\r\n%elif cpuflag(ssse3)\r\n    psign%1 %2, %3\r\n%elif %0 == 4\r\n    pxor    %2, %3, %4\r\n    psub%1  %2, %4\r\n%else\r\n    pxor    %2, %3\r\n    psub%1  %2, %3\r\n%endif\r\n%endmacro\r\n\r\n%define PSIGNW PSIGN w,\r\n%define PSIGND PSIGN d,\r\n\r\n%macro SPLATB_LOAD 3\r\n%if cpuflag(ssse3)\r\n    movd      %1, [%2-3]\r\n    pshufb    %1, %3\r\n%else\r\n    movd      %1, [%2-3] ;to avoid crossing a cacheline\r\n    punpcklbw %1, %1\r\n    SPLATW    %1, %1, 3\r\n%endif\r\n%endmacro\r\n\r\n%imacro SPLATW 2-3 0\r\n%if cpuflag(avx2) && %3 == 0\r\n    vpbroadcastw %1, %2\r\n%else\r\n    PSHUFLW      %1, %2, (%3)*q1111\r\n%if mmsize == 16\r\n    punpcklqdq   %1, %1\r\n%endif\r\n%endif\r\n%endmacro\r\n\r\n%imacro SPLATD 2-3 0\r\n%if mmsize == 16\r\n    pshufd %1, %2, (%3)*q1111\r\n%else\r\n    pshufw %1, %2, (%3)*q0101 + ((%3)+1)*q1010\r\n%endif\r\n%endmacro\r\n\r\n%macro CLIPW 3 ;(dst, min, max)\r\n    pmaxsw %1, %2\r\n    pminsw %1, %3\r\n%endmacro\r\n\r\n%macro CLIPW2 4 ;(dst0, dst1, min, max)\r\n    pmaxsw %1, %3\r\n    pmaxsw %2, %3\r\n    pminsw %1, %4\r\n    pminsw %2, %4\r\n%endmacro\r\n\r\n%macro HADDD 2 ; sum junk\r\n%if sizeof%1 == 32\r\n%define %2 xmm%2\r\n    vextracti128 %2, %1, 1\r\n%define %1 xmm%1\r\n    paddd   %1, %2\r\n%endif\r\n%if mmsize >= 16\r\n%if cpuflag(xop) && sizeof%1 == 16\r\n    vphadddq %1, %1\r\n%endif\r\n    movhlps %2, %1\r\n    paddd   %1, %2\r\n%endif\r\n%if notcpuflag(xop)\r\n    PSHUFLW %2, %1, q0032\r\n    paddd   %1, %2\r\n%endif\r\n%undef %1\r\n%undef %2\r\n%endmacro\r\n\r\n%macro HADDW 2 ; reg, tmp\r\n%if cpuflag(xop) && sizeof%1 == 16\r\n    vphaddwq  %1, %1\r\n    movhlps   %2, %1\r\n    paddd     %1, %2\r\n%else\r\n    pmaddwd %1, [pw_1]\r\n    HADDD   %1, %2\r\n%endif\r\n%endmacro\r\n\r\n%macro HADDUWD 2\r\n%if cpuflag(xop) && sizeof%1 == 16\r\n    vphadduwd %1, %1\r\n%else\r\n    psrld %2, %1, 16\r\n    pslld %1, 16\r\n    psrld %1, 16\r\n    paddd %1, %2\r\n%endif\r\n%endmacro\r\n\r\n%macro HADDUW 2\r\n%if cpuflag(xop) && sizeof%1 == 16\r\n    vphadduwq %1, %1\r\n    movhlps   %2, %1\r\n    paddd     %1, %2\r\n%else\r\n    HADDUWD   %1, %2\r\n    HADDD     %1, %2\r\n%endif\r\n%endmacro\r\n\r\n%macro PALIGNR 4-5 ; [dst,] src1, src2, imm, tmp\r\n; AVX2 version uses a precalculated extra input that\r\n; can be re-used across calls\r\n%if sizeof%1==32\r\n                                 ; %3 = abcdefgh ijklmnop (lower address)\r\n                                 ; %2 = ABCDEFGH IJKLMNOP (higher address)\r\n;   vperm2i128 %5, %2, %3, q0003 ; %5 = ijklmnop ABCDEFGH\r\n%if %4 < 16\r\n    palignr    %1, %5, %3, %4    ; %1 = bcdefghi jklmnopA\r\n%else\r\n    palignr    %1, %2, %5, %4-16 ; %1 = pABCDEFG HIJKLMNO\r\n%endif\r\n%elif cpuflag(ssse3)\r\n    %if %0==5\r\n        palignr %1, %2, %3, %4\r\n    %else\r\n        palignr %1, %2, %3\r\n    %endif\r\n%else\r\n    %define %%dst %1\r\n    %if %0==5\r\n        %ifnidn %1, %2\r\n            mova %%dst, %2\r\n        %endif\r\n        %rotate 1\r\n    %endif\r\n    %ifnidn %4, %2\r\n        mova %4, %2\r\n    %endif\r\n    %if mmsize==8\r\n        psllq  %%dst, (8-%3)*8\r\n        psrlq  %4, %3*8\r\n    %else\r\n        pslldq %%dst, 16-%3\r\n        psrldq %4, %3\r\n    %endif\r\n    por %%dst, %4\r\n%endif\r\n%endmacro\r\n\r\n%macro PSHUFLW 1+\r\n    %if mmsize == 8\r\n        pshufw %1\r\n    %else\r\n        pshuflw %1\r\n    %endif\r\n%endmacro\r\n\r\n; shift a mmxreg by n bytes, or a xmmreg by 2*n bytes\r\n; values shifted in are undefined\r\n; faster if dst==src\r\n%define PSLLPIX PSXLPIX l, -1, ;dst, src, shift\r\n%define PSRLPIX PSXLPIX r,  1, ;dst, src, shift\r\n%macro PSXLPIX 5\r\n    %if mmsize == 8\r\n        %if %5&1\r\n            ps%1lq %3, %4, %5*8\r\n        %else\r\n            pshufw %3, %4, (q3210<<8>>(8+%2*%5))&0xff\r\n        %endif\r\n    %else\r\n        ps%1ldq %3, %4, %5*2\r\n    %endif\r\n%endmacro\r\n\r\n%macro DEINTB 5 ; mask, reg1, mask, reg2, optional src to fill masks from\r\n%ifnum %5\r\n    pand   m%3, m%5, m%4 ; src .. y6 .. y4\r\n    pand   m%1, m%5, m%2 ; dst .. y6 .. y4\r\n%else\r\n    mova   m%1, %5\r\n    pand   m%3, m%1, m%4 ; src .. y6 .. y4\r\n    pand   m%1, m%1, m%2 ; dst .. y6 .. y4\r\n%endif\r\n    psrlw  m%2, 8        ; dst .. y7 .. y5\r\n    psrlw  m%4, 8        ; src .. y7 .. y5\r\n%endmacro\r\n\r\n%macro SUMSUB_BA 3-4\r\n%if %0==3\r\n    padd%1  m%2, m%3\r\n    padd%1  m%3, m%3\r\n    psub%1  m%3, m%2\r\n%elif avx_enabled\r\n    padd%1  m%4, m%2, m%3\r\n    psub%1  m%3, m%2\r\n    SWAP    %2, %4\r\n%else\r\n    mova    m%4, m%2\r\n    padd%1  m%2, m%3\r\n    psub%1  m%3, m%4\r\n%endif\r\n%endmacro\r\n\r\n%macro SUMSUB_BADC 5-6\r\n%if %0==6\r\n    SUMSUB_BA %1, %2, %3, %6\r\n    SUMSUB_BA %1, %4, %5, %6\r\n%else\r\n    padd%1  m%2, m%3\r\n    padd%1  m%4, m%5\r\n    padd%1  m%3, m%3\r\n    padd%1  m%5, m%5\r\n    psub%1  m%3, m%2\r\n    psub%1  m%5, m%4\r\n%endif\r\n%endmacro\r\n\r\n%macro HADAMARD4_V 4+\r\n    SUMSUB_BADC w, %1, %2, %3, %4\r\n    SUMSUB_BADC w, %1, %3, %2, %4\r\n%endmacro\r\n\r\n%macro HADAMARD8_V 8+\r\n    SUMSUB_BADC w, %1, %2, %3, %4\r\n    SUMSUB_BADC w, %5, %6, %7, %8\r\n    SUMSUB_BADC w, %1, %3, %2, %4\r\n    SUMSUB_BADC w, %5, %7, %6, %8\r\n    SUMSUB_BADC w, %1, %5, %2, %6\r\n    SUMSUB_BADC w, %3, %7, %4, %8\r\n%endmacro\r\n\r\n%macro TRANS_SSE2 5-6\r\n; TRANSPOSE2x2\r\n; %1: transpose width (d/q) - use SBUTTERFLY qdq for dq\r\n; %2: ord/unord (for compat with sse4, unused)\r\n; %3/%4: source regs\r\n; %5/%6: tmp regs\r\n%ifidn %1, d\r\n%define mask [mask_10]\r\n%define shift 16\r\n%elifidn %1, q\r\n%define mask [mask_1100]\r\n%define shift 32\r\n%endif\r\n%if %0==6 ; less dependency if we have two tmp\r\n    mova   m%5, mask   ; ff00\r\n    mova   m%6, m%4    ; x5x4\r\n    psll%1 m%4, shift  ; x4..\r\n    pand   m%6, m%5    ; x5..\r\n    pandn  m%5, m%3    ; ..x0\r\n    psrl%1 m%3, shift  ; ..x1\r\n    por    m%4, m%5    ; x4x0\r\n    por    m%3, m%6    ; x5x1\r\n%else ; more dependency, one insn less. sometimes faster, sometimes not\r\n    mova   m%5, m%4    ; x5x4\r\n    psll%1 m%4, shift  ; x4..\r\n    pxor   m%4, m%3    ; (x4^x1)x0\r\n    pand   m%4, mask   ; (x4^x1)..\r\n    pxor   m%3, m%4    ; x4x0\r\n    psrl%1 m%4, shift  ; ..(x1^x4)\r\n    pxor   m%5, m%4    ; x5x1\r\n    SWAP   %4, %3, %5\r\n%endif\r\n%endmacro\r\n\r\n%macro TRANS_SSE4 5-6 ; see above\r\n%ifidn %1, d\r\n%ifidn %2, ord\r\n    psrl%1  m%5, m%3, 16\r\n    pblendw m%5, m%4, q2222\r\n    psll%1  m%4, 16\r\n    pblendw m%4, m%3, q1111\r\n    SWAP     %3, %5\r\n%else\r\n%if avx_enabled\r\n    pblendw m%5, m%3, m%4, q2222\r\n    SWAP     %3, %5\r\n%else\r\n    mova    m%5, m%3\r\n    pblendw m%3, m%4, q2222\r\n%endif\r\n    psll%1  m%4, 16\r\n    psrl%1  m%5, 16\r\n    por     m%4, m%5\r\n%endif\r\n%elifidn %1, q\r\n    shufps m%5, m%3, m%4, q3131\r\n    shufps m%3, m%3, m%4, q2020\r\n    SWAP    %4, %5\r\n%endif\r\n%endmacro\r\n\r\n%macro TRANS_XOP 5-6\r\n%ifidn %1, d\r\n    vpperm m%5, m%3, m%4, [transd_shuf1]\r\n    vpperm m%3, m%3, m%4, [transd_shuf2]\r\n%elifidn %1, q\r\n    shufps m%5, m%3, m%4, q3131\r\n    shufps m%3, m%4, q2020\r\n%endif\r\n    SWAP    %4, %5\r\n%endmacro\r\n\r\n%macro HADAMARD 5-6\r\n; %1=distance in words (0 for vertical pass, 1/2/4 for horizontal passes)\r\n; %2=sumsub/max/amax (sum and diff / maximum / maximum of absolutes)\r\n; %3/%4: regs\r\n; %5(%6): tmpregs\r\n%if %1!=0 ; have to reorder stuff for horizontal op\r\n    %ifidn %2, sumsub\r\n        %define ORDER ord\r\n        ; sumsub needs order because a-b != b-a unless a=b\r\n    %else\r\n        %define ORDER unord\r\n        ; if we just max, order doesn't matter (allows pblendw+or in sse4)\r\n    %endif\r\n    %if %1==1\r\n        TRANS d, ORDER, %3, %4, %5, %6\r\n    %elif %1==2\r\n        %if mmsize==8\r\n            SBUTTERFLY dq, %3, %4, %5\r\n        %else\r\n            TRANS q, ORDER, %3, %4, %5, %6\r\n        %endif\r\n    %elif %1==4\r\n        SBUTTERFLY qdq, %3, %4, %5\r\n    %elif %1==8\r\n        SBUTTERFLY dqqq, %3, %4, %5\r\n    %endif\r\n%endif\r\n%ifidn %2, sumsub\r\n    SUMSUB_BA w, %3, %4, %5\r\n%else\r\n    %ifidn %2, amax\r\n        %if %0==6\r\n            ABSW2 m%3, m%4, m%3, m%4, m%5, m%6\r\n        %else\r\n            ABSW m%3, m%3, m%5\r\n            ABSW m%4, m%4, m%5\r\n        %endif\r\n    %endif\r\n    pmaxsw m%3, m%4\r\n%endif\r\n%endmacro\r\n\r\n\r\n%macro HADAMARD2_2D 6-7 sumsub\r\n    HADAMARD 0, sumsub, %1, %2, %5\r\n    HADAMARD 0, sumsub, %3, %4, %5\r\n    SBUTTERFLY %6, %1, %2, %5\r\n%ifnum %7\r\n    HADAMARD 0, amax, %1, %2, %5, %7\r\n%else\r\n    HADAMARD 0, %7, %1, %2, %5\r\n%endif\r\n    SBUTTERFLY %6, %3, %4, %5\r\n%ifnum %7\r\n    HADAMARD 0, amax, %3, %4, %5, %7\r\n%else\r\n    HADAMARD 0, %7, %3, %4, %5\r\n%endif\r\n%endmacro\r\n\r\n%macro HADAMARD4_2D 5-6 sumsub\r\n    HADAMARD2_2D %1, %2, %3, %4, %5, wd\r\n    HADAMARD2_2D %1, %3, %2, %4, %5, dq, %6\r\n    SWAP %2, %3\r\n%endmacro\r\n\r\n%macro HADAMARD4_2D_SSE 5-6 sumsub\r\n    HADAMARD  0, sumsub, %1, %2, %5 ; 1st V row 0 + 1\r\n    HADAMARD  0, sumsub, %3, %4, %5 ; 1st V row 2 + 3\r\n    SBUTTERFLY   wd, %1, %2, %5     ; %1: m0 1+0 %2: m1 1+0\r\n    SBUTTERFLY   wd, %3, %4, %5     ; %3: m0 3+2 %4: m1 3+2\r\n    HADAMARD2_2D %1, %3, %2, %4, %5, dq\r\n    SBUTTERFLY  qdq, %1, %2, %5\r\n    HADAMARD  0, %6, %1, %2, %5     ; 2nd H m1/m0 row 0+1\r\n    SBUTTERFLY  qdq, %3, %4, %5\r\n    HADAMARD  0, %6, %3, %4, %5     ; 2nd H m1/m0 row 2+3\r\n%endmacro\r\n\r\n%macro HADAMARD8_2D 9-10 sumsub\r\n    HADAMARD2_2D %1, %2, %3, %4, %9, wd\r\n    HADAMARD2_2D %5, %6, %7, %8, %9, wd\r\n    HADAMARD2_2D %1, %3, %2, %4, %9, dq\r\n    HADAMARD2_2D %5, %7, %6, %8, %9, dq\r\n    HADAMARD2_2D %1, %5, %3, %7, %9, qdq, %10\r\n    HADAMARD2_2D %2, %6, %4, %8, %9, qdq, %10\r\n%ifnidn %10, amax\r\n    SWAP %2, %5\r\n    SWAP %4, %7\r\n%endif\r\n%endmacro\r\n\r\n; doesn't include the \"pmaddubsw hmul_8p\" pass\r\n%macro HADAMARD8_2D_HMUL 10\r\n    HADAMARD4_V %1, %2, %3, %4, %9\r\n    HADAMARD4_V %5, %6, %7, %8, %9\r\n    SUMSUB_BADC w, %1, %5, %2, %6, %9\r\n    HADAMARD 2, sumsub, %1, %5, %9, %10\r\n    HADAMARD 2, sumsub, %2, %6, %9, %10\r\n    SUMSUB_BADC w, %3, %7, %4, %8, %9\r\n    HADAMARD 2, sumsub, %3, %7, %9, %10\r\n    HADAMARD 2, sumsub, %4, %8, %9, %10\r\n    HADAMARD 1, amax, %1, %5, %9, %10\r\n    HADAMARD 1, amax, %2, %6, %9, %5\r\n    HADAMARD 1, amax, %3, %7, %9, %5\r\n    HADAMARD 1, amax, %4, %8, %9, %5\r\n%endmacro\r\n\r\n%macro SUMSUB2_AB 4\r\n%if cpuflag(xop)\r\n    pmacs%1%1 m%4, m%3, [p%1_m2], m%2\r\n    pmacs%1%1 m%2, m%2, [p%1_2], m%3\r\n%elifnum %3\r\n    psub%1  m%4, m%2, m%3\r\n    psub%1  m%4, m%3\r\n    padd%1  m%2, m%2\r\n    padd%1  m%2, m%3\r\n%else\r\n    mova    m%4, m%2\r\n    padd%1  m%2, m%2\r\n    padd%1  m%2, %3\r\n    psub%1  m%4, %3\r\n    psub%1  m%4, %3\r\n%endif\r\n%endmacro\r\n\r\n%macro SUMSUBD2_AB 5\r\n%ifnum %4\r\n    psra%1  m%5, m%2, 1  ; %3: %3>>1\r\n    psra%1  m%4, m%3, 1  ; %2: %2>>1\r\n    padd%1  m%4, m%2     ; %3: %3>>1+%2\r\n    psub%1  m%5, m%3     ; %2: %2>>1-%3\r\n    SWAP     %2, %5\r\n    SWAP     %3, %4\r\n%else\r\n    mova    %5, m%2\r\n    mova    %4, m%3\r\n    psra%1  m%3, 1  ; %3: %3>>1\r\n    psra%1  m%2, 1  ; %2: %2>>1\r\n    padd%1  m%3, %5 ; %3: %3>>1+%2\r\n    psub%1  m%2, %4 ; %2: %2>>1-%3\r\n%endif\r\n%endmacro\r\n\r\n%macro DCT4_1D 5\r\n%ifnum %5\r\n    SUMSUB_BADC w, %4, %1, %3, %2, %5\r\n    SUMSUB_BA   w, %3, %4, %5\r\n    SUMSUB2_AB  w, %1, %2, %5\r\n    SWAP %1, %3, %4, %5, %2\r\n%else\r\n    SUMSUB_BADC w, %4, %1, %3, %2\r\n    SUMSUB_BA   w, %3, %4\r\n    mova     [%5], m%2\r\n    SUMSUB2_AB  w, %1, [%5], %2\r\n    SWAP %1, %3, %4, %2\r\n%endif\r\n%endmacro\r\n\r\n%macro IDCT4_1D 6-7\r\n%ifnum %6\r\n    SUMSUBD2_AB %1, %3, %5, %7, %6\r\n    ; %3: %3>>1-%5 %5: %3+%5>>1\r\n    SUMSUB_BA   %1, %4, %2, %7\r\n    ; %4: %2+%4 %2: %2-%4\r\n    SUMSUB_BADC %1, %5, %4, %3, %2, %7\r\n    ; %5: %2+%4 + (%3+%5>>1)\r\n    ; %4: %2+%4 - (%3+%5>>1)\r\n    ; %3: %2-%4 + (%3>>1-%5)\r\n    ; %2: %2-%4 - (%3>>1-%5)\r\n%else\r\n%ifidn %1, w\r\n    SUMSUBD2_AB %1, %3, %5, [%6], [%6+16]\r\n%else\r\n    SUMSUBD2_AB %1, %3, %5, [%6], [%6+32]\r\n%endif\r\n    SUMSUB_BA   %1, %4, %2\r\n    SUMSUB_BADC %1, %5, %4, %3, %2\r\n%endif\r\n    SWAP %2, %5, %4\r\n    ; %2: %2+%4 + (%3+%5>>1) row0\r\n    ; %3: %2-%4 + (%3>>1-%5) row1\r\n    ; %4: %2-%4 - (%3>>1-%5) row2\r\n    ; %5: %2+%4 - (%3+%5>>1) row3\r\n%endmacro\r\n\r\n\r\n%macro LOAD_DIFF 5-6 1\r\n%if HIGH_BIT_DEPTH\r\n%if %6 ; %5 aligned?\r\n    mova       %1, %4\r\n    psubw      %1, %5\r\n%else\r\n    movu       %1, %4\r\n    movu       %2, %5\r\n    psubw      %1, %2\r\n%endif\r\n%else ; !HIGH_BIT_DEPTH\r\n%ifidn %3, none\r\n    movh       %1, %4\r\n    movh       %2, %5\r\n    punpcklbw  %1, %2\r\n    punpcklbw  %2, %2\r\n    psubw      %1, %2\r\n%else\r\n    movh       %1, %4\r\n    punpcklbw  %1, %3\r\n    movh       %2, %5\r\n    punpcklbw  %2, %3\r\n    psubw      %1, %2\r\n%endif\r\n%endif ; HIGH_BIT_DEPTH\r\n%endmacro\r\n\r\n%macro LOAD_DIFF8x4 8 ; 4x dst, 1x tmp, 1x mul, 2x ptr\r\n%if BIT_DEPTH == 8 && cpuflag(ssse3)\r\n    movh       m%2, [%8+%1*FDEC_STRIDE]\r\n    movh       m%1, [%7+%1*FENC_STRIDE]\r\n    punpcklbw  m%1, m%2\r\n    movh       m%3, [%8+%2*FDEC_STRIDE]\r\n    movh       m%2, [%7+%2*FENC_STRIDE]\r\n    punpcklbw  m%2, m%3\r\n    movh       m%4, [%8+%3*FDEC_STRIDE]\r\n    movh       m%3, [%7+%3*FENC_STRIDE]\r\n    punpcklbw  m%3, m%4\r\n    movh       m%5, [%8+%4*FDEC_STRIDE]\r\n    movh       m%4, [%7+%4*FENC_STRIDE]\r\n    punpcklbw  m%4, m%5\r\n    pmaddubsw  m%1, m%6\r\n    pmaddubsw  m%2, m%6\r\n    pmaddubsw  m%3, m%6\r\n    pmaddubsw  m%4, m%6\r\n%else\r\n    LOAD_DIFF  m%1, m%5, m%6, [%7+%1*FENC_STRIDEB], [%8+%1*FDEC_STRIDEB]\r\n    LOAD_DIFF  m%2, m%5, m%6, [%7+%2*FENC_STRIDEB], [%8+%2*FDEC_STRIDEB]\r\n    LOAD_DIFF  m%3, m%5, m%6, [%7+%3*FENC_STRIDEB], [%8+%3*FDEC_STRIDEB]\r\n    LOAD_DIFF  m%4, m%5, m%6, [%7+%4*FENC_STRIDEB], [%8+%4*FDEC_STRIDEB]\r\n%endif\r\n%endmacro\r\n\r\n%macro STORE_DCT 6\r\n    movq   [%5+%6+ 0], m%1\r\n    movq   [%5+%6+ 8], m%2\r\n    movq   [%5+%6+16], m%3\r\n    movq   [%5+%6+24], m%4\r\n    movhps [%5+%6+32], m%1\r\n    movhps [%5+%6+40], m%2\r\n    movhps [%5+%6+48], m%3\r\n    movhps [%5+%6+56], m%4\r\n%endmacro\r\n\r\n%macro STORE_IDCT 4\r\n    movhps [r0-4*FDEC_STRIDE], %1\r\n    movh   [r0-3*FDEC_STRIDE], %1\r\n    movhps [r0-2*FDEC_STRIDE], %2\r\n    movh   [r0-1*FDEC_STRIDE], %2\r\n    movhps [r0+0*FDEC_STRIDE], %3\r\n    movh   [r0+1*FDEC_STRIDE], %3\r\n    movhps [r0+2*FDEC_STRIDE], %4\r\n    movh   [r0+3*FDEC_STRIDE], %4\r\n%endmacro\r\n\r\n%macro LOAD_DIFF_8x4P 7-11 r0,r2,0,1 ; 4x dest, 2x temp, 2x pointer, increment, aligned?\r\n    LOAD_DIFF m%1, m%5, m%7, [%8],      [%9],      %11\r\n    LOAD_DIFF m%2, m%6, m%7, [%8+r1],   [%9+r3],   %11\r\n    LOAD_DIFF m%3, m%5, m%7, [%8+2*r1], [%9+2*r3], %11\r\n    LOAD_DIFF m%4, m%6, m%7, [%8+r4],   [%9+r5],   %11\r\n%if %10\r\n    lea %8, [%8+4*r1]\r\n    lea %9, [%9+4*r3]\r\n%endif\r\n%endmacro\r\n\r\n; 2xdst, 2xtmp, 2xsrcrow\r\n%macro LOAD_DIFF16x2_AVX2 6\r\n    pmovzxbw m%1, [r1+%5*FENC_STRIDE]\r\n    pmovzxbw m%2, [r1+%6*FENC_STRIDE]\r\n    pmovzxbw m%3, [r2+(%5-4)*FDEC_STRIDE]\r\n    pmovzxbw m%4, [r2+(%6-4)*FDEC_STRIDE]\r\n    psubw    m%1, m%3\r\n    psubw    m%2, m%4\r\n%endmacro\r\n\r\n%macro DIFFx2 6-7\r\n    movh       %3, %5\r\n    punpcklbw  %3, %4\r\n    psraw      %1, 6\r\n    paddsw     %1, %3\r\n    movh       %3, %6\r\n    punpcklbw  %3, %4\r\n    psraw      %2, 6\r\n    paddsw     %2, %3\r\n    packuswb   %2, %1\r\n%endmacro\r\n\r\n; (high depth) in: %1, %2, min to clip, max to clip, mem128\r\n; in: %1, tmp, %3, mem64\r\n%macro STORE_DIFF 4-5\r\n%if HIGH_BIT_DEPTH\r\n    psrad      %1, 6\r\n    psrad      %2, 6\r\n    packssdw   %1, %2\r\n    paddw      %1, %5\r\n    CLIPW      %1, %3, %4\r\n    mova       %5, %1\r\n%else\r\n    movh       %2, %4\r\n    punpcklbw  %2, %3\r\n    psraw      %1, 6\r\n    paddsw     %1, %2\r\n    packuswb   %1, %1\r\n    movh       %4, %1\r\n%endif\r\n%endmacro\r\n\r\n%macro SHUFFLE_MASK_W 8\r\n    %rep 8\r\n        %if %1>=0x80\r\n            db %1, %1\r\n        %else\r\n            db %1*2\r\n            db %1*2+1\r\n        %endif\r\n        %rotate 1\r\n    %endrep\r\n%endmacro\r\n\r\n; instruction, accum, input, iteration (zero to swap, nonzero to add)\r\n%macro ACCUM 4\r\n%if %4\r\n    %1        m%2, m%3\r\n%else\r\n    SWAP       %2, %3\r\n%endif\r\n%endmacro\r\n\r\n; IACA support\r\n%macro IACA_START 0\r\n    mov ebx, 111\r\n    db 0x64, 0x67, 0x90\r\n%endmacro\r\n\r\n%macro IACA_END 0\r\n    mov ebx, 222\r\n    db 0x64, 0x67, 0x90\r\n%endmacro\r\n"
  },
  {
    "path": "source/configw.h",
    "content": "/*\r\n * configw.h\r\n *\r\n * Description of this file:\r\n *    header file for MS/Intel compiler on windows platform of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video decoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#ifndef DAVS2_CONFIGW_H\r\n#define DAVS2_CONFIGW_H\r\n\r\n#if defined(__ICL) || defined(_MSC_VER)\r\n\r\n/* arch */\r\n#define ARCH_X86                1\r\n#define ARCH_PPC                0\r\n#define ARCH_ARM                0\r\n#define ARCH_UltraSPARC         0\r\n\r\n/* system */\r\n#define SYS_WINDOWS             1\r\n#define SYS_LINUX               0\r\n#define SYS_MACOSX              0\r\n#define SYS_BEOS                0\r\n#define SYS_FREEBSD             0\r\n#define SYS_OPENBSD             0\r\n\r\n/* cpu */\r\n#ifndef __SSE__\r\n#define __SSE__\r\n#endif\r\n#define HAVE_MMX                1     /* X86     */\r\n#define HAVE_ALTIVEC            0     /* ALTIVEC */\r\n#define HAVE_ALTIVEC_H          0\r\n#define HAVE_NEON               0     /* ARM     */\r\n#define HAVE_ARMV6              0\r\n#define HAVE_ARMV6T2            0\r\n\r\n/* thread */\r\n#define HAVE_THREAD             1\r\n#define HAVE_WIN32THREAD        1\r\n#define HAVE_PTHREAD            0\r\n#define HAVE_BEOSTHREAD         0\r\n#define HAVE_POSIXTHREAD        0\r\n#define PTW32_STATIC_LIB        0\r\n\r\n/* interlace support */\r\n#define HAVE_INTERLACED         1\r\n\r\n/* malloc */\r\n#define HAVE_MALLOC_H           0\r\n\r\n/* big-endian */\r\n#define WORDS_BIGENDIAN         0\r\n\r\n/* others */\r\n#define HAVE_STDINT_H           1\r\n#define HAVE_VECTOREXT          0\r\n#define HAVE_LOG2F              0\r\n#define HAVE_SWSCALE            0\r\n#define HAVE_LAVF               0\r\n#define HAVE_FFMS               0\r\n#define HAVE_GPAC               0\r\n#define HAVE_GF_MALLOC          0\r\n#define HAVE_AVS                0\r\n\r\n#endif\r\n#endif // DAVS2_CONFIGW_H\r\n"
  },
  {
    "path": "source/davs2.h",
    "content": "/*\r\n * davs2.h\r\n *\r\n * Description of this file:\r\n *    API functions definition of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video decoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#ifndef DAVS2_DAVS2_H\r\n#define DAVS2_DAVS2_H\r\n\r\n#include <stdint.h>\r\n\r\n#ifdef __cplusplus\r\nextern \"C\" {    // only need to export C interface if used by C++ source code\r\n#endif\r\n\r\n/* dAVS2 build version, means different API interface\r\n * (10 * VER_MAJOR + VER_MINOR) */\r\n#define DAVS2_BUILD                16\r\n\r\n/**\r\n * ===========================================================================\r\n * define DAVS2_API\r\n * ===========================================================================\r\n */\r\n#ifdef DAVS2_EXPORTS\r\n#  ifdef __GNUC__                     /* for Linux  */\r\n#    if __GNUC__ >= 4\r\n#      define DAVS2_API __attribute__((visibility(\"default\")))\r\n#    else\r\n#      define DAVS2_API __attribute__((dllexport))\r\n#    endif\r\n#  else                               /* for windows */\r\n#    define DAVS2_API __declspec(dllexport)\r\n#  endif\r\n#else\r\n#  ifdef __GNUC__                     /* for Linux   */\r\n#    define DAVS2_API\r\n#  else                               /* for windows */\r\n#    define DAVS2_API __declspec(dllimport)\r\n#  endif\r\n#endif\r\n\r\n\r\n/**\r\n * ===========================================================================\r\n * const defines\r\n * ===========================================================================\r\n */\r\n\r\n/* ---------------------------------------------------------------------------\r\n * picture type */\r\nenum davs2_picture_type_e {\r\n    DAVS2_PIC_I       = 0,         /* picture-I */\r\n    DAVS2_PIC_P       = 1,         /* picture-P */\r\n    DAVS2_PIC_B       = 2,         /* picture-B */\r\n    DAVS2_PIC_G       = 3,         /* picture-G */\r\n    DAVS2_PIC_F       = 4,         /* picture-F */\r\n    DAVS2_PIC_S       = 5          /* picture-S */\r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n * profile id */\r\nenum davs2_profile_id_e {\r\n    DAVS2_PROFILE_MAIN_PIC = 0x12,      /* AVS2 main picture profile */\r\n    DAVS2_PROFILE_MAIN     = 0x20,      /* AVS2 main         profile */\r\n    DAVS2_PROFILE_MAIN10   = 0x22       /* AVS2 main 10bit   profile */\r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n * log level\r\n */\r\nenum davs2_log_level_e {\r\n    DAVS2_LOG_DEBUG   = 0,\r\n    DAVS2_LOG_INFO    = 1,\r\n    DAVS2_LOG_WARNING = 2,\r\n    DAVS2_LOG_ERROR   = 3,\r\n    DAVS2_LOG_MAX     = 4\r\n};\r\n\r\n/* ---------------------------------------------------------------------------\r\n * information of return value for decode/flush()\r\n */\r\nenum davs2_ret_e {\r\n    DAVS2_ERROR       = -1,   /* Decoding error occurs */\r\n    DAVS2_DEFAULT     = 0,    /* Decoding but no output */\r\n    DAVS2_GOT_FRAME   = 1,    /* Decoding get frame */\r\n    DAVS2_GOT_HEADER  = 2,    /* Decoding get sequence header, always obtained before DAVS2_GOT_FRAME */\r\n    DAVS2_END         = 3,    /* Decoding ended: no more bit-stream to decode and no more frames to output */\r\n};\r\n\r\n/**\r\n * ===========================================================================\r\n * interface struct type defines\r\n * ===========================================================================\r\n */\r\n\r\n/* ---------------------------------------------------------------------------\r\n * information of sequence header\r\n */\r\ntypedef struct davs2_seq_info_t {\r\n    uint32_t        profile_id;         /* profile ID, davs2_profile_id_e */\r\n    uint32_t        level_id;           /* level   ID */\r\n    uint32_t        progressive;        /* progressive sequence (0: interlace, 1: progressive) */\r\n    uint32_t        width;              /* image width */\r\n    uint32_t        height;             /* image height */\r\n    uint32_t        chroma_format;      /* chroma format(1: 4:2:0, 2: 4:2:2) */\r\n    uint32_t        aspect_ratio;       /* 2: 4:3,  3: 16:9 */\r\n    uint32_t        low_delay;          /* low delay */\r\n    uint32_t        bitrate;            /* bitrate (bps) */\r\n    uint32_t        internal_bit_depth; /* internal sample bit depth */\r\n    uint32_t        output_bit_depth;   /* output sample bit depth */\r\n    uint32_t        bytes_per_sample;   /* bytes per sample */\r\n    float           frame_rate;         /* frame rate */\r\n    uint32_t        frame_rate_id;      /* frame rate code, mpeg12 [1...8] */\r\n} davs2_seq_info_t;  \r\n\r\n/* ---------------------------------------------------------------------------\r\n * packet of bitstream\r\n */\r\ntypedef struct davs2_packet_t {\r\n    const uint8_t  *data;             /* bitstream */\r\n    int             len;              /* bytes of the bitstream */\r\n    int64_t         pts;              /* presentation time stamp */\r\n    int64_t         dts;              /* decoding time stamp */\r\n} davs2_packet_t;\r\n\r\n/* ---------------------------------------------------------------------------\r\n * decoded picture\r\n */\r\ntypedef struct davs2_picture_t {\r\n    void           *magic;            /* must be the 1st member variable (do not change it) */\r\n    /* picture information */\r\n    uint8_t        *planes[3];        /* picture planes */\r\n    int             widths[3];        /* picture width in pixels */\r\n    int             lines[3];         /* picture height in pixels */\r\n    int             strides[3];       /* number of bytes in one line are stored continuously in memory */\r\n    int             pic_order_count;  /* picture number */\r\n    int             type;             /* picture type of the corresponding frame */\r\n    int             qp;               /* QP of the corresponding picture */\r\n    int64_t         pts;              /* presentation time stamp */\r\n    int64_t         dts;              /* decoding time stamp */\r\n    int             num_planes;       /* number of planes */\r\n    int             bytes_per_sample; /* number of bytes for each sample */\r\n    int             bit_depth;        /* number of bytes for each sample */\r\n    int             b_decode_error;   /* is there any decoding error of this frame? */\r\n    void           *dec_frame;        /* pointer to decoding frame in DPB (do not change it) */\r\n} davs2_picture_t;\r\n\r\n/* ---------------------------------------------------------------------------\r\n * parameters for create an AVS2 decoder\r\n */\r\ntypedef struct davs2_param_t {\r\n    int               threads;        /* decoding threads: 0 for auto */\r\n    int               info_level;     /* only output information which is no less then this level (davs2_log_level_e).\r\n                                         0: All; 1: no debug info; 2: only warning and errors; 3: only errors */\r\n    void             *opaque;         /* user data */\r\n    /* additional parameters for version >= 16 */\r\n    int               disable_avx;    /* 1: disable; 0: default (autodetect) */\r\n} davs2_param_t;\r\n\r\n/**\r\n * ===========================================================================\r\n * interface function declares (DAVS2 library APIs for AVS2 video decoder)\r\n * ===========================================================================\r\n */\r\n\r\n/**\r\n * ---------------------------------------------------------------------------\r\n * Function   : open an AVS2 decoder\r\n * Parameters :\r\n *   [in/out] : param - pointer to struct davs2_param_t\r\n * Return     : handle of the decoder, zero for failure\r\n * ---------------------------------------------------------------------------\r\n */\r\nDAVS2_API void *\r\ndavs2_decoder_open(davs2_param_t *param);\r\n\r\n/**\r\n * ---------------------------------------------------------------------------\r\n * Function   : decode one frame\r\n * Parameters :\r\n *       [in] : decoder   - pointer to the AVS2 decoder handler\r\n *       [in] : packet    - pointer to struct davs2_packet_t\r\n * Return     : see definition of davs2_ret_e\r\n * ---------------------------------------------------------------------------\r\n */\r\nDAVS2_API int\r\ndavs2_decoder_send_packet(void *decoder, davs2_packet_t *packet);\r\n\r\n/**\r\n * ---------------------------------------------------------------------------\r\n * Function   : decode one frame\r\n * Parameters :\r\n *       [in] : decoder   - pointer to the AVS2 decoder handler\r\n *      [out] : headerset - pointer to output common frame information (would always appear before frame output)\r\n *      [out] : out_frame - pointer to output frame information\r\n * Return     : see definition of davs2_ret_e\r\n * ---------------------------------------------------------------------------\r\n */\r\nDAVS2_API int\r\ndavs2_decoder_recv_frame(void *decoder, davs2_seq_info_t *headerset, davs2_picture_t *out_frame);\r\n\r\n/**\r\n * ---------------------------------------------------------------------------\r\n * Function   : flush the decoder\r\n * Parameters :\r\n *       [in] : decoder   - decoder handle\r\n *      [out] : headerset - pointer to output common frame information (would always appear before frame output)\r\n *      [out] : out_frame - pointer to output frame information\r\n * Return     : see definition of davs2_ret_e\r\n * ---------------------------------------------------------------------------\r\n */\r\nDAVS2_API int\r\ndavs2_decoder_flush(void *decoder, davs2_seq_info_t *headerset, davs2_picture_t *out_frame);\r\n\r\n/**\r\n * ---------------------------------------------------------------------------\r\n * Function   : release one output frame\r\n * Parameters :\r\n *       [in] : decoder   - decoder handle\r\n *            : out_frame - frame to recycle\r\n * Return     : none\r\n * ---------------------------------------------------------------------------\r\n */\r\nDAVS2_API void\r\ndavs2_decoder_frame_unref(void *decoder, davs2_picture_t *out_frame);\r\n\r\n/**\r\n * ---------------------------------------------------------------------------\r\n * Function   : close the AVS2 decoder\r\n * Parameters :\r\n *       [in] : decoder - decoder handle\r\n * Return     : none\r\n * ---------------------------------------------------------------------------\r\n */\r\nDAVS2_API void\r\ndavs2_decoder_close(void *decoder);\r\n\r\n#ifdef __cplusplus\r\n}\r\n#endif\r\n\r\n#endif // DAVS2_DAVS2_H\r\n"
  },
  {
    "path": "source/test/getopt/getopt.c",
    "content": "/* Getopt for GNU.\r\n   NOTE: getopt is now part of the C library, so if you don't know what\r\n   \"Keep this file name-space clean\" means, talk to drepper@gnu.org\r\n   before changing it!\r\n   Copyright (C) 1987,88,89,90,91,92,93,94,95,96,98,99,2000,2001\r\n    Free Software Foundation, Inc.\r\n   This file is part of the GNU C Library.\r\n\r\n   The GNU C Library is free software; you can redistribute it and/or\r\n   modify it under the terms of the GNU Lesser General Public\r\n   License as published by the Free Software Foundation; either\r\n   version 2.1 of the License, or (at your option) any later version.\r\n\r\n   The GNU C Library is distributed in the hope that it will be useful,\r\n   but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU\r\n   Lesser General Public License for more details.\r\n\r\n   You should have received a copy of the GNU Lesser General Public\r\n   License along with the GNU C Library; if not, write to the Free\r\n   Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA\r\n   02111-1307 USA.  */\r\n\f\r\n/* This tells Alpha OSF/1 not to define a getopt prototype in <stdio.h>.\r\n   Ditto for AIX 3.2 and <stdlib.h>.  */\r\n#ifndef _NO_PROTO\r\n# define _NO_PROTO\r\n#endif\r\n\r\n#ifdef HAVE_CONFIG_H\r\n# include <config.h>\r\n#endif\r\n\r\n#if !defined __STDC__ || !__STDC__\r\n/* This is a separate conditional since some stdc systems\r\n   reject `defined (const)'.  */\r\n# ifndef const\r\n#  define const\r\n# endif\r\n#endif\r\n\r\n#include <stdio.h>\r\n\r\n/* Comment out all this code if we are using the GNU C Library, and are not\r\n   actually compiling the library itself.  This code is part of the GNU C\r\n   Library, but also included in many other GNU distributions.  Compiling\r\n   and linking in this code is a waste when using the GNU C library\r\n   (especially if it is a shared library).  Rather than having every GNU\r\n   program understand `configure --with-gnu-libc' and omit the object files,\r\n   it is simpler to just do this in the source for each such file.  */\r\n\r\n#define GETOPT_INTERFACE_VERSION 2\r\n#if !defined _LIBC && defined __GLIBC__ && __GLIBC__ >= 2\r\n# include <gnu-versions.h>\r\n# if _GNU_GETOPT_INTERFACE_VERSION == GETOPT_INTERFACE_VERSION\r\n#  define ELIDE_CODE\r\n# endif\r\n#endif\r\n\r\n#ifndef ELIDE_CODE\r\n\r\n\r\n/* This needs to come after some library #include\r\n   to get __GNU_LIBRARY__ defined.  */\r\n#ifdef  __GNU_LIBRARY__\r\n/* Don't include stdlib.h for non-GNU C libraries because some of them\r\n   contain conflicting prototypes for getopt.  */\r\n# include <stdlib.h>\r\n# include <unistd.h>\r\n#endif  /* GNU C library.  */\r\n\r\n#ifdef VMS\r\n# include <unixlib.h>\r\n# if HAVE_STRING_H - 0\r\n#  include <string.h>\r\n# endif\r\n#endif\r\n\r\n#ifndef _\r\n/* This is for other GNU distributions with internationalized messages.  */\r\n# if defined HAVE_LIBINTL_H || defined _LIBC\r\n#  include <libintl.h>\r\n#  ifndef _\r\n#   define _(msgid) gettext (msgid)\r\n#  endif\r\n# else\r\n#  define _(msgid)  (msgid)\r\n# endif\r\n#endif\r\n\r\n/* This version of `getopt' appears to the caller like standard Unix `getopt'\r\n   but it behaves differently for the user, since it allows the user\r\n   to intersperse the options with the other arguments.\r\n\r\n   As `getopt' works, it permutes the elements of ARGV so that,\r\n   when it is done, all the options precede everything else.  Thus\r\n   all application programs are extended to handle flexible argument order.\r\n\r\n   Setting the environment variable POSIXLY_CORRECT disables permutation.\r\n   Then the behavior is completely standard.\r\n\r\n   GNU application programs can use a third alternative mode in which\r\n   they can distinguish the relative order of options and other arguments.  */\r\n\r\n#include \"getopt.h\"\r\n\r\n/* For communication from `getopt' to the caller.\r\n   When `getopt' finds an option that takes an argument,\r\n   the argument value is returned here.\r\n   Also, when `ordering' is RETURN_IN_ORDER,\r\n   each non-option ARGV-element is returned here.  */\r\n\r\nchar *optarg;\r\n\r\n/* Index in ARGV of the next element to be scanned.\r\n   This is used for communication to and from the caller\r\n   and for communication between successive calls to `getopt'.\r\n\r\n   On entry to `getopt', zero means this is the first call; initialize.\r\n\r\n   When `getopt' returns -1, this is the index of the first of the\r\n   non-option elements that the caller should itself scan.\r\n\r\n   Otherwise, `optind' communicates from one call to the next\r\n   how much of ARGV has been scanned so far.  */\r\n\r\n/* 1003.2 says this must be 1 before any call.  */\r\nint optind = 1;\r\n\r\n/* Formerly, initialization of getopt depended on optind==0, which\r\n   causes problems with re-calling getopt as programs generally don't\r\n   know that. */\r\n\r\nint __getopt_initialized;\r\n\r\n/* The next char to be scanned in the option-element\r\n   in which the last option character we returned was found.\r\n   This allows us to pick up the scan where we left off.\r\n\r\n   If this is zero, or a null string, it means resume the scan\r\n   by advancing to the next ARGV-element.  */\r\n\r\nstatic char *nextchar;\r\n\r\n/* Callers store zero here to inhibit the error message\r\n   for unrecognized options.  */\r\n\r\nint opterr = 1;\r\n\r\n/* Set to an option character which was unrecognized.\r\n   This must be initialized on some systems to avoid linking in the\r\n   system's own getopt implementation.  */\r\n\r\nint optopt = '?';\r\n\r\n/* Describe how to deal with options that follow non-option ARGV-elements.\r\n\r\n   If the caller did not specify anything,\r\n   the default is REQUIRE_ORDER if the environment variable\r\n   POSIXLY_CORRECT is defined, PERMUTE otherwise.\r\n\r\n   REQUIRE_ORDER means don't recognize them as options;\r\n   stop option processing when the first non-option is seen.\r\n   This is what Unix does.\r\n   This mode of operation is selected by either setting the environment\r\n   variable POSIXLY_CORRECT, or using `+' as the first character\r\n   of the list of option characters.\r\n\r\n   PERMUTE is the default.  We permute the contents of ARGV as we scan,\r\n   so that eventually all the non-options are at the end.  This allows options\r\n   to be given in any order, even with programs that were not written to\r\n   expect this.\r\n\r\n   RETURN_IN_ORDER is an option available to programs that were written\r\n   to expect options and other ARGV-elements in any order and that care about\r\n   the ordering of the two.  We describe each non-option ARGV-element\r\n   as if it were the argument of an option with character code 1.\r\n   Using `-' as the first character of the list of option characters\r\n   selects this mode of operation.\r\n\r\n   The special argument `--' forces an end of option-scanning regardless\r\n   of the value of `ordering'.  In the case of RETURN_IN_ORDER, only\r\n   `--' can cause `getopt' to return -1 with `optind' != ARGC.  */\r\n\r\nstatic enum {\r\n    REQUIRE_ORDER, PERMUTE, RETURN_IN_ORDER\r\n} ordering;\r\n\r\n/* Value of POSIXLY_CORRECT environment variable.  */\r\nstatic char *posixly_correct;\r\n\f\r\n#ifdef  __GNU_LIBRARY__\r\n/* We want to avoid inclusion of string.h with non-GNU libraries\r\n   because there are many ways it can cause trouble.\r\n   On some systems, it contains special magic macros that don't work\r\n   in GCC.  */\r\n# include <string.h>\r\n# define my_index   strchr\r\n#else\r\n\r\n#include <string.h>\r\n\r\n/* Avoid depending on library functions or files\r\n   whose names are inconsistent.  */\r\n\r\n#ifndef getenv\r\nextern char *getenv();\r\n#endif\r\n\r\nstatic char *\r\nmy_index(str, chr)\r\nconst char *str;\r\nint chr;\r\n{\r\n    while (*str) {\r\n        if (*str == chr) {\r\n            return (char *) str;\r\n        }\r\n        str++;\r\n    }\r\n    return 0;\r\n}\r\n\r\n/* If using GCC, we can safely declare strlen this way.\r\n   If not using GCC, it is ok not to declare it.  */\r\n#ifdef __GNUC__\r\n/* Note that Motorola Delta 68k R3V7 comes with GCC but not stddef.h.\r\n   That was relevant to code that was here before.  */\r\n# if (!defined __STDC__ || !__STDC__) && !defined strlen\r\n/* gcc with -traditional declares the built-in strlen to return int,\r\n   and has done so at least since version 2.4.5. -- rms.  */\r\nextern int strlen(const char *);\r\n# endif /* not __STDC__ */\r\n#endif /* __GNUC__ */\r\n\r\n#endif /* not __GNU_LIBRARY__ */\r\n\f\r\n/* Handle permutation of arguments.  */\r\n\r\n/* Describe the part of ARGV that contains non-options that have\r\n   been skipped.  `first_nonopt' is the index in ARGV of the first of them;\r\n   `last_nonopt' is the index after the last of them.  */\r\n\r\nstatic int first_nonopt;\r\nstatic int last_nonopt;\r\n\r\n#ifdef _LIBC\r\n/* Stored original parameters.\r\n   XXX This is no good solution.  We should rather copy the args so\r\n   that we can compare them later.  But we must not use malloc(3).  */\r\nextern int __libc_argc;\r\nextern char **__libc_argv;\r\n\r\n/* Bash 2.0 gives us an environment variable containing flags\r\n   indicating ARGV elements that should not be considered arguments.  */\r\n\r\n# ifdef USE_NONOPTION_FLAGS\r\n/* Defined in getopt_init.c  */\r\nextern char *__getopt_nonoption_flags;\r\n\r\nstatic int nonoption_flags_max_len;\r\nstatic int nonoption_flags_len;\r\n# endif\r\n\r\n# ifdef USE_NONOPTION_FLAGS\r\n#  define SWAP_FLAGS(ch1, ch2) \\\r\n    if (nonoption_flags_len > 0)                            \\\r\n    {                                         \\\r\n        char __tmp = __getopt_nonoption_flags[ch1];                 \\\r\n        __getopt_nonoption_flags[ch1] = __getopt_nonoption_flags[ch2];          \\\r\n        __getopt_nonoption_flags[ch2] = __tmp;                      \\\r\n    }\r\n# else\r\n#  define SWAP_FLAGS(ch1, ch2)\r\n# endif\r\n#else   /* !_LIBC */\r\n# define SWAP_FLAGS(ch1, ch2)\r\n#endif  /* _LIBC */\r\n\r\n/* Exchange two adjacent subsequences of ARGV.\r\n   One subsequence is elements [first_nonopt,last_nonopt)\r\n   which contains all the non-options that have been skipped so far.\r\n   The other is elements [last_nonopt,optind), which contains all\r\n   the options processed since those non-options were skipped.\r\n\r\n   `first_nonopt' and `last_nonopt' are relocated so that they describe\r\n   the new indices of the non-options in ARGV after they are moved.  */\r\n\r\n#if defined __STDC__ && __STDC__\r\nstatic void exchange(char **);\r\n#endif\r\n\r\nstatic void\r\nexchange(argv)\r\nchar **argv;\r\n{\r\n    int bottom = first_nonopt;\r\n    int middle = last_nonopt;\r\n    int top = optind;\r\n    char *tem;\r\n\r\n    /* Exchange the shorter segment with the far end of the longer segment.\r\n       That puts the shorter segment into the right place.\r\n       It leaves the longer segment in the right place overall,\r\n       but it consists of two parts that need to be swapped next.  */\r\n\r\n#if defined _LIBC && defined USE_NONOPTION_FLAGS\r\n    /* First make sure the handling of the `__getopt_nonoption_flags'\r\n       string can work normally.  Our top argument must be in the range\r\n       of the string.  */\r\n    if (nonoption_flags_len > 0 && top >= nonoption_flags_max_len) {\r\n        /* We must extend the array.  The user plays games with us and\r\n        presents new arguments.  */\r\n        char *new_str = malloc(top + 1);\r\n        if (new_str == NULL) {\r\n            nonoption_flags_len = nonoption_flags_max_len = 0;\r\n        } else {\r\n            memset(__mempcpy(new_str, __getopt_nonoption_flags,\r\n                             nonoption_flags_max_len),\r\n                   '\\0', top + 1 - nonoption_flags_max_len);\r\n            nonoption_flags_max_len = top + 1;\r\n            __getopt_nonoption_flags = new_str;\r\n        }\r\n    }\r\n#endif\r\n\r\n    while (top > middle && middle > bottom) {\r\n        if (top - middle > middle - bottom) {\r\n            /* Bottom segment is the short one.  */\r\n            int len = middle - bottom;\r\n            register int i;\r\n\r\n            /* Swap it with the top part of the top segment.  */\r\n            for (i = 0; i < len; i++) {\r\n                tem = argv[bottom + i];\r\n                argv[bottom + i] = argv[top - (middle - bottom) + i];\r\n                argv[top - (middle - bottom) + i] = tem;\r\n                SWAP_FLAGS(bottom + i, top - (middle - bottom) + i);\r\n            }\r\n            /* Exclude the moved bottom segment from further swapping.  */\r\n            top -= len;\r\n        } else {\r\n            /* Top segment is the short one.  */\r\n            int len = top - middle;\r\n            register int i;\r\n\r\n            /* Swap it with the bottom part of the bottom segment.  */\r\n            for (i = 0; i < len; i++) {\r\n                tem = argv[bottom + i];\r\n                argv[bottom + i] = argv[middle + i];\r\n                argv[middle + i] = tem;\r\n                SWAP_FLAGS(bottom + i, middle + i);\r\n            }\r\n            /* Exclude the moved top segment from further swapping.  */\r\n            bottom += len;\r\n        }\r\n    }\r\n\r\n    /* Update records for the slots the non-options now occupy.  */\r\n\r\n    first_nonopt += (optind - last_nonopt);\r\n    last_nonopt = optind;\r\n}\r\n\r\n/* Initialize the internal data when the first call is made.  */\r\n\r\n#if defined __STDC__ && __STDC__\r\nstatic const char *_getopt_initialize(int, char *const *, const char *);\r\n#endif\r\nstatic const char *\r\n_getopt_initialize(argc, argv, optstring)\r\nint argc;\r\nchar *const *argv;\r\nconst char *optstring;\r\n{\r\n    /* Start processing options with ARGV-element 1 (since ARGV-element 0\r\n       is the program name); the sequence of previously skipped\r\n       non-option ARGV-elements is empty.  */\r\n\r\n    first_nonopt = last_nonopt = optind;\r\n\r\n    nextchar = NULL;\r\n\r\n    posixly_correct = getenv(\"POSIXLY_CORRECT\");\r\n\r\n    /* Determine how to handle the ordering of options and nonoptions.  */\r\n\r\n    if (optstring[0] == '-') {\r\n        ordering = RETURN_IN_ORDER;\r\n        ++optstring;\r\n    } else if (optstring[0] == '+') {\r\n        ordering = REQUIRE_ORDER;\r\n        ++optstring;\r\n    } else if (posixly_correct != NULL) {\r\n        ordering = REQUIRE_ORDER;\r\n    } else {\r\n        ordering = PERMUTE;\r\n    }\r\n\r\n#if defined _LIBC && defined USE_NONOPTION_FLAGS\r\n    if (posixly_correct == NULL\r\n        && argc == __libc_argc && argv == __libc_argv) {\r\n        if (nonoption_flags_max_len == 0) {\r\n            if (__getopt_nonoption_flags == NULL\r\n                || __getopt_nonoption_flags[0] == '\\0') {\r\n                nonoption_flags_max_len = -1;\r\n            } else {\r\n                const char *orig_str = __getopt_nonoption_flags;\r\n                int len = nonoption_flags_max_len = strlen(orig_str);\r\n                if (nonoption_flags_max_len < argc) {\r\n                    nonoption_flags_max_len = argc;\r\n                }\r\n                __getopt_nonoption_flags =\r\n                    (char *) malloc(nonoption_flags_max_len);\r\n                if (__getopt_nonoption_flags == NULL) {\r\n                    nonoption_flags_max_len = -1;\r\n                } else\r\n                    memset(__mempcpy(__getopt_nonoption_flags, orig_str, len),\r\n                           '\\0', nonoption_flags_max_len - len);\r\n            }\r\n        }\r\n        nonoption_flags_len = nonoption_flags_max_len;\r\n    } else {\r\n        nonoption_flags_len = 0;\r\n    }\r\n#endif\r\n\r\n    return optstring;\r\n}\r\n\f\r\n/* Scan elements of ARGV (whose length is ARGC) for option characters\r\n   given in OPTSTRING.\r\n\r\n   If an element of ARGV starts with '-', and is not exactly \"-\" or \"--\",\r\n   then it is an option element.  The characters of this element\r\n   (aside from the initial '-') are option characters.  If `getopt'\r\n   is called repeatedly, it returns successively each of the option characters\r\n   from each of the option elements.\r\n\r\n   If `getopt' finds another option character, it returns that character,\r\n   updating `optind' and `nextchar' so that the next call to `getopt' can\r\n   resume the scan with the following option character or ARGV-element.\r\n\r\n   If there are no more option characters, `getopt' returns -1.\r\n   Then `optind' is the index in ARGV of the first ARGV-element\r\n   that is not an option.  (The ARGV-elements have been permuted\r\n   so that those that are not options now come last.)\r\n\r\n   OPTSTRING is a string containing the legitimate option characters.\r\n   If an option character is seen that is not listed in OPTSTRING,\r\n   return '?' after printing an error message.  If you set `opterr' to\r\n   zero, the error message is suppressed but we still return '?'.\r\n\r\n   If a char in OPTSTRING is followed by a colon, that means it wants an arg,\r\n   so the following text in the same ARGV-element, or the text of the following\r\n   ARGV-element, is returned in `optarg'.  Two colons mean an option that\r\n   wants an optional arg; if there is text in the current ARGV-element,\r\n   it is returned in `optarg', otherwise `optarg' is set to zero.\r\n\r\n   If OPTSTRING starts with `-' or `+', it requests different methods of\r\n   handling the non-option ARGV-elements.\r\n   See the comments about RETURN_IN_ORDER and REQUIRE_ORDER, above.\r\n\r\n   Long-named options begin with `--' instead of `-'.\r\n   Their names may be abbreviated as long as the abbreviation is unique\r\n   or is an exact match for some defined option.  If they have an\r\n   argument, it follows the option name in the same ARGV-element, separated\r\n   from the option name by a `=', or else the in next ARGV-element.\r\n   When `getopt' finds a long-named option, it returns 0 if that option's\r\n   `flag' field is nonzero, the value of the option's `val' field\r\n   if the `flag' field is zero.\r\n\r\n   The elements of ARGV aren't really const, because we permute them.\r\n   But we pretend they're const in the prototype to be compatible\r\n   with other systems.\r\n\r\n   LONGOPTS is a vector of `struct option' terminated by an\r\n   element containing a name which is zero.\r\n\r\n   LONGIND returns the index in LONGOPT of the long-named option found.\r\n   It is only valid when a long-named option has been found by the most\r\n   recent call.\r\n\r\n   If LONG_ONLY is nonzero, '-' as well as '--' can introduce\r\n   long-named options.  */\r\n\r\nint\r\n_getopt_internal(argc, argv, optstring, longopts, longind, long_only)\r\nint argc;\r\nchar *const *argv;\r\nconst char *optstring;\r\nconst struct option *longopts;\r\nint32_t *longind;\r\nint long_only;\r\n{\r\n    int print_errors = opterr;\r\n    if (optstring[0] == ':') {\r\n        print_errors = 0;\r\n    }\r\n\r\n    if (argc < 1) {\r\n        return -1;\r\n    }\r\n\r\n    optarg = NULL;\r\n\r\n    if (optind == 0 || !__getopt_initialized) {\r\n        if (optind == 0) {\r\n            optind = 1;    /* Don't scan ARGV[0], the program name.  */\r\n        }\r\n        optstring = _getopt_initialize(argc, argv, optstring);\r\n        __getopt_initialized = 1;\r\n    }\r\n\r\n    /* Test whether ARGV[optind] points to a non-option argument.\r\n       Either it does not have option syntax, or there is an environment flag\r\n       from the shell indicating it is not an option.  The later information\r\n       is only used when the used in the GNU libc.  */\r\n#if defined _LIBC && defined USE_NONOPTION_FLAGS\r\n# define NONOPTION_P (argv[optind][0] != '-' || argv[optind][1] == '\\0'       \\\r\n                      || (optind < nonoption_flags_len                \\\r\n                          && __getopt_nonoption_flags[optind] == '1'))\r\n#else\r\n# define NONOPTION_P (argv[optind][0] != '-' || argv[optind][1] == '\\0')\r\n#endif\r\n\r\n    if (nextchar == NULL || *nextchar == '\\0') {\r\n        /* Advance to the next ARGV-element.  */\r\n\r\n        /* Give FIRST_NONOPT & LAST_NONOPT rational values if OPTIND has been\r\n        moved back by the user (who may also have changed the arguments).  */\r\n        if (last_nonopt > optind) {\r\n            last_nonopt = optind;\r\n        }\r\n        if (first_nonopt > optind) {\r\n            first_nonopt = optind;\r\n        }\r\n\r\n        if (ordering == PERMUTE) {\r\n            /* If we have just processed some options following some non-options,\r\n               exchange them so that the options come first.  */\r\n\r\n            if (first_nonopt != last_nonopt && last_nonopt != optind) {\r\n                exchange((char **) argv);\r\n            } else if (last_nonopt != optind) {\r\n                first_nonopt = optind;\r\n            }\r\n\r\n            /* Skip any additional non-options\r\n               and extend the range of non-options previously skipped.  */\r\n\r\n            while (optind < argc && NONOPTION_P) {\r\n                optind++;\r\n            }\r\n            last_nonopt = optind;\r\n        }\r\n\r\n        /* The special ARGV-element `--' means premature end of options.\r\n        Skip it like a null option,\r\n         then exchange with previous non-options as if it were an option,\r\n         then skip everything else like a non-option.  */\r\n\r\n        if (optind != argc && !strcmp(argv[optind], \"--\")) {\r\n            optind++;\r\n\r\n            if (first_nonopt != last_nonopt && last_nonopt != optind) {\r\n                exchange((char **) argv);\r\n            } else if (first_nonopt == last_nonopt) {\r\n                first_nonopt = optind;\r\n            }\r\n            last_nonopt = argc;\r\n\r\n            optind = argc;\r\n        }\r\n\r\n        /* If we have done all the ARGV-elements, stop the scan\r\n        and back over any non-options that we skipped and permuted.  */\r\n\r\n        if (optind == argc) {\r\n            /* Set the next-arg-index to point at the non-options\r\n               that we previously skipped, so the caller will digest them.  */\r\n            if (first_nonopt != last_nonopt) {\r\n                optind = first_nonopt;\r\n            }\r\n            return -1;\r\n        }\r\n\r\n        /* If we have come to a non-option and did not permute it,\r\n        either stop the scan or describe it to the caller and pass it by.  */\r\n\r\n        if (NONOPTION_P) {\r\n            if (ordering == REQUIRE_ORDER) {\r\n                return -1;\r\n            }\r\n            optarg = argv[optind++];\r\n            return 1;\r\n        }\r\n\r\n        /* We have found another option-ARGV-element.\r\n        Skip the initial punctuation.  */\r\n\r\n        nextchar = (argv[optind] + 1\r\n                    + (longopts != NULL && argv[optind][1] == '-'));\r\n    }\r\n\r\n    /* Decode the current option-ARGV-element.  */\r\n\r\n    /* Check whether the ARGV-element is a long option.\r\n\r\n       If long_only and the ARGV-element has the form \"-f\", where f is\r\n       a valid short option, don't consider it an abbreviated form of\r\n       a long option that starts with f.  Otherwise there would be no\r\n       way to give the -f short option.\r\n\r\n       On the other hand, if there's a long option \"fubar\" and\r\n       the ARGV-element is \"-fu\", do consider that an abbreviation of\r\n       the long option, just like \"--fu\", and not \"-f\" with arg \"u\".\r\n\r\n       This distinction seems to be the most useful approach.  */\r\n\r\n    if (longopts != NULL\r\n        && (argv[optind][1] == '-'\r\n            || (long_only && (argv[optind][2] || !my_index(optstring, argv[optind][1]))))) {\r\n        char *nameend;\r\n        const struct option *p;\r\n        const struct option *pfound = NULL;\r\n        int exact = 0;\r\n        int ambig = 0;\r\n        int indfound = -1;\r\n        int option_index;\r\n\r\n        for (nameend = nextchar; *nameend && *nameend != '='; nameend++)\r\n            /* Do nothing.  */ ;\r\n\r\n        /* Test all long options for either exact match\r\n        or abbreviated matches.  */\r\n        for (p = longopts, option_index = 0; p->name; p++, option_index++)\r\n            if (!strncmp(p->name, nextchar, nameend - nextchar)) {\r\n                if ((unsigned int)(nameend - nextchar)\r\n                    == (unsigned int) strlen(p->name)) {\r\n                    /* Exact match found.  */\r\n                    pfound = p;\r\n                    indfound = option_index;\r\n                    exact = 1;\r\n                    break;\r\n                } else if (pfound == NULL) {\r\n                    /* First nonexact match found.  */\r\n                    pfound = p;\r\n                    indfound = option_index;\r\n                } else if (long_only\r\n                           || pfound->has_arg != p->has_arg\r\n                           || pfound->flag != p->flag\r\n                           || pfound->val != p->val)\r\n                    /* Second or later nonexact match found.  */\r\n                {\r\n                    ambig = 1;\r\n                }\r\n            }\r\n\r\n        if (ambig && !exact) {\r\n            if (print_errors)\r\n                fprintf(stderr, _(\"%s: option `%s' is ambiguous\\n\"),\r\n                        argv[0], argv[optind]);\r\n            nextchar += strlen(nextchar);\r\n            optind++;\r\n            optopt = 0;\r\n            return '?';\r\n        }\r\n\r\n        if (pfound != NULL) {\r\n            option_index = indfound;\r\n            optind++;\r\n            if (*nameend) {\r\n                /* Don't test has_arg with >, because some C compilers don't\r\n                allow it to be used on enums.  */\r\n                if (pfound->has_arg) {\r\n                    optarg = nameend + 1;\r\n                } else {\r\n                    if (print_errors) {\r\n                        if (argv[optind - 1][1] == '-')\r\n                            /* --option */\r\n                            fprintf(stderr,\r\n                                    _(\"%s: option `--%s' doesn't allow an argument\\n\"),\r\n                                    argv[0], pfound->name);\r\n                        else\r\n                            /* +option or -option */\r\n                            fprintf(stderr,\r\n                                    _(\"%s: option `%c%s' doesn't allow an argument\\n\"),\r\n                                    argv[0], argv[optind - 1][0], pfound->name);\r\n                    }\r\n\r\n                    nextchar += strlen(nextchar);\r\n\r\n                    optopt = pfound->val;\r\n                    return '?';\r\n                }\r\n            } else if (pfound->has_arg == 1) {\r\n                if (optind < argc) {\r\n                    optarg = argv[optind++];\r\n                } else {\r\n                    if (print_errors)\r\n                        fprintf(stderr,\r\n                                _(\"%s: option `%s' requires an argument\\n\"),\r\n                                argv[0], argv[optind - 1]);\r\n                    nextchar += strlen(nextchar);\r\n                    optopt = pfound->val;\r\n                    return optstring[0] == ':' ? ':' : '?';\r\n                }\r\n            }\r\n            nextchar += strlen(nextchar);\r\n            if (longind != NULL) {\r\n                *longind = option_index;\r\n            }\r\n            if (pfound->flag) {\r\n                *(pfound->flag) = pfound->val;\r\n                return 0;\r\n            }\r\n            return pfound->val;\r\n        }\r\n\r\n        /* Can't find it as a long option.  If this is not getopt_long_only,\r\n        or the option starts with '--' or is not a valid short\r\n         option, then it's an error.\r\n         Otherwise interpret it as a short option.  */\r\n        if (!long_only || argv[optind][1] == '-'\r\n            || my_index(optstring, *nextchar) == NULL) {\r\n            if (print_errors) {\r\n                if (argv[optind][1] == '-')\r\n                    /* --option */\r\n                    fprintf(stderr, _(\"%s: unrecognized option `--%s'\\n\"),\r\n                            argv[0], nextchar);\r\n                else\r\n                    /* +option or -option */\r\n                    fprintf(stderr, _(\"%s: unrecognized option `%c%s'\\n\"),\r\n                            argv[0], argv[optind][0], nextchar);\r\n            }\r\n            nextchar = (char *) \"\";\r\n            optind++;\r\n            optopt = 0;\r\n            return '?';\r\n        }\r\n    }\r\n\r\n    /* Look at and handle the next short option-character.  */\r\n\r\n    {\r\n        char c = *nextchar++;\r\n        char *temp = my_index(optstring, c);\r\n\r\n        /* Increment `optind' when we start to process its last character.  */\r\n        if (*nextchar == '\\0') {\r\n            ++optind;\r\n        }\r\n\r\n        if (temp == NULL || c == ':') {\r\n            if (print_errors) {\r\n                if (posixly_correct)\r\n                    /* 1003.2 specifies the format of this message.  */\r\n                    fprintf(stderr, _(\"%s: illegal option -- %c\\n\"),\r\n                            argv[0], c);\r\n                else\r\n                    fprintf(stderr, _(\"%s: invalid option -- %c\\n\"),\r\n                            argv[0], c);\r\n            }\r\n            optopt = c;\r\n            return '?';\r\n        }\r\n        /* Convenience. Treat POSIX -W foo same as long option --foo */\r\n        if (temp[0] == 'W' && temp[1] == ';') {\r\n            char *nameend;\r\n            const struct option *p;\r\n            const struct option *pfound = NULL;\r\n            int exact = 0;\r\n            int ambig = 0;\r\n            int indfound = 0;\r\n            int option_index;\r\n\r\n            /* This is an option that requires an argument.  */\r\n            if (*nextchar != '\\0') {\r\n                optarg = nextchar;\r\n                /* If we end this ARGV-element by taking the rest as an arg,\r\n                   we must advance to the next element now.  */\r\n                optind++;\r\n            } else if (optind == argc) {\r\n                if (print_errors) {\r\n                    /* 1003.2 specifies the format of this message.  */\r\n                    fprintf(stderr, _(\"%s: option requires an argument -- %c\\n\"),\r\n                            argv[0], c);\r\n                }\r\n                optopt = c;\r\n                if (optstring[0] == ':') {\r\n                    c = ':';\r\n                } else {\r\n                    c = '?';\r\n                }\r\n                return c;\r\n            } else\r\n                /* We already incremented `optind' once;\r\n                   increment it again when taking next ARGV-elt as argument.  */\r\n            {\r\n                optarg = argv[optind++];\r\n            }\r\n\r\n            /* optarg is now the argument, see if it's in the\r\n               table of longopts.  */\r\n\r\n            for (nextchar = nameend = optarg; *nameend && *nameend != '='; nameend++)\r\n                /* Do nothing.  */ ;\r\n\r\n            /* Test all long options for either exact match\r\n               or abbreviated matches.  */\r\n            for (p = longopts, option_index = 0; p->name; p++, option_index++)\r\n                if (!strncmp(p->name, nextchar, nameend - nextchar)) {\r\n                    if ((unsigned int)(nameend - nextchar) == strlen(p->name)) {\r\n                        /* Exact match found.  */\r\n                        pfound = p;\r\n                        indfound = option_index;\r\n                        exact = 1;\r\n                        break;\r\n                    } else if (pfound == NULL) {\r\n                        /* First nonexact match found.  */\r\n                        pfound = p;\r\n                        indfound = option_index;\r\n                    } else\r\n                        /* Second or later nonexact match found.  */\r\n                    {\r\n                        ambig = 1;\r\n                    }\r\n                }\r\n            if (ambig && !exact) {\r\n                if (print_errors)\r\n                    fprintf(stderr, _(\"%s: option `-W %s' is ambiguous\\n\"),\r\n                            argv[0], argv[optind]);\r\n                nextchar += strlen(nextchar);\r\n                optind++;\r\n                return '?';\r\n            }\r\n            if (pfound != NULL) {\r\n                option_index = indfound;\r\n                if (*nameend) {\r\n                    /* Don't test has_arg with >, because some C compilers don't\r\n                       allow it to be used on enums.  */\r\n                    if (pfound->has_arg) {\r\n                        optarg = nameend + 1;\r\n                    } else {\r\n                        if (print_errors)\r\n                            fprintf(stderr, _(\"\\\r\n%s: option `-W %s' doesn't allow an argument\\n\"),\r\n                                    argv[0], pfound->name);\r\n\r\n                        nextchar += strlen(nextchar);\r\n                        return '?';\r\n                    }\r\n                } else if (pfound->has_arg == 1) {\r\n                    if (optind < argc) {\r\n                        optarg = argv[optind++];\r\n                    } else {\r\n                        if (print_errors)\r\n                            fprintf(stderr,\r\n                                    _(\"%s: option `%s' requires an argument\\n\"),\r\n                                    argv[0], argv[optind - 1]);\r\n                        nextchar += strlen(nextchar);\r\n                        return optstring[0] == ':' ? ':' : '?';\r\n                    }\r\n                }\r\n                nextchar += strlen(nextchar);\r\n                if (longind != NULL) {\r\n                    *longind = option_index;\r\n                }\r\n                if (pfound->flag) {\r\n                    *(pfound->flag) = pfound->val;\r\n                    return 0;\r\n                }\r\n                return pfound->val;\r\n            }\r\n            nextchar = NULL;\r\n            return 'W';   /* Let the application handle it.   */\r\n        }\r\n        if (temp[1] == ':') {\r\n            if (temp[2] == ':') {\r\n                /* This is an option that accepts an argument optionally.  */\r\n                if (*nextchar != '\\0') {\r\n                    optarg = nextchar;\r\n                    optind++;\r\n                } else {\r\n                    optarg = NULL;\r\n                }\r\n                nextchar = NULL;\r\n            } else {\r\n                /* This is an option that requires an argument.  */\r\n                if (*nextchar != '\\0') {\r\n                    optarg = nextchar;\r\n                    /* If we end this ARGV-element by taking the rest as an arg,\r\n                       we must advance to the next element now.  */\r\n                    optind++;\r\n                } else if (optind == argc) {\r\n                    if (print_errors) {\r\n                        /* 1003.2 specifies the format of this message.  */\r\n                        fprintf(stderr,\r\n                                _(\"%s: option requires an argument -- %c\\n\"),\r\n                                argv[0], c);\r\n                    }\r\n                    optopt = c;\r\n                    if (optstring[0] == ':') {\r\n                        c = ':';\r\n                    } else {\r\n                        c = '?';\r\n                    }\r\n                } else\r\n                    /* We already incremented `optind' once;\r\n                    increment it again when taking next ARGV-elt as argument.  */\r\n                {\r\n                    optarg = argv[optind++];\r\n                }\r\n                nextchar = NULL;\r\n            }\r\n        }\r\n        return c;\r\n    }\r\n}\r\n\r\nint\r\ngetopt(argc, argv, optstring)\r\nint argc;\r\nchar *const *argv;\r\nconst char *optstring;\r\n{\r\n    return _getopt_internal(argc, argv, optstring,\r\n                            (const struct option *) 0,\r\n                            (int32_t *) 0,\r\n                            0);\r\n}\r\n\r\nint\r\ngetopt_long(argc, argv, options, long_options, opt_index)\r\nint argc;\r\nchar *const *argv;\r\nconst char *options;\r\nconst struct option *long_options;\r\nint32_t *opt_index;\r\n{\r\n    return _getopt_internal(argc, argv, options, long_options, opt_index, 0);\r\n}\r\n\r\n#endif  /* Not ELIDE_CODE.  */\r\n\f\r\n#ifdef TEST\r\n\r\n/* Compile with -DTEST to make an executable for use in testing\r\n   the above definition of `getopt'.  */\r\n\r\nint\r\nmain(argc, argv)\r\nint argc;\r\nchar **argv;\r\n{\r\n    int c;\r\n    int digit_optind = 0;\r\n\r\n    while (1) {\r\n        int this_option_optind = optind ? optind : 1;\r\n\r\n        c = getopt(argc, argv, \"abc:d:0123456789\");\r\n        if (c == -1) {\r\n            break;\r\n        }\r\n\r\n        switch (c) {\r\n        case '0':\r\n        case '1':\r\n        case '2':\r\n        case '3':\r\n        case '4':\r\n        case '5':\r\n        case '6':\r\n        case '7':\r\n        case '8':\r\n        case '9':\r\n            if (digit_optind != 0 && digit_optind != this_option_optind) {\r\n                printf(\"digits occur in two different argv-elements.\\n\");\r\n            }\r\n            digit_optind = this_option_optind;\r\n            printf(\"option %c\\n\", c);\r\n            break;\r\n\r\n        case 'a':\r\n            printf(\"option a\\n\");\r\n            break;\r\n\r\n        case 'b':\r\n            printf(\"option b\\n\");\r\n            break;\r\n\r\n        case 'c':\r\n            printf(\"option c with value `%s'\\n\", optarg);\r\n            break;\r\n\r\n        case '?':\r\n            break;\r\n\r\n        default:\r\n            printf(\"?? getopt returned character code 0%o ??\\n\", c);\r\n        }\r\n    }\r\n\r\n    if (optind < argc) {\r\n        printf(\"non-option ARGV-elements: \");\r\n        while (optind < argc) {\r\n            printf(\"%s \", argv[optind++]);\r\n        }\r\n        printf(\"\\n\");\r\n    }\r\n\r\n    exit(0);\r\n}\r\n\r\n#endif /* TEST */\r\n"
  },
  {
    "path": "source/test/getopt/getopt.h",
    "content": "/* Declarations for getopt.\r\n   Copyright (C) 1989-1994, 1996-1999, 2001 Free Software Foundation, Inc.\r\n   This file is part of the GNU C Library.\r\n\r\n   The GNU C Library is free software; you can redistribute it and/or\r\n   modify it under the terms of the GNU Lesser General Public\r\n   License as published by the Free Software Foundation; either\r\n   version 2.1 of the License, or (at your option) any later version.\r\n\r\n   The GNU C Library is distributed in the hope that it will be useful,\r\n   but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU\r\n   Lesser General Public License for more details.\r\n\r\n   You should have received a copy of the GNU Lesser General Public\r\n   License along with the GNU C Library; if not, write to the Free\r\n   Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA\r\n   02111-1307 USA.  */\r\n\r\n#ifndef _GETOPT_H\r\n\r\n#ifndef __need_getopt\r\n# define _GETOPT_H 1\r\n#endif\r\n\r\n#include<stdint.h>\r\n\r\n/* If __GNU_LIBRARY__ is not already defined, either we are being used\r\n   standalone, or this is the first header included in the source file.\r\n   If we are being used with glibc, we need to include <features.h>, but\r\n   that does not exist if we are standalone.  So: if __GNU_LIBRARY__ is\r\n   not defined, include <ctype.h>, which will pull in <features.h> for us\r\n   if it's from glibc.  (Why ctype.h?  It's guaranteed to exist and it\r\n   doesn't flood the namespace with stuff the way some other headers do.)  */\r\n#if !defined __GNU_LIBRARY__\r\n# include <ctype.h>\r\n#endif\r\n\r\n#ifdef  __cplusplus\r\nextern \"C\" {\r\n#endif\r\n\r\n/* For communication from `getopt' to the caller.\r\n   When `getopt' finds an option that takes an argument,\r\n   the argument value is returned here.\r\n   Also, when `ordering' is RETURN_IN_ORDER,\r\n   each non-option ARGV-element is returned here.  */\r\n\r\nextern char *optarg;\r\n\r\n/* Index in ARGV of the next element to be scanned.\r\n   This is used for communication to and from the caller\r\n   and for communication between successive calls to `getopt'.\r\n\r\n   On entry to `getopt', zero means this is the first call; initialize.\r\n\r\n   When `getopt' returns -1, this is the index of the first of the\r\n   non-option elements that the caller should itself scan.\r\n\r\n   Otherwise, `optind' communicates from one call to the next\r\n   how much of ARGV has been scanned so far.  */\r\n\r\nextern int optind;\r\n\r\n/* Callers store zero here to inhibit the error message `getopt' prints\r\n   for unrecognized options.  */\r\n\r\nextern int opterr;\r\n\r\n/* Set to an option character which was unrecognized.  */\r\n\r\nextern int optopt;\r\n\r\n#ifndef __need_getopt\r\n/* Describe the long-named options requested by the application.\r\n   The LONG_OPTIONS argument to getopt_long or getopt_long_only is a vector\r\n   of `struct option' terminated by an element containing a name which is\r\n   zero.\r\n\r\n   The field `has_arg' is:\r\n   no_argument      (or 0) if the option does not take an argument,\r\n   required_argument    (or 1) if the option requires an argument,\r\n   optional_argument    (or 2) if the option takes an optional argument.\r\n\r\n   If the field `flag' is not NULL, it points to a variable that is set\r\n   to the value given in the field `val' when the option is found, but\r\n   left unchanged if the option is not found.\r\n\r\n   To have a long-named option do something other than set an `int' to\r\n   a compiled-in constant, such as set a value from `optarg', set the\r\n   option's `flag' field to zero and its `val' field to a nonzero\r\n   value (the equivalent single-letter option character, if there is\r\n   one).  For long options that have a zero `flag' field, `getopt'\r\n   returns the contents of the `val' field.  */\r\n\r\nstruct option {\r\n# if (defined __STDC__ && __STDC__) || defined __cplusplus\r\n    const char *name;\r\n# else\r\n    char *name;\r\n# endif\r\n    /* has_arg can't be an enum because some compilers complain about\r\n       type mismatches in all the code that assumes it is an int.  */\r\n    int has_arg;\r\n    int32_t *flag;\r\n    int val;\r\n};\r\n\r\n/* Names for the values of the `has_arg' field of `struct option'.  */\r\n\r\n# define no_argument        0\r\n# define required_argument  1\r\n# define optional_argument  2\r\n#endif  /* need getopt */\r\n\r\n\r\n/* Get definitions and prototypes for functions to process the\r\n   arguments in ARGV (ARGC of them, minus the program name) for\r\n   options given in OPTS.\r\n\r\n   Return the option character from OPTS just read.  Return -1 when\r\n   there are no more options.  For unrecognized options, or options\r\n   missing arguments, `optopt' is set to the option letter, and '?' is\r\n   returned.\r\n\r\n   The OPTS string is a list of characters which are recognized option\r\n   letters, optionally followed by colons, specifying that that letter\r\n   takes an argument, to be placed in `optarg'.\r\n\r\n   If a letter in OPTS is followed by two colons, its argument is\r\n   optional.  This behavior is specific to the GNU `getopt'.\r\n\r\n   The argument `--' causes premature termination of argument\r\n   scanning, explicitly telling `getopt' that there are no more\r\n   options.\r\n\r\n   If OPTS begins with `--', then non-option arguments are treated as\r\n   arguments to the option '\\0'.  This behavior is specific to the GNU\r\n   `getopt'.  */\r\n\r\n#if (defined __STDC__ && __STDC__) || defined __cplusplus\r\n# ifdef __GNU_LIBRARY__\r\n/* Many other libraries have conflicting prototypes for getopt, with\r\n   differences in the consts, in stdlib.h.  To avoid compilation\r\n   errors, only prototype getopt for the GNU C library.  */\r\nextern int getopt(int __argc, char *const *__argv, const char *__shortopts);\r\n# else /* not __GNU_LIBRARY__ */\r\nextern int getopt();\r\n# endif /* __GNU_LIBRARY__ */\r\n\r\n# ifndef __need_getopt\r\nextern int getopt_long(int __argc, char *const *__argv, const char *__shortopts,\r\n                       const struct option *__longopts, int32_t *__longind);\r\nextern int getopt_long_only(int __argc, char *const *__argv,\r\n                            const char *__shortopts,\r\n                            const struct option *__longopts, int32_t *__longind);\r\n\r\n/* Internal only.  Users should not call this directly.  */\r\nextern int _getopt_internal(int __argc, char *const *__argv,\r\n                            const char *__shortopts,\r\n                            const struct option *__longopts, int32_t *__longind,\r\n                            int __long_only);\r\n# endif\r\n#else /* not __STDC__ */\r\nextern int getopt();\r\n# ifndef __need_getopt\r\nextern int getopt_long();\r\nextern int getopt_long_only();\r\n\r\nextern int _getopt_internal();\r\n# endif\r\n#endif /* __STDC__ */\r\n\r\n#ifdef  __cplusplus\r\n}\r\n#endif\r\n\r\n/* Make sure we later can get all the definitions and declarations.  */\r\n#undef __need_getopt\r\n\r\n#endif /* getopt.h */\r\n"
  },
  {
    "path": "source/test/inputstream.h",
    "content": "/*\r\n * inputstream.h\r\n *\r\n * Description of this file:\r\n *    Inputstream Processing functions definition of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video decoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#ifndef DAVS2_CHECKFRAME_H\r\n#define DAVS2_CHECKFRAME_H\r\n\r\n#include \"utils.h\"\r\n\r\n#include <stdio.h>\r\n#include <memory.h>\r\n#include <math.h>\r\n#include <time.h>\r\n#include <assert.h>\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\n#define ISPIC(x)  ((x) == 0xB3 || (x) == 0xB6)\r\n#define ISUNIT(x) ((x) == 0xB0 || (x) == 0xB1 || (x) == 0xB7 || ISPIC(x))\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic __inline \r\nconst uint8_t *\r\nfind_start_code(const uint8_t *data, int len) \r\n{\r\n    while (len >= 4 && (*(int *)data & 0x00FFFFFF) != 0x00010000) {\r\n        ++data;\r\n        --len;\r\n    }\r\n\r\n    return len >= 4 ? data : 0;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic int\r\ncheck_frame(const uint8_t *data, int len)\r\n{\r\n    const uint8_t *p;\r\n    const uint8_t *data0 = data;\r\n    const int      len0  = len;\r\n\r\n    while (((p = (uint8_t *)find_start_code(data, len)) != 0) && !ISUNIT(p[3])) {\r\n        len -= (int)(p - data + 4);\r\n        data = p + 4;\r\n    }\r\n\r\n    return (int)(p ? p - data0 : len0 + 1);\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic int\r\nfind_one_frame(uint8_t * data, int len, int *start, int *end)\r\n{\r\n    if ((*start = check_frame(data, len)) > len) {\r\n        return -1;\r\n    }\r\n\r\n    if ((*end = check_frame(data + *start + 4, len - *start - 4)) <= len) {\r\n        *end += *start + 4;\r\n    }\r\n\r\n    return 0;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic int\r\ncount_frames(uint8_t *data, int size)\r\n{\r\n    int count = 0;\r\n    int start, end;\r\n\r\n    for (;;) {\r\n        if (find_one_frame(data, size, &start, &end) < 0) {\r\n            break;\r\n        }\r\n\r\n        if (ISPIC(data[start + 3])) {\r\n            count++;\r\n        }\r\n\r\n        data += end;\r\n        size -= end;\r\n    }\r\n\r\n    return count;\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n*/\r\nstatic int \r\nread_input_file(davs2_input_param_t *p_param, uint8_t **data, int *size, int *frames, float errrate)\r\n{\r\n    /* get size of input file */\r\n    fseek(p_param->g_infile, 0, SEEK_END);\r\n    *size = ftell(p_param->g_infile);\r\n    fseek(p_param->g_infile, 0, SEEK_SET);\r\n\r\n    /* memory for stream buffer */\r\n    if ((*data = (uint8_t *)calloc(*size + 1024, sizeof(uint8_t))) == NULL) {\r\n        show_message(CONSOLE_RED, \"failed to alloc memory for input file.\\n\");\r\n        return -1;\r\n    }\r\n\r\n    /* read stream data */\r\n    if (fread(*data, *size, 1, p_param->g_infile) < 1) {\r\n        show_message(CONSOLE_RED, \"failed to read input file.\\n\");\r\n        free(*data);\r\n        *data = NULL;\r\n        return -1;\r\n    }\r\n\r\n    if (errrate != 0) {\r\n        show_message(CONSOLE_WHITE, \"noise interfering is enabled:\\n\");\r\n    }\r\n\r\n    /* get total frames */\r\n    *frames = count_frames(*data, *size);\r\n\r\n    return 0;\r\n}\r\n\r\n#endif /// DAVS2_CHECKFRAME_H\r\n"
  },
  {
    "path": "source/test/md5.h",
    "content": "/*\r\n * md5.h\r\n *\r\n * Description of this file:\r\n *    MD5 calculate function of davs2.\r\n * \r\n */\r\n\r\n/* The copyright in this software is being made available under the BSD\r\n * License, included below. This software may be subject to other third party\r\n * and contributor rights, including patent rights, and no such rights are\r\n * granted under this license.\r\n *\r\n *  Copyright (c) 2002-2016, Audio Video coding Standard Workgroup of China\r\n * All rights reserved.\r\n *\r\n * Redistribution and use in source and binary forms, with or without\r\n * modification, are permitted provided that the following conditions are met:\r\n *\r\n *  * Redistributions of source code must retain the above copyright notice,\r\n *    this list of conditions and the following disclaimer.\r\n *  * Redistributions in binary form must reproduce the above copyright notice,\r\n *    this list of conditions and the following disclaimer in the documentation\r\n *     and/or other materials provided with the distribution.\r\n *  * Neither the name of Audio Video coding Standard Workgroup of China\r\n *    nor the names of its contributors maybe used to endorse or promote products\r\n *    derived from this software without\r\n *    specific prior written permission.\r\n *\r\n * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS \"AS IS\"\r\n * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE\r\n * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE\r\n * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS\r\n * BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR\r\n * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF\r\n * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS\r\n * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN\r\n * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)\r\n * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF\r\n * THE POSSIBILITY OF SUCH DAMAGE.\r\n */\r\n\r\n/*\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video decoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             Huiwen REN <hwrenx@126.com>\r\n *             etc.\r\n *\r\n */\r\n\r\n#ifndef DAVS2_MD5_H\r\n#define DAVS2_MD5_H\r\n\r\n#include <stdio.h>\r\n#include <stdlib.h>\r\n#include <string.h>\r\n\r\n#define F(x, y, z) (((x) & (y)) | ((~x) & (z)))\r\n#define G(x, y, z) (((x) & (z)) | ((y) & (~z)))\r\n#define H(x, y, z) ((x) ^ (y) ^ (z))\r\n#define I(x, y, z) ((y) ^ ((x) | (~z)))\r\n\r\n#define RL(x, y) (((x) << (y)) | ((x) >> (32 - (y))))\r\n\r\n#define PP(x) (x<<24)|((x<<8)&0xff0000)|((x>>8)&0xff00)|(x>>24)\r\n\r\n#define FF(a, b, c, d, x, s, ac) a = b + (RL((a + F(b,c,d) + x + ac),s))\r\n#define GG(a, b, c, d, x, s, ac) a = b + (RL((a + G(b,c,d) + x + ac),s))\r\n#define HH(a, b, c, d, x, s, ac) a = b + (RL((a + H(b,c,d) + x + ac),s))\r\n#define II(a, b, c, d, x, s, ac) a = b + (RL((a + I(b,c,d) + x + ac),s))\r\n\r\nvoid md5(unsigned int *pA, unsigned int *pB, unsigned int *pC, unsigned int *pD, unsigned int x[16])\r\n{\r\n    unsigned int a, b, c, d;\r\n    a = *pA;\r\n    b = *pB;\r\n    c = *pC;\r\n    d = *pD;\r\n    /**//* Round 1 */\r\n    FF(a, b, c, d, x[ 0],  7, 0xd76aa478); /**/ /* 1 */\r\n    FF(d, a, b, c, x[ 1], 12, 0xe8c7b756); /**/ /* 2 */\r\n    FF(c, d, a, b, x[ 2], 17, 0x242070db); /**/ /* 3 */\r\n    FF(b, c, d, a, x[ 3], 22, 0xc1bdceee); /**/ /* 4 */\r\n    FF(a, b, c, d, x[ 4],  7, 0xf57c0faf); /**/ /* 5 */\r\n    FF(d, a, b, c, x[ 5], 12, 0x4787c62a); /**/ /* 6 */\r\n    FF(c, d, a, b, x[ 6], 17, 0xa8304613); /**/ /* 7 */\r\n    FF(b, c, d, a, x[ 7], 22, 0xfd469501); /**/ /* 8 */\r\n    FF(a, b, c, d, x[ 8],  7, 0x698098d8); /**/ /* 9 */\r\n    FF(d, a, b, c, x[ 9], 12, 0x8b44f7af); /**/ /* 10 */\r\n    FF(c, d, a, b, x[10], 17, 0xffff5bb1); /**/ /* 11 */\r\n    FF(b, c, d, a, x[11], 22, 0x895cd7be); /**/ /* 12 */\r\n    FF(a, b, c, d, x[12],  7, 0x6b901122); /**/ /* 13 */\r\n    FF(d, a, b, c, x[13], 12, 0xfd987193); /**/ /* 14 */\r\n    FF(c, d, a, b, x[14], 17, 0xa679438e); /**/ /* 15 */\r\n    FF(b, c, d, a, x[15], 22, 0x49b40821); /**/ /* 16 */\r\n\r\n    /**//* Round 2 */\r\n    GG(a, b, c, d, x[ 1],  5, 0xf61e2562); /**/ /* 17 */\r\n    GG(d, a, b, c, x[ 6],  9, 0xc040b340); /**/ /* 18 */\r\n    GG(c, d, a, b, x[11], 14, 0x265e5a51); /**/ /* 19 */\r\n    GG(b, c, d, a, x[ 0], 20, 0xe9b6c7aa); /**/ /* 20 */\r\n    GG(a, b, c, d, x[ 5],  5, 0xd62f105d); /**/ /* 21 */\r\n    GG(d, a, b, c, x[10],  9, 0x02441453); /**/ /* 22 */\r\n    GG(c, d, a, b, x[15], 14, 0xd8a1e681); /**/ /* 23 */\r\n    GG(b, c, d, a, x[ 4], 20, 0xe7d3fbc8); /**/ /* 24 */\r\n    GG(a, b, c, d, x[ 9],  5, 0x21e1cde6); /**/ /* 25 */\r\n    GG(d, a, b, c, x[14],  9, 0xc33707d6); /**/ /* 26 */\r\n    GG(c, d, a, b, x[ 3], 14, 0xf4d50d87); /**/ /* 27 */\r\n    GG(b, c, d, a, x[ 8], 20, 0x455a14ed); /**/ /* 28 */\r\n    GG(a, b, c, d, x[13],  5, 0xa9e3e905); /**/ /* 29 */\r\n    GG(d, a, b, c, x[ 2],  9, 0xfcefa3f8); /**/ /* 30 */\r\n    GG(c, d, a, b, x[ 7], 14, 0x676f02d9); /**/ /* 31 */\r\n    GG(b, c, d, a, x[12], 20, 0x8d2a4c8a); /**/ /* 32 */\r\n\r\n    /**//* Round 3 */\r\n    HH(a, b, c, d, x[ 5],  4, 0xfffa3942); /**/ /* 33 */\r\n    HH(d, a, b, c, x[ 8], 11, 0x8771f681); /**/ /* 34 */\r\n    HH(c, d, a, b, x[11], 16, 0x6d9d6122); /**/ /* 35 */\r\n    HH(b, c, d, a, x[14], 23, 0xfde5380c); /**/ /* 36 */\r\n    HH(a, b, c, d, x[ 1],  4, 0xa4beea44); /**/ /* 37 */\r\n    HH(d, a, b, c, x[ 4], 11, 0x4bdecfa9); /**/ /* 38 */\r\n    HH(c, d, a, b, x[ 7], 16, 0xf6bb4b60); /**/ /* 39 */\r\n    HH(b, c, d, a, x[10], 23, 0xbebfbc70); /**/ /* 40 */\r\n    HH(a, b, c, d, x[13],  4, 0x289b7ec6); /**/ /* 41 */\r\n    HH(d, a, b, c, x[ 0], 11, 0xeaa127fa); /**/ /* 42 */\r\n    HH(c, d, a, b, x[ 3], 16, 0xd4ef3085); /**/ /* 43 */\r\n    HH(b, c, d, a, x[ 6], 23, 0x04881d05); /**/ /* 44 */\r\n    HH(a, b, c, d, x[ 9],  4, 0xd9d4d039); /**/ /* 45 */\r\n    HH(d, a, b, c, x[12], 11, 0xe6db99e5); /**/ /* 46 */\r\n    HH(c, d, a, b, x[15], 16, 0x1fa27cf8); /**/ /* 47 */\r\n    HH(b, c, d, a, x[ 2], 23, 0xc4ac5665); /**/ /* 48 */\r\n\r\n    /**//* Round 4 */\r\n    II(a, b, c, d, x[ 0],  6, 0xf4292244); /**/ /* 49 */\r\n    II(d, a, b, c, x[ 7], 10, 0x432aff97); /**/ /* 50 */\r\n    II(c, d, a, b, x[14], 15, 0xab9423a7); /**/ /* 51 */\r\n    II(b, c, d, a, x[ 5], 21, 0xfc93a039); /**/ /* 52 */\r\n    II(a, b, c, d, x[12],  6, 0x655b59c3); /**/ /* 53 */\r\n    II(d, a, b, c, x[ 3], 10, 0x8f0ccc92); /**/ /* 54 */\r\n    II(c, d, a, b, x[10], 15, 0xffeff47d); /**/ /* 55 */\r\n    II(b, c, d, a, x[ 1], 21, 0x85845dd1); /**/ /* 56 */\r\n    II(a, b, c, d, x[ 8],  6, 0x6fa87e4f); /**/ /* 57 */\r\n    II(d, a, b, c, x[15], 10, 0xfe2ce6e0); /**/ /* 58 */\r\n    II(c, d, a, b, x[ 6], 15, 0xa3014314); /**/ /* 59 */\r\n    II(b, c, d, a, x[13], 21, 0x4e0811a1); /**/ /* 60 */\r\n    II(a, b, c, d, x[ 4],  6, 0xf7537e82); /**/ /* 61 */\r\n    II(d, a, b, c, x[11], 10, 0xbd3af235); /**/ /* 62 */\r\n    II(c, d, a, b, x[ 2], 15, 0x2ad7d2bb); /**/ /* 63 */\r\n    II(b, c, d, a, x[ 9], 21, 0xeb86d391); /**/ /* 64 */\r\n\r\n    *pA += a;\r\n    *pB += b;\r\n    *pC += c;\r\n    *pD += d;\r\n\r\n}\r\n\r\nlong long FileMD5(const char *filename, unsigned int md5value[4])\r\n{\r\n    FILE *p_infile = NULL;\r\n\r\n    int i;\r\n    unsigned int flen[2];\r\n    long long len;\r\n    unsigned int A, B, C, D;\r\n    unsigned int x[16];\r\n    memset(md5value, 0, 4 * sizeof(unsigned int));\r\n\r\n    if (filename == NULL) {\r\n        return 0;\r\n    }\r\n\r\n    if (strlen(filename) > 0 && (p_infile = fopen(filename, \"rb\")) == NULL) {\r\n        show_message(CONSOLE_RED, \"Input file %s does not exist\", filename);\r\n        return 0;\r\n    }\r\n\r\n    fseek(p_infile, 0, SEEK_END);\r\n    len = ftell(p_infile);\r\n    fseek(p_infile, 0, SEEK_SET);\r\n\r\n    if (len == -1) {\r\n        show_message(CONSOLE_RED, \"Input file %s is too large to calculate md5!\\n\", filename);\r\n        fclose(p_infile);\r\n        return 0;\r\n    }\r\n\r\n    A = 0x67452301, B = 0xefcdab89, C = 0x98badcfe, D = 0x10325476;\r\n    flen[1] = (unsigned int)(len / 0x20000000);\r\n    flen[0] = (unsigned int)((len % 0x20000000) * 8);\r\n\r\n    memset(x, 0, 64);\r\n    int read_size = fread(&x, 4, 16, p_infile);\r\n    if(read_size!=16){\r\n        if(!feof(p_infile) && ferror(p_infile)){\r\n            show_message(CONSOLE_RED, \"Reading file error!\\n\", filename);\r\n            clearerr(p_infile);\r\n        }        \r\n    }\r\n\r\n    for (i = 0; i < len / 64; i++) {\r\n        md5(&A, &B, &C, &D, x);\r\n        memset(x, 0, 64);\r\n        int ReadSize=fread(&x, 4, 16, p_infile);\r\n        if(ReadSize!=16){\r\n            if(!feof(p_infile) && ferror(p_infile)){\r\n                show_message(CONSOLE_RED, \"Reading file error!\\n\", filename);\r\n                clearerr(p_infile);\r\n            }        \r\n        }\r\n    }\r\n    ((char *)x)[len % 64] = 128;\r\n    if (len % 64 > 55) {\r\n        md5(&A, &B, &C, &D, x);\r\n        memset(x, 0, 64);\r\n    }\r\n    memcpy(x + 14, flen, 8);\r\n    md5(&A, &B, &C, &D, x);\r\n\r\n    fclose(p_infile);\r\n\r\n    md5value[0] = PP(A);\r\n    md5value[1] = PP(B);\r\n    md5value[2] = PP(C);\r\n    md5value[3] = PP(D);\r\n    return len;\r\n}\r\n\r\n#endif   // DAVS2_MD5_H\r\n"
  },
  {
    "path": "source/test/parse_args.h",
    "content": "﻿/*\r\n * parse_args.h\r\n *\r\n * Description of this file:\r\n *    Argument Parsing functions definition of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video decoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#ifndef DAVS2_GETOPT_H\r\n#define DAVS2_GETOPT_H\r\n\r\n#include <stdio.h>\r\n#include <string.h>\r\n#include <getopt.h>\r\n#if _WIN32\r\n#include <io.h>\r\n#include <fcntl.h>\r\n#endif\r\n#include \"utils.h\"\r\n\r\ntypedef struct davs2_input_param_t {\r\n    const char *s_infile;\r\n    const char *s_outfile;\r\n    const char *s_recfile;\r\n    const char *s_md5;\r\n\r\n    int g_verbose;\r\n    int g_psnr;\r\n    int g_threads;\r\n    int b_y4m;     // Y4M or YUV\r\n\r\n    FILE *g_infile;\r\n    FILE *g_recfile;\r\n    FILE *g_outfile;\r\n} davs2_input_param_t;\r\n\r\n#if defined(__ICL) || defined(_MSC_VER)\r\n#define strcasecmp              _stricmp\r\n#endif\r\n\r\n/* 包含附加参数的，在字母后面需要加上冒号 */\r\nstatic const char *optString = \"i:o:r:m:t:vh?\";\r\n\r\nstatic const struct option longOpts[] = {\r\n    {\"input\",   required_argument, NULL, 'i'},\r\n    {\"output\",  required_argument, NULL, 'o'},\r\n    {\"psnr\",    required_argument, NULL, 'r'},\r\n    {\"md5\",     required_argument, NULL, 'm'},\r\n    {\"threads\", required_argument, NULL, 't'},\r\n    {\"verbose\", no_argument, NULL, 'v'},\r\n    {\"help\",    no_argument, NULL, 'h'},\r\n    {NULL, no_argument, NULL, 0}\r\n};\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void display_usage(void)\r\n{\r\n    /* 运行参数说明 */\r\n    const char * usage = \"usage: davs2 -i avs2file -o outputfile [-r recfile] [-t threads] [-v]\";\r\n\r\n    show_message(CONSOLE_RED, \"davs2 parameters\\n    %s\\n\", usage);\r\n    show_message(CONSOLE_RED, \"+------------------+-------------+-------------------------------------------+\\n\");\r\n    show_message(CONSOLE_RED, \"|     Parameter    |    Alias    |                  Settings                 |\\n\");\r\n    show_message(CONSOLE_RED, \"+------------------+-------------+-------------------------------------------+\\n\");\r\n    show_message(CONSOLE_RED, \"| --input=test.avs | -i test.avs | input bitstream file path                 |\\n\");\r\n    show_message(CONSOLE_RED, \"| --output=dec.yuv | -o dec.yuv  | output YUV/Y4M file path                  |\\n\");\r\n    show_message(CONSOLE_RED, \"| --psnr=rec.yuv   | -r rec.yuv  | reference reconstruction YUV file         |\\n\");\r\n    show_message(CONSOLE_RED, \"| --threads=N      | -t N        | threads for decoding (default: 1)         |\\n\");\r\n    show_message(CONSOLE_RED, \"| --md5=M          | -m M        | Reference MD5 of decoded YUV              |\\n\");\r\n    show_message(CONSOLE_RED, \"| --verbose        | -v          | Enable decoding status every frame        |\\n\");\r\n    show_message(CONSOLE_RED, \"| --help           | -h          | Showing this instruction                  |\\n\");\r\n    show_message(CONSOLE_RED, \"+------------------+-------------+-------------------------------------------+\\n\");\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic int parse_args(davs2_input_param_t *p_param, int argc, char **argv)\r\n{\r\n    char title[1024] = {0};\r\n    int i;\r\n    int opt = 0;\r\n    int longIndex = 0;\r\n    for (i = 0; i < argc; ++i) {\r\n        sprintf(&title[strlen(title)], \"%s \", argv[i]);\r\n    }\r\n    show_message(CONSOLE_WHITE, \"%s\\n\\n\", title);\r\n\r\n    if (argc < 2) {\r\n        display_usage();\r\n        return -1;\r\n    }\r\n\r\n    /* Initialize globalArgs before we get to work. */\r\n    p_param->s_infile  = NULL;\r\n    p_param->s_outfile = NULL;\r\n    p_param->s_recfile = NULL;\r\n    p_param->s_md5     = NULL;\r\n    p_param->g_infile  = NULL;\r\n    p_param->g_outfile = NULL;\r\n    p_param->g_recfile = NULL;\r\n    p_param->g_verbose = 0;\r\n    p_param->g_psnr    = 0;\r\n    p_param->g_threads = 1;\r\n    p_param->b_y4m     = 0;\r\n\r\n    opt = getopt_long(argc, argv, optString, longOpts, &longIndex);\r\n    while (opt != -1) {\r\n        switch (opt) {\r\n        case 'i':\r\n            p_param->s_infile = optarg;\r\n            break;\r\n        case 'o':\r\n            p_param->s_outfile = optarg;\r\n            break;\r\n        case 'r':\r\n            p_param->s_recfile = optarg;\r\n            break;\r\n        case 'm':\r\n            p_param->s_md5 = optarg;\r\n            break;\r\n        case 'v':\r\n            p_param->g_verbose = 1;\r\n            break;\r\n        case 't':\r\n            p_param->g_threads = atoi(optarg);\r\n            break;\r\n        case 'h':   /* fall-through is intentional */\r\n        case '?':\r\n            display_usage();\r\n            return -1;\r\n        case 0:     /* long option without a short arg */\r\n            break;\r\n        default:\r\n            /* You won't actually get here. */\r\n            break;\r\n        }\r\n\r\n        opt = getopt_long(argc, argv, optString, longOpts, &longIndex);\r\n    }\r\n\r\n    if (p_param->s_infile == NULL) {\r\n        display_usage();\r\n        show_message(CONSOLE_RED, \"missing input file.\\n\");\r\n        return -1;\r\n    }\r\n\r\n    p_param->g_infile  = fopen(p_param->s_infile, \"rb\");\r\n\r\n    if (p_param->s_recfile != NULL) {\r\n        p_param->g_recfile = fopen(p_param->s_recfile, \"rb\");\r\n    }\r\n\r\n    if (p_param->s_outfile != NULL) {\r\n        if (!strcmp(p_param->s_outfile, \"stdout\")) {\r\n            p_param->g_outfile = stdout;\r\n        } else {\r\n            p_param->g_outfile = fopen(p_param->s_outfile, \"wb\");\r\n        }\r\n    } else if (p_param->g_outfile == NULL) {\r\n        display_usage();\r\n        show_message(CONSOLE_RED, \"WARN: missing output file.\\n\");\r\n    }\r\n\r\n    /* open stream file */\r\n    if (p_param->g_infile == NULL) {\r\n        show_message(CONSOLE_RED, \"ERROR: failed to open input file: %s\\n\", p_param->s_infile);\r\n        return -1;\r\n    }\r\n\r\n    /* open rec file */\r\n    if (p_param->s_recfile != NULL && p_param->g_recfile == NULL) {\r\n        show_message(CONSOLE_RED, \"ERROR: failed to open reference file: %s\\n\", p_param->s_recfile);\r\n    }\r\n    p_param->g_psnr = (p_param->g_recfile != NULL);\r\n\r\n    /* open output file */\r\n    if (p_param->s_outfile != NULL && p_param->g_outfile == NULL) {\r\n        show_message(CONSOLE_RED, \"ERROR: failed to open output file: %s\\n\", p_param->s_outfile);\r\n    } else {\r\n        int l = (int)strlen(p_param->s_outfile);\r\n        if (l > 4) {\r\n            if (!strcmp(p_param->s_outfile + l - 4, \".y4m\")) {\r\n                p_param->b_y4m = 1;\r\n            }\r\n        }\r\n        if (p_param->g_outfile == stdout) {\r\n#if _WIN32\r\n            setmode(fileno(stdout), O_BINARY);\r\n#endif\r\n            p_param->b_y4m = 1;\r\n        }\r\n    }\r\n\r\n    /* get md5 */\r\n    if (p_param->s_md5 && strlen(p_param->s_md5) != 32) {\r\n        show_message(CONSOLE_RED, \"ERROR: invalid md5 value\");\r\n    }\r\n\r\n    show_message(CONSOLE_WHITE, \"--------------------------------------------------\\n\");\r\n    show_message(CONSOLE_WHITE, \" AVS2 file       : %s\\n\", p_param->s_infile);\r\n    show_message(CONSOLE_WHITE, \" Reference file  : %s\\n\", p_param->s_recfile);\r\n    show_message(CONSOLE_WHITE, \" Output file     : %s\\n\", p_param->s_outfile);\r\n    show_message(CONSOLE_WHITE, \"--------------------------------------------------\\n\");\r\n\r\n    return 0;\r\n}\r\n\r\n#endif /// DAVS2_GETOPT_H\r\n"
  },
  {
    "path": "source/test/psnr.h",
    "content": "/*\r\n * psnr.h\r\n *\r\n * Description of this file:\r\n *    PSNR Calculating functions definition of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video decoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#ifndef DAVS2_PSNR_H\r\n#define DAVS2_PSNR_H\r\n\r\n#ifdef _MSC_VER\r\n#undef fseek\r\n#define fseek               _fseeki64\r\n#else  //! for linux\r\n#define _FILE_OFFSET_BITS   64       // for 64 bit fseeko\r\n#define fseek               fseeko\r\n#endif\r\n\r\n#include <math.h>\r\n#include <stdio.h>\r\n#include <stdlib.h>\r\n#include <assert.h>\r\n#if HAVE_STDINT_H\r\n#include <stdint.h>\r\n#else\r\n#include <inttypes.h>\r\n#endif\r\n\r\nint g_width = 0;\r\nint g_lines = 0;\r\nint b_output_error_position = 1;\r\n\r\ndouble g_sum_psnr_y = 0.0;\r\ndouble g_sum_psnr_u = 0.0;\r\ndouble g_sum_psnr_v = 0.0;\r\n\r\nuint8_t *g_recbuf = NULL;\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic __inline uint64_t\r\ncal_ssd_16bit(int width, int height, uint16_t *rec, int rec_stride, uint16_t *dst, int dst_stride)\r\n{\r\n    uint64_t d = 0;\r\n    int i, j;\r\n\r\n    if (rec_stride == dst_stride) {\r\n        if (memcmp(dst, rec, rec_stride * height * 2) == 0) {\r\n            return 0;\r\n        }\r\n    }\r\n\r\n    for (j = 0; j < height; j++) {\r\n        for (i = 0; i < width; i++) {\r\n            int t = dst[i] - rec[i];\r\n            d += t * t;\r\n        }\r\n\r\n        rec += rec_stride;\r\n        dst += dst_stride;\r\n    }\r\n\r\n    return d;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic __inline uint64_t\r\ncal_ssd_8bit(int width, int height, uint8_t *rec, int rec_stride, uint8_t *dst, int dst_stride)\r\n{\r\n    uint64_t d = 0;\r\n    int i, j;\r\n\r\n    if (rec_stride == dst_stride) {\r\n        if (memcmp(dst, rec, rec_stride * height) == 0) {\r\n            return 0;\r\n        }\r\n    }\r\n\r\n    for (j = 0; j < height; j++) {\r\n        for (i = 0; i < width; i++) {\r\n            int t = dst[i] - rec[i];\r\n            d += t * t;\r\n        }\r\n\r\n        rec += rec_stride;\r\n        dst += dst_stride;\r\n    }\r\n\r\n    return d;\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * Function   : calculate the SSD of 2 frames\r\n * Parameters :\r\n *      [in ] : width      - width   of frame\r\n *            : height     - height  of frame\r\n *            : rec        - pointer to reconstructed frame buffer\r\n *            : rec_stride - stride  of reconstructed frame\r\n *            : dst        - pointer to decoded frame buffer\r\n *            : dst_stride - stride  of decoded frame\r\n *      [out] : none\r\n * Return     : mad of 2 frames\r\n * ---------------------------------------------------------------------------\r\n */\r\nstatic __inline uint64_t\r\ncal_ssd(int width, int height, uint8_t *rec, int rec_stride, uint8_t *dst, int dst_stride, int bytes_per_sample)\r\n{\r\n    if (bytes_per_sample == 2) {\r\n        return cal_ssd_16bit(width, height, (uint16_t *)rec, rec_stride, (uint16_t *)dst, dst_stride >> 1);\r\n    } else {\r\n        return cal_ssd_8bit(width, height, rec, rec_stride, dst, dst_stride);\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void\r\nfind_first_mismatch_point_16bit(int width, int height, uint16_t *rec, int rec_stride, uint16_t *dst, int dst_stride, int *x, int *y)\r\n{\r\n    int i, j;\r\n\r\n    *x = -1;\r\n    *y = -1;\r\n\r\n    for (j = 0; j < height; j++) {\r\n        for (i = 0; i < width; i++) {\r\n            int t = dst[i] - rec[i];\r\n            if (t != 0) {\r\n                *x = i;\r\n                *y = j;\r\n                break;\r\n            }\r\n        }\r\n\r\n        rec += rec_stride;\r\n        dst += dst_stride;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void\r\nfind_first_mismatch_point_8bit(int width, int height, uint8_t *rec, int rec_stride, uint8_t *dst, int dst_stride, int *x, int *y)\r\n{\r\n    int i, j;\r\n\r\n    *x = -1;\r\n    *y = -1;\r\n\r\n    for (j = 0; j < height; j++) {\r\n        for (i = 0; i < width; i++) {\r\n            int t = dst[i] - rec[i];\r\n            if (t != 0) {\r\n                *x = i;\r\n                *y = j;\r\n                break;\r\n            }\r\n        }\r\n\r\n        rec += rec_stride;\r\n        dst += dst_stride;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * Function   : calculate the SSD of 2 frames\r\n * Parameters :\r\n *      [in ] : width      - width   of frame\r\n *            : height     - height  of frame\r\n *            : rec        - pointer to reconstructed frame buffer\r\n *            : rec_stride - stride  of reconstructed frame\r\n *            : dst        - pointer to decoded frame buffer\r\n *            : dst_stride - stride  of decoded frame\r\n *      [out] : none\r\n * Return     : x, y position of first mismatch point\r\n * ---------------------------------------------------------------------------\r\n */\r\nstatic void\r\nfind_first_mismatch_point(int width, int height, uint8_t *rec, int rec_stride, uint8_t *dst, int dst_stride, int bytes_per_sample, int *x, int *y)\r\n{\r\n    if (bytes_per_sample == 2) {\r\n        find_first_mismatch_point_16bit(width, height, (uint16_t *)rec, rec_stride, (uint16_t *)dst, dst_stride >> 1, x, y);\r\n    } else {\r\n        find_first_mismatch_point_8bit(width, height, rec, rec_stride, dst, dst_stride, x, y);\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic\r\ndouble get_psnr_with_ssd(double f_max, uint64_t diff)\r\n{\r\n    if (diff > 0) {\r\n        return 10.0 * log10(f_max / diff);\r\n    } else {\r\n        return 0;\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n* Function   : calculate and output the psnr (only for YUV 4:2:0)\r\n* Parameters :\r\n*      [in ] : rec    - pointer to buffer of reconstructed picture\r\n*            : dst    - pointer to buffer of decoded picture\r\n*            : width  - width  of picture\r\n*            : height - height of picture\r\n*      [out] : none\r\n* Return     : void\r\n* ---------------------------------------------------------------------------\r\n*/\r\nint \r\ncal_psnr(int number, uint8_t *dst[3], int strides[3], FILE *f_rec, int width, int height, int num_planes,\r\n         double *psnr_y, double *psnr_u, double *psnr_v, int bytes_per_sample, int bit_depth)\r\n{\r\n    int stride_ref = width;          /* stride of frame/field (luma) */\r\n    int size_l = width * height; /* size   of frame/field (luma) */\r\n    uint8_t *p1;                 /* pointer to buffer of reconstructed picture */\r\n    uint8_t *p2;                 /* pointer to buffer of decoded picture */\r\n    uint64_t diff;               /* difference between decoded and reconstructed picture */\r\n    size_t size_frame = num_planes == 3 ? (bytes_per_sample * size_l * 3) >> 1 : (bytes_per_sample * size_l); //solve warning C4018 \r\n    double f_max_signal = ((1 << bit_depth) - 1) * ((1 << bit_depth) - 1);\r\n    int64_t frameno = number;\r\n\r\n    *psnr_y = *psnr_u = *psnr_v = 0.f;\r\n\r\n    if (width != g_width || height != g_lines) {\r\n        if (g_recbuf) {\r\n            free(g_recbuf);\r\n            g_recbuf = NULL;\r\n        }\r\n\r\n        g_recbuf = (uint8_t *)malloc(size_frame);\r\n\r\n        if (g_recbuf == NULL) {\r\n            return -1;\r\n        }\r\n\r\n        g_width = width;\r\n        g_lines = height;\r\n    }\r\n\r\n    if (g_recbuf == 0) {\r\n        return -1;\r\n    }\r\n\r\n    fseek(f_rec, size_frame * frameno, SEEK_SET);\r\n\r\n    if (fread(g_recbuf, 1, size_frame, f_rec) < size_frame) {\r\n        return -1;\r\n    }\r\n\r\n    p1 = g_recbuf;\r\n    p2 = dst[0];\r\n\r\n    diff = cal_ssd(width, height, p1, stride_ref, p2, strides[0], bytes_per_sample);\r\n    *psnr_y = get_psnr_with_ssd(f_max_signal * size_l, diff);\r\n    g_sum_psnr_y += *psnr_y;\r\n    if (diff != 0 && b_output_error_position) {\r\n        int x, y;\r\n        find_first_mismatch_point(width, height, p1, stride_ref, p2, strides[0], bytes_per_sample, &x, &y);\r\n        show_message(CONSOLE_RED, \"mismatch POC: %3d, Y(%d, %d)\\n\", number, x, y);\r\n        b_output_error_position = 0;\r\n    }\r\n\r\n    if (num_planes == 3) {\r\n        width >>= 1;               // width  of frame/field  (chroma)\r\n        height >>= 1;              // height of frame/field  (chroma, with padding)\r\n        stride_ref >>= 1;          // stride of frame/field  (chroma)\r\n\r\n        /* PSNR U */\r\n        p1 += size_l * bytes_per_sample;\r\n        p2 = dst[1];\r\n\r\n        diff = cal_ssd(width, height, p1, stride_ref, p2, strides[1], bytes_per_sample);\r\n        *psnr_u = get_psnr_with_ssd(f_max_signal * size_l, diff << 2);\r\n        g_sum_psnr_u += *psnr_u;\r\n        if (diff != 0 && b_output_error_position) {\r\n            int x, y;\r\n            find_first_mismatch_point(width, height, p1, stride_ref, p2, strides[1], bytes_per_sample, &x, &y);\r\n            show_message(CONSOLE_RED, \"mismatch POC: %3d, U (%d, %d) => Y(%d, %d)\\n\", number, x, y, 2 * x, 2 * y);\r\n            b_output_error_position = 0;\r\n        }\r\n\r\n        /* PSNR V */\r\n        p1 += (size_l * bytes_per_sample) >> 2;\r\n        p2 = dst[2];\r\n\r\n        diff = cal_ssd(width, height, p1, stride_ref, p2, strides[2], bytes_per_sample);\r\n        *psnr_v = get_psnr_with_ssd(f_max_signal * size_l, diff << 2);\r\n        g_sum_psnr_v += *psnr_v;\r\n        if (diff != 0 && b_output_error_position) {\r\n            int x, y;\r\n            find_first_mismatch_point(width, height, p1, stride_ref, p2, strides[2], bytes_per_sample, &x, &y);\r\n            show_message(CONSOLE_RED, \"mismatch POC: %3d, V (%d, %d) => Y(%d, %d)\\n\", number, x, y, 2 * x, 2 * y);\r\n            b_output_error_position = 0;\r\n        }\r\n    }\r\n\r\n    return 0;\r\n}\r\n\r\n#endif /// DAVS2_PSNR_H\r\n"
  },
  {
    "path": "source/test/test.c",
    "content": "/*\r\n * test.c\r\n *\r\n * Description of this file:\r\n *    test the AVS2 Video Decoder  davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video decoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#if defined(_MSC_VER)\r\n#define WIN32_LEAN_AND_MEAN\r\n#define _CRT_NONSTDC_NO_DEPRECATE\r\n#define _CRT_SECURE_NO_DEPRECATE\r\n#define _CRT_SECURE_NO_WARNINGS\r\n#endif\r\n\r\n#include <stdio.h>\r\n#include <string.h>\r\n#include <assert.h>\r\n\r\n#include \"davs2.h\"\r\n#include \"utils.h\"\r\n#include \"psnr.h\"\r\n#include \"parse_args.h\"\r\n#include \"inputstream.h\"\r\n#include \"md5.h\"\r\n\r\n#if defined(_MSC_VER)\r\n#pragma comment(lib, \"libdavs2.lib\")\r\n#endif\r\n\r\n#if defined(__cplusplus)\r\nextern \"C\" {\r\n#endif  /* __cplusplus */\r\n\r\n/**\r\n * ===========================================================================\r\n * macro defines\r\n * ===========================================================================\r\n */\r\n#define CTRL_LOOP_DEC_FILE    0   /* ѭһESļ */\r\n\r\n/* ---------------------------------------------------------------------------\r\n * disable warning C4100: : unreferenced formal parameter\r\n */\r\n#ifndef UNREFERENCED_PARAMETER\r\n#if defined(_MSC_VER) || defined(__INTEL_COMPILER)\r\n#define UNREFERENCED_PARAMETER(v) (v)\r\n#else\r\n#define UNREFERENCED_PARAMETER(v) (void)(v)\r\n#endif\r\n#endif\r\n\r\n\r\n/**\r\n * ===========================================================================\r\n * global variables\r\n * ===========================================================================\r\n */\r\nint g_frmcount = 0;\r\nint g_psnrfail = 0;\r\nunsigned int   MD5val[4];\r\nchar           MD5str[33];\r\n\r\ndavs2_input_param_t inputparam = {\r\n    NULL, NULL, NULL, NULL, 0, 0, 0, 0\r\n};\r\n\r\n\r\n/**\r\n * ===========================================================================\r\n * function defines\r\n * ===========================================================================\r\n */\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic \r\nvoid output_decoded_frame(davs2_picture_t *pic, davs2_seq_info_t *headerset, int ret_type, int num_frames)\r\n{\r\n    static char IMGTYPE[] = {'I', 'P', 'B', 'G', 'F', 'S', '\\x0'};\r\n    double psnr_y = 0.0f, psnr_u = 0.0f, psnr_v = 0.0f;\r\n\r\n    if (headerset == NULL) {\r\n        return;\r\n    }\r\n\r\n    if (pic == NULL || ret_type == DAVS2_GOT_HEADER) {\r\n        show_message(CONSOLE_GREEN,\r\n            \"  Sequence size: %dx%d, %d/%d-bit %.3lf Hz. ProfileLevel: 0x%x-0x%x\\n\\n\", \r\n            headerset->width, headerset->height, \r\n            headerset->internal_bit_depth, headerset->output_bit_depth,\r\n            headerset->frame_rate,\r\n            headerset->profile_id, headerset->level_id);\r\n\r\n        if (inputparam.b_y4m) {\r\n            static const int FRAME_RATE[9][2] = {\r\n                { 0, 1},  // invalid\r\n                { 24000, 1001 },\r\n                { 24, 1 },\r\n                { 25, 1 },\r\n                { 30000, 1001 },\r\n                { 30, 1 }, \r\n                { 50, 1 },\r\n                { 60000, 1001 },\r\n                { 60, 1 }\r\n            };\r\n            int fps_num = FRAME_RATE[headerset->frame_rate_id][0];\r\n            int fps_den = FRAME_RATE[headerset->frame_rate_id][1];\r\n            write_y4m_header(inputparam.g_outfile, headerset->width, headerset->height,\r\n                             fps_num, fps_den, headerset->output_bit_depth);\r\n        }\r\n        return;\r\n    }\r\n\r\n    if (inputparam.g_psnr) {\r\n        int ret = cal_psnr(pic->pic_order_count, pic->planes, pic->strides, inputparam.g_recfile,\r\n                           pic->widths[0], pic->lines[0], pic->num_planes,\r\n                           &psnr_y, &psnr_u, &psnr_v, \r\n                           pic->bytes_per_sample, pic->bit_depth);\r\n        int psnr = (psnr_y != 0 || psnr_u != 0 || psnr_v != 0);\r\n\r\n        if (ret < 0) {\r\n            g_psnrfail = 1;\r\n            show_message(CONSOLE_RED, \"failed to cal psnr for frame %d(%d).\\t\\t\\t\\t\\n\", g_frmcount, pic->pic_order_count);\r\n        } else {\r\n            if (inputparam.g_verbose || psnr) {\r\n                show_message(psnr ? CONSOLE_RED : CONSOLE_WHITE,\r\n                    \"%5d(%d)\\t(%c) %3d\\t%8.4lf %8.4lf %8.4lf \\t%6lld %6lld\\n\", \r\n                    g_frmcount, pic->pic_order_count,\r\n                    IMGTYPE[pic->type], pic->qp, psnr_y, psnr_u, psnr_v,\r\n                    pic->pts, pic->dts);\r\n            }\r\n        }\r\n    } else if (inputparam.g_verbose) {\r\n        show_message(CONSOLE_WHITE,\r\n            \"%5d(%d)\\t(%c)\\t%3d\\n\", g_frmcount, pic->pic_order_count, IMGTYPE[pic->type], pic->qp);\r\n    }\r\n\r\n    g_frmcount++;\r\n\r\n    if (inputparam.g_verbose == 0) {\r\n        show_progress(g_frmcount, num_frames);\r\n    }\r\n\r\n    if (inputparam.g_outfile) {\r\n        write_frame(pic, inputparam.g_outfile, inputparam.b_y4m);\r\n    }\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n * data_buf - pointer to bitstream buffer\r\n * data_len - number of bytes in bitstream buffer\r\n * frames   - number of frames in bitstream buffer\r\n */\r\nvoid test_decoder(uint8_t *data_buf, int data_len, int num_frames, char *dst)\r\n{\r\n    const double f_time_fac = 1.0 / (double)CLOCKS_PER_SEC;\r\n    davs2_param_t    param;      // decoding parameters\r\n    davs2_packet_t   packet;     // input bitstream\r\n    davs2_picture_t  out_frame;  // output data, frame data\r\n    davs2_seq_info_t headerset;  // output data, sequence header\r\n    int got_frame;\r\n\r\n#if CTRL_LOOP_DEC_FILE\r\n    uint8_t *bak_data_buf = data_buf;\r\n    int      bak_data_len = data_len;\r\n    int      num_loop     = 5;      // ѭ\r\n#endif\r\n    int64_t time0, time1;\r\n    void *decoder;\r\n\r\n    const uint8_t *data = data_buf;\r\n    const uint8_t *data_next_start_code;\r\n    int user_dts = 0; // only used to check the returning value of DTS and PTS\r\n\r\n    /* init the decoder */\r\n    memset(&param, 0, sizeof(param));\r\n    param.threads      = inputparam.g_threads;\r\n    param.opaque       = (void *)(intptr_t)num_frames;\r\n    param.info_level   = DAVS2_LOG_DEBUG;\r\n    param.disable_avx  = 0; // on some platforms, disable AVX (setting to 1) would be faster\r\n\r\n    decoder = davs2_decoder_open(&param);\r\n\r\n    time0 = get_time();\r\n\r\n    /* do decoding */\r\n    for (;;) {\r\n        int len;\r\n\r\n        data_next_start_code = find_start_code(data + 4, data_len - 4);\r\n\r\n        if (data_next_start_code) {\r\n            len = (int)(data_next_start_code - data);\r\n        } else {\r\n            len = data_len;\r\n        }\r\n\r\n        packet.data = data;\r\n        packet.len  = len;\r\n\r\n        // set PTS/DTS, which was only used to check whether they could be passed out rightly\r\n        packet.pts  =  user_dts;\r\n        packet.dts  = -user_dts;\r\n        user_dts++;\r\n\r\n        got_frame = davs2_decoder_send_packet(decoder, &packet);\r\n        if (got_frame == DAVS2_ERROR) {\r\n            show_message(CONSOLE_RED, \"Error: An decoder error counted\\n\");\r\n            break;\r\n        }\r\n\r\n        got_frame = davs2_decoder_recv_frame(decoder, &headerset, &out_frame);\r\n        if (got_frame != DAVS2_DEFAULT) {\r\n            output_decoded_frame(&out_frame, &headerset, got_frame, num_frames);\r\n            davs2_decoder_frame_unref(decoder, &out_frame);\r\n        }\r\n\r\n        data_len -= len;\r\n        data += len; // could not be [data = data_next_start_code]\r\n\r\n        if (!data_len) {\r\n#if CTRL_LOOP_DEC_FILE\r\n            data_len = bak_data_len;\r\n            data = data_buf;\r\n            num_loop--;\r\n            if (num_loop <= 0) {\r\n                break;\r\n            }\r\n#else\r\n            break;              /* end of bitstream */\r\n#endif\r\n        }\r\n    }\r\n\r\n    /* flush the decoder */\r\n    for (;;) {\r\n        got_frame = davs2_decoder_flush(decoder, &headerset, &out_frame);\r\n        if (got_frame == DAVS2_ERROR || got_frame == DAVS2_END) {\r\n            break;\r\n        }\r\n        if (got_frame != DAVS2_DEFAULT) {\r\n            output_decoded_frame(&out_frame, &headerset, got_frame, num_frames);\r\n            davs2_decoder_frame_unref(decoder, &out_frame);\r\n        }\r\n    }\r\n\r\n    time1 = get_time();\r\n\r\n    /* close the decoder */\r\n    davs2_decoder_close(decoder);\r\n\r\n    /* statistics */\r\n    show_message(CONSOLE_WHITE, \"\\n--------------------------------------------------\\n\");\r\n\r\n    show_message(CONSOLE_GREEN, \"total frames: %d/%d\\n\", g_frmcount, num_frames);\r\n    if (inputparam.g_psnr) {\r\n        if (g_psnrfail == 0 && g_frmcount != 0) {\r\n            show_message(CONSOLE_GREEN,\r\n                         \"average PSNR:\\t%8.4f, %8.4f, %8.4f\\n\\n\", \r\n                         g_sum_psnr_y / g_frmcount, g_sum_psnr_u / g_frmcount, g_sum_psnr_v / g_frmcount);\r\n\r\n            sprintf(dst, \"  Frames: %d/%d\\n  TIME : %.3lfs, %6.2lf fps\\n  PSNR : %8.4f, %8.4f, %8.4f\\n\",\r\n                    g_frmcount, num_frames,\r\n                    (double)((time1 - time0) * f_time_fac),\r\n                    (double)(g_frmcount / ((time1 - time0) * f_time_fac)),\r\n                    g_sum_psnr_y / g_frmcount, g_sum_psnr_u / g_frmcount, g_sum_psnr_v / g_frmcount);\r\n        } else {\r\n            show_message(CONSOLE_RED, \"average PSNR:\\tNaN, \\tNaN, \\tNaN\\n\\n\"); /* 'NaN' for 'Not a Number' */\r\n        }\r\n    }\r\n\r\n    show_message(CONSOLE_GREEN, \"total decoding time: %.3lfs, %6.2lf fps\\n\", \r\n        (double)((time1 - time0) * f_time_fac), \r\n        (double)(g_frmcount / ((time1 - time0) * f_time_fac)));\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nint main(int argc, char *argv[])\r\n{\r\n    char dst[1024] = \"> no decode data\\n\";\r\n    uint8_t *data = NULL;\r\n    clock_t tm_start = clock();\r\n    int size;\r\n    int frames;\r\n\r\n\r\n    memset(MD5val, 0, 16);\r\n    memset(MD5str, 0, 33);\r\n\r\n    /* parse params */\r\n    if (parse_args(&inputparam, argc, argv) < 0) {\r\n        sprintf(dst, \"Failed to parse input parameters\\n\");\r\n        goto fail;\r\n    }\r\n\r\n    /* read input data */\r\n    if (read_input_file(&inputparam, &data, &size, &frames, 0.0f) < 0) {\r\n        sprintf(dst, \"Failed to read input bit-stream or create output file\\n\");\r\n        goto fail;\r\n    }\r\n\r\n    /* test decoding */\r\n    test_decoder(data, size, frames, dst);\r\n\r\n    show_message(CONSOLE_WHITE, \"\\n Decoder Total Time: %.3lf s\\n\", (clock() - tm_start) / (double)(CLOCKS_PER_SEC));\r\n\r\nfail:\r\n    /* tidy up */\r\n    if (data) {\r\n        free(data);\r\n    }\r\n\r\n    if (g_recbuf) {\r\n        free(g_recbuf);\r\n    }\r\n\r\n    if (inputparam.g_infile) {\r\n        fclose(inputparam.g_infile);\r\n    }\r\n\r\n    if (inputparam.g_recfile) {\r\n        fclose(inputparam.g_recfile);\r\n    }\r\n\r\n    if (inputparam.g_outfile) {\r\n        fclose(inputparam.g_outfile);\r\n    }\r\n\r\n    /* calculate MD5 */\r\n    if (inputparam.s_md5 && strlen(inputparam.s_md5) == 32) {\r\n        FileMD5(inputparam.s_outfile, MD5val);\r\n        sprintf (MD5str,\"%08X%08X%08X%08X\", MD5val[0], MD5val[1], MD5val[2], MD5val[3]);\r\n        if (strcmp(MD5str,inputparam.s_md5)) {\r\n            show_message(CONSOLE_RED, \"\\n  MD5 match failed\\n\");\r\n            show_message(CONSOLE_WHITE, \"  Input  MD5 : %s \\n\", inputparam.s_md5);\r\n            show_message(CONSOLE_WHITE, \"  Output MD5 : %s \\n\", MD5str);\r\n        } else {\r\n            show_message(CONSOLE_WHITE, \"\\n  MD5 match success \\n\");\r\n        }\r\n    }\r\n\r\n    show_message(CONSOLE_WHITE, \" Decoder Exit, Time: %.3lf s\\n\", (clock() - tm_start) / (double)(CLOCKS_PER_SEC));\r\n    return 0;\r\n}\r\n\r\n#if defined(__cplusplus)\r\n}\r\n#endif  /* __cplusplus */\r\n"
  },
  {
    "path": "source/test/utils.h",
    "content": "/*\r\n * utils.h\r\n *\r\n * Description of this file:\r\n *    functions definition of the davs2 library\r\n *\r\n * --------------------------------------------------------------------------\r\n *\r\n *    davs2 - video decoder of AVS2/IEEE1857.4 video coding standard\r\n *    Copyright (C) 2018~ VCL, NELVT, Peking University\r\n *\r\n *    Authors: Falei LUO <falei.luo@gmail.com>\r\n *             etc.\r\n *\r\n *    This program is free software; you can redistribute it and/or modify\r\n *    it under the terms of the GNU General Public License as published by\r\n *    the Free Software Foundation; either version 2 of the License, or\r\n *    (at your option) any later version.\r\n *\r\n *    This program is distributed in the hope that it will be useful,\r\n *    but WITHOUT ANY WARRANTY; without even the implied warranty of\r\n *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r\n *    GNU General Public License for more details.\r\n *\r\n *    You should have received a copy of the GNU General Public License\r\n *    along with this program; if not, write to the Free Software\r\n *    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.\r\n *\r\n *    This program is also available under a commercial proprietary license.\r\n *    For more information, contact us at sswang @ pku.edu.cn.\r\n */\r\n\r\n#ifndef DAVS2_UTILS_H\r\n#define DAVS2_UTILS_H\r\n\r\n#include <stdio.h>\r\n#include <stdlib.h>\r\n#include <stdarg.h>\r\n\r\n#include \"davs2.h\"\r\n\r\n#define CONSOLE_WHITE  0\r\n#define CONSOLE_YELLOW 1\r\n#define CONSOLE_RED    2\r\n#define CONSOLE_GREEN  3\r\n\r\n#if __ANDROID__\r\n#include <jni.h>\r\n#include <android/log.h>\r\n#define LOGE(format,...) __android_log_print(ANDROID_LOG_ERROR,\"davs2\", format,##__VA_ARGS__)\r\n#endif\r\n\r\n#if _WIN32\r\n#include <sys/types.h>\r\n#include <sys/timeb.h>\r\n#include <windows.h>\r\n#else\r\n#include <sys/time.h>\r\n#endif\r\n#include <time.h>\r\n\r\n/* ---------------------------------------------------------------------------\r\n * time */\r\nstatic __inline int64_t get_time()\r\n{\r\n#if _WIN32\r\n    struct timeb tb;\r\n    ftime(&tb);\r\n    return ((int64_t)tb.time * CLOCKS_PER_SEC + (int64_t)tb.millitm);\r\n#else\r\n    struct timeval tv_date;\r\n    gettimeofday(&tv_date, NULL);\r\n    return (int64_t)(tv_date.tv_sec * CLOCKS_PER_SEC + (int64_t)tv_date.tv_usec);\r\n#endif\r\n}\r\n\r\n#if _WIN32\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic __inline void set_font_color(int color)\r\n{\r\n    WORD colors[] = {\r\n        FOREGROUND_RED | FOREGROUND_GREEN | FOREGROUND_BLUE,\r\n        FOREGROUND_INTENSITY | FOREGROUND_RED | FOREGROUND_GREEN,\r\n        FOREGROUND_INTENSITY | FOREGROUND_RED,\r\n        FOREGROUND_INTENSITY | FOREGROUND_GREEN,\r\n    };\r\n    SetConsoleTextAttribute(GetStdHandle(STD_OUTPUT_HANDLE), colors[color]);\r\n}\r\n#endif\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic void show_message(int color, const char *format, ...)\r\n{\r\n    char message[1024] = { 0 };\r\n\r\n    va_list arg_ptr;\r\n    va_start(arg_ptr, format);\r\n\r\n    vsprintf(message, format, arg_ptr);\r\n\r\n    va_end(arg_ptr);\r\n\r\n#if _WIN32\r\n    set_font_color(color); /* set color */\r\n    fprintf(stderr, \"%s\", message);\r\n    set_font_color(0);     /* restore to white color */\r\n\r\n#elif __ANDROID__\r\n    LOGE(\"%s\", message);\r\n#else\r\n    fprintf(stderr, \"%s\", message);\r\n#endif\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic __inline void show_progress(int frame, int frames)\r\n{\r\n    static int64_t first_time = 0;\r\n    static int64_t last_time  = 0;\r\n    float fps       = 0.0f;\r\n    int64_t total_time  = 0;\r\n    int64_t cur_time    = get_time();\r\n    int eta;\r\n\r\n    if (first_time == 0) {\r\n        first_time = cur_time;\r\n    } else {\r\n        total_time = cur_time - first_time;\r\n        fps = frame * 1.0f / total_time * CLOCKS_PER_SEC;\r\n    }\r\n\r\n    if (cur_time - last_time < 300 && frame != frames) {\r\n        return;\r\n    }\r\n\r\n    last_time = cur_time;\r\n\r\n    eta = (int)((frames - frame) * total_time / frame) / (CLOCKS_PER_SEC / 1000);\r\n\r\n    show_message(CONSOLE_WHITE, \"\\r frames: %4d/%4d,  fps: %4.1f, LeftTime: %8.3f sec\\r\",\r\n                 frame, frames, fps, eta * 0.001);\r\n}\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic\r\nvoid write_frame_plane(FILE *fp_out, const uint8_t *p_src, int img_w, int img_h, int bytes_per_sample, int i_stride)\r\n{\r\n    const int size_line = img_w * bytes_per_sample;\r\n    int i;\r\n\r\n    for (i = 0; i < img_h; i++) {\r\n        fwrite(p_src, size_line, 1, fp_out);\r\n        p_src += i_stride;\r\n    }\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic \r\nvoid write_y4m_header(FILE *fp, int w, int h, int fps_num, int fps_den, int bit_depth)\r\n{\r\n    static int b_y4m_header_write = 0;\r\n\r\n    if (fp != NULL && !b_y4m_header_write) {\r\n        char buf[64];\r\n\r\n        if (bit_depth != 8) {\r\n            sprintf(buf, \"YUV4MPEG2 W%d H%d F%d:%d Ip C%sp%d\\n\",\r\n                    w, h, fps_num, fps_den, \"420\", bit_depth);\r\n            fwrite(buf, 1, strlen(buf), fp);\r\n        } else {\r\n            sprintf(buf, \"YUV4MPEG2 W%d H%d F%d:%d Ip C%s\\n\",\r\n                    w, h, fps_num, fps_den, \"420\");\r\n            fwrite(buf, 1, strlen(buf), fp);\r\n        }\r\n\r\n        b_y4m_header_write = 1;\r\n    }\r\n}\r\n\r\n\r\n/* ---------------------------------------------------------------------------\r\n */\r\nstatic\r\nvoid write_frame(davs2_picture_t *pic, FILE *fp, int b_y4m)\r\n{\r\n    const int bytes_per_sample = pic->bytes_per_sample;\r\n\r\n    if (b_y4m) {\r\n        const char *s_frm = \"FRAME\\n\";\r\n        fwrite(s_frm, 1, strlen(s_frm), fp);\r\n    }\r\n\r\n    /* write y */\r\n    write_frame_plane(fp, pic->planes[0], pic->widths[0], pic->lines[0], bytes_per_sample, pic->strides[0]);\r\n\r\n    if (pic->num_planes == 3) {\r\n        /* write u */\r\n        write_frame_plane(fp, pic->planes[1], pic->widths[1], pic->lines[1], bytes_per_sample, pic->strides[1]);\r\n\r\n        /* write v */\r\n        write_frame_plane(fp, pic->planes[2], pic->widths[2], pic->lines[2], bytes_per_sample, pic->strides[2]);\r\n    }\r\n}\r\n\r\n#endif /// DAVS2_UTILS_H\r\n"
  },
  {
    "path": "version.sh",
    "content": "#!/bin/sh\n\n# ============================================================================\n# File:\n#   version.sh\n#   - get version of repository and generate the file version.h\n# Author:\n#   Falei LUO <falei.luo@gmail.com>\n# ============================================================================\n\n# setting API version\napi=`grep '#define DAVS2_BUILD' < ./source/davs2.h | sed 's/^.* \\([1-9][0-9]*\\).*$/\\1/'`\nVER_R=0\nVER_SHA='not-in-git-tree'\n\n# get version of remote origin/master and local HEAD\nif [ -d .git ] && command -v git >/dev/null 2>&1 ; then\n    VER_R=`git rev-list --count origin/master`\n    VER_SHA=`git rev-parse HEAD | cut -c -16`\nfi\n\n# generate version numbers\nVER_MAJOR=`echo $(($api / 10))`\nVER_MINOR=`echo $(($api % 10))`\n\n# date and time information\nBUILD_TIME=`date \"+%Y-%m-%d %H:%M:%S\"`\n\n# generate the file version.h\necho \"// ===========================================================================\"  > version.h\necho \"// version.h\"                                                                   >> version.h\necho \"// - collection of version numbers\"                                             >> version.h\necho \"//\"                                                                             >> version.h\necho \"// Author:  Falei LUO <falei.luo@gmail.com>\"                                    >> version.h\necho \"//\"                                                                             >> version.h\necho \"// ===========================================================================\" >> version.h\necho \"\"                                                                               >> version.h\necho \"#ifndef DAVS2_VERSION_H\"                                                        >> version.h\necho \"#define DAVS2_VERSION_H\"                                                        >> version.h\necho \"\"                                                                               >> version.h\necho \"// version number\"                                                              >> version.h\necho \"#define VER_MAJOR         $VER_MAJOR     // major version number\"               >> version.h\necho \"#define VER_MINOR         $VER_MINOR     // minor version number\"               >> version.h\necho \"#define VER_BUILD         $VER_R    // build number\"                            >> version.h\necho \"#define VER_SHA_STR       \\\"$VER_SHA\\\"  // commit id\"                           >> version.h\necho \"\"                                                                               >> version.h\necho \"// stringify\"                                                                   >> version.h\necho \"#define _TOSTR(x)       #x            // stringify x\"                           >> version.h\necho \"#define TOSTR(x)        _TOSTR(x)     // stringify x, perform macro expansion\"  >> version.h\necho \"\"                                                                               >> version.h\necho \"// define XVERSION string\"                                                      >> version.h\necho \"#define XVERSION        VER_MAJOR, VER_MINOR, VER_BUILD\"                        >> version.h\necho \"#define XVERSION_STR    TOSTR(VER_MAJOR) \\\".\\\" TOSTR(VER_MINOR) \\\".\\\" TOSTR(VER_BUILD) \\\" \\\" VER_SHA_STR\" >> version.h\necho \"#define XBUILD_TIME     \\\"$BUILD_TIME\\\"\"                                        >> version.h\necho \"\"                                                                               >> version.h\necho \"#endif // DAVS2_VERSION_H\"                                                      >> version.h\n\nmv version.h source/version.h\n\n# show version informations\necho \"#define DAVS2_BUILD      $api\"\necho \"#define DAVS2_POINTVER \\\"$VER_MAJOR.$VER_MINOR.$VER_R\\\"\"\n"
  }
]