[
  {
    "path": "README.md",
    "content": "# Introduction\n\n## speed_histogram_algorithm_framework \n\n- 局部直方图加速框架，内部使用了一些近似计算及指令集加速(SSE)，可以快速处理中值滤波、最大值滤波、最小值滤波、表面模糊等算法。\n\n## resources\n- SSE优化相关的资源。\n\n#### PC的CPU为I5-3230，64位。\n\n#### OpenCV版本为3.4.0\n\n\n\n- sse_implementation_of_common_functions_in_image_processing.cpp 多个图像处理中常用函数的SSE实现。\n- speed_rgb2gray_sse.cpp 使用sse加速RGB和灰度图转换算法，相比于原始实现有接近5倍加速。算法原理：https://mp.weixin.qq.com/s/SagVQ5gfXWWA7NATv-zvBQ  速度测试结果如下：\n\n>测试CPU型号：Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz\n\n| 分辨率    | 优化                                     | 循环次数 | 速度 |\n| --------- | ---------------------------------------- | -------- | ---- |\n| 4032x3024 | 原始实现                                 | 1000      |  12.139ms    |\n| 4032x3024 | 第一版优化（float->INT）                 | 1000      |   7.629ms   |\n| 4032x3024 | OpenCV 自带函数                          | 1000      |   4.287ms   |\n| 4032x3024 | 第二版优化（手动4路并行）                | 1000      |   10.528ms   |\n| 4032x3024 | 第三版优化（OpenMP4线程）                | 1000      |   7.632ms   |\n| 4032x3024 | 第四版优化（SSE优化，一次处理12个像素）  | 1000      |   5.579ms   |\n| 4032x3024 | 第五版优化（SSE优化，一次处理15个像素）  | 1000      |  5.843ms    |\n| 4032x3024 | 第六版优化（AVX2优化，一次处理10个像素） | 1000      |   3.576ms   |\n| 4032x3024 | 第七版优化（AVX2优化+std::async）        | 1000      |   2.626ms   |\n\n\n\n- speed_vibrance_algorithm.cpp 使用SSE加速自然饱和度算法，加速9倍，算法原理请看： https://mp.weixin.qq.com/s/26UVvqMNLgnquXY21Xu3OQ 。速度测试结果如下：\n\n|分辨率|优化|循环次数|速度|\n|----|----|----|----|\n|4032x3024|原始实现|100|115.36ms|\n|4032x3024|第一版优化|100|62.43ms|\n|4032x3024|第二版优化(4线程)|100|28.89ms|\n|4032x3024|第三版优化(SSE)|100|12.69ms|\n\n\n\n- speed_sobel_edgedetection_sse.cpp 使用SSE加速Sobel边缘检测算法，加速幅度巨大，算法原理请看：https://mp.weixin.qq.com/s/5lCfO_jmSfP7DbsgM7qbpg 。速度测试结果如下：\n\n|分辨率|算法优化|循环次数|速度|\n|-|-|-|-|\n|4032x3024|普通实现|1000|126.54 ms|\n|4032x3024|Float->INT+查表法|1000|81.62 ms|\n|4032x3024|SSE优化版本1|1000|34.95 ms|\n|4032x3024|SSE优化版本2|1000|28.87 ms|\n|4032x3024|AVX2优化版本1|1000|15.42 ms  |\n|4032x3024|AVX2优化+std::async|1000| 5.69 ms |\n\n- speed_skin_detection_sse.cpp 使用SSE加速肤色检测算法，加速幅度较大，算法原理请看：https://mp.weixin.qq.com/s/UFzY1s6ohTM-dnNg0P4kkw 。速度测试结果如下：\n\n|分辨率|算法优化|循环次数|速度|\n|-|-|-|-|\n|4272x2848|普通实现|1000|41.40ms|\n|4272x2848|OpenMP 4线程|1000|36.54ms|\n|4272x2848|SSE第一版|1000|6.77ms|\n|4272x2848|SSE第二版(std::async)|1000|4.73ms|\n\n- speed_rgb2yuv_sse.cpp SSE极致优化RGB和YUV图像空间互转，算法原理请看：https://mp.weixin.qq.com/s/ryGocz-0YpqZ1CjYXJbd7Q 。速度测试结果如下：\n\n|分辨率|算法优化|循环次数|速度|\n|-|-|-|-|\n|4032x3024|普通实现|1000|150.58ms|\n|4032x3024|去掉浮点数，除法用位运算代替|1000|76.70ms|\n|4032x3024|OpenMP 4线程|1000|50.48ms|\n|4032x3024|普通SSE向量化|1000|48.92ms|\n|4032x3024|_mm_madd_epi16二次优化|1000|33.04ms|\n|4032x3024|SSE+4线程|1000|23.70ms|\n\n\n\n- speed_median_filter_3x3_sse.cpp 极致优化3*3中值滤波，算法原理请看：https://blog.csdn.net/just_sort/article/details/98617050 。速度测试效果如下：\n\n|分辨率|算法优化|循环次数|速度|\n|-|-|-|-|\n|4032x3024|普通实现|10| 8293.79 ms |\n|4032x3024|逻辑优化，更好的流水|10|  83.75 ms |\n|4032x3024|SSE优化|10| 11.93 ms |\n|4032x3024|AVX优化|10| 9.32 ms |\n\n----------------------------------------------------------------------------------\n\n- speed_gaussian_filter_sse.cpp 使用sse加速高斯滤波算法。算法原理：https://blog.csdn.net/just_sort/article/details/95212099 。速度测试效果如下：\n\n| 优化方式| 图像分辨率 | 速度 |\n| ------------------- | ---------- | ---- |\n| C语言普通实现+单线程 | 4032*3024  | 290.43ms |\n| SSE优化+单线程      | 4032*3024  | 265.96ms |\n\n- speed_integral_graph_sse.cpp 使用SSE加速积分图运算，但是在PC上并没有速度提升，算法原理请看：https://www.cnblogs.com/Imageshop/p/6897233.html 。速度测试结果如下：\n\n|优化方式|图像分辨率 |速度|\n|---------|----------|-------|\n|C语言实现+单线程|4032*3024|66.66ms|\n|C语言实现+4线程|4032*3024|65.34ms|\n|SSE优化+单线程|4032*3024|66.10ms|\n|SSE优化+4线程|4032*3024|66.20ms|\n\n\n- speed_common_functions.cpp 对图像处理的一些常用函数的快速实现，个别使用了SSE优化。\n- speed_max_filter_sse.cpp 使用speed_histogram_algorithm_framework框架实现最大值滤波，半径越大越明显。原理请看：https://blog.csdn.net/just_sort/article/details/97280807 。运行的时候记得把工程属性中的sdl检查关掉，不然会报一个变量未初始化的错误。速度测试效果如下:\n\n|优化方式|图像分辨率 |半径|速度|\n|---------|----------|-------|-------|\n|C语言实现+单线程|4272*2848|7|9445.90ms|\n|SSE优化+单线程|4272*2848|7|2234.55ms|\n|C语言实现+单线程|4272*2848|9|14468.76ms|\n|SSE优化+单线程|4272*2848|9|2221.68ms|\n|C语言实现+单线程|4272*2848|11|23069.10ms|\n|SSE优化+单线程|4272*2848|11|2180.95ms|\n\n- speed_box_filter_sse.cpp 使用speed_histogram_algorithm框架实现O(1)最大值滤波，使用了SSE优化，算法原理请看：https://blog.csdn.net/just_sort/article/details/98075712 。运行方法和speed_max_filter_sse.cpp相同，速度测试结果如下：\n\n|优化方式|图像分辨率 |半径|速度|\n|---------|----------|-------|-------|\n|C语言实现+单线程|4272*2848|11|163.16ms|\n|SSE优化+单线程|4272*2848|11|123.83ms|\n|C语言实现+单线程|4272*2848|21|167.81ms|\n|SSE优化+单线程|4272*2848|21|126.98ms|\n|C语言实现+单线程|4272*2848|31|168.62ms|\n|SSE优化+单线程|4272*2848|31|126.17ms|\n\n- speed_multi_scale_detail_boosting_see.cpp 在speed_box_filter_sse.cpp提供的盒子滤波sse优化的基础上，进一步使用指令集实现了对论文《DARK IMAGE ENHANCEMENT BASED ON PAIRWISE TARGET CONTRAST AND MULTI-SCALE DETAIL BOOSTING》的算法优化。算法原理请看：https://blog.csdn.net/just_sort/article/details/98485746  。在CoreI7-3770速度测试结果如下：\n\n|优化方式|图像分辨率 |半径|速度|\n|---------|----------|-------|-------|\n|C语言实现+单线程|4272*2848|7|206.00ms|\n|SSE优化+单线程|4272*2848|7|57.12ms|\n\n- speed_bicubic_zoom_sse.cpp SSE优化三次立方插值算法，算法原理请看：https://blog.csdn.net/just_sort/article/details/100119653 。速度测试结果如下：\n\n|优化方式|图像分辨率 |插值后大小|速度|\n|---------|----------|-------|-------|\n|C语言原始算法实现|4272*2848|长宽均为原始1.5倍|1856.29ms|\n|C语言实现+查表优化+边界优化|4272*2848|长宽均为原始1.5倍|839.10ms|\n|SSE优化+边界优化|4272*2848|长宽均为原始1.5倍|315.70ms|\n|OpenCV3.1.0自带的函数|4272*2848|长宽均为原始1.5倍|118.77ms|\n\n\n\n\n# 维护了一个微信公众号，分享论文，算法，比赛，生活，欢迎加入。\n\n- 图片要是没加载出来直接搜GiantPandaCV 就好。\n\n![](image/weixin.jpg)\n"
  },
  {
    "path": "resources/SSE指令集补充.md",
    "content": "# SSE指令集记录\n\n- _mm_cvtps_epi32 把四个float变量强转为四个int变量。其中需要注意的是他的截断规则：四舍五入，在进位后末位是偶数的进，否则不进位。\n\n- _mm_cvttps_epi32 把四个float变量强转为四个int变量。直接截断，和c/c++中的r = (int)a一样。\n\n- _mm_cvtpd_ps 将两个双精度， a 的浮点值设置为单精度的，浮点值。返回值:\n\n  ```c++\n  r0 := (float) a0\n  r1 := (float) a1\n  r2 := 0.0 ; r3 := 0.0\n  ```\n\n- _mm_movelh_ps 移动更低两个单精度， b 的浮点值到上面两个单精度，结果的浮点值。\n\n  ```c++\n  r3 := b1\n  r2 := b0\n  r1 := a1\n  r0 := a0\n  ```\n\n- _mm_cmpneq_ps 比较两个单精度，如果对应位置的数相等返回0，不相等则返回1。\n\n- _mm_blendv_ps 混和打包函数：\n\n  ```c++\n  __m128 _mm_blendv_ps( \n     __m128 a,\n     __m128 b,\n     __m128 mask \n  );\n  \n  r0 := (mask0 & 0x80000000) ? b0 : a0\n  r1 := (mask1 & 0x80000000) ? b1 : a1\n  r2 := (mask2 & 0x80000000) ? b2 : a2\n  r3 := (mask3 & 0x80000000) ? b3 : a3\n  ```\n\n- _mm_packs_epi32 将a和b的8位有符号和32位整数转化位16位整型数据。\n\n- _mm_cvtsi128_si32 移动最低有效位的32位a到32位整数。\n\n- _mm_packus_epi16 将a和b的16位整数转化位8位无符号整型数据。\n\n- _mm_cvtsi32_si128 将a的低32位赋值给一个32bits的整数，返回值为r=a0\n\n- _mm_loadu_si128表示：Loads 128-bit value；即加载128位值。\n\n- _mm_max_epu8 (a,b)表示：比较a和b中对应的无符号的8bits的整数，取其较大值，重复这个过程16次。即：r0=max(a0,b0),...,r15=max(a15,b15)。\n\n- _mm_min_epi8(a,b)表示：大体意思同上，不同的是这次比较的是有符号的8bits的整数。\n\n- _mm_setzero_si128表示：将128bits的值都赋值为0。\n\n- _mm_subs_epu8(a,b)表示：a和b中对应的8bits数相减，r0= UnsignedSaturate(a0-b0)，...，r15= UnsignedSaturate(a15 - b15)。\n\n- _mm_adds_epi8(a,b)表示：a和b中对应的8bits数相加，r0=SingedSaturate(a0+b0),...,r15=SingedSaturate(a15+b15)。\n\n- _mm_unpackhi_epi64(a,b)表示：a和b的高64位交错，低64位舍去。\n\n- _mm_srli_si128(a,imm)表示：将a进行逻辑右移imm位，高位填充0。\n\n- _mm_cvtsi128_si32(a)表示：将a的低32位赋值给一个32bits的整数，返回值为r=a0。\n\n- _mm_xor_si128(a,b)表示：将a和b进行按位异或，即r=a^b。\n\n- _mm_or_si128(a,b)表示：将a和b进行或运算，即r=a|b。\n\n- _mm_and_si128(a,b)表示：将a和b进行与运算，即r=a&b。\n\n- _mm_cmpgt_epi8(a,b)表示：分别比较a的每个8bits整数是否大于b的对应位置的8bits整数，若大于，则返回0xffff，否则返回0x0。即r0=(a0>b0)?0xff:0x0  r1=(a1>b1)?0xff:0x0...r15=(a15>b15)?0xff:0x0\n\n- _mm_unpacklo_epi64表示:  a和b的高64位交错，高64位舍去。\n\n- _mm_madd_epi16 表示：返回一个__m128i的寄存器，它含有4个有符号的32位整数。\n\n  ```c++\n  r0 := (a0 * b0) + (a1 * b1)\n  r1 := (a2 * b2) + (a3 * b3)\n  r2 := (a4 * b4) + (a5 * b5)\n  r3 := (a6 * b6) + (a7 * b7)\n  ```\n\n- _mm_extract_epi16(a, imm) 表示: 返回imm位置上的16位数。\n\n- _mm_min_epu16 表示：两个数的最小者。\n\n- _mm_minpos_epu16 表示：返回128 位值， 最低序的 16 位是参数找到的最小值a，第二个低的顺序 16 位是参数找到的最小值的索引a。\n\n- _mm_stream_si32 将数据存储到指针对应的地址中。\n\n- _mm_cvtsi128_si32  移动最低有效位的32位a到32位整数。\n\n- _mm_packus_epi32 \n\n  ```c++\n  r0 := (a0 < 0) ? 0 : ((a0 > 0xffff) ? 0xffff : a0)\n  r1 := (a1 < 0) ? 0 : ((a1 > 0xffff) ? 0xffff : a1)\n  r2 := (a2 < 0) ? 0 : ((a2 > 0xffff) ? 0xffff : a2)\n  r3 := (a3 < 0) ? 0 : ((a3 > 0xffff) ? 0xffff : a3)\n  r4 := (b0 < 0) ? 0 : ((b0 > 0xffff) ? 0xffff : b0)\n  r5 := (b1 < 0) ? 0 : ((b1 > 0xffff) ? 0xffff : b1)\n  r6 := (b2 < 0) ? 0 : ((b2 > 0xffff) ? 0xffff : b2)\n  r7 := (b3 < 0) ? 0 : ((b3 > 0xffff) ? 0xffff : b3)\n  ```\n\n- _mm_setr_epi32 返回一个__m128i的寄存器，使用4个具体的int类型数据来设置寄存器存放数据。\n\n- _mm_mullo_epi32 返回一个__m128i的寄存器，分别对a和b的4个int类型数相乘。\n\n- _mm_hadd_epi32  返回一个__m128i的寄存器，分别对a和b的4个int类型数相加。\n\n- _mm_madd_epi16 返回一个__m128i的寄存器，分别对a和b先相乘后相加。\n\n  ```c++\n  r0 := (a0 * b0) + (a1 * b1)\n  r1 := (a2 * b2) + (a3 * b3)\n  r2 := (a4 * b4) + (a5 * b5)\n  r3 := (a6 * b6) + (a7 * b7)\n  ```\n\n- _mm_unpackhi_epi8 返回一个__m128i的寄存器，对a和b进行交错打包，从高位到低位。\n\n  ```c++\n  r0 := a8 ; r1 := b8\n  r2 := a9 ; r3 := b9\n  ...\n  r14 := a15 ; r15 := b15\n  ```\n\n- _mm_unpacklo_epi8 返回一个__m128i的寄存器，对a和b进行交错打包，从低位到高位。"
  },
  {
    "path": "speed_bicubic_zoom_sse.cpp",
    "content": "#include <stdio.h>\n#include <opencv2/opencv.hpp>\nusing namespace std;\nusing namespace cv;\n\nvoid debug(__m128i var) {\n\tuint8_t *val = (uint8_t*)&var;//can also use uint32_t instead of 16_t \n\tprintf(\"Numerical: %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i\\n\",\n\t\tval[0], val[1], val[2], val[3], val[4], val[5],\n\t\tval[6], val[7], val[8], val[9], val[10], val[11], val[12], val[13],\n\t\tval[14], val[15]);\n}\n\nvoid ConvertBGR8U2BGRAF(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride)\n{\n\t//#pragma omp parallel for\n\tfor (int Y = 0; Y < Height; Y++)\n\t{\n\t\tunsigned char *LinePS = Src + Y * Stride;\n\t\tunsigned char *LinePD = Dest + Y * Width * 4;\n\t\tfor (int X = 0; X < Width; X++, LinePS += 3, LinePD += 4)\n\t\t{\n\t\t\tLinePD[0] = LinePS[0];    LinePD[1] = LinePS[1];    LinePD[2] = LinePS[2]; LinePD[3] = 0;\n\t\t}\n\t}\n}\n\nvoid ConvertBGRAF2BGR8U(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride)\n{\n\t//#pragma omp parallel for\n\tfor (int Y = 0; Y < Height; Y++)\n\t{\n\t\tunsigned char *LinePS = Src + Y * Width * 4;\n\t\tunsigned char *LinePD = Dest + Y * Stride;\n\t\tfor (int X = 0; X < Width; X++, LinePS += 4, LinePD += 3)\n\t\t{\n\t\t\tLinePD[0] = LinePS[0];    LinePD[1] = LinePS[1];    LinePD[2] = LinePS[2];\n\t\t}\n\t}\n}\n\nvoid ConvertBGR8U2BGRAF_SSE(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) {\n\tconst int BlockSize = 4;\n\tint Block = (Width - 2) / BlockSize;\n\t__m128i Mask = _mm_setr_epi8(0, 1, 2, -1, 3, 4, 5, -1, 6, 7, 8, -1, 9, 10, 11, -1);\n\t__m128i Mask2 = _mm_setr_epi8(0, 2, 8, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1);\n\t__m128i Zero = _mm_setzero_si128();\n\tfor (int Y = 0; Y < Height; Y++) {\n\t\tunsigned char *LinePS = Src + Y * Stride;\n\t\tunsigned char *LinePD = Dest + Y * Width * 4;\n\t\tint X = 0;\n\t\tfor (; X < Block * BlockSize; X += BlockSize, LinePS += BlockSize * 3, LinePD += BlockSize * 4) {\n\t\t\t__m128i SrcV = _mm_shuffle_epi8(_mm_loadu_si128((const __m128i*)LinePS), Mask);\n\t\t\t__m128i Src16L = _mm_unpacklo_epi8(SrcV, Zero);\n\t\t\t__m128i Src16H = _mm_unpackhi_epi8(SrcV, Zero);\n\n\t\t\t_mm_storeu_si128((__m128i *)(LinePD + 0), _mm_shuffle_epi8(_mm_unpacklo_epi32(Src16L, Zero), Mask2));\n\t\t\t_mm_storeu_si128((__m128i *)(LinePD + 4), _mm_shuffle_epi8(_mm_unpackhi_epi32(Src16L, Zero), Mask2));\n\t\t\t_mm_storeu_si128((__m128i *)(LinePD + 8), _mm_shuffle_epi8(_mm_unpacklo_epi32(Src16H, Zero), Mask2));\n\t\t\t_mm_storeu_si128((__m128i *)(LinePD + 12), _mm_shuffle_epi8(_mm_unpackhi_epi32(Src16H, Zero), Mask2));\n\t\t}\n\t\tfor (; X < Width; X++, LinePS += 3, LinePD += 4) {\n\t\t\tLinePD[0] = LinePS[0];    LinePD[1] = LinePS[1];    LinePD[2] = LinePS[2];    LinePD[3] = 0;\n\t\t}\n\t}\n}\n\nvoid ConvertBGRAF2BGR8U_SSE(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) {\n\tconst int BlockSize = 4;\n\tint Block = (Width - 2) / BlockSize;\n\t//__m128i Mask = _mm_setr_epi8(0, 1, 2, 4, 5, 6, 8, 9, 10, 12, 13, 14, 3, 7, 11, 15);\n\t__m128i MaskB = _mm_setr_epi8(0, 4, 8, 12, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1);\n\t__m128i MaskG = _mm_setr_epi8(1, 5, 9, 13, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1);\n\t__m128i MaskR = _mm_setr_epi8(2, 6, 10, 14, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1);\n\t__m128i Zero = _mm_setzero_si128();\n\tfor (int Y = 0; Y < Height; Y++) {\n\t\tunsigned char *LinePS = Src + Y * Width * 4;\n\t\tunsigned char *LinePD = Dest + Y * Stride;\n\t\tint X = 0;\n\t\tfor (; X < Block * BlockSize; X += BlockSize, LinePS += BlockSize * 4, LinePD += BlockSize * 3) {\n\t\t\t__m128i SrcV = _mm_loadu_si128((const __m128i*)LinePS);\n\t\t\t__m128i B = _mm_shuffle_epi8(SrcV, MaskB);\n\t\t\t__m128i G = _mm_shuffle_epi8(SrcV, MaskG);\n\t\t\t__m128i R = _mm_shuffle_epi8(SrcV, MaskR);\n\t\t\t__m128i Ans1 = Zero, Ans2 = Zero, Ans3 = Zero;\n\t\t\tAns1 = _mm_or_si128(Ans1, _mm_shuffle_epi8(B, _mm_setr_epi8(0, -1, -1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)));\n\t\t\tAns1 = _mm_or_si128(Ans1, _mm_shuffle_epi8(G, _mm_setr_epi8(-1, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)));\n\t\t\tAns1 = _mm_or_si128(Ans1, _mm_shuffle_epi8(R, _mm_setr_epi8(-1, -1, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)));\n\n\t\t\tAns2 = _mm_or_si128(Ans2, _mm_shuffle_epi8(B, _mm_setr_epi8(-1, -1, 2, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)));\n\t\t\tAns2 = _mm_or_si128(Ans2, _mm_shuffle_epi8(G, _mm_setr_epi8(1, -1, -1, 2, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)));\n\t\t\tAns2 = _mm_or_si128(Ans2, _mm_shuffle_epi8(R, _mm_setr_epi8(-1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)));\n\n\t\t\tAns3 = _mm_or_si128(Ans3, _mm_shuffle_epi8(B, _mm_setr_epi8(-1, 3, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)));\n\t\t\tAns3 = _mm_or_si128(Ans3, _mm_shuffle_epi8(G, _mm_setr_epi8(-1, -1, 3, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)));\n\t\t\tAns3 = _mm_or_si128(Ans3, _mm_shuffle_epi8(R, _mm_setr_epi8(2, -1, -1, 3, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)));\n\n\t\t\t_mm_storeu_si128((__m128i*)(LinePD + 0), Ans1);\n\t\t\t_mm_storeu_si128((__m128i*)(LinePD + 4), Ans2);\n\t\t\t_mm_storeu_si128((__m128i*)(LinePD + 8), Ans3);\n\t\t}\n\t\tfor (; X < Width; X++, LinePS += 4, LinePD += 3) {\n\t\t\tLinePD[0] = LinePS[0]; LinePD[1] = LinePS[1]; LinePD[2] = LinePS[2];\n\t\t}\n\t}\n}\n\n// 将整形的Value值限定在Min和Max内，可取Min或者Max的值\ninline int ClampI(int Value, int Min, int Max) {\n\tif (Value < Min) return Min;\n\telse if (Value > Max) return Max;\n\telse return Value;\n}\n\n// 将整数限制到字节数据类型\ninline unsigned char ClampToByte(int Value) {\n\tif (Value < 0) return 0;\n\telse if (Value > 255) return 255;\n\telse return (unsigned char)Value;\n}\n\n// 获取PosX, PosY位置的像素\ninline unsigned char *GetCheckedPixel(unsigned char *Src, int Width, int Height, int Stride, int Channel, int PosX, int PosY) {\n\treturn Src + ClampI(PosY, 0, Height - 1) * Stride + ClampI(PosX, 0, Width - 1) * Channel;\n}\n\n// 该函数计算插值曲线sin(x * PI) / (x * PI)的值,下面是它的近似拟合表达式\nfloat SinXDivX(float X) {\n\tconst float a = -1; //a还可以取 a=-2,-1,-0.75,-0.5等等，起到调节锐化或模糊程度的作用\n\tX = abs(X);\n\tfloat X2 = X * X, X3 = X2 * X;\n\tif (X <= 1)\n\t\treturn (a + 2) * X3 - (a + 3) * X2 + 1;\n\telse if (X <= 2)\n\t\treturn a * X3 - (5 * a) * X2 + (8 * a) * X - (4 * a);\n\telse\n\t\treturn 0;\n}\n\n// 精确计算插值曲线sin(x * PI) / (x * PI)\nfloat SinXDivX_Standard(float X) {\n\tif (abs(X) < 0.000001f)\n\t\treturn 1;\n\telse\n\t\treturn sin(X * 3.1415926f) / (X * 3.1415926f);\n}\n\nvoid Bicubic_Original(unsigned char *Src, int Width, int Height, int Stride, unsigned char *Pixel, float X, float Y)\n{\n\tint Channel = Stride / Width;\n\tint PosX = floor(X), PosY = floor(Y);\n\tfloat PartXX = X - PosX, PartYY = Y - PosY;\n\n\tunsigned char *Pixel00 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX - 1, PosY - 1);\n\tunsigned char *Pixel01 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 0, PosY - 1);\n\tunsigned char *Pixel02 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 1, PosY - 1);\n\tunsigned char *Pixel03 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 2, PosY - 1);\n\tunsigned char *Pixel10 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX - 1, PosY + 0);\n\tunsigned char *Pixel11 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 0, PosY + 0);\n\tunsigned char *Pixel12 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 1, PosY + 0);\n\tunsigned char *Pixel13 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 2, PosY + 0);\n\tunsigned char *Pixel20 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX - 1, PosY + 1);\n\tunsigned char *Pixel21 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 0, PosY + 1);\n\tunsigned char *Pixel22 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 1, PosY + 1);\n\tunsigned char *Pixel23 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 2, PosY + 1);\n\tunsigned char *Pixel30 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX - 1, PosY + 2);\n\tunsigned char *Pixel31 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 0, PosY + 2);\n\tunsigned char *Pixel32 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 1, PosY + 2);\n\tunsigned char *Pixel33 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 2, PosY + 2);\n\n\tfloat U0 = SinXDivX(1 + PartXX), U1 = SinXDivX(PartXX);\n\tfloat U2 = SinXDivX(1 - PartXX), U3 = SinXDivX(2 - PartXX);\n\tfloat V0 = SinXDivX(1 + PartYY), V1 = SinXDivX(PartYY);\n\tfloat V2 = SinXDivX(1 - PartYY), V3 = SinXDivX(2 - PartYY);\n\n\tfor (int I = 0; I < Channel; I++)\n\t{\n\t\tfloat Sum1 = (Pixel00[I] * U0 + Pixel01[I] * U1 + Pixel02[I] * U2 + Pixel03[I] * U3) * V0;\n\t\t//printf(\"%.5f\\n\", Sum1);\n\t\tfloat Sum2 = (Pixel10[I] * U0 + Pixel11[I] * U1 + Pixel12[I] * U2 + Pixel13[I] * U3) * V1;\n\t\t//printf(\"%.5f\\n\", Sum2);\n\t\tfloat Sum3 = (Pixel20[I] * U0 + Pixel21[I] * U1 + Pixel22[I] * U2 + Pixel23[I] * U3) * V2;\n\t\t//printf(\"%.5f\\n\", Sum3);\n\t\tfloat Sum4 = (Pixel30[I] * U0 + Pixel31[I] * U1 + Pixel22[I] * U2 + Pixel33[I] * U3) * V3;\n\t\t//printf(\"%.5f\\n\", Sum4);\n\t\t// printf(\"%d %.5f %.5f %.5f %.5f\\n\", I, Sum1, Sum2, Sum3, Sum4);\n\t\tPixel[I] = ClampToByte(Sum1 + Sum2 + Sum3 + Sum4 + 0.5f);\n\t}\n}\n\n// ImageShop说如果把Channel改为固定的值，速度能提高很多，待测试\nvoid Bicubic_Border(unsigned char *Src, int Width, int Height, int Stride, unsigned char *Pixel, short *SinXDivX_Table, int SrcX, int SrcY) {\n\tint Channel = Stride / Width;\n\tint U = (unsigned char)(SrcX >> 8), V = (unsigned char)(SrcY >> 8);\n\n\tint U0 = SinXDivX_Table[256 + U], U1 = SinXDivX_Table[U];\n\tint U2 = SinXDivX_Table[256 - U], U3 = SinXDivX_Table[512 - U];\n\tint V0 = SinXDivX_Table[256 + V], V1 = SinXDivX_Table[V];\n\tint V2 = SinXDivX_Table[256 - V], V3 = SinXDivX_Table[512 - V];\n\tint PosX = SrcX >> 16, PosY = SrcY >> 16;\n\n\tunsigned char *Pixel00 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX - 1, PosY - 1);\n\tunsigned char *Pixel01 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 0, PosY - 1);\n\tunsigned char *Pixel02 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 1, PosY - 1);\n\tunsigned char *Pixel03 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 2, PosY - 1);\n\tunsigned char *Pixel10 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX - 1, PosY + 0);\n\tunsigned char *Pixel11 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 0, PosY + 0);\n\tunsigned char *Pixel12 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 1, PosY + 0);\n\tunsigned char *Pixel13 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 2, PosY + 0);\n\tunsigned char *Pixel20 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX - 1, PosY + 1);\n\tunsigned char *Pixel21 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 0, PosY + 1);\n\tunsigned char *Pixel22 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 1, PosY + 1);\n\tunsigned char *Pixel23 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 2, PosY + 1);\n\tunsigned char *Pixel30 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX - 1, PosY + 2);\n\tunsigned char *Pixel31 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 0, PosY + 2);\n\tunsigned char *Pixel32 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 1, PosY + 2);\n\tunsigned char *Pixel33 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 2, PosY + 2);\n\n\tfor (int I = 0; I < Channel; I++)\n\t{\n\t\tint Sum1 = (Pixel00[I] * U0 + Pixel01[I] * U1 + Pixel02[I] * U2 + Pixel03[I] * U3) * V0;\n\t\tint Sum2 = (Pixel10[I] * U0 + Pixel11[I] * U1 + Pixel12[I] * U2 + Pixel13[I] * U3) * V1;\n\t\tint Sum3 = (Pixel20[I] * U0 + Pixel21[I] * U1 + Pixel22[I] * U2 + Pixel23[I] * U3) * V2;\n\t\tint Sum4 = (Pixel30[I] * U0 + Pixel31[I] * U1 + Pixel22[I] * U2 + Pixel33[I] * U3) * V3;\n\t\tPixel[I] = ClampToByte((Sum1 + Sum2 + Sum3 + Sum4) >> 16);\n\t}\n}\nvoid Bicubic_Center(unsigned char *Src, int Width, int Height, int Stride, unsigned char *Pixel, short *SinXDivX_Table, int SrcX, int SrcY)\n{\n\tint Channel = Stride / Width;\n\tint U = (unsigned char)(SrcX >> 8), V = (unsigned char)(SrcY >> 8);\n\n\tint U0 = SinXDivX_Table[256 + U], U1 = SinXDivX_Table[U];\n\tint U2 = SinXDivX_Table[256 - U], U3 = SinXDivX_Table[512 - U];\n\tint V0 = SinXDivX_Table[256 + V], V1 = SinXDivX_Table[V];\n\tint V2 = SinXDivX_Table[256 - V], V3 = SinXDivX_Table[512 - V];\n\tint PosX = SrcX >> 16, PosY = SrcY >> 16;\n\n\tunsigned char *Pixel00 = Src + (PosY - 1) * Stride + (PosX - 1) * Channel;\n\tunsigned char *Pixel01 = Pixel00 + Channel;\n\tunsigned char *Pixel02 = Pixel01 + Channel;\n\tunsigned char *Pixel03 = Pixel02 + Channel;\n\tunsigned char *Pixel10 = Pixel00 + Stride;\n\tunsigned char *Pixel11 = Pixel10 + Channel;\n\tunsigned char *Pixel12 = Pixel11 + Channel;\n\tunsigned char *Pixel13 = Pixel12 + Channel;\n\tunsigned char *Pixel20 = Pixel10 + Stride;\n\tunsigned char *Pixel21 = Pixel20 + Channel;\n\tunsigned char *Pixel22 = Pixel21 + Channel;\n\tunsigned char *Pixel23 = Pixel22 + Channel;\n\tunsigned char *Pixel30 = Pixel20 + Stride;\n\tunsigned char *Pixel31 = Pixel30 + Channel;\n\tunsigned char *Pixel32 = Pixel31 + Channel;\n\tunsigned char *Pixel33 = Pixel32 + Channel;\n\tfor (int I = 0; I < Channel; I++)\n\t{\n\t\tint Sum1 = (Pixel00[I] * U0 + Pixel01[I] * U1 + Pixel02[I] * U2 + Pixel03[I] * U3) * V0;\n\t\tint Sum2 = (Pixel10[I] * U0 + Pixel11[I] * U1 + Pixel12[I] * U2 + Pixel13[I] * U3) * V1;\n\t\tint Sum3 = (Pixel20[I] * U0 + Pixel21[I] * U1 + Pixel22[I] * U2 + Pixel23[I] * U3) * V2;\n\t\tint Sum4 = (Pixel30[I] * U0 + Pixel31[I] * U1 + Pixel22[I] * U2 + Pixel33[I] * U3) * V3;\n\t\tPixel[I] = ClampToByte((Sum1 + Sum2 + Sum3 + Sum4) >> 16);\n\t}\n}\n\n// 原始的插值算法\nvoid IM_Resize_Cubic_Origin(unsigned char *Src, unsigned char *Dest, int SrcW, int SrcH, int StrideS, int DstW, int DstH, int StrideD) {\n\tint Channel = StrideS / SrcW;\n\tif ((SrcW == DstW) && (SrcH == DstH)) {\n\t\tmemcpy(Dest, Src, SrcW * SrcH * Channel * sizeof(unsigned char));\n\t\treturn;\n\t}\n\tprintf(\"%d\\n\", Channel);\n\tfor (int Y = 0; Y < DstH; Y++)\n\t{\n\t\tunsigned char *LinePD = Dest + Y * StrideD;\n\t\tfloat SrcY = (Y + 0.4999999f) * SrcH / DstH - 0.5f;\n\t\tfor (int X = 0; X < DstW; X++)\n\t\t{\n\t\t\tfloat SrcX = (X + 0.4999999f) * SrcW / DstW - 0.5f;\n\t\t\tBicubic_Original(Src, SrcW, SrcH, StrideS, LinePD, SrcX, SrcY);\n\t\t\tLinePD += Channel;\n\t\t}\n\t}\n}\n\n// C语言实现的查表+插值算法\nvoid IM_Resize_Cubic_Table(unsigned char *Src, unsigned char *Dest, int SrcW, int SrcH, int StrideS, int DstW, int DstH, int StrideD) {\n\tint Channel = StrideS / SrcW;\n\tif ((SrcW == DstW) && (SrcH == DstH)) {\n\t\tmemcpy(Dest, Src, SrcW * SrcH * Channel * sizeof(unsigned char));\n\t\treturn;\n\t}\n\tshort *SinXDivX_Table = (short *)malloc(513 * sizeof(short));\n\tfor (int I = 0; I < 513; I++)\n\t\tSinXDivX_Table[I] = int(0.5 + 256 * SinXDivX(I / 256.0f)); // 建立查找表，定点化\n\tint AddX = (SrcW << 16) / DstW, AddY = (SrcH << 16) / DstH;\n\tint ErrorX = -(1 << 15) + (AddX >> 1), ErrorY = -(1 << 15) + (AddY >> 1);\n\n\tint StartX = ((1 << 16) - ErrorX) / AddX + 1;\t\t\t//\t计算出需要特殊处理的边界\n\tint StartY = ((1 << 16) - ErrorY) / AddY + 1;\t\t\t//\ty0+y*yr>=1; y0=ErrorY => y>=(1-ErrorY)/yr\n\tint EndX = (((SrcW - 3) << 16) - ErrorX) / AddX + 1;\n\tint EndY = (((SrcH - 3) << 16) - ErrorY) / AddY + 1;\t//\ty0+y*yr<=(height-3) => y<=(height-3-ErrorY)/yr\n\tif (StartY >= DstH)\t\t\tStartY = DstH;\n\tif (StartX >= DstW)\t\t\tStartX = DstW;\n\tif (EndX < StartX)\t\t\tEndX = StartX;\n\tif (EndY < StartY)\t\t\tEndY = StartY;\n\t// 输出边界\n\t//printf(\"%d %d %d %d\\n\", StartX, StartY, EndX, EndY);\n\tint SrcY = ErrorY;\n\tfor (int Y = 0; Y < StartY; Y++, SrcY += AddY)\t\t\t//\t前面的不是都有效的取样部分数据\n\t{\n\t\tunsigned char *LinePD = Dest + Y * StrideD;\n\t\tfor (int X = 0, SrcX = ErrorX; X < DstW; X++, SrcX += AddX, LinePD += Channel)\n\t\t{\n\t\t\tBicubic_Border(Src, SrcW, SrcH, StrideS, LinePD, SinXDivX_Table, SrcX, SrcY);\n\t\t}\n\t}\n\tfor (int Y = StartY; Y < EndY; Y++, SrcY += AddY)\n\t{\n\t\tint SrcX = ErrorX;\n\t\tunsigned char *LinePD = Dest + Y * StrideD;\n\t\tfor (int X = 0; X < StartX; X++, SrcX += AddX, LinePD += Channel)\n\t\t{\n\t\t\tBicubic_Border(Src, SrcW, SrcH, StrideS, LinePD, SinXDivX_Table, SrcX, SrcY);\n\t\t}\n\t\tfor (int X = StartX; X < EndX; X++, SrcX += AddX, LinePD += Channel)\n\t\t{\n\t\t\tBicubic_Center(Src, SrcW, SrcH, StrideS, LinePD, SinXDivX_Table, SrcX, SrcY);\n\t\t}\n\t\tfor (int X = EndX; X < DstW; X++, SrcX += AddX, LinePD += Channel)\n\t\t{\n\t\t\tBicubic_Border(Src, SrcW, SrcH, StrideS, LinePD, SinXDivX_Table, SrcX, SrcY);\n\t\t}\n\t}\n\tfor (int Y = EndY; Y < DstH; Y++, SrcY += AddY)\n\t{\n\t\tunsigned char *LinePD = Dest + Y * StrideD;\n\t\tfor (int X = 0, SrcX = ErrorX; X < DstW; X++, SrcX += AddX, LinePD += Channel)\n\t\t{\n\t\t\tBicubic_Border(Src, SrcW, SrcH, StrideS, LinePD, SinXDivX_Table, SrcX, SrcY);\n\t\t}\n\t}\n\tfree(SinXDivX_Table);\n}\n\n// 4个有符号的32位的数据相加的和\ninline int _mm_hsum_epi32(__m128i V) { //V3 V2 V1 V0\n\t__m128i T = _mm_add_epi32(V, _mm_srli_si128(V, 8)); //V3+V1\t V2+V0\tV1\tV0\n\tT = _mm_add_epi32(T, _mm_srli_si128(T, 4)); //V3+V1+V2+V0\t\tV2+V0+V1\tV1+V0\tV0\n\treturn _mm_cvtsi128_si32(T); //提取低位\n}\n\n// 使用SSE优化立方插值算法\n// 最大支持图像大小为: 32767*32767\nvoid IM_Resize_SSE(unsigned char *Src, unsigned char *Dest, int SrcW, int SrcH, int StrideS, int DstW, int DstH, int StrideD) {\n\tint Channel = StrideS / SrcW;\n\tif ((SrcW == DstW) && (SrcH == DstH)) {\n\t\tmemcpy(Dest, Src, SrcW * SrcH * Channel * sizeof(unsigned char));\n\t\treturn;\n\t}\n\tshort *SinXDivX_Table = (short *)malloc(513 * sizeof(short));\n\tshort *Table = (short *)malloc(DstW * 4 * sizeof(short));\n\tfor (int I = 0; I < 513; I++)\n\t\tSinXDivX_Table[I] = int(0.5 + 256 * SinXDivX(I / 256.0f)); //\t建立查找表，定点化\n\tint AddX = (SrcW << 16) / DstW, AddY = (SrcH << 16) / DstH;\n\tint ErrorX = -(1 << 15) + (AddX >> 1), ErrorY = -(1 << 15) + (AddY >> 1);\n\n\tint StartX = ((1 << 16) - ErrorX) / AddX + 1;\t\t\t//\t计算出需要特殊处理的边界\n\tint StartY = ((1 << 16) - ErrorY) / AddY + 1;\t\t\t//\ty0+y*yr>=1; y0=ErrorY => y>=(1-ErrorY)/yr\n\tint EndX = (((SrcW - 3) << 16) - ErrorX) / AddX + 1;\n\tint EndY = (((SrcH - 3) << 16) - ErrorY) / AddY + 1;\t//\ty0+y*yr<=(height-3) => y<=(height-3-ErrorY)/yr\n\tif (StartY >= DstH)\t\t\tStartY = DstH;\n\tif (StartX >= DstW)\t\t\tStartX = DstW;\n\tif (EndX < StartX)\t\t\tEndX = StartX;\n\tif (EndY < StartY)\t\t\tEndY = StartY;\n\tfor (int X = StartX, SrcX = ErrorX + StartX * AddX; X < EndY; X++, SrcX += AddX) {\n\t\tint U = (unsigned char)(SrcX >> 8);\n\t\tTable[X * 4 + 0] = SinXDivX_Table[256 + U]; //建立一个新表便于SSE操作\n\t\tTable[X * 4 + 1] = SinXDivX_Table[U];\n\t\tTable[X * 4 + 2] = SinXDivX_Table[256 - U];\n\t\tTable[X * 4 + 3] = SinXDivX_Table[512 - U];\n\t}\n\tint SrcY = ErrorY;\n\tfor (int Y = 0; Y < StartY; Y++, SrcY += AddY) { // 同IM_Resize_Cubic_Table函数\n\t\tunsigned char *LinePD = Dest + Y * StrideD;\n\t\tfor (int X = 0, SrcX = ErrorX; X < DstW; X++, SrcX += AddX, LinePD += Channel) {\n\t\t\tBicubic_Border(Src, SrcW, SrcH, StrideS, LinePD, SinXDivX_Table, SrcX, SrcY);\n\t\t}\n\t}\n\tfor (int Y = StartY; Y < EndY; Y++, SrcY += AddY) {\n\t\tint SrcX = ErrorX;\n\t\tunsigned char *LinePD = Dest + Y * StrideD;\n\t\tfor (int X = 0; X < StartX; X++, SrcX += AddX, LinePD += Channel) {\n\t\t\tBicubic_Border(Src, SrcW, SrcH, StrideS, LinePD, SinXDivX_Table, SrcX, SrcY);\n\t\t}\n\t\tint V = (unsigned char)(SrcY >> 8);\n\t\tunsigned char *LineY = Src + ((SrcY >> 16) - 1) * StrideS;\n\t\t__m128i PartY = _mm_setr_epi32(SinXDivX_Table[256 + V], SinXDivX_Table[V], SinXDivX_Table[256 - V], SinXDivX_Table[512 - V]);\n\t\tfor (int X = StartX; X < EndX; X++, SrcX += AddX, LinePD += Channel) {\n\t\t\t__m128i PartX = _mm_loadl_epi64((__m128i *)(Table + X * 4));\n\t\t\t//PartX: U0 U1 U2 U3 U0 U1 U2 U3 \n\t\t\tPartX = _mm_unpacklo_epi64(PartX, PartX);\n\t\t\tunsigned char *Pixel0 = LineY + ((SrcX >> 16) - 1) * Channel;\n\t\t\tunsigned char *Pixel1 = Pixel0 + StrideS;\n\t\t\tunsigned char *Pixel2 = Pixel1 + StrideS;\n\t\t\tunsigned char *Pixel3 = Pixel2 + StrideS;\n\t\t\tif (Channel == 1) {\n\t\t\t\t__m128i P01 = _mm_cvtepu8_epi16(_mm_unpacklo_epi32(_mm_cvtsi32_si128(*((int *)Pixel0)), _mm_cvtsi32_si128(*((int *)Pixel1)))); //\tP00 P01 P02 P03 P10 P11 P12 P13\n\t\t\t\t__m128i P23 = _mm_cvtepu8_epi16(_mm_unpacklo_epi32(_mm_cvtsi32_si128(*((int *)Pixel2)), _mm_cvtsi32_si128(*((int *)Pixel3)))); //\tP20 P21 P22 P23 P30 P31 P32 P33\n\t\t\t\t__m128i Sum01 = _mm_madd_epi16(P01, PartX); // P00 * U0 + P01 * U1\t\tP02 * U2 + P03 * U3\t\t P10 * U0 + P11 * U1\t\tP12 * U2 + P13 * U3\n\t\t\t\t__m128i Sum23 = _mm_madd_epi16(P23, PartX); // P20 * U0 + P21 * U1\t\tP22 * U2 + P23 * U3\t\t P30 * U0 + P31 * U1\t\tP32 * U2 + P33 * U3\n\t\t\t\t__m128i Sum = _mm_hadd_epi32(Sum01, Sum23); // P00 * U0 + P01 * U1 + P02 * U2 + P03 * U3\t P10 * U0 + P11 * U1 + P12 * U2 + P13 * U3\tP20 * U0 + P21 * U1\t+ P22 * U2 + P23 * U3\tP30 * U0 + P31 * U1 + P32 * U2 + P33 * U3\n\t\t\t\tLinePD[0] = ClampToByte(_mm_hsum_epi32(_mm_mullo_epi32(Sum, PartY)) >> 16);\n\t\t\t}\n\t\t\telse if (Channel == 4) {\n\t\t\t\t__m128i P0 = _mm_loadu_si128((__m128i *)Pixel0), P1 = _mm_loadu_si128((__m128i *)Pixel1);\n\t\t\t\t__m128i P2 = _mm_loadu_si128((__m128i *)Pixel2), P3 = _mm_loadu_si128((__m128i *)Pixel3);\n\t\t\t\tP0 = _mm_shuffle_epi8(P0, _mm_setr_epi8(0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15));\t // B0 G0 R0 A0\n\t\t\t\tP1 = _mm_shuffle_epi8(P1, _mm_setr_epi8(0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15));\t //\tB1 G1 R1 A1\n\t\t\t\tP2 = _mm_shuffle_epi8(P2, _mm_setr_epi8(0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15));\t // B2 G2 R2 A2\n\t\t\t\tP3 = _mm_shuffle_epi8(P3, _mm_setr_epi8(0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15));\t //\tB3 G3 R3 A3\n\n\t\t\t\t__m128i BG01 = _mm_unpacklo_epi32(P0, P1);\t\t//\tB0 B1 G0 G1\n\t\t\t\t__m128i RA01 = _mm_unpackhi_epi32(P0, P1);\t\t//\tR0 R1 A0 A1\n\t\t\t\t__m128i BG23 = _mm_unpacklo_epi32(P2, P3);\t\t//\tB2 B3 G2 G3\n\t\t\t\t__m128i RA23 = _mm_unpackhi_epi32(P2, P3);\t\t//\tR2 R3 A2 A3\n\n\t\t\t\t__m128i B01 = _mm_unpacklo_epi8(BG01, _mm_setzero_si128());\n\t\t\t\t__m128i B23 = _mm_unpacklo_epi8(BG23, _mm_setzero_si128());\n\t\t\t\t__m128i SumB = _mm_hadd_epi32(_mm_madd_epi16(B01, PartX), _mm_madd_epi16(B23, PartX));\n\n\t\t\t\t__m128i G01 = _mm_unpackhi_epi8(BG01, _mm_setzero_si128());\n\t\t\t\t__m128i G23 = _mm_unpackhi_epi8(BG23, _mm_setzero_si128());\n\t\t\t\t__m128i SumG = _mm_hadd_epi32(_mm_madd_epi16(G01, PartX), _mm_madd_epi16(G23, PartX));\n\n\t\t\t\t__m128i R01 = _mm_unpacklo_epi8(RA01, _mm_setzero_si128());\n\t\t\t\t__m128i R23 = _mm_unpacklo_epi8(RA23, _mm_setzero_si128());\n\t\t\t\t__m128i SumR = _mm_hadd_epi32(_mm_madd_epi16(R01, PartX), _mm_madd_epi16(R23, PartX));\n\n\t\t\t\t__m128i A01 = _mm_unpackhi_epi8(RA01, _mm_setzero_si128());\n\t\t\t\t__m128i A23 = _mm_unpackhi_epi8(RA23, _mm_setzero_si128());\n\t\t\t\t__m128i SumA = _mm_hadd_epi32(_mm_madd_epi16(A01, PartX), _mm_madd_epi16(A23, PartX));\n\n\t\t\t\t__m128i Result = _mm_setr_epi32(_mm_hsum_epi32(_mm_mullo_epi32(SumB, PartY)), _mm_hsum_epi32(_mm_mullo_epi32(SumG, PartY)), _mm_hsum_epi32(_mm_mullo_epi32(SumR, PartY)), _mm_hsum_epi32(_mm_mullo_epi32(SumA, PartY)));\n\t\t\t\tResult = _mm_srai_epi32(Result, 16);\n\t\t\t\t//\t*((int *)LinePD) = _mm_cvtsi128_si32(_mm_packus_epi16(_mm_packus_epi32(Result, Result), Result));\n\t\t\t\t_mm_stream_si32((int *)LinePD, _mm_cvtsi128_si32(_mm_packus_epi16(_mm_packus_epi32(Result, Result), Result)));\n\n\t\t\t\t//LinePD[0] = IM_ClampToByte(_mm_hsum_epi32(_mm_mullo_epi32(SumB, PartY)) >> 16);\t//\t确实有部分存在超出unsigned char范围的，因为定点化的缘故\n\t\t\t\t//LinePD[1] = IM_ClampToByte(_mm_hsum_epi32(_mm_mullo_epi32(SumG, PartY)) >> 16);\n\t\t\t\t//LinePD[2] = IM_ClampToByte(_mm_hsum_epi32(_mm_mullo_epi32(SumR, PartY)) >> 16);\n\t\t\t\t//LinePD[3] = IM_ClampToByte(_mm_hsum_epi32(_mm_mullo_epi32(SumA, PartY)) >> 16);\n\t\t\t}\n\t\t}\n\t\tfor (int X = EndX; X < DstW; X++, SrcX += AddX, LinePD += Channel)\n\t\t{\n\t\t\tBicubic_Border(Src, SrcW, SrcH, StrideS, LinePD, SinXDivX_Table, SrcX, SrcY);\n\t\t}\n\t}\n\tfor (int Y = EndY; Y < DstH; Y++, SrcY += AddY)\n\t{\n\t\tunsigned char *LinePD = Dest + Y * StrideD;\n\t\tfor (int X = 0, SrcX = ErrorX; X < DstW; X++, SrcX += AddX, LinePD += Channel)\n\t\t{\n\t\t\tBicubic_Border(Src, SrcW, SrcH, StrideS, LinePD, SinXDivX_Table, SrcX, SrcY);\n\t\t}\n\t}\n\tfree(Table);\n\tfree(SinXDivX_Table);\n}\n\nint main() {\n\tMat src = imread(\"F:\\\\car.jpg\");\n\tint Height = src.rows;\n\tint Width = src.cols;\n\tint Stride = Width * 3;\n\tunsigned char *Src = src.data;\n\tunsigned char *Buffer = new unsigned char[Height * Width * 4];\n\tConvertBGR8U2BGRAF(Src, Buffer, Width, Height, Stride);\n\tint SrcW = Width;\n\tint SrcH = Height;\n\tint StrideS = Width * 4;\n\tint DstW = Width * 15 / 10;\n\tint DstH = Height * 15 / 10;\n\tunsigned char *Res = new unsigned char[DstH * DstW * 4];\n\tunsigned char *Dest = new unsigned char[DstH * DstW * 3];\n\tint StrideD = DstW * 4;\n\tint64 st = cvGetTickCount();\n\tfor (int i = 0; i < 10; i++) {\n\t\tIM_Resize_SSE(Buffer, Res, SrcW, SrcH, StrideS, DstW, DstH, StrideD);\n\t}\n\tdouble duration = (cv::getTickCount() - st) / cv::getTickFrequency() * 100;\n\tprintf(\"%.5f\\n\", duration);\n\tIM_Resize_Cubic_Origin(Buffer, Res, SrcW, SrcH, StrideS, DstW, DstH, StrideD);\n\tConvertBGRAF2BGR8U(Res, Dest, DstW, DstH, DstW * 3);\n\tMat dst(DstH, DstW, CV_8UC3, Dest);\n\timshow(\"origin\", src);\n\timshow(\"result\", dst);\n\timwrite(\"F:\\\\res.jpg\", dst);\n\twaitKey(0);\n}"
  },
  {
    "path": "speed_box_filter_sse.cpp",
    "content": "#include <stdio.h>\n#include <opencv2/opencv.hpp>\n#include \"../../OpencvTest/OpencvTest/Core.h\"\n#include \"../../OpencvTest/OpencvTest/MaxFilter.h\"\n#include \"../../OpencvTest/OpencvTest/Utility.h\"\n#include \"../../OpencvTest/OpencvTest/BoxFilter.h\"\nusing namespace std;\nusing namespace cv;\n\nvoid BoxBlur_1(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, int Channel, int Radius) {\n\tTMatrix a, b;\n\tTMatrix *p1 = &a, *p2 = &b;\n\tTMatrix **p3 = &p1, **p4 = &p2;\n\tIS_CreateMatrix(Width, Height, IS_DEPTH_8U, Channel, p3);\n\tIS_CreateMatrix(Width, Height, IS_DEPTH_8U, Channel, p4);\n\t(p1)->Data = Src;\n\t(p2)->Data = Dest;\n\tBoxBlur(p1, p2, Radius, EdgeMode::Smear);\n}\n\nvoid BoxBlur_SSE(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, int Channel, int Radius) {\n\tTMatrix a, b;\n\tTMatrix *p1 = &a, *p2 = &b;\n\tTMatrix **p3 = &p1, **p4 = &p2;\n\tIS_CreateMatrix(Width, Height, IS_DEPTH_8U, Channel, p3);\n\tIS_CreateMatrix(Width, Height, IS_DEPTH_8U, Channel, p4);\n\t(p1)->Data = Src;\n\t(p2)->Data = Dest;\n\tBoxBlur_SSE(p1, p2, Radius, EdgeMode::Smear);\n}\n\n\nint main() {\n\tMat src = imread(\"F:\\\\car.jpg\");\n\tint Height = src.rows;\n\tint Width = src.cols;\n\tunsigned char *Src = src.data;\n\tunsigned char *Dest = new unsigned char[Height * Width * 3];\n\tint Stride = Width * 3;\n\tint Radius = 11;\n\tint64 st = cvGetTickCount();\n\tfor (int i = 0; i <10; i++) {\n\t\t//Mat temp = MaxFilter(src, Radius);\n\t\tBoxBlur_SSE(Src, Dest, Width, Height, Stride, 3, Radius);\n\t}\n\tdouble duration = (cv::getTickCount() - st) / cv::getTickFrequency() * 100;\n\tprintf(\"%.5f\\n\", duration);\n\tBoxBlur_SSE(Src, Dest, Width, Height, Stride, 3, Radius);\n\tMat dst(Height, Width, CV_8UC3, Dest);\n\timshow(\"origin\", src);\n\timshow(\"result\", dst);\n\timwrite(\"F:\\\\res.jpg\", dst);\n\twaitKey(0);\n\treturn 0;\n}"
  },
  {
    "path": "speed_common_functions.cpp",
    "content": "//近似值\nunion Approximation\n{\n\tdouble Value;\n\tint X[2];\n};\n\n// 函数1: 将数据截断在Byte数据类型内。\n// 参考: http://www.cnblogs.com/zyl910/archive/2012/03/12/noifopex1.html\n// 简介: 用位掩码做饱和处理，用带符号右移生成掩码。\nunsigned char ClampToByte(int Value){\n\treturn ((Value | ((signed int)(255 - Value) >> 31)) & ~((signed int)Value >> 31));\n}\n\n//函数2: 将数据截断在指定范围内\n//参考: 无\n//简介: 无\nint ClampToInt(int Value, int Min, int Max) {\n\tif (Value < Min) return Min;\n\telse if (Value > Max) return Max;\n\telse return Value;\n}\n\n//函数3: 整数除以255\n//参考: 无\n//简介: 移位\nint Div255(int Value) {\n\treturn (((Value >> 8) + Value + 1) >> 8);\n}\n\n//函数4: 取绝对值\n//参考: https://oi-wiki.org/math/bit/\n//简介: 比n > 0 ? n : -n 快\n\nint Abs(int n) {\n\treturn (n ^ (n >> 31)) - (n >> 31);\n\t/* n>>31 取得 n 的符号，若 n 为正数，n>>31 等于 0，若 n 为负数，n>>31 等于 - 1\n\t若 n 为正数 n^0=0, 数不变，若 n 为负数有 n^-1\n\t需要计算 n 和 - 1 的补码，然后进行异或运算，\n\t结果 n 变号并且为 n 的绝对值减 1，再减去 - 1 就是绝对值 */\n}\n\n//函数5: 四舍五入\n//参考: 无\n//简介: 无\ndouble Round(double V)\n{\n\treturn (V > 0.0) ? floor(V + 0.5) : Round(V - 0.5);\n}\n\n//函数6: 返回-1到1之间的随机数\n//参考: 无\n//简介: 无\ndouble Rand()\n{\n\treturn (double)rand() / (RAND_MAX + 1.0);\n}\n\n//函数7: Pow函数的近似计算，针对double类型和float类型\n//参考: http://www.cvchina.info/2010/03/19/log-pow-exp-approximation/\n//参考: http://martin.ankerl.com/2007/10/04/optimized-pow-approximation-for-java-and-c-c/\n//简介: 这个函数只是为了加速的近似计算，有5%-12%不等的误差\ndouble Pow(double X, double Y)\n{\n\tApproximation V = { X };\n\tV.X[1] = (int)(Y * (V.X[1] - 1072632447) + 1072632447);\n\tV.X[0] = 0;\n\treturn V.Value;\n}\n\n\nfloat Pow(float X, float Y)\n{\n\tApproximation V = { X };\n\tV.X[1] = (int)(Y * (V.X[1] - 1072632447) + 1072632447);\n\tV.X[0] = 0;\n\treturn (float)V.Value;\n}\n\n//函数8: Exp函数的近似计算，针对double类型和float类型\ndouble Exp(double Y)\t\t\t//\t用联合体的方式的速度要快些\n{\n\tApproximation V;\n\tV.X[1] = (int)(Y * 1485963 + 1072632447);\n\tV.X[0] = 0;\n\treturn V.Value;\n}\n\nfloat Exp(float Y)\t\t\t//\t用联合体的方式的速度要快些\n{\n\tApproximation V;\n\tV.X[1] = (int)(Y * 1485963 + 1072632447);\n\tV.X[0] = 0;\n\treturn (float)V.Value;\n}\n\n// 函数9: Pow函数更准一点的近似计算，但是速度会稍慢\n// http://martin.ankerl.com/2012/01/25/optimized-approximative-pow-in-c-and-cpp/\n// Besides that, I also have now a slower approximation that has much less error\n// when the exponent is larger than 1. It makes use exponentiation by squaring,\n// which is exact for the integer part of the exponent, and uses only the exponent’s fraction for the approximation:\n// should be much more precise with large Y\n\ndouble PrecisePow(double X, double Y){\n\t// calculate approximation with fraction of the exponent\n\tint e = (int)Y;\n\tApproximation V = { X };\n\tV.X[1] = (int)((Y - e) * (V.X[1] - 1072632447) + 1072632447);\n\tV.X[0] = 0;\n\t// exponentiation by squaring with the exponent's integer part\n\t// double r = u.d makes everything much slower, not sure why\n\tdouble r = 1.0;\n\twhile (e)\n\t{\n\t\tif (e & 1)\tr *= X;\n\t\tX *= X;\n\t\te >>= 1;\n\t}\n\treturn r * V.Value;\n}\n\n//函数10: 返回Min到Max之间的随机数\n//参考: 无\n//简介: Min为随机数的最小值，Max为随机数的最大值\nint Random(int Min, int Max){\n\treturn rand() % (Max + 1 - Min) + Min;\n}\n\n//函数11: 符号函数\n//参考: 无\n//简介: 无\nint sgn(int X){\n\tif (X > 0) return 1;\n\tif (X < 0) return -1;\n\treturn 0;\n}\n\n//函数12: 获取某个整形变量对应的颜色值\n//参考: 无\n//简介: 无\nvoid GetRGB(int Color, int *R, int *G, int *B){\n\t*R = Color & 255;\n\t*G = (Color & 65280) / 256;\n\t*B = (Color & 16711680) / 65536;\n}\n\n//函数13: 牛顿法近似获取指定数字的算法平方根\n//参考: https://www.cnblogs.com/qlky/p/7735145.html\n//简介: 仍然是近似算法，近似出了指定数字的平方根\nfloat Sqrt(float X)\n{\n\tfloat HalfX = 0.5f * X;             // 对double类型的数字无效\n\tint I = *(int*)&X;                  // get bits for floating VALUE \n\tI = 0x5f375a86 - (I >> 1);          // gives initial guess y0\n\tX = *(float*)&I;                    // convert bits BACK to float\n\tX = X * (1.5f - HalfX * X * X);     // Newton step, repeating increases accuracy\n\tX = X * (1.5f - HalfX * X * X);     // Newton step, repeating increases accuracy\n\tX = X * (1.5f - HalfX * X * X);     // Newton step, repeating increases accuracy\n\treturn 1 / X;\n}\n\n//函数14: 无符号短整形直方图数据相加，即是Y = X + Y\n//参考: 无\n//简介: SSE优化\nvoid HistgramAddShort(unsigned short *X, unsigned short *Y)\n{\n\t*(__m128i*)(Y + 0) = _mm_add_epi16(*(__m128i*)&Y[0], *(__m128i*)&X[0]);\t\t//\t不要想着用自己写的汇编超过他的速度了，已经试过了\n\t*(__m128i*)(Y + 8) = _mm_add_epi16(*(__m128i*)&Y[8], *(__m128i*)&X[8]);\n\t*(__m128i*)(Y + 16) = _mm_add_epi16(*(__m128i*)&Y[16], *(__m128i*)&X[16]);\n\t*(__m128i*)(Y + 24) = _mm_add_epi16(*(__m128i*)&Y[24], *(__m128i*)&X[24]);\n\t*(__m128i*)(Y + 32) = _mm_add_epi16(*(__m128i*)&Y[32], *(__m128i*)&X[32]);\n\t*(__m128i*)(Y + 40) = _mm_add_epi16(*(__m128i*)&Y[40], *(__m128i*)&X[40]);\n\t*(__m128i*)(Y + 48) = _mm_add_epi16(*(__m128i*)&Y[48], *(__m128i*)&X[48]);\n\t*(__m128i*)(Y + 56) = _mm_add_epi16(*(__m128i*)&Y[56], *(__m128i*)&X[56]);\n\t*(__m128i*)(Y + 64) = _mm_add_epi16(*(__m128i*)&Y[64], *(__m128i*)&X[64]);\n\t*(__m128i*)(Y + 72) = _mm_add_epi16(*(__m128i*)&Y[72], *(__m128i*)&X[72]);\n\t*(__m128i*)(Y + 80) = _mm_add_epi16(*(__m128i*)&Y[80], *(__m128i*)&X[80]);\n\t*(__m128i*)(Y + 88) = _mm_add_epi16(*(__m128i*)&Y[88], *(__m128i*)&X[88]);\n\t*(__m128i*)(Y + 96) = _mm_add_epi16(*(__m128i*)&Y[96], *(__m128i*)&X[96]);\n\t*(__m128i*)(Y + 104) = _mm_add_epi16(*(__m128i*)&Y[104], *(__m128i*)&X[104]);\n\t*(__m128i*)(Y + 112) = _mm_add_epi16(*(__m128i*)&Y[112], *(__m128i*)&X[112]);\n\t*(__m128i*)(Y + 120) = _mm_add_epi16(*(__m128i*)&Y[120], *(__m128i*)&X[120]);\n\t*(__m128i*)(Y + 128) = _mm_add_epi16(*(__m128i*)&Y[128], *(__m128i*)&X[128]);\n\t*(__m128i*)(Y + 136) = _mm_add_epi16(*(__m128i*)&Y[136], *(__m128i*)&X[136]);\n\t*(__m128i*)(Y + 144) = _mm_add_epi16(*(__m128i*)&Y[144], *(__m128i*)&X[144]);\n\t*(__m128i*)(Y + 152) = _mm_add_epi16(*(__m128i*)&Y[152], *(__m128i*)&X[152]);\n\t*(__m128i*)(Y + 160) = _mm_add_epi16(*(__m128i*)&Y[160], *(__m128i*)&X[160]);\n\t*(__m128i*)(Y + 168) = _mm_add_epi16(*(__m128i*)&Y[168], *(__m128i*)&X[168]);\n\t*(__m128i*)(Y + 176) = _mm_add_epi16(*(__m128i*)&Y[176], *(__m128i*)&X[176]);\n\t*(__m128i*)(Y + 184) = _mm_add_epi16(*(__m128i*)&Y[184], *(__m128i*)&X[184]);\n\t*(__m128i*)(Y + 192) = _mm_add_epi16(*(__m128i*)&Y[192], *(__m128i*)&X[192]);\n\t*(__m128i*)(Y + 200) = _mm_add_epi16(*(__m128i*)&Y[200], *(__m128i*)&X[200]);\n\t*(__m128i*)(Y + 208) = _mm_add_epi16(*(__m128i*)&Y[208], *(__m128i*)&X[208]);\n\t*(__m128i*)(Y + 216) = _mm_add_epi16(*(__m128i*)&Y[216], *(__m128i*)&X[216]);\n\t*(__m128i*)(Y + 224) = _mm_add_epi16(*(__m128i*)&Y[224], *(__m128i*)&X[224]);\n\t*(__m128i*)(Y + 232) = _mm_add_epi16(*(__m128i*)&Y[232], *(__m128i*)&X[232]);\n\t*(__m128i*)(Y + 240) = _mm_add_epi16(*(__m128i*)&Y[240], *(__m128i*)&X[240]);\n\t*(__m128i*)(Y + 248) = _mm_add_epi16(*(__m128i*)&Y[248], *(__m128i*)&X[248]);\n}\n\n//函数15: 无符号短整形直方图数据相减，即是Y = Y - X\n//参考: 无\n//简介: SSE优化\nvoid HistgramSubShort(unsigned short *X, unsigned short *Y)\n{\n\t*(__m128i*)(Y + 0) = _mm_sub_epi16(*(__m128i*)&Y[0], *(__m128i*)&X[0]);\n\t*(__m128i*)(Y + 8) = _mm_sub_epi16(*(__m128i*)&Y[8], *(__m128i*)&X[8]);\n\t*(__m128i*)(Y + 16) = _mm_sub_epi16(*(__m128i*)&Y[16], *(__m128i*)&X[16]);\n\t*(__m128i*)(Y + 24) = _mm_sub_epi16(*(__m128i*)&Y[24], *(__m128i*)&X[24]);\n\t*(__m128i*)(Y + 32) = _mm_sub_epi16(*(__m128i*)&Y[32], *(__m128i*)&X[32]);\n\t*(__m128i*)(Y + 40) = _mm_sub_epi16(*(__m128i*)&Y[40], *(__m128i*)&X[40]);\n\t*(__m128i*)(Y + 48) = _mm_sub_epi16(*(__m128i*)&Y[48], *(__m128i*)&X[48]);\n\t*(__m128i*)(Y + 56) = _mm_sub_epi16(*(__m128i*)&Y[56], *(__m128i*)&X[56]);\n\t*(__m128i*)(Y + 64) = _mm_sub_epi16(*(__m128i*)&Y[64], *(__m128i*)&X[64]);\n\t*(__m128i*)(Y + 72) = _mm_sub_epi16(*(__m128i*)&Y[72], *(__m128i*)&X[72]);\n\t*(__m128i*)(Y + 80) = _mm_sub_epi16(*(__m128i*)&Y[80], *(__m128i*)&X[80]);\n\t*(__m128i*)(Y + 88) = _mm_sub_epi16(*(__m128i*)&Y[88], *(__m128i*)&X[88]);\n\t*(__m128i*)(Y + 96) = _mm_sub_epi16(*(__m128i*)&Y[96], *(__m128i*)&X[96]);\n\t*(__m128i*)(Y + 104) = _mm_sub_epi16(*(__m128i*)&Y[104], *(__m128i*)&X[104]);\n\t*(__m128i*)(Y + 112) = _mm_sub_epi16(*(__m128i*)&Y[112], *(__m128i*)&X[112]);\n\t*(__m128i*)(Y + 120) = _mm_sub_epi16(*(__m128i*)&Y[120], *(__m128i*)&X[120]);\n\t*(__m128i*)(Y + 128) = _mm_sub_epi16(*(__m128i*)&Y[128], *(__m128i*)&X[128]);\n\t*(__m128i*)(Y + 136) = _mm_sub_epi16(*(__m128i*)&Y[136], *(__m128i*)&X[136]);\n\t*(__m128i*)(Y + 144) = _mm_sub_epi16(*(__m128i*)&Y[144], *(__m128i*)&X[144]);\n\t*(__m128i*)(Y + 152) = _mm_sub_epi16(*(__m128i*)&Y[152], *(__m128i*)&X[152]);\n\t*(__m128i*)(Y + 160) = _mm_sub_epi16(*(__m128i*)&Y[160], *(__m128i*)&X[160]);\n\t*(__m128i*)(Y + 168) = _mm_sub_epi16(*(__m128i*)&Y[168], *(__m128i*)&X[168]);\n\t*(__m128i*)(Y + 176) = _mm_sub_epi16(*(__m128i*)&Y[176], *(__m128i*)&X[176]);\n\t*(__m128i*)(Y + 184) = _mm_sub_epi16(*(__m128i*)&Y[184], *(__m128i*)&X[184]);\n\t*(__m128i*)(Y + 192) = _mm_sub_epi16(*(__m128i*)&Y[192], *(__m128i*)&X[192]);\n\t*(__m128i*)(Y + 200) = _mm_sub_epi16(*(__m128i*)&Y[200], *(__m128i*)&X[200]);\n\t*(__m128i*)(Y + 208) = _mm_sub_epi16(*(__m128i*)&Y[208], *(__m128i*)&X[208]);\n\t*(__m128i*)(Y + 216) = _mm_sub_epi16(*(__m128i*)&Y[216], *(__m128i*)&X[216]);\n\t*(__m128i*)(Y + 224) = _mm_sub_epi16(*(__m128i*)&Y[224], *(__m128i*)&X[224]);\n\t*(__m128i*)(Y + 232) = _mm_sub_epi16(*(__m128i*)&Y[232], *(__m128i*)&X[232]);\n\t*(__m128i*)(Y + 240) = _mm_sub_epi16(*(__m128i*)&Y[240], *(__m128i*)&X[240]);\n\t*(__m128i*)(Y + 248) = _mm_sub_epi16(*(__m128i*)&Y[248], *(__m128i*)&X[248]);\n}\n\n//函数16: 无符号短整形直方图数据相加减，即是Z = Z + Y - X\n//参考: 无\n//简介: SSE优化\nvoid HistgramSubAddShort(unsigned short *X, unsigned short *Y, unsigned short *Z)\n{\n\t*(__m128i*)(Z + 0) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[0], *(__m128i*)&Z[0]), *(__m128i*)&X[0]);\t\t\t\t\t\t//\t不要想着用自己写的汇编超过他的速度了，已经试过了\n\t*(__m128i*)(Z + 8) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[8], *(__m128i*)&Z[8]), *(__m128i*)&X[8]);\n\t*(__m128i*)(Z + 16) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[16], *(__m128i*)&Z[16]), *(__m128i*)&X[16]);\n\t*(__m128i*)(Z + 24) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[24], *(__m128i*)&Z[24]), *(__m128i*)&X[24]);\n\t*(__m128i*)(Z + 32) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[32], *(__m128i*)&Z[32]), *(__m128i*)&X[32]);\n\t*(__m128i*)(Z + 40) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[40], *(__m128i*)&Z[40]), *(__m128i*)&X[40]);\n\t*(__m128i*)(Z + 48) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[48], *(__m128i*)&Z[48]), *(__m128i*)&X[48]);\n\t*(__m128i*)(Z + 56) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[56], *(__m128i*)&Z[56]), *(__m128i*)&X[56]);\n\t*(__m128i*)(Z + 64) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[64], *(__m128i*)&Z[64]), *(__m128i*)&X[64]);\n\t*(__m128i*)(Z + 72) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[72], *(__m128i*)&Z[72]), *(__m128i*)&X[72]);\n\t*(__m128i*)(Z + 80) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[80], *(__m128i*)&Z[80]), *(__m128i*)&X[80]);\n\t*(__m128i*)(Z + 88) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[88], *(__m128i*)&Z[88]), *(__m128i*)&X[88]);\n\t*(__m128i*)(Z + 96) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[96], *(__m128i*)&Z[96]), *(__m128i*)&X[96]);\n\t*(__m128i*)(Z + 104) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[104], *(__m128i*)&Z[104]), *(__m128i*)&X[104]);\n\t*(__m128i*)(Z + 112) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[112], *(__m128i*)&Z[112]), *(__m128i*)&X[112]);\n\t*(__m128i*)(Z + 120) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[120], *(__m128i*)&Z[120]), *(__m128i*)&X[120]);\n\t*(__m128i*)(Z + 128) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[128], *(__m128i*)&Z[128]), *(__m128i*)&X[128]);\n\t*(__m128i*)(Z + 136) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[136], *(__m128i*)&Z[136]), *(__m128i*)&X[136]);\n\t*(__m128i*)(Z + 144) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[144], *(__m128i*)&Z[144]), *(__m128i*)&X[144]);\n\t*(__m128i*)(Z + 152) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[152], *(__m128i*)&Z[152]), *(__m128i*)&X[152]);\n\t*(__m128i*)(Z + 160) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[160], *(__m128i*)&Z[160]), *(__m128i*)&X[160]);\n\t*(__m128i*)(Z + 168) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[168], *(__m128i*)&Z[168]), *(__m128i*)&X[168]);\n\t*(__m128i*)(Z + 176) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[176], *(__m128i*)&Z[176]), *(__m128i*)&X[176]);\n\t*(__m128i*)(Z + 184) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[184], *(__m128i*)&Z[184]), *(__m128i*)&X[184]);\n\t*(__m128i*)(Z + 192) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[192], *(__m128i*)&Z[192]), *(__m128i*)&X[192]);\n\t*(__m128i*)(Z + 200) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[200], *(__m128i*)&Z[200]), *(__m128i*)&X[200]);\n\t*(__m128i*)(Z + 208) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[208], *(__m128i*)&Z[208]), *(__m128i*)&X[208]);\n\t*(__m128i*)(Z + 216) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[216], *(__m128i*)&Z[216]), *(__m128i*)&X[216]);\n\t*(__m128i*)(Z + 224) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[224], *(__m128i*)&Z[224]), *(__m128i*)&X[224]);\n\t*(__m128i*)(Z + 232) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[232], *(__m128i*)&Z[232]), *(__m128i*)&X[232]);\n\t*(__m128i*)(Z + 240) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[240], *(__m128i*)&Z[240]), *(__m128i*)&X[240]);\n\t*(__m128i*)(Z + 248) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[248], *(__m128i*)&Z[248]), *(__m128i*)&X[248]);\n}\n"
  },
  {
    "path": "speed_gaussian_filter_sse.cpp",
    "content": "#include <stdio.h>\n#include <opencv2/opencv.hpp>\n\nusing namespace std;\nusing namespace cv;\n\nvoid CalcGaussCof(float Radius, float &B0, float &B1, float &B2, float &B3)\n{\n\tfloat Q, B;\n\tif (Radius >= 2.5)\n\t\tQ = (double)(0.98711 * Radius - 0.96330);                            //    对应论文公式11b\n\telse if ((Radius >= 0.5) && (Radius < 2.5))\n\t\tQ = (double)(3.97156 - 4.14554 * sqrt(1 - 0.26891 * Radius));\n\telse\n\t\tQ = (double)0.1147705018520355224609375;\n\n\tB = 1.57825 + 2.44413 * Q + 1.4281 * Q * Q + 0.422205 * Q * Q * Q;        //    对应论文公式8c\n\tB1 = 2.44413 * Q + 2.85619 * Q * Q + 1.26661 * Q * Q * Q;\n\tB2 = -1.4281 * Q * Q - 1.26661 * Q * Q * Q;\n\tB3 = 0.422205 * Q * Q * Q;\n\n\tB0 = 1.0 - (B1 + B2 + B3) / B;\n\tB1 = B1 / B;\n\tB2 = B2 / B;\n\tB3 = B3 / B;\n}\n\nvoid ConvertBGR8U2BGRAF(unsigned char *Src, float *Dest, int Width, int Height, int Stride)\n{\n\t//#pragma omp parallel for\n\tfor (int Y = 0; Y < Height; Y++)\n\t{\n\t\tunsigned char *LinePS = Src + Y * Stride;\n\t\tfloat *LinePD = Dest + Y * Width * 3;\n\t\tfor (int X = 0; X < Width; X++, LinePS += 3, LinePD += 3)\n\t\t{\n\t\t\tLinePD[0] = LinePS[0];    LinePD[1] = LinePS[1];    LinePD[2] = LinePS[2];\n\t\t}\n\t}\n}\n\nvoid ConvertBGR8U2BGRAF_SSE(unsigned char *Src, float *Dest, int Width, int Height, int Stride) {\n\tconst int BlockSize = 4;\n\tint Block = (Width - 2) / BlockSize;\n\t__m128i Mask = _mm_setr_epi8(0, 1, 2, -1, 3, 4, 5, -1, 6, 7, 8, -1, 9, 10, 11, -1);\n\t__m128i Zero = _mm_setzero_si128();\n\tfor (int Y = 0; Y < Height; Y++) {\n\t\tunsigned char *LinePS = Src + Y * Stride;\n\t\tfloat *LinePD = Dest + Y * Width * 4;\n\t\tint X = 0;\n\t\tfor (; X < Block * BlockSize; X += BlockSize, LinePS += BlockSize * 3, LinePD += BlockSize * 4) {\n\t\t\t__m128i SrcV = _mm_shuffle_epi8(_mm_loadu_si128((const __m128i*)LinePS), Mask);\n\t\t\t__m128i Src16L = _mm_unpacklo_epi8(SrcV, Zero);\n\t\t\t__m128i Src16H = _mm_unpackhi_epi8(SrcV, Zero);\n\t\t\t_mm_store_ps(LinePD + 0, _mm_cvtepi32_ps(_mm_unpacklo_epi16(Src16L, Zero)));\n\t\t\t_mm_store_ps(LinePD + 4, _mm_cvtepi32_ps(_mm_unpackhi_epi16(Src16L, Zero)));\n\t\t\t_mm_store_ps(LinePD + 8, _mm_cvtepi32_ps(_mm_unpacklo_epi16(Src16H, Zero)));\n\t\t\t_mm_store_ps(LinePD + 12, _mm_cvtepi32_ps(_mm_unpackhi_epi16(Src16H, Zero)));\n\t\t}\n\t\tfor (; X < Width; X++, LinePS += 3, LinePD += 4) {\n\t\t\tLinePD[0] = LinePS[0];    LinePD[1] = LinePS[1];    LinePD[2] = LinePS[2];    LinePD[3] = 0;\n\t\t}\n\t}\n}\n\nvoid GaussBlurFromLeftToRight(float *Data, int Width, int Height, float B0, float B1, float B2, float B3)\n{\n\t//#pragma omp parallel for\n\tfor (int Y = 0; Y < Height; Y++)\n\t{\n\t\tfloat *LinePD = Data + Y * Width * 3;\n\t\t//w[n-1], w[n-2], w[n-3]\n\t\tfloat BS1 = LinePD[0], BS2 = LinePD[0], BS3 = LinePD[0]; //边缘处使用重复像素的方案\n\t\tfloat GS1 = LinePD[1], GS2 = LinePD[1], GS3 = LinePD[1];\n\t\tfloat RS1 = LinePD[2], RS2 = LinePD[2], RS3 = LinePD[2];\n\t\tfor (int X = 0; X < Width; X++, LinePD += 3)\n\t\t{\n\t\t\tLinePD[0] = LinePD[0] * B0 + BS1 * B1 + BS2 * B2 + BS3 * B3;\n\t\t\tLinePD[1] = LinePD[1] * B0 + GS1 * B1 + GS2 * B2 + GS3 * B3;         // 进行顺向迭代\n\t\t\tLinePD[2] = LinePD[2] * B0 + RS1 * B1 + RS2 * B2 + RS3 * B3;\n\t\t\tBS3 = BS2, BS2 = BS1, BS1 = LinePD[0];\n\t\t\tGS3 = GS2, GS2 = GS1, GS1 = LinePD[1];\n\t\t\tRS3 = RS2, RS2 = RS1, RS1 = LinePD[2];\n\t\t}\n\t}\n}\n\nvoid GaussBlurFromLeftToRight_SSE(float *Data, int Width, int Height, float B0, float B1, float B2, float B3) {\n\tconst __m128 CofB0 = _mm_set_ps(0, B0, B0, B0);\n\tconst __m128 CofB1 = _mm_set_ps(0, B1, B1, B1);\n\tconst __m128 CofB2 = _mm_set_ps(0, B2, B2, B2);\n\tconst __m128 CofB3 = _mm_set_ps(0, B3, B3, B3);\n\tfor (int Y = 0; Y < Height; Y++) {\n\t\tfloat *LinePD = Data + Y * Width * 4;\n\t\t__m128 V1 = _mm_set_ps(LinePD[3], LinePD[2], LinePD[1], LinePD[0]);\n\t\t__m128 V2 = V1, V3 = V1;\n\t\tfor (int X = 0; X < Width; X++, LinePD += 4) {\n\t\t\t__m128 V0 = _mm_load_ps(LinePD);\n\t\t\t__m128 V01 = _mm_add_ps(_mm_mul_ps(CofB0, V0), _mm_mul_ps(CofB1, V1));\n\t\t\t__m128 V23 = _mm_add_ps(_mm_mul_ps(CofB2, V2), _mm_mul_ps(CofB3, V3));\n\t\t\t__m128 V = _mm_add_ps(V01, V23);\n\t\t\tV3 = V2; V2 = V1; V1 = V;\n\t\t\t_mm_store_ps(LinePD, V);\n\t\t}\n\t}\n}\n\nvoid GaussBlurFromRightToLeft(float *Data, int Width, int Height, float B0, float B1, float B2, float B3) {\n\tfor (int Y = 0; Y < Height; Y++) {\n\t\t//w[n+1], w[n+2], w[n+3]\n\t\tfloat *LinePD = Data + Y * Width * 3 + (Width * 3);\n\t\tfloat BS1 = LinePD[0], BS2 = LinePD[0], BS3 = LinePD[0]; //边缘处使用重复像素的方案\n\t\tfloat GS1 = LinePD[1], GS2 = LinePD[1], GS3 = LinePD[1];\n\t\tfloat RS1 = LinePD[2], RS2 = LinePD[2], RS3 = LinePD[2];\n\t\tfor (int X = Width - 1; X >= 0; X--, LinePD -= 3)\n\t\t{\n\t\t\tLinePD[0] = LinePD[0] * B0 + BS3 * B1 + BS2 * B2 + BS1 * B3;\n\t\t\tLinePD[1] = LinePD[1] * B0 + GS3 * B1 + GS2 * B2 + GS1 * B3;         // 进行反向迭代\n\t\t\tLinePD[2] = LinePD[2] * B0 + RS3 * B1 + RS2 * B2 + RS1 * B3;\n\t\t\tBS1 = BS2, BS2 = BS3, BS3 = LinePD[0];\n\t\t\tGS1 = GS2, GS2 = GS3, GS3 = LinePD[1];\n\t\t\tRS1 = RS2, RS2 = RS3, RS3 = LinePD[2];\n\t\t}\n\t}\n}\n\nvoid GaussBlurFromRightToLeft_SSE(float *Data, int Width, int Height, float B0, float B1, float B2, float B3) {\n\tconst __m128 CofB0 = _mm_set_ps(0, B0, B0, B0);\n\tconst __m128 CofB1 = _mm_set_ps(0, B1, B1, B1);\n\tconst __m128 CofB2 = _mm_set_ps(0, B2, B2, B2);\n\tconst __m128 CofB3 = _mm_set_ps(0, B3, B3, B3);\n\tfor (int Y = 0; Y < Height; Y++) {\n\t\tfloat *LinePD = Data + Y * Width * 4 + (Width * 4);\n\t\t__m128 V1 = _mm_set_ps(LinePD[3], LinePD[2], LinePD[1], LinePD[0]);\n\t\t__m128 V2 = V1, V3 = V1;\n\t\tfor (int X = Width - 1; X >= 0; X--, LinePD -= 4) {\n\t\t\t__m128 V0 = _mm_load_ps(LinePD);\n\t\t\t__m128 V03 = _mm_add_ps(_mm_mul_ps(CofB0, V0), _mm_mul_ps(CofB1, V3));\n\t\t\t__m128 V12 = _mm_add_ps(_mm_mul_ps(CofB2, V2), _mm_mul_ps(CofB3, V1));\n\t\t\t__m128 V = _mm_add_ps(V03, V12);\n\t\t\tV1 = V2; V2 = V3; V3 = V;\n\t\t\t_mm_store_ps(LinePD, V);\n\t\t}\n\t}\n}\n\n\n//w[n] w[n-1], w[n-2], w[n-3]\nvoid GaussBlurFromTopToBottom(float *Data, int Width, int Height, float B0, float B1, float B2, float B3)\n{\n\tfor (int Y = 0; Y < Height; Y++)\n\t{\n\t\tfloat *LinePD3 = Data + (Y + 0) * Width * 3;\n\t\tfloat *LinePD2 = Data + (Y + 1) * Width * 3;\n\t\tfloat *LinePD1 = Data + (Y + 2) * Width * 3;\n\t\tfloat *LinePD0 = Data + (Y + 3) * Width * 3;\n\t\tfor (int X = 0; X < Width; X++, LinePD0 += 3, LinePD1 += 3, LinePD2 += 3, LinePD3 += 3)\n\t\t{\n\t\t\tLinePD0[0] = LinePD0[0] * B0 + LinePD1[0] * B1 + LinePD2[0] * B2 + LinePD3[0] * B3;\n\t\t\tLinePD0[1] = LinePD0[1] * B0 + LinePD1[1] * B1 + LinePD2[1] * B2 + LinePD3[1] * B3;\n\t\t\tLinePD0[2] = LinePD0[2] * B0 + LinePD1[2] * B1 + LinePD2[2] * B2 + LinePD3[2] * B3;\n\t\t}\n\t}\n}\n\nvoid GaussBlurFromTopToBottom_SSE(float *Data, int Width, int Height, float B0, float B1, float B2, float B3){\n\tconst  __m128 CofB0 = _mm_set_ps(0, B0, B0, B0);\n\tconst  __m128 CofB1 = _mm_set_ps(0, B1, B1, B1);\n\tconst  __m128 CofB2 = _mm_set_ps(0, B2, B2, B2);\n\tconst  __m128 CofB3 = _mm_set_ps(0, B3, B3, B3);\n\tfor (int Y = 0; Y < Height; Y++)\n\t{\n\t\tfloat *LinePS3 = Data + (Y + 0) * Width * 4;\n\t\tfloat *LinePS2 = Data + (Y + 1) * Width * 4;\n\t\tfloat *LinePS1 = Data + (Y + 2) * Width * 4;\n\t\tfloat *LinePS0 = Data + (Y + 3) * Width * 4;\n\t\tfor (int X = 0; X < Width * 4; X += 4)\n\t\t{\n\t\t\t__m128 V3 = _mm_load_ps(LinePS3 + X);\n\t\t\t__m128 V2 = _mm_load_ps(LinePS2 + X);\n\t\t\t__m128 V1 = _mm_load_ps(LinePS1 + X);\n\t\t\t__m128 V0 = _mm_load_ps(LinePS0 + X);\n\t\t\t__m128 V01 = _mm_add_ps(_mm_mul_ps(CofB0, V0), _mm_mul_ps(CofB1, V1));\n\t\t\t__m128 V23 = _mm_add_ps(_mm_mul_ps(CofB2, V2), _mm_mul_ps(CofB3, V3));\n\t\t\t_mm_store_ps(LinePS0 + X, _mm_add_ps(V01, V23));\n\t\t}\n\t}\n}\n//w[n] w[n+1], w[n+2], w[n+3]\nvoid GaussBlurFromBottomToTop(float *Data, int Width, int Height, float B0, float B1, float B2, float B3) {\n\tfor (int Y = Height - 1; Y >= 0; Y--) {\n\t\tfloat *LinePD3 = Data + (Y + 3) * Width * 3;\n\t\tfloat *LinePD2 = Data + (Y + 2) * Width * 3;\n\t\tfloat *LinePD1 = Data + (Y + 1) * Width * 3;\n\t\tfloat *LinePD0 = Data + (Y + 0) * Width * 3;\n\t\tfor (int X = 0; X < Width; X++, LinePD0 += 3, LinePD1 += 3, LinePD2 += 3, LinePD3 += 3) {\n\t\t\tLinePD0[0] = LinePD0[0] * B0 + LinePD1[0] * B1 + LinePD2[0] * B2 + LinePD3[0] * B3;\n\t\t\tLinePD0[1] = LinePD0[1] * B0 + LinePD1[1] * B1 + LinePD2[1] * B2 + LinePD3[1] * B3;\n\t\t\tLinePD0[2] = LinePD0[2] * B0 + LinePD1[2] * B1 + LinePD2[2] * B2 + LinePD3[2] * B3;\n\t\t}\n\t}\n}\n\nvoid GaussBlurFromBottomToTop_SSE(float *Data, int Width, int Height, float B0, float B1, float B2, float B3) {\n\tconst  __m128 CofB0 = _mm_set_ps(0, B0, B0, B0);\n\tconst  __m128 CofB1 = _mm_set_ps(0, B1, B1, B1);\n\tconst  __m128 CofB2 = _mm_set_ps(0, B2, B2, B2);\n\tconst  __m128 CofB3 = _mm_set_ps(0, B3, B3, B3);\n\tfor (int Y = Height - 1; Y >= 0; Y--) {\n\t\tfloat *LinePS3 = Data + (Y + 3) * Width * 4;\n\t\tfloat *LinePS2 = Data + (Y + 2) * Width * 4;\n\t\tfloat *LinePS1 = Data + (Y + 1) * Width * 4;\n\t\tfloat *LinePS0 = Data + (Y + 0) * Width * 4;\n\t\tfor (int X = 0; X < Width * 4; X += 4) {\n\t\t\t__m128 V3 = _mm_load_ps(LinePS3 + X);\n\t\t\t__m128 V2 = _mm_load_ps(LinePS2 + X);\n\t\t\t__m128 V1 = _mm_load_ps(LinePS1 + X);\n\t\t\t__m128 V0 = _mm_load_ps(LinePS0 + X);\n\t\t\t__m128 V01 = _mm_add_ps(_mm_mul_ps(CofB0, V0), _mm_mul_ps(CofB1, V1));\n\t\t\t__m128 V23 = _mm_add_ps(_mm_mul_ps(CofB2, V2), _mm_mul_ps(CofB3, V3));\n\t\t\t_mm_store_ps(LinePS0 + X, _mm_add_ps(V01, V23));\n\t\t}\n\t}\n}\n\nvoid ConvertBGRAF2BGR8U(float *Src, unsigned char *Dest, int Width, int Height, int Stride)\n{\n\t//#pragma omp parallel for\n\tfor (int Y = 0; Y < Height; Y++)\n\t{\n\t\tfloat *LinePS = Src + Y * Width * 3;\n\t\tunsigned char *LinePD = Dest + Y * Stride;\n\t\tfor (int X = 0; X < Width; X++, LinePS += 3, LinePD += 3)\n\t\t{\n\t\t\tLinePD[0] = LinePS[0];    LinePD[1] = LinePS[1];    LinePD[2] = LinePS[2];\n\t\t}\n\t}\n}\n\n\nvoid ConvertBGRAF2BGR8U_SSE(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) {\n\tconst int BlockSize = 4;\n\tint Block = (Width - 2) / BlockSize;\n\t//__m128i Mask = _mm_setr_epi8(0, 1, 2, 4, 5, 6, 8, 9, 10, 12, 13, 14, 3, 7, 11, 15);\n\t__m128i MaskB = _mm_setr_epi8(0, 4, 8, 12, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1);\n\t__m128i MaskG = _mm_setr_epi8(1, 5, 9, 13, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1);\n\t__m128i MaskR = _mm_setr_epi8(2, 6, 10, 14, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1);\n\t__m128i Zero = _mm_setzero_si128();\n\tfor (int Y = 0; Y < Height; Y++) {\n\t\tunsigned char *LinePS = Src + Y * Width * 4;\n\t\tunsigned char *LinePD = Dest + Y * Stride;\n\t\tint X = 0;\n\t\tfor (; X < Block * BlockSize; X += BlockSize, LinePS += BlockSize * 4, LinePD += BlockSize * 3) {\n\t\t\t__m128i SrcV = _mm_loadu_si128((const __m128i*)LinePS);\n\t\t\t__m128i B = _mm_shuffle_epi8(SrcV, MaskB);\n\t\t\t__m128i G = _mm_shuffle_epi8(SrcV, MaskG);\n\t\t\t__m128i R = _mm_shuffle_epi8(SrcV, MaskR);\n\t\t\t__m128i Ans1 = Zero, Ans2 = Zero, Ans3 = Zero;\n\t\t\tAns1 = _mm_or_si128(Ans1, _mm_shuffle_epi8(B, _mm_setr_epi8(0, -1, -1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1))); \n\t\t\tAns1 = _mm_or_si128(Ans1, _mm_shuffle_epi8(G, _mm_setr_epi8(-1, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)));\n\t\t\tAns1 = _mm_or_si128(Ans1, _mm_shuffle_epi8(R, _mm_setr_epi8(-1, -1, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)));\n\n\t\t\tAns2 = _mm_or_si128(Ans2, _mm_shuffle_epi8(B, _mm_setr_epi8(-1, -1, 2, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)));\n\t\t\tAns2 = _mm_or_si128(Ans2, _mm_shuffle_epi8(G, _mm_setr_epi8(1, -1, -1, 2, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)));\n\t\t\tAns2 = _mm_or_si128(Ans2, _mm_shuffle_epi8(R, _mm_setr_epi8(-1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)));\n\n\t\t\tAns3 = _mm_or_si128(Ans3, _mm_shuffle_epi8(B, _mm_setr_epi8(-1, 3, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)));\n\t\t\tAns3 = _mm_or_si128(Ans3, _mm_shuffle_epi8(G, _mm_setr_epi8(-1, -1, 3, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)));\n\t\t\tAns3 = _mm_or_si128(Ans3, _mm_shuffle_epi8(R, _mm_setr_epi8(2, -1, -1, 3, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)));\n\n\t\t\t_mm_storeu_si128((__m128i*)(LinePD + 0), Ans1);\n\t\t\t_mm_storeu_si128((__m128i*)(LinePD + 4), Ans2);\n\t\t\t_mm_storeu_si128((__m128i*)(LinePD + 8), Ans3);\n\t\t}\n\t\tfor (; X < Width; X++, LinePS += 4, LinePD += 3) {\n\t\t\tLinePD[0] = LinePS[0]; LinePD[1] = LinePS[1]; LinePD[2] = LinePS[2];\n\t\t}\n\t}\n}\n\nvoid GaussBlur(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, float Radius)\n{\n\tfloat B0, B1, B2, B3;\n\tfloat *Buffer = (float *)malloc(Width * (Height + 6) * sizeof(float) * 3);\n\tCalcGaussCof(Radius, B0, B1, B2, B3);\n\tConvertBGR8U2BGRAF(Src, Buffer + 3 * Width * 3, Width, Height, Stride);\n\tGaussBlurFromLeftToRight(Buffer + 3 * Width * 3, Width, Height, B0, B1, B2, B3);\n\tGaussBlurFromRightToLeft(Buffer + 3 * Width * 3, Width, Height, B0, B1, B2, B3);        //    如果启用多线程，建议把这个函数写到GaussBlurFromLeftToRight的for X循环里，因为这样就可以减少线程并发时的阻力\n\n\tmemcpy(Buffer + 0 * Width * 3, Buffer + 3 * Width * 3, Width * 3 * sizeof(float));\n\tmemcpy(Buffer + 1 * Width * 3, Buffer + 3 * Width * 3, Width * 3 * sizeof(float));\n\tmemcpy(Buffer + 2 * Width * 3, Buffer + 3 * Width * 3, Width * 3 * sizeof(float));\n\n\tGaussBlurFromTopToBottom(Buffer, Width, Height, B0, B1, B2, B3);\n\n\tmemcpy(Buffer + (Height + 3) * Width * 3, Buffer + (Height + 2) * Width * 3, Width * 3 * sizeof(float));\n\tmemcpy(Buffer + (Height + 4) * Width * 3, Buffer + (Height + 2) * Width * 3, Width * 3 * sizeof(float));\n\tmemcpy(Buffer + (Height + 5) * Width * 3, Buffer + (Height + 2) * Width * 3, Width * 3 * sizeof(float));\n\n\tGaussBlurFromBottomToTop(Buffer, Width, Height, B0, B1, B2, B3);\n\n\tConvertBGRAF2BGR8U(Buffer + 3 * Width * 3, Dest, Width, Height, Stride);\n\n\tfree(Buffer);\n}\n\nvoid GaussBlur_SSE(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, float Radius)\n{\n\tfloat B0, B1, B2, B3;\n\tfloat *Buffer = (float *)_mm_malloc(Width * (Height + 6) * sizeof(float) * 4, 16);\n\tCalcGaussCof(Radius, B0, B1, B2, B3);\n\tConvertBGR8U2BGRAF_SSE(Src, Buffer + 3 * Width * 4, Width, Height, Stride);\n\tGaussBlurFromLeftToRight_SSE(Buffer + 3 * Width * 4, Width, Height, B0, B1, B2, B3);        //    在SSE版本中，这两个函数占用的时间比下面两个要多,不过C语言版本也是一样的\n\tGaussBlurFromRightToLeft_SSE(Buffer + 3 * Width * 4, Width, Height, B0, B1, B2, B3);        //    如果启用多线程，建议把这个函数写到GaussBlurFromLeftToRight的for X循环里，因为这样就可以减少线程并发时的阻力\n\n\tmemcpy(Buffer + 0 * Width * 4, Buffer + 3 * Width * 4, Width * 4 * sizeof(float));\n\tmemcpy(Buffer + 1 * Width * 4, Buffer + 3 * Width * 4, Width * 4 * sizeof(float));\n\tmemcpy(Buffer + 2 * Width * 4, Buffer + 3 * Width * 4, Width * 4 * sizeof(float));\n\n\tGaussBlurFromTopToBottom_SSE(Buffer, Width, Height, B0, B1, B2, B3);\n\n\tmemcpy(Buffer + (Height + 3) * Width * 4, Buffer + (Height + 2) * Width * 4, Width * 4 * sizeof(float));\n\tmemcpy(Buffer + (Height + 4) * Width * 4, Buffer + (Height + 2) * Width * 4, Width * 4 * sizeof(float));\n\tmemcpy(Buffer + (Height + 5) * Width * 4, Buffer + (Height + 2) * Width * 4, Width * 4 * sizeof(float));\n\n\tGaussBlurFromBottomToTop_SSE(Buffer, Width, Height, B0, B1, B2, B3);\n\n\tConvertBGRAF2BGR8U_SSE(Buffer + 3 * Width * 4, Dest, Width, Height, Stride);\n\n\t_mm_free(Buffer);\n}\n\nint main() {\n\tMat src = imread(\"F:\\\\car.jpg\");\n\tint Height = src.rows;\n\tint Width = src.cols;\n\tunsigned char *Src = src.data;\n\tunsigned char *Dest = new unsigned char[Height * Width * 3];\n\tint Stride = Width * 3;\n\tint Radius = 11;\n\tint64 st = cvGetTickCount();\n\tfor (int i = 0; i < 20; i++) {\n\t\tGaussBlur_SSE(Src, Dest, Width, Height, Stride, Radius);\n\t}\n\tdouble duration = (cv::getTickCount() - st) / cv::getTickFrequency() *  50;\n\tprintf(\"%.5f\\n\", duration);\n\tGaussBlur_SSE(Src, Dest, Width, Height, Stride, Radius);\n\tMat dst(Height, Width, CV_8UC3, Dest);\n\timshow(\"origin\", src);\n\timshow(\"result\", dst);\n\timwrite(\"F:\\\\res.jpg\", dst);\n\twaitKey(0);\n}"
  },
  {
    "path": "speed_histogram_algorithm_framework/BoxFilter.h",
    "content": "#pragma once\n#include \"Core.h\"\n#include \"Utility.h\"\n\n// : ʵͼ񷽿ģЧ\n// б:\n// Src: ҪԴͼݽṹ\n// Dest: 洦ͼݽṹ\n// Radius: ģİ뾶ЧΧ[1, 1000]\n// EdgeBehavior: ԵݵĴ0ʾظԵأ1ʹþķʽԱԵֵ\n// :\n// 1. ܴ8λҶȺ24λͼ\n// 2. SrcDestͬͬʱٶȻ\n// 3. SSEŻ汾ڳʼʱͰ뾶йصģڰ뾶ʱʱ΢\n\nIS_RET BoxBlur(TMatrix *Src, TMatrix *Dest, int Radius, EdgeMode Edge) {\n\tif (Src == NULL || Dest == NULL) return IS_RET_ERR_NULLREFERENCE;\n\tif (Src->Data == NULL || Dest->Data == NULL) return IS_RET_ERR_NULLREFERENCE;\n\tif (Src->Width != Dest->Width || Src->Height != Dest->Height || Src->Channel != Dest->Channel || Src->Depth != Dest->Depth || Src->WidthStep != Dest->WidthStep) return IS_RET_ERR_PARAMISMATCH;\n\tif (Src->Depth != IS_DEPTH_8U || Dest->Depth != IS_DEPTH_8U) return IS_RET_ERR_NOTSUPPORTED;\n\tIS_RET Ret = IS_RET_OK;\n\tTMatrix *Row = NULL, *Col = NULL;\n\tint *RowPos, *ColPos, *ColSum, *Diff;\n\tint X, Y, Z, Width, Height, Channel, Index;\n\tint Value, ValueB, ValueG, ValueR;\n\tint Size = 2 * Radius + 1, Amount = Size * Size, HalfAmount = Amount / 2;\n\tWidth = Src->Width;\n\tHeight = Src->Height;\n\tChannel = Src->Channel;\n\tRet = GetValidCoordinate(Width, Height, Radius, Radius, Radius, Radius, EdgeMode::Smear, &Row, &Col);\t\t//\tȡƫ\n\tRowPos = ((int *)Row->Data);\n\tColPos = ((int *)Col->Data);\t\t   \n\tColSum = (int *)IS_AllocMemory(Width * Channel * sizeof(int), true);\n\tDiff = (int *)IS_AllocMemory((Width - 1) * Channel * sizeof(int), true);\n\tunsigned char *RowData = (unsigned char *)IS_AllocMemory((Width + 2 * Radius) * Channel, true);\n\tTMatrix Sum;\n\tTMatrix *p = &Sum;\n\tTMatrix **q = &p;\n\tIS_CreateMatrix(Width, Height, IS_DEPTH_32S, Channel, q);\n\tfor (Y = 0; Y < Height; Y++) {\n\t\tunsigned char *LinePS = Src->Data + Y * Src->WidthStep;\n\t\tint *LinePD = (int *)(p->Data + Y * p->WidthStep);\n\t\t//\tһݼԵֲֵʱĻ\n\t\tif (Channel == 1)\n\t\t{\n\t\t\tfor (X = 0; X < Radius; X++)\n\t\t\t\tRowData[X] = LinePS[RowPos[X]];\n\t\t\tmemcpy(RowData + Radius, LinePS, Width);\n\t\t\tfor (X = Radius + Width; X < Radius + Width + Radius; X++)\n\t\t\t\tRowData[X] = LinePS[RowPos[X]];\n\t\t}\n\t\telse if (Channel == 3)\n\t\t{\n\t\t\tfor (X = 0; X < Radius; X++)\n\t\t\t{\n\t\t\t\tIndex = RowPos[X] * 3;\n\t\t\t\tRowData[X * 3] = LinePS[Index];\n\t\t\t\tRowData[X * 3 + 1] = LinePS[Index + 1];\n\t\t\t\tRowData[X * 3 + 2] = LinePS[Index + 2];\n\t\t\t}\n\t\t\tmemcpy(RowData + Radius * 3, LinePS, Width * 3);\n\t\t\tfor (X = Radius + Width; X < Radius + Width + Radius; X++)\n\t\t\t{\n\t\t\t\tIndex = RowPos[X] * 3;\n\t\t\t\tRowData[X * 3 + 0] = LinePS[Index + 0];\n\t\t\t\tRowData[X * 3 + 1] = LinePS[Index + 1];\n\t\t\t\tRowData[X * 3 + 2] = LinePS[Index + 2];\n\t\t\t}\n\t\t}\n\t\tunsigned char *AddPos = RowData + Size * Channel;\n\t\tunsigned char *SubPos = RowData;\n\t\tfor (X = 0; X < (Width - 1) * Channel; X++)\n\t\t\tDiff[X] = AddPos[X] - SubPos[X];\n\t\t//\tһҪ⴦\n\t\tif (Channel == 1)\n\t\t{\n\t\t\tfor (Z = 0, Value = 0; Z < Size; Z++)\tValue += RowData[Z];\n\t\t\tLinePD[0] = Value;\n\n\t\t\tfor (X = 1; X < Width; X++)\n\t\t\t{\n\t\t\t\tValue += Diff[X - 1];\tLinePD[X] = Value;\t\t\t\t//\t·ٶߺܶ\n\t\t\t}\n\t\t}\n\t\telse if (Channel == 3)\n\t\t{\n\t\t\tfor (Z = 0, ValueB = ValueG = ValueR = 0; Z < Size; Z++)\n\t\t\t{\n\t\t\t\tValueB += RowData[Z * 3 + 0];\n\t\t\t\tValueG += RowData[Z * 3 + 1];\n\t\t\t\tValueR += RowData[Z * 3 + 2];\n\t\t\t}\n\t\t\tLinePD[0] = ValueB;\tLinePD[1] = ValueG;\tLinePD[2] = ValueR;\n\n\t\t\tfor (X = 1; X < Width; X++)\n\t\t\t{\n\t\t\t\tIndex = X * 3;\n\t\t\t\tValueB += Diff[Index - 3];\t\tLinePD[Index + 0] = ValueB;\n\t\t\t\tValueG += Diff[Index - 2];\t\tLinePD[Index + 1] = ValueG;\n\t\t\t\tValueR += Diff[Index - 1];\t\tLinePD[Index + 2] = ValueR;\n\t\t\t}\n\t\t}\n\t}\n\tfor (Y = 0; Y < Size - 1; Y++)\t\t\t//\tעûһŶ\t\t\t\t\t\t\n\t{\n\t\tint *LinePS = (int *)(p->Data + ColPos[Y] * p->WidthStep);\n\t\tfor (X = 0; X < Width * Channel; X++)\tColSum[X] += LinePS[X];\n\t}\n\n\tfor (Y = 0; Y < Height; Y++)\n\t{\n\t\tunsigned char* LinePD = Dest->Data + Y * Dest->WidthStep;\n\t\tint *AddPos = (int*)(p->Data + ColPos[Y + Size - 1] * p->WidthStep);\n\t\tint *SubPos = (int*)(p->Data + ColPos[Y] * p->WidthStep);\n\n\t\tfor (X = 0; X < Width * Channel; X++)\n\t\t{\n\t\t\tValue = ColSum[X] + AddPos[X];\n\t\t\tLinePD[X] = (Value + HalfAmount) / Amount;\t\t\t\t\t//\t\t+  HalfAmount ҪΪ\n\t\t\tColSum[X] = Value - SubPos[X];\n\t\t}\n\t}\n\tIS_FreeMemory(RowPos);\n\tIS_FreeMemory(ColPos);\n\tIS_FreeMemory(Diff);\n\tIS_FreeMemory(ColSum);\n\tIS_FreeMemory(RowData);\n\treturn Ret;\n}\n\n// : ʵͼ񷽿ģЧSSEŻ\n\nIS_RET BoxBlur_SSE(TMatrix *Src, TMatrix *Dest, int Radius, EdgeMode Edge) {\n\tif (Src == NULL || Dest == NULL) return IS_RET_ERR_NULLREFERENCE;\n\tif (Src->Data == NULL || Dest->Data == NULL) return IS_RET_ERR_NULLREFERENCE;\n\tif (Src->Width != Dest->Width || Src->Height != Dest->Height || Src->Channel != Dest->Channel || Src->Depth != Dest->Depth || Src->WidthStep != Dest->WidthStep) return IS_RET_ERR_PARAMISMATCH;\n\tif (Src->Depth != IS_DEPTH_8U || Dest->Depth != IS_DEPTH_8U) return IS_RET_ERR_NOTSUPPORTED;\n\tIS_RET Ret = IS_RET_OK;\n\tTMatrix *Row = NULL, *Col = NULL;\n\tint *RowPos, *ColPos, *ColSum, *Diff;\n\tint X, Y, Z, Width, Height, Channel, Index;\n\tint Value, ValueB, ValueG, ValueR;\n\tint Size = 2 * Radius + 1, Amount = Size * Size, HalfAmount = Amount / 2;\n\tfloat Scale = 1.0 / (Size * Size);\n\tWidth = Src->Width;\n\tHeight = Src->Height;\n\tChannel = Src->Channel;\n\tRet = GetValidCoordinate(Width, Height, Radius, Radius, Radius, Radius, EdgeMode::Smear, &Row, &Col);\t\t//\tȡƫ\n\tRowPos = ((int *)Row->Data);\n\tColPos = ((int *)Col->Data);\n\tColSum = (int *)IS_AllocMemory(Width * Channel * sizeof(int), true);\n\tDiff = (int *)IS_AllocMemory((Width - 1) * Channel * sizeof(int), true);\n\tunsigned char *RowData = (unsigned char *)IS_AllocMemory((Width + 2 * Radius) * Channel, true);\n\tTMatrix Sum;\n\tTMatrix *p = &Sum;\n\tTMatrix **q = &p;\n\tIS_CreateMatrix(Width, Height, IS_DEPTH_32S, Channel, q);\n\tfor (Y = 0; Y < Height; Y++) {\n\t\tunsigned char *LinePS = Src->Data + Y * Src->WidthStep;\n\t\tint *LinePD = (int *)(p->Data + Y * p->WidthStep);\n\t\t//\tһݼԵֲֵʱĻ\n\t\tif (Channel == 1)\n\t\t{\n\t\t\tfor (X = 0; X < Radius; X++)\n\t\t\t\tRowData[X] = LinePS[RowPos[X]];\n\t\t\tmemcpy(RowData + Radius, LinePS, Width);\n\t\t\tfor (X = Radius + Width; X < Radius + Width + Radius; X++)\n\t\t\t\tRowData[X] = LinePS[RowPos[X]];\n\t\t}\n\t\telse if (Channel == 3)\n\t\t{\n\t\t\tfor (X = 0; X < Radius; X++)\n\t\t\t{\n\t\t\t\tIndex = RowPos[X] * 3;\n\t\t\t\tRowData[X * 3] = LinePS[Index];\n\t\t\t\tRowData[X * 3 + 1] = LinePS[Index + 1];\n\t\t\t\tRowData[X * 3 + 2] = LinePS[Index + 2];\n\t\t\t}\n\t\t\tmemcpy(RowData + Radius * 3, LinePS, Width * 3);\n\t\t\tfor (X = Radius + Width; X < Radius + Width + Radius; X++)\n\t\t\t{\n\t\t\t\tIndex = RowPos[X] * 3;\n\t\t\t\tRowData[X * 3 + 0] = LinePS[Index + 0];\n\t\t\t\tRowData[X * 3 + 1] = LinePS[Index + 1];\n\t\t\t\tRowData[X * 3 + 2] = LinePS[Index + 2];\n\t\t\t}\n\t\t}\n\t\tunsigned char *AddPos = RowData + Size * Channel;\n\t\tunsigned char *SubPos = RowData;\n\t\tX = 0;\n\t\t__m128i Zero = _mm_setzero_si128();\n\t\tfor (; X <= (Width - 1) * Channel - 8; X += 8) {\n\t\t\t__m128i Add = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i const *)(AddPos + X)), Zero);\n\t\t\t__m128i Sub = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i const *)(SubPos + X)), Zero);\n\t\t\t_mm_store_si128((__m128i *)(Diff + X + 0), _mm_sub_epi32(_mm_unpacklo_epi16(Add, Zero), _mm_unpacklo_epi16(Sub, Zero)));\n\t\t\t_mm_store_si128((__m128i *)(Diff + X + 4), _mm_sub_epi32(_mm_unpackhi_epi16(Add, Zero), _mm_unpackhi_epi16(Sub, Zero)));\n\t\t}\n\t\tfor (; X < (Width - 1) * Channel; X++)\n\t\t\tDiff[X] = AddPos[X] - SubPos[X];\n\t\t// һҪ⴦\n\t\t//\tһҪ⴦\n\t\tif (Channel == 1)\n\t\t{\n\t\t\tfor (Z = 0, Value = 0; Z < Size; Z++)\tValue += RowData[Z];\n\t\t\tLinePD[0] = Value;\n\n\t\t\tfor (X = 1; X < Width; X++)\n\t\t\t{\n\t\t\t\tValue += Diff[X - 1];\n\t\t\t\tLinePD[X] = Value;\n\t\t\t}\n\t\t}\n\t\telse if (Channel == 3)\n\t\t{\n\t\t\tfor (Z = 0, ValueB = ValueG = ValueR = 0; Z < Size; Z++)\n\t\t\t{\n\t\t\t\tValueB += RowData[Z * 3 + 0];\n\t\t\t\tValueG += RowData[Z * 3 + 1];\n\t\t\t\tValueR += RowData[Z * 3 + 2];\n\t\t\t}\n\t\t\tLinePD[0] = ValueB;\tLinePD[1] = ValueG;\tLinePD[2] = ValueR;\n\n\t\t\tfor (X = 1; X < Width; X++)\n\t\t\t{\n\t\t\t\tIndex = X * 3;\n\t\t\t\tValueB += Diff[Index - 3];\t\tLinePD[Index + 0] = ValueB;\n\t\t\t\tValueG += Diff[Index - 2];\t\tLinePD[Index + 1] = ValueG;\n\t\t\t\tValueR += Diff[Index - 1];\t\tLinePD[Index + 2] = ValueR;\n\t\t\t}\n\t\t}\n\t}\n\n\tfor (Y = 0; Y < Size - 1; Y++) {\n\t\tX = 0;\n\t\tint *LinePS = (int *)(p->Data + ColPos[Y] * p->WidthStep);\n\t\tfor (; X <= Width * Channel - 4; X += 4) {\n\t\t\t__m128i SumP = _mm_load_si128((const __m128i*)(ColSum + X));\n\t\t\t__m128i SrcP = _mm_load_si128((const __m128i*)(LinePS + X));\n\t\t\t_mm_store_si128((__m128i *)(ColSum + X), _mm_add_epi32(SumP, SrcP));\n\t\t}\n\t\tfor (; X < Width * Channel; X++) ColSum[X] += LinePS[X];\n\t}\n\n\tfor (Y = 0; Y < Height; Y++) {\n\t\tunsigned char *LinePD = Dest->Data + Y * Dest->WidthStep;\n\t\tint *AddPos = (int*)(p->Data + ColPos[Y + Size - 1] * p->WidthStep);\n\t\tint *SubPos = (int*)(p->Data + ColPos[Y] * p->WidthStep);\n\t\tX = 0;\n\t\tconst __m128 Inv = _mm_set1_ps(Scale);\n\t\tfor (; X <= Width * Channel - 8; X += 8) {\n\t\t\t__m128i Sub1 = _mm_loadu_si128((const __m128i*)(SubPos + X + 0));\n\t\t\t__m128i Sub2 = _mm_loadu_si128((const __m128i*)(SubPos + X + 4));\n\t\t\t__m128i Add1 = _mm_loadu_si128((const __m128i*)(AddPos + X + 0));\n\t\t\t__m128i Add2 = _mm_loadu_si128((const __m128i*)(AddPos + X + 4));\n\t\t\t__m128i Col1 = _mm_load_si128((const __m128i*)(ColSum + X + 0));\n\t\t\t__m128i Col2 = _mm_load_si128((const __m128i*)(ColSum + X + 4));\n\n\t\t\t__m128i Sum1 = _mm_add_epi32(Col1, Add1);\n\t\t\t__m128i Sum2 = _mm_add_epi32(Col2, Add2);\n\n\t\t\t__m128i Dest1 = _mm_cvtps_epi32(_mm_mul_ps(Inv, _mm_cvtepi32_ps(Sum1)));\n\t\t\t__m128i Dest2 = _mm_cvtps_epi32(_mm_mul_ps(Inv, _mm_cvtepi32_ps(Sum2)));\n\n\t\t\tDest1 = _mm_packs_epi32(Dest1, Dest2);\n\t\t\t_mm_storel_epi64((__m128i *)(LinePD + X), _mm_packus_epi16(Dest1, Dest1));\n\n\t\t\t_mm_store_si128((__m128i *)(ColSum + X + 0), _mm_sub_epi32(Sum1, Sub1));\n\t\t\t_mm_store_si128((__m128i *)(ColSum + X + 4), _mm_sub_epi32(Sum2, Sub2));\n\t\t}\n\t\tfor (; X < Width * Channel; X++){\n\t\t\tValue = ColSum[X] + AddPos[X];\n\t\t\tLinePD[X] = Value * Scale;\n\t\t\tColSum[X] = Value - SubPos[X];\n\t\t}\n\t}\n\tIS_FreeMemory(RowPos);\n\tIS_FreeMemory(ColPos);\n\tIS_FreeMemory(Diff);\n\tIS_FreeMemory(ColSum);\n\tIS_FreeMemory(RowData);\n\treturn Ret;\n}"
  },
  {
    "path": "speed_histogram_algorithm_framework/Core.h",
    "content": "#pragma once\n#include <stdio.h>\n#include <malloc.h>\n#include <stdlib.h>\n#include <string.h>\n#include <opencv2/opencv.hpp>\nusing namespace std;\n\n#define WIDTHBYTES(bytes) (((bytes * 8) + 31) / 32 * 4)\nconst float Inv255 = 1.0 / 255;\nconst double Eps = 2.220446049250313E-16;\n\n\n//Եķʽ\nenum EdgeMode {\n\tTile = 0, //ظԵԪ\n\tSmear = 1 //ԵԪ\n};\n\nenum IS_RET {\n\tIS_RET_OK,\t\t\t\t\t\t\t\t\t//\t\n\tIS_RET_ERR_OUTOFMEMORY,\t\t\t\t\t\t//\tڴ\n\tIS_RET_ERR_STACKOVERFLOW,\t\t\t\t\t//\tջ\n\tIS_RET_ERR_NULLREFERENCE,\t\t\t\t\t//\t\n\tIS_RET_ERR_ARGUMENTOUTOFRANGE,\t\t\t\t//\tΧ\n\tIS_RET_ERR_PARAMISMATCH,\t\t\t\t\t//\tƥ\n\tIS_RET_ERR_DIVIDEBYZERO,\n\tIS_RET_ERR_INDEXOUTOFRANGE,\n\tIS_RET_ERR_NOTSUPPORTED,\n\tIS_RET_ERR_OVERFLOW,\n\tIS_RET_ERR_FILENOTFOUND,\n\tIS_RET_ERR_UNKNOWN\n};\n\nenum IS_DEPTH\n{\n\tIS_DEPTH_8U = 0,\t\t\t//\tunsigned char\n\tIS_DEPTH_8S = 1,\t\t\t//\tchar\n\tIS_DEPTH_16S = 2,\t\t\t//\tshort\n\tIS_DEPTH_32S = 3,\t\t\t//  int\n\tIS_DEPTH_32F = 4,\t\t\t//\tfloat\n\tIS_DEPTH_64F = 5,\t\t\t//\tdouble\n};\n\nstruct TMatrix\n{\n\tint Width;\t\t\t\t\t//\tĿ\n\tint Height;\t\t\t\t\t//\tĸ߶\n\tint WidthStep;\t\t\t\t//\tһԪصռõֽ\n\tint Channel;\t\t\t\t//\tͨ\n\tint Depth;\t\t\t\t\t//\tԪص\n\tunsigned char *Data;\t\t//\t\n\tint Reserved;\t\t\t\t//\tʹ\n};\n\n// ڴ\nvoid *IS_AllocMemory(unsigned int Size, bool ZeroMemory = true) {\n\tvoid *Ptr = _mm_malloc(Size, 32);\n\tif (Ptr != NULL)\n\t\tif (ZeroMemory == true)\n\t\t\tmemset(Ptr, 0, Size);\n\treturn Ptr;\n}\n\n// ڴͷ\nvoid IS_FreeMemory(void *Ptr) {\n\tif (Ptr != NULL) _mm_free(Ptr);\n}\n\n// ݾԪصȡһԪʵռõֽ\nint IS_ELEMENT_SIZE(int Depth) {\n\tint Size;\n\tswitch (Depth)\n\t{\n\tcase IS_DEPTH_8U:\n\t\tSize = sizeof(unsigned char);\n\t\tbreak;\n\tcase IS_DEPTH_8S:\n\t\tSize = sizeof(char);\n\t\tbreak;\n\tcase IS_DEPTH_16S:\n\t\tSize = sizeof(short);\n\t\tbreak;\n\tcase IS_DEPTH_32S:\n\t\tSize = sizeof(int);\n\t\tbreak;\n\tcase IS_DEPTH_32F:\n\t\tSize = sizeof(float);\n\t\tbreak;\n\tcase IS_DEPTH_64F:\n\t\tSize = sizeof(double);\n\t\tbreak;\n\tdefault:\n\t\tSize = 0;\n\t\tbreak;\n\t}\n\treturn Size;\n}\n\n//µľ\nIS_RET IS_CreateMatrix(int Width, int Height, int Depth, int Channel, TMatrix **Matrix) {\n\tif (Width < 1 || Height < 1) return IS_RET_ERR_ARGUMENTOUTOFRANGE; //Χ\n\tif (Depth != IS_DEPTH_8U && Depth != IS_DEPTH_8S && Depth != IS_DEPTH_16S && Depth != IS_DEPTH_32S &&\n\t\tDepth != IS_DEPTH_32F && Depth != IS_DEPTH_64F) return IS_RET_ERR_ARGUMENTOUTOFRANGE; //Χ\n\tif (Channel != 1 && Channel != 2 && Channel != 3 && Channel != 4) return IS_RET_ERR_ARGUMENTOUTOFRANGE;\n\t*Matrix = (TMatrix *)IS_AllocMemory(sizeof(TMatrix));\n\t(*Matrix)->Width = Width;\n\t(*Matrix)->Height = Height;\n\t(*Matrix)->Depth = Depth;\n\t(*Matrix)->Channel = Channel;\n\t(*Matrix)->WidthStep = WIDTHBYTES(Width * Channel * IS_ELEMENT_SIZE(Depth));\n\t(*Matrix)->Data = (unsigned char*)IS_AllocMemory((*Matrix)->Height * (*Matrix)->WidthStep, true);\n\tif ((*Matrix)->Data == NULL) {\n\t\tIS_FreeMemory(*Matrix);\n\t\treturn IS_RET_ERR_OUTOFMEMORY; //ڴ\n\t}\n\t(*Matrix)->Reserved = 0;\n\treturn IS_RET_OK;\n}\n\n//ͷŴľ\nIS_RET IS_FreeMatrix(TMatrix **Matrix) {\n\tif ((*Matrix) == NULL) return IS_RET_ERR_NULLREFERENCE; //\n\tif ((*Matrix)->Data == NULL) {\n\t\tIS_FreeMemory((*Matrix));\n\t\treturn IS_RET_ERR_OUTOFMEMORY;\n\t}\n\telse {\n\t\tIS_FreeMemory((*Matrix)->Data);\n\t\tIS_FreeMemory((*Matrix));\n\t\treturn IS_RET_OK;\n\t}\n}\n\n//¡еľ\nIS_RET IS_CloneMatrix(TMatrix *Src, TMatrix **Dest) {\n\tif (Src == NULL) return IS_RET_ERR_NULLREFERENCE;\n\tif (Src->Data == NULL) return IS_RET_ERR_NULLREFERENCE;\n\tIS_RET Ret = IS_CreateMatrix(Src->Width, Src->Height, Src->Depth, Src->Channel, Dest);\n\tif (Ret == IS_RET_OK) memcpy((*Dest)->Data, Src->Data, (*Dest)->Height * (*Dest)->WidthStep);\n\treturn Ret;\n}"
  },
  {
    "path": "speed_histogram_algorithm_framework/MaxFilter.h",
    "content": "#pragma once\n#include \"Core.h\"\n#include \"Utility.h\"\n\n// 函数供能: 在指定半径内，最大值”滤镜用周围像素的最高亮度值替换当前像素的亮度值。\n// 参数列表:\n// Src: 需要处理的源图像的数据结构\n// Dest: 保存处理后的图像的数据结构\n// Radius: 半径，有效范围\n// 说明：\n// 1、程序的执行时间和半径基本无关，但和图像内容有关\n// 2、Src和Dest可以相同，不同时执行速度很快\n// 3、对于各向异性的图像来说，执行速度很快，对于有大面积相同像素的图像，速度会慢一点\n\nIS_RET  MaxFilter(TMatrix *Src, TMatrix *Dest, int Radius)\n{\n\tif (Src == NULL || Dest == NULL) return IS_RET_ERR_NULLREFERENCE;\n\tif (Src->Data == NULL || Dest->Data == NULL) return IS_RET_ERR_NULLREFERENCE;\n\tif (Src->Width != Dest->Width || Src->Height != Dest->Height || Src->Channel != Dest->Channel || Src->Depth != Dest->Depth || Src->WidthStep != Dest->WidthStep) return IS_RET_ERR_PARAMISMATCH;\n\tif (Src->Depth != IS_DEPTH_8U || Dest->Depth != IS_DEPTH_8U) return IS_RET_ERR_NOTSUPPORTED;\n\tif (Radius < 0 || Radius >= 127) return IS_RET_ERR_ARGUMENTOUTOFRANGE;\n\n\tIS_RET Ret = IS_RET_OK;\n\n\tif (Src->Data == Dest->Data)\n\t{\n\t\tTMatrix *Clone = NULL;\n\t\tRet = IS_CloneMatrix(Src, &Clone);\n\t\tif (Ret != IS_RET_OK) return Ret;\n\t\tRet = MaxFilter(Clone, Dest, Radius);\n\t\tIS_FreeMatrix(&Clone);\n\t\treturn Ret;\n\t}\n\tif (Src->Channel == 1)\n\t{\n\t\tTMatrix *Row = NULL, *Col = NULL;\n\t\tunsigned char *LinePS, *LinePD;\n\t\tint X, Y, K, Width = Src->Width, Height = Src->Height;\n\t\tint *RowOffset, *ColOffSet;\n\n\t\tunsigned short *ColHist = (unsigned short *)IS_AllocMemory(256 * (Width + 2 * Radius) * sizeof(unsigned short), true);\n\t\tif (ColHist == NULL) { Ret = IS_RET_ERR_OUTOFMEMORY; goto Done8; }\n\t\tunsigned short *Hist = (unsigned short *)IS_AllocMemory(256 * sizeof(unsigned short), true);\n\t\tif (Hist == NULL) { Ret = IS_RET_ERR_OUTOFMEMORY; goto Done8; }\n\t\tRet = GetValidCoordinate(Width, Height, Radius, Radius, Radius, Radius, EdgeMode::Smear, &Row, &Col);\t\t//\t获取坐标偏移量\n\t\tif (Ret != IS_RET_OK) goto Done8;\n\n\t\tColHist += Radius * 256;\t\tRowOffset = ((int *)Row->Data) + Radius;\n\t\tColOffSet = ((int *)Col->Data) + Radius;\t\t    \t//\t进行偏移以便操作\n\n\t\tfor (Y = 0; Y < Height; Y++)\n\t\t{\n\t\t\tif (Y == 0)\t\t\t\t\t\t\t\t\t\t\t//\t第一行的列直方图,要重头计算\n\t\t\t{\n\t\t\t\tfor (K = -Radius; K <= Radius; K++)\n\t\t\t\t{\n\t\t\t\t\tLinePS = Src->Data + ColOffSet[K] * Src->WidthStep;\n\t\t\t\t\tfor (X = -Radius; X < Width + Radius; X++)\n\t\t\t\t\t{\n\t\t\t\t\t\tColHist[X * 256 + LinePS[RowOffset[X]]]++;\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t}\n\t\t\telse\t\t\t\t\t\t\t\t\t\t\t\t//\t其他行的列直方图，更新就可以了\n\t\t\t{\n\t\t\t\tLinePS = Src->Data + ColOffSet[Y - Radius - 1] * Src->WidthStep;\n\t\t\t\tfor (X = -Radius; X < Width + Radius; X++)\t\t// 删除移出范围内的那一行的直方图数据\n\t\t\t\t{\n\t\t\t\t\tColHist[X * 256 + LinePS[RowOffset[X]]]--;\n\t\t\t\t}\n\n\t\t\t\tLinePS = Src->Data + ColOffSet[Y + Radius] * Src->WidthStep;\n\t\t\t\tfor (X = -Radius; X < Width + Radius; X++)\t\t// 增加进入范围内的那一行的直方图数据\n\t\t\t\t{\n\t\t\t\t\tColHist[X * 256 + LinePS[RowOffset[X]]]++;\n\t\t\t\t}\n\t\t\t}\n\n\t\t\tmemset(Hist, 0, 256 * sizeof(unsigned short));\t\t//\t每一行直方图数据清零先\n\n\t\t\tLinePD = Dest->Data + Y * Dest->WidthStep;\n\n\t\t\tfor (X = 0; X < Width; X++)\n\t\t\t{\n\t\t\t\tif (X == 0)\n\t\t\t\t{\n\t\t\t\t\tfor (K = -Radius; K <= Radius; K++)\t\t\t//\t行第一个像素，需要重新计算\t\n\t\t\t\t\t\tHistgramAddShort(ColHist + K * 256, Hist);\n\t\t\t\t}\n\t\t\t\telse\n\t\t\t\t{\n\t\t\t\t\t/*\tHistgramAddShort(ColHist + RowOffset[X + Radius] * 256, Hist);\n\t\t\t\t\tHistgramSubShort(ColHist + RowOffset[X - Radius - 1] * 256, Hist);\n\t\t\t\t\t*/\n\t\t\t\t\tHistgramSubAddShort(ColHist + RowOffset[X - Radius - 1] * 256, ColHist + RowOffset[X + Radius] * 256, Hist);  //\t行内其他像素，依次删除和增加就可以了\n\t\t\t\t}\n\t\t\t\tfor (K = 255; K >= 0; K--)\n\t\t\t\t{\n\t\t\t\t\tif (Hist[K] != 0)\n\t\t\t\t\t{\n\t\t\t\t\t\tLinePD[X] = K;\n\t\t\t\t\t\tbreak;\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t\tColHist -= Radius * 256;\t\t//\t恢复偏移操作\n\tDone8:\n\t\tIS_FreeMatrix(&Row);\n\t\tIS_FreeMatrix(&Col);\n\t\tIS_FreeMemory(ColHist);\n\t\tIS_FreeMemory(Hist);\n\t\treturn Ret;\n\t}\n\telse\n\t{\n\t\tTMatrix *Blue = NULL, *Green = NULL, *Red = NULL, *Alpha = NULL;\t\t\t//\t由于C变量如果不初始化，其值是随机值，可能会导致释放时的错误。\n\t\tIS_RET Ret = SplitRGBA(Src, &Blue, &Green, &Red, &Alpha);\n\t\tif (Ret != IS_RET_OK) goto Done24;\n\t\tRet = MaxFilter(Blue, Blue, Radius);\n\t\tif (Ret != IS_RET_OK) goto Done24;\n\t\tRet = MaxFilter(Green, Green, Radius);\n\t\tif (Ret != IS_RET_OK) goto Done24;\n\t\tRet = MaxFilter(Red, Red, Radius);\n\t\tif (Ret != IS_RET_OK) goto Done24;\t\t\t\t\t\t\t\t\t\t\t//\t32位的Alpha不做任何处理，实际上32位的相关算法基本上是不能分通道处理的\n\t\tCopyAlphaChannel(Src, Dest);\n\t\tRet = CombineRGBA(Dest, Blue, Green, Red, Alpha);\n\tDone24:\n\t\tIS_FreeMatrix(&Blue);\n\t\tIS_FreeMatrix(&Green);\n\t\tIS_FreeMatrix(&Red);\n\t\tIS_FreeMatrix(&Alpha);\n\t\treturn Ret;\n\t}\n}"
  },
  {
    "path": "speed_histogram_algorithm_framework/SelectiveBlur.h",
    "content": "#pragma once\n#include \"Core.h\"\n#include \"Utility.h\"\n\nvoid Calc(unsigned short *Hist, int Intensity, unsigned char *&Pixel, int Threshold)\n{\n\tint K, Low, High, Sum = 0, Weight = 0;\n\tLow = Intensity - Threshold; High = Intensity + Threshold;\n\tif (Low < 0) Low = 0;\n\tif (High > 255) High = 255;\n\tfor (K = Low; K <= High; K++)\n\t{\n\t\tWeight += Hist[K];\n\t\tSum += Hist[K] * K;\n\t}\n\tif (Weight != 0) *Pixel = Sum / Weight;\n}\n\n// 函数供能: 在指定半径内，实现图像选择性模糊效果。\n// 参数列表:\n// Src: 需要处理的源图像的数据结构\n// Dest: 保存处理后的图像的数据结构\n// Radius: 半径，有效范围\n// 说明：\n// 1、程序的执行时间和半径基本无关，但和图像内容有关\n// 2、Src和Dest可以相同，不同时执行速度很快\n// 3、对于各向异性的图像来说，执行速度很快，对于有大面积相同像素的图像，速度会慢一点\n\nIS_RET SelectiveBlur(TMatrix *Src, TMatrix *Dest, int Radius, int Threshold, EdgeMode Edge)\n{\n\tif (Src == NULL || Dest == NULL) return IS_RET_ERR_NULLREFERENCE;\n\tif (Src->Data == NULL || Dest->Data == NULL) return IS_RET_ERR_NULLREFERENCE;\n\tif (Src->Width != Dest->Width || Src->Height != Dest->Height || Src->Channel != Dest->Channel || Src->Depth != Dest->Depth || Src->WidthStep != Dest->WidthStep) return IS_RET_ERR_PARAMISMATCH;\n\tif (Src->Depth != IS_DEPTH_8U || Dest->Depth != IS_DEPTH_8U) return IS_RET_ERR_NOTSUPPORTED;\n\tif (Radius < 0 || Radius >= 127 || Threshold < 2 || Threshold > 255) return IS_RET_ERR_ARGUMENTOUTOFRANGE;\n\n\tIS_RET Ret = IS_RET_OK;\n\n\tif (Src->Data == Dest->Data)\n\t{\n\t\tTMatrix *Clone = NULL;\n\t\tRet = IS_CloneMatrix(Src, &Clone);\n\t\tif (Ret != IS_RET_OK) return Ret;\n\t\tRet = SelectiveBlur(Clone, Dest, Radius, Threshold, Edge);\n\t\tIS_FreeMatrix(&Clone);\n\t\treturn Ret;\n\t}\n\tif (Src->Channel == 1)\n\t{\n\t\tTMatrix *Row = NULL, *Col = NULL;\n\t\tunsigned char *LinePS, *LinePD;\n\t\tint X, Y, K, Width = Src->Width, Height = Src->Height;\n\t\tint *RowOffset, *ColOffSet;\n\n\t\tunsigned short *ColHist = (unsigned short *)IS_AllocMemory(256 * (Width + 2 * Radius) * sizeof(unsigned short), true);\n\t\tif (ColHist == NULL) { Ret = IS_RET_ERR_OUTOFMEMORY; goto Done8; }\n\t\tunsigned short *Hist = (unsigned short *)IS_AllocMemory(256 * sizeof(unsigned short), true);\n\t\tif (Hist == NULL) { Ret = IS_RET_ERR_OUTOFMEMORY; goto Done8; }\n\n\t\tRet = GetValidCoordinate(Width, Height, Radius, Radius, Radius, Radius, Edge, &Row, &Col);\t\t//\t获取坐标偏移量\n\t\tif (Ret != IS_RET_OK) goto Done8;\n\n\t\tColHist += Radius * 256;\t\tRowOffset = ((int *)Row->Data) + Radius;\t\tColOffSet = ((int *)Col->Data) + Radius;\t\t    \t//\t进行偏移以便操作\n\n\t\tfor (Y = 0; Y < Height; Y++)\n\t\t{\n\t\t\tif (Y == 0)\t\t\t\t\t\t\t\t\t\t\t//\t第一行的列直方图,要重头计算\n\t\t\t{\n\t\t\t\tfor (K = -Radius; K <= Radius; K++)\n\t\t\t\t{\n\t\t\t\t\tLinePS = Src->Data + ColOffSet[K] * Src->WidthStep;\n\t\t\t\t\tfor (X = -Radius; X < Width + Radius; X++)\n\t\t\t\t\t{\n\t\t\t\t\t\tColHist[X * 256 + LinePS[RowOffset[X]]]++;\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t}\n\t\t\telse\t\t\t\t\t\t\t\t\t\t\t\t//\t其他行的列直方图，更新就可以了\n\t\t\t{\n\t\t\t\tLinePS = Src->Data + ColOffSet[Y - Radius - 1] * Src->WidthStep;\n\t\t\t\tfor (X = -Radius; X < Width + Radius; X++)\t\t// 删除移出范围内的那一行的直方图数据\n\t\t\t\t{\n\t\t\t\t\tColHist[X * 256 + LinePS[RowOffset[X]]]--;\n\t\t\t\t}\n\n\t\t\t\tLinePS = Src->Data + ColOffSet[Y + Radius] * Src->WidthStep;\n\t\t\t\tfor (X = -Radius; X < Width + Radius; X++)\t\t// 增加进入范围内的那一行的直方图数据\n\t\t\t\t{\n\t\t\t\t\tColHist[X * 256 + LinePS[RowOffset[X]]]++;\n\t\t\t\t}\n\n\t\t\t}\n\n\t\t\tmemset(Hist, 0, 256 * sizeof(unsigned short));\t\t//\t每一行直方图数据清零先\n\n\t\t\tLinePS = Src->Data + Y * Src->WidthStep;\n\t\t\tLinePD = Dest->Data + Y * Dest->WidthStep;\n\n\t\t\tfor (X = 0; X < Width; X++)\n\t\t\t{\n\t\t\t\tif (X == 0)\n\t\t\t\t{\n\t\t\t\t\tfor (K = -Radius; K <= Radius; K++)\t\t\t//\t行第一个像素，需要重新计算\t\n\t\t\t\t\t\tHistgramAddShort(ColHist + K * 256, Hist);\n\t\t\t\t}\n\t\t\t\telse\n\t\t\t\t{\n\t\t\t\t\t/*\tHistgramAddShort(ColHist + RowOffset[X + Radius] * 256, Hist);\n\t\t\t\t\tHistgramSubShort(ColHist + RowOffset[X - Radius - 1] * 256, Hist);\n\t\t\t\t\t*/\n\t\t\t\t\tHistgramSubAddShort(ColHist + RowOffset[X - Radius - 1] * 256, ColHist + RowOffset[X + Radius] * 256, Hist);  //\t行内其他像素，依次删除和增加就可以了\n\t\t\t\t}\n\t\t\t\tCalc(Hist, LinePS[0], LinePD, Threshold);\n\n\t\t\t\tLinePS++;\n\t\t\t\tLinePD++;\n\t\t\t}\n\t\t}\n\t\tColHist -= Radius * 256;\t\t//\t恢复偏移操作\n\tDone8:\n\t\tIS_FreeMatrix(&Row);\n\t\tIS_FreeMatrix(&Col);\n\t\tIS_FreeMemory(ColHist);\n\t\tIS_FreeMemory(Hist);\n\n\t\treturn Ret;\n\t}\n\telse\n\t{\n\t\tTMatrix *Blue = NULL, *Green = NULL, *Red = NULL, *Alpha = NULL;\t\t\t//\t由于C变量如果不初始化，其值是随机值，可能会导致释放时的错误。\n\t\tIS_RET Ret = SplitRGBA(Src, &Blue, &Green, &Red, &Alpha);\n\t\tif (Ret != IS_RET_OK) goto Done24;\n\t\tRet = SelectiveBlur(Blue, Blue, Radius, Threshold, Edge);\n\t\tif (Ret != IS_RET_OK) goto Done24;\n\t\tRet = SelectiveBlur(Green, Green, Radius, Threshold, Edge);\n\t\tif (Ret != IS_RET_OK) goto Done24;\n\t\tRet = SelectiveBlur(Red, Red, Radius, Threshold, Edge);\n\t\tif (Ret != IS_RET_OK) goto Done24;\t\t\t\t\t\t\t\t\t\t\t//\t32位的Alpha不做任何处理，实际上32位的相关算法基本上是不能分通道处理的\n\t\tRet = CombineRGBA(Dest, Blue, Green, Red, Alpha);\n\tDone24:\n\t\tIS_FreeMatrix(&Blue);\n\t\tIS_FreeMatrix(&Green);\n\t\tIS_FreeMatrix(&Red);\n\t\tIS_FreeMatrix(&Alpha);\n\t\treturn Ret;\n\t}\n}\n"
  },
  {
    "path": "speed_histogram_algorithm_framework/Utility.h",
    "content": "#pragma once\n//ֵ\n#include \"Core.h\"\n\nunion Approximation\n{\n\tdouble Value;\n\tint X[2];\n};\n\n// 1: ݽضByteڡ\n// ο: http://www.cnblogs.com/zyl910/archive/2012/03/12/noifopex1.html\n// : λʹô롣\nunsigned char ClampToByte(int Value) {\n\treturn ((Value | ((signed int)(255 - Value) >> 31)) & ~((signed int)Value >> 31));\n}\n\n//2: ݽضָΧ\n//ο: \n//: \nint ClampToInt(int Value, int Min, int Max) {\n\tif (Value < Min) return Min;\n\telse if (Value > Max) return Max;\n\telse return Value;\n}\n\n//3: 255\n//ο: \n//: λ\nint Div255(int Value) {\n\treturn (((Value >> 8) + Value + 1) >> 8);\n}\n\n//4: ȡֵ\n//ο: https://oi-wiki.org/math/bit/\n//: n > 0 ? n : -n \n\nint Abs(int n) {\n\treturn (n ^ (n >> 31)) - (n >> 31);\n\t/* n>>31 ȡ n ķţ n Ϊn>>31  0 n Ϊn>>31  - 1\n\t n Ϊ n^0=0, 䣬 n Ϊ n^-1\n\tҪ n  - 1 Ĳ룬Ȼ㣬\n\t n ŲΪ n ľֵ 1ټȥ - 1 Ǿֵ */\n}\n\n//5: \n//ο: \n//: \ndouble Round(double V)\n{\n\treturn (V > 0.0) ? floor(V + 0.5) : Round(V - 0.5);\n}\n\n//6: -11֮\n//ο: \n//: \ndouble Rand()\n{\n\treturn (double)rand() / (RAND_MAX + 1.0);\n}\n\n//7: PowĽƼ㣬doubleͺfloat\n//ο: http://www.cvchina.info/2010/03/19/log-pow-exp-approximation/\n//ο: http://martin.ankerl.com/2007/10/04/optimized-pow-approximation-for-java-and-c-c/\n//: ֻΪ˼ٵĽƼ㣬5%-12%ȵ\ndouble Pow(double X, double Y)\n{\n\tApproximation V = { X };\n\tV.X[1] = (int)(Y * (V.X[1] - 1072632447) + 1072632447);\n\tV.X[0] = 0;\n\treturn V.Value;\n}\n\n\nfloat Pow(float X, float Y)\n{\n\tApproximation V = { X };\n\tV.X[1] = (int)(Y * (V.X[1] - 1072632447) + 1072632447);\n\tV.X[0] = 0;\n\treturn (float)V.Value;\n}\n\n//8: ExpĽƼ㣬doubleͺfloat\ndouble Exp(double Y)\t\t\t//\tķʽٶҪЩ\n{\n\tApproximation V;\n\tV.X[1] = (int)(Y * 1485963 + 1072632447);\n\tV.X[0] = 0;\n\treturn V.Value;\n}\n\nfloat Exp(float Y)\t\t\t//\tķʽٶҪЩ\n{\n\tApproximation V;\n\tV.X[1] = (int)(Y * 1485963 + 1072632447);\n\tV.X[0] = 0;\n\treturn (float)V.Value;\n}\n\n// 9: Pow׼һĽƼ㣬ٶȻ\n// http://martin.ankerl.com/2012/01/25/optimized-approximative-pow-in-c-and-cpp/\n// Besides that, I also have now a slower approximation that has much less error\n// when the exponent is larger than 1. It makes use exponentiation by squaring,\n// which is exact for the integer part of the exponent, and uses only the exponents fraction for the approximation:\n// should be much more precise with large Y\n\ndouble PrecisePow(double X, double Y) {\n\t// calculate approximation with fraction of the exponent\n\tint e = (int)Y;\n\tApproximation V = { X };\n\tV.X[1] = (int)((Y - e) * (V.X[1] - 1072632447) + 1072632447);\n\tV.X[0] = 0;\n\t// exponentiation by squaring with the exponent's integer part\n\t// double r = u.d makes everything much slower, not sure why\n\tdouble r = 1.0;\n\twhile (e)\n\t{\n\t\tif (e & 1)\tr *= X;\n\t\tX *= X;\n\t\te >>= 1;\n\t}\n\treturn r * V.Value;\n}\n\n//10: MinMax֮\n//ο: \n//: MinΪСֵMaxΪֵ\nint Random(int Min, int Max) {\n\treturn rand() % (Max + 1 - Min) + Min;\n}\n\n//11: ź\n//ο: \n//: \nint sgn(int X) {\n\tif (X > 0) return 1;\n\tif (X < 0) return -1;\n\treturn 0;\n}\n\n//12: ȡĳαӦɫֵ\n//ο: \n//: \nvoid GetRGB(int Color, int *R, int *G, int *B) {\n\t*R = Color & 255;\n\t*G = (Color & 65280) / 256;\n\t*B = (Color & 16711680) / 65536;\n}\n\n//13: ţٷƻȡֵָ㷨ƽ\n//ο: https://www.cnblogs.com/qlky/p/7735145.html\n//: Ȼǽ㷨Ƴֵָƽ\nfloat Sqrt(float X)\n{\n\tfloat HalfX = 0.5f * X;             // double͵Ч\n\tint I = *(int*)&X;                  // get bits for floating VALUE \n\tI = 0x5f375a86 - (I >> 1);          // gives initial guess y0\n\tX = *(float*)&I;                    // convert bits BACK to float\n\tX = X * (1.5f - HalfX * X * X);     // Newton step, repeating increases accuracy\n\tX = X * (1.5f - HalfX * X * X);     // Newton step, repeating increases accuracy\n\tX = X * (1.5f - HalfX * X * X);     // Newton step, repeating increases accuracy\n\treturn 1 / X;\n}\n\n//14: ޷ŶֱͼӣY = X + Y\n//ο: \n//: SSEŻ\nvoid HistgramAddShort(unsigned short *X, unsigned short *Y)\n{\n\t*(__m128i*)(Y + 0) = _mm_add_epi16(*(__m128i*)&Y[0], *(__m128i*)&X[0]);\t\t//\tҪԼдĻ೬ٶˣѾԹ\n\t*(__m128i*)(Y + 8) = _mm_add_epi16(*(__m128i*)&Y[8], *(__m128i*)&X[8]);\n\t*(__m128i*)(Y + 16) = _mm_add_epi16(*(__m128i*)&Y[16], *(__m128i*)&X[16]);\n\t*(__m128i*)(Y + 24) = _mm_add_epi16(*(__m128i*)&Y[24], *(__m128i*)&X[24]);\n\t*(__m128i*)(Y + 32) = _mm_add_epi16(*(__m128i*)&Y[32], *(__m128i*)&X[32]);\n\t*(__m128i*)(Y + 40) = _mm_add_epi16(*(__m128i*)&Y[40], *(__m128i*)&X[40]);\n\t*(__m128i*)(Y + 48) = _mm_add_epi16(*(__m128i*)&Y[48], *(__m128i*)&X[48]);\n\t*(__m128i*)(Y + 56) = _mm_add_epi16(*(__m128i*)&Y[56], *(__m128i*)&X[56]);\n\t*(__m128i*)(Y + 64) = _mm_add_epi16(*(__m128i*)&Y[64], *(__m128i*)&X[64]);\n\t*(__m128i*)(Y + 72) = _mm_add_epi16(*(__m128i*)&Y[72], *(__m128i*)&X[72]);\n\t*(__m128i*)(Y + 80) = _mm_add_epi16(*(__m128i*)&Y[80], *(__m128i*)&X[80]);\n\t*(__m128i*)(Y + 88) = _mm_add_epi16(*(__m128i*)&Y[88], *(__m128i*)&X[88]);\n\t*(__m128i*)(Y + 96) = _mm_add_epi16(*(__m128i*)&Y[96], *(__m128i*)&X[96]);\n\t*(__m128i*)(Y + 104) = _mm_add_epi16(*(__m128i*)&Y[104], *(__m128i*)&X[104]);\n\t*(__m128i*)(Y + 112) = _mm_add_epi16(*(__m128i*)&Y[112], *(__m128i*)&X[112]);\n\t*(__m128i*)(Y + 120) = _mm_add_epi16(*(__m128i*)&Y[120], *(__m128i*)&X[120]);\n\t*(__m128i*)(Y + 128) = _mm_add_epi16(*(__m128i*)&Y[128], *(__m128i*)&X[128]);\n\t*(__m128i*)(Y + 136) = _mm_add_epi16(*(__m128i*)&Y[136], *(__m128i*)&X[136]);\n\t*(__m128i*)(Y + 144) = _mm_add_epi16(*(__m128i*)&Y[144], *(__m128i*)&X[144]);\n\t*(__m128i*)(Y + 152) = _mm_add_epi16(*(__m128i*)&Y[152], *(__m128i*)&X[152]);\n\t*(__m128i*)(Y + 160) = _mm_add_epi16(*(__m128i*)&Y[160], *(__m128i*)&X[160]);\n\t*(__m128i*)(Y + 168) = _mm_add_epi16(*(__m128i*)&Y[168], *(__m128i*)&X[168]);\n\t*(__m128i*)(Y + 176) = _mm_add_epi16(*(__m128i*)&Y[176], *(__m128i*)&X[176]);\n\t*(__m128i*)(Y + 184) = _mm_add_epi16(*(__m128i*)&Y[184], *(__m128i*)&X[184]);\n\t*(__m128i*)(Y + 192) = _mm_add_epi16(*(__m128i*)&Y[192], *(__m128i*)&X[192]);\n\t*(__m128i*)(Y + 200) = _mm_add_epi16(*(__m128i*)&Y[200], *(__m128i*)&X[200]);\n\t*(__m128i*)(Y + 208) = _mm_add_epi16(*(__m128i*)&Y[208], *(__m128i*)&X[208]);\n\t*(__m128i*)(Y + 216) = _mm_add_epi16(*(__m128i*)&Y[216], *(__m128i*)&X[216]);\n\t*(__m128i*)(Y + 224) = _mm_add_epi16(*(__m128i*)&Y[224], *(__m128i*)&X[224]);\n\t*(__m128i*)(Y + 232) = _mm_add_epi16(*(__m128i*)&Y[232], *(__m128i*)&X[232]);\n\t*(__m128i*)(Y + 240) = _mm_add_epi16(*(__m128i*)&Y[240], *(__m128i*)&X[240]);\n\t*(__m128i*)(Y + 248) = _mm_add_epi16(*(__m128i*)&Y[248], *(__m128i*)&X[248]);\n}\n\n//15: ޷ŶֱͼY = Y - X\n//ο: \n//: SSEŻ\nvoid HistgramSubShort(unsigned short *X, unsigned short *Y)\n{\n\t*(__m128i*)(Y + 0) = _mm_sub_epi16(*(__m128i*)&Y[0], *(__m128i*)&X[0]);\n\t*(__m128i*)(Y + 8) = _mm_sub_epi16(*(__m128i*)&Y[8], *(__m128i*)&X[8]);\n\t*(__m128i*)(Y + 16) = _mm_sub_epi16(*(__m128i*)&Y[16], *(__m128i*)&X[16]);\n\t*(__m128i*)(Y + 24) = _mm_sub_epi16(*(__m128i*)&Y[24], *(__m128i*)&X[24]);\n\t*(__m128i*)(Y + 32) = _mm_sub_epi16(*(__m128i*)&Y[32], *(__m128i*)&X[32]);\n\t*(__m128i*)(Y + 40) = _mm_sub_epi16(*(__m128i*)&Y[40], *(__m128i*)&X[40]);\n\t*(__m128i*)(Y + 48) = _mm_sub_epi16(*(__m128i*)&Y[48], *(__m128i*)&X[48]);\n\t*(__m128i*)(Y + 56) = _mm_sub_epi16(*(__m128i*)&Y[56], *(__m128i*)&X[56]);\n\t*(__m128i*)(Y + 64) = _mm_sub_epi16(*(__m128i*)&Y[64], *(__m128i*)&X[64]);\n\t*(__m128i*)(Y + 72) = _mm_sub_epi16(*(__m128i*)&Y[72], *(__m128i*)&X[72]);\n\t*(__m128i*)(Y + 80) = _mm_sub_epi16(*(__m128i*)&Y[80], *(__m128i*)&X[80]);\n\t*(__m128i*)(Y + 88) = _mm_sub_epi16(*(__m128i*)&Y[88], *(__m128i*)&X[88]);\n\t*(__m128i*)(Y + 96) = _mm_sub_epi16(*(__m128i*)&Y[96], *(__m128i*)&X[96]);\n\t*(__m128i*)(Y + 104) = _mm_sub_epi16(*(__m128i*)&Y[104], *(__m128i*)&X[104]);\n\t*(__m128i*)(Y + 112) = _mm_sub_epi16(*(__m128i*)&Y[112], *(__m128i*)&X[112]);\n\t*(__m128i*)(Y + 120) = _mm_sub_epi16(*(__m128i*)&Y[120], *(__m128i*)&X[120]);\n\t*(__m128i*)(Y + 128) = _mm_sub_epi16(*(__m128i*)&Y[128], *(__m128i*)&X[128]);\n\t*(__m128i*)(Y + 136) = _mm_sub_epi16(*(__m128i*)&Y[136], *(__m128i*)&X[136]);\n\t*(__m128i*)(Y + 144) = _mm_sub_epi16(*(__m128i*)&Y[144], *(__m128i*)&X[144]);\n\t*(__m128i*)(Y + 152) = _mm_sub_epi16(*(__m128i*)&Y[152], *(__m128i*)&X[152]);\n\t*(__m128i*)(Y + 160) = _mm_sub_epi16(*(__m128i*)&Y[160], *(__m128i*)&X[160]);\n\t*(__m128i*)(Y + 168) = _mm_sub_epi16(*(__m128i*)&Y[168], *(__m128i*)&X[168]);\n\t*(__m128i*)(Y + 176) = _mm_sub_epi16(*(__m128i*)&Y[176], *(__m128i*)&X[176]);\n\t*(__m128i*)(Y + 184) = _mm_sub_epi16(*(__m128i*)&Y[184], *(__m128i*)&X[184]);\n\t*(__m128i*)(Y + 192) = _mm_sub_epi16(*(__m128i*)&Y[192], *(__m128i*)&X[192]);\n\t*(__m128i*)(Y + 200) = _mm_sub_epi16(*(__m128i*)&Y[200], *(__m128i*)&X[200]);\n\t*(__m128i*)(Y + 208) = _mm_sub_epi16(*(__m128i*)&Y[208], *(__m128i*)&X[208]);\n\t*(__m128i*)(Y + 216) = _mm_sub_epi16(*(__m128i*)&Y[216], *(__m128i*)&X[216]);\n\t*(__m128i*)(Y + 224) = _mm_sub_epi16(*(__m128i*)&Y[224], *(__m128i*)&X[224]);\n\t*(__m128i*)(Y + 232) = _mm_sub_epi16(*(__m128i*)&Y[232], *(__m128i*)&X[232]);\n\t*(__m128i*)(Y + 240) = _mm_sub_epi16(*(__m128i*)&Y[240], *(__m128i*)&X[240]);\n\t*(__m128i*)(Y + 248) = _mm_sub_epi16(*(__m128i*)&Y[248], *(__m128i*)&X[248]);\n}\n\n//16: ޷ŶֱͼӼZ = Z + Y - X\n//ο: \n//: SSEŻ\nvoid HistgramSubAddShort(unsigned short *X, unsigned short *Y, unsigned short *Z)\n{\n\t*(__m128i*)(Z + 0) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[0], *(__m128i*)&Z[0]), *(__m128i*)&X[0]);\t\t\t\t\t\t//\tҪԼдĻ೬ٶˣѾԹ\n\t*(__m128i*)(Z + 8) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[8], *(__m128i*)&Z[8]), *(__m128i*)&X[8]);\n\t*(__m128i*)(Z + 16) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[16], *(__m128i*)&Z[16]), *(__m128i*)&X[16]);\n\t*(__m128i*)(Z + 24) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[24], *(__m128i*)&Z[24]), *(__m128i*)&X[24]);\n\t*(__m128i*)(Z + 32) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[32], *(__m128i*)&Z[32]), *(__m128i*)&X[32]);\n\t*(__m128i*)(Z + 40) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[40], *(__m128i*)&Z[40]), *(__m128i*)&X[40]);\n\t*(__m128i*)(Z + 48) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[48], *(__m128i*)&Z[48]), *(__m128i*)&X[48]);\n\t*(__m128i*)(Z + 56) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[56], *(__m128i*)&Z[56]), *(__m128i*)&X[56]);\n\t*(__m128i*)(Z + 64) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[64], *(__m128i*)&Z[64]), *(__m128i*)&X[64]);\n\t*(__m128i*)(Z + 72) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[72], *(__m128i*)&Z[72]), *(__m128i*)&X[72]);\n\t*(__m128i*)(Z + 80) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[80], *(__m128i*)&Z[80]), *(__m128i*)&X[80]);\n\t*(__m128i*)(Z + 88) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[88], *(__m128i*)&Z[88]), *(__m128i*)&X[88]);\n\t*(__m128i*)(Z + 96) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[96], *(__m128i*)&Z[96]), *(__m128i*)&X[96]);\n\t*(__m128i*)(Z + 104) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[104], *(__m128i*)&Z[104]), *(__m128i*)&X[104]);\n\t*(__m128i*)(Z + 112) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[112], *(__m128i*)&Z[112]), *(__m128i*)&X[112]);\n\t*(__m128i*)(Z + 120) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[120], *(__m128i*)&Z[120]), *(__m128i*)&X[120]);\n\t*(__m128i*)(Z + 128) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[128], *(__m128i*)&Z[128]), *(__m128i*)&X[128]);\n\t*(__m128i*)(Z + 136) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[136], *(__m128i*)&Z[136]), *(__m128i*)&X[136]);\n\t*(__m128i*)(Z + 144) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[144], *(__m128i*)&Z[144]), *(__m128i*)&X[144]);\n\t*(__m128i*)(Z + 152) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[152], *(__m128i*)&Z[152]), *(__m128i*)&X[152]);\n\t*(__m128i*)(Z + 160) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[160], *(__m128i*)&Z[160]), *(__m128i*)&X[160]);\n\t*(__m128i*)(Z + 168) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[168], *(__m128i*)&Z[168]), *(__m128i*)&X[168]);\n\t*(__m128i*)(Z + 176) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[176], *(__m128i*)&Z[176]), *(__m128i*)&X[176]);\n\t*(__m128i*)(Z + 184) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[184], *(__m128i*)&Z[184]), *(__m128i*)&X[184]);\n\t*(__m128i*)(Z + 192) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[192], *(__m128i*)&Z[192]), *(__m128i*)&X[192]);\n\t*(__m128i*)(Z + 200) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[200], *(__m128i*)&Z[200]), *(__m128i*)&X[200]);\n\t*(__m128i*)(Z + 208) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[208], *(__m128i*)&Z[208]), *(__m128i*)&X[208]);\n\t*(__m128i*)(Z + 216) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[216], *(__m128i*)&Z[216]), *(__m128i*)&X[216]);\n\t*(__m128i*)(Z + 224) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[224], *(__m128i*)&Z[224]), *(__m128i*)&X[224]);\n\t*(__m128i*)(Z + 232) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[232], *(__m128i*)&Z[232]), *(__m128i*)&X[232]);\n\t*(__m128i*)(Z + 240) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[240], *(__m128i*)&Z[240]), *(__m128i*)&X[240]);\n\t*(__m128i*)(Z + 248) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[248], *(__m128i*)&Z[248]), *(__m128i*)&X[248]);\n}\n\n//17: Alphaͨ\n//ο: \n//: ֱԭʼĴ룬ٶȺܺ\nvoid CopyAlphaChannel(TMatrix *Src, TMatrix *Dest) {\n\tif (Src->Channel != 4 || Dest->Channel != 4) return;\n\tif (Src->Data == Dest->Data) return;\n\tunsigned char *SrcP = Src->Data, *DestP = Dest->Data;\n\tint Y, Index = 3;\n\tfor (Y = 0; Y < Src->Width * Src->Height; Y++, Index += 4) {\n\t\tSrcP[Index] = DestP[Index];\n\t}\n}\n\n// 18: ָıԵģʽչֵ\n// б: \n// Width: Ŀ\n// Height: ĸ߶\n// Left: Ҫչ\n// Right: ҲҪչ\n// Top: Ҫչ\n// Bottom: ײҪչ\n// Edge: Եķʽ\n// RawPos: зֵ\n// ColPos: зֵ\n// غִгɹ\nIS_RET GetValidCoordinate(int Width, int Height, int Left, int Right, int Top, int Bottom, EdgeMode Edge, TMatrix **Row, TMatrix **Col)\n{\n\tif ((Left < 0) || (Right < 0) || (Top < 0) || (Bottom < 0)) return IS_RET_ERR_ARGUMENTOUTOFRANGE;\n\tIS_RET Ret = IS_CreateMatrix(Width + Left + Right, 1, IS_DEPTH_32S, 1, Row);\n\tif (Ret != IS_RET_OK) return Ret;\n\tRet = IS_CreateMatrix(1, Height + Top + Bottom, IS_DEPTH_32S, 1, Col);\n\tif (Ret != IS_RET_OK) return Ret;\n\n\tint X, Y, XX, YY, *RowPos = (int *)(*Row)->Data, *ColPos = (int *)(*Col)->Data;\n\n\tfor (X = -Left; X < Width + Right; X++)\n\t{\n\t\tif (X < 0)\n\t\t{\n\t\t\tif (Edge == EdgeMode::Tile)\t\t\t\t\t\t\t//ظԵ\n\t\t\t\tRowPos[X + Left] = 0;\n\t\t\telse\n\t\t\t{\n\t\t\t\tXX = -X;\n\t\t\t\twhile (XX >= Width) XX -= Width;\t\t\t// \n\t\t\t\tRowPos[X + Left] = XX;\n\t\t\t}\n\t\t}\n\t\telse if (X >= Width)\n\t\t{\n\t\t\tif (Edge == EdgeMode::Tile)\n\t\t\t\tRowPos[X + Left] = Width - 1;\n\t\t\telse\n\t\t\t{\n\t\t\t\tXX = Width - (X - Width + 2);\n\t\t\t\twhile (XX < 0) XX += Width;\n\t\t\t\tRowPos[X + Left] = XX;\n\t\t\t}\n\t\t}\n\t\telse\n\t\t{\n\t\t\tRowPos[X + Left] = X;\n\t\t}\n\t}\n\n\tfor (Y = -Top; Y < Height + Bottom; Y++)\n\t{\n\t\tif (Y < 0)\n\t\t{\n\t\t\tif (Edge == EdgeMode::Tile)\n\t\t\t\tColPos[Y + Top] = 0;\n\t\t\telse\n\t\t\t{\n\t\t\t\tYY = -Y;\n\t\t\t\twhile (YY >= Height) YY -= Height;\n\t\t\t\tColPos[Y + Top] = YY;\n\t\t\t}\n\t\t}\n\t\telse if (Y >= Height)\n\t\t{\n\t\t\tif (Edge == EdgeMode::Tile)\n\t\t\t\tColPos[Y + Top] = Height - 1;\n\t\t\telse\n\t\t\t{\n\t\t\t\tYY = Height - (Y - Height + 2);\n\t\t\t\twhile (YY < 0) YY += Height;\n\t\t\t\tColPos[Y + Top] = YY;\n\t\t\t}\n\t\t}\n\t\telse\n\t\t{\n\t\t\tColPos[Y + Top] = Y;\n\t\t}\n\t}\n\treturn IS_RET_OK;\n}\n\n// 19: ɫͼֽΪRGBAͨͼ\n// б:\n// Src: ҪԴͼݽṹ\n// Blue: Blueͨͼݽṹ\n// Green: Greenͨͼݽṹ\n// Red: Redͨͼݽṹ\n// Alpha: Alphaͨͼݽṹ\n// 8λдٶȴ20%\n// غǷִгɹ\nIS_RET SplitRGBA(TMatrix *Src, TMatrix **Blue, TMatrix **Green, TMatrix **Red, TMatrix **Alpha) {\n\tif (Src == NULL) return IS_RET_ERR_NULLREFERENCE;\n\tif (Src->Data == NULL) return IS_RET_ERR_NULLREFERENCE;\n\tif (Src->Depth != IS_DEPTH_8U) return IS_RET_ERR_NOTSUPPORTED;\n\tIS_RET Ret = IS_CreateMatrix(Src->Width, Src->Height, Src->Depth, 1, Blue);\n\tif (Ret != IS_RET_OK) goto Done;\n\tRet = IS_CreateMatrix(Src->Width, Src->Height, Src->Depth, 1, Green);\n\tif (Ret != IS_RET_OK) goto Done;\n\tRet = IS_CreateMatrix(Src->Width, Src->Height, Src->Depth, 1, Red);\n\tif (Ret != IS_RET_OK) goto Done;\n\tif (Src->Channel == 4) {\n\t\tRet = IS_CreateMatrix(Src->Width, Src->Height, Src->Depth, 1, Alpha);\n\t\tif (Ret != IS_RET_OK) goto Done;\n\t}\n\tint X, Y, Block, Width = Src->Width, Height = Src->Height;\n\tunsigned char *LinePS, *LinePB, *LinePG, *LinePR, *LinePA;\n\tconst int BlockSize = 8;\n\tBlock = Width / BlockSize;\t\t\t\t\t\t//\t8·,ٶ࿪·ٶȲû\n\tif (Src->Channel == 3)\n\t{\n\t\tfor (Y = 0; Y < Height; Y++)\n\t\t{\n\t\t\tLinePS = Src->Data + Y * Src->WidthStep;\n\t\t\tLinePB = (*Blue)->Data + Y * (*Blue)->WidthStep;\n\t\t\tLinePG = (*Green)->Data + Y * (*Green)->WidthStep;\n\t\t\tLinePR = (*Red)->Data + Y * (*Red)->WidthStep;\n\t\t\tfor (X = 0; X < Block * BlockSize; X += BlockSize)\t\t\t//\tLinePBȫдһٶȷһЩ\n\t\t\t{\n\t\t\t\tLinePB[0] = LinePS[0];\t\tLinePG[0] = LinePS[1];\t\tLinePR[0] = LinePS[2];\n\t\t\t\tLinePB[1] = LinePS[3];\t\tLinePG[1] = LinePS[4];\t\tLinePR[1] = LinePS[5];\n\t\t\t\tLinePB[2] = LinePS[6];\t\tLinePG[2] = LinePS[7];\t\tLinePR[2] = LinePS[8];\n\t\t\t\tLinePB[3] = LinePS[9];\t\tLinePG[3] = LinePS[10];\t\tLinePR[3] = LinePS[11];\n\t\t\t\tLinePB[4] = LinePS[12];\t\tLinePG[4] = LinePS[13];\t\tLinePR[4] = LinePS[14];\n\t\t\t\tLinePB[5] = LinePS[15];\t\tLinePG[5] = LinePS[16];\t\tLinePR[5] = LinePS[17];\n\t\t\t\tLinePB[6] = LinePS[18];\t\tLinePG[6] = LinePS[19];\t\tLinePR[6] = LinePS[20];\n\t\t\t\tLinePB[7] = LinePS[21];\t\tLinePG[7] = LinePS[22];\t\tLinePR[7] = LinePS[23];\n\t\t\t\tLinePB += 8;\t\t\t\tLinePG += 8;\t\t\t\tLinePR += 8;\t\t\t\tLinePS += 24;\n\t\t\t}\n\t\t\twhile (X < Width)\n\t\t\t{\n\t\t\t\tLinePB[0] = LinePS[0];\t\tLinePG[0] = LinePS[1];\t\tLinePR[0] = LinePS[2];\n\t\t\t\tLinePB++;\t\t\t\t\tLinePG++;\t\t\t\t\tLinePR++;\t\t\t\t\tLinePS += 3;\n\t\t\t\tX++;\n\t\t\t}\n\t\t}\n\t}\n\telse if (Src->Channel == 4)\n\t{\n\t\tfor (Y = 0; Y < Height; Y++)\n\t\t{\n\t\t\tLinePS = Src->Data + Y * Src->WidthStep;\n\t\t\tLinePB = (*Blue)->Data + Y * (*Blue)->WidthStep;\n\t\t\tLinePG = (*Green)->Data + Y * (*Green)->WidthStep;\n\t\t\tLinePR = (*Red)->Data + Y * (*Red)->WidthStep;\n\t\t\tLinePA = (*Alpha)->Data + Y * (*Alpha)->WidthStep;\n\t\t\tfor (X = 0; X < Block * BlockSize; X += BlockSize)\n\t\t\t{\n\t\t\t\tLinePB[0] = LinePS[0];\t\tLinePG[0] = LinePS[1];\t\tLinePR[0] = LinePS[2];\t\tLinePA[0] = LinePS[3];\n\t\t\t\tLinePB[1] = LinePS[4];\t\tLinePG[1] = LinePS[5];\t\tLinePR[1] = LinePS[6];\t\tLinePA[1] = LinePS[7];\n\t\t\t\tLinePB[2] = LinePS[8];\t\tLinePG[2] = LinePS[9];\t\tLinePR[2] = LinePS[10];\t\tLinePA[2] = LinePS[11];\n\t\t\t\tLinePB[3] = LinePS[12];\t\tLinePG[3] = LinePS[13];\t\tLinePR[3] = LinePS[14];\t\tLinePA[3] = LinePS[15];\n\t\t\t\tLinePB[4] = LinePS[16];\t\tLinePG[4] = LinePS[17];\t\tLinePR[4] = LinePS[18];\t\tLinePA[4] = LinePS[19];\n\t\t\t\tLinePB[5] = LinePS[20];\t\tLinePG[5] = LinePS[21];\t\tLinePR[5] = LinePS[22];\t\tLinePA[5] = LinePS[23];\n\t\t\t\tLinePB[6] = LinePS[24];\t\tLinePG[6] = LinePS[25];\t\tLinePR[6] = LinePS[26];\t\tLinePA[6] = LinePS[27];\n\t\t\t\tLinePB[7] = LinePS[28];\t\tLinePG[7] = LinePS[29];\t\tLinePR[7] = LinePS[30];\t\tLinePA[7] = LinePS[31];\n\t\t\t\tLinePB += 8;\t\t\t\tLinePG += 8;\t\t\t\tLinePR += 8;\t\t\t\tLinePA += 8;\t\t\t\tLinePS += 32;\n\t\t\t}\n\t\t\twhile (X < Width)\n\t\t\t{\n\t\t\t\tLinePB[0] = LinePS[0];\t\tLinePG[0] = LinePS[1];\t\tLinePR[0] = LinePS[2];\t\tLinePA[0] = LinePS[3];\n\t\t\t\tLinePB++;\t\t\t\t\tLinePG++;\t\t\t\t\tLinePR++;\t\t\t\t\tLinePA++;\t\t\t\t\tLinePS += 4;\n\t\t\t\tX++;\n\t\t\t}\n\t\t}\n\t}\n\treturn IS_RET_OK;\nDone:\n\tif (*Blue != NULL) IS_FreeMatrix(Blue);\n\tif (*Green != NULL) IS_FreeMatrix(Green);\n\tif (*Red != NULL) IS_FreeMatrix(Red);\n\tif (*Alpha != NULL) IS_FreeMatrix(Alpha);\n\treturn Ret;\n}\n\n// 20: R,G,B,AͨͼϲΪɫͼ\n// б:\n// Dest: ϲͼݽṹ\n// Blue: Blueͨͼݽṹ\n// Green: Greenͨͼݽṹ\n// Red: Redͨͼݽṹ\n// Alpha: Alphaͨͼݽṹ\nIS_RET CombineRGBA(TMatrix *Dest, TMatrix *Blue, TMatrix *Green, TMatrix *Red, TMatrix *Alpha)\n{\n\tif (Dest == NULL || Blue == NULL || Green == NULL || Red == NULL) return IS_RET_ERR_NULLREFERENCE;\n\tif (Dest->Data == NULL || Blue->Data == NULL || Green->Data == NULL || Red->Data == NULL) return IS_RET_ERR_NULLREFERENCE;\n\tif ((Dest->Channel != 3 && Dest->Channel != 4) || Blue->Channel != 1 || Green->Channel != 1 || Red->Channel != 1) return IS_RET_ERR_PARAMISMATCH;\n\tif (Dest->Width != Blue->Width || Dest->Width != Green->Width || Dest->Width != Red->Width || Dest->Width != Blue->Width)  return IS_RET_ERR_PARAMISMATCH;\n\tif (Dest->Height != Blue->Height || Dest->Height != Green->Height || Dest->Height != Red->Height || Dest->Height != Blue->Height)  return IS_RET_ERR_PARAMISMATCH;\n\n\tif (Dest->Channel == 4)\n\t{\n\t\tif (Alpha == NULL) return IS_RET_ERR_NULLREFERENCE;\n\t\tif (Alpha->Data == NULL) return IS_RET_ERR_NULLREFERENCE;\n\t\tif (Alpha->Channel != 1) return IS_RET_ERR_PARAMISMATCH;\n\t\tif (Dest->Width != Alpha->Width || Dest->Height != Alpha->Height) return IS_RET_ERR_PARAMISMATCH;\n\t}\n\n\tint X, Y, Block, Width = Dest->Width, Height = Dest->Height;\n\tunsigned char *LinePD, *LinePB, *LinePG, *LinePR, *LinePA;\n\tconst int BlockSize = 8;\n\tBlock = Width / BlockSize;\t\t\t\t\t\t//\t8·,ٶ࿪·ٶȲû\n\n\tif (Dest->Channel == 3)\n\t{\n\t\tfor (Y = 0; Y < Height; Y++)\n\t\t{\n\t\t\tLinePD = Dest->Data + Y * Dest->WidthStep;\n\t\t\tLinePB = Blue->Data + Y * Blue->WidthStep;\n\t\t\tLinePG = Green->Data + Y * Green->WidthStep;\n\t\t\tLinePR = Red->Data + Y * Red->WidthStep;\n\t\t\tfor (X = 0; X < Block * BlockSize; X += BlockSize)\t\t\t\t//\tLinePBȫдһٶ𲻴\n\t\t\t{\n\t\t\t\tLinePD[0] = LinePB[0];\t\tLinePD[1] = LinePG[0];\t\tLinePD[2] = LinePR[0];\n\t\t\t\tLinePD[3] = LinePB[1];\t\tLinePD[4] = LinePG[1];\t\tLinePD[5] = LinePR[1];\n\t\t\t\tLinePD[6] = LinePB[2];\t\tLinePD[7] = LinePG[2];\t\tLinePD[8] = LinePR[2];\n\t\t\t\tLinePD[9] = LinePB[3];\t\tLinePD[10] = LinePG[3];\t\tLinePD[11] = LinePR[3];\n\t\t\t\tLinePD[12] = LinePB[4];\t\tLinePD[13] = LinePG[4];\t\tLinePD[14] = LinePR[4];\n\t\t\t\tLinePD[15] = LinePB[5];\t\tLinePD[16] = LinePG[5];\t\tLinePD[17] = LinePR[5];\n\t\t\t\tLinePD[18] = LinePB[6];\t\tLinePD[19] = LinePG[6];\t\tLinePD[20] = LinePR[6];\n\t\t\t\tLinePD[21] = LinePB[7];\t\tLinePD[22] = LinePG[7];\t\tLinePD[23] = LinePR[7];\n\t\t\t\tLinePB += 8;\t\t\t\tLinePG += 8;\t\t\t\tLinePR += 8;\t\t\t\tLinePD += 24;\n\t\t\t}\n\t\t\twhile (X < Width)\n\t\t\t{\n\t\t\t\tLinePD[0] = LinePB[0];\t\tLinePD[1] = LinePG[0];\t\tLinePD[2] = LinePR[0];\n\t\t\t\tLinePB++;\t\t\t\t\tLinePG++;\t\t\t\t\tLinePR++;\t\t\t\t\tLinePD += 3;\n\t\t\t\tX++;\n\t\t\t}\n\t\t}\n\t}\n\telse if (Dest->Channel == 4)\n\t{\n\t\tfor (Y = 0; Y < Height; Y++)\n\t\t{\n\t\t\tLinePD = Dest->Data + Y * Dest->WidthStep;\n\t\t\tLinePB = Blue->Data + Y * Blue->WidthStep;\n\t\t\tLinePG = Green->Data + Y * Green->WidthStep;\n\t\t\tLinePR = Red->Data + Y * Red->WidthStep;\n\t\t\tLinePA = Alpha->Data + Y * Alpha->WidthStep;\n\t\t\tfor (X = 0; X < Block * BlockSize; X += BlockSize)\n\t\t\t{\n\t\t\t\tLinePD[0] = LinePB[0];\t\tLinePD[1] = LinePG[0];\t\tLinePD[2] = LinePR[0];\t\tLinePD[3] = LinePA[0];\n\t\t\t\tLinePD[4] = LinePB[1];\t\tLinePD[5] = LinePG[1];\t\tLinePD[6] = LinePR[1];\t\tLinePD[7] = LinePA[1];\n\t\t\t\tLinePD[8] = LinePB[2];\t\tLinePD[9] = LinePG[2];\t\tLinePD[10] = LinePR[2];\t\tLinePD[11] = LinePA[2];\n\t\t\t\tLinePD[12] = LinePB[3];\t\tLinePD[13] = LinePG[3];\t\tLinePD[14] = LinePR[3];\t\tLinePD[15] = LinePA[3];\n\t\t\t\tLinePD[16] = LinePB[4];\t\tLinePD[17] = LinePG[4];\t\tLinePD[18] = LinePR[4];\t\tLinePD[19] = LinePA[4];\n\t\t\t\tLinePD[20] = LinePB[5];\t\tLinePD[21] = LinePG[5];\t\tLinePD[22] = LinePR[5];\t\tLinePD[23] = LinePA[5];\n\t\t\t\tLinePD[24] = LinePB[6];\t\tLinePD[25] = LinePG[6];\t\tLinePD[26] = LinePR[6];\t\tLinePD[27] = LinePA[6];\n\t\t\t\tLinePD[28] = LinePB[7];\t\tLinePD[29] = LinePG[7];\t\tLinePD[30] = LinePR[7];\t\tLinePD[31] = LinePA[7];\n\t\t\t\tLinePB += 8;\t\t\t\tLinePG += 8;\t\t\t\tLinePR += 8;\t\t\t\tLinePA += 8;\t\t\t\tLinePD += 32;\n\t\t\t}\n\t\t\twhile (X < Width)\n\t\t\t{\n\t\t\t\tLinePD[0] = LinePB[0];\t\tLinePD[1] = LinePG[0];\t\tLinePD[2] = LinePR[0];\t\tLinePD[3] = LinePA[0];\n\t\t\t\tLinePB++;\t\t\t\t\tLinePG++;\t\t\t\t\tLinePD++;\t\t\t\t\tLinePA++;\t\t\t\t\tLinePD += 4;\n\t\t\t\tX++;\n\t\t\t}\n\t\t}\n\t}\n\treturn IS_RET_OK;\n}"
  },
  {
    "path": "speed_integral_graph_sse.cpp",
    "content": "#include <stdio.h>\n#include <opencv2/opencv.hpp>\n\nusing namespace std;\nusing namespace cv;\n\nvoid GetGrayIntegralImage(unsigned char *Src, int *Integral, int Width, int Height, int Stride)\n{\n\tmemset(Integral, 0, (Width + 1) * sizeof(int));                    //    第一行都为0\n\tfor (int Y = 0; Y < Height; Y++)\n\t{\n\t\tunsigned char *LinePS = Src + Y * Stride;\n\t\tint *LinePL = Integral + Y * (Width + 1) + 1;                //上一行的位置\n\t\tint *LinePD = Integral + (Y + 1) * (Width + 1) + 1;           //    当前位置，注意每行的第一列的值都为0\n\t\tLinePD[-1] = 0;                                               //    第一列的值为0\n\t\tfor (int X = 0, Sum = 0; X < Width; X++)\n\t\t{\n\t\t\tSum += LinePS[X];                                          //    行方向累加\n\t\t\tLinePD[X] = LinePL[X] + Sum;                               //    更新积分图\n\t\t}\n\t}\n}\n\nvoid GetGrayIntegralImage_SSE(unsigned char *Src, int *Integral, int Width, int Height, int Stride) {\n\tmemset(Integral, 0, (Width + 1) * sizeof(int)); //第一行都为0\n\tint BlockSize = 8, Block = Width / BlockSize;\n\tfor (int Y = 0; Y < Height; Y++) {\n\t\tunsigned char *LinePS = Src + Y * Stride;\n\t\tint *LinePL = Integral + Y * (Width + 1) + 1; //上一行位置\n\t\tint *LinePD = Integral + (Y + 1) * (Width + 1) + 1; //当前位置，注意每行的第一列都为0\n\t\tLinePD[-1] = 0;\n\t\t__m128i PreV = _mm_setzero_si128();\n\t\t__m128i Zero = _mm_setzero_si128();\n\t\tfor (int X = 0; X < Block * BlockSize; X += BlockSize) {\n\t\t\t__m128i Src_Shift0 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i*)(LinePS + X)), Zero); //A7 A6 A5 A 4 A3 A2 A1 A0\n\t\t\t__m128i Src_Shift1 = _mm_slli_si128(Src_Shift0, 2); //A6 A5 A4 A3 A2 A1 A0 0\n\t\t\t__m128i Src_Shift2 = _mm_slli_si128(Src_Shift1, 2); //A5 A4 A3 A2 A1 A0 0  0\n\t\t\t__m128i Src_Shift3 = _mm_slli_si128(Src_Shift2, 2); //A4 A3 A2 A1 A0 0  0  0\n\t\t\t__m128i Shift_Add12 = _mm_add_epi16(Src_Shift1, Src_Shift2); //A6+A5 A5+A4 A4+A3 A3+A2 A2+A1 A1+A0 A0+0  0+0\n\t\t\t__m128i Shift_Add03 = _mm_add_epi16(Src_Shift0, Src_Shift3); //A7+A4 A6+A3 A5+A2 A4+A1 A3+A0 A2+0  A1+0  A0+0 \n\t\t\t__m128i Low = _mm_add_epi16(Shift_Add12, Shift_Add03); //A7+A6+A5+A4 A6+A5+A4+A3 A5+A4+A3+A2 A4+A3+A2+A1 A3+A2+A1+A0 A2+A1+A0+0 A1+A0+0+0 A0+0+0+0\n\t\t\t__m128i High = _mm_add_epi32(_mm_unpackhi_epi16(Low, Zero), _mm_unpacklo_epi16(Low, Zero)); //A7+A6+A5+A4+A3+A2+A1+A0  A6+A5+A4+A3+A2+A1+A0  A5+A4+A3+A2+A1+A0  A4+A3+A2+A1+A0\n\t\t\t__m128i SumL = _mm_loadu_si128((__m128i *)(LinePL + X + 0));\n\t\t\t__m128i SumH = _mm_loadu_si128((__m128i *)(LinePL + X + 4));\n\t\t\tSumL = _mm_add_epi32(SumL, PreV);\n\t\t\tSumL = _mm_add_epi32(SumL, _mm_unpacklo_epi16(Low, Zero));\n\t\t\tSumH = _mm_add_epi32(SumH, PreV);\n\t\t\tSumH = _mm_add_epi32(SumH, High);\n\t\t\tPreV = _mm_add_epi32(PreV, _mm_shuffle_epi32(High, _MM_SHUFFLE(3, 3, 3, 3)));\n\t\t\t_mm_storeu_si128((__m128i *)(LinePD + X + 0), SumL);\n\t\t\t_mm_storeu_si128((__m128i *)(LinePD + X + 4), SumH);\n\t\t}\n\t\tfor (int X = Block * BlockSize, V = LinePD[X - 1] - LinePL[X - 1]; X < Width; X++)\n\t\t{\n\t\t\tV += LinePS[X];\n\t\t\tLinePD[X] = V + LinePL[X];\n\t\t}\n\t}\n}\n\nvoid BoxBlur(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, int Radius) {\n\tint *Integral = (int *)malloc((Width + 1) * (Height + 1) * sizeof(int));\n\tGetGrayIntegralImage(Src, Integral, Width, Height, Stride);\n//#pragma parallel for num_threads(4)\n\tfor (int Y = 0; Y < Height; Y++) {\n\t\tint Y1 = max(Y - Radius, 0);\n\t\tint Y2 = min(Y + Radius + 1, Height - 1);\n\t\tint *LineP1 = Integral + Y1 * (Width + 1);\n\t\tint *LineP2 = Integral + Y2 * (Width + 1);\n\t\tunsigned char *LinePD = Dest + Y * Stride;\n\t\tfor (int X = 0; X < Height; X++) {\n\t\t\tint X1 = max(X - Radius, 0);\n\t\t\tint X2 = min(X + Radius + 1, Width);\n\t\t\tint Sum = LineP2[X2] - LineP1[X2] - LineP2[X1] + LineP1[X1];\n\t\t\tint PixelCount = (X2 - X1) * (Y2 - Y1);\n\t\t\tLinePD[X] = (Sum + (PixelCount >> 1)) / PixelCount;\n\t\t}\n\t}\n\tfree(Integral);\n}\n\nint main() {\n\tMat src = imread(\"F:\\\\car.jpg\", 0);\n\tint Height = src.rows;\n\tint Width = src.cols;\n\tunsigned char *Src = src.data;\n\tunsigned char *Dest = new unsigned char[Height * Width];\n\tint Stride = Width;\n\tint Radius = 11;\n\tint64 st = cvGetTickCount();\n\tfor (int i = 0; i < 10; i++) {\n\t\tBoxBlur(Src, Dest, Width, Height, Stride, Radius);\n\t}\n\tdouble duration = (cv::getTickCount() - st) / cv::getTickFrequency() * 100;\n\tprintf(\"%.5f\\n\", duration);\n\tBoxBlur(Src, Dest, Width, Height, Stride, Radius);\n\tMat dst(Height, Width, CV_8UC1, Dest);\n\timshow(\"origin\", src);\n\timshow(\"result\", dst);\n\timwrite(\"F:\\\\res.jpg\", dst);\n\twaitKey(0);\n\twaitKey(0);\n}"
  },
  {
    "path": "speed_max_filter_sse.cpp",
    "content": "#include <stdio.h>\n#include <opencv2/opencv.hpp>\n#include \"../../OpencvTest/OpencvTest/Core.h\"\n#include \"../../OpencvTest/OpencvTest/MaxFilter.h\"\n#include \"../../OpencvTest/OpencvTest/Utility.h\"\nusing namespace std;\nusing namespace cv;\n\nvoid MaxFilter_SSE(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, int Channel, int Radius) {\n\tTMatrix a, b;\n\tTMatrix *p1 = &a, *p2 = &b;\n\tTMatrix **p3 = &p1, **p4 = &p2;\n\tIS_CreateMatrix(Width, Height, IS_DEPTH_8U, Channel, p3);\n\tIS_CreateMatrix(Width, Height, IS_DEPTH_8U, Channel, p4);\n\t(p1)->Data = Src;\n\t(p2)->Data = Dest;\n\tMaxFilter(p1, p2, Radius);\n}\n\nMat MaxFilter(Mat src, int radius) {\n\tint row = src.rows;\n\tint col = src.cols;\n\tint border = (radius - 1) / 2;\n\tMat dst(row, col, CV_8UC3);\n\tprintf(\"success\\n\");\n\tfor (int i = border; i + border < row; i++) {\n\t\tfor (int j = border; j + border < col; j++) {\n\t\t\tfor (int k = 0; k < 3; k++) {\n\t\t\t\tint val = src.at<Vec3b>(i, j)[k];\n\t\t\t\tfor (int x = -border; x <= border; x++) {\n\t\t\t\t\tfor (int y = -border; y <= border; y++) {\n\t\t\t\t\t\tval = max(val, (int)src.at<Vec3b>(i + x, j + y)[k]);\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\tdst.at<Vec3b>(i, j)[k] = val;\n\t\t\t}\n\t\t}\n\t}\n\tprintf(\"success\\n\");\n\treturn dst;\n}\n\nint main() {\n\tMat src = imread(\"F:\\\\car.jpg\");\n\tint Height = src.rows;\n\tint Width = src.cols;\n\tunsigned char *Src = src.data;\n\tunsigned char *Dest = new unsigned char[Height * Width * 3];\n\tint Stride = Width * 3;\n\tint Radius = 11;\n\tint64 st = cvGetTickCount();\n\tfor (int i = 0; i <10; i++) {\n\t\tMat temp = MaxFilter(src, Radius);\n\t\t//MaxFilter_SSE(Src, Dest, Width, Height, Stride, 3, Radius);\n\t}\n\tdouble duration = (cv::getTickCount() - st) / cv::getTickFrequency() * 100;\n\tprintf(\"%.5f\\n\", duration);\n\tMaxFilter_SSE(Src, Dest, Width, Height, Stride, 3, Radius);\n\tMat dst(Height, Width, CV_8UC3, Dest);\n\timshow(\"origin\", src);\n\timshow(\"result\", dst);\n\timwrite(\"F:\\\\res.jpg\", dst);\n\twaitKey(0);\n\treturn 0;\n}"
  },
  {
    "path": "speed_median_filter_3x3_sse.cpp",
    "content": "#include \"stdafx.h\"\r\n#include <stdio.h>\r\n#include <opencv2/opencv.hpp>\r\nusing namespace std;\r\nusing namespace cv;\r\n\r\nint ComparisonFunction(const void *X, const void *Y) {\r\n\tunsigned char Dx = *(unsigned char *)X;\r\n\tunsigned char Dy = *(unsigned char *)Y;\r\n\tif (Dx < Dy) return -1;\r\n\telse if (Dx > Dy) return 1;\r\n\telse return 0;\r\n}\r\n\r\nvoid MedianBlur3X3_Ori(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) {\r\n\tint Channel = Stride / Width;\r\n\tif (Channel == 1) {\r\n\t\tunsigned char Array[9];\r\n\t\tfor (int Y = 1; Y < Height - 1; Y++) {\r\n\t\t\tunsigned char *LineP0 = Src + (Y - 1) * Stride + 1;\r\n\t\t\tunsigned char *LineP1 = LineP0 + Stride;\r\n\t\t\tunsigned char *LineP2 = LineP1 + Stride;\r\n\t\t\tunsigned char *LinePD = Dest + Y * Stride + 1;\r\n\t\t\tfor (int X = 1; X < Width - 1; X++) {\r\n\t\t\t\tArray[0] = LineP0[X - 1];        Array[1] = LineP0[X];    Array[2] = LineP0[X + 1];\r\n\t\t\t\tArray[3] = LineP1[X - 1];        Array[4] = LineP1[X];    Array[5] = LineP2[X + 1];\r\n\t\t\t\tArray[6] = LineP2[X - 1];        Array[7] = LineP2[X];    Array[8] = LineP2[X + 1];\r\n\t\t\t\tqsort(Array, 9, sizeof(unsigned char), &ComparisonFunction);\r\n\t\t\t\tLinePD[X] = Array[4];\r\n\t\t\t}\r\n\t\t}\r\n\t}\r\n\telse {\r\n\t\tunsigned char ArrayB[9], ArrayG[9], ArrayR[9];\r\n\t\tfor (int Y = 1; Y < Height - 1; Y++) {\r\n\t\t\tunsigned char *LineP0 = Src + (Y - 1) * Stride + 3;\r\n\t\t\tunsigned char *LineP1 = LineP0 + Stride;\r\n\t\t\tunsigned char *LineP2 = LineP1 + Stride;\r\n\t\t\tunsigned char *LinePD = Dest + Y * Stride + 3;\r\n\t\t\tfor (int X = 1; X < Width - 1; X++) {\r\n\t\t\t\tArrayB[0] = LineP0[-3];       ArrayG[0] = LineP0[-2];       ArrayR[0] = LineP0[-1];\r\n\t\t\t\tArrayB[1] = LineP0[0];        ArrayG[1] = LineP0[1];        ArrayR[1] = LineP0[2];\r\n\t\t\t\tArrayB[2] = LineP0[3];        ArrayG[2] = LineP0[4];        ArrayR[2] = LineP0[5];\r\n\r\n\t\t\t\tArrayB[3] = LineP1[-3];       ArrayG[3] = LineP1[-2];       ArrayR[3] = LineP1[-1];\r\n\t\t\t\tArrayB[4] = LineP1[0];        ArrayG[4] = LineP1[1];        ArrayR[4] = LineP1[2];\r\n\t\t\t\tArrayB[5] = LineP1[3];        ArrayG[5] = LineP1[4];        ArrayR[5] = LineP1[5];\r\n\r\n\t\t\t\tArrayB[6] = LineP2[-3];       ArrayG[6] = LineP2[-2];       ArrayR[6] = LineP2[-1];\r\n\t\t\t\tArrayB[7] = LineP2[0];        ArrayG[7] = LineP2[1];        ArrayR[7] = LineP2[2];\r\n\t\t\t\tArrayB[8] = LineP2[3];        ArrayG[8] = LineP2[4];        ArrayR[8] = LineP2[5];\r\n\r\n\t\t\t\tqsort(ArrayB, 9, sizeof(unsigned char), &ComparisonFunction);\r\n\t\t\t\tqsort(ArrayG, 9, sizeof(unsigned char), &ComparisonFunction);\r\n\t\t\t\tqsort(ArrayR, 9, sizeof(unsigned char), &ComparisonFunction);\r\n\r\n\t\t\t\tLinePD[0] = ArrayB[4];\r\n\t\t\t\tLinePD[1] = ArrayG[4];\r\n\t\t\t\tLinePD[2] = ArrayR[4];\r\n\r\n\t\t\t\tLineP0 += 3;\r\n\t\t\t\tLineP1 += 3;\r\n\t\t\t\tLineP2 += 3;\r\n\t\t\t\tLinePD += 3;\r\n\t\t\t}\r\n\t\t}\r\n\t}\r\n}\r\n\r\nvoid Swap(int &X, int &Y) {\r\n\tX ^= Y;\r\n\tY ^= X;\r\n\tX ^= Y;\r\n}\r\n\r\nvoid MedianBlur3X3_Faster(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) {\r\n\tint Channel = Stride / Width;\r\n\tif (Channel == 1) {\r\n\r\n\t\tfor (int Y = 1; Y < Height - 1; Y++) {\r\n\t\t\tunsigned char *LineP0 = Src + (Y - 1) * Stride + 1;\r\n\t\t\tunsigned char *LineP1 = LineP0 + Stride;\r\n\t\t\tunsigned char *LineP2 = LineP1 + Stride;\r\n\t\t\tunsigned char *LinePD = Dest + Y * Stride + 1;\r\n\t\t\tfor (int X = 1; X < Width - 1; X++) {\r\n\t\t\t\tint Gray0, Gray1, Gray2, Gray3, Gray4, Gray5, Gray6, Gray7, Gray8;\r\n\t\t\t\tGray0 = LineP0[X - 1];        Gray1 = LineP0[X];    Gray2 = LineP0[X + 1];\r\n\t\t\t\tGray3 = LineP1[X - 1];        Gray4 = LineP1[X];    Gray5 = LineP1[X + 1];\r\n\t\t\t\tGray6 = LineP2[X - 1];        Gray7 = LineP2[X];    Gray8 = LineP2[X + 1];\r\n\r\n\t\t\t\tif (Gray1 > Gray2) Swap(Gray1, Gray2);\r\n\t\t\t\tif (Gray4 > Gray5) Swap(Gray4, Gray5);\r\n\t\t\t\tif (Gray7 > Gray8) Swap(Gray7, Gray8);\r\n\t\t\t\tif (Gray0 > Gray1) Swap(Gray0, Gray1);\r\n\t\t\t\tif (Gray3 > Gray4) Swap(Gray3, Gray4);\r\n\t\t\t\tif (Gray6 > Gray7) Swap(Gray6, Gray7);\r\n\t\t\t\tif (Gray1 > Gray2) Swap(Gray1, Gray2);\r\n\t\t\t\tif (Gray4 > Gray5) Swap(Gray4, Gray5);\r\n\t\t\t\tif (Gray7 > Gray8) Swap(Gray7, Gray8);\r\n\t\t\t\tif (Gray0 > Gray3) Swap(Gray0, Gray3);\r\n\t\t\t\tif (Gray5 > Gray8) Swap(Gray5, Gray8);\r\n\t\t\t\tif (Gray4 > Gray7) Swap(Gray4, Gray7);\r\n\t\t\t\tif (Gray3 > Gray6) Swap(Gray3, Gray6);\r\n\t\t\t\tif (Gray1 > Gray4) Swap(Gray1, Gray4);\r\n\t\t\t\tif (Gray2 > Gray5) Swap(Gray2, Gray5);\r\n\t\t\t\tif (Gray4 > Gray7) Swap(Gray4, Gray7);\r\n\t\t\t\tif (Gray4 > Gray2) Swap(Gray4, Gray2);\r\n\t\t\t\tif (Gray6 > Gray4) Swap(Gray6, Gray4);\r\n\t\t\t\tif (Gray4 > Gray2) Swap(Gray4, Gray2);\r\n\r\n\t\t\t\tLinePD[X] = Gray4;\r\n\t\t\t}\r\n\t\t}\r\n\r\n\t}\r\n\telse {\r\n\t\tfor (int Y = 1; Y < Height - 1; Y++) {\r\n\t\t\tunsigned char *LineP0 = Src + (Y - 1) * Stride + 3;\r\n\t\t\tunsigned char *LineP1 = LineP0 + Stride;\r\n\t\t\tunsigned char *LineP2 = LineP1 + Stride;\r\n\t\t\tunsigned char *LinePD = Dest + Y * Stride + 3;\r\n\t\t\tfor (int X = 1; X < Width - 1; X++) {\r\n\t\t\t\tint Blue0, Blue1, Blue2, Blue3, Blue4, Blue5, Blue6, Blue7, Blue8;\r\n\t\t\t\tint Green0, Green1, Green2, Green3, Green4, Green5, Green6, Green7, Green8;\r\n\t\t\t\tint Red0, Red1, Red2, Red3, Red4, Red5, Red6, Red7, Red8;\r\n\t\t\t\tBlue0 = LineP0[-3];        Green0 = LineP0[-2];    Red0 = LineP0[-1];\r\n\t\t\t\tBlue1 = LineP0[0];        Green1 = LineP0[1];        Red1 = LineP0[2];\r\n\t\t\t\tBlue2 = LineP0[3];        Green2 = LineP0[4];        Red2 = LineP0[5];\r\n\r\n\t\t\t\tBlue3 = LineP1[-3];        Green3 = LineP1[-2];    Red3 = LineP1[-1];\r\n\t\t\t\tBlue4 = LineP1[0];        Green4 = LineP1[1];        Red4 = LineP1[2];\r\n\t\t\t\tBlue5 = LineP1[3];        Green5 = LineP1[4];        Red5 = LineP1[5];\r\n\r\n\t\t\t\tBlue6 = LineP2[-3];        Green6 = LineP2[-2];    Red6 = LineP2[-1];\r\n\t\t\t\tBlue7 = LineP2[0];        Green7 = LineP2[1];        Red7 = LineP2[2];\r\n\t\t\t\tBlue8 = LineP2[3];        Green8 = LineP2[4];        Red8 = LineP2[5];\r\n\r\n\t\t\t\tif (Blue1 > Blue2) Swap(Blue1, Blue2);\r\n\t\t\t\tif (Blue4 > Blue5) Swap(Blue4, Blue5);\r\n\t\t\t\tif (Blue7 > Blue8) Swap(Blue7, Blue8);\r\n\t\t\t\tif (Blue0 > Blue1) Swap(Blue0, Blue1);\r\n\t\t\t\tif (Blue3 > Blue4) Swap(Blue3, Blue4);\r\n\t\t\t\tif (Blue6 > Blue7) Swap(Blue6, Blue7);\r\n\t\t\t\tif (Blue1 > Blue2) Swap(Blue1, Blue2);\r\n\t\t\t\tif (Blue4 > Blue5) Swap(Blue4, Blue5);\r\n\t\t\t\tif (Blue7 > Blue8) Swap(Blue7, Blue8);\r\n\t\t\t\tif (Blue0 > Blue3) Swap(Blue0, Blue3);\r\n\t\t\t\tif (Blue5 > Blue8) Swap(Blue5, Blue8);\r\n\t\t\t\tif (Blue4 > Blue7) Swap(Blue4, Blue7);\r\n\t\t\t\tif (Blue3 > Blue6) Swap(Blue3, Blue6);\r\n\t\t\t\tif (Blue1 > Blue4) Swap(Blue1, Blue4);\r\n\t\t\t\tif (Blue2 > Blue5) Swap(Blue2, Blue5);\r\n\t\t\t\tif (Blue4 > Blue7) Swap(Blue4, Blue7);\r\n\t\t\t\tif (Blue4 > Blue2) Swap(Blue4, Blue2);\r\n\t\t\t\tif (Blue6 > Blue4) Swap(Blue6, Blue4);\r\n\t\t\t\tif (Blue4 > Blue2) Swap(Blue4, Blue2);\r\n\r\n\t\t\t\tif (Green1 > Green2) Swap(Green1, Green2);\r\n\t\t\t\tif (Green4 > Green5) Swap(Green4, Green5);\r\n\t\t\t\tif (Green7 > Green8) Swap(Green7, Green8);\r\n\t\t\t\tif (Green0 > Green1) Swap(Green0, Green1);\r\n\t\t\t\tif (Green3 > Green4) Swap(Green3, Green4);\r\n\t\t\t\tif (Green6 > Green7) Swap(Green6, Green7);\r\n\t\t\t\tif (Green1 > Green2) Swap(Green1, Green2);\r\n\t\t\t\tif (Green4 > Green5) Swap(Green4, Green5);\r\n\t\t\t\tif (Green7 > Green8) Swap(Green7, Green8);\r\n\t\t\t\tif (Green0 > Green3) Swap(Green0, Green3);\r\n\t\t\t\tif (Green5 > Green8) Swap(Green5, Green8);\r\n\t\t\t\tif (Green4 > Green7) Swap(Green4, Green7);\r\n\t\t\t\tif (Green3 > Green6) Swap(Green3, Green6);\r\n\t\t\t\tif (Green1 > Green4) Swap(Green1, Green4);\r\n\t\t\t\tif (Green2 > Green5) Swap(Green2, Green5);\r\n\t\t\t\tif (Green4 > Green7) Swap(Green4, Green7);\r\n\t\t\t\tif (Green4 > Green2) Swap(Green4, Green2);\r\n\t\t\t\tif (Green6 > Green4) Swap(Green6, Green4);\r\n\t\t\t\tif (Green4 > Green2) Swap(Green4, Green2);\r\n\r\n\t\t\t\tif (Red1 > Red2) Swap(Red1, Red2);\r\n\t\t\t\tif (Red4 > Red5) Swap(Red4, Red5);\r\n\t\t\t\tif (Red7 > Red8) Swap(Red7, Red8);\r\n\t\t\t\tif (Red0 > Red1) Swap(Red0, Red1);\r\n\t\t\t\tif (Red3 > Red4) Swap(Red3, Red4);\r\n\t\t\t\tif (Red6 > Red7) Swap(Red6, Red7);\r\n\t\t\t\tif (Red1 > Red2) Swap(Red1, Red2);\r\n\t\t\t\tif (Red4 > Red5) Swap(Red4, Red5);\r\n\t\t\t\tif (Red7 > Red8) Swap(Red7, Red8);\r\n\t\t\t\tif (Red0 > Red3) Swap(Red0, Red3);\r\n\t\t\t\tif (Red5 > Red8) Swap(Red5, Red8);\r\n\t\t\t\tif (Red4 > Red7) Swap(Red4, Red7);\r\n\t\t\t\tif (Red3 > Red6) Swap(Red3, Red6);\r\n\t\t\t\tif (Red1 > Red4) Swap(Red1, Red4);\r\n\t\t\t\tif (Red2 > Red5) Swap(Red2, Red5);\r\n\t\t\t\tif (Red4 > Red7) Swap(Red4, Red7);\r\n\t\t\t\tif (Red4 > Red2) Swap(Red4, Red2);\r\n\t\t\t\tif (Red6 > Red4) Swap(Red6, Red4);\r\n\t\t\t\tif (Red4 > Red2) Swap(Red4, Red2);\r\n\r\n\t\t\t\tLinePD[0] = Blue4;\r\n\t\t\t\tLinePD[1] = Green4;\r\n\t\t\t\tLinePD[2] = Red4;\r\n\r\n\t\t\t\tLineP0 += 3;\r\n\t\t\t\tLineP1 += 3;\r\n\t\t\t\tLineP2 += 3;\r\n\t\t\t\tLinePD += 3;\r\n\t\t\t}\r\n\t\t}\r\n\t}\r\n}\r\n\r\ninline void _mm_sort_ab(__m128i &a, __m128i &b) {\r\n\tconst __m128i min = _mm_min_epu8(a, b);\r\n\tconst __m128i max = _mm_max_epu8(a, b);\r\n\ta = min;\r\n\tb = max;\r\n}\r\n\r\nvoid MedianBlur3X3_Fastest(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) {\r\n\tint Channel = Stride / Width;\r\n\tint BlockSize = 16, Block = ((Width - 2)* Channel) / BlockSize;\r\n\tfor (int Y = 1; Y < Height - 1; Y++) {\r\n\t\tunsigned char *LineP0 = Src + (Y - 1) * Stride + Channel;\r\n\t\tunsigned char *LineP1 = LineP0 + Stride;\r\n\t\tunsigned char *LineP2 = LineP1 + Stride;\r\n\t\tunsigned char *LinePD = Dest + Y * Stride + Channel;\r\n\t\tfor (int X = 0; X < Block * BlockSize; X += BlockSize, LineP0 += BlockSize, LineP1 += BlockSize, LineP2 += BlockSize, LinePD += BlockSize)\r\n\t\t{\r\n\t\t\t__m128i P0 = _mm_loadu_si128((__m128i *)(LineP0 - Channel));\r\n\t\t\t__m128i P1 = _mm_loadu_si128((__m128i *)(LineP0 - 0));\r\n\t\t\t__m128i P2 = _mm_loadu_si128((__m128i *)(LineP0 + Channel));\r\n\t\t\t__m128i P3 = _mm_loadu_si128((__m128i *)(LineP1 - Channel));\r\n\t\t\t__m128i P4 = _mm_loadu_si128((__m128i *)(LineP1 - 0));\r\n\t\t\t__m128i P5 = _mm_loadu_si128((__m128i *)(LineP1 + Channel));\r\n\t\t\t__m128i P6 = _mm_loadu_si128((__m128i *)(LineP2 - Channel));\r\n\t\t\t__m128i P7 = _mm_loadu_si128((__m128i *)(LineP2 - 0));\r\n\t\t\t__m128i P8 = _mm_loadu_si128((__m128i *)(LineP2 + Channel));\r\n\r\n\t\t\t_mm_sort_ab(P1, P2);\t\t_mm_sort_ab(P4, P5);\t\t_mm_sort_ab(P7, P8);\r\n\t\t\t_mm_sort_ab(P0, P1);\t\t_mm_sort_ab(P3, P4);\t\t_mm_sort_ab(P6, P7);\r\n\t\t\t_mm_sort_ab(P1, P2);\t\t_mm_sort_ab(P4, P5);\t\t_mm_sort_ab(P7, P8);\r\n\t\t\t_mm_sort_ab(P0, P3);\t\t_mm_sort_ab(P5, P8);\t\t_mm_sort_ab(P4, P7);\r\n\t\t\t_mm_sort_ab(P3, P6);\t\t_mm_sort_ab(P1, P4);\t\t_mm_sort_ab(P2, P5);\r\n\t\t\t_mm_sort_ab(P4, P7);\t\t_mm_sort_ab(P4, P2);\t\t_mm_sort_ab(P6, P4);\r\n\t\t\t_mm_sort_ab(P4, P2);\r\n\r\n\t\t\t_mm_storeu_si128((__m128i *)LinePD, P4);\r\n\t\t}\r\n\r\n\t\tfor (int X = Block * BlockSize; X < (Width - 2) * Channel; X++, LinePD++) {\r\n\t\t\tint Gray0, Gray1, Gray2, Gray3, Gray4, Gray5, Gray6, Gray7, Gray8;\r\n\t\t\tGray0 = LineP0[X - Block * BlockSize - Channel];        Gray1 = LineP0[X - Block * BlockSize];    Gray2 = LineP0[X - Block * BlockSize + Channel];\r\n\t\t\tGray3 = LineP1[X - Block * BlockSize - Channel];        Gray4 = LineP1[X - Block * BlockSize];    Gray5 = LineP1[X - Block * BlockSize + Channel];\r\n\t\t\tGray6 = LineP2[X - Block * BlockSize - Channel];        Gray7 = LineP2[X - Block * BlockSize];    Gray8 = LineP2[X - Block * BlockSize + Channel];\r\n\r\n\t\t\tif (Gray1 > Gray2) Swap(Gray1, Gray2);\r\n\t\t\tif (Gray4 > Gray5) Swap(Gray4, Gray5);\r\n\t\t\tif (Gray7 > Gray8) Swap(Gray7, Gray8);\r\n\t\t\tif (Gray0 > Gray1) Swap(Gray0, Gray1);\r\n\t\t\tif (Gray3 > Gray4) Swap(Gray3, Gray4);\r\n\t\t\tif (Gray6 > Gray7) Swap(Gray6, Gray7);\r\n\t\t\tif (Gray1 > Gray2) Swap(Gray1, Gray2);\r\n\t\t\tif (Gray4 > Gray5) Swap(Gray4, Gray5);\r\n\t\t\tif (Gray7 > Gray8) Swap(Gray7, Gray8);\r\n\t\t\tif (Gray0 > Gray3) Swap(Gray0, Gray3);\r\n\t\t\tif (Gray5 > Gray8) Swap(Gray5, Gray8);\r\n\t\t\tif (Gray4 > Gray7) Swap(Gray4, Gray7);\r\n\t\t\tif (Gray3 > Gray6) Swap(Gray3, Gray6);\r\n\t\t\tif (Gray1 > Gray4) Swap(Gray1, Gray4);\r\n\t\t\tif (Gray2 > Gray5) Swap(Gray2, Gray5);\r\n\t\t\tif (Gray4 > Gray7) Swap(Gray4, Gray7);\r\n\t\t\tif (Gray4 > Gray2) Swap(Gray4, Gray2);\r\n\t\t\tif (Gray6 > Gray4) Swap(Gray6, Gray4);\r\n\t\t\tif (Gray4 > Gray2) Swap(Gray4, Gray2);\r\n\r\n\t\t\tLinePD[X] = Gray4;\r\n\t\t\tLineP0 += 1;\r\n\t\t\tLineP1 += 1;\r\n\t\t\tLineP2 += 1;\r\n\t\t}\r\n\t}\r\n}\r\n\r\ninline void _mm_sort_AB(__m256i &a, __m256i &b) {\r\n\tconst __m256i min = _mm256_min_epu8(a, b);\r\n\tconst __m256i max = _mm256_max_epu8(a, b);\r\n\ta = min;\r\n\tb = max;\r\n}\r\n\r\nvoid MedianBlur3X3_Fastest_AVX(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) {\r\n\tint Channel = Stride / Width;\r\n\tint BlockSize = 32, Block = ((Width - 2)* Channel) / BlockSize;\r\n\tfor (int Y = 1; Y < Height - 1; Y++) {\r\n\t\tunsigned char *LineP0 = Src + (Y - 1) * Stride + Channel;\r\n\t\tunsigned char *LineP1 = LineP0 + Stride;\r\n\t\tunsigned char *LineP2 = LineP1 + Stride;\r\n\t\tunsigned char *LinePD = Dest + Y * Stride + Channel;\r\n\t\tfor (int X = 0; X < Block * BlockSize; X += BlockSize, LineP0 += BlockSize, LineP1 += BlockSize, LineP2 += BlockSize, LinePD += BlockSize)\r\n\t\t{\r\n\t\t\t__m256i P0 = _mm256_loadu_si256((const __m256i*)(LineP0 - Channel));\r\n\t\t\t__m256i P1 = _mm256_loadu_si256((const __m256i*)(LineP0 - 0));\r\n\t\t\t__m256i P2 = _mm256_loadu_si256((const __m256i*)(LineP0 + Channel));\r\n\t\t\t__m256i P3 = _mm256_loadu_si256((const __m256i*)(LineP1 - Channel));\r\n\t\t\t__m256i P4 = _mm256_loadu_si256((const __m256i*)(LineP1 - 0));\r\n\t\t\t__m256i P5 = _mm256_loadu_si256((const __m256i*)(LineP1 + Channel));\r\n\t\t\t__m256i P6 = _mm256_loadu_si256((const __m256i*)(LineP2 - Channel));\r\n\t\t\t__m256i P7 = _mm256_loadu_si256((const __m256i*)(LineP2 - 0));\r\n\t\t\t__m256i P8 = _mm256_loadu_si256((const __m256i*)(LineP2 + Channel));\r\n\r\n\t\t\t_mm_sort_AB(P1, P2);\t\t_mm_sort_AB(P4, P5);\t\t_mm_sort_AB(P7, P8);\r\n\t\t\t_mm_sort_AB(P0, P1);\t\t_mm_sort_AB(P3, P4);\t\t_mm_sort_AB(P6, P7);\r\n\t\t\t_mm_sort_AB(P1, P2);\t\t_mm_sort_AB(P4, P5);\t\t_mm_sort_AB(P7, P8);\r\n\t\t\t_mm_sort_AB(P0, P3);\t\t_mm_sort_AB(P5, P8);\t\t_mm_sort_AB(P4, P7);\r\n\t\t\t_mm_sort_AB(P3, P6);\t\t_mm_sort_AB(P1, P4);\t\t_mm_sort_AB(P2, P5);\r\n\t\t\t_mm_sort_AB(P4, P7);\t\t_mm_sort_AB(P4, P2);\t\t_mm_sort_AB(P6, P4);\r\n\t\t\t_mm_sort_AB(P4, P2);\r\n\r\n\t\t\t_mm256_storeu_si256((__m256i *)LinePD, P4);\r\n\t\t}\r\n\r\n\t\tfor (int X = Block * BlockSize; X < (Width - 2) * Channel; X++, LinePD++) {\r\n\t\t\tint Gray0, Gray1, Gray2, Gray3, Gray4, Gray5, Gray6, Gray7, Gray8;\r\n\t\t\tGray0 = LineP0[X - Block * BlockSize - Channel];        Gray1 = LineP0[X - Block * BlockSize];    Gray2 = LineP0[X - Block * BlockSize + Channel];\r\n\t\t\tGray3 = LineP1[X - Block * BlockSize - Channel];        Gray4 = LineP1[X - Block * BlockSize];    Gray5 = LineP1[X - Block * BlockSize + Channel];\r\n\t\t\tGray6 = LineP2[X - Block * BlockSize - Channel];        Gray7 = LineP2[X - Block * BlockSize];    Gray8 = LineP2[X - Block * BlockSize + Channel];\r\n\r\n\t\t\tif (Gray1 > Gray2) Swap(Gray1, Gray2);\r\n\t\t\tif (Gray4 > Gray5) Swap(Gray4, Gray5);\r\n\t\t\tif (Gray7 > Gray8) Swap(Gray7, Gray8);\r\n\t\t\tif (Gray0 > Gray1) Swap(Gray0, Gray1);\r\n\t\t\tif (Gray3 > Gray4) Swap(Gray3, Gray4);\r\n\t\t\tif (Gray6 > Gray7) Swap(Gray6, Gray7);\r\n\t\t\tif (Gray1 > Gray2) Swap(Gray1, Gray2);\r\n\t\t\tif (Gray4 > Gray5) Swap(Gray4, Gray5);\r\n\t\t\tif (Gray7 > Gray8) Swap(Gray7, Gray8);\r\n\t\t\tif (Gray0 > Gray3) Swap(Gray0, Gray3);\r\n\t\t\tif (Gray5 > Gray8) Swap(Gray5, Gray8);\r\n\t\t\tif (Gray4 > Gray7) Swap(Gray4, Gray7);\r\n\t\t\tif (Gray3 > Gray6) Swap(Gray3, Gray6);\r\n\t\t\tif (Gray1 > Gray4) Swap(Gray1, Gray4);\r\n\t\t\tif (Gray2 > Gray5) Swap(Gray2, Gray5);\r\n\t\t\tif (Gray4 > Gray7) Swap(Gray4, Gray7);\r\n\t\t\tif (Gray4 > Gray2) Swap(Gray4, Gray2);\r\n\t\t\tif (Gray6 > Gray4) Swap(Gray6, Gray4);\r\n\t\t\tif (Gray4 > Gray2) Swap(Gray4, Gray2);\r\n\r\n\t\t\tLinePD[X] = Gray4;\r\n\t\t\tLineP0 += 1;\r\n\t\t\tLineP1 += 1;\r\n\t\t\tLineP2 += 1;\r\n\t\t}\r\n\t}\r\n}\r\n\r\nint main() {\r\n\tMat src = imread(\"F:\\\\car.jpg\");\r\n\tint Height = src.rows;\r\n\tint Width = src.cols;\r\n\tunsigned char *Src = src.data;\r\n\tunsigned char *Dest = new unsigned char[Height * Width * 3];\r\n\tint Stride = Width * 3;\r\n\tint Radius = 7;\r\n\tint64 st = cvGetTickCount();\r\n\tfor (int i = 0; i <10; i++) {\r\n\t\t//Mat temp = MaxFilter(src, Radius);\r\n\t\tMedianBlur3X3_Fastest_AVX(Src, Dest, Width, Height, Stride);\r\n\t}\r\n\tdouble duration = (cv::getTickCount() - st) / cv::getTickFrequency() * 100;\r\n\tprintf(\"%.5f\\n\", duration);\r\n\tMedianBlur3X3_Fastest_AVX(Src, Dest, Width, Height, Stride);\r\n\tMat dst(Height, Width, CV_8UC3, Dest);\r\n\timshow(\"origin\", src);\r\n\timshow(\"result\", dst);\r\n\timwrite(\"F:\\\\res.jpg\", dst);\r\n\twaitKey(0);\r\n\treturn 0;\r\n}\r\n"
  },
  {
    "path": "speed_multi_scale_detail_boosting_see.cpp",
    "content": "#include <stdio.h>\n#include <opencv2/opencv.hpp>\n#include \"../../OpencvTest/OpencvTest/Core.h\"\n#include \"../../OpencvTest/OpencvTest/MaxFilter.h\"\n#include \"../../OpencvTest/OpencvTest/Utility.h\"\n#include \"../../OpencvTest/OpencvTest/BoxFilter.h\"\nusing namespace std;\nusing namespace cv;\n#define __SSSE3__ 1\n\nvoid BoxBlur_SSE(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, int Channel, int Radius) {\n\tTMatrix a, b;\n\tTMatrix *p1 = &a, *p2 = &b;\n\tTMatrix **p3 = &p1, **p4 = &p2;\n\tIS_CreateMatrix(Width, Height, IS_DEPTH_8U, Channel, p3);\n\tIS_CreateMatrix(Width, Height, IS_DEPTH_8U, Channel, p4);\n\t(p1)->Data = Src;\n\t(p2)->Data = Dest;\n\tBoxBlur_SSE(p1, p2, Radius, EdgeMode::Smear);\n}\n\nint IM_Sign(int X) {\n\treturn (X >> 31) | (unsigned(-X)) >> 31;\n}\n\ninline unsigned char IM_ClampToByte(int Value)\n{\n\tif (Value < 0)\n\t\treturn 0;\n\telse if (Value > 255)\n\t\treturn 255;\n\telse\n\t\treturn (unsigned char)Value;\n\t//return ((Value | ((signed int)(255 - Value) >> 31)) & ~((signed int)Value >> 31));\n}\n\n\ninline __m128i _mm_sgn_epi16(__m128i v) {\n#ifdef __SSSE3__\n\tv = _mm_sign_epi16(_mm_set1_epi16(1), v); // use PSIGNW on SSSE3 and later\n#else\n\tv = _mm_min_epi16(v, _mm_set1_epi16(1));  // use PMINSW/PMAXSW on SSE2/SSE3.\n\tv = _mm_max_epi16(v, _mm_set1_epi16(-1));\n\t//_mm_set1_epi16(1) = _mm_srli_epi16(_mm_cmpeq_epi16(v, v), 15);\n\t//_mm_set1_epi16(-1) = _mm_cmpeq_epi16(v, v);\n\n#endif\n\treturn v;\n}\n\nvoid MultiScaleSharpen(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, int Radius) {\n\tint Channel = Stride / Width;\n\tunsigned char *B1 = (unsigned char *)malloc(Height * Stride * sizeof(unsigned char));\n\tunsigned char *B2 = (unsigned char *)malloc(Height * Stride * sizeof(unsigned char));\n\tunsigned char *B3 = (unsigned char *)malloc(Height * Stride * sizeof(unsigned char));\n\tBoxBlur_SSE(Src, B1, Width, Height, Channel, Stride, Radius);\n\tBoxBlur_SSE(Src, B2, Width, Height, Channel, Stride, Radius * 2);\n\tBoxBlur_SSE(Src, B3, Width, Height, Channel, Stride, Radius * 4);\n\tfor (int Y = 0; Y < Height * Stride; Y++) {\n\t\tint DiffB1 = Src[Y] - B1[Y];\n\t\tint DiffB2 = B1[Y] - B2[Y];\n\t\tint DiffB3 = B2[Y] - B3[Y];\n\t\tDest[Y] = IM_ClampToByte(((4 - 2 * IM_Sign(DiffB1)) * DiffB1 + 2 * DiffB2 + DiffB3) / 4 + Src[Y]);\n\t}\n}\n\nvoid MultiScaleSharpen_SSE(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, int Radius) {\n\tint Channel = Stride / Width;\n\tunsigned char *B1 = (unsigned char *)malloc(Height * Stride * sizeof(unsigned char));\n\tunsigned char *B2 = (unsigned char *)malloc(Height * Stride * sizeof(unsigned char));\n\tunsigned char *B3 = (unsigned char *)malloc(Height * Stride * sizeof(unsigned char));\n\tBoxBlur_SSE(Src, B1, Width, Height, Channel, Stride, Radius);\n\tBoxBlur_SSE(Src, B2, Width, Height, Channel, Stride, Radius * 2);\n\tBoxBlur_SSE(Src, B3, Width, Height, Channel, Stride, Radius * 4);\n\tint BlockSize = 8, Block = (Height * Stride) / BlockSize;\n\t__m128i Zero = _mm_setzero_si128();\n\t__m128i Four = _mm_set1_epi16(4);\n\tfor (int Y = 0; Y < Block * BlockSize; Y += BlockSize) {\n\t\t__m128i SrcV = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(Src + Y)), Zero);\n\t\t__m128i SrcB1 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(B1 + Y)), Zero);\n\t\t__m128i SrcB2 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(B2 + Y)), Zero);\n\t\t__m128i SrcB3 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(B3 + Y)), Zero);\n\t\t__m128i DiffB1 = _mm_sub_epi16(SrcV, SrcB1);\n\t\t__m128i DiffB2 = _mm_sub_epi16(SrcB1, SrcB2);\n\t\t__m128i DiffB3 = _mm_sub_epi16(SrcB2, SrcB3);\n\t\t//__m128i Offset = _mm_srai_epi16(_mm_add_epi16(_mm_add_epi16(_mm_mullo_epi16(_mm_sub_epi16(Four, _mm_slli_epi16(_mm_sgn_epi16(DiffB1), 1)), DiffB1), _mm_slli_epi16(DiffB2, 1)), DiffB3), 2);\n\t\t__m128i Offset = _mm_add_epi16(_mm_srai_epi16(_mm_sub_epi16(_mm_slli_epi16(_mm_sub_epi16(SrcB1, _mm_sign_epi16(DiffB1, DiffB1)), 1), _mm_add_epi16(SrcB2, SrcB3)), 2), DiffB1);\n\t\t_mm_storel_epi64((__m128i *)(Dest + Y), _mm_packus_epi16(_mm_add_epi16(SrcV, Offset), Zero));\n\t}\n\tfor (int Y = Block * BlockSize; Y < Height * Stride; Y++) {\n\t\tint DiffB1 = Src[Y] - B1[Y];\n\t\tint DiffB2 = B1[Y] - B2[Y];\n\t\tint DiffB3 = B2[Y] - B3[Y];\n\t\tDest[Y] = IM_ClampToByte(((4 - 2 * IM_Sign(DiffB1)) * DiffB1 + 2 * DiffB2 + DiffB3) / 4 + Src[Y]);\n\t}\n}\n\nint main() {\n\tMat src = imread(\"F:\\\\car.jpg\");\n\tint Height = src.rows;\n\tint Width = src.cols;\n\tunsigned char *Src = src.data;\n\tunsigned char *Dest = new unsigned char[Height * Width * 3];\n\tint Stride = Width * 3;\n\tint Radius = 5;\n\tint64 st = cvGetTickCount();\n\tfor (int i = 0; i <10; i++) {\n\t\t//Mat temp = MaxFilter(src, Radius);\n\t\tMultiScaleSharpen_SSE(Src, Dest, Width, Height, Stride, Radius);\n\t}\n\tdouble duration = (cv::getTickCount() - st) / cv::getTickFrequency() * 100;\n\tprintf(\"%.5f\\n\", duration);\n\tMultiScaleSharpen(Src, Dest, Width, Height, Stride, Radius);\n\tMat dst(Height, Width, CV_8UC3, Dest);\n\timshow(\"origin\", src);\n\timshow(\"result\", dst);\n\timwrite(\"F:\\\\res.jpg\", dst);\n\twaitKey(0);\n\treturn 0;\n}"
  },
  {
    "path": "speed_rgb2gray_sse.cpp",
    "content": "#include \"stdafx.h\"\n#include <opencv2/opencv.hpp>\n#include <future>\nusing namespace std;\nusing namespace cv;\n\n//origin\nvoid RGB2Y(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) {\n\tfor (int Y = 0; Y < Height; Y++) {\n\t\tunsigned char *LinePS = Src + Y * Stride;\n\t\tunsigned char *LinePD = Dest + Y * Width;\n\t\tfor (int X = 0; X < Width; X++, LinePS += 3) {\n\t\t\tLinePD[X] = int(0.114 * LinePS[0] + 0.587 * LinePS[1] + 0.299 * LinePS[2]);\n\t\t}\n\t}\n}\n\n//int\nvoid RGB2Y_1(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) {\n\tconst int B_WT = int(0.114 * 256 + 0.5);\n\tconst int G_WT = int(0.587 * 256 + 0.5);\n\tconst int R_WT = 256 - B_WT - G_WT;\n\tfor (int Y = 0; Y < Height; Y++) {\n\t\tunsigned char *LinePS = Src + Y * Stride;\n\t\tunsigned char *LinePD = Dest + Y * Width;\n\t\tfor (int X = 0; X < Width; X++, LinePS += 3) {\n\t\t\tLinePD[X] = (B_WT * LinePS[0] + G_WT * LinePS[1] + R_WT * LinePS[2]) >> 8;\n\t\t}\n\t}\n}\n\n//4路并行\nvoid RGB2Y_2(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) {\n\tconst int B_WT = int(0.114 * 256 + 0.5);\n\tconst int G_WT = int(0.587 * 256 + 0.5);\n\tconst int R_WT = 256 - B_WT - G_WT; // int(0.299 * 256 + 0.5)\n\tfor (int Y = 0; Y < Height; Y++) {\n\t\tunsigned char *LinePS = Src + Y * Stride;\n\t\tunsigned char *LinePD = Dest + Y * Width;\n\t\tint X = 0;\n\t\tfor (; X < Width - 4; X += 4, LinePS += 12) {\n\t\t\tLinePD[X + 0] = (B_WT * LinePS[0] + G_WT * LinePS[1] + R_WT * LinePS[2]) >> 8;\n\t\t\tLinePD[X + 1] = (B_WT * LinePS[3] + G_WT * LinePS[4] + R_WT * LinePS[5]) >> 8;\n\t\t\tLinePD[X + 2] = (B_WT * LinePS[6] + G_WT * LinePS[7] + R_WT * LinePS[8]) >> 8;\n\t\t\tLinePD[X + 3] = (B_WT * LinePS[9] + G_WT * LinePS[10] + R_WT * LinePS[11]) >> 8;\n\t\t}\n\t\tfor (; X < Width; X++, LinePS += 3) {\n\t\t\tLinePD[X] = (B_WT * LinePS[0] + G_WT * LinePS[1] + R_WT * LinePS[2]) >> 8;\n\t\t}\n\t}\n}\n\n//openmp\nvoid RGB2Y_3(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) {\n\tconst int B_WT = int(0.114 * 256 + 0.5);\n\tconst int G_WT = int(0.587 * 256 + 0.5);\n\tconst int R_WT = 256 - B_WT - G_WT;\n\tfor (int Y = 0; Y < Height; Y++) {\n\t\tunsigned char *LinePS = Src + Y * Stride;\n\t\tunsigned char *LinePD = Dest + Y * Width;\n#pragma omp parallel for num_threads(4)\n\t\tfor (int X = 0; X < Width; X++) {\n\t\t\tLinePD[X] = (B_WT * LinePS[0 + X*3] + G_WT * LinePS[1 + X*3] + R_WT * LinePS[2 + X*3]) >> 8;\n\t\t}\n\t}\n}\n\n//sse 一次处理12个\nvoid RGB2Y_4(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) {\n\tconst int B_WT = int(0.114 * 256 + 0.5);\n\tconst int G_WT = int(0.587 * 256 + 0.5);\n\tconst int R_WT = 256 - B_WT - G_WT; // int(0.299 * 256 + 0.5)\n\n\tfor (int Y = 0; Y < Height; Y++) {\n\t\tunsigned char *LinePS = Src + Y * Stride;\n\t\tunsigned char *LinePD = Dest + Y * Width;\n\t\tint X = 0;\n\t\tfor (; X < Width - 12; X += 12, LinePS += 36) {\n\t\t\t__m128i p1aL = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 0))), _mm_setr_epi16(B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT)); //1\n\t\t\t__m128i p2aL = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 1))), _mm_setr_epi16(G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT)); //2\n\t\t\t__m128i p3aL = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 2))), _mm_setr_epi16(R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT)); //3\n\n\t\t\t__m128i p1aH = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 8))), _mm_setr_epi16(R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT));//4\n\t\t\t__m128i p2aH = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 9))), _mm_setr_epi16(B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT));//5\n\t\t\t__m128i p3aH = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 10))), _mm_setr_epi16(G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT));//6\n\n\t\t\t__m128i p1bL = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 18))), _mm_setr_epi16(B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT));//7\n\t\t\t__m128i p2bL = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 19))), _mm_setr_epi16(G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT));//8\n\t\t\t__m128i p3bL = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 20))), _mm_setr_epi16(R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT));//9\n\n\t\t\t__m128i p1bH = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 26))), _mm_setr_epi16(R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT));//10\n\t\t\t__m128i p2bH = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 27))), _mm_setr_epi16(B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT));//11\n\t\t\t__m128i p3bH = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 28))), _mm_setr_epi16(G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT));//12\n\n\t\t\t__m128i sumaL = _mm_add_epi16(p3aL, _mm_add_epi16(p1aL, p2aL));//13\n\t\t\t__m128i sumaH = _mm_add_epi16(p3aH, _mm_add_epi16(p1aH, p2aH));//14\n\t\t\t__m128i sumbL = _mm_add_epi16(p3bL, _mm_add_epi16(p1bL, p2bL));//15\n\t\t\t__m128i sumbH = _mm_add_epi16(p3bH, _mm_add_epi16(p1bH, p2bH));//16\n\t\t\t__m128i sclaL = _mm_srli_epi16(sumaL, 8);//17\n\t\t\t__m128i sclaH = _mm_srli_epi16(sumaH, 8);//18\n\t\t\t__m128i sclbL = _mm_srli_epi16(sumbL, 8);//19\n\t\t\t__m128i sclbH = _mm_srli_epi16(sumbH, 8);//20\n\t\t\t__m128i shftaL = _mm_shuffle_epi8(sclaL, _mm_setr_epi8(0, 6, 12, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1));//21\n\t\t\t__m128i shftaH = _mm_shuffle_epi8(sclaH, _mm_setr_epi8(-1, -1, -1, 18, 24, 30, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1));//22\n\t\t\t__m128i shftbL = _mm_shuffle_epi8(sclbL, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, 0, 6, 12, -1, -1, -1, -1, -1, -1, -1));//23\n\t\t\t__m128i shftbH = _mm_shuffle_epi8(sclbH, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, 18, 24, 30, -1, -1, -1, -1));//24\n\t\t\t__m128i accumL = _mm_or_si128(shftaL, shftbL);//25\n\t\t\t__m128i accumH = _mm_or_si128(shftaH, shftbH);//26\n\t\t\t__m128i h3 = _mm_or_si128(accumL, accumH);//27\n\t\t\t\t\t\t\t\t\t\t\t\t\t  //__m128i h3 = _mm_blendv_epi8(accumL, accumH, _mm_setr_epi8(0, 0, 0, -1, -1, -1, 0, 0, 0, -1, -1, -1, 1, 1, 1, 1));\n\t\t\t_mm_storeu_si128((__m128i *)(LinePD + X), h3);\n\t\t}\n\t\tfor (; X < Width; X++, LinePS += 3) {\n\t\t\tLinePD[X] = (B_WT * LinePS[0] + G_WT * LinePS[1] + R_WT * LinePS[2]) >> 8;\n\t\t}\n\t}\n}\n\n//sse 一次处理15个\nvoid RGB2Y_5(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) {\n\tconst int B_WT = int(0.114 * 256 + 0.5);\n\tconst int G_WT = int(0.587 * 256 + 0.5);\n\tconst int R_WT = 256 - B_WT - G_WT; // int(0.299 * 256 + 0.5)\n\n\tfor (int Y = 0; Y < Height; Y++) {\n\t\tunsigned char *LinePS = Src + Y * Stride;\n\t\tunsigned char *LinePD = Dest + Y * Width;\n\t\tint X = 0;\n\t\tfor (; X < Width - 15; X += 15, LinePS += 45)\n\t\t{\n\t\t\t__m128i p1aL = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 0))), _mm_setr_epi16(B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT)); //1\n\t\t\t__m128i p2aL = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 1))), _mm_setr_epi16(G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT)); //2\n\t\t\t__m128i p3aL = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 2))), _mm_setr_epi16(R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT)); //3\n\n\t\t\t__m128i p1aH = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 8))), _mm_setr_epi16(R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT));\n\t\t\t__m128i p2aH = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 9))), _mm_setr_epi16(B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT));\n\t\t\t__m128i p3aH = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 10))), _mm_setr_epi16(G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT));\n\n\t\t\t__m128i p1bL = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 18))), _mm_setr_epi16(B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT));\n\t\t\t__m128i p2bL = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 19))), _mm_setr_epi16(G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT));\n\t\t\t__m128i p3bL = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 20))), _mm_setr_epi16(R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT));\n\n\t\t\t__m128i p1bH = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 26))), _mm_setr_epi16(R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT));\n\t\t\t__m128i p2bH = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 27))), _mm_setr_epi16(B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT));\n\t\t\t__m128i p3bH = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 28))), _mm_setr_epi16(G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT));\n\n\t\t\t__m128i p1cH = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 36))), _mm_setr_epi16(B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT));\n\t\t\t__m128i p2cH = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 37))), _mm_setr_epi16(G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT));\n\t\t\t__m128i p3cH = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 38))), _mm_setr_epi16(R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT));\n\n\t\t\t__m128i sumaL = _mm_add_epi16(p3aL, _mm_add_epi16(p1aL, p2aL));\n\t\t\t__m128i sumaH = _mm_add_epi16(p3aH, _mm_add_epi16(p1aH, p2aH));\n\t\t\t__m128i sumbL = _mm_add_epi16(p3bL, _mm_add_epi16(p1bL, p2bL));\n\t\t\t__m128i sumbH = _mm_add_epi16(p3bH, _mm_add_epi16(p1bH, p2bH));\n\t\t\t__m128i sumcH = _mm_add_epi16(p3cH, _mm_add_epi16(p1cH, p2cH));\n\n\t\t\t__m128i sclaL = _mm_srli_epi16(sumaL, 8);\n\t\t\t__m128i sclaH = _mm_srli_epi16(sumaH, 8);\n\t\t\t__m128i sclbL = _mm_srli_epi16(sumbL, 8);\n\t\t\t__m128i sclbH = _mm_srli_epi16(sumbH, 8);\n\t\t\t__m128i sclcH = _mm_srli_epi16(sumcH, 8);\n\n\t\t\t__m128i shftaL = _mm_shuffle_epi8(sclaL, _mm_setr_epi8(0, 6, 12, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1));\n\t\t\t__m128i shftaH = _mm_shuffle_epi8(sclaH, _mm_setr_epi8(-1, -1, -1, 2, 8, 14, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1));\n\t\t\t__m128i shftbL = _mm_shuffle_epi8(sclbL, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, 0, 6, 12, -1, -1, -1, -1, -1, -1, -1));\n\t\t\t__m128i shftbH = _mm_shuffle_epi8(sclbH, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, 2, 8, 14, -1, -1, -1, -1));\n\t\t\t__m128i shftcH = _mm_shuffle_epi8(sclcH, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 6, 12, -1));\n\t\t\t__m128i accumL = _mm_or_si128(shftaL, shftbL);\n\t\t\t__m128i accumH = _mm_or_si128(shftaH, shftbH);\n\t\t\t__m128i h3 = _mm_or_si128(accumL, accumH);\n\t\t\th3 = _mm_or_si128(h3, shftcH);\n\t\t\t_mm_storeu_si128((__m128i *)(LinePD + X), h3);\n\t\t}\n\t\tfor (; X < Width; X++, LinePS += 3) {\n\t\t\tLinePD[X] = (B_WT * LinePS[0] + G_WT * LinePS[1] + R_WT * LinePS[2]) >> 8;\n\t\t}\n\t}\n}\n\nvoid debug(__m128i var) {\n\tuint8_t *val = (uint8_t*)&var;//can also use uint32_t instead of 16_t \n\tprintf(\"Numerical: %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i\\n\",\n\t\tval[0], val[1], val[2], val[3], val[4], val[5],\n\t\tval[6], val[7], val[8], val[9], val[10], val[11], val[12], val[13],\n\t\tval[14], val[15]);\n}\n\nvoid debug2(__m256i var) {\n\tuint8_t *val = (uint8_t*)&var;//can also use uint32_t instead of 16_t \n\tprintf(\"Numerical: %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i\\n\",\n\t\tval[0], val[1], val[2], val[3], val[4], val[5],\n\t\tval[6], val[7], val[8], val[9], val[10], val[11], val[12], val[13],\n\t\tval[14], val[15], val[16], val[17], val[18], val[19], val[20], val[21], val[22], val[23], val[24], val[25], val[26], val[27],\n\t\tval[28], val[29], val[30], val[31]);\n}\n\n// AVX2\nconstexpr double B_WEIGHT = 0.114;\nconstexpr double G_WEIGHT = 0.587;\nconstexpr double R_WEIGHT = 0.299;\nconstexpr uint16_t B_WT = static_cast<uint16_t>(32768.0 * B_WEIGHT + 0.5);\nconstexpr uint16_t G_WT = static_cast<uint16_t>(32768.0 * G_WEIGHT + 0.5);\nconstexpr uint16_t R_WT = static_cast<uint16_t>(32768.0 * R_WEIGHT + 0.5);\nstatic const __m256i weight_vec = _mm256_setr_epi16(B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT);\n\nvoid  _RGB2Y(unsigned char* Src, const int32_t Width, const int32_t start_row, const int32_t thread_stride, const int32_t Stride, unsigned char* Dest)\n{\n\tfor (int Y = start_row; Y < start_row + thread_stride; Y++)\n\t{\n\t\t//Sleep(1);\n\t\tunsigned char *LinePS = Src + Y * Stride;\n\t\tunsigned char *LinePD = Dest + Y * Width;\n\t\tint X = 0;\n\t\tfor (; X < Width - 10; X += 10, LinePS += 30)\n\t\t{\n\t\t\t//B1 G1 R1 B2 G2 R2 B3 G3 R3 B4 G4 R4 B5 G5 R5 B6 \n\t\t\t__m256i temp = _mm256_cvtepu8_epi16(_mm_loadu_si128((const __m128i*)(LinePS + 0)));\n\t\t\t__m256i in1 = _mm256_mulhrs_epi16(temp, weight_vec);\n\n\t\t\t//B6 G6 R6 B7 G7 R7 B8 G8 R8 B9 G9 R9 B10 G10 R10 B11\n\t\t\ttemp = _mm256_cvtepu8_epi16(_mm_loadu_si128((const __m128i*)(LinePS + 15)));\n\t\t\t__m256i in2 = _mm256_mulhrs_epi16(temp, weight_vec);\n\n\n\t\t\t//0  1  2  3   4  5  6  7  8  9  10 11 12 13 14 15    16 17 18 19 20 21 22 23 24 25 26 27 28   29 30  31       \n\t\t\t//B1 G1 R1 B2 G2 R2 B3 G3  B6 G6 R6 B7 G7 R7 B8 G8    R3 B4 G4 R4 B5 G5 R5 B6 R8 B9 G9 R9 B10 G10 R10 B11\n\t\t\t__m256i mul = _mm256_packus_epi16(in1, in2);\n\n\t\t\t__m256i b1 = _mm256_shuffle_epi8(mul, _mm256_setr_epi8(\n\t\t\t\t//  B1 B2 B3 -1, -1, -1  B7  B8  -1, -1, -1, -1, -1, -1, -1, -1,\n\t\t\t\t0, 3, 6, -1, -1, -1, 11, 14, -1, -1, -1, -1, -1, -1, -1, -1,\n\n\t\t\t\t//  -1, -1, -1, B4 B5 B6 -1, -1  B9 B10 -1, -1, -1, -1, -1, -1\n\t\t\t\t-1, -1, -1, 1, 4, 7, -1, -1, 9, 12, -1, -1, -1, -1, -1, -1));\n\n\t\t\t__m256i g1 = _mm256_shuffle_epi8(mul, _mm256_setr_epi8(\n\n\t\t\t\t// G1 G2 G3 -1, -1  G6 G7  G8  -1, -1, -1, -1, -1, -1, -1, -1, \n\t\t\t\t1, 4, 7, -1, -1, 9, 12, 15, -1, -1, -1, -1, -1, -1, -1, -1,\n\n\t\t\t\t//  -1, -1, -1  G4 G5 -1, -1, -1  G9  G10 -1, -1, -1, -1, -1, -1\t\n\t\t\t\t-1, -1, -1, 2, 5, -1, -1, -1, 10, 13, -1, -1, -1, -1, -1, -1));\n\n\t\t\t__m256i r1 = _mm256_shuffle_epi8(mul, _mm256_setr_epi8(\n\n\t\t\t\t//  R1 R2 -1  -1  -1  R6  R7  -1, -1, -1, -1, -1, -1, -1, -1, -1,\t\n\t\t\t\t2, 5, -1, -1, -1, 10, 13, -1, -1, -1, -1, -1, -1, -1, -1, -1,\n\n\t\t\t\t//  -1, -1, R3 R4 R5 -1, -1, R8 R9  R10 -1, -1, -1, -1, -1, -1\n\t\t\t\t-1, -1, 0, 3, 6, -1, -1, 8, 11, 14, -1, -1, -1, -1, -1, -1));\n\n\n\n\t\t\t// B1+G1+R1  B2+G2+R2 B3+G3  0 0 G6+R6  B7+G7+R7 B8+G8 0 0 0 0 0 0 0 0 0 0 R3 B4+G4+R4 B5+G5+R5 B6 0 R8 B9+G9+R9 B10+G10+R10 0 0 0 0 0 0\n\n\t\t\t__m256i accum = _mm256_adds_epu8(r1, _mm256_adds_epu8(b1, g1));\n\n\n\t\t\t// _mm256_castsi256_si128(accum)\n\t\t\t// B1+G1+R1  B2+G2+R2 B3+G3  0 0 G6+R6  B7+G7+R7 B8+G8 0 0 0 0 0 0 0 0\n\n\t\t\t// _mm256_extracti128_si256(accum, 1)\n\t\t\t// 0 0 R3 B4+G4+R4 B5+G5+R5 B6 0 R8 B9+G9+R9 B10+G10+R10 0 0 0 0 0 0\n\n\t\t\t__m128i h3 = _mm_adds_epu8(_mm256_castsi256_si128(accum), _mm256_extracti128_si256(accum, 1));\n\n\t\t\t_mm_storeu_si128((__m128i *)(LinePD + X), h3);\n\t\t}\n\t\tfor (; X < Width; X++, LinePS += 3) {\n\t\t\tint tmpB = (B_WT * LinePS[0]) >> 14 + 1;\n\t\t\ttmpB = max(min(255, tmpB), 0);\n\n\t\t\tint tmpG = (G_WT * LinePS[1]) >> 14 + 1;\n\t\t\ttmpG = max(min(255, tmpG), 0);\n\n\t\t\tint tmpR = (R_WT * LinePS[2]) >> 14 + 1;\n\t\t\ttmpR = max(min(255, tmpR), 0);\n\n\t\t\tint tmp = tmpB + tmpG + tmpR;\n\t\t\tLinePD[X] = max(min(255, tmp), 0);\n\t\t}\n\t}\n}\n\n//avx2 \nvoid RGB2Y_6(unsigned char *Src, unsigned char *Dest, int width, int height, int stride)\n{\n\t_RGB2Y(Src, width, 0, height, stride, Dest);\n}\n\n//avx2 + std::async异步编程\nvoid RGB2Y_7(unsigned char *Src, unsigned char *Dest, int width, int height, int stride) {\n\tconst int32_t hw_concur = std::min(height >> 4, static_cast<int32_t>(std::thread::hardware_concurrency()));\n\tstd::vector<std::future<void>> fut(hw_concur);\n\tconst int thread_stride = (height - 1) / hw_concur + 1;\n\tint i = 0, start = 0;\n\tfor (; i < std::min(height, hw_concur); i++, start += thread_stride)\n\t{\n\t\tfut[i] = std::async(std::launch::async, _RGB2Y, Src, width, start, thread_stride, stride, Dest);\n\t}\n\tfor (int j = 0; j < i; ++j)\n\t\tfut[j].wait();\n}\n\nint main() {\n\tMat src = imread(\"F:\\\\car.jpg\");\n\tint Height = src.rows;\n\tint Width = src.cols;\n\tunsigned char *Src = src.data;\n\tunsigned char *Dest = new unsigned char[Height * Width];\n\tint Stride = Width * 3;\n\tint Radius = 11;\n\tint64 st = cvGetTickCount();\n\tfor (int i = 0; i < 100; i++) {\n\t\tRGB2Y_3(Src, Dest, Width, Height, Stride);\n\t}\n\tdouble duration = (cv::getTickCount() - st) / cv::getTickFrequency() * 10;\n\tprintf(\"%.5f\\n\", duration);\n\tRGB2Y_5(Src, Dest, Width, Height, Stride);\n\tMat dst(Height, Width, CV_8UC1, Dest);\n\timshow(\"origin\", src);\n\timshow(\"result\", dst);\n\timwrite(\"F:\\\\res.jpg\", dst);\n\twaitKey(0);\n\treturn 0;\n}"
  },
  {
    "path": "speed_rgb2yuv_sse.cpp",
    "content": "#include \"stdafx.h\"\n#include <stdio.h>\n#include <opencv2/opencv.hpp>\n#include <future>\nusing namespace std;\nusing namespace cv;\n\ninline unsigned char ClampToByte(int Value) {\n\tif (Value < 0)\n\t\treturn 0;\n\telse if (Value > 255)\n\t\treturn 255;\n\telse\n\t\treturn (unsigned char)Value;\n\t//return ((Value | ((signed int)(255 - Value) >> 31)) & ~((signed int)Value >> 31));\n}\n\n\nvoid RGB2YUV(unsigned char *RGB, unsigned char *Y, unsigned char *U, unsigned char *V, int Width, int Height, int Stride) {\n\tfor (int YY = 0; YY < Height; YY++) {\n\t\tunsigned char *LinePS = RGB + YY * Stride;\n\t\tunsigned char *LinePY = Y + YY * Width;\n\t\tunsigned char *LinePU = U + YY * Width;\n\t\tunsigned char *LinePV = V + YY * Width;\n\t\tfor (int XX = 0; XX < Width; XX++, LinePS += 3)\n\t\t{\n\t\t\tint Blue = LinePS[0], Green = LinePS[1], Red = LinePS[2];\n\t\t\tLinePY[XX] = int(0.299*Red + 0.587*Green + 0.144*Blue);\n\t\t\tLinePU[XX] = int(-0.147*Red - 0.289*Green + 0.436*Blue);\n\t\t\tLinePV[XX] = int(0.615*Red - 0.515*Green - 0.100*Blue);\n\t\t}\n\t}\n}\n\nvoid YUV2RGB(unsigned char *Y, unsigned char *U, unsigned char *V, unsigned char *RGB, int Width, int Height, int Stride) {\n\tfor (int YY = 0; YY < Height; YY++)\n\t{\n\t\tunsigned char *LinePD = RGB + YY * Stride;\n\t\tunsigned char *LinePY = Y + YY * Width;\n\t\tunsigned char *LinePU = U + YY * Width;\n\t\tunsigned char *LinePV = V + YY * Width;\n\t\tfor (int XX = 0; XX < Width; XX++, LinePD += 3)\n\t\t{\n\t\t\tint YV = LinePY[XX], UV = LinePU[XX], VV = LinePV[XX];\n\t\t\tLinePD[0] = int(YV + 2.03 * UV);\n\t\t\tLinePD[1] = int(YV - 0.39 * UV - 0.58 * VV);\n\t\t\tLinePD[2] = int(YV + 1.14 * VV);\n\t\t}\n\t}\n}\n\nvoid RGB2YUV_1(unsigned char *RGB, unsigned char *Y, unsigned char *U, unsigned char *V, int Width, int Height, int Stride)\n{\n\tconst int Shift = 8;\n\tconst int HalfV = 1 << (Shift - 1);\n\tconst int Y_B_WT = 0.114f * (1 << Shift), Y_G_WT = 0.587f * (1 << Shift), Y_R_WT = (1 << Shift) - Y_B_WT - Y_G_WT;\n\tconst int U_B_WT = 0.436f * (1 << Shift), U_G_WT = -0.28886f * (1 << Shift), U_R_WT = -(U_B_WT + U_G_WT);\n\tconst int V_B_WT = -0.10001 * (1 << Shift), V_G_WT = -0.51499f * (1 << Shift), V_R_WT = -(V_B_WT + V_G_WT);\n\tfor (int YY = 0; YY < Height; YY++)\n\t{\n\t\tunsigned char *LinePS = RGB + YY * Stride;\n\t\tunsigned char *LinePY = Y + YY * Width;\n\t\tunsigned char *LinePU = U + YY * Width;\n\t\tunsigned char *LinePV = V + YY * Width;\n\t\tfor (int XX = 0; XX < Width; XX++, LinePS += 3)\n\t\t{\n\t\t\tint Blue = LinePS[0], Green = LinePS[1], Red = LinePS[2];\n\t\t\tLinePY[XX] = (Y_B_WT * Blue + Y_G_WT * Green + Y_R_WT * Red + HalfV) >> Shift;\n\t\t\tLinePU[XX] = ((U_B_WT * Blue + U_G_WT * Green + U_R_WT * Red + HalfV) >> Shift) + 128;\n\t\t\tLinePV[XX] = ((V_B_WT * Blue + V_G_WT * Green + V_R_WT * Red + HalfV) >> Shift) + 128;\n\t\t}\n\t}\n}\n\nvoid YUV2RGB_1(unsigned char *Y, unsigned char *U, unsigned char *V, unsigned char *RGB, int Width, int Height, int Stride)\n{\n\tconst int Shift = 8;\n\tconst int HalfV = 1 << (Shift - 1);\n\tconst int B_Y_WT = 1 << Shift, B_U_WT = 2.03211f * (1 << Shift), B_V_WT = 0;\n\tconst int G_Y_WT = 1 << Shift, G_U_WT = -0.39465f * (1 << Shift), G_V_WT = -0.58060f * (1 << Shift);\n\tconst int R_Y_WT = 1 << Shift, R_U_WT = 0, R_V_WT = 1.13983 * (1 << Shift);\n\tfor (int YY = 0; YY < Height; YY++)\n\t{\n\t\tunsigned char *LinePD = RGB + YY * Stride;\n\t\tunsigned char *LinePY = Y + YY * Width;\n\t\tunsigned char *LinePU = U + YY * Width;\n\t\tunsigned char *LinePV = V + YY * Width;\n\t\tfor (int XX = 0; XX < Width; XX++, LinePD += 3)\n\t\t{\n\t\t\tint YV = LinePY[XX], UV = LinePU[XX] - 128, VV = LinePV[XX] - 128;\n\t\t\tLinePD[0] = ClampToByte(YV + ((B_U_WT * UV + HalfV) >> Shift));\n\t\t\tLinePD[1] = ClampToByte(YV + ((G_U_WT * UV + G_V_WT * VV + HalfV) >> Shift));\n\t\t\tLinePD[2] = ClampToByte(YV + ((R_V_WT * VV + HalfV) >> Shift));\n\t\t}\n\t}\n}\n\nvoid RGB2YUV_OpenMP(unsigned char *RGB, unsigned char *Y, unsigned char *U, unsigned char *V, int Width, int Height, int Stride)\n{\n\tconst int Shift = 8;\n\tconst int HalfV = 1 << (Shift - 1);\n\tconst int Y_B_WT = 0.114f * (1 << Shift), Y_G_WT = 0.587f * (1 << Shift), Y_R_WT = (1 << Shift) - Y_B_WT - Y_G_WT;\n\tconst int U_B_WT = 0.436f * (1 << Shift), U_G_WT = -0.28886f * (1 << Shift), U_R_WT = -(U_B_WT + U_G_WT);\n\tconst int V_B_WT = -0.10001 * (1 << Shift), V_G_WT = -0.51499f * (1 << Shift), V_R_WT = -(V_B_WT + V_G_WT);\n\tfor (int YY = 0; YY < Height; YY++)\n\t{\n\t\tunsigned char *LinePS = RGB + YY * Stride;\n\t\tunsigned char *LinePY = Y + YY * Width;\n\t\tunsigned char *LinePU = U + YY * Width;\n\t\tunsigned char *LinePV = V + YY * Width;\n#pragma omp parallel for num_threads(4)\n\t\tfor (int XX = 0; XX < Width; XX++)\n\t\t{\n\t\t\tint Blue = LinePS[XX*3 + 0], Green = LinePS[XX*3 + 1], Red = LinePS[XX*3 + 2];\n\t\t\tLinePY[XX] = (Y_B_WT * Blue + Y_G_WT * Green + Y_R_WT * Red + HalfV) >> Shift;\n\t\t\tLinePU[XX] = ((U_B_WT * Blue + U_G_WT * Green + U_R_WT * Red + HalfV) >> Shift) + 128;\n\t\t\tLinePV[XX] = ((V_B_WT * Blue + V_G_WT * Green + V_R_WT * Red + HalfV) >> Shift) + 128;\n\t\t}\n\t}\n}\n\nvoid YUV2RGB_OpenMP(unsigned char *Y, unsigned char *U, unsigned char *V, unsigned char *RGB, int Width, int Height, int Stride)\n{\n\tconst int Shift = 8;\n\tconst int HalfV = 1 << (Shift - 1);\n\tconst int B_Y_WT = 1 << Shift, B_U_WT = 2.03211f * (1 << Shift), B_V_WT = 0;\n\tconst int G_Y_WT = 1 << Shift, G_U_WT = -0.39465f * (1 << Shift), G_V_WT = -0.58060f * (1 << Shift);\n\tconst int R_Y_WT = 1 << Shift, R_U_WT = 0, R_V_WT = 1.13983 * (1 << Shift);\n\tfor (int YY = 0; YY < Height; YY++)\n\t{\n\t\tunsigned char *LinePD = RGB + YY * Stride;\n\t\tunsigned char *LinePY = Y + YY * Width;\n\t\tunsigned char *LinePU = U + YY * Width;\n\t\tunsigned char *LinePV = V + YY * Width;\n#pragma omp parallel for num_threads(4)\n\t\tfor (int XX = 0; XX < Width; XX++)\n\t\t{\n\t\t\tint YV = LinePY[XX], UV = LinePU[XX] - 128, VV = LinePV[XX] - 128;\n\t\t\tLinePD[XX*3 + 0] = ClampToByte(YV + ((B_U_WT * UV + HalfV) >> Shift));\n\t\t\tLinePD[XX*3 + 1] = ClampToByte(YV + ((G_U_WT * UV + G_V_WT * VV + HalfV) >> Shift));\n\t\t\tLinePD[XX*3 + 2] = ClampToByte(YV + ((R_V_WT * VV + HalfV) >> Shift));\n\t\t}\n\t}\n}\n\nvoid RGB2YUVSSE_2(unsigned char *RGB, unsigned char *Y, unsigned char *U, unsigned char *V, int Width, int Height, int Stride) {\n\tconst int Shift = 13;\n\tconst int HalfV = 1 << (Shift - 1);\n\tconst int Y_B_WT = 0.114f * (1 << Shift), Y_G_WT = 0.587f * (1 << Shift), Y_R_WT = (1 << Shift) - Y_B_WT - Y_G_WT;\n\tconst int U_B_WT = 0.436f * (1 << Shift), U_G_WT = -0.28886f * (1 << Shift), U_R_WT = -(U_B_WT + U_G_WT);\n\tconst int V_B_WT = -0.10001 * (1 << Shift), V_G_WT = -0.51499f * (1 << Shift), V_R_WT = -(V_B_WT + V_G_WT);\n\t__m128i Weight_YB = _mm_set1_epi32(Y_B_WT), Weight_YG = _mm_set1_epi32(Y_G_WT), Weight_YR = _mm_set1_epi32(Y_R_WT);\n\t__m128i Weight_UB = _mm_set1_epi32(U_B_WT), Weight_UG = _mm_set1_epi32(U_G_WT), Weight_UR = _mm_set1_epi32(U_R_WT);\n\t__m128i Weight_VB = _mm_set1_epi32(V_B_WT), Weight_VG = _mm_set1_epi32(V_G_WT), Weight_VR = _mm_set1_epi32(V_R_WT);\n\t__m128i C128 = _mm_set1_epi32(128);\n\t__m128i Half = _mm_set1_epi32(HalfV);\n\t__m128i Zero = _mm_setzero_si128();\n\tconst int BlockSize = 16, Block = Width / BlockSize;\n\tfor (int YY = 0; YY < Height; YY++) {\n\t\tunsigned char *LinePS = RGB + YY * Stride;\n\t\tunsigned char *LinePY = Y + YY * Width;\n\t\tunsigned char *LinePU = U + YY * Width;\n\t\tunsigned char *LinePV = V + YY * Width;\n\t\tfor (int XX = 0; XX < Block * BlockSize; XX += BlockSize, LinePS += BlockSize * 3) {\n\t\t\t__m128i Src1, Src2, Src3, Blue, Green, Red;\n\n\t\t\tSrc1 = _mm_loadu_si128((__m128i *)(LinePS + 0));\n\t\t\tSrc2 = _mm_loadu_si128((__m128i *)(LinePS + 16));\n\t\t\tSrc3 = _mm_loadu_si128((__m128i *)(LinePS + 32));\n\n\t\t\t// 以下操作把16个连续像素的像素顺序由 B G R B G R B G R B G R B G R B G R B G R B G R B G R B G R B G R B G R B G R B G R B G R B G R \n\t\t\t// 更改为适合于SIMD指令处理的连续序列 B B B B B B B B B B B B B B B B G G G G G G G G G G G G G G G G R R R R R R R R R R R R R R R R  \n\n\t\t\tBlue = _mm_shuffle_epi8(Src1, _mm_setr_epi8(0, 3, 6, 9, 12, 15, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1));\n\t\t\tBlue = _mm_or_si128(Blue, _mm_shuffle_epi8(Src2, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, 2, 5, 8, 11, 14, -1, -1, -1, -1, -1)));\n\t\t\tBlue = _mm_or_si128(Blue, _mm_shuffle_epi8(Src3, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, 4, 7, 10, 13)));\n\n\t\t\tGreen = _mm_shuffle_epi8(Src1, _mm_setr_epi8(1, 4, 7, 10, 13, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1));\n\t\t\tGreen = _mm_or_si128(Green, _mm_shuffle_epi8(Src2, _mm_setr_epi8(-1, -1, -1, -1, -1, 0, 3, 6, 9, 12, 15, -1, -1, -1, -1, -1)));\n\t\t\tGreen = _mm_or_si128(Green, _mm_shuffle_epi8(Src3, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 2, 5, 8, 11, 14)));\n\n\t\t\tRed = _mm_shuffle_epi8(Src1, _mm_setr_epi8(2, 5, 8, 11, 14, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1));\n\t\t\tRed = _mm_or_si128(Red, _mm_shuffle_epi8(Src2, _mm_setr_epi8(-1, -1, -1, -1, -1, 1, 4, 7, 10, 13, -1, -1, -1, -1, -1, -1)));\n\t\t\tRed = _mm_or_si128(Red, _mm_shuffle_epi8(Src3, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 3, 6, 9, 12, 15)));\n\n\t\t\t// 以下操作将三个SSE变量里的字节数据分别提取到12个包含4个int类型的数据的SSE变量里，以便后续的乘积操作不溢出\n\n\t\t\t__m128i Blue16L = _mm_unpacklo_epi8(Blue, Zero);\n\t\t\t__m128i Blue16H = _mm_unpackhi_epi8(Blue, Zero);\n\t\t\t__m128i Blue32LL = _mm_unpacklo_epi16(Blue16L, Zero);\n\t\t\t__m128i Blue32LH = _mm_unpackhi_epi16(Blue16L, Zero);\n\t\t\t__m128i Blue32HL = _mm_unpacklo_epi16(Blue16H, Zero);\n\t\t\t__m128i Blue32HH = _mm_unpackhi_epi16(Blue16H, Zero);\n\n\t\t\t__m128i Green16L = _mm_unpacklo_epi8(Green, Zero);\n\t\t\t__m128i Green16H = _mm_unpackhi_epi8(Green, Zero);\n\t\t\t__m128i Green32LL = _mm_unpacklo_epi16(Green16L, Zero);\n\t\t\t__m128i Green32LH = _mm_unpackhi_epi16(Green16L, Zero);\n\t\t\t__m128i Green32HL = _mm_unpacklo_epi16(Green16H, Zero);\n\t\t\t__m128i Green32HH = _mm_unpackhi_epi16(Green16H, Zero);\n\n\t\t\t__m128i Red16L = _mm_unpacklo_epi8(Red, Zero);\n\t\t\t__m128i Red16H = _mm_unpackhi_epi8(Red, Zero);\n\t\t\t__m128i Red32LL = _mm_unpacklo_epi16(Red16L, Zero);\n\t\t\t__m128i Red32LH = _mm_unpackhi_epi16(Red16L, Zero);\n\t\t\t__m128i Red32HL = _mm_unpacklo_epi16(Red16H, Zero);\n\t\t\t__m128i Red32HH = _mm_unpackhi_epi16(Red16H, Zero);\n\n\t\t\t// 以下操作完成：Y[0 - 15] = (Y_B_WT * Blue[0 - 15]+ Y_G_WT * Green[0 - 15] + Y_R_WT * Red[0 - 15] + HalfV) >> Shift;   \n\t\t\t__m128i LL_Y = _mm_srai_epi32(_mm_add_epi32(_mm_add_epi32(_mm_mullo_epi32(Blue32LL, Weight_YB), _mm_add_epi32(_mm_mullo_epi32(Green32LL, Weight_YG), _mm_mullo_epi32(Red32LL, Weight_YR))), Half), Shift);\n\t\t\t__m128i LH_Y = _mm_srai_epi32(_mm_add_epi32(_mm_add_epi32(_mm_mullo_epi32(Blue32LH, Weight_YB), _mm_add_epi32(_mm_mullo_epi32(Green32LH, Weight_YG), _mm_mullo_epi32(Red32LH, Weight_YR))), Half), Shift);\n\t\t\t__m128i HL_Y = _mm_srai_epi32(_mm_add_epi32(_mm_add_epi32(_mm_mullo_epi32(Blue32HL, Weight_YB), _mm_add_epi32(_mm_mullo_epi32(Green32HL, Weight_YG), _mm_mullo_epi32(Red32HL, Weight_YR))), Half), Shift);\n\t\t\t__m128i HH_Y = _mm_srai_epi32(_mm_add_epi32(_mm_add_epi32(_mm_mullo_epi32(Blue32HH, Weight_YB), _mm_add_epi32(_mm_mullo_epi32(Green32HH, Weight_YG), _mm_mullo_epi32(Red32HH, Weight_YR))), Half), Shift);\n\t\t\t_mm_storeu_si128((__m128i*)(LinePY + XX), _mm_packus_epi16(_mm_packus_epi32(LL_Y, LH_Y), _mm_packus_epi32(HL_Y, HH_Y)));    //    4个包含4个int类型的SSE变量重新打包为1个包含16个字节数据的SSE变量\n\n\t\t\t// 以下操作完成: U[0 - 15] = ((U_B_WT * Blue[0 - 15]+ U_G_WT * Green[0 - 15] + U_R_WT * Red[0 - 15] + HalfV) >> Shift) + 128;\n\t\t\t__m128i LL_U = _mm_add_epi32(_mm_srai_epi32(_mm_add_epi32(_mm_add_epi32(_mm_mullo_epi32(Blue32LL, Weight_UB), _mm_add_epi32(_mm_mullo_epi32(Green32LL, Weight_UG), _mm_mullo_epi32(Red32LL, Weight_UR))), Half), Shift), C128);\n\t\t\t__m128i LH_U = _mm_add_epi32(_mm_srai_epi32(_mm_add_epi32(_mm_add_epi32(_mm_mullo_epi32(Blue32LH, Weight_UB), _mm_add_epi32(_mm_mullo_epi32(Green32LH, Weight_UG), _mm_mullo_epi32(Red32LH, Weight_UR))), Half), Shift), C128);\n\t\t\t__m128i HL_U = _mm_add_epi32(_mm_srai_epi32(_mm_add_epi32(_mm_add_epi32(_mm_mullo_epi32(Blue32HL, Weight_UB), _mm_add_epi32(_mm_mullo_epi32(Green32HL, Weight_UG), _mm_mullo_epi32(Red32HL, Weight_UR))), Half), Shift), C128);\n\t\t\t__m128i HH_U = _mm_add_epi32(_mm_srai_epi32(_mm_add_epi32(_mm_add_epi32(_mm_mullo_epi32(Blue32HH, Weight_UB), _mm_add_epi32(_mm_mullo_epi32(Green32HH, Weight_UG), _mm_mullo_epi32(Red32HH, Weight_UR))), Half), Shift), C128);\n\t\t\t_mm_storeu_si128((__m128i*)(LinePU + XX), _mm_packus_epi16(_mm_packus_epi32(LL_U, LH_U), _mm_packus_epi32(HL_U, HH_U)));\n\n\t\t\t// 以下操作完成：V[0 - 15] = ((V_B_WT * Blue[0 - 15]+ V_G_WT * Green[0 - 15] + V_R_WT * Red[0 - 15] + HalfV) >> Shift) + 128; \n\t\t\t__m128i LL_V = _mm_add_epi32(_mm_srai_epi32(_mm_add_epi32(_mm_add_epi32(_mm_mullo_epi32(Blue32LL, Weight_VB), _mm_add_epi32(_mm_mullo_epi32(Green32LL, Weight_VG), _mm_mullo_epi32(Red32LL, Weight_VR))), Half), Shift), C128);\n\t\t\t__m128i LH_V = _mm_add_epi32(_mm_srai_epi32(_mm_add_epi32(_mm_add_epi32(_mm_mullo_epi32(Blue32LH, Weight_VB), _mm_add_epi32(_mm_mullo_epi32(Green32LH, Weight_VG), _mm_mullo_epi32(Red32LH, Weight_VR))), Half), Shift), C128);\n\t\t\t__m128i HL_V = _mm_add_epi32(_mm_srai_epi32(_mm_add_epi32(_mm_add_epi32(_mm_mullo_epi32(Blue32HL, Weight_VB), _mm_add_epi32(_mm_mullo_epi32(Green32HL, Weight_VG), _mm_mullo_epi32(Red32HL, Weight_VR))), Half), Shift), C128);\n\t\t\t__m128i HH_V = _mm_add_epi32(_mm_srai_epi32(_mm_add_epi32(_mm_add_epi32(_mm_mullo_epi32(Blue32HH, Weight_VB), _mm_add_epi32(_mm_mullo_epi32(Green32HH, Weight_VG), _mm_mullo_epi32(Red32HH, Weight_VR))), Half), Shift), C128);\n\t\t\t_mm_storeu_si128((__m128i*)(LinePV + XX), _mm_packus_epi16(_mm_packus_epi32(LL_V, LH_V), _mm_packus_epi32(HL_V, HH_V)));\n\t\t}\n\t\tfor (int XX = Block * BlockSize; XX < Width; XX++, LinePS += 3) {\n\t\t\tint Blue = LinePS[0], Green = LinePS[1], Red = LinePS[2];\n\t\t\tLinePY[XX] = (Y_B_WT * Blue + Y_G_WT * Green + Y_R_WT * Red + HalfV) >> Shift;\n\t\t\tLinePU[XX] = ((U_B_WT * Blue + U_G_WT * Green + U_R_WT * Red + HalfV) >> Shift) + 128;\n\t\t\tLinePV[XX] = ((V_B_WT * Blue + V_G_WT * Green + V_R_WT * Red + HalfV) >> Shift) + 128;\n\t\t}\n\t}\n}\n\nvoid YUV2RGBSSE_2(unsigned char *Y, unsigned char *U, unsigned char *V, unsigned char *RGB, int Width, int Height, int Stride) {\n\tconst int Shift = 13;\n\tconst int HalfV = 1 << (Shift - 1);\n\tconst int B_Y_WT = 1 << Shift, B_U_WT = 2.03211f * (1 << Shift), B_V_WT = 0;\n\tconst int G_Y_WT = 1 << Shift, G_U_WT = -0.39465f * (1 << Shift), G_V_WT = -0.58060f * (1 << Shift);\n\tconst int R_Y_WT = 1 << Shift, R_U_WT = 0, R_V_WT = 1.13983 * (1 << Shift);\n\t__m128i Weight_B_Y = _mm_set1_epi32(B_Y_WT), Weight_B_U = _mm_set1_epi32(B_U_WT), Weight_B_V = _mm_set1_epi32(B_V_WT);\n\t__m128i Weight_G_Y = _mm_set1_epi32(G_Y_WT), Weight_G_U = _mm_set1_epi32(G_U_WT), Weight_G_V = _mm_set1_epi32(G_V_WT);\n\t__m128i Weight_R_Y = _mm_set1_epi32(R_Y_WT), Weight_R_U = _mm_set1_epi32(R_U_WT), Weight_R_V = _mm_set1_epi32(R_V_WT);\n\t__m128i Half = _mm_set1_epi32(HalfV);\n\t__m128i C128 = _mm_set1_epi32(128);\n\t__m128i Zero = _mm_setzero_si128();\n\n\tconst int BlockSize = 16, Block = Width / BlockSize;\n\tfor (int YY = 0; YY < Height; YY++) {\n\t\tunsigned char *LinePD = RGB + YY * Stride;\n\t\tunsigned char *LinePY = Y + YY * Width;\n\t\tunsigned char *LinePU = U + YY * Width;\n\t\tunsigned char *LinePV = V + YY * Width;\n\t\tfor (int XX = 0; XX < Block * BlockSize; XX += BlockSize, LinePY += BlockSize, LinePU += BlockSize, LinePV += BlockSize) {\n\t\t\t__m128i Blue, Green, Red, YV, UV, VV, Dest1, Dest2, Dest3;\n\t\t\tYV = _mm_loadu_si128((__m128i *)(LinePY + 0));\n\t\t\tUV = _mm_loadu_si128((__m128i *)(LinePU + 0));\n\t\t\tVV = _mm_loadu_si128((__m128i *)(LinePV + 0));\n\t\t\t//UV = _mm_sub_epi32(UV, C128);\n\t\t\t//VV = _mm_sub_epi32(VV, C128);\n\n\t\t\t__m128i YV16L = _mm_unpacklo_epi8(YV, Zero);\n\t\t\t__m128i YV16H = _mm_unpackhi_epi8(YV, Zero);\n\t\t\t__m128i YV32LL = _mm_unpacklo_epi16(YV16L, Zero);\n\t\t\t__m128i YV32LH = _mm_unpackhi_epi16(YV16L, Zero);\n\t\t\t__m128i YV32HL = _mm_unpacklo_epi16(YV16H, Zero);\n\t\t\t__m128i YV32HH = _mm_unpackhi_epi16(YV16H, Zero);\n\n\n\t\t\t__m128i UV16L = _mm_unpacklo_epi8(UV, Zero);\n\t\t\t__m128i UV16H = _mm_unpackhi_epi8(UV, Zero);\n\t\t\t__m128i UV32LL = _mm_unpacklo_epi16(UV16L, Zero);\n\t\t\t__m128i UV32LH = _mm_unpackhi_epi16(UV16L, Zero);\n\t\t\t__m128i UV32HL = _mm_unpacklo_epi16(UV16H, Zero);\n\t\t\t__m128i UV32HH = _mm_unpackhi_epi16(UV16H, Zero);\n\t\t\tUV32LL = _mm_sub_epi32(UV32LL, C128);\n\t\t\tUV32LH = _mm_sub_epi32(UV32LH, C128);\n\t\t\tUV32HL = _mm_sub_epi32(UV32HL, C128);\n\t\t\tUV32HH = _mm_sub_epi32(UV32HH, C128);\n\n\t\t\t__m128i VV16L = _mm_unpacklo_epi8(VV, Zero);\n\t\t\t__m128i VV16H = _mm_unpackhi_epi8(VV, Zero);\n\t\t\t__m128i VV32LL = _mm_unpacklo_epi16(VV16L, Zero);\n\t\t\t__m128i VV32LH = _mm_unpackhi_epi16(VV16L, Zero);\n\t\t\t__m128i VV32HL = _mm_unpacklo_epi16(VV16H, Zero);\n\t\t\t__m128i VV32HH = _mm_unpackhi_epi16(VV16H, Zero);\n\t\t\tVV32LL = _mm_sub_epi32(VV32LL, C128);\n\t\t\tVV32LH = _mm_sub_epi32(VV32LH, C128);\n\t\t\tVV32HL = _mm_sub_epi32(VV32HL, C128);\n\t\t\tVV32HH = _mm_sub_epi32(VV32HH, C128);\n\n\t\t\t__m128i LL_B = _mm_add_epi32(YV32LL, _mm_srai_epi32(_mm_add_epi32(Half, _mm_mullo_epi32(UV32LL, Weight_B_U)), Shift));\n\t\t\t__m128i LH_B = _mm_add_epi32(YV32LH, _mm_srai_epi32(_mm_add_epi32(Half, _mm_mullo_epi32(UV32LH, Weight_B_U)), Shift));\n\t\t\t__m128i HL_B = _mm_add_epi32(YV32HL, _mm_srai_epi32(_mm_add_epi32(Half, _mm_mullo_epi32(UV32HL, Weight_B_U)), Shift));\n\t\t\t__m128i HH_B = _mm_add_epi32(YV32HH, _mm_srai_epi32(_mm_add_epi32(Half, _mm_mullo_epi32(UV32HH, Weight_B_U)), Shift));\n\t\t\tBlue = _mm_packus_epi16(_mm_packus_epi32(LL_B, LH_B), _mm_packus_epi32(HL_B, HH_B));\n\n\t\t\t__m128i LL_G = _mm_add_epi32(YV32LL, _mm_srai_epi32(_mm_add_epi32(Half, _mm_add_epi32(_mm_mullo_epi32(Weight_G_U, UV32LL), _mm_mullo_epi32(Weight_G_V, VV32LL))), Shift));\n\t\t\t__m128i LH_G = _mm_add_epi32(YV32LH, _mm_srai_epi32(_mm_add_epi32(Half, _mm_add_epi32(_mm_mullo_epi32(Weight_G_U, UV32LH), _mm_mullo_epi32(Weight_G_V, VV32LH))), Shift));\n\t\t\t__m128i HL_G = _mm_add_epi32(YV32HL, _mm_srai_epi32(_mm_add_epi32(Half, _mm_add_epi32(_mm_mullo_epi32(Weight_G_U, UV32HL), _mm_mullo_epi32(Weight_G_V, VV32HL))), Shift));\n\t\t\t__m128i HH_G = _mm_add_epi32(YV32HH, _mm_srai_epi32(_mm_add_epi32(Half, _mm_add_epi32(_mm_mullo_epi32(Weight_G_U, UV32HH), _mm_mullo_epi32(Weight_G_V, VV32HH))), Shift));\n\t\t\tGreen = _mm_packus_epi16(_mm_packus_epi32(LL_G, LH_G), _mm_packus_epi32(HL_G, HH_G));\n\n\t\t\t__m128i LL_R = _mm_add_epi32(YV32LL, _mm_srai_epi32(_mm_add_epi32(Half, _mm_mullo_epi32(VV32LL, Weight_R_V)), Shift));\n\t\t\t__m128i LH_R = _mm_add_epi32(YV32LH, _mm_srai_epi32(_mm_add_epi32(Half, _mm_mullo_epi32(VV32LH, Weight_R_V)), Shift));\n\t\t\t__m128i HL_R = _mm_add_epi32(YV32HL, _mm_srai_epi32(_mm_add_epi32(Half, _mm_mullo_epi32(VV32HL, Weight_R_V)), Shift));\n\t\t\t__m128i HH_R = _mm_add_epi32(YV32HH, _mm_srai_epi32(_mm_add_epi32(Half, _mm_mullo_epi32(VV32HH, Weight_R_V)), Shift));\n\t\t\tRed = _mm_packus_epi16(_mm_packus_epi32(LL_R, LH_R), _mm_packus_epi32(HL_R, HH_R));\n\n\t\t\tDest1 = _mm_shuffle_epi8(Blue, _mm_setr_epi8(0, -1, -1, 1, -1, -1, 2, -1, -1, 3, -1, -1, 4, -1, -1, 5));\n\t\t\tDest1 = _mm_or_si128(Dest1, _mm_shuffle_epi8(Green, _mm_setr_epi8(-1, 0, -1, -1, 1, -1, -1, 2, -1, -1, 3, -1, -1, 4, -1, -1)));\n\t\t\tDest1 = _mm_or_si128(Dest1, _mm_shuffle_epi8(Red, _mm_setr_epi8(-1, -1, 0, -1, -1, 1, -1, -1, 2, -1, -1, 3, -1, -1, 4, -1)));\n\n\t\t\tDest2 = _mm_shuffle_epi8(Blue, _mm_setr_epi8(-1, -1, 6, -1, -1, 7, -1, -1, 8, -1, -1, 9, -1, -1, 10, -1));\n\t\t\tDest2 = _mm_or_si128(Dest2, _mm_shuffle_epi8(Green, _mm_setr_epi8(5, -1, -1, 6, -1, -1, 7, -1, -1, 8, -1, -1, 9, -1, -1, 10)));\n\t\t\tDest2 = _mm_or_si128(Dest2, _mm_shuffle_epi8(Red, _mm_setr_epi8(-1, 5, -1, -1, 6, -1, -1, 7, -1, -1, 8, -1, -1, 9, -1, -1)));\n\n\t\t\tDest3 = _mm_shuffle_epi8(Blue, _mm_setr_epi8(-1, 11, -1, -1, 12, -1, -1, 13, -1, -1, 14, -1, -1, 15, -1, -1));\n\t\t\tDest3 = _mm_or_si128(Dest3, _mm_shuffle_epi8(Green, _mm_setr_epi8(-1, -1, 11, -1, -1, 12, -1, -1, 13, -1, -1, 14, -1, -1, 15, -1)));\n\t\t\tDest3 = _mm_or_si128(Dest3, _mm_shuffle_epi8(Red, _mm_setr_epi8(10, -1, -1, 11, -1, -1, 12, -1, -1, 13, -1, -1, 14, -1, -1, 15)));\n\n\t\t\t_mm_storeu_si128((__m128i*)(LinePD + (XX / BlockSize) * BlockSize * 3), Dest1);\n\t\t\t_mm_storeu_si128((__m128i*)(LinePD + (XX / BlockSize) * BlockSize * 3 + BlockSize), Dest2);\n\t\t\t_mm_storeu_si128((__m128i*)(LinePD + (XX / BlockSize) * BlockSize * 3 + BlockSize * 2), Dest3);\n\t\t}\n\t\tfor (int XX = Block * BlockSize; XX < Width; XX++, LinePU++, LinePV++, LinePY++) {\n\t\t\tint YV = LinePY[XX], UV = LinePU[XX] - 128, VV = LinePV[XX] - 128;\n\t\t\tLinePD[XX + 0] = ClampToByte(YV + ((B_U_WT * UV + HalfV) >> Shift));\n\t\t\tLinePD[XX + 1] = ClampToByte(YV + ((G_U_WT * UV + G_V_WT * VV + HalfV) >> Shift));\n\t\t\tLinePD[XX + 2] = ClampToByte(YV + ((R_V_WT * VV + HalfV) >> Shift));\n\t\t}\n\t}\n}\n\nvoid RGB2YUVSSE_3(unsigned char *RGB, unsigned char *Y, unsigned char *U, unsigned char *V, int Width, int Height, int Stride)\n{\n\tconst int Shift = 13;                            //    这里没有绝对值大于1的系数，最大可取2^15次方的放大倍数。\n\tconst int HalfV = 1 << (Shift - 1);\n\n\tconst int Y_B_WT = 0.114f * (1 << Shift), Y_G_WT = 0.587f * (1 << Shift), Y_R_WT = (1 << Shift) - Y_B_WT - Y_G_WT, Y_C_WT = 1;\n\tconst int U_B_WT = 0.436f * (1 << Shift), U_G_WT = -0.28886f * (1 << Shift), U_R_WT = -(U_B_WT + U_G_WT), U_C_WT = 257;\n\tconst int V_B_WT = -0.10001 * (1 << Shift), V_G_WT = -0.51499f * (1 << Shift), V_R_WT = -(V_B_WT + V_G_WT), V_C_WT = 257;\n\n\t__m128i Weight_YBG = _mm_setr_epi16(Y_B_WT, Y_G_WT, Y_B_WT, Y_G_WT, Y_B_WT, Y_G_WT, Y_B_WT, Y_G_WT);\n\t__m128i Weight_YRC = _mm_setr_epi16(Y_R_WT, Y_C_WT, Y_R_WT, Y_C_WT, Y_R_WT, Y_C_WT, Y_R_WT, Y_C_WT);\n\t__m128i Weight_UBG = _mm_setr_epi16(U_B_WT, U_G_WT, U_B_WT, U_G_WT, U_B_WT, U_G_WT, U_B_WT, U_G_WT);\n\t__m128i Weight_URC = _mm_setr_epi16(U_R_WT, U_C_WT, U_R_WT, U_C_WT, U_R_WT, U_C_WT, U_R_WT, U_C_WT);\n\t__m128i Weight_VBG = _mm_setr_epi16(V_B_WT, V_G_WT, V_B_WT, V_G_WT, V_B_WT, V_G_WT, V_B_WT, V_G_WT);\n\t__m128i Weight_VRC = _mm_setr_epi16(V_R_WT, V_C_WT, V_R_WT, V_C_WT, V_R_WT, V_C_WT, V_R_WT, V_C_WT);\n\t__m128i Half = _mm_setr_epi16(0, HalfV, 0, HalfV, 0, HalfV, 0, HalfV);\n\t__m128i Zero = _mm_setzero_si128();\n\n\tint BlockSize = 16, Block = Width / BlockSize;\n\tfor (int YY = 0; YY < Height; YY++)\n\t{\n\t\tunsigned char *LinePS = RGB + YY * Stride;\n\t\tunsigned char *LinePY = Y + YY * Width;\n\t\tunsigned char *LinePU = U + YY * Width;\n\t\tunsigned char *LinePV = V + YY * Width;\n\t\tfor (int XX = 0; XX < Block * BlockSize; XX += BlockSize, LinePS += BlockSize * 3)\n\t\t{\n\t\t\t__m128i Src1 = _mm_loadu_si128((__m128i *)(LinePS + 0));\n\t\t\t__m128i Src2 = _mm_loadu_si128((__m128i *)(LinePS + 16));\n\t\t\t__m128i Src3 = _mm_loadu_si128((__m128i *)(LinePS + 32));\n\t\t\t// Src1 : B1 G1 R1 B2 G2 R2 B3 G3 R3 B4 G4 R4 B5 G5 R5 B6 \n\t\t\t// Src2 : G6 R6 B7 G7 R7 B8 G8 R8 B9 G9 R9 B10 G10 R10 B11 G11 \n\t\t\t// Src3 : R11 B12 G12 R12 B13 G13 R13 B14 G14 R14 B15 G15 R15 B16 G16 R16\n\n\t\t\t// BGL : B1 G1 B2 G2 B3 G3 B4 G4 B5 G5 B6 0 0 0 0 0 \n\t\t\t__m128i BGL = _mm_shuffle_epi8(Src1, _mm_setr_epi8(0, 1, 3, 4, 6, 7, 9, 10, 12, 13, 15, -1, -1, -1, -1, -1));\n\n\t\t\t// BGL : B1 G1 B2 G2 B3 G3 B4 G4 B5 G5 B6 G6 B7 G7 B8 G8\n\t\t\tBGL = _mm_or_si128(BGL, _mm_shuffle_epi8(Src2, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 2, 3, 5, 6)));\n\n\t\t\t// BGH : B9 G9 B10 G10 B11 G11 0 0 0 0 0 0 0 0 0 0\n\t\t\t__m128i BGH = _mm_shuffle_epi8(Src2, _mm_setr_epi8(8, 9, 11, 12, 14, 15, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1));\n\n\t\t\t// BGH : B9 G9 B10 G10 B11 G11 B12 G12 B13 G13 B14 G14 B15 G15 B16 G16\n\t\t\tBGH = _mm_or_si128(BGH, _mm_shuffle_epi8(Src3, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, 1, 2, 4, 5, 7, 8, 10, 11, 13, 14)));\n\n\t\t\t// RCL : R1 0 R2 0 R3 0 R4 0 R5 0 0 0 0 0 0 0 \n\t\t\t__m128i RCL = _mm_shuffle_epi8(Src1, _mm_setr_epi8(2, -1, 5, -1, 8, -1, 11, -1, 14, -1, -1, -1, -1, -1, -1, -1));\n\n\t\t\t// RCL : R1 0 R2 0 R3 0 R4 0 R5 0 R6 0 R7 0 R8 0 \n\t\t\tRCL = _mm_or_si128(RCL, _mm_shuffle_epi8(Src2, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, -1, 4, -1, 7, -1)));\n\n\t\t\t// RCH : R9 0 R10 0 0 0 0 0 0 0 0 0 0 0 0 0\n\t\t\t__m128i RCH = _mm_shuffle_epi8(Src2, _mm_setr_epi8(10, -1, 13, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1));\n\n\t\t\t// RCH : R9 0 R10 0 R11 0 R12 0 R13 0 R14 0 R15 0 R16 0\n\t\t\tRCH = _mm_or_si128(RCH, _mm_shuffle_epi8(Src3, _mm_setr_epi8(-1, -1, -1, -1, 0, -1, 3, -1, 6, -1, 9, -1, 12, -1, 15, -1)));\n\n\t\t\t// BGLL : B1 0 G1 0 B2 0 G2 0 B3 0 G3 0 B4 0 G4 0\n\t\t\t__m128i BGLL = _mm_unpacklo_epi8(BGL, Zero);\n\n\t\t\t// BGLH : B5 0 G5 0 B6 0 G6 0 B7 0 G7 0 B8 0 G8 0\n\t\t\t__m128i BGLH = _mm_unpackhi_epi8(BGL, Zero);\n\n\t\t\t// RCLL : R1 Half Half Half R2 Half Half Half R3 Half Half Half R4 Half Half Half\n\t\t\t__m128i RCLL = _mm_or_si128(_mm_unpacklo_epi8(RCL, Zero), Half);\n\n\t\t\t// RCLH : R5 Half Half Half R6 Half Half Half R7 Half Half Half R8 Half Half Half\n\t\t\t__m128i RCLH = _mm_or_si128(_mm_unpackhi_epi8(RCL, Zero), Half);\n\n\t\t\t// BGHL : B9 0 G9 0 B10 0 G10 0 B11 0 G11 0 B12 0 G12 0 \n\t\t\t__m128i BGHL = _mm_unpacklo_epi8(BGH, Zero);\n\n\t\t\t// BGHH : B13 0 G13 0 B14 0 G14 0 B15 0 G15 0 B16 0 G16 0\n\t\t\t__m128i BGHH = _mm_unpackhi_epi8(BGH, Zero);\n\n\t\t\t// RCHL : R9 Half Half Half R10 Half Half Half R11 Half Half Half R12 Half Half Half\n\t\t\t__m128i RCHL = _mm_or_si128(_mm_unpacklo_epi8(RCH, Zero), Half);\n\n\t\t\t// RCHH : R13 Half Half Half R14 Half Half Half R15 Half Half Half R16 Half Half Half\n\t\t\t__m128i RCHH = _mm_or_si128(_mm_unpackhi_epi8(RCH, Zero), Half);\n\n\t\t\t//\n\t\t\t__m128i Y_LL = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGLL, Weight_YBG), _mm_madd_epi16(RCLL, Weight_YRC)), Shift);\n\t\t\t__m128i Y_LH = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGLH, Weight_YBG), _mm_madd_epi16(RCLH, Weight_YRC)), Shift);\n\t\t\t__m128i Y_HL = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGHL, Weight_YBG), _mm_madd_epi16(RCHL, Weight_YRC)), Shift);\n\t\t\t__m128i Y_HH = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGHH, Weight_YBG), _mm_madd_epi16(RCHH, Weight_YRC)), Shift);\n\t\t\t_mm_storeu_si128((__m128i*)(LinePY + XX), _mm_packus_epi16(_mm_packus_epi32(Y_LL, Y_LH), _mm_packus_epi32(Y_HL, Y_HH)));\n\n\t\t\t__m128i U_LL = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGLL, Weight_UBG), _mm_madd_epi16(RCLL, Weight_URC)), Shift);\n\t\t\t__m128i U_LH = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGLH, Weight_UBG), _mm_madd_epi16(RCLH, Weight_URC)), Shift);\n\t\t\t__m128i U_HL = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGHL, Weight_UBG), _mm_madd_epi16(RCHL, Weight_URC)), Shift);\n\t\t\t__m128i U_HH = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGHH, Weight_UBG), _mm_madd_epi16(RCHH, Weight_URC)), Shift);\n\t\t\t_mm_storeu_si128((__m128i*)(LinePU + XX), _mm_packus_epi16(_mm_packus_epi32(U_LL, U_LH), _mm_packus_epi32(U_HL, U_HH)));\n\n\t\t\t__m128i V_LL = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGLL, Weight_VBG), _mm_madd_epi16(RCLL, Weight_VRC)), Shift);\n\t\t\t__m128i V_LH = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGLH, Weight_VBG), _mm_madd_epi16(RCLH, Weight_VRC)), Shift);\n\t\t\t__m128i V_HL = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGHL, Weight_VBG), _mm_madd_epi16(RCHL, Weight_VRC)), Shift);\n\t\t\t__m128i V_HH = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGHH, Weight_VBG), _mm_madd_epi16(RCHH, Weight_VRC)), Shift);\n\t\t\t_mm_storeu_si128((__m128i*)(LinePV + XX), _mm_packus_epi16(_mm_packus_epi32(V_LL, V_LH), _mm_packus_epi32(V_HL, V_HH)));\n\n\t\t}\n\t\tfor (int XX = Block * BlockSize; XX < Width; XX++, LinePS += 3) {\n\t\t\tint Blue = LinePS[0], Green = LinePS[1], Red = LinePS[2];\n\t\t\tLinePY[XX] = (Y_B_WT * Blue + Y_G_WT * Green + Y_R_WT * Red + Y_C_WT * HalfV) >> Shift;\n\t\t\tLinePU[XX] = (U_B_WT * Blue + U_G_WT * Green + U_R_WT * Red + U_C_WT * HalfV) >> Shift;\n\t\t\tLinePV[XX] = (V_B_WT * Blue + V_G_WT * Green + V_R_WT * Red + V_C_WT * HalfV) >> Shift;\n\t\t}\n\t}\n}\n\nvoid YUV2RGBSSE_3(unsigned char *Y, unsigned char *U, unsigned char *V, unsigned char *RGB, int Width, int Height, int Stride) {\n\tconst int Shift = 13;\n\tconst int HalfV = 1 << (Shift - 1);\n\tconst int B_Y_WT = 1 << Shift, B_U_WT = 2.03211f * (1 << Shift), B_V_WT = 0;\n\tconst int G_Y_WT = 1 << Shift, G_U_WT = -0.39465f * (1 << Shift), G_V_WT = -0.58060f * (1 << Shift);\n\tconst int R_Y_WT = 1 << Shift, R_U_WT = 0, R_V_WT = 1.13983 * (1 << Shift);\n\t__m128i Weight_B_Y = _mm_set1_epi32(B_Y_WT), Weight_B_U = _mm_set1_epi32(B_U_WT), Weight_B_V = _mm_set1_epi32(B_V_WT);\n\t__m128i Weight_G_Y = _mm_set1_epi32(G_Y_WT), Weight_G_U = _mm_set1_epi32(G_U_WT), Weight_G_V = _mm_set1_epi32(G_V_WT);\n\t__m128i Weight_R_Y = _mm_set1_epi32(R_Y_WT), Weight_R_U = _mm_set1_epi32(R_U_WT), Weight_R_V = _mm_set1_epi32(R_V_WT);\n\t__m128i Half = _mm_set1_epi32(HalfV);\n\t__m128i C128 = _mm_set1_epi32(128);\n\t__m128i Zero = _mm_setzero_si128();\n\n\tconst int BlockSize = 16, Block = Width / BlockSize;\n\tfor (int YY = 0; YY < Height; YY++) {\n\t\tunsigned char *LinePD = RGB + YY * Stride;\n\t\tunsigned char *LinePY = Y + YY * Width;\n\t\tunsigned char *LinePU = U + YY * Width;\n\t\tunsigned char *LinePV = V + YY * Width;\n\t\tfor (int XX = 0; XX < Block * BlockSize; XX += BlockSize, LinePY += BlockSize, LinePU += BlockSize, LinePV += BlockSize) {\n\t\t\t__m128i Blue, Green, Red, YV, UV, VV, Dest1, Dest2, Dest3;\n\t\t\tYV = _mm_loadu_si128((__m128i *)(LinePY + 0));\n\t\t\tUV = _mm_loadu_si128((__m128i *)(LinePU + 0));\n\t\t\tVV = _mm_loadu_si128((__m128i *)(LinePV + 0));\n\n\t\t\t__m128i YV16L = _mm_unpacklo_epi8(YV, Zero);\n\t\t\t__m128i YV16H = _mm_unpackhi_epi8(YV, Zero);\n\t\t\t__m128i YV32LL = _mm_unpacklo_epi16(YV16L, Zero);\n\t\t\t__m128i YV32LH = _mm_unpackhi_epi16(YV16L, Zero);\n\t\t\t__m128i YV32HL = _mm_unpacklo_epi16(YV16H, Zero);\n\t\t\t__m128i YV32HH = _mm_unpackhi_epi16(YV16H, Zero);\n\n\n\t\t\t__m128i UV16L = _mm_unpacklo_epi8(UV, Zero);\n\t\t\t__m128i UV16H = _mm_unpackhi_epi8(UV, Zero);\n\t\t\t__m128i UV32LL = _mm_unpacklo_epi16(UV16L, Zero);\n\t\t\t__m128i UV32LH = _mm_unpackhi_epi16(UV16L, Zero);\n\t\t\t__m128i UV32HL = _mm_unpacklo_epi16(UV16H, Zero);\n\t\t\t__m128i UV32HH = _mm_unpackhi_epi16(UV16H, Zero);\n\t\t\tUV32LL = _mm_sub_epi32(UV32LL, C128);\n\t\t\tUV32LH = _mm_sub_epi32(UV32LH, C128);\n\t\t\tUV32HL = _mm_sub_epi32(UV32HL, C128);\n\t\t\tUV32HH = _mm_sub_epi32(UV32HH, C128);\n\n\t\t\t__m128i VV16L = _mm_unpacklo_epi8(VV, Zero);\n\t\t\t__m128i VV16H = _mm_unpackhi_epi8(VV, Zero);\n\t\t\t__m128i VV32LL = _mm_unpacklo_epi16(VV16L, Zero);\n\t\t\t__m128i VV32LH = _mm_unpackhi_epi16(VV16L, Zero);\n\t\t\t__m128i VV32HL = _mm_unpacklo_epi16(VV16H, Zero);\n\t\t\t__m128i VV32HH = _mm_unpackhi_epi16(VV16H, Zero);\n\t\t\tVV32LL = _mm_sub_epi32(VV32LL, C128);\n\t\t\tVV32LH = _mm_sub_epi32(VV32LH, C128);\n\t\t\tVV32HL = _mm_sub_epi32(VV32HL, C128);\n\t\t\tVV32HH = _mm_sub_epi32(VV32HH, C128);\n\n\t\t\t__m128i LL_B = _mm_add_epi32(YV32LL, _mm_srai_epi32(_mm_add_epi32(Half, _mm_mullo_epi32(UV32LL, Weight_B_U)), Shift));\n\t\t\t__m128i LH_B = _mm_add_epi32(YV32LH, _mm_srai_epi32(_mm_add_epi32(Half, _mm_mullo_epi32(UV32LH, Weight_B_U)), Shift));\n\t\t\t__m128i HL_B = _mm_add_epi32(YV32HL, _mm_srai_epi32(_mm_add_epi32(Half, _mm_mullo_epi32(UV32HL, Weight_B_U)), Shift));\n\t\t\t__m128i HH_B = _mm_add_epi32(YV32HH, _mm_srai_epi32(_mm_add_epi32(Half, _mm_mullo_epi32(UV32HH, Weight_B_U)), Shift));\n\t\t\tBlue = _mm_packus_epi16(_mm_packus_epi32(LL_B, LH_B), _mm_packus_epi32(HL_B, HH_B));\n\n\t\t\t__m128i LL_G = _mm_add_epi32(YV32LL, _mm_srai_epi32(_mm_add_epi32(Half, _mm_add_epi32(_mm_mullo_epi32(Weight_G_U, UV32LL), _mm_mullo_epi32(Weight_G_V, VV32LL))), Shift));\n\t\t\t__m128i LH_G = _mm_add_epi32(YV32LH, _mm_srai_epi32(_mm_add_epi32(Half, _mm_add_epi32(_mm_mullo_epi32(Weight_G_U, UV32LH), _mm_mullo_epi32(Weight_G_V, VV32LH))), Shift));\n\t\t\t__m128i HL_G = _mm_add_epi32(YV32HL, _mm_srai_epi32(_mm_add_epi32(Half, _mm_add_epi32(_mm_mullo_epi32(Weight_G_U, UV32HL), _mm_mullo_epi32(Weight_G_V, VV32HL))), Shift));\n\t\t\t__m128i HH_G = _mm_add_epi32(YV32HH, _mm_srai_epi32(_mm_add_epi32(Half, _mm_add_epi32(_mm_mullo_epi32(Weight_G_U, UV32HH), _mm_mullo_epi32(Weight_G_V, VV32HH))), Shift));\n\t\t\tGreen = _mm_packus_epi16(_mm_packus_epi32(LL_G, LH_G), _mm_packus_epi32(HL_G, HH_G));\n\n\t\t\t__m128i LL_R = _mm_add_epi32(YV32LL, _mm_srai_epi32(_mm_add_epi32(Half, _mm_mullo_epi32(VV32LL, Weight_R_V)), Shift));\n\t\t\t__m128i LH_R = _mm_add_epi32(YV32LH, _mm_srai_epi32(_mm_add_epi32(Half, _mm_mullo_epi32(VV32LH, Weight_R_V)), Shift));\n\t\t\t__m128i HL_R = _mm_add_epi32(YV32HL, _mm_srai_epi32(_mm_add_epi32(Half, _mm_mullo_epi32(VV32HL, Weight_R_V)), Shift));\n\t\t\t__m128i HH_R = _mm_add_epi32(YV32HH, _mm_srai_epi32(_mm_add_epi32(Half, _mm_mullo_epi32(VV32HH, Weight_R_V)), Shift));\n\t\t\tRed = _mm_packus_epi16(_mm_packus_epi32(LL_R, LH_R), _mm_packus_epi32(HL_R, HH_R));\n\n\t\t\tDest1 = _mm_shuffle_epi8(Blue, _mm_setr_epi8(0, -1, -1, 1, -1, -1, 2, -1, -1, 3, -1, -1, 4, -1, -1, 5));\n\t\t\tDest1 = _mm_or_si128(Dest1, _mm_shuffle_epi8(Green, _mm_setr_epi8(-1, 0, -1, -1, 1, -1, -1, 2, -1, -1, 3, -1, -1, 4, -1, -1)));\n\t\t\tDest1 = _mm_or_si128(Dest1, _mm_shuffle_epi8(Red, _mm_setr_epi8(-1, -1, 0, -1, -1, 1, -1, -1, 2, -1, -1, 3, -1, -1, 4, -1)));\n\n\t\t\tDest2 = _mm_shuffle_epi8(Blue, _mm_setr_epi8(-1, -1, 6, -1, -1, 7, -1, -1, 8, -1, -1, 9, -1, -1, 10, -1));\n\t\t\tDest2 = _mm_or_si128(Dest2, _mm_shuffle_epi8(Green, _mm_setr_epi8(5, -1, -1, 6, -1, -1, 7, -1, -1, 8, -1, -1, 9, -1, -1, 10)));\n\t\t\tDest2 = _mm_or_si128(Dest2, _mm_shuffle_epi8(Red, _mm_setr_epi8(-1, 5, -1, -1, 6, -1, -1, 7, -1, -1, 8, -1, -1, 9, -1, -1)));\n\n\t\t\tDest3 = _mm_shuffle_epi8(Blue, _mm_setr_epi8(-1, 11, -1, -1, 12, -1, -1, 13, -1, -1, 14, -1, -1, 15, -1, -1));\n\t\t\tDest3 = _mm_or_si128(Dest3, _mm_shuffle_epi8(Green, _mm_setr_epi8(-1, -1, 11, -1, -1, 12, -1, -1, 13, -1, -1, 14, -1, -1, 15, -1)));\n\t\t\tDest3 = _mm_or_si128(Dest3, _mm_shuffle_epi8(Red, _mm_setr_epi8(10, -1, -1, 11, -1, -1, 12, -1, -1, 13, -1, -1, 14, -1, -1, 15)));\n\n\t\t\t_mm_storeu_si128((__m128i*)(LinePD + (XX / BlockSize) * BlockSize * 3), Dest1);\n\t\t\t_mm_storeu_si128((__m128i*)(LinePD + (XX / BlockSize) * BlockSize * 3 + BlockSize), Dest2);\n\t\t\t_mm_storeu_si128((__m128i*)(LinePD + (XX / BlockSize) * BlockSize * 3 + BlockSize * 2), Dest3);\n\t\t}\n\t\tfor (int XX = Block * BlockSize; XX < Width; XX++, LinePU++, LinePV++, LinePY++) {\n\t\t\tint YV = LinePY[XX], UV = LinePU[XX] - 128, VV = LinePV[XX] - 128;\n\t\t\tLinePD[XX + 0] = ClampToByte(YV + ((B_U_WT * UV + HalfV) >> Shift));\n\t\t\tLinePD[XX + 1] = ClampToByte(YV + ((G_U_WT * UV + G_V_WT * VV + HalfV) >> Shift));\n\t\t\tLinePD[XX + 2] = ClampToByte(YV + ((R_V_WT * VV + HalfV) >> Shift));\n\t\t}\n\t}\n}\n\n\nconst int Shift = 13;                            //    这里没有绝对值大于1的系数，最大可取2^15次方的放大倍数。\nconst int HalfV = 1 << (Shift - 1);\n\nconst int Y_B_WT = 0.114f * (1 << Shift), Y_G_WT = 0.587f * (1 << Shift), Y_R_WT = (1 << Shift) - Y_B_WT - Y_G_WT, Y_C_WT = 1;\nconst int U_B_WT = 0.436f * (1 << Shift), U_G_WT = -0.28886f * (1 << Shift), U_R_WT = -(U_B_WT + U_G_WT), U_C_WT = 257;\nconst int V_B_WT = -0.10001 * (1 << Shift), V_G_WT = -0.51499f * (1 << Shift), V_R_WT = -(V_B_WT + V_G_WT), V_C_WT = 257;\n\n__m128i Weight_YBG = _mm_setr_epi16(Y_B_WT, Y_G_WT, Y_B_WT, Y_G_WT, Y_B_WT, Y_G_WT, Y_B_WT, Y_G_WT);\n__m128i Weight_YRC = _mm_setr_epi16(Y_R_WT, Y_C_WT, Y_R_WT, Y_C_WT, Y_R_WT, Y_C_WT, Y_R_WT, Y_C_WT);\n__m128i Weight_UBG = _mm_setr_epi16(U_B_WT, U_G_WT, U_B_WT, U_G_WT, U_B_WT, U_G_WT, U_B_WT, U_G_WT);\n__m128i Weight_URC = _mm_setr_epi16(U_R_WT, U_C_WT, U_R_WT, U_C_WT, U_R_WT, U_C_WT, U_R_WT, U_C_WT);\n__m128i Weight_VBG = _mm_setr_epi16(V_B_WT, V_G_WT, V_B_WT, V_G_WT, V_B_WT, V_G_WT, V_B_WT, V_G_WT);\n__m128i Weight_VRC = _mm_setr_epi16(V_R_WT, V_C_WT, V_R_WT, V_C_WT, V_R_WT, V_C_WT, V_R_WT, V_C_WT);\n__m128i Half1 = _mm_setr_epi16(0, HalfV, 0, HalfV, 0, HalfV, 0, HalfV);\n__m128i Zero = _mm_setzero_si128();\n\nconst int B_Y_WT = 1 << Shift, B_U_WT = 2.03211f * (1 << Shift), B_V_WT = 0;\nconst int G_Y_WT = 1 << Shift, G_U_WT = -0.39465f * (1 << Shift), G_V_WT = -0.58060f * (1 << Shift);\nconst int R_Y_WT = 1 << Shift, R_U_WT = 0, R_V_WT = 1.13983 * (1 << Shift);\n__m128i Weight_B_Y = _mm_set1_epi32(B_Y_WT), Weight_B_U = _mm_set1_epi32(B_U_WT), Weight_B_V = _mm_set1_epi32(B_V_WT);\n__m128i Weight_G_Y = _mm_set1_epi32(G_Y_WT), Weight_G_U = _mm_set1_epi32(G_U_WT), Weight_G_V = _mm_set1_epi32(G_V_WT);\n__m128i Weight_R_Y = _mm_set1_epi32(R_Y_WT), Weight_R_U = _mm_set1_epi32(R_U_WT), Weight_R_V = _mm_set1_epi32(R_V_WT);\n__m128i Half2 = _mm_set1_epi32(HalfV);\n__m128i C128 = _mm_set1_epi32(128);\nint BlockSize, Block;\n\nvoid _RGB2YUV(unsigned char *RGB, const int32_t Width, const int32_t Height, const int32_t start_row, const int32_t thread_stride, const int32_t Stride,  unsigned char *Y, unsigned char *U, unsigned char *V)\n{\n\n\tfor (int YY = start_row; YY < start_row + thread_stride; YY++)\n\t{\n\t\tunsigned char *LinePS = RGB + YY * Stride;\n\t\tunsigned char *LinePY = Y + YY * Width;\n\t\tunsigned char *LinePU = U + YY * Width;\n\t\tunsigned char *LinePV = V + YY * Width;\n\t\tfor (int XX = 0; XX < Block * BlockSize; XX += BlockSize, LinePS += BlockSize * 3)\n\t\t{\n\t\t\t__m128i Src1 = _mm_loadu_si128((__m128i *)(LinePS + 0));\n\t\t\t__m128i Src2 = _mm_loadu_si128((__m128i *)(LinePS + 16));\n\t\t\t__m128i Src3 = _mm_loadu_si128((__m128i *)(LinePS + 32));\n\t\t\t// Src1 : B1 G1 R1 B2 G2 R2 B3 G3 R3 B4 G4 R4 B5 G5 R5 B6 \n\t\t\t// Src2 : G6 R6 B7 G7 R7 B8 G8 R8 B9 G9 R9 B10 G10 R10 B11 G11 \n\t\t\t// Src3 : R11 B12 G12 R12 B13 G13 R13 B14 G14 R14 B15 G15 R15 B16 G16 R16\n\n\t\t\t// BGL : B1 G1 B2 G2 B3 G3 B4 G4 B5 G5 B6 0 0 0 0 0 \n\t\t\t__m128i BGL = _mm_shuffle_epi8(Src1, _mm_setr_epi8(0, 1, 3, 4, 6, 7, 9, 10, 12, 13, 15, -1, -1, -1, -1, -1));\n\n\t\t\t// BGL : B1 G1 B2 G2 B3 G3 B4 G4 B5 G5 B6 G6 B7 G7 B8 G8\n\t\t\tBGL = _mm_or_si128(BGL, _mm_shuffle_epi8(Src2, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 2, 3, 5, 6)));\n\n\t\t\t// BGH : B9 G9 B10 G10 B11 G11 0 0 0 0 0 0 0 0 0 0\n\t\t\t__m128i BGH = _mm_shuffle_epi8(Src2, _mm_setr_epi8(8, 9, 11, 12, 14, 15, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1));\n\n\t\t\t// BGH : B9 G9 B10 G10 B11 G11 B12 G12 B13 G13 B14 G14 B15 G15 B16 G16\n\t\t\tBGH = _mm_or_si128(BGH, _mm_shuffle_epi8(Src3, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, 1, 2, 4, 5, 7, 8, 10, 11, 13, 14)));\n\n\t\t\t// RCL : R1 0 R2 0 R3 0 R4 0 R5 0 0 0 0 0 0 0 \n\t\t\t__m128i RCL = _mm_shuffle_epi8(Src1, _mm_setr_epi8(2, -1, 5, -1, 8, -1, 11, -1, 14, -1, -1, -1, -1, -1, -1, -1));\n\n\t\t\t// RCL : R1 0 R2 0 R3 0 R4 0 R5 0 R6 0 R7 0 R8 0 \n\t\t\tRCL = _mm_or_si128(RCL, _mm_shuffle_epi8(Src2, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, -1, 4, -1, 7, -1)));\n\n\t\t\t// RCH : R9 0 R10 0 0 0 0 0 0 0 0 0 0 0 0 0\n\t\t\t__m128i RCH = _mm_shuffle_epi8(Src2, _mm_setr_epi8(10, -1, 13, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1));\n\n\t\t\t// RCH : R9 0 R10 0 R11 0 R12 0 R13 0 R14 0 R15 0 R16 0\n\t\t\tRCH = _mm_or_si128(RCH, _mm_shuffle_epi8(Src3, _mm_setr_epi8(-1, -1, -1, -1, 0, -1, 3, -1, 6, -1, 9, -1, 12, -1, 15, -1)));\n\n\t\t\t// BGLL : B1 0 G1 0 B2 0 G2 0 B3 0 G3 0 B4 0 G4 0\n\t\t\t__m128i BGLL = _mm_unpacklo_epi8(BGL, Zero);\n\n\t\t\t// BGLH : B5 0 G5 0 B6 0 G6 0 B7 0 G7 0 B8 0 G8 0\n\t\t\t__m128i BGLH = _mm_unpackhi_epi8(BGL, Zero);\n\n\t\t\t// RCLL : R1 Half Half Half R2 Half Half Half R3 Half Half Half R4 Half Half Half\n\t\t\t__m128i RCLL = _mm_or_si128(_mm_unpacklo_epi8(RCL, Zero), Half1);\n\n\t\t\t// RCLH : R5 Half Half Half R6 Half Half Half R7 Half Half Half R8 Half Half Half\n\t\t\t__m128i RCLH = _mm_or_si128(_mm_unpackhi_epi8(RCL, Zero), Half1);\n\n\t\t\t// BGHL : B9 0 G9 0 B10 0 G10 0 B11 0 G11 0 B12 0 G12 0 \n\t\t\t__m128i BGHL = _mm_unpacklo_epi8(BGH, Zero);\n\n\t\t\t// BGHH : B13 0 G13 0 B14 0 G14 0 B15 0 G15 0 B16 0 G16 0\n\t\t\t__m128i BGHH = _mm_unpackhi_epi8(BGH, Zero);\n\n\t\t\t// RCHL : R9 Half Half Half R10 Half Half Half R11 Half Half Half R12 Half Half Half\n\t\t\t__m128i RCHL = _mm_or_si128(_mm_unpacklo_epi8(RCH, Zero), Half1);\n\n\t\t\t// RCHH : R13 Half Half Half R14 Half Half Half R15 Half Half Half R16 Half Half Half\n\t\t\t__m128i RCHH = _mm_or_si128(_mm_unpackhi_epi8(RCH, Zero), Half1);\n\n\t\t\t//\n\t\t\t__m128i Y_LL = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGLL, Weight_YBG), _mm_madd_epi16(RCLL, Weight_YRC)), Shift);\n\t\t\t__m128i Y_LH = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGLH, Weight_YBG), _mm_madd_epi16(RCLH, Weight_YRC)), Shift);\n\t\t\t__m128i Y_HL = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGHL, Weight_YBG), _mm_madd_epi16(RCHL, Weight_YRC)), Shift);\n\t\t\t__m128i Y_HH = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGHH, Weight_YBG), _mm_madd_epi16(RCHH, Weight_YRC)), Shift);\n\t\t\t_mm_storeu_si128((__m128i*)(LinePY + XX), _mm_packus_epi16(_mm_packus_epi32(Y_LL, Y_LH), _mm_packus_epi32(Y_HL, Y_HH)));\n\n\t\t\t__m128i U_LL = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGLL, Weight_UBG), _mm_madd_epi16(RCLL, Weight_URC)), Shift);\n\t\t\t__m128i U_LH = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGLH, Weight_UBG), _mm_madd_epi16(RCLH, Weight_URC)), Shift);\n\t\t\t__m128i U_HL = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGHL, Weight_UBG), _mm_madd_epi16(RCHL, Weight_URC)), Shift);\n\t\t\t__m128i U_HH = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGHH, Weight_UBG), _mm_madd_epi16(RCHH, Weight_URC)), Shift);\n\t\t\t_mm_storeu_si128((__m128i*)(LinePU + XX), _mm_packus_epi16(_mm_packus_epi32(U_LL, U_LH), _mm_packus_epi32(U_HL, U_HH)));\n\n\t\t\t__m128i V_LL = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGLL, Weight_VBG), _mm_madd_epi16(RCLL, Weight_VRC)), Shift);\n\t\t\t__m128i V_LH = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGLH, Weight_VBG), _mm_madd_epi16(RCLH, Weight_VRC)), Shift);\n\t\t\t__m128i V_HL = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGHL, Weight_VBG), _mm_madd_epi16(RCHL, Weight_VRC)), Shift);\n\t\t\t__m128i V_HH = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGHH, Weight_VBG), _mm_madd_epi16(RCHH, Weight_VRC)), Shift);\n\t\t\t_mm_storeu_si128((__m128i*)(LinePV + XX), _mm_packus_epi16(_mm_packus_epi32(V_LL, V_LH), _mm_packus_epi32(V_HL, V_HH)));\n\n\t\t}\n\t\tfor (int XX = Block * BlockSize; XX < Width; XX++, LinePS += 3) {\n\t\t\tint Blue = LinePS[0], Green = LinePS[1], Red = LinePS[2];\n\t\t\tLinePY[XX] = (Y_B_WT * Blue + Y_G_WT * Green + Y_R_WT * Red + Y_C_WT * HalfV) >> Shift;\n\t\t\tLinePU[XX] = (U_B_WT * Blue + U_G_WT * Green + U_R_WT * Red + U_C_WT * HalfV) >> Shift;\n\t\t\tLinePV[XX] = (V_B_WT * Blue + V_G_WT * Green + V_R_WT * Red + V_C_WT * HalfV) >> Shift;\n\t\t}\n\t}\n}\n\nvoid _YUV2RGB(const int32_t Width, const int32_t Height, const int32_t start_row, const int32_t thread_stride, const int32_t Stride, unsigned char *Y, unsigned char *U, unsigned char *V, unsigned char *RGB) {\n\t\n\tfor (int YY = start_row; YY < start_row + thread_stride; YY++){\n\t\tunsigned char *LinePD = RGB + YY * Stride;\n\t\tunsigned char *LinePY = Y + YY * Width;\n\t\tunsigned char *LinePU = U + YY * Width;\n\t\tunsigned char *LinePV = V + YY * Width;\n\t\tfor (int XX = 0; XX < Block * BlockSize; XX += BlockSize, LinePY += BlockSize, LinePU += BlockSize, LinePV += BlockSize) {\n\t\t\t__m128i Blue, Green, Red, YV, UV, VV, Dest1, Dest2, Dest3;\n\t\t\tYV = _mm_loadu_si128((__m128i *)(LinePY + 0));\n\t\t\tUV = _mm_loadu_si128((__m128i *)(LinePU + 0));\n\t\t\tVV = _mm_loadu_si128((__m128i *)(LinePV + 0));\n\n\t\t\t__m128i YV16L = _mm_unpacklo_epi8(YV, Zero);\n\t\t\t__m128i YV16H = _mm_unpackhi_epi8(YV, Zero);\n\t\t\t__m128i YV32LL = _mm_unpacklo_epi16(YV16L, Zero);\n\t\t\t__m128i YV32LH = _mm_unpackhi_epi16(YV16L, Zero);\n\t\t\t__m128i YV32HL = _mm_unpacklo_epi16(YV16H, Zero);\n\t\t\t__m128i YV32HH = _mm_unpackhi_epi16(YV16H, Zero);\n\n\n\t\t\t__m128i UV16L = _mm_unpacklo_epi8(UV, Zero);\n\t\t\t__m128i UV16H = _mm_unpackhi_epi8(UV, Zero);\n\t\t\t__m128i UV32LL = _mm_unpacklo_epi16(UV16L, Zero);\n\t\t\t__m128i UV32LH = _mm_unpackhi_epi16(UV16L, Zero);\n\t\t\t__m128i UV32HL = _mm_unpacklo_epi16(UV16H, Zero);\n\t\t\t__m128i UV32HH = _mm_unpackhi_epi16(UV16H, Zero);\n\t\t\tUV32LL = _mm_sub_epi32(UV32LL, C128);\n\t\t\tUV32LH = _mm_sub_epi32(UV32LH, C128);\n\t\t\tUV32HL = _mm_sub_epi32(UV32HL, C128);\n\t\t\tUV32HH = _mm_sub_epi32(UV32HH, C128);\n\n\t\t\t__m128i VV16L = _mm_unpacklo_epi8(VV, Zero);\n\t\t\t__m128i VV16H = _mm_unpackhi_epi8(VV, Zero);\n\t\t\t__m128i VV32LL = _mm_unpacklo_epi16(VV16L, Zero);\n\t\t\t__m128i VV32LH = _mm_unpackhi_epi16(VV16L, Zero);\n\t\t\t__m128i VV32HL = _mm_unpacklo_epi16(VV16H, Zero);\n\t\t\t__m128i VV32HH = _mm_unpackhi_epi16(VV16H, Zero);\n\t\t\tVV32LL = _mm_sub_epi32(VV32LL, C128);\n\t\t\tVV32LH = _mm_sub_epi32(VV32LH, C128);\n\t\t\tVV32HL = _mm_sub_epi32(VV32HL, C128);\n\t\t\tVV32HH = _mm_sub_epi32(VV32HH, C128);\n\n\t\t\t__m128i LL_B = _mm_add_epi32(YV32LL, _mm_srai_epi32(_mm_add_epi32(Half2, _mm_mullo_epi32(UV32LL, Weight_B_U)), Shift));\n\t\t\t__m128i LH_B = _mm_add_epi32(YV32LH, _mm_srai_epi32(_mm_add_epi32(Half2, _mm_mullo_epi32(UV32LH, Weight_B_U)), Shift));\n\t\t\t__m128i HL_B = _mm_add_epi32(YV32HL, _mm_srai_epi32(_mm_add_epi32(Half2, _mm_mullo_epi32(UV32HL, Weight_B_U)), Shift));\n\t\t\t__m128i HH_B = _mm_add_epi32(YV32HH, _mm_srai_epi32(_mm_add_epi32(Half2, _mm_mullo_epi32(UV32HH, Weight_B_U)), Shift));\n\t\t\tBlue = _mm_packus_epi16(_mm_packus_epi32(LL_B, LH_B), _mm_packus_epi32(HL_B, HH_B));\n\n\t\t\t__m128i LL_G = _mm_add_epi32(YV32LL, _mm_srai_epi32(_mm_add_epi32(Half2, _mm_add_epi32(_mm_mullo_epi32(Weight_G_U, UV32LL), _mm_mullo_epi32(Weight_G_V, VV32LL))), Shift));\n\t\t\t__m128i LH_G = _mm_add_epi32(YV32LH, _mm_srai_epi32(_mm_add_epi32(Half2, _mm_add_epi32(_mm_mullo_epi32(Weight_G_U, UV32LH), _mm_mullo_epi32(Weight_G_V, VV32LH))), Shift));\n\t\t\t__m128i HL_G = _mm_add_epi32(YV32HL, _mm_srai_epi32(_mm_add_epi32(Half2, _mm_add_epi32(_mm_mullo_epi32(Weight_G_U, UV32HL), _mm_mullo_epi32(Weight_G_V, VV32HL))), Shift));\n\t\t\t__m128i HH_G = _mm_add_epi32(YV32HH, _mm_srai_epi32(_mm_add_epi32(Half2, _mm_add_epi32(_mm_mullo_epi32(Weight_G_U, UV32HH), _mm_mullo_epi32(Weight_G_V, VV32HH))), Shift));\n\t\t\tGreen = _mm_packus_epi16(_mm_packus_epi32(LL_G, LH_G), _mm_packus_epi32(HL_G, HH_G));\n\n\t\t\t__m128i LL_R = _mm_add_epi32(YV32LL, _mm_srai_epi32(_mm_add_epi32(Half2, _mm_mullo_epi32(VV32LL, Weight_R_V)), Shift));\n\t\t\t__m128i LH_R = _mm_add_epi32(YV32LH, _mm_srai_epi32(_mm_add_epi32(Half2, _mm_mullo_epi32(VV32LH, Weight_R_V)), Shift));\n\t\t\t__m128i HL_R = _mm_add_epi32(YV32HL, _mm_srai_epi32(_mm_add_epi32(Half2, _mm_mullo_epi32(VV32HL, Weight_R_V)), Shift));\n\t\t\t__m128i HH_R = _mm_add_epi32(YV32HH, _mm_srai_epi32(_mm_add_epi32(Half2, _mm_mullo_epi32(VV32HH, Weight_R_V)), Shift));\n\t\t\tRed = _mm_packus_epi16(_mm_packus_epi32(LL_R, LH_R), _mm_packus_epi32(HL_R, HH_R));\n\n\t\t\tDest1 = _mm_shuffle_epi8(Blue, _mm_setr_epi8(0, -1, -1, 1, -1, -1, 2, -1, -1, 3, -1, -1, 4, -1, -1, 5));\n\t\t\tDest1 = _mm_or_si128(Dest1, _mm_shuffle_epi8(Green, _mm_setr_epi8(-1, 0, -1, -1, 1, -1, -1, 2, -1, -1, 3, -1, -1, 4, -1, -1)));\n\t\t\tDest1 = _mm_or_si128(Dest1, _mm_shuffle_epi8(Red, _mm_setr_epi8(-1, -1, 0, -1, -1, 1, -1, -1, 2, -1, -1, 3, -1, -1, 4, -1)));\n\n\t\t\tDest2 = _mm_shuffle_epi8(Blue, _mm_setr_epi8(-1, -1, 6, -1, -1, 7, -1, -1, 8, -1, -1, 9, -1, -1, 10, -1));\n\t\t\tDest2 = _mm_or_si128(Dest2, _mm_shuffle_epi8(Green, _mm_setr_epi8(5, -1, -1, 6, -1, -1, 7, -1, -1, 8, -1, -1, 9, -1, -1, 10)));\n\t\t\tDest2 = _mm_or_si128(Dest2, _mm_shuffle_epi8(Red, _mm_setr_epi8(-1, 5, -1, -1, 6, -1, -1, 7, -1, -1, 8, -1, -1, 9, -1, -1)));\n\n\t\t\tDest3 = _mm_shuffle_epi8(Blue, _mm_setr_epi8(-1, 11, -1, -1, 12, -1, -1, 13, -1, -1, 14, -1, -1, 15, -1, -1));\n\t\t\tDest3 = _mm_or_si128(Dest3, _mm_shuffle_epi8(Green, _mm_setr_epi8(-1, -1, 11, -1, -1, 12, -1, -1, 13, -1, -1, 14, -1, -1, 15, -1)));\n\t\t\tDest3 = _mm_or_si128(Dest3, _mm_shuffle_epi8(Red, _mm_setr_epi8(10, -1, -1, 11, -1, -1, 12, -1, -1, 13, -1, -1, 14, -1, -1, 15)));\n\n\t\t\t_mm_storeu_si128((__m128i*)(LinePD + (XX / BlockSize) * BlockSize * 3), Dest1);\n\t\t\t_mm_storeu_si128((__m128i*)(LinePD + (XX / BlockSize) * BlockSize * 3 + BlockSize), Dest2);\n\t\t\t_mm_storeu_si128((__m128i*)(LinePD + (XX / BlockSize) * BlockSize * 3 + BlockSize * 2), Dest3);\n\t\t}\n\t\tfor (int XX = Block * BlockSize; XX < Width; XX++, LinePU++, LinePV++, LinePY++) {\n\t\t\tint YV = LinePY[XX], UV = LinePU[XX] - 128, VV = LinePV[XX] - 128;\n\t\t\tLinePD[XX + 0] = ClampToByte(YV + ((B_U_WT * UV + HalfV) >> Shift));\n\t\t\tLinePD[XX + 1] = ClampToByte(YV + ((G_U_WT * UV + G_V_WT * VV + HalfV) >> Shift));\n\t\t\tLinePD[XX + 2] = ClampToByte(YV + ((R_V_WT * VV + HalfV) >> Shift));\n\t\t}\n\t}\n}\n\n\nvoid RGB2YUVSSE_4(unsigned char *RGB, unsigned char *Y, unsigned char *U, unsigned char *V, int Width, int Height, int Stride) {\n\tBlockSize = 16, Block = (Width) / BlockSize;\n\tconst int32_t hw_concur = std::min(Height >> 4, static_cast<int32_t>(std::thread::hardware_concurrency()));\n\tstd::vector<std::future<void>> fut(hw_concur);\n\tconst int thread_stride = (Height - 1) / hw_concur + 1;\n\tint i = 0, start = 0;\n\tfor (; i < std::min(Height, hw_concur); i++, start += thread_stride)\n\t{\n\t\tfut[i] = std::async(std::launch::async, _RGB2YUV, RGB, Width, Height, start, thread_stride, Stride, Y, U, V);\n\t}\n\tfor (int j = 0; j < i; ++j)\n\t\tfut[j].wait();\n}\n\nvoid YUV2RGBSSE_4(unsigned char *Y, unsigned char *U, unsigned char *V, unsigned char *RGB, int Width, int Height, int Stride) {\n\tBlockSize = 16, Block = (Width) / BlockSize;\n\tconst int32_t hw_concur = std::min(Height >> 4, static_cast<int32_t>(std::thread::hardware_concurrency()));\n\tstd::vector<std::future<void>> fut(hw_concur);\n\tconst int thread_stride = (Height - 1) / hw_concur + 1;\n\tint i = 0, start = 0;\n\tfor (; i < std::min(Height, hw_concur); i++, start += thread_stride)\n\t{\n\t\tfut[i] = std::async(std::launch::async, _YUV2RGB, Width, Height, start, thread_stride, Stride, Y, U, V, RGB);\n\t}\n\tfor (int j = 0; j < i; ++j)\n\t\tfut[j].wait();\n}\n\nint main() {\n\tMat src = imread(\"F:\\\\car.jpg\");\n\tint Height = src.rows;\n\tint Width = src.cols;\n\tunsigned char *Src = src.data;\n\tunsigned char *Dest = new unsigned char[Height * Width * 3];\n\tunsigned char *Y = new unsigned char[Height * Width];\n\tunsigned char *U = new unsigned char[Height * Width];\n\tunsigned char *V = new unsigned char[Height * Width];\n\tint Stride = Width * 3;\n\tint64 st = cvGetTickCount();\n\tfor (int i = 0; i < 1000; i++) {\n\t\tRGB2YUVSSE_4(Src, Y, U, V, Width, Height, Stride);\n\t\tYUV2RGBSSE_4(Y, U, V, Dest, Width, Height, Stride);\n\t}\n\tdouble duration = (cv::getTickCount() - st) / cv::getTickFrequency();\n\tprintf(\"%.5f\\n\", duration);\n\tRGB2YUVSSE_4(Src, Y, U, V, Width, Height, Stride);\n\tYUV2RGBSSE_4(Y, U, V, Dest, Width, Height, Stride);\n\tMat dst(Height, Width, CV_8UC3, Dest);\n\timshow(\"origin\", src);\n\timshow(\"result\", dst);\n\timwrite(\"F:\\\\res.jpg\", dst);\n\twaitKey(0);\n}"
  },
  {
    "path": "speed_skin_detection_sse.cpp",
    "content": "#include \"stdafx.h\"\n#include <stdio.h>\n#include <opencv2/opencv.hpp>\n#include <future>\nusing namespace std;\nusing namespace cv;\n\n#define IM_Max(a, b) (((a) >= (b)) ? (a): (b))\n#define IM_Min(a, b) (((a) >= (b)) ? (b): (a))\n#define _mm_cmpge_epu8(a, b) _mm_cmpeq_epi8(_mm_max_epu8(a, b), a)\n\nvoid IM_GetRoughSkinRegion(unsigned char *Src, unsigned char *Skin, int Width, int Height, int Stride) {\n\tfor (int Y = 0; Y < Height; Y++)\n\t{\n\t\tunsigned char *LinePS = Src + Y * Stride;\n\t\tunsigned char *LinePD = Skin + Y * Width;\n\t\tfor (int X = 0; X < Width; X++)\n\t\t{\n\t\t\tint Blue = LinePS[0], Green = LinePS[1], Red = LinePS[2];\n\t\t\tif (Red >= 60 && Green >= 40 && Blue >= 20 && Red >= Blue && (Red - Green) >= 10 && IM_Max(IM_Max(Red, Green), Blue) - IM_Min(IM_Min(Red, Green), Blue) >= 10)\n\t\t\t\tLinePD[X] = 255;\n\t\t\telse\n\t\t\t\tLinePD[X] = 16;\n\t\t\tLinePS += 3;\n\t\t}\n\t}\n}\n\nvoid IM_GetRoughSkinRegion_OpenMP(unsigned char *Src, unsigned char *Skin, int Width, int Height, int Stride) {\n\tfor (int Y = 0; Y < Height; Y++)\n\t{\n\t\tunsigned char *LinePS = Src + Y * Stride;\n\t\tunsigned char *LinePD = Skin + Y * Width;\n#pragma omp parallel for num_threads(4)\n\t\tfor (int X = 0; X < Width; X++)\n\t\t{\n\t\t\tint Blue = LinePS[X*3 + 0], Green = LinePS[X*3 + 1], Red = LinePS[X*3 + 2];\n\t\t\tif (Red >= 60 && Green >= 40 && Blue >= 20 && Red >= Blue && (Red - Green) >= 10 && IM_Max(IM_Max(Red, Green), Blue) - IM_Min(IM_Min(Red, Green), Blue) >= 10)\n\t\t\t\tLinePD[X] = 255;\n\t\t\telse\n\t\t\t\tLinePD[X] = 16;\n\t\t}\n\t}\n}\n\n\nvoid IM_GetRoughSkinRegion_SSE(unsigned char *Src, unsigned char *Skin, int Width, int Height, int Stride) {\n\tconst int NonSkinLevel = 10; //非肤色部分的处理程序，本例取16，最大值取100，那样就是所有区域都为肤色，毫无意义\n\tconst int BlockSize = 16;\n\tint Block = Width / BlockSize;\n\tfor (int Y = 0; Y < Height; Y++) {\n\t\tunsigned char *LinePS = Src + Y * Stride;\n\t\tunsigned char *LinePD = Skin + Y * Width;\n\t\tfor (int X = 0; X < Block * BlockSize; X += BlockSize, LinePS += BlockSize * 3, LinePD += BlockSize) {\n\t\t\t__m128i Src1, Src2, Src3, Blue, Green, Red, Result, Max, Min, AbsDiff;\n\t\t\tSrc1 = _mm_loadu_si128((__m128i *)(LinePS + 0));\n\t\t\tSrc2 = _mm_loadu_si128((__m128i *)(LinePS + 16));\n\t\t\tSrc3 = _mm_loadu_si128((__m128i *)(LinePS + 32));\n\n\t\t\tBlue = _mm_shuffle_epi8(Src1, _mm_setr_epi8(0, 3, 6, 9, 12, 15, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1));\n\t\t\tBlue = _mm_or_si128(Blue, _mm_shuffle_epi8(Src2, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, 2, 5, 8, 11, 14, -1, -1, -1, -1, -1)));\n\t\t\tBlue = _mm_or_si128(Blue, _mm_shuffle_epi8(Src3, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, 4, 7, 10, 13)));\n\n\t\t\tGreen = _mm_shuffle_epi8(Src1, _mm_setr_epi8(1, 4, 7, 10, 13, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1));\n\t\t\tGreen = _mm_or_si128(Green, _mm_shuffle_epi8(Src2, _mm_setr_epi8(-1, -1, -1, -1, -1, 0, 3, 6, 9, 12, 15, -1, -1, -1, -1, -1)));\n\t\t\tGreen = _mm_or_si128(Green, _mm_shuffle_epi8(Src3, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 2, 5, 8, 11, 14)));\n\n\t\t\tRed = _mm_shuffle_epi8(Src1, _mm_setr_epi8(2, 5, 8, 11, 14, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1));\n\t\t\tRed = _mm_or_si128(Red, _mm_shuffle_epi8(Src2, _mm_setr_epi8(-1, -1, -1, -1, -1, 1, 4, 7, 10, 13, -1, -1, -1, -1, -1, -1)));\n\t\t\tRed = _mm_or_si128(Red, _mm_shuffle_epi8(Src3, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 3, 6, 9, 12, 15)));\n\n\t\t\tMax = _mm_max_epu8(_mm_max_epu8(Blue, Green), Red); //IM_Max(IM_Max(Red, Green), Blue)\n\t\t\tMin = _mm_min_epu8(_mm_min_epu8(Blue, Green), Red); //IM_Min(IM_Min(Red, Green), Blue)\n\t\t\tResult = _mm_cmpge_epu8(Blue, _mm_set1_epi8(20)); //Blue >= 20\n\t\t\tResult = _mm_and_si128(Result, _mm_cmpge_epu8(Green, _mm_set1_epi8(40))); //Green >= 40\n\t\t\tResult = _mm_and_si128(Result, _mm_cmpge_epu8(Red, _mm_set1_epi8(60))); //Red >= 60\n\t\t\tResult = _mm_and_si128(Result, _mm_cmpge_epu8(Red, Blue)); //Red >= Blue\n\t\t\tResult = _mm_and_si128(Result, _mm_cmpge_epu8(_mm_subs_epu8(Red, Green), _mm_set1_epi8(10))); //(Red - Green) >= 10 \n\t\t\tResult = _mm_and_si128(Result, _mm_cmpge_epu8(_mm_subs_epu8(Max, Min), _mm_set1_epi8(10))); //IM_Max(IM_Max(Red, Green), Blue) - IM_Min(IM_Min(Red, Green), Blue) >= 10\n\t\t\tResult = _mm_or_si128(Result, _mm_set1_epi8(16));\n\t\t\t_mm_storeu_si128((__m128i*)(LinePD + 0), Result);\n\t\t}\n\t\tfor (int X = Block * BlockSize; X < Width; X++, LinePS += 3, LinePD++)\n\t\t{\n\t\t\tint Blue = LinePS[0], Green = LinePS[1], Red = LinePS[2];\n\t\t\tif (Red >= 60 && Green >= 40 && Blue >= 20 && Red >= Blue && (Red - Green) >= 10 && IM_Max(IM_Max(Red, Green), Blue) - IM_Min(IM_Min(Red, Green), Blue) >= 10)\n\t\t\t\tLinePD[0] = 255;\t\t\t\t\t\t\t\t\t//\t全为肤色部分\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\telse\n\t\t\t\tLinePD[0] = 16;\n\t\t}\n\t}\n}\n\nvoid _IM_GetRoughSkinRegion(unsigned char* Src, const int32_t Width, const int32_t start_row, const int32_t thread_stride, const int32_t Stride, unsigned char* Dest) {\n\tconst int NonSkinLevel = 10; //非肤色部分的处理程序，本例取16，最大值取100，那样就是所有区域都为肤色，毫无意义\n\tconst int BlockSize = 16;\n\tint Block = Width / BlockSize;\n\tfor (int Y = start_row; Y < start_row + thread_stride; Y++) {\n\t\tunsigned char *LinePS = Src + Y * Stride;\n\t\tunsigned char *LinePD = Dest + Y * Width;\n\t\tfor (int X = 0; X < Block * BlockSize; X += BlockSize, LinePS += BlockSize * 3, LinePD += BlockSize) {\n\t\t\t__m128i Src1, Src2, Src3, Blue, Green, Red, Result, Max, Min, AbsDiff;\n\t\t\tSrc1 = _mm_loadu_si128((__m128i *)(LinePS + 0));\n\t\t\tSrc2 = _mm_loadu_si128((__m128i *)(LinePS + 16));\n\t\t\tSrc3 = _mm_loadu_si128((__m128i *)(LinePS + 32));\n\n\t\t\tBlue = _mm_shuffle_epi8(Src1, _mm_setr_epi8(0, 3, 6, 9, 12, 15, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1));\n\t\t\tBlue = _mm_or_si128(Blue, _mm_shuffle_epi8(Src2, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, 2, 5, 8, 11, 14, -1, -1, -1, -1, -1)));\n\t\t\tBlue = _mm_or_si128(Blue, _mm_shuffle_epi8(Src3, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, 4, 7, 10, 13)));\n\n\t\t\tGreen = _mm_shuffle_epi8(Src1, _mm_setr_epi8(1, 4, 7, 10, 13, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1));\n\t\t\tGreen = _mm_or_si128(Green, _mm_shuffle_epi8(Src2, _mm_setr_epi8(-1, -1, -1, -1, -1, 0, 3, 6, 9, 12, 15, -1, -1, -1, -1, -1)));\n\t\t\tGreen = _mm_or_si128(Green, _mm_shuffle_epi8(Src3, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 2, 5, 8, 11, 14)));\n\n\t\t\tRed = _mm_shuffle_epi8(Src1, _mm_setr_epi8(2, 5, 8, 11, 14, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1));\n\t\t\tRed = _mm_or_si128(Red, _mm_shuffle_epi8(Src2, _mm_setr_epi8(-1, -1, -1, -1, -1, 1, 4, 7, 10, 13, -1, -1, -1, -1, -1, -1)));\n\t\t\tRed = _mm_or_si128(Red, _mm_shuffle_epi8(Src3, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 3, 6, 9, 12, 15)));\n\n\t\t\tMax = _mm_max_epu8(_mm_max_epu8(Blue, Green), Red); //IM_Max(IM_Max(Red, Green), Blue)\n\t\t\tMin = _mm_min_epu8(_mm_min_epu8(Blue, Green), Red); //IM_Min(IM_Min(Red, Green), Blue)\n\t\t\tResult = _mm_cmpge_epu8(Blue, _mm_set1_epi8(20)); //Blue >= 20\n\t\t\tResult = _mm_and_si128(Result, _mm_cmpge_epu8(Green, _mm_set1_epi8(40))); //Green >= 40\n\t\t\tResult = _mm_and_si128(Result, _mm_cmpge_epu8(Red, _mm_set1_epi8(60))); //Red >= 60\n\t\t\tResult = _mm_and_si128(Result, _mm_cmpge_epu8(Red, Blue)); //Red >= Blue\n\t\t\tResult = _mm_and_si128(Result, _mm_cmpge_epu8(_mm_subs_epu8(Red, Green), _mm_set1_epi8(10))); //(Red - Green) >= 10 \n\t\t\tResult = _mm_and_si128(Result, _mm_cmpge_epu8(_mm_subs_epu8(Max, Min), _mm_set1_epi8(10))); //IM_Max(IM_Max(Red, Green), Blue) - IM_Min(IM_Min(Red, Green), Blue) >= 10\n\t\t\tResult = _mm_or_si128(Result, _mm_set1_epi8(16));\n\t\t\t_mm_storeu_si128((__m128i*)(LinePD + 0), Result);\n\t\t}\n\t\tfor (int X = Block * BlockSize; X < Width; X++, LinePS += 3, LinePD++)\n\t\t{\n\t\t\tint Blue = LinePS[0], Green = LinePS[1], Red = LinePS[2];\n\t\t\tif (Red >= 60 && Green >= 40 && Blue >= 20 && Red >= Blue && (Red - Green) >= 10 && IM_Max(IM_Max(Red, Green), Blue) - IM_Min(IM_Min(Red, Green), Blue) >= 10)\n\t\t\t\tLinePD[0] = 255;\t\t\t\t\t\t\t\t\t//\t全为肤色部分\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\telse\n\t\t\t\tLinePD[0] = 16;\n\t\t}\n\t}\n}\n\nvoid IM_GetRoughSkinRegion_SSE2(unsigned char *Src, unsigned char *Skin, int width, int height, int stride) {\n\tconst int32_t hw_concur = std::min(height >> 4, static_cast<int32_t>(std::thread::hardware_concurrency()));\n\tstd::vector<std::future<void>> fut(hw_concur);\n\tconst int thread_stride = (height - 1) / hw_concur + 1;\n\tint i = 0, start = 0;\n\tfor (; i < std::min(height, hw_concur); i++, start += thread_stride)\n\t{\n\t\tfut[i] = std::async(std::launch::async, _IM_GetRoughSkinRegion, Src, width, start, thread_stride, stride, Skin);\n\t}\n\tfor (int j = 0; j < i; ++j)\n\t\tfut[j].wait();\n}\n\nvoid IM_GrayToRGB(unsigned char *Gray, unsigned char *RGB, int Width, int Height, int Stride)\n{\n\tfor (int Y = 0; Y < Height; Y++)\n\t{\n\t\tunsigned char *LinePS = Gray + Y * Width;\t\t\t\t\t//\t源图的第Y行像素的首地址\n\t\tunsigned char *LinePD = RGB + Y * Stride;\t\t\t\t\t//\tSkin区域的第Y行像素的首地址\t\n\t\tint X = 0;\n\t\tfor (int X = 0; X < Width; X++)\n\t\t{\n\t\t\tLinePD[0] = LinePD[1] = LinePD[2] = LinePS[X];\n\t\t\tLinePD += 3;\n\t\t}\n\t}\n}\n\nint main() {\n\tMat src = imread(\"F:\\\\face.jpg\");\n\tint Height = src.rows;\n\tint Width = src.cols;\n\tunsigned char *Src = src.data;\n\tunsigned char *Skin = new unsigned char[Height * Width];\n\tunsigned char *Dest = new unsigned char[Height * Width * 3];\n\tint Stride = Width * 3;\n\tint Radius = 11;\n\tint Adjustment = 50;\n\tint64 st = cvGetTickCount();\n\tfor (int i = 0; i <1000; i++) {\n\t\tIM_GetRoughSkinRegion_SSE2(Src, Skin, Width, Height, Stride);\n\t\t//IM_GrayToRGB(Skin, Dest, Width, Height, Stride);\n\t}\n\tdouble duration = (cv::getTickCount() - st) / cv::getTickFrequency();\n\tprintf(\"%.5f\\n\", duration);\n\tIM_GetRoughSkinRegion_SSE2(Src, Skin, Width, Height, Stride);\n\tIM_GrayToRGB(Skin, Dest, Width, Height, Stride);\n\tMat dst(Height, Width, CV_8UC3, Dest);\n\timshow(\"origin\", src);\n\timshow(\"result\", dst);\n\timwrite(\"F:\\\\res.jpg\", dst);\n\twaitKey(0);\n}"
  },
  {
    "path": "speed_sobel_edgedetection_sse.cpp",
    "content": "#include <stdio.h>\n#include <opencv2/opencv.hpp>\n#include <future>\nusing namespace std;\nusing namespace cv;\n\ninline unsigned char IM_ClampToByte(int Value)\n{\n\tif (Value < 0)\n\t\treturn 0;\n\telse if (Value > 255)\n\t\treturn 255;\n\telse\n\t\treturn (unsigned char)Value;\n\t//return ((Value | ((signed int)(255 - Value) >> 31)) & ~((signed int)Value >> 31));\n}\n\nvoid Sobel_FLOAT(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) {\n\tint Channel = Stride / Width;\n\tunsigned char *RowCopy = (unsigned char*)malloc((Width + 2) * 3 * Channel);\n\tunsigned char *First = RowCopy;\n\tunsigned char *Second = RowCopy + (Width + 2) * Channel;\n\tunsigned char *Third = RowCopy + (Width + 2) * 2 * Channel;\n\t//拷贝第二行数据，边界值填充\n\tmemcpy(Second, Src, Channel);\n\tmemcpy(Second + Channel, Src, Width*Channel);\n\tmemcpy(Second + (Width + 1)*Channel, Src + (Width - 1)*Channel, Channel);\n\t//第一行和第二行一样\n\tmemcpy(First, Second, (Width + 2) * Channel);\n\t//拷贝第三行数据，边界值填充\n\tmemcpy(Third, Src + Stride, Channel);\n\tmemcpy(Third + Channel, Src + Stride, Width * Channel);\n\tmemcpy(Third + (Width + 1) * Channel, Src + Stride + (Width - 1) * Channel, Channel);\n\n\tfor (int Y = 0; Y < Height; Y++) {\n\t\tunsigned char *LinePS = Src + Y * Stride;\n\t\tunsigned char *LinePD = Dest + Y * Stride;\n\t\tif (Y != 0) {\n\t\t\tunsigned char *Temp = First;\n\t\t\tFirst = Second;\n\t\t\tSecond = Third;\n\t\t\tThird = Temp;\n\t\t}\n\t\tif (Y == Height - 1) {\n\t\t\tmemcpy(Third, Second, (Width + 2) * Channel);\n\t\t}\n\t\telse {\n\t\t\tmemcpy(Third, Src + (Y + 1) * Stride, Channel);\n\t\t\tmemcpy(Third + Channel, Src + (Y + 1) * Stride, Width * Channel);                            //    由于备份了前面一行的数据，这里即使Src和Dest相同也是没有问题的\n\t\t\tmemcpy(Third + (Width + 1) * Channel, Src + (Y + 1) * Stride + (Width - 1) * Channel, Channel);\n\t\t}\n\t\tif (Channel == 1) {\n\t\t\tfor (int X = 0; X < Width; X++)\n\t\t\t{\n\t\t\t\tint GX = First[X] - First[X + 2] + (Second[X] - Second[X + 2]) * 2 + Third[X] - Third[X + 2];\n\t\t\t\tint GY = First[X] + First[X + 2] + (First[X + 1] - Third[X + 1]) * 2 - Third[X] - Third[X + 2];\n\t\t\t\tLinePD[X] = IM_ClampToByte(sqrtf(GX * GX + GY * GY + 0.0F));\n\t\t\t}\n\t\t}\n\t\telse\n\t\t{\n\t\t\tfor (int X = 0; X < Width * 3; X++)\n\t\t\t{\n\t\t\t\tint GX = First[X] - First[X + 6] + (Second[X] - Second[X + 6]) * 2 + Third[X] - Third[X + 6];\n\t\t\t\tint GY = First[X] + First[X + 6] + (First[X + 3] - Third[X + 3]) * 2 - Third[X] - Third[X + 6];\n\t\t\t\tLinePD[X] = IM_ClampToByte(sqrtf(GX * GX + GY * GY + 0.0F));\n\t\t\t}\n\t\t}\n\t}\n\tfree(RowCopy);\n}\n\nvoid Sobel_INT(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) {\n\tint Channel = Stride / Width;\n\tunsigned char *RowCopy = (unsigned char*)malloc((Width + 2) * 3 * Channel);\n\tunsigned char *First = RowCopy;\n\tunsigned char *Second = RowCopy + (Width + 2) * Channel;\n\tunsigned char *Third = RowCopy + (Width + 2) * 2 * Channel;\n\t//拷贝第二行数据，边界值填充\n\tmemcpy(Second, Src, Channel);\n\tmemcpy(Second + Channel, Src, Width*Channel);\n\tmemcpy(Second + (Width + 1)*Channel, Src + (Width - 1)*Channel, Channel);\n\t//第一行和第二行一样\n\tmemcpy(First, Second, (Width + 2) * Channel);\n\t//拷贝第三行数据，边界值填充\n\tmemcpy(Third, Src + Stride, Channel);\n\tmemcpy(Third + Channel, Src + Stride, Width * Channel);\n\tmemcpy(Third + (Width + 1) * Channel, Src + Stride + (Width - 1) * Channel, Channel);\n\n\tunsigned char Table[65026];\n\tfor (int Y = 0; Y < 65026; Y++) Table[Y] = (sqrtf(Y + 0.0f) + 0.5f);\n\tfor (int Y = 0; Y < Height; Y++) {\n\t\tunsigned char *LinePS = Src + Y * Stride;\n\t\tunsigned char *LinePD = Dest + Y * Stride;\n\t\tif (Y != 0) {\n\t\t\tunsigned char *Temp = First;\n\t\t\tFirst = Second;\n\t\t\tSecond = Third;\n\t\t\tThird = Temp;\n\t\t}\n\t\tif (Y == Height - 1) {\n\t\t\tmemcpy(Third, Second, (Width + 2) * Channel);\n\t\t}\n\t\telse {\n\t\t\tmemcpy(Third, Src + (Y + 1) * Stride, Channel);\n\t\t\tmemcpy(Third + Channel, Src + (Y + 1) * Stride, Width * Channel);                            //    由于备份了前面一行的数据，这里即使Src和Dest相同也是没有问题的\n\t\t\tmemcpy(Third + (Width + 1) * Channel, Src + (Y + 1) * Stride + (Width - 1) * Channel, Channel);\n\t\t}\n\t\tif (Channel == 1) {\n\t\t\tfor (int X = 0; X < Width; X++)\n\t\t\t{\n\t\t\t\tint GX = First[X] - First[X + 2] + (Second[X] - Second[X + 2]) * 2 + Third[X] - Third[X + 2];\n\t\t\t\tint GY = First[X] + First[X + 2] + (First[X + 1] - Third[X + 1]) * 2 - Third[X] - Third[X + 2];\n\t\t\t\tLinePD[X] = Table[min(GX * GX + GY * GY, 65025)];\n\t\t\t}\n\t\t}\n\t\telse\n\t\t{\n\t\t\tfor (int X = 0; X < Width * 3; X++)\n\t\t\t{\n\t\t\t\tint GX = First[X] - First[X + 6] + (Second[X] - Second[X + 6]) * 2 + Third[X] - Third[X + 6];\n\t\t\t\tint GY = First[X] + First[X + 6] + (First[X + 3] - Third[X + 3]) * 2 - Third[X] - Third[X + 6];\n\t\t\t\tLinePD[X] = Table[min(GX * GX + GY * GY, 65025)];\n\t\t\t}\n\t\t}\n\t}\n\tfree(RowCopy);\n}\n\nvoid Sobel_SSE1(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) {\n\tint Channel = Stride / Width;\n\tunsigned char *RowCopy = (unsigned char*)malloc((Width + 2) * 3 * Channel);\n\tunsigned char *First = RowCopy;\n\tunsigned char *Second = RowCopy + (Width + 2) * Channel;\n\tunsigned char *Third = RowCopy + (Width + 2) * 2 * Channel;\n\t//拷贝第二行数据，边界值填充\n\tmemcpy(Second, Src, Channel);\n\tmemcpy(Second + Channel, Src, Width*Channel);\n\tmemcpy(Second + (Width + 1)*Channel, Src + (Width - 1)*Channel, Channel);\n\t//第一行和第二行一样\n\tmemcpy(First, Second, (Width + 2) * Channel);\n\t//拷贝第三行数据，边界值填充\n\tmemcpy(Third, Src + Stride, Channel);\n\tmemcpy(Third + Channel, Src + Stride, Width * Channel);\n\tmemcpy(Third + (Width + 1) * Channel, Src + Stride + (Width - 1) * Channel, Channel);\n\n\tint BlockSize = 8, Block = (Width * Channel) / BlockSize;\n\n\tunsigned char Table[65026];\n\tfor (int Y = 0; Y < 65026; Y++) Table[Y] = (sqrtf(Y + 0.0f) + 0.5f);\n\tfor (int Y = 0; Y < Height; Y++) {\n\t\tunsigned char *LinePS = Src + Y * Stride;\n\t\tunsigned char *LinePD = Dest + Y * Stride;\n\t\tif (Y != 0) {\n\t\t\tunsigned char *Temp = First;\n\t\t\tFirst = Second;\n\t\t\tSecond = Third;\n\t\t\tThird = Temp;\n\t\t}\n\t\tif (Y == Height - 1) {\n\t\t\tmemcpy(Third, Second, (Width + 2) * Channel);\n\t\t}\n\t\telse {\n\t\t\tmemcpy(Third, Src + (Y + 1) * Stride, Channel);\n\t\t\tmemcpy(Third + Channel, Src + (Y + 1) * Stride, Width * Channel);                            //    由于备份了前面一行的数据，这里即使Src和Dest相同也是没有问题的\n\t\t\tmemcpy(Third + (Width + 1) * Channel, Src + (Y + 1) * Stride + (Width - 1) * Channel, Channel);\n\t\t}\n\t\tif (Channel == 1) {\n\t\t\tfor (int X = 0; X < Width; X++)\n\t\t\t{\n\t\t\t\tint GX = First[X] - First[X + 2] + (Second[X] - Second[X + 2]) * 2 + Third[X] - Third[X + 2];\n\t\t\t\tint GY = First[X] + First[X + 2] + (First[X + 1] - Third[X + 1]) * 2 - Third[X] - Third[X + 2];\n\t\t\t\t//LinePD[X] = Table[min(GX * GX + GY * GY, 65025)];\n\t\t\t}\n\t\t}\n\t\telse\n\t\t{\n\t\t\t__m128i Zero = _mm_setzero_si128();\n\t\t\tfor (int X = 0; X < Block * BlockSize; X += BlockSize)\n\t\t\t{\n\t\t\t\t__m128i FirstP0 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(First + X)), Zero);\n\t\t\t\t__m128i FirstP1 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(First + X + 3)), Zero);\n\t\t\t\t__m128i FirstP2 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(First + X + 6)), Zero);\n\n\t\t\t\t__m128i SecondP0 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(Second + X)), Zero);\n\t\t\t\t__m128i SecondP2 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(Second + X + 6)), Zero);\n\n\t\t\t\t__m128i ThirdP0 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(Third + X)), Zero);\n\t\t\t\t__m128i ThirdP1 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(Third + X + 3)), Zero);\n\t\t\t\t__m128i ThirdP2 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(Third + X + 6)), Zero);\n\n\t\t\t\t__m128i GX16 = _mm_abs_epi16(_mm_add_epi16(_mm_add_epi16(_mm_sub_epi16(FirstP0, FirstP2), _mm_slli_epi16(_mm_sub_epi16(SecondP0, SecondP2), 1)), _mm_sub_epi16(ThirdP0, ThirdP2)));\n\t\t\t\t__m128i GY16 = _mm_abs_epi16(_mm_sub_epi16(_mm_add_epi16(_mm_add_epi16(FirstP0, FirstP2), _mm_slli_epi16(_mm_sub_epi16(FirstP1, ThirdP1), 1)), _mm_add_epi16(ThirdP0, ThirdP2)));\n\n\t\t\t\t__m128i GX32L = _mm_unpacklo_epi16(GX16, Zero);\n\t\t\t\t__m128i GX32H = _mm_unpackhi_epi16(GX16, Zero);\n\t\t\t\t__m128i GY32L = _mm_unpacklo_epi16(GY16, Zero);\n\t\t\t\t__m128i GY32H = _mm_unpackhi_epi16(GY16, Zero);\n\t\t\t\t__m128i ResultL = _mm_cvtps_epi32(_mm_sqrt_ps(_mm_cvtepi32_ps(_mm_add_epi32(_mm_mullo_epi32(GX32L, GX32L), _mm_mullo_epi32(GY32L, GY32L)))));\n\t\t\t\t__m128i ResultH = _mm_cvtps_epi32(_mm_sqrt_ps(_mm_cvtepi32_ps(_mm_add_epi32(_mm_mullo_epi32(GX32H, GX32H), _mm_mullo_epi32(GY32H, GY32H)))));\n\t\t\t\t_mm_storel_epi64((__m128i *)(LinePD + X), _mm_packus_epi16(_mm_packus_epi32(ResultL, ResultH), Zero));\n\t\t\t}\n\n\t\t\tfor (int X = Block * BlockSize; X < Width * 3; X++)\n\t\t\t{\n\t\t\t\tint GX = First[X] - First[X + 6] + (Second[X] - Second[X + 6]) * 2 + Third[X] - Third[X + 6];\n\t\t\t\tint GY = First[X] + First[X + 6] + (First[X + 3] - Third[X + 3]) * 2 - Third[X] - Third[X + 6];\n\t\t\t\tLinePD[X] = IM_ClampToByte(sqrtf(GX * GX + GY * GY + 0.0F));\n\t\t\t}\n\t\t}\n\t}\n\tfree(RowCopy);\n}\n\nvoid Sobel_SSE2(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) {\n\tint Channel = Stride / Width;\n\tunsigned char *RowCopy = (unsigned char*)malloc((Width + 2) * 3 * Channel);\n\tunsigned char *First = RowCopy;\n\tunsigned char *Second = RowCopy + (Width + 2) * Channel;\n\tunsigned char *Third = RowCopy + (Width + 2) * 2 * Channel;\n\t//拷贝第二行数据，边界值填充\n\tmemcpy(Second, Src, Channel);\n\tmemcpy(Second + Channel, Src, Width*Channel);\n\tmemcpy(Second + (Width + 1)*Channel, Src + (Width - 1)*Channel, Channel);\n\t//第一行和第二行一样\n\tmemcpy(First, Second, (Width + 2) * Channel);\n\t//拷贝第三行数据，边界值填充\n\tmemcpy(Third, Src + Stride, Channel);\n\tmemcpy(Third + Channel, Src + Stride, Width * Channel);\n\tmemcpy(Third + (Width + 1) * Channel, Src + Stride + (Width - 1) * Channel, Channel);\n\n\tint BlockSize = 8, Block = (Width * Channel) / BlockSize;\n\n\tunsigned char Table[65026];\n\tfor (int Y = 0; Y < 65026; Y++) Table[Y] = (sqrtf(Y + 0.0f) + 0.5f);\n\tfor (int Y = 0; Y < Height; Y++) {\n\t\tunsigned char *LinePS = Src + Y * Stride;\n\t\tunsigned char *LinePD = Dest + Y * Stride;\n\t\tif (Y != 0) {\n\t\t\tunsigned char *Temp = First;\n\t\t\tFirst = Second;\n\t\t\tSecond = Third;\n\t\t\tThird = Temp;\n\t\t}\n\t\tif (Y == Height - 1) {\n\t\t\tmemcpy(Third, Second, (Width + 2) * Channel);\n\t\t}\n\t\telse {\n\t\t\tmemcpy(Third, Src + (Y + 1) * Stride, Channel);\n\t\t\tmemcpy(Third + Channel, Src + (Y + 1) * Stride, Width * Channel);                            //    由于备份了前面一行的数据，这里即使Src和Dest相同也是没有问题的\n\t\t\tmemcpy(Third + (Width + 1) * Channel, Src + (Y + 1) * Stride + (Width - 1) * Channel, Channel);\n\t\t}\n\t\tif (Channel == 1) {\n\t\t\tfor (int X = 0; X < Width; X++)\n\t\t\t{\n\t\t\t\tint GX = First[X] - First[X + 2] + (Second[X] - Second[X + 2]) * 2 + Third[X] - Third[X + 2];\n\t\t\t\tint GY = First[X] + First[X + 2] + (First[X + 1] - Third[X + 1]) * 2 - Third[X] - Third[X + 2];\n\t\t\t\t//LinePD[X] = Table[min(GX * GX + GY * GY, 65025)];\n\t\t\t}\n\t\t}\n\t\telse\n\t\t{\n\t\t\t__m128i Zero = _mm_setzero_si128();\n\t\t\tfor (int X = 0; X < Block * BlockSize; X += BlockSize)\n\t\t\t{\n\t\t\t\t__m128i FirstP0 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(First + X)), Zero);\n\t\t\t\t__m128i FirstP1 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(First + X + 3)), Zero);\n\t\t\t\t__m128i FirstP2 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(First + X + 6)), Zero);\n\n\t\t\t\t__m128i SecondP0 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(Second + X)), Zero);\n\t\t\t\t__m128i SecondP2 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(Second + X + 6)), Zero);\n\n\t\t\t\t__m128i ThirdP0 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(Third + X)), Zero);\n\t\t\t\t__m128i ThirdP1 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(Third + X + 3)), Zero);\n\t\t\t\t__m128i ThirdP2 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(Third + X + 6)), Zero);\n\n\t\t\t\t__m128i GX16 = _mm_abs_epi16(_mm_add_epi16(_mm_add_epi16(_mm_sub_epi16(FirstP0, FirstP2), _mm_slli_epi16(_mm_sub_epi16(SecondP0, SecondP2), 1)), _mm_sub_epi16(ThirdP0, ThirdP2)));\n\t\t\t\t__m128i GY16 = _mm_abs_epi16(_mm_sub_epi16(_mm_add_epi16(_mm_add_epi16(FirstP0, FirstP2), _mm_slli_epi16(_mm_sub_epi16(FirstP1, ThirdP1), 1)), _mm_add_epi16(ThirdP0, ThirdP2)));\n\n\t\t\t\t__m128i GXYL = _mm_unpacklo_epi16(GX16, GY16);\n\t\t\t\t__m128i GXYH = _mm_unpackhi_epi16(GX16, GY16);\n\n\t\t\t\t__m128i ResultL = _mm_cvtps_epi32(_mm_sqrt_ps(_mm_cvtepi32_ps(_mm_madd_epi16(GXYL, GXYL))));\n\t\t\t\t__m128i ResultH = _mm_cvtps_epi32(_mm_sqrt_ps(_mm_cvtepi32_ps(_mm_madd_epi16(GXYH, GXYH))));\n\t\t\t\t_mm_storel_epi64((__m128i *)(LinePD + X), _mm_packus_epi16(_mm_packus_epi32(ResultL, ResultH), Zero));\n\t\t\t}\n\n\t\t\tfor (int X = Block * BlockSize; X < Width * 3; X++)\n\t\t\t{\n\t\t\t\tint GX = First[X] - First[X + 6] + (Second[X] - Second[X + 6]) * 2 + Third[X] - Third[X + 6];\n\t\t\t\tint GY = First[X] + First[X + 6] + (First[X + 3] - Third[X + 3]) * 2 - Third[X] - Third[X + 6];\n\t\t\t\tLinePD[X] = IM_ClampToByte(sqrtf(GX * GX + GY * GY + 0.0F));\n\t\t\t}\n\t\t}\n\t}\n\tfree(RowCopy);\n}\n\nunsigned char *RowCopy;\nunsigned char *First;\nunsigned char *Second;\nunsigned char *Third;\nint Channel, Block, BlockSize;\nvoid _Sobel(unsigned char* Src, const int32_t Width, const int32_t Height, const int32_t start_row, const int32_t thread_stride, const int32_t Stride, unsigned char* Dest) {\n\tfor (int Y = start_row; Y < start_row + thread_stride; Y++) {\n\t\tunsigned char *LinePS = Src + Y * Stride;\n\t\tunsigned char *LinePD = Dest + Y * Stride;\n\t\tif (Y != 0) {\n\t\t\tunsigned char *Temp = First;\n\t\t\tFirst = Second;\n\t\t\tSecond = Third;\n\t\t\tThird = Temp;\n\t\t}\n\t\tif (Y == Height - 1) {\n\t\t\tmemcpy(Third, Second, (Width + 2) * Channel);\n\t\t}\n\t\telse {\n\t\t\tmemcpy(Third, Src + (Y + 1) * Stride, Channel);\n\t\t\tmemcpy(Third + Channel, Src + (Y + 1) * Stride, Width * Channel);                            //    由于备份了前面一行的数据，这里即使Src和Dest相同也是没有问题的\n\t\t\tmemcpy(Third + (Width + 1) * Channel, Src + (Y + 1) * Stride + (Width - 1) * Channel, Channel);\n\t\t}\n\t\tif (Channel == 1) {\n\t\t\tfor (int X = 0; X < Width; X++)\n\t\t\t{\n\t\t\t\tint GX = First[X] - First[X + 2] + (Second[X] - Second[X + 2]) * 2 + Third[X] - Third[X + 2];\n\t\t\t\tint GY = First[X] + First[X + 2] + (First[X + 1] - Third[X + 1]) * 2 - Third[X] - Third[X + 2];\n\t\t\t\t//LinePD[X] = Table[min(GX * GX + GY * GY, 65025)];\n\t\t\t}\n\t\t}\n\t\telse\n\t\t{\n\t\t\t__m256i Zero = _mm256_setzero_si256();\n\t\t\tfor (int X = 0; X < Block * BlockSize; X += BlockSize)\n\t\t\t{\n\t\t\t\t__m256i FirstP0 = _mm256_cvtepu8_epi16(_mm_loadu_si128((const __m128i*)(First + X)));\n\t\t\t\t__m256i FirstP1 = _mm256_cvtepu8_epi16(_mm_loadu_si128((const __m128i*)(First + X + 3)));\n\t\t\t\t__m256i FirstP2 = _mm256_cvtepu8_epi16(_mm_loadu_si128((const __m128i*)(First + X + 6)));\n\n\t\t\t\t__m256i SecondP0 = _mm256_cvtepu8_epi16(_mm_loadu_si128((const __m128i*)(Second + X)));\n\t\t\t\t__m256i SecondP2 = _mm256_cvtepu8_epi16(_mm_loadu_si128((const __m128i*)(Second + X + 6)));\n\n\t\t\t\t__m256i ThirdP0 = _mm256_cvtepu8_epi16(_mm_loadu_si128((const __m128i*)(Third + X)));\n\t\t\t\t__m256i ThirdP1 = _mm256_cvtepu8_epi16(_mm_loadu_si128((const __m128i*)(Third + X + 3)));\n\t\t\t\t__m256i ThirdP2 = _mm256_cvtepu8_epi16(_mm_loadu_si128((const __m128i*)(Third + X + 6)));\n\n\t\t\t\t//GX0\tGX1\t    GX2    GX3    GX4    GX5    GX6    GX7     GX8\t GX9\t GX10    GX11    GX12    GX13    GX14    GX15\n\t\t\t\t__m256i GX16 = _mm256_abs_epi16(_mm256_adds_epi16(_mm256_adds_epi16(_mm256_subs_epi16(FirstP0, FirstP2), _mm256_slli_epi16(_mm256_subs_epi16(SecondP0, SecondP2), 1)), _mm256_subs_epi16(ThirdP0, ThirdP2)));\n\t\t\t\t//GY0   GY1     GY2    GY3    GY4    GY5    GY6    GY7     GY8   GY9     GY10    GY11    GY12    GY13    GY14    GY15\n\t\t\t\t__m256i GY16 = _mm256_abs_epi16(_mm256_subs_epi16(_mm256_adds_epi16(_mm256_adds_epi16(FirstP0, FirstP2), _mm256_slli_epi16(_mm256_subs_epi16(FirstP1, ThirdP1), 1)), _mm256_adds_epi16(ThirdP0, ThirdP2)));\n\t\t\t\t//GX0　　GY0　　GX1　　GY1　　GX2　　GY2　　GX3　　GY3    GX4    GY4     GX5     GY5      GX6     GY6     GX7     GY7\n\t\t\t\t__m256i GXYL = _mm256_unpacklo_epi16(GX16, GY16);\n\t\t\t\t//GX8　　GY8　　GX9　　GY9　　GX10　GY10　　GX11　GY11    GX12   GY12    GX13    GY13     GX14    GY14    GX15    GY15     \n\t\t\t\t__m256i GXYH = _mm256_unpackhi_epi16(GX16, GY16);\n\n\n\t\t\t\t__m256i ResultL = _mm256_cvtps_epi32(_mm256_sqrt_ps(_mm256_cvtepi32_ps(_mm256_madd_epi16(GXYL, GXYL))));\n\t\t\t\t__m256i ResultH = _mm256_cvtps_epi32(_mm256_sqrt_ps(_mm256_cvtepi32_ps(_mm256_madd_epi16(GXYH, GXYH))));\n\n\t\t\t\t//__m256i Result = _mm256_packus_epi16(_mm256_packus_epi32(ResultL, ResultH), Zero);\n\n\t\t\t\t__m128i Ans1 = _mm256_castsi256_si128(ResultL);\n\t\t\t\t_mm_storeu_si128((__m128i *)(LinePD + X), Ans1);\n\t\t\t\t\n\t\t\t\t__m128i Ans2 = _mm256_castsi256_si128(ResultL);\n\t\t\t\t_mm_storeu_si128((__m128i *)(LinePD + X + 8), Ans2);\n\t\t\t}\n\n\t\t\tfor (int X = Block * BlockSize; X < Width * 3; X++)\n\t\t\t{\n\t\t\t\tint GX = First[X] - First[X + 6] + (Second[X] - Second[X + 6]) * 2 + Third[X] - Third[X + 6];\n\t\t\t\tint GY = First[X] + First[X + 6] + (First[X + 3] - Third[X + 3]) * 2 - Third[X] - Third[X + 6];\n\t\t\t\tLinePD[X] = IM_ClampToByte(sqrtf(GX * GX + GY * GY + 0.0F));\n\t\t\t}\n\t\t}\n\t}\n}\n\nvoid Sobel_AVX1(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) {\n\tChannel = Stride / Width;\n\tRowCopy = (unsigned char*)malloc((Width + 2) * 3 * Channel);\n\tFirst = RowCopy;\n\tSecond = RowCopy + (Width + 2) * Channel;\n\tThird = RowCopy + (Width + 2) * 2 * Channel;\n\t//拷贝第二行数据，边界值填充\n\tmemcpy(Second, Src, Channel);\n\tmemcpy(Second + Channel, Src, Width*Channel);\n\tmemcpy(Second + (Width + 1)*Channel, Src + (Width - 1)*Channel, Channel);\n\t//第一行和第二行一样\n\tmemcpy(First, Second, (Width + 2) * Channel);\n\t//拷贝第三行数据，边界值填充\n\tmemcpy(Third, Src + Stride, Channel);\n\tmemcpy(Third + Channel, Src + Stride, Width * Channel);\n\tmemcpy(Third + (Width + 1) * Channel, Src + Stride + (Width - 1) * Channel, Channel);\n\n\tBlockSize = 16, Block = (Width * Channel) / BlockSize;\n\n\t_Sobel(Src, Width, Height, 0, Height, Stride, Dest);\n\t\n\tfree(RowCopy);\n}\n\nvoid Sobel_AVX2(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) {\n\t//INIT\n\tChannel = Stride / Width;\n\tRowCopy = (unsigned char*)malloc((Width + 2) * 3 * Channel);\n\tFirst = RowCopy;\n\tSecond = RowCopy + (Width + 2) * Channel;\n\tThird = RowCopy + (Width + 2) * 2 * Channel;\n\t//拷贝第二行数据，边界值填充\n\tmemcpy(Second, Src, Channel);\n\tmemcpy(Second + Channel, Src, Width*Channel);\n\tmemcpy(Second + (Width + 1)*Channel, Src + (Width - 1)*Channel, Channel);\n\t//第一行和第二行一样\n\tmemcpy(First, Second, (Width + 2) * Channel);\n\t//拷贝第三行数据，边界值填充\n\tmemcpy(Third, Src + Stride, Channel);\n\tmemcpy(Third + Channel, Src + Stride, Width * Channel);\n\tmemcpy(Third + (Width + 1) * Channel, Src + Stride + (Width - 1) * Channel, Channel);\n\n\tBlockSize = 16, Block = (Width * Channel) / BlockSize;\n\n\t//Run\n\tconst int32_t hw_concur = std::min(Height >> 4, static_cast<int32_t>(std::thread::hardware_concurrency()));\n\tstd::vector<std::future<void>> fut(hw_concur);\n\tconst int thread_stride = (Height - 1) / hw_concur + 1;\n\tint i = 0, start = 0;\n\tfor (; i < std::min(Height, hw_concur); i++, start += thread_stride)\n\t{\n\t\tfut[i] = std::async(std::launch::async, _Sobel, Src, Width, Height, start, thread_stride, Stride, Dest);\n\t}\n\tfor (int j = 0; j < i; ++j)\n\t\tfut[j].wait();\n\n\tfree(RowCopy);\n}\n\n\nint main() {\n\tMat src = imread(\"F:\\\\car.jpg\");\n\tint Height = src.rows;\n\tint Width = src.cols;\n\tunsigned char *Src = src.data;\n\tunsigned char *Dest = new unsigned char[Height * Width * 3];\n\tint Stride = Width * 3;\n\tint Radius = 11;\n\tint Adjustment = 50;\n\tint64 st = cvGetTickCount();\n\t/*for (int i = 0; i <1000; i++) {\n\t\tSobel_SSE3(Src, Dest, Width, Height, Stride);\n\t}*/\n\tdouble duration = (cv::getTickCount() - st) / cv::getTickFrequency();\n\tprintf(\"%.5f\\n\", duration);\n\tSobel_SSE1(Src, Dest, Width, Height, Stride);\n\tMat dst(Height, Width, CV_8UC3, Dest);\n\timshow(\"origin\", src);\n\timshow(\"result\", dst);\n\timwrite(\"F:\\\\res.jpg\", dst);\n\twaitKey(0);\n}\n"
  },
  {
    "path": "speed_vibrance_algorithm.cpp",
    "content": "#include <stdio.h>\n#include <omp.h>\n#include <opencv2/opencv.hpp>\n\nusing namespace std;\nusing namespace cv;\n\nvoid GetGrayIntegralImage(unsigned char *Src, int *Integral, int Width, int Height, int Stride)\n{\n\tmemset(Integral, 0, (Width + 1) * sizeof(int));                    //    第一行都为0\n\tfor (int Y = 0; Y < Height; Y++)\n\t{\n\t\tunsigned char *LinePS = Src + Y * Stride;\n\t\tint *LinePL = Integral + Y * (Width + 1) + 1;                //上一行的位置\n\t\tint *LinePD = Integral + (Y + 1) * (Width + 1) + 1;           //    当前位置，注意每行的第一列的值都为0\n\t\tLinePD[-1] = 0;                                               //    第一列的值为0\n\t\tfor (int X = 0, Sum = 0; X < Width; X++)\n\t\t{\n\t\t\tSum += LinePS[X];                                          //    行方向累加\n\t\t\tLinePD[X] = LinePL[X] + Sum;                               //    更新积分图\n\t\t}\n\t}\n}\n\nvoid GetGrayIntegralImage_SSE(unsigned char *Src, int *Integral, int Width, int Height, int Stride) {\n\tmemset(Integral, 0, (Width + 1) * sizeof(int)); //第一行都为0\n\tint BlockSize = 8, Block = Width / BlockSize;\n\tfor (int Y = 0; Y < Height; Y++) {\n\t\tunsigned char *LinePS = Src + Y * Stride;\n\t\tint *LinePL = Integral + Y * (Width + 1) + 1; //上一行位置\n\t\tint *LinePD = Integral + (Y + 1) * (Width + 1) + 1; //当前位置，注意每行的第一列都为0\n\t\tLinePD[-1] = 0;\n\t\t__m128i PreV = _mm_setzero_si128();\n\t\t__m128i Zero = _mm_setzero_si128();\n\t\tfor (int X = 0; X < Block * BlockSize; X += BlockSize) {\n\t\t\t__m128i Src_Shift0 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i*)(LinePS + X)), Zero); //A7 A6 A5 A 4 A3 A2 A1 A0\n\t\t\t__m128i Src_Shift1 = _mm_slli_si128(Src_Shift0, 2); //A6 A5 A4 A3 A2 A1 A0 0\n\t\t\t__m128i Src_Shift2 = _mm_slli_si128(Src_Shift1, 2); //A5 A4 A3 A2 A1 A0 0  0\n\t\t\t__m128i Src_Shift3 = _mm_slli_si128(Src_Shift2, 2); //A4 A3 A2 A1 A0 0  0  0\n\t\t\t__m128i Shift_Add12 = _mm_add_epi16(Src_Shift1, Src_Shift2); //A6+A5 A5+A4 A4+A3 A3+A2 A2+A1 A1+A0 A0+0  0+0\n\t\t\t__m128i Shift_Add03 = _mm_add_epi16(Src_Shift0, Src_Shift3); //A7+A4 A6+A3 A5+A2 A4+A1 A3+A0 A2+0  A1+0  A0+0 \n\t\t\t__m128i Low = _mm_add_epi16(Shift_Add12, Shift_Add03); //A7+A6+A5+A4 A6+A5+A4+A3 A5+A4+A3+A2 A4+A3+A2+A1 A3+A2+A1+A0 A2+A1+A0+0 A1+A0+0+0 A0+0+0+0\n\t\t\t__m128i High = _mm_add_epi32(_mm_unpackhi_epi16(Low, Zero), _mm_unpacklo_epi16(Low, Zero)); //A7+A6+A5+A4+A3+A2+A1+A0  A6+A5+A4+A3+A2+A1+A0  A5+A4+A3+A2+A1+A0  A4+A3+A2+A1+A0\n\t\t\t__m128i SumL = _mm_loadu_si128((__m128i *)(LinePL + X + 0));\n\t\t\t__m128i SumH = _mm_loadu_si128((__m128i *)(LinePL + X + 4));\n\t\t\tSumL = _mm_add_epi32(SumL, PreV);\n\t\t\tSumL = _mm_add_epi32(SumL, _mm_unpacklo_epi16(Low, Zero));\n\t\t\tSumH = _mm_add_epi32(SumH, PreV);\n\t\t\tSumH = _mm_add_epi32(SumH, High);\n\t\t\tPreV = _mm_add_epi32(PreV, _mm_shuffle_epi32(High, _MM_SHUFFLE(3, 3, 3, 3)));\n\t\t\t_mm_storeu_si128((__m128i *)(LinePD + X + 0), SumL);\n\t\t\t_mm_storeu_si128((__m128i *)(LinePD + X + 4), SumH);\n\t\t}\n\t\tfor (int X = Block * BlockSize, V = LinePD[X - 1] - LinePL[X - 1]; X < Width; X++)\n\t\t{\n\t\t\tV += LinePS[X];\n\t\t\tLinePD[X] = V + LinePL[X];\n\t\t}\n\t}\n}\n\nvoid BoxBlur(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, int Radius) {\n\tint *Integral = (int *)malloc((Width + 1) * (Height + 1) * sizeof(int));\n\tGetGrayIntegralImage(Src, Integral, Width, Height, Stride);\n#pragma parallel for num_threads(4)\n\tfor (int Y = 0; Y < Height; Y++) {\n\t\tint Y1 = max(Y - Radius, 0);\n\t\tint Y2 = min(Y + Radius + 1, Height - 1);\n\t\tint *LineP1 = Integral + Y1 * (Width + 1);\n\t\tint *LineP2 = Integral + Y2 * (Width + 1);\n\t\tunsigned char *LinePD = Dest + Y * Stride;\n\t\tfor (int X = 0; X < Height; X++) {\n\t\t\tint X1 = max(X - Radius, 0);\n\t\t\tint X2 = min(X + Radius + 1, Width);\n\t\t\tint Sum = LineP2[X2] - LineP1[X2] - LineP2[X1] + LineP1[X1];\n\t\t\tint PixelCount = (X2 - X1) * (Y2 - Y1);\n\t\t\tLinePD[X] = (Sum + (PixelCount >> 1)) / PixelCount;\n\t\t}\n\t}\n\tfree(Integral);\n}\n\n//Adjustment如果为正值，会增加饱和度\n//Adjustment如果为负值，会降低饱和度\nvoid VibranceAlgorithm_FLOAT(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, int Adjustment) {\n\tfloat VibranceAdjustment = -0.01 * Adjustment;\n\tfor (int Y = 0; Y < Height; Y++) {\n\t\tunsigned char *LinePS = Src + Y * Stride;\n\t\tunsigned char *LinePD = Dest + Y * Stride;\n\t\tfor (int X = 0; X < Width; X++) {\n\t\t\tint Blue = LinePS[0], Green = LinePS[1], Red = LinePS[2];\n\t\t\tint Avg = (Blue + Green + Green + Red) >> 2;\n\t\t\tint Max = max(max(Blue, Green), Red);\n\t\t\tfloat AmtVal = (abs(Max - Avg) / 127.0f) * VibranceAdjustment;\n\t\t\tif (Blue != Max) Blue += (Max - Blue) * AmtVal;\n\t\t\tif (Green != Max) Green += (Max - Green) * AmtVal;\n\t\t\tif (Red != Max) Red += (Max - Red) * AmtVal;\n\t\t\tif (Red < 0) Red = 0;\n\t\t\telse if (Red > 255) Red = 255;\n\t\t\tif (Green < 0) Green = 0;\n\t\t\telse if (Green > 255) Green = 255;\n\t\t\tif (Blue < 0) Blue = 0;\n\t\t\telse if (Blue > 255) Blue = 255;\n\t\t\tLinePD[0] = Blue;\n\t\t\tLinePD[1] = Green;\n\t\t\tLinePD[2] = Red;\n\t\t\tLinePS += 3;\n\t\t\tLinePD += 3;\n\t\t}\n\t}\n}\n\nvoid VibranceAlgorithm_INT(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, int Adjustment) {\n\tint VibranceAdjustment = -1.28 * Adjustment;\n\tfor (int Y = 0; Y < Height; Y++) {\n\t\tunsigned char *LinePS = Src + Y * Stride;\n\t\tunsigned char *LinePD = Dest + Y * Stride;\n\t\tfor (int X = 0; X < Width; X++) {\n\t\t\tint Blue, Green, Red, Max;\n\t\t\tBlue = LinePS[0], Green = LinePS[1], Red = LinePS[2];\n\t\t\tint Avg = (Blue + Green + Green + Red) >> 2;\n\t\t\tif (Blue > Green)\n\t\t\t\tMax = Blue;\n\t\t\telse\n\t\t\t\tMax = Green;\n\t\t\tif (Red > Max)\n\t\t\t\tMax = Red;\n\t\t\tint AmtVal = (Max - Avg) * VibranceAdjustment;\n\t\t\tif (Blue != Max) Blue += (((Max - Blue) * AmtVal) >> 14);\n\t\t\tif (Green != Max) Green += (((Max - Green) * AmtVal) >> 14);\n\t\t\tif (Red != Max) Red += (((Max - Red) * AmtVal) >> 14);\n\t\t\tif (Red < 0) Red = 0;\n\t\t\telse if (Red > 255) Red = 255;\n\t\t\tif (Green < 0) Green = 0;\n\t\t\telse if (Green > 255) Green = 255;\n\t\t\tif (Blue < 0) Blue = 0;\n\t\t\telse if (Blue > 255) Blue = 255;\n\t\t\tLinePD[0] = Blue;\n\t\t\tLinePD[1] = Green;\n\t\t\tLinePD[2] = Red;\n\t\t\tLinePS += 3;\n\t\t\tLinePD += 3;\n\t\t}\n\t}\n}\n\nvoid VibranceAlgorithm_INT_OpenMP(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, int Adjustment) {\n\tint VibranceAdjustment = -1.28 * Adjustment;\n\tfor (int Y = 0; Y < Height; Y++) {\n\t\tunsigned char *LinePS = Src + Y * Stride;\n\t\tunsigned char *LinePD = Dest + Y * Stride;\n#pragma omp parallel for num_threads(4)\n\t\tfor (int X = 0; X < Width; X++) {\n\t\t\tint Blue, Green, Red, Max;\n\t\t\tBlue = LinePS[X*3 + 0], Green = LinePS[X*3 + 1], Red = LinePS[X*3 + 2];\n\t\t\tint Avg = (Blue + Green + Green + Red) >> 2;\n\t\t\tif (Blue > Green)\n\t\t\t\tMax = Blue;\n\t\t\telse\n\t\t\t\tMax = Green;\n\t\t\tif (Red > Max)\n\t\t\t\tMax = Red;\n\t\t\tint AmtVal = (Max - Avg) * VibranceAdjustment;\n\t\t\tif (Blue != Max) Blue += (((Max - Blue) * AmtVal) >> 14);\n\t\t\tif (Green != Max) Green += (((Max - Green) * AmtVal) >> 14);\n\t\t\tif (Red != Max) Red += (((Max - Red) * AmtVal) >> 14);\n\t\t\tif (Red < 0) Red = 0;\n\t\t\telse if (Red > 255) Red = 255;\n\t\t\tif (Green < 0) Green = 0;\n\t\t\telse if (Green > 255) Green = 255;\n\t\t\tif (Blue < 0) Blue = 0;\n\t\t\telse if (Blue > 255) Blue = 255;\n\t\t\tLinePD[X*3 + 0] = Blue;\n\t\t\tLinePD[X*3 + 1] = Green;\n\t\t\tLinePD[X*3 + 2] = Red;\n\t\t}\n\t}\n}\n\nvoid VibranceAlgorithm_SSE(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, int Adjustment) {\n\tint VibranceAdjustment = (int)(-1.28 * Adjustment);\n\t__m128i Adjustment128 = _mm_setr_epi16(VibranceAdjustment, VibranceAdjustment, VibranceAdjustment, VibranceAdjustment,\n\t\tVibranceAdjustment, VibranceAdjustment, VibranceAdjustment, VibranceAdjustment);\n\tint X;\n\tfor (int Y = 0; Y < Height; Y++) {\n\t\tunsigned char *LinePS = Src + Y * Stride;\n\t\tunsigned char *LinePD = Dest + Y * Stride;\n\t\tX = 0;\n\t\t__m128i Src1, Src2, Src3, Dest1, Dest2, Dest3, Blue8, Green8, Red8, Max8;\n\t\t__m128i BL16, BH16, GL16, GH16, RL16, RH16, MaxL16, MaxH16, AvgL16, AvgH16, AmtVal;\n\t\t__m128i Zero = _mm_setzero_si128();\n\t\tfor (; X < Width - 16; X += 16, LinePS += 48, LinePD += 48) {\n\t\t\tSrc1 = _mm_loadu_si128((__m128i *)(LinePS + 0)); //B1,G1,R1,B2,G2,R2,B3,G3,R3,B4,G4,R4,B5,G5,R5,B6\n\t\t\tSrc2 = _mm_loadu_si128((__m128i *)(LinePS + 16));//G6,R6,B7,G7,R7,B8,G8,R8,B9,G9,R9,B10,G10,R10,B11,G11\n\t\t\tSrc3 = _mm_loadu_si128((__m128i *)(LinePS + 32));//R11,B12,G12,R12,B13,G13,R13,B14,G14,R14,B15,G15,R15,B16,G16,R16\n\n\t\t\tBlue8 = _mm_shuffle_epi8(Src1, _mm_setr_epi8(0, 3, 6, 9, 12, 15, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1));\n\t\t\tBlue8 = _mm_or_si128(Blue8, _mm_shuffle_epi8(Src2, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, 2, 5, 8, 11, 14, -1, -1, -1, -1, -1)));\n\t\t\tBlue8 = _mm_or_si128(Blue8, _mm_shuffle_epi8(Src3, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, 4, 7, 10, 13)));\n\n\t\t\tGreen8 = _mm_shuffle_epi8(Src1, _mm_setr_epi8(1, 4, 7, 10, 13, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1));\n\t\t\tGreen8 = _mm_or_si128(Green8, _mm_shuffle_epi8(Src2, _mm_setr_epi8(-1, -1, -1, -1, -1, 0, 3, 6, 9, 12, 15, -1, -1, -1, -1, -1)));\n\t\t\tGreen8 = _mm_or_si128(Green8, _mm_shuffle_epi8(Src3, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 2, 5, 8, 11, 14)));\n\n\t\t\tRed8 = _mm_shuffle_epi8(Src1, _mm_setr_epi8(2, 5, 8, 11, 14, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1));\n\t\t\tRed8 = _mm_or_si128(Red8, _mm_shuffle_epi8(Src2, _mm_setr_epi8(-1, -1, -1, -1, -1, 1, 4, 7, 10, 13, -1, -1, -1, -1, -1, -1)));\n\t\t\tRed8 = _mm_or_si128(Red8, _mm_shuffle_epi8(Src3, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 3, 6, 9, 12, 15)));\n\n\t\t\tMax8 = _mm_max_epu8(_mm_max_epu8(Blue8, Green8), Red8);\n\n\t\t\tBL16 = _mm_unpacklo_epi8(Blue8, Zero);\n\t\t\tBH16 = _mm_unpackhi_epi8(Blue8, Zero);\n\t\t\tGL16 = _mm_unpacklo_epi8(Green8, Zero);\n\t\t\tGH16 = _mm_unpackhi_epi8(Green8, Zero);\n\t\t\tRL16 = _mm_unpacklo_epi8(Red8, Zero);\n\t\t\tRH16 = _mm_unpackhi_epi8(Red8, Zero);\n\t\t\tMaxL16 = _mm_unpacklo_epi8(Max8, Zero);\n\t\t\tMaxH16 = _mm_unpackhi_epi8(Max8, Zero);\n\n\t\t\tAvgL16 = _mm_srli_epi16(_mm_add_epi16(_mm_add_epi16(BL16, RL16), _mm_slli_epi16(GL16, 1)), 2);\n\t\t\tAvgH16 = _mm_srli_epi16(_mm_add_epi16(_mm_add_epi16(BH16, RH16), _mm_slli_epi16(GH16, 1)), 2);\n\n\t\t\tAmtVal = _mm_mullo_epi16(_mm_sub_epi16(MaxL16, AvgL16), Adjustment128);\n\t\t\tBL16 = _mm_adds_epi16(BL16, _mm_mulhi_epi16(_mm_slli_epi16(_mm_sub_epi16(MaxL16, BL16), 2), AmtVal));\n\t\t\tGL16 = _mm_adds_epi16(GL16, _mm_mulhi_epi16(_mm_slli_epi16(_mm_sub_epi16(MaxL16, GL16), 2), AmtVal));\n\t\t\tRL16 = _mm_adds_epi16(RL16, _mm_mulhi_epi16(_mm_slli_epi16(_mm_sub_epi16(MaxL16, RL16), 2), AmtVal));\n\n\t\t\tAmtVal = _mm_mullo_epi16(_mm_sub_epi16(MaxH16, AvgH16), Adjustment128);\n\t\t\tBH16 = _mm_adds_epi16(BH16, _mm_mulhi_epi16(_mm_slli_epi16(_mm_sub_epi16(MaxH16, BH16), 2), AmtVal));\n\t\t\tGH16 = _mm_adds_epi16(GH16, _mm_mulhi_epi16(_mm_slli_epi16(_mm_sub_epi16(MaxH16, GH16), 2), AmtVal));\n\t\t\tRH16 = _mm_adds_epi16(RH16, _mm_mulhi_epi16(_mm_slli_epi16(_mm_sub_epi16(MaxH16, RH16), 2), AmtVal));\n\n\t\t\tBlue8 = _mm_packus_epi16(BL16, BH16);\n\t\t\tGreen8 = _mm_packus_epi16(GL16, GH16);\n\t\t\tRed8 = _mm_packus_epi16(RL16, RH16);\n\n\t\t\tDest1 = _mm_shuffle_epi8(Blue8, _mm_setr_epi8(0, -1, -1, 1, -1, -1, 2, -1, -1, 3, -1, -1, 4, -1, -1, 5));\n\t\t\tDest1 = _mm_or_si128(Dest1, _mm_shuffle_epi8(Green8, _mm_setr_epi8(-1, 0, -1, -1, 1, -1, -1, 2, -1, -1, 3, -1, -1, 4, -1, -1)));\n\t\t\tDest1 = _mm_or_si128(Dest1, _mm_shuffle_epi8(Red8, _mm_setr_epi8(-1, -1, 0, -1, -1, 1, -1, -1, 2, -1, -1, 3, -1, -1, 4, -1)));\n\n\t\t\tDest2 = _mm_shuffle_epi8(Blue8, _mm_setr_epi8(-1, -1, 6, -1, -1, 7, -1, -1, 8, -1, -1, 9, -1, -1, 10, -1));\n\t\t\tDest2 = _mm_or_si128(Dest2, _mm_shuffle_epi8(Green8, _mm_setr_epi8(5, -1, -1, 6, -1, -1, 7, -1, -1, 8, -1, -1, 9, -1, -1, 10)));\n\t\t\tDest2 = _mm_or_si128(Dest2, _mm_shuffle_epi8(Red8, _mm_setr_epi8(-1, 5, -1, -1, 6, -1, -1, 7, -1, -1, 8, -1, -1, 9, -1, -1)));\n\n\t\t\tDest3 = _mm_shuffle_epi8(Blue8, _mm_setr_epi8(-1, 11, -1, -1, 12, -1, -1, 13, -1, -1, 14, -1, -1, 15, -1, -1));\n\t\t\tDest3 = _mm_or_si128(Dest3, _mm_shuffle_epi8(Green8, _mm_setr_epi8(-1, -1, 11, -1, -1, 12, -1, -1, 13, -1, -1, 14, -1, -1, 15, -1)));\n\t\t\tDest3 = _mm_or_si128(Dest3, _mm_shuffle_epi8(Red8, _mm_setr_epi8(10, -1, -1, 11, -1, -1, 12, -1, -1, 13, -1, -1, 14, -1, -1, 15)));\n\n\t\t\t_mm_storeu_si128((__m128i *)(LinePD + 0), Dest1);\n\t\t\t_mm_storeu_si128((__m128i *)(LinePD + 16), Dest2);\n\t\t\t_mm_storeu_si128((__m128i *)(LinePD + 32), Dest3);\n\t\t}\n\t\tfor (; X < Width; X++) {\n\t\t\tint Blue, Green, Red, Max;\n\t\t\tBlue = LinePS[0], Green = LinePS[1], Red = LinePS[2];\n\t\t\tint Avg = (Blue + Green + Green + Red) >> 2;\n\t\t\tif (Blue > Green)\n\t\t\t\tMax = Blue;\n\t\t\telse\n\t\t\t\tMax = Green;\n\t\t\tif (Red > Max)\n\t\t\t\tMax = Red;\n\t\t\tint AmtVal = (Max - Avg) * VibranceAdjustment;\n\t\t\tif (Blue != Max) Blue += (((Max - Blue) * AmtVal) >> 14);\n\t\t\tif (Green != Max) Green += (((Max - Green) * AmtVal) >> 14);\n\t\t\tif (Red != Max) Red += (((Max - Red) * AmtVal) >> 14);\n\t\t\tif (Red < 0) Red = 0;\n\t\t\telse if (Red > 255) Red = 255;\n\t\t\tif (Green < 0) Green = 0;\n\t\t\telse if (Green > 255) Green = 255;\n\t\t\tif (Blue < 0) Blue = 0;\n\t\t\telse if (Blue > 255) Blue = 255;\n\t\t\tLinePD[0] = Blue;\n\t\t\tLinePD[1] = Green;\n\t\t\tLinePD[2] = Red;\n\t\t\tLinePS += 3;\n\t\t\tLinePD += 3;\n\t\t}\n\t}\n}\n\nint main() {\n\tMat src = imread(\"F:\\\\car.jpg\");\n\tint Height = src.rows;\n\tint Width = src.cols;\n\tunsigned char *Src = src.data;\n\tunsigned char *Dest = new unsigned char[Height * Width * 3];\n\tint Stride = Width * 3;\n\tint Radius = 11;\n\tint Adjustment = 50;\n\tint64 st = cvGetTickCount();\n\tfor (int i = 0; i <100; i++) {\n\t\tVibranceAlgorithm_SSE(Src, Dest, Width, Height, Stride, Adjustment);\n\t}\n\tdouble duration = (cv::getTickCount() - st) / cv::getTickFrequency() * 10;\n\tprintf(\"%.5f\\n\", duration);\n\tVibranceAlgorithm_SSE(Src, Dest, Width, Height, Stride, Adjustment);\n\tMat dst(Height, Width, CV_8UC3, Dest);\n\timshow(\"origin\", src);\n\timshow(\"result\", dst);\n\timwrite(\"F:\\\\res.jpg\", dst);\n\twaitKey(0);\n\twaitKey(0);\n}"
  },
  {
    "path": "sse_implementation_of_common_functions_in_image_processing.cpp",
    "content": "#include <stdio.h>\n#include <opencv2/opencv.hpp>\nusing namespace std;\nusing namespace cv;\n\n// 函数1: 对数函数的SSE实现，高精度版\ninline __m128 _mm_log_ps(__m128 x)\n{\n\tstatic const __declspec(align(16)) int _ps_min_norm_pos[4] = { 0x00800000, 0x00800000, 0x00800000, 0x00800000 };\n\tstatic const __declspec(align(16)) int _ps_inv_mant_mask[4] = { ~0x7f800000, ~0x7f800000, ~0x7f800000, ~0x7f800000 };\n\tstatic const __declspec(align(16)) int _pi32_0x7f[4] = { 0x7f, 0x7f, 0x7f, 0x7f };\n\tstatic const __declspec(align(16)) float _ps_1[4] = { 1.0f, 1.0f, 1.0f, 1.0f };\n\tstatic const __declspec(align(16)) float _ps_0p5[4] = { 0.5f, 0.5f, 0.5f, 0.5f };\n\tstatic const __declspec(align(16)) float _ps_sqrthf[4] = { 0.707106781186547524f, 0.707106781186547524f, 0.707106781186547524f, 0.707106781186547524f };\n\tstatic const __declspec(align(16)) float _ps_log_p0[4] = { 7.0376836292E-2f, 7.0376836292E-2f, 7.0376836292E-2f, 7.0376836292E-2f };\n\tstatic const __declspec(align(16)) float _ps_log_p1[4] = { -1.1514610310E-1f, -1.1514610310E-1f, -1.1514610310E-1f, -1.1514610310E-1f };\n\tstatic const __declspec(align(16)) float _ps_log_p2[4] = { 1.1676998740E-1f, 1.1676998740E-1f, 1.1676998740E-1f, 1.1676998740E-1f };\n\tstatic const __declspec(align(16)) float _ps_log_p3[4] = { -1.2420140846E-1f, -1.2420140846E-1f, -1.2420140846E-1f, -1.2420140846E-1f };\n\tstatic const __declspec(align(16)) float _ps_log_p4[4] = { 1.4249322787E-1f, 1.4249322787E-1f, 1.4249322787E-1f, 1.4249322787E-1f };\n\tstatic const __declspec(align(16)) float _ps_log_p5[4] = { -1.6668057665E-1f, -1.6668057665E-1f, -1.6668057665E-1f, -1.6668057665E-1f };\n\tstatic const __declspec(align(16)) float _ps_log_p6[4] = { 2.0000714765E-1f, 2.0000714765E-1f, 2.0000714765E-1f, 2.0000714765E-1f };\n\tstatic const __declspec(align(16)) float _ps_log_p7[4] = { -2.4999993993E-1f, -2.4999993993E-1f, -2.4999993993E-1f, -2.4999993993E-1f };\n\tstatic const __declspec(align(16)) float _ps_log_p8[4] = { 3.3333331174E-1f, 3.3333331174E-1f, 3.3333331174E-1f, 3.3333331174E-1f };\n\tstatic const __declspec(align(16)) float _ps_log_q1[4] = { -2.12194440e-4f, -2.12194440e-4f, -2.12194440e-4f, -2.12194440e-4f };\n\tstatic const __declspec(align(16)) float _ps_log_q2[4] = { 0.693359375f, 0.693359375f, 0.693359375f, 0.693359375f };\n\n\t__m128 one = *(__m128*)_ps_1;\n\t__m128 invalid_mask = _mm_cmple_ps(x, _mm_setzero_ps());\n\t/* cut off denormalized stuff */\n\tx = _mm_max_ps(x, *(__m128*)_ps_min_norm_pos);\n\t__m128i emm0 = _mm_srli_epi32(_mm_castps_si128(x), 23);\n\n\t/* keep only the fractional part */\n\tx = _mm_and_ps(x, *(__m128*)_ps_inv_mant_mask);\n\tx = _mm_or_ps(x, _mm_set1_ps(0.5f));\n\n\temm0 = _mm_sub_epi32(emm0, *(__m128i *)_pi32_0x7f);\n\t__m128 e = _mm_cvtepi32_ps(emm0);\n\te = _mm_add_ps(e, one);\n\n\t__m128 mask = _mm_cmplt_ps(x, *(__m128*)_ps_sqrthf);\n\t__m128 tmp = _mm_and_ps(x, mask);\n\tx = _mm_sub_ps(x, one);\n\te = _mm_sub_ps(e, _mm_and_ps(one, mask));\n\tx = _mm_add_ps(x, tmp);\n\n\t__m128 z = _mm_mul_ps(x, x);\n\t__m128 y = *(__m128*)_ps_log_p0;\n\ty = _mm_mul_ps(y, x);\n\ty = _mm_add_ps(y, *(__m128*)_ps_log_p1);\n\ty = _mm_mul_ps(y, x);\n\ty = _mm_add_ps(y, *(__m128*)_ps_log_p2);\n\ty = _mm_mul_ps(y, x);\n\ty = _mm_add_ps(y, *(__m128*)_ps_log_p3);\n\ty = _mm_mul_ps(y, x);\n\ty = _mm_add_ps(y, *(__m128*)_ps_log_p4);\n\ty = _mm_mul_ps(y, x);\n\ty = _mm_add_ps(y, *(__m128*)_ps_log_p5);\n\ty = _mm_mul_ps(y, x);\n\ty = _mm_add_ps(y, *(__m128*)_ps_log_p6);\n\ty = _mm_mul_ps(y, x);\n\ty = _mm_add_ps(y, *(__m128*)_ps_log_p7);\n\ty = _mm_mul_ps(y, x);\n\ty = _mm_add_ps(y, *(__m128*)_ps_log_p8);\n\ty = _mm_mul_ps(y, x);\n\n\ty = _mm_mul_ps(y, z);\n\ttmp = _mm_mul_ps(e, *(__m128*)_ps_log_q1);\n\ty = _mm_add_ps(y, tmp);\n\ttmp = _mm_mul_ps(z, *(__m128*)_ps_0p5);\n\ty = _mm_sub_ps(y, tmp);\n\ttmp = _mm_mul_ps(e, *(__m128*)_ps_log_q2);\n\tx = _mm_add_ps(x, y);\n\tx = _mm_add_ps(x, tmp);\n\tx = _mm_or_ps(x, invalid_mask); // negative arg will be NAN\n\n\treturn x;\n}\n\n// 函数2: 低精度的log函数，大概有小数点后2位的精度\n// 算法来源: https://stackoverflow.com/questions/9411823/fast-log2float-x-implementation-c\ninline float IM_Flog(float val)\n{\n\tunion\n\t{\n\t\tfloat val;\n\t\tint x;\n\t} u = { val };\n\tfloat log_2 = (float)(((u.x >> 23) & 255) - 128);\n\tu.x &= ~(255 << 23);\n\tu.x += (127 << 23);\n\tlog_2 += ((-0.34484843f) * u.val + 2.02466578f) * u.val - 0.67487759f;\n\treturn log_2 * 0.69314718f;\n}\n\n// 函数3: 函数2的SSE实现\ninline __m128 _mm_flog_ps(__m128 x)\n{\n\t__m128i I = _mm_castps_si128(x);\n\t__m128 log_2 = _mm_cvtepi32_ps(_mm_sub_epi32(_mm_and_si128(_mm_srli_epi32(I, 23), _mm_set1_epi32(255)), _mm_set1_epi32(128)));\n\tI = _mm_and_si128(I, _mm_set1_epi32(-2139095041));        //    255 << 23\n\tI = _mm_add_epi32(I, _mm_set1_epi32(1065353216));        //    127 << 23\n\t__m128 F = _mm_castsi128_ps(I);\n\t__m128 T = _mm_add_ps(_mm_mul_ps(_mm_set1_ps(-0.34484843f), F), _mm_set1_ps(2.02466578f));\n\tT = _mm_sub_ps(_mm_mul_ps(T, F), _mm_set1_ps(0.67487759f));\n\treturn _mm_mul_ps(_mm_add_ps(log_2, T), _mm_set1_ps(0.69314718f));\n}\n\n// 函数4: e^x的近似计算\ninline float IM_Fexp(float Y)\n{\n\tunion\n\t{\n\t\tdouble Value;\n\t\tint X[2];\n\t} V;\n\tV.X[1] = (int)(Y * 1512775 + 1072632447 + 0.5F);\n\tV.X[0] = 0;\n\treturn (float)V.Value;\n}\n\n// 函数5: 函数4的SSE实现\ninline __m128 _mm_fexp_ps(__m128 Y)\n{\n\t__m128i T = _mm_cvtps_epi32(_mm_add_ps(_mm_mul_ps(Y, _mm_set1_ps(1512775)), _mm_set1_ps(1072632447)));\n\t__m128i TL = _mm_unpacklo_epi32(_mm_setzero_si128(), T);\n\t__m128i TH = _mm_unpackhi_epi32(_mm_setzero_si128(), T);\n\treturn _mm_movelh_ps(_mm_cvtpd_ps(_mm_castsi128_pd(TL)), _mm_cvtpd_ps(_mm_castsi128_pd(TH)));\n}\n\n//函数6: pow函数的近似实现\ninline float IM_Fpow(float a, float b)\n{\n\tunion\n\t{\n\t\tdouble Value;\n\t\tint X[2];\n\t} V;\n\tV.X[1] = (int)(b * (V.X[1] - 1072632447) + 1072632447);\n\tV.X[0] = 0;\n\treturn (float)V.Value;\n}\n\n// 函数7: 通过_mm_rcp_ps，_mm_rsqrt_ps（求导数的近似值，大概为小数点后12bit），结合牛顿迭代法，求精度更高的导数\n__m128 _mm_prcp_ps(__m128 a) {\n\t__m128 rcp = _mm_rcp_ps(a); //此函数只有12bit的精度\n\treturn _mm_sub_ps(_mm_add_ps(rcp, rcp), _mm_mul_ps(a, _mm_mul_ps(rcp, rcp))); //x1 = x0 * (2 - d * x0) = 2 * x0 - d * x0 * x0，使用牛顿 - 拉弗森方法这种方法可以提高精度到23bit\n}\n\n// 函数8: 直接用导数实现a / b\n__m128 _mm_fdiv_ps(__m128 a, __m128 b)\n{\n\treturn _mm_mul_ps(a, _mm_rcp_ps(b));\n}\n\n// 函数9: 避免除数为0时无法获得效果\n// 在SSE指令中，没有提供整数的除法指令，不知道这是为什么，所以整数除法一般只能借用浮点版本的指令。\n// 同时，除法存在的一个问题就是如果除数为0，可能会触发异常，不过SSE在这种情况下不会抛出异常，但是我们应该避免。\n// 避免的方式有很多，比如判断如果除数为0，就做特殊处理，或者如果除数为0就除以一个很小的数，不过大部分的需求是，\n// 除数为0，则返回0，此时就可以使用下面的SSE指令代替_mm_div_ps\n//四个浮点数的除法a/b，如果b中某个分量为0，则对应位置返回0值\n\ninline __m128 _mm_divz_ps(__m128 a, __m128 b)\n{\n\t__m128 Mask = _mm_cmpeq_ps(b, _mm_setzero_ps());\n\treturn _mm_blendv_ps(_mm_div_ps(a, b), _mm_setzero_ps(), Mask);\n}\n\n// 函数10: 将4个32位整数转换为字节数并保存\n// 将4个32位整形变量数据打包到4个字节数据中\n\ninline void _mm_storesi128_4char(unsigned char *Dest, __m128i P)\n{\n\t__m128i T = _mm_packs_epi32(P, P);\n\t*((int *)Dest) = _mm_cvtsi128_si32(_mm_packus_epi16(T, T));\n}\n\n// 函数11: 读取12个字节数到一个XMM寄存器中\n// XMM寄存器是16个字节大小的，而且SSE的很多计算是以4的整数倍字节位单位进行的，\n// 但是在图像处理中，70%情况下处理的是彩色的24位图像，即一个像素占用3个字节，\n// 如果直接使用load指令载入数据，一次性可载入5加1 / 3个像素，这对算法的处理是很不方便的，\n// 一般状况下都是加载4个像素，即12个字节，然后扩展成16个字节（给每个像素增加一个Alpha值），\n// 我们当然可以直接使用load加载16个字节，然后每次跳过12个字节在进行load加载，但是其实也可以\n// 使用下面的加载12个字节的函数：\n// 从指针p处加载12个字节数据到XMM寄存器中，寄存器最高32位清0\n\ninline __m128i _mm_loadu_epi96(const __m128i * p)\n{\n\treturn _mm_unpacklo_epi64(_mm_loadl_epi64(p), _mm_cvtsi32_si128(((int *)p)[2]));\n}\n\n// 函数12: 保存XMM的高12位\n// 将寄存器Q的低位12个字节数据写入到指针P中。\ninline void _mm_storeu_epi96(__m128i *P, __m128i Q)\n{\n\t_mm_storel_epi64(P, Q);\n\t((int *)P)[2] = _mm_cvtsi128_si32(_mm_srli_si128(Q, 8));\n}\n\n// 函数13: 计算整数整除255的四舍五入结果。\ninline int IM_Div255(int V)\n{\n\treturn (((V >> 8) + V + 1) >> 8);        //    似乎V可以是负数\n}\n \n// 函数14: 函数13的SSE实现\n// 返回16位无符号整形数据整除255后四舍五入的结果： x = ((x + 1) + (x >> 8)) >> 8\n\ninline __m128i _mm_div255_epu16(__m128i x)\n{\n\treturn _mm_srli_epi16(_mm_adds_epu16(_mm_adds_epu16(x, _mm_set1_epi16(1)), _mm_srli_epi16(x, 8)), 8);\n}\n\n// 函数15: 求XMM寄存器内所有元素的累加值\n// 这也是个常见的需求，我们可能把某个结果重复的结果保存在寄存器中，最后结束时在把寄存器中的每个元素想加，\n// 你当然可以通过访问__m128i变量的内部的元素实现，但是据说这样会降低循环内的优化，一种方式是直接用SSE指令实现，\n// 比如对8个有符号的short类型的相加代码如下所示：\n//    8个有符号的16位的数据相加的和。\n//    https://stackoverflow.com/questions/31382209/computing-the-inner-product-of-vectors-with-allowed-scalar-values-0-1-and-2-usi/31382878#31382878\n\ninline int _mm_hsum_epi16(__m128i V)                            //    V7 V6 V5 V4 V3 V2 V1 V0\n{\n\t//    V = _mm_unpacklo_epi16(_mm_hadd_epi16(V, _mm_setzero_si128()), _mm_setzero_si128());    也可以用这句，_mm_hadd_epi16似乎对计算结果超出32768能获得正确结果\n\t__m128i T = _mm_madd_epi16(V, _mm_set1_epi16(1));   //    V7+V6                        V5+V4            V3+V2    V1+V0\n\tT = _mm_add_epi32(T, _mm_srli_si128(T, 8));            //    V7+V6+V3+V2                    V5+V4+V1+V0        0        0        \n\tT = _mm_add_epi32(T, _mm_srli_si128(T, 4));            //    V7+V6+V3+V2+V5+V4+V1+V0        V5+V4+V1+V0        0        0    \n\treturn _mm_cvtsi128_si32(T);                        //    提取低位    \n}\n\n// 函数16: 求16个字节的最小值\n// 比如我们要求一个字节序列的最小值，我们肯定会使用_mm_min_epi8这样的函数保存每隔16个字节的最小值，\n// 这样最终我们得到16个字节的一个XMM寄存器，整个序列的最小值肯定在这个16个字节里面，\n// 这个时候我们可以巧妙的借用下面的SSE语句得到这16个字节的最小值：\n// 求16个字节数据的最小值, 只能针对字节数据。\n\ninline int _mm_hmin_epu8(__m128i a)\n{\n\t__m128i L = _mm_unpacklo_epi8(a, _mm_setzero_si128());\n\t__m128i H = _mm_unpackhi_epi8(a, _mm_setzero_si128());\n\treturn _mm_extract_epi16(_mm_min_epu16(_mm_minpos_epu16(L), _mm_minpos_epu16(H)), 0);\n}\n\n// 函数17: 求16个字节的最大值\n// 求16个字节数据的最大值, 只能针对字节数据。\ninline int _mm_hmax_epu8(__m128i a)\n{\n\t__m128i b = _mm_subs_epu8(_mm_set1_epi8(255), a);\n\t__m128i L = _mm_unpacklo_epi8(b, _mm_setzero_si128());\n\t__m128i H = _mm_unpackhi_epi8(b, _mm_setzero_si128());\n\treturn 255 - _mm_extract_epi16(_mm_min_epu16(_mm_minpos_epu16(L), _mm_minpos_epu16(H)), 0);\n}\n\nint main() {\n\n}\n"
  }
]