master d22063d5c5b4 cached
21 files
246.5 KB
108.4k tokens
183 symbols
1 requests
Download .txt
Showing preview only (268K chars total). Download the full file or copy to clipboard to get everything.
Repository: BBuf/Image-processing-algorithm-Speed
Branch: master
Commit: d22063d5c5b4
Files: 21
Total size: 246.5 KB

Directory structure:
gitextract_nm0o3cbs/

├── README.md
├── resources/
│   └── SSE指令集补充.md
├── speed_bicubic_zoom_sse.cpp
├── speed_box_filter_sse.cpp
├── speed_common_functions.cpp
├── speed_gaussian_filter_sse.cpp
├── speed_histogram_algorithm_framework/
│   ├── BoxFilter.h
│   ├── Core.h
│   ├── MaxFilter.h
│   ├── SelectiveBlur.h
│   └── Utility.h
├── speed_integral_graph_sse.cpp
├── speed_max_filter_sse.cpp
├── speed_median_filter_3x3_sse.cpp
├── speed_multi_scale_detail_boosting_see.cpp
├── speed_rgb2gray_sse.cpp
├── speed_rgb2yuv_sse.cpp
├── speed_skin_detection_sse.cpp
├── speed_sobel_edgedetection_sse.cpp
├── speed_vibrance_algorithm.cpp
└── sse_implementation_of_common_functions_in_image_processing.cpp

================================================
FILE CONTENTS
================================================

================================================
FILE: README.md
================================================
# Introduction

## speed_histogram_algorithm_framework 

- 局部直方图加速框架,内部使用了一些近似计算及指令集加速(SSE),可以快速处理中值滤波、最大值滤波、最小值滤波、表面模糊等算法。

## resources
- SSE优化相关的资源。

#### PC的CPU为I5-3230,64位。

#### OpenCV版本为3.4.0



- sse_implementation_of_common_functions_in_image_processing.cpp 多个图像处理中常用函数的SSE实现。
- speed_rgb2gray_sse.cpp 使用sse加速RGB和灰度图转换算法,相比于原始实现有接近5倍加速。算法原理:https://mp.weixin.qq.com/s/SagVQ5gfXWWA7NATv-zvBQ  速度测试结果如下:

>测试CPU型号:Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz

| 分辨率    | 优化                                     | 循环次数 | 速度 |
| --------- | ---------------------------------------- | -------- | ---- |
| 4032x3024 | 原始实现                                 | 1000      |  12.139ms    |
| 4032x3024 | 第一版优化(float->INT)                 | 1000      |   7.629ms   |
| 4032x3024 | OpenCV 自带函数                          | 1000      |   4.287ms   |
| 4032x3024 | 第二版优化(手动4路并行)                | 1000      |   10.528ms   |
| 4032x3024 | 第三版优化(OpenMP4线程)                | 1000      |   7.632ms   |
| 4032x3024 | 第四版优化(SSE优化,一次处理12个像素)  | 1000      |   5.579ms   |
| 4032x3024 | 第五版优化(SSE优化,一次处理15个像素)  | 1000      |  5.843ms    |
| 4032x3024 | 第六版优化(AVX2优化,一次处理10个像素) | 1000      |   3.576ms   |
| 4032x3024 | 第七版优化(AVX2优化+std::async)        | 1000      |   2.626ms   |



- speed_vibrance_algorithm.cpp 使用SSE加速自然饱和度算法,加速9倍,算法原理请看: https://mp.weixin.qq.com/s/26UVvqMNLgnquXY21Xu3OQ 。速度测试结果如下:

|分辨率|优化|循环次数|速度|
|----|----|----|----|
|4032x3024|原始实现|100|115.36ms|
|4032x3024|第一版优化|100|62.43ms|
|4032x3024|第二版优化(4线程)|100|28.89ms|
|4032x3024|第三版优化(SSE)|100|12.69ms|



- speed_sobel_edgedetection_sse.cpp 使用SSE加速Sobel边缘检测算法,加速幅度巨大,算法原理请看:https://mp.weixin.qq.com/s/5lCfO_jmSfP7DbsgM7qbpg 。速度测试结果如下:

|分辨率|算法优化|循环次数|速度|
|-|-|-|-|
|4032x3024|普通实现|1000|126.54 ms|
|4032x3024|Float->INT+查表法|1000|81.62 ms|
|4032x3024|SSE优化版本1|1000|34.95 ms|
|4032x3024|SSE优化版本2|1000|28.87 ms|
|4032x3024|AVX2优化版本1|1000|15.42 ms  |
|4032x3024|AVX2优化+std::async|1000| 5.69 ms |

- speed_skin_detection_sse.cpp 使用SSE加速肤色检测算法,加速幅度较大,算法原理请看:https://mp.weixin.qq.com/s/UFzY1s6ohTM-dnNg0P4kkw 。速度测试结果如下:

|分辨率|算法优化|循环次数|速度|
|-|-|-|-|
|4272x2848|普通实现|1000|41.40ms|
|4272x2848|OpenMP 4线程|1000|36.54ms|
|4272x2848|SSE第一版|1000|6.77ms|
|4272x2848|SSE第二版(std::async)|1000|4.73ms|

- speed_rgb2yuv_sse.cpp SSE极致优化RGB和YUV图像空间互转,算法原理请看:https://mp.weixin.qq.com/s/ryGocz-0YpqZ1CjYXJbd7Q 。速度测试结果如下:

|分辨率|算法优化|循环次数|速度|
|-|-|-|-|
|4032x3024|普通实现|1000|150.58ms|
|4032x3024|去掉浮点数,除法用位运算代替|1000|76.70ms|
|4032x3024|OpenMP 4线程|1000|50.48ms|
|4032x3024|普通SSE向量化|1000|48.92ms|
|4032x3024|_mm_madd_epi16二次优化|1000|33.04ms|
|4032x3024|SSE+4线程|1000|23.70ms|



- speed_median_filter_3x3_sse.cpp 极致优化3*3中值滤波,算法原理请看:https://blog.csdn.net/just_sort/article/details/98617050 。速度测试效果如下:

|分辨率|算法优化|循环次数|速度|
|-|-|-|-|
|4032x3024|普通实现|10| 8293.79 ms |
|4032x3024|逻辑优化,更好的流水|10|  83.75 ms |
|4032x3024|SSE优化|10| 11.93 ms |
|4032x3024|AVX优化|10| 9.32 ms |

----------------------------------------------------------------------------------

- speed_gaussian_filter_sse.cpp 使用sse加速高斯滤波算法。算法原理:https://blog.csdn.net/just_sort/article/details/95212099 。速度测试效果如下:

| 优化方式| 图像分辨率 | 速度 |
| ------------------- | ---------- | ---- |
| C语言普通实现+单线程 | 4032*3024  | 290.43ms |
| SSE优化+单线程      | 4032*3024  | 265.96ms |

- speed_integral_graph_sse.cpp 使用SSE加速积分图运算,但是在PC上并没有速度提升,算法原理请看:https://www.cnblogs.com/Imageshop/p/6897233.html 。速度测试结果如下:

|优化方式|图像分辨率 |速度|
|---------|----------|-------|
|C语言实现+单线程|4032*3024|66.66ms|
|C语言实现+4线程|4032*3024|65.34ms|
|SSE优化+单线程|4032*3024|66.10ms|
|SSE优化+4线程|4032*3024|66.20ms|


- speed_common_functions.cpp 对图像处理的一些常用函数的快速实现,个别使用了SSE优化。
- speed_max_filter_sse.cpp 使用speed_histogram_algorithm_framework框架实现最大值滤波,半径越大越明显。原理请看:https://blog.csdn.net/just_sort/article/details/97280807 。运行的时候记得把工程属性中的sdl检查关掉,不然会报一个变量未初始化的错误。速度测试效果如下:

|优化方式|图像分辨率 |半径|速度|
|---------|----------|-------|-------|
|C语言实现+单线程|4272*2848|7|9445.90ms|
|SSE优化+单线程|4272*2848|7|2234.55ms|
|C语言实现+单线程|4272*2848|9|14468.76ms|
|SSE优化+单线程|4272*2848|9|2221.68ms|
|C语言实现+单线程|4272*2848|11|23069.10ms|
|SSE优化+单线程|4272*2848|11|2180.95ms|

- speed_box_filter_sse.cpp 使用speed_histogram_algorithm框架实现O(1)最大值滤波,使用了SSE优化,算法原理请看:https://blog.csdn.net/just_sort/article/details/98075712 。运行方法和speed_max_filter_sse.cpp相同,速度测试结果如下:

|优化方式|图像分辨率 |半径|速度|
|---------|----------|-------|-------|
|C语言实现+单线程|4272*2848|11|163.16ms|
|SSE优化+单线程|4272*2848|11|123.83ms|
|C语言实现+单线程|4272*2848|21|167.81ms|
|SSE优化+单线程|4272*2848|21|126.98ms|
|C语言实现+单线程|4272*2848|31|168.62ms|
|SSE优化+单线程|4272*2848|31|126.17ms|

- speed_multi_scale_detail_boosting_see.cpp 在speed_box_filter_sse.cpp提供的盒子滤波sse优化的基础上,进一步使用指令集实现了对论文《DARK IMAGE ENHANCEMENT BASED ON PAIRWISE TARGET CONTRAST AND MULTI-SCALE DETAIL BOOSTING》的算法优化。算法原理请看:https://blog.csdn.net/just_sort/article/details/98485746  。在CoreI7-3770速度测试结果如下:

|优化方式|图像分辨率 |半径|速度|
|---------|----------|-------|-------|
|C语言实现+单线程|4272*2848|7|206.00ms|
|SSE优化+单线程|4272*2848|7|57.12ms|

- speed_bicubic_zoom_sse.cpp SSE优化三次立方插值算法,算法原理请看:https://blog.csdn.net/just_sort/article/details/100119653 。速度测试结果如下:

|优化方式|图像分辨率 |插值后大小|速度|
|---------|----------|-------|-------|
|C语言原始算法实现|4272*2848|长宽均为原始1.5倍|1856.29ms|
|C语言实现+查表优化+边界优化|4272*2848|长宽均为原始1.5倍|839.10ms|
|SSE优化+边界优化|4272*2848|长宽均为原始1.5倍|315.70ms|
|OpenCV3.1.0自带的函数|4272*2848|长宽均为原始1.5倍|118.77ms|




# 维护了一个微信公众号,分享论文,算法,比赛,生活,欢迎加入。

- 图片要是没加载出来直接搜GiantPandaCV 就好。

![](image/weixin.jpg)


================================================
FILE: resources/SSE指令集补充.md
================================================
# SSE指令集记录

- _mm_cvtps_epi32 把四个float变量强转为四个int变量。其中需要注意的是他的截断规则:四舍五入,在进位后末位是偶数的进,否则不进位。

- _mm_cvttps_epi32 把四个float变量强转为四个int变量。直接截断,和c/c++中的r = (int)a一样。

- _mm_cvtpd_ps 将两个双精度, a 的浮点值设置为单精度的,浮点值。返回值:

  ```c++
  r0 := (float) a0
  r1 := (float) a1
  r2 := 0.0 ; r3 := 0.0
  ```

- _mm_movelh_ps 移动更低两个单精度, b 的浮点值到上面两个单精度,结果的浮点值。

  ```c++
  r3 := b1
  r2 := b0
  r1 := a1
  r0 := a0
  ```

- _mm_cmpneq_ps 比较两个单精度,如果对应位置的数相等返回0,不相等则返回1。

- _mm_blendv_ps 混和打包函数:

  ```c++
  __m128 _mm_blendv_ps( 
     __m128 a,
     __m128 b,
     __m128 mask 
  );
  
  r0 := (mask0 & 0x80000000) ? b0 : a0
  r1 := (mask1 & 0x80000000) ? b1 : a1
  r2 := (mask2 & 0x80000000) ? b2 : a2
  r3 := (mask3 & 0x80000000) ? b3 : a3
  ```

- _mm_packs_epi32 将a和b的8位有符号和32位整数转化位16位整型数据。

- _mm_cvtsi128_si32 移动最低有效位的32位a到32位整数。

- _mm_packus_epi16 将a和b的16位整数转化位8位无符号整型数据。

- _mm_cvtsi32_si128 将a的低32位赋值给一个32bits的整数,返回值为r=a0

- _mm_loadu_si128表示:Loads 128-bit value;即加载128位值。

- _mm_max_epu8 (a,b)表示:比较a和b中对应的无符号的8bits的整数,取其较大值,重复这个过程16次。即:r0=max(a0,b0),...,r15=max(a15,b15)。

- _mm_min_epi8(a,b)表示:大体意思同上,不同的是这次比较的是有符号的8bits的整数。

- _mm_setzero_si128表示:将128bits的值都赋值为0。

- _mm_subs_epu8(a,b)表示:a和b中对应的8bits数相减,r0= UnsignedSaturate(a0-b0),...,r15= UnsignedSaturate(a15 - b15)。

- _mm_adds_epi8(a,b)表示:a和b中对应的8bits数相加,r0=SingedSaturate(a0+b0),...,r15=SingedSaturate(a15+b15)。

- _mm_unpackhi_epi64(a,b)表示:a和b的高64位交错,低64位舍去。

- _mm_srli_si128(a,imm)表示:将a进行逻辑右移imm位,高位填充0。

- _mm_cvtsi128_si32(a)表示:将a的低32位赋值给一个32bits的整数,返回值为r=a0。

- _mm_xor_si128(a,b)表示:将a和b进行按位异或,即r=a^b。

- _mm_or_si128(a,b)表示:将a和b进行或运算,即r=a|b。

- _mm_and_si128(a,b)表示:将a和b进行与运算,即r=a&b。

- _mm_cmpgt_epi8(a,b)表示:分别比较a的每个8bits整数是否大于b的对应位置的8bits整数,若大于,则返回0xffff,否则返回0x0。即r0=(a0>b0)?0xff:0x0  r1=(a1>b1)?0xff:0x0...r15=(a15>b15)?0xff:0x0

- _mm_unpacklo_epi64表示:  a和b的高64位交错,高64位舍去。

- _mm_madd_epi16 表示:返回一个__m128i的寄存器,它含有4个有符号的32位整数。

  ```c++
  r0 := (a0 * b0) + (a1 * b1)
  r1 := (a2 * b2) + (a3 * b3)
  r2 := (a4 * b4) + (a5 * b5)
  r3 := (a6 * b6) + (a7 * b7)
  ```

- _mm_extract_epi16(a, imm) 表示: 返回imm位置上的16位数。

- _mm_min_epu16 表示:两个数的最小者。

- _mm_minpos_epu16 表示:返回128 位值, 最低序的 16 位是参数找到的最小值a,第二个低的顺序 16 位是参数找到的最小值的索引a。

- _mm_stream_si32 将数据存储到指针对应的地址中。

- _mm_cvtsi128_si32  移动最低有效位的32位a到32位整数。

- _mm_packus_epi32 

  ```c++
  r0 := (a0 < 0) ? 0 : ((a0 > 0xffff) ? 0xffff : a0)
  r1 := (a1 < 0) ? 0 : ((a1 > 0xffff) ? 0xffff : a1)
  r2 := (a2 < 0) ? 0 : ((a2 > 0xffff) ? 0xffff : a2)
  r3 := (a3 < 0) ? 0 : ((a3 > 0xffff) ? 0xffff : a3)
  r4 := (b0 < 0) ? 0 : ((b0 > 0xffff) ? 0xffff : b0)
  r5 := (b1 < 0) ? 0 : ((b1 > 0xffff) ? 0xffff : b1)
  r6 := (b2 < 0) ? 0 : ((b2 > 0xffff) ? 0xffff : b2)
  r7 := (b3 < 0) ? 0 : ((b3 > 0xffff) ? 0xffff : b3)
  ```

- _mm_setr_epi32 返回一个__m128i的寄存器,使用4个具体的int类型数据来设置寄存器存放数据。

- _mm_mullo_epi32 返回一个__m128i的寄存器,分别对a和b的4个int类型数相乘。

- _mm_hadd_epi32  返回一个__m128i的寄存器,分别对a和b的4个int类型数相加。

- _mm_madd_epi16 返回一个__m128i的寄存器,分别对a和b先相乘后相加。

  ```c++
  r0 := (a0 * b0) + (a1 * b1)
  r1 := (a2 * b2) + (a3 * b3)
  r2 := (a4 * b4) + (a5 * b5)
  r3 := (a6 * b6) + (a7 * b7)
  ```

- _mm_unpackhi_epi8 返回一个__m128i的寄存器,对a和b进行交错打包,从高位到低位。

  ```c++
  r0 := a8 ; r1 := b8
  r2 := a9 ; r3 := b9
  ...
  r14 := a15 ; r15 := b15
  ```

- _mm_unpacklo_epi8 返回一个__m128i的寄存器,对a和b进行交错打包,从低位到高位。

================================================
FILE: speed_bicubic_zoom_sse.cpp
================================================
#include <stdio.h>
#include <opencv2/opencv.hpp>
using namespace std;
using namespace cv;

void debug(__m128i var) {
	uint8_t *val = (uint8_t*)&var;//can also use uint32_t instead of 16_t 
	printf("Numerical: %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i\n",
		val[0], val[1], val[2], val[3], val[4], val[5],
		val[6], val[7], val[8], val[9], val[10], val[11], val[12], val[13],
		val[14], val[15]);
}

void ConvertBGR8U2BGRAF(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride)
{
	//#pragma omp parallel for
	for (int Y = 0; Y < Height; Y++)
	{
		unsigned char *LinePS = Src + Y * Stride;
		unsigned char *LinePD = Dest + Y * Width * 4;
		for (int X = 0; X < Width; X++, LinePS += 3, LinePD += 4)
		{
			LinePD[0] = LinePS[0];    LinePD[1] = LinePS[1];    LinePD[2] = LinePS[2]; LinePD[3] = 0;
		}
	}
}

void ConvertBGRAF2BGR8U(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride)
{
	//#pragma omp parallel for
	for (int Y = 0; Y < Height; Y++)
	{
		unsigned char *LinePS = Src + Y * Width * 4;
		unsigned char *LinePD = Dest + Y * Stride;
		for (int X = 0; X < Width; X++, LinePS += 4, LinePD += 3)
		{
			LinePD[0] = LinePS[0];    LinePD[1] = LinePS[1];    LinePD[2] = LinePS[2];
		}
	}
}

void ConvertBGR8U2BGRAF_SSE(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) {
	const int BlockSize = 4;
	int Block = (Width - 2) / BlockSize;
	__m128i Mask = _mm_setr_epi8(0, 1, 2, -1, 3, 4, 5, -1, 6, 7, 8, -1, 9, 10, 11, -1);
	__m128i Mask2 = _mm_setr_epi8(0, 2, 8, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1);
	__m128i Zero = _mm_setzero_si128();
	for (int Y = 0; Y < Height; Y++) {
		unsigned char *LinePS = Src + Y * Stride;
		unsigned char *LinePD = Dest + Y * Width * 4;
		int X = 0;
		for (; X < Block * BlockSize; X += BlockSize, LinePS += BlockSize * 3, LinePD += BlockSize * 4) {
			__m128i SrcV = _mm_shuffle_epi8(_mm_loadu_si128((const __m128i*)LinePS), Mask);
			__m128i Src16L = _mm_unpacklo_epi8(SrcV, Zero);
			__m128i Src16H = _mm_unpackhi_epi8(SrcV, Zero);

			_mm_storeu_si128((__m128i *)(LinePD + 0), _mm_shuffle_epi8(_mm_unpacklo_epi32(Src16L, Zero), Mask2));
			_mm_storeu_si128((__m128i *)(LinePD + 4), _mm_shuffle_epi8(_mm_unpackhi_epi32(Src16L, Zero), Mask2));
			_mm_storeu_si128((__m128i *)(LinePD + 8), _mm_shuffle_epi8(_mm_unpacklo_epi32(Src16H, Zero), Mask2));
			_mm_storeu_si128((__m128i *)(LinePD + 12), _mm_shuffle_epi8(_mm_unpackhi_epi32(Src16H, Zero), Mask2));
		}
		for (; X < Width; X++, LinePS += 3, LinePD += 4) {
			LinePD[0] = LinePS[0];    LinePD[1] = LinePS[1];    LinePD[2] = LinePS[2];    LinePD[3] = 0;
		}
	}
}

void ConvertBGRAF2BGR8U_SSE(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) {
	const int BlockSize = 4;
	int Block = (Width - 2) / BlockSize;
	//__m128i Mask = _mm_setr_epi8(0, 1, 2, 4, 5, 6, 8, 9, 10, 12, 13, 14, 3, 7, 11, 15);
	__m128i MaskB = _mm_setr_epi8(0, 4, 8, 12, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1);
	__m128i MaskG = _mm_setr_epi8(1, 5, 9, 13, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1);
	__m128i MaskR = _mm_setr_epi8(2, 6, 10, 14, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1);
	__m128i Zero = _mm_setzero_si128();
	for (int Y = 0; Y < Height; Y++) {
		unsigned char *LinePS = Src + Y * Width * 4;
		unsigned char *LinePD = Dest + Y * Stride;
		int X = 0;
		for (; X < Block * BlockSize; X += BlockSize, LinePS += BlockSize * 4, LinePD += BlockSize * 3) {
			__m128i SrcV = _mm_loadu_si128((const __m128i*)LinePS);
			__m128i B = _mm_shuffle_epi8(SrcV, MaskB);
			__m128i G = _mm_shuffle_epi8(SrcV, MaskG);
			__m128i R = _mm_shuffle_epi8(SrcV, MaskR);
			__m128i Ans1 = Zero, Ans2 = Zero, Ans3 = Zero;
			Ans1 = _mm_or_si128(Ans1, _mm_shuffle_epi8(B, _mm_setr_epi8(0, -1, -1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)));
			Ans1 = _mm_or_si128(Ans1, _mm_shuffle_epi8(G, _mm_setr_epi8(-1, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)));
			Ans1 = _mm_or_si128(Ans1, _mm_shuffle_epi8(R, _mm_setr_epi8(-1, -1, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)));

			Ans2 = _mm_or_si128(Ans2, _mm_shuffle_epi8(B, _mm_setr_epi8(-1, -1, 2, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)));
			Ans2 = _mm_or_si128(Ans2, _mm_shuffle_epi8(G, _mm_setr_epi8(1, -1, -1, 2, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)));
			Ans2 = _mm_or_si128(Ans2, _mm_shuffle_epi8(R, _mm_setr_epi8(-1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)));

			Ans3 = _mm_or_si128(Ans3, _mm_shuffle_epi8(B, _mm_setr_epi8(-1, 3, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)));
			Ans3 = _mm_or_si128(Ans3, _mm_shuffle_epi8(G, _mm_setr_epi8(-1, -1, 3, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)));
			Ans3 = _mm_or_si128(Ans3, _mm_shuffle_epi8(R, _mm_setr_epi8(2, -1, -1, 3, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)));

			_mm_storeu_si128((__m128i*)(LinePD + 0), Ans1);
			_mm_storeu_si128((__m128i*)(LinePD + 4), Ans2);
			_mm_storeu_si128((__m128i*)(LinePD + 8), Ans3);
		}
		for (; X < Width; X++, LinePS += 4, LinePD += 3) {
			LinePD[0] = LinePS[0]; LinePD[1] = LinePS[1]; LinePD[2] = LinePS[2];
		}
	}
}

// 将整形的Value值限定在Min和Max内,可取Min或者Max的值
inline int ClampI(int Value, int Min, int Max) {
	if (Value < Min) return Min;
	else if (Value > Max) return Max;
	else return Value;
}

// 将整数限制到字节数据类型
inline unsigned char ClampToByte(int Value) {
	if (Value < 0) return 0;
	else if (Value > 255) return 255;
	else return (unsigned char)Value;
}

// 获取PosX, PosY位置的像素
inline unsigned char *GetCheckedPixel(unsigned char *Src, int Width, int Height, int Stride, int Channel, int PosX, int PosY) {
	return Src + ClampI(PosY, 0, Height - 1) * Stride + ClampI(PosX, 0, Width - 1) * Channel;
}

// 该函数计算插值曲线sin(x * PI) / (x * PI)的值,下面是它的近似拟合表达式
float SinXDivX(float X) {
	const float a = -1; //a还可以取 a=-2,-1,-0.75,-0.5等等,起到调节锐化或模糊程度的作用
	X = abs(X);
	float X2 = X * X, X3 = X2 * X;
	if (X <= 1)
		return (a + 2) * X3 - (a + 3) * X2 + 1;
	else if (X <= 2)
		return a * X3 - (5 * a) * X2 + (8 * a) * X - (4 * a);
	else
		return 0;
}

// 精确计算插值曲线sin(x * PI) / (x * PI)
float SinXDivX_Standard(float X) {
	if (abs(X) < 0.000001f)
		return 1;
	else
		return sin(X * 3.1415926f) / (X * 3.1415926f);
}

void Bicubic_Original(unsigned char *Src, int Width, int Height, int Stride, unsigned char *Pixel, float X, float Y)
{
	int Channel = Stride / Width;
	int PosX = floor(X), PosY = floor(Y);
	float PartXX = X - PosX, PartYY = Y - PosY;

	unsigned char *Pixel00 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX - 1, PosY - 1);
	unsigned char *Pixel01 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 0, PosY - 1);
	unsigned char *Pixel02 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 1, PosY - 1);
	unsigned char *Pixel03 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 2, PosY - 1);
	unsigned char *Pixel10 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX - 1, PosY + 0);
	unsigned char *Pixel11 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 0, PosY + 0);
	unsigned char *Pixel12 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 1, PosY + 0);
	unsigned char *Pixel13 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 2, PosY + 0);
	unsigned char *Pixel20 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX - 1, PosY + 1);
	unsigned char *Pixel21 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 0, PosY + 1);
	unsigned char *Pixel22 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 1, PosY + 1);
	unsigned char *Pixel23 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 2, PosY + 1);
	unsigned char *Pixel30 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX - 1, PosY + 2);
	unsigned char *Pixel31 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 0, PosY + 2);
	unsigned char *Pixel32 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 1, PosY + 2);
	unsigned char *Pixel33 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 2, PosY + 2);

	float U0 = SinXDivX(1 + PartXX), U1 = SinXDivX(PartXX);
	float U2 = SinXDivX(1 - PartXX), U3 = SinXDivX(2 - PartXX);
	float V0 = SinXDivX(1 + PartYY), V1 = SinXDivX(PartYY);
	float V2 = SinXDivX(1 - PartYY), V3 = SinXDivX(2 - PartYY);

	for (int I = 0; I < Channel; I++)
	{
		float Sum1 = (Pixel00[I] * U0 + Pixel01[I] * U1 + Pixel02[I] * U2 + Pixel03[I] * U3) * V0;
		//printf("%.5f\n", Sum1);
		float Sum2 = (Pixel10[I] * U0 + Pixel11[I] * U1 + Pixel12[I] * U2 + Pixel13[I] * U3) * V1;
		//printf("%.5f\n", Sum2);
		float Sum3 = (Pixel20[I] * U0 + Pixel21[I] * U1 + Pixel22[I] * U2 + Pixel23[I] * U3) * V2;
		//printf("%.5f\n", Sum3);
		float Sum4 = (Pixel30[I] * U0 + Pixel31[I] * U1 + Pixel22[I] * U2 + Pixel33[I] * U3) * V3;
		//printf("%.5f\n", Sum4);
		// printf("%d %.5f %.5f %.5f %.5f\n", I, Sum1, Sum2, Sum3, Sum4);
		Pixel[I] = ClampToByte(Sum1 + Sum2 + Sum3 + Sum4 + 0.5f);
	}
}

// ImageShop说如果把Channel改为固定的值,速度能提高很多,待测试
void Bicubic_Border(unsigned char *Src, int Width, int Height, int Stride, unsigned char *Pixel, short *SinXDivX_Table, int SrcX, int SrcY) {
	int Channel = Stride / Width;
	int U = (unsigned char)(SrcX >> 8), V = (unsigned char)(SrcY >> 8);

	int U0 = SinXDivX_Table[256 + U], U1 = SinXDivX_Table[U];
	int U2 = SinXDivX_Table[256 - U], U3 = SinXDivX_Table[512 - U];
	int V0 = SinXDivX_Table[256 + V], V1 = SinXDivX_Table[V];
	int V2 = SinXDivX_Table[256 - V], V3 = SinXDivX_Table[512 - V];
	int PosX = SrcX >> 16, PosY = SrcY >> 16;

	unsigned char *Pixel00 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX - 1, PosY - 1);
	unsigned char *Pixel01 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 0, PosY - 1);
	unsigned char *Pixel02 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 1, PosY - 1);
	unsigned char *Pixel03 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 2, PosY - 1);
	unsigned char *Pixel10 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX - 1, PosY + 0);
	unsigned char *Pixel11 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 0, PosY + 0);
	unsigned char *Pixel12 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 1, PosY + 0);
	unsigned char *Pixel13 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 2, PosY + 0);
	unsigned char *Pixel20 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX - 1, PosY + 1);
	unsigned char *Pixel21 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 0, PosY + 1);
	unsigned char *Pixel22 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 1, PosY + 1);
	unsigned char *Pixel23 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 2, PosY + 1);
	unsigned char *Pixel30 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX - 1, PosY + 2);
	unsigned char *Pixel31 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 0, PosY + 2);
	unsigned char *Pixel32 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 1, PosY + 2);
	unsigned char *Pixel33 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 2, PosY + 2);

	for (int I = 0; I < Channel; I++)
	{
		int Sum1 = (Pixel00[I] * U0 + Pixel01[I] * U1 + Pixel02[I] * U2 + Pixel03[I] * U3) * V0;
		int Sum2 = (Pixel10[I] * U0 + Pixel11[I] * U1 + Pixel12[I] * U2 + Pixel13[I] * U3) * V1;
		int Sum3 = (Pixel20[I] * U0 + Pixel21[I] * U1 + Pixel22[I] * U2 + Pixel23[I] * U3) * V2;
		int Sum4 = (Pixel30[I] * U0 + Pixel31[I] * U1 + Pixel22[I] * U2 + Pixel33[I] * U3) * V3;
		Pixel[I] = ClampToByte((Sum1 + Sum2 + Sum3 + Sum4) >> 16);
	}
}
void Bicubic_Center(unsigned char *Src, int Width, int Height, int Stride, unsigned char *Pixel, short *SinXDivX_Table, int SrcX, int SrcY)
{
	int Channel = Stride / Width;
	int U = (unsigned char)(SrcX >> 8), V = (unsigned char)(SrcY >> 8);

	int U0 = SinXDivX_Table[256 + U], U1 = SinXDivX_Table[U];
	int U2 = SinXDivX_Table[256 - U], U3 = SinXDivX_Table[512 - U];
	int V0 = SinXDivX_Table[256 + V], V1 = SinXDivX_Table[V];
	int V2 = SinXDivX_Table[256 - V], V3 = SinXDivX_Table[512 - V];
	int PosX = SrcX >> 16, PosY = SrcY >> 16;

	unsigned char *Pixel00 = Src + (PosY - 1) * Stride + (PosX - 1) * Channel;
	unsigned char *Pixel01 = Pixel00 + Channel;
	unsigned char *Pixel02 = Pixel01 + Channel;
	unsigned char *Pixel03 = Pixel02 + Channel;
	unsigned char *Pixel10 = Pixel00 + Stride;
	unsigned char *Pixel11 = Pixel10 + Channel;
	unsigned char *Pixel12 = Pixel11 + Channel;
	unsigned char *Pixel13 = Pixel12 + Channel;
	unsigned char *Pixel20 = Pixel10 + Stride;
	unsigned char *Pixel21 = Pixel20 + Channel;
	unsigned char *Pixel22 = Pixel21 + Channel;
	unsigned char *Pixel23 = Pixel22 + Channel;
	unsigned char *Pixel30 = Pixel20 + Stride;
	unsigned char *Pixel31 = Pixel30 + Channel;
	unsigned char *Pixel32 = Pixel31 + Channel;
	unsigned char *Pixel33 = Pixel32 + Channel;
	for (int I = 0; I < Channel; I++)
	{
		int Sum1 = (Pixel00[I] * U0 + Pixel01[I] * U1 + Pixel02[I] * U2 + Pixel03[I] * U3) * V0;
		int Sum2 = (Pixel10[I] * U0 + Pixel11[I] * U1 + Pixel12[I] * U2 + Pixel13[I] * U3) * V1;
		int Sum3 = (Pixel20[I] * U0 + Pixel21[I] * U1 + Pixel22[I] * U2 + Pixel23[I] * U3) * V2;
		int Sum4 = (Pixel30[I] * U0 + Pixel31[I] * U1 + Pixel22[I] * U2 + Pixel33[I] * U3) * V3;
		Pixel[I] = ClampToByte((Sum1 + Sum2 + Sum3 + Sum4) >> 16);
	}
}

// 原始的插值算法
void IM_Resize_Cubic_Origin(unsigned char *Src, unsigned char *Dest, int SrcW, int SrcH, int StrideS, int DstW, int DstH, int StrideD) {
	int Channel = StrideS / SrcW;
	if ((SrcW == DstW) && (SrcH == DstH)) {
		memcpy(Dest, Src, SrcW * SrcH * Channel * sizeof(unsigned char));
		return;
	}
	printf("%d\n", Channel);
	for (int Y = 0; Y < DstH; Y++)
	{
		unsigned char *LinePD = Dest + Y * StrideD;
		float SrcY = (Y + 0.4999999f) * SrcH / DstH - 0.5f;
		for (int X = 0; X < DstW; X++)
		{
			float SrcX = (X + 0.4999999f) * SrcW / DstW - 0.5f;
			Bicubic_Original(Src, SrcW, SrcH, StrideS, LinePD, SrcX, SrcY);
			LinePD += Channel;
		}
	}
}

// C语言实现的查表+插值算法
void IM_Resize_Cubic_Table(unsigned char *Src, unsigned char *Dest, int SrcW, int SrcH, int StrideS, int DstW, int DstH, int StrideD) {
	int Channel = StrideS / SrcW;
	if ((SrcW == DstW) && (SrcH == DstH)) {
		memcpy(Dest, Src, SrcW * SrcH * Channel * sizeof(unsigned char));
		return;
	}
	short *SinXDivX_Table = (short *)malloc(513 * sizeof(short));
	for (int I = 0; I < 513; I++)
		SinXDivX_Table[I] = int(0.5 + 256 * SinXDivX(I / 256.0f)); // 建立查找表,定点化
	int AddX = (SrcW << 16) / DstW, AddY = (SrcH << 16) / DstH;
	int ErrorX = -(1 << 15) + (AddX >> 1), ErrorY = -(1 << 15) + (AddY >> 1);

	int StartX = ((1 << 16) - ErrorX) / AddX + 1;			//	计算出需要特殊处理的边界
	int StartY = ((1 << 16) - ErrorY) / AddY + 1;			//	y0+y*yr>=1; y0=ErrorY => y>=(1-ErrorY)/yr
	int EndX = (((SrcW - 3) << 16) - ErrorX) / AddX + 1;
	int EndY = (((SrcH - 3) << 16) - ErrorY) / AddY + 1;	//	y0+y*yr<=(height-3) => y<=(height-3-ErrorY)/yr
	if (StartY >= DstH)			StartY = DstH;
	if (StartX >= DstW)			StartX = DstW;
	if (EndX < StartX)			EndX = StartX;
	if (EndY < StartY)			EndY = StartY;
	// 输出边界
	//printf("%d %d %d %d\n", StartX, StartY, EndX, EndY);
	int SrcY = ErrorY;
	for (int Y = 0; Y < StartY; Y++, SrcY += AddY)			//	前面的不是都有效的取样部分数据
	{
		unsigned char *LinePD = Dest + Y * StrideD;
		for (int X = 0, SrcX = ErrorX; X < DstW; X++, SrcX += AddX, LinePD += Channel)
		{
			Bicubic_Border(Src, SrcW, SrcH, StrideS, LinePD, SinXDivX_Table, SrcX, SrcY);
		}
	}
	for (int Y = StartY; Y < EndY; Y++, SrcY += AddY)
	{
		int SrcX = ErrorX;
		unsigned char *LinePD = Dest + Y * StrideD;
		for (int X = 0; X < StartX; X++, SrcX += AddX, LinePD += Channel)
		{
			Bicubic_Border(Src, SrcW, SrcH, StrideS, LinePD, SinXDivX_Table, SrcX, SrcY);
		}
		for (int X = StartX; X < EndX; X++, SrcX += AddX, LinePD += Channel)
		{
			Bicubic_Center(Src, SrcW, SrcH, StrideS, LinePD, SinXDivX_Table, SrcX, SrcY);
		}
		for (int X = EndX; X < DstW; X++, SrcX += AddX, LinePD += Channel)
		{
			Bicubic_Border(Src, SrcW, SrcH, StrideS, LinePD, SinXDivX_Table, SrcX, SrcY);
		}
	}
	for (int Y = EndY; Y < DstH; Y++, SrcY += AddY)
	{
		unsigned char *LinePD = Dest + Y * StrideD;
		for (int X = 0, SrcX = ErrorX; X < DstW; X++, SrcX += AddX, LinePD += Channel)
		{
			Bicubic_Border(Src, SrcW, SrcH, StrideS, LinePD, SinXDivX_Table, SrcX, SrcY);
		}
	}
	free(SinXDivX_Table);
}

// 4个有符号的32位的数据相加的和
inline int _mm_hsum_epi32(__m128i V) { //V3 V2 V1 V0
	__m128i T = _mm_add_epi32(V, _mm_srli_si128(V, 8)); //V3+V1	 V2+V0	V1	V0
	T = _mm_add_epi32(T, _mm_srli_si128(T, 4)); //V3+V1+V2+V0		V2+V0+V1	V1+V0	V0
	return _mm_cvtsi128_si32(T); //提取低位
}

// 使用SSE优化立方插值算法
// 最大支持图像大小为: 32767*32767
void IM_Resize_SSE(unsigned char *Src, unsigned char *Dest, int SrcW, int SrcH, int StrideS, int DstW, int DstH, int StrideD) {
	int Channel = StrideS / SrcW;
	if ((SrcW == DstW) && (SrcH == DstH)) {
		memcpy(Dest, Src, SrcW * SrcH * Channel * sizeof(unsigned char));
		return;
	}
	short *SinXDivX_Table = (short *)malloc(513 * sizeof(short));
	short *Table = (short *)malloc(DstW * 4 * sizeof(short));
	for (int I = 0; I < 513; I++)
		SinXDivX_Table[I] = int(0.5 + 256 * SinXDivX(I / 256.0f)); //	建立查找表,定点化
	int AddX = (SrcW << 16) / DstW, AddY = (SrcH << 16) / DstH;
	int ErrorX = -(1 << 15) + (AddX >> 1), ErrorY = -(1 << 15) + (AddY >> 1);

	int StartX = ((1 << 16) - ErrorX) / AddX + 1;			//	计算出需要特殊处理的边界
	int StartY = ((1 << 16) - ErrorY) / AddY + 1;			//	y0+y*yr>=1; y0=ErrorY => y>=(1-ErrorY)/yr
	int EndX = (((SrcW - 3) << 16) - ErrorX) / AddX + 1;
	int EndY = (((SrcH - 3) << 16) - ErrorY) / AddY + 1;	//	y0+y*yr<=(height-3) => y<=(height-3-ErrorY)/yr
	if (StartY >= DstH)			StartY = DstH;
	if (StartX >= DstW)			StartX = DstW;
	if (EndX < StartX)			EndX = StartX;
	if (EndY < StartY)			EndY = StartY;
	for (int X = StartX, SrcX = ErrorX + StartX * AddX; X < EndY; X++, SrcX += AddX) {
		int U = (unsigned char)(SrcX >> 8);
		Table[X * 4 + 0] = SinXDivX_Table[256 + U]; //建立一个新表便于SSE操作
		Table[X * 4 + 1] = SinXDivX_Table[U];
		Table[X * 4 + 2] = SinXDivX_Table[256 - U];
		Table[X * 4 + 3] = SinXDivX_Table[512 - U];
	}
	int SrcY = ErrorY;
	for (int Y = 0; Y < StartY; Y++, SrcY += AddY) { // 同IM_Resize_Cubic_Table函数
		unsigned char *LinePD = Dest + Y * StrideD;
		for (int X = 0, SrcX = ErrorX; X < DstW; X++, SrcX += AddX, LinePD += Channel) {
			Bicubic_Border(Src, SrcW, SrcH, StrideS, LinePD, SinXDivX_Table, SrcX, SrcY);
		}
	}
	for (int Y = StartY; Y < EndY; Y++, SrcY += AddY) {
		int SrcX = ErrorX;
		unsigned char *LinePD = Dest + Y * StrideD;
		for (int X = 0; X < StartX; X++, SrcX += AddX, LinePD += Channel) {
			Bicubic_Border(Src, SrcW, SrcH, StrideS, LinePD, SinXDivX_Table, SrcX, SrcY);
		}
		int V = (unsigned char)(SrcY >> 8);
		unsigned char *LineY = Src + ((SrcY >> 16) - 1) * StrideS;
		__m128i PartY = _mm_setr_epi32(SinXDivX_Table[256 + V], SinXDivX_Table[V], SinXDivX_Table[256 - V], SinXDivX_Table[512 - V]);
		for (int X = StartX; X < EndX; X++, SrcX += AddX, LinePD += Channel) {
			__m128i PartX = _mm_loadl_epi64((__m128i *)(Table + X * 4));
			//PartX: U0 U1 U2 U3 U0 U1 U2 U3 
			PartX = _mm_unpacklo_epi64(PartX, PartX);
			unsigned char *Pixel0 = LineY + ((SrcX >> 16) - 1) * Channel;
			unsigned char *Pixel1 = Pixel0 + StrideS;
			unsigned char *Pixel2 = Pixel1 + StrideS;
			unsigned char *Pixel3 = Pixel2 + StrideS;
			if (Channel == 1) {
				__m128i P01 = _mm_cvtepu8_epi16(_mm_unpacklo_epi32(_mm_cvtsi32_si128(*((int *)Pixel0)), _mm_cvtsi32_si128(*((int *)Pixel1)))); //	P00 P01 P02 P03 P10 P11 P12 P13
				__m128i P23 = _mm_cvtepu8_epi16(_mm_unpacklo_epi32(_mm_cvtsi32_si128(*((int *)Pixel2)), _mm_cvtsi32_si128(*((int *)Pixel3)))); //	P20 P21 P22 P23 P30 P31 P32 P33
				__m128i Sum01 = _mm_madd_epi16(P01, PartX); // P00 * U0 + P01 * U1		P02 * U2 + P03 * U3		 P10 * U0 + P11 * U1		P12 * U2 + P13 * U3
				__m128i Sum23 = _mm_madd_epi16(P23, PartX); // P20 * U0 + P21 * U1		P22 * U2 + P23 * U3		 P30 * U0 + P31 * U1		P32 * U2 + P33 * U3
				__m128i Sum = _mm_hadd_epi32(Sum01, Sum23); // P00 * U0 + P01 * U1 + P02 * U2 + P03 * U3	 P10 * U0 + P11 * U1 + P12 * U2 + P13 * U3	P20 * U0 + P21 * U1	+ P22 * U2 + P23 * U3	P30 * U0 + P31 * U1 + P32 * U2 + P33 * U3
				LinePD[0] = ClampToByte(_mm_hsum_epi32(_mm_mullo_epi32(Sum, PartY)) >> 16);
			}
			else if (Channel == 4) {
				__m128i P0 = _mm_loadu_si128((__m128i *)Pixel0), P1 = _mm_loadu_si128((__m128i *)Pixel1);
				__m128i P2 = _mm_loadu_si128((__m128i *)Pixel2), P3 = _mm_loadu_si128((__m128i *)Pixel3);
				P0 = _mm_shuffle_epi8(P0, _mm_setr_epi8(0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15));	 // B0 G0 R0 A0
				P1 = _mm_shuffle_epi8(P1, _mm_setr_epi8(0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15));	 //	B1 G1 R1 A1
				P2 = _mm_shuffle_epi8(P2, _mm_setr_epi8(0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15));	 // B2 G2 R2 A2
				P3 = _mm_shuffle_epi8(P3, _mm_setr_epi8(0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15));	 //	B3 G3 R3 A3

				__m128i BG01 = _mm_unpacklo_epi32(P0, P1);		//	B0 B1 G0 G1
				__m128i RA01 = _mm_unpackhi_epi32(P0, P1);		//	R0 R1 A0 A1
				__m128i BG23 = _mm_unpacklo_epi32(P2, P3);		//	B2 B3 G2 G3
				__m128i RA23 = _mm_unpackhi_epi32(P2, P3);		//	R2 R3 A2 A3

				__m128i B01 = _mm_unpacklo_epi8(BG01, _mm_setzero_si128());
				__m128i B23 = _mm_unpacklo_epi8(BG23, _mm_setzero_si128());
				__m128i SumB = _mm_hadd_epi32(_mm_madd_epi16(B01, PartX), _mm_madd_epi16(B23, PartX));

				__m128i G01 = _mm_unpackhi_epi8(BG01, _mm_setzero_si128());
				__m128i G23 = _mm_unpackhi_epi8(BG23, _mm_setzero_si128());
				__m128i SumG = _mm_hadd_epi32(_mm_madd_epi16(G01, PartX), _mm_madd_epi16(G23, PartX));

				__m128i R01 = _mm_unpacklo_epi8(RA01, _mm_setzero_si128());
				__m128i R23 = _mm_unpacklo_epi8(RA23, _mm_setzero_si128());
				__m128i SumR = _mm_hadd_epi32(_mm_madd_epi16(R01, PartX), _mm_madd_epi16(R23, PartX));

				__m128i A01 = _mm_unpackhi_epi8(RA01, _mm_setzero_si128());
				__m128i A23 = _mm_unpackhi_epi8(RA23, _mm_setzero_si128());
				__m128i SumA = _mm_hadd_epi32(_mm_madd_epi16(A01, PartX), _mm_madd_epi16(A23, PartX));

				__m128i Result = _mm_setr_epi32(_mm_hsum_epi32(_mm_mullo_epi32(SumB, PartY)), _mm_hsum_epi32(_mm_mullo_epi32(SumG, PartY)), _mm_hsum_epi32(_mm_mullo_epi32(SumR, PartY)), _mm_hsum_epi32(_mm_mullo_epi32(SumA, PartY)));
				Result = _mm_srai_epi32(Result, 16);
				//	*((int *)LinePD) = _mm_cvtsi128_si32(_mm_packus_epi16(_mm_packus_epi32(Result, Result), Result));
				_mm_stream_si32((int *)LinePD, _mm_cvtsi128_si32(_mm_packus_epi16(_mm_packus_epi32(Result, Result), Result)));

				//LinePD[0] = IM_ClampToByte(_mm_hsum_epi32(_mm_mullo_epi32(SumB, PartY)) >> 16);	//	确实有部分存在超出unsigned char范围的,因为定点化的缘故
				//LinePD[1] = IM_ClampToByte(_mm_hsum_epi32(_mm_mullo_epi32(SumG, PartY)) >> 16);
				//LinePD[2] = IM_ClampToByte(_mm_hsum_epi32(_mm_mullo_epi32(SumR, PartY)) >> 16);
				//LinePD[3] = IM_ClampToByte(_mm_hsum_epi32(_mm_mullo_epi32(SumA, PartY)) >> 16);
			}
		}
		for (int X = EndX; X < DstW; X++, SrcX += AddX, LinePD += Channel)
		{
			Bicubic_Border(Src, SrcW, SrcH, StrideS, LinePD, SinXDivX_Table, SrcX, SrcY);
		}
	}
	for (int Y = EndY; Y < DstH; Y++, SrcY += AddY)
	{
		unsigned char *LinePD = Dest + Y * StrideD;
		for (int X = 0, SrcX = ErrorX; X < DstW; X++, SrcX += AddX, LinePD += Channel)
		{
			Bicubic_Border(Src, SrcW, SrcH, StrideS, LinePD, SinXDivX_Table, SrcX, SrcY);
		}
	}
	free(Table);
	free(SinXDivX_Table);
}

int main() {
	Mat src = imread("F:\\car.jpg");
	int Height = src.rows;
	int Width = src.cols;
	int Stride = Width * 3;
	unsigned char *Src = src.data;
	unsigned char *Buffer = new unsigned char[Height * Width * 4];
	ConvertBGR8U2BGRAF(Src, Buffer, Width, Height, Stride);
	int SrcW = Width;
	int SrcH = Height;
	int StrideS = Width * 4;
	int DstW = Width * 15 / 10;
	int DstH = Height * 15 / 10;
	unsigned char *Res = new unsigned char[DstH * DstW * 4];
	unsigned char *Dest = new unsigned char[DstH * DstW * 3];
	int StrideD = DstW * 4;
	int64 st = cvGetTickCount();
	for (int i = 0; i < 10; i++) {
		IM_Resize_SSE(Buffer, Res, SrcW, SrcH, StrideS, DstW, DstH, StrideD);
	}
	double duration = (cv::getTickCount() - st) / cv::getTickFrequency() * 100;
	printf("%.5f\n", duration);
	IM_Resize_Cubic_Origin(Buffer, Res, SrcW, SrcH, StrideS, DstW, DstH, StrideD);
	ConvertBGRAF2BGR8U(Res, Dest, DstW, DstH, DstW * 3);
	Mat dst(DstH, DstW, CV_8UC3, Dest);
	imshow("origin", src);
	imshow("result", dst);
	imwrite("F:\\res.jpg", dst);
	waitKey(0);
}

================================================
FILE: speed_box_filter_sse.cpp
================================================
#include <stdio.h>
#include <opencv2/opencv.hpp>
#include "../../OpencvTest/OpencvTest/Core.h"
#include "../../OpencvTest/OpencvTest/MaxFilter.h"
#include "../../OpencvTest/OpencvTest/Utility.h"
#include "../../OpencvTest/OpencvTest/BoxFilter.h"
using namespace std;
using namespace cv;

void BoxBlur_1(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, int Channel, int Radius) {
	TMatrix a, b;
	TMatrix *p1 = &a, *p2 = &b;
	TMatrix **p3 = &p1, **p4 = &p2;
	IS_CreateMatrix(Width, Height, IS_DEPTH_8U, Channel, p3);
	IS_CreateMatrix(Width, Height, IS_DEPTH_8U, Channel, p4);
	(p1)->Data = Src;
	(p2)->Data = Dest;
	BoxBlur(p1, p2, Radius, EdgeMode::Smear);
}

void BoxBlur_SSE(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, int Channel, int Radius) {
	TMatrix a, b;
	TMatrix *p1 = &a, *p2 = &b;
	TMatrix **p3 = &p1, **p4 = &p2;
	IS_CreateMatrix(Width, Height, IS_DEPTH_8U, Channel, p3);
	IS_CreateMatrix(Width, Height, IS_DEPTH_8U, Channel, p4);
	(p1)->Data = Src;
	(p2)->Data = Dest;
	BoxBlur_SSE(p1, p2, Radius, EdgeMode::Smear);
}


int main() {
	Mat src = imread("F:\\car.jpg");
	int Height = src.rows;
	int Width = src.cols;
	unsigned char *Src = src.data;
	unsigned char *Dest = new unsigned char[Height * Width * 3];
	int Stride = Width * 3;
	int Radius = 11;
	int64 st = cvGetTickCount();
	for (int i = 0; i <10; i++) {
		//Mat temp = MaxFilter(src, Radius);
		BoxBlur_SSE(Src, Dest, Width, Height, Stride, 3, Radius);
	}
	double duration = (cv::getTickCount() - st) / cv::getTickFrequency() * 100;
	printf("%.5f\n", duration);
	BoxBlur_SSE(Src, Dest, Width, Height, Stride, 3, Radius);
	Mat dst(Height, Width, CV_8UC3, Dest);
	imshow("origin", src);
	imshow("result", dst);
	imwrite("F:\\res.jpg", dst);
	waitKey(0);
	return 0;
}

================================================
FILE: speed_common_functions.cpp
================================================
//近似值
union Approximation
{
	double Value;
	int X[2];
};

// 函数1: 将数据截断在Byte数据类型内。
// 参考: http://www.cnblogs.com/zyl910/archive/2012/03/12/noifopex1.html
// 简介: 用位掩码做饱和处理,用带符号右移生成掩码。
unsigned char ClampToByte(int Value){
	return ((Value | ((signed int)(255 - Value) >> 31)) & ~((signed int)Value >> 31));
}

//函数2: 将数据截断在指定范围内
//参考: 无
//简介: 无
int ClampToInt(int Value, int Min, int Max) {
	if (Value < Min) return Min;
	else if (Value > Max) return Max;
	else return Value;
}

//函数3: 整数除以255
//参考: 无
//简介: 移位
int Div255(int Value) {
	return (((Value >> 8) + Value + 1) >> 8);
}

//函数4: 取绝对值
//参考: https://oi-wiki.org/math/bit/
//简介: 比n > 0 ? n : -n 快

int Abs(int n) {
	return (n ^ (n >> 31)) - (n >> 31);
	/* n>>31 取得 n 的符号,若 n 为正数,n>>31 等于 0,若 n 为负数,n>>31 等于 - 1
	若 n 为正数 n^0=0, 数不变,若 n 为负数有 n^-1
	需要计算 n 和 - 1 的补码,然后进行异或运算,
	结果 n 变号并且为 n 的绝对值减 1,再减去 - 1 就是绝对值 */
}

//函数5: 四舍五入
//参考: 无
//简介: 无
double Round(double V)
{
	return (V > 0.0) ? floor(V + 0.5) : Round(V - 0.5);
}

//函数6: 返回-1到1之间的随机数
//参考: 无
//简介: 无
double Rand()
{
	return (double)rand() / (RAND_MAX + 1.0);
}

//函数7: Pow函数的近似计算,针对double类型和float类型
//参考: http://www.cvchina.info/2010/03/19/log-pow-exp-approximation/
//参考: http://martin.ankerl.com/2007/10/04/optimized-pow-approximation-for-java-and-c-c/
//简介: 这个函数只是为了加速的近似计算,有5%-12%不等的误差
double Pow(double X, double Y)
{
	Approximation V = { X };
	V.X[1] = (int)(Y * (V.X[1] - 1072632447) + 1072632447);
	V.X[0] = 0;
	return V.Value;
}


float Pow(float X, float Y)
{
	Approximation V = { X };
	V.X[1] = (int)(Y * (V.X[1] - 1072632447) + 1072632447);
	V.X[0] = 0;
	return (float)V.Value;
}

//函数8: Exp函数的近似计算,针对double类型和float类型
double Exp(double Y)			//	用联合体的方式的速度要快些
{
	Approximation V;
	V.X[1] = (int)(Y * 1485963 + 1072632447);
	V.X[0] = 0;
	return V.Value;
}

float Exp(float Y)			//	用联合体的方式的速度要快些
{
	Approximation V;
	V.X[1] = (int)(Y * 1485963 + 1072632447);
	V.X[0] = 0;
	return (float)V.Value;
}

// 函数9: Pow函数更准一点的近似计算,但是速度会稍慢
// http://martin.ankerl.com/2012/01/25/optimized-approximative-pow-in-c-and-cpp/
// Besides that, I also have now a slower approximation that has much less error
// when the exponent is larger than 1. It makes use exponentiation by squaring,
// which is exact for the integer part of the exponent, and uses only the exponent’s fraction for the approximation:
// should be much more precise with large Y

double PrecisePow(double X, double Y){
	// calculate approximation with fraction of the exponent
	int e = (int)Y;
	Approximation V = { X };
	V.X[1] = (int)((Y - e) * (V.X[1] - 1072632447) + 1072632447);
	V.X[0] = 0;
	// exponentiation by squaring with the exponent's integer part
	// double r = u.d makes everything much slower, not sure why
	double r = 1.0;
	while (e)
	{
		if (e & 1)	r *= X;
		X *= X;
		e >>= 1;
	}
	return r * V.Value;
}

//函数10: 返回Min到Max之间的随机数
//参考: 无
//简介: Min为随机数的最小值,Max为随机数的最大值
int Random(int Min, int Max){
	return rand() % (Max + 1 - Min) + Min;
}

//函数11: 符号函数
//参考: 无
//简介: 无
int sgn(int X){
	if (X > 0) return 1;
	if (X < 0) return -1;
	return 0;
}

//函数12: 获取某个整形变量对应的颜色值
//参考: 无
//简介: 无
void GetRGB(int Color, int *R, int *G, int *B){
	*R = Color & 255;
	*G = (Color & 65280) / 256;
	*B = (Color & 16711680) / 65536;
}

//函数13: 牛顿法近似获取指定数字的算法平方根
//参考: https://www.cnblogs.com/qlky/p/7735145.html
//简介: 仍然是近似算法,近似出了指定数字的平方根
float Sqrt(float X)
{
	float HalfX = 0.5f * X;             // 对double类型的数字无效
	int I = *(int*)&X;                  // get bits for floating VALUE 
	I = 0x5f375a86 - (I >> 1);          // gives initial guess y0
	X = *(float*)&I;                    // convert bits BACK to float
	X = X * (1.5f - HalfX * X * X);     // Newton step, repeating increases accuracy
	X = X * (1.5f - HalfX * X * X);     // Newton step, repeating increases accuracy
	X = X * (1.5f - HalfX * X * X);     // Newton step, repeating increases accuracy
	return 1 / X;
}

//函数14: 无符号短整形直方图数据相加,即是Y = X + Y
//参考: 无
//简介: SSE优化
void HistgramAddShort(unsigned short *X, unsigned short *Y)
{
	*(__m128i*)(Y + 0) = _mm_add_epi16(*(__m128i*)&Y[0], *(__m128i*)&X[0]);		//	不要想着用自己写的汇编超过他的速度了,已经试过了
	*(__m128i*)(Y + 8) = _mm_add_epi16(*(__m128i*)&Y[8], *(__m128i*)&X[8]);
	*(__m128i*)(Y + 16) = _mm_add_epi16(*(__m128i*)&Y[16], *(__m128i*)&X[16]);
	*(__m128i*)(Y + 24) = _mm_add_epi16(*(__m128i*)&Y[24], *(__m128i*)&X[24]);
	*(__m128i*)(Y + 32) = _mm_add_epi16(*(__m128i*)&Y[32], *(__m128i*)&X[32]);
	*(__m128i*)(Y + 40) = _mm_add_epi16(*(__m128i*)&Y[40], *(__m128i*)&X[40]);
	*(__m128i*)(Y + 48) = _mm_add_epi16(*(__m128i*)&Y[48], *(__m128i*)&X[48]);
	*(__m128i*)(Y + 56) = _mm_add_epi16(*(__m128i*)&Y[56], *(__m128i*)&X[56]);
	*(__m128i*)(Y + 64) = _mm_add_epi16(*(__m128i*)&Y[64], *(__m128i*)&X[64]);
	*(__m128i*)(Y + 72) = _mm_add_epi16(*(__m128i*)&Y[72], *(__m128i*)&X[72]);
	*(__m128i*)(Y + 80) = _mm_add_epi16(*(__m128i*)&Y[80], *(__m128i*)&X[80]);
	*(__m128i*)(Y + 88) = _mm_add_epi16(*(__m128i*)&Y[88], *(__m128i*)&X[88]);
	*(__m128i*)(Y + 96) = _mm_add_epi16(*(__m128i*)&Y[96], *(__m128i*)&X[96]);
	*(__m128i*)(Y + 104) = _mm_add_epi16(*(__m128i*)&Y[104], *(__m128i*)&X[104]);
	*(__m128i*)(Y + 112) = _mm_add_epi16(*(__m128i*)&Y[112], *(__m128i*)&X[112]);
	*(__m128i*)(Y + 120) = _mm_add_epi16(*(__m128i*)&Y[120], *(__m128i*)&X[120]);
	*(__m128i*)(Y + 128) = _mm_add_epi16(*(__m128i*)&Y[128], *(__m128i*)&X[128]);
	*(__m128i*)(Y + 136) = _mm_add_epi16(*(__m128i*)&Y[136], *(__m128i*)&X[136]);
	*(__m128i*)(Y + 144) = _mm_add_epi16(*(__m128i*)&Y[144], *(__m128i*)&X[144]);
	*(__m128i*)(Y + 152) = _mm_add_epi16(*(__m128i*)&Y[152], *(__m128i*)&X[152]);
	*(__m128i*)(Y + 160) = _mm_add_epi16(*(__m128i*)&Y[160], *(__m128i*)&X[160]);
	*(__m128i*)(Y + 168) = _mm_add_epi16(*(__m128i*)&Y[168], *(__m128i*)&X[168]);
	*(__m128i*)(Y + 176) = _mm_add_epi16(*(__m128i*)&Y[176], *(__m128i*)&X[176]);
	*(__m128i*)(Y + 184) = _mm_add_epi16(*(__m128i*)&Y[184], *(__m128i*)&X[184]);
	*(__m128i*)(Y + 192) = _mm_add_epi16(*(__m128i*)&Y[192], *(__m128i*)&X[192]);
	*(__m128i*)(Y + 200) = _mm_add_epi16(*(__m128i*)&Y[200], *(__m128i*)&X[200]);
	*(__m128i*)(Y + 208) = _mm_add_epi16(*(__m128i*)&Y[208], *(__m128i*)&X[208]);
	*(__m128i*)(Y + 216) = _mm_add_epi16(*(__m128i*)&Y[216], *(__m128i*)&X[216]);
	*(__m128i*)(Y + 224) = _mm_add_epi16(*(__m128i*)&Y[224], *(__m128i*)&X[224]);
	*(__m128i*)(Y + 232) = _mm_add_epi16(*(__m128i*)&Y[232], *(__m128i*)&X[232]);
	*(__m128i*)(Y + 240) = _mm_add_epi16(*(__m128i*)&Y[240], *(__m128i*)&X[240]);
	*(__m128i*)(Y + 248) = _mm_add_epi16(*(__m128i*)&Y[248], *(__m128i*)&X[248]);
}

//函数15: 无符号短整形直方图数据相减,即是Y = Y - X
//参考: 无
//简介: SSE优化
void HistgramSubShort(unsigned short *X, unsigned short *Y)
{
	*(__m128i*)(Y + 0) = _mm_sub_epi16(*(__m128i*)&Y[0], *(__m128i*)&X[0]);
	*(__m128i*)(Y + 8) = _mm_sub_epi16(*(__m128i*)&Y[8], *(__m128i*)&X[8]);
	*(__m128i*)(Y + 16) = _mm_sub_epi16(*(__m128i*)&Y[16], *(__m128i*)&X[16]);
	*(__m128i*)(Y + 24) = _mm_sub_epi16(*(__m128i*)&Y[24], *(__m128i*)&X[24]);
	*(__m128i*)(Y + 32) = _mm_sub_epi16(*(__m128i*)&Y[32], *(__m128i*)&X[32]);
	*(__m128i*)(Y + 40) = _mm_sub_epi16(*(__m128i*)&Y[40], *(__m128i*)&X[40]);
	*(__m128i*)(Y + 48) = _mm_sub_epi16(*(__m128i*)&Y[48], *(__m128i*)&X[48]);
	*(__m128i*)(Y + 56) = _mm_sub_epi16(*(__m128i*)&Y[56], *(__m128i*)&X[56]);
	*(__m128i*)(Y + 64) = _mm_sub_epi16(*(__m128i*)&Y[64], *(__m128i*)&X[64]);
	*(__m128i*)(Y + 72) = _mm_sub_epi16(*(__m128i*)&Y[72], *(__m128i*)&X[72]);
	*(__m128i*)(Y + 80) = _mm_sub_epi16(*(__m128i*)&Y[80], *(__m128i*)&X[80]);
	*(__m128i*)(Y + 88) = _mm_sub_epi16(*(__m128i*)&Y[88], *(__m128i*)&X[88]);
	*(__m128i*)(Y + 96) = _mm_sub_epi16(*(__m128i*)&Y[96], *(__m128i*)&X[96]);
	*(__m128i*)(Y + 104) = _mm_sub_epi16(*(__m128i*)&Y[104], *(__m128i*)&X[104]);
	*(__m128i*)(Y + 112) = _mm_sub_epi16(*(__m128i*)&Y[112], *(__m128i*)&X[112]);
	*(__m128i*)(Y + 120) = _mm_sub_epi16(*(__m128i*)&Y[120], *(__m128i*)&X[120]);
	*(__m128i*)(Y + 128) = _mm_sub_epi16(*(__m128i*)&Y[128], *(__m128i*)&X[128]);
	*(__m128i*)(Y + 136) = _mm_sub_epi16(*(__m128i*)&Y[136], *(__m128i*)&X[136]);
	*(__m128i*)(Y + 144) = _mm_sub_epi16(*(__m128i*)&Y[144], *(__m128i*)&X[144]);
	*(__m128i*)(Y + 152) = _mm_sub_epi16(*(__m128i*)&Y[152], *(__m128i*)&X[152]);
	*(__m128i*)(Y + 160) = _mm_sub_epi16(*(__m128i*)&Y[160], *(__m128i*)&X[160]);
	*(__m128i*)(Y + 168) = _mm_sub_epi16(*(__m128i*)&Y[168], *(__m128i*)&X[168]);
	*(__m128i*)(Y + 176) = _mm_sub_epi16(*(__m128i*)&Y[176], *(__m128i*)&X[176]);
	*(__m128i*)(Y + 184) = _mm_sub_epi16(*(__m128i*)&Y[184], *(__m128i*)&X[184]);
	*(__m128i*)(Y + 192) = _mm_sub_epi16(*(__m128i*)&Y[192], *(__m128i*)&X[192]);
	*(__m128i*)(Y + 200) = _mm_sub_epi16(*(__m128i*)&Y[200], *(__m128i*)&X[200]);
	*(__m128i*)(Y + 208) = _mm_sub_epi16(*(__m128i*)&Y[208], *(__m128i*)&X[208]);
	*(__m128i*)(Y + 216) = _mm_sub_epi16(*(__m128i*)&Y[216], *(__m128i*)&X[216]);
	*(__m128i*)(Y + 224) = _mm_sub_epi16(*(__m128i*)&Y[224], *(__m128i*)&X[224]);
	*(__m128i*)(Y + 232) = _mm_sub_epi16(*(__m128i*)&Y[232], *(__m128i*)&X[232]);
	*(__m128i*)(Y + 240) = _mm_sub_epi16(*(__m128i*)&Y[240], *(__m128i*)&X[240]);
	*(__m128i*)(Y + 248) = _mm_sub_epi16(*(__m128i*)&Y[248], *(__m128i*)&X[248]);
}

//函数16: 无符号短整形直方图数据相加减,即是Z = Z + Y - X
//参考: 无
//简介: SSE优化
void HistgramSubAddShort(unsigned short *X, unsigned short *Y, unsigned short *Z)
{
	*(__m128i*)(Z + 0) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[0], *(__m128i*)&Z[0]), *(__m128i*)&X[0]);						//	不要想着用自己写的汇编超过他的速度了,已经试过了
	*(__m128i*)(Z + 8) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[8], *(__m128i*)&Z[8]), *(__m128i*)&X[8]);
	*(__m128i*)(Z + 16) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[16], *(__m128i*)&Z[16]), *(__m128i*)&X[16]);
	*(__m128i*)(Z + 24) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[24], *(__m128i*)&Z[24]), *(__m128i*)&X[24]);
	*(__m128i*)(Z + 32) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[32], *(__m128i*)&Z[32]), *(__m128i*)&X[32]);
	*(__m128i*)(Z + 40) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[40], *(__m128i*)&Z[40]), *(__m128i*)&X[40]);
	*(__m128i*)(Z + 48) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[48], *(__m128i*)&Z[48]), *(__m128i*)&X[48]);
	*(__m128i*)(Z + 56) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[56], *(__m128i*)&Z[56]), *(__m128i*)&X[56]);
	*(__m128i*)(Z + 64) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[64], *(__m128i*)&Z[64]), *(__m128i*)&X[64]);
	*(__m128i*)(Z + 72) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[72], *(__m128i*)&Z[72]), *(__m128i*)&X[72]);
	*(__m128i*)(Z + 80) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[80], *(__m128i*)&Z[80]), *(__m128i*)&X[80]);
	*(__m128i*)(Z + 88) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[88], *(__m128i*)&Z[88]), *(__m128i*)&X[88]);
	*(__m128i*)(Z + 96) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[96], *(__m128i*)&Z[96]), *(__m128i*)&X[96]);
	*(__m128i*)(Z + 104) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[104], *(__m128i*)&Z[104]), *(__m128i*)&X[104]);
	*(__m128i*)(Z + 112) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[112], *(__m128i*)&Z[112]), *(__m128i*)&X[112]);
	*(__m128i*)(Z + 120) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[120], *(__m128i*)&Z[120]), *(__m128i*)&X[120]);
	*(__m128i*)(Z + 128) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[128], *(__m128i*)&Z[128]), *(__m128i*)&X[128]);
	*(__m128i*)(Z + 136) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[136], *(__m128i*)&Z[136]), *(__m128i*)&X[136]);
	*(__m128i*)(Z + 144) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[144], *(__m128i*)&Z[144]), *(__m128i*)&X[144]);
	*(__m128i*)(Z + 152) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[152], *(__m128i*)&Z[152]), *(__m128i*)&X[152]);
	*(__m128i*)(Z + 160) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[160], *(__m128i*)&Z[160]), *(__m128i*)&X[160]);
	*(__m128i*)(Z + 168) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[168], *(__m128i*)&Z[168]), *(__m128i*)&X[168]);
	*(__m128i*)(Z + 176) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[176], *(__m128i*)&Z[176]), *(__m128i*)&X[176]);
	*(__m128i*)(Z + 184) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[184], *(__m128i*)&Z[184]), *(__m128i*)&X[184]);
	*(__m128i*)(Z + 192) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[192], *(__m128i*)&Z[192]), *(__m128i*)&X[192]);
	*(__m128i*)(Z + 200) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[200], *(__m128i*)&Z[200]), *(__m128i*)&X[200]);
	*(__m128i*)(Z + 208) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[208], *(__m128i*)&Z[208]), *(__m128i*)&X[208]);
	*(__m128i*)(Z + 216) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[216], *(__m128i*)&Z[216]), *(__m128i*)&X[216]);
	*(__m128i*)(Z + 224) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[224], *(__m128i*)&Z[224]), *(__m128i*)&X[224]);
	*(__m128i*)(Z + 232) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[232], *(__m128i*)&Z[232]), *(__m128i*)&X[232]);
	*(__m128i*)(Z + 240) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[240], *(__m128i*)&Z[240]), *(__m128i*)&X[240]);
	*(__m128i*)(Z + 248) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[248], *(__m128i*)&Z[248]), *(__m128i*)&X[248]);
}


================================================
FILE: speed_gaussian_filter_sse.cpp
================================================
#include <stdio.h>
#include <opencv2/opencv.hpp>

using namespace std;
using namespace cv;

void CalcGaussCof(float Radius, float &B0, float &B1, float &B2, float &B3)
{
	float Q, B;
	if (Radius >= 2.5)
		Q = (double)(0.98711 * Radius - 0.96330);                            //    对应论文公式11b
	else if ((Radius >= 0.5) && (Radius < 2.5))
		Q = (double)(3.97156 - 4.14554 * sqrt(1 - 0.26891 * Radius));
	else
		Q = (double)0.1147705018520355224609375;

	B = 1.57825 + 2.44413 * Q + 1.4281 * Q * Q + 0.422205 * Q * Q * Q;        //    对应论文公式8c
	B1 = 2.44413 * Q + 2.85619 * Q * Q + 1.26661 * Q * Q * Q;
	B2 = -1.4281 * Q * Q - 1.26661 * Q * Q * Q;
	B3 = 0.422205 * Q * Q * Q;

	B0 = 1.0 - (B1 + B2 + B3) / B;
	B1 = B1 / B;
	B2 = B2 / B;
	B3 = B3 / B;
}

void ConvertBGR8U2BGRAF(unsigned char *Src, float *Dest, int Width, int Height, int Stride)
{
	//#pragma omp parallel for
	for (int Y = 0; Y < Height; Y++)
	{
		unsigned char *LinePS = Src + Y * Stride;
		float *LinePD = Dest + Y * Width * 3;
		for (int X = 0; X < Width; X++, LinePS += 3, LinePD += 3)
		{
			LinePD[0] = LinePS[0];    LinePD[1] = LinePS[1];    LinePD[2] = LinePS[2];
		}
	}
}

void ConvertBGR8U2BGRAF_SSE(unsigned char *Src, float *Dest, int Width, int Height, int Stride) {
	const int BlockSize = 4;
	int Block = (Width - 2) / BlockSize;
	__m128i Mask = _mm_setr_epi8(0, 1, 2, -1, 3, 4, 5, -1, 6, 7, 8, -1, 9, 10, 11, -1);
	__m128i Zero = _mm_setzero_si128();
	for (int Y = 0; Y < Height; Y++) {
		unsigned char *LinePS = Src + Y * Stride;
		float *LinePD = Dest + Y * Width * 4;
		int X = 0;
		for (; X < Block * BlockSize; X += BlockSize, LinePS += BlockSize * 3, LinePD += BlockSize * 4) {
			__m128i SrcV = _mm_shuffle_epi8(_mm_loadu_si128((const __m128i*)LinePS), Mask);
			__m128i Src16L = _mm_unpacklo_epi8(SrcV, Zero);
			__m128i Src16H = _mm_unpackhi_epi8(SrcV, Zero);
			_mm_store_ps(LinePD + 0, _mm_cvtepi32_ps(_mm_unpacklo_epi16(Src16L, Zero)));
			_mm_store_ps(LinePD + 4, _mm_cvtepi32_ps(_mm_unpackhi_epi16(Src16L, Zero)));
			_mm_store_ps(LinePD + 8, _mm_cvtepi32_ps(_mm_unpacklo_epi16(Src16H, Zero)));
			_mm_store_ps(LinePD + 12, _mm_cvtepi32_ps(_mm_unpackhi_epi16(Src16H, Zero)));
		}
		for (; X < Width; X++, LinePS += 3, LinePD += 4) {
			LinePD[0] = LinePS[0];    LinePD[1] = LinePS[1];    LinePD[2] = LinePS[2];    LinePD[3] = 0;
		}
	}
}

void GaussBlurFromLeftToRight(float *Data, int Width, int Height, float B0, float B1, float B2, float B3)
{
	//#pragma omp parallel for
	for (int Y = 0; Y < Height; Y++)
	{
		float *LinePD = Data + Y * Width * 3;
		//w[n-1], w[n-2], w[n-3]
		float BS1 = LinePD[0], BS2 = LinePD[0], BS3 = LinePD[0]; //边缘处使用重复像素的方案
		float GS1 = LinePD[1], GS2 = LinePD[1], GS3 = LinePD[1];
		float RS1 = LinePD[2], RS2 = LinePD[2], RS3 = LinePD[2];
		for (int X = 0; X < Width; X++, LinePD += 3)
		{
			LinePD[0] = LinePD[0] * B0 + BS1 * B1 + BS2 * B2 + BS3 * B3;
			LinePD[1] = LinePD[1] * B0 + GS1 * B1 + GS2 * B2 + GS3 * B3;         // 进行顺向迭代
			LinePD[2] = LinePD[2] * B0 + RS1 * B1 + RS2 * B2 + RS3 * B3;
			BS3 = BS2, BS2 = BS1, BS1 = LinePD[0];
			GS3 = GS2, GS2 = GS1, GS1 = LinePD[1];
			RS3 = RS2, RS2 = RS1, RS1 = LinePD[2];
		}
	}
}

void GaussBlurFromLeftToRight_SSE(float *Data, int Width, int Height, float B0, float B1, float B2, float B3) {
	const __m128 CofB0 = _mm_set_ps(0, B0, B0, B0);
	const __m128 CofB1 = _mm_set_ps(0, B1, B1, B1);
	const __m128 CofB2 = _mm_set_ps(0, B2, B2, B2);
	const __m128 CofB3 = _mm_set_ps(0, B3, B3, B3);
	for (int Y = 0; Y < Height; Y++) {
		float *LinePD = Data + Y * Width * 4;
		__m128 V1 = _mm_set_ps(LinePD[3], LinePD[2], LinePD[1], LinePD[0]);
		__m128 V2 = V1, V3 = V1;
		for (int X = 0; X < Width; X++, LinePD += 4) {
			__m128 V0 = _mm_load_ps(LinePD);
			__m128 V01 = _mm_add_ps(_mm_mul_ps(CofB0, V0), _mm_mul_ps(CofB1, V1));
			__m128 V23 = _mm_add_ps(_mm_mul_ps(CofB2, V2), _mm_mul_ps(CofB3, V3));
			__m128 V = _mm_add_ps(V01, V23);
			V3 = V2; V2 = V1; V1 = V;
			_mm_store_ps(LinePD, V);
		}
	}
}

void GaussBlurFromRightToLeft(float *Data, int Width, int Height, float B0, float B1, float B2, float B3) {
	for (int Y = 0; Y < Height; Y++) {
		//w[n+1], w[n+2], w[n+3]
		float *LinePD = Data + Y * Width * 3 + (Width * 3);
		float BS1 = LinePD[0], BS2 = LinePD[0], BS3 = LinePD[0]; //边缘处使用重复像素的方案
		float GS1 = LinePD[1], GS2 = LinePD[1], GS3 = LinePD[1];
		float RS1 = LinePD[2], RS2 = LinePD[2], RS3 = LinePD[2];
		for (int X = Width - 1; X >= 0; X--, LinePD -= 3)
		{
			LinePD[0] = LinePD[0] * B0 + BS3 * B1 + BS2 * B2 + BS1 * B3;
			LinePD[1] = LinePD[1] * B0 + GS3 * B1 + GS2 * B2 + GS1 * B3;         // 进行反向迭代
			LinePD[2] = LinePD[2] * B0 + RS3 * B1 + RS2 * B2 + RS1 * B3;
			BS1 = BS2, BS2 = BS3, BS3 = LinePD[0];
			GS1 = GS2, GS2 = GS3, GS3 = LinePD[1];
			RS1 = RS2, RS2 = RS3, RS3 = LinePD[2];
		}
	}
}

void GaussBlurFromRightToLeft_SSE(float *Data, int Width, int Height, float B0, float B1, float B2, float B3) {
	const __m128 CofB0 = _mm_set_ps(0, B0, B0, B0);
	const __m128 CofB1 = _mm_set_ps(0, B1, B1, B1);
	const __m128 CofB2 = _mm_set_ps(0, B2, B2, B2);
	const __m128 CofB3 = _mm_set_ps(0, B3, B3, B3);
	for (int Y = 0; Y < Height; Y++) {
		float *LinePD = Data + Y * Width * 4 + (Width * 4);
		__m128 V1 = _mm_set_ps(LinePD[3], LinePD[2], LinePD[1], LinePD[0]);
		__m128 V2 = V1, V3 = V1;
		for (int X = Width - 1; X >= 0; X--, LinePD -= 4) {
			__m128 V0 = _mm_load_ps(LinePD);
			__m128 V03 = _mm_add_ps(_mm_mul_ps(CofB0, V0), _mm_mul_ps(CofB1, V3));
			__m128 V12 = _mm_add_ps(_mm_mul_ps(CofB2, V2), _mm_mul_ps(CofB3, V1));
			__m128 V = _mm_add_ps(V03, V12);
			V1 = V2; V2 = V3; V3 = V;
			_mm_store_ps(LinePD, V);
		}
	}
}


//w[n] w[n-1], w[n-2], w[n-3]
void GaussBlurFromTopToBottom(float *Data, int Width, int Height, float B0, float B1, float B2, float B3)
{
	for (int Y = 0; Y < Height; Y++)
	{
		float *LinePD3 = Data + (Y + 0) * Width * 3;
		float *LinePD2 = Data + (Y + 1) * Width * 3;
		float *LinePD1 = Data + (Y + 2) * Width * 3;
		float *LinePD0 = Data + (Y + 3) * Width * 3;
		for (int X = 0; X < Width; X++, LinePD0 += 3, LinePD1 += 3, LinePD2 += 3, LinePD3 += 3)
		{
			LinePD0[0] = LinePD0[0] * B0 + LinePD1[0] * B1 + LinePD2[0] * B2 + LinePD3[0] * B3;
			LinePD0[1] = LinePD0[1] * B0 + LinePD1[1] * B1 + LinePD2[1] * B2 + LinePD3[1] * B3;
			LinePD0[2] = LinePD0[2] * B0 + LinePD1[2] * B1 + LinePD2[2] * B2 + LinePD3[2] * B3;
		}
	}
}

void GaussBlurFromTopToBottom_SSE(float *Data, int Width, int Height, float B0, float B1, float B2, float B3){
	const  __m128 CofB0 = _mm_set_ps(0, B0, B0, B0);
	const  __m128 CofB1 = _mm_set_ps(0, B1, B1, B1);
	const  __m128 CofB2 = _mm_set_ps(0, B2, B2, B2);
	const  __m128 CofB3 = _mm_set_ps(0, B3, B3, B3);
	for (int Y = 0; Y < Height; Y++)
	{
		float *LinePS3 = Data + (Y + 0) * Width * 4;
		float *LinePS2 = Data + (Y + 1) * Width * 4;
		float *LinePS1 = Data + (Y + 2) * Width * 4;
		float *LinePS0 = Data + (Y + 3) * Width * 4;
		for (int X = 0; X < Width * 4; X += 4)
		{
			__m128 V3 = _mm_load_ps(LinePS3 + X);
			__m128 V2 = _mm_load_ps(LinePS2 + X);
			__m128 V1 = _mm_load_ps(LinePS1 + X);
			__m128 V0 = _mm_load_ps(LinePS0 + X);
			__m128 V01 = _mm_add_ps(_mm_mul_ps(CofB0, V0), _mm_mul_ps(CofB1, V1));
			__m128 V23 = _mm_add_ps(_mm_mul_ps(CofB2, V2), _mm_mul_ps(CofB3, V3));
			_mm_store_ps(LinePS0 + X, _mm_add_ps(V01, V23));
		}
	}
}
//w[n] w[n+1], w[n+2], w[n+3]
void GaussBlurFromBottomToTop(float *Data, int Width, int Height, float B0, float B1, float B2, float B3) {
	for (int Y = Height - 1; Y >= 0; Y--) {
		float *LinePD3 = Data + (Y + 3) * Width * 3;
		float *LinePD2 = Data + (Y + 2) * Width * 3;
		float *LinePD1 = Data + (Y + 1) * Width * 3;
		float *LinePD0 = Data + (Y + 0) * Width * 3;
		for (int X = 0; X < Width; X++, LinePD0 += 3, LinePD1 += 3, LinePD2 += 3, LinePD3 += 3) {
			LinePD0[0] = LinePD0[0] * B0 + LinePD1[0] * B1 + LinePD2[0] * B2 + LinePD3[0] * B3;
			LinePD0[1] = LinePD0[1] * B0 + LinePD1[1] * B1 + LinePD2[1] * B2 + LinePD3[1] * B3;
			LinePD0[2] = LinePD0[2] * B0 + LinePD1[2] * B1 + LinePD2[2] * B2 + LinePD3[2] * B3;
		}
	}
}

void GaussBlurFromBottomToTop_SSE(float *Data, int Width, int Height, float B0, float B1, float B2, float B3) {
	const  __m128 CofB0 = _mm_set_ps(0, B0, B0, B0);
	const  __m128 CofB1 = _mm_set_ps(0, B1, B1, B1);
	const  __m128 CofB2 = _mm_set_ps(0, B2, B2, B2);
	const  __m128 CofB3 = _mm_set_ps(0, B3, B3, B3);
	for (int Y = Height - 1; Y >= 0; Y--) {
		float *LinePS3 = Data + (Y + 3) * Width * 4;
		float *LinePS2 = Data + (Y + 2) * Width * 4;
		float *LinePS1 = Data + (Y + 1) * Width * 4;
		float *LinePS0 = Data + (Y + 0) * Width * 4;
		for (int X = 0; X < Width * 4; X += 4) {
			__m128 V3 = _mm_load_ps(LinePS3 + X);
			__m128 V2 = _mm_load_ps(LinePS2 + X);
			__m128 V1 = _mm_load_ps(LinePS1 + X);
			__m128 V0 = _mm_load_ps(LinePS0 + X);
			__m128 V01 = _mm_add_ps(_mm_mul_ps(CofB0, V0), _mm_mul_ps(CofB1, V1));
			__m128 V23 = _mm_add_ps(_mm_mul_ps(CofB2, V2), _mm_mul_ps(CofB3, V3));
			_mm_store_ps(LinePS0 + X, _mm_add_ps(V01, V23));
		}
	}
}

void ConvertBGRAF2BGR8U(float *Src, unsigned char *Dest, int Width, int Height, int Stride)
{
	//#pragma omp parallel for
	for (int Y = 0; Y < Height; Y++)
	{
		float *LinePS = Src + Y * Width * 3;
		unsigned char *LinePD = Dest + Y * Stride;
		for (int X = 0; X < Width; X++, LinePS += 3, LinePD += 3)
		{
			LinePD[0] = LinePS[0];    LinePD[1] = LinePS[1];    LinePD[2] = LinePS[2];
		}
	}
}


void ConvertBGRAF2BGR8U_SSE(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) {
	const int BlockSize = 4;
	int Block = (Width - 2) / BlockSize;
	//__m128i Mask = _mm_setr_epi8(0, 1, 2, 4, 5, 6, 8, 9, 10, 12, 13, 14, 3, 7, 11, 15);
	__m128i MaskB = _mm_setr_epi8(0, 4, 8, 12, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1);
	__m128i MaskG = _mm_setr_epi8(1, 5, 9, 13, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1);
	__m128i MaskR = _mm_setr_epi8(2, 6, 10, 14, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1);
	__m128i Zero = _mm_setzero_si128();
	for (int Y = 0; Y < Height; Y++) {
		unsigned char *LinePS = Src + Y * Width * 4;
		unsigned char *LinePD = Dest + Y * Stride;
		int X = 0;
		for (; X < Block * BlockSize; X += BlockSize, LinePS += BlockSize * 4, LinePD += BlockSize * 3) {
			__m128i SrcV = _mm_loadu_si128((const __m128i*)LinePS);
			__m128i B = _mm_shuffle_epi8(SrcV, MaskB);
			__m128i G = _mm_shuffle_epi8(SrcV, MaskG);
			__m128i R = _mm_shuffle_epi8(SrcV, MaskR);
			__m128i Ans1 = Zero, Ans2 = Zero, Ans3 = Zero;
			Ans1 = _mm_or_si128(Ans1, _mm_shuffle_epi8(B, _mm_setr_epi8(0, -1, -1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1))); 
			Ans1 = _mm_or_si128(Ans1, _mm_shuffle_epi8(G, _mm_setr_epi8(-1, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)));
			Ans1 = _mm_or_si128(Ans1, _mm_shuffle_epi8(R, _mm_setr_epi8(-1, -1, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)));

			Ans2 = _mm_or_si128(Ans2, _mm_shuffle_epi8(B, _mm_setr_epi8(-1, -1, 2, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)));
			Ans2 = _mm_or_si128(Ans2, _mm_shuffle_epi8(G, _mm_setr_epi8(1, -1, -1, 2, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)));
			Ans2 = _mm_or_si128(Ans2, _mm_shuffle_epi8(R, _mm_setr_epi8(-1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)));

			Ans3 = _mm_or_si128(Ans3, _mm_shuffle_epi8(B, _mm_setr_epi8(-1, 3, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)));
			Ans3 = _mm_or_si128(Ans3, _mm_shuffle_epi8(G, _mm_setr_epi8(-1, -1, 3, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)));
			Ans3 = _mm_or_si128(Ans3, _mm_shuffle_epi8(R, _mm_setr_epi8(2, -1, -1, 3, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)));

			_mm_storeu_si128((__m128i*)(LinePD + 0), Ans1);
			_mm_storeu_si128((__m128i*)(LinePD + 4), Ans2);
			_mm_storeu_si128((__m128i*)(LinePD + 8), Ans3);
		}
		for (; X < Width; X++, LinePS += 4, LinePD += 3) {
			LinePD[0] = LinePS[0]; LinePD[1] = LinePS[1]; LinePD[2] = LinePS[2];
		}
	}
}

void GaussBlur(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, float Radius)
{
	float B0, B1, B2, B3;
	float *Buffer = (float *)malloc(Width * (Height + 6) * sizeof(float) * 3);
	CalcGaussCof(Radius, B0, B1, B2, B3);
	ConvertBGR8U2BGRAF(Src, Buffer + 3 * Width * 3, Width, Height, Stride);
	GaussBlurFromLeftToRight(Buffer + 3 * Width * 3, Width, Height, B0, B1, B2, B3);
	GaussBlurFromRightToLeft(Buffer + 3 * Width * 3, Width, Height, B0, B1, B2, B3);        //    如果启用多线程,建议把这个函数写到GaussBlurFromLeftToRight的for X循环里,因为这样就可以减少线程并发时的阻力

	memcpy(Buffer + 0 * Width * 3, Buffer + 3 * Width * 3, Width * 3 * sizeof(float));
	memcpy(Buffer + 1 * Width * 3, Buffer + 3 * Width * 3, Width * 3 * sizeof(float));
	memcpy(Buffer + 2 * Width * 3, Buffer + 3 * Width * 3, Width * 3 * sizeof(float));

	GaussBlurFromTopToBottom(Buffer, Width, Height, B0, B1, B2, B3);

	memcpy(Buffer + (Height + 3) * Width * 3, Buffer + (Height + 2) * Width * 3, Width * 3 * sizeof(float));
	memcpy(Buffer + (Height + 4) * Width * 3, Buffer + (Height + 2) * Width * 3, Width * 3 * sizeof(float));
	memcpy(Buffer + (Height + 5) * Width * 3, Buffer + (Height + 2) * Width * 3, Width * 3 * sizeof(float));

	GaussBlurFromBottomToTop(Buffer, Width, Height, B0, B1, B2, B3);

	ConvertBGRAF2BGR8U(Buffer + 3 * Width * 3, Dest, Width, Height, Stride);

	free(Buffer);
}

void GaussBlur_SSE(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, float Radius)
{
	float B0, B1, B2, B3;
	float *Buffer = (float *)_mm_malloc(Width * (Height + 6) * sizeof(float) * 4, 16);
	CalcGaussCof(Radius, B0, B1, B2, B3);
	ConvertBGR8U2BGRAF_SSE(Src, Buffer + 3 * Width * 4, Width, Height, Stride);
	GaussBlurFromLeftToRight_SSE(Buffer + 3 * Width * 4, Width, Height, B0, B1, B2, B3);        //    在SSE版本中,这两个函数占用的时间比下面两个要多,不过C语言版本也是一样的
	GaussBlurFromRightToLeft_SSE(Buffer + 3 * Width * 4, Width, Height, B0, B1, B2, B3);        //    如果启用多线程,建议把这个函数写到GaussBlurFromLeftToRight的for X循环里,因为这样就可以减少线程并发时的阻力

	memcpy(Buffer + 0 * Width * 4, Buffer + 3 * Width * 4, Width * 4 * sizeof(float));
	memcpy(Buffer + 1 * Width * 4, Buffer + 3 * Width * 4, Width * 4 * sizeof(float));
	memcpy(Buffer + 2 * Width * 4, Buffer + 3 * Width * 4, Width * 4 * sizeof(float));

	GaussBlurFromTopToBottom_SSE(Buffer, Width, Height, B0, B1, B2, B3);

	memcpy(Buffer + (Height + 3) * Width * 4, Buffer + (Height + 2) * Width * 4, Width * 4 * sizeof(float));
	memcpy(Buffer + (Height + 4) * Width * 4, Buffer + (Height + 2) * Width * 4, Width * 4 * sizeof(float));
	memcpy(Buffer + (Height + 5) * Width * 4, Buffer + (Height + 2) * Width * 4, Width * 4 * sizeof(float));

	GaussBlurFromBottomToTop_SSE(Buffer, Width, Height, B0, B1, B2, B3);

	ConvertBGRAF2BGR8U_SSE(Buffer + 3 * Width * 4, Dest, Width, Height, Stride);

	_mm_free(Buffer);
}

int main() {
	Mat src = imread("F:\\car.jpg");
	int Height = src.rows;
	int Width = src.cols;
	unsigned char *Src = src.data;
	unsigned char *Dest = new unsigned char[Height * Width * 3];
	int Stride = Width * 3;
	int Radius = 11;
	int64 st = cvGetTickCount();
	for (int i = 0; i < 20; i++) {
		GaussBlur_SSE(Src, Dest, Width, Height, Stride, Radius);
	}
	double duration = (cv::getTickCount() - st) / cv::getTickFrequency() *  50;
	printf("%.5f\n", duration);
	GaussBlur_SSE(Src, Dest, Width, Height, Stride, Radius);
	Mat dst(Height, Width, CV_8UC3, Dest);
	imshow("origin", src);
	imshow("result", dst);
	imwrite("F:\\res.jpg", dst);
	waitKey(0);
}

================================================
FILE: speed_histogram_algorithm_framework/BoxFilter.h
================================================
#pragma once
#include "Core.h"
#include "Utility.h"

// : ʵͼ񷽿ģЧ
// б:
// Src: ҪԴͼݽṹ
// Dest: 洦ͼݽṹ
// Radius: ģİ뾶ЧΧ[1, 1000]
// EdgeBehavior: ԵݵĴ0ʾظԵأ1ʹþķʽԱԵֵ
// :
// 1. ܴ8λҶȺ24λͼ
// 2. SrcDestͬͬʱٶȻ
// 3. SSEŻ汾ڳʼʱͰ뾶йصģڰ뾶ʱʱ΢

IS_RET BoxBlur(TMatrix *Src, TMatrix *Dest, int Radius, EdgeMode Edge) {
	if (Src == NULL || Dest == NULL) return IS_RET_ERR_NULLREFERENCE;
	if (Src->Data == NULL || Dest->Data == NULL) return IS_RET_ERR_NULLREFERENCE;
	if (Src->Width != Dest->Width || Src->Height != Dest->Height || Src->Channel != Dest->Channel || Src->Depth != Dest->Depth || Src->WidthStep != Dest->WidthStep) return IS_RET_ERR_PARAMISMATCH;
	if (Src->Depth != IS_DEPTH_8U || Dest->Depth != IS_DEPTH_8U) return IS_RET_ERR_NOTSUPPORTED;
	IS_RET Ret = IS_RET_OK;
	TMatrix *Row = NULL, *Col = NULL;
	int *RowPos, *ColPos, *ColSum, *Diff;
	int X, Y, Z, Width, Height, Channel, Index;
	int Value, ValueB, ValueG, ValueR;
	int Size = 2 * Radius + 1, Amount = Size * Size, HalfAmount = Amount / 2;
	Width = Src->Width;
	Height = Src->Height;
	Channel = Src->Channel;
	Ret = GetValidCoordinate(Width, Height, Radius, Radius, Radius, Radius, EdgeMode::Smear, &Row, &Col);		//	ȡƫ
	RowPos = ((int *)Row->Data);
	ColPos = ((int *)Col->Data);		   
	ColSum = (int *)IS_AllocMemory(Width * Channel * sizeof(int), true);
	Diff = (int *)IS_AllocMemory((Width - 1) * Channel * sizeof(int), true);
	unsigned char *RowData = (unsigned char *)IS_AllocMemory((Width + 2 * Radius) * Channel, true);
	TMatrix Sum;
	TMatrix *p = &Sum;
	TMatrix **q = &p;
	IS_CreateMatrix(Width, Height, IS_DEPTH_32S, Channel, q);
	for (Y = 0; Y < Height; Y++) {
		unsigned char *LinePS = Src->Data + Y * Src->WidthStep;
		int *LinePD = (int *)(p->Data + Y * p->WidthStep);
		//	һݼԵֲֵʱĻ
		if (Channel == 1)
		{
			for (X = 0; X < Radius; X++)
				RowData[X] = LinePS[RowPos[X]];
			memcpy(RowData + Radius, LinePS, Width);
			for (X = Radius + Width; X < Radius + Width + Radius; X++)
				RowData[X] = LinePS[RowPos[X]];
		}
		else if (Channel == 3)
		{
			for (X = 0; X < Radius; X++)
			{
				Index = RowPos[X] * 3;
				RowData[X * 3] = LinePS[Index];
				RowData[X * 3 + 1] = LinePS[Index + 1];
				RowData[X * 3 + 2] = LinePS[Index + 2];
			}
			memcpy(RowData + Radius * 3, LinePS, Width * 3);
			for (X = Radius + Width; X < Radius + Width + Radius; X++)
			{
				Index = RowPos[X] * 3;
				RowData[X * 3 + 0] = LinePS[Index + 0];
				RowData[X * 3 + 1] = LinePS[Index + 1];
				RowData[X * 3 + 2] = LinePS[Index + 2];
			}
		}
		unsigned char *AddPos = RowData + Size * Channel;
		unsigned char *SubPos = RowData;
		for (X = 0; X < (Width - 1) * Channel; X++)
			Diff[X] = AddPos[X] - SubPos[X];
		//	һҪ⴦
		if (Channel == 1)
		{
			for (Z = 0, Value = 0; Z < Size; Z++)	Value += RowData[Z];
			LinePD[0] = Value;

			for (X = 1; X < Width; X++)
			{
				Value += Diff[X - 1];	LinePD[X] = Value;				//	·ٶߺܶ
			}
		}
		else if (Channel == 3)
		{
			for (Z = 0, ValueB = ValueG = ValueR = 0; Z < Size; Z++)
			{
				ValueB += RowData[Z * 3 + 0];
				ValueG += RowData[Z * 3 + 1];
				ValueR += RowData[Z * 3 + 2];
			}
			LinePD[0] = ValueB;	LinePD[1] = ValueG;	LinePD[2] = ValueR;

			for (X = 1; X < Width; X++)
			{
				Index = X * 3;
				ValueB += Diff[Index - 3];		LinePD[Index + 0] = ValueB;
				ValueG += Diff[Index - 2];		LinePD[Index + 1] = ValueG;
				ValueR += Diff[Index - 1];		LinePD[Index + 2] = ValueR;
			}
		}
	}
	for (Y = 0; Y < Size - 1; Y++)			//	עûһŶ						
	{
		int *LinePS = (int *)(p->Data + ColPos[Y] * p->WidthStep);
		for (X = 0; X < Width * Channel; X++)	ColSum[X] += LinePS[X];
	}

	for (Y = 0; Y < Height; Y++)
	{
		unsigned char* LinePD = Dest->Data + Y * Dest->WidthStep;
		int *AddPos = (int*)(p->Data + ColPos[Y + Size - 1] * p->WidthStep);
		int *SubPos = (int*)(p->Data + ColPos[Y] * p->WidthStep);

		for (X = 0; X < Width * Channel; X++)
		{
			Value = ColSum[X] + AddPos[X];
			LinePD[X] = (Value + HalfAmount) / Amount;					//		+  HalfAmount ҪΪ
			ColSum[X] = Value - SubPos[X];
		}
	}
	IS_FreeMemory(RowPos);
	IS_FreeMemory(ColPos);
	IS_FreeMemory(Diff);
	IS_FreeMemory(ColSum);
	IS_FreeMemory(RowData);
	return Ret;
}

// : ʵͼ񷽿ģЧSSEŻ

IS_RET BoxBlur_SSE(TMatrix *Src, TMatrix *Dest, int Radius, EdgeMode Edge) {
	if (Src == NULL || Dest == NULL) return IS_RET_ERR_NULLREFERENCE;
	if (Src->Data == NULL || Dest->Data == NULL) return IS_RET_ERR_NULLREFERENCE;
	if (Src->Width != Dest->Width || Src->Height != Dest->Height || Src->Channel != Dest->Channel || Src->Depth != Dest->Depth || Src->WidthStep != Dest->WidthStep) return IS_RET_ERR_PARAMISMATCH;
	if (Src->Depth != IS_DEPTH_8U || Dest->Depth != IS_DEPTH_8U) return IS_RET_ERR_NOTSUPPORTED;
	IS_RET Ret = IS_RET_OK;
	TMatrix *Row = NULL, *Col = NULL;
	int *RowPos, *ColPos, *ColSum, *Diff;
	int X, Y, Z, Width, Height, Channel, Index;
	int Value, ValueB, ValueG, ValueR;
	int Size = 2 * Radius + 1, Amount = Size * Size, HalfAmount = Amount / 2;
	float Scale = 1.0 / (Size * Size);
	Width = Src->Width;
	Height = Src->Height;
	Channel = Src->Channel;
	Ret = GetValidCoordinate(Width, Height, Radius, Radius, Radius, Radius, EdgeMode::Smear, &Row, &Col);		//	ȡƫ
	RowPos = ((int *)Row->Data);
	ColPos = ((int *)Col->Data);
	ColSum = (int *)IS_AllocMemory(Width * Channel * sizeof(int), true);
	Diff = (int *)IS_AllocMemory((Width - 1) * Channel * sizeof(int), true);
	unsigned char *RowData = (unsigned char *)IS_AllocMemory((Width + 2 * Radius) * Channel, true);
	TMatrix Sum;
	TMatrix *p = &Sum;
	TMatrix **q = &p;
	IS_CreateMatrix(Width, Height, IS_DEPTH_32S, Channel, q);
	for (Y = 0; Y < Height; Y++) {
		unsigned char *LinePS = Src->Data + Y * Src->WidthStep;
		int *LinePD = (int *)(p->Data + Y * p->WidthStep);
		//	һݼԵֲֵʱĻ
		if (Channel == 1)
		{
			for (X = 0; X < Radius; X++)
				RowData[X] = LinePS[RowPos[X]];
			memcpy(RowData + Radius, LinePS, Width);
			for (X = Radius + Width; X < Radius + Width + Radius; X++)
				RowData[X] = LinePS[RowPos[X]];
		}
		else if (Channel == 3)
		{
			for (X = 0; X < Radius; X++)
			{
				Index = RowPos[X] * 3;
				RowData[X * 3] = LinePS[Index];
				RowData[X * 3 + 1] = LinePS[Index + 1];
				RowData[X * 3 + 2] = LinePS[Index + 2];
			}
			memcpy(RowData + Radius * 3, LinePS, Width * 3);
			for (X = Radius + Width; X < Radius + Width + Radius; X++)
			{
				Index = RowPos[X] * 3;
				RowData[X * 3 + 0] = LinePS[Index + 0];
				RowData[X * 3 + 1] = LinePS[Index + 1];
				RowData[X * 3 + 2] = LinePS[Index + 2];
			}
		}
		unsigned char *AddPos = RowData + Size * Channel;
		unsigned char *SubPos = RowData;
		X = 0;
		__m128i Zero = _mm_setzero_si128();
		for (; X <= (Width - 1) * Channel - 8; X += 8) {
			__m128i Add = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i const *)(AddPos + X)), Zero);
			__m128i Sub = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i const *)(SubPos + X)), Zero);
			_mm_store_si128((__m128i *)(Diff + X + 0), _mm_sub_epi32(_mm_unpacklo_epi16(Add, Zero), _mm_unpacklo_epi16(Sub, Zero)));
			_mm_store_si128((__m128i *)(Diff + X + 4), _mm_sub_epi32(_mm_unpackhi_epi16(Add, Zero), _mm_unpackhi_epi16(Sub, Zero)));
		}
		for (; X < (Width - 1) * Channel; X++)
			Diff[X] = AddPos[X] - SubPos[X];
		// һҪ⴦
		//	һҪ⴦
		if (Channel == 1)
		{
			for (Z = 0, Value = 0; Z < Size; Z++)	Value += RowData[Z];
			LinePD[0] = Value;

			for (X = 1; X < Width; X++)
			{
				Value += Diff[X - 1];
				LinePD[X] = Value;
			}
		}
		else if (Channel == 3)
		{
			for (Z = 0, ValueB = ValueG = ValueR = 0; Z < Size; Z++)
			{
				ValueB += RowData[Z * 3 + 0];
				ValueG += RowData[Z * 3 + 1];
				ValueR += RowData[Z * 3 + 2];
			}
			LinePD[0] = ValueB;	LinePD[1] = ValueG;	LinePD[2] = ValueR;

			for (X = 1; X < Width; X++)
			{
				Index = X * 3;
				ValueB += Diff[Index - 3];		LinePD[Index + 0] = ValueB;
				ValueG += Diff[Index - 2];		LinePD[Index + 1] = ValueG;
				ValueR += Diff[Index - 1];		LinePD[Index + 2] = ValueR;
			}
		}
	}

	for (Y = 0; Y < Size - 1; Y++) {
		X = 0;
		int *LinePS = (int *)(p->Data + ColPos[Y] * p->WidthStep);
		for (; X <= Width * Channel - 4; X += 4) {
			__m128i SumP = _mm_load_si128((const __m128i*)(ColSum + X));
			__m128i SrcP = _mm_load_si128((const __m128i*)(LinePS + X));
			_mm_store_si128((__m128i *)(ColSum + X), _mm_add_epi32(SumP, SrcP));
		}
		for (; X < Width * Channel; X++) ColSum[X] += LinePS[X];
	}

	for (Y = 0; Y < Height; Y++) {
		unsigned char *LinePD = Dest->Data + Y * Dest->WidthStep;
		int *AddPos = (int*)(p->Data + ColPos[Y + Size - 1] * p->WidthStep);
		int *SubPos = (int*)(p->Data + ColPos[Y] * p->WidthStep);
		X = 0;
		const __m128 Inv = _mm_set1_ps(Scale);
		for (; X <= Width * Channel - 8; X += 8) {
			__m128i Sub1 = _mm_loadu_si128((const __m128i*)(SubPos + X + 0));
			__m128i Sub2 = _mm_loadu_si128((const __m128i*)(SubPos + X + 4));
			__m128i Add1 = _mm_loadu_si128((const __m128i*)(AddPos + X + 0));
			__m128i Add2 = _mm_loadu_si128((const __m128i*)(AddPos + X + 4));
			__m128i Col1 = _mm_load_si128((const __m128i*)(ColSum + X + 0));
			__m128i Col2 = _mm_load_si128((const __m128i*)(ColSum + X + 4));

			__m128i Sum1 = _mm_add_epi32(Col1, Add1);
			__m128i Sum2 = _mm_add_epi32(Col2, Add2);

			__m128i Dest1 = _mm_cvtps_epi32(_mm_mul_ps(Inv, _mm_cvtepi32_ps(Sum1)));
			__m128i Dest2 = _mm_cvtps_epi32(_mm_mul_ps(Inv, _mm_cvtepi32_ps(Sum2)));

			Dest1 = _mm_packs_epi32(Dest1, Dest2);
			_mm_storel_epi64((__m128i *)(LinePD + X), _mm_packus_epi16(Dest1, Dest1));

			_mm_store_si128((__m128i *)(ColSum + X + 0), _mm_sub_epi32(Sum1, Sub1));
			_mm_store_si128((__m128i *)(ColSum + X + 4), _mm_sub_epi32(Sum2, Sub2));
		}
		for (; X < Width * Channel; X++){
			Value = ColSum[X] + AddPos[X];
			LinePD[X] = Value * Scale;
			ColSum[X] = Value - SubPos[X];
		}
	}
	IS_FreeMemory(RowPos);
	IS_FreeMemory(ColPos);
	IS_FreeMemory(Diff);
	IS_FreeMemory(ColSum);
	IS_FreeMemory(RowData);
	return Ret;
}

================================================
FILE: speed_histogram_algorithm_framework/Core.h
================================================
#pragma once
#include <stdio.h>
#include <malloc.h>
#include <stdlib.h>
#include <string.h>
#include <opencv2/opencv.hpp>
using namespace std;

#define WIDTHBYTES(bytes) (((bytes * 8) + 31) / 32 * 4)
const float Inv255 = 1.0 / 255;
const double Eps = 2.220446049250313E-16;


//Եķʽ
enum EdgeMode {
	Tile = 0, //ظԵԪ
	Smear = 1 //ԵԪ
};

enum IS_RET {
	IS_RET_OK,									//	
	IS_RET_ERR_OUTOFMEMORY,						//	ڴ
	IS_RET_ERR_STACKOVERFLOW,					//	ջ
	IS_RET_ERR_NULLREFERENCE,					//	
	IS_RET_ERR_ARGUMENTOUTOFRANGE,				//	Χ
	IS_RET_ERR_PARAMISMATCH,					//	ƥ
	IS_RET_ERR_DIVIDEBYZERO,
	IS_RET_ERR_INDEXOUTOFRANGE,
	IS_RET_ERR_NOTSUPPORTED,
	IS_RET_ERR_OVERFLOW,
	IS_RET_ERR_FILENOTFOUND,
	IS_RET_ERR_UNKNOWN
};

enum IS_DEPTH
{
	IS_DEPTH_8U = 0,			//	unsigned char
	IS_DEPTH_8S = 1,			//	char
	IS_DEPTH_16S = 2,			//	short
	IS_DEPTH_32S = 3,			//  int
	IS_DEPTH_32F = 4,			//	float
	IS_DEPTH_64F = 5,			//	double
};

struct TMatrix
{
	int Width;					//	Ŀ
	int Height;					//	ĸ߶
	int WidthStep;				//	һԪصռõֽ
	int Channel;				//	ͨ
	int Depth;					//	Ԫص
	unsigned char *Data;		//	
	int Reserved;				//	ʹ
};

// ڴ
void *IS_AllocMemory(unsigned int Size, bool ZeroMemory = true) {
	void *Ptr = _mm_malloc(Size, 32);
	if (Ptr != NULL)
		if (ZeroMemory == true)
			memset(Ptr, 0, Size);
	return Ptr;
}

// ڴͷ
void IS_FreeMemory(void *Ptr) {
	if (Ptr != NULL) _mm_free(Ptr);
}

// ݾԪصȡһԪʵռõֽ
int IS_ELEMENT_SIZE(int Depth) {
	int Size;
	switch (Depth)
	{
	case IS_DEPTH_8U:
		Size = sizeof(unsigned char);
		break;
	case IS_DEPTH_8S:
		Size = sizeof(char);
		break;
	case IS_DEPTH_16S:
		Size = sizeof(short);
		break;
	case IS_DEPTH_32S:
		Size = sizeof(int);
		break;
	case IS_DEPTH_32F:
		Size = sizeof(float);
		break;
	case IS_DEPTH_64F:
		Size = sizeof(double);
		break;
	default:
		Size = 0;
		break;
	}
	return Size;
}

//µľ
IS_RET IS_CreateMatrix(int Width, int Height, int Depth, int Channel, TMatrix **Matrix) {
	if (Width < 1 || Height < 1) return IS_RET_ERR_ARGUMENTOUTOFRANGE; //Χ
	if (Depth != IS_DEPTH_8U && Depth != IS_DEPTH_8S && Depth != IS_DEPTH_16S && Depth != IS_DEPTH_32S &&
		Depth != IS_DEPTH_32F && Depth != IS_DEPTH_64F) return IS_RET_ERR_ARGUMENTOUTOFRANGE; //Χ
	if (Channel != 1 && Channel != 2 && Channel != 3 && Channel != 4) return IS_RET_ERR_ARGUMENTOUTOFRANGE;
	*Matrix = (TMatrix *)IS_AllocMemory(sizeof(TMatrix));
	(*Matrix)->Width = Width;
	(*Matrix)->Height = Height;
	(*Matrix)->Depth = Depth;
	(*Matrix)->Channel = Channel;
	(*Matrix)->WidthStep = WIDTHBYTES(Width * Channel * IS_ELEMENT_SIZE(Depth));
	(*Matrix)->Data = (unsigned char*)IS_AllocMemory((*Matrix)->Height * (*Matrix)->WidthStep, true);
	if ((*Matrix)->Data == NULL) {
		IS_FreeMemory(*Matrix);
		return IS_RET_ERR_OUTOFMEMORY; //ڴ
	}
	(*Matrix)->Reserved = 0;
	return IS_RET_OK;
}

//ͷŴľ
IS_RET IS_FreeMatrix(TMatrix **Matrix) {
	if ((*Matrix) == NULL) return IS_RET_ERR_NULLREFERENCE; //
	if ((*Matrix)->Data == NULL) {
		IS_FreeMemory((*Matrix));
		return IS_RET_ERR_OUTOFMEMORY;
	}
	else {
		IS_FreeMemory((*Matrix)->Data);
		IS_FreeMemory((*Matrix));
		return IS_RET_OK;
	}
}

//¡еľ
IS_RET IS_CloneMatrix(TMatrix *Src, TMatrix **Dest) {
	if (Src == NULL) return IS_RET_ERR_NULLREFERENCE;
	if (Src->Data == NULL) return IS_RET_ERR_NULLREFERENCE;
	IS_RET Ret = IS_CreateMatrix(Src->Width, Src->Height, Src->Depth, Src->Channel, Dest);
	if (Ret == IS_RET_OK) memcpy((*Dest)->Data, Src->Data, (*Dest)->Height * (*Dest)->WidthStep);
	return Ret;
}

================================================
FILE: speed_histogram_algorithm_framework/MaxFilter.h
================================================
#pragma once
#include "Core.h"
#include "Utility.h"

// 函数供能: 在指定半径内,最大值”滤镜用周围像素的最高亮度值替换当前像素的亮度值。
// 参数列表:
// Src: 需要处理的源图像的数据结构
// Dest: 保存处理后的图像的数据结构
// Radius: 半径,有效范围
// 说明:
// 1、程序的执行时间和半径基本无关,但和图像内容有关
// 2、Src和Dest可以相同,不同时执行速度很快
// 3、对于各向异性的图像来说,执行速度很快,对于有大面积相同像素的图像,速度会慢一点

IS_RET  MaxFilter(TMatrix *Src, TMatrix *Dest, int Radius)
{
	if (Src == NULL || Dest == NULL) return IS_RET_ERR_NULLREFERENCE;
	if (Src->Data == NULL || Dest->Data == NULL) return IS_RET_ERR_NULLREFERENCE;
	if (Src->Width != Dest->Width || Src->Height != Dest->Height || Src->Channel != Dest->Channel || Src->Depth != Dest->Depth || Src->WidthStep != Dest->WidthStep) return IS_RET_ERR_PARAMISMATCH;
	if (Src->Depth != IS_DEPTH_8U || Dest->Depth != IS_DEPTH_8U) return IS_RET_ERR_NOTSUPPORTED;
	if (Radius < 0 || Radius >= 127) return IS_RET_ERR_ARGUMENTOUTOFRANGE;

	IS_RET Ret = IS_RET_OK;

	if (Src->Data == Dest->Data)
	{
		TMatrix *Clone = NULL;
		Ret = IS_CloneMatrix(Src, &Clone);
		if (Ret != IS_RET_OK) return Ret;
		Ret = MaxFilter(Clone, Dest, Radius);
		IS_FreeMatrix(&Clone);
		return Ret;
	}
	if (Src->Channel == 1)
	{
		TMatrix *Row = NULL, *Col = NULL;
		unsigned char *LinePS, *LinePD;
		int X, Y, K, Width = Src->Width, Height = Src->Height;
		int *RowOffset, *ColOffSet;

		unsigned short *ColHist = (unsigned short *)IS_AllocMemory(256 * (Width + 2 * Radius) * sizeof(unsigned short), true);
		if (ColHist == NULL) { Ret = IS_RET_ERR_OUTOFMEMORY; goto Done8; }
		unsigned short *Hist = (unsigned short *)IS_AllocMemory(256 * sizeof(unsigned short), true);
		if (Hist == NULL) { Ret = IS_RET_ERR_OUTOFMEMORY; goto Done8; }
		Ret = GetValidCoordinate(Width, Height, Radius, Radius, Radius, Radius, EdgeMode::Smear, &Row, &Col);		//	获取坐标偏移量
		if (Ret != IS_RET_OK) goto Done8;

		ColHist += Radius * 256;		RowOffset = ((int *)Row->Data) + Radius;
		ColOffSet = ((int *)Col->Data) + Radius;		    	//	进行偏移以便操作

		for (Y = 0; Y < Height; Y++)
		{
			if (Y == 0)											//	第一行的列直方图,要重头计算
			{
				for (K = -Radius; K <= Radius; K++)
				{
					LinePS = Src->Data + ColOffSet[K] * Src->WidthStep;
					for (X = -Radius; X < Width + Radius; X++)
					{
						ColHist[X * 256 + LinePS[RowOffset[X]]]++;
					}
				}
			}
			else												//	其他行的列直方图,更新就可以了
			{
				LinePS = Src->Data + ColOffSet[Y - Radius - 1] * Src->WidthStep;
				for (X = -Radius; X < Width + Radius; X++)		// 删除移出范围内的那一行的直方图数据
				{
					ColHist[X * 256 + LinePS[RowOffset[X]]]--;
				}

				LinePS = Src->Data + ColOffSet[Y + Radius] * Src->WidthStep;
				for (X = -Radius; X < Width + Radius; X++)		// 增加进入范围内的那一行的直方图数据
				{
					ColHist[X * 256 + LinePS[RowOffset[X]]]++;
				}
			}

			memset(Hist, 0, 256 * sizeof(unsigned short));		//	每一行直方图数据清零先

			LinePD = Dest->Data + Y * Dest->WidthStep;

			for (X = 0; X < Width; X++)
			{
				if (X == 0)
				{
					for (K = -Radius; K <= Radius; K++)			//	行第一个像素,需要重新计算	
						HistgramAddShort(ColHist + K * 256, Hist);
				}
				else
				{
					/*	HistgramAddShort(ColHist + RowOffset[X + Radius] * 256, Hist);
					HistgramSubShort(ColHist + RowOffset[X - Radius - 1] * 256, Hist);
					*/
					HistgramSubAddShort(ColHist + RowOffset[X - Radius - 1] * 256, ColHist + RowOffset[X + Radius] * 256, Hist);  //	行内其他像素,依次删除和增加就可以了
				}
				for (K = 255; K >= 0; K--)
				{
					if (Hist[K] != 0)
					{
						LinePD[X] = K;
						break;
					}
				}
			}
		}
		ColHist -= Radius * 256;		//	恢复偏移操作
	Done8:
		IS_FreeMatrix(&Row);
		IS_FreeMatrix(&Col);
		IS_FreeMemory(ColHist);
		IS_FreeMemory(Hist);
		return Ret;
	}
	else
	{
		TMatrix *Blue = NULL, *Green = NULL, *Red = NULL, *Alpha = NULL;			//	由于C变量如果不初始化,其值是随机值,可能会导致释放时的错误。
		IS_RET Ret = SplitRGBA(Src, &Blue, &Green, &Red, &Alpha);
		if (Ret != IS_RET_OK) goto Done24;
		Ret = MaxFilter(Blue, Blue, Radius);
		if (Ret != IS_RET_OK) goto Done24;
		Ret = MaxFilter(Green, Green, Radius);
		if (Ret != IS_RET_OK) goto Done24;
		Ret = MaxFilter(Red, Red, Radius);
		if (Ret != IS_RET_OK) goto Done24;											//	32位的Alpha不做任何处理,实际上32位的相关算法基本上是不能分通道处理的
		CopyAlphaChannel(Src, Dest);
		Ret = CombineRGBA(Dest, Blue, Green, Red, Alpha);
	Done24:
		IS_FreeMatrix(&Blue);
		IS_FreeMatrix(&Green);
		IS_FreeMatrix(&Red);
		IS_FreeMatrix(&Alpha);
		return Ret;
	}
}

================================================
FILE: speed_histogram_algorithm_framework/SelectiveBlur.h
================================================
#pragma once
#include "Core.h"
#include "Utility.h"

void Calc(unsigned short *Hist, int Intensity, unsigned char *&Pixel, int Threshold)
{
	int K, Low, High, Sum = 0, Weight = 0;
	Low = Intensity - Threshold; High = Intensity + Threshold;
	if (Low < 0) Low = 0;
	if (High > 255) High = 255;
	for (K = Low; K <= High; K++)
	{
		Weight += Hist[K];
		Sum += Hist[K] * K;
	}
	if (Weight != 0) *Pixel = Sum / Weight;
}

// 函数供能: 在指定半径内,实现图像选择性模糊效果。
// 参数列表:
// Src: 需要处理的源图像的数据结构
// Dest: 保存处理后的图像的数据结构
// Radius: 半径,有效范围
// 说明:
// 1、程序的执行时间和半径基本无关,但和图像内容有关
// 2、Src和Dest可以相同,不同时执行速度很快
// 3、对于各向异性的图像来说,执行速度很快,对于有大面积相同像素的图像,速度会慢一点

IS_RET SelectiveBlur(TMatrix *Src, TMatrix *Dest, int Radius, int Threshold, EdgeMode Edge)
{
	if (Src == NULL || Dest == NULL) return IS_RET_ERR_NULLREFERENCE;
	if (Src->Data == NULL || Dest->Data == NULL) return IS_RET_ERR_NULLREFERENCE;
	if (Src->Width != Dest->Width || Src->Height != Dest->Height || Src->Channel != Dest->Channel || Src->Depth != Dest->Depth || Src->WidthStep != Dest->WidthStep) return IS_RET_ERR_PARAMISMATCH;
	if (Src->Depth != IS_DEPTH_8U || Dest->Depth != IS_DEPTH_8U) return IS_RET_ERR_NOTSUPPORTED;
	if (Radius < 0 || Radius >= 127 || Threshold < 2 || Threshold > 255) return IS_RET_ERR_ARGUMENTOUTOFRANGE;

	IS_RET Ret = IS_RET_OK;

	if (Src->Data == Dest->Data)
	{
		TMatrix *Clone = NULL;
		Ret = IS_CloneMatrix(Src, &Clone);
		if (Ret != IS_RET_OK) return Ret;
		Ret = SelectiveBlur(Clone, Dest, Radius, Threshold, Edge);
		IS_FreeMatrix(&Clone);
		return Ret;
	}
	if (Src->Channel == 1)
	{
		TMatrix *Row = NULL, *Col = NULL;
		unsigned char *LinePS, *LinePD;
		int X, Y, K, Width = Src->Width, Height = Src->Height;
		int *RowOffset, *ColOffSet;

		unsigned short *ColHist = (unsigned short *)IS_AllocMemory(256 * (Width + 2 * Radius) * sizeof(unsigned short), true);
		if (ColHist == NULL) { Ret = IS_RET_ERR_OUTOFMEMORY; goto Done8; }
		unsigned short *Hist = (unsigned short *)IS_AllocMemory(256 * sizeof(unsigned short), true);
		if (Hist == NULL) { Ret = IS_RET_ERR_OUTOFMEMORY; goto Done8; }

		Ret = GetValidCoordinate(Width, Height, Radius, Radius, Radius, Radius, Edge, &Row, &Col);		//	获取坐标偏移量
		if (Ret != IS_RET_OK) goto Done8;

		ColHist += Radius * 256;		RowOffset = ((int *)Row->Data) + Radius;		ColOffSet = ((int *)Col->Data) + Radius;		    	//	进行偏移以便操作

		for (Y = 0; Y < Height; Y++)
		{
			if (Y == 0)											//	第一行的列直方图,要重头计算
			{
				for (K = -Radius; K <= Radius; K++)
				{
					LinePS = Src->Data + ColOffSet[K] * Src->WidthStep;
					for (X = -Radius; X < Width + Radius; X++)
					{
						ColHist[X * 256 + LinePS[RowOffset[X]]]++;
					}
				}
			}
			else												//	其他行的列直方图,更新就可以了
			{
				LinePS = Src->Data + ColOffSet[Y - Radius - 1] * Src->WidthStep;
				for (X = -Radius; X < Width + Radius; X++)		// 删除移出范围内的那一行的直方图数据
				{
					ColHist[X * 256 + LinePS[RowOffset[X]]]--;
				}

				LinePS = Src->Data + ColOffSet[Y + Radius] * Src->WidthStep;
				for (X = -Radius; X < Width + Radius; X++)		// 增加进入范围内的那一行的直方图数据
				{
					ColHist[X * 256 + LinePS[RowOffset[X]]]++;
				}

			}

			memset(Hist, 0, 256 * sizeof(unsigned short));		//	每一行直方图数据清零先

			LinePS = Src->Data + Y * Src->WidthStep;
			LinePD = Dest->Data + Y * Dest->WidthStep;

			for (X = 0; X < Width; X++)
			{
				if (X == 0)
				{
					for (K = -Radius; K <= Radius; K++)			//	行第一个像素,需要重新计算	
						HistgramAddShort(ColHist + K * 256, Hist);
				}
				else
				{
					/*	HistgramAddShort(ColHist + RowOffset[X + Radius] * 256, Hist);
					HistgramSubShort(ColHist + RowOffset[X - Radius - 1] * 256, Hist);
					*/
					HistgramSubAddShort(ColHist + RowOffset[X - Radius - 1] * 256, ColHist + RowOffset[X + Radius] * 256, Hist);  //	行内其他像素,依次删除和增加就可以了
				}
				Calc(Hist, LinePS[0], LinePD, Threshold);

				LinePS++;
				LinePD++;
			}
		}
		ColHist -= Radius * 256;		//	恢复偏移操作
	Done8:
		IS_FreeMatrix(&Row);
		IS_FreeMatrix(&Col);
		IS_FreeMemory(ColHist);
		IS_FreeMemory(Hist);

		return Ret;
	}
	else
	{
		TMatrix *Blue = NULL, *Green = NULL, *Red = NULL, *Alpha = NULL;			//	由于C变量如果不初始化,其值是随机值,可能会导致释放时的错误。
		IS_RET Ret = SplitRGBA(Src, &Blue, &Green, &Red, &Alpha);
		if (Ret != IS_RET_OK) goto Done24;
		Ret = SelectiveBlur(Blue, Blue, Radius, Threshold, Edge);
		if (Ret != IS_RET_OK) goto Done24;
		Ret = SelectiveBlur(Green, Green, Radius, Threshold, Edge);
		if (Ret != IS_RET_OK) goto Done24;
		Ret = SelectiveBlur(Red, Red, Radius, Threshold, Edge);
		if (Ret != IS_RET_OK) goto Done24;											//	32位的Alpha不做任何处理,实际上32位的相关算法基本上是不能分通道处理的
		Ret = CombineRGBA(Dest, Blue, Green, Red, Alpha);
	Done24:
		IS_FreeMatrix(&Blue);
		IS_FreeMatrix(&Green);
		IS_FreeMatrix(&Red);
		IS_FreeMatrix(&Alpha);
		return Ret;
	}
}


================================================
FILE: speed_histogram_algorithm_framework/Utility.h
================================================
#pragma once
//ֵ
#include "Core.h"

union Approximation
{
	double Value;
	int X[2];
};

// 1: ݽضByteڡ
// ο: http://www.cnblogs.com/zyl910/archive/2012/03/12/noifopex1.html
// : λʹô롣
unsigned char ClampToByte(int Value) {
	return ((Value | ((signed int)(255 - Value) >> 31)) & ~((signed int)Value >> 31));
}

//2: ݽضָΧ
//ο: 
//: 
int ClampToInt(int Value, int Min, int Max) {
	if (Value < Min) return Min;
	else if (Value > Max) return Max;
	else return Value;
}

//3: 255
//ο: 
//: λ
int Div255(int Value) {
	return (((Value >> 8) + Value + 1) >> 8);
}

//4: ȡֵ
//ο: https://oi-wiki.org/math/bit/
//: n > 0 ? n : -n 

int Abs(int n) {
	return (n ^ (n >> 31)) - (n >> 31);
	/* n>>31 ȡ n ķţ n Ϊn>>31  0 n Ϊn>>31  - 1
	 n Ϊ n^0=0, 䣬 n Ϊ n^-1
	Ҫ n  - 1 IJ룬Ȼ㣬
	 n ŲΪ n ľֵ 1ټȥ - 1 Ǿֵ */
}

//5: 
//ο: 
//: 
double Round(double V)
{
	return (V > 0.0) ? floor(V + 0.5) : Round(V - 0.5);
}

//6: -11֮
//ο: 
//: 
double Rand()
{
	return (double)rand() / (RAND_MAX + 1.0);
}

//7: PowĽƼ㣬doubleͺfloat
//ο: http://www.cvchina.info/2010/03/19/log-pow-exp-approximation/
//ο: http://martin.ankerl.com/2007/10/04/optimized-pow-approximation-for-java-and-c-c/
//: ֻΪ˼ٵĽƼ㣬5%-12%ȵ
double Pow(double X, double Y)
{
	Approximation V = { X };
	V.X[1] = (int)(Y * (V.X[1] - 1072632447) + 1072632447);
	V.X[0] = 0;
	return V.Value;
}


float Pow(float X, float Y)
{
	Approximation V = { X };
	V.X[1] = (int)(Y * (V.X[1] - 1072632447) + 1072632447);
	V.X[0] = 0;
	return (float)V.Value;
}

//8: ExpĽƼ㣬doubleͺfloat
double Exp(double Y)			//	ķʽٶҪЩ
{
	Approximation V;
	V.X[1] = (int)(Y * 1485963 + 1072632447);
	V.X[0] = 0;
	return V.Value;
}

float Exp(float Y)			//	ķʽٶҪЩ
{
	Approximation V;
	V.X[1] = (int)(Y * 1485963 + 1072632447);
	V.X[0] = 0;
	return (float)V.Value;
}

// 9: Pow׼һĽƼ㣬ٶȻ
// http://martin.ankerl.com/2012/01/25/optimized-approximative-pow-in-c-and-cpp/
// Besides that, I also have now a slower approximation that has much less error
// when the exponent is larger than 1. It makes use exponentiation by squaring,
// which is exact for the integer part of the exponent, and uses only the exponents fraction for the approximation:
// should be much more precise with large Y

double PrecisePow(double X, double Y) {
	// calculate approximation with fraction of the exponent
	int e = (int)Y;
	Approximation V = { X };
	V.X[1] = (int)((Y - e) * (V.X[1] - 1072632447) + 1072632447);
	V.X[0] = 0;
	// exponentiation by squaring with the exponent's integer part
	// double r = u.d makes everything much slower, not sure why
	double r = 1.0;
	while (e)
	{
		if (e & 1)	r *= X;
		X *= X;
		e >>= 1;
	}
	return r * V.Value;
}

//10: MinMax֮
//ο: 
//: MinΪСֵMaxΪֵ
int Random(int Min, int Max) {
	return rand() % (Max + 1 - Min) + Min;
}

//11: ź
//ο: 
//: 
int sgn(int X) {
	if (X > 0) return 1;
	if (X < 0) return -1;
	return 0;
}

//12: ȡijαӦɫֵ
//ο: 
//: 
void GetRGB(int Color, int *R, int *G, int *B) {
	*R = Color & 255;
	*G = (Color & 65280) / 256;
	*B = (Color & 16711680) / 65536;
}

//13: ţٷƻȡֵָ㷨ƽ
//ο: https://www.cnblogs.com/qlky/p/7735145.html
//: Ȼǽ㷨Ƴֵָƽ
float Sqrt(float X)
{
	float HalfX = 0.5f * X;             // double͵Ч
	int I = *(int*)&X;                  // get bits for floating VALUE 
	I = 0x5f375a86 - (I >> 1);          // gives initial guess y0
	X = *(float*)&I;                    // convert bits BACK to float
	X = X * (1.5f - HalfX * X * X);     // Newton step, repeating increases accuracy
	X = X * (1.5f - HalfX * X * X);     // Newton step, repeating increases accuracy
	X = X * (1.5f - HalfX * X * X);     // Newton step, repeating increases accuracy
	return 1 / X;
}

//14: ޷ŶֱͼӣY = X + Y
//ο: 
//: SSEŻ
void HistgramAddShort(unsigned short *X, unsigned short *Y)
{
	*(__m128i*)(Y + 0) = _mm_add_epi16(*(__m128i*)&Y[0], *(__m128i*)&X[0]);		//	ҪԼдĻ೬ٶˣѾԹ
	*(__m128i*)(Y + 8) = _mm_add_epi16(*(__m128i*)&Y[8], *(__m128i*)&X[8]);
	*(__m128i*)(Y + 16) = _mm_add_epi16(*(__m128i*)&Y[16], *(__m128i*)&X[16]);
	*(__m128i*)(Y + 24) = _mm_add_epi16(*(__m128i*)&Y[24], *(__m128i*)&X[24]);
	*(__m128i*)(Y + 32) = _mm_add_epi16(*(__m128i*)&Y[32], *(__m128i*)&X[32]);
	*(__m128i*)(Y + 40) = _mm_add_epi16(*(__m128i*)&Y[40], *(__m128i*)&X[40]);
	*(__m128i*)(Y + 48) = _mm_add_epi16(*(__m128i*)&Y[48], *(__m128i*)&X[48]);
	*(__m128i*)(Y + 56) = _mm_add_epi16(*(__m128i*)&Y[56], *(__m128i*)&X[56]);
	*(__m128i*)(Y + 64) = _mm_add_epi16(*(__m128i*)&Y[64], *(__m128i*)&X[64]);
	*(__m128i*)(Y + 72) = _mm_add_epi16(*(__m128i*)&Y[72], *(__m128i*)&X[72]);
	*(__m128i*)(Y + 80) = _mm_add_epi16(*(__m128i*)&Y[80], *(__m128i*)&X[80]);
	*(__m128i*)(Y + 88) = _mm_add_epi16(*(__m128i*)&Y[88], *(__m128i*)&X[88]);
	*(__m128i*)(Y + 96) = _mm_add_epi16(*(__m128i*)&Y[96], *(__m128i*)&X[96]);
	*(__m128i*)(Y + 104) = _mm_add_epi16(*(__m128i*)&Y[104], *(__m128i*)&X[104]);
	*(__m128i*)(Y + 112) = _mm_add_epi16(*(__m128i*)&Y[112], *(__m128i*)&X[112]);
	*(__m128i*)(Y + 120) = _mm_add_epi16(*(__m128i*)&Y[120], *(__m128i*)&X[120]);
	*(__m128i*)(Y + 128) = _mm_add_epi16(*(__m128i*)&Y[128], *(__m128i*)&X[128]);
	*(__m128i*)(Y + 136) = _mm_add_epi16(*(__m128i*)&Y[136], *(__m128i*)&X[136]);
	*(__m128i*)(Y + 144) = _mm_add_epi16(*(__m128i*)&Y[144], *(__m128i*)&X[144]);
	*(__m128i*)(Y + 152) = _mm_add_epi16(*(__m128i*)&Y[152], *(__m128i*)&X[152]);
	*(__m128i*)(Y + 160) = _mm_add_epi16(*(__m128i*)&Y[160], *(__m128i*)&X[160]);
	*(__m128i*)(Y + 168) = _mm_add_epi16(*(__m128i*)&Y[168], *(__m128i*)&X[168]);
	*(__m128i*)(Y + 176) = _mm_add_epi16(*(__m128i*)&Y[176], *(__m128i*)&X[176]);
	*(__m128i*)(Y + 184) = _mm_add_epi16(*(__m128i*)&Y[184], *(__m128i*)&X[184]);
	*(__m128i*)(Y + 192) = _mm_add_epi16(*(__m128i*)&Y[192], *(__m128i*)&X[192]);
	*(__m128i*)(Y + 200) = _mm_add_epi16(*(__m128i*)&Y[200], *(__m128i*)&X[200]);
	*(__m128i*)(Y + 208) = _mm_add_epi16(*(__m128i*)&Y[208], *(__m128i*)&X[208]);
	*(__m128i*)(Y + 216) = _mm_add_epi16(*(__m128i*)&Y[216], *(__m128i*)&X[216]);
	*(__m128i*)(Y + 224) = _mm_add_epi16(*(__m128i*)&Y[224], *(__m128i*)&X[224]);
	*(__m128i*)(Y + 232) = _mm_add_epi16(*(__m128i*)&Y[232], *(__m128i*)&X[232]);
	*(__m128i*)(Y + 240) = _mm_add_epi16(*(__m128i*)&Y[240], *(__m128i*)&X[240]);
	*(__m128i*)(Y + 248) = _mm_add_epi16(*(__m128i*)&Y[248], *(__m128i*)&X[248]);
}

//15: ޷ŶֱͼY = Y - X
//ο: 
//: SSEŻ
void HistgramSubShort(unsigned short *X, unsigned short *Y)
{
	*(__m128i*)(Y + 0) = _mm_sub_epi16(*(__m128i*)&Y[0], *(__m128i*)&X[0]);
	*(__m128i*)(Y + 8) = _mm_sub_epi16(*(__m128i*)&Y[8], *(__m128i*)&X[8]);
	*(__m128i*)(Y + 16) = _mm_sub_epi16(*(__m128i*)&Y[16], *(__m128i*)&X[16]);
	*(__m128i*)(Y + 24) = _mm_sub_epi16(*(__m128i*)&Y[24], *(__m128i*)&X[24]);
	*(__m128i*)(Y + 32) = _mm_sub_epi16(*(__m128i*)&Y[32], *(__m128i*)&X[32]);
	*(__m128i*)(Y + 40) = _mm_sub_epi16(*(__m128i*)&Y[40], *(__m128i*)&X[40]);
	*(__m128i*)(Y + 48) = _mm_sub_epi16(*(__m128i*)&Y[48], *(__m128i*)&X[48]);
	*(__m128i*)(Y + 56) = _mm_sub_epi16(*(__m128i*)&Y[56], *(__m128i*)&X[56]);
	*(__m128i*)(Y + 64) = _mm_sub_epi16(*(__m128i*)&Y[64], *(__m128i*)&X[64]);
	*(__m128i*)(Y + 72) = _mm_sub_epi16(*(__m128i*)&Y[72], *(__m128i*)&X[72]);
	*(__m128i*)(Y + 80) = _mm_sub_epi16(*(__m128i*)&Y[80], *(__m128i*)&X[80]);
	*(__m128i*)(Y + 88) = _mm_sub_epi16(*(__m128i*)&Y[88], *(__m128i*)&X[88]);
	*(__m128i*)(Y + 96) = _mm_sub_epi16(*(__m128i*)&Y[96], *(__m128i*)&X[96]);
	*(__m128i*)(Y + 104) = _mm_sub_epi16(*(__m128i*)&Y[104], *(__m128i*)&X[104]);
	*(__m128i*)(Y + 112) = _mm_sub_epi16(*(__m128i*)&Y[112], *(__m128i*)&X[112]);
	*(__m128i*)(Y + 120) = _mm_sub_epi16(*(__m128i*)&Y[120], *(__m128i*)&X[120]);
	*(__m128i*)(Y + 128) = _mm_sub_epi16(*(__m128i*)&Y[128], *(__m128i*)&X[128]);
	*(__m128i*)(Y + 136) = _mm_sub_epi16(*(__m128i*)&Y[136], *(__m128i*)&X[136]);
	*(__m128i*)(Y + 144) = _mm_sub_epi16(*(__m128i*)&Y[144], *(__m128i*)&X[144]);
	*(__m128i*)(Y + 152) = _mm_sub_epi16(*(__m128i*)&Y[152], *(__m128i*)&X[152]);
	*(__m128i*)(Y + 160) = _mm_sub_epi16(*(__m128i*)&Y[160], *(__m128i*)&X[160]);
	*(__m128i*)(Y + 168) = _mm_sub_epi16(*(__m128i*)&Y[168], *(__m128i*)&X[168]);
	*(__m128i*)(Y + 176) = _mm_sub_epi16(*(__m128i*)&Y[176], *(__m128i*)&X[176]);
	*(__m128i*)(Y + 184) = _mm_sub_epi16(*(__m128i*)&Y[184], *(__m128i*)&X[184]);
	*(__m128i*)(Y + 192) = _mm_sub_epi16(*(__m128i*)&Y[192], *(__m128i*)&X[192]);
	*(__m128i*)(Y + 200) = _mm_sub_epi16(*(__m128i*)&Y[200], *(__m128i*)&X[200]);
	*(__m128i*)(Y + 208) = _mm_sub_epi16(*(__m128i*)&Y[208], *(__m128i*)&X[208]);
	*(__m128i*)(Y + 216) = _mm_sub_epi16(*(__m128i*)&Y[216], *(__m128i*)&X[216]);
	*(__m128i*)(Y + 224) = _mm_sub_epi16(*(__m128i*)&Y[224], *(__m128i*)&X[224]);
	*(__m128i*)(Y + 232) = _mm_sub_epi16(*(__m128i*)&Y[232], *(__m128i*)&X[232]);
	*(__m128i*)(Y + 240) = _mm_sub_epi16(*(__m128i*)&Y[240], *(__m128i*)&X[240]);
	*(__m128i*)(Y + 248) = _mm_sub_epi16(*(__m128i*)&Y[248], *(__m128i*)&X[248]);
}

//16: ޷ŶֱͼӼZ = Z + Y - X
//ο: 
//: SSEŻ
void HistgramSubAddShort(unsigned short *X, unsigned short *Y, unsigned short *Z)
{
	*(__m128i*)(Z + 0) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[0], *(__m128i*)&Z[0]), *(__m128i*)&X[0]);						//	ҪԼдĻ೬ٶˣѾԹ
	*(__m128i*)(Z + 8) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[8], *(__m128i*)&Z[8]), *(__m128i*)&X[8]);
	*(__m128i*)(Z + 16) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[16], *(__m128i*)&Z[16]), *(__m128i*)&X[16]);
	*(__m128i*)(Z + 24) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[24], *(__m128i*)&Z[24]), *(__m128i*)&X[24]);
	*(__m128i*)(Z + 32) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[32], *(__m128i*)&Z[32]), *(__m128i*)&X[32]);
	*(__m128i*)(Z + 40) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[40], *(__m128i*)&Z[40]), *(__m128i*)&X[40]);
	*(__m128i*)(Z + 48) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[48], *(__m128i*)&Z[48]), *(__m128i*)&X[48]);
	*(__m128i*)(Z + 56) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[56], *(__m128i*)&Z[56]), *(__m128i*)&X[56]);
	*(__m128i*)(Z + 64) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[64], *(__m128i*)&Z[64]), *(__m128i*)&X[64]);
	*(__m128i*)(Z + 72) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[72], *(__m128i*)&Z[72]), *(__m128i*)&X[72]);
	*(__m128i*)(Z + 80) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[80], *(__m128i*)&Z[80]), *(__m128i*)&X[80]);
	*(__m128i*)(Z + 88) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[88], *(__m128i*)&Z[88]), *(__m128i*)&X[88]);
	*(__m128i*)(Z + 96) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[96], *(__m128i*)&Z[96]), *(__m128i*)&X[96]);
	*(__m128i*)(Z + 104) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[104], *(__m128i*)&Z[104]), *(__m128i*)&X[104]);
	*(__m128i*)(Z + 112) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[112], *(__m128i*)&Z[112]), *(__m128i*)&X[112]);
	*(__m128i*)(Z + 120) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[120], *(__m128i*)&Z[120]), *(__m128i*)&X[120]);
	*(__m128i*)(Z + 128) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[128], *(__m128i*)&Z[128]), *(__m128i*)&X[128]);
	*(__m128i*)(Z + 136) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[136], *(__m128i*)&Z[136]), *(__m128i*)&X[136]);
	*(__m128i*)(Z + 144) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[144], *(__m128i*)&Z[144]), *(__m128i*)&X[144]);
	*(__m128i*)(Z + 152) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[152], *(__m128i*)&Z[152]), *(__m128i*)&X[152]);
	*(__m128i*)(Z + 160) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[160], *(__m128i*)&Z[160]), *(__m128i*)&X[160]);
	*(__m128i*)(Z + 168) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[168], *(__m128i*)&Z[168]), *(__m128i*)&X[168]);
	*(__m128i*)(Z + 176) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[176], *(__m128i*)&Z[176]), *(__m128i*)&X[176]);
	*(__m128i*)(Z + 184) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[184], *(__m128i*)&Z[184]), *(__m128i*)&X[184]);
	*(__m128i*)(Z + 192) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[192], *(__m128i*)&Z[192]), *(__m128i*)&X[192]);
	*(__m128i*)(Z + 200) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[200], *(__m128i*)&Z[200]), *(__m128i*)&X[200]);
	*(__m128i*)(Z + 208) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[208], *(__m128i*)&Z[208]), *(__m128i*)&X[208]);
	*(__m128i*)(Z + 216) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[216], *(__m128i*)&Z[216]), *(__m128i*)&X[216]);
	*(__m128i*)(Z + 224) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[224], *(__m128i*)&Z[224]), *(__m128i*)&X[224]);
	*(__m128i*)(Z + 232) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[232], *(__m128i*)&Z[232]), *(__m128i*)&X[232]);
	*(__m128i*)(Z + 240) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[240], *(__m128i*)&Z[240]), *(__m128i*)&X[240]);
	*(__m128i*)(Z + 248) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[248], *(__m128i*)&Z[248]), *(__m128i*)&X[248]);
}

//17: Alphaͨ
//ο: 
//: ֱԭʼĴ룬ٶȺܺ
void CopyAlphaChannel(TMatrix *Src, TMatrix *Dest) {
	if (Src->Channel != 4 || Dest->Channel != 4) return;
	if (Src->Data == Dest->Data) return;
	unsigned char *SrcP = Src->Data, *DestP = Dest->Data;
	int Y, Index = 3;
	for (Y = 0; Y < Src->Width * Src->Height; Y++, Index += 4) {
		SrcP[Index] = DestP[Index];
	}
}

// 18: ָıԵģʽչֵ
// б: 
// Width: Ŀ
// Height: ĸ߶
// Left: Ҫչ
// Right: ҲҪչ
// Top: Ҫչ
// Bottom: ײҪչ
// Edge: Եķʽ
// RawPos: зֵ
// ColPos: зֵ
// غִгɹ
IS_RET GetValidCoordinate(int Width, int Height, int Left, int Right, int Top, int Bottom, EdgeMode Edge, TMatrix **Row, TMatrix **Col)
{
	if ((Left < 0) || (Right < 0) || (Top < 0) || (Bottom < 0)) return IS_RET_ERR_ARGUMENTOUTOFRANGE;
	IS_RET Ret = IS_CreateMatrix(Width + Left + Right, 1, IS_DEPTH_32S, 1, Row);
	if (Ret != IS_RET_OK) return Ret;
	Ret = IS_CreateMatrix(1, Height + Top + Bottom, IS_DEPTH_32S, 1, Col);
	if (Ret != IS_RET_OK) return Ret;

	int X, Y, XX, YY, *RowPos = (int *)(*Row)->Data, *ColPos = (int *)(*Col)->Data;

	for (X = -Left; X < Width + Right; X++)
	{
		if (X < 0)
		{
			if (Edge == EdgeMode::Tile)							//ظԵ
				RowPos[X + Left] = 0;
			else
			{
				XX = -X;
				while (XX >= Width) XX -= Width;			// 
				RowPos[X + Left] = XX;
			}
		}
		else if (X >= Width)
		{
			if (Edge == EdgeMode::Tile)
				RowPos[X + Left] = Width - 1;
			else
			{
				XX = Width - (X - Width + 2);
				while (XX < 0) XX += Width;
				RowPos[X + Left] = XX;
			}
		}
		else
		{
			RowPos[X + Left] = X;
		}
	}

	for (Y = -Top; Y < Height + Bottom; Y++)
	{
		if (Y < 0)
		{
			if (Edge == EdgeMode::Tile)
				ColPos[Y + Top] = 0;
			else
			{
				YY = -Y;
				while (YY >= Height) YY -= Height;
				ColPos[Y + Top] = YY;
			}
		}
		else if (Y >= Height)
		{
			if (Edge == EdgeMode::Tile)
				ColPos[Y + Top] = Height - 1;
			else
			{
				YY = Height - (Y - Height + 2);
				while (YY < 0) YY += Height;
				ColPos[Y + Top] = YY;
			}
		}
		else
		{
			ColPos[Y + Top] = Y;
		}
	}
	return IS_RET_OK;
}

// 19: ɫͼֽΪRGBAͨͼ
// б:
// Src: ҪԴͼݽṹ
// Blue: Blueͨͼݽṹ
// Green: Greenͨͼݽṹ
// Red: Redͨͼݽṹ
// Alpha: Alphaͨͼݽṹ
// 8λдٶȴ20%
// غǷִгɹ
IS_RET SplitRGBA(TMatrix *Src, TMatrix **Blue, TMatrix **Green, TMatrix **Red, TMatrix **Alpha) {
	if (Src == NULL) return IS_RET_ERR_NULLREFERENCE;
	if (Src->Data == NULL) return IS_RET_ERR_NULLREFERENCE;
	if (Src->Depth != IS_DEPTH_8U) return IS_RET_ERR_NOTSUPPORTED;
	IS_RET Ret = IS_CreateMatrix(Src->Width, Src->Height, Src->Depth, 1, Blue);
	if (Ret != IS_RET_OK) goto Done;
	Ret = IS_CreateMatrix(Src->Width, Src->Height, Src->Depth, 1, Green);
	if (Ret != IS_RET_OK) goto Done;
	Ret = IS_CreateMatrix(Src->Width, Src->Height, Src->Depth, 1, Red);
	if (Ret != IS_RET_OK) goto Done;
	if (Src->Channel == 4) {
		Ret = IS_CreateMatrix(Src->Width, Src->Height, Src->Depth, 1, Alpha);
		if (Ret != IS_RET_OK) goto Done;
	}
	int X, Y, Block, Width = Src->Width, Height = Src->Height;
	unsigned char *LinePS, *LinePB, *LinePG, *LinePR, *LinePA;
	const int BlockSize = 8;
	Block = Width / BlockSize;						//	8·,ٶ࿪·ٶȲû
	if (Src->Channel == 3)
	{
		for (Y = 0; Y < Height; Y++)
		{
			LinePS = Src->Data + Y * Src->WidthStep;
			LinePB = (*Blue)->Data + Y * (*Blue)->WidthStep;
			LinePG = (*Green)->Data + Y * (*Green)->WidthStep;
			LinePR = (*Red)->Data + Y * (*Red)->WidthStep;
			for (X = 0; X < Block * BlockSize; X += BlockSize)			//	LinePBȫдһٶȷһЩ
			{
				LinePB[0] = LinePS[0];		LinePG[0] = LinePS[1];		LinePR[0] = LinePS[2];
				LinePB[1] = LinePS[3];		LinePG[1] = LinePS[4];		LinePR[1] = LinePS[5];
				LinePB[2] = LinePS[6];		LinePG[2] = LinePS[7];		LinePR[2] = LinePS[8];
				LinePB[3] = LinePS[9];		LinePG[3] = LinePS[10];		LinePR[3] = LinePS[11];
				LinePB[4] = LinePS[12];		LinePG[4] = LinePS[13];		LinePR[4] = LinePS[14];
				LinePB[5] = LinePS[15];		LinePG[5] = LinePS[16];		LinePR[5] = LinePS[17];
				LinePB[6] = LinePS[18];		LinePG[6] = LinePS[19];		LinePR[6] = LinePS[20];
				LinePB[7] = LinePS[21];		LinePG[7] = LinePS[22];		LinePR[7] = LinePS[23];
				LinePB += 8;				LinePG += 8;				LinePR += 8;				LinePS += 24;
			}
			while (X < Width)
			{
				LinePB[0] = LinePS[0];		LinePG[0] = LinePS[1];		LinePR[0] = LinePS[2];
				LinePB++;					LinePG++;					LinePR++;					LinePS += 3;
				X++;
			}
		}
	}
	else if (Src->Channel == 4)
	{
		for (Y = 0; Y < Height; Y++)
		{
			LinePS = Src->Data + Y * Src->WidthStep;
			LinePB = (*Blue)->Data + Y * (*Blue)->WidthStep;
			LinePG = (*Green)->Data + Y * (*Green)->WidthStep;
			LinePR = (*Red)->Data + Y * (*Red)->WidthStep;
			LinePA = (*Alpha)->Data + Y * (*Alpha)->WidthStep;
			for (X = 0; X < Block * BlockSize; X += BlockSize)
			{
				LinePB[0] = LinePS[0];		LinePG[0] = LinePS[1];		LinePR[0] = LinePS[2];		LinePA[0] = LinePS[3];
				LinePB[1] = LinePS[4];		LinePG[1] = LinePS[5];		LinePR[1] = LinePS[6];		LinePA[1] = LinePS[7];
				LinePB[2] = LinePS[8];		LinePG[2] = LinePS[9];		LinePR[2] = LinePS[10];		LinePA[2] = LinePS[11];
				LinePB[3] = LinePS[12];		LinePG[3] = LinePS[13];		LinePR[3] = LinePS[14];		LinePA[3] = LinePS[15];
				LinePB[4] = LinePS[16];		LinePG[4] = LinePS[17];		LinePR[4] = LinePS[18];		LinePA[4] = LinePS[19];
				LinePB[5] = LinePS[20];		LinePG[5] = LinePS[21];		LinePR[5] = LinePS[22];		LinePA[5] = LinePS[23];
				LinePB[6] = LinePS[24];		LinePG[6] = LinePS[25];		LinePR[6] = LinePS[26];		LinePA[6] = LinePS[27];
				LinePB[7] = LinePS[28];		LinePG[7] = LinePS[29];		LinePR[7] = LinePS[30];		LinePA[7] = LinePS[31];
				LinePB += 8;				LinePG += 8;				LinePR += 8;				LinePA += 8;				LinePS += 32;
			}
			while (X < Width)
			{
				LinePB[0] = LinePS[0];		LinePG[0] = LinePS[1];		LinePR[0] = LinePS[2];		LinePA[0] = LinePS[3];
				LinePB++;					LinePG++;					LinePR++;					LinePA++;					LinePS += 4;
				X++;
			}
		}
	}
	return IS_RET_OK;
Done:
	if (*Blue != NULL) IS_FreeMatrix(Blue);
	if (*Green != NULL) IS_FreeMatrix(Green);
	if (*Red != NULL) IS_FreeMatrix(Red);
	if (*Alpha != NULL) IS_FreeMatrix(Alpha);
	return Ret;
}

// 20: R,G,B,AͨͼϲΪɫͼ
// б:
// Dest: ϲͼݽṹ
// Blue: Blueͨͼݽṹ
// Green: Greenͨͼݽṹ
// Red: Redͨͼݽṹ
// Alpha: Alphaͨͼݽṹ
IS_RET CombineRGBA(TMatrix *Dest, TMatrix *Blue, TMatrix *Green, TMatrix *Red, TMatrix *Alpha)
{
	if (Dest == NULL || Blue == NULL || Green == NULL || Red == NULL) return IS_RET_ERR_NULLREFERENCE;
	if (Dest->Data == NULL || Blue->Data == NULL || Green->Data == NULL || Red->Data == NULL) return IS_RET_ERR_NULLREFERENCE;
	if ((Dest->Channel != 3 && Dest->Channel != 4) || Blue->Channel != 1 || Green->Channel != 1 || Red->Channel != 1) return IS_RET_ERR_PARAMISMATCH;
	if (Dest->Width != Blue->Width || Dest->Width != Green->Width || Dest->Width != Red->Width || Dest->Width != Blue->Width)  return IS_RET_ERR_PARAMISMATCH;
	if (Dest->Height != Blue->Height || Dest->Height != Green->Height || Dest->Height != Red->Height || Dest->Height != Blue->Height)  return IS_RET_ERR_PARAMISMATCH;

	if (Dest->Channel == 4)
	{
		if (Alpha == NULL) return IS_RET_ERR_NULLREFERENCE;
		if (Alpha->Data == NULL) return IS_RET_ERR_NULLREFERENCE;
		if (Alpha->Channel != 1) return IS_RET_ERR_PARAMISMATCH;
		if (Dest->Width != Alpha->Width || Dest->Height != Alpha->Height) return IS_RET_ERR_PARAMISMATCH;
	}

	int X, Y, Block, Width = Dest->Width, Height = Dest->Height;
	unsigned char *LinePD, *LinePB, *LinePG, *LinePR, *LinePA;
	const int BlockSize = 8;
	Block = Width / BlockSize;						//	8·,ٶ࿪·ٶȲû

	if (Dest->Channel == 3)
	{
		for (Y = 0; Y < Height; Y++)
		{
			LinePD = Dest->Data + Y * Dest->WidthStep;
			LinePB = Blue->Data + Y * Blue->WidthStep;
			LinePG = Green->Data + Y * Green->WidthStep;
			LinePR = Red->Data + Y * Red->WidthStep;
			for (X = 0; X < Block * BlockSize; X += BlockSize)				//	LinePBȫдһٶ𲻴
			{
				LinePD[0] = LinePB[0];		LinePD[1] = LinePG[0];		LinePD[2] = LinePR[0];
				LinePD[3] = LinePB[1];		LinePD[4] = LinePG[1];		LinePD[5] = LinePR[1];
				LinePD[6] = LinePB[2];		LinePD[7] = LinePG[2];		LinePD[8] = LinePR[2];
				LinePD[9] = LinePB[3];		LinePD[10] = LinePG[3];		LinePD[11] = LinePR[3];
				LinePD[12] = LinePB[4];		LinePD[13] = LinePG[4];		LinePD[14] = LinePR[4];
				LinePD[15] = LinePB[5];		LinePD[16] = LinePG[5];		LinePD[17] = LinePR[5];
				LinePD[18] = LinePB[6];		LinePD[19] = LinePG[6];		LinePD[20] = LinePR[6];
				LinePD[21] = LinePB[7];		LinePD[22] = LinePG[7];		LinePD[23] = LinePR[7];
				LinePB += 8;				LinePG += 8;				LinePR += 8;				LinePD += 24;
			}
			while (X < Width)
			{
				LinePD[0] = LinePB[0];		LinePD[1] = LinePG[0];		LinePD[2] = LinePR[0];
				LinePB++;					LinePG++;					LinePR++;					LinePD += 3;
				X++;
			}
		}
	}
	else if (Dest->Channel == 4)
	{
		for (Y = 0; Y < Height; Y++)
		{
			LinePD = Dest->Data + Y * Dest->WidthStep;
			LinePB = Blue->Data + Y * Blue->WidthStep;
			LinePG = Green->Data + Y * Green->WidthStep;
			LinePR = Red->Data + Y * Red->WidthStep;
			LinePA = Alpha->Data + Y * Alpha->WidthStep;
			for (X = 0; X < Block * BlockSize; X += BlockSize)
			{
				LinePD[0] = LinePB[0];		LinePD[1] = LinePG[0];		LinePD[2] = LinePR[0];		LinePD[3] = LinePA[0];
				LinePD[4] = LinePB[1];		LinePD[5] = LinePG[1];		LinePD[6] = LinePR[1];		LinePD[7] = LinePA[1];
				LinePD[8] = LinePB[2];		LinePD[9] = LinePG[2];		LinePD[10] = LinePR[2];		LinePD[11] = LinePA[2];
				LinePD[12] = LinePB[3];		LinePD[13] = LinePG[3];		LinePD[14] = LinePR[3];		LinePD[15] = LinePA[3];
				LinePD[16] = LinePB[4];		LinePD[17] = LinePG[4];		LinePD[18] = LinePR[4];		LinePD[19] = LinePA[4];
				LinePD[20] = LinePB[5];		LinePD[21] = LinePG[5];		LinePD[22] = LinePR[5];		LinePD[23] = LinePA[5];
				LinePD[24] = LinePB[6];		LinePD[25] = LinePG[6];		LinePD[26] = LinePR[6];		LinePD[27] = LinePA[6];
				LinePD[28] = LinePB[7];		LinePD[29] = LinePG[7];		LinePD[30] = LinePR[7];		LinePD[31] = LinePA[7];
				LinePB += 8;				LinePG += 8;				LinePR += 8;				LinePA += 8;				LinePD += 32;
			}
			while (X < Width)
			{
				LinePD[0] = LinePB[0];		LinePD[1] = LinePG[0];		LinePD[2] = LinePR[0];		LinePD[3] = LinePA[0];
				LinePB++;					LinePG++;					LinePD++;					LinePA++;					LinePD += 4;
				X++;
			}
		}
	}
	return IS_RET_OK;
}

================================================
FILE: speed_integral_graph_sse.cpp
================================================
#include <stdio.h>
#include <opencv2/opencv.hpp>

using namespace std;
using namespace cv;

void GetGrayIntegralImage(unsigned char *Src, int *Integral, int Width, int Height, int Stride)
{
	memset(Integral, 0, (Width + 1) * sizeof(int));                    //    第一行都为0
	for (int Y = 0; Y < Height; Y++)
	{
		unsigned char *LinePS = Src + Y * Stride;
		int *LinePL = Integral + Y * (Width + 1) + 1;                //上一行的位置
		int *LinePD = Integral + (Y + 1) * (Width + 1) + 1;           //    当前位置,注意每行的第一列的值都为0
		LinePD[-1] = 0;                                               //    第一列的值为0
		for (int X = 0, Sum = 0; X < Width; X++)
		{
			Sum += LinePS[X];                                          //    行方向累加
			LinePD[X] = LinePL[X] + Sum;                               //    更新积分图
		}
	}
}

void GetGrayIntegralImage_SSE(unsigned char *Src, int *Integral, int Width, int Height, int Stride) {
	memset(Integral, 0, (Width + 1) * sizeof(int)); //第一行都为0
	int BlockSize = 8, Block = Width / BlockSize;
	for (int Y = 0; Y < Height; Y++) {
		unsigned char *LinePS = Src + Y * Stride;
		int *LinePL = Integral + Y * (Width + 1) + 1; //上一行位置
		int *LinePD = Integral + (Y + 1) * (Width + 1) + 1; //当前位置,注意每行的第一列都为0
		LinePD[-1] = 0;
		__m128i PreV = _mm_setzero_si128();
		__m128i Zero = _mm_setzero_si128();
		for (int X = 0; X < Block * BlockSize; X += BlockSize) {
			__m128i Src_Shift0 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i*)(LinePS + X)), Zero); //A7 A6 A5 A 4 A3 A2 A1 A0
			__m128i Src_Shift1 = _mm_slli_si128(Src_Shift0, 2); //A6 A5 A4 A3 A2 A1 A0 0
			__m128i Src_Shift2 = _mm_slli_si128(Src_Shift1, 2); //A5 A4 A3 A2 A1 A0 0  0
			__m128i Src_Shift3 = _mm_slli_si128(Src_Shift2, 2); //A4 A3 A2 A1 A0 0  0  0
			__m128i Shift_Add12 = _mm_add_epi16(Src_Shift1, Src_Shift2); //A6+A5 A5+A4 A4+A3 A3+A2 A2+A1 A1+A0 A0+0  0+0
			__m128i Shift_Add03 = _mm_add_epi16(Src_Shift0, Src_Shift3); //A7+A4 A6+A3 A5+A2 A4+A1 A3+A0 A2+0  A1+0  A0+0 
			__m128i Low = _mm_add_epi16(Shift_Add12, Shift_Add03); //A7+A6+A5+A4 A6+A5+A4+A3 A5+A4+A3+A2 A4+A3+A2+A1 A3+A2+A1+A0 A2+A1+A0+0 A1+A0+0+0 A0+0+0+0
			__m128i High = _mm_add_epi32(_mm_unpackhi_epi16(Low, Zero), _mm_unpacklo_epi16(Low, Zero)); //A7+A6+A5+A4+A3+A2+A1+A0  A6+A5+A4+A3+A2+A1+A0  A5+A4+A3+A2+A1+A0  A4+A3+A2+A1+A0
			__m128i SumL = _mm_loadu_si128((__m128i *)(LinePL + X + 0));
			__m128i SumH = _mm_loadu_si128((__m128i *)(LinePL + X + 4));
			SumL = _mm_add_epi32(SumL, PreV);
			SumL = _mm_add_epi32(SumL, _mm_unpacklo_epi16(Low, Zero));
			SumH = _mm_add_epi32(SumH, PreV);
			SumH = _mm_add_epi32(SumH, High);
			PreV = _mm_add_epi32(PreV, _mm_shuffle_epi32(High, _MM_SHUFFLE(3, 3, 3, 3)));
			_mm_storeu_si128((__m128i *)(LinePD + X + 0), SumL);
			_mm_storeu_si128((__m128i *)(LinePD + X + 4), SumH);
		}
		for (int X = Block * BlockSize, V = LinePD[X - 1] - LinePL[X - 1]; X < Width; X++)
		{
			V += LinePS[X];
			LinePD[X] = V + LinePL[X];
		}
	}
}

void BoxBlur(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, int Radius) {
	int *Integral = (int *)malloc((Width + 1) * (Height + 1) * sizeof(int));
	GetGrayIntegralImage(Src, Integral, Width, Height, Stride);
//#pragma parallel for num_threads(4)
	for (int Y = 0; Y < Height; Y++) {
		int Y1 = max(Y - Radius, 0);
		int Y2 = min(Y + Radius + 1, Height - 1);
		int *LineP1 = Integral + Y1 * (Width + 1);
		int *LineP2 = Integral + Y2 * (Width + 1);
		unsigned char *LinePD = Dest + Y * Stride;
		for (int X = 0; X < Height; X++) {
			int X1 = max(X - Radius, 0);
			int X2 = min(X + Radius + 1, Width);
			int Sum = LineP2[X2] - LineP1[X2] - LineP2[X1] + LineP1[X1];
			int PixelCount = (X2 - X1) * (Y2 - Y1);
			LinePD[X] = (Sum + (PixelCount >> 1)) / PixelCount;
		}
	}
	free(Integral);
}

int main() {
	Mat src = imread("F:\\car.jpg", 0);
	int Height = src.rows;
	int Width = src.cols;
	unsigned char *Src = src.data;
	unsigned char *Dest = new unsigned char[Height * Width];
	int Stride = Width;
	int Radius = 11;
	int64 st = cvGetTickCount();
	for (int i = 0; i < 10; i++) {
		BoxBlur(Src, Dest, Width, Height, Stride, Radius);
	}
	double duration = (cv::getTickCount() - st) / cv::getTickFrequency() * 100;
	printf("%.5f\n", duration);
	BoxBlur(Src, Dest, Width, Height, Stride, Radius);
	Mat dst(Height, Width, CV_8UC1, Dest);
	imshow("origin", src);
	imshow("result", dst);
	imwrite("F:\\res.jpg", dst);
	waitKey(0);
	waitKey(0);
}

================================================
FILE: speed_max_filter_sse.cpp
================================================
#include <stdio.h>
#include <opencv2/opencv.hpp>
#include "../../OpencvTest/OpencvTest/Core.h"
#include "../../OpencvTest/OpencvTest/MaxFilter.h"
#include "../../OpencvTest/OpencvTest/Utility.h"
using namespace std;
using namespace cv;

void MaxFilter_SSE(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, int Channel, int Radius) {
	TMatrix a, b;
	TMatrix *p1 = &a, *p2 = &b;
	TMatrix **p3 = &p1, **p4 = &p2;
	IS_CreateMatrix(Width, Height, IS_DEPTH_8U, Channel, p3);
	IS_CreateMatrix(Width, Height, IS_DEPTH_8U, Channel, p4);
	(p1)->Data = Src;
	(p2)->Data = Dest;
	MaxFilter(p1, p2, Radius);
}

Mat MaxFilter(Mat src, int radius) {
	int row = src.rows;
	int col = src.cols;
	int border = (radius - 1) / 2;
	Mat dst(row, col, CV_8UC3);
	printf("success\n");
	for (int i = border; i + border < row; i++) {
		for (int j = border; j + border < col; j++) {
			for (int k = 0; k < 3; k++) {
				int val = src.at<Vec3b>(i, j)[k];
				for (int x = -border; x <= border; x++) {
					for (int y = -border; y <= border; y++) {
						val = max(val, (int)src.at<Vec3b>(i + x, j + y)[k]);
					}
				}
				dst.at<Vec3b>(i, j)[k] = val;
			}
		}
	}
	printf("success\n");
	return dst;
}

int main() {
	Mat src = imread("F:\\car.jpg");
	int Height = src.rows;
	int Width = src.cols;
	unsigned char *Src = src.data;
	unsigned char *Dest = new unsigned char[Height * Width * 3];
	int Stride = Width * 3;
	int Radius = 11;
	int64 st = cvGetTickCount();
	for (int i = 0; i <10; i++) {
		Mat temp = MaxFilter(src, Radius);
		//MaxFilter_SSE(Src, Dest, Width, Height, Stride, 3, Radius);
	}
	double duration = (cv::getTickCount() - st) / cv::getTickFrequency() * 100;
	printf("%.5f\n", duration);
	MaxFilter_SSE(Src, Dest, Width, Height, Stride, 3, Radius);
	Mat dst(Height, Width, CV_8UC3, Dest);
	imshow("origin", src);
	imshow("result", dst);
	imwrite("F:\\res.jpg", dst);
	waitKey(0);
	return 0;
}

================================================
FILE: speed_median_filter_3x3_sse.cpp
================================================
#include "stdafx.h"
#include <stdio.h>
#include <opencv2/opencv.hpp>
using namespace std;
using namespace cv;

int ComparisonFunction(const void *X, const void *Y) {
	unsigned char Dx = *(unsigned char *)X;
	unsigned char Dy = *(unsigned char *)Y;
	if (Dx < Dy) return -1;
	else if (Dx > Dy) return 1;
	else return 0;
}

void MedianBlur3X3_Ori(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) {
	int Channel = Stride / Width;
	if (Channel == 1) {
		unsigned char Array[9];
		for (int Y = 1; Y < Height - 1; Y++) {
			unsigned char *LineP0 = Src + (Y - 1) * Stride + 1;
			unsigned char *LineP1 = LineP0 + Stride;
			unsigned char *LineP2 = LineP1 + Stride;
			unsigned char *LinePD = Dest + Y * Stride + 1;
			for (int X = 1; X < Width - 1; X++) {
				Array[0] = LineP0[X - 1];        Array[1] = LineP0[X];    Array[2] = LineP0[X + 1];
				Array[3] = LineP1[X - 1];        Array[4] = LineP1[X];    Array[5] = LineP2[X + 1];
				Array[6] = LineP2[X - 1];        Array[7] = LineP2[X];    Array[8] = LineP2[X + 1];
				qsort(Array, 9, sizeof(unsigned char), &ComparisonFunction);
				LinePD[X] = Array[4];
			}
		}
	}
	else {
		unsigned char ArrayB[9], ArrayG[9], ArrayR[9];
		for (int Y = 1; Y < Height - 1; Y++) {
			unsigned char *LineP0 = Src + (Y - 1) * Stride + 3;
			unsigned char *LineP1 = LineP0 + Stride;
			unsigned char *LineP2 = LineP1 + Stride;
			unsigned char *LinePD = Dest + Y * Stride + 3;
			for (int X = 1; X < Width - 1; X++) {
				ArrayB[0] = LineP0[-3];       ArrayG[0] = LineP0[-2];       ArrayR[0] = LineP0[-1];
				ArrayB[1] = LineP0[0];        ArrayG[1] = LineP0[1];        ArrayR[1] = LineP0[2];
				ArrayB[2] = LineP0[3];        ArrayG[2] = LineP0[4];        ArrayR[2] = LineP0[5];

				ArrayB[3] = LineP1[-3];       ArrayG[3] = LineP1[-2];       ArrayR[3] = LineP1[-1];
				ArrayB[4] = LineP1[0];        ArrayG[4] = LineP1[1];        ArrayR[4] = LineP1[2];
				ArrayB[5] = LineP1[3];        ArrayG[5] = LineP1[4];        ArrayR[5] = LineP1[5];

				ArrayB[6] = LineP2[-3];       ArrayG[6] = LineP2[-2];       ArrayR[6] = LineP2[-1];
				ArrayB[7] = LineP2[0];        ArrayG[7] = LineP2[1];        ArrayR[7] = LineP2[2];
				ArrayB[8] = LineP2[3];        ArrayG[8] = LineP2[4];        ArrayR[8] = LineP2[5];

				qsort(ArrayB, 9, sizeof(unsigned char), &ComparisonFunction);
				qsort(ArrayG, 9, sizeof(unsigned char), &ComparisonFunction);
				qsort(ArrayR, 9, sizeof(unsigned char), &ComparisonFunction);

				LinePD[0] = ArrayB[4];
				LinePD[1] = ArrayG[4];
				LinePD[2] = ArrayR[4];

				LineP0 += 3;
				LineP1 += 3;
				LineP2 += 3;
				LinePD += 3;
			}
		}
	}
}

void Swap(int &X, int &Y) {
	X ^= Y;
	Y ^= X;
	X ^= Y;
}

void MedianBlur3X3_Faster(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) {
	int Channel = Stride / Width;
	if (Channel == 1) {

		for (int Y = 1; Y < Height - 1; Y++) {
			unsigned char *LineP0 = Src + (Y - 1) * Stride + 1;
			unsigned char *LineP1 = LineP0 + Stride;
			unsigned char *LineP2 = LineP1 + Stride;
			unsigned char *LinePD = Dest + Y * Stride + 1;
			for (int X = 1; X < Width - 1; X++) {
				int Gray0, Gray1, Gray2, Gray3, Gray4, Gray5, Gray6, Gray7, Gray8;
				Gray0 = LineP0[X - 1];        Gray1 = LineP0[X];    Gray2 = LineP0[X + 1];
				Gray3 = LineP1[X - 1];        Gray4 = LineP1[X];    Gray5 = LineP1[X + 1];
				Gray6 = LineP2[X - 1];        Gray7 = LineP2[X];    Gray8 = LineP2[X + 1];

				if (Gray1 > Gray2) Swap(Gray1, Gray2);
				if (Gray4 > Gray5) Swap(Gray4, Gray5);
				if (Gray7 > Gray8) Swap(Gray7, Gray8);
				if (Gray0 > Gray1) Swap(Gray0, Gray1);
				if (Gray3 > Gray4) Swap(Gray3, Gray4);
				if (Gray6 > Gray7) Swap(Gray6, Gray7);
				if (Gray1 > Gray2) Swap(Gray1, Gray2);
				if (Gray4 > Gray5) Swap(Gray4, Gray5);
				if (Gray7 > Gray8) Swap(Gray7, Gray8);
				if (Gray0 > Gray3) Swap(Gray0, Gray3);
				if (Gray5 > Gray8) Swap(Gray5, Gray8);
				if (Gray4 > Gray7) Swap(Gray4, Gray7);
				if (Gray3 > Gray6) Swap(Gray3, Gray6);
				if (Gray1 > Gray4) Swap(Gray1, Gray4);
				if (Gray2 > Gray5) Swap(Gray2, Gray5);
				if (Gray4 > Gray7) Swap(Gray4, Gray7);
				if (Gray4 > Gray2) Swap(Gray4, Gray2);
				if (Gray6 > Gray4) Swap(Gray6, Gray4);
				if (Gray4 > Gray2) Swap(Gray4, Gray2);

				LinePD[X] = Gray4;
			}
		}

	}
	else {
		for (int Y = 1; Y < Height - 1; Y++) {
			unsigned char *LineP0 = Src + (Y - 1) * Stride + 3;
			unsigned char *LineP1 = LineP0 + Stride;
			unsigned char *LineP2 = LineP1 + Stride;
			unsigned char *LinePD = Dest + Y * Stride + 3;
			for (int X = 1; X < Width - 1; X++) {
				int Blue0, Blue1, Blue2, Blue3, Blue4, Blue5, Blue6, Blue7, Blue8;
				int Green0, Green1, Green2, Green3, Green4, Green5, Green6, Green7, Green8;
				int Red0, Red1, Red2, Red3, Red4, Red5, Red6, Red7, Red8;
				Blue0 = LineP0[-3];        Green0 = LineP0[-2];    Red0 = LineP0[-1];
				Blue1 = LineP0[0];        Green1 = LineP0[1];        Red1 = LineP0[2];
				Blue2 = LineP0[3];        Green2 = LineP0[4];        Red2 = LineP0[5];

				Blue3 = LineP1[-3];        Green3 = LineP1[-2];    Red3 = LineP1[-1];
				Blue4 = LineP1[0];        Green4 = LineP1[1];        Red4 = LineP1[2];
				Blue5 = LineP1[3];        Green5 = LineP1[4];        Red5 = LineP1[5];

				Blue6 = LineP2[-3];        Green6 = LineP2[-2];    Red6 = LineP2[-1];
				Blue7 = LineP2[0];        Green7 = LineP2[1];        Red7 = LineP2[2];
				Blue8 = LineP2[3];        Green8 = LineP2[4];        Red8 = LineP2[5];

				if (Blue1 > Blue2) Swap(Blue1, Blue2);
				if (Blue4 > Blue5) Swap(Blue4, Blue5);
				if (Blue7 > Blue8) Swap(Blue7, Blue8);
				if (Blue0 > Blue1) Swap(Blue0, Blue1);
				if (Blue3 > Blue4) Swap(Blue3, Blue4);
				if (Blue6 > Blue7) Swap(Blue6, Blue7);
				if (Blue1 > Blue2) Swap(Blue1, Blue2);
				if (Blue4 > Blue5) Swap(Blue4, Blue5);
				if (Blue7 > Blue8) Swap(Blue7, Blue8);
				if (Blue0 > Blue3) Swap(Blue0, Blue3);
				if (Blue5 > Blue8) Swap(Blue5, Blue8);
				if (Blue4 > Blue7) Swap(Blue4, Blue7);
				if (Blue3 > Blue6) Swap(Blue3, Blue6);
				if (Blue1 > Blue4) Swap(Blue1, Blue4);
				if (Blue2 > Blue5) Swap(Blue2, Blue5);
				if (Blue4 > Blue7) Swap(Blue4, Blue7);
				if (Blue4 > Blue2) Swap(Blue4, Blue2);
				if (Blue6 > Blue4) Swap(Blue6, Blue4);
				if (Blue4 > Blue2) Swap(Blue4, Blue2);

				if (Green1 > Green2) Swap(Green1, Green2);
				if (Green4 > Green5) Swap(Green4, Green5);
				if (Green7 > Green8) Swap(Green7, Green8);
				if (Green0 > Green1) Swap(Green0, Green1);
				if (Green3 > Green4) Swap(Green3, Green4);
				if (Green6 > Green7) Swap(Green6, Green7);
				if (Green1 > Green2) Swap(Green1, Green2);
				if (Green4 > Green5) Swap(Green4, Green5);
				if (Green7 > Green8) Swap(Green7, Green8);
				if (Green0 > Green3) Swap(Green0, Green3);
				if (Green5 > Green8) Swap(Green5, Green8);
				if (Green4 > Green7) Swap(Green4, Green7);
				if (Green3 > Green6) Swap(Green3, Green6);
				if (Green1 > Green4) Swap(Green1, Green4);
				if (Green2 > Green5) Swap(Green2, Green5);
				if (Green4 > Green7) Swap(Green4, Green7);
				if (Green4 > Green2) Swap(Green4, Green2);
				if (Green6 > Green4) Swap(Green6, Green4);
				if (Green4 > Green2) Swap(Green4, Green2);

				if (Red1 > Red2) Swap(Red1, Red2);
				if (Red4 > Red5) Swap(Red4, Red5);
				if (Red7 > Red8) Swap(Red7, Red8);
				if (Red0 > Red1) Swap(Red0, Red1);
				if (Red3 > Red4) Swap(Red3, Red4);
				if (Red6 > Red7) Swap(Red6, Red7);
				if (Red1 > Red2) Swap(Red1, Red2);
				if (Red4 > Red5) Swap(Red4, Red5);
				if (Red7 > Red8) Swap(Red7, Red8);
				if (Red0 > Red3) Swap(Red0, Red3);
				if (Red5 > Red8) Swap(Red5, Red8);
				if (Red4 > Red7) Swap(Red4, Red7);
				if (Red3 > Red6) Swap(Red3, Red6);
				if (Red1 > Red4) Swap(Red1, Red4);
				if (Red2 > Red5) Swap(Red2, Red5);
				if (Red4 > Red7) Swap(Red4, Red7);
				if (Red4 > Red2) Swap(Red4, Red2);
				if (Red6 > Red4) Swap(Red6, Red4);
				if (Red4 > Red2) Swap(Red4, Red2);

				LinePD[0] = Blue4;
				LinePD[1] = Green4;
				LinePD[2] = Red4;

				LineP0 += 3;
				LineP1 += 3;
				LineP2 += 3;
				LinePD += 3;
			}
		}
	}
}

inline void _mm_sort_ab(__m128i &a, __m128i &b) {
	const __m128i min = _mm_min_epu8(a, b);
	const __m128i max = _mm_max_epu8(a, b);
	a = min;
	b = max;
}

void MedianBlur3X3_Fastest(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) {
	int Channel = Stride / Width;
	int BlockSize = 16, Block = ((Width - 2)* Channel) / BlockSize;
	for (int Y = 1; Y < Height - 1; Y++) {
		unsigned char *LineP0 = Src + (Y - 1) * Stride + Channel;
		unsigned char *LineP1 = LineP0 + Stride;
		unsigned char *LineP2 = LineP1 + Stride;
		unsigned char *LinePD = Dest + Y * Stride + Channel;
		for (int X = 0; X < Block * BlockSize; X += BlockSize, LineP0 += BlockSize, LineP1 += BlockSize, LineP2 += BlockSize, LinePD += BlockSize)
		{
			__m128i P0 = _mm_loadu_si128((__m128i *)(LineP0 - Channel));
			__m128i P1 = _mm_loadu_si128((__m128i *)(LineP0 - 0));
			__m128i P2 = _mm_loadu_si128((__m128i *)(LineP0 + Channel));
			__m128i P3 = _mm_loadu_si128((__m128i *)(LineP1 - Channel));
			__m128i P4 = _mm_loadu_si128((__m128i *)(LineP1 - 0));
			__m128i P5 = _mm_loadu_si128((__m128i *)(LineP1 + Channel));
			__m128i P6 = _mm_loadu_si128((__m128i *)(LineP2 - Channel));
			__m128i P7 = _mm_loadu_si128((__m128i *)(LineP2 - 0));
			__m128i P8 = _mm_loadu_si128((__m128i *)(LineP2 + Channel));

			_mm_sort_ab(P1, P2);		_mm_sort_ab(P4, P5);		_mm_sort_ab(P7, P8);
			_mm_sort_ab(P0, P1);		_mm_sort_ab(P3, P4);		_mm_sort_ab(P6, P7);
			_mm_sort_ab(P1, P2);		_mm_sort_ab(P4, P5);		_mm_sort_ab(P7, P8);
			_mm_sort_ab(P0, P3);		_mm_sort_ab(P5, P8);		_mm_sort_ab(P4, P7);
			_mm_sort_ab(P3, P6);		_mm_sort_ab(P1, P4);		_mm_sort_ab(P2, P5);
			_mm_sort_ab(P4, P7);		_mm_sort_ab(P4, P2);		_mm_sort_ab(P6, P4);
			_mm_sort_ab(P4, P2);

			_mm_storeu_si128((__m128i *)LinePD, P4);
		}

		for (int X = Block * BlockSize; X < (Width - 2) * Channel; X++, LinePD++) {
			int Gray0, Gray1, Gray2, Gray3, Gray4, Gray5, Gray6, Gray7, Gray8;
			Gray0 = LineP0[X - Block * BlockSize - Channel];        Gray1 = LineP0[X - Block * BlockSize];    Gray2 = LineP0[X - Block * BlockSize + Channel];
			Gray3 = LineP1[X - Block * BlockSize - Channel];        Gray4 = LineP1[X - Block * BlockSize];    Gray5 = LineP1[X - Block * BlockSize + Channel];
			Gray6 = LineP2[X - Block * BlockSize - Channel];        Gray7 = LineP2[X - Block * BlockSize];    Gray8 = LineP2[X - Block * BlockSize + Channel];

			if (Gray1 > Gray2) Swap(Gray1, Gray2);
			if (Gray4 > Gray5) Swap(Gray4, Gray5);
			if (Gray7 > Gray8) Swap(Gray7, Gray8);
			if (Gray0 > Gray1) Swap(Gray0, Gray1);
			if (Gray3 > Gray4) Swap(Gray3, Gray4);
			if (Gray6 > Gray7) Swap(Gray6, Gray7);
			if (Gray1 > Gray2) Swap(Gray1, Gray2);
			if (Gray4 > Gray5) Swap(Gray4, Gray5);
			if (Gray7 > Gray8) Swap(Gray7, Gray8);
			if (Gray0 > Gray3) Swap(Gray0, Gray3);
			if (Gray5 > Gray8) Swap(Gray5, Gray8);
			if (Gray4 > Gray7) Swap(Gray4, Gray7);
			if (Gray3 > Gray6) Swap(Gray3, Gray6);
			if (Gray1 > Gray4) Swap(Gray1, Gray4);
			if (Gray2 > Gray5) Swap(Gray2, Gray5);
			if (Gray4 > Gray7) Swap(Gray4, Gray7);
			if (Gray4 > Gray2) Swap(Gray4, Gray2);
			if (Gray6 > Gray4) Swap(Gray6, Gray4);
			if (Gray4 > Gray2) Swap(Gray4, Gray2);

			LinePD[X] = Gray4;
			LineP0 += 1;
			LineP1 += 1;
			LineP2 += 1;
		}
	}
}

inline void _mm_sort_AB(__m256i &a, __m256i &b) {
	const __m256i min = _mm256_min_epu8(a, b);
	const __m256i max = _mm256_max_epu8(a, b);
	a = min;
	b = max;
}

void MedianBlur3X3_Fastest_AVX(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) {
	int Channel = Stride / Width;
	int BlockSize = 32, Block = ((Width - 2)* Channel) / BlockSize;
	for (int Y = 1; Y < Height - 1; Y++) {
		unsigned char *LineP0 = Src + (Y - 1) * Stride + Channel;
		unsigned char *LineP1 = LineP0 + Stride;
		unsigned char *LineP2 = LineP1 + Stride;
		unsigned char *LinePD = Dest + Y * Stride + Channel;
		for (int X = 0; X < Block * BlockSize; X += BlockSize, LineP0 += BlockSize, LineP1 += BlockSize, LineP2 += BlockSize, LinePD += BlockSize)
		{
			__m256i P0 = _mm256_loadu_si256((const __m256i*)(LineP0 - Channel));
			__m256i P1 = _mm256_loadu_si256((const __m256i*)(LineP0 - 0));
			__m256i P2 = _mm256_loadu_si256((const __m256i*)(LineP0 + Channel));
			__m256i P3 = _mm256_loadu_si256((const __m256i*)(LineP1 - Channel));
			__m256i P4 = _mm256_loadu_si256((const __m256i*)(LineP1 - 0));
			__m256i P5 = _mm256_loadu_si256((const __m256i*)(LineP1 + Channel));
			__m256i P6 = _mm256_loadu_si256((const __m256i*)(LineP2 - Channel));
			__m256i P7 = _mm256_loadu_si256((const __m256i*)(LineP2 - 0));
			__m256i P8 = _mm256_loadu_si256((const __m256i*)(LineP2 + Channel));

			_mm_sort_AB(P1, P2);		_mm_sort_AB(P4, P5);		_mm_sort_AB(P7, P8);
			_mm_sort_AB(P0, P1);		_mm_sort_AB(P3, P4);		_mm_sort_AB(P6, P7);
			_mm_sort_AB(P1, P2);		_mm_sort_AB(P4, P5);		_mm_sort_AB(P7, P8);
			_mm_sort_AB(P0, P3);		_mm_sort_AB(P5, P8);		_mm_sort_AB(P4, P7);
			_mm_sort_AB(P3, P6);		_mm_sort_AB(P1, P4);		_mm_sort_AB(P2, P5);
			_mm_sort_AB(P4, P7);		_mm_sort_AB(P4, P2);		_mm_sort_AB(P6, P4);
			_mm_sort_AB(P4, P2);

			_mm256_storeu_si256((__m256i *)LinePD, P4);
		}

		for (int X = Block * BlockSize; X < (Width - 2) * Channel; X++, LinePD++) {
			int Gray0, Gray1, Gray2, Gray3, Gray4, Gray5, Gray6, Gray7, Gray8;
			Gray0 = LineP0[X - Block * BlockSize - Channel];        Gray1 = LineP0[X - Block * BlockSize];    Gray2 = LineP0[X - Block * BlockSize + Channel];
			Gray3 = LineP1[X - Block * BlockSize - Channel];        Gray4 = LineP1[X - Block * BlockSize];    Gray5 = LineP1[X - Block * BlockSize + Channel];
			Gray6 = LineP2[X - Block * BlockSize - Channel];        Gray7 = LineP2[X - Block * BlockSize];    Gray8 = LineP2[X - Block * BlockSize + Channel];

			if (Gray1 > Gray2) Swap(Gray1, Gray2);
			if (Gray4 > Gray5) Swap(Gray4, Gray5);
			if (Gray7 > Gray8) Swap(Gray7, Gray8);
			if (Gray0 > Gray1) Swap(Gray0, Gray1);
			if (Gray3 > Gray4) Swap(Gray3, Gray4);
			if (Gray6 > Gray7) Swap(Gray6, Gray7);
			if (Gray1 > Gray2) Swap(Gray1, Gray2);
			if (Gray4 > Gray5) Swap(Gray4, Gray5);
			if (Gray7 > Gray8) Swap(Gray7, Gray8);
			if (Gray0 > Gray3) Swap(Gray0, Gray3);
			if (Gray5 > Gray8) Swap(Gray5, Gray8);
			if (Gray4 > Gray7) Swap(Gray4, Gray7);
			if (Gray3 > Gray6) Swap(Gray3, Gray6);
			if (Gray1 > Gray4) Swap(Gray1, Gray4);
			if (Gray2 > Gray5) Swap(Gray2, Gray5);
			if (Gray4 > Gray7) Swap(Gray4, Gray7);
			if (Gray4 > Gray2) Swap(Gray4, Gray2);
			if (Gray6 > Gray4) Swap(Gray6, Gray4);
			if (Gray4 > Gray2) Swap(Gray4, Gray2);

			LinePD[X] = Gray4;
			LineP0 += 1;
			LineP1 += 1;
			LineP2 += 1;
		}
	}
}

int main() {
	Mat src = imread("F:\\car.jpg");
	int Height = src.rows;
	int Width = src.cols;
	unsigned char *Src = src.data;
	unsigned char *Dest = new unsigned char[Height * Width * 3];
	int Stride = Width * 3;
	int Radius = 7;
	int64 st = cvGetTickCount();
	for (int i = 0; i <10; i++) {
		//Mat temp = MaxFilter(src, Radius);
		MedianBlur3X3_Fastest_AVX(Src, Dest, Width, Height, Stride);
	}
	double duration = (cv::getTickCount() - st) / cv::getTickFrequency() * 100;
	printf("%.5f\n", duration);
	MedianBlur3X3_Fastest_AVX(Src, Dest, Width, Height, Stride);
	Mat dst(Height, Width, CV_8UC3, Dest);
	imshow("origin", src);
	imshow("result", dst);
	imwrite("F:\\res.jpg", dst);
	waitKey(0);
	return 0;
}


================================================
FILE: speed_multi_scale_detail_boosting_see.cpp
================================================
#include <stdio.h>
#include <opencv2/opencv.hpp>
#include "../../OpencvTest/OpencvTest/Core.h"
#include "../../OpencvTest/OpencvTest/MaxFilter.h"
#include "../../OpencvTest/OpencvTest/Utility.h"
#include "../../OpencvTest/OpencvTest/BoxFilter.h"
using namespace std;
using namespace cv;
#define __SSSE3__ 1

void BoxBlur_SSE(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, int Channel, int Radius) {
	TMatrix a, b;
	TMatrix *p1 = &a, *p2 = &b;
	TMatrix **p3 = &p1, **p4 = &p2;
	IS_CreateMatrix(Width, Height, IS_DEPTH_8U, Channel, p3);
	IS_CreateMatrix(Width, Height, IS_DEPTH_8U, Channel, p4);
	(p1)->Data = Src;
	(p2)->Data = Dest;
	BoxBlur_SSE(p1, p2, Radius, EdgeMode::Smear);
}

int IM_Sign(int X) {
	return (X >> 31) | (unsigned(-X)) >> 31;
}

inline unsigned char IM_ClampToByte(int Value)
{
	if (Value < 0)
		return 0;
	else if (Value > 255)
		return 255;
	else
		return (unsigned char)Value;
	//return ((Value | ((signed int)(255 - Value) >> 31)) & ~((signed int)Value >> 31));
}


inline __m128i _mm_sgn_epi16(__m128i v) {
#ifdef __SSSE3__
	v = _mm_sign_epi16(_mm_set1_epi16(1), v); // use PSIGNW on SSSE3 and later
#else
	v = _mm_min_epi16(v, _mm_set1_epi16(1));  // use PMINSW/PMAXSW on SSE2/SSE3.
	v = _mm_max_epi16(v, _mm_set1_epi16(-1));
	//_mm_set1_epi16(1) = _mm_srli_epi16(_mm_cmpeq_epi16(v, v), 15);
	//_mm_set1_epi16(-1) = _mm_cmpeq_epi16(v, v);

#endif
	return v;
}

void MultiScaleSharpen(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, int Radius) {
	int Channel = Stride / Width;
	unsigned char *B1 = (unsigned char *)malloc(Height * Stride * sizeof(unsigned char));
	unsigned char *B2 = (unsigned char *)malloc(Height * Stride * sizeof(unsigned char));
	unsigned char *B3 = (unsigned char *)malloc(Height * Stride * sizeof(unsigned char));
	BoxBlur_SSE(Src, B1, Width, Height, Channel, Stride, Radius);
	BoxBlur_SSE(Src, B2, Width, Height, Channel, Stride, Radius * 2);
	BoxBlur_SSE(Src, B3, Width, Height, Channel, Stride, Radius * 4);
	for (int Y = 0; Y < Height * Stride; Y++) {
		int DiffB1 = Src[Y] - B1[Y];
		int DiffB2 = B1[Y] - B2[Y];
		int DiffB3 = B2[Y] - B3[Y];
		Dest[Y] = IM_ClampToByte(((4 - 2 * IM_Sign(DiffB1)) * DiffB1 + 2 * DiffB2 + DiffB3) / 4 + Src[Y]);
	}
}

void MultiScaleSharpen_SSE(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, int Radius) {
	int Channel = Stride / Width;
	unsigned char *B1 = (unsigned char *)malloc(Height * Stride * sizeof(unsigned char));
	unsigned char *B2 = (unsigned char *)malloc(Height * Stride * sizeof(unsigned char));
	unsigned char *B3 = (unsigned char *)malloc(Height * Stride * sizeof(unsigned char));
	BoxBlur_SSE(Src, B1, Width, Height, Channel, Stride, Radius);
	BoxBlur_SSE(Src, B2, Width, Height, Channel, Stride, Radius * 2);
	BoxBlur_SSE(Src, B3, Width, Height, Channel, Stride, Radius * 4);
	int BlockSize = 8, Block = (Height * Stride) / BlockSize;
	__m128i Zero = _mm_setzero_si128();
	__m128i Four = _mm_set1_epi16(4);
	for (int Y = 0; Y < Block * BlockSize; Y += BlockSize) {
		__m128i SrcV = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(Src + Y)), Zero);
		__m128i SrcB1 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(B1 + Y)), Zero);
		__m128i SrcB2 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(B2 + Y)), Zero);
		__m128i SrcB3 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(B3 + Y)), Zero);
		__m128i DiffB1 = _mm_sub_epi16(SrcV, SrcB1);
		__m128i DiffB2 = _mm_sub_epi16(SrcB1, SrcB2);
		__m128i DiffB3 = _mm_sub_epi16(SrcB2, SrcB3);
		//__m128i Offset = _mm_srai_epi16(_mm_add_epi16(_mm_add_epi16(_mm_mullo_epi16(_mm_sub_epi16(Four, _mm_slli_epi16(_mm_sgn_epi16(DiffB1), 1)), DiffB1), _mm_slli_epi16(DiffB2, 1)), DiffB3), 2);
		__m128i Offset = _mm_add_epi16(_mm_srai_epi16(_mm_sub_epi16(_mm_slli_epi16(_mm_sub_epi16(SrcB1, _mm_sign_epi16(DiffB1, DiffB1)), 1), _mm_add_epi16(SrcB2, SrcB3)), 2), DiffB1);
		_mm_storel_epi64((__m128i *)(Dest + Y), _mm_packus_epi16(_mm_add_epi16(SrcV, Offset), Zero));
	}
	for (int Y = Block * BlockSize; Y < Height * Stride; Y++) {
		int DiffB1 = Src[Y] - B1[Y];
		int DiffB2 = B1[Y] - B2[Y];
		int DiffB3 = B2[Y] - B3[Y];
		Dest[Y] = IM_ClampToByte(((4 - 2 * IM_Sign(DiffB1)) * DiffB1 + 2 * DiffB2 + DiffB3) / 4 + Src[Y]);
	}
}

int main() {
	Mat src = imread("F:\\car.jpg");
	int Height = src.rows;
	int Width = src.cols;
	unsigned char *Src = src.data;
	unsigned char *Dest = new unsigned char[Height * Width * 3];
	int Stride = Width * 3;
	int Radius = 5;
	int64 st = cvGetTickCount();
	for (int i = 0; i <10; i++) {
		//Mat temp = MaxFilter(src, Radius);
		MultiScaleSharpen_SSE(Src, Dest, Width, Height, Stride, Radius);
	}
	double duration = (cv::getTickCount() - st) / cv::getTickFrequency() * 100;
	printf("%.5f\n", duration);
	MultiScaleSharpen(Src, Dest, Width, Height, Stride, Radius);
	Mat dst(Height, Width, CV_8UC3, Dest);
	imshow("origin", src);
	imshow("result", dst);
	imwrite("F:\\res.jpg", dst);
	waitKey(0);
	return 0;
}

================================================
FILE: speed_rgb2gray_sse.cpp
================================================
#include "stdafx.h"
#include <opencv2/opencv.hpp>
#include <future>
using namespace std;
using namespace cv;

//origin
void RGB2Y(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) {
	for (int Y = 0; Y < Height; Y++) {
		unsigned char *LinePS = Src + Y * Stride;
		unsigned char *LinePD = Dest + Y * Width;
		for (int X = 0; X < Width; X++, LinePS += 3) {
			LinePD[X] = int(0.114 * LinePS[0] + 0.587 * LinePS[1] + 0.299 * LinePS[2]);
		}
	}
}

//int
void RGB2Y_1(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) {
	const int B_WT = int(0.114 * 256 + 0.5);
	const int G_WT = int(0.587 * 256 + 0.5);
	const int R_WT = 256 - B_WT - G_WT;
	for (int Y = 0; Y < Height; Y++) {
		unsigned char *LinePS = Src + Y * Stride;
		unsigned char *LinePD = Dest + Y * Width;
		for (int X = 0; X < Width; X++, LinePS += 3) {
			LinePD[X] = (B_WT * LinePS[0] + G_WT * LinePS[1] + R_WT * LinePS[2]) >> 8;
		}
	}
}

//4路并行
void RGB2Y_2(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) {
	const int B_WT = int(0.114 * 256 + 0.5);
	const int G_WT = int(0.587 * 256 + 0.5);
	const int R_WT = 256 - B_WT - G_WT; // int(0.299 * 256 + 0.5)
	for (int Y = 0; Y < Height; Y++) {
		unsigned char *LinePS = Src + Y * Stride;
		unsigned char *LinePD = Dest + Y * Width;
		int X = 0;
		for (; X < Width - 4; X += 4, LinePS += 12) {
			LinePD[X + 0] = (B_WT * LinePS[0] + G_WT * LinePS[1] + R_WT * LinePS[2]) >> 8;
			LinePD[X + 1] = (B_WT * LinePS[3] + G_WT * LinePS[4] + R_WT * LinePS[5]) >> 8;
			LinePD[X + 2] = (B_WT * LinePS[6] + G_WT * LinePS[7] + R_WT * LinePS[8]) >> 8;
			LinePD[X + 3] = (B_WT * LinePS[9] + G_WT * LinePS[10] + R_WT * LinePS[11]) >> 8;
		}
		for (; X < Width; X++, LinePS += 3) {
			LinePD[X] = (B_WT * LinePS[0] + G_WT * LinePS[1] + R_WT * LinePS[2]) >> 8;
		}
	}
}

//openmp
void RGB2Y_3(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) {
	const int B_WT = int(0.114 * 256 + 0.5);
	const int G_WT = int(0.587 * 256 + 0.5);
	const int R_WT = 256 - B_WT - G_WT;
	for (int Y = 0; Y < Height; Y++) {
		unsigned char *LinePS = Src + Y * Stride;
		unsigned char *LinePD = Dest + Y * Width;
#pragma omp parallel for num_threads(4)
		for (int X = 0; X < Width; X++) {
			LinePD[X] = (B_WT * LinePS[0 + X*3] + G_WT * LinePS[1 + X*3] + R_WT * LinePS[2 + X*3]) >> 8;
		}
	}
}

//sse 一次处理12个
void RGB2Y_4(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) {
	const int B_WT = int(0.114 * 256 + 0.5);
	const int G_WT = int(0.587 * 256 + 0.5);
	const int R_WT = 256 - B_WT - G_WT; // int(0.299 * 256 + 0.5)

	for (int Y = 0; Y < Height; Y++) {
		unsigned char *LinePS = Src + Y * Stride;
		unsigned char *LinePD = Dest + Y * Width;
		int X = 0;
		for (; X < Width - 12; X += 12, LinePS += 36) {
			__m128i p1aL = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 0))), _mm_setr_epi16(B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT)); //1
			__m128i p2aL = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 1))), _mm_setr_epi16(G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT)); //2
			__m128i p3aL = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 2))), _mm_setr_epi16(R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT)); //3

			__m128i p1aH = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 8))), _mm_setr_epi16(R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT));//4
			__m128i p2aH = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 9))), _mm_setr_epi16(B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT));//5
			__m128i p3aH = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 10))), _mm_setr_epi16(G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT));//6

			__m128i p1bL = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 18))), _mm_setr_epi16(B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT));//7
			__m128i p2bL = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 19))), _mm_setr_epi16(G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT));//8
			__m128i p3bL = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 20))), _mm_setr_epi16(R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT));//9

			__m128i p1bH = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 26))), _mm_setr_epi16(R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT));//10
			__m128i p2bH = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 27))), _mm_setr_epi16(B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT));//11
			__m128i p3bH = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 28))), _mm_setr_epi16(G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT));//12

			__m128i sumaL = _mm_add_epi16(p3aL, _mm_add_epi16(p1aL, p2aL));//13
			__m128i sumaH = _mm_add_epi16(p3aH, _mm_add_epi16(p1aH, p2aH));//14
			__m128i sumbL = _mm_add_epi16(p3bL, _mm_add_epi16(p1bL, p2bL));//15
			__m128i sumbH = _mm_add_epi16(p3bH, _mm_add_epi16(p1bH, p2bH));//16
			__m128i sclaL = _mm_srli_epi16(sumaL, 8);//17
			__m128i sclaH = _mm_srli_epi16(sumaH, 8);//18
			__m128i sclbL = _mm_srli_epi16(sumbL, 8);//19
			__m128i sclbH = _mm_srli_epi16(sumbH, 8);//20
			__m128i shftaL = _mm_shuffle_epi8(sclaL, _mm_setr_epi8(0, 6, 12, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1));//21
			__m128i shftaH = _mm_shuffle_epi8(sclaH, _mm_setr_epi8(-1, -1, -1, 18, 24, 30, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1));//22
			__m128i shftbL = _mm_shuffle_epi8(sclbL, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, 0, 6, 12, -1, -1, -1, -1, -1, -1, -1));//23
			__m128i shftbH = _mm_shuffle_epi8(sclbH, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, 18, 24, 30, -1, -1, -1, -1));//24
			__m128i accumL = _mm_or_si128(shftaL, shftbL);//25
			__m128i accumH = _mm_or_si128(shftaH, shftbH);//26
			__m128i h3 = _mm_or_si128(accumL, accumH);//27
													  //__m128i h3 = _mm_blendv_epi8(accumL, accumH, _mm_setr_epi8(0, 0, 0, -1, -1, -1, 0, 0, 0, -1, -1, -1, 1, 1, 1, 1));
			_mm_storeu_si128((__m128i *)(LinePD + X), h3);
		}
		for (; X < Width; X++, LinePS += 3) {
			LinePD[X] = (B_WT * LinePS[0] + G_WT * LinePS[1] + R_WT * LinePS[2]) >> 8;
		}
	}
}

//sse 一次处理15个
void RGB2Y_5(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) {
	const int B_WT = int(0.114 * 256 + 0.5);
	const int G_WT = int(0.587 * 256 + 0.5);
	const int R_WT = 256 - B_WT - G_WT; // int(0.299 * 256 + 0.5)

	for (int Y = 0; Y < Height; Y++) {
		unsigned char *LinePS = Src + Y * Stride;
		unsigned char *LinePD = Dest + Y * Width;
		int X = 0;
		for (; X < Width - 15; X += 15, LinePS += 45)
		{
			__m128i p1aL = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 0))), _mm_setr_epi16(B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT)); //1
			__m128i p2aL = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 1))), _mm_setr_epi16(G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT)); //2
			__m128i p3aL = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 2))), _mm_setr_epi16(R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT)); //3

			__m128i p1aH = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 8))), _mm_setr_epi16(R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT));
			__m128i p2aH = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 9))), _mm_setr_epi16(B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT));
			__m128i p3aH = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 10))), _mm_setr_epi16(G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT));

			__m128i p1bL = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 18))), _mm_setr_epi16(B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT));
			__m128i p2bL = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 19))), _mm_setr_epi16(G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT));
			__m128i p3bL = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 20))), _mm_setr_epi16(R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT));

			__m128i p1bH = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 26))), _mm_setr_epi16(R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT));
			__m128i p2bH = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 27))), _mm_setr_epi16(B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT));
			__m128i p3bH = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 28))), _mm_setr_epi16(G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT));

			__m128i p1cH = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 36))), _mm_setr_epi16(B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT));
			__m128i p2cH = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 37))), _mm_setr_epi16(G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT));
			__m128i p3cH = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 38))), _mm_setr_epi16(R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT));

			__m128i sumaL = _mm_add_epi16(p3aL, _mm_add_epi16(p1aL, p2aL));
			__m128i sumaH = _mm_add_epi16(p3aH, _mm_add_epi16(p1aH, p2aH));
			__m128i sumbL = _mm_add_epi16(p3bL, _mm_add_epi16(p1bL, p2bL));
			__m128i sumbH = _mm_add_epi16(p3bH, _mm_add_epi16(p1bH, p2bH));
			__m128i sumcH = _mm_add_epi16(p3cH, _mm_add_epi16(p1cH, p2cH));

			__m128i sclaL = _mm_srli_epi16(sumaL, 8);
			__m128i sclaH = _mm_srli_epi16(sumaH, 8);
			__m128i sclbL = _mm_srli_epi16(sumbL, 8);
			__m128i sclbH = _mm_srli_epi16(sumbH, 8);
			__m128i sclcH = _mm_srli_epi16(sumcH, 8);

			__m128i shftaL = _mm_shuffle_epi8(sclaL, _mm_setr_epi8(0, 6, 12, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1));
			__m128i shftaH = _mm_shuffle_epi8(sclaH, _mm_setr_epi8(-1, -1, -1, 2, 8, 14, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1));
			__m128i shftbL = _mm_shuffle_epi8(sclbL, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, 0, 6, 12, -1, -1, -1, -1, -1, -1, -1));
			__m128i shftbH = _mm_shuffle_epi8(sclbH, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, 2, 8, 14, -1, -1, -1, -1));
			__m128i shftcH = _mm_shuffle_epi8(sclcH, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 6, 12, -1));
			__m128i accumL = _mm_or_si128(shftaL, shftbL);
			__m128i accumH = _mm_or_si128(shftaH, shftbH);
			__m128i h3 = _mm_or_si128(accumL, accumH);
			h3 = _mm_or_si128(h3, shftcH);
			_mm_storeu_si128((__m128i *)(LinePD + X), h3);
		}
		for (; X < Width; X++, LinePS += 3) {
			LinePD[X] = (B_WT * LinePS[0] + G_WT * LinePS[1] + R_WT * LinePS[2]) >> 8;
		}
	}
}

void debug(__m128i var) {
	uint8_t *val = (uint8_t*)&var;//can also use uint32_t instead of 16_t 
	printf("Numerical: %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i\n",
		val[0], val[1], val[2], val[3], val[4], val[5],
		val[6], val[7], val[8], val[9], val[10], val[11], val[12], val[13],
		val[14], val[15]);
}

void debug2(__m256i var) {
	uint8_t *val = (uint8_t*)&var;//can also use uint32_t instead of 16_t 
	printf("Numerical: %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i\n",
		val[0], val[1], val[2], val[3], val[4], val[5],
		val[6], val[7], val[8], val[9], val[10], val[11], val[12], val[13],
		val[14], val[15], val[16], val[17], val[18], val[19], val[20], val[21], val[22], val[23], val[24], val[25], val[26], val[27],
		val[28], val[29], val[30], val[31]);
}

// AVX2
constexpr double B_WEIGHT = 0.114;
constexpr double G_WEIGHT = 0.587;
constexpr double R_WEIGHT = 0.299;
constexpr uint16_t B_WT = static_cast<uint16_t>(32768.0 * B_WEIGHT + 0.5);
constexpr uint16_t G_WT = static_cast<uint16_t>(32768.0 * G_WEIGHT + 0.5);
constexpr uint16_t R_WT = static_cast<uint16_t>(32768.0 * R_WEIGHT + 0.5);
static const __m256i weight_vec = _mm256_setr_epi16(B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT);

void  _RGB2Y(unsigned char* Src, const int32_t Width, const int32_t start_row, const int32_t thread_stride, const int32_t Stride, unsigned char* Dest)
{
	for (int Y = start_row; Y < start_row + thread_stride; Y++)
	{
		//Sleep(1);
		unsigned char *LinePS = Src + Y * Stride;
		unsigned char *LinePD = Dest + Y * Width;
		int X = 0;
		for (; X < Width - 10; X += 10, LinePS += 30)
		{
			//B1 G1 R1 B2 G2 R2 B3 G3 R3 B4 G4 R4 B5 G5 R5 B6 
			__m256i temp = _mm256_cvtepu8_epi16(_mm_loadu_si128((const __m128i*)(LinePS + 0)));
			__m256i in1 = _mm256_mulhrs_epi16(temp, weight_vec);

			//B6 G6 R6 B7 G7 R7 B8 G8 R8 B9 G9 R9 B10 G10 R10 B11
			temp = _mm256_cvtepu8_epi16(_mm_loadu_si128((const __m128i*)(LinePS + 15)));
			__m256i in2 = _mm256_mulhrs_epi16(temp, weight_vec);


			//0  1  2  3   4  5  6  7  8  9  10 11 12 13 14 15    16 17 18 19 20 21 22 23 24 25 26 27 28   29 30  31       
			//B1 G1 R1 B2 G2 R2 B3 G3  B6 G6 R6 B7 G7 R7 B8 G8    R3 B4 G4 R4 B5 G5 R5 B6 R8 B9 G9 R9 B10 G10 R10 B11
			__m256i mul = _mm256_packus_epi16(in1, in2);

			__m256i b1 = _mm256_shuffle_epi8(mul, _mm256_setr_epi8(
				//  B1 B2 B3 -1, -1, -1  B7  B8  -1, -1, -1, -1, -1, -1, -1, -1,
				0, 3, 6, -1, -1, -1, 11, 14, -1, -1, -1, -1, -1, -1, -1, -1,

				//  -1, -1, -1, B4 B5 B6 -1, -1  B9 B10 -1, -1, -1, -1, -1, -1
				-1, -1, -1, 1, 4, 7, -1, -1, 9, 12, -1, -1, -1, -1, -1, -1));

			__m256i g1 = _mm256_shuffle_epi8(mul, _mm256_setr_epi8(

				// G1 G2 G3 -1, -1  G6 G7  G8  -1, -1, -1, -1, -1, -1, -1, -1, 
				1, 4, 7, -1, -1, 9, 12, 15, -1, -1, -1, -1, -1, -1, -1, -1,

				//  -1, -1, -1  G4 G5 -1, -1, -1  G9  G10 -1, -1, -1, -1, -1, -1	
				-1, -1, -1, 2, 5, -1, -1, -1, 10, 13, -1, -1, -1, -1, -1, -1));

			__m256i r1 = _mm256_shuffle_epi8(mul, _mm256_setr_epi8(

				//  R1 R2 -1  -1  -1  R6  R7  -1, -1, -1, -1, -1, -1, -1, -1, -1,	
				2, 5, -1, -1, -1, 10, 13, -1, -1, -1, -1, -1, -1, -1, -1, -1,

				//  -1, -1, R3 R4 R5 -1, -1, R8 R9  R10 -1, -1, -1, -1, -1, -1
				-1, -1, 0, 3, 6, -1, -1, 8, 11, 14, -1, -1, -1, -1, -1, -1));



			// B1+G1+R1  B2+G2+R2 B3+G3  0 0 G6+R6  B7+G7+R7 B8+G8 0 0 0 0 0 0 0 0 0 0 R3 B4+G4+R4 B5+G5+R5 B6 0 R8 B9+G9+R9 B10+G10+R10 0 0 0 0 0 0

			__m256i accum = _mm256_adds_epu8(r1, _mm256_adds_epu8(b1, g1));


			// _mm256_castsi256_si128(accum)
			// B1+G1+R1  B2+G2+R2 B3+G3  0 0 G6+R6  B7+G7+R7 B8+G8 0 0 0 0 0 0 0 0

			// _mm256_extracti128_si256(accum, 1)
			// 0 0 R3 B4+G4+R4 B5+G5+R5 B6 0 R8 B9+G9+R9 B10+G10+R10 0 0 0 0 0 0

			__m128i h3 = _mm_adds_epu8(_mm256_castsi256_si128(accum), _mm256_extracti128_si256(accum, 1));

			_mm_storeu_si128((__m128i *)(LinePD + X), h3);
		}
		for (; X < Width; X++, LinePS += 3) {
			int tmpB = (B_WT * LinePS[0]) >> 14 + 1;
			tmpB = max(min(255, tmpB), 0);

			int tmpG = (G_WT * LinePS[1]) >> 14 + 1;
			tmpG = max(min(255, tmpG), 0);

			int tmpR = (R_WT * LinePS[2]) >> 14 + 1;
			tmpR = max(min(255, tmpR), 0);

			int tmp = tmpB + tmpG + tmpR;
			LinePD[X] = max(min(255, tmp), 0);
		}
	}
}

//avx2 
void RGB2Y_6(unsigned char *Src, unsigned char *Dest, int width, int height, int stride)
{
	_RGB2Y(Src, width, 0, height, stride, Dest);
}

//avx2 + std::async异步编程
void RGB2Y_7(unsigned char *Src, unsigned char *Dest, int width, int height, int stride) {
	const int32_t hw_concur = std::min(height >> 4, static_cast<int32_t>(std::thread::hardware_concurrency()));
	std::vector<std::future<void>> fut(hw_concur);
	const int thread_stride = (height - 1) / hw_concur + 1;
	int i = 0, start = 0;
	for (; i < std::min(height, hw_concur); i++, start += thread_stride)
	{
		fut[i] = std::async(std::launch::async, _RGB2Y, Src, width, start, thread_stride, stride, Dest);
	}
	for (int j = 0; j < i; ++j)
		fut[j].wait();
}

int main() {
	Mat src = imread("F:\\car.jpg");
	int Height = src.rows;
	int Width = src.cols;
	unsigned char *Src = src.data;
	unsigned char *Dest = new unsigned char[Height * Width];
	int Stride = Width * 3;
	int Radius = 11;
	int64 st = cvGetTickCount();
	for (int i = 0; i < 100; i++) {
		RGB2Y_3(Src, Dest, Width, Height, Stride);
	}
	double duration = (cv::getTickCount() - st) / cv::getTickFrequency() * 10;
	printf("%.5f\n", duration);
	RGB2Y_5(Src, Dest, Width, Height, Stride);
	Mat dst(Height, Width, CV_8UC1, Dest);
	imshow("origin", src);
	imshow("result", dst);
	imwrite("F:\\res.jpg", dst);
	waitKey(0);
	return 0;
}

================================================
FILE: speed_rgb2yuv_sse.cpp
================================================
#include "stdafx.h"
#include <stdio.h>
#include <opencv2/opencv.hpp>
#include <future>
using namespace std;
using namespace cv;

inline unsigned char ClampToByte(int Value) {
	if (Value < 0)
		return 0;
	else if (Value > 255)
		return 255;
	else
		return (unsigned char)Value;
	//return ((Value | ((signed int)(255 - Value) >> 31)) & ~((signed int)Value >> 31));
}


void RGB2YUV(unsigned char *RGB, unsigned char *Y, unsigned char *U, unsigned char *V, int Width, int Height, int Stride) {
	for (int YY = 0; YY < Height; YY++) {
		unsigned char *LinePS = RGB + YY * Stride;
		unsigned char *LinePY = Y + YY * Width;
		unsigned char *LinePU = U + YY * Width;
		unsigned char *LinePV = V + YY * Width;
		for (int XX = 0; XX < Width; XX++, LinePS += 3)
		{
			int Blue = LinePS[0], Green = LinePS[1], Red = LinePS[2];
			LinePY[XX] = int(0.299*Red + 0.587*Green + 0.144*Blue);
			LinePU[XX] = int(-0.147*Red - 0.289*Green + 0.436*Blue);
			LinePV[XX] = int(0.615*Red - 0.515*Green - 0.100*Blue);
		}
	}
}

void YUV2RGB(unsigned char *Y, unsigned char *U, unsigned char *V, unsigned char *RGB, int Width, int Height, int Stride) {
	for (int YY = 0; YY < Height; YY++)
	{
		unsigned char *LinePD = RGB + YY * Stride;
		unsigned char *LinePY = Y + YY * Width;
		unsigned char *LinePU = U + YY * Width;
		unsigned char *LinePV = V + YY * Width;
		for (int XX = 0; XX < Width; XX++, LinePD += 3)
		{
			int YV = LinePY[XX], UV = LinePU[XX], VV = LinePV[XX];
			LinePD[0] = int(YV + 2.03 * UV);
			LinePD[1] = int(YV - 0.39 * UV - 0.58 * VV);
			LinePD[2] = int(YV + 1.14 * VV);
		}
	}
}

void RGB2YUV_1(unsigned char *RGB, unsigned char *Y, unsigned char *U, unsigned char *V, int Width, int Height, int Stride)
{
	const int Shift = 8;
	const int HalfV = 1 << (Shift - 1);
	const int Y_B_WT = 0.114f * (1 << Shift), Y_G_WT = 0.587f * (1 << Shift), Y_R_WT = (1 << Shift) - Y_B_WT - Y_G_WT;
	const int U_B_WT = 0.436f * (1 << Shift), U_G_WT = -0.28886f * (1 << Shift), U_R_WT = -(U_B_WT + U_G_WT);
	const int V_B_WT = -0.10001 * (1 << Shift), V_G_WT = -0.51499f * (1 << Shift), V_R_WT = -(V_B_WT + V_G_WT);
	for (int YY = 0; YY < Height; YY++)
	{
		unsigned char *LinePS = RGB + YY * Stride;
		unsigned char *LinePY = Y + YY * Width;
		unsigned char *LinePU = U + YY * Width;
		unsigned char *LinePV = V + YY * Width;
		for (int XX = 0; XX < Width; XX++, LinePS += 3)
		{
			int Blue = LinePS[0], Green = LinePS[1], Red = LinePS[2];
			LinePY[XX] = (Y_B_WT * Blue + Y_G_WT * Green + Y_R_WT * Red + HalfV) >> Shift;
			LinePU[XX] = ((U_B_WT * Blue + U_G_WT * Green + U_R_WT * Red + HalfV) >> Shift) + 128;
			LinePV[XX] = ((V_B_WT * Blue + V_G_WT * Green + V_R_WT * Red + HalfV) >> Shift) + 128;
		}
	}
}

void YUV2RGB_1(unsigned char *Y, unsigned char *U, unsigned char *V, unsigned char *RGB, int Width, int Height, int Stride)
{
	const int Shift = 8;
	const int HalfV = 1 << (Shift - 1);
	const int B_Y_WT = 1 << Shift, B_U_WT = 2.03211f * (1 << Shift), B_V_WT = 0;
	const int G_Y_WT = 1 << Shift, G_U_WT = -0.39465f * (1 << Shift), G_V_WT = -0.58060f * (1 << Shift);
	const int R_Y_WT = 1 << Shift, R_U_WT = 0, R_V_WT = 1.13983 * (1 << Shift);
	for (int YY = 0; YY < Height; YY++)
	{
		unsigned char *LinePD = RGB + YY * Stride;
		unsigned char *LinePY = Y + YY * Width;
		unsigned char *LinePU = U + YY * Width;
		unsigned char *LinePV = V + YY * Width;
		for (int XX = 0; XX < Width; XX++, LinePD += 3)
		{
			int YV = LinePY[XX], UV = LinePU[XX] - 128, VV = LinePV[XX] - 128;
			LinePD[0] = ClampToByte(YV + ((B_U_WT * UV + HalfV) >> Shift));
			LinePD[1] = ClampToByte(YV + ((G_U_WT * UV + G_V_WT * VV + HalfV) >> Shift));
			LinePD[2] = ClampToByte(YV + ((R_V_WT * VV + HalfV) >> Shift));
		}
	}
}

void RGB2YUV_OpenMP(unsigned char *RGB, unsigned char *Y, unsigned char *U, unsigned char *V, int Width, int Height, int Stride)
{
	const int Shift = 8;
	const int HalfV = 1 << (Shift - 1);
	const int Y_B_WT = 0.114f * (1 << Shift), Y_G_WT = 0.587f * (1 << Shift), Y_R_WT = (1 << Shift) - Y_B_WT - Y_G_WT;
	const int U_B_WT = 0.436f * (1 << Shift), U_G_WT = -0.28886f * (1 << Shift), U_R_WT = -(U_B_WT + U_G_WT);
	const int V_B_WT = -0.10001 * (1 << Shift), V_G_WT = -0.51499f * (1 << Shift), V_R_WT = -(V_B_WT + V_G_WT);
	for (int YY = 0; YY < Height; YY++)
	{
		unsigned char *LinePS = RGB + YY * Stride;
		unsigned char *LinePY = Y + YY * Width;
		unsigned char *LinePU = U + YY * Width;
		unsigned char *LinePV = V + YY * Width;
#pragma omp parallel for num_threads(4)
		for (int XX = 0; XX < Width; XX++)
		{
			int Blue = LinePS[XX*3 + 0], Green = LinePS[XX*3 + 1], Red = LinePS[XX*3 + 2];
			LinePY[XX] = (Y_B_WT * Blue + Y_G_WT * Green + Y_R_WT * Red + HalfV) >> Shift;
			LinePU[XX] = ((U_B_WT * Blue + U_G_WT * Green + U_R_WT * Red + HalfV) >> Shift) + 128;
			LinePV[XX] = ((V_B_WT * Blue + V_G_WT * Green + V_R_WT * Red + HalfV) >> Shift) + 128;
		}
	}
}

void YUV2RGB_OpenMP(unsigned char *Y, unsigned char *U, unsigned char *V, unsigned char *RGB, int Width, int Height, int Stride)
{
	const int Shift = 8;
	const int HalfV = 1 << (Shift - 1);
	const int B_Y_WT = 1 << Shift, B_U_WT = 2.03211f * (1 << Shift), B_V_WT = 0;
	const int G_Y_WT = 1 << Shift, G_U_WT = -0.39465f * (1 << Shift), G_V_WT = -0.58060f * (1 << Shift);
	const int R_Y_WT = 1 << Shift, R_U_WT = 0, R_V_WT = 1.13983 * (1 << Shift);
	for (int YY = 0; YY < Height; YY++)
	{
		unsigned char *LinePD = RGB + YY * Stride;
		unsigned char *LinePY = Y + YY * Width;
		unsigned char *LinePU = U + YY * Width;
		unsigned char *LinePV = V + YY * Width;
#pragma omp parallel for num_threads(4)
		for (int XX = 0; XX < Width; XX++)
		{
			int YV = LinePY[XX], UV = LinePU[XX] - 128, VV = LinePV[XX] - 128;
			LinePD[XX*3 + 0] = ClampToByte(YV + ((B_U_WT * UV + HalfV) >> Shift));
			LinePD[XX*3 + 1] = ClampToByte(YV + ((G_U_WT * UV + G_V_WT * VV + HalfV) >> Shift));
			LinePD[XX*3 + 2] = ClampToByte(YV + ((R_V_WT * VV + HalfV) >> Shift));
		}
	}
}

void RGB2YUVSSE_2(unsigned char *RGB, unsigned char *Y, unsigned char *U, unsigned char *V, int Width, int Height, int Stride) {
	const int Shift = 13;
	const int HalfV = 1 << (Shift - 1);
	const int Y_B_WT = 0.114f * (1 << Shift), Y_G_WT = 0.587f * (1 << Shift), Y_R_WT = (1 << Shift) - Y_B_WT - Y_G_WT;
	const int U_B_WT = 0.436f * (1 << Shift), U_G_WT = -0.28886f * (1 << Shift), U_R_WT = -(U_B_WT + U_G_WT);
	const int V_B_WT = -0.10001 * (1 << Shift), V_G_WT = -0.51499f * (1 << Shift), V_R_WT = -(V_B_WT + V_G_WT);
	__m128i Weight_YB = _mm_set1_epi32(Y_B_WT), Weight_YG = _mm_set1_epi32(Y_G_WT), Weight_YR = _mm_set1_epi32(Y_R_WT);
	__m128i Weight_UB = _mm_set1_epi32(U_B_WT), Weight_UG = _mm_set1_epi32(U_G_WT), Weight_UR = _mm_set1_epi32(U_R_WT);
	__m128i Weight_VB = _mm_set1_epi32(V_B_WT), Weight_VG = _mm_set1_epi32(V_G_WT), Weight_VR = _mm_set1_epi32(V_R_WT);
	__m128i C128 = _mm_set1_epi32(128);
	__m128i Half = _mm_set1_epi32(HalfV);
	__m128i Zero = _mm_setzero_si128();
	const int BlockSize = 16, Block = Width / BlockSize;
	for (int YY = 0; YY < Height; YY++) {
		unsigned char *LinePS = RGB + YY * Stride;
		unsigned char *LinePY = Y + YY * Width;
		unsigned char *LinePU = U + YY * Width;
		unsigned char *LinePV = V + YY * Width;
		for (int XX = 0; XX < Block * BlockSize; XX += BlockSize, LinePS += BlockSize * 3) {
			__m128i Src1, Src2, Src3, Blue, Green, Red;

			Src1 = _mm_loadu_si128((__m128i *)(LinePS + 0));
			Src2 = _mm_loadu_si128((__m128i *)(LinePS + 16));
			Src3 = _mm_loadu_si128((__m128i *)(LinePS + 32));

			// 以下操作把16个连续像素的像素顺序由 B G R B G R B G R B G R B G R B G R B G R B G R B G R B G R B G R B G R B G R B G R B G R B G R 
			// 更改为适合于SIMD指令处理的连续序列 B B B B B B B B B B B B B B B B G G G G G G G G G G G G G G G G R R R R R R R R R R R R R R R R  

			Blue = _mm_shuffle_epi8(Src1, _mm_setr_epi8(0, 3, 6, 9, 12, 15, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1));
			Blue = _mm_or_si128(Blue, _mm_shuffle_epi8(Src2, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, 2, 5, 8, 11, 14, -1, -1, -1, -1, -1)));
			Blue = _mm_or_si128(Blue, _mm_shuffle_epi8(Src3, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, 4, 7, 10, 13)));

			Green = _mm_shuffle_epi8(Src1, _mm_setr_epi8(1, 4, 7, 10, 13, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1));
			Green = _mm_or_si128(Green, _mm_shuffle_epi8(Src2, _mm_setr_epi8(-1, -1, -1, -1, -1, 0, 3, 6, 9, 12, 15, -1, -1, -1, -1, -1)));
			Green = _mm_or_si128(Green, _mm_shuffle_epi8(Src3, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 2, 5, 8, 11, 14)));

			Red = _mm_shuffle_epi8(Src1, _mm_setr_epi8(2, 5, 8, 11, 14, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1));
			Red = _mm_or_si128(Red, _mm_shuffle_epi8(Src2, _mm_setr_epi8(-1, -1, -1, -1, -1, 1, 4, 7, 10, 13, -1, -1, -1, -1, -1, -1)));
			Red = _mm_or_si128(Red, _mm_shuffle_epi8(Src3, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 3, 6, 9, 12, 15)));

			// 以下操作将三个SSE变量里的字节数据分别提取到12个包含4个int类型的数据的SSE变量里,以便后续的乘积操作不溢出

			__m128i Blue16L = _mm_unpacklo_epi8(Blue, Zero);
			__m128i Blue16H = _mm_unpackhi_epi8(Blue, Zero);
			__m128i Blue32LL = _mm_unpacklo_epi16(Blue16L, Zero);
			__m128i Blue32LH = _mm_unpackhi_epi16(Blue16L, Zero);
			__m128i Blue32HL = _mm_unpacklo_epi16(Blue16H, Zero);
			__m128i Blue32HH = _mm_unpackhi_epi16(Blue16H, Zero);

			__m128i Green16L = _mm_unpacklo_epi8(Green, Zero);
			__m128i Green16H = _mm_unpackhi_epi8(Green, Zero);
			__m128i Green32LL = _mm_unpacklo_epi16(Green16L, Zero);
			__m128i Green32LH = _mm_unpackhi_epi16(Green16L, Zero);
			__m128i Green32HL = _mm_unpacklo_epi16(Green16H, Zero);
			__m128i Green32HH = _mm_unpackhi_epi16(Green16H, Zero);

			__m128i Red16L = _mm_unpacklo_epi8(Red, Zero);
			__m128i Red16H = _mm_unpackhi_epi8(Red, Zero);
			__m128i Red32LL = _mm_unpacklo_epi16(Red16L, Zero);
			__m128i Red32LH = _mm_unpackhi_epi16(Red16L, Zero);
			__m128i Red32HL = _mm_unpacklo_epi16(Red16H, Zero);
			__m128i Red32HH = _mm_unpackhi_epi16(Red16H, Zero);

			// 以下操作完成:Y[0 - 15] = (Y_B_WT * Blue[0 - 15]+ Y_G_WT * Green[0 - 15] + Y_R_WT * Red[0 - 15] + HalfV) >> Shift;   
			__m128i LL_Y = _mm_srai_epi32(_mm_add_epi32(_mm_add_epi32(_mm_mullo_epi32(Blue32LL, Weight_YB), _mm_add_epi32(_mm_mullo_epi32(Green32LL, Weight_YG), _mm_mullo_epi32(Red32LL, Weight_YR))), Half), Shift);
			__m128i LH_Y = _mm_srai_epi32(_mm_add_epi32(_mm_add_epi32(_mm_mullo_epi32(Blue32LH, Weight_YB), _mm_add_epi32(_mm_mullo_epi32(Green32LH, Weight_YG), _mm_mullo_epi32(Red32LH, Weight_YR))), Half), Shift);
			__m128i HL_Y = _mm_srai_epi32(_mm_add_epi32(_mm_add_epi32(_mm_mullo_epi32(Blue32HL, Weight_YB), _mm_add_epi32(_mm_mullo_epi32(Green32HL, Weight_YG), _mm_mullo_epi32(Red32HL, Weight_YR))), Half), Shift);
			__m128i HH_Y = _mm_srai_epi32(_mm_add_epi32(_mm_add_epi32(_mm_mullo_epi32(Blue32HH, Weight_YB), _mm_add_epi32(_mm_mullo_epi32(Green32HH, Weight_YG), _mm_mullo_epi32(Red32HH, Weight_YR))), Half), Shift);
			_mm_storeu_si128((__m128i*)(LinePY + XX), _mm_packus_epi16(_mm_packus_epi32(LL_Y, LH_Y), _mm_packus_epi32(HL_Y, HH_Y)));    //    4个包含4个int类型的SSE变量重新打包为1个包含16个字节数据的SSE变量

			// 以下操作完成: U[0 - 15] = ((U_B_WT * Blue[0 - 15]+ U_G_WT * Green[0 - 15] + U_R_WT * Red[0 - 15] + HalfV) >> Shift) + 128;
			__m128i LL_U = _mm_add_epi32(_mm_srai_epi32(_mm_add_epi32(_mm_add_epi32(_mm_mullo_epi32(Blue32LL, Weight_UB), _mm_add_epi32(_mm_mullo_epi32(Green32LL, Weight_UG), _mm_mullo_epi32(Red32LL, Weight_UR))), Half), Shift), C128);
			__m128i LH_U = _mm_add_epi32(_mm_srai_epi32(_mm_add_epi32(_mm_add_epi32(_mm_mullo_epi32(Blue32LH, Weight_UB), _mm_add_epi32(_mm_mullo_epi32(Green32LH, Weight_UG), _mm_mullo_epi32(Red32LH, Weight_UR))), Half), Shift), C128);
			__m128i HL_U = _mm_add_epi32(_mm_srai_epi32(_mm_add_epi32(_mm_add_epi32(_mm_mullo_epi32(Blue32HL, Weight_UB), _mm_add_epi32(_mm_mullo_epi32(Green32HL, Weight_UG), _mm_mullo_epi32(Red32HL, Weight_UR))), Half), Shift), C128);
			__m128i HH_U = _mm_add_epi32(_mm_srai_epi32(_mm_add_epi32(_mm_add_epi32(_mm_mullo_epi32(Blue32HH, Weight_UB), _mm_add_epi32(_mm_mullo_epi32(Green32HH, Weight_UG), _mm_mullo_epi32(Red32HH, Weight_UR))), Half), Shift), C128);
			_mm_storeu_si128((__m128i*)(LinePU + XX), _mm_packus_epi16(_mm_packus_epi32(LL_U, LH_U), _mm_packus_epi32(HL_U, HH_U)));

			// 以下操作完成:V[0 - 15] = ((V_B_WT * Blue[0 - 15]+ V_G_WT * Green[0 - 15] + V_R_WT * Red[0 - 15] + HalfV) >> Shift) + 128; 
			__m128i LL_V = _mm_add_epi32(_mm_srai_epi32(_mm_add_epi32(_mm_add_epi32(_mm_mullo_epi32(Blue32LL, Weight_VB), _mm_add_epi32(_mm_mullo_epi32(Green32LL, Weight_VG), _mm_mullo_epi32(Red32LL, Weight_VR))), Half), Shift), C128);
			__m128i LH_V = _mm_add_epi32(_mm_srai_epi32(_mm_add_epi32(_mm_add_epi32(_mm_mullo_epi32(Blue32LH, Weight_VB), _mm_add_epi32(_mm_mullo_epi32(Green32LH, Weight_VG), _mm_mullo_epi32(Red32LH, Weight_VR))), Half), Shift), C128);
			__m128i HL_V = _mm_add_epi32(_mm_srai_epi32(_mm_add_epi32(_mm_add_epi32(_mm_mullo_epi32(Blue32HL, Weight_VB), _mm_add_epi32(_mm_mullo_epi32(Green32HL, Weight_VG), _mm_mullo_epi32(Red32HL, Weight_VR))), Half), Shift), C128);
			__m128i HH_V = _mm_add_epi32(_mm_srai_epi32(_mm_add_epi32(_mm_add_epi32(_mm_mullo_epi32(Blue32HH, Weight_VB), _mm_add_epi32(_mm_mullo_epi32(Green32HH, Weight_VG), _mm_mullo_epi32(Red32HH, Weight_VR))), Half), Shift), C128);
			_mm_storeu_si128((__m128i*)(LinePV + XX), _mm_packus_epi16(_mm_packus_epi32(LL_V, LH_V), _mm_packus_epi32(HL_V, HH_V)));
		}
		for (int XX = Block * BlockSize; XX < Width; XX++, LinePS += 3) {
			int Blue = LinePS[0], Green = LinePS[1], Red = LinePS[2];
			LinePY[XX] = (Y_B_WT * Blue + Y_G_WT * Green + Y_R_WT * Red + HalfV) >> Shift;
			LinePU[XX] = ((U_B_WT * Blue + U_G_WT * Green + U_R_WT * Red + HalfV) >> Shift) + 128;
			LinePV[XX] = ((V_B_WT * Blue + V_G_WT * Green + V_R_WT * Red + HalfV) >> Shift) + 128;
		}
	}
}

void YUV2RGBSSE_2(unsigned char *Y, unsigned char *U, unsigned char *V, unsigned char *RGB, int Width, int Height, int Stride) {
	const int Shift = 13;
	const int HalfV = 1 << (Shift - 1);
	const int B_Y_WT = 1 << Shift, B_U_WT = 2.03211f * (1 << Shift), B_V_WT = 0;
	const int G_Y_WT = 1 << Shift, G_U_WT = -0.39465f * (1 << Shift), G_V_WT = -0.58060f * (1 << Shift);
	const int R_Y_WT = 1 << Shift, R_U_WT = 0, R_V_WT = 1.13983 * (1 << Shift);
	__m128i Weight_B_Y = _mm_set1_epi32(B_Y_WT), Weight_B_U = _mm_set1_epi32(B_U_WT), Weight_B_V = _mm_set1_epi32(B_V_WT);
	__m128i Weight_G_Y = _mm_set1_epi32(G_Y_WT), Weight_G_U = _mm_set1_epi32(G_U_WT), Weight_G_V = _mm_set1_epi32(G_V_WT);
	__m128i Weight_R_Y = _mm_set1_epi32(R_Y_WT), Weight_R_U = _mm_set1_epi32(R_U_WT), Weight_R_V = _mm_set1_epi32(R_V_WT);
	__m128i Half = _mm_set1_epi32(HalfV);
	__m128i C128 = _mm_set1_epi32(128);
	__m128i Zero = _mm_setzero_si128();

	const int BlockSize = 16, Block = Width / BlockSize;
	for (int YY = 0; YY < Height; YY++) {
		unsigned char *LinePD = RGB + YY * Stride;
		unsigned char *LinePY = Y + YY * Width;
		unsigned char *LinePU = U + YY * Width;
		unsigned char *LinePV = V + YY * Width;
		for (int XX = 0; XX < Block * BlockSize; XX += BlockSize, LinePY += BlockSize, LinePU += BlockSize, LinePV += BlockSize) {
			__m128i Blue, Green, Red, YV, UV, VV, Dest1, Dest2, Dest3;
			YV = _mm_loadu_si128((__m128i *)(LinePY + 0));
			UV = _mm_loadu_si128((__m128i *)(LinePU + 0));
			VV = _mm_loadu_si128((__m128i *)(LinePV + 0));
			//UV = _mm_sub_epi32(UV, C128);
			//VV = _mm_sub_epi32(VV, C128);

			__m128i YV16L = _mm_unpacklo_epi8(YV, Zero);
			__m128i YV16H = _mm_unpackhi_epi8(YV, Zero);
			__m128i YV32LL = _mm_unpacklo_epi16(YV16L, Zero);
			__m128i YV32LH = _mm_unpackhi_epi16(YV16L, Zero);
			__m128i YV32HL = _mm_unpacklo_epi16(YV16H, Zero);
			__m128i YV32HH = _mm_unpackhi_epi16(YV16H, Zero);


			__m128i UV16L = _mm_unpacklo_epi8(UV, Zero);
			__m128i UV16H = _mm_unpackhi_epi8(UV, Zero);
			__m128i UV32LL = _mm_unpacklo_epi16(UV16L, Zero);
			__m128i UV32LH = _mm_unpackhi_epi16(UV16L, Zero);
			__m128i UV32HL = _mm_unpacklo_epi16(UV16H, Zero);
			__m128i UV32HH = _mm_unpackhi_epi16(UV16H, Zero);
			UV32LL = _mm_sub_epi32(UV32LL, C128);
			UV32LH = _mm_sub_epi32(UV32LH, C128);
			UV32HL = _mm_sub_epi32(UV32HL, C128);
			UV32HH = _mm_sub_epi32(UV32HH, C128);

			__m128i VV16L = _mm_unpacklo_epi8(VV, Zero);
			__m128i VV16H = _mm_unpackhi_epi8(VV, Zero);
			__m128i VV32LL = _mm_unpacklo_epi16(VV16L, Zero);
			__m128i VV32LH = _mm_unpackhi_epi16(VV16L, Zero);
			__m128i VV32HL = _mm_unpacklo_epi16(VV16H, Zero);
			__m128i VV32HH = _mm_unpackhi_epi16(VV16H, Zero);
			VV32LL = _mm_sub_epi32(VV32LL, C128);
			VV32LH = _mm_sub_epi32(VV32LH, C128);
			VV32HL = _mm_sub_epi32(VV32HL, C128);
			VV32HH = _mm_sub_epi32(VV32HH, C128);

			__m128i LL_B = _mm_add_epi32(YV32LL, _mm_srai_epi32(_mm_add_epi32(Half, _mm_mullo_epi32(UV32LL, Weight_B_U)), Shift));
			__m128i LH_B = _mm_add_epi32(YV32LH, _mm_srai_epi32(_mm_add_epi32(Half, _mm_mullo_epi32(UV32LH, Weight_B_U)), Shift));
			__m128i HL_B = _mm_add_epi32(YV32HL, _mm_srai_epi32(_mm_add_epi32(Half, _mm_mullo_epi32(UV32HL, Weight_B_U)), Shift));
			__m128i HH_B = _mm_add_epi32(YV32HH, _mm_srai_epi32(_mm_add_epi32(Half, _mm_mullo_epi32(UV32HH, Weight_B_U)), Shift));
			Blue = _mm_packus_epi16(_mm_packus_epi32(LL_B, LH_B), _mm_packus_epi32(HL_B, HH_B));

			__m128i LL_G = _mm_add_epi32(YV32LL, _mm_srai_epi32(_mm_add_epi32(Half, _mm_add_epi32(_mm_mullo_epi32(Weight_G_U, UV32LL), _mm_mullo_epi32(Weight_G_V, VV32LL))), Shift));
			__m128i LH_G = _mm_add_epi32(YV32LH, _mm_srai_epi32(_mm_add_epi32(Half, _mm_add_epi32(_mm_mullo_epi32(Weight_G_U, UV32LH), _mm_mullo_epi32(Weight_G_V, VV32LH))), Shift));
			__m128i HL_G = _mm_add_epi32(YV32HL, _mm_srai_epi32(_mm_add_epi32(Half, _mm_add_epi32(_mm_mullo_epi32(Weight_G_U, UV32HL), _mm_mullo_epi32(Weight_G_V, VV32HL))), Shift));
			__m128i HH_G = _mm_add_epi32(YV32HH, _mm_srai_epi32(_mm_add_epi32(Half, _mm_add_epi32(_mm_mullo_epi32(Weight_G_U, UV32HH), _mm_mullo_epi32(Weight_G_V, VV32HH))), Shift));
			Green = _mm_packus_epi16(_mm_packus_epi32(LL_G, LH_G), _mm_packus_epi32(HL_G, HH_G));

			__m128i LL_R = _mm_add_epi32(YV32LL, _mm_srai_epi32(_mm_add_epi32(Half, _mm_mullo_epi32(VV32LL, Weight_R_V)), Shift));
			__m128i LH_R = _mm_add_epi32(YV32LH, _mm_srai_epi32(_mm_add_epi32(Half, _mm_mullo_epi32(VV32LH, Weight_R_V)), Shift));
			__m128i HL_R = _mm_add_epi32(YV32HL, _mm_srai_epi32(_mm_add_epi32(Half, _mm_mullo_epi32(VV32HL, Weight_R_V)), Shift));
			__m128i HH_R = _mm_add_epi32(YV32HH, _mm_srai_epi32(_mm_add_epi32(Half, _mm_mullo_epi32(VV32HH, Weight_R_V)), Shift));
			Red = _mm_packus_epi16(_mm_packus_epi32(LL_R, LH_R), _mm_packus_epi32(HL_R, HH_R));

			Dest1 = _mm_shuffle_epi8(Blue, _mm_setr_epi8(0, -1, -1, 1, -1, -1, 2, -1, -1, 3, -1, -1, 4, -1, -1, 5));
			Dest1 = _mm_or_si128(Dest1, _mm_shuffle_epi8(Green, _mm_setr_epi8(-1, 0, -1, -1, 1, -1, -1, 2, -1, -1, 3, -1, -1, 4, -1, -1)));
			Dest1 = _mm_or_si128(Dest1, _mm_shuffle_epi8(Red, _mm_setr_epi8(-1, -1, 0, -1, -1, 1, -1, -1, 2, -1, -1, 3, -1, -1, 4, -1)));

			Dest2 = _mm_shuffle_epi8(Blue, _mm_setr_epi8(-1, -1, 6, -1, -1, 7, -1, -1, 8, -1, -1, 9, -1, -1, 10, -1));
			Dest2 = _mm_or_si128(Dest2, _mm_shuffle_epi8(Green, _mm_setr_epi8(5, -1, -1, 6, -1, -1, 7, -1, -1, 8, -1, -1, 9, -1, -1, 10)));
			Dest2 = _mm_or_si128(Dest2, _mm_shuffle_epi8(Red, _mm_setr_epi8(-1, 5, -1, -1, 6, -1, -1, 7, -1, -1, 8, -1, -1, 9, -1, -1)));

			Dest3 = _mm_shuffle_epi8(Blue, _mm_setr_epi8(-1, 11, -1, -1, 12, -1, -1, 13, -1, -1, 14, -1, -1, 15, -1, -1));
			Dest3 = _mm_or_si128(Dest3, _mm_shuffle_epi8(Green, _mm_setr_epi8(-1, -1, 11, -1, -1, 12, -1, -1, 13, -1, -1, 14, -1, -1, 15, -1)));
			Dest3 = _mm_or_si128(Dest3, _mm_shuffle_epi8(Red, _mm_setr_epi8(10, -1, -1, 11, -1, -1, 12, -1, -1, 13, -1, -1, 14, -1, -1, 15)));

			_mm_storeu_si128((__m128i*)(LinePD + (XX / BlockSize) * BlockSize * 3), Dest1);
			_mm_storeu_si128((__m128i*)(LinePD + (XX / BlockSize) * BlockSize * 3 + BlockSize), Dest2);
			_mm_storeu_si128((__m128i*)(LinePD + (XX / BlockSize) * BlockSize * 3 + BlockSize * 2), Dest3);
		}
		for (int XX = Block * BlockSize; XX < Width; XX++, LinePU++, LinePV++, LinePY++) {
			int YV = LinePY[XX], UV = LinePU[XX] - 128, VV = LinePV[XX] - 128;
			LinePD[XX + 0] = ClampToByte(YV + ((B_U_WT * UV + HalfV) >> Shift));
			LinePD[XX + 1] = ClampToByte(YV + ((G_U_WT * UV + G_V_WT * VV + HalfV) >> Shift));
			LinePD[XX + 2] = ClampToByte(YV + ((R_V_WT * VV + HalfV) >> Shift));
		}
	}
}

void RGB2YUVSSE_3(unsigned char *RGB, unsigned char *Y, unsigned char *U, unsigned char *V, int Width, int Height, int Stride)
{
	const int Shift = 13;                            //    这里没有绝对值大于1的系数,最大可取2^15次方的放大倍数。
	const int HalfV = 1 << (Shift - 1);

	const int Y_B_WT = 0.114f * (1 << Shift), Y_G_WT = 0.587f * (1 << Shift), Y_R_WT = (1 << Shift) - Y_B_WT - Y_G_WT, Y_C_WT = 1;
	const int U_B_WT = 0.436f * (1 << Shift), U_G_WT = -0.28886f * (1 << Shift), U_R_WT = -(U_B_WT + U_G_WT), U_C_WT = 257;
	const int V_B_WT = -0.10001 * (1 << Shift), V_G_WT = -0.51499f * (1 << Shift), V_R_WT = -(V_B_WT + V_G_WT), V_C_WT = 257;

	__m128i Weight_YBG = _mm_setr_epi16(Y_B_WT, Y_G_WT, Y_B_WT, Y_G_WT, Y_B_WT, Y_G_WT, Y_B_WT, Y_G_WT);
	__m128i Weight_YRC = _mm_setr_epi16(Y_R_WT, Y_C_WT, Y_R_WT, Y_C_WT, Y_R_WT, Y_C_WT, Y_R_WT, Y_C_WT);
	__m128i Weight_UBG = _mm_setr_epi16(U_B_WT, U_G_WT, U_B_WT, U_G_WT, U_B_WT, U_G_WT, U_B_WT, U_G_WT);
	__m128i Weight_URC = _mm_setr_epi16(U_R_WT, U_C_WT, U_R_WT, U_C_WT, U_R_WT, U_C_WT, U_R_WT, U_C_WT);
	__m128i Weight_VBG = _mm_setr_epi16(V_B_WT, V_G_WT, V_B_WT, V_G_WT, V_B_WT, V_G_WT, V_B_WT, V_G_WT);
	__m128i Weight_VRC = _mm_setr_epi16(V_R_WT, V_C_WT, V_R_WT, V_C_WT, V_R_WT, V_C_WT, V_R_WT, V_C_WT);
	__m128i Half = _mm_setr_epi16(0, HalfV, 0, HalfV, 0, HalfV, 0, HalfV);
	__m128i Zero = _mm_setzero_si128();

	int BlockSize = 16, Block = Width / BlockSize;
	for (int YY = 0; YY < Height; YY++)
	{
		unsigned char *LinePS = RGB + YY * Stride;
		unsigned char *LinePY = Y + YY * Width;
		unsigned char *LinePU = U + YY * Width;
		unsigned char *LinePV = V + YY * Width;
		for (int XX = 0; XX < Block * BlockSize; XX += BlockSize, LinePS += BlockSize * 3)
		{
			__m128i Src1 = _mm_loadu_si128((__m128i *)(LinePS + 0));
			__m128i Src2 = _mm_loadu_si128((__m128i *)(LinePS + 16));
			__m128i Src3 = _mm_loadu_si128((__m128i *)(LinePS + 32));
			// Src1 : B1 G1 R1 B2 G2 R2 B3 G3 R3 B4 G4 R4 B5 G5 R5 B6 
			// Src2 : G6 R6 B7 G7 R7 B8 G8 R8 B9 G9 R9 B10 G10 R10 B11 G11 
			// Src3 : R11 B12 G12 R12 B13 G13 R13 B14 G14 R14 B15 G15 R15 B16 G16 R16

			// BGL : B1 G1 B2 G2 B3 G3 B4 G4 B5 G5 B6 0 0 0 0 0 
			__m128i BGL = _mm_shuffle_epi8(Src1, _mm_setr_epi8(0, 1, 3, 4, 6, 7, 9, 10, 12, 13, 15, -1, -1, -1, -1, -1));

			// BGL : B1 G1 B2 G2 B3 G3 B4 G4 B5 G5 B6 G6 B7 G7 B8 G8
			BGL = _mm_or_si128(BGL, _mm_shuffle_epi8(Src2, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 2, 3, 5, 6)));

			// BGH : B9 G9 B10 G10 B11 G11 0 0 0 0 0 0 0 0 0 0
			__m128i BGH = _mm_shuffle_epi8(Src2, _mm_setr_epi8(8, 9, 11, 12, 14, 15, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1));

			// BGH : B9 G9 B10 G10 B11 G11 B12 G12 B13 G13 B14 G14 B15 G15 B16 G16
			BGH = _mm_or_si128(BGH, _mm_shuffle_epi8(Src3, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, 1, 2, 4, 5, 7, 8, 10, 11, 13, 14)));

			// RCL : R1 0 R2 0 R3 0 R4 0 R5 0 0 0 0 0 0 0 
			__m128i RCL = _mm_shuffle_epi8(Src1, _mm_setr_epi8(2, -1, 5, -1, 8, -1, 11, -1, 14, -1, -1, -1, -1, -1, -1, -1));

			// RCL : R1 0 R2 0 R3 0 R4 0 R5 0 R6 0 R7 0 R8 0 
			RCL = _mm_or_si128(RCL, _mm_shuffle_epi8(Src2, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, -1, 4, -1, 7, -1)));

			// RCH : R9 0 R10 0 0 0 0 0 0 0 0 0 0 0 0 0
			__m128i RCH = _mm_shuffle_epi8(Src2, _mm_setr_epi8(10, -1, 13, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1));

			// RCH : R9 0 R10 0 R11 0 R12 0 R13 0 R14 0 R15 0 R16 0
			RCH = _mm_or_si128(RCH, _mm_shuffle_epi8(Src3, _mm_setr_epi8(-1, -1, -1, -1, 0, -1, 3, -1, 6, -1, 9, -1, 12, -1, 15, -1)));

			// BGLL : B1 0 G1 0 B2 0 G2 0 B3 0 G3 0 B4 0 G4 0
			__m128i BGLL = _mm_unpacklo_epi8(BGL, Zero);

			// BGLH : B5 0 G5 0 B6 0 G6 0 B7 0 G7 0 B8 0 G8 0
			__m128i BGLH = _mm_unpackhi_epi8(BGL, Zero);

			// RCLL : R1 Half Half Half R2 Half Half Half R3 Half Half Half R4 Half Half Half
			__m128i RCLL = _mm_or_si128(_mm_unpacklo_epi8(RCL, Zero), Half);

			// RCLH : R5 Half Half Half R6 Half Half Half R7 Half Half Half R8 Half Half Half
			__m128i RCLH = _mm_or_si128(_mm_unpackhi_epi8(RCL, Zero), Half);

			// BGHL : B9 0 G9 0 B10 0 G10 0 B11 0 G11 0 B12 0 G12 0 
			__m128i BGHL = _mm_unpacklo_epi8(BGH, Zero);

			// BGHH : B13 0 G13 0 B14 0 G14 0 B15 0 G15 0 B16 0 G16 0
			__m128i BGHH = _mm_unpackhi_epi8(BGH, Zero);

			// RCHL : R9 Half Half Half R10 Half Half Half R11 Half Half Half R12 Half Half Half
			__m128i RCHL = _mm_or_si128(_mm_unpacklo_epi8(RCH, Zero), Half);

			// RCHH : R13 Half Half Half R14 Half Half Half R15 Half Half Half R16 Half Half Half
			__m128i RCHH = _mm_or_si128(_mm_unpackhi_epi8(RCH, Zero), Half);

			//
			__m128i Y_LL = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGLL, Weight_YBG), _mm_madd_epi16(RCLL, Weight_YRC)), Shift);
			__m128i Y_LH = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGLH, Weight_YBG), _mm_madd_epi16(RCLH, Weight_YRC)), Shift);
			__m128i Y_HL = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGHL, Weight_YBG), _mm_madd_epi16(RCHL, Weight_YRC)), Shift);
			__m128i Y_HH = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGHH, Weight_YBG), _mm_madd_epi16(RCHH, Weight_YRC)), Shift);
			_mm_storeu_si128((__m128i*)(LinePY + XX), _mm_packus_epi16(_mm_packus_epi32(Y_LL, Y_LH), _mm_packus_epi32(Y_HL, Y_HH)));

			__m128i U_LL = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGLL, Weight_UBG), _mm_madd_epi16(RCLL, Weight_URC)), Shift);
			__m128i U_LH = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGLH, Weight_UBG), _mm_madd_epi16(RCLH, Weight_URC)), Shift);
			__m128i U_HL = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGHL, Weight_UBG), _mm_madd_epi16(RCHL, Weight_URC)), Shift);
			__m128i U_HH = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGHH, Weight_UBG), _mm_madd_epi16(RCHH, Weight_URC)), Shift);
			_mm_storeu_si128((__m128i*)(LinePU + XX), _mm_packus_epi16(_mm_packus_epi32(U_LL, U_LH), _mm_packus_epi32(U_HL, U_HH)));

			__m128i V_LL = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGLL, Weight_VBG), _mm_madd_epi16(RCLL, Weight_VRC)), Shift);
			__m128i V_LH = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGLH, Weight_VBG), _mm_madd_epi16(RCLH, Weight_VRC)), Shift);
			__m128i V_HL = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGHL, Weight_VBG), _mm_madd_epi16(RCHL, Weight_VRC)), Shift);
			__m128i V_HH = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGHH, Weight_VBG), _mm_madd_epi16(RCHH, Weight_VRC)), Shift);
			_mm_storeu_si128((__m128i*)(LinePV + XX), _mm_packus_epi16(_mm_packus_epi32(V_LL, V_LH), _mm_packus_epi32(V_HL, V_HH)));

		}
		for (int XX = Block * BlockSize; XX < Width; XX++, LinePS += 3) {
			int Blue = LinePS[0], Green = LinePS[1], Red = LinePS[2];
			LinePY[XX] = (Y_B_WT * Blue + Y_G_WT * Green + Y_R_WT * Red + Y_C_WT * HalfV) >> Shift;
			LinePU[XX] = (U_B_WT * Blue + U_G_WT * Green + U_R_WT * Red + U_C_WT * HalfV) >> Shift;
			LinePV[XX] = (V_B_WT * Blue + V_G_WT * Green + V_R_WT * Red + V_C_WT * HalfV) >> Shift;
		}
	}
}

void YUV2RGBSSE_3(unsigned char *Y, unsigned char *U, unsigned char *V, unsigned char *RGB, int Width, int Height, int Stride) {
	const int Shift = 13;
	const int HalfV = 1 << (Shift - 1);
	const int B_Y_WT = 1 << Shift, B_U_WT = 2.03211f * (1 << Shift), B_V_WT = 0;
	const int G_Y_WT = 1 << Shift, G_U_WT = -0.39465f * (1 << Shift), G_V_WT = -0.58060f * (1 << Shift);
	const int R_Y_WT = 1 << Shift, R_U_WT = 0, R_V_WT = 1.13983 * (1 << Shift);
	__m128i Weight_B_Y = _mm_set1_epi32(B_Y_WT), Weight_B_U = _mm_set1_epi32(B_U_WT), Weight_B_V = _mm_set1_epi32(B_V_WT);
	__m128i Weight_G_Y = _mm_set1_epi32(G_Y_WT), Weight_G_U = _mm_set1_epi32(G_U_WT), Weight_G_V = _mm_set1_epi32(G_V_WT);
	__m128i Weight_R_Y = _mm_set1_epi32(R_Y_WT), Weight_R_U = _mm_set1_epi32(R_U_WT), Weight_R_V = _mm_set1_epi32(R_V_WT);
	__m128i Half = _mm_set1_epi32(HalfV);
	__m128i C128 = _mm_set1_epi32(128);
	__m128i Zero = _mm_setzero_si128();

	const int BlockSize = 16, Block = Width / BlockSize;
	for (int YY = 0; YY < Height; YY++) {
		unsigned char *LinePD = RGB + YY * Stride;
		unsigned char *LinePY = Y + YY * Width;
		unsigned char *LinePU = U + YY * Width;
		unsigned char *LinePV = V + YY * Width;
		for (int XX = 0; XX < Block * BlockSize; XX += BlockSize, LinePY += BlockSize, LinePU += BlockSize, LinePV += BlockSize) {
			__m128i Blue, Green, Red, YV, UV, VV, Dest1, Dest2, Dest3;
			YV = _mm_loadu_si128((__m128i *)(LinePY + 0));
			UV = _mm_loadu_si128((__m128i *)(LinePU + 0));
			VV = _mm_loadu_si128((__m128i *)(LinePV + 0));

			__m128i YV16L = _mm_unpacklo_epi8(YV, Zero);
			__m128i YV16H = _mm_unpackhi_epi8(YV, Zero);
			__m128i YV32LL = _mm_unpacklo_epi16(YV16L, Zero);
			__m128i YV32LH = _mm_unpackhi_epi16(YV16L, Zero);
			__m128i YV32HL = _mm_unpacklo_epi16(YV16H, Zero);
			__m128i YV32HH = _mm_unpackhi_epi16(YV16H, Zero);


			__m128i UV16L = _mm_unpacklo_epi8(UV, Zero);
			__m128i UV16H = _mm_unpackhi_epi8(UV, Zero);
			__m128i UV32LL = _mm_unpacklo_epi16(UV16L, Zero);
			__m128i UV32LH = _mm_unpackhi_epi16(UV16L, Zero);
			__m128i UV32HL = _mm_unpacklo_epi16(UV16H, Zero);
			__m128i UV32HH = _mm_unpackhi_epi16(UV16H, Zero);
			UV32LL = _mm_sub_epi32(UV32LL, C128);
			UV32LH = _mm_sub_epi32(UV32LH, C128);
			UV32HL = _mm_sub_epi32(UV32HL, C128);
			UV32HH = _mm_sub_epi32(UV32HH, C128);

			__m128i VV16L = _mm_unpacklo_epi8(VV, Zero);
			__m128i VV16H = _mm_unpackhi_epi8(VV, Zero);
			__m128i VV32LL = _mm_unpacklo_epi16(VV16L, Zero);
			__m128i VV32LH = _mm_unpackhi_epi16(VV16L, Zero);
			__m128i VV32HL = _mm_unpacklo_epi16(VV16H, Zero);
			__m128i VV32HH = _mm_unpackhi_epi16(VV16H, Zero);
			VV32LL = _mm_sub_epi32(VV32LL, C128);
			VV32LH = _mm_sub_epi32(VV32LH, C128);
			VV32HL = _mm_sub_epi32(VV32HL, C128);
			VV32HH = _mm_sub_epi32(VV32HH, C128);

			__m128i LL_B = _mm_add_epi32(YV32LL, _mm_srai_epi32(_mm_add_epi32(Half, _mm_mullo_epi32(UV32LL, Weight_B_U)), Shift));
			__m128i LH_B = _mm_add_epi32(YV32LH, _mm_srai_epi32(_mm_add_epi32(Half, _mm_mullo_epi32(UV32LH, Weight_B_U)), Shift));
			__m128i HL_B = _mm_add_epi32(YV32HL, _mm_srai_epi32(_mm_add_epi32(Half, _mm_mullo_epi32(UV32HL, Weight_B_U)), Shift));
			__m128i HH_B = _mm_add_epi32(YV32HH, _mm_srai_epi32(_mm_add_epi32(Half, _mm_mullo_epi32(UV32HH, Weight_B_U)), Shift));
			Blue = _mm_packus_epi16(_mm_packus_epi32(LL_B, LH_B), _mm_packus_epi32(HL_B, HH_B));

			__m128i LL_G = _mm_add_epi32(YV32LL, _mm_srai_epi32(_mm_add_epi32(Half, _mm_add_epi32(_mm_mullo_epi32(Weight_G_U, UV32LL), _mm_mullo_epi32(Weight_G_V, VV32LL))), Shift));
			__m128i LH_G = _mm_add_epi32(YV32LH, _mm_srai_epi32(_mm_add_epi32(Half, _mm_add_epi32(_mm_mullo_epi32(Weight_G_U, UV32LH), _mm_mullo_epi32(Weight_G_V, VV32LH))), Shift));
			__m128i HL_G = _mm_add_epi32(YV32HL, _mm_srai_epi32(_mm_add_epi32(Half, _mm_add_epi32(_mm_mullo_epi32(Weight_G_U, UV32HL), _mm_mullo_epi32(Weight_G_V, VV32HL))), Shift));
			__m128i HH_G = _mm_add_epi32(YV32HH, _mm_srai_epi32(_mm_add_epi32(Half, _mm_add_epi32(_mm_mullo_epi32(Weight_G_U, UV32HH), _mm_mullo_epi32(Weight_G_V, VV32HH))), Shift));
			Green = _mm_packus_epi16(_mm_packus_epi32(LL_G, LH_G), _mm_packus_epi32(HL_G, HH_G));

			__m128i LL_R = _mm_add_epi32(YV32LL, _mm_srai_epi32(_mm_add_epi32(Half, _mm_mullo_epi32(VV32LL, Weight_R_V)), Shift));
			__m128i LH_R = _mm_add_epi32(YV32LH, _mm_srai_epi32(_mm_add_epi32(Half, _mm_mullo_epi32(VV32LH, Weight_R_V)), Shift));
			__m128i HL_R = _mm_add_epi32(YV32HL, _mm_srai_epi32(_mm_add_epi32(Half, _mm_mullo_epi32(VV32HL, Weight_R_V)), Shift));
			__m128i HH_R = _mm_add_epi32(YV32HH, _mm_srai_epi32(_mm_add_epi32(Half, _mm_mullo_epi32(VV32HH, Weight_R_V)), Shift));
			Red = _mm_packus_epi16(_mm_packus_epi32(LL_R, LH_R), _mm_packus_epi32(HL_R, HH_R));

			Dest1 = _mm_shuffle_epi8(Blue, _mm_setr_epi8(0, -1, -1, 1, -1, -1, 2, -1, -1, 3, -1, -1, 4, -1, -1, 5));
			Dest1 = _mm_or_si128(Dest1, _mm_shuffle_epi8(Green, _mm_setr_epi8(-1, 0, -1, -1, 1, -1, -1, 2, -1, -1, 3, -1, -1, 4, -1, -1)));
			Dest1 = _mm_or_si128(Dest1, _mm_shuffle_epi8(Red, _mm_setr_epi8(-1, -1, 0, -1, -1, 1, -1, -1, 2, -1, -1, 3, -1, -1, 4, -1)));

			Dest2 = _mm_shuffle_epi8(Blue, _mm_setr_epi8(-1, -1, 6, -1, -1, 7, -1, -1, 8, -1, -1, 9, -1, -1, 10, -1));
			Dest2 = _mm_or_si128(Dest2, _mm_shuffle_epi8(Green, _mm_setr_epi8(5, -1, -1, 6, -1, -1, 7, -1, -1, 8, -1, -1, 9, -1, -1, 10)));
			Dest2 = _mm_or_si128(Dest2, _mm_shuffle_epi8(Red, _mm_setr_epi8(-1, 5, -1, -1, 6, -1, -1, 7, -1, -1, 8, -1, -1, 9, -1, -1)));

			Dest3 = _mm_shuffle_epi8(Blue, _mm_setr_epi8(-1, 11, -1, -1, 12, -1, -1, 13, -1, -1, 14, -1, -1, 15, -1, -1));
			Dest3 = _mm_or_si128(Dest3, _mm_shuffle_epi8(Green, _mm_setr_epi8(-1, -1, 11, -1, -1, 12, -1, -1, 13, -1, -1, 14, -1, -1, 15, -1)));
			Dest3 = _mm_or_si128(Dest3, _mm_shuffle_epi8(Red, _mm_setr_epi8(10, -1, -1, 11, -1, -1, 12, -1, -1, 13, -1, -1, 14, -1, -1, 15)));

			_mm_storeu_si128((__m128i*)(LinePD + (XX / BlockSize) * BlockSize * 3), Dest1);
			_mm_storeu_si128((__m128i*)(LinePD + (XX / BlockSize) * BlockSize * 3 + BlockSize), Dest2);
			_mm_storeu_si128((__m128i*)(LinePD + (XX / BlockSize) * BlockSize * 3 + BlockSize * 2), Dest3);
		}
		for (int XX = Block * BlockSize; XX < Width; XX++, LinePU++, LinePV++, LinePY++) {
			int YV = LinePY[XX], UV = LinePU[XX] - 128, VV = LinePV[XX] - 128;
			LinePD[XX + 0] = ClampToByte(YV + ((B_U_WT * UV + HalfV) >> Shift));
			LinePD[XX + 1] = ClampToByte(YV + ((G_U_WT * UV + G_V_WT * VV + HalfV) >> Shift));
			LinePD[XX + 2] = ClampToByte(YV + ((R_V_WT * VV + HalfV) >> Shift));
		}
	}
}


const int Shift = 13;                            //    这里没有绝对值大于1的系数,最大可取2^15次方的放大倍数。
const int HalfV = 1 << (Shift - 1);

const int Y_B_WT = 0.114f * (1 << Shift), Y_G_WT = 0.587f * (1 << Shift), Y_R_WT = (1 << Shift) - Y_B_WT - Y_G_WT, Y_C_WT = 1;
const int U_B_WT = 0.436f * (1 << Shift), U_G_WT = -0.28886f * (1 << Shift), U_R_WT = -(U_B_WT + U_G_WT), U_C_WT = 257;
const int V_B_WT = -0.10001 * (1 << Shift), V_G_WT = -0.51499f * (1 << Shift), V_R_WT = -(V_B_WT + V_G_WT), V_C_WT = 257;

__m128i Weight_YBG = _mm_setr_epi16(Y_B_WT, Y_G_WT, Y_B_WT, Y_G_WT, Y_B_WT, Y_G_WT, Y_B_WT, Y_G_WT);
__m128i Weight_YRC = _mm_setr_epi16(Y_R_WT, Y_C_WT, Y_R_WT, Y_C_WT, Y_R_WT, Y_C_WT, Y_R_WT, Y_C_WT);
__m128i Weight_UBG = _mm_setr_epi16(U_B_WT, U_G_WT, U_B_WT, U_G_WT, U_B_WT, U_G_WT, U_B_WT, U_G_WT);
__m128i Weight_URC = _mm_setr_epi16(U_R_WT, U_C_WT, U_R_WT, U_C_WT, U_R_WT, U_C_WT, U_R_WT, U_C_WT);
__m128i Weight_VBG = _mm_setr_epi16(V_B_WT, V_G_WT, V_B_WT, V_G_WT, V_B_WT, V_G_WT, V_B_WT, V_G_WT);
__m128i Weight_VRC = _mm_setr_epi16(V_R_WT, V_C_WT, V_R_WT, V_C_WT, V_R_WT, V_C_WT, V_R_WT, V_C_WT);
__m128i Half1 = _mm_setr_epi16(0, HalfV, 0, HalfV, 0, HalfV, 0, HalfV);
__m128i Zero = _mm_setzero_si128();

const int B_Y_WT = 1 << Shift, B_U_WT = 2.03211f * (1 << Shift), B_V_WT = 0;
const int G_Y_WT = 1 << Shift, G_U_WT = -0.39465f * (1 << Shift), G_V_WT = -0.58060f * (1 << Shift);
const int R_Y_WT = 1 << Shift, R_U_WT = 0, R_V_WT = 1.13983 * (1 << Shift);
__m128i Weight_B_Y = _mm_set1_epi32(B_Y_WT), Weight_B_U = _mm_set1_epi32(B_U_WT), Weight_B_V = _mm_set1_epi32(B_V_WT);
__m128i Weight_G_Y = _mm_set1_epi32(G_Y_WT), Weight_G_U = _mm_set1_epi32(G_U_WT), Weight_G_V = _mm_set1_epi32(G_V_WT);
__m128i Weight_R_Y = _mm_set1_epi32(R_Y_WT), Weight_R_U = _mm_set1_epi32(R_U_WT), Weight_R_V = _mm_set1_epi32(R_V_WT);
__m128i Half2 = _mm_set1_epi32(HalfV);
__m128i C128 = _mm_set1_epi32(128);
int BlockSize, Block;

void _RGB2YUV(unsigned char *RGB, const int32_t Width, const int32_t Height, const int32_t start_row, const int32_t thread_stride, const int32_t Stride,  unsigned char *Y, unsigned char *U, unsigned char *V)
{

	for (int YY = start_row; YY < start_row + thread_stride; YY++)
	{
		unsigned char *LinePS = RGB + YY * Stride;
		unsigned char *LinePY = Y + YY * Width;
		unsigned char *LinePU = U + YY * Width;
		unsigned char *LinePV = V + YY * Width;
		for (int XX = 0; XX < Block * BlockSize; XX += BlockSize, LinePS += BlockSize * 3)
		{
			__m128i Src1 = _mm_loadu_si128((__m128i *)(LinePS + 0));
			__m128i Src2 = _mm_loadu_si128((__m128i *)(LinePS + 16));
			__m128i Src3 = _mm_loadu_si128((__m128i *)(LinePS + 32));
			// Src1 : B1 G1 R1 B2 G2 R2 B3 G3 R3 B4 G4 R4 B5 G5 R5 B6 
			// Src2 : G6 R6 B7 G7 R7 B8 G8 R8 B9 G9 R9 B10 G10 R10 B11 G11 
			// Src3 : R11 B12 G12 R12 B13 G13 R13 B14 G14 R14 B15 G15 R15 B16 G16 R16

			// BGL : B1 G1 B2 G2 B3 G3 B4 G4 B5 G5 B6 0 0 0 0 0 
			__m128i BGL = _mm_shuffle_epi8(Src1, _mm_setr_epi8(0, 1, 3, 4, 6, 7, 9, 10, 12, 13, 15, -1, -1, -1, -1, -1));

			// BGL : B1 G1 B2 G2 B3 G3 B4 G4 B5 G5 B6 G6 B7 G7 B8 G8
			BGL = _mm_or_si128(BGL, _mm_shuffle_epi8(Src2, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 2, 3, 5, 6)));

			// BGH : B9 G9 B10 G10 B11 G11 0 0 0 0 0 0 0 0 0 0
			__m128i BGH = _mm_shuffle_epi8(Src2, _mm_setr_epi8(8, 9, 11, 12, 14, 15, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1));

			// BGH : B9 G9 B10 G10 B11 G11 B12 G12 B13 G13 B14 G14 B15 G15 B16 G16
			BGH = _mm_or_si128(BGH, _mm_shuffle_epi8(Src3, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, 1, 2, 4, 5, 7, 8, 10, 11, 13, 14)));

			// RCL : R1 0 R2 0 R3 0 R4 0 R5 0 0 0 0 0 0 0 
			__m128i RCL = _mm_shuffle_epi8(Src1, _mm_setr_epi8(2, -1, 5, -1, 8, -1, 11, -1, 14, -1, -1, -1, -1, -1, -1, -1));

			// RCL : R1 0 R2 0 R3 0 R4 0 R5 0 R6 0 R7 0 R8 0 
			RCL = _mm_or_si128(RCL, _mm_shuffle_epi8(Src2, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, -1, 4, -1, 7, -1)));

			// RCH : R9 0 R10 0 0 0 0 0 0 0 0 0 0 0 0 0
			__m128i RCH = _mm_shuffle_epi8(Src2, _mm_setr_epi8(10, -1, 13, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1));

			// RCH : R9 0 R10 0 R11 0 R12 0 R13 0 R14 0 R15 0 R16 0
			RCH = _mm_or_si128(RCH, _mm_shuffle_epi8(Src3, _mm_setr_epi8(-1, -1, -1, -1, 0, -1, 3, -1, 6, -1, 9, -1, 12, -1, 15, -1)));

			// BGLL : B1 0 G1 0 B2 0 G2 0 B3 0 G3 0 B4 0 G4 0
			__m128i BGLL = _mm_unpacklo_epi8(BGL, Zero);

			// BGLH : B5 0 G5 0 B6 0 G6 0 B7 0 G7 0 B8 0 G8 0
			__m128i BGLH = _mm_unpackhi_epi8(BGL, Zero);

			// RCLL : R1 Half Half Half R2 Half Half Half R3 Half Half Half R4 Half Half Half
			__m128i RCLL = _mm_or_si128(_mm_unpacklo_epi8(RCL, Zero), Half1);

			// RCLH : R5 Half Half Half R6 Half Half Half R7 Half Half Half R8 Half Half Half
			__m128i RCLH = _mm_or_si128(_mm_unpackhi_epi8(RCL, Zero), Half1);

			// BGHL : B9 0 G9 0 B10 0 G10 0 B11 0 G11 0 B12 0 G12 0 
			__m128i BGHL = _mm_unpacklo_epi8(BGH, Zero);

			// BGHH : B13 0 G13 0 B14 0 G14 0 B15 0 G15 0 B16 0 G16 0
			__m128i BGHH = _mm_unpackhi_epi8(BGH, Zero);

			// RCHL : R9 Half Half Half R10 Half Half Half R11 Half Half Half R12 Half Half Half
			__m128i RCHL = _mm_or_si128(_mm_unpacklo_epi8(RCH, Zero), Half1);

			// RCHH : R13 Half Half Half R14 Half Half Half R15 Half Half Half R16 Half Half Half
			__m128i RCHH = _mm_or_si128(_mm_unpackhi_epi8(RCH, Zero), Half1);

			//
			__m128i Y_LL = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGLL, Weight_YBG), _mm_madd_epi16(RCLL, Weight_YRC)), Shift);
			__m128i Y_LH = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGLH, Weight_YBG), _mm_madd_epi16(RCLH, Weight_YRC)), Shift);
			__m128i Y_HL = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGHL, Weight_YBG), _mm_madd_epi16(RCHL, Weight_YRC)), Shift);
			__m128i Y_HH = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGHH, Weight_YBG), _mm_madd_epi16(RCHH, Weight_YRC)), Shift);
			_mm_storeu_si128((__m128i*)(LinePY + XX), _mm_packus_epi16(_mm_packus_epi32(Y_LL, Y_LH), _mm_packus_epi32(Y_HL, Y_HH)));

			__m128i U_LL = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGLL, Weight_UBG), _mm_madd_epi16(RCLL, Weight_URC)), Shift);
			__m128i U_LH = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGLH, Weight_UBG), _mm_madd_epi16(RCLH, Weight_URC)), Shift);
			__m128i U_HL = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGHL, Weight_UBG), _mm_madd_epi16(RCHL, Weight_URC)), Shift);
			__m128i U_HH = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGHH, Weight_UBG), _mm_madd_epi16(RCHH, Weight_URC)), Shift);
			_mm_storeu_si128((__m128i*)(LinePU + XX), _mm_packus_epi16(_mm_packus_epi32(U_LL, U_LH), _mm_packus_epi32(U_HL, U_HH)));

			__m128i V_LL = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGLL, Weight_VBG), _mm_madd_epi16(RCLL, Weight_VRC)), Shift);
			__m128i V_LH = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGLH, Weight_VBG), _mm_madd_epi16(RCLH, Weight_VRC)), Shift);
			__m128i V_HL = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGHL, Weight_VBG), _mm_madd_epi16(RCHL, Weight_VRC)), Shift);
			__m128i V_HH = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGHH, Weight_VBG), _mm_madd_epi16(RCHH, Weight_VRC)), Shift);
			_mm_storeu_si128((__m128i*)(LinePV + XX), _mm_packus_epi16(_mm_packus_epi32(V_LL, V_LH), _mm_packus_epi32(V_HL, V_HH)));

		}
		for (int XX = Block * BlockSize; XX < Width; XX++, LinePS += 3) {
			int Blue = LinePS[0], Green = LinePS[1], Red = LinePS[2];
			LinePY[XX] = (Y_B_WT * Blue + Y_G_WT * Green + Y_R_WT * Red + Y_C_WT * HalfV) >> Shift;
			LinePU[XX] = (U_B_WT * Blue + U_G_WT * Green + U_R_WT * Red + U_C_WT * HalfV) >> Shift;
			LinePV[XX] = (V_B_WT * Blue + V_G_WT * Green + V_R_WT * Red + V_C_WT * HalfV) >> Shift;
		}
	}
}

void _YUV2RGB(const int32_t Width, const int32_t Height, const int32_t start_row, const int32_t thread_stride, const int32_t Stride, unsigned char *Y, unsigned char *U, unsigned char *V, unsigned char *RGB) {
	
	for (int YY = start_row; YY < start_row + thread_stride; YY++){
		unsigned char *LinePD = RGB + YY * Stride;
		unsigned char *LinePY = Y + YY * Width;
		unsigned char *LinePU = U + YY * Width;
		unsigned char *LinePV = V + YY * Width;
		for (int XX = 0; XX < Block * BlockSize; XX += BlockSize, LinePY += BlockSize, LinePU += BlockSize, LinePV += BlockSize) {
			__m128i Blue, Green, Red, YV, UV, VV, Dest1, Dest2, Dest3;
			YV = _mm_loadu_si128((__m128i *)(LinePY + 0));
			UV = _mm_loadu_si128((__m128i *)(LinePU + 0));
			VV = _mm_loadu_si128((__m128i *)(LinePV + 0));

			__m128i YV16L = _mm_unpacklo_epi8(YV, Zero);
			__m128i YV16H = _mm_unpackhi_epi8(YV, Zero);
			__m128i YV32LL = _mm_unpacklo_epi16(YV16L, Zero);
			__m128i YV32LH = _mm_unpackhi_epi16(YV16L, Zero);
			__m128i YV32HL = _mm_unpacklo_epi16(YV16H, Zero);
			__m128i YV32HH = _mm_unpackhi_epi16(YV16H, Zero);


			__m128i UV16L = _mm_unpacklo_epi8(UV, Zero);
			__m128i UV16H = _mm_unpackhi_epi8(UV, Zero);
			__m128i UV32LL = _mm_unpacklo_epi16(UV16L, Zero);
			__m128i UV32LH = _mm_unpackhi_epi16(UV16L, Zero);
			__m128i UV32HL = _mm_unpacklo_epi16(UV16H, Zero);
			__m128i UV32HH = _mm_unpackhi_epi16(UV16H, Zero);
			UV32LL = _mm_sub_epi32(UV32LL, C128);
			UV32LH = _mm_sub_epi32(UV32LH, C128);
			UV32HL = _mm_sub_epi32(UV32HL, C128);
			UV32HH = _mm_sub_epi32(UV32HH, C128);

			__m128i VV16L = _mm_unpacklo_epi8(VV, Zero);
			__m128i VV16H = _mm_unpackhi_epi8(VV, Zero);
			__m128i VV32LL = _mm_unpacklo_epi16(VV16L, Zero);
			__m128i VV32LH = _mm_unpackhi_epi16(VV16L, Zero);
			__m128i VV32HL = _mm_unpacklo_epi16(VV16H, Zero);
			__m128i VV32HH = _mm_unpackhi_epi16(VV16H, Zero);
			VV32LL = _mm_sub_epi32(VV32LL, C128);
			VV32LH = _mm_sub_epi32(VV32LH, C128);
			VV32HL = _mm_sub_epi32(VV32HL, C128);
			VV32HH = _mm_sub_epi32(VV32HH, C128);

			__m128i LL_B = _mm_add_epi32(YV32LL, _mm_srai_epi32(_mm_add_epi32(Half2, _mm_mullo_epi32(UV32LL, Weight_B_U)), Shift));
			__m128i LH_B = _mm_add_epi32(YV32LH, _mm_srai_epi32(_mm_add_epi32(Half2, _mm_mullo_epi32(UV32LH, Weight_B_U)), Shift));
			__m128i HL_B = _mm_add_epi32(YV32HL, _mm_srai_epi32(_mm_add_epi32(Half2, _mm_mullo_epi32(UV32HL, Weight_B_U)), Shift));
			__m128i HH_B = _mm_add_epi32(YV32HH, _mm_srai_epi32(_mm_add_epi32(Half2, _mm_mullo_epi32(UV32HH, Weight_B_U)), Shift));
			Blue = _mm_packus_epi16(_mm_packus_epi32(LL_B, LH_B), _mm_packus_epi32(HL_B, HH_B));

			__m128i LL_G = _mm_add_epi32(YV32LL, _mm_srai_epi32(_mm_add_epi32(Half2, _mm_add_epi32(_mm_mullo_epi32(Weight_G_U, UV32LL), _mm_mullo_epi32(Weight_G_V, VV32LL))), Shift));
			__m128i LH_G = _mm_add_epi32(YV32LH, _mm_srai_epi32(_mm_add_epi32(Half2, _mm_add_epi32(_mm_mullo_epi32(Weight_G_U, UV32LH), _mm_mullo_epi32(Weight_G_V, VV32LH))), Shift));
			__m128i HL_G = _mm_add_epi32(YV32HL, _mm_srai_epi32(_mm_add_epi32(Half2, _mm_add_epi32(_mm_mullo_epi32(Weight_G_U, UV32HL), _mm_mullo_epi32(Weight_G_V, VV32HL))), Shift));
			__m128i HH_G = _mm_add_epi32(YV32HH, _mm_srai_epi32(_mm_add_epi32(Half2, _mm_add_epi32(_mm_mullo_epi32(Weight_G_U, UV32HH), _mm_mullo_epi32(Weight_G_V, VV32HH))), Shift));
			Green = _mm_packus_epi16(_mm_packus_epi32(LL_G, LH_G), _mm_packus_epi32(HL_G, HH_G));

			__m128i LL_R = _mm_add_epi32(YV32LL, _mm_srai_epi32(_mm_add_epi32(Half2, _mm_mullo_epi32(VV32LL, Weight_R_V)), Shift));
			__m128i LH_R = _mm_add_epi32(YV32LH, _mm_srai_epi32(_mm_add_epi32(Half2, _mm_mullo_epi32(VV32LH, Weight_R_V)), Shift));
			__m128i HL_R = _mm_add_epi32(YV32HL, _mm_srai_epi32(_mm_add_epi32(Half2, _mm_mullo_epi32(VV32HL, Weight_R_V)), Shift));
			__m128i HH_R = _mm_add_epi32(YV32HH, _mm_srai_epi32(_mm_add_epi32(Half2, _mm_mullo_epi32(VV32HH, Weight_R_V)), Shift));
			Red = _mm_packus_epi16(_mm_packus_epi32(LL_R, LH_R), _mm_packus_epi32(HL_R, HH_R));

			Dest1 = _mm_shuffle_epi8(Blue, _mm_setr_epi8(0, -1, -1, 1, -1, -1, 2, -1, -1, 3, -1, -1, 4, -1, -1, 5));
			Dest1 = _mm_or_si128(Dest1, _mm_shuffle_epi8(Green, _mm_setr_epi8(-1, 0, -1, -1, 1, -1, -1, 2, -1, -1, 3, -1, -1, 4, -1, -1)));
			Dest1 = _mm_or_si128(Dest1, _mm_shuffle_epi8(Red, _mm_setr_epi8(-1, -1, 0, -1, -1, 1, -1, -1, 2, -1, -1, 3, -1, -1, 4, -1)));

			Dest2 = _mm_shuffle_epi8(Blue, _mm_setr_epi8(-1, -1, 6, -1, -1, 7, -1, -1, 8, -1, -1, 9, -1, -1, 10, -1));
			Dest2 = _mm_or_si128(Dest2, _mm_shuffle_epi8(Green, _mm_setr_epi8(5, -1, -1, 6, -1, -1, 7, -1, -1, 8, -1, -1, 9, -1, -1, 10)));
			Dest2 = _mm_or_si128(Dest2, _mm_shuffle_epi8(Red, _mm_setr_epi8(-1, 5, -1, -1, 6, -1, -1, 7, -1, -1, 8, -1, -1, 9, -1, -1)));

			Dest3 = _mm_shuffle_epi8(Blue, _mm_setr_epi8(-1, 11, -1, -1, 12, -1, -1, 13, -1, -1, 14, -1, -1, 15, -1, -1));
			Dest3 = _mm_or_si128(Dest3, _mm_shuffle_epi8(Green, _mm_setr_epi8(-1, -1, 11, -1, -1, 12, -1, -1, 13, -1, -1, 14, -1, -1, 15, -1)));
			Dest3 = _mm_or_si128(Dest3, _mm_shuffle_epi8(Red, _mm_setr_epi8(10, -1, -1, 11, -1, -1, 12, -1, -1, 13, -1, -1, 14, -1, -1, 15)));

			_mm_storeu_si128((__m128i*)(LinePD + (XX / BlockSize) * BlockS
Download .txt
gitextract_nm0o3cbs/

├── README.md
├── resources/
│   └── SSE指令集补充.md
├── speed_bicubic_zoom_sse.cpp
├── speed_box_filter_sse.cpp
├── speed_common_functions.cpp
├── speed_gaussian_filter_sse.cpp
├── speed_histogram_algorithm_framework/
│   ├── BoxFilter.h
│   ├── Core.h
│   ├── MaxFilter.h
│   ├── SelectiveBlur.h
│   └── Utility.h
├── speed_integral_graph_sse.cpp
├── speed_max_filter_sse.cpp
├── speed_median_filter_3x3_sse.cpp
├── speed_multi_scale_detail_boosting_see.cpp
├── speed_rgb2gray_sse.cpp
├── speed_rgb2yuv_sse.cpp
├── speed_skin_detection_sse.cpp
├── speed_sobel_edgedetection_sse.cpp
├── speed_vibrance_algorithm.cpp
└── sse_implementation_of_common_functions_in_image_processing.cpp
Download .txt
SYMBOL INDEX (183 symbols across 19 files)

FILE: speed_bicubic_zoom_sse.cpp
  function debug (line 6) | void debug(__m128i var) {
  function ConvertBGR8U2BGRAF (line 14) | void ConvertBGR8U2BGRAF(unsigned char *Src, unsigned char *Dest, int Wid...
  function ConvertBGRAF2BGR8U (line 28) | void ConvertBGRAF2BGR8U(unsigned char *Src, unsigned char *Dest, int Wid...
  function ConvertBGR8U2BGRAF_SSE (line 42) | void ConvertBGR8U2BGRAF_SSE(unsigned char *Src, unsigned char *Dest, int...
  function ConvertBGRAF2BGR8U_SSE (line 68) | void ConvertBGRAF2BGR8U_SSE(unsigned char *Src, unsigned char *Dest, int...
  function ClampI (line 109) | inline int ClampI(int Value, int Min, int Max) {
  function ClampToByte (line 116) | inline unsigned char ClampToByte(int Value) {
  function SinXDivX (line 128) | float SinXDivX(float X) {
  function SinXDivX_Standard (line 141) | float SinXDivX_Standard(float X) {
  function Bicubic_Original (line 148) | void Bicubic_Original(unsigned char *Src, int Width, int Height, int Str...
  function Bicubic_Border (line 192) | void Bicubic_Border(unsigned char *Src, int Width, int Height, int Strid...
  function Bicubic_Center (line 228) | void Bicubic_Center(unsigned char *Src, int Width, int Height, int Strid...
  function IM_Resize_Cubic_Origin (line 266) | void IM_Resize_Cubic_Origin(unsigned char *Src, unsigned char *Dest, int...
  function IM_Resize_Cubic_Table (line 287) | void IM_Resize_Cubic_Table(unsigned char *Src, unsigned char *Dest, int ...
  function _mm_hsum_epi32 (line 347) | inline int _mm_hsum_epi32(__m128i V) { //V3 V2 V1 V0
  function IM_Resize_SSE (line 355) | void IM_Resize_SSE(unsigned char *Src, unsigned char *Dest, int SrcW, in...
  function main (line 472) | int main() {

FILE: speed_box_filter_sse.cpp
  function BoxBlur_1 (line 10) | void BoxBlur_1(unsigned char *Src, unsigned char *Dest, int Width, int H...
  function BoxBlur_SSE (line 21) | void BoxBlur_SSE(unsigned char *Src, unsigned char *Dest, int Width, int...
  function main (line 33) | int main() {

FILE: speed_common_functions.cpp
  function ClampToByte (line 11) | unsigned char ClampToByte(int Value){
  function ClampToInt (line 18) | int ClampToInt(int Value, int Min, int Max) {
  function Div255 (line 27) | int Div255(int Value) {
  function Abs (line 35) | int Abs(int n) {
  function Round (line 46) | double Round(double V)
  function Rand (line 54) | double Rand()
  function Pow (line 63) | double Pow(double X, double Y)
  function Pow (line 72) | float Pow(float X, float Y)
  function Exp (line 81) | double Exp(double Y)			//	用联合体的方式的速度要快些
  function Exp (line 89) | float Exp(float Y)			//	用联合体的方式的速度要快些
  function PrecisePow (line 104) | double PrecisePow(double X, double Y){
  function Random (line 125) | int Random(int Min, int Max){
  function sgn (line 132) | int sgn(int X){
  function GetRGB (line 141) | void GetRGB(int Color, int *R, int *G, int *B){
  function Sqrt (line 150) | float Sqrt(float X)
  function HistgramAddShort (line 165) | void HistgramAddShort(unsigned short *X, unsigned short *Y)
  function HistgramSubShort (line 204) | void HistgramSubShort(unsigned short *X, unsigned short *Y)
  function HistgramSubAddShort (line 243) | void HistgramSubAddShort(unsigned short *X, unsigned short *Y, unsigned ...

FILE: speed_gaussian_filter_sse.cpp
  function CalcGaussCof (line 7) | void CalcGaussCof(float Radius, float &B0, float &B1, float &B2, float &B3)
  function ConvertBGR8U2BGRAF (line 28) | void ConvertBGR8U2BGRAF(unsigned char *Src, float *Dest, int Width, int ...
  function ConvertBGR8U2BGRAF_SSE (line 42) | void ConvertBGR8U2BGRAF_SSE(unsigned char *Src, float *Dest, int Width, ...
  function GaussBlurFromLeftToRight (line 66) | void GaussBlurFromLeftToRight(float *Data, int Width, int Height, float ...
  function GaussBlurFromLeftToRight_SSE (line 88) | void GaussBlurFromLeftToRight_SSE(float *Data, int Width, int Height, fl...
  function GaussBlurFromRightToLeft (line 108) | void GaussBlurFromRightToLeft(float *Data, int Width, int Height, float ...
  function GaussBlurFromRightToLeft_SSE (line 127) | void GaussBlurFromRightToLeft_SSE(float *Data, int Width, int Height, fl...
  function GaussBlurFromTopToBottom (line 149) | void GaussBlurFromTopToBottom(float *Data, int Width, int Height, float ...
  function GaussBlurFromTopToBottom_SSE (line 166) | void GaussBlurFromTopToBottom_SSE(float *Data, int Width, int Height, fl...
  function GaussBlurFromBottomToTop (line 190) | void GaussBlurFromBottomToTop(float *Data, int Width, int Height, float ...
  function GaussBlurFromBottomToTop_SSE (line 204) | void GaussBlurFromBottomToTop_SSE(float *Data, int Width, int Height, fl...
  function ConvertBGRAF2BGR8U (line 226) | void ConvertBGRAF2BGR8U(float *Src, unsigned char *Dest, int Width, int ...
  function ConvertBGRAF2BGR8U_SSE (line 241) | void ConvertBGRAF2BGR8U_SSE(unsigned char *Src, unsigned char *Dest, int...
  function GaussBlur (line 281) | void GaussBlur(unsigned char *Src, unsigned char *Dest, int Width, int H...
  function GaussBlur_SSE (line 307) | void GaussBlur_SSE(unsigned char *Src, unsigned char *Dest, int Width, i...
  function main (line 333) | int main() {

FILE: speed_histogram_algorithm_framework/BoxFilter.h
  function IS_RET (line 16) | IS_RET BoxBlur(TMatrix *Src, TMatrix *Dest, int Radius, EdgeMode Edge) {
  function IS_RET (line 133) | IS_RET BoxBlur_SSE(TMatrix *Src, TMatrix *Dest, int Radius, EdgeMode Edg...

FILE: speed_histogram_algorithm_framework/Core.h
  type EdgeMode (line 15) | enum EdgeMode {
  type IS_RET (line 20) | enum IS_RET {
  type IS_DEPTH (line 35) | enum IS_DEPTH
  type TMatrix (line 45) | struct TMatrix
  function IS_FreeMemory (line 66) | void IS_FreeMemory(void *Ptr) {
  function IS_ELEMENT_SIZE (line 71) | int IS_ELEMENT_SIZE(int Depth) {
  function IS_RET (line 101) | IS_RET IS_CreateMatrix(int Width, int Height, int Depth, int Channel, TM...
  function IS_RET (line 122) | IS_RET IS_FreeMatrix(TMatrix **Matrix) {
  function IS_RET (line 136) | IS_RET IS_CloneMatrix(TMatrix *Src, TMatrix **Dest) {

FILE: speed_histogram_algorithm_framework/MaxFilter.h
  function IS_RET (line 15) | IS_RET  MaxFilter(TMatrix *Src, TMatrix *Dest, int Radius)

FILE: speed_histogram_algorithm_framework/SelectiveBlur.h
  function Calc (line 5) | void Calc(unsigned short *Hist, int Intensity, unsigned char *&Pixel, in...
  function IS_RET (line 29) | IS_RET SelectiveBlur(TMatrix *Src, TMatrix *Dest, int Radius, int Thresh...

FILE: speed_histogram_algorithm_framework/Utility.h
  function ClampToByte (line 14) | unsigned char ClampToByte(int Value) {
  function ClampToInt (line 21) | int ClampToInt(int Value, int Min, int Max) {
  function Div255 (line 30) | int Div255(int Value) {
  function Abs (line 38) | int Abs(int n) {
  function Round (line 49) | double Round(double V)
  function Rand (line 57) | double Rand()
  function Pow (line 66) | double Pow(double X, double Y)
  function Pow (line 75) | float Pow(float X, float Y)
  function Exp (line 84) | double Exp(double Y)			//	ķʽٶҪЩ
  function Exp (line 92) | float Exp(float Y)			//	ķʽٶҪЩ
  function PrecisePow (line 107) | double PrecisePow(double X, double Y) {
  function Random (line 128) | int Random(int Min, int Max) {
  function sgn (line 135) | int sgn(int X) {
  function GetRGB (line 144) | void GetRGB(int Color, int *R, int *G, int *B) {
  function Sqrt (line 153) | float Sqrt(float X)
  function HistgramAddShort (line 168) | void HistgramAddShort(unsigned short *X, unsigned short *Y)
  function HistgramSubShort (line 207) | void HistgramSubShort(unsigned short *X, unsigned short *Y)
  function HistgramSubAddShort (line 246) | void HistgramSubAddShort(unsigned short *X, unsigned short *Y, unsigned ...
  function CopyAlphaChannel (line 285) | void CopyAlphaChannel(TMatrix *Src, TMatrix *Dest) {
  function IS_RET (line 307) | IS_RET GetValidCoordinate(int Width, int Height, int Left, int Right, in...
  function IS_RET (line 388) | IS_RET SplitRGBA(TMatrix *Src, TMatrix **Blue, TMatrix **Green, TMatrix ...
  function IS_RET (line 479) | IS_RET CombineRGBA(TMatrix *Dest, TMatrix *Blue, TMatrix *Green, TMatrix...

FILE: speed_integral_graph_sse.cpp
  function GetGrayIntegralImage (line 7) | void GetGrayIntegralImage(unsigned char *Src, int *Integral, int Width, ...
  function GetGrayIntegralImage_SSE (line 24) | void GetGrayIntegralImage_SSE(unsigned char *Src, int *Integral, int Wid...
  function BoxBlur (line 61) | void BoxBlur(unsigned char *Src, unsigned char *Dest, int Width, int Hei...
  function main (line 82) | int main() {

FILE: speed_max_filter_sse.cpp
  function MaxFilter_SSE (line 9) | void MaxFilter_SSE(unsigned char *Src, unsigned char *Dest, int Width, i...
  function Mat (line 20) | Mat MaxFilter(Mat src, int radius) {
  function main (line 43) | int main() {

FILE: speed_median_filter_3x3_sse.cpp
  function ComparisonFunction (line 7) | int ComparisonFunction(const void *X, const void *Y) {
  function MedianBlur3X3_Ori (line 15) | void MedianBlur3X3_Ori(unsigned char *Src, unsigned char *Dest, int Widt...
  function Swap (line 70) | void Swap(int &X, int &Y) {
  function MedianBlur3X3_Faster (line 76) | void MedianBlur3X3_Faster(unsigned char *Src, unsigned char *Dest, int W...
  function _mm_sort_ab (line 211) | inline void _mm_sort_ab(__m128i &a, __m128i &b) {
  function MedianBlur3X3_Fastest (line 218) | void MedianBlur3X3_Fastest(unsigned char *Src, unsigned char *Dest, int ...
  function _mm_sort_AB (line 283) | inline void _mm_sort_AB(__m256i &a, __m256i &b) {
  function MedianBlur3X3_Fastest_AVX (line 290) | void MedianBlur3X3_Fastest_AVX(unsigned char *Src, unsigned char *Dest, ...
  function main (line 355) | int main() {

FILE: speed_multi_scale_detail_boosting_see.cpp
  function BoxBlur_SSE (line 11) | void BoxBlur_SSE(unsigned char *Src, unsigned char *Dest, int Width, int...
  function IM_Sign (line 22) | int IM_Sign(int X) {
  function IM_ClampToByte (line 26) | inline unsigned char IM_ClampToByte(int Value)
  function __m128i (line 38) | inline __m128i _mm_sgn_epi16(__m128i v) {
  function MultiScaleSharpen (line 51) | void MultiScaleSharpen(unsigned char *Src, unsigned char *Dest, int Widt...
  function MultiScaleSharpen_SSE (line 67) | void MultiScaleSharpen_SSE(unsigned char *Src, unsigned char *Dest, int ...
  function main (line 98) | int main() {

FILE: speed_rgb2gray_sse.cpp
  function RGB2Y (line 8) | void RGB2Y(unsigned char *Src, unsigned char *Dest, int Width, int Heigh...
  function RGB2Y_1 (line 19) | void RGB2Y_1(unsigned char *Src, unsigned char *Dest, int Width, int Hei...
  function RGB2Y_2 (line 33) | void RGB2Y_2(unsigned char *Src, unsigned char *Dest, int Width, int Hei...
  function RGB2Y_3 (line 54) | void RGB2Y_3(unsigned char *Src, unsigned char *Dest, int Width, int Hei...
  function RGB2Y_4 (line 69) | void RGB2Y_4(unsigned char *Src, unsigned char *Dest, int Width, int Hei...
  function RGB2Y_5 (line 120) | void RGB2Y_5(unsigned char *Src, unsigned char *Dest, int Width, int Hei...
  function debug (line 180) | void debug(__m128i var) {
  function debug2 (line 188) | void debug2(__m256i var) {
  function _RGB2Y (line 206) | void  _RGB2Y(unsigned char* Src, const int32_t Width, const int32_t star...
  function RGB2Y_6 (line 286) | void RGB2Y_6(unsigned char *Src, unsigned char *Dest, int width, int hei...
  function RGB2Y_7 (line 292) | void RGB2Y_7(unsigned char *Src, unsigned char *Dest, int width, int hei...
  function main (line 305) | int main() {

FILE: speed_rgb2yuv_sse.cpp
  function ClampToByte (line 8) | inline unsigned char ClampToByte(int Value) {
  function RGB2YUV (line 19) | void RGB2YUV(unsigned char *RGB, unsigned char *Y, unsigned char *U, uns...
  function YUV2RGB (line 35) | void YUV2RGB(unsigned char *Y, unsigned char *U, unsigned char *V, unsig...
  function RGB2YUV_1 (line 52) | void RGB2YUV_1(unsigned char *RGB, unsigned char *Y, unsigned char *U, u...
  function YUV2RGB_1 (line 75) | void YUV2RGB_1(unsigned char *Y, unsigned char *U, unsigned char *V, uns...
  function RGB2YUV_OpenMP (line 98) | void RGB2YUV_OpenMP(unsigned char *RGB, unsigned char *Y, unsigned char ...
  function YUV2RGB_OpenMP (line 122) | void YUV2RGB_OpenMP(unsigned char *Y, unsigned char *U, unsigned char *V...
  function RGB2YUVSSE_2 (line 146) | void RGB2YUVSSE_2(unsigned char *RGB, unsigned char *Y, unsigned char *U...
  function YUV2RGBSSE_2 (line 239) | void YUV2RGBSSE_2(unsigned char *Y, unsigned char *U, unsigned char *V, ...
  function RGB2YUVSSE_3 (line 339) | void RGB2YUVSSE_3(unsigned char *RGB, unsigned char *Y, unsigned char *U...
  function YUV2RGBSSE_3 (line 450) | void YUV2RGBSSE_3(unsigned char *Y, unsigned char *U, unsigned char *V, ...
  function _RGB2YUV (line 575) | void _RGB2YUV(unsigned char *RGB, const int32_t Width, const int32_t Hei...
  function _YUV2RGB (line 670) | void _YUV2RGB(const int32_t Width, const int32_t Height, const int32_t s...
  function RGB2YUVSSE_4 (line 757) | void RGB2YUVSSE_4(unsigned char *RGB, unsigned char *Y, unsigned char *U...
  function YUV2RGBSSE_4 (line 771) | void YUV2RGBSSE_4(unsigned char *Y, unsigned char *U, unsigned char *V, ...
  function main (line 785) | int main() {

FILE: speed_skin_detection_sse.cpp
  function IM_GetRoughSkinRegion (line 12) | void IM_GetRoughSkinRegion(unsigned char *Src, unsigned char *Skin, int ...
  function IM_GetRoughSkinRegion_OpenMP (line 29) | void IM_GetRoughSkinRegion_OpenMP(unsigned char *Src, unsigned char *Ski...
  function IM_GetRoughSkinRegion_SSE (line 47) | void IM_GetRoughSkinRegion_SSE(unsigned char *Src, unsigned char *Skin, ...
  function _IM_GetRoughSkinRegion (line 94) | void _IM_GetRoughSkinRegion(unsigned char* Src, const int32_t Width, con...
  function IM_GetRoughSkinRegion_SSE2 (line 141) | void IM_GetRoughSkinRegion_SSE2(unsigned char *Src, unsigned char *Skin,...
  function IM_GrayToRGB (line 154) | void IM_GrayToRGB(unsigned char *Gray, unsigned char *RGB, int Width, in...
  function main (line 169) | int main() {

FILE: speed_sobel_edgedetection_sse.cpp
  function IM_ClampToByte (line 7) | inline unsigned char IM_ClampToByte(int Value)
  function Sobel_FLOAT (line 18) | void Sobel_FLOAT(unsigned char *Src, unsigned char *Dest, int Width, int...
  function Sobel_INT (line 73) | void Sobel_INT(unsigned char *Src, unsigned char *Dest, int Width, int H...
  function Sobel_SSE1 (line 130) | void Sobel_SSE1(unsigned char *Src, unsigned char *Dest, int Width, int ...
  function Sobel_SSE2 (line 215) | void Sobel_SSE2(unsigned char *Src, unsigned char *Dest, int Width, int ...
  function _Sobel (line 304) | void _Sobel(unsigned char* Src, const int32_t Width, const int32_t Heigh...
  function Sobel_AVX1 (line 378) | void Sobel_AVX1(unsigned char *Src, unsigned char *Dest, int Width, int ...
  function Sobel_AVX2 (line 402) | void Sobel_AVX2(unsigned char *Src, unsigned char *Dest, int Width, int ...
  function main (line 438) | int main() {

FILE: speed_vibrance_algorithm.cpp
  function GetGrayIntegralImage (line 8) | void GetGrayIntegralImage(unsigned char *Src, int *Integral, int Width, ...
  function GetGrayIntegralImage_SSE (line 25) | void GetGrayIntegralImage_SSE(unsigned char *Src, int *Integral, int Wid...
  function BoxBlur (line 62) | void BoxBlur(unsigned char *Src, unsigned char *Dest, int Width, int Hei...
  function VibranceAlgorithm_FLOAT (line 85) | void VibranceAlgorithm_FLOAT(unsigned char *Src, unsigned char *Dest, in...
  function VibranceAlgorithm_INT (line 113) | void VibranceAlgorithm_INT(unsigned char *Src, unsigned char *Dest, int ...
  function VibranceAlgorithm_INT_OpenMP (line 147) | void VibranceAlgorithm_INT_OpenMP(unsigned char *Src, unsigned char *Des...
  function VibranceAlgorithm_SSE (line 180) | void VibranceAlgorithm_SSE(unsigned char *Src, unsigned char *Dest, int ...
  function main (line 282) | int main() {

FILE: sse_implementation_of_common_functions_in_image_processing.cpp
  function __m128 (line 7) | inline __m128 _mm_log_ps(__m128 x)
  function IM_Flog (line 82) | inline float IM_Flog(float val)
  function __m128 (line 97) | inline __m128 _mm_flog_ps(__m128 x)
  function IM_Fexp (line 110) | inline float IM_Fexp(float Y)
  function __m128 (line 123) | inline __m128 _mm_fexp_ps(__m128 Y)
  function IM_Fpow (line 132) | inline float IM_Fpow(float a, float b)
  function __m128 (line 145) | __m128 _mm_prcp_ps(__m128 a) {
  function __m128 (line 151) | __m128 _mm_fdiv_ps(__m128 a, __m128 b)
  function __m128 (line 163) | inline __m128 _mm_divz_ps(__m128 a, __m128 b)
  function _mm_storesi128_4char (line 172) | inline void _mm_storesi128_4char(unsigned char *Dest, __m128i P)
  function __m128i (line 187) | inline __m128i _mm_loadu_epi96(const __m128i * p)
  function _mm_storeu_epi96 (line 194) | inline void _mm_storeu_epi96(__m128i *P, __m128i Q)
  function IM_Div255 (line 201) | inline int IM_Div255(int V)
  function __m128i (line 209) | inline __m128i _mm_div255_epu16(__m128i x)
  function _mm_hsum_epi16 (line 221) | inline int _mm_hsum_epi16(__m128i V)                            //    V7...
  function _mm_hmin_epu8 (line 236) | inline int _mm_hmin_epu8(__m128i a)
  function _mm_hmax_epu8 (line 245) | inline int _mm_hmax_epu8(__m128i a)
  function main (line 253) | int main() {
Condensed preview — 21 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (281K chars).
[
  {
    "path": "README.md",
    "chars": 5364,
    "preview": "# Introduction\n\n## speed_histogram_algorithm_framework \n\n- 局部直方图加速框架,内部使用了一些近似计算及指令集加速(SSE),可以快速处理中值滤波、最大值滤波、最小值滤波、表面模糊等"
  },
  {
    "path": "resources/SSE指令集补充.md",
    "chars": 3264,
    "preview": "# SSE指令集记录\n\n- _mm_cvtps_epi32 把四个float变量强转为四个int变量。其中需要注意的是他的截断规则:四舍五入,在进位后末位是偶数的进,否则不进位。\n\n- _mm_cvttps_epi32 把四个float变量"
  },
  {
    "path": "speed_bicubic_zoom_sse.cpp",
    "chars": 24471,
    "preview": "#include <stdio.h>\n#include <opencv2/opencv.hpp>\nusing namespace std;\nusing namespace cv;\n\nvoid debug(__m128i var) {\n\tui"
  },
  {
    "path": "speed_box_filter_sse.cpp",
    "chars": 1793,
    "preview": "#include <stdio.h>\n#include <opencv2/opencv.hpp>\n#include \"../../OpencvTest/OpencvTest/Core.h\"\n#include \"../../OpencvTes"
  },
  {
    "path": "speed_common_functions.cpp",
    "chars": 12851,
    "preview": "//近似值\nunion Approximation\n{\n\tdouble Value;\n\tint X[2];\n};\n\n// 函数1: 将数据截断在Byte数据类型内。\n// 参考: http://www.cnblogs.com/zyl910/"
  },
  {
    "path": "speed_gaussian_filter_sse.cpp",
    "chars": 15385,
    "preview": "#include <stdio.h>\n#include <opencv2/opencv.hpp>\n\nusing namespace std;\nusing namespace cv;\n\nvoid CalcGaussCof(float Radi"
  },
  {
    "path": "speed_histogram_algorithm_framework/BoxFilter.h",
    "chars": 9799,
    "preview": "#pragma once\n#include \"Core.h\"\n#include \"Utility.h\"\n\n// : ʵͼ񷽿ģЧ\n// б:\n// Src: ҪԴͼݽṹ\n// Dest: 洦ͼݽṹ\n// Radius: ģİ뾶ЧΧ[1, 10"
  },
  {
    "path": "speed_histogram_algorithm_framework/Core.h",
    "chars": 3438,
    "preview": "#pragma once\n#include <stdio.h>\n#include <malloc.h>\n#include <stdlib.h>\n#include <string.h>\n#include <opencv2/opencv.hpp"
  },
  {
    "path": "speed_histogram_algorithm_framework/MaxFilter.h",
    "chars": 4234,
    "preview": "#pragma once\n#include \"Core.h\"\n#include \"Utility.h\"\n\n// 函数供能: 在指定半径内,最大值”滤镜用周围像素的最高亮度值替换当前像素的亮度值。\n// 参数列表:\n// Src: 需要处理的"
  },
  {
    "path": "speed_histogram_algorithm_framework/SelectiveBlur.h",
    "chars": 4700,
    "preview": "#pragma once\n#include \"Core.h\"\n#include \"Utility.h\"\n\nvoid Calc(unsigned short *Hist, int Intensity, unsigned char *&Pixe"
  },
  {
    "path": "speed_histogram_algorithm_framework/Utility.h",
    "chars": 22570,
    "preview": "#pragma once\n//ֵ\n#include \"Core.h\"\n\nunion Approximation\n{\n\tdouble Value;\n\tint X[2];\n};\n\n// 1: ݽضByteڡ\n// ο: http://www.c"
  },
  {
    "path": "speed_integral_graph_sse.cpp",
    "chars": 4389,
    "preview": "#include <stdio.h>\n#include <opencv2/opencv.hpp>\n\nusing namespace std;\nusing namespace cv;\n\nvoid GetGrayIntegralImage(un"
  },
  {
    "path": "speed_max_filter_sse.cpp",
    "chars": 1907,
    "preview": "#include <stdio.h>\n#include <opencv2/opencv.hpp>\n#include \"../../OpencvTest/OpencvTest/Core.h\"\n#include \"../../OpencvTes"
  },
  {
    "path": "speed_median_filter_3x3_sse.cpp",
    "chars": 15750,
    "preview": "#include \"stdafx.h\"\r\n#include <stdio.h>\r\n#include <opencv2/opencv.hpp>\r\nusing namespace std;\r\nusing namespace cv;\r\n\r\nint"
  },
  {
    "path": "speed_multi_scale_detail_boosting_see.cpp",
    "chars": 4963,
    "preview": "#include <stdio.h>\n#include <opencv2/opencv.hpp>\n#include \"../../OpencvTest/OpencvTest/Core.h\"\n#include \"../../OpencvTes"
  },
  {
    "path": "speed_rgb2gray_sse.cpp",
    "chars": 16299,
    "preview": "#include \"stdafx.h\"\n#include <opencv2/opencv.hpp>\n#include <future>\nusing namespace std;\nusing namespace cv;\n\n//origin\nv"
  },
  {
    "path": "speed_rgb2yuv_sse.cpp",
    "chars": 49271,
    "preview": "#include \"stdafx.h\"\n#include <stdio.h>\n#include <opencv2/opencv.hpp>\n#include <future>\nusing namespace std;\nusing namesp"
  },
  {
    "path": "speed_skin_detection_sse.cpp",
    "chars": 9603,
    "preview": "#include \"stdafx.h\"\n#include <stdio.h>\n#include <opencv2/opencv.hpp>\n#include <future>\nusing namespace std;\nusing namesp"
  },
  {
    "path": "speed_sobel_edgedetection_sse.cpp",
    "chars": 19260,
    "preview": "#include <stdio.h>\n#include <opencv2/opencv.hpp>\n#include <future>\nusing namespace std;\nusing namespace cv;\n\ninline unsi"
  },
  {
    "path": "speed_vibrance_algorithm.cpp",
    "chars": 13553,
    "preview": "#include <stdio.h>\n#include <omp.h>\n#include <opencv2/opencv.hpp>\n\nusing namespace std;\nusing namespace cv;\n\nvoid GetGra"
  },
  {
    "path": "sse_implementation_of_common_functions_in_image_processing.cpp",
    "chars": 9524,
    "preview": "#include <stdio.h>\n#include <opencv2/opencv.hpp>\nusing namespace std;\nusing namespace cv;\n\n// 函数1: 对数函数的SSE实现,高精度版\ninlin"
  }
]

About this extraction

This page contains the full source code of the BBuf/Image-processing-algorithm-Speed GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 21 files (246.5 KB), approximately 108.4k tokens, and a symbol index with 183 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Copied to clipboard!