Repository: BBuf/Image-processing-algorithm-Speed Branch: master Commit: d22063d5c5b4 Files: 21 Total size: 246.5 KB Directory structure: gitextract_nm0o3cbs/ ├── README.md ├── resources/ │ └── SSE指令集补充.md ├── speed_bicubic_zoom_sse.cpp ├── speed_box_filter_sse.cpp ├── speed_common_functions.cpp ├── speed_gaussian_filter_sse.cpp ├── speed_histogram_algorithm_framework/ │ ├── BoxFilter.h │ ├── Core.h │ ├── MaxFilter.h │ ├── SelectiveBlur.h │ └── Utility.h ├── speed_integral_graph_sse.cpp ├── speed_max_filter_sse.cpp ├── speed_median_filter_3x3_sse.cpp ├── speed_multi_scale_detail_boosting_see.cpp ├── speed_rgb2gray_sse.cpp ├── speed_rgb2yuv_sse.cpp ├── speed_skin_detection_sse.cpp ├── speed_sobel_edgedetection_sse.cpp ├── speed_vibrance_algorithm.cpp └── sse_implementation_of_common_functions_in_image_processing.cpp ================================================ FILE CONTENTS ================================================ ================================================ FILE: README.md ================================================ # Introduction ## speed_histogram_algorithm_framework - 局部直方图加速框架,内部使用了一些近似计算及指令集加速(SSE),可以快速处理中值滤波、最大值滤波、最小值滤波、表面模糊等算法。 ## resources - SSE优化相关的资源。 #### PC的CPU为I5-3230,64位。 #### OpenCV版本为3.4.0 - sse_implementation_of_common_functions_in_image_processing.cpp 多个图像处理中常用函数的SSE实现。 - speed_rgb2gray_sse.cpp 使用sse加速RGB和灰度图转换算法,相比于原始实现有接近5倍加速。算法原理:https://mp.weixin.qq.com/s/SagVQ5gfXWWA7NATv-zvBQ 速度测试结果如下: >测试CPU型号:Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz | 分辨率 | 优化 | 循环次数 | 速度 | | --------- | ---------------------------------------- | -------- | ---- | | 4032x3024 | 原始实现 | 1000 | 12.139ms | | 4032x3024 | 第一版优化(float->INT) | 1000 | 7.629ms | | 4032x3024 | OpenCV 自带函数 | 1000 | 4.287ms | | 4032x3024 | 第二版优化(手动4路并行) | 1000 | 10.528ms | | 4032x3024 | 第三版优化(OpenMP4线程) | 1000 | 7.632ms | | 4032x3024 | 第四版优化(SSE优化,一次处理12个像素) | 1000 | 5.579ms | | 4032x3024 | 第五版优化(SSE优化,一次处理15个像素) | 1000 | 5.843ms | | 4032x3024 | 第六版优化(AVX2优化,一次处理10个像素) | 1000 | 3.576ms | | 4032x3024 | 第七版优化(AVX2优化+std::async) | 1000 | 2.626ms | - speed_vibrance_algorithm.cpp 使用SSE加速自然饱和度算法,加速9倍,算法原理请看: https://mp.weixin.qq.com/s/26UVvqMNLgnquXY21Xu3OQ 。速度测试结果如下: |分辨率|优化|循环次数|速度| |----|----|----|----| |4032x3024|原始实现|100|115.36ms| |4032x3024|第一版优化|100|62.43ms| |4032x3024|第二版优化(4线程)|100|28.89ms| |4032x3024|第三版优化(SSE)|100|12.69ms| - speed_sobel_edgedetection_sse.cpp 使用SSE加速Sobel边缘检测算法,加速幅度巨大,算法原理请看:https://mp.weixin.qq.com/s/5lCfO_jmSfP7DbsgM7qbpg 。速度测试结果如下: |分辨率|算法优化|循环次数|速度| |-|-|-|-| |4032x3024|普通实现|1000|126.54 ms| |4032x3024|Float->INT+查表法|1000|81.62 ms| |4032x3024|SSE优化版本1|1000|34.95 ms| |4032x3024|SSE优化版本2|1000|28.87 ms| |4032x3024|AVX2优化版本1|1000|15.42 ms | |4032x3024|AVX2优化+std::async|1000| 5.69 ms | - speed_skin_detection_sse.cpp 使用SSE加速肤色检测算法,加速幅度较大,算法原理请看:https://mp.weixin.qq.com/s/UFzY1s6ohTM-dnNg0P4kkw 。速度测试结果如下: |分辨率|算法优化|循环次数|速度| |-|-|-|-| |4272x2848|普通实现|1000|41.40ms| |4272x2848|OpenMP 4线程|1000|36.54ms| |4272x2848|SSE第一版|1000|6.77ms| |4272x2848|SSE第二版(std::async)|1000|4.73ms| - speed_rgb2yuv_sse.cpp SSE极致优化RGB和YUV图像空间互转,算法原理请看:https://mp.weixin.qq.com/s/ryGocz-0YpqZ1CjYXJbd7Q 。速度测试结果如下: |分辨率|算法优化|循环次数|速度| |-|-|-|-| |4032x3024|普通实现|1000|150.58ms| |4032x3024|去掉浮点数,除法用位运算代替|1000|76.70ms| |4032x3024|OpenMP 4线程|1000|50.48ms| |4032x3024|普通SSE向量化|1000|48.92ms| |4032x3024|_mm_madd_epi16二次优化|1000|33.04ms| |4032x3024|SSE+4线程|1000|23.70ms| - speed_median_filter_3x3_sse.cpp 极致优化3*3中值滤波,算法原理请看:https://blog.csdn.net/just_sort/article/details/98617050 。速度测试效果如下: |分辨率|算法优化|循环次数|速度| |-|-|-|-| |4032x3024|普通实现|10| 8293.79 ms | |4032x3024|逻辑优化,更好的流水|10| 83.75 ms | |4032x3024|SSE优化|10| 11.93 ms | |4032x3024|AVX优化|10| 9.32 ms | ---------------------------------------------------------------------------------- - speed_gaussian_filter_sse.cpp 使用sse加速高斯滤波算法。算法原理:https://blog.csdn.net/just_sort/article/details/95212099 。速度测试效果如下: | 优化方式| 图像分辨率 | 速度 | | ------------------- | ---------- | ---- | | C语言普通实现+单线程 | 4032*3024 | 290.43ms | | SSE优化+单线程 | 4032*3024 | 265.96ms | - speed_integral_graph_sse.cpp 使用SSE加速积分图运算,但是在PC上并没有速度提升,算法原理请看:https://www.cnblogs.com/Imageshop/p/6897233.html 。速度测试结果如下: |优化方式|图像分辨率 |速度| |---------|----------|-------| |C语言实现+单线程|4032*3024|66.66ms| |C语言实现+4线程|4032*3024|65.34ms| |SSE优化+单线程|4032*3024|66.10ms| |SSE优化+4线程|4032*3024|66.20ms| - speed_common_functions.cpp 对图像处理的一些常用函数的快速实现,个别使用了SSE优化。 - speed_max_filter_sse.cpp 使用speed_histogram_algorithm_framework框架实现最大值滤波,半径越大越明显。原理请看:https://blog.csdn.net/just_sort/article/details/97280807 。运行的时候记得把工程属性中的sdl检查关掉,不然会报一个变量未初始化的错误。速度测试效果如下: |优化方式|图像分辨率 |半径|速度| |---------|----------|-------|-------| |C语言实现+单线程|4272*2848|7|9445.90ms| |SSE优化+单线程|4272*2848|7|2234.55ms| |C语言实现+单线程|4272*2848|9|14468.76ms| |SSE优化+单线程|4272*2848|9|2221.68ms| |C语言实现+单线程|4272*2848|11|23069.10ms| |SSE优化+单线程|4272*2848|11|2180.95ms| - speed_box_filter_sse.cpp 使用speed_histogram_algorithm框架实现O(1)最大值滤波,使用了SSE优化,算法原理请看:https://blog.csdn.net/just_sort/article/details/98075712 。运行方法和speed_max_filter_sse.cpp相同,速度测试结果如下: |优化方式|图像分辨率 |半径|速度| |---------|----------|-------|-------| |C语言实现+单线程|4272*2848|11|163.16ms| |SSE优化+单线程|4272*2848|11|123.83ms| |C语言实现+单线程|4272*2848|21|167.81ms| |SSE优化+单线程|4272*2848|21|126.98ms| |C语言实现+单线程|4272*2848|31|168.62ms| |SSE优化+单线程|4272*2848|31|126.17ms| - speed_multi_scale_detail_boosting_see.cpp 在speed_box_filter_sse.cpp提供的盒子滤波sse优化的基础上,进一步使用指令集实现了对论文《DARK IMAGE ENHANCEMENT BASED ON PAIRWISE TARGET CONTRAST AND MULTI-SCALE DETAIL BOOSTING》的算法优化。算法原理请看:https://blog.csdn.net/just_sort/article/details/98485746 。在CoreI7-3770速度测试结果如下: |优化方式|图像分辨率 |半径|速度| |---------|----------|-------|-------| |C语言实现+单线程|4272*2848|7|206.00ms| |SSE优化+单线程|4272*2848|7|57.12ms| - speed_bicubic_zoom_sse.cpp SSE优化三次立方插值算法,算法原理请看:https://blog.csdn.net/just_sort/article/details/100119653 。速度测试结果如下: |优化方式|图像分辨率 |插值后大小|速度| |---------|----------|-------|-------| |C语言原始算法实现|4272*2848|长宽均为原始1.5倍|1856.29ms| |C语言实现+查表优化+边界优化|4272*2848|长宽均为原始1.5倍|839.10ms| |SSE优化+边界优化|4272*2848|长宽均为原始1.5倍|315.70ms| |OpenCV3.1.0自带的函数|4272*2848|长宽均为原始1.5倍|118.77ms| # 维护了一个微信公众号,分享论文,算法,比赛,生活,欢迎加入。 - 图片要是没加载出来直接搜GiantPandaCV 就好。 ![](image/weixin.jpg) ================================================ FILE: resources/SSE指令集补充.md ================================================ # SSE指令集记录 - _mm_cvtps_epi32 把四个float变量强转为四个int变量。其中需要注意的是他的截断规则:四舍五入,在进位后末位是偶数的进,否则不进位。 - _mm_cvttps_epi32 把四个float变量强转为四个int变量。直接截断,和c/c++中的r = (int)a一样。 - _mm_cvtpd_ps 将两个双精度, a 的浮点值设置为单精度的,浮点值。返回值: ```c++ r0 := (float) a0 r1 := (float) a1 r2 := 0.0 ; r3 := 0.0 ``` - _mm_movelh_ps 移动更低两个单精度, b 的浮点值到上面两个单精度,结果的浮点值。 ```c++ r3 := b1 r2 := b0 r1 := a1 r0 := a0 ``` - _mm_cmpneq_ps 比较两个单精度,如果对应位置的数相等返回0,不相等则返回1。 - _mm_blendv_ps 混和打包函数: ```c++ __m128 _mm_blendv_ps( __m128 a, __m128 b, __m128 mask ); r0 := (mask0 & 0x80000000) ? b0 : a0 r1 := (mask1 & 0x80000000) ? b1 : a1 r2 := (mask2 & 0x80000000) ? b2 : a2 r3 := (mask3 & 0x80000000) ? b3 : a3 ``` - _mm_packs_epi32 将a和b的8位有符号和32位整数转化位16位整型数据。 - _mm_cvtsi128_si32 移动最低有效位的32位a到32位整数。 - _mm_packus_epi16 将a和b的16位整数转化位8位无符号整型数据。 - _mm_cvtsi32_si128 将a的低32位赋值给一个32bits的整数,返回值为r=a0 - _mm_loadu_si128表示:Loads 128-bit value;即加载128位值。 - _mm_max_epu8 (a,b)表示:比较a和b中对应的无符号的8bits的整数,取其较大值,重复这个过程16次。即:r0=max(a0,b0),...,r15=max(a15,b15)。 - _mm_min_epi8(a,b)表示:大体意思同上,不同的是这次比较的是有符号的8bits的整数。 - _mm_setzero_si128表示:将128bits的值都赋值为0。 - _mm_subs_epu8(a,b)表示:a和b中对应的8bits数相减,r0= UnsignedSaturate(a0-b0),...,r15= UnsignedSaturate(a15 - b15)。 - _mm_adds_epi8(a,b)表示:a和b中对应的8bits数相加,r0=SingedSaturate(a0+b0),...,r15=SingedSaturate(a15+b15)。 - _mm_unpackhi_epi64(a,b)表示:a和b的高64位交错,低64位舍去。 - _mm_srli_si128(a,imm)表示:将a进行逻辑右移imm位,高位填充0。 - _mm_cvtsi128_si32(a)表示:将a的低32位赋值给一个32bits的整数,返回值为r=a0。 - _mm_xor_si128(a,b)表示:将a和b进行按位异或,即r=a^b。 - _mm_or_si128(a,b)表示:将a和b进行或运算,即r=a|b。 - _mm_and_si128(a,b)表示:将a和b进行与运算,即r=a&b。 - _mm_cmpgt_epi8(a,b)表示:分别比较a的每个8bits整数是否大于b的对应位置的8bits整数,若大于,则返回0xffff,否则返回0x0。即r0=(a0>b0)?0xff:0x0 r1=(a1>b1)?0xff:0x0...r15=(a15>b15)?0xff:0x0 - _mm_unpacklo_epi64表示: a和b的高64位交错,高64位舍去。 - _mm_madd_epi16 表示:返回一个__m128i的寄存器,它含有4个有符号的32位整数。 ```c++ r0 := (a0 * b0) + (a1 * b1) r1 := (a2 * b2) + (a3 * b3) r2 := (a4 * b4) + (a5 * b5) r3 := (a6 * b6) + (a7 * b7) ``` - _mm_extract_epi16(a, imm) 表示: 返回imm位置上的16位数。 - _mm_min_epu16 表示:两个数的最小者。 - _mm_minpos_epu16 表示:返回128 位值, 最低序的 16 位是参数找到的最小值a,第二个低的顺序 16 位是参数找到的最小值的索引a。 - _mm_stream_si32 将数据存储到指针对应的地址中。 - _mm_cvtsi128_si32 移动最低有效位的32位a到32位整数。 - _mm_packus_epi32 ```c++ r0 := (a0 < 0) ? 0 : ((a0 > 0xffff) ? 0xffff : a0) r1 := (a1 < 0) ? 0 : ((a1 > 0xffff) ? 0xffff : a1) r2 := (a2 < 0) ? 0 : ((a2 > 0xffff) ? 0xffff : a2) r3 := (a3 < 0) ? 0 : ((a3 > 0xffff) ? 0xffff : a3) r4 := (b0 < 0) ? 0 : ((b0 > 0xffff) ? 0xffff : b0) r5 := (b1 < 0) ? 0 : ((b1 > 0xffff) ? 0xffff : b1) r6 := (b2 < 0) ? 0 : ((b2 > 0xffff) ? 0xffff : b2) r7 := (b3 < 0) ? 0 : ((b3 > 0xffff) ? 0xffff : b3) ``` - _mm_setr_epi32 返回一个__m128i的寄存器,使用4个具体的int类型数据来设置寄存器存放数据。 - _mm_mullo_epi32 返回一个__m128i的寄存器,分别对a和b的4个int类型数相乘。 - _mm_hadd_epi32 返回一个__m128i的寄存器,分别对a和b的4个int类型数相加。 - _mm_madd_epi16 返回一个__m128i的寄存器,分别对a和b先相乘后相加。 ```c++ r0 := (a0 * b0) + (a1 * b1) r1 := (a2 * b2) + (a3 * b3) r2 := (a4 * b4) + (a5 * b5) r3 := (a6 * b6) + (a7 * b7) ``` - _mm_unpackhi_epi8 返回一个__m128i的寄存器,对a和b进行交错打包,从高位到低位。 ```c++ r0 := a8 ; r1 := b8 r2 := a9 ; r3 := b9 ... r14 := a15 ; r15 := b15 ``` - _mm_unpacklo_epi8 返回一个__m128i的寄存器,对a和b进行交错打包,从低位到高位。 ================================================ FILE: speed_bicubic_zoom_sse.cpp ================================================ #include #include using namespace std; using namespace cv; void debug(__m128i var) { uint8_t *val = (uint8_t*)&var;//can also use uint32_t instead of 16_t printf("Numerical: %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i\n", val[0], val[1], val[2], val[3], val[4], val[5], val[6], val[7], val[8], val[9], val[10], val[11], val[12], val[13], val[14], val[15]); } void ConvertBGR8U2BGRAF(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) { //#pragma omp parallel for for (int Y = 0; Y < Height; Y++) { unsigned char *LinePS = Src + Y * Stride; unsigned char *LinePD = Dest + Y * Width * 4; for (int X = 0; X < Width; X++, LinePS += 3, LinePD += 4) { LinePD[0] = LinePS[0]; LinePD[1] = LinePS[1]; LinePD[2] = LinePS[2]; LinePD[3] = 0; } } } void ConvertBGRAF2BGR8U(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) { //#pragma omp parallel for for (int Y = 0; Y < Height; Y++) { unsigned char *LinePS = Src + Y * Width * 4; unsigned char *LinePD = Dest + Y * Stride; for (int X = 0; X < Width; X++, LinePS += 4, LinePD += 3) { LinePD[0] = LinePS[0]; LinePD[1] = LinePS[1]; LinePD[2] = LinePS[2]; } } } void ConvertBGR8U2BGRAF_SSE(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) { const int BlockSize = 4; int Block = (Width - 2) / BlockSize; __m128i Mask = _mm_setr_epi8(0, 1, 2, -1, 3, 4, 5, -1, 6, 7, 8, -1, 9, 10, 11, -1); __m128i Mask2 = _mm_setr_epi8(0, 2, 8, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1); __m128i Zero = _mm_setzero_si128(); for (int Y = 0; Y < Height; Y++) { unsigned char *LinePS = Src + Y * Stride; unsigned char *LinePD = Dest + Y * Width * 4; int X = 0; for (; X < Block * BlockSize; X += BlockSize, LinePS += BlockSize * 3, LinePD += BlockSize * 4) { __m128i SrcV = _mm_shuffle_epi8(_mm_loadu_si128((const __m128i*)LinePS), Mask); __m128i Src16L = _mm_unpacklo_epi8(SrcV, Zero); __m128i Src16H = _mm_unpackhi_epi8(SrcV, Zero); _mm_storeu_si128((__m128i *)(LinePD + 0), _mm_shuffle_epi8(_mm_unpacklo_epi32(Src16L, Zero), Mask2)); _mm_storeu_si128((__m128i *)(LinePD + 4), _mm_shuffle_epi8(_mm_unpackhi_epi32(Src16L, Zero), Mask2)); _mm_storeu_si128((__m128i *)(LinePD + 8), _mm_shuffle_epi8(_mm_unpacklo_epi32(Src16H, Zero), Mask2)); _mm_storeu_si128((__m128i *)(LinePD + 12), _mm_shuffle_epi8(_mm_unpackhi_epi32(Src16H, Zero), Mask2)); } for (; X < Width; X++, LinePS += 3, LinePD += 4) { LinePD[0] = LinePS[0]; LinePD[1] = LinePS[1]; LinePD[2] = LinePS[2]; LinePD[3] = 0; } } } void ConvertBGRAF2BGR8U_SSE(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) { const int BlockSize = 4; int Block = (Width - 2) / BlockSize; //__m128i Mask = _mm_setr_epi8(0, 1, 2, 4, 5, 6, 8, 9, 10, 12, 13, 14, 3, 7, 11, 15); __m128i MaskB = _mm_setr_epi8(0, 4, 8, 12, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1); __m128i MaskG = _mm_setr_epi8(1, 5, 9, 13, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1); __m128i MaskR = _mm_setr_epi8(2, 6, 10, 14, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1); __m128i Zero = _mm_setzero_si128(); for (int Y = 0; Y < Height; Y++) { unsigned char *LinePS = Src + Y * Width * 4; unsigned char *LinePD = Dest + Y * Stride; int X = 0; for (; X < Block * BlockSize; X += BlockSize, LinePS += BlockSize * 4, LinePD += BlockSize * 3) { __m128i SrcV = _mm_loadu_si128((const __m128i*)LinePS); __m128i B = _mm_shuffle_epi8(SrcV, MaskB); __m128i G = _mm_shuffle_epi8(SrcV, MaskG); __m128i R = _mm_shuffle_epi8(SrcV, MaskR); __m128i Ans1 = Zero, Ans2 = Zero, Ans3 = Zero; Ans1 = _mm_or_si128(Ans1, _mm_shuffle_epi8(B, _mm_setr_epi8(0, -1, -1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1))); Ans1 = _mm_or_si128(Ans1, _mm_shuffle_epi8(G, _mm_setr_epi8(-1, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1))); Ans1 = _mm_or_si128(Ans1, _mm_shuffle_epi8(R, _mm_setr_epi8(-1, -1, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1))); Ans2 = _mm_or_si128(Ans2, _mm_shuffle_epi8(B, _mm_setr_epi8(-1, -1, 2, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1))); Ans2 = _mm_or_si128(Ans2, _mm_shuffle_epi8(G, _mm_setr_epi8(1, -1, -1, 2, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1))); Ans2 = _mm_or_si128(Ans2, _mm_shuffle_epi8(R, _mm_setr_epi8(-1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1))); Ans3 = _mm_or_si128(Ans3, _mm_shuffle_epi8(B, _mm_setr_epi8(-1, 3, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1))); Ans3 = _mm_or_si128(Ans3, _mm_shuffle_epi8(G, _mm_setr_epi8(-1, -1, 3, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1))); Ans3 = _mm_or_si128(Ans3, _mm_shuffle_epi8(R, _mm_setr_epi8(2, -1, -1, 3, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1))); _mm_storeu_si128((__m128i*)(LinePD + 0), Ans1); _mm_storeu_si128((__m128i*)(LinePD + 4), Ans2); _mm_storeu_si128((__m128i*)(LinePD + 8), Ans3); } for (; X < Width; X++, LinePS += 4, LinePD += 3) { LinePD[0] = LinePS[0]; LinePD[1] = LinePS[1]; LinePD[2] = LinePS[2]; } } } // 将整形的Value值限定在Min和Max内,可取Min或者Max的值 inline int ClampI(int Value, int Min, int Max) { if (Value < Min) return Min; else if (Value > Max) return Max; else return Value; } // 将整数限制到字节数据类型 inline unsigned char ClampToByte(int Value) { if (Value < 0) return 0; else if (Value > 255) return 255; else return (unsigned char)Value; } // 获取PosX, PosY位置的像素 inline unsigned char *GetCheckedPixel(unsigned char *Src, int Width, int Height, int Stride, int Channel, int PosX, int PosY) { return Src + ClampI(PosY, 0, Height - 1) * Stride + ClampI(PosX, 0, Width - 1) * Channel; } // 该函数计算插值曲线sin(x * PI) / (x * PI)的值,下面是它的近似拟合表达式 float SinXDivX(float X) { const float a = -1; //a还可以取 a=-2,-1,-0.75,-0.5等等,起到调节锐化或模糊程度的作用 X = abs(X); float X2 = X * X, X3 = X2 * X; if (X <= 1) return (a + 2) * X3 - (a + 3) * X2 + 1; else if (X <= 2) return a * X3 - (5 * a) * X2 + (8 * a) * X - (4 * a); else return 0; } // 精确计算插值曲线sin(x * PI) / (x * PI) float SinXDivX_Standard(float X) { if (abs(X) < 0.000001f) return 1; else return sin(X * 3.1415926f) / (X * 3.1415926f); } void Bicubic_Original(unsigned char *Src, int Width, int Height, int Stride, unsigned char *Pixel, float X, float Y) { int Channel = Stride / Width; int PosX = floor(X), PosY = floor(Y); float PartXX = X - PosX, PartYY = Y - PosY; unsigned char *Pixel00 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX - 1, PosY - 1); unsigned char *Pixel01 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 0, PosY - 1); unsigned char *Pixel02 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 1, PosY - 1); unsigned char *Pixel03 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 2, PosY - 1); unsigned char *Pixel10 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX - 1, PosY + 0); unsigned char *Pixel11 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 0, PosY + 0); unsigned char *Pixel12 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 1, PosY + 0); unsigned char *Pixel13 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 2, PosY + 0); unsigned char *Pixel20 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX - 1, PosY + 1); unsigned char *Pixel21 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 0, PosY + 1); unsigned char *Pixel22 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 1, PosY + 1); unsigned char *Pixel23 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 2, PosY + 1); unsigned char *Pixel30 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX - 1, PosY + 2); unsigned char *Pixel31 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 0, PosY + 2); unsigned char *Pixel32 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 1, PosY + 2); unsigned char *Pixel33 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 2, PosY + 2); float U0 = SinXDivX(1 + PartXX), U1 = SinXDivX(PartXX); float U2 = SinXDivX(1 - PartXX), U3 = SinXDivX(2 - PartXX); float V0 = SinXDivX(1 + PartYY), V1 = SinXDivX(PartYY); float V2 = SinXDivX(1 - PartYY), V3 = SinXDivX(2 - PartYY); for (int I = 0; I < Channel; I++) { float Sum1 = (Pixel00[I] * U0 + Pixel01[I] * U1 + Pixel02[I] * U2 + Pixel03[I] * U3) * V0; //printf("%.5f\n", Sum1); float Sum2 = (Pixel10[I] * U0 + Pixel11[I] * U1 + Pixel12[I] * U2 + Pixel13[I] * U3) * V1; //printf("%.5f\n", Sum2); float Sum3 = (Pixel20[I] * U0 + Pixel21[I] * U1 + Pixel22[I] * U2 + Pixel23[I] * U3) * V2; //printf("%.5f\n", Sum3); float Sum4 = (Pixel30[I] * U0 + Pixel31[I] * U1 + Pixel22[I] * U2 + Pixel33[I] * U3) * V3; //printf("%.5f\n", Sum4); // printf("%d %.5f %.5f %.5f %.5f\n", I, Sum1, Sum2, Sum3, Sum4); Pixel[I] = ClampToByte(Sum1 + Sum2 + Sum3 + Sum4 + 0.5f); } } // ImageShop说如果把Channel改为固定的值,速度能提高很多,待测试 void Bicubic_Border(unsigned char *Src, int Width, int Height, int Stride, unsigned char *Pixel, short *SinXDivX_Table, int SrcX, int SrcY) { int Channel = Stride / Width; int U = (unsigned char)(SrcX >> 8), V = (unsigned char)(SrcY >> 8); int U0 = SinXDivX_Table[256 + U], U1 = SinXDivX_Table[U]; int U2 = SinXDivX_Table[256 - U], U3 = SinXDivX_Table[512 - U]; int V0 = SinXDivX_Table[256 + V], V1 = SinXDivX_Table[V]; int V2 = SinXDivX_Table[256 - V], V3 = SinXDivX_Table[512 - V]; int PosX = SrcX >> 16, PosY = SrcY >> 16; unsigned char *Pixel00 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX - 1, PosY - 1); unsigned char *Pixel01 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 0, PosY - 1); unsigned char *Pixel02 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 1, PosY - 1); unsigned char *Pixel03 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 2, PosY - 1); unsigned char *Pixel10 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX - 1, PosY + 0); unsigned char *Pixel11 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 0, PosY + 0); unsigned char *Pixel12 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 1, PosY + 0); unsigned char *Pixel13 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 2, PosY + 0); unsigned char *Pixel20 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX - 1, PosY + 1); unsigned char *Pixel21 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 0, PosY + 1); unsigned char *Pixel22 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 1, PosY + 1); unsigned char *Pixel23 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 2, PosY + 1); unsigned char *Pixel30 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX - 1, PosY + 2); unsigned char *Pixel31 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 0, PosY + 2); unsigned char *Pixel32 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 1, PosY + 2); unsigned char *Pixel33 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 2, PosY + 2); for (int I = 0; I < Channel; I++) { int Sum1 = (Pixel00[I] * U0 + Pixel01[I] * U1 + Pixel02[I] * U2 + Pixel03[I] * U3) * V0; int Sum2 = (Pixel10[I] * U0 + Pixel11[I] * U1 + Pixel12[I] * U2 + Pixel13[I] * U3) * V1; int Sum3 = (Pixel20[I] * U0 + Pixel21[I] * U1 + Pixel22[I] * U2 + Pixel23[I] * U3) * V2; int Sum4 = (Pixel30[I] * U0 + Pixel31[I] * U1 + Pixel22[I] * U2 + Pixel33[I] * U3) * V3; Pixel[I] = ClampToByte((Sum1 + Sum2 + Sum3 + Sum4) >> 16); } } void Bicubic_Center(unsigned char *Src, int Width, int Height, int Stride, unsigned char *Pixel, short *SinXDivX_Table, int SrcX, int SrcY) { int Channel = Stride / Width; int U = (unsigned char)(SrcX >> 8), V = (unsigned char)(SrcY >> 8); int U0 = SinXDivX_Table[256 + U], U1 = SinXDivX_Table[U]; int U2 = SinXDivX_Table[256 - U], U3 = SinXDivX_Table[512 - U]; int V0 = SinXDivX_Table[256 + V], V1 = SinXDivX_Table[V]; int V2 = SinXDivX_Table[256 - V], V3 = SinXDivX_Table[512 - V]; int PosX = SrcX >> 16, PosY = SrcY >> 16; unsigned char *Pixel00 = Src + (PosY - 1) * Stride + (PosX - 1) * Channel; unsigned char *Pixel01 = Pixel00 + Channel; unsigned char *Pixel02 = Pixel01 + Channel; unsigned char *Pixel03 = Pixel02 + Channel; unsigned char *Pixel10 = Pixel00 + Stride; unsigned char *Pixel11 = Pixel10 + Channel; unsigned char *Pixel12 = Pixel11 + Channel; unsigned char *Pixel13 = Pixel12 + Channel; unsigned char *Pixel20 = Pixel10 + Stride; unsigned char *Pixel21 = Pixel20 + Channel; unsigned char *Pixel22 = Pixel21 + Channel; unsigned char *Pixel23 = Pixel22 + Channel; unsigned char *Pixel30 = Pixel20 + Stride; unsigned char *Pixel31 = Pixel30 + Channel; unsigned char *Pixel32 = Pixel31 + Channel; unsigned char *Pixel33 = Pixel32 + Channel; for (int I = 0; I < Channel; I++) { int Sum1 = (Pixel00[I] * U0 + Pixel01[I] * U1 + Pixel02[I] * U2 + Pixel03[I] * U3) * V0; int Sum2 = (Pixel10[I] * U0 + Pixel11[I] * U1 + Pixel12[I] * U2 + Pixel13[I] * U3) * V1; int Sum3 = (Pixel20[I] * U0 + Pixel21[I] * U1 + Pixel22[I] * U2 + Pixel23[I] * U3) * V2; int Sum4 = (Pixel30[I] * U0 + Pixel31[I] * U1 + Pixel22[I] * U2 + Pixel33[I] * U3) * V3; Pixel[I] = ClampToByte((Sum1 + Sum2 + Sum3 + Sum4) >> 16); } } // 原始的插值算法 void IM_Resize_Cubic_Origin(unsigned char *Src, unsigned char *Dest, int SrcW, int SrcH, int StrideS, int DstW, int DstH, int StrideD) { int Channel = StrideS / SrcW; if ((SrcW == DstW) && (SrcH == DstH)) { memcpy(Dest, Src, SrcW * SrcH * Channel * sizeof(unsigned char)); return; } printf("%d\n", Channel); for (int Y = 0; Y < DstH; Y++) { unsigned char *LinePD = Dest + Y * StrideD; float SrcY = (Y + 0.4999999f) * SrcH / DstH - 0.5f; for (int X = 0; X < DstW; X++) { float SrcX = (X + 0.4999999f) * SrcW / DstW - 0.5f; Bicubic_Original(Src, SrcW, SrcH, StrideS, LinePD, SrcX, SrcY); LinePD += Channel; } } } // C语言实现的查表+插值算法 void IM_Resize_Cubic_Table(unsigned char *Src, unsigned char *Dest, int SrcW, int SrcH, int StrideS, int DstW, int DstH, int StrideD) { int Channel = StrideS / SrcW; if ((SrcW == DstW) && (SrcH == DstH)) { memcpy(Dest, Src, SrcW * SrcH * Channel * sizeof(unsigned char)); return; } short *SinXDivX_Table = (short *)malloc(513 * sizeof(short)); for (int I = 0; I < 513; I++) SinXDivX_Table[I] = int(0.5 + 256 * SinXDivX(I / 256.0f)); // 建立查找表,定点化 int AddX = (SrcW << 16) / DstW, AddY = (SrcH << 16) / DstH; int ErrorX = -(1 << 15) + (AddX >> 1), ErrorY = -(1 << 15) + (AddY >> 1); int StartX = ((1 << 16) - ErrorX) / AddX + 1; // 计算出需要特殊处理的边界 int StartY = ((1 << 16) - ErrorY) / AddY + 1; // y0+y*yr>=1; y0=ErrorY => y>=(1-ErrorY)/yr int EndX = (((SrcW - 3) << 16) - ErrorX) / AddX + 1; int EndY = (((SrcH - 3) << 16) - ErrorY) / AddY + 1; // y0+y*yr<=(height-3) => y<=(height-3-ErrorY)/yr if (StartY >= DstH) StartY = DstH; if (StartX >= DstW) StartX = DstW; if (EndX < StartX) EndX = StartX; if (EndY < StartY) EndY = StartY; // 输出边界 //printf("%d %d %d %d\n", StartX, StartY, EndX, EndY); int SrcY = ErrorY; for (int Y = 0; Y < StartY; Y++, SrcY += AddY) // 前面的不是都有效的取样部分数据 { unsigned char *LinePD = Dest + Y * StrideD; for (int X = 0, SrcX = ErrorX; X < DstW; X++, SrcX += AddX, LinePD += Channel) { Bicubic_Border(Src, SrcW, SrcH, StrideS, LinePD, SinXDivX_Table, SrcX, SrcY); } } for (int Y = StartY; Y < EndY; Y++, SrcY += AddY) { int SrcX = ErrorX; unsigned char *LinePD = Dest + Y * StrideD; for (int X = 0; X < StartX; X++, SrcX += AddX, LinePD += Channel) { Bicubic_Border(Src, SrcW, SrcH, StrideS, LinePD, SinXDivX_Table, SrcX, SrcY); } for (int X = StartX; X < EndX; X++, SrcX += AddX, LinePD += Channel) { Bicubic_Center(Src, SrcW, SrcH, StrideS, LinePD, SinXDivX_Table, SrcX, SrcY); } for (int X = EndX; X < DstW; X++, SrcX += AddX, LinePD += Channel) { Bicubic_Border(Src, SrcW, SrcH, StrideS, LinePD, SinXDivX_Table, SrcX, SrcY); } } for (int Y = EndY; Y < DstH; Y++, SrcY += AddY) { unsigned char *LinePD = Dest + Y * StrideD; for (int X = 0, SrcX = ErrorX; X < DstW; X++, SrcX += AddX, LinePD += Channel) { Bicubic_Border(Src, SrcW, SrcH, StrideS, LinePD, SinXDivX_Table, SrcX, SrcY); } } free(SinXDivX_Table); } // 4个有符号的32位的数据相加的和 inline int _mm_hsum_epi32(__m128i V) { //V3 V2 V1 V0 __m128i T = _mm_add_epi32(V, _mm_srli_si128(V, 8)); //V3+V1 V2+V0 V1 V0 T = _mm_add_epi32(T, _mm_srli_si128(T, 4)); //V3+V1+V2+V0 V2+V0+V1 V1+V0 V0 return _mm_cvtsi128_si32(T); //提取低位 } // 使用SSE优化立方插值算法 // 最大支持图像大小为: 32767*32767 void IM_Resize_SSE(unsigned char *Src, unsigned char *Dest, int SrcW, int SrcH, int StrideS, int DstW, int DstH, int StrideD) { int Channel = StrideS / SrcW; if ((SrcW == DstW) && (SrcH == DstH)) { memcpy(Dest, Src, SrcW * SrcH * Channel * sizeof(unsigned char)); return; } short *SinXDivX_Table = (short *)malloc(513 * sizeof(short)); short *Table = (short *)malloc(DstW * 4 * sizeof(short)); for (int I = 0; I < 513; I++) SinXDivX_Table[I] = int(0.5 + 256 * SinXDivX(I / 256.0f)); // 建立查找表,定点化 int AddX = (SrcW << 16) / DstW, AddY = (SrcH << 16) / DstH; int ErrorX = -(1 << 15) + (AddX >> 1), ErrorY = -(1 << 15) + (AddY >> 1); int StartX = ((1 << 16) - ErrorX) / AddX + 1; // 计算出需要特殊处理的边界 int StartY = ((1 << 16) - ErrorY) / AddY + 1; // y0+y*yr>=1; y0=ErrorY => y>=(1-ErrorY)/yr int EndX = (((SrcW - 3) << 16) - ErrorX) / AddX + 1; int EndY = (((SrcH - 3) << 16) - ErrorY) / AddY + 1; // y0+y*yr<=(height-3) => y<=(height-3-ErrorY)/yr if (StartY >= DstH) StartY = DstH; if (StartX >= DstW) StartX = DstW; if (EndX < StartX) EndX = StartX; if (EndY < StartY) EndY = StartY; for (int X = StartX, SrcX = ErrorX + StartX * AddX; X < EndY; X++, SrcX += AddX) { int U = (unsigned char)(SrcX >> 8); Table[X * 4 + 0] = SinXDivX_Table[256 + U]; //建立一个新表便于SSE操作 Table[X * 4 + 1] = SinXDivX_Table[U]; Table[X * 4 + 2] = SinXDivX_Table[256 - U]; Table[X * 4 + 3] = SinXDivX_Table[512 - U]; } int SrcY = ErrorY; for (int Y = 0; Y < StartY; Y++, SrcY += AddY) { // 同IM_Resize_Cubic_Table函数 unsigned char *LinePD = Dest + Y * StrideD; for (int X = 0, SrcX = ErrorX; X < DstW; X++, SrcX += AddX, LinePD += Channel) { Bicubic_Border(Src, SrcW, SrcH, StrideS, LinePD, SinXDivX_Table, SrcX, SrcY); } } for (int Y = StartY; Y < EndY; Y++, SrcY += AddY) { int SrcX = ErrorX; unsigned char *LinePD = Dest + Y * StrideD; for (int X = 0; X < StartX; X++, SrcX += AddX, LinePD += Channel) { Bicubic_Border(Src, SrcW, SrcH, StrideS, LinePD, SinXDivX_Table, SrcX, SrcY); } int V = (unsigned char)(SrcY >> 8); unsigned char *LineY = Src + ((SrcY >> 16) - 1) * StrideS; __m128i PartY = _mm_setr_epi32(SinXDivX_Table[256 + V], SinXDivX_Table[V], SinXDivX_Table[256 - V], SinXDivX_Table[512 - V]); for (int X = StartX; X < EndX; X++, SrcX += AddX, LinePD += Channel) { __m128i PartX = _mm_loadl_epi64((__m128i *)(Table + X * 4)); //PartX: U0 U1 U2 U3 U0 U1 U2 U3 PartX = _mm_unpacklo_epi64(PartX, PartX); unsigned char *Pixel0 = LineY + ((SrcX >> 16) - 1) * Channel; unsigned char *Pixel1 = Pixel0 + StrideS; unsigned char *Pixel2 = Pixel1 + StrideS; unsigned char *Pixel3 = Pixel2 + StrideS; if (Channel == 1) { __m128i P01 = _mm_cvtepu8_epi16(_mm_unpacklo_epi32(_mm_cvtsi32_si128(*((int *)Pixel0)), _mm_cvtsi32_si128(*((int *)Pixel1)))); // P00 P01 P02 P03 P10 P11 P12 P13 __m128i P23 = _mm_cvtepu8_epi16(_mm_unpacklo_epi32(_mm_cvtsi32_si128(*((int *)Pixel2)), _mm_cvtsi32_si128(*((int *)Pixel3)))); // P20 P21 P22 P23 P30 P31 P32 P33 __m128i Sum01 = _mm_madd_epi16(P01, PartX); // P00 * U0 + P01 * U1 P02 * U2 + P03 * U3 P10 * U0 + P11 * U1 P12 * U2 + P13 * U3 __m128i Sum23 = _mm_madd_epi16(P23, PartX); // P20 * U0 + P21 * U1 P22 * U2 + P23 * U3 P30 * U0 + P31 * U1 P32 * U2 + P33 * U3 __m128i Sum = _mm_hadd_epi32(Sum01, Sum23); // P00 * U0 + P01 * U1 + P02 * U2 + P03 * U3 P10 * U0 + P11 * U1 + P12 * U2 + P13 * U3 P20 * U0 + P21 * U1 + P22 * U2 + P23 * U3 P30 * U0 + P31 * U1 + P32 * U2 + P33 * U3 LinePD[0] = ClampToByte(_mm_hsum_epi32(_mm_mullo_epi32(Sum, PartY)) >> 16); } else if (Channel == 4) { __m128i P0 = _mm_loadu_si128((__m128i *)Pixel0), P1 = _mm_loadu_si128((__m128i *)Pixel1); __m128i P2 = _mm_loadu_si128((__m128i *)Pixel2), P3 = _mm_loadu_si128((__m128i *)Pixel3); P0 = _mm_shuffle_epi8(P0, _mm_setr_epi8(0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15)); // B0 G0 R0 A0 P1 = _mm_shuffle_epi8(P1, _mm_setr_epi8(0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15)); // B1 G1 R1 A1 P2 = _mm_shuffle_epi8(P2, _mm_setr_epi8(0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15)); // B2 G2 R2 A2 P3 = _mm_shuffle_epi8(P3, _mm_setr_epi8(0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15)); // B3 G3 R3 A3 __m128i BG01 = _mm_unpacklo_epi32(P0, P1); // B0 B1 G0 G1 __m128i RA01 = _mm_unpackhi_epi32(P0, P1); // R0 R1 A0 A1 __m128i BG23 = _mm_unpacklo_epi32(P2, P3); // B2 B3 G2 G3 __m128i RA23 = _mm_unpackhi_epi32(P2, P3); // R2 R3 A2 A3 __m128i B01 = _mm_unpacklo_epi8(BG01, _mm_setzero_si128()); __m128i B23 = _mm_unpacklo_epi8(BG23, _mm_setzero_si128()); __m128i SumB = _mm_hadd_epi32(_mm_madd_epi16(B01, PartX), _mm_madd_epi16(B23, PartX)); __m128i G01 = _mm_unpackhi_epi8(BG01, _mm_setzero_si128()); __m128i G23 = _mm_unpackhi_epi8(BG23, _mm_setzero_si128()); __m128i SumG = _mm_hadd_epi32(_mm_madd_epi16(G01, PartX), _mm_madd_epi16(G23, PartX)); __m128i R01 = _mm_unpacklo_epi8(RA01, _mm_setzero_si128()); __m128i R23 = _mm_unpacklo_epi8(RA23, _mm_setzero_si128()); __m128i SumR = _mm_hadd_epi32(_mm_madd_epi16(R01, PartX), _mm_madd_epi16(R23, PartX)); __m128i A01 = _mm_unpackhi_epi8(RA01, _mm_setzero_si128()); __m128i A23 = _mm_unpackhi_epi8(RA23, _mm_setzero_si128()); __m128i SumA = _mm_hadd_epi32(_mm_madd_epi16(A01, PartX), _mm_madd_epi16(A23, PartX)); __m128i Result = _mm_setr_epi32(_mm_hsum_epi32(_mm_mullo_epi32(SumB, PartY)), _mm_hsum_epi32(_mm_mullo_epi32(SumG, PartY)), _mm_hsum_epi32(_mm_mullo_epi32(SumR, PartY)), _mm_hsum_epi32(_mm_mullo_epi32(SumA, PartY))); Result = _mm_srai_epi32(Result, 16); // *((int *)LinePD) = _mm_cvtsi128_si32(_mm_packus_epi16(_mm_packus_epi32(Result, Result), Result)); _mm_stream_si32((int *)LinePD, _mm_cvtsi128_si32(_mm_packus_epi16(_mm_packus_epi32(Result, Result), Result))); //LinePD[0] = IM_ClampToByte(_mm_hsum_epi32(_mm_mullo_epi32(SumB, PartY)) >> 16); // 确实有部分存在超出unsigned char范围的,因为定点化的缘故 //LinePD[1] = IM_ClampToByte(_mm_hsum_epi32(_mm_mullo_epi32(SumG, PartY)) >> 16); //LinePD[2] = IM_ClampToByte(_mm_hsum_epi32(_mm_mullo_epi32(SumR, PartY)) >> 16); //LinePD[3] = IM_ClampToByte(_mm_hsum_epi32(_mm_mullo_epi32(SumA, PartY)) >> 16); } } for (int X = EndX; X < DstW; X++, SrcX += AddX, LinePD += Channel) { Bicubic_Border(Src, SrcW, SrcH, StrideS, LinePD, SinXDivX_Table, SrcX, SrcY); } } for (int Y = EndY; Y < DstH; Y++, SrcY += AddY) { unsigned char *LinePD = Dest + Y * StrideD; for (int X = 0, SrcX = ErrorX; X < DstW; X++, SrcX += AddX, LinePD += Channel) { Bicubic_Border(Src, SrcW, SrcH, StrideS, LinePD, SinXDivX_Table, SrcX, SrcY); } } free(Table); free(SinXDivX_Table); } int main() { Mat src = imread("F:\\car.jpg"); int Height = src.rows; int Width = src.cols; int Stride = Width * 3; unsigned char *Src = src.data; unsigned char *Buffer = new unsigned char[Height * Width * 4]; ConvertBGR8U2BGRAF(Src, Buffer, Width, Height, Stride); int SrcW = Width; int SrcH = Height; int StrideS = Width * 4; int DstW = Width * 15 / 10; int DstH = Height * 15 / 10; unsigned char *Res = new unsigned char[DstH * DstW * 4]; unsigned char *Dest = new unsigned char[DstH * DstW * 3]; int StrideD = DstW * 4; int64 st = cvGetTickCount(); for (int i = 0; i < 10; i++) { IM_Resize_SSE(Buffer, Res, SrcW, SrcH, StrideS, DstW, DstH, StrideD); } double duration = (cv::getTickCount() - st) / cv::getTickFrequency() * 100; printf("%.5f\n", duration); IM_Resize_Cubic_Origin(Buffer, Res, SrcW, SrcH, StrideS, DstW, DstH, StrideD); ConvertBGRAF2BGR8U(Res, Dest, DstW, DstH, DstW * 3); Mat dst(DstH, DstW, CV_8UC3, Dest); imshow("origin", src); imshow("result", dst); imwrite("F:\\res.jpg", dst); waitKey(0); } ================================================ FILE: speed_box_filter_sse.cpp ================================================ #include #include #include "../../OpencvTest/OpencvTest/Core.h" #include "../../OpencvTest/OpencvTest/MaxFilter.h" #include "../../OpencvTest/OpencvTest/Utility.h" #include "../../OpencvTest/OpencvTest/BoxFilter.h" using namespace std; using namespace cv; void BoxBlur_1(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, int Channel, int Radius) { TMatrix a, b; TMatrix *p1 = &a, *p2 = &b; TMatrix **p3 = &p1, **p4 = &p2; IS_CreateMatrix(Width, Height, IS_DEPTH_8U, Channel, p3); IS_CreateMatrix(Width, Height, IS_DEPTH_8U, Channel, p4); (p1)->Data = Src; (p2)->Data = Dest; BoxBlur(p1, p2, Radius, EdgeMode::Smear); } void BoxBlur_SSE(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, int Channel, int Radius) { TMatrix a, b; TMatrix *p1 = &a, *p2 = &b; TMatrix **p3 = &p1, **p4 = &p2; IS_CreateMatrix(Width, Height, IS_DEPTH_8U, Channel, p3); IS_CreateMatrix(Width, Height, IS_DEPTH_8U, Channel, p4); (p1)->Data = Src; (p2)->Data = Dest; BoxBlur_SSE(p1, p2, Radius, EdgeMode::Smear); } int main() { Mat src = imread("F:\\car.jpg"); int Height = src.rows; int Width = src.cols; unsigned char *Src = src.data; unsigned char *Dest = new unsigned char[Height * Width * 3]; int Stride = Width * 3; int Radius = 11; int64 st = cvGetTickCount(); for (int i = 0; i <10; i++) { //Mat temp = MaxFilter(src, Radius); BoxBlur_SSE(Src, Dest, Width, Height, Stride, 3, Radius); } double duration = (cv::getTickCount() - st) / cv::getTickFrequency() * 100; printf("%.5f\n", duration); BoxBlur_SSE(Src, Dest, Width, Height, Stride, 3, Radius); Mat dst(Height, Width, CV_8UC3, Dest); imshow("origin", src); imshow("result", dst); imwrite("F:\\res.jpg", dst); waitKey(0); return 0; } ================================================ FILE: speed_common_functions.cpp ================================================ //近似值 union Approximation { double Value; int X[2]; }; // 函数1: 将数据截断在Byte数据类型内。 // 参考: http://www.cnblogs.com/zyl910/archive/2012/03/12/noifopex1.html // 简介: 用位掩码做饱和处理,用带符号右移生成掩码。 unsigned char ClampToByte(int Value){ return ((Value | ((signed int)(255 - Value) >> 31)) & ~((signed int)Value >> 31)); } //函数2: 将数据截断在指定范围内 //参考: 无 //简介: 无 int ClampToInt(int Value, int Min, int Max) { if (Value < Min) return Min; else if (Value > Max) return Max; else return Value; } //函数3: 整数除以255 //参考: 无 //简介: 移位 int Div255(int Value) { return (((Value >> 8) + Value + 1) >> 8); } //函数4: 取绝对值 //参考: https://oi-wiki.org/math/bit/ //简介: 比n > 0 ? n : -n 快 int Abs(int n) { return (n ^ (n >> 31)) - (n >> 31); /* n>>31 取得 n 的符号,若 n 为正数,n>>31 等于 0,若 n 为负数,n>>31 等于 - 1 若 n 为正数 n^0=0, 数不变,若 n 为负数有 n^-1 需要计算 n 和 - 1 的补码,然后进行异或运算, 结果 n 变号并且为 n 的绝对值减 1,再减去 - 1 就是绝对值 */ } //函数5: 四舍五入 //参考: 无 //简介: 无 double Round(double V) { return (V > 0.0) ? floor(V + 0.5) : Round(V - 0.5); } //函数6: 返回-1到1之间的随机数 //参考: 无 //简介: 无 double Rand() { return (double)rand() / (RAND_MAX + 1.0); } //函数7: Pow函数的近似计算,针对double类型和float类型 //参考: http://www.cvchina.info/2010/03/19/log-pow-exp-approximation/ //参考: http://martin.ankerl.com/2007/10/04/optimized-pow-approximation-for-java-and-c-c/ //简介: 这个函数只是为了加速的近似计算,有5%-12%不等的误差 double Pow(double X, double Y) { Approximation V = { X }; V.X[1] = (int)(Y * (V.X[1] - 1072632447) + 1072632447); V.X[0] = 0; return V.Value; } float Pow(float X, float Y) { Approximation V = { X }; V.X[1] = (int)(Y * (V.X[1] - 1072632447) + 1072632447); V.X[0] = 0; return (float)V.Value; } //函数8: Exp函数的近似计算,针对double类型和float类型 double Exp(double Y) // 用联合体的方式的速度要快些 { Approximation V; V.X[1] = (int)(Y * 1485963 + 1072632447); V.X[0] = 0; return V.Value; } float Exp(float Y) // 用联合体的方式的速度要快些 { Approximation V; V.X[1] = (int)(Y * 1485963 + 1072632447); V.X[0] = 0; return (float)V.Value; } // 函数9: Pow函数更准一点的近似计算,但是速度会稍慢 // http://martin.ankerl.com/2012/01/25/optimized-approximative-pow-in-c-and-cpp/ // Besides that, I also have now a slower approximation that has much less error // when the exponent is larger than 1. It makes use exponentiation by squaring, // which is exact for the integer part of the exponent, and uses only the exponent’s fraction for the approximation: // should be much more precise with large Y double PrecisePow(double X, double Y){ // calculate approximation with fraction of the exponent int e = (int)Y; Approximation V = { X }; V.X[1] = (int)((Y - e) * (V.X[1] - 1072632447) + 1072632447); V.X[0] = 0; // exponentiation by squaring with the exponent's integer part // double r = u.d makes everything much slower, not sure why double r = 1.0; while (e) { if (e & 1) r *= X; X *= X; e >>= 1; } return r * V.Value; } //函数10: 返回Min到Max之间的随机数 //参考: 无 //简介: Min为随机数的最小值,Max为随机数的最大值 int Random(int Min, int Max){ return rand() % (Max + 1 - Min) + Min; } //函数11: 符号函数 //参考: 无 //简介: 无 int sgn(int X){ if (X > 0) return 1; if (X < 0) return -1; return 0; } //函数12: 获取某个整形变量对应的颜色值 //参考: 无 //简介: 无 void GetRGB(int Color, int *R, int *G, int *B){ *R = Color & 255; *G = (Color & 65280) / 256; *B = (Color & 16711680) / 65536; } //函数13: 牛顿法近似获取指定数字的算法平方根 //参考: https://www.cnblogs.com/qlky/p/7735145.html //简介: 仍然是近似算法,近似出了指定数字的平方根 float Sqrt(float X) { float HalfX = 0.5f * X; // 对double类型的数字无效 int I = *(int*)&X; // get bits for floating VALUE I = 0x5f375a86 - (I >> 1); // gives initial guess y0 X = *(float*)&I; // convert bits BACK to float X = X * (1.5f - HalfX * X * X); // Newton step, repeating increases accuracy X = X * (1.5f - HalfX * X * X); // Newton step, repeating increases accuracy X = X * (1.5f - HalfX * X * X); // Newton step, repeating increases accuracy return 1 / X; } //函数14: 无符号短整形直方图数据相加,即是Y = X + Y //参考: 无 //简介: SSE优化 void HistgramAddShort(unsigned short *X, unsigned short *Y) { *(__m128i*)(Y + 0) = _mm_add_epi16(*(__m128i*)&Y[0], *(__m128i*)&X[0]); // 不要想着用自己写的汇编超过他的速度了,已经试过了 *(__m128i*)(Y + 8) = _mm_add_epi16(*(__m128i*)&Y[8], *(__m128i*)&X[8]); *(__m128i*)(Y + 16) = _mm_add_epi16(*(__m128i*)&Y[16], *(__m128i*)&X[16]); *(__m128i*)(Y + 24) = _mm_add_epi16(*(__m128i*)&Y[24], *(__m128i*)&X[24]); *(__m128i*)(Y + 32) = _mm_add_epi16(*(__m128i*)&Y[32], *(__m128i*)&X[32]); *(__m128i*)(Y + 40) = _mm_add_epi16(*(__m128i*)&Y[40], *(__m128i*)&X[40]); *(__m128i*)(Y + 48) = _mm_add_epi16(*(__m128i*)&Y[48], *(__m128i*)&X[48]); *(__m128i*)(Y + 56) = _mm_add_epi16(*(__m128i*)&Y[56], *(__m128i*)&X[56]); *(__m128i*)(Y + 64) = _mm_add_epi16(*(__m128i*)&Y[64], *(__m128i*)&X[64]); *(__m128i*)(Y + 72) = _mm_add_epi16(*(__m128i*)&Y[72], *(__m128i*)&X[72]); *(__m128i*)(Y + 80) = _mm_add_epi16(*(__m128i*)&Y[80], *(__m128i*)&X[80]); *(__m128i*)(Y + 88) = _mm_add_epi16(*(__m128i*)&Y[88], *(__m128i*)&X[88]); *(__m128i*)(Y + 96) = _mm_add_epi16(*(__m128i*)&Y[96], *(__m128i*)&X[96]); *(__m128i*)(Y + 104) = _mm_add_epi16(*(__m128i*)&Y[104], *(__m128i*)&X[104]); *(__m128i*)(Y + 112) = _mm_add_epi16(*(__m128i*)&Y[112], *(__m128i*)&X[112]); *(__m128i*)(Y + 120) = _mm_add_epi16(*(__m128i*)&Y[120], *(__m128i*)&X[120]); *(__m128i*)(Y + 128) = _mm_add_epi16(*(__m128i*)&Y[128], *(__m128i*)&X[128]); *(__m128i*)(Y + 136) = _mm_add_epi16(*(__m128i*)&Y[136], *(__m128i*)&X[136]); *(__m128i*)(Y + 144) = _mm_add_epi16(*(__m128i*)&Y[144], *(__m128i*)&X[144]); *(__m128i*)(Y + 152) = _mm_add_epi16(*(__m128i*)&Y[152], *(__m128i*)&X[152]); *(__m128i*)(Y + 160) = _mm_add_epi16(*(__m128i*)&Y[160], *(__m128i*)&X[160]); *(__m128i*)(Y + 168) = _mm_add_epi16(*(__m128i*)&Y[168], *(__m128i*)&X[168]); *(__m128i*)(Y + 176) = _mm_add_epi16(*(__m128i*)&Y[176], *(__m128i*)&X[176]); *(__m128i*)(Y + 184) = _mm_add_epi16(*(__m128i*)&Y[184], *(__m128i*)&X[184]); *(__m128i*)(Y + 192) = _mm_add_epi16(*(__m128i*)&Y[192], *(__m128i*)&X[192]); *(__m128i*)(Y + 200) = _mm_add_epi16(*(__m128i*)&Y[200], *(__m128i*)&X[200]); *(__m128i*)(Y + 208) = _mm_add_epi16(*(__m128i*)&Y[208], *(__m128i*)&X[208]); *(__m128i*)(Y + 216) = _mm_add_epi16(*(__m128i*)&Y[216], *(__m128i*)&X[216]); *(__m128i*)(Y + 224) = _mm_add_epi16(*(__m128i*)&Y[224], *(__m128i*)&X[224]); *(__m128i*)(Y + 232) = _mm_add_epi16(*(__m128i*)&Y[232], *(__m128i*)&X[232]); *(__m128i*)(Y + 240) = _mm_add_epi16(*(__m128i*)&Y[240], *(__m128i*)&X[240]); *(__m128i*)(Y + 248) = _mm_add_epi16(*(__m128i*)&Y[248], *(__m128i*)&X[248]); } //函数15: 无符号短整形直方图数据相减,即是Y = Y - X //参考: 无 //简介: SSE优化 void HistgramSubShort(unsigned short *X, unsigned short *Y) { *(__m128i*)(Y + 0) = _mm_sub_epi16(*(__m128i*)&Y[0], *(__m128i*)&X[0]); *(__m128i*)(Y + 8) = _mm_sub_epi16(*(__m128i*)&Y[8], *(__m128i*)&X[8]); *(__m128i*)(Y + 16) = _mm_sub_epi16(*(__m128i*)&Y[16], *(__m128i*)&X[16]); *(__m128i*)(Y + 24) = _mm_sub_epi16(*(__m128i*)&Y[24], *(__m128i*)&X[24]); *(__m128i*)(Y + 32) = _mm_sub_epi16(*(__m128i*)&Y[32], *(__m128i*)&X[32]); *(__m128i*)(Y + 40) = _mm_sub_epi16(*(__m128i*)&Y[40], *(__m128i*)&X[40]); *(__m128i*)(Y + 48) = _mm_sub_epi16(*(__m128i*)&Y[48], *(__m128i*)&X[48]); *(__m128i*)(Y + 56) = _mm_sub_epi16(*(__m128i*)&Y[56], *(__m128i*)&X[56]); *(__m128i*)(Y + 64) = _mm_sub_epi16(*(__m128i*)&Y[64], *(__m128i*)&X[64]); *(__m128i*)(Y + 72) = _mm_sub_epi16(*(__m128i*)&Y[72], *(__m128i*)&X[72]); *(__m128i*)(Y + 80) = _mm_sub_epi16(*(__m128i*)&Y[80], *(__m128i*)&X[80]); *(__m128i*)(Y + 88) = _mm_sub_epi16(*(__m128i*)&Y[88], *(__m128i*)&X[88]); *(__m128i*)(Y + 96) = _mm_sub_epi16(*(__m128i*)&Y[96], *(__m128i*)&X[96]); *(__m128i*)(Y + 104) = _mm_sub_epi16(*(__m128i*)&Y[104], *(__m128i*)&X[104]); *(__m128i*)(Y + 112) = _mm_sub_epi16(*(__m128i*)&Y[112], *(__m128i*)&X[112]); *(__m128i*)(Y + 120) = _mm_sub_epi16(*(__m128i*)&Y[120], *(__m128i*)&X[120]); *(__m128i*)(Y + 128) = _mm_sub_epi16(*(__m128i*)&Y[128], *(__m128i*)&X[128]); *(__m128i*)(Y + 136) = _mm_sub_epi16(*(__m128i*)&Y[136], *(__m128i*)&X[136]); *(__m128i*)(Y + 144) = _mm_sub_epi16(*(__m128i*)&Y[144], *(__m128i*)&X[144]); *(__m128i*)(Y + 152) = _mm_sub_epi16(*(__m128i*)&Y[152], *(__m128i*)&X[152]); *(__m128i*)(Y + 160) = _mm_sub_epi16(*(__m128i*)&Y[160], *(__m128i*)&X[160]); *(__m128i*)(Y + 168) = _mm_sub_epi16(*(__m128i*)&Y[168], *(__m128i*)&X[168]); *(__m128i*)(Y + 176) = _mm_sub_epi16(*(__m128i*)&Y[176], *(__m128i*)&X[176]); *(__m128i*)(Y + 184) = _mm_sub_epi16(*(__m128i*)&Y[184], *(__m128i*)&X[184]); *(__m128i*)(Y + 192) = _mm_sub_epi16(*(__m128i*)&Y[192], *(__m128i*)&X[192]); *(__m128i*)(Y + 200) = _mm_sub_epi16(*(__m128i*)&Y[200], *(__m128i*)&X[200]); *(__m128i*)(Y + 208) = _mm_sub_epi16(*(__m128i*)&Y[208], *(__m128i*)&X[208]); *(__m128i*)(Y + 216) = _mm_sub_epi16(*(__m128i*)&Y[216], *(__m128i*)&X[216]); *(__m128i*)(Y + 224) = _mm_sub_epi16(*(__m128i*)&Y[224], *(__m128i*)&X[224]); *(__m128i*)(Y + 232) = _mm_sub_epi16(*(__m128i*)&Y[232], *(__m128i*)&X[232]); *(__m128i*)(Y + 240) = _mm_sub_epi16(*(__m128i*)&Y[240], *(__m128i*)&X[240]); *(__m128i*)(Y + 248) = _mm_sub_epi16(*(__m128i*)&Y[248], *(__m128i*)&X[248]); } //函数16: 无符号短整形直方图数据相加减,即是Z = Z + Y - X //参考: 无 //简介: SSE优化 void HistgramSubAddShort(unsigned short *X, unsigned short *Y, unsigned short *Z) { *(__m128i*)(Z + 0) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[0], *(__m128i*)&Z[0]), *(__m128i*)&X[0]); // 不要想着用自己写的汇编超过他的速度了,已经试过了 *(__m128i*)(Z + 8) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[8], *(__m128i*)&Z[8]), *(__m128i*)&X[8]); *(__m128i*)(Z + 16) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[16], *(__m128i*)&Z[16]), *(__m128i*)&X[16]); *(__m128i*)(Z + 24) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[24], *(__m128i*)&Z[24]), *(__m128i*)&X[24]); *(__m128i*)(Z + 32) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[32], *(__m128i*)&Z[32]), *(__m128i*)&X[32]); *(__m128i*)(Z + 40) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[40], *(__m128i*)&Z[40]), *(__m128i*)&X[40]); *(__m128i*)(Z + 48) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[48], *(__m128i*)&Z[48]), *(__m128i*)&X[48]); *(__m128i*)(Z + 56) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[56], *(__m128i*)&Z[56]), *(__m128i*)&X[56]); *(__m128i*)(Z + 64) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[64], *(__m128i*)&Z[64]), *(__m128i*)&X[64]); *(__m128i*)(Z + 72) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[72], *(__m128i*)&Z[72]), *(__m128i*)&X[72]); *(__m128i*)(Z + 80) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[80], *(__m128i*)&Z[80]), *(__m128i*)&X[80]); *(__m128i*)(Z + 88) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[88], *(__m128i*)&Z[88]), *(__m128i*)&X[88]); *(__m128i*)(Z + 96) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[96], *(__m128i*)&Z[96]), *(__m128i*)&X[96]); *(__m128i*)(Z + 104) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[104], *(__m128i*)&Z[104]), *(__m128i*)&X[104]); *(__m128i*)(Z + 112) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[112], *(__m128i*)&Z[112]), *(__m128i*)&X[112]); *(__m128i*)(Z + 120) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[120], *(__m128i*)&Z[120]), *(__m128i*)&X[120]); *(__m128i*)(Z + 128) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[128], *(__m128i*)&Z[128]), *(__m128i*)&X[128]); *(__m128i*)(Z + 136) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[136], *(__m128i*)&Z[136]), *(__m128i*)&X[136]); *(__m128i*)(Z + 144) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[144], *(__m128i*)&Z[144]), *(__m128i*)&X[144]); *(__m128i*)(Z + 152) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[152], *(__m128i*)&Z[152]), *(__m128i*)&X[152]); *(__m128i*)(Z + 160) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[160], *(__m128i*)&Z[160]), *(__m128i*)&X[160]); *(__m128i*)(Z + 168) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[168], *(__m128i*)&Z[168]), *(__m128i*)&X[168]); *(__m128i*)(Z + 176) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[176], *(__m128i*)&Z[176]), *(__m128i*)&X[176]); *(__m128i*)(Z + 184) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[184], *(__m128i*)&Z[184]), *(__m128i*)&X[184]); *(__m128i*)(Z + 192) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[192], *(__m128i*)&Z[192]), *(__m128i*)&X[192]); *(__m128i*)(Z + 200) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[200], *(__m128i*)&Z[200]), *(__m128i*)&X[200]); *(__m128i*)(Z + 208) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[208], *(__m128i*)&Z[208]), *(__m128i*)&X[208]); *(__m128i*)(Z + 216) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[216], *(__m128i*)&Z[216]), *(__m128i*)&X[216]); *(__m128i*)(Z + 224) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[224], *(__m128i*)&Z[224]), *(__m128i*)&X[224]); *(__m128i*)(Z + 232) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[232], *(__m128i*)&Z[232]), *(__m128i*)&X[232]); *(__m128i*)(Z + 240) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[240], *(__m128i*)&Z[240]), *(__m128i*)&X[240]); *(__m128i*)(Z + 248) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[248], *(__m128i*)&Z[248]), *(__m128i*)&X[248]); } ================================================ FILE: speed_gaussian_filter_sse.cpp ================================================ #include #include using namespace std; using namespace cv; void CalcGaussCof(float Radius, float &B0, float &B1, float &B2, float &B3) { float Q, B; if (Radius >= 2.5) Q = (double)(0.98711 * Radius - 0.96330); // 对应论文公式11b else if ((Radius >= 0.5) && (Radius < 2.5)) Q = (double)(3.97156 - 4.14554 * sqrt(1 - 0.26891 * Radius)); else Q = (double)0.1147705018520355224609375; B = 1.57825 + 2.44413 * Q + 1.4281 * Q * Q + 0.422205 * Q * Q * Q; // 对应论文公式8c B1 = 2.44413 * Q + 2.85619 * Q * Q + 1.26661 * Q * Q * Q; B2 = -1.4281 * Q * Q - 1.26661 * Q * Q * Q; B3 = 0.422205 * Q * Q * Q; B0 = 1.0 - (B1 + B2 + B3) / B; B1 = B1 / B; B2 = B2 / B; B3 = B3 / B; } void ConvertBGR8U2BGRAF(unsigned char *Src, float *Dest, int Width, int Height, int Stride) { //#pragma omp parallel for for (int Y = 0; Y < Height; Y++) { unsigned char *LinePS = Src + Y * Stride; float *LinePD = Dest + Y * Width * 3; for (int X = 0; X < Width; X++, LinePS += 3, LinePD += 3) { LinePD[0] = LinePS[0]; LinePD[1] = LinePS[1]; LinePD[2] = LinePS[2]; } } } void ConvertBGR8U2BGRAF_SSE(unsigned char *Src, float *Dest, int Width, int Height, int Stride) { const int BlockSize = 4; int Block = (Width - 2) / BlockSize; __m128i Mask = _mm_setr_epi8(0, 1, 2, -1, 3, 4, 5, -1, 6, 7, 8, -1, 9, 10, 11, -1); __m128i Zero = _mm_setzero_si128(); for (int Y = 0; Y < Height; Y++) { unsigned char *LinePS = Src + Y * Stride; float *LinePD = Dest + Y * Width * 4; int X = 0; for (; X < Block * BlockSize; X += BlockSize, LinePS += BlockSize * 3, LinePD += BlockSize * 4) { __m128i SrcV = _mm_shuffle_epi8(_mm_loadu_si128((const __m128i*)LinePS), Mask); __m128i Src16L = _mm_unpacklo_epi8(SrcV, Zero); __m128i Src16H = _mm_unpackhi_epi8(SrcV, Zero); _mm_store_ps(LinePD + 0, _mm_cvtepi32_ps(_mm_unpacklo_epi16(Src16L, Zero))); _mm_store_ps(LinePD + 4, _mm_cvtepi32_ps(_mm_unpackhi_epi16(Src16L, Zero))); _mm_store_ps(LinePD + 8, _mm_cvtepi32_ps(_mm_unpacklo_epi16(Src16H, Zero))); _mm_store_ps(LinePD + 12, _mm_cvtepi32_ps(_mm_unpackhi_epi16(Src16H, Zero))); } for (; X < Width; X++, LinePS += 3, LinePD += 4) { LinePD[0] = LinePS[0]; LinePD[1] = LinePS[1]; LinePD[2] = LinePS[2]; LinePD[3] = 0; } } } void GaussBlurFromLeftToRight(float *Data, int Width, int Height, float B0, float B1, float B2, float B3) { //#pragma omp parallel for for (int Y = 0; Y < Height; Y++) { float *LinePD = Data + Y * Width * 3; //w[n-1], w[n-2], w[n-3] float BS1 = LinePD[0], BS2 = LinePD[0], BS3 = LinePD[0]; //边缘处使用重复像素的方案 float GS1 = LinePD[1], GS2 = LinePD[1], GS3 = LinePD[1]; float RS1 = LinePD[2], RS2 = LinePD[2], RS3 = LinePD[2]; for (int X = 0; X < Width; X++, LinePD += 3) { LinePD[0] = LinePD[0] * B0 + BS1 * B1 + BS2 * B2 + BS3 * B3; LinePD[1] = LinePD[1] * B0 + GS1 * B1 + GS2 * B2 + GS3 * B3; // 进行顺向迭代 LinePD[2] = LinePD[2] * B0 + RS1 * B1 + RS2 * B2 + RS3 * B3; BS3 = BS2, BS2 = BS1, BS1 = LinePD[0]; GS3 = GS2, GS2 = GS1, GS1 = LinePD[1]; RS3 = RS2, RS2 = RS1, RS1 = LinePD[2]; } } } void GaussBlurFromLeftToRight_SSE(float *Data, int Width, int Height, float B0, float B1, float B2, float B3) { const __m128 CofB0 = _mm_set_ps(0, B0, B0, B0); const __m128 CofB1 = _mm_set_ps(0, B1, B1, B1); const __m128 CofB2 = _mm_set_ps(0, B2, B2, B2); const __m128 CofB3 = _mm_set_ps(0, B3, B3, B3); for (int Y = 0; Y < Height; Y++) { float *LinePD = Data + Y * Width * 4; __m128 V1 = _mm_set_ps(LinePD[3], LinePD[2], LinePD[1], LinePD[0]); __m128 V2 = V1, V3 = V1; for (int X = 0; X < Width; X++, LinePD += 4) { __m128 V0 = _mm_load_ps(LinePD); __m128 V01 = _mm_add_ps(_mm_mul_ps(CofB0, V0), _mm_mul_ps(CofB1, V1)); __m128 V23 = _mm_add_ps(_mm_mul_ps(CofB2, V2), _mm_mul_ps(CofB3, V3)); __m128 V = _mm_add_ps(V01, V23); V3 = V2; V2 = V1; V1 = V; _mm_store_ps(LinePD, V); } } } void GaussBlurFromRightToLeft(float *Data, int Width, int Height, float B0, float B1, float B2, float B3) { for (int Y = 0; Y < Height; Y++) { //w[n+1], w[n+2], w[n+3] float *LinePD = Data + Y * Width * 3 + (Width * 3); float BS1 = LinePD[0], BS2 = LinePD[0], BS3 = LinePD[0]; //边缘处使用重复像素的方案 float GS1 = LinePD[1], GS2 = LinePD[1], GS3 = LinePD[1]; float RS1 = LinePD[2], RS2 = LinePD[2], RS3 = LinePD[2]; for (int X = Width - 1; X >= 0; X--, LinePD -= 3) { LinePD[0] = LinePD[0] * B0 + BS3 * B1 + BS2 * B2 + BS1 * B3; LinePD[1] = LinePD[1] * B0 + GS3 * B1 + GS2 * B2 + GS1 * B3; // 进行反向迭代 LinePD[2] = LinePD[2] * B0 + RS3 * B1 + RS2 * B2 + RS1 * B3; BS1 = BS2, BS2 = BS3, BS3 = LinePD[0]; GS1 = GS2, GS2 = GS3, GS3 = LinePD[1]; RS1 = RS2, RS2 = RS3, RS3 = LinePD[2]; } } } void GaussBlurFromRightToLeft_SSE(float *Data, int Width, int Height, float B0, float B1, float B2, float B3) { const __m128 CofB0 = _mm_set_ps(0, B0, B0, B0); const __m128 CofB1 = _mm_set_ps(0, B1, B1, B1); const __m128 CofB2 = _mm_set_ps(0, B2, B2, B2); const __m128 CofB3 = _mm_set_ps(0, B3, B3, B3); for (int Y = 0; Y < Height; Y++) { float *LinePD = Data + Y * Width * 4 + (Width * 4); __m128 V1 = _mm_set_ps(LinePD[3], LinePD[2], LinePD[1], LinePD[0]); __m128 V2 = V1, V3 = V1; for (int X = Width - 1; X >= 0; X--, LinePD -= 4) { __m128 V0 = _mm_load_ps(LinePD); __m128 V03 = _mm_add_ps(_mm_mul_ps(CofB0, V0), _mm_mul_ps(CofB1, V3)); __m128 V12 = _mm_add_ps(_mm_mul_ps(CofB2, V2), _mm_mul_ps(CofB3, V1)); __m128 V = _mm_add_ps(V03, V12); V1 = V2; V2 = V3; V3 = V; _mm_store_ps(LinePD, V); } } } //w[n] w[n-1], w[n-2], w[n-3] void GaussBlurFromTopToBottom(float *Data, int Width, int Height, float B0, float B1, float B2, float B3) { for (int Y = 0; Y < Height; Y++) { float *LinePD3 = Data + (Y + 0) * Width * 3; float *LinePD2 = Data + (Y + 1) * Width * 3; float *LinePD1 = Data + (Y + 2) * Width * 3; float *LinePD0 = Data + (Y + 3) * Width * 3; for (int X = 0; X < Width; X++, LinePD0 += 3, LinePD1 += 3, LinePD2 += 3, LinePD3 += 3) { LinePD0[0] = LinePD0[0] * B0 + LinePD1[0] * B1 + LinePD2[0] * B2 + LinePD3[0] * B3; LinePD0[1] = LinePD0[1] * B0 + LinePD1[1] * B1 + LinePD2[1] * B2 + LinePD3[1] * B3; LinePD0[2] = LinePD0[2] * B0 + LinePD1[2] * B1 + LinePD2[2] * B2 + LinePD3[2] * B3; } } } void GaussBlurFromTopToBottom_SSE(float *Data, int Width, int Height, float B0, float B1, float B2, float B3){ const __m128 CofB0 = _mm_set_ps(0, B0, B0, B0); const __m128 CofB1 = _mm_set_ps(0, B1, B1, B1); const __m128 CofB2 = _mm_set_ps(0, B2, B2, B2); const __m128 CofB3 = _mm_set_ps(0, B3, B3, B3); for (int Y = 0; Y < Height; Y++) { float *LinePS3 = Data + (Y + 0) * Width * 4; float *LinePS2 = Data + (Y + 1) * Width * 4; float *LinePS1 = Data + (Y + 2) * Width * 4; float *LinePS0 = Data + (Y + 3) * Width * 4; for (int X = 0; X < Width * 4; X += 4) { __m128 V3 = _mm_load_ps(LinePS3 + X); __m128 V2 = _mm_load_ps(LinePS2 + X); __m128 V1 = _mm_load_ps(LinePS1 + X); __m128 V0 = _mm_load_ps(LinePS0 + X); __m128 V01 = _mm_add_ps(_mm_mul_ps(CofB0, V0), _mm_mul_ps(CofB1, V1)); __m128 V23 = _mm_add_ps(_mm_mul_ps(CofB2, V2), _mm_mul_ps(CofB3, V3)); _mm_store_ps(LinePS0 + X, _mm_add_ps(V01, V23)); } } } //w[n] w[n+1], w[n+2], w[n+3] void GaussBlurFromBottomToTop(float *Data, int Width, int Height, float B0, float B1, float B2, float B3) { for (int Y = Height - 1; Y >= 0; Y--) { float *LinePD3 = Data + (Y + 3) * Width * 3; float *LinePD2 = Data + (Y + 2) * Width * 3; float *LinePD1 = Data + (Y + 1) * Width * 3; float *LinePD0 = Data + (Y + 0) * Width * 3; for (int X = 0; X < Width; X++, LinePD0 += 3, LinePD1 += 3, LinePD2 += 3, LinePD3 += 3) { LinePD0[0] = LinePD0[0] * B0 + LinePD1[0] * B1 + LinePD2[0] * B2 + LinePD3[0] * B3; LinePD0[1] = LinePD0[1] * B0 + LinePD1[1] * B1 + LinePD2[1] * B2 + LinePD3[1] * B3; LinePD0[2] = LinePD0[2] * B0 + LinePD1[2] * B1 + LinePD2[2] * B2 + LinePD3[2] * B3; } } } void GaussBlurFromBottomToTop_SSE(float *Data, int Width, int Height, float B0, float B1, float B2, float B3) { const __m128 CofB0 = _mm_set_ps(0, B0, B0, B0); const __m128 CofB1 = _mm_set_ps(0, B1, B1, B1); const __m128 CofB2 = _mm_set_ps(0, B2, B2, B2); const __m128 CofB3 = _mm_set_ps(0, B3, B3, B3); for (int Y = Height - 1; Y >= 0; Y--) { float *LinePS3 = Data + (Y + 3) * Width * 4; float *LinePS2 = Data + (Y + 2) * Width * 4; float *LinePS1 = Data + (Y + 1) * Width * 4; float *LinePS0 = Data + (Y + 0) * Width * 4; for (int X = 0; X < Width * 4; X += 4) { __m128 V3 = _mm_load_ps(LinePS3 + X); __m128 V2 = _mm_load_ps(LinePS2 + X); __m128 V1 = _mm_load_ps(LinePS1 + X); __m128 V0 = _mm_load_ps(LinePS0 + X); __m128 V01 = _mm_add_ps(_mm_mul_ps(CofB0, V0), _mm_mul_ps(CofB1, V1)); __m128 V23 = _mm_add_ps(_mm_mul_ps(CofB2, V2), _mm_mul_ps(CofB3, V3)); _mm_store_ps(LinePS0 + X, _mm_add_ps(V01, V23)); } } } void ConvertBGRAF2BGR8U(float *Src, unsigned char *Dest, int Width, int Height, int Stride) { //#pragma omp parallel for for (int Y = 0; Y < Height; Y++) { float *LinePS = Src + Y * Width * 3; unsigned char *LinePD = Dest + Y * Stride; for (int X = 0; X < Width; X++, LinePS += 3, LinePD += 3) { LinePD[0] = LinePS[0]; LinePD[1] = LinePS[1]; LinePD[2] = LinePS[2]; } } } void ConvertBGRAF2BGR8U_SSE(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) { const int BlockSize = 4; int Block = (Width - 2) / BlockSize; //__m128i Mask = _mm_setr_epi8(0, 1, 2, 4, 5, 6, 8, 9, 10, 12, 13, 14, 3, 7, 11, 15); __m128i MaskB = _mm_setr_epi8(0, 4, 8, 12, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1); __m128i MaskG = _mm_setr_epi8(1, 5, 9, 13, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1); __m128i MaskR = _mm_setr_epi8(2, 6, 10, 14, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1); __m128i Zero = _mm_setzero_si128(); for (int Y = 0; Y < Height; Y++) { unsigned char *LinePS = Src + Y * Width * 4; unsigned char *LinePD = Dest + Y * Stride; int X = 0; for (; X < Block * BlockSize; X += BlockSize, LinePS += BlockSize * 4, LinePD += BlockSize * 3) { __m128i SrcV = _mm_loadu_si128((const __m128i*)LinePS); __m128i B = _mm_shuffle_epi8(SrcV, MaskB); __m128i G = _mm_shuffle_epi8(SrcV, MaskG); __m128i R = _mm_shuffle_epi8(SrcV, MaskR); __m128i Ans1 = Zero, Ans2 = Zero, Ans3 = Zero; Ans1 = _mm_or_si128(Ans1, _mm_shuffle_epi8(B, _mm_setr_epi8(0, -1, -1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1))); Ans1 = _mm_or_si128(Ans1, _mm_shuffle_epi8(G, _mm_setr_epi8(-1, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1))); Ans1 = _mm_or_si128(Ans1, _mm_shuffle_epi8(R, _mm_setr_epi8(-1, -1, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1))); Ans2 = _mm_or_si128(Ans2, _mm_shuffle_epi8(B, _mm_setr_epi8(-1, -1, 2, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1))); Ans2 = _mm_or_si128(Ans2, _mm_shuffle_epi8(G, _mm_setr_epi8(1, -1, -1, 2, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1))); Ans2 = _mm_or_si128(Ans2, _mm_shuffle_epi8(R, _mm_setr_epi8(-1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1))); Ans3 = _mm_or_si128(Ans3, _mm_shuffle_epi8(B, _mm_setr_epi8(-1, 3, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1))); Ans3 = _mm_or_si128(Ans3, _mm_shuffle_epi8(G, _mm_setr_epi8(-1, -1, 3, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1))); Ans3 = _mm_or_si128(Ans3, _mm_shuffle_epi8(R, _mm_setr_epi8(2, -1, -1, 3, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1))); _mm_storeu_si128((__m128i*)(LinePD + 0), Ans1); _mm_storeu_si128((__m128i*)(LinePD + 4), Ans2); _mm_storeu_si128((__m128i*)(LinePD + 8), Ans3); } for (; X < Width; X++, LinePS += 4, LinePD += 3) { LinePD[0] = LinePS[0]; LinePD[1] = LinePS[1]; LinePD[2] = LinePS[2]; } } } void GaussBlur(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, float Radius) { float B0, B1, B2, B3; float *Buffer = (float *)malloc(Width * (Height + 6) * sizeof(float) * 3); CalcGaussCof(Radius, B0, B1, B2, B3); ConvertBGR8U2BGRAF(Src, Buffer + 3 * Width * 3, Width, Height, Stride); GaussBlurFromLeftToRight(Buffer + 3 * Width * 3, Width, Height, B0, B1, B2, B3); GaussBlurFromRightToLeft(Buffer + 3 * Width * 3, Width, Height, B0, B1, B2, B3); // 如果启用多线程,建议把这个函数写到GaussBlurFromLeftToRight的for X循环里,因为这样就可以减少线程并发时的阻力 memcpy(Buffer + 0 * Width * 3, Buffer + 3 * Width * 3, Width * 3 * sizeof(float)); memcpy(Buffer + 1 * Width * 3, Buffer + 3 * Width * 3, Width * 3 * sizeof(float)); memcpy(Buffer + 2 * Width * 3, Buffer + 3 * Width * 3, Width * 3 * sizeof(float)); GaussBlurFromTopToBottom(Buffer, Width, Height, B0, B1, B2, B3); memcpy(Buffer + (Height + 3) * Width * 3, Buffer + (Height + 2) * Width * 3, Width * 3 * sizeof(float)); memcpy(Buffer + (Height + 4) * Width * 3, Buffer + (Height + 2) * Width * 3, Width * 3 * sizeof(float)); memcpy(Buffer + (Height + 5) * Width * 3, Buffer + (Height + 2) * Width * 3, Width * 3 * sizeof(float)); GaussBlurFromBottomToTop(Buffer, Width, Height, B0, B1, B2, B3); ConvertBGRAF2BGR8U(Buffer + 3 * Width * 3, Dest, Width, Height, Stride); free(Buffer); } void GaussBlur_SSE(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, float Radius) { float B0, B1, B2, B3; float *Buffer = (float *)_mm_malloc(Width * (Height + 6) * sizeof(float) * 4, 16); CalcGaussCof(Radius, B0, B1, B2, B3); ConvertBGR8U2BGRAF_SSE(Src, Buffer + 3 * Width * 4, Width, Height, Stride); GaussBlurFromLeftToRight_SSE(Buffer + 3 * Width * 4, Width, Height, B0, B1, B2, B3); // 在SSE版本中,这两个函数占用的时间比下面两个要多,不过C语言版本也是一样的 GaussBlurFromRightToLeft_SSE(Buffer + 3 * Width * 4, Width, Height, B0, B1, B2, B3); // 如果启用多线程,建议把这个函数写到GaussBlurFromLeftToRight的for X循环里,因为这样就可以减少线程并发时的阻力 memcpy(Buffer + 0 * Width * 4, Buffer + 3 * Width * 4, Width * 4 * sizeof(float)); memcpy(Buffer + 1 * Width * 4, Buffer + 3 * Width * 4, Width * 4 * sizeof(float)); memcpy(Buffer + 2 * Width * 4, Buffer + 3 * Width * 4, Width * 4 * sizeof(float)); GaussBlurFromTopToBottom_SSE(Buffer, Width, Height, B0, B1, B2, B3); memcpy(Buffer + (Height + 3) * Width * 4, Buffer + (Height + 2) * Width * 4, Width * 4 * sizeof(float)); memcpy(Buffer + (Height + 4) * Width * 4, Buffer + (Height + 2) * Width * 4, Width * 4 * sizeof(float)); memcpy(Buffer + (Height + 5) * Width * 4, Buffer + (Height + 2) * Width * 4, Width * 4 * sizeof(float)); GaussBlurFromBottomToTop_SSE(Buffer, Width, Height, B0, B1, B2, B3); ConvertBGRAF2BGR8U_SSE(Buffer + 3 * Width * 4, Dest, Width, Height, Stride); _mm_free(Buffer); } int main() { Mat src = imread("F:\\car.jpg"); int Height = src.rows; int Width = src.cols; unsigned char *Src = src.data; unsigned char *Dest = new unsigned char[Height * Width * 3]; int Stride = Width * 3; int Radius = 11; int64 st = cvGetTickCount(); for (int i = 0; i < 20; i++) { GaussBlur_SSE(Src, Dest, Width, Height, Stride, Radius); } double duration = (cv::getTickCount() - st) / cv::getTickFrequency() * 50; printf("%.5f\n", duration); GaussBlur_SSE(Src, Dest, Width, Height, Stride, Radius); Mat dst(Height, Width, CV_8UC3, Dest); imshow("origin", src); imshow("result", dst); imwrite("F:\\res.jpg", dst); waitKey(0); } ================================================ FILE: speed_histogram_algorithm_framework/BoxFilter.h ================================================ #pragma once #include "Core.h" #include "Utility.h" // : ʵͼ񷽿ģЧ // б: // Src: ҪԴͼݽṹ // Dest: 洦ͼݽṹ // Radius: ģİ뾶ЧΧ[1, 1000] // EdgeBehavior: ԵݵĴ0ʾظԵأ1ʹþķʽԱԵֵ // : // 1. ܴ8λҶȺ24λͼ // 2. SrcDestͬͬʱٶȻ // 3. SSEŻ汾ڳʼʱͰ뾶йصģڰ뾶ʱʱ΢ IS_RET BoxBlur(TMatrix *Src, TMatrix *Dest, int Radius, EdgeMode Edge) { if (Src == NULL || Dest == NULL) return IS_RET_ERR_NULLREFERENCE; if (Src->Data == NULL || Dest->Data == NULL) return IS_RET_ERR_NULLREFERENCE; if (Src->Width != Dest->Width || Src->Height != Dest->Height || Src->Channel != Dest->Channel || Src->Depth != Dest->Depth || Src->WidthStep != Dest->WidthStep) return IS_RET_ERR_PARAMISMATCH; if (Src->Depth != IS_DEPTH_8U || Dest->Depth != IS_DEPTH_8U) return IS_RET_ERR_NOTSUPPORTED; IS_RET Ret = IS_RET_OK; TMatrix *Row = NULL, *Col = NULL; int *RowPos, *ColPos, *ColSum, *Diff; int X, Y, Z, Width, Height, Channel, Index; int Value, ValueB, ValueG, ValueR; int Size = 2 * Radius + 1, Amount = Size * Size, HalfAmount = Amount / 2; Width = Src->Width; Height = Src->Height; Channel = Src->Channel; Ret = GetValidCoordinate(Width, Height, Radius, Radius, Radius, Radius, EdgeMode::Smear, &Row, &Col); // ȡƫ RowPos = ((int *)Row->Data); ColPos = ((int *)Col->Data); ColSum = (int *)IS_AllocMemory(Width * Channel * sizeof(int), true); Diff = (int *)IS_AllocMemory((Width - 1) * Channel * sizeof(int), true); unsigned char *RowData = (unsigned char *)IS_AllocMemory((Width + 2 * Radius) * Channel, true); TMatrix Sum; TMatrix *p = ∑ TMatrix **q = &p; IS_CreateMatrix(Width, Height, IS_DEPTH_32S, Channel, q); for (Y = 0; Y < Height; Y++) { unsigned char *LinePS = Src->Data + Y * Src->WidthStep; int *LinePD = (int *)(p->Data + Y * p->WidthStep); // һݼԵֲֵʱĻ if (Channel == 1) { for (X = 0; X < Radius; X++) RowData[X] = LinePS[RowPos[X]]; memcpy(RowData + Radius, LinePS, Width); for (X = Radius + Width; X < Radius + Width + Radius; X++) RowData[X] = LinePS[RowPos[X]]; } else if (Channel == 3) { for (X = 0; X < Radius; X++) { Index = RowPos[X] * 3; RowData[X * 3] = LinePS[Index]; RowData[X * 3 + 1] = LinePS[Index + 1]; RowData[X * 3 + 2] = LinePS[Index + 2]; } memcpy(RowData + Radius * 3, LinePS, Width * 3); for (X = Radius + Width; X < Radius + Width + Radius; X++) { Index = RowPos[X] * 3; RowData[X * 3 + 0] = LinePS[Index + 0]; RowData[X * 3 + 1] = LinePS[Index + 1]; RowData[X * 3 + 2] = LinePS[Index + 2]; } } unsigned char *AddPos = RowData + Size * Channel; unsigned char *SubPos = RowData; for (X = 0; X < (Width - 1) * Channel; X++) Diff[X] = AddPos[X] - SubPos[X]; // һҪ⴦ if (Channel == 1) { for (Z = 0, Value = 0; Z < Size; Z++) Value += RowData[Z]; LinePD[0] = Value; for (X = 1; X < Width; X++) { Value += Diff[X - 1]; LinePD[X] = Value; // ·ٶߺܶ } } else if (Channel == 3) { for (Z = 0, ValueB = ValueG = ValueR = 0; Z < Size; Z++) { ValueB += RowData[Z * 3 + 0]; ValueG += RowData[Z * 3 + 1]; ValueR += RowData[Z * 3 + 2]; } LinePD[0] = ValueB; LinePD[1] = ValueG; LinePD[2] = ValueR; for (X = 1; X < Width; X++) { Index = X * 3; ValueB += Diff[Index - 3]; LinePD[Index + 0] = ValueB; ValueG += Diff[Index - 2]; LinePD[Index + 1] = ValueG; ValueR += Diff[Index - 1]; LinePD[Index + 2] = ValueR; } } } for (Y = 0; Y < Size - 1; Y++) // עûһŶ { int *LinePS = (int *)(p->Data + ColPos[Y] * p->WidthStep); for (X = 0; X < Width * Channel; X++) ColSum[X] += LinePS[X]; } for (Y = 0; Y < Height; Y++) { unsigned char* LinePD = Dest->Data + Y * Dest->WidthStep; int *AddPos = (int*)(p->Data + ColPos[Y + Size - 1] * p->WidthStep); int *SubPos = (int*)(p->Data + ColPos[Y] * p->WidthStep); for (X = 0; X < Width * Channel; X++) { Value = ColSum[X] + AddPos[X]; LinePD[X] = (Value + HalfAmount) / Amount; // + HalfAmount ҪΪ ColSum[X] = Value - SubPos[X]; } } IS_FreeMemory(RowPos); IS_FreeMemory(ColPos); IS_FreeMemory(Diff); IS_FreeMemory(ColSum); IS_FreeMemory(RowData); return Ret; } // : ʵͼ񷽿ģЧSSEŻ IS_RET BoxBlur_SSE(TMatrix *Src, TMatrix *Dest, int Radius, EdgeMode Edge) { if (Src == NULL || Dest == NULL) return IS_RET_ERR_NULLREFERENCE; if (Src->Data == NULL || Dest->Data == NULL) return IS_RET_ERR_NULLREFERENCE; if (Src->Width != Dest->Width || Src->Height != Dest->Height || Src->Channel != Dest->Channel || Src->Depth != Dest->Depth || Src->WidthStep != Dest->WidthStep) return IS_RET_ERR_PARAMISMATCH; if (Src->Depth != IS_DEPTH_8U || Dest->Depth != IS_DEPTH_8U) return IS_RET_ERR_NOTSUPPORTED; IS_RET Ret = IS_RET_OK; TMatrix *Row = NULL, *Col = NULL; int *RowPos, *ColPos, *ColSum, *Diff; int X, Y, Z, Width, Height, Channel, Index; int Value, ValueB, ValueG, ValueR; int Size = 2 * Radius + 1, Amount = Size * Size, HalfAmount = Amount / 2; float Scale = 1.0 / (Size * Size); Width = Src->Width; Height = Src->Height; Channel = Src->Channel; Ret = GetValidCoordinate(Width, Height, Radius, Radius, Radius, Radius, EdgeMode::Smear, &Row, &Col); // ȡƫ RowPos = ((int *)Row->Data); ColPos = ((int *)Col->Data); ColSum = (int *)IS_AllocMemory(Width * Channel * sizeof(int), true); Diff = (int *)IS_AllocMemory((Width - 1) * Channel * sizeof(int), true); unsigned char *RowData = (unsigned char *)IS_AllocMemory((Width + 2 * Radius) * Channel, true); TMatrix Sum; TMatrix *p = ∑ TMatrix **q = &p; IS_CreateMatrix(Width, Height, IS_DEPTH_32S, Channel, q); for (Y = 0; Y < Height; Y++) { unsigned char *LinePS = Src->Data + Y * Src->WidthStep; int *LinePD = (int *)(p->Data + Y * p->WidthStep); // һݼԵֲֵʱĻ if (Channel == 1) { for (X = 0; X < Radius; X++) RowData[X] = LinePS[RowPos[X]]; memcpy(RowData + Radius, LinePS, Width); for (X = Radius + Width; X < Radius + Width + Radius; X++) RowData[X] = LinePS[RowPos[X]]; } else if (Channel == 3) { for (X = 0; X < Radius; X++) { Index = RowPos[X] * 3; RowData[X * 3] = LinePS[Index]; RowData[X * 3 + 1] = LinePS[Index + 1]; RowData[X * 3 + 2] = LinePS[Index + 2]; } memcpy(RowData + Radius * 3, LinePS, Width * 3); for (X = Radius + Width; X < Radius + Width + Radius; X++) { Index = RowPos[X] * 3; RowData[X * 3 + 0] = LinePS[Index + 0]; RowData[X * 3 + 1] = LinePS[Index + 1]; RowData[X * 3 + 2] = LinePS[Index + 2]; } } unsigned char *AddPos = RowData + Size * Channel; unsigned char *SubPos = RowData; X = 0; __m128i Zero = _mm_setzero_si128(); for (; X <= (Width - 1) * Channel - 8; X += 8) { __m128i Add = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i const *)(AddPos + X)), Zero); __m128i Sub = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i const *)(SubPos + X)), Zero); _mm_store_si128((__m128i *)(Diff + X + 0), _mm_sub_epi32(_mm_unpacklo_epi16(Add, Zero), _mm_unpacklo_epi16(Sub, Zero))); _mm_store_si128((__m128i *)(Diff + X + 4), _mm_sub_epi32(_mm_unpackhi_epi16(Add, Zero), _mm_unpackhi_epi16(Sub, Zero))); } for (; X < (Width - 1) * Channel; X++) Diff[X] = AddPos[X] - SubPos[X]; // һҪ⴦ // һҪ⴦ if (Channel == 1) { for (Z = 0, Value = 0; Z < Size; Z++) Value += RowData[Z]; LinePD[0] = Value; for (X = 1; X < Width; X++) { Value += Diff[X - 1]; LinePD[X] = Value; } } else if (Channel == 3) { for (Z = 0, ValueB = ValueG = ValueR = 0; Z < Size; Z++) { ValueB += RowData[Z * 3 + 0]; ValueG += RowData[Z * 3 + 1]; ValueR += RowData[Z * 3 + 2]; } LinePD[0] = ValueB; LinePD[1] = ValueG; LinePD[2] = ValueR; for (X = 1; X < Width; X++) { Index = X * 3; ValueB += Diff[Index - 3]; LinePD[Index + 0] = ValueB; ValueG += Diff[Index - 2]; LinePD[Index + 1] = ValueG; ValueR += Diff[Index - 1]; LinePD[Index + 2] = ValueR; } } } for (Y = 0; Y < Size - 1; Y++) { X = 0; int *LinePS = (int *)(p->Data + ColPos[Y] * p->WidthStep); for (; X <= Width * Channel - 4; X += 4) { __m128i SumP = _mm_load_si128((const __m128i*)(ColSum + X)); __m128i SrcP = _mm_load_si128((const __m128i*)(LinePS + X)); _mm_store_si128((__m128i *)(ColSum + X), _mm_add_epi32(SumP, SrcP)); } for (; X < Width * Channel; X++) ColSum[X] += LinePS[X]; } for (Y = 0; Y < Height; Y++) { unsigned char *LinePD = Dest->Data + Y * Dest->WidthStep; int *AddPos = (int*)(p->Data + ColPos[Y + Size - 1] * p->WidthStep); int *SubPos = (int*)(p->Data + ColPos[Y] * p->WidthStep); X = 0; const __m128 Inv = _mm_set1_ps(Scale); for (; X <= Width * Channel - 8; X += 8) { __m128i Sub1 = _mm_loadu_si128((const __m128i*)(SubPos + X + 0)); __m128i Sub2 = _mm_loadu_si128((const __m128i*)(SubPos + X + 4)); __m128i Add1 = _mm_loadu_si128((const __m128i*)(AddPos + X + 0)); __m128i Add2 = _mm_loadu_si128((const __m128i*)(AddPos + X + 4)); __m128i Col1 = _mm_load_si128((const __m128i*)(ColSum + X + 0)); __m128i Col2 = _mm_load_si128((const __m128i*)(ColSum + X + 4)); __m128i Sum1 = _mm_add_epi32(Col1, Add1); __m128i Sum2 = _mm_add_epi32(Col2, Add2); __m128i Dest1 = _mm_cvtps_epi32(_mm_mul_ps(Inv, _mm_cvtepi32_ps(Sum1))); __m128i Dest2 = _mm_cvtps_epi32(_mm_mul_ps(Inv, _mm_cvtepi32_ps(Sum2))); Dest1 = _mm_packs_epi32(Dest1, Dest2); _mm_storel_epi64((__m128i *)(LinePD + X), _mm_packus_epi16(Dest1, Dest1)); _mm_store_si128((__m128i *)(ColSum + X + 0), _mm_sub_epi32(Sum1, Sub1)); _mm_store_si128((__m128i *)(ColSum + X + 4), _mm_sub_epi32(Sum2, Sub2)); } for (; X < Width * Channel; X++){ Value = ColSum[X] + AddPos[X]; LinePD[X] = Value * Scale; ColSum[X] = Value - SubPos[X]; } } IS_FreeMemory(RowPos); IS_FreeMemory(ColPos); IS_FreeMemory(Diff); IS_FreeMemory(ColSum); IS_FreeMemory(RowData); return Ret; } ================================================ FILE: speed_histogram_algorithm_framework/Core.h ================================================ #pragma once #include #include #include #include #include using namespace std; #define WIDTHBYTES(bytes) (((bytes * 8) + 31) / 32 * 4) const float Inv255 = 1.0 / 255; const double Eps = 2.220446049250313E-16; //Եķʽ enum EdgeMode { Tile = 0, //ظԵԪ Smear = 1 //ԵԪ }; enum IS_RET { IS_RET_OK, // IS_RET_ERR_OUTOFMEMORY, // ڴ IS_RET_ERR_STACKOVERFLOW, // ջ IS_RET_ERR_NULLREFERENCE, // IS_RET_ERR_ARGUMENTOUTOFRANGE, // Χ IS_RET_ERR_PARAMISMATCH, // ƥ IS_RET_ERR_DIVIDEBYZERO, IS_RET_ERR_INDEXOUTOFRANGE, IS_RET_ERR_NOTSUPPORTED, IS_RET_ERR_OVERFLOW, IS_RET_ERR_FILENOTFOUND, IS_RET_ERR_UNKNOWN }; enum IS_DEPTH { IS_DEPTH_8U = 0, // unsigned char IS_DEPTH_8S = 1, // char IS_DEPTH_16S = 2, // short IS_DEPTH_32S = 3, // int IS_DEPTH_32F = 4, // float IS_DEPTH_64F = 5, // double }; struct TMatrix { int Width; // Ŀ int Height; // ĸ߶ int WidthStep; // һԪصռõֽ int Channel; // ͨ int Depth; // Ԫص unsigned char *Data; // int Reserved; // ʹ }; // ڴ void *IS_AllocMemory(unsigned int Size, bool ZeroMemory = true) { void *Ptr = _mm_malloc(Size, 32); if (Ptr != NULL) if (ZeroMemory == true) memset(Ptr, 0, Size); return Ptr; } // ڴͷ void IS_FreeMemory(void *Ptr) { if (Ptr != NULL) _mm_free(Ptr); } // ݾԪصȡһԪʵռõֽ int IS_ELEMENT_SIZE(int Depth) { int Size; switch (Depth) { case IS_DEPTH_8U: Size = sizeof(unsigned char); break; case IS_DEPTH_8S: Size = sizeof(char); break; case IS_DEPTH_16S: Size = sizeof(short); break; case IS_DEPTH_32S: Size = sizeof(int); break; case IS_DEPTH_32F: Size = sizeof(float); break; case IS_DEPTH_64F: Size = sizeof(double); break; default: Size = 0; break; } return Size; } //µľ IS_RET IS_CreateMatrix(int Width, int Height, int Depth, int Channel, TMatrix **Matrix) { if (Width < 1 || Height < 1) return IS_RET_ERR_ARGUMENTOUTOFRANGE; //Χ if (Depth != IS_DEPTH_8U && Depth != IS_DEPTH_8S && Depth != IS_DEPTH_16S && Depth != IS_DEPTH_32S && Depth != IS_DEPTH_32F && Depth != IS_DEPTH_64F) return IS_RET_ERR_ARGUMENTOUTOFRANGE; //Χ if (Channel != 1 && Channel != 2 && Channel != 3 && Channel != 4) return IS_RET_ERR_ARGUMENTOUTOFRANGE; *Matrix = (TMatrix *)IS_AllocMemory(sizeof(TMatrix)); (*Matrix)->Width = Width; (*Matrix)->Height = Height; (*Matrix)->Depth = Depth; (*Matrix)->Channel = Channel; (*Matrix)->WidthStep = WIDTHBYTES(Width * Channel * IS_ELEMENT_SIZE(Depth)); (*Matrix)->Data = (unsigned char*)IS_AllocMemory((*Matrix)->Height * (*Matrix)->WidthStep, true); if ((*Matrix)->Data == NULL) { IS_FreeMemory(*Matrix); return IS_RET_ERR_OUTOFMEMORY; //ڴ } (*Matrix)->Reserved = 0; return IS_RET_OK; } //ͷŴľ IS_RET IS_FreeMatrix(TMatrix **Matrix) { if ((*Matrix) == NULL) return IS_RET_ERR_NULLREFERENCE; // if ((*Matrix)->Data == NULL) { IS_FreeMemory((*Matrix)); return IS_RET_ERR_OUTOFMEMORY; } else { IS_FreeMemory((*Matrix)->Data); IS_FreeMemory((*Matrix)); return IS_RET_OK; } } //¡еľ IS_RET IS_CloneMatrix(TMatrix *Src, TMatrix **Dest) { if (Src == NULL) return IS_RET_ERR_NULLREFERENCE; if (Src->Data == NULL) return IS_RET_ERR_NULLREFERENCE; IS_RET Ret = IS_CreateMatrix(Src->Width, Src->Height, Src->Depth, Src->Channel, Dest); if (Ret == IS_RET_OK) memcpy((*Dest)->Data, Src->Data, (*Dest)->Height * (*Dest)->WidthStep); return Ret; } ================================================ FILE: speed_histogram_algorithm_framework/MaxFilter.h ================================================ #pragma once #include "Core.h" #include "Utility.h" // 函数供能: 在指定半径内,最大值”滤镜用周围像素的最高亮度值替换当前像素的亮度值。 // 参数列表: // Src: 需要处理的源图像的数据结构 // Dest: 保存处理后的图像的数据结构 // Radius: 半径,有效范围 // 说明: // 1、程序的执行时间和半径基本无关,但和图像内容有关 // 2、Src和Dest可以相同,不同时执行速度很快 // 3、对于各向异性的图像来说,执行速度很快,对于有大面积相同像素的图像,速度会慢一点 IS_RET MaxFilter(TMatrix *Src, TMatrix *Dest, int Radius) { if (Src == NULL || Dest == NULL) return IS_RET_ERR_NULLREFERENCE; if (Src->Data == NULL || Dest->Data == NULL) return IS_RET_ERR_NULLREFERENCE; if (Src->Width != Dest->Width || Src->Height != Dest->Height || Src->Channel != Dest->Channel || Src->Depth != Dest->Depth || Src->WidthStep != Dest->WidthStep) return IS_RET_ERR_PARAMISMATCH; if (Src->Depth != IS_DEPTH_8U || Dest->Depth != IS_DEPTH_8U) return IS_RET_ERR_NOTSUPPORTED; if (Radius < 0 || Radius >= 127) return IS_RET_ERR_ARGUMENTOUTOFRANGE; IS_RET Ret = IS_RET_OK; if (Src->Data == Dest->Data) { TMatrix *Clone = NULL; Ret = IS_CloneMatrix(Src, &Clone); if (Ret != IS_RET_OK) return Ret; Ret = MaxFilter(Clone, Dest, Radius); IS_FreeMatrix(&Clone); return Ret; } if (Src->Channel == 1) { TMatrix *Row = NULL, *Col = NULL; unsigned char *LinePS, *LinePD; int X, Y, K, Width = Src->Width, Height = Src->Height; int *RowOffset, *ColOffSet; unsigned short *ColHist = (unsigned short *)IS_AllocMemory(256 * (Width + 2 * Radius) * sizeof(unsigned short), true); if (ColHist == NULL) { Ret = IS_RET_ERR_OUTOFMEMORY; goto Done8; } unsigned short *Hist = (unsigned short *)IS_AllocMemory(256 * sizeof(unsigned short), true); if (Hist == NULL) { Ret = IS_RET_ERR_OUTOFMEMORY; goto Done8; } Ret = GetValidCoordinate(Width, Height, Radius, Radius, Radius, Radius, EdgeMode::Smear, &Row, &Col); // 获取坐标偏移量 if (Ret != IS_RET_OK) goto Done8; ColHist += Radius * 256; RowOffset = ((int *)Row->Data) + Radius; ColOffSet = ((int *)Col->Data) + Radius; // 进行偏移以便操作 for (Y = 0; Y < Height; Y++) { if (Y == 0) // 第一行的列直方图,要重头计算 { for (K = -Radius; K <= Radius; K++) { LinePS = Src->Data + ColOffSet[K] * Src->WidthStep; for (X = -Radius; X < Width + Radius; X++) { ColHist[X * 256 + LinePS[RowOffset[X]]]++; } } } else // 其他行的列直方图,更新就可以了 { LinePS = Src->Data + ColOffSet[Y - Radius - 1] * Src->WidthStep; for (X = -Radius; X < Width + Radius; X++) // 删除移出范围内的那一行的直方图数据 { ColHist[X * 256 + LinePS[RowOffset[X]]]--; } LinePS = Src->Data + ColOffSet[Y + Radius] * Src->WidthStep; for (X = -Radius; X < Width + Radius; X++) // 增加进入范围内的那一行的直方图数据 { ColHist[X * 256 + LinePS[RowOffset[X]]]++; } } memset(Hist, 0, 256 * sizeof(unsigned short)); // 每一行直方图数据清零先 LinePD = Dest->Data + Y * Dest->WidthStep; for (X = 0; X < Width; X++) { if (X == 0) { for (K = -Radius; K <= Radius; K++) // 行第一个像素,需要重新计算 HistgramAddShort(ColHist + K * 256, Hist); } else { /* HistgramAddShort(ColHist + RowOffset[X + Radius] * 256, Hist); HistgramSubShort(ColHist + RowOffset[X - Radius - 1] * 256, Hist); */ HistgramSubAddShort(ColHist + RowOffset[X - Radius - 1] * 256, ColHist + RowOffset[X + Radius] * 256, Hist); // 行内其他像素,依次删除和增加就可以了 } for (K = 255; K >= 0; K--) { if (Hist[K] != 0) { LinePD[X] = K; break; } } } } ColHist -= Radius * 256; // 恢复偏移操作 Done8: IS_FreeMatrix(&Row); IS_FreeMatrix(&Col); IS_FreeMemory(ColHist); IS_FreeMemory(Hist); return Ret; } else { TMatrix *Blue = NULL, *Green = NULL, *Red = NULL, *Alpha = NULL; // 由于C变量如果不初始化,其值是随机值,可能会导致释放时的错误。 IS_RET Ret = SplitRGBA(Src, &Blue, &Green, &Red, &Alpha); if (Ret != IS_RET_OK) goto Done24; Ret = MaxFilter(Blue, Blue, Radius); if (Ret != IS_RET_OK) goto Done24; Ret = MaxFilter(Green, Green, Radius); if (Ret != IS_RET_OK) goto Done24; Ret = MaxFilter(Red, Red, Radius); if (Ret != IS_RET_OK) goto Done24; // 32位的Alpha不做任何处理,实际上32位的相关算法基本上是不能分通道处理的 CopyAlphaChannel(Src, Dest); Ret = CombineRGBA(Dest, Blue, Green, Red, Alpha); Done24: IS_FreeMatrix(&Blue); IS_FreeMatrix(&Green); IS_FreeMatrix(&Red); IS_FreeMatrix(&Alpha); return Ret; } } ================================================ FILE: speed_histogram_algorithm_framework/SelectiveBlur.h ================================================ #pragma once #include "Core.h" #include "Utility.h" void Calc(unsigned short *Hist, int Intensity, unsigned char *&Pixel, int Threshold) { int K, Low, High, Sum = 0, Weight = 0; Low = Intensity - Threshold; High = Intensity + Threshold; if (Low < 0) Low = 0; if (High > 255) High = 255; for (K = Low; K <= High; K++) { Weight += Hist[K]; Sum += Hist[K] * K; } if (Weight != 0) *Pixel = Sum / Weight; } // 函数供能: 在指定半径内,实现图像选择性模糊效果。 // 参数列表: // Src: 需要处理的源图像的数据结构 // Dest: 保存处理后的图像的数据结构 // Radius: 半径,有效范围 // 说明: // 1、程序的执行时间和半径基本无关,但和图像内容有关 // 2、Src和Dest可以相同,不同时执行速度很快 // 3、对于各向异性的图像来说,执行速度很快,对于有大面积相同像素的图像,速度会慢一点 IS_RET SelectiveBlur(TMatrix *Src, TMatrix *Dest, int Radius, int Threshold, EdgeMode Edge) { if (Src == NULL || Dest == NULL) return IS_RET_ERR_NULLREFERENCE; if (Src->Data == NULL || Dest->Data == NULL) return IS_RET_ERR_NULLREFERENCE; if (Src->Width != Dest->Width || Src->Height != Dest->Height || Src->Channel != Dest->Channel || Src->Depth != Dest->Depth || Src->WidthStep != Dest->WidthStep) return IS_RET_ERR_PARAMISMATCH; if (Src->Depth != IS_DEPTH_8U || Dest->Depth != IS_DEPTH_8U) return IS_RET_ERR_NOTSUPPORTED; if (Radius < 0 || Radius >= 127 || Threshold < 2 || Threshold > 255) return IS_RET_ERR_ARGUMENTOUTOFRANGE; IS_RET Ret = IS_RET_OK; if (Src->Data == Dest->Data) { TMatrix *Clone = NULL; Ret = IS_CloneMatrix(Src, &Clone); if (Ret != IS_RET_OK) return Ret; Ret = SelectiveBlur(Clone, Dest, Radius, Threshold, Edge); IS_FreeMatrix(&Clone); return Ret; } if (Src->Channel == 1) { TMatrix *Row = NULL, *Col = NULL; unsigned char *LinePS, *LinePD; int X, Y, K, Width = Src->Width, Height = Src->Height; int *RowOffset, *ColOffSet; unsigned short *ColHist = (unsigned short *)IS_AllocMemory(256 * (Width + 2 * Radius) * sizeof(unsigned short), true); if (ColHist == NULL) { Ret = IS_RET_ERR_OUTOFMEMORY; goto Done8; } unsigned short *Hist = (unsigned short *)IS_AllocMemory(256 * sizeof(unsigned short), true); if (Hist == NULL) { Ret = IS_RET_ERR_OUTOFMEMORY; goto Done8; } Ret = GetValidCoordinate(Width, Height, Radius, Radius, Radius, Radius, Edge, &Row, &Col); // 获取坐标偏移量 if (Ret != IS_RET_OK) goto Done8; ColHist += Radius * 256; RowOffset = ((int *)Row->Data) + Radius; ColOffSet = ((int *)Col->Data) + Radius; // 进行偏移以便操作 for (Y = 0; Y < Height; Y++) { if (Y == 0) // 第一行的列直方图,要重头计算 { for (K = -Radius; K <= Radius; K++) { LinePS = Src->Data + ColOffSet[K] * Src->WidthStep; for (X = -Radius; X < Width + Radius; X++) { ColHist[X * 256 + LinePS[RowOffset[X]]]++; } } } else // 其他行的列直方图,更新就可以了 { LinePS = Src->Data + ColOffSet[Y - Radius - 1] * Src->WidthStep; for (X = -Radius; X < Width + Radius; X++) // 删除移出范围内的那一行的直方图数据 { ColHist[X * 256 + LinePS[RowOffset[X]]]--; } LinePS = Src->Data + ColOffSet[Y + Radius] * Src->WidthStep; for (X = -Radius; X < Width + Radius; X++) // 增加进入范围内的那一行的直方图数据 { ColHist[X * 256 + LinePS[RowOffset[X]]]++; } } memset(Hist, 0, 256 * sizeof(unsigned short)); // 每一行直方图数据清零先 LinePS = Src->Data + Y * Src->WidthStep; LinePD = Dest->Data + Y * Dest->WidthStep; for (X = 0; X < Width; X++) { if (X == 0) { for (K = -Radius; K <= Radius; K++) // 行第一个像素,需要重新计算 HistgramAddShort(ColHist + K * 256, Hist); } else { /* HistgramAddShort(ColHist + RowOffset[X + Radius] * 256, Hist); HistgramSubShort(ColHist + RowOffset[X - Radius - 1] * 256, Hist); */ HistgramSubAddShort(ColHist + RowOffset[X - Radius - 1] * 256, ColHist + RowOffset[X + Radius] * 256, Hist); // 行内其他像素,依次删除和增加就可以了 } Calc(Hist, LinePS[0], LinePD, Threshold); LinePS++; LinePD++; } } ColHist -= Radius * 256; // 恢复偏移操作 Done8: IS_FreeMatrix(&Row); IS_FreeMatrix(&Col); IS_FreeMemory(ColHist); IS_FreeMemory(Hist); return Ret; } else { TMatrix *Blue = NULL, *Green = NULL, *Red = NULL, *Alpha = NULL; // 由于C变量如果不初始化,其值是随机值,可能会导致释放时的错误。 IS_RET Ret = SplitRGBA(Src, &Blue, &Green, &Red, &Alpha); if (Ret != IS_RET_OK) goto Done24; Ret = SelectiveBlur(Blue, Blue, Radius, Threshold, Edge); if (Ret != IS_RET_OK) goto Done24; Ret = SelectiveBlur(Green, Green, Radius, Threshold, Edge); if (Ret != IS_RET_OK) goto Done24; Ret = SelectiveBlur(Red, Red, Radius, Threshold, Edge); if (Ret != IS_RET_OK) goto Done24; // 32位的Alpha不做任何处理,实际上32位的相关算法基本上是不能分通道处理的 Ret = CombineRGBA(Dest, Blue, Green, Red, Alpha); Done24: IS_FreeMatrix(&Blue); IS_FreeMatrix(&Green); IS_FreeMatrix(&Red); IS_FreeMatrix(&Alpha); return Ret; } } ================================================ FILE: speed_histogram_algorithm_framework/Utility.h ================================================ #pragma once //ֵ #include "Core.h" union Approximation { double Value; int X[2]; }; // 1: ݽضByteڡ // ο: http://www.cnblogs.com/zyl910/archive/2012/03/12/noifopex1.html // : λʹô롣 unsigned char ClampToByte(int Value) { return ((Value | ((signed int)(255 - Value) >> 31)) & ~((signed int)Value >> 31)); } //2: ݽضָΧ //ο: //: int ClampToInt(int Value, int Min, int Max) { if (Value < Min) return Min; else if (Value > Max) return Max; else return Value; } //3: 255 //ο: //: λ int Div255(int Value) { return (((Value >> 8) + Value + 1) >> 8); } //4: ȡֵ //ο: https://oi-wiki.org/math/bit/ //: n > 0 ? n : -n int Abs(int n) { return (n ^ (n >> 31)) - (n >> 31); /* n>>31 ȡ n ķţ n Ϊn>>31 0 n Ϊn>>31 - 1 n Ϊ n^0=0, 䣬 n Ϊ n^-1 Ҫ n - 1 IJ룬Ȼ㣬 n ŲΪ n ľֵ 1ټȥ - 1 Ǿֵ */ } //5: //ο: //: double Round(double V) { return (V > 0.0) ? floor(V + 0.5) : Round(V - 0.5); } //6: -11֮ //ο: //: double Rand() { return (double)rand() / (RAND_MAX + 1.0); } //7: PowĽƼ㣬doubleͺfloat //ο: http://www.cvchina.info/2010/03/19/log-pow-exp-approximation/ //ο: http://martin.ankerl.com/2007/10/04/optimized-pow-approximation-for-java-and-c-c/ //: ֻΪ˼ٵĽƼ㣬5%-12%ȵ double Pow(double X, double Y) { Approximation V = { X }; V.X[1] = (int)(Y * (V.X[1] - 1072632447) + 1072632447); V.X[0] = 0; return V.Value; } float Pow(float X, float Y) { Approximation V = { X }; V.X[1] = (int)(Y * (V.X[1] - 1072632447) + 1072632447); V.X[0] = 0; return (float)V.Value; } //8: ExpĽƼ㣬doubleͺfloat double Exp(double Y) // ķʽٶҪЩ { Approximation V; V.X[1] = (int)(Y * 1485963 + 1072632447); V.X[0] = 0; return V.Value; } float Exp(float Y) // ķʽٶҪЩ { Approximation V; V.X[1] = (int)(Y * 1485963 + 1072632447); V.X[0] = 0; return (float)V.Value; } // 9: Pow׼һĽƼ㣬ٶȻ // http://martin.ankerl.com/2012/01/25/optimized-approximative-pow-in-c-and-cpp/ // Besides that, I also have now a slower approximation that has much less error // when the exponent is larger than 1. It makes use exponentiation by squaring, // which is exact for the integer part of the exponent, and uses only the exponents fraction for the approximation: // should be much more precise with large Y double PrecisePow(double X, double Y) { // calculate approximation with fraction of the exponent int e = (int)Y; Approximation V = { X }; V.X[1] = (int)((Y - e) * (V.X[1] - 1072632447) + 1072632447); V.X[0] = 0; // exponentiation by squaring with the exponent's integer part // double r = u.d makes everything much slower, not sure why double r = 1.0; while (e) { if (e & 1) r *= X; X *= X; e >>= 1; } return r * V.Value; } //10: MinMax֮ //ο: //: MinΪСֵMaxΪֵ int Random(int Min, int Max) { return rand() % (Max + 1 - Min) + Min; } //11: ź //ο: //: int sgn(int X) { if (X > 0) return 1; if (X < 0) return -1; return 0; } //12: ȡijαӦɫֵ //ο: //: void GetRGB(int Color, int *R, int *G, int *B) { *R = Color & 255; *G = (Color & 65280) / 256; *B = (Color & 16711680) / 65536; } //13: ţٷƻȡֵָ㷨ƽ //ο: https://www.cnblogs.com/qlky/p/7735145.html //: Ȼǽ㷨Ƴֵָƽ float Sqrt(float X) { float HalfX = 0.5f * X; // double͵Ч int I = *(int*)&X; // get bits for floating VALUE I = 0x5f375a86 - (I >> 1); // gives initial guess y0 X = *(float*)&I; // convert bits BACK to float X = X * (1.5f - HalfX * X * X); // Newton step, repeating increases accuracy X = X * (1.5f - HalfX * X * X); // Newton step, repeating increases accuracy X = X * (1.5f - HalfX * X * X); // Newton step, repeating increases accuracy return 1 / X; } //14: ޷ŶֱͼӣY = X + Y //ο: //: SSEŻ void HistgramAddShort(unsigned short *X, unsigned short *Y) { *(__m128i*)(Y + 0) = _mm_add_epi16(*(__m128i*)&Y[0], *(__m128i*)&X[0]); // ҪԼдĻ೬ٶˣѾԹ *(__m128i*)(Y + 8) = _mm_add_epi16(*(__m128i*)&Y[8], *(__m128i*)&X[8]); *(__m128i*)(Y + 16) = _mm_add_epi16(*(__m128i*)&Y[16], *(__m128i*)&X[16]); *(__m128i*)(Y + 24) = _mm_add_epi16(*(__m128i*)&Y[24], *(__m128i*)&X[24]); *(__m128i*)(Y + 32) = _mm_add_epi16(*(__m128i*)&Y[32], *(__m128i*)&X[32]); *(__m128i*)(Y + 40) = _mm_add_epi16(*(__m128i*)&Y[40], *(__m128i*)&X[40]); *(__m128i*)(Y + 48) = _mm_add_epi16(*(__m128i*)&Y[48], *(__m128i*)&X[48]); *(__m128i*)(Y + 56) = _mm_add_epi16(*(__m128i*)&Y[56], *(__m128i*)&X[56]); *(__m128i*)(Y + 64) = _mm_add_epi16(*(__m128i*)&Y[64], *(__m128i*)&X[64]); *(__m128i*)(Y + 72) = _mm_add_epi16(*(__m128i*)&Y[72], *(__m128i*)&X[72]); *(__m128i*)(Y + 80) = _mm_add_epi16(*(__m128i*)&Y[80], *(__m128i*)&X[80]); *(__m128i*)(Y + 88) = _mm_add_epi16(*(__m128i*)&Y[88], *(__m128i*)&X[88]); *(__m128i*)(Y + 96) = _mm_add_epi16(*(__m128i*)&Y[96], *(__m128i*)&X[96]); *(__m128i*)(Y + 104) = _mm_add_epi16(*(__m128i*)&Y[104], *(__m128i*)&X[104]); *(__m128i*)(Y + 112) = _mm_add_epi16(*(__m128i*)&Y[112], *(__m128i*)&X[112]); *(__m128i*)(Y + 120) = _mm_add_epi16(*(__m128i*)&Y[120], *(__m128i*)&X[120]); *(__m128i*)(Y + 128) = _mm_add_epi16(*(__m128i*)&Y[128], *(__m128i*)&X[128]); *(__m128i*)(Y + 136) = _mm_add_epi16(*(__m128i*)&Y[136], *(__m128i*)&X[136]); *(__m128i*)(Y + 144) = _mm_add_epi16(*(__m128i*)&Y[144], *(__m128i*)&X[144]); *(__m128i*)(Y + 152) = _mm_add_epi16(*(__m128i*)&Y[152], *(__m128i*)&X[152]); *(__m128i*)(Y + 160) = _mm_add_epi16(*(__m128i*)&Y[160], *(__m128i*)&X[160]); *(__m128i*)(Y + 168) = _mm_add_epi16(*(__m128i*)&Y[168], *(__m128i*)&X[168]); *(__m128i*)(Y + 176) = _mm_add_epi16(*(__m128i*)&Y[176], *(__m128i*)&X[176]); *(__m128i*)(Y + 184) = _mm_add_epi16(*(__m128i*)&Y[184], *(__m128i*)&X[184]); *(__m128i*)(Y + 192) = _mm_add_epi16(*(__m128i*)&Y[192], *(__m128i*)&X[192]); *(__m128i*)(Y + 200) = _mm_add_epi16(*(__m128i*)&Y[200], *(__m128i*)&X[200]); *(__m128i*)(Y + 208) = _mm_add_epi16(*(__m128i*)&Y[208], *(__m128i*)&X[208]); *(__m128i*)(Y + 216) = _mm_add_epi16(*(__m128i*)&Y[216], *(__m128i*)&X[216]); *(__m128i*)(Y + 224) = _mm_add_epi16(*(__m128i*)&Y[224], *(__m128i*)&X[224]); *(__m128i*)(Y + 232) = _mm_add_epi16(*(__m128i*)&Y[232], *(__m128i*)&X[232]); *(__m128i*)(Y + 240) = _mm_add_epi16(*(__m128i*)&Y[240], *(__m128i*)&X[240]); *(__m128i*)(Y + 248) = _mm_add_epi16(*(__m128i*)&Y[248], *(__m128i*)&X[248]); } //15: ޷ŶֱͼY = Y - X //ο: //: SSEŻ void HistgramSubShort(unsigned short *X, unsigned short *Y) { *(__m128i*)(Y + 0) = _mm_sub_epi16(*(__m128i*)&Y[0], *(__m128i*)&X[0]); *(__m128i*)(Y + 8) = _mm_sub_epi16(*(__m128i*)&Y[8], *(__m128i*)&X[8]); *(__m128i*)(Y + 16) = _mm_sub_epi16(*(__m128i*)&Y[16], *(__m128i*)&X[16]); *(__m128i*)(Y + 24) = _mm_sub_epi16(*(__m128i*)&Y[24], *(__m128i*)&X[24]); *(__m128i*)(Y + 32) = _mm_sub_epi16(*(__m128i*)&Y[32], *(__m128i*)&X[32]); *(__m128i*)(Y + 40) = _mm_sub_epi16(*(__m128i*)&Y[40], *(__m128i*)&X[40]); *(__m128i*)(Y + 48) = _mm_sub_epi16(*(__m128i*)&Y[48], *(__m128i*)&X[48]); *(__m128i*)(Y + 56) = _mm_sub_epi16(*(__m128i*)&Y[56], *(__m128i*)&X[56]); *(__m128i*)(Y + 64) = _mm_sub_epi16(*(__m128i*)&Y[64], *(__m128i*)&X[64]); *(__m128i*)(Y + 72) = _mm_sub_epi16(*(__m128i*)&Y[72], *(__m128i*)&X[72]); *(__m128i*)(Y + 80) = _mm_sub_epi16(*(__m128i*)&Y[80], *(__m128i*)&X[80]); *(__m128i*)(Y + 88) = _mm_sub_epi16(*(__m128i*)&Y[88], *(__m128i*)&X[88]); *(__m128i*)(Y + 96) = _mm_sub_epi16(*(__m128i*)&Y[96], *(__m128i*)&X[96]); *(__m128i*)(Y + 104) = _mm_sub_epi16(*(__m128i*)&Y[104], *(__m128i*)&X[104]); *(__m128i*)(Y + 112) = _mm_sub_epi16(*(__m128i*)&Y[112], *(__m128i*)&X[112]); *(__m128i*)(Y + 120) = _mm_sub_epi16(*(__m128i*)&Y[120], *(__m128i*)&X[120]); *(__m128i*)(Y + 128) = _mm_sub_epi16(*(__m128i*)&Y[128], *(__m128i*)&X[128]); *(__m128i*)(Y + 136) = _mm_sub_epi16(*(__m128i*)&Y[136], *(__m128i*)&X[136]); *(__m128i*)(Y + 144) = _mm_sub_epi16(*(__m128i*)&Y[144], *(__m128i*)&X[144]); *(__m128i*)(Y + 152) = _mm_sub_epi16(*(__m128i*)&Y[152], *(__m128i*)&X[152]); *(__m128i*)(Y + 160) = _mm_sub_epi16(*(__m128i*)&Y[160], *(__m128i*)&X[160]); *(__m128i*)(Y + 168) = _mm_sub_epi16(*(__m128i*)&Y[168], *(__m128i*)&X[168]); *(__m128i*)(Y + 176) = _mm_sub_epi16(*(__m128i*)&Y[176], *(__m128i*)&X[176]); *(__m128i*)(Y + 184) = _mm_sub_epi16(*(__m128i*)&Y[184], *(__m128i*)&X[184]); *(__m128i*)(Y + 192) = _mm_sub_epi16(*(__m128i*)&Y[192], *(__m128i*)&X[192]); *(__m128i*)(Y + 200) = _mm_sub_epi16(*(__m128i*)&Y[200], *(__m128i*)&X[200]); *(__m128i*)(Y + 208) = _mm_sub_epi16(*(__m128i*)&Y[208], *(__m128i*)&X[208]); *(__m128i*)(Y + 216) = _mm_sub_epi16(*(__m128i*)&Y[216], *(__m128i*)&X[216]); *(__m128i*)(Y + 224) = _mm_sub_epi16(*(__m128i*)&Y[224], *(__m128i*)&X[224]); *(__m128i*)(Y + 232) = _mm_sub_epi16(*(__m128i*)&Y[232], *(__m128i*)&X[232]); *(__m128i*)(Y + 240) = _mm_sub_epi16(*(__m128i*)&Y[240], *(__m128i*)&X[240]); *(__m128i*)(Y + 248) = _mm_sub_epi16(*(__m128i*)&Y[248], *(__m128i*)&X[248]); } //16: ޷ŶֱͼӼZ = Z + Y - X //ο: //: SSEŻ void HistgramSubAddShort(unsigned short *X, unsigned short *Y, unsigned short *Z) { *(__m128i*)(Z + 0) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[0], *(__m128i*)&Z[0]), *(__m128i*)&X[0]); // ҪԼдĻ೬ٶˣѾԹ *(__m128i*)(Z + 8) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[8], *(__m128i*)&Z[8]), *(__m128i*)&X[8]); *(__m128i*)(Z + 16) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[16], *(__m128i*)&Z[16]), *(__m128i*)&X[16]); *(__m128i*)(Z + 24) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[24], *(__m128i*)&Z[24]), *(__m128i*)&X[24]); *(__m128i*)(Z + 32) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[32], *(__m128i*)&Z[32]), *(__m128i*)&X[32]); *(__m128i*)(Z + 40) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[40], *(__m128i*)&Z[40]), *(__m128i*)&X[40]); *(__m128i*)(Z + 48) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[48], *(__m128i*)&Z[48]), *(__m128i*)&X[48]); *(__m128i*)(Z + 56) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[56], *(__m128i*)&Z[56]), *(__m128i*)&X[56]); *(__m128i*)(Z + 64) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[64], *(__m128i*)&Z[64]), *(__m128i*)&X[64]); *(__m128i*)(Z + 72) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[72], *(__m128i*)&Z[72]), *(__m128i*)&X[72]); *(__m128i*)(Z + 80) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[80], *(__m128i*)&Z[80]), *(__m128i*)&X[80]); *(__m128i*)(Z + 88) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[88], *(__m128i*)&Z[88]), *(__m128i*)&X[88]); *(__m128i*)(Z + 96) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[96], *(__m128i*)&Z[96]), *(__m128i*)&X[96]); *(__m128i*)(Z + 104) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[104], *(__m128i*)&Z[104]), *(__m128i*)&X[104]); *(__m128i*)(Z + 112) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[112], *(__m128i*)&Z[112]), *(__m128i*)&X[112]); *(__m128i*)(Z + 120) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[120], *(__m128i*)&Z[120]), *(__m128i*)&X[120]); *(__m128i*)(Z + 128) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[128], *(__m128i*)&Z[128]), *(__m128i*)&X[128]); *(__m128i*)(Z + 136) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[136], *(__m128i*)&Z[136]), *(__m128i*)&X[136]); *(__m128i*)(Z + 144) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[144], *(__m128i*)&Z[144]), *(__m128i*)&X[144]); *(__m128i*)(Z + 152) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[152], *(__m128i*)&Z[152]), *(__m128i*)&X[152]); *(__m128i*)(Z + 160) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[160], *(__m128i*)&Z[160]), *(__m128i*)&X[160]); *(__m128i*)(Z + 168) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[168], *(__m128i*)&Z[168]), *(__m128i*)&X[168]); *(__m128i*)(Z + 176) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[176], *(__m128i*)&Z[176]), *(__m128i*)&X[176]); *(__m128i*)(Z + 184) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[184], *(__m128i*)&Z[184]), *(__m128i*)&X[184]); *(__m128i*)(Z + 192) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[192], *(__m128i*)&Z[192]), *(__m128i*)&X[192]); *(__m128i*)(Z + 200) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[200], *(__m128i*)&Z[200]), *(__m128i*)&X[200]); *(__m128i*)(Z + 208) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[208], *(__m128i*)&Z[208]), *(__m128i*)&X[208]); *(__m128i*)(Z + 216) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[216], *(__m128i*)&Z[216]), *(__m128i*)&X[216]); *(__m128i*)(Z + 224) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[224], *(__m128i*)&Z[224]), *(__m128i*)&X[224]); *(__m128i*)(Z + 232) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[232], *(__m128i*)&Z[232]), *(__m128i*)&X[232]); *(__m128i*)(Z + 240) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[240], *(__m128i*)&Z[240]), *(__m128i*)&X[240]); *(__m128i*)(Z + 248) = _mm_sub_epi16(_mm_add_epi16(*(__m128i*)&Y[248], *(__m128i*)&Z[248]), *(__m128i*)&X[248]); } //17: Alphaͨ //ο: //: ֱԭʼĴ룬ٶȺܺ void CopyAlphaChannel(TMatrix *Src, TMatrix *Dest) { if (Src->Channel != 4 || Dest->Channel != 4) return; if (Src->Data == Dest->Data) return; unsigned char *SrcP = Src->Data, *DestP = Dest->Data; int Y, Index = 3; for (Y = 0; Y < Src->Width * Src->Height; Y++, Index += 4) { SrcP[Index] = DestP[Index]; } } // 18: ָıԵģʽչֵ // б: // Width: Ŀ // Height: ĸ߶ // Left: Ҫչ // Right: ҲҪչ // Top: Ҫչ // Bottom: ײҪչ // Edge: Եķʽ // RawPos: зֵ // ColPos: зֵ // غִгɹ IS_RET GetValidCoordinate(int Width, int Height, int Left, int Right, int Top, int Bottom, EdgeMode Edge, TMatrix **Row, TMatrix **Col) { if ((Left < 0) || (Right < 0) || (Top < 0) || (Bottom < 0)) return IS_RET_ERR_ARGUMENTOUTOFRANGE; IS_RET Ret = IS_CreateMatrix(Width + Left + Right, 1, IS_DEPTH_32S, 1, Row); if (Ret != IS_RET_OK) return Ret; Ret = IS_CreateMatrix(1, Height + Top + Bottom, IS_DEPTH_32S, 1, Col); if (Ret != IS_RET_OK) return Ret; int X, Y, XX, YY, *RowPos = (int *)(*Row)->Data, *ColPos = (int *)(*Col)->Data; for (X = -Left; X < Width + Right; X++) { if (X < 0) { if (Edge == EdgeMode::Tile) //ظԵ RowPos[X + Left] = 0; else { XX = -X; while (XX >= Width) XX -= Width; // RowPos[X + Left] = XX; } } else if (X >= Width) { if (Edge == EdgeMode::Tile) RowPos[X + Left] = Width - 1; else { XX = Width - (X - Width + 2); while (XX < 0) XX += Width; RowPos[X + Left] = XX; } } else { RowPos[X + Left] = X; } } for (Y = -Top; Y < Height + Bottom; Y++) { if (Y < 0) { if (Edge == EdgeMode::Tile) ColPos[Y + Top] = 0; else { YY = -Y; while (YY >= Height) YY -= Height; ColPos[Y + Top] = YY; } } else if (Y >= Height) { if (Edge == EdgeMode::Tile) ColPos[Y + Top] = Height - 1; else { YY = Height - (Y - Height + 2); while (YY < 0) YY += Height; ColPos[Y + Top] = YY; } } else { ColPos[Y + Top] = Y; } } return IS_RET_OK; } // 19: ɫͼֽΪRGBAͨͼ // б: // Src: ҪԴͼݽṹ // Blue: Blueͨͼݽṹ // Green: Greenͨͼݽṹ // Red: Redͨͼݽṹ // Alpha: Alphaͨͼݽṹ // 8λдٶȴ20% // غǷִгɹ IS_RET SplitRGBA(TMatrix *Src, TMatrix **Blue, TMatrix **Green, TMatrix **Red, TMatrix **Alpha) { if (Src == NULL) return IS_RET_ERR_NULLREFERENCE; if (Src->Data == NULL) return IS_RET_ERR_NULLREFERENCE; if (Src->Depth != IS_DEPTH_8U) return IS_RET_ERR_NOTSUPPORTED; IS_RET Ret = IS_CreateMatrix(Src->Width, Src->Height, Src->Depth, 1, Blue); if (Ret != IS_RET_OK) goto Done; Ret = IS_CreateMatrix(Src->Width, Src->Height, Src->Depth, 1, Green); if (Ret != IS_RET_OK) goto Done; Ret = IS_CreateMatrix(Src->Width, Src->Height, Src->Depth, 1, Red); if (Ret != IS_RET_OK) goto Done; if (Src->Channel == 4) { Ret = IS_CreateMatrix(Src->Width, Src->Height, Src->Depth, 1, Alpha); if (Ret != IS_RET_OK) goto Done; } int X, Y, Block, Width = Src->Width, Height = Src->Height; unsigned char *LinePS, *LinePB, *LinePG, *LinePR, *LinePA; const int BlockSize = 8; Block = Width / BlockSize; // 8·,ٶ࿪·ٶȲû if (Src->Channel == 3) { for (Y = 0; Y < Height; Y++) { LinePS = Src->Data + Y * Src->WidthStep; LinePB = (*Blue)->Data + Y * (*Blue)->WidthStep; LinePG = (*Green)->Data + Y * (*Green)->WidthStep; LinePR = (*Red)->Data + Y * (*Red)->WidthStep; for (X = 0; X < Block * BlockSize; X += BlockSize) // LinePBȫдһٶȷһЩ { LinePB[0] = LinePS[0]; LinePG[0] = LinePS[1]; LinePR[0] = LinePS[2]; LinePB[1] = LinePS[3]; LinePG[1] = LinePS[4]; LinePR[1] = LinePS[5]; LinePB[2] = LinePS[6]; LinePG[2] = LinePS[7]; LinePR[2] = LinePS[8]; LinePB[3] = LinePS[9]; LinePG[3] = LinePS[10]; LinePR[3] = LinePS[11]; LinePB[4] = LinePS[12]; LinePG[4] = LinePS[13]; LinePR[4] = LinePS[14]; LinePB[5] = LinePS[15]; LinePG[5] = LinePS[16]; LinePR[5] = LinePS[17]; LinePB[6] = LinePS[18]; LinePG[6] = LinePS[19]; LinePR[6] = LinePS[20]; LinePB[7] = LinePS[21]; LinePG[7] = LinePS[22]; LinePR[7] = LinePS[23]; LinePB += 8; LinePG += 8; LinePR += 8; LinePS += 24; } while (X < Width) { LinePB[0] = LinePS[0]; LinePG[0] = LinePS[1]; LinePR[0] = LinePS[2]; LinePB++; LinePG++; LinePR++; LinePS += 3; X++; } } } else if (Src->Channel == 4) { for (Y = 0; Y < Height; Y++) { LinePS = Src->Data + Y * Src->WidthStep; LinePB = (*Blue)->Data + Y * (*Blue)->WidthStep; LinePG = (*Green)->Data + Y * (*Green)->WidthStep; LinePR = (*Red)->Data + Y * (*Red)->WidthStep; LinePA = (*Alpha)->Data + Y * (*Alpha)->WidthStep; for (X = 0; X < Block * BlockSize; X += BlockSize) { LinePB[0] = LinePS[0]; LinePG[0] = LinePS[1]; LinePR[0] = LinePS[2]; LinePA[0] = LinePS[3]; LinePB[1] = LinePS[4]; LinePG[1] = LinePS[5]; LinePR[1] = LinePS[6]; LinePA[1] = LinePS[7]; LinePB[2] = LinePS[8]; LinePG[2] = LinePS[9]; LinePR[2] = LinePS[10]; LinePA[2] = LinePS[11]; LinePB[3] = LinePS[12]; LinePG[3] = LinePS[13]; LinePR[3] = LinePS[14]; LinePA[3] = LinePS[15]; LinePB[4] = LinePS[16]; LinePG[4] = LinePS[17]; LinePR[4] = LinePS[18]; LinePA[4] = LinePS[19]; LinePB[5] = LinePS[20]; LinePG[5] = LinePS[21]; LinePR[5] = LinePS[22]; LinePA[5] = LinePS[23]; LinePB[6] = LinePS[24]; LinePG[6] = LinePS[25]; LinePR[6] = LinePS[26]; LinePA[6] = LinePS[27]; LinePB[7] = LinePS[28]; LinePG[7] = LinePS[29]; LinePR[7] = LinePS[30]; LinePA[7] = LinePS[31]; LinePB += 8; LinePG += 8; LinePR += 8; LinePA += 8; LinePS += 32; } while (X < Width) { LinePB[0] = LinePS[0]; LinePG[0] = LinePS[1]; LinePR[0] = LinePS[2]; LinePA[0] = LinePS[3]; LinePB++; LinePG++; LinePR++; LinePA++; LinePS += 4; X++; } } } return IS_RET_OK; Done: if (*Blue != NULL) IS_FreeMatrix(Blue); if (*Green != NULL) IS_FreeMatrix(Green); if (*Red != NULL) IS_FreeMatrix(Red); if (*Alpha != NULL) IS_FreeMatrix(Alpha); return Ret; } // 20: R,G,B,AͨͼϲΪɫͼ // б: // Dest: ϲͼݽṹ // Blue: Blueͨͼݽṹ // Green: Greenͨͼݽṹ // Red: Redͨͼݽṹ // Alpha: Alphaͨͼݽṹ IS_RET CombineRGBA(TMatrix *Dest, TMatrix *Blue, TMatrix *Green, TMatrix *Red, TMatrix *Alpha) { if (Dest == NULL || Blue == NULL || Green == NULL || Red == NULL) return IS_RET_ERR_NULLREFERENCE; if (Dest->Data == NULL || Blue->Data == NULL || Green->Data == NULL || Red->Data == NULL) return IS_RET_ERR_NULLREFERENCE; if ((Dest->Channel != 3 && Dest->Channel != 4) || Blue->Channel != 1 || Green->Channel != 1 || Red->Channel != 1) return IS_RET_ERR_PARAMISMATCH; if (Dest->Width != Blue->Width || Dest->Width != Green->Width || Dest->Width != Red->Width || Dest->Width != Blue->Width) return IS_RET_ERR_PARAMISMATCH; if (Dest->Height != Blue->Height || Dest->Height != Green->Height || Dest->Height != Red->Height || Dest->Height != Blue->Height) return IS_RET_ERR_PARAMISMATCH; if (Dest->Channel == 4) { if (Alpha == NULL) return IS_RET_ERR_NULLREFERENCE; if (Alpha->Data == NULL) return IS_RET_ERR_NULLREFERENCE; if (Alpha->Channel != 1) return IS_RET_ERR_PARAMISMATCH; if (Dest->Width != Alpha->Width || Dest->Height != Alpha->Height) return IS_RET_ERR_PARAMISMATCH; } int X, Y, Block, Width = Dest->Width, Height = Dest->Height; unsigned char *LinePD, *LinePB, *LinePG, *LinePR, *LinePA; const int BlockSize = 8; Block = Width / BlockSize; // 8·,ٶ࿪·ٶȲû if (Dest->Channel == 3) { for (Y = 0; Y < Height; Y++) { LinePD = Dest->Data + Y * Dest->WidthStep; LinePB = Blue->Data + Y * Blue->WidthStep; LinePG = Green->Data + Y * Green->WidthStep; LinePR = Red->Data + Y * Red->WidthStep; for (X = 0; X < Block * BlockSize; X += BlockSize) // LinePBȫдһٶ𲻴 { LinePD[0] = LinePB[0]; LinePD[1] = LinePG[0]; LinePD[2] = LinePR[0]; LinePD[3] = LinePB[1]; LinePD[4] = LinePG[1]; LinePD[5] = LinePR[1]; LinePD[6] = LinePB[2]; LinePD[7] = LinePG[2]; LinePD[8] = LinePR[2]; LinePD[9] = LinePB[3]; LinePD[10] = LinePG[3]; LinePD[11] = LinePR[3]; LinePD[12] = LinePB[4]; LinePD[13] = LinePG[4]; LinePD[14] = LinePR[4]; LinePD[15] = LinePB[5]; LinePD[16] = LinePG[5]; LinePD[17] = LinePR[5]; LinePD[18] = LinePB[6]; LinePD[19] = LinePG[6]; LinePD[20] = LinePR[6]; LinePD[21] = LinePB[7]; LinePD[22] = LinePG[7]; LinePD[23] = LinePR[7]; LinePB += 8; LinePG += 8; LinePR += 8; LinePD += 24; } while (X < Width) { LinePD[0] = LinePB[0]; LinePD[1] = LinePG[0]; LinePD[2] = LinePR[0]; LinePB++; LinePG++; LinePR++; LinePD += 3; X++; } } } else if (Dest->Channel == 4) { for (Y = 0; Y < Height; Y++) { LinePD = Dest->Data + Y * Dest->WidthStep; LinePB = Blue->Data + Y * Blue->WidthStep; LinePG = Green->Data + Y * Green->WidthStep; LinePR = Red->Data + Y * Red->WidthStep; LinePA = Alpha->Data + Y * Alpha->WidthStep; for (X = 0; X < Block * BlockSize; X += BlockSize) { LinePD[0] = LinePB[0]; LinePD[1] = LinePG[0]; LinePD[2] = LinePR[0]; LinePD[3] = LinePA[0]; LinePD[4] = LinePB[1]; LinePD[5] = LinePG[1]; LinePD[6] = LinePR[1]; LinePD[7] = LinePA[1]; LinePD[8] = LinePB[2]; LinePD[9] = LinePG[2]; LinePD[10] = LinePR[2]; LinePD[11] = LinePA[2]; LinePD[12] = LinePB[3]; LinePD[13] = LinePG[3]; LinePD[14] = LinePR[3]; LinePD[15] = LinePA[3]; LinePD[16] = LinePB[4]; LinePD[17] = LinePG[4]; LinePD[18] = LinePR[4]; LinePD[19] = LinePA[4]; LinePD[20] = LinePB[5]; LinePD[21] = LinePG[5]; LinePD[22] = LinePR[5]; LinePD[23] = LinePA[5]; LinePD[24] = LinePB[6]; LinePD[25] = LinePG[6]; LinePD[26] = LinePR[6]; LinePD[27] = LinePA[6]; LinePD[28] = LinePB[7]; LinePD[29] = LinePG[7]; LinePD[30] = LinePR[7]; LinePD[31] = LinePA[7]; LinePB += 8; LinePG += 8; LinePR += 8; LinePA += 8; LinePD += 32; } while (X < Width) { LinePD[0] = LinePB[0]; LinePD[1] = LinePG[0]; LinePD[2] = LinePR[0]; LinePD[3] = LinePA[0]; LinePB++; LinePG++; LinePD++; LinePA++; LinePD += 4; X++; } } } return IS_RET_OK; } ================================================ FILE: speed_integral_graph_sse.cpp ================================================ #include #include using namespace std; using namespace cv; void GetGrayIntegralImage(unsigned char *Src, int *Integral, int Width, int Height, int Stride) { memset(Integral, 0, (Width + 1) * sizeof(int)); // 第一行都为0 for (int Y = 0; Y < Height; Y++) { unsigned char *LinePS = Src + Y * Stride; int *LinePL = Integral + Y * (Width + 1) + 1; //上一行的位置 int *LinePD = Integral + (Y + 1) * (Width + 1) + 1; // 当前位置,注意每行的第一列的值都为0 LinePD[-1] = 0; // 第一列的值为0 for (int X = 0, Sum = 0; X < Width; X++) { Sum += LinePS[X]; // 行方向累加 LinePD[X] = LinePL[X] + Sum; // 更新积分图 } } } void GetGrayIntegralImage_SSE(unsigned char *Src, int *Integral, int Width, int Height, int Stride) { memset(Integral, 0, (Width + 1) * sizeof(int)); //第一行都为0 int BlockSize = 8, Block = Width / BlockSize; for (int Y = 0; Y < Height; Y++) { unsigned char *LinePS = Src + Y * Stride; int *LinePL = Integral + Y * (Width + 1) + 1; //上一行位置 int *LinePD = Integral + (Y + 1) * (Width + 1) + 1; //当前位置,注意每行的第一列都为0 LinePD[-1] = 0; __m128i PreV = _mm_setzero_si128(); __m128i Zero = _mm_setzero_si128(); for (int X = 0; X < Block * BlockSize; X += BlockSize) { __m128i Src_Shift0 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i*)(LinePS + X)), Zero); //A7 A6 A5 A 4 A3 A2 A1 A0 __m128i Src_Shift1 = _mm_slli_si128(Src_Shift0, 2); //A6 A5 A4 A3 A2 A1 A0 0 __m128i Src_Shift2 = _mm_slli_si128(Src_Shift1, 2); //A5 A4 A3 A2 A1 A0 0 0 __m128i Src_Shift3 = _mm_slli_si128(Src_Shift2, 2); //A4 A3 A2 A1 A0 0 0 0 __m128i Shift_Add12 = _mm_add_epi16(Src_Shift1, Src_Shift2); //A6+A5 A5+A4 A4+A3 A3+A2 A2+A1 A1+A0 A0+0 0+0 __m128i Shift_Add03 = _mm_add_epi16(Src_Shift0, Src_Shift3); //A7+A4 A6+A3 A5+A2 A4+A1 A3+A0 A2+0 A1+0 A0+0 __m128i Low = _mm_add_epi16(Shift_Add12, Shift_Add03); //A7+A6+A5+A4 A6+A5+A4+A3 A5+A4+A3+A2 A4+A3+A2+A1 A3+A2+A1+A0 A2+A1+A0+0 A1+A0+0+0 A0+0+0+0 __m128i High = _mm_add_epi32(_mm_unpackhi_epi16(Low, Zero), _mm_unpacklo_epi16(Low, Zero)); //A7+A6+A5+A4+A3+A2+A1+A0 A6+A5+A4+A3+A2+A1+A0 A5+A4+A3+A2+A1+A0 A4+A3+A2+A1+A0 __m128i SumL = _mm_loadu_si128((__m128i *)(LinePL + X + 0)); __m128i SumH = _mm_loadu_si128((__m128i *)(LinePL + X + 4)); SumL = _mm_add_epi32(SumL, PreV); SumL = _mm_add_epi32(SumL, _mm_unpacklo_epi16(Low, Zero)); SumH = _mm_add_epi32(SumH, PreV); SumH = _mm_add_epi32(SumH, High); PreV = _mm_add_epi32(PreV, _mm_shuffle_epi32(High, _MM_SHUFFLE(3, 3, 3, 3))); _mm_storeu_si128((__m128i *)(LinePD + X + 0), SumL); _mm_storeu_si128((__m128i *)(LinePD + X + 4), SumH); } for (int X = Block * BlockSize, V = LinePD[X - 1] - LinePL[X - 1]; X < Width; X++) { V += LinePS[X]; LinePD[X] = V + LinePL[X]; } } } void BoxBlur(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, int Radius) { int *Integral = (int *)malloc((Width + 1) * (Height + 1) * sizeof(int)); GetGrayIntegralImage(Src, Integral, Width, Height, Stride); //#pragma parallel for num_threads(4) for (int Y = 0; Y < Height; Y++) { int Y1 = max(Y - Radius, 0); int Y2 = min(Y + Radius + 1, Height - 1); int *LineP1 = Integral + Y1 * (Width + 1); int *LineP2 = Integral + Y2 * (Width + 1); unsigned char *LinePD = Dest + Y * Stride; for (int X = 0; X < Height; X++) { int X1 = max(X - Radius, 0); int X2 = min(X + Radius + 1, Width); int Sum = LineP2[X2] - LineP1[X2] - LineP2[X1] + LineP1[X1]; int PixelCount = (X2 - X1) * (Y2 - Y1); LinePD[X] = (Sum + (PixelCount >> 1)) / PixelCount; } } free(Integral); } int main() { Mat src = imread("F:\\car.jpg", 0); int Height = src.rows; int Width = src.cols; unsigned char *Src = src.data; unsigned char *Dest = new unsigned char[Height * Width]; int Stride = Width; int Radius = 11; int64 st = cvGetTickCount(); for (int i = 0; i < 10; i++) { BoxBlur(Src, Dest, Width, Height, Stride, Radius); } double duration = (cv::getTickCount() - st) / cv::getTickFrequency() * 100; printf("%.5f\n", duration); BoxBlur(Src, Dest, Width, Height, Stride, Radius); Mat dst(Height, Width, CV_8UC1, Dest); imshow("origin", src); imshow("result", dst); imwrite("F:\\res.jpg", dst); waitKey(0); waitKey(0); } ================================================ FILE: speed_max_filter_sse.cpp ================================================ #include #include #include "../../OpencvTest/OpencvTest/Core.h" #include "../../OpencvTest/OpencvTest/MaxFilter.h" #include "../../OpencvTest/OpencvTest/Utility.h" using namespace std; using namespace cv; void MaxFilter_SSE(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, int Channel, int Radius) { TMatrix a, b; TMatrix *p1 = &a, *p2 = &b; TMatrix **p3 = &p1, **p4 = &p2; IS_CreateMatrix(Width, Height, IS_DEPTH_8U, Channel, p3); IS_CreateMatrix(Width, Height, IS_DEPTH_8U, Channel, p4); (p1)->Data = Src; (p2)->Data = Dest; MaxFilter(p1, p2, Radius); } Mat MaxFilter(Mat src, int radius) { int row = src.rows; int col = src.cols; int border = (radius - 1) / 2; Mat dst(row, col, CV_8UC3); printf("success\n"); for (int i = border; i + border < row; i++) { for (int j = border; j + border < col; j++) { for (int k = 0; k < 3; k++) { int val = src.at(i, j)[k]; for (int x = -border; x <= border; x++) { for (int y = -border; y <= border; y++) { val = max(val, (int)src.at(i + x, j + y)[k]); } } dst.at(i, j)[k] = val; } } } printf("success\n"); return dst; } int main() { Mat src = imread("F:\\car.jpg"); int Height = src.rows; int Width = src.cols; unsigned char *Src = src.data; unsigned char *Dest = new unsigned char[Height * Width * 3]; int Stride = Width * 3; int Radius = 11; int64 st = cvGetTickCount(); for (int i = 0; i <10; i++) { Mat temp = MaxFilter(src, Radius); //MaxFilter_SSE(Src, Dest, Width, Height, Stride, 3, Radius); } double duration = (cv::getTickCount() - st) / cv::getTickFrequency() * 100; printf("%.5f\n", duration); MaxFilter_SSE(Src, Dest, Width, Height, Stride, 3, Radius); Mat dst(Height, Width, CV_8UC3, Dest); imshow("origin", src); imshow("result", dst); imwrite("F:\\res.jpg", dst); waitKey(0); return 0; } ================================================ FILE: speed_median_filter_3x3_sse.cpp ================================================ #include "stdafx.h" #include #include using namespace std; using namespace cv; int ComparisonFunction(const void *X, const void *Y) { unsigned char Dx = *(unsigned char *)X; unsigned char Dy = *(unsigned char *)Y; if (Dx < Dy) return -1; else if (Dx > Dy) return 1; else return 0; } void MedianBlur3X3_Ori(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) { int Channel = Stride / Width; if (Channel == 1) { unsigned char Array[9]; for (int Y = 1; Y < Height - 1; Y++) { unsigned char *LineP0 = Src + (Y - 1) * Stride + 1; unsigned char *LineP1 = LineP0 + Stride; unsigned char *LineP2 = LineP1 + Stride; unsigned char *LinePD = Dest + Y * Stride + 1; for (int X = 1; X < Width - 1; X++) { Array[0] = LineP0[X - 1]; Array[1] = LineP0[X]; Array[2] = LineP0[X + 1]; Array[3] = LineP1[X - 1]; Array[4] = LineP1[X]; Array[5] = LineP2[X + 1]; Array[6] = LineP2[X - 1]; Array[7] = LineP2[X]; Array[8] = LineP2[X + 1]; qsort(Array, 9, sizeof(unsigned char), &ComparisonFunction); LinePD[X] = Array[4]; } } } else { unsigned char ArrayB[9], ArrayG[9], ArrayR[9]; for (int Y = 1; Y < Height - 1; Y++) { unsigned char *LineP0 = Src + (Y - 1) * Stride + 3; unsigned char *LineP1 = LineP0 + Stride; unsigned char *LineP2 = LineP1 + Stride; unsigned char *LinePD = Dest + Y * Stride + 3; for (int X = 1; X < Width - 1; X++) { ArrayB[0] = LineP0[-3]; ArrayG[0] = LineP0[-2]; ArrayR[0] = LineP0[-1]; ArrayB[1] = LineP0[0]; ArrayG[1] = LineP0[1]; ArrayR[1] = LineP0[2]; ArrayB[2] = LineP0[3]; ArrayG[2] = LineP0[4]; ArrayR[2] = LineP0[5]; ArrayB[3] = LineP1[-3]; ArrayG[3] = LineP1[-2]; ArrayR[3] = LineP1[-1]; ArrayB[4] = LineP1[0]; ArrayG[4] = LineP1[1]; ArrayR[4] = LineP1[2]; ArrayB[5] = LineP1[3]; ArrayG[5] = LineP1[4]; ArrayR[5] = LineP1[5]; ArrayB[6] = LineP2[-3]; ArrayG[6] = LineP2[-2]; ArrayR[6] = LineP2[-1]; ArrayB[7] = LineP2[0]; ArrayG[7] = LineP2[1]; ArrayR[7] = LineP2[2]; ArrayB[8] = LineP2[3]; ArrayG[8] = LineP2[4]; ArrayR[8] = LineP2[5]; qsort(ArrayB, 9, sizeof(unsigned char), &ComparisonFunction); qsort(ArrayG, 9, sizeof(unsigned char), &ComparisonFunction); qsort(ArrayR, 9, sizeof(unsigned char), &ComparisonFunction); LinePD[0] = ArrayB[4]; LinePD[1] = ArrayG[4]; LinePD[2] = ArrayR[4]; LineP0 += 3; LineP1 += 3; LineP2 += 3; LinePD += 3; } } } } void Swap(int &X, int &Y) { X ^= Y; Y ^= X; X ^= Y; } void MedianBlur3X3_Faster(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) { int Channel = Stride / Width; if (Channel == 1) { for (int Y = 1; Y < Height - 1; Y++) { unsigned char *LineP0 = Src + (Y - 1) * Stride + 1; unsigned char *LineP1 = LineP0 + Stride; unsigned char *LineP2 = LineP1 + Stride; unsigned char *LinePD = Dest + Y * Stride + 1; for (int X = 1; X < Width - 1; X++) { int Gray0, Gray1, Gray2, Gray3, Gray4, Gray5, Gray6, Gray7, Gray8; Gray0 = LineP0[X - 1]; Gray1 = LineP0[X]; Gray2 = LineP0[X + 1]; Gray3 = LineP1[X - 1]; Gray4 = LineP1[X]; Gray5 = LineP1[X + 1]; Gray6 = LineP2[X - 1]; Gray7 = LineP2[X]; Gray8 = LineP2[X + 1]; if (Gray1 > Gray2) Swap(Gray1, Gray2); if (Gray4 > Gray5) Swap(Gray4, Gray5); if (Gray7 > Gray8) Swap(Gray7, Gray8); if (Gray0 > Gray1) Swap(Gray0, Gray1); if (Gray3 > Gray4) Swap(Gray3, Gray4); if (Gray6 > Gray7) Swap(Gray6, Gray7); if (Gray1 > Gray2) Swap(Gray1, Gray2); if (Gray4 > Gray5) Swap(Gray4, Gray5); if (Gray7 > Gray8) Swap(Gray7, Gray8); if (Gray0 > Gray3) Swap(Gray0, Gray3); if (Gray5 > Gray8) Swap(Gray5, Gray8); if (Gray4 > Gray7) Swap(Gray4, Gray7); if (Gray3 > Gray6) Swap(Gray3, Gray6); if (Gray1 > Gray4) Swap(Gray1, Gray4); if (Gray2 > Gray5) Swap(Gray2, Gray5); if (Gray4 > Gray7) Swap(Gray4, Gray7); if (Gray4 > Gray2) Swap(Gray4, Gray2); if (Gray6 > Gray4) Swap(Gray6, Gray4); if (Gray4 > Gray2) Swap(Gray4, Gray2); LinePD[X] = Gray4; } } } else { for (int Y = 1; Y < Height - 1; Y++) { unsigned char *LineP0 = Src + (Y - 1) * Stride + 3; unsigned char *LineP1 = LineP0 + Stride; unsigned char *LineP2 = LineP1 + Stride; unsigned char *LinePD = Dest + Y * Stride + 3; for (int X = 1; X < Width - 1; X++) { int Blue0, Blue1, Blue2, Blue3, Blue4, Blue5, Blue6, Blue7, Blue8; int Green0, Green1, Green2, Green3, Green4, Green5, Green6, Green7, Green8; int Red0, Red1, Red2, Red3, Red4, Red5, Red6, Red7, Red8; Blue0 = LineP0[-3]; Green0 = LineP0[-2]; Red0 = LineP0[-1]; Blue1 = LineP0[0]; Green1 = LineP0[1]; Red1 = LineP0[2]; Blue2 = LineP0[3]; Green2 = LineP0[4]; Red2 = LineP0[5]; Blue3 = LineP1[-3]; Green3 = LineP1[-2]; Red3 = LineP1[-1]; Blue4 = LineP1[0]; Green4 = LineP1[1]; Red4 = LineP1[2]; Blue5 = LineP1[3]; Green5 = LineP1[4]; Red5 = LineP1[5]; Blue6 = LineP2[-3]; Green6 = LineP2[-2]; Red6 = LineP2[-1]; Blue7 = LineP2[0]; Green7 = LineP2[1]; Red7 = LineP2[2]; Blue8 = LineP2[3]; Green8 = LineP2[4]; Red8 = LineP2[5]; if (Blue1 > Blue2) Swap(Blue1, Blue2); if (Blue4 > Blue5) Swap(Blue4, Blue5); if (Blue7 > Blue8) Swap(Blue7, Blue8); if (Blue0 > Blue1) Swap(Blue0, Blue1); if (Blue3 > Blue4) Swap(Blue3, Blue4); if (Blue6 > Blue7) Swap(Blue6, Blue7); if (Blue1 > Blue2) Swap(Blue1, Blue2); if (Blue4 > Blue5) Swap(Blue4, Blue5); if (Blue7 > Blue8) Swap(Blue7, Blue8); if (Blue0 > Blue3) Swap(Blue0, Blue3); if (Blue5 > Blue8) Swap(Blue5, Blue8); if (Blue4 > Blue7) Swap(Blue4, Blue7); if (Blue3 > Blue6) Swap(Blue3, Blue6); if (Blue1 > Blue4) Swap(Blue1, Blue4); if (Blue2 > Blue5) Swap(Blue2, Blue5); if (Blue4 > Blue7) Swap(Blue4, Blue7); if (Blue4 > Blue2) Swap(Blue4, Blue2); if (Blue6 > Blue4) Swap(Blue6, Blue4); if (Blue4 > Blue2) Swap(Blue4, Blue2); if (Green1 > Green2) Swap(Green1, Green2); if (Green4 > Green5) Swap(Green4, Green5); if (Green7 > Green8) Swap(Green7, Green8); if (Green0 > Green1) Swap(Green0, Green1); if (Green3 > Green4) Swap(Green3, Green4); if (Green6 > Green7) Swap(Green6, Green7); if (Green1 > Green2) Swap(Green1, Green2); if (Green4 > Green5) Swap(Green4, Green5); if (Green7 > Green8) Swap(Green7, Green8); if (Green0 > Green3) Swap(Green0, Green3); if (Green5 > Green8) Swap(Green5, Green8); if (Green4 > Green7) Swap(Green4, Green7); if (Green3 > Green6) Swap(Green3, Green6); if (Green1 > Green4) Swap(Green1, Green4); if (Green2 > Green5) Swap(Green2, Green5); if (Green4 > Green7) Swap(Green4, Green7); if (Green4 > Green2) Swap(Green4, Green2); if (Green6 > Green4) Swap(Green6, Green4); if (Green4 > Green2) Swap(Green4, Green2); if (Red1 > Red2) Swap(Red1, Red2); if (Red4 > Red5) Swap(Red4, Red5); if (Red7 > Red8) Swap(Red7, Red8); if (Red0 > Red1) Swap(Red0, Red1); if (Red3 > Red4) Swap(Red3, Red4); if (Red6 > Red7) Swap(Red6, Red7); if (Red1 > Red2) Swap(Red1, Red2); if (Red4 > Red5) Swap(Red4, Red5); if (Red7 > Red8) Swap(Red7, Red8); if (Red0 > Red3) Swap(Red0, Red3); if (Red5 > Red8) Swap(Red5, Red8); if (Red4 > Red7) Swap(Red4, Red7); if (Red3 > Red6) Swap(Red3, Red6); if (Red1 > Red4) Swap(Red1, Red4); if (Red2 > Red5) Swap(Red2, Red5); if (Red4 > Red7) Swap(Red4, Red7); if (Red4 > Red2) Swap(Red4, Red2); if (Red6 > Red4) Swap(Red6, Red4); if (Red4 > Red2) Swap(Red4, Red2); LinePD[0] = Blue4; LinePD[1] = Green4; LinePD[2] = Red4; LineP0 += 3; LineP1 += 3; LineP2 += 3; LinePD += 3; } } } } inline void _mm_sort_ab(__m128i &a, __m128i &b) { const __m128i min = _mm_min_epu8(a, b); const __m128i max = _mm_max_epu8(a, b); a = min; b = max; } void MedianBlur3X3_Fastest(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) { int Channel = Stride / Width; int BlockSize = 16, Block = ((Width - 2)* Channel) / BlockSize; for (int Y = 1; Y < Height - 1; Y++) { unsigned char *LineP0 = Src + (Y - 1) * Stride + Channel; unsigned char *LineP1 = LineP0 + Stride; unsigned char *LineP2 = LineP1 + Stride; unsigned char *LinePD = Dest + Y * Stride + Channel; for (int X = 0; X < Block * BlockSize; X += BlockSize, LineP0 += BlockSize, LineP1 += BlockSize, LineP2 += BlockSize, LinePD += BlockSize) { __m128i P0 = _mm_loadu_si128((__m128i *)(LineP0 - Channel)); __m128i P1 = _mm_loadu_si128((__m128i *)(LineP0 - 0)); __m128i P2 = _mm_loadu_si128((__m128i *)(LineP0 + Channel)); __m128i P3 = _mm_loadu_si128((__m128i *)(LineP1 - Channel)); __m128i P4 = _mm_loadu_si128((__m128i *)(LineP1 - 0)); __m128i P5 = _mm_loadu_si128((__m128i *)(LineP1 + Channel)); __m128i P6 = _mm_loadu_si128((__m128i *)(LineP2 - Channel)); __m128i P7 = _mm_loadu_si128((__m128i *)(LineP2 - 0)); __m128i P8 = _mm_loadu_si128((__m128i *)(LineP2 + Channel)); _mm_sort_ab(P1, P2); _mm_sort_ab(P4, P5); _mm_sort_ab(P7, P8); _mm_sort_ab(P0, P1); _mm_sort_ab(P3, P4); _mm_sort_ab(P6, P7); _mm_sort_ab(P1, P2); _mm_sort_ab(P4, P5); _mm_sort_ab(P7, P8); _mm_sort_ab(P0, P3); _mm_sort_ab(P5, P8); _mm_sort_ab(P4, P7); _mm_sort_ab(P3, P6); _mm_sort_ab(P1, P4); _mm_sort_ab(P2, P5); _mm_sort_ab(P4, P7); _mm_sort_ab(P4, P2); _mm_sort_ab(P6, P4); _mm_sort_ab(P4, P2); _mm_storeu_si128((__m128i *)LinePD, P4); } for (int X = Block * BlockSize; X < (Width - 2) * Channel; X++, LinePD++) { int Gray0, Gray1, Gray2, Gray3, Gray4, Gray5, Gray6, Gray7, Gray8; Gray0 = LineP0[X - Block * BlockSize - Channel]; Gray1 = LineP0[X - Block * BlockSize]; Gray2 = LineP0[X - Block * BlockSize + Channel]; Gray3 = LineP1[X - Block * BlockSize - Channel]; Gray4 = LineP1[X - Block * BlockSize]; Gray5 = LineP1[X - Block * BlockSize + Channel]; Gray6 = LineP2[X - Block * BlockSize - Channel]; Gray7 = LineP2[X - Block * BlockSize]; Gray8 = LineP2[X - Block * BlockSize + Channel]; if (Gray1 > Gray2) Swap(Gray1, Gray2); if (Gray4 > Gray5) Swap(Gray4, Gray5); if (Gray7 > Gray8) Swap(Gray7, Gray8); if (Gray0 > Gray1) Swap(Gray0, Gray1); if (Gray3 > Gray4) Swap(Gray3, Gray4); if (Gray6 > Gray7) Swap(Gray6, Gray7); if (Gray1 > Gray2) Swap(Gray1, Gray2); if (Gray4 > Gray5) Swap(Gray4, Gray5); if (Gray7 > Gray8) Swap(Gray7, Gray8); if (Gray0 > Gray3) Swap(Gray0, Gray3); if (Gray5 > Gray8) Swap(Gray5, Gray8); if (Gray4 > Gray7) Swap(Gray4, Gray7); if (Gray3 > Gray6) Swap(Gray3, Gray6); if (Gray1 > Gray4) Swap(Gray1, Gray4); if (Gray2 > Gray5) Swap(Gray2, Gray5); if (Gray4 > Gray7) Swap(Gray4, Gray7); if (Gray4 > Gray2) Swap(Gray4, Gray2); if (Gray6 > Gray4) Swap(Gray6, Gray4); if (Gray4 > Gray2) Swap(Gray4, Gray2); LinePD[X] = Gray4; LineP0 += 1; LineP1 += 1; LineP2 += 1; } } } inline void _mm_sort_AB(__m256i &a, __m256i &b) { const __m256i min = _mm256_min_epu8(a, b); const __m256i max = _mm256_max_epu8(a, b); a = min; b = max; } void MedianBlur3X3_Fastest_AVX(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) { int Channel = Stride / Width; int BlockSize = 32, Block = ((Width - 2)* Channel) / BlockSize; for (int Y = 1; Y < Height - 1; Y++) { unsigned char *LineP0 = Src + (Y - 1) * Stride + Channel; unsigned char *LineP1 = LineP0 + Stride; unsigned char *LineP2 = LineP1 + Stride; unsigned char *LinePD = Dest + Y * Stride + Channel; for (int X = 0; X < Block * BlockSize; X += BlockSize, LineP0 += BlockSize, LineP1 += BlockSize, LineP2 += BlockSize, LinePD += BlockSize) { __m256i P0 = _mm256_loadu_si256((const __m256i*)(LineP0 - Channel)); __m256i P1 = _mm256_loadu_si256((const __m256i*)(LineP0 - 0)); __m256i P2 = _mm256_loadu_si256((const __m256i*)(LineP0 + Channel)); __m256i P3 = _mm256_loadu_si256((const __m256i*)(LineP1 - Channel)); __m256i P4 = _mm256_loadu_si256((const __m256i*)(LineP1 - 0)); __m256i P5 = _mm256_loadu_si256((const __m256i*)(LineP1 + Channel)); __m256i P6 = _mm256_loadu_si256((const __m256i*)(LineP2 - Channel)); __m256i P7 = _mm256_loadu_si256((const __m256i*)(LineP2 - 0)); __m256i P8 = _mm256_loadu_si256((const __m256i*)(LineP2 + Channel)); _mm_sort_AB(P1, P2); _mm_sort_AB(P4, P5); _mm_sort_AB(P7, P8); _mm_sort_AB(P0, P1); _mm_sort_AB(P3, P4); _mm_sort_AB(P6, P7); _mm_sort_AB(P1, P2); _mm_sort_AB(P4, P5); _mm_sort_AB(P7, P8); _mm_sort_AB(P0, P3); _mm_sort_AB(P5, P8); _mm_sort_AB(P4, P7); _mm_sort_AB(P3, P6); _mm_sort_AB(P1, P4); _mm_sort_AB(P2, P5); _mm_sort_AB(P4, P7); _mm_sort_AB(P4, P2); _mm_sort_AB(P6, P4); _mm_sort_AB(P4, P2); _mm256_storeu_si256((__m256i *)LinePD, P4); } for (int X = Block * BlockSize; X < (Width - 2) * Channel; X++, LinePD++) { int Gray0, Gray1, Gray2, Gray3, Gray4, Gray5, Gray6, Gray7, Gray8; Gray0 = LineP0[X - Block * BlockSize - Channel]; Gray1 = LineP0[X - Block * BlockSize]; Gray2 = LineP0[X - Block * BlockSize + Channel]; Gray3 = LineP1[X - Block * BlockSize - Channel]; Gray4 = LineP1[X - Block * BlockSize]; Gray5 = LineP1[X - Block * BlockSize + Channel]; Gray6 = LineP2[X - Block * BlockSize - Channel]; Gray7 = LineP2[X - Block * BlockSize]; Gray8 = LineP2[X - Block * BlockSize + Channel]; if (Gray1 > Gray2) Swap(Gray1, Gray2); if (Gray4 > Gray5) Swap(Gray4, Gray5); if (Gray7 > Gray8) Swap(Gray7, Gray8); if (Gray0 > Gray1) Swap(Gray0, Gray1); if (Gray3 > Gray4) Swap(Gray3, Gray4); if (Gray6 > Gray7) Swap(Gray6, Gray7); if (Gray1 > Gray2) Swap(Gray1, Gray2); if (Gray4 > Gray5) Swap(Gray4, Gray5); if (Gray7 > Gray8) Swap(Gray7, Gray8); if (Gray0 > Gray3) Swap(Gray0, Gray3); if (Gray5 > Gray8) Swap(Gray5, Gray8); if (Gray4 > Gray7) Swap(Gray4, Gray7); if (Gray3 > Gray6) Swap(Gray3, Gray6); if (Gray1 > Gray4) Swap(Gray1, Gray4); if (Gray2 > Gray5) Swap(Gray2, Gray5); if (Gray4 > Gray7) Swap(Gray4, Gray7); if (Gray4 > Gray2) Swap(Gray4, Gray2); if (Gray6 > Gray4) Swap(Gray6, Gray4); if (Gray4 > Gray2) Swap(Gray4, Gray2); LinePD[X] = Gray4; LineP0 += 1; LineP1 += 1; LineP2 += 1; } } } int main() { Mat src = imread("F:\\car.jpg"); int Height = src.rows; int Width = src.cols; unsigned char *Src = src.data; unsigned char *Dest = new unsigned char[Height * Width * 3]; int Stride = Width * 3; int Radius = 7; int64 st = cvGetTickCount(); for (int i = 0; i <10; i++) { //Mat temp = MaxFilter(src, Radius); MedianBlur3X3_Fastest_AVX(Src, Dest, Width, Height, Stride); } double duration = (cv::getTickCount() - st) / cv::getTickFrequency() * 100; printf("%.5f\n", duration); MedianBlur3X3_Fastest_AVX(Src, Dest, Width, Height, Stride); Mat dst(Height, Width, CV_8UC3, Dest); imshow("origin", src); imshow("result", dst); imwrite("F:\\res.jpg", dst); waitKey(0); return 0; } ================================================ FILE: speed_multi_scale_detail_boosting_see.cpp ================================================ #include #include #include "../../OpencvTest/OpencvTest/Core.h" #include "../../OpencvTest/OpencvTest/MaxFilter.h" #include "../../OpencvTest/OpencvTest/Utility.h" #include "../../OpencvTest/OpencvTest/BoxFilter.h" using namespace std; using namespace cv; #define __SSSE3__ 1 void BoxBlur_SSE(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, int Channel, int Radius) { TMatrix a, b; TMatrix *p1 = &a, *p2 = &b; TMatrix **p3 = &p1, **p4 = &p2; IS_CreateMatrix(Width, Height, IS_DEPTH_8U, Channel, p3); IS_CreateMatrix(Width, Height, IS_DEPTH_8U, Channel, p4); (p1)->Data = Src; (p2)->Data = Dest; BoxBlur_SSE(p1, p2, Radius, EdgeMode::Smear); } int IM_Sign(int X) { return (X >> 31) | (unsigned(-X)) >> 31; } inline unsigned char IM_ClampToByte(int Value) { if (Value < 0) return 0; else if (Value > 255) return 255; else return (unsigned char)Value; //return ((Value | ((signed int)(255 - Value) >> 31)) & ~((signed int)Value >> 31)); } inline __m128i _mm_sgn_epi16(__m128i v) { #ifdef __SSSE3__ v = _mm_sign_epi16(_mm_set1_epi16(1), v); // use PSIGNW on SSSE3 and later #else v = _mm_min_epi16(v, _mm_set1_epi16(1)); // use PMINSW/PMAXSW on SSE2/SSE3. v = _mm_max_epi16(v, _mm_set1_epi16(-1)); //_mm_set1_epi16(1) = _mm_srli_epi16(_mm_cmpeq_epi16(v, v), 15); //_mm_set1_epi16(-1) = _mm_cmpeq_epi16(v, v); #endif return v; } void MultiScaleSharpen(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, int Radius) { int Channel = Stride / Width; unsigned char *B1 = (unsigned char *)malloc(Height * Stride * sizeof(unsigned char)); unsigned char *B2 = (unsigned char *)malloc(Height * Stride * sizeof(unsigned char)); unsigned char *B3 = (unsigned char *)malloc(Height * Stride * sizeof(unsigned char)); BoxBlur_SSE(Src, B1, Width, Height, Channel, Stride, Radius); BoxBlur_SSE(Src, B2, Width, Height, Channel, Stride, Radius * 2); BoxBlur_SSE(Src, B3, Width, Height, Channel, Stride, Radius * 4); for (int Y = 0; Y < Height * Stride; Y++) { int DiffB1 = Src[Y] - B1[Y]; int DiffB2 = B1[Y] - B2[Y]; int DiffB3 = B2[Y] - B3[Y]; Dest[Y] = IM_ClampToByte(((4 - 2 * IM_Sign(DiffB1)) * DiffB1 + 2 * DiffB2 + DiffB3) / 4 + Src[Y]); } } void MultiScaleSharpen_SSE(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, int Radius) { int Channel = Stride / Width; unsigned char *B1 = (unsigned char *)malloc(Height * Stride * sizeof(unsigned char)); unsigned char *B2 = (unsigned char *)malloc(Height * Stride * sizeof(unsigned char)); unsigned char *B3 = (unsigned char *)malloc(Height * Stride * sizeof(unsigned char)); BoxBlur_SSE(Src, B1, Width, Height, Channel, Stride, Radius); BoxBlur_SSE(Src, B2, Width, Height, Channel, Stride, Radius * 2); BoxBlur_SSE(Src, B3, Width, Height, Channel, Stride, Radius * 4); int BlockSize = 8, Block = (Height * Stride) / BlockSize; __m128i Zero = _mm_setzero_si128(); __m128i Four = _mm_set1_epi16(4); for (int Y = 0; Y < Block * BlockSize; Y += BlockSize) { __m128i SrcV = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(Src + Y)), Zero); __m128i SrcB1 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(B1 + Y)), Zero); __m128i SrcB2 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(B2 + Y)), Zero); __m128i SrcB3 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(B3 + Y)), Zero); __m128i DiffB1 = _mm_sub_epi16(SrcV, SrcB1); __m128i DiffB2 = _mm_sub_epi16(SrcB1, SrcB2); __m128i DiffB3 = _mm_sub_epi16(SrcB2, SrcB3); //__m128i Offset = _mm_srai_epi16(_mm_add_epi16(_mm_add_epi16(_mm_mullo_epi16(_mm_sub_epi16(Four, _mm_slli_epi16(_mm_sgn_epi16(DiffB1), 1)), DiffB1), _mm_slli_epi16(DiffB2, 1)), DiffB3), 2); __m128i Offset = _mm_add_epi16(_mm_srai_epi16(_mm_sub_epi16(_mm_slli_epi16(_mm_sub_epi16(SrcB1, _mm_sign_epi16(DiffB1, DiffB1)), 1), _mm_add_epi16(SrcB2, SrcB3)), 2), DiffB1); _mm_storel_epi64((__m128i *)(Dest + Y), _mm_packus_epi16(_mm_add_epi16(SrcV, Offset), Zero)); } for (int Y = Block * BlockSize; Y < Height * Stride; Y++) { int DiffB1 = Src[Y] - B1[Y]; int DiffB2 = B1[Y] - B2[Y]; int DiffB3 = B2[Y] - B3[Y]; Dest[Y] = IM_ClampToByte(((4 - 2 * IM_Sign(DiffB1)) * DiffB1 + 2 * DiffB2 + DiffB3) / 4 + Src[Y]); } } int main() { Mat src = imread("F:\\car.jpg"); int Height = src.rows; int Width = src.cols; unsigned char *Src = src.data; unsigned char *Dest = new unsigned char[Height * Width * 3]; int Stride = Width * 3; int Radius = 5; int64 st = cvGetTickCount(); for (int i = 0; i <10; i++) { //Mat temp = MaxFilter(src, Radius); MultiScaleSharpen_SSE(Src, Dest, Width, Height, Stride, Radius); } double duration = (cv::getTickCount() - st) / cv::getTickFrequency() * 100; printf("%.5f\n", duration); MultiScaleSharpen(Src, Dest, Width, Height, Stride, Radius); Mat dst(Height, Width, CV_8UC3, Dest); imshow("origin", src); imshow("result", dst); imwrite("F:\\res.jpg", dst); waitKey(0); return 0; } ================================================ FILE: speed_rgb2gray_sse.cpp ================================================ #include "stdafx.h" #include #include using namespace std; using namespace cv; //origin void RGB2Y(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) { for (int Y = 0; Y < Height; Y++) { unsigned char *LinePS = Src + Y * Stride; unsigned char *LinePD = Dest + Y * Width; for (int X = 0; X < Width; X++, LinePS += 3) { LinePD[X] = int(0.114 * LinePS[0] + 0.587 * LinePS[1] + 0.299 * LinePS[2]); } } } //int void RGB2Y_1(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) { const int B_WT = int(0.114 * 256 + 0.5); const int G_WT = int(0.587 * 256 + 0.5); const int R_WT = 256 - B_WT - G_WT; for (int Y = 0; Y < Height; Y++) { unsigned char *LinePS = Src + Y * Stride; unsigned char *LinePD = Dest + Y * Width; for (int X = 0; X < Width; X++, LinePS += 3) { LinePD[X] = (B_WT * LinePS[0] + G_WT * LinePS[1] + R_WT * LinePS[2]) >> 8; } } } //4路并行 void RGB2Y_2(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) { const int B_WT = int(0.114 * 256 + 0.5); const int G_WT = int(0.587 * 256 + 0.5); const int R_WT = 256 - B_WT - G_WT; // int(0.299 * 256 + 0.5) for (int Y = 0; Y < Height; Y++) { unsigned char *LinePS = Src + Y * Stride; unsigned char *LinePD = Dest + Y * Width; int X = 0; for (; X < Width - 4; X += 4, LinePS += 12) { LinePD[X + 0] = (B_WT * LinePS[0] + G_WT * LinePS[1] + R_WT * LinePS[2]) >> 8; LinePD[X + 1] = (B_WT * LinePS[3] + G_WT * LinePS[4] + R_WT * LinePS[5]) >> 8; LinePD[X + 2] = (B_WT * LinePS[6] + G_WT * LinePS[7] + R_WT * LinePS[8]) >> 8; LinePD[X + 3] = (B_WT * LinePS[9] + G_WT * LinePS[10] + R_WT * LinePS[11]) >> 8; } for (; X < Width; X++, LinePS += 3) { LinePD[X] = (B_WT * LinePS[0] + G_WT * LinePS[1] + R_WT * LinePS[2]) >> 8; } } } //openmp void RGB2Y_3(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) { const int B_WT = int(0.114 * 256 + 0.5); const int G_WT = int(0.587 * 256 + 0.5); const int R_WT = 256 - B_WT - G_WT; for (int Y = 0; Y < Height; Y++) { unsigned char *LinePS = Src + Y * Stride; unsigned char *LinePD = Dest + Y * Width; #pragma omp parallel for num_threads(4) for (int X = 0; X < Width; X++) { LinePD[X] = (B_WT * LinePS[0 + X*3] + G_WT * LinePS[1 + X*3] + R_WT * LinePS[2 + X*3]) >> 8; } } } //sse 一次处理12个 void RGB2Y_4(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) { const int B_WT = int(0.114 * 256 + 0.5); const int G_WT = int(0.587 * 256 + 0.5); const int R_WT = 256 - B_WT - G_WT; // int(0.299 * 256 + 0.5) for (int Y = 0; Y < Height; Y++) { unsigned char *LinePS = Src + Y * Stride; unsigned char *LinePD = Dest + Y * Width; int X = 0; for (; X < Width - 12; X += 12, LinePS += 36) { __m128i p1aL = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 0))), _mm_setr_epi16(B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT)); //1 __m128i p2aL = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 1))), _mm_setr_epi16(G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT)); //2 __m128i p3aL = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 2))), _mm_setr_epi16(R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT)); //3 __m128i p1aH = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 8))), _mm_setr_epi16(R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT));//4 __m128i p2aH = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 9))), _mm_setr_epi16(B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT));//5 __m128i p3aH = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 10))), _mm_setr_epi16(G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT));//6 __m128i p1bL = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 18))), _mm_setr_epi16(B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT));//7 __m128i p2bL = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 19))), _mm_setr_epi16(G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT));//8 __m128i p3bL = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 20))), _mm_setr_epi16(R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT));//9 __m128i p1bH = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 26))), _mm_setr_epi16(R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT));//10 __m128i p2bH = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 27))), _mm_setr_epi16(B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT));//11 __m128i p3bH = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 28))), _mm_setr_epi16(G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT));//12 __m128i sumaL = _mm_add_epi16(p3aL, _mm_add_epi16(p1aL, p2aL));//13 __m128i sumaH = _mm_add_epi16(p3aH, _mm_add_epi16(p1aH, p2aH));//14 __m128i sumbL = _mm_add_epi16(p3bL, _mm_add_epi16(p1bL, p2bL));//15 __m128i sumbH = _mm_add_epi16(p3bH, _mm_add_epi16(p1bH, p2bH));//16 __m128i sclaL = _mm_srli_epi16(sumaL, 8);//17 __m128i sclaH = _mm_srli_epi16(sumaH, 8);//18 __m128i sclbL = _mm_srli_epi16(sumbL, 8);//19 __m128i sclbH = _mm_srli_epi16(sumbH, 8);//20 __m128i shftaL = _mm_shuffle_epi8(sclaL, _mm_setr_epi8(0, 6, 12, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1));//21 __m128i shftaH = _mm_shuffle_epi8(sclaH, _mm_setr_epi8(-1, -1, -1, 18, 24, 30, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1));//22 __m128i shftbL = _mm_shuffle_epi8(sclbL, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, 0, 6, 12, -1, -1, -1, -1, -1, -1, -1));//23 __m128i shftbH = _mm_shuffle_epi8(sclbH, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, 18, 24, 30, -1, -1, -1, -1));//24 __m128i accumL = _mm_or_si128(shftaL, shftbL);//25 __m128i accumH = _mm_or_si128(shftaH, shftbH);//26 __m128i h3 = _mm_or_si128(accumL, accumH);//27 //__m128i h3 = _mm_blendv_epi8(accumL, accumH, _mm_setr_epi8(0, 0, 0, -1, -1, -1, 0, 0, 0, -1, -1, -1, 1, 1, 1, 1)); _mm_storeu_si128((__m128i *)(LinePD + X), h3); } for (; X < Width; X++, LinePS += 3) { LinePD[X] = (B_WT * LinePS[0] + G_WT * LinePS[1] + R_WT * LinePS[2]) >> 8; } } } //sse 一次处理15个 void RGB2Y_5(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) { const int B_WT = int(0.114 * 256 + 0.5); const int G_WT = int(0.587 * 256 + 0.5); const int R_WT = 256 - B_WT - G_WT; // int(0.299 * 256 + 0.5) for (int Y = 0; Y < Height; Y++) { unsigned char *LinePS = Src + Y * Stride; unsigned char *LinePD = Dest + Y * Width; int X = 0; for (; X < Width - 15; X += 15, LinePS += 45) { __m128i p1aL = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 0))), _mm_setr_epi16(B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT)); //1 __m128i p2aL = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 1))), _mm_setr_epi16(G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT)); //2 __m128i p3aL = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 2))), _mm_setr_epi16(R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT)); //3 __m128i p1aH = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 8))), _mm_setr_epi16(R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT)); __m128i p2aH = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 9))), _mm_setr_epi16(B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT)); __m128i p3aH = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 10))), _mm_setr_epi16(G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT)); __m128i p1bL = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 18))), _mm_setr_epi16(B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT)); __m128i p2bL = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 19))), _mm_setr_epi16(G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT)); __m128i p3bL = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 20))), _mm_setr_epi16(R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT)); __m128i p1bH = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 26))), _mm_setr_epi16(R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT)); __m128i p2bH = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 27))), _mm_setr_epi16(B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT)); __m128i p3bH = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 28))), _mm_setr_epi16(G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT)); __m128i p1cH = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 36))), _mm_setr_epi16(B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT)); __m128i p2cH = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 37))), _mm_setr_epi16(G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT)); __m128i p3cH = _mm_mullo_epi16(_mm_cvtepu8_epi16(_mm_loadu_si128((__m128i *)(LinePS + 38))), _mm_setr_epi16(R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT)); __m128i sumaL = _mm_add_epi16(p3aL, _mm_add_epi16(p1aL, p2aL)); __m128i sumaH = _mm_add_epi16(p3aH, _mm_add_epi16(p1aH, p2aH)); __m128i sumbL = _mm_add_epi16(p3bL, _mm_add_epi16(p1bL, p2bL)); __m128i sumbH = _mm_add_epi16(p3bH, _mm_add_epi16(p1bH, p2bH)); __m128i sumcH = _mm_add_epi16(p3cH, _mm_add_epi16(p1cH, p2cH)); __m128i sclaL = _mm_srli_epi16(sumaL, 8); __m128i sclaH = _mm_srli_epi16(sumaH, 8); __m128i sclbL = _mm_srli_epi16(sumbL, 8); __m128i sclbH = _mm_srli_epi16(sumbH, 8); __m128i sclcH = _mm_srli_epi16(sumcH, 8); __m128i shftaL = _mm_shuffle_epi8(sclaL, _mm_setr_epi8(0, 6, 12, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)); __m128i shftaH = _mm_shuffle_epi8(sclaH, _mm_setr_epi8(-1, -1, -1, 2, 8, 14, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)); __m128i shftbL = _mm_shuffle_epi8(sclbL, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, 0, 6, 12, -1, -1, -1, -1, -1, -1, -1)); __m128i shftbH = _mm_shuffle_epi8(sclbH, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, 2, 8, 14, -1, -1, -1, -1)); __m128i shftcH = _mm_shuffle_epi8(sclcH, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 6, 12, -1)); __m128i accumL = _mm_or_si128(shftaL, shftbL); __m128i accumH = _mm_or_si128(shftaH, shftbH); __m128i h3 = _mm_or_si128(accumL, accumH); h3 = _mm_or_si128(h3, shftcH); _mm_storeu_si128((__m128i *)(LinePD + X), h3); } for (; X < Width; X++, LinePS += 3) { LinePD[X] = (B_WT * LinePS[0] + G_WT * LinePS[1] + R_WT * LinePS[2]) >> 8; } } } void debug(__m128i var) { uint8_t *val = (uint8_t*)&var;//can also use uint32_t instead of 16_t printf("Numerical: %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i\n", val[0], val[1], val[2], val[3], val[4], val[5], val[6], val[7], val[8], val[9], val[10], val[11], val[12], val[13], val[14], val[15]); } void debug2(__m256i var) { uint8_t *val = (uint8_t*)&var;//can also use uint32_t instead of 16_t printf("Numerical: %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i %i\n", val[0], val[1], val[2], val[3], val[4], val[5], val[6], val[7], val[8], val[9], val[10], val[11], val[12], val[13], val[14], val[15], val[16], val[17], val[18], val[19], val[20], val[21], val[22], val[23], val[24], val[25], val[26], val[27], val[28], val[29], val[30], val[31]); } // AVX2 constexpr double B_WEIGHT = 0.114; constexpr double G_WEIGHT = 0.587; constexpr double R_WEIGHT = 0.299; constexpr uint16_t B_WT = static_cast(32768.0 * B_WEIGHT + 0.5); constexpr uint16_t G_WT = static_cast(32768.0 * G_WEIGHT + 0.5); constexpr uint16_t R_WT = static_cast(32768.0 * R_WEIGHT + 0.5); static const __m256i weight_vec = _mm256_setr_epi16(B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT, G_WT, R_WT, B_WT); void _RGB2Y(unsigned char* Src, const int32_t Width, const int32_t start_row, const int32_t thread_stride, const int32_t Stride, unsigned char* Dest) { for (int Y = start_row; Y < start_row + thread_stride; Y++) { //Sleep(1); unsigned char *LinePS = Src + Y * Stride; unsigned char *LinePD = Dest + Y * Width; int X = 0; for (; X < Width - 10; X += 10, LinePS += 30) { //B1 G1 R1 B2 G2 R2 B3 G3 R3 B4 G4 R4 B5 G5 R5 B6 __m256i temp = _mm256_cvtepu8_epi16(_mm_loadu_si128((const __m128i*)(LinePS + 0))); __m256i in1 = _mm256_mulhrs_epi16(temp, weight_vec); //B6 G6 R6 B7 G7 R7 B8 G8 R8 B9 G9 R9 B10 G10 R10 B11 temp = _mm256_cvtepu8_epi16(_mm_loadu_si128((const __m128i*)(LinePS + 15))); __m256i in2 = _mm256_mulhrs_epi16(temp, weight_vec); //0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 //B1 G1 R1 B2 G2 R2 B3 G3 B6 G6 R6 B7 G7 R7 B8 G8 R3 B4 G4 R4 B5 G5 R5 B6 R8 B9 G9 R9 B10 G10 R10 B11 __m256i mul = _mm256_packus_epi16(in1, in2); __m256i b1 = _mm256_shuffle_epi8(mul, _mm256_setr_epi8( // B1 B2 B3 -1, -1, -1 B7 B8 -1, -1, -1, -1, -1, -1, -1, -1, 0, 3, 6, -1, -1, -1, 11, 14, -1, -1, -1, -1, -1, -1, -1, -1, // -1, -1, -1, B4 B5 B6 -1, -1 B9 B10 -1, -1, -1, -1, -1, -1 -1, -1, -1, 1, 4, 7, -1, -1, 9, 12, -1, -1, -1, -1, -1, -1)); __m256i g1 = _mm256_shuffle_epi8(mul, _mm256_setr_epi8( // G1 G2 G3 -1, -1 G6 G7 G8 -1, -1, -1, -1, -1, -1, -1, -1, 1, 4, 7, -1, -1, 9, 12, 15, -1, -1, -1, -1, -1, -1, -1, -1, // -1, -1, -1 G4 G5 -1, -1, -1 G9 G10 -1, -1, -1, -1, -1, -1 -1, -1, -1, 2, 5, -1, -1, -1, 10, 13, -1, -1, -1, -1, -1, -1)); __m256i r1 = _mm256_shuffle_epi8(mul, _mm256_setr_epi8( // R1 R2 -1 -1 -1 R6 R7 -1, -1, -1, -1, -1, -1, -1, -1, -1, 2, 5, -1, -1, -1, 10, 13, -1, -1, -1, -1, -1, -1, -1, -1, -1, // -1, -1, R3 R4 R5 -1, -1, R8 R9 R10 -1, -1, -1, -1, -1, -1 -1, -1, 0, 3, 6, -1, -1, 8, 11, 14, -1, -1, -1, -1, -1, -1)); // B1+G1+R1 B2+G2+R2 B3+G3 0 0 G6+R6 B7+G7+R7 B8+G8 0 0 0 0 0 0 0 0 0 0 R3 B4+G4+R4 B5+G5+R5 B6 0 R8 B9+G9+R9 B10+G10+R10 0 0 0 0 0 0 __m256i accum = _mm256_adds_epu8(r1, _mm256_adds_epu8(b1, g1)); // _mm256_castsi256_si128(accum) // B1+G1+R1 B2+G2+R2 B3+G3 0 0 G6+R6 B7+G7+R7 B8+G8 0 0 0 0 0 0 0 0 // _mm256_extracti128_si256(accum, 1) // 0 0 R3 B4+G4+R4 B5+G5+R5 B6 0 R8 B9+G9+R9 B10+G10+R10 0 0 0 0 0 0 __m128i h3 = _mm_adds_epu8(_mm256_castsi256_si128(accum), _mm256_extracti128_si256(accum, 1)); _mm_storeu_si128((__m128i *)(LinePD + X), h3); } for (; X < Width; X++, LinePS += 3) { int tmpB = (B_WT * LinePS[0]) >> 14 + 1; tmpB = max(min(255, tmpB), 0); int tmpG = (G_WT * LinePS[1]) >> 14 + 1; tmpG = max(min(255, tmpG), 0); int tmpR = (R_WT * LinePS[2]) >> 14 + 1; tmpR = max(min(255, tmpR), 0); int tmp = tmpB + tmpG + tmpR; LinePD[X] = max(min(255, tmp), 0); } } } //avx2 void RGB2Y_6(unsigned char *Src, unsigned char *Dest, int width, int height, int stride) { _RGB2Y(Src, width, 0, height, stride, Dest); } //avx2 + std::async异步编程 void RGB2Y_7(unsigned char *Src, unsigned char *Dest, int width, int height, int stride) { const int32_t hw_concur = std::min(height >> 4, static_cast(std::thread::hardware_concurrency())); std::vector> fut(hw_concur); const int thread_stride = (height - 1) / hw_concur + 1; int i = 0, start = 0; for (; i < std::min(height, hw_concur); i++, start += thread_stride) { fut[i] = std::async(std::launch::async, _RGB2Y, Src, width, start, thread_stride, stride, Dest); } for (int j = 0; j < i; ++j) fut[j].wait(); } int main() { Mat src = imread("F:\\car.jpg"); int Height = src.rows; int Width = src.cols; unsigned char *Src = src.data; unsigned char *Dest = new unsigned char[Height * Width]; int Stride = Width * 3; int Radius = 11; int64 st = cvGetTickCount(); for (int i = 0; i < 100; i++) { RGB2Y_3(Src, Dest, Width, Height, Stride); } double duration = (cv::getTickCount() - st) / cv::getTickFrequency() * 10; printf("%.5f\n", duration); RGB2Y_5(Src, Dest, Width, Height, Stride); Mat dst(Height, Width, CV_8UC1, Dest); imshow("origin", src); imshow("result", dst); imwrite("F:\\res.jpg", dst); waitKey(0); return 0; } ================================================ FILE: speed_rgb2yuv_sse.cpp ================================================ #include "stdafx.h" #include #include #include using namespace std; using namespace cv; inline unsigned char ClampToByte(int Value) { if (Value < 0) return 0; else if (Value > 255) return 255; else return (unsigned char)Value; //return ((Value | ((signed int)(255 - Value) >> 31)) & ~((signed int)Value >> 31)); } void RGB2YUV(unsigned char *RGB, unsigned char *Y, unsigned char *U, unsigned char *V, int Width, int Height, int Stride) { for (int YY = 0; YY < Height; YY++) { unsigned char *LinePS = RGB + YY * Stride; unsigned char *LinePY = Y + YY * Width; unsigned char *LinePU = U + YY * Width; unsigned char *LinePV = V + YY * Width; for (int XX = 0; XX < Width; XX++, LinePS += 3) { int Blue = LinePS[0], Green = LinePS[1], Red = LinePS[2]; LinePY[XX] = int(0.299*Red + 0.587*Green + 0.144*Blue); LinePU[XX] = int(-0.147*Red - 0.289*Green + 0.436*Blue); LinePV[XX] = int(0.615*Red - 0.515*Green - 0.100*Blue); } } } void YUV2RGB(unsigned char *Y, unsigned char *U, unsigned char *V, unsigned char *RGB, int Width, int Height, int Stride) { for (int YY = 0; YY < Height; YY++) { unsigned char *LinePD = RGB + YY * Stride; unsigned char *LinePY = Y + YY * Width; unsigned char *LinePU = U + YY * Width; unsigned char *LinePV = V + YY * Width; for (int XX = 0; XX < Width; XX++, LinePD += 3) { int YV = LinePY[XX], UV = LinePU[XX], VV = LinePV[XX]; LinePD[0] = int(YV + 2.03 * UV); LinePD[1] = int(YV - 0.39 * UV - 0.58 * VV); LinePD[2] = int(YV + 1.14 * VV); } } } void RGB2YUV_1(unsigned char *RGB, unsigned char *Y, unsigned char *U, unsigned char *V, int Width, int Height, int Stride) { const int Shift = 8; const int HalfV = 1 << (Shift - 1); const int Y_B_WT = 0.114f * (1 << Shift), Y_G_WT = 0.587f * (1 << Shift), Y_R_WT = (1 << Shift) - Y_B_WT - Y_G_WT; const int U_B_WT = 0.436f * (1 << Shift), U_G_WT = -0.28886f * (1 << Shift), U_R_WT = -(U_B_WT + U_G_WT); const int V_B_WT = -0.10001 * (1 << Shift), V_G_WT = -0.51499f * (1 << Shift), V_R_WT = -(V_B_WT + V_G_WT); for (int YY = 0; YY < Height; YY++) { unsigned char *LinePS = RGB + YY * Stride; unsigned char *LinePY = Y + YY * Width; unsigned char *LinePU = U + YY * Width; unsigned char *LinePV = V + YY * Width; for (int XX = 0; XX < Width; XX++, LinePS += 3) { int Blue = LinePS[0], Green = LinePS[1], Red = LinePS[2]; LinePY[XX] = (Y_B_WT * Blue + Y_G_WT * Green + Y_R_WT * Red + HalfV) >> Shift; LinePU[XX] = ((U_B_WT * Blue + U_G_WT * Green + U_R_WT * Red + HalfV) >> Shift) + 128; LinePV[XX] = ((V_B_WT * Blue + V_G_WT * Green + V_R_WT * Red + HalfV) >> Shift) + 128; } } } void YUV2RGB_1(unsigned char *Y, unsigned char *U, unsigned char *V, unsigned char *RGB, int Width, int Height, int Stride) { const int Shift = 8; const int HalfV = 1 << (Shift - 1); const int B_Y_WT = 1 << Shift, B_U_WT = 2.03211f * (1 << Shift), B_V_WT = 0; const int G_Y_WT = 1 << Shift, G_U_WT = -0.39465f * (1 << Shift), G_V_WT = -0.58060f * (1 << Shift); const int R_Y_WT = 1 << Shift, R_U_WT = 0, R_V_WT = 1.13983 * (1 << Shift); for (int YY = 0; YY < Height; YY++) { unsigned char *LinePD = RGB + YY * Stride; unsigned char *LinePY = Y + YY * Width; unsigned char *LinePU = U + YY * Width; unsigned char *LinePV = V + YY * Width; for (int XX = 0; XX < Width; XX++, LinePD += 3) { int YV = LinePY[XX], UV = LinePU[XX] - 128, VV = LinePV[XX] - 128; LinePD[0] = ClampToByte(YV + ((B_U_WT * UV + HalfV) >> Shift)); LinePD[1] = ClampToByte(YV + ((G_U_WT * UV + G_V_WT * VV + HalfV) >> Shift)); LinePD[2] = ClampToByte(YV + ((R_V_WT * VV + HalfV) >> Shift)); } } } void RGB2YUV_OpenMP(unsigned char *RGB, unsigned char *Y, unsigned char *U, unsigned char *V, int Width, int Height, int Stride) { const int Shift = 8; const int HalfV = 1 << (Shift - 1); const int Y_B_WT = 0.114f * (1 << Shift), Y_G_WT = 0.587f * (1 << Shift), Y_R_WT = (1 << Shift) - Y_B_WT - Y_G_WT; const int U_B_WT = 0.436f * (1 << Shift), U_G_WT = -0.28886f * (1 << Shift), U_R_WT = -(U_B_WT + U_G_WT); const int V_B_WT = -0.10001 * (1 << Shift), V_G_WT = -0.51499f * (1 << Shift), V_R_WT = -(V_B_WT + V_G_WT); for (int YY = 0; YY < Height; YY++) { unsigned char *LinePS = RGB + YY * Stride; unsigned char *LinePY = Y + YY * Width; unsigned char *LinePU = U + YY * Width; unsigned char *LinePV = V + YY * Width; #pragma omp parallel for num_threads(4) for (int XX = 0; XX < Width; XX++) { int Blue = LinePS[XX*3 + 0], Green = LinePS[XX*3 + 1], Red = LinePS[XX*3 + 2]; LinePY[XX] = (Y_B_WT * Blue + Y_G_WT * Green + Y_R_WT * Red + HalfV) >> Shift; LinePU[XX] = ((U_B_WT * Blue + U_G_WT * Green + U_R_WT * Red + HalfV) >> Shift) + 128; LinePV[XX] = ((V_B_WT * Blue + V_G_WT * Green + V_R_WT * Red + HalfV) >> Shift) + 128; } } } void YUV2RGB_OpenMP(unsigned char *Y, unsigned char *U, unsigned char *V, unsigned char *RGB, int Width, int Height, int Stride) { const int Shift = 8; const int HalfV = 1 << (Shift - 1); const int B_Y_WT = 1 << Shift, B_U_WT = 2.03211f * (1 << Shift), B_V_WT = 0; const int G_Y_WT = 1 << Shift, G_U_WT = -0.39465f * (1 << Shift), G_V_WT = -0.58060f * (1 << Shift); const int R_Y_WT = 1 << Shift, R_U_WT = 0, R_V_WT = 1.13983 * (1 << Shift); for (int YY = 0; YY < Height; YY++) { unsigned char *LinePD = RGB + YY * Stride; unsigned char *LinePY = Y + YY * Width; unsigned char *LinePU = U + YY * Width; unsigned char *LinePV = V + YY * Width; #pragma omp parallel for num_threads(4) for (int XX = 0; XX < Width; XX++) { int YV = LinePY[XX], UV = LinePU[XX] - 128, VV = LinePV[XX] - 128; LinePD[XX*3 + 0] = ClampToByte(YV + ((B_U_WT * UV + HalfV) >> Shift)); LinePD[XX*3 + 1] = ClampToByte(YV + ((G_U_WT * UV + G_V_WT * VV + HalfV) >> Shift)); LinePD[XX*3 + 2] = ClampToByte(YV + ((R_V_WT * VV + HalfV) >> Shift)); } } } void RGB2YUVSSE_2(unsigned char *RGB, unsigned char *Y, unsigned char *U, unsigned char *V, int Width, int Height, int Stride) { const int Shift = 13; const int HalfV = 1 << (Shift - 1); const int Y_B_WT = 0.114f * (1 << Shift), Y_G_WT = 0.587f * (1 << Shift), Y_R_WT = (1 << Shift) - Y_B_WT - Y_G_WT; const int U_B_WT = 0.436f * (1 << Shift), U_G_WT = -0.28886f * (1 << Shift), U_R_WT = -(U_B_WT + U_G_WT); const int V_B_WT = -0.10001 * (1 << Shift), V_G_WT = -0.51499f * (1 << Shift), V_R_WT = -(V_B_WT + V_G_WT); __m128i Weight_YB = _mm_set1_epi32(Y_B_WT), Weight_YG = _mm_set1_epi32(Y_G_WT), Weight_YR = _mm_set1_epi32(Y_R_WT); __m128i Weight_UB = _mm_set1_epi32(U_B_WT), Weight_UG = _mm_set1_epi32(U_G_WT), Weight_UR = _mm_set1_epi32(U_R_WT); __m128i Weight_VB = _mm_set1_epi32(V_B_WT), Weight_VG = _mm_set1_epi32(V_G_WT), Weight_VR = _mm_set1_epi32(V_R_WT); __m128i C128 = _mm_set1_epi32(128); __m128i Half = _mm_set1_epi32(HalfV); __m128i Zero = _mm_setzero_si128(); const int BlockSize = 16, Block = Width / BlockSize; for (int YY = 0; YY < Height; YY++) { unsigned char *LinePS = RGB + YY * Stride; unsigned char *LinePY = Y + YY * Width; unsigned char *LinePU = U + YY * Width; unsigned char *LinePV = V + YY * Width; for (int XX = 0; XX < Block * BlockSize; XX += BlockSize, LinePS += BlockSize * 3) { __m128i Src1, Src2, Src3, Blue, Green, Red; Src1 = _mm_loadu_si128((__m128i *)(LinePS + 0)); Src2 = _mm_loadu_si128((__m128i *)(LinePS + 16)); Src3 = _mm_loadu_si128((__m128i *)(LinePS + 32)); // 以下操作把16个连续像素的像素顺序由 B G R B G R B G R B G R B G R B G R B G R B G R B G R B G R B G R B G R B G R B G R B G R B G R // 更改为适合于SIMD指令处理的连续序列 B B B B B B B B B B B B B B B B G G G G G G G G G G G G G G G G R R R R R R R R R R R R R R R R Blue = _mm_shuffle_epi8(Src1, _mm_setr_epi8(0, 3, 6, 9, 12, 15, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)); Blue = _mm_or_si128(Blue, _mm_shuffle_epi8(Src2, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, 2, 5, 8, 11, 14, -1, -1, -1, -1, -1))); Blue = _mm_or_si128(Blue, _mm_shuffle_epi8(Src3, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, 4, 7, 10, 13))); Green = _mm_shuffle_epi8(Src1, _mm_setr_epi8(1, 4, 7, 10, 13, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)); Green = _mm_or_si128(Green, _mm_shuffle_epi8(Src2, _mm_setr_epi8(-1, -1, -1, -1, -1, 0, 3, 6, 9, 12, 15, -1, -1, -1, -1, -1))); Green = _mm_or_si128(Green, _mm_shuffle_epi8(Src3, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 2, 5, 8, 11, 14))); Red = _mm_shuffle_epi8(Src1, _mm_setr_epi8(2, 5, 8, 11, 14, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)); Red = _mm_or_si128(Red, _mm_shuffle_epi8(Src2, _mm_setr_epi8(-1, -1, -1, -1, -1, 1, 4, 7, 10, 13, -1, -1, -1, -1, -1, -1))); Red = _mm_or_si128(Red, _mm_shuffle_epi8(Src3, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 3, 6, 9, 12, 15))); // 以下操作将三个SSE变量里的字节数据分别提取到12个包含4个int类型的数据的SSE变量里,以便后续的乘积操作不溢出 __m128i Blue16L = _mm_unpacklo_epi8(Blue, Zero); __m128i Blue16H = _mm_unpackhi_epi8(Blue, Zero); __m128i Blue32LL = _mm_unpacklo_epi16(Blue16L, Zero); __m128i Blue32LH = _mm_unpackhi_epi16(Blue16L, Zero); __m128i Blue32HL = _mm_unpacklo_epi16(Blue16H, Zero); __m128i Blue32HH = _mm_unpackhi_epi16(Blue16H, Zero); __m128i Green16L = _mm_unpacklo_epi8(Green, Zero); __m128i Green16H = _mm_unpackhi_epi8(Green, Zero); __m128i Green32LL = _mm_unpacklo_epi16(Green16L, Zero); __m128i Green32LH = _mm_unpackhi_epi16(Green16L, Zero); __m128i Green32HL = _mm_unpacklo_epi16(Green16H, Zero); __m128i Green32HH = _mm_unpackhi_epi16(Green16H, Zero); __m128i Red16L = _mm_unpacklo_epi8(Red, Zero); __m128i Red16H = _mm_unpackhi_epi8(Red, Zero); __m128i Red32LL = _mm_unpacklo_epi16(Red16L, Zero); __m128i Red32LH = _mm_unpackhi_epi16(Red16L, Zero); __m128i Red32HL = _mm_unpacklo_epi16(Red16H, Zero); __m128i Red32HH = _mm_unpackhi_epi16(Red16H, Zero); // 以下操作完成:Y[0 - 15] = (Y_B_WT * Blue[0 - 15]+ Y_G_WT * Green[0 - 15] + Y_R_WT * Red[0 - 15] + HalfV) >> Shift; __m128i LL_Y = _mm_srai_epi32(_mm_add_epi32(_mm_add_epi32(_mm_mullo_epi32(Blue32LL, Weight_YB), _mm_add_epi32(_mm_mullo_epi32(Green32LL, Weight_YG), _mm_mullo_epi32(Red32LL, Weight_YR))), Half), Shift); __m128i LH_Y = _mm_srai_epi32(_mm_add_epi32(_mm_add_epi32(_mm_mullo_epi32(Blue32LH, Weight_YB), _mm_add_epi32(_mm_mullo_epi32(Green32LH, Weight_YG), _mm_mullo_epi32(Red32LH, Weight_YR))), Half), Shift); __m128i HL_Y = _mm_srai_epi32(_mm_add_epi32(_mm_add_epi32(_mm_mullo_epi32(Blue32HL, Weight_YB), _mm_add_epi32(_mm_mullo_epi32(Green32HL, Weight_YG), _mm_mullo_epi32(Red32HL, Weight_YR))), Half), Shift); __m128i HH_Y = _mm_srai_epi32(_mm_add_epi32(_mm_add_epi32(_mm_mullo_epi32(Blue32HH, Weight_YB), _mm_add_epi32(_mm_mullo_epi32(Green32HH, Weight_YG), _mm_mullo_epi32(Red32HH, Weight_YR))), Half), Shift); _mm_storeu_si128((__m128i*)(LinePY + XX), _mm_packus_epi16(_mm_packus_epi32(LL_Y, LH_Y), _mm_packus_epi32(HL_Y, HH_Y))); // 4个包含4个int类型的SSE变量重新打包为1个包含16个字节数据的SSE变量 // 以下操作完成: U[0 - 15] = ((U_B_WT * Blue[0 - 15]+ U_G_WT * Green[0 - 15] + U_R_WT * Red[0 - 15] + HalfV) >> Shift) + 128; __m128i LL_U = _mm_add_epi32(_mm_srai_epi32(_mm_add_epi32(_mm_add_epi32(_mm_mullo_epi32(Blue32LL, Weight_UB), _mm_add_epi32(_mm_mullo_epi32(Green32LL, Weight_UG), _mm_mullo_epi32(Red32LL, Weight_UR))), Half), Shift), C128); __m128i LH_U = _mm_add_epi32(_mm_srai_epi32(_mm_add_epi32(_mm_add_epi32(_mm_mullo_epi32(Blue32LH, Weight_UB), _mm_add_epi32(_mm_mullo_epi32(Green32LH, Weight_UG), _mm_mullo_epi32(Red32LH, Weight_UR))), Half), Shift), C128); __m128i HL_U = _mm_add_epi32(_mm_srai_epi32(_mm_add_epi32(_mm_add_epi32(_mm_mullo_epi32(Blue32HL, Weight_UB), _mm_add_epi32(_mm_mullo_epi32(Green32HL, Weight_UG), _mm_mullo_epi32(Red32HL, Weight_UR))), Half), Shift), C128); __m128i HH_U = _mm_add_epi32(_mm_srai_epi32(_mm_add_epi32(_mm_add_epi32(_mm_mullo_epi32(Blue32HH, Weight_UB), _mm_add_epi32(_mm_mullo_epi32(Green32HH, Weight_UG), _mm_mullo_epi32(Red32HH, Weight_UR))), Half), Shift), C128); _mm_storeu_si128((__m128i*)(LinePU + XX), _mm_packus_epi16(_mm_packus_epi32(LL_U, LH_U), _mm_packus_epi32(HL_U, HH_U))); // 以下操作完成:V[0 - 15] = ((V_B_WT * Blue[0 - 15]+ V_G_WT * Green[0 - 15] + V_R_WT * Red[0 - 15] + HalfV) >> Shift) + 128; __m128i LL_V = _mm_add_epi32(_mm_srai_epi32(_mm_add_epi32(_mm_add_epi32(_mm_mullo_epi32(Blue32LL, Weight_VB), _mm_add_epi32(_mm_mullo_epi32(Green32LL, Weight_VG), _mm_mullo_epi32(Red32LL, Weight_VR))), Half), Shift), C128); __m128i LH_V = _mm_add_epi32(_mm_srai_epi32(_mm_add_epi32(_mm_add_epi32(_mm_mullo_epi32(Blue32LH, Weight_VB), _mm_add_epi32(_mm_mullo_epi32(Green32LH, Weight_VG), _mm_mullo_epi32(Red32LH, Weight_VR))), Half), Shift), C128); __m128i HL_V = _mm_add_epi32(_mm_srai_epi32(_mm_add_epi32(_mm_add_epi32(_mm_mullo_epi32(Blue32HL, Weight_VB), _mm_add_epi32(_mm_mullo_epi32(Green32HL, Weight_VG), _mm_mullo_epi32(Red32HL, Weight_VR))), Half), Shift), C128); __m128i HH_V = _mm_add_epi32(_mm_srai_epi32(_mm_add_epi32(_mm_add_epi32(_mm_mullo_epi32(Blue32HH, Weight_VB), _mm_add_epi32(_mm_mullo_epi32(Green32HH, Weight_VG), _mm_mullo_epi32(Red32HH, Weight_VR))), Half), Shift), C128); _mm_storeu_si128((__m128i*)(LinePV + XX), _mm_packus_epi16(_mm_packus_epi32(LL_V, LH_V), _mm_packus_epi32(HL_V, HH_V))); } for (int XX = Block * BlockSize; XX < Width; XX++, LinePS += 3) { int Blue = LinePS[0], Green = LinePS[1], Red = LinePS[2]; LinePY[XX] = (Y_B_WT * Blue + Y_G_WT * Green + Y_R_WT * Red + HalfV) >> Shift; LinePU[XX] = ((U_B_WT * Blue + U_G_WT * Green + U_R_WT * Red + HalfV) >> Shift) + 128; LinePV[XX] = ((V_B_WT * Blue + V_G_WT * Green + V_R_WT * Red + HalfV) >> Shift) + 128; } } } void YUV2RGBSSE_2(unsigned char *Y, unsigned char *U, unsigned char *V, unsigned char *RGB, int Width, int Height, int Stride) { const int Shift = 13; const int HalfV = 1 << (Shift - 1); const int B_Y_WT = 1 << Shift, B_U_WT = 2.03211f * (1 << Shift), B_V_WT = 0; const int G_Y_WT = 1 << Shift, G_U_WT = -0.39465f * (1 << Shift), G_V_WT = -0.58060f * (1 << Shift); const int R_Y_WT = 1 << Shift, R_U_WT = 0, R_V_WT = 1.13983 * (1 << Shift); __m128i Weight_B_Y = _mm_set1_epi32(B_Y_WT), Weight_B_U = _mm_set1_epi32(B_U_WT), Weight_B_V = _mm_set1_epi32(B_V_WT); __m128i Weight_G_Y = _mm_set1_epi32(G_Y_WT), Weight_G_U = _mm_set1_epi32(G_U_WT), Weight_G_V = _mm_set1_epi32(G_V_WT); __m128i Weight_R_Y = _mm_set1_epi32(R_Y_WT), Weight_R_U = _mm_set1_epi32(R_U_WT), Weight_R_V = _mm_set1_epi32(R_V_WT); __m128i Half = _mm_set1_epi32(HalfV); __m128i C128 = _mm_set1_epi32(128); __m128i Zero = _mm_setzero_si128(); const int BlockSize = 16, Block = Width / BlockSize; for (int YY = 0; YY < Height; YY++) { unsigned char *LinePD = RGB + YY * Stride; unsigned char *LinePY = Y + YY * Width; unsigned char *LinePU = U + YY * Width; unsigned char *LinePV = V + YY * Width; for (int XX = 0; XX < Block * BlockSize; XX += BlockSize, LinePY += BlockSize, LinePU += BlockSize, LinePV += BlockSize) { __m128i Blue, Green, Red, YV, UV, VV, Dest1, Dest2, Dest3; YV = _mm_loadu_si128((__m128i *)(LinePY + 0)); UV = _mm_loadu_si128((__m128i *)(LinePU + 0)); VV = _mm_loadu_si128((__m128i *)(LinePV + 0)); //UV = _mm_sub_epi32(UV, C128); //VV = _mm_sub_epi32(VV, C128); __m128i YV16L = _mm_unpacklo_epi8(YV, Zero); __m128i YV16H = _mm_unpackhi_epi8(YV, Zero); __m128i YV32LL = _mm_unpacklo_epi16(YV16L, Zero); __m128i YV32LH = _mm_unpackhi_epi16(YV16L, Zero); __m128i YV32HL = _mm_unpacklo_epi16(YV16H, Zero); __m128i YV32HH = _mm_unpackhi_epi16(YV16H, Zero); __m128i UV16L = _mm_unpacklo_epi8(UV, Zero); __m128i UV16H = _mm_unpackhi_epi8(UV, Zero); __m128i UV32LL = _mm_unpacklo_epi16(UV16L, Zero); __m128i UV32LH = _mm_unpackhi_epi16(UV16L, Zero); __m128i UV32HL = _mm_unpacklo_epi16(UV16H, Zero); __m128i UV32HH = _mm_unpackhi_epi16(UV16H, Zero); UV32LL = _mm_sub_epi32(UV32LL, C128); UV32LH = _mm_sub_epi32(UV32LH, C128); UV32HL = _mm_sub_epi32(UV32HL, C128); UV32HH = _mm_sub_epi32(UV32HH, C128); __m128i VV16L = _mm_unpacklo_epi8(VV, Zero); __m128i VV16H = _mm_unpackhi_epi8(VV, Zero); __m128i VV32LL = _mm_unpacklo_epi16(VV16L, Zero); __m128i VV32LH = _mm_unpackhi_epi16(VV16L, Zero); __m128i VV32HL = _mm_unpacklo_epi16(VV16H, Zero); __m128i VV32HH = _mm_unpackhi_epi16(VV16H, Zero); VV32LL = _mm_sub_epi32(VV32LL, C128); VV32LH = _mm_sub_epi32(VV32LH, C128); VV32HL = _mm_sub_epi32(VV32HL, C128); VV32HH = _mm_sub_epi32(VV32HH, C128); __m128i LL_B = _mm_add_epi32(YV32LL, _mm_srai_epi32(_mm_add_epi32(Half, _mm_mullo_epi32(UV32LL, Weight_B_U)), Shift)); __m128i LH_B = _mm_add_epi32(YV32LH, _mm_srai_epi32(_mm_add_epi32(Half, _mm_mullo_epi32(UV32LH, Weight_B_U)), Shift)); __m128i HL_B = _mm_add_epi32(YV32HL, _mm_srai_epi32(_mm_add_epi32(Half, _mm_mullo_epi32(UV32HL, Weight_B_U)), Shift)); __m128i HH_B = _mm_add_epi32(YV32HH, _mm_srai_epi32(_mm_add_epi32(Half, _mm_mullo_epi32(UV32HH, Weight_B_U)), Shift)); Blue = _mm_packus_epi16(_mm_packus_epi32(LL_B, LH_B), _mm_packus_epi32(HL_B, HH_B)); __m128i LL_G = _mm_add_epi32(YV32LL, _mm_srai_epi32(_mm_add_epi32(Half, _mm_add_epi32(_mm_mullo_epi32(Weight_G_U, UV32LL), _mm_mullo_epi32(Weight_G_V, VV32LL))), Shift)); __m128i LH_G = _mm_add_epi32(YV32LH, _mm_srai_epi32(_mm_add_epi32(Half, _mm_add_epi32(_mm_mullo_epi32(Weight_G_U, UV32LH), _mm_mullo_epi32(Weight_G_V, VV32LH))), Shift)); __m128i HL_G = _mm_add_epi32(YV32HL, _mm_srai_epi32(_mm_add_epi32(Half, _mm_add_epi32(_mm_mullo_epi32(Weight_G_U, UV32HL), _mm_mullo_epi32(Weight_G_V, VV32HL))), Shift)); __m128i HH_G = _mm_add_epi32(YV32HH, _mm_srai_epi32(_mm_add_epi32(Half, _mm_add_epi32(_mm_mullo_epi32(Weight_G_U, UV32HH), _mm_mullo_epi32(Weight_G_V, VV32HH))), Shift)); Green = _mm_packus_epi16(_mm_packus_epi32(LL_G, LH_G), _mm_packus_epi32(HL_G, HH_G)); __m128i LL_R = _mm_add_epi32(YV32LL, _mm_srai_epi32(_mm_add_epi32(Half, _mm_mullo_epi32(VV32LL, Weight_R_V)), Shift)); __m128i LH_R = _mm_add_epi32(YV32LH, _mm_srai_epi32(_mm_add_epi32(Half, _mm_mullo_epi32(VV32LH, Weight_R_V)), Shift)); __m128i HL_R = _mm_add_epi32(YV32HL, _mm_srai_epi32(_mm_add_epi32(Half, _mm_mullo_epi32(VV32HL, Weight_R_V)), Shift)); __m128i HH_R = _mm_add_epi32(YV32HH, _mm_srai_epi32(_mm_add_epi32(Half, _mm_mullo_epi32(VV32HH, Weight_R_V)), Shift)); Red = _mm_packus_epi16(_mm_packus_epi32(LL_R, LH_R), _mm_packus_epi32(HL_R, HH_R)); Dest1 = _mm_shuffle_epi8(Blue, _mm_setr_epi8(0, -1, -1, 1, -1, -1, 2, -1, -1, 3, -1, -1, 4, -1, -1, 5)); Dest1 = _mm_or_si128(Dest1, _mm_shuffle_epi8(Green, _mm_setr_epi8(-1, 0, -1, -1, 1, -1, -1, 2, -1, -1, 3, -1, -1, 4, -1, -1))); Dest1 = _mm_or_si128(Dest1, _mm_shuffle_epi8(Red, _mm_setr_epi8(-1, -1, 0, -1, -1, 1, -1, -1, 2, -1, -1, 3, -1, -1, 4, -1))); Dest2 = _mm_shuffle_epi8(Blue, _mm_setr_epi8(-1, -1, 6, -1, -1, 7, -1, -1, 8, -1, -1, 9, -1, -1, 10, -1)); Dest2 = _mm_or_si128(Dest2, _mm_shuffle_epi8(Green, _mm_setr_epi8(5, -1, -1, 6, -1, -1, 7, -1, -1, 8, -1, -1, 9, -1, -1, 10))); Dest2 = _mm_or_si128(Dest2, _mm_shuffle_epi8(Red, _mm_setr_epi8(-1, 5, -1, -1, 6, -1, -1, 7, -1, -1, 8, -1, -1, 9, -1, -1))); Dest3 = _mm_shuffle_epi8(Blue, _mm_setr_epi8(-1, 11, -1, -1, 12, -1, -1, 13, -1, -1, 14, -1, -1, 15, -1, -1)); Dest3 = _mm_or_si128(Dest3, _mm_shuffle_epi8(Green, _mm_setr_epi8(-1, -1, 11, -1, -1, 12, -1, -1, 13, -1, -1, 14, -1, -1, 15, -1))); Dest3 = _mm_or_si128(Dest3, _mm_shuffle_epi8(Red, _mm_setr_epi8(10, -1, -1, 11, -1, -1, 12, -1, -1, 13, -1, -1, 14, -1, -1, 15))); _mm_storeu_si128((__m128i*)(LinePD + (XX / BlockSize) * BlockSize * 3), Dest1); _mm_storeu_si128((__m128i*)(LinePD + (XX / BlockSize) * BlockSize * 3 + BlockSize), Dest2); _mm_storeu_si128((__m128i*)(LinePD + (XX / BlockSize) * BlockSize * 3 + BlockSize * 2), Dest3); } for (int XX = Block * BlockSize; XX < Width; XX++, LinePU++, LinePV++, LinePY++) { int YV = LinePY[XX], UV = LinePU[XX] - 128, VV = LinePV[XX] - 128; LinePD[XX + 0] = ClampToByte(YV + ((B_U_WT * UV + HalfV) >> Shift)); LinePD[XX + 1] = ClampToByte(YV + ((G_U_WT * UV + G_V_WT * VV + HalfV) >> Shift)); LinePD[XX + 2] = ClampToByte(YV + ((R_V_WT * VV + HalfV) >> Shift)); } } } void RGB2YUVSSE_3(unsigned char *RGB, unsigned char *Y, unsigned char *U, unsigned char *V, int Width, int Height, int Stride) { const int Shift = 13; // 这里没有绝对值大于1的系数,最大可取2^15次方的放大倍数。 const int HalfV = 1 << (Shift - 1); const int Y_B_WT = 0.114f * (1 << Shift), Y_G_WT = 0.587f * (1 << Shift), Y_R_WT = (1 << Shift) - Y_B_WT - Y_G_WT, Y_C_WT = 1; const int U_B_WT = 0.436f * (1 << Shift), U_G_WT = -0.28886f * (1 << Shift), U_R_WT = -(U_B_WT + U_G_WT), U_C_WT = 257; const int V_B_WT = -0.10001 * (1 << Shift), V_G_WT = -0.51499f * (1 << Shift), V_R_WT = -(V_B_WT + V_G_WT), V_C_WT = 257; __m128i Weight_YBG = _mm_setr_epi16(Y_B_WT, Y_G_WT, Y_B_WT, Y_G_WT, Y_B_WT, Y_G_WT, Y_B_WT, Y_G_WT); __m128i Weight_YRC = _mm_setr_epi16(Y_R_WT, Y_C_WT, Y_R_WT, Y_C_WT, Y_R_WT, Y_C_WT, Y_R_WT, Y_C_WT); __m128i Weight_UBG = _mm_setr_epi16(U_B_WT, U_G_WT, U_B_WT, U_G_WT, U_B_WT, U_G_WT, U_B_WT, U_G_WT); __m128i Weight_URC = _mm_setr_epi16(U_R_WT, U_C_WT, U_R_WT, U_C_WT, U_R_WT, U_C_WT, U_R_WT, U_C_WT); __m128i Weight_VBG = _mm_setr_epi16(V_B_WT, V_G_WT, V_B_WT, V_G_WT, V_B_WT, V_G_WT, V_B_WT, V_G_WT); __m128i Weight_VRC = _mm_setr_epi16(V_R_WT, V_C_WT, V_R_WT, V_C_WT, V_R_WT, V_C_WT, V_R_WT, V_C_WT); __m128i Half = _mm_setr_epi16(0, HalfV, 0, HalfV, 0, HalfV, 0, HalfV); __m128i Zero = _mm_setzero_si128(); int BlockSize = 16, Block = Width / BlockSize; for (int YY = 0; YY < Height; YY++) { unsigned char *LinePS = RGB + YY * Stride; unsigned char *LinePY = Y + YY * Width; unsigned char *LinePU = U + YY * Width; unsigned char *LinePV = V + YY * Width; for (int XX = 0; XX < Block * BlockSize; XX += BlockSize, LinePS += BlockSize * 3) { __m128i Src1 = _mm_loadu_si128((__m128i *)(LinePS + 0)); __m128i Src2 = _mm_loadu_si128((__m128i *)(LinePS + 16)); __m128i Src3 = _mm_loadu_si128((__m128i *)(LinePS + 32)); // Src1 : B1 G1 R1 B2 G2 R2 B3 G3 R3 B4 G4 R4 B5 G5 R5 B6 // Src2 : G6 R6 B7 G7 R7 B8 G8 R8 B9 G9 R9 B10 G10 R10 B11 G11 // Src3 : R11 B12 G12 R12 B13 G13 R13 B14 G14 R14 B15 G15 R15 B16 G16 R16 // BGL : B1 G1 B2 G2 B3 G3 B4 G4 B5 G5 B6 0 0 0 0 0 __m128i BGL = _mm_shuffle_epi8(Src1, _mm_setr_epi8(0, 1, 3, 4, 6, 7, 9, 10, 12, 13, 15, -1, -1, -1, -1, -1)); // BGL : B1 G1 B2 G2 B3 G3 B4 G4 B5 G5 B6 G6 B7 G7 B8 G8 BGL = _mm_or_si128(BGL, _mm_shuffle_epi8(Src2, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 2, 3, 5, 6))); // BGH : B9 G9 B10 G10 B11 G11 0 0 0 0 0 0 0 0 0 0 __m128i BGH = _mm_shuffle_epi8(Src2, _mm_setr_epi8(8, 9, 11, 12, 14, 15, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)); // BGH : B9 G9 B10 G10 B11 G11 B12 G12 B13 G13 B14 G14 B15 G15 B16 G16 BGH = _mm_or_si128(BGH, _mm_shuffle_epi8(Src3, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, 1, 2, 4, 5, 7, 8, 10, 11, 13, 14))); // RCL : R1 0 R2 0 R3 0 R4 0 R5 0 0 0 0 0 0 0 __m128i RCL = _mm_shuffle_epi8(Src1, _mm_setr_epi8(2, -1, 5, -1, 8, -1, 11, -1, 14, -1, -1, -1, -1, -1, -1, -1)); // RCL : R1 0 R2 0 R3 0 R4 0 R5 0 R6 0 R7 0 R8 0 RCL = _mm_or_si128(RCL, _mm_shuffle_epi8(Src2, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, -1, 4, -1, 7, -1))); // RCH : R9 0 R10 0 0 0 0 0 0 0 0 0 0 0 0 0 __m128i RCH = _mm_shuffle_epi8(Src2, _mm_setr_epi8(10, -1, 13, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)); // RCH : R9 0 R10 0 R11 0 R12 0 R13 0 R14 0 R15 0 R16 0 RCH = _mm_or_si128(RCH, _mm_shuffle_epi8(Src3, _mm_setr_epi8(-1, -1, -1, -1, 0, -1, 3, -1, 6, -1, 9, -1, 12, -1, 15, -1))); // BGLL : B1 0 G1 0 B2 0 G2 0 B3 0 G3 0 B4 0 G4 0 __m128i BGLL = _mm_unpacklo_epi8(BGL, Zero); // BGLH : B5 0 G5 0 B6 0 G6 0 B7 0 G7 0 B8 0 G8 0 __m128i BGLH = _mm_unpackhi_epi8(BGL, Zero); // RCLL : R1 Half Half Half R2 Half Half Half R3 Half Half Half R4 Half Half Half __m128i RCLL = _mm_or_si128(_mm_unpacklo_epi8(RCL, Zero), Half); // RCLH : R5 Half Half Half R6 Half Half Half R7 Half Half Half R8 Half Half Half __m128i RCLH = _mm_or_si128(_mm_unpackhi_epi8(RCL, Zero), Half); // BGHL : B9 0 G9 0 B10 0 G10 0 B11 0 G11 0 B12 0 G12 0 __m128i BGHL = _mm_unpacklo_epi8(BGH, Zero); // BGHH : B13 0 G13 0 B14 0 G14 0 B15 0 G15 0 B16 0 G16 0 __m128i BGHH = _mm_unpackhi_epi8(BGH, Zero); // RCHL : R9 Half Half Half R10 Half Half Half R11 Half Half Half R12 Half Half Half __m128i RCHL = _mm_or_si128(_mm_unpacklo_epi8(RCH, Zero), Half); // RCHH : R13 Half Half Half R14 Half Half Half R15 Half Half Half R16 Half Half Half __m128i RCHH = _mm_or_si128(_mm_unpackhi_epi8(RCH, Zero), Half); // __m128i Y_LL = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGLL, Weight_YBG), _mm_madd_epi16(RCLL, Weight_YRC)), Shift); __m128i Y_LH = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGLH, Weight_YBG), _mm_madd_epi16(RCLH, Weight_YRC)), Shift); __m128i Y_HL = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGHL, Weight_YBG), _mm_madd_epi16(RCHL, Weight_YRC)), Shift); __m128i Y_HH = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGHH, Weight_YBG), _mm_madd_epi16(RCHH, Weight_YRC)), Shift); _mm_storeu_si128((__m128i*)(LinePY + XX), _mm_packus_epi16(_mm_packus_epi32(Y_LL, Y_LH), _mm_packus_epi32(Y_HL, Y_HH))); __m128i U_LL = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGLL, Weight_UBG), _mm_madd_epi16(RCLL, Weight_URC)), Shift); __m128i U_LH = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGLH, Weight_UBG), _mm_madd_epi16(RCLH, Weight_URC)), Shift); __m128i U_HL = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGHL, Weight_UBG), _mm_madd_epi16(RCHL, Weight_URC)), Shift); __m128i U_HH = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGHH, Weight_UBG), _mm_madd_epi16(RCHH, Weight_URC)), Shift); _mm_storeu_si128((__m128i*)(LinePU + XX), _mm_packus_epi16(_mm_packus_epi32(U_LL, U_LH), _mm_packus_epi32(U_HL, U_HH))); __m128i V_LL = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGLL, Weight_VBG), _mm_madd_epi16(RCLL, Weight_VRC)), Shift); __m128i V_LH = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGLH, Weight_VBG), _mm_madd_epi16(RCLH, Weight_VRC)), Shift); __m128i V_HL = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGHL, Weight_VBG), _mm_madd_epi16(RCHL, Weight_VRC)), Shift); __m128i V_HH = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGHH, Weight_VBG), _mm_madd_epi16(RCHH, Weight_VRC)), Shift); _mm_storeu_si128((__m128i*)(LinePV + XX), _mm_packus_epi16(_mm_packus_epi32(V_LL, V_LH), _mm_packus_epi32(V_HL, V_HH))); } for (int XX = Block * BlockSize; XX < Width; XX++, LinePS += 3) { int Blue = LinePS[0], Green = LinePS[1], Red = LinePS[2]; LinePY[XX] = (Y_B_WT * Blue + Y_G_WT * Green + Y_R_WT * Red + Y_C_WT * HalfV) >> Shift; LinePU[XX] = (U_B_WT * Blue + U_G_WT * Green + U_R_WT * Red + U_C_WT * HalfV) >> Shift; LinePV[XX] = (V_B_WT * Blue + V_G_WT * Green + V_R_WT * Red + V_C_WT * HalfV) >> Shift; } } } void YUV2RGBSSE_3(unsigned char *Y, unsigned char *U, unsigned char *V, unsigned char *RGB, int Width, int Height, int Stride) { const int Shift = 13; const int HalfV = 1 << (Shift - 1); const int B_Y_WT = 1 << Shift, B_U_WT = 2.03211f * (1 << Shift), B_V_WT = 0; const int G_Y_WT = 1 << Shift, G_U_WT = -0.39465f * (1 << Shift), G_V_WT = -0.58060f * (1 << Shift); const int R_Y_WT = 1 << Shift, R_U_WT = 0, R_V_WT = 1.13983 * (1 << Shift); __m128i Weight_B_Y = _mm_set1_epi32(B_Y_WT), Weight_B_U = _mm_set1_epi32(B_U_WT), Weight_B_V = _mm_set1_epi32(B_V_WT); __m128i Weight_G_Y = _mm_set1_epi32(G_Y_WT), Weight_G_U = _mm_set1_epi32(G_U_WT), Weight_G_V = _mm_set1_epi32(G_V_WT); __m128i Weight_R_Y = _mm_set1_epi32(R_Y_WT), Weight_R_U = _mm_set1_epi32(R_U_WT), Weight_R_V = _mm_set1_epi32(R_V_WT); __m128i Half = _mm_set1_epi32(HalfV); __m128i C128 = _mm_set1_epi32(128); __m128i Zero = _mm_setzero_si128(); const int BlockSize = 16, Block = Width / BlockSize; for (int YY = 0; YY < Height; YY++) { unsigned char *LinePD = RGB + YY * Stride; unsigned char *LinePY = Y + YY * Width; unsigned char *LinePU = U + YY * Width; unsigned char *LinePV = V + YY * Width; for (int XX = 0; XX < Block * BlockSize; XX += BlockSize, LinePY += BlockSize, LinePU += BlockSize, LinePV += BlockSize) { __m128i Blue, Green, Red, YV, UV, VV, Dest1, Dest2, Dest3; YV = _mm_loadu_si128((__m128i *)(LinePY + 0)); UV = _mm_loadu_si128((__m128i *)(LinePU + 0)); VV = _mm_loadu_si128((__m128i *)(LinePV + 0)); __m128i YV16L = _mm_unpacklo_epi8(YV, Zero); __m128i YV16H = _mm_unpackhi_epi8(YV, Zero); __m128i YV32LL = _mm_unpacklo_epi16(YV16L, Zero); __m128i YV32LH = _mm_unpackhi_epi16(YV16L, Zero); __m128i YV32HL = _mm_unpacklo_epi16(YV16H, Zero); __m128i YV32HH = _mm_unpackhi_epi16(YV16H, Zero); __m128i UV16L = _mm_unpacklo_epi8(UV, Zero); __m128i UV16H = _mm_unpackhi_epi8(UV, Zero); __m128i UV32LL = _mm_unpacklo_epi16(UV16L, Zero); __m128i UV32LH = _mm_unpackhi_epi16(UV16L, Zero); __m128i UV32HL = _mm_unpacklo_epi16(UV16H, Zero); __m128i UV32HH = _mm_unpackhi_epi16(UV16H, Zero); UV32LL = _mm_sub_epi32(UV32LL, C128); UV32LH = _mm_sub_epi32(UV32LH, C128); UV32HL = _mm_sub_epi32(UV32HL, C128); UV32HH = _mm_sub_epi32(UV32HH, C128); __m128i VV16L = _mm_unpacklo_epi8(VV, Zero); __m128i VV16H = _mm_unpackhi_epi8(VV, Zero); __m128i VV32LL = _mm_unpacklo_epi16(VV16L, Zero); __m128i VV32LH = _mm_unpackhi_epi16(VV16L, Zero); __m128i VV32HL = _mm_unpacklo_epi16(VV16H, Zero); __m128i VV32HH = _mm_unpackhi_epi16(VV16H, Zero); VV32LL = _mm_sub_epi32(VV32LL, C128); VV32LH = _mm_sub_epi32(VV32LH, C128); VV32HL = _mm_sub_epi32(VV32HL, C128); VV32HH = _mm_sub_epi32(VV32HH, C128); __m128i LL_B = _mm_add_epi32(YV32LL, _mm_srai_epi32(_mm_add_epi32(Half, _mm_mullo_epi32(UV32LL, Weight_B_U)), Shift)); __m128i LH_B = _mm_add_epi32(YV32LH, _mm_srai_epi32(_mm_add_epi32(Half, _mm_mullo_epi32(UV32LH, Weight_B_U)), Shift)); __m128i HL_B = _mm_add_epi32(YV32HL, _mm_srai_epi32(_mm_add_epi32(Half, _mm_mullo_epi32(UV32HL, Weight_B_U)), Shift)); __m128i HH_B = _mm_add_epi32(YV32HH, _mm_srai_epi32(_mm_add_epi32(Half, _mm_mullo_epi32(UV32HH, Weight_B_U)), Shift)); Blue = _mm_packus_epi16(_mm_packus_epi32(LL_B, LH_B), _mm_packus_epi32(HL_B, HH_B)); __m128i LL_G = _mm_add_epi32(YV32LL, _mm_srai_epi32(_mm_add_epi32(Half, _mm_add_epi32(_mm_mullo_epi32(Weight_G_U, UV32LL), _mm_mullo_epi32(Weight_G_V, VV32LL))), Shift)); __m128i LH_G = _mm_add_epi32(YV32LH, _mm_srai_epi32(_mm_add_epi32(Half, _mm_add_epi32(_mm_mullo_epi32(Weight_G_U, UV32LH), _mm_mullo_epi32(Weight_G_V, VV32LH))), Shift)); __m128i HL_G = _mm_add_epi32(YV32HL, _mm_srai_epi32(_mm_add_epi32(Half, _mm_add_epi32(_mm_mullo_epi32(Weight_G_U, UV32HL), _mm_mullo_epi32(Weight_G_V, VV32HL))), Shift)); __m128i HH_G = _mm_add_epi32(YV32HH, _mm_srai_epi32(_mm_add_epi32(Half, _mm_add_epi32(_mm_mullo_epi32(Weight_G_U, UV32HH), _mm_mullo_epi32(Weight_G_V, VV32HH))), Shift)); Green = _mm_packus_epi16(_mm_packus_epi32(LL_G, LH_G), _mm_packus_epi32(HL_G, HH_G)); __m128i LL_R = _mm_add_epi32(YV32LL, _mm_srai_epi32(_mm_add_epi32(Half, _mm_mullo_epi32(VV32LL, Weight_R_V)), Shift)); __m128i LH_R = _mm_add_epi32(YV32LH, _mm_srai_epi32(_mm_add_epi32(Half, _mm_mullo_epi32(VV32LH, Weight_R_V)), Shift)); __m128i HL_R = _mm_add_epi32(YV32HL, _mm_srai_epi32(_mm_add_epi32(Half, _mm_mullo_epi32(VV32HL, Weight_R_V)), Shift)); __m128i HH_R = _mm_add_epi32(YV32HH, _mm_srai_epi32(_mm_add_epi32(Half, _mm_mullo_epi32(VV32HH, Weight_R_V)), Shift)); Red = _mm_packus_epi16(_mm_packus_epi32(LL_R, LH_R), _mm_packus_epi32(HL_R, HH_R)); Dest1 = _mm_shuffle_epi8(Blue, _mm_setr_epi8(0, -1, -1, 1, -1, -1, 2, -1, -1, 3, -1, -1, 4, -1, -1, 5)); Dest1 = _mm_or_si128(Dest1, _mm_shuffle_epi8(Green, _mm_setr_epi8(-1, 0, -1, -1, 1, -1, -1, 2, -1, -1, 3, -1, -1, 4, -1, -1))); Dest1 = _mm_or_si128(Dest1, _mm_shuffle_epi8(Red, _mm_setr_epi8(-1, -1, 0, -1, -1, 1, -1, -1, 2, -1, -1, 3, -1, -1, 4, -1))); Dest2 = _mm_shuffle_epi8(Blue, _mm_setr_epi8(-1, -1, 6, -1, -1, 7, -1, -1, 8, -1, -1, 9, -1, -1, 10, -1)); Dest2 = _mm_or_si128(Dest2, _mm_shuffle_epi8(Green, _mm_setr_epi8(5, -1, -1, 6, -1, -1, 7, -1, -1, 8, -1, -1, 9, -1, -1, 10))); Dest2 = _mm_or_si128(Dest2, _mm_shuffle_epi8(Red, _mm_setr_epi8(-1, 5, -1, -1, 6, -1, -1, 7, -1, -1, 8, -1, -1, 9, -1, -1))); Dest3 = _mm_shuffle_epi8(Blue, _mm_setr_epi8(-1, 11, -1, -1, 12, -1, -1, 13, -1, -1, 14, -1, -1, 15, -1, -1)); Dest3 = _mm_or_si128(Dest3, _mm_shuffle_epi8(Green, _mm_setr_epi8(-1, -1, 11, -1, -1, 12, -1, -1, 13, -1, -1, 14, -1, -1, 15, -1))); Dest3 = _mm_or_si128(Dest3, _mm_shuffle_epi8(Red, _mm_setr_epi8(10, -1, -1, 11, -1, -1, 12, -1, -1, 13, -1, -1, 14, -1, -1, 15))); _mm_storeu_si128((__m128i*)(LinePD + (XX / BlockSize) * BlockSize * 3), Dest1); _mm_storeu_si128((__m128i*)(LinePD + (XX / BlockSize) * BlockSize * 3 + BlockSize), Dest2); _mm_storeu_si128((__m128i*)(LinePD + (XX / BlockSize) * BlockSize * 3 + BlockSize * 2), Dest3); } for (int XX = Block * BlockSize; XX < Width; XX++, LinePU++, LinePV++, LinePY++) { int YV = LinePY[XX], UV = LinePU[XX] - 128, VV = LinePV[XX] - 128; LinePD[XX + 0] = ClampToByte(YV + ((B_U_WT * UV + HalfV) >> Shift)); LinePD[XX + 1] = ClampToByte(YV + ((G_U_WT * UV + G_V_WT * VV + HalfV) >> Shift)); LinePD[XX + 2] = ClampToByte(YV + ((R_V_WT * VV + HalfV) >> Shift)); } } } const int Shift = 13; // 这里没有绝对值大于1的系数,最大可取2^15次方的放大倍数。 const int HalfV = 1 << (Shift - 1); const int Y_B_WT = 0.114f * (1 << Shift), Y_G_WT = 0.587f * (1 << Shift), Y_R_WT = (1 << Shift) - Y_B_WT - Y_G_WT, Y_C_WT = 1; const int U_B_WT = 0.436f * (1 << Shift), U_G_WT = -0.28886f * (1 << Shift), U_R_WT = -(U_B_WT + U_G_WT), U_C_WT = 257; const int V_B_WT = -0.10001 * (1 << Shift), V_G_WT = -0.51499f * (1 << Shift), V_R_WT = -(V_B_WT + V_G_WT), V_C_WT = 257; __m128i Weight_YBG = _mm_setr_epi16(Y_B_WT, Y_G_WT, Y_B_WT, Y_G_WT, Y_B_WT, Y_G_WT, Y_B_WT, Y_G_WT); __m128i Weight_YRC = _mm_setr_epi16(Y_R_WT, Y_C_WT, Y_R_WT, Y_C_WT, Y_R_WT, Y_C_WT, Y_R_WT, Y_C_WT); __m128i Weight_UBG = _mm_setr_epi16(U_B_WT, U_G_WT, U_B_WT, U_G_WT, U_B_WT, U_G_WT, U_B_WT, U_G_WT); __m128i Weight_URC = _mm_setr_epi16(U_R_WT, U_C_WT, U_R_WT, U_C_WT, U_R_WT, U_C_WT, U_R_WT, U_C_WT); __m128i Weight_VBG = _mm_setr_epi16(V_B_WT, V_G_WT, V_B_WT, V_G_WT, V_B_WT, V_G_WT, V_B_WT, V_G_WT); __m128i Weight_VRC = _mm_setr_epi16(V_R_WT, V_C_WT, V_R_WT, V_C_WT, V_R_WT, V_C_WT, V_R_WT, V_C_WT); __m128i Half1 = _mm_setr_epi16(0, HalfV, 0, HalfV, 0, HalfV, 0, HalfV); __m128i Zero = _mm_setzero_si128(); const int B_Y_WT = 1 << Shift, B_U_WT = 2.03211f * (1 << Shift), B_V_WT = 0; const int G_Y_WT = 1 << Shift, G_U_WT = -0.39465f * (1 << Shift), G_V_WT = -0.58060f * (1 << Shift); const int R_Y_WT = 1 << Shift, R_U_WT = 0, R_V_WT = 1.13983 * (1 << Shift); __m128i Weight_B_Y = _mm_set1_epi32(B_Y_WT), Weight_B_U = _mm_set1_epi32(B_U_WT), Weight_B_V = _mm_set1_epi32(B_V_WT); __m128i Weight_G_Y = _mm_set1_epi32(G_Y_WT), Weight_G_U = _mm_set1_epi32(G_U_WT), Weight_G_V = _mm_set1_epi32(G_V_WT); __m128i Weight_R_Y = _mm_set1_epi32(R_Y_WT), Weight_R_U = _mm_set1_epi32(R_U_WT), Weight_R_V = _mm_set1_epi32(R_V_WT); __m128i Half2 = _mm_set1_epi32(HalfV); __m128i C128 = _mm_set1_epi32(128); int BlockSize, Block; void _RGB2YUV(unsigned char *RGB, const int32_t Width, const int32_t Height, const int32_t start_row, const int32_t thread_stride, const int32_t Stride, unsigned char *Y, unsigned char *U, unsigned char *V) { for (int YY = start_row; YY < start_row + thread_stride; YY++) { unsigned char *LinePS = RGB + YY * Stride; unsigned char *LinePY = Y + YY * Width; unsigned char *LinePU = U + YY * Width; unsigned char *LinePV = V + YY * Width; for (int XX = 0; XX < Block * BlockSize; XX += BlockSize, LinePS += BlockSize * 3) { __m128i Src1 = _mm_loadu_si128((__m128i *)(LinePS + 0)); __m128i Src2 = _mm_loadu_si128((__m128i *)(LinePS + 16)); __m128i Src3 = _mm_loadu_si128((__m128i *)(LinePS + 32)); // Src1 : B1 G1 R1 B2 G2 R2 B3 G3 R3 B4 G4 R4 B5 G5 R5 B6 // Src2 : G6 R6 B7 G7 R7 B8 G8 R8 B9 G9 R9 B10 G10 R10 B11 G11 // Src3 : R11 B12 G12 R12 B13 G13 R13 B14 G14 R14 B15 G15 R15 B16 G16 R16 // BGL : B1 G1 B2 G2 B3 G3 B4 G4 B5 G5 B6 0 0 0 0 0 __m128i BGL = _mm_shuffle_epi8(Src1, _mm_setr_epi8(0, 1, 3, 4, 6, 7, 9, 10, 12, 13, 15, -1, -1, -1, -1, -1)); // BGL : B1 G1 B2 G2 B3 G3 B4 G4 B5 G5 B6 G6 B7 G7 B8 G8 BGL = _mm_or_si128(BGL, _mm_shuffle_epi8(Src2, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 2, 3, 5, 6))); // BGH : B9 G9 B10 G10 B11 G11 0 0 0 0 0 0 0 0 0 0 __m128i BGH = _mm_shuffle_epi8(Src2, _mm_setr_epi8(8, 9, 11, 12, 14, 15, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)); // BGH : B9 G9 B10 G10 B11 G11 B12 G12 B13 G13 B14 G14 B15 G15 B16 G16 BGH = _mm_or_si128(BGH, _mm_shuffle_epi8(Src3, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, 1, 2, 4, 5, 7, 8, 10, 11, 13, 14))); // RCL : R1 0 R2 0 R3 0 R4 0 R5 0 0 0 0 0 0 0 __m128i RCL = _mm_shuffle_epi8(Src1, _mm_setr_epi8(2, -1, 5, -1, 8, -1, 11, -1, 14, -1, -1, -1, -1, -1, -1, -1)); // RCL : R1 0 R2 0 R3 0 R4 0 R5 0 R6 0 R7 0 R8 0 RCL = _mm_or_si128(RCL, _mm_shuffle_epi8(Src2, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, -1, 4, -1, 7, -1))); // RCH : R9 0 R10 0 0 0 0 0 0 0 0 0 0 0 0 0 __m128i RCH = _mm_shuffle_epi8(Src2, _mm_setr_epi8(10, -1, 13, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)); // RCH : R9 0 R10 0 R11 0 R12 0 R13 0 R14 0 R15 0 R16 0 RCH = _mm_or_si128(RCH, _mm_shuffle_epi8(Src3, _mm_setr_epi8(-1, -1, -1, -1, 0, -1, 3, -1, 6, -1, 9, -1, 12, -1, 15, -1))); // BGLL : B1 0 G1 0 B2 0 G2 0 B3 0 G3 0 B4 0 G4 0 __m128i BGLL = _mm_unpacklo_epi8(BGL, Zero); // BGLH : B5 0 G5 0 B6 0 G6 0 B7 0 G7 0 B8 0 G8 0 __m128i BGLH = _mm_unpackhi_epi8(BGL, Zero); // RCLL : R1 Half Half Half R2 Half Half Half R3 Half Half Half R4 Half Half Half __m128i RCLL = _mm_or_si128(_mm_unpacklo_epi8(RCL, Zero), Half1); // RCLH : R5 Half Half Half R6 Half Half Half R7 Half Half Half R8 Half Half Half __m128i RCLH = _mm_or_si128(_mm_unpackhi_epi8(RCL, Zero), Half1); // BGHL : B9 0 G9 0 B10 0 G10 0 B11 0 G11 0 B12 0 G12 0 __m128i BGHL = _mm_unpacklo_epi8(BGH, Zero); // BGHH : B13 0 G13 0 B14 0 G14 0 B15 0 G15 0 B16 0 G16 0 __m128i BGHH = _mm_unpackhi_epi8(BGH, Zero); // RCHL : R9 Half Half Half R10 Half Half Half R11 Half Half Half R12 Half Half Half __m128i RCHL = _mm_or_si128(_mm_unpacklo_epi8(RCH, Zero), Half1); // RCHH : R13 Half Half Half R14 Half Half Half R15 Half Half Half R16 Half Half Half __m128i RCHH = _mm_or_si128(_mm_unpackhi_epi8(RCH, Zero), Half1); // __m128i Y_LL = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGLL, Weight_YBG), _mm_madd_epi16(RCLL, Weight_YRC)), Shift); __m128i Y_LH = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGLH, Weight_YBG), _mm_madd_epi16(RCLH, Weight_YRC)), Shift); __m128i Y_HL = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGHL, Weight_YBG), _mm_madd_epi16(RCHL, Weight_YRC)), Shift); __m128i Y_HH = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGHH, Weight_YBG), _mm_madd_epi16(RCHH, Weight_YRC)), Shift); _mm_storeu_si128((__m128i*)(LinePY + XX), _mm_packus_epi16(_mm_packus_epi32(Y_LL, Y_LH), _mm_packus_epi32(Y_HL, Y_HH))); __m128i U_LL = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGLL, Weight_UBG), _mm_madd_epi16(RCLL, Weight_URC)), Shift); __m128i U_LH = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGLH, Weight_UBG), _mm_madd_epi16(RCLH, Weight_URC)), Shift); __m128i U_HL = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGHL, Weight_UBG), _mm_madd_epi16(RCHL, Weight_URC)), Shift); __m128i U_HH = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGHH, Weight_UBG), _mm_madd_epi16(RCHH, Weight_URC)), Shift); _mm_storeu_si128((__m128i*)(LinePU + XX), _mm_packus_epi16(_mm_packus_epi32(U_LL, U_LH), _mm_packus_epi32(U_HL, U_HH))); __m128i V_LL = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGLL, Weight_VBG), _mm_madd_epi16(RCLL, Weight_VRC)), Shift); __m128i V_LH = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGLH, Weight_VBG), _mm_madd_epi16(RCLH, Weight_VRC)), Shift); __m128i V_HL = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGHL, Weight_VBG), _mm_madd_epi16(RCHL, Weight_VRC)), Shift); __m128i V_HH = _mm_srai_epi32(_mm_add_epi32(_mm_madd_epi16(BGHH, Weight_VBG), _mm_madd_epi16(RCHH, Weight_VRC)), Shift); _mm_storeu_si128((__m128i*)(LinePV + XX), _mm_packus_epi16(_mm_packus_epi32(V_LL, V_LH), _mm_packus_epi32(V_HL, V_HH))); } for (int XX = Block * BlockSize; XX < Width; XX++, LinePS += 3) { int Blue = LinePS[0], Green = LinePS[1], Red = LinePS[2]; LinePY[XX] = (Y_B_WT * Blue + Y_G_WT * Green + Y_R_WT * Red + Y_C_WT * HalfV) >> Shift; LinePU[XX] = (U_B_WT * Blue + U_G_WT * Green + U_R_WT * Red + U_C_WT * HalfV) >> Shift; LinePV[XX] = (V_B_WT * Blue + V_G_WT * Green + V_R_WT * Red + V_C_WT * HalfV) >> Shift; } } } void _YUV2RGB(const int32_t Width, const int32_t Height, const int32_t start_row, const int32_t thread_stride, const int32_t Stride, unsigned char *Y, unsigned char *U, unsigned char *V, unsigned char *RGB) { for (int YY = start_row; YY < start_row + thread_stride; YY++){ unsigned char *LinePD = RGB + YY * Stride; unsigned char *LinePY = Y + YY * Width; unsigned char *LinePU = U + YY * Width; unsigned char *LinePV = V + YY * Width; for (int XX = 0; XX < Block * BlockSize; XX += BlockSize, LinePY += BlockSize, LinePU += BlockSize, LinePV += BlockSize) { __m128i Blue, Green, Red, YV, UV, VV, Dest1, Dest2, Dest3; YV = _mm_loadu_si128((__m128i *)(LinePY + 0)); UV = _mm_loadu_si128((__m128i *)(LinePU + 0)); VV = _mm_loadu_si128((__m128i *)(LinePV + 0)); __m128i YV16L = _mm_unpacklo_epi8(YV, Zero); __m128i YV16H = _mm_unpackhi_epi8(YV, Zero); __m128i YV32LL = _mm_unpacklo_epi16(YV16L, Zero); __m128i YV32LH = _mm_unpackhi_epi16(YV16L, Zero); __m128i YV32HL = _mm_unpacklo_epi16(YV16H, Zero); __m128i YV32HH = _mm_unpackhi_epi16(YV16H, Zero); __m128i UV16L = _mm_unpacklo_epi8(UV, Zero); __m128i UV16H = _mm_unpackhi_epi8(UV, Zero); __m128i UV32LL = _mm_unpacklo_epi16(UV16L, Zero); __m128i UV32LH = _mm_unpackhi_epi16(UV16L, Zero); __m128i UV32HL = _mm_unpacklo_epi16(UV16H, Zero); __m128i UV32HH = _mm_unpackhi_epi16(UV16H, Zero); UV32LL = _mm_sub_epi32(UV32LL, C128); UV32LH = _mm_sub_epi32(UV32LH, C128); UV32HL = _mm_sub_epi32(UV32HL, C128); UV32HH = _mm_sub_epi32(UV32HH, C128); __m128i VV16L = _mm_unpacklo_epi8(VV, Zero); __m128i VV16H = _mm_unpackhi_epi8(VV, Zero); __m128i VV32LL = _mm_unpacklo_epi16(VV16L, Zero); __m128i VV32LH = _mm_unpackhi_epi16(VV16L, Zero); __m128i VV32HL = _mm_unpacklo_epi16(VV16H, Zero); __m128i VV32HH = _mm_unpackhi_epi16(VV16H, Zero); VV32LL = _mm_sub_epi32(VV32LL, C128); VV32LH = _mm_sub_epi32(VV32LH, C128); VV32HL = _mm_sub_epi32(VV32HL, C128); VV32HH = _mm_sub_epi32(VV32HH, C128); __m128i LL_B = _mm_add_epi32(YV32LL, _mm_srai_epi32(_mm_add_epi32(Half2, _mm_mullo_epi32(UV32LL, Weight_B_U)), Shift)); __m128i LH_B = _mm_add_epi32(YV32LH, _mm_srai_epi32(_mm_add_epi32(Half2, _mm_mullo_epi32(UV32LH, Weight_B_U)), Shift)); __m128i HL_B = _mm_add_epi32(YV32HL, _mm_srai_epi32(_mm_add_epi32(Half2, _mm_mullo_epi32(UV32HL, Weight_B_U)), Shift)); __m128i HH_B = _mm_add_epi32(YV32HH, _mm_srai_epi32(_mm_add_epi32(Half2, _mm_mullo_epi32(UV32HH, Weight_B_U)), Shift)); Blue = _mm_packus_epi16(_mm_packus_epi32(LL_B, LH_B), _mm_packus_epi32(HL_B, HH_B)); __m128i LL_G = _mm_add_epi32(YV32LL, _mm_srai_epi32(_mm_add_epi32(Half2, _mm_add_epi32(_mm_mullo_epi32(Weight_G_U, UV32LL), _mm_mullo_epi32(Weight_G_V, VV32LL))), Shift)); __m128i LH_G = _mm_add_epi32(YV32LH, _mm_srai_epi32(_mm_add_epi32(Half2, _mm_add_epi32(_mm_mullo_epi32(Weight_G_U, UV32LH), _mm_mullo_epi32(Weight_G_V, VV32LH))), Shift)); __m128i HL_G = _mm_add_epi32(YV32HL, _mm_srai_epi32(_mm_add_epi32(Half2, _mm_add_epi32(_mm_mullo_epi32(Weight_G_U, UV32HL), _mm_mullo_epi32(Weight_G_V, VV32HL))), Shift)); __m128i HH_G = _mm_add_epi32(YV32HH, _mm_srai_epi32(_mm_add_epi32(Half2, _mm_add_epi32(_mm_mullo_epi32(Weight_G_U, UV32HH), _mm_mullo_epi32(Weight_G_V, VV32HH))), Shift)); Green = _mm_packus_epi16(_mm_packus_epi32(LL_G, LH_G), _mm_packus_epi32(HL_G, HH_G)); __m128i LL_R = _mm_add_epi32(YV32LL, _mm_srai_epi32(_mm_add_epi32(Half2, _mm_mullo_epi32(VV32LL, Weight_R_V)), Shift)); __m128i LH_R = _mm_add_epi32(YV32LH, _mm_srai_epi32(_mm_add_epi32(Half2, _mm_mullo_epi32(VV32LH, Weight_R_V)), Shift)); __m128i HL_R = _mm_add_epi32(YV32HL, _mm_srai_epi32(_mm_add_epi32(Half2, _mm_mullo_epi32(VV32HL, Weight_R_V)), Shift)); __m128i HH_R = _mm_add_epi32(YV32HH, _mm_srai_epi32(_mm_add_epi32(Half2, _mm_mullo_epi32(VV32HH, Weight_R_V)), Shift)); Red = _mm_packus_epi16(_mm_packus_epi32(LL_R, LH_R), _mm_packus_epi32(HL_R, HH_R)); Dest1 = _mm_shuffle_epi8(Blue, _mm_setr_epi8(0, -1, -1, 1, -1, -1, 2, -1, -1, 3, -1, -1, 4, -1, -1, 5)); Dest1 = _mm_or_si128(Dest1, _mm_shuffle_epi8(Green, _mm_setr_epi8(-1, 0, -1, -1, 1, -1, -1, 2, -1, -1, 3, -1, -1, 4, -1, -1))); Dest1 = _mm_or_si128(Dest1, _mm_shuffle_epi8(Red, _mm_setr_epi8(-1, -1, 0, -1, -1, 1, -1, -1, 2, -1, -1, 3, -1, -1, 4, -1))); Dest2 = _mm_shuffle_epi8(Blue, _mm_setr_epi8(-1, -1, 6, -1, -1, 7, -1, -1, 8, -1, -1, 9, -1, -1, 10, -1)); Dest2 = _mm_or_si128(Dest2, _mm_shuffle_epi8(Green, _mm_setr_epi8(5, -1, -1, 6, -1, -1, 7, -1, -1, 8, -1, -1, 9, -1, -1, 10))); Dest2 = _mm_or_si128(Dest2, _mm_shuffle_epi8(Red, _mm_setr_epi8(-1, 5, -1, -1, 6, -1, -1, 7, -1, -1, 8, -1, -1, 9, -1, -1))); Dest3 = _mm_shuffle_epi8(Blue, _mm_setr_epi8(-1, 11, -1, -1, 12, -1, -1, 13, -1, -1, 14, -1, -1, 15, -1, -1)); Dest3 = _mm_or_si128(Dest3, _mm_shuffle_epi8(Green, _mm_setr_epi8(-1, -1, 11, -1, -1, 12, -1, -1, 13, -1, -1, 14, -1, -1, 15, -1))); Dest3 = _mm_or_si128(Dest3, _mm_shuffle_epi8(Red, _mm_setr_epi8(10, -1, -1, 11, -1, -1, 12, -1, -1, 13, -1, -1, 14, -1, -1, 15))); _mm_storeu_si128((__m128i*)(LinePD + (XX / BlockSize) * BlockSize * 3), Dest1); _mm_storeu_si128((__m128i*)(LinePD + (XX / BlockSize) * BlockSize * 3 + BlockSize), Dest2); _mm_storeu_si128((__m128i*)(LinePD + (XX / BlockSize) * BlockSize * 3 + BlockSize * 2), Dest3); } for (int XX = Block * BlockSize; XX < Width; XX++, LinePU++, LinePV++, LinePY++) { int YV = LinePY[XX], UV = LinePU[XX] - 128, VV = LinePV[XX] - 128; LinePD[XX + 0] = ClampToByte(YV + ((B_U_WT * UV + HalfV) >> Shift)); LinePD[XX + 1] = ClampToByte(YV + ((G_U_WT * UV + G_V_WT * VV + HalfV) >> Shift)); LinePD[XX + 2] = ClampToByte(YV + ((R_V_WT * VV + HalfV) >> Shift)); } } } void RGB2YUVSSE_4(unsigned char *RGB, unsigned char *Y, unsigned char *U, unsigned char *V, int Width, int Height, int Stride) { BlockSize = 16, Block = (Width) / BlockSize; const int32_t hw_concur = std::min(Height >> 4, static_cast(std::thread::hardware_concurrency())); std::vector> fut(hw_concur); const int thread_stride = (Height - 1) / hw_concur + 1; int i = 0, start = 0; for (; i < std::min(Height, hw_concur); i++, start += thread_stride) { fut[i] = std::async(std::launch::async, _RGB2YUV, RGB, Width, Height, start, thread_stride, Stride, Y, U, V); } for (int j = 0; j < i; ++j) fut[j].wait(); } void YUV2RGBSSE_4(unsigned char *Y, unsigned char *U, unsigned char *V, unsigned char *RGB, int Width, int Height, int Stride) { BlockSize = 16, Block = (Width) / BlockSize; const int32_t hw_concur = std::min(Height >> 4, static_cast(std::thread::hardware_concurrency())); std::vector> fut(hw_concur); const int thread_stride = (Height - 1) / hw_concur + 1; int i = 0, start = 0; for (; i < std::min(Height, hw_concur); i++, start += thread_stride) { fut[i] = std::async(std::launch::async, _YUV2RGB, Width, Height, start, thread_stride, Stride, Y, U, V, RGB); } for (int j = 0; j < i; ++j) fut[j].wait(); } int main() { Mat src = imread("F:\\car.jpg"); int Height = src.rows; int Width = src.cols; unsigned char *Src = src.data; unsigned char *Dest = new unsigned char[Height * Width * 3]; unsigned char *Y = new unsigned char[Height * Width]; unsigned char *U = new unsigned char[Height * Width]; unsigned char *V = new unsigned char[Height * Width]; int Stride = Width * 3; int64 st = cvGetTickCount(); for (int i = 0; i < 1000; i++) { RGB2YUVSSE_4(Src, Y, U, V, Width, Height, Stride); YUV2RGBSSE_4(Y, U, V, Dest, Width, Height, Stride); } double duration = (cv::getTickCount() - st) / cv::getTickFrequency(); printf("%.5f\n", duration); RGB2YUVSSE_4(Src, Y, U, V, Width, Height, Stride); YUV2RGBSSE_4(Y, U, V, Dest, Width, Height, Stride); Mat dst(Height, Width, CV_8UC3, Dest); imshow("origin", src); imshow("result", dst); imwrite("F:\\res.jpg", dst); waitKey(0); } ================================================ FILE: speed_skin_detection_sse.cpp ================================================ #include "stdafx.h" #include #include #include using namespace std; using namespace cv; #define IM_Max(a, b) (((a) >= (b)) ? (a): (b)) #define IM_Min(a, b) (((a) >= (b)) ? (b): (a)) #define _mm_cmpge_epu8(a, b) _mm_cmpeq_epi8(_mm_max_epu8(a, b), a) void IM_GetRoughSkinRegion(unsigned char *Src, unsigned char *Skin, int Width, int Height, int Stride) { for (int Y = 0; Y < Height; Y++) { unsigned char *LinePS = Src + Y * Stride; unsigned char *LinePD = Skin + Y * Width; for (int X = 0; X < Width; X++) { int Blue = LinePS[0], Green = LinePS[1], Red = LinePS[2]; if (Red >= 60 && Green >= 40 && Blue >= 20 && Red >= Blue && (Red - Green) >= 10 && IM_Max(IM_Max(Red, Green), Blue) - IM_Min(IM_Min(Red, Green), Blue) >= 10) LinePD[X] = 255; else LinePD[X] = 16; LinePS += 3; } } } void IM_GetRoughSkinRegion_OpenMP(unsigned char *Src, unsigned char *Skin, int Width, int Height, int Stride) { for (int Y = 0; Y < Height; Y++) { unsigned char *LinePS = Src + Y * Stride; unsigned char *LinePD = Skin + Y * Width; #pragma omp parallel for num_threads(4) for (int X = 0; X < Width; X++) { int Blue = LinePS[X*3 + 0], Green = LinePS[X*3 + 1], Red = LinePS[X*3 + 2]; if (Red >= 60 && Green >= 40 && Blue >= 20 && Red >= Blue && (Red - Green) >= 10 && IM_Max(IM_Max(Red, Green), Blue) - IM_Min(IM_Min(Red, Green), Blue) >= 10) LinePD[X] = 255; else LinePD[X] = 16; } } } void IM_GetRoughSkinRegion_SSE(unsigned char *Src, unsigned char *Skin, int Width, int Height, int Stride) { const int NonSkinLevel = 10; //非肤色部分的处理程序,本例取16,最大值取100,那样就是所有区域都为肤色,毫无意义 const int BlockSize = 16; int Block = Width / BlockSize; for (int Y = 0; Y < Height; Y++) { unsigned char *LinePS = Src + Y * Stride; unsigned char *LinePD = Skin + Y * Width; for (int X = 0; X < Block * BlockSize; X += BlockSize, LinePS += BlockSize * 3, LinePD += BlockSize) { __m128i Src1, Src2, Src3, Blue, Green, Red, Result, Max, Min, AbsDiff; Src1 = _mm_loadu_si128((__m128i *)(LinePS + 0)); Src2 = _mm_loadu_si128((__m128i *)(LinePS + 16)); Src3 = _mm_loadu_si128((__m128i *)(LinePS + 32)); Blue = _mm_shuffle_epi8(Src1, _mm_setr_epi8(0, 3, 6, 9, 12, 15, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)); Blue = _mm_or_si128(Blue, _mm_shuffle_epi8(Src2, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, 2, 5, 8, 11, 14, -1, -1, -1, -1, -1))); Blue = _mm_or_si128(Blue, _mm_shuffle_epi8(Src3, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, 4, 7, 10, 13))); Green = _mm_shuffle_epi8(Src1, _mm_setr_epi8(1, 4, 7, 10, 13, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)); Green = _mm_or_si128(Green, _mm_shuffle_epi8(Src2, _mm_setr_epi8(-1, -1, -1, -1, -1, 0, 3, 6, 9, 12, 15, -1, -1, -1, -1, -1))); Green = _mm_or_si128(Green, _mm_shuffle_epi8(Src3, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 2, 5, 8, 11, 14))); Red = _mm_shuffle_epi8(Src1, _mm_setr_epi8(2, 5, 8, 11, 14, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)); Red = _mm_or_si128(Red, _mm_shuffle_epi8(Src2, _mm_setr_epi8(-1, -1, -1, -1, -1, 1, 4, 7, 10, 13, -1, -1, -1, -1, -1, -1))); Red = _mm_or_si128(Red, _mm_shuffle_epi8(Src3, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 3, 6, 9, 12, 15))); Max = _mm_max_epu8(_mm_max_epu8(Blue, Green), Red); //IM_Max(IM_Max(Red, Green), Blue) Min = _mm_min_epu8(_mm_min_epu8(Blue, Green), Red); //IM_Min(IM_Min(Red, Green), Blue) Result = _mm_cmpge_epu8(Blue, _mm_set1_epi8(20)); //Blue >= 20 Result = _mm_and_si128(Result, _mm_cmpge_epu8(Green, _mm_set1_epi8(40))); //Green >= 40 Result = _mm_and_si128(Result, _mm_cmpge_epu8(Red, _mm_set1_epi8(60))); //Red >= 60 Result = _mm_and_si128(Result, _mm_cmpge_epu8(Red, Blue)); //Red >= Blue Result = _mm_and_si128(Result, _mm_cmpge_epu8(_mm_subs_epu8(Red, Green), _mm_set1_epi8(10))); //(Red - Green) >= 10 Result = _mm_and_si128(Result, _mm_cmpge_epu8(_mm_subs_epu8(Max, Min), _mm_set1_epi8(10))); //IM_Max(IM_Max(Red, Green), Blue) - IM_Min(IM_Min(Red, Green), Blue) >= 10 Result = _mm_or_si128(Result, _mm_set1_epi8(16)); _mm_storeu_si128((__m128i*)(LinePD + 0), Result); } for (int X = Block * BlockSize; X < Width; X++, LinePS += 3, LinePD++) { int Blue = LinePS[0], Green = LinePS[1], Red = LinePS[2]; if (Red >= 60 && Green >= 40 && Blue >= 20 && Red >= Blue && (Red - Green) >= 10 && IM_Max(IM_Max(Red, Green), Blue) - IM_Min(IM_Min(Red, Green), Blue) >= 10) LinePD[0] = 255; // 全为肤色部分 else LinePD[0] = 16; } } } void _IM_GetRoughSkinRegion(unsigned char* Src, const int32_t Width, const int32_t start_row, const int32_t thread_stride, const int32_t Stride, unsigned char* Dest) { const int NonSkinLevel = 10; //非肤色部分的处理程序,本例取16,最大值取100,那样就是所有区域都为肤色,毫无意义 const int BlockSize = 16; int Block = Width / BlockSize; for (int Y = start_row; Y < start_row + thread_stride; Y++) { unsigned char *LinePS = Src + Y * Stride; unsigned char *LinePD = Dest + Y * Width; for (int X = 0; X < Block * BlockSize; X += BlockSize, LinePS += BlockSize * 3, LinePD += BlockSize) { __m128i Src1, Src2, Src3, Blue, Green, Red, Result, Max, Min, AbsDiff; Src1 = _mm_loadu_si128((__m128i *)(LinePS + 0)); Src2 = _mm_loadu_si128((__m128i *)(LinePS + 16)); Src3 = _mm_loadu_si128((__m128i *)(LinePS + 32)); Blue = _mm_shuffle_epi8(Src1, _mm_setr_epi8(0, 3, 6, 9, 12, 15, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)); Blue = _mm_or_si128(Blue, _mm_shuffle_epi8(Src2, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, 2, 5, 8, 11, 14, -1, -1, -1, -1, -1))); Blue = _mm_or_si128(Blue, _mm_shuffle_epi8(Src3, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, 4, 7, 10, 13))); Green = _mm_shuffle_epi8(Src1, _mm_setr_epi8(1, 4, 7, 10, 13, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)); Green = _mm_or_si128(Green, _mm_shuffle_epi8(Src2, _mm_setr_epi8(-1, -1, -1, -1, -1, 0, 3, 6, 9, 12, 15, -1, -1, -1, -1, -1))); Green = _mm_or_si128(Green, _mm_shuffle_epi8(Src3, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 2, 5, 8, 11, 14))); Red = _mm_shuffle_epi8(Src1, _mm_setr_epi8(2, 5, 8, 11, 14, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)); Red = _mm_or_si128(Red, _mm_shuffle_epi8(Src2, _mm_setr_epi8(-1, -1, -1, -1, -1, 1, 4, 7, 10, 13, -1, -1, -1, -1, -1, -1))); Red = _mm_or_si128(Red, _mm_shuffle_epi8(Src3, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 3, 6, 9, 12, 15))); Max = _mm_max_epu8(_mm_max_epu8(Blue, Green), Red); //IM_Max(IM_Max(Red, Green), Blue) Min = _mm_min_epu8(_mm_min_epu8(Blue, Green), Red); //IM_Min(IM_Min(Red, Green), Blue) Result = _mm_cmpge_epu8(Blue, _mm_set1_epi8(20)); //Blue >= 20 Result = _mm_and_si128(Result, _mm_cmpge_epu8(Green, _mm_set1_epi8(40))); //Green >= 40 Result = _mm_and_si128(Result, _mm_cmpge_epu8(Red, _mm_set1_epi8(60))); //Red >= 60 Result = _mm_and_si128(Result, _mm_cmpge_epu8(Red, Blue)); //Red >= Blue Result = _mm_and_si128(Result, _mm_cmpge_epu8(_mm_subs_epu8(Red, Green), _mm_set1_epi8(10))); //(Red - Green) >= 10 Result = _mm_and_si128(Result, _mm_cmpge_epu8(_mm_subs_epu8(Max, Min), _mm_set1_epi8(10))); //IM_Max(IM_Max(Red, Green), Blue) - IM_Min(IM_Min(Red, Green), Blue) >= 10 Result = _mm_or_si128(Result, _mm_set1_epi8(16)); _mm_storeu_si128((__m128i*)(LinePD + 0), Result); } for (int X = Block * BlockSize; X < Width; X++, LinePS += 3, LinePD++) { int Blue = LinePS[0], Green = LinePS[1], Red = LinePS[2]; if (Red >= 60 && Green >= 40 && Blue >= 20 && Red >= Blue && (Red - Green) >= 10 && IM_Max(IM_Max(Red, Green), Blue) - IM_Min(IM_Min(Red, Green), Blue) >= 10) LinePD[0] = 255; // 全为肤色部分 else LinePD[0] = 16; } } } void IM_GetRoughSkinRegion_SSE2(unsigned char *Src, unsigned char *Skin, int width, int height, int stride) { const int32_t hw_concur = std::min(height >> 4, static_cast(std::thread::hardware_concurrency())); std::vector> fut(hw_concur); const int thread_stride = (height - 1) / hw_concur + 1; int i = 0, start = 0; for (; i < std::min(height, hw_concur); i++, start += thread_stride) { fut[i] = std::async(std::launch::async, _IM_GetRoughSkinRegion, Src, width, start, thread_stride, stride, Skin); } for (int j = 0; j < i; ++j) fut[j].wait(); } void IM_GrayToRGB(unsigned char *Gray, unsigned char *RGB, int Width, int Height, int Stride) { for (int Y = 0; Y < Height; Y++) { unsigned char *LinePS = Gray + Y * Width; // 源图的第Y行像素的首地址 unsigned char *LinePD = RGB + Y * Stride; // Skin区域的第Y行像素的首地址 int X = 0; for (int X = 0; X < Width; X++) { LinePD[0] = LinePD[1] = LinePD[2] = LinePS[X]; LinePD += 3; } } } int main() { Mat src = imread("F:\\face.jpg"); int Height = src.rows; int Width = src.cols; unsigned char *Src = src.data; unsigned char *Skin = new unsigned char[Height * Width]; unsigned char *Dest = new unsigned char[Height * Width * 3]; int Stride = Width * 3; int Radius = 11; int Adjustment = 50; int64 st = cvGetTickCount(); for (int i = 0; i <1000; i++) { IM_GetRoughSkinRegion_SSE2(Src, Skin, Width, Height, Stride); //IM_GrayToRGB(Skin, Dest, Width, Height, Stride); } double duration = (cv::getTickCount() - st) / cv::getTickFrequency(); printf("%.5f\n", duration); IM_GetRoughSkinRegion_SSE2(Src, Skin, Width, Height, Stride); IM_GrayToRGB(Skin, Dest, Width, Height, Stride); Mat dst(Height, Width, CV_8UC3, Dest); imshow("origin", src); imshow("result", dst); imwrite("F:\\res.jpg", dst); waitKey(0); } ================================================ FILE: speed_sobel_edgedetection_sse.cpp ================================================ #include #include #include using namespace std; using namespace cv; inline unsigned char IM_ClampToByte(int Value) { if (Value < 0) return 0; else if (Value > 255) return 255; else return (unsigned char)Value; //return ((Value | ((signed int)(255 - Value) >> 31)) & ~((signed int)Value >> 31)); } void Sobel_FLOAT(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) { int Channel = Stride / Width; unsigned char *RowCopy = (unsigned char*)malloc((Width + 2) * 3 * Channel); unsigned char *First = RowCopy; unsigned char *Second = RowCopy + (Width + 2) * Channel; unsigned char *Third = RowCopy + (Width + 2) * 2 * Channel; //拷贝第二行数据,边界值填充 memcpy(Second, Src, Channel); memcpy(Second + Channel, Src, Width*Channel); memcpy(Second + (Width + 1)*Channel, Src + (Width - 1)*Channel, Channel); //第一行和第二行一样 memcpy(First, Second, (Width + 2) * Channel); //拷贝第三行数据,边界值填充 memcpy(Third, Src + Stride, Channel); memcpy(Third + Channel, Src + Stride, Width * Channel); memcpy(Third + (Width + 1) * Channel, Src + Stride + (Width - 1) * Channel, Channel); for (int Y = 0; Y < Height; Y++) { unsigned char *LinePS = Src + Y * Stride; unsigned char *LinePD = Dest + Y * Stride; if (Y != 0) { unsigned char *Temp = First; First = Second; Second = Third; Third = Temp; } if (Y == Height - 1) { memcpy(Third, Second, (Width + 2) * Channel); } else { memcpy(Third, Src + (Y + 1) * Stride, Channel); memcpy(Third + Channel, Src + (Y + 1) * Stride, Width * Channel); // 由于备份了前面一行的数据,这里即使Src和Dest相同也是没有问题的 memcpy(Third + (Width + 1) * Channel, Src + (Y + 1) * Stride + (Width - 1) * Channel, Channel); } if (Channel == 1) { for (int X = 0; X < Width; X++) { int GX = First[X] - First[X + 2] + (Second[X] - Second[X + 2]) * 2 + Third[X] - Third[X + 2]; int GY = First[X] + First[X + 2] + (First[X + 1] - Third[X + 1]) * 2 - Third[X] - Third[X + 2]; LinePD[X] = IM_ClampToByte(sqrtf(GX * GX + GY * GY + 0.0F)); } } else { for (int X = 0; X < Width * 3; X++) { int GX = First[X] - First[X + 6] + (Second[X] - Second[X + 6]) * 2 + Third[X] - Third[X + 6]; int GY = First[X] + First[X + 6] + (First[X + 3] - Third[X + 3]) * 2 - Third[X] - Third[X + 6]; LinePD[X] = IM_ClampToByte(sqrtf(GX * GX + GY * GY + 0.0F)); } } } free(RowCopy); } void Sobel_INT(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) { int Channel = Stride / Width; unsigned char *RowCopy = (unsigned char*)malloc((Width + 2) * 3 * Channel); unsigned char *First = RowCopy; unsigned char *Second = RowCopy + (Width + 2) * Channel; unsigned char *Third = RowCopy + (Width + 2) * 2 * Channel; //拷贝第二行数据,边界值填充 memcpy(Second, Src, Channel); memcpy(Second + Channel, Src, Width*Channel); memcpy(Second + (Width + 1)*Channel, Src + (Width - 1)*Channel, Channel); //第一行和第二行一样 memcpy(First, Second, (Width + 2) * Channel); //拷贝第三行数据,边界值填充 memcpy(Third, Src + Stride, Channel); memcpy(Third + Channel, Src + Stride, Width * Channel); memcpy(Third + (Width + 1) * Channel, Src + Stride + (Width - 1) * Channel, Channel); unsigned char Table[65026]; for (int Y = 0; Y < 65026; Y++) Table[Y] = (sqrtf(Y + 0.0f) + 0.5f); for (int Y = 0; Y < Height; Y++) { unsigned char *LinePS = Src + Y * Stride; unsigned char *LinePD = Dest + Y * Stride; if (Y != 0) { unsigned char *Temp = First; First = Second; Second = Third; Third = Temp; } if (Y == Height - 1) { memcpy(Third, Second, (Width + 2) * Channel); } else { memcpy(Third, Src + (Y + 1) * Stride, Channel); memcpy(Third + Channel, Src + (Y + 1) * Stride, Width * Channel); // 由于备份了前面一行的数据,这里即使Src和Dest相同也是没有问题的 memcpy(Third + (Width + 1) * Channel, Src + (Y + 1) * Stride + (Width - 1) * Channel, Channel); } if (Channel == 1) { for (int X = 0; X < Width; X++) { int GX = First[X] - First[X + 2] + (Second[X] - Second[X + 2]) * 2 + Third[X] - Third[X + 2]; int GY = First[X] + First[X + 2] + (First[X + 1] - Third[X + 1]) * 2 - Third[X] - Third[X + 2]; LinePD[X] = Table[min(GX * GX + GY * GY, 65025)]; } } else { for (int X = 0; X < Width * 3; X++) { int GX = First[X] - First[X + 6] + (Second[X] - Second[X + 6]) * 2 + Third[X] - Third[X + 6]; int GY = First[X] + First[X + 6] + (First[X + 3] - Third[X + 3]) * 2 - Third[X] - Third[X + 6]; LinePD[X] = Table[min(GX * GX + GY * GY, 65025)]; } } } free(RowCopy); } void Sobel_SSE1(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) { int Channel = Stride / Width; unsigned char *RowCopy = (unsigned char*)malloc((Width + 2) * 3 * Channel); unsigned char *First = RowCopy; unsigned char *Second = RowCopy + (Width + 2) * Channel; unsigned char *Third = RowCopy + (Width + 2) * 2 * Channel; //拷贝第二行数据,边界值填充 memcpy(Second, Src, Channel); memcpy(Second + Channel, Src, Width*Channel); memcpy(Second + (Width + 1)*Channel, Src + (Width - 1)*Channel, Channel); //第一行和第二行一样 memcpy(First, Second, (Width + 2) * Channel); //拷贝第三行数据,边界值填充 memcpy(Third, Src + Stride, Channel); memcpy(Third + Channel, Src + Stride, Width * Channel); memcpy(Third + (Width + 1) * Channel, Src + Stride + (Width - 1) * Channel, Channel); int BlockSize = 8, Block = (Width * Channel) / BlockSize; unsigned char Table[65026]; for (int Y = 0; Y < 65026; Y++) Table[Y] = (sqrtf(Y + 0.0f) + 0.5f); for (int Y = 0; Y < Height; Y++) { unsigned char *LinePS = Src + Y * Stride; unsigned char *LinePD = Dest + Y * Stride; if (Y != 0) { unsigned char *Temp = First; First = Second; Second = Third; Third = Temp; } if (Y == Height - 1) { memcpy(Third, Second, (Width + 2) * Channel); } else { memcpy(Third, Src + (Y + 1) * Stride, Channel); memcpy(Third + Channel, Src + (Y + 1) * Stride, Width * Channel); // 由于备份了前面一行的数据,这里即使Src和Dest相同也是没有问题的 memcpy(Third + (Width + 1) * Channel, Src + (Y + 1) * Stride + (Width - 1) * Channel, Channel); } if (Channel == 1) { for (int X = 0; X < Width; X++) { int GX = First[X] - First[X + 2] + (Second[X] - Second[X + 2]) * 2 + Third[X] - Third[X + 2]; int GY = First[X] + First[X + 2] + (First[X + 1] - Third[X + 1]) * 2 - Third[X] - Third[X + 2]; //LinePD[X] = Table[min(GX * GX + GY * GY, 65025)]; } } else { __m128i Zero = _mm_setzero_si128(); for (int X = 0; X < Block * BlockSize; X += BlockSize) { __m128i FirstP0 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(First + X)), Zero); __m128i FirstP1 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(First + X + 3)), Zero); __m128i FirstP2 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(First + X + 6)), Zero); __m128i SecondP0 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(Second + X)), Zero); __m128i SecondP2 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(Second + X + 6)), Zero); __m128i ThirdP0 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(Third + X)), Zero); __m128i ThirdP1 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(Third + X + 3)), Zero); __m128i ThirdP2 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(Third + X + 6)), Zero); __m128i GX16 = _mm_abs_epi16(_mm_add_epi16(_mm_add_epi16(_mm_sub_epi16(FirstP0, FirstP2), _mm_slli_epi16(_mm_sub_epi16(SecondP0, SecondP2), 1)), _mm_sub_epi16(ThirdP0, ThirdP2))); __m128i GY16 = _mm_abs_epi16(_mm_sub_epi16(_mm_add_epi16(_mm_add_epi16(FirstP0, FirstP2), _mm_slli_epi16(_mm_sub_epi16(FirstP1, ThirdP1), 1)), _mm_add_epi16(ThirdP0, ThirdP2))); __m128i GX32L = _mm_unpacklo_epi16(GX16, Zero); __m128i GX32H = _mm_unpackhi_epi16(GX16, Zero); __m128i GY32L = _mm_unpacklo_epi16(GY16, Zero); __m128i GY32H = _mm_unpackhi_epi16(GY16, Zero); __m128i ResultL = _mm_cvtps_epi32(_mm_sqrt_ps(_mm_cvtepi32_ps(_mm_add_epi32(_mm_mullo_epi32(GX32L, GX32L), _mm_mullo_epi32(GY32L, GY32L))))); __m128i ResultH = _mm_cvtps_epi32(_mm_sqrt_ps(_mm_cvtepi32_ps(_mm_add_epi32(_mm_mullo_epi32(GX32H, GX32H), _mm_mullo_epi32(GY32H, GY32H))))); _mm_storel_epi64((__m128i *)(LinePD + X), _mm_packus_epi16(_mm_packus_epi32(ResultL, ResultH), Zero)); } for (int X = Block * BlockSize; X < Width * 3; X++) { int GX = First[X] - First[X + 6] + (Second[X] - Second[X + 6]) * 2 + Third[X] - Third[X + 6]; int GY = First[X] + First[X + 6] + (First[X + 3] - Third[X + 3]) * 2 - Third[X] - Third[X + 6]; LinePD[X] = IM_ClampToByte(sqrtf(GX * GX + GY * GY + 0.0F)); } } } free(RowCopy); } void Sobel_SSE2(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) { int Channel = Stride / Width; unsigned char *RowCopy = (unsigned char*)malloc((Width + 2) * 3 * Channel); unsigned char *First = RowCopy; unsigned char *Second = RowCopy + (Width + 2) * Channel; unsigned char *Third = RowCopy + (Width + 2) * 2 * Channel; //拷贝第二行数据,边界值填充 memcpy(Second, Src, Channel); memcpy(Second + Channel, Src, Width*Channel); memcpy(Second + (Width + 1)*Channel, Src + (Width - 1)*Channel, Channel); //第一行和第二行一样 memcpy(First, Second, (Width + 2) * Channel); //拷贝第三行数据,边界值填充 memcpy(Third, Src + Stride, Channel); memcpy(Third + Channel, Src + Stride, Width * Channel); memcpy(Third + (Width + 1) * Channel, Src + Stride + (Width - 1) * Channel, Channel); int BlockSize = 8, Block = (Width * Channel) / BlockSize; unsigned char Table[65026]; for (int Y = 0; Y < 65026; Y++) Table[Y] = (sqrtf(Y + 0.0f) + 0.5f); for (int Y = 0; Y < Height; Y++) { unsigned char *LinePS = Src + Y * Stride; unsigned char *LinePD = Dest + Y * Stride; if (Y != 0) { unsigned char *Temp = First; First = Second; Second = Third; Third = Temp; } if (Y == Height - 1) { memcpy(Third, Second, (Width + 2) * Channel); } else { memcpy(Third, Src + (Y + 1) * Stride, Channel); memcpy(Third + Channel, Src + (Y + 1) * Stride, Width * Channel); // 由于备份了前面一行的数据,这里即使Src和Dest相同也是没有问题的 memcpy(Third + (Width + 1) * Channel, Src + (Y + 1) * Stride + (Width - 1) * Channel, Channel); } if (Channel == 1) { for (int X = 0; X < Width; X++) { int GX = First[X] - First[X + 2] + (Second[X] - Second[X + 2]) * 2 + Third[X] - Third[X + 2]; int GY = First[X] + First[X + 2] + (First[X + 1] - Third[X + 1]) * 2 - Third[X] - Third[X + 2]; //LinePD[X] = Table[min(GX * GX + GY * GY, 65025)]; } } else { __m128i Zero = _mm_setzero_si128(); for (int X = 0; X < Block * BlockSize; X += BlockSize) { __m128i FirstP0 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(First + X)), Zero); __m128i FirstP1 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(First + X + 3)), Zero); __m128i FirstP2 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(First + X + 6)), Zero); __m128i SecondP0 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(Second + X)), Zero); __m128i SecondP2 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(Second + X + 6)), Zero); __m128i ThirdP0 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(Third + X)), Zero); __m128i ThirdP1 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(Third + X + 3)), Zero); __m128i ThirdP2 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(Third + X + 6)), Zero); __m128i GX16 = _mm_abs_epi16(_mm_add_epi16(_mm_add_epi16(_mm_sub_epi16(FirstP0, FirstP2), _mm_slli_epi16(_mm_sub_epi16(SecondP0, SecondP2), 1)), _mm_sub_epi16(ThirdP0, ThirdP2))); __m128i GY16 = _mm_abs_epi16(_mm_sub_epi16(_mm_add_epi16(_mm_add_epi16(FirstP0, FirstP2), _mm_slli_epi16(_mm_sub_epi16(FirstP1, ThirdP1), 1)), _mm_add_epi16(ThirdP0, ThirdP2))); __m128i GXYL = _mm_unpacklo_epi16(GX16, GY16); __m128i GXYH = _mm_unpackhi_epi16(GX16, GY16); __m128i ResultL = _mm_cvtps_epi32(_mm_sqrt_ps(_mm_cvtepi32_ps(_mm_madd_epi16(GXYL, GXYL)))); __m128i ResultH = _mm_cvtps_epi32(_mm_sqrt_ps(_mm_cvtepi32_ps(_mm_madd_epi16(GXYH, GXYH)))); _mm_storel_epi64((__m128i *)(LinePD + X), _mm_packus_epi16(_mm_packus_epi32(ResultL, ResultH), Zero)); } for (int X = Block * BlockSize; X < Width * 3; X++) { int GX = First[X] - First[X + 6] + (Second[X] - Second[X + 6]) * 2 + Third[X] - Third[X + 6]; int GY = First[X] + First[X + 6] + (First[X + 3] - Third[X + 3]) * 2 - Third[X] - Third[X + 6]; LinePD[X] = IM_ClampToByte(sqrtf(GX * GX + GY * GY + 0.0F)); } } } free(RowCopy); } unsigned char *RowCopy; unsigned char *First; unsigned char *Second; unsigned char *Third; int Channel, Block, BlockSize; void _Sobel(unsigned char* Src, const int32_t Width, const int32_t Height, const int32_t start_row, const int32_t thread_stride, const int32_t Stride, unsigned char* Dest) { for (int Y = start_row; Y < start_row + thread_stride; Y++) { unsigned char *LinePS = Src + Y * Stride; unsigned char *LinePD = Dest + Y * Stride; if (Y != 0) { unsigned char *Temp = First; First = Second; Second = Third; Third = Temp; } if (Y == Height - 1) { memcpy(Third, Second, (Width + 2) * Channel); } else { memcpy(Third, Src + (Y + 1) * Stride, Channel); memcpy(Third + Channel, Src + (Y + 1) * Stride, Width * Channel); // 由于备份了前面一行的数据,这里即使Src和Dest相同也是没有问题的 memcpy(Third + (Width + 1) * Channel, Src + (Y + 1) * Stride + (Width - 1) * Channel, Channel); } if (Channel == 1) { for (int X = 0; X < Width; X++) { int GX = First[X] - First[X + 2] + (Second[X] - Second[X + 2]) * 2 + Third[X] - Third[X + 2]; int GY = First[X] + First[X + 2] + (First[X + 1] - Third[X + 1]) * 2 - Third[X] - Third[X + 2]; //LinePD[X] = Table[min(GX * GX + GY * GY, 65025)]; } } else { __m256i Zero = _mm256_setzero_si256(); for (int X = 0; X < Block * BlockSize; X += BlockSize) { __m256i FirstP0 = _mm256_cvtepu8_epi16(_mm_loadu_si128((const __m128i*)(First + X))); __m256i FirstP1 = _mm256_cvtepu8_epi16(_mm_loadu_si128((const __m128i*)(First + X + 3))); __m256i FirstP2 = _mm256_cvtepu8_epi16(_mm_loadu_si128((const __m128i*)(First + X + 6))); __m256i SecondP0 = _mm256_cvtepu8_epi16(_mm_loadu_si128((const __m128i*)(Second + X))); __m256i SecondP2 = _mm256_cvtepu8_epi16(_mm_loadu_si128((const __m128i*)(Second + X + 6))); __m256i ThirdP0 = _mm256_cvtepu8_epi16(_mm_loadu_si128((const __m128i*)(Third + X))); __m256i ThirdP1 = _mm256_cvtepu8_epi16(_mm_loadu_si128((const __m128i*)(Third + X + 3))); __m256i ThirdP2 = _mm256_cvtepu8_epi16(_mm_loadu_si128((const __m128i*)(Third + X + 6))); //GX0 GX1 GX2 GX3 GX4 GX5 GX6 GX7 GX8 GX9 GX10 GX11 GX12 GX13 GX14 GX15 __m256i GX16 = _mm256_abs_epi16(_mm256_adds_epi16(_mm256_adds_epi16(_mm256_subs_epi16(FirstP0, FirstP2), _mm256_slli_epi16(_mm256_subs_epi16(SecondP0, SecondP2), 1)), _mm256_subs_epi16(ThirdP0, ThirdP2))); //GY0 GY1 GY2 GY3 GY4 GY5 GY6 GY7 GY8 GY9 GY10 GY11 GY12 GY13 GY14 GY15 __m256i GY16 = _mm256_abs_epi16(_mm256_subs_epi16(_mm256_adds_epi16(_mm256_adds_epi16(FirstP0, FirstP2), _mm256_slli_epi16(_mm256_subs_epi16(FirstP1, ThirdP1), 1)), _mm256_adds_epi16(ThirdP0, ThirdP2))); //GX0  GY0  GX1  GY1  GX2  GY2  GX3  GY3 GX4 GY4 GX5 GY5 GX6 GY6 GX7 GY7 __m256i GXYL = _mm256_unpacklo_epi16(GX16, GY16); //GX8  GY8  GX9  GY9  GX10 GY10  GX11 GY11 GX12 GY12 GX13 GY13 GX14 GY14 GX15 GY15 __m256i GXYH = _mm256_unpackhi_epi16(GX16, GY16); __m256i ResultL = _mm256_cvtps_epi32(_mm256_sqrt_ps(_mm256_cvtepi32_ps(_mm256_madd_epi16(GXYL, GXYL)))); __m256i ResultH = _mm256_cvtps_epi32(_mm256_sqrt_ps(_mm256_cvtepi32_ps(_mm256_madd_epi16(GXYH, GXYH)))); //__m256i Result = _mm256_packus_epi16(_mm256_packus_epi32(ResultL, ResultH), Zero); __m128i Ans1 = _mm256_castsi256_si128(ResultL); _mm_storeu_si128((__m128i *)(LinePD + X), Ans1); __m128i Ans2 = _mm256_castsi256_si128(ResultL); _mm_storeu_si128((__m128i *)(LinePD + X + 8), Ans2); } for (int X = Block * BlockSize; X < Width * 3; X++) { int GX = First[X] - First[X + 6] + (Second[X] - Second[X + 6]) * 2 + Third[X] - Third[X + 6]; int GY = First[X] + First[X + 6] + (First[X + 3] - Third[X + 3]) * 2 - Third[X] - Third[X + 6]; LinePD[X] = IM_ClampToByte(sqrtf(GX * GX + GY * GY + 0.0F)); } } } } void Sobel_AVX1(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) { Channel = Stride / Width; RowCopy = (unsigned char*)malloc((Width + 2) * 3 * Channel); First = RowCopy; Second = RowCopy + (Width + 2) * Channel; Third = RowCopy + (Width + 2) * 2 * Channel; //拷贝第二行数据,边界值填充 memcpy(Second, Src, Channel); memcpy(Second + Channel, Src, Width*Channel); memcpy(Second + (Width + 1)*Channel, Src + (Width - 1)*Channel, Channel); //第一行和第二行一样 memcpy(First, Second, (Width + 2) * Channel); //拷贝第三行数据,边界值填充 memcpy(Third, Src + Stride, Channel); memcpy(Third + Channel, Src + Stride, Width * Channel); memcpy(Third + (Width + 1) * Channel, Src + Stride + (Width - 1) * Channel, Channel); BlockSize = 16, Block = (Width * Channel) / BlockSize; _Sobel(Src, Width, Height, 0, Height, Stride, Dest); free(RowCopy); } void Sobel_AVX2(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride) { //INIT Channel = Stride / Width; RowCopy = (unsigned char*)malloc((Width + 2) * 3 * Channel); First = RowCopy; Second = RowCopy + (Width + 2) * Channel; Third = RowCopy + (Width + 2) * 2 * Channel; //拷贝第二行数据,边界值填充 memcpy(Second, Src, Channel); memcpy(Second + Channel, Src, Width*Channel); memcpy(Second + (Width + 1)*Channel, Src + (Width - 1)*Channel, Channel); //第一行和第二行一样 memcpy(First, Second, (Width + 2) * Channel); //拷贝第三行数据,边界值填充 memcpy(Third, Src + Stride, Channel); memcpy(Third + Channel, Src + Stride, Width * Channel); memcpy(Third + (Width + 1) * Channel, Src + Stride + (Width - 1) * Channel, Channel); BlockSize = 16, Block = (Width * Channel) / BlockSize; //Run const int32_t hw_concur = std::min(Height >> 4, static_cast(std::thread::hardware_concurrency())); std::vector> fut(hw_concur); const int thread_stride = (Height - 1) / hw_concur + 1; int i = 0, start = 0; for (; i < std::min(Height, hw_concur); i++, start += thread_stride) { fut[i] = std::async(std::launch::async, _Sobel, Src, Width, Height, start, thread_stride, Stride, Dest); } for (int j = 0; j < i; ++j) fut[j].wait(); free(RowCopy); } int main() { Mat src = imread("F:\\car.jpg"); int Height = src.rows; int Width = src.cols; unsigned char *Src = src.data; unsigned char *Dest = new unsigned char[Height * Width * 3]; int Stride = Width * 3; int Radius = 11; int Adjustment = 50; int64 st = cvGetTickCount(); /*for (int i = 0; i <1000; i++) { Sobel_SSE3(Src, Dest, Width, Height, Stride); }*/ double duration = (cv::getTickCount() - st) / cv::getTickFrequency(); printf("%.5f\n", duration); Sobel_SSE1(Src, Dest, Width, Height, Stride); Mat dst(Height, Width, CV_8UC3, Dest); imshow("origin", src); imshow("result", dst); imwrite("F:\\res.jpg", dst); waitKey(0); } ================================================ FILE: speed_vibrance_algorithm.cpp ================================================ #include #include #include using namespace std; using namespace cv; void GetGrayIntegralImage(unsigned char *Src, int *Integral, int Width, int Height, int Stride) { memset(Integral, 0, (Width + 1) * sizeof(int)); // 第一行都为0 for (int Y = 0; Y < Height; Y++) { unsigned char *LinePS = Src + Y * Stride; int *LinePL = Integral + Y * (Width + 1) + 1; //上一行的位置 int *LinePD = Integral + (Y + 1) * (Width + 1) + 1; // 当前位置,注意每行的第一列的值都为0 LinePD[-1] = 0; // 第一列的值为0 for (int X = 0, Sum = 0; X < Width; X++) { Sum += LinePS[X]; // 行方向累加 LinePD[X] = LinePL[X] + Sum; // 更新积分图 } } } void GetGrayIntegralImage_SSE(unsigned char *Src, int *Integral, int Width, int Height, int Stride) { memset(Integral, 0, (Width + 1) * sizeof(int)); //第一行都为0 int BlockSize = 8, Block = Width / BlockSize; for (int Y = 0; Y < Height; Y++) { unsigned char *LinePS = Src + Y * Stride; int *LinePL = Integral + Y * (Width + 1) + 1; //上一行位置 int *LinePD = Integral + (Y + 1) * (Width + 1) + 1; //当前位置,注意每行的第一列都为0 LinePD[-1] = 0; __m128i PreV = _mm_setzero_si128(); __m128i Zero = _mm_setzero_si128(); for (int X = 0; X < Block * BlockSize; X += BlockSize) { __m128i Src_Shift0 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i*)(LinePS + X)), Zero); //A7 A6 A5 A 4 A3 A2 A1 A0 __m128i Src_Shift1 = _mm_slli_si128(Src_Shift0, 2); //A6 A5 A4 A3 A2 A1 A0 0 __m128i Src_Shift2 = _mm_slli_si128(Src_Shift1, 2); //A5 A4 A3 A2 A1 A0 0 0 __m128i Src_Shift3 = _mm_slli_si128(Src_Shift2, 2); //A4 A3 A2 A1 A0 0 0 0 __m128i Shift_Add12 = _mm_add_epi16(Src_Shift1, Src_Shift2); //A6+A5 A5+A4 A4+A3 A3+A2 A2+A1 A1+A0 A0+0 0+0 __m128i Shift_Add03 = _mm_add_epi16(Src_Shift0, Src_Shift3); //A7+A4 A6+A3 A5+A2 A4+A1 A3+A0 A2+0 A1+0 A0+0 __m128i Low = _mm_add_epi16(Shift_Add12, Shift_Add03); //A7+A6+A5+A4 A6+A5+A4+A3 A5+A4+A3+A2 A4+A3+A2+A1 A3+A2+A1+A0 A2+A1+A0+0 A1+A0+0+0 A0+0+0+0 __m128i High = _mm_add_epi32(_mm_unpackhi_epi16(Low, Zero), _mm_unpacklo_epi16(Low, Zero)); //A7+A6+A5+A4+A3+A2+A1+A0 A6+A5+A4+A3+A2+A1+A0 A5+A4+A3+A2+A1+A0 A4+A3+A2+A1+A0 __m128i SumL = _mm_loadu_si128((__m128i *)(LinePL + X + 0)); __m128i SumH = _mm_loadu_si128((__m128i *)(LinePL + X + 4)); SumL = _mm_add_epi32(SumL, PreV); SumL = _mm_add_epi32(SumL, _mm_unpacklo_epi16(Low, Zero)); SumH = _mm_add_epi32(SumH, PreV); SumH = _mm_add_epi32(SumH, High); PreV = _mm_add_epi32(PreV, _mm_shuffle_epi32(High, _MM_SHUFFLE(3, 3, 3, 3))); _mm_storeu_si128((__m128i *)(LinePD + X + 0), SumL); _mm_storeu_si128((__m128i *)(LinePD + X + 4), SumH); } for (int X = Block * BlockSize, V = LinePD[X - 1] - LinePL[X - 1]; X < Width; X++) { V += LinePS[X]; LinePD[X] = V + LinePL[X]; } } } void BoxBlur(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, int Radius) { int *Integral = (int *)malloc((Width + 1) * (Height + 1) * sizeof(int)); GetGrayIntegralImage(Src, Integral, Width, Height, Stride); #pragma parallel for num_threads(4) for (int Y = 0; Y < Height; Y++) { int Y1 = max(Y - Radius, 0); int Y2 = min(Y + Radius + 1, Height - 1); int *LineP1 = Integral + Y1 * (Width + 1); int *LineP2 = Integral + Y2 * (Width + 1); unsigned char *LinePD = Dest + Y * Stride; for (int X = 0; X < Height; X++) { int X1 = max(X - Radius, 0); int X2 = min(X + Radius + 1, Width); int Sum = LineP2[X2] - LineP1[X2] - LineP2[X1] + LineP1[X1]; int PixelCount = (X2 - X1) * (Y2 - Y1); LinePD[X] = (Sum + (PixelCount >> 1)) / PixelCount; } } free(Integral); } //Adjustment如果为正值,会增加饱和度 //Adjustment如果为负值,会降低饱和度 void VibranceAlgorithm_FLOAT(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, int Adjustment) { float VibranceAdjustment = -0.01 * Adjustment; for (int Y = 0; Y < Height; Y++) { unsigned char *LinePS = Src + Y * Stride; unsigned char *LinePD = Dest + Y * Stride; for (int X = 0; X < Width; X++) { int Blue = LinePS[0], Green = LinePS[1], Red = LinePS[2]; int Avg = (Blue + Green + Green + Red) >> 2; int Max = max(max(Blue, Green), Red); float AmtVal = (abs(Max - Avg) / 127.0f) * VibranceAdjustment; if (Blue != Max) Blue += (Max - Blue) * AmtVal; if (Green != Max) Green += (Max - Green) * AmtVal; if (Red != Max) Red += (Max - Red) * AmtVal; if (Red < 0) Red = 0; else if (Red > 255) Red = 255; if (Green < 0) Green = 0; else if (Green > 255) Green = 255; if (Blue < 0) Blue = 0; else if (Blue > 255) Blue = 255; LinePD[0] = Blue; LinePD[1] = Green; LinePD[2] = Red; LinePS += 3; LinePD += 3; } } } void VibranceAlgorithm_INT(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, int Adjustment) { int VibranceAdjustment = -1.28 * Adjustment; for (int Y = 0; Y < Height; Y++) { unsigned char *LinePS = Src + Y * Stride; unsigned char *LinePD = Dest + Y * Stride; for (int X = 0; X < Width; X++) { int Blue, Green, Red, Max; Blue = LinePS[0], Green = LinePS[1], Red = LinePS[2]; int Avg = (Blue + Green + Green + Red) >> 2; if (Blue > Green) Max = Blue; else Max = Green; if (Red > Max) Max = Red; int AmtVal = (Max - Avg) * VibranceAdjustment; if (Blue != Max) Blue += (((Max - Blue) * AmtVal) >> 14); if (Green != Max) Green += (((Max - Green) * AmtVal) >> 14); if (Red != Max) Red += (((Max - Red) * AmtVal) >> 14); if (Red < 0) Red = 0; else if (Red > 255) Red = 255; if (Green < 0) Green = 0; else if (Green > 255) Green = 255; if (Blue < 0) Blue = 0; else if (Blue > 255) Blue = 255; LinePD[0] = Blue; LinePD[1] = Green; LinePD[2] = Red; LinePS += 3; LinePD += 3; } } } void VibranceAlgorithm_INT_OpenMP(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, int Adjustment) { int VibranceAdjustment = -1.28 * Adjustment; for (int Y = 0; Y < Height; Y++) { unsigned char *LinePS = Src + Y * Stride; unsigned char *LinePD = Dest + Y * Stride; #pragma omp parallel for num_threads(4) for (int X = 0; X < Width; X++) { int Blue, Green, Red, Max; Blue = LinePS[X*3 + 0], Green = LinePS[X*3 + 1], Red = LinePS[X*3 + 2]; int Avg = (Blue + Green + Green + Red) >> 2; if (Blue > Green) Max = Blue; else Max = Green; if (Red > Max) Max = Red; int AmtVal = (Max - Avg) * VibranceAdjustment; if (Blue != Max) Blue += (((Max - Blue) * AmtVal) >> 14); if (Green != Max) Green += (((Max - Green) * AmtVal) >> 14); if (Red != Max) Red += (((Max - Red) * AmtVal) >> 14); if (Red < 0) Red = 0; else if (Red > 255) Red = 255; if (Green < 0) Green = 0; else if (Green > 255) Green = 255; if (Blue < 0) Blue = 0; else if (Blue > 255) Blue = 255; LinePD[X*3 + 0] = Blue; LinePD[X*3 + 1] = Green; LinePD[X*3 + 2] = Red; } } } void VibranceAlgorithm_SSE(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, int Adjustment) { int VibranceAdjustment = (int)(-1.28 * Adjustment); __m128i Adjustment128 = _mm_setr_epi16(VibranceAdjustment, VibranceAdjustment, VibranceAdjustment, VibranceAdjustment, VibranceAdjustment, VibranceAdjustment, VibranceAdjustment, VibranceAdjustment); int X; for (int Y = 0; Y < Height; Y++) { unsigned char *LinePS = Src + Y * Stride; unsigned char *LinePD = Dest + Y * Stride; X = 0; __m128i Src1, Src2, Src3, Dest1, Dest2, Dest3, Blue8, Green8, Red8, Max8; __m128i BL16, BH16, GL16, GH16, RL16, RH16, MaxL16, MaxH16, AvgL16, AvgH16, AmtVal; __m128i Zero = _mm_setzero_si128(); for (; X < Width - 16; X += 16, LinePS += 48, LinePD += 48) { Src1 = _mm_loadu_si128((__m128i *)(LinePS + 0)); //B1,G1,R1,B2,G2,R2,B3,G3,R3,B4,G4,R4,B5,G5,R5,B6 Src2 = _mm_loadu_si128((__m128i *)(LinePS + 16));//G6,R6,B7,G7,R7,B8,G8,R8,B9,G9,R9,B10,G10,R10,B11,G11 Src3 = _mm_loadu_si128((__m128i *)(LinePS + 32));//R11,B12,G12,R12,B13,G13,R13,B14,G14,R14,B15,G15,R15,B16,G16,R16 Blue8 = _mm_shuffle_epi8(Src1, _mm_setr_epi8(0, 3, 6, 9, 12, 15, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)); Blue8 = _mm_or_si128(Blue8, _mm_shuffle_epi8(Src2, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, 2, 5, 8, 11, 14, -1, -1, -1, -1, -1))); Blue8 = _mm_or_si128(Blue8, _mm_shuffle_epi8(Src3, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, 4, 7, 10, 13))); Green8 = _mm_shuffle_epi8(Src1, _mm_setr_epi8(1, 4, 7, 10, 13, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)); Green8 = _mm_or_si128(Green8, _mm_shuffle_epi8(Src2, _mm_setr_epi8(-1, -1, -1, -1, -1, 0, 3, 6, 9, 12, 15, -1, -1, -1, -1, -1))); Green8 = _mm_or_si128(Green8, _mm_shuffle_epi8(Src3, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 2, 5, 8, 11, 14))); Red8 = _mm_shuffle_epi8(Src1, _mm_setr_epi8(2, 5, 8, 11, 14, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1)); Red8 = _mm_or_si128(Red8, _mm_shuffle_epi8(Src2, _mm_setr_epi8(-1, -1, -1, -1, -1, 1, 4, 7, 10, 13, -1, -1, -1, -1, -1, -1))); Red8 = _mm_or_si128(Red8, _mm_shuffle_epi8(Src3, _mm_setr_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 3, 6, 9, 12, 15))); Max8 = _mm_max_epu8(_mm_max_epu8(Blue8, Green8), Red8); BL16 = _mm_unpacklo_epi8(Blue8, Zero); BH16 = _mm_unpackhi_epi8(Blue8, Zero); GL16 = _mm_unpacklo_epi8(Green8, Zero); GH16 = _mm_unpackhi_epi8(Green8, Zero); RL16 = _mm_unpacklo_epi8(Red8, Zero); RH16 = _mm_unpackhi_epi8(Red8, Zero); MaxL16 = _mm_unpacklo_epi8(Max8, Zero); MaxH16 = _mm_unpackhi_epi8(Max8, Zero); AvgL16 = _mm_srli_epi16(_mm_add_epi16(_mm_add_epi16(BL16, RL16), _mm_slli_epi16(GL16, 1)), 2); AvgH16 = _mm_srli_epi16(_mm_add_epi16(_mm_add_epi16(BH16, RH16), _mm_slli_epi16(GH16, 1)), 2); AmtVal = _mm_mullo_epi16(_mm_sub_epi16(MaxL16, AvgL16), Adjustment128); BL16 = _mm_adds_epi16(BL16, _mm_mulhi_epi16(_mm_slli_epi16(_mm_sub_epi16(MaxL16, BL16), 2), AmtVal)); GL16 = _mm_adds_epi16(GL16, _mm_mulhi_epi16(_mm_slli_epi16(_mm_sub_epi16(MaxL16, GL16), 2), AmtVal)); RL16 = _mm_adds_epi16(RL16, _mm_mulhi_epi16(_mm_slli_epi16(_mm_sub_epi16(MaxL16, RL16), 2), AmtVal)); AmtVal = _mm_mullo_epi16(_mm_sub_epi16(MaxH16, AvgH16), Adjustment128); BH16 = _mm_adds_epi16(BH16, _mm_mulhi_epi16(_mm_slli_epi16(_mm_sub_epi16(MaxH16, BH16), 2), AmtVal)); GH16 = _mm_adds_epi16(GH16, _mm_mulhi_epi16(_mm_slli_epi16(_mm_sub_epi16(MaxH16, GH16), 2), AmtVal)); RH16 = _mm_adds_epi16(RH16, _mm_mulhi_epi16(_mm_slli_epi16(_mm_sub_epi16(MaxH16, RH16), 2), AmtVal)); Blue8 = _mm_packus_epi16(BL16, BH16); Green8 = _mm_packus_epi16(GL16, GH16); Red8 = _mm_packus_epi16(RL16, RH16); Dest1 = _mm_shuffle_epi8(Blue8, _mm_setr_epi8(0, -1, -1, 1, -1, -1, 2, -1, -1, 3, -1, -1, 4, -1, -1, 5)); Dest1 = _mm_or_si128(Dest1, _mm_shuffle_epi8(Green8, _mm_setr_epi8(-1, 0, -1, -1, 1, -1, -1, 2, -1, -1, 3, -1, -1, 4, -1, -1))); Dest1 = _mm_or_si128(Dest1, _mm_shuffle_epi8(Red8, _mm_setr_epi8(-1, -1, 0, -1, -1, 1, -1, -1, 2, -1, -1, 3, -1, -1, 4, -1))); Dest2 = _mm_shuffle_epi8(Blue8, _mm_setr_epi8(-1, -1, 6, -1, -1, 7, -1, -1, 8, -1, -1, 9, -1, -1, 10, -1)); Dest2 = _mm_or_si128(Dest2, _mm_shuffle_epi8(Green8, _mm_setr_epi8(5, -1, -1, 6, -1, -1, 7, -1, -1, 8, -1, -1, 9, -1, -1, 10))); Dest2 = _mm_or_si128(Dest2, _mm_shuffle_epi8(Red8, _mm_setr_epi8(-1, 5, -1, -1, 6, -1, -1, 7, -1, -1, 8, -1, -1, 9, -1, -1))); Dest3 = _mm_shuffle_epi8(Blue8, _mm_setr_epi8(-1, 11, -1, -1, 12, -1, -1, 13, -1, -1, 14, -1, -1, 15, -1, -1)); Dest3 = _mm_or_si128(Dest3, _mm_shuffle_epi8(Green8, _mm_setr_epi8(-1, -1, 11, -1, -1, 12, -1, -1, 13, -1, -1, 14, -1, -1, 15, -1))); Dest3 = _mm_or_si128(Dest3, _mm_shuffle_epi8(Red8, _mm_setr_epi8(10, -1, -1, 11, -1, -1, 12, -1, -1, 13, -1, -1, 14, -1, -1, 15))); _mm_storeu_si128((__m128i *)(LinePD + 0), Dest1); _mm_storeu_si128((__m128i *)(LinePD + 16), Dest2); _mm_storeu_si128((__m128i *)(LinePD + 32), Dest3); } for (; X < Width; X++) { int Blue, Green, Red, Max; Blue = LinePS[0], Green = LinePS[1], Red = LinePS[2]; int Avg = (Blue + Green + Green + Red) >> 2; if (Blue > Green) Max = Blue; else Max = Green; if (Red > Max) Max = Red; int AmtVal = (Max - Avg) * VibranceAdjustment; if (Blue != Max) Blue += (((Max - Blue) * AmtVal) >> 14); if (Green != Max) Green += (((Max - Green) * AmtVal) >> 14); if (Red != Max) Red += (((Max - Red) * AmtVal) >> 14); if (Red < 0) Red = 0; else if (Red > 255) Red = 255; if (Green < 0) Green = 0; else if (Green > 255) Green = 255; if (Blue < 0) Blue = 0; else if (Blue > 255) Blue = 255; LinePD[0] = Blue; LinePD[1] = Green; LinePD[2] = Red; LinePS += 3; LinePD += 3; } } } int main() { Mat src = imread("F:\\car.jpg"); int Height = src.rows; int Width = src.cols; unsigned char *Src = src.data; unsigned char *Dest = new unsigned char[Height * Width * 3]; int Stride = Width * 3; int Radius = 11; int Adjustment = 50; int64 st = cvGetTickCount(); for (int i = 0; i <100; i++) { VibranceAlgorithm_SSE(Src, Dest, Width, Height, Stride, Adjustment); } double duration = (cv::getTickCount() - st) / cv::getTickFrequency() * 10; printf("%.5f\n", duration); VibranceAlgorithm_SSE(Src, Dest, Width, Height, Stride, Adjustment); Mat dst(Height, Width, CV_8UC3, Dest); imshow("origin", src); imshow("result", dst); imwrite("F:\\res.jpg", dst); waitKey(0); waitKey(0); } ================================================ FILE: sse_implementation_of_common_functions_in_image_processing.cpp ================================================ #include #include using namespace std; using namespace cv; // 函数1: 对数函数的SSE实现,高精度版 inline __m128 _mm_log_ps(__m128 x) { static const __declspec(align(16)) int _ps_min_norm_pos[4] = { 0x00800000, 0x00800000, 0x00800000, 0x00800000 }; static const __declspec(align(16)) int _ps_inv_mant_mask[4] = { ~0x7f800000, ~0x7f800000, ~0x7f800000, ~0x7f800000 }; static const __declspec(align(16)) int _pi32_0x7f[4] = { 0x7f, 0x7f, 0x7f, 0x7f }; static const __declspec(align(16)) float _ps_1[4] = { 1.0f, 1.0f, 1.0f, 1.0f }; static const __declspec(align(16)) float _ps_0p5[4] = { 0.5f, 0.5f, 0.5f, 0.5f }; static const __declspec(align(16)) float _ps_sqrthf[4] = { 0.707106781186547524f, 0.707106781186547524f, 0.707106781186547524f, 0.707106781186547524f }; static const __declspec(align(16)) float _ps_log_p0[4] = { 7.0376836292E-2f, 7.0376836292E-2f, 7.0376836292E-2f, 7.0376836292E-2f }; static const __declspec(align(16)) float _ps_log_p1[4] = { -1.1514610310E-1f, -1.1514610310E-1f, -1.1514610310E-1f, -1.1514610310E-1f }; static const __declspec(align(16)) float _ps_log_p2[4] = { 1.1676998740E-1f, 1.1676998740E-1f, 1.1676998740E-1f, 1.1676998740E-1f }; static const __declspec(align(16)) float _ps_log_p3[4] = { -1.2420140846E-1f, -1.2420140846E-1f, -1.2420140846E-1f, -1.2420140846E-1f }; static const __declspec(align(16)) float _ps_log_p4[4] = { 1.4249322787E-1f, 1.4249322787E-1f, 1.4249322787E-1f, 1.4249322787E-1f }; static const __declspec(align(16)) float _ps_log_p5[4] = { -1.6668057665E-1f, -1.6668057665E-1f, -1.6668057665E-1f, -1.6668057665E-1f }; static const __declspec(align(16)) float _ps_log_p6[4] = { 2.0000714765E-1f, 2.0000714765E-1f, 2.0000714765E-1f, 2.0000714765E-1f }; static const __declspec(align(16)) float _ps_log_p7[4] = { -2.4999993993E-1f, -2.4999993993E-1f, -2.4999993993E-1f, -2.4999993993E-1f }; static const __declspec(align(16)) float _ps_log_p8[4] = { 3.3333331174E-1f, 3.3333331174E-1f, 3.3333331174E-1f, 3.3333331174E-1f }; static const __declspec(align(16)) float _ps_log_q1[4] = { -2.12194440e-4f, -2.12194440e-4f, -2.12194440e-4f, -2.12194440e-4f }; static const __declspec(align(16)) float _ps_log_q2[4] = { 0.693359375f, 0.693359375f, 0.693359375f, 0.693359375f }; __m128 one = *(__m128*)_ps_1; __m128 invalid_mask = _mm_cmple_ps(x, _mm_setzero_ps()); /* cut off denormalized stuff */ x = _mm_max_ps(x, *(__m128*)_ps_min_norm_pos); __m128i emm0 = _mm_srli_epi32(_mm_castps_si128(x), 23); /* keep only the fractional part */ x = _mm_and_ps(x, *(__m128*)_ps_inv_mant_mask); x = _mm_or_ps(x, _mm_set1_ps(0.5f)); emm0 = _mm_sub_epi32(emm0, *(__m128i *)_pi32_0x7f); __m128 e = _mm_cvtepi32_ps(emm0); e = _mm_add_ps(e, one); __m128 mask = _mm_cmplt_ps(x, *(__m128*)_ps_sqrthf); __m128 tmp = _mm_and_ps(x, mask); x = _mm_sub_ps(x, one); e = _mm_sub_ps(e, _mm_and_ps(one, mask)); x = _mm_add_ps(x, tmp); __m128 z = _mm_mul_ps(x, x); __m128 y = *(__m128*)_ps_log_p0; y = _mm_mul_ps(y, x); y = _mm_add_ps(y, *(__m128*)_ps_log_p1); y = _mm_mul_ps(y, x); y = _mm_add_ps(y, *(__m128*)_ps_log_p2); y = _mm_mul_ps(y, x); y = _mm_add_ps(y, *(__m128*)_ps_log_p3); y = _mm_mul_ps(y, x); y = _mm_add_ps(y, *(__m128*)_ps_log_p4); y = _mm_mul_ps(y, x); y = _mm_add_ps(y, *(__m128*)_ps_log_p5); y = _mm_mul_ps(y, x); y = _mm_add_ps(y, *(__m128*)_ps_log_p6); y = _mm_mul_ps(y, x); y = _mm_add_ps(y, *(__m128*)_ps_log_p7); y = _mm_mul_ps(y, x); y = _mm_add_ps(y, *(__m128*)_ps_log_p8); y = _mm_mul_ps(y, x); y = _mm_mul_ps(y, z); tmp = _mm_mul_ps(e, *(__m128*)_ps_log_q1); y = _mm_add_ps(y, tmp); tmp = _mm_mul_ps(z, *(__m128*)_ps_0p5); y = _mm_sub_ps(y, tmp); tmp = _mm_mul_ps(e, *(__m128*)_ps_log_q2); x = _mm_add_ps(x, y); x = _mm_add_ps(x, tmp); x = _mm_or_ps(x, invalid_mask); // negative arg will be NAN return x; } // 函数2: 低精度的log函数,大概有小数点后2位的精度 // 算法来源: https://stackoverflow.com/questions/9411823/fast-log2float-x-implementation-c inline float IM_Flog(float val) { union { float val; int x; } u = { val }; float log_2 = (float)(((u.x >> 23) & 255) - 128); u.x &= ~(255 << 23); u.x += (127 << 23); log_2 += ((-0.34484843f) * u.val + 2.02466578f) * u.val - 0.67487759f; return log_2 * 0.69314718f; } // 函数3: 函数2的SSE实现 inline __m128 _mm_flog_ps(__m128 x) { __m128i I = _mm_castps_si128(x); __m128 log_2 = _mm_cvtepi32_ps(_mm_sub_epi32(_mm_and_si128(_mm_srli_epi32(I, 23), _mm_set1_epi32(255)), _mm_set1_epi32(128))); I = _mm_and_si128(I, _mm_set1_epi32(-2139095041)); // 255 << 23 I = _mm_add_epi32(I, _mm_set1_epi32(1065353216)); // 127 << 23 __m128 F = _mm_castsi128_ps(I); __m128 T = _mm_add_ps(_mm_mul_ps(_mm_set1_ps(-0.34484843f), F), _mm_set1_ps(2.02466578f)); T = _mm_sub_ps(_mm_mul_ps(T, F), _mm_set1_ps(0.67487759f)); return _mm_mul_ps(_mm_add_ps(log_2, T), _mm_set1_ps(0.69314718f)); } // 函数4: e^x的近似计算 inline float IM_Fexp(float Y) { union { double Value; int X[2]; } V; V.X[1] = (int)(Y * 1512775 + 1072632447 + 0.5F); V.X[0] = 0; return (float)V.Value; } // 函数5: 函数4的SSE实现 inline __m128 _mm_fexp_ps(__m128 Y) { __m128i T = _mm_cvtps_epi32(_mm_add_ps(_mm_mul_ps(Y, _mm_set1_ps(1512775)), _mm_set1_ps(1072632447))); __m128i TL = _mm_unpacklo_epi32(_mm_setzero_si128(), T); __m128i TH = _mm_unpackhi_epi32(_mm_setzero_si128(), T); return _mm_movelh_ps(_mm_cvtpd_ps(_mm_castsi128_pd(TL)), _mm_cvtpd_ps(_mm_castsi128_pd(TH))); } //函数6: pow函数的近似实现 inline float IM_Fpow(float a, float b) { union { double Value; int X[2]; } V; V.X[1] = (int)(b * (V.X[1] - 1072632447) + 1072632447); V.X[0] = 0; return (float)V.Value; } // 函数7: 通过_mm_rcp_ps,_mm_rsqrt_ps(求导数的近似值,大概为小数点后12bit),结合牛顿迭代法,求精度更高的导数 __m128 _mm_prcp_ps(__m128 a) { __m128 rcp = _mm_rcp_ps(a); //此函数只有12bit的精度 return _mm_sub_ps(_mm_add_ps(rcp, rcp), _mm_mul_ps(a, _mm_mul_ps(rcp, rcp))); //x1 = x0 * (2 - d * x0) = 2 * x0 - d * x0 * x0,使用牛顿 - 拉弗森方法这种方法可以提高精度到23bit } // 函数8: 直接用导数实现a / b __m128 _mm_fdiv_ps(__m128 a, __m128 b) { return _mm_mul_ps(a, _mm_rcp_ps(b)); } // 函数9: 避免除数为0时无法获得效果 // 在SSE指令中,没有提供整数的除法指令,不知道这是为什么,所以整数除法一般只能借用浮点版本的指令。 // 同时,除法存在的一个问题就是如果除数为0,可能会触发异常,不过SSE在这种情况下不会抛出异常,但是我们应该避免。 // 避免的方式有很多,比如判断如果除数为0,就做特殊处理,或者如果除数为0就除以一个很小的数,不过大部分的需求是, // 除数为0,则返回0,此时就可以使用下面的SSE指令代替_mm_div_ps //四个浮点数的除法a/b,如果b中某个分量为0,则对应位置返回0值 inline __m128 _mm_divz_ps(__m128 a, __m128 b) { __m128 Mask = _mm_cmpeq_ps(b, _mm_setzero_ps()); return _mm_blendv_ps(_mm_div_ps(a, b), _mm_setzero_ps(), Mask); } // 函数10: 将4个32位整数转换为字节数并保存 // 将4个32位整形变量数据打包到4个字节数据中 inline void _mm_storesi128_4char(unsigned char *Dest, __m128i P) { __m128i T = _mm_packs_epi32(P, P); *((int *)Dest) = _mm_cvtsi128_si32(_mm_packus_epi16(T, T)); } // 函数11: 读取12个字节数到一个XMM寄存器中 // XMM寄存器是16个字节大小的,而且SSE的很多计算是以4的整数倍字节位单位进行的, // 但是在图像处理中,70%情况下处理的是彩色的24位图像,即一个像素占用3个字节, // 如果直接使用load指令载入数据,一次性可载入5加1 / 3个像素,这对算法的处理是很不方便的, // 一般状况下都是加载4个像素,即12个字节,然后扩展成16个字节(给每个像素增加一个Alpha值), // 我们当然可以直接使用load加载16个字节,然后每次跳过12个字节在进行load加载,但是其实也可以 // 使用下面的加载12个字节的函数: // 从指针p处加载12个字节数据到XMM寄存器中,寄存器最高32位清0 inline __m128i _mm_loadu_epi96(const __m128i * p) { return _mm_unpacklo_epi64(_mm_loadl_epi64(p), _mm_cvtsi32_si128(((int *)p)[2])); } // 函数12: 保存XMM的高12位 // 将寄存器Q的低位12个字节数据写入到指针P中。 inline void _mm_storeu_epi96(__m128i *P, __m128i Q) { _mm_storel_epi64(P, Q); ((int *)P)[2] = _mm_cvtsi128_si32(_mm_srli_si128(Q, 8)); } // 函数13: 计算整数整除255的四舍五入结果。 inline int IM_Div255(int V) { return (((V >> 8) + V + 1) >> 8); // 似乎V可以是负数 } // 函数14: 函数13的SSE实现 // 返回16位无符号整形数据整除255后四舍五入的结果: x = ((x + 1) + (x >> 8)) >> 8 inline __m128i _mm_div255_epu16(__m128i x) { return _mm_srli_epi16(_mm_adds_epu16(_mm_adds_epu16(x, _mm_set1_epi16(1)), _mm_srli_epi16(x, 8)), 8); } // 函数15: 求XMM寄存器内所有元素的累加值 // 这也是个常见的需求,我们可能把某个结果重复的结果保存在寄存器中,最后结束时在把寄存器中的每个元素想加, // 你当然可以通过访问__m128i变量的内部的元素实现,但是据说这样会降低循环内的优化,一种方式是直接用SSE指令实现, // 比如对8个有符号的short类型的相加代码如下所示: // 8个有符号的16位的数据相加的和。 // https://stackoverflow.com/questions/31382209/computing-the-inner-product-of-vectors-with-allowed-scalar-values-0-1-and-2-usi/31382878#31382878 inline int _mm_hsum_epi16(__m128i V) // V7 V6 V5 V4 V3 V2 V1 V0 { // V = _mm_unpacklo_epi16(_mm_hadd_epi16(V, _mm_setzero_si128()), _mm_setzero_si128()); 也可以用这句,_mm_hadd_epi16似乎对计算结果超出32768能获得正确结果 __m128i T = _mm_madd_epi16(V, _mm_set1_epi16(1)); // V7+V6 V5+V4 V3+V2 V1+V0 T = _mm_add_epi32(T, _mm_srli_si128(T, 8)); // V7+V6+V3+V2 V5+V4+V1+V0 0 0 T = _mm_add_epi32(T, _mm_srli_si128(T, 4)); // V7+V6+V3+V2+V5+V4+V1+V0 V5+V4+V1+V0 0 0 return _mm_cvtsi128_si32(T); // 提取低位 } // 函数16: 求16个字节的最小值 // 比如我们要求一个字节序列的最小值,我们肯定会使用_mm_min_epi8这样的函数保存每隔16个字节的最小值, // 这样最终我们得到16个字节的一个XMM寄存器,整个序列的最小值肯定在这个16个字节里面, // 这个时候我们可以巧妙的借用下面的SSE语句得到这16个字节的最小值: // 求16个字节数据的最小值, 只能针对字节数据。 inline int _mm_hmin_epu8(__m128i a) { __m128i L = _mm_unpacklo_epi8(a, _mm_setzero_si128()); __m128i H = _mm_unpackhi_epi8(a, _mm_setzero_si128()); return _mm_extract_epi16(_mm_min_epu16(_mm_minpos_epu16(L), _mm_minpos_epu16(H)), 0); } // 函数17: 求16个字节的最大值 // 求16个字节数据的最大值, 只能针对字节数据。 inline int _mm_hmax_epu8(__m128i a) { __m128i b = _mm_subs_epu8(_mm_set1_epi8(255), a); __m128i L = _mm_unpacklo_epi8(b, _mm_setzero_si128()); __m128i H = _mm_unpackhi_epi8(b, _mm_setzero_si128()); return 255 - _mm_extract_epi16(_mm_min_epu16(_mm_minpos_epu16(L), _mm_minpos_epu16(H)), 0); } int main() { }