圖形圖像處理－之－彩色轉化到灰度的速度優化

               圖形圖像處理－之－彩色轉化到灰度的速度優化
                   [email protected] 2009.02.08

tag:灰度算法,速度優化,定點數優化,MMX,SSE,SSE2,CPU緩存優化

摘要:
彩色轉化到灰度的速度優化文章包括圖形圖像處理簡單Demo框架和灰度轉換的實
現及其速度優化,並演示其使用SIMD指令集的優化;
   本篇文章將第一次提供完整的可以編譯的圖像處理完整項目代碼;
   (以後會用這個框架逐步改寫以前的圖形圖像處理文章)

正文：
爲了便於討論，這裏只處理32bit的ARGB顏色；代碼使用C++;使用的編譯器爲vc2008;
(經過測試代碼也可以在DevC++和xcode下編譯通過) 測試使用的CPU爲AMD64x2 4200+(2.33G);

速度測試說明:
只測試內存數據到內存數據的ARGB32顏色的灰度轉化;
測試圖片是800*600; fps表示每秒鐘的幀數,值越大表示函數越快;

A: 圖形圖像處理簡單Demo框架

我以前寫的圖形圖像處理方面的blog文章都沒有完整的可以編譯運行的代碼,
而僅僅列出了關鍵的核心代碼;經常有網友看了我的文章,但因爲不能實際運行看看,
從而對代碼的理解不深,也不能把代碼移植到自己的項目中使用; 所以決定爲我的圖形
圖像處理系列blog文章建立一個簡單的小型的框架;我把它命名爲hGraphic32,
它會盡量的小,演示爲主,僅支持ARGB32顏色,能夠加載和保存bmp圖片文件,能夠在
多個編譯器和平臺下編譯和運行;
   現在就下載完整項目源代碼吧: 完整項目源代碼

<hGraphic32>文件夾裏的文件說明:
    "hColor32.h" : 裏面定義了32bitARGB顏色類型Color32,它佔用4字節,代表一個顏色;
        TPixels32Ref是圖像數據區的描述信息,可以把它理解爲一個"指針",指向了Color32構成的像素區;
        IPixels32Buf是圖像數據區接口,用於描述一個圖像的緩衝區;
    "hPixels32.h" : 裏面定義了TPixels32類,它實現了IPixels32Buf接口,用於申請和管理一塊內存像素;
    "hStream.h"   : 裏面定義了IInputStream輸入流接口;
        IBufInputStream數據區輸入流接口,繼承自IInputStream;
        TFileInputStream文件輸入流類,它實現了IBufInputStream接口;
        IOutputStream輸出流接口;
        TFileOutputStream文件輸出流類,它實現了IOutputStream接口;
     "hBmpFile.h" : 裏面定義了TBmpFile類,它負責加載bmp和保存bmp;
     "hGraphic32.h" 文件include了上面的*.h頭文件,所以使用的時候,只要#include "hGraphic32.h"就可以了

B: 灰度轉化項目
所有的轉換和測試代碼都在"ColorToGray/ColorToGray.cpp"文件中(帶有main函數的命令行程序);
"ColorToGray/win_vc/ColorToGray.sln"是windows系統下的vc2008項目文件(測試的時請設定調試運行目錄爲"..");
"ColorToGray/win_DevC++/ColorToGray.dev"是windows系統下的DevC++項目文件;
"ColorToGray/macosx_xcode/ColorToGray.xcodeproj"是macosx系統下的xcode項目文件;
你也可以自己建立項目,包含ColorToGray.cpp文件和<hGraphic32>文件夾下的所有文件,就可以編譯了;

C: 灰度轉化公式和代碼實現
文章中用的灰度公式: Gray = R*0.299 + G*0.587 + B*0.114;

代碼實現:

//灰度轉換系數 const double gray_r_coeff=0.299; const double gray_g_coeff=0.587; const double gray_b_coeff=0.114; //處理一個點 must_inline double toGray_float(const Color32& src){ return (src.r*gray_r_coeff +src.g*gray_g_coeff +src.b*gray_b_coeff); } //處理一行 void colorToGrayLine_float(const Color32* src,Color32* dst,long width){ for (long x = 0; x < width; ++x){ int gray=(int)toGray_float(src[x]); dst[x]=Color32(gray,gray,gray,src[x].a);//R,G,B都設置爲相同的亮度值,A不變 } } void colorToGray_float(const TPixels32Ref& src,const TPixels32Ref& dst){ long width=std::min(src.width,dst.width); long height=std::min(src.height,dst.height); Color32* srcLine=src.pdata; Color32* dstLine=dst.pdata; for (long y = 0; y < height; ++y){ colorToGrayLine_float(srcLine,dstLine,width); src.nextLine(srcLine); dst.nextLine(dstLine); } }
////////////////////////////////////////////////////////////////////////////////
//速度測試
//==============================================================================
// colorToGray_float           145.49 FPS
////////////////////////////////////////////////////////////////////////////////

D: 將浮點運算轉化爲定點數(整數)運算

must_inline int toGray_int16(const Color32& src){ const long bit=16; const int gray_r_coeff_int=(int)( gray_r_coeff*(1<<bit)+0.4999999 ); const int gray_g_coeff_int=(int)( gray_g_coeff*(1<<bit)+0.4999999 ); const int gray_b_coeff_int=(1<<bit)-gray_r_coeff_int-gray_g_coeff_int; return (src.r*gray_r_coeff_int +src.g*gray_g_coeff_int +src.b*gray_b_coeff_int) >> bit; } inline void colorToGrayLine_int16(const Color32* src,Color32* dst,long width){ for (long x = 0; x < width; ++x){ int gray=toGray_int16(src[x]); dst[x]=Color32(gray,gray,gray,src[x].a); } } void colorToGray_int16(const TPixels32Ref& src,const TPixels32Ref& dst){ long width=std::min(src.width,dst.width); long height=std::min(src.height,dst.height); Color32* srcLine=src.pdata; Color32* dstLine=dst.pdata; for (long y = 0; y < height; ++y){ colorToGrayLine_int16(srcLine,dstLine,width); src.nextLine(srcLine); dst.nextLine(dstLine); } }

////////////////////////////////////////////////////////////////////////////////
//速度測試
//==============================================================================
// colorToGray_int16           355.33 FPS
////////////////////////////////////////////////////////////////////////////////

E: 做一個簡單的循環代碼展開
//四路展開 void colorToGrayLine_int16_expand4(const Color32* src,Color32* dst,long width){ long widthFast=width>>2<<2; for (long x = 0; x < widthFast; x+=4){ int gray0=toGray_int16(src[x ]); int gray1=toGray_int16(src[x+1]); dst[x ]=Color32(gray0,gray0,gray0,src[x ].a); dst[x+1]=Color32(gray1,gray1,gray1,src[x+1].a); int gray2=toGray_int16(src[x+2]); int gray3=toGray_int16(src[x+3]); dst[x+2]=Color32(gray2,gray2,gray2,src[x+2].a); dst[x+3]=Color32(gray3,gray3,gray3,src[x+3].a); } //border if (width>widthFast) colorToGrayLine_int16(&src[widthFast],&dst[widthFast],width-widthFast); } void colorToGray_int16_expand4(const TPixels32Ref& src,const TPixels32Ref& dst){ long width=std::min(src.width,dst.width); long height=std::min(src.height,dst.height); Color32* srcLine=src.pdata; Color32* dstLine=dst.pdata; for (long y = 0; y < height; ++y){ colorToGrayLine_int16_expand4(srcLine,dstLine,width); src.nextLine(srcLine); dst.nextLine(dstLine); } }

////////////////////////////////////////////////////////////////////////////////
//速度測試
//==============================================================================
// colorToGray_int16_expand4   413.22 FPS
////////////////////////////////////////////////////////////////////////////////

F: 一個特別的版本
   在高級語言範圍內進行單條指令多數據流計算,減少需要的乘法量;
在乘法運算代價比較高昂的cpu上應該效果不錯; (x86上速度可能慢)
must_inline UInt32 toGray_int8_opMul(const Color32* src2Color){ const UInt32 gray_r_coeff_8=(UInt32)( gray_r_coeff*(1<<8)+0.4999999); const UInt32 gray_g_coeff_8=(UInt32)( gray_g_coeff*(1<<8)+0.4999999); const UInt32 gray_b_coeff_8=(1<<8)-gray_r_coeff_8-gray_g_coeff_8; UInt32 RR,GG,BB; BB=src2Color[0].b | (src2Color[1].b<<16); GG=src2Color[0].g | (src2Color[1].g<<16); RR=src2Color[0].r | (src2Color[1].r<<16); BB*=gray_b_coeff_8; GG*=gray_g_coeff_8; RR*=gray_r_coeff_8; return BB+GG+RR; } void colorToGrayLine_int8_opMul(const Color32* src,Color32* dst,long width){ long widthFast=width>>2<<2; for (long x = 0; x < widthFast; x+=4){ UInt32 gray01=toGray_int8_opMul(&src[x ]); int gray0=(gray01&0x0000FF00)>>8; int gray1=gray01>>24; dst[x ]=Color32(gray0,gray0,gray0,src[x ].a); dst[x+1]=Color32(gray1,gray1,gray1,src[x+1].a); UInt32 gray23=toGray_int8_opMul(&src[x+2]); int gray2=(gray23&0x0000FF00)>>8; int gray3=gray23>>24; dst[x+2]=Color32(gray2,gray2,gray2,src[x+2].a); dst[x+3]=Color32(gray3,gray3,gray3,src[x+3].a); } //border if (width>widthFast) colorToGrayLine_int16(&src[widthFast],&dst[widthFast],width-widthFast); } void colorToGray_int8_opMul(const TPixels32Ref& src,const TPixels32Ref& dst){ long width=std::min(src.width,dst.width); long height=std::min(src.height,dst.height); Color32* srcLine=src.pdata; Color32* dstLine=dst.pdata; for (long y = 0; y < height; ++y){ colorToGrayLine_int8_opMul(srcLine,dstLine,width); src.nextLine(srcLine); dst.nextLine(dstLine); } }
////////////////////////////////////////////////////////////////////////////////
//速度測試
//==============================================================================
// colorToGray_int8_opMul      387.97 FPS
////////////////////////////////////////////////////////////////////////////////

G: 內聯彙編的MMX實現版本
   注意:這裏的MMX代碼都只支持x86CPU(奔騰MMX以上CPU);
   在x64下不再有MMX寄存器,而應該使用SEE的XMM寄存器;
   而且在x64模式下vc2008編譯器還沒有提供內聯彙編的直接支持,而必須使用函數指令方式的實現;
   GCC編譯器也支持內聯彙編模式,但是彙編語法不同,請參考相應的說明;

void colorToGrayLine_MMX(const Color32* src,Color32* dst,long width){ //const UInt32 gray_r_coeff_7=(UInt32)( gray_r_coeff*(1<<7)+0.4999999 ); //const UInt32 gray_g_coeff_7=(UInt32)( gray_g_coeff*(1<<7)+0.4999999 ); //const UInt32 gray_b_coeff_7=(1<<7)-gray_r_coeff_7-gray_g_coeff_7; // csMMX_rgb_coeff_w= short[ 0 , gray_r_coeff_7 , gray_g_coeff_7 , gray_b_coeff_7 ] const UInt64 csMMX_rgb_coeff_w = (((UInt64)0x00000026)<<32) | 0x004b000f; long widthFast=width>>1<<1; if (widthFast>0){ asm{ pcmpeqb mm5,mm5 // FF FF FF FF FF FF FF FF mov ecx,widthFast pxor mm7,mm7 // 00 00 00 00 00 00 00 00 pcmpeqb mm4,mm4 // FF FF FF FF FF FF FF FF mov eax,src mov edx,dst movq mm6,csMMX_rgb_coeff_w psrlw mm5,15 // 1 1 1 1 lea eax,[eax+ecx*4] lea edx,[edx+ecx*4] pslld mm4,24 // FF 00 00 00 FF 00 00 00 neg ecx loop_beign: movq mm0,[eax+ecx*4] // A1 R1 G1 B1 A0 R0 G0 B0 movq mm1,mm0 movq mm3,mm0 punpcklbw mm0,mm7 // 00 A0 00 R0 00 G0 00 B0 punpckhbw mm1,mm7 // 00 A1 00 R1 00 G1 00 B1 pmaddwd mm0,mm6 // R0*r_coeff G0*g_coeff+B0*b_coeff pmaddwd mm1,mm6 // R1*r_coeff G1*g_coeff+B1*b_coeff pand mm3,mm4 // A1 00 00 00 A0 00 00 00 packssdw mm0,mm1 // sR1 sG1+sB1 sR0 sG0+sB0 pmaddwd mm0,mm5 // sR1+sG1+sB1 sR0+sG0+sB0 psrld mm0,7 // 00 00 00 Gray1 00 00 00 Gray0 movq mm1,mm0 movq mm2,mm0 pslld mm1,8 // 00 00 Gray1 00 00 00 Gray0 00 por mm0,mm3 pslld mm2,16 // 00 Gray1 00 00 00 Gray0 00 00 por mm0,mm1 por mm0,mm2 // A1 Gray1 Gray1 Gray1 A0 Gray0 Gray0 Gray0 movq [edx+ecx*4],mm0 add ecx,2 jnz loop_beign } } //border if (width>widthFast) colorToGrayLine_int16(&src[widthFast],&dst[widthFast],width-widthFast); } void colorToGray_MMX(const TPixels32Ref& src,const TPixels32Ref& dst){ long width=std::min(src.width,dst.width); long height=std::min(src.height,dst.height); Color32* srcLine=src.pdata; Color32* dstLine=dst.pdata; for (long y = 0; y < height; ++y){ colorToGrayLine_MMX(srcLine,dstLine,width); src.nextLine(srcLine); dst.nextLine(dstLine); } asm{ emms //MMX使用結束 } }

////////////////////////////////////////////////////////////////////////////////
//速度測試
//==============================================================================
// colorToGray_MMX             590.84 FPS
////////////////////////////////////////////////////////////////////////////////

H: 優化寫緩衝的內聯彙編的MMX實現版本
該版本相應於上面的MMX版本只改寫了兩句:
   一是寫內存的movq [edx+ecx*4],mm0 改成了 movntq [edx+ecx*4],mm0 繞過緩存
   二是函數結束的時候調用sfence刷新寫入
完整代碼如下:

void colorToGrayLine_MMX2(const Color32* src,Color32* dst,long width){ //const UInt32 gray_r_coeff_7=(UInt32)( gray_r_coeff*(1<<7)+0.4999999 ); //const UInt32 gray_g_coeff_7=(UInt32)( gray_g_coeff*(1<<7)+0.4999999 ); //const UInt32 gray_b_coeff_7=(1<<7)-gray_r_coeff_7-gray_g_coeff_7; // csMMX_rgb_coeff_w= short[ 0 , gray_r_coeff_7 , gray_g_coeff_7 , gray_b_coeff_7 ] const UInt64 csMMX_rgb_coeff_w = (((UInt64)0x00000026)<<32) | 0x004b000f; long widthFast=width>>1<<1; if (widthFast>0){ asm{ pcmpeqb mm5,mm5 // FF FF FF FF FF FF FF FF mov ecx,widthFast pxor mm7,mm7 // 00 00 00 00 00 00 00 00 pcmpeqb mm4,mm4 // FF FF FF FF FF FF FF FF mov eax,src mov edx,dst movq mm6,csMMX_rgb_coeff_w psrlw mm5,15 // 1 1 1 1 lea eax,[eax+ecx*4] lea edx,[edx+ecx*4] pslld mm4,24 // FF 00 00 00 FF 00 00 00 neg ecx loop_beign: movq mm0,[eax+ecx*4] // A1 R1 G1 B1 A0 R0 G0 B0 movq mm1,mm0 movq mm3,mm0 punpcklbw mm0,mm7 // 00 A0 00 R0 00 G0 00 B0 punpckhbw mm1,mm7 // 00 A1 00 R1 00 G1 00 B1 pmaddwd mm0,mm6 // R0*r_coeff G0*g_coeff+B0*b_coeff pmaddwd mm1,mm6 // R1*r_coeff G1*g_coeff+B1*b_coeff pand mm3,mm4 // A1 00 00 00 A0 00 00 00 packssdw mm0,mm1 // sR1 sG1+sB1 sR0 sG0+sB0 pmaddwd mm0,mm5 // sR1+sG1+sB1 sR0+sG0+sB0 psrld mm0,7 // 00 00 00 Gray1 00 00 00 Gray0 movq mm1,mm0 movq mm2,mm0 pslld mm1,8 // 00 00 Gray1 00 00 00 Gray0 00 por mm0,mm3 pslld mm2,16 // 00 Gray1 00 00 00 Gray0 00 00 por mm0,mm1 por mm0,mm2 // A1 Gray1 Gray1 Gray1 A0 Gray0 Gray0 Gray0 movntq [edx+ecx*4],mm0 //和colorToGrayLine_MMX的不同之處 add ecx,2 jnz loop_beign } } //border if (width>widthFast) colorToGrayLine_int16(&src[widthFast],&dst[widthFast],width-widthFast); } void colorToGray_MMX2(const TPixels32Ref& src,const TPixels32Ref& dst){ long width=std::min(src.width,dst.width); long height=std::min(src.height,dst.height); Color32* srcLine=src.pdata; Color32* dstLine=dst.pdata; for (long y = 0; y < height; ++y){ colorToGrayLine_MMX2(srcLine,dstLine,width); src.nextLine(srcLine); dst.nextLine(dstLine); } asm{ sfence //刷新寫入 emms } }

////////////////////////////////////////////////////////////////////////////////
//速度測試
//==============================================================================
// colorToGray_MMX2            679.50 FPS
////////////////////////////////////////////////////////////////////////////////

I: 使用MMX函數指令方式的實現
MMX/SSE等特殊指令除了內聯彙編來使用外,也可以使用函數指令方式的實現,從而在多種
編譯器下都可以使用SIMD相關指令,可移植性也會好很多;
但現在看來,vc對此的優化還不夠,還可能遇到編譯器的實現bug;
(可以考慮使用intel的編譯器編譯這些代碼,感覺優化能力很不錯)

#include <mmintrin.h> //mmx //#include <mm3dnow.h> //3dnow #include <xmmintrin.h> //sse //#include <emmintrin.h> //sse2 //#include <pmmintrin.h> //sse3 //#include <tmmintrin.h> //ssse3 //#include <intrin.h> //sse4a //#include <smmintrin.h> //sse4.1 //#include <nmmintrin.h> //sse4.2 //---------------------------------- void colorToGrayLine_MMX_mmh(const Color32* src,Color32* dst,long width){ //const UInt32 gray_r_coeff_7=(UInt32)( gray_r_coeff*(1<<7)+0.4999999 ); //const UInt32 gray_g_coeff_7=(UInt32)( gray_g_coeff*(1<<7)+0.4999999 ); //const UInt32 gray_b_coeff_7=(1<<7)-gray_r_coeff_7-gray_g_coeff_7; // csMMX_rgb_coeff_w= short[ 0 , gray_r_coeff_7 , gray_g_coeff_7 , gray_b_coeff_7 ] long widthFast=width>>1<<1; if (widthFast>0){ const UInt64 csMMX_rgb_coeff_w =(((UInt64)0x00000026)<<32) | 0x004b000f; const __m64 mm6=*(const __m64*)&csMMX_rgb_coeff_w; const __m64 mm7=_mm_setzero_si64(); //mm?變量值同colorToGrayLine_MMX中的mmx值一致 __m64 mm5=_mm_cmpeq_pi8(mm7,mm7); //想寫成__m64 mm5; mm5=_mm_cmpeq_pi8(mm5,mm5);但會出錯:( const __m64 mm4=_mm_slli_pi32(mm5,24); // ... mm5=_mm_srli_pi16(mm5,15); // ... for (long x = 0; x < widthFast; x+=2){ __m64 mm0=*(__m64*)&src[x]; __m64 mm1=mm0; __m64 mm3=mm0; mm0=_mm_unpacklo_pi8(mm0,mm7); mm1=_mm_unpackhi_pi8(mm1,mm7); mm0=_mm_madd_pi16(mm0,mm6); mm1=_mm_madd_pi16(mm1,mm6); mm3=_mm_and_si64(mm3,mm4); mm0=_mm_packs_pi32(mm0,mm1); mm0=_mm_madd_pi16(mm0,mm5); mm0=_mm_srli_pi32(mm0,7); mm1=mm0; __m64 mm2=mm0; mm1=_mm_slli_pi32(mm1,8); mm0=_mm_or_si64(mm0,mm3); mm2=_mm_slli_pi32(mm2,16); mm0=_mm_or_si64(mm0,mm1); mm0=_mm_or_si64(mm0,mm2); *(__m64*)&dst[x]=mm0; } } //border if (width>widthFast) colorToGrayLine_int16(&src[widthFast],&dst[widthFast],width-widthFast); } void colorToGray_MMX_mmh(const TPixels32Ref& src,const TPixels32Ref& dst){ long width=std::min(src.width,dst.width); long height=std::min(src.height,dst.height); Color32* srcLine=src.pdata; Color32* dstLine=dst.pdata; for (long y = 0; y < height; ++y){ colorToGrayLine_MMX_mmh(srcLine,dstLine,width); src.nextLine(srcLine); dst.nextLine(dstLine); } _mm_empty(); //MMX使用結束 }

////////////////////////////////////////////////////////////////////////////////
//速度測試
//==============================================================================
// colorToGray_MMX_mmh         508.69 FPS
////////////////////////////////////////////////////////////////////////////////

優化寫緩衝的使用MMX函數指令方式的實現

void colorToGrayLine_MMX2_mmh(const Color32* src,Color32* dst,long width){ //const UInt32 gray_r_coeff_7=(UInt32)( gray_r_coeff*(1<<7)+0.4999999 ); //const UInt32 gray_g_coeff_7=(UInt32)( gray_g_coeff*(1<<7)+0.4999999 ); //const UInt32 gray_b_coeff_7=(1<<7)-gray_r_coeff_7-gray_g_coeff_7; // csMMX_rgb_coeff_w= short[ 0 , gray_r_coeff_7 , gray_g_coeff_7 , gray_b_coeff_7 ] long widthFast=width>>1<<1; if (widthFast>0){ const UInt64 csMMX_rgb_coeff_w =(((UInt64)0x00000026)<<32) | 0x004b000f; const __m64 mm6=*(const __m64*)&csMMX_rgb_coeff_w; const __m64 mm7=_mm_setzero_si64(); //mm?變量值同colorToGrayLine_MMX中的mmx值一致 __m64 mm5=_mm_cmpeq_pi8(mm7,mm7); // ... const __m64 mm4=_mm_slli_pi32(mm5,24); // ... mm5=_mm_srli_pi16(mm5,15); // ... for (long x = 0; x < widthFast; x+=2){ __m64 mm0=*(__m64*)&src[x]; __m64 mm1=mm0; __m64 mm3=mm0; mm0=_mm_unpacklo_pi8(mm0,mm7); mm1=_mm_unpackhi_pi8(mm1,mm7); mm0=_mm_madd_pi16(mm0,mm6); mm1=_mm_madd_pi16(mm1,mm6); mm3=_mm_and_si64(mm3,mm4); mm0=_mm_packs_pi32(mm0,mm1); mm0=_mm_madd_pi16(mm0,mm5); mm0=_mm_srli_pi32(mm0,7); mm1=mm0; __m64 mm2=mm0; mm1=_mm_slli_pi32(mm1,8); mm0=_mm_or_si64(mm0,mm3); mm2=_mm_slli_pi32(mm2,16); mm0=_mm_or_si64(mm0,mm1); mm0=_mm_or_si64(mm0,mm2); //*(__m64*)&dst[x]=mm0; _mm_stream_pi((__m64*)&dst[x],mm0); } } //border if (width>widthFast) colorToGrayLine_int16(&src[widthFast],&dst[widthFast],width-widthFast); } void colorToGray_MMX2_mmh(const TPixels32Ref& src,const TPixels32Ref& dst){ long width=std::min(src.width,dst.width); long height=std::min(src.height,dst.height); Color32* srcLine=src.pdata; Color32* dstLine=dst.pdata; for (long y = 0; y < height; ++y){ colorToGrayLine_MMX2_mmh(srcLine,dstLine,width); src.nextLine(srcLine); dst.nextLine(dstLine); } _mm_sfence();//刷新寫入 _mm_empty(); //MMX使用結束 }

////////////////////////////////////////////////////////////////////////////////
//速度測試
//==============================================================================
// colorToGray_MMX2_mmh        540.78 FPS
////////////////////////////////////////////////////////////////////////////////

J:把測試成績放在一起：

////////////////////////////////////////////////////////////////////////////////
//CPU: AMD64x2 4200+(2.33G) 800*600 to 800*600
//==============================================================================
// colorToGray_float           145.49 FPS
// colorToGray_int16           355.33 FPS
// colorToGray_int16_expand4   413.22 FPS
// colorToGray_int8_opMul      387.97 FPS
// colorToGray_MMX             590.84 FPS
// colorToGray_MMX2            679.50 FPS
// colorToGray_MMX_mmh         508.69 FPS
// colorToGray_MMX2_mmh        540.78 FPS
////////////////////////////////////////////////////////////////////////////////

ps:用SSE的浮點指令的版本/用SSE2整數指令的版本/利用SSE3的水平加指令等的實現版本有機會時再補充
ps:SIMD特殊指令集的使用框架請參見我的<YUV視頻格式到RGB32格式轉換的速度優化中篇>一文,從而
根據CPU對指令集的支持情況動態的調用最優的實現函數版本;

圖形圖像處理－之－彩色轉化到灰度的速度優化

985 碩士程序員，空窗 4 個月沒有 Offer！

一文搞懂 Spring 循環依賴

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

“數學函數動態編譯器TCompile類”的bug跟蹤、新版源代碼下載

圖形圖像處理－之－任意角度的高質量的快速的圖像旋轉下篇補充話題

HDiffPatch和BsDiff4.3&xdelta3.1的對比測試

我的分形畫廊

YUV視頻格式到RGB32格式轉換的速度優化中篇

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結