圖形圖像處理－之－高質量的快速的圖像縮放上篇近鄰取樣插值和其速度優化

tag:圖像縮放,速度優化,定點數優化,近鄰取樣插值,二次線性插值,三次線性插值,
MipMap鏈,三次卷積插值,MMX,SSE,SSE2,CPU緩存優化

摘要:首先給出一個基本的圖像縮放算法，然後一步一步的優化其速度和縮放質量；

高質量的快速的圖像縮放全文分爲:
     上篇近鄰取樣插值和其速度優化
     中篇二次線性插值和三次卷積插值
     下篇三次線性插值和MipMap鏈

正文：

爲了便於討論，這裏只處理32bit的ARGB顏色；
代碼使用C++;涉及到彙編優化的時候假定爲x86平臺;使用的編譯器爲vc2005;
爲了代碼的可讀性,沒有加入異常處理代碼;
測試使用的CPU爲AMD64x2 4200+(2.37G) 和 Intel Core2 4400(2.00G);

速度測試說明:
只測試內存數據到內存數據的縮放
測試圖片都是800*600縮放到1024*768; fps表示每秒鐘的幀數,值越大表示函數越快
////////////////////////////////////////////////////////////////////////////////
//Windows GDI相關函數參考速度:
//==============================================================================
// BitBlt             544.7 fps //is copy 800*600 to 800*600
// BitBlt             331.6 fps //is copy 1024*1024 to 1024*1024
// StretchBlt         232.7 fps //is zoom 800*600 to 1024*1024
////////////////////////////////////////////////////////////////////////////////

A: 首先定義圖像數據結構:

#define asm __asm
typedef unsigned char TUInt8; // [0..255]
struct TARGB32      //32 bit color
{
    TUInt8 B,G,R,A;          // A is alpha
};
struct TPicRegion //一塊顏色數據區的描述，便於參數傳遞
{
    TARGB32*    pdata;         //顏色數據首地址
long        byte_width;    //一行數據的物理寬度(字節寬度)；
//abs(byte_width)有可能大於等於width*sizeof(TARGB32);
long        width;         //像素寬度
long        height;        //像素高度
};
//那麼訪問一個點的函數可以寫爲：
inline TARGB32& Pixels(const TPicRegion& pic,const long x,const long y)
{
return ( (TARGB32*)((TUInt8*)pic.pdata+pic.byte_width*y) )[x];
}

B: 縮放原理和公式圖示:

縮放後圖片原圖片
(寬DW,高DH) (寬SW,高SH)

(Sx-0)/(SW-0)=(Dx-0)/(DW-0) (Sy-0)/(SH-0)=(Dy-0)/(DH-0)
=> Sx=Dx*SW/DW Sy=Dy*SH/DH

C: 縮放算法的一個參考實現

//給出一個最簡單的縮放函數(插值方式爲近鄰取樣,而且我“盡力”把它寫得慢一些了:D)
//Src.PColorData指向源數據區,Dst.PColorData指向目的數據區
//函數將大小爲Src.Width*Src.Height的圖片縮放到Dst.Width*Dst.Height的區域中

void PicZoom0(const TPicRegion& Dst,const TPicRegion& Src)
{
if ( (0==Dst.width)||(0==Dst.height)
||(0==Src.width)||(0==Src.height)) return;
for (long x=0;x
    {
for (long y=0;y
        {
long srcx=(x*Src.width/Dst.width);
long srcy=(y*Src.height/Dst.height);
            Pixels(Dst,x,y)=Pixels(Src,srcx,srcy);
        }
    }
}

////////////////////////////////////////////////////////////////////////////////
//速度測試:
//==============================================================================
// PicZoom0 19.4 fps
////////////////////////////////////////////////////////////////////////////////

D: 優化PicZoom0函數

a.PicZoom0函數並沒有按照顏色數據在內存中的排列順序讀寫(內部循環遞增y行
索引)，將造成CPU緩存預讀失敗和內存顛簸導致巨大的性能損失,(很多硬件都有這種特性,
包括緩存、內存、顯存、硬盤等,優化順序訪問，隨機訪問時會造成巨大的性能損失)
所以先交換x,y循環的順序:

void PicZoom1(const TPicRegion& Dst,const TPicRegion& Src)
{
if ( (0==Dst.width)||(0==Dst.height)
||(0==Src.width)||(0==Src.height)) return;
for (long y=0;y
    {
for (long x=0;x
        {
long srcx=(x*Src.width/Dst.width);
long srcy=(y*Src.height/Dst.height);
            Pixels(Dst,x,y)=Pixels(Src,srcx,srcy);
        }
    }
}

////////////////////////////////////////////////////////////////////////////////
//速度測試:
//==============================================================================
// PicZoom1 30.1 fps
////////////////////////////////////////////////////////////////////////////////

b.“(x*Src.Width/Dst.Width)”表達式中有一個除法運算，它屬於很慢的操作(比一般
的加減運算慢幾十倍!),使用定點數的方法來優化它；

void PicZoom2(const TPicRegion& Dst,const TPicRegion& Src)
{
if ( (0==Dst.width)||(0==Dst.height)
||(0==Src.width)||(0==Src.height)) return;
    //函數能夠處理的最大圖片尺寸65536*65536
    unsigned long xrIntFloat_16=(Src.width<<16)/Dst.width+1; //16.16格式定點數
    unsigned long yrIntFloat_16=(Src.height<<16)/Dst.height+1; //16.16格式定點數

//可證明: (Dst.width-1)*xrIntFloat_16

    for (unsigned long y=0;y
    {
        for (unsigned long x=0;x
        {
            unsigned long srcx=(x*xrIntFloat_16)>>16;
            unsigned long srcy=(y*yrIntFloat_16)>>16;
            Pixels(Dst,x,y)=Pixels(Src,srcx,srcy);
        }
    }

}

////////////////////////////////////////////////////////////////////////////////
//速度測試:
//==============================================================================
// PicZoom2 185.8 fps
////////////////////////////////////////////////////////////////////////////////

c. 在x的循環中y一直不變，那麼可以提前計算與y相關的值; 1.可以發現srcy的值和x變量無關，可以提前到x軸循環之前；2.展開Pixels函數，優化與y相關的指針計算；

void PicZoom3(const TPicRegion& Dst,const TPicRegion& Src)
{
if ( (0==Dst.width)||(0==Dst.height)
||(0==Src.width)||(0==Src.height)) return;
unsigned long xrIntFloat_16=(Src.width<<16)/Dst.width+1;
unsigned long yrIntFloat_16=(Src.height<<16)/Dst.height+1;

    unsigned long dst_width=Dst.width;
    TARGB32* pDstLine=Dst.pdata;
    unsigned long srcy_16=0;
    for (unsigned long y=0;y
    {
        TARGB32* pSrcLine=((TARGB32*)((TUInt8*)Src.pdata+Src.byte_width*(srcy_16>>16)));
        unsigned long srcx_16=0;
        for (unsigned long x=0;x
        {
            pDstLine[x]=pSrcLine[srcx_16>>16];
            srcx_16+=xrIntFloat_16;
        }
        srcy_16+=yrIntFloat_16;
        ((TUInt8*&)pDstLine)+=Dst.byte_width;
    }
}

////////////////////////////////////////////////////////////////////////////////
//速度測試:
//==============================================================================
// PicZoom3 414.4 fps
////////////////////////////////////////////////////////////////////////////////

d.定點數優化使函數能夠處理的最大圖片尺寸和縮放結果(肉眼不可察覺的誤差)受到了一
定的影響,這裏給出一個使用浮點運算的版本,可以在有這種需求的場合使用:

void PicZoom3_float(const TPicRegion& Dst,const TPicRegion& Src)
{
//注意: 該函數需要FPU支持
if ( (0==Dst.width)||(0==Dst.height)
||(0==Src.width)||(0==Src.height)) return;
    double xrFloat=1.000000001/((double)Dst.width/Src.width);
    double yrFloat=1.000000001/((double)Dst.height/Src.height);
    unsigned short RC_Old;
    unsigned short RC_Edit;
    asm //設置FPU的取整方式爲了直接使用fist浮點指令
    {
        FNSTCW RC_Old             // 保存協處理器控制字,用來恢復
        FNSTCW RC_Edit            // 保存協處理器控制字,用來修改
        FWAIT
        OR      RC_Edit, 0x0F00 // 改爲 RC=11 使FPU向零取整
        FLDCW   RC_Edit            // 載入協處理器控制字,RC場已經修改
    }
    unsigned long dst_width=Dst.width;
    TARGB32* pDstLine=Dst.pdata;
double srcy=0;
for (unsigned long y=0;y
    {
        TARGB32* pSrcLine=((TARGB32*)((TUInt8*)Src.pdata+Src.byte_width*((long)srcy)));
/**//*
        double srcx=0;
        for (unsigned long x=0;x
        {
            pDstLine[x]=pSrcLine[(unsigned long)srcx];//因爲默認的浮點取整是一個很慢
                                     //的操作! 所以才使用了直接操作FPU的內聯彙編代碼。
            srcx+=xrFloat;
        }*/
        asm fld       xrFloat            //st0==xrFloat
        asm fldz                         //st0==0   st1==xrFloat
        unsigned long srcx=0;
for (long x=0;x
        {
            asm fist dword ptr srcx      //srcx=(long)st0
            pDstLine[x]=pSrcLine[srcx];
            asm fadd st,st(1)           //st0+=st1   st1==xrFloat
        }
        asm fstp      st
        asm fstp      st
        srcy+=yrFloat;
        ((TUInt8*&)pDstLine)+=Dst.byte_width;
    }
    asm //恢復FPU的取整方式
    {
        FWAIT
        FLDCW   RC_Old
    }
}

////////////////////////////////////////////////////////////////////////////////
//速度測試:
//==============================================================================
// PicZoom3_float 286.2 fps
////////////////////////////////////////////////////////////////////////////////

e.注意到這樣一個事實:每一行的縮放比例是固定的;那麼可以預先建立一個縮放映射表格
來處理縮放映射算法(PicZoom3_Table和PicZoom3_float的實現等價);

void PicZoom3_Table(const TPicRegion& Dst,const TPicRegion& Src)
{
if ( (0==Dst.width)||(0==Dst.height)
||(0==Src.width)||(0==Src.height)) return;
    unsigned long dst_width=Dst.width;
    unsigned long* SrcX_Table = new unsigned long[dst_width];
for (unsigned long x=0;x
    {
        SrcX_Table[x]=(x*Src.width/Dst.width);
    }
    TARGB32* pDstLine=Dst.pdata;
for (unsigned long y=0;y
    {
        unsigned long srcy=(y*Src.height/Dst.height);
        TARGB32* pSrcLine=((TARGB32*)((TUInt8*)Src.pdata+Src.byte_width*srcy));
for (unsigned long x=0;x
            pDstLine[x]=pSrcLine[SrcX_Table[x]];
        ((TUInt8*&)pDstLine)+=Dst.byte_width;
    }
    delete [] SrcX_Table;
}

////////////////////////////////////////////////////////////////////////////////
//速度測試:
//==============================================================================
// PicZoom3_Table 390.1 fps
////////////////////////////////////////////////////////////////////////////////

f.爲了加快縮放，可以採用根據縮放比例動態生成函數的方式來得到更快的縮放函數；這
有點像編譯器的工作原理；要實現它需要的工作量比較大(或比較晦澀)就不再實現了；
(動態生成是一種不錯的思路，但個人覺得對於縮放，實現它的必要性不大)

g.現代CPU中，在讀取數據和寫入數據時，都有自動的緩存機制；很容易知道，算法中生
成的數據不會很快再次使用，所以不需要寫入緩存的幫助；在SSE指令集中增加了movntq
等指令來完成這個功能；
(嘗試過利用CPU顯式prefetcht0、prefetchnta預讀指令或直接的mov讀取指令等速度反
而略有下降:( 但預讀在copy算法中速度優化效果很明顯 )

void PicZoom3_SSE(const TPicRegion& Dst,const TPicRegion& Src)
{
//警告: 函數需要CPU支持MMX和movntq指令
if ( (0==Dst.width)||(0==Dst.height)
||(0==Src.width)||(0==Src.height)) return;
    unsigned long xrIntFloat_16=(Src.width<<16)/Dst.width+1;
    unsigned long yrIntFloat_16=(Src.height<<16)/Dst.height+1;
    unsigned long dst_width=Dst.width;
    TARGB32* pDstLine=Dst.pdata;
    unsigned long srcy_16=0;
for (unsigned long y=0;y
    {
        TARGB32* pSrcLine=((TARGB32*)((TUInt8*)Src.pdata+Src.byte_width*(srcy_16>>16)));
        asm
        {
            push      ebp
            mov       esi,pSrcLine
            mov       edi,pDstLine
            mov       edx,xrIntFloat_16
            mov       ecx,dst_width
            xor       ebp,ebp           //srcx_16=0
            and    ecx, (not 3)    //循環4次展開
            TEST   ECX,ECX   //nop
            jle    EndWriteLoop
            lea       edi,[edi+ecx*4]
            neg       ecx
//todo: 預讀
                WriteLoop:
                        mov       eax,ebp
                        shr       eax,16 //srcx_16>>16
                        lea       ebx,[ebp+edx]
                        movd      mm0,[esi+eax*4]
                        shr       ebx,16 //srcx_16>>16
                        PUNPCKlDQ mm0,[esi+ebx*4]
                        lea       ebp,[ebp+edx*2]
// movntq qword ptr [edi+ecx*4], mm0 //不使用緩存的寫入指令
                        asm _emit 0x0F asm _emit 0xE7 asm _emit 0x04 asm _emit 0x8F
                        mov       eax,ebp
                        shr       eax,16 //srcx_16>>16
                        lea       ebx,[ebp+edx]
                        movd      mm1,[esi+eax*4]
                        shr       ebx,16 //srcx_16>>16
                        PUNPCKlDQ mm1,[esi+ebx*4]
                        lea       ebp,[ebp+edx*2]
// movntq qword ptr [edi+ecx*4+8], mm1 //不使用緩存的寫入指令
                        asm _emit 0x0F asm _emit 0xE7 asm _emit 0x4C asm _emit 0x8F asm _emit 0x08
                        add ecx, 4
                        jnz WriteLoop
//sfence //刷新寫入
                        asm _emit 0x0F asm _emit 0xAE asm _emit 0xF8
                        emms
                EndWriteLoop:
            mov    ebx,ebp
            pop    ebp
//處理邊界循環次數爲0,1,2,3；(這個循環可以展開,做一個跳轉表,略)
            mov    ecx,dst_width
            and    ecx,3
            TEST   ECX,ECX
            jle    EndLineZoom
            lea       edi,[edi+ecx*4]
            neg       ecx
      StartBorder:
            mov       eax,ebx
            shr       eax,16 //srcx_16>>16
            mov       eax,[esi+eax*4]
            mov       [edi+ecx*4],eax
            add       ebx,edx
            inc       ECX
            JNZ       StartBorder
      EndLineZoom:
        }
//
        srcy_16+=yrIntFloat_16;
        ((TUInt8*&)pDstLine)+=Dst.byte_width;
    }
}

//=====================================================================

//鑑於有讀者反映彙編代碼閱讀困難，這裏給出一個使用intel提供的函數調用方式的實現，
//讀者可以相互對照來閱讀代碼
//要編譯PicZoom3_SSE_mmh,需要#include #include
//並且需要編譯器支持
//函數PicZoom3_SSE_mmh速度爲 593.7 fps

void PicZoom3_SSE_mmh(const TPicRegion& Dst,const TPicRegion& Src)
{
//警告: 函數需要CPU支持MMX和movntq指令

if ( (0==Dst.width)||(0==Dst.height)
||(0==Src.width)||(0==Src.height)) return;

unsigned long xrIntFloat_16=(Src.width<<16)/Dst.width+1;
unsigned long yrIntFloat_16=(Src.height<<16)/Dst.height+1;

    unsigned long dst_width=Dst.width;
    TARGB32* pDstLine=Dst.pdata;
    unsigned long srcy_16=0;
    unsigned long for4count=dst_width/4*4;
    for (unsigned long y=0;y
    {
        TARGB32* pSrcLine=((TARGB32*)((TUInt8*)Src.pdata+Src.byte_width*(srcy_16>>16)));
        unsigned long srcx_16=0;
        unsigned long x;
        for (x=0;x
        {
            __m64 m0=_m_from_int(*(int*)(&pSrcLine[srcx_16>>16]));
            srcx_16+=xrIntFloat_16;
            m0=_m_punpckldq(m0, _m_from_int(*(int*)(&pSrcLine[srcx_16>>16])) );
            srcx_16+=xrIntFloat_16;
            __m64 m1=_m_from_int(*(int*)(&pSrcLine[srcx_16>>16]));
            srcx_16+=xrIntFloat_16;
            m1=_m_punpckldq(m1, _m_from_int(*(int*)(&pSrcLine[srcx_16>>16])) );
            srcx_16+=xrIntFloat_16;
            _mm_stream_pi((__m64 *)&pDstLine[x],m0); //不使用緩存的寫入指令
            _mm_stream_pi((__m64 *)&pDstLine[x+2],m1); //不使用緩存的寫入指令
        }
        for (x=for4count;x
        {
            pDstLine[x]=pSrcLine[srcx_16>>16];
            srcx_16+=xrIntFloat_16;
        }
        srcy_16+=yrIntFloat_16;
        ((TUInt8*&)pDstLine)+=Dst.byte_width;
    }
    _m_empty();
}

////////////////////////////////////////////////////////////////////////////////
//速度測試:
//==============================================================================
// PicZoom3_SSE 711.7 fps
////////////////////////////////////////////////////////////////////////////////

E: 縮放效果圖：

原圖放大圖(x軸放大8倍,y軸放大12倍)

原圖縮小圖(縮小到0.66倍) 放大圖(放大到1.6倍)

F: 把測試成績放在一起：
////////////////////////////////////////////////////////////////////////////////
//CPU: AMD64x2 4200+(2.1G) zoom 800*600 to 1024*768
//==============================================================================
// BitBlt             544.7 fps //is copy 800*600 to 800*600
// BitBlt             331.6 fps //is copy 1024*1024 to 1024*1024
// StretchBlt         232.7 fps //is zoom 800*600 to 1024*1024
//
// PicZoom0            19.4 fps
// PicZoom1            30.1 fps
// PicZoom2           185.8 fps
// PicZoom3           414.4 fps
// PicZoom3_float     286.2 fps
// PicZoom3_Table     390.1 fps
// PicZoom3_SSE       711.7 fps
////////////////////////////////////////////////////////////////////////////////

補充Intel Core2 4400上的測試成績:
////////////////////////////////////////////////////////////////////////////////
//CPU: Intel Core2 4400(2.00G) zoom 800*600 to 1024*768
//==============================================================================
// PicZoom0            15.0 fps
// PicZoom1            63.9 fps
// PicZoom2           231.2 fps
// PicZoom3           460.5 fps
// PicZoom3_float     422.5 fps
// PicZoom3_Table     457.6 fps
// PicZoom3_SSE      1099.7 fps
////////////////////////////////////////////////////////////////////////////////

圖形圖像處理－之－高質量的快速的圖像縮放上篇近鄰取樣插值和其速度優化

PDManer [元數建模]-v4.9.0 發佈：一款簡單好用的數據庫建模平臺

使用neovim打造go ide(支持代碼跳轉, 代碼補全, 實時語法檢查)

sql求連續值問題

cs01 CSS Syntax

sql server sp_executesql 中使用表變量進行查詢

挑戰程序設計競賽 2.3章習題 poj 3046 Ant Counting

[MASM拾遺]Offset僞指令

h30 HTML Layout Elements

瞭解顯卡

一款基於C#開發的通訊調試工具（支持Modbus RTU、MQTT調試）

Android開發入門之三--Android Overview

Qualcomm QTV Player

RTP/RTCP/RTSP/SIP/SDP

圖形圖像處理－之－高質量的快速的圖像縮放下篇三次線性插值和MipMap鏈

Android的多媒體框架OpenCore介紹

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

圖形圖像處理－之－高質量的快速的圖像縮放 上篇 近鄰取樣插值和其速度優化

圖形圖像處理－之－高質量的快速的圖像縮放上篇近鄰取樣插值和其速度優化