x264、x265中cuTree原理分析

mbtree是x264中引入的一項創新性技術，可以有效提高主客觀質量（參考文章最後的表格1）。x265繼承了這一算法，改名爲cuTree，算法本身實現較爲複雜，下面探討一下cutree原理，結合代碼來分析實現細節。

cutree和mbtree都是根據當前塊被參考的程度調整qpOffset，要知道當前塊被參考的程度，很顯然需要一個編碼的反推過程。

對於幀間參考，參考幀的質量顯然對當前幀質量有直接影響。即參考塊的編碼代價，除了要考慮本身的編碼代價外，還需考慮對將來參考到當前塊的那些塊的影響力。因此，cutree在分析每個塊的Cost時，引入了一個PropagateInCost的概念：即每個塊的Cost，不僅是自己本身編碼的Cost，還要加上後續塊依賴於當前塊的Cost，這個Cost稱之爲PropagateInCost，所以關鍵是如何確定PropagateInCost。

考慮以下簡化情形：假設B塊完全參考了A塊，B塊幀內幀間預測分別爲IntraCostB和InterCostB。分析趨勢：如果IntraCostB與InterCostB差不多大，說明B塊從A塊獲取的信息量很少；反之，如果IntraCostB比InterCostB大很多，說明B塊大部分信息可以從A塊獲取。基於這個思想，B塊本身從A塊獲取的信息量可以表達爲：(IntraCostB - InterCostB) 。進一步考慮，B塊也被其他塊參考了，所以B塊的Cost也包含了PropagateInCostB。綜上：B塊依賴於A塊的Cost爲：

(IntraCostB - InterCostB) + PropagateInCostB * (IntraCostB - InterCostB) / IntraCostB = (1 + PropagateInCostB) * (IntraCostB - InterCostB) / IntraCostB

其中：(IntraCostB - InterCostB) / IntraCostB表示B塊的PropagateInCostB有多少比例要傳遞到A塊。

如下是x265計算PropagateCost的函數，其基本思想就是上面所述。

/* Estimate the total amount of influence on future quality that could be had if we
 * were to improve the reference samples used to inter predict any given CU. */
static void estimateCUPropagateCost(int* dst, const uint16_t* propagateIn, const int32_t* intraCosts, const uint16_t* interCosts, const int32_t* invQscales, const double* fpsFactor, int len)
{
    double fps = *fpsFactor / 256;  // range[0.01, 1.00]
    for (int i = 0; i < len; i++)
    {
        int intraCost = intraCosts[i];
        int interCost = X265_MIN(intraCosts[i], interCosts[i] & LOWRES_COST_MASK);
        double propagateIntra = intraCost * invQscales[i]; // Q16 x Q8.8 = Q24.8
        double propagateAmount = (double)propagateIn[i] + propagateIntra * fps; // Q16.0 + Q24.8 x Q0.x = Q25.0
        double propagateNum = (double)(intraCost - interCost); // Q32 - Q32 = Q33.0
        double propagateDenom = (double)intraCost;             // Q32
        dst[i] = (int)(propagateAmount * propagateNum / propagateDenom + 0.5);
    }
}

如前所述，B塊完全參考A塊，則A塊的PropagateCostInA = (1 + PropagateInCostB) * (IntraCostB - InterCostB) / IntraCostB

考慮更復雜的情況，由於MV不可能都指向一個完整的編碼塊，所以B塊的PropateCostB在參考幀中要被按比例地加入到對應的參考塊中。如下爲x265的cutree函數：

void Lookahead::estimateCUPropagate(Lowres **frames, double averageDuration, int p0, int p1, int b, int referenced)
{
    uint16_t *refCosts[2] = { frames[p0]->propagateCost, frames[p1]->propagateCost };
    int32_t distScaleFactor = (((b - p0) << 8) + ((p1 - p0) >> 1)) / (p1 - p0);
    int32_t bipredWeight = m_param->bEnableWeightedBiPred ? 64 - (distScaleFactor >> 2) : 32;
    int32_t bipredWeights[2] = { bipredWeight, 64 - bipredWeight };
    int listDist[2] = { b - p0 - 1, p1 - b - 1 };

    memset(m_scratch, 0, m_8x8Width * sizeof(int));

    uint16_t *propagateCost = frames[b]->propagateCost;

    x265_emms();
    double fpsFactor = CLIP_DURATION((double)m_param->fpsDenom / m_param->fpsNum) / CLIP_DURATION(averageDuration);

    /* For non-referred frames the source costs are always zero, so just memset one row and re-use it. */
    if (!referenced)
        memset(frames[b]->propagateCost, 0, m_8x8Width * sizeof(uint16_t));

    int32_t strideInCU = m_8x8Width;
    for (uint16_t blocky = 0; blocky < m_8x8Height; blocky++)
    {
        int cuIndex = blocky * strideInCU;
        // 計算frames[b]每個塊的PropagateInCost，結果存儲到m_scratch中
        if (m_param->rc.qgSize == 8)
            primitives.propagateCost(m_scratch, propagateCost,
                       frames[b]->intraCost + cuIndex, frames[b]->lowresCosts[b - p0][p1 - b] + cuIndex,
                       frames[b]->invQscaleFactor8x8 + cuIndex, &fpsFactor, m_8x8Width);
        else
            primitives.propagateCost(m_scratch, propagateCost,
                       frames[b]->intraCost + cuIndex, frames[b]->lowresCosts[b - p0][p1 - b] + cuIndex,
                       frames[b]->invQscaleFactor + cuIndex, &fpsFactor, m_8x8Width);

        if (referenced)
            propagateCost += m_8x8Width;

        // 將frames[b]中的PropagateInCost 按比例加到參考幀中每個塊裏
        for (uint16_t blockx = 0; blockx < m_8x8Width; blockx++, cuIndex++)
        {
            int32_t propagate_amount = m_scratch[blockx];
            /* Don't propagate for an intra block. */
            if (propagate_amount > 0)
            {
                /* Access width-2 bitfield. */
                int32_t lists_used = frames[b]->lowresCosts[b - p0][p1 - b][cuIndex] >> LOWRES_COST_SHIFT;
                /* Follow the MVs to the previous frame(s). */
                for (uint16_t list = 0; list < 2; list++)
                {
                    if ((lists_used >> list) & 1)
                    {
#define CLIP_ADD(s, x) (s) = (uint16_t)X265_MIN((s) + (x), (1 << 16) - 1)
                        int32_t listamount = propagate_amount;
                        /* Apply bipred weighting. */
                        if (lists_used == 3)
                            listamount = (listamount * bipredWeights[list] + 32) >> 6;

                        MV *mvs = frames[b]->lowresMvs[list][listDist[list]];

                        /* Early termination for simple case of mv0. */
                        // MV(0, 0)，直接加到參考幀的PropateCost數組中
                        if (!mvs[cuIndex].word)
                        {
                            CLIP_ADD(refCosts[list][cuIndex], listamount);
                            continue;
                        }

                        // MV不爲(0, 0)時，參考塊爲四個塊的子區域，分別爲idx0, idx1, idx2, idx3，比例爲idx0weight, idx1weight, idx2weidht, idx3weidht
                        int32_t x = mvs[cuIndex].x;
                        int32_t y = mvs[cuIndex].y;
                        int32_t cux = (x >> 5) + blockx;
                        int32_t cuy = (y >> 5) + blocky;
                        int32_t idx0 = cux + cuy * strideInCU;
                        int32_t idx1 = idx0 + 1;
                        int32_t idx2 = idx0 + strideInCU;
                        int32_t idx3 = idx0 + strideInCU + 1;
                        x &= 31;
                        y &= 31;
                        int32_t idx0weight = (32 - y) * (32 - x);
                        int32_t idx1weight = (32 - y) * x;
                        int32_t idx2weight = y * (32 - x);
                        int32_t idx3weight = y * x;

                        /* We could just clip the MVs, but pixels that lie outside the frame probably shouldn't
                         * be counted. */
                        if (cux < m_8x8Width - 1 && cuy < m_8x8Height - 1 && cux >= 0 && cuy >= 0)
                        {
                            CLIP_ADD(refCosts[list][idx0], (listamount * idx0weight + 512) >> 10);
                            CLIP_ADD(refCosts[list][idx1], (listamount * idx1weight + 512) >> 10);
                            CLIP_ADD(refCosts[list][idx2], (listamount * idx2weight + 512) >> 10);
                            CLIP_ADD(refCosts[list][idx3], (listamount * idx3weight + 512) >> 10);
                        }
                        else /* Check offsets individually */
                        {
                            if (cux < m_8x8Width && cuy < m_8x8Height && cux >= 0 && cuy >= 0)
                                CLIP_ADD(refCosts[list][idx0], (listamount * idx0weight + 512) >> 10);
                            if (cux + 1 < m_8x8Width && cuy < m_8x8Height && cux + 1 >= 0 && cuy >= 0)
                                CLIP_ADD(refCosts[list][idx1], (listamount * idx1weight + 512) >> 10);
                            if (cux < m_8x8Width && cuy + 1 < m_8x8Height && cux >= 0 && cuy + 1 >= 0)
                                CLIP_ADD(refCosts[list][idx2], (listamount * idx2weight + 512) >> 10);
                            if (cux + 1 < m_8x8Width && cuy + 1 < m_8x8Height && cux + 1 >= 0 && cuy + 1 >= 0)
                                CLIP_ADD(refCosts[list][idx3], (listamount * idx3weight + 512) >> 10);
                        }
                    }
                }
            }
        }
    }

    if (m_param->rc.vbvBufferSize && m_param->lookaheadDepth && referenced)
        cuTreeFinish(frames[b], averageDuration, b == p1 ? b - p0 : 0);
}

最後，當前Cu的QPOffset肯定是與PropagateInCost有關的，PropagateInCost越大，則CU的qp應該越小，QPOffset是負值，也應該越小，x265中cutree的QPOffset = -strength * log2(1 + PropagateInCost / IntraCost)，具體代碼，參考函數cuTreeFinish，如下所示。

void Lookahead::cuTreeFinish(Lowres *frame, double averageDuration, int ref0Distance)
{
    int fpsFactor = (int)(CLIP_DURATION(averageDuration) / CLIP_DURATION((double)m_param->fpsDenom / m_param->fpsNum) * 256);
    double weightdelta = 0.0;

    if (ref0Distance && frame->weightedCostDelta[ref0Distance - 1] > 0)
        weightdelta = (1.0 - frame->weightedCostDelta[ref0Distance - 1]);

    frame->qpAvgFrmCuTreeOffset = 0.0;
    for (int cuIndex = 0; cuIndex < m_cuCount; cuIndex++)
    {
        int intracost = (frame->intraCost[cuIndex] * frame->invQscaleFactor[cuIndex] + 128) >> 8;
        if (intracost)
        {
            int propagateCost = (frame->propagateCost[cuIndex] * fpsFactor + 128) >> 8;
            double log2_ratio = X265_LOG2(intracost + propagateCost) - X265_LOG2(intracost) + weightdelta;
            frame->qpCuTreeOffset[cuIndex] = frame->qpAqOffset[cuIndex] - m_cuTreeStrength * log2_ratio;
            frame->qpAvgFrmCuTreeOffset += frame->qpCuTreeOffset[cuIndex];
        }
    }
    frame->qpAvgFrmCuTreeOffset /= m_cuCount;
}

下面表1、表2爲x265 v2.4版本中，cutree對編碼客觀質量的影響，編碼配置爲：preset=medium， ratecontrol=ABR，BFrames = 3（or = 0），aq-mode=off，測試序列爲HEVC中的Class B(1080p)。當BFrames=3時，cuTree開啓後，Y的bitrate節省6.52%，U的碼率節省15.38%，V的碼率節省15.56%，壓縮效率提升非常明顯。當BFrames=0時，cuTree開啓後，Y的bitrate增加0.72%，U的碼率節省2.4%，V的碼率節省1.4%，壓縮效率沒什麼提升。這是因爲BFrames=3時，CuTree對I和P，QP調小的幅度大，對B-Ref，QP適當調小，對B-Non-Ref，QP不做調整，本質與HM中的Hierarchichal QP差不多；當BFrames=0時，所有P幀的QP都被調小，幅度都差不多，這樣其實相當於沒有調整QP了。

表1、x265中cuTree對編碼碼率的節省(BFrames=3)

Sequence	BD-Rate Y	BD-Rate U	BD-Rate V
BasketballDrive	-4.8%	-8.5%	-3.5%
Bqterrace	-3.5%	-19.2%	-21.3%
Cactus	-9.6%	-15.9%	-13.9%
Kimono	-3.3%	-13.4%	-17.1%
ParkScene	-11.4%	-19.9%	-22.0%
Average	-6.52%	-15.38%	-15.56%

表2、x265中cuTree對編碼碼率的節省(BFrames=0)

Sequence	BD-Rate Y	BD-Rate U	BD-Rate V
BasketballDrive	2.4%	2.3%	4.9%
Bqterrace	4.1%	-0.9%	0.8%
Cactus	-1.8%	-4.6%	-2.3%
Kimono	2.3%	-0.6%	-2.7%
ParkScene	-3.4%	-8.2%	-7.7%
Average	0.72%	-2.4%	-1.4%

需要注意一點：如上所述，cuTree是從後往前推導，求qpOffset。x264、x265在開啓碼控時，會啓用lookahead機制，所謂lookahead機制就是從當前幀往後看，根據後續幀的情況，給當前幀分配合適的QP，確定合適的幀類型等。代碼中，cuTree往後看的幀數就等於lookahead_num的值。比如對x265的Preset Medium，lookahead_num默認爲20，則cuTree會從當前幀之後第20幀開始往前推導，一直到當前幀，算出qpOffset，所以lookahead_num會對cuTree的結果有直接影響：不同lookahead_num，cuTree的QPOffset的值也稍有不同，但是影響不算很大。

此外，還需要注意，x265的幀型決策以及cuTree的QPOffset的確定過程都是以MiniGop爲單位的，即每次爲一個MiniGop確定好編碼所需的參數。因此，3個B幀的情況，每4幀(bBbP)調用一次cuTree過程。而0個B幀時，則每個P幀都要調用一次cuTree過程。cuTree每次要反推20(lookahead_num)幀，計算量很可觀。所以對超高分辨率編碼時，有時0B反而比3B更慢，問題很可能出於此。

x264、x265中cuTree原理分析

[轉帖]使用NMT和pmap解決JVM資源泄漏問題原創

Python實現大麥網搶票的四大關鍵技術點解析

Python 安裝庫指令大全

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

一款開源的.NET程序集反編譯、編輯和調試神器

關於接口協議，你必須要知道這些！

基於 Milvus + LlamaIndex 實現高級 RAG

【2024-05-21】以茶會友

x264、x265中cuTree原理分析

開發一個HEVC碼率分析工具

ffmpeg計算psnr與x264,x265不一致

python 冪數擬合及擬合度計算

HEVC如何計算Cu_Qp_Delta

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結