Faster R-CNN論文筆記

亮點: 引入 Region Proposal Network(RPN),可以與檢測網絡共享全圖的卷積特徵權值，這樣基本使得region proposals計算沒有消耗（cost-free）。【翻譯參考http://blog.csdn.net/mw_mustwin/article/details/53039338】

個人想法:爲什麼說共享卷積特徵權值之後達到了cost free呢，是因爲Anchor部分和特徵提取是同步進行的，一個給出框一個給出圖，框框一圈起來就可以直接進行運算了。

RPN是一種全卷積的網絡，能夠同時預測目標的邊界以及對objectness得分。RPN是端到端訓練，產生高質量的region proposals用於Fast R-CNN的檢測。作者通過共享卷積特徵進一步將RPN和Fast R-CNN合併成一個網絡，使用最近神經網絡流行的術語——“attention”機制，RPN組件能夠告訴網絡看向哪裏。對於VGG-16模型，檢測系統在GPU上的幀率爲5幀（包含所有步驟），同時僅用每張圖300個proposals取得了PASCAL VOC2007,2012以及MS COCO數據集的最好檢測精度。代碼已公開。

1. Introduction

近期 region proposal methods和region-based convolutional neural netwoeks(R-CNNs)在目標檢測中的應用比較成功，儘管R-CNNs的計算本質上就expensive，採用across proposals的卷積共享極大地減少了cost。忽略region proposals耗時，改進後應用深度網絡的Fast R-CNN基本達到了實時速度，所以porposals的計算時間是當前檢測技術的主要瓶頸。（Now, proposals are the test-time computational bottleneck in state-of-the-art detection systems.）

標註:

第1、2節介紹，目前的網絡挺好，點明問題所在——region proposal算法比較耗時。

第3節分析耗時跟region proposal使用CPU有一定關係，但即使用GPU去生成proposal來省時，依舊忽略了網絡結構的特徵從而錯過可以共享計算的機會。然後4~5節就是引入並介紹本文提出的算法。

第4節引入了RPN，可以和目前非常好的目標檢測網絡共享卷積層，從而省去計算proposal的時間。第5節說明了這個想法的由來：作者發現Fast R-CNN網絡結構中的卷積特徵圖不僅可以用於ROI池化層，另外再加一段卷積層還可以用來產生region proposals，RPN的結構就這樣出來了，具體如圖，左邊多出來的部分就是RPN結構。所以RPN是實質是一種全卷積層，端到端訓練，專門用於產生regionproposals的任務。

6~7節介紹了一下結構細節的內容，也是文章正文核心的內容。RPN用於產生預測具有大範圍尺度以及長寬比變化的region proposals。先前的方法都是通過圖像金字塔或者濾波器金字塔的方式來得到Region proposals在尺度以及長寬比上的不同模式，但是本文引入一種稱爲“anchor”boxes的方法實現，可以視爲一種pyramids of regressionreference（迴歸引用金字塔，意思就是在一個位置同時預測多個不同size的proposal），如圖。這種模式不必去列舉不同尺寸和長寬比的圖像或者濾波器，訓練和測試階段都僅使用一種size的圖像即可。爲了統一RPN和Fast R-CNN，作者提出了一種訓練模式：在region proposal和目標檢測兩個任務的fine-tuning之間交替進行。這種模式收斂很快並且產生統一的能夠在兩個之間共享卷積特徵的網絡結構。

8~10節介紹了方法的結果以及成績。在PASCAL VOC上評估，檢測精度比Selective Search + Fast R-CNN高，同時解決了Selective search在測試階段的計算負擔。在GPU上即使使用複雜的VGG-16模型，幀率依舊能達到5幀（包括所有步驟），就速度和精度而言，已經達到了實際應用的效果。文章同時也展示了在MS COCO上的結果以及利用MS COCO數據對PASCAL VOC檢測結果的改善，代碼已公開。

第9、10節交代之前的版本以及本文的方法還被很多其他方法和商業系統引用。

2. Related Work

Object Proposals

關於Object Proposals的文獻很多，論文中給出一些方法。

Deep Networks for Object Detection

R-CNN ：端到端訓練CNNs來分類目標，它主要扮演一個分類器，而不去預測目標的邊界（除了通過包圍盒迴歸精煉）。其準確度由region proposal module的性能決定。一些論文中提出使用deep networks來預測object bounding boxes.

OverFeat : 訓練一個fully-connected layer來預測box coordinates用來對單個目標的任務進行定位。fully-connected layer接入convolutional layer處理多類目標。

MultiBox : 從網絡的最後一個fully-connected layer (全連接層)同時預測多個class-agnostic boxes生成region proposals, generalizing the“singlebox” fashion of OverFeat。These class-agnostic boxes are used as proposals for R-CNN.

從最後一個fc層同時預測多個包圍盒的網絡中生成區域建議，R-CNN就是用的這個。

另外，卷積的共享計算近來也很熱門。主要就是SPPnet, Fast R-CNN。

3. FASTER R-CNN

作者提出自己的目標檢測系統：Faster R-CNN, 由兩部分組成： deep fully convolutional network 用來生成propose regions；Fast R-CNN detector .

In Section 3.1 we introduce the designs and properties of the network for region proposal. In Section 3.2 we develop algorithms for training both modules with features shared.

3.1 Region Proposal Networks

RPN：輸入任意尺寸圖片，輸出爲一系列rectanglular object proposals，每一個都帶有objectness score。作者用了一個全卷積網絡模型來處理。作者的目標是能夠與Fast R-CNN目標檢測網絡共享計算，因此假設每個網絡都可以共享卷積層。

生成region proposals

如圖所示，作者用sliding window在convolutional feature map上滑動，並將每一個sliding window映射到低維特徵（ZF 是256d,VGG是512d），然後這些特徵又分別被feed into two sibling fully connected layers—— 一個box-regression layer(reg),一個box-classification layer(cls).（本文用一個全卷積網絡（n*n的卷積層+兩個並列的1*1卷積層）來對這個過程進行建模）

3.1.1 Anchors

每一處sliding window，作者都同時預測了很多個region proposals,proposals最大數目設置爲k。

reg layer 有4k個輸出，即k boxes 的座標，cls layer的輸出爲每個proposal估計是否爲object的可能性的2k scores 。k proposals are parameterized relative to k reference boxes，which we call anchors. An anchor is centered at the sliding window in question, and is associated with a scale and aspect ratio.

reg層輸出4k個值表示proposal的座標，cls層輸出2k個值評估proposal是否爲目標的概率。這k個proposals是關於k個reference boxes參數化的，因此稱爲anchors。anchor以滑動窗口爲中心，對應一種特定尺度以及長寬比，默認選取3種尺度和三種長寬比，因此每個滑動位置產生9個anchors。最終總共最多產生W*H*k個anchors（當然會選取得分高的anchors作爲proposals的）。

Translation-Invariant Anchors

作者提出該方法的最重要的特性是平移不變性，意思是對於anchors以及與anchors相關的計算proposals的函數而言，具有平移不變性。例：將圖像中的目標平移了，依然能計算出對應的平移的proposals，並且在何位置函數都能預測到它。MultiBox用的是K均值方法產生800個anchors，不具有平移不變性。平移不變性還能減小模型的尺寸，MultiBox的全連接輸出層是（4+1）*800維，本文卷積輸出層是（4+2）*9維（k取9），結果就是本文輸出層2.8*10^4個參數，而MultiBox有6.1*10個參數，相差兩個量級。考慮特徵映射層，也會相差一個量級。

Multi-Scale Anchors as Regression References

將本文的多尺度anchors和圖像/濾波器金字塔作比較，並舉了例子，來說明本文基於anchor的方法實現多尺度有兩個好處：一、省時；二、可以僅使用一張一張圖像的卷積特徵圖（事實上還是省時）。

3.1.2 Loss Function

訓練RPNs：對每個anchor設計二進制的分類標籤（是/不是 object）

給兩種anchors設置 positive label：

（1） the anchor/anchors with the highest Intersection-overUnion (IoU) overlap with a ground-truth box,

（2） an anchor that has an IoU overlap higher than 0.7 with any ground-truth box.

作者採用了第一種方式，因爲第二種條件可能在某些情況下無法尋找到正樣本。

作者對 IoU ratio is lower than 0.3 for all ground-truth boxes的non-positive anchor設置了negative label.

Anchors的正/負不會對訓練目標造成影響。

有了以上一些定義，作者用Fast R-CNN 中的multi-task loss minimize an objective function.Loss function for an image is defined as:

i is the index of an anchor in a mini-bach.

p_i is the predicted probability of anchor i being an object. 當anchor爲positive時，the ground-truth label p_i* 爲1，否則爲0。

t_i 是一個向量，表示predicted bounding box 的4個參數化的座標。 t∗ i is that of the ground-truth box associated with a positive anchor.

分類損失 L_cls 是兩個類別的log loss（object vs. not object）.對於迴歸損失，作者使用，R is robust loss function(smooth L_1)。

可以看出，僅當positive anchors（p_i*=1）時,p_i*L_reg means the regression loss is activated.

The outputs of the cls and reg layers consis of {p_i} and {t_i} respectively.

實驗設計：

N_cls = 256 ,N_reg ~ 2400， λ = 10.

bounding box regression,作者採用 parameterizations of the 4 coordinates following：

x\x_a\x* are for predicted box\anchor box\ground-truth box（likewise for y,w,h）.

作者指出其方法: 用於迴歸的特徵是來自feature maps的同一空間尺度（3*3），考慮不同的尺寸，a set of k bounding-box regressors are learned.每一個regressor都代表一種尺度和一種長寬比，and the k regressors do not share weights. As such, it is still possible to predict boxes of various sizes even though the features are of a fixed size/scale, thanks to the design of anchors.

3.1.3 Training RPNs

使用反向傳播和隨機梯度下降端對端訓練RPN。採用“Fast R-CNN” 採樣策略，每個mini-batch由一張包含多個正負anchors的單一圖像產生，由於優化loss函數的時候會偏向佔主導作用的負樣本，所以作者採樣的mini-batch正負anchors的比例保持在1：1，隨機採樣的256個anchors來計算損失函數。

從零均值、標準差爲0.01的高斯分佈中隨機採樣初始化new layers參數，其他layers用 ImageNet classification 的預訓練模型初始化。ZF net的所有層都參與優化\VGG-16從conv3_1向上的層優化，學習速率： 60k mini-batches 爲 0.001，next 20k mini-batches 爲 0.0001 on the PASCAL VOC dataset; momentum : 0.9; a weight decay of 0.0005; uses Caffe.

3.2 Sharing Features for RPN and Fast R-CNN

learn a unified network composed of RPN and Fast R-CNN with shared convolutional layers.

RPN與Fast R-CNN分別獨立訓練，然後設計一個技術點用以sharing convolutional layers between the two networks.作者討論了三種訓練方法（for training networks with features shared）：

（1） Alternating training. 先訓練RPN, 然後利用 proposals 訓練 Fast R-CNN. 訓練好的Fast R-CNN 用來初始化RPN，這個過程是迭代的，本文實驗選用了這個方法。

（2） Approximate joint training. RPN與Fast R-CNN網絡在訓練過程中融合爲一個網絡。在每個隨機梯度下降迭代過程，前向傳遞產生region proposals並固定，在訓練Fast R-CNN檢測器時，提前計算好proposals.The backward propagation takes place as usual, where for the shared layers the backward propagated signals from both the RPN loss and the Fast R-CNN loss

are combined.

（3） Non-approximate joint training.（詳細見論文）

4-Step Alternating Training

作者採用了 a pragmatic 4-step training algorithm to learn shared features via alternating optimization.

訓練RPN，由ImageNet-pre-trained model and fine-tune end-to-end for the region proposal task.
訓練一個分離的detection network by Fast R-CNN using the proposals generated by the step-1 RPN.該檢測網絡也使用以上初始化模型。此時兩個網絡還未共享卷積層。
使用detector network to initialize RPN training,先固定共享的卷積層，only fine-tune the layers unique to RPN.現在兩個網絡共享卷積層，最後保持共享的卷積層固定，fine-tune the unique layers of Fast R-CNN. 如此，每個網絡都共享了同樣的卷積層從而形成了一個統一的網絡。相同的交替訓練可以迭代運行多次，不過作者觀察到沒有明顯的提升。

Faster R-CNN論文筆記

《Python進階》學習筆記

Leetcode 3161. 物塊放置查詢

leetcode 60 排列序列

一個docker容器暴露多個端口

微服務實踐之使用 Visual Studio 2022 調試Dapr 應用程序

wpf附加屬性理解 WPF附加屬性

python 數組計算要注意定義數據類型！

慢慢理解LSTM

學習記錄 —— pytorch example for MNIST

「人臉識別」學習FaceNet（4）

python 中 set()的使用

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結