YOLO v1論文筆記

You Only Look Once:Unified, Real-Time Object Detection
 
 
Abstract
Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation.
Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background.
 
Introduction
A single convolutional network simultaneously predicts multiple bounding boxes and class probabilities for those boxes. YOLO trains on full images and directly optimizes detection performance. This unified model has several benefits over traditional methods of object detection.
1) YOLO is extremely fast. Since we frame detection as a regression problem we don't need a complex pipeline. We simply run our neural network on a new image at test time to predict detections. Our base network runs at 45 frames per second with no batch processing on a Titan X GPU and a fast version runs at more than 150 fps. YOLO achieves more than twice the mean average precision of other real-time systems.
2) YOLO reasons globally about the image when making predictions. YOLO sees the entire image during training and test time so it implicitly encodes contextual information about classes as well as their appearance. YOLO makes less than half the number of background errors compared to Fast R-CNN.
3) YOLO learns generalizable representations of objects. When trained on natural images and tested on artwork, YOLO outperforms top detection methods like DPM and R-CNN by a wide margin. Since YOLO is highly generalizable it is less likely to break down when applied to new domains or unexpected inputs.
但是YOLO在accuracy上仍然落後於state-of-the-art的檢測系統,但是它可以快速並努力更準確的定位圖像中的物體,特別是較小的物體。
 
Unified Detection
Our network uses features from the entire image to predict each bounding box. It also predicts all bounding boxes across all classes for an image simultaneously.
此係統將輸入圖像resize到S x S的grid。如果一個object的中心落在一個grid中,那麼這個grid理所當然要detect這個object。
每一個grid預測B個bounding boxes和對應的confidence scores。這些confidence scores反應了模型對box包含object和它預測的box的精度的confidence。Confidence可以表示爲:
如果Cell裏面沒有object,我們希望confidence score爲0,否則我們希望confidence score等於預測的box和ground truth之間IOU的交集。
每個bounding box有5個預測:x, y, w, h, confidence。其中(x, y)代表與grid cell邊界相關的box的中心座標。(w, h)即預測的寬度與高度與整副圖像相關。預測的confidence代表了預測框box與ground truth box之間的IOU。
注意(x, y)的預測,論文中這麼說的:
Our final layer predicts both class probabilities and bounding box coordinates. We normalize the bounding box width and height by the image width and height so that they fall between 0 and 1. We parametrize the bounding box x and y coordinates to be offsets of a particular grid cell location so they are also bounded between 0 and 1.
當時沒搞懂這句話的意思,直到看到YOLO9000文章中的這一句:
Instead of predicting offsets we follow the approach of YOLO and predict location coordinates relative to the location of the grid cell
即(x, y)預測的其實是相對於每一個grid cell位置的偏移量。
每個grid也會預測C個有條件的class概率
這些概率取決於包含object的grid cell。每個grid cell只預測一套class概率,而不管boxes B的數量。
綜上,yolo的輸出維度應該是: output_size = (cell_size * cell_size) * (num_class + boxes_per_cell * 5)。即每個grid cell有num_class + boxes_per_cell * 5個預測。
 
At test time we multiply the conditional class probabilities and the individual box confidence predictions,
which gives us class-specific confidence scores for each box. These scores encode both the probability of that class appearing in the box and how well the predicted box fits the object.

 

For evaluating YOLO on PASCAL VOC, we use S = 7, B = 2. PASCAL VOC has 20 labelled classes so C = 20. Our final prediction is a 7 × 7 × 30 tensor.
 
兩種網絡結構設計
1. YOLO的網絡結構設計
此網絡基於卷積網絡,前面的卷積層用來extract features,後面的全連接層用來預測輸出概率和座標。並使用了1x1的reduction layers在3x3的卷積層前面。該網絡共有24層卷積層,2層全連接層。完整網絡見Figure 3.
 
2. Fast YOLO
具有更少的卷積層(9層),且在這些層中具有更少的filter。除了這些,其他包括train和testing paramerters都與YOLO相同。
 
網絡的最終輸出都是7x7x30的tensor的預測。
 
訓練
關於網絡
卷積層在有1000-class的ImageNet上預訓練。預訓練的時候使用Figure 3中所示網絡的前20層卷積層,後面接着一個average-pooling層,然後是一個全連接層。We achieve a single crop top-5 accuracy of 88% on the ImageNet 2012 validation set。所有的training和inference都使用Darknet。
注意了,作者預訓練的時候只用了20個卷積層和一個全連接層,爲啥??
因爲後面說了,Ren等人在文獻[29]中證明了在預訓練的網絡上添加捲積層和全連接層可以提高性能。所以作者在預訓練的網絡後面加了4個卷積層和兩個全連接層,並且隨機初始化了weights!
還有一點就是預訓練的時候的input image的resolution是224x224,作者在應用自己的Detection網絡時將input resolution從224x224增加到了448x448. 爲啥??因爲Detection often requires fine-grained visual information
關於參數
將bounding box的width和height歸一化到(0,1]之間,x和y作爲特定grid cell的位置偏移量也在(0,1]之間。
最後一層使用線性激活函數,其他層後面使用leaky relu作爲激活函數:
loss function:
此模型的損失函數如上式(3)所示。
問題來了,作者爲啥不用sum-squared error?爲啥??
因爲:
it does not perfectly align with our goal of maximizing average precision. It weights localization error equally with classification error which may not be ideal. Also, in every image many grid cells do not contain any object. This pushes the “confidence” scores of those cells towards zero, often overpowering the gradient from cells that do contain objects. This can lead to model instability, causing training to diverge early on.
怎麼解決?
增加來自bounding box coordinate的loss,減少不包含object的boxes的confidence預測的loss。這兩個操作是通過兩個參數完成的:λcoord和λnoobj,並將它們設爲:λcoord=5,λnoobj=5
還有一點要注意,Sum-squared error also equally weights errors in large boxes and small boxes. Our error metric should reflect that small deviations in large boxes matter less than in small boxes. To partially address this we predict the square root of the bounding box width and height instead of the width and height directly。直接預測bounding box的寬度和高度的平方根,而不是寬度和高度。
 
Note that the loss function only penalizes classification error if an object is present in that grid cell (hence the conditional class probability discussed earlier). It also only penalizes bounding box coordinate error if that predictor is “responsible” for the ground truth box (i.e. has the highest IOU of any predictor in that grid cell)
 
超參數:
epochs: 135
batch size: 64
momentum: 0.9
decay: 0.0005
learning rate schedule is as follows:
For the first epochs we slowly raise the learning rate from 10-3 to 10-2. We continue training with 10-2 for 75 epochs, then 10-3 for 30 epochs, and finally 10-4 for 30 epochs.
學習率慢慢減小容易理解,但是爲什麼一開始的時候是先小後大?因爲如果一開始學習率比較高,模型會因爲梯度不穩定而偏移:If we start at a high learning rate our model often diverges due to unstable gradients.
防止過擬合的兩個方法:
Dropout rate: 0.5. After the first connected layer prevents co-adaptation between layers [18].
data augmentation: 我們引入了隨機縮放和最多20%原始圖像大小的翻譯。 我們還在HSV色彩空間中隨機調整圖像的曝光和飽和度,最高可達1:5.
 
Limitations of YOLO
很強的空間約束。每個grid cell只能預測兩個bounding box,且只能有一個class。這種空間約束限制了模型預測靠的比較近的object的數量。當有一羣小物體密集的時候模型很難正確預測bounding box。
因爲結構中有不少downsampling layers,模型也會用較粗糙的features來預測bounding boxes。
loss function會將小bounding box和大bounding box的error同等對待。一個小的error對大的box影響不大,但是對小的box的IOU有較大的影響。
 
Experiments
 
我們可以看到,YOLO在保持較高的準確率的情況下還能有45幀的處理速度。
 
Conclusion
We introduce YOLO, a unified model for object detection. Our model is simple to construct and can be trained directly on full images. Unlike classifier-based approaches, YOLO is trained on a loss function that directly corresponds to detection performance and the entire model is trained jointly.
Fast YOLO is the fastest general-purpose object detector in the literature and YOLO pushes the state-of-the-art in real-time object detection. YOLO also generalizes well to new domains making it ideal for applications that rely on fast, robust object detection.
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章