PSPNet論文翻譯及解讀(中英文對照)

Pyramid Scene Parsing Network

Abstract

Scene parsing is challenging for unrestricted open vocabulary and diverse scenes. In this paper, we exploit the capability of global context information by different-regionbased context aggregation through our pyramid pooling module together with the proposed pyramid scene parsing network (PSPNet). Our global prior representation is effective to produce good quality results on the scene parsing task, while PSPNet provides a superior framework for pixellevel prediction. The proposed approach achieves state-of-the-art performance on various datasets. It came first in ImageNet scene parsing challenge 2016, PASCAL VOC 2012 benchmark and Cityscapes benchmark. A single PSPNet yields the new record of mIoU accuracy 85.4% on PASCAL VOC 2012 and accuracy 80.2% on Cityscapes.

摘要

場景解析對於無限制開放詞彙庫和不同的場景來說是一個挑戰。本文通過所提出的金字塔場景分析網絡(PSPNet),對不同區域的語境進行聚合,使模型擁有了理解全局語境信息的能力。我們的全局信息可以有效地在場景分析任務中產生高質量的結果,而PSPNet則爲像素級預測提供了一個優越的框架。該方法在各種數據集上展現出了最高水平的性能,在2016年ImageNet場景分析挑戰賽、Pascal VOC 2012數據集和Cityscapes數據集中排名第一。本文所提出這種PSPNet在Pascal VOC 2012上的mIoU準確率達到85.4%,在Cityscapes數據集上的mIoU準確率達到80.2%。

1. Introduction

Scene parsing, based on semantic segmentation, is a fundamental topic in computer vision. The goal is to assign each pixel in the image a category label. Scene parsing provides complete understanding of the scene. It predicts the label, location, as well as shape for each element. This topic is of broad interest for potential applications of automatic driving, robot sensing, to name a few.

1. 引言

基於語義分割的場景分析是計算機視覺中的一個基本課題,其目的是爲圖像中的每個像素指定一個類別標籤。場景分析提供了對場景的完整理解,預測了每個元素的類別、位置和形狀。這一課題對於自動駕駛、機器人傳感技術等潛在的應用領域來說具有廣泛的研究意義。

Difficulty of scene parsing is closely related to scene and label variety. The pioneer scene parsing task [23] is to classify 33 scenes for 2,688 images on LMO dataset [22]. More recent PASCAL VOC semantic segmentation and PASCAL context datasets [8, 29] include more labels with similar context, such as chair and sofa, horse and cow, etc. The new ADE20K dataset [43] is the most challenging one with a large and unrestricted open vocabulary and more scene classes. A few representative images are shown in Fig. 1. To develop an effective algorithm for these datasets needs to conquer a few difficulties.

場景解析的難度與場景和類別標籤的多樣性密切相關。“先驅者”場景解析任務[23]是將LMO數據集[22]上的33個場景共2688張圖像進行分類。最新的Pascal VOC語義分割和Pascal語境數據集[8,29]中包括了更多具有類似語境的標籤,如椅子和沙發、馬和牛等。新的ADE20K數據集[43]是最具挑戰性的,它具有龐大且無限制的開放詞彙庫和更多的場景類別,其中一些有代表性的圖像如圖1所示。要爲這些數據集開發一種有效的算法,仍需要克服較大的困難。

State-of-the-art scene parsing frameworks are mostly based on the fully convolutional network (FCN) [26]. The deep convolutional neural network (CNN) based methods boost dynamic object understanding, and yet still face chal-lenges considering diverse scenes and unrestricted vocabulary. One example is shown in the first row of Fig. 2, where a boat is mistaken as a car. These errors are due to similar appearance of objects. But when viewing the image regarding the context prior that the scene is described as boathouse near a river, correct prediction should be yielded.

最先進的場景分析框架主要是基於全卷積網絡(FCN)[26]。基於深度卷積神經網絡(CNN)的方法提高了對動態對象的理解,但考慮到場景的多樣性和詞彙庫的不受限制,其仍然面臨着較大的挑戰。圖2的第一行展示了一個例子,圖中的一艘船被誤認爲是一輛汽車。這些錯誤是由於對象的外觀相似造成的。但是,基於上下文信息,我們發現圖像中的場景被描述爲靠近河流的船塢,利用這一點,當要判別圖像中元素的類別時,模型應要做出正確的預測。

Towards accurate scene perception, the knowledge graph relies on prior information of scene context. We found that the major issue for current FCN based models is lack of suitable strategy to utilize global scene category clues. For typical complex scene understanding, previously to get a global image-level feature, spatial pyramid pooling [18] was widely employed where spatial statistics provide a good descriptor for overall scene interpretation. Spatial pyramid pooling network [12] further enhances the ability.

準確的感知場景依賴於事先理解的場景語境信息。我們發現,當前基於FCN的模型的主要問題是缺乏合適的策略來利用全局場景類別線索。對於典型的複雜場景理解,以前爲了獲得全局圖像級特徵,廣泛使用了空間金字塔池化技術[18],該技術中的空間統計數據爲整體場景解釋提供了良好的描述詞。後來,在空間金字塔池化網絡[12]中進一步增強了這種能力。

Different from these methods, to incorporate suitable global features, we propose pyramid scene parsing network (PSPNet). In addition to traditional dilated FCN [3, 40] for pixel prediction, we extend the pixel-level feature to the specially designed global pyramid pooling one. The local and global clues together make the final prediction more reliable. We also propose an optimization strategy with deeply supervised loss. We give all implementation details, which are key to our decent performance in this paper, and make the code and trained models publicly available.

不同於這些方法,爲了融合合適的全局特徵,我們提出了金字塔場景解析網絡(PSPNet)。除了使用傳統擴展後的FCN[3,40]進行像素級類別預測外,我們還將像素級特徵擴展到專門設計的全局金字塔池化模塊,局部和全局線索的結合可使最終的預測結果更加可靠。我們還提出了一種深度監督損失函數的優化策略。在本文中,我們給出了所有實現細節,這是我們良好性能的關鍵,並公開了代碼和經過培訓的模型。

Our approach achieves state-of-the-art performance on all available datasets. It is the champion of ImageNet scene parsing challenge 2016 [43], and arrived the 1st place on PASCAL VOC 2012 semantic segmentation benchmark [8], and the 1st place on urban scene Cityscapes data [6]. They manifest that PSPNet gives a promising direction for pixellevel prediction tasks, which may even benefit CNN-based stereo matching, optical flow, depth estimation, etc. in follow-up work. Our main contributions are threefold.

  • We propose a pyramid scene parsing network to embed difficult scenery context features in an FCN based pixel prediction framework.
  • We develop an effective optimization strategy for deep ResNet [13] based on deeply supervised loss.
  • We build a practical system for state-of-the-art scene parsing and semantic segmentation where all crucial implementation details are included.

我們的方法在所有可用數據集上展現了最高水平的性能。它是2016年ImageNet場景解析挑戰的冠軍[43],在Pascal VOC 2012語義分割數據庫[8]上排名第一,在城市場景Cityscapes數據庫上也排名第一[6]。結果表明,PSPNet爲像素級預測任務指示了方向,甚至能有利於基於CNN的立體匹配、光流、深度評估等後續工作。我們的主要貢獻有以下三部分:

  • 我們提出了一個金字塔場景分析網絡,在傳統基於FCN的像素級預測框架中嵌入複雜場景的語境特徵。
  • 對基於深度監測損失函數的ResNet(殘差網絡)提出了一種有效的優化策略[13]。
  • 我們構建了一個實用的場景分析和語義分割系統,其中包括所有關鍵的實現細節。
2. Related Work

In the following, we review recent advances in scene parsing and semantic segmentation tasks. Driven by powerful deep neural networks [17, 33, 34, 13], pixel-level prediction tasks like scene parsing and semantic segmentation achieve great progress inspired by replacing the fully-connected layer in classification with the convolution layer [26]. To enlarge the receptive field of neural networks, methods of [3, 40] used dilated convolution. Noh et al. [30] proposed a coarse-to-fine structure with deconvolution network to learn the segmentation mask. Our baseline network is FCN and dilated network [26, 3].

2. 相關工作

接下來,我們回顧了場景解析和語義分割任務的最新進展。在強大的深度神經網絡[17,33,34,13]的驅動下,像場景解析和語義分割這樣的像素級預測任務,在分類任務中將完全連接層替換爲卷積層後,取得了極大的進展[26]。爲了擴大神經網絡的感受野,方法[3,40]使用了空洞卷積。而後,Noh等人[30]提出了一種由粗到細的反捲積網絡結構來學習分割掩模。本文則是基於FCN及其擴張網絡[26,3]進行進一步的研究。

Other work mainly proceeds in two directions. One line [26, 3, 5, 39, 11] is with multi-scale feature ensembling. Since in deep networks, higher-layer feature contains more semantic meaning and less location information. Combining multi-scale features can improve the performance.

其他工作主要有兩個方向。方向一[26、3、5、39、11]是多尺度特徵的集成,因爲在深層網絡中,越深層次的特徵包含越多的語義含義和更少的位置信息,結合多尺度特徵可以提高網絡性能。

The other direction is based on structure prediction. The pioneer work [3] used conditional random field (CRF) as post processing to refine the segmentation result. Following methods [25, 41, 1] refined networks via end-to-end modeling. Both of the two directions ameliorate the localization ability of scene parsing where predicted semantic boundary fits objects. Yet there is still much room to exploit necessary information in complex scenes.

另一個方向是結構預測。文獻[3]的開拓性工作在於其採用條件隨機場(CRF)作爲後續步驟,對分割結果進行了改善。文獻[25,41,1]中的方法則通過端到端建模的方式優化網絡。這兩種方法都改善了場景解析的定位能力,即預測語義邊界來擬合對象位置的能力。然而,在複雜的場景中,利用一些必要的信息後,這種能力仍然有很大可改進的餘地。

To make good use of global image-level priors for diverse scene understanding, methods of [18, 27] extracted global context information with traditional features not from deep neural networks. Similar improvement was made under object detection frameworks [35]. Liu et al. [24] proved that global average pooling with FCN can improve semantic segmentation results. However, our experiments show that these global descriptors are not representative enough for the challenging ADE20K data. Therefore, different from global pooling in [24], we exploit the capability of global context information by different-region-based context aggregation via our pyramid scene parsing network.

爲了更好地利用圖像級全局信息來理解不同的場景,方法[18,27]提取了具有傳統特徵而非從深層神經網絡中獲得的全局語境信息,在目標檢測框架下也進行了類似的改進[35]。劉等人[24]證明了使用全局平均池化的FCN可以改善語義分割結果。然而,我們的實驗表明,上述的這些全局描述詞對於極具挑戰性的ADE20K數據庫來說並不具有足夠的代表性。因此,與[24]中的全局池化不同,我們利用金字塔場景解析網絡,對不同區域的語境進行聚合,使模型擁有了理解全局語境信息的能力。

3. Pyramid Scene Parsing Network

We start with our observation and analysis of representative failure cases when applying FCN methods to scene parsing. They motivate proposal of our pyramid pooling module as the effective global context prior. Our pyramid scene parsing network (PSPNet) illustrated in Fig. 3 is then described to improve performance for open-vocabulary object and stuff identification in complex scene parsing.

3. 金字塔場景解析網絡

我們首先觀察和分析了應用FCN進行場景解析的典型失敗案例,這些案例促使我們提出建議,即將金字塔池化模塊作爲有效提取全局語境信息的手段。圖3所示的金字塔場景分析網絡(PSPNet)可提高複雜場景分析中識別物體的性能。

3.1. Important Observations

The new ADE20K dataset [43] contains 150 stuff/object category labels (e.g., wall, sky, and tree) and 1,038 imagelevel scene descriptors (e.g., airport terminal, bedroom, and street). So a large amount of labels and vast distributions of scenes come into existence. Inspecting the prediction results of the FCN baseline provided in [43], we summarize several common issues for complex-scene parsing.

3. 重要發現

最新的ADE20K數據集[43]包含150個物體/對象類別標籤(例如,牆、天空和樹)和1038個圖像級場景描述詞(例如,機場航站樓、臥室和街道),因此,大量的標籤和場景出現了。通過審查文獻[43]中提供的基於FCN的場景分割結果,我們總結了複雜場景解析的幾個常見問題。

Mismatched Relationship Context relationship is universal and important especially for complex scene understanding. There exist co-occurrent visual patterns. For example, an airplane is likely to be in runway or fly in sky while not over a road. For the first-row example in Fig. 2, FCN predicts the boat in the yellow box as a “car” based on its appearance. But the common knowledge is that a car is seldom over a river. Lack of the ability to collect contextual information increases the chance of misclassification.

語境關係不匹配 語境關係是普遍存在的,尤其在對複雜場景的理解中極爲重要,有些物體常常是一起出現的,例如,飛機很可能在跑道上或在空中飛行,而不是在公路上。對於圖2中的第一行示例,FCN根據外觀將黃色框中的船預測爲“汽車”,但衆所周知,汽車很少在河上行駛。所以,缺乏收集語境信息的能力會增大錯誤分類的概率。

Confusion Categories There are many class label pairs in the ADE20K dataset [43] that are confusing in classification. Examples are field and earth; mountain and hill; wall, house, building and skyscraper. They are with similar appearance. The expert annotator who labeled the entire dataset, still makes 17.60% pixel error as described in [43]. In the second row of Fig. 2, FCN predicts the object in the box as part of skyscraper and part of building. These results should be excluded so that the whole object is either skyscraper or building, but not both. This problem can be remedied by utilizing the relationship between categories.

類別混淆 ADE20K數據集[43]中有許多類別標籤在分類時容易出現混淆。例如:田野和土地;山脈和丘陵;牆、房子、建築物和摩天大樓,它們的外觀十分相似。如文獻[43]所述,專家註釋員標記了整個數據集,仍然產生17.60%的像素誤差。在圖2的第二行中,對於框中的物體,FCN預測其部分是摩天大樓,部分是建築物。這些結果是不正確的,框中的整個物體只能要麼是摩天大樓,要麼是建築物,但不能兩者兼有,而利用類別之間的關係即可解決上述問題。

Inconspicuous Classes Scene contains objects/stuff of arbitrary size. Several small-size things, like streetlight and signboard, are hard to find while they may be of great importance. Contrarily, big objects or stuff may exceed the receptive field of FCN and thus cause discontinuous prediction. As shown in the third row of Fig. 2, the pillow has similar appearance with the sheet. Overlooking the global scene category may fail to parse the pillow. To improve performance for remarkably small or large objects, one should pay much attention to different sub-regions that contain inconspicuous-category stuff.

不明顯的類別 通常來說,場景中包含着任意大小的物體。一些小的東西,比如路燈和標誌牌,儘管它們可能很重要,但很難被找到。相反,大的物體或東西可能會超過fcn的感受野,從而導致的預測不連續性。如圖2第三行所示,枕頭與牀單外觀相似,忽略全局場景類別,枕頭可能無法被解析分割出來。要提高對非常小或非常大的對象的識別能力,應該特別注意包含不明顯類別物體的不同子區域。

To summarize these observations, many errors are partially or completely related to contextual relationship and global information for different receptive fields. Thus a deep network with a suitable global-scene-level prior can much improve the performance of scene parsing.

總結這些觀察結果,我們可以發現,許多錯誤都與不同感受野獲取的全局信息和語境關係有着部分甚至是完全的關聯。因此,一個擁有適當場景級全局信息的深度網絡可以大大提高場景解析的能力。

3.2. Pyramid Pooling Module

With above analysis, in what follows, we introduce the pyramid pooling module, which empirically proves to be an effective global contextual prior.

3.2. 金字塔池化模塊

通過上述分析,在接下來的部分中,我們引入了金字塔池化模塊,研究表明,它能有效獲得全局語境信息。

In a deep neural network, the size of receptive field can roughly indicates how much we use context information. Although theoretically the receptive field of ResNet [13] is already larger than the input image, it is shown by Zhou et al. [42] that the empirical receptive field of CNN is much smaller than the theoretical one especially on high-level layers. This makes many networks not sufficiently incorporate the momentous global scenery prior. We address this issue by proposing an effective global prior representation.

在深度神經網絡中,感受野的大小可以大致代表我們使用語境信息的程度。儘管理論上ResNet[13]的感受野已經大於輸入圖像,但zhou等人[42]指出,從經驗中可以發現,CNN感受野遠小於理論大小,尤其是深層網絡中,這使得許多網絡沒有充分融入重要的全局場景信息。針對於此,我們提出了有效的全局信息特徵嘗試解決這一問題。

Global average pooling is a good baseline model as the global contextual prior, which is commonly used in image classification tasks [34, 13]. In [24], it was successfully applied to semantic segmentation. But regarding the complex scene images in ADE20K [43], this strategy is not enough to cover necessary information. Pixels in these scene images are annotated regarding many stuff and objects. Directly fusing them to form a single vector may lose the spatial relation and cause ambiguity. Global context information along with sub-region context is helpful in this regard to distinguish among various categories. A more powerful representation could be fused information from different sub-regions with these receptive fields. Similar conclusion was drawn in classical work [18, 12] of scene/image classification.

全局平均池化通常用於圖像分類任務[34,13],將它用於提取全局語境信息是一種比較好的方式。在文獻[24]中,它成功地被應用於語義分割。但對於ADE20K[43]中的複雜場景,該策略無法涵蓋必要的信息。這些場景圖像中的像素都被標註爲各種物體,直接將它們融合成一個向量,可能會失去它們間的空間關係,造成模糊。全局語境信息結合局部語境信息有助於區分出各種類別,越好的特徵越能融合來自感受野大小不同的子區域的信息,在場景/圖像分類的經典著作[18,12]中也得出了類似的結論。

In [12], feature maps in different levels generated by pyramid pooling were finally flattened and concatenated to be fed into a fully connected layer for classification. This global prior is designed to remove the fixed-size constraint of CNN for image classification. To further reduce context information loss between different sub-regions, we propose a hierarchical global prior, containing information with different scales and varying among different sub-regions. We call it pyramid pooling module for global scene prior construction upon the final-layer-feature-map of the deep neural network, as illustrated in part (c) of Fig. 3.

在文獻[12]中,金字塔池化生成的不同級別的特徵圖最終被展平並拼接起來,然後輸入到全連接層中進行分類。該全局先驗模塊是爲消除CNN進行圖像分類時需輸入固定尺寸圖像的這一約束而設計的。爲了進一步避免丟失表徵不同子區域之間關係的語境信息,我們提出了一個包含不同尺度、不同子區域間關係的分層全局信息。如圖3中(c)部分所示,將該金字塔池化模塊的輸出作爲深度神經網絡最終的特徵圖,並稱其爲全局場景先驗信息。

The pyramid pooling module fuses features under four different pyramid scales. The coarsest level highlighted in red is global pooling to generate a single bin output. The following pyramid level separates the feature map into different sub-regions and forms pooled representation for different locations. The output of different levels in the pyramid pooling module contains the feature map with varied sizes. To maintain the weight of global feature, we use 1x1 convolution layer after each pyramid level to reduce the dimension of context representation to 1/N of the original one if the level size of pyramid is N. Then we directly upsample the low-dimension feature maps to get the same size feature as the original feature map via bilinear interpolation. Finally, different levels of features are concatenated as the final pyramid pooling global feature.

金字塔池化模塊融合了四種不同尺度下的特徵。圖中用紅色突出顯示的爲最粗略的層級,是使用全局池化生成的單個bin輸出。剩下的三個層級將輸入特徵圖劃分成若干個不同的子區域,並對每個子區域進行池化,最後將包含位置信息的池化後的單個bin組合起來。金字塔池化模塊中不同層級輸出不同尺度的特徵圖,爲了保持全局特徵的權重,我們在每個金字塔層級後使用1x1的卷積核,當某個層級維數爲n時,即可將語境特徵的維數降到原始特徵的1/n。然後,通過雙線性插值直接對低維特徵圖進行上採樣,使其與原始特徵圖尺度相同。最後,將不同層級的特徵圖拼接爲最終的金字塔池化全局特徵。

Noted that the number of pyramid levels and size of each level can be modified. They are related to the size of feature map that is fed into the pyramid pooling layer. The structure abstracts different sub-regions by adopting varying-size pooling kernels in a few strides. Thus the multi-stage kernels should maintain a reasonable gap in representation. Our pyramid pooling module is a four-level one with bin sizes of 1x1, 2x2, 3x3 and 6x6 respectively. For the type of pooling operation between max and average, we perform extensive experiments to show the difference in Section 5.2.

注意到金字塔層數和每個層級尺度的大小都可以修改,它們與輸入金字塔池化層的特徵圖尺度有關。該結構通過採用不同大小的池化核,通過簡單幾步即可提取出不同的子區域的特徵,因此,多層級的池化核大小應保持合理的間隔。我們的金字塔池化模塊是一個四層級的模塊,分別有1x1、2x2、3x3和6x6的bin大小。針對於每個層級的池化操作是選擇max pooling還是average pooling,我們進行了大量的實驗,在第5.2節中將會展示兩者之間的差異。

3.3. Network Architecture

With the pyramid pooling module, we propose our pyramid scene parsing network (PSPNet) as illustrated in Fig. 3. Given an input image in Fig. 3(a), we use a pretrained ResNet [13] model with the dilated network strategy [3, 40] to extract the feature map. The final feature map size is 1/8 of the input image, as shown in Fig. 3(b). On top of the map, we use the pyramid pooling module shown in (c) to gather context information. Using our 4-level pyramid, the pooling kernels cover the whole, half of, and small portions of the image. They are fused as the global prior. Then we concatenate the prior with the original feature map in the final part of (c). It is followed by a convolution layer to generate the final prediction map in (d).

3.3 網絡結構

基於金字塔池化模塊,我們提出了金字塔場景分析網絡(PSPNet),如圖3所示。對於圖3(a)中的輸入圖像,我們使用一個帶有擴展網絡策略[3,40]且預訓練過的ResNet[13]模型來提取特徵圖,最終特徵圖尺寸爲輸入圖像的1/8,如圖3(b)所示。我們對上述特徵圖使用(c)中所示的金字塔池化模塊來獲取語境信息,其中,金字塔池化模塊分4個層級,其池化核大小分別爲圖像的全部、一半和小部分,最終它們可融合爲全局特徵。然後,在(c)模塊的最後部分,我們將融合得到的全局特徵與原始特徵圖連接起來。最後,在(d)中通過一層卷積層生成最終的預測圖。

To explain our structure, PSPNet provides an effective global contextual prior for pixel-level scene parsing. The pyramid pooling module can collect levels of information, more representative than global pooling [24]. In terms of computational cost, our PSPNet does not much increase it compared to the original dilated FCN network. In end-to-end learning, the global pyramid pooling module and the local FCN feature can be optimized simultaneously.

爲了解釋我們的網絡結構,PSPNet爲像素級場景解析提供了一個有效的全局語境信息,其中,金字塔池化模塊能收集不同尺度的語境信息並融合,會比全局池化所得的全局信息更具代表性[24]。在計算成本方面,我們的PSPNet並沒有比原來的擴展FCN網絡增加多少。另外,在端到端的學習中,全局金字塔池化模塊和局部FCN特徵還可被同時優化。

4. Deep Supervision for ResNet-Based FCN

Deep pretrained networks lead to good performance [17, 33, 13]. However, increasing depth of the network may introduce additional optimization difficulty as shown in [32, 19] for image classification. ResNet solves this problem with skip connection in each block. Latter layers of deep ResNet mainly learn residues based on previous ones.

  1. 殘差全卷積網絡的損失函數
    神經網絡經過深度預訓練能帶來良好的性能[17,33,13],然而,對於圖像分類任務來說,如[32,19]中所示,網絡深度的增加可能會帶來額外的優化困難。ResNet通過在每個塊中使用跳遠連接來解決這個問題,在深度殘差網絡中,後一層主要學習前一層拋出的殘差。

We contrarily propose generating initial results by supervision with an additional loss, and learning the residue afterwards with the final loss. Thus, optimization of the deep network is decomposed into two, each is simpler to solve.

我們提出通過另外的損失函數來產生初始結果,然後通過最終的損失函數來學習殘差。因此,將深層網絡的優化問題可分解爲兩個,單個問題的求解就會比較簡單。

An example of our deeply supervised ResNet101 [13] model is illustrated in Fig. 4. Apart from the main branch using softmax loss to train the final classifier, another classifier is applied after the fourth stage, i.e., the res4b22 residue block. Different from relay backpropagation [32] that blocks the backward auxiliary loss to several shallow layers, we let the two loss functions pass through all previous layers. The auxiliary loss helps optimize the learning process, while the master branch loss takes the most responsibility. We add weight to balance the auxiliary loss.

圖4給出了我們增加輔助損失函數後的ResNet101[13]模型的一個示例。除了使用softmax loss訓練最終分類器的主要分支外,在第四階段(即res4b22殘差塊)後應用另一個分類器。分程反向傳播[32]會阻塞輔助損失函數傳遞到較淺的網絡層,區別與此,我們讓這兩個損失函數通過在其之前的所有網絡層。輔助損失函數有助於優化學習過程,而主分支損失函數承擔起了最大的責任。最後,我們還增加權重以平衡輔助損失函數。

In the testing phase, we abandon this auxiliary branch and only use the well optimized master branch for final prediction. This kind of deeply supervised training strategy for ResNet-based FCN is broadly useful under different experimental settings and works with the pre-trained ResNet model. This manifests the generality of such a learning strategy. More details are provided in Section 5.2.

在測試階段,我們放棄了這個輔助分支,只使用經過良好優化的主分支進行最終預測。但是,這種殘差全卷積網絡的損失函數訓練策略與預先訓練的ResNet模型相結合後,在不同的實驗參數設置下都着實有效,這表明了該學習策略的普適性,更多細節參見5.2節。

後續便是實驗部分和總結了,可參見原論文,此處不在詳細翻譯。

以上。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章