[ 論文速度 ] ECCV 2020 : Cross-Modal Weighting Network

Cross-Modal Weighting Network for RGB-D Salient Object Detection

[PAPER]

本文針對 SOD 這個問題,提出了新的網絡結構,設計的新的網絡結構,有效地融合了RGB通道和深度通道的信息,同時挖掘了跨尺度的目標定位和細節。

Abstract

Depth maps contain geometric clues for assisting Salient Object Detection (SOD). In this paper, we propose a novel Cross-Modal Weighting (CMW) strategy to encourage comprehensive interactions between RGB and depth channels for RGB-D SOD.

Specifically, three RGB-depth interaction modules, named CMW-L, CMW-M and CMWH, are developed to deal with respectively low-, middle- and high-level cross-modal information fusion. These modules use Depth-to-RGB Weighing (DW) and RGB-to-RGB Weighting (RW) to allow rich cross-modal and cross-scale interactions among feature layers generated by different network blocks. To effectively train the proposed Cross-Modal Weighting Network (CMWNet), we design a composite loss function that summarizes the errors between intermediate predictions and ground truth over different scales. With all these novel components working together, CMWNet effectively fuses information from RGB and depth channels, and meanwhile explores object localization and details across scales. Thorough evaluations demonstrate CMWNet consistently outperforms 15 state-of-the-art RGB-D SOD methods on seven popular benchmarks.

摘要一開始,給出了本文的研究領域 SOD,和本文核心賣點:Cross-Modal Weighting (CMW) strategy

接着,具體介紹這個新的的 CMW 網絡:

1. 組成:CMW-L, CMW-M and CMWH

2. 作用:low-, middle- and high-level cross-modal information fusion

3. 方法:use Depth-to-RGB Weighing (DW) and RGB-to-RGB Weighting (RW),允許不同網絡塊生成的特徵層之間的豐富的跨模態和跨尺度交互。

4. 訓練:損失函數,設計一個綜合損失函數,總結中間預測和地面真相在不同尺度上的誤差。

5. 解釋:通過所有這些新穎的組件一起工作,CMWNet有效地融合了來自RGB和深度通道的信息,同時探索了跨尺度的目標定位和細節。

最後,實驗結論。

 

Related Work

CNN-based RGB-D SOD

In recent years, numerous CNN-based RGB-D SOD methods [4–6, 10, 13, 20, 26, 31, 33, 38, 42, 44] have been proposed.

4. Chen, H., Li, Y.: Progressively complementarity-aware fusion network for RGB-D salient object detection. In: IEEE CVPR (2018)

5. Chen, H., Li, Y.: Three-stream attention-aware network for RGB-D salient object detection. IEEE TIP 28(6), 2825–2835 (2019)

6. Chen, H., Li, Y., Su, D.: Multi-modal fusion network with multi-scale multi-path and cross-modal interactions for RGB-D salient object detection. Pattern Recognition 86, 376–385 (2019)

10. Ding, Y., Liu, Z., Huang, M., Shi, R., Wang, X.: Depth-aware saliency detection using convolutional neural networks. Journal of Visual Communication and Image Representatio 61, 1–9 (2019)

13. Fan, D.P., Lin, Z., Zhao, J.X., Liu, Y., Zhang, Z., Hou, Q., Zhu, M., Cheng, M.M.: Rethinking RGB-D salient object detection: Models, datasets, and largescale benchmarks. arXiv preprint arXiv:1907.06781 (2019)

20. Han, J., Chen, H., Liu, N., Yan, C., Li, X.: CNNs-based RGB-D saliency detection via cross-view transfer and multiview fusion. IEEE TCYB 48(11), 3171–3183 (2018)

26. Liu, Z., Shi, S., Duan, Q., Zhang, W., Zhao, P.: Salient object detection for RGBD image by single stream recurrent convolution neural network. Neurocomputing 363, 46–57 (2019)

31. Qu, L., He, S., Zhang, J., Tian, J., Tang, Y., Yang, Q.: RGBD salient object detection via deep fusion. IEEE TIP 26(5), 2274–2285 (2017)

33. Shigematsu, R., Feng, D., You, S., Barnes, N.: Learning RGB-D salient object detection using background enclosure, depth contrast, and top-down features. In: IEEE ICCVW (2017)

38. Wang, N., Gong, X.: Adaptive fusion for RGB-D salient object detection. IEEE Access 7, 55277–55284 (2019)

42. Zhao, J.X., Cao, Y., Fan, D.P., Cheng, M.M., Li, X.Y., Zhang, L.: Contrast prior and fluid pyramid integration for RGBD salient object detection. In: IEEE CVPR (2019)

44. Zhu, C., Cai, X., Huang, K., Li, T.H., Li, G.: PDNet: Prior-model guided depthenhanced network for salient object detection. In: IEEE ICME (2019)

 

Proposed Method

Network Overview and Motivation

整體上看:

1. U-Net 結構;

2. Encoder 分爲兩路,depth 和 RGB;

3. 從 Encoder 到 Decoder 的跳接前,先做了 depth 和 RGB 兩路之間的融合;

4. 這種聯繫可以看做是 一方的較某一層另一方的前一層 特徵的融合過程(作者又進一步命名爲三個模塊,這樣在寫作時,容易描述);

5. 損失函數,在解碼器每一層都做輸出預測,然後和 GT 進行計算損失。

Fig. 1. Illustration of the proposed CMWNet. For both RGB and depth channel, the Siamese encoder network is employed to extract feature blocks organized in three levels. Three Cross-Modal Weighting (CMW) modules, CMW-L, CMW-M and CMWH, are proposed to capture the interactions at corresponding level, and provide inputs for the decoder. The decoder progressively aggregates all the cross-modal cross-scale information for the final prediction. For training, multi-scale pixel-level supervision for intermediate predictions are utilized.

 

CMW-L, CMW-M and CMW-H

Fig. 2. Details of the three proposed RGB-depth interaction modules: CMW-L, CMW-M and CMW-H. All modules consist of Depth-to-RGB Weighting (DW) and RGB-to-RGB Weighting (RW) as key operations. Notably, the DW in CMW-L and CMW-M is performed in the cross-scale manner between two adjacent blocks, which effectively captures the feature continuity and activates cross-modal cross-scale interactions.

三個模塊結構完全一致,只是在不同的 encoder 層中,且前兩者是層級間、領域間的交叉融合(低階和高階、depth 和 RGB 之間的交叉),而最後者只是領域間的融合。

1. depth 要經過不同窗口尺度的卷積,然後通過 2x2 卷積聚合;

2. RGB 特徵直接作爲自己和 depth 輸出的權重,類似於 mix attention;

3. 最後,RGB 通過殘差連接直接送入 decoder ,這個就是傳統的 U-Net 了。

 

小結:

新穎點:

1. Encoder 分爲 depth 和 RGB 兩個分支;

2. 層級交叉融合確實有些微妙;

3. 直接將 RGB 作爲 mix attention 的注意力圖。

 

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章