STSN：Object Detection in Video with Spatiotemporal Sampling Networks

STSN使用可變形卷積對視頻進行時空採樣，把這些信息用於目標檢測，用以提高檢測精度。和FGFA相比，STSN沒有用到光流數據訓練預測運動的FlowNet。

主要貢獻/解決的主要問題：針對視頻中有些視頻幀由於抖動、鏡頭不聚焦，部分遮擋、姿態變化過大等問題，此時對這些難以檢測的圖像進行深度特徵提取會出現檢測失敗。
原因是在這些幀中，深度特徵由於這些挑戰導致特徵不明顯，從而導致預測結果置信度很低。

達到更高的精度需要滿足兩個條件：
1）強有力的特徵，網絡結構使用最好的backbone，文中用的Res101
2）從支持的幀中提取有用目標級別特徵的能力

先介紹一下可變形卷積：

首先是一般的2D卷積公式，P0是卷積結果的座標，Pn是根據卷積核大小確定的（3*3的卷積，n就有9個，從1到9）

接下來是可變形卷積公式：
和一般的2D卷積不同的是在輸入x的地方增加了一個偏移量△Pn，偏移量也是訓練得到的，大小和特徵圖的大小相同

一張圖瞭解一下可變形卷積：輸入特徵圖中綠色部分爲一般2D卷積的區域，而可變形卷積則會在輸入的位置增加一個偏移量△Pn，結果就從綠色的部分變成藍色的方框了（本人覺得也可以理解爲卷積核的相對變形）

算法大致流程

首先對視頻中的所有幀提取特徵，接着，時空採樣模塊從得到的所有特徵圖中採樣，然後根據權重加到一起，作爲當前幀的深度特徵，然後再做檢測。

Fig. 2: Our spatiotemporal sampling mechanism, which we use for video object
detection. Given the task of detecting objects in a particular video frame (i.e.,
a reference frame), our goal is to incorporate information from a nearby frame
of the same video (i.e., a supporting frame). First, we extract features from
both frames via a backbone convolutional network (CNN). Next, we concatenate
the features from the reference and supporting frames, and feed them through
multiple deformable convolutional layers. The last of such layers produces offsets
that are used to sample informative features from the supporting frame. Our
spatiotemporal sampling scheme allows us to produce accurate detections even
if objects in the reference frame appear blurry or occluded

算法流程：

首先，當前幀t和支持幀t+k經過backbone，得到卷積特徵cwh，
然後把兩個tensor按照通道的維度拼接起來2cw*h，接着做可變形卷積，在此過程中得到offset（第一個），
卷積的結果作爲第二個offset（大小是w*h），用於和支持幀特徵做可變形卷積，
卷積結果作爲當前幀（比較模糊或者抖動導致圖像質量不好）用於目標檢測的特徵。

檢測結果

整體來說算法的思想比較簡單，和FGFA（Flow-Guided Feature Aggregation for Video Object Detection）相比，一是沒有用到光流法預測流場和光流網絡（FlowNet），感覺速度上會比較快，二是訓練時使用的相關幀也比FGFA少（12VS51），最後的結果也比FGFA好。
不過還沒有代碼放出來，希望儘快放出代碼~

STSN：Object Detection in Video with Spatiotemporal Sampling Networks