輕量化backbone

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications  (2017.04)

1、使用 depthwise conv + pointwise conv代替 standard conv,需要有DW優化支持的平臺上才能顯出speed。
計算量減少比例:

計算量和參數量主要集中在 pointwise conv上,pointwise conv不需要im2col操作

2、兩個超參調解channel和resolution:width multiplier α and resolution multiplier ρ, 參數減少程度爲α^2、ρ^2

3、由於模型小,訓練時很多tricks沒有用,見論文3.2
4、使用relu6,在float16/int8的嵌入式設備中效果較好

ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices (2017.07)

motivation
For each residual unit in ResNeXt the pointwise convolutions occupy 93.4% multiplication-adds.
A straightforward solution is to apply channel sparse connections.
There is one side effect: outputs from a certain channel are only derived from a small fraction of input channels
提出 channel shuffle。改進resnet
difficult to efficiently implement depthwise convolution on lowpower mobile devices, which may result from a worse computation/memory access ratio compared with other dense operations. 所以只在bottleneck中使用。

def channel_shuffle(x, groups):
    """
    Parameters
        x: Input tensor of with `channels_last` data format
        groups: int number of groups per channel
    Returns
        channel shuffled output tensor
    Examples
        Example for a 1D Array with 3 groups
        >>> d = np.array([0,1,2,3,4,5,6,7,8])
        >>> x = np.reshape(d, (3,3))
        >>> x = np.transpose(x, [1,0])
        >>> x = np.reshape(x, (9,))
        '[0 1 2 3 4 5 6 7 8] --> [0 3 6 1 4 7 2 5 8]'
    """
    height, width, in_channels = x.shape.as_list()[1:]
    channels_per_group = in_channels // groups
    x = K.reshape(x, [-1, height, width, groups, channels_per_group])
    x = K.permute_dimensions(x, (0, 1, 2, 4, 3))  # transpose
    x = K.reshape(x, [-1, height, width, in_channels])
    return x

MobileNetV2: Inverted Residuals and Linear Bottlenecks (2019.03)

深度卷積部分的卷積核比較容易訓廢掉:訓完之後發現深度卷積訓出來的卷積核有不少是空的,作者認爲這是ReLU導致。
Relu
會對低維embed造成信息丟失。

https://www.zhihu.com/question/265709710/answer/298245276
Depthwise Conv確實是大大降低了計算量, 而且NxN Depthwise + 1X1 PointWise的結構在性能上也能接近NxN Conv。 在實際使用的時候, 我們發現Depthwise 部分的kernel比較容易訓廢掉: 訓完之後發現depthwise訓出來的kernel有不少是空的... 當時我們認爲是因爲depthwise每個kernel dim 相對於vanilla conv要小得多, 過小的kernel_dim, 加上ReLU的激活影響下, 使得神經元輸出很容易變爲0, 所以就學廢了: ReLU對於0的輸出的梯度爲0, 所以一旦陷入了0輸出, 就沒法恢復了。 我們還發現,這個問題在定點化低精度訓練的時候會進一步放大。
 

方案:inverted residuals 升維(channel) + Linear Bottlenecks 丟棄最後Relu6。
channel數減少可以減少skip branch的帶寬。(帶寬是指從cache寫入DDR的帶寬。 假設我們把旁路的多個卷積on the fly去做, 那麼讀寫的帶寬需求是: input 讀, eltwise 加的時候主路feature讀, 以及eltwise 輸出的寫。 v2的特典是eltwise部分channel少, 所以省帶寬。)
另外該backbone也可用於det和seg task

ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design (2018.06)

motivation:
1、FLOPs is an indirect metric.
First, several important factors that have considerable affection on speed are not taken into account by FLOPs.
Second, operations with the same FLOPs could have different running time, depending on the platform.
2、two principles should be considered
First, the direct metric.
Second, such metric should be evaluated on the target platform.
論文將memory access cost (MAC)納入考慮因素,並分別在GPU和ARM上實驗,提出4條Guidelines

G1) Equal channel width minimizes memory access cost (MAC)
G2) Excessive group convolution increases MAC
G3) Network fragmentation reduces degree of parallelism.
G4) Element-wise operations are non-negligible.
由此設計結構:

 

好處:
main branch 的3個conv channel相同(G1)
two 1 × 1 convolutions are no longer group-wise, (G2)
跳連 identity(G3,G4)
Concat(G4)

Why accurate?
1、more feature channels 
2、feature reuse. 只有一半的channel跳連,符合connections between the adjacent layers are stronger than the others

Searching for MobilenetV3 (2019.05)

結構:

Redesigning Expensive Layers
頭部:channel 32 -> 16,relu -> h-swis
尾部:pooling提前,升1280 channel 時由7x7 -> 1x1,在不會造成精度損失的同時,減少10ms耗時,提速15%,減小了30m的MAdd操作。

激活函數 h-swish:
most of the benefits swish are realized by using them only in the deeper layers.
we only use h-swish at the second half of the model.

Large squeeze-and-excite:
1/4 of the number of channels in expansion layer

Segmentation
1、trained from scratch without pretraining
2、apply atrous convolution to the last block of MobileNetV3
3、last block channel x0.5 沒有明顯影響,論文認爲original designed for 1000 classes,segmentation only 19 classes.

參考:
輕量級神經網絡“巡禮”(一)—— ShuffleNetV2
輕量級神經網絡“巡禮”(二)—— MobileNet,從V1到V3
輕量化網絡ShuffleNet MobileNet v1/v2 解析
ShuffNet v1 和 ShuffleNet v2
爲什麼depthwise convolution 比 convolution更加耗時?
如何評價mobilenet v2 ?
Why MobileNet and Its Variants (e.g. ShuffleNet) Are Fast

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章