經典網絡結構(六):DenseNet (Densely Connected Networks 稠密連接網絡)

本節參考:《深度學習之PyTorch物體檢測實戰》
《DIVE INTO DEEP LEARNING》

Function Decomposition

ResNet significantly changed the view of how to parametrize the functions in deep networks. DenseNet is to some extent the logical extension of this. To understand how to arrive at it, let us take a small detour to theory. Recall the Taylor expansion for functions. For scalars it can be written as
f(x)=f(0)+f(x)x+12f(x)x2+16f(x)x3+o(x3)f(x)=f(0)+f′(x)x+12f′′(x)x2+16f′′′(x)x3+o(x3)

The key point is that it decomposes the function into increasingly higher order terms. In a similar vein, ResNet decomposes functions into
f(x)=x+g(x)f(x)=x+g(x)

That is, ResNet decomposes ff into a simple linear term and a more complex nonlinear one. What if we want to go beyond two terms? A solution was proposed by [Huang et al., 2017] in the form of DenseNet, an architecture that reported record performance on the ImageNet dataset.
在這裏插入圖片描述
As shown in Fig. 5.10, the key difference between ResNet and DenseNet is that in the latter case outputs are concatenated rather than added. As a result we perform a mapping from xx to its values after applying an increasingly complex sequence of functions.
x[x,f1(x),f2(x,f1(x)),f3(x,f1(x),f2(x,f1(x)),]x→[x,f1(x),f2(x,f1(x)),f3(x,f1(x),f2(x,f1(x)),…]在這裏插入圖片描述
The name DenseNet arises from the fact that the dependency graph between variables becomes quite dense. The last layer of such a chain is densely connected to all previous layers.

DenseNet在ResNet的基礎上,最大化了前後層的信息交流,通過建立前面所有層與後面層的密集連接,實現了特徵在通道維度上的複用,使其可以在參數與計算量更少的情況下實現比ResNet更優的性能

DenseNet的主要構建模塊是稠密塊(dense block)和過渡層(transition layer)。前者定義了輸入和輸出是如何連結的,後者則用來控制通道數,使之不過大。

稠密塊(dense block)

DenseNet使用了ResNet改良版的“BN-ReLU-卷積”結構,這裏的 1×11\times 1卷積還是用來降維減少計算量的,也就是所謂的 bottleneck layer。以32個conv_blockDenseBlock爲例,那麼第32個conv_block的輸入是前面31層的輸出結果,每層輸出的channel爲32(growth rate),如果不加 bottleneck layer 的話,第32層的 3×33\times 3卷積的輸入通道數就是31*32 + (上一個DenseBlock的輸出channel數) 。而如果加上 bottleneck layer 的話,一般 1×11\times 1卷積的通道數爲 growthRate 的4倍,也就是128, 相比之前的接近1000的通道數,極大地降低了計算量

def conv_block(input_channels, output_channels):
    # 一般 1 x 1 卷積的通道數爲 growthRate 的4倍
    inter_channels = 4 * output_channels
        
    return nn.Sequential(
        nn.BatchNorm2d(input_channels), 
        nn.ReLU(inplace=True),
        nn.Conv2d(input_channels, inter_channels, kernel_size=1, bias=False),
        
        nn.BatchNorm2d(inter_channels), 
        nn.ReLU(inplace=True),
        nn.Conv2d(inter_channels, output_channels, kernel_size=3, padding=1, bias=False),
        )

稠密塊由多個conv_block組成,每塊使用相同的輸出通道數。但在前向計算時,我們將每塊的輸入和輸出在通道維上連結。

# output_channels 也被稱爲 growthRate
class DenseBlock(nn.Module):
    def __init__(self, num_convs, input_channels, output_channels):
        super(DenseBlock, self).__init__()
        layer = []
        for i in range(num_convs):
            layer.append(conv_block(output_channels*i+input_channels, output_channels))
        self.net = nn.Sequential(*layer)
        
    def forward(self, X):
        for blk in self.net:
            Y = blk(X)
            # Concatenate the input and output of each block on the channel
            # dimension
            X = torch.cat((X, Y), dim=1)
        return X

在下面的例子中,我們定義一個有2個輸出通道數爲10的卷積塊。使用通道數爲3的輸入時,我們會得到通道數爲 3+2×10=23 的輸出。卷積塊的通道數控制了輸出通道數相對於輸入通道數的增長,因此也被稱爲增長率(growth rate)。

blk = DenseBlock(2, 3, 10)
X = torch.randn(4, 3, 8, 8)
Y = blk(X)
Y.shape

過渡層(transition layer)

由於每個稠密塊都會帶來通道數的增加,使用過多則會帶來過於複雜的模型。過渡層用來控制模型複雜度。它通過 1×11×1 卷積層來減小通道數,並使用步幅爲2的平均池化層減半高和寬,從而進一步降低模型複雜度。

def transition_block(input_channels, output_channels):
    return nn.Sequential(
        nn.BatchNorm2d(input_channels), 
        nn.ReLU(inplace=True),
        nn.Conv2d(input_channels, output_channels, kernel_size=1),
        nn.AvgPool2d(kernel_size=2, stride=2))

DenseNet模型

在這裏插入圖片描述

  • DenseNet首先使用同ResNet一樣的單卷積層和最大池化層
  • 類似於ResNet接下來使用的4個殘差塊,DenseNet使用的是4個稠密塊。同ResNet一樣,我們可以設置每個稠密塊使用多少個卷積層。這裏我們設成4,與ResNet-18保持一致。稠密塊裏的卷積層通道數(即增長率)設爲32,所以每個稠密塊將增加128個通道。
    ResNet裏通過步幅爲2的殘差塊在每個模塊之間減小高和寬。這裏則使用過渡層來減半高和寬,並減半通道數。
  • 同ResNet一樣,最後接上全局池化層和全連接層來輸出
class DenseNet(nn.Module):
    def __init__(self, input_channels, class_num):
        super(DenseNet, self).__init__()
        self.stem = nn.Sequential(
            nn.Conv2d(input_channels, 64, kernel_size=7, stride=2, padding=3, bias=False),
            nn.BatchNorm2d(64), 
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1))
        
        num_channels, growth_rate = 64, 32
        num_convs_in_dense_blocks = [4, 4, 4, 4]
        blks = []
        for i, num_convs in enumerate(num_convs_in_dense_blocks):
            blks.append(DenseBlock(num_convs, num_channels, growth_rate))
            # This is the number of output channels in the previous dense block
            num_channels += num_convs * growth_rate
            # A transition layer that haves the number of channels is added between
            # the dense blocks
            if i != len(num_convs_in_dense_blocks) - 1:
                blks.append(transition_block(num_channels, num_channels // 2))
                num_channels = num_channels // 2
        self.blks = nn.Sequential(*blks)
        
        self.classifier = nn.Sequential(
            nn.BatchNorm2d(num_channels), 
            nn.ReLU(inplace=True),
            nn.AdaptiveMaxPool2d((1,1)),
            nn.Flatten(),
            nn.Linear(num_channels, 10))
                
    def forward(self, x):
        y = self.stem(x)
        y = self.blks(y)
        y = self.classifier(y)
        
        return y

Why do we use average pooling rather than max-pooling in the transition layer?

參考:https://stats.stackexchange.com/questions/413275/why-do-we-use-the-average-pooling-layers-instead-of-max-pooling-layers-in-the-de

Average pooling can better represent the overall strength of a feature by passing gradients through all indices(while gradient flows through only the max index in max pooling), which is very like the DenseNet itself that connections are built between any two layers.

high memory consumption

參考:https://github.com/tensorflow/tensorflow/issues/12948

One problem for which DenseNet has been criticized is its high memory consumption.

  • Is this really the case? Try to change the input shape to 224×224 to see the actual (GPU) memory consumption.

While DenseNets are fairly easy to implement in deep learning frameworks, most implmementations (such as the original) tend to be memory-hungry. In particular, the number of intermediate feature maps generated by batch normalization and concatenation operations grows quadratically with network depth. It is worth emphasizing that this is not a property inherent to DenseNets, but rather to the implementation.

DenseNet is an effective network design that relies on applying nn layers on recursive concatenations of data along the channel axis. Unfortunately, this has the side effect of quadratic memory growth in TensorFlow as completely new blocks of memory are allocated after each concat operation, resulting in poor performance during all phases of execution.
This is a feature request for a new allocation='shared' option for operations such as tf.concat(allocation='shared') . This would make it is possible to utilize the Memory-Efficient Implementation of DenseNets, a paper which demonstrates that this memory utilization can be dramatically reduced through sharing of allocations.

This implementation uses a new strategy to reduce the memory consumption of DenseNets. We use checkpointing to compute the Batch Norm and concatenation feature maps. These intermediate feature maps are discarded during the forward pass and recomputed for the backward pass. This adds 15-20% of time overhead for training, but reduces feature map consumption from quadratic to linear.

This functionality would also be useful for any other application or future network design that employs recursive concatenations.

Why do we not need to concatenate terms if we are just interested in x and f(x) for ResNet? Why do we need this for more than two layers in DenseNet?

參考:https://blog.csdn.net/u014380165/article/details/75142664

Each layer has direct access to the gradients from the loss function and the original input signal, leading to an implicit deep supervision.
這種dense connection相當於每一層都直接連接input和loss,因此就可以減輕梯度消失現象

Design a DenseNet for fully connected networks and apply it to the Housing Price prediction task.

留個空,之後實現一下

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章