本節參考：《深度學習之PyTorch物體檢測實戰》
《DIVE INTO DEEP LEARNING》

Function Decomposition

ResNet significantly changed the view of how to parametrize the functions in deep networks. DenseNet is to some extent the logical extension of this. To understand how to arrive at it, let us take a small detour to theory. Recall the Taylor expansion for functions. For scalars it can be written as
$f(x)=f(0)+f′(x)x+12f′′(x)x2+16f′′′(x)x3+o(x3)$

The key point is that it decomposes the function into increasingly higher order terms. In a similar vein, ResNet decomposes functions into
$f(x)=x+g(x)$

That is, ResNet decomposes $f$ into a simple linear term and a more complex nonlinear one. What if we want to go beyond two terms? A solution was proposed by [Huang et al., 2017] in the form of DenseNet, an architecture that reported record performance on the ImageNet dataset.

As shown in Fig. 5.10, the key difference between ResNet and DenseNet is that in the latter case outputs are concatenated rather than added. As a result we perform a mapping from $x$ to its values after applying an increasingly complex sequence of functions.
$x→[x,f1(x),f2(x,f1(x)),f3(x,f1(x),f2(x,f1(x)),…]$
The name DenseNet arises from the fact that the dependency graph between variables becomes quite dense. The last layer of such a chain is densely connected to all previous layers.

DenseNet在ResNet的基礎上，最大化了前後層的信息交流，通過建立前面所有層與後面層的密集連接，實現了特徵在通道維度上的複用，使其可以在參數與計算量更少的情況下實現比ResNet更優的性能。

DenseNet的主要構建模塊是稠密塊（dense block）和過渡層（transition layer）。前者定義了輸入和輸出是如何連結的，後者則用來控制通道數，使之不過大。

稠密塊（dense block）

DenseNet使用了ResNet改良版的“BN-ReLU-卷積”結構，這裏的 $1\times 1$ 卷積還是用來降維減少計算量的，也就是所謂的 bottleneck layer。以32個conv_block的DenseBlock爲例，那麼第32個conv_block的輸入是前面31層的輸出結果，每層輸出的channel爲32(growth rate)，如果不加 bottleneck layer 的話，第32層的 $3\times 3$ 卷積的輸入通道數就是31*32 + (上一個DenseBlock的輸出channel數) 。而如果加上 bottleneck layer 的話，一般 $1\times 1$ 卷積的通道數爲 growthRate 的4倍，也就是128，相比之前的接近1000的通道數，極大地降低了計算量

def conv_block(input_channels, output_channels):
    # 一般 1 x 1 卷積的通道數爲 growthRate 的4倍
    inter_channels = 4 * output_channels
        
    return nn.Sequential(
        nn.BatchNorm2d(input_channels), 
        nn.ReLU(inplace=True),
        nn.Conv2d(input_channels, inter_channels, kernel_size=1, bias=False),
        
        nn.BatchNorm2d(inter_channels), 
        nn.ReLU(inplace=True),
        nn.Conv2d(inter_channels, output_channels, kernel_size=3, padding=1, bias=False),
        )

稠密塊由多個conv_block組成，每塊使用相同的輸出通道數。但在前向計算時，我們將每塊的輸入和輸出在通道維上連結。

# output_channels 也被稱爲 growthRate
class DenseBlock(nn.Module):
    def __init__(self, num_convs, input_channels, output_channels):
        super(DenseBlock, self).__init__()
        layer = []
        for i in range(num_convs):
            layer.append(conv_block(output_channels*i+input_channels, output_channels))
        self.net = nn.Sequential(*layer)
        
    def forward(self, X):
        for blk in self.net:
            Y = blk(X)
            # Concatenate the input and output of each block on the channel
            # dimension
            X = torch.cat((X, Y), dim=1)
        return X

在下面的例子中，我們定義一個有2個輸出通道數爲10的卷積塊。使用通道數爲3的輸入時，我們會得到通道數爲 3+2×10=23 的輸出。卷積塊的通道數控制了輸出通道數相對於輸入通道數的增長，因此也被稱爲增長率（growth rate）。

blk = DenseBlock(2, 3, 10)
X = torch.randn(4, 3, 8, 8)
Y = blk(X)
Y.shape

過渡層（transition layer）

由於每個稠密塊都會帶來通道數的增加，使用過多則會帶來過於複雜的模型。過渡層用來控制模型複雜度。它通過 $1×1$ 卷積層來減小通道數，並使用步幅爲2的平均池化層減半高和寬，從而進一步降低模型複雜度。

def transition_block(input_channels, output_channels):
    return nn.Sequential(
        nn.BatchNorm2d(input_channels), 
        nn.ReLU(inplace=True),
        nn.Conv2d(input_channels, output_channels, kernel_size=1),
        nn.AvgPool2d(kernel_size=2, stride=2))

DenseNet模型

DenseNet首先使用同ResNet一樣的單卷積層和最大池化層
類似於ResNet接下來使用的4個殘差塊，DenseNet使用的是4個稠密塊。同ResNet一樣，我們可以設置每個稠密塊使用多少個卷積層。這裏我們設成4，與ResNet-18保持一致。稠密塊裏的卷積層通道數（即增長率）設爲32，所以每個稠密塊將增加128個通道。
ResNet裏通過步幅爲2的殘差塊在每個模塊之間減小高和寬。這裏則使用過渡層來減半高和寬，並減半通道數。
同ResNet一樣，最後接上全局池化層和全連接層來輸出

class DenseNet(nn.Module):
    def __init__(self, input_channels, class_num):
        super(DenseNet, self).__init__()
        self.stem = nn.Sequential(
            nn.Conv2d(input_channels, 64, kernel_size=7, stride=2, padding=3, bias=False),
            nn.BatchNorm2d(64), 
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1))
        
        num_channels, growth_rate = 64, 32
        num_convs_in_dense_blocks = [4, 4, 4, 4]
        blks = []
        for i, num_convs in enumerate(num_convs_in_dense_blocks):
            blks.append(DenseBlock(num_convs, num_channels, growth_rate))
            # This is the number of output channels in the previous dense block
            num_channels += num_convs * growth_rate
            # A transition layer that haves the number of channels is added between
            # the dense blocks
            if i != len(num_convs_in_dense_blocks) - 1:
                blks.append(transition_block(num_channels, num_channels // 2))
                num_channels = num_channels // 2
        self.blks = nn.Sequential(*blks)
        
        self.classifier = nn.Sequential(
            nn.BatchNorm2d(num_channels), 
            nn.ReLU(inplace=True),
            nn.AdaptiveMaxPool2d((1,1)),
            nn.Flatten(),
            nn.Linear(num_channels, 10))
                
    def forward(self, x):
        y = self.stem(x)
        y = self.blks(y)
        y = self.classifier(y)
        
        return y

Why do we use average pooling rather than max-pooling in the transition layer?

參考：https://stats.stackexchange.com/questions/413275/why-do-we-use-the-average-pooling-layers-instead-of-max-pooling-layers-in-the-de

Average pooling can better represent the overall strength of a feature by passing gradients through all indices(while gradient flows through only the max index in max pooling), which is very like the DenseNet itself that connections are built between any two layers.

high memory consumption

參考：https://github.com/tensorflow/tensorflow/issues/12948

One problem for which DenseNet has been criticized is its high memory consumption.

Is this really the case? Try to change the input shape to 224×224 to see the actual (GPU) memory consumption.

While DenseNets are fairly easy to implement in deep learning frameworks, most implmementations (such as the original) tend to be memory-hungry. In particular, the number of intermediate feature maps generated by batch normalization and concatenation operations grows quadratically with network depth. It is worth emphasizing that this is not a property inherent to DenseNets, but rather to the implementation.

DenseNet is an effective network design that relies on applying nn layers on recursive concatenations of data along the channel axis. Unfortunately, this has the side effect of quadratic memory growth in TensorFlow as completely new blocks of memory are allocated after each concat operation, resulting in poor performance during all phases of execution.
This is a feature request for a new allocation='shared' option for operations such as tf.concat(allocation='shared') . This would make it is possible to utilize the Memory-Efficient Implementation of DenseNets, a paper which demonstrates that this memory utilization can be dramatically reduced through sharing of allocations.

Can you think of an alternative means of reducing the memory consumption? How would you need to change the framework?
參考：https://github.com/gpleiss/efficient_densenet_pytorch

This implementation uses a new strategy to reduce the memory consumption of DenseNets. We use checkpointing to compute the Batch Norm and concatenation feature maps. These intermediate feature maps are discarded during the forward pass and recomputed for the backward pass. This adds 15-20% of time overhead for training, but reduces feature map consumption from quadratic to linear.

This functionality would also be useful for any other application or future network design that employs recursive concatenations.

Why do we not need to concatenate terms if we are just interested in x and f(x) for ResNet? Why do we need this for more than two layers in DenseNet?

參考：https://blog.csdn.net/u014380165/article/details/75142664

Each layer has direct access to the gradients from the loss function and the original input signal, leading to an implicit deep supervision.
這種dense connection相當於每一層都直接連接input和loss，因此就可以減輕梯度消失現象

Design a DenseNet for fully connected networks and apply it to the Housing Price prediction task.

留個空，之後實現一下

經典網絡結構(六)：DenseNet (Densely Connected Networks 稠密連接網絡)

目錄

Function Decomposition

稠密塊（dense block）

過渡層（transition layer）

DenseNet模型

Why do we use average pooling rather than max-pooling in the transition layer?

high memory consumption

Why do we not need to concatenate terms if we are just interested in x and f(x) for ResNet? Why do we need this for more than two layers in DenseNet?

Design a DenseNet for fully connected networks and apply it to the Housing Price prediction task.

[轉帖]使用NMT和pmap解決JVM資源泄漏問題原創

Python實現大麥網搶票的四大關鍵技術點解析

Python 安裝庫指令大全

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

一款開源的.NET程序集反編譯、編輯和調試神器

關於接口協議，你必須要知道這些！

基於 Milvus + LlamaIndex 實現高級 RAG

【2024-05-21】以茶會友

PyTorch(四)：實踐--在CIFAR10數據集上訓練基於PyTorch的第一個網絡

經典網絡結構(二)：VGG

深度學習入門(十)：CNN的實現及可視化

經典網絡結構(一)：LeNet、AlexNet

深度學習入門(九)：卷積層和池化層的實現

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結