1、Batch Normalization概念
Batch Normalization:批標準化
批: 一批數據,通常爲mini-batch
標準化: 0均值,1方差
優點:
- 可以用更大學習率,加速模型收斂;
- 可以不用精心設計權值初始化;
- 可以不用dropout或較小的dropout;
- 可以不用L2或者較小的weight decay;
- 可以不用LRN(local response normalization局部響應值的標準化)
上面僞代碼中最後一部分是affine transfrom,也就是scale and shift,公式中的gamma和beta是可學習參數,可以根據loss反向傳播更新參數。
爲什麼在進行normalize更新之後要加一個affine transform呢?這一步可以增強模型的容納能力,使模型更靈活,選擇性更多,可以讓模型判斷是否需要對模型進行變換。
這個方法是在論文《Batch Normalization:Accelerating Deep Network Training by Reducing Internal Covariate Shift》提出的,主要是爲了解決ICS問題(Internal Covariate Shift數據尺度變化)。
2、Pytorch的Batch Normalization 1d/2d/3d實現
Pytorch中nn.Batchnorm1d、nn.Batchnorm2d、nn.Batchnorm3d都繼承於基類_Batchnorm;
2.1 _BatchNorm
_BatchNorm的主要參數:
- num_features:一個樣本特徵數量(最重要);
- eps:分母修正項,避免分母爲零;
- momentum:指數加權平均估計當前mean/var;
- affine:布爾變量,是否需要affine transform;
- track_running_stats:訓練狀態還是測試狀態;如果是訓練狀態,mean/var需要不斷計算更新;如果在測試狀態,mean/var是固定的;
def __init__(self,num_features,eps=1e-5,momentum=0.1,affine=True,track_running_stats=True)
2.2 nn.BatchNorm1d/nn.BatchNorm2d/nn.NatchNorm3d
nn.BatchNorm1d/nn.BatchNorm2d/nn.NatchNorm3d的主要屬性:
- running_mean:均值;
- running_var:方差;
- weight:affine transform中的gamma;
- bias:affine transform中的beta;
BN的公式:
BN中的均值和方差在訓練的時候採用指數加權平均進行計算,在測試時使用當前統計值:
2.3 nn.BatchNorm1d/nn.BatchNorm2d/nn.NatchNorm3d對數據的要求
- nn.BatchNorm1d input = Batch_size * 特徵數 * 1d特徵維度
- nn.BatchNorm2d input = Batch_size * 特徵數 * 2d特徵維度
- nn.BatchNorm3d input = Batch_size * 特徵數 * 3d特徵維度
2.3.1 nn.BatchNorm1d
在全連接層使用的就是nn.BatchNorm1d,全連接層中的每一個神經元就是一個特徵,假設一個網絡層有五個特徵,也就是一個網絡層有五個神經元,如下圖中的每一列是一個數據,每個數據有5個特徵作爲網絡層的輸入,每一個特徵的維度是紅色圓圈圈出的部分,維度爲1,這樣就構成了一個樣本的一個特徵。
每次訓練數據組成一個batch,假設一個batch有三個樣本,這樣的三個樣本組成的batch就構成了nn.BatchNorm1d的輸入數據形式,輸入數據的形式爲[3,5,1],有時候1可以忽略,因此可以表示爲[3,5];
我們知道,nn.BatchNorm1d有四個參數需要計算,這四個參數需要在特徵維度上進行計算,如上圖,現在有三個樣本,每個樣本有五個特徵,需要在三個樣本的同樣位置的特徵上求取均值、方差、gamma和beta,在每一個特徵維度上都有對應的均值、方差、gamma和beta。
下面通過代碼學習nn.BatchNorm1d:
batch_size = 3 # batch_size
num_features = 5 # 每一個數據的特徵個數
momentum = 0.3
features_shape = (1) # 特徵維度爲1
feature_map = torch.ones(features_shape) # [1] # 1D
feature_maps = torch.stack([feature_map*(i+1) for i in range(num_features)], dim=0) # [1,2,3,4,5] # 2D
feature_maps_bs = torch.stack([feature_maps for i in range(batch_size)], dim=0) # [[][][]] # 3D
print("input data:\n{} shape is {}".format(feature_maps_bs, feature_maps_bs.shape))
bn = nn.BatchNorm1d(num_features=num_features, momentum=momentum)
running_mean, running_var = 0, 1
for i in range(2):
outputs = bn(feature_maps_bs)
print("\niteration:{}, running mean: {} ".format(i, bn.running_mean))
print("iteration:{}, running var:{} ".format(i, bn.running_var))
mean_t, var_t = 2, 0
running_mean = (1 - momentum) * running_mean + momentum * mean_t
running_var = (1 - momentum) * running_var + momentum * var_t
print("iteration:{}, 第二個特徵的running mean: {} ".format(i, running_mean))
print("iteration:{}, 第二個特徵的running var:{}".format(i, running_var))
通過運行代碼,得到輸出爲:
iteration:0, running mean: tensor([0.3000, 0.6000, 0.9000, 1.2000, 1.5000])
iteration:0, running var:tensor([0.7000, 0.7000, 0.7000, 0.7000, 0.7000])
iteration:0, 第二個特徵的running mean: 0.6
iteration:0, 第二個特徵的running var:0.7
iteration:1, running mean: tensor([0.5100, 1.0200, 1.5300, 2.0400, 2.5500])
iteration:1, running var:tensor([0.4900, 0.4900, 0.4900, 0.4900, 0.4900])
iteration:1, 第二個特徵的running mean: 1.02
iteration:1, 第二個特徵的running var:0.48999999999999994
2.3.2 nn.BatchNorm2d
nn.BatchNorm2d和nn.BatchNorm1d輸入數據的主要不同在於特徵維度上,卷積神經網絡輸出的一個特徵圖就是二維的形式。
如下圖,假設一個特徵圖的維度爲22,一個層有三個卷積核,會輸出三個通道的22的特徵圖,一個特徵圖在BN中理解爲一個特徵,BN會在一個特徵上求取均值、方差、gamma和beta。因此在nn.BatchNorm2d中輸入數據的形式爲[3,3,2,2]。
下面通過代碼研究nn.BatchNorm2d的具體使用:
batch_size = 3
num_features = 6
momentum = 0.3
features_shape = (2, 2)
feature_map = torch.ones(features_shape) # 2d # 2D
feature_maps = torch.stack([feature_map*(i+1) for i in range(num_features)], dim=0) # 3d # 3D
feature_maps_bs = torch.stack([feature_maps for i in range(batch_size)], dim=0) # 4d # 4D
print("input data:\n{} shape is {}".format(feature_maps_bs, feature_maps_bs.shape))
bn = nn.BatchNorm2d(num_features=num_features, momentum=momentum)
running_mean, running_var = 0, 1
for i in range(2):
outputs = bn(feature_maps_bs)
print("\niter:{}, running_mean.shape: {}".format(i, bn.running_mean.shape))
print("iter:{}, running_var.shape: {}".format(i, bn.running_var.shape))
print("iter:{}, weight.shape: {}".format(i, bn.weight.shape))
print("iter:{}, bias.shape: {}".format(i, bn.bias.shape))
q代碼對應輸出爲:
iter:0, running_mean.shape: torch.Size([6])
iter:0, running_var.shape: torch.Size([6])
iter:0, weight.shape: torch.Size([6])
iter:0, bias.shape: torch.Size([6])
iter:1, running_mean.shape: torch.Size([6])
iter:1, running_var.shape: torch.Size([6])
iter:1, weight.shape: torch.Size([6])
iter:1, bias.shape: torch.Size([6])
2.3.3 nn.BatchNorm3d
下圖所示爲nn.BatchNorm3d的輸入數據形式,一個數據的一個特徵是3維的,其形式爲[2,2,3],一個數據有3個特徵,一共有3個樣本,所以nn.BatchNorm3d的輸入數據形式爲[3,3,2,2,3]。
nn.BatchNorm3d的代碼如下:
batch_size = 3
num_features = 4
momentum = 0.3
features_shape = (2, 2, 3)
feature = torch.ones(features_shape) # 3D
feature_map = torch.stack([feature * (i + 1) for i in range(num_features)], dim=0) # 4D
feature_maps = torch.stack([feature_map for i in range(batch_size)], dim=0) # 5D
print("input data:\n{} shape is {}".format(feature_maps, feature_maps.shape))
bn = nn.BatchNorm3d(num_features=num_features, momentum=momentum)
running_mean, running_var = 0, 1
for i in range(2):
outputs = bn(feature_maps)
print("\niter:{}, running_mean.shape: {}".format(i, bn.running_mean.shape))
print("iter:{}, running_var.shape: {}".format(i, bn.running_var.shape))
print("iter:{}, weight.shape: {}".format(i, bn.weight.shape))
print("iter:{}, bias.shape: {}".format(i, bn.bias.shape))