前言
權重如何初始化關係到模型的訓練能否快速收斂,這對於模型能否減少訓練時間也至關重要。
下面以兩個卷積層和一個全連接層的權重初始化爲例子,兩個代碼都只運行一個epoch,來進行對照實驗。
注意使用GPU訓練時候,模型的初始化要設置保存梯度,否則返回的梯度就是0了
未對權重歸一化的結果
代碼
import torch
USE_GPU = True
dtype = torch.float32 # we will be using float throughout this tutorial
if USE_GPU and torch.cuda.is_available():
device = torch.device('cuda')
else:
device = torch.device('cpu')
#-------------------------權重
conv_w1 = torch.randn((32, 3, 5, 5), device=device,dtype=dtype) # [out_channel, in_channel, kernel_H, kernel_W]
conv_w1.requires_grad =True
conv_b1 = torch.zeros((32,),device=device, dtype=dtype, requires_grad=True) # out_channel
conv_w2 = torch.randn((16, 32, 3, 3), device=device,dtype=dtype)# [out_channel, in_channel, kernel_H, kernel_W]
conv_w2.requires_grad =True
conv_b2 = torch.zeros((16,),device=device, dtype=dtype, requires_grad=True) # out_channel
# you must calculate the shape of the tensor after two conv layers, before the fully-connected layer
fc_w = torch.randn((16 * 32 * 32, 10),device=device, dtype=dtype)
fc_w.requires_grad =True
fc_b = torch.zeros(10,device=device, dtype=dtype, requires_grad=True)
結果
歸一化權重後
代碼
import torch
USE_GPU = True
dtype = torch.float32 # we will be using float throughout this tutorial
if USE_GPU and torch.cuda.is_available():
device = torch.device('cuda')
else:
device = torch.device('cpu')
#------------------------權重
conv_w1 = torch.randn((32, 3, 5, 5), device=device,dtype=dtype) * np.sqrt(2. / (3*5*5))# [out_channel, in_channel, kernel_H, kernel_W]
conv_w1.requires_grad =True
conv_b1 = torch.zeros((32,),device=device, dtype=dtype, requires_grad=True) # out_channel
conv_w2 = torch.randn((16, 32, 3, 3), device=device,dtype=dtype)* np.sqrt(2. / (16*3*3))# [out_channel, in_channel, kernel_H, kernel_W]
conv_w2.requires_grad =True
conv_b2 = torch.zeros((16,),device=device, dtype=dtype, requires_grad=True) # out_channel
# you must calculate the shape of the tensor after two conv layers, before the fully-connected layer
fc_w = torch.randn((16 * 32 * 32, 10),device=device, dtype=dtype)* np.sqrt(2. / (channel_2 * 32 * 32))
fc_w.requires_grad =True
fc_b = torch.zeros(10,device=device, dtype=dtype, requires_grad=True)
結果
結論
可以看出來在進行歸一化後 模型可以快速收斂