pytorch進行GPU訓練權重初始化的經驗總結

前言

權重如何初始化關係到模型的訓練能否快速收斂,這對於模型能否減少訓練時間也至關重要。
下面以兩個卷積層和一個全連接層的權重初始化爲例子,兩個代碼都只運行一個epoch,來進行對照實驗。
注意使用GPU訓練時候,模型的初始化要設置保存梯度,否則返回的梯度就是0了

未對權重歸一化的結果

代碼

import torch

USE_GPU = True
dtype = torch.float32 # we will be using float throughout this tutorial
if USE_GPU and torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')
 #-------------------------權重
conv_w1 = torch.randn((32, 3, 5, 5), device=device,dtype=dtype) # [out_channel, in_channel, kernel_H, kernel_W]
conv_w1.requires_grad =True
conv_b1 = torch.zeros((32,),device=device, dtype=dtype, requires_grad=True) # out_channel

conv_w2 = torch.randn((16, 32, 3, 3), device=device,dtype=dtype)# [out_channel, in_channel, kernel_H, kernel_W]
conv_w2.requires_grad =True
conv_b2 = torch.zeros((16,),device=device, dtype=dtype, requires_grad=True) # out_channel

# you must calculate the shape of the tensor after two conv layers, before the fully-connected layer
fc_w = torch.randn((16 * 32 * 32, 10),device=device, dtype=dtype)
fc_w.requires_grad =True
fc_b = torch.zeros(10,device=device, dtype=dtype, requires_grad=True)

結果
在這裏插入圖片描述

歸一化權重後

代碼

import torch
USE_GPU = True
dtype = torch.float32 # we will be using float throughout this tutorial
if USE_GPU and torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')
  
  #------------------------權重
conv_w1 = torch.randn((32, 3, 5, 5), device=device,dtype=dtype) * np.sqrt(2. / (3*5*5))# [out_channel, in_channel, kernel_H, kernel_W]
conv_w1.requires_grad =True
conv_b1 = torch.zeros((32,),device=device, dtype=dtype, requires_grad=True) # out_channel

conv_w2 = torch.randn((16, 32, 3, 3), device=device,dtype=dtype)* np.sqrt(2. / (16*3*3))# [out_channel, in_channel, kernel_H, kernel_W]
conv_w2.requires_grad =True
conv_b2 = torch.zeros((16,),device=device, dtype=dtype, requires_grad=True) # out_channel

# you must calculate the shape of the tensor after two conv layers, before the fully-connected layer
fc_w = torch.randn((16 * 32 * 32, 10),device=device, dtype=dtype)* np.sqrt(2. / (channel_2 * 32 * 32))
fc_w.requires_grad =True
fc_b = torch.zeros(10,device=device, dtype=dtype, requires_grad=True)

結果
在這裏插入圖片描述

結論

可以看出來在進行歸一化後 模型可以快速收斂

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章