PyTorch(三):常見的網絡層

本文參考–PyTorch官方教程中文版鏈接:http://pytorch123.com/FirstSection/PyTorchIntro/
Pytorch中文文檔:https://pytorch-cn.readthedocs.io/zh/latest/package_references/Tensor/
PyTorch英文文檔:https://pytorch.org/docs/stable/tensors.html
《深度學習之PyTorch物體檢測實戰》
第一次接觸PyTorch,網上很難找到最新版本的教程,先從它的官方資料入手吧!


默認導入模塊:

import os
import json
from PIL import Image
import matplotlib.pyplot as plt
import numpy as np

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import optim
import torchvision
from torchvision import models
from torch.utils.data import Dataset
from torchvision import transforms
from torch.utils.data import DataLoader
import visdom
# from tensorboardX import SummaryWriter
from torch.utils.tensorboard import SummaryWriter

全連接層

nn.Linear(in_features, out_features, bias=True)
>>> linear = nn.Linear(784, 10)
>>> input = torch.randn(4, 784)
>>> output = linear(input)
>>> output.shape
torch.Size([4, 10])

卷積層

nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0, 
		dilation=1, groups=1, bias=True, padding_mode='zeros')
  • dilation:空洞卷積,當大於1的時候可以增大感受野,同時保持特徵圖的尺寸
  • groups:可實現組卷積,即在卷積操作時不是逐點卷積,而是將輸入通道範圍分爲多個組,稀疏連接達到降低計算量的目的

通過.weight.bias查看卷積核的權重與偏置

>>> conv = nn.Conv2d(1, 1, 3, 1, 1)
>>> conv.weight.shape
torch.Size([1, 1, 3, 3])
>>> conv.bias.shape
torch.Size([1])

輸入特徵圖必須寫爲(N,C,H,W)(N, C, H, W)的形式

>>> input = torch.randn(1, 1, 5, 5)
>>> output = conv(input)
>>> output.shape
torch.Size([1, 1, 5, 5])

池化層

最大池化層

nn.MaxPool2d(kernel_size, stride=None, padding=0, 
			dilation=1, return_indices=False, ceil_mode=False)
  • return_indices – if True, will return the max indices along with the outputs.
  • ceil_mode – when True, will use ceil instead of floor to compute the output shape
  • stride – 注意:stride 默認值爲 kernel_size,而非1
>>> max_pooling = nn.MaxPool2d(2, stride=2)
>>> input = torch.randn(1, 1, 4, 4)
>>> max_pooling(input)
tensor([[[[0.9636, 0.7075],
          [1.0641, 1.1749]]]])
>>> max_pooling(input).shape
torch.Size([1, 1, 2, 2])

平均池化層

nn.AvgPool2d(kernel_size, stride=None, padding=0, 
			ceil_mode=False, count_include_pad=True, divisor_override=None)

If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points.

  • ceil_mode – when True, will use ceil instead of floor to compute the output shape
  • count_include_pad – when True, will include the zero-padding in the averaging calculation
  • divisor_override – if specified, it will be used as divisor, otherwise attr:kernel_size will be used

The parameters kernel_size, stride, padding can either be:

  • a single int – in which case the same value is used for the height and width dimension
  • a tuple of two ints – in which case, the first int is used for the height dimension, and the second int for the width dimension

全局平均池化層

nn.Sequential(
            nn.AdaptiveMaxPool2d((1,1)),
            nn.Flatten()
            }

激活函數層

當然,下面的層也可以用torch.nn.functional中的函數替代

Sigmoid層

nn.Sigmoid()
>>> sigmoid = nn.Sigmoid()
>>> sigmoid(torch.Tensor([1, 1, 2, 2]))
tensor([0.7311, 0.7311, 0.8808, 0.8808])

ReLU層

nn.ReLU(inplace=False)
>>> relu = nn.ReLU(inplace=True)
>>> input = torch.randn(2, 2)
>>> input
tensor([[-0.4853,  2.3864],
        [ 0.7122, -0.6493]])
>>> relu(input)
tensor([[0.0000, 2.3864],
        [0.7122, 0.0000]])
>>> input
tensor([[0.0000, 2.3864],
        [0.7122, 0.0000]])

Softmax層

nn.Softmax(dim=None)
>>> softmax = nn.Softmax(dim=1)
>>> score = torch.randn(1, 4)
>>> score
tensor([[ 0.3101,  3.5648,  1.0988, -1.5856]])
>>> softmax(score)
tensor([[0.0342, 0.8855, 0.0752, 0.0051]])

LogSoftmax層

nn.LogSoftmax(dim=None)

後接nn.NLLLoss層相當於CrossEntropyLoss

Dropout層

nn.Dropout(p=0.5, inplace=False)
>>> dropout = nn.Dropout(0.5, inplace=False)
>>> input = torch.randn(1, 20)
>>> output = dropout(input)
>>> output
tensor([[-2.9413,  0.0000,  1.8461,  1.9605,  0.2774, -0.0000, -2.5381, -2.0313,
         -0.1914,  0.0000,  0.5346, -0.0000,  0.0000,  4.4960, -3.8345, -1.0938,
          4.3297,  2.1258, -4.1431,  0.0000]])
>>> input
tensor([[-1.4707,  0.5105,  0.9231,  0.9802,  0.1387, -0.4195, -1.2690, -1.0156,
         -0.0957,  0.8108,  0.2673, -2.0898,  0.6666,  2.2480, -1.9173, -0.5469,
          2.1648,  1.0629, -2.0716,  0.9974]])

BN層

torch.nn.BatchNorm2d(num_features, eps=1e-05, momentum=0.1, 
					affine=True, track_running_stats=True)
  • num_featuresCC from an expected input of size (N,C,H,W)(N, C, H, W)
  • eps – a value added to the denominator for numerical stability. Default: 1e-5
  • momentum – the value used for the running_mean and running_var computation. Can be set to None for cumulative moving average (i.e. simple average). Default: 0.1
  • affine – a boolean value that when set to True, this module has learnable affine parameters. Default: True
  • track_running_stats – a boolean value that when set to True, this module tracks the running mean and variance, and when set to False, this module does not track such statistics and always uses batch statistics in both training and eval modes. Default: True

Because the Batch Normalization is done over the C dimension, computing statistics on (N,H,W)(N, H, W) slices, it’s common terminology to call this Spatial Batch Normalization.

The mean and standard-deviation are calculated per-dimension over the mini-batches and γ\gamma and β\beta are learnable parameter vectors of size C (where C is the input size). By default, the elements of γ\gamma are set to 1 and the elements of β\beta are set to 0.

>>> bn = nn.BatchNorm2d(64)
>>> input = torch.randn(4, 64, 28, 28)
>>> output = bn(input)
>>> output.shape
torch.Size([4, 64, 28, 28])

損失函數層

NLLLoss

nn.NLLLoss(weight=None, size_average=None, 
			ignore_index=-100, reduce=None, reduction='mean')

The input given through a forward call is expected to contain log-probabilities of each class. input has to be a Tensor of size either(minibatch,C)(minibatch, C)or(minibatch,C,d1,d2,...,dK)(minibatch, C, d_1, d_2, ..., d_K) with K1K≥1 for the K-dimensional case(described later).

It is useful to train a classification problem with C (C = number of classes) classes.

The target that this loss expects should be a class index in the range [0,C1][0, C-1] where C = number of classes; if ignore_index is specified, this loss also accepts this class index (this index may not necessarily be in the class range).

The unreduced (i.e. with reduction set to 'none') loss can be described as:

l(x,y)=L={l1,...,lN}Tln=weight[yn]xn,yn{ynignore_index}\begin{aligned} l(x, y) &= L = \{l_1,...,l_N\}^T \\ l_n &= -weight[y_n]\cdot x_{n,y_n} \{y_n\neq ignore\_index \} \end{aligned}xyweightlossxn,ynnlog(score)\begin{aligned} 其中&x爲輸入,y爲標籤,weight表示每個類別在計算loss時的權重,\\& x_{n,y_n}表示第n個樣本中正確類別的log(score) \end{aligned}

If reduction is ‘mean’ (default ‘mean’), then
l(x,y)=n=1Nlnn=1N\begin{aligned} l(x, y) &=\sum_{n=1}^N \frac {l_n}{\sum_{n=1}^N} \end{aligned}

If reduction is ‘sum’ (default ‘mean’), then
l(x,y)=n=1Nln\begin{aligned} l(x, y) &=\sum_{n=1}^N l_n \end{aligned}

Can also be used for higher dimension inputs, such as 2D images, by providing an input of size (minibatch,C,d1,d2,...,dK)(minibatch, C, d_1, d_2, ..., d_K)withK1K≥1 , where K is the number of dimensions, and a target of appropriate shape (see below). In the case of images, it computes NLL loss per-pixel.

  • weight (Tensor, optional) – a manual rescaling weight given to each class. If given, it has to be a Tensor of size C. Otherwise, it is treated as if having all ones. If provided, the optional argument weight should be a 1D Tensor assigning weight to each of the classes. This is particularly useful when you have an unbalanced training set. 就是在計算loss時給每個類別加的權重
  • size_average (bool, optional) – Deprecated
  • ignore_index (int, optional) – Specifies a target value that is ignored and does not contribute to the input gradient.
  • reduce (bool, optional) – Deprecated
  • reduction (string, optional) – Specifies the reduction to apply to the output: ’none' | ’mean' | 'sum'. 'none': no reduction will be applied, 'mean': the sum of the output will be divided by the number of elements in the output, 'sum': the output will be summed. Default: ‘mean

Shape:

  • Input: (N,C)(N, C)where CC = number of classes, or (N,C,d1,d2,...,dK)(N, C, d_1, d_2, ..., d_K)with K1K≥1 in the case of K-dimensional loss.
  • Target: (N)(N) where each value is 0targets[i]C10 \leq \text{targets}[i] \leq C-1 , or (N,d1,d2,...,dK)(N, d_1, d_2, ..., d_K) with K1K≥1 in the case of K-dimensional loss.
  • Output: scalar. If reduction is ‘none’, then the same size as the target: (N)(N) , or (N,d1,d2,...,dK)(N, d_1, d_2, ..., d_K) with K1K≥1 in the case of K-dimensional loss.
m = nn.LogSoftmax(dim=1)
loss = nn.NLLLoss()

input = torch.randn(3, 5, requires_grad=True)
target = torch.tensor([1, 0, 4])

output = loss(m(input), target)
N, C = 5, 4
loss = nn.NLLLoss()

# input is of size N x C x height x width
data = torch.randn(N, C, 8, 8)
m = nn.LogSoftmax(dim=1)

# each element in target has to have 0 <= value < C
target = torch.empty(N, 8, 8, dtype=torch.long).random_(0, C)
output = loss(m(data), target)

CrossEntropyLoss

nn.CrossEntropyLoss(weight=None, size_average=None, 
					ignore_index=-100, reduce=None, reduction='mean')

This criterion combines nn.LogSoftmax() and nn.NLLLoss() in one single class.

其實就是Softmax + CrossEntropyLoss,雖然現在還沒看過源碼,但應該也是因爲它們兩個結合在一起在梯度反向傳播的時候結果就會是漂亮的 yty-t

參數的意義跟上面的nn.NLLLoss一樣,這裏就不多說了

loss = nn.CrossEntropyLoss()
input = torch.randn(3, 5, requires_grad=True)
target = torch.empty(3, dtype=torch.long).random_(5)

output = loss(input, target)
output.backward()

優化器

SGD(包含了Momentum以及Nesterov Momentum)

optim.SGD(params, lr=<required parameter>, momentum=0, 
			dampening=0, weight_decay=0, nesterov=False)
  • dampening (float, optional) – dampening for momentum (default: 0)
    疑問:這個dampening是幹啥的 看源碼時再解答
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

# 每次優化之前都要先清空梯度
optimizer.zero_grad()
loss.backward()
optimizer.step()

Adagrad

optim.Adagrad(params, lr=0.01, lr_decay=0, weight_decay=0, 
			initial_accumulator_value=0, eps=1e-10)
  • lr (float, optional) – learning rate (default: 1e-2)
  • lr_decay (float, optional) – learning rate decay (default: 0)

RMSProp

optim.RMSprop(params, lr=0.01, alpha=0.99, eps=1e-08, 
			weight_decay=0, momentum=0, centered=False)
  • alpha (float, optional) – smoothing constant (default: 0.99)
  • momentum (float, optional) – momentum factor (default: 0)
  • centered (bool, optional) – if True, compute the centered RMSProp, the gradient is normalized by an estimation of its variance

這個alpha應該就是RMSProp中遺忘過去梯度的動量參數,那麼這個momentum又是什麼?同樣也只能等看了源碼再解答

Adadelta

optim.Adadelta(params, lr=1.0, rho=0.9, eps=1e-06, weight_decay=0)
  • lr (float, optional) – coefficient that scale delta before it is applied to the parameters (default: 1.0) 按照Adadelta原公式的話應該是不用lr的,這裏卻有lr參數,還是需要閱讀源碼後再解答
  • rho (float, optional) – coefficient used for computing a running average of squared gradients (default: 0.9)

Adam

optim.Adam(params, lr=0.001, betas=(0.9, 0.999), 
			eps=1e-08, weight_decay=0, amsgrad=False)
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章