本文參考–PyTorch官方教程中文版鏈接：http://pytorch123.com/FirstSection/PyTorchIntro/
Pytorch中文文檔：https://pytorch-cn.readthedocs.io/zh/latest/package_references/Tensor/
PyTorch英文文檔：https://pytorch.org/docs/stable/tensors.html
《深度學習之PyTorch物體檢測實戰》
第一次接觸PyTorch，網上很難找到最新版本的教程，先從它的官方資料入手吧！

默認導入模塊：

import os
import json
from PIL import Image
import matplotlib.pyplot as plt
import numpy as np

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import optim
import torchvision
from torchvision import models
from torch.utils.data import Dataset
from torchvision import transforms
from torch.utils.data import DataLoader
import visdom
# from tensorboardX import SummaryWriter
from torch.utils.tensorboard import SummaryWriter

全連接層

nn.Linear(in_features, out_features, bias=True)

>>> linear = nn.Linear(784, 10)
>>> input = torch.randn(4, 784)
>>> output = linear(input)
>>> output.shape
torch.Size([4, 10])

卷積層

nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0, 
		dilation=1, groups=1, bias=True, padding_mode='zeros')

dilation：空洞卷積，當大於1的時候可以增大感受野，同時保持特徵圖的尺寸
groups：可實現組卷積，即在卷積操作時不是逐點卷積，而是將輸入通道範圍分爲多個組，稀疏連接達到降低計算量的目的

通過.weight和.bias查看卷積核的權重與偏置

>>> conv = nn.Conv2d(1, 1, 3, 1, 1)
>>> conv.weight.shape
torch.Size([1, 1, 3, 3])
>>> conv.bias.shape
torch.Size([1])

輸入特徵圖必須寫爲 $(N, C, H, W)$ 的形式

>>> input = torch.randn(1, 1, 5, 5)
>>> output = conv(input)
>>> output.shape
torch.Size([1, 1, 5, 5])

池化層

最大池化層

nn.MaxPool2d(kernel_size, stride=None, padding=0, 
			dilation=1, return_indices=False, ceil_mode=False)

return_indices – if True, will return the max indices along with the outputs.
ceil_mode – when True, will use ceil instead of floor to compute the output shape
stride – 注意：stride 默認值爲 kernel_size，而非1

>>> max_pooling = nn.MaxPool2d(2, stride=2)
>>> input = torch.randn(1, 1, 4, 4)
>>> max_pooling(input)
tensor([[[[0.9636, 0.7075],
          [1.0641, 1.1749]]]])
>>> max_pooling(input).shape
torch.Size([1, 1, 2, 2])

平均池化層

nn.AvgPool2d(kernel_size, stride=None, padding=0, 
			ceil_mode=False, count_include_pad=True, divisor_override=None)

If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points.

ceil_mode – when True, will use ceil instead of floor to compute the output shape
count_include_pad – when True, will include the zero-padding in the averaging calculation
divisor_override – if specified, it will be used as divisor, otherwise attr:kernel_size will be used

The parameters kernel_size, stride, padding can either be:

a single int – in which case the same value is used for the height and width dimension
a tuple of two ints – in which case, the first int is used for the height dimension, and the second int for the width dimension

全局平均池化層

nn.Sequential(
            nn.AdaptiveMaxPool2d((1,1)),
            nn.Flatten()
            }

激活函數層

當然，下面的層也可以用torch.nn.functional中的函數替代

Sigmoid層

nn.Sigmoid()

>>> sigmoid = nn.Sigmoid()
>>> sigmoid(torch.Tensor([1, 1, 2, 2]))
tensor([0.7311, 0.7311, 0.8808, 0.8808])

ReLU層

nn.ReLU(inplace=False)

>>> relu = nn.ReLU(inplace=True)
>>> input = torch.randn(2, 2)
>>> input
tensor([[-0.4853,  2.3864],
        [ 0.7122, -0.6493]])
>>> relu(input)
tensor([[0.0000, 2.3864],
        [0.7122, 0.0000]])
>>> input
tensor([[0.0000, 2.3864],
        [0.7122, 0.0000]])

Softmax層

nn.Softmax(dim=None)

>>> softmax = nn.Softmax(dim=1)
>>> score = torch.randn(1, 4)
>>> score
tensor([[ 0.3101,  3.5648,  1.0988, -1.5856]])
>>> softmax(score)
tensor([[0.0342, 0.8855, 0.0752, 0.0051]])

LogSoftmax層

nn.LogSoftmax(dim=None)

後接nn.NLLLoss層相當於CrossEntropyLoss層

Dropout層

nn.Dropout(p=0.5, inplace=False)

>>> dropout = nn.Dropout(0.5, inplace=False)
>>> input = torch.randn(1, 20)
>>> output = dropout(input)
>>> output
tensor([[-2.9413,  0.0000,  1.8461,  1.9605,  0.2774, -0.0000, -2.5381, -2.0313,
         -0.1914,  0.0000,  0.5346, -0.0000,  0.0000,  4.4960, -3.8345, -1.0938,
          4.3297,  2.1258, -4.1431,  0.0000]])
>>> input
tensor([[-1.4707,  0.5105,  0.9231,  0.9802,  0.1387, -0.4195, -1.2690, -1.0156,
         -0.0957,  0.8108,  0.2673, -2.0898,  0.6666,  2.2480, -1.9173, -0.5469,
          2.1648,  1.0629, -2.0716,  0.9974]])

BN層

torch.nn.BatchNorm2d(num_features, eps=1e-05, momentum=0.1, 
					affine=True, track_running_stats=True)

num_features – $C$ from an expected input of size $(N, C, H, W)$
eps – a value added to the denominator for numerical stability. Default: 1e-5
momentum – the value used for the running_mean and running_var computation. Can be set to None for cumulative moving average (i.e. simple average). Default: 0.1
affine – a boolean value that when set to True, this module has learnable affine parameters. Default: True
track_running_stats – a boolean value that when set to True, this module tracks the running mean and variance, and when set to False, this module does not track such statistics and always uses batch statistics in both training and eval modes. Default: True

Because the Batch Normalization is done over the C dimension, computing statistics on $(N, H, W)$ slices, it’s common terminology to call this Spatial Batch Normalization.

The mean and standard-deviation are calculated per-dimension over the mini-batches and $\gamma$ and $\beta$ are learnable parameter vectors of size C (where C is the input size). By default, the elements of $\gamma$ are set to 1 and the elements of $\beta$ are set to 0.

>>> bn = nn.BatchNorm2d(64)
>>> input = torch.randn(4, 64, 28, 28)
>>> output = bn(input)
>>> output.shape
torch.Size([4, 64, 28, 28])

損失函數層

NLLLoss

nn.NLLLoss(weight=None, size_average=None, 
			ignore_index=-100, reduce=None, reduction='mean')

The input given through a forward call is expected to contain log-probabilities of each class. input has to be a Tensor of size either $(minibatch, C)$ or $(minibatch, C, d_1, d_2, ..., d_K)$ with $K≥1$ for the K-dimensional case(described later).

It is useful to train a classification problem with C (C = number of classes) classes.

The target that this loss expects should be a class index in the range $[0, C-1]$ where C = number of classes; if ignore_index is specified, this loss also accepts this class index (this index may not necessarily be in the class range).

The unreduced (i.e. with reduction set to 'none') loss can be described as:

$\begin{aligned} l(x, y) &= L = \{l_1,...,l_N\}^T \\ l_n &= -weight[y_n]\cdot x_{n,y_n} \{y_n\neq ignore\_index \} \end{aligned}$ $\begin{aligned} 其中&x爲輸入，y爲標籤，weight表示每個類別在計算loss時的權重，\\& x_{n,y_n}表示第n個樣本中正確類別的log(score) \end{aligned}$

If reduction is ‘mean’ (default ‘mean’), then
$\begin{aligned} l(x, y) &=\sum_{n=1}^N \frac {l_n}{\sum_{n=1}^N} \end{aligned}$

If reduction is ‘sum’ (default ‘mean’), then
$\begin{aligned} l(x, y) &=\sum_{n=1}^N l_n \end{aligned}$

Can also be used for higher dimension inputs, such as 2D images, by providing an input of size $(minibatch, C, d_1, d_2, ..., d_K)$ with $K≥1$ , where K is the number of dimensions, and a target of appropriate shape (see below). In the case of images, it computes NLL loss per-pixel.

weight (Tensor, optional) – a manual rescaling weight given to each class. If given, it has to be a Tensor of size C. Otherwise, it is treated as if having all ones. If provided, the optional argument weight should be a 1D Tensor assigning weight to each of the classes. This is particularly useful when you have an unbalanced training set. 就是在計算loss時給每個類別加的權重
size_average (bool, optional) – Deprecated
ignore_index (int, optional) – Specifies a target value that is ignored and does not contribute to the input gradient.
reduce (bool, optional) – Deprecated
reduction (string, optional) – Specifies the reduction to apply to the output: ’none' | ’mean' | 'sum'. 'none': no reduction will be applied, 'mean': the sum of the output will be divided by the number of elements in the output, 'sum': the output will be summed. Default: ‘mean’

Shape:

Input: $(N, C)$ where $C$ = number of classes, or $(N, C, d_1, d_2, ..., d_K)$ with $K≥1$ in the case of K-dimensional loss.
Target: $(N)$ where each value is $0 \leq \text{targets}[i] \leq C-1$ , or $(N, d_1, d_2, ..., d_K)$ with $K≥1$ in the case of K-dimensional loss.
Output: scalar. If reduction is ‘none’, then the same size as the target: $(N)$ , or $(N, d_1, d_2, ..., d_K)$ with $K≥1$ in the case of K-dimensional loss.

m = nn.LogSoftmax(dim=1)
loss = nn.NLLLoss()

input = torch.randn(3, 5, requires_grad=True)
target = torch.tensor([1, 0, 4])

output = loss(m(input), target)

N, C = 5, 4
loss = nn.NLLLoss()

# input is of size N x C x height x width
data = torch.randn(N, C, 8, 8)
m = nn.LogSoftmax(dim=1)

# each element in target has to have 0 <= value < C
target = torch.empty(N, 8, 8, dtype=torch.long).random_(0, C)
output = loss(m(data), target)

CrossEntropyLoss

nn.CrossEntropyLoss(weight=None, size_average=None, 
					ignore_index=-100, reduce=None, reduction='mean')

This criterion combines nn.LogSoftmax() and nn.NLLLoss() in one single class.

其實就是Softmax + CrossEntropyLoss，雖然現在還沒看過源碼，但應該也是因爲它們兩個結合在一起在梯度反向傳播的時候結果就會是漂亮的 $y-t$

參數的意義跟上面的nn.NLLLoss一樣，這裏就不多說了

loss = nn.CrossEntropyLoss()
input = torch.randn(3, 5, requires_grad=True)
target = torch.empty(3, dtype=torch.long).random_(5)

output = loss(input, target)
output.backward()

優化器

SGD(包含了Momentum以及Nesterov Momentum)

optim.SGD(params, lr=<required parameter>, momentum=0, 
			dampening=0, weight_decay=0, nesterov=False)

dampening (float, optional) – dampening for momentum (default: 0)
疑問：這個dampening是幹啥的看源碼時再解答

optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

# 每次優化之前都要先清空梯度
optimizer.zero_grad()
loss.backward()
optimizer.step()

Adagrad

optim.Adagrad(params, lr=0.01, lr_decay=0, weight_decay=0, 
			initial_accumulator_value=0, eps=1e-10)

lr (float, optional) – learning rate (default: 1e-2)
lr_decay (float, optional) – learning rate decay (default: 0)

RMSProp

optim.RMSprop(params, lr=0.01, alpha=0.99, eps=1e-08, 
			weight_decay=0, momentum=0, centered=False)

alpha (float, optional) – smoothing constant (default: 0.99)
momentum (float, optional) – momentum factor (default: 0)
centered (bool, optional) – if True, compute the centered RMSProp, the gradient is normalized by an estimation of its variance

這個alpha應該就是RMSProp中遺忘過去梯度的動量參數，那麼這個momentum又是什麼？同樣也只能等看了源碼再解答

Adadelta

optim.Adadelta(params, lr=1.0, rho=0.9, eps=1e-06, weight_decay=0)

lr (float, optional) – coefficient that scale delta before it is applied to the parameters (default: 1.0) 按照Adadelta原公式的話應該是不用lr的，這裏卻有lr參數，還是需要閱讀源碼後再解答
rho (float, optional) – coefficient used for computing a running average of squared gradients (default: 0.9)

Adam

optim.Adam(params, lr=0.001, betas=(0.9, 0.999), 
			eps=1e-08, weight_decay=0, amsgrad=False)

amsgrad (boolean, optional) – whether to use the AMSGrad variant of this algorithm from the paper On the Convergence of Adam and Beyond (default: False)

PyTorch(三)：常見的網絡層

目錄

全連接層

卷積層

池化層

最大池化層

平均池化層

全局平均池化層

激活函數層

Sigmoid層

ReLU層

Softmax層

LogSoftmax層

Dropout層

BN層

損失函數層

NLLLoss

CrossEntropyLoss

優化器

SGD(包含了Momentum以及Nesterov Momentum)

Adagrad

RMSProp

Adadelta

Adam

如何使用 JS 判斷用戶是否處於活躍狀態

Mono 支持LoongArch架構

lightdb秒級增加列和刪除列（not null帶默認值）

lightdb數據庫超時相關控制參數

通過HPA+CronHPA組合應對業務複雜彈性伸縮場景

❤️‍🔥 Solon Cloud Event 新的事務特性與應用

lightdb mysql 8.0兼容之不可見主鍵

使用 JS 實現在瀏覽器控制檯打印圖片 console.image()

基於Ubuntu-22.04安裝K8s-v1.28.2實驗（四）使用域名訪問網站應用

PyTorch(四)：實踐--在CIFAR10數據集上訓練基於PyTorch的第一個網絡

經典網絡結構(二)：VGG

深度學習入門(十)：CNN的實現及可視化

經典網絡結構(一)：LeNet、AlexNet

深度學習入門(九)：卷積層和池化層的實現

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結