PyTorch－Adam優化算法原理，公式，應用

　　　　概念：Adam 是一種可以替代傳統隨機梯度下降過程的一階優化算法，它能基於訓練數據迭代地更新神經網絡權重。Adam 最開始是由 OpenAI 的 Diederik Kingma 和多倫多大學的 Jimmy Ba 在提交到 2015 年 ICLR 論文（Adam: A Method for Stochastic Optimization）中提出的．該算法名爲「Adam」，其並不是首字母縮寫，也不是人名。它的名稱來源於適應性矩估計（adaptive moment estimation）

　　Adam(Adaptive Moment Estimation)本質上是帶有動量項的RMSprop，它利用梯度的一階矩估計和二階矩估計動態調整每個參數的學習率。它的優點主要在於經過偏置校正後，每一次迭代學習率都有個確定範圍，使得參數比較平穩。其公式如下：

　　其中，前兩個公式分別是對梯度的一階矩估計和二階矩估計，可以看作是對期望E|gt|，E|gt^2|的估計;
公式3，4是對一階二階矩估計的校正，這樣可以近似爲對期望的無偏估計。可以看出，直接對梯度的矩估計對內存沒有額外的要求，而且可以根據梯度進行動態調整。最後一項前面部分是對學習率n形成的一個動態約束，而且有明確的範圍。

　　優點：

1、結合了Adagrad善於處理稀疏梯度和RMSprop善於處理非平穩目標的優點;
2、對內存需求較小;
3、爲不同的參數計算不同的自適應學習率;
4、也適用於大多非凸優化-適用於大數據集和高維空間。

　　應用和源碼：

　　參數實例：

class torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)

　　參數含義：

　　params(iterable)：可用於迭代優化的參數或者定義參數組的dicts。

　　lr (float, optional) ：學習率(默認: 1e-3) betas (Tuple[float, float], optional)：

　　用於計算梯度的平均和平方的係數(默認: (0.9, 0.999)) eps (float, optional)：

　　爲了提高數值穩定性而添加到分母的一個項(默認: 1e-8) weight_decay (float, optional)：權重衰減(如L2懲罰)(默認: 0)

　　torch.optim.adam源碼：

 1 import math
 2 from .optimizer import Optimizer
 3 
 4 class Adam(Optimizer):
 5     def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8，weight_decay=0):
 6         defaults = dict(lr=lr, betas=betas, eps=eps,weight_decay=weight_decay)
 7         super(Adam, self).__init__(params, defaults)
 8 
 9     def step(self, closure=None):
10         loss = None
11         if closure is not None:
12             loss = closure()
13 
14         for group in self.param_groups:
15             for p in group['params']:
16                 if p.grad is None:
17                     continue
18                 grad = p.grad.data
19                 state = self.state[p]
20 
21                 # State initialization
22                 if len(state) == 0:
23                     state['step'] = 0
24                     # Exponential moving average of gradient values
25                     state['exp_avg'] = grad.new().resize_as_(grad).zero_()
26                     # Exponential moving average of squared gradient values
27                     state['exp_avg_sq'] = grad.new().resize_as_(grad).zero_()
28 
29                 exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq']
30                 beta1, beta2 = group['betas']
31 
32                 state['step'] += 1
33 
34                 if group['weight_decay'] != 0:
35                     grad = grad.add(group['weight_decay'], p.data)
36 
37                 # Decay the first and second moment running average coefficient
38                 exp_avg.mul_(beta1).add_(1 - beta1, grad)
39                 exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)
40 
41                 denom = exp_avg_sq.sqrt().add_(group['eps'])
42 
43                 bias_correction1 = 1 - beta1 ** state['step']
44                 bias_correction2 = 1 - beta2 ** state['step']
45                 step_size = group['lr'] * math.sqrt(bias_correction2) / bias_correction1
46 
47                 p.data.addcdiv_(-step_size, exp_avg, denom)
48 
49         return loss

　　使用例子：

 1 import torch
 2 
 3 # N is batch size; D_in is input dimension;
 4 # H is hidden dimension; D_out is output dimension.
 5 N, D_in, H, D_out = 64, 1000, 100, 10
 6 
 7 # Create random Tensors to hold inputs and outputs
 8 x = torch.randn(N, D_in)
 9 y = torch.randn(N, D_out)
10 
11 # Use the nn package to define our model and loss function.
12 model = torch.nn.Sequential(
13     torch.nn.Linear(D_in, H),
14     torch.nn.ReLU(),
15     torch.nn.Linear(H, D_out),
16 )
17 loss_fn = torch.nn.MSELoss(reduction='sum')
18 
19 # Use the optim package to define an Optimizer that will update the weights of
20 # the model for us. Here we will use Adam; the optim package contains many other
21 # optimization algoriths. The first argument to the Adam constructor tells the
22 # optimizer which Tensors it should update.
23 learning_rate = 1e-4
24 optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
25 for t in range(500):
26     # Forward pass: compute predicted y by passing x to the model.
27     y_pred = model(x)
28 
29     # Compute and print loss.
30     loss = loss_fn(y_pred, y)
31     print(t, loss.item())
32 
33     # Before the backward pass, use the optimizer object to zero all of the
34     # gradients for the variables it will update (which are the learnable
35     # weights of the model). This is because by default, gradients are
36     # accumulated in buffers( i.e, not overwritten) whenever .backward()
37     # is called. Checkout docs of torch.autograd.backward for more details.
38     optimizer.zero_grad()
39 
40     # Backward pass: compute gradient of the loss with respect to model
41     # parameters
42     loss.backward()
43 
44     # Calling the step function on an Optimizer makes an update to its
45     # parameters
46     optimizer.step()

　　到這裏，相信對付絕大多數的應用是可以的了．我的目的也就基本完成了．接下來就要在應用中加深理解了．

參考文檔：

１　https://blog.csdn.net/kgzhang/article/details/77479737

２　https://pytorch.org/tutorials/beginner/examples_nn/two_layer_net_optim.html

PyTorch－Adam優化算法原理，公式，應用

爲什麼要⽤ Foundry

【筆記】動手學深度學習-預備知識

py發送email

MySQL 分庫分表方案，總結太全了。。

Qt/C++音視頻開發71-指定mjpeg/h264格式採集本地攝像頭/存儲文件到mp4/設備推流/採集推流

WPF開源輕便、快速的桌面啓動器

公司來了個新同事，把 DDD 運用得爐火純青！

在路上階段總結之反對本本主義

基於python的文件seek和tell實例解析

GPS的GNRMC數據經緯度轉換實例解析

在路上之產品充分測試的重要性

硬件設計中減少電源紋波噪音的幾大措施

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結