本文的主要目的是對基於gradient的一些approximation知識點以及優化方法做一個簡單的review。詳細內容參考引用鏈接，這裏只列出key points，主要是在遺忘的時候能夠快速catch up…

Jacobian矩陣和Hessian矩陣

引用：

雅可比矩陣(描述 $f:R^n\rightarrow R^m$ 的一階導數矩陣). $J_{ij}=\partial f_i/\partial x_j$ . 把它理解爲一階gradient就好了。例如在Automatic Differenciation中,利用Chain Rule就可以將求導過程作爲一系列Jacobian矩陣的乘積。
海森矩陣(描述 $f:R^n\rightarrow R$ 的二階導數矩陣). $H_{ij}=\frac{\partial^2 f}{\partial x_i\partial x_j}$ . 另, 則此函數的雅克比矩陣爲 $[\partial f/\partial x_1, \partial f/\partial x_2,...]$ .求此雅克比矩陣（轉置後作爲向量函數）的雅克比矩陣，即得到海森矩陣。

Automatic Differenciation

引用：

Symbolic differentiation can lead to inefficient code and faces the difficulty of converting a computer program into a single expression. 符號求導是先推導出 $\partial y/\partial x$ 的符號公式，再代入x求解，比較複雜。
Numerical differentiation can introduce round-off errors in the discretization process and cancellation. 數值積分是在目標點鄰域兩端取點，並通過 $\frac{\Delta y}{\Delta x}$ 的方式對gradient進行數值近似。容易遇到數值問題。
Auto Differenciation exploits the fact that every computer program, no matter how complicated, executes a sequence of elementary arithmetic operations (addition, subtraction, multiplication, division, etc.) and elementary functions (exp, log, sin, cos, etc.). By applying the chain rule repeatedly to these operations, derivatives of arbitrary order can be computed automatically, accurately to working precision, and using at most a small constant factor more arithmetic operations than the original program. 自動求導主要是對Chain Rule的應用。
考慮 $f:R^n\rightarrow R^m$ ，我們最終想要的是一個m*n的雅克比矩陣。
Forward accumulation: $\frac{\partial y}{\partial x_i} =\frac{\partial y}{\partial w_0}\frac{\partial w_0}{\partial x_i}=\frac{\partial y}{\partial w_1}(\frac{\partial w_1}{\partial w_0}\frac{\partial w_0}{\partial x_i})=...$ 也就是沿着f箭頭方向，從x向y展開chain. 注意括號的位置，對括號都對應了在computational graph中的一個節點。每次計算都是針對某個xi進行的，所以要得到最終的雅克比矩陣需要進行n次計算。當n<<m時，使用Forward方式理論上計算次數更少
Reverse accumulation：與forward相反 $\frac{\partial y}{\partial x_i} =\frac{\partial y}{\partial w_5}\frac{\partial w_5}{\partial x_i}=(\frac{\partial y}{\partial w_5}\frac{\partial w_5}{\partial w_4})\frac{\partial w_4}{\partial x_i}...$ 逆着f箭頭方向，從y向x展開chain。注意括號的位置，對括號都對應了在computational graph中的一個節點。每次計算都是針對某個yi進行的，所以要得到最終的雅克比矩陣需要進行m次計算。當n>>m時，使用Reverse方式理論上計算次數更少.但是要注意，由於是top-down遞歸求解，中間過程需要存儲，導致memory使用會比較大
Forward/Reverse 只是兩中極端方式，如何使用最少的步驟求得雅克比矩陣，是一個np難問題。
Dual Numbers: 這是一種計算Ad的方式。把標量全部用類似complex number的形式表現出來： $x\rightarrow(a+b\epsilon )$ 其中的 $\epsilon$ 理解爲無限趨近於零的無窮小(infinidestimal),並且 $\epsilon ^2=0$ . 現在 $f(x)=f(x+\epsilon )=f(x)+f'(x)\epsilon$ 也就是說，將x用dual number表示後，只要按往常一樣計算y，那麼最終y的dual number表示中的第二個component就是gradient了。這裏有點意思，舉個栗子吧： $f(x)=x^2=f(x+\epsilon )^2=x^2+2x\epsilon +\epsilon ^2=x^2+(2x)\epsilon$ 所以2x就是f'(x)了！這裏我覺得其實是完全符合gradient的定義的，即當x改變一丟丟時，y的變化量。可以把此處的 $\epsilon$ 理解爲。