I explain it in English.

Neural Networks

Consider a supervised learning problem where we have access to labeled training examples $(x (i), y (i))$ . Neural networks give a way of defining a complex, non-linear form of hypotheses $h W, b (x)$ , with parameters $W, b$ that we can fit to our data.

To describe neural networks, we will begin by describing the simplest possible neural network, one which comprises a single "neuron." We will use the following diagram to denote a single neuron:

This "neuron" is a computational unit that takes as input $x 1, x 2, x 3$ (and a +1 intercept term), and outputs $\textstyle h_{W,b}(x) = f(W^Tx) = f(\sum_{i=1}^3 W_{i}x_i +b)$ , where $f : \Re \mapsto \Re$ is called the activation function. In these notes, we will choose $f(\cdot)$ to be the sigmoid function:

$f(z) = \frac{1}{1+\exp(-z)}.$

Thus, our single neuron corresponds exactly to the input-output mapping defined by logistic regression.

Although these notes will use the sigmoid function, it is worth noting that another common choice for $f$ is the hyperbolic tangent, or tanh, function:

$f(z) = \tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}},$

Here are plots of the sigmoid and $tanh$ functions:

The $tanh(z)$ function is a rescaled version of the sigmoid, and its output range is $[ - 1,1]$ instead of $[0,1]$ .

Note that unlike some other venues (including the OpenClassroom videos, and parts of CS229), we are not using the convention here of $x 0 = 1$ . Instead, the intercept term is handled separately by the parameter $b$ .

Finally, one identity that'll be useful later: If $f (z) = 1 / (1 + exp( - z))$ is the sigmoid function, then its derivative is given by $f'(z) = f (z)(1 - f (z))$ . (If $f$ is the tanh function, then its derivative is given by $f'(z) = 1 - (f (z)) 2$ .) You can derive this yourself using the definition of the sigmoid (or tanh) function.

Neural Network model

A neural network is put together by hooking together many of our simple "neurons," so that the output of a neuron can be the input of another. For example, here is a small neural network:

In this figure, we have used circles to also denote the inputs to the network. The circles labeled "+1" are called bias units, and correspond to the intercept term. The leftmost layer of the network is called the input layer, and the rightmost layer the output layer (which, in this example, has only one node). The middle layer of nodes is called thehidden layer, because its values are not observed in the training set. We also say that our example neural network has 3input units (not counting the bias unit), 3 hidden units, and 1 output unit.

We will let $n l$ denote the number of layers in our network; thus $n l = 3$ in our example. We label layer $l$ as $L l$ , so layer $L 1$ is the input layer, and layer $L_{n_l}$ the output layer. Our neural network has parameters $(W, b) = (W (1), b (1), W (2), b (2))$ , where we write $W^{(l)}_{ij}$ to denote the parameter (or weight) associated with the connection between unit $j$ in layer $l$ , and unit $i$ in layer $l + 1$ . (Note the order of the indices.) Also, $b^{(l)}_i$ is the bias associated with unit $i$ in layer $l + 1$ . Thus, in our example, we have $W^{(1)} \in \Re^{3\times 3}$ , and $W^{(2)} \in \Re^{1\times 3}$ . Note that bias units don't have inputs or connections going into them, since they always output the value +1. We also let $s l$ denote the number of nodes in layer $l$ (not counting the bias unit).

We will write $a^{(l)}_i$ to denote the activation (meaning output value) of unit $i$ in layer $l$ . For $l = 1$ , we also use $a^{(1)}_i = x_i$ to denote the $i$ -th input. Given a fixed setting of the parameters $W, b$ , our neural network defines a hypothesis $h W, b (x)$ that outputs a real number. Specifically, the computation that this neural network represents is given by:

$\begin{align}a_1^{(2)} &= f(W_{11}^{(1)}x_1 + W_{12}^{(1)} x_2 + W_{13}^{(1)} x_3 + b_1^{(1)}) \\a_2^{(2)} &= f(W_{21}^{(1)}x_1 + W_{22}^{(1)} x_2 + W_{23}^{(1)} x_3 + b_2^{(1)}) \\a_3^{(2)} &= f(W_{31}^{(1)}x_1 + W_{32}^{(1)} x_2 + W_{33}^{(1)} x_3 + b_3^{(1)}) \\h_{W,b}(x) &= a_1^{(3)} = f(W_{11}^{(2)}a_1^{(2)} + W_{12}^{(2)} a_2^{(2)} + W_{13}^{(2)} a_3^{(2)} + b_1^{(2)}) \end{align}$

In the sequel, we also let $z^{(l)}_i$ denote the total weighted sum of inputs to unit $i$ in layer $l$ , including the bias term (e.g., $\textstyle z_i^{(2)} = \sum_{j=1}^n W^{(1)}_{ij} x_j + b^{(1)}_i$ ), so that $a^{(l)}_i = f(z^{(l)}_i)$ .

Note that this easily lends itself to a more compact notation. Specifically, if we extend the activation function $f(\cdot)$ to apply to vectors in an element-wise fashion (i.e., $f ([z 1, z 2, z 3]) = [f (z 1), f (z 2), f (z 3)]$ ), then we can write the equations above more compactly as:

$\begin{align}z^{(2)} &= W^{(1)} x + b^{(1)} \\a^{(2)} &= f(z^{(2)}) \\z^{(3)} &= W^{(2)} a^{(2)} + b^{(2)} \\h_{W,b}(x) &= a^{(3)} = f(z^{(3)})\end{align}$

We call this step forward propagation. More generally, recalling that we also use $a (1) = x$ to also denote the values from the input layer, then given layer $l$ 's activations $a (l)$ , we can compute layer $l + 1$ 's activations $a (l + 1)$ as:

$\begin{align}z^{(l+1)} &= W^{(l)} a^{(l)} + b^{(l)} \\a^{(l+1)} &= f(z^{(l+1)})\end{align}$

By organizing our parameters in matrices and using matrix-vector operations, we can take advantage of fast linear algebra routines to quickly perform calculations in our network.

We have so far focused on one example neural network, but one can also build neural networks with other architectures(meaning patterns of connectivity between neurons), including ones with multiple hidden layers. The most common choice is a $\textstyle n_l$ -layered network where layer $\textstyle 1$ is the input layer, layer $\textstyle n_l$ is the output layer, and each layer $\textstyle l$ is densely connected to layer $\textstyle l+1$ . In this setting, to compute the output of the network, we can successively compute all the activations in layer $\textstyle L_2$ , then layer $\textstyle L_3$ , and so on, up to layer $\textstyle L_{n_l}$ , using the equations above that describe the forward propagation step. This is one example of a feedforward neural network, since the connectivity graph does not have any directed loops or cycles.

Neural networks can also have multiple output units. For example, here is a network with two hidden layers layers $L 2$ and $L 3$ and two output units in layer $L 4$ :

To train this network, we would need training examples $(x (i), y (i))$ where $y^{(i)} \in \Re^2$ . This sort of network is useful if there're multiple outputs that you're interested in predicting. (For example, in a medical diagnosis application, the vector $x$ might give the input features of a patient, and the different outputs $y i$ 's might indicate presence or absence of different diseases.)

本節內容: The representation and learning of neural networks

-----------------------The representation of neural networks-------------------------

（一）、爲什麼引入神經網絡？——Nonlinear hypothesis

之前我們討論的ML問題中，主要針對Regression做了分析，其中採用梯度下降法進行參數更新。然而其可行性基於假設參數不多，如果參數多起來了怎麼辦呢？比如下圖中這個例子：從100*100個pixels中選出所有XiXj作爲logistic regression的一個參數，那麼總共就有5*10^7個feature，即x有這麼多維。

所以引入了Nonlinear hypothesis，應對高維數據和非線性的hypothesis（如下圖所示）：

===============================

（二）、神經元與大腦（neurons and brain）

神經元工作模式：

神經網絡的邏輯單元：輸入向量x（input layer），中間層a(2,i)（hidden layer）, 輸出層h(x)（output layer）。

其中，中間層的a(2,i)中的2表示第二個級別（第一個級別是輸入層），i表示中間層的第幾個元素。或者可以說，a(j,i) is the activation of unit i in layer j.

===============================

（三）、神經網絡的表示形式

從圖中可知，中間層a(2，j)是輸入層線性組合的sigmod值，輸出又是中間層線性組合的sigmod值。

下面我們進行神經網絡參數計算的向量化：

令z⁽²⁾表示中間層，x表示輸入層，則有

，

z⁽²⁾=Θ⁽¹⁾x

a⁽²⁾=g(z⁽²⁾)

或者可以將x表示成a⁽¹⁾，那麼對於輸入層a⁽¹⁾有[x_0~x_3]4個元素，中間層a⁽²⁾有[a⁽²⁾₀~a⁽²⁾₃]4個元素（其中令a⁽²⁾₀=1），則有

h(x)= a⁽³⁾=g(z⁽³⁾)

z⁽³⁾=Θ⁽²⁾a⁽²⁾

通過以上這種神經元的傳遞方式（input->activation->output）來計算h(x), 叫做Forward propagation, 向前傳遞。

這裏我們可以發現，其實神經網絡就像是logistic regression，只不過我們把logistic regression中的輸入向量[x₁~x₃]變成了中間層的[a⁽²⁾₁~a⁽²⁾₃], 即

h(x)=g(Θ⁽²⁾₀ a⁽²⁾₀+Θ⁽²⁾₁ a⁽²⁾₁+Θ⁽²⁾₂ a⁽²⁾₂+Θ⁽²⁾₃ a⁽²⁾₃)

而中間層又由真正的輸入向量通過Θ⁽¹⁾學習而來，這裏呢，就解放了輸入層，換言之輸入層可以是original input data的任何線性組合甚至是多項式組合如set x1*x2 as original x1...另外呢，具體怎樣利用中間層進行更新下面會更詳細地講；此外，還有一些其他模型，比如：

===============================

（四）、怎樣用神經網絡實現邏輯表達式？

神經網路中，單層神經元（無中間層）的計算可用來表示邏輯運算，比如邏輯AND、邏輯或OR

舉例說明：邏輯與AND；下圖中左半部分是神經網絡的設計與output層表達式，右邊上部分是sigmod函數，下半部分是真值表。

給定神經網絡的權值就可以根據真值表判斷該函數的作用。再給出一個邏輯或的例子，如下圖所示：

以上兩個例子只是單層傳遞，下面我們再給出一個更復雜的例子，用來實現邏輯表達< x1 XNOR x2 >, 即邏輯同或關係，它由前面幾個例子共同實現：

將AND、NOT AND和 OR分別放在下圖中輸入層和輸出層的位置，即可得到x1 XNOR x2，道理顯而易見：

a²₁ = x1 && x2

a²₂ = （﹁x1）&&（﹁x2）

a³₁ =a²₁||a²₁ =(x1 && x2) || （﹁x1）&&（﹁x2） = x1 XNOR x2；

應用：手寫識別系統

===============================

（五）、分類問題（Classification）

記得上一章中我們講過的one-vs-all分類問題麼？one-vs-all方法是把二類分類問題到多類分類的一個推廣，在這裏，我們就講述如何用神經網絡進行分類。網絡設計如下圖所示：

輸入向量x有三個維度，兩個中間層，輸出層4個神經元分別用來表示4類，也就是每一個數據在輸出層都會出現[a b c d]^T，且a,b,c,d中僅有一個爲1，表示當前類。

===============================

小結

本章引入了ML中神經網絡的概念，主要講述瞭如何利用神經網絡的construction及如何進行邏輯表達function的構造，在下一章中我們將針對神經網絡的學習過程進行更詳細的講述。

----------------------- The learning of neural networks------------------

（一）、Cost function

（二）、Backpropagation algorithm

（三）、Backpropagation intuition

（四）、Implementation note: Unrolling parameters

（五）、Gradient checking

（六）、Random initialization

（七）、Putting it together

===============================

（一）、Cost function

假設神經網絡的訓練樣本有m個，每個包含一組輸入x和一組輸出信號y，L表示神經網絡層數，S_l表示每層的neuron個數(SL表示輸出層神經元個數)。

將神經網絡的分類定義爲兩種情況：二類分類和多類分類，

卐二類分類：S_L=1, y=0 or 1表示哪一類；

卐K類分類：S_L=K, y_i= 1表示分到第i類；（K>2）

我們在前幾章中已經知道，Logistic hypothesis的Cost Function如下定義：

其中，前半部分表示hypothesis與真實值之間的距離，後半部分爲對參數進行regularization的bias項，神經網絡的cost function同理：

hypothesis與真實值之間的距離爲每個樣本-每個類輸出的加和，對參數進行regularization的bias項處理所有參數的平方和

===============================

（二）、Backpropagation algorithm

前面我們已經講了cost function的形式，下面我們需要的就是最小化J(Θ)

想要根據gradient descent的方法進行參數optimization，首先需要得到cost function和一些參數的表示。根據forward propagation,我們首先進行training dataset 在神經網絡上的各層輸出值：

我們定義神經網絡的總誤差爲：

$E = \frac{1}{2}\sum_{i}{(y_i-a_i)^2}$ 希望通過調整權重參數W（也就是theta）來最小化E。由於

$\Delta W \propto -\frac{\partial E}{\partial W}$
所以每一層按如下方式進行更新： $\Theta_{ij}^{(l)} = \Theta_{ij}^{(l)}+\Delta\Theta_{ij}^{(l)}=\Theta_{ij}^{(l)}-\alpha \frac{\partial E(\Theta)}{\partial \Theta_{ij}^{(l)}}$

根據backpropagation算法進行梯度的計算，這裏引入了error變量δ，該殘差表明了該節點對最終輸出值的殘差產生了多少影響。對於最後一層，我們可以直接算出網絡產生的輸出 $a_i^{(l)}$ 與實際值之間的差距，我們將這個差距定義爲 $\delta_{i}^{(l)}$ 。對於隱藏單元我們如何處理呢？我們將通過計算各層節點殘差的加權平均值計算hidden layer的殘差。讀者可以自己驗證下，其實 $\delta_{i}^{(l)}$ 就是E對b求導的結果。

在最後一層中， $\delta_{i}^{(l)} = \frac{\partial E}{\partial z_i^{(l)}}=\frac{\partial \frac{1}{2}(y-a_i^{(l)})^2}{\partial z_i^{(l)}}=\frac{\partial [\frac{1}{2}(y-g(z_i^{(l)}))^2]}{\partial z_i^{(l)}}= (a_i^{(l)}-y)\cdot g{’}'(z_i^{(l)})$

對於前面的每一層，都有 $\delta_{i}^{(l)} = \frac{\partial E}{\partial z_i^{(l)}}=\sum_{j}^{N^{(l+1)}} \frac{\partial E}{\partial z_j^{(l+1)}} \cdot \frac{\partial z_j^{(l+1)}}{\partial z_i^{(l)}}\\ = \sum_{j}^{N^{(l+1)}} \delta_{j}^{(l+1)}\cdot \frac{\partial [\sum_{k}^{N^{l}}\Theta_{jk}\cdot g(z_k^{(l)})] }{\partial z_i^{(l)}},i\in k\\ =\sum_{j}^{N^{(l+1)}} (\delta_{j}^{(l+1)}\cdot \Theta_{ji}) \cdot g{'}(z_i)$

由此得到第l層第i個節點的殘差計算方法： $\delta_{i}^{(l)} =\sum_{j}^{N^{(l+1)}} (\delta_{j}^{(l+1)}\cdot \Theta_{ji}) \cdot g{'}(z_i)$

由於我們的真實目的是計算 $\frac{\partial E(\Theta)}{\partial \Theta_{ji}}$ ,且

$\fn_cm \frac{\partial E}{\partial \Theta_{ji}^{l}} = \frac{\partial E}{\partial z_i^{(l+1)}}\cdot \frac {\partial z_i^{(l+1)}}{\partial \Theta_{ji}^{l}} =\delta_i^{(l+1)}\cdot a_j^{(l)}$

所以我們可以得到神經網絡中權重的update方程： $\fn_cm \Theta_{ji}^{l} = \Theta_{ji}^{l}-\alpha\cdot \delta_i^{(l+1)}\cdot a_j^{(l)}$ 不斷迭代直到落入local optima,就是backpropagation的算法過程。

============================================================ Example of logistical cost:

下面我們針對logistical cost給出計算的例子：
而對於每一層，其誤差可以定義爲：

$\Delta \Theta_{k-1} = \frac{\partial E}{\partial a_k}\cdot \frac{\partial a_k}{\partial z_k} \cdot \frac{\partial z_k}{\Theta _{k-1}}$

分別代入即得

$\frac{\partial E}{\partial a_k} = a_k-y \\ \frac{\partial a_k}{\partial z_k} = \frac{\partial g(z_k))}{\partial z_k} = \frac{e^{-z}}{(1+e^{-z})^2} = a_k(1-a_k)\\ \frac{\partial z_k}{\partial \Theta _{k-1}} = a_{k-1}$

由此得來\theta_{k}的update方程：

$\Theta_{k} = \xi (y-a_k)a_k(1-a_k)a_{k-1}$

如果將誤差對激勵函數（activation function）的導數記做δ，則有：

$\delta_{k} = (y-a_k)a_k(1-a_k)$

$\Delta\Theta_k = \xi \delta_k \cdot a_{k-1}$

對於前面一層 ,更新同理， $\Delta\Theta_k = \xi \delta_k \cdot a_{k-1}$ ，只是上一層\Theta梯度的第一個分量E對a_k求導有所變化，

$\begin{align*} \frac{\partial E}{\partial a_{j}}=\sum_{k} \frac{\partial E}{\partial a_k}\cdot \frac{\partial a_k}{\partial z_k}\cdot \frac{\partial z_k}{\partial a_{j}}\\ = \sum_{k}(y-a_k)\cdot a_k(1-a_k)\cdot \Theta_j \end{align*}$

但是 $\Delta\Theta_k = \xi \delta_k \cdot a_{k-1}$ 始終是不變的。

下圖就是上面推導得出的結果：

由上圖我們得到了error變量δ的計算，下面我們來看backpropagation算法的僞代碼：

ps：最後一步之所以寫+=而非直接賦值是把Δ看做了一個矩陣，每次在相應位置上做修改。

從後向前此計算每層依的δ，用Δ表示全局誤差，每一層都對應一個Δ(l)。再引入D作爲cost function對參數的求導結果。下圖左邊j是否等於0影響的是是否有最後的bias regularization項。左邊是定義，右邊可證明（比較繁瑣）。

===============================

（三）、Backpropagation intuition

上面講了backpropagation算法的步驟以及一些公式，在這一小節中我們講一下最簡單的back-propagation模型是怎樣learning的。

首先根據forward propagation方法從前往後計算z^(j),a^(j);

然後將原cost function 進行簡化，去掉下圖中後面那項regularization項，

那麼對於輸入的第i個樣本(xi,yi)，有

Cost(i)=y⁽ⁱ⁾log(h_θ(x⁽ⁱ⁾))+(1- y⁽ⁱ⁾)log(1- h_θ(x⁽ⁱ⁾))

由上文可知，

$\delta_k = \frac{\partial J(\Theta)}{\partial z_k} = \frac{\partial J(\Theta)}{\partial a_k}\frac{\partial a_k}{\partial z_k} = \Theta_{k}\delta_{k+1}\cdot g'(z_k) \\ \Delta w_{ij} = \Delta w_{ij} + \frac{\partial J(\Theta)}{\partial w_{ij}} = \Delta w_{ij} + a_j^l \cdot \delta_k^(l+1)\\ \frac{\partial J(\Theta)}{\partial w_{ij}} = \frac{\partial J(\Theta)}{\partial z_k} \cdot \frac{\partial z_k}{\partial w_{ij}}$

其中J就是cost。那麼將其進行簡化，暫時不考慮g'(zk) = ak(1-ak)的部分，就有：

經過求導計算可得，對於上圖有

換句話說, 對於每一層來說，δ分量都等於後面一層所有的δ加權和，其中權值就是參數Θ。

===============================

(四)、Implementation note: Unrolling parameters

這一節講述matlab中如何實現unrolling parameter。

前幾章中已經講過在matlab中利用梯度下降方法進行更新的根本，兩個方程：

function [jVal, gradient] = costFunction(theta)

optTheta = fminunc(@costFunction, initialTheta, options)

與linear regression和logistic regression不同，在神經網絡中，參數非常多，每一層j有一個參數向量Θj和Derivative向量Dj。那麼我們首先將各層向量連起來，組成大vectorΘ和D，傳入function，再在計算中進行下圖中的reshape，分別取出進行計算。

計算時，方法如下：

===============================

（五）、Gradient checking

神經網絡中計算起來數字千變萬化難以掌握，那我們怎麼知道它裏頭工作的對不對呢？不怕，我們有法寶，就是gradient checking，通過check梯度判斷我們的code有沒有問題，ok？怎麼做呢，看下邊：

對於下面這個【Θ-J(Θ)】圖，取Θ點左右各一點（Θ+ε），（Θ-ε），則有點Θ的導數（梯度）近似等於(J（Θ+ε）-J（Θ-ε）)/(2ε)。

對於每個參數的求導公式如下圖所示：

由於在back-propagation算法中我們一直能得到J(Θ)的導數D（derivative），那麼就可以將這個近似值與D進行比較，如果這兩個結果相近就說明code正確，否則錯誤，如下圖所示：

Summary: 有以下幾點需要注意

-在back propagation中計算出J(θ)對θ的導數D，並組成vector（Dvec）

-用numerical gradient check方法計算大概的梯度gradApprox=(J（Θ+ε）-J（Θ-ε）)/(2ε)

-看是否得到相同（or相近）的結果

-（這一點非常重要）停止check，只用back propagation 來進行神經網絡學習（否則會非常慢，相當慢）

===============================

（六）、Random Initialization

對於參數θ的initialization問題，我們之前採用全部賦0的方法，比如：

this means all of your hidden units are computing all of the exact same function of the input. So this is a highly redundant representation. 因爲一層內的所有計算都可以歸結爲1個，而這使得一些interesting的東西被ignore了。

所以我們應該打破這種symmetry，randomly選取每一個parameter，在[-ε,ε]範圍內：

===============================

（七）、Putting it together

1. 選擇神經網絡結構

我們有很多choices of network :

那麼怎麼選擇呢？

No. of input units: Dimension of features
No. output units: Number of classes
Reasonable default: 1 hidden layer, or if >1 hidden layer, have same no. of hidden units in every layer (usually the more the better)

2. 神經網絡的訓練

① Randomly initialize weights
② Implement forward propagation to get h_θ(x⁽ⁱ⁾) for anyx⁽ⁱ⁾
③ Implement code to compute cost function J(θ)
④ Implement backprop to compute partial derivatives