Optimization: Stochastic Gradient Descent

這一節主要講optimization的相關內容。重點在於各種grads的實現，特別是與矩陣相關的grads的實現，包括公式推導和代碼實現。note 3中先給出了svm的grads，以後還會遇到softmax，conv，relu，BN等等各種grads。將會結合作業詳細的給出各種grads的公式推導和代碼實現。

此處有作業

Assignment 1: SVM grads的計算

公式推導

對一個sample $x_i$ ，svm的loss爲：

$\begin{aligned} L_i = & \sum_{j \neq y_i}^C \max\left( 0, w_j x_i - w_{y_i} x_i + \Delta \right) \newline = & \max\left( 0, w_0 x_i - w_{y_i} x_i + \Delta \right) + ... + \max\left( 0, w_j x_i - w_{y_i} x_i + \Delta \right) + ... \end{aligned}$

$L_i$ 對 $w_j$ 求導：

$\mathrm{d}w_j = \frac{\partial L_i}{\partial w_j} = 0 + 0 + ... + \mathbb{1} \left( w_j x_i - w_{y_i} x_i + \Delta > 0\right) \cdot x_i$

$L_i$ 對 $w_{y_i}$ 求導：

$\begin{aligned} \mathrm{d}w_{y_i} =& \frac{\partial L_i}{\partial w_{y_i}} = \mathbb{1} \left( w_0 x_i - w_{y_i} x_i + \Delta > 0\right) \cdot (-x_i) + ... + \mathbb{1} \left( w_j x_i - w_{y_i} x_i + \Delta > 0\right) \cdot (-x_i) + ... \newline =& - \left( \sum_{j \neq y_i}^C \mathbb{1} \left( w_j x_i - w_{y_i} x_i + \Delta > 0\right) \right) \cdot x_i \end{aligned}$

代碼實現

svm_naive

$\mathrm{d}W$ 必定與 $W$ 有同樣的shape，這一點是今後計算grad必須要首先確定的。在這裏， $\mathrm{d}W$ 的shape是(3073, 10)。接下來看 $L$ 的下標是 $i \in [0, N)$ ，即是N個sample之一， $w$ 的下標是 $j \in [0, C)$ ，即10個class之一。如果此列對應的不是true class，並且score大於0，就把這個sample的 $x_i$ 加到 $\mathrm{d}W$ 的此列；如果此列對應的是true class，要計算其餘9個class中，有幾個的score大於0，然後與這個sample的 $x_i$ 相乘，放到 $\mathrm{d}W$ 對應列。如此遍歷N個sample結束。

svm_vectorize

這裏介紹非常重要的維數分析法，該方法可大大簡化vectorize的分析過程，而且不易出錯。首先score是X和W的函數，即：
$Score = X.dot(W)$
所以， $\mathrm{d}W$ 必定是由 $\mathrm{d} Score$ 和X計算得出。這裏X是(N, 3073)，W是(3073, 10)，所以Score是(N, 10)，而 $\mathrm{d} Score$ 必定與Score的shape相同，所以 $\mathrm{d} Score$ 也是(N, 10)，這樣，根據矩陣相乘的維數限制，可以得到
$\mathrm{d} W = X.T.dot(\mathrm{d} Score)。$
由公式推導可以得到 $\mathrm{d} Score$ ：
$\mathrm{d}s_j = \mathbb{1} \left( s_j - s_{y_i} + \Delta > 0\right)$
$\mathrm{d}s_{y_i} = - \sum_{j \neq y_i}^C \mathbb{1} \left( s_j - s_{y_i} + \Delta > 0\right)$
即對Score的每一列，如果不是true class，且score>0，該位置 $\mathrm{d} Score$ 爲1，否則爲0；如果是true class，該位置的數值是此列不爲0的個數。

這裏遇到個問題debug了很久才發現，代碼如下:

scores = np.maximum(0, scores - correct_score[:, np.newaxis] + 1.0)
scores[np.arange(N), y] = 0
dScore = (scores > 0)
dScore[np.arange(N), y] = -np.sum(dScore, axis=1)
print(dScore[np.arange(N), y])

結果爲:

[ True  True  True ... True]

正確的應該爲：

dScore = (scores > 0).astype(np.float)

結果爲:

[-9. -9. -9. ... -6.]

由於這裏grad check用的是grad_check_sparse，僅僅在10個點上進行抽樣檢查，所以在grad check時是檢查不出來的，但是在最後檢查naive和vecterize時，採用的是兩者之差求Frobenius norm，這纔會檢查出來。

Inline Question 1: It is possible that once in a while a dimension in the gradcheck will not match exactly. What could such a discrepancy be caused by? Is it a reason for concern? What is a simple example in one dimension where a gradient check could fail? How would change the margin affect of the frequency of this happening? Hint: the SVM loss function is not strictly speaking differentiable

問題裏已經給出了提示，顯然由於max函數引入的奇點，當svm取到0附近一個很小區域時，gradcheck就會不匹配。最簡單的例子就是如果 $\Delta$ 爲0，在初始化的時候有很大概率取到奇點附近。

Assignment 1: softmax grads的計算

公式推導

還是要stage到score級別，然後再用 $\mathrm{d} W = X.T.dot(\mathrm{d} Score)$ 來計算 $\mathrm{d} W$ ，這樣可以在推導的時候不用考慮如何計算對兩個矩陣相乘。
$L_i = - \log \left( \ p_{y_i} \right) = -\log \left(\frac{e^{s_{y_i}}}{\sum_j e^{s_j}} \right )$

$L_i$ 對任意 $s_k$ 求導：
$\begin{aligned} \mathrm{d} s_k =& \frac{\partial L_i}{\partial s_k} = - \frac{\partial}{\partial s_k} \left( \log \left(\frac{e^{s_{y_i}}}{\sum_j e^{s_j}} \right ) \right) \newline =& - \frac{\sum_j e^{s_j}}{e^{s_{y_i}}} \cdot \frac{\left( {e^{s_{y_i}}}\right)^{'} \cdot {\sum_j e^{s_j}} - {e^{s_{y_i}}} \cdot \left( {\sum_j e^{s_j}} \right)^{'}}{\left( {\sum_j e^{s_j}}\right)^2} \newline =&\frac{\frac{\partial}{\partial s_k}\left( {\sum_j e^{s_j}} \right)}{{\sum_j e^{s_j}}} - \frac{ \frac{\partial }{\partial s_k} \left({e^{s_{y_i}}} \right)}{{e^{s_{y_i}}}} \newline =&\frac{\frac{\partial}{\partial s_k}\left( e^{s_0} + e^{s_1} + e^{s_{y_i}} + ... \right)}{{\sum_j e^{s_j}}} - \frac{ \frac{\partial }{\partial s_k} \left({e^{s_{y_i}}} \right)}{{e^{s_{y_i}}}} \end{aligned}$
當 $y_i = k$ 時：
$\mathrm{d} s_k = \frac{{e^{s_{y_i}}}}{{\sum_j e^{s_j}}} - 1$
當 $y_i \neq k$ 時：
$\mathrm{d} s_k = \frac{{e^{s_k}}}{{\sum_j e^{s_j}}}$
綜上，
$\mathrm{d} s_k = \frac{{e^{s_k}}}{{\sum_j e^{s_j}}} - \mathbb{1} \left( y_i = k \right)$

代碼實現

softmax_naive

softmax_vectorize

有了上面的公式推導和svm的經驗，這裏的代碼不難寫。注意，我們這裏都是先去計算 $\mathrm{d} Score$ , 然後再用 $\mathrm{d} W = X.T.dot(\mathrm{d} Score)$ 來計算。

Assignment 1: SVM Stochastic Gradient Descent

這裏沒什麼好說的。

Inline question 2:
Describe what your visualized SVM weights look like, and offer a brief explanation for why they look they way that they do.

每一組weights都是一個class的template。

Assignment 1: Softmax

Inline Question - True or False
It’s possible to add a new datapoint to a training set that would leave the SVM loss unchanged, but this is not the case with the Softmax classifier loss.

可以的。因爲svm是夠大就行，而softmax是永不滿足。

Assignment 1: Train SVM on features

這一部分沒什麼可說的。

Inline question 1:
Describe the misclassification results that you see. Do they make sense?

觀察這些分類錯誤的情況，我們可以看到它們和正確的class在顏色、外形上有很多的相似之處。

cs231n'18： Course Note 3

Optimization: Stochastic Gradient Descent

此處有作業

Assignment 1: SVM grads的計算

公式推導

代碼實現

svm_naive

svm_vectorize

Assignment 1: softmax grads的計算

公式推導

代碼實現

softmax_naive

softmax_vectorize

Assignment 1: SVM Stochastic Gradient Descent

Assignment 1: Softmax

Assignment 1: Train SVM on features

[轉帖]使用NMT和pmap解決JVM資源泄漏問題原創

Python實現大麥網搶票的四大關鍵技術點解析

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

一款開源的.NET程序集反編譯、編輯和調試神器

關於接口協議，你必須要知道這些！

【2024-05-21】以茶會友

cs231n'18：Lecture 2 | Image Classification Pipeline

cs231n'18：Lecture 3 | Loss Functions and Optimization

cs231n'18： Course Note 3

cs231n'18：Lecture 6 | Training Neural Networks I

cs231n'18： Course Note 2

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結