這一節主要講optimization的相關內容。重點在於各種grads的實現,特別是與矩陣相關的grads的實現,包括公式推導和代碼實現。note 3中先給出了svm的grads,以後還會遇到softmax,conv,relu,BN等等各種grads。將會結合作業詳細的給出各種grads的公式推導和代碼實現。
此處有作業
公式推導
對一個sample x i x_i x i ,svm的loss爲:
L i = ∑ j ≠ y i C max ( 0 , w j x i − w y i x i + Δ ) = max ( 0 , w 0 x i − w y i x i + Δ ) + . . . + max ( 0 , w j x i − w y i x i + Δ ) + . . .
\begin{aligned}
L_i = & \sum_{j \neq y_i}^C \max\left( 0, w_j x_i - w_{y_i} x_i + \Delta \right) \newline
= & \max\left( 0, w_0 x_i - w_{y_i} x_i + \Delta \right) + ... + \max\left( 0, w_j x_i - w_{y_i} x_i + \Delta \right) + ...
\end{aligned}
L i = j ̸ = y i ∑ C max ( 0 , w j x i − w y i x i + Δ ) = max ( 0 , w 0 x i − w y i x i + Δ ) + . . . + max ( 0 , w j x i − w y i x i + Δ ) + . . .
L i L_i L i 對 w j w_j w j 求導:
d w j = ∂ L i ∂ w j = 0 + 0 + . . . + 1 ( w j x i − w y i x i + Δ > 0 ) ⋅ x i
\mathrm{d}w_j = \frac{\partial L_i}{\partial w_j} = 0 + 0 + ... +
\mathbb{1} \left( w_j x_i - w_{y_i} x_i + \Delta > 0\right) \cdot x_i
d w j = ∂ w j ∂ L i = 0 + 0 + . . . + 1 ( w j x i − w y i x i + Δ > 0 ) ⋅ x i
L i L_i L i 對 w y i w_{y_i} w y i 求導:
d w y i = ∂ L i ∂ w y i = 1 ( w 0 x i − w y i x i + Δ > 0 ) ⋅ ( − x i ) + . . . + 1 ( w j x i − w y i x i + Δ > 0 ) ⋅ ( − x i ) + . . . = − ( ∑ j ≠ y i C 1 ( w j x i − w y i x i + Δ > 0 ) ) ⋅ x i
\begin{aligned}
\mathrm{d}w_{y_i} =& \frac{\partial L_i}{\partial w_{y_i}} =
\mathbb{1} \left( w_0 x_i - w_{y_i} x_i + \Delta > 0\right) \cdot (-x_i) +
... + \mathbb{1} \left( w_j x_i - w_{y_i} x_i + \Delta > 0\right) \cdot (-x_i) + ... \newline
=& - \left( \sum_{j \neq y_i}^C \mathbb{1} \left( w_j x_i - w_{y_i} x_i + \Delta > 0\right) \right) \cdot x_i
\end{aligned}
d w y i = ∂ w y i ∂ L i = 1 ( w 0 x i − w y i x i + Δ > 0 ) ⋅ ( − x i ) + . . . + 1 ( w j x i − w y i x i + Δ > 0 ) ⋅ ( − x i ) + . . . = − ⎝ ⎛ j ̸ = y i ∑ C 1 ( w j x i − w y i x i + Δ > 0 ) ⎠ ⎞ ⋅ x i
svm_naive
d W \mathrm{d}W d W 必定與 W W W 有同樣的shape,這一點是今後計算grad必須要首先確定的。在這裏,d W \mathrm{d}W d W 的shape是(3073, 10)。接下來看 L L L 的下標是 i ∈ [ 0 , N ) i \in [0, N) i ∈ [ 0 , N ) ,即是N個sample之一,w w w 的下標是 j ∈ [ 0 , C ) j \in [0, C) j ∈ [ 0 , C ) ,即10個class之一。如果此列對應的不是true class,並且score大於0,就把這個sample的x i x_i x i 加到 d W \mathrm{d}W d W 的此列;如果此列對應的是true class,要計算其餘9個class中,有幾個的score大於0,然後與這個sample的x i x_i x i 相乘,放到 d W \mathrm{d}W d W 對應列。如此遍歷N個sample結束。
svm_vectorize
這裏介紹非常重要的維數分析法,該方法可大大簡化vectorize的分析過程,而且不易出錯。首先score是X和W的函數,即:
S c o r e = X . d o t ( W )
Score = X.dot(W)
S c o r e = X . d o t ( W )
所以,d W \mathrm{d}W d W 必定是由 d S c o r e \mathrm{d} Score d S c o r e 和X計算得出。這裏X是(N, 3073),W是(3073, 10),所以Score是(N, 10),而 d S c o r e \mathrm{d} Score d S c o r e 必定與Score的shape相同,所以 d S c o r e \mathrm{d} Score d S c o r e 也是(N, 10),這樣,根據矩陣相乘的維數限制,可以得到
d W = X . T . d o t ( d S c o r e ) 。
\mathrm{d} W = X.T.dot(\mathrm{d} Score)。
d W = X . T . d o t ( d S c o r e ) 。
由公式推導可以得到 d S c o r e \mathrm{d} Score d S c o r e :
d s j = 1 ( s j − s y i + Δ > 0 )
\mathrm{d}s_j = \mathbb{1} \left( s_j - s_{y_i} + \Delta > 0\right)
d s j = 1 ( s j − s y i + Δ > 0 )
d s y i = − ∑ j ≠ y i C 1 ( s j − s y i + Δ > 0 )
\mathrm{d}s_{y_i}
= - \sum_{j \neq y_i}^C \mathbb{1} \left( s_j - s_{y_i} + \Delta > 0\right)
d s y i = − j ̸ = y i ∑ C 1 ( s j − s y i + Δ > 0 )
即對Score的每一列,如果不是true class,且score>0,該位置 d S c o r e \mathrm{d} Score d S c o r e 爲1,否則爲0;如果是true class,該位置的數值是此列不爲0的個數。
這裏遇到個問題debug了很久才發現,代碼如下:
scores = np. maximum( 0 , scores - correct_score[ : , np. newaxis] + 1.0 )
scores[ np. arange( N) , y] = 0
dScore = ( scores > 0 )
dScore[ np. arange( N) , y] = - np. sum ( dScore, axis= 1 )
print ( dScore[ np. arange( N) , y] )
結果爲:
[ True True True ... True]
正確的應該爲:
dScore = (scores > 0).astype(np.float)
結果爲:
[-9. -9. -9. ... -6.]
由於這裏grad check用的是grad_check_sparse,僅僅在10個點上進行抽樣檢查,所以在grad check時是檢查不出來的,但是在最後檢查naive和vecterize時,採用的是兩者之差求Frobenius norm,這纔會檢查出來。
Inline Question 1: It is possible that once in a while a dimension in the gradcheck will not match exactly. What could such a discrepancy be caused by? Is it a reason for concern? What is a simple example in one dimension where a gradient check could fail? How would change the margin affect of the frequency of this happening? Hint: the SVM loss function is not strictly speaking differentiable
問題裏已經給出了提示,顯然由於max函數引入的奇點,當svm取到0附近一個很小區域時,gradcheck就會不匹配。最簡單的例子就是如果 Δ \Delta Δ 爲0,在初始化的時候有很大概率取到奇點附近。
公式推導
還是要stage到score級別,然後再用 d W = X . T . d o t ( d S c o r e ) \mathrm{d} W = X.T.dot(\mathrm{d} Score) d W = X . T . d o t ( d S c o r e ) 來計算 d W \mathrm{d} W d W ,這樣可以在推導的時候不用考慮如何計算對兩個矩陣相乘。
L i = − log ( p y i ) = − log ( e s y i ∑ j e s j )
L_i = - \log \left( \ p_{y_i} \right) = -\log \left(\frac{e^{s_{y_i}}}{\sum_j e^{s_j}} \right )
L i = − log ( p y i ) = − log ( ∑ j e s j e s y i )
L i L_i L i 對任意 s k s_k s k 求導:
d s k = ∂ L i ∂ s k = − ∂ ∂ s k ( log ( e s y i ∑ j e s j ) ) = − ∑ j e s j e s y i ⋅ ( e s y i ) ′ ⋅ ∑ j e s j − e s y i ⋅ ( ∑ j e s j ) ′ ( ∑ j e s j ) 2 = ∂ ∂ s k ( ∑ j e s j ) ∑ j e s j − ∂ ∂ s k ( e s y i ) e s y i = ∂ ∂ s k ( e s 0 + e s 1 + e s y i + . . . ) ∑ j e s j − ∂ ∂ s k ( e s y i ) e s y i
\begin{aligned}
\mathrm{d} s_k =& \frac{\partial L_i}{\partial s_k} = - \frac{\partial}{\partial s_k} \left( \log \left(\frac{e^{s_{y_i}}}{\sum_j e^{s_j}} \right ) \right) \newline
=& - \frac{\sum_j e^{s_j}}{e^{s_{y_i}}} \cdot \frac{\left( {e^{s_{y_i}}}\right)^{'} \cdot {\sum_j e^{s_j}} - {e^{s_{y_i}}} \cdot \left( {\sum_j e^{s_j}} \right)^{'}}{\left( {\sum_j e^{s_j}}\right)^2} \newline
=&\frac{\frac{\partial}{\partial s_k}\left( {\sum_j e^{s_j}} \right)}{{\sum_j e^{s_j}}} - \frac{ \frac{\partial }{\partial s_k} \left({e^{s_{y_i}}} \right)}{{e^{s_{y_i}}}} \newline
=&\frac{\frac{\partial}{\partial s_k}\left( e^{s_0} + e^{s_1} + e^{s_{y_i}} + ... \right)}{{\sum_j e^{s_j}}} - \frac{ \frac{\partial }{\partial s_k} \left({e^{s_{y_i}}} \right)}{{e^{s_{y_i}}}}
\end{aligned}
d s k = ∂ s k ∂ L i = − ∂ s k ∂ ( log ( ∑ j e s j e s y i ) ) = − e s y i ∑ j e s j ⋅ ( ∑ j e s j ) 2 ( e s y i ) ′ ⋅ ∑ j e s j − e s y i ⋅ ( ∑ j e s j ) ′ = ∑ j e s j ∂ s k ∂ ( ∑ j e s j ) − e s y i ∂ s k ∂ ( e s y i ) = ∑ j e s j ∂ s k ∂ ( e s 0 + e s 1 + e s y i + . . . ) − e s y i ∂ s k ∂ ( e s y i )
當 y i = k y_i = k y i = k 時:
d s k = e s y i ∑ j e s j − 1
\mathrm{d} s_k = \frac{{e^{s_{y_i}}}}{{\sum_j e^{s_j}}} - 1
d s k = ∑ j e s j e s y i − 1
當 y i ≠ k y_i \neq k y i ̸ = k 時:
d s k = e s k ∑ j e s j
\mathrm{d} s_k = \frac{{e^{s_k}}}{{\sum_j e^{s_j}}}
d s k = ∑ j e s j e s k
綜上,
d s k = e s k ∑ j e s j − 1 ( y i = k )
\mathrm{d} s_k = \frac{{e^{s_k}}}{{\sum_j e^{s_j}}} - \mathbb{1} \left( y_i = k \right)
d s k = ∑ j e s j e s k − 1 ( y i = k )
softmax_naive
softmax_vectorize
有了上面的公式推導和svm的經驗,這裏的代碼不難寫。注意,我們這裏都是先去計算 d S c o r e \mathrm{d} Score d S c o r e , 然後再用 d W = X . T . d o t ( d S c o r e ) \mathrm{d} W = X.T.dot(\mathrm{d} Score) d W = X . T . d o t ( d S c o r e ) 來計算。
這裏沒什麼好說的。
Inline question 2:
Describe what your visualized SVM weights look like, and offer a brief explanation for why they look they way that they do.
每一組weights都是一個class的template。
Inline Question - True or False
It’s possible to add a new datapoint to a training set that would leave the SVM loss unchanged, but this is not the case with the Softmax classifier loss.
可以的。因爲svm是夠大就行,而softmax是永不滿足。
這一部分沒什麼可說的。
Inline question 1:
Describe the misclassification results that you see. Do they make sense?
觀察這些分類錯誤的情況,我們可以看到它們和正確的class在顏色、外形上有很多的相似之處。