1. Sigmoid函數
\qquad 本文中Sigmoid函數用 S ( x ) S(x) S ( x ) 表示:
S ( x ) = 1 1 + e − x \qquad\qquad S(x)=\dfrac{1}{1+e^{-x}} S ( x ) = 1 + e − x 1
\qquad Sigmoid函數具有特殊的性質:[ S ( x ) ] ′ = S ( x ) [ 1 − S ( x ) ] [S(x) ]^{'}=S(x) [ 1-S(x) ] [ S ( x ) ] ′ = S ( x ) [ 1 − S ( x ) ]
Sigmoid函數的曲線在中心 ( x = 0 , y = 0.5 ) (x=0,y=0.5 ) ( x = 0 , y = 0 . 5 ) 附近增長速度較快,在兩端增長速度緩慢
其中,虛線爲階梯函數
2. Logistic Regression模型
\qquad 如果將Sigmoid函數 S ( x ) S(x) S ( x ) 作爲線性模型 f ( x ) = w T x + b f(\boldsymbol x)=\boldsymbol{w}^T \boldsymbol x + b f ( x ) = w T x + b 的變換函數,那麼:
y ( x ) = S [ f ( x ) ] = 1 1 + e − ( w T x + b ) \qquad\qquad y(\boldsymbol{x})=S [ f(\boldsymbol x) ]=\dfrac{1}{1+e^{-(\boldsymbol{w}^{T}\boldsymbol{x}+b)}} y ( x ) = S [ f ( x ) ] = 1 + e − ( w T x + b ) 1
\qquad 對於某個樣本 x ∗ \boldsymbol{x^{\ast}} x ∗ 來說,其輸出值爲 y = y ( x ∗ ) y=y(\boldsymbol x^{\ast}) y = y ( x ∗ ) ,可得到:
ln ( y 1 − y ) = w T x ∗ + b \qquad\qquad \ln\left( \dfrac{y}{1-y}\right)=\boldsymbol{w}^{T}\boldsymbol{x^{\ast}}+b ln ( 1 − y y ) = w T x ∗ + b
\qquad 如果將 y y y 看作是樣本 x ∗ \boldsymbol{x^{\ast}} x ∗ 爲正例 的可能性(概率),將 1 − y 1-y 1 − y 看作是樣本 x ∗ \boldsymbol{x^{\ast}} x ∗ 爲負例 的可能性(概率),兩者的比率取對數 ln ( y 1 − y ) \ln\left( \dfrac{y}{1-y}\right) ln ( 1 − y y ) 反映了對樣本 x ∗ \boldsymbol{x^{\ast}} x ∗ 進行“線性分類”的情況 (如下圖所示):
1 ) \qquad1) 1 ) 如果 y = 0.5 y=0.5 y = 0 . 5 ,那麼 1 − y = 0.5 1-y=0.5 1 − y = 0 . 5 ,ln ( y 1 − y ) = 0 \ln\left( \dfrac{y}{1-y}\right)=0 ln ( 1 − y y ) = 0 ,此時 w T x ∗ + b = 0 \boldsymbol{w}^{T}\boldsymbol{x^{\ast}}+b=0 w T x ∗ + b = 0 。
\qquad 從線性模型的角度來說,樣本 x ∗ \boldsymbol{x^{\ast}} x ∗ 正好處在分界面(紅色直線)上,作爲正例和負例的可能性是相同的。
2 ) \qquad2) 2 ) 如果 y > 0.5 y>0.5 y > 0 . 5 ,那麼 1 − y < 0.5 1-y<0.5 1 − y < 0 . 5 ,ln ( y 1 − y ) > 0 \ln\left( \dfrac{y}{1-y}\right)>0 ln ( 1 − y y ) > 0 ,此時 w T x ∗ + b > 0 \boldsymbol{w}^{T}\boldsymbol{x^{\ast}}+b>0 w T x ∗ + b > 0 。
\qquad 這說明了,樣本 x ∗ \boldsymbol{x^{\ast}} x ∗ 處在分界面的上側。
3 ) \qquad3) 3 ) 如果 y < 0.5 y<0.5 y < 0 . 5 ,那麼 1 − y > 0.5 1-y>0.5 1 − y > 0 . 5 ,ln ( y 1 − y ) < 0 \ln\left( \dfrac{y}{1-y}\right)<0 ln ( 1 − y y ) < 0 ,此時 w T x ∗ + b < 0 \boldsymbol{w}^{T}\boldsymbol{x^{\ast}}+b<0 w T x ∗ + b < 0 。
\qquad 這說明了,樣本 x ∗ \boldsymbol{x^{\ast}} x ∗ 處在分界面的下側。
通過Sigmoid函數可以將線性模型 y = w T x ∗ + b y=\boldsymbol{w}^{T}\boldsymbol{x^{\ast}}+b y = w T x ∗ + b 的輸出值 y y y 轉化爲 [ 0 , 1 ] [0,1] [ 0 , 1 ] 之間
如果事件發生的概率爲 p p p ,那麼該事件發生的機率 (odds)定義爲 p 1 − p \dfrac{p}{1-p} 1 − p p ,該事件的對數機率 (log odds)定義爲 ln ( p 1 − p ) \ln\left(\dfrac{p}{1-p}\right) ln ( 1 − p p )
\qquad 如果採用變量 c = 1 c=1 c = 1 表示上圖中的 R 1 \mathcal R_1 R 1 區域,用 c = 0 c=0 c = 0 表示上圖中的 R 2 \mathcal R_2 R 2 區域,那麼可將 y ( x ) y(\boldsymbol x) y ( x ) 的值視爲類後驗概率 ,即:
p ( c = 1 ∣ x ) = y ( x ) = 1 1 + e − ( w T x + b ) \qquad\qquad p(c=1|\boldsymbol{x})=y(\boldsymbol{x})=\dfrac{1}{1+e^{-(\boldsymbol{w}^{T}\boldsymbol{x}+b)}} p ( c = 1 ∣ x ) = y ( x ) = 1 + e − ( w T x + b ) 1
\qquad 此時,關於 p ( c = 1 ∣ x ) p(c=1|\boldsymbol{x}) p ( c = 1 ∣ x ) 的對數機率就是線性模型:
ln p ( c = 1 ∣ x ) p ( c = 0 ∣ x ) = w T x + b \qquad\qquad\ln \dfrac{p(c=1|\boldsymbol{x})}{p(c=0|\boldsymbol{x})}=\boldsymbol{w}^T\boldsymbol{x}+b ln p ( c = 0 ∣ x ) p ( c = 1 ∣ x ) = w T x + b
x \boldsymbol{x} x 爲正例 ( c = 1 ) (c=1) ( c = 1 ) 的概率:
p ( c = 1 ∣ x ) = 1 1 + e − ( w T x + b ) \qquad\qquad p(c=1|\boldsymbol{x})=\dfrac{1}{1+e^{-(\boldsymbol{w}^{T}\boldsymbol{x}+b)}} p ( c = 1 ∣ x ) = 1 + e − ( w T x + b ) 1
\qquad 線性函數 w T x + b \boldsymbol{w}^{T}\boldsymbol{x}+b w T x + b 的值越接近 + ∞ +\infty + ∞ ,概率值越接近 1 1 1
\qquad 線性函數 w T x + b \boldsymbol{w}^{T}\boldsymbol{x}+b w T x + b 的值越接近 − ∞ -\infty − ∞ ,概率值越接近 0 0 0
x \boldsymbol{x} x 爲負例 ( c = 0 ) (c=0) ( c = 0 ) 的概率:
p ( c = 0 ∣ x ) = 1 − p ( y = 1 ∣ x ) = e − ( w T x + b ) 1 + e − ( w T x + b ) = 1 1 + e ( w T x + b ) \qquad\qquad\begin{aligned} p(c=0|\boldsymbol{x})&=1-p(y=1|\boldsymbol{x})\\
&=\dfrac{e^{-(\boldsymbol{w}^{T}\boldsymbol{x}+b)}}{1+e^{-(\boldsymbol{w}^{T}\boldsymbol{x}+b)}}\\
&=\dfrac{1}{1+e^{(\boldsymbol{w}^{T}\boldsymbol{x}+b)}}\end{aligned} p ( c = 0 ∣ x ) = 1 − p ( y = 1 ∣ x ) = 1 + e − ( w T x + b ) e − ( w T x + b ) = 1 + e ( w T x + b ) 1
\qquad 線性函數 w T x + b \boldsymbol{w}^{T}\boldsymbol{x}+b w T x + b 的值越接近 + ∞ +\infty + ∞ ,概率值越接近 0 0 0
\qquad 線性函數 w T x + b \boldsymbol{w}^{T}\boldsymbol{x}+b w T x + b 的值越接近 − ∞ -\infty − ∞ ,概率值越接近 1 1 1
\qquad 顯然,對於新的輸入樣本 x ∗ \boldsymbol{x^{\ast}} x ∗ 而言,按照最大後驗概率準則:如果 p ( y = 1 ∣ x ∗ ) > p ( y = 0 ∣ x ∗ ) p(y=1|\boldsymbol{x^{\ast}})>p(y=0|\boldsymbol{x^{\ast}}) p ( y = 1 ∣ x ∗ ) > p ( y = 0 ∣ x ∗ ) ,則認爲 x ∗ \boldsymbol{x^{\ast}} x ∗ 屬於 R 1 R_{1} R 1 ;如果 p ( y = 1 ∣ x ∗ ) < p ( y = 0 ∣ x ∗ ) p(y=1|\boldsymbol{x^{\ast}})<p(y=0|\boldsymbol{x^{\ast}}) p ( y = 1 ∣ x ∗ ) < p ( y = 0 ∣ x ∗ ) ,則認爲 x ∗ \boldsymbol{x^{\ast}} x ∗ 屬於 R 2 R_{2} R 2 。
\qquad
3. 模型的參數估計
\qquad 假設訓練樣本爲{ ( x i , c i ) } i = 1 N \{ ( \boldsymbol{x}_{i},c_{i})
\} _{i=1}^N { ( x i , c i ) } i = 1 N ,其中 x i ∈ R n , c i ∈ { 0 , 1 } \boldsymbol{x}_{i}\in R^{n},c_{i}\in \{0,1\} x i ∈ R n , c i ∈ { 0 , 1 } ,採用最大似然估計來求模型的參數 ( w , b ) (\boldsymbol{w},b) ( w , b ) 。
1 ) \qquad1) 1 ) 由 y ( x ) = p ( c = 1 ∣ x ) y(\boldsymbol{x})=p(c=1|\boldsymbol{x}) y ( x ) = p ( c = 1 ∣ x ) ,訓練樣本集的似然函數可表示爲:
L ( w , b ) = ∏ i = 1 N y ( x i ) c i [ 1 − y ( x i ) ] 1 − c i \qquad\qquad L(\boldsymbol{w},b)=\displaystyle\prod_{i=1}^N y(\boldsymbol{x}_{i})^{c_{i}}\left[ 1-y(\boldsymbol{x}_{i})\right]
^{1-c_{i}} L ( w , b ) = i = 1 ∏ N y ( x i ) c i [ 1 − y ( x i ) ] 1 − c i
2 ) \qquad2) 2 ) 對數似然函數可表示爲:
ln L ( w , b ) = ∑ i = 1 N { c i ln [ y ( x i ) ] + ( 1 − c i ) ln [ 1 − y ( x i ) ] } = ∑ i = 1 N { c i ln y ( x i ) 1 − y ( x i ) + ln [ 1 − y ( x i ) ] } = ∑ i = 1 N { c i ( w T x i + b ) − ln [ 1 + e ( w T x i + b ) ] } \qquad\qquad\begin{aligned} \ln L(\boldsymbol{w},b)&=\displaystyle\sum_{i=1}^N \{ c_{i}\ln\left[
y\left( \boldsymbol{x}_{i}\right) \right] +(1-c_{i})\ln\left[ 1-y\left(
\boldsymbol{x}_{i}\right) \right] \}\\
&=\displaystyle\sum_{i=1}^N\left\{ c_{i}\ln\dfrac{
y\left( \boldsymbol{x}_{i}\right)}{1-y\left(
\boldsymbol{x}_{i}\right)} +\ln\left[ 1-y\left(
\boldsymbol{x}_{i}\right) \right] \right\} \\
&=\displaystyle\sum_{i=1}^N\left\{ c_{i}\left(\boldsymbol{w}^{T}\boldsymbol{x}_{i}+b\right)-\ln\left[ 1+e^{\left(\boldsymbol{w}^{T}\boldsymbol{x}_{i}+b\right)} \right] \right\} \\
\end{aligned} ln L ( w , b ) = i = 1 ∑ N { c i ln [ y ( x i ) ] + ( 1 − c i ) ln [ 1 − y ( x i ) ] } = i = 1 ∑ N { c i ln 1 − y ( x i ) y ( x i ) + ln [ 1 − y ( x i ) ] } = i = 1 ∑ N { c i ( w T x i + b ) − ln [ 1 + e ( w T x i + b ) ] }
3 ) \qquad3) 3 ) 若令 β = [ w T , b ] T , x ^ i = [ x i T , 1 ] T \boldsymbol{\beta}=[\boldsymbol{w}^T,b]^{T},\ \hat{\boldsymbol{x}}_{i}=[\boldsymbol{x}_{i}^T,1]^{T} β = [ w T , b ] T , x ^ i = [ x i T , 1 ] T ,那麼線性模型 w T x i + b = β T x ^ i \boldsymbol{w}^{T}\boldsymbol{x}_{i}+b=\boldsymbol{\beta}^{T}\hat{\boldsymbol{x}}_{i} w T x i + b = β T x ^ i ,從而有
ln L ( β ) = ∑ i = 1 N [ c i β T x ^ i − ln ( 1 + e β T x ^ i ) ] \qquad\qquad \ln L(\boldsymbol{\beta})=\displaystyle\sum_{i=1}^N \left[ c_{i}\boldsymbol{\beta}^{T}\hat{\boldsymbol{x}}_{i}-\ln\left( 1+e^{\boldsymbol{\beta}^{T}\hat{\boldsymbol{x}}_{i}} \right) \right] ln L ( β ) = i = 1 ∑ N [ c i β T x ^ i − ln ( 1 + e β T x ^ i ) ]
\qquad 通過最大似然估計,可以估計出Logistic Regression模型的參數 β = [ w T , b ] T \boldsymbol{\beta}=[\boldsymbol{w}^{T},b]^{T} β = [ w T , b ] T 。
4. 模型學習的最優化算法
\qquad 一般取“負的對數似然函數”作爲損失函數,即:l ( β ) = − ln L ( w , b ) = − ln L ( β ) l(\boldsymbol{\beta})=-\ln L(\boldsymbol{w},b)=-\ln L(\boldsymbol{\beta}) l ( β ) = − ln L ( w , b ) = − ln L ( β ) 。最大化似然函數,相當於最小化損失函數 l ( β ) l(\boldsymbol{\beta}) l ( β ) 。
l ( β ) = − ln L ( β ) = − ∑ i = 1 N [ c i β T x ^ i − ln ( 1 + e β T x ^ i ) ] \qquad\qquad l(\boldsymbol{\beta})=-\ln L(\boldsymbol{\beta})=-\displaystyle\sum_{i=1}^N \left[ c_{i}\boldsymbol{\beta}^{T}\hat{\boldsymbol{x}}_{i}-\ln\left( 1+e^{\boldsymbol{\beta}^{T}\hat{\boldsymbol{x}}_{i}} \right) \right] l ( β ) = − ln L ( β ) = − i = 1 ∑ N [ c i β T x ^ i − ln ( 1 + e β T x ^ i ) ]
\qquad 由於 l ( β ) l(\boldsymbol{\beta}) l ( β ) 是關於 β \boldsymbol{\beta} β 的高階可導連續凸函數,可採用數值優化方法對 β \boldsymbol{\beta} β 進行求解。
\qquad
4.1 梯度下降法
\qquad 採用梯度下降法求解時,需要求出“負梯度方向”作爲下降方向:
∂ l ( β ) ∂ β = − ∑ i = 1 N ( c i x ^ i − 1 1 + e β T x ^ i e β T x ^ i x ^ i ) = − ∑ i = 1 N ( c i − e β T x ^ i 1 + e β T x ^ i ) x ^ i = − ∑ i = 1 N [ c i − y ( x i ) ] x ^ i ( 1 ) \qquad\qquad \begin{aligned} \dfrac{\partial l(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}}&=-\displaystyle\sum_{i=1}^N \left( c_{i}\hat{\boldsymbol{x}}_{i}-\dfrac{1}{1+e^{\boldsymbol{\beta}^{T}\hat{\boldsymbol{x}}_{i}}}e^{\boldsymbol{\beta}^{T}\hat{\boldsymbol{x}}_{i}}\hat{\boldsymbol{x}}_{i} \right)\\
&=-\displaystyle\sum_{i=1}^N \left( c_{i}-\dfrac{e^{\boldsymbol{\beta}^{T}\hat{\boldsymbol{x}}_{i}}}{1+e^{\boldsymbol{\beta}^{T}\hat{\boldsymbol{x}}_{i}}} \right)\hat{\boldsymbol{x}}_{i} \\
&=-\displaystyle\sum_{i=1}^N [ c_{i}-y(\boldsymbol{x}_{i}) ]\hat{\boldsymbol{x}}_{i} \qquad\qquad\qquad\ (1)\\
\end{aligned} ∂ β ∂ l ( β ) = − i = 1 ∑ N ( c i x ^ i − 1 + e β T x ^ i 1 e β T x ^ i x ^ i ) = − i = 1 ∑ N ( c i − 1 + e β T x ^ i e β T x ^ i ) x ^ i = − i = 1 ∑ N [ c i − y ( x i ) ] x ^ i ( 1 )
\qquad 由於參數 β = [ w T , b ] T = [ w 1 , ⋯ , w n , b ] T \boldsymbol{\beta}=[\boldsymbol{w}^T,b]^{T}=[w_1,\cdots,w_n,b]^T β = [ w T , b ] T = [ w 1 , ⋯ , w n , b ] T 以及 x ^ i = [ x i T , 1 ] T = [ x i ( 1 ) , ⋯ , x i ( n ) , 1 ] T \hat\boldsymbol{x}_{i}=[\boldsymbol{x}_{i}^T,1]^T=[x_i^{(1)},\cdots,x_i^{(n)},1]^T x ^ i = [ x i T , 1 ] T = [ x i ( 1 ) , ⋯ , x i ( n ) , 1 ] T ,公式(1)實際上爲:
{ ∂ l ( β ) ∂ w = − ∑ i = 1 N [ c i − y ( x i ) ] x i ( 2 ) ∂ l ( β ) ∂ b = − ∑ i = 1 N [ c i − y ( x i ) ] ( 3 ) \qquad\qquad\begin{cases}\ \ \ \dfrac{\partial l(\boldsymbol{\beta})}{\partial \boldsymbol w}=-\displaystyle\sum_{i=1}^N [ c_{i}-y\left( \boldsymbol{x}_{i}\right) ]\boldsymbol{x}_{i}\qquad\qquad\ \ (2) \\
\\
\ \ \ \dfrac{\partial l(\boldsymbol{\beta})}{\partial b}=-\displaystyle\sum_{i=1}^N\left[ c_{i}-y\left( \boldsymbol{x}_{i}\right) \right] \qquad\qquad(3) \end{cases} ⎩ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎧ ∂ w ∂ l ( β ) = − i = 1 ∑ N [ c i − y ( x i ) ] x i ( 2 ) ∂ b ∂ l ( β ) = − i = 1 ∑ N [ c i − y ( x i ) ] ( 3 )
\qquad 若考慮 x i = [ x i ( 1 ) , ⋯ , x i ( n ) ] T \boldsymbol{x}_{i}=[x_i^{(1)},\cdots,x_i^{(n)}]^T x i = [ x i ( 1 ) , ⋯ , x i ( n ) ] T 的每一個分量 x i ( j ) x_{i}^{(j)} x i ( j ) ,公式(2)還可以表示爲:
∂ l ( β ) ∂ w j = − ∑ i = 1 N [ c i − y ( x i ) ] x i ( j ) \qquad\qquad \dfrac{\partial l(\boldsymbol{\beta})}{\partial w_{j}}=-\displaystyle\sum_{i=1}^N [ c_{i}-y\left( \boldsymbol{x}_{i}\right) ]x_{i}^{(j)} ∂ w j ∂ l ( β ) = − i = 1 ∑ N [ c i − y ( x i ) ] x i ( j )
\qquad 對數似然函數的梯度可以表示爲:
∂ l ( β ) ∂ β = [ ∂ l ( β ) ∂ w 1 , ⋯ , ∂ l ( β ) ∂ w n , ∂ l ( β ) ∂ b ] T \qquad\qquad \dfrac{\partial l(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}}=\left[\dfrac{\partial l(\boldsymbol{\beta})}{\partial w_1},\cdots,\dfrac{\partial l(\boldsymbol{\beta})}{\partial w_n},\dfrac{\partial l(\boldsymbol{\beta})}{\partial b} \right]^{T} ∂ β ∂ l ( β ) = [ ∂ w 1 ∂ l ( β ) , ⋯ , ∂ w n ∂ l ( β ) , ∂ b ∂ l ( β ) ] T
\qquad 因此,採用梯度下降法的權值更新公式爲(α \alpha α 爲步長):
β t + 1 = β t − α ∂ l ( β ) ∂ β \qquad\qquad \boldsymbol{\beta}^{t+1}=\boldsymbol{\beta}^{t}-\alpha\dfrac{\partial l(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}} β t + 1 = β t − α ∂ β ∂ l ( β )
\qquad
4.2 牛頓法
\qquad 採用牛頓法求解最優化問題時,是在搜索點取泰勒級數的二階近似的導數爲 0 0 0 。除了要求梯度之外,還需要求出 h e s s i a n hessian h e s s i a n 矩陣的逆。
\qquad 由於已經求出: ∂ l ( β ) ∂ β = − ∑ i = 1 N [ c i − y ( x i ) ] x ^ i \ \dfrac{\partial l(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}}=-\displaystyle\sum_{i=1}^N [ c_{i}-y(\boldsymbol{x}_{i}) ]\hat{\boldsymbol{x}}_{i} ∂ β ∂ l ( β ) = − i = 1 ∑ N [ c i − y ( x i ) ] x ^ i
\qquad 那麼,h e s s i a n hessian h e s s i a n 矩陣就爲:
∂ ∂ β T ( ∂ l ( β ) ∂ β ) = ∂ ∂ β T ( − ∑ i = 1 N [ c i − y ( x i ) ] x ^ i ) = ∂ ∂ β T ( ∑ i = 1 N y ( x i ) x ^ i ) = ∑ i = 1 N y ( x i ) [ 1 − y ( x i ) ] x ^ i ∂ ∂ β T ( β T x ^ i ) = ∑ i = 1 N y ( x i ) [ 1 − y ( x i ) ] x ^ i x ^ i T \qquad\qquad \begin{aligned} \dfrac{\partial}{\partial \boldsymbol\beta^T}\left(\dfrac{\partial l(\boldsymbol{\beta})}{\partial \boldsymbol\beta}\right)
&=\dfrac{\partial}{\partial \boldsymbol\beta^T}\left(-\displaystyle\sum_{i=1}^N [ c_{i}-y(\boldsymbol{x}_{i}) ]\hat{\boldsymbol{x}}_{i}\right)\\
&=\dfrac{\partial}{\partial \boldsymbol\beta^T}\left(\displaystyle\sum_{i=1}^Ny(\boldsymbol{x}_{i})\hat\boldsymbol{x}_i\right)\\
&=\displaystyle\sum_{i=1}^Ny(\boldsymbol{x}_{i})[1-y(\boldsymbol{x}_{i})]\hat\boldsymbol{x}_i\dfrac{\partial}{\partial \boldsymbol\beta^T}\left(\boldsymbol{\beta}^{T}\hat{\boldsymbol{x}}_{i}\right)\\
&=\displaystyle\sum_{i=1}^Ny(\boldsymbol{x}_{i})[1-y(\boldsymbol{x}_{i})]\hat\boldsymbol{x}_i\hat{\boldsymbol{x}}_{i}^T\\
\end{aligned} ∂ β T ∂ ( ∂ β ∂ l ( β ) ) = ∂ β T ∂ ( − i = 1 ∑ N [ c i − y ( x i ) ] x ^ i ) = ∂ β T ∂ ( i = 1 ∑ N y ( x i ) x ^ i ) = i = 1 ∑ N y ( x i ) [ 1 − y ( x i ) ] x ^ i ∂ β T ∂ ( β T x ^ i ) = i = 1 ∑ N y ( x i ) [ 1 − y ( x i ) ] x ^ i x ^ i T
\qquad 因此,採用牛頓法的權值更新公式爲:
β t + 1 = β t − ( ∂ 2 l ( β ) ∂ β ∂ β T ) − 1 ∂ l ( β ) ∂ β \qquad\qquad \boldsymbol{\beta}^{t+1}=\boldsymbol{\beta}^{t}-\left(\dfrac{\partial^2 l(\boldsymbol\beta)}{\partial \boldsymbol\beta\partial \beta^T}\right)^{-1}\dfrac{\partial l(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}} β t + 1 = β t − ( ∂ β ∂ β T ∂ 2 l ( β ) ) − 1 ∂ β ∂ l ( β )
\qquad
5. 模型訓練步驟
\qquad 若採用梯度下降法來求解模型的參數,則訓練步驟如下:
1 ) \qquad1) 1 ) 隨機選擇 β = [ w T , b ] T \boldsymbol{\beta}=[\boldsymbol{w}^T,b]^{T} β = [ w T , b ] T 的初始值 β 0 \boldsymbol{\beta}^{0} β 0
2 ) \qquad2) 2 ) 選擇步長 α \alpha α ,迭代計算下列公式,直到滿足終止條件
∂ l ( β ) ∂ β = − ∑ i = 1 N [ c i − y ( x i ) ] x ^ i = − ∑ i = 1 N [ c i − y ( x i ) ] [ x i 1 ] \qquad\qquad \begin{aligned} \dfrac{\partial l(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}}
&=-\displaystyle\sum_{i=1}^N [ c_{i}-y(\boldsymbol{x}_{i}) ]\hat{\boldsymbol{x}}_{i}\\
&=-\displaystyle\sum_{i=1}^N [ c_{i}-y(\boldsymbol{x}_{i}) ]\left[\begin{matrix}\boldsymbol{x}_{i}\\ 1\end{matrix}\right]\\
\end{aligned} ∂ β ∂ l ( β ) = − i = 1 ∑ N [ c i − y ( x i ) ] x ^ i = − i = 1 ∑ N [ c i − y ( x i ) ] [ x i 1 ]
β t + 1 = β t − α ∂ l ( β ) ∂ β \qquad\qquad \boldsymbol{\beta}^{t+1}=\boldsymbol{\beta}^{t}-\alpha\dfrac{\partial l(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}} β t + 1 = β t − α ∂ β ∂ l ( β )
\qquad
6. 實現代碼(二分類)
1) 定義Sigmoid函數
def sigmoid ( x) :
'''Sigmoid function
'''
return 1.0 / ( 1 + np. exp( - x) )
2) 讀取訓練/測試集數據函數
假設在二維平面R 2 R^{2} R 2 上生成的數據集的格式爲 ( x i , y i ) = ( x i ( 1 ) , x i ( 2 ) , y i ) , y i ∈ { 0 , 1 } (\boldsymbol{x}_{i},y_{i})=(x_{i}^{(1)},x_{i}^{(2)},y_{i}), \ y_{i}\in\{0,1\} ( x i , y i ) = ( x i ( 1 ) , x i ( 2 ) , y i ) , y i ∈ { 0 , 1 } 的形式:
3.562302 , 25.329208 , 1.000000 − 24.268267 , 1.272092 , 1.000000 25.405790 , 8.463017 , 1.000000 − 6.908775 , 23.298889 , 1.000000 40.621010 , − 25.134052 , 0.000000 − 9.305521 , 14.983097 , 1.000000 20.041330 , − 25.381725 , 0.000000 37.298540 , − 26.767307 , 0.000000 35.856177 , − 31.080316 , 0.000000 − 17.976889 , 4.244106 , 1.000000 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 3.562302,25.329208,1.000000\newline
-24.268267,1.272092,1.000000\newline
25.405790,8.463017,1.000000\newline
-6.908775,23.298889,1.000000\newline
40.621010,-25.134052,0.000000\newline
-9.305521,14.983097,1.000000\newline
20.041330,-25.381725,0.000000\newline
37.298540,-26.767307,0.000000\newline
35.856177,-31.080316,0.000000\newline
-17.976889,4.244106,1.000000\newline
\cdot\cdot\cdot\cdot\cdot\cdot 3 . 5 6 2 3 0 2 , 2 5 . 3 2 9 2 0 8 , 1 . 0 0 0 0 0 0 − 2 4 . 2 6 8 2 6 7 , 1 . 2 7 2 0 9 2 , 1 . 0 0 0 0 0 0 2 5 . 4 0 5 7 9 0 , 8 . 4 6 3 0 1 7 , 1 . 0 0 0 0 0 0 − 6 . 9 0 8 7 7 5 , 2 3 . 2 9 8 8 8 9 , 1 . 0 0 0 0 0 0 4 0 . 6 2 1 0 1 0 , − 2 5 . 1 3 4 0 5 2 , 0 . 0 0 0 0 0 0 − 9 . 3 0 5 5 2 1 , 1 4 . 9 8 3 0 9 7 , 1 . 0 0 0 0 0 0 2 0 . 0 4 1 3 3 0 , − 2 5 . 3 8 1 7 2 5 , 0 . 0 0 0 0 0 0 3 7 . 2 9 8 5 4 0 , − 2 6 . 7 6 7 3 0 7 , 0 . 0 0 0 0 0 0 3 5 . 8 5 6 1 7 7 , − 3 1 . 0 8 0 3 1 6 , 0 . 0 0 0 0 0 0 − 1 7 . 9 7 6 8 8 9 , 4 . 2 4 4 1 0 6 , 1 . 0 0 0 0 0 0 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
生成具有兩個中心的二維高斯分佈的數據集:
def gen_gausssian ( mean1, mean2, cov1, cov2, num) :
'''generate 2-d gaussian dataset with 2 clusters
'''
data1 = np. random. multivariate_normal( mean1, cov1, num)
label1 = np. ones( ( 1 , num) ) . T
data_pos = np. append( data1, label1, axis= 1 )
data2 = np. random. multivariate_normal( mean2, cov2, num)
label2 = np. zeros( ( 1 , num) ) . T
data_neg = np. append( data2, label2, axis= 1 )
data = np. append( data_pos, data_neg, axis= 0 )
shuffle_data = np. random. permutation( data)
x1, y1 = data1. T
x2, y2 = data2. T
plt. scatter( x1, y1, c= 'r' , s= 3 )
plt. plot( mean1[ 0 ] , mean1[ 1 ] , 'ko' )
plt. scatter( x2, y2, c= 'b' , s= 3 )
plt. plot( mean2[ 0 ] , mean2[ 1 ] , 'ko' )
plt. axis( )
plt. title( "2-d gaussian dataset with 2 clusters" )
plt. xlabel( "x" )
plt. ylabel( "y" )
plt. show( )
np. savetxt( 'gaussdata.txt' , shuffle_data, fmt= '%f' , delimiter= ',' )
return shuffle_data, data_pos, data_neg
用散點圖表示:
讀取以 ( x i , y i ) = ( x i ( 1 ) , x i ( 2 ) , y i ) , y i ∈ { 0 , 1 } (\boldsymbol{x}_{i},y_{i})=(x_{i}^{(1)},x_{i}^{(2)},y_{i}), \ y_{i}\in\{0,1\} ( x i , y i ) = ( x i ( 1 ) , x i ( 2 ) , y i ) , y i ∈ { 0 , 1 } 格式保存的訓練數據,返回numpy數組形式:
def load_data ( filename) :
'''Load data of training or testing set
'''
tdata = [ ]
with open ( filename) as f:
while True :
line = f. readline( )
if not line:
break
line = line. split( ',' )
tdata. append( [ float ( item) for item in line] )
f. close( )
return np. array( tdata)
3) 迭代計算公式 ( 2 ) (2) ( 2 ) ,並顯示每次迭代後的損失函數值
def lr_train ( xhat, c, alpha, num) :
beta = np. random. rand( 3 , 1 )
for i in range ( num) :
yx = sigmoid( np. dot( xhat, beta) )
beta = beta + alpha * np. dot( xhat. T, ( c - yx) )
print ( '#' + str ( i) + ',training loss:' + str ( train_loss( c, yx) ) )
return beta
由公式 − ln L ( w , b ) = − ∑ i = 1 N { c i ln [ y ( x i ) ] + ( 1 − c i ) ln [ 1 − y ( x i ) ] } \ -\ln L(\boldsymbol{w},b)=-\displaystyle\sum_{i=1}^N \{ c_{i}\ln\left[
y\left( \boldsymbol{x}_{i}\right) \right] +(1-c_{i})\ln\left[ 1-y\left(
\boldsymbol{x}_{i}\right) \right] \} − ln L ( w , b ) = − i = 1 ∑ N { c i ln [ y ( x i ) ] + ( 1 − c i ) ln [ 1 − y ( x i ) ] }
計算損失函數(誤差值) :
def train_loss ( c, yx) :
err = 0.0
for i in range ( len ( yx) ) :
if yx[ i, 0 ] > 0 and ( 1 - yx[ i, 0 ] ) > 0 :
err -= c[ i, 0 ] * np. log( yx[ i, 0 ] ) + ( 1 - c[ i, 0 ] ) * np. log( 1 - yx[ i, 0 ] )
return err
主程序:
mean1 = [ 3 , - 1 ]
cov1 = [ [ 5 , 0 ] , [ 0 , 10 ] ]
mean2 = [ - 5 , 7 ]
cov2 = [ [ 10 , 0 ] , [ 0 , 5 ] ]
data, data_pos, data_neg = gen_gausssian( mean1, mean2, cov1, cov2, 1100 )
training_data = data
tmp1 = training_data[ 0 : 2000 , 0 : 2 ]
tmp2 = np. ones( ( 2000 , 1 ) )
xhat = np. concatenate( ( tmp1, tmp2) , axis= 1 )
target = training_data[ 0 : 2000 , 2 : ]
beta = lr_train( xhat, target, 0.01 , 100 )
print ( 'beta:\n' , beta)
tmp1 = training_data[ 2000 : 2200 , 0 : 2 ]
tmp2 = np. ones( ( 200 , 1 ) )
testing_data = np. concatenate( ( tmp1, tmp2) , axis= 1 )
target = training_data[ 2000 : 2200 , 2 : ]
y1 = classification( testing_data, beta)
print ( np. abs ( y1- target) . T)
對一個大小爲2000的數據進行訓練,可得到輸出結果爲:
#0,training loss:2767.7605301149197
#1,training loss:28706.32704095256
#2,training loss:24304.21071966826
#3,training loss:20729.807928831706
#4,training loss:18031.980567667095
#5,training loss:15793.907613945637
#6,training loss:13876.408972848896
#7,training loss:12260.25776604957
#8,training loss:10857.914022333194
#9,training loss:9702.173760769088
#10,training loss:8739.995403737194
#11,training loss:7909.116254144592
#12,training loss:7237.015743718265
#13,training loss:6581.515845960798
#14,training loss:6155.195323818418
#15,training loss:5782.624246205244
#16,training loss:5451.120323877512
#17,training loss:5159.985309063984
#18,training loss:4921.653117909279
#19,training loss:4728.055485820308
#20,training loss:4546.101559000789
#21,training loss:4368.003415240011
#22,training loss:4196.188712568878
#23,training loss:4032.1962049440162
#24,training loss:3876.771728177838
#25,training loss:3694.5715060625985
#26,training loss:3554.8126869561006
#27,training loss:3418.8100373192524
#28,training loss:3321.3029188728215
#29,training loss:3189.8131265721095
#30,training loss:3060.4306284382133
#31,training loss:2932.529506577584
#32,training loss:2807.0843716420854
#33,training loss:2684.4698423911955
#34,training loss:2564.789169175422
#35,training loss:2447.909093126709
#36,training loss:2333.712055985516
#37,training loss:2222.305120198585
#38,training loss:2114.0629245811747
#39,training loss:2009.9696271327145
#40,training loss:1911.4641438101942
#41,training loss:1818.4131000336629
#42,training loss:1731.1576524394175
#43,training loss:1648.321160807572
#44,training loss:1568.548376402433
#45,training loss:1491.2975705058457
#46,training loss:1416.3652001741157
#47,training loss:1343.7359069149327
#48,training loss:1273.1915049002964
#49,training loss:1204.2529637870934
#50,training loss:1136.9266223350025
#51,training loss:1071.3457943359633
#52,training loss:1007.323134162851
#53,training loss:944.9219846916478
#54,training loss:885.1816608689702
#55,training loss:899.9599868299116
#56,training loss:845.0057775193546
#57,training loss:793.5441317445959
#58,training loss:745.7136807432933
#59,training loss:701.6553680843865
#60,training loss:696.1322808438866
#61,training loss:689.6549290071879
#62,training loss:648.194522863791
#63,training loss:608.4750616604265
#64,training loss:570.7085584125251
#65,training loss:535.7643996771293
#66,training loss:504.1081114497143
#67,training loss:475.508191984496
#68,training loss:450.66041076529723
#69,training loss:429.67911785665814
#70,training loss:411.2956082830516
#71,training loss:394.87591024838343
#72,training loss:380.24926997738123
#73,training loss:403.36316867372153
#74,training loss:391.38976162245194
#75,training loss:381.3801916299097
#76,training loss:372.99093649597427
#77,training loss:366.0959147066188
#78,training loss:360.56881602093216
#79,training loss:355.8694556375197
#80,training loss:351.9153236373713
#81,training loss:348.4531107835549
#82,training loss:345.4202325927137
#83,training loss:342.7408477041635
#84,training loss:340.38193613578653
#85,training loss:338.2928279212097
#86,training loss:336.440189142936
#87,training loss:334.7845603199353
#88,training loss:333.29353615870866
#89,training loss:331.9350276518051
#90,training loss:330.68392791191314
#91,training loss:329.51840299766616
#92,training loss:328.4199642611809
#93,training loss:327.3721997842567
#94,training loss:326.36511808367675
#95,training loss:325.3868163444934
#96,training loss:324.43080556455044
#97,training loss:323.49135041237633
#98,training loss:322.5630405032738
#99,training loss:321.64316225672036
beta:
[[ 5.96987205]
[-6.41668657]
[30.84845393]]
這裏的beta值就是用梯度下降法所求得β = [ w T , b ] T \boldsymbol{\beta}=[\boldsymbol{w}^{T},b]^{T} β = [ w T , b ] T 的結果。
def classification ( testing_data, beta) :
y = sigmoid( np. dot( testing_data, beta) )
for i in range ( len ( y) ) :
if y[ i, 0 ] < 0.5 :
y[ i, 0 ] = 0.0
else :
y[ i, 0 ] = 1.0
return y
判別結果:200個測試數據,有2個數據被錯誤分類
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0.]]