1 原問題
約定:Data=( x i , y i ) i = 1 N , x i ∈ R p , y i ∈ { 1 , − 1 } {(x_i,y_i)}_{i=1}^N,\quad x_i\in R^p,\ y_i\in \{1,-1\} ( x i , y i ) i = 1 N , x i ∈ R p , y i ∈ { 1 , − 1 } ,
則SVM變爲如下的最優化問題:
(1){ m i n 1 2 w T w s . t . y i ( w T x i + b ) ≥ 1 ⟺ 1 − y i ( w T x i + b ) ≤ 0 \left\{\begin{array}{l}min \frac{1}{2}w^T w \\
s.t. y_i(w^Tx_i+b)\geq 1\Longleftrightarrow 1-y_i(w^Tx_i+b)\leq 0
\end{array}\right. { m i n 2 1 w T w s . t . y i ( w T x i + b ) ≥ 1 ⟺ 1 − y i ( w T x i + b ) ≤ 0
令L ( w , b , α ) = 1 2 w T w + ∑ i = 1 N α i ( 1 − y i ( w T x i + b ) ) \mathcal{L}(w,b,\alpha)=\frac{1}{2}w^Tw+\sum\limits_{i=1}^N \alpha_i(1-y_i(w^Tx_i+b)) L ( w , b , α ) = 2 1 w T w + i = 1 ∑ N α i ( 1 − y i ( w T x i + b ) ) ,
這裏α i ≥ 0 , 1 − y i ( w T x i + b ) ≤ 0. \alpha_i\geq 0,1-y_i(w^Tx_i+b)\leq 0. α i ≥ 0 , 1 − y i ( w T x i + b ) ≤ 0 .
下面證明問題(1)等價於下面的問題( P ) (P) ( P ) .
( P ) { min w , b max α L ( w , b , α ) s . t . α i ≥ 0 (P)\left\{\begin{array}{l}\min\limits_{w,b}\max\limits_{\alpha}\mathcal{L}(w,b,\alpha)\\
s.t. \alpha_i\geq 0\end{array}\right. ( P ) { w , b min α max L ( w , b , α ) s . t . α i ≥ 0
證明:討論兩種情況:
如果1 − y i ( w T x i + b ) ≥ 0 1-y_i(w^Tx_i+b)\geq 0 1 − y i ( w T x i + b ) ≥ 0 ,又由於α i ≥ 0 \alpha_i\geq 0 α i ≥ 0 ,所以,
max α L ( w , b , α ) = 1 2 w T w + ∞ = ∞ \max\limits_{\alpha}\mathcal{L}(w,b,\alpha)=\frac{1}{2}w^Tw+\infty=\infty α max L ( w , b , α ) = 2 1 w T w + ∞ = ∞ .
如果1 − y i ( w T x i + b ) ≤ 0 1-y_i(w^Tx_i+b)\leq 0 1 − y i ( w T x i + b ) ≤ 0 ,則max α L ( w , b , α ) = 1 2 w T w + 0 = 1 2 w T w \max\limits_{\alpha}\mathcal{L}(w,b,\alpha)=\frac{1}{2}w^Tw+0=\frac{1}{2}w^Tw α max L ( w , b , α ) = 2 1 w T w + 0 = 2 1 w T w .
綜上,min w , b max α L ( w , b , α ) = min w , b { ∞ , 1 2 w T w } = 1 2 w T w . 證 畢 \min\limits_{w,b}\max\limits_{\alpha}\mathcal{L}(w,b,\alpha)=\min\limits_{w,b}\{\infty,\frac{1}{2}w^Tw\}=\frac{1}{2}w^Tw.證畢 w , b min α max L ( w , b , α ) = w , b min { ∞ , 2 1 w T w } = 2 1 w T w . 證 畢
2 對偶問題
弱對偶定理 :min w , b max α L ( w , b , α ) ≥ max λ min w , b L ( w , b , α ) \min\limits_{w,b}\max\limits_{\alpha}\mathcal{L}(w,b,\alpha)\geq
\max\limits_{\lambda}\min\limits_{w,b}\mathcal{L}(w,b,\alpha) w , b min α max L ( w , b , α ) ≥ λ max w , b min L ( w , b , α ) .
強對偶定理 :min w , b max α L ( w , b , α ) = max λ min w , b L ( w , b , α ) \min\limits_{w,b}\max\limits_{\alpha}\mathcal{L}(w,b,\alpha)=
\max\limits_{\lambda}\min\limits_{w,b}\mathcal{L}(w,b,\alpha) w , b min α max L ( w , b , α ) = λ max w , b min L ( w , b , α ) .
由於原問題( P ) (P) ( P ) 目標函數是二次函數(凸函數),約束條件是線性,所以對於問題( P ) (P) ( P ) ,強對偶定理成立。
即原問題( P ) (P) ( P ) 等價於下面的對偶問題(D):
(D){ max α min w , b L ( w , b , α ) s . t . α i ≥ 0 \left\{\begin{array}{l}\max\limits_{\alpha}\min\limits_{w,b}\mathcal{L}(w,b,\alpha)\\
s.t. \alpha_i\geq 0\end{array}\right. { α max w , b min L ( w , b , α ) s . t . α i ≥ 0
3對偶問題的求解
(D){ max α min w , b L ( w , b , α ) s . t . α i ≥ 0 \left\{\begin{array}{l}\max\limits_{\alpha}\min\limits_{w,b}\mathcal{L}(w,b,\alpha)\\
s.t. \alpha_i\geq 0\end{array}\right. { α max w , b min L ( w , b , α ) s . t . α i ≥ 0
先來求min w , b L ( w , b , α ) \min\limits_{w,b}\mathcal{L}(w,b,\alpha) w , b min L ( w , b , α ) ,這時需要L ( w , b , α ) \mathcal{L}(w,b,\alpha) L ( w , b , α ) 函數分別對w,b求導如下。
∂ L ∂ b = ∂ ∂ b ( − ∑ i = 1 N α i y i b ) = − ∑ i = 1 N α i y i = 0 \frac{\partial L}{\partial b}=\frac{\partial}{\partial b}(-\sum\limits_{i=1}^N\alpha_i y_i b)
=-\sum\limits_{i=1}^N\alpha_i y_i=0 ∂ b ∂ L = ∂ b ∂ ( − i = 1 ∑ N α i y i b ) = − i = 1 ∑ N α i y i = 0 ,即∑ i = 1 N α i y i = 0 \sum\limits_{i=1}^N\alpha_i y_i=0 i = 1 ∑ N α i y i = 0 .
將其帶入L ( w , b , α ) \mathcal{L}(w,b,\alpha) L ( w , b , α ) 得,
L ( w , b , α ) = 1 2 w T w + ∑ i = 1 N α i − ∑ i = 1 N α i y i ( w T x i + b ) = 1 2 w T w + ∑ i = 1 N α i − ∑ i = 1 N α i y i w T x i − ∑ i = 1 N α i y i b = 1 2 w T w + ∑ i = 1 N α i − ∑ i = 1 N α i y i w T x i . \begin{array}{ll}\mathcal{L}(w,b,\alpha)&=\frac{1}{2}w^Tw+\sum\limits_{i=1}^N\alpha_i-\sum\limits_{i=1}^N\alpha_i y_i(w^Tx_i+b)\\
&=\frac{1}{2}w^Tw+\sum\limits_{i=1}^N\alpha_i-\sum\limits_{i=1}^N
\alpha_i y_iw^Tx_i-\sum\limits_{i=1}^N\alpha_i y_ib\\
&=\frac{1}{2}w^Tw+\sum\limits_{i=1}^N\alpha_i-\sum\limits_{i=1}^N
\alpha_i y_iw^Tx_i\end{array}. L ( w , b , α ) = 2 1 w T w + i = 1 ∑ N α i − i = 1 ∑ N α i y i ( w T x i + b ) = 2 1 w T w + i = 1 ∑ N α i − i = 1 ∑ N α i y i w T x i − i = 1 ∑ N α i y i b = 2 1 w T w + i = 1 ∑ N α i − i = 1 ∑ N α i y i w T x i .
∂ L ∂ w = 1 2 ⋅ 2 ⋅ w − ∑ i = 1 N α i y i x i = 0 ⟹ w = ∑ i = 1 N α i y i x i \frac{\partial\mathcal{L}}{\partial w}=\frac{1}{2}\cdot 2\cdot w-
\sum\limits_{i=1}^N\alpha_iy_ix_i=0\Longrightarrow w=\sum\limits_{i=1}^N\alpha_iy_ix_i ∂ w ∂ L = 2 1 ⋅ 2 ⋅ w − i = 1 ∑ N α i y i x i = 0 ⟹ w = i = 1 ∑ N α i y i x i .
再將上式帶入L L L 中得,
min w , b L ( w , b , α ) = 1 2 ( ∑ i = 1 N α i y i x i ) T ( ∑ i = 1 N α i y i x i ) − ∑ i = 1 N α i y i ( ∑ j = 1 N α j y j x j ) T x i + ∑ i = 1 N α i = 1 2 ∑ i = 1 N α i y i x i T ∑ j = 1 N α j y j x j − ∑ i = 1 N α i y i ∑ j = 1 N α j y j x j T x i + ∑ i = 1 N α i = 1 2 ∑ i = 1 N ∑ j = 1 N α i α j y i y j x i T x j − ∑ i = 1 N ∑ j = 1 N α i α j y i y j x j T x i + ∑ i = 1 N α i = − 1 2 ∑ i = 1 N ∑ j = 1 N α i α j y i y j x i T x j + ∑ i = 1 N α i \begin{array}{ll}\min\limits_{w,b}\mathcal{L}(w,b,\alpha)&=\frac{1}{2}(\sum\limits_{i=1}^N\alpha_i y_ix_i)^T
(\sum\limits_{i=1}^N\alpha_i y_ix_i)-\sum\limits_{i=1}^N\alpha_i y_i
(\sum\limits_{j=1}^N\alpha_j y_jx_j)^T x_i+\sum\limits_{i=1}^N\alpha_i\\
&=\frac{1}{2}\sum\limits_{i=1}^N\alpha_i y_ix_i^T\sum\limits_{j=1}^N
\alpha_jy_jx_j-\sum\limits_{i=1}^N\alpha_i y_i\sum\limits_{j=1}^N
\alpha_jy_jx_j^Tx_i+\sum\limits_{i=1}^N\alpha_i \\
&=\frac{1}{2}\sum\limits_{i=1}^N\sum\limits_{j=1}^N\alpha_i \alpha_j
y_iy_jx_i^Tx_j-\sum\limits_{i=1}^N\sum\limits_{j=1}^N\alpha_i \alpha_j
y_iy_jx_j^Tx_i+\sum\limits_{i=1}^N\alpha_i \\
&=-\frac{1}{2}\sum\limits_{i=1}^N\sum\limits_{j=1}^N\alpha_i \alpha_j
y_iy_jx_i^Tx_j+\sum\limits_{i=1}^N\alpha_i \end{array} w , b min L ( w , b , α ) = 2 1 ( i = 1 ∑ N α i y i x i ) T ( i = 1 ∑ N α i y i x i ) − i = 1 ∑ N α i y i ( j = 1 ∑ N α j y j x j ) T x i + i = 1 ∑ N α i = 2 1 i = 1 ∑ N α i y i x i T j = 1 ∑ N α j y j x j − i = 1 ∑ N α i y i j = 1 ∑ N α j y j x j T x i + i = 1 ∑ N α i = 2 1 i = 1 ∑ N j = 1 ∑ N α i α j y i y j x i T x j − i = 1 ∑ N j = 1 ∑ N α i α j y i y j x j T x i + i = 1 ∑ N α i = − 2 1 i = 1 ∑ N j = 1 ∑ N α i α j y i y j x i T x j + i = 1 ∑ N α i ,
,上面用到了x i T ⋅ x j = x j T ⋅ x i x_i^T\cdot x_j=x_j^T\cdot x_i x i T ⋅ x j = x j T ⋅ x i .
綜上,問題(D)等價於下面的( D 1 ) (D_1) ( D 1 ) :
( D 1 ) { max α − 1 2 ∑ i = 1 N ∑ j = 1 N α i α j y i y j x i T x j + ∑ i = 1 N α i s . t . α i ≥ 0 ∑ i = 1 N α i y i = 0 (D_1)\left\{\begin{array}{l}
\max\limits_{\alpha}-\frac{1}{2}\sum\limits_{i=1}^N\sum\limits_{j=1}^N\alpha_i \alpha_j
y_iy_jx_i^Tx_j+\sum\limits_{i=1}^N\alpha_i \\
s.t.\ \ \alpha_i \geq 0\\
\qquad \sum\limits_{i=1}^N\alpha_i y_i=0\end{array}\right. ( D 1 ) ⎩ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎧ α max − 2 1 i = 1 ∑ N j = 1 ∑ N α i α j y i y j x i T x j + i = 1 ∑ N α i s . t . α i ≥ 0 i = 1 ∑ N α i y i = 0
事實上,在SVM中經常使用核函數來將輸入映射到高維空間,內積只是核函數的一種。我們把x i , x j x_i,x_j x i , x j 的內積x i T x j x_i^Tx_j x i T x j 改爲K ( x i , x j ) = ϕ ( x i ) T ϕ ( x j ) K(x_i,x_j)=\phi(x_i)^T\phi(x_j) K ( x i , x j ) = ϕ ( x i ) T ϕ ( x j ) ,則(D2)變爲:
( D 1 ′ ) { max α − 1 2 ∑ i = 1 N ∑ j = 1 N α i α j y i y j K ( x i , x j ) + ∑ i = 1 N α i s . t . α i ≥ 0 ∑ i = 1 N α i y i = 0 (D_1')\left\{\begin{array}{l}
\max\limits_{\alpha}-\frac{1}{2}\sum\limits_{i=1}^N\sum\limits_{j=1}^N\alpha_i \alpha_j
y_iy_jK(x_i,x_j)+\sum\limits_{i=1}^N\alpha_i \\
s.t.\ \ \alpha_i \geq 0\\
\qquad \sum\limits_{i=1}^N\alpha_i y_i=0\end{array}\right. ( D 1 ′ ) ⎩ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎧ α max − 2 1 i = 1 ∑ N j = 1 ∑ N α i α j y i y j K ( x i , x j ) + i = 1 ∑ N α i s . t . α i ≥ 0 i = 1 ∑ N α i y i = 0
此時,與上面類似,我們有w = ∑ i = 1 N α i y i ϕ ( x i ) . w=\sum\limits_{i=1}^N\alpha_iy_i\phi(x_i). w = i = 1 ∑ N α i y i ϕ ( x i ) .
4 KKT條件
考慮如下一般形式的約束優化問題:
{ min f ( x ) , s . t . g i ( x ) ≤ 0 , i = 1 , 2 , ⋯ , m , h i ( x ) = 0 , i = 1 , 2 , ⋯ , ℓ . \left\{\begin{array}{l}\min f(x),\\
s.t.\ g_i(x)\leq 0,i=1,2,\cdots,m,\\
\qquad h_i(x)=0,i=1,2,\cdots,\ell.\end{array}\right. ⎩ ⎨ ⎧ min f ( x ) , s . t . g i ( x ) ≤ 0 , i = 1 , 2 , ⋯ , m , h i ( x ) = 0 , i = 1 , 2 , ⋯ , ℓ .
假設x ∗ x^{*} x ∗ 是原問題的局部最優解,且x ∗ x^* x ∗ 處某個"適當的條件(constraint qualification, 也稱約束規範)" 成立,則存在α , μ \alpha, \mu α , μ 使得,
∇ f ( x ∗ ) + ∑ i = 1 m α i ∇ g i ( x ∗ ) − ∑ j = 1 ℓ μ j ∇ h j ( x ∗ ) = 0 α i ≥ 0 , i = 1 , 2 , ⋯ m , g i ( x ∗ ) ≤ 0 , i = 1 , 2 , ⋯ , m , h j ( x ∗ ) = 0 , i = 1 , 2 , ⋯ , ℓ , α i g i ( x ∗ ) = 0 , i = 1 , 2 , ⋯ , m . \begin{array}{r}
\nabla f(x^*)+\sum\limits_{i=1}^m\alpha_i \nabla g_i(x^*)-
\sum\limits_{j=1}^{\ell}\mu_j\nabla h_j(x^*)=0\\
\alpha_i \geq 0,i=1,2,\cdots m,\\
g_i(x^*)\leq 0,i=1,2,\cdots,m,\\
h_j(x^*)=0,i=1,2,\cdots,\ell,\\
\alpha_i g_i(x^*)=0,i=1,2,\cdots,m.\end{array} ∇ f ( x ∗ ) + i = 1 ∑ m α i ∇ g i ( x ∗ ) − j = 1 ∑ ℓ μ j ∇ h j ( x ∗ ) = 0 α i ≥ 0 , i = 1 , 2 , ⋯ m , g i ( x ∗ ) ≤ 0 , i = 1 , 2 , ⋯ , m , h j ( x ∗ ) = 0 , i = 1 , 2 , ⋯ , ℓ , α i g i ( x ∗ ) = 0 , i = 1 , 2 , ⋯ , m .
,以上5條稱爲Karush-Kuhn_Tucker(KKT)條件。
對於SVM的對偶問題:
(D){ max α min w , b L ( w , b , α ) s . t . α i ≥ 0 \left\{\begin{array}{l}\max\limits_{\alpha}\min\limits_{w,b}\mathcal{L}(w,b,\alpha)\\
s.t. \alpha_i \geq 0\end{array}\right. { α max w , b min L ( w , b , α ) s . t . α i ≥ 0
其中,min w , b L ( w , b , α ) \min\limits_{w,b}\mathcal{L}(w,b,\alpha) w , b min L ( w , b , α ) 對應的KKT條件爲:
{ ∂ L ∂ w = 0 , ∂ L ∂ b = 0 1 − y i ( w T x i + b ) ≤ 0 , α i ≥ 0 , α i ( 1 − y i ( w T x i + b ) = 0. \left\{\begin{array}{l}
\frac{\partial L}{\partial w}=0,\frac{\partial L}{\partial b}=0\\
1-y_i(w^Tx_i+b)\leq 0,\\
\alpha_i \geq 0,\\
\alpha_i (1-y_i(w^Tx_i+b)=0.\end{array}\right. ⎩ ⎪ ⎪ ⎨ ⎪ ⎪ ⎧ ∂ w ∂ L = 0 , ∂ b ∂ L = 0 1 − y i ( w T x i + b ) ≤ 0 , α i ≥ 0 , α i ( 1 − y i ( w T x i + b ) = 0 .
最後一條稱爲互補鬆弛條件(slackness complenmentary)。
注意: 對於1 − y i ( w T x i + b ) < 0 1-y_i(w^Tx_i+b)<0 1 − y i ( w T x i + b ) < 0 的點,要想互補鬆弛條件成立,α i = 0 \alpha_i =0 α i = 0 必須成立。即,對於非支撐向量,α \alpha α 的值都爲0. 也就是說,對於目標函數L L L 來說,真正起作用的只有支撐向量。
前面我們已經求出w ∗ = ∑ i = 1 N α i y i x i w^*=\sum\limits_{i=1}^N\alpha_iy_ix_i w ∗ = i = 1 ∑ N α i y i x i ,下面我們來求b。
由前面分析,存在支撐向量( x k , y k ) (x_k,y_k) ( x k , y k ) 使得1 − y k ( w T x k + b ) = 0 1-y_k(w^Tx_k+b)=0 1 − y k ( w T x k + b ) = 0 ,即y k ( w T x k + b ) = 1 ⇒ y k 2 ( w T x k + b ) = w T x k + b = y k y_k(w^Tx_k+b)=1\Rightarrow y_k^2(w^Tx_k+b)=w^Tx_k+b=y_k y k ( w T x k + b ) = 1 ⇒ y k 2 ( w T x k + b ) = w T x k + b = y k 。
從而,b ∗ = y k − w T x k = y k − ∑ i = 1 N α i y i x i T x k b^*=y_k-w^Tx_k=y_k-\sum\limits_{i=1}^N\alpha_iy_ix_i^Tx_k b ∗ = y k − w T x k = y k − i = 1 ∑ N α i y i x i T x k 。
注意: (1)α \alpha α 在非支撐向量上都爲0,所有的非零α \alpha α 只有在支撐向量纔會出現。
(2) 對於使用核函數的情況,w ∗ = ∑ i = 1 N α i y i ϕ ( x i ) w^*=\sum\limits_{i=1}^N\alpha_iy_i\phi(x_i) w ∗ = i = 1 ∑ N α i y i ϕ ( x i ) ,b ∗ = y k − ∑ i = 1 N α i y i K ( x i , x k ) b^*=y_k-\sum\limits_{i=1}^N\alpha_iy_iK(x_i,x_k) b ∗ = y k − i = 1 ∑ N α i y i K ( x i , x k ) .
上面我們已經求出w ∗ , b ∗ w^*,b^* w ∗ , b ∗ 的值,但是這兩個式子都有未知量α \alpha α , 所以還需要首先求出α \alpha α 的值。這就要用到SMO算法。
5 軟間隔SVM
經過加入鬆弛變量 ξ i \xi_i ξ i 後,模型修改爲:
( 2 ) { m i n 1 2 w T w + C ∑ i = 1 N ξ i s . t . y i ( w T x i + b ) ≥ 1 − ξ i ξ i ≥ 0 (2)\left\{\begin{array}{l}min \frac{1}{2}w^T w+C\sum\limits_{i=1}^N\xi_i \\
s.t. \ \ y_i(w^Tx_i+b)\geq 1-\xi_i\\
\qquad \xi_i\geq 0
\end{array}\right. ( 2 ) ⎩ ⎪ ⎪ ⎨ ⎪ ⎪ ⎧ m i n 2 1 w T w + C i = 1 ∑ N ξ i s . t . y i ( w T x i + b ) ≥ 1 − ξ i ξ i ≥ 0
將其轉化爲:
( P 2 ) { min w , b , ξ max α , μ L ( w , b , α , μ ) = 1 2 w T w + C ∑ i = 1 N ξ i − ∑ i = 1 N α i ( y i ( w T x i + b ) − 1 + ξ i ) − ∑ i = 1 N μ i ξ i s . t . ∀ i , α i ≥ 0 ∀ i , μ i ≥ 0 (P_2)\left\{\begin{array}{l}\min\limits_{w,b,\xi}\max\limits_{\alpha,\mu}\mathcal{L}(w,b,\alpha,\mu)
=\frac{1}{2}w^Tw+C\sum\limits_{i=1}^N\xi_i-\sum\limits_{i=1}^N\alpha_i
(y_i(w^Tx_i+b)-1+\xi_i)-\sum\limits_{i=1}^N\mu_i\xi_i\\
s.t.\ \ \forall i, \alpha_i\geq 0\\
\qquad \forall i, \mu_i\geq 0
\end{array}\right. ( P 2 ) ⎩ ⎪ ⎪ ⎨ ⎪ ⎪ ⎧ w , b , ξ min α , μ max L ( w , b , α , μ ) = 2 1 w T w + C i = 1 ∑ N ξ i − i = 1 ∑ N α i ( y i ( w T x i + b ) − 1 + ξ i ) − i = 1 ∑ N μ i ξ i s . t . ∀ i , α i ≥ 0 ∀ i , μ i ≥ 0
對應的對偶問題爲:
( D 2 ) { max α , μ min w , b , ξ L ( w , b , α , μ ) = 1 2 w T w + C ∑ i = 1 N ξ i − ∑ i = 1 N α i ( y i ( w T x i + b ) − 1 + ξ i ) − ∑ i = 1 N μ i ξ i s . t . ∀ i , α i ≥ 0 ∀ i , μ i ≥ 0 (D_2)\left\{\begin{array}{l}\max\limits_{\alpha,\mu}\min\limits_{w,b,\xi}\mathcal{L}(w,b,\alpha,\mu)
=\frac{1}{2}w^Tw+C\sum\limits_{i=1}^N\xi_i-\sum\limits_{i=1}^N\alpha_i
(y_i(w^Tx_i+b)-1+\xi_i)-\sum\limits_{i=1}^N\mu_i\xi_i\\
s.t.\ \ \forall i, \alpha_i\geq 0\\
\qquad \forall i, \mu_i\geq 0
\end{array}\right. ( D 2 ) ⎩ ⎪ ⎪ ⎨ ⎪ ⎪ ⎧ α , μ max w , b , ξ min L ( w , b , α , μ ) = 2 1 w T w + C i = 1 ∑ N ξ i − i = 1 ∑ N α i ( y i ( w T x i + b ) − 1 + ξ i ) − i = 1 ∑ N μ i ξ i s . t . ∀ i , α i ≥ 0 ∀ i , μ i ≥ 0
其中,min w , b , ξ L ( w , b , α , ξ ) \min\limits_{w,b,\xi}\mathcal{L}(w,b,\alpha,\xi) w , b , ξ min L ( w , b , α , ξ ) 對應的KKT條件爲:
{ ∂ L ∂ w = 0 , ∂ L ∂ b = 0 , ∂ L ∂ ξ = 0 y i ( w T x i + b ) − 1 + ξ i ≥ 0 ξ i ≥ 0 α i ≥ 0 , μ i ≥ 0 α i ( y i ( w T x i + b ) − 1 + ξ i ) = 0 μ i ξ i = 0 \left\{\begin{array}{l}
\frac{\partial L}{\partial w}=0,\frac{\partial L}{\partial b}=0,
\frac{\partial L}{\partial \xi}=0\\
y_i(w^Tx_i+b)-1+\xi_i\geq 0\\
\xi_i\geq 0\\
\alpha_i\geq 0,\mu_i\geq 0\\
\alpha_i(y_i(w^Tx_i+b)-1+\xi_i)=0\\
\mu_i\xi_i=0\end{array}\right. ⎩ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎧ ∂ w ∂ L = 0 , ∂ b ∂ L = 0 , ∂ ξ ∂ L = 0 y i ( w T x i + b ) − 1 + ξ i ≥ 0 ξ i ≥ 0 α i ≥ 0 , μ i ≥ 0 α i ( y i ( w T x i + b ) − 1 + ξ i ) = 0 μ i ξ i = 0
其中,第一個條件可化爲:
w = ∑ i = 1 N α i y i x i , 0 = ∑ i = 1 N α i y i , α i = C − μ i w=\sum\limits_{i=1}^N\alpha_iy_ix_i, 0=\sum\limits_{i=1}^N\alpha_iy_i,
\ \alpha_i=C-\mu_i w = i = 1 ∑ N α i y i x i , 0 = i = 1 ∑ N α i y i , α i = C − μ i .
將這些等式帶入,可以得到下面的等價問題:
( D 2 ′ ) { max α − 1 2 ∑ i = 1 N ∑ j = 1 N α i α j y i y j x i T x j + ∑ i = 1 N α i s . t . 0 ≤ α i ≤ C ∑ i = 1 N α i y i = 0 (D_2')\left\{\begin{array}{l}
\max\limits_{\alpha}-\frac{1}{2}\sum\limits_{i=1}^N\sum\limits_{j=1}^N\alpha_i\alpha_j
y_iy_jx_i^Tx_j+\sum\limits_{i=1}^N\alpha_i\\
s.t.\ \ 0\leq \alpha_i\leq C\\
\qquad \sum\limits_{i=1}^N\alpha_iy_i=0\end{array}\right. ( D 2 ′ ) ⎩ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎧ α max − 2 1 i = 1 ∑ N j = 1 ∑ N α i α j y i y j x i T x j + i = 1 ∑ N α i s . t . 0 ≤ α i ≤ C i = 1 ∑ N α i y i = 0
令f ( x ) = w T x + b f(x)=w^Tx+b f ( x ) = w T x + b ,從得到的結果逆向推理,可以得到下面的式子:
α i = 0 → μ i = C → ξ i = 0 → y i f ( x i ) ≥ 1 0 < α i < C → μ i ≠ 0 → ξ i = 0 → α i ( y i f ( x i ) − 1 ) = 0 → y i f ( x i ) = 1 α i = c → μ i = 0 → ξ i ≥ 0 → α i ( y i f ( x i ) − 1 + ξ i ) = 0 → y i f ( x i ) ≤ 1. \begin{array}{l}
\alpha_i=0\rightarrow \mu_i=C\rightarrow \xi_i=0\rightarrow y_if(x_i)\geq 1\\
0<\alpha_i<C\rightarrow \mu_i\neq 0\rightarrow \xi_i=0\rightarrow
\alpha_i(y_if(x_i)-1)=0\rightarrow y_if(x_i)=1\\
\alpha_i=c\rightarrow \mu_i=0\rightarrow \xi_i\geq 0\rightarrow
\alpha_i(y_if(x_i)-1+\xi_i)=0\rightarrow y_if(x_i)\leq 1.\end{array} α i = 0 → μ i = C → ξ i = 0 → y i f ( x i ) ≥ 1 0 < α i < C → μ i = 0 → ξ i = 0 → α i ( y i f ( x i ) − 1 ) = 0 → y i f ( x i ) = 1 α i = c → μ i = 0 → ξ i ≥ 0 → α i ( y i f ( x i ) − 1 + ξ i ) = 0 → y i f ( x i ) ≤ 1 .
6 SMO算法
現在我們的問題就是,如何快速的求解下面這個優化問題。
( D 3 ) { min a 1 2 ∑ i = 1 N ∑ i = 1 N y i y j x i T x j α i α j − ∑ i = 1 N α i s . t . 0 ≤ α i ≤ C ∑ i = 1 N y i α i = 0 (D_3)\left\{\begin{array}{l}
\min\limits_{a}\frac{1}{2}\sum\limits_{i=1}^N\sum\limits_{i=1}^N
y_iy_jx_i^Tx_j\alpha_i\alpha_j-\sum\limits_{i=1}^N\alpha_i\\
s.t. \ \ 0\leq \alpha_i\leq C\\
\qquad \sum\limits_{i=1}^Ny_i\alpha_i=0\end{array}\right. ( D 3 ) ⎩ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎧ a min 2 1 i = 1 ∑ N i = 1 ∑ N y i y j x i T x j α i α j − i = 1 ∑ N α i s . t . 0 ≤ α i ≤ C i = 1 ∑ N y i α i = 0
寫成核函數形式爲:
( D 3 ′ ) { min a 1 2 ∑ i = 1 N ∑ i = 1 N y i y j K ( x i , x j ) α i α j − ∑ i = 1 N α i s . t . 0 ≤ α i ≤ C ∑ i = 1 N y i α i = 0 (D_3')\left\{\begin{array}{l}
\min\limits_{a}\frac{1}{2}\sum\limits_{i=1}^N\sum\limits_{i=1}^N
y_iy_jK(x_i,x_j)\alpha_i\alpha_j-\sum\limits_{i=1}^N\alpha_i\\
s.t. \ \ 0\leq \alpha_i\leq C\\
\qquad \sum\limits_{i=1}^Ny_i\alpha_i=0\end{array}\right. ( D 3 ′ ) ⎩ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎧ a min 2 1 i = 1 ∑ N i = 1 ∑ N y i y j K ( x i , x j ) α i α j − i = 1 ∑ N α i s . t . 0 ≤ α i ≤ C i = 1 ∑ N y i α i = 0
解決帶不等式限制的凸優化問題,採取的一般都是內點法。但是內點法的代價太大,需要存儲一個N 2 N^2 N 2 的矩陣,在內存有限的條件下不可行,且每次求全局導數花費時間很多,此外還牽涉到數值問題。而SMO是解決二次優化問題的神器。他每次選擇兩個拉格朗日乘子α i , α j \alpha_i,\alpha_j α i , α j 來求條件最小值,然後更新α i , α j \alpha_i,\alpha_j α i , α j 。由於在其他拉格朗日乘子固定的情況下,α i , α j \alpha_i,\alpha_j α i , α j 有如下關係:
∑ i = 1 N α i y i = 0 → α i y i + α j y j = − ∑ k ≠ i , j N α k y k . \sum\limits_{i=1}^N\alpha_iy_i=0\rightarrow \alpha_iy_i+\alpha_jy_j=-\sum\limits_{k\neq i,j}^N\alpha_ky_k. i = 1 ∑ N α i y i = 0 → α i y i + α j y j = − k = i , j ∑ N α k y k .
這樣α i \alpha_i α i 就可以通過α j \alpha_j α j 表示出來,此時優化問題可以轉變爲一個變量的二次優化問題,這個問題的計算量非常少。所以SMO包括兩個過程,一個過程選擇兩個拉格朗日乘子,這是一個外部循環,一個過程來求解這兩個變量的二次優化問題,這個是循環內過程。我們先來解決兩個變量的lagrange multipliers問題,然後再去解決乘子的選擇問題。
6.1 α i , α j \alpha_i,\alpha_j α i , α j 的計算
由SVM優化目標函數的約束條件∑ i = 1 N a i y i = 0 \sum\limits_{i=1}^Na_iy_i=0 i = 1 ∑ N a i y i = 0 ,可以得到:
a 1 y 1 + a 2 y 2 = − ∑ j = 3 N a j y j ≜ ζ → a 1 = ( ζ − a 2 y 2 ) y 1 a_1y_1+a_2y_2=-\sum\limits_{j=3}^Na_jy_j\triangleq \zeta\rightarrow a_1=(\zeta-a_2y_2)y_1 a 1 y 1 + a 2 y 2 = − j = 3 ∑ N a j y j ≜ ζ → a 1 = ( ζ − a 2 y 2 ) y 1 .
爲了計算兩個lagrange Multiplier的優化問題,SMO首先計算這兩個乘子的取值範圍,然後再這個範圍限制下解決二次優化問題。爲了書寫方便,現在用1,2來代替i,j。這兩個變量之間的關係爲a 1 y 1 + a 2 y 2 = ζ a_1y_1+a_2y_2=\zeta a 1 y 1 + a 2 y 2 = ζ ,下面這幅圖生動形象的解釋了這兩個變量的關係。
現在我們來討論在進行雙變量二次優化時α 2 \alpha_2 α 2 的取值範圍。如果y 1 ≠ y 2 y_1\neq y_2 y 1 = y 2 ,則α 2 \alpha_2 α 2 的下界L和上界H可以表示爲:
L = max ( 0 , α 2 − α 1 ) , H = min ( C , C + α 2 − α 1 ) . ( 13 − Platt論文中編號 ) L=\max(0,\alpha_2-\alpha_1),\quad H=\min(C,C+\alpha_2-\alpha_1).\qquad(13-\text{Platt論文中編號}) L = max ( 0 , α 2 − α 1 ) , H = min ( C , C + α 2 − α 1 ) . ( 1 3 − Platt 論文中編號 )
反之,如果y1=y2,則α 2 \alpha_2 α 2 的下界L和上界H可以表示爲:
L = max ( 0 , α 2 + α 1 − C ) , H = min ( C , α 2 + α 1 ) . ( 14 − Platt論文中編號 ) L=\max(0,\alpha_2+\alpha_1-C),\quad H=\min(C,\alpha_2+\alpha_1).\qquad(14-\text{Platt論文中編號}) L = max ( 0 , α 2 + α 1 − C ) , H = min ( C , α 2 + α 1 ) . ( 1 4 − Platt 論文中編號 )
對於上面的問題( D 3 ′ ) (D_3') ( D 3 ′ ) :
( D 3 ′ ) { min a 1 2 ∑ i = 1 N ∑ i = 1 N y i y j K ( x i , x j ) α i α j − ∑ i = 1 N α i s . t . 0 ≤ α i ≤ C ∑ i = 1 N y i α i = 0 (D_3')\left\{\begin{array}{l}
\min\limits_{a}\frac{1}{2}\sum\limits_{i=1}^N\sum\limits_{i=1}^N
y_iy_jK(x_i,x_j)\alpha_i\alpha_j-\sum\limits_{i=1}^N\alpha_i\\
s.t. \ \ 0\leq \alpha_i\leq C\\
\qquad \sum\limits_{i=1}^Ny_i\alpha_i=0\end{array}\right. ( D 3 ′ ) ⎩ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎧ a min 2 1 i = 1 ∑ N i = 1 ∑ N y i y j K ( x i , x j ) α i α j − i = 1 ∑ N α i s . t . 0 ≤ α i ≤ C i = 1 ∑ N y i α i = 0
首先,將原優化目標函數展開爲與α 1 \alpha_1 α 1 和α 2 \alpha_2 α 2 有關的部分和無關的部分:
min Φ ( α 1 , α 2 ) = 1 2 K 11 α 1 2 + 1 2 K 22 α 2 2 + y 1 y 2 K 12 α 1 α 2 − ( α 1 + α 2 ) + a 1 y 1 ∑ i = 3 N α i y i K 1 i + α 2 y 2 ∑ j = 3 N α j y j K 2 j + c ) = 1 2 K 11 α 1 2 + 1 2 K 22 α 2 2 + y 1 y 2 K 12 α 1 α 2 − ( α 1 + α 2 ) + α 1 y 1 v 1 + α 2 y 2 v 2 + c ) ( 1 ) \begin{array}{rl}\min\Phi(\alpha_1,\alpha_2)&=\frac{1}{2}K_{11}\alpha_1^2+\frac{1}{2}K_{22}\alpha_2^2+y_1y_2K_{12}\alpha_1\alpha_2-(\alpha_1+\alpha_2)+a_1y_1\sum\limits_{i=3}^N\alpha_iy_iK_{1i}+\alpha_2y_2\sum\limits_{j=3}^N\alpha_jy_jK_{2j}+c)\\
&=\frac{1}{2}K_{11}\alpha_1^2+\frac{1}{2}K_{22}\alpha_2^2+y_1y_2K_{12}\alpha_1\alpha_2-(\alpha_1+\alpha_2)+\alpha_1y_1v_1+\alpha_2y_2v_2+c)\qquad (1)\end{array} min Φ ( α 1 , α 2 ) = 2 1 K 1 1 α 1 2 + 2 1 K 2 2 α 2 2 + y 1 y 2 K 1 2 α 1 α 2 − ( α 1 + α 2 ) + a 1 y 1 i = 3 ∑ N α i y i K 1 i + α 2 y 2 j = 3 ∑ N α j y j K 2 j + c ) = 2 1 K 1 1 α 1 2 + 2 1 K 2 2 α 2 2 + y 1 y 2 K 1 2 α 1 α 2 − ( α 1 + α 2 ) + α 1 y 1 v 1 + α 2 y 2 v 2 + c ) ( 1 ) ,
其中v i = ∑ j = 3 N α j y j K i j ( i = 1 , 2 ) v_i=\sum\limits_{j=3}^N\alpha_jy_jK_{ij}\ (i=1,2) v i = j = 3 ∑ N α j y j K i j ( i = 1 , 2 ) ,c是與α 1 \alpha_1 α 1 和α 2 \alpha_2 α 2 無關的部分,在本次優化中當做常數項處理。
我們將優化目標中所有的α 1 \alpha_1 α 1 都替換爲用α 2 \alpha_2 α 2 表示的形式,得到如下式子:
min Φ ( α 2 ) = 1 2 K 11 ( ζ − α 2 y 2 ) 2 + 1 2 K 22 α 2 2 + y 2 K 12 α 2 ( ζ − α 2 y 2 ) − y 1 ( ζ − α 2 y 2 ) − α 2 + ( ζ − α 2 y 2 ) ∑ i = 3 N α i y i K 1 i + α 2 y 2 ∑ j = 3 N α j y j K 2 j + c = 1 2 K 11 ( ζ − α 2 y 2 ) 2 + 1 2 K 22 α 2 2 + y 2 K 12 α 2 ( ζ − α 2 y 2 ) − y 1 ( ζ − α 2 y 2 ) − α 2 + ( ζ − α 2 y 2 ) v 1 + α 2 y 2 v 2 + c . ( 2 ) \begin{array}{rl}\min\Phi(\alpha_2)=&\frac{1}{2}K_{11}(\zeta-\alpha_2y_2)^2+\frac{1}{2}K_{22}\alpha_2^2+y_2K_{12}\alpha_2(\zeta-\alpha_2y_2)-\\
&y_1(\zeta-\alpha_2y_2)-\alpha_2+(\zeta-\alpha_2y_2)\sum\limits_{i=3}^N\alpha_iy_iK_{1i}+\alpha_2y_2\sum\limits_{j=3}^N\alpha_jy_jK_{2j}+c\\
=&\frac{1}{2}K_{11}(\zeta-\alpha_2y_2)^2+\frac{1}{2}K_{22}\alpha_2^2+y_2K_{12}\alpha_2(\zeta-\alpha_2y_2)-\\
&y_1(\zeta-\alpha_2y_2)-\alpha_2+(\zeta-\alpha_2y_2)v_1+\alpha_2y_2v_2+c.\qquad (2)
\end{array} min Φ ( α 2 ) = = 2 1 K 1 1 ( ζ − α 2 y 2 ) 2 + 2 1 K 2 2 α 2 2 + y 2 K 1 2 α 2 ( ζ − α 2 y 2 ) − y 1 ( ζ − α 2 y 2 ) − α 2 + ( ζ − α 2 y 2 ) i = 3 ∑ N α i y i K 1 i + α 2 y 2 j = 3 ∑ N α j y j K 2 j + c 2 1 K 1 1 ( ζ − α 2 y 2 ) 2 + 2 1 K 2 2 α 2 2 + y 2 K 1 2 α 2 ( ζ − α 2 y 2 ) − y 1 ( ζ − α 2 y 2 ) − α 2 + ( ζ − α 2 y 2 ) v 1 + α 2 y 2 v 2 + c . ( 2 ) .
此時,優化目標中僅含有α 2 \alpha_2 α 2 一個待優化變量了,現在將待優化函數對α 2 \alpha_2 α 2 求偏導得到如下結果:
∂ Φ ( α 2 ) ∂ α 2 = ( K 11 + K 22 − 2 K 12 ) a l p h a 2 − K 11 ζ y 2 + K 12 ζ y 2 + y 1 y 2 − 1 − y 2 ∑ i = 3 N a i y i K 1 i + y 2 ∑ j ≠ 1 , 2 N α j y j K 2 j = ( K 11 + K 22 − 2 K 12 ) α 2 − K 11 ζ y 2 + K 12 ζ y 2 + y 1 y 2 − 1 − y 2 v 1 + y 2 v 2 ( 3 ) \begin{array}{rl}\frac{\partial \Phi(\alpha_2)}{\partial \alpha_2}=&(K_{11}+K_{22}-2K_{12})alpha_2-K_{11}\zeta y_2+K_{12}\zeta y_2+y_1y_2-1-y_2\sum\limits_{i=3}^Na_iy_iK_{1i}+y_2\sum\limits_{j\neq 1,2}^N\alpha_jy_jK_{2j}\\
=&(K_{11}+K_{22}-2K_{12})\alpha_2-K_{11}\zeta y_2+K_{12}\zeta y_2+y_1y_2-1-y_2v_1+y_2v_2\qquad (3)\end{array} ∂ α 2 ∂ Φ ( α 2 ) = = ( K 1 1 + K 2 2 − 2 K 1 2 ) a l p h a 2 − K 1 1 ζ y 2 + K 1 2 ζ y 2 + y 1 y 2 − 1 − y 2 i = 3 ∑ N a i y i K 1 i + y 2 j = 1 , 2 ∑ N α j y j K 2 j ( K 1 1 + K 2 2 − 2 K 1 2 ) α 2 − K 1 1 ζ y 2 + K 1 2 ζ y 2 + y 1 y 2 − 1 − y 2 v 1 + y 2 v 2 ( 3 ) .
前面我們已經定義SVM超平面的模型爲f ( x ) = w T x + b f(x)=w^Tx+b f ( x ) = w T x + b ,將已推導出w的表達式帶入得
f ( x ) = ∑ i = 1 N α i y i K ( x i , x ) + b ; f ( x i ) f(x)=\sum\limits_{i=1}^N\alpha_iy_iK(x_i,x)+b;f(x_i) f ( x ) = i = 1 ∑ N α i y i K ( x i , x ) + b ; f ( x i ) 表示樣本x i x_i x i 的預測值,y i y_i y i 表示樣本x i x_i x i 的真實值,定義E i E_i E i 表示預測值與真實值之差爲E i = f ( x i ) − y i E_i=f(x_i)-y_i E i = f ( x i ) − y i ,從而,
v i = ∑ j = 3 N α i y i K i j , i = 1 , 2 v_i=\sum\limits_{j=3}^N\alpha_iy_iK_{ij},\ i=1,2 v i = j = 3 ∑ N α i y i K i j , i = 1 , 2 ,因此,
v 1 = f ( x 1 ) − ∑ j = 1 2 y j α j K 1 j − b = f ( x 1 ) − y 1 α 1 K 11 − y 2 α 2 K 12 − b ( 4 ) v 2 = f ( x 2 ) − ∑ j = 1 2 y j α j K 2 j − b = f ( x 2 ) − y 1 α 1 K 21 − y 2 α 2 K 22 − b ( 5 ) \begin{array}{rl}
v_1&=f(x_1)-\sum\limits_{j=1}^2y_j\alpha_jK_{1j}-b=f(x_1)-y_1\alpha_1K_{11}-y_2\alpha_2K_{12}-b\qquad (4)\\
v_2&=f(x_2)-\sum\limits_{j=1}^2y_j\alpha_jK_{2j}-b=f(x_2)-y_1\alpha_1K_{21}-y_2\alpha_2K_{22}-b\qquad (5)\end{array} v 1 v 2 = f ( x 1 ) − j = 1 ∑ 2 y j α j K 1 j − b = f ( x 1 ) − y 1 α 1 K 1 1 − y 2 α 2 K 1 2 − b ( 4 ) = f ( x 2 ) − j = 1 ∑ 2 y j α j K 2 j − b = f ( x 2 ) − y 1 α 1 K 2 1 − y 2 α 2 K 2 2 − b ( 5 ) .
優化前的解記爲α 1 o l d , α 2 o l d \alpha_1^{old},\alpha_2^{old} α 1 o l d , α 2 o l d ,更新後的解記爲α 1 n e w , α 2 n e w \alpha_1^{new},\alpha_2^{new} α 1 n e w , α 2 n e w ,則(1),(4),(5)可得,
{ a 1 o l d y 1 + a 2 o l d y 2 = ζ ∑ i = 3 N α i y i K 1 i = f ( x 1 ) − y 1 α 1 o l d K 11 − y 2 α 2 o l d K 12 + b ∑ j = 3 N α j y j K 2 j = f ( x 2 ) − y 1 α 1 o l d K 21 − y 2 α 2 o l d K 22 + b \left\{\begin{array}{l}
a_1^{old}y_1+a_2^{old}y_2=\zeta\\
\sum\limits_{i=3}^N\alpha_iy_iK_{1i}=f(x_1)-y_1\alpha_1^{old}K_{11}-y_2\alpha_2^{old}K_{12}+b\\
\sum\limits_{j=3}^N\alpha_jy_jK_{2j}=f(x_2)-y_1\alpha_1^{old}K_{21}-y_2\alpha_2^{old}K_{22}+b\end{array}\right. ⎩ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎧ a 1 o l d y 1 + a 2 o l d y 2 = ζ i = 3 ∑ N α i y i K 1 i = f ( x 1 ) − y 1 α 1 o l d K 1 1 − y 2 α 2 o l d K 1 2 + b j = 3 ∑ N α j y j K 2 j = f ( x 2 ) − y 1 α 1 o l d K 2 1 − y 2 α 2 o l d K 2 2 + b
將以上三個條件帶入偏導式子中,得到如下結果:
∂ Φ ( α 2 ) ∂ α 2 = ( K 11 + K 22 − 2 K 12 ) a l p h a 2 − ( K 11 − K 12 ) ( α 1 o l d y 1 + α 2 o l d y 2 ) y 2 + y 1 y 2 − 1 − y 2 ( u ( x 1 ) − y 1 α 1 o l d K 11 − y 2 α 2 o l d K 12 − b ) + y 2 ( u ( x 2 ) − y 1 α 1 o l d K 21 − y 2 α 2 o l d K 22 − b ) = 0 \begin{array}{rl}\frac{\partial \Phi(\alpha_2)}{\partial \alpha_2}&=(K_{11}+K_{22}-2K_{12})alpha_2-(K_{11}-K_{12})(\alpha_1^{old}y_1+\alpha_2^{old}y_2)y_2\\
&+y_1y_2-1-y_2(u(x_1)-y_1\alpha_1^{old}K_{11}-y_2\alpha_2^{old}K_{12}-b)\\
&+y_2(u(x_2)-y_1\alpha_1^{old}K_{21}-y_2\alpha_2^{old}K_{22}-b)=0\end{array} ∂ α 2 ∂ Φ ( α 2 ) = ( K 1 1 + K 2 2 − 2 K 1 2 ) a l p h a 2 − ( K 1 1 − K 1 2 ) ( α 1 o l d y 1 + α 2 o l d y 2 ) y 2 + y 1 y 2 − 1 − y 2 ( u ( x 1 ) − y 1 α 1 o l d K 1 1 − y 2 α 2 o l d K 1 2 − b ) + y 2 ( u ( x 2 ) − y 1 α 1 o l d K 2 1 − y 2 α 2 o l d K 2 2 − b ) = 0 .
化簡後得:
( K 11 + K 22 − 2 K 12 ) α 2 = ( K 11 + K 22 − 2 K 12 ) α 2 o l d + y 2 [ ( f ( x 1 ) − y 1 ) − ( f ( x 2 ) − y 2 ) ] (K_{11}+K_{22}-2K_{12})\alpha_2=(K_{11}+K_{22}-2K_{12})\alpha_2^{old}+y_2[(f(x_1)-y_1)-(f(x_2)-y_2)] ( K 1 1 + K 2 2 − 2 K 1 2 ) α 2 = ( K 1 1 + K 2 2 − 2 K 1 2 ) α 2 o l d + y 2 [ ( f ( x 1 ) − y 1 ) − ( f ( x 2 ) − y 2 ) ]
記:η = K 11 + K 22 − 2 K 12 \eta=K_{11}+K_{22}-2K_{12} η = K 1 1 + K 2 2 − 2 K 1 2 , 則得到α 2 \alpha_2 α 2 的更新公式:
α 2 n e w = α 2 o l d + y 2 ( E 1 − E 2 ) η \alpha_2^{new}=\alpha_2^{old}+\frac{y_2(E_1-E_2)}{\eta} α 2 n e w = α 2 o l d + η y 2 ( E 1 − E 2 ) ,
但是,我們還需要考慮α 2 \alpha_2 α 2 的取值範圍,所以我們最後得到的α 2 \alpha_2 α 2 的新值爲:
α 2 n e w , c l i p = { H i f α 2 n e w ≥ H α 2 n e w i f L < a 2 n e w < H L i f α 2 n e w ≤ L ( 6 ) \alpha_2^{new,clip}=\left\{\begin{array}{rl}
H&if \alpha_2^{new}\geq H\\
\alpha_2^{new}& if L< a_2^{new}<H\\
L&if \alpha_2^{new}\leq L\end{array}\right.\qquad (6) α 2 n e w , c l i p = ⎩ ⎨ ⎧ H α 2 n e w L i f α 2 n e w ≥ H i f L < a 2 n e w < H i f α 2 n e w ≤ L ( 6 )
由α 1 o l d y 1 + α 2 o l d y 2 = α 1 n e w y 1 + α 2 n e w y 2 = C \alpha_1^{old}y_1+\alpha_2^{old}y_2=\alpha_1^{new}y_1+\alpha_2^{new}y_2=C α 1 o l d y 1 + α 2 o l d y 2 = α 1 n e w y 1 + α 2 n e w y 2 = C 可得α 1 \alpha_1 α 1 的更新公式:
α 1 n e w = α 1 o l d + y 1 y 2 ( α 2 o l d − α 2 n e w , c l i p ) ( 7 ) \alpha_1^{new}=\alpha_1^{old}+y_1y_2(\alpha_2^{old}-\alpha_2^{new,clip})\qquad (7) α 1 n e w = α 1 o l d + y 1 y 2 ( α 2 o l d − α 2 n e w , c l i p ) ( 7 ) .
6.2 α 1 , α 2 \alpha_1,\alpha_2 α 1 , α 2 取臨界情況
大部分情況下,有η = K 11 + K 22 − 2 K 12 > 0 \eta = K_{11}+K_{22}-2K_{12}>0 η = K 1 1 + K 2 2 − 2 K 1 2 > 0 。但是在如下幾種情況下,α 2 n e w \alpha_2^{new} α 2 n e w 需要取臨界值L或者H:
η<0,當核函數K不滿足Mercer定理( Mercer定理:任何半正定對稱函數都可以作爲核函數)時,矩陣K非正定;
η<0,樣本x1與x2輸入特徵相同;
也可以如下理解,(3)式對α 2 \alpha_2 α 2 再求導,或者說(2)式中Φ ( α 2 ) \Phi(\alpha_2) Φ ( α 2 ) 對α 2 \alpha_2 α 2 求二階導數就是η = K 11 + K 22 − 2 K 12 > 0 \eta = K_{11}+K_{22}-2K_{12}>0 η = K 1 1 + K 2 2 − 2 K 1 2 > 0 ,
當η<0時,目標函數爲凸函數,沒有極小值,極值在定義域邊界處取得。
當η=0時,目標函數爲單調函數,同樣在邊界處取極值。
計算方法如下:
將α 2 n e w = L \alpha_2^{new}=L α 2 n e w = L 和α 2 n e w = H \alpha_2^{new}=H α 2 n e w = H 分別帶入(7)式中,計算出α 1 n e w = L 1 \alpha_1^{new}=L_1 α 1 n e w = L 1 和α 1 n e w = H 1 \alpha_1^{new}=H_1 α 1 n e w = H 1 ,其中L 1 = α 1 + s ( α 2 − L ) , H 1 = α 1 + s ( α 2 − H ) L_1=\alpha_1+s(\alpha_2-L), H_1=\alpha_1+s(\alpha_2-H) L 1 = α 1 + s ( α 2 − L ) , H 1 = α 1 + s ( α 2 − H ) .
將其帶入目標函數(1)內,比較Ψ L ≜ Ψ ( α 1 = L 1 , α 2 = L ) \Psi_L\triangleq\Psi(\alpha_1=L_1,\alpha_2=L) Ψ L ≜ Ψ ( α 1 = L 1 , α 2 = L ) 與Ψ H ≜ Ψ ( α 1 = H 1 , α 2 = H ) \Psi_H\triangleq\Psi(\alpha_1=H_1,\alpha_2=H) Ψ H ≜ Ψ ( α 1 = H 1 , α 2 = H ) 的大小,α 2 \alpha_2 α 2 取Ψ L \Psi_L Ψ L 和Ψ H \Psi_H Ψ H 中較小的函數值對應的邊界點,Ψ L \Psi_L Ψ L 和Ψ H \Psi_H Ψ H 計算如下:
Ψ L = L 1 f 1 + L f 2 + 1 2 L 1 2 K 11 + 1 2 L 2 K 22 + s L L 1 K 12 , Ψ H = H 1 f 1 + H f 2 + 1 2 H 1 2 K 11 + 1 2 H 2 K 22 + s H H 1 K 12 , \begin{array}{rl}
\Psi_L&=L_1f_1+Lf_2+\frac{1}{2}L_1^2K_{11}+\frac{1}{2}L^2K_{22}+sLL_1K_{12},\\
\Psi_H&=H_1f_1+Hf_2+\frac{1}{2}H_1^2K_{11}+\frac{1}{2}H^2K_{22}+sHH_1K_{12},\end{array} Ψ L Ψ H = L 1 f 1 + L f 2 + 2 1 L 1 2 K 1 1 + 2 1 L 2 K 2 2 + s L L 1 K 1 2 , = H 1 f 1 + H f 2 + 2 1 H 1 2 K 1 1 + 2 1 H 2 K 2 2 + s H H 1 K 1 2 ,
其中:
f 1 = y 1 ( E 1 − b ) − α 1 K 11 − s α 2 K 12 , f 2 = y 2 ( E 2 − b ) − s α 1 K 12 − α 2 K 22 . \begin{array}{rl}
f_1&=y_1(E_1-b)-\alpha_1K_{11}-s\alpha_2K_{12},\\
f_2&=y_2(E_2-b)-s\alpha_1K_{12}-\alpha_2K_{22}.\end{array} f 1 f 2 = y 1 ( E 1 − b ) − α 1 K 1 1 − s α 2 K 1 2 , = y 2 ( E 2 − b ) − s α 1 K 1 2 − α 2 K 2 2 .
7 啓發式選擇變量
上述分析是在從N個變量中已經選出兩個變量進行優化的方法,下面分析如何高效地選擇兩個變量進行優化,使得目標函數下降的最快。
所謂的啓發式選擇方法主要思想是每次選擇拉格朗日乘子的時候,優先選擇樣本前面係數0 < α i < C 0<\alpha_i<C 0 < α i < C 的α i \alpha_i α i 作優化(論文中稱爲無界樣例),因爲在界上(α i \alpha_i α i 爲0或C)的樣例對應的係數α i \alpha_i α i 一般不會更改。
7.1 第一個變量的選擇
第一個變量的選擇稱爲外循環,首先遍歷整個樣本集,選擇違反KKT條件的α i \alpha_i α i 作爲第一個變量(違背KKT條件不代表0 < α i < C 0<\alpha_i<C 0 < α i < C ,在界上也有可能會違背),接着依據相關規則選擇第二個變量(見下面分析),對這兩個變量採用上述方法進行優化。
當遍歷完整個樣本集後,遍歷非邊界樣本集(0 < α i < C 0<\alpha_i<C 0 < α i < C )中違反KKT的α i \alpha_i α i 作爲第一個變量,同樣依據相關規則選擇第二個變量,對此兩個變量進行優化。
當遍歷完非邊界樣本集後,再次回到遍歷整個樣本集中尋找,即在整個樣本集與非邊界樣本集上來回切換,尋找違反KKT條件的α i \alpha_i α i 作爲第一個變量。直到遍歷整個樣本集後,沒有違反KKT條件α i \alpha_i α i ,然後退出。
也就是說 ,先對所有樣例進行循環,循環中碰到違背KKT條件的(不管界上還是界內)都進行迭代更新。等這輪過後,如果沒有收斂,第二輪就只針對0 < α i < C 0<\alpha_i<C 0 < α i < C 的樣例進行迭代更新。
上面敘述中,違反KKT條件指的是下面的3個條件。
α i = 0 ⇒ y ( i ) ( w T x ( i ) + b ) ≥ 1 α i = C ⇒ y ( i ) ( w T x ( i ) + b ) ≤ 1 0 < α i < C ⇒ y ( i ) ( w T x ( i ) + b ) = 1. \boxed{
\begin{array}{rl}
\alpha_i=0&\Rightarrow y^{(i)}(w^Tx^{(i)}+b)\geq 1\\
\alpha_i=C&\Rightarrow y^{(i)}(w^Tx^{(i)}+b)\leq 1\\
0<\alpha_i<C&\Rightarrow y^{(i)}(w^Tx^{(i)}+b)=1.
\end{array}} α i = 0 α i = C 0 < α i < C ⇒ y ( i ) ( w T x ( i ) + b ) ≥ 1 ⇒ y ( i ) ( w T x ( i ) + b ) ≤ 1 ⇒ y ( i ) ( w T x ( i ) + b ) = 1 .
代碼如下(來自github.com/yhswjtuILMARE/Machine-Learning-Study-Notes):
def realSMO(trainSet, trainLabels, C, toler, kTup=('lin', 1.3), maxIter=40):
obj = osStruct(trainSet, trainLabels, C, toler, kTup)
entrySet = True
iterNum = 0
alphapairschanged = 0
while (iterNum < maxIter) and (alphapairschanged > 0 or entrySet):
print(iterNum)
alphapairschanged = 0
if entrySet:
for i in range(obj.m):
alphapairschanged += innerLoop(obj, i)
if i % 100 == 0:
print("full set loop, iter: %d, alphapairschanged: %d, iterNum: %d" % (i, alphapairschanged, iterNum))
iterNum += 1
else:
validalphasList = np.nonzero((obj.alphas.A > 0) * (obj.alphas.A < C))[0]
for i in validalphasList:
alphapairschanged += innerLoop(obj, i)
if i % 100 == 0:
print("non-bound set loop, iter: %d, alphapairschanged: %d, iterNum: %d" % (i, alphapairschanged, iterNum))
iterNum += 1
if entrySet:
entrySet = False
elif alphapairschanged == 0:
entrySet = True
print("iter num: %d" % (iterNum))
return obj.alphas, obj.b
7.2 第二個變量的選擇
SMO稱第二個變量的選擇過程爲內循環,假定現在已經選取了第一個待優化的α \alpha α ,如何選取另一個α \alpha α ?在SMO算法中爲了讓迭代次數儘量少,收斂速度儘量快,算法要求我們選取的兩個α \alpha α 有儘可能大的“差異”。在算法的實現中我們用預測的誤差值來表徵一個α \alpha α 的效果。那麼兩個α \alpha α 儘可能不同反映在算法上就是兩個α \alpha α 所對應的預測誤差值之差的絕對值最大。
代碼表現出來如下圖所示(來自github.com/yhswjtuILMARE/Machine-Learning-Study-Notes):
def selectJIndex(obj,i,Ei):
maxJ = -1
maxdelta = -1
Ek = -1
obj.eCache[i] = [1, Ei]
validEiList = np.nonzero(obj.eCache[:, 0].A)[0]
if len(validEiList) > 1:
for j in validEiList:
if j == i:
continue
Ej = calEi(obj, j)
delta = np.abs(Ei - Ej)
if delta > maxdelta:
maxdelta = delta
maxJ = j
Ek = Ej
else:
maxJ = selectJRand(i, obj.m)
Ek = calEi(obj, maxJ)
return Ek, maxJ
上面的代碼反映出了這樣一種思想:首先傳入第一個α \alpha α 所對應的索引值“i”,然後搜索誤差列表eCache,在其中找到與“i”所對應的誤差值相差絕對值最大的那個索引值“j”,則另一個α \alpha α 的索引值就定爲“j”。若誤差值列表eCache還沒有初始化則從所有的索引值中隨機選取一個索引值作爲另一個α \alpha α 的索引值。
有時按照上述的啓發式選擇第二個變量,不能夠使得函數值有足夠的下降,這時按下述步驟:
首先在非邊界集上選擇能夠使函數值足夠下降的樣本作爲第二個變量,
如果非邊界集上沒有,則在整個樣本集上選擇第二個變量,
如果整個樣本集依然不存在,則重新選擇第一個變量。
8 閾值b的計算
每完成對兩個變量的優化後,要對b的值進行更新,因爲b的值關係到f(x)的計算,即關係到下次優化時E i E_i E i 的計算。
如果0 < α 1 n e w < C 0<\alpha_1^{new}<C 0 < α 1 n e w < C ,由KKT條件y 1 ( w T x 1 + b ) = 1 y_1(w^Tx_1+b)=1 y 1 ( w T x 1 + b ) = 1 ,兩邊同乘以y 1 y_1 y 1 得,y 1 = ∑ i = 1 N α i y i K i 1 + b y_1=\sum\limits_{i=1}^N\alpha_iy_iK_{i1}+b y 1 = i = 1 ∑ N α i y i K i 1 + b ,從而,
b 1 n e w = y 1 − ∑ i = 3 N α i y i K i 1 − α 1 n e w y 1 K 11 − α 2 n e w , c l i p p e d y 2 K 21 b_1^{new}=y_1-\sum\limits_{i=3}^N\alpha_iy_iK_{i1}-\alpha_1^{new}y_1K_{11}-\alpha_2^{new,clipped}y_2K_{21} b 1 n e w = y 1 − i = 3 ∑ N α i y i K i 1 − α 1 n e w y 1 K 1 1 − α 2 n e w , c l i p p e d y 2 K 2 1 .
由E i = f ( x i ) − y i , i = 1 , 2 E_i=f(x_i)-y_i, i=1,2 E i = f ( x i ) − y i , i = 1 , 2 ,上式前兩項可以替換爲:
y 1 − ∑ i = 3 N α i y i K i 1 = − E 1 + α 1 o l d y 1 K 11 + α 2 o l d y 2 K 21 + b o l d y_1-\sum\limits_{i=3}^N\alpha_iy_iK_{i1}=-E_1+\alpha_1^{old}y_1K_{11}+\alpha_2^{old}y_2K_{21}+b^{old} y 1 − i = 3 ∑ N α i y i K i 1 = − E 1 + α 1 o l d y 1 K 1 1 + α 2 o l d y 2 K 2 1 + b o l d ,
繼而,
b 1 = − E 1 − y 1 K 11 ( α 1 n e w − α 1 o l d ) − y 2 K 21 ( α 2 n e w − α 2 o l d ) + b o l d b_1=-E_1-y_1K_{11}(\alpha_1^{new}-\alpha_1^{old})-y_2K_{21}(\alpha_2^{new}-\alpha_2^{old})+b^{old} b 1 = − E 1 − y 1 K 1 1 ( α 1 n e w − α 1 o l d ) − y 2 K 2 1 ( α 2 n e w − α 2 o l d ) + b o l d ,
此時,b n e w = b 1 b^{new}=b_1 b n e w = b 1 .
如果0 < α 2 n e w , c l i p p e d < C 0<\alpha_2^{new,clipped}<C 0 < α 2 n e w , c l i p p e d < C ,則
b 2 = − E 2 − y 1 K 12 ( α 1 n e w − α 1 o l d ) − y 2 K 22 ( α 2 n e w , c l i p p e d − α 2 o l d ) + b o l d b_2=-E_2-y_1K_{12}(\alpha_1^{new}-\alpha_1^{old})-y_2K_{22}(\alpha_2^{new,clipped}-\alpha_2^{old})+b^{old} b 2 = − E 2 − y 1 K 1 2 ( α 1 n e w − α 1 o l d ) − y 2 K 2 2 ( α 2 n e w , c l i p p e d − α 2 o l d ) + b o l d ,
此時,b n e w = b 2 b^{new}=b2 b n e w = b 2 .
如果同時滿足0 < α 1 n e w < C , 0 < α 2 n e w , c l i p p e d < C 0<\alpha_1^{new}<C, 0<\alpha_2^{new,clipped}<C 0 < α 1 n e w < C , 0 < α 2 n e w , c l i p p e d < C ,則b 1 = b 2 b_1=b_2 b 1 = b 2 ,此時,b n e w = b 1 = b 2 b^{new}=b_1=b_2 b n e w = b 1 = b 2 .
如果α 1 n e w , α 2 n e w , c l i p p e d \alpha_1^{new},\alpha_2^{new,clipped} α 1 n e w , α 2 n e w , c l i p p e d 都在界上,那麼b 1 b_1 b 1 和b 2 b_2 b 2 之間的任何數都滿足KKT條件,都可作爲b的更新值,一般取b n e w = b 1 + b 2 2 b^{new}=\frac{b_1+b_2}{2} b n e w = 2 b 1 + b 2 。
9 Code Details(來自Platt論文)
The pseudo-code below describes the entire SMO algorithm:
target - desired ouput vector
point - training point matrix
procedure takeStep(i1,i2)
if(i1 == i2) return 0
alph1 = Lagrange multiplier for i1
y1 = target[i1]
E1 = SVM ouput on point[i1] - y1 (check in error cache)
s = y1*y2
Compute L, H via equations (13) and (14)
if (L == H)
return 0
k11 = kernel(point[i1], point[i1])
k12 = kernel(point[i1], point[i2])
k22 = kernel(point[i2], point[i2])
if (eta > 0)
{
a2 = alph2 + y2*(E1-E2)/eta
if (a2 < L) a2 = L
else if (a2 > H) a2 = H
}
else
{
Lobj = objective function at a2=L
Hobj = objective function at a2=H
if ( Lobj < Hobj - eps)
a2 = L
else if (Lobj > Hobj + eps)
a2 = H
else
a2 = alph2
}
if (|a2-aph2| < eps*(a2+alph2+eps))
return 0
a1 = alph1 + s*(alph2-a2)
Update threshold to reflect change in Lagrange multipliers
Update weight vector to reflect change in a1 & a2, if SVM is linear
Update error cache using new Lagrange multipliers
Store a1 in the alpha array
Store a2 in the alpha array
endprocedure
procedure examineExample(i2)
y2 = target[i2]
alph2 = Lagrange multiplier for i2
E2 = SVM ouput on point[i2] - y2 (check in error cache)
r2 = E2*y2
if ((r2< -tol && alph2 < C) || (r2 > tol && alph2 > 0))
{
if (number of non-zero & non-C alpha > 1)
{
i1 = result of second choice heuristic
if takeStep(i1, i2)
return 1
}
loop over all non-zero and non-C alpha, starting at a rondom point
{
i1 = identity of current alpha
if takeStep(i1, i2)
return 1
}
loop over all possible i1, staring at a random point
{
i1 = loop variable
if (takeStep(i1, i2)
return 1
}
}
return 0
endprocedure
main routine:
numCchanged = 0;
examineAll = 1;
while(numChanged > 0 | examineAll)
{
numChanged = 0;
if (examineAll)
loop I over all training examples
numChanged += examineExample(I)
else
loop I over examples where alpha is not 0 & not C
numChanged += examineExample(I)
if (examineAll == 1)
examineAll = 0
else if (numChanged == 0)
examineAll = 1
}
結尾 :
對比這麼複雜的推導過程,SVM的思想確實那麼簡單。它不再像logistic迴歸一樣企圖去擬合樣本點(中間加了一層sigmoid函數變換),而是就在樣本中去找分隔線,爲了評判哪條分界線更好,引入了幾何間隔最大化的目標。
拉格朗日對偶的重要作用是將w的計算提前並消除w,使得優化函數變爲拉格朗日乘子α \alpha α 的單一參數優化問題。之後所有的推導都是去解決目標函數的最優化上了。在解決最優化的過程中,發現了w可以由特徵向量內積來表示,進而發現了核函數,僅需要調整核函數就可以將特徵進行低維到高維的變換,在低維上進行計算,實質結果表現在高維上。由於並不是所有的樣本都可分,爲了保證SVM的通用性,進行了軟間隔的處理,導致的結果就是將優化問題變得更加複雜,然而驚奇的是鬆弛變量沒有出現在最後的目標函數中。最後的優化求解問題,也被拉格朗日對偶和SMO算法化解,使SVM趨向於完美。
文獻地址:https://wenku.baidu.com/view/5e849b14f18583d0496459bd.html
具體代碼參考:https://gitee.com/xxuffei/Machine-Learning-Study-Notes(forked from github.com/yhswjtuILMARE/Machine-Learning-Study-Notes)