邏輯迴歸

邏輯迴歸

  1. 邏輯迴歸的梯度下降法推導

  2. 邏輯迴歸目標函數爲凸函數

訓練數據D={(x1,y1),,(xn,yn)}D = \{ (\mathbf{x}_{1}, y_{1}), \cdots, (\mathbf{x}_{n}, y_{n}) \},其中(xi,yi)(\mathbf{x}_{i}, y_{i})表示 一條樣本,xiRD\mathbf{x}_{i} \in \R^{D}DD維樣本特徵(feature),yi{0,1}y_{i} \in \{ 0, 1\}表示樣本標籤(label)。

邏輯迴歸模型的參數爲(w,b)(\mathbf{w}, b)。爲推導方便,通常將bb整合到w\mathbf{w}中,此時,w\mathbf{w}xi\mathbf{x}_{i}分別改寫爲

w=[w0,w1,,wD], xi=[1,x1,,xD]\mathbf{w} = [w_{0}, w_{1}, \cdots, w_{D}], \ \mathbf{x}_{i} = [1, x_{1}, \cdots, x_{D}]

1 邏輯迴歸的目標函數

目標函數(objective function),也稱爲損失函數(loss function),記爲L(w)\mathcal{L} (\mathbf{w})

二分類問題模型

p(yx;w)=p(y=1x;w)y[1p(y=1x;w)]1y(1)p(y | \mathbf{x}; \mathbf{w} ) = p(y = 1 | \mathbf{x}; \mathbf{w})^{y} [1 - p(y = 1 | \mathbf{x}; \mathbf{w})]^{1 - y} \tag {1}

最大似然估計(MLE)

w=argmaxwp(yx;w)=argmaxwi=1np(yixi;w)=argmaxwlog[i=1np(yixi;w)]=argmaxwi=1nlog[p(yixi;w)]=argmaxwi=1nlog[p(yi=1xi;w)yi[1p(yi=1xi;w)]1yi]=argmaxwi=1n[yilogp(yi=1xi;w)+(1yi)log[1p(yi=1xi;w)]](2)\begin{aligned} \mathbf{w}^{\ast} & = \arg \max_{\mathbf{w}} p(\mathbf{y} | \mathbf{x}; \mathbf{w} ) \\ & = \arg \max_{\mathbf{w}} \prod_{i = 1}^{n} p(y_{i} | \mathbf{x}_{i}; \mathbf{w} ) \\ & = \arg \max_{\mathbf{w}} \log \left[ \prod_{i = 1}^{n} p(y_{i} | \mathbf{x}_{i}; \mathbf{w} ) \right] \\ & = \arg \max_{\mathbf{w}} \sum_{i = 1}^{n} \log \left[ p(y_{i} | \mathbf{x}_{i}; \mathbf{w} ) \right] \\ & = \arg \max_{\mathbf{w}} \sum_{i = 1}^{n} \log \left[ p(y_{i} = 1 | \mathbf{x}_{i}; \mathbf{w})^{y_{i}} [1 - p(y_{i} = 1 | \mathbf{x}_{i}; \mathbf{w})]^{1 - y_{i}} \right] \\ & = \arg \max_{\mathbf{w}} \sum_{i = 1}^{n} \left[ y_{i} \log p(y_{i} = 1 | \mathbf{x}_{i}; \mathbf{w}) + (1 - y_{i}) \log [1 - p(y_{i} = 1 | \mathbf{x}_{i}; \mathbf{w})] \right] \\ \end{aligned} \tag {2}

方程(2)是對p(yx;w)p(\mathbf{y} | \mathbf{x}; \mathbf{w} )的最大似然估計。通常,目標函數對w\mathbf{w}取極小:

w=argminwL(w)\mathbf{w}^{\ast} = \arg \min_{\mathbf{w}} \mathcal{L} (\mathbf{w})

則目標函數(交叉熵損失)表示爲:

L(w)=i=1n[yilogp(yi=1xi;w)+(1yi)log[1p(yi=1xi;w)]](3)\mathcal{L} (\mathbf{w}) = - \sum_{i = 1}^{n} \left[ y_{i} \log p(y_{i} = 1 | \mathbf{x}_{i}; \mathbf{w}) + (1 - y_{i}) \log [1 - p(y_{i} = 1 | \mathbf{x}_{i}; \mathbf{w})] \right] \tag {3}

邏輯函數(logistic sigmoid function)

σ(x)=11+ex, σ(x)=σ(x)(1σ(x))\sigma (x) = \frac{1}{1 + e^{- x}}, \ \sigma^{\prime} (x) = \sigma (x) \left(1 - \sigma (x) \right)

考慮二分類問題:給定x\mathbf{x},事件發生(y=1y = 1)的條件概率爲p(y=1x;w)p(y = 1 | x; \mathbf{w}),則該事件發生的條件機率比(odds)爲:

odd=p(y=1x;w)1p(y=1x;w)=exp(wTx)\text{odd} = \frac{p(y = 1 | x; \mathbf{w})}{1 - p(y = 1 | x; \mathbf{w})} = \exp(\mathbf{w}^{\text{T}} \mathbf{x})

可知:

p(y=1x;w)=σ(wTx)=11+ewTx(4)p(y = 1 | x; \mathbf{w}) = \sigma (\mathbf{w}^{\text{T}} \mathbf{x}) = \frac{1}{1 + e^{- \mathbf{w}^{\text{T}} \mathbf{x}}} \tag {4}

將方程(4)代入方程(3)中,可得邏輯迴歸的目標函數:

L(w)=i=1n[yilogσ(wTxi)+(1yi)log[1σ(wTxi)]](5)\mathcal{L} (\mathbf{w}) = - \sum_{i = 1}^{n} \left[ y_{i} \log \sigma (\mathbf{w}^{\text{T}} \mathbf{x}_{i}) + (1 - y_{i}) \log [1 - \sigma (\mathbf{w}^{\text{T}} \mathbf{x}_{i})] \right] \tag {5}

2 L(w)\mathcal{L} (\mathbf{w})w\mathbf{w}的梯度

向量求導

aTxx=xTax=a\frac{\partial \mathbf{a}^{\text{T}} \mathbf{x}}{\partial \mathbf{x}} = \frac{\partial \mathbf{x}^{\text{T}} \mathbf{a}}{\partial \mathbf{x}} = \mathbf{a}

L(w)\mathcal{L} (\mathbf{w})w\mathbf{w}的梯度

L(w)w=i=1n[yilogσ(wTxi)+(1yi)log[1σ(wTxi)]]w=i=1n[yilogσ(wTxi)w+(1yi)log[1σ(wTxi)]w]=i=1n[yi(1σ(wTxi))xi(1yi)σ(wTxi)xi]=i=1n[σ(wTxi)yi]xi(6)\begin{aligned} \frac{\partial \mathcal{L} (\mathbf{w})}{\partial \mathbf{w}} & = - \frac{ \partial \sum_{i = 1}^{n} \left[ y_{i} \log \sigma (\mathbf{w}^{\text{T}} \mathbf{x}_{i}) + (1 - y_{i}) \log [1 - \sigma (\mathbf{w}^{\text{T}} \mathbf{x}_{i})] \right] } { \partial \mathbf{w} } \\ & = - \sum_{i = 1}^{n} \left[ y_{i} \frac{ \partial \log \sigma (\mathbf{w}^{\text{T}} \mathbf{x}_{i}) } { \partial \mathbf{w} } + (1 - y_{i}) \frac{ \partial \log [1 - \sigma (\mathbf{w}^{\text{T}} \mathbf{x}_{i})] } { \partial \mathbf{w} } \right] \\ & = - \sum_{i = 1}^{n} \left[ y_{i} \left( 1 - \sigma (\mathbf{w}^{\text{T}} \mathbf{x}_{i} ) \right) \mathbf{x}_{i} - (1 - y_{i}) \sigma (\mathbf{w}^{\text{T}} \mathbf{x}_{i} ) \mathbf{x}_{i} \right] \\ & = \sum_{i = 1}^{n} \left[ \sigma (\mathbf{w}^{\text{T}} \mathbf{x}_{i}) - y_{i} \right] \mathbf{x}_{i} \\ \end{aligned} \tag {6}

梯度下降

w=wηL(w)w=wηi=1n[σ(wTxi)yi]xi\mathbf{w} = \mathbf{w} - \eta \frac{\partial \mathcal{L} (\mathbf{w})}{\partial \mathbf{w}} = \mathbf{w} - \eta \sum_{i = 1}^{n} \left[ \sigma (\mathbf{w}^{\text{T}} \mathbf{x}_{i}) - y_{i} \right] \mathbf{x}_{i}

3 L(w)\mathcal{L} (\mathbf{w})w\mathbf{w}的二階導數

Hessian方程(Hessian formulation)

yy=[y1xynx]\frac{\partial \mathbf{y}}{\partial \mathbf{y}} = \begin{bmatrix} \frac{\partial y_{1}}{\partial \mathbf{x}} & \cdots & \frac{\partial y_{n}}{\partial \mathbf{x}} \\ \end{bmatrix}

L(w)\mathcal{L} (\mathbf{w})w\mathbf{w}的二階導數

2L(w)2w=i=1n[σ(wTxi)yi]xiw=i=1nσ(wTxi)wxiT=i=1nσ(wTxi)(1σ(wTxi))xixiT(7)\begin{aligned} \frac{\partial^{2} \mathcal{L}(\mathbf{w})}{\partial^{2} \mathbf{w}} & = \frac{ \partial \sum_{i = 1}^{n} \left[ \sigma (\mathbf{w}^{\text{T}} \mathbf{x}_{i}) - y_{i} \right] \mathbf{x}_{i} } { \partial \mathbf{w} } \\ & = \sum_{i = 1}^{n} \frac{ \partial \sigma (\mathbf{w}^{\text{T}} \mathbf{x}_{i}) } { \partial \mathbf{w} } \mathbf{x}_{i}^{\text{T}} \\ & = \sum_{i = 1}^{n} \sigma (\mathbf{w}^{\text{T}} \mathbf{x}_{i}) \left( 1 - \sigma (\mathbf{w}^{\text{T}} \mathbf{x}_{i}) \right) \mathbf{x}_{i} \mathbf{x}_{i}^{\text{T}} \\ \end{aligned} \tag {7}

4 邏輯迴歸函數是凸函數

假設一個函數是凸函數,則其局部最優解即爲全局最優解。所以如果通過隨機梯度下降法等手段找到最優解時就可以確認這個解就是全局最優解。

證明凸函數的方法之一是證明二次導數大於等於0。例如函數f(x)=x23x+3f(x) = x^{2}- 3x + 3,其二次導數f(x)=2>0f''(x) = 2 \gt 0,因此f(x)f(x)是凸函數。該理論也適用於多元變量函數。對於多元函數,只要證明其二階導數矩陣是半正定的(posititive semidefinite)即可。爲證明矩陣H\mathbf{H}爲半正定矩陣,需要證明對於任意非零向量v\mathbf{v},滿足vTHv0\mathbf{v}^{\text{T}} \mathbf{H} \mathbf{v} \geq 0

2L(w)2w\frac{\partial^{2} \mathcal{L}(\mathbf{w})}{\partial^{2} \mathbf{w}}半正定性證明

vT2L(w)2wv=i=1nσ(wTxi)(1σ(wTxi))vTxixiTv(8)\begin{aligned} \mathbf{v}^{\text{T}} \frac{\partial^{2} \mathcal{L}(\mathbf{w})}{\partial^{2} \mathbf{w}} \mathbf{v} & = \sum_{i = 1}^{n} \sigma (\mathbf{w}^{\text{T}} \mathbf{x}_{i}) \left( 1 - \sigma (\mathbf{w}^{\text{T}} \mathbf{x}_{i}) \right) \mathbf{v}^{\text{T}} \mathbf{x}_{i} \mathbf{x}_{i}^{\text{T}} \mathbf{v}\\ \end{aligned} \tag {8}

由於0<σ(wTxi)<10 \lt \sigma (\mathbf{w}^{\text{T}} \mathbf{x}_{i}) \lt 1,故只需證明vTxixiTv0\mathbf{v}^{\text{T}} \mathbf{x}_{i} \mathbf{x}_{i}^{\text{T}} \mathbf{v} \geq 0

vTxixiTv=(vTxi)202L(w)2w0\mathbf{v}^{\text{T}} \mathbf{x}_{i} \mathbf{x}_{i}^{\text{T}} \mathbf{v} = \left( \mathbf{v}^{\text{T}} \mathbf{x}_{i} \right)^{2} \geq 0 \Rightarrow \frac{\partial^{2} \mathcal{L}(\mathbf{w})}{\partial^{2} \mathbf{w}} \geq 0

因此,2L(w)2w\frac{\partial^{2} \mathcal{L}(\mathbf{w})}{\partial^{2} \mathbf{w}}爲半正定矩陣,邏輯迴歸函數是凸函數得證。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章