Python 深入理解邏輯迴歸

關注微信公共號:小程在線

關注CSDN博客:程志偉的博客

其數學目的是求解能夠讓模型對數據擬合程度最高的參數 的值,以此構建預測函數 ,然後將特徵矩陣輸入預測函數來計算出邏輯迴歸的結果y。注意,雖然我們熟悉的邏輯迴歸通常被用於處理二分類問題,但邏輯迴歸也可以做多分類。

”損失函數“:來衡量參數爲 的模型擬合訓練集時產生的信息損失的大小,並以此衡量參數 的優劣。如果用一組參數建模後,模型在訓練集上表現良好,那我們就說模型擬合過程中的損失很小,損失函數的值很小,這一組參數就優秀;相反,如果模型在訓練集上表現糟糕,損失函數就會很大,模型就訓練不足,效果較差,這一組參數也就比較差。

正則化是用來防止模型過擬合的過程,常用的有L1正則化和L2正則化兩種選項,分別通過在損失函數後加上參數向量 的L1範式和L2範式的倍數來實現。這個增加的範式,被稱爲“正則項”,也被稱爲"懲罰項"。損失函數改變,基於損失函數的最優化來求解的參數取值必然改變,我們以此來調節模型擬合的程度。其中L1範式表現爲參數向量中的每個參數的絕對值之和,L2範數表現爲參數向量中的每個參數的平方和的開方值。

Python 3.7.3 (default, Apr 24 2019, 15:29:51) [MSC v.1915 64 bit (AMD64)]
Type "copyright", "credits" or "license" for more information.

IPython 7.6.1 -- An enhanced Interactive Python.

 

1.導入需要的庫與數據集

from sklearn.linear_model import LogisticRegression as LR
from sklearn.datasets import load_breast_cancer
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

 

2.查看數據集的基本情況

data = load_breast_cancer()
X = data.data
y = data.target
data.data.shape
Out[2]: (569, 30)

 

3.構建邏輯迴歸,使用不同的正則化

lrl1 = LR(penalty="l1",solver="liblinear",C=0.5,max_iter=1000)
lrl2 = LR(penalty="l2",solver="liblinear",C=0.5,max_iter=1000)

 

當我們選擇L1正則化的時候,許多特徵的參數都被設置爲了0

lrl1 = lrl1.fit(X,y)
lrl1.coef_
Out[4]: 
array([[ 3.98387974,  0.03124285, -0.13482427, -0.0161981 ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.50336745,  0.        , -0.07122861,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        , -0.24513104, -0.12830753, -0.01443125,  0.        ,
         0.        , -2.05883441,  0.        ,  0.        ,  0.        ]])

(lrl1.coef_ != 0).sum(axis=1)
Out[5]: array([10])

 

L2正則化則是對所有的特徵都給出了參數
lrl2 = lrl2.fit(X,y)
lrl2.coef_
Out[6]: 
array([[ 1.58651399e+00,  1.05063447e-01,  4.48683632e-02,
        -3.74375536e-03, -8.56901613e-02, -2.94287943e-01,
        -4.37733190e-01, -2.07600072e-01, -1.22519137e-01,
        -1.87465544e-02,  2.78262183e-02,  8.41924650e-01,
         1.66118667e-01, -9.75336736e-02, -8.75079523e-03,
        -3.14114111e-02, -6.23724835e-02, -2.55180879e-02,
        -2.58371386e-02, -1.14472251e-03,  1.34078040e+00,
        -3.01592084e-01, -1.79752694e-01, -2.25790709e-02,
        -1.57048116e-01, -8.65467461e-01, -1.12967239e+00,
        -3.98774604e-01, -3.79997636e-01, -8.51004287e-02]])

 

l1 = []
l2 = []
l1test = []
l2test = []
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X,y,test_size=0.3,random_state=420)
for i in np.linspace(0.05,1,19):
    lrl1 = LR(penalty="l1",solver="liblinear",C=i,max_iter=1000)
    lrl2 = LR(penalty="l2",solver="liblinear",C=i,max_iter=1000)
    lrl1 = lrl1.fit(Xtrain,Ytrain)
    l1.append(accuracy_score(lrl1.predict(Xtrain),Ytrain))
    l1test.append(accuracy_score(lrl1.predict(Xtest),Ytest))
    lrl2 = lrl2.fit(Xtrain,Ytrain)
    l2.append(accuracy_score(lrl2.predict(Xtrain),Ytrain))
    l2test.append(accuracy_score(lrl2.predict(Xtest),Ytest))


graph = [l1,l2,l1test,l2test]
color = ["green","black","lightgreen","gray"]
label = ["L1","L2","L1test","L2test"]

plt.figure(figsize=(6,6))
for i in range(len(graph)):
    plt.plot(np.linspace(0.05,1,19),graph[i],color[i],label=label[i])

plt.legend(loc=4) #圖例的位置在哪裏?4表示,右下角
plt.show()

可見,但隨着C的逐漸變大,正則化的強度越來越小,模型在訓練集和測試集上的表現都呈上升趨勢,直到C=0.8左右,訓練集上的表現依然在走高,但模型在未知數據集上的表現開始下跌,這時候就是出現了過擬合。我們可以認爲,C設定爲0.8會比較好。在實際使用時,基本就默認使用l2正則化,如果感覺到模型的效果不好,那就換L1試試看。

 

高效的嵌入法embedded
from sklearn.linear_model import LogisticRegression as LR
from sklearn.datasets import load_breast_cancer
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import SelectFromModel

data = load_breast_cancer()
data.data.shape
Out[10]: (569, 30)

#使用邏輯迴歸的準確率

LR_ = LR(solver="liblinear",C=0.8,random_state=420)
cross_val_score(LR_,data.data,data.target,cv=10).mean()
Out[11]: 0.9508998790078644

#使用嵌入法的準確率

X_embedded = SelectFromModel(LR_,norm_order=1).fit_transform(data.data,data.target)
X_embedded.shape
Out[12]: (569, 9)

cross_val_score(LR_,X_embedded,data.target,cv=10).mean()
Out[13]: 0.9368323826808401

特徵數量被減小到個位數,並且模型的效果卻沒有下降太多,如果我們要求不高,在這裏其實就可以停下了。

調節SelectFromModel這個類中的參數threshold,這是嵌入法的閾值,表示刪除所有參數的絕對值低於這個閾
值的特徵。現在threshold默認爲None,所以SelectFromModel只根據L1正則化的結果來選擇了特徵,即選擇了所
有L1正則化後參數不爲0的特徵。我們此時,只要調整threshold的值(畫出threshold的學習曲線),就可以觀察
不同的threshold下模型的效果如何變化。一旦調整threshold,就不是在使用L1正則化選擇特徵,而是使用模型的
屬性.coef_中生成的各個特徵的係數來選擇。coef_雖然返回的是特徵的係數,但是係數的大小和決策樹中的
feature_ importances_以及降維算法中的可解釋性方差explained_vairance_概念相似,其實都是衡量特徵的重要
程度和貢獻度的,因此SelectFromModel中的參數threshold可以設置爲coef_的閾值,即可以剔除係數小於
threshold中輸入的數字的所有特徵
 

fullx = []
fsx = []
threshold = np.linspace(0,abs((LR_.fit(data.data,data.target).coef_)).max(),20)
k=0
for i in threshold:
    X_embedded = SelectFromModel(LR_,threshold=i).fit_transform(data.data,data.target)
    fullx.append(cross_val_score(LR_,data.data,data.target,cv=5).mean())
    fsx.append(cross_val_score(LR_,X_embedded,data.target,cv=5).mean())
    print((threshold[k],X_embedded.shape[1]))
    k+=1

plt.figure(figsize=(20,5))
plt.plot(threshold,fullx,label="full")
plt.plot(threshold,fsx,label="feature selection")
plt.xticks(threshold)
plt.legend()
plt.show()
(0.0, 30)
(0.1040236124018952, 17)
(0.2080472248037904, 12)
(0.3120708372056856, 10)
(0.4160944496075808, 8)
(0.520118062009476, 8)
(0.6241416744113713, 5)
(0.7281652868132664, 5)
(0.8321888992151616, 5)
(0.9362125116170568, 5)
(1.040236124018952, 5)
(1.144259736420847, 3)
(1.2482833488227425, 3)
(1.3523069612246377, 2)
(1.4563305736265328, 2)
(1.560354186028428, 1)
(1.6643777984303232, 1)
(1.7684014108322184, 1)
(1.8724250232341135, 1)
(1.9764486356360087, 1)

可以看出當threshold越來越大,被刪除的特徵越來越多,模型的效果也越來越差。

 

調邏輯迴歸的類LR_,通過畫C的學習曲線來實現
fullx = []
fsx = []
C=np.arange(0.01,10.01,0.5)
for i in C:
    LR_ = LR(solver="liblinear",C=i,random_state=420)
    fullx.append(cross_val_score(LR_,data.data,data.target,cv=10).mean())
    X_embedded = SelectFromModel(LR_,norm_order=1).fit_transform(data.data,data.target)
    fsx.append(cross_val_score(LR_,X_embedded,data.target,cv=10).mean())

print(max(fsx),C[fsx.index(max(fsx))])
plt.figure(figsize=(20,5))
plt.plot(C,fullx,label="full")
plt.plot(C,fsx,label="feature selection")
plt.xticks(C)
plt.legend()
plt.show()
0.9563164376458386 8.01

 

確定大致範圍後繼續細化學習曲線:
fullx = []
fsx = []
C=np.arange(7.51,8.51,0.005)
for i in C:
    LR_ = LR(solver="liblinear",C=i,random_state=420)
    fullx.append(cross_val_score(LR_,data.data,data.target,cv=10).mean())
    X_embedded = SelectFromModel(LR_,norm_order=1).fit_transform(data.data,data.target)
    fsx.append(cross_val_score(LR_,X_embedded,data.target,cv=10).mean())

print(max(fsx),C[fsx.index(max(fsx))])
plt.figure(figsize=(20,5))
plt.plot(C,fullx,label="full")
plt.plot(C,fsx,label="feature selection")
plt.xticks(C)
plt.legend()
plt.show()
0.9563164376458386 7.515


fullx = []
fsx = []
C=np.arange(6.05,7.05,0.005)
for i in C:
    LR_ = LR(solver="liblinear",C=i,random_state=420)
    fullx.append(cross_val_score(LR_,data.data,data.target,cv=10).mean())
    X_embedded = SelectFromModel(LR_,norm_order=1).fit_transform(data.data,data.target)
    fsx.append(cross_val_score(LR_,X_embedded,data.target,cv=10).mean())

print(max(fsx),C[fsx.index(max(fsx))])
plt.figure(figsize=(20,5))
plt.plot(C,fullx,label="full")
plt.plot(C,fsx,label="feature selection")
plt.xticks(C)
plt.legend()
plt.show()
0.9580405755768732 6.069999999999999

 

選擇c=6.069999999999999時,準確度最高

#驗證模型效果:降維之前
LR_ = LR(solver="liblinear",C=6.069999999999999,random_state=420)
cross_val_score(LR_,data.data,data.target,cv=10).mean()
Out[20]: 0.9491152450090743

 

#驗證模型效果:降維之後
LR_ = LR(solver="liblinear",C=6.069999999999999,random_state=420)
X_embedded = SelectFromModel(LR_,norm_order=1).fit_transform(data.data,data.target)
cross_val_score(LR_,X_embedded,data.target,cv=10).mean()
Out[21]: 0.9580405755768732

X_embedded.shape
Out[22]: (569, 11)

 

2.3 梯度下降:重要參數max_iter
來看看乳腺癌數據集下,max_iter的學習曲線

l2 = []
l2test = []
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X,y,test_size=0.3,random_state=420)
for i in np.arange(1,201,10):
    lrl2 = LR(penalty="l2",solver="liblinear",C=0.9,max_iter=i)
    lrl2 = lrl2.fit(Xtrain,Ytrain)
    l2.append(accuracy_score(lrl2.predict(Xtrain),Ytrain))
    l2test.append(accuracy_score(lrl2.predict(Xtest),Ytest))

graph = [l2,l2test]
color = ["black","gray"]
label = ["L2","L2test"]
plt.figure(figsize=(20,5))
for i in range(len(graph)):
    plt.plot(np.arange(1,201,10),graph[i],color[i],label=label[i])

plt.legend(loc=4)
plt.xticks(np.arange(1,201,10))
plt.show()
H:\Anaconda3\lib\site-packages\sklearn\svm\base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  "the number of iterations.", ConvergenceWarning)
H:\Anaconda3\lib\site-packages\sklearn\svm\base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  "the number of iterations.", ConvergenceWarning)
H:\Anaconda3\lib\site-packages\sklearn\svm\base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  "the number of iterations.", ConvergenceWarning)

我們可以使用屬性.n_iter_來調用本次求解中真正實現的迭代次數,說明在24時,函數已經收斂

lr = LR(penalty="l2",solver="liblinear",C=0.9,max_iter=300).fit(Xtrain,Ytrain)
lr.n_iter_
Out[25]: array([24], dtype=int32)

 

2.4 二元迴歸與多元迴歸:重要參數solver & multi_class
來看看鳶尾花數據集上,multinomial和ovr的區別怎麼樣:

輸入"ovr", "multinomial", "auto"來告知模型,我們要處理的分類問題的類型。默認是"ovr"。
'ovr':表示分類問題是二分類,或讓模型使用"一對多"的形式來處理多分類問題。
'multinomial':表示處理多分類問題,這種輸入在參數solver是'liblinear'時不可用。
"auto":表示會根據數據的分類情況和其他參數來確定模型要處理的分類問題的類型。比如說,如果數據是二分
類,或者solver的取值爲"liblinear","auto"會默認選擇"ovr"。反之,則會選擇"nultinomial"。
 

from sklearn.datasets import load_iris
iris = load_iris()


for multi_class in ('multinomial', 'ovr'):
    clf = LR(solver='sag', max_iter=100, random_state=42,
                             multi_class=multi_class).fit(iris.data, iris.target)

#打印兩種multi_class模式下的訓練分數
#%的用法,用%來代替打印的字符串中,想由變量替換的部分。%.3f表示,保留三位小數的浮點數。%s表示,字符串。
#字符串後的%後使用元祖來容納變量,字符串中有幾個%,元祖中就需要有幾個變量
    print("training score : %.3f (%s)" % (clf.score(iris.data, iris.target),multi_class))
training score : 0.987 (multinomial)
training score : 0.960 (ovr)

H:\Anaconda3\lib\site-packages\sklearn\linear_model\sag.py:337: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)
H:\Anaconda3\lib\site-packages\sklearn\linear_model\sag.py:337: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章