關注微信公共號:小程在線
關注CSDN博客:程志偉的博客
學習本篇文章大約需要3小時時間
我們要了解了SVC類的各種重要參數,屬性和接口,其中參數包括軟間隔的懲罰係數C,核函數kernel,核函數的相關參數gamma,coef0和degree,解決樣本不均衡的參數class_weight,控制概率的參數probability,控制計算內存的參數cache_size,屬性主要包括調用支持向量的屬性support_vectors_和查看特徵
重要性的屬性coef_。接口中,我們學習了最核心的decision_function。除此之外,我們介紹了分類模型的模型評
估指標:混淆矩陣和ROC曲線,還介紹了部分特徵工程和數據預處理的思路。
導入需要的庫
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
導入數據,探索數據
weather = pd.read_csv(r"H:\程志偉\python\菜菜的機器學習skleaen課堂\SVM數據\weatherAUS5000.csv",index_col=0)
weather.head()
Out[2]:
Date Location MinTemp ... Temp9am Temp3pm RainTomorrow
0 2015-03-24 Adelaide 12.3 ... 15.1 17.7 No
1 2011-07-12 Adelaide 7.9 ... 8.4 11.3 No
2 2010-02-08 Adelaide 24.0 ... 32.4 37.4 No
3 2016-09-19 Adelaide 6.7 ... 11.2 15.9 No
4 2014-03-05 Adelaide 16.7 ... 20.8 23.7 No
[5 rows x 22 columns]
特徵/標籤 | 含義 |
Date | 觀察日期 |
Location | 獲取該信息的氣象站的名稱 |
MinTemp | 以攝氏度爲單位的最低溫度 |
MaxTemp | 以攝氏度爲單位的最高溫度 |
Rainfall | 當天記錄的降雨量,單位爲mm |
Evaporation | 到早上9點之前的24小時的A級蒸發量(mm) |
Sunshine | 白日受到日照的完整小時 |
WindGustDir | 在到午夜12點前的24小時中的最強風的風向 |
WindGustSpeed | 在到午夜12點前的24小時中的最強風速(km / h) |
WindDir9am | 上午9點時的風向 |
WindDir3pm | 下午3點時的風向 |
WindSpeed9am | 上午9點之前每個十分鐘的風速的平均值(km / h) |
WindSpeed3pm | 下午3點之前每個十分鐘的風速的平均值(km / h) |
Humidity9am | 上午9點的溼度(百分比) |
Humidity3am | 下午3點的溼度(百分比) |
Pressure9am | 上午9點平均海平面上的大氣壓(hpa) |
Pressure3pm | 下午3點平均海平面上的大氣壓(hpa) |
Cloud9am | 上午9點的天空被雲層遮蔽的程度,這是以“oktas”來衡量的,這個單位記錄了雲層遮擋 天空的程度。0表示完全晴朗的天空,而8表示它完全是陰天。 |
Cloud3pm | 下午3點的天空被雲層遮蔽的程度 |
Temp9am | 上午9點的攝氏度溫度 |
Temp3pm | 下午3點的攝氏度溫度 |
RainTomorrow | 目標變量,我們的標籤:明天下雨了嗎? |
#將特徵矩陣和標籤Y分開
X = weather.iloc[:,:-1]
Y = weather.iloc[:,-1]
#探索數據類型
X.shape
Out[4]: (5000, 21)
X.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5000 entries, 0 to 4999
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 5000 non-null object
1 Location 5000 non-null object
2 MinTemp 4979 non-null float64
3 MaxTemp 4987 non-null float64
4 Rainfall 4950 non-null float64
5 Evaporation 2841 non-null float64
6 Sunshine 2571 non-null float64
7 WindGustDir 4669 non-null object
8 WindGustSpeed 4669 non-null float64
9 WindDir9am 4651 non-null object
10 WindDir3pm 4887 non-null object
11 WindSpeed9am 4949 non-null float64
12 WindSpeed3pm 4919 non-null float64
13 Humidity9am 4936 non-null float64
14 Humidity3pm 4880 non-null float64
15 Pressure9am 4506 non-null float64
16 Pressure3pm 4504 non-null float64
17 Cloud9am 3111 non-null float64
18 Cloud3pm 3012 non-null float64
19 Temp9am 4967 non-null float64
20 Temp3pm 4912 non-null float64
dtypes: float64(16), object(5)
memory usage: 859.4+ KB
#探索缺失值
X.isnull().mean()
Out[6]:
Date 0.0000
Location 0.0000
MinTemp 0.0042
MaxTemp 0.0026
Rainfall 0.0100
Evaporation 0.4318
Sunshine 0.4858
WindGustDir 0.0662
WindGustSpeed 0.0662
WindDir9am 0.0698
WindDir3pm 0.0226
WindSpeed9am 0.0102
WindSpeed3pm 0.0162
Humidity9am 0.0128
Humidity3pm 0.0240
Pressure9am 0.0988
Pressure3pm 0.0992
Cloud9am 0.3778
Cloud3pm 0.3976
Temp9am 0.0066
Temp3pm 0.0176
dtype: float64
#探索標籤的分類
np.unique(Y)
Out[7]: array(['No', 'Yes'], dtype=object)
1.2 分集,優先探索標籤
分訓練集和測試集,並做描述性統計
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X,Y,test_size=0.3,random_state=420)
#恢復索引
for i in [Xtrain, Xtest, Ytrain, Ytest]:
i.index = range(i.shape[0])
是否有樣本不平衡問題
Ytrain.value_counts()
Out[10]:
No 2704
Yes 796
Name: RainTomorrow, dtype: int64
Ytest.value_counts()
Out[11]:
No 1157
Yes 343
Name: RainTomorrow, dtype: int64
將標籤編碼
from sklearn.preprocessing import LabelEncoder
encorder = LabelEncoder().fit(Ytrain)
Ytrain = pd.DataFrame(encorder.transform(Ytrain))
Ytest = pd.DataFrame(encorder.transform(Ytest))
1.3 探索特徵,開始處理特徵矩陣
1.3.1 描述性統計與異常值
Xtrain.describe([0.01,0.05,0.1,0.25,0.5,0.75,0.9,0.99]).T
Out[14]:
count mean std ... 90% 99% max
MinTemp 3486.0 12.225645 6.396243 ... 20.9 25.900 29.0
MaxTemp 3489.0 23.245543 7.201839 ... 33.0 40.400 46.4
Rainfall 3467.0 2.487049 7.949686 ... 6.6 41.272 115.8
Evaporation 1983.0 5.619163 4.383098 ... 10.2 20.600 56.0
Sunshine 1790.0 7.508659 3.805841 ... 12.0 13.300 13.9
WindGustSpeed 3263.0 39.858413 13.219607 ... 57.0 76.000 117.0
WindSpeed9am 3466.0 14.046163 8.670472 ... 26.0 37.000 65.0
WindSpeed3pm 3437.0 18.553390 8.611818 ... 30.0 43.000 65.0
Humidity9am 3459.0 69.069095 18.787698 ... 94.0 100.000 100.0
Humidity3pm 3408.0 51.651995 20.697872 ... 79.0 98.000 100.0
Pressure9am 3154.0 1017.622067 7.065236 ... 1027.0 1033.247 1038.1
Pressure3pm 3154.0 1015.227077 7.032531 ... 1024.4 1030.800 1036.0
Cloud9am 2171.0 4.491939 2.858781 ... 8.0 8.000 8.0
Cloud3pm 2095.0 4.603819 2.655765 ... 8.0 8.000 8.0
Temp9am 3481.0 16.989859 6.537552 ... 26.0 31.000 38.0
Temp3pm 3431.0 21.719003 7.031199 ... 31.4 38.600 45.9
[16 rows x 13 columns]
Xtest.describe([0.01,0.05,0.1,0.25,0.5,0.75,0.9,0.99]).T
Out[15]:
count mean std ... 90% 99% max
MinTemp 1493.0 11.916812 6.375377 ... 20.48 25.316 28.3
MaxTemp 1498.0 22.906809 6.986043 ... 32.60 38.303 45.1
Rainfall 1483.0 2.241807 7.988822 ... 5.20 35.372 108.2
Evaporation 858.0 5.657809 4.105762 ... 10.40 19.458 38.8
Sunshine 781.0 7.677465 3.862294 ... 12.20 13.400 13.9
WindGustSpeed 1406.0 40.044097 14.027052 ... 57.00 78.000 122.0
WindSpeed9am 1483.0 13.986514 9.124337 ... 26.00 39.360 72.0
WindSpeed3pm 1482.0 18.601215 8.850446 ... 31.00 43.000 56.0
Humidity9am 1477.0 68.688558 18.876448 ... 95.00 100.000 100.0
Humidity3pm 1472.0 51.431386 20.459957 ... 78.00 96.290 100.0
Pressure9am 1352.0 1017.763536 6.910275 ... 1026.50 1033.449 1038.2
Pressure3pm 1350.0 1015.397926 6.916976 ... 1024.20 1031.151 1036.9
Cloud9am 940.0 4.494681 2.870468 ... 8.00 8.000 8.0
Cloud3pm 917.0 4.403490 2.731969 ... 8.00 8.000 8.0
Temp9am 1486.0 16.751817 6.339816 ... 25.45 30.200 35.1
Temp3pm 1481.0 21.483660 6.770567 ... 30.90 37.400 42.9
[16 rows x 13 columns]
Xtrain.shape
Out[16]: (3500, 21)
Xtest.shape
Out[17]: (1500, 21)
1.3.2 處理困難特徵:日期
Xtrainc = Xtrain.copy()
Xtrainc.sort_values(by="Location")
Out[18]:
Date Location MinTemp ... Cloud3pm Temp9am Temp3pm
2796 2015-03-24 Adelaide 12.3 ... NaN 15.1 17.7
2975 2012-08-17 Adelaide 7.8 ... NaN 8.3 12.5
775 2013-03-16 Adelaide 17.4 ... NaN 19.1 20.7
861 2011-07-12 Adelaide 7.9 ... NaN 8.4 11.3
2906 2015-08-24 Adelaide 9.2 ... NaN 9.9 13.4
... ... ... ... ... ... ...
2223 2009-05-08 Woomera 9.2 ... 1.0 13.7 20.1
1984 2014-05-26 Woomera 15.5 ... 7.0 18.0 21.5
1592 2012-01-10 Woomera 16.8 ... 6.0 18.3 24.9
2824 2015-11-03 Woomera 16.2 ... 7.0 20.5 26.2
1005 2010-05-14 Woomera 3.9 ... 1.0 11.5 18.5
[3500 rows x 21 columns]
Xtrain.iloc[:,0].value_counts()
Out[19]:
2014-05-16 6
2015-10-12 6
2015-07-03 6
2012-09-18 5
2012-11-23 5
..
2013-11-17 1
2008-12-23 1
2011-10-26 1
2010-06-15 1
2011-06-20 1
Name: Date, Length: 2141, dtype: int64
#首先,日期不是獨一無二的,日期有重複
#其次,在我們分訓練集和測試集之後,日期也不是連續的,而是分散的
#某一年的某一天傾向於會下雨?或者傾向於不會下雨嗎?
#不是日期影響了下雨與否,反而更多的是這一天的日照時間,溼度,溫度等等這些因素影響了是否會下雨
#光看日期,其實感覺它對我們的判斷並無直接影響
#如果我們把它當作連續型變量處理,那算法會人爲它是一系列1~3000左右的數字,不會意識到這是日期
Xtrain.iloc[:,0].value_counts().count()
Out[20]: 2141
#如果我們把它當作分類型變量處理,類別太多,有2141類,如果換成數值型,會被直接當成連續型變量,如果做成啞
變量,我們特徵的維度會爆炸
我們的特徵中有一列叫做“Rainfall",這是表示當前日期當前地區下的降雨量,換句話說,也就是”今
天的降雨量“。憑常識我們認爲,今天是否下雨,應該會影響明天是否下雨,比如有的地方可能就有這樣的氣候,
一旦下雨就連着下很多天,也有可能有的地方的氣候就是一場暴雨來得快去的快。因此,我們可以將時間對氣候的
連續影響,轉換爲”今天是否下雨“這個特徵,巧妙地將樣本對應標籤之間的聯繫,轉換成是特徵與標籤之間的聯繫了
Xtrain["Rainfall"].head(20)
Out[21]:
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
5 0.0
6 0.0
7 0.2
8 0.0
9 0.2
10 1.0
11 0.0
12 0.2
13 0.0
14 0.0
15 3.0
16 0.2
17 0.0
18 35.2
19 0.0
Name: Rainfall, dtype: float64
Xtrain["Rainfall"].isnull().sum()
Out[22]: 33
Xtrain.loc[Xtrain["Rainfall"] >= 1,"RainToday"] = "Yes"
Xtrain.loc[Xtrain["Rainfall"] < 1,"RainToday"] = "No"
Xtrain.loc[Xtrain["Rainfall"] == np.nan,"RainToday"] = np.nan
Xtest.loc[Xtest["Rainfall"] >= 1,"RainToday"] = "Yes"
Xtest.loc[Xtest["Rainfall"] < 1,"RainToday"] = "No"
Xtest.loc[Xtest["Rainfall"] == np.nan,"RainToday"] = np.nan
Xtrain.head()
Out[25]:
Date Location MinTemp ... Temp9am Temp3pm RainToday
0 2015-08-24 Katherine 17.5 ... 27.5 NaN No
1 2016-12-10 Tuggeranong 9.5 ... 14.6 23.6 No
2 2010-04-18 Albany 13.0 ... 17.5 20.8 No
3 2009-11-26 Sale 13.9 ... 18.5 27.5 No
4 2014-04-25 Mildura 6.0 ... 12.4 22.4 No
[5 rows x 22 columns]
Xtest.head()
Out[26]:
Date Location MinTemp ... Temp9am Temp3pm RainToday
0 2016-01-23 NorahHead 22.0 ... 26.2 23.1 Yes
1 2009-03-05 MountGambier 12.0 ... 14.8 17.5 Yes
2 2010-03-05 MountGinini 9.1 ... NaN NaN NaN
3 2013-10-26 Wollongong 13.1 ... 16.8 19.6 No
4 2016-11-28 Sale 12.2 ... 13.6 19.0 No
[5 rows x 22 columns]
我們就創造了一個特徵,今天是否下雨 RainToday。
日期本身並不影響天氣,但是日期所在的月份和季節其實是影響天氣的,如果任選梅雨季節的某一天,那明天下雨的可能性必然比非梅雨季節的那一天要大。雖然我們無法讓機器學習體會不同月份是什麼季節,但是我們可以對不同月份進行分組,算法可以通過訓練感受到,“這個月或者這個季節更容易下雨”。因此,我們可以將月份或者季節提取出來,作爲一個特徵使用,而捨棄掉具體的日期。如此,我們又可以創造第二個特徵,月份"Month”。
#提取出月份
Xtrain.loc[0,"Date"].split("-")
Out[27]: ['2015', '08', '24']
int(Xtrain.loc[0,"Date"].split("-")[1])
Out[28]: 8
Xtrain["Date"] = Xtrain["Date"].apply(lambda x:int(x.split("-")[1]))
#替換完畢後,我們需要修改列的名稱
#rename是比較少有的,可以用來修改單個列名的函數
#我們通常都直接使用 df.columns = 某個列表 這樣的形式來一次修改所有的列名
#但rename允許我們只修改某個單獨的列
Xtrain.head()
Out[30]:
Date Location MinTemp MaxTemp ... Cloud3pm Temp9am Temp3pm RainToday
0 8 Katherine 17.5 36.0 ... NaN 27.5 NaN No
1 12 Tuggeranong 9.5 25.0 ... NaN 14.6 23.6 No
2 4 Albany 13.0 22.6 ... 3.0 17.5 20.8 No
3 11 Sale 13.9 29.8 ... 6.0 18.5 27.5 No
4 4 Mildura 6.0 23.5 ... 4.0 12.4 22.4 No
[5 rows x 22 columns]
Xtrain.loc[:,'Date'].value_counts()
Out[31]:
3 334
5 324
7 316
9 302
6 302
1 300
11 299
10 282
4 265
2 264
12 259
8 253
Name: Date, dtype: int64
Xtrain = Xtrain.rename(columns={"Date":"Month"})
Xtrain.head()
Out[32]:
Month Location MinTemp MaxTemp ... Cloud3pm Temp9am Temp3pm RainToday
0 8 Katherine 17.5 36.0 ... NaN 27.5 NaN No
1 12 Tuggeranong 9.5 25.0 ... NaN 14.6 23.6 No
2 4 Albany 13.0 22.6 ... 3.0 17.5 20.8 No
3 11 Sale 13.9 29.8 ... 6.0 18.5 27.5 No
4 4 Mildura 6.0 23.5 ... 4.0 12.4 22.4 No
[5 rows x 22 columns]
Xtest["Date"] = Xtest["Date"].apply(lambda x:int(x.split("-")[1]))
Xtest = Xtest.rename(columns={"Date":"Month"})
Xtest.head()
__main__:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Out[33]:
Month Location MinTemp MaxTemp ... Cloud3pm Temp9am Temp3pm RainToday
0 1 NorahHead 22.0 27.8 ... NaN 26.2 23.1 Yes
1 3 MountGambier 12.0 18.6 ... 7.0 14.8 17.5 Yes
2 3 MountGinini 9.1 13.3 ... NaN NaN NaN NaN
3 10 Wollongong 13.1 20.3 ... NaN 16.8 19.6 No
4 11 Sale 12.2 20.0 ... 4.0 13.6 19.0 No
[5 rows x 22 columns]
通過時間,我們處理出兩個新特徵,“今天是否下雨”和“月份”
1.3.3 處理困難特徵:地點
不同的地點因爲氣候不同,所以對“明天是否會下雨”有着不同的影響。如果我們能夠將地點轉換爲這個地方的氣候的話,我們就可以將不同城市打包到同一個氣候中,而同一個氣候下反應的降雨情況應該是相似的。
處理思路:全國主要城市的氣候,主要城市的經緯度(地點),我們就可以通過計算我們樣本中的每個氣候站到各個主要城市的地理距離,來找出一個離這個氣象站最近的主要城市,而這個主要城市的氣候就是我們樣本點所在的地點的氣候
Xtrain.loc[:,'Location'].value_counts().count()
Out[34]: 49
cityll = pd.read_csv(r"H:\程志偉\python\菜菜的機器學習skleaen課堂\SVM數據\cityll.csv",index_col=0)
city_climate = pd.read_csv(r"H:\程志偉\python\菜菜的機器學習skleaen課堂\SVM數據\Cityclimate.csv")
cityll.head()
Out[36]:
City Latitude Longitude Latitudedir Longitudedir
0 Adelaide 34.9285° 138.6007° S, E
1 Albany 35.0275° 117.8840° S, E
2 Albury 36.0737° 146.9135° S, E
3 Wodonga 36.1241° 146.8818° S, E
4 AliceSprings 23.6980° 133.8807° S, E
city_climate.head()
Out[37]:
City Climate
0 Adelaide Warm temperate
1 Albany Mild temperate
2 Albury Hot dry summer, cool winter
3 Wodonga Hot dry summer, cool winter
4 AliceSprings Hot dry summer, warm winter
將這兩張表處理成可以使用的樣子,首先要去掉cityll中經緯度上帶有的度數符號,然後要將兩張表合併起來
#去掉度數符號
cityll.loc[0,'Latitude'][:-1]
Out[38]: '34.9285'
cityll["Latitudenum"] = cityll["Latitude"].apply(lambda x:float(x[:-1]))
cityll["Longitudenum"] = cityll["Longitude"].apply(lambda x:float(x[:-1]))
cityll.head()
Out[39]:
City Latitude Longitude ... Longitudedir Latitudenum Longitudenum
0 Adelaide 34.9285° 138.6007° ... E 34.9285 138.6007
1 Albany 35.0275° 117.8840° ... E 35.0275 117.8840
2 Albury 36.0737° 146.9135° ... E 36.0737 146.9135
3 Wodonga 36.1241° 146.8818° ... E 36.1241 146.8818
4 AliceSprings 23.6980° 133.8807° ... E 23.6980 133.8807
[5 rows x 7 columns]
citylld = cityll.iloc[:,[0,5,6]]
#將city_climate中的氣候添加到我們的citylld中
citylld["climate"] = city_climate.iloc[:,-1]
citylld.head()
__main__:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Out[40]:
City Latitudenum Longitudenum climate
0 Adelaide 34.9285 138.6007 Warm temperate
1 Albany 35.0275 117.8840 Mild temperate
2 Albury 36.0737 146.9135 Hot dry summer, cool winter
3 Wodonga 36.1241 146.8818 Hot dry summer, cool winter
4 AliceSprings 23.6980 133.8807 Hot dry summer, warm winter
citylld.loc[:,'climate'].value_counts()
Out[41]:
Hot dry summer, cool winter 24
Hot dry summer, warm winter 18
Warm temperate 18
High humidity summer, warm winter 17
Mild temperate 9
Cool temperate 9
Warm humid summer, mild winter 5
Name: climate, dtype: int64
想要計算距離,我們就會需要所有樣本數據中的城市。我們認爲,只有出現在訓練集中的地點纔會出現在測試集中
samplecity = pd.read_csv(r"H:\程志偉\\samplecity.csv",index_col=0)
#我們對samplecity也執行同樣的處理:去掉經緯度中度數的符號,並且捨棄我們的經緯度的方向
samplecity["Latitudenum"] = samplecity["Latitude"].apply(lambda x:float(x[:-1]))
samplecity["Longitudenum"] = samplecity["Longitude"].apply(lambda x:float(x[:-1]))
samplecityd = samplecity.iloc[:,[0,5,6]]
samplecityd.head()
Out[42]:
City Latitudenum Longitudenum
0 Canberra 35.2809 149.1300
1 Sydney 33.8688 151.2093
2 Perth 31.9505 115.8605
3 Darwin 12.4634 130.8456
4 Hobart 42.8821 147.3272
我們現在有了主要城市的經緯度和對應的氣候,也有了我們的樣本的地點所對應的經緯度,接下來我們要開始計算我們樣本上的地點到每個主要城市的距離,而離我們的樣本地點最近的那個主要城市的氣候,就是我們樣本點的氣候。
#首先使用radians將角度轉換成弧度
from math import radians, sin, cos, acos
citylld.loc[:,"slat"] = citylld.iloc[:,1].apply(lambda x : radians(x))
citylld.loc[:,"slon"] = citylld.iloc[:,2].apply(lambda x : radians(x))
samplecityd.loc[:,"elat"] = samplecityd.iloc[:,1].apply(lambda x : radians(x))
samplecityd.loc[:,"elon"] = samplecityd.iloc[:,2].apply(lambda x : radians(x))
citylld.head()
Out[46]:
City Latitudenum ... slat slon
0 Adelaide 34.9285 ... 0.609617 2.419039
1 Albany 35.0275 ... 0.611345 2.057464
2 Albury 36.0737 ... 0.629605 2.564124
3 Wodonga 36.1241 ... 0.630484 2.563571
4 AliceSprings 23.6980 ... 0.413608 2.336659
[5 rows x 6 columns]
samplecityd.head()
Out[47]:
City Latitudenum Longitudenum elat elon
0 Canberra 35.2809 149.1300 0.615768 2.602810
1 Sydney 33.8688 151.2093 0.591122 2.639100
2 Perth 31.9505 115.8605 0.557641 2.022147
3 Darwin 12.4634 130.8456 0.217527 2.283687
4 Hobart 42.8821 147.3272 0.748434 2.571345
import sys
for i in range(samplecityd.shape[0]):
slat = citylld.loc[:,"slat"]
slon = citylld.loc[:,"slon"]
elat = samplecityd.loc[i,"elat"]
elon = samplecityd.loc[i,"elon"]
dist = 6371.01 * np.arccos(np.sin(slat)*np.sin(elat) +
np.cos(slat)*np.cos(elat)*np.cos(slon.values - elon))
city_index = np.argsort(dist)[0]
#每次計算後,取距離最近的城市,然後將最近的城市和城市對應的氣候都匹配到samplecityd中
samplecityd.loc[i,"closest_city"] = citylld.loc[city_index,"City"]
samplecityd.loc[i,"climate"] = citylld.loc[city_index,"climate"]
#查看最後的結果,需要檢查城市匹配是否基本正確
samplecityd.head()
Out[49]:
City Latitudenum ... closest_city climate
0 Canberra 35.2809 ... Canberra Cool temperate
1 Sydney 33.8688 ... Sydney Warm temperate
2 Perth 31.9505 ... Perth Warm temperate
3 Darwin 12.4634 ... Darwin High humidity summer, warm winter
4 Hobart 42.8821 ... Hobart Cool temperate
[5 rows x 7 columns]
#查看氣候的分佈
samplecityd["climate"].value_counts()
Out[50]:
Warm temperate 15
Mild temperate 10
Cool temperate 9
Hot dry summer, cool winter 6
High humidity summer, warm winter 4
Hot dry summer, warm winter 3
Warm humid summer, mild winter 2
Name: climate, dtype: int64
#確認無誤後,取出樣本城市所對應的氣候,並保存
locafinal = samplecityd.iloc[:,[0,-1]]
locafinal.head()
Out[52]:
City climate
0 Canberra Cool temperate
1 Sydney Warm temperate
2 Perth Warm temperate
3 Darwin High humidity summer, warm winter
4 Hobart Cool temperate
locafinal.columns = ["Location","Climate"]
#在這裏設定locafinal的索引爲地點,是爲了之後進行map的匹配
locafinal = locafinal.set_index(keys="Location")
locafinal.to_csv(r"H:\程志偉\python\\samplelocation.csv")
locafinal.head()
Out[56]:
Climate
Location
Canberra Cool temperate
Sydney Warm temperate
Perth Warm temperate
Darwin High humidity summer, warm winter
Hobart Cool temperate
有了每個樣本城市所對應的氣候,我們接下來就使用氣候來替掉原本的城市,原本的氣象站的名稱。在這裏,我們可以使用map功能,map能夠將特徵中的值一一對應到我們設定的字典中,並且用字典中的值來替換樣本中原本的值.
Xtrain.head()
Out[57]:
Month Location MinTemp MaxTemp ... Cloud3pm Temp9am Temp3pm RainToday
0 8 Katherine 17.5 36.0 ... NaN 27.5 NaN No
1 12 Tuggeranong 9.5 25.0 ... NaN 14.6 23.6 No
2 4 Albany 13.0 22.6 ... 3.0 17.5 20.8 No
3 11 Sale 13.9 29.8 ... 6.0 18.5 27.5 No
4 4 Mildura 6.0 23.5 ... 4.0 12.4 22.4 No
[5 rows x 22 columns]
#map將數據進行替換
import re
Xtrain["Location"] = Xtrain["Location"].map(locafinal.iloc[:,0])
Xtrain.head()
Out[58]:
Month Location ... Temp3pm RainToday
0 8 High humidity summer, warm winter ... NaN No
1 12 Cool temperate ... 23.6 No
2 4 Mild temperate ... 20.8 No
3 11 Mild temperate ... 27.5 No
4 4 Hot dry summer, cool winter ... 22.4 No
[5 rows x 22 columns]
#將location中的內容替換,並且確保匹配進入的氣候字符串中不含有逗號,氣候兩邊不含有空格
#我們使用re這個模塊來消除逗號
#re.sub(希望替換的值,希望被替換成的值,要操作的字符串)
#x.strip()是去掉空格的函數
Xtrain["Location"] = Xtrain["Location"].apply(lambda x:re.sub(",","",x.strip()))
Xtrain.head()
Out[60]:
Month Location MinTemp ... Temp9am Temp3pm RainToday
0 8 High humidity summer warm winter 17.5 ... 27.5 NaN No
1 12 Cool temperate 9.5 ... 14.6 23.6 No
2 4 Mild temperate 13.0 ... 17.5 20.8 No
3 11 Mild temperate 13.9 ... 18.5 27.5 No
4 4 Hot dry summer cool winter 6.0 ... 12.4 22.4 No
[5 rows x 22 columns]
Xtest["Location"] = Xtest["Location"].map(locafinal.iloc[:,0]).apply(lambda x:re.sub(",","",x.strip()))
#修改特徵內容之後,我們使用新列名“Climate”來替換之前的列名“Location”
Xtrain = Xtrain.rename(columns={"Location":"Climate"})
Xtest = Xtest.rename(columns={"Location":"Climate"})
Xtrain.head()
Out[62]:
Month Climate MinTemp ... Temp9am Temp3pm RainToday
0 8 High humidity summer warm winter 17.5 ... 27.5 NaN No
1 12 Cool temperate 9.5 ... 14.6 23.6 No
2 4 Mild temperate 13.0 ... 17.5 20.8 No
3 11 Mild temperate 13.9 ... 18.5 27.5 No
4 4 Hot dry summer cool winter 6.0 ... 12.4 22.4 No
[5 rows x 22 columns]
Xtest.head()
Out[63]:
Month Climate MinTemp ... Temp9am Temp3pm RainToday
0 1 Cool temperate 22.0 ... 26.2 23.1 Yes
1 3 Mild temperate 12.0 ... 14.8 17.5 Yes
2 3 Cool temperate 9.1 ... NaN NaN NaN
3 10 Warm temperate 13.1 ... 16.8 19.6 No
4 11 Mild temperate 12.2 ... 13.6 19.0 No
[5 rows x 22 columns]
到這裏,地點就處理完畢了。其實,我們還沒有將這個特徵轉化爲數字,即還沒有對它進行編碼。我們稍後和其他
的分類型變量一起來編碼
1.3.4 處理分類型變量:缺失值
#查看缺失值的缺失情況
Xtrain.isnull().mean()
Out[64]:
Month 0.000000
Climate 0.000000
MinTemp 0.004000
MaxTemp 0.003143
Rainfall 0.009429
Evaporation 0.433429
Sunshine 0.488571
WindGustDir 0.067714
WindGustSpeed 0.067714
WindDir9am 0.067429
WindDir3pm 0.024286
WindSpeed9am 0.009714
WindSpeed3pm 0.018000
Humidity9am 0.011714
Humidity3pm 0.026286
Pressure9am 0.098857
Pressure3pm 0.098857
Cloud9am 0.379714
Cloud3pm 0.401429
Temp9am 0.005429
Temp3pm 0.019714
RainToday 0.009429
dtype: float64
#首先找出,分類型特徵都有哪些
cate = Xtrain.columns[Xtrain.dtypes == "object"].tolist()
#除了特徵類型爲"object"的特徵們,還有雖然用數字表示,但是本質爲分類型特徵的雲層遮蔽程度
cloud = ["Cloud9am","Cloud3pm"]
cate = cate + cloud
cate
Out[66]:
['Climate',
'WindGustDir',
'WindDir9am',
'WindDir3pm',
'RainToday',
'Cloud9am',
'Cloud3pm']
#對於分類型特徵,我們使用衆數來進行填補
from sklearn.impute import SimpleImputer
si = SimpleImputer(missing_values=np.nan,strategy="most_frequent")
#注意,我們使用訓練集數據來訓練我們的填補器,本質是在生成訓練集中的衆數
si.fit(Xtrain.loc[:,cate])
Out[67]:
SimpleImputer(add_indicator=False, copy=True, fill_value=None,
missing_values=nan, strategy='most_frequent', verbose=0)
#然後我們用訓練集中的衆數來同時填補訓練集和測試集
Xtrain.loc[:,cate] = si.transform(Xtrain.loc[:,cate])
Xtest.loc[:,cate] = si.transform(Xtest.loc[:,cate])
Xtrain.head()
Out[69]:
Month Climate MinTemp ... Temp9am Temp3pm RainToday
0 8 High humidity summer warm winter 17.5 ... 27.5 NaN No
1 12 Cool temperate 9.5 ... 14.6 23.6 No
2 4 Mild temperate 13.0 ... 17.5 20.8 No
3 11 Mild temperate 13.9 ... 18.5 27.5 No
4 4 Hot dry summer cool winter 6.0 ... 12.4 22.4 No
[5 rows x 22 columns]
Xtest.head()
Out[70]:
Month Climate MinTemp ... Temp9am Temp3pm RainToday
0 1 Cool temperate 22.0 ... 26.2 23.1 Yes
1 3 Mild temperate 12.0 ... 14.8 17.5 Yes
2 3 Cool temperate 9.1 ... NaN NaN No
3 10 Warm temperate 13.1 ... 16.8 19.6 No
4 11 Mild temperate 12.2 ... 13.6 19.0 No
[5 rows x 22 columns]
#查看分類型特徵是否依然存在缺失值
Xtrain.loc[:,cate].isnull().mean()
Out[71]:
Climate 0.0
WindGustDir 0.0
WindDir9am 0.0
WindDir3pm 0.0
RainToday 0.0
Cloud9am 0.0
Cloud3pm 0.0
dtype: float64
Xtest.loc[:,cate].isnull().mean()
Out[72]:
Climate 0.0
WindGustDir 0.0
WindDir9am 0.0
WindDir3pm 0.0
RainToday 0.0
Cloud9am 0.0
Cloud3pm 0.0
dtype: float64
1.3.5 處理分類型變量:將分類型變量編碼
#將所有的分類型變量編碼爲數字,一個類別是一個數字
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder()
#利用訓練集進行fit
oe = oe.fit(Xtrain.loc[:,cate])
#用訓練集的編碼結果來編碼訓練和測試特徵矩陣
#在這裏如果測試特徵矩陣報錯,就說明測試集中出現了訓練集中從未見過的類別
Xtrain.loc[:,cate] = oe.transform(Xtrain.loc[:,cate])
Xtest.loc[:,cate] = oe.transform(Xtest.loc[:,cate])
Xtrain.loc[:,cate].head()
Out[76]:
Climate WindGustDir WindDir9am WindDir3pm RainToday Cloud9am Cloud3pm
0 1.0 2.0 6.0 0.0 0.0 0.0 7.0
1 0.0 6.0 4.0 6.0 0.0 7.0 7.0
2 4.0 13.0 4.0 0.0 0.0 1.0 3.0
3 4.0 8.0 3.0 8.0 0.0 6.0 6.0
4 2.0 5.0 0.0 6.0 0.0 2.0 4.0
Xtest.loc[:,cate].head()
Out[77]:
Climate WindGustDir WindDir9am WindDir3pm RainToday Cloud9am Cloud3pm
0 0.0 11.0 8.0 11.0 1.0 7.0 7.0
1 4.0 12.0 12.0 8.0 1.0 8.0 7.0
2 0.0 4.0 3.0 9.0 0.0 7.0 7.0
3 6.0 12.0 13.0 9.0 0.0 7.0 7.0
4 4.0 0.0 12.0 0.0 0.0 8.0 4.0
1.3.6 處理連續型變量:填補缺失值
col = Xtrain.columns.tolist()
for i in cate:
col.remove(i)
col
Out[78]:
['Month',
'MinTemp',
'MaxTemp',
'Rainfall',
'Evaporation',
'Sunshine',
'WindGustSpeed',
'WindSpeed9am',
'WindSpeed3pm',
'Humidity9am',
'Humidity3pm',
'Pressure9am',
'Pressure3pm',
'Temp9am',
'Temp3pm']
#實例化模型,填補策略爲"mean"表示均值
impmean = SimpleImputer(missing_values=np.nan,strategy = "mean")
#用訓練集來fit模型
impmean = impmean.fit(Xtrain.loc[:,col])
#分別在訓練集和測試集上進行均值填補
Xtrain.loc[:,col] = impmean.transform(Xtrain.loc[:,col])
Xtest.loc[:,col] = impmean.transform(Xtest.loc[:,col])
Xtrain.head()
Out[82]:
Month Climate MinTemp MaxTemp ... Cloud3pm Temp9am Temp3pm RainToday
0 8.0 1.0 17.5 36.0 ... 7.0 27.5 21.719003 0.0
1 12.0 0.0 9.5 25.0 ... 7.0 14.6 23.600000 0.0
2 4.0 4.0 13.0 22.6 ... 3.0 17.5 20.800000 0.0
3 11.0 4.0 13.9 29.8 ... 6.0 18.5 27.500000 0.0
4 4.0 2.0 6.0 23.5 ... 4.0 12.4 22.400000 0.0
[5 rows x 22 columns]
Xtest.head()
Out[83]:
Month Climate MinTemp MaxTemp ... Cloud3pm Temp9am Temp3pm RainToday
0 1.0 0.0 22.0 27.8 ... 7.0 26.200000 23.100000 1.0
1 3.0 4.0 12.0 18.6 ... 7.0 14.800000 17.500000 1.0
2 3.0 0.0 9.1 13.3 ... 7.0 16.989859 21.719003 0.0
3 10.0 6.0 13.1 20.3 ... 7.0 16.800000 19.600000 0.0
4 11.0 4.0 12.2 20.0 ... 4.0 13.600000 19.000000 0.0
[5 rows x 22 columns]
Xtrain.isnull().sum()
Out[84]:
Month 0
Climate 0
MinTemp 0
MaxTemp 0
Rainfall 0
Evaporation 0
Sunshine 0
WindGustDir 0
WindGustSpeed 0
WindDir9am 0
WindDir3pm 0
WindSpeed9am 0
WindSpeed3pm 0
Humidity9am 0
Humidity3pm 0
Pressure9am 0
Pressure3pm 0
Cloud9am 0
Cloud3pm 0
Temp9am 0
Temp3pm 0
RainToday 0
dtype: int64
Xtest.isnull().sum()
Out[85]:
Month 0
Climate 0
MinTemp 0
MaxTemp 0
Rainfall 0
Evaporation 0
Sunshine 0
WindGustDir 0
WindGustSpeed 0
WindDir9am 0
WindDir3pm 0
WindSpeed9am 0
WindSpeed3pm 0
Humidity9am 0
Humidity3pm 0
Pressure9am 0
Pressure3pm 0
Cloud9am 0
Cloud3pm 0
Temp9am 0
Temp3pm 0
RainToday 0
dtype: int64
1.3.7 處理連續型變量:無量綱化
col.remove("Month")
col
Out[86]:
['MinTemp',
'MaxTemp',
'Rainfall',
'Evaporation',
'Sunshine',
'WindGustSpeed',
'WindSpeed9am',
'WindSpeed3pm',
'Humidity9am',
'Humidity3pm',
'Pressure9am',
'Pressure3pm',
'Temp9am',
'Temp3pm']
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
ss = ss.fit(Xtrain.loc[:,col])
Xtrain.loc[:,col] = ss.transform(Xtrain.loc[:,col])
Xtest.loc[:,col] = ss.transform(Xtest.loc[:,col])
Xtrain.head()
Out[89]:
Month Climate MinTemp MaxTemp ... Cloud3pm Temp9am Temp3pm RainToday
0 8.0 1.0 0.826375 1.774044 ... 7.0 1.612270 0.000000 0.0
1 12.0 0.0 -0.427048 0.244031 ... 7.0 -0.366608 0.270238 0.0
2 4.0 4.0 0.121324 -0.089790 ... 3.0 0.078256 -0.132031 0.0
3 11.0 4.0 0.262334 0.911673 ... 6.0 0.231658 0.830540 0.0
4 4.0 2.0 -0.975421 0.035393 ... 4.0 -0.704091 0.097837 0.0
[5 rows x 22 columns]
Xtest.head()
Out[90]:
Month Climate MinTemp MaxTemp ... Cloud3pm Temp9am Temp3pm RainToday
0 1.0 0.0 1.531425 0.633489 ... 7.0 1.412848 0.198404 1.0
1 3.0 4.0 -0.035354 -0.646158 ... 7.0 -0.335927 -0.606132 1.0
2 3.0 0.0 -0.489720 -1.383346 ... 7.0 0.000000 0.000000 0.0
3 10.0 6.0 0.136992 -0.409702 ... 7.0 -0.029125 -0.304431 0.0
4 11.0 4.0 -0.004018 -0.451429 ... 4.0 -0.520009 -0.390632 0.0
[5 rows x 22 columns]
Ytrain.head()
Out[91]:
0
0 0
1 0
2 0
3 1
4 0
Ytest.head()
Out[92]:
0
0 0
1 0
2 1
3 0
4 0
1.4 建模與模型評估
from time import time
import datetime
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score, recall_score
#建模選擇自然是我們的支持向量機SVC,首先用核函數的學習曲線來選擇核函數
#我們希望同時觀察,精確性,recall以及AUC分數
times = time() #因爲SVM是計算量很大的模型,所以我們需要時刻監控我們的模型運行時間
for kernel in ["linear","poly","rbf","sigmoid"]:
clf = SVC(kernel = kernel
,gamma="auto"
,degree = 1
,cache_size = 5000
).fit(Xtrain, Ytrain)
result = clf.predict(Xtest) #獲取模型的結果
score = clf.score(Xtest,Ytest) #返回準確度
recall = recall_score(Ytest, result)
auc = roc_auc_score(Ytest,clf.decision_function(Xtest))
print("%s 's testing accuracy %f, recall is %f', auc is %f" % (kernel,score,recall,auc))
print(datetime.datetime.fromtimestamp(time()-times).strftime("%M:%S:%f"))
linear 's testing accuracy 0.844000, recall is 0.469388', auc is 0.869029
00:07:008706
poly 's testing accuracy 0.840667, recall is 0.457726', auc is 0.868157
00:07:870836
rbf 's testing accuracy 0.813333, recall is 0.306122', auc is 0.814873
00:10:493201
sigmoid 's testing accuracy 0.655333, recall is 0.154519', auc is 0.437308
00:11:395323
我們注意到,模型的準確度和auc面積還是勉勉強強,但是每個核函數下的recall都不太高。相比之下,其實線性模
型的效果是最好的。那現在我們可以開始考慮了,在這種狀況下,我們要向着什麼方向進行調參呢?我們最想要的
是什麼?
我們可以有不同的目標:
一,我希望不計一切代價判斷出少數類,得到最高的recall。
二,我們希望追求最高的預測準確率,一切目的都是爲了讓accuracy更高,我們不在意recall或者AUC。
三,我們希望達到recall,ROC和accuracy之間的平衡,不追求任何一個也不犧牲任何一個
1.5 模型調參
1.5.1 最求最高Recall
times = time()
for kernel in ["linear","poly","rbf","sigmoid"]:
clf = SVC(kernel = kernel
,gamma="auto"
,degree = 1
,cache_size = 5000
,class_weight = "balanced"
).fit(Xtrain, Ytrain)
result = clf.predict(Xtest)
score = clf.score(Xtest,Ytest)
recall = recall_score(Ytest, result)
auc = roc_auc_score(Ytest,clf.decision_function(Xtest))
print("%s 's testing accuracy %f, recall is %f', auc is %f" %(kernel,score,recall,auc))
print(datetime.datetime.fromtimestamp(time()-times).strftime("%M:%S:%f"))
linear 's testing accuracy 0.796667, recall is 0.775510', auc is 0.870062
00:05:915204
poly 's testing accuracy 0.793333, recall is 0.763848', auc is 0.871448
00:06:949940
rbf 's testing accuracy 0.803333, recall is 0.600583', auc is 0.819713
00:09:667872
sigmoid 's testing accuracy 0.562000, recall is 0.282799', auc is 0.437119
00:11:633267
在鎖定了線性核函數之後,我甚至可以將class_weight調節得更加傾向於少數類,來不計代價提升recall
times = time()
for kernel in ["linear","poly","rbf","sigmoid"]:
clf = SVC(kernel = kernel
,gamma="auto"
,degree = 1
,cache_size = 5000
,class_weight = {1:10} #注意,這裏寫的其實是,類別1:10,隱藏了類別0:1這個比例
).fit(Xtrain, Ytrain)
result = clf.predict(Xtest)
score = clf.score(Xtest,Ytest)
recall = recall_score(Ytest, result)
auc = roc_auc_score(Ytest,clf.decision_function(Xtest))
print("%s 's testing accuracy %f, recall is %f', auc is %f" %(kernel,score,recall,auc))
print(datetime.datetime.fromtimestamp(time()-times).strftime("%M:%S:%f"))
linear 's testing accuracy 0.636667, recall is 0.912536', auc is 0.866360
00:12:969724
poly 's testing accuracy 0.634667, recall is 0.912536', auc is 0.866885
00:14:926113
rbf 's testing accuracy 0.790000, recall is 0.553936', auc is 0.802820
00:18:623275
sigmoid 's testing accuracy 0.228667, recall is 1.000000', auc is 0.436592
00:21:038490
隨着recall地無節制上升,我們的精確度下降得十分厲害,不過看起來AUC面積卻還好,穩定保持在0.86左右。如
果此時我們的目的就是追求一個比較高的AUC分數和比較好的recall,那我們的模型此時就算是很不錯了。雖然現
在,我們的精確度很低,但是我們的確精準地捕捉出了每一個雨天。
1.5.2 追求最高準確率
如果我們的樣本非常不均衡,但是此時卻有很多多數類被判錯的話,那我們可以讓模型任性地把所有地樣本都判斷爲0,完全不顧少數類。
Ytrain = Ytrain.iloc[:,0].ravel()
Ytest = Ytest.iloc[:,0].ravel()
valuec = pd.Series(Ytest).value_counts()
valuec
Out[98]:
0 1157
1 343
dtype: int64
valuec[0]/valuec.sum()
Out[99]: 0.7713333333333333
全部判斷爲多數類的概率爲0.7713333333333333,而上面的準確率爲0.844000,說明有小數類也被分類正確。
我們可以使用混淆矩陣來計算我們的特異度,如果特異度非常高,則證明多數類上已經很難被操作了。
from sklearn.metrics import confusion_matrix as CM
clf = SVC(kernel = "linear"
,gamma="auto"
,cache_size = 5000
).fit(Xtrain, Ytrain)
result = clf.predict(Xtest)
cm = CM(Ytest,result,labels=(1,0))
cm
Out[100]:
array([[ 161, 182],
[ 52, 1105]], dtype=int64)
specificity = cm[1,1]/cm[1,:].sum()
specificity
Out[101]: 0.9550561797752809
#幾乎所有的0都被判斷正確了,還有不少1也被判斷正確了
以試試看使用class_weight將模型向少數類的方向稍微調整,已查看我們是否有更多的空間來提升我們的準確
率。如果在輕微向少數類方向調整過程中,出現了更高的準確率,則說明模型還沒有到極限。
irange = np.linspace(0.01,0.05,10)
for i in irange:
times = time()
clf = SVC(kernel = "linear"
,gamma="auto"
,cache_size = 5000
,class_weight = {1:1+i}
).fit(Xtrain, Ytrain)
result = clf.predict(Xtest)
score = clf.score(Xtest,Ytest)
recall = recall_score(Ytest, result)
auc = roc_auc_score(Ytest,clf.decision_function(Xtest))
print("under ratio 1:%f testing accuracy %f, recall is %f', auc is %f" % (1+i,score,recall,auc))
print(datetime.datetime.fromtimestamp(time()-times).strftime("%M:%S:%f"))
under ratio 1:1.010000 testing accuracy 0.844667, recall is 0.475219', auc is 0.869157
00:05:282753
under ratio 1:1.014444 testing accuracy 0.844667, recall is 0.478134', auc is 0.869185
00:06:590835
under ratio 1:1.018889 testing accuracy 0.844667, recall is 0.478134', auc is 0.869198
00:06:539148
under ratio 1:1.023333 testing accuracy 0.845333, recall is 0.481050', auc is 0.869175
00:06:203914
under ratio 1:1.027778 testing accuracy 0.844000, recall is 0.481050', auc is 0.869394
00:06:585682
under ratio 1:1.032222 testing accuracy 0.844000, recall is 0.481050', auc is 0.869528
00:06:291609
under ratio 1:1.036667 testing accuracy 0.844000, recall is 0.481050', auc is 0.869659
00:05:643494
under ratio 1:1.041111 testing accuracy 0.844667, recall is 0.483965', auc is 0.869629
00:06:332509
under ratio 1:1.045556 testing accuracy 0.844667, recall is 0.483965', auc is 0.869712
00:06:435075
under ratio 1:1.050000 testing accuracy 0.845333, recall is 0.486880', auc is 0.869863
00:06:337505
驚喜出現了,我們的最高準確度是84.53%,超過了我們之前什麼都不做的時候得到的84.40%。可見,模型還是有
潛力的。我們可以繼續細化我們的學習曲線來進行調整。
irange_ = np.linspace(0.018889,0.027778,10)
for i in irange_:
times = time()
clf = SVC(kernel = "linear"
,gamma="auto"
,cache_size = 5000
,class_weight = {1:1+i}
).fit(Xtrain, Ytrain)
result = clf.predict(Xtest)
score = clf.score(Xtest,Ytest)
recall = recall_score(Ytest, result)
auc = roc_auc_score(Ytest,clf.decision_function(Xtest))
print("under ratio 1:%f testing accuracy %f, recall is %f', auc is %f" %(1+i,score,recall,auc))
print(datetime.datetime.fromtimestamp(time()-times).strftime("%M:%S:%f"))
under ratio 1:1.018889 testing accuracy 0.844667, recall is 0.478134', auc is 0.869213
00:05:366834
under ratio 1:1.019877 testing accuracy 0.844000, recall is 0.478134', auc is 0.869228
00:05:415828
under ratio 1:1.020864 testing accuracy 0.844000, recall is 0.478134', auc is 0.869218
00:05:250741
under ratio 1:1.021852 testing accuracy 0.844667, recall is 0.478134', auc is 0.869188
00:05:137647
under ratio 1:1.022840 testing accuracy 0.844667, recall is 0.478134', auc is 0.869220
00:05:145678
under ratio 1:1.023827 testing accuracy 0.844667, recall is 0.481050', auc is 0.869188
00:05:224714
under ratio 1:1.024815 testing accuracy 0.844667, recall is 0.481050', auc is 0.869231
00:04:954503
under ratio 1:1.025803 testing accuracy 0.844000, recall is 0.481050', auc is 0.869253
00:05:323782
under ratio 1:1.026790 testing accuracy 0.844000, recall is 0.481050', auc is 0.869314
00:05:072606
under ratio 1:1.027778 testing accuracy 0.844667, recall is 0.481050', auc is 0.869374
00:05:165673
模型的效果沒有太好,並沒有再出現比我們的84.53%精確度更高的取值。可見,模型在不做樣本平衡的情況下,
準確度其實已經非常接近極限了,讓模型向着少數類的方向調節,不能夠達到質變。
如果我們真的希望再提升準確度,只能選擇更換模型的方式。
from sklearn.linear_model import LogisticRegression as LR
logclf = LR(solver="liblinear").fit(Xtrain, Ytrain)
logclf.score(Xtest,Ytest)
Out[105]: 0.8486666666666667
C_range = np.linspace(3,5,10)
for C in C_range:
logclf = LR(solver="liblinear",C=C).fit(Xtrain, Ytrain)
print(C,logclf.score(Xtest,Ytest))
3.0 0.8493333333333334
3.2222222222222223 0.8493333333333334
3.4444444444444446 0.8493333333333334
3.6666666666666665 0.8493333333333334
3.888888888888889 0.8493333333333334
4.111111111111111 0.8493333333333334
4.333333333333333 0.8493333333333334
4.555555555555555 0.8493333333333334
4.777777777777778 0.8493333333333334
5.0 0.8493333333333334
儘管我們實現了非常小的提升,但可以看出來,模型的精確度還是沒有能夠實現質變
1.5.3 追求平衡
import matplotlib.pyplot as plt
C_range = np.linspace(0.01,20,20)
recallall = []
aucall = []
scoreall = []
for C in C_range:
times = time()
clf = SVC(kernel = "linear",C=C,cache_size = 5000
,class_weight = "balanced"
).fit(Xtrain, Ytrain)
result = clf.predict(Xtest)
score = clf.score(Xtest,Ytest)
recall = recall_score(Ytest, result)
auc = roc_auc_score(Ytest,clf.decision_function(Xtest))
recallall.append(recall)
aucall.append(auc)
scoreall.append(score)
print("under C %f, testing accuracy is %f,recall is %f', auc is %f" % (C,score,recall,auc))
print(datetime.datetime.fromtimestamp(time()-times).strftime("%M:%S:%f"))
print(max(aucall),C_range[aucall.index(max(aucall))])
plt.figure()
plt.plot(C_range,recallall,c="red",label="recall")
plt.plot(C_range,aucall,c="black",label="auc")
plt.plot(C_range,scoreall,c="orange",label="accuracy")
plt.legend()
plt.show()
under C 0.010000, testing accuracy is 0.800000,recall is 0.752187', auc is 0.870634
00:00:759538
under C 1.062105, testing accuracy is 0.796000,recall is 0.775510', auc is 0.870024
00:06:290471
under C 2.114211, testing accuracy is 0.794000,recall is 0.772595', auc is 0.870160
00:11:766379
under C 3.166316, testing accuracy is 0.795333,recall is 0.772595', auc is 0.870165
00:15:810231
under C 4.218421, testing accuracy is 0.796000,recall is 0.775510', auc is 0.870112
00:20:414523
under C 5.270526, testing accuracy is 0.796000,recall is 0.775510', auc is 0.870082
00:24:632509
under C 6.322632, testing accuracy is 0.795333,recall is 0.772595', auc is 0.870100
00:29:588032
under C 7.374737, testing accuracy is 0.795333,recall is 0.772595', auc is 0.870022
00:34:028188
under C 8.426842, testing accuracy is 0.796000,recall is 0.775510', auc is 0.870090
00:37:620724
under C 9.478947, testing accuracy is 0.795333,recall is 0.772595', auc is 0.870123
00:44:379563
under C 10.531053, testing accuracy is 0.795333,recall is 0.772595', auc is 0.870092
00:47:067436
under C 11.583158, testing accuracy is 0.795333,recall is 0.772595', auc is 0.870097
00:50:203707
under C 12.635263, testing accuracy is 0.795333,recall is 0.772595', auc is 0.870019
00:55:903715
under C 13.687368, testing accuracy is 0.795333,recall is 0.772595', auc is 0.870039
00:58:575636
under C 14.739474, testing accuracy is 0.795333,recall is 0.772595', auc is 0.869986
01:04:257676
under C 15.791579, testing accuracy is 0.795333,recall is 0.772595', auc is 0.869997
01:07:978324
under C 16.843684, testing accuracy is 0.795333,recall is 0.772595', auc is 0.870032
01:13:093954
under C 17.895789, testing accuracy is 0.795333,recall is 0.772595', auc is 0.870024
01:16:849644
under C 18.947895, testing accuracy is 0.795333,recall is 0.772595', auc is 0.870014
01:21:223715
under C 20.000000, testing accuracy is 0.794667,recall is 0.772595', auc is 0.870047
01:25:687908
0.8706340666900172 0.01
但當C到1以上之後,模型的表現開始逐漸穩定,在C逐漸變大之後,模型的效果並沒有顯著地提高。可以認爲我們設定的C值範圍太大了,然而再繼續增大或者縮小C值的範圍,AUC面積也只能夠在0.86上下進行變化了,調節C值不能夠讓模型的任何指標實現質變。我們把目前爲止最佳的C值帶入模型,看看我們的準確率,Recall的具體值。
times = time()
clf = SVC(kernel = "linear",C=3.1663157894736838,cache_size = 5000
,class_weight = "balanced"
).fit(Xtrain, Ytrain)
result = clf.predict(Xtest)
score = clf.score(Xtest,Ytest)
recall = recall_score(Ytest, result)
auc = roc_auc_score(Ytest,clf.decision_function(Xtest))
print("testing accuracy %f,recall is %f', auc is %f" % (score,recall,auc))
print(datetime.datetime.fromtimestamp(time()-times).strftime("%M:%S:%f"))
testing accuracy 0.795333,recall is 0.772595', auc is 0.870165
00:16:030390
我們是否可以通過調整閾值來對這個模型進行改進
from sklearn.metrics import roc_curve as ROC
import matplotlib.pyplot as plt
FPR, Recall, thresholds = ROC(Ytest,clf.decision_function(Xtest),pos_label=1)
area = roc_auc_score(Ytest,clf.decision_function(Xtest))
plt.figure()
plt.plot(FPR, Recall, color='red',
label='ROC curve (area = %0.2f)' % area)
plt.plot([0, 1], [0, 1], color='black', linestyle='--')
plt.xlim([-0.05, 1.05])
plt.ylim([-0.05, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('Recall')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()

以此模型作爲基礎,我們來求解最佳閾值
maxindex = (Recall - FPR).tolist().index(max(Recall - FPR))
thresholds[maxindex]
Out[111]: -0.08950517388953827
基於我們選出的最佳閾值,我們來認爲確定y_predict,並確定在這個閾值下的recall和準確度的值
from sklearn.metrics import accuracy_score as AC
times = time()
clf = SVC(kernel = "linear",C=3.1663157894736838,cache_size = 5000
,class_weight = "balanced"
).fit(Xtrain, Ytrain)
prob = pd.DataFrame(clf.decision_function(Xtest))
prob.loc[prob.iloc[:,0] >= thresholds[maxindex],"y_pred"]=1
prob.loc[prob.iloc[:,0] < thresholds[maxindex],"y_pred"]=0
prob.loc[:,"y_pred"].isnull().sum()
Out[113]: 0
#檢查模型本身的準確度
score = AC(Ytest,prob.loc[:,"y_pred"].values)
recall = recall_score(Ytest, prob.loc[:,"y_pred"])
print("testing accuracy %f,recall is %f" % (score,recall))
print(datetime.datetime.fromtimestamp(time()-times).strftime("%M:%S:%f"))
testing accuracy 0.789333,recall is 0.804665
01:04:299623
反而還不如我們不調整時的效果好。可見,如果我們追求平衡,那SVC本身的結果就已經非常接近最優結果了。調
節閾值,調節參數C和調節class_weight都不一定有效果。