修正《python機器學習經典實例》p38-p45頁2.9節《根據汽車特徵評估質量》

原創

南宫第一

2018-08-23 11:38

使用python3.5重新驗證《python機器學習經典實例》中的代碼時，經常碰到各種警告與錯誤。

一般來說警告來自於函數庫的更新，原書使用python2.x，且函數庫古老，一些模塊被合併了，在調取對應方法時要記得更新名字，不然紅色警告傷眼。

拋出的異常錯誤集中在P40頁的對單一數據示例進行編碼測試中的這一句代碼：

input_data_encoded[i]=int(label_encoder[i].transform([input_data[i]]))

異常是：

raise ValueError("bad input shape {0}".format(shape))
ValueError: bad input shape ()

經過逐步分析，發現：input_data[i] 的值是單一字符串，但是transform方法中的參數需要列表格式，所以改成：[input_data[i]]

解決完這個問題後，第二個問題是：

output_class=classifier.predict(input_data_encoded)

報錯爲：

ValueError: Expected 2D array, got 1D array instead:
array=[ 3.  3.  0.  0.  2.  1.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

這個問題與第一個問題類似，要將 input_data_encoded 重塑成一維數組。

在重塑前 input_data_encoded的值是：[0 0 1 1 2 0]

重塑代碼爲：

input_data_encoded=input_data_encoded.reshape(1,6)

重塑後是：[[0 0 1 1 2 0]]

以上就是本節的難點了，全部修正後代碼如下：

import numpy as np
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.model_selection import validation_curve
from sklearn.model_selection import learning_curve

#畫圖時顯示中文字體
from pylab import mpl
mpl.rcParams['font.sans-serif'] = ['SimHei']


input_file=r'D:\python\AI\2\car.data.txt'
x=[]
count=0

with open(input_file,'r') as f:
    for line in f.readlines():
        data=line[:-1].split(',')
        x.append(data)

x=np.array(x)

#字符串轉數值
label_encoder=[]
x_encoded=np.empty(x.shape)

for i,item in enumerate(x[0]):
    label_encoder.append(preprocessing.LabelEncoder())
    label_encoder[-1].fit(x[:,i])
    x_encoded[:,i]=label_encoder[-1].transform(x[:,i])

x=x_encoded[:, :-1].astype(int)
y=x_encoded[:,-1].astype(int)

#訓練分類器
params={'n_estimators':200,'max_depth':8,'random_state':7}
classifier=RandomForestClassifier(**params)
classifier.fit(x,y)

#交叉驗證
accuracy=model_selection.cross_val_score(classifier,x,y,scoring='accuracy',cv=3)
print('Accuracy of the classifier(訓練器精度):'+str(round(100*accuracy.mean(),2))+'%')

#對單一數據示例進行編碼測試
input_data=['high','high','3','4','small','high']
input_data_encoded=[-1]*len(input_data)


for i,item in enumerate(input_data):
    input_data_encoded[i]=int(label_encoder[i].transform([input_data[i]]))

input_data_encoded=np.array(input_data_encoded)
print('數組重塑前:',input_data_encoded)

#預測並打印數據點的輸出
#重塑數組
input_data_encoded=input_data_encoded.reshape(1,6)
print('數組重塑後:',input_data_encoded)
output_class=classifier.predict(input_data_encoded)
print('Output class（輸出類型）:',label_encoder[-1].inverse_transform(output_class)[0])

#定義隨機森林迴歸器的超參數
#測試評估器數量參數對分類器的影響
classifier = RandomForestClassifier(max_depth=4, random_state=7)
parameter_grid = np.linspace(25, 200, 8).astype(int)
train_scores, validation_scores = validation_curve(classifier, x, y, 
        'n_estimators', parameter_grid, cv=5)
print('\n##### 驗證曲線 #####')
print('\nParam: n_estimators\nTraining scores:\n', train_scores)
print('\nParam: n_estimators\nValidation scores:\n', validation_scores)

#畫圖
plt.figure()
plt.plot(parameter_grid, 100*np.average(train_scores, axis=1), color='black')
plt.title(u'Training curve（訓練曲線）')
plt.xlabel(u'Number of estimators(評估器數量)')
plt.ylabel(u'Accuracy（準確度）')
plt.show()

#測試最大深度參數對分類器的影響
classifier = RandomForestClassifier(n_estimators=20, random_state=7)
parameter_grid = np.linspace(2, 10, 5).astype(int)
train_scores, valid_scores = validation_curve(classifier, x, y, 
        'max_depth', parameter_grid, cv=5)
print(u'\nParam: max_depth\nTraining scores:\n', train_scores)
print(u'\nParam: max_depth\nValidation scores:\n', validation_scores)

#畫圖
plt.figure()
plt.plot(parameter_grid, 100*np.average(train_scores, axis=1), color='black')
plt.title(u'Validation curve(驗證曲線)')
plt.xlabel(u'Maximum depth of the tree（樹的最大深度）')
plt.ylabel(u'Accuracy（準確度）')
plt.show()


#生成學習曲線
classifier = RandomForestClassifier(random_state=7)

parameter_grid = np.array([200, 500, 800, 1100])
train_sizes, train_scores, validation_scores = learning_curve(classifier, 
        x, y, train_sizes=parameter_grid, cv=5)

print('\n##### 學習曲線 #####')
print('\nTraining scores:\n', train_scores)
print('\nValidation scores:\n', validation_scores)

#畫圖
plt.figure()
plt.plot(parameter_grid, 100*np.average(train_scores, axis=1), color='black')
plt.title(u'Learning curve（學習曲線）')
plt.xlabel(u'Number of training samples（訓練樣本的數量）')
plt.ylabel(u'Accuracy（準確度）')
plt.show()

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

修正《python機器學習經典實例》p38-p45頁2.9節《根據汽車特徵評估質量》

再談23種設計模式（3）：行爲型模式（學習筆記）

Power Automate Desktop 安裝完，登錄後老是提示one driver 錯誤

微前端學習筆記(4):從微前端到微模塊之EMP與hel-micro方案探索

微前端學習筆記（1）：微前端總體架構概述，從微服務發微

985 碩士程序員，空窗 4 個月沒有 Offer！

一文搞懂 Spring 循環依賴

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

記一次 .NET某工控視覺自動化系統卡死分析

WindowsServer--SQL Server搭建主從同步實現讀寫分離 - 事務性分發

(終極方案)python安裝成功但運行失敗顯示計算機丟失api組件解決方案

全國黨員管理信息系統信息檢錯

pyinstaller 打包及其UPX壓縮

windows下pip與pipenv永久更換鏡像源(最簡單的方式)

最短的代碼解決打開文件夾、多個文件等問題

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結