機器學習應用（1）

原創

2020-07-01 02:53

一、波士頓房價預測

這是一個迴歸問題
利用boston數據集，對數據標準化後進行迴歸，並進行多模型對比。
代碼如下：

import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import SGDRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

# 1、數據準備（506x14,無缺失值)
boston = load_boston()
print(boston.DESCR)

x = boston.data
y = boston.target

# 2、訓練測試數據分離
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)
print('x_train.shape：', x_train.shape, '\n', 'x_test.shape：', x_test.shape, '\n',
      'y_train.shape：', y_train.shape, '\n', 'y_test.shape：', y_test.shape)
print(y_train.mean(), y_test.mean(), y_train.max())

# 3、查看數據
df = pd.DataFrame(np.hstack((x, y.reshape(506, 1))))
df.describe()

# 4、標準化
ss_x = StandardScaler()
ss_y = StandardScaler()

x_train = ss_x.fit_transform(x_train)
x_test = ss_x.transform(x_test)
y_train = ss_y.fit_transform(y_train.reshape(-1, 1)).ravel()
y_test = ss_y.transform(y_test.reshape(-1, 1)).ravel()

print('x_train.shape：', x_train.shape, '\n', 'x_test.shape：', x_test.shape, '\n',
      'y_train.shape：', y_train.shape, '\n', 'y_test.shape：', y_test.shape)
print(y_train.mean(), y_test.mean(), y_train.max())

# 5、迴歸
rfr = RandomForestRegressor()  # 初始化LinearRegression
rfr.fit(x_train, y_train)  # 擬合
rfr_y_predict = rfr.predict(x_test)  # 預測

# 6、性能評估
print('模型自帶評分結果：', rfr.score(x_test, y_test))
print('R-squared：', r2_score(y_test, rfr_y_predict))
print('MSE：', mean_squared_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(rfr_y_predict)))
print('MAE：', mean_absolute_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(rfr_y_predict)))

# 多模型對比
estimators = {"svr kernel=linear regressor": SVR(kernel="linear"),
              "svr kernel=rbf regressor": SVR(kernel="rbf"),
              "svr kernel=poly regressor": SVR(kernel="poly"),
              "knr weights=uniform regressor": KNeighborsRegressor(weights='uniform'),
              "knr weights=distance regressor": KNeighborsRegressor(weights='distance'),
              "dtr regressor": DecisionTreeRegressor(),
              "randomforest regressor": RandomForestRegressor(),
              "GradientBoostingRegressor": GradientBoostingRegressor(),
              "lr": LinearRegression(),
              "sgdr": SGDRegressor()}

for key, estimator in estimators.items():
    estimator.fit(x_train, y_train)
    y_predict = estimator.predict(x_test)
    print(key, "模型R-squared:", r2_score(y_test, y_predict))

二、titanic數據集生存預測
這是一個分類問題
通過特徵選擇，缺失數據處理，特徵向量化等處理數據，最後用決策樹，隨機森林等模型預測
代碼如下：

import pandas as pd
from sklearn.feature_extraction import DictVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score

# 1、數據準備
train = pd.read_csv('./basic-ml/data/titanic/train.csv')
test = pd.read_csv('./basic-ml/data/titanic/test.csv')

print(train.info())  # age,cabin,embarked含缺失值
print(test.info())  # age,Fare,cabin含缺失值

# 特徵選擇
selected_features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']

x_train = train[selected_features]
x_test = test[selected_features]
y_train = train['Survived']

# 2、處理缺失值
x_train = x_train.copy()
x_test = x_test.copy()
print(x_train['Embarked'].value_counts())  # S

x_train['Embarked'].fillna('S', inplace=True)
x_train['Age'].fillna(x_train['Age'].mean(), inplace=True)
x_test['Age'].fillna(x_test['Age'].mean(), inplace=True)
x_test['Fare'].fillna(x_test['Fare'].mean(), inplace=True)

# 重新檢查數據是否含有缺失值
x_train.info()
x_test.info()

# 3、類別特徵向量化
vec = DictVectorizer(sparse=False)
x_train = vec.fit_transform(x_train.to_dict(orient='record'))
x_test = vec.transform(x_test.to_dict(orient='record'))
print(vec.feature_names_)

# 4、訓練
estimators = {"DecisionTree": DecisionTreeClassifier(),
              "RandomForest": RandomForestClassifier(),
              "GradientBoosting": GradientBoostingClassifier(),
              "XGBC": XGBClassifier()}

for key, estimator in estimators.items():
    print(key,':', cross_val_score(estimator, x_train, y_train, cv=5).mean())

# 使用GradientBoosting預測
gbc = GradientBoostingClassifier()
gbc.fit(x_train,y_train)
y_predict = gbc.predict(x_test)

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

機器學習應用（1）

《python機器學習及實踐_從零開始通往kaggle競賽之路》——讀書筆記

論文閱讀（1） —— Character Region Awareness for Text Detection

機器學習（3） -- 線性模型

手寫PCA -- 人臉重建

機器學習（15） -- 規則學習

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結