大數據中模型調優詳細解析

原創

2019-03-20 13:35

k折交叉驗證
第一步，不重複抽樣將原始大數據隨機分爲 k 份。
第二步，每一次挑選其中 1 份作爲測試集，剩餘 k-1 份作爲訓練集用於模型訓練。
第三步，重複第二步 k 次，這樣每個子集都有一次機會作爲測試集，其餘機會作爲訓練集。
在每個訓練集上訓練後得到一個模型，
用這個模型在相應的測試集上測試，計算並保存模型的評估指標，
第四步，計算 k 組測試結果的平均值作爲模型精度的估計，並作爲當前 k 折交叉驗證下模型的性能指標。
在這裏我們採用5折交叉驗證
網格搜索
GridSearchCV，它存在的意義就是自動調參，只要把參數輸進去，就能給出最優化的結果和參數。但是這個方法適合於小數據集，一旦數據的量級上去了，很難得出結果。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_curve,auc
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from lightgbm import LGBMClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm

data_all = pd.read_csv('D:\data_all.csv',encoding ='gbk')

X = data_all.drop(['status'],axis = 1)
y = data_all['status']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=2018)
#數據標準化
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
#LR
lr = LogisticRegression(random_state = 2018)
param = {'C':[1e-3,0.01,0.1,1,10,100,1e3], 'penalty':['l1', 'l2']}
grid = GridSearchCV(estimator=lr, param_grid=param, scoring='roc_auc', cv=5)
grid.fit(X_train,y_train)
print(grid.bestparams)
print( grid.bestscore)
print(grid.score(X_test,y_test))
#DecisionTree
dt = DecisionTreeClassifier(random_state = 2018)
param = {'criterion':['gini','entropy'],'splitter':['best','random'],'max_depth':[2,4,6,8]
,'max_features':['sqrt','log2',None]}
grid = GridSearchCV(estimator = dt, param_grid=param, scoring='roc_auc', cv=5)
print(grid.bestparams)
print( grid.bestscore)
print(grid.score(X_test,y_test))
#SVM
svc = svm.SVC(random_state = 2018)
param = {'C':[1e-2, 1e-1, 1, 10],'kernel':['linear','poly','rbf','sigmoid']}
grid = GridSearchCV(estimator = svc, param_grid=param, scoring='roc_auc', cv=5)
print(grid.bestparams)
print( grid.bestscore)
print(grid.score(X_test,y_test))
#RandomForest
rft = RandomForestClassifier()
param = {'n_estimators':[10,20,50,100],'criterion':['gini','entropy'],'max_depth':[2,4,6,8,10,None]
,'max_features':['sqrt','log2',None]}
grid = GridSearchCV(estimator = rft, param_grid=param, scoring='roc_auc', cv=5)
print(grid.bestparams)
print( grid.bestscore)
print(grid.score(X_test,y_test))
#GBDT
gb = GradientBoostingClassifier()
param = {'max_features':['sqrt','log2',None],'learning_rate':[0.01,0.1,0.5,1],'n_estimators':range
(20,200,20),'subsample':[0.2,0.5,0.7,1.0]}
grid = GridSearchCV(estimator = gb, param_grid=param, scoring='roc_auc', cv=5)
print(grid.bestparams)
print( grid.bestscore)
print(grid.score(X_test,y_test))
#XGBoost
xgb_c = XGBClassifier()
param = {'n_estimators':range(20,200,20),'max_depth':[2,6,10],'reg_lambda':[0.2,0.5,1]}
grid = GridSearchCV(estimator = xgb_c, param_grid=param, scoring='roc_auc', cv=5)
print(grid.bestparams)
print( grid.bestscore)
print(grid.score(X_test,y_test))
#LightGBM
lgbm_c = LGBMClassifier()
param = {'learning_rate': [0.2,0.5,0.7], 'max_depth': range(1,10,2), 'n_estimators':range(20,100,10)}
grid = GridSearchCV(estimator = lgbm_c, param_grid=param, scoring='roc_auc', cv=5)
grid.fit(X_train,y_train)
print(grid.bestparams)
print( grid.bestscore)
print(grid.score(X_test,y_test))

文章來自：https://www.itjmd.com/news/show-4312.html

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

大數據中模型調優詳細解析

Java開發模式之單例模式詳細解析

PHP與ajax 是如何實現文件上傳的?

PHP中TS和NTS都有哪些區別？

PHP開發巧妙解決高併發與大流量的問題

Java開發之ArrayList類詳細解析

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結