二手車數據挖掘-模型融合

Datawhale 零基礎入門數據挖掘-Task5 模型融合

五、模型融合

Tip:此部分爲零基礎入門數據挖掘的 Task5 模型融合 部分,帶你來了解各種模型結果的融合方式,在比賽的攻堅時刻衝刺Top,歡迎大家後續多多交流。

賽題:零基礎入門數據挖掘 - 二手車交易價格預測

地址:https://tianchi.aliyun.com/competition/entrance/231784/introduction?spm=5176.12281957.1004.1.38b02448ausjSX

5.1 模型融合目標

  • 對於多種調參完成的模型進行模型融合。

  • 完成對於多種模型的融合,提交融合結果並打卡。

5.2 內容介紹

模型融合是比賽後期一個重要的環節,大體來說有如下的類型方式。

  1. 簡單加權融合:

    • 迴歸(分類概率):算術平均融合(Arithmetic mean),幾何平均融合(Geometric mean);
    • 分類:投票(Voting)
    • 綜合:排序融合(Rank averaging),log融合
  2. stacking/blending:

    • 構建多層模型,並利用預測結果再擬合預測。
  3. boosting/bagging(在xgboost,Adaboost,GBDT中已經用到):

    • 多樹的提升方法

5.3 Stacking相關理論介紹

1) 什麼是 stacking

簡單來說 stacking 就是當用初始訓練數據學習出若干個基學習器後,將這幾個學習器的預測結果作爲新的訓練集,來學習一個新的學習器。

將個體學習器結合在一起的時候使用的方法叫做結合策略。對於分類問題,我們可以使用投票法來選擇輸出最多的類。對於迴歸問題,我們可以將分類器輸出的結果求平均值。

上面說的投票法和平均法都是很有效的結合策略,還有一種結合策略是使用另外一個機器學習算法來將個體機器學習器的結果結合在一起,這個方法就是Stacking。

在stacking方法中,我們把個體學習器叫做初級學習器,用於結合的學習器叫做次級學習器或元學習器(meta-learner),次級學習器用於訓練的數據叫做次級訓練集。次級訓練集是在訓練集上用初級學習器得到的。

2) 如何進行 stacking

算法示意圖如下:

引用自 西瓜書《機器學習》

  • 過程1-3 是訓練出來個體學習器,也就是初級學習器。
  • 過程5-9是 使用訓練出來的個體學習器來得預測的結果,這個預測的結果當做次級學習器的訓練集。
  • 過程11 是用初級學習器預測的結果訓練出次級學習器,得到我們最後訓練的模型。

3)Stacking的方法講解

首先,我們先從一種“不那麼正確”但是容易懂的Stacking方法講起。

Stacking模型本質上是一種分層的結構,這裏簡單起見,只分析二級Stacking.假設我們有2個基模型 Model1_1、Model1_2 和 一個次級模型Model2

Step 1. 基模型 Model1_1,對訓練集train訓練,然後用於預測 train 和 test 的標籤列,分別是P1,T1

Model1_1 模型訓練:

KaTeX parse error: Expected '}', got '_' at position 120: …^{\text {Model1_̲1 Train} }\left…

訓練後的模型 Model1_1 分別在 train 和 test 上預測,得到預測標籤分別是P1,T1

KaTeX parse error: Expected '}', got '_' at position 120: …^{\text {Model1_̲1 Predict} }\le…

KaTeX parse error: Expected '}', got '_' at position 119: …^{\text {Model1_̲1 Predict} }\le…

Step 2. 基模型 Model1_2 ,對訓練集train訓練,然後用於預測train和test的標籤列,分別是P2,T2

Model1_2 模型訓練:

KaTeX parse error: Expected '}', got '_' at position 120: …^{\text {Model1_̲2 Train} }\left…

訓練後的模型 Model1_2 分別在 train 和 test 上預測,得到預測標籤分別是P2,T2

KaTeX parse error: Expected '}', got '_' at position 120: …^{\text {Model1_̲2 Predict} }\le…

KaTeX parse error: Expected '}', got '_' at position 119: …^{\text {Model1_̲2 Predict} }\le…

Step 3. 分別把P1,P2以及T1,T2合併,得到一個新的訓練集和測試集train2,test2.

KaTeX parse error: Expected '}', got '_' at position 159: …}^{\text {Train_̲2 }} and \ov…

再用 次級模型 Model2 以真實訓練集標籤爲標籤訓練,以train2爲特徵進行訓練,預測test2,得到最終的測試集預測的標籤列 YPreY_{Pre}

KaTeX parse error: Expected '}', got '_' at position 159: …}^{\text {Train_̲2 }} \overbrace…

KaTeX parse error: Expected '}', got '_' at position 158: …)}^{\text {Test_̲2 }} \overbrace…

這就是我們兩層堆疊的一種基本的原始思路想法。在不同模型預測的結果基礎上再加一層模型,進行再訓練,從而得到模型最終的預測。

Stacking本質上就是這麼直接的思路,但是直接這樣有時對於如果訓練集和測試集分佈不那麼一致的情況下是有一點問題的,其問題在於用初始模型訓練的標籤再利用真實標籤進行再訓練,毫無疑問會導致一定的模型過擬合訓練集,這樣或許模型在測試集上的泛化能力或者說效果會有一定的下降,因此現在的問題變成了如何降低再訓練的過擬合性,這裏我們一般有兩種方法。

    1. 次級模型儘量選擇簡單的線性模型
    1. 利用K折交叉驗證

K-折交叉驗證:
訓練:

預測:

5.4 代碼示例

5.4.1 迴歸\分類概率-融合:

1)簡單加權平均,結果直接融合

## 生成一些簡單的樣本數據,test_prei 代表第i個模型的預測值
test_pre1 = [1.2, 3.2, 2.1, 6.2]
test_pre2 = [0.9, 3.1, 2.0, 5.9]
test_pre3 = [1.1, 2.9, 2.2, 6.0]

# y_test_true 代表第模型的真實值
y_test_true = [1, 3, 2, 6] 
import numpy as np
import pandas as pd

## 定義結果的加權平均函數
def Weighted_method(test_pre1,test_pre2,test_pre3,w=[1/3,1/3,1/3]):
    Weighted_result = w[0]*pd.Series(test_pre1)+w[1]*pd.Series(test_pre2)+w[2]*pd.Series(test_pre3)
    return Weighted_result
from sklearn import metrics
# 各模型的預測結果計算MAE
print('Pred1 MAE:',metrics.mean_absolute_error(y_test_true, test_pre1))
print('Pred2 MAE:',metrics.mean_absolute_error(y_test_true, test_pre2))
print('Pred3 MAE:',metrics.mean_absolute_error(y_test_true, test_pre3))
Pred1 MAE: 0.1750000000000001
Pred2 MAE: 0.07499999999999993
Pred3 MAE: 0.10000000000000009
## 根據加權計算MAE
w = [0.3,0.4,0.3] # 定義比重權值
Weighted_pre = Weighted_method(test_pre1,test_pre2,test_pre3,w)
print('Weighted_pre MAE:',metrics.mean_absolute_error(y_test_true, Weighted_pre))
Weighted_pre MAE: 0.05750000000000027

可以發現加權結果相對於之前的結果是有提升的,這種我們稱其爲簡單的加權平均。

還有一些特殊的形式,比如mean平均,median平均

## 定義結果的加權平均函數
def Mean_method(test_pre1,test_pre2,test_pre3):
    Mean_result = pd.concat([pd.Series(test_pre1),pd.Series(test_pre2),pd.Series(test_pre3)],axis=1).mean(axis=1)
    return Mean_result
Mean_pre = Mean_method(test_pre1,test_pre2,test_pre3)
print('Mean_pre MAE:',metrics.mean_absolute_error(y_test_true, Mean_pre))
Mean_pre MAE: 0.06666666666666693
## 定義結果的加權平均函數
def Median_method(test_pre1,test_pre2,test_pre3):
    Median_result = pd.concat([pd.Series(test_pre1),pd.Series(test_pre2),pd.Series(test_pre3)],axis=1).median(axis=1)
    return Median_result
Median_pre = Median_method(test_pre1,test_pre2,test_pre3)
print('Median_pre MAE:',metrics.mean_absolute_error(y_test_true, Median_pre))
Median_pre MAE: 0.07500000000000007

2) Stacking融合(迴歸):

from sklearn import linear_model

def Stacking_method(train_reg1,train_reg2,train_reg3,y_train_true,test_pre1,test_pre2,test_pre3,model_L2= linear_model.LinearRegression()):
    model_L2.fit(pd.concat([pd.Series(train_reg1),pd.Series(train_reg2),pd.Series(train_reg3)],axis=1).values,y_train_true)
    Stacking_result = model_L2.predict(pd.concat([pd.Series(test_pre1),pd.Series(test_pre2),pd.Series(test_pre3)],axis=1).values)
    return Stacking_result
## 生成一些簡單的樣本數據,test_prei 代表第i個模型的預測值
train_reg1 = [3.2, 8.2, 9.1, 5.2]
train_reg2 = [2.9, 8.1, 9.0, 4.9]
train_reg3 = [3.1, 7.9, 9.2, 5.0]
# y_test_true 代表第模型的真實值
y_train_true = [3, 8, 9, 5] 

test_pre1 = [1.2, 3.2, 2.1, 6.2]
test_pre2 = [0.9, 3.1, 2.0, 5.9]
test_pre3 = [1.1, 2.9, 2.2, 6.0]

# y_test_true 代表第模型的真實值
y_test_true = [1, 3, 2, 6] 
model_L2= linear_model.LinearRegression()
Stacking_pre = Stacking_method(train_reg1,train_reg2,train_reg3,y_train_true,
                               test_pre1,test_pre2,test_pre3,model_L2)
print('Stacking_pre MAE:',metrics.mean_absolute_error(y_test_true, Stacking_pre))
Stacking_pre MAE: 0.042134831460675204

可以發現模型結果相對於之前有進一步的提升,這是我們需要注意的一點是,對於第二層Stacking的模型不宜選取的過於複雜,這樣會導致模型在訓練集上過擬合,從而使得在測試集上並不能達到很好的效果。

5.4.2 分類模型融合:

對於分類,同樣的可以使用融合方法,比如簡單投票,Stacking…

from sklearn.datasets import make_blobs
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons
from sklearn.metrics import accuracy_score,roc_auc_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold

1)Voting投票機制:

Voting即投票機制,分爲軟投票和硬投票兩種,其原理採用少數服從多數的思想。

'''
硬投票:對多個模型直接進行投票,不區分模型結果的相對重要度,最終投票數最多的類爲最終被預測的類。
'''
iris = datasets.load_iris()

x=iris.data
y=iris.target
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3)

clf1 = XGBClassifier(learning_rate=0.1, n_estimators=150, max_depth=3, min_child_weight=2, subsample=0.7,
                     colsample_bytree=0.6, objective='binary:logistic')
clf2 = RandomForestClassifier(n_estimators=50, max_depth=1, min_samples_split=4,
                              min_samples_leaf=63,oob_score=True)
clf3 = SVC(C=0.1)

# 硬投票
eclf = VotingClassifier(estimators=[('xgb', clf1), ('rf', clf2), ('svc', clf3)], voting='hard')
for clf, label in zip([clf1, clf2, clf3, eclf], ['XGBBoosting', 'Random Forest', 'SVM', 'Ensemble']):
    scores = cross_val_score(clf, x, y, cv=5, scoring='accuracy')
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))
Accuracy: 0.96 (+/- 0.02) [XGBBoosting]
Accuracy: 0.33 (+/- 0.00) [Random Forest]
Accuracy: 0.95 (+/- 0.03) [SVM]


C:\Users\94890\Anaconda3\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\94890\Anaconda3\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\94890\Anaconda3\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\94890\Anaconda3\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\94890\Anaconda3\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\94890\Anaconda3\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\94890\Anaconda3\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\94890\Anaconda3\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)


Accuracy: 0.95 (+/- 0.03) [Ensemble]


C:\Users\94890\Anaconda3\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\94890\Anaconda3\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
'''
軟投票:和硬投票原理相同,增加了設置權重的功能,可以爲不同模型設置不同權重,進而區別模型不同的重要度。
'''
x=iris.data
y=iris.target
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3)

clf1 = XGBClassifier(learning_rate=0.1, n_estimators=150, max_depth=3, min_child_weight=2, subsample=0.8,
                     colsample_bytree=0.8, objective='binary:logistic')
clf2 = RandomForestClassifier(n_estimators=50, max_depth=1, min_samples_split=4,
                              min_samples_leaf=63,oob_score=True)
clf3 = SVC(C=0.1, probability=True)

# 軟投票
eclf = VotingClassifier(estimators=[('xgb', clf1), ('rf', clf2), ('svc', clf3)], voting='soft', weights=[2, 1, 1])
clf1.fit(x_train, y_train)

for clf, label in zip([clf1, clf2, clf3, eclf], ['XGBBoosting', 'Random Forest', 'SVM', 'Ensemble']):
    scores = cross_val_score(clf, x, y, cv=5, scoring='accuracy')
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))
Accuracy: 0.96 (+/- 0.02) [XGBBoosting]
Accuracy: 0.33 (+/- 0.00) [Random Forest]
Accuracy: 0.95 (+/- 0.03) [SVM]


C:\Users\94890\Anaconda3\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\94890\Anaconda3\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\94890\Anaconda3\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\94890\Anaconda3\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\94890\Anaconda3\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\94890\Anaconda3\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\94890\Anaconda3\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\94890\Anaconda3\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)


Accuracy: 0.96 (+/- 0.02) [Ensemble]


C:\Users\94890\Anaconda3\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\94890\Anaconda3\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)

2)分類的Stacking\Blending融合:

stacking是一種分層模型集成框架。

以兩層爲例,第一層由多個基學習器組成,其輸入爲原始訓練集,第二層的模型則是以第一層基學習器的輸出作爲訓練集進行再訓練,從而得到完整的stacking模型, stacking兩層模型都使用了全部的訓練數據。

'''
5-Fold Stacking
'''
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier,GradientBoostingClassifier
import pandas as pd
#創建訓練的數據集
data_0 = iris.data
data = data_0[:100,:]

target_0 = iris.target
target = target_0[:100]

#模型融合中使用到的各個單模型
clfs = [LogisticRegression(solver='lbfgs'),
        RandomForestClassifier(n_estimators=5, n_jobs=-1, criterion='gini'),
        ExtraTreesClassifier(n_estimators=5, n_jobs=-1, criterion='gini'),
        ExtraTreesClassifier(n_estimators=5, n_jobs=-1, criterion='entropy'),
        GradientBoostingClassifier(learning_rate=0.05, subsample=0.5, max_depth=6, n_estimators=5)]
 
#切分一部分數據作爲測試集
X, X_predict, y, y_predict = train_test_split(data, target, test_size=0.3, random_state=2020)

dataset_blend_train = np.zeros((X.shape[0], len(clfs)))
dataset_blend_test = np.zeros((X_predict.shape[0], len(clfs)))

#5折stacking
n_splits = 5
skf = StratifiedKFold(n_splits)
skf = skf.split(X, y)

for j, clf in enumerate(clfs):
    #依次訓練各個單模型
    dataset_blend_test_j = np.zeros((X_predict.shape[0], 5))
    for i, (train, test) in enumerate(skf):
        #5-Fold交叉訓練,使用第i個部分作爲預測,剩餘的部分來訓練模型,獲得其預測的輸出作爲第i部分的新特徵。
        X_train, y_train, X_test, y_test = X[train], y[train], X[test], y[test]
        clf.fit(X_train, y_train)
        y_submission = clf.predict_proba(X_test)[:, 1]
        dataset_blend_train[test, j] = y_submission
        dataset_blend_test_j[:, i] = clf.predict_proba(X_predict)[:, 1]
        #print(test)
    #對於測試集,直接用這k個模型的預測值均值作爲新的特徵。
    dataset_blend_test[:, j] = dataset_blend_test_j.mean(1)
    print("val auc Score: %f" % roc_auc_score(y_predict, dataset_blend_test[:, j]))

clf = LogisticRegression(solver='lbfgs')
clf.fit(dataset_blend_train, y)
y_submission = clf.predict_proba(dataset_blend_test)[:, 1]

print("Val auc Score of Stacking: %f" % (roc_auc_score(y_predict, y_submission)))

val auc Score: 1.000000
val auc Score: 0.500000
val auc Score: 0.500000
val auc Score: 0.500000
val auc Score: 0.500000
Val auc Score of Stacking: 1.000000

Blending,其實和Stacking是一種類似的多層模型融合的形式

其主要思路是把原始的訓練集先分成兩部分,比如70%的數據作爲新的訓練集,剩下30%的數據作爲測試集。

在第一層,我們在這70%的數據上訓練多個模型,然後去預測那30%數據的label,同時也預測test集的label。

在第二層,我們就直接用這30%數據在第一層預測的結果做爲新特徵繼續訓練,然後用test集第一層預測的label做特徵,用第二層訓練的模型做進一步預測

其優點在於:

  • 1.比stacking簡單(因爲不用進行k次的交叉驗證來獲得stacker feature)
  • 2.避開了一個信息泄露問題:generlizers和stacker使用了不一樣的數據集

缺點在於:

  • 1.使用了很少的數據(第二階段的blender只使用training set10%的量)
  • 2.blender可能會過擬合
  • 3.stacking使用多次的交叉驗證會比較穩健
    ‘’’
'''
Blending
'''
 
#創建訓練的數據集
#創建訓練的數據集
data_0 = iris.data
data = data_0[:100,:]

target_0 = iris.target
target = target_0[:100]
 
#模型融合中使用到的各個單模型
clfs = [LogisticRegression(solver='lbfgs'),
        RandomForestClassifier(n_estimators=5, n_jobs=-1, criterion='gini'),
        RandomForestClassifier(n_estimators=5, n_jobs=-1, criterion='entropy'),
        ExtraTreesClassifier(n_estimators=5, n_jobs=-1, criterion='gini'),
        #ExtraTreesClassifier(n_estimators=5, n_jobs=-1, criterion='entropy'),
        GradientBoostingClassifier(learning_rate=0.05, subsample=0.5, max_depth=6, n_estimators=5)]
 
#切分一部分數據作爲測試集
X, X_predict, y, y_predict = train_test_split(data, target, test_size=0.3, random_state=2020)

#切分訓練數據集爲d1,d2兩部分
X_d1, X_d2, y_d1, y_d2 = train_test_split(X, y, test_size=0.5, random_state=2020)
dataset_d1 = np.zeros((X_d2.shape[0], len(clfs)))
dataset_d2 = np.zeros((X_predict.shape[0], len(clfs)))
 
for j, clf in enumerate(clfs):
    #依次訓練各個單模型
    clf.fit(X_d1, y_d1)
    y_submission = clf.predict_proba(X_d2)[:, 1]
    dataset_d1[:, j] = y_submission
    #對於測試集,直接用這k個模型的預測值作爲新的特徵。
    dataset_d2[:, j] = clf.predict_proba(X_predict)[:, 1]
    print("val auc Score: %f" % roc_auc_score(y_predict, dataset_d2[:, j]))

#融合使用的模型
clf = GradientBoostingClassifier(learning_rate=0.02, subsample=0.5, max_depth=6, n_estimators=30)
clf.fit(dataset_d1, y_d2)
y_submission = clf.predict_proba(dataset_d2)[:, 1]
print("Val auc Score of Blending: %f" % (roc_auc_score(y_predict, y_submission)))
val auc Score: 1.000000
val auc Score: 1.000000
val auc Score: 1.000000
val auc Score: 1.000000
val auc Score: 1.000000
Val auc Score of Blending: 1.000000

參考博客:https://blog.csdn.net/Noob_daniel/article/details/76087829

3)分類的Stacking融合(利用mlxtend):

!pip install mlxtend

import warnings
warnings.filterwarnings('ignore')
import itertools
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec

from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB 
from sklearn.ensemble import RandomForestClassifier
from mlxtend.classifier import StackingClassifier

from sklearn.model_selection import cross_val_score
from mlxtend.plotting import plot_learning_curves
from mlxtend.plotting import plot_decision_regions

# 以python自帶的鳶尾花數據集爲例
iris = datasets.load_iris()
X, y = iris.data[:, 1:3], iris.target

clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()
lr = LogisticRegression()
sclf = StackingClassifier(classifiers=[clf1, clf2, clf3], 
                          meta_classifier=lr)

label = ['KNN', 'Random Forest', 'Naive Bayes', 'Stacking Classifier']
clf_list = [clf1, clf2, clf3, sclf]

fig = plt.figure(figsize=(10,8))
gs = gridspec.GridSpec(2, 2)
grid = itertools.product([0,1],repeat=2)

clf_cv_mean = []
clf_cv_std = []
for clf, label, grd in zip(clf_list, label, grid):
        
    scores = cross_val_score(clf, X, y, cv=3, scoring='accuracy')
    print("Accuracy: %.2f (+/- %.2f) [%s]" %(scores.mean(), scores.std(), label))
    clf_cv_mean.append(scores.mean())
    clf_cv_std.append(scores.std())
        
    clf.fit(X, y)
    ax = plt.subplot(gs[grd[0], grd[1]])
    fig = plot_decision_regions(X=X, y=y, clf=clf)
    plt.title(label)

plt.show()
Looking in indexes: http://mirrors.aliyun.com/pypi/simple
Requirement already satisfied: mlxtend in c:\users\94890\anaconda3\lib\site-packages (0.17.2)
Requirement already satisfied: scipy>=1.2.1 in c:\users\94890\anaconda3\lib\site-packages (from mlxtend) (1.4.1)
Requirement already satisfied: setuptools in c:\users\94890\anaconda3\lib\site-packages (from mlxtend) (41.4.0)
Requirement already satisfied: numpy>=1.16.2 in c:\users\94890\anaconda3\lib\site-packages (from mlxtend) (1.16.5)
Requirement already satisfied: matplotlib>=3.0.0 in c:\users\94890\anaconda3\lib\site-packages (from mlxtend) (3.2.1)
Requirement already satisfied: joblib>=0.13.2 in c:\users\94890\anaconda3\lib\site-packages (from mlxtend) (0.14.1)
Requirement already satisfied: scikit-learn>=0.20.3 in c:\users\94890\anaconda3\lib\site-packages (from mlxtend) (0.21.3)
Requirement already satisfied: pandas>=0.24.2 in c:\users\94890\anaconda3\lib\site-packages (from mlxtend) (0.25.3)
Requirement already satisfied: python-dateutil>=2.1 in c:\users\94890\anaconda3\lib\site-packages (from matplotlib>=3.0.0->mlxtend) (2.8.0)
Requirement already satisfied: cycler>=0.10 in c:\users\94890\anaconda3\lib\site-packages (from matplotlib>=3.0.0->mlxtend) (0.10.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in c:\users\94890\anaconda3\lib\site-packages (from matplotlib>=3.0.0->mlxtend) (2.4.2)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\94890\anaconda3\lib\site-packages (from matplotlib>=3.0.0->mlxtend) (1.1.0)
Requirement already satisfied: pytz>=2017.2 in c:\users\94890\anaconda3\lib\site-packages (from pandas>=0.24.2->mlxtend) (2019.3)
Requirement already satisfied: six>=1.5 in c:\users\94890\anaconda3\lib\site-packages (from python-dateutil>=2.1->matplotlib>=3.0.0->mlxtend) (1.12.0)
Accuracy: 0.91 (+/- 0.01) [KNN]
Accuracy: 0.93 (+/- 0.05) [Random Forest]
Accuracy: 0.92 (+/- 0.03) [Naive Bayes]
Accuracy: 0.95 (+/- 0.03) [Stacking Classifier]



<Figure size 1000x800 with 4 Axes>

可以發現 基模型 用 ‘KNN’, ‘Random Forest’, ‘Naive Bayes’ 然後再這基礎上 次級模型加一個 ‘LogisticRegression’,模型測試效果有着很好的提升。

5.4.3 一些其它方法:

將特徵放進模型中預測,並將預測結果變換並作爲新的特徵加入原有特徵中再經過模型預測結果 (Stacking變化)

(可以反覆預測多次將結果加入最後的特徵中)

def Ensemble_add_feature(train,test,target,clfs):
    
    # n_flods = 5
    # skf = list(StratifiedKFold(y, n_folds=n_flods))

    train_ = np.zeros((train.shape[0],len(clfs*2)))
    test_ = np.zeros((test.shape[0],len(clfs*2)))

    for j,clf in enumerate(clfs):
        '''依次訓練各個單模型'''
        # print(j, clf)
        '''使用第1個部分作爲預測,第2部分來訓練模型,獲得其預測的輸出作爲第2部分的新特徵。'''
        # X_train, y_train, X_test, y_test = X[train], y[train], X[test], y[test]
        
        clf.fit(train,target)
        y_train = clf.predict(train)
        y_test = clf.predict(test)

        ## 新特徵生成
        train_[:,j*2] = y_train**2
        test_[:,j*2] = y_test**2
        train_[:, j+1] = np.exp(y_train)
        test_[:, j+1] = np.exp(y_test)
        # print("val auc Score: %f" % r2_score(y_predict, dataset_d2[:, j]))
        print('Method ',j)
    train_ = pd.DataFrame(train_)
    test_ = pd.DataFrame(test_)
    return train_,test_

from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()

data_0 = iris.data
data = data_0[:100,:]

target_0 = iris.target
target = target_0[:100]

x_train,x_test,y_train,y_test=train_test_split(data,target,test_size=0.3)
x_train = pd.DataFrame(x_train) ; x_test = pd.DataFrame(x_test)

#模型融合中使用到的各個單模型
clfs = [LogisticRegression(),
        RandomForestClassifier(n_estimators=5, n_jobs=-1, criterion='gini'),
        ExtraTreesClassifier(n_estimators=5, n_jobs=-1, criterion='gini'),
        ExtraTreesClassifier(n_estimators=5, n_jobs=-1, criterion='entropy'),
        GradientBoostingClassifier(learning_rate=0.05, subsample=0.5, max_depth=6, n_estimators=5)]

New_train,New_test = Ensemble_add_feature(x_train,x_test,y_train,clfs)

clf = LogisticRegression()
# clf = GradientBoostingClassifier(learning_rate=0.02, subsample=0.5, max_depth=6, n_estimators=30)
clf.fit(New_train, y_train)
y_emb = clf.predict_proba(New_test)[:, 1]

print("Val auc Score of stacking: %f" % (roc_auc_score(y_test, y_emb)))
Method  0
Method  1
Method  2
Method  3
Method  4
Val auc Score of stacking: 1.000000

5.4.4 本賽題示例

import pandas as pd
import numpy as np
import warnings
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

warnings.filterwarnings('ignore')
%matplotlib inline

import itertools
import matplotlib.gridspec as gridspec
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB 
from sklearn.ensemble import RandomForestClassifier
# from mlxtend.classifier import StackingClassifier
from sklearn.model_selection import cross_val_score, train_test_split
# from mlxtend.plotting import plot_learning_curves
# from mlxtend.plotting import plot_decision_regions

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split

from sklearn import linear_model
from sklearn import preprocessing
from sklearn.svm import SVR
from sklearn.decomposition import PCA,FastICA,FactorAnalysis,SparsePCA

import lightgbm as lgb
import xgboost as xgb
from sklearn.model_selection import GridSearchCV,cross_val_score
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor

from sklearn.metrics import mean_squared_error, mean_absolute_error
## 數據讀取
Train_data = pd.read_csv('data/train.csv', sep=' ')
TestA_data = pd.read_csv('data/testA.csv', sep=' ')

print(Train_data.shape)
print(TestA_data.shape)
(150000, 31)
(50000, 30)
Train_data.head()
SaleID name regDate model brand bodyType fuelType gearbox power kilometer ... v_5 v_6 v_7 v_8 v_9 v_10 v_11 v_12 v_13 v_14
0 0 736 20040402 30.0 6 1.0 0.0 0.0 60 12.5 ... 0.235676 0.101988 0.129549 0.022816 0.097462 -2.881803 2.804097 -2.420821 0.795292 0.914762
1 1 2262 20030301 40.0 1 2.0 0.0 0.0 0 15.0 ... 0.264777 0.121004 0.135731 0.026597 0.020582 -4.900482 2.096338 -1.030483 -1.722674 0.245522
2 2 14874 20040403 115.0 15 1.0 0.0 0.0 163 12.5 ... 0.251410 0.114912 0.165147 0.062173 0.027075 -4.846749 1.803559 1.565330 -0.832687 -0.229963
3 3 71865 19960908 109.0 10 0.0 0.0 1.0 193 15.0 ... 0.274293 0.110300 0.121964 0.033395 0.000000 -4.509599 1.285940 -0.501868 -2.438353 -0.478699
4 4 111080 20120103 110.0 5 1.0 0.0 0.0 68 5.0 ... 0.228036 0.073205 0.091880 0.078819 0.121534 -1.896240 0.910783 0.931110 2.834518 1.923482

5 rows × 31 columns

numerical_cols = Train_data.select_dtypes(exclude = 'object').columns
print(numerical_cols)
Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType',
       'gearbox', 'power', 'kilometer', 'regionCode', 'seller', 'offerType',
       'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6',
       'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14'],
      dtype='object')
feature_cols = [col for col in numerical_cols if col not in ['SaleID','name','regDate','price']]
X_data = Train_data[feature_cols]
Y_data = Train_data['price']

X_test  = TestA_data[feature_cols]

print('X train shape:',X_data.shape)
print('X test shape:',X_test.shape)
X train shape: (150000, 26)
X test shape: (50000, 26)
def Sta_inf(data):
    print('_min',np.min(data))
    print('_max:',np.max(data))
    print('_mean',np.mean(data))
    print('_ptp',np.ptp(data))
    print('_std',np.std(data))
    print('_var',np.var(data))
print('Sta of label:')
Sta_inf(Y_data)
Sta of label:
_min 11
_max: 99999
_mean 5923.327333333334
_ptp 99988
_std 7501.973469876438
_var 56279605.94272992
X_data = X_data.fillna(-1)
X_test = X_test.fillna(-1)
def build_model_lr(x_train,y_train):
    reg_model = linear_model.LinearRegression()
    reg_model.fit(x_train,y_train)
    return reg_model

def build_model_ridge(x_train,y_train):
    reg_model = linear_model.Ridge(alpha=0.8)#alphas=range(1,100,5)
    reg_model.fit(x_train,y_train)
    return reg_model

def build_model_lasso(x_train,y_train):
    reg_model = linear_model.LassoCV()
    reg_model.fit(x_train,y_train)
    return reg_model

def build_model_gbdt(x_train,y_train):
    estimator =GradientBoostingRegressor(loss='ls',subsample= 0.85,max_depth= 5,n_estimators = 100)
    param_grid = { 
            'learning_rate': [0.05,0.08,0.1,0.2],
            }
    gbdt = GridSearchCV(estimator, param_grid,cv=3)
    gbdt.fit(x_train,y_train)
    print(gbdt.best_params_)
    # print(gbdt.best_estimator_ )
    return gbdt

def build_model_xgb(x_train,y_train):
    model = xgb.XGBRegressor(n_estimators=120, learning_rate=0.08, gamma=0, subsample=0.8,\
        colsample_bytree=0.9, max_depth=5) #, objective ='reg:squarederror'
    model.fit(x_train, y_train)
    return model

def build_model_lgb(x_train,y_train):
    estimator = lgb.LGBMRegressor(num_leaves=63,n_estimators = 100)
    param_grid = {
        'learning_rate': [0.01, 0.05, 0.1],
    }
    gbm = GridSearchCV(estimator, param_grid)
    gbm.fit(x_train, y_train)
    return gbm

2)XGBoost的五折交叉迴歸驗證實現

## xgb
xgr = xgb.XGBRegressor(n_estimators=120, learning_rate=0.1, subsample=0.8,\
        colsample_bytree=0.9, max_depth=7) # ,objective ='reg:squarederror'

scores_train = []
scores = []

## 5折交叉驗證方式
sk=StratifiedKFold(n_splits=5,shuffle=True,random_state=0)
for train_ind,val_ind in sk.split(X_data,Y_data):
    
    train_x=X_data.iloc[train_ind].values
    train_y=Y_data.iloc[train_ind]
    val_x=X_data.iloc[val_ind].values
    val_y=Y_data.iloc[val_ind]
    
    xgr.fit(train_x,train_y)
    pred_train_xgb=xgr.predict(train_x)
    pred_xgb=xgr.predict(val_x)
    
    score_train = mean_absolute_error(train_y,pred_train_xgb)
    scores_train.append(score_train)
    score = mean_absolute_error(val_y,pred_xgb)
    scores.append(score)

print('Train mae:',np.mean(score_train))
print('Val mae',np.mean(scores))
Train mae: 600.0127885014529
Val mae 691.9976473362078

3)劃分數據集,並用多種方法訓練和預測

## Split data with val
x_train,x_val,y_train,y_val = train_test_split(X_data,Y_data,test_size=0.3)

## Train and Predict
print('Predict LR...')
model_lr = build_model_lr(x_train,y_train)
val_lr = model_lr.predict(x_val)
subA_lr = model_lr.predict(X_test)

print('Predict Ridge...')
model_ridge = build_model_ridge(x_train,y_train)
val_ridge = model_ridge.predict(x_val)
subA_ridge = model_ridge.predict(X_test)

print('Predict Lasso...')
model_lasso = build_model_lasso(x_train,y_train)
val_lasso = model_lasso.predict(x_val)
subA_lasso = model_lasso.predict(X_test)

print('Predict GBDT...')
model_gbdt = build_model_gbdt(x_train,y_train)
val_gbdt = model_gbdt.predict(x_val)
subA_gbdt = model_gbdt.predict(X_test)

Predict LR...
Predict Ridge...
Predict Lasso...
Predict GBDT...
{'learning_rate': 0.2}

一般比賽中效果最爲顯著的兩種方法

print('predict XGB...')
model_xgb = build_model_xgb(x_train,y_train)
val_xgb = model_xgb.predict(x_val)
subA_xgb = model_xgb.predict(X_test)

print('predict lgb...')
model_lgb = build_model_lgb(x_train,y_train)
val_lgb = model_lgb.predict(x_val)
subA_lgb = model_lgb.predict(X_test)
predict XGB...
predict lgb...
print('Sta inf of lgb:')
Sta_inf(subA_lgb)
Sta inf of lgb:
_min -249.6417522048869
_max: 88093.34871616827
_mean 5927.779131190231
_ptp 88342.99046837316
_std 7374.103714251954
_var 54377405.588544466

1)加權融合

def Weighted_method(test_pre1,test_pre2,test_pre3,w=[1/3,1/3,1/3]):
    Weighted_result = w[0]*pd.Series(test_pre1)+w[1]*pd.Series(test_pre2)+w[2]*pd.Series(test_pre3)
    return Weighted_result

## Init the Weight
w = [0.3,0.4,0.3]

## 測試驗證集準確度
val_pre = Weighted_method(val_lgb,val_xgb,val_gbdt,w)
MAE_Weighted = mean_absolute_error(y_val,val_pre)
print('MAE of Weighted of val:',MAE_Weighted)

## 預測數據部分
subA = Weighted_method(subA_lgb,subA_xgb,subA_gbdt,w)
print('Sta inf:')
Sta_inf(subA)
## 生成提交文件
sub = pd.DataFrame()
sub['SaleID'] = X_test.index
sub['price'] = subA
sub.to_csv('./sub_Weighted.csv',index=False)
MAE of Weighted of val: 737.629358950625
Sta inf:
_min -116.38826138390652
_max: 89118.06960546214
_mean 5928.988237800195
_ptp 89234.45786684605
_std 7358.429430715667
_var 54146483.6868225
## 與簡單的LR(線性迴歸)進行對比
val_lr_pred = model_lr.predict(x_val)
MAE_lr = mean_absolute_error(y_val,val_lr_pred)
print('MAE of lr:',MAE_lr)
MAE of lr: 2589.5545307333628

2)Starking融合

## Starking

## 第一層
train_lgb_pred = model_lgb.predict(x_train)
train_xgb_pred = model_xgb.predict(x_train)
train_gbdt_pred = model_gbdt.predict(x_train)

Strak_X_train = pd.DataFrame()
Strak_X_train['Method_1'] = train_lgb_pred
Strak_X_train['Method_2'] = train_xgb_pred
Strak_X_train['Method_3'] = train_gbdt_pred

Strak_X_val = pd.DataFrame()
Strak_X_val['Method_1'] = val_lgb
Strak_X_val['Method_2'] = val_xgb
Strak_X_val['Method_3'] = val_gbdt

Strak_X_test = pd.DataFrame()
Strak_X_test['Method_1'] = subA_lgb
Strak_X_test['Method_2'] = subA_xgb
Strak_X_test['Method_3'] = subA_gbdt
Strak_X_test.head()
Method_1 Method_2 Method_3
0 42044.434453 42000.722656 41179.870051
1 278.799486 292.818481 355.614197
2 7275.306257 7491.733398 6972.124464
3 11377.657060 11747.993164 11975.158007
4 572.089176 537.258118 583.171802
## level2-method 
model_lr_Stacking = build_model_lr(Strak_X_train,y_train)
## 訓練集
train_pre_Stacking = model_lr_Stacking.predict(Strak_X_train)
print('MAE of Stacking-LR:',mean_absolute_error(y_train,train_pre_Stacking))

## 驗證集
val_pre_Stacking = model_lr_Stacking.predict(Strak_X_val)
print('MAE of Stacking-LR:',mean_absolute_error(y_val,val_pre_Stacking))

## 預測集
print('Predict Stacking-LR...')
subA_Stacking = model_lr_Stacking.predict(Strak_X_test)

MAE of Stacking-LR: 629.0601202327235
MAE of Stacking-LR: 725.7825758739375
Predict Stacking-LR...
subA_Stacking[subA_Stacking<10]=10  ## 去除過小的預測值

sub = pd.DataFrame()
sub['SaleID'] = X_test.index
sub['price'] = subA_Stacking
sub.to_csv('./sub_Stacking.csv',index=False)
print('Sta inf:')
Sta_inf(subA_Stacking)
Sta inf:
_min 10.0
_max: 88840.28945529042
_mean 5929.120726344594
_ptp 88830.28945529042
_std 7419.570814377191
_var 55050031.069557816

3.4 經驗總結

比賽的融合這個問題,個人的看法來說其實涉及多個層面,也是提分和提升模型魯棒性的一種重要方法:

  • 1)結果層面的融合,這種是最常見的融合方法,其可行的融合方法也有很多,比如根據結果的得分進行加權融合,還可以做Log,exp處理等。在做結果融合的時候,有一個很重要的條件是模型結果的得分要比較近似,然後結果的差異要比較大,這樣的結果融合往往有比較好的效果提升。

  • 2)特徵層面的融合,這個層面其實感覺不叫融合,準確說可以叫分割,很多時候如果我們用同種模型訓練,可以把特徵進行切分給不同的模型,然後在後面進行模型或者結果融合有時也能產生比較好的效果。

  • 3)模型層面的融合,模型層面的融合可能就涉及模型的堆疊和設計,比如加Staking層,部分模型的結果作爲特徵輸入等,這些就需要多實驗和思考了,基於模型層面的融合最好不同模型類型要有一定的差異,用同種模型不同的參數的收益一般是比較小的。

Task 5-模型融合 END.

— By: ML67

    Email: [email protected]
    PS: 華中科技大學研究生, 長期混跡Tianchi等,希望和大家多多交流。
    github: https://github.com/mlw67 (近期會做一些書籍推導和代碼的整理)

關於Datawhale:

Datawhale是一個專注於數據科學與AI領域的開源組織,彙集了衆多領域院校和知名企業的優秀學習者,聚合了一羣有開源精神和探索精神的團隊成員。Datawhale 以“for the learner,和學習者一起成長”爲願景,鼓勵真實地展現自我、開放包容、互信互助、敢於試錯和勇於擔當。同時 Datawhale 用開源的理念去探索開源內容、開源學習和開源方案,賦能人才培養,助力人才成長,建立起人與人,人與知識,人與企業和人與未來的聯結。

本次數據挖掘路徑學習,專題知識將在天池分享,詳情可關注Datawhale:

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章