GBDT-LR思想利用GBDT生成特徵在利用LR對特徵值進行擬合。
GBDT是梯度提升決策樹,由多棵樹組成。構造一個決策樹,根據已有的模型和實際樣本輸出的殘差上再構造一顆決策樹,不斷地進行迭代。
每一次迭代都會產生一個增益較大的分類特徵,因此GBDT樹有多少個葉子節點,得到的特徵空間就有多大,並將該特徵作爲LR模型的輸入。
下面通過一個實例說明這個過程。
1.準備數據
使用sklearn 數據集
from sklearn.datasets import make_hastie_10_2
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics.ranking import roc_auc_score
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.preprocessing.data import OneHotEncoder
import numpy as np
from scipy.sparse.construct import hstack
#import warnings
#warnings.filterwarnings('ignore')
X, y = make_hastie_10_2(random_state=0)
x_train, x_test = X[:2000], X[2000:3000]
y_train, y_test = y[:2000], y[2000:3000]
2.GBDT和LR單獨擬合
gb = GradientBoostingClassifier(n_estimators=100,
learning_rate=1.0,
max_depth=3,
random_state=0)
gb.fit(x_train, y_train)
score = roc_auc_score(y_test, gb.predict(x_test))
print("GBDT train data shape : {0} auc: {1}".format(x_train.shape, score))
lr = LogisticRegression()
lr.fit(x_train, y_train)
score = roc_auc_score(y_test, lr.predict(x_test))
print("LR train data shape : {0} auc: {1}".format(x_train.shape, score))
輸出結果:
GBDT train data shape : (2000, 10) auc: 0.8978334613415259
LR train data shape : (2000, 10) auc: 0.5247535842293907
n_estimators表示評估器(樹)個數,評估器個數越多也就表示特徵參數越多。
max_depth 單個評估器的最大高度
3.使用GBDT樹特徵作爲LR的數據
x_train_gb = gb.apply(x_train)[:, :, 0]
x_test_gb = gb.apply(x_test)[:, :, 0]
使用決策樹進行apply,這裏apply的意思是將訓練數據映射到每棵樹上。例如隨機取一棵樹,獲取樹gb.estimators_.ravel()[3]
繪製這棵樹,
from sklearn.externals.six import StringIO
from IPython.display import Image
from sklearn.tree import export_graphviz
import pydotplus
def plot_tree(clf):
dot_data = StringIO()
export_graphviz(clf, out_file=dot_data, node_ids=True,
filled=True, rounded=True,
special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
return Image(graph.create_png())
#選擇第四棵樹
plot_tree(gb.estimators_.ravel()[3])
由下圖可以看到一個樹最大深度是3 每個節點都會由對應的index,3層的樹序號從0~14。
使用apply就是將輸入的x_train映射到這個樹得到其索引,在本次示例中使用了100棵樹,則有100個特徵。例如:
x3=2.2408932 第一個節點進入了右子樹
x9=0.4105985 不滿足0.244進入右子樹
x0= 1.76405235不滿足0.405進入右子樹 所以對應第4棵樹對應是14
x_train[0:1]
#array([[ 1.76405235, 0.40015721, 0.97873798, 2.2408932 , 1.86755799, -0.97727788, 0.95008842, -0.15135721, -0.10321885, 0.4105985 ]])
gb.apply(x_train[0:1])[:, :, 0]
#array([[ 6., 10., 14., 14., 10., 7., 13., 6., 7., 7., 13., 14., 7.,
10., 10., 12., 14., 4., 11., 14., 3., 10., 13., 7., 11., 11.,
6., 6., 6., 11., 11., 11., 14., 11., 11., 4., 7., 11., 7.,
14., 12., 3., 13., 14., 4., 12., 13., 6., 14., 12., 6., 9.,
11., 14., 6., 6., 7., 7., 14., 3., 12., 13., 10., 11., 10.,
13., 13., 11., 9., 11., 12., 9., 7., 6., 14., 3., 14., 14.,
11., 13., 11., 4., 14., 13., 9., 11., 11., 14., 4., 14., 3.,
6., 10., 9., 13., 14., 6., 12., 14., 6.]])
gb_onehot = OneHotEncoder()
x_trains = gb_onehot.fit_transform(np.concatenate((x_train_gb, x_test_gb), axis=0))
rows = x_train.shape[0]
lr = LogisticRegression()
x_train_gb_data = x_trains[:rows, :]
x_test_gb_data = x_trains[rows:, :]
lr.fit(x_train_gb_data, y_train)
score = roc_auc_score(y_test, lr.predict(x_test_gb_data))
print("LR with GBDT apply data, train data shape : {0} auc: {1}".format(x_train_gb_data.shape, score))
將樹產生的特徵數據進行one-hot編碼,編碼之後長度增加。可以看到準確率提升至0.914986559139785
LR with GBDT apply data, train data shape : (2000, 770) auc: 0.914986559139785
4.融合原始數據進行擬合
lr = LogisticRegression()
x_train_merge = hstack([x_trains[:rows, :], x_train])
x_test_merge = hstack([x_trains[rows:, :], x_test])
lr.fit(x_train_merge, y_train)
score = roc_auc_score(y_test, lr.predict(x_test_merge))
print("LR with GBDT apply data and origin data, train data shape : {0} auc: {1}".format(x_train_merge.shape, score))
嘗試將輸入數據和特徵數據融合,但此時需要注意數據維度和範圍等,或許效果不好。
LR with GBDT apply data and origin data, train data shape : (2000, 780) auc: 0.9119783666154633
完整代碼
from sklearn.datasets import make_hastie_10_2
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics.ranking import roc_auc_score
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.preprocessing.data import OneHotEncoder
import numpy as np
from scipy.sparse.construct import hstack
import warnings
warnings.filterwarnings('ignore')
X, y = make_hastie_10_2(random_state=0)
x_train, x_test = X[:2000], X[2000:3000]
y_train, y_test = y[:2000], y[2000:3000]
gb = GradientBoostingClassifier(n_estimators=100,
learning_rate=1.0,
max_depth=3,
random_state=0)
gb.fit(x_train, y_train)
score = roc_auc_score(y_test, gb.predict(x_test))
print("GBDT train data shape : {0} auc: {1}".format(x_train.shape, score))
lr = LogisticRegression()
lr.fit(x_train, y_train)
score = roc_auc_score(y_test, lr.predict(x_test))
print("LR train data shape : {0} auc: {1}".format(x_train.shape, score))
x_train_gb = gb.apply(x_train)[:, :, 0]
x_test_gb = gb.apply(x_test)[:, :, 0]
gb_onehot = OneHotEncoder()
x_trains = gb_onehot.fit_transform(np.concatenate((x_train_gb, x_test_gb), axis=0))
rows = x_train.shape[0]
lr = LogisticRegression()
x_train_gb_data = x_trains[:rows, :]
x_test_gb_data = x_trains[rows:, :]
lr.fit(x_train_gb_data, y_train)
score = roc_auc_score(y_test, lr.predict(x_test_gb_data))
print("LR with GBDT apply data, train data shape : {0} auc: {1}".format(x_train_gb_data.shape, score))
lr = LogisticRegression()
x_train_merge = hstack([x_trains[:rows, :], x_train])
x_test_merge = hstack([x_trains[rows:, :], x_test])
lr.fit(x_train_merge, y_train)
score = roc_auc_score(y_test, lr.predict(x_test_merge))
print("LR with GBDT apply data and origin data, train data shape : {0} auc: {1}".format(x_train_merge.shape, score))
# from sklearn.externals.six import StringIO
# from IPython.display import Image
# from sklearn.tree import export_graphviz
# import pydotplus
# def plot_tree(clf):
# dot_data = StringIO()
# export_graphviz(clf, out_file=dot_data, node_ids=True,
# filled=True, rounded=True,
# special_characters=True)
# graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
# return Image(graph.create_png())
# plot_tree(gb.estimators_.ravel()[0])