GBDT+LR 入門+實例

GBDT-LR思想利用GBDT生成特徵在利用LR對特徵值進行擬合。
GBDT是梯度提升決策樹，由多棵樹組成。構造一個決策樹，根據已有的模型和實際樣本輸出的殘差上再構造一顆決策樹，不斷地進行迭代。
每一次迭代都會產生一個增益較大的分類特徵，因此GBDT樹有多少個葉子節點，得到的特徵空間就有多大，並將該特徵作爲LR模型的輸入。

下面通過一個實例說明這個過程。

1.準備數據

使用sklearn 數據集

from sklearn.datasets import make_hastie_10_2
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics.ranking import roc_auc_score
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.preprocessing.data import OneHotEncoder
import numpy as np
from scipy.sparse.construct import hstack
#import warnings
#warnings.filterwarnings('ignore')

X, y = make_hastie_10_2(random_state=0)
x_train, x_test = X[:2000], X[2000:3000]
y_train, y_test = y[:2000], y[2000:3000]

2.GBDT和LR單獨擬合

gb = GradientBoostingClassifier(n_estimators=100,
                                 learning_rate=1.0,
                                 max_depth=3,
                                 random_state=0)

gb.fit(x_train, y_train)
score = roc_auc_score(y_test, gb.predict(x_test))
print("GBDT train data shape : {0}  auc: {1}".format(x_train.shape, score))

lr = LogisticRegression()
lr.fit(x_train, y_train)
score = roc_auc_score(y_test, lr.predict(x_test))
print("LR train data shape : {0}  auc: {1}".format(x_train.shape, score))

輸出結果：

GBDT train data shape : (2000, 10)  auc: 0.8978334613415259
LR train data shape : (2000, 10)  auc: 0.5247535842293907

n_estimators表示評估器（樹）個數，評估器個數越多也就表示特徵參數越多。

max_depth 單個評估器的最大高度

3.使用GBDT樹特徵作爲LR的數據

x_train_gb = gb.apply(x_train)[:, :, 0]
x_test_gb = gb.apply(x_test)[:, :, 0]

使用決策樹進行apply，這裏apply的意思是將訓練數據映射到每棵樹上。例如隨機取一棵樹，獲取樹gb.estimators_.ravel()[3]

繪製這棵樹，

from sklearn.externals.six import StringIO
from IPython.display import Image
from sklearn.tree import export_graphviz
import pydotplus
def plot_tree(clf):
    dot_data = StringIO()
    export_graphviz(clf, out_file=dot_data, node_ids=True,
                    filled=True, rounded=True,
                    special_characters=True)
    graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
    return Image(graph.create_png())
#選擇第四棵樹
plot_tree(gb.estimators_.ravel()[3])

由下圖可以看到一個樹最大深度是3 每個節點都會由對應的index，3層的樹序號從0~14。

使用apply就是將輸入的x_train映射到這個樹得到其索引，在本次示例中使用了100棵樹，則有100個特徵。例如：

x3=2.2408932 第一個節點進入了右子樹

x9=0.4105985 不滿足0.244進入右子樹

x0= 1.76405235不滿足0.405進入右子樹所以對應第4棵樹對應是14

x_train[0:1] 
#array([[ 1.76405235,  0.40015721,  0.97873798,  2.2408932 ,  1.86755799, -0.97727788,  0.95008842, -0.15135721, -0.10321885,  0.4105985 ]])
gb.apply(x_train[0:1])[:, :, 0]
#array([[ 6., 10., 14., 14., 10.,  7., 13.,  6.,  7.,  7., 13., 14.,  7.,
        10., 10., 12., 14.,  4., 11., 14.,  3., 10., 13.,  7., 11., 11.,
         6.,  6.,  6., 11., 11., 11., 14., 11., 11.,  4.,  7., 11.,  7.,
        14., 12.,  3., 13., 14.,  4., 12., 13.,  6., 14., 12.,  6.,  9.,
        11., 14.,  6.,  6.,  7.,  7., 14.,  3., 12., 13., 10., 11., 10.,
        13., 13., 11.,  9., 11., 12.,  9.,  7.,  6., 14.,  3., 14., 14.,
        11., 13., 11.,  4., 14., 13.,  9., 11., 11., 14.,  4., 14.,  3.,
         6., 10.,  9., 13., 14.,  6., 12., 14.,  6.]])

gb_onehot = OneHotEncoder()
x_trains = gb_onehot.fit_transform(np.concatenate((x_train_gb, x_test_gb), axis=0))

rows = x_train.shape[0]
lr = LogisticRegression()
x_train_gb_data = x_trains[:rows, :]
x_test_gb_data = x_trains[rows:, :]
lr.fit(x_train_gb_data, y_train)
score = roc_auc_score(y_test, lr.predict(x_test_gb_data))
print("LR with GBDT apply data, train data shape : {0}  auc: {1}".format(x_train_gb_data.shape, score))

將樹產生的特徵數據進行one-hot編碼，編碼之後長度增加。可以看到準確率提升至0.914986559139785

LR with GBDT apply data, train data shape : (2000, 770)  auc: 0.914986559139785

4.融合原始數據進行擬合

lr = LogisticRegression()
x_train_merge = hstack([x_trains[:rows, :], x_train])
x_test_merge = hstack([x_trains[rows:, :], x_test])
lr.fit(x_train_merge, y_train)
score = roc_auc_score(y_test, lr.predict(x_test_merge))
print("LR with GBDT apply data and origin data, train data shape : {0}  auc: {1}".format(x_train_merge.shape, score))

嘗試將輸入數據和特徵數據融合，但此時需要注意數據維度和範圍等，或許效果不好。

LR with GBDT apply data and origin data, train data shape : (2000, 780)  auc: 0.9119783666154633

完整代碼

from sklearn.datasets import make_hastie_10_2
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics.ranking import roc_auc_score
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.preprocessing.data import OneHotEncoder
import numpy as np
from scipy.sparse.construct import hstack
import warnings
warnings.filterwarnings('ignore')


X, y = make_hastie_10_2(random_state=0)
x_train, x_test = X[:2000], X[2000:3000]
y_train, y_test = y[:2000], y[2000:3000]

gb = GradientBoostingClassifier(n_estimators=100,
                                 learning_rate=1.0,
                                 max_depth=3,
                                 random_state=0)

gb.fit(x_train, y_train)
score = roc_auc_score(y_test, gb.predict(x_test))
print("GBDT train data shape : {0}  auc: {1}".format(x_train.shape, score))

lr = LogisticRegression()
lr.fit(x_train, y_train)
score = roc_auc_score(y_test, lr.predict(x_test))
print("LR train data shape : {0}  auc: {1}".format(x_train.shape, score))

x_train_gb = gb.apply(x_train)[:, :, 0]
x_test_gb = gb.apply(x_test)[:, :, 0]

gb_onehot = OneHotEncoder()
x_trains = gb_onehot.fit_transform(np.concatenate((x_train_gb, x_test_gb), axis=0))

rows = x_train.shape[0]
lr = LogisticRegression()
x_train_gb_data = x_trains[:rows, :]
x_test_gb_data = x_trains[rows:, :]
lr.fit(x_train_gb_data, y_train)
score = roc_auc_score(y_test, lr.predict(x_test_gb_data))
print("LR with GBDT apply data, train data shape : {0}  auc: {1}".format(x_train_gb_data.shape, score))

lr = LogisticRegression()
x_train_merge = hstack([x_trains[:rows, :], x_train])
x_test_merge = hstack([x_trains[rows:, :], x_test])
lr.fit(x_train_merge, y_train)
score = roc_auc_score(y_test, lr.predict(x_test_merge))
print("LR with GBDT apply data and origin data, train data shape : {0}  auc: {1}".format(x_train_merge.shape, score))


# from sklearn.externals.six import StringIO
# from IPython.display import Image
# from sklearn.tree import export_graphviz
# import pydotplus
# def plot_tree(clf):
#     dot_data = StringIO()
#     export_graphviz(clf, out_file=dot_data, node_ids=True,
#                     filled=True, rounded=True,
#                     special_characters=True)
#     graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
#     return Image(graph.create_png())
# plot_tree(gb.estimators_.ravel()[0])

GBDT+LR 入門+實例

自然語言幾個重要的模型

模型實踐（二）bert 中文語料分類

小白來看：java反射與註解

關鍵詞提取-TFIDF 自定義逆文檔IDF的值

模型實踐（一）RNN LSTM 中文分類

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結