MachineLearning—AdaBoost算法代碼應用實現

AdaBoost包含分類和迴歸兩類，即AdaBoostClassifier和AdaBoostRegressor，其中分類使用了兩種算法即SAMME和SAMME.R，在對AdaBoost算法進行調參時，主要包括兩個部分。第一個是AdaBoost框架調參，另一個是對我們選擇的弱分類器進行調參。下面以分類算法爲例講解。

base_estimator：弱學習器，默認一般是CART分類迴歸樹，即DecisionTreeClassifier、DecisionTreeRegressor

algorithm：只AdaBoost分類算法有，其中SAMME使用樣本集分類效果作爲弱學習器權重，SAMME.R使用樣本集分類的預測概率大小作爲弱學習器權重。SAMME.R使用了概率度量的連續值，迭代一般比SAMME快，因此AdaBoostClassifier的默認算法algorithm的值也是SAMME.R。

loss：AdaBoost迴歸算法中用到，默認線性損失函數。線性誤差、均方誤差、指數誤差。

n_estimators：弱學習的個數，默認50

learning_rate：取值範圍0到1，代表的是弱學習器的權重衰減係數。

上面介紹的是框架的參數，下面介紹的是弱學習器參數。

max_features：默認分割時考慮所有特徵，當特徵很多時可以考慮按部分特徵進行分割。

max_depth：決策樹最大深度

min_samples_split：當某個葉節點樣本數小於這個值時，則停止繼續劃分子樹

min_samples_leaf：當某個葉節點樣本數小於這個值時，則跟兄弟節點一起被剪枝

max_leaf_nodes：通過限制最大葉子節點數，可以防止過擬合

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_gaussian_quantiles

#生成2維正態分佈，生成的數據按分位數分爲兩組，500個樣本,2個樣本特徵，協方差係數爲2
X1, y1 = make_gaussian_quantiles(cov=2.0,n_samples=500, n_features=2,n_classes=2, random_state=1)  #X1包含xy座標，y1包含分組標籤

#生成2維正態分佈，生成的數據按分位數分爲兩組，400個樣本,2個樣本特徵均值都爲3，協方差係數爲2
X2, y2 = make_gaussian_quantiles(mean=(3, 3), cov=1.5,n_samples=400, n_features=2, n_classes=2, random_state=1)

X = np.concatenate((X1, X2))
y = np.concatenate((y1, - y2 + 1))  #y2與y1標籤正好相反

plt.scatter(X[:, 0], X[:, 1], marker='o', c=y)

bdt = AdaBoostClassifier(DecisionTreeClassifier(max_depth=2, min_samples_split=20, min_samples_leaf=5),
                         algorithm="SAMME",
                         n_estimators=200, learning_rate=0.8) #最大深度爲2，葉節點中樣本數小於20時不再分割，葉節點中樣本點數小於5的剪枝
bdt.fit(X, y)

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
                     np.arange(y_min, y_max, 0.02))

Z = bdt.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
cs = plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)
plt.scatter(X[:, 0], X[:, 1], marker='o', c=y)
plt.show()

print("Score:", bdt.score(X,y))

Score: 0.913333333333

bdt = AdaBoostClassifier(DecisionTreeClassifier(max_depth=2, min_samples_split=20, min_samples_leaf=5),
                         algorithm="SAMME",
                         n_estimators=200, learning_rate=0.5)
bdt.fit(X, y)
print("Score:", bdt.score(X,y))

Score: 0.895555555556

MachineLearning—AdaBoost算法代碼應用實現

NLP—TextRank算法獲取文本關鍵詞和摘要

tSNE—高維數據降維可視化（理論部分）

Dask-大規模數據存儲與讀取、並行計算

Python點滴(八)—pandas中的透視表

Vim_Linux指令_Git

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結