1. CART決策樹
CART決策樹(Classification and Regression Tree)又稱爲分類與迴歸樹,該模型是由Breiman等人在1984年提出,是一種註明的的決策樹學習算法。從名字就可知道,這個算法既能處理分類問題,又能解決迴歸問題。
CART決策樹與之前文章討論過的ID3算法、C4.5算法有着類似的邏輯,不過也有所不同。
1.1 基尼指數
CART決策樹使用“基尼指數”(Gini Index)來選擇最優的劃分屬性,其定義如下:
KaTeX parse error: No such environment: align at position 8:
\begin{̲a̲l̲i̲g̲n̲}̲ Gini(D) &= \su…
由定義公式可知,Gini(D)反映了從數據集D中隨機抽取的樣本,其類別標記不一致的概率。因此,Gini(D)越小,則數據集D的純度越高。
那麼,在數據集D中,對於某一個屬性a的基尼指數定義爲:
根據這2個公式,我們就可以在屬性集合中A一個一個計算,最後選擇那個基尼指數最小的屬性作爲最優劃分屬性,即$a_*={arg}_{a\in A} min \ \ Gini_Index(D,a) \ $.
2. python實現
對於CART決策樹,採用的是機器學習領域最有名的數據集之一——鳶尾花卉數據集(iris.csv)作爲input data.
2.1 思路
輸入: 訓練集 D=(x1,y1),(x2,y2),…,(xm,ym);
屬性集 A=a1,a2,a3,…,ad;
過程: 函數CreateTree(D,A)
1. 生成節點node;
- if D中樣本全部屬於同一類別C then
- 將node標記爲C類葉節點; return
- end if
- if A=∅ OR D中的樣本在A上取值都相同 then
- 將node標記爲葉節點,其類別標記爲D中樣本數量最多的類; return
- end if
- 從A中選擇最優的劃分屬性a∗;
- for a∗的每一個值av∗ do
10. 爲node節點生成一個分支;令Dv表示D中在a∗上取值爲av∗的樣本子集;
11. if Dv爲空 then
12. 將分支節點標記爲葉節點,其類別標記爲D中樣本最多的類; return
13. else
14. 以CreateTree(Dv,a∗)爲分支節點
15. end if
16. end for
輸出:以node爲根節點的一棵決策樹
2.2 具體的代碼
需要說明的是:
import pandas as pd
import operator
def loadDataSet(name='iris.csv'):
# 這個路徑是我的路徑,你要記得改爲自己的文件路徑
path='/home/tyler/machine_learning/data/'
dataSet=pd.read_csv(path+name)
return dataSet
# 統計dataSet中每個不同數值的個數,返回字典類型counts
def classCount(dataSet):
counts={}
labels=dataSet.iloc[:,-1]
for one in labels:
if one not in counts.keys():
counts[one]=0
counts[one]+=1
return counts
# 計算dataSet的基尼指數
def calcGini(dataSet):
gini=1.00
counts=classCount(dataSet)
for one in counts.keys():
prob=float(counts[one])/len(dataSet)
gini-=(prob*prob)
return gini
# 在dataSet中返回數目最多的類別標記
def majorityClass(dataSet):
counts=classCount(dataSet)
sortedCounts=sorted(counts.items(),key=operator.itemgetter(1),reverse=True)
return sortedCounts[0][0]
# 根據索引i、值value和方向direction分割dataSet,返回大於/小於value的子數據集
def splitContinuousDataSet(dataSet,i,value,direction):
if direction==0:
subDataSet=dataSet[dataSet.iloc[:,i]>value]
if direction==1:
subDataSet=dataSet[dataSet.iloc[:,i]<=value]
reduceDataSet=subDataSet.drop(subDataSet.columns[i],axis=1)
return reduceDataSet
# 根據索引i和值value分割dataSet,返回值爲value的子數據集
def splitDataSet(dataSet,i,value):
subDataSet=dataSet[dataSet.iloc[:,i]==value]
reduceDataSet=subDataSet.drop(subDataSet.columns[i],axis=1)
return reduceDataSet
# 選擇dataSet衆多屬性中最優劃分屬性
def chooseBestFeat(dataSet):
labels=dataSet.iloc[:,-1]
feats=dataSet.columns
splitDic={}
splitPoint=[]
bestGiniIndex=10000.0
bestFeat=0
for i in range(len(dataSet.iloc[0,:])-1):
if type(dataSet.iloc[0,i]).__name__=='float64':
# set a list of value as split point to split
valueList=set(dataSet.iloc[:,i])
bestSplitGini=1000.0
for value in valueList:
newGiniIndex=0.0
greaterDataSet=splitContinuousDataSet(dataSet,i,value,0)
prob0=float(len(greaterDataSet))/len(dataSet)
newGiniIndex+=prob0*calcGini(greaterDataSet)
smallerDataSet=splitContinuousDataSet(dataSet,i,value,1)
prob1=float(len(smallerDataSet))/len(dataSet)
newGiniIndex+=prob1*calcGini(smallerDataSet)
if newGiniIndex<bestSplitGini:
bestSplitGini=newGiniIndex
bestSplitValue=value
splitDic[dataSet.columns[i]]=value
GiniIndex=bestSplitGini
print('tempBestFeat:'+str(dataSet.columns[i])+' ,GiniIndex:'+str(newGiniIndex))
else:
valueList=set(dataSet.iloc[:,i])
newGiniIndex=0.0
for value in valueList:
subDataSet=splitDataSet(dataSet,i,value)
prob=float(len(subDataSet))/len(dataSet)
newGiniIndex+=prob*calcGini(subDataSet)
GiniIndex=newGiniIndex
if GiniIndex<bestGiniIndex:
bestGiniIndex=GiniIndex
bestFeat=i
if type(dataSet.iloc[0,bestFeat]).__name__=='float64':
bestFeatValue=splitDic[dataSet.columns[bestFeat]]
if type(dataSet.iloc[0,bestFeat]).__name__=='str':
bestFeatValue=dataSet.columns[bestFeat]
return bestFeat,bestFeatValue
# 創建樹的函數,將不斷遞歸調用,直至滿足條件
def createTree(dataSet):
gini=calcGini(dataSet)
if len(dataSet.columns)==1:
return majorityClass(dataSet)
if len(set(dataSet.iloc[:,-1]))==1:
return dataSet.iloc[0,-1]
bestFeat,bestFeatValue=chooseBestFeat(dataSet)
bestFeatLabel=dataSet.columns[bestFeat]
print('bestFeat:'+str(bestFeatLabel))
print('bestFeatValue:'+str(bestFeatValue))
myTree={bestFeatLabel:{}}
# reduceDataSet=dataSet.drop(bestFeatLabel,axis=1)
if type(dataSet.iloc[0,bestFeat]).__name__=='float64':
greaterDataSet=splitContinuousDataSet(dataSet,bestFeat,bestFeatValue,0)
myTree[bestFeatLabel]['>'+str(bestFeatValue)]=createTree(greaterDataSet)
smallerDataSet=splitContinuousDataSet(dataSet,bestFeat,bestFeatValue,1)
myTree[bestFeatLabel]['<='+str(bestFeatValue)]=createTree(smallerDataSet)
if type(dataSet.iloc[0,bestFeat]).__name__=='str':
bestFeatValues=dataSet.iloc[:,bestFeat]
uniqBestFeatVals=set(bestFeatValues)
for value in uniqBestFeatVals:
myTree[bestFeatLabel][bestFeatValue]=createTree(splitDataSet(dataSet,bestFeat,value))
return myTree
# 主函數,程序入口
if __name__ == '__main__':
dataSet=loadDataSet()
dataSet._convert(float)
print(createTree(dataSet))