1. C4.5算法
前一篇文章機器學習算法–決策樹ID3–python實現 講述了決策樹的基本概念和最經典的ID3算法。
在那篇文章中,我沒有談到ID3算法的缺陷,更多的是側重於介紹決策樹的算法和概念。但其實,ID3算法存在這麼幾個缺陷:
1.信息增益準則對可取數值數目較多的屬性有所偏好;
比如,如果在原來的數據中加入[序號]這一屬性,運行ID3算法後,我們會發現序號被作爲最優屬性首先被劃分。但常識告訴我們,序號根本和樣本類別毫無關係。
2.只能處理離散變量的屬性,對於類似於身高、體重、年齡、工資這樣存在無限可能的連續數值毫無辦法。
爲了優化並解決以上2個問題,著名的C4.5算法被提出。
1.1 解決類似於"序號"這樣的干擾
C4.5決策樹算法不直接使用信息增益(Information gain)劃分最優屬性,而是採用增益率(gain ratio)。其定義爲:
其中
成爲屬性a的固有值。由表達式可知,Gain(D,a)所表示的仍然是信息增益,與ID3算法中的Gain(D,a)並無差別,但重點在於IV(a)這一項:如果屬性a的可能取值數目越多(即V越大),則IV(a)的值通常會越大,那麼最終的Gain_ratio的值會相應減小,以此來解決上文提到的問題。
1.2 增加對連續變量的處理模塊
- 首先,要在處理每一個屬性之前,判斷該屬性的值是字符/字符串(這意味該屬性的值是離散的)還是整型/浮點型(這意味該屬性的只是迴歸的、連續的),兩種類型的值要分別進行處理。
if type(featVals[0]).__name__=='float' or type(featVals[0]).__name__=='int':
...
if type(dataSet[0][bestFeat]).__name__=='str':
...
*如果你事先已經對離散值做了整數型數值替換表示,那麼可以根據具體情況進行修改上述判斷語句。
- 其次,要增添一個根據連續變量屬性分割數據集的函數:
def splitContinuousDataSet(dataSet,i,value,direction):
subDataSet=[]
for one in dataSet:
if direction==0:
if one[i]>value:
reduceData=one[:i]
reduceData.extend(one[i+1:])
subDataSet.append(reduceData)
if direction==1:
if one[i]<=value:
reduceData=one[:i]
reduceData.extend(one[i+1:])
subDataSet.append(reduceData)
return subDataSet
由於變量是連續變量,因此可能存在所有值不同的情景,如果像離散值那樣對每個連續值劃分支是不現實也是沒有意義的,因此可以考慮在該屬性的衆多數值中選擇1個或2個分割點,然後按照大於(>)該值或小於等於(<=)該值進行劃分。
- 然後,在選擇最優劃分屬性模塊的返回值中添加具體劃分點的數值。由於對於迴歸屬性,返回值不再像是離散值那樣只有一個索引值(index),還需要知道大於或小於哪一個具體的值,因此,需要建立一個字典,用來存儲每一個迴歸屬性所對應劃分點的具體值。其次,要對每一個潛在的劃分數值點進行增益率計算,在衆多潛在數值中找到那個最優的點,並返回。
bestSplitDic={}
...
sortedFeatVals=sorted(featVals)
splitList=[]
for j in range(len(featVals)-1):
splitList.append((sortedFeatVals[j]+sortedFeatVals[j+1])/2.0)
for j in range(len(splitList)):
newEntropy=0.0
gainRatio=0.0
splitInfo=0.0
value=splitList[j]
subDataSet0=splitContinuousDataSet(dataSet,i,value,0)
subDataSet1=splitContinuousDataSet(dataSet,i,value,1)
prob0=float(len(subDataSet0))/len(dataSet)
newEntropy-=prob0*calcShannonEntropy(subDataSet0)
prob1=float(len(subDataSet1))/len(dataSet)
newEntropy-=prob1*calcShannonEntropy(subDataSet1)
splitInfo-=prob0*log(prob0,2)
splitInfo-=prob1*log(prob1,2)
gainRatio=float(baseEntropy-newEntropy)/splitInfo
print('IVa '+str(j)+':'+str(splitInfo))
if gainRatio>baseGainRatio:
baseGainRatio=gainRatio
bestSplit=j
bestFeat=i
bestSplitDic[labels[i]]=splitList[bestSplit]
myTree[labels[bestFeat]]['>' + str(value)] = createTree(greaterDataSet, subLabels)
print(myTree)
print('== ' * len(dataSet[0]))
myTree[labels[bestFeat]]['<=' + str(value)] = createTree(smallerDataSet, subLabels)
2. C4.5算法的python實現
2.1 思路
輸入: 訓練集 D=(x1,y1),(x2,y2),…,(xm,ym);
屬性集 A=a1,a2,a3,…,ad;
過程: 函數CreateTree(D,A)
1. 生成節點node;
- if D中樣本全部屬於同一類別C then
- 將node標記爲C類葉節點; return
- end if
- if A=∅ OR D中的樣本在A上取值都相同 then
- 將node標記爲葉節點,其類別標記爲D中樣本數量最多的類; return
- end if
- 從A中選擇最優的劃分屬性a∗;
- for a∗的每一個值av∗ do
10. 爲node節點生成一個分支;令Dv表示D中在a∗上取值爲av∗的樣本子集;
11. if Dv爲空 then
12. 將分支節點標記爲葉節點,其類別標記爲D中樣本最多的類; return
13. else
14. 以CreateTree(Dv,a∗)爲分支節點
15. end if
16. end for
輸出:以node爲根節點的一棵決策樹
2.2 具體的代碼
算法的原理已經講過了,那麼這裏就來說說具體的python實現。
from math import log,sqrt
import operator
import re
def createDataSet():
dataSet = [[1,'長', '粗', '男'],
[2,'短', '粗', '男'],
[3,'短', '粗', '男'],
[4,'長', '細', '女'],
[5,'短', '細', '女'],
[6,'短', '粗', '女'],
[7,'長', '粗', '女'],
[8,'長', '粗', '女']]
labels = ['序號','頭髮', '聲音'] # 兩個特徵
return dataSet, labels
def classCount(dataSet):
labelCount={}
for one in dataSet:
if one[-1] not in labelCount.keys():
labelCount[one[-1]]=0
labelCount[one[-1]]+=1
return labelCount
def calcShannonEntropy(dataSet):
labelCount=classCount(dataSet)
numEntries=len(dataSet)
Entropy=0.0
for i in labelCount:
prob=float(labelCount[i])/numEntries
Entropy-=prob*log(prob,2)
return Entropy
def majorityClass(dataSet):
labelCount=classCount(dataSet)
sortedLabelCount=sorted(labelCount.items(),key=operator.itemgetter(1),reverse=True)
return sortedLabelCount[0][0]
def splitDataSet(dataSet,i,value):
subDataSet=[]
for one in dataSet:
if one[i]==value:
reduceData=one[:i]
reduceData.extend(one[i+1:])
subDataSet.append(reduceData)
return subDataSet
def splitContinuousDataSet(dataSet,i,value,direction):
subDataSet=[]
for one in dataSet:
if direction==0:
if one[i]>value:
reduceData=one[:i]
reduceData.extend(one[i+1:])
subDataSet.append(reduceData)
if direction==1:
if one[i]<=value:
reduceData=one[:i]
reduceData.extend(one[i+1:])
subDataSet.append(reduceData)
return subDataSet
def chooseBestFeat(dataSet,labels):
baseEntropy=calcShannonEntropy(dataSet)
bestFeat=0
baseGainRatio=-1
numFeats=len(dataSet[0])-1
bestSplitDic={}
i=0
print('dataSet[0]:' + str(dataSet[0]))
for i in range(numFeats):
featVals=[example[i] for example in dataSet]
#print('chooseBestFeat:'+str(i))
if type(featVals[0]).__name__=='float' or type(featVals[0]).__name__=='int':
j=0
sortedFeatVals=sorted(featVals)
splitList=[]
for j in range(len(featVals)-1):
splitList.append((sortedFeatVals[j]+sortedFeatVals[j+1])/2.0)
for j in range(len(splitList)):
newEntropy=0.0
gainRatio=0.0
splitInfo=0.0
value=splitList[j]
subDataSet0=splitContinuousDataSet(dataSet,i,value,0)
subDataSet1=splitContinuousDataSet(dataSet,i,value,1)
prob0=float(len(subDataSet0))/len(dataSet)
newEntropy-=prob0*calcShannonEntropy(subDataSet0)
prob1=float(len(subDataSet1))/len(dataSet)
newEntropy-=prob1*calcShannonEntropy(subDataSet1)
splitInfo-=prob0*log(prob0,2)
splitInfo-=prob1*log(prob1,2)
gainRatio=float(baseEntropy-newEntropy)/splitInfo
print('IVa '+str(j)+':'+str(splitInfo))
if gainRatio>baseGainRatio:
baseGainRatio=gainRatio
bestSplit=j
bestFeat=i
bestSplitDic[labels[i]]=splitList[bestSplit]
else:
uniqueFeatVals=set(featVals)
GainRatio=0.0
splitInfo=0.0
newEntropy=0.0
for value in uniqueFeatVals:
subDataSet=splitDataSet(dataSet,i,value)
prob=float(len(subDataSet))/len(dataSet)
splitInfo-=prob*log(prob,2)
newEntropy-=prob*calcShannonEntropy(subDataSet)
gainRatio=float(baseEntropy-newEntropy)/splitInfo
if gainRatio > baseGainRatio:
bestFeat = i
baseGainRatio = gainRatio
if type(dataSet[0][bestFeat]).__name__=='float' or type(dataSet[0][bestFeat]).__name__=='int':
bestFeatValue=bestSplitDic[labels[bestFeat]]
##bestFeatValue=labels[bestFeat]+'<='+str(bestSplitValue)
if type(dataSet[0][bestFeat]).__name__=='str':
bestFeatValue=labels[bestFeat]
return bestFeat,bestFeatValue
def createTree(dataSet,labels):
classList=[example[-1] for example in dataSet]
if len(set(classList))==1:
return classList[0][0]
if len(dataSet[0])==1:
return majorityClass(dataSet)
Entropy = calcShannonEntropy(dataSet)
bestFeat,bestFeatLabel=chooseBestFeat(dataSet,labels)
print('bestFeat:'+str(bestFeat)+'--'+str(labels[bestFeat])+', bestFeatLabel:'+str(bestFeatLabel))
myTree={labels[bestFeat]:{}}
subLabels = labels[:bestFeat]
subLabels.extend(labels[bestFeat+1:])
print('subLabels:'+str(subLabels))
if type(dataSet[0][bestFeat]).__name__=='str':
featVals = [example[bestFeat] for example in dataSet]
uniqueVals = set(featVals)
print('uniqueVals:' + str(uniqueVals))
for value in uniqueVals:
reduceDataSet=splitDataSet(dataSet,bestFeat,value)
print('reduceDataSet:'+str(reduceDataSet))
myTree[labels[bestFeat]][value]=createTree(reduceDataSet,subLabels)
if type(dataSet[0][bestFeat]).__name__=='int' or type(dataSet[0][bestFeat]).__name__=='float':
value=bestFeatLabel
greaterDataSet=splitContinuousDataSet(dataSet,bestFeat,value,0)
smallerDataSet=splitContinuousDataSet(dataSet,bestFeat,value,1)
print('greaterDataset:' + str(greaterDataSet))
print('smallerDataSet:' + str(smallerDataSet))
print('== ' * len(dataSet[0]))
myTree[labels[bestFeat]]['>' + str(value)] = createTree(greaterDataSet, subLabels)
print(myTree)
print('== ' * len(dataSet[0]))
myTree[labels[bestFeat]]['<=' + str(value)] = createTree(smallerDataSet, subLabels)
return myTree
if __name__ == '__main__':
dataSet,labels=createDataSet()
print(createTree(dataSet,labels))
*上述代碼中有帶有許多print()函數,打印出這些看似多餘的中間值,是爲了監控整個執行過程。寫這篇文章的時候我猶豫了一下,最後決定將其保留,正在看這篇文章的你如果覺得多餘,就請自行刪除吧。 : /