機器學習算法|決策樹C4.5--python實現

1. C4.5算法

前一篇文章機器學習算法–決策樹ID3–python實現 講述了決策樹的基本概念和最經典的ID3算法。
在那篇文章中,我沒有談到ID3算法的缺陷,更多的是側重於介紹決策樹的算法和概念。但其實,ID3算法存在這麼幾個缺陷:
1.信息增益準則對可取數值數目較多的屬性有所偏好;

比如,如果在原來的數據中加入[序號]這一屬性,運行ID3算法後,我們會發現序號被作爲最優屬性首先被劃分。但常識告訴我們,序號根本和樣本類別毫無關係。

2.只能處理離散變量的屬性,對於類似於身高、體重、年齡、工資這樣存在無限可能的連續數值毫無辦法。

爲了優化並解決以上2個問題,著名的C4.5算法被提出。

1.1 解決類似於"序號"這樣的干擾

C4.5決策樹算法不直接使用信息增益(Information gain)劃分最優屬性,而是採用增益率(gain ratio)。其定義爲:
Gain_ratio(D,a)=Gain(D,a)IV(a),Gain\_ratio(D,a)=\dfrac{Gain(D,a)}{IV(a)},
其中
IV(a)=v=1VDvDlog2DvDIV(a)=-\sum_{v=1}^{V}\dfrac{|D^v|}{|D|}log_2\dfrac{|D^v|}{|D|}
成爲屬性a的固有值。由表達式可知,Gain(D,a)所表示的仍然是信息增益,與ID3算法中的Gain(D,a)並無差別,但重點在於IV(a)這一項:如果屬性a的可能取值數目越多(即V越大),則IV(a)的值通常會越大,那麼最終的Gain_ratio的值會相應減小,以此來解決上文提到的問題。

1.2 增加對連續變量的處理模塊
  • 首先,要在處理每一個屬性之前,判斷該屬性的值是字符/字符串(這意味該屬性的值是離散的)還是整型/浮點型(這意味該屬性的只是迴歸的、連續的),兩種類型的值要分別進行處理。
if type(featVals[0]).__name__=='float' or type(featVals[0]).__name__=='int':
...
if type(dataSet[0][bestFeat]).__name__=='str':
...

*如果你事先已經對離散值做了整數型數值替換表示,那麼可以根據具體情況進行修改上述判斷語句。

  • 其次,要增添一個根據連續變量屬性分割數據集的函數:
def splitContinuousDataSet(dataSet,i,value,direction):
    subDataSet=[]
    for one in dataSet:
        if direction==0:
            if one[i]>value:
                reduceData=one[:i]
                reduceData.extend(one[i+1:])
                subDataSet.append(reduceData)
        if direction==1:
            if one[i]<=value:
                reduceData=one[:i]
                reduceData.extend(one[i+1:])
                subDataSet.append(reduceData)
    return subDataSet

由於變量是連續變量,因此可能存在所有值不同的情景,如果像離散值那樣對每個連續值劃分支是不現實也是沒有意義的,因此可以考慮在該屬性的衆多數值中選擇1個或2個分割點,然後按照大於(>)該值或小於等於(<=)該值進行劃分。

  • 然後,在選擇最優劃分屬性模塊的返回值中添加具體劃分點的數值。由於對於迴歸屬性,返回值不再像是離散值那樣只有一個索引值(index),還需要知道大於或小於哪一個具體的值,因此,需要建立一個字典,用來存儲每一個迴歸屬性所對應劃分點的具體值。其次,要對每一個潛在的劃分數值點進行增益率計算,在衆多潛在數值中找到那個最優的點,並返回。
bestSplitDic={}
...
		sortedFeatVals=sorted(featVals)
		splitList=[]
		for j in range(len(featVals)-1):
		    splitList.append((sortedFeatVals[j]+sortedFeatVals[j+1])/2.0)
		for j in range(len(splitList)):
		    newEntropy=0.0
		    gainRatio=0.0
		    splitInfo=0.0
		    value=splitList[j]
		    subDataSet0=splitContinuousDataSet(dataSet,i,value,0)
		    subDataSet1=splitContinuousDataSet(dataSet,i,value,1)
		    prob0=float(len(subDataSet0))/len(dataSet)
		    newEntropy-=prob0*calcShannonEntropy(subDataSet0)
		    prob1=float(len(subDataSet1))/len(dataSet)
		    newEntropy-=prob1*calcShannonEntropy(subDataSet1)
		    splitInfo-=prob0*log(prob0,2)
		    splitInfo-=prob1*log(prob1,2)
		    gainRatio=float(baseEntropy-newEntropy)/splitInfo
		    print('IVa '+str(j)+':'+str(splitInfo))
		    if gainRatio>baseGainRatio:
		        baseGainRatio=gainRatio
		        bestSplit=j
		        bestFeat=i
		bestSplitDic[labels[i]]=splitList[bestSplit]
  • 最後,還需要在生成決策的模塊,修改劃分點的標籤,如“> x.xxx”,"<= x.xxx",關鍵代碼:
 myTree[labels[bestFeat]]['>' + str(value)] = createTree(greaterDataSet, subLabels)
        print(myTree)
        print('== ' * len(dataSet[0]))
        myTree[labels[bestFeat]]['<=' + str(value)] = createTree(smallerDataSet, subLabels)

2. C4.5算法的python實現

2.1 思路

輸入: 訓練集 D=(x1,y1),(x2,y2),…,(xm,ym);
    屬性集 A=a1,a2,a3,…,ad;
過程: 函數CreateTree(D,A)
1. 生成節點node;

  1. if D中樣本全部屬於同一類別C then
  2. 將node標記爲C類葉節點; return
  3. end if
  4. if A=∅ OR D中的樣本在A上取值都相同 then
  5. 將node標記爲葉節點,其類別標記爲D中樣本數量最多的類; return
  6. end if
  7. 從A中選擇最優的劃分屬性a∗;
  8. for a∗的每一個值av∗ do
    10.  爲node節點生成一個分支;令Dv表示D中在a∗上取值爲av∗的樣本子集;
    11.  if Dv爲空 then
    12.   將分支節點標記爲葉節點,其類別標記爲D中樣本最多的類; return
    13.  else
    14.   以CreateTree(Dv,a∗)爲分支節點
    15.  end if
    16. end for
    輸出:以node爲根節點的一棵決策樹
2.2 具體的代碼

算法的原理已經講過了,那麼這裏就來說說具體的python實現。

from math import log,sqrt
import operator
import re

def createDataSet():
    dataSet = [[1,'長', '粗', '男'],
               [2,'短', '粗', '男'],
               [3,'短', '粗', '男'],
               [4,'長', '細', '女'],
               [5,'短', '細', '女'],
               [6,'短', '粗', '女'],
               [7,'長', '粗', '女'],
               [8,'長', '粗', '女']]
    labels = ['序號','頭髮', '聲音']  # 兩個特徵
    return dataSet, labels

def classCount(dataSet):
    labelCount={}
    for one in dataSet:
        if one[-1] not in labelCount.keys():
            labelCount[one[-1]]=0
        labelCount[one[-1]]+=1
    return labelCount

def calcShannonEntropy(dataSet):
    labelCount=classCount(dataSet)
    numEntries=len(dataSet)
    Entropy=0.0
    for i in labelCount:
        prob=float(labelCount[i])/numEntries
        Entropy-=prob*log(prob,2)
    return Entropy

def majorityClass(dataSet):
    labelCount=classCount(dataSet)
    sortedLabelCount=sorted(labelCount.items(),key=operator.itemgetter(1),reverse=True)
    return sortedLabelCount[0][0]

def splitDataSet(dataSet,i,value):
    subDataSet=[]
    for one in dataSet:
        if one[i]==value:
            reduceData=one[:i]
            reduceData.extend(one[i+1:])
            subDataSet.append(reduceData)
    return subDataSet

def splitContinuousDataSet(dataSet,i,value,direction):
    subDataSet=[]
    for one in dataSet:
        if direction==0:
            if one[i]>value:
                reduceData=one[:i]
                reduceData.extend(one[i+1:])
                subDataSet.append(reduceData)
        if direction==1:
            if one[i]<=value:
                reduceData=one[:i]
                reduceData.extend(one[i+1:])
                subDataSet.append(reduceData)
    return subDataSet

def chooseBestFeat(dataSet,labels):
    baseEntropy=calcShannonEntropy(dataSet)
    bestFeat=0
    baseGainRatio=-1
    numFeats=len(dataSet[0])-1
    bestSplitDic={}
    i=0
    print('dataSet[0]:' + str(dataSet[0]))
    for i in range(numFeats):
        featVals=[example[i] for example in dataSet]
        #print('chooseBestFeat:'+str(i))
        if type(featVals[0]).__name__=='float' or type(featVals[0]).__name__=='int':
            j=0
            sortedFeatVals=sorted(featVals)
            splitList=[]
            for j in range(len(featVals)-1):
                splitList.append((sortedFeatVals[j]+sortedFeatVals[j+1])/2.0)
            for j in range(len(splitList)):
                newEntropy=0.0
                gainRatio=0.0
                splitInfo=0.0
                value=splitList[j]
                subDataSet0=splitContinuousDataSet(dataSet,i,value,0)
                subDataSet1=splitContinuousDataSet(dataSet,i,value,1)
                prob0=float(len(subDataSet0))/len(dataSet)
                newEntropy-=prob0*calcShannonEntropy(subDataSet0)
                prob1=float(len(subDataSet1))/len(dataSet)
                newEntropy-=prob1*calcShannonEntropy(subDataSet1)
                splitInfo-=prob0*log(prob0,2)
                splitInfo-=prob1*log(prob1,2)
                gainRatio=float(baseEntropy-newEntropy)/splitInfo
                print('IVa '+str(j)+':'+str(splitInfo))
                if gainRatio>baseGainRatio:
                    baseGainRatio=gainRatio
                    bestSplit=j
                    bestFeat=i
            bestSplitDic[labels[i]]=splitList[bestSplit]
        else:
            uniqueFeatVals=set(featVals)
            GainRatio=0.0
            splitInfo=0.0
            newEntropy=0.0
            for value in uniqueFeatVals:
                subDataSet=splitDataSet(dataSet,i,value)
                prob=float(len(subDataSet))/len(dataSet)
                splitInfo-=prob*log(prob,2)
                newEntropy-=prob*calcShannonEntropy(subDataSet)
            gainRatio=float(baseEntropy-newEntropy)/splitInfo
            if gainRatio > baseGainRatio:
                bestFeat = i
                baseGainRatio = gainRatio
    if type(dataSet[0][bestFeat]).__name__=='float' or type(dataSet[0][bestFeat]).__name__=='int':
        bestFeatValue=bestSplitDic[labels[bestFeat]]
        ##bestFeatValue=labels[bestFeat]+'<='+str(bestSplitValue)
    if type(dataSet[0][bestFeat]).__name__=='str':
        bestFeatValue=labels[bestFeat]
    return bestFeat,bestFeatValue



def createTree(dataSet,labels):
    classList=[example[-1] for example in dataSet]
    if len(set(classList))==1:
        return classList[0][0]
    if len(dataSet[0])==1:
        return majorityClass(dataSet)
    Entropy = calcShannonEntropy(dataSet)
    bestFeat,bestFeatLabel=chooseBestFeat(dataSet,labels)
    print('bestFeat:'+str(bestFeat)+'--'+str(labels[bestFeat])+', bestFeatLabel:'+str(bestFeatLabel))
    myTree={labels[bestFeat]:{}}
    subLabels = labels[:bestFeat]
    subLabels.extend(labels[bestFeat+1:])
    print('subLabels:'+str(subLabels))
    if type(dataSet[0][bestFeat]).__name__=='str':
        featVals = [example[bestFeat] for example in dataSet]
        uniqueVals = set(featVals)
        print('uniqueVals:' + str(uniqueVals))
        for value in uniqueVals:
            reduceDataSet=splitDataSet(dataSet,bestFeat,value)
            print('reduceDataSet:'+str(reduceDataSet))
            myTree[labels[bestFeat]][value]=createTree(reduceDataSet,subLabels)
    if type(dataSet[0][bestFeat]).__name__=='int' or type(dataSet[0][bestFeat]).__name__=='float':
        value=bestFeatLabel
        greaterDataSet=splitContinuousDataSet(dataSet,bestFeat,value,0)
        smallerDataSet=splitContinuousDataSet(dataSet,bestFeat,value,1)
        print('greaterDataset:' + str(greaterDataSet))
        print('smallerDataSet:' + str(smallerDataSet))
        print('== ' * len(dataSet[0]))
        myTree[labels[bestFeat]]['>' + str(value)] = createTree(greaterDataSet, subLabels)
        print(myTree)
        print('== ' * len(dataSet[0]))
        myTree[labels[bestFeat]]['<=' + str(value)] = createTree(smallerDataSet, subLabels)
    return myTree

if __name__ == '__main__':
	dataSet,labels=createDataSet()
	print(createTree(dataSet,labels))

*上述代碼中有帶有許多print()函數,打印出這些看似多餘的中間值,是爲了監控整個執行過程。寫這篇文章的時候我猶豫了一下,最後決定將其保留,正在看這篇文章的你如果覺得多餘,就請自行刪除吧。 : /

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章