機器學習實戰-決策樹

計算給定數據集的香農熵

 在本章中,給出決策樹的訓練方法,以及訓練中的信息增益。首先介紹了信息增益,信息增益有兩種,一種是香農熵,另一種是基尼不純度。第一段代碼就是計算香農熵,我在讀書的時候研究過結構化隨機森林,曾經評估過香農熵,有一定的瞭解,代碼看起來不太費勁(其實本來就比較簡單),不慫,直接上書中提供的代碼:

 

from math import log
def calcShannonEnt(dataSet):
    numEntries = len(dataSet)
    labelCounts = {}
    for featVec in dataSet: #the the number of unique elements and their occurance
        currentLabel = featVec[-1]
        if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0
        labelCounts[currentLabel] += 1
    shannonEnt = 0.0
    for key in labelCounts:
        prob = float(labelCounts[key])/numEntries
        shannonEnt -= prob * log(prob,2) #log base 2
    return shannonEnt


 計算香農熵的代碼寫完了,先測一下能不能行,隨意創建一個測試用例吧:

def createDataSet():
    dataSet = [[1, 1, 'yes'],
               [1, 1, 'yes'],
               [1, 0, 'no'],
               [0, 1, 'no'],
               [0, 1, 'no']]
    labels = ['no surfacing', 'flippers']
    return dataSet, labels

測試代碼也貼上:

myData, labels = createDataSet()
shanon_ent = calcShannonEnt(myData)
print(shanon_ent)

劃分數據集

上面的截圖部分就是本段測試代碼運行的結果。到目前爲止,我們能夠簡單的計算shannon熵了,但是這還遠遠不夠。我們可以通過迭代的方式讓information gain一步步減小到我們能夠容忍的範圍以內。爲了達到這個目的,我們需要把一個要分類的大的集合逐漸分解成小的集合,直至各個集合的基尼不純度(也就是information gain的具體表達形式)符合一定的條件,我們才能夠說,ok,我們可以得到一個分類的方法(也就是一棵樹),可以搞定分類問題!本着拆解大問題的方針,小問題先被拆減成分開一個大集合,選擇最佳的數據劃分方式,以及迭代計算基尼不純度(別一直想着比基尼哦)!先強上劃分數據集的代碼:

def splitDataSet(dataSet, axis, value):
    retDataSet = []
    for featVec in dataSet:
        if featVec[axis] == value:
            splitSet = featVec[:axis]
            splitSet.extend(featVec[axis + 1:])
            retDataSet.append(splitSet)
    return retDataSet

這中間有個小點,list的append方法和extern方法不同,不同之處百度一下就好,lol。做好測試!做一個好的開發一定要做好測試,這裏引導大家做測試,做不做好看個人追求:)

if __name__ == "__main__":
    myData, labels = createDataSet()
    reduced_data = splitDataSet(myData, 0, 0)
    print(reduced_data)

應打印出的結果是:


這裏,最終存儲結果直接把劃分數據集的特徵扔掉了,這是決策樹和CART的差別。把featvec[:axis]改成featvec[:axis+1]即可。接下來我們將遍歷整個數據集,尋找最佳的數據劃分方式。

def chooseBestFeatureToSplit(dataset):
    featNum = len(dataset[0]) - 1
    baseEntropy = calcShannonEnt(dataset)
    bestInfoGain = 0.0
    bestFeature = -1
    for i in range(featNum):
        featList = [example[i] for example in dataset]
        uniqueValue = set(featList)
        newEntropy = 0.0
        for value in uniqueValue:
            subDataSet = splitDataSet(dataset, i, value)
            prob = len(subDataSet)/float(len(featList))
            newEntropy += prob * calcShannonEnt(subDataSet)
        infoGain = baseEntropy - newEntropy
        if (infoGain > bestInfoGain):
            bestInfoGain = infoGain
            bestFeature = i
    return bestFeature

測試代碼:

if __name__ == "__main__":
    myData, labels = createDataSet()
    bf = chooseBestFeatureToSplit(myData)
    print(bf)

預期結果:

完成樹的構建

最後一個問題,完成樹的創建:

def majorityCnt(classList):
    classCount={}
    for vote in classList:
        if vote not in classCount.keys(): classCount[vote] = 0
        classCount[vote] += 1
    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]

def createTree(dataset, labels):
    classList = [example[-1] for example in dataset]
    if classList.count(classList[0]) == len(classList):
        return classList[0]
    if len(dataset[0]) == 1:
        return majorityCnt(classList)
    bestFeat = chooseBestFeatureToSplit(dataset)
    bestFeatLabel = labels[bestFeat]
    myTree = {bestFeatLabel:{}}
    del(labels[bestFeat])
    featValues = [example[bestFeat] for example in dataset]
    uniqueValue = set(featValues)
    for value in uniqueValue:
        subLabels = labels[:]
        myTree[bestFeatLabel][value] = createTree(splitDataSet\
            (dataset, bestFeat, value), labels)

    return myTree

測試代碼:

if __name__ == "__main__":
    myData, labels = createDataSet()
    myTree = createTree(myData, labels)
    print(myTree)

測試結果:

到這裏,一顆決策樹就創建完成了,其實不需要在意creatTree裏面的細節,這裏僅僅練手用,使用開源庫會更加安全。

測試模型

依靠訓練創建了決策樹之後,就可以用它來對實際的數據進行測試了。測試中已然需要遞歸調用分類函數。由於無法獲取某個feature的編號,所以需要將featLabels傳進去,以幫助程序處理。剛代碼吧:

def classify(inputTree, featLabels, testVec):
    firstStr = list(inputTree.keys())[0]
    secondDict = inputTree[firstStr]
    featIndex = featLabels.index(firstStr)
    for key in secondDict.keys():
        if testVec[featIndex] == key:
            if type(secondDict[key]).__name__ == 'dict':
                classLabel = classify(secondDict[key], featLabels, testVec)
            else:
                classLabel = secondDict[key]
    return classLabel

下面對我們的分類代碼進行一個簡單的測試,此時我們先用createtree函數創建一個決策樹,然後構造一個數據測試。注意:在createtree函數中,labels被一個一個del掉了,所以要先複製一個labels,然後再調用classify,否則報錯。測試代碼:

if __name__ == "__main__":
    myData, labels = createDataSet()
    myLabels = []
    for label in labels:
        myLabels.append(label)
    myTree = createTree(myData, labels)
    #print(myLabels)
    lable0 = classify(myTree, myLabels, [1,0])
    print(lable0)

對於【1,0】這個數據的測試結果如下:

保存模型

每次對新的數據集做預測時都訓練一遍模型是不現實的,訓練好的模型需要保存好,預測的時候來調用訓練好的模型就可以了。Python中有一個pickle魔法模塊,他可以把模型還按照字典的形式讀存,此處使用pickle。

def storeTree(inputTree, filename):
    import pickle
    with open(filename, 'wb') as fw:
        pickle.dump(inputTree, fw)

def grabTree(filename):
    import pickle
    with open(filename, 'rb') as fr:
        outputTree = pickle.load(fr)
    return outputTree

測試代碼:

if __name__ == "__main__":
    myData, labels = createDataSet()
    myLabels = []
    for label in labels:
        myLabels.append(label)
    myTree = createTree(myData, labels)
    lable0 = classify(myTree, myLabels, [1,0])
    storeTree(myTree, "classifier.txt")
    myTree0 = grabTree("classifier.txt")
    print(myTree0)

測試結果:





發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章