計算給定數據集的香農熵
在本章中,給出決策樹的訓練方法,以及訓練中的信息增益。首先介紹了信息增益,信息增益有兩種,一種是香農熵,另一種是基尼不純度。第一段代碼就是計算香農熵,我在讀書的時候研究過結構化隨機森林,曾經評估過香農熵,有一定的瞭解,代碼看起來不太費勁(其實本來就比較簡單),不慫,直接上書中提供的代碼:
from math import log
def calcShannonEnt(dataSet):
numEntries = len(dataSet)
labelCounts = {}
for featVec in dataSet: #the the number of unique elements and their occurance
currentLabel = featVec[-1]
if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0
labelCounts[currentLabel] += 1
shannonEnt = 0.0
for key in labelCounts:
prob = float(labelCounts[key])/numEntries
shannonEnt -= prob * log(prob,2) #log base 2
return shannonEnt
計算香農熵的代碼寫完了,先測一下能不能行,隨意創建一個測試用例吧:
def createDataSet():
dataSet = [[1, 1, 'yes'],
[1, 1, 'yes'],
[1, 0, 'no'],
[0, 1, 'no'],
[0, 1, 'no']]
labels = ['no surfacing', 'flippers']
return dataSet, labels
測試代碼也貼上:
myData, labels = createDataSet()
shanon_ent = calcShannonEnt(myData)
print(shanon_ent)
劃分數據集
上面的截圖部分就是本段測試代碼運行的結果。到目前爲止,我們能夠簡單的計算shannon熵了,但是這還遠遠不夠。我們可以通過迭代的方式讓information gain一步步減小到我們能夠容忍的範圍以內。爲了達到這個目的,我們需要把一個要分類的大的集合逐漸分解成小的集合,直至各個集合的基尼不純度(也就是information gain的具體表達形式)符合一定的條件,我們才能夠說,ok,我們可以得到一個分類的方法(也就是一棵樹),可以搞定分類問題!本着拆解大問題的方針,小問題先被拆減成分開一個大集合,選擇最佳的數據劃分方式,以及迭代計算基尼不純度(別一直想着比基尼哦)!先強上劃分數據集的代碼:
def splitDataSet(dataSet, axis, value):
retDataSet = []
for featVec in dataSet:
if featVec[axis] == value:
splitSet = featVec[:axis]
splitSet.extend(featVec[axis + 1:])
retDataSet.append(splitSet)
return retDataSet
這中間有個小點,list的append方法和extern方法不同,不同之處百度一下就好,lol。做好測試!做一個好的開發一定要做好測試,這裏引導大家做測試,做不做好看個人追求:)
if __name__ == "__main__":
myData, labels = createDataSet()
reduced_data = splitDataSet(myData, 0, 0)
print(reduced_data)
應打印出的結果是:
這裏,最終存儲結果直接把劃分數據集的特徵扔掉了,這是決策樹和CART的差別。把featvec[:axis]改成featvec[:axis+1]即可。接下來我們將遍歷整個數據集,尋找最佳的數據劃分方式。
def chooseBestFeatureToSplit(dataset):
featNum = len(dataset[0]) - 1
baseEntropy = calcShannonEnt(dataset)
bestInfoGain = 0.0
bestFeature = -1
for i in range(featNum):
featList = [example[i] for example in dataset]
uniqueValue = set(featList)
newEntropy = 0.0
for value in uniqueValue:
subDataSet = splitDataSet(dataset, i, value)
prob = len(subDataSet)/float(len(featList))
newEntropy += prob * calcShannonEnt(subDataSet)
infoGain = baseEntropy - newEntropy
if (infoGain > bestInfoGain):
bestInfoGain = infoGain
bestFeature = i
return bestFeature
測試代碼:
if __name__ == "__main__":
myData, labels = createDataSet()
bf = chooseBestFeatureToSplit(myData)
print(bf)
預期結果:
完成樹的構建
最後一個問題,完成樹的創建:
def majorityCnt(classList):
classCount={}
for vote in classList:
if vote not in classCount.keys(): classCount[vote] = 0
classCount[vote] += 1
sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
return sortedClassCount[0][0]
def createTree(dataset, labels):
classList = [example[-1] for example in dataset]
if classList.count(classList[0]) == len(classList):
return classList[0]
if len(dataset[0]) == 1:
return majorityCnt(classList)
bestFeat = chooseBestFeatureToSplit(dataset)
bestFeatLabel = labels[bestFeat]
myTree = {bestFeatLabel:{}}
del(labels[bestFeat])
featValues = [example[bestFeat] for example in dataset]
uniqueValue = set(featValues)
for value in uniqueValue:
subLabels = labels[:]
myTree[bestFeatLabel][value] = createTree(splitDataSet\
(dataset, bestFeat, value), labels)
return myTree
測試代碼:
if __name__ == "__main__":
myData, labels = createDataSet()
myTree = createTree(myData, labels)
print(myTree)
測試結果:
到這裏,一顆決策樹就創建完成了,其實不需要在意creatTree裏面的細節,這裏僅僅練手用,使用開源庫會更加安全。
測試模型
依靠訓練創建了決策樹之後,就可以用它來對實際的數據進行測試了。測試中已然需要遞歸調用分類函數。由於無法獲取某個feature的編號,所以需要將featLabels傳進去,以幫助程序處理。剛代碼吧:
def classify(inputTree, featLabels, testVec):
firstStr = list(inputTree.keys())[0]
secondDict = inputTree[firstStr]
featIndex = featLabels.index(firstStr)
for key in secondDict.keys():
if testVec[featIndex] == key:
if type(secondDict[key]).__name__ == 'dict':
classLabel = classify(secondDict[key], featLabels, testVec)
else:
classLabel = secondDict[key]
return classLabel
下面對我們的分類代碼進行一個簡單的測試,此時我們先用createtree函數創建一個決策樹,然後構造一個數據測試。注意:在createtree函數中,labels被一個一個del掉了,所以要先複製一個labels,然後再調用classify,否則報錯。測試代碼:
if __name__ == "__main__":
myData, labels = createDataSet()
myLabels = []
for label in labels:
myLabels.append(label)
myTree = createTree(myData, labels)
#print(myLabels)
lable0 = classify(myTree, myLabels, [1,0])
print(lable0)
對於【1,0】這個數據的測試結果如下:
保存模型
每次對新的數據集做預測時都訓練一遍模型是不現實的,訓練好的模型需要保存好,預測的時候來調用訓練好的模型就可以了。Python中有一個pickle魔法模塊,他可以把模型還按照字典的形式讀存,此處使用pickle。
def storeTree(inputTree, filename):
import pickle
with open(filename, 'wb') as fw:
pickle.dump(inputTree, fw)
def grabTree(filename):
import pickle
with open(filename, 'rb') as fr:
outputTree = pickle.load(fr)
return outputTree
測試代碼:
if __name__ == "__main__":
myData, labels = createDataSet()
myLabels = []
for label in labels:
myLabels.append(label)
myTree = createTree(myData, labels)
lable0 = classify(myTree, myLabels, [1,0])
storeTree(myTree, "classifier.txt")
myTree0 = grabTree("classifier.txt")
print(myTree0)
測試結果: