C4.5算法

【適用範圍】

處理分類問題，只要目標問題的類間邊界能用樹型分解方式或規則判別方式來確定，就可以使用C4.5算法

【屬性】

監督學習

【基本思想】

給定數據集，所有實例都由一組屬性來描述，每個實例僅屬於一個類別，在給定數據集上運行C4.5算法可以學習得到一個從屬性值到類別的映射，進而可以用該映射去分類新的未知實例

【算法原理】

Input: an attribute-valued dataset D

1:Tree = {}

2:if D is 'pure' or other stopping criteria met then

3: terminate

4:end if

5:for all attribute a ∈ D do

6: Compute information-theoretic criteria if we split on a

7:end for

8:abest = Best attribute according to above computed criteria

9:Tree = Create a decision node that tests abest in the root

10:Dv = Induced sub-datasets from D based on abest

11:for all Dv do

12: Treev = C4.5(Dv)

13: Attach Treev to the corresponding branch of Tree

14:end for

15:return Tree

【算法闡述】

用根節點表示給定的數據集，從根節點開始在每個節點上測試一個特定的屬性，把節點數據集劃分成更小的子集，並用子樹表示。該過程一直執行，直到子集成爲“純”的，也就是說子集中的所有實例屬於同一個類別，樹停止增長。

【算法要點】

1.信息論準則

C4.5算法使用增益(Gain)、增益率(Gain Ratio)等信息論準則選擇合適的屬性來劃分子樹。增益用於計算類別分佈的熵的減少量，增益越大證明依據該屬性的分類效果越好，它的缺陷在於過於偏向選擇輸出結果更多的屬性。增益率具有克服這一偏差的優點，因此C4.5算法默認的信息論準則是增益率。

GainRatio(a) = Gain(a)/Entropy(a)

其中，Gain(a)=Entropy(Category in D) - ∑|Dv|/D*Entropy(Category in Dv)

Entropy = -∑p*log2(p)

D是整個數據集，Dv是D的子集，實例在Dv上屬性值相同，Category不同

屬性a的Entropy(a)僅取決於取值的概率分佈，與類別無關。

屬性a的Gain(a)與類別相關。

【代碼實現】

文章內容系參考清華大學出版社《數據挖掘十大算法》整理而成，特此聲明