Weka 學習:J48(C4.5)

Before writing:To improve my english,I will write my blog in English.

Section 1: J48

J48 is a class to implement C4.5 algorithm.Look at part of the code.In thebuildClassifier(...) function,there are two important classes:ModelSelection (It is extended by the class of BinC45ModelSelectionand the class of C45ModelSelection)& ClassifierTree (It is extended by the class of C45PruneableClassifierTreeandthe class of PruneableClassifierTree) .theModelSelection is used to select sons of the node and theClassifierTree is used with ModelSelection to build the tree.

 public void buildClassifier(Instances instances) 
       throws Exception {

    ModelSelection modSelection;	 

    if (m_binarySplits)
      modSelection = new BinC45ModelSelection(m_minNumObj, instances);
    else
      modSelection = new C45ModelSelection(m_minNumObj, instances);
    if (!m_reducedErrorPruning)
      m_root = new C45PruneableClassifierTree(modSelection, !m_unpruned, m_CF,
					    m_subtreeRaising, !m_noCleanup);
    else
      m_root = new PruneableClassifierTree(modSelection, !m_unpruned, m_numFolds,
					   !m_noCleanup, m_Seed);
    m_root.buildClassifier(instances);
    if (m_binarySplits) {
      ((BinC45ModelSelection)modSelection).cleanup();
    } else {
      ((C45ModelSelection)modSelection).cleanup();
    }
  }

Section 2: ModelSelection

Secition 2.1 Split

In this part,I will discuss class BinC45ModelSelection and class C45ModelSelection.In general,these two classes are the iterations of selecting the best attribute with the highest info gain ratio.To calculate the info gain of a certain attribute is what the class BinC45Split or C45Split dose.So firstly,let us give a glance of BinC45Split and C45Split .

These two classes are so alike that the only difference isthat for a nominal attribute, the former splits it into two subsets while the latter splits it into multiple subsets.So i will only explain the former(the latter is easier).

 public void buildClassifier(Instances trainInstances) 
       throws Exception {

    // Initialize the remaining instance variables.
    m_numSubsets = 0;
    m_splitPoint = Double.MAX_VALUE;
    m_infoGain = 0;
    m_gainRatio = 0;

    // Different treatment for enumerated and numeric
    // attributes.
    if (trainInstances.attribute(m_attIndex).isNominal()) {
      m_complexityIndex = trainInstances.attribute(m_attIndex).numValues();
      m_index = m_complexityIndex;
      handleEnumeratedAttribute(trainInstances);
    }else{
      m_complexityIndex = 2;
      m_index = 0;
      trainInstances.sort(trainInstances.attribute(m_attIndex));
      handleNumericAttribute(trainInstances);
    }
  }

The above is the main function to calculate the info gain. In the main funciton exist two functions : handleEnumeratedAttribute(...) and handleNumericAttribute(...).I will explain them respectively.

 private void handleEnumeratedAttribute(Instances trainInstances)
       throws Exception {
    
    Instance instance;

    m_distribution = new Distribution(m_complexityIndex,
			      trainInstances.numClasses());
    
    // Only Instances with known values are relevant.
    Enumeration enu = trainInstances.enumerateInstances();
    while (enu.hasMoreElements()) {
      instance = (Instance) enu.nextElement();
      if (!instance.isMissing(m_attIndex))
	m_distribution.add((int)instance.value(m_attIndex),instance);
    }
    
    // Check if minimum number of Instances in at least two
    // subsets.
    if (m_distribution.check(m_minNoObj)) {
      m_numSubsets = m_complexityIndex;
      m_infoGain = infoGainCrit.
	splitCritValue(m_distribution,m_sumOfWeights);
      m_gainRatio = 
	gainRatioCrit.splitCritValue(m_distribution,m_sumOfWeights,
				     m_infoGain);
    }
  }

  private void handleEnumeratedAttribute(Instances trainInstances)
       throws Exception {
    
    Distribution newDistribution,secondDistribution;
    int numAttValues;
    double currIG,currGR;
    Instance instance; 
    int i;

    numAttValues = trainInstances.attribute(m_attIndex).numValues();
    newDistribution = new Distribution(numAttValues,
				       trainInstances.numClasses());
    
    // Only Instances with known values are relevant.
    Enumeration enu = trainInstances.enumerateInstances();
    while (enu.hasMoreElements()) {
      instance = (Instance) enu.nextElement();
      if (!instance.isMissing(m_attIndex))
	newDistribution.add((int)instance.value(m_attIndex),instance);
    }
    m_distribution = newDistribution;

    // For all values
    for (i = 0; i < numAttValues; i++){

      if (Utils.grOrEq(newDistribution.perBag(i),m_minNoObj)){
	/**
	 * use newDistribution to initialize a two-bag distribution.
	 * for more detail ,please check Distribution 
	 */
     secondDistribution = new Distribution(newDistribution,i);
	
	// Check if minimum number of Instances in the two
	// subsets.
     /**
      * select the best  split point
      */
	if (secondDistribution.check(m_minNoObj)){
	  m_numSubsets = 2;
	  /**
	   * calculate the info gain ande info gain ratio
	   */
	  currIG = m_infoGainCrit.splitCritValue(secondDistribution,
					       m_sumOfWeights);
	  currGR = m_gainRatioCrit.splitCritValue(secondDistribution,
						m_sumOfWeights,
						currIG);
	  if ((i == 0) || Utils.gr(currGR,m_gainRatio)){
	    m_gainRatio = currGR;
	    m_infoGain = currIG;
	    m_splitPoint = (double)i;
	    m_distribution = secondDistribution;
	  }
	}
      }
    }
  }

The handleNumericAttribute(...) function is very similar to the handleEnumeratedAttribute(...).

  private void handleNumericAttribute(Instances trainInstances)
       throws Exception {
  
    int firstMiss;
    int next = 1;
    int last = 0;
    int index = 0;
    int splitIndex = -1;
    double currentInfoGain;
    double defaultEnt;
    double minSplit;
    Instance instance;
    int i;

    // Current attribute is a numeric attribute.
    m_distribution = new Distribution(2,trainInstances.numClasses());
    
    // Only Instances with known values are relevant.
    /**
     * remember the trainInstances have been sorted and the missing values are put at the last place
     */
    Enumeration enu = trainInstances.enumerateInstances();
    i = 0;
    while (enu.hasMoreElements()) {
      instance = (Instance) enu.nextElement();
      if (instance.isMissing(m_attIndex))
	break;
      m_distribution.add(1,instance);
      i++;
    }
    firstMiss = i;

    // Compute minimum number of Instances required in each
    // subset.
    minSplit =  0.1*(m_distribution.total())/
      ((double)trainInstances.numClasses());
    if (Utils.smOrEq(minSplit,m_minNoObj)) 
      minSplit = m_minNoObj;
    else
      if (Utils.gr(minSplit,25)) 
	minSplit = 25;

    // Enough Instances with known values?
    if (Utils.sm((double)firstMiss,2*minSplit))
      return;
    
    // Compute values of criteria for all possible split
    // indices.
    defaultEnt = m_infoGainCrit.oldEnt(m_distribution);
    /**
     * In the while loop, we find the best split point .
     */
    while (next < firstMiss){
	  
      if (trainInstances.instance(next-1).value(m_attIndex)+1e-5 < 
	  trainInstances.instance(next).value(m_attIndex)){ 
	
	// Move class values for all Instances up to next 
	// possible split point.
	m_distribution.shiftRange(1,0,trainInstances,last,next);
	
	// Check if enough Instances in each subset and compute
	// values for criteria.
	if (Utils.grOrEq(m_distribution.perBag(0),minSplit) && 
	    Utils.grOrEq(m_distribution.perBag(1),minSplit)){
	  currentInfoGain = m_infoGainCrit.
	    splitCritValue(m_distribution,m_sumOfWeights,
			   defaultEnt);
	  if (Utils.gr(currentInfoGain,m_infoGain)){
	    m_infoGain = currentInfoGain;
	    splitIndex = next-1;
	  }
	  index++;
	}
	last = next;
      }
      next++;
    }
    
    // Was there any useful split?
    if (index == 0)
      return;
    
    // Compute modified information gain for best split.
    m_infoGain = m_infoGain-(Utils.log2(index)/m_sumOfWeights);
    if (Utils.smOrEq(m_infoGain,0))
      return;
    
    // Set instance variables' values to values for
    // best split.
    m_numSubsets = 2;
    m_splitPoint = 
      (trainInstances.instance(splitIndex+1).value(m_attIndex)+
       trainInstances.instance(splitIndex).value(m_attIndex))/2;

    // In case we have a numerical precision problem we need to choose the
    // smaller value
    if (m_splitPoint == trainInstances.instance(splitIndex + 1).value(m_attIndex)) {
      m_splitPoint = trainInstances.instance(splitIndex).value(m_attIndex);
    }

    // Restore distributioN for best split.
    m_distribution = new Distribution(2,trainInstances.numClasses());
    m_distribution.addRange(0,trainInstances,0,splitIndex+1);
    m_distribution.addRange(1,trainInstances,splitIndex+1,firstMiss);

    // Compute modified gain ratio for best split.
    m_gainRatio = m_gainRatioCrit.
      splitCritValue(m_distribution,m_sumOfWeights,
		     m_infoGain);
  }

Section 2.2 ModelSelection

      for (i = 0; i < data.numAttributes(); i++){
	
	// Apart from class attribute.
	if (i != (data).classIndex()){
	  
	  // Get models for current attribute.
	  currentModel[i] = new BinC45Split(i,m_minNoObj,sumOfWeights);
	  currentModel[i].buildClassifier(data);
	  
	  // Check if useful split for current attribute
	  // exists and check for enumerated attributes with 
	  // a lot of values.
	  if (currentModel[i].checkModel())
	    if ((data.attribute(i).isNumeric()) ||
		(multiVal || Utils.sm((double)data.attribute(i).numValues(),
				      (0.3*(double)m_allData.numInstances())))){
	      averageInfoGain = averageInfoGain+currentModel[i].infoGain();
	      validModels++;
	    }
	}else
	  currentModel[i] = null;
      }
      
      // Check if any useful split was found.
      if (validModels == 0)
	return noSplitModel;
      averageInfoGain = averageInfoGain/(double)validModels;

      // Find "best" attribute to split on.
      minResult = 0;
      for (i=0;i<data.numAttributes();i++){
	if ((i != (data).classIndex()) &&
	    (currentModel[i].checkModel()))
	  
	  // Use 1E-3 here to get a closer approximation to the original
	  // implementation.
	  if ((currentModel[i].infoGain() >= (averageInfoGain-1E-3)) &&
	      Utils.gr(currentModel[i].gainRatio(),minResult)){ 
	    bestModel = currentModel[i];
	    minResult = currentModel[i].gainRatio();
	  }
      }

In the first for loop, C4.5Split for each attribute is built,and average info gain is calculated.In the second for loop,the best attribute is selected.The iteration is simple and clear.

Section 3 ClassifierTree

Class ClassifiefTree has two subclasses :C45PruneableClassifierTree & PruneableClassifierTree.The main difference in these two subclasses is the approach to calculated the error rate which is more complex in C45PruneableClassifierTree .If you have read the paper, it is not hard to understand how to calculate the error rate,so I will not explain how to calculate error rate.

   buildTree(data, m_subtreeRaising || !m_cleanup);
   collapse();
   if (m_pruneTheTree) {
     prune();
   }
   if (m_cleanup) {
     cleanup(new Instances(data, 0));
   }

As the code post above, here are two major functions: buildTree(...) and prune(..);

Let us read part code of buildTree(...):

   if (m_localModel.numSubsets() > 1) {
      localInstances = m_localModel.split(data);
      data = null;
      m_sons = new ClassifierTree [m_localModel.numSubsets()];
      for (int i = 0; i < m_sons.length; i++) {
	m_sons[i] = getNewTree(localInstances[i]);
	localInstances[i] = null;
      }
    }else{
      m_isLeaf = true;
      if (Utils.eq(data.sumOfWeights(), 0))
	m_isEmpty = true;
      data = null;
    }

protected ClassifierTree getNewTree(Instances data) throws Exception {
    
    C45PruneableClassifierTree newTree = 
      new C45PruneableClassifierTree(m_toSelectModel, m_pruneTheTree, m_CF,
				     m_subtreeRaising, m_cleanup);
    newTree.buildTree((Instances)data, m_subtreeRaising || !m_cleanup);

    return newTree;
  }

As the code says,first it uses modelSelection to select the best ClassifierSplitModel,then uses ClassifierSplitModel to split instances into subsets.In this way ,sons of the node is built with each subset.

Let us read part of prune(..):

    if (!m_isLeaf){

      // Prune all subtrees.
      for (i=0;i<m_sons.length;i++)
	son(i).prune();

      // Compute error for largest branch
      indexOfLargestBranch = localModel().distribution().maxBag();
      if (m_subtreeRaising) {
	errorsLargestBranch = son(indexOfLargestBranch).
	  getEstimatedErrorsForBranch((Instances)m_train);
      } else {
	errorsLargestBranch = Double.MAX_VALUE;
      }

      // Compute error if this Tree would be leaf
      errorsLeaf = 
	getEstimatedErrorsForDistribution(localModel().distribution());

      // Compute error for the whole subtree
      errorsTree = getEstimatedErrors();

      // Decide if leaf is best choice.
      if (Utils.smOrEq(errorsLeaf,errorsTree+0.1) &&
	  Utils.smOrEq(errorsLeaf,errorsLargestBranch+0.1)){

	// Free son Trees
	m_sons = null;
	m_isLeaf = true;
		
	// Get NoSplit Model for node.
	m_localModel = new NoSplit(localModel().distribution());
	return;
      }

If the node is not a leaf node ,it will recursively call prune() function .In other words, the pruning process is from bottom to top. If errorLeaf( if the node would be leaf ) is smaller than both the errorsTree( error as a tree ) and the errorsLargestBranch( ),the node will be pruned as a leaf.