FP-growth算法需要使用FP樹和一個頭結點鏈表。FP樹與普通的樹類似,但是它通過指針鏈接相同的元素。這裏採用 Machine Learning IN ACTION 裏面的例子作爲講解,數據集對應的頭結點錶鏈表FP樹如下所示。

  • 數據集

  • 數據集對應FP樹



首先我們會發現數據集中的有些元素,在頭結點鏈表和FP樹中並沒有出現,如o, n, m等。這是因爲它們不滿足大於最小支持度的要求(這裏最小支持度設爲3,在Apriori中最小支持度是個比值,這裏和它不一樣)。因此我們可以得出構建FP樹的第一步:

1. 遍歷數據集,統計每個元素出現的頻數,刪除頻數小於最小支持度的元素並將剩餘元素和其頻數寫入頭結點鏈表。

        initial_header = {}
        # 1. the first scan, get singleton set
        for record in train_data:
            for item in record:
                initial_header[item] = initial_header.get(item, 0) + train_data[record]

        # get singleton set whose support is greater than min_support. If there is no set meeting the condition,  return none
        header = {}
        for k in initial_header.keys():
            if initial_header[k] >= self.min_support:
                header[k] = initial_header[k]
        frequent_set = set(header.keys())
        if len(frequent_set) == 0:
            return None, None
        for k in header:
            header[k] = [header[k], None]


class FPNode:
    def __init__(self, item, count, parent):
        self.item = item
        self.count = count              # support
        self.parent = parent
        self.next = None                # the same elements
        self.children = {}


2. 遍歷數據集,當待添加的記錄與 FP 樹中的路徑相同只需要更新元素對應的頻數,否則,在路徑不同的地方產生分支,創建新的結點。代碼首先定義了根節點,然後遍歷數據集。每次讀入一個數據集中的記錄,將該記錄中的頻繁項集支持度存入頭結點鏈表的第一個域。如果該記錄的頻繁項集存在,則將其中的數據按出現次數由高到低排序,然後更新樹。

        FP_tree = FPNode('root', 1, None)        # root node
        for record, count in train_data.items():
            frequent_item = {}
            for item in record:                # if item is a frequent set, add it
                if item in frequent_set:       # 2.1 filter infrequent_item
                    frequent_item[item] = header[item][0]

            if len(frequent_item) > 0:
                ordered_frequent_item = [val[0] for val in sorted(frequent_item.items(), key=lambda val:val[1], reverse=True)]  # 2.1 sort all the elements in descending order according to count
                self.updataTree(ordered_frequent_item, FP_tree, header, count) # 2.2 insert frequent_item in FP-Tree, share the path with the same prefix


    def updataTree(self, data, FP_tree, header, count):
        frequent_item = data[0]
        if frequent_item in FP_tree.children:
            FP_tree.children[frequent_item].count += count
            FP_tree.children[frequent_item] = FPNode(frequent_item, count, FP_tree)
            if header[frequent_item][1] is None:
                header[frequent_item][1] = FP_tree.children[frequent_item]
                self.updateHeader(header[frequent_item][1], FP_tree.children[frequent_item]) # share the same path

        if len(data) > 1:
            self.updataTree(data[1::], FP_tree.children[frequent_item], header, count)  # recurrently update FP tree


self.updateHeader(header[frequent_item][1], FP_tree.children[frequent_item]) 就好理解了,就是將頭結點的指針,指向FP樹中相同的元素。

    def updateHeader(self, head_node, tail_node):
        while head_node.next is not None:
            head_node = head_node.next
        head_node.next = tail_node


  • 過濾和排序後的數據

  • 首先讀入第一條記錄(z,r)由於樹初始爲空,直接添加即可


  • 讀入第二條記錄(z,x,y,s,t),首元素z在FP樹中,FP樹中的結點z,count值增加,然後進行遞歸每次從當前的下一個元素開始訪問。由於第二個元素x不在FP樹中,在FP樹中創建對應的結點,然後在header中創建相應的指針。後續結點在x的結點分支後進行。指的注意的是r在訪問(z,r)時已經存在了,因此第一個r結點的next域指向這次的r結點。後續記錄以此類推知道構成完成的FP樹。


2.2 頻繁項集


 def findFrequentItem(self, header, prefix, frequent_set):
        # for each item in header, then iterate until there is only one element in conditional fptree
        header_items = [val[0] for val in sorted(header.items(), key=lambda val: val[1][0])]
        if len(header_items) == 0:

        for base in header_items:
            new_prefix = prefix.copy()
            support = header[base][0]
            frequent_set[frozenset(new_prefix)] = support

            prefix_path = self.getPrefixPath(base, header)
            if len(prefix_path) != 0:
                conditonal_tree, conditional_header = self.createFPTree(prefix_path)
                if conditional_header is not None:
                    self.findFrequentItem(conditional_header, new_prefix, frequent_set)


  • 對於元素z來說,其前綴路徑爲{},則返回頻繁項集{y, z}
  • 對於元素x來說,其前綴路徑爲{z}, 對{z}構建條件FP樹,如圖所示。由於此時條件FP樹只有一個元素,故返回頻繁項集{y,x,z}。因爲元素y本身也是頻繁項集,因此{y,x}也是頻繁項
  • 故y對應的頻繁項集有{y}, {y, z}, {y, x}, {y, x, z}


  • FP樹


  • y的前綴路徑生成的條件FP樹


  • 前綴路徑爲{z}的條件FP樹

2.3 關聯規則


    def generateRules(self, frequent_set, rules):
        for frequent_item in frequent_set:
            if len(frequent_item) > 1:
                self.getRules(frequent_item, frequent_item, frequent_set, rules)

   def getRules(self, frequent_item, current_item, frequent_set, rules):
        for item in current_item:
            subset = self.removeItem(current_item, item)
            confidence = frequent_set[frequent_item]/frequent_set[subset]
            if confidence >= self.min_confidence:
                flag = False
                for rule in rules:
                    if (rule[0] == subset) and (rule[1] == frequent_item - subset):
                        flag = True

                if flag == False:
                    rules.append((subset, frequent_item - subset, confidence))

                if (len(subset) >= 2):
                    self.getRules(frequent_item, subset, frequent_set, rules)

3. 總結與分析

FP-growth由於只遍歷數據集兩遍,因此其時間複雜度不會很高。但是對於海量數據在內存中建立一份統一的 FP 樹結構是不大現實的。這就需要考慮採用並行計算的思路來併發實現 FP-growth。使用和Apriori一樣的數據集




