使用樸素貝葉斯過濾垃圾郵件

準備數據:切分文本

python的split()方法可以切分字符串,它是按照空格對字符串進行劃分,有時不是那麼準確。

In [7]: sentence = 'This is the best book on Python or M.L. I have ever laid eyes upon.'

In [8]: sentence.split()
Out[8]:
['This',
 'is',
 'the',
 'best',
 'book',
 'on',
 'Python',
 'or',
 'M.L.',
 'I',
 'have',
 'ever',
 'laid',
 'eyes',
 'upon.']

使用正則表達式來切分句子會更準確:

In [1]: import re

In [2]: regular = re.compile('\\W*')

In [3]: sentence = 'This is the best book on Python or M.L. I have ever laid eyes upon.'

In [4]: list_of_tokens = regular.split(sentence)
C:\Users\birdguan\Anaconda3\Scripts\ipython:1: FutureWarning: split() requires a non-empty pattern match.

In [5]: list_of_tokens
Out[5]:
['This',
 'is',
 'the',
 'best',
 'book',
 'on',
 'Python',
 'or',
 'M',
 'L',
 'I',
 'have',
 'ever',
 'laid',
 'eyes',
 'upon',
 '']

現在得到一系列詞組成的詞表,但是裏面的空字符串需要去掉:

In [6]: [token for token in list_of_tokens if len(token) > 0]
Out[6]:
['This',
 'is',
 'the',
 'best',
 'book',
 'on',
 'Python',
 'or',
 'M',
 'L',
 'I',
 'have',
 'ever',
 'laid',
 'eyes',
 'upon']

現在解決存在大寫的問題:

In [7]: [token.lower() for token in list_of_tokens if len(token) > 0]
Out[7]:
['this',
 'is',
 'the',
 'best',
 'book',
 'on',
 'python',
 'or',
 'm',
 'l',
 'i',
 'have',
 'ever',
 'laid',
 'eyes',
 'upon']

測試算法:使用樸素貝葉斯進行交叉驗證

import re
import numpy as np


def create_vocab_list(data_set):
    vocab_set = set([])
    for document in data_set:
        vocab_set = vocab_set | set(document)
    return list(vocab_set)


def set_of_word2vec(vocab_list, input_set):
    return_vec = [0] * len(vocab_list)
    for word in input_set:
        if word in vocab_list:
            return_vec[vocab_list.index(word)] = 1
        else:
            print("the word: %s is not in the vocabulary" % word)
    return return_vec


def train_naive_bayes(train_matrix, train_category):
    train_docs_num = len(train_matrix)
    words_num = len(train_matrix[0])
    p_abusive = sum(train_category)/float(train_docs_num)
    p0_num = np.ones(words_num)
    p1_num = np.ones(words_num)
    p0_sum = 2.0
    p1_sum = 2.0
    for i in range(train_docs_num):
        if train_category[i] == 1:
            p1_num += train_matrix[i]
            p1_sum += sum(train_matrix[i])
        else:
            p0_num += train_matrix[i]
            p0_sum += sum(train_matrix[i])
    p0_vec = np.log(p0_num/p0_sum)
    p1_vec = np.log(p1_num/p1_sum)
    return p0_vec, p1_vec, p_abusive


def classify_naive_bayes(vec2classify, p0_vec, p1_vec, p_class1):
    p1 = np.sum(vec2classify * p1_vec) + np.log(p_class1)
    p0 = np.sum(vec2classify * p0_vec) + np.log(1 - p_class1)
    if p1 > p0:
        return 1
    else:
        return 0


def parse_text(big_string):
    """
    接受字符串並將其解析爲字符串列表
    :param big_string:
    :return:
    """
    token_list = re.split(r'\W*', big_string)
    return [token.lower() for token in token_list if len(token) > 0]


def spam_test():
    doc_list = []
    class_list = []
    full_list = []
    # 導入並解析文本文件
    for i in range(1, 26):
        word_list = parse_text(open('email/spam/%d.txt' % i).read())
        doc_list.append(word_list)
        full_list.extend(word_list)
        class_list.append(1)
        word_list = parse_text(open('email/ham/%d.txt' % i).read())
        doc_list.append(word_list)
        full_list.extend(word_list)
        class_list.append(0)
    vocab_list = create_vocab_list(doc_list)
    total_error_count = 0
    for step in range(10):
        print("=====> step  %d <=====" % (step+1))
        train_set = list(range(50))
        test_set = []
        # 隨機構建訓練集
        for i in range(10):
            rand_index = int(np.random.uniform(1, len(train_set)))
            test_set.append(train_set[rand_index])
            del train_set[rand_index]
        train_mat = []
        train_class = []
        for doc_index in train_set:
            train_mat.append(set_of_word2vec(vocab_list, doc_list[doc_index]))
            train_class.append(class_list[doc_index])
        p0_v, p1_v, p_spam = train_naive_bayes(train_mat, train_class)
        error_count = 0
        for doc_index in test_set:
            word_vector = set_of_word2vec(vocab_list, doc_list[doc_index])
            if classify_naive_bayes(word_vector, p0_v, p1_v, p_spam) != class_list[doc_index]:
                error_count += 1
                total_error_count += 1
                print("[WARNING] The real class of ", doc_list[doc_index], "is", class_list[doc_index])
        print("Current error rate is:", float(error_count)/len(test_set))
    print("average error rate is %f" % (total_error_count/(10*10)))


if __name__ == '__main__':
    spam_test()

這裏隨機選擇數據的一部分作爲訓練集,而剩餘部分作爲測試集的過程稱爲留存交叉驗證

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章