準備數據:切分文本
python的split()方法可以切分字符串,它是按照空格對字符串進行劃分,有時不是那麼準確。
In [7]: sentence = 'This is the best book on Python or M.L. I have ever laid eyes upon.'
In [8]: sentence.split()
Out[8]:
['This',
'is',
'the',
'best',
'book',
'on',
'Python',
'or',
'M.L.',
'I',
'have',
'ever',
'laid',
'eyes',
'upon.']
使用正則表達式來切分句子會更準確:
In [1]: import re
In [2]: regular = re.compile('\\W*')
In [3]: sentence = 'This is the best book on Python or M.L. I have ever laid eyes upon.'
In [4]: list_of_tokens = regular.split(sentence)
C:\Users\birdguan\Anaconda3\Scripts\ipython:1: FutureWarning: split() requires a non-empty pattern match.
In [5]: list_of_tokens
Out[5]:
['This',
'is',
'the',
'best',
'book',
'on',
'Python',
'or',
'M',
'L',
'I',
'have',
'ever',
'laid',
'eyes',
'upon',
'']
現在得到一系列詞組成的詞表,但是裏面的空字符串需要去掉:
In [6]: [token for token in list_of_tokens if len(token) > 0]
Out[6]:
['This',
'is',
'the',
'best',
'book',
'on',
'Python',
'or',
'M',
'L',
'I',
'have',
'ever',
'laid',
'eyes',
'upon']
現在解決存在大寫的問題:
In [7]: [token.lower() for token in list_of_tokens if len(token) > 0]
Out[7]:
['this',
'is',
'the',
'best',
'book',
'on',
'python',
'or',
'm',
'l',
'i',
'have',
'ever',
'laid',
'eyes',
'upon']
測試算法:使用樸素貝葉斯進行交叉驗證
import re
import numpy as np
def create_vocab_list(data_set):
vocab_set = set([])
for document in data_set:
vocab_set = vocab_set | set(document)
return list(vocab_set)
def set_of_word2vec(vocab_list, input_set):
return_vec = [0] * len(vocab_list)
for word in input_set:
if word in vocab_list:
return_vec[vocab_list.index(word)] = 1
else:
print("the word: %s is not in the vocabulary" % word)
return return_vec
def train_naive_bayes(train_matrix, train_category):
train_docs_num = len(train_matrix)
words_num = len(train_matrix[0])
p_abusive = sum(train_category)/float(train_docs_num)
p0_num = np.ones(words_num)
p1_num = np.ones(words_num)
p0_sum = 2.0
p1_sum = 2.0
for i in range(train_docs_num):
if train_category[i] == 1:
p1_num += train_matrix[i]
p1_sum += sum(train_matrix[i])
else:
p0_num += train_matrix[i]
p0_sum += sum(train_matrix[i])
p0_vec = np.log(p0_num/p0_sum)
p1_vec = np.log(p1_num/p1_sum)
return p0_vec, p1_vec, p_abusive
def classify_naive_bayes(vec2classify, p0_vec, p1_vec, p_class1):
p1 = np.sum(vec2classify * p1_vec) + np.log(p_class1)
p0 = np.sum(vec2classify * p0_vec) + np.log(1 - p_class1)
if p1 > p0:
return 1
else:
return 0
def parse_text(big_string):
"""
接受字符串並將其解析爲字符串列表
:param big_string:
:return:
"""
token_list = re.split(r'\W*', big_string)
return [token.lower() for token in token_list if len(token) > 0]
def spam_test():
doc_list = []
class_list = []
full_list = []
# 導入並解析文本文件
for i in range(1, 26):
word_list = parse_text(open('email/spam/%d.txt' % i).read())
doc_list.append(word_list)
full_list.extend(word_list)
class_list.append(1)
word_list = parse_text(open('email/ham/%d.txt' % i).read())
doc_list.append(word_list)
full_list.extend(word_list)
class_list.append(0)
vocab_list = create_vocab_list(doc_list)
total_error_count = 0
for step in range(10):
print("=====> step %d <=====" % (step+1))
train_set = list(range(50))
test_set = []
# 隨機構建訓練集
for i in range(10):
rand_index = int(np.random.uniform(1, len(train_set)))
test_set.append(train_set[rand_index])
del train_set[rand_index]
train_mat = []
train_class = []
for doc_index in train_set:
train_mat.append(set_of_word2vec(vocab_list, doc_list[doc_index]))
train_class.append(class_list[doc_index])
p0_v, p1_v, p_spam = train_naive_bayes(train_mat, train_class)
error_count = 0
for doc_index in test_set:
word_vector = set_of_word2vec(vocab_list, doc_list[doc_index])
if classify_naive_bayes(word_vector, p0_v, p1_v, p_spam) != class_list[doc_index]:
error_count += 1
total_error_count += 1
print("[WARNING] The real class of ", doc_list[doc_index], "is", class_list[doc_index])
print("Current error rate is:", float(error_count)/len(test_set))
print("average error rate is %f" % (total_error_count/(10*10)))
if __name__ == '__main__':
spam_test()
這裏隨機選擇數據的一部分作爲訓練集,而剩餘部分作爲測試集的過程稱爲留存交叉驗證。