樸素貝葉斯

一、概念

先驗概率：通過經驗來判斷事物發生的概率
後驗概率：結果已經有了，推測原因的概率 - 條件概率： p(A|B) 事件A在事件B已經發生的情況下發生的概率
似然函數：用來衡量概率模型的參數

二、樸素貝葉斯公式理解 - 公式

對於公式的理解：
公式左邊是我們通過特徵A出現，推斷它屬於Bi的概率
公式右邊分子是分類Bi在所有類別的概率*分類Bi中A出現的概率
公式右邊分母是A在每個分類的概率之和

三、樸素貝葉斯分類器的工作流程

確定特徵屬性-----------------------獲取訓練樣本（準備階段）
計算每個類別的概率-----------------p(Ci)
計算每個特徵在每個類別出現的概率----p(Ai|Ci)p(Ci)

四、pyhton實現對新聞的分類（sklearn 機器學習包）

4.1分類器的種類

高斯樸素貝葉斯：特徵變量是連續變量，符合高斯分佈 eg：人的身高、體重
多項式樸素貝葉斯：特徵變量符合多項分佈 eg：單詞詞頻(TF-IDF)
伯努利樸素貝葉斯:特徵變量符合0/1分佈 eg：單詞是否出現

4.2概念

TF(Term Frequency):單詞在文檔中出現的次數，默認單詞的重要性和它出現的次數成正比 -
IDF(Iverse Document Frequency):單詞的區分度，默認一個單詞出現在的文檔數越少，就越能通過這個單詞把該文檔和其他文檔區分開

4.3如何計算TF-IDF

公式
-

4.4 sklearn 求 TF-IDF

from sklearn.feature_extraction.text import TfidfVectorizer
# 創建TfidfVectorizer,加載停用詞，對於超過半數文章中出現的單詞不做統計
tfidf_vec = TfidfVectorizer(stop_words=stop_words, max_df=0.5)
# 對Document進行擬合,得到各個文本各個詞的TF-IDF值（分類器用到的特徵空間）
features = tfidf_vec.fit_transform(documents)

注：調用後的tfidf_vec屬性值
vocabulary_ 詞彙表
idf_ idf值
stopwords_ 停用詞表

4.5 如何對文檔進行分類

4.5.1處理流程

4.5.2對文檔進行分詞

對於英文文檔，使用NTLK包

word_list = nltk.word_tokenize(text)
nltk.post_tag(word_list)

對於中文文檔，使用jieba包

word_list = jieba.cut(text)

4.5.3加載停用詞表

stop_words = [line.strip().decode(utf-8) for line in io.open("stop_wordss.txt").readlines()]

4.5.4 計算單詞的權重

# 創建TfidfVectorizer,加載停用詞，對於超過半數文章中出現的單詞不做統計
tfidf_vec = TfidfVectorizer(stop_words=stop_words, max_df=0.5)
# 對Document進行擬合,得到TF-IDF矩陣
features = tfidf_vec.fit_transform(documents)

4.5.5 生成樸素貝葉斯分類器

將特徵訓練集的特徵空間train_features,以及訓練集對的train_lable傳遞給貝葉斯分類器
使用多項式分類器，alpha爲平滑參數，當aplha在[0,1]之間是使用的是Lidstone平滑，當alpha=1時使用的時Laplace平滑，對於Lidstone平滑，alpha越小，迭代次數越多，精度越高

from sklearn.naive_bayes import  MultinomialNB
clf = MultinomialNB(alpha=0.001).fit(features, labels

4.5.6 使用生成的分類器做預測- 得到測試集的特徵矩陣

test_tfidf_vec = TfidfVectorizer(stop_words=stop_words, max_df=0.5,vocabulary=train_vocabulary)
test_features = tfidf_vec.fit_transform(documents)

用訓練好的分類器做預測

predict = clf.predict(test_features)

4.5.6 計算準確率

from sklearn import metrics
accuracy = metrics.accuracy_score(test_labels, predict_labels)

五、完整代碼

# encoding=utf-8
import jieba
import os
import io
stop_words = [line.strip() for line in io.open("data/stop/stopword.txt", "rb").readlines()]
label_dic = {"體育": 1, "女性": 2, "文學": 3, "校園": 4}

# 加載目錄下的文檔，返回分詞後的文檔和文檔標籤
def load_data(data_path):
    labels = []
    document = []
    for root, dirs, files in os.walk(data_path):
        for file in files:
            label = root.split("\\")[-1]
            labels.append(label_dic[label])
            filename = os.path.join(root, file)
            with open(filename, "rb") as f:
                content = f.read()
                word_list = list(jieba.cut(content))
                words = [wd for wd in word_list]
                document.append(' '.join(words))
    return document, labels


from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import  MultinomialNB

# 傳入分詞後的文檔和對應的標籤，返回用於區分文檔的單詞表和分類器
def train(documents, labels):
    # 創建TfidfVectorizer,加載停用詞，對於超過半數文章中出現的單詞不做統計
    tfidf_vec = TfidfVectorizer(stop_words=stop_words, max_df=0.5)
    # 對Document進行擬合,得到各個文本各個詞的TF-IDF值（分類器用到的特徵空間）
    features = tfidf_vec.fit_transform(documents)
    train_vocabulary = tfidf_vec.vocabulary_
    clf = MultinomialNB(alpha=0.001).fit(features, labels)
    return train_vocabulary, clf

# 傳入用於分類的單詞表、分類器、以及需要預測的文檔
def predict(train_vocabulary, clf, document):
    test_tfidf = TfidfVectorizer(stop_words=stop_words, max_df=0.5, vocabulary=train_vocabulary)
    test_features = test_tfidf.fit_transform(document)
    predict_labels = clf.predict(test_features)
    return predict_labels


train_document, train_labels = load_data("data/train")
test_document, test_labels = load_data("data/test")
train_vocabulary, clf = train(train_document, train_labels)
predict_labels = predict(train_vocabulary, clf, test_document)
print(predict_labels)
print(test_labels)
from sklearn import metrics
x = metrics.accuracy_score(test_labels, predict_labels)
print(x)

注：完整代碼包括數據已上傳github https://github.com/huzai9527/data_analysis_algrithm

數據分析--樸素貝葉斯對文檔分類

樸素貝葉斯

一、概念

二、樸素貝葉斯公式理解 - 公式

三、樸素貝葉斯分類器的工作流程

四、pyhton實現對新聞的分類（sklearn 機器學習包）

4.1分類器的種類

4.2概念

4.3如何計算TF-IDF

4.4 sklearn 求 TF-IDF

4.5 如何對文檔進行分類

4.5.1處理流程

4.5.2對文檔進行分詞

4.5.3加載停用詞表

4.5.4 計算單詞的權重

4.5.5 生成樸素貝葉斯分類器

4.5.6 使用生成的分類器做預測- 得到測試集的特徵矩陣

4.5.6 計算準確率

五、完整代碼

Ubuntu18.04下搭建LAMP環境

Ubuntu安裝Tomcat以及Mysql（Javaweb項目發佈）

爬取網易雲熱門評論，並生成詞雲

創建自己的私有云

兩步在ubuntu 18.04 上安裝node&npm

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結