gensim.models.LdaModel建立新聞的LDA模型並測試，附代碼和文本數據

原創

锅巴QAQ

2020-06-24 21:23

參考

https://github.com/DengYangyong/LDA_gensim

文本數據

新聞數據：news_train.txt
預處理後文本：news_train_jieba.txt
stopwords停用詞：news_stopwords.txt
測試數據：news_test.txt
數據在上面的參考github的data目錄下。
鏈接：https://pan.baidu.com/s/1emmCSJXeGSkOJhKvkguLmg ，提取碼：c9vw

模型建立

2262條新聞，分爲體育、娛樂、家居、教育、房產，5類，最終得到55759個特徵詞

lda = models.LdaModel(corpus=corpus, id2word=dictionary.id2token,
num_topics=num_topics,iterations = 400,chunksize = 2262,passes = 40)
topic_list = lda.print_topics(5)
得到：
5個主題的單詞分佈爲：

(0, ‘0.012*“企業” + 0.012*“產品” + 0.010*“品牌” + 0.010*“市場” + 0.009*“傢俱” +
0.009*“消費者” + 0.008*“家居” + 0.008*“櫥櫃” + 0.008*“行業” + 0.007*“中國”’)

(1, ‘0.009*“房地產” + 0.007*“市場” + 0.006*“中國” + 0.006*“考試” + 0.006*“四六級” +
0.005*“信息” + 0.005*“項目” + 0.005*“平米” + 0.005*“房價” + 0.004*“戶型”’)

(2, ‘0.013*“比賽” + 0.008*“球隊” + 0.007*“熱火” + 0.006*“球員” + 0.005*“時間” +
0.005*“湖人” + 0.005*“防守” + 0.005*“季後賽” + 0.005*“新浪” + 0.005*“詹姆斯”’)

(3, ‘0.012*“電影” + 0.008*“影片” + 0.006*“導演” + 0.005*“娛樂” + 0.004*“新浪” +
0.004*“上映” + 0.004*“最佳” + 0.004*“奧斯卡” + 0.004*“票房” + 0.004*“觀衆”’)

(4, ‘0.009*“裝修” + 0.005*“活動” + 0.004*“中國” + 0.004*“公司” + 0.004*“紅星” +
0.003*“設計” + 0.003*“業主” + 0.003*“設計師” + 0.003*“美凱龍” + 0.003*“產品”’)

得到的平均主題一致性：-2.1734.
2020-03-02 11:35:23,557 : INFO : CorpusAccumulator accumulated stats from 1000 documents
2020-03-02 11:35:23,712 : INFO : CorpusAccumulator accumulated stats from 2000 documents
Average topic coherence: -2.1734.

家居：企業,產品,品牌,市場,傢俱,消費者,家居,櫥櫃,行業,中國
教育：房地產,市場,中國,考試,四六級,信息,項目,平米,房價,戶型
體育：比賽,球隊,熱火,球員,時間,湖人,防守,季後賽,新浪,詹姆斯
娛樂：電影,影片,導演,娛樂,新浪,上映,最佳,奧斯卡,票房,觀衆
房地產：裝修,活動,中國,公司,紅星,設計,業主,設計師,美凱龍,產品

測試新聞數據

從體育、娛樂、科技三個主題方面測試：

測試結果：

代碼

import jieba,os,re
from gensim import corpora, models, similarities
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# 停用詞文檔
stopwords_path = "G:/1研究生/news_stopwords.txt"

 # 原始新聞文檔：體育、娛樂、家居、教育、房產，5類
filename = "G:/python code/news_train.txt"
# 分詞處理後新聞，空格隔開
outfilename = "G:/python code/news_train_jieba.txt"

# 測試新聞文件
file_test = "G:/python code/news_test.txt"



"""構建詞頻矩陣，"""
# train是列表套列表，train: [['黃蜂', '湖人', '首發', '科比', '帶傷', '戰',...],[...],...]
dictionary = corpora.Dictionary(train)

# corpus[0]: [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1),...]
# corpus是把每條新聞ID化後的結果，每個元素是新聞中的每個詞語，在字典中的ID和頻率
corpus = [dictionary.doc2bow(text) for text in train]
print('特徵數目: %d' % len(dictionary))
print('文檔數目: %d' % len(corpus))
"""未屏蔽單個字前：57665，屏蔽後55759"""

""" 
訓練LDA模型 
num_topics = 5
chunksize = 2000
passes = 20
iterations = 400
eval_every = None  # Don't evaluate model perplexity, takes too much time.

# Make a index to word dictionary.
temp = dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token

"""

num_topics=5

lda = models.LdaModel(corpus=corpus, id2word=dictionary.id2token, num_topics=num_topics,iterations = 400,chunksize = 2262,passes = 40)
topic_list = lda.print_topics(5)
print("5個主題的單詞分佈爲：\n")
for topic in topic_list:
    print(topic)

"""
企業,產品,品牌,市場,傢俱,消費者,家居,櫥櫃,行業,中國
房地產,市場,中國,考試,四六級,信息,項目,平米,房價,戶型
比賽,球隊,熱火,球員,時間,湖人,防守,季後賽,新浪,詹姆斯
電影,影片,導演,娛樂,新浪,上映,最佳,奧斯卡,票房,觀衆
裝修,活動,中國,公司,紅星,設計,業主,設計師,美凱龍,產品
"""

"""抽取新聞的主題"""
# 用來測試的三條新聞，分別爲體驗、娛樂和科技新聞    
news_test = open(file_test, 'r', encoding='UTF-8')
    
test = []
count=0
# 處理成正確的輸入格式       
for line in news_test:
    line = line.split('\t')[1]
    line = re.sub(r'[^\u4e00-\u9fa5]+','',line)
    line_seg = seg_depart(line.strip())
    line_seg = [word.strip() for word in line_seg.split(' ')]
    test.append(line_seg)  
    count+=1
news_test.close()
print("讀取"+file_test+":"+str(count))

import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties
font = FontProperties(fname=r"C:\Windows\Fonts\simhei.ttf", size=14)  


# 新聞ID化    
corpus_test = [dictionary.doc2bow(text) for text in test]
# 得到每條新聞的主題分佈
topics_test = lda.get_document_topics(corpus_test)  

# for t in topics_test:
#     print(t)
labels = ['體育','娛樂','科技']
for i in range(3):
    print('這條'+labels[i]+'新聞的主題分佈爲：\n')
    # list[ 主題ID，相關概率 ]
    print(topics_test[i],'\n')
#     max_pro=0
#     max_id=0
    max_pro=0
    id_list =[]
    pro_list=[]
    
    for l in topics_test[i]:
        #print(l)
        id_list.append(l[0]+1)
        pro_list.append(l[1])
        
    plt.figure(figsize=(10,4),dpi=80)
    plt.title(labels[i], FontProperties=font)
    plt.bar(id_list, pro_list)
    
    plt.xlabel(u'主題編號', FontProperties=font)
    plt.ylabel(u'相關性', FontProperties=font)
    
    plt.xlim(0.5, num_topics+0.5)
    plt.ylim(0.0, max(pro_list)+0.1)
    plt.show()
    
    topic_id=id_list[pro_list.index(max(pro_list))]
    print( str(topic_id)+"號主題:相關性"+str(max(pro_list)))
    print(topics_list[topic_id-1]+"\n\n")
    
#     print(max_id)
#     print(max_pro)
#     print(lda.print_topic(max_id))

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

gensim.models.LdaModel建立新聞的LDA模型並測試，附代碼和文本數據

參考

文本數據

模型建立

測試新聞數據

代碼

vue項目獲取富文本編輯器wangEditor內容導出爲word（html轉word格式並下載）

dotnet C# 創建 X11 應用時設置窗口背景顏色

Navicat安裝與激活教程

TDengine docker安裝方法

vue3組件通信與props

sapui5

Alpine Linux apk add DNS lookup error

部分JDK版本的發佈時間

工作中用到的腳本合集

合併代碼時Beyond Compare設置

JavaWeb - Response筆記

JavaWeb - Ajax&Json筆記

gensim.models.LdaModel建立新聞的LDA模型並測試，附代碼和文本數據

sklearn.feature_extraction.text中常見 Vectorizer 使用方法以及Tf–idf 值獲取

代碼！以備不時之需！中文文本預處理（停用詞、空格分隔、按行分類）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結