主題一致性

翻譯該鏈接
Topic Modeling with Gensim (Python)
主題建模是一種從大量文本中提取隱藏主題的技術。 Latent Dirichlet Allocation（LDA）是一種流行的主題建模算法，在Python的Gensim包中具有出色的實現。然而，挑戰在於如何提取清晰，隔離和有意義的高質量主題。這在很大程度上取決於文本預處理的質量以及找到最佳主題數量的策略。本教程試圖解決這兩個問題。
1.Introduction
自然語言處理的主要應用之一是自動從大量文本中提取人們正在討論的主題。大文本的一些示例可以是來自社交媒體，酒店，電影等的客戶評論，用戶反饋，新聞報道，客戶投訴的電子郵件等的饋送。
瞭解人們在談論什麼並理解他們的問題和意見對於企業，管理者和政治活動來說非常有價值。並且很難手動閱讀如此大的卷並編譯主題。
因此，需要一種自動算法，該算法可以讀取文本文檔並自動輸出所討論的主題。
在本教程中，我們將採用’20新聞組’數據集的真實示例，並使用LDA提取自然討論的主題。
我將使用Gensim包中的Latent Dirichlet Allocation（LDA）以及Mallet的實現（通過Gensim）。 Mallet有效地實現了LDA。衆所周知，它可以更快地運行並提供更好的主題隔離。
我們還將提取每個主題的數量和百分比貢獻，以瞭解主題的重要性。
讓我們開始！
2. Prerequisites – Download nltk stopwords and spacy model
3. Import Packages
4. What does LDA do?
LDA的主題建模方法是將每個文檔視爲一定比例的主題集合。並且每個主題作爲關鍵詞的集合，再次以一定比例。
一旦您爲算法提供了主題數量，它就會重新排列文檔中的主題分佈和主題內的關鍵字分佈，以獲得主題 - 關鍵字分佈的良好組合。
當我說主題時，實際上它是什麼以及如何表示？
一個主題只不過是典型代表的主導關鍵詞集合。只需查看關鍵字，您就可以確定主題的內容。
以下是獲得良好隔離主題的關鍵因素：
文本處理的質量。
文本談論的各種主題。
主題建模算法的選擇。
提供給算法的主題數量。
算法調整參數。
5. Prepare Stopwords
6. Import Newsgroups Data
我們將使用20-Newsgroups數據集進行此練習。此版本的數據集包含來自20個不同主題的大約11k個新聞組帖子。這可以作爲newsgroups.json使用。
這是使用pandas.read_json導入的，結果數據集有3列，如圖所示。
7. Remove emails and newline characters
8. Tokenize words and Clean-up text
9. Creating Bigram and Trigram Models
10. Remove Stopwords, Make Bigrams and Lemmatize
11. Create the Dictionary and Corpus needed for Topic Modeling
LDA主題模型的兩個主要輸入是字典（id2word）和語料庫。讓我們創造它們。
12. Building the Topic Model
我們擁有訓練LDA模型所需的一切。除語料庫和字典外，您還需要提供主題數量。
除此之外，alpha和eta是超參數，影響主題的稀疏性。根據Gensim文檔，默認爲1.0 / num_topics之前。
chunksize是每個訓練塊中使用的文檔數。 update_every確定模型參數應更新的頻率，並且pass是訓練傳遞的總數。

# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=20, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

View the topics in LDA model
上述LDA模型由20個不同的主題構建，其中每個主題是關鍵字的組合，並且每個關鍵字對主題貢獻一定的權重。
您可以使用lda_model.print_topics（）查看每個主題的關鍵字以及每個關鍵字的權重（重要性），如下所示。

# Print the Keyword in the 10 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

怎麼解釋這個？
主題0表示爲_0.016“car”+ 0.014“power”+ 0.010“light”+ 0.009“drive”+ 0.007“mount”+ 0.007“controller”+ 0.007“cool”+ 0.007“engine”+ 0.007“返回“+'0.006”轉“。
這意味着貢獻這個主題的前10個關鍵詞是：‘car’，‘power’，'light’等等，主題0上’car’的重量是0.016。
權重反映了關鍵字對該主題的重要程度。
看看這些關鍵詞，您能猜出這個主題是什麼嗎？您可以將其概括爲“汽車”或“汽車”。
同樣，您是否可以瀏覽剩餘的主題關鍵字並判斷主題是什麼？

14. Compute Model Perplexity and Coherence Score
模型困惑和主題一致性提供了一種方便的方法來判斷給定主題模型的優秀程度。根據我的經驗，特別是主題一致性得分更有幫助。

# Compute Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

Perplexity:  -8.86067503009

Coherence Score:  0.532947587081

Visualize the topics-keywords
現在構建了LDA模型，下一步是檢查生成的主題和關聯的關鍵字。沒有比pyLDAvis包的交互式圖表更好的工具，並且設計爲與jupyter筆記本一起使用。

那麼如何推斷pyLDAvis的輸出呢？
左側圖中的每個氣泡代表一個主題。泡沫越大，該主題就越普遍。
一個好的主題模型將在整個圖表中分散相當大的非重疊氣泡，而不是聚集在一個象限中。
具有太多主題的模型通常會有許多重疊，小尺寸的氣泡聚集在圖表的一個區域中。
好吧，如果將光標移動到其中一個氣泡上，右側的單詞和條形將會更新。這些單詞是構成所選主題的顯着關鍵字。
我們已經成功構建了一個好看的主題模型。
鑑於我們之前對文檔中自然主題數量的瞭解，找到最佳模型非常簡單。
在上一篇文章中，我們將使用Mallet版本的LDA算法對此模型進行改進，然後我們將重點介紹如何在給定任何大型文本語料庫的情況下獲得最佳主題數。
Building LDA Mallet Model
到目前爲止，您已經看到了Gensim內置的LDA算法版本。然而，Mallet的版本通常會提供更好的主題質量。
Gensim提供了一個包裝器，用於在Gensim內部實現Mallet的LDA。您只需要下載zip文件，解壓縮它並將解壓縮目錄中的mallet路徑提供給gensim.models.wrappers.LdaMallet。看看我在下面如何做到這一點。

# Download File: http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip
mallet_path = 'path/to/mallet-2.0.8/bin/mallet' # update this path
ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=20, id2word=id2word)

# Show Topics
pprint(ldamallet.show_topics(formatted=False))

# Compute Coherence Score
coherence_model_ldamallet = CoherenceModel(model=ldamallet, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_ldamallet = coherence_model_ldamallet.get_coherence()
print('\nCoherence Score: ', coherence_ldamallet)

How to find the optimal number of topics for LDA?
我找到最佳主題數的方法是構建具有不同主題數量（k）的許多LDA模型，並選擇具有最高一致性值的LDA模型。
選擇一個標誌着主題連貫性快速增長的“k”通常會提供有意義和可解釋的主題。選擇更高的值有時可以提供更精細的子主題。
如果您在多個主題中看到相同的關鍵字重複，則可能表示“k”太大。
compute_coherence_values（）（見下文）訓練多個LDA模型並提供模型及其相應的一致性得分。

def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
    """
    Compute c_v coherence for various number of topics

    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    texts : List of input texts
    limit : Max num of topics

    Returns:
    -------
    model_list : List of LDA topic models
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=num_topics, id2word=id2word)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())

    return model_list, coherence_values

# Can take a long time to run.
model_list, coherence_values = compute_coherence_values(dictionary=id2word, corpus=corpus, texts=data_lemmatized, start=2, limit=40, step=6)

# Show graph
limit=40; start=2; step=6;
x = range(start, limit, step)
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence_values"), loc='best')
plt.show()

# Print the coherence scores
for m, cv in zip(x, coherence_values):
    print("Num Topics =", m, " has Coherence Value of", round(cv, 4))

Num Topics = 2  has Coherence Value of 0.4451
Num Topics = 8  has Coherence Value of 0.5943
Num Topics = 14  has Coherence Value of 0.6208
Num Topics = 20  has Coherence Value of 0.6438
Num Topics = 26  has Coherence Value of 0.643
Num Topics = 32  has Coherence Value of 0.6478
Num Topics = 38  has Coherence Value of 0.6525

如果一致性得分似乎在不斷增加，那麼選擇在（flattenting out）展平之前給出最高CV的模型可能更有意義。這就是這種情況。

因此，對於進一步的步驟，我將選擇具有20個主題的模型。

# Select the model and print the topics
optimal_model = model_list[3]
model_topics = optimal_model.show_topics(formatted=False)
pprint(optimal_model.print_topics(num_words=10))

Finding the dominant topic in each sentence
主題建模的實際應用之一是確定給定文檔的主題。

爲了找到這個，我們找到該文檔中貢獻百分比最高的主題編號。

下面的format_topics_sentences（）函數很好地將此信息聚合在一個可呈現的表中。

def format_topics_sentences(ldamodel=lda_model, corpus=corpus, texts=data):
    # Init output
    sent_topics_df = pd.DataFrame()

    # Get main topic in each document
    for i, row in enumerate(ldamodel[corpus]):
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        # Get the Dominant topic, Perc Contribution and Keywords for each document
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp])
                sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
            else:
                break
    sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']

    # Add original text to the end of the output
    contents = pd.Series(texts)
    sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
    return(sent_topics_df)


df_topic_sents_keywords = format_topics_sentences(ldamodel=optimal_model, corpus=corpus, texts=data)

# Format
df_dominant_topic = df_topic_sents_keywords.reset_index()
df_dominant_topic.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Text']

# Show
df_dominant_topic.head(10)

19. Find the most representative document for each topic
有時，主題關鍵字可能不足以理解主題的含義。因此，爲了幫助理解該主題，您可以找到給定主題最有貢獻的文檔，並通過閱讀該文檔來推斷該主題。呼！

# Group top 5 sentences under each topic
sent_topics_sorteddf_mallet = pd.DataFrame()

sent_topics_outdf_grpd = df_topic_sents_keywords.groupby('Dominant_Topic')

for i, grp in sent_topics_outdf_grpd:
    sent_topics_sorteddf_mallet = pd.concat([sent_topics_sorteddf_mallet, 
                                             grp.sort_values(['Perc_Contribution'], ascending=[0]).head(1)], 
                                            axis=0)

# Reset Index    
sent_topics_sorteddf_mallet.reset_index(drop=True, inplace=True)

# Format
sent_topics_sorteddf_mallet.columns = ['Topic_Num', "Topic_Perc_Contrib", "Keywords", "Text"]

# Show
sent_topics_sorteddf_mallet.head()

上面的表格輸出實際上有20行，每個主題一個。它有主題編號，關鍵字和最具代表性的文檔。 Perc_Contribution列只是給定文檔中主題的百分比貢獻。
20. Topic distribution across documents
最後，我們希望瞭解主題的數量和分佈，以判斷討論的範圍。下表公開了該信息。

# Number of Documents for Each Topic
topic_counts = df_topic_sents_keywords['Dominant_Topic'].value_counts()

# Percentage of Documents for Each Topic
topic_contribution = round(topic_counts/topic_counts.sum(), 4)

# Topic Number and Keywords
topic_num_keywords = df_topic_sents_keywords[['Dominant_Topic', 'Topic_Keywords']]

# Concatenate Column wise
df_dominant_topics = pd.concat([topic_num_keywords, topic_counts, topic_contribution], axis=1)

# Change Column names
df_dominant_topics.columns = ['Dominant_Topic', 'Topic_Keywords', 'Num_Documents', 'Perc_Documents']

# Show
df_dominant_topics

21. Conclusion
我們開始瞭解建模可以做什麼主題。我們使用Gensim的LDA構建了一個基本主題模型，並使用pyLDAvis可視化主題。然後我們構建了mallet的LDA實現。您瞭解瞭如何使用一致性分數找到最佳主題數，以及如何對如何選擇最佳模型進行邏輯理解。我們開始瞭解建模可以做什麼主題。我們使用Gensim的LDA構建了一個基本主題模型，並使用pyLDAvis可視化主題。然後我們構建了mallet的LDA實現。您瞭解瞭如何使用一致性分數找到最佳主題數，以及如何對如何選擇最佳模型進行邏輯理解。我們開始瞭解建模可以做什麼主題。我們使用Gensim的LDA構建了一個基本主題模型，並使用pyLDAvis可視化主題。然後我們構建了mallet的LDA實現。您瞭解瞭如何使用一致性分數找到最佳主題數，以及如何對如何選擇最佳模型進行邏輯理解。我們開始瞭解建模可以做什麼主題。我們使用Gensim的LDA構建了一個基本主題模型，並使用pyLDAvis可視化主題。然後我們構建了mallet的LDA實現。您瞭解瞭如何使用一致性分數找到最佳主題數，以及如何對如何選擇最佳模型進行邏輯理解。

最後，我們看到了如何聚合和呈現結果，以產生可能更具可操作性的見解。

【SQL進階】CASE語句的使用

爬蟲基礎（續）

python學習筆記9---scrapy框架

python學習筆記5---（python網絡爬蟲-網絡請求）

基本庫的使用

關於Jupyter的小知識

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結