Exploring the Space of Topic Coherence Measures

Evaluation of Topic Modeling:Topic coherence
we will go through the evaluation of Topic Modelling by introducing the concept of Topic coherence, as topic models give no guaranty on the interpretability of their output. Topic modeling provides us with methods to organize, understand and summarize large collections of textual information. There are many techniques that are used to obtain topic models. Latent Dirichlet Allocation (LDA) is a widely used topic modeling technique to extract topic from the textual data.
我們將通過引入主題一致性的概念來進行主題建模的評估,因爲主題模型不能保證其輸出的可解釋性。 主題建模爲我們提供了組織,理解和總結大量文本信息的方法。 有許多技術用於獲取主題模型。 Latent Dirichlet Allocation(LDA)是一種廣泛使用的主題建模技術,用於從文本數據中提取主題。

Topic models learn topics—typically represented as sets of important words—automatically from unlabelled documents in an unsupervised way. This is an attractive method to bring structure to otherwise unstructured text data, but Topics are not guaranteed to be well interpretable, therefore, coherence measures have been proposed to distinguish between good and bad topics.
主題模型以無監督的方式自動從未標記的文檔中學習主題 - 通常表示爲重要單詞集。 這是一種將結構引入其他非結構化文本數據的有吸引力的方法,但主題並不能保證很好地解釋,因此,已經提出了一致性度量來區分好主題和壞主題。

Let’s start learning with a simple example and then we move to a technical part of topic coherence.
讓我們從一個簡單的例子開始學習,然後我們轉向主題一致性的技術部分。

Imagine you are a lead quality analyst sitting at location X at a logistics company and you want to check the quality of your dispatch product at 4 different locations: A, B, C, D. One way is to collect the reviews from various people – for example- “whether they receive product in good condition”, Did they receive on time”. You may need to improve your process if most people give you bad reviews. So, basically, you are evaluating on the qualitative approach, as there is no quantitative measure involved, which can tell you how much worse your dispatch product quality at A is compared to dispatch quality at B.
想象一下,您是位於物流公司X地點的首席質量分析師,您想在4個不同的位置檢查您的調度產品的質量:A,B,C,D。一種方法是收集來自不同人的評論 - 例如 - “他們是否收到狀況良好的產品”,他們是否按時收到“。如果大多數人給你不好的評論,你可能需要改進你的過程。因此,基本上,您正在評估定性方法,因爲沒有涉及定量測量,這可以告訴您A的調度產品質量與B的調度質量相比有多差。

To arrive at the quantitative measure, your central lab at X set up 4 different quality lab Kiosk at A, B, C and D to check the dispatch product quality (let’s say quality defined by % of conformance as per some predefined standards). Now, while sitting at the central lab, you can get the quality values from 4 Kiosks and can compute your overall quality. You don’t need to rely on people reviews, as you have a good quantitative measure of quality.
爲了達到定量測量,您在X的中心實驗室在A,B,C和D處設置了4個不同質量的實驗室信息亭,以檢查調度產品質量(假設質量由符合某些預定義標準的百分比定義)。現在,坐在中心實驗室,您可以從4個信息亭獲得質量值,並可以計算您的整體質量。您不需要依賴人員評論,因爲您有良好的質量定量指標。

Here the analogy comes in:
The dispatch product here is the topics from some topic modeling algorithm such as LDA. The qualitative approach is to test the topics on their human interpretability by presenting them to humans and taking their input on them. The quality lab setup is the topic coherence framework, which is grouped into 4 following dimensions:
這裏的比喻來自:
這裏的調度產品是來自某些主題建模算法(如LDA)的主題。 定性方法是通過將這些主題呈現給人類並對其進行輸入來測試關於其人類可解釋性的主題。 質量實驗室設置是主題一致性框架,分爲以下4個維度:

  • Segmentation: A lot of dispatch product divided into different sub-lot sizes, such that each sub-lot product are different.
  • Probability Estimation: Quantitative Measurement of sub lot quality.
  • Confirmation Measure: Determine quality as per some predefined standard (say % conformance) and assign some number to qualify. For example, 75% of products are good quality as per XXX standard.
  • Aggregation: It’s the central lab where you combine all the quality numbers and derive a single number for overall quality.

  • 細分:許多調度產品分爲不同的子批量,每個子批次產品都不同。

  • 概率估計:子批次質量的定量測量。
  • 確認度量:根據某些預定義的標準(比如%一致性)確定質量並指定一些數字以符合條件。 例如,根據XXX標準,75%的產品質量良好。
  • 聚合:這是中心實驗室,您可以將所有質量數字組合在一起,併爲整體質量得出一個數字。

From a technical point of view, Coherence framework is represented as a composition of parts that can be combined. The parts are grouped into dimensions that span the configuration space of coherence measures. Each dimension is characterized by a set of exchangeable components.
從技術角度來看,一致性框架表示爲可以組合的部件組合。 這些部件被分組爲跨越一致性度量的配置空間的維度。 每個維度都由一組可交換組件表徵。

First, the word set t is segmented into a set of pairs of word subsets S. Second, word probabilities P are computed based on a given reference corpus. Both, the set of word subsets S as well as the computed probabilities P are consumed by the confirmation measure to calculate the agreements ϕ of pairs of S. Last, those values are aggregated to a single coherence value c.
首先,將單詞集t分割成一組單詞子集S對。其次,基於給定的參考語料庫計算單詞概率P. 確定度量消耗字集子集S以及計算概率P兩者以計算S對的協議φ。最後,將這些值聚合爲單個一致性值c。

There are 2 measures in Topic coherence :
主題一致性有兩個衡量標準:

Intrinsic Measure
內在度量
It is represented as UMass. It measures to compare a word only to the preceding and succeeding words respectively, so need ordered word set.It uses as pairwise score function which is the empirical conditional log-probability with smoothing count to avoid calculating the logarithm of zero.
它表示爲UMass。 它只測量一個單詞與前後單詞的比較,因此需要有序單詞集。它用作成對分數函數,它是具有平滑計數的經驗條件對數概率,以避免計算零的對數。

Extrinsic Measure
外在度量
It is represented as UCI. In UCI measure, every single word is paired with every other single word. The UCI coherence uses pointwise mutual information (PMI).
它表示爲UCI。 在UCI度量中,每個單詞都與其他單個單詞配對。 UCI一致性使用逐點互信息(PMI)。

Both Intrinsic and Extrinsic measure compute the coherence score c (sum of pairwise scores on the words w1, …, wn used to describe the topic).
If you are interested to learn in more detail, refer this paper :- Exploring the Space of Topic Coherence Measures
內在和外在度量都計算一致性得分c(用於描述主題的詞w1,…,wn的成對得分之和)。
如果您有興趣更詳細地學習,請參閱本文: - Exploring the Space of Topic Coherence Measures

Implementation in Python
在Python中實現
Amazon fine food review dataset, publicly available on Kaggle is used for this paper. Since dataset is very huge, only 10,000 reviews are considered. Since we are focusing on topic coherence, I am not going in details for data pre-processing here.
亞馬遜精美食品評論數據集,在Kaggle上公開可用於本文。 由於數據集非常龐大,因此只考慮了10,000條評論。 由於我們專注於主題一致性,因此我不會詳細介紹數據預處理。

It consists of following steps:
它包括以下步驟:

Step 1

First step is loading packages, Data and Data pre-processing.
We created dictionary and corpus required for Topic Modeling: The two main inputs to the LDA topic model are the dictionary and the corpus. Gensim creates a unique id for each word in the document. The produced corpus shown above is a mapping of (word_id, word_frequency). For example, (0, 1) below in the output implies, word id 0 occurs once in the first document.
第一步是加載包,數據和數據預處理。
我們創建了主題建模所需的字典和語料庫:LDA主題模型的兩個主要輸入是字典和語料庫。 Gensim爲文檔中的每個單詞創建一個唯一的ID。 上面顯示的產生的語料庫是(word_id,word_frequency)的映射。 例如,輸出中的下面的(0,1)表示,單詞id 0在第一個文檔中出現一次。

# Import required packages
import numpy as np
import logging
import pyLDAvis.gensim
import json
import warnings
warnings.filterwarnings('ignore')  # To ignore all warnings that arise here to enhance clarity

from gensim.models.coherencemodel import CoherenceModel
from gensim.models.ldamodel import LdaModel
from gensim.corpora.dictionary import Dictionary
from numpy import array

# Import dataset
p_df = pd.read_csv('C:/Users/kamal/Desktop/R project/Reviews.csv')
# Create sample of 10,000 reviews
p_df = p_df.sample(n = 10000)
# Convert to array
docs =array(p_df['Text'])
# Define function for tokenize and lemmatizing
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer

def docs_preprocessor(docs):
    tokenizer = RegexpTokenizer(r'\w+')
    for idx in range(len(docs)):
        docs[idx] = docs[idx].lower()  # Convert to lowercase.
        docs[idx] = tokenizer.tokenize(docs[idx])  # Split into words.

    # Remove numbers, but not words that contain numbers.
    docs = [[token for token in doc if not token.isdigit()] for doc in docs]

    # Remove words that are only one character.
    docs = [[token for token in doc if len(token) > 3] for doc in docs]

    # Lemmatize all words in documents.
    lemmatizer = WordNetLemmatizer()
    docs = [[lemmatizer.lemmatize(token) for token in doc] for doc in docs]

    return docs
# Perform function on our document
docs = docs_preprocessor(docs)
#Create Biagram & Trigram Models 
from gensim.models import Phrases
# Add bigrams and trigrams to docs,minimum count 10 means only that appear 10 times or more.
bigram = Phrases(docs, min_count=10)
trigram = Phrases(bigram[docs])

for idx in range(len(docs)):
    for token in bigram[docs[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            docs[idx].append(token)
    for token in trigram[docs[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            docs[idx].append(token)
#Remove rare & common tokens 
# Create a dictionary representation of the documents.
dictionary = Dictionary(docs)
dictionary.filter_extremes(no_below=10, no_above=0.2)
#Create dictionary and corpus required for Topic Modeling
corpus = [dictionary.doc2bow(doc) for doc in docs]
print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))
print(corpus[:1])
Number of unique tokens: 4214
Number of documents: 10000
[[(0, 1), (1, 1), (2, 2), (3, 1), (4, 1), (5, 1), (6, 4), (7, 2), (8, 1), (9, 2), (10, 5), (11, 8), (12, 3), (13, 1), (14, 2), (15, 1), (16, 2), (17, 2), (18, 3), (19, 3), (20, 1), (21, 1), (22, 1), (23, 2), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 2), (30, 1), (31, 2), (32, 2), (33, 1), (34, 1), (35, 1), (36, 1), (37, 2), (38, 3)]]

Step 2

We have everything required to train the LDA model. In addition to the corpus and dictionary, we need to provide the number of topics as well.Set number of topics=5.
我們擁有訓練LDA模型所需的一切。 除語料庫和字典外,我們還需要提供主題數量。設置主題數量= 5。

# Set parameters.
num_topics = 5
chunksize = 500 
passes = 20 
iterations = 400
eval_every = 1  

# Make a index to word dictionary.
temp = dictionary[0]  # only to "load" the dictionary.
id2word = dictionary.id2token

lda_model = LdaModel(corpus=corpus, id2word=id2word, chunksize=chunksize, \
                       alpha='auto', eta='auto', \
                       iterations=iterations, num_topics=num_topics, \
                       passes=passes, eval_every=eval_every)
# Print the Keyword in the 5 topics
print(lda_model.print_topics())
[(0, '0.067*"coffee" + 0.012*"strong" + 0.011*"green" + 0.010*"vanilla" + 0.009*"just_right" + 0.009*"cup" + 0.009*"blend" + 0.008*"drink" + 0.008*"http_amazon" + 0.007*"bean"'), (1, '0.046*"have_been" + 0.030*"chocolate" + 0.021*"been" + 0.021*"gluten_free" + 0.018*"recommend" + 0.017*"highly_recommend" + 0.014*"free" + 0.012*"cooky" + 0.012*"would_recommend" + 0.011*"have_ever"'), (2, '0.028*"food" + 0.018*"amazon" + 0.016*"from" + 0.015*"price" + 0.012*"store" + 0.011*"find" + 0.010*"brand" + 0.010*"time" + 0.010*"will" + 0.010*"when"'), (3, '0.019*"than" + 0.013*"more" + 0.012*"water" + 0.011*"this_stuff" + 0.011*"better" + 0.011*"sugar" + 0.009*"much" + 0.009*"your" + 0.009*"more_than" + 0.008*"which"'), (4, '0.010*"would" + 0.010*"little" + 0.010*"were" + 0.009*"they_were" + 0.009*"treat" + 0.009*"really" + 0.009*"make" + 0.008*"when" + 0.008*"some" + 0.007*"will"')]

Step 3

LDA is an unsupervised technique, meaning that we don’t know prior to running the model how many topics exits in our corpus.You can use LDA visualization tool pyLDAvis, tried a few numbers of topics and compared the results. Topic coherence is one of the main techniques used to estimate the number of topics.We will use both UMass and c_v measure to see the coherence score of our LDA model.
LDA是一種無監督的技術,這意味着我們在運行模型之前不知道在我們的語料庫中有多少主題存在。您可以使用LDA可視化工具pyLDAvis,嘗試了一些主題並比較了結果。 主題一致性是用於估計主題數量的主要技術之一。我們將使用UMass和c_v度量來查看LDA模型的一致性得分。

Using c_v Measure

# Compute Coherence Score using c_v
coherence_model_lda = CoherenceModel(model=lda_model, texts=docs, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)
Coherence Score:  0.359704263036

Using UMass Measure

# Compute Coherence Score using UMass
coherence_model_lda = CoherenceModel(model=lda_model, texts=docs, dictionary=dictionary, coherence="u_mass")
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)
Coherence Score:  -2.60591638507

Step 4

The last step is to find the optimal number of topics.We need to build many LDA models with different values of the number of topics (k) and pick the one that gives the highest coherence value. Choosing a ‘k’ that marks the end of a rapid growth of topic coherence usually offers meaningful and interpretable topics. Picking an even higher value can sometimes provide more granular sub-topics. If you see the same keywords being repeated in multiple topics, it’s probably a sign that the ‘k’ is too large.
最後一步是找到最佳主題數。我們需要構建許多具有不同主題數(k)值的LDA模型,並選擇一個具有最高一致性值的模型。 選擇一個標誌着主題一致性快速增長的“k”通常會提供有意義和可解釋的主題。 選擇更高的值有時可以提供更精細的子主題。 如果您在多個主題中看到相同的關鍵字重複,則可能表示“k”太大。

Using c_v Measure

def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
    """
    Compute c_v coherence for various number of topics

    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    texts : List of input texts
    limit : Max num of topics

    Returns:
    -------
    model_list : List of LDA topic models
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model=LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())

    return model_list, coherence_values

Create a model list and plot Coherence score against a number of topics
創建模型列表並根據一些主題繪製一致性得分

model_list, coherence_values = compute_coherence_values(dictionary=dictionary, corpus=corpus, texts=docs, start=2, limit=40, step=6)
# Show graph
import matplotlib.pyplot as plt
limit=40; start=2; step=6;
x = range(start, limit, step)
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence_values"), loc='best')
plt.show()

Gives this plot:
這裏寫圖片描述
The above plot shows that coherence score increases with the number of topics, with a decline between 15 to 20.Now, choosing the number of topics still depends on your requirement because topic around 33 have good coherence scores but may have repeated keywords in the topic. Topic coherence gives you a good picture so that you can take better decision.
You can try the same with U mass measure.
上圖顯示,一致性得分隨着主題數量的增加而增加,下降幅度在15到20之間。現在,選擇主題數仍然取決於您的要求,因爲33左右的主題具有良好的一致性分數,但可能在主題中重複關鍵詞。 主題一致性爲您提供了良好的圖像,以便您可以做出更好的決策。
您可以嘗試使用U MASS度量。

Conclusion

To conclude, there are many other approaches to evaluate Topic models such as Perplexity, but its poor indicator of the quality of the topics.Topic Visualization is also a good way to assess topic models. Topic Coherence measure is a good way to compare difference topic models based on their human-interpretability.The u_mass and c_v topic coherences capture the optimal number of topics by giving the interpretability of these topics a number called coherence score.
總而言之,還有許多其他方法來評估主題模型,如困惑,但它的主題質量指標很差.Topic Visualization也是評估主題模型的好方法。 主題一致性度量是一種基於人類可解釋性比較差異主題模型的好方法.u_mass和c_v主題一致性通過將這些主題的可解釋性稱爲一致性得分來捕獲最佳主題數量。

針對一些評論,作者Kamal Kumar的回答如下:

Q:I need to classify the result topics(lda) using binary classier.
advice me!!
A:You can use document topic distribution as feature vector to classifier.
In sklearn , you can get this by lda.fit_transform(tf).
Q:我需要使用二進制分類器對結果主題(lda)進行分類。
建議我!!
A:您可以將文檔主題分佈用作分類器的特徵向量。
在sklearn中,您可以通過lda.fit_transform(tf)獲得此信息。
Q:How can you define c_v?
A:CV is based on a sliding window, a one-set segmentation of the top words and an indirect confirmation measure that uses normalized pointwise mutual information (NPMI) and the cosinus similarity.
CUCI is a coherence that is based on a sliding window and the pointwise mutual information (PMI) of all word pairs of the given top words.
CUMass is based on document cooccurrence counts, a one-preceding segmentation and a logarithmic conditional probability as confirmation measure.
Q:你怎麼定義c_v?
A:CV基於滑動窗口,頂部單詞的一組分段和使用標準化逐點互信息(NPMI)和餘弦相似性間接確認度量。
CUCI是基於滑動窗口和給定頂部單詞的所有單詞對的逐點互信息(PMI)的一致性。
CUMass基於文檔共生計數,一個單前述分段和一個對數條件概率作爲確認度量。
Q:Thank you so much. Another doubt i have is apart from a quantitative value we obtain through coherence, does it have an actual relation with the goodness of model?
A:Yes, you can say that. If we see in above code , compute_coherence_ values trains multiple LDA models and gives us their corresponding coherence scores.Its good to consider model with highest coherence scores.In short ,It measures how often the topic words appear together in the corpus.
Q:Oh k but as per above graph there are multiple high values for coherence so what is a good coherence value in that case? I beleive the coherence Of model is the mean of all the topic coherence combined or some similar measure is that rite any idea on that kamal? Because there is option I found which gets us coherence if topics of a model. Sorry for bombarding with lot of questions. Jus confused
A:As I mentioned in the article , 33 Topics have good coherence score ( peak shown ). But again this depends upon your requirement and case to case. LDA is one method of Topic Modelling. You can compare Coherence scores of Topics generated by different Topic model and then use the topic model accordingly.
Q:非常感謝你。我所擁有的另一個疑問是,除了通過連貫性獲得的定量值之外,它是否與模型的優點有實際關係?
A:是的,你可以這麼說。如果我們在上面的代碼中看到,compute_coherence_values訓練多個LDA模型並給出它們相應的一致性得分。很好地考慮具有最高一致性得分的模型。簡而言之,它測量主題詞在語料庫中一起出現的頻率。
Q:哦,但是如上圖所示,對於一致性有多個高值,那麼在這種情況下什麼是良好的一致性值?我相信模型的一致性是所有主題一致性的組合的平均值,或者某種類似的衡量標準是關於那個kamal的任何想法嗎?因爲我找到的選項可以讓我們在模型的主題上保持一致性。很抱歉有很多問題轟炸。很困惑
A:正如我在文章中提到的,33個主題具有良好的一致性得分(顯示峯值)。但這又取決於您的要求和具體情況。 LDA是主題建模的一種方法。您可以比較由不同主題模型生成的主題的Coherence分數,然後相應地使用主題模型。
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章