Latent Semantic Analysis (LSA) Tutorial 潛語義分析LSA介紹 四

WangBen 20110916 Beijing

Part 2 - Modify the Counts with TFIDF

計算TFIDF替代簡單計數

In sophisticated Latent Semantic Analysis systems, the raw matrix countsare usually modified so that rare words are weighted more heavily than commonwords. For example, a word that occurs in only 5% of the documents shouldprobably be weighted more heavily than a word that occurs in 90% of thedocuments. The most popular weighting is TFIDF (Term Frequency - InverseDocument Frequency). Under this method, the count in each cell is replaced bythe following formula.

在複雜的LSA系統中,爲了重要的詞佔據更重的權重,原始矩陣中的計數往往會被修改。例如,一個詞僅在5%的文檔中應該比那些出現在90%文檔中的詞佔據更重的權重。最常用的權重計算方法就是TFIDF(詞頻-逆文檔頻率)。基於這種方法,我們把每個單元的數值進行修改:

TFIDFi,j = ( Ni,j / N*,j ) * log( D / Di) where

  • Ni,j = the number of times word i appears in document j (the original cell count).
  • N*,j = the number of total words in document j (just add the counts in column j).
  • D = the number of documents (the number of columns).
  • Di = the number of documents in which word i appears (the number of non-zero columns in row i).

Nij = 某個詞i出現在文檔j的次數(矩陣單元中的原始值)
N*j= 在文檔j中所有詞的個數(就是列j上所有數值的和)
D = 文檔個數(也就是矩陣的列數)
Di= 包含詞i的文檔個數(也就是矩陣第i行非0列的個數)

In this formula, words that concentrate in certain documents areemphasized (by the Ni,j / N*,jratio) and words that onlyappear in a few documents are also emphasized (by the log( D / Di )term).

Since we have such a small example, we will skip this step and move on theheart of LSA, doing the singular value decomposition of our matrix of counts.However, if we did want to add TFIDF to our LSA class we could add the followingtwo lines at the beginning of our python file to import the log, asarray, andsum functions.

在這個公式裏,在某個文檔中密集出現的詞被加強(通過Nij/N*j),那些僅在少數文檔中出現的詞也被加強(通過log(D/Di))

因爲我們的例子過小,這裏將跳過這一個步驟直接進入LSA的核心部分,對我們的計數矩陣做SVD。然而,如果我們需要增加TFIDF到這個LSA類中,我們需要加入以下兩行代碼。

from math importlog
from numpy import asarray, sum

Then we would add the following TFIDF method to our LSA class. WordsPerDoc(N*,j) just holds the sum of each column, which is the total numberof index words in each document. DocsPerWord (Di) uses asarray tocreate an array of what would be True and False values, depending on whetherthe cell value is greater than 0 or not, but the 'i' argument turns it into 1'sand 0's instead. Then each row is summed up which tells us how many documentseach word appears in. Finally, we just step through each cell and apply theformula. We do have to change cols (which is the number of documents) into afloat to prevent integer division.

接下來需要增加下面這個TFIDF方法到我們的LSA類中。WordsPerDoc 就是矩陣每列的和,也就是每篇文檔的詞語總數。DocsPerWord 利用asarray方法創建一個0、1數組(也就是大於0的數值會被歸一到1),然後每一行會被加起來,從而計算出每個詞出現在了多少文檔中。最後,我們對每一個矩陣單元計算TFIDF公式

def TFIDF(self):

    WordsPerDoc = sum(self.A, axis=0)       

    DocsPerWord = sum(asarray(self.A > 0,'i'), axis=1)

    rows, cols = self.A.shape

    for i in range(rows):

        for j in range(cols):

            self.A[i,j] = (self.A[i,j] /WordsPerDoc[j]) * log(float(cols) / DocsPerWord[i])


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章