一、什麼是 TF-IDF？

TF-IDF(Term Frequency-Inverse Document Frequency, 詞頻-逆文件頻率)是一種用於資訊檢索與資訊探勘的常用加權技術。TF-IDF是一種統計方法，用以評估一字詞對於一個文件集或一個語料庫中的其中一份文件的重要程度。字詞的重要性隨着它在文件中出現的次數成正比增加，但同時會隨着它在語料庫中出現的頻率成反比下降。

上述引用總結就是, 一個詞語在一篇文章中出現次數越多, 同時在所有文檔中出現次數越少, 越能夠代表該文章。這也就是TF-IDF的含義。

TF-IDF分爲 TF 和 IDF，下面分別介紹這個兩個概念。

1.1 TF

TF(Term Frequency, 詞頻)表示詞條在文本中出現的頻率，這個數字通常會被歸一化(一般是詞頻除以文章總詞數), 以防止它偏向長的文件（同一個詞語在長文件裏可能會比短文件有更高的詞頻，而不管該詞語重要與否）。TF用公式表示如下
$TF_{i,j}=\frac{n_{i,j}}{\sum_{k}{n_{k,j}}}\tag{1}$
其中， $n_{i,j}$ 表示詞條 $t_i$ 在文檔 $d_j$ 中出現的次數， $TF_{i,j}$ 就是表示詞條 $t_i$ 在文檔 $d_j$ 中出現的頻率。

但是，需要注意，一些通用的詞語對於主題並沒有太大的作用，反倒是一些出現頻率較少的詞才能夠表達文章的主題，所以單純使用是TF不合適的。權重的設計必須滿足：一個詞預測主題的能力越強，權重越大，反之，權重越小。所有統計的文章中，一些詞只是在其中很少幾篇文章中出現，那麼這樣的詞對文章的主題的作用很大，這些詞的權重應該設計的較大。IDF就是在完成這樣的工作。

1.2 IDF

IDF(Inverse Document Frequency, 逆文件頻率)表示關鍵詞的普遍程度。如果包含詞條 $i$ 的文檔越少， IDF越大，則說明該詞條具有很好的類別區分能力。某一特定詞語的IDF，可以由總文件數目除以包含該詞語之文件的數目，再將得到的商取對數得到
$IDF_i=\log\frac{\left|D \right|}{1+\left|j: t_i \in d_j\right|}\tag{2}$
其中， $\left|D \right|$ 表示所有文檔的數量， $\left|j: t_i \in d_j\right|$ 表示包含詞條 $t_i$ 的文檔數量，爲什麼這裏要加 1 呢？主要是防止包含詞條 $t_i$ 的數量爲 0 從而導致運算出錯的現象發生。

某一特定文件內的高詞語頻率，以及該詞語在整個文件集合中的低文件頻率，可以產生出高權重的TF-IDF。因此，TF-IDF傾向於過濾掉常見的詞語，保留重要的詞語，表達爲
$TF \text{-}IDF= TF \cdot IDF\tag{3}$

二、Python 實現

我們用相同的語料庫，分別使用 Python 手動實現、使用gensim 庫函數以及 sklearn 庫函數計算 TF-IDF。

2.1 Python 手動實現

輸入語料庫

corpus = ['this is the first document',
        'this is the second second document',
        'and the third one',
        'is this the first document']
words_list = list()
for i in range(len(corpus)):
    words_list.append(corpus[i].split(' '))
print(words_list)

[['this', 'is', 'the', 'first', 'document'], 
['this', 'is', 'the', 'second', 'second', 'document'], 
['and', 'the', 'third', 'one'], 
['is', 'this', 'the', 'first', 'document']]

統計詞語數量

from collections import Counter
count_list = list()
for i in range(len(words_list)):
    count = Counter(words_list[i])
    count_list.append(count)
print(count_list)

[Counter({'this': 1, 'is': 1, 'the': 1, 'first': 1, 'document': 1}), 
Counter({'second': 2, 'this': 1, 'is': 1, 'the': 1, 'document': 1}), 
Counter({'and': 1, 'the': 1, 'third': 1, 'one': 1}), 
Counter({'is': 1, 'this': 1, 'the': 1, 'first': 1, 'document': 1})]

定義函數

import math
def tf(word, count):
    return count[word] / sum(count.values())


def idf(word, count_list):
    n_contain = sum([1 for count in count_list if word in count])
    return math.log(len(count_list) / (1 + n_contain))


def tf_idf(word, count, count_list):
    return tf(word, count) * idf(word, count_list)

輸出結果

for i, count in enumerate(count_list):
    print("第 {} 個文檔 TF-IDF 統計信息".format(i + 1))
    scores = {word : tf_idf(word, count, count_list) for word in count}
    sorted_word = sorted(scores.items(), key = lambda x : x[1], reverse=True)
    for word, score in sorted_word:
        print("\tword: {}, TF-IDF: {}".format(word, round(score, 5)))

第 1 個文檔 TF-IDF 統計信息
	word: first, TF-IDF: 0.05754
	word: this, TF-IDF: 0.0
	word: is, TF-IDF: 0.0
	word: document, TF-IDF: 0.0
	word: the, TF-IDF: -0.04463
第 2 個文檔 TF-IDF 統計信息
	word: second, TF-IDF: 0.23105
	word: this, TF-IDF: 0.0
	word: is, TF-IDF: 0.0
	word: document, TF-IDF: 0.0
	word: the, TF-IDF: -0.03719
第 3 個文檔 TF-IDF 統計信息
	word: and, TF-IDF: 0.17329
	word: third, TF-IDF: 0.17329
	word: one, TF-IDF: 0.17329
	word: the, TF-IDF: -0.05579
第 4 個文檔 TF-IDF 統計信息
	word: first, TF-IDF: 0.05754
	word: is, TF-IDF: 0.0
	word: this, TF-IDF: 0.0
	word: document, TF-IDF: 0.0
	word: the, TF-IDF: -0.04463

2.2 使用 gensim 算法包實現

使用和 2.1 節相同的語料庫 corpus，過程如下

獲取每個詞語的 id 和詞頻

from gensim import corpora
# 賦給語料庫中每個詞(不重複的詞)一個整數id
dic = corpora.Dictionary(words_list)
new_corpus = [dic.doc2bow(words) for words in words_list]
# 元組中第一個元素是詞語在詞典中對應的id，第二個元素是詞語在文檔中出現的次數
print(new_corpus)

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)], 
[(0, 1), (2, 1), (3, 1), (4, 1), (5, 2)], 
[(3, 1), (6, 1), (7, 1), (8, 1)], 
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)]]

查看每個詞語對應的 id

print(dic.token2id)

{'document': 0, 'first': 1, 'is': 2, 'the': 3, 'this': 4, 'second': 5, 'and': 6, 'one': 7, 'third': 8}

訓練gensim模型並且保存它以便後面的使用

# 訓練模型並保存
from gensim import models
tfidf = models.TfidfModel(new_corpus)
tfidf.save("tfidf.model")
# 載入模型
tfidf = models.TfidfModel.load("tfidf.model")
# 使用這個訓練好的模型得到單詞的tfidf值
tfidf_vec = []
for i in range(len(corpus)):
    string = corpus[i]
    string_bow = dic.doc2bow(string.lower().split())
    string_tfidf = tfidf[string_bow]
    tfidf_vec.append(string_tfidf)
# 輸出 詞語id與詞語tfidf值
print(tfidf_vec)

[[(0, 0.33699829595119235), (1, 0.8119707171924228), (2, 0.33699829595119235), (4, 0.33699829595119235)], 
[(0, 0.10212329019650272), (2, 0.10212329019650272), (4, 0.10212329019650272), (5, 0.9842319344536239)], 
[(6, 0.5773502691896258), (7, 0.5773502691896258), (8, 0.5773502691896258)], 
[(0, 0.33699829595119235), (1, 0.8119707171924228), (2, 0.33699829595119235), (4, 0.33699829595119235)]]

句子測試

# 測試一個句子
test_words = "i is the first one"
string_bow = dic.doc2bow(string.lower().split())
string_tfidf = tfidf[string_bow]
print(string_tfidf)

[(0, 0.33699829595119235), (1, 0.8119707171924228), (2, 0.33699829595119235), (4, 0.33699829595119235)]

這裏需要注意的是，在打印 tf-idf 值的時候會發現只會顯示部分詞語，這是因爲 gensim 會自動的去除停用詞。

2.3 使用 sklearn 算法包實現

調包

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vec = TfidfVectorizer()
tfidf_matrix = tfidf_vec.fit_transform(corpus)
# 得到語料庫所有不重複的詞
print(tfidf_vec.get_feature_names())
# 得到每個單詞對應的id值
print(tfidf_vec.vocabulary_)
# 得到每個句子所對應的向量，向量裏數字的順序是按照詞語的id順序來的
print(tfidf_matrix.toarray())

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
{'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}
[[0.         0.43877674 0.54197657 0.43877674 0.         0.
  0.35872874 0.         0.43877674]
 [0.         0.27230147 0.         0.27230147 0.         0.85322574
  0.22262429 0.         0.27230147]
 [0.55280532 0.         0.         0.         0.55280532 0.
  0.28847675 0.55280532 0.        ]
 [0.         0.43877674 0.54197657 0.43877674 0.         0.
  0.35872874 0.         0.43877674]]

具體的 notebook 文件可以見我的 github 代碼。

三、參考

[1] https://zh.wikipedia.org/wiki/Tf-idf

[2] https://blog.csdn.net/zrc199021/article/details/53728499

[3] https://www.zybuluo.com/lianjizhe/note/1212780

TF-IDF 原理與實現

一、什麼是 TF-IDF？

1.1 TF

1.2 IDF

二、Python 實現

2.1 Python 手動實現

2.2 使用 gensim 算法包實現

2.3 使用 sklearn 算法包實現

三、參考

[轉帖]使用NMT和pmap解決JVM資源泄漏問題原創

Python實現大麥網搶票的四大關鍵技術點解析

Python 安裝庫指令大全

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

一款開源的.NET程序集反編譯、編輯和調試神器

關於接口協議，你必須要知道這些！

2020年上半年數據庫系統工程師考試

基於 Milvus + LlamaIndex 實現高級 RAG

【2024-05-21】以茶會友

機器學習-PCA降維原理與實現

進程 VS 線程

機器學習-極大似然估計法

帶你重溫聚類方法

機器學習-softmax 迴歸原理與實現

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結