NLP從入門到實戰（三）

詞袋模型與句子相似度計算

本文將會介紹NLP中常見的詞袋模型（Bag of Words）以及如何利用詞袋模型來計算句子間的相似度（餘弦相似度，cosine similarity）。
首先，讓我們來看一下，什麼是詞袋模型。

將所有詞語裝進一個袋子裏，不考慮其詞法和語序的問題，即每個詞語都是獨立的。例如下面個例句，就可以構成一個詞袋，袋子裏包括所有詞語。假設建立一個數組（或詞典）用於映射匹配

我們以下面兩個簡單句子爲例：

sent1 = "Word bag model,Put all the words in a bag, regardless of their morphology and word order, that is, each word is independent. For example, the above two examples can form a word bag, which includes Jane, wants, to, go, Shenzhen, Bob and Shanghai. Suppose you build an array (or dictionary) for mapping matches."
sent2 = "Words bags model,Put all the words in a bag, regardless of their morphology and word order, this is, each word is independent. For example, the above two examples can form a word bag, which includes Jane, wants, to, go, Shenzhen, Bob and Shanghai. Suppose you build an array (or dictionary) for mapping matches."

通常，NLP無法一下子處理完整的段落或句子，因此，第一步往往是分句和分詞。這裏只有句子，因此我們只需要分詞即可。對於英語句子，可以使用NLTK中的word_tokenize函數，對於中文句子，則可使用jieba模塊。故第一步爲分詞，代碼如下：

from nltk import word_tokenize
sents = [sent1, sent2]
texts = [[word for word in word_tokenize(sent)] for sent in sents]

輸出的結果如下：

[['Word', 'bag', 'model', ',', 'Put', 'all', 'the', 'words', 'in', 'a', 'bag', ',', 'regardless', 'of', 'their', 'morphology', 'and', 'word', 'order', ',', 'that', 'is', ',', 'each', 'word', 'is', 'independent', '.', 'For', 'example', ',', 'the', 'above', 'two', 'examples', 'can', 'form', 'a', 'word', 'bag', ',', 'which', 'includes', 'Jane', ',', 'wants', ',', 'to', ',', 'go', ',', 'Shenzhen', ',', 'Bob', 'and', 'Shanghai', '.', 'Suppose', 'you', 'build', 'an', 'array', '(', 'or', 'dictionary', ')', 'for', 'mapping', 'matches', '.'],
['Words', 'bags', 'model', ',', 'Put', 'all', 'the', 'words', 'in', 'a', 'bag', ',', 'regardless', 'of', 'their', 'morphology', 'and', 'word', 'order', ',', 'this', 'is', ',', 'each', 'word', 'is', 'independent', '.', 'For', 'example', ',', 'the', 'above', 'two', 'examples', 'can', 'form', 'a', 'word', 'bag', ',', 'which', 'includes', 'Jane', ',', 'wants', ',', 'to', ',', 'go', ',', 'Shenzhen', ',', 'Bob', 'and', 'Shanghai', '.', 'Suppose', 'you', 'build', 'an', 'array', '(', 'or', 'dictionary', ')', 'for', 'mapping', 'matches', '.']]

分詞完畢。下一步是構建語料庫，即所有句子中出現的單詞及標點。代碼如下：

all_list = []
for text in texts:
    all_list += text
corpus = set(all_list)
print(corpus)

輸出如下：

{'wants', 'to', 'each', 'mapping', 'Words', 'morphology', 'array', '.', 'for', 'you', 'and', 'is', 'matches', 'Word', 'word', ',', 'Bob', 'can', 'which', 'a', 'that', 'an', 'Put', 'includes', 'bag', 'this', 'the', 'bags', 'words', 'two', 'in', 'Suppose', 'build', 'dictionary', 'examples', 'Shanghai', 'all', 'For', 'Jane', 'of', 'or', 'form', 'go', 'their', 'model', 'regardless', 'order', 'independent', 'example', 'above', 'Shenzhen', '(', ')'}

可以看到，語料庫中一共是8個單詞及標點。接下來，對語料庫中的單詞及標點建立數字映射，便於後續的句子的向量表示。代碼如下：

corpus_dict = dict(zip(corpus, range(len(corpus))))
print(corpus_dict)

輸出如下：

{'wants': 0, 'to': 1, 'each': 2, 'mapping': 3, 'Words': 4, 'morphology': 5, 'array': 6, '.': 7, 'for': 8, 'you': 9, 'and': 10, 'is': 11, 'matches': 12, 'Word': 13, 'word': 14, ',': 15, 'Bob': 16, 'can': 17, 'which': 18, 'a': 19, 'that': 20, 'an': 21, 'Put': 22, 'includes': 23, 'bag': 24, 'this': 25, 'the': 26, 'bags': 27, 'words': 28, 'two': 29, 'in': 30, 'Suppose': 31, 'build': 32, 'dictionary': 33, 'examples': 34, 'Shanghai': 35, 'all': 36, 'For': 37, 'Jane': 38, 'of': 39, 'or': 40, 'form': 41, 'go': 42, 'their': 43, 'model': 44, 'regardless': 45, 'order': 46, 'independent': 47, 'example': 48, 'above': 49, 'Shenzhen': 50, '(': 51, ')': 52}

雖然單詞及標點並沒有按照它們出現的順序來建立數字映射，不過這並不會影響句子的向量表示及後續的句子間的相似度。
下一步，也就是詞袋模型的關鍵一步，就是建立句子的向量表示。這個表示向量並不是簡單地以單詞或標點出現與否來選擇0，1數字，而是把單詞或標點的出現頻數作爲其對應的數字表示，結合剛纔的語料庫字典，句子的向量表示的代碼如下：

# 建立句子的向量表示
def vector_rep(text, corpus_dict):
    vec = []
    for key in corpus_dict.keys():
        if key in text:
            vec.append((corpus_dict[key], text.count(key)))
        else:
            vec.append((corpus_dict[key], 0))

    vec = sorted(vec, key= lambda x: x[0])

    return vec

vec1 = vector_rep(texts[0], corpus_dict)
vec2 = vector_rep(texts[1], corpus_dict)
print(vec1)
print(vec2)

輸出如下：

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 0), (5, 1), (6, 1), (7, 3), (8, 1), (9, 1), (10, 2), (11, 2), (12, 1), (13, 1), (14, 3), (15, 11), (16, 1), (17, 1), (18, 1), (19, 2), (20, 1), (21, 1), (22, 1), (23, 1), (24, 3), (25, 0), (26, 2), (27, 0), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1)]
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 3), (8, 1), (9, 1), (10, 2), (11, 2), (12, 1), (13, 0), (14, 3), (15, 11), (16, 1), (17, 1), (18, 1), (19, 2), (20, 0), (21, 1), (22, 1), (23, 1), (24, 2), (25, 1), (26, 2), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1)]

讓我們稍微逗留一會兒，來看看這個向量。在第一句中I出現了兩次，在預料庫字典中，I對應的數字爲5，因此在第一句中5出現2次，在列表中的元組即爲(5,2)，代表單詞I在第一句中出現了2次。以上的輸出可能並不那麼直觀

OK，詞袋模型到此結束。接下來，我們會利用剛纔得到的詞袋模型，即兩個句子的向量表示，來計算相似度。
在NLP中，如果得到了兩個句子的向量表示，那麼，一般會選擇用餘弦相似度作爲它們的相似度，而向量的餘弦相似度即爲兩個向量的夾角的餘弦值。其計算的Python代碼如下：

from math import sqrt
def similarity_with_2_sents(vec1, vec2):
    inner_product = 0
    square_length_vec1 = 0
    square_length_vec2 = 0
    for tup1, tup2 in zip(vec1, vec2):
        inner_product += tup1[1]*tup2[1]
        square_length_vec1 += tup1[1]**2
        square_length_vec2 += tup2[1]**2

    return (inner_product/sqrt(square_length_vec1*square_length_vec2))


cosine_sim = similarity_with_2_sents(vec1, vec2)
print('兩個句子的餘弦相似度爲： %.4f。'%cosine_sim)

輸出結果如下：

兩個句子的餘弦相似度爲： 0.9853。

這樣，我們就通過句子的詞袋模型，得到了它們間的句子相似度。
當然，在實際的NLP項目中，如果需要計算兩個句子的相似度，我們只需調用gensim模塊即可，它是NLP的利器，能夠幫助我們處理很多NLP任務。下面爲用gensim計算兩個句子的相似度的代碼：

sent1 = "I love sky, I love sea."
sent2 = "I like running, I love reading."

from nltk import word_tokenize
sents = [sent1, sent2]
texts = [[word for word in word_tokenize(sent)] for sent in sents]
print(texts)

from gensim import corpora
from gensim.similarities import Similarity

#  語料庫
dictionary = corpora.Dictionary(texts)

# 利用doc2bow作爲詞袋模型
corpus = [dictionary.doc2bow(text) for text in texts]
similarity = Similarity('-Similarity-index', corpus, num_features=len(dictionary))
print(similarity)
# 獲取句子的相似度
new_sensence = sent1
test_corpus_1 = dictionary.doc2bow(word_tokenize(new_sensence))

cosine_sim = similarity[test_corpus_1][1]
print("利用gensim計算得到兩個句子的相似度： %.4f。"%cosine_sim)

輸出結果如下：

[['I', 'love', 'sky', ',', 'I', 'love', 'sea', '.'], ['I', 'like', 'running', ',', 'I', 'love', 'reading', '.']]
Similarity index with 2 documents in 0 shards (stored under -Similarity-index)
利用gensim計算得到兩個句子的相似度： 0.7303。

注意，如果在運行代碼時出現以下warning:

gensim\utils.py:1209: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")

gensim\matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int32 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):

如果想要去掉這些warning，則在導入gensim模塊的代碼前添加以下代碼即可：

import warnings
warnings.filterwarnings(action='ignore',category=UserWarning,module='gensim')
warnings.filterwarnings(action='ignore',category=FutureWarning,module='gensim')

附上源碼

from nltk import word_tokenize

sent1 = "Word bag model,Put all the words in a bag, regardless of their morphology and word order, that is, each word is independent. For example, the above two examples can form a word bag, which includes Jane, wants, to, go, Shenzhen, Bob and Shanghai. Suppose you build an array (or dictionary) for mapping matches."
sent2 = "Words bags model,Put all the words in a bag, regardless of their morphology and word order, this is, each word is independent. For example, the above two examples can form a word bag, which includes Jane, wants, to, go, Shenzhen, Bob and Shanghai. Suppose you build an array (or dictionary) for mapping matches."
sents = [sent1, sent2]
texts = [[word for word in word_tokenize(sent)] for sent in sents]
print(texts)
all_list = []
for text in texts:
    all_list += text
corpus = set(all_list)
print(corpus)
corpus_dict = dict(zip(corpus, range(len(corpus))))
print(corpus_dict)

def create_vector(text, corpus_dict):
    vec = []
    for key in corpus_dict.keys():
        if key in text:
            vec.append((corpus_dict[key], text.count(key)))
        else:
            vec.append((corpus_dict[key], 0))
    print(vec)
    vec = sorted(vec, key=lambda x: x[0])
    return vec

vec1 = create_vector(texts[0], corpus_dict)
vec2 = create_vector(texts[1], corpus_dict)
print(vec1, vec2)

from math import sqrt
def similarity_with_2_sents(vec1, vec2):
    inner_product = 0
    square_length_vec1 = 0
    square_length_vec2 = 0
    for tup1, tup2 in zip(vec1, vec2):
        inner_product += tup1[1]*tup2[1]
        square_length_vec1 += tup1[1]**2
        square_length_vec2 += tup2[1]**2

    return (inner_product/sqrt(square_length_vec1*square_length_vec2))


cosine_sim = similarity_with_2_sents(vec1, vec2)
print('兩個句子的餘弦相似度爲： %.4f。'%cosine_sim)



sent1 = "I love sky, I love sea."
sent2 = "I like running, I love reading."

from nltk import word_tokenize
sents = [sent1, sent2]
texts = [[word for word in word_tokenize(sent)] for sent in sents]
print(texts)

from gensim import corpora
from gensim.similarities import Similarity

#  語料庫
dictionary = corpora.Dictionary(texts)

# 利用doc2bow作爲詞袋模型
corpus = [dictionary.doc2bow(text) for text in texts]
similarity = Similarity('-Similarity-index', corpus, num_features=len(dictionary))
print(similarity)
# 獲取句子的相似度
new_sensence = sent1
test_corpus_1 = dictionary.doc2bow(word_tokenize(new_sensence))

cosine_sim = similarity[test_corpus_1][1]
print("利用gensim計算得到兩個句子的相似度： %.4f。"%cosine_sim)

本文到此結束，感謝閱讀！如果不當之處，請速聯繫筆者，歡迎大家交流！祝您好運~

NLP從入門到實戰（三）

詞袋模型與句子相似度計算

【數據共享】深度學習+計算機視覺+自然語言處理

聽說正則表達式比數學難？你覺得呢？

基於深度學習GAN的Ai換裝（比賽記錄）

基於python的新型冠狀肺炎患病人數預測

視覺識別入門之識別 ——口罩識別

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結