本文主要翻譯自：https://radimrehurek.com/gensim/tut2.html

這個教程會向大家展示如何將代表文檔的向量轉換成另一種向量，做這件事的目的主要有兩個：

發現語料中的隱藏結構，比如詞與詞之間的聯繫，然後用一種全新的方式、一種更能表現語義的方式（semantic way）來描述文檔。
使文檔的表示更加緊湊，這樣可以提高效率和功效，因爲新的表達方式消耗更少的資源，並且去除了噪音。

一、回顧

在之前的gensim基礎使用中，我們介紹瞭如何將語料提取特徵後轉換爲向量(基於詞袋模型)，上一章中的結果：

# 清洗後的語料庫，只有九句話，代表九個文檔
[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]
# 根據上面語料訓練的詞典，每個詞都有一個id
{'computer': 0,
 'eps': 8,
 'graph': 10,
 'human': 1,
 'interface': 2,
 'minors': 11,
 'response': 3,
 'survey': 4,
 'system': 5,
 'time': 6,
 'trees': 9,
 'user': 7}
# 根據詞典基於詞袋模型，訓練上面語料的結果，（0,1.0）的意思是id爲0的單詞，即“computer”在第一篇文章中出現了1次。其它類似。
[[(0, 1.0), (1, 1.0), (2, 1.0)],
 [(0, 1.0), (3, 1.0), (4, 1.0), (5, 1.0), (6, 1.0), (7, 1.0)],
 [(2, 1.0), (5, 1.0), (7, 1.0), (8, 1.0)],
 [(1, 1.0), (5, 2.0), (8, 1.0)],
 [(3, 1.0), (6, 1.0), (7, 1.0)],
 [(9, 1.0)],
 [(9, 1.0), (10, 1.0)],
 [(9, 1.0), (10, 1.0), (11, 1.0)],
 [(4, 1.0), (10, 1.0), (11, 1.0)]]

二、加載上一章中結果（保存的字典和語料向量）

from gensim import corpora, models, similarities
import os
if(os.path.exists('./gensim_out/deerwester.dict')):
    dictionary = corpora.Dictionary.load('./gensim_out/deerwester.dict')
    corpus = corpora.MmCorpus('./gensim_out/deerwester.mm')
    print("使用之前已經存儲的字典和語料向量")
else:
    print("請先通過上一章生成deerwester.dict和deerwester.mm")

#pprint(dictionary.tokenz`2id)
#pprint(corpus)

三、初始化一個轉換模型（Creating a transformation）

轉換模型是標準的python對象，通常需要傳入一個語料庫進行初始化。
我們使用教程1中的舊語料庫來初始化（訓練）轉換模型。也就是上面加載的corpus，不同的轉換模型一般需要不同的初始化參數; 在TfIdf的情況下，“訓練”僅包括通過提供的語料庫一次並計算其所有詞頻和逆文檔頻率。訓練其他模型，例如潛在語義分析或潛在狄利克雷分析，涉及更多，因此也會消耗更多時間。

tfidf = models.TfidfModel(corpus) #初始化一個模型

doc_bow = [(0, 1), (1, 1)]

print(tfidf[doc_bow])#輸出：[(0, 0.70710678), (1, 0.70710678)]```

上面已經創建了tfidf模型，我們應該將其作爲一個只讀對象來看待，用它可以將舊的向量表示（上一節中的詞袋模型）轉換爲新的向量表示（比如tf-idf權重）
假設新文本爲：“***Human computer interaction***”
doc_bow是新文本經過上一章的清洗、分詞、基於詞袋模型轉換後的結果，(0, 1)表示id爲0，即“computer”。1表示“computer”在新文本中出現了1次；(1, 1)表示id爲1，即“human”也出現了1次。
tfidf模型將新文本從詞袋向量模型（[(0, 1), (1, 1)]）轉換爲了詞頻-逆文檔頻率權重向量（[(0, 0.70710678), (1, 0.70710678)]），即“computer”的權重爲0 0.70710678，“human”的權重爲0.70710678

### 四、序列化轉換後的結果
調用model_name[corpus]僅在舊的語料庫文檔流周圍創建一個包裝器 ，實際轉換在文檔迭代期間即時完成。 我們無法在調用corpus_transformed = model [corpus]時轉換整個語料庫，因爲這意味着將結果存儲在內存中，這與gensim的內存獨立的目標相矛盾。 如果您將多次迭代轉換的corpus_transformed，並且轉換成本很高，請先將生成的語料庫序列化到磁盤然後再使用它。

用tfidf轉換語料庫corpus

corpus_tfidf = tfidf[corpus]

# initialize an LSI transformation(初始化LSI模型)lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)

create a double wrapper over the original corpus: bow->tfidf->fold-in-lsicorpus_lsi = lsi[corpus_tfidf]

lsi.print_topics(2)

輸出：topic #0(1.594): -0.703"trees" + -0.538"graph" + -0.402"minors" + -0.187"survey" + -0.061"system" + -0.060"response" + -0.060"time" + -0.058"user" + -0.049"computer" + -0.035"interface"

topic #1(1.476): -0.460"system" + -0.373"user" + -0.332"eps" + -0.328"interface" + -0.320"response" + -0.320"time" + -0.293"computer" + -0.280"human" + -0.171"survey" + 0.161"trees"


根據上面結果可以看出“trees”, “graph” and “minors都是相關聯的詞彙，並且對第一個主題的貢獻度最高，第二個主題，更多的是關注其他的詞彙

for doc in corpus_lsi:

print(doc)

輸出結果（可以看出前五個文檔與第二個主題關聯度更高）：

[(0, -0.066), (1, 0.520)] # "Human machine interface for lab abc computer applications"
[(0, -0.197), (1, 0.761)] # "A survey of user opinion of computer system response time"
[(0, -0.090), (1, 0.724)] # "The EPS user interface management system"
[(0, -0.076), (1, 0.632)] # "System and human system engineering testing of EPS"
[(0, -0.102), (1, 0.574)] # "Relation of user perceived response time to error measurement"
[(0, -0.703), (1, -0.161)] # "The generation of random binary unordered trees"
[(0, -0.877), (1, -0.168)] # "The intersection graph of paths in trees"
[(0, -0.910), (1, -0.141)] # "Graph minors IV Widths of trees and well quasi ordering"
[(0, -0.617), (1, 0.054)] # "Graph minors A survey"


#### 保存模型與加載模型

保存

lsi.save('./gensim_out/model.lsi') # same for tfidf, lda, ...

加載

lsi = models.LsiModel.load('/tmp/model.lsi')


### gensim中可用的轉換模型

* Term Frequency * Inverse Document Frequency, Tf-Idf

model = models.TfidfModel(corpus, normalize=True)

* Latent Semantic Indexing, LSI (or sometimes LSA)

model = models.LsiModel(tfidf_corpus, id2word=dictionary, num_topics=300)
model.add_documents(another_tfidf_corpus) # now LSI has been trained on tfidf_corpus + another_tfidf_corpus
lsi_vec = model[tfidf_vec] # convert some new document into the LSI space, without affecting the model
...
model.add_documents(more_documents) # tfidf_corpus + another_tfidf_corpus + more_documents
lsi_vec = model[tfidf_vec]
...

* Random Projections, RP

model = models.RpModel(tfidf_corpus, num_topics=500)

* Latent Dirichlet Allocation, LDA

model = models.LdaModel(corpus, id2word=dictionary, num_topics=100)

* Hierarchical Dirichlet Process, HDP

model = models.HdpModel(corpus, id2word=dictionary)

【NLP學習筆記】（二）gensim使用之Topics and Transformations

一、回顧

二、加載上一章中結果（保存的字典和語料向量）

三、初始化一個轉換模型（Creating a transformation）

用tfidf轉換語料庫corpus

create a double wrapper over the original corpus: bow->tfidf->fold-in-lsicorpus_lsi = lsi[corpus_tfidf]

輸出：topic #0(1.594): -0.703"trees" + -0.538"graph" + -0.402"minors" + -0.187"survey" + -0.061"system" + -0.060"response" + -0.060"time" + -0.058"user" + -0.049"computer" + -0.035"interface"

輸出結果（可以看出前五個文檔與第二個主題關聯度更高）：

保存

加載

杭州的 IT 崩盤了麼？

開源高性能結構化日誌模塊NanoLog

Python 潮流週刊#55：分享 9 個高質量的技術類信息源！

Azure Virtual Network (22) 多訂閱使用Azure DNS解析問題 Windows Azure Platform 系列文章目錄

【簡寫Mybatis-02】註冊機的實現以及SqlSession處理

手繪二維碼

.NET藉助虛擬網卡實現一個簡單異地組網工具

Numpy常用屬性及方法

【NLP學習筆記】（三）gensim使用之相似性查詢（Similarity Queries）

【NLP學習筆記】（二）gensim使用之Topics and Transformations

【NLP學習筆記】Gensim基本使用方法

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結