文章目錄

訓練Word2vec

Word2vec

在NLP中，想要處理文本，避不開的問題就是如何表示詞。在Word2vec出現之前，詞以one-hot形式的編碼表示，即一個詞由一個僅包含0或1的向量表示，出現的單詞位置置爲1，其餘單詞位置置爲0。這樣的編碼方式有一些缺點，其中之一就是任意兩個單詞計算歐氏距離均相同，這樣顯然是不太合理的。比如apple和banana應該更加接近，而apple和dog舉例應該更遠。

顧名思義，Word2vec就是將一個單詞轉換爲一個向量，但是不同於one-hot的編碼方式，他更能表現詞與詞之間的關係。Word2vec所轉換的向量的維度是一個參數，需要在訓練前手動指定，維度越高，所包含的信息越多，但在訓練時時間和空間開銷也就越大。

由於本篇主要介紹如何訓練Word2vec，算法原理就不展開了。

第三方庫

gensim

gensim是一個python主題模型第三方工具包，主要用於自然語言處理（NLP）和信息檢索（IR），其中包含了許多高效、易於訓練的算法，如Latent Semantic Analysis (LSA/LSI/SVD), Latent Dirichlet Allocation (LDA), Random Projections (RP), Hierarchical Dirichlet Process (HDP) 和 word2vec。本篇主要使用的是gensim中的word2vec模型。

安裝gensim也很簡單，可以選擇使用pip或者conda進行安裝。

pip install gensim
conda install gensim

nltk

nltk的全稱是Natural Language Toolkit，是一個用於自然語言處理的第三方庫。nltk提供了超過50個語料庫和字典資源，同時也提供文本方法如分類、分詞、標註等。本篇中主要使用到的是nltk中的分詞和停用詞。

安裝nltk的方法同安裝gensim，使用pip或者conda。

pip install nltk
conda install nltk

安裝完成後，還需要下載nltk中的資源（其內置的語料、模塊需要單獨下載），在命令行啓動python，執行

>>> import nltk
>>> nltk.download('stopwords')
>>> nltk.download('punkt')

Windows用戶，可以在C:\Users\你的用戶名\AppData\Roaming\nltk_data中看到下載的內容，網絡狀態不好的情況下，可能會下載失敗，此時可以從網上下載對應的壓縮，放入指定的文件夾後解壓，這樣不需要安裝上述的方式下載也是可以使用的。

官方下載地址有兩個，其中一個是http://www.nltk.org/nltk_data/，另外一個是https://github.com/nltk/nltk_data/tree/gh-pages/packages，找到需要下載的zip壓縮包。

其中，stopwords.zip解壓到C:\Users\你的用戶名\AppData\Roaming\nltk_data\corpora中，punkt.zip解壓到C:\Users\你的用戶名\AppData\Roaming\nltk_data\tokenizers中。

訓練Word2vec

語料庫(corpus)

本篇選用的語料庫爲**20 Newsgroups。**語料庫共有20個類別，每個類別有若干篇短文。可以從這裏下載到語料庫。使用這個語料庫訓練出的Word2vec，可以應用到下游的分類任務中。

預處理

在訓練之前，首先要對文本進行預處理。gensim中的word2vec模型接收的是完整的分詞結果，如[‘At’, ‘eight’, “o’clock”, ‘on’, ‘Thursday’, ‘morning’,‘Arthur’, ‘did’, “n’t”, ‘feel’, ‘very’, ‘good’]。

查看數據集中/alt/atheism/49960，前21行均不是正文，引入這些文本會對模型造成一定影響（數據集中的每一個文件都有類似的前綴）。所以首先需要對文本進行清洗。清洗的要求有：

不包含標點符號；
所有單詞應該轉換爲小寫；
不包含空行；

總之，我們希望得到的是文章的單詞組成的列表。考慮到文本內容的複雜性，分得的詞中可能包含數字，或者由符號組成的字符串，或者一些停用詞等等，需要進一步加入過濾的條件。

對於本數據集，一個簡單的清洗方法是，判斷冒號:是否存在於一行內容中，若是的話，則爲文件前綴，否則爲正文內容，這樣對正文造成的影響十較小。代碼如下：

with open(file_path, 'r', encoding='utf-8') as file:
		for line in file:
				line = line.lower()
				if ':' in line or line == '':
						continue
				else:
						pass

接着過濾所有的中英文符號，並且使用nltk分詞，將分詞中的純數字和停用詞過濾掉，考慮到文章中可能有一些不可讀取的字節碼，引入異常處理，代碼如下:

def fetch_tokens(file_path):
        result = []
        tot_words = []
        with open(file_path, 'r', encoding='utf-8') as file:
            try:
                for line in file:
                    line = line.strip()
                    if line == '' or ':' in line:
                        continue
                    line = re.sub("[\s+\.\!\/_,$%^*(+\"\':<>\-)?]+|[+——！，。？、~@#￥%……&*（）]+", " ", line).lower()
                    line = nltk.tokenize.word_tokenize(line)
                    tot_words.extend(line)
            except:
                print("Error happened on file {}, PASS".format(file_path))
            for word in tot_words:
                if word in list_stopwords or word.isdigit():
                    continue
                else:
                    result.append(word)
        return result

這樣給定一個文件，即可輸出這個文件的所有分詞結果。當然根據特定情況，可以進一步修改過濾條件，如文章中有一些特別長的單詞，這些單詞要麼是無意義的字符串，要麼出現的次數很少可以忽略。

使用gensim訓練

使用gensim訓練很簡單，只需要輸入所有文章構成的單詞，每個文章的分詞結果一列表形式保存，代碼如下；

sentences = [['first', 'sentence'], ['second', 'sentence']]
model = gensim.models.Word2Vec(sentences, min_count=1)

其中參數min_count表示忽略出現次數少於次參數的所有單詞。

當sentence很多的時候，佔用的內存也很大，此處可以使用一種節約內存的方式，使用python的yeild方法。重寫上面的預處理，將其改爲如下的形式：

class MySentence:
    def __init__(self, dir_name):
        self.dir = dir_name
        self.dir_list = os.listdir(dir_name)
        self.list_stopwords = stopwords.words('english')

    def __iter__(self):
        for sub_dir in self.dir_list:
            file_list = os.listdir(os.path.join(self.dir, sub_dir))
            for file in file_list:
                yield self.fetch_tokens(os.path.join(self.dir, sub_dir, file))

    def fetch_tokens(self, file_path):
        result = []
        tot_words = []
        with open(file_path, 'r', encoding='utf-8') as file:
            try:
                for line in file:
                    line = line.strip()
                    if line == '' or ':' in line:
                        continue
                    line = re.sub("[\s+\.\!\/_,$%^*(+\"\':<>\-)?]+|[+——！，。？、~@#￥%……&*（）]+", " ", line).lower()
                    line = nltk.tokenize.word_tokenize(line)
                    tot_words.extend(line)
            except:
                print("Error happened on file {}, PASS".format(file_path))
            for word in tot_words:
                if word in self.list_stopwords or word.isdigit():
                    continue
                else:
                    result.append(word)
        return result

調用方法如下，

sentences = MySentence(CORPUS_DIR)
model = gensim.models.Word2Vec(sentences)
model.save(SAVE_PATH)

這樣就可以訓練處word2vec模型，並將其保存爲SAVE_PATH。

讀取Word2vec

讀取Word2vec模型使用gensim中的load函數

  model = gensim.models.Word2Vec.load(FILE_PATH)

獲取Word2vec模型中單詞的數目

 print(len(model.wv.vocab))

查詢某個單詞在Word2vec中的表示

print(model['screen'])

查詢某個單詞在Word2vec中與之最相近的單詞

print(model.most_similar('screen'))

查詢每個單詞在Word2vec中對應的index

for word, obj in model.wv.vocab.items():
    print(word, obj.index)

Code

import gensim
import nltk
import re
import os
from nltk.corpus import stopwords

CORPUS_DIR = "./data/20_newsgroup"
CORPUS = '20_newsgroup'
SIZE = 200
WINDOW = 10


class MySentence:
    def __init__(self, dir_name):
        self.dir = dir_name
        self.dir_list = os.listdir(dir_name)
        self.list_stopwords = stopwords.words('english')

    def __iter__(self):
        for sub_dir in self.dir_list:
            file_list = os.listdir(os.path.join(self.dir, sub_dir))
            for file in file_list:
                yield self.fetch_tokens(os.path.join(self.dir, sub_dir, file))

    def fetch_tokens(self, file_path):
        result = []
        tot_words = []
        with open(file_path, 'r', encoding='utf-8') as file:
            try:
                for line in file:
                    line = line.strip()
                    if line == '' or ':' in line:
                        continue
                    line = re.sub("[\s+\.\!\/_,$%^*(+\"\':<>\-)?]+|[+——！，。？、~@#￥%……&*（）]+", " ", line).lower()
                    line = nltk.tokenize.word_tokenize(line)
                    tot_words.extend(line)
            except:
                print("Error happened on file {}, PASS".format(file_path))
            for word in tot_words:
                if word in self.list_stopwords or word.isdigit():
                    continue
                else:
                    result.append(word)
        return result


if __name__ == '__main__':
    sentences = MySentence(CORPUS_DIR)
    model = gensim.models.Word2Vec(sentences,
                                   size=SIZE,
                                   window=WINDOW,
                                   min_count=10)
    model.save("{}/word2vec_{}_{}".format(CORPUS_DIR, CORPUS, SIZE))

    # load Word2vec model
    model = gensim.models.Word2Vec.load("{}/word2vec_{}_{}".format(CORPUS_DIR, CORPUS, SIZE))
    # amount of words in Word2vec model
    print(len(model.wv.vocab))
    # fetch the vector representation of screen
    print(model['screen'])
    # fetch the most similar words of screen
    print(model.most_similar('screen'))
    # fetch index of every word in Word2vec model
    for word, obj in model.wv.vocab.items():
        print(word, obj.index)

參考

[1] Word2vec Tutorial
[2] NLTK（一）：英文分詞分句
[3] python3去掉英文文本中的標點符號
[4] gensim詞向量Word2Vec

NLP(1) - 使用gensim訓練Word2vec

文章目錄

Word2vec

第三方庫

gensim

nltk

訓練Word2vec

語料庫(corpus)

預處理

使用gensim訓練

讀取Word2vec

Code

參考

Python 潮流週刊#52：Python 處理 Excel 的資源

TensorFlow出現Found Inf or NaN global norm的排查和解決辦法

CS224n 深度自然語言處理(三) Note - Word Window Classification, Neural Networks

知識圖譜表示學習 TransE: Translating Embeddings for Modeling Multi-relational Data

CS224n 深度自然語言處理(四) Note - Backpropagation and computation graphs

知識圖譜表示學習 TransH: Knowledge Graph Embedding by Translating on Hyperplanes

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結