論文分享-- >word2Vec論文總結

博客內容將首發在微信公衆號"跟我一起讀論文啦啦",上面會定期分享機器學習、深度學習、數據挖掘、自然語言處理等高質量論文,歡迎關注!

一直以來,對word2vecword2vec,以及對 tensorflowtensorflow 裏面的wordEmbeddingwordEmbedding底層實現原理一直模糊不清,由此決心閱讀word2Vecword2Vec的兩篇原始論文,Efficient Estimation of Word Representations in Vector SpaceEfficient\ Estimation\ of\ Word\ Representations\ in\ Vector\ SpaceDistributed Representations of Words and Phrases and their CompositionalityDistributed\ Representations\ of\ Words\ and\ Phrases\ and\ their\ Compositionality,看完以後還是有點半懂半不懂的感覺,於是又結合網上的一些比較好的講解(Word2Vec Tutorial - The Skip-Gram Model),以及開源的實現代碼理解了一遍,在此總結一下。
這裏寫圖片描述

下面主要以 skipgramskip-gram 模型來介紹word2Vecword2Vec

word2vec工作流程

  1. word2Vecword2Vec只是一個 三層 的神經網絡。
  2. 餵給模型一個wordword,然後用來預測它周邊的詞。
  3. 然後去掉最後一層,只保存input_layerinput\_layerhidden_layerhidden\_layer
  4. 從詞表中選取一個詞,餵給模型,在hidden_layerhidden\_layer 將會給出該詞的embedding repesentationembedding\ repesentation
import numpy as np
import tensorflow as tf
corpus_raw = 'He is the king . The king is royal . She is the royal  queen '
# convert to lower case
corpus_raw = corpus_raw.lower()

上述代碼非常簡單和易懂,現在我們需要獲取input output pairinput\ output\ pair,假設我們現在有這樣一個任務,餵給模型一個詞,我們需要獲取它周邊的詞,舉例來說,就是獲取該詞前nn個和後nn個詞,那麼這個nn就是代碼中的window_sizewindow\_size,例如下圖:

這裏寫圖片描述

注意:如果這個詞是一個句子的開頭或結尾,windowwindow 忽略窗外的詞。

我們需要對文本數據進行一個簡單的預處理,創建一個word2intword2int的字典和int2wordint2word的字典。

words = []
for word in corpus_raw.split():
    if word != '.': # because we don't want to treat . as a word
        words.append(word)
words = set(words) # so that all duplicate words are removed
word2int = {}
int2word = {}
vocab_size = len(words) # gives the total number of unique words
for i,word in enumerate(words):
    word2int[word] = i
    int2word[i] = word

來看看這個字典有啥效果:

print(word2int['queen'])
-> 42 (say)
print(int2word[42])
-> 'queen'

好,現在可以獲取訓練數據啦

data = []
WINDOW_SIZE = 2
for sentence in sentences:
    for word_index, word in enumerate(sentence):
        for nb_word in sentence[max(word_index - WINDOW_SIZE, 0) : min(word_index + WINDOW_SIZE, len(sentence)) + 1] : 
            if nb_word != word:
                data.append([word, nb_word])

上述代碼就是切句子,然後切詞,得出的一個個訓練樣本[word, nb_word][word,\ nb\_word],其中wordword就是模型輸入,nb_wordnb\_word就是該詞周邊的某個單詞。

datadata打印出來看看?

print(data)
[['he', 'is'],
 ['he', 'the'],
 ['is', 'he'],
 ['is', 'the'],
 ['is', 'king'],
 ['the', 'he'],
 ['the', 'is'],
 ['the', 'king'],
.
.
.
]

現在我們有了訓練數據了,但是需要將它轉成模型可讀可理解的形式,這時,上面的word2intword2int字典的作用就來了。

來,我們更進一步的對wordword進行處理,並使其轉成onehotone-hot向量

i.e., 
say we have a vocabulary of 3 words : pen, pineapple, apple
where 
word2int['pen'] -> 0 -> [1 0 0]
word2int['pineapple'] -> 1 -> [0 1 0]
word2int['apple'] -> 2 -> [0 0 1]

那麼爲啥是onehotone-hot特徵呢?稍後將解釋。

# function to convert numbers to one hot vectors
def to_one_hot(data_point_index, vocab_size):
    temp = np.zeros(vocab_size)
    temp[data_point_index] = 1
    return temp
x_train = [] # input word
y_train = [] # output word
for data_word in data:
    x_train.append(to_one_hot(word2int[ data_word[0] ], vocab_size))
    y_train.append(to_one_hot(word2int[ data_word[1] ], vocab_size))
# convert them to numpy arrays
x_train = np.asarray(x_train)
y_train = np.asarray(y_train)

利用tensorflowtensorflow建立模型

# making placeholders for x_train and y_train
x = tf.placeholder(tf.float32, shape=(None, vocab_size))
y_label = tf.placeholder(tf.float32, shape=(None, vocab_size))

這裏寫圖片描述

由上圖,我們可以看出,我們將inputinput轉換成embedding_representationembedding\_representation,並且將vocabSizevocabSize維度降低到設定的embedding_dimembedding\_dim

EMBEDDING_DIM = 5 # you can choose your own number
W1 = tf.Variable(tf.random_normal([vocab_size, EMBEDDING_DIM]))
b1 = tf.Variable(tf.random_normal([EMBEDDING_DIM])) #bias
hidden_representation = tf.add(tf.matmul(x,W1), b1)

接下來,我們需要使用softmaxsoftmax函數來預測該wordword周邊的詞。

這裏寫圖片描述

W2 = tf.Variable(tf.random_normal([EMBEDDING_DIM, vocab_size]))
b2 = tf.Variable(tf.random_normal([vocab_size]))
prediction = tf.nn.softmax(tf.add( tf.matmul(hidden_representation, W2), b2))

所以整體的過程如下:

這裏寫圖片描述

input_one_hot  --->  embedded repr. ---> predicted_neighbour_prob
predicted_prob will be compared against a one hot vector to correct it.

好了,來看看怎麼訓這個模型

sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init) #make sure you do this!
# define the loss function:
cross_entropy_loss = tf.reduce_mean(-tf.reduce_sum(y_label * tf.log(prediction), reduction_indices=[1]))
# define the training step:
train_step = tf.train.GradientDescentOptimizer(0.1).minimize(cross_entropy_loss)
n_iters = 10000
# train for n_iter iterations
for _ in range(n_iters):
    sess.run(train_step, feed_dict={x: x_train, y_label: y_train})
    print('loss is : ', sess.run(cross_entropy_loss, feed_dict={x: x_train, y_label: y_train}))

在訓的過程中,你可以看到lossloss的變化:

loss is :  2.73213
loss is :  2.30519
loss is :  2.11106
loss is :  1.9916
loss is :  1.90923
loss is :  1.84837
loss is :  1.80133
loss is :  1.76381
loss is :  1.73312
loss is :  1.70745
loss is :  1.68556
loss is :  1.66654
loss is :  1.64975
loss is :  1.63472
loss is :  1.62112
loss is :  1.6087
loss is :  1.59725
loss is :  1.58664
loss is :  1.57676
loss is :  1.56751
loss is :  1.55882
loss is :  1.55064
loss is :  1.54291
loss is :  1.53559
loss is :  1.52865
loss is :  1.52206
loss is :  1.51578
loss is :  1.50979
loss is :  1.50408
loss is :  1.49861
.
.
.

最終lossloss會收斂,即使其accuracyaccuracy不能達到很高的水平,我們並不carecare這點,我們最終的目的是獲取較好的W1W1b1b1,也就是hidden_repesentationhidden\_repesentation

爲什麼是onehotone-hot

這裏寫圖片描述

當我們用onehotone-hot向量乘以W1W1時,獲取的是W1W1矩陣的某一行,所以W1W1扮演的是一個look up tablelook\ up\ table

在我們這個代碼例子中,可以看看"queen""queen"1W1中的repesetationrepesetation

print(vectors[ word2int['queen'] ])
# say here word2int['queen'] is 2
-> 
[-0.69424796 -1.67628145  3.07313657 -1.14802659 -1.2207377 ]

給定一個向量,我們可以獲取與其最近的向量

def euclidean_dist(vec1, vec2):
    return np.sqrt(np.sum((vec1-vec2)**2))

def find_closest(word_index, vectors):
    min_dist = 10000 # to act like positive infinity
    min_index = -1
    query_vector = vectors[word_index]
    for index, vector in enumerate(vectors):
        if euclidean_dist(vector, query_vector) < min_dist and not np.array_equal(vector, query_vector):
            min_dist = euclidean_dist(vector, query_vector)
            min_index = index
    return min_index

我們來看看,與"king""queen""royal""king"、"queen"、"royal"最近的詞:

print(int2word[find_closest(word2int['king'], vectors)])
print(int2word[find_closest(word2int['queen'], vectors)])
print(int2word[find_closest(word2int['royal'], vectors)])
->
queen
king
he

進階

上面總結的主要是第一篇論文Efficient Estimation of Word Representations in Vector SpaceEfficient\ Estimation\ of\ Word\ Representations\ in\ Vector\ Space內的內容,雖然只是一個三層的神經網絡,但是在海量訓練數據的情況下,需要極大的計算資源來支撐整個過程,舉例來說,我們設定的embedding_size=300embedding\_size=300時,而vocab_size=10,000vocab\_size=10,000時,這時W1W1矩陣的維度就達到了10,000300=3million10,000*300=3 million!!,這個時候再用SGDSGD來優化訓練過程就顯得十分緩慢,但是有時候你必須使用大量的數據來訓練模型來避免過擬合。論文Distributed Representations of Words and Phrases and their CompositionalityDistributed\ Representations\ of\ Words\ and\ Phrases\ and\ their\ Compositionality介紹了幾種解決辦法。

  • 採用下采樣來降低訓練樣本數量
    tensorflowtensorflow裏面實現的word2Vecword2Vecvocab_szievocab\_szie並不是所有的wordword的數量,而且先統計了所有wordword的出現頻次,然後選取出現頻次最高的前5000050000的詞作爲詞袋。具體操作請看代碼 tensorflow/examples/tutorials/word2vec/word2vec_basic.py,其餘的詞用unkunk代替。

  • 採用一種所謂的"負採樣"的操作,這種操作每次可以讓一個樣本只更新權重矩陣中一小部分,減小訓練過程中的計算壓力。
    舉例來說:一個input output pairinput\ output\ pair 如:(fox,quick)(“fox”, “quick”),由上面的分析可知,其true labeltrue\ label爲一個onehotone-hot向量,並且該向量只是在quickquick的位置爲1,其餘的位置均爲0,並且該向量的長度爲vocab sizevocab\ size,由此每個樣本都緩慢能更新權重矩陣,而"負採樣"操作只是隨機選擇其餘的部分wordword,使得其在true labeltrue\ label的位置爲0,那麼我們只更新對應位置的權重。例如我們如果選擇負採樣數量爲5,則選取5個其餘的wordword,使其對應的outputoutput爲0,這個時候outputoutput只是6個神經元,本來我們一次需要更新30010,000300*10,000參數,進行負採樣操作以後只需要更新30061800300*6=1800個參數。

  • Hierarchical Softmax 是NLP中常用方法,詳情可以查看Hierarchical Softmax 。其主要思想是以詞頻構建Huffman樹,樹的葉子節點爲詞表中的詞,相應的高頻詞距離根結點更近。當需要計算生成某個詞的概率時,不需要對所有詞進行概率計算,而是選擇在Huffman樹中從根結點到該詞所在結點的路徑進行計算,得到生成該詞的概率,時間複雜度從 O(N) 降低到 O(logN)(N個結點,則樹的深度logN)

個人總結

  • seq2seq模型,輸入處都會乘以embedding_matrixembedding\_matrix,輸出處都會乘以embedding_matrixTembedding\_matrix^T,這兩個embedding矩陣有時會共享,有時則不會。我認爲word2Vecword2Vec 其實就是seq2seqseq2seq 模型的原型,只不過應用到了不同的複雜場景中,根據場景需要,在內部加了AttentionAttention 等機制,大致框架依然是word2Vecword2Vec
  • word2Vecword2Vec 是當前自然語言處理領域的最基礎知識,深刻理解word2vecword2vec 原理非常重要。

個人感覺word2Vecword2Vec瞭解到這個程度差不多了。

完整代碼:

import tensorflow as tf
import numpy as np

corpus_raw = 'He is the king . The king is royal . She is the royal  queen '

# convert to lower case
corpus_raw = corpus_raw.lower()

words = []
for word in corpus_raw.split():
    if word != '.': # because we don't want to treat . as a word
        words.append(word)

words = set(words) # so that all duplicate words are removed
word2int = {}
int2word = {}
vocab_size = len(words) # gives the total number of unique words

for i,word in enumerate(words):
    word2int[word] = i
    int2word[i] = word

# raw sentences is a list of sentences.
raw_sentences = corpus_raw.split('.')
sentences = []
for sentence in raw_sentences:
    sentences.append(sentence.split())

WINDOW_SIZE = 2

data = []
for sentence in sentences:
    for word_index, word in enumerate(sentence):
        for nb_word in sentence[max(word_index - WINDOW_SIZE, 0) : min(word_index + WINDOW_SIZE, len(sentence)) + 1] : 
            if nb_word != word:
                data.append([word, nb_word])

# function to convert numbers to one hot vectors
def to_one_hot(data_point_index, vocab_size):
    temp = np.zeros(vocab_size)
    temp[data_point_index] = 1
    return temp

x_train = [] # input word
y_train = [] # output word

for data_word in data:
    x_train.append(to_one_hot(word2int[ data_word[0] ], vocab_size))
    y_train.append(to_one_hot(word2int[ data_word[1] ], vocab_size))

# convert them to numpy arrays
x_train = np.asarray(x_train)
y_train = np.asarray(y_train)

# making placeholders for x_train and y_train
x = tf.placeholder(tf.float32, shape=(None, vocab_size))
y_label = tf.placeholder(tf.float32, shape=(None, vocab_size))

EMBEDDING_DIM = 5 # you can choose your own number
W1 = tf.Variable(tf.random_normal([vocab_size, EMBEDDING_DIM]))
b1 = tf.Variable(tf.random_normal([EMBEDDING_DIM])) #bias
hidden_representation = tf.add(tf.matmul(x,W1), b1)

W2 = tf.Variable(tf.random_normal([EMBEDDING_DIM, vocab_size]))
b2 = tf.Variable(tf.random_normal([vocab_size]))
prediction = tf.nn.softmax(tf.add( tf.matmul(hidden_representation, W2), b2))


sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init) #make sure you do this!

# define the loss function:
cross_entropy_loss = tf.reduce_mean(-tf.reduce_sum(y_label * tf.log(prediction), reduction_indices=[1]))

# define the training step:
train_step = tf.train.GradientDescentOptimizer(0.1).minimize(cross_entropy_loss)

n_iters = 10000
# train for n_iter iterations

for _ in range(n_iters):
    sess.run(train_step, feed_dict={x: x_train, y_label: y_train})
    print('loss is : ', sess.run(cross_entropy_loss, feed_dict={x: x_train, y_label: y_train}))

vectors = sess.run(W1 + b1)

def euclidean_dist(vec1, vec2):
    return np.sqrt(np.sum((vec1-vec2)**2))

def find_closest(word_index, vectors):
    min_dist = 10000 # to act like positive infinity
    min_index = -1
    query_vector = vectors[word_index]
    for index, vector in enumerate(vectors):
        if euclidean_dist(vector, query_vector) < min_dist and not np.array_equal(vector, query_vector):
            min_dist = euclidean_dist(vector, query_vector)
            min_index = index
    return min_index


from sklearn.manifold import TSNE

model = TSNE(n_components=2, random_state=0)
np.set_printoptions(suppress=True)
vectors = model.fit_transform(vectors) 

from sklearn import preprocessing

normalizer = preprocessing.Normalizer()
vectors =  normalizer.fit_transform(vectors, 'l2')

print(vectors)

import matplotlib.pyplot as plt


fig, ax = plt.subplots()
print(words)
for word in words:
    print(word, vectors[word2int[word]][1])
    ax.annotate(word, (vectors[word2int[word]][0],vectors[word2int[word]][1] ))
plt.show()
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章