論文分享-- >word2Vec論文總結

博客內容將首發在微信公衆號"跟我一起讀論文啦啦"，上面會定期分享機器學習、深度學習、數據挖掘、自然語言處理等高質量論文，歡迎關注！

一直以來，對 $word2vec$ ，以及對 $tensorflow$ 裏面的 $wordEmbedding$ 底層實現原理一直模糊不清，由此決心閱讀 $word2Vec$ 的兩篇原始論文， $Efficient\ Estimation\ of\ Word\ Representations\ in\ Vector\ Space$ ， $Distributed\ Representations\ of\ Words\ and\ Phrases\ and\ their\ Compositionality$ ，看完以後還是有點半懂半不懂的感覺，於是又結合網上的一些比較好的講解（Word2Vec Tutorial - The Skip-Gram Model)，以及開源的實現代碼理解了一遍，在此總結一下。

下面主要以 $skip-gram$ 模型來介紹 $word2Vec$ 。

word2vec工作流程

$word2Vec$ 只是一個三層的神經網絡。
餵給模型一個 $word$ ，然後用來預測它周邊的詞。
然後去掉最後一層，只保存 $input\_layer$ 和 $hidden\_layer$ 。
從詞表中選取一個詞，餵給模型，在 $hidden\_layer$ 將會給出該詞的 $embedding\ repesentation$ 。

import numpy as np
import tensorflow as tf
corpus_raw = 'He is the king . The king is royal . She is the royal  queen '
# convert to lower case
corpus_raw = corpus_raw.lower()

上述代碼非常簡單和易懂，現在我們需要獲取 $input\ output\ pair$ ，假設我們現在有這樣一個任務，餵給模型一個詞，我們需要獲取它周邊的詞，舉例來說，就是獲取該詞前 $n$ 個和後 $n$ 個詞，那麼這個 $n$ 就是代碼中的 $window\_size$ ，例如下圖：

注意：如果這個詞是一個句子的開頭或結尾， $window$ 忽略窗外的詞。

我們需要對文本數據進行一個簡單的預處理，創建一個 $word2int$ 的字典和 $int2word$ 的字典。

words = []
for word in corpus_raw.split():
    if word != '.': # because we don't want to treat . as a word
        words.append(word)
words = set(words) # so that all duplicate words are removed
word2int = {}
int2word = {}
vocab_size = len(words) # gives the total number of unique words
for i,word in enumerate(words):
    word2int[word] = i
    int2word[i] = word

來看看這個字典有啥效果：

print(word2int['queen'])
-> 42 (say)
print(int2word[42])
-> 'queen'

好，現在可以獲取訓練數據啦

data = []
WINDOW_SIZE = 2
for sentence in sentences:
    for word_index, word in enumerate(sentence):
        for nb_word in sentence[max(word_index - WINDOW_SIZE, 0) : min(word_index + WINDOW_SIZE, len(sentence)) + 1] : 
            if nb_word != word:
                data.append([word, nb_word])

上述代碼就是切句子，然後切詞，得出的一個個訓練樣本 $[word,\ nb\_word]$ ，其中 $word$ 就是模型輸入， $nb\_word$ 就是該詞周邊的某個單詞。

把 $data$ 打印出來看看？

print(data)
[['he', 'is'],
 ['he', 'the'],
 ['is', 'he'],
 ['is', 'the'],
 ['is', 'king'],
 ['the', 'he'],
 ['the', 'is'],
 ['the', 'king'],
.
.
.
]

現在我們有了訓練數據了，但是需要將它轉成模型可讀可理解的形式，這時，上面的 $word2int$ 字典的作用就來了。

來，我們更進一步的對 $word$ 進行處理，並使其轉成 $one-hot$ 向量

i.e., 
say we have a vocabulary of 3 words : pen, pineapple, apple
where 
word2int['pen'] -> 0 -> [1 0 0]
word2int['pineapple'] -> 1 -> [0 1 0]
word2int['apple'] -> 2 -> [0 0 1]

那麼爲啥是 $one-hot$ 特徵呢？稍後將解釋。

# function to convert numbers to one hot vectors
def to_one_hot(data_point_index, vocab_size):
    temp = np.zeros(vocab_size)
    temp[data_point_index] = 1
    return temp
x_train = [] # input word
y_train = [] # output word
for data_word in data:
    x_train.append(to_one_hot(word2int[ data_word[0] ], vocab_size))
    y_train.append(to_one_hot(word2int[ data_word[1] ], vocab_size))
# convert them to numpy arrays
x_train = np.asarray(x_train)
y_train = np.asarray(y_train)

利用 $tensorflow$ 建立模型

# making placeholders for x_train and y_train
x = tf.placeholder(tf.float32, shape=(None, vocab_size))
y_label = tf.placeholder(tf.float32, shape=(None, vocab_size))

由上圖，我們可以看出，我們將 $input$ 轉換成 $embedding\_representation$ ，並且將 $vocabSize$ 維度降低到設定的 $embedding\_dim$ 。

EMBEDDING_DIM = 5 # you can choose your own number
W1 = tf.Variable(tf.random_normal([vocab_size, EMBEDDING_DIM]))
b1 = tf.Variable(tf.random_normal([EMBEDDING_DIM])) #bias
hidden_representation = tf.add(tf.matmul(x,W1), b1)

接下來，我們需要使用 $softmax$ 函數來預測該 $word$ 周邊的詞。

W2 = tf.Variable(tf.random_normal([EMBEDDING_DIM, vocab_size]))
b2 = tf.Variable(tf.random_normal([vocab_size]))
prediction = tf.nn.softmax(tf.add( tf.matmul(hidden_representation, W2), b2))

所以整體的過程如下：

input_one_hot  --->  embedded repr. ---> predicted_neighbour_prob
predicted_prob will be compared against a one hot vector to correct it.

好了，來看看怎麼訓這個模型

sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init) #make sure you do this!
# define the loss function:
cross_entropy_loss = tf.reduce_mean(-tf.reduce_sum(y_label * tf.log(prediction), reduction_indices=[1]))
# define the training step:
train_step = tf.train.GradientDescentOptimizer(0.1).minimize(cross_entropy_loss)
n_iters = 10000
# train for n_iter iterations
for _ in range(n_iters):
    sess.run(train_step, feed_dict={x: x_train, y_label: y_train})
    print('loss is : ', sess.run(cross_entropy_loss, feed_dict={x: x_train, y_label: y_train}))

在訓的過程中，你可以看到 $loss$ 的變化：

loss is :  2.73213
loss is :  2.30519
loss is :  2.11106
loss is :  1.9916
loss is :  1.90923
loss is :  1.84837
loss is :  1.80133
loss is :  1.76381
loss is :  1.73312
loss is :  1.70745
loss is :  1.68556
loss is :  1.66654
loss is :  1.64975
loss is :  1.63472
loss is :  1.62112
loss is :  1.6087
loss is :  1.59725
loss is :  1.58664
loss is :  1.57676
loss is :  1.56751
loss is :  1.55882
loss is :  1.55064
loss is :  1.54291
loss is :  1.53559
loss is :  1.52865
loss is :  1.52206
loss is :  1.51578
loss is :  1.50979
loss is :  1.50408
loss is :  1.49861
.
.
.

最終 $loss$ 會收斂，即使其 $accuracy$ 不能達到很高的水平，我們並不 $care$ 這點，我們最終的目的是獲取較好的 $W1$ 和 $b1$ ，也就是 $hidden\_repesentation$ 。

爲什麼是 $one-hot$ ？

當我們用 $one-hot$ 向量乘以 $W1$ 時，獲取的是 $W1$ 矩陣的某一行，所以 $W1$ 扮演的是一個 $look\ up\ table$ 。

在我們這個代碼例子中，可以看看 $"queen"$ 在 $Ｗ1$ 中的 $repesetation$ 。

print(vectors[ word2int['queen'] ])
# say here word2int['queen'] is 2
-> 
[-0.69424796 -1.67628145  3.07313657 -1.14802659 -1.2207377 ]

給定一個向量，我們可以獲取與其最近的向量

def euclidean_dist(vec1, vec2):
    return np.sqrt(np.sum((vec1-vec2)**2))

def find_closest(word_index, vectors):
    min_dist = 10000 # to act like positive infinity
    min_index = -1
    query_vector = vectors[word_index]
    for index, vector in enumerate(vectors):
        if euclidean_dist(vector, query_vector) < min_dist and not np.array_equal(vector, query_vector):
            min_dist = euclidean_dist(vector, query_vector)
            min_index = index
    return min_index

我們來看看，與 $"king"、"queen"、"royal"$ 最近的詞：

print(int2word[find_closest(word2int['king'], vectors)])
print(int2word[find_closest(word2int['queen'], vectors)])
print(int2word[find_closest(word2int['royal'], vectors)])
->
queen
king
he

進階

上面總結的主要是第一篇論文 $Efficient\ Estimation\ of\ Word\ Representations\ in\ Vector\ Space$ 內的內容，雖然只是一個三層的神經網絡，但是在海量訓練數據的情況下，需要極大的計算資源來支撐整個過程，舉例來說，我們設定的 $embedding\_size=300$ 時，而 $vocab\_size=10,000$ 時，這時 $W1$ 矩陣的維度就達到了 $10,000*300=3 million$ ！！，這個時候再用 $SGD$ 來優化訓練過程就顯得十分緩慢，但是有時候你必須使用大量的數據來訓練模型來避免過擬合。論文 $Distributed\ Representations\ of\ Words\ and\ Phrases\ and\ their\ Compositionality$ 介紹了幾種解決辦法。

採用下采樣來降低訓練樣本數量
在 $tensorflow$ 裏面實現的 $word2Vec$ ， $vocab\_szie$ 並不是所有的 $word$ 的數量，而且先統計了所有 $word$ 的出現頻次，然後選取出現頻次最高的前 $50000$ 的詞作爲詞袋。具體操作請看代碼 tensorflow/examples/tutorials/word2vec/word2vec_basic.py，其餘的詞用 $unk$ 代替。
採用一種所謂的"負採樣"的操作，這種操作每次可以讓一個樣本只更新權重矩陣中一小部分，減小訓練過程中的計算壓力。
舉例來說：一個 $input\ output\ pair$ 如： $(“fox”, “quick”)$ ，由上面的分析可知，其 $true\ label$ 爲一個 $one-hot$ 向量，並且該向量只是在 $quick$ 的位置爲1，其餘的位置均爲0，並且該向量的長度爲 $vocab\ size$ ，由此每個樣本都緩慢能更新權重矩陣，而"負採樣"操作只是隨機選擇其餘的部分 $word$ ，使得其在 $true\ label$ 的位置爲0，那麼我們只更新對應位置的權重。例如我們如果選擇負採樣數量爲５，則選取５個其餘的 $word$ ，使其對應的 $output$ 爲0，這個時候 $output$ 只是６個神經元，本來我們一次需要更新 $300*10,000$ 參數，進行負採樣操作以後只需要更新 $300*6＝1800$ 個參數。
Hierarchical Softmax 是NLP中常用方法，詳情可以查看Hierarchical Softmax 。其主要思想是以詞頻構建Huffman樹，樹的葉子節點爲詞表中的詞，相應的高頻詞距離根結點更近。當需要計算生成某個詞的概率時，不需要對所有詞進行概率計算，而是選擇在Huffman樹中從根結點到該詞所在結點的路徑進行計算，得到生成該詞的概率，時間複雜度從 O(N) 降低到 O(logN)（N個結點，則樹的深度logN）

個人總結

seq2seq模型，輸入處都會乘以 $embedding\_matrix$ ，輸出處都會乘以 $embedding\_matrix^T$ ，這兩個embedding矩陣有時會共享，有時則不會。我認爲 $word2Vec$ 其實就是 $seq2seq$ 模型的原型，只不過應用到了不同的複雜場景中，根據場景需要，在內部加了 $Attention$ 等機制，大致框架依然是 $word2Vec$ 。
$word2Vec$ 是當前自然語言處理領域的最基礎知識，深刻理解 $word2vec$ 原理非常重要。

個人感覺 $word2Vec$ 瞭解到這個程度差不多了。

完整代碼：

import tensorflow as tf
import numpy as np

corpus_raw = 'He is the king . The king is royal . She is the royal  queen '

# convert to lower case
corpus_raw = corpus_raw.lower()

words = []
for word in corpus_raw.split():
    if word != '.': # because we don't want to treat . as a word
        words.append(word)

words = set(words) # so that all duplicate words are removed
word2int = {}
int2word = {}
vocab_size = len(words) # gives the total number of unique words

for i,word in enumerate(words):
    word2int[word] = i
    int2word[i] = word

# raw sentences is a list of sentences.
raw_sentences = corpus_raw.split('.')
sentences = []
for sentence in raw_sentences:
    sentences.append(sentence.split())

WINDOW_SIZE = 2

data = []
for sentence in sentences:
    for word_index, word in enumerate(sentence):
        for nb_word in sentence[max(word_index - WINDOW_SIZE, 0) : min(word_index + WINDOW_SIZE, len(sentence)) + 1] : 
            if nb_word != word:
                data.append([word, nb_word])

# function to convert numbers to one hot vectors
def to_one_hot(data_point_index, vocab_size):
    temp = np.zeros(vocab_size)
    temp[data_point_index] = 1
    return temp

x_train = [] # input word
y_train = [] # output word

for data_word in data:
    x_train.append(to_one_hot(word2int[ data_word[0] ], vocab_size))
    y_train.append(to_one_hot(word2int[ data_word[1] ], vocab_size))

# convert them to numpy arrays
x_train = np.asarray(x_train)
y_train = np.asarray(y_train)

# making placeholders for x_train and y_train
x = tf.placeholder(tf.float32, shape=(None, vocab_size))
y_label = tf.placeholder(tf.float32, shape=(None, vocab_size))

EMBEDDING_DIM = 5 # you can choose your own number
W1 = tf.Variable(tf.random_normal([vocab_size, EMBEDDING_DIM]))
b1 = tf.Variable(tf.random_normal([EMBEDDING_DIM])) #bias
hidden_representation = tf.add(tf.matmul(x,W1), b1)

W2 = tf.Variable(tf.random_normal([EMBEDDING_DIM, vocab_size]))
b2 = tf.Variable(tf.random_normal([vocab_size]))
prediction = tf.nn.softmax(tf.add( tf.matmul(hidden_representation, W2), b2))


sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init) #make sure you do this!

# define the loss function:
cross_entropy_loss = tf.reduce_mean(-tf.reduce_sum(y_label * tf.log(prediction), reduction_indices=[1]))

# define the training step:
train_step = tf.train.GradientDescentOptimizer(0.1).minimize(cross_entropy_loss)

n_iters = 10000
# train for n_iter iterations

for _ in range(n_iters):
    sess.run(train_step, feed_dict={x: x_train, y_label: y_train})
    print('loss is : ', sess.run(cross_entropy_loss, feed_dict={x: x_train, y_label: y_train}))

vectors = sess.run(W1 + b1)

def euclidean_dist(vec1, vec2):
    return np.sqrt(np.sum((vec1-vec2)**2))

def find_closest(word_index, vectors):
    min_dist = 10000 # to act like positive infinity
    min_index = -1
    query_vector = vectors[word_index]
    for index, vector in enumerate(vectors):
        if euclidean_dist(vector, query_vector) < min_dist and not np.array_equal(vector, query_vector):
            min_dist = euclidean_dist(vector, query_vector)
            min_index = index
    return min_index


from sklearn.manifold import TSNE

model = TSNE(n_components=2, random_state=0)
np.set_printoptions(suppress=True)
vectors = model.fit_transform(vectors) 

from sklearn import preprocessing

normalizer = preprocessing.Normalizer()
vectors =  normalizer.fit_transform(vectors, 'l2')

print(vectors)

import matplotlib.pyplot as plt


fig, ax = plt.subplots()
print(words)
for word in words:
    print(word, vectors[word2int[word]][1])
    ax.annotate(word, (vectors[word2int[word]][0],vectors[word2int[word]][1] ))
plt.show()

論文分享-- >word2Vec論文總結

博客內容將首發在微信公衆號"跟我一起讀論文啦啦"，上面會定期分享機器學習、深度學習、數據挖掘、自然語言處理等高質量論文，歡迎關注！

word2vec工作流程

利用 $tensorflow$ 建立模型

爲什麼是 $one-hot$ ？

進階

個人總結

算法基礎-->貪心和動態規劃

論文分享-- >異常檢測-- >Deep Autoencoding Gaussian Mixture Model for Unsupervised Anomaly Detection

論文分享-- >Graph Embedding-- > DeepWalk: Online learning of Social Representations

機器學習-- > 隱馬爾科夫模型(HMM)

強化學習-->Deep Reinforcement Learning

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

論文分享-- >word2Vec論文總結

博客內容將首發在微信公衆號"跟我一起讀論文啦啦"，上面會定期分享機器學習、深度學習、數據挖掘、自然語言處理等高質量論文，歡迎關注！

word2vec工作流程

利用tensorflowtensorflowtensorflow建立模型

爲什麼是one−hotone-hotone−hot？

進階

個人總結

利用 $tensorflow$ 建立模型

爲什麼是 $one-hot$ ？