博客內容將首發在微信公衆號"跟我一起讀論文啦啦",上面會定期分享機器學習、深度學習、數據挖掘、自然語言處理等高質量論文,歡迎關注!
一直以來,對,以及對 裏面的底層實現原理一直模糊不清,由此決心閱讀的兩篇原始論文,,,看完以後還是有點半懂半不懂的感覺,於是又結合網上的一些比較好的講解(Word2Vec Tutorial - The Skip-Gram Model),以及開源的實現代碼理解了一遍,在此總結一下。
下面主要以 模型來介紹。
word2vec工作流程
- 只是一個 三層 的神經網絡。
- 餵給模型一個,然後用來預測它周邊的詞。
- 然後去掉最後一層,只保存 和 。
- 從詞表中選取一個詞,餵給模型,在 將會給出該詞的。
import numpy as np
import tensorflow as tf
corpus_raw = 'He is the king . The king is royal . She is the royal queen '
# convert to lower case
corpus_raw = corpus_raw.lower()
上述代碼非常簡單和易懂,現在我們需要獲取,假設我們現在有這樣一個任務,餵給模型一個詞,我們需要獲取它周邊的詞,舉例來說,就是獲取該詞前個和後個詞,那麼這個就是代碼中的,例如下圖:
注意:如果這個詞是一個句子的開頭或結尾, 忽略窗外的詞。
我們需要對文本數據進行一個簡單的預處理,創建一個的字典和的字典。
words = []
for word in corpus_raw.split():
if word != '.': # because we don't want to treat . as a word
words.append(word)
words = set(words) # so that all duplicate words are removed
word2int = {}
int2word = {}
vocab_size = len(words) # gives the total number of unique words
for i,word in enumerate(words):
word2int[word] = i
int2word[i] = word
來看看這個字典有啥效果:
print(word2int['queen'])
-> 42 (say)
print(int2word[42])
-> 'queen'
好,現在可以獲取訓練數據啦
data = []
WINDOW_SIZE = 2
for sentence in sentences:
for word_index, word in enumerate(sentence):
for nb_word in sentence[max(word_index - WINDOW_SIZE, 0) : min(word_index + WINDOW_SIZE, len(sentence)) + 1] :
if nb_word != word:
data.append([word, nb_word])
上述代碼就是切句子,然後切詞,得出的一個個訓練樣本,其中就是模型輸入,就是該詞周邊的某個單詞。
把打印出來看看?
print(data)
[['he', 'is'],
['he', 'the'],
['is', 'he'],
['is', 'the'],
['is', 'king'],
['the', 'he'],
['the', 'is'],
['the', 'king'],
.
.
.
]
現在我們有了訓練數據了,但是需要將它轉成模型可讀可理解的形式,這時,上面的字典的作用就來了。
來,我們更進一步的對進行處理,並使其轉成向量
i.e.,
say we have a vocabulary of 3 words : pen, pineapple, apple
where
word2int['pen'] -> 0 -> [1 0 0]
word2int['pineapple'] -> 1 -> [0 1 0]
word2int['apple'] -> 2 -> [0 0 1]
那麼爲啥是特徵呢?稍後將解釋。
# function to convert numbers to one hot vectors
def to_one_hot(data_point_index, vocab_size):
temp = np.zeros(vocab_size)
temp[data_point_index] = 1
return temp
x_train = [] # input word
y_train = [] # output word
for data_word in data:
x_train.append(to_one_hot(word2int[ data_word[0] ], vocab_size))
y_train.append(to_one_hot(word2int[ data_word[1] ], vocab_size))
# convert them to numpy arrays
x_train = np.asarray(x_train)
y_train = np.asarray(y_train)
利用建立模型
# making placeholders for x_train and y_train
x = tf.placeholder(tf.float32, shape=(None, vocab_size))
y_label = tf.placeholder(tf.float32, shape=(None, vocab_size))
由上圖,我們可以看出,我們將轉換成,並且將維度降低到設定的。
EMBEDDING_DIM = 5 # you can choose your own number
W1 = tf.Variable(tf.random_normal([vocab_size, EMBEDDING_DIM]))
b1 = tf.Variable(tf.random_normal([EMBEDDING_DIM])) #bias
hidden_representation = tf.add(tf.matmul(x,W1), b1)
接下來,我們需要使用函數來預測該周邊的詞。
W2 = tf.Variable(tf.random_normal([EMBEDDING_DIM, vocab_size]))
b2 = tf.Variable(tf.random_normal([vocab_size]))
prediction = tf.nn.softmax(tf.add( tf.matmul(hidden_representation, W2), b2))
所以整體的過程如下:
input_one_hot ---> embedded repr. ---> predicted_neighbour_prob
predicted_prob will be compared against a one hot vector to correct it.
好了,來看看怎麼訓這個模型
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init) #make sure you do this!
# define the loss function:
cross_entropy_loss = tf.reduce_mean(-tf.reduce_sum(y_label * tf.log(prediction), reduction_indices=[1]))
# define the training step:
train_step = tf.train.GradientDescentOptimizer(0.1).minimize(cross_entropy_loss)
n_iters = 10000
# train for n_iter iterations
for _ in range(n_iters):
sess.run(train_step, feed_dict={x: x_train, y_label: y_train})
print('loss is : ', sess.run(cross_entropy_loss, feed_dict={x: x_train, y_label: y_train}))
在訓的過程中,你可以看到的變化:
loss is : 2.73213
loss is : 2.30519
loss is : 2.11106
loss is : 1.9916
loss is : 1.90923
loss is : 1.84837
loss is : 1.80133
loss is : 1.76381
loss is : 1.73312
loss is : 1.70745
loss is : 1.68556
loss is : 1.66654
loss is : 1.64975
loss is : 1.63472
loss is : 1.62112
loss is : 1.6087
loss is : 1.59725
loss is : 1.58664
loss is : 1.57676
loss is : 1.56751
loss is : 1.55882
loss is : 1.55064
loss is : 1.54291
loss is : 1.53559
loss is : 1.52865
loss is : 1.52206
loss is : 1.51578
loss is : 1.50979
loss is : 1.50408
loss is : 1.49861
.
.
.
最終會收斂,即使其不能達到很高的水平,我們並不這點,我們最終的目的是獲取較好的和,也就是。
爲什麼是?
當我們用向量乘以時,獲取的是矩陣的某一行,所以扮演的是一個。
在我們這個代碼例子中,可以看看在中的。
print(vectors[ word2int['queen'] ])
# say here word2int['queen'] is 2
->
[-0.69424796 -1.67628145 3.07313657 -1.14802659 -1.2207377 ]
給定一個向量,我們可以獲取與其最近的向量
def euclidean_dist(vec1, vec2):
return np.sqrt(np.sum((vec1-vec2)**2))
def find_closest(word_index, vectors):
min_dist = 10000 # to act like positive infinity
min_index = -1
query_vector = vectors[word_index]
for index, vector in enumerate(vectors):
if euclidean_dist(vector, query_vector) < min_dist and not np.array_equal(vector, query_vector):
min_dist = euclidean_dist(vector, query_vector)
min_index = index
return min_index
我們來看看,與最近的詞:
print(int2word[find_closest(word2int['king'], vectors)])
print(int2word[find_closest(word2int['queen'], vectors)])
print(int2word[find_closest(word2int['royal'], vectors)])
->
queen
king
he
進階
上面總結的主要是第一篇論文內的內容,雖然只是一個三層的神經網絡,但是在海量訓練數據的情況下,需要極大的計算資源來支撐整個過程,舉例來說,我們設定的時,而時,這時矩陣的維度就達到了!!,這個時候再用來優化訓練過程就顯得十分緩慢,但是有時候你必須使用大量的數據來訓練模型來避免過擬合。論文介紹了幾種解決辦法。
-
採用下采樣來降低訓練樣本數量
在裏面實現的,並不是所有的的數量,而且先統計了所有的出現頻次,然後選取出現頻次最高的前的詞作爲詞袋。具體操作請看代碼 tensorflow/examples/tutorials/word2vec/word2vec_basic.py,其餘的詞用代替。 -
採用一種所謂的"負採樣"的操作,這種操作每次可以讓一個樣本只更新權重矩陣中一小部分,減小訓練過程中的計算壓力。
舉例來說:一個 如:,由上面的分析可知,其爲一個向量,並且該向量只是在的位置爲1,其餘的位置均爲0,並且該向量的長度爲,由此每個樣本都緩慢能更新權重矩陣,而"負採樣"操作只是隨機選擇其餘的部分,使得其在的位置爲0,那麼我們只更新對應位置的權重。例如我們如果選擇負採樣數量爲5,則選取5個其餘的,使其對應的爲0,這個時候只是6個神經元,本來我們一次需要更新參數,進行負採樣操作以後只需要更新個參數。 -
Hierarchical Softmax 是NLP中常用方法,詳情可以查看Hierarchical Softmax 。其主要思想是以詞頻構建Huffman樹,樹的葉子節點爲詞表中的詞,相應的高頻詞距離根結點更近。當需要計算生成某個詞的概率時,不需要對所有詞進行概率計算,而是選擇在Huffman樹中從根結點到該詞所在結點的路徑進行計算,得到生成該詞的概率,時間複雜度從 O(N) 降低到 O(logN)(N個結點,則樹的深度logN)
個人總結
- seq2seq模型,輸入處都會乘以,輸出處都會乘以,這兩個embedding矩陣有時會共享,有時則不會。我認爲 其實就是 模型的原型,只不過應用到了不同的複雜場景中,根據場景需要,在內部加了 等機制,大致框架依然是。
- 是當前自然語言處理領域的最基礎知識,深刻理解 原理非常重要。
個人感覺瞭解到這個程度差不多了。
完整代碼:
import tensorflow as tf
import numpy as np
corpus_raw = 'He is the king . The king is royal . She is the royal queen '
# convert to lower case
corpus_raw = corpus_raw.lower()
words = []
for word in corpus_raw.split():
if word != '.': # because we don't want to treat . as a word
words.append(word)
words = set(words) # so that all duplicate words are removed
word2int = {}
int2word = {}
vocab_size = len(words) # gives the total number of unique words
for i,word in enumerate(words):
word2int[word] = i
int2word[i] = word
# raw sentences is a list of sentences.
raw_sentences = corpus_raw.split('.')
sentences = []
for sentence in raw_sentences:
sentences.append(sentence.split())
WINDOW_SIZE = 2
data = []
for sentence in sentences:
for word_index, word in enumerate(sentence):
for nb_word in sentence[max(word_index - WINDOW_SIZE, 0) : min(word_index + WINDOW_SIZE, len(sentence)) + 1] :
if nb_word != word:
data.append([word, nb_word])
# function to convert numbers to one hot vectors
def to_one_hot(data_point_index, vocab_size):
temp = np.zeros(vocab_size)
temp[data_point_index] = 1
return temp
x_train = [] # input word
y_train = [] # output word
for data_word in data:
x_train.append(to_one_hot(word2int[ data_word[0] ], vocab_size))
y_train.append(to_one_hot(word2int[ data_word[1] ], vocab_size))
# convert them to numpy arrays
x_train = np.asarray(x_train)
y_train = np.asarray(y_train)
# making placeholders for x_train and y_train
x = tf.placeholder(tf.float32, shape=(None, vocab_size))
y_label = tf.placeholder(tf.float32, shape=(None, vocab_size))
EMBEDDING_DIM = 5 # you can choose your own number
W1 = tf.Variable(tf.random_normal([vocab_size, EMBEDDING_DIM]))
b1 = tf.Variable(tf.random_normal([EMBEDDING_DIM])) #bias
hidden_representation = tf.add(tf.matmul(x,W1), b1)
W2 = tf.Variable(tf.random_normal([EMBEDDING_DIM, vocab_size]))
b2 = tf.Variable(tf.random_normal([vocab_size]))
prediction = tf.nn.softmax(tf.add( tf.matmul(hidden_representation, W2), b2))
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init) #make sure you do this!
# define the loss function:
cross_entropy_loss = tf.reduce_mean(-tf.reduce_sum(y_label * tf.log(prediction), reduction_indices=[1]))
# define the training step:
train_step = tf.train.GradientDescentOptimizer(0.1).minimize(cross_entropy_loss)
n_iters = 10000
# train for n_iter iterations
for _ in range(n_iters):
sess.run(train_step, feed_dict={x: x_train, y_label: y_train})
print('loss is : ', sess.run(cross_entropy_loss, feed_dict={x: x_train, y_label: y_train}))
vectors = sess.run(W1 + b1)
def euclidean_dist(vec1, vec2):
return np.sqrt(np.sum((vec1-vec2)**2))
def find_closest(word_index, vectors):
min_dist = 10000 # to act like positive infinity
min_index = -1
query_vector = vectors[word_index]
for index, vector in enumerate(vectors):
if euclidean_dist(vector, query_vector) < min_dist and not np.array_equal(vector, query_vector):
min_dist = euclidean_dist(vector, query_vector)
min_index = index
return min_index
from sklearn.manifold import TSNE
model = TSNE(n_components=2, random_state=0)
np.set_printoptions(suppress=True)
vectors = model.fit_transform(vectors)
from sklearn import preprocessing
normalizer = preprocessing.Normalizer()
vectors = normalizer.fit_transform(vectors, 'l2')
print(vectors)
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
print(words)
for word in words:
print(word, vectors[word2int[word]][1])
ax.annotate(word, (vectors[word2int[word]][0],vectors[word2int[word]][1] ))
plt.show()