Pytorch學習筆記之訓練詞向量（三）

Pytorch學習筆記之Pytorch訓練詞向量（三）

學習目標

學習詞向量的概念
用Skip-thought模型訓練詞向量
學習使用PyTorch dataset和dataloader
學習定義PyTorch模型
學習torch.nn中常見的Module
- Embedding
學習常見的PyTorch operations
- bmm
- logsigmoid
保存和讀取PyTorch模型

使用的訓練數據可以從以下鏈接下載到。

鏈接:https://pan.baidu.com/s/1tFeK3mXuVXEy3EMarfeWvg 密碼:v2z5

在這一份notebook中，我們會（儘可能）嘗試復現論文Distributed Representations of Words and Phrases and their Compositionality中訓練詞向量的方法. 我們會實現Skip-gram模型，並且使用論文中noice contrastive sampling的目標函數。

這篇論文有很多模型實現的細節，這些細節對於詞向量的好壞至關重要。我們雖然無法完全復現論文中的實驗結果，主要是由於計算資源等各種細節原因，但是我們還是可以大致展示如何訓練詞向量。

以下是一些我們沒有實現的細節

subsampling：參考論文section 2.3

1. 引入pytorch相關包

import torch
import torch.nn as nn # neural Network
import torch.nn.functional as F # functional
import torch.utils.data as tud # 
from torch.nn.parameter import Parameter

from collections import Counter
import numpy as np
import random
import math

import pandas as pd
import scipy
import sklearn
from sklearn.metrics.pairwise import cosine_similarity
# 配置參數
# 是否有GPU
USE_CUDA = torch.cuda.is_available()
# 固定隨機數種子，保證程序復現
seed_numder = 1 
random.seed(seed_numder)
np.random.seed(seed_numder)
torch.manual_seed(seed_numder)
if USE_CUDA:
    torch.cuda.manual_seed(seed_numder)

# 設置超參數
K = 100 # 負樣本隨機採樣與正樣本的比例
C = 3 # 上下文窗口數目
NUM_EPOCHS = 10 # 迭代輪數
MAX_VOCAB_SIZE = 30000 # 詞彙表大小
BATCH_SIZE = 128 #每次迭代的batch數目
LEARNING_RATE = 0.2 # 學習率
EMBEDDING_SIZE = 100 # 詞向量維度

LOG_FILE = 'word_embedding.log'

2. 預處理

從文本文件中讀取所有的文字，通過這些文本創建一個vocabulary
由於單詞數量可能太大，我們只選取最常見的MAX_VOCAB_SIZE個單詞
我們添加一個UNK單詞表示所有不常見的單詞
我們需要記錄單詞到index的mapping，以及index到單詞的mapping，單詞的count，單詞的(normalized) frequency，以及單詞總數。

def word_tokenize(text):
    return text.split()
# 讀取訓練文本
with open("./text8/text8/text8.train.txt", "r") as fin: #讀入文件
    text = fin.read()
# 分詞後轉換爲列表
text = [w for w in word_tokenize(text.lower())]
# 獲取出現頻率最高的前 (MAX_VOCAB_SIZE - 1)個詞
# 返回是一個字典類型 {word_1: frequency_1}
vocab = dict(Counter(text).most_common(MAX_VOCAB_SIZE - 1))
# 統計剩餘的其他詞出現的頻率
vocab["<unk>"] = len(text) - np.sum(list(vocab.values()))
# 
idx_to_word = [w for w in vocab.keys()]
# 
word_to_idx = {word:i for i, word in enumerate(idx_to_word)}
# 記錄單詞出現的頻數
word_counts = np.array([v for v in vocab.values()], dtype=np.float32)
# 正則化
word_freqs = word_counts / np.sum(word_counts)
# 3/4次之後，會將高概率的單詞的概率值，分一部分給低概率的單詞。
word_freqs = word_freqs ** (3./4.)
# 
word_freqs = word_freqs / np.sum(word_freqs)
# 詞彙表數目
VOCAB_SIZE = len(idx_to_word)

確認一下詞表大小

VOCAB_SIZE
> 30000

3. 實現Dataloader

一個dataloader需要以下內容：

把所有text編碼成數字，然後用subsampling預處理這些文字。
保存vocabulary，單詞count，normalized word frequency
每個iteration sample一箇中心詞
根據當前的中心詞返回context單詞
根據中心詞sample一些negative單詞
返回單詞的counts

這裏有一個好的tutorial介紹如何使用PyTorch dataloader.
爲了使用dataloader，我們需要定義以下兩個function:

__len__function需要返回整個數據集中有多少個item
__getitem__根據給定的index返回一個item

有了dataloader之後，我們可以輕鬆隨機打亂整個數據集，拿到一個batch的數據等等。

class WordEmbeddingDataset(tud.Dataset):
    def __init__(self, text, word_to_idx, idx_to_word, word_freqs):
        ''' text: a list of words, all text from the training dataset
            word_to_idx: the dictionary from word to idx
            idx_to_word: idx to word mapping
            word_freq: the frequency of each word
            word_counts: the word counts
        '''
        super().__init__() #初始化模型
        # 將文本編碼成數目
        self.text_encoded = [word_to_idx.get(t, word_to_idx["<unk>"]) for t in text]
        self.text_encoded = torch.LongTensor(self.text_encoded)
        # 
        self.word_to_idx = word_to_idx
        self.idx_to_word = idx_to_word
        self.word_freqs = torch.Tensor(word_freqs)
    def __len__(self):
        ''' 返回整個數據集（所有單詞）的長度
        '''
        return len(self.text_encoded)
    def __getitem__(self, idx):
        ''' 這個function返回以下數據用於訓練
            - 中心詞
            - 這個單詞附近的(positive)單詞
            - 隨機採樣的K個單詞作爲negative sample
        '''
        # 獲取中心詞
        center_word = self.text_encoded[idx]
        # 獲取中心詞上下文的詞
        pos_indices = list(range(idx-C, idx)) + list(range(idx+1, idx+C+1))
        # 超出長度的部分，取餘（一個圓環）
        pos_indices = [p%self.__len__() for p in pos_indices]
        pos_words = self.text_encoded[pos_indices]
        # 獲取負樣本, 精細時應該考慮去掉抽取出中心詞或者上下文詞的情況
        neg_words = torch.multinomial(self.word_freqs, K * pos_words.shape[0], True)
        
        return center_word, pos_words, neg_words

4. 創建一個dataset對象及使用DataLoader加載數據

dataset = WordEmbeddingDataset(text, word_to_idx, idx_to_word, word_freqs)
# windows系統設置 num_workers=0（因爲windows系統下pytorch的多線程執行有bug），其他系統可以增加線程數
dataloader = tud.DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=0)

看一下dataloader返回的數據

# dataloader 每次返回的訓練數據 batch_size 
showNextData = next(iter(dataloader))
print(showNextData[0].size())
print(showNextData[1].size())
print(showNextData[2].size())

運行結果

torch.Size([128])
torch.Size([128, 6])
torch.Size([128, 600])

5. 定義Pytorch模型

loss函數如下

class EmbeddingModel(nn.Module):
    def __init__(self, vocab_size, embed_size):
        ''' 初始化輸出和輸出embedding
        '''
        super().__init__()
        self.vocab_size = vocab_size # 30000
        self.embed_size = embed_size # 100
        
        initrange = 0.5 / self.embed_size
        # [30000, 100] matrix
        self.in_embed = nn.Embedding(self.vocab_size, self.embed_size, sparse=False)
        # 初始化權重分佈設爲均勻分佈[-5e-3, 5e-3]
        self.in_embed.weight.data.uniform_(-initrange, initrange)
        self.out_embed = nn.Embedding(self.vocab_size, self.embed_size, sparse=False)
        self.out_embed.weight.data.uniform_(-initrange, initrange)
    
    def forward(self, input_labels, pos_labels, neg_labels):
        '''
        input_labels: 中心詞, [batch_size]
        pos_labels: 中心詞周圍 context window 出現過的單詞 [batch_size * (window_size * 2)]
        neg_labelss: 中心詞周圍沒有出現過的單詞，從 negative sampling 得到 [batch_size, (window_size * 2 * K)]
        
        return: loss, [batch_size]
        '''
        input_embedding = self.in_embed(input_labels) # [b, embed_size] [128, 100]
        pos_embedding = self.out_embed(pos_labels) # [b, 2*C, embed_size] [128, 6, 100]
        neg_embedding = self.out_embed(neg_labels) # [b, 2*C*K, embed_size] [128, 600, 100]
        # unsqueeze(dim) 在指定維度插入一維，squeeze（dim）在指定維度去掉一維。dim屬於[-x.dim-1 ,x.dim+1) 左閉右開
        # squeeze() 壓縮所有維度爲1的維度 （3, 1, 2, 1, 5) -> (3, 2, 5)
        # torch.bmm()爲batch矩陣乘法（b, n, m)*(b, m, p)=(b, n, p)
        # [b, 2*C] [128, 6]
        log_pos = torch.bmm(pos_embedding, input_embedding.unsqueeze(2)).squeeze() 
        # [b, 2*C*K] [128, 600]
        log_neg = torch.bmm(neg_embedding, input_embedding.unsqueeze(2)).squeeze()
        # b
        log_pos = F.logsigmoid(log_pos).sum(1)
        log_neg = F.logsigmoid(log_neg).sum(1)
        loss = log_pos + log_neg
        
        return -loss
    
    def input_embeddings(self):
        # 取出權重，計算相似度
        weight = self.in_embed.weight.data.cpu().numpy()
        # 通常也可以取 兩個Embedding的平均值
        weight1 = ((self.in_embed.weight.data + self.out_embed.weight.data) / 2.).cpu().numpy()
        # 兩個都返回，看看哪種情況好
        return weight, weight1

6. 創建模型對象

model = EmbeddingModel(VOCAB_SIZE, EMBEDDING_SIZE)
if USE_CUDA:
    model = model.to('cuda:0')

查看一下模型結構

model
# 一下爲輸出結果
EmbeddingModel(
  (in_embed): Embedding(30000, 100)
  (out_embed): Embedding(30000, 100)
)

7. 評估詞向量的代碼

evaluate(filename, embedding_weight_1, embedding_weight_2) 函數利用訓練好的詞向量計算單詞序列之間的相似度，與人類主觀上單詞相似度進行對比。一致率越高，數值接近1，反之，爲-1。
find_nearest(word, embedding_weight_1, embedding_weight_2) 利用相似度，找出與指定單詞意思最近的十個單詞。
embedding_weight_1 指從in_embed權重矩陣中取
embedding_weight_2 指將in_embed權重矩陣和 out_embed權重矩陣相加，再取平均後的結果。
科學研究就是要做實驗，試一下兩者那個效果好

def evaluate(filename, embedding_weight_1, embedding_weight_2):
    if filename.endswith(".csv"):
        data = pd.read_csv(filename, sep=',')
    else:
        data = pd.read_csv(filename, sep='\t')
    human_similarity = []
    # in_embed權重相似度
    model_similarity_1 = []
    # in_embed 和 out_embed 兩個Embedding的平均值
    model_similarity_2 = []
    for i in data.iloc[:, 0:2].index:
        word1, word2 = data.iloc[i, 0], data.iloc[i, 1]
        if word1 not in word_to_idx or word2 not in word_to_idx:
            continue
        else:
            # 取出索引值
            word1_idx, word2_idx = word_to_idx.get(word1), word_to_idx.get(word2)
            # 取出訓練好的 詞向量值
            word1_embed_1, word2_embed_1 = embedding_weight_1[[word1_idx]], embedding_weight_1[[word2_idx]]
            word1_embed_2, word2_embed_2 = embedding_weight_2[[word1_idx]], embedding_weight_2[[word2_idx]]
            # 計算相似度度  兩個單詞相似度越高，夾角應該越小,sklearn.metrics.pairwise.cosine_similarity （相似度）增大
            model_similarity_1.append(float(cosine_similarity(word1_embed_1, word2_embed_1)))
            model_similarity_2.append(float(cosine_similarity(word1_embed_2, word2_embed_2)))
            human_similarity.append(float(data.iloc[i, 2]))
    # 統計預測值與真實值序列之間的相關係數
    # spearman秩相關係數是度量兩個變量之間的統計相關性的指標，用來評估當用單調函數來描述兩個變量之間的關係有多好。
    # 在沒有重複數據的情況下，如果一個變量是另外一個變量的嚴格單調函數，那麼二者之間的spearman秩相關係數就是1或+1，稱爲完全spearman相關
    return scipy.stats.spearmanr(human_similarity, model_similarity_1), scipy.stats.spearmanr(human_similarity, model_similarity_2)

def find_nearest(word, embedding_weight_1, embedding_weight_2):
    index = word_to_idx.get(word)
    embed_1 = embedding_weight_1[index]
    embed_2 = embedding_weight_2[index]
    # 1 - cosine_sklearn = cosine_scipy  scipy庫和是sklearn庫關於餘弦線相似度的計算是不一樣的
    # 兩個單詞相似度越高，夾角應該越小,cosine_sklearn （相似度）增大，cosine_scipy （夾角）減小
    cos_dis_1 = np.array([scipy.spatial.distance.cosine(e, embed_1) for e in embedding_weight_1])
    cos_dis_2 = np.array([scipy.spatial.distance.cosine(e, embed_2) for e in embedding_weight_2])
    # argsort()函數是將x中的元素從小到大排列，返回其對應的index(索引號)
    lst_1 = [idx_to_word[i] for i in cos_dis_1.argsort()[:10]]
    lst_2 = [idx_to_word[i] for i in cos_dis_2.argsort()[:10]]
    return lst_1, lst_2

8. 訓練模型：

模型一般需要訓練若干個epoch
每個epoch我們都把所有的數據分成若干個batch
把每個batch的輸入和輸出都包裝成cuda tensor
forward pass，通過輸入的句子預測每個單詞的下一個單詞
用模型的預測和正確的下一個單詞計算cross entropy loss
清空模型當前gradient
backward pass
更新模型參數
每隔一定的iteration輸出模型在當前iteration的loss，以及在驗證數據集上做模型的評估

# 優化器採用SGD
optimizer = torch.optim.SGD(model.parameters(), lr=LEARNING_RATE)
# 
for e in range(NUM_EPOCHS):
    for i, (input_labels, pos_labels, neg_labels) in enumerate(dataloader):
        #
        input_labels = input_labels.long()
        pos_labels = pos_labels.long()
        neg_labels = neg_labels.long()
        if USE_CUDA:
            input_labels = input_labels.cuda()
            pos_labels = pos_labels.cuda()
            neg_labels = neg_labels.cuda()
        # 梯度歸零
        optimizer.zero_grad()
        # loss返回 [128] 求平均
        loss = model(input_labels, pos_labels, neg_labels).mean()
        loss.backward() # 反向傳播
        optimizer.step() # 更新梯度
        
        # 每100次打印結果。
        if i % 100 == 0:
            with open(LOG_FILE, "a") as fout:
                fout.write("epoch: {}, iter: {}, loss: {}\n".format(e, i, loss.item()))
                print("epoch: {}, iter: {}, loss: {}".format(e, i, loss.item()))
        # 每2000次計算一次相似度
        if i % 2000 == 0:
            embedding_weights_1, embedding_weights_2  = model.input_embeddings()
            sim_simlex_1, sim_simlex_2 = evaluate("simlex-999.txt", embedding_weights_1, embedding_weights_2)
            sim_men_1, sim_men_2 = evaluate("men.txt", embedding_weights_1, embedding_weights_2)
            sim_353_1, sim_353_2 = evaluate("wordsim353.csv", embedding_weights_1, embedding_weights_2)
            with open(LOG_FILE, "a") as fout:
                print(f"epoch: {e}, iteration: {i}, \n simlex-999_1: {sim_simlex_1}, \n simlex-999_2: {sim_simlex_2}, \n men_1: {sim_men_1}, \n men_2: {sim_men_2},  \n sim353_1: {sim_353_1}, \n sim353_2: {sim_353_2}, \n nearest to monster: {find_nearest('monster', embedding_weights_1, embedding_weights_2)}\n")
                fout.write(f"epoch: {e}, iteration: {i}, simlex-999_1: {sim_simlex_1},simlex-999_2: {sim_simlex_2}, men_1: {sim_men_1}, men_2: {sim_men_2}, sim353_1: {sim_353_1}, sim353_2: {sim_353_2}, nearest to monster: {find_nearest('monster', embedding_weights_1, embedding_weights_2)}\n")

運行結果示例如下（訓練輪次自己設定）

epoch: 0, iter: 0, loss: 142.8716583251953
epoch: 0, iteration: 0, 
 simlex-999_1: SpearmanrResult(correlation=-0.035259516920833865, pvalue=0.27660953700886737), 
 simlex-999_2: SpearmanrResult(correlation=-0.047682561919958094, pvalue=0.14110938770590917), 
 men_1: SpearmanrResult(correlation=0.04988050229600173, pvalue=0.011246409260567655), 
 men_2: SpearmanrResult(correlation=0.04000382686226847, pvalue=0.0420970874212936),  
 sim353_1: SpearmanrResult(correlation=0.026812442387097967, pvalue=0.633297026852052), 
 sim353_2: SpearmanrResult(correlation=-0.0034262533468499058, pvalue=0.9513952103438084), 
 nearest to monster: (['monster', 'maltese', 'watershed', 'correspond', 'flops', 'yellowstone', 'gamal', 'tolstoy', 'aquitaine', 'denoting'], ['monster', 'etc', 'services', 'abraham', 'slightly', 'sexual', 'andrew', 'legal', 'nobel', 'broken'])

epoch: 0, iter: 100, loss: 102.23202514648438
epoch: 0, iter: 200, loss: 93.51679229736328
epoch: 0, iter: 300, loss: 91.04571533203125
epoch: 0, iter: 400, loss: 85.10859680175781
epoch: 0, iter: 500, loss: 73.21339416503906
epoch: 0, iter: 600, loss: 82.36524200439453
epoch: 0, iter: 700, loss: 71.56480407714844
epoch: 0, iter: 800, loss: 47.44879913330078
epoch: 0, iter: 900, loss: 49.65077209472656
epoch: 0, iter: 1000, loss: 53.81517028808594
epoch: 0, iter: 1100, loss: 37.037811279296875
epoch: 0, iter: 1200, loss: 49.845680236816406
epoch: 0, iter: 1300, loss: 44.053367614746094
epoch: 0, iter: 1400, loss: 29.414356231689453
epoch: 0, iter: 1500, loss: 41.82801818847656
epoch: 0, iter: 1600, loss: 35.28537368774414
epoch: 0, iter: 1700, loss: 26.633563995361328
epoch: 0, iter: 1800, loss: 31.498106002807617
epoch: 0, iter: 1900, loss: 29.859540939331055
epoch: 0, iter: 2000, loss: 31.989009857177734
epoch: 0, iteration: 2000, 
 simlex-999_1: SpearmanrResult(correlation=-0.030601859820272474, pvalue=0.3450785890869934), 
 simlex-999_2: SpearmanrResult(correlation=-0.0463389472461431, pvalue=0.15267173464395575), 
 men_1: SpearmanrResult(correlation=0.031088156058363608, pvalue=0.11426469625155944), 
 men_2: SpearmanrResult(correlation=0.02383281291831326, pvalue=0.226044371368817),  
 sim353_1: SpearmanrResult(correlation=-0.04833420023275394, pvalue=0.38957379699663996), 
 sim353_2: SpearmanrResult(correlation=-0.03349780564943204, pvalue=0.55110266180645), 
 nearest to monster: (['monster', 'a', 'but', 'he', 'home', 'empire', 'that', '<unk>', 'one', 'time'], ['monster', 'has', 'are', 'etc', 'part', 'were', 'been', 'state', 'that', 'his'])

epoch: 0, iter: 2100, loss: 29.90845489501953
epoch: 0, iter: 2200, loss: 30.369483947753906
epoch: 0, iter: 2300, loss: 24.405258178710938

Pytorch學習筆記之訓練詞向量（三）

Pytorch學習筆記之Pytorch訓練詞向量（三）

1. 引入pytorch相關包

2. 預處理

3. 實現Dataloader

4. 創建一個dataset對象及使用DataLoader加載數據

看一下dataloader返回的數據

運行結果

5. 定義Pytorch模型

6. 創建模型對象

查看一下模型結構

7. 評估詞向量的代碼

8. 訓練模型：

運行結果示例如下（訓練輪次自己設定）

如何使用 JS 判斷用戶是否處於活躍狀態

lightdb秒級增加列和刪除列（not null帶默認值）

lightdb數據庫超時相關控制參數

通過HPA+CronHPA組合應對業務複雜彈性伸縮場景

❤️‍🔥 Solon Cloud Event 新的事務特性與應用

lightdb mysql 8.0兼容之不可見主鍵

使用 JS 實現在瀏覽器控制檯打印圖片 console.image()

基於Ubuntu-22.04安裝K8s-v1.28.2實驗（四）使用域名訪問網站應用

Pytorch學習筆記之語言模型（四）

python 中的self和cls

linux按文件大小排序和按時間排序指令

臥龍崗大學厲萬慶老師訪問ZZUNLP交流記錄

一個oracle面試題

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結