2019-CS224n-Assignment1

我的原文:https://www.hijerry.cn/p/54554.html

去年冬季學習了cs224n的2017課程,做了三個assignments,用的是TensorFlow。今年cs224n再次放課,一共有5個assignments,使用PyTorch,主講還是Manning,特別喜歡這個老師,講課生動有趣還挺可愛的哈哈哈~~

Assignment1(點擊下載) 的任務是探索詞向量。以基於計數的共現矩陣和基於預測的word2vec兩種方式,計算詞的相似度,研究近義詞、反義詞等等性質,從代碼層面來理解它們,有更深刻的記憶。

作業是ipynb文件,所以要用jupyter打開,可以參考chaibubble的如何打開ipynb文件

注意:python版本 >= 3.5

詞向量

詞向量是下游NLP任務(如問答、文本生成、翻譯等) 的基本組件,詞向量的好壞能在很大程度上影響下游任務的性能。這裏我們將探索兩類詞向量:共現矩陣word2vec

術語解釋: “word vectors” 和 “word embeddings” 通常可以互換使用。“embedding” 這個詞的內在含義是將詞編碼到一個底維空間中。“概念上而言,它是指把一個維數爲所有詞的數量的高維空間嵌入到一個維數低得多的連續向量空間中,每個單詞或詞組被映射爲實數域上的向量。”——維基百科

Part 1:基於計數的詞向量

大多數詞向量模型都是基於一個觀點:

You shall know a word by the company it keeps (Firth, J. R. 1957:11)

大多數詞向量的實現的核心是 相似詞 ,也就是同義詞,因爲它們有相似的上下文。這裏我們介紹一種策略叫做 共現矩陣 (更多信息可以查看 這裏這裏 )

這部分要實現的是,給定語料庫,根據共現矩陣計算詞向量,得到語料庫中每個詞的詞向量,流程如下:

  • 計算語料庫的單詞集
  • 計算共現矩陣
  • 使用SVD降維
  • 分析詞向量

問題1.1:實現 dicintct_words

計算語料庫的單詞數量、單詞集

def distinct_words(corpus):
    """ Determine a list of distinct words for the corpus.
        Params:
            corpus (list of list of strings): corpus of documents
        Return:
            corpus_words (list of strings): list of distinct words across the corpus, sorted (using python 'sorted' function)
            num_corpus_words (integer): number of distinct words across the corpus
    """
    corpus_words = []
    num_corpus_words = -1
    
    # ------------------
    # Write your implementation here.
    corpus = [w for sent in corpus for w in sent]
    corpus_words = list(set(corpus))
    corpus_words = sorted(corpus_words)
    num_corpus_words = len(corpus_words)

    # ------------------

    return corpus_words, num_corpus_words

問題1.2:實現compute_co_occurrence_matrix

計算給定語料庫的共現矩陣。具體來說,對於每一個詞 w,統計前、後方 window_size 個詞的出現次數

def compute_co_occurrence_matrix(corpus, window_size=4):
    """ Compute co-occurrence matrix for the given corpus and window_size (default of 4).
    
        Note: Each word in a document should be at the center of a window. Words near edges will have a smaller
              number of co-occurring words.
              
              For example, if we take the document "START All that glitters is not gold END" with window size of 4,
              "All" will co-occur with "START", "that", "glitters", "is", and "not".
    
        Params:
            corpus (list of list of strings): corpus of documents
            window_size (int): size of context window
        Return:
            M (numpy matrix of shape (number of corpus words, number of corpus words)): 
                Co-occurence matrix of word counts. 
                The ordering of the words in the rows/columns should be the same as the ordering of the words given by the distinct_words function.
            word2Ind (dict): dictionary that maps word to index (i.e. row/column number) for matrix M.
    """
    words, num_words = distinct_words(corpus)
    M = None
    word2Ind = {}
    
    # ------------------
    # Write your implementation here.
    M = np.zeros(shape=(num_words, num_words), dtype=np.int32)
    for i in range(num_words):
        word2Ind[words[i]] = i
    
    for sent in corpus:
        for p in range(len(sent)):
            ci = word2Ind[sent[p]]
            
            # preceding
            for w in sent[max(0, p - window_size):p]:
                wi = word2Ind[w]
                M[ci][wi] += 1
            
            # subsequent
            for w in sent[p + 1:p + 1 + window_size]:
                wi = word2Ind[w]
                M[ci][wi] += 1
            
    # ------------------

    return M, word2Ind

問題1.3:實現 reduce_to_k_dim

這一步是降維。在問題1.2得到的是一個N x N的矩陣(N是單詞集的大小),使用scikit-learn實現的SVD(奇異值分解),從這個大矩陣裏分解出一個含k個特製的N x k 小矩陣。

注意:在numpy、scipy和scikit-learn都提供了一些SVD的實現,但是隻有scipy、sklearn有Truncated SVD,並且只有sklearn提供了計算大規模SVD的高效的randomized算法,詳情參考sklearn.decomposition.TruncatedSVD

def reduce_to_k_dim(M, k=2):
    """ Reduce a co-occurence count matrix of dimensionality (num_corpus_words, num_corpus_words)
        to a matrix of dimensionality (num_corpus_words, k) using the following SVD function from Scikit-Learn:
            - http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
    
        Params:
            M (numpy matrix of shape (number of corpus words, number of corpus words)): co-occurence matrix of word counts
            k (int): embedding size of each word after dimension reduction
        Return:
            M_reduced (numpy matrix of shape (number of corpus words, k)): matrix of k-dimensioal word embeddings.
                    In terms of the SVD from math class, this actually returns U * S
    """    
    n_iters = 10     # Use this parameter in your call to `TruncatedSVD`
    M_reduced = None
    print("Running Truncated SVD over %i words..." % (M.shape[0]))
    
    # ------------------
    # Write your implementation here.
    svd = TruncatedSVD(n_components=k)
    svd.fit(M.T)
    M_reduced = svd.components_.T

    # ------------------

    print("Done.")
    return M_reduced

問題1.4 實現 plot_embeddings

基於matplotlib,用scatter 畫 “×”,用 text 寫字

def plot_embeddings(M_reduced, word2Ind, words):
    """ Plot in a scatterplot the embeddings of the words specified in the list "words".
        NOTE: do not plot all the words listed in M_reduced / word2Ind.
        Include a label next to each point.
        
        Params:
            M_reduced (numpy matrix of shape (number of unique words in the corpus , k)): matrix of k-dimensioal word embeddings
            word2Ind (dict): dictionary that maps word to indices for matrix M
            words (list of strings): words whose embeddings we want to visualize
    """

    # ------------------
    # Write your implementation here.
    
    for w in words:
        x = M_reduced[word2Ind[w]][0]
        y = M_reduced[word2Ind[w]][1]
        plt.scatter(x, y, marker='x')
        plt.text(x, y, w)

    plt.show()
    # ------------------

效果:

mark

問題1.5:共現打印分析

將詞嵌入到2個維度上,歸一化,最終詞向量會落到一個單位圓內,在座標系上尋找相近的詞。

mark

Part 2:基於預測的詞向量

目前,基於預測的詞向量是最流行的,比如word2vec。現在我們來探索word2vec生成的詞向量,如果想要深入瞭解,可以讀一讀 原始論文

這一部分主要是使用gensim探索詞向量,不是自己實現word2vec,所使用的詞向量維度是300,由google發佈。

首先使用SVD降維,將300維降2維,方便打印查看。

問題2.1:word2vec打印分析

和問題1.5一樣

問題2.2:一詞多義

找到一個有多個含義的詞(比如 “leaves”,“scoop”),這種詞的top-10相似詞(根據餘弦相似度)裏有兩個詞的意思不一樣。比如"leaves"(葉子,花瓣)的top-10詞裏有"vanishes"(消失)和"stalks"(莖稈)。

這裏我找到的詞是"column"(列),它的top-10裏有"columnist"(專欄作家)和"article"(文章)

# ------------------
# Write your polysemous word exploration code here.

wv_from_bin.most_similar("column")

# ------------------

輸出:

[('columns', 0.767943263053894),
 ('columnist', 0.6541407108306885),
 ('article', 0.651928186416626),
 ('columnists', 0.617466926574707),
 ('syndicated_column', 0.599014401435852),
 ('op_ed', 0.588202714920044),
 ('Op_Ed', 0.5801560282707214),
 ('op_ed_column', 0.5779396891593933),
 ('nationally_syndicated_column', 0.572504997253418),
 ('colum', 0.5595961213111877)]

問題2.3:近義詞和反義詞

找到三個詞(w1, w2, w3),其中w1和w2是近義詞,w1和w3是反義詞,但是w1和w3的距離<w1和w2的距離。例如:w1=“happy”,w2=“cheerful”,w3=“sad”

爲什麼反義詞的相似度反而更大呢(距離越小說明越相似)?因爲他們的上下文通常非常一致

# ------------------
# Write your synonym & antonym exploration code here.

w1 = "love"
w2 = "like"
w3 = "hate"
w1_w2_dist = wv_from_bin.distance(w1, w2)
w1_w3_dist = wv_from_bin.distance(w1, w3)

print("Synonyms {}, {} have cosine distance: {}".format(w1, w2, w1_w2_dist))
print("Antonyms {}, {} have cosine distance: {}".format(w1, w3, w1_w3_dist))

# ------------------

輸出:

Synonyms love, like have cosine distance: 0.6328612565994263
Antonyms love, hate have cosine distance: 0.39960432052612305

問題2.4:類比

man 對於 king,相當於woman對於___,這樣的問題也可以用word2vec來解決,關於most_similar的詳細用法可以參考 GenSim文檔

這裏我們找另外一組類比

# ------------------
# Write your analogy exploration code here.
# man : him :: woman : her
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'him'], negative=['man']))

# ------------------

輸出:

[('her', 0.694490909576416),
 ('she', 0.6385233402252197),
 ('me', 0.628451406955719),
 ('herself', 0.6239798665046692),
 ('them', 0.5843966007232666),
 ('She', 0.5237804651260376),
 ('myself', 0.4885627031326294),
 ('saidshe', 0.48337966203689575),
 ('he', 0.48184287548065186),
 ('Gail_Quets', 0.4784894585609436)]

可以看到正確的計算出了"queen"

問題2.5:錯誤的類比

找到一個錯誤的類比,樹:樹葉 ::花:花瓣

# ------------------
# Write your incorrect analogy exploration code here.
# tree : leaf :: flower : petal
pprint.pprint(wv_from_bin.most_similar(positive=['leaf', 'flower'], negative=['tree']))

# ------------------

輸出:

[('floral', 0.5532568693161011),
 ('marigold', 0.5291938185691833),
 ('tulip', 0.521312952041626),
 ('rooted_cuttings', 0.5189826488494873),
 ('variegation', 0.5136324763298035),
 ('Asiatic_lilies', 0.5132641792297363),
 ('gerberas', 0.5106234550476074),
 ('gerbera_daisies', 0.5101010203361511),
 ('Verbena_bonariensis', 0.5070016980171204),
 ('violet', 0.5058108568191528)]

結果輸出的裏面沒有“花瓣”

問題2.6:偏見分析

注意偏見是很重要的比如性別歧視、種族歧視等,執行下面代碼,分析兩個問題:

(a) 哪個詞與“woman”和“boss”最相似,和“man”最不相似?

(b) 哪個詞與“man”和“boss”最相似,和“woman”最不相似?

# Run this cell
# Here `positive` indicates the list of words to be similar to and `negative` indicates the list of words to be
# most dissimilar from.
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'boss'], negative=['man']))
print()
pprint.pprint(wv_from_bin.most_similar(positive=['man', 'boss'], negative=['woman']))

輸出:

[('bosses', 0.5522644519805908),
 ('manageress', 0.49151360988616943),
 ('exec', 0.45940813422203064),
 ('Manageress', 0.45598435401916504),
 ('receptionist', 0.4474116563796997),
 ('Jane_Danson', 0.44480544328689575),
 ('Fiz_Jennie_McAlpine', 0.44275766611099243),
 ('Coronation_Street_actress', 0.44275566935539246),
 ('supremo', 0.4409853219985962),
 ('coworker', 0.43986251950263977)]

[('supremo', 0.6097398400306702),
 ('MOTHERWELL_boss', 0.5489562153816223),
 ('CARETAKER_boss', 0.5375303626060486),
 ('Bully_Wee_boss', 0.5333974361419678),
 ('YEOVIL_Town_boss', 0.5321705341339111),
 ('head_honcho', 0.5281980037689209),
 ('manager_Stan_Ternent', 0.525971531867981),
 ('Viv_Busby', 0.5256162881851196),
 ('striker_Gabby_Agbonlahor', 0.5250812768936157),
 ('BARNSLEY_boss', 0.5238943099975586)]

第一個類比 男人:女人 :: 老闆:___,最合適的詞應該是"landlady"(老闆娘)之類的,但是top-10裏只有"manageress"(女經理),“receptionist”(接待員)之類的詞。

第二個類比 女人:男人 :: 老闆:___,輸出的不知道是些什麼東西/捂臉

問題2.7:自行分析偏見

這裏我找的例子是:

  • 男人:女人 :: 醫生:___
  • 女人:男人 :: 醫生:___
# ------------------
# Write your bias exploration code here.

pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'doctor'], negative=['man']))
print()
pprint.pprint(wv_from_bin.most_similar(positive=['man', 'doctor'], negative=['woman']))

# ------------------

輸出:

[('gynecologist', 0.7093892097473145),
 ('nurse', 0.647728681564331),
 ('doctors', 0.6471461057662964),
 ('physician', 0.64389967918396),
 ('pediatrician', 0.6249487996101379),
 ('nurse_practitioner', 0.6218312978744507),
 ('obstetrician', 0.6072014570236206),
 ('ob_gyn', 0.5986712574958801),
 ('midwife', 0.5927063226699829),
 ('dermatologist', 0.5739566683769226)]

[('physician', 0.6463665962219238),
 ('doctors', 0.5858404040336609),
 ('surgeon', 0.5723941326141357),
 ('dentist', 0.552364706993103),
 ('cardiologist', 0.5413815975189209),
 ('neurologist', 0.5271126627922058),
 ('neurosurgeon', 0.5249835848808289),
 ('urologist', 0.5247740149497986),
 ('Doctor', 0.5240625143051147),
 ('internist', 0.5183224081993103)]

第一個類比中,我們看到了"nurse"(護士),這是一個有偏見的類比

問題2.8:思考偏見問題

什麼會導致詞向量裏的偏見?

因爲數據集中有偏見

參考

[1] CS224n: Natural Language Processing with Deep Learning, 2019-03-12.

[2] 計算機系統裏的偏見和歧視:除了殺死,還有其他方法, 2019-03-12.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章