【論文筆記】AS Reader vs Stanford Attentive Reader

原創

changreal

2020-02-21 05:55

Attention Sum Reader Network

數據集

CNN&DailyMail

每篇文章作爲一個文檔（document），在文檔的summary中剔除一個實體類單詞，並作爲問題（question），剔除的實體類單詞即作爲答案（answer），該文檔中所有的實體類單詞均可爲候選答案（candidate answers）。其中每個樣本將文本中所有的命名實體用類似“@entity1”替代，並隨機打亂表示。

兒童故事（Children’s Book Test，CBT）

從每一個兒童故事中提取20個連續的句子作爲文檔（document），第21個句子作爲問題（question），並從中剔除一個實體類單詞作爲答案（answer）。

模型簡介

與Attentive Reader十分類似，是一種一維匹配模型（Stanford Attentive Reader也是），主要是在最後的 Answer 判斷應用了一種 Pointer Sum Attention 機制，模型結構如下圖所示：

模型具體

probability si is that the answer to query q appears at position i in the document d.

與Attentive Reader比較：

Attention層應用的是 Dot Attention，相對於 Attentive Reader 參數更少，即注意力權重
一維匹配模型的注意力分數等效於直接文檔 d 中每個詞在特定問題上下文向量中作爲答案的概率，該模型的做法就是，在得到每個詞Softmax歸一化之後的分數後，將同類型的詞的分數累加，得分最高的詞即爲答案（即作者提到的Pointer Sum Attention）

該模型的結構以及Attention的求解過程明顯比 Attentive Reader 更簡單，卻取得了更好的效果

Pointer Sum Attention也顯示出，如果一個詞出現頻率越高，則越有可能成爲問題的答案（因爲累加的注意力分數越多），實驗數據表明這樣的假設是合理的，畢竟這也符合大多數的閱讀理解規律。

實驗設置

優化函數：Adam
學習率：0.001、0.0005
損失函數：-logP(a|q, d)
embedding層權重矩陣初始化範圍：[-0.1, 0.1]
GRU網絡中的權值初始化：隨機正交矩陣
GRU網絡中的偏置初始化：0
batch size：32

實驗結果

下圖展示了模型對比實驗結果。

其他相關

這裏的pointer sum attention，使用attention as a pointer over discrete tokens in the context document and then they directly sum the word’s attention across all the occurrences.

候選答案詞在文檔中出現的地方softmax結果累加。

這與seq2seq的attention的使用不同（blend words from the context into an answer representations），這裏的attention的使用受到了Pointer Networks(Ptr-Nets)的啓發

Attentive and Impatient Readers

比較了與Attentive Reader的區別；

提到了Chen et.al

提到了Memory Networks——MemNNs

Standford Attentive Reader

論文：

A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task

參考1：https://www.imooc.com/article/28801

參考2：https://www.cnblogs.com/sandwichnlp/p/11811396.html#model-2-attentive-sum-reader

源碼：https://github.com/danqi/rc-cnn-dailymail

源碼解析：http://www.imooc.com/article/29397

效果：比ASReader 和 Attentive Reader效果好

模型介紹

深度學習神經網絡在MRC
boosted決策樹森林的MRC

數據集：CNN&DailyMail

基於boosted決策樹森林的機器閱讀理解模型

特徵工程來構建實體類單詞e的特徵向量f_p,q(e ）, 特徵有：是否出現、出現位置、詞頻、n-gram匹配特徵、詞距特徵、依存句法特徵、句共現特徵等

將機器閱讀理解看成是一個排序問題，並使用RankLib包的LambdaMART來構建boosted決策樹森林模型。

基於深度學習的模型：Stanford Attentive Reader

Encoding層

Stanford Attentive Reader模型與ASReader模型encoding步驟基本一致：document和question的encoding基本一致

Attention層

不同ASReader模型的求點積，Stanford Attentive Reader使用了雙線性函數作爲匹配函數。然後累加相同詞在不同文章不同位置的相似度。雙線性函數可以計算q和p_i之間的相似性，比用點積更靈活。

在Attention層中，匹配函數有所不同，說明在CNN&Dailymail數據集上的機器閱讀理解模型在這個時候模型基本無太大差異，重要的研究點在於匹配函數。

記錄一下其與 Attentive Reader 不一樣的部分：

3.1 實驗設置

優化函數：SGD

詞向量維度：100（使用預訓練好的100維glove詞向量）

學習率：0.1

損失函數：-logP(a|q, d)

GRU網絡中的權值初始化：滿足高斯分佈N(0, 0.1)

隱藏層大小h：CNN(128)，Dailymail(256)

Attention層權重矩陣初始化範圍：[-0.01, 0.01]

batch size：32

dropout：0.2

changreal

發佈了60 篇原創文章 · 獲贊 13 · 訪問量 3萬+

私信關注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

【論文筆記】AS Reader vs Stanford Attentive Reader

Attention Sum Reader Network

數據集

模型具體

其他相關

Standford Attentive Reader

使用neovim打造go ide(支持代碼跳轉, 代碼補全, 實時語法檢查)

挑戰程序設計競賽 2.3章習題 poj 3046 Ant Counting

Shell/Python中的用戶名獲取

【論文筆記】Attention總結二：Attention本質思想 + Hard/Soft/Global/Local形式Attention

【讀書筆記】《深度學習入門——基於python的理論與實現》

【論文筆記】MRC綜述論文+神經閱讀理解與超越基礎部分總結

【兼容調試】AttributeError: 'NoneType' object has no attribute 'loader'

【論文筆記】ULMFiT——Universal Language Model Fine-tuning for Text Classification

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結