【論文筆記】Bi-DAF（待修）——BI-DIRECTIONAL ATTENTION FLOW FOR MACHINE COMPREHENSION

原創

changreal

2020-06-16 02:20

0 摘要

represents the context at different levels of granularity
uses bi-directional attention flow mechanism to obtain a query-aware context representation without early summarization

1 introduce

先前工作中的注意機制通常具有以下一個或多個特徵。

計算的注意力權重通常用於從上下文中提取最相關的信息，以通過將上下文概括爲固定大小的向量來回答問題。
在文本域中，它們通常在時間上是動態的, the attention weights at the current time step are a function of the attended vector
at the previous time step
它們通常是單向的，其中查詢參與上下文段落或圖像。

the Bi-Directional Attention Flow (BIDAF) network, 這是一種分層多階段架構，用於對不同粒度級別的上下文段落的表示進行建模.includes character-level, word-level, and contextual embeddings, and uses bi-directional attention flow to obtain a query-aware context representation.

我們的注意層不用於將上下文段落概括爲固定大小的向量。在每步計算attention，並獲得向量，伴隨之前的層，允許流通到接下來的層。這降低了概括造成的損失。
在兩個方向上使用attention機制，query-to-context and context-to-query，提供互補信息。
use a memory-less attention mechanism. 當順着時間計算attention時，在每個time step的attention是當前time step的問題和上下文段落的函數，不依賴於之前time step的attention。假設這個簡化導致attention層和建模層的分工合作。這迫使attention層專注學習query和context之前的attention，允許建模層專注學習 query-aware context representation (the output of the attention layer)。 memory-less attention gives a clear advantage over dynamic attention.

2 model

模型包括6層：

Character Embedding Layer：通過character-level的CNNs，把每個詞映射到向量空間。
Word Embedding Layer：通過訓練好的word embedding將每個詞映射到一個向量空間。
Contextual Embedding Layer ：從周圍詞中使用情境線索過濾詞嵌入。
上面3個層都應用於context和query
Attention Flow Layer：結合query和context向量，對context中的每個詞生成一組query-aware特徵的向量。
Modeling Layer：使用RNN掃描context。
Output Layer：對query提供一個答案。

具體

Character Embedding Layer
Let {x1, . . . xT } and {q1, . . . qJ } represent the words in the input context paragraph and query,每個詞語使用CNN獲得字符級的詞嵌入。
Word Embedding Layer
把每個詞映射到高維向量空間。使用訓練好的向量，GloVe，獲得每個詞的固定詞嵌入。 character和word的向量進行concatenation，輸入到兩層的highway網絡( 公路網),公路網的輸出是2個d維向量，or2個矩陣。context輸出X，query輸出Q，dT 和dJ維。
Contextual Embedding Layer
在前面的層提供的嵌入之上，使用LSTM模擬單詞之間的時間相互作用。LSTM使用雙向，對兩個LSTM的輸出進行concatenation。

前三層都是從不同的粒度計算query和context的特徵，和CNN的多尺度特徵相似。

** Attention Flow Layer**
連接和融合context和word的信息。允許每個時間步的注意向量，以及來自先前層的嵌入流動到後續建模層。這減少了由早期摘要引起的信息丟失。
我們在兩個方向上計算注意力：從上下文到查詢以及從查詢到上下文。
S_tj表示第t個context詞和第j個query詞的相似度，相似矩陣的計算如下：
- Context-to-query Attention.
  上下文到查詢（C2Q）注意表示哪個查詢詞與每個上下文詞最相關。
- Query-to-context Attention.
  查詢到上下文（Q2C）注意表示哪個上下文單詞與查詢單詞之一具有最接近的相似性，因此對於回答查詢是至關重要的。
最後，將context embeddings和attention vectors組合在一起以產生G, 其中每個列向量可以被視爲the query-aware representation of each context word.
Modeling Layer.
建模層的輸入是G，它encodes the query-aware representations of context words.The output of the modeling layer captures the interaction among the context words conditioned on the query. 使用雙向lstm，從而得到一個矩陣M(2d*T)，它將通過output layer來預測答案。期望M的每個列向量包含關於整個上下文段落和查詢的詞的上下文信息。
Output Layer
輸出層是特定於應用程序的。 BIDAF的模塊化特性允許我們根據任務輕鬆交換輸出層，其餘架構保持完全相同。
training
test