【論文翻譯+筆記】Neural Machine Reading Comprehension: Methods and Trends

1 Introduction

過去的MRC技術的特點：hand-crafted rules or features
缺點

不能泛化
performance may degrade due to large-scale datasets of myriad types of articles ignore long-range dependencies , fail to extract contextual information

MRC研究的不同內容以及對應數量：

一個好的MRC的介紹論文應該：

給不同MRC任務具體的定義
深度比較它們
介紹新趨勢和open issues

探索方法

谷歌學術，關鍵詞：machine reading comprehension, machine comprehension, reading comprehension
頂會論文：ACL, EMNLP, NAACL, ICLR, AAAI, IJCAI and CoNLL，時間2015–2018
http://arxiv.org/ ， latest pre-print articles

這篇論文關於MRC的大綱結構

論文結構：

MRC任務的四種分類
cloze tests, multiple choice, span extraction, and free answering.
comparing these tasks in different dimensions 【2】
展現neural MRC systems embeddings的通用結構
feature extraction，context-question interaction and answer prediction. 【3】
一些代表性的數據集、根據不同的任務而使用的評估指標【4】
一些新的趨勢
比如knowledge-based MRC, MRC with unanswerable questions, multi-passage MRC and conversational MRC【5】
一些open issue、未來可能的研究方向【6】

2 Tasks

MRC的定義：
寫者根據回答形式把MRC分爲4種分類：cloze tests, multiple choice, span extraction and free answering.

2.1 Cloze Tests

answer A is a word or entity in the given context C;
question Q is generated by removing a word or entity from the given context C such that Q = C − A.

2.2 Multiple Choice

2.3 Span Extraction

完形填空和多選的缺點：

words or entities不夠回答,一些回答需要完整的句子
有些問題沒有condidate answers

2.4 Free Answering

there are no limitations to its answer forms, and it is more suitable for real application scenarios.

2.5 Comparison of Different Tasks

we evaluated five dimensions: construction, understanding, flexibility, evaluation and application.

Because of the flexibility of the answer form, it is somewhat hard to build datasets, and how to effectively evaluate performance on these tasks remains a challenge.

3 Deep-Learning-Based Methods

3.1 General Architecture

一個典型的 neural MRC系統
包含4個核心modules: embeddings, feature extraction, context-question interaction and answer prediction.

一些語言學特徵：比如 part-ofspeech, name entity, and question category，結合詞表示(one-hot or word2vec)來表達words中的semantic and syntactic信息。

3.2 Typical Deep-Learning Methods

典型MRC系統的組成以及涉及的深度學習方法：

3.2.1 Embeddings

在現有MRC models中，word representation方法可以分爲：conventional word representation和pre-trained contextualized representation兩種。
爲了encode足夠的semantic and linguistic信息，multiple granularity (word-level/character-level, 詞性，命名實體，詞頻，問題類別等)也添加進了MRC系統

Conventional Word Representation 傳統詞表示
Pre-trained Contextualized Word Representation 預訓練上下文詞表示

CoVE
CoVE是MT是seq2seq模型+LSTM的encoder
連接MT encoder的輸出(encoder的輸出被看作CoVE)和用GloVe預訓練的word embeddings來表示上下文和question，然後feed them through the coattention and dynamic decoder implemented in a dynamic coattention network (DCN)
ELMo
如：an improved version of bidirectional attention flow (Bi-DAF) + ELMo
它很容易整合進現有模型中，但是受限於LSTM特徵抽取能力不足
Generative pre-training (GPT)
a semi-supervised approach combining unsupervised pre-training and supervised fine-tuning.
the transformer architecture used in GPT and GPT-2 is unidirectional (left-to-right),which cannot incorporate context from both directions.
In terms of MRC problems such as multiple choice, concatenate the context and the question with each possible answer and process such sequences with transformer networks. Finally, they produce an output distribution over possible answers to predict correct answers.
BERT
In particular, for MRC tasks, BERT is so competitive that using it with a simple answer prediction approach shows promise.
缺點：BERT’s pre-training process is time and resource-consuming which makes it nearly impossible to pre-train without abundant computational resources.

Multiple Granularity 多重粒度
由word2vec 和gloVe預訓練的word-level embeddings不能encode足夠的syntactic和linguistic信息（比如與part-of-speech, affixes, grammar），爲了整合fine-grained(細粒度)的信息到詞表示中，用了一下方法來encode the context and the question在不同level上的粒度：

Character Embeddings
Seo et al. [75] add character-level embeddings to their Bi-DAF model for the MRC task.
The concatenation of word-level and character-level embeddings are then fed to the next module as input.
(1). CNN的方式：The concatenation of word-level and character-level embeddings are then fed to the next module as input.
(2). character embeddings can be encoded with bidirectional LSTMs. the outputs of the last hidden state are considered to be its character-level representation.
(3). word-level and character-level embeddings can be combined dynamically with a fine-grained gating mechanism rather than simple concatenation to mitigate the imbalance between frequent and infrequent words
Part-of-Speech Tags
Labeling POS tags in NLP tasks可以說明單詞使用的複雜特徵並有助於消歧。To translate POS tags into fixed length vectors, they are regarded as variables, randomly initialized in the beginning and updated while training.
Name-Entity Tags：embedding name-entity tags of context words can improve the accuracy of answer prediction. 方法和POS tags類似。
Binary Feature of Exact Match (EM)
which measures whether a context word is in the question. some researchers used it in the Embedding module to enrich word representations.
Query-Category
The types of questions (what, where, who, when, why, how) can usually provide clues to search for the answer. The query-category embeddings are often added to the query word embeddings.

The embeddings introduced above can be combined freely in the embedding module. 比如word representations裏含有：character-level, word-level, POS tags, name-entity tags, EM, query-category embeddings.

3.2.2 Feature Extraction

特徵抽取經常放在embeddins layer後面來分別抽取context 和question的特徵。它更關注挖sentence-level的contextual info.

RNN
許多研究者使用雙向rnn來捕獲MRC中context和question的embedddings.
在questions的形式中，雙向RNN可以分類爲word-level和sentence-level，而sentence-level可以編碼問題的整個相關句子。
MRC的context通常是long sequence,so use word-level feature extraction to encode sequential information of context.
RNN的處理是費時並且不可平行。
CNN
When applied to NLP tasks, 一維CNNs show their superiority in mining local contextual information with sliding windows.
CNN有點：可以平行，並且不收詞典大小的限制來抽取局部信息(不需要表示詞典中的每個n-gram)，但是CNN不能處理long sequence.
Transformer
能對齊、平行、運行需要更少時間、更關注global dependencies.
比如QANet是使用transformer的代表性的MRC模型。

To accelerate the training process, some researchers substitute RNNs with CNNs or the transformer.

3.2.3 Context-Question Interaction

提取context和question的correlation，模型從而找到evidence來預測answer。
根據模型如何extract correlations，現有工作可以分爲one-hop和multi-hop interaction.
重要角色：attention機制。
在MRC上，attention機制可以分爲單向和雙向attention。

Unidirectional Attention
單一attention flow總是從query到context，根據question關注context中最重要和相關的部分。
**相關模型：【Attentive Reader】**這個模型可以瞭解一下，早期MRC模型用單向attention。
但是單一attention不能關注到question words中同樣關鍵的用於answer prediction的詞，因此單一attention對於提取context和query之間的mutual info不夠給力。
Bidirectional Attention
不僅計算query-to-context attention的信息，也計算context-to-query attention. 這個方法受益於context和query的interaction，能提供補充信息。
主要是通過計算matching scores計算pair-wise matching matrix M，從而列 column-wise softmax是q2c的權重，行row-wise softmax function是c2q的權重。
典型MRC的雙向attention模型：AoA Reader, DCN, Bi-DAF
One-Hop Interaction
One-hop interaction is a shallow architecture, where the interaction between the context and the question is computed only once 。早期的context-query interaction就是這樣的one-hop結構，它在許多MRC系統中，比如AR，AS，AoA等等。當問題需要通過上下文的多個句子來推理的時候，one-hop很難預測正確答案。
Multi-Hop Interaction
it tries to 模仿 the rereading phenomenon of humans by computing the interaction between the context and the question more than once. 在interaction中，是否可以有效存儲先前的隱層狀態(已讀的context和question)，這將直接影響下一次interaction的performance.

三種perform multi-hop interaction的方法：
(1). 第一種方法計算基於之前的context的attentive representations的context和question的相似度。
參考模型：Impatient Rreader model 。每讀一個question的token，就動態更新query-aware context representations. 這模仿了人類根據question重複閱讀context的過程。

(2). 第二種方法introduces external memory slots to 存儲previous memories.
代表模型：memory networks.
優點：可以明顯存儲長期記憶，has easy access to reading memories. MRC模型can have a deeper understanding of the context and the question by multiple turns of interaction.
缺點：難易通過後向傳播訓練網絡。
改善：end-to-end version of memory networks
介紹：explicit memory storage is embedded with continuous representations and the process of reading and updating memories is modeled by neural networks.
優點：can reduce supervision during training and is applicable to more tasks.

在memoery networks中使用Multiple hop更新內存的特性，使此方法在MRC系統中很受歡迎。
典型模型：MEMEN model
介紹：which stores ①question-aware context representations, ②context-aware question representations, ③and candidate answer representations in memory slots and ④updates them dynamically.
典型模型：論文[107]
介紹：使用external memory slots來存儲question-aware context representations，並用雙向GRUs來更新memories.

(3). 第三種方法takes advantage of the recurrence feature of RNNs, using hidden states to store previous interaction information.
[91] 思路：using match-LSTM 結構的RNN。

其他MRC模型：R-NET，IA Reader，也用RNNs去更新query-aware context representations來實現multi-hop interaction.

反正，efficient context-query interaction需要着重關注。gete mechanism 也是multi-hop interaction的重要組件。

下面的涉及Gate mechanism的模型：

模型：GA Reader (gated-attention reader)，使用gate mechansim來確定更新上下文表示時question info如何影響對上下文單詞的關注。gate attention mechanism 是通過query embeddings和上下文的中間表示(intermadiate representations)之間的元素逐次乘法(element-wise multiplication)來執行的。

相比於GA Reader，[78]，This mechanism is capable of extracting evidence from the context and the question alternately. 而question根據之前的search states更新，上下文表示隨着updated queries而用之前推理的信息refined, 然後使用了feed-forward的gate machanism來決定context和query匹配的程度。

之前的模型都忽視了在回答問題時context words有不同的重要性。因此，R-NET模型引入了gata mechanism來過濾context中不重要的部分，並強調與問題最相關的部分。R-NET可以看作attention-based rnn的變種，相比於match-LSTM，它引入了基於當前context representations和context-aware question表示的gate mechanism。並且儘管它是RNN-based models(insufficient memory), 但是它添加了self-attention to the context itself所以能處理好long documents的問題。

總結：one-hop interaction不能綜合性的理解mutual question-context info。相比之下，有着之前contexts和questions記憶的multiple-hop interaction，可以深度地提取correlations，並且整個evidence for answer prediction。

3.2.4 Answer Prediction

The implementation of answer prediction is highly task-specific. 有3中預測回答的方法：word predictor, option selector, span extractor, answer generator.

(1) Wrod predictor

早期的工作中，用query-aware context representation來匹配候選答案。典型代表：Attentive Reader —— 使用query-aware context representations來匹配答案。
這個方法使用了attentive context repreentations來預測，但是它不能保證答案就在context中。
通過預訓練的w2v後，february可能成爲答案。

爲了解決predicted answer可能不在context中的問題，有人提出了AS Reader, 它受pointer networks 的啓發。
在AS Reader中，沒有計算attentive representations, 反而，它們直接使用attention weight來預測答案。the attention resutls of the same word are added together, 有最大值的就是答案。這種方法很簡單，但是對完型問題很有效。

(2) Option Selector

common way是：測量attentive context representations和候選答案的相似度，然後選擇相似度最高的做爲正確答案；
其他方法：
[4] 使用CNNs來encode question-options tuple和相關context sentences. 然後用餘弦相似度來測量相關性，最相關的選項作爲答案；
[111] introduce選項的信息，以幫助提取上下文和問題之間的interaction，在答案預測模塊，根據attentive info, 使用bilinear function來score每個context，最高score的就是預測答案**；
[8] 所提出的convolutional spatial attention model(卷積空間注意模型),使用dot product來抽取context, question, options之間的correlations，從而計算了question-aware condidate 表示、context-aware表示、self-attended question的幾個相似度。然後這些不同的相似度被連接起來，fed to 不同kernel sizes的CNNs。CNNs被當做特徵向量，然後fed to 全連接層來計算每個condidate的score。

(3) Span Extractor

可以看作完形填空任務的拓展，需要抽取的是subsequence而不是一個詞。
同樣受到pointer networks [89] 的影響，**[91]**提出了2種模型：the sequence model 和 the boundary model。

the sequence model輸出的是answer token在original context中的的出現位置。答案預測的過程類似於seq2seq的decoding過程。通過這種方法獲得的答案可能不是連續範圍，並且不能確保是原始上下文的子序列。

應用
the boundary model可以解決以上問題，它只預測start and end potisions of the answer。它更簡單並且在SQuAD上表現良好。廣泛應用在MRC模型中。

但是the boundary model可能由於local maxima抽取到incorrect answer, [100] 提出了dynamic pointing decoder來解決。它通過多輪iterations來選擇answer span. 此方法使用了LSTM基於答案預測last state相關表示來估計起始位置，使用了HMN(highway maxout networks——[21] + [79]) 來計算context tokens起始位置的score。

(4) Answer Generator

綜合(synthesize)了context和question. 答案的表達形式可能與context中的不一樣，或者答案來自於不同passages的多個片段。這個任務對答案預測模塊有高要求，下面是幾種生成fleible answers的方法：
S-NET：有"extraction and then synthesis" process. the extraction module 是R-NET的變種，the generation module是seq2seq結構。
encoder:

現有方法成生成的答案有着：語法錯誤(syntax errors)和邏輯問題。因此，generation和extraction methods總是同時使用來provide complementary info。
比如：S-NET的extraction module首先labels the approximiate boundary of the answer span, 然後generation module生成不限制於的原始context的答案。

應用：
generation approaches在現在MRC系統中非常普遍，extraction methods在許多cases中已經表現得足夠好。

再看一次這章節的圖：general MRC architecture，以及用到的深度學習方法。

3.3 Additional Tricks

一些典型的深度學習tricks，這些不在general MRC architecture的範圍裏，但是這些技巧也很重要很有效。如：強化學習、anwer randker、sentence selector。

3.3.1 強化學習

reinforcement learning can be regarded as an improved approach in MRC systems that is capable of not only reducing the gap between optimization objectives and evaluation metrics(如：[101],[28]) but also determining whether to stop reasoning dynamically.(如：ReasonNets) With reinforcement learning, the model can be trained and refine better answers even if some states are discrete.

3.3.2 Answer Ranker

用ranker module, 答案預測的精度又可以提升一定程度。從而啓發研究者去探索不可回答的問題。

[87] 結合了 pointer methods 的方法來ranker。用類似於[33]AS Reader的方法，用最高的attention sum score來選擇一些answer span。然後把這些候選送入reasoner component，從而這些會送入question序列的placeholder, 通過計算probability來選擇答案。

[108 用可變的長度抽取extract提出了2個方法。第一種方法是在驗證集上捕捉到答案的POS 然後選擇能最好匹配這些詞性的子序列；第二種方法是context固定長度範圍內enumerate所有可能的answer span。獲得到這些答案候選以後，計算他們和question similarity的相似度從而選出最相似的作爲答案。

3.3.3 Sentence Selector

尤其是在long document中，提前找到與questions最相關的sentences可以加速訓練過程。因此 [51] 提出一個sentence selector來找到所需要回答question的句子的minimal set。sentence selector的結構是seq2seq，decoder會計算每個句子與question的相似度，如果後來decoder的score高於predefined threshold, 這個句子會be fed to the MRC systems.

此方法是一種降低training and inference時間的方式。

4 datasets and evaluation metrics

4.1 Datasets

In this part, we introduce several representative datasets of each MRC task, highlighting how to construct large-scale datasets according to task requirements, and how to reduce lexical overlap between questions and context.

4.1.1 Cloze Tests Datasets

CNN & Daily Mail :consisting of 93,000 articles from the CNN and 220,000 articles from the Daily Maile.all entities in documents are anonymized. missing items are named entities.

CBT : the Children’s Book Test (CBT)： any word in the target sentence may be targeted; entities in the CBT dataset are not anonymized, so models can use background knowledge from wider contexts; missing items are named entites, nouns, verbs, prepositions.

LAMBADA : 也用books作爲source,the word that needs to be predicted in LAMBADA is the last word in the target sentence. compared with CBT, LAMBADA requires more understanding of the wider context.

Who-did-What : in who-did-what, each sample is formed from two independent articles; one serves as the context and questions are generated from the other. (reduce the syntactic similarity).

CLOTH: 人造的，collected from English exams for Chinese students.

CliCR：based on clinical case reports for healthcare and medicine. 類似於CNN&Daliy Mail.

4.1.2 Multiple-Choice Datasets

MCTest : It consists of 500 fictional stories. Choosing fictional stories avoids introducing external knowledge, and questions can be answered according to the stories themselves. 用故事的語料庫啓發了其他數據集，比如CBT, LAMBADA. 但是就500個故事，太小了。

RACE: collected from English exams for middleschool and high-school Chinese students. almost all kinds of passages can be found in RACE。 large-scale,支持深度學習模型訓練，需要更多的推理，有挑戰性。

4.1.3 Span Extraction Datasets

SQuAD ：MRC里程碑，啓發了MRC多種技術的發展。不僅large還高質量，有563篇維基百科的文章，10w+人類設計的問題和對應span的回答。SQuAD定義了一種新的MRC task。

NewsQA ：類似於SQuAD，另一種span extraction數據集。問題也是人類設計的，但是和SQuAD不同的是文章來源是CNN。一些問題根據給定context是無法回答的，這讓questions更接近於現實，從而啓發了SQuAD2.0. unanswerable quesions 會在5.2節詳細介紹。

TriviaQA：之前的工作導致依賴quesions和evidence回答問題。現實中人們通常是尋找有用的resources來回答問題的。因此這個數據集收集了從trivia 和 quiz-league websites的question-anser pairs，然後在網頁和維基百科搜索evidence來回答問題。最後build more than 650,00 question-answer-evidence triples for the MRC task.

DuoRC ： reduce lexical overlap between questions and contexts. questions and answers in DuoRC are created from two different versions of documents corresponding to the same movie, one from Wikipedia and one from IMDb. requires more understanding and reasoning. unanswerable questions in DuoRC.

4.1.4 Free Answering Datasets

bAbI: It consists of 20 tasks. all data in bAbI is synthetic因此不是很接近於真實世界. 每個任務獨立並且能反映文字理解的一個方面，比如識別2或3個論點關係. Answers are limited to a single word or a list of words and may not be directly found from the original context.

MS MARCO ：可以看作MRC在SQuAD後的又一里程碑。有4個特徵：①所有的問題來自於real user queries；②每個問題有bing搜索上的10篇相關文檔作爲context; ③這些問題的labeled answers是由人類生成的，所以他們不限制於context的一段話，會需要更多推理和總結；④每個問題有多重回答，有時這些回答甚至相互矛盾，這讓系統選擇出正確的答案變得更有挑戰性。MS MARCO讓MRC數據集更加貼近現實世界。

SearchQA：like TriviaQA. 作者collect question-answer pairs from the J!Archive and then search for snippets related to questions(大約每個pari有49.6個相關snippets，Trivia只有1篇document) from Google.

NarrativeQA：Based on book stories and movie scripts, they search related summaries from Wikipedia and ask co-workers to generate question-answer pairs according to those summaries. 要回答問題需要理解整個narrative,而不是表面的matching info.

DuReader：類似於MS MARCO。另一個現實世界使用的large-scale MRC數據集。問題和documents收集於百度搜索和百度知道。答案和人類生成的，而不是spans of original context.並且它有一些新的問題形式比如yes/no 和opinion. questions有事需要summary over multiple parts of documents.

4.2 Evaluation Metrics

cloze tests 和 multiple-choice tasks, 最常用的metric是accuracy.
span extraction：exact match(EM) (accuracy的變形)，和 F1 score
free answers : ROUGE-L, BLEU 被廣泛使用。

Accuracy ：Q={Q1,Q2,…Qm}共m個問題，這樣計算精度：

Exact match：評估了一個預測的答案範圍是否準備匹配ground-truth sequence. 如果預測的答案等於the gold answer, 則EM是1，否則是0。它也可以用上面的公式計算。

F1 Score：是分類任務中的common metric. 在MRC中，condicate和參考答案(reference answers)被當做bags of tokens和 TP,FP,TN,FN(true positive, false positive, true negative and false negative)，如下表：

precision和recall這樣計算：

F1是presicion 和recall的諧波平均值：F1 = 2PR / (P+R), P是precision，R是recall

和EM比起來，F1鬆散地測量預測值與真實答案之間的平均重疊。

ROUGE-L：ROUGE原來是自動生成裏的評估指標。它通過計算生成的模型摘要與真實摘要之間的重疊量來評估摘要的質量。ROUGE有ROUGE-N, ROUGE-L, ROUGE-W, and ROUGE-S的方式，其中ROUGE-L在MRC的free answering中廣泛使用。-L更靈活，L代表： longest common subsequence (LCS). 計算方式如圖：

ROUGE-L來評估不需要預測的答案是ground truth的連續序列，儘管更多的重疊有助於更高的分數。

BLEU：Bilingual Evaluation Understudy，廣泛適用於評估翻譯系統。在MRC中，BLEU score can not only evaluate the similarity between candidate answers and ground-truth answers but also test the readability of candidates.（太複雜了，具體的計算看論文8）

5 New Trends

基於知識的MRC；
不可回答問題的MRC；
多文檔的MRC；
會話的MRC

Knowledge-Based Machine Reading Comprehension 基於知識的MRC

MRC requires answering questions with knowledge implicit in the given context. knowledge-based machine reading comprehension (KBMRC) . KBMRC的輸入就是添加了從knowladge bases抽取的相關知識，這在KBMRC很關鍵。KBMRC可以看作用external knowledge K擴充MRC，公式化後如下：

KBMRC的數據集有MCScripts ，它關於人類的daily activities. 其中有些問題在context裏無法回答，會需要不在context裏的common sense知識。

KBMRC的挑戰如下：

知識檢索：從存儲了各種不同知識的knowledge bases裏抽取出相關context和question的知識。
知識整合：knowledge有它自己的結構，如何把他們encode並且整合進context和question的表示裏，這是一個仍在研究的問題。

解決以上KBMRC的問題
[44] 提出的類似於完型填空的方式，rare entity prediction,區別是僅僅依賴original context是無法回答問題的。這個任務需要添加從知識庫裏抽取的entity description來幫助預測entity。在整合這些external knowledge，使用的方法有：

[102],[48] 設計了 有哨兵的attention機制 來考慮知識與context, question的相關性，並且要避免不相關的知識進來誤導預測(要不要和要哪些知識)；
[82] 使用了 key-value的memory network 來決定相關信息：所有相關知識存儲在memory slot裏作爲key-value pairs. 然後key匹配query也計算values權重和得到相關知識表示
[90] 提出了一種數據增強方式，用WordNet的semantic relations，然後找出passage words中和context和question每個詞有對應語義關係的position information，這些位置信息被當做external knowledge，然後fed to MRC model.

5.2 Machine Reading Comprehension with Unanswerable Questions

有些問題是不可回答的，這才更接近於真實世界。一個成熟的MRC系統要能辨別出不可回答的問題。因此MRC處理過程包含2個子任務：answerability detection 和 reading comprehension, (辨別出不可回答的問題，並僅給可回答的問題給出正確答案)，定義如下：

總體來說，對於不可回答的問題，要實現3點：不可回答問題檢測、回答問題、答案檢驗；

不可回答問題有2個挑戰：

Unanswerable Question Detection
模型需要知道它不知道的東西，然後mark them不可回答
Plausible Answer Discrimination
MRC模型爲了避免給出fake answer, 需要檢查預測的答案，並且從中tell plausible answers. 方法可以分爲以下2類：

第一種：indicate no-answer cases ：
法1：使用share-normalization + 添加額外的trainable bias + softmax來獲得distributions of no answer的概率，如果概率大於the best span的概率，這意味着問題不可回答；
法2：設置一個global confidence threshold，如果predicted answer confidence低於這個閾值，意味着不可回答；這種方法不能保證答案是正確的。
法3：padding。[85] 爲原始passage添加了一個padding position，來決定問題是否是可回答的。

2種loss，對於不可回答的問題檢測，提出了兩種輔助loss：independent span loss 和 sequential architecture。見論文

第二種：ledigtimacy of answer
法1：sequence architecture。把question, answer, 包含備選答案的context當做一整個sequence，然後輸入fine-tune的transformer model來預測沒有答案的概率
法2：interactive architecture。計算context裏的question和answer的correlation來分類問題是否可回答。
法3：整合以上2種，連接兩種方法的outputs作爲joint representations。

除了以上的pipeline structure, [81] 使用了多任務學習 來聯合訓練answer prediction, no-answer detection, answer validation。用一個聯合通用節點來區分這些，這些fused representations可以用來判斷問題是否可以回答。

5.3 Multi-Passage Machine Reading Comprehension

This extension can be applied to tackle open-domain question-answering tasks based on a large corpus of
unstructured text.
基於大規模無結構文本語料，可以用於處理開放領域的問答任務。MRC multi-passage具體任務數據集： MS MARCO , TriviaQA, SearchQA , DuReader , and QUASAR.
定義：

多任務MRC比其他MRC任務更有挑戰性，它有以下幾個特徵，而檢索的效率是關鍵：

massive document corpus
noisy document retrival
no answer
multiple answer
evidence aggregation

pipeline
爲了解決多文檔MRC問題，一種pipeline的方法是"retrieve then read"，也就是retrieve component先返回幾個相關文檔，然後送入reader，來給出正確的答案。
比如陳丹琦的 DrQA,就是典型的pipeline-based multi-passage MRC model. 它的檢索組件是使用TF-IDF來爲SQuAD上的每個問題選擇5篇相關維基百科文章來縮小搜索空間，然後reader module，使用rich word representation 可以改善模型，然後pointer module來預測答案的起始位置。

但是陳丹琦的模型檢索和reading是各自分來的，這有一個問題就是在檢索階段的errors很容易傳播到reader module裏，這導致了表現大大降低。爲了減輕poor document retrieve引起的 error propagation，一個方法是ranker component，另一個是jointly train the retrieve and reading process，具體如下：

對於reader component裏，[27] 提到兩種ranker，InferSent 和 Relation-Networks Ranker，分別使用了feed-forward network 和 relation-network。而**[37]**使用了Paragraph Ranker mechanism——使用了雙向LSTMs+點乘。

對於joint traning, [92] 的 Reinforced Ranker-Reader (R3) 是代表性的模型。R3用match-LSTM來計算問題和每個passage的相似度從而獲取文檔表示，然後表示fed to ranker和reader。在ranker模型裏強化學習 用來選出最相關的passage，

檢索方面
以上的檢索組件都太低效了，比如DrQA僅僅使用傳統的IR方法，R3使用question-dependent passage representations來rank，他們的計算複雜度會隨着文檔語料的增加而變大。爲了加速過程，[15] 提出了一種快速和高效的檢索方法，該方法表示獨立於問題，並離線存儲輸出（見論文）。

多個候選答案方面
在選擇答案方面，[58]提出了三種啓發式的方法：RAND, MAX, SUM。而[41]引入了fast paragraph selector 來過濾掉有錯誤answer labels的passages。

evidence aggregation方面
[93] 認爲整合evidence很重要。他們認爲：正確的答案有更多evidence，這些evidence在不同passage裏。並且一些問題需要不同層面的eevidence來回答。爲了充分利用multiple pieces of evidence, 他們提出了strength-based 和coverage-based re-ranker。前者裏候選者中出現次數最多的答案被選中，後者裏concanate所有包含候選答案的passages，作爲一個新的context然後feed to the reader，這樣之後獲取的答案有不同層面的evidence.

總結
multi-passage MRC很接近於真實世界的應用。預測答案需要更多evidence，related document retrieval很重要，從衆多documents裏evidence aggragation可能補充or矛盾。因此，自由形式的回答在multi-passage任務裏很普遍。多文檔MRC有很長的路要走！

5.4 Conversational Machine Reading Comprehension

基於之前的回答，a related question需要deeper understanding. Conversational machine reading comprehension ( CMRC ) has become the research hot spot.
定義：
會話歷史H，conversation history H 作爲context的一部分。

相關數據集：
CoQA, QuAC, [46]基於完形填空的拓展——multiparty dialog。
CMRC給MRC帶來了一些新的挑戰，如：

Conversational History
dialogue pairs as conversational history are fed to CMRC systems as inputs.
Coreference Resolution共指解析
有兩種共指解析：explicit 和 implicit。implicit coreference更難解決。

會話MRC的一些模型和方法
[69] 的混合模型：DrQA + PGNet ，結合了seq2seq + MRC模型來抽取和生成答案，他們把之前的question-answer paris作爲一個sequence然後添加進context。
[105] 使用了一個改善的MRC model，Bi-DAF++ + ELMo 來基於context和history回答。
[30] 不僅僅簡單concatenate之前的問答對作爲inputs，還引入了flow mechanism來深度理解會話歷史，在處理回答先前問題的過程中encode hidden context representations.
[110] 相似於[69]，把之前的問答對添加到現在的questions裏，爲了找到歷史會話的相關性，他們還對questions添加了self-attention。

在coreference resolution 上鮮有所成，如果共指解析不能被正確解決，會導致performance degradation。

6 Open Issues

一些open issues還沒解決，比如machine inference, open-domain QA。現在最重要的問題是，neural MRC並不真的理解given text, 現存的模型大多數主要依賴 semantic matching to answer question . 在以下方面MRC和人類閱讀理解有巨大的gap：

給定上下文的限制 Limitation of Given Context
MRC系統的魯棒性
拓展知識的整合 Incorporation of External Knowledge
缺少推理能力
解釋困難 Difficulty of Interpretation

展開來說：

Limitation of Given Context
context在MRC任務裏很必要，但是在現實世界裏有限制。multi-passage MRC的研究一定程度上打破了given context的限制。相關resources決定了回答問題的準確度，如何有效地爲MRC系統找到最相關的資源，這還有很長的路要走。這需要把信息檢索和機器閱讀理解進行深度結合！
MRC系統的魯棒性
現存的大多數MRC系統還是基於word overlap，他們面對對抗性(adversarial)的問答對錶現很弱。這也反映了機器並不是真的理解自然語言，儘管answer verification component 可以減輕side-effect of plausible answers，MRC系統的魯棒性需要加強！
Incorporation of External Knowledge
問題在於有效的引入和利用外界知識。一方面，知識庫裏的知識存儲結構不同於text from context and questions，因此很難整合；另一方面，知識庫的質量很重要，但是構建知識庫很費時間。此外，知識庫裏的知識是分散的，相關external knowledge不能直接找到。因此，有效地 融合知識圖譜和MRC 需要進一步研究。
Lack of Inference Ability
現存大多數MRC系統基於semantic matching這導致MRC系統推理能力不足。比如：2個人躺在地上，5個人躺在牀上，答案不能推理出7個人躺着。
Difficulty of Interpretation
MRC系統有多種任務，但是他們的工作仍是以黑盒的方式進行。缺乏解釋性是MRC應用方面的主要缺點，比如在healthcare上面，是否能給出一個理性的outputs，這是trust的必要條件。這時就要提一下HotpotQA數據集，它的問題需要從多個支持文檔、有用的labels sentecnes裏推理，因此MRC系統可以用這些信息來解釋預測。再提一下 XAI (explainable artificial interlligence)。
總之，爲了更多實際應用，MRC系統需要更加值得信賴和透明。