【論文筆記】ELMo：Deep contextualized word representations

Abstract

介紹一種新型的深度語境化(deep contextualized)詞表示：

模擬了複雜的詞特徵的使用(例如，語法和語義)
模擬了詞在不同語境中的使用（use vary across linguistic contexts）

其他要點：

這個詞向量是一個深度雙向語言模型(biLM)內部狀態的學習函數(vectors are learned functions of the internal states of a deep bidirectional language model (biLM)
暴露預訓練網絡的深層內部是至關重要的，允許下游模型混合不同類型的半監督信號。

Introduction

ELMo (Embeddings from Language Models) , they are a function of all of the internal layers of the biLM.
ELMo表示法是深層的。更具體地說，我們學習了每個詞的每層結束端向量的線性組合(we learn a linear combination of the vectors stacked above each input word for each end task)，這比僅僅使用頂部的LSTM層顯著地提高了性能。

LSTM的高層狀態能更好的理解上下文中單詞的含義（如詞義消歧任務）；
低層狀態則在語法建模方面表現更好（如詞性標註)。
Simultaneously exposing all of these signals(higher-level and lower-level states我猜) is highly beneficial.

3 ELMo: Embeddings from Language Models

ELMo word representations are functions of the entire input sentence. They are

computed on top of two-layer biLMs with character convolutions,
as a linear function of the internal network states .

3.1 Bidirectional language models

3.2 ELMo

3.3 Using biLMs for supervised NLP tasks

Given a pre-trained biLM and a supervised architecture for a target NLP task, it is a simple process to use the biLM to improve the task model.
ELMo也是：

simply run the biLM and record all of the layer表示 for each word.
let the end task model learn a linear combination of these representations

具體：

許多supervised的nlp models在最底層共享一個公共架構，這允許我們以一致的、統一的方式添加ELMo。
To add ELMo to the supervised model:
(1) freeze the weights of the biLM
(2) concatenate ELMo_k^task和x_k，形成增強表示 [x_k; ELMo_k^task ] 並添加到RNN：

對於一些任務(如SNLI，SQuAD)，通過引入了另一組輸出的線性權值，將h_k替換爲：[h_k; ELMo_k^task ]，這樣可以observe further improvements

剩下的superveised model未變，這些additions可以在更復雜的神經模型上下文中發生

例如：biLSTMs + bi-attention layer， or 一個放在biLSTM之上的聚類模型
在ELMo中加入適量的dropout是有益的

一些情況下添加λ||w||^2到loss中來regularize ELMo weights也是有益的
這對ELMo權重施加了一個歸納偏差，使其接近於所有biLM層的平均值。

3.4 Pre-trained bidirectional language model architecture

這篇論文的pre-trained biLMs支持兩個方向上的聯合訓練，並且在LSTM層之間添加了residual connection。

CNN-BIG-LSTM 模型，減去了一半的embedding 和 hidden dimensions.

2 biLSTM layers with 4096 units and 512 dimension projections
第一層到第二層之間用residual connection
context insensitive type representation使用2048個字符的n-gram卷積過濾器
然後2個highway layers
一個向下延伸到512個表示的線性投影
training for 10 epochs（with the backward value slightly lower.）

因此，biLM爲每個輸入字符提供了三層表示，包括那些純字符輸入而不在訓練集中的表示。（相比之下，傳統的單詞嵌入方法只爲固定詞彙表中的字符提供了一層表示。）

ELMo使用總結
參考總結來自：這篇博客

在大的語料庫上預訓練 biLM 模型。模型由兩層bi-LSTM 組成，模型之間用residual connection 連接起來。而且作者認爲低層的bi-LSTM層能提取語料中的句法信息，高層的bi-LSTM能提取語料中的語義信息。

在我們的訓練語料（去除標籤），fine-tuning 預訓練好的biLM 模型。這一步可以看作是biLM的domain transfer。

利用ELMO 產生的word embedding來作爲任務的輸入，有時也可以即在輸入時加入，也在輸出時加入。

4 Evaluation

ELMo在6個不同的基準(benchmark)NLP任務集合中的性能

simply adding ELMo establishes a new state-of-the-art result.

例如，在QA中：
our baseline model is an improved version of the Bidirectional Attention Flow model(BiDAF; 2017), it adds a self-attention layer after the bidirectional attention component, 簡化pooling，用GRU替換LSTMs。添加ELMo到baseline model之後，F1顯著提高。

另外，在Textual entailment、Semantic role labeling, Coreference resolution, Named entity extraction, Sentiment analysis 都有提高。

5 Analysis

ablation analysis
syntactic information is better represented at lower layers while semantic information is captured a higher layers.

5.1 Alternate layer weighting schemes

對於結合biLM層，方程1有許多備選方案。

正則化參數的選擇也很重要，因爲像λ =1這樣的大值有效地減少了權重函數在層上的簡單平均值，而較小的值(如λ =0.001)允許層權重變化。

包含來自所有層的表示比僅使用最後一層更能提高整體性能，包括來自最後一層的上下文表示比基線更能提高性能。A small λ is preferred in most cases with ELMo

5.2 Where to include ELMo?

本文中所有的任務架構都只將word嵌入作爲輸入輸入到最低層biRNN中。但是，我們發現，在特定於任務的體系結構中，在biRNN的輸出中包含ELMo，可以提高某些任務的總體結果。

一個可能的解釋是SNLI和SQuAD架構在biRNN之後都使用了注意層，因此在這個層引入ELMo允許模型直接關注biLM的內部表示。

5.3 What information is captured by the biLM’s representations?

the biLM is able to disambiguate both the part of speech(詞義) and word sense in the source sentence (詞義在源句中的歧義).

見論文

5.4 Sample efficiency

Adding ELMo to a model increases the sample efficiency considerably, both in terms of number of parameter updates to reach state-of-the-art performance and the overall training set size.

ELMo-enhanced models use smaller training sets more efficiently than models without ELMo.

6. Conclusion

we have also confirmed that the biLM layers efficiently encode different types of syntactic and semantic information about wordsin-context, and that using all layers improves overall task performance.

【論文筆記】ELMo：Deep contextualized word representations

Abstract

Introduction

相關工作

3 ELMo: Embeddings from Language Models

3.1 Bidirectional language models

3.2 ELMo

3.3 Using biLMs for supervised NLP tasks

3.4 Pre-trained bidirectional language model architecture

4 Evaluation

5 Analysis

5.1 Alternate layer weighting schemes

5.2 Where to include ELMo?

5.3 What information is captured by the biLM’s representations?

5.4 Sample efficiency

6. Conclusion

Python 爬蟲：Spring Boot 反爬蟲的成功案例

京東科技數字化營銷能力的演進與最佳實踐| 京東雲技術團隊

【論文筆記】Attention總結二：Attention本質思想 + Hard/Soft/Global/Local形式Attention

【讀書筆記】《深度學習入門——基於python的理論與實現》

【論文筆記】MRC綜述論文+神經閱讀理解與超越基礎部分總結

【兼容調試】AttributeError: 'NoneType' object has no attribute 'loader'

【論文筆記】ULMFiT——Universal Language Model Fine-tuning for Text Classification

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結