【論文筆記】Attention is all you need

在閱讀本文之前,關於self-attention的詳細介紹,比較全面的transformer總結看之前copy的這篇文章
有了self-attention的基礎之後再看這篇論文,感覺就容易了。
論文:Attention is all you need。

1-2 Introduction & Background

RNN:This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples.
解決(治標不治本,因爲根本上序列計算的限制還在):

  1. factorization tricks.
  2. conditional computation.

也使用過CNN來as basic building block, such as:ByteNet, ConvS2S. But: makes it more diffucult to learn dependencies between distant positions.(計算量與觀測序列X和輸出序列Y的長度成正比)

歷史

名稱 解釋 侷限
seq2seq
encoder-decoder 傳統,一般配合RNN
RNN\LSTM\GRU 方向:單向雙向;depth:單層or multi-layer; RNN難以應付長序列、無法並行實現、對齊問題;神經網絡需要能夠將源語句的所有必要信息壓縮成固定長度的向量
CNN 可以並行計算、變長序列樣本 佔內存、很多trick、大數據量上參數調整不容易
Attention Mechanism 關注向量子集、解決對齊問題

提到的點:
self-attention;
recurrent attention mechanism.
transduction models

3 Model Architecture

大部分的encoder-decoder structure:
輸入序列:輸入序列 x = (x1,…,xn), N個
encoder輸出的連續表示:z = (z1,…,zn),N個
docoder的outputs: y=(y1,…,ym),M個
一次一個元素。consuming the previously generated symbols as additional input when generating the next.
transformer模型架構
transforme模型架構2
The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder.

3.1 Encoder and Decoder Stacks

transformer
transformer結構

Encoder: a stack of N = 6 identical layers.Each layer has two sub-layers:(從下到上)

  1. multi-head self-attention mechanism.
  2. simple, position-wise fully conntected feed-forward network(以下叫ffnn).

encoder
We employ a residual connection around each of the two sub-layers(2個子層之間殘差連接) + layer normalization. 也就是說,每個子層的輸出是LayerNorm(x + Sublayer(x)).

殘差連接結構
殘差連接
all sub-layers in the model + embediding layers, produce outputs of dimension dmodel = 512.

Decoder: a stack of N = 6 identical layers. the decoder inserts a third sub-layer:(從下到上)

  1. multi-head self-attention子層(接收encoder stack的輸出)(mask)
  2. encoder-decoder attention
  3. ffnn

transformer_decoding_1decoder手繪

masking ensures that the predictions for position i can depend only on the known outputs at positions less than i.

每個子層residual connections + layer normalization.(殘差連接round each of the sub-layers.)

3.2 Attention

attention機制就是mapping a query and a set of key-value pairs to an output(all are vectors). the output is computed as a weighted sum of the values, where the weight assigned to each values is computed by a compatibility function of the query with the corresponding key.

這篇文章用到的attenion是:

  • scaled Dot-Product Attention
  • Multi-Head Attention
    scaled dot-product attention

3.2.1 Scaled Dot-Product Attention

The input consists of queries and keys of dimension dk, and values of dimension dv. We compute the dot products of the query with all keys, divide each by sqrt(dk), and apply a softmax function to obtain the weights on the values.

在實踐中,we compute the attention function on a set of queries simultaneously, packed together into a matrix Q. The keys and values are also packed together into matrices K and V.
Scaled Dot-Product Attention公式1
Scaled Dot-Product Attention公式2
Scaled Dot-Product Attention公式

PS:2個最常用的attention函數是additive attention, dot-product attention. dot-product更快、更有效利用空間。

當dk都很小的時候兩者表現差不多,但是dk大的時候,不帶縮放的dot-product大幅度增長,然後會 pushing the softmax function into regions where it has extremely small gradients,因此,scale the dot product by 1/sqrt(dk).

3.2.2 Multi-Head Attention

在多頭 attention的情況下,我們對每個頭保持獨立的Q/K/V權重矩陣,從而得到不同的Q/K/V矩陣。和之前一樣,我們用X乘以WQ/WK/WV矩陣得到Q/K/V矩陣。
也是就:Queries, Keys, Values。
如果做上面列出的同樣的self-attention計算,只是8次不同的加權矩陣,能得到8個不同的Z矩陣。
說的attention heads也就是產生的不同的Z矩陣。

論文提出對queries, keys, values做h次不同的投影,映射的維度都是dk,dv,然後經過scaled dot-product attention將結果拼接在一起,最終通過一個線性映射輸出。通過多頭注意力,模型能夠獲得不同子空間下的位置信息。

These are concatenated and once again projected, resulting in the final values.
Multi-Head Attention公式
8個不同的Z矩陣壓縮成一個矩陣,乘以一個額外的權值矩陣W0,得到結果Z矩陣,它能從所有attention heads裏捕獲到信息,從而把這個句子餵給FFNN。
Multi-Head Attention1

multi-head attention要點:

  • employ h = 8 parallel attention layers, or heads.
  • For each of these we use dk = dv = dmodel /h = 64. (dmodel 就是整合起來的維度)
  • Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.

3.2.3 模型中attention的應用

transformer模型架構
The Transformer uses multi-head attention in three different ways:

  1. In “encoder-decoder attention” layers

    輸入爲encoder的輸出和decoder的self-attention輸出。——queries來自於之前的decoder層,keys,values來自於encoder的輸出。
    Encoder-Decoder Attention層的工作原理與multi-headed self attention類似,只是它從其下一層創建queries矩陣,並從編碼器堆棧的輸出中獲取keys和values矩陣。

  2. encoder的self-attention layers.

    Each position in the encoder can attend to all positions in the previous layer of the encoder
    輸入的Q、K、V都是一樣的(input embedding and positional embedding)

  3. decoder的self-attention layers.

    We implement this inside of scaled dot-product attention by masking out (setting to −∞) all values in the input of the softmax which correspond to illegal connections.
    在decoder的self-attention層中,只能夠訪問當前位置前面的位置

使用multi-headed attention的地方

3.3 Position-wise Feed-Forward Networks 基於位置的前饋網絡

編碼器和解碼器中的每個層都包含一個完全連接的前饋網絡,該前饋網絡單獨且相同地應用於每個位置。This consists of two linear transformations(兩層Dense層) with a ReLU activation in between. 可以看成是兩層的1*1的1d-convolution。hidden_size變化爲:512->2048->512

FFN(X) = max(0, xW1+b1)W2 + b2

they use different parameters from layer to layer.
Position-wise feed forward network,其實就是一個MLP網絡。

3.4 Embeddings and Softmax

we use learned embeddings to convert the input tokens and output tokens to vectors of dimension dmodel. we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation. In the embedding layers, we multiply those weights by sqrt(dmodel).

3.5 Positional Encoding

we must inject some information about the relative or absolute position of the tokens in the sequence. add “positional encodings” to the input embeddings at the bottoms of the encoder and decoder stacks.

transformer向每個輸入的embedding添加一個向量——positional encoding。這些向量遵循模型學習的特定模式,這有助於確定每個單詞的位置,或序列中不同單詞之間的距離

posotional encoding

The positional encodings have the same dimension dmodel as the embeddings, so that the two can be summed.

positional encodings can be learned and fixed.

使用不同頻率的正弦和餘弦函數:
PositionalEncoding

where pos is the position and i is the dimension.
We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset k, PEpos+k can be represented as a linear function of PEpos.
正弦和餘弦函數具有週期性,對於固定長度偏差k(類似於週期),post +k位置的PE可以表示成關於pos位置PE的一個線性變化(存在線性關係),這樣可以方便模型學習詞與詞之間的一個相對位置關係。
這種編碼方法的優勢是可以拓展到不可見的序列長度,比如如果我們的訓練模型被要求翻譯一個比訓練集中任何一個句子都長的句子。

在其他NLP論文中,position embedding,通常是一個訓練的向量,但是position embedding只是extra features,有該信息會更好,但是沒有性能也不會產生極大下降,因爲RNN、CNN本身就能夠捕捉到位置信息,但是在Transformer模型中,Position Embedding是位置信息的唯一來源,因此是該模型的核心成分,並非是輔助性質的特徵。

4 why self-attention

  1. the total computational complexity per layer.(每層的總計算複雜度)
  2. the amoount of computation tha can be parallelized.
  3. the path length between long-range dependencies in the network.

5 Training

5.1 training data and batching

sentence pairs, Sentences were encoded using byte-pair encoding. which has a shared source-target vocabulary of about 37000 tokens.
Sentence pairs were batched together by approximate sequence length.

5.2 Hardware and Schedule

8 NVIDIA P100 GPUs,
base models : 訓練 100,000 steps or 12 hours.
big models :訓練300,000 steps (3.5 days).

5.3 Optimizer

Adam optimizer, β1 = 0:9, β2 = 0:98 and θ = 10−9.
根據以下公式在訓練過程中改變學習率:
learningrate

5.4 regularization

Residual Dropout : apply dropout to the output of each sub-layer, to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks. Pdrop = 0.1在這篇論文裏。
(每個子層的output、embeddings的和、encoder和decoder stacks的positional encodings三個地方用了residual dropout.)

label smoothing
transformer實驗結果

6 results

6.1 機器翻譯

機器翻譯多種模型參數:
機器翻譯網路參數

we vary the number of attention heads and the attention key and value dimensions.發現如下:

  1. reducing the attention key size dk hurts model quality
  2. a more sophisticated compatibility function than dot product may be beneficial
  3. bigger models are better
  4. dropout is very helpful in avoiding over-fitting
  5. replace our sinusoidal positional encoding with learned positional embeddings.
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章