【論文筆記】Attention is all you need

在閱讀本文之前，關於self-attention的詳細介紹，比較全面的transformer總結看之前copy的這篇文章。
有了self-attention的基礎之後再看這篇論文，感覺就容易了。
論文：Attention is all you need。

文章目錄

6 results

6.1 機器翻譯

1-2 Introduction & Background

RNN：This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples.
解決（治標不治本，因爲根本上序列計算的限制還在）：

factorization tricks.
conditional computation.

也使用過CNN來as basic building block, such as:ByteNet, ConvS2S. But: makes it more diffucult to learn dependencies between distant positions.（計算量與觀測序列X和輸出序列Y的長度成正比）

歷史：

名稱	解釋	侷限
seq2seq
encoder-decoder	傳統，一般配合RNN
RNN\LSTM\GRU	方向：單向雙向；depth：單層or multi-layer；	RNN難以應付長序列、無法並行實現、對齊問題；神經網絡需要能夠將源語句的所有必要信息壓縮成固定長度的向量
CNN	可以並行計算、變長序列樣本	佔內存、很多trick、大數據量上參數調整不容易
Attention Mechanism	關注向量子集、解決對齊問題

提到的點：
self-attention;
recurrent attention mechanism.
transduction models

3 Model Architecture

大部分的encoder-decoder structure:
輸入序列：輸入序列 x = (x1,…,xn), N個
encoder輸出的連續表示：z = (z1,…,zn)，N個
docoder的outputs: y=(y1,…,ym)，M個
一次一個元素。consuming the previously generated symbols as additional input when generating the next.

The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder.

3.1 Encoder and Decoder Stacks

Encoder: a stack of N = 6 identical layers.Each layer has two sub-layers：（從下到上）

multi-head self-attention mechanism.
simple, position-wise fully conntected feed-forward network(以下叫ffnn).

We employ a residual connection around each of the two sub-layers(2個子層之間殘差連接) + layer normalization. 也就是說，每個子層的輸出是LayerNorm(x + Sublayer(x)).

all sub-layers in the model + embediding layers, produce outputs of dimension d_model = 512.

Decoder： a stack of N = 6 identical layers. the decoder inserts a third sub-layer:（從下到上）

multi-head self-attention子層(接收encoder stack的輸出)（mask）
encoder-decoder attention
ffnn

masking ensures that the predictions for position i can depend only on the known outputs at positions less than i.

每個子層residual connections + layer normalization.(殘差連接round each of the sub-layers.)

3.2 Attention

attention機制就是mapping a query and a set of key-value pairs to an output(all are vectors). the output is computed as a weighted sum of the values, where the weight assigned to each values is computed by a compatibility function of the query with the corresponding key.

這篇文章用到的attenion是：

scaled Dot-Product Attention
Multi-Head Attention

3.2.1 Scaled Dot-Product Attention

The input consists of queries and keys of dimension dk, and values of dimension dv. We compute the dot products of the query with all keys, divide each by sqrt(dk), and apply a softmax function to obtain the weights on the values.

在實踐中，we compute the attention function on a set of queries simultaneously, packed together into a matrix Q. The keys and values are also packed together into matrices K and V.

PS：2個最常用的attention函數是additive attention, dot-product attention. dot-product更快、更有效利用空間。

當d_k都很小的時候兩者表現差不多，但是dk大的時候，不帶縮放的dot-product大幅度增長，然後會 pushing the softmax function into regions where it has extremely small gradients，因此，scale the dot product by 1/sqrt(dk).

3.2.2 Multi-Head Attention

在多頭 attention的情況下，我們對每個頭保持獨立的Q/K/V權重矩陣，從而得到不同的Q/K/V矩陣。和之前一樣，我們用X乘以WQ/WK/WV矩陣得到Q/K/V矩陣。
也是就：Queries, Keys, Values。
如果做上面列出的同樣的self-attention計算，只是8次不同的加權矩陣，能得到8個不同的Z矩陣。
說的attention heads也就是產生的不同的Z矩陣。

論文提出對queries, keys, values做h次不同的投影，映射的維度都是dk,dv,然後經過scaled dot-product attention將結果拼接在一起，最終通過一個線性映射輸出。通過多頭注意力，模型能夠獲得不同子空間下的位置信息。

These are concatenated and once again projected, resulting in the final values.

8個不同的Z矩陣壓縮成一個矩陣，乘以一個額外的權值矩陣W0，得到結果Z矩陣，它能從所有attention heads裏捕獲到信息，從而把這個句子餵給FFNN。

multi-head attention要點：

employ h = 8 parallel attention layers, or heads.
For each of these we use dk = dv = d_model /h = 64. (d_model 就是整合起來的維度)
Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.

3.2.3 模型中attention的應用

The Transformer uses multi-head attention in three different ways:

In “encoder-decoder attention” layers

輸入爲encoder的輸出和decoder的self-attention輸出。——queries來自於之前的decoder層，keys,values來自於encoder的輸出。
Encoder-Decoder Attention層的工作原理與multi-headed self attention類似，只是它從其下一層創建queries矩陣，並從編碼器堆棧的輸出中獲取keys和values矩陣。
encoder的self-attention layers.

Each position in the encoder can attend to all positions in the previous layer of the encoder
輸入的Q、K、V都是一樣的（input embedding and positional embedding）
decoder的self-attention layers.

We implement this inside of scaled dot-product attention by masking out (setting to −∞) all values in the input of the softmax which correspond to illegal connections.
在decoder的self-attention層中，只能夠訪問當前位置前面的位置

3.3 Position-wise Feed-Forward Networks 基於位置的前饋網絡

編碼器和解碼器中的每個層都包含一個完全連接的前饋網絡，該前饋網絡單獨且相同地應用於每個位置。This consists of two linear transformations（兩層Dense層） with a ReLU activation in between. 可以看成是兩層的1*1的1d-convolution。hidden_size變化爲：512->2048->512

FFN(X) = max(0, xW1+b1)W2 + b2

they use different parameters from layer to layer.
Position-wise feed forward network，其實就是一個MLP網絡。

3.4 Embeddings and Softmax

we use learned embeddings to convert the input tokens and output tokens to vectors of dimension d_model. we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation. In the embedding layers, we multiply those weights by sqrt(dmodel).

3.5 Positional Encoding

we must inject some information about the relative or absolute position of the tokens in the sequence. add “positional encodings” to the input embeddings at the bottoms of the encoder and decoder stacks.

transformer向每個輸入的embedding添加一個向量——positional encoding。這些向量遵循模型學習的特定模式，這有助於確定每個單詞的位置，或序列中不同單詞之間的距離。

The positional encodings have the same dimension dmodel as the embeddings, so that the two can be summed.

positional encodings can be learned and fixed.

使用不同頻率的正弦和餘弦函數：

where pos is the position and i is the dimension.
We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset k, PEpos+k can be represented as a linear function of PEpos.
正弦和餘弦函數具有週期性，對於固定長度偏差k（類似於週期），post +k位置的PE可以表示成關於pos位置PE的一個線性變化（存在線性關係），這樣可以方便模型學習詞與詞之間的一個相對位置關係。
這種編碼方法的優勢是可以拓展到不可見的序列長度，比如如果我們的訓練模型被要求翻譯一個比訓練集中任何一個句子都長的句子。

在其他NLP論文中，position embedding，通常是一個訓練的向量，但是position embedding只是extra features，有該信息會更好，但是沒有性能也不會產生極大下降，因爲RNN、CNN本身就能夠捕捉到位置信息，但是在Transformer模型中，Position Embedding是位置信息的唯一來源，因此是該模型的核心成分，並非是輔助性質的特徵。

4 why self-attention

the total computational complexity per layer.(每層的總計算複雜度)
the amoount of computation tha can be parallelized.
the path length between long-range dependencies in the network.

5 Training

5.1 training data and batching

sentence pairs, Sentences were encoded using byte-pair encoding. which has a shared source-target vocabulary of about 37000 tokens.
Sentence pairs were batched together by approximate sequence length.

5.2 Hardware and Schedule

8 NVIDIA P100 GPUs,
base models : 訓練 100,000 steps or 12 hours.
big models ：訓練300,000 steps (3.5 days).

5.3 Optimizer

Adam optimizer， β1 = 0:9, β2 = 0:98 and θ = 10−9.
根據以下公式在訓練過程中改變學習率：

5.4 regularization

Residual Dropout : apply dropout to the output of each sub-layer, to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks. Pdrop = 0.1在這篇論文裏。
（每個子層的output、embeddings的和、encoder和decoder stacks的positional encodings三個地方用了residual dropout.）

label smoothing

6 results

6.1 機器翻譯

機器翻譯多種模型參數：

we vary the number of attention heads and the attention key and value dimensions.發現如下：

reducing the attention key size dk hurts model quality
a more sophisticated compatibility function than dot product may be beneficial
bigger models are better
dropout is very helpful in avoiding over-fitting
replace our sinusoidal positional encoding with learned positional embeddings.

【論文筆記】Attention is all you need

文章目錄

1-2 Introduction & Background

3 Model Architecture

3.1 Encoder and Decoder Stacks

3.2 Attention

3.2.1 Scaled Dot-Product Attention

3.2.2 Multi-Head Attention

3.2.3 模型中attention的應用

3.3 Position-wise Feed-Forward Networks 基於位置的前饋網絡

3.4 Embeddings and Softmax

3.5 Positional Encoding

4 why self-attention

5 Training

5.1 training data and batching

5.2 Hardware and Schedule

5.3 Optimizer

5.4 regularization

6 results

6.1 機器翻譯

使用neovim打造go ide(支持代碼跳轉, 代碼補全, 實時語法檢查)

挑戰程序設計競賽 2.3章習題 poj 3046 Ant Counting

Shell/Python中的用戶名獲取

【論文筆記】Attention總結二：Attention本質思想 + Hard/Soft/Global/Local形式Attention

【讀書筆記】《深度學習入門——基於python的理論與實現》

【論文筆記】MRC綜述論文+神經閱讀理解與超越基礎部分總結

【兼容調試】AttributeError: 'NoneType' object has no attribute 'loader'

【論文筆記】ULMFiT——Universal Language Model Fine-tuning for Text Classification

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結