Attention總結二：

涉及論文：

Show, Attend and Tell: Neural Image Caption Generation with Visual Attentio（用了hard\soft attention attention）

Effective Approaches to Attention-based Neural Machine Translation（提出了global\local attention）

本文參考文章：

Attention - 之二
 不得不瞭解的五種Attention模型方法及其應用
 attention模型方法綜述
 Attention機制論文閱讀——global attention和local attention
Global Attention / Local Attention

本文摘要

attention機制本質思想
總結各attention機制（hard\soft\global\local attention）
attention其他相關

1 Attention機制本質思想

本質思想見：這篇文章，此文章中也說了self-attention。
簡答來說attention就是(query, key ,value)在機器翻譯中key-value是一樣的。
PS：NMT中應用的Attention機制基本思想見論文總結：Attentin總結一

2 各種attention

來說一下其他的attention：

hard attention
soft attention
gloabal attention
local attention
self-attention:target = source -> Multi-head attention -（放attention總結三）

2.1 hard attention

論文：Show, Attend and Tell: Neural Image Caption Generation with Visual Attention.

筆記來源：attention模型方法綜述

soft attention是保留所有分量進行加權，hard attention是以某種策略選取部分分量。hard attention就是關注部分。
soft attention就是後向傳播來訓練。

hard attention的特點：
the hard attention model is non-differentiable and requires more complicated techniques such as variance reduction or reinforcement learning to train

具體

模型的encoder利用CNN(VGG net)，提取出圖像的L個D維的向量ai,i=1,2,…L,每個向量表示圖像的一部分信息。
decoder是一個LSTM，每個timestep的t輸入包括三個部分：zt, ht-1,yt-1。其中zt由ai和αti得到。
αti是通過attention模型f_att來計算得到。
本文的f_att是一個多層感知機：

從而可以計算zt
其中attention模型f_att的獲得方式有2種：stochastic attention and deterministic attention.

2.1.2 Stochastic “Hard” Attention

st是decoder的第t個時刻的attention關注的位置編號，sti表示第t時刻attention是否關注位置i，sti,i=1,2,…L，[st1,st2,…stL]是one-hot編碼，attention每次只focus一個位置的做法，是hard的來源。
模型根據a=(a1,a2,…aL)生成序列y(y1,…,yC)，這裏的s={s1,s2,…sC}是時間軸上的重點focus序列，理論上有L^C個。

PS:深度學習思想：研究目標函數，進而研究目標函數對參數的梯度。

用到了著名的jensen不等式來對目標函數(最大化logp(y|a))，對目標函數做了轉化(因爲沒有顯式s)，得到目標函數的lower bound，

然後用logp(y|a)代替原始目標函數，對模型的參數W算梯度，再用蒙特卡洛方法對s做抽樣。
還有的細節涉及強化學習。

2.1.3 Deterministic “Soft” Attention

The whole model is smooth and differentiable（即目標函數，也就是LSTM的目標函數對權重αti是可微的，原因很簡單，因爲目標函數對zt可微，而zt對αti 可微，根據chain rule可得目標函數對αti可微）under the deterministic attention, so learning end-to-end is trivial by using standard backpropagation.

在hard attention裏面，每個時刻t模型的序列[st1,…stL]只有一個取1，其餘全部爲0，也就是說每次只focus一個位置，而soft attention每次會照顧到全部的位置，只是不同位置的權重不同罷了。zt爲ai的加權求和：

微調：,

用來調節context vector在LSTM中相對於ht-1和yt-1的比重。

2.1.4 訓練過程

2種attention模型都使用SGD(stochastic gradient descent)來訓練。

2.2 Global/Local Attention論文

論文：Effective Approaches to Attention-based Neural Machine Translation

筆記參考來自：

Attention機制論文閱讀——global attention和local attention

Global Attention / Local Attention

論文計算context向量的過程：

h_t -> a_t -> c_t -> h^~_t

Global Attention

global attention 在計算 context vector ct 的時候會考慮 encoder 所產生的全部hidden state。

由此也可以看出，global attention相對於attention總結一里的attention很相似但更簡單。兩者間的區別，可以參考此篇文章，即下圖筆記：
])

記 decoder 時刻t的 target hidden爲ht，encoder 的全部 hidden state 爲h^~_s ,s=1,2,…n。這也叫作：attentional hidden state。

對於任何h^~_s，權重a_t(s)是一個長度可變的alignment vector，長度等於編碼器部分時間序列的長度。通過對比當前的解碼器的隱藏層狀態h_t 和每個編碼器隱藏層狀態狀態h^~_s 得到：

a_t(s)是一個解碼器狀態和編碼器狀態對比得到的。
score是一個基於內容的函數，文章給出了三種種計算方法（文章稱爲 alignment function）：

其中：dot對global attention更好，general對local attention更好。

另外一種只需要h_t的score方式是將所有的a_t(s)整合成一個權重矩陣，得到Wa，就能計算得到a_t：

對a_t做一個加權平均操作(h^~_s 的weighted summation)就可以得到context向量c_t，然後繼續進行後續步驟

global attention過程圖：

Local Attention

global attention在計算每一個解碼器的狀態時需要關注所有的編碼器輸入，計算量比較大。
local attention 可以視爲 hard attention 和 soft attention 的混合體（優勢上的混合），因爲它的計算複雜度要低於 global attention、soft attention，而且與 hard attention 不同的是，local attention 幾乎處處可微，易於訓練。

local attention機制選擇性的關注於上下文所在的一個小窗口（每次只focus一小部分的source position），這能減少計算代價。

在這個模型中，對於是時刻t的每一個目標詞彙，模型首先產生一個對齊的位置（aligned position）p_t。
context向量c_t由編碼器中一個集合點隱藏層狀態計算得到，編碼器中的隱藏層包含在窗口[p_t-D, p_t+D]中，D的大小通過經驗選擇。

這些模型在c_t的形成上是不同的，具體見下面global vs location。

回到local attention，其中p_t是一個source position index, 可以理解爲attention的焦點，作爲模型的參數。p_t計算兩種計算方案：

Monotonic alingnment(local-m)

設p_t=t，假設源序列和目標序列大致單調對齊，那麼對齊向量a_t可以定義爲：
Predictive alignment(local-p)

模型預測了一個對齊位置，而不是假設源序列和目標序列單調對齊。

W_p和v_p是模型的參數，通過訓練來預測位置。S是源句子長度，這樣計算之後，p_t∈[0,S]。
爲了支持p_t附近的對齊點，設置一個圍繞p_t的高斯分佈，這樣對齊權重αt(s)就可以表示爲：

這裏的對齊函數和global中的對齊函數相同，可以看出，距離中心 pt 越遠的位置，其位置上的 source hidden state 對應的權重就會被壓縮地越厲害。

得到c_t之後計算h^~_t 的方法，通過一個連接層將上下文向量c_t和h_t整合成h^~_t：
h^~_t = tanh(Wc[c_t; h_t])
h^~_t是一個attention向量，這個向量通過如下公式產生預測輸出詞的概率分佈：

local attention過程圖：

2.2.1 Global vs Local Attention

因此global/local區別就是：

前者中對齊向量a_t大小是可變的，取決於編碼器部分輸入序列的長度；
後者context向量a_t的大小是固定的，a_t∈R^2D+1；

Global Attention 和 Local Attention 各有優劣，實際中 Global 的用的更多一點，因爲：

Local Attention 當 encoder 不長時，計算量並沒有減少
位置向量p_t的預測並不非常準確，直接影響到Local Attention的準確率

2.2.2 Input-feeding Approach

inputfeeding approach：Attentional vectors h˜t are fed as inputs to the next time steps to inform the model about past alignment decisions。這樣做的效果是雙重的：

make the model fully aware of previous alignment choice
we create a very deep network spanning both horizontally and vertically

2.2.3 總結這篇論文使用的技術點：

global\ local attention,
input-feeding approach
better alignment function

2.2.4 論文實現tips

實現的時候涉及的理念與技術：
層層遞進，比如先based模型，然後+reverse, +dropout, +global attention, + feed input, +unk replace, 然後看分數提高程度。
reverse就是reverse the source sentence,
上面的已知技術就比如：source reversing, dropout，unknowed replacement technique.
用整合多種比如8中不同設置的模型，比如使用不同的attention方法，有無使用dropout

詞表大小、比如每個語言取top 50K，
未知的詞用<unk>代替
句子對填充、LSTM層數、參數初始化設計比如在[-0.1, 0.1]範圍內、the normalized gradient is rescaled whenever its norm exceeds 5.

訓練方式：SGD
超參數的設計：
LSTM層數，每層的單元數比如100cells，多少維的word embeddings，epoch次數、mini-batch的大小比如128，
學習率可以用變化的，比如一開始是1,5pochs以後每次epoch後就halve、dropout比如0.2、
還有dropout的開始12pochs，8epochs後halve學習率

實驗分析：

學習曲線看下降
effects of long sentences
attentional architectures
alignment quality

3 其他相關

3.1 Attention的設計

location-based attention

Location-based的意思就是，這裏的attention沒有其他額外所關注的對象，即attention的向量就是hi本身。
si=f(hi)=activation(WThi+b)
general attention(不常見)
concatenation-based attention

Concatenation-based意思就是，這裏的attention是指要關注其他對象。
而f就是被設計出來衡量hi和ht之間相關性的函數。
si=f(hi，ht)=vTactivation(W1hi+W2ht+b)

3.2 Attention的拓展

一個文檔由k2個sentence組成，每個sentence由k1（每個句子的k1大小不一）個word組成。

第一層：word-level的attention
對於每個sentence有k1k1個word，所對應的就有k1k1個向量wiwi，利用本文第二章所提的方式，得到每個sentence的表達向量，記爲stisti。
第二層：sentence-level的attention
通過第一層的attention，我們可以得到k2k2個stisti，再利用本文第二章所提的方式，得到每個文檔的表達向量didi，當然也可以得到每個stisti所對應的權重αiαi，然後，得到這些，具體任務具體分析。

【論文筆記】Attention總結二：Attention本質思想 + Hard/Soft/Global/Local形式Attention

Attention總結二：

本文摘要

1 Attention機制本質思想

2 各種attention

2.1 hard attention

具體

2.1.2 Stochastic “Hard” Attention

2.1.3 Deterministic “Soft” Attention

2.1.4 訓練過程

2.2 Global/Local Attention論文

Global Attention

Local Attention

2.2.1 Global vs Local Attention

2.2.2 Input-feeding Approach

2.2.3 總結這篇論文使用的技術點：

2.2.4 論文實現tips

3 其他相關

3.1 Attention的設計

3.2 Attention的拓展

.Net 8.0 下的新RPC，IceRPC之試試的新玩法"打洞"

完美替代postman的軟件

Vue mockjs mock.js

關於遊戲付費的一點想法

我通過CKA和CKS啦！

安裝chromadb注意事項

《最新出爐》系列入門篇-Python+Playwright自動化測試-42-強大的可視化追蹤利器Trace Viewer

大數據怎麼學？對大數據開發領域及崗位的詳細解讀，完整理解大數據開發領域技術體系

【論文筆記】Attention總結二：Attention本質思想 + Hard/Soft/Global/Local形式Attention

【讀書筆記】《深度學習入門——基於python的理論與實現》

【論文筆記】MRC綜述論文+神經閱讀理解與超越基礎部分總結

【兼容調試】AttributeError: 'NoneType' object has no attribute 'loader'

【論文筆記】ULMFiT——Universal Language Model Fine-tuning for Text Classification

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結