原文鏈接http://cairohy.github.io/2017/06/05/ml-coding-summarize/Tensorflow%E7%9A%84RNN%E5%92%8CAttention%E7%9B%B8%E5%85%B3/
Tensorflow的RNN和Attention實現過程
發表於 2017-06-05 | 分類於編程總結 | 還沒有評論 | 閱讀次數： 8074
本文字數： 7.7k | 閱讀時長 ≈ 0:08
今天就來看一看不同種類的RNN和Attention在Tensorflow中到底是怎麼實現的。

1、從_RNNCell到LSTM
任何Recurrent Neural Network都必須有一個或者多個cell，而這些cell的公共父類就是RNNCell，一個抽象類。擁有_call()方法，每次調用接受一個input（BatchSize×input_size）和一個state，輸出一個output和new state的元組。

1.BasicRNNCell，也就是經典的RNN，其調用的時候輸出和狀態的計算公式是：output = new_state = act(W input + U state + B)，其內部調用了_linear()函數。
2._linear()函數，接受輸入，並將輸入與參數矩陣W相乘，加上偏置b，並返回。
3.BasicLSTMCell，也就是LSTM，其調用函數：
def call(self, inputs, state, scope=None):
“”“Long short-term memory cell (LSTM).”""
with _checked_scope(self, scope or “basic_lstm_cell”, reuse=self._reuse):
# Parameters of gates are concatenated into one multiply for efficiency.
if self._state_is_tuple:
# 一般都走這個分支，取出c_t和h_t
c, h = state
else:
c, h = array_ops.split(value=state, num_or_size_splits=2, axis=1)

參考了《Recurrent Neural Network Regularization》，一次計算四個gate

concat = _linear([inputs, h], 4 * self._num_units, True)

# i = input_gate, j = new_input, f = forget_gate, o = output_gate
i, j, f, o = array_ops.split(value=concat, num_or_size_splits=4, axis=1)

new_c = (c * sigmoid(f + self._forget_bias) + sigmoid(i) *
         self._activation(j))
new_h = self._activation(new_c) * sigmoid(o)

if self._state_is_tuple:
  new_state = LSTMStateTuple(new_c, new_h)
else:
  new_state = array_ops.concat([new_c, new_h], 1)
# 注意這裏返回的輸出是h_t,而state是(c,h)
return new_h, new_state

和下圖中論文中公式是完全對應的：

這個公式總的看來就是：ht=G(ht−1,xt,ct)

4.GRUCell，參考了2014年EMNLP論文《Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation》中的實現，論文中公式如下（參數簡化表示了）：
下面的公式總的看來就是：ht=G(ht−1,xt)
r=σ(Wxt+Uht−1)z=σ(Wxt+Uht−1)ht=zht−1+(1−z)h_tht=ϕ(Wx+U(r⊙ht−1))
而代碼如下：

def call(self, inputs, state, scope=None):
“”“Gated recurrent unit (GRU) with nunits cells.”""
with _checked_scope(self, scope or “gru_cell”, reuse=self._reuse):
with vs.variable_scope(“gates”): # Reset gate and update gate.
# We start with bias of 1.0 to not reset and not update.
# 一次計算出兩個gate的值
value = sigmoid(_linear(
[inputs, state], 2 * self._num_units, True, 1.0))
# 這裏的u就是上面的z
r, u = array_ops.split(
value=value,
num_or_size_splits=2,
axis=1)
with vs.variable_scope(“candidate”):
c = self._activation(_linear([inputs, r * state],
self._num_units, True))
new_h = u * state + (1 - u) * c

GRU裏面輸出和state都是一個h

return new_h, new_h
此外，還有支持peephole和projection的LSTMCell。也是__call__()方法不一樣。

2、Cell的Wrapper
包括inputProjectionWrapper，outputProjectionWrapper在內的一些用於映射輸入輸出的類，往往沒有直接在外面用tf的操作快。

DropoutWrapper，將cell作爲屬性，並實現call方法，在調用cell前後進行dropout，支持對於輸入，state和輸出進行dropout。

ResidualWrapper，就是把輸入concat到輸出上一起返回。

DeviceWrapper，確保這個cell在指定的設備上運行（2333）。

MultiRNNCell，這個也算是一個wrapper，因爲可以擁有cell的數組作爲屬性，用於實現多層RNN。

AttentionCellWrapper，參照了《Neural Machine Translation by Jointly Learning to Align and Translate》的實現，也就是Bahdanau風格的實現，公式如下，其中y是t時刻的輸入（在這篇文章中同時也是t-1時刻的輸出），s是隱藏狀態，c通過和encoder的隱藏狀態進行相似度計算、歸一化、加權求和得到：

si=(1−zi)∘si−1+zi∘s_isi=tanh(W×e(yi−1)+U[ri∘si−1])zi=σ(f(yi−1,si−1,ci))ri=σ(f(yi−1,si−1,ci))
這篇文章中，相似度對比函數a是通過一個前饋神經網絡實現的。

eij=a(si−1,hj)=Vtanh(g(si−1,hj))
接下來是TF的代碼部分：

def call(self, inputs, state, scope=None):
“”“Long short-term memory cell with attention (LSTMA).”""
# state \in R^{B\times T}
with _checked_scope(self, scope or “attention_cell_wrapper”,
reuse=self._reuse):
if self._state_is_tuple:
# 這裏把state分爲三個部分，LSTM的state，attns（代表attention向量）和attn的state
state, attns, attn_states = state
else:
# 如果不是元組，就按照長度切分
states = state
state = array_ops.slice(states, [0, 0], [-1, self._cell.state_size])
attns = array_ops.slice(
states, [0, self._cell.state_size], [-1, self._attn_size])
attn_states = array_ops.slice(
states, [0, self._cell.state_size + self._attn_size],
[-1, self._attn_size * self._attn_length])
# attention狀態是[None x Attention向量長度 x Attention窗口長度]
attn_states = array_ops.reshape(attn_states,
[-1, self._attn_length, self._attn_size])
input_size = self._input_size
if input_size is None:
input_size = inputs.get_shape().as_list()[1]
# 讓input和attns進行一個什麼運算呢？
inputs = _linear([inputs, attns], input_size, True)
lstm_output, new_state = self._cell(inputs, state)
if self._state_is_tuple:
new_state_cat = array_ops.concat(nest.flatten(new_state), 1)
else:
new_state_cat = new_state
# 利用attention機制計算出下一時刻需要的上下文向量c_t和attention狀態（隱藏狀態）h_j
new_attns, new_attn_states = self.attention(new_state_cat, attn_states)
with vs.variable_scope(“attn_output_projection”):
# 利用c_t和x_t(y{t-1})計算出t時刻輸出s_t
output = _linear([lstm_output, new_attns], self._attn_size, True)
# 把當前時刻輸出s_t增加到下一時刻attention狀態去
new_attn_states = array_ops.concat(
[new_attn_states, array_ops.expand_dims(output, 1)], 1)
new_attn_states = array_ops.reshape(
new_attn_states, [-1, self._attn_length * self._attn_size])
new_state = (new_state, new_attns, new_attn_states)
if not self._state_is_tuple:
new_state = array_ops.concat(list(new_state), 1)
# 最後返回s_t和h，注意這裏的h就是s_t，所以這個AttentionWrapper應用範圍有限，有些情況下不能用，需要自己修改定製
return output, new_state

def _attention(self, query, attn_states):
conv2d = nn_ops.conv2d
reduce_sum = math_ops.reduce_sum
softmax = nn_ops.softmax
tanh = math_ops.tanh

with vs.variable_scope("attention"):
  k = vs.get_variable(
      "attn_w", [1, 1, self._attn_size, self._attn_vec_size])
  v = vs.get_variable("attn_v", [self._attn_vec_size])
  # 相當於所有的h_j
  hidden = array_ops.reshape(attn_states,
                             [-1, self._attn_length, 1, self._attn_size])
  # 計算Uh_j,shape:[[None, attn_len, 1, attn_vec_size]]
  hidden_features = conv2d(hidden, k, [1, 1, 1, 1], "SAME")
  y = _linear(query, self._attn_vec_size, True)
  # 計算WS_i
  y = array_ops.reshape(y, [-1, 1, 1, self._attn_vec_size])
  # attention相似度計算公式，s\in R^{-1, attn_len}，對應所有的e_{ij}
  s = reduce_sum(v * tanh(hidden_features + y), [2, 3])
  # a \in R^{-1, attn_len}，對應論文中的\alpha
  a = softmax(s)
  # 計算上下文向量c_i=\sum \alpha_{ij} * h_j
  d = reduce_sum(
      array_ops.reshape(a, [-1, self._attn_length, 1, 1]) * hidden, [1, 2])
  new_attns = array_ops.reshape(d, [-1, self._attn_size])
  # 扔掉最早的一個attention-states
  new_attn_states = array_ops.slice(attn_states, [0, 1, 0], [-1, -1, -1])
  return new_attns, new_attn_states

最後，從staticrnn到dynamicrnn到bidirection_dynamic_rnn，其內部都是調用了這些cell的__call()方法。

3、各種Attention
_BaseAttentionMechanism
BahdanauAttention
LuongAttention
DynamicAttentionWrapper
官方給的用法參考：

cell = tf.contrib.rnn.DeviceWrapper(LSTMCell(512), “/gpu:0”)
attention_mechanism = tf.contrib.seq2seq.LuongAttention(512, encoder_outputs)
attn_cell = tf.contrib.seq2seq.DynamicAttentionWrapper(
cell, attention_mechanism, attention_size=256)
參考
https://www.tensorflow.org/api_guides/python/contrib.seq2seq#Attention
https://www.tensorflow.org/api_guides/python/contrib.rnn#Core_RNN_Cell_wrappers_RNNCells_that_wrap_other_RNNCells_

機器學習 # 深度學習 # 自然語言處理 # tensorflow

【轉】Tensorflow的RNN和Attention實現過程

參考了《Recurrent Neural Network Regularization》，一次計算四個gate

GRU裏面輸出和state都是一個h

機器學習 # 深度學習 # 自然語言處理 # tensorflow

PDManer [元數建模]-v4.9.0 發佈：一款簡單好用的數據庫建模平臺

使用neovim打造go ide(支持代碼跳轉, 代碼補全, 實時語法檢查)

cs01 CSS Syntax

挑戰程序設計競賽 2.3章習題 poj 3046 Ant Counting

[MASM拾遺]Offset僞指令

h30 HTML Layout Elements

瞭解顯卡

一款基於C#開發的通訊調試工具（支持Modbus RTU、MQTT調試）

Linux/Golang/glibC系統調用

cs04 CSS Measurement Units

多項式擬合和高斯擬合結合調研

mysql中GROUP BY結合GROUP_CONCAT的使用

C#加密解密

asp.net Session介紹

通過配置文件將自動映射到對應的類

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結