【轉】Tensorflow的RNN和Attention實現過程

原文鏈接http://cairohy.github.io/2017/06/05/ml-coding-summarize/Tensorflow%E7%9A%84RNN%E5%92%8CAttention%E7%9B%B8%E5%85%B3/
Tensorflow的RNN和Attention實現過程
發表於 2017-06-05 | 分類於 編程總結 | 還沒有評論 | 閱讀次數: 8074
本文字數: 7.7k | 閱讀時長 ≈ 0:08
今天就來看一看不同種類的RNN和Attention在Tensorflow中到底是怎麼實現的。

1、從_RNNCell到LSTM
任何Recurrent Neural Network都必須有一個或者多個cell,而這些cell的公共父類就是RNNCell,一個抽象類。擁有_call()方法,每次調用接受一個input(BatchSize×input_size)和一個state,輸出一個output和new state的元組。

1.BasicRNNCell,也就是經典的RNN,其調用的時候輸出和狀態的計算公式是:output = new_state = act(W input + U state + B),其內部調用了_linear()函數。
2._linear()函數,接受輸入,並將輸入與參數矩陣W相乘,加上偏置b,並返回。
3.BasicLSTMCell,也就是LSTM,其調用函數:
def call(self, inputs, state, scope=None):
“”“Long short-term memory cell (LSTM).”""
with _checked_scope(self, scope or “basic_lstm_cell”, reuse=self._reuse):
# Parameters of gates are concatenated into one multiply for efficiency.
if self._state_is_tuple:
# 一般都走這個分支,取出c_t和h_t
c, h = state
else:
c, h = array_ops.split(value=state, num_or_size_splits=2, axis=1)

參考了《Recurrent Neural Network Regularization》,一次計算四個gate

concat = _linear([inputs, h], 4 * self._num_units, True)

# i = input_gate, j = new_input, f = forget_gate, o = output_gate
i, j, f, o = array_ops.split(value=concat, num_or_size_splits=4, axis=1)

new_c = (c * sigmoid(f + self._forget_bias) + sigmoid(i) *
         self._activation(j))
new_h = self._activation(new_c) * sigmoid(o)

if self._state_is_tuple:
  new_state = LSTMStateTuple(new_c, new_h)
else:
  new_state = array_ops.concat([new_c, new_h], 1)
# 注意這裏返回的輸出是h_t,而state是(c,h)
return new_h, new_state

和下圖中論文中公式是完全對應的:

這個公式總的看來就是:ht=G(ht−1,xt,ct)

4.GRUCell,參考了2014年EMNLP論文《Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation》中的實現,論文中公式如下(參數簡化表示了):
下面的公式總的看來就是:ht=G(ht−1,xt)
r=σ(Wxt+Uht−1)z=σ(Wxt+Uht−1)ht=zht−1+(1−z)htht=ϕ(Wx+U(r⊙ht−1))
而代碼如下:

def call(self, inputs, state, scope=None):
“”“Gated recurrent unit (GRU) with nunits cells.”""
with _checked_scope(self, scope or “gru_cell”, reuse=self._reuse):
with vs.variable_scope(“gates”): # Reset gate and update gate.
# We start with bias of 1.0 to not reset and not update.
# 一次計算出兩個gate的值
value = sigmoid(_linear(
[inputs, state], 2 * self._num_units, True, 1.0))
# 這裏的u就是上面的z
r, u = array_ops.split(
value=value,
num_or_size_splits=2,
axis=1)
with vs.variable_scope(“candidate”):
c = self._activation(_linear([inputs, r * state],
self._num_units, True))
new_h = u * state + (1 - u) * c

GRU裏面輸出和state都是一個h

return new_h, new_h
此外,還有支持peephole和projection的LSTMCell。也是__call__()方法不一樣。

2、Cell的Wrapper
包括inputProjectionWrapper,outputProjectionWrapper在內的一些用於映射輸入輸出的類,往往沒有直接在外面用tf的操作快。

DropoutWrapper,將cell作爲屬性,並實現call方法,在調用cell前後進行dropout,支持對於輸入,state和輸出進行dropout。

ResidualWrapper,就是把輸入concat到輸出上一起返回。

DeviceWrapper,確保這個cell在指定的設備上運行(2333)。

MultiRNNCell,這個也算是一個wrapper,因爲可以擁有cell的數組作爲屬性,用於實現多層RNN。

AttentionCellWrapper,參照了《Neural Machine Translation by Jointly Learning to Align and Translate》的實現,也就是Bahdanau風格的實現,公式如下,其中y是t時刻的輸入(在這篇文章中同時也是t-1時刻的輸出),s是隱藏狀態,c通過和encoder的隱藏狀態進行相似度計算、歸一化、加權求和得到:

si=(1−zi)∘si−1+zi∘sisi=tanh(W×e(yi−1)+U[ri∘si−1])zi=σ(f(yi−1,si−1,ci))ri=σ(f(yi−1,si−1,ci))
這篇文章中,相似度對比函數a是通過一個前饋神經網絡實現的。

eij=a(si−1,hj)=Vtanh(g(si−1,hj))
接下來是TF的代碼部分:

def call(self, inputs, state, scope=None):
“”“Long short-term memory cell with attention (LSTMA).”""
# state \in R^{B\times T}
with _checked_scope(self, scope or “attention_cell_wrapper”,
reuse=self._reuse):
if self._state_is_tuple:
# 這裏把state分爲三個部分,LSTM的state,attns(代表attention向量)和attn的state
state, attns, attn_states = state
else:
# 如果不是元組,就按照長度切分
states = state
state = array_ops.slice(states, [0, 0], [-1, self._cell.state_size])
attns = array_ops.slice(
states, [0, self._cell.state_size], [-1, self._attn_size])
attn_states = array_ops.slice(
states, [0, self._cell.state_size + self._attn_size],
[-1, self._attn_size * self._attn_length])
# attention狀態是[None x Attention向量長度 x Attention窗口長度]
attn_states = array_ops.reshape(attn_states,
[-1, self._attn_length, self._attn_size])
input_size = self._input_size
if input_size is None:
input_size = inputs.get_shape().as_list()[1]
# 讓input和attns進行一個什麼運算呢?
inputs = _linear([inputs, attns], input_size, True)
lstm_output, new_state = self._cell(inputs, state)
if self._state_is_tuple:
new_state_cat = array_ops.concat(nest.flatten(new_state), 1)
else:
new_state_cat = new_state
# 利用attention機制計算出下一時刻需要的上下文向量c_t和attention狀態(隱藏狀態)h_j
new_attns, new_attn_states = self.attention(new_state_cat, attn_states)
with vs.variable_scope(“attn_output_projection”):
# 利用c_t和x_t(y
{t-1})計算出t時刻輸出s_t
output = _linear([lstm_output, new_attns], self._attn_size, True)
# 把當前時刻輸出s_t增加到下一時刻attention狀態去
new_attn_states = array_ops.concat(
[new_attn_states, array_ops.expand_dims(output, 1)], 1)
new_attn_states = array_ops.reshape(
new_attn_states, [-1, self._attn_length * self._attn_size])
new_state = (new_state, new_attns, new_attn_states)
if not self._state_is_tuple:
new_state = array_ops.concat(list(new_state), 1)
# 最後返回s_t和h,注意這裏的h就是s_t,所以這個AttentionWrapper應用範圍有限,有些情況下不能用,需要自己修改定製
return output, new_state

def _attention(self, query, attn_states):
conv2d = nn_ops.conv2d
reduce_sum = math_ops.reduce_sum
softmax = nn_ops.softmax
tanh = math_ops.tanh

with vs.variable_scope("attention"):
  k = vs.get_variable(
      "attn_w", [1, 1, self._attn_size, self._attn_vec_size])
  v = vs.get_variable("attn_v", [self._attn_vec_size])
  # 相當於所有的h_j
  hidden = array_ops.reshape(attn_states,
                             [-1, self._attn_length, 1, self._attn_size])
  # 計算Uh_j,shape:[[None, attn_len, 1, attn_vec_size]]
  hidden_features = conv2d(hidden, k, [1, 1, 1, 1], "SAME")
  y = _linear(query, self._attn_vec_size, True)
  # 計算WS_i
  y = array_ops.reshape(y, [-1, 1, 1, self._attn_vec_size])
  # attention相似度計算公式,s\in R^{-1, attn_len},對應所有的e_{ij}
  s = reduce_sum(v * tanh(hidden_features + y), [2, 3])
  # a \in R^{-1, attn_len},對應論文中的\alpha
  a = softmax(s)
  # 計算上下文向量c_i=\sum \alpha_{ij} * h_j
  d = reduce_sum(
      array_ops.reshape(a, [-1, self._attn_length, 1, 1]) * hidden, [1, 2])
  new_attns = array_ops.reshape(d, [-1, self._attn_size])
  # 扔掉最早的一個attention-states
  new_attn_states = array_ops.slice(attn_states, [0, 1, 0], [-1, -1, -1])
  return new_attns, new_attn_states

最後,從staticrnn到dynamicrnn到bidirection_dynamic_rnn,其內部都是調用了這些cell的__call()方法。

3、各種Attention
_BaseAttentionMechanism
BahdanauAttention
LuongAttention
DynamicAttentionWrapper
官方給的用法參考:

cell = tf.contrib.rnn.DeviceWrapper(LSTMCell(512), “/gpu:0”)
attention_mechanism = tf.contrib.seq2seq.LuongAttention(512, encoder_outputs)
attn_cell = tf.contrib.seq2seq.DynamicAttentionWrapper(
cell, attention_mechanism, attention_size=256)
參考
https://www.tensorflow.org/api_guides/python/contrib.seq2seq#Attention
https://www.tensorflow.org/api_guides/python/contrib.rnn#Core_RNN_Cell_wrappers_RNNCells_that_wrap_other_RNNCells_

機器學習 # 深度學習 # 自然語言處理 # tensorflow

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章