Adversarially Regularized Autoencoders
Kim Y, Zhang K, Rush A M, et al. Adversarially regularized autoencoders[J]. arXiv preprint arXiv:1706.04223, 2017.
GitHub: https://github.com/jakezhaojb/ARAE
adversarially regularized autoencoder (ARAE)
Abstract
Deep latent variable models (也就是VAE、GAN這種由隨機變量作種子的模型)比較方便生成連續的樣本。當把他們運用在例如文本、離散圖片等離散結構上時,將會遇到很大挑戰。本文提出了一個靈活的方法來訓練 deep latent variable models of discrete structures。
Background and Notation
Discrete Autoencoder
就是把離散序列 encoder 之後再 decoder,通過 softmax 來進行離散
L r e c ( ϕ , ψ ) = − l o g p ψ ( x ∣ e n c ϕ ( x ) ) L_{rec}(\phi,\psi)=-log~p_{\psi}(x|enc_{\phi}(x)) L r e c ( ϕ , ψ ) = − l o g p ψ ( x ∣ e n c ϕ ( x ) )
x ^ = a r g m a x x p ψ ( x ∣ e n c ϕ ( x ) ) \hat{x}=argmax_{x}~p_{\psi}(x|enc_{\phi}(x)) x ^ = a r g m a x x p ψ ( x ∣ e n c ϕ ( x ) )
編碼器和解碼器是一個 problem-specific(特定的問題),一般可以選擇 RNN 作爲解碼器和編碼器。
Generative Adversarial Networks
WGAN:
m i n θ m a x w ∈ W E z ∼ p r [ f w ( z ) ] − E z ~ ∼ p z [ f w ( z ~ ) ] min_{\theta}max_{w\in W}E_{z\sim p_r}[f_w(z)]-E_{\tilde{z}\sim p_z}[f_w(\tilde{z})] m i n θ m a x w ∈ W E z ∼ p r [ f w ( z ) ] − E z ~ ∼ p z [ f w ( z ~ ) ]
weight-clipping w = [ − ϵ , ϵ ] w=[-\epsilon,\epsilon] w = [ − ϵ , ϵ ]
Adversarially Regularized Autoencoder
ARAE combines a discrete autoencoder with a GAN-regularized latent representation. 模型如下圖所示,學習離散空間P ψ P_{\psi} P ψ 。直覺上這種方法用一個更靈活的先驗分佈提供了一個更平滑的離散編碼空間。
模型包含 a discrete autoencoder regularized with a prior distribution,
m i n ϕ , ψ L r e c ( ϕ , ψ ) + λ ( 1 ) W ( P Q , P z ) min_{\phi,\psi}L_{rec}(\phi,\psi)+\lambda^{(1)}W(P_Q,P_z) m i n ϕ , ψ L r e c ( ϕ , ψ ) + λ ( 1 ) W ( P Q , P z )
其中W W W 表示離散編碼空間P Q P_Q P Q (就是x x x 經過編碼後e n c ϕ ( x ) enc_{\phi}(x) e n c ϕ ( x ) 概率空間)和P z P_z P z 的 Wasserstein 距離。模型訓練相當於對下面幾個目標進行求解:
(1) m i n ϕ , ψ L r e c ( ϕ , ψ ) = E x ∼ P r [ − l o g p ϕ ( x ∣ e n c ( x ) ) ] min_{\phi,\psi}L_{rec}(\phi,\psi)=E_{x\sim P_r}[-log~p_{\phi}(x|enc(x))] m i n ϕ , ψ L r e c ( ϕ , ψ ) = E x ∼ P r [ − l o g p ϕ ( x ∣ e n c ( x ) ) ]
(2) m a x w ∈ W L c r i = E x ∼ P r [ f w ( e n c ϕ ( x ) ) ] − E z ^ ∼ P z [ f w ( z ^ ) ] max_{w\in W}L_{cri}=E_{x\sim P_r}[f_w(enc_{\phi}(x))]-E_{\hat{z}\sim P_z}[f_w(\hat{z})] m a x w ∈ W L c r i = E x ∼ P r [ f w ( e n c ϕ ( x ) ) ] − E z ^ ∼ P z [ f w ( z ^ ) ]
(3) m i n ϕ L e n c ( ϕ ) = E x ∼ P r [ f w ( e n c ϕ ( x ) ) ] − E z ^ ∼ P z [ f w ( z ^ ) ] min_{\phi}L_{enc}(\phi)=E_{x\sim P_r}[f_w(enc_{\phi}(x))]-E_{\hat{z}\sim P_z}[f_w(\hat{z})] m i n ϕ L e n c ( ϕ ) = E x ∼ P r [ f w ( e n c ϕ ( x ) ) ] − E z ^ ∼ P z [ f w ( z ^ ) ]
(1)爲最小化編碼解碼器的的重構誤差、(2)是優化判別器、(3)是優化生成器
經驗上我們發現,先驗分佈P z P_z P z 對結果有很強的影響,最簡單的選擇是固定的高斯分佈N ( 0 , 1 ) N(0,1) N ( 0 , 1 ) ,但是這種限制很強的條件很容易造成模型的崩潰。我們不固定P z P_z P z 而是通過一個生成器來學習一個從高斯分佈N ( 0 , 1 ) N(0,1) N ( 0 , 1 ) 到P z P_z P z 的映射。
Algorithm 1 ARAE Training
for each training iteration do
(1) Train the encoder/decoder for reconstruction KaTeX parse error: Expected 'EOF', got '}' at position 11: (\phi,\psi}̲)
Sample { x ( i ) } i = 1 m ∼ P r \{x^{(i)}\}^m_{i=1}\sim P_r { x ( i ) } i = 1 m ∼ P r and compute z ( i ) = e n c ϕ ( x ( i ) ) z^{(i)}=enc_{\phi}(x^{(i)}) z ( i ) = e n c ϕ ( x ( i ) )
Backprop loss, L r e c = − 1 m ∑ i = 1 m l o g p ψ ( x ( i ) ∣ z ( i ) ) L_{rec}=−\frac{1}{m}\sum^m_{i=1}log~p_{\psi}(x^{(i)}|z^{(i)}) L r e c = − m 1 ∑ i = 1 m l o g p ψ ( x ( i ) ∣ z ( i ) )
(2) Train the critic ( w ) (w) ( w )
Sample { x ( i ) } i = 1 m ∼ P r \{x^{(i)}\}^m_{i=1}\sim P_r { x ( i ) } i = 1 m ∼ P r and ${s{(i)}} m_{i=1}\sim N(0, I)
Compute $z{(i)}=enc_{\phi}(x {(i)}) and $\hat{z}{(i)}=g_{\theta}(z {(i)})
Backprop loss $-\frac{1}{m}\summ_{i=1}f_w(z (i))+frac{1}{m}\summ_{i=1}f_w(\hat{z} {(i)})
Clip critic w w w to KaTeX parse error: Unexpected character: '' at position 11: [−\epsilon̲, \epsilon]
(3) Train the encoder/generator adversarially ( ϕ , θ ) (\phi, \theta) ( ϕ , θ )
Sample { x ( i ) } i = 1 m ∼ P r \{x^{(i)}\}^m_{i=1}\sim P_r { x ( i ) } i = 1 m ∼ P r and { s ( i ) } i = 1 m ∼ N ( 0 , I ) \{s^{(i)}\}^m_{i=1}\sim N(0, I) { s ( i ) } i = 1 m ∼ N ( 0 , I )
Compute z ( i ) = e n c ϕ ( x ( i ) ) z^{(i)}=enc_{\phi}(x^{(i)}) z ( i ) = e n c ϕ ( x ( i ) ) and z ^ ( i ) = g θ ( s ( i ) ) \hat{z}^{(i)}=g_{\theta}(s^{(i)}) z ^ ( i ) = g θ ( s ( i ) ) .
Backprop loss 1 m ∑ m i = 1 f w ( z ( i ) ) − 1 m ∑ i = 1 m f w ( z ^ ( i ) ) \frac{1}{m}\sum_m^{i=1} f_w(z^{(i)})− \frac{1}{m}\sum^m_{i=1} f_w(\hat{z}^{(i)}) m 1 ∑ m i = 1 f w ( z ( i ) ) − m 1 ∑ i = 1 m f w ( z ^ ( i ) )
end for
Extension: Unaligned Transfer
考慮對齊問題,對解碼器增加一個條件變爲p ψ ( x ∣ z , y ) p_{\psi}(x|z,y) p ψ ( x ∣ z , y ) (沒看太明白,以後看代碼看看能看明白不),最優化時考慮分類誤差
m i n ϕ , ψ L r e c ( ϕ , ψ ) + λ ( 1 ) W ( P Q , P z ) − λ ( 2 ) L c l a s s ( ϕ , u ) min_{\phi,\psi}L_{rec}(\phi,\psi)+\lambda^{(1)}W(P_Q,P_z)-\lambda^{(2)}L_{class}(\phi,u) m i n ϕ , ψ L r e c ( ϕ , ψ ) + λ ( 1 ) W ( P Q , P z ) − λ ( 2 ) L c l a s s ( ϕ , u )
本文中λ ( 2 ) = 1 \lambda^{(2)}=1 λ ( 2 ) = 1 ,並且需要在訓練時增加兩個步驟:(2b) 訓練分類器、(3b)爲分類器訓練解碼器
Algorithm 2 ARAE Transfer Extension
Each loop additionally:
(2b) Train attribute classifier ( u ) (u) ( u )
Sample { x ( i ) } i = 1 m ∼ P r \{x^{(i)}\}^m_{i=1}\sim P_r { x ( i ) } i = 1 m ∼ P r , lookup $y^{(i)} , and compute z ( i ) = e n c ϕ ( x ( i ) ) z^(i)=enc_{\phi}(x^{(i)}) z ( i ) = e n c ϕ ( x ( i ) )
Backprop loss − 1 m ∑ i = 1 m l o g p u ( y ( i ) ∣ z ( i ) ) −\frac{1}{m}\sum^m_{i=1}log~p_u(y^{(i)}|z^{(i)}) − m 1 ∑ i = 1 m l o g p u ( y ( i ) ∣ z ( i ) )
(3b) Train the encoder adversarially ( ϕ ) (\phi) ( ϕ )
Sample { x ( i ) } i = 1 m ∼ P r \{x^{(i)}\}^m_{i=1}\sim P_r { x ( i ) } i = 1 m ∼ P r , lookup $y^{(i)} , and compute z ( i ) = e n c ϕ ( x ( i ) ) z^(i)=enc_{\phi}(x^{(i)}) z ( i ) = e n c ϕ ( x ( i ) )
Backprop loss − 1 m ∑ i = 1 m l o g p u ( 1 − y ( i ) ∣ z ( i ) ) −\frac{1}{m}\sum^m_{i=1}log~p_u(1-y^{(i)}|z^{(i)}) − m 1 ∑ i = 1 m l o g p u ( 1 − y ( i ) ∣ z ( i ) )
Theoretical Properties
在標準的 GAN 中,我們隱式的減小真實分佈和模型分佈。在本文的情況中,我的理解是隱式的最小化 embedding 空間的真實分佈和模型分佈,並且最小化模型分佈P r P_r P r 和隱變量分佈p ψ = ∫ z p ψ ( x ∣ z ) p ( z ) d z p_{\psi}=\int_zp_{\psi}(x|z)p(z)dz p ψ = ∫ z p ψ ( x ∣ z ) p ( z ) d z 。
略去一些很數學的證明
Experiments
其他
看 github 上作者把 WGAN 方法更新爲 WGAN-UP。