復現END-TO-END CODE-SWITCHED TTS WITH MIX OF MONOLINGUAL RECORDINGS論文, 理解以及代碼, 以及實驗結果.

Show us the samples please? By the way, you had better change the mel loss function into MAE and watch the alignment again.

These plots show that BahdanauMonotonic Attention is better.

What are the advantages of Location Sensitive Attention?

Maybe it is better to let the network learn without any monotonic pressure. However https://arxiv.org/abs/1803.09047 claims to use GMM on Tacotron and obtain better results, especially for longer sequences.

 

do you have a change related to guided attention?

I am thinking use phone duration information to generate the guided attention for training; 對, 只提供"參考價值", 不用完全相信. 設計網絡.

 

can you provide the code for the GMM attention? I cannot find a working version that gives good alignments anywhere.

I don't have it either anymore. I totally ditched it. You can pick that out from "voice loop" repo.

FORWARD ATTENTION IN SEQUENCE-TO-SEQUENCE ACOUSTIC MODELING FOR SPEECH SYNTHESIS

https://github.com/geneing/WaveRNN-Pytorch   Fast WaveRNN

https://github.com/mozilla/TTS/blob/master/notebooks/Benchmark.ipynb

“On-line and linear-time attention by enforcing monotonic align-ments,  

機器學習中,是否有給注意力機制加先驗的工作或者特殊的初始化方法?

如題,有些問題裏注意力有比較明顯的規律,例如機器翻譯中有些語言對的語序基本一致,這時候能否給注意力讀寫頭加入適當的先驗,讓網絡快速收斂?

自問自答一下,因爲今天突然看到一篇文章,已被 ICML 2017 接受:

Online and Linear-Time Attention by Enforcing Monotonic Alignments

去搜索這個的名字, 可能能找到對應的結構. (1)

大意是用拋硬幣的方法決定要不要繼續往後走,每次只選一個 encoder 的狀態來做 context,從而實現 attention 從前往後只走一遍

先近似使用:

注意力有content-based和location-based兩種,我覺得location-based很像你說的先驗。

參考:http://papers.nips.cc/paper/58

開始寫代碼: LDE

determined by the language boundary information in the CS text. 

performing discriminative code lookup 對於 speaker id來說, 先近似實現, 是不是有可以差異化初始化或者查詢的方法?

 

This design enables the gen-erated speech in a single speaker’s voice. The language embedding and discriminative embedding are jointly learned with the model by back-propagation.  這個也是一個切入點.

 

The discriminative embedding is obtained by performing discriminative code lookup, and is concate-nated with previous time-step decoder output and context informa-tion before being sent to decoder RNN.  這一點原版論文和大家理解的是不一樣的, 這一版代碼跑的是原版的Tacotron-2, 而不是微軟理解的Tacotron-2.

 

https://www.tensorflow.org/api_docs/python/tf/nn/bidirectional_dynamic_rnn  論文寫的不清楚, 按照自己理解的拼接進去, 其實init的時候也有錯誤. (2)

np.zeros()  和 list 的區別, 一直報錯.  return array(a, dtype, copy=False, order=order) ValueError: setting an array element with a sequence.

 

 

 

 

 

 

感覺少了一步decoder!!!!!!!!!!!先不改, 等效果, 然後再改.!!!!!!!!!!!!!! 看不懂, 看不懂, 可能沒有錯吧. (3)

在文件Architecture_wrappers.py中

 

https://github.com/begeekmyfriend?tab=repositories  探究人家的東西.

https://github.com/fatchord?tab=repositories 還有他的.

https://github.com/r9y9/gantts  VAE另外的一條路.

 

 

Tacotron: Advanced attention module (e.g. Monotonic attention) #13

https://github.com/mozilla/TTS/issues/13

https://github.com/mozilla/TTS

https://github.com/mozilla/TTS/blob/master/notebooks/Benchmark.ipynb

 

 

Guided Attention Loss #346

https://github.com/Rayhane-mamah/Tacotron-2/issues/346

 

 

 

 

http://itjcc.com/1172/html  破解ultraledit 26. 等有工資了一定補上去.

https://blog.csdn.net/xiliuhu/article/details/5757305   ultral edit多窗口實現.

統計attention不加限制情況下的單調和不單調情況, 然後再加單調的要求, 這是其實是兩條路, 都能解釋, 同時在用si-單調, 指導它, 不改變他.

 

製作基於LJSpeech1.1和標貝的訓練數據集 和 腳本

1. grapheme

 

 

Whether to rescale audio prior to preprocessing   參數弄不明白.

rescale = False, #Whether to rescale audio prior to preprocessing

 

#M-AILABS (and other datasets) trim params
    trim_fft_size = 512,
    trim_hop_size = 128,
    trim_top_db = 60,

也不明白.

sox的使用辦法:

https://blog.csdn.net/centnetHY/article/details/88571352

Batch_Size=32 => 16, 因爲內存不足.

 watch -n 10 nvidia-smi

至於SPE, 代碼很好寫:

就剩下整理數據, 實驗結果了, 放出來一個網頁demo.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章