Tacotron-2-google-full-structure 以及過程中產生的靈感

原創

2020-07-08 07:08

1. speaker id 和 language id 放的位置: The synthesizer network uses the Tacotron 2 architecture [20], with additional inputs consisting of learned speaker (64-dim) and language embeddings (3-dim), concatenated and passed to the decoder at each step.

但是我認爲應該在進行attention之前.

2. The speaker classifiers are fully-connected networks with one 256 unit hidden layer followed by a softmax predicting the speaker identity. The syn-thesizer and speaker classifier are trained with weight 1:0 and 0:02respectively. As described in the previous section we apply gradient clipping with factor0:5to the gradient reversal layer.

3.訓練, The entire model is trained jointly with a batch size of 256, using the Adam optimizer configured with an initial learning rate of 10 3 , and an exponential decay that halves the learning rate every 12.5k steps, starting at 50k steps. 不過根據joee的進行調整就好.

4. 最簡單的去掉信息的方法就是x=>壓縮維度x=>加上壓縮的信息標籤=>還原成x 設計限制的維度是關鍵, 動態設計, 讓網絡自己去訓練.

5.像google的那篇, 不管是sequence壓縮還是每一個對應之後加起來模糊化speaker id, 都是沒有徹底的, 只會有一點用, 因爲在壓縮的過程中可能會丟掉信息, 或者網絡不看重speaker id信息, 識別網絡. 這樣還是沒用. 當然這是在沒考慮vae的情況下. 兩端梯度回傳, 一個帶文本+speaker id, 一個-的speaker id, 看嗯呢該不能匯聚到一起就沒有speaker id了. 這樣防止label含有的信息多餘回傳的方式, 並不是輸入端就具有這種想要去掉的信息, 不能使用google引用的ICASSPAL19的[28]的一部分拉出來是id的識別, 其實和4是差不多的. 但是可以5和4反覆交替使用, 效果是最好的. 再配合上switch-language-id.

6. 考慮數據集中每種語言的說話人=>自然text就不包含speaker id信息了, 每個說話人的多種語言=>防止speaker id中含有信息的干擾

7. 特別的, 來考慮下speaker id=>embedding vector中含有的語言信息, 或者說看起來像"口音", 但也會干擾合成. 所以speaker id embedding中應該也加入對抗languge id. 思考 text, speaker id embedding, language id embedding, prosody, 對 speaker id, language id, text (id) 的對抗, 要麼大量數據, 要麼設計結構.

8. 開始寫代碼:

1. Gradient Reversal Layer for Keras

Add an example implementing the 'Domain-Adversarial Training of Neural Networks' paper (https://arxiv.org/abs/1505.07818)

This allows domain adaptation in an unsupervised manner by forcing the net to learn features that are domain invariant between training and target domains using the concept of gradient reversal.

2. Prosody | 這個結構應該不是獨立的, 而動態可拆卸的.

很快就能擬合大約, 但是小數點之後的不行, 這時候把它放大, 只擬合後面的, 就是說反覆使用residual網絡就行, 遞歸使用

之前的論文: Recent development of neural end-to-end TTS models [1, 2] en-ables control of both labelled and unlabelled speech attributes by conditioning synthesis on both text and learned attribute represen-tations [3, 4, 5, 6, 7, 8].

For each encoder, a mel spectrogram is first passed through two convolutional layers, which contains 512 filters with shape 3 × 1. The output of these convolutional layers is then fed to a stack of two bidirectional LSTM layers with 256 cells at each direction. A mean pooling layer is used to summarize the LSTM outputs across time, followed by a linear projection layer to predict the posterior mean and log variance.

I guess there is a mismatch between your reference and training audio. In my demo page, I trained the model with Blizzard2011 data. Before that, I random select 500 sentences as test set, which is used to provide reference audio. So for your experiment, the reference audio is from Blizzard2011 database, but your model was trained with LJspeech data.

I'm sorry that now I'm an intern at Tencent, I don't have the pre-trained model now.

還有以前的uncover的那一篇, 不是句子層次的, 而是每一幀層次的, 其實也是可以調出來結果的. 以及現在預測電子書文本的情感, 並且和聲音合成弄成end-to-end.

https://github.com/syang1993/gst-tacotron

Training

Download a dataset:

The following are supported out of the box:
- LJ Speech (Public Domain)
- Blizzard 2013 (Creative Commons Attribution Share-Alike)
We use the Blizzard 2013 dataset to test this repo (Google's paper used 147 hours data read by the 2013 Blizzard Challenge speaker). This year Challenge provides about 200 hours unsegmented speech and 9741 segmented waveforms, I did all the experiments based the 9741 segmented waveforms since it's hard for me to split the unsegmented data.

You can use other datasets if you convert them to the right format. See more details about data pre-process in keithito's TRAINING_DATA.md.

Lessac Technologies, Inc. release of Voice Factory audiobook recordings for Blizzard 2013

https://github.com/Kyubyong/expressive_tacotron

https://github.com/cnlinxi/style-token_tacotron2

本上面的過程和圖片還沒有上傳!

https://github.com/rishikksh20/vae_tacotron2 效果不太好.

https://github.com/yanggeng1995/vae_tacotron 從這個同學的開始! 謝謝人家的回覆, 勇於挑戰VAE!!!!!

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Tacotron-2-google-full-structure 以及過程中產生的靈感

8. 開始寫代碼:

1. Gradient Reversal Layer for Keras

2. Prosody | 這個結構應該不是獨立的, 而動態可拆卸的.

Training

Lessac Technologies, Inc. release of Voice Factory audiobook recordings for Blizzard 2013

本上面的過程和圖片還沒有上傳!

高效率使用windows

智能決策新時代：可視化大屏是否能夠超越傳統白板？

解密Prompt系列28. LLM Agent之金融領域摸索：FinMem & FinAgent

分享幾個.NET開源的AI和LLM相關項目框架

kaggle比賽一之ieee-fraud-detection

嘗試nvidia的Tacotron-2和waveglow的結合, 並且着重考慮多GPU以及inference時的性能.

簡單的基於Tacotron2的中英文混語言合成, 包括code-switch和voice clone. 以及深入結構設計的探討.

Tensorflow1.x查看ckpt變量情況, 以及爲之後部分恢復權重做鋪墊.

Pycharm爲核心在構建服務器端深度學習語音合成程序時的配置和技巧

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

Tacotron-2-google-full-structure 以及 過程中產生的靈感

8. 開始寫代碼:

1. Gradient Reversal Layer for Keras

2. Prosody | 這個結構應該不是獨立的, 而動態可拆卸的.

Training

Lessac Technologies, Inc. release of Voice Factory audiobook recordings for Blizzard 2013

本上面的過程和圖片還沒有上傳!

Tacotron-2-google-full-structure 以及過程中產生的靈感