簡單的基於Tacotron2的中英文混語言合成, 包括code-switch和voice clone. 以及深入結構設計的探討.

之前的討論

33. 韻律評測, 很重要. https://zhuanlan.zhihu.com/p/43240701

34. 復現了Tacotron2 中文和英文 單語言合成, 音質滿足期望(忽略inference時間), 下一步方向在哪裏, 如果想Expressive, 靠譜的方法有什麼經驗嗎, 同時我嘗試下混語言:

expressive最簡單用look up table就可以,不過需要標註,繼續深化就是vae系列了,比如gmvae,木神應該更加熟悉這些東西,mixlingual現在看來有數據就能做,不過跨說話人的話,可能vocoder的影響就會變得很大

Expressive, 如果有標註的話, 就類似於,speaker id, 之後用look up table, 這個我去找找有沒有論文/數據集, 跑跑試試;
VAE (Encoder) 作爲prosody Encoder, 這個應該是也要嘗試的, 雖然對於VAE我.....;
mixlingual/cross-lingual 雙語同人數據集有的話, 直接正常訓練, 不涉及speaker id和language id, 這個看看有沒有數據集 (或者把LJSPeech和標貝當成一個人), 但具體涉及到code-switching還有些細節 (比如訓練數據switch的比例和測試語句相差很大);
跨語言說話人, 特別是一種語言只有一個說話人 (但是語料質量非常高),  如何做到voice clone, switch-coding, 確實是個難題, 但藉助與VAE也可能有解決方法, 不過沒有明白師兄說的"可能vocoder的影響就會變得很大"的含義, 是指的整個網絡decoder端的網絡設計嗎

"可能vocoder的影響就會變得很大" 。我覺得是訓練wavenet啥的,跟說話人關係比較大

翻翻王木師兄的論文了, 重音

舉個例子啊,比如mixlingual的一句話: Amazon併購了Google

現在實際有兩個說話人,一個英文說話人,一箇中文說話人,然後訓了一個multispeaker,multilingual的模型,在inference的時候,指定了中文說話人的ID,然後合成,這個時候英文部分的發音鐵定不好,這個時候就需要靠vocoder的魯棒性修了

就是英文從context text => aucostic feature會很難受, 因爲他訓練的時候沒見過這個ID

但是網絡還是會硬着去搞(找平衡), 這樣的aucostic feature就不是那麼完美, 需要vocoder來修

當然還有共享phone集這些辦法

全共享phone這種, 其實音色遷移(統一)是很好的應該.

但是會丟掉每段語音內部自身語言的獨特性(韻律, 口音, 發音)

總而言之任重道遠

我其實設計了個很大的網絡

 

 

步驟一: 儘量簡單化的實現一個cross-language TTS, 直至state of art.

(到20號完成)

1. 直接使用最簡單的text的group, 然後數據放到一起, 不加任何標記, 不加模型結構.

  • LJSpeech1.1 + 標貝.
  • 字母序列+拼音版本, 英文音素+聲韻母版本都嘗試. 工程性工作.
    • 音素版本的調調更加"抑揚", "優美", 先調查英文轉成音標. 使用這個: https://github.com/Kyubyong/g2p 具體代碼在下面的py轉換腳本中.
    • Welcome Mr. Li Xiang to join in our human-computer speech interaction laboratory, we are a big family. 中文名字的發音? Mr. 是否可以.
    • tar -xjvf  httpd-2.4.4.tar.bz2
    • 安裝ntlk的時候, 用arch版本, sudo pacman -S python-nltk  sudo pacman -S nltk-data
    • 使用from preprocess to phoneme.py腳本就能轉換, 過程如下圖 (注意中文, 英文提取的聲學參數應該完全一樣維度): 
    • 但是發現的確不同, 中文版本的: https://github.com/Joee1995/tacotron2-mandarin-griffin-lim/blob/master/hparams.py 英文版本的:https://github.com/Rayhane-mamah/Tacotron-2/blob/master/hparams.py
    • 說不清哪個好, 等實驗之後再說吧. 或者看google等的論文.
    • 製作中文_phoneme在之前的joee文件夾有, 英文_phoneme在test_t2_wavenet/old_del/Mix_phoneme_G2P_demo.ipynb 中, 然後粘貼到: text_list_for_synthesis.txt   
    •  

    •  
  • 發現問題: 2w steps的時候, 中文合成已經很好, 單獨英文竟然還會出現斷句, 中英文混合會發現後面的英文竟然發不出音. 不知道爲什麼, 收斂的慢? python synthesize.py --model='Tacotron2-mix-phoneme' --output_dir='mix-phoneme-output_dir/' --text_list='text_list_for_synthesis.txt'
  • 基本的排查原因: 還是得等訓練到8w左右吧.

 

2. 加入speaker id的 (無language id)

https://github.com/Rayhane-mamah/Tacotron-2

Related work:

[3] Learning pronunciation from a foreign language in speech synthesis networks

MIT. Revised 12 Apr 2019 (unpublished)

Implement a multilingual multi-speaker tacotron2.  On pre-training(then coming with a fine-tune using low-source language data), using high-source language data and low-source language data together is much better than only use hight—source language

Paper studied learning pronunciation from a bilingual TTS model. Learned phoneme embedding vectors are located closer if their pronunciations are similar across the languages. This can generalize to every low-source language. Knowing this, we can always give small data big data whatever its language is to improve network convergence.

In fact, this embedding vectors is something like IPA. But IPA  (1) has no stress label (2)no mapping for small language. Speech in different languages is different speakers.

Use dataset: VCTK, CMU Arctic, LJSpeech, 2013 Blizzard, CSS10.

MODEL:

A sequence of phonemes are converted to phoneme embeddings, then fed to the encoder as input. We concatenate phoneme embedding dictionary of each language to form the entire phoneme embedding dictionary, so there may exist duplicated phonemes in the dictionary if the languages share same phonemes. Note that, the phoneme embeddings are normalized to have the same norm. In order to model multiple speakers’ voices in a single TTS model, we adopt Deep Voice 2 [4] style speaker embedding network. One-hot speaker identity vector is converted to a 32-dimensional speaker embedding vector by the speaker embedding network.

the amount of speech data differs across speakers. This data imbalance may induce a bias to the TTS model. To cope with this data imbalance, we divide the loss of each sample from one speaker by the total number of samples in a training set which belongs to the speaker. We empirically found that this adjustment in loss function yields better synthesis quality

同時可以關注下這篇文章的phoneme輸入, 並且pre-trained的應用也需要關注.

步驟二: 收集現在情況, 包括代碼和思路

最重要的是定義一個評測函數

Cross-lingual Voice Conversion

I wish I could speak many languages. Wait. Actually I do. But only 4 or 5 languages with limited proficiency. Instead, can I create a voice model that can copy any voice in any language? Possibly! A while ago, me and my colleage Dabi opened a simple voice conversion project. Based on it, I expanded the idea to cross-languages. I found it's very challenging with my limited knowledge. Unfortunately, the results I have for now are not good, but hopefully it will be helpful for some people.

February 2018

Author: Kyubyong Park ([email protected])

Version: 1.0

VQ-VAE

This is a Tensorflow Implementation of VQ-VAE Speaker Conversion introduced in Neural Discrete Representation Learning. Although the training curves look fine, the samples generated during training were bad. Unfortunately, I have no time to dig more in this as I'm tied with my other projects. So I publish this project for those who are interested in the paper or its implementation. If you succeed in training based on this repo, please share the good news.

Voice Conversion with Non-Parallel Data

What if you could imitate a famous celebrity's voice or sing like a famous singer? This project started with a goal to convert someone's voice to a specific target voice. So called, it's voice style transfer. We worked on this project that aims to convert someone's voice to a famous English actress Kate Winslet's voice. We implemented a deep neural networks to achieve that and more than 2 hours of audio book sentences read by Kate Winslet are used as a dataset.

tacotron2-vae

https://github.com/jinhan/tacotron2-vae

tacotron2-gst

(Expressive)

https://github.com/jinhan/tacotron2-gst

VAE Tacotron-2:

https://github.com/rishikksh20/vae_tacotron2

https://arxiv.org/pdf/1812.04342.pdf

TPSE Tacotron2

https://github.com/cnlinxi/tpse_tacotron2

paper: Predicting Expressive Speaking Style From Text in End-to-End Speech Synthesis

reference : Rayhane-mamah/Tacotron-2

Tacotron-2:

https://github.com/cnlinxi/tacotron2decoder

pretrain decoder for fewer parallel corpus.

Paper: Semi-Supervised Training for Improving Data Efficiency in End-to-End Speech Synthesis

Reference: Rayhane-mamah/Tacotron-2

style-token Tacotron2

https://github.com/cnlinxi/style-token_tacotron2#style-token-tacotron2

paper: Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis

reference : Rayhane-mamah/Tacotron-2

HTS-Project

https://github.com/cnlinxi/HTS-Project

HTS-demo project with blank data, expecially extended for Mandarin Chinese SPSS system.

speech_emotion

https://github.com/cnlinxi/speech_emotion#speech_emotion

Detect human emotion from audio.

Refer to some code in Speech_emotion_recognition_BLSTM, thanks a lot.

create dir 'logs' in the project dir, so you can find most of the logs of step, you can find details in them. Also, it create a log file in the project dir as befrore..

Tacotron-2:

https://github.com/karamarieliu/gst_tacotron2_wavenet

cvqvae

Conditional VQ-VAE, adaped from the VQ-VAE code provided by MishaLaskin [https://github.com/MishaLaskin/vqvae].

Tensorflow implementation of DeepMind's Tacotron-2. A deep neural network architecture described in this paper: Natural TTS synthesis by conditioning Wavenet on MEL spectogram predictions

Vector Quantized Variational Autoencoder

This is a PyTorch implementation of the vector quantized variational autoencoder (https://arxiv.org/abs/1711.00937).

You can find the author's original implementation in Tensorflow here with an example you can run in a Jupyter notebook.

Tacotron2 for 瑞典語

https://github.com/ruclion/taco2swe

A modification of https://github.com/Rayhane-mamah/Tacotron-2 that is intended for use with the Swedish language.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章