Voice Conversion 項目筆記（含從VCC 2016匿名比賽深挖的各前沿方法性能對比）

voice conversion 基本架構：

voice conversion 任務主要由兩個步驟構成，特徵提取與特徵參數轉換，對於這兩個步驟，都有相應的常用的技術，這兩個步驟中常用的技術各種排列組合，就產生了衆多VC系統，以下做小彙總。

STEP1：Feature extraction			STEP2：Feature conversion
Feature	Extraction toolkits	description
LSF（Line Spectral Frequency）	STRAIGHT	STRAIGHT is an MATLAB Lib design for VC.	GMM/JDGMM
MGC（mel-generalized coeffi- cient ）	STRAIGHT	STRAIGHT is an MATLAB Lib design for VC.	DNN
LF0(log f0)	AHOCODER	Ahocoder parameterizes speech waveforms into three different streams	RNN(BLSTM)
MCP(mel-spectgram)	AHOCODER		seq2seq/(with Attention)
MVF（maximum voiced frequency）	AHOCODER		GRU
	DBN		VQ
	Mixture of Factor Analyzer
parameter generation algorithm with global variance	HTS	HMM-based Speech Synthesis System

其中,LF0爲語音基頻log變換，爲主要的語音轉換參數，是表徵不同人，不同性別的最重要參數之一。

其他參數爲語音的高階分量，控制合成語音的細節。

衡量標準：

常用的兩種：基於主觀評分的MOS和基於Mel譜失真的MCD

MOS（Mean Opinion Score）主觀評分分值1-5 由受試者主觀打分一般有GMM方法的MOS分作爲beseline，對於不同的測試可以考慮把GMM的分數對齊從而在不同的MOS測試中統一基準

打分基準：

4－5分優秀(excelent) 很好，聽的清楚，延遲很小，交流流暢

3－4分良好(good) 稍差，聽的清楚，延遲小，交流欠缺順暢，有點雜音

2－3分一般(fair) 還可以，聽不太清，有一定延遲，可以交流

1－2分差(poor) 勉強，聽不太清，延遲較大，交流重複多次1分以下很差(bad) 極差，聽不懂，延遲大，交流不通暢

MCD（Mel-cepstral distortion）[dB] 越低越好

例：

seq2seq with LSTM:

Variational Auto-encoder:

Deep sequence-to-sequence Attention Model :

VCC2016:

各組綜合分數對比：

各組方案對比（紅圈爲MOS>2.5 similarity>60%的方案，紫色橫線表示不會公開方法的組別）

其中G，L,O 組

很多都採用了STRAIGHT工具包進行MGC特徵的提取

轉換方法上，常採用GMM或者LSTM方法

開源的組轉換結果樣例：

效果較好的：

略遜的：

各組的報告，包括使用的參數，工具包，及其方法：

各組來源：

現在可用的轉換方法：

用不同的神經網絡對不同參數進行轉換

用AHOcoder進行特徵提取：

LF0：LSTM

MVF：DNN

MCP：GRU

轉換結果樣例：（基於VCC2016 數據）

基於LSF LF0 UV的方法：

轉換結果樣例：

還在下載

基於GMM的轉換：

利用SPTK提取MCEP參數，對MCEP參數用24階GMM進行轉換：

轉換結果樣例：

各機構的Demo：

Voice Conversion 方法，鏈接如下：

LyreBird : https://lyrebird.ai/demo

微軟：DNN VC
香港科技大學：BLSTM VC
印度OHSU：Joint AE VC
日本東京大學：GMM VC
法國tut：基於DKPL迴歸
Voice Morphing：Voice Morphing
Adobe VOCO：http://ieeexplore.ieee.org/abstract/document/7472761/

TTS方法，鏈接如下：

ModelTalker ：(採集語音以進行語音合成):https://www.modeltalker.org/
日本Kobayashi 實驗室：Speaker-Independent HMM-Based Voice Conversion
愛丁堡大學：Listening test materials for “A study of speaker adaptation for DNN-based speech synthesis”
TOKUDA and NANKAKU LABORATORY
Princeton VOCO：http://gfx.cs.princeton.edu/pubs/Jin_2017_VTI/【5月11日發表】【10分鐘音頻內尋找最接近音素】

建議：

在VCC2016比賽中，效果較好的組在語音特徵提取的環節都採用了STRAIGHT工具包，應對此進行進一步的探索

未開源及展示論文的方案可能由於經過cherry-picked的調參，復現可能有一定困難

開源的方案主要是都以LSTM作爲主要模型，用於轉換LF0，MCEP等特徵

現有的轉換方法中，以基於GMM的MCEP特徵轉換作爲基準，利用LSTM轉換LF0的這種方法性能較好，其MOS評分應在2-3分左右，普分辨率較低

可以用STRAIGHT替換AHOcoder進行特徵提取，並使用LSTM進行LF0和MCEP的轉換（嘗試研究G組的方案）。

參考論文及其性能展示：

常用傳統轉換技術對比（不包括各類神經網絡方法）：

https://arxiv.org/pdf/1612.07523.pdf

IEEE（只看了一下，包括其中G組的方案）:

Voice conversion using deep neural networks with speaker-independent pre-training

http://ieeexplore.ieee.org/document/7078543/

Dictionary update for NMF-based voice conversion using an encoder-decoder network

http://ieeexplore.ieee.org/document/7918382/

Text-independent voice conversion using deep neural network based phonetic level features

http://ieeexplore.ieee.org/document/7918382/

F0 transformation techniques for statistical voice conversion with direct waveform modification with spectral differential

http://ieeexplore.ieee.org/document/7846338/

Enhancing a glossectomy patient's speech via GMM-based voice conversion

http://ieeexplore.ieee.org/document/7820909/

Deep neural network based voice conversion with a large synthesized parallel corpus

http://ieeexplore.ieee.org/document/7820716/

SEQ2SEQ類:

https://arxiv.org/pdf/1704.02360.pdf

Voice Conversion Using Sequence-to-Sequence Learning of Context Posterior Probabilities

Voice Conversion from Unaligned Corpora using Variational Autoencoding Wasserstein Generative Adversarial Networks

https://arxiv.org/pdf/1704.00849.pdf

VAE：

Voice Conversion from Non-parallel Corpora Using Variational Auto-encoder

https://arxiv.org/pdf/1610.04019.pdf

DNN：

4 Proposed Framework: Deep sequence-to-sequence Attention Model

尋找對應組：

I組：NII National Institute of Informatics：

http://cs.joensuu.fi/pages/tkinnu/webpage/pdf/i-vector-VC_icassp2017.pdf

NON-PARALLEL VOICE CONVERSION USING I-VECTOR PLDA: TOWARDS UNIFYING SPEAKER VERIFICATION AND TRANSFORMATION

L組：西北工業大學，謝磊：

http://www.nwpu-aslp.org/lxie/papers/ssw9_PS1-4_Huang.pdf

An Automatic Voice Conversion Evaluation Strategy Based on Perceptual Background Noise Distortion and Speaker Similarity

J組：NU-NAIST：

F0 transformation techniques for statistical voice conversion with direct waveform modification with spectral differential

http://ieeexplore.ieee.org/abstract/document/7846338/

C組：UTSC-NELLSILP：（jun du）

https://www.researchgate.net/publication/307889222_The_USTC_System_for_Voice_Conversion_Challenge_2016_Neural_Network_Based_Approaches_for_Spectrum_Aperiodicity_and_F0_Conversion

未知：University College London voice conversion challenge：

Hulk組【G組】：Phone-aware LSTM-RNN for voice conversion

http://ieeexplore.ieee.org/abstract/document/7877819/

B組：Zhizheng WU：

CSTR The Centre for Speech Technology Research, The University of Edinburgh, UK

http://ieeexplore.ieee.org/abstract/document/7472761/

https://link.springer.com/article/10.1007/s11042-014-2180-2

F組：HCCL-CUHK Human-Computer Communications Laboratory

The Chinese University of Hong Kong, Hong Kong

http://www1.se.cuhk.edu.hk/~lfsun/ICME2016_Lifa_Sun.pdf

日本人的復現：

https://github.com/sesenosannko/ppg_vc

【一個用DNN的組（O（也有可能是C））】VoiceKontrol Center for Spoken Language Understanding (CSLU), Oregon Health & Science University, Portland, OR, USA

S.H. Mohammadi, A. Kain, Semi-supervised Training of a Voice Conversion Mapping Function using Joint-Autoencoder, Interspeech (To Appear), 2015.

S.H. Mohammadi, A. Kain, Voice Conversion Using Deep Neural Networks With Speaker-Independent Pre-Training, 2014 IEEE Spoken Language Technology Workshop (SLT), 2014.

復現：

https://github.com/shamidreza/dnnmapper

還沒找到的：

CASIA-NLPR-Taogroup National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China

Team Initiator Tsinghua University, Beijing, China

AST Academia Sinica, Taipei, Taiwan

https://arxiv.org/pdf/1610.04019.pdf

https://github.com/JeremyCCHsu/vae4vc

github:

RNN:

https://github.com/Pooja-Donekal/Voice-Conversion

LSTM:(還不錯)（F組）