Far-Field End-to-End Text-Dependent Speaker Verification based on Mixed Training Data with Transfer

Far-Field End-to-End Text-Dependent Speaker Verification based on Mixed Training Data with Transfer Learning and Enrollment Data Augmentation
Xiaoyi Qin1,2
, Danwei Cai1
, Ming Li1
1Data Science Research Center, Duke Kunshan University, Kunshan, China
2School of Electronics and Information Technology, Sun Yat-sen University, Guangzhou, China

遠場端到端文本相關說話人驗證–基於遷移學習
https://www.isca-speech.org/archive/Interspeech_2019/pdfs/1542.pdf

Abstract

In this paper, we focus on the far-field end-to-end text dependent speaker verification task with a small-scale far-field text dependent dataset and a large scale close-talking text independent database for training. First, we show that simulating far-field text independent data from the existing large-scale clean database for data augmentation can reduce the mismatch. Second, using a small far-field text dependent data set to finetune the deep speaker embedding model pre-trained from the simulated far-field as well as original clean text independent data can significantly improve the system performance. Third, in special applications when using the close-talking clean utterances for enrollment and employing the real far-field noisy utterances for testing, adding reverberant noises on the clean enrollment data can further enhance the system performance. We evaluate our methods on AISHELL ASR0009 and AISHELL 2019B-eval databases and achieve an equal error rate (EER) of 5.75% for far-field text-dependent speaker verification under noisy environments.
本文以一個小規模的遠場文本相關數據集和一個大規模的近場文本無關數據庫爲訓練對象,研究了遠場端到端文本相關說話人驗證問題。首先,我們表明,模擬遠場文本無關的數據從現有的大規模清潔數據庫的數據擴增可以減少不匹配。其次,利用一個小的遠場文本相關數據集對由模擬的遠場數據和原始的純文本無關數據預訓練的深度說話人嵌入模型進行微調,可以顯著提高系統性能。第三,在特殊應用中,當使用近距離交談的乾淨話語進行註冊和使用真實的遠場噪聲話語進行測試時,在乾淨的註冊數據上添加混響噪聲可以進一步提高系統性能。我們在AISHELL ASR0009和AISHELL 2019B eval數據庫上評估了我們的方法,並在噪聲環境下實現了5.75%的等錯誤率(EER)。

  1. Introduction

In the past decade, the performance of automatic speaker verification (ASV) has improved dramatically. The i-vector based method [1] and the deep neural network (DNN) based methods [2, 3] have been widely used in telephone channel and closetalking scenarios. Recently, smartphones and virtual assistants become very popular. People use pre-defined words to wake up the system. To enhance the security level and be able to provide preconized service, the wake-up words based text-dependent speaker verification is adapted to determine whether the wake-up speech is indeed uttered by the claimed speaker [4, 5, 6]. However, in many Internet of Things (IoT) applications, e.g., smart speakers and smart home devices, text-dependent speaker verification under far-field and complex environmental settings are still challenging due to the effects of room reverberation and various kinds of noises and distortions. To reduce the effects of room reverberation and environmental noise, various approaches with single channel microphone or multi-channel microphone array, have been proposed at different levels of the text independent ASV system. At the signal level, linear prediction inverse modulation transfer function [7] and weighted prediction error (WPE) [8, 9] methods have been used for dereverberation. DNN based denoising methods for single-channel speech enhancement [10, 11, 12, 13] and beamforming for multi-channel speech enhancement [8, 14, 15] have also been explored for ASV system under complex environments. At the feature level, sub-band Hilbert envelopes based features [16, 17, 18], warped minimum variance distortionless response (MVDR) cepstral coefficients [19], blind spectral weighting (BSW) based features [17], power-normalized cepstral coefficients (PNCC) [20] and DNN bottleneck features [21] have been applied to ASV system to suppress the adverse impacts of reverberation and noise. At the model level, reverberation matching with multi-condition training models have also been successfully employed within the universal background model (UBM) or ivector based frontend systems [22, 23]. Multi-channel i-vector combination for far-field speaker recognition is also explored in [24]. In backend modeling, multi-condition training of probabilistic linear discriminant analysis (PLDA) models was employed in i-vector system [25]. The robustness of deep speaker embeddings for far-field text-independent speech has also been investigated in [26, 27]. Finally, at the score level, score normalization [22] and multi-channel score fusion [28] have been applied in farfield ASV system to improve the robustness. In this work, we focus on the far-field end-to-end textdependent speaker verification task at the model level. Previous studies [4, 5, 6] on end-to-end deep neural network based text-dependent speaker verification directly use large-scale text dependent database to train the systems. However, in realworld applications, people may want to use customized wakeup words for speaker verification, and different smart home devices may have different wake-up words even for products from the same company. Hence collecting a large-scale far-field text-dependent speech database for each new or customized wake-up words may not be possible. This motivates us to explore the transfer learning concept and use a small far-field text-dependent speech dataset to fine-tune the existing deep speaker embedding network trained from large-scale text independent speech databases, like NIST SRE databases or voxceleb [29, 30]. Furthermore, we propose a new research topic on far-field text-dependent speaker verification, which is to use the closetalking clean data for enrollment and employ the real far-field noisy utterances for testing. This scenario corresponds to the case that only one clean utterance recorded by cell phone is used to enroll the speaker for the smart home devices. In this work, we investigate an enrollment data augmentation scheme to reduce the mismatch and improve the ASV performance
近十年來,自動說話人驗證(ASV)的性能有了很大的提高。基於i-向量的方法[1]和基於深度神經網絡(DNN)的方法[2,3]已經被廣泛應用於電話信道和閉話場景中。最近,智能手機和虛擬助理變得非常流行。人們用預先定義好的詞來喚醒系統。爲了提高安全級別並能夠提供預定的服務,基於喚醒詞的文本相關說話人驗證適合於確定所述喚醒語音是否確實由所述說話人發出[4、5、6]。然而,在許多物聯網(IoT)應用中,如智能揚聲器和智能家居設備中,由於室內混響和各種噪聲和失真的影響,在遠場和複雜環境下的文本相關揚聲器驗證仍然具有挑戰性。爲了降低室內混響和環境噪聲的影響,在文本無關ASV系統的不同層次上,提出了使用單通道麥克風或多通道麥克風陣列的各種方法。在信號級,線性預測逆調製傳遞函數[7]和加權預測誤差(WPE)[8,9]方法被用於去冗餘。針對複雜環境下的ASV系統,研究了基於DNN的單通道語音增強去噪方法[10,11,12,13]和用於多通道語音增強的波束形成方法[8,14,15]。**在特徵層,**基於子帶希爾伯特包絡的特徵【16,17,18】,翹曲最小方差無失真響應(MVDR)倒譜系數【19】,基於盲譜加權(BSW)的特徵【17】,功率歸一化倒譜系數(PNCC)[20]和DNN瓶頸特徵[21]被應用於ASV系統,以抑制混響和噪聲的不利影響。在模型級,混響匹配和多條件訓練模型也被成功地應用於通用背景模型(UBM)或基於ivector的前端系統中[22,23]。在[24]中還探討了用於遠場說話人識別的多通道i矢量組合。在後端建模中,概率線性判別分析(PLDA)模型的多條件訓練被用於i-向量系統[25]。文[26,27]還研究了深度說話人嵌入對遠場文本無關語音的魯棒性。最後,在分數層次上,將分數歸一化[22]和多通道分數融合[28]應用於farfield-ASV系統,提高了系統的魯棒性。

在這項工作中,我們將重點放在模型級的遠場端到端文本相關說話人驗證任務上。已有的基於端到端深度神經網絡的文本相關說話人驗證研究[4,5,6]直接使用大規模文本相關數據庫對系統進行訓練。然而,在現實世界的應用中,人們可能希望使用自定義的喚醒詞進行揚聲器驗證,而不同的智能家居設備可能有不同的喚醒詞,即使是來自同一公司的產品。因此,**可能無法爲每個新的或自定義的喚醒詞收集大規模的遠場文本相關語音數據庫。這促使我們探索轉移學習的概念,並使用一個小的遠場文本相關語音數據集,以調整現有的深度揚聲器嵌入網絡訓練從大型文本無關的語音數據庫,如NIST SRE數據庫或Vox SeleB[[ 29, 30 ] ]。**此外,我們還提出了一個新的研究課題,即使用閉嘴乾淨數據進行註冊,並使用真實的遠場噪聲話語進行測試。這種情況對應於這樣的情況,即只有一個由手機記錄的乾淨的話語被用於註冊智能家庭設備的揚聲器。在這項工作中,我們研究了一個登記資料擴充計劃,以減少不匹配,並改善ASV的效能

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章