icassp2020--TEXT-INDEPENDENT SPEAKER VERIFICATION WITH ADVERSARIAL LEARNING ON SHORT UTTERANCES

TEXT-INDEPENDENT SPEAKER VERIFICATION WITH ADVERSARIAL LEARNING ON SHORT UTTERANCES

Kai Liu, Huan Zhou
Artificial Intelligence Application Research Center, Huawei Technologies Shenzhen, PRC

ABSTRACT
摘要
A text-independent speaker verification system suffers severe performance degradation under short utterance condition. To address the problem, in this paper, we propose an adversari- ally learned embedding mapping model that directly maps a short embedding to an enhanced embedding with increased discriminability. In particular, a Wasserstein GAN with a bunch of loss criteria are investigated. These loss functions have distinct optimization objectives and some of them are less favoured for the speaker verification research area. Dif- ferent from most prior studies, our main objective in this study is to investigate the effectiveness of those loss criteria by con- ducting numerous ablation studies. Experiments on Voxceleb dataset showed that some criteria are beneficial to the veri- fication performance while some have trivial effects. Lastly, a Wasserstein GAN with chosen loss criteria, without fine- tuning, achieves meaningful advancements over the baseline, with 4% relative improvements on EER and 7% on minDCF in the challenging scenario of short 2second utterances.
與文本無關的說話人驗證系統在短語音條件下性能嚴重下降。爲了解決這一問題,本文提出了一種敵方學習的嵌入映射模型,該模型直接將短嵌入映射到具有更高可分辨性的增強嵌入。特別地,研究了具有一系列損耗準則的Wasserstein GAN。這些損失函數具有明顯的優化目標,其中一些函數不太適合於說話人驗證領域。與以往的研究不同,本研究的主要目的是通過大量的消融研究來探討這些喪失標準的有效性。在Voxceleb數據集上的實驗表明,一些準則有利於驗證性能的提高,而一些準則的影響較小。最後,一個Wasserstein GAN在沒有微調的情況下選擇了損耗標準,在較短的2秒話語的挑戰性場景中,與基線相比取得了有意義的進步,EER相對提高了4%,minDCF相對提高了7%。

  1. INTRODUCTION
    一。導言
    Text-independent Speaker Verification (SV) aims to automat- ically verify the identity of a speaker, given enrolled speaker record and some test speech signal (with no special constraint on phonetic content). The most important step in the SV pipeline is to map speech of arbitrary duration into speaker representation of fixed dimension. It’s desirable for such a speaker representation to be compact, discriminative and ro- bust to extrinsic and intrinsic variations.
    文本無關說話人驗證(SV)的目的是自動驗證說話人的身份,給定註冊的說話人記錄和一些測試語音信號(對語音內容沒有特殊限制)。SV管道中最重要的一步是將任意持續時間的語音映射成固定維數的說話人表示。這樣一個說話人的表現形式應該是緊湊的,有區別的,並且是外部和內在變化的。
    Several types of speaker representations have been de- veloped over the past decades. The well-known i-vector [3] has been the state-of-the-art speaker representation, usually associated with a simple cosine-scoring strategy or more powerful probability linear discriminant analysis (PLDA) [12, 4] as verifier. With the advent of deep neural networks (DNNs), a variety of DNN frameworks and loss functions have been developed to learn deep speaker representations, known as embeddings. By training these networks with either
    在過去的幾十年裏,有幾種類型的說話人表徵已經發展出來。衆所周知的i向量[3]是最先進的說話人表示,通常與簡單的餘弦評分策略或更強大的概率線性判別分析(PLDA)[12,4]作爲驗證器相關聯。隨着深度神經網絡(DNNs)的出現,人們開發了各種DNN框架和損失函數來學習深度說話人表示,稱爲嵌入。通過訓練這些網絡
    the cross-entropy loss, or some form of contrastive loss on large amount of data, the resulting embeddings are speaker- discriminative. Compared to the i-vector, those embeddings, such as x-vector[2] and GhostVLAD-aggregated embedding [18] (or G-vector for short), are promising, demonstrating competitive performance for long speeches and distinct ad- vantage for short speeches. Furthermore, the recently devel- oped G-vector further shows considerable gains over x-vector for noisy test conditions, which makes it more favorable for a practical SV system.
    交叉熵損失,或者對大量數據的某種形式的對比損失,導致的嵌入是說話人識別的。與i-向量相比,x-向量[2]和ghostflad聚合嵌入[18]等嵌入技術具有廣闊的應用前景,表現出了長語音的競爭性和短語音的獨特優勢。此外,最近發展起來的G矢量在噪聲測試條件下比x矢量有更大的增益,這使得它更適合於實際的SV系統。

[18] Weidi Xie, Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. Utterance-level aggregation for speaker recognition in the wild. In ICASSP, pages 5791– 5795, 2019. 牛津的

However, the performance of a SV system usually de- grades in real scenarios, due to prevalent mismatches between development and test condition, such as channel, domain or duration mismatch [11, 5, 18]. For instance, it has been ob- served [5] that on NIST-SRE 2010 test set (female part), the performance of i-vector/PLDA system drops from 2.48% to 24.78% when the verification trial was shortened from full- duration to 5 seconds long.
然而,由於開發和測試條件之間普遍存在不匹配,例如通道、域或持續時間不匹配,SV系統的性能在實際場景中通常會降低[11、5、18]。例如,在NIST-SRE 2010測試集(女性部分)上,當驗證試驗從完整持續時間縮短到5秒時,i-vector/PLDA系統的性能從2.48%下降到24.78%。
Numerous research studies have been proposed to miti- gate the short duration effect. An early family of researches aimed to modify different aspects of i-vector based SV sys- tem, e.g., feature extraction techniques, intermediate param- eter estimation, speaker model generation, score normaliza- tion techniques, as summarized in [11]. Recently, more novel deep learning technologies are explored. For instance, in- sufficient phonetic information is compensated by a teacher- student learning framework [17] and scoring scheme is cal- ibrated by transfer learning [13]. Another research strategy is to design duration robust speaker embeddings to dealing with utterances of arbitrary duration. By applying different neural network architectures and alternative loss functions, the discriminability of embeddings is further enhanced. For example, Inception Net with triplet loss is depolyed in [20], Inception-ResNet with joint softmax and center loss in [8] and ResCNN with novel speaker identity subspace loss in [14].
許多研究已經提出方法來緩解這種短時效應。早期的一系列研究旨在修改基於i-向量的SV系統的不同方面,例如特徵提取技術、中間參數估計、說話人模型生成、分數規範化技術,如[11]所述。近年來,越來越多的新型深度學習技術被開發出來。例如,充分的語音信息由師生學習框架[17]進行補償,評分方案由遷移學習[13]進行校準。另一種研究策略是設計持續時間魯棒的說話人嵌入來處理任意持續時間的話語。通過採用不同的神經網絡結構和交替損耗函數,進一步提高了嵌入的可分辨性。例如,在[20]中,具有三重態損失的起始網被去分解,在[8]中,具有聯合軟最大值和中心損失的起始ResNet和在[14]中具有新的說話人身份子空間損失的ResCNN。

[11] A. Poddar, M. Sahidullah, and G. Saha. Performance comparison of speaker recognition systems in presence of duration variability. In 2015 Annual IEEE India Conference (INDICON), pages 1–6, Dec 2015.
[13] Lihong Wan Jun Zhang Qingyang Hong, Lin Li and Feng Tong. Transfer learning for speaker verification on short utterances. In Interspeech, pages 1848–1852, 2016.
[20] Chunlei Zhang and Kazuhito Koishida. End-to-end text-independent speaker verification with triplet loss on short utterances. In Interspeech, pages 1487–1491, 2017.
[8] Na Li, Deyi Tuo, Dan Su, Zhifeng Li, and Dong Yu. Deep discriminative embeddings for duration robust speaker verification. In Interspeech, pages 2262–2266, 2018.
[14] Xinyuan Cai Ruifang Ji and Bo Xu. An end-to-end text- independent speaker identification system on short ut- terances. In Interspeech, pages 3628–3632, 2018.

Generative Adversarial Networks (GANs) [6] are one of the most popular deep learning algorithm developed recently. GANs have the potential to generate realistic instances and provide a solution to problems that require a generative so- lution, most notably in various image-to-image translation tasks.
生成性對抗網絡(GANs)[6]是近年來發展起來的最流行的深度學習算法之一。GANs有可能生成真實的實例,併爲需要生成解決方案的問題提供解決方案,尤其是在各種圖像到圖像的翻譯任務中。

[6] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Sys- tems 27, pages 2672–2680. 2014. 這個是GAN的提出論文,是Ian Goodfellow搞出來的。

In this study, we aim to investigate the short duration is-sue presented in a practical SV system. Contrary to the most techniques mentioned above, our proposed approach works directly on the speaker embeddings. In particular, given short and long embedding pairs extracted from same speaker and session, we propose to use adversarial learning of Wasser- stein GAN to learn a new embedding with enhanced discrim- inability. To test our approach, G-vector is chosen as the em- bedding benchmark in our experiments due to its promising performance on short speeches. This put forward a challenge to our study than those prior studies which benchmarked with the i-vectors.
在本研究中,我們的目的是探討在實際的SV系統中所呈現的短持續時間。與上述大多數技術相反,我們提出的方法直接作用於說話人嵌入。特別是,在給定從同一說話人和會話中提取的短嵌入對和長嵌入對的情況下,我們提出利用Wasser-stein-GAN的對抗性學習來學習一種新的嵌入方法。爲了測試我們的方法,在我們的實驗中選擇了G-vector作爲embedding的基準,因爲它在短語音中有很好的表現。這對我們的研究提出了一個挑戰,比以前的那些以i-向量爲基準的研究提出了挑戰。
The remainder of this paper is organized as follows: Sec- tion 2 briefly introduces the related works of our methods. Section 3 details our proposed Wassertein GAN based ap- proach. Section 4 presents experimental results and discus- sions. Finally, our conclusions are given in Section 5.
本文的其餘部分安排如下:第二節簡要介紹了本文方法的相關工作。第三節詳細介紹了我們提出的基於Wassertein-GAN的ap-proach。第四節介紹了實驗結果和討論。最後,第5節給出了我們的結論。

  1. RELATEDWORKS

2.1. Wasserstein-GAN

GANs [6] are deep generative models comprised of two net- works, a generator and a discriminator. The discriminator D tries to learn the difference between real sample y and fake sample g generated from noise η, and the generator G tries to fool the discriminator. That is, the following minimax loss function is optimized through alternating optimization, until equilibrium is reached.
GANs[6]是由兩個網絡、一個生成器和一個鑑別器組成的深度生成模型。鑑別器D試圖學習噪聲η產生的真樣本y和假樣本g之間的差別,而生成器g試圖愚弄鑑別器。也就是說,通過交替優化來優化下面的極大極小損失函數,直到達到平衡。公式(1)
However, training a GAN model is difficult due to well- known diminishing or exploding gradients issue. The issues has been addressed by Wasserstein GAN (WGAN) [1], where the discriminator is designed to find a good fw and a new loss function is configured as measuring the Wasserstein distance:
然而,由於衆所周知的梯度遞減或爆炸問題,訓練GAN模型是困難的。Wasserstein GAN(WGAN)[1]已經解決了這些問題,其中鑑別器被設計成尋找良好的fw,並且新的損耗函數被配置成測量Wasserstein距離:公式如下:

[1] Mart ́ın Arjovsky, Soumith Chintala, and Le ́on Bot- tou. Wasserstein gan. ArXiv, 2017. URL: https://arxiv.org/pdf/1701.07875.pdf.
在這裏插入圖片描述
2.2. Deployments of GANs inSV
SV中GANS的部署

Motivated by the remarkable success in image-to-image translation, GANs have been actively deployed in SV re- search community, mainly to handle domain-mismatch issue, like transforming i-vectors [15] and x-vectors [19]. In con- trast, there are few works to use GANs to handle the short du- ration issue. To authors’ best knowledge, the only published work is to propose compensating the i-vectors via conditional GAN [7]. However, limited performance improvements were observed. The proposed system alone failed to outperform the baseline system, and only score-wise fusion based system showed better performance than the baseline.
由於在圖像到圖像轉換方面取得了顯著的成功,GANs被積極地部署在SV搜索社區中,主要用於處理域失配問題,如轉換i-向量[15]和x-向量[19]。相反,很少有作品使用GANs來處理短語音問題。據作者所知,唯一發表的工作是提出通過條件GAN補償i-向量[7]。然而,觀察到的性能改進有限。單系統的性能沒有優於基線系統,只有融合系統的性能優於基線系統。

[19] Man-Wai Mak Youzhi Tu and Jen-Tzung Chien. Varia- tional domain adversarial learning for speaker verifica- tion. In Interspeech, pages 4315–4319, 2019.
[7] Nakamasa Inoue Jiacen Zhang and Koichi Shinoda. I- vector transformation using conditional generative ad- versarial networks for short utterance speaker verifica- tion. In Interspeech, Hyderabad, pages 3613–3617, 2018.

In authors’ opinion, training GAN is non-trivial, the rea- son behind such results might be the oversight on effects of loss functions of conditional GAN. As such, in this study, we investigate the problem and seek to reveal some guidelines on choosing beneficial loss functions to make the model perform better.
在作者看來,訓練GAN是非常重要的,這種結果背後的原因可能是對條件GAN損失函數的影響的疏忽。因此,在本研究中,我們探討了這個問題,並試圖揭示一些選擇有利損失函數的準則,以使模型表現得更好。

  1. PROPOSED APPROACH

The architecture of our proposed approach is illustrated in Fig.1. Here x and y are D-dimensional G-vectors correspond- ing to short and long utterance embedding from same speaker session, z is speaker identity labels. With given x, y, z, the proposed system is trained to learn a D-dimensional embed- ding g, with the expectation that the g-based SV system can outperform the one based on x.
我們提出的方法的架構如圖1所示。這裏x和y是D維G-vectors,對應於同一說話人會話中嵌入的短、長話語,z是說話人身份標籤。在給定x,y,z的情況下,訓練系統學習D維的嵌入g,期望基於g的SV系統能優於基於x的SV系統。
在這裏插入圖片描述
Overall, the proposed architecture can be decomposed into four core components: embedding generator Gf , speaker label predictor Gc, distance calculator Gd and Wasserstine discriminator Dw . All components are jointly trained in order to generate enhanced embeddings with carefully handcraft optimization objects, as described as follows.
總體而言,該架構可以分解爲四個核心組件:嵌入生成器Gf、說話人標籤預測器Gc、距離計算器Gd和Wasserstine鑑別器Dw。所有組件都經過聯合訓練,以便使用精心設計的手工優化對象生成增強的嵌入,如下所述。

3.1. Proposed Discriminator-Related Loss Functions
我們提出的判別性相關的損失函數

As aforementioned, the primary task of the proposed ap- proach is to learn embedding with enhanced discriminability. Let P denote the data distribution, we propose to achieve the task by mapping Pg from initial Px to the target Py by adver- sarial learning of WGAN. To this end, in the discriminative model, several loss criteria are investigated with different optimization objectives.
如前所述,所提出的方法的主要任務是學習增強過的具有判別性的嵌入。假設P表示數據分佈,我們建議通過WGAN的adver-sarial學習將Pg從初始Px映射到目標Py來完成任務。爲此,在判別模型中,研究了具有不同優化目標的幾種損失準則。
Following the conventional definition of min-max function, the loss function of WGAN is:
根據最小最大功能的傳統定義,WGAN的損失函數爲:
在這裏插入圖片描述
Inspired by the idea of conditional GAN [10], in this study, we investigate a novel loss function by optimizing the Wassertein distance between joint data distributions. That is, to control the data to be discriminated by concatenating short embedding x with the conventional discriminator input. The corresponding min-max function is updated as:

[10] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. ArXiv, 2014. URL: https://arxiv.org/pdf/1411.1784.pdf.

在本研究中,受條件GAN的啓發,我們通過優化聯合數據分佈之間的Wassertein距離,研究了一種新的損失函數。也就是說,通過將短嵌入x與傳統的鑑別器輸入串聯來控制要鑑別器化的數據。相應的min max函數更新爲:
在這裏插入圖片描述
In addition, to seek more discriminability, the Fre ́chet In- ception Distance (FID) [9], as a popular metric to calculate the distance between feature vectors of real and generated im- ages, is also explored herein. Assuming Py and Pg as normal distributions with means μy , μg and co-variance matrices Cy , Cg , FID loss can be calculated by:
此外,爲了尋求更高的可分辨性,本文還探討了Fréchet-In-ception Distance(FID)[9]作爲計算真實圖像和生成圖像特徵向量之間距離的常用度量。假設Py和Pg爲正態分佈,平均值爲μy,μg,協方差矩陣爲Cy,Cg,FID損失可通過以下公式計算:
在這裏插入圖片描述
3.2. Proposed Generator-Related Loss Functions
我們提出的Generator相關的損失函數

In order to guide GAN training with the objective of feature discriminability, four loss criteria are investigated herein as extra training guides for the GAN training.
爲了以特徵判別性爲目的指導GAN訓練,本文研究了四種loss準則作爲GAN訓練的附加訓練準則。
To verify the speaker label, the widely adopted multiclass cross-entropy (CE) loss is investigated with formulation of:
爲了驗證說話人標籤,研究了廣泛採用的多類交叉熵(CE)損失,公式如下:
在這裏插入圖片描述
where N is the batch size, c is the number of classes. gi de- notes the i-th generated embedding sample and zi is the cor- responding label index. W ∈ RD∗c and b ∈ Rc denotes the weight matrix and bias in the project layer.
其中N是批處理大小,c是類的數量。gi 表示第i個生成的嵌入樣本和zi是對應的標籤索引。W∈RD*c和b∈Rc表示項目層中的權重矩陣和偏差。
To explicitly penalize the class-related classification error, triplet loss is deployed as well, where a baseline (anchor) in- put is compared to a positive (truthy) input and a negative (falsy) input. Let Γ be the set of all possible embedding triplets γ = (ga , gp , gn ) in the training set, the loss is defined as:
爲了顯式地懲罰與類相關的分類錯誤,還採用了三元組loss,其中基線(錨)輸入與正(truthy)輸入和負(falsy)輸入進行比較。設Γ爲訓練集中所有可能嵌入三元組的集合γ=(ga,gp,gn),其損失定義爲:
在這裏插入圖片描述
where ga is an anchor input, gp is a positive input from the same class and gn is a negative input from a different class, Ψ ∈ R+ is safety margin between positive and negative pairs.
其中ga是錨輸入,gp是同一類的正輸入,gn是不同類的負輸入,Ψ∈R+是正負對之間的安全角度。
Apart from the above, to minimize intra-class variation, center loss [16] is also adopted. It can be formulated as:
除此之外,爲了最小化類內變化,還採用了中心損失[16]。它可以表述爲:
在這裏插入圖片描述
where c
其中c
denotes the ith deep feature belonging to the yith class and m is the size of mini-batch.
表示屬於yith類的第i個深特徵,m是小批量的大小。
To better guide the training process, the similarity be- tween enhanced embedding and its target is explicitly con- sidered. It’s measured by the cosine distance and evaluated as a dot product as follow:
爲了更好地指導訓練過程,明確考慮了增強嵌入與目標之間的相似性。它是由余弦距離測量的,並作爲點積計算,如下所示:
在這裏插入圖片描述
where g ̄ and y ̄ are normalized version of embedding g and y, respectively.
其中ḡ和ȳ分別是嵌入g和y的規範化版本。
In all, we propose to train the generator Gf with the total loss defined as:
總之,我們建議訓練generator Gf時的total loss 定義爲:
在這裏插入圖片描述
and discriminator Dw with:
和鑑別器Dw:
在這裏插入圖片描述
After the training of WGAN, the generative model Gf is retained. At the SV test stage, a short embedding x for any given test short utterance, can be easily mapped to its enhanced version (g) by directly applying the feed-forward model of Gf on the x.
經過WGAN訓練後,保留了生成模型Gf。在SV測試階段,通過直接在x上應用Gf的前向模型,可以很容易地將任意給定測試短句的短嵌入x映射到其增強版本(g)。

  1. EXPERIMENTS AND RESULTS
    四。實驗和結果
    This section details our experimental setups and investigation results on the effectiveness of the above proposed loss criteria.
    本節詳細介紹了我們的實驗裝置和對上述loss準則有效性的調查結果。

4.1. Experimental Setup

We use a subset of the Voxceleb2 to train our proposed sys- tem, where 1,057 speakers are chosen with total 164,716 ut- terances. Those utterances are randomly cut to 2 seconds as short utterance. Similarly, a subset of Voxceleb1 with 40 speakers is sampled and total 13,265 utterance pairs are used for testing.
我們使用Voxceleb2的一個子集來訓練我們提出的系統,其中1057個說話人被選擇,總共164716個話語。這些話語被隨機縮短爲2秒作爲簡段語音。類似地,對40個說話人的Voxceleb1子集進行採樣,並使用總共13265個話語對進行測試。
The VGG-Restnet34s network is used to extract G-vectors as our baseline system. Regarding the GAN training, the learning rates for both Gf and Dw are 0.0001; Adam op timization is adopted; weight clipping is employed for Dw with threshold setting from -0.01 to 0.01 and batch size is set as 128.
使用VGG-Restnet34s網絡提取G向量作爲我們的基線系統。對於GAN訓練,Gf和Dw的學習率均爲0.0001;採用Adam-op優化;Dw採用權值剪裁,閾值設置爲-0.01到0.01,批設置爲128。

4.2. Ablation Studies on Various Loss Functions
不同損失函數的ablation消融研究
To verify the importance of proposed loss criteria, a bunch of ablation studies are conducted by choosing different com- binations of them. The overall results are illustrated in Tab.1, where Lc, Lt denote Lcenter and Ltriplet, respec- tively. Triplet a means that inputs are sampled from both y and g and b means from g only.
爲了驗證提出的ablation準則的重要性,我們選擇了不同的ablation準則組合進行了大量的ablation研究。總體結果如表1所示,其中Lc,Lt分別表示Lcenter和Ltriplet。三元組a表示輸入同時從y和g採樣,b表示僅從g採樣。
在這裏插入圖片描述
In our study, total 8 systems (v1 − v8), by combining different loss criteria with Watterstein GAN, are evaluated. Their corresponding detection error trade-off (DET) curves are plotted in Fig.2.
在我們的研究中,通過將不同的損失標準與Watterstein GAN相結合,對總共8個系統(v1-v8)進行了評估。相應的檢測誤差權衡(DET)曲線如圖2所示。
在這裏插入圖片描述
From the above experimental results, the following con-
從以上實驗結果來看,下面是-
clusions could be drawn:
可以得出結論:

• FID loss has positive effect (v1 vs. v2);
• Conditional WGAN outperforms WGAN (v3 vs. v4);
• Triplet loss is preferred (v7 vs. v2);
• Triplet a greatly outperforms triplet b (v3 vs. v8);
• softmax has positive effect (v3 vs. v5);
• Center loss has negative effect (v6 vs. v7);
• Cosine loss has significant positive effect (v6 vs. v8).

The above findings are very interesting with a twofold out- come. Firstly, it demonstrates that additional training func- tions (e.g. traditional softmax, cosine loss and triplet loss) all have positive contribution to the performance, which verifies our earlier statement that extra training guides might be help- ful for feature discriminability. Secondly, some less-favoured
以上的發現非常有趣,有兩個方面。首先,證明了附加的訓練函數(如傳統的softmax、餘弦損失和三重態損失)對性能都有積極的貢獻,這驗證了我們之前的觀點,即附加的訓練指南可能有助於提高特徵的可分辨性。其次,一些不太受歡迎的
loss criteria to a typical SV system (e.g. FID loss and con- ditional WGAN loss) are surprisingly helpful, which are un- usual findings and might be worthy of further investigation.
典型SV系統的損失標準(如FID損失和常規WGAN損失)出人意料地有用,這是不常見的發現,可能值得進一步研究。

4.3. Comparison with the Baseline System

In the end, we make a performance comparison between our best system (v3) and the G-vector baseline system. Herein the comparison is measured in terms of equal error rate (EER) and minDCF. The results are reported in Tab.2.
最後,我們對最佳系統(v3)和G矢量基線系統進行了性能比較。在此,以等錯誤率(EER)和minDCF來測量比較。結果見表2。
在這裏插入圖片描述
From the table, we can see that our proposed system also has the merit for generalization and behave consistently for different short duration over the baseline system. In detail, for verification with 2 second enroll-test utterances, our proposed system shows 4.2% relative EER improvement and 7.2% rela- tive minDCF improvement. For shorter utterances with dura- tion of 1 second, it shows comparable EER (3.8%) improve- ment.
從表中可以看出,我們提出的系統也具有泛化的優點,並且在不同的短時間內與基線系統保持一致。詳細地說,對於2秒註冊測試話語的驗證,我們提出的系統顯示了4.2%的相對EER改進和7.2%的相對minDCF改進。對於持續1秒的較短的話語,它顯示出可比的EER(3.8%)提高。
It’s worth noting that due to time constraint, the FID loss function has not been added to our final system; besides, there is no any fine-tuning on hyper-parameters, loss weights α, β, γ, λ, ε and triplet margin η. This means there are still a lot of room for improvements in our system.
值得注意的是,由於時間的限制,最終系統中沒有加入FID損失函數;此外,對超參數、損失權重α、β、γ、λ、ε和三重態裕度η也沒有任何微調。這意味着我們的系統還有很大的改進空間。

  1. CONCLUSIONS
    5個。結論
    In this paper, we have successfully applied WGAN to learn enhanced embedding for speaker verification application with short utterances. Our main contributions are twofold: pro- posed WGAN-based kernel system; and on top of it, validated the effectiveness of a bunch of loss criteria on the GAN train- ing. Our final proposed system outperforms the baseline sys- tem for the challenging short speaker verification scenarios. In all, our experiments show both decent advancement and a potential direction where our further research goes forward.
    本文成功地將WGAN應用於短語音說話人驗證應用中的增強嵌入學習。我們的主要貢獻有兩方面:提出了基於WGAN的核系統;在此基礎上,驗證了一系列損失準則在GAN訓練中的有效性。我們最後提出的系統在挑戰性的短小說話人驗證情形下,表現優於基線系統。總之,我們的實驗既顯示了良好的進展,也顯示了我們進一步研究的潛在方向。

  2. REFERENCES
    [1] Mart ́ın Arjovsky, Soumith Chintala, and Le ́on Bot- tou. Wasserstein gan. ArXiv, 2017. URL: https://arxiv.org/pdf/1701.07875.pdf.
    [2] D. Povey D. Snyder, D. Garcia-Romero and S. Khu- danpur. Deep neural network embeddings for text-independent speaker verification. In Interspeech, page 999, 2017.
    [3] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet. Front-end factor analysis for speaker verifi- cation. IEEE Transactions on Audio, Speech, and Lan- guage Processing, 19(4):788–798, May 2011.
    [4] Daniel Garcia-Romero and Carol Espy-Wilson. Analy- sis of i-vector length normalization in speaker recogni- tion systems. In Interspeech, pages 249–252, 2011.
    [5] JahangirAlamGautamBhattacharyaandPatrickKenny. Deep speaker embeddings for short duration speaker verification. In Interspeech, Stockholm, pages 1517– 1521, 2017.
    [6] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Sys- tems 27, pages 2672–2680. 2014.
    [7] Nakamasa Inoue Jiacen Zhang and Koichi Shinoda. I- vector transformation using conditional generative ad- versarial networks for short utterance speaker verifica- tion. In Interspeech, Hyderabad, pages 3613–3617, 2018.
    [8] Na Li, Deyi Tuo, Dan Su, Zhifeng Li, and Dong Yu. Deep discriminative embeddings for duration robust speaker verification. In Interspeech, pages 2262–2266, 2018.
    [9] Thomas Unterthiner Bernhard Nessler Martin Heusel, Hubert Ramsauer and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local nash equilibrium. In NIPS, pages 6626–6637, 2017.
    [10] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. ArXiv, 2014. URL: https://arxiv.org/pdf/1411.1784.pdf.
    [11] A. Poddar, M. Sahidullah, and G. Saha. Performance comparison of speaker recognition systems in presence of duration variability. In 2015 Annual IEEE India Con- ference (INDICON), pages 1–6, Dec 2015.
    [12] S. J. D. Prince and J. H. Elder. Probabilistic linear dis- criminant analysis for inferences about identity. In 2007 IEEE 11th International Conference on Computer Vi- sion, pages 1–8, Oct 2007.
    [13] Lihong Wan Jun Zhang Qingyang Hong, Lin Li and Feng Tong. Transfer learning for speaker verification on short utterances. In Interspeech, pages 1848–1852, 2016.
    [14] Xinyuan Cai Ruifang Ji and Bo Xu. An end-to-end text- independent speaker identification system on short ut- terances. In Interspeech, pages 3628–3632, 2018.
    [15] Qing Wang, Wei Rao, Sining Sun, Leib Xie, Eng Chng, and Haizhou Li. Unsupervised domain adaptation via domain adversarial training for speaker recognition. In ICASSP, pages 4889–4893, 2018.
    [16] Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. A discriminative feature learning approach for deep face recognition. In ECCV, pages 499–515, 2016.
    [17] Jee weon Jung, Hee-Soo Heo, Hye jin Shim, and Ha-Jin Yu. Short utterance compensation in speaker verification via cosine-based teacher-student learn- ing of speaker embeddings. ArXiv, 2018. URL: https://arxiv.org/pdf/1810.10884.pdf.
    [18] Weidi Xie, Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. Utterance-level aggregation for speaker recognition in the wild. In ICASSP, pages 5791– 5795, 2019.
    [19] Man-Wai Mak Youzhi Tu and Jen-Tzung Chien. Varia- tional domain adversarial learning for speaker verifica- tion. In Interspeech, pages 4315–4319, 2019.
    [20] Chunlei Zhang and Kazuhito Koishida. End-to-end text-independent speaker verification with triplet loss on short utterances. In Interspeech, pages 1487–1491, 2017.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章