2016--Analysis of the DNN-based SRE systems in multi-language conditions

Ondrˇej Novotny ́, Pavel Mateˇjka, Ondrˇej Glembek, Oldrˇich Plchot, Frantisˇek Gre ́zl, Luka ́sˇ Burget, and Jan “Honza” Cˇernocky ́
恩德·伊吉·諾沃特尼、帕維爾·馬特·伊卡、恩德·伊吉·格倫貝克、奧爾德·伊奇·普爾肖特、弗朗蒂斯·埃克·格雷茲爾、盧卡·斯伯格特和簡·恩扎·恩諾基
Brno University of Technology, Speech@FIT and IT4I Center of Excellence, Brno, Czech Republic
布爾諾理工大學,Speech@FIT和IT4I卓越中心,布爾諾,捷克共和國
{inovoton,matejkap,glembek,iplchot,grezl,burget,cernocky}@fit.vutbr.cz
{inovoton,matejkap,glembek,iplchot,grezl,burget,cernocky}@fit.vutbr.cz

https://ieeexplore.ieee.xilesou.top/abstract/document/7846265

This work was supported by the DARPA RATS Program under Contract No. HR0011-15-C-0038. The views expressed are those of the author and do not reflect the official policy or position of the Department of Defense or the U.S. Government.
這項工作得到了DARPA RATS項目的支持,合同號爲HR0011-15-C-0038。所表達的觀點是作者的觀點,並不反映國防部或美國政府的官方政策或立場。
This work was also supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Defense US Army Research Laboratory contract number W911NF-12-C-0013. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoD/ARL, or the U.S. Government.
這項工作還得到了美國國防部高級研究項目活動(IARPA)的支持,該活動的合同號爲W911NF-12-C-0013。美國政府有權爲政府目的複製和分發再版,儘管有版權註釋。免責聲明:本文包含的觀點和結論是作者的觀點和結論,不應被解釋爲必然代表IARPA、DoD/ARL或美國政府的官方政策或背書,無論是明示的還是暗示的。
The work was also supported by Czech Ministry of Interior project No. VI20152020025 ”DRAPAK” and European Union’s Horizon 2020 pro- gramme under grant agreement No. 645523 BISON.
這項工作還得到了捷克內政部第VI20152020025號“德拉帕克”項目和歐盟第645523號《比森贈款協定》規定的“地平線2020計劃”的支持。

摘要
This paper analyzes the behavior of our state-of-the-art Deep Neural Network/i-vector/PLDA-based speaker recognition systems in multi-language conditions. On the “Language Pack” of the PRISM set, we evaluate the systems’ performance using the NIST’s standard metrics. We show that not only the gain from using DNNs vanishes, nor using dedicated DNNs for target conditions helps, but also the DNN-based systems tend to produce de-calibrated scores under the studied conditions. This work gives suggestions for directions of future research rather than any particular solutions to these issues.
本文分析了目前最先進的基於深度神經網絡/i-vector/PLDA的說話人識別系統在多語言環境下的行爲。在PRISM集合的“語言包”中,我們使用NIST的標準度量來評估系統的性能。我們發現,在目標條件下使用DNN不僅會使增益消失,而且使用專用DNN也不會有幫助,而且基於DNN的系統在所研究的條件下往往會產生去校準分數。這項工作爲未來的研究方向提供了建議,而不是針對這些問題的任何具體解決方案。

  1. INTRODUCTION

During the last decade, neural networks have experienced a renais- sance as a powerful machine learning tool. Deep Neural Networks (DNN) have been also successfully applied to the field of speech processing. After their great success in automatic speech recogni- tion (ASR) [1], DNNs were also found very useful in other fields of speech processing such as speaker [2, 3, 4] or language recog- nition [5, 6, 7]. In speech recognition, DNNs are often directly trained for the ”target” task of frame-by-frame classification of speech sounds (e.g. tied tri-phone states). Similarly, a DNN directly trained for frame-by-frame classification of languages was success- fully used for language recognition in [7]. However, this system provided competitive performance only for speech utterances of short durations.
在過去的十年裏,神經網絡作爲一種強大的機器學習工具,經歷了一個全新的時代。深神經網絡(DNN)也已成功地應用於語音處理領域。DNNs在自動語音識別(ASR)領域取得了巨大成功[1]之後,在其他語音處理領域,如說話人[2,3,4]或語言識別[5,6,7]中也得到了廣泛的應用。在語音識別中,dnn通常直接訓練用於語音的逐幀分類(如三音系狀態)的“目標”任務。類似地,一個直接訓練用於語言逐幀分類的DNN成功地完全用於語言識別。然而,該系統只爲短時長的語音提供了競爭性的性能。即語音長度較短,比如只有幾到幾十秒時DNN效果較好。
In the field of speaker recognition, DNNs are usually used in more elaborate and indirect way: One approach is to use DNNs for extracting frame-by-frame speech features. Such features are then used in the usual way (e.g. input to i-vector based system [8]).
在說話人識別領域中,DNNs的使用通常比較精細和間接:一種方法是利用DNNs逐幀提取語音特徵。然後以通常的方式使用這些特徵(例如,輸入到基於i向量的系統[8])。
These features can be directly derived from the DNN output pos- terior probabilities [9] and combined with the conventional features (PLP or MFCC) [10]. More commonly, however, bottleneck (BN) DNNs are trained for a specific task, and the features are taken from a narrow hidden layer compressing the relevant information into low dimensional feature vectors [6, 5, 11]. Alternatively, standard DNN (with no bottleneck) can be used, where the high-dimensional out- puts of one of the hidden layers can be converted to features using a dimensionality reduction technique such as PCA [12].
這些特徵可以直接從DNN輸出概率中導出[9],並與傳統特徵(PLP或MFCC)相結合[10]。然而,更常見的是,瓶頸dnn是爲特定任務訓練的,特徵是從一個狹窄的隱藏層提取的,將相關信息壓縮成低維特徵向量[6,5,11]。或者,可以使用標準DNN(沒有瓶頸),其中,一個隱藏層的高維輸出可以使用諸如PCA[12]之類的降維技術轉換爲特徵。
In [13], we analyzed various DNN approaches to speaker recog- nition (and similar studies were conducted e.g. in [14, 15]). We used two different DNN’s (a mono-lingual DNN—trained on the Fisher English data corpus—and a multi-lingual DNN—trained on 11 lan- guages of the Babel data collection). The rest of the system was trained on the PRISM set, i.e. mainly on the English data. We re- ported our results only on the NIST SRE 2010 telephone condition (i.e. only on English speech) via the Equal Error Rates (EERs) and the minimum DCF NIST metrics.
在[13]中,我們分析了不同的DNN方法對說話人識別的影響(並進行了類似的研究,如在[14,15]中)。我們使用了兩種不同的DNN(一種是在Fisher英語數據語料庫上訓練的單語DNN,另一種是在Babel數據收集的11種語言上訓練的多語DNN)。系統的其餘部分是在棱鏡組上訓練的,即主要是在英語數據上。我們僅在NIST SRE 2010電話條件下(即僅在英語語音條件下)通過等錯誤率(EER)和最小DCF NIST指標重新報告結果。

[13] Pavel Mateˇjka, Ondˇrej Glembek, Ondˇrej Novotny ́, Oldˇrich Pl- chot,FrantisˇekGre ́zl,Luka ́sˇBurget,andJanCˇernocky ́,“Anal- ysis of DNN approaches to speaker identification,” in Proceed- ings of the 41th IEEE International Conference on Acoustics, Speech and Signal Processing. 2016, pp. 5100–5104, IEEE Signal Processing Society.
[14] Yao Tian, Meng Cai, Liang He, and Jia Liu, “Investigation of bottleneck features and multilingual deep neural networks,” in Interspeech, 2015.
[15] Sandro Cumani, Oldˇrich Plchot, and Pietro Laface, “Compar- ison of hybrid DNN-GMM architectures for speaker recogni- tion,” in ICASSP. 2016, IEEE Signal Processing Society.

However, when tested on non-English test sets, we observed that the benefit of using the DNNs degraded dramatically. We used the “lan” Language Pack of the PRISM set (described later in the paper), and its Chinese subset—the “chn” pack in comparison with the orig- inally used NIST SRE 2010 telephone condition. Not only we saw performance degradation in terms of EER and the minimum DCFs, but more so in terms of the actual DCFs, i.e. the systems produce heavily de-calibrated scores.
然而,當在非英語測試集上測試時,我們發現使用DNNs的好處顯著降低。我們使用PRISM集的“lan”語言包(本文稍後介紹),其,中文子集“chn”包與最初使用的NIST SRE 2010電話條件進行比較。我們不僅看到了EER和最小DCF方面的性能下降,更看到了實際DCF方面的性能下降,即系統產生嚴重的未校準分數。,即,系統產生的分數準確度較小。如果是calibrated scores,就是產生的分數準度較高。

Our hypothesis was that when we use the DNN trained for the target language, the error rates would decrease. To match the sre10, “lan”, and “chn” test conditions, we chose three DNNs, trained on: i) the Fisher English, the ii) Multilingual set, and iii) the Mandarin, re- spectively. However, it turned out that, apart from the Fisher English being optimal for the NIST SRE 2010 test, there was no clear corre- lation between the test language and the DNN training language.
我們的假設是,當我們使用爲目標語言訓練的DNN時,錯誤率會降低。(即假設是,訓練集的語言和測試集的語言一致時,錯誤率會降低)爲了符合sre10,“lan”和“chn”測試條件,我們選擇了三個DNN,分別是:i)Fisher英語,ii)多語言集,和iii)普通話。(lan就是說這個情況下的測試語言是多語種的,chn就是說這個情況下的測試語言的普通話,當然還有一種情況下的測試語言是英語。所以作者選用了三種DNN,訓練集分別是英語,多語言和普通話)然而,結果發現,除了Fisher英語是NIST SRE 2010測試的最佳語言外,,**測試語言和DNN訓練語言之間沒有明顯的相關性。**結果與前面的假設並不一致,並不是-------訓練集的語言和測試集的語言一致時,錯誤率會降低。

This paper analyzes the problems that emerged when applying the current state-of-the-art SRE systems to non-English domains, and provides directions for future research. This work is an exten- sion of our previous analysis, available as a technical report [16].
本文分析了目前最先進的SRE系統在非英語領域應用中出現的問題,爲今後的研究提供了方向。這項工作是我們先前分析的擴展,可作爲技術報告[16]。

感覺文獻16和本文是一模一樣的啊,擴展在哪裏

[16] Ondˇrej Novotny ́, Pavel Mateˇjka, Ondˇrej Glembek, Oldˇrich Plchot, Frantisˇek Gre ́zl, Luka ́sˇ Burget, and Jan “Honza” Cˇernocky ́, “DNN-based SRE systems in multi-language conditions,” 2016, BUT Technical Report, http://www.fit.vutbr.cz/research/pubs/report.php?id=11235, also being submitted to IEEE Signal Processing Letters.

  1. THEORETICAL BACKGROUND
    2。理論背景
    2.1. i-vector Systems
    2.1條。i-向量系統
    The i-vectors [8] provide an elegant way of reducing large-dimensional input data to a small-dimensional feature vector while retaining most of the relevant information. The main principle is that the utterance-dependent Gaussian Mixture Model (GMM) supervector of concatenated mean vectors s is modeled as
    i-向量[8]提供了一種優雅的方法,在保留大部分相關信息的同時,將大維輸入數據減少爲小維特徵向量。其主要原理是將級聯平均向量s的與話語相關的高斯混合模型(GMM)超向量建模爲
    在這裏插入圖片描述

We experimented with monolingual (English and Mandarin) and multilingual BN features. In the case of multilingual training, we adopted training scheme with block-softmax, which divides the out- put layer into parts according to individual languages. During train- ing, only the part of the output layer is activated that corresponds to the language that the given target belongs to. See [20, 21] for detailed description.
我們嘗試了單語(英語和漢語)和多語種的BN特徵。在多語種訓練的情況下,採用了塊softmax的訓練方案,將輸出層按語言劃分爲若干部分。在訓練過程中,只激活輸出層中與給定目標所屬語言相對應的部分。詳細說明見[20,21]。

Bottleneck Neural-Network (BN-NN) refers to such topology of a NN, one of whose hidden layers has significantly lower dimension- ality than the surrounding layers. A bottleneck feature vector is gen- erally understood as a by-product of forwarding a primary input fea- ture vector through the BN-NN and reading off the vector of values at the bottleneck layer. We have used a cascade of two such NNs for our experiments. The output of the first network is stacked in time, defining context-dependent input features for the second NN, hence the term Stacked Bottleneck Features.
瓶頸神經網絡(BN-NN)是指一個網絡的拓撲結構,其隱含層的維數明顯低於周圍層。瓶頸特徵向量一般被理解爲通過BN-NN轉發主輸入特徵向量並讀取瓶頸層的值向量的副產品。我們在實驗中使用了兩個這樣的NNs級聯。第一個網絡的輸出是按時間堆疊的,爲第二個神經網絡定義了上下文相關的輸入特徵,因此稱爲堆疊瓶頸特徵。

However, it was shown that DNNs can be used directly for posterior computa- tion [2] .
然而,DNNs可直接用於後驗概率計算[2]。

In other words, we show the utility of the trained DNNs as both feature- and posterior-extractors
換句話說,我們展示了訓練dnn作爲特徵提取和後驗概率提取的實用性

SBN特徵提取涉及兩個NN:第一個NN的瓶頸(BN)輸出是堆疊的、下采樣的、可選的,並作爲第二個NN的輸入向量。第二個神經網絡又有一個BN層,其輸出作爲傳統高斯混合模型-隱馬爾可夫模型(GMM-HMM)語音識別系統的輸入特徵。基頻(f0)相關特徵是兩種聲調語言和非聲調語言語音識別系統中的重要特徵。我們嘗試將不同的f0特性作爲SBN的附加輸入。SBN最後輸出的瓶頸特徵是80維的,後續作爲傳統GMM/UBM i-vector 說話人識別系統的輸入特徵。

本文的DNN主要用來做特徵提取的。提取後驗概率用在哪裏了?

有兩種結構的DNN,標準DNN和SBN,訓練集又有三種,英語,普通話和多語種,
以上結果就是有五種DNN,
English SBN
Mandarin SBN
Multilang SBN
English DNN
Mandarin DNN

Baseline的DNN結構不知道是啥,也沒說,Baseline應該是標準的kaldi 的i-vector吧,Baseline的特徵提取設置如下:
19 MFCC coefficients + en- ergy augmented with their delta and double delta coefficients, re- sulting in 60-dimensional feature vectors. The analysis window was 20 ms long with the shift of 10 ms.

SNB結構的輸入是:
24 log Mel-scale filter bank outputs augmented with fundamental frequency features from 4 different f0 estimators (Kaldi, Snack1 , and two other according to [17] and [18]). Together, we have 13 f0 related features, see [19] for more de- tails.
SNB結構的輸出是:
80維的,後續作爲傳統GMM/UBM i-vector 說話人識別系統的輸入特徵。

English DNN 和 Mandarin DNN 的輸入和輸出又是啥,感覺文中沒說啊。

tab3的解釋我也沒看懂:
In Tab. 3, we show the effect of a linear calibration on the En- glish SBN system. Because of the lack of an independent held- out set, we performed a cheating (gender-independent) calibration trained using the “lan” trial set, which contains both English and Chinese trials.

  1. CONCLUSIONS
    In this work, we have studied the behavior of the DNN techniques in SRE i-vector/PLDA systems, currently considered to be state-of- the-art, as evaluated on the most common NIST SRE English test sets, such as the NIST SRE 2010, condition 5. We have shown that when applied to non-English test sets, these techniques stop being effective and are susceptible to de-calibration of the scores produced by the traditional i-vector/PLDA systems. We have also observed that selecting a DNN to match the test condition does not solve the issues mentioned above.
    在這項工作中,我們研究了目前被認爲是最先進的SRE i-vector/PLDA系統中DNN技術的行爲,並在最常見的NIST SRE英語測試集(如NIST SRE 2010,條件5)上進行了評估。我們已經證明,當應用於非英語測試集時,這些技術不再有效,並且容易對傳統的i-vector/PLDA系統產生的分數進行去校準。我們還觀察到,選擇一個DNN來匹配測試條件並不能解決上述問題。
    This work therefore leaves more questions than answers, and suggests that we focus on the analysis of the DNN acoustic space clustering with regard to multiple languages and other types of vari- ability, and that we study the behavior of clustering with regard to the available SRE training data.
    因此,這項工作留下了更多的問題而不是答案,並建議我們將重點放在分析DNN聲學空間聚類的多語言和其他類型的變異性,並研究聚類行爲對可用的SRE訓練數據。
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章