探索專有領域的端到端ASR解決之道

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"​​​​​​​​​​摘要:本文從《Shallow-Fusion End-to-End Contextual Biasing》入手,探索解決專有領域的端到端ASR。","attrs":{}}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文分享自華爲雲社區","attrs":{}},{"type":"link","attrs":{"href":"https://bbs.huaweicloud.com/blogs/269842?utm_source=infoq&utm_medium=bbs-ex&utm_campaign=other&utm_content=content","title":"","type":null},"content":[{"type":"text","text":"《語境偏移如何解決?專有領域端到端ASR之路(一)》","attrs":{}}]},{"type":"text","text":",原文作者:xiaoye0829 。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於產品級的自動語音識別(Automatic Speech Recognition, ASR),能夠適應專有領域的語境偏移(contextual bias),是一個很重要的功能。舉個例子,對於手機上的ASR,系統要能準確識別出用戶說的app的名字,聯繫人的名字等等,而不是發音相同的其他詞。更具體一點,比如讀作“YaoMing”的這個詞語,在體育領域可能是我們家喻戶曉的運動員“姚明”,但是在手機上,它可能是我們通訊錄裏面一個叫做“姚敏”的朋友。如何隨着應用領域的變化,解決這種偏差問題就是我們這個系列的文章要探索的主要問題。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"        ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於傳統的ASR系統,它們往往有獨立的聲學模型(AM)、發音詞典(PM)、以及語言模型(LM),當需要對特定領域進行偏移時,可以通過特定語境的語言模型LM來偏移識別的過程。但是對於端到端的模型,AM、PM、以及LM被整合成了一個神經網絡模型。此時,語境偏移對於端到端的模型十分具有挑戰性,其中的原因主要有以下幾個方面:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1. 端到端模型只在解碼時用到了文本信息,作爲對比,傳統的ASR系統中的LM可以使用大量的文本進行訓練。因此,我們發現端到端的模型在識別稀有、語境依賴的單詞和短語,比如名詞短語時,相較於傳統模型,更容易出錯。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2. 端到端的模型考慮到解碼效率,通常在beamsearch解碼時的每一步只保有少量的候選詞(一般爲4到10個詞),因此,稀有的單詞短語,比如依賴語境的n-gram(n元詞組),很有可能不在beam中。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"        ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"先前的工作主要是在嘗試將獨立訓練的語境n-gram 語言模型融入到端到端模型中,來解決語境建模的問題,這個做法也被稱爲:Shallow fusion (淺融合)。但是他們的方法對於專有名詞處理得比較差,專有名詞通常在beam search時就已經被剪裁掉了,因此即使加入語言模型來做偏移,也爲時已晚,因爲這種偏移通常在每個word生成後才進行偏移,而beam search在grapheme/wordpiece (對於英文來說,grapheme指的是26個英文字母+1空格+12常用標點。對於中文來說,grapheme指的是3755一級漢字+3008二級漢字+16標點符號) 等sub-word單元上進行預測。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"        ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在這篇博文中,我們來介紹嘗試解決這個問題的一篇工作:《Shallow-FusionEnd-to-End Contextual Biasing》,這篇工作是Google發表在InterSpeech 2019上的工作。在這個工作中,首先,爲了避免還沒使用語言模型進行偏移,專有名詞就被剪枝掉了,我們探索在sub-word單元上進行偏移。其次,我們探索在beam 剪枝前使用contextual FST。第三,因爲語境n-gram通常和一組共同前綴(“call”, “text”)一起使用,我們也去探索在shallowfusion時融合這些前綴。最後,爲了幫助專有名詞的建模,我們探索了多種技術去利用大規模的文本數據。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"        ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們在這裏,首先介紹下Shallow fusion,給定一串語音序列x=(x_1, …, x_K),端到端的模型輸出一串","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"子詞級","attrs":{}},{"type":"text","text":"的後驗概率分佈y=(y_1,…,y_L),即P(y|x).Shallow fusion的意思就是將端到端的輸出得分與一個外部訓練的語言LM得分在beam search時進行融合:y^{*}=argmax logP(y|x)+\\lambda P_C(y)","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"y","attrs":{}},{"type":"text","text":"∗=","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"argmaxlogP","attrs":{}},{"type":"text","text":"(","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"y","attrs":{}},{"type":"text","text":"∣","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"x","attrs":{}},{"type":"text","text":")+","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"λPC","attrs":{}},{"type":"text","text":"​(","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"y","attrs":{}},{"type":"text","text":")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中\\lambda","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"λ","attrs":{}},{"type":"text","text":"是一個用來調節端到端模型和語言模型權重的參數。爲了構建用於端到端模型的語境LM,我們假設已經知道了一系列的單詞級偏置短語,並把他們編譯成了n-gram的WFST(weighted finite state transducer)。這個單詞級的WFST,然後被分解成一個作爲拼寫轉換器的FST,這個FST可以把一串graphemes/wordpieces轉換成對應的單詞。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"        ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所有之前的偏移工作,無論是針對傳統方法或者是端到端模型,都是將語境LM和基底模型(比如端到端模型或者ASR聲學模型)的得分在單詞(word)或者子詞(sub-word)網格上進行結合。端到端的模型由於在解碼時,通常設置了比較小的beam閾值,導致了其解碼路徑相較於傳統的方法較少。因此本文主要探索在beam剪枝前將語境信息應用到端到端模型裏。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當我們選擇對grapheme進行偏移,一個擔心是我們可能會有大量的不必要的詞語,與語境FST匹配上,從而淹沒這個beam。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/19/190423dfec02abe72ea6079e446a769d.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"​舉例來看,如上圖所示,如果我們想偏移這個單詞“cat”,那麼語境FST構建的目標就是去偏移“c”“a”和“t”這三個字母。當我們想要往“c”這個字母去偏移時,我們可能不僅會把“cat”加入到beam中,也有可能會把“car”這種無關的單詞加入到beam中。但是如果我們是在wordpiece層面進行偏移,相關的subword有較少的匹配,因此,更多相關的單詞能被加入beam中。還是以“cat”這個例子舉例,如果我們按照wordpiece來偏移,那麼“car”這個詞就不會進入beam中。因此,在本文中,我們使用了一個4096大小的wordpiece詞彙表。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"        ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們進一步分析,Shallow fusion修改了輸出的後驗概率,因此我們也可以發現shallow fusion會傷害那些沒有詞語需要偏移的語音,即那些去語境化的語音。因此,我們探索只去偏移那些符合特定前綴的短語,舉例來說,在手機中搜索聯繫人時,通常會先說一個“call”或者“message”,或者想播放音樂時,會先說一個“play”。因此在本文中,我們在構建語境FST時,考慮到這些前綴詞語。我們抽取出在語境偏移單詞前出現過50詞以上的前綴詞語。最後,我們獲得了292個常用前綴詞語用於查找聯繫人,11個用於播放歌曲,66個用於查找app。我們構建了一個無權重的前綴FST,並把它和語境FST級聯起來。我們也允許一個空前綴選項,去跳過這些前綴詞。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"       ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一個提高專有名詞覆蓋率的方法是利用大量的無監督數據。無監督的數據來自語音搜索中的匿名語音。這些語音利用一個SOTA模型進行處理,只有那些具有高confidence的語音會被保留下來。最後,爲了保證我們留下來的語音主要關於專有名詞,我們用了一個專有名詞標註器(就是ner裏的CRF作序列標註),並保留帶有專有名詞的語音。利用上述方法,我們得到了一億條無監督的語音,並結合了3500萬條有監督的語音進行訓練,在訓練時,每個batch內80%的時間是有監督的數據,20%是無監督的數據。利用無監督的數據,有一個問題就是他們識別出來的文字可能有錯,識別的結果也會限制名稱的拼寫,比如到底是Eric,還是Erik,或者Erick。因此,我們也可以利用大量的專有名詞,結合TTS的方法,創造了一個合成的數據集。我們從互聯網上針對不同類別去挖掘大量的語境偏移詞語,比如多媒體、社交、以及app等類別。最後,我們抽取除了大概58萬條聯繫人的名字,4萬2千條歌名,以及7萬個app的名字。接下來,我們從日誌中去挖掘大量的前綴詞語,比如,“call Johnmobile”,可以得到前綴詞“call”對應到社交領域。然後,我們利用特定類別的前綴詞和專有名詞去生成語音識別的文本,並利用語音合成器,爲每個類別生成了大約100萬條語音。我們進一步爲這些語音加上了噪音來模擬室內的聲音。最後,在訓練時,每個batch內90%的時間是有監督的數據,10%的是合成的數據。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"        ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後,我們探索了是否能添加更多的專有名詞到有監督的訓練集中。具體來說,我們對每一條語音利用專有名詞標註器,找到其中的專有名詞。對於每一個專有名詞,我們獲得了其發音特徵。舉例來說,比如“Caitlin”可以表示成發音單位(phonemes)“K eI t l @ n”.緊接着,我們從發音詞典中,找到有相同發音單位序列的詞語,比如“Kaitlyn”。對於真實的語音,和可以替換的單詞,我們在訓練時,隨機替換。這個做法,可以讓模型觀察到更多的專有名詞。一個更直接的出發點是,模型能夠在訓練的時候拼寫出更多的名字,那麼在後面解碼時,結合語境FST,更能夠拼寫出這些名字。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"        ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面看一下實驗部分。所有實驗均基於RNN-T模型,encoder裏包含一個time reduction層,以及8層LSTM,每層有2000個隱藏層單元。decoder包含2層的LSTM,每層有2000個隱藏層單元。encoder和decoer被送到一個聯合網絡中,這個網絡有600個隱藏層單元。然後這個聯合網絡被送到一個softmax裏,輸出爲有96個單元的graphemes或者是4096個單元的wordpieces。在推理時,每條語音伴隨着一系列偏移短語用來構建一個語境FST。在這個FST中,每條弧(arc)都有相同的權重。這個權重爲每個目錄(比如音樂,聯繫人等)的測試集分別調節。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/3b/3ba7bdd8b9cbdde744cc22bfd2731e76.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"​上圖是Shallow Fusion的一些結果,E0和E1是grapheme和wordpieces的結果,這些模型是沒有進行偏移的。E2是grapheme帶偏移的結果,但是不帶任何本文中的提升策略。E3是用了一個減法代價(subtractive cost)去防止在beam中保留糟糕的候選詞,這個操作在幾乎所有的測試集上都帶來了提升。再從grapheme層面的偏移轉換到wordpiece上的偏移,即我們在更長的單元上進行偏移,有助於在beam內保持相關的候選詞,並提高模型的性能。最後,我們的E5模型在beam search剪枝前,就應用偏移FST,我們稱之爲early biasing,這樣有助於確保好的候選詞能更早的保留在beam裏,並帶來了額外的性能提升。總之,我們最好的shallow fusion模型是在wordpiece層面進行偏移,並帶有subtractive cost和early biasing。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"        ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於語境偏置的可能存在於句子中,我們也需要保證當語境偏移不存在時,模型的效果不會下降,即不會損害那些不帶有偏置詞的語音的識別。爲了測試這一點,我們在VS test數據集上進行了實驗,我們隨機從Cnt-TTS測試集中選擇了200個偏置短語,去構建一個偏置FST。下圖展示了實驗的結果:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/1d/1d23dee0d7851e0f069f78fa3393d270.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"​從這個表中可以看到,E1是我們的baseline模型,當添加偏移後,E5模型在VS上出現了很多程度上的效果下降。爲了解決這個問題,傳統的模型在偏移FST中包含了前綴詞。如果我們只在看到任何非空前綴詞後,才應用偏移(E6),我們可以觀察到VS數據集上相較E5出現了結果提升,但是在其他有偏移詞的測試集上,出現了結果下降。進一步,當我們允許其中一條前綴可以爲空時(主要想解決有偏移詞的場景),但是我們僅僅獲得了與E5類似的結果。爲了解決這個問題,我們對於語境短語用了較小的權重如果前面是一個空的前綴詞(即沒有前綴詞)。利用這個方法,我們觀察到E8相較於E1模型,在VS上取得了很小程度的效果下降,但是在有偏移短語的測試集上,能夠保持有效果提升。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"        ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在分析完了上述內容後,我們進一步探索下,當模型能感知到更多的專有名詞時,我們是否能進一步提升偏移的能力。我們的基線模型是E8,這個模型是在3500萬的有監督數據集上訓練得到的。結合我們上面的無監督數據和生成的數據,我們做了下面的實驗:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/58/584b1af8dc215f06d63a51406f1061bd.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"​E9的實驗結果展示,當有無監督的數據一起訓練時,在各個數據集上,都有效果提升。當有生成的數據一起訓練時(E10),相比於E9在TTS測試集上有更大的效果提升,但是在真實場景數據集Cnt-Real上出現了較大程度的下滑(7.1 vs 5.8),這表明在TTS偏移測試集上的提升,主要來源於訓練集和測試集間匹配的音頻環境,而不是學到了更豐富的專有名詞的詞彙表。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://bbs.huaweicloud.com/blogs?utm_source=infoq&utm_medium=bbs-ex&utm_campaign=other&utm_content=content","title":"","type":null},"content":[{"type":"text","text":"點擊關注,第一時間瞭解華爲雲新鮮技術~","attrs":{}}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章