超清音質實時會議系統的背後 ,深入剖析 AliCloudDenoise 語音增強算法

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"近些年,隨着實時通信技術的發展,在線會議逐漸成爲人們工作中不可或缺的重要辦公工具,據不完全統計,線上會議中約有 75% 爲純語音會議,即無需開啓攝像頭和屏幕共享功能,此時會議中的語音質量和清晰度對線上會議的體驗便至關重要。","attrs":{}}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作者|七琦","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"審校|泰一","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"前言","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在現實生活中,會議所處的環境是極具多樣性的,包括開闊的嘈雜環境、瞬時非平穩的鍵盤敲擊聲音等,這些對傳統的基於信號處理的語音前端增強算法提出了很大的挑戰。與此同時伴隨着數據驅動類算法的快速發展,學界 [1] 和工業界 [2,3,4] 逐漸湧現出了深度學習類的智能語音增強算法,並取得了較好的效果,AliCloudDenoise 算法在這樣的背景下應運而生,藉助神經網絡卓越的非線性擬合能力,與傳統語音增強算法相結合,在不斷的迭代優化中,針對實時會議場景下的降噪效果、性能消耗等方面進行了一系列的優化與改進,最終可以在充分保證降噪能力的同時保有極高的語音保真度,爲阿里雲視頻雲實時會議系統提供了卓越的語音會議體驗。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/d2/d205c7729e1385c063651667fa7fb89e.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e5/e52a226a19802965bae40b6249d7aef6.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"語音增強算法的發展現狀","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"語音增強是指乾淨語音在現實生活場景中受到來自各種噪聲干擾時,需要通過一定的方法將噪聲濾除,以提升該段語音的質量和可懂度的技術。過去的幾十年間,傳統單通道語音增強算法得到了快速的發展,主要分爲時域方法和頻域方法。其中時域方法又可以大致分爲參數濾波法 [5,6] 和信號子空間法 [7],頻域方法則包括譜減法、維納濾波法和基於最小均方誤差的語音幅度譜估計方法 [8,9] 等。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"傳統單通道語音增強方法具有計算量小,可實時在線語音增強的優點,但對非平穩突發性噪聲的抑制能力較差,比如馬路上突然出現的汽車鳴笛聲等,同時傳統算法增強後會有很多殘留噪聲,這些噪聲會導致主觀聽感差,甚至影響語音信息傳達的可懂度。從算法的數學理論推導角度來說,傳統算法還存在解析解求解過程中假設過多的問題,這使得算法的效果存在明顯上限,難以適應複雜多變的實際場景。自 2016 年起,深度學習類方法顯著提升了許多監督學習任務的性能,如圖像分類 [10],手寫識別 [11],自動語音識別 [12],語言建模 [13] 和機器翻譯 [14] 等,在語音增強任務中,也出現了很多深度學習類的方法。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/42/42a8178890707c4ed478fee1f734d6d9.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖一 傳統單通道語音增強系統的經典算法流程圖","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於深度學習類的語音增強算法根據訓練目標的不同大致可分爲以下四類:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"• 基於傳統信號處理的混合類語音增強算法(Hybrid method)這類算法多將傳統基於信號處理的語音增強算法中的一個或多個子模塊由神經網絡替代,一般情況下不會改變算法的整體處理流程,典型代表如 Rnnoise[15]。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"• 基於時頻掩模近似的語音增強算法(Mask_based method)這類算法通過訓練神經網絡來預測時頻掩模,並將預測的時頻掩模應用於輸入噪聲的頻譜來重構純淨語音信號。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"常用的時頻掩模包括 IRM[16],PSM[17], cIRM[18] 等,訓練過程中的誤差函數如下式所示:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/7b/7bfd06e316f862424cc218b3af46150b.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"• 基於特徵映射的語音增強算法(Mapping_based method)這類算法通過訓練神經網絡來實現特徵的直接映射,常用的特徵包括幅度頻譜、對數功率頻譜和複數頻譜等,訓練過程中的誤差函數如下式所示:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/1d/1dba6a11b02e72106ddceaed60bdb9be.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"• 基於端到端的語音增強算法(End-to-end method)這類算法將數據驅動的思想發揮到了極致,在數據集分佈合理的前提下,拋卻頻域變換,直接從時域語音信號進行端到端的數值映射,是近兩年廣泛活躍在學術界的熱門研究方向之一。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"AliCloudDenoise 語音增強算法","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"一、算法原理","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在綜合考慮業務使用場景,對降噪效果、性能開銷、實時性等諸多因素權衡後,AliCloudDenoise 語音增強算法採用了 Hybrid 的方法,將帶噪語音中噪聲能量和目標人聲能量的比值作爲擬合目標,進而利用傳統信號處理中的增益估計器如最小均方誤差短時頻譜幅度 (MMSE-STSA) 估計器,求得頻域上的去噪增益,最後經逆變換得到增強後的時域語音信號。在網絡結構的選擇上,兼顧實時性和功耗,捨棄了 RNN 類結構而選擇了 TCN 網絡,基本網絡結構如下圖所示:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/8b/8b86b9d595632c182eb5f94367240e29.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"二、實時會議場景下的算法優化","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"1、開會時旁邊人多很吵怎麼辦?","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"text","text":"問題背景","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在實時會議場景中,有一類較爲常見的背景噪聲是 Babble Noise,即多個說話者的交談聲組成的背景噪聲,此類噪聲不僅僅是非平穩的,而且和語音增強算法的目標語音成分相似,導致在對這類噪聲的抑制過程中算法處理的難度增大。以下列舉了一個具體的實例:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/22/22cb5db66d15b392e84bd2487245d01b.gif","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"text","text":"問題分析與改進方案","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"經過對數十小時含有 Babble Noise 的辦公室場景音頻進行分析,同時結合人類的語音發聲機制,發現這類噪聲具有類長時平穩存在特性,衆所周知,在語音增強算法中,上下文信息(contextual information)對算法效果有着非常重要的影響,所以針對 Babble Noise 這種對上下文信息更加敏感的噪聲類型,AliCloudDenoise 算法通過空洞卷積(dilated convolutions)系統性地聚合模型中的關鍵階段性特徵,顯式的增大感受野,同時額外的融合了門控機制(gating mechanisms),使得改進後的模型對 Babble Noise 的處理效果有了明顯的改善。下圖展示了改進前(TCN)與改進後(GaTCN)的關鍵模型部分的對比圖。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/d3/d3e8b624291d94cb143df20b568f05d1.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在語音測試集上的結果表明,所提 GaTCN 模型在 IRM 目標下語音質量 PESQ[19] 較 TCN 模型提升了 9.7%,語音可懂度 STOI[20] 較 TCN 模型提升了 3.4%;在 Mapping a priori SNR[21] 目標下語音質量 PESQ 較 TCN 模型提升了 7.1%,語音可懂度 STOI 較 TCN 模型提升了 2.0%,且優於所有的 baseline 模型,指標詳情見表一和表二。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/6d/6d68b78f31c85caf04c2bc9f7fb67061.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"表一 客觀指標語音質量 PESQ 對比詳情","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/fc/fcd7ac5f26741e95c4412a43d5dba03f.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"表二 客觀指標語音可懂度 STOI 對比詳情","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"改進效果展示:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/63/633575836fa6a1d7c068472b05f5e060.gif","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"2、關鍵時刻怎能掉字?","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"text","text":"問題背景","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在語音增強算法中,吞字或特定字詞消失如語句尾音消失的現象是影響增強後語音主觀聽感的一個重要因素,在實時會議場景中,因涉及到的語種多樣,語者說話內容多樣,這種現象更爲常見,以下列舉了一個具體的實例:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/45/45cf1f4ad3895c9a30acb593f3943d27.gif","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"text","text":"問題分析與改進方案","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在分類構建的 1w+ 條語音測試數據集上,通過對增強後吞字、掉字現象發生的時機進行統計,並可視化其對應的頻域特徵,發現該現象主要發生在清音、疊音及長音等幾類特定的音素或字詞上;同時,在以信噪比爲維度的分類統計中發現低信噪比情況下的吞字、掉字現象顯著增多,據此,進行了以下三方面的改進:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"• 數據層面:首先進行了訓練數據集中特定音素的分佈統計,在得出佔比較少的結論後,針對性的豐富了訓練數據集中的語音成分。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"• 降噪策略層面:降低低信噪比情況,在特定情況下使用組合降噪的策略,即先進行傳統降噪,再進行 AliCloudDenoise 降噪,此方法的缺點體現在以下兩方面,首先組合降噪會增加算法開銷,其次傳統降噪不可避免的會出現頻譜級音質損傷,降低整體的音質質量。此方法經實測確實會改善吞字、掉字現象,但因其缺點明顯,並未在線上使用。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"• 訓練策略層面:在針對性的豐富了訓練數據集中的語音成分後,確實會改善增強後吞字、掉字的現象,但仍存在該現象,進一步分析後,發現其頻譜特徵與某些噪聲的頻譜特徵高度相似,導致網絡訓練局部收斂困難,基於此,AliCloudDenoise 算法採用了訓練中輔助輸出語音存在概率,而推演過程中不採納的訓練策略,SPP 的計算公式如下:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/df/df01c8ce016234e1ce2f13970d64bf08.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在語音測試集上的結果表明,所提雙輸出的輔助訓練策略在 IRM 目標下語音質量 PESQ 較原模型提升了 3.1%,語音可懂度 STOI 較原模型提升了 1.2%;在 Mapping a priori SNR 目標下語音質量 PESQ 較原模型提升了 4.0%,語音可懂度 STOI 較原模型提升了 0.7%,且優於所有的 baseline 模型,指標詳情見表三和表四。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e1/e17133473fc1ad62a0c913958592393a.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"表三 客觀指標語音質量 PESQ 對比詳情","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ed/edf5d7c4c245ab590e6b950bb7f10df0.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"表四 客觀指標語音可懂度 STOI 對比詳情","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"改進效果展示:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a7/a7ae94a9b0f7c7d6334e7cad6a473e07.gif","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"三、如何讓算法的適用設備範圍更廣","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於實時會議場景來說,AliCloudDenoise 算法的運行環境一般包括 PC 端、移動端以及 IOT 設備等,儘管在不同運行環境中關於能耗的要求不同,但 CPU 佔用、內存容量及帶寬、電量消耗等都是我們關注的關鍵性能指標,爲了使 AliCloudDenoise 算法能夠廣泛地爲各個業務方提供服務,我們採用了一系列能耗優化手段,主要包括模型的結構化裁剪、資源自適應策略、權值量化與訓練量化等,並通過一些輔助收斂策略在精度降低 0.1% 量級的情況下最終得到了約 500KB 的智能語音增強模型,極大地拓寬了 AliCloudDenoise 算法的應用範圍。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下來我們首先對優化過程中涉及到的模型輕量化技術做簡單的回顧,然後對資源自適應策略和模型量化展開介紹,最後給出 AliCloudDenoise 算法的關鍵能耗指標。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"1、採用的模型輕量化技術","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對深度學習模型的輕量化技術,一般指對模型的參數量及尺寸、運算量、能耗、速度等 “運行成本” 進行優化的一系列技術手段。其目的是便於模型在各類硬件設備的部署。同時,輕量化技術在計算密集型的雲端服務上也有廣泛的用途,可以幫助降低服務成本、提升相應速度。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"輕量化技術的主要難點在於:在優化運行成本的同時,算法的效果與泛化性、穩定性不應受到明顯的影響。這對於常見的 “黑箱式” 神經網絡模型來說,在各方面都具有一定的難度。此外,輕量化的一部分難點也體現在優化目標的差異性上。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"比如模型尺寸的降低,並不一定會使得運算量降低;模型運算量的降低,也未必能提高運行速度;運行速度的提升也不一定會降低能耗。這種差異性使得輕量化難以 “一攬子” 地解決所有性能問題,需要從多種角度、利用多種技術配合,才能達成運行成本的綜合降低。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前學術界與工業界常見的輕量化技術包括:參數 / 運算量子化、剪枝、小型模塊、結構超參優化、蒸餾、低秩、共享等。其中各類技術都對應不同的目的與需求,比如參數量化可以壓縮模型佔用的存儲空間,但運算時依然恢復成浮點數;參數 + 運算全局量子化可以同時降低參數體積,減少芯片運算量,但需要芯片有相應的運算器支持,才能發揮提速效果;知識蒸餾利用小型的學生網絡,學習大型模型的高層特徵,來獲得性能匹配的輕量模型,但優化存在一些難度且主要適合簡化表達的任務(比如分類)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"非結構化的精細剪裁可以將最多的冗餘參數剔除,達成優良的精簡,但需要專用硬件支持纔可以減少運算量;權重共享可顯著降低模型尺寸,缺點是難以加速或節能;AutoML 結構超參搜索能自動確定小型測試結果最優的模型堆疊結構,但搜索空間複雜度與迭代估計的優良度限制了其應用面。下圖展示了 AliCloudDenoise 算法在能耗優化過程中主要採用的輕量化技術。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/99/9941a51fe811c87fdfb3a779a16bf953.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"2、資源自適應策略","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"資源自適應策略的核心思想是模型可以在資源不充足的情況下自適應的輸出滿足限定條件的較低精度的結果,在資源充足時就做到最好,輸出最優精度的增強結果,實現此功能最直接的想法是訓練不同規模的模型存放在設備中,按需使用,但會額外增加存儲成本,AliCloudDenoise 算法採用了分級訓練的方案,如下圖所示:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/b4/b4865dedbf4899acf830ab5cd425ae17.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將中間層的結果也進行輸出,經聯合 loss 最終進行統一約束訓練,但實際驗證中發現存在以下兩個問題:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"• 比較淺層的網絡抽取的特徵比較基礎,淺層網絡的增強效果較差。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"• 增加了中間層網絡輸出的結構後,最後一層網絡的增強結果會受到影響,原因是聯合訓練過程中會希望淺層網絡也可以輸出較爲不錯的增強結果,破壞了原有網絡結構抽取特徵的分佈佈局。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對以上兩個問題,我們採用了多尺度 Dense 連接 + 離線超參預剪枝的優化策略,保證了模型可動態按需輸出精度範圍不超過 3.2% 的語音增強結果。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"3、模型量化","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在模型所需的內存容量及帶寬的優化上,主要採用了 MNN 團隊的權值量化工具 [22] 和 python 離線量化工具 [23] 實現了 FP32 與 INT8 之間的轉換,方案示意圖如下:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/61/616a206afe8db38b8be1bb355783ea5e.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"4、AliCloudDenoise 算法的關鍵能耗指標","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/15/15af768261fd6d0fc3e2f7793f4add85.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如上圖所示,在 Mac 平臺的算法庫大小上,競品爲 14MB,AliCloudDenoise 算法目前主流輸出的算法庫爲 524KB、912KB 和 2.6MB,具有顯著優勢;在運行消耗上,Mac 平臺的測試結果表明,競品的 cpu 佔用爲 3.4%,AliCloudDenoise 算法庫 524KB 的 cpu 佔用爲 1.1%,912KB 的 cpu 佔用爲 1.3%,2.6MB 的 cpu 佔用爲 2.7%,尤其在長時運行條件下,AliCloudDenoise 算法有明顯優勢。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"四、算法的效果技術指標評測結果","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對 AliCloudDenoise 算法的語音增強效果的評估目前主要集中在兩個場景上,通用型場景和辦公室會議場景。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"1、通用場景下的評測結果","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通用型場景的測試集中,語音數據集由中文和英文兩部分組成(共計約 5000 條),噪聲數據集則包含了常見的四類典型噪聲,平穩噪聲(Stationary noise)、非平穩噪聲(Non-stationary noise)、辦公室噪聲(Babble noise)和室外噪聲(Outdoor noise),環境噪聲強度設置在 - 5 到 15db 之間,客觀指標主要通過 PESQ 語音質量與 STOI 語音可懂度來衡量,兩項指標都是值越大表示增強後的語音效果越好。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如下表所示,在通用型場景的語音測試集上的評測結果表明,AliCloudDenoise 524KB 算法庫較傳統算法在 PESQ 上分別有 39.4%(英文語音)和 48.4%(中文語音)的提升,在 STOI 上分別有 21.4%(英文語音)和 23.1%(中文語音)的提升,同時和競品算法基本持平。而 AliCloudDenoise 2.6MB 算法庫較競品算法在 PESQ 上分別有 9.2%(英文語音)和 3.9%(中文語音)的提升,在 STOI 上分別有 0.4%(英文語音)和 1.6%(中文語音)的提升,展現出了顯著的效果優勢。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/57/57fca25a0976ce7e2f0c1b61160dc3f6.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"2、辦公室場景下的評測結果","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"結合實時會議的業務聲學場景,我們針對辦公室場景做了單獨的評測,噪聲爲實際錄製的真實辦公場景下的嘈雜噪聲,共構建了約 5.3h 的評測帶噪語音。下圖展示了 AliCloudDenoise 2.6MB 算法庫和競品 1、競品 2、傳統 1 及傳統 2 ,這四種算法在 SNR、P563、PESQ 和 STOI 指標上的對比結果,可以看到 AliCloudDenoise 2.6MB 算法庫具有明顯優勢。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/9b/9b28dcba382ed06b8d2387ffb13a1470.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"未來展望","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在實時通信場景下,AI + Audio Processing 還有很多待探索和落地的研究方向,通過數據驅動思想與經典信號處理算法的融合,可以給音頻的前端算法(ANS、AEC、AGC)、音頻的後端算法(帶寬擴展、實時美聲、變聲、音效)、音頻編解碼及弱網下的音頻處理算法(PLC、NetEQ)帶來效果上的升級,爲阿里雲視頻雲的用戶提供極致的音頻體驗。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"參考文獻","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[1] Wang D L, Chen J. Supervised speech separation based on deep learning: An overview[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2018, 26(10): 1702-1726.[2] https://venturebeat.com/2020/04/09/microsoft-teams-ai-machine-learning-real-time-noise-suppression-typing/[3] https://venturebeat.com/2020/06/08/google-meet-noise-cancellation-ai-cloud-denoiser-g-suite/[4] https://medialab.qq.com/#/projectTea[5] Gannot S, Burshtein D, Weinstein E. Iterative and sequential Kalman filter-based speech enhancement algorithms[J]. IEEE Transactions on speech and audio processing, 1998, 6(4): 373-385.[6] Kim J B, Lee K Y, Lee C W. On the applications of the interacting multiple model algorithm for enhancing noisy speech[J]. IEEE transactions on speech and audio processing, 2000, 8(3): 349-352.[7] Ephraim Y, Van Trees H L. A signal subspace approach for speech enhancement[J]. IEEE Transactions on speech and audio processing, 1995, 3(4): 251-266.[8] Ephraim Y, Malah D. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator[J]. IEEE Transactions on acoustics, speech, and signal processing, 1984, 32(6): 1109-1121.[9] Cohen I. Noise spectrum estimation in adverse environments: Improved minima controlled recursive averaging[J]. IEEE Transactions on speech and audio processing, 2003, 11(5): 466-475.[10]Ciregan D, Meier U, Schmidhuber J. Multi-column deep neural networks for image classification[C]//2012 IEEE conference on computer vision and pattern recognition. IEEE, 2012: 3642-3649.[11]Graves A, Liwicki M, Fernández S, et al. A novel connectionist system for unconstrained handwriting recognition[J]. IEEE transactions on pattern analysis and machine intelligence, 2008, 31(5): 855-868.[12]Senior A, Vanhoucke V, Nguyen P, et al. Deep neural networks for acoustic modeling in speech recognition[J]. IEEE Signal processing magazine, 2012.[13]Sundermeyer M, Ney H, Schlüter R. From feedforward to recurrent LSTM neural networks for language modeling[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015, 23(3): 517-529.[14]Sutskever I, Vinyals O, Le Q V. Sequence to sequence learning with neural networks[C]//Advances in neural information processing systems. 2014: 3104-3112.[15] Valin J M. A hybrid DSP/deep learning approach to real-time full-band speech enhancement[C]//2018 IEEE 20th international workshop on multimedia signal processing (MMSP). IEEE, 2018: 1-5.[16] Wang Y, Narayanan A, Wang D L. On training targets for supervised speech separation[J]. IEEE/ACM transactions on audio, speech, and language processing, 2014, 22(12): 1849-1858.[17] Erdogan H, Hershey J R, Watanabe S, et al. Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks[C]//2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015: 708-712.[18] Williamson D S, Wang Y, Wang D L. Complex ratio masking for monaural speech separation[J]. IEEE/ACM transactions on audio, speech, and language processing, 2015, 24(3): 483-492.[19] Recommendation I T U T. Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs[J]. Rec. ITU-T P. 862, 2001.[20] Taal C H, Hendriks R C, Heusdens R, et al. A short-time objective intelligibility measure for time-frequency weighted noisy speech[C]//2010 IEEE international conference on acoustics, speech and signal processing. IEEE, 2010: 4214-4217.[21] Nicolson A, Paliwal K K. Deep learning for minimum mean-square error approaches to speech enhancement[J]. Speech Communication, 2019, 111: 44-55.[22] https://www.yuque.com/mnn/cn/model_convert[23] https://github.com/alibaba/MNN/tree/master/tools/MNNPythonOfflineQuant","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"「視頻雲技術」你最值得關注的音視頻技術公衆號,每週推送來自阿里雲一線的實踐技術文章,在這裏與音視頻領域一流工程師交流切磋。公衆號後臺回覆【技術】可加入阿里雲視頻雲技術交流羣,和作者一起探討音視頻技術,獲取更多行業最新信息。","attrs":{}}]}],"attrs":{}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章