語音的基本概念--譯自CMU sphinx

語音的基本概念--譯自CMU sphinx

這是CMU sphinx語音識別系統wiki的第一部分，主要是介紹語音的一些基本概念的。我試着翻譯了一下。英語水平受限，翻譯難免出錯，請各位不吝指點！呵呵

Basic concepts of speech

語音的基本概念

Speech is a complex phenomenon. People rarely understand how is it produced and perceived. The naive perception is often that speech is built with words, and each word consists of phones. The reality is unfortunately very different. Speech is a dynamic process without clearly distinguished parts. It's always useful to get a sound editor and look into the recording of the speech and listen to it. Here is for example the speech recording in an audio editor.

語音是一個複雜的現象。我們基本上不知道它是如何產生和被感知的。我們最基礎的認識就是語音是由單詞來構成的，然後每個單詞是由音素來構成的。但事實與我們的理解大相徑庭。語音是一個動態過程，不存在很明顯的部分劃分。通過音頻編輯軟件去查看一個語音的錄音對於理解語音是一個比較有效的方法。下面就是一個錄音在音頻編輯器裏的顯示的例子。

All modern descriptions of speech are to some degree probabilistic. That means that there are no certain boundaries between units, or between words. Speech to text translation and other applications of speech are never 100% correct. That idea is rather unusual for software developers, who usually work with deterministic systems. And it creates a lot of issues specific only to speech technology.

目前關於語音的所有描述說明從某種程度上面講都是基於概率的（基於頻譜？）。這意味着在語音單元或者單詞之間並沒有確定的邊界。語音識別技術沒辦法到達100%的準確率。這個概念對於軟件開發者來說有點不可思議，因爲他們所研究的系統通常都是確定性的。另外，對於語音技術來說，它會產生很多和語言相關的特定的問題。

Structure of speech

語音的構成

In current practice, speech structure is understood as follows:

在本文中，我們是按照以下方式去理解語音的構成的：

Speech is a continuous audio stream where rather stable states mix with dynamically changed states. In this sequence of states, one can define more or less similar classes of sounds, or phones. Words are understood to be built of phones, but this is certainly not true. The acoustic properties of a waveform corresponding to a phone can vary greatly depending on many factors - phone context, speaker, style of speech and so on. The so called coarticulation協同發音 makes phones sound very different from their “canonical” representation. Next, since transitions between words are more informative than stable regions, developers often talk about diphones - parts of phones between two consecutive phones. Sometimes developers talk about subphonetic units - different substates of a phone. Often three or more regions of a different nature can easily be found.

The number three is easily explained. The first part of the phone depends on its preceding phone, the middle part is stable, and the next part depends on the subsequent phone. That's why there are often three states in a phone selected for HMM recognition.

語音是一個連續的音頻流，它是由大部分的穩定態和部分動態改變的狀態混合構成。

一個單詞的發聲（波形）實際上取決於很多因素，而不僅僅是音素，例如音素上下文、說話者、語音風格等；

協同發音（指的是一個音受前後相鄰音的影響而發生變化，從發聲機理上看就是人的發聲器官在一個音轉向另一個音時其特性只能漸變，從而使得後一個音的頻譜與其他條件下的頻譜產生差異。）的存在使得音素的感知與標準不一樣，所以我們需要根據上下文來辨別音素。將一個音素劃分爲幾個亞音素單元。如：數字“three”，音素的第一部分與在它之前的音素存在關聯，中間部分是穩定的部分，而最後一部分則與下一個音素存在關聯，這就是爲什麼在用HMM模型做語音識別時，選擇音素的三狀態HMM模型。（上下文相關建模方法在建模時考慮了這一影響，從而使模型能更準確地描述語音，只考慮前一音的影響的稱爲Bi-Phone，考慮前一音和後一音的影響的稱爲 Tri-Phone。）

Sometimes phones are considered in context. There are triphones or even quinphones. But note that unlike phones and diphones, they are matched with the same range in waveform as just phones. They just differ by name. That's why we prefer to call this object senone. A senone's dependence on context could be more complex than just left and right context. It can be a rather complex function defined by a decision tree, or in some other way.

有時候，音素會被放在上下文中考慮，這樣就形成了三元音素或者多元音素。但它與亞音素不同，他們在波形中匹配時長度還是和單一音素一樣。只是名字上的不同而已，所以我們更傾向於將這樣的多元音素稱爲senone。一個senone的上下文依賴比單純的左右上下文複雜得多，它是一個可以被決策樹或者其他方式來定義的複雜函數。（英語的上下文相關建模通常以音素爲基元，由於有些音素對其後音素的影響是相似的，因而可以通過音素解碼狀態的聚類進行模型參數的共享。聚類的結果稱爲senone。決策樹用來實現高效的triphone對senone的對應，通過回答一系列前後音所屬類別（元/輔音、清/濁音等等）的問題，最終確定其HMM狀態應使用哪個senone。分類迴歸樹CART模型用以進行詞到音素的發音標註。）

Next, phones build subword units, like syllables. Sometimes, syllables are defined as “reduction-stable entities”. To illustrate, when speech becomes fast, phones often change, but syllables remain the same. Also, syllables are related to intonational contour. There are other ways to build subwords - morphologically-based in morphology-rich languages or phonetically-based. Subwords are often used in open vocabulary speech recognition.

音素phones構成亞單詞單元，也就是音節syllables。音節是一個比較穩定的實體，因爲當語音變得比較快的時候，音素往往會發生改變，但是音節卻不變。音節與節奏語調的輪廓有關。有幾種方式去產生音節：基於形態學或者基於語音學。音節經常在詞彙語音識別中使用。

Subwords form words. Words are important in speech recognition because they restrict combinations of phones significantly. If there are 40 phones and an average word has 7 phones, there must be 40^7 words. Luckily, even a very educated person rarely uses more then 20k words in his practice, which makes recognition way more feasible.

亞單詞單元（音節）構成單詞。單詞在語音識別中很重要，因爲單詞約束了音素的組合。假如共有40個音素，然後每個單詞平均有7個音素，那麼就會存在40^7個單詞，但幸運的是就算一個受過優等教育的人也很少使用過20k個單詞，這就使識別變得可行。

Words and other non-linguistic sounds, which we call fillers (breath, um, uh, cough), form utterances. They are separate chunks of audio between pauses. They don't necessary match sentences, which are more semantic concepts.

單詞和一些非語言學聲音構成了話語utterances，我們把非語言學聲音稱爲fillers填充物，例如呼吸，um，uh，咳嗽等，它們在音頻中是以停頓做分離的。所以它們更多隻是語義上面的概念，不算是一個句子。

On the top of this, there are dialog acts like turns, but they go beyond the purpose of the document.

Recognition process

識別過程

The common way to recognize speech is the following: we take waveform, split it on utterances by silences then try to recognize what's being said in each utterance. To do that we want to take all possible combinations of words and try to match them with the audio. We choose the best matching combination. There are few important things in this match.

語音識別一般的方法是：錄製語音波形，再把波形通過靜音silences分割爲多個utterances，然後去識別每個utterance所表達的意思。爲了達到這個目的，我們需要用單詞的所有可能組合去匹配這段音頻，然後選擇匹配度最高的組合。

在匹配中有幾個關鍵的概念需要了解的：

First of all it's a concept of features. Since number of parameters is large, we are trying to optimize it. Numbers that are calculated from speech usually by dividing speech on frames. Then for each frame of length typically 10 milliseconds we extract 39 numbers that represent the speech. That's called feature vector. The way to generates numbers is a subject of active investigation, but in simple case it's a derivative from spectrum.

特徵：

由於描述一個語音，需要的參數個數非常多，這樣對處理速度的要求就很高（而且也沒必要處理那麼多的信息，我們只需要處理對識別有幫助的就行），所以我們需要做優化，進行降維。我們用幀frames去分割語音波形，每幀大概10ms，然後每幀提取可以代表該幀語音的39個數字，這39個數字也就是該幀語音的特徵，用特徵向量來表示。而如何提取特徵向量是當下熱門的研究課題，但這些提取方法都是由頻譜衍生出來的。

Second it's a concept of the model. Model describes some mathematical object that gathers common attributes of the spoken word. In practice, for audio model of senone is gaussian mixture of it's three states - to put it simple, it's a most probable feature vector. From concept of the model the following issues raised - how good does model fits practice, can model be made better of it's internal model problems, how adaptive model is to the changed conditions.

模型：

模型是用來描述一些數學對象的。這些數學對象描述了一些口語的共同屬性。在實際應用中，senone的音頻模型就是三態高斯混合模型。簡單的說，它就是一個最有可能的特徵向量。對於模型，有幾個問題需要考慮：模型到底多大程度上可以描述實際情況？在模型本身的侷限情況下模型能表現得更優嗎？自適應模型如何改變條件？

Third, it's a matching process itself. Since it would take a huge time more than universe existed to compare all feature vectors with all models, the search is often optimized by many tricks. At any points we maintain best matching variants and extend them as time goes producing best matching variants for the next frame.

匹配算法：

語音識別需要對所有的特徵向量和所有的模型做比較匹配，這是一個非常耗時的工作。而在這方面的優化往往是使用一些技巧，在每一點的匹配時，我們通過保留最好的匹配variants，然後通過它在下一幀產生最好的匹配variants。？

Models

According to the speech structure, three models are used in speech recognition to do the match:

An acoustic model contains acoustic properties for each senone. There are context-independent models that contain properties (most probable feature vectors for each phone) and context-dependent ones (built from senones with context).

聲學模型acoustic model：

一個聲學模型包含每個senone的聲學屬性，其包括不依賴於上下文的屬性（每個音素phone最大可能的特徵向量）和依賴於上下文的屬性（根據上下文構建的senone）。

A phonetic dictionary contains a mapping from words to phones. This mapping is not very effective. For example, only two to three pronunciation variants are noted in it, but it's practical enough most of the time. The dictionary is not the only variant of mapper from words to phones. It could be done with some complex function learned with a machine learning algorithm.

語音學字典phonetic dictionary：

字典包含了從單詞words到音素phones之間的映射。

字典並不是描述單詞words到音素phones之間的映射的唯一方法。可以通過運用機器學習算法去學習得到一些複雜的函數去完成映射功能。

A language model is used to restrict word search. It defines which word could follow previously recognized words (remember that matching is a sequential process) and helps to significantly restrict the matching process by stripping words that are not probable. Most common language models used are n-gram language models-these contain statistics of word sequences-and finite state language models-these define speech sequences by finite state automation, sometimes with weights. To reach a good accuracy rate, your language model must be very successful in search space restriction. This means it should be very good at predicting the next word. A language model usually restricts the vocabulary considered to the words it contains. That's an issue for name recognition. To deal with this, a language model can contain smaller chunks like subwords or even phones. Please note that search space restriction in this case is usually worse and corresponding recognition accuracies are lower than with a word-based language model.

語言模型 language model：

語言模型是用來約束單詞搜索的。它定義了哪些詞能跟在上一個已經識別的詞的後面（匹配是一個順序的處理過程），這樣就可以爲匹配過程排除一些不可能的單詞。大部分的語言模型都是使用n-gram模型，它包含了單詞序列的統計。和有限狀態模型，它通過有限狀態機來定義語音序列。有時候會加入權值。爲了達到比較好的識別準確率，語言模型必須能夠很好的約束空間搜索，也就是說可以更好的預測下一個詞。語言模型是約束詞彙包含的單詞的，這就出現一個問題，就是名字識別（因爲名字可以隨便由幾個單詞組成）。爲了處理這種情況，語言模型可以包含更小的塊，例如亞單詞，甚至音素。但是這種情況，識別準確率將會低於基於單詞的語言模型。

Those three entities are combined together in an engine to recognize speech. If you are going to apply your engine for some other language, you need to get such structures in place. For many languages there are acoustic models, phonetic dictionaries and even large vocabulary language models available for download.

特徵、模型和搜索算法三部分構成了一個語音識別系統。如果你需要識別不同的語言，那麼就需要修改這三個部分。很多語言，都已經存在聲學模型，字典，甚至大詞彙量語言模型可供下載了。

Other concepts used

其他用到的概念

A Lattice is a directed graph that represents variants of the recognition. Often, getting the best match is not practical; in that case, lattices are good intermediate formats to represent the recognition result.

網格Lattice是一個代表識別的不同結果的有向圖。一般來說，很難去獲得一個最好的語音匹配結果。所以Lattices就是一個比較好的格式去存放語音識別的中間結果。

N-best lists of variants are like lattices, though their representations are not as dense as the lattice ones.

N-best lists和lattices有點像，但是它沒有lattices那麼密集（也就是保留的結果沒有lattices多）。（N-best搜索和多遍搜索：爲在搜索中利用各種知識源，通常要進行多遍搜索，第一遍使用代價低的知識源（如聲學模型、語言模型和音標詞典），產生一個候選列表或詞候選網格，在此基礎上進行使用代價高的知識源（如4階或5階的N-Gram、4階或更高的上下文相關模型）的第二遍搜索得到最佳路徑。）

Word confusion networks (sausages) are lattices where the strict order of nodes is taken from lattice edges.

單詞混淆網絡是從lattice的邊緣得到的一個嚴格的節點順序序列。

Speech database - a set of typical recordings from the task database. If we develop dialog system it might be dialogs recorded from users. For dictation system it might be reading recordings. Speech databases are used to train, tune and test the decoding systems.

語音數據庫-一個從任務數據庫得到的典型的錄音集。如果我們開發的是一個對話的系統，那麼數據庫就是包含了多個用戶的對話錄音。而對於聽寫系統，包含的就是朗讀的錄音。語音數據庫是來用訓練，調整和測試解碼系統的（也就是語音識別系統）。

Text databases - sample texts collected for language model training and so on. Usually, databases of texts are collected in sample text form. The issue with collection is to put present documents (PDFs, web pages, scans) into spoken text form. That is, you need to remove tags and headings, to expand numbers to their spoken form, and to expand abbreviations.

文本數據庫-爲了訓練語言模型而收集的文本。一般是以樣本文本的方式來收集形成的。而收集過程存在一個問題就是誤把PDFs, web pages, scans等現成文檔也當成口語文本的形式放進數據庫中。所以，我們就需要把這些文件帶進數據庫裏面的標籤和文件頭去掉，還有把數字展開爲它們的語音形式（例如1展開爲英文的one或者漢語的yi），另外還需要把縮寫給擴大還原爲完整單詞。

What is optimized

語音的優化

When speech recognition is being developed, the most complex issue is to make search precise (consider as many variants to match as possible) and to make it fast enough to not run for ages. There are also issues with making the model match the speech since models aren't perfect.

隨着語音識別技術的發展，最複雜的難題是如何使搜索（也就是語音解碼，可以認爲是需要匹配儘可能多的語音變體）更加準確和快速。還有在模型並不完美的前提下如何匹配語音和模型。

Usually the system is tested on a test database that is meant to represent the target task correctly.

一般來說系統需要通過一個測試數據庫來驗證準確性，也就是是否達到了我們的預定目標。

The following characteristics are used:

我們通過以下幾個參數來表徵系統的性能：

Word error rate. Let we have original text and recognition text of length of N words. From them the I words were inserted D words were deleted and S words were substituted Word error rate is

WER = (I + D + S) / N

WER is usually measured in percent.

單詞錯誤率：我們有一個N個單詞長度的原始文本和識別出來的文本。（對單詞串進行識別難免有詞的插入，替換和刪除的誤識）I代表被插入的單詞個數，D代表被刪除的單詞個數，S代表被替換的單詞個數，那麼單詞錯誤率就定義爲：WER = (I + D + S) / N

單詞錯誤率一般通過百分百來表示。

Accuracy. It is almost the same thing as word error rate, but it doesn't count insertions.

Accuracy = (N - D - S) / N

Accuracy is actually a worse measure for most tasks, since insertions are also important in final results. But for some tasks, accuracy is a reasonable measure of the decoder performance.

準確度。它和單詞錯誤率大部分是相似的，但是它不計算插入單詞的個數，它定義爲：Accuracy = (N - D - S) / N

對於大部分任務來說，準確度事實上是一個比較差的度量方法，因爲插入的情況對於識別結果的影響也是很重要的。但對於一些任務而言，準確度也是一個合理的評價解碼器性能的參數。

Speed. Suppose the audio file was 2 hours and the decoding took 6 hours. Then speed is counted as 3xRT.

速度：假設音頻文件是2個小時，而解碼花費了6個小時，那麼計算出來的速度就是3xRT。（3倍速）

ROC curves. When we talk about detection tasks, there are false alarms and hits/misses; ROC curves are used. A curve is a graphic that describes the number of false alarms vs number of hits, and tries to find optimal point where the number of false alarms is small and number of hits matches 100%.

ROC曲線：對於一個檢測任務，檢測會出現誤報和命中兩種情況。ROC曲線就是用來評價檢測性能的。ROC曲線就是描述誤報和命中的數目比例的。而且可以通過ROC曲線取尋找一個最優點，在這個點誤報最小，而命中最大，也就是接近100%的命中率。

There are other properties that aren't often taken into account, but still important for many practical applications. Your first task should be to build such a measure and systematically apply it during the system development. Your second task is to collect the test database and test how does your application perform.

還有其他的方法來衡量識別性能，雖然這裏沒有提及，但對於很多的實際應用來說還是比較重要的。你的第一個工作應該是建立這樣一個評價體系，然後系統地應用到開發過程中。第二個工作就是收集一個測試數據庫來測試你的系統性能。

語音的基本概念--譯自CMU sphinx

2024年DataOps趨勢預測：AI不會取代數據工程師

雲原生週刊：K8s 中的服務和網絡｜ 2024.4.29

通過Http鏈接地址爬取有贊微信商城商品信息及下載至EXCEL

多人同時導出 Excel 幹崩服務器！新來的阿里大佬給出的解決方案太優雅了！

[轉帖]cpupower

今天，昨天，近七天，近30天，近90天，js封裝

華爲云云原生FinOps解決方案，釋放雲原生最大價值

2015年語音識別文獻閱讀報告

語音識別的基礎知識與CMUsphinx介紹

史上最詳細最容易理解的HMM文章 .

語音的基本概念--譯自CMU sphinx

HMM學習筆記1

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結