語音特徵公式與python實現

1.過零率

zero crossing rate
每幀信號內,信號過零點的次數,體現的是頻率特性。
Zn=12m=0N1sgn[xn(m)]sgn[xn(m1)]Z_n = \frac{1}{2}\sum_{m=0}^{N-1}|sgn[x_n(m)]-sgn[x_n(m-1)]|

import numpy as np
def stZCR(frame):
   # computing zero crossing rate
   count = len(frame)
   count_z = np.sum(np.abs(np.diff(np.sign(frame)))) / 2
   return (np.float64(count_z) / np.float64(count - 1))

2.能量

energy
短時能量,即每幀信號的平方和,體現的是信號能量的強弱
En=m=0N1xn2(m)E_n = \sum_{m=0}^{N-1}x_n^2(m)

import numpy as np
def stEnergy(frame):
   return (np.sum(frame ** 2) / np.float64(len(frame)))

2.1 振幅擾動度-分貝形式

shimmer in DB
Shimmer(dB)=1N1i=1N120log(Ai+1/Ai)Shimmer(dB) = {\frac{1}{N-1}}\sum_{i=1}^{N--1}|20log(A_{i+1}/A_i)|

import numpy as np
def stShimmerDB(frame):
    '''
     amplitude shimmer 振幅擾動度
     expressed as variability of the peak-to-peak amplitude in decibels 分貝
     [3]
    '''
    count = len(frame)
    sigma = 0
    for i in range(count):
        if i == count - 1:
            break
        sigma += np.abs(20 * (np.log10(np.abs(frame[i + 1] / (frame[i] + eps)))))
    return np.float64(sigma) / np.float64(count - 1)

2.2 振幅擾動度-百分數形式

Shimmer(relative)=1N1N1i=1AiAi+11Ni=1NAiShimmer(relative) = \frac{\frac{1}{N-1}\sum_{N-1}^{i=1}|A_i-A_{i+1}|}{\frac{1}{N}\sum_{i=1}^{N}A_i}

def stShimmerRelative(frame):
    '''
    shimmer relative is defined as average absolute difference between the amplitude
    of consecutive periods divided by the average amplitude, expressed as percentage
    [3]
    '''
    count = len(frame)
    sigma_diff = 0
    sigma_sum = 0
    for i in range(count):
        if i < count - 1:
            sigma_diff += np.abs(np.abs(frame[i]) - np.abs(frame[i + 1]))
        sigma_sum += np.abs(frame[i])
    return np.float64(sigma_diff / (count - 1)) / np.float64(sigma_sum / count + eps)

3. 聲強/響度

intensity / loudness

  • intensity: mean of squared input values multiplied by a Hamming window
    聲強和響度是對應的概念,參考openSmile程序
    intensity=m=0N1hamWin[m]xn2(m)m=0N1hamWin[m]intensity = \frac{\sum_{m=0}^{N-1}hamWin[m]*x_n^2(m)}{\sum_{m=0}^{N-1}hamWin[m]}
    loudness=(intensityI00.3)loudness =( {\frac{intensity}{I0}}^{0.3}) I0=1×1012 I0=1\times10^{-12}
###################
##
## from opensimle
##
#####################
def stIntensity(frame):
    '''
    cannot understand what differ from energy
    '''
    fn = len(frame)
    hamWin = np.hamming(fn)
    winSum = np.sum(hamWin)
    if winSum <= 0.0:
        winSum = 1.0
    I0 = 0.000001
    Im = 0
    for i in range(fn):
        Im = hamWin[i] * frame[i] ** 2
    intensity = Im/winSum
    loudness = (Im / I0) ** .3
    return intensity, loudness

4. 基頻

計算基頻的方法包括倒譜法、短時自相關法和線性預測法。本文采用短時自相關法
1)基音檢測預處理:語音端點檢測
由於語音的頭部和尾部不具有週期性,因此爲了提高基音檢測的準確性,在基音檢測時採用了端點檢測。使用譜熵法進行端點檢測。
語音信號時域波形爲x(i)x(i),加窗分幀後第nn幀語音信號爲xn(m)x_n(m),其FFT表示爲Xn(k)X_n(k)kk表示第kk條譜線。該語音幀在頻域中的短時能量EnE_n
En=k=0N/2Xn(k)Xn(k) E_n = \sum_{k=0}^{N/2}X_n(k)X_n^*(k)
NN爲FFT長度,只取正頻率部分
某一譜線kk的能量譜爲Yn(k)=Xn(k)Xn(k)Y_n(k) = X_n(k)X_n(k)^*,則每個頻率分量的歸一化譜概率密度函數定義爲pn(k)=Yn(k)En=Yn(k)l=0N/2Yn(l)p_n(k)=\frac{Y_n(k)}{E_n}=\frac{Y_n(k)}{\sum_{l=0}^{N/2}Y_n(l)}
該語音幀短時譜熵定義爲
Hn=l=0N/2pn(k)lnpn(k)H_n = -\sum_{l=0}^{N/2}p_n(k)lnp_n(k)
HnH_n是第nn幀的譜熵,它表示譜的能量變化,在不同水平噪聲環境中譜熵參數具有一定穩健性,但每一譜點的幅值易受噪聲的污染而影響端點檢測。
在說話區間內的譜熵值小於噪聲區段內的譜熵值,因此能量與譜熵的比值,說話區間大於噪聲區間。能量與譜熵的比值稱爲能熵比EEFnEEF_n
EEFn=1+En/HnEEF_n = \sqrt{1+|E_n/H_n|}
設置閾值T1T1,當EEFnEEF_n不小於T1T1時,爲說話區間。
2)自相關法檢測基頻
爲了減少共振峯的干擾,基頻檢測的頻率範圍一般爲60Hz~500Hz,此時相應的樣本點值爲fs/60f_s/60fs/500f_s/500
將一幀信號進行自相關,當延遲量等於基音週期時,此時信號的值最大。因此,檢測基頻的具體做法即爲:將信號進行自相關運算後,找出最大值對應的採樣點數。

####################################
##
## calculate pitch with correlation
## calPitch() JitterAbsolute() JitterRelative()
##
####################################	

class voiceSegment:

    def __init__(self, in1=0, in2=0, duratioin=0):
        self.begin = in1
        self.end = in2
        self.duratioin = duratioin


def pitch_vad(x, win, step, T1, miniL):
    # 端點檢測
    y = enframe(x, win, step).T
    fn = len(y[0, :])
    print(fn)
    Esum = []  # energy of frames
    H = []  # spectrom entropy
    for i in range(fn):
        Sp = np.abs(np.fft.fft(y[:, i]))
        Sp = Sp[:int(win / 2)]  # fft positive
        Esum.append(np.sum(Sp ** 2))  # energy
        prob = Sp / (np.sum(Sp))  # probability
        H.append(- np.sum(prob * np.log(prob + eps)))

    H = np.array(H)
    hindex = np.where(H < 0.1)
    H[hindex] = np.max(H)
    # Ef = np.sqrt(1 + np.abs(Esum / np.linalg.inv(H)))  # energy entropy percentage
    Ef = np.sqrt(1 + np.abs(Esum / H))
    Ef = Ef / np.max(Ef)

    zindex = np.where(Ef >= T1)  # 尋找Ef中大於T1的部分
    zseg = findSegment(zindex)  # 給出端點檢測各段的信息
    zsl = len(zseg)  # 給出段數
    j = 0
    SF = np.zeros(fn)
    voiceseg = []
    for k in range(zsl):
        if zseg[k].duratioin >= miniL:
            j = j + 1
            in1 = zseg[k].begin
            in2 = zseg[k].end
            voiceseg.append(zseg[k])
            SF[in1:in2] = 1

    vosl = len(voiceseg)  # 有話段的段數
    return voiceseg, vosl, SF, Ef


def findSegment(express):
    # express = np.array(express)
    '''
    if express[0][0] == 0:
        voiceIndex = np.where(express == 1)
    else:
        voiceIndex = express
    '''
    voiceIndex = np.array(express).flatten()
    soundSegment = []
    k = 0
    soundSegment.append(voiceSegment(voiceIndex[0]))
    for i in range(len(voiceIndex) - 1):
        if voiceIndex[i + 1] - voiceIndex[i] > 1:
            soundSegment[k].end = voiceIndex[i]
            soundSegment.append(voiceSegment(voiceIndex[i + 1]))
            k = k + 1
    soundSegment[k].end = voiceIndex[-1]

    for i in range(k + 1):
        soundSegment[i].duratioin = soundSegment[i].end - soundSegment[i].begin + 1
    return soundSegment


def pitch_Corr(x, win, step, T1, sr, miniL=10):
    win = int(win);
    step = int(step)
    vseg, vsl, SF, Ef = pitch_vad(x, win, step, T1, miniL)
    y = enframe(x, win, step).T
    fn = len(SF)
    lmin = int(sr / 500)
    lmax = int(sr / 27.5)
    period = np.zeros(fn)
    for i in range(vsl):
        ixb = vseg[i].begin
        ixe = vseg[i].end
        ixd = vseg[i].duratioin
        for k in range(ixd):
            u = y[:, k + ixb]
            ru = np.correlate(u, u, mode='full')
            ru = ru[win - 1:]  # positive
            tloc = np.array(np.where(ru[lmin:lmax] == np.max(ru[lmin:lmax]))).flatten()
            period[k + ixb] = lmin + tloc - 1
    return vseg, vsl, SF, Ef, period


def calPitch(y, win, step, sr):
    '''
    calculate pitch
    :param y: data of wav file
    :param win: windows
    :param step: inc
    :param sr: frequency of wav file
    :return: pitch Hz, period dot
    '''
    T1 = 0.05
    voicesef, vosl, SF, Ef, period = pitch_Corr(y, win, step, T1, sr)
    # period is pitch period
    pitch = sr / (period + eps)
    pindex = np.where(pitch > 5000)
    pitch[pindex] = 0
    return pitch, period

4.1 頻率抖動度-分貝形式

jitter
Jitter(absolute)=1N1i=1N1TiTi+1Jitter(absolute)=\frac{1}{N-1}\sum_{i=1}^{N-1}|T_i-T_{i+1}|

def JitterAbsolute(pitch):
    period = 1 / (pitch + eps)
    pindex = np.where(period > 5000)
    period[pindex] = 0
    n = len(period)
    sigma = 0
    for i in range(n - 1):
        sigma = np.abs(period[i] - period[i + 1])
    jitter_absolute = sigma / (n - 1)
    return jitter_absolute

4.2 頻率抖動度-百分數形式

jitter

Jitter(relative)=1N1i=1N1T1Ti+11Ni=1NTiJitter(relative)=\frac{\frac{1}{N-1}\sum_{i=1}^{N-1}|T_1-T_{i+1}|}{\frac{1}{N}\sum_{i=1}^{N}T_i}

def JitterRelative(pitch):
    period = 1 / (pitch + eps)
    pindex = np.where(period > 5000)
    period[pindex] = 0
    n = len(period)
    sigma = 0
    jitter_relative = JitterAbsolute(pitch) / (np.sum(period) / n)

4.3 諧噪比HNR

harmonic to noise ratio
HNR=10log(ACF(T0)ACF(0)ACF(T0))HNR=10*log(\frac{ACF(T0)}{ACF(0)-ACF(T0)})
T0T0指基音頻率

def stHNR(frame, period):
    '''
    harmonics to noise ratio 諧噪比
    HNR = 10 * log( ACF(T0) / (ACF(0) - ACF(T0)) )
	frame: a frame of signal
	period: pitch period of given frame
	return hnr db
	'''
    period = int(period)
    if period == 0:
        return 0  # when pitch period is zero, return zero
    ru = np.correlate(frame, frame, mode='full')
    win = len(frame)
    print(np.where(ru == np.max(ru)))
    ru = ru[win - 1:]
    print(np.max(ru))
    print(ru[period])
    HNR = 10 * np.log(ru[period] / (ru[0] - ru[period]))
    return HNR

5. 共振峯

formant
共振峯可用倒譜法與LPC法估計得出。其中LPC法是通過FFT對任意頻率求得其功率譜幅值響應,並從幅值響應中找到共振峯,相應的方法包括拋物線內插法和線性預測係數求複數根法。
本文采用拋物線內插法。
聲道可以看成具有非均勻截面的聲管,在發音時起到共鳴器的作用。當準週期激勵進入聲道時會引起共振特性,產生一組共振頻率,稱爲共振峯頻率。
語音產生模型是將輻射、聲道以及聲門激勵的全部效應簡化爲一個時變的數字濾波器,其傳遞函數爲H(z)=S(z)U(z)=G1i=1paiziH(z)=\frac{S(z)}{U(z)}=\frac{G}{1-\sum_{i=1}^{p}a_iz^{-i}}
這種表現形式稱爲pp階線性預測模型,這是一個全極點模型。
z1=exp(j2πf/fs)z^{-1}=exp(-j2\pi{f}/f_s),則功率譜P(f)P(f)可表示爲P(f)=H(f)2=G21i=1paiexp(j2πif/fs)2P(f)=|H(f)|^2=\frac{G^2}{|1-\sum_{i=1}^{p}a_iexp(-j2\pi if/f_s)|^2}
利用FFT對任意頻率求功率譜幅值響應,從幅值響應中找出共振峯。相應的求解方法包括拋物線內插法和線性預測係數求復根法。
拋物線內插法,見《語音信號處理實驗教程》p105-107

#############################################
##
## calculate formant frequency and bandwidth
##
#############################################
# def Formant_Interpolation(u, sr, p=12):
def stFormant(u, sr, p=12):
    '''
    F: formant frequency
    Bw: formant bandwith
    u: one frame of signal
    p: number of LPC
    sr: sampling rate
	return 
	[1] formant frequency array
	[2] formant bandwidth array
    '''
    ### calculate lpc begin
    a_filter = lpc.autocor(u, p)
    a_filter_num = a_filter.numdict
    i = 0
    a = []
    for k, v in a_filter_num.items():
        if i != k:
            while (i != k):
                a.append(0)
                i = i + 1
        a.append(v)
        i = i + 1
    a = np.array(a)
    ### calculate lpc end
    U = lpcar2pf(a, 255)  # 由LPC係數求出頻譜曲線
    df = sr / 512  # 頻譜分辨力
    Loc, Mdict = signal.find_peaks(U)  # find peaks in U
    nFormant = len(Loc)
    F = np.zeros(nFormant)  # 共振峯頻率
    Bw = np.zeros(nFormant)  # 共振峯帶寬
    # 內插法
    i = 0
    for m in Loc:
        m1 = m - 1;
        m2 = m + 1
        p = U[m];
        p1 = U[m1];
        p2 = U[m2]
        aa = (p1 + p2) / 2 - p
        bb = (p2 - p1) / 2
        cc = p
        dm = - bb / (2 * aa)  # 極大值對應頻率
        pp = -bb ** 2 / (4 * aa) + cc  # 中心頻率對應功率譜
        bf = - np.sqrt(bb ** 2 - 4 * aa * (cc - 0.5 * pp)) / (2 * aa)  # 帶寬x軸值
        F[i] = (m + dm) * df
        Bw[i] = 2 * bf * df
        i = i + 1

    return F, Bw


def lpcar2pf(ar, npoints):
    '''
    ar : lpc coefficient
    np : 頻譜範圍
    return : 頻譜曲線
    '''
    return np.abs(np.fft.rfft(ar, 2 * npoints + 2)) ** (-2)


def pre_emphasis(y, coefficient=0.99):
    '''
    y : original signal
    coefficient: emphasis coefficient
    '''
    return np.append(y[0], y[1:] - coefficient * y[:-1])

6. Entropy of Energy:能量熵

跟頻譜的譜熵(Spectral Entropy)有點類似,不過它描述的是信號的時域分佈情況,體現的是連續性。也是短時特徵。在第ii幀信號內,將信號分爲KK個子塊。
jj個子塊與一幀信號總能量的比值爲ej=EsubFramejEshortFrameie_j =\frac{E_{subFrame_j}}{E_{shortFrame_i}},其中EshortFrameiE_{shortFrame_i}的值與子塊的總能量對應,爲EshortFramei=k=1KEsubFramekE_{shortFrame_i}=\sum_{k=1}^{K}E_{subFrame_k}.
則一幀信號的譜熵值爲Hi=k=1Kejlog2(ej)H{i}=-\sum_{k=1}^{K} e_j log_2(e_j)

def stEnergyEntropy(frame, n_short_blocks=10):
    """Computes entropy of energy"""
    Eol = numpy.sum(frame ** 2)    # total frame energy
    L = len(frame)
    sub_win_len = int(numpy.floor(L / n_short_blocks))
    if L != sub_win_len * n_short_blocks:
            frame = frame[0:sub_win_len * n_short_blocks]
    # sub_wins is of size [n_short_blocks x L]
    sub_wins = frame.reshape(sub_win_len, n_short_blocks, order='F').copy()

    # Compute normalized sub-frame energies:
    s = numpy.sum(sub_wins ** 2, axis=0) / (Eol + eps)

    # Compute entropy of the normalized sub-frame energies:
    Entropy = -numpy.sum(s * numpy.log2(s + eps))
    return Entropy

7. Spectral Centroid:頻譜質心

又稱爲頻譜一階距,頻譜中心的值越小,表明越多的頻譜能量集中在低頻範圍內,如:voice與music相比,通常spectral centroid較低。第ii幀信號的頻譜質心CiC_i:
Ci=k=1fnkXi(k)k=1fnXi(k)C_i=\frac{\sum_{k=1}^{f_n}kX_i(k)}{\sum_{k=1}^{f_n}X_i(k)}
Xi(k)X_i(k)爲第ii幀信號的第kk根頻譜線,fnf_n爲一幀信號的長度

def stSpectralCentroidAndSpread(X, fs):
    """Computes spectral centroid of frame (given abs(FFT))"""
    ind = (numpy.arange(1, len(X) + 1)) * (fs/(2.0 * len(X)))

    Xt = X.copy()
    Xt = Xt / Xt.max()
    NUM = numpy.sum(ind * Xt)
    DEN = numpy.sum(Xt) + eps

    # Centroid:
    C = (NUM / DEN)

    # Spread:
    S = numpy.sqrt(numpy.sum(((ind - C) ** 2) * Xt) / DEN)

    # Normalize:
    C = C / (fs / 2.0)
    S = S / (fs / 2.0)

    return (C, S)

8. Spectral Spread:頻譜延展度

又稱爲頻譜二階中心矩,它描述了信號在頻譜中心周圍的分佈狀況。第i幀信號的頻譜延展度SiS_i
Si=k=1fn(kCi)2Xi(k)k=1fnXikS_i=\sqrt{\frac {\sum_{k=1}^{f_n} (k-C_i)^2X_i(k)} {\sum_{k=1}^{f_n}X_i{k}}}
程序同7.Spectral Centroid

9. Spectral Entropy:譜熵

根據熵的特性可以知道,分佈越均勻,熵越大,能量熵反應了每一幀信號的均勻程度,如說話人頻譜由於共振峯存在顯得不均勻,而白噪聲的頻譜就更加均勻,藉此進行VAD便是應用之一。
與計算能量熵類似,譜熵是將FFT後的頻譜劃分爲KK個子塊,分別計算每個子塊與頻譜總能量的比例,最後將所有子塊的譜熵相加,即爲第ii幀信號的譜熵。
kk個子塊的比例:nk=Ekl=1KEln_k = \frac{E_k}{\sum_{l=1}^{K}E_l}
ii幀譜熵
Hi=k=1Knklog2(nf)H_i=-\sum_{k=1}^{K}n_klog_2(n_f)

def stSpectralEntropy(X, n_short_blocks=10):
    """Computes the spectral entropy"""
    L = len(X)                         # number of frame samples
    Eol = numpy.sum(X ** 2)            # total spectral energy

    sub_win_len = int(numpy.floor(L / n_short_blocks))   # length of sub-frame
    if L != sub_win_len * n_short_blocks:
        X = X[0:sub_win_len * n_short_blocks]

    sub_wins = X.reshape(sub_win_len, n_short_blocks, order='F').copy()  # define sub-frames (using matrix reshape)
    s = numpy.sum(sub_wins ** 2, axis=0) / (Eol + eps)                      # compute spectral sub-energies
    En = -numpy.sum(s*numpy.log2(s + eps))                                    # compute spectral entropy

    return En

10. Spectral Flux:頻譜通量

描述的是相鄰幀頻譜的變化情況。計算了頻譜歸一化之後,兩幀頻譜差的平方的總和
Fl(i,i1)=k=1fn(ENi(k)ENi1(k))2Fl_{(i,i-1)}=\sum_{k=1}^{f_n}(EN_i(k)-EN_{i-1}(k))^2
其中,ENi(k)=Xi(k)l=1fnXi(l)EN_i(k)=\frac{X_i(k)}{\sum_{l=1}^{f_n}X_i(l)}
Xi(k)X_i(k)指第kk根頻譜線。

def stSpectralFlux(X, X_prev):
    """
    Computes the spectral flux feature of the current frame
    ARGUMENTS:
        X:            the abs(fft) of the current frame
        X_prev:        the abs(fft) of the previous frame
    """
    # compute the spectral flux as the sum of square distances:
    sumX = numpy.sum(X + eps)
    sumPrevX = numpy.sum(X_prev + eps)
    F = numpy.sum((X / sumX - X_prev/sumPrevX) ** 2)

    return F

11. Spectral Rolloff:頻譜滾降點

頻譜的能量在一定頻率範圍內是集中的。當頻譜能量達到一確切百分比(通常爲90%左右),相應的DFT座標即爲滾降點的座標。然後將滾降點座標除以FFT長度歸一化。k=1mXi(k)=Ck=1fnXi(k)\sum_{k=1}^{m}X_i(k)=C\sum_{k=1}^{f_n}X_i(k)
mm爲滾降點座標,fnf_n爲FFT長度,Xi(k)X_i(k)爲第kk根譜線。

def stSpectralRollOff(X, c, fs):
    """Computes spectral roll-off"""
    totalEnergy = numpy.sum(X ** 2)
    fftLength = len(X)
    Thres = c*totalEnergy
    # Ffind the spectral rolloff as the frequency position 
    # where the respective spectral energy is equal to c*totalEnergy
    CumSum = numpy.cumsum(X ** 2) + eps
    [a, ] = numpy.nonzero(CumSum > Thres)
    if len(a) > 0:
        mC = numpy.float64(a[0]) / (float(fftLength))
    else:
        mC = 0.0
    return (mC)

12. MFCCs:梅爾倒譜系數

梅爾頻率倒譜系數基於人的聽覺特性機理,即根據人的聽覺實驗結果分析語音頻譜。與實際頻率的關係爲Fmel(f)=1125ln(1+f/700)F_{mel}(f)=1125ln(1+f/700)

  1. 梅爾濾波器
    梅爾頻率相當於在語音的頻譜範圍內設置若干帶通濾波器Hm(k)H_m(k),0mM0\le m \le M,MM爲濾波器的個數,自己設定。每個濾波器具有三角形濾波特性,中心頻率爲f(m)f(m),在Mel頻率範圍內,這些濾波器等帶寬。每個帶通濾波器的傳遞函數爲
    Hm(k)={0k&lt;f(m1)kf(m1)f(m)f(m1)f(m1)kf(m)f(m+1)kf(m+1)f(m)f(m)kf(m+1)0k&gt;f(m+1)H_m(k)= \left \{ \begin {matrix} 0&amp;&amp;k&lt;f(m-1)\\ \frac{k-f(m-1)}{f(m)-f(m-1)} &amp;&amp; f(m-1)\leqslant k \leqslant f(m) \\ \frac{f(m+1)-k}{f(m+1)-f(m)} &amp;&amp; f(m) \leqslant k \leqslant f(m+1) \\ 0 &amp;&amp; k&gt;f(m+1) \end {matrix}\right.
    梅爾濾波器的中心頻率爲f(m)=NfsFmel1(Fmel(fl)+mFmel(fh)Fmel(fl)M+1)f(m)=\frac{N}{f_s}F_{mel}^{-1}(F_{mel}(f_l)+m\frac{F_{mel}(f_h)-F_{mel}(f_l)}{M+1})
    其中,fhf_hflf_l是濾波器組的最高頻率和最低頻率,fsf_s爲採樣頻率,MM是濾波器組的數目,NN爲FFT變換的點數
  2. MFCC係數計算
    (1)預處理
    預加重、分幀、加窗,得第ii幀信號爲xi(m)x_i(m)
    (2)FFT
    Xi(k)=FFT[xi(m)]X_i(k)=FFT[x_i(m)]
    (3)計算譜線能量
    Ei(k)=[Xi(k)]2E_i(k)=[X_i(k)]^2
    (4)計算通過梅爾濾波器的能量
    將一幀能量譜的每一根譜線Ei(k)E_i(k)分別與梅爾濾波器的頻域響應Hm(k)H_m(k)相乘並相加。由於梅爾濾波器有MM個,因此一幀內通過梅爾濾波器能量有MM
    Si(m)=k=0fn1Ei(k)Hm(k),0m&lt;MS_i(m )=\sum_{k=0}^{f_n-1}E_i(k)H_m(k) ,0\leqslant m &lt; M
    (5)計算DCT倒譜
    [我的想法。到了這一步,是將傅里葉頻域變換到梅爾頻域內。梅爾頻域的譜線與梅爾濾波器相對應]使用DCT計算梅爾頻域的各譜線能量
    mfcci(n)=2Mm=0M1log[S(i.m)]cos[πn(2m1)2M],0n&lt;Mmfcc_i(n)=\sqrt{\frac{2}{M}\sum_{m=0}^{M-1}log[S(i.m)]cos[\frac{\pi n(2m-1)}{2M}]},0\leqslant n &lt;M
def mfccInitFilterBanks(fs, nfft):
    """
    Computes the triangular filterbank for MFCC computation 
    (used in the stFeatureExtraction function before the stMFCC function call)
    This function is taken from the scikits.talkbox library (MIT Licence):
    https://pypi.python.org/pypi/scikits.talkbox
    """

    # filter bank params:
    lowfreq = 133.33
    linsc = 200/3.
    logsc = 1.0711703
    numLinFiltTotal = 13
    numLogFilt = 27

    if fs < 8000:
        nlogfil = 5

    # Total number of filters
    nFiltTotal = numLinFiltTotal + numLogFilt

    # Compute frequency points of the triangle:
    freqs = numpy.zeros(nFiltTotal+2)
    freqs[:numLinFiltTotal] = lowfreq + numpy.arange(numLinFiltTotal) * linsc
    freqs[numLinFiltTotal:] = freqs[numLinFiltTotal-1] * logsc ** numpy.arange(1, numLogFilt + 3)
    heights = 2./(freqs[2:] - freqs[0:-2])

    # Compute filterbank coeff (in fft domain, in bins)
    fbank = numpy.zeros((nFiltTotal, nfft))
    nfreqs = numpy.arange(nfft) / (1. * nfft) * fs

    for i in range(nFiltTotal):
        lowTrFreq = freqs[i]
        cenTrFreq = freqs[i+1]
        highTrFreq = freqs[i+2]

        lid = numpy.arange(numpy.floor(lowTrFreq * nfft / fs) + 1, 
                           numpy.floor(cenTrFreq * nfft / fs) + 1,  
                                       dtype=numpy.int)
        lslope = heights[i] / (cenTrFreq - lowTrFreq)
        rid = numpy.arange(numpy.floor(cenTrFreq * nfft / fs) + 1, 
                                       numpy.floor(highTrFreq * nfft / fs) + 1, 
                                       dtype=numpy.int)
        rslope = heights[i] / (highTrFreq - cenTrFreq)
        fbank[i][lid] = lslope * (nfreqs[lid] - lowTrFreq)
        fbank[i][rid] = rslope * (highTrFreq - nfreqs[rid])

    return fbank, freqs
def stMFCC(X, fbank, n_mfcc_feats):
    """
    Computes the MFCCs of a frame, given the fft mag

    ARGUMENTS:
        X:        fft magnitude abs(FFT)
        fbank:    filter bank (see mfccInitFilterBanks)
    RETURN
        ceps:     MFCCs (13 element vector)

    Note:    MFCC calculation is, in general, taken from the 
             scikits.talkbox library (MIT Licence),
    #    with a small number of modifications to make it more 
         compact and suitable for the pyAudioAnalysis Lib
    """

    mspec = numpy.log10(numpy.dot(X, fbank.T)+eps)
    ceps = dct(mspec, type=2, norm='ortho', axis=-1)[:n_mfcc_feats]
    return ceps
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章