tensorflow 2 實現 mfcc 獲取

0x00 前言概述

  • Q: 爲什麼搞tensorflow2實現mfcc提取?網上不是有一大把教程和python自帶兩個庫的實現的嗎?

  • A: 想學習mfcc是如何計算獲得,並用代碼實現(該項目是tensorflow提供的語音喚醒例子下)

  • tensorflow1.14及之前的版本中,它是這麼實現的:

    # stft , get spectrogram
    spectrogram = contrib_audio.audio_spectrogram(
        wav_decoder.audio,
        window_size=640,
        stride=640,
        magnitude_squared=True)
    
    # get mfcc (C, H, W)
    _mfcc = contrib_audio.mfcc(
        spectrogram,
        sample_rate=16000,
        upper_frequency_limit=4000,
        lower_frequency_limit=20,
        filterbank_channel_count=40,
        dct_coefficient_count=10)
    

    只能獲取到中間語譜圖(spectrogram)變量, 而當我想要獲取mfcc計算過程中的加窗、梅爾濾波器等的相關信息,是極難獲取的,需要很強的代碼功底,它內部實現的代碼傳送門:gen_audio_ops.py

    好吧,我第一次被代碼勸退了。

    他山之石可以攻玉,轉而看一下其他獲取mfcc的方式(腳本),網上有教程說是python自帶的兩個庫可以實現mfcc獲取:

    librosa && python_speech_feature

    但是遇到很棘手的一個問題:

    • mfcc值和tensorflow1.14計算的值並不相同啊

    如此之後,偶有看到tensorflow 2.1.0mfccs_from_log_mel_spectrograms可以分步驟的計算mfcc,修修改改,最終得到了現在這個版本

    以下代碼均在tensorflow2.1.0 版本下執行

    基本流程:語音讀取 --> 預加重(本文無預加重) --> 分幀 --> 加窗 --> FFT -->Mel濾波器組 -->對數運算 --> DCT

0x01 讀取語音源文件

  • shape = wav.length
import tensorflow as tf
import numpy as np
from tensorflow.python.ops import io_ops
from tensorflow import audio

def load_wav(wav_path, sample_rate=16000):
    '''
        load one wav file

    Args:
        wav_path: the wav file path, str
        sample_rate: wav's sample rate, int8

    Returns:
        wav: wav文件信息, 有經過歸一化操作, float32
        rate: wav's sample rate, int8

    '''
    wav_loader = io_ops.read_file(wav_path)
    (wav, rate) = audio.decode_wav(wav_loader,
                                   desired_channels=1,
                                   desired_samples=sample_rate)

    # shape (16000,)
    wav = np.array(wav).flatten()

    return wav, rate

0x02 填充及分幀 & 加窗 & 快速傅里葉變換

input shape: wav, shape = wav.shape

output : spectrograms without padding, shape = (n_frames, num_spectrogram_bins)

  • n_frames = 1 + (signals - frame_length) / frame_step
  • num_spectrogram_bins = fft_length//2+1
def stft(wav, win_length=640, win_step=640, n_fft=1024):
    '''
        stft 快速傅里葉變換

    Args:
        wav: *.wav的文件信息, float32, shape (16000,)
        win_length: 每一幀窗口的樣本點數, int8
        win_step: 幀移的樣本點數, int8
        n_fft: fft 係數, int8

    Returns:
        spectrograms: 快速傅里葉變換計算之後的語譜圖
                shape: (n_frames, n_fft//2 + 1)
                n_frames = 1 + (signals - frame_length) / frame_step
        num_spectrogram_bins: spectrograms[-1], int8
    '''

    # if fft_length not given
    # fft_length = 2**N for integer N such that 2**N >= frame_length.
    # shape (25, 513)
    stfts = tf.signal.stft(wav, frame_length=win_length,
                           frame_step=win_step, fft_length=n_fft)
    spectrograms = tf.abs(stfts)
    spectrograms = tf.square(spectrograms)

    # Warp the linear scale spectrograms into the mel-scale.
    num_spectrogram_bins = stfts.shape.as_list()[-1]  # 513
    return spectrograms, num_spectrogram_bins

tf.signal.stft(), 摘自tensorflow/python/ops/signal/spectral_ops.py

@tf_export('signal.stft')
def stft(signals, frame_length, frame_step, fft_length=None,
         window_fn=window_ops.hann_window,
         pad_end=False, name=None):
  
  # 各變量轉成tensor
  with ops.name_scope(name, 'stft', [signals, frame_length,
                                     frame_step]):
    signals = ops.convert_to_tensor(signals, name='signals')
    signals.shape.with_rank_at_least(1)
    frame_length = ops.convert_to_tensor(frame_length, name='frame_length')
    frame_length.shape.assert_has_rank(0)
    frame_step = ops.convert_to_tensor(frame_step, name='frame_step')
    frame_step.shape.assert_has_rank(0)

    if fft_length is None:
      # fft_length = 2**N for integer N such that 2**N >= frame_length.
      fft_length = _enclosing_power_of_two(frame_length)
    else:
      fft_length = ops.convert_to_tensor(fft_length, name='fft_length')

    # 分幀, shape=(1+(signals-frame_length)/frame_step, fft_length//2+1)
    framed_signals = shape_ops.frame(
        signals, frame_length, frame_step, pad_end=pad_end)

    # 加窗,默認是hanning窗
    if window_fn is not None:
      window = window_fn(frame_length, dtype=framed_signals.dtype)
    # 使用方式是矩陣相乘
      framed_signals *= window

    # fft_ops.rfft produces the (fft_length/2 + 1) unique components of the
    # FFT of the real windowed signals in framed_signals.
    return fft_ops.rfft(framed_signals, [fft_length])

2.1 分幀 & 填充

input : wav, shape = wav.shape

ouput : framed_signals, shape = (n_frames, frame_length)

  • 代碼:
framed_signals = shape_ops.frame(signals, frame_length, frame_step, pad_end=pad_end)
  • frame_length / frame_length的計算方法:

    • 假設wav是16k的採樣率,signals爲採樣的樣本點數

    signals=160001000/1000=16000 signals = 16000 * 1000 / 1000 =16000

    • window_size_ms爲40ms,frame_length爲一幀的樣本點數

    frame_length=1600040/1000=640 frame\_length = 16000 * 40 / 1000 = 640

    • window_stride_ms爲40ms,frame_step爲幀移的樣本點數

    frame_step=1600040/1000 frame\_step = 16000 * 40 / 1000

  • 此處無填充

    • 假如pad_endTrue,則最後一幀不足幀長的補上0

    • 舉個例子,signals=16000,frame_length=512,frame_step=180,

      pad_end=False時,n_frames爲87,最後340個樣本點不足一幀捨去;

      pad_end=True時,n_frames爲89,最後340個樣本點擴充爲2幀,末尾補0

2.2 加窗

input : framed_signals, shape = (n_frames, frame_length)

output : framed_signals = (n_frames, frame_length)

  • 代碼:
# shape = (frame_length, )
window = window_fn(frame_length, dtype=framed_signals.dtype)

# 使用方式是矩陣相乘
framed_signals *= window
  • 爲什麼要加窗?

    參考:語音信號的加窗處理

    吉布斯效應:將具有不連續點的周期函數(如矩形脈衝)進行傅立葉級數展開後,選取有限項進行合成。當選取的項數越多,在所合成的波形中出現的峯起越靠近原信號的不連續點。當選取的項數很大時,該峯起值趨於一個常數,大約等於總跳變值的9%。這種現象稱爲吉布斯效應。

    • 使全局更加連續,避免出現吉布斯效應
    • 加窗的目的是爲了減少泄露,而非消除泄露

    上面的說法太官方了,在知乎上看到一個更人性化的解釋:怎樣用通俗易懂的方式解釋窗函數?

  • 加窗的過程

    原始的時域信號和窗函數做乘積,輸出的信號能夠更好的滿足傅里葉變換的週期性需求

    * =

  • 窗函數使用的是漢寧窗,如下:

    [外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-RvuYOO0b-1589366303382)(https://upload.wikimedia.org/wikipedia/commons/thumb/2/28/Window_function_%28hann%29.svg/220px-Window_function_%28hann%29.svg.png)]
    w(t)=0.5  (1cos(2πtT)) w(t)= 0.5\; \left(1 - \cos \left ( \frac{2 \pi t}{T} \right) \right)

    # periodic 默認爲1, a 默認爲0.5, b 默認爲0.5
    even = 1 - math_ops.mod(window_length, 2)  # 1
    
    n = math_ops.cast(window_length + periodic * even - 1, dtype=dtype)  # 640
    count = math_ops.cast(math_ops.range(window_length), dtype)
    cos_arg = constant_op.constant(2 * np.pi, dtype=dtype) * count / n
    
    return math_ops.cast(a - b * math_ops.cos(cos_arg), dtype=dtype)
    

2.3 傅里葉變換

input : framed_signals, shape = (n_frames, frame_length)

output : spectrograms without padding, shape = (n_frames, num_spectrogram_bins)

由於信號在時域上的變換通常很難看出信號的特性,所以通常將它轉換爲頻域上的能量分佈來觀察,不同的能量分佈,就能代表不同語音的特性。對分幀加窗後的各幀信號進行做一個N點FFT來計算頻譜,也稱爲短時傅立葉變換(STFT),其中N通常爲256或512,NFFT=512
Si(k)=n=1Nsi(n)ej2πkn/N1kK S_i(k)=\sum_{n=1}^{N}s_i(n)e^{-j2\pi kn/N} 1\le k \le K

  • 代碼:numpy.fft.rfft(frames,NFFT)
fft_ops.rfft(framed_signals, [fft_length])

rfft 單獨拎出來,來自github

# rfft
def rfft_wrapper(fft_fn, fft_rank, default_name):
  """Wrapper around gen_spectral_ops.rfft* that infers fft_length argument."""

  def _rfft(input_tensor, fft_length=None, name=None):
    """Wrapper around gen_spectral_ops.rfft* that infers fft_length argument."""
    with _ops.name_scope(name, default_name,
                         [input_tensor, fft_length]) as name:
      input_tensor = _ops.convert_to_tensor(input_tensor,
                                            preferred_dtype=_dtypes.float32)
      fft_length = _ops.convert_to_tensor(fft_length)
      if input_tensor.dtype not in (_dtypes.float32, _dtypes.float64):
        raise ValueError(
            "RFFT requires tf.float32 or tf.float64 inputs, got: %s" %
            input_tensor)
      real_dtype = input_tensor.dtype
      if real_dtype == _dtypes.float32:
        complex_dtype = _dtypes.complex64
      else:
        assert real_dtype == _dtypes.float64
        complex_dtype = _dtypes.complex128
      input_tensor.shape.with_rank_at_least(fft_rank)
#       fft_length = None
      if fft_length is None:
#         fft_length = input_tensor.get_shape()[-1:]
        fft_shape = input_tensor.get_shape()[-fft_rank:]

        # If any dim is unknown, fall back to tensor-based math.
        if not fft_shape.is_fully_defined():
          fft_length = _array_ops.shape(input_tensor)[-fft_rank:]

        # Otherwise, return a constant.
        fft_length =  _ops.convert_to_tensor(fft_shape.as_list(), _dtypes.int32)
      else:
        fft_length = _ops.convert_to_tensor(fft_length, _dtypes.int32)
      
      # 此處對input_tensor做了一個填充, 末尾補零, 從frame_length 到 fft_length
      input_tensor = _maybe_pad_for_rfft(input_tensor, fft_rank, fft_length)
#       print('\n')
      fft_length_static = _tensor_util.constant_value(fft_length)
      if fft_length_static is not None:
        fft_length = fft_length_static
      return fft_fn(input_tensor, fft_length, Tcomplex=complex_dtype, name=name)
  _rfft.__doc__ = fft_fn.__doc__
  return _rfft

rfft = rfft_wrapper(gen_spectral_ops.rfft, 1, 'rfft')
spec = rfft(framed_signals, [1024])
tf.abs(tmp)

在下找不到gen_spectral_ops.py這個文件啊

感興趣的可以看一下numpy實現的傅里葉變換:numpy.fft.rfft(frames,NFFT)

小道消息,tensorflow2也是numpy實現,畢竟兩者計算效果等同

Yd4sRf.png

2.4 對spectrum 取絕對值 & 平方

spectrograms = tf.abs(stfts)
spectrograms = tf.square(spectrograms)

0x03 梅爾濾波

input : spectrograms, shape=(n_frames, num_spectrogram_bins)

output : mel_spectrograms, shape=(n_frames, num_mel_bins)

Q:爲什麼要使用梅爾頻譜?

A:頻率的單位是赫茲(Hz),人耳能聽到的頻率範圍是20-20000Hz,但人耳對Hz這種標度單位並不是線性感知關係。例如如果我們適應了1000Hz的音調,如果把音調頻率提高到2000Hz,我們的耳朵只能覺察到頻率提高了一點點,根本察覺不到頻率提高了一倍。如果將普通的頻率標度轉化爲梅爾頻率標度,則人耳對頻率的感知度就成了線性關係。也就是說,在梅爾標度下,如果兩段語音的梅爾頻率相差兩倍,則人耳可以感知到的音調大概也相差兩倍。

讓我們觀察一下從Hz到mel的映射圖,由於它們是log的關係,當頻率較小時,mel隨Hz變化較快;當頻率很大時,mel的上升很緩慢,曲線的斜率很小。這說明了人耳對低頻音調的感知較靈敏,在高頻時人耳是很遲鈍的,梅爾標度濾波器組啓發於此。

這裏寫圖片描述

def build_mel(spectrograms, num_mel_bins, num_spectrogram_bins,
              sample_rate, lower_edge_hertz, upper_edge_hertz):
    '''
        構建梅爾濾波器

    Args:
        spectrograms: 語譜圖 (1 + (wav-win_length)/win_step, n_fft//2 + 1)
        num_mel_bins: How many bands in the resulting mel spectrum.
        num_spectrogram_bins:
            An integer `Tensor`. How many bins there are in the
            source spectrogram data, which is understood to be `fft_size // 2 + 1`,
            i.e. the spectrogram only contains the nonredundant FFT bins.
            sample_rate: An integer or float `Tensor`. Samples per second of the input
            signal used to create the spectrogram. Used to figure out the frequencies
            corresponding to each spectrogram bin, which dictates how they are mapped
            into the mel scale.
        sample_rate: 採樣率
        lower_edge_hertz:
            Python float. Lower bound on the frequencies to be
            included in the mel spectrum. This corresponds to the lower edge of the
            lowest triangular band.
        upper_edge_hertz:
            Python float. The desired top edge of the highest frequency band.


    Returns:
        mel_spectrograms: 梅爾濾波器與語譜圖做矩陣相乘之後的語譜圖
                shape: (1 + (wav-win_length)/win_step, n_mels)

    '''
    linear_to_mel_weight_matrix = tf.signal.linear_to_mel_weight_matrix(
        num_mel_bins=num_mel_bins,
        num_spectrogram_bins=num_spectrogram_bins,
        sample_rate=sample_rate,
        lower_edge_hertz=lower_edge_hertz,
        upper_edge_hertz=upper_edge_hertz)

    ################ 官網教程中, 少了sqrt #############
    spectrograms = tf.sqrt(spectrograms)
    mel_spectrograms = tf.tensordot(spectrograms,
                        linear_to_mel_weight_matrix, 1)
    # 兩條等價
    # mel_spectrograms = tf.matmul(spectrograms, linear_to_mel_weight_matrix)

    # shape (25, 40)
    mel_spectrograms.set_shape(spectrograms.shape[:-1].concatenate(
        linear_to_mel_weight_matrix.shape[-1:]))

    return mel_spectrograms

# 官方代碼

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

from tensorflow.python.framework import dtypes
from tensorflow.python.framework import ops
from tensorflow.python.framework import tensor_util
from tensorflow.python.ops import array_ops
from tensorflow.python.ops import math_ops
from tensorflow.python.ops.signal import shape_ops
from tensorflow.python.util.tf_export import tf_export

_MEL_BREAK_FREQUENCY_HERTZ = 700
_MEL_HIGH_FREQUENCY_Q = 1127

def _hertz_to_mel(frequencies_hertz, name=None):
  """Converts frequencies in `frequencies_hertz` in Hertz to the mel scale.
  Args:
    frequencies_hertz: A `Tensor` of frequencies in Hertz.
    name: An optional name for the operation.
  Returns:
    A `Tensor` of the same shape and type of `frequencies_hertz` containing
    frequencies in the mel scale.
  """
  with ops.name_scope(name, 'hertz_to_mel', [frequencies_hertz]):
    frequencies_hertz = ops.convert_to_tensor(frequencies_hertz)
    return _MEL_HIGH_FREQUENCY_Q * math_ops.log(
        1.0 + (frequencies_hertz / _MEL_BREAK_FREQUENCY_HERTZ))

@tf_export('signal.linear_to_mel_weight_matrix')
def linear_to_mel_weight_matrix(num_mel_bins=20,
                                num_spectrogram_bins=129,
                                sample_rate=8000,
                                lower_edge_hertz=125.0,
                                upper_edge_hertz=3800.0,
                                dtype=dtypes.float32,
                                name=None):
    with ops.name_scope(name, 'linear_to_mel_weight_matrix') as name:
        # Convert Tensor `sample_rate` to float, if possible.
        if isinstance(sample_rate, ops.Tensor):
            maybe_const_val = tensor_util.constant_value(sample_rate)
            if maybe_const_val is not None:
                sample_rate = maybe_const_val


        # This function can be constant folded by graph optimization since there are
        # no Tensor inputs.
        sample_rate = math_ops.cast(
            sample_rate, dtype, name='sample_rate')
        lower_edge_hertz = ops.convert_to_tensor(
            lower_edge_hertz, dtype, name='lower_edge_hertz')
        upper_edge_hertz = ops.convert_to_tensor(
            upper_edge_hertz, dtype, name='upper_edge_hertz')
        zero = ops.convert_to_tensor(0.0, dtype)

        # HTK excludes the spectrogram DC bin.
        bands_to_zero = 1
        nyquist_hertz = sample_rate / 2.0
        # 間隔 (nyquist_hertz - zero) / (num_spectrogram_bins-1)
        # shape = (512, )
        linear_frequencies = math_ops.linspace(
            zero, nyquist_hertz, num_spectrogram_bins)[bands_to_zero:]
    
        # herz to mel
        # shape = [512, 1]
        spectrogram_bins_mel = array_ops.expand_dims(
            _hertz_to_mel(linear_frequencies), 1)

        # Compute num_mel_bins triples of (lower_edge, center, upper_edge). The
        # center of each band is the lower and upper edge of the adjacent bands.
        # Accordingly, we divide [lower_edge_hertz, upper_edge_hertz] into
        # num_mel_bins + 2 pieces.
        # shape = ((num_mel_bins + 2 - frame_length)/frame_step + 1, frame_length)
        band_edges_mel = shape_ops.frame(
            math_ops.linspace(_hertz_to_mel(lower_edge_hertz),
                              _hertz_to_mel(upper_edge_hertz),
                              num_mel_bins + 2), frame_length=3, frame_step=1)

        # Split the triples up and reshape them into [1, num_mel_bins] tensors.
        lower_edge_mel, center_mel, upper_edge_mel = tuple(array_ops.reshape(
            t, [1, num_mel_bins]) for t in array_ops.split(
                band_edges_mel, 3, axis=1))

        # Calculate lower and upper slopes for every spectrogram bin.
        # Line segments are linear in the mel domain, not Hertz.
        lower_slopes = (spectrogram_bins_mel - lower_edge_mel) / (
            center_mel - lower_edge_mel)
        print(lower_slopes)
        upper_slopes = (upper_edge_mel - spectrogram_bins_mel) / (
            upper_edge_mel - center_mel)

        # Intersect the line segments with each other and zero.
        mel_weights_matrix = math_ops.maximum(
            zero, math_ops.minimum(lower_slopes, upper_slopes))
        print(mel_weights_matrix)

        # Re-add the zeroed lower bins we sliced out above.
        # 補上bands_to_zero 行, 內容爲0
        return array_ops.pad(
            mel_weights_matrix, [[bands_to_zero, 0], [0, 0]], name=name)
    
melbank = linear_to_mel_weight_matrix(num_mel_bins=40,
                            num_spectrogram_bins=513,
                            sample_rate=16000,
                            lower_edge_hertz=20.0,
                            upper_edge_hertz=4000.0,
                            dtype=dtypes.float32,
                            name=None)

3.1 構造梅爾濾波器組(每一幀抽取40個特徵)

output : linear_to_mel_weight_matrix, shape=(num_spectrogram_bins, num_mel_bins)

  • 代碼:
linear_to_mel_weight_matrix = tf.signal.linear_to_mel_weight_matrix(
                                                                        num_mel_bins=num_mel_bins,
                                                                        num_spectrogram_bins=num_spectrogram_bins,
                                                                        sample_rate=sample_rate,
                                                                        lower_edge_hertz=lower_edge_hertz,
                                                                        upper_edge_hertz=upper_edge_hertz)

3.2 計算梅爾濾波器的參數

3.2.1 轉換

M(f)=1127ln(1+f/700) M(f)=1127ln(1+f/700)

# tf2 實現代碼
Mel = _MEL_HIGH_FREQUENCY_Q * math_ops.log(1.0 + (frequencies_hertz / _MEL_BREAK_FREQUENCY_HERTZ))
  • 梅爾頻率到頻率的轉換公式:

M1(m)=700(em/11271) M^{−1}(m)=700(e^{m/1127}−1)

# tf2 實現代碼
Herze = _MEL_BREAK_FREQUENCY_HERTZ * (math_ops.exp(mel_values / _MEL_HIGH_FREQUENCY_Q) - 1.0)

3.2.2 均等劃分

一共有40個濾波器,num_mel_bins=40,需要42個點,包含最大和最小頻率和中間等間距的點,在Mel空間上平均的分配:
m(i)=31.75,83.32,134.89,186.46,238.03,289.59,341.16,...,2094.51,2146.08 m(i) = 31.75, 83.32, 134.89, 186.46, 238.03, 289.59, 341.16, ... , 2094.51, 2146.08

# tf2 代碼實現
math_ops.linspace(_hertz_to_mel(lower_edge_hertz),
                              			_hertz_to_mel(upper_edge_hertz),
                              			num_mel_bins + 2)

# 自定義校驗代碼
mel = [np.round((31.75+i*(2146.0756 - 31.75)/41), 2) for i in range(42)]

3.2.3 分幀,分出前中後三個濾波器參數

shape = ((num_mel_bins + 2 - frame_length)/frame_step + 1, frame_length)

計算公式在上面有給出。輸出40幀,每幀三個樣本點。

band_edges_mel = shape_ops.frame(math_ops.linspace(_hertz_to_mel(lower_edge_hertz),
                             _hertz_to_mel(upper_edge_hertz),
                             num_mel_bins + 2), frame_length=3, frame_step=1)

分出三個邊緣濾波器, shape = (1, (num_mel_bins + 2 - frame_length)/frame_step + 1)

# Split the triples up and reshape them into [1, num_mel_bins] tensors.
lower_edge_mel, center_mel, upper_edge_mel = tuple(array_ops.reshape(
                            t, [1, num_mel_bins]) for t in array_ops.split(
                            band_edges_mel, 3, axis=1))

3.2.4 建立濾波器

濾波器是三角濾波器,第一個濾波器從第一點開始,第二個時取得最大值,第三個點又歸零。第二個濾波器從第二個點開始,第三點時達到最大值,第四點歸零,依次類推。由下面的公式表達,Hm(k)H_m(k)表示其中一個濾波器:
Hm(k)={0k<f(m1)kf(m1)f(m)f(m1)f(m1)kf(m)f(m+1)kf(m+1)f(m)f(m)kf(m+1)0k>f(m+1) H_m(k)=\begin{cases} 0 &k \lt f(m-1) \\ \frac{k-f(m-1)}{f(m)-f(m-1)} & f(m-1) \le k \le f(m) \\ \frac{f(m+1)-k}{f(m+1)-f(m)} &f(m) \le k \le f(m+1) \\ 0 &k \gt f(m+1) \end{cases}

m 表示濾波器的數量,$ f() $表示m+2梅爾間隔頻率(Mel-spaced frequencies)列表。

代碼中產生一個(512, 40) 的矩陣,每一列就是每一個濾波器,總共有40個濾波器

# Calculate lower and upper slopes for every spectrogram bin.
# Line segments are linear in the mel domain, not Hertz.
# 用到了廣播, shape = ((512, 1) - (1, 40)) / ((1,40) - (1,40)) = (512, 40)
lower_slopes = (spectrogram_bins_mel - lower_edge_mel) / (
   								 center_mel - lower_edge_mel)
upper_slopes = (upper_edge_mel - spectrogram_bins_mel) / (
    							upper_edge_mel - center_mel)

# Intersect the line segments with each other and zero.
mel_weights_matrix = math_ops.maximum(zero, math_ops.minimum(lower_slopes, upper_slopes))


# 填充bands_to_zero個維度
array_ops.pad(mel_weights_matrix, [[bands_to_zero, 0], [0, 0]], name=name)
# 畫圖代碼
import matplotlib.pyplot as plt

freq = []  # 採樣頻率值
df = sample_rate / num_spectrogram_bins
for n in range(0, num_spectrogram_bins):
    freqs = int(n * df)
    freq.append(freqs)

# x y 軸數據
for i in range(1, num_mel_bins + 1):
    plt.plot(freq, melbank[:, i-1])
    
plt.xlim((0, 8000))
plt.ylim((0, 1))

plt.xlabel('frequency(Hz)', fontsize=13)
plt.ylabel('amplititude', fontsize=13)
plt.title("The 40-filter Mel Filterbank")
plt.show()

最終得到40個濾波器組,如下圖所示:

Ytq89x.png

3.3 矩陣乘法

input: spectrograms, shape=(n_frames, num_spectrogram_bins)

output: mel_spectrograms, shape=(n_frames, num_mel_bins)

[n_frames,n_mels(n_bins//2+1)][(n_bins//2+1)]=[n_frames,n_mels] [n\_frames, n\_mels ,(n\_bins // 2 +1) ] * [ (n\_bins // 2 +1) ] = [ n\_frames, n\_mels]

  • 代碼:
spectrograms = tf.sqrt(spectrograms)
mel_spectrograms = tf.matmul(spectrograms, linear_to_mel_weight_matrix)

官網給出的例子當中就是少了個sqrt(),導致計算結果出錯,換句話說,前邊不需要square()計算

0x04 進行log變換

Q: 爲什麼要進行log變換?

​ 求對數後把小信號與大信號的差距變小了,原來兩者相差105倍,經過求對數後,兩者相差100db,100和105相比,兩者差距被縮小了,對數譜更接近於人耳實際聽音時的感覺。

def log(mel_spectrograms):
    # Compute a stabilized log to get log-magnitude mel-scale spectrograms.
    # shape: (1 + (wav-win_length)/win_step, n_mels)
    # 學術界又叫做filter_banks
    log_mel_spectrograms = tf.math.log(mel_spectrograms + 1e-12)
    return log_mel_spectrograms

0x05 採用DCT得到MFCC

​ DCT的作用,爲了獲得頻譜的倒譜,倒譜的低頻分量就是頻譜的包絡,倒譜的高頻分量就是頻譜的細節

  • DCT 較之DFT的優勢 在於:

    CT變換較DFT變換具有更好的頻域能量聚集度(說人話就是能夠把圖像更重要的信息聚集在一塊),那麼對於那些不重要的頻域區域和係數就能夠直接裁剪掉

  • 維基百科:

離散餘弦變換(英語:discrete cosine transform, DCT)是與傅里葉變換相關的一種變換,類似於離散傅里葉變換,但是隻使用實數

​ 2d DCT(type II)與離散傅里葉變換的比較:

​ 將f0{\displaystyle f_{0}}再乘以12{\displaystyle {\frac {1}{\sqrt {2}}}}。這將使得DCT-II成爲正交矩陣.
fm=k=0n1xkcos[πnm(k+12)] f_m = \sum_{k=0}^{n-1} x_k \cos \le[\frac{\pi}{n} m \le(k+\frac{1}{2}) ]
YdhwCV.png

def dct(log_mel_spectrograms, dct_counts):
    # shape (1 + (wav-win_length)/win_step, dct)
    mfccs = tf.signal.mfccs_from_log_mel_spectrograms(
        log_mel_spectrograms)
     # 取低頻維度上的部分值輸出,語音能量大多集中在低頻域,數值一般取13。
    mfcc = mfccs[..., :dct_counts]
    return mfcc
  • 最終結果的mfcc 熱力圖
YdraQI.png
# 畫圖部分代碼

import matplotlib.pyplot as plt

# %matplotlib inline
# %config InlineBackend.figure_format = 'png'

# 保證時域是x軸
filter_bank = tf.transpose(log_mel_spectrograms)
# 保證時域是x軸,也是水平的
mfcc = tf.transpose(mfccs)

plt.figure(figsize=(14,10), dpi=500)

plt.subplot(211)
plt.imshow(np.flipud(filter_bank), 
                       cmap=plt.cm.jet, 
                       aspect='auto',
                       extent=[0,filter_bank.shape[1],
                                        0,filter_bank.shape[0]])
plt.ylabel('filters', fontsize=15)
plt.xlabel('Time(s)', fontsize=15)
plt.title("log_mel_spectrograms", fontsize=18)

plt.subplot(212)
plt.imshow(np.flipud(mfcc), 
                       cmap=plt.cm.jet,
                       aspect=0.5, 
                       extent=[0,mfcc.shape[1],
                                        0,mfcc.shape[0]])
plt.ylabel('MFCC Coefficients', fontsize=15)
plt.xlabel('Time(s)', fontsize=15)
plt.title("mfcc", fontsize=18)

# plt.savefig('mfcc_04.png')

代碼中的公式:
C(k)=Re{2ejπk2NFFT{y(n)}} C(k) = \mathrm{Re}\Bigl\lbrace 2*e^{-j\pi \frac{k}{2N}} *FFT\lbrace y(n)\rbrace\Bigr\rbrace

知乎,DFT到DCT,很詳細,從DFT一步一步推導,最終的DCT公式如下:
YdhpA1.png

推導證明DFT和DCT等價,n=m1/2n = m' - 1 / 2
img

# tensorflow/python/ops/signal/mfcc_ops.py,
# mfccs_from_log_mel_spectrograms() 源碼實現
dct2 = dct_ops.dct(log_mel_spectrograms, type=2)

# 正交歸一化
# DCT進行矩陣運算時,通常取sqrt(1/2N)
# rsqrt:取sqrt()的倒數; num_mel_bins = 40
return dct2 * math_ops.rsqrt(math_ops.cast(
         					num_mel_bins, dct2.dtype) * 2.0)

# tensorflow/python/ops/signal/dct_ops.py 
# dct_ops.dct 源碼實現
scale = 2.0 * _math_ops.exp(
          					_math_ops.complex(
              				zero, -_math_ops.range(axis_dim_float) * _math.pi * 0.5 /
              				axis_dim_float))

# TODO(rjryan): Benchmark performance and memory usage of the various
# approaches to computing a DCT via the RFFT.
# 2 * axis_dim 將信號擴大爲原來的兩倍,目的是用實信號造一個實偶信號
# _math_ops.real() 返回複數的實數部分
dct2 = _math_ops.real(
                       		fft_ops.rfft(
                            input, fft_length=[2 * axis_dim])[..., :axis_dim] * scale)

return dct2

0x06 參考文獻

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章