語音識別中seq2seq的輸入數據構建

很多seq2seq的實戰都是翻譯問題,如英語轉法語等。給模型構建的特徵都是先建立字母或者詞的字典,然後構建向量作爲輸入。最終的輸入是一個0,1組成的三維向量。

構建機器翻譯的輸入——Keras官方給的例子源碼解讀

下載回來的數據 fra.txt大概長這樣:

Go.		Va !
Run!	Cours !
Run!	Courez !
Wow!	Ça alors !
Fire!	Au feu !
Help!	À l'aide !
Jump.	Saute.
Stop!	Ça suffit !
Stop!	Stop !
Stop!	Arrête-toi !

官方處理源碼如下:

batch_size = 64  # Batch size for training.
epochs = 100  # Number of epochs to train for.
latent_dim = 256  # Latent dimensionality of the encoding space.
num_samples = 10000  # Number of samples to train on.
# Path to the data txt file on disk.
data_path = 'fra-eng/fra.txt'

# Vectorize the data.
input_texts = []
target_texts = []
input_characters = set()
target_characters = set()
with open(data_path, 'r', encoding='utf-8') as f:
    lines = f.read().split('\n')
for line in lines[: min(num_samples, len(lines) - 1)]:
	# 將每一行中的輸入和翻譯結果分開
    input_text, target_text, _ = line.split('\t') 
    # We use "tab" as the "start sequence" character
    # for the targets, and "\n" as "end sequence" character.
    target_text = '\t' + target_text + '\n'
    input_texts.append(input_text)
    target_texts.append(target_text)
    # 建立輸入字典
    for char in input_text:
        if char not in input_characters:
            input_characters.add(char)
     # 建立輸出字典
    for char in target_text:
        if char not in target_characters:
            target_characters.add(char)

input_characters = sorted(list(input_characters))
target_characters = sorted(list(target_characters))
num_encoder_tokens = len(input_characters)
num_decoder_tokens = len(target_characters)
# 需要翻譯每句話不一定等長,需要找到最長的一句話爲準確定特徵向量的維度
max_encoder_seq_length = max([len(txt) for txt in input_texts])
max_decoder_seq_length = max([len(txt) for txt in target_texts])

print('Number of samples:', len(input_texts))
print('Number of unique input tokens:', num_encoder_tokens)
print('Number of unique output tokens:', num_decoder_tokens)
print('Max sequence length for inputs:', max_encoder_seq_length)
print('Max sequence length for outputs:', max_decoder_seq_length)

# 將字典中的每個詞與索引綁定
input_token_index = dict(
    [(char, i) for i, char in enumerate(input_characters)])
target_token_index = dict(
    [(char, i) for i, char in enumerate(target_characters)])

# 一句話向量化爲一個nxm的矩陣,有l句話,預先創建三維數組(l, n, m)
# 第一維爲樣本數量,第二維爲句子長度,第三維爲字典長度
# 第三維設爲字典長度而不是1,是因爲要進行one-hot編碼
encoder_input_data = np.zeros(
    (len(input_texts), max_encoder_seq_length, num_encoder_tokens),
    dtype='float32')
decoder_input_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens),
    dtype='float32')
decoder_target_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens),
    dtype='float32')

# 進行編碼
for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
    for t, char in enumerate(input_text):
        encoder_input_data[i, t, input_token_index[char]] = 1.
    encoder_input_data[i, t + 1:, input_token_index[' ']] = 1.
    for t, char in enumerate(target_text):
        # decoder_target_data is ahead of decoder_input_data by one timestep
        decoder_input_data[i, t, target_token_index[char]] = 1.
        # 目標輸出時間步往前移
        if t > 0:
            # decoder_target_data will be ahead by one timestep
            # and will not include the start character.
            decoder_target_data[i, t - 1, target_token_index[char]] = 1.
    decoder_input_data[i, t + 1:, target_token_index[' ']] = 1.
    decoder_target_data[i, t:, target_token_index[' ']] = 1.

構建語音識別的輸入- seq2seq用於語音識別-源碼解讀

對於一個語音序列,理論上我們可以這麼做,找出給定所有序列中不同的值然後排序,然後構建每個序列的向量。爲什麼說理論呢?假設我們的序列歸一化在0-1之間,然而語音序列每個點都可以看做是不相等的,一直在變化,因此在0-1之間,可以取無窮個值。不像單詞構建的字典,語音序列可能需要構建一個百萬、千萬級別的字典。所以類似詞嵌入的方式理論上可能,實際上不可能,因爲它要海量的內存和處理時間。當然,我們這裏不考慮稀疏編碼的情況。

主要的向量構建源碼如下:

def audioToInputVector(audio_filename, numcep, numcontext):
    """
    Given a WAV audio file at ``audio_filename``, calculates ``numcep`` MFCC features
    at every 0.01s time step with a window length of 0.025s. Appends ``numcontext``
    context frames to the left and right of each time step, and returns this data
    in a numpy array.
    Borrowed from Mozilla's Deep Speech and slightly modified.
    https://github.com/mozilla/DeepSpeech
    """
	# 使用librosa庫的函數加載文件
    audio, fs = librosa.load(audio_filename)
	
	# 使用python_speech_features庫的mfcc函數求mfccs特徵
    # # Get mfcc coefficients
    features = mfcc(audio, samplerate=fs, numcep=numcep, nfft=551)
    # features = librosa.feature.mfcc(y=audio,
    #                                 sr=fs,
    #                                 n_fft=551,
    #                                 n_mfcc=numcep).T

    # We only keep every second feature (BiRNN stride = 2)
    features = features[::2]

    # One stride per time step in the input
    num_strides = len(features)
	

    # Add empty initial and final contexts 
    # 這裏使用上下文信息,numcontext 表示當前幀關聯前後幀數
    empty_context = np.zeros((numcontext, numcep), dtype=features.dtype)
    # 爲了保證第一幀和最後一幀也有上下文,必須在前後padding
    features = np.concatenate((empty_context, features, empty_context))
    
    # numcontext (past) + 1 (present) + numcontext (future) 
    # 考慮上下文後一個特徵的第一個維度就是window_size 
    window_size = 2 * numcontext + 1
    
   	# 使用np.lib.stride_tricks.as_strided()將features中劃分成維度爲(num_strides, window_size, numcep)的新特徵
  	# 劃分方式爲(features.strides[0], features.strides[0], features.strides[1])
    train_inputs = np.lib.stride_tricks.as_strided(
        features,
        (num_strides, window_size, numcep),
        (features.strides[0], features.strides[0], features.strides[1]),
        writeable=False)

    # 展開第二、三維度
    train_inputs = np.reshape(train_inputs, [num_strides, -1])
    
    # Copy the strided array so that we can write to it safely
    train_inputs = np.copy(train_inputs)
    # 歸一化
    train_inputs = (train_inputs - np.mean(train_inputs)) / np.std(train_inputs)

    # Return results
    return train_inputs
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章