很多seq2seq的實戰都是翻譯問題,如英語轉法語等。給模型構建的特徵都是先建立字母或者詞的字典,然後構建向量作爲輸入。最終的輸入是一個0,1組成的三維向量。
構建機器翻譯的輸入——Keras官方給的例子源碼解讀
下載回來的數據 fra.txt大概長這樣:
Go. Va !
Run! Cours !
Run! Courez !
Wow! Ça alors !
Fire! Au feu !
Help! À l'aide !
Jump. Saute.
Stop! Ça suffit !
Stop! Stop !
Stop! Arrête-toi !
官方處理源碼如下:
batch_size = 64 # Batch size for training.
epochs = 100 # Number of epochs to train for.
latent_dim = 256 # Latent dimensionality of the encoding space.
num_samples = 10000 # Number of samples to train on.
# Path to the data txt file on disk.
data_path = 'fra-eng/fra.txt'
# Vectorize the data.
input_texts = []
target_texts = []
input_characters = set()
target_characters = set()
with open(data_path, 'r', encoding='utf-8') as f:
lines = f.read().split('\n')
for line in lines[: min(num_samples, len(lines) - 1)]:
# 將每一行中的輸入和翻譯結果分開
input_text, target_text, _ = line.split('\t')
# We use "tab" as the "start sequence" character
# for the targets, and "\n" as "end sequence" character.
target_text = '\t' + target_text + '\n'
input_texts.append(input_text)
target_texts.append(target_text)
# 建立輸入字典
for char in input_text:
if char not in input_characters:
input_characters.add(char)
# 建立輸出字典
for char in target_text:
if char not in target_characters:
target_characters.add(char)
input_characters = sorted(list(input_characters))
target_characters = sorted(list(target_characters))
num_encoder_tokens = len(input_characters)
num_decoder_tokens = len(target_characters)
# 需要翻譯每句話不一定等長,需要找到最長的一句話爲準確定特徵向量的維度
max_encoder_seq_length = max([len(txt) for txt in input_texts])
max_decoder_seq_length = max([len(txt) for txt in target_texts])
print('Number of samples:', len(input_texts))
print('Number of unique input tokens:', num_encoder_tokens)
print('Number of unique output tokens:', num_decoder_tokens)
print('Max sequence length for inputs:', max_encoder_seq_length)
print('Max sequence length for outputs:', max_decoder_seq_length)
# 將字典中的每個詞與索引綁定
input_token_index = dict(
[(char, i) for i, char in enumerate(input_characters)])
target_token_index = dict(
[(char, i) for i, char in enumerate(target_characters)])
# 一句話向量化爲一個nxm的矩陣,有l句話,預先創建三維數組(l, n, m)
# 第一維爲樣本數量,第二維爲句子長度,第三維爲字典長度
# 第三維設爲字典長度而不是1,是因爲要進行one-hot編碼
encoder_input_data = np.zeros(
(len(input_texts), max_encoder_seq_length, num_encoder_tokens),
dtype='float32')
decoder_input_data = np.zeros(
(len(input_texts), max_decoder_seq_length, num_decoder_tokens),
dtype='float32')
decoder_target_data = np.zeros(
(len(input_texts), max_decoder_seq_length, num_decoder_tokens),
dtype='float32')
# 進行編碼
for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
for t, char in enumerate(input_text):
encoder_input_data[i, t, input_token_index[char]] = 1.
encoder_input_data[i, t + 1:, input_token_index[' ']] = 1.
for t, char in enumerate(target_text):
# decoder_target_data is ahead of decoder_input_data by one timestep
decoder_input_data[i, t, target_token_index[char]] = 1.
# 目標輸出時間步往前移
if t > 0:
# decoder_target_data will be ahead by one timestep
# and will not include the start character.
decoder_target_data[i, t - 1, target_token_index[char]] = 1.
decoder_input_data[i, t + 1:, target_token_index[' ']] = 1.
decoder_target_data[i, t:, target_token_index[' ']] = 1.
構建語音識別的輸入- seq2seq用於語音識別-源碼解讀
對於一個語音序列,理論上我們可以這麼做,找出給定所有序列中不同的值然後排序,然後構建每個序列的向量。爲什麼說理論呢?假設我們的序列歸一化在0-1之間,然而語音序列每個點都可以看做是不相等的,一直在變化,因此在0-1之間,可以取無窮個值。不像單詞構建的字典,語音序列可能需要構建一個百萬、千萬級別的字典。所以類似詞嵌入的方式理論上可能,實際上不可能,因爲它要海量的內存和處理時間。當然,我們這裏不考慮稀疏編碼的情況。
主要的向量構建源碼如下:
def audioToInputVector(audio_filename, numcep, numcontext):
"""
Given a WAV audio file at ``audio_filename``, calculates ``numcep`` MFCC features
at every 0.01s time step with a window length of 0.025s. Appends ``numcontext``
context frames to the left and right of each time step, and returns this data
in a numpy array.
Borrowed from Mozilla's Deep Speech and slightly modified.
https://github.com/mozilla/DeepSpeech
"""
# 使用librosa庫的函數加載文件
audio, fs = librosa.load(audio_filename)
# 使用python_speech_features庫的mfcc函數求mfccs特徵
# # Get mfcc coefficients
features = mfcc(audio, samplerate=fs, numcep=numcep, nfft=551)
# features = librosa.feature.mfcc(y=audio,
# sr=fs,
# n_fft=551,
# n_mfcc=numcep).T
# We only keep every second feature (BiRNN stride = 2)
features = features[::2]
# One stride per time step in the input
num_strides = len(features)
# Add empty initial and final contexts
# 這裏使用上下文信息,numcontext 表示當前幀關聯前後幀數
empty_context = np.zeros((numcontext, numcep), dtype=features.dtype)
# 爲了保證第一幀和最後一幀也有上下文,必須在前後padding
features = np.concatenate((empty_context, features, empty_context))
# numcontext (past) + 1 (present) + numcontext (future)
# 考慮上下文後一個特徵的第一個維度就是window_size
window_size = 2 * numcontext + 1
# 使用np.lib.stride_tricks.as_strided()將features中劃分成維度爲(num_strides, window_size, numcep)的新特徵
# 劃分方式爲(features.strides[0], features.strides[0], features.strides[1])
train_inputs = np.lib.stride_tricks.as_strided(
features,
(num_strides, window_size, numcep),
(features.strides[0], features.strides[0], features.strides[1]),
writeable=False)
# 展開第二、三維度
train_inputs = np.reshape(train_inputs, [num_strides, -1])
# Copy the strided array so that we can write to it safely
train_inputs = np.copy(train_inputs)
# 歸一化
train_inputs = (train_inputs - np.mean(train_inputs)) / np.std(train_inputs)
# Return results
return train_inputs