使用Keras 構建基於 LSTM 模型的故事生成器

LSTM 網絡工作示意圖

什麼是 LSTM 網絡？

LSTM （Long Short Term Memory, 長短期神經網絡）是一種特殊的循環神經網絡（RNN, Recurrent neural networks）。
LSTM 能夠通過更新單元狀態來學習參數間的長期依賴關係，目前在機器翻譯、語言識別等領域有着廣泛應用。

LSTM 的使用背景

當你讀這篇文章的時候，你可以根據你對前面所讀單詞的理解來理解上下文。
你不會從一開始或者從中間部分閱讀就能夠直接理解文本意義，而是隨着你閱讀的深入，你的大腦才最終形成上下文聯繫，能夠理解文本意義。

傳統神經網絡的一個主要不足在於不能夠真正地像人類大腦的神經元一樣工作運行，往往只能夠利用短期記憶或者信息。
一旦數據序列較長，就難以將早期階段信息傳遞至後面階段

考慮下面兩個句子。
如果我們要預測第一句中“<…>”的內容，那麼最好的預測答案是“Telugu”。因爲根據上下文，該句談論的是 Hyderabad 的母語。
這樣的預測對於人類來說是很基礎的，但是對於人工神經網絡而言則非常困難。

“Hyderabad” 單詞指明其語言應該是“Telugu”。但是“Hyderabad”出現在句首。
所以神經網絡要準確進行預測，就必須記憶單詞的所以序列。
而這正是 LSTM 可以做到的。

編程實現 LSTM

本文將通過 LSTM 網絡開發一個故事生成器模型。主要使用自然語言處理（NLP）進行數據預處理，使用雙向LSTM進行模型構建。

Step 1:數據集準備

創建一個包含有各種題材類型的短篇小說文本庫，保存爲“stories.txt”。
文本庫中的一個片段如下：

Frozen grass crunched beneath the steps of a shambling man. His shoes were crusted and worn, and dirty toes protruded from holes in the sides. His quivering eye scanned the surroundings: a freshly paved path through the grass, which led to a double swingset, and a picnic table off to the side with a group of parents lounging in bundles, huddled to keep warm. Squeaky clean-and-combed children giggled and bounced as they weaved through the pathways with their hot breaths escaping into the air like smoke.

Step2:導入數據分析庫並進行分析

接下來，我們導入必要的庫並且查看數據集。
使用的是運行在 TensorFlow 2.0 的 Keras 框架。

from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras import regularizers
import tensorflow.keras.utils as ku
import numpy as np
import tensorflow as tf
import pickle
data=open('stories.txt',encoding="utf8").read()

Step3:使用 NLP 庫預處理數據

首先，我們將數據全部轉換爲小寫，並將其按行拆分，以獲得一個python語句列表。
轉換成小寫的原因是，同一單詞不同大小寫，其意義是一樣的。例如，“Doctor”和“doctor”都是醫生，但模型會對其進行不同的處理。

然後我們將單詞進行編碼並轉化爲向量。爲每一個單詞生成索引屬性，該屬性返回一個包含鍵值對的字典，其中鍵是單詞，值是該單詞的記號。

# Converting the text to lowercase and splitting it
corpus = data.lower().split("\n")
# Tokenization
tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus)
total_words = len(tokenizer.word_index) + 1
print(total_words)

下一步將把句子轉換成基於這些標記索引的值列表。這將把一行文本（如“frozen grass crunched beneath the steps”）轉換成表示單詞對應的標記列表。

然後我們將遍歷標記列表，並且使每個句子的長度一致，否則，用它們訓練神經網絡可能會很困難。主要在於遍歷所有序列並找到最長的一個。一旦我們有了最長的序列長度，接下來要做的是填充所有序列，使它們的長度相同。

同時，我們需要將劃分輸入數據（特徵）以及輸出數據（標籤）。其中，輸入數據就是除最後一個字符外的所有數據，而輸出數據則是最後一個字符。

現在，我們將對標籤進行 One-hot 編碼，因爲這實際上是一個分類問題，在給定一個單詞序列的情況下，我們可以從語料庫中對下一個單詞進行分類預測。

# create input sequences using list of tokens
input_sequences = []
for line in corpus:
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)
        
# pad sequences 
max_sequence_len = max([len(x) for x in input_sequences])
print(max_sequence_len)
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))

# create predictors and label
predictors, label = input_sequences[:,:-1],input_sequences[:,-1]

label = ku.to_categorical(label, num_classes=total_words)

Step 4:搭建模型

有了訓練數據集後，我們就可以搭建需要的模型了：

model = Sequential()
model.add(Embedding(total_words, 300, input_length=max_sequence_len-1))
model.add(Bidirectional(LSTM(200, return_sequences = True)))
model.add(Dropout(0.2))
model.add(LSTM(100))
model.add(Dense(total_words/2, activation='relu', kernel_regularizer=regularizers.l2(0.001)))
model.add(Dense(total_words, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

history = model.fit(predictors, label, epochs=200, verbose=0)

其中，第一層是 embedding 層。第一個參數反映模型處理的單詞數量，這裏我們希望能夠處理所有單詞，所以賦值 total_words；第二個參數反映用於繪製單詞向量的維數，可以隨意調整，會獲得不同的預測結果；第三個參數反映輸入的序列長度，因爲輸入序列是原始序列中除最後一個字符外的所有數據，所以這裏需要減去一。
隨後是 bidirectional LSTM 層以及 Dense 層。
對於損失函數，我們設置爲分類交叉熵；優化函數，我們選擇 adam 算法。

Step 5:結果分析

對於訓練後的效果，我們主要查看準確度和損失大小。

import matplotlib.pyplot as plt
acc = history.history['accuracy']
loss = history.history['loss']
epochs = range(len(acc))
plt.plot(epochs, acc, 'b', label='Training accuracy')
plt.title('Training accuracy')
plt.figure()
plt.plot(epochs, loss, 'b', label='Training Loss')
plt.title('Training loss')
plt.legend()
plt.show()

從曲線圖可以看出，訓練準確率不斷提高，而損失則不斷衰減。說明模型達到較好的性能。

Step 6:保存模型

通過以下代碼可以對訓練完成的模型進行保存，以方便進一步的部署。

# serialize model to JSON
model_json=model.to_json()
with open("model.json","w") as json_file:
json_file.write(model_json)
# serialize weights to HDF5
model.save_weights("model.h5")
print("Saved model to disk")

Step 7:進行預測

接下來，將應用訓練好的模型進行單詞預測以及生成故事。
首先，用戶輸入初始語句，然後將該語句進行預處理，輸入到 LSTM 模型中，得到對應的一個預測單詞。重複這一過程，便能夠生成對應的故事了。具體代碼如下：

seed_text = "As i walked, my heart sank"
next_words = 100
for _ in range(next_words):
token_list = tokenizer.texts_to_sequences([seed_text])[0]
token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
predicted = model.predict_classes(token_list, verbose=0)
output_word = ""
for word, index in tokenizer.word_index.items():
if index == predicted:
output_word = word
break
seed_text += " " + output_word
print(seed_text)

生成故事如下：

As i walked, my heart sank until he was alarmed by the voice of the hunter
and realised what could have happened with him he flew away the boy crunched
before it disguised herself as another effort to pull out the bush which he did
the next was a small tree which the child had to struggle a lot to pull out
finally the old man showed him a bigger tree and asked the child to pull it
out the boy did so with ease and they walked on the morning she was asked
how she had slept as a while they came back with me

所有文本庫：https://gist.github.com/jayashree8/08448d1b6610e444dc7a033ef4a5aae7#file-stories-txt
本文源代碼：https://github.com/jayashree8/Story_Generator/blob/master/Story_Generator.ipynb
作者：Jayashree domala
deephub翻譯組：Oliver Lee

使用Keras 構建基於 LSTM 模型的故事生成器

什麼是 LSTM 網絡？

LSTM 的使用背景

編程實現 LSTM

Step 1:數據集準備

Step2:導入數據分析庫並進行分析

Step3:使用 NLP 庫預處理數據

Step 4:搭建模型

Step 5:結果分析

Step 6:保存模型

Step 7:進行預測

如何應對缺失值帶來的分佈變化？探索填充缺失值的最佳插補算法

數據並非都是正態分佈：三種常見的統計分佈及其應用

Block Transformer：通過全局到局部的語言建模加速LLM推理

CNN依舊能戰：nnU-Net團隊新研究揭示醫學圖像分割的驗證誤區，設定先進的驗證標準與基線模型

從提示工程到代理工程：構建高效AI代理的策略框架概述

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結