LSTM_文本生成（text_generation）

1.文本生成（char）

用LSTM做文本生成
舉個小小的例子，來看看LSTM是怎麼玩的
我們這裏用溫斯頓丘吉爾的人物傳記作爲我們的學習語料。

# -*- coding: utf-8 -*-
'''
用RNN做文本生成，用溫斯頓丘吉爾的人物傳記作爲我們的學習語料
我們這裏簡單的文本預測是，給了前置的字母以後，下一個字母是誰？
比如,importan,給出t,Winsto,給出n,Britai, 給出n
'''
import numpy as np
from keras.models import Sequential
from keras.layers import Dense,Dropout,LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils

raw_text = open("input/Winston_Churchil.txt",encoding="utf-8").read()
raw_text = raw_text.lower()

chars = sorted(list(set(raw_text)))

char_to_int = dict((c,i) for i,c in enumerate(chars))
int_to_char = dict((i,c) for i,c in enumerate(chars))

'''
構造訓練測試集
我們需要把我們的raw text變成可以用來訓練的x,y:
x 是前置字母們 y 是後一個字母
'''
seg_length = 100
x = []
y = []
for i in range(0,len(raw_text)-seg_length):
    given = raw_text[i:i+seg_length]
    predict = raw_text[i+seg_length]
    x.append([char_to_int[char] for char in given])
    y.append(char_to_int[predict])
    
    
'''
此刻，樓上這些表達方式，類似就是一個詞袋，或者說 index。
接下來我們做兩件事：
1.我們已經有了一個input的數字表達（index），我們要把它變成LSTM需要的數組格式： [樣本數，時間步伐，特徵]
2.第二，對於output，我們在Word2Vec裏學過，用one-hot做output的預測可以給我們更好的效果，相對於直接預測一個準確的y數值的話。
'''
n_patterns = len(x)
n_vocab = len(chars)
# 把x變成LSTM需要的樣子
x = np.reshape(x,(n_patterns,seg_length,1))
# 簡單normal到0-1之間
x = x / float(n_vocab)
# output變成one-hot
y = np_utils.to_categorical(y)

'''
模型建造（LSTM）
'''
model = Sequential()
model.add(LSTM(256,input_shape=(x.shape[1],x.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1],activation="softmax"))
model.compile(loss="categorical_crossentropy",optimizer="adam")
model.fit(x,y,nb_epoch=50,batch_size=4096)


'''
測試程序,看看我們訓練出來的LSTM的效果
'''
def predict_next(input_array):
    x = np.reshape(input_array,(1,seg_length,1))
    x = x / float(n_vocab)
    y = model.predict(x)
    return y

def string_to_index(raw_input):
    res = []
    for c in raw_input[(len(raw_input)-seg_length):]:
        res.append(char_to_int[c])
    return res
def y_to_char(y):
    largest_index = y.argmax()
    c = int_to_char[largest_index]
    return c

def generate_article(init,rounds=200):
    in_string = init.lower()
    for i in range(rounds):
        n = y_to_char(predict_next(string_to_index(in_string)))
        in_string += n
    return in_string

    
init = 'His object in coming to New York was to engage officers for that service. He came at an opportune moment'
article = generate_article(init)
print(article)

訓練過程

Epoch 1/50
276730/276730 [==============================] - 197s - loss: 3.1120   
Epoch 2/50
276730/276730 [==============================] - 197s - loss: 3.0227   
Epoch 3/50
276730/276730 [==============================] - 197s - loss: 2.9910   
Epoch 4/50
276730/276730 [==============================] - 197s - loss: 2.9337   
Epoch 5/50
276730/276730 [==============================] - 197s - loss: 2.8971   
Epoch 6/50
276730/276730 [==============================] - 197s - loss: 2.8784   
Epoch 7/50
276730/276730 [==============================] - 197s - loss: 2.8640   
Epoch 8/50
276730/276730 [==============================] - 197s - loss: 2.8516   
Epoch 9/50
276730/276730 [==============================] - 197s - loss: 2.8384   
Epoch 10/50
276730/276730 [==============================] - 197s - loss: 2.8254   
Epoch 11/50
276730/276730 [==============================] - 197s - loss: 2.8133   
Epoch 12/50
276730/276730 [==============================] - 197s - loss: 2.8032   
Epoch 13/50
276730/276730 [==============================] - 197s - loss: 2.7913   
Epoch 14/50
276730/276730 [==============================] - 197s - loss: 2.7831   
Epoch 15/50
276730/276730 [==============================] - 197s - loss: 2.7744   
Epoch 16/50
276730/276730 [==============================] - 197s - loss: 2.7672   
Epoch 17/50
276730/276730 [==============================] - 197s - loss: 2.7601   
Epoch 18/50
276730/276730 [==============================] - 197s - loss: 2.7540   
Epoch 19/50
276730/276730 [==============================] - 197s - loss: 2.7477   
Epoch 20/50
276730/276730 [==============================] - 197s - loss: 2.7418   
Epoch 21/50
276730/276730 [==============================] - 197s - loss: 2.7360   
Epoch 22/50
276730/276730 [==============================] - 197s - loss: 2.7296   
Epoch 23/50
276730/276730 [==============================] - 197s - loss: 2.7238   
Epoch 24/50
276730/276730 [==============================] - 197s - loss: 2.7180   
Epoch 25/50
276730/276730 [==============================] - 197s - loss: 2.7113   
Epoch 26/50
276730/276730 [==============================] - 197s - loss: 2.7055   
Epoch 27/50
276730/276730 [==============================] - 197s - loss: 2.7000   
Epoch 28/50
276730/276730 [==============================] - 197s - loss: 2.6934   
Epoch 29/50
276730/276730 [==============================] - 197s - loss: 2.6859   
Epoch 30/50
276730/276730 [==============================] - 197s - loss: 2.6800   
Epoch 31/50
276730/276730 [==============================] - 197s - loss: 2.6741   
Epoch 32/50
276730/276730 [==============================] - 197s - loss: 2.6669   
Epoch 33/50
276730/276730 [==============================] - 197s - loss: 2.6593   
Epoch 34/50
276730/276730 [==============================] - 197s - loss: 2.6529   
Epoch 35/50
276730/276730 [==============================] - 197s - loss: 2.6461   
Epoch 36/50
276730/276730 [==============================] - 197s - loss: 2.6385   
Epoch 37/50
276730/276730 [==============================] - 197s - loss: 2.6320   
Epoch 38/50
276730/276730 [==============================] - 197s - loss: 2.6249   
Epoch 39/50
276730/276730 [==============================] - 197s - loss: 2.6187   
Epoch 40/50
276730/276730 [==============================] - 197s - loss: 2.6110   
Epoch 41/50
276730/276730 [==============================] - 192s - loss: 2.6039   
Epoch 42/50
276730/276730 [==============================] - 141s - loss: 2.5969   
Epoch 43/50
276730/276730 [==============================] - 140s - loss: 2.5909   
Epoch 44/50
276730/276730 [==============================] - 140s - loss: 2.5843   
Epoch 45/50
276730/276730 [==============================] - 140s - loss: 2.5763   
Epoch 46/50
276730/276730 [==============================] - 140s - loss: 2.5697   
Epoch 47/50
276730/276730 [==============================] - 141s - loss: 2.5635   
Epoch 48/50
276730/276730 [==============================] - 140s - loss: 2.5575   
Epoch 49/50
276730/276730 [==============================] - 140s - loss: 2.5496   
Epoch 50/50
276730/276730 [==============================] - 140s - loss: 2.5451

結果輸出

his object in coming to new york was to engage officers for that service. he came at an opportune moment th the toote of the carie and the soote of the carie and the soote of the carie and the soote of the carie and the soote of the carie and the soote of the carie and the soote of the carie and the soo

2.文本生成（word）

import os
import numpy as np
import nltk
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils
from gensim.models.word2vec import Word2Vec

raw_text = ''
for file in os.listdir("../input/"):
    if file.endswith(".txt"):
        raw_text += open("../input/"+file, errors='ignore').read() + '\n\n'
# raw_text = open('../input/Winston_Churchil.txt').read()
raw_text = raw_text.lower()
sentensor = nltk.data.load('tokenizers/punkt/english.pickle')     
#加載英文的劃分句子的模型
   
sents = sentensor.tokenize(raw_text)
corpus = []
for sen in sents:
    corpus.append(nltk.word_tokenize(sen))

w2v_model = Word2Vec(corpus, size=128, window=5, min_count=5, workers=4)
raw_input = [item for sublist in corpus for item in sublist]
text_stream = []
vocab = w2v_model.vocab
for word in raw_input:
    if word in vocab:
        text_stream.append(word)

seq_length = 10
x = []
y = []
for i in range(0, len(text_stream) - seq_length):

    given = text_stream[i:i + seq_length]
    predict = text_stream[i + seq_length]
    x.append(np.array([w2v_model[word] for word in given]))
    y.append(w2v_model[predict])
x = np.reshape(x, (-1, seq_length, 128))
y = np.reshape(y, (-1,128))

model = Sequential()
model.add(LSTM(256, dropout_W=0.2, dropout_U=0.2, input_shape=(seq_length, 128)))
model.add(Dropout(0.2))
model.add(Dense(128, activation='sigmoid'))
model.compile(loss='mse', optimizer='adam')
model.fit(x, y, nb_epoch=50, batch_size=4096)

def predict_next(input_array):
    x = np.reshape(input_array, (-1,seq_length,128))
    y = model.predict(x)
    return y

def string_to_index(raw_input):
    raw_input = raw_input.lower()
    input_stream = nltk.word_tokenize(raw_input)
    res = []
    for word in input_stream[(len(input_stream)-seq_length):]:
        res.append(w2v_model[word])
    return res

def y_to_word(y):
    word = w2v_model.most_similar(positive=y, topn=1)
    return word
def generate_article(init, rounds=30):
    in_string = init.lower()
    for i in range(rounds):
        n = y_to_word(predict_next(string_to_index(in_string)))
        in_string += ' ' + n[0][0]
    return in_string
init = 'Language Models allow us to measure how likely a sentence is, which is an important for Machine'
article = generate_article(init)
print(article)

訓練過程

Epoch 1/50
2058743/2058743 [==============================] - 150s - loss: 0.6839   
Epoch 2/50
2058743/2058743 [==============================] - 150s - loss: 0.6670   
Epoch 3/50
2058743/2058743 [==============================] - 150s - loss: 0.6625   
Epoch 4/50
2058743/2058743 [==============================] - 150s - loss: 0.6598   
Epoch 5/50
2058743/2058743 [==============================] - 150s - loss: 0.6577   
Epoch 6/50
2058743/2058743 [==============================] - 150s - loss: 0.6562   
Epoch 7/50
2058743/2058743 [==============================] - 150s - loss: 0.6549   
Epoch 8/50
2058743/2058743 [==============================] - 150s - loss: 0.6537   
Epoch 9/50
2058743/2058743 [==============================] - 150s - loss: 0.6527   
Epoch 10/50
2058743/2058743 [==============================] - 150s - loss: 0.6519   
Epoch 11/50
2058743/2058743 [==============================] - 150s - loss: 0.6512   
Epoch 12/50
2058743/2058743 [==============================] - 150s - loss: 0.6506   
Epoch 13/50
2058743/2058743 [==============================] - 150s - loss: 0.6500   
Epoch 14/50
2058743/2058743 [==============================] - 150s - loss: 0.6496   
Epoch 15/50
2058743/2058743 [==============================] - 150s - loss: 0.6492   
Epoch 16/50
2058743/2058743 [==============================] - 150s - loss: 0.6488   
Epoch 17/50
2058743/2058743 [==============================] - 151s - loss: 0.6485   
Epoch 18/50
2058743/2058743 [==============================] - 150s - loss: 0.6482   
Epoch 19/50
2058743/2058743 [==============================] - 150s - loss: 0.6480   
Epoch 20/50
2058743/2058743 [==============================] - 150s - loss: 0.6477   
Epoch 21/50
2058743/2058743 [==============================] - 150s - loss: 0.6475   
Epoch 22/50
2058743/2058743 [==============================] - 150s - loss: 0.6473   
Epoch 23/50
2058743/2058743 [==============================] - 150s - loss: 0.6471   
Epoch 24/50
2058743/2058743 [==============================] - 150s - loss: 0.6470   
Epoch 25/50
2058743/2058743 [==============================] - 150s - loss: 0.6468   
Epoch 26/50
2058743/2058743 [==============================] - 150s - loss: 0.6466   
Epoch 27/50
2058743/2058743 [==============================] - 150s - loss: 0.6464   
Epoch 28/50
2058743/2058743 [==============================] - 150s - loss: 0.6463   
Epoch 29/50
2058743/2058743 [==============================] - 150s - loss: 0.6462   
Epoch 30/50
2058743/2058743 [==============================] - 150s - loss: 0.6461   
Epoch 31/50
2058743/2058743 [==============================] - 150s - loss: 0.6460   
Epoch 32/50
2058743/2058743 [==============================] - 150s - loss: 0.6458   
Epoch 33/50
2058743/2058743 [==============================] - 150s - loss: 0.6458   
Epoch 34/50
2058743/2058743 [==============================] - 150s - loss: 0.6456   
Epoch 35/50
2058743/2058743 [==============================] - 150s - loss: 0.6456   
Epoch 36/50
2058743/2058743 [==============================] - 150s - loss: 0.6455   
Epoch 37/50
2058743/2058743 [==============================] - 150s - loss: 0.6454   
Epoch 38/50
2058743/2058743 [==============================] - 150s - loss: 0.6453   
Epoch 39/50
2058743/2058743 [==============================] - 150s - loss: 0.6452   
Epoch 40/50
2058743/2058743 [==============================] - 150s - loss: 0.6452   
Epoch 41/50
2058743/2058743 [==============================] - 150s - loss: 0.6451   
Epoch 42/50
2058743/2058743 [==============================] - 150s - loss: 0.6450   
Epoch 43/50
2058743/2058743 [==============================] - 150s - loss: 0.6450   
Epoch 44/50
2058743/2058743 [==============================] - 150s - loss: 0.6449   
Epoch 45/50
2058743/2058743 [==============================] - 150s - loss: 0.6448   
Epoch 46/50
2058743/2058743 [==============================] - 150s - loss: 0.6447   
Epoch 47/50
2058743/2058743 [==============================] - 150s - loss: 0.6447   
Epoch 48/50
2058743/2058743 [==============================] - 150s - loss: 0.6446   
Epoch 49/50
2058743/2058743 [==============================] - 150s - loss: 0.6446   
Epoch 50/50
2058743/2058743 [==============================] - 150s - loss: 0.6445

輸出結果

language models allow us to measure how likely a sentence is, which is an important for machine engagement . to-day good-for-nothing fit job job job job job . i feel thing job job job ; thing really done certainly job job ; but i need not say

CNN (picture（貓狗）識別)

import os,cv2,random
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
from matplotlib import ticker
import seaborn as sns

from keras.models import Sequential
from keras.layers import Input, Dropout, Flatten, Convolution2D, MaxPooling2D, Dense, Activation
from keras.optimizers import RMSprop
from keras.callbacks import ModelCheckpoint, Callback, EarlyStopping
from keras.utils import np_utils
from keras import backend as K
K.set_image_dim_ordering('th')

'''
準備數據：
我們先把所有的數據load進來。貓狗各取200
'''
train_dir = 'G:/KNNtest/RNN/train/'
test_dir = 'G:/KNNtest/RNN/test/'

#把貓狗分開讀入(1/0代表label)
train_dogs = [(train_dir+i,1) for i in os.listdir(train_dir) if 'dog' in i]
train_cats = [(train_dir+i,0) for i in os.listdir(train_dir) if 'cat' in i]

#-1也代表lable，隨意設置
test_images = [(test_dir+i,-1) for i in os.listdir(test_dir)]
#合成訓練集，鑑於內存大小和時間問題，貓狗數據各取200
train_images = train_dogs[:200] + train_cats[:200]
#洗牌，打亂數據
random.shuffle(train_images)
test_images = test_images[:10]

#使用opencv讀取圖片，並將圖片格式統一爲64*64
rows = 64
cols = 64
def read_image(tuple_set):
    file_path = tuple_set[0]
    label = tuple_set[1]
    # 你這裏的參數，可以是彩色或者灰度(GRAYSCALE)
    img = cv2.imread(file_path)
    # 這裏，可以選擇壓縮圖片的方式，zoom（cv2.INTER_CUBIC & cv2.INTER_LINEAR）還是shrink（cv2.INTER_AREA）
    return cv2.resize(img,(rows,cols),interpolation=cv2.INTER_CUBIC),label

#！！！預處理圖片，將圖片數據轉換爲numpy數組
channels=3                  #代表RGB三個顏色
def prep_data(images):
    no_images = len(images)
    data = np.ndarray((no_images,channels,rows,cols),dtype=np.uint8)
    labels = []

    for i,image_file in enumerate(images):
        image,label = read_image(image_file)
        data[i] = image.T
        labels.append(label)
    return data,labels

x_train,y_train = prep_data(train_images)
x_test,y_test = prep_data(test_images)


'''
模型構造
'''
optimizer = RMSprop(lr=1e-4)
objective = 'binary_crossentropy'


model = Sequential()

model.add(Convolution2D(32, 3, 3, border_mode='same', input_shape=(3, rows, cols), activation='relu'))
model.add(Convolution2D(32, 3, 3, border_mode='same', activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Convolution2D(64, 3, 3, border_mode='same', activation='relu'))
model.add(Convolution2D(64, 3, 3, border_mode='same', activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Convolution2D(128, 3, 3, border_mode='same', activation='relu'))
model.add(Convolution2D(128, 3, 3, border_mode='same', activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Convolution2D(256, 3, 3, border_mode='same', activation='relu'))
model.add(Convolution2D(256, 3, 3, border_mode='same', activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))

model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))

model.add(Dense(1))
model.add(Activation('sigmoid'))

model.compile(loss=objective, optimizer=optimizer, metrics=['accuracy'])




'''
訓練與測試
'''
nb_epoch = 10           #使用完整的訓練集進行10次訓練
batch_size =10          #批大小，每次訓練在訓練集中取batch_size個樣本進行訓練
#每個epoch後，存下loss，便於畫圖
class LossHistory(Callback):
    def on_train_begin(self, logs={}):
        self.losses =[]
        self.val_losses = []

    def on_epoch_begin(self, epoch, logs={}):
        self.losses.append(logs.get('loss'))
        self.val_losses.append(logs.get('val_loss'))

early_stopping = EarlyStopping(monitor='val_loss', patience=3, verbose=1, mode='auto')

#跑模型
history = LossHistory()

model.fit(x_train, y_train, batch_size=batch_size, nb_epoch=nb_epoch,
          validation_split=0.2, verbose=0, shuffle=True, callbacks=[history, early_stopping])

predictions = model.predict(x_test, verbose=0)

loss = history.losses
val_loss = history.val_losses

plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('VGG-16 Loss Trend')
plt.plot(loss, 'blue', label='Training Loss')
plt.plot(val_loss, 'green', label='Validation Loss')
plt.xticks(range(0,nb_epoch)[0::2])
plt.legend()
plt.show()

AI算法工程師YC

發佈了224 篇原創文章 · 獲贊 48 · 訪問量 5460

私信關注

LSTM_文本生成（text_generation）

1.文本生成（char）

2.文本生成（word）

CNN (picture（貓狗）識別)

TDengine docker安裝方法

vue項目獲取富文本編輯器wangEditor內容導出爲word（html轉word格式並下載）

dotnet C# 創建 X11 應用時設置窗口背景顏色

vue3組件通信與props

sapui5

Alpine Linux apk add DNS lookup error

部分JDK版本的發佈時間

工作中用到的腳本合集

合併代碼時Beyond Compare設置

Navicat安裝與激活教程

leetcode94(二叉樹的中序遍歷)

基於word2vec的QA demo

leetcode110(平衡二叉樹)

基於Tensorflow裏CNN文本分類

LeetCode 刷題之旅

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結