基於LSTM的中文多分類情感分析

趁着國慶假期，玩了一下深度學習（主要是LSTM這個網絡），順便做了一箇中文多分類的情感分析。中文情感分析相對英文來說，難度太大，所以最後分析的結果，準確度也不是太高，但基本還是沒啥問題的。

對應的api地址： https://blog.csdn.net/qq_40663357/article/details/103102396

數據

我的數據是來自github的一個項目：ChineseNlpCorpus 裏面收集了蠻多用於自然語言處理的中文數據集/語料。

下載地址：百度網盤
數據概覽： 36 萬多條，帶情感標註新浪微博，包含 4 種情感，其中喜悅約 20 萬條，憤怒、厭惡、低落各約 5 萬條
數據來源：新浪微博
原數據集：微博情感分析數據集，網上搜集，具體作者、來源不詳
加工處理：寫了個腳本sampling，可以將數據集拆分更小一點，加快實驗。

預處理

拆分數據集

由於原數據集比較大，所以爲了加快測試調參的速度，這裏寫了一個對數據集進行拆分的腳本。

n1 = 0
n2 = 0
n3 = 0
n4 = 0
newLines = ""
with open("weibo_train.txt", "r", encoding="gbk",errors='ignore') as f:
    lines = f.readlines()
    f.close()
    for line in lines:
        label, sentence = line.strip().split("\t")
        if int(label) == 0:
            if n1 < 10000:
                n1 += 1
                newLines += label + "\t" + sentence + "\n"
        if int(label) == 1:
            if n2 < 10000:
                n2 += 1
                newLines += label + "\t" + sentence + "\n"
        if int(label) == 2:
            if n3 < 10000:
                n3 += 1
                newLines += label + "\t" + sentence + "\n"
        if int(label) == 3:
            if n4 < 10000:
                n4 += 1
                newLines += label + "\t" + sentence + "\n"
    with open("small_train.txt", "w") as f2:
        f2.write(newLines)
        f2.close()

數據過濾

數據中的一些標點符號、特殊符號、英文字母、數字等對於我們的實驗都是沒有用處的，所以我們需要將他們過濾掉。
這裏使用正則表達式進行過濾：

# 數據過濾
def regex_filter(s_line):
    # 剔除英文、數字，以及空格
    special_regex = re.compile(r"[a-zA-Z0-9\s]+")
    # 剔除英文標點符號和特殊符號
    en_regex = re.compile(r"[.…{|}#$%&\'()*+,!-_./:~^;<=>?@★●，。]+")
    # 剔除中文標點符號
    zn_regex = re.compile(r"[《》！、，“”；：（）【】]+")

    s_line = special_regex.sub(r"", s_line)
    s_line = en_regex.sub(r"", s_line)
    s_line = zn_regex.sub(r"", s_line)
    return s_line

去除停用詞

對於一些中文的停用詞，對情感分析也是沒有用處的，所以需要去除停用詞，這裏使用的是哈工大的停用詞庫。在後面的項目地址中已經包含了。

# 加載停用詞
def stopwords_list(file_path):
    stopwords = [line.strip() for line in open(file_path, 'r', encoding='utf-8').readlines()]
    return stopwords

對數據集進行預處理並計算詞頻

word_freqs = collections.Counter()  # 詞頻
stopword = stopwords_list("data/stopWords.txt")
max_len = 0
with open('data/small_train.txt', 'r+', encoding="gbk",errors='ignore') as f:
    lines = f.readlines()
    for line in lines:
        # 取出label和句子
        label, sentence = line.strip("\n").split("\t")
        # 數據預處理
        sentence = regex_filter(sentence)
        words = jieba.cut(sentence)
        x = 0
        for word in words:
            # 去除停用詞
            if word not in stopword:
                print(word)
                word_freqs[word] += 1
                x += 1
        max_len = max(max_len, x)
print(max_len)
print('nb_words ', len(word_freqs))

生成字典

在對數據預處理後，將分好的詞都對應一個唯一的數字，一起構成字典。定義一個最大詞頻數，只取詞頻中的前面頻率高的這些詞。再增加一個"PAD"用於後面補0的詞，一個"UNK"用於不在字典中的詞。

MAX_FEATURES = 80000 # 最大詞頻數
vocab_size = min(MAX_FEATURES, len(word_freqs)) + 2
# 構建詞頻字典
word2index = {x[0]: i+2 for i, x in enumerate(word_freqs.most_common(MAX_FEATURES))}
word2index["PAD"] = 0
word2index["UNK"] = 1
# 將詞頻字典寫入文件中保存
with open('model/word_dict.pickle', 'wb') as handle:
    pickle.dump(word2index, handle, protocol=pickle.HIGHEST_PROTOCOL)

開始訓練

先把包都導進去吧。

import pickle
from keras.engine.saving import load_model
from keras.layers.core import Activation, Dense, SpatialDropout1D
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM
from keras.models import Sequential
from keras.preprocessing import sequence
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import jieba
import numpy as np
import pandas as pd

準備數據

先把字典從文件中取出來，統計要訓練的樣本的大小，用於初始化numpy數組

# 加載分詞字典
with open('model/word_dict.pickle', 'rb') as handle:
    word2index = pickle.load(handle)

### 準備數據
MAX_FEATURES = 80002 # 最大詞頻數
MAX_SENTENCE_LENGTH = 110 # 句子最大長度
num_recs = 0  # 樣本數

with open('data/small_train.txt', 'r+', encoding="gbk",errors='ignore') as f:
    lines = f.readlines()
    # 統計樣本大小
    for line in lines:
        num_recs += 1

# 初始化句子數組和label數組
X = np.empty(num_recs,dtype=list)
y = np.zeros(num_recs)
i=0

遍歷訓練集，將在字典中的詞添加到輸入數組中，不在的詞兒添加一個"UNK"來表示。把句子轉換成數字序列，並對句子進行統一長度。再並將對應的label標籤添加到輸出數組中，並使用pandas對label進行one-hot編碼。

with open('data/small_train.txt', 'r+', encoding="gbk",errors='ignore') as f:
    for line in f:
        label, sentence = line.strip("\n").split("\t")
        words = jieba.cut(sentence)
        seqs = []
        for word in words:
            # 在詞頻中
            if word in word2index:
                seqs.append(word2index[word])
            else:
                seqs.append(word2index["UNK"]) # 不在詞頻內的補爲UNK
        X[i] = seqs
        y[i] = int(label)
        i += 1

# 把句子轉換成數字序列，並對句子進行統一長度，長的截斷，短的補0
X = sequence.pad_sequences(X, maxlen=MAX_SENTENCE_LENGTH)
# 使用pandas對label進行one-hot編碼
y1 = pd.get_dummies(y).values
print(X.shape)
print(y1.shape)

對準備好的數據進行劃分，按一定的比例劃分爲訓練集和測試集。這裏將測試集的比例調爲0.2。

# 數據劃分
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y1, test_size=0.2, random_state=42)

構建網絡

開始構建LSTM網絡，這裏使用的是keras深度學習庫中帶的LSTM。代碼中已經加了詳細的註釋。具體的參數含義請見這篇文章：一文學會如何在Keras中開發LSTMs

## 網絡構建
EMBEDDING_SIZE = 256 # 詞向量維度
HIDDEN_LAYER_SIZE = 128 # 隱藏層大小
BATCH_SIZE = 64 # 每批大小
NUM_EPOCHS = 5 # 訓練週期數
# 創建一個實例
model = Sequential()
# 構建詞向量
model.add(Embedding(MAX_FEATURES, EMBEDDING_SIZE,input_length=MAX_SENTENCE_LENGTH))
model.add(SpatialDropout1D(0.2))
# 構建LSTM層
model.add(LSTM(HIDDEN_LAYER_SIZE, dropout=0.2, recurrent_dropout=0.2))
# 輸出層包含四個分類，激活函數設置爲'softmax'
model.add(Dense(4, activation="softmax"))
model.add(Activation('softmax'))
# 損失函數設置爲分類交叉熵categorical_crossentropy
model.compile(loss="categorical_crossentropy", optimizer="adam",metrics=["accuracy"])

## 訓練模型
model.fit(Xtrain, ytrain, batch_size=BATCH_SIZE, epochs=NUM_EPOCHS,validation_data=(Xtest, ytest))

評估模型

這裏使用F1分數來評估模型

## 評估模型
y_pred = model.predict(Xtest)
y_pred = y_pred.argmax(axis=1)
ytest = ytest.argmax(axis=1)

print('accuracy %s' % accuracy_score(y_pred, ytest))
target_names = ['喜悅', '憤怒', '厭惡', '低落']
print(classification_report(ytest, y_pred, target_names=target_names))

保存並測試模型

將模型保存成文件，並加載模型進行測試。最後的分類結果是根據softmax生成的概率，取概率最高的爲分類結果。

print("保存模型")
model.save('model/my_model.h5')

## 測試模型
print("加載模型")
model = load_model('model/my_model.h5')

INPUT_SENTENCES = ['哈哈哈開心','真是無語，你們怎麼搞的','小姐姐，祝你生日快樂','你他媽的有病']
XX = np.empty(len(INPUT_SENTENCES),dtype=list)
i=0
for sentence in  INPUT_SENTENCES:
    words = jieba.cut(sentence)
    seq = []
    for word in words:
        if word in word2index:
            seq.append(word2index[word])
        else:
            seq.append(word2index['UNK'])
    XX[i] = seq
    i+=1

XX = sequence.pad_sequences(XX, maxlen=MAX_SENTENCE_LENGTH)
label2word = {0:'喜悅', 1:'憤怒', 2:'厭惡', 3:'低落'}
for x in model.predict(XX):
    print(x)
    x = x.tolist()
    label = x.index(max(x[0], x[1], x[2], x[3]))
    print(label)
    print('{}'.format(label2word[label]))

預測結果：

github地址

參考：
基於LSTM的中文文本多分類實戰
 一文學會如何在Keras中開發LSTM
利用 Keras 下的 LSTM 進行情感分析

基於LSTM的中文多分類情感分析

數據

預處理

拆分數據集

數據過濾

去除停用詞

生成字典

開始訓練

準備數據

構建網絡

評估模型

保存並測試模型

【SQL進階】CASE語句的使用

mavonEditor配色方案效果

攔截器返回false造成的跨域問題

vue + justauth 實現前後端分離下的第三方登錄

springboot連接mongo自定義配置

爬取酷狗音樂時的坑

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結