NLP（二十八）多標籤文本分類

本文將會講述如何實現多標籤文本分類。

什麼是多標籤分類？

在分類問題中，我們已經接觸過二分類和多分類問題了。所謂二（多）分類問題，指的是y值一共有兩（多）個類別，每個樣本的y值只能屬於其中的一個類別。對於多標籤問題而言，每個樣本的y值可能不僅僅屬於一個類別。
舉個簡單的例子，我們平時在給新聞貼標籤的時候，就有可能把一篇文章分爲經濟和文化兩個類別。因此，多標籤問題在我們的日常生活中也是很常見的。
對於多標籤問題，業界還沒有很成熟的解決方法，主要是因爲標籤之間可能會存在複雜的依賴關係，這種依賴關係現階段還沒有成熟的模型來解決。我們在解決多標籤問題的時候，一種辦法是認爲標籤之間互相獨立，然後把該問題轉化爲我們熟悉的二（多）分類問題。
本文以 2020語言與智能技術競賽：事件抽取任務中的數據作爲多分類標籤的樣例數據，藉助多標籤分類模型來解決。
整個項目的結構如下圖所示：

首先，讓我們來看一下樣例數據。

數據分析

首先，讓我們來看一下樣例數據的幾個例子：

司法行爲-起訴|組織關係-裁員最近，一位前便利蜂員工就因公司違規裁員，將便利蜂所在的公司蟲極科技（北京）有限公司告上法庭。
組織關係-裁員思科上海大規模裁員人均可獲賠100萬官方澄清事實
組織關係-裁員日本巨頭面臨危機，已裁員1000多人，蘋果也救不了它！
組織關係-裁員|組織關係-解散在硅谷鍍金失敗的造車新勢力們：蔚來裁員、奇點被偷竊、拜騰解散

從上面的例子中我們可以看出，同樣的描述文本，有可能會屬於多個事件類型。比如上面的在硅谷鍍金失敗的造車新勢力們：蔚來裁員、奇點被偷竊、拜騰解散，該句話中包含了組織關係-裁員和組織關係-解散兩個事件類型。
該數據集中的訓練集一共有11958個樣本，65個事件類型，我們對該訓練集進行簡單的數據分析，來看看多事件類型的個數和佔比，以及每個事件類型的數量。數據分析的腳本如下：

# -*- coding: utf-8 -*-
# author: Jclian91
# place: Pudong Shanghai
# time: 2020-04-09 21:31

from collections import defaultdict
from pprint import pprint

with open("./data/multi-classification-train.txt", "r", encoding="utf-8") as f:
    content = [_.strip() for _ in f.readlines()]

# 每個事件類型的數量統計
event_type_count_dict = defaultdict(int)

# 多事件類型數量
multi_event_type_cnt = 0

for line in content:
    # 事件類型
    event_types = line.split(" ", maxsplit=1)[0]

    # 如果|在事件類型中，則爲多事件類型
    if "|" in event_types:
        multi_event_type_cnt += 1

    # 對應的每個事件類型數量加1
    for event_type in event_types.split("|"):
        event_type_count_dict[event_type] += 1


# 輸出結果
print("多事件類型的樣本共有%d個，佔比爲%.4f。" %(multi_event_type_cnt, multi_event_type_cnt/len(content)))

pprint(event_type_count_dict)

輸出結果如下：

多事件類型的樣本共有1121個，佔比爲0.0937。
defaultdict(<class 'int'>,
            {'交往-會見': 98,
             '交往-感謝': 63,
             '交往-探班': 69,
             '交往-點贊': 95,
             '交往-道歉': 149,
             '產品行爲-上映': 286,
             '產品行爲-下架': 188,
             '產品行爲-發佈': 1196,
             '產品行爲-召回': 287,
             '產品行爲-獲獎': 139,
             '人生-產子/女': 106,
             '人生-出軌': 32,
             '人生-分手': 118,
             '人生-失聯': 105,
             '人生-婚禮': 59,
             '人生-慶生': 133,
             '人生-懷孕': 65,
             '人生-死亡': 811,
             '人生-求婚': 76,
             '人生-離婚': 268,
             '人生-結婚': 294,
             '人生-訂婚': 62,
             '司法行爲-舉報': 98,
             '司法行爲-入獄': 155,
             '司法行爲-開庭': 105,
             '司法行爲-拘捕': 712,
             '司法行爲-立案': 82,
             '司法行爲-約談': 266,
             '司法行爲-罰款': 224,
             '司法行爲-起訴': 174,
             '災害/意外-地震': 119,
             '災害/意外-坍/垮塌': 80,
             '災害/意外-墜機': 104,
             '災害/意外-洪災': 48,
             '災害/意外-爆炸': 73,
             '災害/意外-襲擊': 117,
             '災害/意外-起火': 204,
             '災害/意外-車禍': 286,
             '競賽行爲-奪冠': 430,
             '競賽行爲-晉級': 302,
             '競賽行爲-禁賽': 135,
             '競賽行爲-勝負': 1663,
             '競賽行爲-退役': 95,
             '競賽行爲-退賽': 141,
             '組織關係-停職': 87,
             '組織關係-加盟': 335,
             '組織關係-裁員': 142,
             '組織關係-解散': 81,
             '組織關係-解約': 45,
             '組織關係-解僱': 93,
             '組織關係-辭/離職': 580,
             '組織關係-退出': 183,
             '組織行爲-開幕': 251,
             '組織行爲-遊行': 73,
             '組織行爲-罷工': 63,
             '組織行爲-閉幕': 59,
             '財經/交易-上市': 51,
             '財經/交易-出售/收購': 181,
             '財經/交易-加息': 24,
             '財經/交易-漲價': 58,
             '財經/交易-漲停': 219,
             '財經/交易-融資': 116,
             '財經/交易-跌停': 102,
             '財經/交易-降價': 78,
             '財經/交易-降息': 28})

模型訓練

我們利用sklearn模塊中的MultiLabelBinarizer進行多標籤編碼，如果文本所對應的事件類型存在，則將該位置的元素置爲1，否則爲0。因此，y值爲65維的向量，其中1個或多個爲1，是該文本（x值）對應一個或多個事件類型。
我們採用ALBERT對文本進行特徵提取，最大文本長度爲200，採用的深度學習模型如下：

模型訓練的腳本（model_trian.py）的代碼如下：

# -*- coding: utf-8 -*-
# author: Jclian91
# place: Pudong Shanghai
# time: 2020-04-03 18:12

import json
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer
from keras.models import Model
from keras.optimizers import Adam
from keras.layers import Input, Dense
from att import Attention
from keras.layers import GRU, Bidirectional
from tqdm import tqdm
import matplotlib.pyplot as plt

from albert_zh.extract_feature import BertVector

with open("./data/multi-classification-train.txt", "r", encoding="utf-8") as f:
    train_content = [_.strip() for _ in f.readlines()]

with open("./data/multi-classification-test.txt", "r", encoding="utf-8") as f:
    test_content = [_.strip() for _ in f.readlines()]

# 獲取訓練集合、測試集的事件類型
movie_genres = []

for line in train_content+test_content:
    genres = line.split(" ", maxsplit=1)[0].split("|")
    movie_genres.append(genres)

# 利用sklearn中的MultiLabelBinarizer進行多標籤編碼
mlb = MultiLabelBinarizer()
mlb.fit(movie_genres)

print("一共有%d種事件類型。" % len(mlb.classes_))

with open("event_type.json", "w", encoding="utf-8") as h:
    h.write(json.dumps(mlb.classes_.tolist(), ensure_ascii=False, indent=4))

# 對訓練集和測試集的數據進行多標籤編碼
y_train = []
y_test = []

for line in train_content:
    genres = line.split(" ", maxsplit=1)[0].split("|")
    y_train.append(mlb.transform([genres])[0])

for line in test_content:
    genres = line.split(" ", maxsplit=1)[0].split("|")
    y_test.append(mlb.transform([genres])[0])

y_train = np.array(y_train)
y_test = np.array(y_test)

print(y_train.shape)
print(y_test.shape)

# 利用ALBERT對x值（文本）進行編碼
bert_model = BertVector(pooling_strategy="NONE", max_seq_len=200)
print('begin encoding')
f = lambda text: bert_model.encode([text])["encodes"][0]

x_train = []
x_test = []

process_bar = tqdm(train_content)

for ch, line in zip(process_bar, train_content):
    movie_intro = line.split(" ", maxsplit=1)[1]
    x_train.append(f(movie_intro))

process_bar = tqdm(test_content)

for ch, line in zip(process_bar, test_content):
    movie_intro = line.split(" ", maxsplit=1)[1]
    x_test.append(f(movie_intro))

x_train = np.array(x_train)
x_test = np.array(x_test)

print("end encoding")
print(x_train.shape)


# 深度學習模型
# 模型結構：ALBERT + 雙向GRU + Attention + FC
inputs = Input(shape=(200, 312, ), name="input")
gru = Bidirectional(GRU(128, dropout=0.2, return_sequences=True), name="bi-gru")(inputs)
attention = Attention(32, name="attention")(gru)
num_class = len(mlb.classes_)
output = Dense(num_class, activation='sigmoid', name="dense")(attention)
model = Model(inputs, output)

# 模型可視化
# from keras.utils import plot_model
# plot_model(model, to_file='multi-label-model.png', show_shapes=True)

model.compile(loss='binary_crossentropy',
              optimizer=Adam(),
              metrics=['accuracy'])

history = model.fit(x_train, y_train, validation_data=(x_test, y_test), batch_size=128, epochs=10)
model.save('event_type.h5')


# 訓練結果可視化
# 繪製loss和acc圖像
plt.subplot(2, 1, 1)
epochs = len(history.history['loss'])
plt.plot(range(epochs), history.history['loss'], label='loss')
plt.plot(range(epochs), history.history['val_loss'], label='val_loss')
plt.legend()

plt.subplot(2, 1, 2)
epochs = len(history.history['accuracy'])
plt.plot(range(epochs), history.history['accuracy'], label='acc')
plt.plot(range(epochs), history.history['val_accuracy'], label='val_acc')
plt.legend()
plt.savefig("loss_acc.png")

訓練過程輸出內容如下：

一共有65種事件類型。
(11958, 65)
(1498, 65)
I:BERT_VEC:[graph:opt:128]:load parameters from checkpoint...
I:BERT_VEC:[graph:opt:130]:freeze...
I:BERT_VEC:[graph:opt:133]:optimize...
I:BERT_VEC:[graph:opt:144]:write graph to a tmp file: ./tmp_graph11
100%|██████████| 11958/11958 [02:47<00:00, 71.39it/s]
100%|██████████| 1498/1498 [00:20<00:00, 72.54it/s]
end encoding
(11958, 200, 312)
Train on 11958 samples, validate on 1498 samples

在最終的epoch上，訓練集上的acuuracy爲0.9966，測試集上的acuuracy爲0.9964。訓練結果的loss和acc曲線如下：

從上述結果看，多標籤分類的模型效果還是相當不錯的。

模型預測

我們利用下面的模型預測腳本（model_predict.py）對新的測試集數據進行驗證，腳本代碼如下：

# -*- coding: utf-8 -*-
# author: Jclian91
# place: Pudong Shanghai
# time: 2020-04-03 21:50

import json
import numpy as np
from keras.models import load_model

from att import Attention
from albert_zh.extract_feature import BertVector
load_model = load_model("event_type.h5", custom_objects={"Attention": Attention})

# 預測語句
text = "北京時間6月7日，中國男足在廣州天河體育場與菲律賓進行了一場熱身賽，最終國足以2-0擊敗了對手，裏皮也贏得了再度執教國足後的首場比賽勝利！"
text = text.replace("\n", "").replace("\r", "").replace("\t", "")

labels = []

bert_model = BertVector(pooling_strategy="NONE", max_seq_len=200)

# 將句子轉換成向量
vec = bert_model.encode([text])["encodes"][0]
x_train = np.array([vec])

# 模型預測
predicted = load_model.predict(x_train)[0]

indices = [i for i in range(len(predicted)) if predicted[i] > 0.5]

with open("event_type.json", "r", encoding="utf-8") as g:
    movie_genres = json.loads(g.read())

print("預測語句: %s" % text)
print("預測事件類型: %s" % "|".join([movie_genres[index] for index in indices]))

其中的幾個樣本的預測結果如下：

預測語句: 北京時間6月7日，中國男足在廣州天河體育場與菲律賓進行了一場熱身賽，最終國足以2-0擊敗了對手，裏皮也贏得了再度執教國足後的首場比賽勝利！
預測事件類型: 競賽行爲-勝負

預測語句: 巴西亞馬孫雨林大火持續多日，引發全球關注。
預測事件類型: 災害/意外-起火

預測語句: 19里加大師賽資格賽前兩天戰報中國選手8人晉級6人遭淘汰2人棄賽
預測事件類型: 競賽行爲-晉級

預測語句: 日本電車卡車相撞，車頭部分脫軌並傾斜，現場起火濃煙滾滾
預測事件類型: 災害/意外-車禍

預測語句: 截止到11日13：30 ，因颱風致浙江32人死亡，16人失聯。具體如下：永嘉縣巖坦鎮山早村23死9失聯，樂清6死，臨安區島石鎮銀坑村3死4失聯，臨海市東塍鎮王加山村3失聯。
預測事件類型: 人生-失聯|人生-死亡

預測語句: 定位B端應用，BeBop發佈Quest專屬版柔性VR手套
預測事件類型: 產品行爲-發佈

預測語句: 8月17日。凌晨3點20分左右，濟南消防支隊領秀城中隊接到指揮中心調度命令，濟南市中區中海環宇城往南方向發生車禍，有人員被困。
預測事件類型: 災害/意外-車禍

預測語句: 注意！濟南可能有雷電事故｜英才學院14.9億被收購｜八里橋蔬菜市場今日拆除，未來將建新的商業綜合體
預測事件類型: 財經/交易-出售/收購

預測語句: 昨天18：30，陝西寧強縣胡家壩鎮向家溝村三組發生山體坍塌，5人被埋。當晚，3人被救出，其中1人在醫院搶救無效死亡，2人在送醫途中死亡。今天凌晨，另外2人被發現，已無生命跡象。
預測事件類型: 人生-死亡|災害/意外-坍/垮塌

總結

本項目已經上傳至Github項目，網址爲：https://github.com/percent4/multi-label-classification-4-event-type 。
後續有機會再給大家介紹更多多標籤分類相關的問題，歡迎大家關注~

NLP（二十八）多標籤文本分類

什麼是多標籤分類？

數據分析

模型訓練

模型預測

總結

Python之繪製個人足跡地圖

目標檢測初體驗（三）破解滑動驗證碼

NLP（三十一）短語的語序問題

NLP（三十）利用ALBERT和機器學習來做文本分類

目標檢測初體驗（二）自制人臉檢測功能

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結