基於Transformers庫的BERT模型：一個文本情感分類的實例解析

簡介

本文來講述BERT應用的一個例子，採用預訓練好的BERT模型來進行演示。BERT的庫來源於Transformers，這是一個由PyTorch編寫的庫，其集成了多個NLP領域SOTA的模型，比如bert、gpt-2、transformer xl等，並且可以自由選擇已經預訓練好的模型參數，我們可以基於這些參數進行進一步的訓練調試。

Part 1: 利用BERT基於特徵的方式進行建模

1、任務與數據集

本文采用的任務是文本分類任務中的情感分類，即給出一個句子，判斷出它所表達的情感是積極的(用1表示)還是消極的(用0表示)。這裏所使用的數據集是斯坦福大學所發佈的一個情感分析數據集SST，其組成成分來自於電影的評論。而SST2則是二分類的任務。

在開始之前，我們需要先安裝transformers，直接在pip上安裝即可：

pip install transformers

然後加載我們需要用到的一些庫：

#part 1 - bert feature base
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
import torch
import transformers as tfs
import warnings

warnings.filterwarnings('ignore')

2、加載數據集

train_df = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv', delimiter='\t', header=None)

train_set = train_df[:3000]   #取其中的3000條數據作爲我們的數據集
print("Train set shape:", train_set.shape)
train_set[1].value_counts()   #查看數據集中標籤的分佈

得到以下輸出：

Train set shape: (3000, 2)
1    1565
0    1435
Name: 1, dtype: int64

可以看出，積極和消極的標籤基本對半分。

3、利用BERT進行特徵抽取

在這裏，我們利用BERT對數據集進行特徵抽取，即把輸入數據經過BERT模型，來獲取輸入數據的特徵，這些特徵包含了整個句子的信息，是語境層面的。這種做法類似於EMLo的特徵抽取。需要注意的是，這裏並沒有使用到BERT的微調，因爲BERT並不參與後面的訓練，僅僅進行特徵抽取操作。

model_class, tokenizer_class, pretrained_weights = (tfs.BertModel, tfs.BertTokenizer, 'bert-base-uncased')
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

我們使用預訓練好的"bert-base-uncased"模型參數進行處理，採用的模型是BertModel，採用的分詞器是BertTokenizer。由於我們的輸入句子是英文句子，所以需要先分詞；然後把單詞映射成詞彙表的索引，再餵給模型。實際上Bert的分詞操作，不是以傳統的單詞爲單位的，而是以wordpiece爲單位，這是比單詞更細粒度的單位。我們執行以下代碼：

#add_special_tokens 表示在句子的首尾添加[CLS]和[END]符號
train_tokenized = train_set[0].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

然後，爲了提升訓練速度，我們需要把句子都處理成同一個長度，即常見的pad操作，我們在短的句子末尾添加一系列的[PAD]符號：

train_max_len = 0
for i in train_tokenized.values:
    if len(i) > train_max_len:
        train_max_len = len(i)

train_padded = np.array([i + [0] * (train_max_len-len(i)) for i in train_tokenized.values])
print("train set shape:",train_padded.shape)

#output：train set shape: (3000, 66)

最後，我們還需要讓模型知道，哪些詞是不用處理的，即上面我們添加的[PAD]符號：

train_attention_mask = np.where(train_padded != 0, 1, 0)

經過上面一系列步驟的處理，此時輸入數據已經可以正確被Bert模型接收並處理了，我們直接進行特徵的輸出：

train_input_ids = torch.tensor(train_padded).long()
train_attention_mask = torch.tensor(train_attention_mask).long()
with torch.no_grad():
    train_last_hidden_states = model(train_input_ids, attention_mask=train_attention_mask)

我們來看以下Bert模型給我們的輸出是什麼樣的：

train_last_hidden_states[0].size()

output: torch.Size([3000, 66, 768])

第一維的是樣本數量，第二維的是序列長度，第三維是特徵數量。也就是說，Bert對於我們的每一個位置的輸入，都會輸出一個對應的特徵向量。

4、切分數據成訓練集和測試集

train_features = train_last_hidden_states[0][:,0,:].numpy()
train_labels = train_set[1]

請注意：我們使用[:,0,:]來提取序列第一個位置的輸出向量，因爲第一個位置是[CLS]，比起其他位置，該向量應該更具有代表性，蘊含了整個句子的信息。緊接着，我們利用sklearn庫的方法來把數據集切分成訓練集和測試集。

train_features, test_features, train_labels, test_labels = train_test_split(train_features, train_labels)

5、使用邏輯迴歸進行訓練

在這一部分，我們使用sklearn的邏輯迴歸模塊對我們的訓練集進行擬合，最後在測試集上進行評價：

lr_clf = LogisticRegression()
lr_clf.fit(train_features, train_labels)
lr_clf.score(test_features, test_labels)

輸出：

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

accuracy: 0.8306666666666667

經過邏輯迴歸模型的擬合，其準確率達到了79.21，分類效果還不錯。那麼，我們還能進一步提升嗎？

Part 2: 利用BERT基於微調的方式進行建模

在上一部分，我們利用了Bert抽取特徵的能力進行建模，提取了Bert的輸出特徵，再輸入給一個線性層以預測。但Bert本身的不參與模型的訓練。現在我們採取另一種方式，即fine-tuned，Bert與線性層一起參與訓練，反向傳播會更新二者的參數，使得Bert模型更加適合這個分類任務。那麼，讓我們開始吧~

1、建立模型

#part 2 - bert fine-tuned
import torch
from torch import nn
from torch import optim
import transformers as tfs
import math

class BertClassificationModel(nn.Module):
    def __init__(self):
        super(BertClassificationModel, self).__init__()   
        model_class, tokenizer_class, pretrained_weights = (tfs.BertModel, tfs.BertTokenizer, 'bert-base-uncased')         
        self.tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
        self.bert = model_class.from_pretrained(pretrained_weights)
        self.dense = nn.Linear(768, 2)  #bert默認的隱藏單元數是768， 輸出單元是2，表示二分類
        
    def forward(self, batch_sentences):
        batch_tokenized = self.tokenizer.batch_encode_plus(batch_sentences, add_special_tokens=True,
                                max_len=66, pad_to_max_length=True)      #tokenize、add special token、pad
        input_ids = torch.tensor(batch_tokenized['input_ids'])
        attention_mask = torch.tensor(batch_tokenized['attention_mask'])
        bert_output = self.bert(input_ids, attention_mask=attention_mask)
        bert_cls_hidden_state = bert_output[0][:,0,:]       #提取[CLS]對應的隱藏狀態
        linear_output = self.dense(bert_cls_hidden_state)
        return linear_output

模型很簡單，關鍵代碼都在上面註釋了。其主要構成是在bert模型的[CLS]輸出位置接上一個線性層，用以預測句子的分類。

2、數據分批

下面我們對原來的數據集進行一些改造，分成batch_size爲64大小的數據集，以便模型進行批量梯度下降。

sentences = train_set[0].values
targets = train_set[1].values
train_inputs, test_inputs, train_targets, test_targets = train_test_split(sentences, targets)

batch_size = 64
batch_count = int(len(train_inputs) / batch_size)
batch_train_inputs, batch_train_targets = [], []
for i in range(batch_count):
    batch_train_inputs.append(train_inputs[i*batch_size : (i+1)*batch_size])
    batch_train_targets.append(train_targets[i*batch_size : (i+1)*batch_size])

3、訓練模型

#train the model
epochs = 3
lr = 0.01
print_every_batch = 5
bert_classifier_model = BertClassificationModel()
optimizer = optim.SGD(bert_classifier_model.parameters(), lr=lr, momentum=0.9)
criterion = nn.CrossEntropyLoss()

for epoch in range(epochs):
    print_avg_loss = 0
    for i in range(batch_count):
        inputs = batch_train_inputs[i]
        labels = torch.tensor(batch_train_targets[i])
        optimizer.zero_grad()
        outputs = bert_classifier_model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        
        print_avg_loss += loss.item()
        if i % print_every_batch == (print_every_batch-1):
            print("Batch: %d, Loss: %.4f" % ((i+1), print_avg_loss/print_every_batch))
            print_avg_loss = 0

得到以下輸出：

Batch: 5, Loss: 0.6938
Batch: 10, Loss: 0.6647
Batch: 15, Loss: 0.6175
Batch: 20, Loss: 0.5445
Batch: 25, Loss: 0.7380
Batch: 30, Loss: 0.4852
Batch: 35, Loss: 0.4842
Batch: 5, Loss: 0.4027
Batch: 10, Loss: 0.2978
Batch: 15, Loss: 0.3876
Batch: 20, Loss: 0.5566
Batch: 25, Loss: 0.3102
Batch: 30, Loss: 0.2467
Batch: 35, Loss: 0.2219

4、模型評價

# eval the trained model
total = len(test_inputs)
hit = 0
with torch.no_grad():
    for i in range(total):
        outputs = bert_classifier_model([test_inputs[i]])
        _, predicted = torch.max(outputs, 1)
        if predicted == test_targets[i]:
            hit += 1

print("Accuracy: %.2f%%" % (hit / total * 100))

這裏我們用測試數據集對已經訓練好的模型進行評價，並打印其準確率，輸出如下：

Accuracy: 90.53%

可以看出，通過微調的方式來建模，經過3個輪次的訓練後，模型的準確率達到了90.53%，比起基於特徵的建模方式有了較大提升。下面給出本文代碼的地址，有需要的可以自取~謝謝您的閱讀！

項目地址：
基於Bert的文本分類實例

參考文章
Using BERT for the first time
Transformers官方文檔

基於Transformers庫的BERT模型：一個文本情感分類的實例解析

簡介

Part 1: 利用BERT基於特徵的方式進行建模

Part 2: 利用BERT基於微調的方式進行建模

一鍵自動化博客發佈工具,用過的人都說好(掘金篇)

「Pygors跨平臺GUI」2：安裝MinGW-w64、MSYS2還是WSL2

[轉帖]

python列出centos7內存使用前50的進程信息

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

nodejs學習06——小案例

評估統計算法在銀行僞造鈔票檢測中的價值

C# Xmlserializer 程序集內存泄露

Java ThreadPoolShutdown

5月21日相聚上海張江！與文心大模型一起共建大模型產業應用生態圈

一個基於PyTorch實現的Glove詞向量的實例

Picasso源碼完全解析——學習其優秀設計思想

Android源碼探究：Activity啓動流程完全解析

基於Transformers庫的BERT模型：一個文本情感分類的實例解析

Android View 深度分析requestLayout、invalidate與postInvalidate

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結