1.下載bert源代碼和中文預訓練模型
Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters
2.準備樣本
依舊採用上一節中使用的ai挑戰賽用戶評論信息。對於自己使用的場景按照對應的格式處理好即可。例如這邊樣本格式如下(正文+標籤):
(說明 此處用的標籤含義是從-2~1 共4種代表不同的情感標籤,是個4分類。爲了便於處理,會將標籤投影到1~4 data.others_overall_experience = data.others_overall_experience + 3)
將樣本分成三個文件,且放置於同一個文件夾下:
- train.tsv:訓練集
- dev.tsv:驗證集
- test.tsv:測試集
樣本打亂之後按照比例劃分。新建一個preprocess.py
的文件用於數據預處理。
import os
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
def train_valid_test_split(x_data, y_data, validation_size = 0.1, test_size = 0.1):
x_, x_test, y_, y_test = train_test_split(x_data, y_data, test_size=test_size)
valid_size = validation_size / (1.0 - test_size)
x_train, x_valid, y_train, y_valid = train_test_split(x_, y_, test_size=valid_size)
return x_train, x_valid, x_test, y_train, y_valid, y_test
pd_all = pd.read_csv("./sample.csv"))
pd_all = shuffle(pd_all)
x_data, y_data = pd_all.content, pd_all.others_overall_experience
x_train, x_valid, x_test, y_train, y_valid, y_test = train_valid_test_split(x_data, y_data, 0.1, 0.1)
train = pd.DataFrame({'label': y_train, 'x_train': x_train})
train.to_csv("./train.csv", index=False, encoding='utf-8',sep='\t')
valid = pd.DataFrame({'label': y_valid, 'x_valid': x_valid})
valid.to_csv("./dev.csv", index=False, encoding='utf-8',sep='\t')
test = pd.DataFrame({'label': y_test, 'x_test': x_test})
test.to_csv("./test.csv", index=False, encoding='utf-8',sep='\t')
3.修改bert代碼
run_classifier.py
添加自定義的數據處理模塊,默認內部已經存在了幾個。
class CommentProcessor(DataProcessor):
"""Processor for the WeiBo data set ."""
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "train.csv"), quotechar='"'), "train")
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "dev.csv"), quotechar='"'), "dev")
def get_test_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "test.csv"), quotechar='"'), "test")
def get_labels(self):#這裏返回了數據樣本中定義標籤枚舉 即1~4
"""See base class."""
return ["1", "2", "3", "4"]
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
# All sets have a header
if i == 0: continue
guid = "%s-%s" % (set_type, i)
text_a = tokenization.convert_to_unicode(line[1])#讀取第二列正文
label = tokenization.convert_to_unicode(line[0])#讀取第一列標籤
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
return examples
#
# .........
#
def main(_):
tf.logging.set_verbosity(tf.logging.INFO)
processors = {
"cola": ColaProcessor,
"mnli": MnliProcessor,
"mrpc": MrpcProcessor,
"xnli": XnliProcessor,
"comment": CommentProcessor,#新增數據處理模塊
}
注:
self._read_tsv
是繼承於DataProcessor
方法所以數據的處理需要注意下該方法實現的默認參數。
例如此處csv.reader delimiter
是按照\t
分割
self._read_tsv(os.path.join(data_dir, "train.csv"), quotechar='"')
這裏的quotechar
引用符,如果正文中存在嵌套的雙引號就會出錯了。
@classmethod
def _read_tsv(cls, input_file, quotechar=None):
"""Reads a tab separated value file."""
with tf.gfile.Open(input_file, "r") as f:
reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
lines = []
for line in reader:
lines.append(line)
print(len(lines))
return lines
4.run
python run_classifier.py -data_dir=/bert/bert-demo/bert/data/ --task_name=news --do_train=true --do_eval=true --data_dir=/bert/bert-demo/bert/data/ --vocab_file=/bert/model/chinese_L-12_H-768_A-12/vocab.txt --bert_config_file=/bert/model/chinese_L-12_H-768_A-12/bert_config.json --init_checkpoint=/bert/model/chinese_L-12_H-768_A-12/bert_model.ckpt --max_seq_length=128 --train_batch_size=32 --learning_rate=2e-5 --num_train_epochs=3.0 --output_dir=/output
涉及到路徑/bert/xxx
加載的是數據和模型 按照實際的路徑修改即可
do_train
是否訓練
do_eval
是否驗證
結果:
/output/model.ckpt-7875
INFO:tensorflow:evaluation_loop marked as finished
INFO:tensorflow:***** Eval results *****
INFO:tensorflow: eval_accuracy = 0.73209524
INFO:tensorflow: eval_loss = 0.7203514
INFO:tensorflow: global_step = 7875
INFO:tensorflow: loss = 0.72009945
5.總結
與上小節lstm 和cnn 相比,bert的精度和損失更具有優勢(訓練集和驗證集均爲隨機,沒有使用完全相同的分組可能存在一點差異)
在 自然語言幾個重要的模型 這一節中最後提到 ERNIE更適合中文場景進行的詞MASK,這也是待優化的點。