利用Tensorflow使用BERT模型+輸出句向量和字符向量

原創

2020-06-16 02:40

文章目錄

1.前言

最近想着如何利用tensorflow調用BERT模型，發現其源碼已經有比較詳細的調用代碼，具體的鏈接如下：https://github.com/google-research/bert/blob/master/predicting_movie_reviews_with_bert_on_tf_hub.ipynb

因此結合上面的例子，主要來構造調用BERT模型。

2.BERT模型

2.1 下載預訓練好的模型

我們可以在BERT的github源碼中找到已經訓練好的模型，其鏈接如下：https://github.com/google-research/bert。

可以看到上面有許多已經訓練好的模型，可以根據自己的需求找到適合自己的進行下載。下載下來的文件是一個壓縮包，解壓之後可以看到幾個具體文件：

bert_config.json：保存的是BERT模型的一些主要參數設置
bert_model.ckpt.xxxx：這裏有兩個文件，但導入模型只需要bert_model.ckpt這個前綴就可以了
vocab.txt：用來預訓練時的詞典

這時候就可以利用這三個文件來導入BERT模型。這三個文件地址可以表示爲：

BERT_INIT_CHKPNT="./bert_pretrain_model/bert_model.ckpt"
BERT_VOCAB="./bert_pretrain_model/vocab.txt"
BERT_CONFIG="./bert_pretrain_model/bert_config.json"

2.2 導入BERT模型

首先需要安裝好bert-tensorflow和tensorflow-hub，安裝bert-tensorflow的命令爲：

pip install bert-tensorflow

接着利用bert-tensorflow中的modeling模塊導入參數設置和文件：

from bert import modeling

bert_config = modeling.BertConfig.from_json_file(hp.BERT_CONFIG)
model = modeling.BertModel(
    config=bert_config,
    is_training=is_training,
    input_ids=input_ids,
    input_mask=input_mask,
    token_type_ids=segment_ids,
    use_one_hot_embeddings=use_one_hot_embeddings)

2.3 數據下載和預處理

這次的訓練數據集是電影評論情感分類，其具有兩個類別標籤：0（positive）和1（negative）。

具體代碼：

import pandas as pd
import tensorflow as tf
import os
import re


# Load all files from a directory in a DataFrame.
def load_directory_data(directory):
    data = {}
    data["sentence"] = []
    data["sentiment"] = []
    for file_path in os.listdir(directory):
        with tf.gfile.GFile(os.path.join(directory, file_path), "r") as f:
            data["sentence"].append(f.read())
            data["sentiment"].append(re.match("\d+_(\d+)\.txt", file_path).group(1))
    return pd.DataFrame.from_dict(data)


# Merge positive and negative examples, add a polarity column and shuffle.
def load_dataset(directory):
    pos_df = load_directory_data(os.path.join(directory, "pos"))
    neg_df = load_directory_data(os.path.join(directory, "neg"))
    pos_df["polarity"] = 1
    neg_df["polarity"] = 0
    return pd.concat([pos_df, neg_df]).sample(frac=1).reset_index(drop=True)


# Download and process the dataset files.
def download_and_load_datasets(force_download=False):
    dataset = tf.keras.utils.get_file(
        fname="aclImdb.tar.gz",
        origin="http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz",
        extract=True)

    train_df = load_dataset(os.path.join(os.path.dirname(dataset), "aclImdb", "train"))
    test_df = load_dataset(os.path.join(os.path.dirname(dataset), "aclImdb", "test"))

    return train_df, test_df


if __name__ == "__main__":
    train, test = download_and_load_datasets()

接下來對特徵進行標記化，轉換成BERT模型能夠識別的特徵：

# Convert our train and test features to InputFeatures that BERT understands.
label_list = [int(i) for i in hp.label_list.split(",")]
train_features = run_classifier.convert_examples_to_features(train_InputExamples, label_list, hp.MAX_SEQ_LENGTH,
                                                                      tokenizer)
test_features = run_classifier.convert_examples_to_features(test_InputExamples, label_list, hp.MAX_SEQ_LENGTH,
                                                                     tokenizer)

比如它能夠將句子：“This here’s an example of using the BERT tokenizer”轉換爲：

2.4 模型訓練

（1）可以利用tf.estimator訓練：https://github.com/llq20133100095/bert_use/blob/master/model_estimator.py

（2）利用傳統的sess和tf.data進行訓練：
https://github.com/llq20133100095/bert_use/blob/master/train.py

2.5 直接輸出BERT模型的句向量或者是字符向量

在BERT源碼中，只要設置model.get_pooled_output和model.get_sequence_output就能得到句向量和字符向量：

def create_model(bert_config, is_training, input_ids, input_mask, segment_ids, use_one_hot_embeddings, use_sentence):
    """Creates a classification model."""
    model = modeling.BertModel(
      config=bert_config,
      is_training=is_training,
      input_ids=input_ids,
      input_mask=input_mask,
      token_type_ids=segment_ids,
      use_one_hot_embeddings=use_one_hot_embeddings)

    # Use "pooled_output" for classification tasks on an entire sentence.
    # Use "sequence_outputs" for token-level output.
    if use_sentence:
        output_layer = model.get_pooled_output()
    else:
        output_layer = model.get_sequence_output()

    return output_layer

具體代碼可以參考：https://github.com/llq20133100095/bert_use/blob/master/get_embedding.py

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

利用Tensorflow使用BERT模型+輸出句向量和字符向量

文章目錄

1.前言

2.BERT模型

2.1 下載預訓練好的模型

2.2 導入BERT模型

2.3 數據下載和預處理

2.4 模型訓練

2.5 直接輸出BERT模型的句向量或者是字符向量

如何使用 JS 判斷用戶是否處於活躍狀態

通過HPA+CronHPA組合應對業務複雜彈性伸縮場景

❤️‍🔥 Solon Cloud Event 新的事務特性與應用

stacking in tensorflow2.0：Roberta集成

奇異值分解（SVD）推導（從條件推理+反向證明+與特徵分解的關係）

BLEU算法（例子和公式解釋）

機器學習——過擬合問題（線性迴歸+邏輯斯特迴歸的正則化推導）

啓動Tomcat出現自動關閉問題的解決辦法

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結