一文學會Pytorch版本BERT使用

原創

codebrid

2020-06-25 19:25

一、前言：

coder們最常用的Pytorch版本的BERT應該就是這一份了吧 https://github.com/huggingface/pytorch-pretrained-BERT

這份是剛出BERT的時候出的，暫且叫它舊版

這是博主在學習使用舊版的時候粗略記過的一些筆記：https://blog.csdn.net/ccbrid/article/details/88732857

隨着BERT的出現，更多的預訓練模型(BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet, CTRL...)也如雨後春筍般湧出，Hugging Face的這款BERT工具也在不管的更新迭代

本文是針對於新版的學習👇

二、BERT相關網址

BERT論文：https://arxiv.org/abs/1810.04805

Google_BERT代碼（tensorflow）：https://github.com/google-research/bert

Pytorch版本的BERT：https://github.com/huggingface/transformers（本文記錄該工具的使用）

該工具使用文檔：https://huggingface.co/transformers/

BERT部分使用文檔：https://huggingface.co/transformers/model_doc/bert.html#bertmodel

優化器部分使用文檔：https://huggingface.co/transformers/main_classes/optimizer_schedules.html

快速瀏覽版：https://github.com/huggingface/transformers#quick-tour

三、安裝步驟

1. 要求Python 3.5+, PyTorch 1.0.0+ 或 TensorFlow 2.0.0-rc1

2. 推薦使用虛擬環境例如：

conda create -n transformers python=3.6

source activate transformers

（conda env list 查看已有虛擬環境）

3. 首先 you need to install one of, or both, TensorFlow 2.0 and PyTorch. （本文PyTorch）

安裝PyTorch可訪問網址：https://pytorch.org/get-started/locally/#start-locally

在該網址上選擇對應的安裝環境，可以直接得到如下命令：

conda install pytorch torchvision cudatoolkit=10.1 -c pytorch

由於博主本機cuda不支持1.4的torch，所以重新安裝

pip install torch==1.2.0 torchvision==0.4.0 -f https://download.pytorch.org/whl/torch_stable.html

4. pip install transformers

5. （此步可省略）下到本地方便查看源碼 git clone https://github.com/huggingface/transformers.git

四、使用方法

1. 必備import

from transformers import BertTokenizer, BertModel, BertForMaskedLM

2. 數據的處理
我們知道，在BERT的輸入中，cls是一個，sep是大於等於一個。'[CLS]'必須出現在樣本段落的開頭，一個段落可以有一句話也可以有多句話，每句話的結尾必須是'[SEP]'。
例如：['[CLS]', 'this', 'is', 'blue', '[SEP]', 'that', 'is', 'red', '[SEP]']

我們需要對輸入BERT的數據進行處理
例如：

words = [self.CLS_TOKEN] + words + [self.SEP_TOKEN]

3. Tokenizer
調用tokenizer，使用tokenizer分割輸入，將tokens轉爲ids。如下：

self.bert_tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
words = self.bert_tokenizer.tokenize(''.join(words))
feature = self.bert_tokenizer.convert_tokens_to_ids(sent + [self.PAD_TOKEN for _ in range(max_sent_len - len(sent))])

如果覺得BERT給的token不夠用，或希望在BERT給的token中加入自己的token，加入以下代碼即可：
例如，想要加入大寫字母：

self.bert_tokenizer.add_tokens([chr(i) for i in range(ord("A"), ord("Z") + 1)])
args.len_token = len(self.bert_tokenizer)

此處使用了len_token記錄self.bert_tokenizer新的token大小，因爲要對模型進行更新。

4. 模型使用
調用BertModel，因爲改變了tokenizer所以對模型的token參數進行更新，然後就可以正常使用BERT-Model啦！

self.BertModel = BertModel.from_pretrained('bert-base-chinese')
# 加入了A-Z，重新resize一下大小
self.BertModel.resize_token_embeddings(self.args.len_token) 
outputs = self.BertModel(input_ids=ii, token_type_ids=tti, attention_mask=am)

注：
5.
本文使用了預訓練模型bert-base-chinese做例子，其餘的預訓練模型具體可參考：https://github.com/google-research/bert

6.
若想對Bert進行fine-tuning，如果存在out-of-memory的問題，可能會用到GPU並行：
self.BertModel = nn.DataParallel(self.BertModel, device_ids=args.bert_gpu_ids, output_device=torch.cuda.current_device())

GPU並行的其他注意事項小可愛們可以移步這裏：Pytorch | 多GPU並行 DataParallel

公衆號：NLP筆記屋

投稿、交流&合作歡迎掃碼進羣

記得備註 “暱稱-學校(公司)”呦~

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

一文學會Pytorch版本BERT使用

一、前言：

二、BERT相關網址

三、安裝步驟

四、使用方法

Python heapq（堆操作）

【書籍記錄】《編程之法》

面經 | 記錄秋招遇到的概率題與智力題（附答案）

【ERROR】TypeError: expected bytes, Descriptor found

【論文】【ACL2018】Neural Document Summarization by Jointly Learning to Score and Select Sentences

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結