pytorch使用Bert

原創

2020-05-01 22:50

主要分爲以下幾個步驟：

下載模型放到目錄中
使用transformers中的BertModel，BertTokenizer來加載模型與分詞器
使用tokenizer的encode和decode 函數分別編碼與解碼，注意參數add_special_tokens和skip_special_tokens
forward的輸入是一個[batch_size, seq_length]的tensor，再需要注意的是attention_mask參數。
輸出是一個tuple，tuple的第一個值是bert的最後一個transformer層的hidden_state，size是[batch_size, seq_length, hidden_size]，也就是bert最後的輸出，再用於下游的任務。

# -*- encoding: utf-8 -*-
import warnings

warnings.filterwarnings('ignore')
from transformers import BertModel, BertTokenizer, BertConfig
import os
from os.path import dirname, abspath

root_dir = dirname(dirname(dirname(abspath(__file__))))

import torch

# 把預訓練的模型從官網下載下來放到目錄中
pretrained_path = os.path.join(root_dir, 'pretrained/bert_zh')
# 從文件中加載bert模型
model = BertModel.from_pretrained(pretrained_path)
# 從bert目錄中加載詞典
tokenizer = BertTokenizer.from_pretrained(pretrained_path)
print(f'vocab size :{tokenizer.vocab_size}')
# 把'[PAD]'編碼
print(tokenizer.encode('[PAD]'))
print(tokenizer.encode('[SEP]'))
# 把中文句子編碼，默認加入了special tokens了，也就是句子開頭加入了[CLS] 句子結尾加入了[SEP]
ids = tokenizer.encode("我是中國人", add_special_tokens=True)
# 從結果中看，101是[CLS]的id，而2769是"我"的id
# [101, 2769, 3221, 704, 1744, 782, 102]
print(ids)
# 把ids解碼爲中文，默認是沒有跳過特殊字符的
print(tokenizer.decode([101, 2769, 3221, 704, 1744, 782, 102], skip_special_tokens=False))
# print(model)

inputs = torch.tensor(ids).unsqueeze(0)
# forward，result是一個tuple，第一個tensor是最後的hidden-state
result = model(torch.tensor(inputs))
# [1, 5, 768]
print(result[0].size())
# [1, 768]
print(result[1].size())

for name, parameter in model.named_parameters():
    # 打印每一層，及每一層的參數
    print(name)
    # 每一層的參數默認都requires_grad=True的，參數是可以學習的
    print(parameter.requires_grad)
    # 如果只想訓練第11層transformer的參數的話：
    if '11' in name:
        parameter.requires_grad = True
    else:
        parameter.requires_grad = False

print([p.requires_grad for name, p in model.named_parameters()])

添加atten_mask的方法：
其中101是[CLS]，102是[SEP]，0是[PAD]

>>> a
tensor([[101,   3,   4,  23,  11,   1, 102,   0,   0,   0]])
>>> notpad = a!=0
>>> notpad
tensor([[ True,  True,  True,  True,  True,  True,  True, False, False, False]])
>>> notcls = a!=101
>>> notcls
tensor([[False,  True,  True,  True,  True,  True,  True,  True,  True,  True]])
>>> notsep = a!=102
>>> notsep
tensor([[ True,  True,  True,  True,  True,  True, False,  True,  True,  True]])
>>> mask = notpad & notcls & notsep
>>> mask
tensor([[False,  True,  True,  True,  True,  True, False, False, False, False]])
>>>

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

pytorch使用Bert

10分鐘搞定Mysql主從部署配置

如何使用 JS 判斷用戶是否處於活躍狀態

「Pygors跨平臺GUI」2：安裝MinGW-w64、MSYS2還是WSL2

[轉帖]

python列出centos7內存使用前50的進程信息

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

一鍵自動化博客發佈工具,用過的人都說好(掘金篇)

lightdb數據庫超時相關控制參數

lightdb秒級增加列和刪除列（not null帶默認值）

Java ThreadPoolShutdown

rasa的使用

leetcode統計記錄

智能問答平臺調研

leetcode——integer break(343)

neo4j的使用

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結