前言

Bert (Bi-directional Encoder Representations from Transfromers) 預訓練語言模型可謂是2018年 NLP 領域最耀眼的模型，看過很多對 Bert 論文和原理解讀的文章，但是對 Bert 源碼進行解讀的文章較少，這篇博客有一份 TensorFlow 版本的 Bert 源碼解讀，這裏來對 Pytorch 版本的 Bert 源碼記錄一份 “詳細” 註釋。

這份基於 Pytorch 的 Bert 源碼由 Espresso大神提供，地址在這 https://github.com/aespresso/a_journey_into_math_of_ml ，大家也可以在 Espresso大神的 B站觀看他的視頻，講得非常不錯。

今天記錄的這一部分是 bert_model.py 文件，主要實現了 bert 的預訓練模型搭建部分。

Bert 源碼解讀：

1. 模型結構源碼： bert_model.py

2. 模型預訓練源碼：bert_training.py

3. 數據預處理源碼：wiki_dataset.py

開始

1. 定義激活函數

def gelu(x):
    """Implementation of the gelu activation function.
        For information: OpenAI GPT's gelu is slightly different (and gives slightly different results):
        0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
        Also see https://arxiv.org/abs/1606.08415
    """
    return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))


ACT2FN = {"gelu": gelu, "relu": torch.nn.functional.relu}

首先就是定義了 gelu 激活函數，Bert 中不同於傳統 Transformer，Bert 中的某些層的激活函數使用了 gelu 來代替 relu，使得具備了更多的隨機因素，在 gelu的論文中，gelu 的實驗效果也要優於 relu。

這裏是論文中提出的 GELUs(x) 的近似計算的數學公式：

其次定義了 activate function 字典，方便激活函數的使用。

2. 配置參數

class BertConfig(object):
    """Configuration class to store the configuration of a `BertModel`.
    """
    def __init__(self,
                 vocab_size, 
                 hidden_size=384, 
                 num_hidden_layers=6, 
                 num_attention_heads=12,
                 intermediate_size=384*4, 
                 hidden_act="gelu",
                 hidden_dropout_prob=0.4,
                 attention_probs_dropout_prob=0.4,
                 max_position_embeddings=512*2,
                 type_vocab_size=256,
                 initializer_range=0.02
                 ):
        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads
        self.hidden_act = hidden_act
        self.intermediate_size = intermediate_size
        self.hidden_dropout_prob = hidden_dropout_prob
        self.attention_probs_dropout_prob = attention_probs_dropout_prob
        self.max_position_embeddings = max_position_embeddings
        self.type_vocab_size = type_vocab_size
        self.initializer_range = initializer_range

接下來，定義 BertConfig 類，對 Bert 中的一些參數進行設置，具體的設置項爲：

vocab_size : 詞典大小

hidden_size : 隱藏層維度 & 字向量維度

num_hidden_layers : Transformer Block 的個數

num_attention_heads : Multi-head Self-Attention 的頭數

intermediate_size : Feedforword 線性映射層的維度

hidden_act : 隱藏層激活函數

hidden_dropout_prob : 隱藏層 dropout 概率

attention_probs_dropout_prob : Attention 中使用的 dropout 概率

max_position_embedding : 位置編碼的最大長度

type_vocab_size : 用來做 next sentence預測時的分類類別數量，這裏預留了256個類別，但用到的只有0，1

initializer_range : 初始化模型參數的標準差

3. Embedding部分

class BertEmbeddings(nn.Module):

    def __init__(self, config):
        super(BertEmbeddings, self).__init__()
        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=0)
        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
        # embedding矩陣初始化
        nn.init.orthogonal_(self.word_embeddings.weight)
        nn.init.orthogonal_(self.token_type_embeddings.weight)

        # embedding矩陣進行歸一化
        epsilon = 1e-8
        self.word_embeddings.weight.data = \
            self.word_embeddings.weight.data.div(torch.norm(self.word_embeddings.weight, p=2, dim=1, keepdim=True).data + epsilon)
        self.token_type_embeddings.weight.data = \
            self.token_type_embeddings.weight.data.div(torch.norm(self.token_type_embeddings.weight, p=2, dim=1, keepdim=True).data + epsilon)

        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
        # any TensorFlow checkpoint file
        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)


    def forward(self, input_ids, positional_enc, token_type_ids=None):
        """
        :param input_ids: 維度 [batch_size, sequence_length]
        :param positional_enc: 位置編碼 [sequence_length, embedding_dimension]
        :param token_type_ids: BERT訓練的時候, 第一句是0, 第二句是1
        :return: 維度 [batch_size, sequence_length, embedding_dimension]
        """
        # 字向量查表
        words_embeddings = self.word_embeddings(input_ids)

        if token_type_ids is None:
            token_type_ids = torch.zeros_like(input_ids)
        token_type_embeddings = self.token_type_embeddings(token_type_ids)

        embeddings = words_embeddings + positional_enc + token_type_embeddings
        # embeddings: [batch_size, sequence_length, embedding_dimension]
        embeddings = self.LayerNorm(embeddings)
        embeddings = self.dropout(embeddings)
        return embeddings

__init__ 部分中，主要是對 Embedding 向量的初始化以及標準化，positional encoding 由於是公式計算得出，所以不用初始化。

forward 函數中，主要實現了 input_ids 與 token_type_ids 由 index token 到 embedding 的轉化，以及將 words embedding、positional encoding、token type embedding 相加生成最終輸入 tansformer block 的 embedding，這裏在相加後還進行了 Layer normal 和 dropout 的操作，在 embedding 輸入的部分做 layer normal 同樣是爲了加快loss收斂，加快訓練速度，但這裏不太明白爲什麼要在輸入時就進行 dropout 的操作。

4. Self-Attention機制

class BertSelfAttention(nn.Module):
    """自注意力機制層, 見Transformer(一), 講編碼器(encoder)的第2部分"""
    def __init__(self, config):
        super(BertSelfAttention, self).__init__()
        # 判斷embedding dimension是否可以被num_attention_heads整除
        if config.hidden_size % config.num_attention_heads != 0:
            raise ValueError(
                "The hidden size (%d) is not a multiple of the number of attention "
                "heads (%d)" % (config.hidden_size, config.num_attention_heads))
        self.num_attention_heads = config.num_attention_heads
        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
        self.all_head_size = self.num_attention_heads * self.attention_head_size
        # Q, K, V線性映射
        self.query = nn.Linear(config.hidden_size, self.all_head_size)
        self.key = nn.Linear(config.hidden_size, self.all_head_size)
        self.value = nn.Linear(config.hidden_size, self.all_head_size)

        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)

    def transpose_for_scores(self, x):
        # 輸入x爲QKV中的一個, 維度: [batch_size, seq_length, embedding_dim]
        # 輸出的維度經過reshape和轉置: [batch_size, num_heads, seq_length, embedding_dim / num_heads]
        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
        x = x.view(*new_x_shape)
        return x.permute(0, 2, 1, 3)

    def forward(self, hidden_states, attention_mask, get_attention_matrices=False):
        # Q, K, V線性映射
        # Q, K, V的維度爲[batch_size, seq_length, num_heads * embedding_dim]
        mixed_query_layer = self.query(hidden_states)
        mixed_key_layer = self.key(hidden_states)
        mixed_value_layer = self.value(hidden_states)
        # 把QKV分割成num_heads份
        # 把維度轉換爲[batch_size, num_heads, seq_length, embedding_dim / num_heads]
        query_layer = self.transpose_for_scores(mixed_query_layer)
        key_layer = self.transpose_for_scores(mixed_key_layer)
        value_layer = self.transpose_for_scores(mixed_value_layer)

        # Take the dot product between "query" and "key" to get the raw attention scores.
        # Q與K求點積
        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
        # attention_scores: [batch_size, num_heads, seq_length, seq_length]
        # 除以K的dimension, 開平方根以歸一爲標準正態分佈
        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
        # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
        attention_scores = attention_scores + attention_mask
        # attention_mask 注意力矩陣mask: [batch_size, 1, 1, seq_length]
        # 元素相加後, 會廣播到維度: [batch_size, num_heads, seq_length, seq_length]

        # softmax歸一化, 得到注意力矩陣
        # Normalize the attention scores to probabilities.
        attention_probs_ = nn.Softmax(dim=-1)(attention_scores)

        # This is actually dropping out entire tokens to attend to, which might
        # seem a bit unusual, but is taken from the original Transformer paper.
        attention_probs = self.dropout(attention_probs_)

        # 用注意力矩陣加權V
        context_layer = torch.matmul(attention_probs, value_layer)
        # 把加權後的V reshape, 得到[batch_size, length, embedding_dimension]
        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
        context_layer = context_layer.view(*new_context_layer_shape)
        # 輸出attention矩陣用來可視化
        if get_attention_matrices:
            return context_layer, attention_probs_
        return context_layer, None

__init__ 部分，首先會進行判斷 config 中設置的 multi-head self-attention 的頭數是否能夠被 hidden_size 整除。接下來會計算出每個 head 的維度，然後定義由 hidden layer 到 Q，K，V 向量的線性映射。

transpose_for_scores 方法，實現了 Q，K，V 向量的維度轉換，以及多頭分割，將 Q，K，V 原始的維度 [ batch_size, seq_length, hidden_size ]，轉換爲 [ batch_size, num_heads, seq_length, attention_head_size ]。由於在做Multi-head Self-attention時，是通過矩陣乘法的方式實現並行化的計算，所以需要在計算之初對多個頭的 Q，K，V 向量進行分割並組成矩陣。

forward 方法具體實現了 Multi-head Self-attention 的操作，首先將隱藏層向量 hidden_states 通過不同的權重參數線性映射爲 Q，K，V 向量，然後進行 Mutli-head 的劃分，組成矩陣。Transformer 中計算 attention 權重值的公式如下：

代碼中，第一步的 attention_scores 通過計算 Q Muti-head 矩陣與 K Muti-head 矩陣的矩陣乘乘積，實現了所有 Q 與 K 向量的點積的同步計算。由於 Q Muti-head 與 K Multi-head 矩陣的 shape 相同，所以在做矩陣乘法之前先要對每一個 head 的 K 向量矩陣進行轉置的操作。經過計算後第一步的 attention_scores 的 shape 爲：[ batch_size, num_heads, seq_length, seq_length ]。

第二步的 attention_scores 主要爲公式中，計算點積值除以 $\sqrt{^{d_{k}}}$ 的部分，接下來將計算出的值，與 attention_mask 相加，會將 padding 時補充的佔位符位置的相似度值置爲負無窮大，以減小 padding 操作在後續 Softmax 中的影響。attention_mask 的 shape 爲：[batch_size, 1, 1, seq_length ]，在元素相加時，會自動廣播到所有維度。

接下來，在最後一維 seq_length 的維度上，完成 Softmax 的操作，再經過 dropout ，得到最終的注意力權重矩陣 attention_probs。其維度爲：[ batch_size, num_heads, seq_length, seq_length ]。

最後將注意權重矩陣與 V Multi-head 矩陣做矩陣乘法，實現方程中的最後一步。相乘後的 context_layer 維度爲：[ batch_size, num_heads, seq_length, attention_head_size ]。此時就完成了所有 Multi-head attention 的並行計算，然後再將所有 head 的計算結果進行拼接，也就是將 context_layer 重新 reshape 爲 hidden layer 的維度。最終經過 Self-attention 後的輸出的維度仍爲：[ batch_size, seq_length, embedding_size ]。

整個計算過程中，將所有 Multi-head 的 Q，K，V 向量拼接爲矩陣，通過矩陣乘法實現了高效率的並行計算，爲了更加直觀的理解這個過程，這裏貼上 Espresso大神的圖示說明：

5. Layer Normalization

class BertLayerNorm(nn.Module):

    def __init__(self, hidden_size, eps=1e-12):
        """Construct a layernorm module in the TF style (epsilon inside the square root).
        """
        super(BertLayerNorm, self).__init__()
        self.weight = nn.Parameter(torch.ones(hidden_size))
        self.bias = nn.Parameter(torch.zeros(hidden_size))
        self.variance_epsilon = eps

    def forward(self, x):
        u = x.mean(-1, keepdim=True)
        s = (x - u).pow(2).mean(-1, keepdim=True)
        x = (x - u) / torch.sqrt(s + self.variance_epsilon)
        return self.weight * x + self.bias

Bert 中有許多地方使用了 Layer Normalization 的操作，對每一層的輸出向量進行分佈調整，以加快網絡的訓練速度。Layer Normalization 的方程如下：

forward 方法實現了 Layer Normalization 的操作。

6. Attention 與 Add & Normal 的封裝

class BertSelfOutput(nn.Module):
    # 封裝的LayerNorm和殘差連接, 用於處理SelfAttention的輸出
    def __init__(self, config):
        super(BertSelfOutput, self).__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    def forward(self, hidden_states, input_tensor):
        hidden_states = self.dense(hidden_states)
        hidden_states = self.dropout(hidden_states)
        hidden_states = self.LayerNorm(hidden_states + input_tensor)
        return hidden_states


class BertAttention(nn.Module):
    # 封裝的多頭注意力機制部分, 包括LayerNorm和殘差連接
    def __init__(self, config):
        super(BertAttention, self).__init__()
        self.self = BertSelfAttention(config)
        self.output = BertSelfOutput(config)

    def forward(self, input_tensor, attention_mask, get_attention_matrices=False):
        self_output, attention_matrices = self.self(input_tensor, attention_mask, get_attention_matrices=get_attention_matrices)
        attention_output = self.output(self_output, input_tensor)
        return attention_output, attention_matrices

這兩部分代碼主要將 Transformer 中的 Muti-head 操作和 Add & Normal 操作進行封裝。

BertSelfOutput 類實現了Add & Normal 操作，先相加，再做 Layer Normalization。

BertAttention 類將 Multi-head Self-attention 與 Add & Normal 操作進行合併封裝。

7. FeedForward 與第二個 Add & Normal

class BertIntermediate(nn.Module):
    # 封裝的FeedForward層和激活層
    def __init__(self, config):
        super(BertIntermediate, self).__init__()
        self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
        self.intermediate_act_fn = ACT2FN[config.hidden_act]

    def forward(self, hidden_states):
        hidden_states = self.dense(hidden_states)
        hidden_states = self.intermediate_act_fn(hidden_states)
        return hidden_states


class BertOutput(nn.Module):
    # 封裝的LayerNorm和殘差連接, 用於處理FeedForward層的輸出
    def __init__(self, config):
        super(BertOutput, self).__init__()
        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    def forward(self, hidden_states, input_tensor):
        hidden_states = self.dense(hidden_states)
        hidden_states = self.dropout(hidden_states)
        hidden_states = self.LayerNorm(hidden_states + input_tensor)
        return hidden_states

傳統 Transformer 中的 FeedForward 的公式如下:

BertIntermediate 類主要進行了 FeedForward 中第一次線性變換的操作。這裏 Bert 與傳統 Transformer 的不同在於，傳統 Transformer 的 FeedForwrad 層使用的激活函數爲 Relu，而 Bert 中使用的激活函數爲 Gelu。

其次 FeedForward 層在進行第一次線性變換時，會將 hidden_size 轉換爲 intermediate_size，也就是說，根據官方的參數，會將intermediate_size 的大小爲 hidden_size 的4倍。這樣做的原因是：將提煉好的向量投射到一個維度更高的特徵空間，增強其表達能力和攜帶特徵的能力，類似於 SVM 的核函數的作用，低維度的複雜問題或許投射到高緯度就可以用一個超平面來簡單的解決。

BertOutput 類主要進行了 FeedForward 中的第二次線性變換，以及 FeedForward 後的 Add&Normal 的操作。

8. 封裝Transformer Block

class BertLayer(nn.Module):
    # 一個transformer block
    def __init__(self, config):
        super(BertLayer, self).__init__()
        self.attention = BertAttention(config)
        self.intermediate = BertIntermediate(config)
        self.output = BertOutput(config)

    def forward(self, hidden_states, attention_mask, get_attention_matrices=False):
        # Attention層(包括LayerNorm和殘差連接)
        attention_output, attention_matrices = self.attention(hidden_states, attention_mask, get_attention_matrices=get_attention_matrices)
        # FeedForward層
        intermediate_output = self.intermediate(attention_output)
        # LayerNorm與殘差連接輸出層
        layer_output = self.output(intermediate_output, attention_output)
        return layer_output, attention_matrices

通過前面的源碼解讀，這一部分就很清楚明瞭了，forward 方法將 Multi-head Self-attention，Add&Noraml，FeedForward，Add&Normal 依次串聯起來，形成一個 Transformer Block。

9. BertEncoder

class BertEncoder(nn.Module):
    # transformer blocks * N
    def __init__(self, config):
        super(BertEncoder, self).__init__()
        layer = BertLayer(config)
        # 複製N個transformer block
        self.layer = nn.ModuleList([copy.deepcopy(layer) for _ in range(config.num_hidden_layers)])

    def forward(self, hidden_states, attention_mask, output_all_encoded_layers=True, get_attention_matrices=False):
        """
        :param output_all_encoded_layers: 是否輸出每一個transformer block的隱藏層計算結果
        :param get_attention_matrices: 是否輸出注意力矩陣, 可用於可視化
        """
        all_attention_matrices = []
        all_encoder_layers = []
        for layer_module in self.layer:
            hidden_states, attention_matrices = layer_module(hidden_states, attention_mask, get_attention_matrices=get_attention_matrices)
            if output_all_encoded_layers:
                all_encoder_layers.append(hidden_states)
                all_attention_matrices.append(attention_matrices)
        if not output_all_encoded_layers:
            all_encoder_layers.append(hidden_states)
            all_attention_matrices.append(attention_matrices)
        return all_encoder_layers, all_attention_matrices

這一部分源碼實現 Transformer Block 的堆疊。

__init__ 方法將要拼裝的 Transformer 放到一個 list 中。

forward 方法主要完成依次堆疊工作，同時返回每一層 Block 的輸出與每一層 Attention 權重矩陣的輸出，用於做 Attention 可視化。

10. 取出第一條特徵向量

class BertPooler(nn.Module):
    """Pooler是把隱藏層(hidden state)中對應#CLS#的token的一條提取出來的功能"""
    def __init__(self, config):
        super(BertPooler, self).__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.activation = nn.Tanh()

    def forward(self, hidden_states):
        # We "pool" the model by simply taking the hidden state corresponding
        # to the first token.
        first_token_tensor = hidden_states[:, 0]
        pooled_output = self.dense(first_token_tensor)
        pooled_output = self.activation(pooled_output)
        return pooled_output

BertPooler 類會將 hidden state 的第一條特徵向量，也就是 #CLS# 所對應的特徵向量取出，用於進行 next sentence classification 任務。這裏使用了 Tanh 作爲激活函數。

11. sequence length to vocabulary size

class BertPredictionHeadTransform(nn.Module):
    def __init__(self, config):
        super(BertPredictionHeadTransform, self).__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.transform_act_fn = ACT2FN[config.hidden_act]
        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)

    def forward(self, hidden_states):
        hidden_states = self.dense(hidden_states)
        hidden_states = self.transform_act_fn(hidden_states)
        hidden_states = self.LayerNorm(hidden_states)
        return hidden_states


class BertLMPredictionHead(nn.Module):
    def __init__(self, config, bert_model_embedding_weights):
        super(BertLMPredictionHead, self).__init__()
        # 線性映射, 激活, LayerNorm
        self.transform = BertPredictionHeadTransform(config)

        # The output weights are the same as the input embeddings, but there is
        # an output-only bias for each token.
        self.decoder = nn.Linear(bert_model_embedding_weights.size(1),
                                 bert_model_embedding_weights.size(0),
                                 bias=False)
        """上面是創建一個線性映射層, 把transformer block輸出的[batch_size, seq_len, embed_dim]
        映射爲[batch_size, seq_len, vocab_size], 也就是把最後一個維度映射成字典中字的數量, 
        獲取MaskedLM的預測結果, 注意這裏其實也可以直接矩陣成embedding矩陣的轉置, 
        但一般情況下我們要隨機初始化新的一層參數
        """
        self.decoder.weight = bert_model_embedding_weights
        self.bias = nn.Parameter(torch.zeros(bert_model_embedding_weights.size(0)))

    def forward(self, hidden_states):
        hidden_states = self.transform(hidden_states)
        hidden_states = self.decoder(hidden_states) + self.bias
        return hidden_states

上面兩部分代碼中，BertPredictionHeadTransform 類封裝了一個順序爲：線性變換，激活，Layer Normal 的變換流程，爲後續的 Masked language modeling 任務做準備。

BertLMPredictionHead 類中，實現了將 hidden layer 的輸出由 [ batch_size, seq_length, hidden_size ] 轉化爲 [ batch_size, seq_length, vocab_size ] 的維度變換。這樣就可以通過 Softmax 歸一化 vocab_size 的概率，對 sequence 中被 mask 調的字進行預測了。

12. 獲得多任務輸出

class BertPreTrainingHeads(nn.Module):
    """
    BERT的訓練中通過隱藏層輸出Masked LM的預測和Next Sentence的預測
    """
    def __init__(self, config, bert_model_embedding_weights):
        super(BertPreTrainingHeads, self).__init__()

        self.predictions = BertLMPredictionHead(config, bert_model_embedding_weights)
        # 把transformer block輸出的[batch_size, seq_len, embed_dim]
        # 映射爲[batch_size, seq_len, vocab_size]
        # 用來進行MaskedLM的預測
        self.seq_relationship = nn.Linear(config.hidden_size, 2)
        # 用來把pooled_output也就是對應#CLS#的那一條向量映射爲2分類
        # 用來進行Next Sentence的預測

    def forward(self, sequence_output, pooled_output):
        prediction_scores = self.predictions(sequence_output)
        seq_relationship_score = self.seq_relationship(pooled_output)
        return prediction_scores, seq_relationship_score

BertPreTrainingHeads 類的作用是獲得預訓練 Bert 時，進行的兩個任務的輸出向量，prediction_scores 用於 mask language modeling 任務，seq_relationship_scre 用於 next sentence classfifcation 任務。

13. 參數初始化

class BertPreTrainedModel(nn.Module):
    """ An abstract class to handle weights initialization and
        a simple interface for dowloading and loading pretrained models.
        用來初始化模型參數
    """
    def __init__(self, config, *inputs, **kwargs):
        super(BertPreTrainedModel, self).__init__()
        if not isinstance(config, BertConfig):
            raise ValueError(
                "Parameter config in `{}(config)` should be an instance of class `BertConfig`. "
                "To create a model from a Google pretrained model use "
                "`model = {}.from_pretrained(PRETRAINED_MODEL_NAME)`".format(
                    self.__class__.__name__, self.__class__.__name__
                ))
        self.config = config

    def init_bert_weights(self, module):
        """ Initialize the weights.
        """
        if isinstance(module, (nn.Linear)):
            # 初始線性映射層的參數爲正態分佈
            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
        elif isinstance(module, BertLayerNorm):
            # 初始化LayerNorm中的alpha爲全1, beta爲全0
            module.bias.data.zero_()
            module.weight.data.fill_(1.0)
        if isinstance(module, nn.Linear) and module.bias is not None:
            # 初始化偏置爲0
            module.bias.data.zero_()

BertPreTrainedModel 類，初始化訓練參數。

14. Bert Model

class BertModel(BertPreTrainedModel):
    """BERT model ("Bidirectional Embedding Representations from a Transformer").

    Params:
        config: a BertConfig class instance with the configuration to build a new model

    Inputs:
        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
            with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
            `extract_features.py`, `run_classifier.py` and `run_squad.py`)
        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
            a `sentence B` token (see BERT paper for more details).
        `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
            input sequence length in the current batch. It's the mask that we typically use for attention when
            a batch has varying length sentences.
        `output_all_encoded_layers`: boolean which controls the content of the `encoded_layers` output as described below. Default: `True`.

    Outputs: Tuple of (encoded_layers, pooled_output)
        `encoded_layers`: controled by `output_all_encoded_layers` argument:
            - `output_all_encoded_layers=True`: outputs a list of the full sequences of encoded-hidden-states at the end
                of each attention block (i.e. 12 full sequences for BERT-base, 24 for BERT-large), each
                encoded-hidden-state is a torch.FloatTensor of size [batch_size, sequence_length, hidden_size],
            - `output_all_encoded_layers=False`: outputs only the full sequence of hidden-states corresponding
                to the last attention block of shape [batch_size, sequence_length, hidden_size],
        `pooled_output`: a torch.FloatTensor of size [batch_size, hidden_size] which is the output of a
            classifier pretrained on top of the hidden state associated to the first character of the
            input (`CLS`) to train on the Next-Sentence task (see BERT's paper).

    Example usage:
    ```python
    # Already been converted into WordPiece token ids
    input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
    input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])

    config = modeling.BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)

    model = modeling.BertModel(config=config)
    all_encoder_layers, pooled_output = model(input_ids, token_type_ids, input_mask)
    ```
    """
    def __init__(self, config):
        super(BertModel, self).__init__(config)
        self.embeddings = BertEmbeddings(config)
        self.encoder = BertEncoder(config)
        self.pooler = BertPooler(config)
        self.apply(self.init_bert_weights)

    def forward(self, input_ids, positional_enc, token_type_ids=None, attention_mask=None,
                output_all_encoded_layers=True, get_attention_matrices=False):
        if attention_mask is None:
            # torch.LongTensor
            # attention_mask = torch.ones_like(input_ids)
            attention_mask = (input_ids > 0)
            # attention_mask [batch_size, length]
        if token_type_ids is None:
            token_type_ids = torch.zeros_like(input_ids)

        # We create a 3D attention mask from a 2D tensor mask.
        # Sizes are [batch_size, 1, 1, to_seq_length]
        # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]
        # this attention mask is more simple than the triangular masking of causal attention
        # used in OpenAI GPT, we just need to prepare the broadcast dimension here.
        extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
        # 注意力矩陣mask: [batch_size, 1, 1, seq_length]

        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
        # masked positions, this operation will create a tensor which is 0.0 for
        # positions we want to attend and -10000.0 for masked positions.
        # Since we are adding it to the raw scores before the softmax, this is
        # effectively the same as removing these entirely.
        extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility
        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
        # 給注意力矩陣裏padding的無效區域加一個很大的負數的偏置, 爲了使softmax之後這些無效區域仍然爲0, 不參與後續計算

        # embedding層
        embedding_output = self.embeddings(input_ids, positional_enc, token_type_ids)
        # 經過所有定義的transformer block之後的輸出
        encoded_layers, all_attention_matrices = self.encoder(embedding_output,
                                                              extended_attention_mask,
                                                              output_all_encoded_layers=output_all_encoded_layers,
                                                              get_attention_matrices=get_attention_matrices)
        # 可輸出所有層的注意力矩陣用於可視化
        if get_attention_matrices:
            return all_attention_matrices
        # [-1]爲最後一個transformer block的隱藏層的計算結果
        sequence_output = encoded_layers[-1]
        # pooled_output爲隱藏層中#CLS#對應的token的一條向量
        pooled_output = self.pooler(sequence_output)
        if not output_all_encoded_layers:
            encoded_layers = encoded_layers[-1]
        return encoded_layers, pooled_output

講完了之前的源碼，這一部分源碼就變得很簡單了。 __inint__ 方法實例化了前面的三大模塊，分別是：

1.處理 word embedding、positional encoding、和 token_type_embedding 的 BertEmbeddings 模塊。

2.經過N個 Transformer Block 堆疊好的 BertEncoder 模塊。

3.用於提取 hidden state 中，對應 #CLS# 的 token 的 vector 的 BertPooler 模塊。

forward 方法中，首先處理attention_mask 矩陣，用於在計算 attention 權重值時，減少 padding 帶來的影響。接下來將 embedding 層的輸出和 attention_mask 矩陣傳入 BertEncoder 部分，再對 output_all_encoded_layers 標識位進行判斷，決定返回每一層 Transformer Block 的輸出，還是隻返回最後一層的最終結果。同時返回 #CLS# 對應的 hidden state。

15.最終的Bert Pre-train model

class BertForPreTraining(BertPreTrainedModel):
    """BERT model with pre-training heads.
    This module comprises the BERT model followed by the two pre-training heads:
        - the masked language modeling head, and
        - the next sentence classification head.

    Params:
        config: a BertConfig class instance with the configuration to build a new model.

    Inputs:
        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
            with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
            `extract_features.py`, `run_classifier.py` and `run_squad.py`)
        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
            a `sentence B` token (see BERT paper for more details).
        `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
            input sequence length in the current batch. It's the mask that we typically use for attention when
            a batch has varying length sentences.
        `masked_lm_labels`: optional masked language modeling labels: torch.LongTensor of shape [batch_size, sequence_length]
            with indices selected in [-1, 0, ..., vocab_size]. All labels set to -1 are ignored (masked), the loss
            is only computed for the labels set in [0, ..., vocab_size]
        `next_sentence_label`: optional next sentence classification loss: torch.LongTensor of shape [batch_size]
            with indices selected in [0, 1].
            0 => next sentence is the continuation, 1 => next sentence is a random sentence.

    Outputs:
        if `masked_lm_labels` and `next_sentence_label` are not `None`:
            Outputs the total_loss which is the sum of the masked language modeling loss and the next
            sentence classification loss.
        if `masked_lm_labels` or `next_sentence_label` is `None`:
            Outputs a tuple comprising
            - the masked language modeling logits of shape [batch_size, sequence_length, vocab_size], and
            - the next sentence classification logits of shape [batch_size, 2].

    Example usage:
    ```python
    # Already been converted into WordPiece token ids
    input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
    input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])

    config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)

    model = BertForPreTraining(config)
    masked_lm_logits_scores, seq_relationship_logits = model(input_ids, token_type_ids, input_mask)
    ```
    """
    def __init__(self, config):
        super(BertForPreTraining, self).__init__(config)
        self.bert = BertModel(config)
        self.cls = BertPreTrainingHeads(config, self.bert.embeddings.word_embeddings.weight)
        self.apply(self.init_bert_weights)
        self.vocab_size = config.vocab_size
        self.next_loss_func = CrossEntropyLoss()
        self.mlm_loss_func = CrossEntropyLoss(ignore_index=0)

    def compute_loss(self, predictions, labels, num_class=2, ignore_index=-100):
        loss_func = CrossEntropyLoss(ignore_index=ignore_index)
        return loss_func(predictions.view(-1, num_class), labels.view(-1))

    def forward(self, input_ids, positional_enc, token_type_ids=None, attention_mask=None,
                masked_lm_labels=None, next_sentence_label=None):
        sequence_output, pooled_output = self.bert(input_ids, positional_enc, token_type_ids, attention_mask,
                                                   output_all_encoded_layers=False)
        mlm_preds, next_sen_preds = self.cls(sequence_output, pooled_output)
        return mlm_preds, next_sen_preds

終於到了最後一個模塊，BertForPreTraining 類。這是最終在預訓練 Bert 時調用的類。

__init__ 方法中，實例化 BertModel 類，得到 bert 模型。實例化 BertPreTrainingHeads 類，用於獲得預訓練 Bert 時，進行的兩個任務的輸出向量。模型參數初始化，獲得詞表大小，以及定義兩個任務所用到的損失函數，都爲 Cross-Entropy。

compute_loss 方法主要定義了 next_sentence_classification 任務中用到的 loss 計算。

forward 方法中，通過之前實例化好的 bert 模型計算得到 hidden state 的輸出及 #CLS# 對應 token 的輸出，在將他們進行線性變換，轉化爲可以直接計算 loss 的形式。

總結

以上就是 Pytorch 版本的 Bert 源碼模型構建部分的全部內容，相較於 TensorFlow 版本而言，在最後的最終預訓練模型的組裝，以及模型預測向量輸出部分，Pytorch使用了多個類來封裝結合，這樣雖然在結構上更爲清晰，但還是顯得較爲冗長。後續會再對模型預訓練部分的源碼及數據處理和 Embedding 部分的源碼進行解讀記錄。

如有問題歡迎指正，轉載請註明出處。

Bert (Bi-directional Encoder Representations from Transformers) Pytorch 源碼解讀（一）

前言

開始

1. 定義激活函數

2. 配置參數

3. Embedding部分

4. Self-Attention機制

5. Layer Normalization

6. Attention 與 Add & Normal 的封裝

7. FeedForward 與第二個 Add & Normal

8. 封裝Transformer Block

9. BertEncoder

10. 取出第一條特徵向量

11. sequence length to vocabulary size

12. 獲得多任務輸出

13. 參數初始化

14. Bert Model

15.最終的Bert Pre-train model

總結

10分鐘搞定Mysql主從部署配置

如何使用 JS 判斷用戶是否處於活躍狀態

「Pygors跨平臺GUI」2：安裝MinGW-w64、MSYS2還是WSL2

[轉帖]

python列出centos7內存使用前50的進程信息

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

一鍵自動化博客發佈工具,用過的人都說好(掘金篇)

lightdb數據庫超時相關控制參數

lightdb秒級增加列和刪除列（not null帶默認值）

Java ThreadPoolShutdown

TensorFlow 2.1.0 使用 TFRecord 轉存與讀取文本數據

Python 命名實體識別(NER) 庫使用指南

論文研讀（1）《Summarizing Source Code with Transferred API Knowledge》

Bert (Bi-directional Encoder Representations from Transformers) Pytorch 源碼解讀（一）

Bert (Bi-directional Encoder Representations from Transformers) Pytorch 源碼解讀（二）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

Bert (Bi-directional Encoder Representations from Transformers) Pytorch 源碼解讀（一）

前言

開始

1. 定義激活函數

2. 配置參數

3. Embedding部分

4. Self-Attention機制

5. Layer Normalization

6. Attention 與 Add & Normal 的封裝

7. FeedForward 與 第二個 Add & Normal

8. 封裝Transformer Block

9. BertEncoder

10. 取出第一條特徵向量

11. sequence length to vocabulary size

12. 獲得多任務輸出

13. 參數初始化

14. Bert Model

15.最終的Bert Pre-train model

總結

7. FeedForward 與第二個 Add & Normal