ALBERT與BERT的異同

論文地址：https://openreview.net/pdf?id=H1eA7AEtvS

中文預訓練ALBERT模型：https://github.com/brightmart/albert_zh

1、對Embedding因式分解（Factorized embedding parameterization）

在BERT中，詞embedding與encoder輸出的embedding維度是一樣的都是768。但是ALBERT認爲，詞級別的embedding是沒有上下文依賴的表述，而隱藏層的輸出值不僅包括了詞本生的意思還包括一些上下文信息，理論上來說隱藏層的表述包含的信息應該更多一些，因此應該讓 $H\gg E$ ，所以ALBERT的詞向量的維度是小於encoder輸出值維度的。

在NLP任務中，通常詞典都會很大，embedding matrix的大小是 $E\times V$ ，如果和BERT一樣讓，那麼embedding matrix的參數量會很大，並且反向傳播的過程中，更新的內容也比較稀疏。

結合上述說的兩個點，ALBERT採用了一種因式分解的方法來降低參數量。首先把one-hot向量映射到一個低維度的空間，大小爲E，然後再映射到一個高維度的空間，說白了就是先經過一個維度很低的embedding matrix，然後再經過一個高維度matrix把維度變到隱藏層的空間內，從而把參數量從 $O(V\times H)$ 降低到了 $O(V\times E+E\times ×H)$ ，當 $E\ll H$ 時參數量減少的很明顯。
modeling.py中，

embedding因式分解的tensorflow代碼如下：

def embedding_lookup_factorized(input_ids, # Factorized embedding parameterization provide by albert
                     vocab_size,
                     hidden_size,
                     embedding_size=128,
                     initializer_range=0.02,
                     word_embedding_name="word_embeddings",
                     use_one_hot_embeddings=False):
    """
    :param input_ids: [batch_size, seq_length]
    :param vocab_size: 
    :param hidden_size: 
    :param embedding_size: 
    :param initializer_range: 
    :param word_embedding_name: 
    :param use_one_hot_embeddings: 
    :return: 
    """
    # 1. 將one-hot向量映射到embedding_size大小的低維稠密空間
    print("embedding_lookup_factorized. factorized embedding parameterization is used.")
    if input_ids.shape.ndims == 2:
        input_ids = tf.expand_dims(input_ids, axis=[-1])  # shape of input_ids is:[ batch_size, seq_length, 1]

    embedding_table = tf.get_variable(  # [vocab_size, embedding_size]
        name=word_embedding_name,
        shape=[vocab_size, embedding_size],
        initializer=create_initializer(initializer_range))

    flat_input_ids = tf.reshape(input_ids, [-1])  # one rank. shape as (batch_size * sequence_length,)
    if use_one_hot_embeddings:
        one_hot_input_ids = tf.one_hot(flat_input_ids,depth=vocab_size)  
        output_middle = tf.matmul(one_hot_input_ids, embedding_table)  # [batch_size * sequence_length,embedding_size]
    else:
        output_middle = tf.gather(embedding_table,flat_input_ids)  # [batch_size * sequence_length,embedding_size]

    # 2. 將第一步的輸出映射到hidden_size的向量空間
    project_variable = tf.get_variable(  # [embedding_size, hidden_size]
        name=word_embedding_name+"_2",
        shape=[embedding_size, hidden_size],
        initializer=create_initializer(initializer_range))
    output = tf.matmul(output_middle, project_variable) # [batch_size * sequence_length, hidden_size]
    # reshape back to 3 rank
    input_shape = get_shape_list(input_ids)  
    batch_size, sequene_length, _=input_shape
    output = tf.reshape(output, (batch_size,sequene_length,hidden_size))  # [batch_size, sequence_length, hidden_size]
    return (output, embedding_table, project_variable)

2、跨層的參數共享（Cross-layer parameter sharing）

在ALBERT還提出了一種參數共享的方法，Transformer中共享參數有多種方案，只共享全連接層，只共享attention層，ALBERT結合了上述兩種方案，全連接層與attention層都進行參數共享，也就是說共享encoder內的所有參數，同樣量級下的Transformer採用該方案後實際上效果是有下降的，但是參數量減少了很多，訓練速度也提升了很多。

modeling.py中，在variable_scope中設置reuse=True實現跨層共享參數。

在原始的Transformer中，Layer Norm在跟在Residual之後的，我們把這個稱爲Post-LN Transformer。Post-LN Transformer對參數非常敏感，需要很仔細地調參才能取得好的結果，比如必備的warm-up學習率策略，這會非常耗時間。

既然warm-up是訓練的初始階段使用的，那肯定是訓練的初始階段優化有問題，包括模型的初始化。

Post-LN Transformer在訓練的初始階段，輸出層附近的期望梯度非常大，所以，如果沒有warm-up，模型優化過程就會炸裂，非常不穩定。把LayerNorm換個位置，比如放在Residual的過程之中（稱爲Pre-LN Transformer），再觀察訓練初始階段的梯度變化，發現比Post-LN Transformer好很多，甚至不需要warm-up，從而進一步減少訓練時間。

參考論文：On Layer Normalization in the TransformerArchitecture

def prelln_transformer_model(input_tensor,
						attention_mask=None,
						hidden_size=768,
						num_hidden_layers=12,
						num_attention_heads=12,
						intermediate_size=3072,
						intermediate_act_fn=gelu,
						hidden_dropout_prob=0.1,
						attention_probs_dropout_prob=0.1,
						initializer_range=0.02,
						do_return_all_layers=False,
						shared_type='all', # None,
						adapter_fn=None):
	
	prev_output = bert_utils.reshape_to_matrix(input_tensor)

	all_layer_outputs = []

	def layer_scope(idx, shared_type):
		if shared_type == 'all':
			tmp = {
				"layer":"layer_shared",
				'attention':'attention',
				'intermediate':'intermediate',
				'output':'output'
			}
		elif shared_type == 'attention':
			tmp = {
				"layer":"layer_shared",
				'attention':'attention',
				'intermediate':'intermediate_{}'.format(idx),
				'output':'output_{}'.format(idx)
			}
		elif shared_type == 'ffn':
			tmp = {
				"layer":"layer_shared",
				'attention':'attention_{}'.format(idx),
				'intermediate':'intermediate',
				'output':'output'
			}
		else:
			tmp = {
				"layer":"layer_{}".format(idx),
				'attention':'attention',
				'intermediate':'intermediate',
				'output':'output'
			}

		return tmp

	all_layer_outputs = []

	for layer_idx in range(num_hidden_layers):

		idx_scope = layer_scope(layer_idx, shared_type)
        # 跨層共享參數
		with tf.variable_scope(idx_scope['layer'], reuse=tf.AUTO_REUSE):
			layer_input = prev_output
            # 共享注意力層的參數
			with tf.variable_scope(idx_scope['attention'], reuse=tf.AUTO_REUSE):
				attention_heads = []
                # 共享全連接層的參數，改爲Pre-LN
				with tf.variable_scope("output", reuse=tf.AUTO_REUSE):
					layer_input_pre = layer_norm(layer_input)

				with tf.variable_scope("self"):
					attention_head = attention_layer(
							from_tensor=layer_input_pre,
							to_tensor=layer_input_pre,
							attention_mask=attention_mask,
							num_attention_heads=num_attention_heads,
							size_per_head=attention_head_size,
							attention_probs_dropout_prob=attention_probs_dropout_prob,
							initializer_range=initializer_range,
							do_return_2d_tensor=True,
							batch_size=batch_size,
							from_seq_length=seq_length,
							to_seq_length=seq_length)
					attention_heads.append(attention_head)

				attention_output = None
				if len(attention_heads) == 1:
					attention_output = attention_heads[0]
				else:
					# In the case where we have other sequences, we just concatenate
					# them to the self-attention head before the projection.
					attention_output = tf.concat(attention_heads, axis=-1)

				# Run a linear projection of `hidden_size` then add a residual
				# with `layer_input`.
                # 共享全連接層的參數
				with tf.variable_scope("output", reuse=tf.AUTO_REUSE):
					attention_output = tf.layers.dense(
							attention_output,
							hidden_size,
							kernel_initializer=create_initializer(initializer_range))
					attention_output = dropout(attention_output, hidden_dropout_prob)

					# attention_output = layer_norm(attention_output + layer_input)
					attention_output = attention_output + layer_input
            # 共享全連接層的參數
			with tf.variable_scope(idx_scope['output'], reuse=tf.AUTO_REUSE):
				attention_output_pre = layer_norm(attention_output)

            # 共享全連接層的參數
			with tf.variable_scope(idx_scope['intermediate'], reuse=tf.AUTO_REUSE):
				intermediate_output = tf.layers.dense(
						attention_output_pre,
						intermediate_size,
						activation=intermediate_act_fn,
						kernel_initializer=create_initializer(initializer_range))

            # 共享全連接層的參數
			with tf.variable_scope(idx_scope['output'], reuse=tf.AUTO_REUSE):
				layer_output = tf.layers.dense(
						intermediate_output,
						hidden_size,
						kernel_initializer=create_initializer(initializer_range))
				layer_output = dropout(layer_output, hidden_dropout_prob)

				# layer_output = layer_norm(layer_output + attention_output)
				layer_output = layer_output + attention_output
				prev_output = layer_output
				all_layer_outputs.append(layer_output)

	if do_return_all_layers:
		final_outputs = []
		for layer_output in all_layer_outputs:
			final_output = bert_utils.reshape_from_matrix(layer_output, input_shape)
			final_outputs.append(final_output)
		return final_outputs
	else:
		final_output = bert_utils.reshape_from_matrix(prev_output, input_shape)
		return final_output

3、句間連貫（Inter-sentence coherence loss）

BERT的NSP任務實際上是一個二分類，訓練數據的正樣本是通過採樣同一個文檔中的兩個連續的句子，而負樣本是通過採用兩個不同的文檔的句子。

在ALBERT中，爲了只保留一致性任務去除主題識別的影響，提出了一個新的任務 sentence-order prediction（SOP）。

NSP（Next Sentence Prediction）：下一句預測，正樣本=上下相鄰的2個句子，負樣本=隨機2個句子
SOP (Sentence )：句子順序預測，正樣本=正常順序的2個相鄰句子，負樣本=調換順序的2個相鄰句子

對於NLI自然語言推理任務。研究發現NSP任務效果並不好，主要原因是因爲其任務過於簡單。NSP其實包含了兩個子任務，主題預測與關係一致性預測，但是主題預測相比於關係一致性預測簡單太多了，因爲只要模型發現兩個句子的主題不一樣就行了，而SOP預測任務能夠讓模型學習到更多的信息。SOP因爲是在同一個文檔中選的，其只關注句子的順序並沒有主題方面的影響。

在create_pretraining_data.py的create_instances_from_document_albert函數中，負例的選取是調換正常順序的兩個句子。

ALBERT與BERT的異同

1、對Embedding因式分解（Factorized embedding parameterization）

2、跨層的參數共享（Cross-layer parameter sharing）

3、句間連貫（Inter-sentence coherence loss）

一鍵自動化博客發佈工具,用過的人都說好(掘金篇)

「Pygors跨平臺GUI」2：安裝MinGW-w64、MSYS2還是WSL2

[轉帖]

python列出centos7內存使用前50的進程信息

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

nodejs學習06——小案例

評估統計算法在銀行僞造鈔票檢測中的價值

C# Xmlserializer 程序集內存泄露

Java ThreadPoolShutdown

5月21日相聚上海張江！與文心大模型一起共建大模型產業應用生態圈

基於用戶的協同過濾算法(UserCF)

Q Learning 和SARSA算法

樸素貝葉斯算法(Naive Bayes) 原理總結

論文：Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

基於物品的協同過濾算法(ItemCF)

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結