GPT-2解讀（論文 + TensorFlow實現）

GPT-2是對GPT的一個升級，並且更着重於將思路放在爲何pretrain是有用的上面，認爲LM本身是一個Multi-task Learner，並且大力用ZSL實驗來佐證這個思路。

文章目錄

五. 總結

傳送門

一. 前言

GPT-2相比於GPT，筆者感覺主要有三點改進：1）大數據；2）大模型；3）很好的一個insight觀點。還不熟悉GPT的讀者可以戳這裏。

前兩點就不用說了，最後一點其實在GPT-2的論文題目中就已經體現出來了，也是貫徹全文的一個重要觀點：《Language Models are Unsupervised Multitask Learners》，不像是之前的講Pretrain+Finetune的論文，都只是套用了這個思路，然後實驗說：哦這樣很好，而沒有一個理論層面的昇華。

這篇GPT-2，筆者看下來，感覺對NLP領域中pretrain+finetune這一套流程爲啥有用，又有了些不一樣的認識。

筆者自己對於這個觀點的理解就是：一般之前對於pretrain爲何有用的解釋都是猜測說，找到了一個很好的初始化點。這裏是認爲LM在學習的過程中，自然能學到那些有監督任務所需要的信息，即LM本身就是一個無監督的多任務學習者，也就能證明爲何pretrain對後面的任務是有用的，即爲何能找到一個很好的初始化點。更具體一些，論文中提到有監督的任務其實都只是語言模型序列中的一個子集，這裏筆者腦補了一些例子，比如對於“The translation of apple in Chinese is 蘋果”這個序列進行LM建模，自然能學到翻譯的知識；對於“姚明的身高是2.26米”這個序列進行建模，自然能學到問答相關的知識，諸如此類。。

二. GPT-2原理

理解了上面的思路之後，就可以來看GPT-2的原理了，雖然原理上沒有太多的創新。這裏主要講相比於GPT的改進點。

1. 數據集

作者從網上爬了一大堆語料，用來進行LM的pretrain，他們最後的數據集叫WebText，有800萬左右的文檔，40G的文本，並且還移除了Wikipedia的數據，因爲後面要ZSL的任務裏面有很多都是基於Wikipedia的語料的，這裏其實就是保證了ZSL任務的前提。

PS：ZSL就是Zero-shot Learning。

2. 輸入表徵

對於輸入的text不做任何的預處理（比如大小寫轉換啊，切分啊這種的），直接弄成bpe扔進去。

3. 模型

基本還是與GPT一致，但將LayerNorm移到了每層的輸入，並且在最後一層attention後面加上了LayerNorm。同時在residual層初始化的時候，將其乘了 $1/\sqrt{N}$ ，這裏的N是residual的層數（這裏沒看懂？有大神看懂可以解答一下，residual不就是一個相加？哪裏有參數？）。詞表擴大到了50257。上下文長度從512擴展到1024；batchsize擴大到512。

三. 實驗

作者用了幾種不同size的模型，見下圖：

作者指出的是，最小的模型就是GPT，第二小的與大BERT是一個量級，最大的模型稱爲GPT-2。**所有的model，在LM訓練的時候，都處於欠擬合的狀態。**說明他們爬的這個大數據還是很好的！

作者直接將這個pretrain的模型，不用finetune的跑了各個下游的NLP任務，即ZSL設定，結果如下：

這裏的WikiText2、PTB、enwiki8、text8、WikiText103、1BW是幾個測試語言模型的數據集；LAMBADA是測試建模長句子能力的數據集，用於預測一句話的最後一個詞；CBT是用於檢驗在不同類型的詞上LM的表現，主要是Cloze任務。

作者還測試了一些其他的任務，比如推理的任務Winograd Schema Challange，結果如下：

還有閱讀理解CoQA、摘要、翻譯、QA等任務，比如摘要的結果：

最後，作者還給出了一個說明訓練難度的表格，用於說明這些任務的訓練集與測試集的文本重合度比較高，所以SoTA的效果要打一些折扣，而GPT-2這裏用到的訓練數據則與測試集重合度較低，所以就更能說明GPT-2的提升效果啦！

四. TensorFlow實現

看源碼的意思，好像與GPT一樣，也是沒有放出pretrain的訓練代碼，而且在例子上也只是給出了文本續寫的部分。但依然不影響筆者想一探究竟，那麼這裏就從pretrain的模型結構和文本續寫的generate來講吧。其實，按照GPT-2本身論文的側重點，是想證明pretrain的LM就可以用ZSL完成其他的任務，因此，這裏給出的這兩部分源碼其實對於實際應用來說也足夠了！

1. 模型結構

在模型結構上，主體還是與GPT很像，都是transformer的decoder形式，只不過在規模上擴大了，其具體代碼如下：

def model(hparams, X, past=None, scope='model', reuse=False):
    with tf.variable_scope(scope, reuse=reuse):
        results = {}
        batch, sequence = shape_list(X)

        # Embedding
        wpe = tf.get_variable('wpe', [hparams.n_ctx, hparams.n_embd],
                             initializer=tf.random_normal_initializer(stddev=0.01))
        wte = tf.get_variable('wte', [hparams.n_vocab, hparams.n_embd],
                             initializer=tf.random_normal_initializer(stddev=0.02))
        past_length = 0 if past is None else tf.shape(past)[-2]
        h = tf.gather(wte, X) + tf.gather(wpe, positions_for(X, past_length))

        # Transformer
        presents = []
        pasts = tf.unstack(past, axis=1) if past is not None else [None] * hparams.n_layer
        assert len(pasts) == hparams.n_layer
        for layer, past in enumerate(pasts):
            h, present = block(h, 'h%d' % layer, past=past, hparams=hparams)
            presents.append(present)
        results['present'] = tf.stack(presents, axis=1)
        h = norm(h, 'ln_f')

        # Language model loss.  Do tokens <n predict token n?
        h_flat = tf.reshape(h, [batch*sequence, hparams.n_embd])
        logits = tf.matmul(h_flat, wte, transpose_b=True)
        logits = tf.reshape(logits, [batch, sequence, hparams.n_vocab])
        results['logits'] = logits
        return results

代碼整體還是很清晰的，一共分爲三步：

embedding層：這裏的wpe和wte分別代表的是position embedding和token embeeding。
Transformer層：這裏的核心仍然是block這個函數，後面會細說。注意這裏仍是沒有傳入長度的mask部分，這與之前GPT中的處理方式一樣，還是很粗糙。
輸出層：在得到了每個timestep的表示之後，就是熟悉的softmax層，這裏仍然用了tie的策略，在映射到詞表的時候，仍然使用的是之前token embedding的參數。

至於block部分，就是transformer的decoder部分，其實現方式如下：

def block(x, scope, *, past, hparams):
    with tf.variable_scope(scope):
        nx = x.shape[-1].value
        a, present = attn(norm(x, 'ln_1'), 'attn', nx, past=past, hparams=hparams)
        x = x + a
        m = mlp(norm(x, 'ln_2'), 'mlp', nx*4, hparams=hparams)
        x = x + m
        return x, present

與GPT的主要不同就在於norm的地方不一樣，GPT是在residual之後進行norm。

這裏的兩個細節實現attn和mlp如下：

def attn(x, scope, n_state, *, past, hparams):
    assert x.shape.ndims == 3  # Should be [batch, sequence, features]
    assert n_state % hparams.n_head == 0
    if past is not None:
        assert past.shape.ndims == 5  # Should be [batch, 2, heads, sequence, features], where 2 is [k, v]

    def split_heads(x):
        # From [batch, sequence, features] to [batch, heads, sequence, features]
        return tf.transpose(split_states(x, hparams.n_head), [0, 2, 1, 3])

    def merge_heads(x):
        # Reverse of split_heads
        return merge_states(tf.transpose(x, [0, 2, 1, 3]))

    def mask_attn_weights(w):
        # w has shape [batch, heads, dst_sequence, src_sequence], where information flows from src to dst.
        _, _, nd, ns = shape_list(w)
        b = attention_mask(nd, ns, dtype=w.dtype)
        b = tf.reshape(b, [1, 1, nd, ns])
        w = w*b - tf.cast(1e10, w.dtype)*(1-b)
        return w

    def multihead_attn(q, k, v):
        # q, k, v have shape [batch, heads, sequence, features]
        w = tf.matmul(q, k, transpose_b=True)
        w = w * tf.rsqrt(tf.cast(v.shape[-1].value, w.dtype))

        w = mask_attn_weights(w)
        w = softmax(w)g
        a = tf.matmul(w, v)
        return a

    with tf.variable_scope(scope):
        c = conv1d(x, 'c_attn', n_state*3)
        qg, k, v = map(split_heads, tf.split(c, 3, axis=2))
        present = tf.stack([k, v], axis=1)
        if past is not None:
            pk, pv = tf.unstack(past, axis=1)
            k = tf.concat([pk, k], axis=-2)
            v = tf.concat([pv, v], axis=-2)
        a = multihead_attn(q, k, v)
        a = merge_heads(a)
        a = conv1d(a, 'c_proj', n_state)
        return a, present


def mlp(x, scope, n_state, *, hparams):
    with tf.variable_scope(scope):
        nx = x.shape[-1].value
        h = gelu(conv1d(x, 'c_fc', n_state))
        h2 = conv1d(h, 'c_proj', nx)
        return h2

這裏在feed forward裏面仍然使用的是gelu激活函數。

2. 文本續寫

這裏其實主要是用的LM的自動生成下一個功能，主體的part就在於下面這個函數：

def body(past, prev, output):
    next_outputs = step(hparams, prev[:, tf.newaxis], past=past)
    logits = next_outputs['logits'][:, -1, :]  / tf.to_float(temperature)
    logits = top_k_logits(logits, k=top_k)
    samples = tf.multinomial(logits, num_samples=1, output_dtype=tf.int32)
    return [
        tf.concat([past, next_outputs['presents']], axis=-2),
        tf.squeeze(samples, axis=[1]),
        tf.concat([output, samples], axis=1),
    ]
    
def step(hparams, tokens, past=None):
    lm_output = model.model(hparams=hparams, X=tokens, past=past, reuse=tf.AUTO_REUSE)

    logits = lm_output['logits'][:, :, :hparams.n_vocab]
    presents = lm_output['present']
    presents.set_shape(model.past_shape(hparams=hparams, batch_size=batch_size))
    return {
        'logits': logits,
        'presents': presents,
    }

可見其流程是：1. 根據當前的上下文生成下一個輸出（step函數）；2. 選擇出Top-k的輸出；3. 根據當前的概率分佈採樣一個作爲下一個續寫的輸出。

五. 總結

優勢

收集了一個大語料庫WebText，即使像GPT-2這樣的大模型，也依然處於欠擬合的狀態
最大的GPT-2模型，有1.5B的參數量，用ZSL在很多任務上進行測試，發現有7/8的任務上都達到了SoTA。
給出了預訓練好的參數，雖然只有TensorFlow的，但轉成別的應該也不難

不足

沒有放出pretrain的訓練代碼，並且finetune的部分也只列舉了續寫的部分
只給出了一個小的117M的預訓練參數，可能是怕用於不正當用途吧，也可以理解

傳送門

論文：https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
源碼：https://github.com/openai/gpt-2 （TensorFlow）
https://github.com/huggingface/pytorch-pretrained-BERT （PyTorch，雖然名字是BERT，裏面也有GPT-2的實現）
官方blog：https://openai.com/blog/better-language-models/

GPT-2解讀（論文 + TensorFlow實現）

文章目錄

一. 前言

二. GPT-2原理

1. 數據集

2. 輸入表徵

3. 模型

三. 實驗

四. TensorFlow實現

1. 模型結構

2. 文本續寫

五. 總結

優勢

不足

傳送門

《推薦系統實踐》算法純享（附代碼鏈接）（二）—— 協同過濾篇

XLM解讀（論文 + PyTorch源碼）

Transformer解讀（論文 + PyTorch源碼）

《推薦系統實踐》算法純享（附代碼鏈接）（一）—— 評價指標篇

Transformer-XL解讀（論文 + PyTorch源碼）

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結