動手學習深度學習 | 語言模型和循環神經網絡筆記

0.文本處理整體概況

step1：對原始數據進行分詞
step2：對分詞後的數據進行去重編號，得到[詞語to序號]的列表，和[序號to詞語]的字典。將這兩部分用作後續訓練循環神經網絡的數據集。
step3：通過一些採樣方法對構建的數據集進行採樣，得到訓練的批次。常見的採樣方法有隨機採樣和相鄰採樣。
step4：利用語言模型對上述的數據集進行訓練，得到一個nlp模型。語言模型有n元語法模型，RNN模型，LSTM模型等。

1.使用spacy可以進行語言分詞

達到很好的直觀效果，相較於自己構建的邏輯，更加符合語言本身詞意的分詞操作，且可以將分詞對應的idx對應輸出。

import spacy
text = "Mr. Chen doesn't agree with my suggestion."
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
print([token.text for token in doc])
print([token.idx for token in doc])

#------------------
['Mr.', 'Chen', 'does', "n't", 'agree', 'with', 'my', 'suggestion', '.']
[0, 4, 9, 13, 17, 23, 28, 31, 41]

2.隨機採樣和相鄰採樣

2.1 隨機採樣

下面的代碼每次從數據裏隨機採樣一個小批量。其中批量大小batch_size是每個小批量的樣本數，num_steps是每個樣本所包含的時間步數。在隨機採樣中，每個樣本是原始序列上任意截取的一段序列，相鄰的兩個隨機小批量在原始序列上的位置不一定相毗鄰。

import torch
import random
def data_iter_random(corpus_indices, batch_size, num_steps, device=None):
    # 減1是因爲對於長度爲n的序列，X最多隻有包含其中的前n - 1個字符
    num_examples = (len(corpus_indices) - 1) // num_steps  # 下取整，得到不重疊情況下的樣本個數
    example_indices = [i * num_steps for i in range(num_examples)]  # 每個樣本的第一個字符在corpus_indices中的下標
    random.shuffle(example_indices)

    def _data(i):
        # 返回從i開始的長爲num_steps的序列
        return corpus_indices[i: i + num_steps]
    if device is None:
        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    
    for i in range(0, num_examples, batch_size):
        # 每次選出batch_size個隨機樣本
        batch_indices = example_indices[i: i + batch_size]  # 當前batch的各個樣本的首字符的下標
        X = [_data(j) for j in batch_indices]
        Y = [_data(j + 1) for j in batch_indices]
        yield torch.tensor(X, device=device), torch.tensor(Y, device=device)

2.2 相鄰採樣

在相鄰採樣中，相鄰的兩個隨機小批量在原始序列上的位置相毗鄰。

def data_iter_consecutive(corpus_indices, batch_size, num_steps, device=None):
    if device is None:
        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    corpus_len = len(corpus_indices) // batch_size * batch_size  # 保留下來的序列的長度
    corpus_indices = corpus_indices[: corpus_len]  # 僅保留前corpus_len個字符
    indices = torch.tensor(corpus_indices, device=device)
    indices = indices.view(batch_size, -1)  # resize成(batch_size, )
    batch_num = (indices.shape[1] - 1) // num_steps
    for i in range(batch_num):
        i = i * num_steps
        X = indices[:, i: i + num_steps]
        Y = indices[:, i + 1: i + num_steps + 1]
        yield X, Y

3.語言模型

語言模型的目標就是給定一個通過分詞得到的詞序列，並且評估該序列是否合理，即計算該序列的概率。語言模型從最初的n-gram方法受限於模型參數大，參數稀疏，計算效率低等不足，發展到現在的基於深度學習的RNN，LSTM等方法，大大的提高了語言模型的能力。

3.1 n-gram

如下描述所示，n-gram語言模型是一種基於統計的方法。該方法通過分詞在語料庫中的比重以及條件概率構建概率模型。由於條件概率的存在，該方法的最終計算需要基於n階馬爾科夫假設。且該方法由於基於分詞在語料庫中的比重，從而導致很多分詞的出現頻次很少甚至爲零，導致得到的值的稀疏性太強，計算效率低。

3.2 RNN遞歸神經網絡模型

如下描述所示，遞歸神經網絡RNN是一種基於參數學習的語言模型。該類模型通過中間狀態位的保留，隱式的實現了上述概率模型中的條件概率模型。而且此種方法學習友好，使用方便。且如下代碼所示，RNN模型的參數與輸入輸出的時間步數無關，因此模型的複用性很強。

vocab_size = 1027
num_inputs, num_hiddens, num_outputs = vocab_size, 256, vocab_size
# num_inputs: d
# num_hiddens: h, 隱藏單元的個數是超參數
# num_outputs: q

def get_params():
    def _one(shape):
        param = torch.zeros(shape, device=device, dtype=torch.float32)
        nn.init.normal_(param, 0, 0.01)
        return torch.nn.Parameter(param)

    # 隱藏層參數
    W_xh = _one((num_inputs, num_hiddens))
    print(W_xh.shape)
    W_hh = _one((num_hiddens, num_hiddens))
    print(W_hh.shape)
    b_h = torch.nn.Parameter(torch.zeros(num_hiddens, device=device))
    # 輸出層參數
    W_hq = _one((num_hiddens, num_outputs))
    print(W_hq.shape)
    b_q = torch.nn.Parameter(torch.zeros(num_outputs, device=device))
    return (W_xh, W_hh, b_h, W_hq, b_q)
get_params()
print(num_inputs, num_hiddens, num_outputs)

#--------------------

torch.Size([1027, 256])
torch.Size([256, 256])
torch.Size([256, 1027])
1027 256 1027

動手學習深度學習 | 語言模型和循環神經網絡筆記

0.文本處理整體概況

1.使用spacy可以進行語言分詞

2.隨機採樣和相鄰採樣

2.1 隨機採樣

2.2 相鄰採樣

3.語言模型

3.1 n-gram

3.2 RNN遞歸神經網絡模型

10分鐘搞定Mysql主從部署配置

如何使用 JS 判斷用戶是否處於活躍狀態

「Pygors跨平臺GUI」2：安裝MinGW-w64、MSYS2還是WSL2

[轉帖]

python列出centos7內存使用前50的進程信息

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

一鍵自動化博客發佈工具,用過的人都說好(掘金篇)

lightdb數據庫超時相關控制參數

lightdb秒級增加列和刪除列（not null帶默認值）

Java ThreadPoolShutdown

C語言 | 函數內修改數組值（指針的應用）

CVPR 2019 | MSPN 重新思考多階段人體姿態估計網絡

CVPR2017 | G-RMI_Google大佬構建的姿態估計baseline

C語言 | 解析json

CVPR 2019 | SP_相似性保存知識蒸餾

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結