Open-NMT 使用筆記

官網：https://opennmt.net

是什麼：是一個開源NMT工具

OpenNMT is an open source ecosystem for neural machine translation and neural sequence learning.

來源：由哈佛NLP組推出，誕生於2016年年末，主版本基於Torch, 默認語言是Lua

GitHub：https://github.com/OpenNMT/OpenNMT-py/blob/master/docs/source/Summarization.md

其他說明：這裏的命令行參數要根據自己的數據以及模型進行更改，關於使用pointer-network或者transformer等不同的參數請從上述網址查看

安裝成功之後的使用步驟

建議使用虛擬環境

source activate onmt
cd pytorch/onmt/

數據預處理

輸入訓練集開發集的src和tgt文件，以shard_size進行大小分割，

輸出train.*.pt和valid.*.pt，以及詞表文件vocab.pt

onmt_preprocess -train_src data/cnndm/train.txt.src \
                -train_tgt data/cnndm/train.txt.tgt.tagged \
                -valid_src data/cnndm/val.txt.src \
                -valid_tgt data/cnndm/val.txt.tgt.tagged \
                -save_data data/cnndm/CNNDM \
                -src_seq_length 10000 \
                -tgt_seq_length 10000 \
                -src_seq_length_trunc 400 \
                -tgt_seq_length_trunc 100 \
                -dynamic_dict \
                -share_vocab \
                -shard_size 100000

這一步可以省略

輸入一個在大規模語料上訓練好的embedding文件和第一步處理好的詞表文件

輸出本程序語料下的embedding文件

python embeddings_to_torch.py -emb_file_both 例如embeddings/glove/glove.6B.300d.txt\
-dict_file data/XXX.vocab.pt \
-output_file data/XXX_embeddings

訓練

如果是多卡GPU，在命令前使用CUDA_VISIBLE_DEVICES=0指定在哪一塊兒卡上進行訓練

onmt_train -save_model models/cnndm \
           -data data/cnndm/CNNDM \
           -copy_attn \
           -global_attention mlp \
           -word_vec_size 128 \
           -rnn_size 512 \
           -layers 1 \
           -encoder_type brnn \
           -train_steps 200000 \
           -max_grad_norm 2 \
           -dropout 0. \
           -batch_size 16 \
           -valid_batch_size 16 \
           -optim adagrad \
           -learning_rate 0.15 \
           -adagrad_accumulator_init 0.1 \
           -reuse_copy_attn \
           -copy_loss_by_seqlength \
           -bridge \
           -seed 777 \
           -world_size 2 \
           -gpu_ranks 0 1

4
測試

（通常是選擇訓練時在開發集上表現最好的模型進行測試）

在我的機器上，訓練可以使用GPU，但是測試的時候卻會out of memory

如果同樣碰上這個問題的小夥伴只需要去掉【-gpu】這個參數即可

onmt_translate -gpu X \
               -batch_size 20 \
               -beam_size 10 \
               -model models/cnndm... \
               -src data/cnndm/test.txt.src \
               -output testout/cnndm.out \
               -min_length 35 \
               -verbose \
               -stepwise_penalty \
               -coverage_penalty summary \
               -beta 5 \
               -length_penalty wu \
               -alpha 0.9 \
               -verbose \
               -block_ngram_repeat 3 \
               -ignore_when_blocking "." "</t>" "<t>"

在進行這一步的時候，會出現【OverflowError: math range error】的錯誤

這個錯誤可以忽略不看他，因爲出現這個錯誤時，我們的測試文件【testout/cnndm.out】已經生成好了

結果測評

舉例

python test_rouge.py -r data/test.tgt.new -c testout/cnndm.out

參數說明：

Preprocessing the data（數據預處理）

--dynamic_dict：使用了copy-attention時，需要預處理數據集，以使source和target對齊。

--share_vocab：使source和target使用相同的字典。

Training （參數選擇和實現大部分和 See et al相似）

--copy_attn: 【copy】This is the most important option, since it allows the model to copy words from the source.

--global_attention mlp: 使用 Bahdanau et al. [3] 的attention mechanism代替 Luong et al. [4] (global_attention dot).

--share_embeddings: 使encoder和decoder共享word embeddings. 大大減少了模型必須學習的參數數量。We did not find this option to helpful, but you can try it out by adding it to the command below.

--reuse_copy_attn: 將standard attention 重用爲copy attention. Without this, model learns an additional attention that is only used for copying.

--copy_loss_by_seqlength: 將 loss 除以序列長度. 實踐中我們發現這可以使inference時生成長序列. However, this effect can also be achieved by using penalties during decoding.

--bridge: This is an additional layer that uses the final hidden state of the encoder as input and computes an initial hidden state for the decoder. Without this, the decoder is initialized with the final hidden state of the encoder directly.

--optim adagrad: Adagrad 優於 SGD when coupled with the following option.

--adagrad_accumulator_init 0.1: PyTorch does not initialize the accumulator in adagrad with any values. To match the optimization algorithm with the Tensorflow version, this option needs to be added.

Inference（使用beam-search of 10. 也加入解碼中可以使用的特定懲罰項，如下）

--stepwise_penalty: Applies penalty at every step

--coverage_penalty summary: 【coverage】使用懲罰項防止同一個source word的repeated attention

--beta 5: Coverage Penalty的參數

--length_penalty wu: 使用Wu et al的長度懲罰項

--alpha 0.8: Parameter for the Length Penalty.

--block_ngram_repeat 3: 防止模型 repeating trigrams.

--ignore_when_blocking "." "</t>" "<t>": 允許模型句子邊界的tokens repeat trigrams .

示例command: http://opennmt.net/OpenNMT-py/Summarization.html

數據預處理

我們還關閉了源的截斷功能，以確保不會截斷超過50個單詞的輸入。

對於CNN-DM，我們遵循See等。[2]並另外將源長度截斷爲400個令牌，將目標長度截斷爲100個令牌。

我們還注意到CNN-DM中，我們發現如果target將句子使用標籤包圍起來，使得句子看起來像 <t> w1 w2 w3 . </t>，模型會更好地工作。如果使用這種格式，則可以在 inference 步驟之後使用命令sed -i 's/ <\/t>//g' FILE.txt和sed -i 's/<t> //g' FILE.txt刪除標籤。

更ing

Open-NMT 使用筆記

安裝成功之後的使用步驟

參數說明：

Preprocessing the data（數據預處理）

Training （參數選擇和實現大部分和 See et al相似）

Inference（使用beam-search of 10. 也加入解碼中可以使用的特定懲罰項，如下）

數據預處理

Python實現大麥網搶票的四大關鍵技術點解析

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

一款開源的.NET程序集反編譯、編輯和調試神器

關於接口協議，你必須要知道這些！

Python heapq（堆操作）

【書籍記錄】《編程之法》

面經 | 記錄秋招遇到的概率題與智力題（附答案）

【ERROR】TypeError: expected bytes, Descriptor found

【論文】【ACL2018】Neural Document Summarization by Jointly Learning to Score and Select Sentences

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

Open-NMT 使用筆記

安裝成功之後的使用步驟

參數說明：

Preprocessing the data（數據預處理 ）

Training （參數選擇和實現大部分和 See et al相似）

Inference（使用beam-search of 10. 也加入解碼中可以使用的特定懲罰項，如下）

數據預處理

Preprocessing the data（數據預處理）