Transformer端側模型壓縮——Mobile Transformer

原創

2020-06-21 18:07

隨着Transformer模型在NLP、ASR等序列建模中的普及應用，其在端側等資源受限場景的部署需求日益增加。經典的mobile-transformer結構包括evolved tansformer、lite-transformer、mobile-bert、miniLM等模型，藉助結構改進、知識蒸餾等策略實現了transformer模型的小型化、並確保精度魯棒性。

1. The Evolved Transformer

Paper Link: https://arxiv.org/abs/1901.11117

GitHub: https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/evolved_transformer.py

Google基於NAS搜索獲得的Transformer結構:

搜索空間：包括兩個stackable cell，分別包含在transformer encoder與transformer decoder。每個cell由NAS-style block組成, 可通過左右兩個block轉換輸入Embedding、再聚合獲得新的Embedding，進一步輸入到self-attention layer。
搜索策略：基於EA (Evolutional Aligorithm)的搜索策略；

網絡結構如下：

2. Lite Transformer with Long-Short Range Attention

Paper Link: https://arxiv.org/abs/2004.11886

GitHub: https://github.com/mit-han-lab/lite-transformer

Lite Transformer是韓松組研究提出的一種高效、面向移動端部署的Transformer架構，其核心是長短距離注意力結構（Long-Short Range Attention，LSRA）。LSRA將輸入Embedding沿feature維度split成兩部分，其中一部分通過GLU、一維卷積，用以提取局部context信息；而另一部分依靠self-attention，用以負責全局相關性信息編碼。

Lite Transformer核心結構如下：

3. HAT: Hardware-Aware Transformers for Efficient Natural Language Processing

Paper Link: https://arxiv.org/abs/2005.14187

GitHub: https://github.com/mit-han-lab/hardware-aware-transformers

HAT是韓松組研究提出的one for all網絡，sub-transformer通過共享super-transformer的網絡參數，可實現不同部署平臺與硬件設備的快速適配。設計核心包括arbitrary encoder-decoder attention、以及elastic網絡結構 (hidden size、embed-size、layers等)。

One for all自動化部署流程、以及核心網絡結構如下：

4. MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices

Paper Link: https://arxiv.org/abs/2004.02984

Google Brain提出了MobileBERT，該模型是與任務無關的，即可以通過簡單的微調、應用於各種下游NLP任務。基本上，MobileBERT是BERT_LARGE的精簡版，同時配備了bottleneck結構和self-attention與ffn之間的平衡。爲了訓練MobileBERT，首先訓練了一個特別設計的教師模型 (包含Inverted Attention Block)，然後通過知識蒸餾誘導MobileBERT的訓練。

具體的網絡結構與蒸餾機制如下：

5. MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

Paper Link: https://arxiv.org/abs/2002.10957

GitHub: https://github.com/microsoft/unilm/tree/master/minilm

微軟研究院提出了基於 Transformer預訓練模型的通用壓縮方法：深度自注意力知識蒸餾（Deep Self-Attention Distillation），通過遷移teacher model最後一層self-attention layer的attention score信息與value relation信息，可有效實現student model的誘導訓練。只遷移最後一層的知識，顯得簡單有效、且訓練速度更快，而且不需要手動設計teacher-student之間的層對應關係。

Attention score信息與Value relation信息的知識遷移如下：

Attention score transfer:

Value relation transfer:

6. Miscellaneous

關於Separable Conv1d在序列模型中的應用、及優勢，可參考：Depthwise Separable Convolutions for Neural Machine Translation；

移動端推理框架可參考：MNN、NCNN、Paddle-lite、Tengine、TNN、TF-lite等；

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Transformer端側模型壓縮——Mobile Transformer

1. The Evolved Transformer

2. Lite Transformer with Long-Short Range Attention

3. HAT: Hardware-Aware Transformers for Efficient Natural Language Processing

4. MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices

5. MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

6. Miscellaneous

認知提升的方法

螞蟻面試：Springcloud核心組件的底層原理，你知道多少？

Transformer端側模型壓縮——Mobile Transformer

基於生成對抗的結構剪枝——Generative Adversarial Learning

CUDA版本的Locality-aware NMS

Post-training量化策略——without training or re-training

Learning Dynamic Routing for Semantic Segmentation——在線動態定義網絡結構

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結