在深度學習中，量化指的是使用更少的bit來存儲原本以浮點數存儲的tensor，以及使用更少的bit來完成原本以浮點數完成的計算。這麼做的好處主要有如下幾點：

更少的模型體積，接近4倍的減少；
可以更快的計算，由於更少的內存訪問和更快的int8計算，可以快2~4倍。

一個量化後的模型，其部分或者全部的tensor操作會使用int類型來計算，而不是使用量化之前的float類型。當然，量化還需要底層硬件支持，x86 CPU（支持AVX2）、ARM CPU、Google TPU、Nvidia Volta/Turing/Ampere、Qualcomm DSP這些主流硬件都對量化提供了支持。

PyTorch對量化的支持目前有如下三種方式：

Post Training Dynamic Quantization：模型訓練完畢後的動態量化；
Post Training Static Quantization：模型訓練完畢後的靜態量化；
QAT (Quantization Aware Training)：模型訓練中開啓量化。

在開始這三部分之前，先介紹下最基礎的Tensor的量化。

Tensor的量化

量化：$$公式1：xq=round(\frac{x}{scale}+zero\_point)$$

反量化：$$公式2：x = (xq-zero\_point)*scale$$

式中，scale是縮放因子，zero_point是零基準，也就是fp32中的零在量化tensor中的值

　　爲了實現量化，PyTorch 引入了能夠表示量化數據的Quantized Tensor，可以存儲 int8/uint8/int32類型的數據，並攜帶有scale、zero_point這些參數。把一個標準的float Tensor轉換爲量化Tensor的步驟如下：

import torch

x = torch.randn(2, 2, dtype=torch.float32)
# tensor([[ 0.9872, -1.6833],
#         [-0.9345,  0.6531]])

# 公式1(量化)：xq = round(x / scale + zero_point)
# 使用給定的scale和 zero_point 來把一個float tensor轉化爲 quantized tensor
xq = torch.quantize_per_tensor(x, scale=0.5, zero_point=8, dtype=torch.quint8)
# tensor([[ 1.0000, -1.5000],
#         [-1.0000,  0.5000]], size=(2, 2), dtype=torch.quint8,
#        quantization_scheme=torch.per_tensor_affine, scale=0.5, zero_point=8)

print(xq.int_repr())  # 給定一個量化的張量，返回一個以 uint8_t 作爲數據類型的張量
# tensor([[10,  5],
#         [ 6,  9]], dtype=torch.uint8)

# 公式2(反量化)：xdq = (xq - zero_point) * scale
# 使用給定的scale和 zero_point 來把一個 quantized tensor 轉化爲 float tensor
xdq = xq.dequantize()
# tensor([[ 1.0000, -1.5000],
#         [-1.0000,  0.5000]])

xdq和x的值已經出現了偏差的事實告訴了我們兩個道理：

量化會有精度損失
我們隨便選取的scale和zp太爛，選擇合適的scale和zp可以有效降低精度損失。不信你把scale和zp分別換成scale = 0.0036, zero_point = 0試試

而在PyTorch中，選擇合適的scale和zp的工作就由各種observer來完成。

Tensor的量化支持兩種模式：per tensor 和 per channel。

Per tensor：是說一個tensor裏的所有value按照同一種方式去scale和offset；
Per channel：是對於tensor的某一個維度（通常是channel的維度）上的值按照一種方式去scale和offset，也就是一個tensor裏有多種不同的scale和offset的方式（組成一個vector），如此以來，在量化的時候相比per tensor的方式會引入更少的錯誤。PyTorch目前支持conv2d()、conv3d()、linear()的per channel量化。

在我們正式瞭解pytorch模型量化前我們再來檢查一下pytorch的官方量化是否能滿足我們的需求，如果不能，後面的都不需要看了

	靜態量化	動態量化
nn.linear	Y	Y
nn.Conv1d/2d/3d	Y	N (因爲pytorch認爲卷積參數來了個太小了，對卷積核進行量化會造成更多損失，所以pytorch選擇不量化)
nn.LSTM	N(LSTM的好像又可以了，官方給出了一個例子，傳送門)	Y
nn.GRU	N	Y
nn.RNNCell	N	Y
nn.GRUCell	N	Y
nn.LSTMCell	N	Y
nn.EmbeddingBag	Y(激活在fp32)	Y
nn.Embedding	Y	N
nn.MultiheadAttention	N	N
Activations	大部分支持	不變，計算停留在fp32中

第二點：pytorch模型的動態量化只量化權重，不量化偏置

Post Training Dynamic Quantization (訓練後動態量化)

　　意思就是對訓練後的模型權重執行動態量化，將浮點模型轉換爲動態量化模型，僅對模型權重進行量化，偏置不會量化。默認情況下，僅對 Linear 和 RNN 變體量化 (因爲這些layer的參數量很大，收益更高)。

torch.quantization.quantize_dynamic(model, qconfig_spec=None, dtype=torch.qint8, mapping=None, inplace=False)

參數：

model：浮點模型
qconfig_spec：
- 下面的任意一種
  - 集合：比如： qconfig_spec={nn.LSTM, nn.Linear} 。羅列要量化的NN
  - 字典： qconfig_spec = {nn.Linear : default_dynamic_qconfig, nn.LSTM : default_dynamic_qconfig}
dtype： float16 或 qint8
mapping：就地執行模型轉換，原始模塊發生變異
inplace：將子模塊的類型映射到需要替換子模塊的相應動態量化版本的類型

返回：動態量化後的模型

我們來喫一個栗子：

# -*- coding:utf-8 -*-
# Author:凌逆戰 | Never
# Date: 2022/10/17
"""
只量化權重，不量化激活
"""
import torch
from torch import nn

class DemoModel(torch.nn.Module):
    def __init__(self):
        super(DemoModel, self).__init__()
        self.conv = nn.Conv2d(in_channels=1,out_channels=1,kernel_size=1)
        self.relu = nn.ReLU()
        self.fc = torch.nn.Linear(2, 2)

    def forward(self, x):
        x = self.conv(x)
        x = self.relu(x)
        x = self.fc(x)
        return x


if __name__ == "__main__":
    model_fp32 = DemoModel()
    # 創建一個量化的模型實例
    model_int8 = torch.quantization.quantize_dynamic(
        model=model_fp32,  # 原始模型
        qconfig_spec={torch.nn.Linear},  # 要動態量化的NN算子
        dtype=torch.qint8)  # 將權重量化爲：float16 \ qint8

    print(model_fp32)
    print(model_int8)

    # 運行模型
    input_fp32 = torch.randn(1,1,2, 2)
    output_fp32 = model_fp32(input_fp32)
    print(output_fp32)

    output_int8 = model_int8(input_fp32)
    print(output_int8)

輸出

DemoModel(
  (conv): Conv2d(1, 1, kernel_size=(1, 1), stride=(1, 1))
  (relu): ReLU()
  (fc): Linear(in_features=2, out_features=2, bias=True)
)
DemoModel(
  (conv): Conv2d(1, 1, kernel_size=(1, 1), stride=(1, 1))
  (relu): ReLU()
  (fc): DynamicQuantizedLinear(in_features=2, out_features=2, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
)
tensor([[[[-0.5361,  0.0741],
          [-0.2033,  0.4149]]]], grad_fn=<AddBackward0>)
tensor([[[[-0.5371,  0.0713],
          [-0.2040,  0.4126]]]])

View Code

Post Training Static Quantization (訓練後靜態量化)

　　靜態量化需要把模型的權重和激活都進行量化，靜態量化需要把訓練集或者和訓練集分佈類似的數據餵給模型(注意沒有反向傳播)，然後通過每個op輸入的分佈來計算activation的量化參數（scale和zp）——稱之爲Calibrate（定標），因爲靜態量化的前向推理過程自始至終都是int計算，activation需要確保一個op的輸入符合下一個op的輸入。

PyTorch會使用以下5步來完成模型的靜態量化：

1、fuse_model

合併一些可以合併的layer。這一步的目的是爲了提高速度和準確度：

fuse_modules(model, modules_to_fuse, inplace=False, fuser_func=fuse_known_modules, fuse_custom_config_dict=None)

比如給fuse_modules傳遞下面的參數就會合併網絡中的conv1、bn1、relu1：

torch.quantization.fuse_modules(F32Model, [['fc', 'relu']], inplace=True)

一旦合併成功，那麼原始網絡中的fc就會被替換爲新的合併後的module（因爲其是list中的第一個元素），而relu（list中剩餘的元素）會被替換爲nn.Identity()，這個模塊是個佔位符，直接輸出輸入。舉個例子，對於下面的一個小網絡：

import torch
from torch import nn

class F32Model(nn.Module):
    def __init__(self):
        super(F32Model, self).__init__()
        self.fc = nn.Linear(3, 2,bias=False)
        self.relu = nn.ReLU(inplace=False)

    def forward(self, x):
        x = self.fc(x)
        x = self.relu(x)
        return x

model_fp32 = F32Model()
print(model_fp32)
# F32Model(
#   (fc): Linear(in_features=3, out_features=2, bias=False)
#   (relu): ReLU()
# )
model_fp32_fused = torch.quantization.fuse_modules(model_fp32, [['fc', 'relu']])
print(model_fp32_fused)
# F32Model(
#   (fc): LinearReLU(
#     (0): Linear(in_features=3, out_features=2, bias=False)
#     (1): ReLU()
#   )
#   (relu): Identity()
# )

modules_to_fuse參數的list可以包含多個item list，或者是submodule的op list也可以，比如：[ ['conv1', 'bn1', 'relu1'], ['submodule.conv', 'submodule.relu']]。有的人會說了，我要fuse的module被Sequential封裝起來了，如何傳參？參考下面的代碼：

torch.quantization.fuse_modules(a_sequential_module, ['0', '1', '2'], inplace=True)

就目前來說，截止目前爲止，只有如下的op和順序纔可以 (這個mapping關係就定義在DEFAULT_OP_LIST_TO_FUSER_METHOD中)：

Convolution, BatchNorm
Convolution, BatchNorm, ReLU
Convolution, ReLU
Linear, ReLU
BatchNorm, ReLU
ConvTranspose, BatchNorm

2、設置qconfig

qconfig要設置到模型或者Module上。

#如果要部署在x86 server上
model_fp32.qconfig = torch.quantization.get_default_qconfig('fbgemm')

#如果要部署在ARM上
model_fp32.qconfig = torch.quantization.get_default_qconfig('qnnpack')

x86和arm之外目前不支持。

3、prepare

prepare用來給每個子module插入Observer，用來收集和定標數據。

以activation的observer爲例，觀察輸入數據得到四元組中的 min_val 和 max_val，至少觀察個幾百個迭代的數據吧，然後由這四元組得到 scale 和 zp 這兩個參數的值。

model_fp32_prepared= torch.quantization.prepare(model_fp32_fused)

4、喂數據

這一步不是訓練。是爲了獲取數據的分佈特點，來更好的計算activation的 scale 和 zp 。至少要喂上幾百個迭代的數據。

#至少觀察個幾百迭代
for data in data_loader:
    model_fp32_prepared(data)

5、轉換模型

第四步完成後，各個op權重的四元組 (min_val，max_val，qmin, qmax) 中的 min_val ， max_val 已經有了，各個op activation的四元組 (min_val，max_val，qmin, qmax) 中的 min_val ， max_val 也已經觀察出來了。那麼在這一步我們將調用convert API：

model_prepared_int8 = torch.quantization.convert(model_fp32_prepared)

我們來喫一個完整的例子：

# -*- coding:utf-8 -*-
# Author:凌逆戰 | Never
# Date: 2022/10/17
"""
權重和激活都會被量化
"""

import torch
from torch import nn


# 定義一個浮點模型，其中一些層可以被靜態量化
class F32Model(torch.nn.Module):
    def __init__(self):
        super(F32Model, self).__init__()
        self.quant = torch.quantization.QuantStub()  # QuantStub: 轉換張量從浮點到量化
        self.conv = nn.Conv2d(1, 1, 1)
        self.fc = nn.Linear(2, 2, bias=False)
        self.relu = nn.ReLU()
        self.dequant = torch.quantization.DeQuantStub()  # DeQuantStub: 將量化張量轉換爲浮點

    def forward(self, x):
        x = self.quant(x)  # 手動指定張量: 從浮點轉換爲量化
        x = self.conv(x)
        x = self.fc(x)
        x = self.relu(x)
        x = self.dequant(x)  # 手動指定張量: 從量化轉換到浮點
        return x


model_fp32 = F32Model()
model_fp32.eval()  # 模型必須設置爲eval模式，靜態量化邏輯才能工作

# 1、如果要部署在ARM上；果要部署在x86 server上 ‘fbgemm’
model_fp32.qconfig = torch.quantization.get_default_qconfig('qnnpack')

# 2、在適用的情況下，將一些層進行融合，可以加速
# 常見的融合包括在：DEFAULT_OP_LIST_TO_FUSER_METHOD
model_fp32_fused = torch.quantization.fuse_modules(model_fp32, [['fc', 'relu']])

# 3、準備模型，插入observers，觀察 activation 和 weight
model_fp32_prepared = torch.quantization.prepare(model_fp32_fused)

# 4、代表性數據集，獲取數據的分佈特點，來更好的計算activation的 scale 和 zp
input_fp32 = torch.randn(1, 1, 2, 2)  # (batch_size, channel, W, H)
model_fp32_prepared(input_fp32)

# 5、量化模型
model_int8 = torch.quantization.convert(model_fp32_prepared)

# 運行模型，相關計算將在int8中進行
output_fp32 = model_fp32(input_fp32)
output_int8 = model_int8(input_fp32)
print(output_fp32)
# tensor([[[[0.6315, 0.0000],
#           [0.2466, 0.0000]]]], grad_fn=<ReluBackward0>)
print(output_int8)
# tensor([[[[0.3886, 0.0000],
#           [0.2475, 0.0000]]]])

Quantization Aware Training (邊訓練邊量化)

這一部分我用不着，等我需要使用的時候再來補充

保存和加載量化模型

我們先把模型量化

import torch
from torch import nn

class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(5, 5,bias=True)
        self.gru = nn.GRU(input_size=5,hidden_size=5,bias=True,)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.linear(x)
        x = self.gru(x)
        x = self.relu(x)
        return x

m = M().eval()
model_int8 = torch.quantization.quantize_dynamic(
    model=m,  # 原始模型
    qconfig_spec={nn.Linear,
                  nn.GRU},  # 要動態量化的NN算子
    dtype=torch.qint8, inplace=True)  # 將權重量化爲：float16 \ qint8+

保存/加載量化模型 state_dict

torch.save(model_int8.state_dict(), "./state_dict.pth")
model_int8.load_state_dict(torch.load("./state_dict.pth"))
print(model_int8)

保存/加載腳本化量化模型 torch.jit.save 和 torch.jit.load

traced_model = torch.jit.trace(model_int8, torch.rand(5, 5))
torch.jit.save(traced_model, "./traced_quant.pt")
quantized_model = torch.jit.load("./traced_quant.pt")
print(quantized_model)

獲取量化模型的參數

其實pytorch獲取量化後的模型參數是比較困難的，我們還是以上面的量化模型爲例來取參數的值

print(model_int8)
# M(
#   (linear): DynamicQuantizedLinear(in_features=5, out_features=5, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
#   (gru): DynamicQuantizedGRU(5, 5)
#   (relu): ReLU()
# )
print(model_int8.linear)
print(model_int8.gru)
print(model_int8.relu)

我們來嘗試一下獲取線性層的權重和偏置

# print(dir(model_int8.linear))　　# 獲得對象的所有屬性和方法
print(model_int8.linear.weight().int_repr())
# tensor([[ 104,  127,   70,  -94,  121],
#         [  98,   53,  124,   74,   38],
#         [-103, -112,   38,  117,   64],
#         [ -46,  -36,  115,   82,  -75],
#         [ -14,  -94,   42,  -25,   41]], dtype=torch.int8)
print(model_int8.linear.bias())
# tensor([ 0.2437,  0.2956,  0.4010, -0.2818,  0.0950], requires_grad=True)

O My God，偏置居然還是浮點類型的，只有權重被量化爲了整型。

好的，我們再來獲取GRU的權重和偏置

print(dir(model_int8.gru))
print(model_int8.gru.get_weight()["weight_ih_l0"].int_repr())   # int8
print(model_int8.gru.get_weight()["weight_hh_l0"].int_repr())   #int8
print(model_int8.gru.get_bias()["bias_ih_l0"])  # float
print(model_int8.gru.get_bias()["bias_hh_l0"])  # float

第一，別問我別問我爲什麼取值這麼麻煩，你以爲我想？？？

第二，靜態量化不支持GRU就算了，動態量化偏置還不給我量化了，哎，pytorch的量化真的是還有很長的路要走呀！

參考

【pytorch官方】Quantization（需要非常細心且耐心的去讀）

【pytorch官方】Quantization API

【知乎】PyTorch的量化

【CSDN】Pytorch 1.10.2 下模型量化踩坑

Pytorch模型量化

Tensor的量化

Post Training Dynamic Quantization (訓練後動態量化)

Post Training Static Quantization (訓練後靜態量化)

1、fuse_model

2、設置qconfig

3、prepare

4、喂數據

5、轉換模型

Quantization Aware Training (邊訓練邊量化)

保存和加載量化模型

保存/加載量化模型 state_dict

保存/加載腳本化量化模型 torch.jit.save 和 torch.jit.load

獲取量化模型的參數

參考

杭州的 IT 崩盤了麼？

開源高性能結構化日誌模塊NanoLog

Python 潮流週刊#55：分享 9 個高質量的技術類信息源！

Azure Virtual Network (22) 多訂閱使用Azure DNS解析問題 Windows Azure Platform 系列文章目錄

【簡寫Mybatis-02】註冊機的實現以及SqlSession處理

手繪二維碼

.NET藉助虛擬網卡實現一個簡單異地組網工具

語音信號處理中的“窗函數”

如何快速瞭解一個行業

論文閱讀：2023_Semantic Hearing: Programming Acoustic Scenes with Binaural Hearables

Linux後臺跑程序的方法總結

EQ 均衡器

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結