如何使用PyTorch的量化功能？

點擊上方“AI算法與圖像處理”，選擇加"星標"或“置頂”

重磅乾貨，第一時間送達

來源：paperweekly

背景

在深度學習中，量化指的是使用更少的 bit 來存儲原本以浮點數存儲的 tensor，以及使用更少的 bit 來完成原本以浮點數完成的計算。這麼做的好處主要有如下幾點：

更少的模型體積，接近 4 倍的減少；
可以更快的計算，由於更少的內存訪問和更快的 int8 計算，可以快 2~4 倍。

一個量化後的模型，其部分或者全部的 tensor 操作會使用 int 類型來計算，而不是使用量化之前的 float 類型。當然，量化還需要底層硬件支持，x86 CPU（支持AVX2）、ARM CPU、Google TPU、Nvidia Volta/Turing/Ampere、Qualcomm DSP 這些主流硬件都對量化提供了支持。

PyTorch 1.1 的時候開始添加 torch.qint8 dtype、torch.quantize_linear 轉換函數來開始對量化提供有限的實驗性支持。PyTorch 1.3 開始正式支持量化，在可量化的 Tensor 之外，PyTorch 開始支持 CNN 中最常見的 operator 的量化操作，包括：

1. Tensor 上的函數: view, clone, resize, slice, add, multiply, cat, mean, max, sort, topk；

2. 常見的模塊（在 torch.nn.quantized 中）： Conv2d, Linear, Avgpool2d, AdaptiveAvgpool2d, MaxPool2d, AdaptiveMaxPool2d, Interpolate, Upsample；

3. 爲了量化後還維持更高準確率的合併操作（在torch.nn.intrinsic中）： ConvReLU2d, ConvBnReLU2d, ConvBn2d，LinearReLU，add_relu。

在 PyTorch 1.4 的時候，PyTorch 添加了 nn.quantized.Conv3d，與此同時，torchvision 0.5 開始提供量化版本的 ResNet、ResNext、MobileNetV2、GoogleNet、InceptionV3 和 ShuffleNetV2。

到 PyTorch 1.5 的時候，QNNPACK 添加了對 dynamic quantization 的支持，也就爲量化版的 LSTM 在手機平臺上使用提供了支撐——也就是添加了對 PyTorch mobile 的 dynamic quantization 的支持；增加了量化版本的 sigmoid、leaky relu、batch_norm、BatchNorm2d、 Avgpool3d、quantized_hardtanh、quantized ELU activation、quantized Upsample3d、quantized batch_norm3d、 batch_norm3d + relu operators的fused、quantized hardsigmoid。

在 PyTorch 1.6 的時候，添加了 quantized Conv1d、quantized hardswish、quantized layernorm、quantized groupnorm、quantized instancenorm、quantized reflection_pad1d、quantized adaptive avgpool、quantized channel shuffle op、Quantized Threshold；添加 ConvBn3d, ConvBnReLU3d, BNReLU2d, BNReLU3d；per-channel 的量化得到增強；添加對 LSTMCell、RNNCell、GRUCell 的 Dynamic quantization 支持；在 nn.DataParallel 和 nn.DistributedDataParallel 中可以使用 Quantization aware training；支持 CUDA 上的 quantized tensor。

到目前的最新版本的 PyTorch 1.7，又添加了 Embedding 和 EmbeddingBag quantization、aten::repeat、aten::apend、tensor 的 stack、tensor 的 fill_、per channel affine quantized tensor 的 clone、1D batch normalization、N-Dimensional constant padding、CELU operator、FP16 quantization 的支持。

PyTorch對量化的支持目前有如下三種方式：

Post Training Dynamic Quantization，模型訓練完畢後的動態量化；
Post Training Static Quantization，模型訓練完畢後的靜態量化；
QAT（Quantization Aware Training），模型訓練中開啓量化。

在開始這三部分之前，先介紹下最基礎的 Tensor 的量化。

Tensor的量化

PyTorch 爲了實現量化，首先就得需要具備能夠表示量化數據的 Tensor，這就是從 PyTorch 1.1 之後引入的 Quantized Tensor。Quantized Tensor 可以存儲 int8/uint8/int32 類型的數據，並攜帶有 scale、zero_point 這些參數。把一個標準的 float Tensor 轉換爲量化 Tensor 的步驟如下：

>>> x = torch.rand(2,3, dtype=torch.float32) 
>>> x
tensor([[0.6839, 0.4741, 0.7451],
        [0.9301, 0.1742, 0.6835]])

>>> xq = torch.quantize_per_tensor(x, scale = 0.5, zero_point = 8, dtype=torch.quint8)
tensor([[0.5000, 0.5000, 0.5000],
        [1.0000, 0.0000, 0.5000]], size=(2, 3), dtype=torch.quint8,
       quantization_scheme=torch.per_tensor_affine, scale=0.5, zero_point=8)

>>> xq.int_repr()
tensor([[ 9,  9,  9],
        [10,  8,  9]], dtype=torch.uint8)

quantize_per_tensor 函數就是使用給定的 scale 和 zp 來把一個 float tensor 轉化爲quantized tensor，後文你還會遇到這個函數。通過上面這幾個數的變化，你可以感受到，量化 tensor，也就是 xq，和 fp32 tensor 的關係大概就是:

xq = round(x / scale + zero_point)

scale 這個縮放因子和 zero_point 是兩個參數，建立起了 fp32 tensor 到量化 tensor 的映射關係。scale 體現了映射中的比例關係，而 zero_point 則是零基準，也就是 fp32 中的零在量化 tensor 中的值。因爲當 x 爲零的時候，上述 xq 就變成了：

xq = round(zero_point) = zero_point

現在 xq 已經是一個量化 tensor 了，我們可以把 xq 在反量化回來，如下所示：

# xq is a quantized tensor with data represented as quint8
>>> xdq = xq.dequantize()
>>> xdq
tensor([[0.5000, 0.5000, 0.5000],
        [1.0000, 0.0000, 0.5000]])

dequantize 函數就是 quantize_per_tensor 的反義詞，把一個量化 tensor 轉換爲 float tensor。也就是：

xdq = (xq - zero_point) * scale

xdq 和 x 的值已經出現了偏差的事實告訴了我們兩個道理：

量化會有精度損失；
我們這裏隨便選取的 scale 和 zp 太爛，選擇合適的 scale 和 zp 可以有效降低精度損失。不信你把 scale 和 zp 分別換成 scale = 0.0036, zero_point = 0試試。

而在 PyTorch 中，選擇合適的 scale 和 zp 的工作就由各種 observer 來完成。

Tensor 的量化支持兩種模式：per tensor 和 per channel。Per tensor 是說一個 tensor 裏的所有 value 按照同一種方式去 scale 和 offset；per channel 是對於 tensor 的某一個維度（通常是 channel 的維度）上的值按照一種方式去 scale 和 offset，也就是一個 tensor 裏有多種不同的 scale 和 offset 的方式（組成一個vector），如此以來，在量化的時候相比 per tensor 的方式會引入更少的錯誤。PyTorch 目前支持 conv2d()、conv3d()、linear() 的 per channel 量化。

Post Training Dynamic Quantization

這種量化方式經常縮略前面的兩個單詞從而稱之爲 Dynamic Quantization，中文爲動態量化。這是什麼意思呢？你看到全稱中的兩個關鍵字了嗎：Post、Dynamic：

Post：也就是訓練完成後再量化模型的權重參數；
Dynamic：也就是網絡在前向推理的時候動態的量化 float32 類型的輸入。

Dynamic Quantization 使用下面的 API 來完成模型的量化：

torch.quantization.quantize_dynamic(model, qconfig_spec=None, dtype=torch.qint8, mapping=None, inplace=False)

quantize_dynamic 這個 API 把一個 float model 轉換爲 dynamic quantized model，也就是隻有權重被量化的 model，dtype 參數可以取值 float16 或者 qint8。當對整個模型進行轉換時，默認只對以下的 op 進行轉換：

Linear
LSTM
LSTMCell
RNNCell
GRUCell

爲啥呢？因爲 dynamic quantization只是把權重參數進行量化，而這些 layer 一般參數數量很大，在整個模型中參數量佔比極高，因此邊際效益高。對其它 layer進行 dynamic quantization 幾乎沒有實際的意義。

再來說說這個 API 的第二個參數：qconfig_spec：

qconfig_spec 指定了一組 qconfig，具體就是哪個 op 對應哪個 qconfig；
每個 qconfig 是 QConfig 類的實例，封裝了兩個 observer；
這兩個 observer 分別是 activation 的 observer 和 weight 的 observer；
但是動態量化使用的是 QConfig 子類 QConfigDynamic 的實例，該實例實際上只封裝了 weight 的 observer；
activate 就是 post process，就是 op forward 之後的後處理，但在動態量化中不包含；
observer 用來根據四元組（min_val，max_val，qmin, qmax）來計算 2 個量化的參數：scale 和 zero_point；
qmin、qmax 是算法提前確定好的，min_val 和 max_val 是從輸入數據中觀察到的，所以起名叫 observer。

當 qconfig_spec 爲 None 的時候就是默認行爲，如果想要改變默認行爲，則可以：

qconfig_spec 賦值爲一個 set，比如：{nn.LSTM, nn.Linear}，意思是指定當前模型中的哪些 layer 要被 dynamic quantization；
qconfig_spec 賦值爲一個 dict，key 爲 submodule 的 name 或 type，value 爲 QConfigDynamic 實例（其包含了特定的 Observer，比如 MinMaxObserver、MovingAverageMinMaxObserver、PerChannelMinMaxObserver、MovingAveragePerChannelMinMaxObserver、HistogramObserver）。

事實上，當 qconfig_spec 爲 None 的時候，quantize_dynamic API 就會使用如下的默認值：

qconfig_spec = {
                nn.Linear : default_dynamic_qconfig,
                nn.LSTM : default_dynamic_qconfig,
                nn.GRU : default_dynamic_qconfig,
                nn.LSTMCell : default_dynamic_qconfig,
                nn.RNNCell : default_dynamic_qconfig,
                nn.GRUCell : default_dynamic_qconfig,
            }

這就是 Gemfield 剛纔提到的動態量化只量化 Linear 和 RNN 變種的真相。而 default_dynamic_qconfig 是 QConfigDynamic 的一個實例，使用如下的參數進行構造：

default_dynamic_qconfig = QConfigDynamic(activation=default_dynamic_quant_observer, weight=default_weight_observer)
default_dynamic_quant_observer = PlaceholderObserver.with_args(dtype=torch.float, compute_dtype=torch.quint8)
default_weight_observer = MinMaxObserver.with_args(dtype=torch.qint8, qscheme=torch.per_tensor_symmetric)

其中，用於 activation 的 PlaceholderObserver 就是個佔位符，啥也不做；而用於 weight 的 MinMaxObserver 就是記錄輸入 tensor 中的最大值和最小值，用來計算 scale 和 zp。

對於一個默認行爲下的 quantize_dynamic 調用，你的模型會經歷什麼變化呢？Gemfield 使用一個小網絡來演示下：

class CivilNet(nn.Module):
    def __init__(self):
        super(CivilNet, self).__init__()
        gemfieldin = 1
        gemfieldout = 1
        self.conv = nn.Conv2d(gemfieldin, gemfieldout, kernel_size=1, stride=1, padding=0, groups=1, bias=False)
        self.fc = nn.Linear(3, 2,bias=False)
        self.relu = nn.ReLU(inplace=False)

    def forward(self, x):
        x = self.conv(x)
        x = self.fc(x)
        x = self.relu(x)
        return x

原始網絡和動態量化後的網絡如下所示：

#原始網絡
CivilNet(
  (conv): Conv2d(1, 1, kernel_size=(1, 1), stride=(1, 1), bias=False)
  (fc): Linear(in_features=3, out_features=2, bias=False)
  (relu): ReLU()
)

#quantize_dynamic 後
CivilNet(
  (conv): Conv2d(1, 1, kernel_size=(1, 1), stride=(1, 1), bias=False)
  (fc): DynamicQuantizedLinear(in_features=3, out_features=2, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
  (relu): ReLU()
)

可以看到，除了 Linear，其它 op 都沒有變動。而 Linear 被轉換成了 DynamicQuantizedLinear，DynamicQuantizedLinear 就是 torch.nn.quantized.dynamic.modules.linear.Linear 類。

沒錯，quantize_dynamic API 的本質就是檢索模型中 op 的 type，如果某個 op 的 type 屬於字典 DEFAULT_DYNAMIC_QUANT_MODULE_MAPPINGS 的 key，那麼，這個 op 將被替換爲 key 對應的 value：

# Default map for swapping dynamic modules
DEFAULT_DYNAMIC_QUANT_MODULE_MAPPINGS = {
    nn.GRUCell: nnqd.GRUCell,
    nn.Linear: nnqd.Linear,
    nn.LSTM: nnqd.LSTM,
    nn.LSTMCell: nnqd.LSTMCell,
    nn.RNNCell: nnqd.RNNCell,
}

這裏，nnqd.Linear 就是 DynamicQuantizedLinear 就是 torch.nn.quantized.dynamic.modules.linear.Linear。但是，type從key 換爲 value，那這個新的 type 如何實例化呢？更重要的是，實例化新的 type 一定是要用之前的權重參數的呀。沒錯，以 Linear 爲例，該邏輯定義在 nnqd.Linear 的 from_float() 方法中，通過如下方式實例化：

new_mod = mapping[type(mod)].from_float(mod)

from_float 做的事情主要就是：

使用 MinMaxObserver 計算模型中 op 權重參數中 tensor 的最大值最小值（這個例子中只有 Linear op），縮小量化時原始值的取值範圍，提高量化的精度；
通過上述步驟中得到四元組中的 min_val 和 max_val，再結合算法確定的 qmin, qmax 計算出 scale 和 zp，參考前文“Tensor的量化”小節，計算得到量化後的weight，這個量化過程有torch.quantize_per_tensor 和 torch.quantize_per_channel兩種，默認是前者（因爲qchema默認是torch.per_tensor_affine）；
實例化 nnqd.Linear，然後使用 qlinear.set_weight_bias 將量化後的 weight 和原始的 bias 設置到新的 layer 上。其中最後一步還涉及到 weight 和 bias 的打包，在源代碼中是這樣的：

#ifdef USE_FBGEMM
    if (ctx.qEngine() == at::QEngine::FBGEMM) {
      return PackedLinearWeight::prepack(std::move(weight), std::move(bias));
    }
#endif

#ifdef USE_PYTORCH_QNNPACK
    if (ctx.qEngine() == at::QEngine::QNNPACK) {
      return PackedLinearWeightsQnnp::prepack(std::move(weight), std::move(bias));
    }
#endif
    TORCH_CHECK(false,"Didn't find engine for operation quantized::linear_prepack ",toString(ctx.qEngine()));

也就是說依賴 FBGEMM、QNNPACK 這些 backend。量化完後的模型在推理的時候有什麼不一樣的呢？在原始網絡中，從輸入到最終輸出是這麼計算的：

#input
torch.Tensor([[[[-1,-2,-3],[1,2,3]]]])

#經過卷積後（權重爲torch.Tensor([[[[-0.7867]]]])）
torch.Tensor([[[[ 0.7867,  1.5734,  2.3601],[-0.7867, -1.5734, -2.3601]]]])

#經過fc後（權重爲torch.Tensor([[ 0.4097, -0.2896, -0.4931], [-0.3738, -0.5541,  0.3243]]) )
torch.Tensor([[[[-1.2972, -0.4004], [1.2972,  0.4004]]]])

#經過relu後
torch.Tensor([[[[0.0000, 0.0000],[1.2972, 0.4004]]]])

而在動態量化模型中，上述過程就變成了：

#input
torch.Tensor([[[[-1,-2,-3],[1,2,3]]]])

#經過卷積後（權重爲torch.Tensor([[[[-0.7867]]]])）
torch.Tensor([[[[ 0.7867,  1.5734,  2.3601],[-0.7867, -1.5734, -2.3601]]]])

#經過fc後（權重爲torch.Tensor([[ 0.4085, -0.2912, -0.4911],[-0.3737, -0.5563,  0.3259]], dtype=torch.qint8,scale=0.0043458822183310986,zero_point=0) )
torch.Tensor([[[[-1.3038, -0.3847], [1.2856,  0.3969]]]])

#經過relu後
torch.Tensor([[[[0.0000, 0.0000], [1.2856, 0.3969]]]])

所以關鍵點就是這裏的 Linear op 了，因爲其它 op 和量化之前是一模一樣的。你可以看到 Linear 權重的 scale 爲 0.0043458822183310986，zero_point 爲0。scale 和 zero_point 怎麼來的呢？由其使用的 observer 計算得到的，具體來說就是默認的 MinMaxObserver，它是怎麼工作的呢？還記得前面說過的 observer 負責根據四元組來計算 scale 和 zp 吧：

在各種 observer 中，計算權重的 scale 和 zp 離不開這四個變量：min_val，max_val，qmin, qmax，分別代表 op 權重數據 /input tensor 數據分佈的最小值和最大值，以及量化後的取值範圍的最小、最大值。

qmin 和 qmax 的值好確定，基本就是 8 個 bit 能表示的範圍，這裏取的分別是 -128 和 127（更詳細的計算方式將會在下文的“靜態量化”章節中描述）；Linear op 的權重爲 torch.Tensor([[ 0.4097, -0.2896, -0.4931], [-0.3738, -0.5541, 0.3243]])，因此其 min_val 和 max_val 分別爲 -0.5541 和 0.4097，在這個上下文中，max_val 將進一步取這倆絕對值的最大值。由此我們就可以得到：

scale = max_val / (float(qmax - qmin) / 2) = 0.5541 / ((127 + 128) / 2) = 0.004345882...
zp = 0

scale 和 zp 的計算細節還會在下文的“靜態量化”章節中更詳細的描述。從上面我們可以得知，權重部分的量化是“靜態”的，是提前就轉換完畢的，而之所以叫做“動態”量化，就在於前向推理的時候動態的把 input 的 float tensor 轉換爲量化 tensor。

在 forward 的時候，nnqd.Linear 會調用 torch.ops.quantized.linear_dynamic 函數，輸入正是上面（pack 好後的）量化後的權重和 float 的 bias，而 torch.ops.quantized.linear_dynamic 函數最終會被 PyTorch 分發到 C++ 中的 apply_dynamic_impl 函數，在這裏，或者使用 FBGEMM 的實現（x86-64 設備），或者使用 QNNPACK 的實現（ARM 設備上）：

#ifdef USE_FBGEMM
at::Tensor PackedLinearWeight::apply_dynamic_impl(at::Tensor input, bool reduce_range) {
  ...
  fbgemm::xxxx
  ...
}
#endif // USE_FBGEMM

#ifdef USE_PYTORCH_QNNPACK
at::Tensor PackedLinearWeightsQnnp::apply_dynamic_impl(at::Tensor input) {
  ...
  qnnpack::qnnpackLinearDynamic(xxxx)
  ...
}
#endif // USE_PYTORCH_QNNPACK

等等，input 還是 float32 的啊，這怎麼運算嘛。別急，在上述的 apply_dynamic_impl 函數中，會使用下面的邏輯對輸入進行量化：

Tensor q_input = at::quantize_per_tensor(input_contig, q_params.scale, q_params.zero_point, c10::kQUInt8);

也就是說，動態量化的本質就藏身於此：基於運行時對數據範圍的觀察，來動態確定對輸入進行量化時的 scale 值。這就確保 input tensor 的 scale 因子能夠基於輸入數據進行優化，從而獲得顆粒度更細的信息。

而模型的參數則是提前就轉換爲了 INT8 的格式（在使用 quantize_dynamic API 的時候）。這樣，當輸入也被量化後，網絡中的運算就使用向量化的 INT8 指令來完成。而在當前 layer 輸出的時候，我們還需要把結果再重新轉換爲 float32——re-quantization 的 scale 值是依據 input、 weight 和 output scale 來確定的，定義如下：

requant_scale = input_scale_fp32 * weight_scale_fp32 / output_scale_fp32

實際上，在 apply_dynamic_impl 函數中，requant_scales 就是這麼實現的：

auto output_scale = 1.f
auto inverse_output_scale = 1.f /output_scale;
requant_scales[i] = (weight_scales_data[i] * input_scale) * inverse_output_scale;

這就是爲什麼在前面 Gemfield 提到過，經過量化版的 fc 的輸出爲torch.Tensor([[[[-1.3038, -0.3847], [1.2856, 0.3969]]]])，已經變回正常的 float tensor 了。所以動態量化模型的前向推理過程可以概括如下：

#原始的模型，所有的tensor和計算都是浮點型
previous_layer_fp32 -- linear_fp32 -- activation_fp32 -- next_layer_fp32
                 /
linear_weight_fp32

#動態量化後的模型，Linear和LSTM的權重是int8
previous_layer_fp32 -- linear_int8_w_fp32_inp -- activation_fp32 -- next_layer_fp32
                     /
   linear_weight_int8

總結下來，我們可以這麼說：Post Training Dynamic Quantization，簡稱爲 Dynamic Quantization，也就是動態量化，或者叫作Weight-only的量化，是提前把模型中某些 op 的參數量化爲 INT8，然後在運行的時候動態的把輸入量化爲 INT8，然後在當前 op 輸出的時候再把結果 requantization 回到 float32 類型。動態量化默認只適用於 Linear 以及 RNN 的變種。

Post Training Static Quantization

與其介紹 post training static quantization 是什麼，我們不如先來說明下它和 dynamic quantization 的相同點和區別是什麼。相同點就是，都是把網絡的權重參數轉從 float32 轉換爲 int8；不同點是，需要把訓練集或者和訓練集分佈類似的數據餵給模型（注意沒有反向傳播），然後通過每個 op 輸入的分佈特點來計算 activation 的量化參數（scale 和 zp）——稱之爲 Calibrate（定標）。

是的，靜態量化包含有 activation 了，也就是 post process，也就是 op forward 之後的後處理。爲什麼靜態量化需要 activation 呢？因爲靜態量化的前向推理過程自（始+1）至（終-1）都是 INT 計算，activation 需要確保一個 op 的輸入符合下一個 op 的輸入。

PyTorch 會使用五部曲來完成模型的靜態量化：

1. fuse_model

合併一些可以合併的 layer。這一步的目的是爲了提高速度和準確度：

fuse_modules(model, modules_to_fuse, inplace=False, fuser_func=fuse_known_modules, fuse_custom_config_dict=None)

比如給 fuse_modules 傳遞下面的參數就會合併網絡中的 conv1、bn1、relu1：

torch.quantization.fuse_modules(gemfield_model, [['conv1', 'bn1', 'relu1']], inplace=True)

一旦合併成功，那麼原始網絡中的 conv1 就會被替換爲新的合併後的 module（因爲其是 list 中的第一個元素），而 bn1、relu1（list 中剩餘的元素）會被替換爲 nn.Identity()，這個模塊是個佔位符，直接輸出輸入。舉個例子，對於下面的一個小網絡：

class CivilNet(nn.Module):
    def __init__(self):
        super(CivilNet, self).__init__()
        syszuxin = 1
        syszuxout = 1
        self.conv = nn.Conv2d(syszuxin, syszuxout, kernel_size=1, stride=1, padding=0, groups=1, bias=False)
        self.fc = nn.Linear(3, 2,bias=False)
        self.relu = nn.ReLU(inplace=False)

    def forward(self, x):
        x = self.conv(x)
        x = self.fc(x)
        x = self.relu(x)
        return x

網絡結構如下：

CivilNet(
  (conv): Conv2d(1, 1, kernel_size=(1, 1), stride=(1, 1), bias=False)
  (fc): Linear(in_features=3, out_features=2, bias=False)
  (relu): ReLU()
)

經過 torch.quantization.fuse_modules(c, [['fc', 'relu']], inplace=True)後，網絡變成了：

CivilNet(
  (conv): Conv2d(1, 1, kernel_size=(1, 1), stride=(1, 1), bias=False)
  (fc): LinearReLU(
    (0): Linear(in_features=3, out_features=2, bias=False)
    (1): ReLU()
  )
  (relu): Identity()
)

modules_to_fuse 參數的 list 可以包含多個 item list，或者是 submodule 的 op list 也可以，比如：[ ['conv1', 'bn1', 'relu1'], ['submodule.conv', 'submodule.relu']]。有的人會說了，我要 fuse的module 被 Sequential 封裝起來了，如何傳參？參考下面的代碼：

torch.quantization.fuse_modules(a_sequential_module, ['0', '1', '2'], inplace=True)

不是什麼類型的 op 都可以參與合併，也不是什麼樣的順序都可以參與合併。就目前來說，截止到 pytorch 1.7.1，只有如下的 op 和順序纔可以：

Convolution, Batch normalization
Convolution, Batch normalization, Relu
Convolution, Relu
Linear, Relu
Batch normalization, Relu

實際上，這個 mapping 關係就定義在 DEFAULT_OP_LIST_TO_FUSER_METHOD 中：

DEFAULT_OP_LIST_TO_FUSER_METHOD : Dict[Tuple, Union[nn.Sequential, Callable]] = {
    (nn.Conv1d, nn.BatchNorm1d): fuse_conv_bn,
    (nn.Conv1d, nn.BatchNorm1d, nn.ReLU): fuse_conv_bn_relu,
    (nn.Conv2d, nn.BatchNorm2d): fuse_conv_bn,
    (nn.Conv2d, nn.BatchNorm2d, nn.ReLU): fuse_conv_bn_relu,
    (nn.Conv3d, nn.BatchNorm3d): fuse_conv_bn,
    (nn.Conv3d, nn.BatchNorm3d, nn.ReLU): fuse_conv_bn_relu,
    (nn.Conv1d, nn.ReLU): nni.ConvReLU1d,
    (nn.Conv2d, nn.ReLU): nni.ConvReLU2d,
    (nn.Conv3d, nn.ReLU): nni.ConvReLU3d,
    (nn.Linear, nn.ReLU): nni.LinearReLU,
    (nn.BatchNorm2d, nn.ReLU): nni.BNReLU2d,
    (nn.BatchNorm3d, nn.ReLU): nni.BNReLU3d,
}

2. 設置qconfig

qconfig 是要設置到模型或者模型的子 module 上的。前文 Gemfield 就已經說過，qconfig 是 QConfig 的一個實例，QConfig 這個類就是維護了兩個 observer，一個是 activation 所使用的 observer，一個是 op 權重所使用的 observer。

#如果要部署在x86 server上
gemfield_model.qconfig = torch.quantization.get_default_qconfig('fbgemm')

#如果要部署在ARM上
gemfield_model.qconfig = torch.quantization.get_default_qconfig('qnnpack')

如果是 x86 和 arm 之外呢？抱歉，目前不支持。實際上，這裏的 get_default_qconfig 函數的實現如下所示：

def get_default_qconfig(backend='fbgemm'):
    if backend == 'fbgemm':
        qconfig = QConfig(activation=HistogramObserver.with_args(reduce_range=True),weight=default_per_channel_weight_observer)
    elif backend == 'qnnpack':
        qconfig = QConfig(activation=HistogramObserver.with_args(reduce_range=False),weight=default_weight_observer)
    else:
        qconfig = default_qconfig
    return qconfig

default_qconfig 實際上是 QConfig(activation=default_observer, weight=default_weight_observer)，所以 gemfield 這裏總結了一個表格：

量化的 backend	activation	weight
fbgemm	HistogramObserver (reduce_range=True)	PerChannelMin MaxObserver (default_per_channel _weight_observer)
qnnpack	HistogramObserver (reduce_range=False)	MinMaxObserver (default_weight _observer)
默認（非 fbgemm和qnnpack）	MinMaxObserver (default_observer)	MinMaxObserver (default_weight _observer)

3. prepare

prepare 調用是通過如下 API 完成的：

gemfield_model_prepared = torch.quantization.prepare(gemfield_model)

prepare 用來給每個子 module 插入 Observer，用來收集和定標數據。以 activation 的 observer 爲例，就是期望其觀察輸入數據得到四元組中的 min_val 和 max_val，至少觀察個幾百個迭代的數據吧，然後由這四元組得到 scale 和 zp 這兩個參數的值。

module 上安插 activation 的 observer 是怎麼實現的呢？還記得 [1] 一文中說過的“_forward_hooks 是通過 register_forward_hook 來完成註冊的。這些 hooks 是在 forward 完之後被調用的......”嗎？沒錯，CivilNet 模型中的 Conv2d、Linear、ReLU、QuantStub 這些 module 的 _forward_hooks 上都被插入了 activation 的 HistogramObserver，當這些子 module 計算完畢後，結果會被立刻送到其 _forward_hooks 中的 HistogramObserver 進行觀察。

這一步完成後，CivilNet 網絡就被改造成了：

CivilNet(
  (conv): Conv2d(
    1, 1, kernel_size=(1, 1), stride=(1, 1), bias=False
    (activation_post_process): HistogramObserver()
  )
  (fc): Linear(
    in_features=3, out_features=2, bias=False
    (activation_post_process): HistogramObserver()
  )
  (relu): ReLU(
    (activation_post_process): HistogramObserver()
  )
  (quant): QuantStub(
    (activation_post_process): HistogramObserver()
  )
  (dequant): DeQuantStub()
)

4. 喂數據

這一步不是訓練。是爲了獲取數據的分佈特點，來更好的計算 activation 的 scale 和 zp。至少要喂上幾百個迭代的數據。

#至少觀察個幾百迭代
for data in data_loader:
    gemfield_model_prepared(data)

5. 轉換模型

第四步完成後，各個 op 權重的四元組（min_val，max_val，qmin, qmax）中的 min_val，max_val 已經有了，各個 op activation 的四元組（min_val，max_val，qmin, qmax）中的 min_val，max_val 也已經觀察出來了。那麼在這一步我們將調用 convert API：

gemfield_model_prepared_int8 = torch.quantization.convert(gemfield_model_prepared)

這個過程和 dynamic 量化類似，本質就是檢索模型中 op 的 type，如果某個 op 的 type 屬於字典 DEFAULT_STATIC_QUANT_MODULE_MAPPINGS 的 key（注意字典和動態量化的不一樣了），那麼，這個 op 將被替換爲 key 對應的 value：

DEFAULT_STATIC_QUANT_MODULE_MAPPINGS = {
    QuantStub: nnq.Quantize,
    DeQuantStub: nnq.DeQuantize,
    nn.BatchNorm2d: nnq.BatchNorm2d,
    nn.BatchNorm3d: nnq.BatchNorm3d,
    nn.Conv1d: nnq.Conv1d,
    nn.Conv2d: nnq.Conv2d,
    nn.Conv3d: nnq.Conv3d,
    nn.ConvTranspose1d: nnq.ConvTranspose1d,
    nn.ConvTranspose2d: nnq.ConvTranspose2d,
    nn.ELU: nnq.ELU,
    nn.Embedding: nnq.Embedding,
    nn.EmbeddingBag: nnq.EmbeddingBag,
    nn.GroupNorm: nnq.GroupNorm,
    nn.Hardswish: nnq.Hardswish,
    nn.InstanceNorm1d: nnq.InstanceNorm1d,
    nn.InstanceNorm2d: nnq.InstanceNorm2d,
    nn.InstanceNorm3d: nnq.InstanceNorm3d,
    nn.LayerNorm: nnq.LayerNorm,
    nn.LeakyReLU: nnq.LeakyReLU,
    nn.Linear: nnq.Linear,
    nn.ReLU6: nnq.ReLU6,
    # Wrapper Modules:
    nnq.FloatFunctional: nnq.QFunctional,
    # Intrinsic modules:
    nni.BNReLU2d: nniq.BNReLU2d,
    nni.BNReLU3d: nniq.BNReLU3d,
    nni.ConvReLU1d: nniq.ConvReLU1d,
    nni.ConvReLU2d: nniq.ConvReLU2d,
    nni.ConvReLU3d: nniq.ConvReLU3d,
    nni.LinearReLU: nniq.LinearReLU,
    nniqat.ConvBn1d: nnq.Conv1d,
    nniqat.ConvBn2d: nnq.Conv2d,
    nniqat.ConvBnReLU1d: nniq.ConvReLU1d,
    nniqat.ConvBnReLU2d: nniq.ConvReLU2d,
    nniqat.ConvReLU2d: nniq.ConvReLU2d,
    nniqat.LinearReLU: nniq.LinearReLU,
    # QAT modules:
    nnqat.Linear: nnq.Linear,
    nnqat.Conv2d: nnq.Conv2d,
}

替換的過程也和 dynamic 一樣，使用 from_float() API，這個 API 會使用前面的四元組信息計算出 op 權重和 op activation 的 scale 和 zp，然後用於量化。動態量化”章節時 Gemfield 說過要再詳細介紹下 scale 和 zp 的計算過程，好了，就在這裏。這個計算過程覆蓋瞭如下的幾個問題：

QuantStub 的 scale 和 zp 是怎麼來的（靜態量化需要插入 QuantStub，後文有說明）？
conv activation 的 scale 和 zp 是怎麼來的？
conv weight 的 scale 和 zp 是怎麼來的？
fc activation 的 scale 和 zp 是怎麼來的？
fc weight 的 scale 和 zp 是怎麼來的？
relu activation 的 scale 和 zp 是怎麼來的？
relu weight 的...等等，relu 沒有 weight。

我們就從 conv 來說起吧，還記得前面說過的 Observer 嗎？分爲 activation 和 weight 兩種。以 Gemfield 這裏使用的 fbgemm 後端爲例，activation默認的observer 是 HistogramObserver、weight 默認的 observer 是 PerChannelMinMaxObserver。而計算 scale 和 zp 所需的四元組都是這些 observer 觀察出來的呀（好吧，其中兩個）。

在 convert API 調用中，pytorch 會將 Conv2d op 替換爲對應的 QuantizedConv2d，在這個替換的過程中會計算 QuantizedConv2d activation 的 scale 和 zp 以及 QuantizedConv2d weight 的 scale 和 zp。

在各種 observer 中，計算 scale 和 zp 離不開這四個變量：min_val，max_val，qmin, qmax，分別代表輸入的數據/權重的數據分佈的最小值和最大值，以及量化後的取值範圍的最小、最大值。qmin 和 qmax 的值好確定，基本就是 8 個 bit 能表示的範圍，在pytorch中，qmin 和 qmax 是使用如下方式確定的：

if self.dtype == torch.qint8:
    if self.reduce_range:
        qmin, qmax = -64, 63
    else:
        qmin, qmax = -128, 127
else:
    if self.reduce_range:
        qmin, qmax = 0, 127
    else:
        qmin, qmax = 0, 255

比如 conv 的 activation 的 observer（quint8）是 HistogramObserver，又是 reduce_range 的，因此其 qmin,qmax = 0，127，而 conv 的 weight（qint8）是 PerChannelMinMaxObserver，不是 reduce_range 的，因此其 qmin, qmax = -128, 127。

那麼 min_val，max_val 又是怎麼確定的呢？對於 HistogramObserver，其由輸入數據 + 權重值根據 L2Norm(An approximation for L2 error minimization)確定；對於 PerChannelMinMaxObserver 來說，其由輸入數據的最小值和最大值確定，比如在上述的例子中，值就是 -0.7898 和 -0.7898。

既然現在 conv weight 的 min_val，max_val，qmin, qmax 分別爲 -0.7898、-0.7898、-128、 127，那如何得到 scale 和 zp 呢？PyTorch 就是用下面的邏輯進行計算的：

#qscheme 是 torch.per_tensor_symmetric 或者torch.per_channel_symmetric時
max_val = torch.max(-min_val, max_val)
scale = max_val / (float(qmax - qmin) / 2)
scale = torch.max(scale, torch.tensor(self.eps, device=device, dtype=scale.dtype))
if self.dtype == torch.quint8:
    zero_point = zero_point.new_full(zero_point.size(), 128)

#qscheme 是 torch.per_tensor_affine時
scale = (max_val - min_val) / float(qmax - qmin)
scale = torch.max(scale, torch.tensor(self.eps, device=device, dtype=scale.dtype))
zero_point = qmin - torch.round(min_val / scale)
zero_point = torch.max(zero_point, torch.tensor(qmin, device=device, dtype=zero_point.dtype))
zero_point = torch.min(zero_point, torch.tensor(qmax, device=device, dtype=zero_point.dtype))

由此 conv2d weight 的謎團就被我們解開了：

scale = 0.7898 / ((127 + 128)/2 ) = 0.0062
zp = 0

再說說 QuantStub 的 scale 和 zp 是如何計算的。QuantStub 使用的是 HistogramObserver，根據輸入從 [-3,3] 的分佈，HistogramObserver 計算得到min_val、max_val 分別是 -3、2.9971，而 qmin 和 qmax 又分別是 0、127，其 schema 爲 per_tensor_affine，因此套用上面的 per_tensor_affine 邏輯可得：

scale = (2.9971 + 3) / (127 - 0) = 0.0472
zp = 0 - round(-3 /0.0472) = 64

其它計算同理，不再贅述。有了scale 和 zp，就有了量化版本的 module，上面那個 CivilNet 網絡，經過靜態量化後，網絡的變化如下所示：

#原始的CivilNet網絡：
CivilNet(
  (conv): Conv2d(1, 1, kernel_size=(1, 1), stride=(1, 1), bias=False)
  (fc): Linear(in_features=3, out_features=2, bias=False)
  (relu): ReLU()
)

#靜態量化後的CivilNet網絡：
CivilNet(
  (conv): QuantizedConv2d(1, 1, kernel_size=(1, 1), stride=(1, 1), scale=0.0077941399067640305, zero_point=0, bias=False)
  (fc): QuantizedLinear(in_features=3, out_features=2, scale=0.002811126410961151, zero_point=14, qscheme=torch.per_channel_affine)
  (relu): QuantizedReLU()
)

靜態量化模型如何推理？

我們知道，在 PyTorch 的網絡中，前向推理邏輯都是實現在了每個 op 的 forward 函數中（參考：Gemfield：詳解 Pytorch 中的網絡構造 [1] ）。而在 convert 完成後，所有的 op 被替換成了量化版本的 op，那麼量化版本的 op 的 forward 會有什麼不一樣的呢？還記得嗎？

動態量化中可是隻量化了 op 的權重哦，輸入的量化所需的 scale 的值是在推理過程中動態計算出來的。而靜態量化中，統統都是提前就計算好的。我們來看一個典型的靜態量化模型的推理過程：

import torch
import torch.nn as nn

class CivilNet(nn.Module):
    def __init__(self):
        super(CivilNet, self).__init__()
        in_planes = 1
        out_planes = 1
        self.conv = nn.Conv2d(in_planes, out_planes, kernel_size=1, stride=1, padding=0, groups=1, bias=False)
        self.fc = nn.Linear(3, 2,bias=False)
        self.relu = nn.ReLU(inplace=False)
        self.quant = QuantStub()
        self.dequant = DeQuantStub()

    def forward(self, x):
        x = self.quant(x)
        x = self.conv(x)
        x = self.fc(x)
        x = self.relu(x)
        x = self.dequant(x)
        return x

網絡 forward 的開始和結束還必須安插 QuantStub 和 DeQuantStub，如上所示。否則運行時會報錯：RuntimeError: Could not run 'quantized::conv2d.new' with arguments from the 'CPU' backend. 'quantized::conv2d.new' is only available for these backends: [QuantizedCPU]。

QuantStub 在 observer 階段會記錄參數值，DeQuantStub 在 prepare階段相當於 Identity；而在 convert API 調用過程中，會分別被替換爲 nnq.Quantize 和 nnq.DeQuantize。在這個章節要介紹的推理過程中，QuantStub，也就是 nnq.Quantize 在做什麼工作呢？如下所示：

def forward(self, X):
    return torch.quantize_per_tensor(X, float(self.scale), int(self.zero_point), self.dtype)

是不是呼應了前文中的“tensor 的量化”章節？這裏的 scale 和 zero_point 的計算方式前文也剛介紹過。而 nnq.DeQuantize 做了什麼呢？很簡單，把量化 tensor 反量化回來。

def forward(self, Xq):
    return Xq.dequantize()

是不是又呼應了前文中的“tensor的量化”章節？我們就以上面的 CivilNet 網絡爲例，當在靜態量化後的模型進行前向推理和原始的模型的區別是什麼呢？假設網絡的輸入爲 torch.Tensor([[[[-1,-2,-3],[1,2,3]]]])：

c = CivilNet()
t = torch.Tensor([[[[-1,-2,-3],[1,2,3]]]])
c(t)

假設 conv 的權重爲 torch.Tensor([[[[-0.7867]]]])，假設 fc 的權重爲 torch.Tensor([[ 0.4097, -0.2896, -0.4931], [-0.3738, -0.5541, 0.3243]])，那麼在原始的 CivilNet 前向中，從輸入到輸出的過程依次爲：

#input
torch.Tensor([[[[-1,-2,-3],[1,2,3]]]])

#經過卷積後（權重爲torch.Tensor([[[[-0.7867]]]])）
torch.Tensor([[[[ 0.7867,  1.5734,  2.3601],[-0.7867, -1.5734, -2.3601]]]])

#經過fc後（權重爲torch.Tensor([[ 0.4097, -0.2896, -0.4931], [-0.3738, -0.5541,  0.3243]]) )
torch.Tensor([[[[-1.2972, -0.4004], [1.2972,  0.4004]]]])

#經過relu後
torch.Tensor([[[[0.0000, 0.0000],[1.2972, 0.4004]]]])

而在靜態量化的模型前向中，總體情況如下：

#input
torch.Tensor([[[[-1,-2,-3],[1,2,3]]]])

#QuantStub後 (scale=tensor([0.0472]), zero_point=tensor([64]))
tensor([[[[-0.9916, -1.9833, -3.0221],[ 0.9916,  1.9833,  3.0221]]]],
       dtype=torch.quint8, scale=0.04722102731466293, zero_point=64)

#經過卷積後（權重爲torch.Tensor([[[[-0.7898]]]], dtype=torch.qint8, scale=0.0062, zero_point=0))
#conv activation（輸入）的scale爲0.03714831545948982，zp爲64
torch.Tensor([[[[ 0.7801,  1.5602,  2.3775],[-0.7801, -1.5602, -2.3775]]]], scale=0.03714831545948982, zero_point=64)

#經過fc後（權重爲torch.Tensor([[ 0.4100, -0.2901, -0.4951],[-0.3737, -0.5562,  0.3259]], dtype=torch.qint8, scale=tensor([0.0039, 0.0043]),zero_point=tensor([0, 0])) )
#fc activation（輸入）的scale爲0.020418135449290276, zp爲64
torch.Tensor([[[[-1.3068, -0.3879],[ 1.3068,  0.3879]]]], dtype=torch.quint8, scale=0.020418135449290276, zero_point=64)

#經過relu後
torch.Tensor([[[[0.0000, 0.0000],[1.3068, 0.3879]]]], dtype=torch.quint8, scale=0.020418135449290276, zero_point=64)

#經過DeQuantStub後
torch.Tensor([[[[0.0000, 0.0000],[1.3068, 0.3879]]]])

Gemfield 這裏用原始的 python 語句來分步驟來介紹下。首先是 QuantStub 的工作：

import torch
import torch.nn.quantized as nnq
#輸入
>>> x = torch.Tensor([[[[-1,-2,-3],[1,2,3]]]])
>>> x
tensor([[[[-1., -2., -3.],
          [ 1.,  2.,  3.]]]])

#經過QuantStub
>>> xq = torch.quantize_per_tensor(x, scale = 0.0472, zero_point = 64, dtype=torch.quint8)
>>> xq
tensor([[[[-0.9912, -1.9824, -3.0208],
          [ 0.9912,  1.9824,  3.0208]]]], size=(1, 1, 2, 3),
       dtype=torch.quint8, quantization_scheme=torch.per_tensor_affine,
       scale=0.0472, zero_point=64)

>>> xq.int_repr()
tensor([[[[ 43,  22,   0],
          [ 85, 106, 128]]]], dtype=torch.uint8)

我們特意在網絡前面安插的 QuantStub 完成了自己的使命，其 scale = 0.0472、zero_point = 64 是靜態量化完畢後就已經知道的，然後通過 quantize_per_tensor 調用把輸入的 float tensor 轉換爲了量化 tensor，然後送給接下來的 Conv2d——量化版本的 Conv2d：

>>> c = nnq.Conv2d(1,1,1)
>>> weight = torch.Tensor([[[[-0.7898]]]])
>>> qweight = torch.quantize_per_channel(weight, scales=torch.Tensor([0.0062]).to(torch.double), zero_points = torch.Tensor([0]).to(torch.int64), axis=0, dtype=torch.qint8)
>>> c.set_weight_bias(qweight, None)
>>> c.scale = 0.03714831545948982
>>> c.zero_point = 64
>>> x = c(xq)
>>> x
tensor([[[[ 0.7801,  1.5602,  2.3775],
          [-0.7801, -1.5602, -2.3775]]]], size=(1, 1, 2, 3),
       dtype=torch.quint8, quantization_scheme=torch.per_tensor_affine,
       scale=0.03714831545948982, zero_point=64)

同理，Conv2d 的權重的 scale=0.0062、zero_points=0 是靜態量化完畢就已知的，其 activation 的 scale = 0.03714831545948982、zero_point = 64 也是量化完畢已知的。然後送給 nnq.Conv2d 的 forward 函數（參考： [1] ），其 forward 邏輯爲：

def forward(self, input):
    return ops.quantized.conv2d(input, self._packed_params, self.scale, self.zero_point)

Conv2d 計算完了，我們停下來反省一下。如果是按照浮點數計算，那麼 -0.7898 * -0.9912 大約是 0.7828，但這裏使用 int8 的計算方式得到的值是 0.7801，這說明已經在引入誤差了（大約爲 0.34% 的誤差）。這也是前面 gemfield 說的使用 fuse_modules 可以提高精度的原因，因爲每一層都會引入類似的誤差。

後面 Linear 的計算同理，其 forward 邏輯爲：

def forward(self, x):
    return torch.ops.quantized.linear(x, self._packed_params._packed_params, self.scale, self.zero_point)

可以看到，所有以量化方式計算完的值現在需要經過 activation 的計算。這是靜態量化和動態量化的本質區別之一：op 和 op 之間不再需要轉換回到 float tensor 了。通過上面的分析，我們可以把靜態量化模型的前向推理過程概括爲如下的形式：

#原始的模型，所有的tensor和計算都是浮點型
previous_layer_fp32 -- linear_fp32 -- activation_fp32 -- next_layer_fp32
                    /
    linear_weight_fp32

#靜態量化的模型，權重和輸入都是int8
previous_layer_int8 -- linear_with_activation_int8 -- next_layer_int8
                    /
  linear_weight_int8

最後再來描述下動態量化和靜態量化的最大區別：

靜態量化的 float 輸入必經 QuantStub 變爲 int，此後到輸出之前都是 int；
動態量化的 float 輸入是經動態計算的 scale 和 zp 量化爲 int，op 輸出時轉換回 float。

QAT（Quantization Aware Training）

前面兩種量化方法都有一個 post 關鍵字，意思是模型訓練完畢後所做的量化。而 QAT 則不一樣，是指在訓練過程中就開啓了量化功能。

QAT 需要五部曲，說到這裏，你可能想到了靜態量化，那不妨對比着來看。

1. 設置qconfig

在設置 qconfig 之前，模型首先設置爲訓練模式，這很容易理解，因爲 QAT 的着力點就是 T 嘛：

cnet = CivilNet()
cnet.train()

使用 get_default_qat_qconfig API 來給要 QAT 的網絡設置 qconfig：

cnet.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')

不過，這個 qconfig 和靜態量化中的可不一樣啊。前文說過 qconfig 維護了兩個observer，activation 的和權重的。QAT 的 qconfig 中，activation 和權重的observer 都變成了 FakeQuantize（和 observer 是 has a 的關係，也即包含一個 observer），並且參數不一樣（qmin、qmax、schema,dtype,qschema,reduce_range 這些參數），如下所示：

#activation的observer的參數
FakeQuantize.with_args(observer=MovingAverageMinMaxObserver,quant_min=0,quant_max=255,reduce_range=True)

#權重的observer的參數
FakeQuantize.with_args(observer=MovingAveragePerChannelMinMaxObserver,
                                                               quant_min=-128,
                                                               quant_max=127,
                                                               dtype=torch.qint8,
                                                               qscheme=torch.per_channel_symmetric,
                                                               reduce_range=False,
                                                               ch_axis=0)

這裏 FakeQuantize 包含的 observer 是 MovingAverageMinMaxObserver，繼承自前面提到過的 MinMaxObserver，但是求最小值和最大值的方法有點區別，使用的是如下公式：

Xmin、Xmax 是當前運行中正在求解和最終求解的最小值、最大值；
X 是當前輸入的 tensor；
c 是一個常數，PyTorch 中默認爲 0.01，也就是最新一次的極值由上一次貢獻 99%，當前的 tensor 貢獻 1%。

MovingAverageMinMaxObserver 在求 min、max 的方式和其基類 MinMaxObserver 有所區別之外，scale 和 zero_points 的計算則是一致的。那麼在包含了上述的 observer 之後，FakeQuantize 的作用又是什麼呢？看下面的步驟。

2. fuse_modules

和靜態量化一樣，不再贅述。

3. prepare_qat

在靜態量化中，我們這一步使用的是 prepare API，而在 QAT 這裏使用的是 prepare_qat API。最重要的區別有兩點：

prepare_qa t 要把 qconfig 安插到每個 op 上，qconfig 的內容本身就不同，參考五部曲中的第一步；
prepare_qat 中需要多做一步轉換子 module 的工作，需要 inplace 的把模型中的一些子 module 替換了，替換的邏輯就是從 DEFAULT_QAT_MODULE_MAPPINGS 的 key 替換爲 value，這個字典的定義如下：

# Default map for swapping float module to qat modules
DEFAULT_QAT_MODULE_MAPPINGS : Dict[Callable, Any] = {
    nn.Conv2d: nnqat.Conv2d,
    nn.Linear: nnqat.Linear,
    # Intrinsic modules:
    nni.ConvBn1d: nniqat.ConvBn1d,
    nni.ConvBn2d: nniqat.ConvBn2d,
    nni.ConvBnReLU1d: nniqat.ConvBnReLU1d,
    nni.ConvBnReLU2d: nniqat.ConvBnReLU2d,
    nni.ConvReLU2d: nniqat.ConvReLU2d,
    nni.LinearReLU: nniqat.LinearReLU
}

因此，同靜態量化的 prepare 相比，prepare_qat 在多插入 fake_quants、又替換了 nn.Conv2d、nn.Linear 之後，CivilNet 網絡就被改成了如下的樣子：

CivilNet(
  (conv): QATConv2d(
    1, 1, kernel_size=(1, 1), stride=(1, 1), bias=False
    (activation_post_process): FakeQuantize(
      fake_quant_enabled=tensor([1], dtype=torch.uint8), observer_enabled=tensor([1], dtype=torch.uint8),            scale=tensor([1.]), zero_point=tensor([0])
      (activation_post_process): MovingAverageMinMaxObserver(min_val=tensor([]), max_val=tensor([]))
    )
    (weight_fake_quant): FakeQuantize(
      fake_quant_enabled=tensor([1], dtype=torch.uint8), observer_enabled=tensor([1], dtype=torch.uint8),            scale=tensor([1.]), zero_point=tensor([0])
      (activation_post_process): MovingAveragePerChannelMinMaxObserver(min_val=tensor([]), max_val=tensor([]))
    )
  )
  (fc): QATLinear(
    in_features=3, out_features=2, bias=False
    (activation_post_process): FakeQuantize(
      fake_quant_enabled=tensor([1], dtype=torch.uint8), observer_enabled=tensor([1], dtype=torch.uint8),            scale=tensor([1.]), zero_point=tensor([0])
      (activation_post_process): MovingAverageMinMaxObserver(min_val=tensor([]), max_val=tensor([]))
    )
    (weight_fake_quant): FakeQuantize(
      fake_quant_enabled=tensor([1], dtype=torch.uint8), observer_enabled=tensor([1], dtype=torch.uint8),            scale=tensor([1.]), zero_point=tensor([0])
      (activation_post_process): MovingAveragePerChannelMinMaxObserver(min_val=tensor([]), max_val=tensor([]))
    )
  )
  (relu): ReLU(
    (activation_post_process): FakeQuantize(
      fake_quant_enabled=tensor([1], dtype=torch.uint8), observer_enabled=tensor([1], dtype=torch.uint8),            scale=tensor([1.]), zero_point=tensor([0])
      (activation_post_process): MovingAverageMinMaxObserver(min_val=tensor([]), max_val=tensor([]))
    )
  )
  (quant): QuantStub(
    (activation_post_process): FakeQuantize(
      fake_quant_enabled=tensor([1], dtype=torch.uint8), observer_enabled=tensor([1], dtype=torch.uint8),            scale=tensor([1.]), zero_point=tensor([0])
      (activation_post_process): MovingAverageMinMaxObserver(min_val=tensor([]), max_val=tensor([]))
    )
  )
  (dequant): DeQuantStub()
)

4. 喂數據

和靜態量化完全不同，在 QAT 中這一步是用來訓練的。我們知道，在 PyTorch 的網絡中，前向推理邏輯都是實現在了每個op的 forward 函數中（參考：Gemfield：詳解 Pytorch 中的網絡構造 [1] ）。而在 prepare_qat 中，所有的 op 被替換成了 QAT 版本的 op，那麼這些 op 的 forward 函數有什麼特別的地方呢？

Conv2d 被替換爲了 QATConv2d：

def forward(self, input):
   return self.activation_post_process(self._conv_forward(input, self.weight_fake_quant(self.weight)))

Linear 被替換爲了 QATLinear:

def forward(self, input):
    return self.activation_post_process(F.linear(input, self.weight_fake_quant(self.weight), self.bias))

ReLU 還是那個 ReLU，不說了。總之，你可以看出來，每個 op 的輸入都需要經過 self.weight_fake_quant 來處理下，輸出又都需要經過 self.activation_post_process 來處理下，這兩個都是 FakeQuantize 的實例，只是裏面包含的 observer 不一樣。以 Conv2d 爲例：

#conv2d
weight=functools.partial(<class 'torch.quantization.fake_quantize.FakeQuantize'>, 
           observer=<class 'torch.quantization.observer.MovingAveragePerChannelMinMaxObserver'>, 
           quant_min=-128, quant_max=127, dtype=torch.qint8, 
           qscheme=torch.per_channel_symmetric, reduce_range=False, ch_axis=0))

activation=functools.partial(<class 'torch.quantization.fake_quantize.FakeQuantize'>, 
            observer=<class 'torch.quantization.observer.MovingAverageMinMaxObserver'>, 
            quant_min=0, quant_max=255, reduce_range=True)

而 FakeQuantize 的 forward 函數如下所示：

def forward(self, X):
        if self.observer_enabled[0] == 1:
            #使用移動平均算法計算scale和zp

        if self.fake_quant_enabled[0] == 1:
            X = torch.fake_quantize_per_channel_or_tensor_affine(X...)
        return X

FakeQuantize 中的 fake_quantize_per_channel_or_tensor_affine 實現了 quantize 和 dequantize，用公式表示的話爲：out = (clamp(round(x/scale + zero_point), quant_min, quant_max)-zero_point)*scale。也就是說，這是把量化的誤差引入到了訓練 loss 之中呀！

這樣，在 QAT 中，所有的 weights 和 activations 就像上面那樣被 fake quantized了，且參與模型訓練中的前向和反向計算。float 值被 round 成了（用來模擬的）int8 值，但是所有的計算仍然是通過 float 來完成的。這樣以來，所有的權重在優化過程中都能感知到量化帶來的影響，稱之爲量化感知訓練（支持 cpu 和 cuda），精度也因此更高。

5. 轉換

這一步和靜態量化一樣，不再贅述。需要注意的是，QAT 中，有一些 module 在 prepare 中已經轉換成新的 module 了，所以靜態量化中所使用的字典包含有如下的條目：

DEFAULT_STATIC_QUANT_MODULE_MAPPINGS = {
    ......
    # QAT modules:
    nnqat.Linear: nnq.Linear,
    nnqat.Conv2d: nnq.Conv2d,
}

總結下來就是：

# 原始的模型，所有的tensor和計算都是浮點
previous_layer_fp32 -- linear_fp32 -- activation_fp32 -- next_layer_fp32
                      /
    linear_weight_fp32

# 訓練過程中，fake_quants發揮作用
previous_layer_fp32 -- fq -- linear_fp32 -- activation_fp32 -- fq -- next_layer_fp32
                           /
   linear_weight_fp32 -- fq

# 量化後的模型進行推理，權重和輸入都是int8
previous_layer_int8 -- linear_with_activation_int8 -- next_layer_int8
                     /
   linear_weight_int8

總結

那麼如何更方便的在你的代碼中使用 PyTorch 的量化功能呢？一個比較優雅的方式就是使用 deepvac 規範——這是一個定義了 PyTorch 工程標準的項目：

https://github.com/DeepVAC/deepvac

基於 deepvac 規範（包含庫），我們只需要簡單的打開幾個開關就可以使用上述的三種量化功能。

參考文獻

[1] https://zhuanlan.zhihu.com/p/53927068

   
      
      
      
    
       
       
       個人微信（如果沒有備註不拉羣！）
   
      
      
      
   
      
      
      
    
       
       
       請註明：
    
       
       
       地區+學校/企業+研究方向+暱稱
   
      
      
      
   
      
      
      
    
       
       
       

   
      
      
      


下載1：何愷明頂會分享

在「AI算法與圖像處理」公衆號後臺回覆：何愷明，即可下載。總共有6份PDF，涉及 ResNet、Mask RCNN等經典工作的總結分析

下載2：終身受益的編程指南：Google編程風格指南

在「AI算法與圖像處理」公衆號後臺回覆：c++，即可下載。歷經十年考驗，最權威的編程規範！


 
    
    
    
  
     
     
     

 
    
    
    
 
    
    
    
  
     
     
     下載3 CVPR2020
 
    
    
    
 
    
    
    
  
     
     
     

 
    
    
    
 
    
    
    
  
     
     
     在「AI算法與圖像處理」公衆號後臺回覆：
  
     
     
     CVPR2020
  
     
     
     ，即可下載1467篇CVPR 2020論文
 
    
    
    



   
   
   
 
    
    
    
  
     
     
     
   
      
      
      
    
       
       
       覺得不錯就點亮在看吧

本文分享自微信公衆號 - AI算法與圖像處理（AI_study）。
如有侵權，請聯繫 [email protected] 刪除。
本文參與“OSC源創計劃”，歡迎正在閱讀的你也加入，一起分享。

如何使用PyTorch的量化功能？

背景

Tensor的量化

Post Training Dynamic Quantization

Post Training Static Quantization

QAT（Quantization Aware Training）

總結

別魔改網絡了，Google研究員：模型精度不高，是因爲你的Resize方法不夠好！

深度學習中圖像分割經典算法和必備知識點整理

算！力！羊！毛！5000核時計算資源終於開放使用了！

部署教程 | ResNet原理+PyTorch復現+ONNX+TensorRT int8量化部署

YOLOS：通過目標檢測重新思考Transformer（附源代碼）

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結