pytorch1.3 Quantization

pytorch提供了三種量化的方法

1.訓練後動態量化。這種模式使用的場景是:模型的執行時間是由內存加載參數的時間決定(不是矩陣運算時間決定),這種模式適合的模型是LSTM和Transformer之類的小批量的模型。調用方法torch.quantization.quantize_dynamic()

2.訓練後靜態量化。這種模式使用場景:內存帶寬和運算時間都重要的模型,如CNN。

   訓練步驟:

   1).標示出模型中QuantStub and DeQuantStub 的位置,確保模型不會被複用,將需要量化的操作模塊化。例如:

     

#  QuantStub and DeQuantStub
  def forward(self, x):
        x = self.quant(x)

        x = self.features(x)
        x = x.mean([2, 3])
        x = self.classifier(x)

        x = self.dequant(x)

        return x

#模塊化
class ConvBNReLU(nn.Sequential):
    def __init__(self, in_planes, out_planes, kernel_size=3, stride=1, groups=1):
        padding = (kernel_size - 1) // 2
        super(ConvBNReLU, self).__init__(
            nn.Conv2d(in_planes, out_planes, kernel_size, stride, padding, groups=groups, bias=False),
            nn.BatchNorm2d(out_planes, momentum=0.1),
            nn.ReLU(inplace=False)
        )

2)對conv + relu or conv+batchnorm + relu之類的操作放到一起,使用Fuse operations 來提高效率;

3)明確配置

4)Use the torch.quantization.prepare() to insert modules that will observe activation tensors during calibration

5)Calibrate the model by running inference against a calibration dataset

6)使用torch.quantization.convert() 轉化模型。

 

3.Quantization Aware Training,這種方法可以可以保證量化後的精度最大。在訓練過程中,所有的權重會被“fake quantized” :float會被截斷爲int8,但是計算的時候仍然按照浮點數計算

 

下面是官方文檔

PyTorch provides three approaches to quantize models.

  1. Post Training Dynamic Quantization: This is the simplest to apply form of quantization where the weights are quantized ahead of time but the activations are dynamically quantized during inference. This is used for situations where the model execution time is dominated by loading weights from memory rather than computing the matrix multiplications. This is true for for LSTM and Transformer type models with small batch size. Applying dynamic quantization to a whole model can be done with a single call to torch.quantization.quantize_dynamic(). See the quantization tutorials

  2. Post Training Static Quantization: This is the most commonly used form of quantization where the weights are quantized ahead of time and the scale factor and bias for the activation tensors is pre-computed based on observing the behavior of the model during a calibration process. Post Training Quantization is typically when both memory bandwidth and compute savings are important with CNNs being a typical use case. The general process for doing post training quantization is:

    1. Prepare the model: a. Specify where the activations are quantized and dequantized explicitly by adding QuantStub and DeQuantStub modules. b. Ensure that modules are not reused. c. Convert any operations that require requantization into modules

    2. Fuse operations like conv + relu or conv+batchnorm + relu together to improve both model accuracy and performance.

    3. Specify the configuration of the quantization methods ‘97 such as selecting symmetric or asymmetric quantization and MinMax or L2Norm calibration techniques.

    4. Use the torch.quantization.prepare() to insert modules that will observe activation tensors during calibration

    5. Calibrate the model by running inference against a calibration dataset

    6. Finally, convert the model itself with the torch.quantization.convert() method. This does several things: it quantizes the weights, computes and stores the scale and bias value to be used each activation tensor, and replaces key operators quantized implementations.

    See the quantization tutorials

  3. Quantization Aware Training: In the rare cases where post training quantization does not provide adequate accuracy training can be done with simulated quantization using the torch.quantization.FakeQuantize. Computations will take place in FP32 but with values clamped and rounded to simulate the effects of INT8 quantization. The sequence of steps is very similar.

    1. Steps (1) and (2) are identical.

    1. Specify the configuration of the fake quantization methods ‘97 such as selecting symmetric or asymmetric quantization and MinMax or Moving Average or L2Norm calibration techniques.

    2. Use the torch.quantization.prepare_qat() to insert modules that will simulate quantization during training.

    3. Train or fine tune the model.

    4. Identical to step (6) for post training quantization

    See the quantization tutorials

While default implementations of observers to select the scale factor and bias based on observed tensor data are provided, developers can provide their own quantization functions. Quantization can be applied selectively to different parts of the model or configured differently for different parts of the model.

We also provide support for per channel quantization for conv2d() and linear()

Quantization workflows work by adding (e.g. adding observers as .observer submodule) or replacing (e.g. converting nn.Conv2d to nn.quantized.Conv2d) submodules in the model’s module hierarchy. It means that the model stays a regular nn.Module-based instance throughout the process and thus can work with the rest of PyTorch APIs.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章