Integer Quantization for Deep Learning Inference: Principle and Empirical Evaluation

This post is a digest for the survey paper of quantization and calibration – Integer Quantization for Deep Learning Inference: Principle and Empirical Evaluation.

1 – Intro

目前在 DL 領域 32 位元浮點數是主要的數值表示方法。

問：浮點數與定點數的差別？
答：定點數會分別設定固定的位元來表示整數與小數點；浮點數則為不固定。

而在實際 model inference 時所使用數值並沒有那麼精準，會使用位元較少的數值表示。主要因為：

可以提高 throughput
降低 memory bandwidth 需求
記憶體用量之後 cache 可以儲存比較多資訊，locality 也會變好

這篇論文主要對 quantization 還有 calibration 介紹其中基礎的數學，對 model 校準的訓練以及在 data set 上實際的校準結果。

Quantization 分為兩大類：

Post Training Quantization (PTQ)
Quantization Aware Training (QAT)

這篇主要介紹使用 quantization 來加速運算，因此使用 uniform quantization scheme。

3 – Quantization Fundementals

UNIFORM QUANTIZATION SCHEME

Uniform quantization involves 2 steps.

range of real number to be quantized, clamp outliers outside the range
map the real values into representable range

QUANTIZE / DEQUANTIZE

Quantize: real number to integer representation (fp32 to int8)
Dequantize: integer representation to real number (int32 to fp16)

3.1 Range mapping

Range of real value:
[β,α]
bit-width of representation:
signed integer: [-2^{b-1}, 2^{b-1} – 1]
unsigned integer: [0, 2^b – 1]
Uniform transformation:
affine: 𝑓(𝑥)=𝑠⋅𝑥+𝑧
scale: 𝑓(𝑥)=𝑠⋅𝑥f

3.1.1 AFFINE QUANTIZATION

Quantize

𝑓(𝑥) =𝑠⋅𝑥+𝑧, 𝑠,𝑥,𝑧∈𝑅

s = \frac{2^b – 1}{\alpha – \beta}

z = -round(\beta \cdot s) – 2^{b-1}

可以想像是給定兩項條件：(1) real value range; (2) representable range 之後，把值直接「線性的」投射上去。

Dequantize

Dequantize 為原本 𝑓(𝑥) 的反函數：

f'(x) = \frac{1}{s}(x – z)

3.1.2 SCALE QUANTIZATION

Quantize

Scale quantize 是一種 affine quantization 的特例。也就是它以 0 為基準點 (z=0)，設定 real value range 為 [−α,α].

f(x) = round(s \cdot x)

s = \frac{2^{b-1}}{\alpha}

Dequantize

很簡單。s 的倒數乘回去。

f'(x) = \frac{1}{s}x

3.2 TENSOR QUANTIZATION GRANULARITY

粒度也是 quantization 的重點之一。

最粗糙的就是 tensor 中每個 element 都共用同樣的 quantization parameter。而最細緻的就是每個 element 都擁有自己的 quantization parameter。

常見的為：

per column / per row（適用 2-D tensors, like 2D-CNN activations）
per channel（適用 3-D tensors, like image）

考量 quantization 效果因素：

accuracy
computation cost

3.3 COMPUTATIONAL COST OF AFFINE QUANTIZATION

MathJax in WordPress is screwed up. So skipping this section. Please checkout the original paper for the equations.

3.4 CALIBRATION

Calibration is the process of choosing the real value range.

三種常見 Calibration 方法：

Max：取極值，最淺顯易懂的作法。
Entropy：取 KL Divergence 並盡力取減少 information loss。這是理論上的最佳作法
Percentile：百分位式，像是只取分布中主要 99.99% 的 element 作為 value range。這種做法法可以避免極值離主要分布太遠。

4 – Post training quantization

由上一章節可以看到。可以對 model 中的 weight 做 calibration。可以丟進一些 input data 來得到 calibration 結果。

Model 有分很多種：CNN feed-forward, RNN, Attention-based NN.

4.1 WEIGHT QUANTIZATION

列出實驗數據。

quantization 方法：max。

Per channel 比 per tensor 好。
BN folding 不影響 per channel calibration
BN folding 會影響 per tensor calibration

4.2 ACTIVATION QUANTIZATION

列出實驗數據。

大部分為 entropy 最佳
max 部分從來不是個好作法
Percentile 作法偶爾會贏 entropy。
mobilenet, efficientnet, bert 有 > 1% 的 accuracy loss

小結論：no single calibration is best for all networks.

5 – Accuracy recovery

當發現 calibration 真的有帶來 accuracy loss 時，可以使用方法來做 accuracy recovery。

5.1 PARTIAL QUANTIZATION

往往是因為某些 NN 層帶來 accuracy loss。一個繞過去的作法就是讓 CPU 來做這些造成失真的層。（leaving them unquantized）

如果想要測試 partial quantization 的組合，組合會 exponential 增長，所以用 single layer accuracy 的方法來比較的話，可以將所有 layer 排序，並逐層拔掉 quantization 直到有打到理想的 accuracy。（是個挺直白的 greedy heuristic）

而列出哪個 layer 會影響準度，叫做 sensitivity analysis。

5.2 QUANTIZATION AWARE TRAINING (QAT)

Insert quantization before training.

The intuition behind is when we train with quantization we may narrow the gradient descent to the optimal. Make model be aware of integer-ness, and find “wide and flat” minima.

常見作法為 fake quantization，又叫 simulated quantization。

fake quantize: \hat{x} = dquantize(quantize(x, b, s), b, s)

對無法微分的的地方，使用 Straight Through Estimator (STE)。

STE:

如果在 real value range 內，回傳 1
else，回傳 0

甚至有時候 QAT 會帶來更好的 model accuracy 因為 quantization 也有 regularizer 的效果。

5.3 LEARNING QUANTIZATION PARAMETERS

It is also possible to learn quantization parameters along with the model weight.

PACT learns the range of activation for activation quantization during training.

Initialized with max calibration (意思是：「以 max calibration 參數為初始」嗎？？)

learning the range (real value range) results in better accuracy

Initialized with best calibration range

yields similar result as initialized with max calibration

小結論：learning the range (QAT) doesn’t offer additional benefit if given carefully calibrated range.

不過文章中也有說 PACT 可能在其他地方更有用，這裡只是拿 PACT 來展現 QAT 是在做什麼的。

6 – Workflow

Pretrained Network –> PTQ –> Partial Quantization –> QAT(start from best calibration)

Self reflection

終於更了解 quantization / calibration 了！！！

Author: eopXD

Hi 我是 eopXD ，希望人生過得有趣有挑戰。 View all posts by eopXD

1 – Intro

2 – Related work

3 – Quantization Fundementals

UNIFORM QUANTIZATION SCHEME

QUANTIZE / DEQUANTIZE

3.1 Range mapping

3.1.1 AFFINE QUANTIZATION

Quantize

Dequantize

3.1.2 SCALE QUANTIZATION

Quantize

Dequantize

3.2 TENSOR QUANTIZATION GRANULARITY

3.3 COMPUTATIONAL COST OF AFFINE QUANTIZATION

3.4 CALIBRATION

4 – Post training quantization

4.1 WEIGHT QUANTIZATION

4.2 ACTIVATION QUANTIZATION

5 – Accuracy recovery

5.1 PARTIAL QUANTIZATION

5.2 QUANTIZATION AWARE TRAINING (QAT)

5.3 LEARNING QUANTIZATION PARAMETERS

6 – Workflow

Self reflection

分享此文：

Related

Author: eopXD

Leave a Reply Cancel reply