Skip to content

Quantisation

  • Why quantise: memory reduction, throughput gains, energy savings
  • Number formats: FP32, FP16, BF16, FP8 (E4M3, E5M2), INT8, INT4, binary/ternary
  • Post-training quantisation (PTQ): calibration, min-max, percentile, MSE-optimal scaling
  • Quantisation-aware training (QAT): fake quantisation, straight-through estimator
  • Weight-only quantisation: GPTQ, AWQ, QuIP, squeeze-and-multiply
  • Activation quantisation: dynamic vs static, per-tensor vs per-channel vs per-token
  • Mixed-precision: choosing precision per layer, sensitivity analysis
  • KV-cache quantisation: reducing memory for long sequences