Quantisation
- Why quantise: memory reduction, throughput gains, energy savings
- Number formats: FP32, FP16, BF16, FP8 (E4M3, E5M2), INT8, INT4, binary/ternary
- Post-training quantisation (PTQ): calibration, min-max, percentile, MSE-optimal scaling
- Quantisation-aware training (QAT): fake quantisation, straight-through estimator
- Weight-only quantisation: GPTQ, AWQ, QuIP, squeeze-and-multiply
- Activation quantisation: dynamic vs static, per-tensor vs per-channel vs per-token
- Mixed-precision: choosing precision per layer, sensitivity analysis
- KV-cache quantisation: reducing memory for long sequences