Scaling and Deployment
- Model parallelism: tensor parallelism (Megatron-style column/row splitting), pipeline parallelism (GPipe, microbatching), sequence parallelism
- Data parallelism at inference: replicating models across GPUs
- Distributed KV-cache: sharding across nodes, communication overhead
- Speculative decoding: draft model + verification, Medusa heads, EAGLE, self-speculative decoding
- Prefix caching: sharing KV-cache across requests with common prefixes
- Inference frameworks: vLLM, TensorRT-LLM, SGLang, llama.cpp, TGI
- Cost optimisation: spot instances, autoscaling, right-sizing GPU selection
- Monitoring: token-level logging, latency histograms, degradation detection