Skip to content

Scaling and Deployment

  • Model parallelism: tensor parallelism (Megatron-style column/row splitting), pipeline parallelism (GPipe, microbatching), sequence parallelism
  • Data parallelism at inference: replicating models across GPUs
  • Distributed KV-cache: sharding across nodes, communication overhead
  • Speculative decoding: draft model + verification, Medusa heads, EAGLE, self-speculative decoding
  • Prefix caching: sharing KV-cache across requests with common prefixes
  • Inference frameworks: vLLM, TensorRT-LLM, SGLang, llama.cpp, TGI
  • Cost optimisation: spot instances, autoscaling, right-sizing GPU selection
  • Monitoring: token-level logging, latency histograms, degradation detection