Skip to content

Serving and Batching

  • LLM serving fundamentals: prefill vs decode phases, time to first token (TTFT) vs tokens per second
  • Continuous batching: dynamic request scheduling, iteration-level batching
  • PagedAttention: virtual memory for KV-cache, vLLM architecture
  • Batching strategies: static batching, dynamic batching, sequence bucketing
  • Scheduling: first-come-first-served, shortest-job-first, preemption
  • Disaggregated serving: separating prefill and decode stages
  • Multi-model serving: model multiplexing, LoRA serving (S-LoRA, Punica)
  • Metrics: throughput (tokens/s), latency (p50/p99), SLO compliance, cost per token