Serving and Batching¶

LLM serving fundamentals: prefill vs decode phases, time to first token (TTFT) vs tokens per second
Continuous batching: dynamic request scheduling, iteration-level batching
PagedAttention: virtual memory for KV-cache, vLLM architecture
Batching strategies: static batching, dynamic batching, sequence bucketing
Scheduling: first-come-first-served, shortest-job-first, preemption
Disaggregated serving: separating prefill and decode stages
Multi-model serving: model multiplexing, LoRA serving (S-LoRA, Punica)
Metrics: throughput (tokens/s), latency (p50/p99), SLO compliance, cost per token