Serving and Batching
- LLM serving fundamentals: prefill vs decode phases, time to first token (TTFT) vs tokens per second
- Continuous batching: dynamic request scheduling, iteration-level batching
- PagedAttention: virtual memory for KV-cache, vLLM architecture
- Batching strategies: static batching, dynamic batching, sequence bucketing
- Scheduling: first-come-first-served, shortest-job-first, preemption
- Disaggregated serving: separating prefill and decode stages
- Multi-model serving: model multiplexing, LoRA serving (S-LoRA, Punica)
- Metrics: throughput (tokens/s), latency (p50/p99), SLO compliance, cost per token