Efficient Architectures¶

StreamingLLM: attention sinks, rolling KV-cache, infinite-length generation
Sparse attention: local attention, sliding window (Mistral), dilated, BigBird, Longformer
Linear attention: kernel approximation, RWKV, RetNet, Mamba (state-space models)
Multi-query attention (MQA) and grouped-query attention (GQA): reducing KV-cache size
Mixture of Experts at inference: expert caching, routing efficiency
Knowledge distillation: teacher-student, task-specific vs general distillation
Pruning: unstructured (magnitude), structured (channel/head pruning), lottery ticket hypothesis
Neural architecture search (NAS) for efficient models