Efficient Architectures
- StreamingLLM: attention sinks, rolling KV-cache, infinite-length generation
- Sparse attention: local attention, sliding window (Mistral), dilated, BigBird, Longformer
- Linear attention: kernel approximation, RWKV, RetNet, Mamba (state-space models)
- Multi-query attention (MQA) and grouped-query attention (GQA): reducing KV-cache size
- Mixture of Experts at inference: expert caching, routing efficiency
- Knowledge distillation: teacher-student, task-specific vs general distillation
- Pruning: unstructured (magnitude), structured (channel/head pruning), lottery ticket hypothesis
- Neural architecture search (NAS) for efficient models