Skip to content

Efficient Architectures

  • StreamingLLM: attention sinks, rolling KV-cache, infinite-length generation
  • Sparse attention: local attention, sliding window (Mistral), dilated, BigBird, Longformer
  • Linear attention: kernel approximation, RWKV, RetNet, Mamba (state-space models)
  • Multi-query attention (MQA) and grouped-query attention (GQA): reducing KV-cache size
  • Mixture of Experts at inference: expert caching, routing efficiency
  • Knowledge distillation: teacher-student, task-specific vs general distillation
  • Pruning: unstructured (magnitude), structured (channel/head pruning), lottery ticket hypothesis
  • Neural architecture search (NAS) for efficient models