Skip to content

Vision-Language-Action Models

  • From vision-language to action: grounding language instructions in physical actions
  • VLAs: architecture (vision encoder + LLM + action head), RT-2, Octo, OpenVLA
  • Action tokenisation: discretising continuous actions, action chunking
  • Pretraining recipes: web-scale vision-language data → robot manipulation data
  • Generalisation: unseen objects, environments, instructions
  • Co-training with internet data and robot data
  • Embodiment-agnostic models: one model for multiple robot form factors
  • Benchmarks: SIMPLER, real-world evaluation protocols