Vision-Language-Action Models
- From vision-language to action: grounding language instructions in physical actions
- VLAs: architecture (vision encoder + LLM + action head), RT-2, Octo, OpenVLA
- Action tokenisation: discretising continuous actions, action chunking
- Pretraining recipes: web-scale vision-language data → robot manipulation data
- Generalisation: unseen objects, environments, instructions
- Co-training with internet data and robot data
- Embodiment-agnostic models: one model for multiple robot form factors
- Benchmarks: SIMPLER, real-world evaluation protocols