RoboticsJuly 28, 2023Google DeepMind
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess
Abstract
We study how vision-language models trained on internet-scale data can be incorporated directly into end-to-end robotic control to boost generalization and enable emergent semantic reasoning. We introduce RT-2, a class of vision-language-action (VLA) models that are trained on both web data and robotics data, and show that they can directly output robot actions.
Key Findings
- 1Transferred web knowledge from vision-language models to robotic control
- 2Demonstrated emergent semantic reasoning in robot actions
- 3Showed robots performing novel tasks described in natural language
- 4Combined internet-scale pre-training with robot demonstration data
- 5Achieved significant improvements in generalization over prior methods
Impact & Significance
RT-2 showed that foundation models can bridge the gap between internet knowledge and physical robot control, advancing the vision of general-purpose robots that understand and act on natural language instructions.
Related Tools
Related Papers
LLMJuly 23, 2024
The Llama 3 Herd of Models
Meta AI
LLMJuly 15, 2024
Qwen2 Technical Report
Alibaba Cloud / Qwen Team
EfficiencyMay 7, 2024
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek AI
LLMMarch 4, 2024
The Claude 3 Model Family: Opus, Sonnet, and Haiku
Anthropic