RoboticsJuly 28, 2023Google DeepMind

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess

Abstract

We study how vision-language models trained on internet-scale data can be incorporated directly into end-to-end robotic control to boost generalization and enable emergent semantic reasoning. We introduce RT-2, a class of vision-language-action (VLA) models that are trained on both web data and robotics data, and show that they can directly output robot actions.

Key Findings

  • 1Transferred web knowledge from vision-language models to robotic control
  • 2Demonstrated emergent semantic reasoning in robot actions
  • 3Showed robots performing novel tasks described in natural language
  • 4Combined internet-scale pre-training with robot demonstration data
  • 5Achieved significant improvements in generalization over prior methods

Impact & Significance

RT-2 showed that foundation models can bridge the gap between internet knowledge and physical robot control, advancing the vision of general-purpose robots that understand and act on natural language instructions.

Related Tools

Read Full Paper