RoboticsJuly 28, 2023Google DeepMind

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess

Abstract

We study how vision-language models trained on internet-scale data can be incorporated directly into end-to-end robotic control to boost generalization and enable emergent semantic reasoning. We introduce RT-2, a class of vision-language-action (VLA) models that are trained on both web data and robotics data, and show that they can directly output robot actions.

Key Findings

1Transferred web knowledge from vision-language models to robotic control
2Demonstrated emergent semantic reasoning in robot actions
3Showed robots performing novel tasks described in natural language
4Combined internet-scale pre-training with robot demonstration data
5Achieved significant improvements in generalization over prior methods

Impact & Significance

RT-2 showed that foundation models can bridge the gap between internet knowledge and physical robot control, advancing the vision of general-purpose robots that understand and act on natural language instructions.

Related Tools

Gemini

Read Full Paper

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Abstract

Key Findings

Impact & Significance

Related Tools

Related Papers

The Llama 3 Herd of Models

Qwen2 Technical Report

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

The Claude 3 Model Family: Opus, Sonnet, and Haiku