A Generalist Agent
Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gomez Colmenarejo, Alexander Novikov, Gabriel Barth-Maron, Mai Gimenez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, Tom Eccles, Jake Bruce, Ali Razavi, Ashley Edwards, Nicolas Heess, Yutian Chen, Raia Hadsell, Oriol Vinyals, Mahyar Bordbar, Nando de Freitas
Abstract
We introduce Gato, a single generalist agent that works as a multi-modal, multi-task, multi-embodiment generalist policy. The same network with the same weights can play Atari, caption images, chat, stack blocks with a real robot arm, and much more, deciding based on its context whether to output text, joint torques, button presses, or other tokens.
Key Findings
- 1Created a single model performing 604 distinct tasks across modalities
- 2Demonstrated a unified architecture for text, vision, and robot control
- 3Showed that a single set of weights can handle diverse embodiments
- 4Achieved competitive performance across all task categories
- 5Pointed toward the possibility of foundation models for robotics
Impact & Significance
Gato demonstrated the feasibility of building generalist AI agents that can operate across modalities and embodiments, influencing the vision of AGI as a single model that can do everything from chat to robot control.
Related Tools
Related Papers
The Llama 3 Herd of Models
Meta AI
Qwen2 Technical Report
Alibaba Cloud / Qwen Team
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek AI
The Claude 3 Model Family: Opus, Sonnet, and Haiku
Anthropic