VisionOctober 22, 2020Google Research

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby

Abstract

While the Transformer architecture has become the de-facto standard for NLP tasks, its applications to computer vision remain limited. We show that a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. Vision Transformer (ViT) attains excellent results compared to state-of-the-art CNNs while requiring substantially fewer computational resources to train.

Key Findings

1Demonstrated that Transformers can replace CNNs for image classification
2Split images into fixed-size patches and treated them as token sequences
3Achieved state-of-the-art on ImageNet with fewer FLOPs than CNNs
4Showed that scale (data + model size) is key to ViT's success
5Established a unified architecture for both vision and language tasks

Impact & Significance

ViT unified computer vision and NLP under the Transformer architecture, enabling multimodal models and simplifying the AI research landscape. It influenced CLIP, DALL-E, Segment Anything, and virtually all modern vision models.

Related Tools

Hugging Face Google Cloud Vision

Read Full Paper

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Abstract

Key Findings

Impact & Significance

Related Tools

Related Papers

The Llama 3 Herd of Models

Qwen2 Technical Report

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

The Claude 3 Model Family: Opus, Sonnet, and Haiku