VisionOctober 22, 2020Google Research

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby

Abstract

While the Transformer architecture has become the de-facto standard for NLP tasks, its applications to computer vision remain limited. We show that a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. Vision Transformer (ViT) attains excellent results compared to state-of-the-art CNNs while requiring substantially fewer computational resources to train.

Key Findings

  • 1Demonstrated that Transformers can replace CNNs for image classification
  • 2Split images into fixed-size patches and treated them as token sequences
  • 3Achieved state-of-the-art on ImageNet with fewer FLOPs than CNNs
  • 4Showed that scale (data + model size) is key to ViT's success
  • 5Established a unified architecture for both vision and language tasks

Impact & Significance

ViT unified computer vision and NLP under the Transformer architecture, enabling multimodal models and simplifying the AI research landscape. It influenced CLIP, DALL-E, Segment Anything, and virtually all modern vision models.

Read Full Paper