Learning Transferable Visual Models From Natural Language Supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever
Abstract
We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn state-of-the-art image representations from scratch on a dataset of 400 million image-text pairs. CLIP models learn to connect images and text in a shared embedding space, enabling zero-shot transfer to downstream tasks.
Key Findings
- 1Learned visual representations from natural language supervision at scale
- 2Achieved competitive zero-shot image classification without task-specific training
- 3Created a shared embedding space for images and text enabling cross-modal retrieval
- 4Trained on 400 million image-text pairs from the internet
- 5Demonstrated remarkable robustness to distribution shift
Impact & Significance
CLIP bridged the gap between vision and language, becoming a fundamental building block for DALL-E, Stable Diffusion, and multimodal AI systems. Its contrastive learning approach influenced a generation of vision-language models.
Related Tools
Related Papers
The Llama 3 Herd of Models
Meta AI
Qwen2 Technical Report
Alibaba Cloud / Qwen Team
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek AI
The Claude 3 Model Family: Opus, Sonnet, and Haiku
Anthropic