MultimodalFebruary 26, 2021OpenAI

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever

Abstract

We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn state-of-the-art image representations from scratch on a dataset of 400 million image-text pairs. CLIP models learn to connect images and text in a shared embedding space, enabling zero-shot transfer to downstream tasks.

Key Findings

1Learned visual representations from natural language supervision at scale
2Achieved competitive zero-shot image classification without task-specific training
3Created a shared embedding space for images and text enabling cross-modal retrieval
4Trained on 400 million image-text pairs from the internet
5Demonstrated remarkable robustness to distribution shift

Impact & Significance

CLIP bridged the gap between vision and language, becoming a fundamental building block for DALL-E, Stable Diffusion, and multimodal AI systems. Its contrastive learning approach influenced a generation of vision-language models.

Related Tools

Dall E Stable Diffusion Midjourney

Read Full Paper

Learning Transferable Visual Models From Natural Language Supervision

Abstract

Key Findings

Impact & Significance

Related Tools

Related Papers

The Llama 3 Herd of Models

Qwen2 Technical Report

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

The Claude 3 Model Family: Opus, Sonnet, and Haiku