MultimodalApril 13, 2022OpenAI
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen
Abstract
Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. We propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We call the resulting model DALL-E 2.
Key Findings
- 1Combined CLIP embeddings with diffusion models for text-to-image generation
- 2Achieved photorealistic image generation from text descriptions
- 3Demonstrated image editing capabilities through text-guided manipulation
- 4Introduced a two-stage architecture (prior + decoder) for generation
- 5Showed strong zero-shot generation capabilities
Impact & Significance
DALL-E 2 brought AI image generation to mainstream awareness and demonstrated the commercial potential of text-to-image models. It influenced the development of subsequent models including DALL-E 3 and Midjourney.
Related Papers
LLMJuly 23, 2024
The Llama 3 Herd of Models
Meta AI
LLMJuly 15, 2024
Qwen2 Technical Report
Alibaba Cloud / Qwen Team
EfficiencyMay 7, 2024
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek AI
LLMMarch 4, 2024
The Claude 3 Model Family: Opus, Sonnet, and Haiku
Anthropic