Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J. Fleet, Mohammad Norouzi
Abstract
We present Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. We discover that generic large language models, pre-trained on text-only corpora, are surprisingly effective at encoding text for image synthesis. Imagen achieves a new state-of-the-art FID score on the COCO benchmark.
Key Findings
- 1Demonstrated that large text encoders (T5-XXL) dramatically improve image quality
- 2Achieved state-of-the-art photorealism in text-to-image generation
- 3Showed that language model scale matters more than diffusion model scale
- 4Introduced DrawBench, a benchmark for evaluating text-to-image models
- 5Applied classifier-free guidance for improved text-image alignment
Impact & Significance
Imagen showed that using large frozen language models as text encoders is key to high-quality image generation, influencing the design of subsequent models. It contributed to Google's ImageFX and Gemini image generation capabilities.
Related Tools
Related Papers
The Llama 3 Herd of Models
Meta AI
Qwen2 Technical Report
Alibaba Cloud / Qwen Team
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek AI
The Claude 3 Model Family: Opus, Sonnet, and Haiku
Anthropic