MultimodalMay 23, 2022Google Research

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J. Fleet, Mohammad Norouzi

Abstract

We present Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. We discover that generic large language models, pre-trained on text-only corpora, are surprisingly effective at encoding text for image synthesis. Imagen achieves a new state-of-the-art FID score on the COCO benchmark.

Key Findings

1Demonstrated that large text encoders (T5-XXL) dramatically improve image quality
2Achieved state-of-the-art photorealism in text-to-image generation
3Showed that language model scale matters more than diffusion model scale
4Introduced DrawBench, a benchmark for evaluating text-to-image models
5Applied classifier-free guidance for improved text-image alignment

Impact & Significance

Imagen showed that using large frozen language models as text encoders is key to high-quality image generation, influencing the design of subsequent models. It contributed to Google's ImageFX and Gemini image generation capabilities.

Related Tools

Google Imagefx Gemini

Read Full Paper

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Abstract

Key Findings

Impact & Significance

Related Tools

Related Papers

The Llama 3 Herd of Models

Qwen2 Technical Report

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

The Claude 3 Model Family: Opus, Sonnet, and Haiku