MultimodalMay 23, 2022Google Research

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J. Fleet, Mohammad Norouzi

Abstract

We present Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. We discover that generic large language models, pre-trained on text-only corpora, are surprisingly effective at encoding text for image synthesis. Imagen achieves a new state-of-the-art FID score on the COCO benchmark.

Key Findings

  • 1Demonstrated that large text encoders (T5-XXL) dramatically improve image quality
  • 2Achieved state-of-the-art photorealism in text-to-image generation
  • 3Showed that language model scale matters more than diffusion model scale
  • 4Introduced DrawBench, a benchmark for evaluating text-to-image models
  • 5Applied classifier-free guidance for improved text-image alignment

Impact & Significance

Imagen showed that using large frozen language models as text encoders is key to high-quality image generation, influencing the design of subsequent models. It contributed to Google's ImageFX and Gemini image generation capabilities.

Read Full Paper