MultimodalApril 13, 2022OpenAI

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen

Abstract

Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. We propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We call the resulting model DALL-E 2.

Key Findings

  • 1Combined CLIP embeddings with diffusion models for text-to-image generation
  • 2Achieved photorealistic image generation from text descriptions
  • 3Demonstrated image editing capabilities through text-guided manipulation
  • 4Introduced a two-stage architecture (prior + decoder) for generation
  • 5Showed strong zero-shot generation capabilities

Impact & Significance

DALL-E 2 brought AI image generation to mainstream awareness and demonstrated the commercial potential of text-to-image models. It influenced the development of subsequent models including DALL-E 3 and Midjourney.

Related Tools

Read Full Paper