MultimodalApril 13, 2022OpenAI

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen

Abstract

Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. We propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We call the resulting model DALL-E 2.

Key Findings

1Combined CLIP embeddings with diffusion models for text-to-image generation
2Achieved photorealistic image generation from text descriptions
3Demonstrated image editing capabilities through text-guided manipulation
4Introduced a two-stage architecture (prior + decoder) for generation
5Showed strong zero-shot generation capabilities

Impact & Significance

DALL-E 2 brought AI image generation to mainstream awareness and demonstrated the commercial potential of text-to-image models. It influenced the development of subsequent models including DALL-E 3 and Midjourney.

Related Tools

Dall E Chatgpt

Read Full Paper

Hierarchical Text-Conditional Image Generation with CLIP Latents

Abstract

Key Findings

Impact & Significance

Related Tools

Related Papers

The Llama 3 Herd of Models

Qwen2 Technical Report

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

The Claude 3 Model Family: Opus, Sonnet, and Haiku