VisionSeptember 20, 2023OpenAI
Improving Image Generation with Better Captions
James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo
Abstract
We study how image generation models can be improved by training on better image captions. We develop an automatic captioning pipeline that generates highly descriptive image captions. Training text-to-image models on these improved captions substantially improves the quality and prompt-following ability of the resulting models, which we call DALL-E 3.
Key Findings
- 1Showed that better training captions dramatically improve image generation quality
- 2Developed an automatic captioning pipeline for training data improvement
- 3Achieved significantly better prompt-following compared to DALL-E 2
- 4Demonstrated improved text rendering in generated images
- 5Integrated natively with ChatGPT for conversational image generation
Impact & Significance
DALL-E 3 integrated text-to-image generation directly into ChatGPT, making AI image creation accessible to millions. Its focus on caption quality influenced how subsequent models approach training data curation.
Related Papers
LLMJuly 23, 2024
The Llama 3 Herd of Models
Meta AI
LLMJuly 15, 2024
Qwen2 Technical Report
Alibaba Cloud / Qwen Team
EfficiencyMay 7, 2024
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek AI
LLMMarch 4, 2024
The Claude 3 Model Family: Opus, Sonnet, and Haiku
Anthropic