VisionFebruary 15, 2024OpenAI
Video Generation Models as World Simulators
OpenAI
Abstract
We explore large-scale training of generative models on video data. Specifically, we train text-conditional diffusion models jointly on videos and images of variable durations, resolutions and aspect ratios. We find that scaling video generation models is a promising path towards building general purpose simulators of the physical world. Our largest model, Sora, is capable of generating a minute of high fidelity video.
Key Findings
- 1Generated up to one minute of high-fidelity, coherent video from text
- 2Used a Transformer-based diffusion architecture on spacetime patches
- 3Demonstrated 3D consistency and understanding of physical interactions
- 4Showed emergent simulation capabilities of real-world physics
- 5Handled variable durations, resolutions, and aspect ratios
Impact & Significance
Sora represented a paradigm shift in AI video generation, showing that text-to-video models can simulate coherent physical worlds. It set a new bar for video generation quality and sparked intense competition in the AI video space.
Related Tools
Related Papers
LLMJuly 23, 2024
The Llama 3 Herd of Models
Meta AI
LLMJuly 15, 2024
Qwen2 Technical Report
Alibaba Cloud / Qwen Team
EfficiencyMay 7, 2024
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek AI
LLMMarch 4, 2024
The Claude 3 Model Family: Opus, Sonnet, and Haiku
Anthropic