VisionFebruary 15, 2024OpenAI

Video Generation Models as World Simulators

OpenAI

Abstract

We explore large-scale training of generative models on video data. Specifically, we train text-conditional diffusion models jointly on videos and images of variable durations, resolutions and aspect ratios. We find that scaling video generation models is a promising path towards building general purpose simulators of the physical world. Our largest model, Sora, is capable of generating a minute of high fidelity video.

Key Findings

1Generated up to one minute of high-fidelity, coherent video from text
2Used a Transformer-based diffusion architecture on spacetime patches
3Demonstrated 3D consistency and understanding of physical interactions
4Showed emergent simulation capabilities of real-world physics
5Handled variable durations, resolutions, and aspect ratios

Impact & Significance

Sora represented a paradigm shift in AI video generation, showing that text-to-video models can simulate coherent physical worlds. It set a new bar for video generation quality and sparked intense competition in the AI video space.

Related Tools

Sora Openai Api

Read Full Paper

Video Generation Models as World Simulators

Abstract

Key Findings

Impact & Significance

Related Tools

Related Papers

The Llama 3 Herd of Models

Qwen2 Technical Report

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

The Claude 3 Model Family: Opus, Sonnet, and Haiku