VisionFebruary 15, 2024OpenAI

Video Generation Models as World Simulators

OpenAI

Abstract

We explore large-scale training of generative models on video data. Specifically, we train text-conditional diffusion models jointly on videos and images of variable durations, resolutions and aspect ratios. We find that scaling video generation models is a promising path towards building general purpose simulators of the physical world. Our largest model, Sora, is capable of generating a minute of high fidelity video.

Key Findings

  • 1Generated up to one minute of high-fidelity, coherent video from text
  • 2Used a Transformer-based diffusion architecture on spacetime patches
  • 3Demonstrated 3D consistency and understanding of physical interactions
  • 4Showed emergent simulation capabilities of real-world physics
  • 5Handled variable durations, resolutions, and aspect ratios

Impact & Significance

Sora represented a paradigm shift in AI video generation, showing that text-to-video models can simulate coherent physical worlds. It set a new bar for video generation quality and sparked intense competition in the AI video space.

Related Tools

Read Full Paper