Logo

VideoVerse: How Far is Your T2V Generator from a World Model?

Zeqing Wang1,4*, Xinyu Wei2,4*, Bairui Li2,4*, Zhen Guo2,4, Jinrui Zhang2,4, Hongyang Wei3,4, Keze Wang1†, Lei Zhang2,4†
1Sun Yat-sen University, 2Hong Kong Polytechnic University, 3Tsinghua University
4OPPO Research Institute

*Equal contribution Corresponding Author

Abstract

The recent rapid advancement of Text-to-Video (T2V) generation technologies, which are critical to build "world models", makes the existing benchmarks increasingly insufficient to evaluate state-of-the-art T2V models. First, current evaluation dimensions, such as per-frame aesthetic quality and temporal consistency, are no longer able to differentiate state-of-the-art T2V models. Second, event-level temporal causality, which not only distinguishes video from other modalities but also constitutes a crucial component of world models, is severely underexplored in existing benchmarks. Third, existing benchmarks lack a systematic assessment of world knowledge, which are essential capabilities for building world models. To address these issues, we introduce VideoVerse, a comprehensive benchmark that focuses on evaluating whether a T2V model could understand complex temporal causality and world knowledge in the real world. We collect representative videos across diverse domains (e.g., natural landscapes, sports, indoor scenes, science fiction, chemical and physical experiments) and extract their event-level descriptions with inherent temporal causality, which are then rewritten into text-to-video prompts by independent annotators. For each prompt, we design a suite of binary evaluation questions from the perspective of dynamic and static properties, with a total of ten carefully defined evaluation dimensions. In total, our VideoVerse comprises 300 carefully curated prompts, involving 815 events and 793 binary evaluation questions. Consequently, a human preference aligned QA-based evaluation pipeline is developed by using modern vision-language models. Finally, we perform a systematic evaluation of state-of-the-art open-source and closed-source T2V models on VideoVerse, providing in-depth analysis on how far the current T2V generators are from world models.

Evolution Demos

VideoVerse Benchmark Leaderboard (Evaluated By Gemini2.5 Pro).

*Due to Sora's security review, some videos are not successfully generated. Despite this, Sora2 still achieved the SOTA performance.
Model Overall Dynamic Static
Event
Following
Camera
Control
Interaction Mechanics Material
Properties
Natural
Constra.
Common
Sense
Attr.
Correct.
2D
Layout
3D
Depth
Open-Source Models
CogVideoX1.5 (S)8944243632201336401776551
CogVideoX1.5 (L)8934263732221438371825847
SkyReels-V2 (S)939484433127931431606150
SkyReels-V2 (L)9685113736261236351686146
Wan2.1-14B9694964329261035451676751
Hunyuan8984463830241438421596047
OpenSora2.09894824729271448491816151
Wan2.2-A14B 1085 567 60 34 32 17 37 43 184 63 48
Closed-Source Models
Minimax-Hailuo 1203 623 75 38 30 22 54 52 187 68 54
Veo-3 1292 680 76 43 40 21 67 57 187 67 54
Sora2* 1299 689 72 51 42 21 64 63 177 66 54

BibTeX

@article{wang2025videoverse,
  title={VideoVerse: How Far is Your T2V Generator from a World Model?},
  author={Wang, Zeqing and Wei, Xinyu and Li, Bairui and Guo, Zhen and Zhang, Jinrui and Wei, Hongyang and Wang, Keze and Zhang, Lei},
  journal={arXiv preprint arXiv:2510.08398},
  year={2025}
}