PAPER_TITLE

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

VideoVerse: How Far is Your T2V Generator from a World Model?

Zeqing Wang^1,4*, Xinyu Wei^2,4*, Bairui Li^2,4*, Zhen Guo^2,4, Jinrui Zhang^2,4, Hongyang Wei^3,4, Keze Wang^1†, Lei Zhang^2,4†

¹Sun Yat-sen University, ²Hong Kong Polytechnic University, ³Tsinghua University
⁴OPPO Research Institute
^*Equal contribution ^†Corresponding Author

Code arXiv 🤗 VideoVerse Data

Abstract

The recent rapid advancement of Text-to-Video (T2V) generation technologies, which are critical to build "world models", makes the existing benchmarks increasingly insufficient to evaluate state-of-the-art T2V models. First, current evaluation dimensions, such as per-frame aesthetic quality and temporal consistency, are no longer able to differentiate state-of-the-art T2V models. Second, event-level temporal causality, which not only distinguishes video from other modalities but also constitutes a crucial component of world models, is severely underexplored in existing benchmarks. Third, existing benchmarks lack a systematic assessment of world knowledge, which are essential capabilities for building world models. To address these issues, we introduce VideoVerse, a comprehensive benchmark that focuses on evaluating whether a T2V model could understand complex temporal causality and world knowledge in the real world. We collect representative videos across diverse domains (e.g., natural landscapes, sports, indoor scenes, science fiction, chemical and physical experiments) and extract their event-level descriptions with inherent temporal causality, which are then rewritten into text-to-video prompts by independent annotators. For each prompt, we design a suite of binary evaluation questions from the perspective of dynamic and static properties, with a total of ten carefully defined evaluation dimensions. In total, our VideoVerse comprises 300 carefully curated prompts, involving 815 events and 793 binary evaluation questions. Consequently, a human preference aligned QA-based evaluation pipeline is developed by using modern vision-language models. Finally, we perform a systematic evaluation of state-of-the-art open-source and closed-source T2V models on VideoVerse, providing in-depth analysis on how far the current T2V generators are from world models.

Evolution Demos

VideoVerse Benchmark Leaderboard (Evaluated By Gemini2.5 Pro).

*Due to Sora's security review, some videos are not successfully generated. Despite this, Sora2 still achieved the SOTA performance.

Model	Overall	Dynamic					Static
Model	Overall	Event Following	Camera Control	Interaction	Mechanics	Material Properties	Natural Constra.	Common Sense	Attr. Correct.	2D Layout	3D Depth
Open-Source Models
CogVideoX1.5 (S)	894	424	36	32	20	13	36	40	177	65	51
CogVideoX1.5 (L)	893	426	37	32	22	14	38	37	182	58	47
SkyReels-V2 (S)	939	484	43	31	27	9	31	43	160	61	50
SkyReels-V2 (L)	968	511	37	36	26	12	36	35	168	61	46
Wan2.1-14B	969	496	43	29	26	10	35	45	167	67	51
Hunyuan	898	446	38	30	24	14	38	42	159	60	47
OpenSora2.0	989	482	47	29	27	14	48	49	181	61	51
Wan2.2-A14B	1085	567	60	34	32	17	37	43	184	63	48
Closed-Source Models
Minimax-Hailuo	1203	623	75	38	30	22	54	52	187	68	54
Veo-3	1292	680	76	43	40	21	67	57	187	67	54
Sora2*	1299	689	72	51	42	21	64	63	177	66	54

BibTeX

@article{wang2025videoverse,
  title={VideoVerse: How Far is Your T2V Generator from a World Model?},
  author={Wang, Zeqing and Wei, Xinyu and Li, Bairui and Guo, Zhen and Zhang, Jinrui and Wei, Hongyang and Wang, Keze and Zhang, Lei},
  journal={arXiv preprint arXiv:2510.08398},
  year={2025}
}

More Works from Our Lab

TIIF Bench

VideoVerse: How Far is Your T2V Generator from a World Model?

Abstract

Evolution Demos

VideoVerse Benchmark Leaderboard (Evaluated By Gemini2.5 Pro).

BibTeX