Microsoft and Zhejiang University Introduce World-R1: Achieving 3D Consistency in Video Models Through Reinforcement Learning

AIMPACT Update, April 28 (UTC+8): According to monitoring by Beating, a team from Microsoft Research and Zhejiang University has introduced World-R1, a method that enables text-to-video models to learn 3D geometric consistency using reinforcement learning—without modifying the model architecture or relying on 3D datasets. The core idea: after generating a video, reconstruct the scene’s 3D Gaussians (3DGS) using the pretrained 3D foundation model Depth Anything 3, then render the scene from new viewpoints and compare it with the original video. The reward signal is composed of reconstruction error, trajectory deviation, and semantic plausibility from new viewpoints (scored by Qwen3-VL). This reward is fed back to the video model via Flow-GRPO, a reinforcement learning algorithm adapted for flow-matching models. The base model is the open-source Wan 2.1 (1.3B and 14B parameters), yielding World-R1-Small and World-R1-Large respectively. Training used only approximately 3,000 pure text prompts generated by Gemini, with no 3D assets involved. Every 100 training steps, a “dynamic fine-tuning” phase is inserted—temporarily disabling the 3D reward and retaining only the visual quality reward—to prevent the model from suppressing non-rigid dynamics such as human motion in pursuit of geometric rigidity. In terms of 3D consistency metrics, World-R1-Large improves PSNR (peak signal-to-noise ratio) by 7.91 dB over the base Wan 2.1 14B model, while the Small version improves by 10.23 dB. VBench overall video quality scores remain unchanged or improve. In a blind test with 25 participants, World-R1 achieved a 92% win rate for geometric consistency and an 86% overall preference rate. The code has been open-sourced on GitHub under the CC BY-NC-SA 4.0 license. (Source: BlockBeats)