Microsoft and Zhejiang University Introduce World-R1: Achieving 3D Consistency in Video Models Through Reinforcement Learning

iconKuCoinFlash
Share
Share IconShare IconShare IconShare IconShare IconShare IconCopy
AI summary iconSummary

expand icon
On-chain news: Microsoft Research and Zhejiang University unveiled World-R1 on April 28, a reinforcement learning method that enables video models to understand 3D geometry without requiring 3D datasets. The system uses Depth Anything 3 to reconstruct 3D Gaussians, then compares rendered views with the original footage. A reward signal based on error, trajectory, and Qwen3-VL credibility is optimized via Flow-GRPO. Models include Wan 2.1 (1.3B and 14B), trained on 3,000 Gemini-generated prompts. World-R1-Large improved PSNR by 7.91 dB, and World-R1-Small by 10.23 dB. Code is available on GitHub under CC BY-NC-SA 4.0. Real-world assets (RWA) news highlights this advancement in AI-driven 3D modeling.

AIMPACT Update, April 28 (UTC+8): According to monitoring by Beating, a team from Microsoft Research and Zhejiang University has introduced World-R1, a method that enables text-to-video models to learn 3D geometric consistency using reinforcement learning—without modifying the model architecture or relying on 3D datasets. The core idea: after generating a video, reconstruct the scene’s 3D Gaussians (3DGS) using the pretrained 3D foundation model Depth Anything 3, then render the scene from new viewpoints and compare it with the original video. The reward signal is composed of reconstruction error, trajectory deviation, and semantic plausibility from new viewpoints (scored by Qwen3-VL). This reward is fed back to the video model via Flow-GRPO, a reinforcement learning algorithm adapted for flow-matching models. The base model is the open-source Wan 2.1 (1.3B and 14B parameters), yielding World-R1-Small and World-R1-Large respectively. Training used only approximately 3,000 pure text prompts generated by Gemini, with no 3D assets involved. Every 100 training steps, a “dynamic fine-tuning” phase is inserted—temporarily disabling the 3D reward and retaining only the visual quality reward—to prevent the model from suppressing non-rigid dynamics such as human motion in pursuit of geometric rigidity. In terms of 3D consistency metrics, World-R1-Large improves PSNR (peak signal-to-noise ratio) by 7.91 dB over the base Wan 2.1 14B model, while the Small version improves by 10.23 dB. VBench overall video quality scores remain unchanged or improve. In a blind test with 25 participants, World-R1 achieved a 92% win rate for geometric consistency and an 86% overall preference rate. The code has been open-sourced on GitHub under the CC BY-NC-SA 4.0 license. (Source: BlockBeats)

Disclaimer: The information on this page may have been obtained from third parties and does not necessarily reflect the views or opinions of KuCoin. This content is provided for general informational purposes only, without any representation or warranty of any kind, nor shall it be construed as financial or investment advice. KuCoin shall not be liable for any errors or omissions, or for any outcomes resulting from the use of this information. Investments in digital assets can be risky. Please carefully evaluate the risks of a product and your risk tolerance based on your own financial circumstances. For more information, please refer to our Terms of Use and Risk Disclosure.