Xiaomi Launches JointWM Framework for Autonomous Driving, Sets New Benchmark Records

According to Beating Monitor, Xiaomi Auto has officially launched the Xiaomi EV World Model, a new framework for assisted driving world modeling, achieving for the first time deep integration between 3D reconstruction and video generation modules. In autonomous driving simulation, traditional techniques typically decouple reconstruction from generation: reconstruction modules can reconstruct scenes but cannot predict changes, while generation modules can forecast future states but suffer from distortion and drift over long time sequences. The team proposes the JointWM architecture, which uses 3D geometric structures as a physical skeleton to anchor the scene, then employs the generation module to complete visual details and predict unobserved regions, setting new state-of-the-art performance records on major benchmarks such as Waymo and nuScenes. Specifically, the reconstruction module, WorldRec, abandons the traditional pixel-by-pixel approach and instead represents the scene using sparse 3D query points, incrementally fusing them into a cross-view 4D Gaussian spatial skeleton that enables rapid reconstruction of 10 seconds of video in just 10 seconds. Leveraging the geometric priors provided by the reconstruction module, the generation module, WorldGen, operates strictly within the physical boundaries of the skeleton, focusing solely on generating plausible lighting and textures. For content beyond the boundaries—such as future frames or blind spots—the generation module performs physical predictions through a two-stage temporal training and distribution-matching distillation mechanism. The entire architecture achieves generation speeds of 0.19 seconds per single view and 0.46 seconds per three views on an H20 GPU, supporting video generation up to one minute in length. This solution achieves a PSNR score of 28.48 in Waymo reconstruction accuracy tests and maintains leadership in zero-shot generalization on nuScenes. In terms of generation efficiency, it is 5.6 times faster than the autoregressive baseline Epona and ranks among the top in spatiotemporal coherence compared to similar algorithms. The research has already been deployed across three key scenarios at Xiaomi Auto: delivering over 100,000 high-quality synthetic data segments for perception model training, constructing highly realistic closed-loop simulation environments to reproduce long-tail driving scenarios, and launching an Assistant Driving Academy that uses generative video to guide user operations.