A team of five universities develops a visual-guided 3D navigation framework for digital humans.

According to ME News, on April 14 (UTC+8), a collaborative team from Peking University, Carnegie Mellon University, Tongji University, UCLA, and the University of Michigan released VGHuman on arXiv—a grounded AI framework enabling digital humans to autonomously navigate unfamiliar 3D environments using only visual perception. Previously, digital human systems largely relied on predefined scripts or privileged state information; VGHuman aims to give digital humans true eyes, allowing them to see, plan, and act for themselves. The framework consists of two layers. The World Layer reconstructs a 3D Gaussian scene from monocular video, complete with semantic annotations and collision meshes; its occlusion-aware design enables accurate detection of small, partially obscured objects even in complex outdoor environments. The Agent Layer equips the digital human with first-person RGB-D (color + depth) perception, generating navigation plans through spatially aware visual prompts and iterative reasoning, which are then converted into full-body motion sequences via a diffusion model. On a navigation benchmark comprising 200 test scenarios across three difficulty levels—simple paths, obstacle avoidance, and dynamic pedestrian navigation—VGHuman achieved task success rates up to 30 percentage points higher than leading baselines such as NaVILA, NaVid, and Uni-NaVid, while maintaining or reducing collision rates. The framework also supports diverse movement styles including running and jumping, as well as long-range planning to sequentially access multiple targets. Code and models are planned for open-source release; a GitHub repository has already been established. (Source: BlockBeats)