Multi-X, a team under OPPO, has released X-OmniClaw, an open-source Android AI agent framework. The project emphasizes "edge-first" design, with core control, perception, and execution processes performed locally on the device, only leveraging cloud-based large models for complex reasoning tasks.
This framework is designed for scenarios where the phone serves as a continuous AI assistant, rather than a one-time Q&A chat tool. According to OPPO’s disclosed design, the system can understand the current environment by integrating camera input, screen content, and voice input, then directly perform actions within real apps.
Core capabilities are placed on local devices.
Many mobile AI systems currently rely on cloud-based operation, invoking Android virtual environments on servers to mimic user actions. While this approach simplifies unified deployment, it prevents direct access to the user’s real camera, photo gallery, and local files.
X-OmniClaw takes the opposite approach. The technical report states that this framework runs directly on users' physical devices, reducing the gap between virtual environments and real-world usage scenarios. OPPO summarizes its architecture into three components: perception, execution, and memory, which form a continuous loop.
- The perception layer integrates cameras, screens, and voice input.
- The execution layer is responsible for identifying interfaces and completing clicks and navigation.
- The memory layer stores contextual information across tasks and sessions.
Recognizes screen and real-world scenes
In the perception phase, the system first uses a vision-language model to understand the current scene, then determines the next action. For example, if a user points the camera at a product and asks for its price, the agent will first identify the object, then open the corresponding shopping app to initiate a search—rather than merely guessing based on textual instructions.
The execution component combines XML interface data, on-device vision models, and OCR capabilities to determine exactly where to click on the page. Even when the interface contains numerous ads or incomplete structural information, the system can use visual recognition to assist in locating the target area.
OPPO has also added behavior cloning capabilities. If a user manually demonstrates a path to a deeper page once, the system can subsequently use Android deeplink to quickly replicate that path, reducing repetitive actions.
Introduce cross-dialogue semantic memory
Unlike conventional chatbots, X-OmniClaw emphasizes long-term semantic memory. The system not only retains context within a single task but also generates structured records about objects, scenes, and events based on album content, enabling future retrieval and execution.
OPPO demonstrated use cases including math problem assistance and album video generation. The former can read screen-based questions via a floating interface, process them step by step, and automatically proceed to the next question; the latter can filter relevant images from the album based on requests such as “parrot-themed photos” and then use a deeplink to open CapCut to batch-generate videos.
This means the role of the mobile AI agent is shifting from single-turn Q&A to continuous assistance. The report notes that X-OmniClaw was developed based on the open-source HermesApp codebase and incorporates skill structure design elements from OpenClaw. The project code has been released on GitHub, and OPPO plans to continue publishing related resources and updating the version.
