Android, iOS, HarmonyOS, and Windows Enter the Agent Era with OS-Level AI Integration

Article by Yunyong AI, Author | Huang Yunhao

One. After Google I/O 2026: Four Major Edge OSes Enter the Agent Era

On May 12, 2026, Google held the Android Show | I/O Edition, an Android-focused event ahead of the I/O conference on May 19. Sameer Samat, President of the Android Ecosystem, set the tone for the event: Android is evolving from an operating system into an intelligent system. Central to this vision is Gemini Intelligence—a proactive AI capability integrated at the Android system level.

Windows

2026 Android Show | I/O Edition Launch Poster
Source: Android Headlines

Compared to last year’s Gemini Nano + AICore combination, Google has now further embedded the Agent’s ability to operate across apps and contexts at the OS level: cross-app task automation (ordering food, shopping, placing orders), automatic form filling, web summarization, and customizable widgets have been systematically added to the list of system-level capabilities. Google has also designated explicit user control, comprehensive data protection, and operational transparency as its three core product principles.

On May 19, one week from now, at the I/O keynote, Google CEO Sundar Pichai opened with this theme:

Welcome to the agentic Gemini era

Google is not among the earliest to join the wave of OS agentization on the edge.

In May 2024 at Build 2024, Microsoft launched Copilot+ PCs—a new category of Windows 11 devices featuring NPUs with over 40 TOPS—integrating Agent capabilities into the OS through three key features: the on-device small model Phi Silica, the screen Agent capability Click to Do, and the system-level activity memory Recall.

At WWDC24 in June 2024, Apple officially announced "Apple Intelligence," positioning it as a "personal intelligence system." Since then, it has gradually rolled out some AI-assisted features; however, due to delays in its own large model development and Siri's limited capabilities, the core Agent functionality of Apple Intelligence has yet to be released.

Huawei launched HarmonyOS 6 and the HarmonyOS Intelligent Agent Framework (HMAF) at HDC 2025 in June 2025, followed by the launch of over 80 intelligent agents on the Xiao Yi Intelligent Agent Plaza.

The major trend toward OS agentization on the edge is now evident across mainstream operating systems such as Android, iOS, HarmonyOS, and Windows.

The demonstration at the launch only showcased features; what OS vendors truly compete on is the three-layer foundation that supports reliable OS Agent operation and solves real-world problems: system-level AI runtime, controllable chips, and an end-to-cloud model matrix.

II. Beneath the Launch: The Three-Layer Foundation Supporting OS Agent

System-level AI Runtime: The Orchestration Hub for Edge-side Intelligence

Runtime is the inference engine and system service that runs on-device models within the operating system. It directly interfaces with the NPU and system resource scheduling below, and exposes stable APIs to all apps above. It transforms on-device models into “OS-level shared intelligence”: enabling cross-app sharing of model weights, unified scheduling of compute and memory, supporting tool calls required by agents, guiding generation, and integrating context and permissions. It determines whether an OS Agent is merely a chat button within an app, or a persistent service capable of executing system-level operations on the operating system.

The most comprehensive sample within the Android ecosystem is Google AICore. In December 2023, AICore launched as a system service in Android 14; in August 2025, Gemini Nano became available to developers via ML Kit GenAI APIs. From its foundation as a system service to stable APIs for apps, AICore took nearly two years of continuous refinement.

Other OS vendors are following the same path, just at different paces. Apple opened the Foundation Models framework to developers at WWDC25, which comes built-in with the @Generable decorator, tool calling, guided generation, and stateful sessions, backed by an on-device foundation model with approximately 3 billion parameters and cloud support via private cloud computing. Microsoft integrated the on-device AI framework Foundry on Windows and Phi Silica into Windows 11, using Windows ML as the underlying inference backend. Huawei unveiled the Agent Framework Kit (HarmonyOS Agent Framework, HMAF) at HDC 2025, opening up its intent system and Agent collaboration protocol.

Windows

Android AICore, as a system service, orchestrates Gemini Nano inference on hardware accelerators.
Source: Android Developers

Controllable chip: The pivot of software-hardware synergy

Google has set clear hardware requirements for Gemini Intelligence at Android Show｜I/O Edition: the full feature set will launch exclusively on a select few of the latest flagships, such as the Pixel 10 series and Galaxy S26 series, excluding all models from last year. This points to a simple truth: AI models are still rapidly evolving, and software continues to demand more from hardware. Controllable chips serve as the foundation for meeting these demands, and the degree of control determines how much room OS vendors have to adapt their end-side OS agents to hardware.

Apple is a prime example of the integrated hardware-software approach. iOS and macOS have evolved alongside the A-series and M-series chips from the outset, while Core ML abstracts the scheduling of CPU, GPU, and ANE into the framework layer. This approach continues to extend into the LLM era. Apple Machine Learning Research has published benchmark results showing that, following Core ML’s optimization path, deploying Llama 3.1 8B Instruct on an M1 Max achieves a local decoding speed of approximately 33 tokens/s. The technical report “Apple Intelligence Foundation Language Models” also reveals that Apple implemented architecture-level optimizations such as KV cache sharing and 2-bit quantization-aware training specifically for its own chips, enabling the successful release of a ~3B on-device foundation model to developers via the Foundation Models framework. Such depth is only possible when a company controls its own silicon—this is precisely the value of controllable chips for OS vendors: it determines the depth of hardware-software synergy and raises the upper limit of the on-device OS agent experience.

Entering the AI era, Google is doing the same—since the Pixel 6, it has pursued its own Tensor SoC path, and the latest Tensor G5 boosts TPU performance by up to 60% and CPU performance by an average of 34%, serving as the first SoC to fully run the latest Gemini Nano on the Pixel 10. Of course, the Tensor G5 has its limitations: Android Central’s real-world tests show that its memory configuration (RAM capacity) remains a bottleneck for AI performance, and its Geekbench AI score lags behind the Snapdragon 8 Elite; in Macworld’s Geekbench 6 tests, the G5’s single-core and multi-core scores are both lower than those of the A18 Pro. Google is still catching up, but the combined strategy of proprietary Tensor SoCs and on-device Gemini is already taking shape.

Huawei's Kirin paired with the Da Vinci NPU and the Pangu on-device model represents another controllable chip pathway alongside Apple and Google. Xiaomi has deployed the Xuanjie O1, marking its entry as a new player in the controllable chip direction.

End-Cloud Model Matrix: The Source of Agent Intelligence

The edge-cloud model matrix is the source of "intelligence" for endpoint devices: cloud models set the upper limit for complex tasks, while edge models establish the baseline for everyday operations—latency, battery life, privacy, and stability all rest on the edge side. Both are indispensable; the difference lies in the depth of integration with the OS. Edge models must be embedded into the OS of every endpoint device and deeply coupled with the local NPU, assuming a dual role within the OS: downward, it serves as the local inference backend for the Runtime; upward, it exposes system-level APIs to apps through the Runtime’s framework and SDK.

Self-developed models make sense both in the cloud and on the edge, but the returns on the edge are more tangible. Cloud models sourced externally can still support the upper limits of capability, while the advantages of self-development primarily lie in routing control, commercial terms, and model iteration pace. On the edge, it’s different. Edge models are embedded into the OS and NPU of every device, and the benefits of self-development directly manifest in product performance: KV cache sharing, 2-bit quantization-aware training tailored for a specific chip generation, Per-Layer Embedding (derived from Gemma 3n, incrementally loading embedding parameters from fast storage per layer), and more—all of these require synchronized design of the model and hardware to be effectively implemented; meanwhile, coordination timelines must no longer be constrained by third-party hardware vendors.

The TPU computing power of the Tensor G5 is up to 60% higher than its predecessor, the G4—but the improvements for Gemini Nano on the G5 go far beyond that. According to Google and Jon Peddie Research, local processing speed is 2.6 times faster, power consumption is halved, and the token window has expanded from 12,000 to 32,000—equivalent to processing approximately a hundred screenshots at once. These significantly enhanced performance metrics stem from Gemini Nano v3’s Matryoshka Transformer adaptive inference architecture, combined with co-optimization for the Tensor G5 TPU.

Windows

Performance leap of Gemini Nano on Tensor G5 compared to the previous generation
Source: Google/Jon Peddie Research, Cloud Burst AI Illustration

At this edge model layer, major OS vendors each hold their own proprietary models: Google’s Gemini Nano, Apple’s approximately 3B-parameter edge foundational model, Microsoft’s Phi Silica, and Huawei’s Pangu edge model. In-house development is the default choice at this layer.

III. Between the three layers: deeper collaboration creates greater room for differentiation

The three-layer capability foundation is coupled from bottom to top: controllable chip → edge/cloud models → Runtime → Agent. The controllable chip determines the inference efficiency and power consumption achievable by edge models; the edge model determines the local intelligence that the Runtime can orchestrate; the Runtime determines the reliability of the Agent as a system service executing across apps. The deeper the collaboration among these three, the greater the product experience differentiation OS vendors can achieve in edge Agents, and the wider their competitive moat becomes.

The deeper the three layers are integrated within the same software and hardware system, the more distinct capabilities the OS Agent will exhibit that single layers cannot achieve.

Response latency and power consumption. The 2.6x processing speed improvement and halved power consumption achieved by Gemini Nano on the Tensor G5 result from mutual optimization across model architecture, chip design, and runtime scheduling within the same-generation hardware-software integration—such significant improvements only emerge through this level of co-design.
Privacy and trust. Common tasks involving private data are performed locally by on-device models, while complex requests are handed off to the cloud—this is the reasonable default stance for OS Agent toward user data today. Three interdependent layers determine whether this “on-device first, cloud fallback” approach can truly be realized: deep adaptation between the NPU and on-device models is the critical path enabling on-device models, still in development, to handle frequent daily inference; models are quantized, compressed, and share KV caches for the NPU; and the runtime routes tasks between on-device and cloud based on complexity. If any one of these three layers falls short, “on-device first” remains merely marketing rhetoric.
System-level context. By reorganizing cross-app and OS-level user data—such as semantic indexing, screen awareness, and long-term memory—into a system-level personal context for the Agent, the OS vendor enables the Agent to truly “understand the user,” which is the defining characteristic that distinguishes OS Agents from single-app Agents. Implementation relies on three tightly integrated layers: the Runtime holds cross-app indexing and permissions, on-device models remain resident to handle understanding and reasoning, and the NPU provides efficient local compute power. Apple’s Core Spotlight establishes semantic indexing on the device, while apps use App Intents to connect actions and data to the system; the Agent will access context through Personal Context (Apple has announced this capability will be available via future software updates). On Android, AppFunctions follows the same path.
The reliability of system services requires the OS Agent to remain available under real-world conditions such as no internet connectivity, low battery, and thermal throttling. The on-device model runs continuously, enabling the Agent to function without a network connection; a highly optimized NPU handles low-power inference; and the Runtime dynamically adjusts scheduling based on available resources—falling back to lighter models or routing requests to the cloud when resources are constrained. If any one of these three layers is missing, the OS Agent cannot fulfill the role of a system service and reverts to being merely an app-level chat button.

Apple Intelligence presents a complete collaborative paradigm: Apple Silicon, approximately 3B on-device foundation models, and the Foundation Models framework seamlessly integrated from bottom to top—handling common scenarios on-device and routing complex requests to private cloud computing. Google takes a different approach: the Tensor G5, the first SoC capable of fully running the latest generation of Gemini Nano, arrives on the Pixel 10, with unified orchestration by AICore enabling system-level Agent features like Magic Cue and Pixel Screenshots to be enabled by default without cloud dependency. Huawei exemplifies the construction of a three-layer collaboration model domestically: Kirin, Da Vinci NPU, Pangu on-device, and HMAF—all self-developed and tightly coupled from bottom to top into a complete three-layer foundation.

Windows

Three-layer chassis interlock mechanism for the edge OS agent
Source: Yunyong AI

IV. On Top of the Foundation: Other Key Variables of a Long-Term Moat

The core of building a moat lies in the collaboration of three layers. Above the foundation, numerous variables influence product competitiveness in the OS Agent era, including the interaction capability between Agent and App, privacy protection, and more.

The interaction between OS Agents and apps sits at the forefront of the power struggle between OS vendors and app developers. Currently, two paths are being pursued in parallel. One is screen recognition and automation, including features like Gemini Live screen sharing, Apple Visual Intelligence, and Circle to Search. OS Agents interact with apps by reading the screen and clicking buttons—effective for single tasks, but lacking structured data with each invocation, making it difficult to reliably build multi-step workflows. The other path is deep API integration, including Google AppFunctions, Apple App Intents, and Huawei Intents Kit. Here, apps expose core functions as structured interfaces to the system, enabling stable Agent calls and the construction of multi-step workflows. Whether the API path can succeed depends not on OS vendors, but on app developers. Allowing Agents to access core functions risks users bypassing the app entirely, potentially leading OS platforms to capture brand exposure, ad placements, behavioral data, and payment channels. This will become the central battleground for control over terminal traffic distribution on the user side.

Privacy protection is a core value and fundamental principle of edge-side systems. OS vendors hold the deepest system-level permissions and the most sensitive user data on the edge; privacy is both their inherent responsibility and a prerequisite for the long-term advancement of the previous two priorities. Apple has built an end-device-based privacy protection system by leveraging the same hardware-level security design shared between the edge-side Secure Enclave dedicated security chip and its private cloud computing (PCC) nodes. This product strategy has made “Privacy. That’s Apple.” a core brand label for Apple in the global premium market, earning user trust.

Windows

Apple's "Privacy. That’s Apple." tag
Source: Apple's official website

The synergy of the three layers forms the core of the moat, while these long-term variables above the foundation determine how deep it can be strengthened.

Five. It’s not just about redoing the OS

Under the trend of OS agents on the edge, the more solid the three foundational layers—system-level AI runtime, controllable chips, and edge-cloud model matrices—the higher the product floor and the greater the differentiation potential for OS vendors. Only those OS vendors who seize this trend will have the opportunity to reset the allocation of edge-side traffic entry points and secure a stronger competitive position.

This trend extends beyond smartphones and PCs. The underlying capabilities of OS Agent are spreading across existing multi-device ecosystems, particularly thriving in IoT. Controllable chips are being integrated into scenarios such as automotive SoCs, with Huawei having deployed automotive-grade Kirin chips and Xiaomi’s HyperOS entering its own vehicle models. Edge-side models are being lightweighted onto new hardware forms like smart glasses, with Google, Samsung, Gentle Monster, and Warby Parker jointly developing Android XR smart glasses set to launch in fall 2026. The collaboration between Runtime and Agent is expanding across device clusters through each company’s established “Super Device” or distributed frameworks—for example, Huawei’s 1+8+N and HarmonyOS Distributed Soft Bus, Xiaomi’s “People-Car-Home Full Ecosystem” and HyperConnect, Apple’s Continuity, and Google’s Cross-device SDK and cross-device services. This battle for OS Agent extends far beyond the outcome of smartphones and PCs.

AICore has been refined for nearly two years; Apple’s OS has taken over a decade to fully optimize with its Apple Silicon chip series; Tensor has undergone continuous iterations up to G5 before the Pixel 10 could handle Gemini Nano v3. The true advantage in this battle has never been decided in the one or two hours of a product launch, but in the countless generations of chips, models, and runtimes that have been meticulously honed over time.

References:

Gemini Intelligence brings proactive AI to Android｜Google Blog
I/O 2026: Welcome to the agentic Gemini era｜Google Blog
Phi Silica, small but mighty on-device SLM｜Windows Experience Blog
Apple Delays Siri Upgrade Indefinitely｜Bloomberg
HarmonyOS 6 Developer Beta Launch Press Release (HDC 2025) | Huawei
The latest Gemini Nano with on-device ML Kit GenAI APIs｜Android Developers Blog
Foundation Models framework documentation｜Apple Developer
HarmonyOS Intelligent Agent Framework Whitepaper | Huawei Developer
On-Device Llama 3.1 with Core ML｜Apple Machine Learning Research
Apple Intelligence Foundation Language Models Tech Report 2025｜Apple Machine Learning Research
Google Tensor G5: Benchmarks and everything you need to know｜Android Central
Google's new M5 SoC (Tensor G5 Detailed Analysis · Matryoshka Transformer) | Jon Peddie Research
Private Cloud Compute: A new frontier for AI privacy in the cloud｜Apple Security Engineering
Overview of AppFunctions｜Android Developers
App Intents｜Apple Developer
Intents Kit Overview (HarmonyOS) | Huawei Developer
The Google Pixel 10 Pro’s Tensor G5 chip is impressive—if you compare it to an iPhone 14｜Macworld
Gemma 3n model overview｜Google AI for Developers