Perplexity to Launch Hybrid Local-Cloud AI Inference System in July

CoinDesk reports:

Perplexity announced a new feature at Computex 2026 in Taipei, with plans to launch the Windows version of Perplexity Computer in July. The system will automatically determine which parts of an AI task run on the local device and which are handled by cloud models, eliminating the need for users to manually switch modes.

Locally process sensitive content first.

This solution was unveiled jointly by Perplexity CEO Aravind Srinivas and Intel CEO Lisa Wu. The company refers to it as a hybrid local-server inference orchestration system, emphasizing the integration of privacy, performance, and computational cost within a single workflow.

Perplexity states that content such as financial records, health information, and personal documents is better suited for initial evaluation by a lightweight model on the device to determine whether it should remain local; more demanding reasoning tasks are then sent to larger models in the cloud.

According to the company, tasks such as document summarization, text formatting, and lightweight classification can be completed locally, while complex reasoning is handed off to the server. The entire process switches automatically during task execution, minimizing user awareness.

However, this does not mean Perplexity has opened up a fully controllable offline model to users. The local components are still compact models integrated into the app, and the cloud components continue to run on Perplexity’s servers, so it cannot be considered a fully offline solution.

Cost pressures are an important backdrop.

During Computex, Srinivas stated that the goal of AI systems should be to deliver higher "value per watt" for each user, rather than concentrating all computation on servers and the largest models. He noted that some companies are already spending hundreds of millions of dollars per month on computing power.

Perplexity previously disclosed that its revenue has increased from $100 million to $500 million, while its workforce has grown by only 34%. In this context, shifting part of the reasoning load to users' devices can directly reduce cloud computing costs.

This is also one of the key reasons the current AI industry is driving edge inference. For businesses, running locally reduces server costs; for users, it means sensitive data doesn’t need to leave their devices.

The industry is shifting toward edge and hybrid models.

Currently, multiple tech companies are advancing local or hybrid inference. Apple performs certain sensitive processing on local chips; Microsoft’s Foundry Local became generally available in April this year, enabling local AI inference on Windows, macOS, and Linux.

NVIDIA also launched RTX Spark during Computex, targeting local large model inference on laptops and desktops. In contrast, Perplexity’s differentiator is not the model itself, but the orchestration layer: the system dynamically decides between local and cloud execution on a per-task basis, rather than requiring users to pre-select.

Perplexity stated that this feature is not limited to Intel chips on the platform. Although the live demonstration used an Intel Core Ultra Series 3 processor, NVIDIA processors are also supported. Currently, the feature is confirmed to launch first on Windows PC apps, with no release timeline yet announced for other platforms.