The winning company won’t be the one with the most GPUs, but the one that can tell you which GPUs are available, where, at what price, and route each workload to the location where it can run at the lowest cost.

Author: Frank Fu @ IOSG

Source: IOSG

The gap David Cahn identified in 2023 was never filled on the training side—it was filled on the inference side, and the market has only just begun incorporating this into pricing over the past few weeks. With NVIDIA restructuring its financial reporting around “service tokens” and Cerebras receiving 20x oversubscription at its IPO, the battle over bottlenecks is over. The real question now becomes this: as inference becomes a scarce resource, where will value accumulate in the compute stack?

Part One: Following the GPU: From a $200 Billion Problem to a $600 Billion Problem

In 2023, David Cahn of Sequoia posed the question hanging over all AI development—the “$200 billion problem.” For every dollar spent on GPUs, roughly another dollar must be spent powering them in data centers, meaning that each year’s GPU CapEx requires these chips to generate approximately $200 billion in revenue just to recoup the capital investment. Even under very generous assumptions about AI revenue, he found a gap of over $125 billion between “inputs” and “what end customers are actually paying for.” The concern is straightforward: GPUs are being overbuilt ahead of real demand.

A year later, the gap did not narrow—it widened. In his 2024 follow-up, as mega-firms’ capex ballooned, Cahn redefined it as the “$600 billion problem.” The bear case has crystallized into a familiar shape: overbuilding leading to oversupply, and oversupply burning through capital.

Both articles are essentially asking the same question: Who will fill this gap? The answer never appears on the training side of the ledger—it appears on the inference side, and the market has only just begun to account for it in pricing over the past few weeks.

Part Two: Cerebras IPO and Inference Compression

Cerebras went public on Thursday. The IPO received 20 times oversubscription, priced at nearly double the final price increase announced on Wednesday. The demand did not stem from bets on a "next-generation Nvidia killer," but from something simpler: the market is beginning to recognize that the real bottleneck in AI is inference, not training.

Cerebras’s core strength is a chip architecture designed for extremely fast inference—not training, but inference. This is precisely what excites Wall Street. The inference market is recurring and scales with usage. Every time Claude answers a question, every time an agent performs a task, it consumes compute. Training happens once; inference never stops.

J.P. Morgan estimates the inference market size to be 10 to 50 times that of training. As machines begin executing tasks assigned by other machines—that is, agentic expansion—inference demand no longer scales with the number of users, but rather with computational power itself.

Part Three: Nvidia Redraws the Map: Inference Takes Center Stage

If Cerebras was the market’s awakening, then NVIDIA’s latest quarterly earnings report is the definitive confirmation from the top of the supply chain. During the latest earnings call, Jensen Huang made explicit what everyone had assumed: AI demand is growing parabolically. The reason is simple—agentic AI has arrived. Mainstream AI has evolved from one-time inference to logical reasoning, and now into the agent stage, where systems autonomously invoke tools and orchestrate tasks. Huang stated, “Tokens are now profitable.” In the age of AI, compute equals revenue and profit.

This has reshaped the entire industry. Training is a one-time cost to build a model, while inference is the ongoing cost to run it—and today, the bottleneck is inference, not training.

Nvidia has incorporated this classification into its financial reporting口径, now disclosing revenue across two platforms instead of one: Data Center and Edge Computing. The Data Center segment (approximately $75 billion this quarter, +92% year-over-year) is further broken down into Hyperscale (approximately $38 billion, +12% quarter-over-quarter) and ACIE—AI Cloud, Industrial, and Enterprise (approximately $37 billion, +31% quarter-over-quarter). A new line item, Edge Computing, has been introduced at $6.4 billion, up 29% year-over-year, covering endpoints where agentic AI and physical AI actually run, such as PCs, workstations, AI-RAN base stations, robots, and automobiles.

Edge still accounts for less than 8% of total revenue, but Nvidia has elevated it to a “second platform” alongside data centers. This signals that inference is splitting into two fronts: cloud inference within data centers and endpoint inference at the edge, as AI must see, move, and act in the physical world. The roadmap follows the same logic: Vera Rubin, shipping starting in Q3, offers up to 35 times the inference throughput of Blackwell; Huang also introduced a brand-new $200 billion TAM for the Vera CPU designed for agentic workloads. Every leading model company is expected to fully transition to it on day one.

When the world’s most valuable company restructured its financial disclosures around “service tokens,” the battle over bottlenecks was already settled. The remainder of this article discusses who captures value when inference—rather than training—becomes the scarce resource.

First, a scope clarification. Across these two fronts, this article focuses on cloud inference—rented data center GPUs that provide API token services. Endpoint inference runs on local chips within the device itself (such as Nvidia’s Jetson, RTX, Drive, and AI-RAN), entirely bypassing the underlying GPU rental and aggregation stack. Here, treat it as a tailwind that amplifies the overall inference economy and supports the bottleneck argument, rather than as part of the market occupied by Hyperbolic and Venice, which operate entirely on the cloud side.

Part Four: The squeeze has arrived

Anthropic is the canary in the coal mine. Usage has far exceeded preconfigured capacity, with widespread online complaints about Claude being “lobotomized”—including throttled responses, slower reasoning, and compressed context windows. The solution is bluntly raw compute: In May 2026, Anthropic took over the entire Colossus 1 data center from SpaceX, comprising over 220,000 Nvidia GPUs and 300+ megawatts, dedicating it exclusively to inference rather than training.

This capacity unlock triggered a series of quota adjustments, each serving as a signal. On May 6, Anthropic doubled the five-hour quota for Claude Code, eliminated peak-time throttling, and significantly increased the API rate limits for Opus. On May 13, it further raised the weekly quota for Claude Code by 50% (through July 13). Then, starting June 15, it took the opposite approach: it removed agentic and programmatic use cases (Agent SDK, headless mode claude-p, CI pipelines) from flat-rate subscriptions and moved them into a separate, metered credit pool (priced at $20 to $200 per month, billed at API rates). This final step distilled the entire argument into one action: agents consume inference at a rate far exceeding what flat-rate subscriptions were designed to support, and thus must be priced according to their actual recurring cost.

Training is a one-time capital expenditure. Inference is an ongoing operational cost that compounds with each new user and each new agent.

Part Five: This stack: six layers, one bottleneck

Every AI application sits on a supply chain that begins at a TSMC wafer factory and ends at an API endpoint:

Most companies own only one layer. Nvidia owns the silicon, CoreWeave owns the bare metal, Together AI owns inference optimization, and OpenRouter owns model API routing.

Except for one.

Part Six: Hyperbolic — The Only Company Spanning All Three Layers

Hyperbolic launched its on-demand GPU marketplace in June 2025. Within its first few months, its developer base surpassed 200,000, with adoption spanning leading AI labs, search engines, and major consumer platforms.

Interestingly, its architecture.

Hyperbolic owns no GPUs of its own. Every GPU is sourced from neocloud and data centers, including CoreWeave, Lambda Labs, Nebius, and smaller operators with excess capacity. This may sound like a weakness, but it’s actually a moat.

By sitting between GPU suppliers and consumers, Hyperbolic sees real-time data others cannot. It knows who is buying which GPUs, at what prices, and when. It detects oversupply before it becomes public and anticipates demand surges before they hit the market.

Today, the moat itself is this multi-cloud aggregation. Hyperbolic stitches together fragmented computing capacity from dozens of independent clouds and data centers into a standardized, unified pool, enabling developers to rent the cheapest available GPUs anywhere without negotiating with each provider or managing multiple accounts. The more clouds it connects to, the deeper its liquidity and the richer its pricing data. Looking ahead, the team is exploring how to use this data to model GPU price curves and eventually deploy its own capital to smooth supply and demand, acting as a market maker for physical compute—but this goal remains in early stages. What’s truly compounding today is the aggregation layer.

This is the flywheel:

Connect more clouds → more aggregated supply
More supply → Deeper market and real-time pricing data
Better data → Smarter routing now, pricing models in the long term
Better liquidity and pricing → More developers → More cloud integrations

No other company is attempting this. Hyperbolic is the only company spanning the GPU leasing layer, deployment layer, and model API layer.

Part Seven: The Mirror of Venice

Venice is the clearest manifestation of the inference economy at the application layer and serves as a useful contrast to Hyperbolic’s position. It is a privacy-first inference application: a set of OpenAI-compatible APIs combined with consumer-facing subscriptions (Free / Pro / Pro+ / Max) that route requests to approximately 75 models, about two-thirds of which are open-source or self-hosted models (Llama, Mistral, Qwen, DeepSeek), while the remainder are anonymous passthroughs to proprietary frontier models. Crucially, Venice does not own meaningful compute power itself. It rents from undisclosed GPU partners and confidential computing providers (NEAR AI Cloud, Phala) and pays frontier labs for passthrough services, meaning its true cost of revenue is inference compute—not SaaS hosting.

Venice is truly selling privacy. By “privacy,” it doesn’t mean turning public compute into private property, but rather adding a layer of assurance to commoditized inference: no data retention, no use for training, request anonymization, and portions of the workload running inside TEEs—so even the operator cannot see the plaintext. The underlying compute is commodity-grade; the markup comes entirely from this privacy layer. Moreover, this assurance is layered and uneven: for open-source models running on hardware under Venice’s control or within TEE GPUs, it delivers near end-to-end confidential computing. But for closed-source models like Claude or GPT, anonymity merely strips away identity—while the original prompt is still processed by the frontier labs on the other end. Thus, the strongest privacy only covers the open-source portion; for frontier models, it’s “anonymized,” not “truly confidential.” Venice’s gross margin = subscription price − downstream inference costs paid to providers. The premium it can charge over bare API prices is almost entirely supported by this privacy溢价 (privacy premium), which also explains its thin margins and dependence on pricing from frontier model passthroughs.

The token design encapsulates this inference demand. Venice operates on two tokens: VVV (for staking and platform access) and DIEM, which is an inference credit equivalent to approximately $1 of computing power per day. Paid subscriptions trigger programmatic buybacks and burns of VVV (Pro, Pro+, and Max tiers correspond to approximately $2, $5, and $10 respectively), while emissions decrease on a fixed schedule: monthly supply drops from 6M → 5M → 4M VVV, then further reduces to 3M on July 1. The buybacks are real but discretionary and still relatively small: approximately $103,000 was burned in both April and May, and June is slowly approaching around $110,000—still well below the $200,000 monthly threshold.

The fundamentals are healthier than the headlines. The widely circulated figure of "$70 million ARR" is almost certainly the result of conflating subscription renewals with net new customer acquisition; a defensible observable range is closer to $6 million to $15 million ARR. Below that, traction is real: approximately 136,000 unique wallet addresses, around 9.9 million monthly website visits (roughly 330,000 daily), and new Pro subscriptions hovering near 1,400 per day. This is a legitimate business, but a low-margin one, whose economics are constrained by the compute power it purchases.

This is precisely why Hyperbolic sits one layer above. If Venice is a gas station, Hyperbolic is the refinery. Venice purchases compute from the same constrained supply that everyone relies on; Hyperbolic aggregates and standardizes that fragmented supply, then sells it to Venice and all other players like it. As inference demand grows, value doesn’t just accumulate at the applications consuming compute—it also accumulates at the layer that aggregates and routes compute, capturing the cost of revenue paid by these applications.

Part Eight: Why This Matters Now

Nvidia has restructured its finances around service tokens. Cerebras’s IPO demonstrates that the market now recognizes inference as the bottleneck. Anthropic’s scramble for capacity proves this is a real issue. Agentic and physical AI will multiply demand by orders of magnitude across both cloud and edge endpoints.

It also closes the loop on the “$600 billion problem” from the other side. Cahn’s bearish thesis—that overbuilding leads to oversupply—will likely be proven correct. But oversupply is precisely the optimal scenario for lightweight aggregators: as GPU prices decline and supply becomes fragmented across dozens of clouds, the player that owns no hardware and routes each workload to the cheapest available GPU will capture the spread, while operators holding depreciating GPUs bear the losses. Hyperbolic is betting on oversupply, not against it.