The AI Investor’s 2026 Dilemma: What Remains of Startups’ Moats When Models Dominate?

Author: Sarah Guo

Compiled by Deep潮 TechFlow

Shenchao Overview: As large models begin to outperform humans across all rankings, investors are falling into despair: What’s left to invest in besides Anthropic and NVIDIA? This top Silicon Valley investor uses data and case studies to show that the real moat isn’t on the rankings—it’s hidden in areas that cannot be measured by benchmarks.

In mid-2026, the investor version of AI madness is despair: there’s nothing left to invest in—we should put all our money into Anthropic and NVIDIA and go home.

I’ve never felt this way. I’m convinced the models are several versions smarter than me, and I’d be happy to buy Anthropic and NVIDIA at market price—every one of my smartest friends is fairly certain self-improvement will succeed soon—but I still don’t feel that sense of desperation.

This despair is not foolish. The logic is this: if the model continuously improves at everything, then every company built on top of it is merely a thin layer of packaging waiting to be absorbed; the only value that can survive is computational power and cutting-edge weights.

Using software as an example, this is the case most relied upon by pessimists. When Devin was released in 2024, it could only solve 13% of tasks on standard software benchmarks and was largely ignored. A year and a half later, the best agents are scoring over 80, and they are already performing real work inside firms like Goldman Sachs and the U.S. Army. Almost everyone has drawn the same incorrect conclusion: that models have consumed software engineering. But as models have consumed the most easily measurable parts of software engineering, we are rediscovering what many teams have long known—that engineering has always resisted measurement, and the most easily measurable parts may not be the only important ones.

Mert Demirer and his collaborators at MIT have finally provided numbers: among over 100,000 developers, the latest coding agents have increased the amount of code written by approximately 180% and the amount of code actually deployed by about 30%. Writing code has become cheaper. The remaining portion still requires human involvement—and it remains crucial. Of course, the net impact is still astonishing.

A benchmark is something you can measure, and anything you can measure is something you can train against. That’s why coding agents matured first: compilers are free validators, test suites are free validators, and when answers can self-check for free, you can iteratively refine them until you outperform the benchmark. But passing tests never tells you whether a change is correct for a ten-year-old codebase with three undocumented modules and a deployment pipeline barely held together by a cron job no one dares admit they wrote.

That kind of correctness cannot be read from a leaderboard, nor from anything else. You only learn whether such a complex system works by running it in the real world for long enough—and smarter models don’t make the world run faster. No one runs unit tests on Google-scale systems and believes the green checkmarks; you believe in them because they’ve withstood years of real-world load. This kind of correctness isn’t just private—it’s a slow moat that capital cannot erode. Even optimists acknowledge that clocks cannot be skipped: Noam Brown, a pioneer of OpenAI’s reasoning models, recently wrote that the only reliable way to evaluate an agent over a one-year timespan might be... to run it for a year.

As Gabe Pereyra said, true automation is not just about making models better. It’s about the product, the model, the workflow, and the company moving together—and three of these four move at the organization’s pace.

The people side is beyond the reach of benchmarks: convincing a skeptical partner to change how she approaches her work and keeping the team united during reconstruction. That’s why, when hiring a CEO, the ability to handle people is at least as important as analytical ability—and smarter models won’t shift this weighting. Feedback is ambiguous, the time horizon spans years, and trust resides in one person. Every company I know has equipped all its engineers with cutting-edge coding models, yet none has transformed its engineering organization anywhere near that speed. Adoption took a quarter—what a magical quarter of token growth! But reconstruction is taking years.

What is visible is what is leaving. Valuable work is structurally invisible: anything you can put on a leaderboard, you can train for, so anything measurable is already on its way to becoming commoditized. This process takes time and will never be complete, but the direction is irreversible. In the monetary terms of my friend Matt MacInnis at Rippling: tokens spent answering generic questions are nearly worthless, because any model can answer them, while tokens spent reasoning over your company’s data are far more valuable, because they do what you actually want—not just what seems plausible.

Visible work is being consumed from two directions. From below, task saturation: once a task can be cheaply verified, buyers stop asking which model performed it and start asking how much it cost, causing the work to fall to the cheapest open-source or distilled model of the week. Wherever they can make an impact, margins ultimately matter. From above, labs are trying to make models consume their own scaffolding—retrieval, routing between cheap and expensive calls, tool use, and even reasoning strategies—all the mechanisms that once wrapped the model are being pulled into the weights until the wrapper becomes the model. This is the absorption of the frontier. Margin pressure also cuts in reverse: general agents must be ready for anything, which is expensive, while focused applications can tune a workflow until it runs on a fraction of token expenditure—and unlike labs selling those tokens, they retain the margin.

So, for any type of work, we can ask two questions: Is its correctness private and expensive to establish—the kind of truth that exists only within someone’s own data? Is it isolated, locked inside a system you cannot access? Contrasting these with the saturation level of the task yields a 2x2 matrix. Saturated work with public answers is commodity token territory, owned by open-source models. Frontier work with public answers—the domain of coding benchmarks—is where labs win, because when evaluation is free, owning it doesn’t matter. The prize lies in the final quadrant: frontier work whose correctness exists only in private domains. You can see it in the inference clouds hosting AI-native pioneers, where the vast majority of tokens are generated by custom models, not general-purpose open-source ones.

The walls in that final corner vary in height. A single developer’s toy codebase is portable and standardized, so the climb is short. But a bank’s production system is neither—gaining root access won’t come from being 2% smarter on SWE-Bench Verified.

Intelligence has consumed many things, but a better model won’t turn private ground truth into public knowledge. It doesn’t hold licenses, sign liability agreements, or possess corporate documents, and it cannot be held legally accountable when answers are wrong. Intelligence is not the bottleneck here—licensing is, and liability is too. You can imagine a model far smarter than any human, yet it still must be granted access, and someone must still take responsibility for its actions.

The door has a lock and a bolt. The lock is the environment: you can only verify whether the AI has done something useful after being trusted within the system, following security reviews, integration, and contractual agreements regarding your signed results. The bolt is the user. Today, most doctors in the U.S. open OpenEvidence every day—no amount of computational power can buy this. A lab could train a perfect medical model tomorrow and still fail to enter doctors’ routines or UCSF’s decision-making workflows, because trust is built slowly, through relationships, requiring user consent rather than overriding their gradient descent.

This is also work. An application earns its place by doing unglamorous tasks in untrainable corners: arranging a company’s private reality so the model can act upon it, providing the model with tools to take action, and collaborating with clients to change their employees’ reality. A company that delivers translation is hard to replicate—because translation never ends. Integration and maintenance last as long as relationships do, won by teams that place domain-specific engineers and tools right beside the client.

For example, at a top-tier white-shoe law firm, the M&A practice alone handles nearly a thousand transactions annually. Due to confidentiality and many other reasons, you cannot have hundreds of associates each download client files to their desktops and ask a generic agent to review them—even if you could, what you’d learn would be fragmented, one associate’s corrections at a time, with no visibility into how the entire transaction flows. The critical signals exist at the transaction level, and transactions have a structure: for M&A, it’s NDAs, term sheets, due diligence, purchase agreements, ancillary documents, closing checklists; for IP litigation, it’s motions, discovery, prior art, more motions. Each practice area has its own, and neither lawyers nor tools are interchangeable across domains. And the actual problems the firm solves lie one level above all of this: running each practice area in parallel, as senior partners simultaneously manage hundreds of matters while onboarding new ones and training associates. Transforming such a firm is not a single task you can evaluate and execute. It requires an operator who uses data-driven methods, with extremely vague goals, incomplete feedback, long time horizons, and within an environment that never stands still.

Unfortunately, invisible value is also hard to sell, for the same reason it’s hard to commoditize: companies cannot externally determine whether AI will transform their operations, just as benchmarks cannot. As a result, the strongest companies stop trying to prove it externally and instead move inward, pricing outcomes directly. Sierra charges only when its agent resolves a customer issue, and charges nothing when it escalates to a human—making price itself the evaluation, which only works if Sierra has a clear definition of “resolved.” Cognition’s Devin takes the same approach in software, offering a “performance guarantee,” which can only deliver on outcomes within systems you’re trusted to enter.

Even service tokens, which everyone likes to call a commodity layer, do not function like commodities. The best AI-native companies concentrate their services on one or two providers (Baseten or Fireworks), because while cost per token is being commoditized as planned, reliability under real traffic and guaranteed access to scarce compute resources are not. Where you serve your service is a different choice from which models you use. Price is the only part of inference that operates like a commodity.

A common objection is that the lab is your supplier—why wouldn’t it simply run its own first-party products below cost to squeeze you out, or revoke your API access and take over the market entirely? This is the real version of the despair argument, and it only holds if the model layer were a solo game. It clearly isn’t—it looks more like a three-and-a-half-way death race, with a group of international players six months behind in training and a development ecosystem five times larger than last year’s. Customers want competition among suppliers, and the lab wants market share more than it wants any single application to die.

You can see this in the fiercely competitive market of laboratory-facing applications. In consumer chat, the best models have never simply won. ChatGPT has maintained its lead through years of real competition; the share it is now losing is going to Gemini, driven by the power of Android and search, not by a superior model. Anthropic, currently rated by prediction markets and internet sentiment as the company with the best model, is nearly irrelevant in consumer chat but has built its business in enterprise and coding. If better models cannot take users away from competitors in the most core applications, they won’t penetrate hospital records or banking compliance through integration. Public choices today are not based solely on coding. If the frontier remains crowded, its upper layers will be valuable.

If work cannot be scored externally, someone internally must decide what even constitutes a good answer—and that decision is the entire game. Enough of these decisions, written down, become a benchmark. Harvey released one for law; Sierra released one for voice agents. You win the right to define what “good” means for a domain by becoming the benchmark already in use within that field—these companies earned that right through the struggle of real-world adoption.

The evaluation of real-world value is private and varies by company: what this company accepts as good work in such matters is far from settled, as the depth of the law renders any public test insignificant. OpenEvidence is determining what a safe clinical answer looks like. These are not true measurements—they are judgments about what is true and what is good, written down until they become the standard by which everyone else is measured, and no matter how intelligent a laboratory may be, it cannot write them, because that authority exists only within the field itself. This authority tends to remain where it already sits. Senior lawyers draft legal benchmarks. Defining a safe clinical answer falls to physicians. And “solved” means whatever any company that already has customers says it means.

The frontier keeps rising as we continuously learn to measure more work, and what is measurable gets consumed. The ground of the untrainable shrinks beneath anyone standing on it, so you can’t find a defensible point and rest. You constantly move toward anything that still cannot be scored, and you continuously re-cover ground. On a narrow task, with your private data and your own evaluation, you can train to the frontier and outperform general models in meaningful ways—this specialized model becomes part of your moat. On the other hand, competing on general models is a capital war, and you will lose to those with the most compute—a trap for companies with shallow access to and visibility over tasks. The day it promises to surpass the frontier on general tasks for survival, the winner is increasingly determined by data center scale, and the outcome is typically not an independent champion but a sale to those with abundant compute.

All of these are defense. The harder part is offense—choosing what to build first. This is what I spent a year searching for, and I may have found it three times. Models are no help here. They’ll do whatever you point them at, but they can’t tell you what’s worth pointing at—you can’t benchmark that, so you can’t train it. This is also why existing companies won’t take everything: they hold onto their existing territory, and the next big thing comes from those who discover its use before the rest of us do. Perhaps intent is a scarcer input than compute.

Pessimism is half right. The thin packaging is indeed being absorbed; much of what looks like a company today appears to be thin packaging. But it’s wrong about what remains. The mechanism is clear; the destination is not. I’d bet on the direction: intelligence keeps getting cheaper, and value keeps sliding toward the few places models cannot reach. What cannot be trained is historically valuable. So enter one, do unglamorous translation, and start writing down what it means to be good there, because someone will. This year’s most cited benchmark scores are a map of territory soon to become worthless, and a notice about who is about to lose the right to say what counts as good.