After AI consumes everything, what remains untrainable?

Introduction: As AI capabilities continue to advance, a new pessimistic view is emerging in the investment community: as models grow increasingly powerful, all application companies will eventually be absorbed by model and compute layer players like Anthropic, OpenAI, and Nvidia, leaving only cutting-edge models, compute resources, and a handful of infrastructure providers in the market. But Sarah Guo believes this assessment is only half right. Those “thin wrappers”—applications that simply layer on top of models—will indeed be absorbed, and any tasks that can be measured by benchmarks, trained on public data, and validated at low cost will gradually become commoditized.

The real question is: After AI consumes everything that can be trained, what remains untrainable?

The answer lies in the value that exists within real organizations and cannot be easily replicated from the outside: proprietary enterprise data, complex workflows, user trust, system permissions, industry judgment, compliance responsibilities, and accumulated experience from long-term operations. Models can become smarter, but they cannot automatically access a bank’s production systems; they can generate medical responses, but cannot directly gain a doctor’s trust or penetrate a hospital’s decision-making processes; they can draft legal documents, but cannot assume responsibility for seasoned lawyers or arbitrarily define what constitutes competent legal work.

Therefore, the AI companies that will truly possess a moat in the future are not those that simply outsmart general-purpose models, but those that deeply penetrate specific industries to accomplish the difficult yet critical task of “translation”: organizing clients’ private realities, tools, processes, and criteria for judgment into systems that models can act upon, and over time, gradually defining what constitutes a “good outcome.” The more powerful AI becomes, the more it devalues tasks that are measurable and replicable—and the more it highlights those “untrainable” elements rooted in history, relationships, authority, and professional judgment. This is the true value that will remain after the model-driven transformation.

The following is the original text:

In mid-2026, the investor version of “AI insanity” is a feeling of despair that nothing else remains worth investing in: it seems we should just pour all our money into Anthropic and Nvidia and go home to sleep. But I’ve never felt this way. For several minor versions now, I’ve been convinced that the models are already smarter than I am; I’d be perfectly happy to buy Anthropic and Nvidia at market prices; my smartest friends around me are also quite confident that the models’ self-improvement will soon truly take off—yet I still don’t feel this despair.

This despair is not foolish. Its logic is this: if models continue to improve in every aspect, then all companies built upon these models are merely thin shells waiting to be absorbed by the models themselves; the only value that will ultimately remain is compute power and cutting-edge model weights.

Using software as an example, this is the scenario where that sense of despair is most evident. When Devin was released in 2024, it could only solve 13% of tasks on standard software benchmarks, and was largely dismissed by the market. One and a half years later, the most advanced agents are now scoring over 80% and beginning to handle real-world tasks inside firms like Goldman Sachs and the U.S. Army. Almost everyone reached the same incorrect conclusion: that the models had consumed software engineering.

But after the model absorbed the most easily measurable part of software engineering, we are also relearning what many teams have long known: engineering has always resisted measurement, and the most easily measurable aspects are not necessarily the only important ones.

Mert Demirer of MIT and his collaborators have finally quantified this: among over 100,000 developers, the latest generation of coding agents increased the volume of code written by approximately 180%, but the amount of code actually deployed to production rose by only about 30%. Writing code has become cheaper, but the remaining steps still require human involvement—and these steps are crucial. Nonetheless, the overall net impact remains remarkable.

Benchmarking is something you can measure; and anything that can be measured can be trained. Therefore, coding agents mature first: compilers are free validators, and test suites are also free validators. When answers can be self-checked at nearly zero cost, you can continuously refine around this verification signal until you break through it.

But passing tests never means that a change is correct for a codebase that has been running for ten years. That module may exist for three reasons no one ever documented; the deployment pipeline might barely function thanks to a cron job no one is willing to admit they wrote.

This correctness cannot be read from a leaderboard, nor can it be directly observed from anything else. You can only know if such a complex system truly works by letting it run in the real world for long enough. And more sophisticated models won’t make the real world run faster. No one would fully trust a system as large as Google just because its unit tests passed and showed green checkmarks. You trust it because it has withstood years of real-world load.

This correctness is not only private but also a slowly built moat—one that capital cannot directly compress over time. Even optimists acknowledge that this clock cannot be skipped. Noam Brown, a pioneer of OpenAI’s reasoning models, recently wrote: The only reliable way to evaluate an agent’s performance over a one-year cycle may be to let it actually run for a year.

As Gabe Pereyra said, true automation isn't just about models becoming stronger—it's about the product, the model, the workflows, and the company organization all evolving together, and of these four, three move at the pace of the organization.

Getting people moving is something no benchmark can touch: convincing a skeptical partner to change how she approaches her work, or keeping a team cohesive during a rebuild. That’s why, when hiring a CEO, we value their ability to handle people at least as much as their analytical skills. Models becoming smarter doesn’t change this weighting.

The feedback here is vague; the time scale is measured in years, yet trust resides with specific individuals. Every company I know has already enabled every engineer to use cutting-edge coding models, but none of their engineering organizations have changed at a pace even close to the rate of model advancement. Adopting the tools took just one quarter—a truly magical quarter of token growth! But true transformation requires years.

Jobs that can be seen are leaving. Truly valuable work is inherently unreadable by design: anything you can put on a leaderboard can be trained on; therefore, anything measurable is already becoming commoditized. This process takes time and will never be fully complete, but the direction never reverses.

In the words of my friend Matt MacInnis at Rippling, translating this into monetary terms: a token that merely answers a general question is almost worthless, because any model can answer it; but a token that reasons over your company’s data is far more valuable, because it does what you actually want—not just generates a plausible-sounding answer.

Readable work will be swallowed from two directions.

From below, tasks become saturated: once a task can be verified at low cost, buyers no longer care which model performed it—they start asking how much it costs. As a result, the task ends up going to the cheapest open-source or distilled model of the week. As long as profit margins can play a role, they ultimately will.

From above, the lab is attempting to make the model swallow its own scaffolding—routing between retrieval, cheap and expensive calls, tool usage, and even reasoning strategies—all the external mechanisms once wrapped around the model—are being pulled into the model’s weights until the very “shell” becomes the model. This is the absorption boundary.

Profit pressure also works in another direction: a general-purpose agent must be ready to handle anything at all times, making it costly; whereas a focused application can optimize a workflow to the extreme, consuming only a small fraction of tokens. And unlike labs that sell these tokens, application companies can retain the margin in between.

Therefore, we can ask two questions of any task: Is its correctness private and costly, and is it a truth that exists only within a company’s internal data? Is it isolated within a system inaccessible to outsiders? When combined with the level of task saturation, these questions form a 2×2 matrix.

Saturated, openly answered tasks are the domain of commoditized tokens, which open-source models will dominate. Frontier but openly answered tasks, such as coding benchmarks, are where labs will prevail, because when evaluation is free, owning the model itself holds little value.

The real prize is in the last corner—the “untrainable” corner: cutting-edge work whose validity exists only within private environments. You can see this on inference clouds serving AI-native pioneers: the vast majority of tokens are generated by custom models, not by general-purpose open-source models.

The walls leading to this final corner vary in height. A developer’s toy codebase is portable and standardized, so climbing in isn’t difficult. But a bank’s production system is neither portable nor standardized. You won’t gain its root access just by being 2% smarter on SWE-Bench Verified.

Capabilities can absorb many things, but a better model won’t turn private, real-world standards into public ones. It doesn’t hold licenses, doesn’t sign off on liability, and doesn’t own corporate documents; when an answer is wrong, it cannot be held legally accountable. The bottleneck here isn’t intelligence—it’s authority and responsibility. You can imagine a model far smarter than any human, but it still must be granted access, and someone still must put their name on the line for what it does.

The door has a lock and a bolt.

That lock is the environment: only after establishing trust within a system, passing a security review, completing integration, and signing a contract with accountability for outcomes can you verify whether the AI has truly done something useful.

The lock is the user. Today, most American doctors open OpenEvidence every day—not something that can be bought with any amount of computing power. A lab could train a perfect medical model tomorrow, but it still wouldn’t be able to penetrate doctors’ usage habits or UCSF’s decision-making processes. Because trust is built slowly, through relationships and user consent, not by erasing these factors through gradient descent.

This is precisely the work of application companies. An app secures its place in the "untrainable" corners not through glamorous efforts, but through unglamorous tasks: organizing a company’s private reality so the model can act upon it; equipping the model with action tools; and collaboratively transforming how a client’s workforce actually operates.

A company that can accomplish this kind of “translation” is extremely difficult to replicate, and this translation never ends—integration and maintenance continue alongside customer relationships. The teams that win are those who place domain-specialized engineers and tools directly beside their customers.

For example, at a top-tier, established law firm, mergers and acquisitions alone account for nearly a thousand transactions per year. You can’t have hundreds of paralegals each download client documents to their desktops and hand them off to a generic agent to read through. Confidentiality alone prohibits this, not to mention a dozen other issues. Even if it were possible, what you’d learn would be fragmented: one paralegal correcting a small piece at a time, with no one seeing how an entire transaction flows.

The most important signals exist at the transaction level. A transaction has its own shape: for mergers and acquisitions, it’s NDAs, term sheets, due diligence, purchase agreements, ancillary documents, and closing checklists; for intellectual property litigation, it’s motions, discovery, prior art, and more motions. Each business area has its own structure—lawyers and tools cannot be arbitrarily interchanged.

But the real challenge this law firm must address is even more elevated: how to simultaneously manage every practice area, much like a senior partner juggling hundreds of tasks at once while bringing in new clients and mentoring junior associates. Transforming such a firm is not a single problem you can define with a simple evaluation task—it requires a strategist to handle it like playing “data baseball”: intermediate goals are highly ambiguous, feedback is incomplete, cycles are extremely long, and the environment itself never stands still.

Unfortunately, unreadable value is also hard to sell, for the same reason it’s hard to commoditize: a company cannot externally determine whether AI can truly transform its operations as benchmarks suggest. Therefore, the strongest companies stop trying to prove themselves externally and instead enter their customers’ operations first, then price the outcomes.

Sierra only charges when its agent resolves the customer’s issue; if the issue is escalated to a human, it doesn’t charge. Thus, the price itself becomes an evaluation mechanism. This works because Sierra holds the authority to define what “resolved” means. Cognition’s Devin did the same in the software domain by introducing a “performance guarantee.” You’re only qualified to offer such a guarantee when you’re trusted enough to operate inside the system.

Even at the level of providing token services—what everyone calls the pure commodity layer—it doesn’t behave like a commodity. The best AI-native companies concentrate their services with one or two providers, such as Baseten or Fireworks. While the cost per token tends toward commoditization over time, reliability under real traffic and consistent access to scarce compute resources do not. Where you host inference services is a different decision from which models you use. The only truly commodity-like aspect of inference is price.

A common rebuttal is: The lab is your supplier—why wouldn’t it dump its own first-party product below cost to drive you out of business? Or simply revoke your API access and take the market for itself? This is the true version of that sense of desperation. But it only holds if the model layer is a solo game.

Clearly, that’s not the case. The model layer resembles a death match among three and a half players, alongside a group of international participants trailing by about six months in training, and a development league five times the size of last year’s. Customers want competition among their suppliers, and labs seek market share more than they seek to eliminate any specific application.

You can see this in the highly competitive market of laboratory settings. In consumer chat scenarios, the best models have never simply captured the entire market. ChatGPT has maintained its leadership through years of real-world competition; the share it has lost now flows to Gemini—not because Gemini’s model is superior, but due to Android and search distribution advantages. Anthropic is currently regarded as having the best model in prediction markets and online sentiment, yet it is hardly a major player in consumer chat, instead establishing its business in enterprise and coding use cases.

If an even better model cannot win away users from competitors in its most core application, it won’t easily take over a hospital’s medical records system or a bank’s liability framework. Today, public choice of products is based on more than just coding capability. If the frontier model layer remains crowded, then the application layer above it will hold value.

If a task cannot be scored externally, someone internally must decide what constitutes a good answer—and that decision is the entire game. When enough such decisions are documented, they become benchmarks. Harvey released benchmarks for the legal field; Sierra released benchmarks for voice agents. You have the authority to define what “good” means in a domain because that domain is already using you. And these companies earned that right through the hard-fought battles of real-world adoption.

The true evaluation that determines the flow of money is private and company-specific: what this company considers a good outcome in such matters. This process is far from complete, as the depth of regulation far exceeds any public testing. OpenEvidence is identifying what constitutes a safe clinical answer.

All of this is not truly “measurement,” but rather judgments about what is real and what is good. These judgments are written down until they become the standards that everyone else must accept for measurement. No matter how intelligent the foundational model lab becomes, it cannot invent these standards out of thin air, because such authority exists only within the domain itself.

This authority typically resides where it already exists. Senior lawyers establish legal benchmarks. Doctors define what constitutes a safe clinical answer. It is the company that already has a customer relationship that determines what “solved” means.

The absorption boundary will continue to rise, as we continually learn to measure more work, and what can be measured gets absorbed. The untrainable ground beneath those standing on it will keep shrinking, so you cannot stop once you find a defensible position. You must continually move toward areas that cannot yet be scored, and persistently reunderwrite and reassess risk.

On a narrow task, with your proprietary data and your own evaluation system, you can train to state-of-the-art performance and outperform general-purpose models in critical scenarios; this specialized model becomes part of your moat. On the other hand, if you're competing on the capabilities of general-purpose models, it becomes a capital war—you will lose to those with the most computational power. This is precisely the trap that companies with only shallow access and highly readable tasks are most vulnerable to.

When a company decides to train a model capable of surpassing state-of-the-art performance across a broad range of general-purpose tasks just to survive, the outcome is often already determined by the scale of its data center. The final result is rarely an independent champion, but rather acquisition by a player with sufficient computational resources.

All of the above is defense. The harder part is offense: deciding exactly what to build. This is what I’ve been searching for all year, and I’ve only found it about three times. Models can’t help with this. You point them in a direction, and they’ll follow—but they can’t tell you what’s worth pointing toward. You can’t create benchmarks for this, so you can’t train it.

This is also why established giants won’t take everything: they’ll defend the territory they already own, while the next big thing will come from someone who discovers its use before others do. Perhaps intention is a more scarce input than computational power.

Half of this sense of despair is correct. The thin shell layer is indeed being absorbed, and many things that appear to be companies today are truly just thin shells. But its judgment about “what remains after absorption” is wrong. The mechanism is clear, but the endpoint is not.

What I’m betting on is this: intelligence will continue to get cheaper, while value will keep shifting toward areas that no few models can reach. The untrainable carries historical value.

So, enter one of these fields, do the unglamorous translation work, and begin writing down what “good” means there. Because someone always will. This year’s most frequently cited benchmark scores are, in fact, a map of land that is about to become worthless—and a notice: a notice that certain people are about to lose their right to define what counts as “good.”

[Original link]

律动 BlockBeats