Editor’s Note: As the capabilities of large models continue to improve, the AI application layer is facing widespread concern: If model companies like OpenAI and Anthropic control both the underlying models and distribution channels, along with brand advantages, what can startups still do on the application layer?

This is precisely the question Joe Schmidt, a partner at a16z, seeks to answer in this article. Using the “Yellow Brick Road” from The Wizard of Oz as a metaphor, he categorizes AI application opportunities into two types: one is the main road that large model companies are entering themselves—such as code generation, writing, image generation, general-purpose agents, and horizontal productivity assistants; the other is “elsewhere in Oz”—those vertical scenarios that deeply integrate into industry workflows, rely on complex processes, data accumulation, compliance governance, and system integration capabilities.

In his view, the real opportunity for startups lies in the latter.

From sales to insurance, Joe Schmidt consistently emphasizes the same logic: businesses are willing to pay not for a smarter chat window, but for a system that takes responsibility for business outcomes. It must understand the chaos of customer data, handle multi-person approvals and edge cases, assume compliance and audit responsibilities, and manage migration, routing, and cost optimization for clients as models continuously evolve.

This is the core insight of this article regarding the next generation of enterprise software: underlying models will grow increasingly powerful and more interchangeable; what truly cannot be replaced is the data, processes, governance capabilities, and operational knowledge accumulated around specific industries and workflows. The opportunity for AI application companies lies not in competing with model companies for the “yellow brick road,” but in entering areas that are more complex, messier, slower—but also closer to real business value.

The following is the original text:

Recently, I’ve heard the same question repeatedly from founders and potential employees: Is there anything left to do in the AI application layer, or will OpenAI and Anthropic ultimately kill everything?

There’s a very typical AI-induced anxiety behind this question. Some have already concluded that the only long-term valuable positions, if one wants to avoid becoming a permanent underclass, are either inside large model labs or in startups focused on robotics, hard tech, or similar cutting-edge fields—in theory, doing things that “labs can’t touch.” Because if every type of software will be absorbed—either directly consumed by Codex or Claude, or rendered unnecessary by some future model—the best choice seems to be: run fast!

I admit that I’m almost an AI maximalist myself, and I think they’re half right. Large model labs are indeed entering vast areas of the application layer. But the “application layer” is not a homogeneous set of opportunities. The real distinguishing factor is whether you’re on the yellow brick road or somewhere else in Oz.

The so-called "Yellow Brick Road" refers to the path that large model labs are currently pursuing and investing heavily in. Problems such as code generation, writing, and image creation are naturally suited for labs because they improve directly as the model's core capabilities advance: every dollar invested in pre-training and post-training directly enhances product quality.

Elsewhere in Oz, more complex—and often more vertical—challenges exist. These cannot be solved simply by giving an enterprise user a horizontal tool that connects to standard tools and computing capabilities. Here, value comes more from the scaffolding around the model: scaffolding that makes outputs trustworthy, compliant, and truly integrated into industry-specific business processes. While the raw capabilities of the underlying model remain important, they are no longer the whole story.

We are seeing this unfold in real time. OpenAI and Anthropic are effectively acknowledging to the market that they cannot solve all problems with a single general-purpose AI colleague. They have announced massive frontline deployment partnerships focused on building entire companies around configuring and customizing models for enterprises. If they truly believed the next model release would solve these issues, they wouldn’t be investing billions of dollars into such initiatives.

So, if you want to make money with AI applications, don’t follow the yellow brick road—instead, go build elsewhere in Oz. Here are some lessons we and several founders in our portfolio have learned through practice.

Yellow Brick Road

If you're starting a company, the yellow brick road is the most obvious path—but also the most dangerous. Take a high-performance model, connect it to off-the-shelf integrations like Google Drive, Slack, Salesforce, Notion, and GitHub, and layer an agent orchestration layer on top. It looks like magic.

The issue is that this is precisely what the large model labs are doing through Cowork and Codex. Clearly, they possess the models, which gives them better profit margins, greater control, and pricing power over all downstream participants. But perhaps more importantly, they also control the architectural decisions that determine what problems a product is suited to solve. So far, they have deliberately adopted the “model + tool calling” approach—which is exactly the pattern needed for horizontal, low-step-count tasks on the yellow brick road. Even if a startup could somehow outperform Codex or Claude Code, the large model labs still possess massive distribution capabilities and the strongest brand aura in the AI field.

If you are an AI application company using the same approach—integrating the same connectors, with no underlying sub-agents or configurations, and no distribution channels—then you are likely on a path leading to nothing.

Other parts of Oz

For startups, the situation isn’t all bleak. Beyond the yellow brick road, there are still enormous opportunities. Startups can acquire customers and solve complex problems in these areas.

These companies are building agent experiences: models are woven into complex tools, automations, and integrated networks—in other words, software. This also makes most of these startups inherently vertical. They can focus on multi-step, multi-party workflows, designing sub-agents tailored to different roles and vertical scenarios to address challenges that horizontal platforms from Anthropic and OpenAI struggle to reach: gathering context across systems and routing tasks to multiple stakeholders who need to approve at different stages.

This type of work typically involves one or more legacy systems, often requiring deterministic outcomes, as ambiguity is unacceptable, and sometimes directly ties to critical business results. Large model labs are well aware of how valuable these issues are: this is why they are building their own outsourced configuration teams and why an entire ecosystem of companies offering reinforcement learning services to enterprise clients is emerging.

Why aren't other parts of Oz completely taken over by the Wizard?

A counterargument to the above point is that betting against models or laboratories continuing to advance has always been a poor trade. They are likely to keep growing stronger and ultimately capture the markets served by these application-layer companies.

The large model lab will certainly continue to advance. However, I believe that companies elsewhere in Oz still have several defensive strategies in the long term.

Data and Learning Flywheel

Many of the things you truly internalize in your work do not exist in any training dataset: unwritten industry norms, undocumented standards, and tribal knowledge held in the minds of practitioners. None of these are available on the public internet. No amount of training compute can replace actually stepping inside the workflows where this knowledge resides.

Two flywheels are at play here: one is the cross-customer flywheel, where patterns compound as you encounter more variations of the same type of problem; the other is the within-customer flywheel, where the underlying reasons for specific decisions, unspoken exceptions, and the company’s own heuristics only emerge through real user-system interactions.

Even if customer data cannot be shared across customers, the company can still leverage pattern recognition across different types of customer issues to inform the design of future solutions. A company that has already had its agents handle a hundred legal compliance updates, a thousand insurance underwriting cycles, or ten thousand SDR sales development activities possesses an understanding of problem patterns that a newcomer cannot replicate simply by launching a new agent for the first time.

In theory, a horizontal agent could also build the same learning infrastructure. However, beyond insufficient focus, the more critical reason it doesn’t is user experience. Capturing this knowledge depends entirely on the workflow interfaces you provide to users. Vertical players can design these interfaces around the specific information truly needed for their workflow—something horizontal tools cannot do. Evaluation sets, annotated outputs, and boundary case taxonomies can be combined into a vertical domain data flywheel that further supports fine-tuning. New entrants without equivalent scale of production exposure will find it difficult to generate such a flywheel. Whether it’s feasible depends on data rights, accumulated production usage, and customer contract structures, but pattern recognition itself continues to accumulate over time.

Manage model volatility and complexity

The internal large model lab is already implementing routing: invoking different categories of models based on various requests, using model ensembles at the underlying level. However, they cannot perform cross-vendor routing, nor can they easily evaluate competitors’ models for specific subtasks or deploy the truly most suitable open-source fine-tuned models in narrow domains.

Companies outside the Land of Oz select the most suitable model for each subtask across the entire model market, rather than relying solely on models released by a single parent lab. They also take on the work that no one else wants to do: re-running evaluations whenever a new model is released, re-tuning prompts for customers’ edge cases, and deploying changes without disrupting production environments. Large model labs do not do these things for their customers—they sell you a new model and tell you to migrate yourself. Companies outside the Land of Oz absorb the migration costs. Customers receive the best intelligent capabilities available across the entire market, along with continuity throughout every upgrade cycle.

Cost optimization

Sending every query to Opus 4.7 is the fastest way to turn gross margins negative. The best Oz companies route tasks across different model tiers: assigning the most difficult tasks to frontier models, the majority of tasks to medium-sized models, and using smaller, customized, or fine-tuned models where proven effective.

Some companies are now building on this foundation to perform their own post-training, optimizing models to focus precisely on the narrow set of tasks that matter most to their customers, and offering services at a fraction of the cost of cutting-edge API calls. Large model labs price at the "floor": the minimum level of intelligence you can buy for X dollars. Oz companies, by contrast, sell the opposite: the lowest dollar cost to achieve precisely the level of intelligence needed for a specific workflow. This is only possible when you have a clear understanding of exactly what level of intelligence each subtask requires—and large model labs are structurally unable to understand every task within every vertical industry. Ultimately, this translates directly into lower, more predictable pricing for results.

Governance

Becoming the control plane for customers running AI in a specific vertical generates significant value. This control plane is where permissions, audits, what agents are authorized to do, and what agents actually do converge.

This control plane is built on guardrails specific to use cases, and these guardrails vary entirely across industries and job roles. Because these companies own end-to-end the tools, workflows, and data that agents interact with, they can deliver deterministic outcomes in ways that horizontal tools cannot match. They also absorb regulatory complexity on behalf of end buyers: the U.S. Federal Rules of Civil Procedure and attorney conduct rules in law, HIPAA in healthcare, SEC and FINRA regulations in finance, state insurance regulations, and more. Horizontal players cannot credibly achieve this without becoming a hundred different vertical industries. CIOs need a partner that can explicitly commit in contract to assume responsibility for compliance handling of the agents provided.

All of this ultimately comes down to one thing: focus.

This focus could be a vertical industry, such as insurance, law, or accounting, or a function that has been deeply refined, such as sales, customer service, or finance. Regardless of the choice, this work requires a team to consistently engage with the same type of client base over the long term, understanding its workflows, edge cases, and regulatory requirements. Large model labs were not built for this purpose. They must serve everyone and cover everywhere—that’s precisely why they paved the yellow brick road in the first place. The same trade-off also makes it difficult for them to venture into other parts of Oz: you can be everywhere at once, or you can excel at one thing—but you cannot do both.

For example, in sales: Practical advice from 11x tech CEO

In practice, how should one understand this? Here are some practical recommendations from Prabhav Jain, CEO of 11x.

Focus on results

To build a company that can withstand the impact of large model labs, a viable tactical approach is to start with the specific outcomes that customers truly care about. For us, that outcome is helping businesses generate more sales leads and pipelines.

From here, the questions become very specific: Which activities do we want to own end-to-end and that can genuinely drive pipeline growth? Break down each activity into tasks. Which tasks are suitable for agents, and which are not? Which require deep domain expertise, and which do not? While large model labs will also release workflows, when a workflow has many steps, messy inputs, hard-to-explain states, or real-world constraints, simply having a better model won’t get things done. At this point, the work reverts to traditional software engineering—and at this level, large model labs have no advantage over a focused application company.

For example, some of the tasks we handle include lead generation based on custom signals, lead data enrichment, in-depth account research, context extraction from CRMs, crafting messages for different channels, lead qualification agents, and email delivery systems. Some of these are agent-based tasks, while others are not. These tasks cannot be completed with a single prompt—they require deep engineering capabilities.

The key insight from the Oz analogy is that in any real workflow, roughly half of the tasks are non-agent tasks, and this half does not confer a laboratory advantage. Beneath the model layer, their ability to write deterministic software is no better than yours. The other half—agent tasks—still require you to fine-tune, train, and constrain the model around the outcomes you truly want.

Domain knowledge is often absent from general training data. These capabilities must be built bottom-up from vertical industries or specific functions and fed into the model at the right moments in the workflow. When our agent evaluates over the phone whether an inbound lead is qualified, it must be trained to understand what constitutes a successful sales conversation for a specific industry and target user profile. This is the work that application companies must do—and this capability compounds over time.

More importantly, these capabilities continually become outdated as the businesses themselves evolve. Therefore, your ability to continuously adapt your workflows and context becomes a competitive advantage in itself. For example, when we first launched our scalable email outreach product, AI-generated emails were just beginning to emerge. Fast forward to today, people have developed a keen sense for distinguishing between AI-written emails and those that feel more human—and crucially, this judgment shifts every few months. Our agents must continuously adapt to these market dynamics, and it’s precisely here that our moat is built. In fact, despite this constant evolution, our reply rates have increased fourfold over the past few months, generating hundreds of millions of dollars in sales pipelines for our customers.

Tackle high-complexity problems

Complex problems are where real business value is unlocked. Otherwise, you risk finding yourself merely building a thin wrapper.

Break down any sufficiently complex business problem, and chaos will quickly emerge. Here’s a seemingly simple example from the GTM space: If a company is already your customer, you shouldn’t reach out to another contact within that company. But this isn’t simple at all.

Your CRM might contain the domain associated with that company. But what about companies with dozens of subsidiaries? What if the CRM only lists the parent company’s domain? What if an outdated matching field in Salesforce causes you to send cold outreach emails to the Chief Revenue Officer of an existing client? Real-world data is messy. Even humans struggle with it, and models won’t magically overcome this barrier. Building order from this chaos requires designing specialized agents tailored to the specific nature of the problem—not simply pointing a generic copilot at your CRM and calling it done. In fact, based on the data we have, we’ve found that our own data quality and freshness often exceed that of our clients themselves; therefore, we default to using our data as the anchor.

The guardrail isn't just there to prevent bad things from happening—customers are paying specifically for this.

Guardrails are severely underestimated. Even within the same product, each use case requires its own guardrails. For us, the assurances required by a regulated financial services prospect are entirely different from those demanded by a mid-sized SaaS client. These assurances cascade down to dictate how agents write, whom they can contact, what data they can access, what they can say during calls, and how every decision is logged.

A one-size-fits-all system would collapse under such variations. Safeguards must be built according to use cases, configured per customer, and continuously audited—all responsibilities that fall entirely on the application company. This is why we need frontline deployment engineers and technical deployment strategists to fine-tune solutions for each customer’s specific requirements.

For example, we previously partnered with a Fortune 1000 company to conduct opt-in outbound calls via voice to its large SMB customer base. In the initial rounds, the answer rate was very low. We had to iterate quickly to learn how to engage this specific audience within the first 10 seconds of a call. SMB owners behave very differently from large B2B buyers or individual consumers. Today, we generate more sales opportunities for them in a single day than their entire sales team can generate in a month within that segment.

Using insurance as an example: Practical advice from the CEO of FurtherAI

Sales is just one example. Insurance is another, illustrating the same point from a different perspective. Here is Aman Gour, CEO of FurtherAI, on his understanding of "building off the yellow brick road."

When we began deploying AI into real-world insurance operations, we repeatedly heard this assumption: the model is intelligence, and workflows are merely scaffolding built around the model.

But the more insurance companies we partner with, the more convinced we become that the opposite is true.

In the insurance industry, much of the intelligence resides within the workflow itself. Two insurers may allow a submission to follow what appears to be the same path: submission, review, quoting, underwriting. The path itself is straightforward. What truly distinguishes the two insurers is everything inside the path: which risks require escalation, which loss signals are significant, which underwriting preference rule takes precedence when two conflict, when human approval is mandatory, which external data must be retrieved, and how the final decision is recorded.

These logic elements do not exist within a clean rule engine. They are scattered across standard operating procedures, manager approvals, underwriting philosophies, insurer-specific risk appetites, and years of operational experience. Much of this logic is not documented in a form that models can directly read.

That’s why we don’t believe in pure agents that reason from scratch every time, nor in rigid workflows that collapse under the complexity of reality. Instead, we’ve been building agent workflows. Workflows bring repeatability, auditability, and cost control; agents handle variability and recover the process when the ideal path breaks down; and humans remain in the loop for decisions requiring judgment and accountability.

On the first day, this system automates manual tasks. But over time, each upgrade becomes a signal, each exception a feedback loop, and every human correction reveals where the original playbook was incomplete. Gradually, the workflow evolves from merely a script into the insurance company’s operational memory.

This is precisely the part that large model labs struggle to reach. They will continue releasing better models and better general agents—and they should. But they won’t remain embedded in an insurance company’s production workflows to learn why a certain account was upgraded, why a particular risk was declined, or why an underwriter overrode the risk appetite guidelines—and turned out to be right.

This understanding can only come from running the same workflow tens of thousands of times in a production environment. The workflow you deliver on day one is not a moat; it’s the cyclical accumulation of production use over time that creates the moat.

For us, this is what it means to "build off the yellow brick road."

How can you tell if you're somewhere else in Oz, or still on the Yellow Brick Road?

Tools and Procedure Testing

How many steps does this task require? How complex are the tools you need to build to support it?

Compare this to using a horizontal AI to search in Google Drive: it’s a single-step action for one tool, and the tolerance for error is high. If the user finds the summary incorrect, they can simply ask again.

Now consider a task involving multi-step legal compliance adjustments based on a law firm’s precedents over the past three years: it may involve dozens of steps, multiple tools, require partner review, and even need to be argued in court. Both appear as if “an agent is performing a task,” but only the latter requires the deep software built over many years by a dedicated team.

System testing

Are you building a system for customers to run their workflows, or are you adding a tool on top of the customer’s existing system?

The system features an end-to-end workflow: data capture, governance, and completion tracking. When customers describe how their actual work occurs, they refer to this system. Tools simply add a layer of intelligence to the workflows customers are already running.

Tool-type products can generate real revenue, but large model labs are more likely to take it away because customers aren’t dependent on you as the orchestration layer. High ACV is typically a signal of a system-type product, since systems replace actual human labor and therefore command corresponding payments. However, this is not an absolute guarantee. Ask yourself: If a large model lab launched a product that directly competes with yours, would customers still need your tool? If the answer is yes, you’re building a system. If the answer is no, you’re a tool—even if your ACV is high.

Hedge Fund / Profit and Loss Statement Test

The performance of the Large Model Lab is evaluated using benchmark tests; the performance of companies elsewhere in Oz is evaluated using their customers' profit and loss statements.

Customers don’t care about your model’s score on SWE-Bench or MMLU. They care about whether your agent executed an order, correctly revised contract redlines, or underwrote the right policy. If customers care about specific workflow outcomes rather than generic capability scores, you’re somewhere else in Oz. If customers are paying for generic capabilities, you’re selling what they can get through a Claude or Codex subscription.

The best agent firms must execute like hedge funds: they win on alpha, and alpha is measured in client P&L, not in benchmark scores.

Both can win, and both will win.

We will see major winners both on and off the yellow brick road. The models will continue to win because they possess the models and the distribution capabilities designed for horizontal tools.

Elsewhere in Oz, wins are possible too—provided they have working systems: the interfaces where businesses actually execute work, and the data that flows through and is captured. These companies possess data capture, workflow action systems, and governance. As complex workflows within a vertical mature, they compound into a core experience customers cannot do without. As both established players and new entrants release successive generations of models, this company will become the layer that integrates and delivers those models to customers. The underlying models are interchangeable, but the working system is not.

The next generation of enterprise software will be built beyond the yellow brick road.