AI Agent Output Quality Correlated with Token Burn

Author: Systematic Long Short

Compiled by Deep潮 TechFlow

Shenchao Summary: The core argument of this article is simply this: The quality of an AI Agent's output is proportional to the number of tokens you invest.

The author isn't speaking in general theories but provides two concrete methods you can start using today, and clearly defines the boundary where tokens cannot be created—the "novelty issue."

For readers using Agents to write code or run workflows, the information is highly dense and actionable.

Introduction

Alright, you have to admit this headline is definitely eye-catching—but seriously, this isn't a joke.

In 2023, when we were using LLMs to generate production code, everyone around us was stunned, because the prevailing belief at the time was that LLMs could only produce unusable garbage. But we understood something others hadn’t realized: the output quality of an agent is a function of the number of tokens you invest. That’s it.

You can see this for yourself by running a few experiments. Have the Agent complete a complex, somewhat niche programming task—such as implementing a constrained convex optimization algorithm from scratch. First, run it with the lowest thinking level; then switch to the highest thinking level and have it review its own code to see how many bugs it can find. Try medium and high levels as well. You’ll clearly observe that the number of bugs decreases monotonically as the number of tokens invested increases.

This isn't hard to understand, is it?

More tokens = fewer errors. You can take this logic one step further—it’s essentially the simplified core idea behind code review products. In a completely new context, deploy a massive number of tokens (for example, have it parse the code line by line and check each line for bugs)—this can catch the vast majority, if not all, of the bugs. You can repeat this process ten or a hundred times, examining the codebase from a “different angle” each time, and ultimately uncover every single bug.

The notion that “burning more tokens improves agent quality” is further supported by evidence: teams claiming to use agents to write code end-to-end and deploy it directly to production are either the providers of the foundational models themselves or companies with extremely substantial funding.

So, if you're still struggling to get your Agent to produce production-grade code—let’s be blunt—the issue is with you. Or rather, with your wallet.

How do I know if I’ve burned enough tokens?

I wrote an entire article saying the problem is absolutely not with the framework you built—you can still create excellent results by keeping it simple, and I still stand by that view. You read it, followed along, but were still deeply disappointed by the Agent’s output. You sent me a DM, saw that I read it, but never replied.

This is the reply.

Your Agent performs poorly and fails to resolve issues mostly because you haven't burned enough tokens.

The number of tokens required to solve a problem depends entirely on the problem’s scale, complexity, and novelty.

What is 2 plus 2? It doesn't require many tokens.

Help me write a bot that scans all markets between Polymarket and Kalshi to identify semantically similar markets that should settle on the same event, set arbitrage-free boundaries, and automatically execute trades with low latency whenever an arbitrage opportunity arises—this will burn through a lot of tokens.

In practice, we discovered something interesting.

If you throw enough tokens at the problems caused by scale and complexity, the agent will always resolve them. In other words, if you want to build something extremely complex with many components and lines of code, as long as you throw enough tokens at these issues, they will eventually be fully resolved.

There is a small but important exception.

Your question cannot be too novel. At this stage, no amount of tokens can solve the issue of novelty. Enough tokens can reduce errors caused by complexity to zero, but they cannot enable an agent to invent something it does not know.

This conclusion actually puts us at ease.

We expended tremendous effort and burned—vast, vast, vast amounts—of tokens, trying to see if an agent could reconstruct an institutional investment process with almost no guidance. Part of this was to understand how many years away we (as quantitative researchers) are from being fully replaced by AI. We found that agents simply cannot come close to replicating a proper institutional investment process. We believe this is because they have never encountered such processes before—meaning, institutional investment workflows simply do not exist in the training data.

So, if your problem is novel, don’t expect to solve it by simply throwing tokens at it—you’ll need to guide the exploration process yourself. But once you’ve identified the implementation approach, you can confidently deploy as many tokens as needed to execute it—no matter how large the codebase or how complex the components.

Here is a simple heuristic: Token budgets should scale proportionally with the number of lines of code.

What exactly are the burned tokens doing?

In practice, additional tokens typically enhance the quality of an agent's work in the following ways:

Let it spend more time reasoning in the same attempt, giving it the opportunity to discover flawed logic on its own. Deeper reasoning = better planning = higher probability of success on the first try.

Allow it to make multiple independent attempts, exploring different solution paths. Some paths are better than others. By allowing more than one attempt, it can select the optimal one.

Similarly, more independent planning attempts allow it to abandon weak directions and retain the most promising ones.

More tokens allow it to critique its previous work within a completely new context, giving it a chance to improve rather than getting stuck in a pattern of "reasoning inertia."

Of course, one of my favorite aspects is that more tokens mean it can be verified with tests and tools. Running the actual code to see if it works is the most reliable way to confirm the correct answer.

This logic works because Agent engineering failures are not random. They almost always occur due to prematurely choosing the wrong path, failing to verify whether that path is viable early on, or lacking sufficient budget to recover and backtrack after discovering an error.

That's the story. Tokens literally represent the quality of the decisions you purchase. Think of it like research: if you ask someone to answer a difficult question on the spot, the quality of their answer declines as time pressure increases.

Research, at its core, is the foundation for producing "knowing the answer." Humans expend biological time to generate better answers, while agents expend more computational time to produce better answers.

How to improve your Agent

You might still be skeptical, but numerous papers support this point—and frankly, the very existence of the "reasoning" dial is all the proof you need.

One of my favorite papers shows that researchers trained the model on a small set of carefully curated reasoning examples, then used a method to force the model to keep thinking when it wanted to stop—specifically, by appending "Wait" at the points where it intended to halt. Just this single change improved performance on a benchmark from 50% to 57%.

I want to be as straightforward as possible: if you keep complaining that the code written by the agent is subpar, the highest single-thinking tier may still not be enough for you.

I have two very simple solutions for you.

Simple approach one: WAIT

The simplest thing you can start doing today: set up an automated loop—after building it, have the Agent review it N times with fresh context each time, fixing any issues it finds.

If you find that this simple trick improves your Agent engineering results, then you’ve at least realized that your issue is just about the number of tokens—welcome to the Token Burning Club.

Simple Method 2: VERIFY

Have the Agent verify its work early and frequently. Write tests to prove that the chosen path actually works. This is especially useful for highly complex, deeply nested projects—where a single function may be called by many downstream functions. Catching errors upstream can save you significant computational time (tokens) later on. So, wherever possible, set up “verification checkpoints” throughout the entire build process.

After writing a section, the main agent says it's done? Have the second agent verify it. Unrelated thought streams can mask the sources of systematic bias.

That’s about it. I could write much more on this topic, but I believe recognizing these two points and implementing them well will solve 95% of your problems. I firmly believe in mastering the simple things first, then adding complexity only as needed.

I mentioned that "novelty" is a problem tokens cannot solve, and I want to emphasize it again because you will eventually hit this wall and come back to complain that stacking tokens didn’t work.

When the problem you're trying to solve isn't in the training set, you're the one who truly needs to provide the solution. Therefore, domain expertise remains extremely important.