OpenClaw Session Burns 21.5 Million Tokens in a Day; Optimization Strategies Reduce Costs

iconOdaily
Share
Share IconShare IconShare IconShare IconShare IconShare IconCopy
AI summary iconSummary

expand icon
A recent OpenClaw session burned 21.5 million tokens in a single day, primarily due to repeated cache prefix replay rather than user or model output. Over 79% of token usage originated from cache reads, with large intermediate outputs—such as tool results and browser snapshots—being replayed. The report highlights gas optimization strategies: avoid large tool outputs in long-term context, configure compaction mechanisms, and reduce persistent reasoning text. These measures aim to lower token costs by improving context management in agent systems.

Why My OpenClaw Sessions Burned 21.5M Tokens in a Day (And What Actually Fixed It)

Original author: MOSHIII

Peggy, BlockBeats

Editor’s Note: As agent applications rapidly gain adoption, many teams have observed a counterintuitive phenomenon: the system operates normally, yet token costs continue to rise unnoticed. Through an analysis of a real OpenClaw workload, this article reveals that the root cause of cost spikes often lies not in user inputs or model outputs, but in the overlooked practice of cached prefix replay. The model repeatedly re-reads large volumes of historical context during each invocation, leading to massive token consumption.

The article, using specific session data, demonstrates how large intermediate outputs—such as tool outputs, browser snapshots, and JSON logs—are continuously written into the historical context and repeatedly read during the agent loop.

Through this case, the author presents a clear optimization framework: from designing context structure and managing tool outputs to configuring compaction mechanisms. For developers building Agent systems, this is not only a technical troubleshooting record but also a practical guide to saving real money.

The following is the original text:

I analyzed a real OpenClaw workload and identified a pattern that I believe many Agent users will recognize:

Token usage appears to be very active.

The reply also looks normal.

But token consumption suddenly surged dramatically.

Below is the structural breakdown, root cause, and practical remediation path for this analysis.

TL;DR

The largest cost driver is not that user messages are too long, but rather that a massive volume of cached prefixes is being repeatedly replayed.

Based on the session data:

Total tokens: 21,543,714

cacheRead: 17,105,970 (79.40%)

Input: 4,345,264 (20.17%)

Output: 92,480 (0.43%)

In other words: The cost of most calls is not actually due to processing new user intentions, but rather from repeatedly reading through large historical contexts.

The moment of "Wait, how did this happen?"

I originally assumed that high token usage came from: very long user prompts, large volumes of generated output, or expensive tool calls.

But the dominant model is:

Hundreds to thousands of tokens

cacheRead: 170,000 to 180,000 tokens per call

In other words, the model repeatedly reads the same large, stable prefix in each round.

Data range

I analyzed data at two levels:

1. Runtime logs

2. Session transcripts

It should be noted that:

Run logs are primarily used to monitor behavioral signals such as restarts, errors, and configuration issues.

Accurate token statistics are derived from the usage field in the session JSONL.

Script used:

scripts/session_token_breakdown.py

scripts/session_duplicate_waste_analysis.py

Generated analysis file:

tmp/session_token_stats_v2.txt

tmp/session_token_stats_v2.json

tmp/session_duplicate_waste.txt

tmp/session_duplicate_waste.json

tmp/session_duplicate_waste.png

Where is the token actually consumed?

1) Session Consolidation

One session is consuming significantly more than the others:

570587c3-dc42-47e4-9dd4-985c2a50af86: 19,204,645 tokens

Then comes a clear cliff-like decline:

ef42abbb-d8a1-48d8-9924-2f869dea6d4a: 1,505,038

ea880b13-f97f-4d45-ba8c-a236cf6f2bb5: 649,584

2) Behavior concentration

Tokens primarily come from:

toolUse: 16,372,294

Stop: 5,171,420

The issue primarily stems from a loop in the tool invocation chain, not from regular chat.

3) Time concentrated

Token peaks are not random but concentrated in several time periods:

2026-03-08 16:00: 4,105,105

2026-03-08 09:00: 4,036,070

2026-03-08 07:00: 2,793,648

What exactly is in the huge cache prefix?

Not the conversation content, but primarily large intermediate products:

Large toolResult data block

Long reasoning/thinking traces

Large JSON snapshot

File list

Browser data scraping

Sub-agent conversation history

In the maximum session, the character count is approximately:

366,469 characters

assistant:thinking: 331,494 characters

assistant:toolCall: 53,039 characters

Once these contents are retained in the historical context, they may be reread in subsequent calls using the cache prefix.

Specific examples (from session file)

Large blocks of context have repeatedly appeared at the following location:

sessions/570587c3-dc42-47e4-9dd4-985c2a50af86.jsonl:70

Large gateway JSON log (approximately 37,000 characters)

sessions/570587c3-dc42-47e4-9dd4-985c2a50af86.jsonl:134

Browser snapshot + secure encapsulation (approximately 29,000 characters)

sessions/570587c3-dc42-47e4-9dd4-985c2a50af86.jsonl:219

Large file list output (approximately 41,000 characters)

sessions/570587c3-dc42-47e4-9dd4-985c2a50af86.jsonl:311

session/status status snapshot + large prompt structure (approximately 30,000 characters)

Duplicate content waste vs. cache replay burden

I also measured the proportion of repeated content within a single call:

Repeat ratio: approximately 1.72%

It does exist, but it's not the main issue.

The real issue is that the absolute size of the cache prefix is too large.

The structure is: a vast historical context, re-reading all previous context with each call, and only adding a small amount of new input on top.

Therefore, the focus of optimization is not on deduplication, but on designing the context structure.

Why is the Agent loop particularly prone to this issue?

Three mechanisms stack together:

1. A large number of tool outputs have been written to the historical context.

2. The tool loop generates a large number of short-interval calls.

3. Minimal prefix change → cache is re-read each time

If context compaction does not trigger reliably, the issue will quickly escalate.

Most critical fixes (ranked by impact)

P0—Do not overload long-term context with massive tool outputs

For ultra-large tool output:

  • Keep summary + reference path / ID
  • Write the original payload to the file artifact
  • Do not keep the full original text in chat history.

Prioritize restrictions on these categories:

  • Large JSON
  • Long directory list
  • Full browser snapshot
  • Sub-agent complete transcript

P1—Ensure the compaction mechanism is truly effective

In this data, configuration compatibility issues have occurred multiple times: invalid compaction key

This will silently disable the optimization mechanism.

Best practice: Use only version-compatible configurations.

Then verify:

openclaw doctor --fix

And check the startup logs to confirm that the compaction was accepted.

P1—Reduce reasoning text persistence

Avoid repeatedly replaying lengthy reasoning text.

In production: Save concise summaries rather than full reasoning.

P3—Improve prompt caching design

The goal is not to maximize cacheRead. The goal is to use the cache on compact, stable, high-value prefixes.

Recommendation:

  • Put the stability rules into the system prompt.
  • Do not put unstable data into a stable prefix.
  • Avoid injecting large amounts of debug data in each round.

Practical stop-loss strategy (if I were to handle this tomorrow)

1. Identify the session with the highest cacheRead percentage

2. Execute /compact on the runaway session

3. Add truncation and artifactization to the tool output

4. Rerun the token statistics after each modification.

Focus on tracking four KPIs:

cacheRead / totalTokens

toolUse avgTotal/call

Number of calls with >=100k tokens

Maximum session percentage

Successful signal

If the optimization takes effect, you should see:

Over 100K token calls have significantly decreased.

Cache read percentage decreased

toolUse invocation weight decreased

The dominance of a single session has decreased.

If these metrics haven't changed, your context strategy is still too permissive.

Replicate experiment command

python3 scripts/session_token_breakdown.py 'sessions' \

--include-deleted \

-- top 20 \

--outlier-threshold 120000 \

--json-out tmp/session_token_stats_v2.json \

> tmp/session_token_stats_v2.txt

python3 scripts/session_duplicate_waste_analysis.py 'sessions' \

--include-deleted \

-- top 20 \

--png-out tmp/session_duplicate_waste.png \

--json-out tmp/session_duplicate_waste.json \

> tmp/session_duplicate_waste.txt

Conclusion

If your Agent system appears to be functioning normally but costs continue to rise, first check this issue: Are you paying for new inference, or are you heavily replaying old contexts?

In my case, the vast majority of costs come from context replay.

Once you recognize this, the solution becomes clear: strictly control the data entering the long-term context.

Original link

Disclaimer: The information on this page may have been obtained from third parties and does not necessarily reflect the views or opinions of KuCoin. This content is provided for general informational purposes only, without any representation or warranty of any kind, nor shall it be construed as financial or investment advice. KuCoin shall not be liable for any errors or omissions, or for any outcomes resulting from the use of this information. Investments in digital assets can be risky. Please carefully evaluate the risks of a product and your risk tolerance based on your own financial circumstances. For more information, please refer to our Terms of Use and Risk Disclosure.