Why My OpenClaw Sessions Burned 21.5M Tokens in a Day (And What Actually Fixed It)
Original author: MOSHIII
Peggy, BlockBeats
Editor’s Note: As agent applications rapidly gain adoption, many teams have observed a counterintuitive phenomenon: the system operates normally, yet token costs continue to rise unnoticed. Through an analysis of a real OpenClaw workload, this article reveals that the root cause of cost spikes often lies not in user inputs or model outputs, but in the overlooked practice of cached prefix replay. The model repeatedly re-reads large volumes of historical context during each invocation, leading to massive token consumption.
The article, using specific session data, demonstrates how large intermediate outputs—such as tool outputs, browser snapshots, and JSON logs—are continuously written into the historical context and repeatedly read during the agent loop.
Through this case, the author presents a clear optimization framework: from designing context structure and managing tool outputs to configuring compaction mechanisms. For developers building Agent systems, this is not only a technical troubleshooting record but also a practical guide to saving real money.
The following is the original text:
I analyzed a real OpenClaw workload and identified a pattern that I believe many Agent users will recognize:
Token usage appears to be very active.
The reply also looks normal.
But token consumption suddenly surged dramatically.
Below is the structural breakdown, root cause, and practical remediation path for this analysis.
TL;DR
The largest cost driver is not that user messages are too long, but rather that a massive volume of cached prefixes is being repeatedly replayed.
Based on the session data:
Total tokens: 21,543,714
cacheRead: 17,105,970 (79.40%)
Input: 4,345,264 (20.17%)
Output: 92,480 (0.43%)
In other words: The cost of most calls is not actually due to processing new user intentions, but rather from repeatedly reading through large historical contexts.
The moment of "Wait, how did this happen?"
I originally assumed that high token usage came from: very long user prompts, large volumes of generated output, or expensive tool calls.
But the dominant model is:
Hundreds to thousands of tokens
cacheRead: 170,000 to 180,000 tokens per call
In other words, the model repeatedly reads the same large, stable prefix in each round.
Data range
I analyzed data at two levels:
1. Runtime logs
2. Session transcripts
It should be noted that:
Run logs are primarily used to monitor behavioral signals such as restarts, errors, and configuration issues.
Accurate token statistics are derived from the usage field in the session JSONL.
Script used:
scripts/session_token_breakdown.py
scripts/session_duplicate_waste_analysis.py
Generated analysis file:
tmp/session_token_stats_v2.txt
tmp/session_token_stats_v2.json
tmp/session_duplicate_waste.txt
tmp/session_duplicate_waste.json
tmp/session_duplicate_waste.png
Where is the token actually consumed?
1) Session Consolidation
One session is consuming significantly more than the others:
570587c3-dc42-47e4-9dd4-985c2a50af86: 19,204,645 tokens
Then comes a clear cliff-like decline:
ef42abbb-d8a1-48d8-9924-2f869dea6d4a: 1,505,038
ea880b13-f97f-4d45-ba8c-a236cf6f2bb5: 649,584
2) Behavior concentration
Tokens primarily come from:
toolUse: 16,372,294
Stop: 5,171,420
The issue primarily stems from a loop in the tool invocation chain, not from regular chat.
3) Time concentrated
Token peaks are not random but concentrated in several time periods:
2026-03-08 16:00: 4,105,105
2026-03-08 09:00: 4,036,070
2026-03-08 07:00: 2,793,648
What exactly is in the huge cache prefix?
Not the conversation content, but primarily large intermediate products:
Large toolResult data block
Long reasoning/thinking traces
Large JSON snapshot
File list
Browser data scraping
Sub-agent conversation history
In the maximum session, the character count is approximately:
366,469 characters
assistant:thinking: 331,494 characters
assistant:toolCall: 53,039 characters
Once these contents are retained in the historical context, they may be reread in subsequent calls using the cache prefix.
Specific examples (from session file)
Large blocks of context have repeatedly appeared at the following location:
sessions/570587c3-dc42-47e4-9dd4-985c2a50af86.jsonl:70
Large gateway JSON log (approximately 37,000 characters)
sessions/570587c3-dc42-47e4-9dd4-985c2a50af86.jsonl:134
Browser snapshot + secure encapsulation (approximately 29,000 characters)
sessions/570587c3-dc42-47e4-9dd4-985c2a50af86.jsonl:219
Large file list output (approximately 41,000 characters)
sessions/570587c3-dc42-47e4-9dd4-985c2a50af86.jsonl:311
session/status status snapshot + large prompt structure (approximately 30,000 characters)
Duplicate content waste vs. cache replay burden
I also measured the proportion of repeated content within a single call:
Repeat ratio: approximately 1.72%
It does exist, but it's not the main issue.
The real issue is that the absolute size of the cache prefix is too large.
The structure is: a vast historical context, re-reading all previous context with each call, and only adding a small amount of new input on top.
Therefore, the focus of optimization is not on deduplication, but on designing the context structure.
Why is the Agent loop particularly prone to this issue?
Three mechanisms stack together:
1. A large number of tool outputs have been written to the historical context.
2. The tool loop generates a large number of short-interval calls.
3. Minimal prefix change → cache is re-read each time
If context compaction does not trigger reliably, the issue will quickly escalate.
Most critical fixes (ranked by impact)
P0—Do not overload long-term context with massive tool outputs
For ultra-large tool output:
- Keep summary + reference path / ID
- Write the original payload to the file artifact
- Do not keep the full original text in chat history.
Prioritize restrictions on these categories:
- Large JSON
- Long directory list
- Full browser snapshot
- Sub-agent complete transcript
P1—Ensure the compaction mechanism is truly effective
In this data, configuration compatibility issues have occurred multiple times: invalid compaction key
This will silently disable the optimization mechanism.
Best practice: Use only version-compatible configurations.
Then verify:
openclaw doctor --fix
And check the startup logs to confirm that the compaction was accepted.
P1—Reduce reasoning text persistence
Avoid repeatedly replaying lengthy reasoning text.
In production: Save concise summaries rather than full reasoning.
P3—Improve prompt caching design
The goal is not to maximize cacheRead. The goal is to use the cache on compact, stable, high-value prefixes.
Recommendation:
- Put the stability rules into the system prompt.
- Do not put unstable data into a stable prefix.
- Avoid injecting large amounts of debug data in each round.
Practical stop-loss strategy (if I were to handle this tomorrow)
1. Identify the session with the highest cacheRead percentage
2. Execute /compact on the runaway session
3. Add truncation and artifactization to the tool output
4. Rerun the token statistics after each modification.
Focus on tracking four KPIs:
cacheRead / totalTokens
toolUse avgTotal/call
Number of calls with >=100k tokens
Maximum session percentage
Successful signal
If the optimization takes effect, you should see:
Over 100K token calls have significantly decreased.
Cache read percentage decreased
toolUse invocation weight decreased
The dominance of a single session has decreased.
If these metrics haven't changed, your context strategy is still too permissive.
Replicate experiment command
python3 scripts/session_token_breakdown.py 'sessions' \
--include-deleted \
-- top 20 \
--outlier-threshold 120000 \
--json-out tmp/session_token_stats_v2.json \
> tmp/session_token_stats_v2.txt
python3 scripts/session_duplicate_waste_analysis.py 'sessions' \
--include-deleted \
-- top 20 \
--png-out tmp/session_duplicate_waste.png \
--json-out tmp/session_duplicate_waste.json \
> tmp/session_duplicate_waste.txt
Conclusion
If your Agent system appears to be functioning normally but costs continue to rise, first check this issue: Are you paying for new inference, or are you heavily replaying old contexts?
In my case, the vast majority of costs come from context replay.
Once you recognize this, the solution becomes clear: strictly control the data entering the long-term context.
