Editor's Note: Many users of Claude Code find that token consumption feels too fast and long conversations quickly deplete their quota. However, from Anthropic engineers’ perspective, what truly impacts cost is not how much code you write, but whether the system consistently reuses previously processed context.

The core insight of this article is how caching mechanisms save tokens. The author reused over 300 million tokens via caching within a week, with a daily cache volume reaching 91 million. Since the cost of cached tokens is only 10% of that of regular input tokens, 91 million cached tokens effectively count as approximately 9 million regular tokens. Claude Code’s long conversations appear more “durable” not because the model works for free, but because a large volume of repeated context has been successfully reused.

The key to prompt caching is "don't break the cache." Claude Code caches system prompts, tool definitions, CLAUDE.md, project rules, and conversation history in layers; as long as the prefix of subsequent requests remains consistent, Claude can directly retrieve the cache instead of reprocessing the entire context. Anthropic internally monitors prompt cache reuse rates, as they not only affect user quotas but also directly impact model service costs and operational efficiency.

For regular users, there’s no need to understand all the underlying details—just adopt a few key habits: don’t leave your session idle for more than an hour; perform proper session handoff when switching tasks; avoid frequently switching models; and place large documents into Projects rather than repeatedly pasting them into conversations.

This article is less about a token-saving trick and more about offering a Claude Code approach that aligns more closely with engineering thinking: treat context as asset management, enable continuous cache reuse, and minimize redundant computations in long conversations.

The following is the original text:

I saved 300 million tokens this week, with 91 million in a single day, totaling over 300 million for the week.

I haven't changed any settings. This is just prompt caching working normally in the background.

But once I truly understood what caching is and how to avoid disrupting it, my sessions lasted much longer under the same usage allowance. Here’s a concise 80/20 guide to Claude Code prompt caching, without diving into deep API-level details.

TL;DR

The cost of cached tokens is only 10% of that of regular input tokens. 91 million cached tokens are billed at approximately the equivalent of 9 million tokens.

The cache TTL for the Claude Code subscription is 1 hour; the default for the API is 5 minutes; sub-agents are always 5 minutes.

The cache is divided into three layers: system layer, project layer, and conversation layer.

Switching models mid-session will clear the cache, including when enabling the "opus plan" mode.

How exactly is caching charged?

The cost of each cached token is 10% of the cost of a regular input token.

So, when my dashboard shows that 91 million tokens were cached in a single day, the actual billing is roughly equivalent to processing only 9 million tokens. This is why, over time, using Claude Code with caching feels like an almost “free” extension of your session compared to not using caching.

There are two numbers on the dashboard worth paying attention to:

Cache creation: The one-time cost incurred when writing content to the cache. It will take effect in the next conversation.
Cache read: Tokens reused from cache by Claude, such as your CLAUDE.md, tool definitions, and previous messages. Costs 10 times less than processing them as new input.

If your Cache read count is high, it means you're effectively using the cache; if it's low, it means you're repeatedly paying for the same context.

Thariq from Anthropic said something that left a strong impression on me: “We actually monitor the prompt cache hit rate, and if it drops too low, it triggers an alert—even a SEV-level incident.”

He also wrote an excellent X article. When cache hit rates are high, four things happen simultaneously: Claude Code feels faster, Anthropic’s service costs decrease, your subscription allowance appears more durable, and extended coding sessions become more realistic.

But if the hit rate is very low, everyone will suffer.

So, both sides have aligned incentives: Anthropic wants your cache hit rate to be higher, and you do too. The only thing that truly holds you back are some seemingly minor habits that quietly reset your cache.

How does the cache grow with each round of conversation?

Caching relies on prefix matching, also known as "prefix matching".

You don’t need to dive into technical details—just understand this: if the content before a certain position matches exactly what’s already cached, Claude can reuse those cached tokens.

A brand new session, roughly unfolding like this:

According to the Claude Code documentation, a new session typically operates as follows:

First conversation: No cache exists yet. The system prompt, your project context (e.g., CLAUDE.md, memory, rules), and your first message will all be reprocessed and written to cache.

Second conversation: All content from the first conversation is now cached. Claude only needs to process your new reply and the next message. This round will be much cheaper.

Third conversation: Same logic. Previous conversations remain in cache; only the latest interaction needs to be reprocessed.

The cache itself can be divided into three levels:

From Thariq's X post:

System layer: Includes base commands, tool definitions (read, write, bash, grep, glob), and output style. This layer is globally cached.

Project layer: Includes CLAUDE.md, memory, and project rules. This layer is cached per project.

Conversation layer: Includes replies and messages that grow with each round of dialogue.

If any content at the system or project level changes midway through a session, everything must be recached from the beginning. This is the most "expensive" operation. Imagine this: you’ve reached the 16th message, and then the system prompt is suddenly changed, or there’s a one-hour pause—every token from the first message onward must be reprocessed.

One hour and five minutes of confusion

This is the most easily misunderstood part.

Claude Code subscription: Default TTL is 1 hour.

Claude API: The default TTL is 5 minutes. You can pay a higher cost to increase it to 1 hour.
Sub-agent under any plan: always 5 minutes.

Claude.ai web chat: Not officially documented. Might be the same as the subscription version, but I haven't confirmed it yet.

Several months ago, many users complained that their Claude subscription credits were being consumed too quickly. At the time, some believed Anthropic had quietly reduced the TTL from 1 hour to 5 minutes without notifying users. However, this was not the case—Claude Code’s TTL remains 1 hour.

The issue is that the documentation for Claude Code and the API is separated, and since these are entirely different things, it has caused considerable confusion.

If you're running sub-agent workflows at scale or using the API directly, then the 5-minute figure matters. But for 95% of Claude Code users, the only window that truly matters is the 1-hour window.

Three habits that cover 95% of users

These are the parts I find truly useful in daily use.

Don't pause for too long

If you've been idle for over an hour, the previous content has likely expired from the cache. Your next message will rebuild the cache. In this case, it’s usually more efficient to make a clean transition and start a new session rather than trying to revive an already "cold" old one.

When switching tasks, restart directly

/s2>/compact or /clear already clears the cache, so why not use this opportunity to fully reset it?

I created a session handoff skill to replace /compact. It summarizes what we’ve accomplished, which decisions are still pending, which documents are most important, and where to pick up next. Then I run /clear and paste in this summary, so we can continue exactly as if there had been no interruption.

The compact command can sometimes run slowly, while this handoff skill typically completes in less than a minute.

In Claude chat, place large documents into Projects whenever possible.

Claude.ai does not have very detailed official documentation on its caching mechanism, but Projects clearly use a different optimization approach than regular conversation threads. Therefore, if you need to paste a large document, it’s best to place it in a Project rather than directly into the conversation.

What actions silently corrupt the cache?

Several things can reset the cache without any obvious notification.

Switch model: Because caching relies on prefix matching and each model has its own cache, switching models will cause the next request to reload the full history without any cache hits.

The "Opus plan" mode: This setting uses Opus during the planning phase and Sonnet during the execution phase. I previously recommended it in some token optimization videos for good reason. However, it’s important to understand that each switch between plans is essentially a model switch, meaning the cache must be rebuilt. Over the long term, it still helps extend your session quota, but you should be aware of what’s happening underneath.

It is possible to edit CLAUDE.md mid-session: this change will not take effect immediately and will only apply after the next restart. Therefore, the currently running cache will not be affected.

My Free Token Dashboard

The screenshot I showed earlier is from a token dashboard.

This is a very simple GitHub repository. You give the link to Claude Code, and it will deploy locally on localhost, then it will read all your past conversation records instead of starting from scratch. You’ll immediately see daily data for input, output, cache create, and cache read.

However, please note that this dashboard displays Token data from your local device. If you switch from a desktop to a laptop, the numbers may not be identical—each device has its own set of statistics.

Summary

Prompt caching is something that can be explored in great depth. Thariq’s article covers it more comprehensively—if you want the full picture, it’s worth reading.

But you don’t need to understand every detail to benefit from it. You just need to grasp the key 80/20: cached tokens cost 10 times less than regular tokens; Claude Code’s TTL is 1 hour; switching models breaks the cache; cleanly handing off between tasks is usually more cost-effective than continuing an old session until it expires.