It’s time to calculate your tokens like a traditional grocery store owner.

Author and source: 0x9999in1, ME News

TL;DR

The end of the hash rate surge is a runaway bill. The honeymoon period of subsidies by giants has officially ended; tokens have become the hard currency of the digital age, and every unit of hash power carries a steep price tag.
Waste is everywhere and alarming. Redundant prompts, uncontrolled RAG garbage dumping, and agents stuck in infinite loops are quietly and rapidly draining corporate cash flow.
Extreme engineering-driven自救 is urgently needed. Semantic Cache, Prompt Compression, and Model Routing are no longer optional enhancements—they are the three critical lifelines.
The frontier Agent framework points the way. Agents like OpenClaw and Hermes are demonstrating true "token economics" in resource-constrained scenarios such as mobile devices, through precise context management and structured outputs.
The ultimate remedy is the awakening of ROI (return on investment) thinking. Move away from crude interventions and embrace refined operations. Rising compute costs are not the end of the industry, but a brutal consolidation that leaves only those players who truly understand and respect cost discipline.

Illusion shattered: When mining power bills become death sentences

The wind has stopped.

Over the past two years, we have lived in an illusion meticulously crafted by capital and giants. In this illusion, computing power seems like tap water—turn on the faucet, and large models endlessly produce elegant language, complex code, and answers that appear to know everything.

We waste recklessly. We indiscriminately stuff lengthy documents with tens of thousands of words into prompts. We task state-of-the-art models with hundreds of billions of parameters to perform absurdly trivial tasks like “capitalize the first letter of this text.”

Why? Because it's cheap. Because OpenAI, Anthropic, and others are using investors' money to pay for us.

But now, the dream is over.

Computing power is undergoing a comprehensive price increase. This is not an exaggeration—it is the cold, hard reality unfolding before us. The battle for NVIDIA’s H100 chips has escalated from commercial competition to a geopolitical struggle. Data center energy consumption is nearing the limits of the power grid. Every API call背后 is the burning of silicon chips and the roar of cooling towers.

The giants are no longer doing charity. Although the API billing unit is still the negligible “1K Tokens,” when your business scales up and your daily API calls reach millions or even tens of millions, that number is no longer trivial.

That’s a waterfall. That’s a bloodsucker. That’s a nightmare that could wake up the CFO of any startup in the middle of the night.

Token, the most fundamental atomic unit of the large model era, has officially been equated with the US dollar and the renminbi. A word worth a thousand gold pieces is no longer an exaggeration, but a tangible financial statement.

After the hashrate price increase, how can I save on Tokens?

This is not just a tricky technical issue; it’s a matter of life and death for the business model’s viability.

The Hidden Corner: How Did Your Tokens Actually Disappear?

To stop the bleeding, first locate the wound.

Many people have no understanding of token consumption. They stare at their monthly bills, which keep soaring, as if reading an incomprehensible ancient text. In reality, token loss often occurs in the most inconspicuous, hidden corners.

The cost of politeness and the trap of trash talk

Do you speak politely to AI?

Hello, could you help me out? Thank you so much—I need you to act as a seasoned marketing expert…

Stop. Hold on.

As a human, you are a gentleman. But in token economics, you are a spendthrift.

Large models have no emotions. They do not need your "please" or "thank you." They do not require those socially polite phrases that add no informational value. Every word, every punctuation mark, even every space, is a token—and all are billed.

Even more concerning are the "nonsense" generated by these frameworks. Many developers, in an effort to ensure output stability, use excessively long, convoluted system prompts: “You must adhere to the following ten principles...” “If you don’t know, answer ‘I don’t know’—do not fabricate...”

Are these words useful? Yes. But if thousands of tokens must be recalculated for every conversation and each turn of multi-turn interaction, the waste is staggering. The context window is not a free storage closet—it’s prime real estate in Manhattan’s CBD.

Out-of-Control RAG: Violent Document Dump

RAG (Retrieval-Augmented Generation) is hailed as a silver bullet for addressing hallucinations in large models.

But in reality, RAG is often a disaster.

Ideal RAG: Precisely retrieve the three most relevant sentences, feed them to the model, and generate a perfect answer.

Practical RAG: The user asks a question, and the vector database frantically retrieves the top ten PDF documents, each tens of thousands of words long, and slams them directly in front of the model.

“I’ll find the answer myself,” the developer thought.

This is not just laziness. This is a crime against hashing power.

Excessive irrelevant background information not only interferes with the model’s attention mechanism (causing the “Lost in the Middle” phenomenon), but also results in astronomical token consumption. You may think you’re asking a simple question, but you’re actually making the model read half a library—and you’re the one paying for that reading cost.

Agent stuck in an infinite loop

More expensive than RAG is an uncontrolled agent.

Empowering AI with the ability to plan, think, and use tools is the absolute forefront of today’s advancements. The ReAct (Reasoning and Action) paradigm makes AI appear to work like a human.

I need to check today's weather.

Call the weather API.

Observation: Failed to retrieve.

Thought: It failed just now, I'll try again.

Call the weather API.

Did you notice? If the API happens to go down, or the agent's logic gets stuck in a dead end, it will spin endlessly in this loop. Each round of "thinking" and "acting" consumes extremely expensive output tokens.

The output token's price is typically several times higher than the input token's price.

An agent without proper circuit-breaking mechanisms and maximum iteration limits is a bottomless pit that consumes tokens—it can max out your credit card while you sleep.

Scraping the Bone to Heal the Poison: A Hardcore Engineering Self-Rescue Guide

Complaining about price increases is useless. Mature observers focus only on solutions.

When brute-force computational power becomes a thing of the past, refined engineering capability becomes the only moat. How to save? Squeeze every last drop of value from each Token, like wringing out the final drop of water from a towel.

Semantic Cache: Don't pay twice for the same question

This is the most straightforward and aggressive way to save money.

Human nature is to repeat; user questions are often highly homogeneous. Questions like “How do I reset my password?” or “How do I issue an invoice?” may be asked hundreds or even thousands of times per day.

Calling GPT-4 every time is like using a cannon to swat a mosquito.

Introduce semantic caching. When a user asks a question, convert it into a vector and perform a similarity match against the cache library. If a similar question has been asked before (e.g., “What should I do if I forget my password?”) and the match is highly accurate, return the cached answer directly.

No large model required. No tokens consumed. Latency reduced from seconds to milliseconds.

This is no longer just about saving money—it’s a dimensional downgrade in experience.

Prompt Compression: Algorithmic Minimalism

Since lengthy context is the original sin, compress it.

This is not about manually deleting words or phrases, but rather relying on algorithms. The industry has already developed several entropy-based prompt compression techniques. These tools can analyze a long text to identify which words are crucial for the large model to understand the meaning, and which are optional stop words or redundant information.

They can compress a 1,000-token text into 300 tokens while preserving the core semantics with negligible or no loss.

Let machines communicate with machines. Use a form of "Martian text" that seems awkward and even grammatically incorrect to humans to converse with large models, because their self-attention mechanisms are powerful enough to understand.

You saved 70% on toll fees.

Model Routing: Let the right people do the right things

This is currently the most challenging aspect for architects.

Don't blindly hand all tasks to the most expensive, most powerful model. Don't use a sledgehammer to crack a nut.

Inside a superior AI application, there should be a matrix of collaborating models. We need a "router" to handle distribution.

Simple entity extraction, format conversion, multilingual translation? Directly route to locally deployed open-source small models (such as Llama 3 8B) or extremely low-cost APIs (such as Claude 3 Haiku). Costs are nearly negligible.
Need deep logical reasoning, complex code writing, or multi-step planning? That’s when you bring out the heavy hitters like GPT-4o or Claude 3.5 Sonnet.

Like a well-run company, front-line staff handle inquiries without bothering the CEO. After a comprehensive increase in computing power, whoever can make this routing mechanism smoother and more precise can reduce their overall Token cost to one-tenth of their competitors'.

Frontier Exploration: Examining the "Token Economics" of Agents through OpenClaw and Hermes

The true technological frontier has long smelled the blood of rising compute prices.

When we turn our attention to the most cutting-edge Agent ecosystem—particularly frameworks aiming to break free from cloud computing constraints and move toward edge and mobile devices—you’ll find that an intense battle for Token optimization has already begun.

Mobile-driven pressure: no luxury of context

Why am I specifically mentioning mobile integration? Because it’s the ultimate proving ground for token efficiency.

On a PC or in the cloud, you might tolerate a few seconds of latency and a large context window. But on a mobile device, when running an Agent in resource-constrained hardware environments, bandwidth is a bottleneck, memory is a bottleneck, and battery life is also a bottleneck.

This forces the framework to be extremely frugal.

Observing OpenClaw's development trajectory, you'll find its control over Token usage borders on obsessive-compulsive. When executing complex tasks, OpenClaw does not rely on brute-force full-context stacking. Instead, it heavily depends on optimized structured outputs.

It understands that allowing the model to generate freely produces an uncontrollable stream of tokens. By forcing the model to output results in a strict JSON Schema or even a more fundamental binary-friendly format, OpenClaw significantly eliminates redundant characters from the generation process. It doesn’t let the AI “chat”—it lets the AI “submit forms” directly.

This strict constraint on output format, while seemingly intended to facilitate downstream program parsing, objectively achieves a clever “data-saving” operation in today’s era of scarce computing power.

Hermes Agent: Surgical context management

Now consider the Hermes series of models and their agent-based applications introduced by Nous Research.

Many open-source models, due to insufficient understanding, frequently require repeated trial and error when performing function calls, consuming large amounts of tokens. The brilliance of Hermes lies in its precision in instruction following.

Precision means getting it right the first time. Getting it right the first time is the greatest savings.

In multi-round interactions, as the conversation deepens, the context window grows like a snowball. Advanced players in the Hermes Agent ecosystem have long abandoned the foolish practice of "retaining all history."

They introduced a dynamic memory mechanism.

Working Memory: Retain only the most recent 3-5 rounds of direct conversation to stay agile.
Long-term Memory: When the window limit is exceeded, trigger a lightweight background model to summarize the previous conversation into a few key points and store them in the vector database.

The previous conversation was discarded, but the knowledge was retained.

They are not dumping garbage, but performing surgical memory excision and suturing. This precise context management not only breaks the physical limits of token length, but also achieves a dramatic reduction in computational cost at a macro level.

Whether it’s OpenClaw’s structured control or Hermes’ dynamic memory management, both reveal a trend: the future of agents will be decided not by who can access the most tools, but by who can accomplish the most complex tasks under extreme token budgets.

This is dancing with chains. And those who dance the best will win the next era.

Cognitive leap: From consumer mindset to investment mindset (The awakening of ROI)

Strip away all the technical jargon and return to the essence of business.

The comprehensive increase in computing power has brought about the biggest change not by forcing engineers to work late to modify code, but by forcing a complete refresh of the AI industry's mindset.

In the era of cheapness, we approached tokens with a consumer mindset.

Like shopping at a supermarket and tossing discounted items into your cart, we don’t care whether this feature actually needs a large model—we only care that “it looks cool.”

Many companies blindly integrate LLMs into their internal systems, giving every employee an account—even using AI to generate the cafeteria menu. When the end-of-month bill arrives, they’re stunned.

Now, we must shift to an investment-grade mindset.

Every token consumption is an investment. With investment, you must calculate ROI (return on investment).

What did this Token bring me?

Did it improve the customer service ticket closure rate?

Did it reduce the time developers spend fixing bugs?

Or just a meaningless user reply: "Haha, this AI is funny"?

If a feature costs only 1 cent to implement using a rules engine or traditional machine learning, but requires 1 yuan in token fees to integrate with a large model, yet only improves conversion rate by a negligible 2%,

Then cut it off. Cut it off without hesitation.

Moving away from the hype of "big and all-inclusive" AI toward precise, "small and elegant" solutions. Business process reengineering must be built on a foundation of extreme sensitivity to compute costs.

We must learn to say "no" to business teams. When they ask, "Can AI read through all 100,000 research reports and give me a summary?" you should respond: "Does your business benefit cover the cost of several million API tokens?"

Do the math. Budget carefully. Calculate your tokens like a traditional grocery store owner.

This doesn't sound cyberpunk at all. It's so outdated.

But this is precisely the necessary path for AI to mature.

Conclusion: The landscape after the tide recedes

The frenzy of the trend is always short-lived; the laws of business gravity will ultimately take effect.

The comprehensive increase in mining power is less a crisis than a long-overdue reckoning. It brutally punctures the bubble inflated by unlimited subsidies, bringing everyone back to cold reality.

But this is not a bad thing.

It forces us to abandon blind faith in "brute force yields miracles" and rekindle our respect for engineering efficiency. It eliminates those who merely write a few prompts and go around misleading others, leaving the stage to hardcore teams that truly understand underlying architectures, model routing, and how to maximally squeeze computational power on mobile devices.

When the dust settles, the companies that survive and thrive won't be the ones holding the most expensive models.

But those who can remain calm, watching the Token numbers rapidly fluctuate on their dashboard, confident that they are earning more than they spend.

After all, when the tide goes out, we see who’s been swimming naked. This time, it’s the tide of computational power红利 that’s receding.

Only those who forge every single Token like gold can wear true armor.

Source:

Nvidia Corporation. (2025).Data Center Compute Constraints and Global Supply Chain Outlook. Investor Relations Report.
Anthropic. (2025). Prompt Caching and Context Window Economics in Claude Managed Agents. Anthropic API Documentation.
OpenAI. (2025). Best Practices for Token Optimization and RAG Implementations. OpenAI Developer Platform.
Nous Research. (2025). Hermes Agent Framework: Efficient Context Management for Edge Computing. Nous Research Technical Blog.
OpenClaw Community. (2026). Mobile Integration and Zero-Waste Token Strategies in Deep Agentic Workflows. GitHub Repository / Technical Whitepaper.
Bloomberg. (2026). The End of the AI Subsidy Era: How Datacenter Energy Caps are Restructuring Cloud Pricing. Bloomberg Technology.