04/03 2026
552
Thursday is my token-guzzling day of the week—I need to track work progress, conduct reviews, and compile weekly reports.
I fire up Claude Code, type in “Help me summarize this week’s work,” and watch my token quota plummet. After drafting the weekly report, I’ve already burned through 33% of my allowance. Other tasks haven’t even begun, and I’ve already hit my token cap.
A $200 token package? Gone in 3.5 hours. A single task can devour 30% of my quota.
The real issue? I’m clueless about where my tokens are disappearing to.
A few days back, after Claude Code’s code was mistakenly bundled and uploaded, Reddit users uncovered through reverse engineering that Anthropic finally came clean: Claude Code was indeed overcharging. After a barrage of complaints and reverse-engineering evidence flooded Reddit, the official account responded: “We’re urgently investigating—this is our top priority.”
This boils down to an engineering challenge. On one hand, the Harness architecture empowers AI to tackle complex tasks, but at the expense of significantly higher overhead compared to single-model interactions. On the other hand, Claude’s traffic algorithm has clearly overlooked some critical aspects.
I get it—the agent is a token hog. To achieve optimal results, you’ve got to feed it more tokens.
So, do domestic MaaS providers face similar issues? Can they at least offer a viable solution during peak AI usage?
That’s why I’m proposing that MaaS providers introduce a weekly “Crazy Thursday Token Bonanza”—unlimited tokens for the day. Just V me $50, and I’ll ascend to AI greatness.

What’s Lurking in the Application-Layer Black Box?
Why is Claude Code so effective? Because it’s not a single model—it’s a multi-agent pipeline. Coders, reviewers, and debuggers operate independently, with their token consumption bundled into a single “conversation.”
The Harness architecture enables AI to handle complex tasks, but token costs skyrocket. Community tests reveal that token consumption for complex tasks can be several times—even up to ten times—higher than direct model calls. This overhead is conveniently tucked away in the “single conversation” bill, leaving users in the dark.
Even more sneaky is the model switching within the Coding Plan. Roles like Plan Mode, Reviewer, and Debugger each trigger hidden calls. You think you’re interacting with “one AI,” but in reality, the backend is jumping between five or six sub-agents.
Then there’s the truth unearthed by Reddit users through reverse engineering: two independent cache bugs rendered the prompt cache ineffective. Here are the two critical bugs:
Bug 1: Sentinel Replacement Mechanism Breaks Caching
Claude Code uses separate binaries for different platforms (Windows/macOS/Linux). When conversation content involves internal billing logic, the system replaces sensitive fields with sentinel values. The problem? This replacement disrupts the prompt cache’s hash consistency, causing cache misses even when they should hit, leading to redundant token calculations.
Bug 2: Resume Parameter Forces Cache Invalidation
Starting from v2.1.69, the resume parameter (used to restore interrupted conversations) forces cache invalidation. This means that if you exit mid-conversation or switch devices, all previous context caches are discarded, and the system recalculates tokens for the entire conversation history. For users relying on long contexts, this is a financial disaster—every “continue conversation” burns money.
The combined impact of these two bugs is catastrophic. Imagine asking Claude Code to review a GitHub PR. Normally, caching should save you 90% of repeat calculation costs. But due to these bugs, you’re paying full price every time, inflating costs by 10–20 times.
So, the Harness structure not only has explosive consumption but also accelerates billing due to algorithmic flaws.
More amusingly, the official response only came after users reverse-engineered the bugs. As one netizen quipped: “You’ve got the world’s best models and developers, yet you ignore thousands of complaints until someone dissects your code.”
This “users discover first, vendors admit later” model has become the norm in the AI industry. ChatGPT Plus never refunds historical quotas, and Gemini Advanced slows down without warning. Anthropic’s issue isn’t the bugs themselves but the lack of basic billing observability—when users question their bills, they can’t provide data to prove their innocence.
In contrast, OpenClaw updates almost weekly, fixing issues overnight. Anthropic’s sluggish response exposes a harsh reality: when model capability becomes the competitive edge, user experience and billing transparency become expendable.
Technical debt is passed on to users. How much of your money goes to “actually using AI” versus “system waste”? No one knows.
How Are Domestic MaaS Providers Faring?
Since Claude Code’s application layer is a black box, let’s examine domestic MaaS providers. How do they stack up?
To be frank, domestic MaaS providers are generally more conscientious. At least at the API layer, they break down costs more granularly. But at the application layer, everyone still hides Harness/Agent scheduling costs in the black box:

Based solely on tokens, everyone’s pricing is transparent and traceable. But at the application layer—when actually solving problems—full transparency remains elusive, perhaps because everyone is still working within OpenClaw’s framework without innovation.
Since OpenClaw’s rise, everyone has started offering custom token packages. However, quota limits and “flexible allocation” of outdated models are common in these packages, as are quota shortages during peak hours leading to lag. Users sometimes have no choice but to switch back to pay-as-you-go API pricing, which defeats the purpose: fixed packages can’t meet user needs, so they revert to the old pay-as-you-go model.
In short, pricing transparency only exists at the API layer. When using AI applications that call tokens via agents, scheduling costs remain a black box. While providers like Kimi and Volcano now limit agent usage through quotas, users must wait for refreshes after exhausting their package.
API-layer transparency suits developers; application-layer transparency suits enterprise procurement. When you need to explain to your boss “why AI cost $20,000 this month,” “called the Deep Research Agent 500 times” is more convincing than “consumed 1 million tokens.” Interestingly, among these six providers, only Baidu explicitly breaks down agent costs; the other five still bundle Harness scheduling costs into tokens.
This isn’t about money—it’s about whether users have the right to know how their computational resources are being utilized.
In the cloud era, no one would accept “a server costs $200/month, but we won’t say how much CPU or bandwidth you get.”
AWS bills can break down computing time to the millisecond, traffic consumption by byte, and even price differences across availability zones. Observability is the foundation of mature cloud computing.
AI applications are still in the Wild West. Providers package Harness scheduling and multi-agent collaboration as “magic,” hiding technical debt as “usage.” This essentially strips users of their right to know.
Users need a detailed bill—or at least a “debug mode” toggle to let developers view Harness call chains. At minimum, providers should promise automatic refunds for billing errors caused by bugs instead of “investigating.” They should distinguish between “what you spent” and “what you should have spent.”
Given how rapidly MaaS providers are evolving, by next Crazy Thursday, I hope to at least know how my tokens were devoured. Just V me $50, and I’ll enjoy two extra pieces of Original Recipe chicken.