Alibaba and ByteDance 'Hunt' for Zhipu and MiniMax: Who Should Determine Token Pricing?

Home

Finance

ICV

Smart City

Digital Live

Cloud

Optics

Home Finance AI ICV Smart City Digital Live Cloud Optics

04/15 2026 538

Not long ago, Anthropic stopped allowing subscribed users to access the Claude API through third-party tools like OpenClaw. The reason is straightforward: running an OpenClaw proxy for a day consumes $1,000 to $5,000 in computing costs, while users only pay $200 monthly.

Boris Cherny, head of Claude Code, stated in a declaration that the subscription service was “not designed for the usage patterns of these third-party tools.” While this is true, it obscures a more fundamental issue: no subscription service can be designed to cover such usage patterns. Token consumption in agent scenarios has no upper limit, nor historical data for reference. Any fixed monthly fee is a guess about a variable that cannot be modeled.

In late March, China’s National Data Bureau released another set of figures: the average daily token calls in China exceeded 140 trillion, a thousandfold increase in two years. During the same period, ByteDance’s token calls ranked among the top three globally, alongside OpenAI and Google. Xia Lixue, CEO of Wuxin Qiong, described this growth at an industry forum, saying the last time he saw a similar curve was when mobile data usage surged from 100MB per month during the 3G era. Back then, no one anticipated the rise of Douyin, WeChat, and food delivery apps after data caps were lifted.

Together, these two events describe the same reality: token consumption is growing at an unprecedented rate, yet the pricing logic supporting the entire industry remains rooted in assumptions from the chatbot era two years ago—namely, that user consumption can be predicted by historical data, that light users will naturally offset heavy users, and that overall costs can be evenly distributed.

AI agents have shattered every premise of these assumptions. The pace of market change has outstripped the responsiveness of any pricing model. Over the past two years, the evolution of the token market has been driven by a single logic: when competitors can replicate advantages—scale can be caught up, algorithms can be open-sourced, and scenarios can be overwhelmed by large platforms’ distribution capabilities.

Currently, the only thing difficult to replicate quickly is the ability to internalize token efficiency into product architecture, pricing logic, and engineering culture. Among all players, only Anthropic has truly systematized this approach.

The Meaninglessness of Average Pricing

Tokens differ from traditional production factors like electricity and steel due to their unique “programmability.” No traditional factor can alter its value by a factor of 100,000 simply through different instructions. This programmability is the defining feature of tokens as a new production factor and the key to understanding the current chaos in the AI economy.

To grasp this, one must first develop a sense of scale. 36Kr reported that OpenAI’s API processes approximately 21.6 trillion tokens daily, while Google’s Gemini handles around 43 trillion. China’s 140 trillion daily token calls are more than double the combined total of the two. JPMorgan predicts that China’s AI inference token consumption alone will increase 370-fold within five years. This scale alone indicates that tokens have become an economic indicator.

Moreover, a significant portion of token consumption occurs outside public cloud statistics. Financial institutions run invoice recognition on local servers, in-car intelligent cockpits process dialogues in closed loops, and industrial robots run vision models with millisecond-level responses on edge devices—none of which appear in public data. One industry insider estimates that non-public cloud API calls are at least five to ten times those of public clouds.

Beyond scale, the value structure and production costs of tokens warrant attention. In March, Jensen Huang, in a bylined article, broke down the AI industry into five layers: energy, chips, infrastructure, models, and applications, defining tokens as the fundamental unit of modern AI, its language, and currency. The brilliance of this definition lies in its dual reference to tokens’ two attributes: as a language, they are the atoms of computation; as a currency, they are the medium of value exchange.

However, the cost of producing a token is far more complex than this definition suggests. According to Sam Altman and Epoch AI, ChatGPT consumes approximately 0.3 watt-hours per text prompt. Google Search, by contrast, uses only 0.03 watt-hours—a fraction of that amount. In 2025, Google disclosed that Gemini consumes about 0.24 watt-hours per typical text prompt and generates roughly 0.03 grams of CO2.

As model complexity increases, inference costs rise accordingly. A GPT-5-level system may consume around 18 watt-hours per query and up to 40 watt-hours during extended reasoning. The gap arises from two factors: model size (more parameters require greater computation per token) and reasoning mode (new-generation models perform extensive implicit reasoning before outputting each visible token, amplifying the true cost of a single visible token).

This is the fundamental difference between tokens and production factors like electricity or oil: a token’s value is determined entirely by its usage scenario, not its production cost. A million tokens used for casual chat might fetch $0.01; for code generation, $200; for legal document review, over $1,000—a value gap of 100,000x. Yale researchers describe this as tokens’ “contractable” attribute: quantity can be precisely measured, but value depends on what they are programmed to do.

When the entire industry applies a single pricing logic to scenarios with a 100,000x value gap, systemic pricing chaos is not accidental—it is inevitable.

Thus, the so-called average token price is as meaningless as describing a business district with both street vendors and Michelin-starred restaurants by average spending per customer. Even if the number is correct, it is irrelevant. Collis and Brynjolfsson estimated in 2025 that generative AI alone generated approximately $97 billion in consumer surplus for U.S. consumers in 2024, with users deriving far more value than they paid. The vast majority of this value concentrated in high-value applications.

The Window of Opportunity for the Token Economy Is Closing

In the token economy, competitive advantage hinges on time windows shaped by technological leaps, product transformations, and market structures. The beneficiaries of each window unconsciously pave the way for the next disruptor, and the true winners are those who consistently secure positions across multiple windows.

In early 2025, algorithms were the first window. After DeepSeek V3’s release, the Mixture of Experts (MoE) architecture reduced inference costs for equivalent capabilities by an order of magnitude: the model contains multiple expert submodules, activating only a small fraction per inference, drastically cutting actual computation per inference while retaining full model capabilities.

However, the paradox of the algorithm window is that the key to opening it is also the lock that closes it. DeepSeek chose open-source, releasing core model weights and architectural designs to attract global developers into its ecosystem. This choice rapidly expanded market share in the short term but actively compressed the algorithm’s leading window in the medium to long term. Once architectural innovations are open-sourced, the industry’s token cost baseline resets, turning algorithmic advantages from proprietary barriers into public infrastructure.

By year-end, scale became the second window. Volcano Engine adopted strategies from internet traffic wars, using large-scale airport advertisements to announce its presence in the token market. Tan Dai, in his latest business update on April 2, mentioned that Volcano Engine’s token calls grew 1,000-fold within two years, with trillion-token-consuming enterprises increasing to 140.

However, scale advantages are time-bound. In an interview with Yicai, Tan Dai also noted that a significant portion of large-scale token calls involved ineffective computing. Using math problems as an example, he explained that enumeration methods require massive computation, and models with insufficient capabilities resort to such approaches, leading to unnecessary consumption. Superior models find concise solutions, leaving ample room for optimization. Behind the scale numbers lies substantial avoidable computing waste. When competition shifts from “how much is consumed” to “how much value each token creates,” the scale window begins to close.

Scenarios are currently the most fiercely contested area in token competition. Zhipu, MiniMax, and Yuezhi’an lack ByteDance’s traffic scale or Alibaba and Tencent’s cloud ecosystems, but they have found footing in high-value B2B scenarios. At one point, Zhipu and MiniMax’s market valuations exceeded those of traditional internet companies like Kuaishou, demonstrating the significant valuation premium scenario windows can create at specific stages.

But this window is now narrowing. At an industry forum, Yang Zhilin asked Zhipu CEO Zhang Peng why they raised prices. Zhang replied that completing an agent task consumes 10x or even 100x more tokens than answering simple questions; long-term reliance on low-price competition benefits no one in the industry.

Behind this dialogue, a larger-scale scenario battle is unfolding. ByteDance embeds large model capabilities directly into enterprise collaboration workflows and massive traffic nodes through Feishu and Coze platforms. Tencent leverages the WeChat ecosystem and Enterprise WeChat to control the shortest social path for businesses to reach and serve customers. Alibaba has consolidated its AI businesses into the ATH Group, packaging token consumption as part of enterprise digital infrastructure.

These three companies possess long-established trust relationships and system integration capabilities at the enterprise level. The scenario advantages independent vendors rely on—maintained through differences in model quality—are being rapidly compressed by such structural advantages.

Token efficiency is the fourth emerging window and the hardest to replicate quickly. Competition in this window currently centers on coding scenarios. After Anthropic banned third-party tools, users accustomed to low-cost Claude access began seeking alternatives. OpenAI quickly positioned itself as the more user-friendly option. However, Anthropic is betting on training and model runtime efficiency, while OpenAI assumes Altman can always secure more funding to support computing scale.

Using capital to stack computing power for market share is a viable but unsustainable strategy. By late March, OpenAI’s API processing volume exceeded 15 billion tokens per minute, up from 6 billion in October 2025. Yet computing supply growth lags far behind: GPU leasing prices surged 48% in two months, with NVIDIA’s latest Blackwell chip reaching $4.08 per hour, and data center construction cycles measured in years. OpenAI even partially suspended Sora, its video generation tool, to free up computing resources for coding and enterprise products.

Anthropic is pursuing Harness Engineering, redesigning agent scheduling architectures to reduce ineffective token consumption systemically, achieving more with less computing. Under the reality of scarce computing resources, this redefines efficiency itself.

In China, Alibaba Cloud is also entering the efficiency window, integrating token pricing, call tracking, and enterprise billing management into a unified cloud infrastructure. Wu Yongming noted that many enterprises now treat token consumption not as IT budgets but as production materials and R&D costs—a slower but harder-to-disrupt approach.

With computing supply hitting physical limits and demand still accelerating, what is truly scarce is not cheap tokens but tokens that deliver the highest value density under constrained computing resources.

The OpenClaw Ban Is Just the Result

Under the combined pressures of scarce computing, ineffective pricing systems, and uncontrolled agent consumption, Anthropic is the only company so far that has not just adjusted pricing strategies but also redefined “how agents should operate” at the engineering architecture level. The ban is a passive response; Managed Agents are the proactive answer.

Harness is the scheduling layer of the agent framework, responsible for deciding when to call models, how to manage context, and how to handle errors. In the chatbot era, this logic was relatively simple. Entering the agent era, Harness began handling more complex tasks and generating substantial unnecessary token consumption.

The Anthropic engineering blog provides a specific case study involving Claude Sonnet 4.5, which exhibited a behavior referred to by engineers as 'context anxiety.' This behavior caused the model to prematurely terminate tasks when it sensed that the context window was nearing its limit. To address this, Harness introduced a context reset mechanism that forcibly clears and reloads the context at appropriate times to ensure task continuation. At the time, this was a reasonable engineering fix.

The issue arose after the launch of Claude Opus 4.5. The new model no longer experienced 'context anxiety,' yet the old reset mechanism continued to trigger with every execution, consuming unnecessary tokens and introducing unnecessary latency. These mechanisms, which were once solutions to problems, had become burdens contributing to costs. Anthropic engineers referred to this as 'dead weight.'

This represents a structural flaw in the Harness framework: each Harness is a snapshot of a model's capabilities at a specific moment. While models continue to evolve, the snapshots are executed as permanent rules. The faster the model iterates, the more severe this misalignment becomes.

In commercial scenarios, this problem is further amplified. When processing a single user query, OpenClaw generates several times more API requests than the official Claude Code framework, with each request carrying a context window exceeding 100,000 tokens. When converted into API rates, the true cost of a single query is dozens of times higher than the subscription price. Regardless of an individual's subjective frequency of use, requests initiated through such frameworks inherently carry the cost profile of heavy users. Platform subsidies for heavy users thus shift from being a probabilistic issue to a deterministic one.

Anthropic's response is Managed Agents, with the core idea being to establish a stable interface for the Agent domain, enabling an abstraction layer that allows for free substitution. With the disappearance of 'context anxiety,' the corresponding reset mechanism naturally fades away, leaving no 'dead weight.' Internal test data shows that in structured file generation tasks, Managed Agents have improved task success rates by up to 10 percentage points, with the most significant improvements seen in the most challenging tasks.

The concurrent emergence of the Hermes Agent corroborates this judgment from another direction. This framework, which emphasizes a 'closed-loop learning cycle,' chooses to write updates to accumulated operational procedure files in a patch-based manner, transmitting only the specific fields that require modification rather than rewriting the entire file. Patches address only the issues at hand, resulting in lower token consumption. This represents one of the most concrete manifestations of token efficiency awareness at the framework design level.

The new competition in the token economy has become nuanced to the point of 'who can make each token yield higher value.' Luo Fuli wrote at the end of her post, which has garnered over 730,000 views, that the true way forward lies not in cheaper tokens but in the co-evolution of models and Agents.

This statement pertains not only to technical routes but also to the necessary transformation in the industry's pricing logic: shifting from volume-based billing to value-based pricing, and from cost management to result creation. This is a transformation that the entire industry needs to undertake.

Anthropic's exploration of the Harness architecture provides one of the clearest directions thus far. However, the journey ahead remains long.

*The featured image and illustrations in the text are sourced from the internet.

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.

Newest

Links