Can Deepseek Save China One Trillion US Dollars?

06/05 2026 379

In the second half of 2026, NVIDIA will deliver its most powerful AI platform to date: Vera Rubin VR200 NVL72. A full cabinet contains 72 Rubin GPUs and 36 Vera CPUs. Morgan Stanley estimates the material cost of this machine to be approximately 7.8 million US dollars.

This figure is already alarming. But what deserves even closer attention is where the money is being spent.

Out of the 7.8 million US dollars, approximately 2 million US dollars are not spent on the world-renowned GPU chip or the computing core but on memory—high-bandwidth memory (HBM4) and regular memory (LPDDR5X). In just one year, the cost of this memory has surged by 435% due to price hikes.

This is a signal. In the increasingly expensive AI machine, money is flowing massively from the “computing components” to the “memory and storage components.”

Remember this signal. Because the story of DeepSeek, which this article will discuss, does precisely the opposite: while everyone is being pushed by the times to pay AI hardware premiums for increasingly expensive memory, DeepSeek is finding ways to enhance the token production capacity of these expensive hardware components by over four times through software-hardware integration, without compromising competitiveness—effectively saving 75% on hardware investment.

At the end of this endeavor, a conjecture has been circulating lately: Can DeepSeek, through its efforts, save one trillion US dollars for China's AI infrastructure construction?

Is this really possible?

—Introduction

The NVIDIA quote mentioned earlier represents the hardest cost in recent AI infrastructure budgets. Under the current supply-demand landscape, if you want the most advanced AI machine, you must accept this bill.

DeepSeek cannot change this fact.

What it can change is something else: how many Tokens can the same machine, with the same 2 million US dollars' worth of expensive storage hardware, produce?

This question has become especially concrete after the release of DeepSeek V4.

What deserves more attention in V4 is not just the model itself but the three key strategies it demonstrates: first, continue to compress “memory” so that long contexts no longer overwhelm video memory (video memory); second, activate the “body” on demand, so that massive expert models do not need to be fully deployed every time; third, transform repetitive computations into reusable assets, so that computed contexts do not burn money repeatedly.

These technical features share a prominent characteristic—they focus on software-hardware collaboration rather than pure software optimization. This is why some jokingly say that DeepSeek may become China's largest AI hardware company.

Its model page shows that in a 1 million Token context scenario, V4-Pro requires only 27% of the single-Token inference computing power and 10% of the cache occupancy compared to the previous generation. In this article, we use approximately one-fourth of the computing power for our calculations.

Under traditional approaches, these hardware components can only support one throughput. However, through long-context compression, on-demand activation, cache reuse, and inference scheduling, DeepSeek can increase the effective Token output of the same hardware by four times—not by “cutting costs” but by amortizing them. Tasks that once required four machines may now be accomplished with one; tasks that once consumed a full share of expensive hardware costs for each Token generated can now spread the same hardware costs across four Tokens.

This is where DeepSeek truly excels: it has not changed NVIDIA's pricing but has altered the output efficiency of NVIDIA's machines in AI accounting. The significance of this far exceeds a simple API price reduction.

And the one trillion US dollars figure is not a baseless assumption.

McKinsey's 2026 report, “The Cost of Computing,” provides a specific figure: by 2030, global data centers will require approximately 6.7 trillion US dollars in investment to keep up with computing power demand, with AI-specific workloads accounting for about 5.2 trillion US dollars of that total.

In other words, over the next few years, humanity plans to invest trillions of US dollars in AI hardware.

A significant portion of this vast sum will flow toward the most cutting-edge and scarce hardware components—namely, HBM high-bandwidth memory and LPDDR memory. What DeepSeek is doing is systematically reducing the entire Chinese AI industry's reliance on these expensive hardware components. Even if it only reduces dependence partially, the value it saves for the industry will be astronomical, reaching into the trillions.

As China's daily Token consumption grows from over a hundred trillion today to hundreds or thousands of trillions, any reduction in unit Token costs will be amplified into enormous infrastructure savings. If the same throughput can indeed be achieved with one-fourth of the hardware, then in the foreseeable future, it could save China's AI infrastructure nearly one trillion US dollars in computing hardware investment.

This is an infrastructure account: whoever can generate more Tokens from the same rigid hardware expenditure will build fewer data centers, buy fewer GPUs, and stack less memory—and thus redistribute the future tickets to AI.

So, how does DeepSeek achieve this? The answer is that it has made three key modifications to the large model machine.

A common misconception is that the most costly aspect of large models lies in “thinking” or computation. In reality, this is not the case.

The true energy hogs are “memory” and “body,” and they consume the same expensive fuel—high-bandwidth memory (HBM), an extremely fast and costly type of memory directly integrated into the GPU packaging system.

Let’s start with memory. When generating text, large models have a clumsy characteristic: for every new character produced, they must revisit all previous content. This is because the meaning of language is built layer by layer, and what comes next depends entirely on the context established earlier.

This is akin to a simultaneous interpreter. They cannot start translating based solely on your last sentence but must always carry everything you’ve said before—only by remembering those pave the way (foreshadowing) can they understand the true intent of your current statement. The longer you speak, the more they must remember.

To avoid recalculating everything from scratch for each character (which would be too slow to use), models temporarily store intermediate results already computed. This archive is called the KV cache (Key-Value Cache, which can be understood as the model's short-term memory).

The problem is that it expansion (swells) crazy (wildly) as conversations lengthen.

To provide a specific figure: based on a certain standard structure, processing a context of approximately 120,000 characters could consume 488GB of high-bandwidth memory. Meanwhile, NVIDIA's upcoming top-tier Rubin GPU has only 288GB of memory per card. In other words, storing this single memory archive alone would occupy nearly one and a half or even close to two of the most advanced GPUs' entire memory—and the model hasn’t even started its actual work yet.

Now, let’s talk about the “body.” The model's “body” refers to its parameter weights, which can be roughly understood as the carrier of all its knowledge and capabilities. The stronger the capability, the larger the body tends to be, often reaching hundreds of billions or even trillions of parameters.

Traditional dense models (which use all parameters for any input) have a problem: no matter what you ask, they must mobilize the entire body. This is like going to a hospital for a toothache only to have doctors from every department examine you from head to toe before finally reaching the dentist. Absurd, but the full cost is still charged.

This massive body must also remain constantly resident in expensive high-bandwidth memory, ready for action at any time.

Memory and body—these two energy hogs—have firmly pinned the value distribution of the entire hardware system onto the most expensive, scarce, and constrained hardware components. Over the past decade, the industry's response has been simple and crude: if computing power is insufficient, stack more; if memory is insufficient, stack more. As a result, industry wealth has become highly concentrated along this most cutting-edge hardware chain, with the fattest profits trapped at the scarcest link.

Token prices have thus been hijacked by the scarcity of one type of hardware. DeepSeek's three modifications precisely loosen this grip.

The first modification targets “memory.” And it strikes precisely at the most untouchable—or rather, the place no one dares to touch—the attention mechanism (the core mechanism large models use to understand contextual relationships).

The attention mechanism is the brain of large models. Its ability to comprehend context and grasp key points in long conversations relies entirely on this mechanism repeatedly weighing relationships between every character. That expensive memory archive is precisely the product of every pulse from this brain.

To save memory while avoiding risks, nearly everyone chooses to bypass this brain and work only on the periphery. From multi-query attention (MQA), proposed in 2019 by Noam Shazeer, one of the original authors of Transformer, to grouped-query attention (GQA), introduced by Google in 2023 and widely adopted by Llama and others, the mainstream approach has always been to “have multiple query heads share the same memory”—essentially “remember fewer copies and make do.” The space-saving effects are astonishing, but the cost is compromised model quality. Frankly, the consensus along this route has always been “compromise”: accepting that compression will inevitably damage quality and only bargaining over how much.

DeepSeek refuses to compromise. It chooses to operate directly on the brain by modifying the attention mechanism itself.

Its solution is called multi-head latent attention (MLA), first introduced in DeepSeek-V2 in 2024. To illustrate: while other models take notes by copying every detail verbatim into several large volumes, MLA first distills the notes into a highly condensed summary, stores only the summary, and reconstructs details from it when needed. In technical terms, this is called “low-rank compression”—projecting seemingly vast but highly redundant memories into a much more compact space for storage.

How impressive are the results? According to the DeepSeek-V2 paper, compared to its predecessor, V2 achieves stronger capabilities while reducing training costs by 42.5%, cutting KV Cache by 93.3%, and increasing maximum generation throughput by 5.76 times. The earlier example that consumed 488GB could potentially be reduced to a few GB under this approach.

But the truly remarkable part is not how much it saves but that it pays almost no penalty in detail loss.

By common sense, compressing a book into a one-page summary and then reconstructing it would inevitably lose some details. Yet in DeepSeek's published experiments, this compressed memory not only performs no worse than the “verbatim copy” standard attention mechanism but even slightly outperforms it in some cases.

By V4, this approach has been pushed to even more extreme long-context scenarios: V4-Pro adopts a hybrid attention architecture, requiring only 27% of the inference computing power and 10% of the cache occupancy compared to the previous generation under a 1 million Token context setting.

To appreciate how difficult this is, consider that it’s like performing surgery on an airplane in flight. Modifying the attention mechanism means rewriting the model's most fundamental computational logic, retraining the entire model, and rebuilding the entire service system that supports it. Any mistake in any link (link) would collapse intelligence. This is not replacing a tire valve; it’s brain surgery.

And DeepSeek pulled it off, leaving the AI healthier after surgery than before.

The first modification tames memory. The second modification tackles the massive “body.”

This modification’s thought process (idea) is not original to DeepSeek but follows a clear established path: mixture of experts (MoE), which involves splitting the model into many “experts” and activating only a few each time.

This concept dates back to 1991 and was introduced into neural networks in 2017 by Shazeer and others. Google’s GShard and Switch Transformer later brought it into Transformer; what truly made it mainstream was French company Mistral’s Mixtral 8x7B, released in late 2023 with nothing more than a seed link—totaling approximately 46.7 billion parameters but activating only about 12.9 billion per character.

Returning to the hospital analogy: MoE transforms it into a specialized facility where you go for a toothache, and the receptionist directs you straight to the dental department while other doctors go about their business. The total staff remains enormous—parameter counts can still reach hundreds of billions or trillions—but only a small fraction is mobilized each time.

DeepSeek pushed this approach to a highly aggressive scale in V3 and even further in V4: V4-Pro has 1.6 trillion total parameters and 49 billion active parameters; V4-Flash has 284 billion total parameters and 13 billion active parameters. In other words, the model's “total body” continues to grow, but the portion actually mobilized at each step remains very small.

But the true ingenuity of the second modification lies beyond merely “activating fewer doctors.” It go with the flow (consequently) transforms how the model accesses these “bodies.”

Here’s a more fitting metaphor. Traditional large models resemble a massive but disorganized storage room: everything is piled together, and even to retrieve one item, you must open the door and rummage through everything from the bottom up to find it. To make this rummaging fast enough for a steady stream of customers, you can only place the entire storage room in the most expensive “downtown storefront”—i.e., high-bandwidth memory.

DeepSeek transforms this storage room into a cabinet with tens of thousands of numbered compartments. To use any item, you simply pull open the corresponding compartment by its number without touching the others. This means you no longer need to store the entire cabinet’s contents in the most expensive storefront. Most compartments not currently in use can be placed in much cheaper regular memory (LPDDR) or even cheaper solid-state drives, with only the needed compartment loaded quickly when required. DeepSeek's ecosystem and open-source inference systems like SGLang continue to explore such offloading and streaming loading approaches.

Here, the synergy between the first two modifications becomes clear: the first compresses “memory,” while the second numbers the “body” and retrieves only the necessary compartment. Together, they drastically reduce the portion of the most expensive memory truly needed by the machine at any given moment.

The third cut pushes the logic of 'retrieval by number' to its extreme: even the act of 'computation' is minimized wherever possible. Some calculation results can actually be precomputed and stored as numbered cells, retrieved directly when needed instead of recalculating each time. It’s like memorizing multiplication tables—no one counts on their fingers for 7×8; they just blurt out 56. This amounts to replacing costly 'hard computation' (chip operations) with low-cost 'retrieval' (memory reads).

In V4, this approach finds a more direct commercial expression: cache hit prices are slashed to near-zero, while long-context reuse is directly baked into the pricing model. Repeated computations aren’t just technically avoidable—they’re commercially incentivized to be avoided.

Viewed together, these three cuts aren’t isolated moves but a layered progression of the same logic: transforming a chaotic mess into a system where everything is precisely retrievable by number. Memory is minimized, only the necessary components are awakened, and computations default to lookup tables instead of recalculations. Each cut reduces the machine’s reliance on the most expensive hardware. Combined, they allow it to handle the same workload with a fraction of the cutting-edge hardware it once required.

In May 2026, DeepSeek announced it would make the previous 75% discounted price of V4-Pro permanent, widening the pricing gap between cache hits, cache misses, and output tokens. The cache hit price matters because it turns DeepSeek’s third cut into a commercial rule: precomputed contexts shouldn’t be charged repeatedly as 'new work.'

The contrast becomes tangible in real invoices. For a mid-sized application processing 1 billion tokens monthly, the same workload costs:~$522 with DeepSeek V4-Pro; ~$9,000 with Claude Opus 4.7; ~$10,000 with GPT-5.5.That’s a 17-to-19-fold difference.

Consider an extreme but common scenario: a long-context programming assistant repeatedly rereads a 100,000-token codebase 100 times. With near-free cache hits, DeepSeek charges ~$0.036; GPT-5.5 and Claude Opus 4.7 each charge ~$5—a 100+ fold difference.

These prices aren’t loss leaders. The machine is simply this efficient—a cost advantage engineered inch by inch by Chinese teams. Two years ago, Liang Wenfeng stated pricing principles: 'no losses, but no excessive profits.' The fuller context: when your cost structure operates on a different plane, your pricing naturally follows suit.

Of course, this reengineering isn’t risk-free. Shifting loads to cheaper memory and storage introduces trade-offs in power consumption, latency, and scheduling complexity. In some cases, system-wide costs per generated token may not decrease unless hardware, software stacks, and storage media are further optimized. These three cuts represent a delicate balancing act, not mindless cost-cutting. But the direction is clear: replace the most expensive, constrained resource with cheaper, more accessible ones.

06 Turning 'One Trillion' into Visible Savings

After discussing so much 'saving,' let’s visualize it: How many fewer AI computing centers need to be built?

Start with token throughput. By March 2026, China’s daily token calls exceeded 140 trillion, a 1,000+ fold increase from early 2024. Industry data shows Doubao’s large model alone handled over 120 trillion daily tokens that month. While statistics vary, they converge on one fact: China’s AI token consumption has entered the hundred-trillion range and is rapidly approaching the thousand-trillion mark. Thus, 500 trillion tokens/day represents the near future; 5,000 trillion tokens/day reflects a high-traffic scenario with widespread agents, multimodality, and code generation.

Against this backdrop, DeepSeek’s value becomes clear. In 2025, China Unicom began building a thousand-card AI inference center in Wuhan with an initial investment of ~200 million yuan (~$28 million). Consider this a template: ~200 million yuan per thousand-card center.

DeepSeek V4’s efficiency gains aren’t minor optimizations—they’re multiples. In long-context scenarios, its hardware efficiency improves by severalfold. Using a conservative estimate: V4’s trifecta of optimizations quadruples effective token throughput per hardware unit. Thus, tasks once requiring four centers now need only one, saving 75% in equivalent hardware investment.

Note that DeepSeek doesn’t simply reduce storage use. Quite the opposite: it leverages storage—through compressed attention, on-demand activation, cache hits, and inference scheduling—to maximize GPU and memory utilization. What’s saved is the additional hardware that would otherwise be purchased for the same token throughput.

So, what does $1 trillion represent? ~$1 trillion ≈ 7 trillion yuan. At 200 million yuan per thousand-card center, that’s 35,000 centers. If V4’s approach quadruples throughput, avoiding 35,000 equivalent centers aligns with a daily token flow of ~5,000 trillion.

This is the industrial landscape behind 'one trillion dollars.' It’s not a precise engineering bid but an infrastructure-scale calculation, projecting future—not current—traffic scenarios. The point is: in an era of modest usage, efficiency gains save a few cards or racks. In an era of thousands of trillions of tokens/day, they save entire data centers.

DeepSeek doesn’t just change pricing per call—it rewrites the future of AI infrastructure.

Now, back to the machine. Recall Vera Rubin’s $7.8 million cost: $2 million went to memory, which continues to surge in price. This reveals a dangerous trend—the industry’s value is increasingly and unhealthily tied to memory chips, which shouldn’t be this expensive.

Many assume DeepSeek 'follows' this trend by using a large amount of (large volumes of) memory. Wrong. DeepSeek reverses it. The old approach passively devours hardware, inefficiently piling value onto chips while memory prices spiral. DeepSeek first drastically reduces hardware demands, then strategically allocates residual needs to the cheapest, most suitable storage tier. The difference: 'being pushed by prices' versus 'calculating costs before spending.'

This matters profoundly for China. It shifts the battlefield from our weakest area (cutting-edge chips) to our strongest (memory chips). We can’t yet match the fastest AI chips, but memory? That’s a strength we’ve solidified in 2026.

Changxin Storage, China’s DRAM leader, hit 50.8 billion yuan (~$7.1 billion) in Q1 2026 revenue with ~25 billion yuan (~$3.5 billion) in profit. Its H1 profit is projected at 66–75 billion yuan (~$9.2–10.5 billion), matching ByteDance’s 2025 annual profit. Though Changxin ranks fourth globally in DRAM, its previously negligible domestic capacity now stands tall.

This is the strategic value of DeepSeek’s three cuts. It’s not 'replacing compute with storage' but reducing marginal reliance on scarce compute while shifting pressure to accessible storage, caching, and systems engineering. When an AI machine leans heavily on memory, caching, scheduling, and systems—areas China controls—the supply chain transforms from 'constrained' to 'sufficient,' even 'advantageous.' Security soars.

Epilogue

Liang Wenfeng, whose instinct is to 'eradicate inefficiency,' won’t settle for cheaper models. His target is the AI industry’s greatest inefficiency: the assumption that 'stronger intelligence requires the most advanced, scarcest, and costliest hardware.'

If DeepSeek enables the industry to achieve the same with far less cutting-edge hardware, it frees up a trillion-dollar 'virtual capacity base'—no factories needed, just redirected capital. That 'one trillion' ceases to be a valuation myth and becomes an infrastructure ledger.

Framing DeepSeek as 'algorithm vs. NVIDIA' is a cheap narrative. A better question: Can it reduce industry reliance on the costliest hardware, scarcest memory, and 'inevitable' inference costs? Yes. Can it redistribute AI infrastructure value from premium GPUs to model architecture, inference systems, cache management, storage orchestration, and engineering optimization? Also yes. That’s its true industrial significance.

True tech revolutions don’t just make things pricier—they turn luxuries into everyday infrastructure. On a grander scale, DeepSeek’s real impact isn’t how much it saves but how those savings redistribute AI’s future tickets to China’s vast industries waiting to be transformed.

(This analysis synthesizes public data and industry discussions. Forward-looking claims—e.g., trillion-dollar infrastructure substitution, hardware efficiency trade-offs, cost equivalencies—are industry projections under debate, not established facts. Readers should interpret them critically.)

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.