06/02 2026
499
In the second half of 2026, NVIDIA will deliver its most powerful AI platform to date: the Vera Rubin VR200 NVL72. A full rack houses 72 Rubin GPUs and 36 Vera CPUs. Morgan Stanley estimates the material cost of this machine at approximately $7.8 million.
This figure is already alarming. But what deserves even closer attention is where the money is being spent.
Of that $7.8 million, roughly $2 million is not spent on the world-renowned GPU chip or the computing cores but on memory—high-bandwidth memory (HBM4) and regular memory (LPDDR5X). In just one year, the cost of this memory has surged by 435% due to price hikes.
This is a signal. In the increasingly expensive machinery of AI, money is flowing heavily from 'computing components' to 'memory and storage components.'
Remember this signal. Because the story of DeepSeek, which this article will explore, does precisely the opposite: while everyone is being pushed by the times to pay AI hardware premiums for increasingly expensive memory, DeepSeek is finding ways, without compromising competitiveness, to boost token production capacity of these expensive hardware by over fourfold through software-hardware integration—effectively saving 75% on hardware investment.
And at the end of this journey, a conjecture has been gaining traction lately: Can DeepSeek, through its efforts, save China $1 trillion in AI infrastructure construction?
Is this really possible?
—Introduction
That NVIDIA quote mentioned earlier represents the hardest cost in recent AI infrastructure accounting. Under the current supply-demand landscape, if you want the most advanced AI machines, you must accept this bill.
DeepSeek cannot change that.
What it can change is something else: how many tokens can the same machine, with the same $2 million worth of expensive storage hardware, produce?
This question becomes especially concrete after the release of DeepSeek V4.
What deserves more attention in V4 is not just the model itself but the three key strategies it demonstrates: first, continue compressing 'memory' so that long contexts no longer overwhelm video memory (video memory); second, activate 'the body' on demand so that massive expert models don't need to mobilize fully every time; third, turn repetitive computations into reusable assets so that processed contexts don't keep burning money repeatedly.

These technologies share a prominent feature—they focus on software-hardware collaboration rather than pure software optimization. Hence the playful metaphor: DeepSeek may become China's largest AI hardware company.
Its model page shows that in a 1 million-token context scenario, V4-Pro requires only 27% of the per-token inference computing power and 10% of the cache occupancy compared to the previous generation. In this article, we use approximately one-fourth of the computing power for subsequent calculations.
Under traditional approaches, these hardware components can only support one throughput level. Through long-context compression, on-demand activation, cache reuse, and inference scheduling, DeepSeek can quadruple the effective token output of the same hardware—not by 'cutting costs' but by amortizing them. Tasks that once required four machines may now be accomplished with one; tasks that once consumed a full share of expensive hardware costs per token can now spread the same hardware costs across four tokens.
This is where DeepSeek truly excels: it hasn't changed NVIDIA's pricing but has altered the productivity of NVIDIA's machines in AI accounting. The significance of this far exceeds a simple API price reduction.
And the $1 trillion figure is not pulled out of thin air.
McKinsey's 2026 report, 'The Cost of Computing,' provides a concrete number: by 2030, global data centers will require approximately $6.7 trillion in investment to keep pace with computing demand, with AI-specific workloads accounting for about $5.2 trillion of that.
In other words, over the next few years, humanity plans to pour trillions of dollars into AI hardware.
A significant portion of this vast sum will flow toward the most cutting-edge, scarcest hardware—namely, HBM high-bandwidth memory and LPDDR memory. What DeepSeek is doing is systematically reducing the entire Chinese AI industry's reliance on this expensive hardware. Even if it only reduces dependence partially, the value it saves for the industry will reach astronomical, trillion-dollar figures.
As China's daily token consumption grows from over 100 trillion today to hundreds or thousands of trillions, any reduction in per-unit token costs will amplify into massive infrastructure savings. If the same throughput can indeed be achieved with one-fourth of the hardware, then in the foreseeable future, it could save China nearly $1 trillion in computing hardware investment for AI infrastructure.
This is an infrastructure calculation: whoever can generate more tokens from the same rigid hardware expenditure will build fewer data centers, buy fewer GPUs, and stack less memory—thereby redistributing future AI access tickets.
So, how does DeepSeek achieve this? The answer is that it has made three critical modifications to the large model machinery.
A common misconception is that the most costly aspect of large models lies in 'thinking' or computation. It's not.
Their two true resource hogs are 'memory' and 'the body.' And they consume the same expensive fuel—high-bandwidth memory (HBM), an extremely fast and costly type of memory directly integrated into the GPU packaging system.
First, let's talk about memory. When generating text, large models have a clumsy characteristic: for every new character produced, they must revisit all previous content. Because linguistic meaning builds layer by layer, what comes next depends entirely on the context established earlier.
This resembles a simultaneous interpreter. They can't start translating based solely on your last sentence but must carry everything you've said before—only by remembering those pave the way (foundations) can they grasp the true direction of your current statement. The longer you speak, the more they must remember.
To avoid recalculating everything from scratch for each character (which would be unbearably slow), models temporarily store intermediate results they've already computed. This archive is called the KV cache (Key-Value Cache, akin to the model's short-term memory).
The trouble is, it expansion (swells) crazy (wildly) as conversations lengthen.
To provide a concrete figure: based on one standard architectural estimate, processing approximately 120,000 characters of context could consume 488GB of high-bandwidth memory. Meanwhile, NVIDIA's upcoming top-tier Rubin GPU offers only 288GB of memory per card. In other words, storing this single 'memory' alone would occupy nearly one and a half—or even close to two—of the most advanced GPUs' entire memory capacity—and the model hasn't even started its actual work yet.
Now, let's discuss 'the body.' The model's 'body' refers to its parameter weights, which can be roughly understood as the carrier (vessels) for all its knowledge and capabilities. The stronger the capability, the larger the body tends to be, often reaching hundreds of billions or even trillions of parameters.
Traditional dense models (which mobilize all parameters for any input) have a problem: no matter what you ask, they must activate the entire body. It's like going to a hospital for a toothache only to have every department's doctors summoned to examine you from head to toe before finally reaching the dentist. Absurd, but you're billed for the full service.
This massive body must also remain constantly resident in expensive high-bandwidth memory, ready for action at any moment.
Memory and the body—these two resource hogs—have firmly pinned the value distribution of the entire hardware ecosystem onto the most expensive, scarcest, and supply-constrained hardware components. For the past decade, the industry's response has been crude and straightforward: if computing power is insufficient, stack more; if memory is insufficient, stack more. Consequently, industry wealth has heavily accumulated along this most cutting-edge hardware chain, with the fattest profits concentrated at the scarcest link.
Token prices have thus been held hostage by the scarcity of one hardware type. And DeepSeek's three modifications precisely loosen this grip.
The first cut targets 'memory.' And it strikes precisely where no one dares to touch—or where most avoid touching—the attention mechanism (the core mechanism large models use to understand contextual relationships).
The attention mechanism is the brain of large models. Its ability to comprehend context and grasp key points in long conversations depends entirely on this mechanism repeatedly weighing relationships between every character. That expensive 'memory' mentioned earlier is precisely the byproduct of each pulse from this brain.
To save memory while avoiding risks, nearly everyone chooses to bypass this brain and work only on the periphery. From multi-query attention (MQA), proposed in 2019 by Noam Shazeer, one of the original Transformer authors, to grouped-query attention (GQA), introduced by Google in 2023 and widely adopted by Llama and others, the mainstream approach has consistently been 'let multiple query heads share the same memory'—essentially 'remember less and make do.' The space-saving effects are astonishing, but the cost is compromised model quality. Frankly, the consensus along this route has always been 'compromise': accepting that compression will inevitably damage quality and merely bargaining over the degree of damage.
DeepSeek refuses to compromise. It chooses to operate directly on the brain itself, modifying the attention mechanism.
Its solution is called multi-head latent attention (MLA, Multi-head Latent Attention), first introduced in DeepSeek-V2 in 2024. To illustrate: while other models take verbatim notes, filling multiple large volumes with every detail, MLA first distills notes into a highly condensed summary, stores only the summary, and reconstructs details from it when needed. In technical terms, this is called 'low-rank compression'—projecting seemingly vast but highly redundant memories into a much more compact space for storage.
How impressive are the results? The DeepSeek-V2 paper reports that compared to its predecessor, V2 achieves stronger capabilities while reducing training costs by 42.5%, cutting KV Cache by 93.3%, and boosting maximum generation throughput by 5.76 times. The earlier example that consumed 488GB could potentially be reduced to a few GB under this approach.
But the truly remarkable aspect isn't the savings but the near-zero cost in detail loss.
By common sense, compressing a book into a one-page summary means you can't recover all details upon reconstruction. Yet in DeepSeek's published experiments, this compressed memory not only matches but in some cases slightly outperforms the standard attention mechanism that 'transcribes the full text.'
With V4, this approach is pushed to even more extreme long-context scenarios: V4-Pro adopts a hybrid attention architecture, requiring only 27% of the inference computing power and 10% of the cache occupancy compared to the previous generation under a 1 million-token context setting.
To appreciate how difficult this is, consider performing surgery on an airplane in flight. Modifying the attention mechanism means rewriting the model's most fundamental computational logic, retraining the entire model, and rebuilding the entire service system supporting its operation. Any mistake in any link would collapse intelligence. This isn't changing a tire valve; it's brain surgery.
And DeepSeek pulled it off, leaving the AI healthier post-surgery than before.
The first cut tames memory. The second cut addresses the massive 'body.'
This cut's logic isn't original to DeepSeek but continues a clear existing path: mixture of experts (MoE), which means splitting the model into many 'experts' and activating only a few each time.
This concept dates back to 1991 and was introduced into neural networks in 2017 by Shazeer and others. Google's GShard and Switch Transformer later brought it into Transformer architectures; what truly popularized it was French company Mistral's Mixtral 8x7B, released in late 2023 with nothing more than a seed link—totaling approximately 46.7 billion parameters but activating only about 12.9 billion per character.
Returning to our hospital analogy: MoE transforms it into a clearly specialized facility. When you come for a toothache, the reception directs you straight to the dental department, while other departments continue their work. The hospital's total staff remains enormous—parameter counts can reach hundreds of billions or even trillions—but only a small fraction mobilizes at any given time.
DeepSeek pushed this route to a remarkably aggressive scale in V3 and even further in V4: V4-Pro has 1.6 trillion total parameters with 49 billion activated; V4-Flash has 284 billion total parameters with 13 billion activated. In other words, the model's 'total body' continues growing, but the portion actually mobilized at each step remains very small.
But the second cut's true ingenuity lies beyond merely 'activating fewer doctors.' It go with the flow ( go with the flow 而为) (seizes the opportunity to) transform how the model accesses these 'bodies.'
Here's a more fitting metaphor. Traditional large models resemble a massive but disorganized storage room: everything is piled together, and even retrieving one item requires opening the main door and rummaging through everything from the bottom up to find it. To serve customers fast enough, you can only place this entire storage room in the most expensive 'downtown storefront'—high-bandwidth memory.
DeepSeek transforms this storage room into a cabinet with tens of thousands of numbered compartments. To use any item, you simply pull open the corresponding compartment by its number without touching the others. This means you no longer need to store the entire cabinet's contents in the most expensive storefront. Most compartments not currently in use can be placed in much cheaper regular memory (LPDDR) or even cheaper solid-state drives, with only the needed compartment loaded quickly when required. DeepSeek's ecosystem and open-source inference systems like SGLang continue exploring these offloading and streaming loading techniques.
Here, the synergy between the first two cuts emerges: the first cut reduces 'memory' size, while the second cut numbers 'the body' and retrieves only the necessary compartment. Together, they minimize the portion of the most expensive memory truly needed by the machine at any moment to an extremely low level.
The third cut pushes the logic of 'retrieval by number' to the extreme: even the act of 'computation' is minimized wherever possible. Some calculation results can actually be precomputed, stored as numbered cells, and retrieved directly when needed, rather than recalculated each time. It's like someone who has memorized the multiplication table doesn't need to count on their fingers every time they calculate 7 × 8—they simply blurt out 56. This amounts to replacing costly 'hard computation' (chip operations) with extremely low-cost 'retrieval' (memory reads).
In V4, this approach finds a more direct commercial expression: cache hit pricing is driven extremely low, and long-context reuse is directly written into the pricing structure. Repeated calculations are not only technically avoidable but also commercially encouraged to be avoided.
Viewed together, these three cuts are not isolated events but a layered progression of the same logic: transforming a chaotic mess into a system where everything can be precisely retrieved by number. Memory is minimized, only the necessary components are awakened, and computations rely on lookups rather than recalculations. Each cut reduces the machine's reliance on the most expensive hardware. Combined, the three cuts allow it to perform the same tasks while consuming only a fraction of the cutting-edge hardware it once required.

In May 2026, DeepSeek announced that it would make the previous 75% discount price for V4-Pro a permanent fixture, creating a massive pricing gap between cache hits, cache misses, and output tokens. The importance of cache hit pricing lies in its direct translation of DeepSeek's third cut into a commercial rule: previously computed contexts should not be charged repeatedly as if they were new tasks.
The contrast becomes tangible when translated into real-world bills. Consider a medium-scale application processing 1 billion tokens monthly. Under the same workload: using DeepSeek V4-Pro, the monthly bill would be approximately $522; switching to Claude Opus 4.7 would cost around $9,000; and opting for GPT-5.5 would cost about $10,000. The gap ranges from seventeen to nineteen times.
Now consider an extreme but common scenario: a long-context programming assistant repeatedly rereads a 100,000-token codebase one hundred times. Leveraging nearly free cache hits, DeepSeek would charge approximately $0.036 for this task; GPT-5.5 and Claude Opus 4.7 would each charge around $5—a difference exceeding one hundredfold.
These prices are shockingly low, but they are not loss leaders. Instead, they reflect the inherent efficiency of this modified machine—cost savings achieved through meticulous engineering by the Chinese team. Two years ago, when Liang Wenfeng discussed pricing, he stated the principle as 'no money lost, but no exorbitant profits earned.' A more accurate interpretation: when your cost structure operates on a fundamentally different plane from others, your pricing naturally exists in a different stratosphere.
Of course, this modification is not a guaranteed win. Shifting loads to cheaper memory and storage has drawbacks, as studies suggest that frequent data movement may increase power consumption, latency, and scheduling complexity. In some cases, the total system cost per generated character may not decrease unless hardware, software stacks, and storage media are further optimized. Thus, these three cuts represent a delicate balancing act rather than mindless cost-cutting. But the direction is clear: replace the most expensive, constrained resource with cheaper, more accessible alternatives.
06 Turning 'One Trillion' into a Tangible Account
After discussing so much 'cost-saving,' let's reframe it in more visual terms: how many fewer AI computing centers need to be built?
First, consider token traffic. By national estimates, China's daily token invocation volume exceeded 140 trillion by March 2026, growing over a thousandfold since early 2024. Industry estimates suggest that Doubao's large model alone surpassed 120 trillion daily invocations in the same month. While statistical boundaries differ, both figures indicate that China's AI token consumption has entered the hundred-trillion range for daily operations and is rapidly advancing toward the thousand-trillion level. Thus, 500 trillion tokens per day can be seen as the near-term next milestone, while 5,000 trillion tokens per day represents a high-traffic scenario after agents, multimodality, and code generation become ubiquitous.
Against this backdrop, the value of DeepSeek becomes apparent when examining computing center costs. In 2025, China Unicom began constructing a thousand-card AI inference center in Wuhan with an initial investment of nearly $200 million. This can be roughly viewed as a sample investment for a thousand-card inference center: approximately $200 million per facility.
Based on DeepSeek V4's efficiency gains, particularly in long-context scenarios, the improvements are not mere double-digit optimizations but hardware efficiency multipliers. Taking a conservative, easily understandable assumption: V4's three-pronged approach quadruples the effective token throughput of the same hardware. In other words, tasks previously requiring four centers can now be handled by one, eliminating the need for three centers and saving 75% in equivalent hardware investment.
Note that DeepSeek does not simply reduce storage usage. Quite the opposite—it optimizes storage utilization through compressed attention, on-demand activation, cache hits, and inference scheduling to maximize the utility of expensive GPUs and memory time. What is truly saved is the additional hardware that would otherwise be required to achieve the same token throughput.
So, what does $1 trillion correspond to? One trillion U.S. dollars is approximately 7 trillion RMB. At $200 million per thousand-card inference center, 7 trillion RMB equates to 35,000 such centers. If V4's approach quadruples effective throughput, avoiding the construction of 35,000 equivalent centers would correspond to a daily token traffic volume of approximately 5,000 trillion.
This is the industrial landscape implied by the 'one trillion dollars' discussed herein. It is not a precise calculation from engineering bids but an infrastructure-scale estimate aligned with future traffic scenarios rather than current realities. What it truly demonstrates is that in an era of modest usage, efficiency gains save a few cards or cabinets; in an era of thousands of trillions of tokens daily, efficiency gains save thousands of AI computing centers that would otherwise need to be built.
Thus, DeepSeek does not merely alter the price of individual invocations but reshapes the future accounting of AI infrastructure.

Now, return to the machine mentioned at the beginning. Recall that of Vera Rubin's $7.8 million budget, $2 million was allocated to memory, which continues to surge in price. This reveals a dangerous trend—the industry's value is increasingly and unhealthily tied to memory chips, which should not be so expensive.
Many mistakenly believe DeepSeek is 'complying' with this trend because it also heavily uses memory. On the contrary, DeepSeek is reversing it. The old approach passively and inefficiently consumed hardware, piling value onto chips in an inverted manner while letting memory prices be driven by market frenzies. DeepSeek first drastically reduces true hardware demand through its three cuts, then strategically allocates the remaining minimal demand to the cheapest, most suitable storage tier. The former is 'being pushed by prices,' while the latter is 'calculating costs before spending.'
This distinction matters greatly for China. It shifts the battleground from an area where we are at a disadvantage to one where we have a better chance of winning. We may temporarily lag in cutting-edge computing chips, but memory and storage chips are areas where China has made tangible progress this year.
Changxin Storage, China's leading DRAM manufacturer, reported revenue of 50.8 billion RMB in Q1 2026 and net profit of approximately 25 billion RMB. The company expects H1 net profit to reach 66–75 billion RMB, equivalent to earning ByteDance's entire annual profit from last year in just six months. Although Changxin remains fourth in the global DRAM market, this domestic production capacity—once nearly nonexistent—has finally gained traction.
This is precisely the strategic significance of DeepSeek's three cuts. It is not about 'replacing compute with storage' but reducing marginal reliance on the scarcest compute resources while shifting some pressure to more accessible storage, caching, and systems engineering. When an AI machine relies more heavily on memory, caching, scheduling, and systems engineering—areas where China has greater control—the existing supply chain suddenly transforms from 'constrained everywhere' to 'sufficient' or even 'excellent.' This significantly enhances security across the entire chain.
Epilogue
Liang Wenfeng, whose instinct is to 'eliminate inefficiency,' will not settle for merely making a model cheaper. His target is the industry's greatest inefficiency: the assumption that 'achieving stronger intelligence requires relying on the most cutting-edge, scarcest, and expensive hardware.'
If it enables the industry to accomplish the same tasks with far fewer cutting-edge hardware components, it effectively creates a trillion-dollar virtual production base—occupying no physical factory space but releasing billions that would otherwise be invested in hardware. That 'one trillion' thus ceases to be a valuation narrative and becomes an infrastructure account.
Framing DeepSeek as 'using algorithms to eliminate NVIDIA' is another cheap myth. But rephrasing the question yields an interesting answer: Could DeepSeek enable the industry to buy less of the most expensive hardware, occupy less of the scarcest memory, and pay less for inference costs once deemed inevitable? Yes. Could it redistribute the value of AI infrastructure away from a singular narrative of high-end GPUs toward model architecture, inference systems, cache management, storage scheduling, and engineering optimization? Also yes. That is its true industrial significance.
True technological revolutions do not necessarily make everything more expensive; they transform what only a few could afford into everyday infrastructure accessible to the masses. From a broader perspective, what truly matters in this endeavor is not how much money is saved but how those savings quietly redistribute tickets to the future across China's myriad industries in need of AI empowerment.
(This article is compiled from public information and industry discussions. Some forward-looking judgments, such as the trillion-dollar infrastructure substitution value, hardware energy efficiency trade-offs, and equivalent cost conversions, represent industry speculations and debated viewpoints rather than established facts. Readers are advised to interpret them cautiously.)