03/18 2026
558
Is There a CUDA Moat in Inference?
"By 2027, market orders for Blackwell and Vera Rubin systems will generate at least $1 trillion in revenue."
Another year, another GTC. At this year's 'tech Super Bowl,' Huang Renxun, clad in his signature leather jacket, unveiled new 'nuclear bombs' and delivered an unprecedented explosive revenue forecast. This staggering figure reflects Jensen Huang's consistent optimism and confidence in the sustained growth of AI infrastructure, signaling to the market that Nvidia's growth story is far from over.
Yet the capital markets responded with tepid enthusiasm. Nvidia's stock initially jumped 4.3% before retreating, ultimately closing up just 1.2%. The unprecedented revenue forecast failed to ignite market passion.
The crux lies in the shifting rules of the booming inference computing market. Low latency, high energy efficiency, and application costs are replacing traditional metrics like high performance, high throughput, large memory, and high bandwidth as the core factors dominating the computing power market.
Amid this structural upheaval, Nvidia—the undisputed king of AI computing for the past three years—is facing unprecedented centrifugal forces. Beyond traditional chipmakers, major clients like Amazon, Meta, and even OpenAI are accelerating their in-house chip development. Meanwhile, China represents a massive inference demand market where domestically produced computing solutions now offer highly competitive inference costs.
To counter this unprecedented inference anxiety, Nvidia unveiled a series of new products at this year's GTC to meet inference demands and is reshaping its competitive moat through the AI factory narrative. However, the market is still observing and awaiting the effectiveness of these moves.
What's clear is that this protective battle around moats and barriers has only just begun.
01
Centrifugal Anxiety in the Inference Era
Nvidia is facing a massive 'centrifugal movement.' Multiple players competing for the inference market are creating strong outward pull, challenging the giant's dominance in the training market.
The root cause lies in the AI industry's seismic shift: the inference market is surpassing training to become the primary AI computing battleground.
As Huang himself declared at this year's GTC keynote, 'The inference inflection point has arrived.' This represents an enormous emerging market. IDC predicts that by 2027, China's inference computing will account for over 70% of total computing power. Globally, intelligent agent usage will grow 10x, with inference demand surging 1000x. Deloitte also notes in a report that inference workloads will constitute two-thirds of all AI computing by 2026, having rapidly increased from one-third in 2023 to half by 2025.
However, this high-potential market has fundamentally different computing requirements for inference tasks compared to training phases.
In a paper published earlier this year, RISC architecture pioneer David Patterson and Google DeepMind senior engineer Ma Xiaoyu noted that training requires massive parallel computing to process vast datasets. For example, a single GPT-4-level training run needs 25,000 A100 GPUs operating continuously for 90 days—an 'arms race' of peak computing power and capital.
The inference phase operates entirely differently. It's essentially a sequential autoregressive process that generates one token at a time, requiring frequent loading of model parameters from GPU memory to computing units. Available memory bandwidth becomes the decisive factor for token generation speed, making memory bandwidth and end-to-end latency the core bottlenecks.
Additionally, cost structures differ dramatically. While training follows a 'one-time explosion' model, inference involves continuous bleeding. Under billions of daily requests, AI application providers prioritize cost control—'token output per watt per dollar' directly determines AI application viability.
To address memory bandwidth, end-to-end latency, and cost-efficiency issues, there's industry consensus that customized chips optimized for specific tasks outperform general-purpose GPUs.
Multiple forces are now entering the inference computing market:
Traditional chipmakers like AMD and Intel haven't missed this structural growth opportunity. AMD's MI350 series (including MI355X) offers memory and inference performance advantages, creating total cost of ownership benefits. Supply chain statistics show Meta has purchased 173,000 MI300 series chips (with plans to shift to MI350 at scale) by 2025, Microsoft 96,000, and Oracle has committed to deploying up to 131,000 MI355X units. Meanwhile, Intel's Gaudi 3 accelerators are rapidly gaining traction in enterprise and cloud inference markets.
Leading cloud providers, previously Nvidia's primary revenue contributors in data centers, are aggressively pursuing in-house chip development due to cost control and supply chain autonomy considerations. For these giants, custom chips with lower development costs can save billions annually while providing crucial supply chain flexibility amid billions of daily inference requests.
Currently, Google to Amazon have partnered with Broadcom to design and mass-produce inference chips. Google's TPU, after multiple iterations, has secured orders from Anthropic (deploying over 1 million units) and Meta (signing a multi-year, billion-dollar leasing agreement in February 2026). Amazon's Trainium received a 2GW capacity order from OpenAI, while Anthropic has also embraced Amazon's solutions. Meta's in-house MTIA series (including MTIA 300 and later versions) has deployed hundreds of thousands of chips to support full-platform recommendation system inference.
Simultaneously, specialized inference chip companies are accelerating their market entry. Groq, acquired and integrated by Nvidia by the end of 2025, attracted significant developer and enterprise interest in 2025 with its LPU's superior first-token latency and lower pricing compared to GPUs.
Beyond these competitors, China represents a major inference market client where domestic inference computing ecosystems are rising. The industry observes that the market has evolved from Huawei's dominance to a diverse landscape. Companies like Biren's inference-specific chips now offer significant cost advantages, while Moore Threads and other vendors are highly recommended among AI agent enterprises.
Under multi-front competition, market research firms believe the AI server market will shift from Nvidia's dominance to 'diversified competition.' XPU (specialized accelerators that are neither GPU nor CPU) growth rates will surpass GPUs. Tech analysis firm byteiota, synthesizing analyst views, even suggests Nvidia's inference market share could plummet from 80% by 2028, with ASICs capturing 70-75% of production inference workloads.
'There is no CUDA moat in inference,' The Wall Street Journal recently reported, citing Andrew Feldman, CEO of emerging chipmaker Cerebras Systems. This may represent Nvidia's greatest anxiety source at present.
02
Nvidia's Moat Defense in the Trillion-Dollar Market
Nevertheless, Nvidia is taking strategic actions to address inference era challenges. At GTC, both Huang's keynote and the array of new product announcements demonstrated Nvidia's ambition for the inference era.
During his over two-hour speech, someone counted that 'training' was mentioned only a dozen times, while 'inference' appeared nearly 40 times.
Huang also used the $1 trillion revenue forecast to assert Nvidia's continued relevance in the inference era:
'Last year at this time, I mentioned that by 2026, demand for Blackwell and Rubin could reach $500 billion. Today, I'm telling you: standing here in 2027, we see high-certainty demand of at least $1 trillion. And I believe actual demand will be even higher.'
Behind this, Huang noted that starting in 2025, Nvidia has fully committed to inference capabilities, ensuring Nvidia excels not just in training but throughout the entire AI lifecycle—including post-training and inference.
At the conference, Nvidia presented its comprehensive strategic layout for addressing inference era challenges. Huang dissected the inference process into two distinct phases—'prefill' and 'decode'—and equipped each with specially optimized hardware architectures.
Some observers commented that this redefinition of inference computing fundamentals aims to reclaim Nvidia's discourse power in the inference era.
The new flagship GPU—Vera Rubin GPU—handles the 'prefill' phase, boosting inference performance 3.3-5x over the previous generation to convert user requests into tokens.
The integration of Groq 3 LPX marks Nvidia's crucial step in addressing low-latency inference shortcomings. In December 2025, Nvidia invested $20 billion to acquire Groq's low-latency inference technology and core team through an unconventional acquisition, becoming its largest-ever deal. Groq specializes in extreme low latency and performance determinism, founded by Jonathan Ross—a key architect behind Google's TPU.
The Groq 3 LPU, the first product from this partnership and manufactured by Samsung, is expected to ship in Q3 2026. Designed specifically for the decode phase, it bypasses traditional GPU HBM memory bottlenecks, achieving first-token latency below 0.1ms with 35x inference performance improvement. Huang stated that the 'GPU for prefill, LPU for decode' division represents the optimal architecture for the inference era.
With the agent era arriving, Nvidia also designed a new CPU specifically for agent workflows—the Vera CPU. Using LPDDR5 low-power memory commonly found in mobile devices, it shifts positioning from general-purpose computing to agent task orchestrator. Rather than blindly stacking memory bandwidth, it achieves efficient, precise data scheduling at lower power consumption. Huang claimed its performance doubles that of mainstream global CPUs, representing a 'billion-dollar business': 'We never thought we'd sell CPUs alone, but now we're selling plenty.'
Thus, Nvidia has shifted from a universal GPU narrative to scenario-based division of labor. The current system forms a triangular division: GPUs handle heavy computing, CPUs manage orchestration, and LPUs deliver ultra-fast output. Paired with Nvidia's in-house Dynamo scheduling software, this flexibly meets complex requirements for cost, latency, and throughput across different AI tasks. In high-value token generation scenarios, token throughput per megawatt improves 35x over the previous Blackwell generation.
Huang further provided deployment recommendations: high-throughput workloads can use 100% Vera Rubin; for coding and high-value engineering token generation, a 25% Groq + 75% Vera Rubin combination is optimal.
Beyond hardware and software releases, Nvidia constructed a new narrative—the 'AI Factory':
'We're not just optimizing chips individually but pursuing extreme co-design: chips, systems, networks, software, algorithms, and deployment methods—full-stack synergy. In the future, all cloud providers, AI companies, and large enterprises will study their token factory efficiency like today's manufacturing production lines. Because data centers are no longer just 'places to store files' but factories producing tokens. Tokens are becoming new commodities, and AI computing is becoming a new revenue source.'
Under this narrative, competition shifts from single-dimension chip performance to a full-stack ecosystem spanning chips, liquid-cooled racks, network interconnects, and AI factory operating systems. Nvidia now occupies multiple levels from energy and chips to infrastructure and models, enabling customers to obtain optimal costs across the full training+inference lifecycle through 'one-stop' solutions. Huang also elaborated on 'Token Factory Economics,' emphasizing 'token output per watt per dollar' as the new benchmark.
Observers believe Nvidia is using this holistic delivery model to leverage systemic advantages and offset single-dimension cost advantages, thereby countering inference market competition.
At GTC 2026, Nvidia remains the AI computing market leader but has entered a defensive war's opening phase. This inference defense battle represents a struggle for survival and dominance in the new era—and everything has just begun.