Computing Power Showdown: Cloud Titans Challenge NVIDIA's Dominance with Cost-Effective NPUs

04/24 2026 418

This is not merely a substitution but a paradigm shift. NPUs are propelling AI computing into a "LEGO Era" of modular innovation.

Source | Silicon Quadrant

For the past decade, NVIDIA has single-handedly shaped the narrative of AI computing power.

From the A100 to the H100, and now the H200, NVIDIA's GPUs have acted as an ever-expanding production line of computing power, driving deep learning from experimental labs into the era of large-scale AI models.

However, a critical but long-ignored fact remains: GPUs were never designed for AI—they were originally built for graphics rendering.

This means GPUs are inherently general-purpose parallel computing architectures rather than architectures purpose-built for AI.

As a result, a more fundamental shift is underway:

As AI computing demands grow exponentially, the marginal efficiency of GPUs is starting to decline.

The industry is now pivoting toward a new direction—redesigning computing paradigms. This has given rise to a new class of computing chips based on dedicated computing logic (ASIC): the NPU.

On April 22 (U.S. time), at the Google Cloud Next event, Google unveiled two eighth-generation NPU chips: the TPU8t for AI training and the TPU8i for AI inference. The TPU8t delivers a 124% improvement in performance per watt over its predecessor, while the TPU8i boosts performance per dollar by 80% and overall performance by 117%. Industry analysts suggest that "if sold externally, it could disrupt NVIDIA's market dominance."

Google is not alone in this chip development race among cloud providers.

Amazon, the global cloud services leader, launched its first inference-focused NPU, Inferentia1, in 2018, followed by the second-generation Inferentia2 in 2023, and later the training-oriented NPU Trainium3 at the end of last year. Microsoft Azure, ranked second globally, introduced its first cloud-based NPU (Maia 100) in 2023 and followed up with the Maia 200 earlier this year.

A similar trend is emerging in China. Alibaba unveiled its first NPU (Hanguang 800) in 2019, focusing on cloud-based inference and visual computing. Baidu began releasing its self-developed ASIC-based AI chips, starting with Kunlun 1 in 2018, which has since evolved to the third-generation Kunlun chip.

In 2026, ByteDance, a major consumer of computing power chips, is poised to enter the NPU market. Foreign media reports indicate that ByteDance has initiated discussions with Samsung to develop its own NPU chip, codenamed SeedChip, specifically designed for AI inference tasks, with the first prototypes expected by the end of March 2026.

The trend for 2026 is clear: chips will no longer be monolithic. Google, ByteDance, Alibaba, and others aim to integrate their own specialized modules into NVIDIA's ecosystem.

Cloud providers developing their own NPUs could redefine AI cost structures, energy consumption profiles, and even business models.

01 What Is an NPU?

Google's TPU and Alibaba's Hanguang 800 are both examples of NPUs.

An NPU (Neural Processing Unit) is, as the name suggests, a chip designed specifically for neural network processing.

To understand the difference between NPUs and GPUs, we must examine their underlying architectures. NPUs fall under the category of Application-Specific Integrated Circuits (ASICs), while GPUs are general-purpose processing chips.

Chips can be broadly categorized into three types based on design philosophy: general-purpose computing chips, FPGAs (reconfigurable hardware), and ASICs (dedicated processing chips).

First, general-purpose computing chips, such as CPUs and GPUs, use a single instruction to drive hundreds or thousands of threads in parallel, making them highly effective for large-scale parallel computing. They typically do not modify hardware but instead optimize "task scheduling" through software (e.g., CUDA). This is why NVIDIA is often described as a software company—the core strength of GPUs lies in their high programmability, adaptability to diverse computing tasks, and complex architecture (requiring substantial caching). However, this versatility comes at the cost of lower efficiency.

Second, ASICs (Application-Specific Integrated Circuits) are custom-designed for specific tasks (e.g., image recognition or speech processing). With fixed data flow and extremely high energy efficiency, ASICs represent a design approach that "hardwires algorithms into silicon." The downside is that once the circuitry is etched onto the silicon, its function cannot be altered, making it less flexible. ASICs essentially transform AI computing from a "software problem" into a "hardware problem," but this also means they offer the least flexibility and have long update cycles.

Third, FPGAs (Field-Programmable Gate Arrays) can dynamically reconfigure their hardware layout through "rewiring" and modify software code to alter chip functionality, akin to a set of "LEGO bricks." FPGAs occupy a middle ground between general-purpose and dedicated chips and are often used in prototype development or edge computing where algorithms iterate rapidly.

GPUs are powerful and can handle many tasks in parallel. When fully utilized, they are incredibly capable, but they come with higher costs and greater energy consumption. In contrast, NPUs are designed for a single task or a narrow set of tasks, offering limited functionality but at a lower cost and with greater energy efficiency.

02 Not Selling Chips, But Offering More Cost-Effective Cloud Services

Cloud service providers are not in the business of selling chips; their goal is to provide more cost-effective computing power.

In 2015, Google began researching NPUs after identifying a critical issue: the demand for neural network inference in its data centers was surging, but GPU efficiency was insufficient.

This led to the launch of Google's internal TPU project. The first-generation TPU, designed solely for data inference, debuted in 2015. In 2018, Google Cloud TPU was made available to the public, and from 2020 to 2024, the focus shifted toward "integrated training and inference."

In 2026, with the release of the TPU 8, Google for the first time clearly divided its approach into two paths: the TPU 8t for training and the TPU 8i for inference. This reflects a broader industry trend: the center of gravity in AI computing power is shifting from training to inference.

Industry organizations predict that by 2030, 75%-80% of AI computing power will be dedicated to inference. This means that while a GPT model may be trained once, it will run billions of inference operations. Therefore, whoever can reduce the cost of inference from one cent to 0.1 cents will dominate the future of computing power.

Google's two new chips were designed by separate partners: the TPU8t by Broadcom and the TPU8i by MediaTek. Both are expected to be based on TSMC's 2nm process, with mass production slated for the end of 2027. The most significant change in the TPU 8 is its attempt to address the "memory wall" issue through higher-bandwidth HBM and denser inter-chip interconnectivity.

It is reported that compared to the previous-generation Ironwood product, the TPU 8i inference chip increases HBM capacity from 216GB to 288GB, boosts bandwidth from 6,528 to 8,601 GB/s, and triples on-chip SRAM to 384M. The cluster scale expands from tens of thousands to 134,000 chips, with a maximum connectivity of 1 million chips.

NPU development is not limited to Google; Amazon, Microsoft, and others also have NPU products, while domestic companies like Huawei's Ascend, Cambrian, and Horizon Robotics have released similar offerings.

Previously, cloud providers relied on NVIDIA's "full-stack solutions." Now, they want to purchase NVIDIA's "foundation" and build their own "houses."

03 Seizing the Initiative in the Computing Power Era

The release of the TPU 8 series reflects a clear strategy: reducing reliance on NVIDIA.

If successful, the TPU could transform AI computing power from a "GPU monopoly" into a "multi-architecture competition."

However, replacing NVIDIA will not be easy.

The most critical challenge is the ecosystem. NVIDIA's CUDA remains the industry standard, with 4 million developers. On the other hand, TPUs are highly specialized—GPUs can be used for AI training and inference, graphics processing, and rendering, whereas TPUs have a narrower focus.

Industry insiders generally believe that the significance of NPUs lies not in "replacing GPUs" but in redefining the AI computing power structure. In the future, GPUs may serve as the general-purpose computing foundation, while TPUs/NPUs act as dedicated AI acceleration layers.

NVIDIA has also recognized this trend. By the end of 2025, NVIDIA invested $20 billion to acquire Groq, whose LPU (Language Processing Unit) achieves speeds more than 10 times faster than traditional GPUs when running large language models (LLMs).

This resembles the competition among smartphone manufacturers over the past decade. When the most critical SoC (System on a Chip) chips were already monopolized by Qualcomm and MediaTek, with strong competitive moats, developing proprietary SoCs required significant investment and carried high risks.

As a result, most smartphone companies chose not to develop their own SoCs but instead optimized specific functions of existing SoCs to gain a competitive edge.

Previously, smartphone manufacturers like Samsung, vivo, and OPPO developed proprietary NPUs to enhance photography capabilities, achieving differentiation—such as vivo's V1 imaging chip and OPPO's MariSilicon chip.

The computing power competition among cloud providers is also heating up, with more NPUs for training and inference emerging and continuously improving in capability.

The true dividing line in the future computing power industry will be: whoever can minimize AI inference costs will seize the initiative in the next era of computing power.

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.