Depth | From GPU to Full-Stack Systems, the Focus of AI Computing Value is Shifting to CPU

03/30 2026 457

Preface:

Over the past three years, the narrative around AI computing power has been dominated by a single logic: GPU equals computing power, and computing power equals GPU.

However, as AI transitions from model competition to system competition, a deeper structural shift in the value of computing power is underway.

The focus of computing value is shifting from the GPU chip itself to the CPU + system layer.

The CPU Returns to Center Stage, Evolving from Supporting Role to Scheduling Hub

The emergence of Agentic AI has completely revolutionized AI's working model.

A typical Agent task often involves dozens of network searches, API calls, code executions, document parsing, and result orchestration—workloads that far exceed the parallel advantages of GPUs.

In an Agent's workflow, GPUs still handle core token generation, while CPUs take on the critical role of [making tokens truly effective].

This means that users' perceived AI response speed and experience are no longer determined by the GPU's computing upper limit (translated as "upper limit" in context, but kept as "computing power ceiling" for clarity), but are instead constrained by the CPU's processing efficiency.

Even if a GPU can generate tokens in milliseconds, any delay in the CPU's task orchestration or tool execution will drastically prolong the system's end-to-end experience.

The industry has finally realized that in the Agentic AI era, simply stacking GPUs no longer solves the fundamental problem.

Research from Cornell University shows that across five representative Agent workloads, CPU-side tool processing, logical scheduling, and data preprocessing account for 43.8%–90.6% of total end-to-end latency, far exceeding the GPU's share in model inference.

In the most common Haystack RAG scenario, CPU processing accounts for over 90% of total latency, while GPU inference computing contributes less than 10%.

When millions of Agents run concurrently, the demand for CPU cores grows exponentially.

Cloud providers' real-world testing shows that to fully utilize a cluster of 10,000 A100 GPUs, the number of accompanying CPU cores must increase from the traditional 500,000 to 1.2 million.

As computing scale expands, the challenge becomes how to schedule, allocate, and improve utilization—exactly where CPUs and the system layer excel.

The Primary Consumer of Computing Power Has Changed, Altering Value Standards

When computing resources were extremely scarce, the key question was [who has GPUs].

IDC's survey data also reveals that even among leading internet companies' AI inference clusters, GPU average utilization remains below 40% long-term, with many SMEs' GPU clusters operating at less than 15% utilization.

The root cause of this massive waste is that the system's data flow, task scheduling, and memory management capabilities cannot keep pace with GPU computing speeds.

It's like a top-tier supercar stuck in congested city traffic, unable to reach its maximum speed—with the CPU acting as both road designer and traffic controller.

MLPerf industry benchmark tests show that in large model training scenarios, delays in data loading, preprocessing, and parameter synchronization can consume 35%–60% of total training time, directly leading to GPU utilization below 40%.

The GPU's computing ceiling is often determined by CPU performance, with this scheduling and management value becoming even more pronounced in distributed AI clusters.

The Rise of CXL (Compute Express Link) Technology Further Solidifies CPU's Central Role

As a next-generation high-speed interconnect protocol, CXL uses memory pooling technology to integrate scattered memory resources across servers and acceleration cards into a unified shared memory pool, breaking through traditional architectural memory wall bottlenecks.

The CPU serves as the sole master control unit for the entire CXL memory pool, responsible for unified memory address mapping, cache coherence maintenance, and dynamic resource allocation.

Real-world testing shows that CXL 3.0-based memory pooling architectures reduce cross-node memory access latency from 220ns (traditional NUMA) to 90ns, improve memory bandwidth utilization from 65% to 92%, and lower cache miss rates from 18% to 6%.

General-Purpose Computing Foundation: The Key to AI Generalization

AI applications in entity (translated as "physical" or "real-world" industries) industries almost always involve [hybrid workload] scenarios. Financial institutions' servers must run core trading systems, databases, and risk control middleware alongside user profiling vector searches.

Manufacturing enterprises' production line servers must operate industrial control software and equipment management systems alongside computer vision models for product quality inspection.

Government system servers must support e-government platforms and data sharing systems alongside large model applications for intelligent Q&A and document review.

In these scenarios, users' core demand (translated as "needs") is not to deploy a separate AI computing cluster but to seamlessly integrate AI capabilities into existing business systems—exactly where CPUs excel.

GPU architectures are inherently designed for parallel computing and struggle to efficiently handle serial general-purpose business workloads like databases and middleware simultaneously. Forced hybrid deployment only degrades performance for both.

In contrast, CPUs' general-purpose architecture naturally accommodates mixed operation of various business and AI workloads, enabling unified hardware, operations, and scheduling while significantly reducing deployment costs and operational complexity for enterprises.

Cloud providers' self-developed CPUs have already demonstrated immense value in such scenarios.

AWS's Graviton4 processor delivers 35%–50% performance improvement over its predecessor in mainstream online AI applications like search, advertising, and recommendations, with 30%–50% better cost-effectiveness compared to x86 instances of similar specifications.

Currently, over 100,000 enterprises worldwide have migrated their core online AI inference businesses to Graviton instances, including internet companies like Epic Games and enterprise service providers like SAP and IBM.

Alibaba Cloud's Yitian 710 processor, based on ARM v9 architecture and SVE2 instruction set, delivers up to 2x inference performance improvement after optimization for AI inference scenarios.

Domestic independently controllable CPUs have leveraged this trend to achieve rapid breakthroughs in AI scenarios.

Meanwhile, CPUs' native AI capabilities have undergone qualitative leaps, completely shattering the conventional wisdom that [CPUs are unsuitable for AI].

Traditionally, CPU-based AI computing relied on general-purpose core vector operations, with performance lagging far behind GPUs.

However, mainstream server CPUs now integrate dedicated AI acceleration units, achieving exponential AI performance improvements through specialized instruction sets and hardware acceleration engines.

CPU as Core, GPU as Wing: The Industrial Transformation Ahead

① AI-native CPU architectures will become the core competitive point for next-generation server chips.

Past CPU designs focused primarily on improving general-purpose computing performance, with AI acceleration as an add-on feature.

Future CPU designs will incorporate native AI workload optimizations at the architectural level.

CPU competition will evolve from simple core count and clock speed comparisons to full AI scenario capability contests.

② Unified end-cloud collaborative computing architectures will make CPU the core foundation for AI generalization.

Current AI computing often uses different architectures for end-side, edge-side, and cloud-side deployments, resulting in extremely high model development, adaptation, and deployment costs.

This explains why cloud providers like AWS, Alibaba Cloud, and Huawei are heavily investing in self-developed Arm architecture CPUs.

③ In the AI-native CPU race, global vendors start from the same baseline, with domestic vendors holding natural advantages in scenario understanding, customer demand adaptation, and localized ecosystems.

Vendors like Kunpeng and Hygon have already achieved technical breakthroughs in AI scenarios. As AI penetrates thousands of industries, domestic CPUs are poised to evolve from [alternatives] to core players in the AI computing market, building independent and controllable full-stack AI computing systems.

④ Competition in full-stack software ecosystems will become CPU vendors' core moat.

The full realization of CPU AI performance depends heavily on software ecosystem maturity, including deep adaptation to mainstream AI frameworks like TensorFlow, PyTorch, and PaddlePaddle.

This encompasses quantization and compression optimization for mainstream large models, operator customization for industry scenarios, and development toolchain refinement.

Future CPU vendors will intensify investment in software ecosystems, building full-stack AI software systems spanning hardware to frameworks and models to scenarios—a key determinant of market positioning.

⑤ A new axis of competition has emerged in CPU industry instruction sets.

The x86 and Arm camps will engage in a new round of competition around Agentic workloads.

Most tools invoked by Agents have undergone decades of optimization on x86 architectures, creating significant ecological inertia as x86's strongest moat.

The Arm camp's core competitiveness (translated as "competitive edge") is unparalleled power efficiency. Arm architecture CPUs like NVIDIA Vera/Grace, AWS Graviton, and Ampere deliver higher concurrent processing capabilities under equal power consumption, aligning perfectly with Agentic workloads' light-thread characteristics.

Epilogue:

Market responses always provide the most authentic commentary on industrial transformation.

Today, CPUs have evolved from standardized commodity components into differentiated products that significantly impact AI system performance.

The core challenge in AI computing has shifted from raw performance to efficiency. While GPUs remain AI's engine, CPUs and system layers are becoming the steering wheel and transmission.

Partial references: Yin Technology: "The New Bottleneck Replacing HBM!", Semiconductor Industry Observer: "The New Causality of Computing Power: Revaluing CPU Value in the AI Agent Era", Semiconductor Frontline: "GPU Hegemony Loosens! Agentic AI Gains Momentum, CPUs Rise", Financial Associated Press: "Why Has CPU Taken the [Computing Power Center Stage]?"

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.