Alibaba Cloud Announces: New Pooling System Cuts Nvidia AI GPU Usage by 82%

Home

Finance

ICV

Smart City

Digital Live

Cloud

Optics

Home Finance AI ICV Smart City Digital Live Cloud Optics

11/14 2025 567

Alibaba Cloud has reported that during several months of beta testing in its Model Studio marketplace, its innovative Aegaeon pooling system has significantly reduced the number of Nvidia GPUs needed to support large language models by 82%.

The findings, published in a peer-reviewed paper at the 2025 ACM Symposium on Operating Systems Principles (SOSP) in Seoul, indicate that cloud providers could potentially harness more robust inference capabilities from existing chips. This is particularly relevant in markets like China, where access to Nvidia's latest H20 chips is constrained.

In contrast to breakthroughs at training time that focus on enhancing model quality or speed, Aegaeon functions as an inference-time scheduler. It is engineered to optimize GPU utilization across multiple models that have variable or unpredictable demands.

Rather than assigning a dedicated accelerator to each model, Aegaeon virtualizes GPU access at the token level. This allows the system to schedule minute work fragments within a shared pool of resources.

As a result, a single H20 GPU can now serve multiple models simultaneously. The system's 'effective throughput'—a metric for measuring output efficiency—has improved up to ninefold compared to previous serverless systems.

The paper highlights that the system has undergone several months of testing in real-world production environments. Researchers from Peking University and Alibaba's infrastructure team, including Chief Technology Officer Zhou Jingren, managed to reduce the GPU count required to support numerous large language models (LLMs), with parameters reaching up to 72 billion, from 1,192 to just 213.

Although the report does not detail which models contributed most significantly to these savings, the South China Morning Post noted that the tests were conducted using Nvidia's H20 GPU—one of the few accelerators still legally available to Chinese buyers under current U.S. export restrictions.

Alibaba attributes these efficiencies mainly to two technological advancements:

Packing multiple models onto each GPU;
Employing token-level auto-scalers to dynamically allocate computing resources during output generation, instead of reserving resources at the request level.

Alibaba Cloud observed that in real AI applications, only a handful of models are frequently utilized. Yet, a substantial portion of GPU resources is allocated to models that are seldom called upon, resulting in low utilization rates. Data indicates that 17.7% of GPU resources handled just 1.35% of total inference requests.

Through Aegaeon, Alibaba has tackled this imbalance using pooling and intelligent scaling techniques. The system ensures continuous GPU utilization and prevents idle processing of rarely used models. This approach has enabled Alibaba to achieve higher throughput and enhance hardware efficiency for enterprise-level deployments.

In benchmark tests, Aegaeon demonstrated actual throughput improvements ranging from 1.5 to 9 times that of ServerlessLLM and MuxServe.

Chinese firms such as Huawei and Cambricon are stepping up efforts to develop domestic GPUs to lessen reliance on foreign suppliers. Nvidia's CEO has acknowledged that the company's share of China's advanced AI chip market has fallen to zero. This shift is fostering local innovation and the localization of AI hardware supply chains.

Alibaba's new approach not only strengthens its market position but also aligns with national objectives for technological self-sufficiency. By diminishing its reliance on U.S. chips, Alibaba secures a more robust foothold in China's rapidly evolving AI landscape.

Whether these efficiencies can be replicated outside of Alibaba's proprietary technology stack remains uncertain. The paper from Alibaba Cloud does not specify the exact network architectures utilized during the beta tests. However, given that Alibaba offers its proprietary eRDMA elastic RDMA networking and has a history of developing highly integrated GPU service stacks, it suggests that the results may hinge on optimized, vertically integrated environments.

Nevertheless, as the demand for inference continues to escalate, these findings are likely to pique the interest of other hyperscale companies seeking to maximize the use of limited accelerator resources.

References:

https://www.tomshardware.com/tech-industry/semiconductors/alibaba-says-new-pooling-system-cut-nvidia-gpu-use-by-82-percent

https://coincentral.com/alibaba-group-holding-limited-baba-stock-soars-as-new-ai-pooling-tech-slashes-nvidia-gpu-use-by-82/

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.

Newest

Links