01/10 2025 443
With the backing of Volcano Engine's AI Cloud Native, enterprises gain the most suitable IT environment for AI inference deployment, transcending upper-level application products.
This environment can be regarded as the newest and most fitting IT architecture system for AI inference growth in China. It encompasses elastic scheduling management of large-scale GPU clusters, storage and computing products tailored for AI inference scenarios in China, network enhancements based on AI training and inference needs, and environments that cater to data demands in specialized inference scenarios, thereby assisting enterprises in achieving AI deployment in a swifter, more stable, and more cost-effective manner.
Author | Mr. Pi
Produced by | Industry Home
Where is large model deployment headed in 2025?
"In the process of deployment in specific scenarios, our inference demand is nearly 5-10 times that of training demand, and with the deepening of AI usage, it may even exceed that," a relevant executive from an AI enterprise told Industry Home.
A strikingly realistic scenario is that in the nearly two years since "AI large models" became a buzzword, industrial deployment of large models has demonstrated an increasingly upward trend, with inference demand being the most prominent manifestation.
According to IDC reports, the demand for training and inference computing power in the Chinese market is projected to grow at compound annual growth rates of over 50% and 190%, respectively, in the next five years, with inference computing power comprehensively surpassing training computing power by 2028.
This is also the focal point of current market discussions. That is, with the emergence of more and more AI deployment forms such as intelligent agents, enterprises' demands for AI technology deployment, or inference demand, are significantly increasing.
However, behind this robust demand, another issue is rapidly coming to the forefront: what is the state of the AI deployment infrastructure in China's AI industry ecosystem? From a granular perspective, this issue is not merely about the evaluation of data systems and model development; more external focus is on AI underlying infrastructure, or the construction of the AI Infra layer.
In a more foundational understanding, behind the myriad of large-scale traffic activities in the past, what provided support was a series of massive CPU clusters, with various adapted and verified databases, storage, computing middleware, and different PaaS layer products built on top, collectively ensuring the development of upper-level application activities.
But now, in the AI era, whether it's a series of compositions at the PaaS layer, elastic processing at the underlying IaaS side, or the network, in new environments with more complex data types and larger data volumes, they all need to be re-addressed.
These new solutions are aligned with the business front end, which is directly related to the enterprise's AI application deployment, such as how to efficiently release and manage applications, how to handle large-scale online inference traffic, and so forth.
It can be said that beyond the overt data and model challenges, if we aim to achieve large-scale industrial deployment of AI large models, AI infra is a hurdle that must be overcome.
So, where do we stand now? In 2025, when AI inference demand is on the verge of exploding, or can even be said to have already exploded, what should the underlying AI infra that truly adapts to the large-scale deployment of AI large models look like?
I. On the Eve of the Explosion of Large-Scale Inference Scenarios:
AI Infra Stepping onto the Stage
"We have ample data, and we utilize the industry's top-ranking models as our underlying models, but the AI applications we build simply don't work," a retail enterprise executive said at an industry event.
The more specific details are that this enterprise has a robust IT foundation. Over the years, it has long established a full-chain digital architecture from ERP to CRM and databases, and the enterprise has strong data accumulation. This also made the enterprise executive enthusiastic before the advent of the AI wave. In their view, "with the advantage of data, this is an opportunity to overtake on the curve."
But the results were unsatisfactory. In terms of performance, issues such as high costs for inference training and slow response speeds for AI applications became increasingly apparent, ultimately leading to the temporary suspension of this project at the end of 2023.
In fact, this was one of the earliest batches of enterprises, and even now, many enterprises are facing this situation when attempting AI deployment. That is, beyond data and models, more and more AI infra issues are emerging, such as insufficient GPU card resources, the inability to connect inherent computing storage products with front-end models (e.g., database incompatibility), and network jitter leading to low training efficiency, among other factors, all of which are becoming obstacles for enterprises in deploying large models.
A common definition of AI infra is that it often refers to a complete system that supports AI training, inference, and other operations, including hardware (such as GPU servers, storage devices), software (such as operating systems, development frameworks), networks (such as high-speed network connections and security protection), and data systems.
If translated to the inherent cloud computing architecture, it corresponds to the entire IT architecture encompassing infrastructure, platforms, software, data, and models behind a series of app applications, fulfilling the entire chain from customer demand to specific application product expression through the operation of the entire architecture.
But this is no easy feat.
"Many things are different from the previous CPU model, especially in the inference stage," Luo Hao, head of cloud infrastructure products at Volcano Engine, told Industry Home. "For example, the types of data to be stored are more abundant, shifting from primarily text and image small files in the past to now requiring storage for large-scale videos and large files, with the number of storage items also growing exponentially. In the past, the objects arranged by the CPU architecture were functions, but now in the GPU architecture, what is arranged are large models, requiring re-optimization of the computing, storage, and network architecture to improve throughput and reduce IO latency."
A more precise statement is that in the era of AI large models, as the business architecture shifts from CPU-centric to GPU-centric, the entire system needs to be upgraded, which includes not only a more intricate scheduling model at the resource layer but also the processing methods for the new resource layer and new data models. Corresponding to the product side are a series of computing, storage database products, new orchestration, and other middleware adapted to AI models, as well as new network stability methods, and so forth.
The priority of these underlying IT environments is even higher than that of data and large model capabilities. "Strictly speaking, the data for some scenarios is sufficient now, and new AI infra needs to complete AI engineering deployment first, so as to ensure the satisfaction of enterprises' large-scale inference deployment," an investor told us.
In fact, in the past two years, the rigid demand for AI infra has also been continuously confirmed. According to incomplete statistics, during the period from January 1 to July 31, 2024, enterprises in directions such as smart computing centers, vector databases, and orchestration capabilities for large models have been highly favored by capital, with financing accounting for over 15% of the entire large model field.
But beyond heat and rigid demand, there are still issues. For example, in today's explosion of demand for large model inference, different scenarios and fields still have varied needs for AI Infra. At present, most AI Infra service providers on the domestic market only provide a single-point boost, making it challenging for enterprises to obtain the entire chain of services from elastic computing power services to data storage and computing, and then to model inference and application expression.
Where is the answer? Or put another way, has such a full-chain AI Infra service model emerged in China's current wave of AI deployment?
II. Volcano Engine AI Cloud Native
'Taking a Quick Step Forward'
Meitu can almost be considered a bona fide player in China's application market. Over the years, it has led the beauty camera segment in the market with its Meituiu Xi productux. With the advent of the AI era, transformation has become a necessary path.
But just as mentioned above, this is not an easy proposition. Specifically for the application of Meitu Xiuxiu, it not only needs to ensure the user experience of AI functions in front-end products but also maintain controllable costs and investments.
This demand for inference deployment corresponds to elastic scheduling of GPU resources, storage product performance, network stability requirements in different regions, and the most critical heterogeneous GPU card scheduling training efficiency, among other things, on the AI infra side.
Volcano Engine became the underlying booster chosen by Meitu. With a series of boosts from computing power resources to storage networks, Meitu built an elastic, cost-controllable, and healthy AI infra architecture that can not only achieve the scheduling of different GPU card resources in different scenarios but also rapidly expand resources during traffic peaks to satisfy the deployment of its large-scale inference scenarios.
Companies with similar experiences include Moonton and DeepPotential. The former is one of the earliest domestic game companies to go overseas, with its 'Mobile Legends: Bang Bang' launched in 2016 now having over 110 million monthly active users and over 1 billion cumulative downloads globally. Within the game, there are certain aggressive languages and behaviors such as abuse and religious discrimination between opposing players, which need to be identified and isolated based on AI.
The solution they adopted was to access and invoke the Doubao large model API through Private Link on Volcano Engine, while also optimizing and customizing based on the Doubao large model with the support of Volcano Engine's machine learning platform, cloud search, vector database, and other products, ultimately achieving ultra-low latency and low-cost deployment on the inference side and completing the deployment of related AI products.
This is also true for DeepPotential. As a leading enterprise in AI for science in China today, it often encounters a large number of data processing issues with different formats during business development, requiring high-speed reading of unstructured data. With the support of Volcano Engine, it not only achieves resource matching for different training and inference scenarios but also ensures high utilization of underlying resources based on the platform's unified scheduling capabilities, ensuring efficient business advancement.
At the recently held Volcano Engine FORCE conference, this assistance for enterprises in the large-scale inference deployment side can also be understood as advanced underlying AI practices, which were formally presented by Volcano Engine in the form of a solution, namely the 'AI Cloud Native' solution.
Among them, several highlights are particularly noteworthy.
For instance, at the computing power level, Volcano Engine has launched Elastic Scheduled Instances (ESI) and Spot Instances, fully supporting CPUs and GPUs. Relying on ByteDance's massive resource pooling technology, it can provide millions of CPU cores of elastic computing power and tens of thousands of GPU cards of elastic capabilities online, meeting the elastic computing power needs of customers in different scenarios cost-effectively.
Moreover, at the storage level, to address the problem of core data flow bypassing the CPU in traditional AI architectures, Volcano Engine officially launched Elastic Instant Cache (EIC).
As another self-developed product of Volcano Engine, it completely rewrites KV through GPU Direct and RDMA technologies, allowing the KV Cache in video memory to be cached in the memory of remote or local hosts. Compared to traditional caching technologies, latency is reduced to 1/50. In scenarios such as Prefix Cache, P/D separation, multi-round dialogue, and long text processing, the core indicators TTFT and TPOT can be improved by several times, while also reducing overall GPU consumption.
The highlights are also more pronounced at the network level. Volcano Engine's third-generation heterogeneous GPU and NPU instances, as well as fourth-generation CPU instances, fully support affordable vRDMA interconnection capabilities, providing up to 320G of bandwidth in VPC networks with an average 80% reduction in latency compared to traditional VPC networks, significantly improving training and inference efficiency; at the same time, the intelligent routing solution of the AI Gateway supports load balancing based on GPU utilization indicators, helping users handle large-scale inference traffic through intelligent scheduling, with network costs optimized by up to 70%.
What's more worth mentioning is that at this conference, Volcano Engine's veStack Smart Computing Edition was also upgraded to phase 2.0. "The new-generation Smart Computing Edition not only offers richer support for smart computing infrastructure but also further improvements in stability construction, operation and maintenance capabilities, training frameworks, and model development capabilities. At the same time, it provides standardized APIs in terms of the ecosystem, offering industry deployment capabilities for different scenarios in various industries, enabling customers to better address various challenges in the era of smart computing," Luo Hao told us.
In fact, all these products have already been fully deployed within ByteDance's internal AI system. According to Tan Dai, president of Volcano Engine, since ByteDance released the Doubao large model in May of this year, its invocation volume has increased by over 33 times within seven months, with the average daily token usage exceeding 4 trillion as of December.
A significant portion of these demands stem from the inference side. These demands are met in a more efficient, cost-effective, practical, and secure manner with the support of Volcano Engine AI Cloud Native.
Luo Hao informed us that Volcano Engine's underlying AI Cloud Native solution is now propelling the diverse needs of various enterprises. 'One category comprises enterprises eager to experiment independently, perhaps by developing an application to test the waters. Another seeks the 'low-hanging fruit,' meaning those who have decided to adopt AI and have identified a specific direction. Lastly, there are enterprises with robust AI strategic requirements, such as building substantial models or possessing their own card resources.'
A more nuanced understanding reveals that, whether it's the initial exploration of AI applications, the AI-based evolution of specific links, or AI upgrades and iterations at the enterprise level, Volcano Engine AI Cloud Native enables enterprises to access the most suitable IT environment for AI inference deployment, transcending upper-level application products.
This environment can be deemed the most recent and fitting IT architecture system for AI inference growth in China. It encompasses elastic scheduling management of large-scale GPU clusters, storage and computing products tailored to AI inference scenarios in China, network enhancements based on AI training and inference needs, and environments catering to data requirements in special inference scenarios, thereby aiding enterprises in achieving AI deployment more swiftly, stably, and cost-effectively.
III. From Within to Beyond:
A Novel Paradigm for AI Inference Deployment Soil
Cultivating such soil that is optimally suited for AI inference expression is indeed a formidable task. For Luo Hao and the Volcano Engine team, this has been a lengthy journey of 'seeking truth.'
Reflecting on the cloud product conference in December 2021, Volcano Engine unveiled a series of AI products, spanning applications from the upper layer to AI development platforms and AI deployment solutions for diverse scenarios. Even today, this remains an advanced AI practice model for numerous industrial scenarios, considering its rich scenarios and AI development efficiency.
This technological initiative underscores Volcano Engine's and even ByteDance's long-standing technological and industrial depth in AI. With the emergence of OpenAI, ByteDance's internal AI foundations ignited the domestic large model market.
Consequently, in 2023, the market was abuzz with the slogan, '70% of domestic large models run on Volcano Engine.' However, Luo Hao and the Volcano Engine team observed a more pronounced trend emerging shortly after: the robust inference-side demand mentioned earlier.
Simultaneously, more pronounced trends and challenges regarding inference scenarios originate internally. As ByteDance advances its AI initiatives, whether it's the Doubao large model, a series of upper-level AI applications like Doubao Assistant, Jianying, and the Kouzi development platform, or numerous products deployed in diverse scenarios domestically and internationally, they all exhibit strong demands for AI infrastructure.
Globally, this represents one of the largest-scale AI inference deployment demands.
For Luo Hao and Volcano Engine, their primary objective is to serve these native AI applications within ByteDance. This involves addressing issues such as elastic scheduling of GPU resources, more efficient and low-latency computing and storage products, and the optimization of various network environments.
This practical experience and "trial and error" in serving large-scale AI inference scenario deployments, rare even globally, constitutes Volcano Engine's unique advantage in domestic AI infra services. It offers a stable IT architecture highly compatible with large-scale inference demands, cutting-edge GPU-centric resource scheduling and data processing capabilities, and service guarantees for core elements like network environments.
"Overall, we can not only enhance single card utilization for enterprises but also assist them in optimizing specific scenarios and deployment details," said Luo Hao.
Comprehensive data indicates that, with Volcano Engine's AI cloud-native solution, enterprises can achieve over 99% effective training duration in training scenarios and save 20% of GPU resources while improving performance by 100% in inference scenarios.
"In fact, compared to IDC's prediction that 'inference calculations will exceed training calculations by 2028,' this may occur two years earlier on our Volcano Engine," Luo Hao revealed.
It's evident that, with the surge in inference demands, Volcano Engine's strong capability to meet enterprises' inference scenario needs will translate into a preferred choice for many enterprises.
The benign transmission line is straightforward: superior underlying support for AI infrastructure aids enterprises in completing large-scale inference deployments faster, fostering better, more usable, and cost-effective industrial AI applications. This enables enterprises to forge ahead in building their new competitive strengths.
Broadening the perspective, the AI cloud-native solution aligns with the latest underlying new infrastructure for the evolution of numerous industries and even China in the AI era. Only by establishing a robust underlying IT environment can we ensure AI technology's genuine implementation, fulfill numerous industrial scenario inference demands, and drive industries' progression from digitization to digital intelligence.
On this novel AI soil for enterprises and industries, Volcano Engine has taken the inaugural step.