04/15 2024 492
The explosive growth of AI large models is forcing the digital infrastructure industry to accelerate its upgrade.
Over the past year and a half, landmark applications of AI large models have emerged one after another, from ChatGPT to Sora, constantly刷新people's perceptions. Behind this shock is the exponential growth of large model parameters.
The pressure of this data explosion is quickly transmitted to the underlying infrastructure of large models. As the "three pillars" supporting large models - computing power, network, and storage - are all iterating rapidly.
In terms of computing power, NVIDIA has upgraded its GPUs from H100 to H200 in just two years, improving the training performance of models by 5 times.
In terms of network, the bandwidth has been upgraded from 25G to 200G, with a 6-fold increase. With the large-scale application of RDMA, network latency has also been reduced by 60%.
In terms of storage, major players like Huawei, Alibaba Cloud, Baidu Smart Cloud, and Tencent Cloud have successively launched storage solutions tailored for AI large models.
So, what changes have occurred in storage, one of the three pillars of infrastructure, in the context of AI large models? And what new technical challenges do they pose?
Storage Challenges Brought by AI Large Models
The importance of computing power, algorithms, and data in the development of AI is well known, but storage, as the carrier of data, is often overlooked.
In the process of training AI large models, a large amount of data exchange is required. As the basic hardware for data, storage is not just about simply recording data, but deeply involved in the entire process of data collection, circulation, utilization, and more.
If the storage performance is not strong, it may take a lot of time to complete a training session, which will severely restrict the development and iteration of large models.
In fact, many enterprises have begun to realize the enormous challenges faced by storage systems in the process of developing and implementing large model applications.
From the perspective of the research, development, and production process of AI large models, it can be divided into four stages: data collection, cleaning, training, and application. Each stage poses new requirements for storage, such as:
In the data collection stage, due to the massive scale and diverse sources of raw training data, enterprises hope to have a large-capacity, low-cost, and highly reliable data storage base.
In the data cleaning stage, raw data collected from the internet cannot be directly used for AI model training. It requires cleaning, deduplication, filtering, and processing of multi-format and multi-protocol data, which is known in the industry as "data preprocessing".
Compared to traditional single-modal small model training, the amount of training data required for multi-modal large models is over 1000 times greater. For a typical 100TB-level large model dataset, the preprocessing time exceeds 10 days, accounting for 30% of the entire AI data mining process.
At the same time, data preprocessing involves high concurrency processing, which consumes a lot of computing power. This requires storage to provide multi-protocol, high-performance support to complete the cleaning and conversion of massive data in a standard file format, thereby shortening the data preprocessing time.
In the model training stage, issues such as slow loading of training sets, easy interruption, and long data recovery times often arise.
Compared to traditional learning models, the parameters and training datasets of large models have increased exponentially. How to achieve rapid loading of massive small file datasets and reduce GPU wait times is crucial.
Currently, mainstream pre-trained models already have hundreds of billions of parameters. However, factors such as frequent parameter tuning, network instability, and server failures often lead to unstable training processes and easy interruptions, requiring a Checkpoints mechanism to ensure that training can be rolled back to a restore point rather than the initial point.
Currently, due to the daily-level recovery time required for Checkpoints, the overall training cycle of large models has increased significantly. Facing the massive data volume and future hourly frequency requirements, it is necessary to seriously consider how to reduce the Checkpoints recovery time.
Therefore, whether storage can quickly read and write checkpoint files has become the key to efficiently utilizing computing resources and improving training efficiency.
In the application stage, storage needs to provide relatively rich data audit capabilities to meet the requirements of identifying and screening inappropriate content, ensuring that the content generated by large models is used in a legal and compliant manner.
Overall, to maximize the efficiency of AI large model training and reduce unnecessary waste, efforts must be made in data. Specifically, innovations must be made in data storage technology.
AI Forcing Storage Technology Innovation
According to investment firm ARK Invest's budget, by 2030, the industry is expected to train AI models with 57 times more parameters and 720 times more tokens than GPT-3, with costs dropping from today's $17 billion to $600,000. As computing prices decrease, data will become the primary limiting factor in the production of large models.
Faced with data constraints, many enterprises have begun to make forward-looking layouts.
For example, large model enterprises such as Baichuan Intelligence, Zhipu, and Yuanxiang have adopted Tencent Cloud's AIGC cloud storage solution to improve efficiency.
Data shows that Tencent Cloud's AIGC cloud storage solution can double the efficiency of data cleaning and training for large models, reducing the required time by half.
Large model enterprises and institutions such as iFLYTEK and the Chinese Academy of Sciences have adopted Huawei's AI storage-related products.
Data shows that Huawei's OceanStor A310 can achieve AI full-process massive data management from data collection, preprocessing to model training and inference applications, simplifying data collection processes, reducing data migration, and improving preprocessing efficiency by 30%.
Currently, major domestic vendors have also successively released storage solutions tailored for AI large model scenarios.
In July 2023, Huawei released two storage products tailored for AI large models - OceanStor A310 Deep Learning Data Lake Storage and FusionCube A3000 Training/Inference Hyperconverged Integrated System.
At the 2023 Yunqi Conference in November, Alibaba Cloud launched a series of storage product innovations tailored for large model scenarios, leveraging AI technology to empower AI businesses and help users more easily manage large-scale multimodal datasets, improving the efficiency and accuracy of model training and inference.
In December 2023, Baidu Smart Cloud released the "Baidu Canghai Storage" unified technical foundation, comprehensively enhancing both data lake storage and AI storage capabilities.
In April 2024, Tencent Cloud announced a comprehensive upgrade of its cloud storage solution for AIGC scenarios, providing comprehensive and efficient cloud storage support for the entire process of AI large model data collection, cleaning, training, inference, and data governance.
Synthesizing the storage technology innovations of major vendors, it can be found that the technical directions are relatively unified, focusing on targeted performance optimization of storage products based on the entire production and research process of AI large models.
Taking Tencent Cloud as an example, in the data collection and cleaning stage, storage first needs to support multi-protocol, high-performance, and large bandwidth.
Therefore, Tencent Cloud Object Storage (COS) can support single-cluster management of up to hundreds of EB-level storage scales, providing convenient and efficient public network access capabilities for data, and supporting multiple protocols, fully supporting PB-level massive data collection for large models.
At the same time, during data cleaning, big data engines need to quickly read and filter out valid data. Tencent Cloud Object Storage (COS) enhances data access performance through its self-developed data accelerator GooseFS, achieving read bandwidth of up to several TBps, supporting high-speed computing operations, and greatly improving data cleaning efficiency.
In the model training stage, it is usually necessary to save training results every 2-4 hours to enable rollback in case of GPU failure.
Tencent Cloud's self-developed parallel file storage CFS Turbo has been specifically optimized for AIGC training scenarios, achieving total read and write throughput of up to TiB/s per second and metadata performance of up to millions of OPS per second, both of which are industry-leading. The 3TB checkpoint write time has been shortened from 10 minutes to 10 seconds, significantly improving the training efficiency of large models.
Large model inference scenarios place higher requirements on data security and traceability.
Tencent Cloud Data Vision CI provides capabilities such as implicit image watermarks, AIGC content moderation, and intelligent data retrieval MetaInsight, providing strong support for the entire business process from "user input - preprocessing - content moderation - copyright protection - secure distribution - information retrieval", optimizing AIGC content production and management models, complying with regulatory guidelines, and broadening storage boundaries.
At the same time, with the growth of training and inference data, low-cost storage capabilities need to be provided to reduce storage overhead. Tencent Cloud Object Storage Service provides up to 12 9s of data durability and 99.995% data availability, able to provide continuously available storage services for businesses.
Overall, with the advancement of AI large models, new trends have emerged in data storage. The market desires higher-performance, large-capacity, low-cost storage products and is accelerating the integration and efficiency improvement of all aspects of large models.
Major vendors are also continuously meeting the needs of all aspects of large models through technological innovation, lowering the barriers for enterprises to implement large models.
Driven by AI large models, storage innovation is already on the way.
Related Reading
Challenging the "Impossible Triangle" of Storage: Leading Storage Performance Breakthroughs with Proprietary Technology
In the Era of Data Explosion, Cloud Storage Reshapes Data Vitality in the Cloud
Is "Storage-Computing Integration" the Key to Breaking the Impasse of Large Model AI Chips?
【Original Report by Tech Cloud News】
Please indicate "Tech Cloud News" and attach this article's link when reprinting.
"