06/11 2024 384
Apple's WWDC early this morning confirmed one point: AI inference computing power will remain primarily in the "cloud" for the long term, and this "long term" is at least three to five years. Yes, Apple has established a strategic partnership with OpenAI, planning to deeply integrate the next-generation iOS system with ChatGPT; however, most generative AI inferences, including text and image generation tasks, will still be uploaded to ChatGPT's data centers and completed in the cloud. Regarding this point, OpenAI has made it clear in its announcement. Apple's "edge AI" is still primarily limited to the software level.
If even Apple cannot achieve "edge-side" inference computing power, then other mobile phone manufacturers are even less capable. PCs may progress slightly better than mobile devices, but in the foreseeable future, most AI PCs (including desktop workstations) will still rely on NVIDIA's desktop-grade graphics cards and can only execute relatively small-scale (distilled) large model inferences. From both a technical and cost perspective, large model and application developers will prefer to complete most inference tasks in the cloud, i.e., data centers. The capital market once again recognized this point, so after WWDC, Apple's stock price fell while NVIDIA's stock price rose slightly.
For a long time in the future, we do not need to consider "edge-side computing power" too much. As such, the issue of domestic AI computing power shortage cannot be solved by developing so-called "edge-side computing power." Since the emergence of ChatGPT in late November 2022, domestic AI computing power has almost always been in short supply, which is jointly determined by the following factors:
Global AI computing power is in short supply, especially in the manufacturing sector. NVIDIA's H-series graphics cards can only be manufactured by TSMC (Samsung is not capable), and the production bottleneck will continue for many years.
The US chip export ban is becoming increasingly stringent, especially after a comprehensive tightening in the second half of 2023, many "backdoors" have been blocked, making it increasingly difficult for domestic manufacturers to purchase data center-grade graphics cards.
We know that the computing power required for AI large models is divided into training and inference, with the former requiring higher standards; however, the current situation in China is that both types of computing power are lacking. Ironically, during the development of cloud gaming in China in previous years, major internet companies and telecommunications operators have purchased a batch of NVIDIA Turing architecture graphics cards to set up RTX blade servers, which can be used for AI inference; without cloud gaming, the domestic inference computing power bottleneck would be even more severe. The Chinese gaming industry is a hardworking scapegoat industry, anyone can step on it, and anyone can vilify it, but it turns out that saving the so-called "hard technology" industry still relies on it!
Even so, the supply-demand relationship for domestic AI inference computing power remains very tight. Therefore, the "price reduction" measures taken by domestic large models in the past month are largely just behavioral art. Especially for B-end customers, regardless of how low the调用price of the large model API drops, the key is whether they can buy in bulk. The current issue is a "market without prices": only very small-scale purchases can be executed at the "list price," while slightly larger-scale purchases must be negotiated separately with sales personnel and queued, making the actual transaction price unpredictable (certainly much higher than the "list price").
Not only B-end users but even C-end users can feel the tension of inference computing power: for several of the most popular AI large model applications in China, free users will almost certainly encounter queuing during peak hours and must recharge or tip to speed up the process. It should be noted that the current DAU of mainstream domestic generative AI applications is generally only in the millions, and the inference computing power is already so scarce; if a super AI application with hundreds of millions of DAU truly emerges, the computing power will almost certainly not keep up - so such a super application is impossible in China at present. (Note: Both Wenxin Yiyan and Tongyi Qianwen claim to have over 100 million cumulative users and over 100 million daily API calls, but they are still far from 100 million DAU; Doubao is also estimated to be far away.)
It is imaginable that training computing power, which has even higher requirements than inference, is even more scarce. In February 2024, ByteDance announced in a paper that it had established a "10,000-card cluster" in September of the previous year. Unfortunately, it consists of 12,000 relatively outdated A100 graphics cards, while US technology giants have already switched to "10,000-card clusters" composed of more advanced H100, such as Meta's LLaMA-3, which was trained on a cluster of 25,000 H100; cloud computing giants represented by Amazon are actively shifting to even more advanced B100 and GB200 computing clusters. The A-series graphics cards were released in 2020, when the chip ban had not yet been issued, and there were not many obstacles for domestic procurement; the H-series was released in 2022, when the chip ban had already been issued, but China could still bypass it by purchasing "special edition" (mainly H800); the B-series was released in 2024, and at this point, the ways to bypass the chip ban have become very narrow and uncertain.
The long-term and severe computing power bottleneck has brought two profound impacts on the domestic AI industry. First, the shortage of computing power means that computing power prices are high (whether purchased or rented), and the selling prices of all domestic large model manufacturers cannot cover the training + inference costs, and some cannot even cover marginal inference costs, losing money on every sale (the recent wave of price reductions may have made the losses worse). Second, most of the domestic computing power is concentrated in the hands of a few major technology companies, and startup companies are highly dependent on them and eager for them to invest in the form of computing power. In conclusion, starting a business with large models in China is a very poor business proposition, far inferior to mobile internet entrepreneurship in those days.
Below, let's further illustrate the current status of domestic AI computing power in the form of a question-and-answer session. The questions are those most concerned about in the market, and the answers do not come from me personally but from my trusted friends in the cloud computing and AI industries, and I am merely summarizing their answers.
Q: What is the current status of domestic AI computing power reserve and distribution?
A: Let's start with the "big cards" used for training. If we consider both the A100-800 and H100-800 as "big cards," then the domestic reserve of "big cards" must exceed six figures, or even possibly over 200,000 cards. The problem is that with technological progress, the A-series is no longer considered a "big card." If we follow Zuckerberg's so-called "H100 equivalent computing power," the domestic reserve would certainly not exceed six figures, while Meta alone has over 300,000 "H100 equivalent computing power," which will exceed 650,000 by the end of 2024, far exceeding the combined computing power reserve of all major domestic manufacturers.
Specifically regarding the distribution of computing power, there are two standards: the first is "possessed computing power," and the second is "callable computing power." Cloud computing giants like Alibaba possess tremendous computing power, but a significant portion of it is rented out to customers, and their own large model training and inference may not necessarily have an absolute advantage in callable computing power. If we only consider "possessed computing power," Alibaba is undoubtedly the first in China, followed by Baidu, ByteDance, and Tencent may have less. Many internet companies possess one or two thousand big cards because content recommendation algorithms, autonomous driving training, and other aspects require them.
As for the distribution of inference computing power, it is even more complex. As mentioned above, graphics cards used for cloud gaming can undertake certain inference tasks, and currently, a large portion of domestic inference computing power may come from previous cloud gaming computing power.
Q: What is your view on the domestic substitution of AI computing power?
A: It is extremely difficult on the training side. Even if some domestic graphics cards claim that their technical parameters can reach the level of the A100, they do not possess NVlink interconnection technology and the CUDA development environment, thus unable to undertake large model training missions. Moreover, the A100 was released by NVIDIA in 2020, and catching up with the level of four years ago in 2024 is not advanced. Large models are not atomic bombs; they are civilian products that emphasize cost-effectiveness. Large models developed using non-mainstream hardware may have no commercial value.
However, on the inference side, it is not entirely impossible because inference cards have very little dependence on NVlink and CUDA. NVIDIA's walls on the inference side are still high, but they are much weaker compared to the training side. The problem is that the technical route of inference computing power is constantly changing, and the leader of these changes is still NVIDIA. If given a choice, mainstream manufacturers would definitely prefer to purchase NVIDIA's inference solutions. The issue for domestic manufacturers is that they have no choice under the chip ban, and domestic substitution on the inference side is still better than nothing.
Q: What is your view on Groq and some domestic manufacturers' "far exceeding NVIDIA" inference cards?
A: Under highly specialized technical routes, it is indeed possible to create inference cards with apparent technology far surpassing NVIDIA's contemporary products - but at the cost of having very narrow application scenarios. Such graphics cards are not only suitable for large model inference but may even only be suitable for a specific type of inference. Large companies building data centers need to consider versatility and subsequent upgrade requirements, which highly specialized graphics cards cannot satisfy. As mentioned above, graphics cards used for cloud gaming can be used for inference, but can highly specialized inference cards perform graphics rendering tasks? Can they perform non-generative inference tasks like autonomous driving?
Moreover, wealthy large companies in Silicon Valley now prefer to use "big cards" to simultaneously perform training and inference tasks: it is faster, more flexible, and easier to manage. Your training tasks are not evenly distributed throughout the year, and you may consume more computing power for training in these three months and more for inference in the next few months. Unified construction of a "big card" cluster helps improve flexibility. Of course, this is not economical, so inference tasks are still mainly executed by inference cards. I just want to say that NVIDIA's moat on the training and inference sides is complementary and not isolated from each other.
Q: Is it possible to bypass the chip ban? What are the alternative solutions currently?
A: Many people believe that the chip ban can be bypassed through "abnormal" means. However, they overlook two points: first, NVIDIA's high-end graphics cards have been in short supply in recent years, so there is no large second-hand or bulk market, and even graphics cards discarded by overseas manufacturers are generally repurposed internally. Second, even if you can bypass NVIDIA's official sales and obtain some graphics cards, you will not receive technical support.
Both the H-series and B-series training cards are not sold individually but as servers (training machines). The B-series training machines are already very similar to high-end CNC machines, with a geolocation detection system implanted internally that can automatically shut down if the geolocation is detected to have shifted. Therefore, both theoretically and practically, as long as NVIDIA is willing to seriously enforce the chip ban, it will be difficult to bypass. Although NVIDIA certainly wants to sell to more customers and expand into the Chinese market, its graphics cards are not hard to sell anyway, and it is unlikely to take the risk of actively violating the ban in the short term.
Of course, everything is negotiable. As long as both parties are willing to do business seriously and offer something in exchange, no business is impossible. The key lies in how strongly everyone wants to do business! We cannot underestimate the difficulty of solving the problem - only by fully estimating the difficulty can we solve it from a realistic perspective. One-sidedly underestimating the difficulty and pretending that the problem has been solved is not advisable, and I believe true practitioners would not do so.