11/18 2024 516
The intelligent computing center is the 'heart' of computing power in this wave of technological revolution and has long been the focus of international technological competition. The construction plan for intelligent computing centers has always attracted the attention of the Chinese people.
Recently, the application of OXC optical switching technology in intelligent computing scenarios has come into the public eye. Can this technology and its solutions support the network of intelligent computing centers?
From the perspective of technical essence, practical application, and industrial progress, one might say, 'OXC technology actually has no future in intelligent computing scenarios.'
Technically, OXC optical switching technology faces challenges such as optoelectronic coordination and not supporting many-to-many transmission of AI tasks in intelligent computing scenarios, which are difficult to effectively resolve.
In practice, currently, only Google commercially uses MEMS-OXC equipment in the industry. Google's core purpose in using OXC in its TPU clusters is to address the availability issue of Torus topology. However, the shortcoming in network availability lies in the access ports. OXC does not address the issue of network availability, so there is no fundamental difference from automatic patch panels.
From an industrial perspective, Google is the only company in the world that commercially uses OXC. According to LightCounting's predictions, the global market space for OXC will be approximately $500 million by 2029, with most of it belonging to Google. The industrial scale is only one-twentieth that of electrical switching.
Considering the above dimensions, it is not difficult to conclude that OXC technology is merely a patch panel in intelligent computing scenarios and cannot be truly scaled up or support the network of intelligent computing centers with over 10,000 cards.
Next, let's comprehensively unveil the veil of MEMS-OXC in intelligent computing scenarios, from the starting point of technology to the end point of the industry.
Simply put, OXC optical switching technology involves exchanging optical signals between different optical paths. Technical paths include MEMS, DLC, and DLBS. Among them, MEMS technology is currently the most mainstream solution, and MEMS-OXC equipment is currently the only one commercially used by Google.
However, in the network of intelligent computing centers with over 10,000 cards, the role of MEMS-OXC is essentially that of a patch panel.
Let's first take a look at how the networking of intelligent computing centers is achieved. The report 'AI Data Center Network Construction' released by the Open Data Center Committee (ODCC) mentions that the AI parameter plane network has two layers of Spine-Leaf and a three-tier CLOS architecture networking. In AI cluster networking practices, a networking scale of over 100,000 cards is achieved through three-tier networking.
Currently, AI giants such as Meta, OpenAI, and Microsoft build super-large-scale clusters by expanding their networking models from two to three tiers, which means adding a Core layer in addition to the Leaf and Spine layers. Among them, two-tier networking uses electrical switches. For example, Google, the only commercial practice of OXC in the industry, also adopts an optoelectronic hybrid architecture.
It can be seen that if a two-tier networking model is adopted for the intelligent computing center network, OXC is not required; if a three-tier networking model is adopted, the main role of MEMS-OXC equipment in the Core layer is flexible patching, which is not fundamentally different from an automatic patch panel.
Introducing MEMS-OXC not only fails to bring gains to the network but may also create additional problems:
First, the issue of optoelectronic coordination.
If an OXC optical switch is introduced in the third tier, but the underlying data center network still uses electrical switches, this requires coordination, communication, and cooperation between optics and electronics, which has a significant impact on the entire data center network.
For example, OXC technology has the characteristic of flexible switching. However, for the entire network, the optical switch is connected and disconnected intermittently, which requires the entire access layer and Spine layer to adjust their strategies accordingly.
Imagine that most large model training in intelligent computing scenarios adopts parallel training, and the business flow changes at any time. If the data center network is adjusted in seconds at any time, the reliability of training is difficult to guarantee. Any large model development team may find it unacceptable to have frequent interruptions in training.
Secondly, the compatibility issue between OXC and AI services.
OXC optical switching technology does not support many-to-many communications and can only perform pure physical forwarding. In intelligent computing scenarios, AI tasks involve many algorithms and operators, and the communication modes of different algorithms are quite different. They may require various forwarding methods such as one-to-many, many-to-one, and many-to-many. These algorithms require efficient communication that is difficult to meet with OXC technology, hindering the development of related intelligent computing services.
The third major issue is the energy consumption of OXC.
The insertion loss of OXC optical switches is significant, meaning that the signal attenuates during optical refraction. To compensate for the insertion loss of OXC, it is necessary to use optical modules with higher power or longer distances, which leads to increased energy consumption. Additionally, the insertion loss issue prevents the evolution of optical module speeds.
Due to the above issues, when considering insertion loss, power consumption, and other factors, intelligent computing centers will find that MEMS-OXC equipment is inferior to automatic patch panels after calculating a comprehensive account.
Another key factor why MEMS-OXC is inferior to automatic patch panels is its commercial prospects.
We know that a new technology must complete a closed loop in the commercial market and recover investment through use to attract further investment in infrastructure and form a virtuous cycle. However, it is difficult for the industrialization of OXC technology to form a virtuous commercial loop.
The primary constraint is cost.
The implementation of OXC technology requires extensive use of optical switches, and all related components such as optical modules need to be upgraded, which results in huge initial investments and high comprehensive costs.
The Open Data Center Committee (ODCC) proposed in the 'AI Network Optical Switch Technology Report' that considering the challenges of the network system and OCS (Optical Circuit Switch) itself, from the dimensions of port quantity demand, switching time demand, low cost, high reliability, and easy topology management, optical switches still need optimization design to reduce insertion loss and return loss, as well as explore networking solutions with electrical switches to reduce costs.
However, the above investments require commercial returns from industrial users. As mentioned earlier, due to the limitations of optical switching technology itself, many AI tasks and scenarios are difficult to implement in the short term, leading to high uncertainty in the commercialization of OXC.
Considering the above comprehensively, the industry's pace in implementing OXC is notably cautious, and it is basically in a wait-and-see state.
Technology does not exist in a vacuum but is embedded in the realities of talent, capital, industry, and the real economy.
China's intelligent computing industry is still in the catch-up phase, with relatively insufficient resources and talent. It must grasp development prospects and opportunities while facing current survival and commercial realities, and even address historical issues in some cases.
In this context, if China's intelligent computing industry invests precious resources in OXC, which is not suitable for networking, it may lead to a series of chain reactions.
For example, the dispersion of industrial resources and the high cost of building intelligent computing centers mean that OXC equipment, which has no advantages in networking scale, insertion loss, power consumption, and cost, represents inefficient investment and reduces the risk resistance of technology enterprises.
The implementation effect of MEMS-OXC equipment in intelligent computing clusters is not significant, and it cannot solve the issue of network availability. Introducing OXC affects the transmission and supply of AI computing power, thereby hindering the resilient development of AI training, AI reasoning, and other services.
What requires greater vigilance is that hyping the OXC route may cause China's intelligent computing industry to miss exploring other technological routes, and the opportunity cost incurred is immeasurable.
Therefore, OXC, which can only serve as an automatic patch panel, is not suitable as a choice for networking in intelligent computing centers and has no future in such scenarios. At present, what China's intelligent computing industry should truly do is to further leverage its core advantages in mature switching technology, existing precious resources, and industrial intelligence opportunities.