10/09 2025
407
Editor's Note:
The internet's mythical inception was rooted in a vision of interconnectedness. From Sir Tim Berners-Lee's brainchild of the World Wide Web to the TCP/IP protocol laying the foundation for the global information superhighway, 'connectivity' has consistently been the internet's lifeblood. Yet, when we scrutinize China's burgeoning AI computing infrastructure, a stark contrast emerges: super-node intelligent computing centers, erected by tech titans, purport to support 'heterogeneity' technically but operate as isolated ecosystems.
Computing power, the 'oil' of this new era, has not pooled into a vast, flowing river as anticipated. Instead, it risks stagnating in isolated ponds.
The 'Tower of Babel' Predicament
The internet's essence lies in dismantling isolated islands. When the U.S. Defense Advanced Research Projects Agency (DARPA) initially conceptualized ARPANET, its primary objective was to facilitate resource sharing and communication across disparate computers. This legacy was carried forward by the subsequent internet, contributing to its current grandeur. In the realm of AI computing power, this philosophy extends to 'computing power networks' or the 'internet of computing,' aiming to uniformly orchestrate and on-demand allocate distributed computing resources akin to a power grid, enabling users to access computing power seamlessly and transparently from anywhere.
However, aspirations often outpace reality. With the deepening of the 'Computing Power Transmission from East to West' initiative and the explosive demand for computing power in large-scale model training, super-node products continue to proliferate, and the construction of their intelligent computing centers has reached a fever pitch. The crux lies in the profound ecological barriers between these nodes.
A common misconception arises: 'If they've all open-sourced their protocols, why can't they interoperate?' In truth, this is the deceptive allure of the 'small courtyards with high walls' strategy. Open-sourcing here is not about dismantling barriers but about constructing them more efficiently.
In ancient times, humans endeavored to build the 'Tower of Babel' to unify their strength, only to be ensnared in a communication quandary due to divine linguistic confusion, ultimately aborting the project. This allegory underscores the core of the 'Tower of Babel' predicament: even with shared objectives, collaboration sans effective communication is doomed to fail.
Presently, China's computing power sector grapples with a similar dilemma. Propelled by the national 'Computing Power Transmission from East to West' project, major tech conglomerates are constructing intelligent computing 'super-nodes,' yet their self-contained technical routes and ecological barriers pose formidable challenges to the vision of a nationwide interconnected computing power 'tower.'
For instance, a vendor's partial open-sourcing of code is comparable to Tesla unveiling its charging station interface protocol. While any third party can manufacture charging guns based on this protocol, it doesn't guarantee that your BYD car will achieve optimal charging performance at a Tesla station, nor does it imply that Tesla cars will seamlessly utilize NIO's charging network. Optimal performance is heavily reliant on the vendor's proprietary 'core assets.' This deep integration renders it arduous for an application optimized for one platform to replicate the same efficiency on disparate hardware.
A specific computing platform at a major internet company's intelligent computing center is intricately linked with its self-developed chips and deep learning framework. Its technical documentation extensively highlights joint optimization cases between the two. While such optimizations enhance performance within its ecosystem, they inadvertently create technical barriers. Another platform, in contrast, exports its core technological capabilities through productized cloud services (e.g., elastic computing, file storage), offering standard APIs and SDKs for user convenience. It actively participates in and contributes to open-source communities like Kubernetes, ensuring seamless integration with these ecosystems. Its strategy leans towards a 'black-box' and 'service-oriented' approach, not confining users to specific hardware or frameworks but encapsulating its potent scheduling and networking capabilities into stable, reliable cloud services. Users reap the benefits of the service outcomes without delving into the implementation intricacies. This model's 'openness' primarily manifests in supporting standard interfaces rather than public code.
Moreover, ecological compatibility necessitates seamless collaboration among heterogeneous chips at the software stack (drivers, frameworks, libraries), protocol standards (communication, scheduling, security), and development toolchain levels. For example, NVIDIA's GPU ecosystem has constructed a comprehensive technical stack through components like CUDA, NCCL, and NVLink, enabling developers to seamlessly migrate models across GPU generations. In stark contrast, China's domestic super-node ecosystem lacks analogous standards, with vendors employing incompatible protocols.
Tests conducted at an intelligent computing center revealed that when Super-node A schedules GPUs from Vendor B, protocol conversion escalates communication latency by 42% and triples task startup time. This performance degradation stems from disparities in the protocol stack.
Domestic super-node vendors commonly adopt closed protocols to erect technical barriers. Even if partial specifications are disclosed, core components like inference frameworks still necessitate profound adaptation to their self-developed chips. This 'hardware open, software closed' paradigm compels third-party chips to rewrite driver-level code, drastically inflating development costs.
Furthermore, ecological compatibility mandates that development toolchains (e.g., model conversion tools, debuggers) support cross-chip optimization. Currently, domestic super-node vendors' toolchains are fragmented.
The ultimate outcome of this strategy is the formation of multiple thriving 'open-source islands.' Each island entices residents with open-source and exemplary technology, boasting wide roads and developed transportation within. However, no standardized bridges or ferries connect these islands. Open-sourcing lowers the entry threshold to each 'small courtyard' but escalates the migration cost between them.
The Tragedy of 'Small Courtyards with High Walls'
The adverse impacts of this ecological isolation are profound and multifaceted.
Firstly, it leads to colossal waste of computing resources and scheduling inefficiencies. According to the China Academy of Information and Communications Technology's 'White Paper on Cloud Computing Development (2023),' one of the paramount challenges for enterprises in adopting cloud and diverse computing power is the 'complexity of managing and scheduling heterogeneous resources.' When an AI startup urgently requires large-scale computing power, it painfully discovers that applications developed for Ecosystem A cannot efficiently run on Cluster B, and vice versa. Users must repeatedly adapt and develop for different computing power sources, significantly augmenting time and financial costs. These ecological barriers render cross-vendor, cross-regional computing power coordinated scheduling nearly insurmountable, leaving vast computing power idling due to uneven demand and preventing optimal global allocation.
Secondly, it holds developers hostage and stifles innovation. Developers should focus on algorithm and model innovation but now squander substantial energy learning and adapting to different underlying hardware ecosystems. Their creativity and technical flexibility are shackled by ecological chains. A healthy market should witness computing power providers vying to offer developers superior experiences and prices. However, under the 'small courtyards with high walls,' this relationship is partially inverted—developers, seeking peak performance, must align with a specific ecosystem, weakening their bargaining power and technical agility.
Finally, this contradicts the national strategy of a 'nationally integrated computing power system.' The grand vision of the 'Computing Power Transmission from East to West' project is to achieve large-scale, intensive development of national computing power resources by constructing national-level computing hubs and data center clusters. The current fragmented 'small courtyards with high walls' pose the biggest obstacle to realizing this vision. They prevent computing power resources from being efficiently and losslessly scheduled to demand sites like 'West-to-East Electricity Transmission,' hindering the formation of a unified national computing power market.
Conclusion
To dismantle the 'small courtyards with high walls' dilemma, we cannot rely solely on corporate self-regulation, as ecological strategies driven by commercial interests possess inherent rationality. We must adopt top-level design and industrial collaboration from a higher vantage point, encouraging bottom-layer technological innovation while vigorously promoting standardization and openness in the middle layers to construct a neutral computing power scheduling system capable of penetrating ecological barriers.
The internet's narrative is one of transitioning from fragmentation to connection. The current ecological fragmentation of 'super-nodes' we confront today is an inevitable pain during the technological explosion, essentially a struggle for discourse power among vendors in the intelligent era. But it is not the future we embrace.
As computing power becomes the 'oil' of this new era, enterprises must strike a balance between self-reliance and open collaboration—fortifying the 'small courtyard's' security while opening 'high wall' interconnection channels. Only then can they secure a commanding position in the global intelligent revolution.