No Consensus in Embodied AI: The Best Path Forward

11/26 2025 565

In the nascent stages of any technological revolution, there's always a rush to identify the one true path—a singular solution that promises to cut through uncertainty. However, the intricate landscape of embodied AI is teaching the industry a valuable lesson: progress is not carved from a single block of marble but is rather sculpted through a myriad of trials, conflicts, and resolutions. Far from being flaws, imperfect models, incomplete datasets, and diverse architectures are the very elements that infuse embodied AI with its most genuine vitality.

Editor: Lv Xinyi

As anticipated, embodied AI is charging ahead with formidable momentum as we approach the end of 2025.

What's perhaps even more predictable is the persistent lack of consensus within the field.

At the 2025 Zhiyuan Embodied OpenDay Roundtable Forum, leading domestic experts in embodied AI engaged in a spirited debate, sharing their diverse perspectives. Whether discussing model architectures or data utilization, no unified direction emerged from the discussions. Many participants expressed disappointment over the continued absence of a consensus in embodied AI.

However, the Embodied AI Research Society views this lack of consensus as a sign of promise, suggesting that technology may advance in unexpected ways. After all, a clear path can sometimes be limiting. When we let go of the need for certainty, certain trends start to emerge. Perhaps the absence of consensus is, in itself, a form of consensus.

Image Source: Zhiyuan Institute

From an industry standpoint, the absence of consensus carries three positive implications:

Firstly, it breaks the monopolistic discourse of a single technological path, preventing the industry from falling into the 'path dependency' innovation trap. In embodied AI, from the divergence in technological paths (e.g., 'layered architecture vs. end-to-end') to the choices in implementation (e.g., 'general-purpose humanoid robots vs. scenario-specific embodied AI'), the lack of consensus provides teams with different technological philosophies and academic backgrounds equal opportunities for experimentation and learning.

Secondly, consensus in mature industries often comes with high barriers to entry. The 'no consensus' state of embodied AI offers small and medium-sized enterprises, start-up teams, and even cross-industry players opportunities to leapfrog competitors. Without the need to adhere to existing technological standards or business rules, new entrants can leverage their differentiated advantages to enter the market.

Thirdly, as a cross-disciplinary field, the technological foundations of embodied AI are still rapidly evolving. Forming consensus too early could solidify technological paths and limit the industry's potential for breakthroughs to higher dimensions. The core value of the 'no consensus' state lies in preserving 'flexibility' for technological iteration.

At the Zhiyuan Embodied OpenDay Roundtable Forum, the abundance of 'no consensus' discussions also revealed more possibilities. Based on the guests' responses, the Embodied AI Research Society identified five key signals in embodied AI, with the directions for future development perhaps hidden within these signals.

Models Are Still Not Good Enough: Some Are Starting From Scratch

Signal 1: World Models Cannot Shoulder the Burden Alone

In discussions about models in embodied AI, the 'rising star' world model is an unavoidable topic.

Its core value lies in 'prediction.' Enabling robots to predict the next change based on the current spatiotemporal state, like humans, and then plan actions—this is widely recognized by roundtable guests. Wang He, an assistant professor at Peking University and founder of Galaxy General, used robot motion control as an example, pointing out that whether it's the legged locomotion and dancing of humanoid robots or the delicate operations of dexterous hands, the underlying control logic requires predictive capabilities for physical interactions, which world models can provide. However, for world models to truly serve robots, their training data must include more data from the robots themselves.

Yet, the shortcomings of world models are equally prominent, making it difficult for them to serve as a 'universal solution' for embodied AI alone. Wang He emphasized that many current world models rely on human behavior videos for training, but the physical structures of robots (such as wheeled chassis and multi-degree-of-freedom robotic arms) differ greatly from humans, limiting the practical help of such data for robot operations. Cheng Hao, founder and CEO of Accelerated Evolution, also mentioned that in real-world scenarios like cooking and complex assembly, the predictive accuracy of world models is still insufficient, necessitating the use of layered models to solve simple tasks first before gradual iteration and upgrading.

Signal 2: Models Need to 'Start From Scratch'

Since existing models struggle to meet demands, 'building dedicated models for embodied AI' has become a consensus among many companies.

Zhao Xing, an assistant professor at the Institute for Interdisciplinary Information Sciences at Tsinghua University and CTO of Xinghaitu, stated that embodied AI requires a 'Large Action Model' parallel to large language models, with 'actions' as the core rather than language. He explained that the evolution of human intelligence follows 'actions first, then vision, and finally language.' For robots to adapt to the physical world, they should follow a similar logic—for example, when driving, humans rely on vision to observe road conditions and actions to control the steering wheel, with language not involved in core operations. Embodied models should also prioritize closing the 'vision-action' loop.

Wang Qian, founder and CEO of Zivariable, offered a more specific viewpoint, arguing that embodied AI needs a 'physical world foundational model' capable of controlling robot actions and predicting physical laws as a world model. While multimodal models in virtual worlds rely on text and image training, the delicate processes of friction, collision, and force feedback in the physical world cannot be accurately described by language. When a robot grasps an egg, it needs to perceive the fragility of the eggshell and adjust its grip strength—such understanding of physical properties must rely on models specifically trained for the physical world.

Signal 3: Revolutionizing from the Bottom-Up Architecture

Over the past few years, the Transformer architecture has supported the explosion of large language models like ChatGPT with its cross-modal processing capabilities. However, its applicability in embodied AI is being questioned. Zhang Jiaxing, Chief AI Scientist at China Merchants Group, represents this viewpoint, stating bluntly that 'embodied AI cannot follow the path from LLM to VLM.'

In his view, the Transformer architecture is language-centric, mapping visual, action, and other modalities to language, which contradicts the operational logic of the physical world—when humans perform actions, visual perception directly guides muscle movement without language 'translation.' He revealed that leading teams in Silicon Valley are already exploring new architectures like 'Vision First' or 'Vision Action First,' enabling direct interaction between vision and actions to reduce the loss from language intermediaries.

Wang He added that while the Transformer, as a cross-modal attention mechanism, is versatile—handling text, video, and audio modalities—the challenge lies in embodied scenarios. 'Today, the issue with embodied AI is that humans have eyes, ears, mouths, noses, and tongues—so many 'senses.' While attention mechanisms can tokenize these 'senses' and feed them into the Transformer, the output is not ideal. The fundamental challenge lies in data issues and the corresponding learning paradigms.'

Wang He proposed that in the short term, simulation and synthetic data are core means to accelerate exploration speed; in the long term, the scale of humanoid robots in the real world must continue to expand rapidly. Only with a sufficiently large 'robot population' and mutual promotion of capability improvements can truly powerful embodied large models emerge.

This mismatch in bottom-up architectures has made the industry realize that to achieve breakthroughs in embodied AI, perhaps a revolution from the architectural roots is needed rather than patching up existing frameworks.

Data Remains a Bottleneck: With an Increasing Appetite

Signal 4: No Perfect Data, Only Adaptive Choices

'Data is the fuel for embodied AI'—this was a consensus at the roundtable forum, but there was no unified answer on 'what data to use.' Given that different data types have their own strengths and weaknesses, companies generally adopt a 'multi-source fusion, need-based selection' strategy, matching the most suitable data sources according to task scenarios. Real machine data is the most 'authentic' choice, directly reflecting the interaction laws of the real physical world, making it the preferred option for delicate operation scenarios. Zhao Xing's team at Xinghaitu insists on collecting data in real-world scenarios, viewing authenticity and quality as the starting points for real robot data collection. Luo Jianlan, Partner and Chief Scientist at Zhiyuan Robotics, also emphasized that Zhiyuan Robotics adheres to real data and, in data collection, insists on real-world scenarios rather than relying solely on data collection factories, exploring a path of constructing a data flywheel through autonomous robot data generation. Simulation data, with its advantages of 'low cost and scalability,' has become the mainstay for training bottom-up control. Wang He believes that in reinforcement learning, many extreme scenarios (such as robot falls and robotic arm overloads) are difficult to repeatedly test on real machines, while simulators can quickly generate large amounts of similar data to help models learn coping strategies. In his view, simulators are not a denial of the real world but rather a starting point that provides embodied AI companies with a solid base controller to kickstart the data flywheel in the real world.

Cheng Hao's team at Accelerated Evolution adopts a similar strategy, first using simulation data to enable robots to acquire basic motion control capabilities, then fine-tuning with real machine data to adapt to real-world scenarios. 'Our goal in training with simulation data is to enable robots to obtain more real data next, as only with real data can their overall capabilities improve further.' In Cheng Hao's view, this is likely a spiral upward process.

Video data has become an important supplement for foundational model training. Wang Zhongyuan, Dean of the Zhiyuan Institute, believes that the logic of 'training foundational models with video data' is similar to how children learn about the world by watching phones—first learning through videos, then enhancing their skills through real-world interactions. These video data contain multi-dimensional information such as spatiotemporal relationships, causality, and intentions, and can be acquired on a large scale, making them the 'best compromise' when massive real machine data is lacking. However, when the Embodied AI Research Society questioned 'how to solve the lack of fine-grained tactile and force control data from videos,' Wang Zhongyuan admitted that videos indeed lack force feedback and tactile information, but this does not diminish their value. Currently, the Embodied AI Laboratory at the Zhiyuan Institute is equipped with collection devices for force feedback data. Video data is mainly used for 'laying the foundation,' requiring further optimization and fine-tuning with other data types.

Signal 5: 'Quantity,' 'Quality,' 'Variety'—Embodied AI Companies Demand Data in All Aspects

As embodied AI penetrates into complex scenarios, the industry's demand for data is constantly upgrading, requiring not only large 'quantities' but also high 'quality' and greater 'variety,' forming an increasingly large 'data appetite.'

First is the craving for 'quantity,' with 'Internet-scale' data becoming a common expectation in the industry. For example, Zhao Xing believes that data scalability can drive model evolution and the realization of intelligence. Wang Zhongyuan also stated that 'better embodied large models may only emerge after a large number of robots solve specific problems in real-world scenarios and accumulate 'Internet-scale' embodied AI data.' In other words, without sufficient data, models are like undernourished children—unable to run fast or grow strong.

When the industry cheered for the 270,000-hour real machine dataset built for Generalist, seemingly touching upon the so-called law of scalability, Wang Zhongyuan admitted to the Embodied AI Research Society that 'tens of thousands of hours of data still cannot be called massive data and are far from reaching the ChatGPT moment.'

Image Source: Zhiyuan Institute

Beyond 'quantity' lies the pursuit of 'quality,' with the view that 'high-quality data is more valuable than massive low-quality data' gradually becoming mainstream. Wang Qian believes that while data is important, it is not simply a case of 'the more, the better.'

In fact, language models have already demonstrated that merely stacking data scale does not necessarily yield the best results—high-quality and efficient data are the decisive factors. He believes that in embodied scenarios, data quality can create a difference on an order of magnitude compared to data volume. Here, the top-tier real machine data, while perhaps small in quantity, may serve as the foundation or the critical support beyond simulation and video data.

Ultimately, there arises a pressing need for 'diversity' in data, with the utilization of multimodal data growing increasingly crucial. As the application scenarios for robots continue to expand, relying solely on a single type of data is no longer sufficient to meet the evolving demands. Take, for instance, home service scenarios, where robots are required to process multiple types of information simultaneously. This includes visual data (for object recognition), auditory signals (for understanding instructions), tactile feedback (for perceiving the softness of objects), and force feedback (for controlling the force of actions). Currently, within the industry, the multimodal capabilities predominantly draw upon visual and linguistic abilities inherited from foundational large models. However, there is a scarcity of modalities that capture real physical interactions, such as tactile and force feedback.

This burgeoning demand for data diversity has also prompted the industry to recognize that future data collection efforts must extend beyond merely recording 'what robots do.' Instead, they must encompass a comprehensive understanding of 'what transpires in the environment,' 'what feedback interactions yield,' and 'what human needs entail.' This holistic approach is essential for enabling models to gain a deeper comprehension of the physical world and human requirements. In the nascent stages of technological advancement, there are always those who seek to identify a singular, correct path, hoping to navigate through uncertainty with a single strategic move. Nevertheless, the intricate complexity of embodied AI is serving as a reminder to the industry that true intelligence does not emerge from a solitary trajectory. Rather, it is 'sculpted' through a myriad of trials, conflicts, and reconciliations. Imperfect models, incomplete datasets, and non-uniform architectures—while seemingly flawed—are precisely the elements that imbue embodied AI with its most genuine and vibrant essence.

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.