From Demo to Industrial Application: Embodied AI Will Ruthlessly Eliminate 99% of Teams, Says Zhou Erjin of Yuanli Lingji

Home

Finance

ICV

Smart City

Digital Live

Cloud

Optics

Home Finance AI ICV Smart City Digital Live Cloud Optics

04/10 2026 426

After the hype surrounding large models, the next major battleground in AI is indisputably embodied AI.

On this red-hot track, Yuanli Lingji, established just a year ago, has taken a somewhat “unconventional” pragmatic approach. While many teams are keen on showcasing flashy demos of robots performing cutting-edge actions, the founding team, with a background at Megvii and extensive experience in large model development and AI commercialization, has delved into the most challenging foundational infrastructure.

Over the past year, Yuanli Lingji has delivered five key solutions: the embodied native large model DM0, the open-source development framework Dexbotic 2.0, the mass production workflow DFOL, the real-world evaluation platform RoboChallenge, and open-source hardware. Their goal is clear: to transition embodied AI from “lab-only demos” to “real-world scenarios, commercial closed loops, and continuous iteration” on an industrial scale.

This focus on practical implementation aligns closely with the background of co-founder Zhou Erjin. Admitted to Tsinghua University's Department of Electronic Engineering in his third year of junior high, Zhou led the development of FaceID, the industry's first financial-grade identity authentication cloud service, as an early member of Megvii Research in 2013—one of the earliest and most successful large-scale commercial applications of AI vision technology. Now, armed with experience in achieving large-scale commercial closed loops, he oversees embodied model and framework development at Yuanli Lingji, tackling the industry's most fundamental challenges.

Facing the current hype in the embodied AI sector, Zhou Erjin offers three sobering judgments:

First, the value of real-world data is unshakable. The watershed in embodied AI lies in who can first deploy robots at scale, vigorously turning the data flywheel.

Second, the “ChatGPT moment” for embodied AI is not about flashy tricks but about “out-of-the-box generalization.” This means robots can reliably perform basic actions in unfamiliar scenarios and with unfamiliar objects.

Third, the extremely high barriers from demo to true scenario closure “will ruthlessly eliminate 99% of participants.” This journey is filled with irreducible time costs that cannot be bypassed by computational power alone.

Building on these assertions, we engaged in an in-depth dialogue with Zhou Erjin. Let’s see how this entrepreneur, blending top-tier algorithmic thinking with seasoned practical experience, deconstructs the industry's key variables:

How should expensive real-world data and vast amounts of first-person (human) data be strategically deployed?

When will the long-awaited “ChatGPT moment” for embodied AI finally arrive?

Where do the true breakthroughs lie in solving the industry-wide “generalization” challenge?

Why is transitioning robots from demo to real-world scenarios far more difficult than imagined—by a factor of ten thousand?

Yuanli Lingji co-founder Zhou Erjin at the 2026 Technology Open Day

Data and Generalization

Theory of Intelligent Evolution: After more than a year in the embodied AI field, what do you see as the biggest changes in the industry?

Zhou Erjin: Confidence in data has grown stronger, with a broad consensus emerging on the need for large-scale real-world data deployment. Two years ago, proposing a 100,000-hour data collection effort would have been considered insane. Today, there’s general agreement that achieving true proficiency in embodied AI requires data on the scale of millions of hours.

Many studies, including our own experiments, show that as data volume increases, models demonstrate increasingly robust generalization capabilities.

Theory of Intelligent Evolution: What about factors other than data?

Zhou Erjin: There are many, but the core priority is to scale up data first; other aspects will follow. If data volume increases, model capacity must also expand to accommodate it. Consequently, we’re seeing a gradual rise in model parameter counts.

Theory of Intelligent Evolution: Some argue that reliance on real-world data may not be necessary due to its scarcity. What’s your view?

Zhou Erjin: That’s a temporary state. For large-scale deployment, real-world data is indispensable. Take autonomous driving: no one trains self-driving systems on human cycling data. You need to build machines, deploy them at scale, and let robots generate their own data.

Theory of Intelligent Evolution: Where does embodied AI currently stand in its development?

Zhou Erjin: It’s still early, but different from one or two years ago when simply performing a stable action was impressive. Our goal is to achieve out-of-the-box basic actions with scenario and object generalization by year-end.

Mass deployment of real-world robots will mark the next major milestone. Now, the focus is on ensuring foundational models deliver guaranteed accuracy and generalization. Teams achieving this can deploy robots at scale for reinforcement learning, widening the gap with those still collecting data in lab environments.

Theory of Intelligent Evolution: What scale of deployment are you targeting?

Zhou Erjin: Current data collection typically involves dozens to hundreds of units. For real-world reinforcement learning, it’s often just a dozen to 20 units, and even fewer in labs. We aim for scales ranging from hundreds to thousands.

Theory of Intelligent Evolution: When might the “ChatGPT moment” for embodied AI arrive, and what will mark it?

Zhou Erjin: Definitions vary. I see it as out-of-the-box usability with scenario generalization and guaranteed accuracy. Your model and hardware should work not just in your lab but anywhere.

This could involve simple tasks like pick-and-place (moving objects from point A to B), which already address many problems. Achieving error-free performance across diverse scenarios and objects would represent a significant leap beyond existing models.

Theory of Intelligent Evolution: Could you clarify “out-of-the-box” with an example? Would downstream manufacturers simply deploy our models?

Zhou Erjin: We’ll initially bind our models to our own hardware, as our proprietary hardware and algorithms are optimally aligned. On our platforms, we aim for out-of-the-box basic actions. We break down generalization into four dimensions: objects, scenarios, tasks, and hardware configurations, with difficulty increasing sequentially. Our primary focus is on the first three. Hardware adaptation involves a learning process—like a large model unfamiliar with German needing fine-tuning with German data to become proficient.

Theory of Intelligent Evolution: What’s overhyped and what’s underestimated in the industry?

Zhou Erjin: Compared to large models, embodied AI isn’t overhyped—if anything, it lacks sufficient attention. Expectations for general-purpose robots remain consistently underestimated.

I believe data investment is still insufficient. Many large-scale data efforts originate overseas, proving their viability, while we follow. The industry needs greater, more resolute commitment to data investment.

Theory of Intelligent Evolution: What challenges does the industry face with real-world data?

Zhou Erjin: Balancing cost, scale, and precision is key. Scaling up while controlling costs makes it hard to maintain both scale and high precision.

For broad scenario generalization (e.g., human or egocentric data), precision is low but volume is high. For precise execution of specific actions, teleoperation captures high-precision joint motor signals—but at the cost of scalability.

Theory of Intelligent Evolution: What constitutes high-quality data for training embodied models?

Zhou Erjin: It depends on your goals—address deficiencies directly.

Theory of Intelligent Evolution: You use three data types: multimodal internet data, driving behavior data, and embodied multi-sensor data. Will real-world data’s share keep growing?

Zhou Erjin: Absolutely. Real-world data is the biggest variable for improving model quality. Robot data collection is just beginning to scale up, and volume can be increased rapidly.

Theory of Intelligent Evolution: You mentioned “investing data where entropy lies.” Who judges entropy—humans or systems?

Zhou Erjin: Ultimately, we aim for automated system feedback. Areas prone to errors—high information density—should be prioritized for resource allocation. Once resolved, they become trivial (like elementary math problems). But current baselines are low, so human judgment suffices for now.

Theory of Intelligent Evolution: Your “whole-body, full-time, omni-domain” data collection—what does “omni-domain” mean?

Zhou Erjin: It refers to collection scenarios and locations. If robots are to eventually perform most human daily tasks, their training should cover all areas where humans currently operate.

Theory of Intelligent Evolution: How do you view first-person data collection?

Zhou Erjin: It’s highly valuable. First-person perspective is a major focus for us this year. Before large-scale robot deployment, it offers a cost-effective way to capture diverse actions across scenarios.

Models and Closed Loops

Theory of Intelligent Evolution: What’s your take on the VLA (Visual-Language-Action) pathway?

Zhou Erjin: Current embodied training transfers knowledge from existing systems to the physical world. VLA builds on internet-pretrained VLMs, adding action or robot data to acquire physical skills. However, its upper limit may be apparent—like a child raised solely on books until age 10 before learning to play soccer, limiting athletic development. That’s why our DM0 model adopts an embodied-native VLA approach, training internet and robot data jointly from day one—like a child who reads and plays sports equally from the start.

Theory of Intelligent Evolution: Is this the core of your embodied-native approach? How is it implemented?

Zhou Erjin: The key lies in training methodology. We train our VLM from scratch, incorporating a multi-task paradigm called the “physical spatial reasoning chain of thought.”

Theory of Intelligent Evolution: How does the spatial reasoning chain work?

Zhou Erjin: Like humans, it involves task decomposition (e.g., cleaning a room: first sweep, then mop), object localization (finding the broom), and subconscious motion planning (walking to the broom before sweeping). We want the model to generate motion trajectories, represented as lines or 3D paths, enabling robots to understand and execute complex tasks like humans.

Theory of Intelligent Evolution: Will the spatial reasoning chain evolve this year to achieve higher generalization?

Zhou Erjin: Yes, it will become more complex. Robots will need to interpret deictic references (e.g., gestures, spatial pronouns like “there”)—common in daily life but currently challenging for AI. Expanding training modes to include such references will be crucial.

Theory of Intelligent Evolution: What barriers does the spatial reasoning chain face?

Zhou Erjin: Methods are open-source, so they’re not barriers. The core lies in data and understanding embodied tasks. Without hands-on robot experience, it’s hard to identify pitfalls or error-prone areas, which guide our training workflows.

Theory of Intelligent Evolution: What’s the toughest challenge for embodied large models—generalization, memory, precision, or long-term tasks?

Zhou Erjin: Generalization remains paramount.

Theory of Intelligent Evolution: Where do generalization bottlenecks lie?

Zhou Erjin: First, it requires vast data—hence the focus on human and egocentric data this year. Second, sensor diversity is critical. Many tactile sensors for robots are still experimental, with data volumes far smaller than visual data. While pure vision may suffice for 60–70% accuracy, tasks like dishwashing demand 99% reliability—requiring multimodal sensors.

Theory of Intelligent Evolution: What’s Yuanli Lingji’s primary strategy for boosting generalization this year?

Zhou Erjin: A sophisticated model architecture and training paradigm, built on diverse data. We’re pursuing both VLA and world models pragmatically, focusing on problem-solving rather than technological allegiance.

We will integrate VLA with world models to create a unified model capable of making two types of predictions simultaneously: one for determining the next action to take, and the other for forecasting how the world will change next. These two aspects are completely dual to each other.

Intelligence Evolution Theory: In terms of model architecture, do you follow other companies' approaches or innovate on your own?

Zhou Erjin: We are currently exploring several areas independently, including memory, tactile perception, and the encoding forms of actions. However, we ultimately aim to integrate these into a single model.

Last year, we were the first to propose a memory-based approach in the VLA field, and many have since followed suit. Regarding the encoding forms of actions, everyone currently uses the Pi model. Are there other encoding forms that could make the training of actions and trajectories smoother?

Intelligence Evolution Theory: Will the DM0 model be upgraded this year? Will it continue the small-parameter route?

Zhou Erjin: Yes, we'll see. DM is a series of model release plans, with a new version basically released every six months.

Intelligence Evolution Theory: We emphasize the high intelligence density of the DM0 model. How should this be understood?

Zhou Erjin: The blind pursuit of a large number of parameters, as if big equals excellence, is highly problematic. For robots, a large model implies issues with inference efficiency, which can of course be seen as merely a cost issue.

The core question is, is big truly excellent? Or, for models with 1B or 2B parameters, what is their actual ceiling? This issue has been overlooked. By releasing a model with a bit over one billion parameters, we aim to convey the idea that through meticulous data preparation and scientific training paradigms, we can even achieve better results than larger models.

Intelligence Evolution Theory: Has the DM0 model been applied in industries yet?

Zhou Erjin: The logistics industry is the first direction we will choose for business applications, and some clients are already conducting POC verifications.

Intelligence Evolution Theory: We emphasize the importance of a closed-loop model training process, running in real-world scenarios 24x7. What are the specific challenges in achieving this closed loop?

Zhou Erjin: Embodied models do not have their intelligence locked in at the moment of model training completion. Instead, they can only truly come alive and generate real data when deployed in real-world scenarios. This data can then be incorporated back into the training process to keep the flywheel turning.

The core issue is whether we can truly enter these scenarios. The final step will actually screen out 99% of the teams. Teams that have not experienced the complete commercialization and deployment of AI products, or have not worked in factories, will never realize the numerous challenges hidden beneath the surface of a seemingly fully closed-loop scenario where robots are utilized 24x7.

For example, have you ever interfaced with a factory's operating system? Have you ever modified its production line? What do you do when a robot makes a mistake? Without considering these issues, no matter how well the demo performs, the final step will always remain unattainable.

Intelligence Evolution Theory: How do we approach this?

Zhou Erjin: We have been deploying algorithms at Megvii for over a decade. We are very clear about the numerous pitfalls involved, the kind of deployment team required, how to interface with clients' business systems, and the fact that what we should deliver is a comprehensive solution rather than just a single model or robot.

The reason we chose logistics is that we have a strong customer base, and in many logistics scenarios, we and our partners possess the capability to modify production lines.

Intelligence Evolution Theory: What is the most difficult challenge in overcoming this step?

Zhou Erjin: There are many things you will never be capable of unless you have experienced them. There is a significant element of time that cannot be compressed. Therefore, I do not believe that going from algorithms to a demo is a 0-1 step, and from demo to factory deployment is a 1-100 step. It is a much more complex journey. Moving atoms is far more difficult than moving bits. This is also our barrier and advantage, having fully experienced every stage of entrepreneurship.

Intelligence Evolution Theory: Currently, in logistics scenarios, which real-world delivery scenarios have been successfully implemented?

Zhou Erjin: For example, material sorting. A typical task involves picking items from bins, sorting them, and completing packaging. The first thing we do is break it down into many roles and steps, with the first step being material screening. Being able to move items from one place to another, so Pick up & Place is a crucial technical capability.

Intelligence Evolution Theory: Is our current process fully automated?

Zhou Erjin: Saying it's fully automated is not rigorous. Saying my model is 99% there in logistics scenarios would be boasting. Our solution is a comprehensive one with fallback plans to ensure that if something goes wrong, your production line won't stop. Within logistics scenarios, as the data flywheel turns, accuracy will continue to improve, and cost savings will accumulate.

Intelligence Evolution Theory: What are the imaginative spaces for combining OpenClaw (Lobster) with embodied intelligence? Will it be one of the future directions?

Zhou Erjin: Lobster is an excellent direction that has completely opened up everyone's imagination about large models. However, using Lobster to operate robots is not particularly popular today. Lobster acts as a brain, capable of excellent task planning and issuing instructions, but if the embodied aspect cannot execute, there's nothing it can do. More importantly, the success rate of the robot's ontology (I assume this means "physical body" or "robot itself") in performing low-level simple tasks needs to improve. When combined with the cloud-based Lobster, it might become much more popular.

Framework and Workflow

Intelligence Evolution Theory: What advantages does the Dexbotic open-source framework have in the industry?

Zhou Erjin: Many frameworks in the industry today are open-sourced after engineering the code for a good piece of work I published. This is just open-sourcing specific to that work.

Today, when it comes to embodied intelligence, VLM, visual encoders, and action generation sequences' action experts can all use different companies' solutions. From a more universal perspective, if you want to provide infrastructure similar to scaffolding, you shouldn't be tied to a specific model structure but rather give everyone ample choice. The design philosophy of Dexbotic is to encourage everyone to freely create their own experiments and structures. We have achieved better engineering decoupling, allowing different modules to be combined with each other.

Intelligence Evolution Theory: How is the Dexbotic framework being used currently?

Zhou Erjin: We have been receiving feedback from users on GitHub, with many suggestions for improvements. We believe there is a clear need for such a framework.

The framework is just the first step; we have also open-sourced our own hardware. Combining the framework with the hardware, Dexbotic provides a complete platform for the entire process, from data collection to model training and finally redeployment back to the machine. Many of our university clients and partner companies need such a comprehensive platform.

Intelligence Evolution Theory: Are there any similar frameworks to Dexbotic in the industry?

Zhou Erjin: Very few; we are quite unique, not only in terms of the framework but also in its integration with our own hardware. Last year, we collaborated deeply with Tsinghua University and the RLinf team from Wuxian Xinqiong, enabling one-click completion of everything from imitation learning to reinforcement learning on Dexbotic based on RLinf. From a functionality completeness perspective, we are currently unique.

Dexbotic has already served dozens of institutions and over a thousand developers since its launch

Intelligence Evolution Theory: What value does the DFOL flexible production workflow generate?

Zhou Erjin: It's about closing the loop. It's a standardized infrastructure that tightly couples algorithm training, data updates, and collection cleaning.

After deployment at the client's site, the model operates and generates high-quality data that flows back to the cloud. Automated infrastructure completes model iteration, and the improved model is then sent back to the client's side for further feedback and to trigger more data collection. From the perspective of model iteration, it's about quickly getting the data flywheel turning and boosting the efficiency of model iteration. Some of our core logistics clients are already using it.

Intelligence Evolution Theory: Other vendors in the industry are also conducting real machine evaluations. How does RoboChallenge maintain its industry-leading position in real machine evaluations?

Zhou Erjin: We develop both algorithms and hardware ourselves, so we are at the forefront, pushing for a scientific evaluation mechanism that we truly need. Evaluation is handled by a dedicated team within our organization, with importance on par with algorithm training. We periodically update the evaluation process, akin to an attack-defense approach where we replace old questions with new ones.

Intelligence Evolution Theory: What is the evolution direction of Robochallenge this year?

Zhou Erjin: Generalization is a key focus of this year's evaluations. The previous Table30 test set did not actually measure generalization; it still involved completing tasks under specific conditions. But can you still accomplish the task if I change the objects to be grasped?

Second, we are gradually moving from desktops to larger spaces, from grasping to movement to full-body control. Evaluating the complexity of the entire robot's movements is another dimension.

Intelligence Evolution Theory: How do you measure generalization?

Zhou Erjin: Returning to our definition of generalization, it involves constantly changing different operation objects, scenarios, and tasks. It means the tasks you trained on are not exactly the same as the tasks I give you for testing.

Table 30 V2 Task Set

Growth and Vision

Intelligence Evolution Theory: After more than a year in the embodied intelligence field, what is your biggest personal takeaway?

Zhou Erjin: The physical world is an extremely complex environment where algorithms and hardware are highly coupled, making it much more complex than purely working on models. I believe an open mindset is essential, fully absorbing knowledge from cross-disciplinary fields. Since each field has its own experts, it must be a team collaboration effort.

Intelligence Evolution Theory: Based on your own experience, what can be transferred over, and what needs to be relearned?

Zhou Erjin: We come from a visual background and have relatively rich experience in model training. Later, we transitioned from vision to text and multimodality, covering the entire chain. Over the past year and a half, we have all gotten our hands dirty repairing machines, deploying models, and watching robots collect data. Throughout this process, we have learned a great deal about hardware-related knowledge.

Intelligence Evolution Theory: What has remained constant for you from your school days, through your time at Megvii, to now?

Zhou Erjin: The pursuit of technical excellence and a curiosity for the unknown are probably the most fundamental aspects. There have been many setbacks along the way, but when I look back, there's still something that gets me excited every day when I wake up.

One is the drive to take things to the extreme, questioning why something isn't working as it should theoretically and striving to make it happen. The second is curiosity; whether it's embodied intelligence or large models, we still face many new challenges today, and curiosity drives us to experiment.

Intelligence Evolution Theory: Were you always particularly interested in computers since childhood, or was there something that inspired you?

Zhou Erjin: I dedicated a significant amount of effort to information science competitions during my student years, and the impact of these competitions has been profound.

Firstly, the logical thinking skills to analyze problems, the ability to break down complex projects, and implementing them in code within a limited time frame is something I find particularly fascinating.

Secondly, during the competitions, you are exposed to many open problems in computer science, greatly expanding your horizons and revealing how many interesting things are out there. So, starting from these competitions, I developed a series of interests and focus points that gravitated towards the field of computer science.

Intelligence Evolution Theory: With AI impacting industries, including layoffs at tech giants, many young people feel lost and anxious. What advice do you have for them?

Zhou Erjin: Young people don't want to hear advice. I'm still young myself, so I can only share my own thoughts. Throughout my life, I have basically always been doing things that genuinely interest me at a deep level. Because there are so many challenges along the way, to persist, you must truly be interested in what you're doing. You also need to keep learning new things and not let your vision (I assume this means "vision" or "perspective") and cognition (I assume this means "cognition" or "understanding") stagnate.

Intelligence Evolution Theory: What is your vision for embodied intelligence, or what is the ideal scenario you envision?

Zhou Erjin: One of my visions is to see the emergence of robots with generalized social identities, which is quite exciting.

Today, people's expectations for robots are more functional; they want to know what tasks the robot can perform for them. If a robot has its own ID, Alipay account, and phone number, it has been granted a virtual social identity to some extent.

When robots have their own identities, many infrastructure elements can be built for robots rather than just for humans. Just as roads were not built until cars became widespread, the world now has various types of roads.

Human-robot symbiosis is not just about functional aspects; it's about a more complex social level, encompassing everything from the entire society's infrastructure to the relationship between humans and robots, as well as the rights and interests of robots themselves.

END

This is an original work by "Intelligence Evolution Theory." Please follow us for more content.

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.

Newest

Links