Interview with Li Hongsheng from Daxiao Robotics: Instead of Waiting for Data, Why Not Create an Infinite 'Chinese Home Training Ground' for Robots

06/15 2026 477

Editor's Note: In the wave of embodied AI, there are always those who stand at the forefront, defining the direction of the flow. Xinghe Frequency launches a new interview series, 'Above the Wave.'

Focusing on key figures in the embodied AI industry, we share insights from technical turning points to business decisions, from product implementation to industrial outlook. We do not discuss vague trends but only record the thoughts, judgments, and actions that truly drive the wave.

Not chasing hot topics, but catching the crest of the wave, allowing more people to hear the next frequency of embodied AI first.

Author: Mao Xinru

For embodied AI, the most pressing bottleneck currently is not that models are not large enough, but that there is not enough data—and not enough authentic data.

This issue becomes even more pronounced in home scenarios: a living room might simultaneously contain a coffee table, a robot vacuum, children's toys, and a pile of temporarily stacked clutter.

For robots, these seemingly trivial details are precisely what determine whether they can truly enter households.

This is why data collection in home scenarios has always been challenging.

On one hand, the spatial structures of real homes vary greatly; on the other hand, collecting and annotating 3D data with physical properties and interactive relationships is extremely costly and difficult to scale.

To address this challenge, Daxiao Robotics has introduced Kairos-HomeWorld, a fully interactive 3D world model for entire homes.

This is the world's first unified framework for a world model that enables full-house generation and full interaction with individual objects.

Under this framework, there is no need to exhaustively collect real home data. Instead, algorithms generate a virtual training ground for Chinese households that can be infinitely replicated and expanded, with every object interactive.

Currently, the project has open-sourced 300,000 real Chinese household floor plans and 5,000 complete, fully interactive home scenarios on GitHub, along with a self-developed four-stage hierarchical generation process, filling the gap in localized simulation for the industry.

After the project's open-sourcing, it has received attention and usage requests from numerous universities and research institutions both domestically and internationally.

Behind these numbers stands Li Hongsheng, the scientist behind Daxiao Robotics' embodied large model and a professor at the Multimedia Laboratory (CUHK MMLab) at the Chinese University of Hong Kong.

Following the open-sourcing of Kairos-HomeWorld, we conducted an in-depth interview with Li Hongsheng.

The conversation covered the paradigm shift triggered by large models, the design philosophy of Kairos-HomeWorld, the philosophy of balancing simulated and real-world data, and the future evolution of embodied AI brains.

As both a scholar and an industry practitioner, he has chosen a slower, more difficult path—but one he believes is truly impactful.

AI Moves from Virtual to Real: Embodied AI Reaches a Tipping Point in Engineering Implementation

Reviewing the development of AI over the past decade, the implementation of large models has completely rewritten the trajectory of AI evolution.

Before the widespread adoption of large models, traditional AI was long confined to perception and reasoning at the software level, excelling in processing virtual information such as text, images, and speech. It could complete numerous lightweight tasks but was entirely incapable of solving problems in the real physical world.

This significant gap between virtual intelligence and physical reality has driven a clear transformation in the AI sector over the past three years.

The technological focus has shifted from pure software reasoning to a fusion of software and hardware, marking AI's official transition from the virtual to the real world, empowering physical devices.

At the same time, the robot hardware supply chain has also matured after long-term accumulation, reaching a critical point of readiness.

Core hardware and technical challenges that previously hindered robot implementation, such as reducers, drive motors, and low-level motion control, have largely been overcome after a decade of industrial iteration. The stability and cost-effectiveness of hardware now support large-scale trials.

Thus, the embodied AI industry has seized an opportune moment: upper-layer large models inject general intelligence and decision-making capabilities into robots, while lower-layer hardware provides the physical carrier (carrier) for intelligence implementation.

This bidirectional convergence—top-down intelligent empowerment and bottom-up hardware maturity—has transformed embodied AI from an academic concept into an engineering problem that is implementable, iterable, and optimizable.

Changes in industry trends have also influenced Li Hongsheng's career choices.

His academic research spans automation, computer vision, multimodal intelligence, and other fields, covering nearly the entire core knowledge system of today's embodied AI practitioners.

After years of deep academic involvement, he recognized the limitations of pure academic research.

"Tens of thousands of conference papers are published each year. As researchers, we pursue impactful work. Truly impactful work is something that can be implemented and actually works."

The segmentation task in computer vision is a classic example. Decades of academic accumulation in this field only achieved generalizable, tool-based implementation through Meta's SAM series models, truly empowering the entire industry.

For Li Hongsheng, truly valuable technological innovations must ultimately be implemented in real-world scenarios to solve real problems.

With this core objective in mind, he chose to join Daxiao Robotics. More than a decade of deep collaboration with the SenseTime team has fostered profound trust and tacit understanding (tacit understanding) between the two sides.

For Li Hongsheng, entering the industry was a natural extension of his personal academic research and technological ideals.

Full-House, Interactive, and Infinitely Generative: Filling the Gap in Localized Training Scenarios

The release of Kairos-HomeWorld directly addresses the severe shortage of training data for robots in the industry.

Publicly available simulated training environments generally suffer from limited scenarios and ineffective interactions. Most support only small-scale training in desktop or single-room settings, unable to cover entire home spaces.

These static simulation environments can only fulfill basic robot training needs for navigation and obstacle avoidance, failing to support core interaction tasks such as grasping, placing, and fine manipulation.

More critically, the industry has lacked an open-source training ground tailored to Chinese households, covering entire homes with fully interactive objects.

"We looked at what was available on the market. There were desktop scenarios and single-room scenarios, but nothing for full-house, interactive scenarios with Chinese household characteristics," Li Hongsheng recalled.

To address this gap, Daxiao Robotics chose to build a new training system from scratch, establishing three goals: covering complete home spaces, enabling interaction with all objects in the scenarios, and supporting infinite generation and iteration of scenarios.

The team developed a four-stage hierarchical generation architecture, abandoning black-box end-to-end generation in favor of a modular, decomposable, and adjustable engineering approach to create highly realistic home scenarios.

The first stage is floor plan structuring.

Based on real floor plan data, the overall framework of the house is locked, defining the number of rooms, functional zones, and the orientation of doors and windows. Following human spatial design logic, blueprints are finalized before constructing the scenario, ensuring the rationality (rationality) and authenticity of the layout from the source.

The second stage involves full-house decoration and detail filling.

Details such as furniture dimensions, floor and wall materials, wallpaper styles, and lighting arrangements are meticulously restored.

Unlike the minimalist, uniformly white environments of laboratories, the diverse home decoration styles, lighting tones, and personalized soft furnishings in Chinese households directly impact a robot's visual recognition and task judgment.

Detailed scenario restoration is the foundation for adapting to local households.

The third stage involves automated conflict checking.

Rule-based algorithms comprehensively screen generated scenarios, automatically eliminating flaws such as furniture overlapping, blocked pathways, and unreasonable layouts, ensuring that each simulated space adheres to real-world human habitation logic and provides valid samples for subsequent robot training.

The fourth stage, and the project's core innovation, involves infusing life into static scenarios.

Traditional simulation scenarios stop after furniture placement, resulting in spacious showrooms. However, the defining feature of real households is clutter, dynamism, and the presence of everyday items.

The team generated high-frequency daily objects such as cups, remote controls, books, toys, tableware, and snacks in bulk on tables, bookshelves, kitchen counters, and TV stands.

These seemingly trivial clutter items are precisely the biggest challenge for robots in household tasks.

When faced with randomly placed, irregularly shaped, and variable-center-of-gravity everyday clutter, a robot's operational success rate often plummets.

This step of filling in details about the physical properties of diverse objects brings simulated scenarios infinitely close to real human habitats, providing a higher-quality training ground for robots to improve their generalization capabilities.

The entire four-stage process adopts a modular design, allowing each stage to be trained, validated, and tuned independently before being combined into a complete generation pipeline.

Although this engineering approach doubles the initial workload, its stability, controllability, and adaptability far exceed those of end-to-end black-box models.

Li Hongsheng admitted that similar projects in the industry are generally smaller in scale. The team's decision to expand to full-house, full-scenario, and fully interactive object volumes at once imposed significant engineering pressure.

"But the biggest challenge is not technical—there are always solutions. The biggest challenge is the determination (determination) to break through and persevere with a project that no one has done before."

In terms of data sources, the team has amassed 800,000 real residential floor plans, all derived from genuine Chinese households, covering conventional layouts, rare and unusual layouts, and special structural layouts.

Considering the high costs associated with standardized CAD blueprints, they opted for more cost-effective and widely covered—though noisier—data sources.

Through iterative cleaning involving model-based intelligent labeling and manual refinement, they preserved the diversity of real floor plans while eliminating data flaws, making the simulated scenarios more closely resemble authentic Chinese households.

Compared to mainstream 3D generation projects in the industry, such as Li Feifei's World Labs, Kairos-HomeWorld's core differentiation lies in interactivity.

While World Labs generates visually realistic 3D scenarios that support free roaming, all objects are interconnected and cannot be manipulated individually.

In contrast, Kairos-HomeWorld achieves full object decoupling, allowing every piece of furniture and clutter item in the scenario to be independently grasped, moved, and opened or closed, making it a true training ground for robots.

Additionally, Kairos-HomeWorld has open-sourced 300,000 Chinese household floor plans, along with 5,000 complete, fully interactive home scenarios.

This open-source volume also exceeds existing datasets by an order of magnitude.

More importantly, relying on a unified generation algorithm, this system can infinitely iterate to produce new scenarios, replicating tens of thousands to millions of scenarios at low cost, providing the industry with a reusable and evolvable localized simulation infrastructure.

Simulation Data Is Not Dead: The Optimal Solution for Embodied AI Lies in Combining Virtual and Real

Over the past year, the industry's perception of simulation data has undergone extreme fluctuations.

From initially being hailed as a low-cost implementation miracle to facing widespread skepticism due to the gap between simulation and reality, the industry's understanding of fundamental data needs has gradually clarified.

Simulation data and real-world data are not an either-or choice but an engineering combination that must be dynamically adjusted based on task scenarios.

Li Hongsheng's judgment is grounded in the most pragmatic logic of cost and efficiency.

He does not deny the gap between simulation and reality—pure simulation-trained models indeed underperform compared to those trained on real-world data.

However, the issue lies in the fact that, within full-house home furnishings (home) scenarios, the cost of collecting real-world data is unsustainable for the industry.

Collecting a complete set of real 3D home data requires on-site scanning, 3D reconstruction, object segmentation, and physical property annotation—a process that is time-consuming, labor-intensive, and extremely costly.

Even with massive human resources, accumulating vast amounts of data is unfeasible. Moreover, the difficulty of collecting robot-home interaction data increases exponentially, making it impossible to support the high-frequency iteration needs of models.

The team dynamically adjusts data ratios based on task types. For full-house spatial understanding tasks, real-world data accounts for 5%-10%; for standardized grasping tasks, real-world data dominates.

The core principle of data allocation is to weigh data collection difficulty, cost, and model iteration needs, achieving on-demand allocation and dynamic optimization.

It is worth mentioning that Daxiao Robotics has not chosen household robots as their current implementation carrier (carrier).

In Li Hongsheng's view, the whole-house robot represents the ultimate form of embodied AI implementation, but it is not a suitable entry point at the initial stage.

The core issue lies in the difficulty of managing safety risks. Household robots are relatively heavy, and issues such as power outages, loss of control, or tipping over can easily cause harm to the elderly and children. The industry currently lacks mature safety solutions to mitigate these risks.

Therefore, he advocates for a gradual approach, starting with easier, controlled scenarios in commercial and industrial settings before progressing to more challenging environments. Current implementation efforts focus on controllable commercial and industrial scenarios, including pre-warehouse packaging, factory assembly, and large-space navigation.

These scenarios present lower technical complexity, controlled environments, and reduced safety risks, enabling rapid implementation and iteration.

Of course, this is not contradictory.

The family setting represents the ultimate form of embodied AI implementation. Precisely because of its difficulty and high risks, it necessitates early accumulation of data and capabilities at the lowest possible cost. This is where the value of Kairos-HomeWorld lies.

Commercial and industrial scenarios serve as training grounds at the current stage. Many underlying technologies, such as grasping, placing, and navigation, are transferable across domains.

A more pragmatic approach involves first refining models in controlled environments to improve success rates before transitioning them to household settings.

Although there is no rush to deploy in household settings, Kairos-HomeWorld still plays a significant role in current robot model training.

For instance, the success rate of transferring from desktop simulation training to real-world deployment has reached 50%, placing it at the upper echelon of the industry.

Li Hongsheng's training logic is straightforward: if performance in simulation environments is subpar, there is no need to proceed to real-world environments. Unless simulation success rates reach 80-90%, real-world performance will only be worse.

The core significance of simulation scenarios lies in providing robots with a low-cost, zero-risk, and infinitely iterable pre-training platform, serving as a preparatory stage for real-world model deployment.

Technical Path Uncertain, Long-Term Patience as Core Competence

Compared to the mature and stable trajectory of large language models, the technical architecture of embodied AI remains in a chaotic and iterative phase, with no unified industry standards yet established.

A year ago, VLA models were the dominant paradigm. Today, world models and world action models have rapidly emerged as new research hotspots.

The rapid iteration and shift in technical approaches underscore that embodied AI is still in its early exploratory stage.

In Li Hongsheng's view, VLA models and world models are not mutually exclusive or antagonistic; instead, they are destined to converge and integrate in the future.

Each fulfills distinct roles and complements the other's capabilities.

World models excel in spatial perception, scene memory, future state prediction, and physical reasoning, serving as the spatial intelligence brain for robots. VLA models, leveraging large language model capabilities, specialize in long-term task decomposition, logical planning, and semantic understanding.

Both approaches possess irreplaceable advantages, and blindly choosing one over the other would result in missed technological opportunities.

Taking dexterous hand manipulation as an example, Li Hongsheng argues that world models are better suited for training dexterous hands.

Most human hand operations involve fine-grained movements such as in-hand rotation, pinching, and sliding.

He illustrates this with the example of picking up a phone: when you lift it from a table with the screen facing your palm, a precise 180-degree rotation is required for use. While humans perceive this action as simple, data collection through teleoperation is 5 to 10 times less efficient.

Due to the low efficiency and high cost of teleoperation, scaling up dataset accumulation is nearly impossible.

However, world action models can autonomously generate fine-grained action strategies, perfectly meeting the training needs of dexterous hands and addressing the limitation of traditional VLA models' reliance on manually demonstrated data.

Furthermore, the industry is not only fragmented in terms of technical approaches but also in its evaluation systems.

Unlike the mature benchmark tests and leaderboards in the large model domain, the embodied AI industry lacks standardized simulation environments, task frameworks, and evaluation criteria across enterprises and institutions. This makes horizontal comparisons extremely challenging and hinders the clear identification of industry leaders.

While this fragmented state may attract short-term attention, it also presents significant opportunities for disruptive innovation.

Short-term advantages built on capital and resource accumulation can be entirely overturned by a single iteration of underlying architecture.

Currently, there is no absolute leader; all top players are exploring multiple technical routes, conducting trial-and-error experiments, and awaiting the convergence of technological outcomes.

For young professionals aspiring to enter the embodied AI field, Li Hongsheng offers a simple piece of advice: patience.

"The technical stack of embodied AI is much broader than that of pure large models. It requires not only AI expertise but also knowledge of robotics, control systems, and hardware, along with longer training cycles. Many students are impatient, but those who truly accomplish significant work are the patient ones."

As for when general-purpose household robots will become widespread, Li Hongsheng admits that no clear timeline can be provided at present. However, the technological development strategy of Daxiao Robotics is clear.

Instead of passively waiting for industry data to mature, it is proactive in building localized simulation infrastructure. Rather than getting entangled in debates over the value of simulation technologies, it validates technical feasibility through engineering implementation.

Kairos-HomeWorld exemplifies this pragmatic philosophy, representing a long-term, industry-empowering infrastructure-level innovation.

In the marathon of embodied AI development, single-point algorithmic advantages are insufficient to establish long-term competitive barriers. Only the depth, breadth, and sustainable iterative capacity of scenario-based data can form an irreplaceable core defense.

While it is far too early to speak of a harvest period, the industry has reached a stage where foundational infrastructure must be established.

Someone needs to continuously enhance data, scenario, and engineering capabilities, transforming technological imagination into the patience for real-world combat.

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.