Embodied AI Limbs Trained to Perfection, Brain Still Needs Millions of Hours of Data

Home

Finance

ICV

Smart City

Digital Live

Cloud

Optics

Home Finance AI ICV Smart City Digital Live Cloud Optics

07/02 2026 387

While demos of humanoid robots from major manufacturers walking steadily and grasping dexterously fill exhibitions, the industry faces a clear dividing line: the 'cerebellum' for motion control has reached high maturity, while the robot brain responsible for understanding the world and autonomous planning remains firmly constrained by a lack of extensive scenario data. A commercial path mirroring Scale AI, pioneered by Alex Wang, known as the 'data factory' model, is rapidly being replicated in the embodied AI sector. Data infrastructure will become the decisive factor in the future industry landscape.

Cerebellum and Lower Limbs Fully Mastered, Higher-Level Brain Trapped in Data Desert

The industry has already formed a clear hierarchical architecture: the robot cerebellum corresponds to foundational capabilities such as motion control, joint execution, and force feedback, while the brain encompasses VLA (Vision-Language-Action) large models and world models, responsible for environmental understanding, long-term task planning, and cross-scenario generalization.

Over the past two years, breakthroughs have been achieved in bipedal balance, dexterous hand manipulation, and whole-body motion algorithms at scale: humanoid robots from manufacturers like Unitree, UBTECH, and Zhipu can now stably navigate stairs, transport materials, and perform standardized actions such as twisting bottle caps and folding clothes. Shihe Special Robots, leveraging mature motion platforms, have been deployed in high-risk scenarios like ship rust removal and facade cleaning. Their entire limb execution system has undergone millions of real-world iterations, meeting commercial standards for accuracy, stability, and responsiveness. In short, robot 'hands and feet' have become sufficiently agile.

However, the situation for higher-level cognitive brains is entirely different. Unlike large language models (LLMs) that can freely scrape text from the internet, embodied AI requires multimodal physical interaction data—encompassing vision, touch, joint trajectories, object mechanics, and environmental timing alignment—which cannot be crawled online and must be digitally collected through human skill acquisition in the real world. The industry consensus is that an embodied large model capable of general autonomous capabilities requires at least tens of millions of hours of high-quality real-world interaction data. As of early 2026, the global stock of compliant, usable real-robot + non-physical data stands at just 500,000 hours, leaving a gap exceeding 99%.

Data shortages directly expose brain limitations: robots can precisely execute single predefined actions but fail to adapt to scenario variables—they may easily crush eggs in different packaging, struggle to organize cluttered desktops autonomously, and lack cross-scenario capabilities between home and factory settings. The root cause lies in the fact that most existing data is generated in single-scenario laboratories, lacking the massive, diverse, real-world samples needed to establish physical common sense for world models.

Breaking down the five major data challenges in the current industry, all point to insufficient training supply for the brain:

1. High Collection Costs: Traditional real-robot teleoperation data costs 500–1,000 RMB per hour, requiring heavy investment in dedicated scenarios, robot deployment, and manual operation across the entire chain.

2. Scalability Bottlenecks: The industry previously relied on small-scale laboratory pilots, struggling to cover thousands of differentiated scenarios in homes, supermarkets, industries, and warehouses.

3. Multimodal Alignment Difficulties: Achieving millisecond-level synchronization of vision, touch, hand movements, and environmental audio presents high technical barriers, leaving large volumes of raw, 'dirty' data unusable for brain training.

4. Scarcity of Generalizable Scenario Samples: Existing datasets focus on standardized, simple tasks, lacking real-world interaction cases involving clutter, emergencies, and long-tail scenarios.

5. Poor Data Reusability: Under project-based collection models, single batches of data serve only one-time model fine-tuning, failing to accumulate as universal training assets.

The Embodied AI Sector Enters the Data Factory Era

This data supply dilemma closely mirrors the early trajectory of large language models. The core logic behind Alex Wang’s founding of Scale AI was to move beyond fragmented annotation outsourcing and build a standardized, full-chain, and cycle (reusable) AI data factory, serving as a unified data supply base for OpenAI, Meta, and NVIDIA.

Scale AI’s core business model can be summarized as a three-layer closed loop:

1. Standardized Production Line Collection: Establishing a globally distributed data collection network with unified equipment, collection protocols, and quality inspection standards.

2. Automated Refinement Processing: Using AI pre-screening + manual review to clean, atomically annotate, and multimodally align raw materials, transforming them into training sets directly readable by models.

3. Model-Driven Iteration: Customizing supplementary datasets based on large models’ training shortcomings, forming a 'train-detect defects-supplement data-improve performance' data flywheel. Single standardized datasets can be reused across clients and models, reducing marginal costs.

Today, this 'data refinery' logic is being fully replicated in the embodied AI sector, with domestic players pursuing three parallel routes, all targeting scalable data infrastructure:

Route 1: Non-Physical Wearable Distributed Collection (Jianzhi, Mifeng)

Abandoning heavy-asset real robots, these players use lightweight wearable devices (data gloves, three-lens headsets, full-body sensing suits) as hardware cores to conduct crowdsourced collection in homes, factories, and stores. Jianzhi’s Gen DAS devices achieve millimeter-level motion capture and 1mm high-density tactile sensing, deployed in over a thousand real homes, producing processed training data within 2 hours. Zhipu spun off Mifeng Technology to launch the MEgo collection kit, opening co-creation modes in stores and factories, enabling ordinary people to work part-time as data collectors and cost-effectively expanding real-world scenario sample pools.

This UMI (Unmanned Mobile Intelligence) non-physical collection model costs just one-third of real-robot teleoperation and can be scaled massively. It specifically addresses the brain’s need for massive, lifestyle-oriented, fragmented scenario data, resolving the pain point of laboratory data being detached from reality.

Route 2: Virtual-Real Fusion Data Factories (Guanglun, Wuwen Zhike)

Mirroring Scale AI’s synthetic data capabilities, these players build dual production lines combining 'human collection + simulation generation.' Guanglun Intelligence completed three large funding rounds in under four months, raising a cumulative 2 billion RMB in two weeks. Leveraging simulation engines, it generates long-tail boundary scenario data in bulk while accumulating millions of hours of human operation videos. Its standardized datasets achieve over 10x resale rates, with single datasets adaptable to VLA model training for multiple robot manufacturers. Wuwen Zhike established a virtual-real training ground in the Yangtze River Delta, producing thousands of hours of fused data daily to continuously supply material (materials) for general world models.

Route 3: Large-Scale Enterprise Scenario Crowdsourcing (JD.com, Baidu)

Internet giants open their ecosystems for data supply. JD.com plans to mobilize 600,000 internal and external personnel wearing collection devices to achieve 10 million hours of human first-person data within two years. Baidu launched the Embodied Data Supermarket, integrating industry-wide collection resources to streamline data circulation and lower access thresholds for small and medium-sized model developers.

Data Platforms: The 'Shovel-Selling' Business at the Industry’s Foundation

Capital markets have already placed early bets on the sector’s certainty: Jianzhi secured three funding rounds totaling over 200 million RMB within four months of establishment. Guanglun became the world’s first embodied data unicorn, with a valuation exceeding 2 billion USD. Mifeng Technology secured tens of millions of USD in seed funding upon spin-off. Data service providers like Yiren and Jinglianwen achieved over 100 million RMB in revenue and positive profitability, commercializing ahead of robot and model companies still mired in losses.

The underlying logic of capital bets is clear:

1. Persistent Rigid Demand: As long as robot brains rely on multimodal physical data for training, data supply will never face oversupply. Unlike LLM text data, physical interaction data cannot be infinitely replicated, making real-world scenario samples perpetually scarce.

2. Amplifying Scale Effects: After establishing standardized production lines, data factories see continuous declines in marginal costs for collection, annotation, and simulation. Data assets can be resold and reused repeatedly, forming a flywheel where accumulation reinforces barriers.

3. Strong Cross-Industry Universality: The same home and industrial interaction datasets can simultaneously supply humanoid robots, dexterous manipulator arms, and special-purpose equipment manufacturers, unconstrained by single hardware forms.

In contrast, most industry players still focus on robot hardware and end-to-end model iteration, neglecting data infrastructure investment. This leads to a cycle of 'impressive demos but poor real-world performance'—no matter how excellent cerebellar motion algorithms are, brains starved of multi-scenario data cannot make general autonomous decisions and are confined to repeating predefined actions in fixed scenarios.

Currently, 99% of publicly available datasets lack fine-grained force interaction dimensions, causing robot grasping and assembly task models to frequently exhibit physical hallucinations. After supplementing tactile and timing-aligned data, VLA models’ physical interaction capabilities undergo a qualitative leap. This precisely demonstrates that standardized multimodal materials provided by data platforms are the only solution to breaking through robot brain bottlenecks.

The Second Half of Embodied Competition: Data Infrastructure Determines the Winner

During the LLM wave, Scale AI capitalized on standardized data factories to dominate industry dividends. History is now repeating in the embodied AI sector.

Today, humanoid robots’ limb motion and low-level control have entered a homogenization phase. The true differentiator lies in whether companies can build a scalable, low-cost, full-chain data production system to continuously supply 'robot brains' with tens of millions of hours of real-world training materials.

From laboratory real-robot workshops to distributed wearable crowdsourcing, virtual-real fusion data factories, and cross-industry data circulation platforms, the industrialization of embodied data infrastructure has only just begun. Over the next 3–5 years, players mastering high-quality multimodal data supply capabilities will become indispensable foundational infrastructure for the entire embodied AI industrial chain (value chain).

Before robots enter millions of homes and factory floors, building complete data refineries represents the industry’s most certain long-term priority.

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.

Newest

Links