Revenue Surges 50-Fold: Q1 Earnings Match Annual Profit, This Year's Hottest Business Revealed

Home

Finance

ICV

Smart City

Digital Live

Cloud

Optics

Home Finance AI ICV Smart City Digital Live Cloud Optics

06/10 2026 356

Cover Image | Created by ChatGPT

The embodied AI industry is Crazy Grab data (frantically competing for data).

"Everyone desperately craves datasets exceeding 10 million hours... At 200 RMB per hour, 10 million hours would cost 2 billion RMB." Gao Shaolong, founder of Jiyuan Zhihang, told Pencil News that leading embodied AI companies are willing to invest significant resources to obtain high-quality data.

"Nowadays, a company can't claim to be in embodied AI without at least 1 million hours of data," said Zhang Ji, founder of Zhuma Innovation. While 1 million hours sounds substantial, it actually meets only 1/10,000 of the real demand in embodied AI.

If 1 million hours represent the baseline, entering the embodied AI field requires at least 200 million RMB.

The problem is that even with billions in cash, companies often can't buy what they need—there simply isn't enough high-quality real-world data.

Assembly actions in factories, service procedures in coffee shops, household organization, elderly care scenarios in nursing homes... These real-world behaviors cannot be crawled like internet text or bulk-downloaded like images. They must be collected hour by hour by humans.

Meanwhile, capital markets have started paying for "data sellers."

In June last year, data annotation giant Scale AI received a $14.3 billion investment from Meta, valuing the company at $29 billion.

This year, Chinese data company Tashi Zhihang secured over $450 million in financing, setting a record for single-round funding in China's embodied AI industry. Data sensor startup Yuanche Taichu raised over 500 million RMB within five months of its founding.

Orders are pouring in. Ma Chenghui, founder of real-world AI data collection company Yiren Technology, told Pencil News that embodied AI data orders exceeded 100 million RMB in Q1 this year, surpassing last year's total. Yang Hongbing, founder of Lingsheng Technology, revealed that revenue from embodied data orders is expected to grow over 50-fold this year.

In the U.S., human data collection company Mecka.ai secured $100 million in orders within a year.

A new data gold rush is underway. Pencil News interviewed multiple data collection industry practitioners and investors to uncover Money making opportunities (profit opportunities) in the sector.

- 01 - Buying Data: 200 Million RMB

By Gao Shaolong, Founder of Jiyuan Zhihang

Jiyuan Zhihang is an innovative company specializing in embodied AI data infrastructure, having completed its angel round financing.

Today, nearly every company in the embodied AI industry lacks data.

If you ask leading Chinese embodied AI companies about their actual data procurement needs, the minimum they hope to acquire is measured in millions of hours.

What does 1 million hours represent?

Calculated at a standard data collection cost of 50-60 RMB per hour, 1 million hours would require 50-60 million RMB in investment.

For deep scenario data costing 200 RMB per hour, 1 million hours would cost 200 million RMB.

More embarrassingly, these companies have billions in cash and can afford the data—but often can't find it.

AI data annotation work interface Source: Public data

Due to insufficient quality and scale, most models today are not general-purpose but optimized for specific scenarios.

This is a common dilemma across the industry.

The embodied AI industry has faced data shortages for some time because real-machine data lacks universality. We've visited nearly every major robotics manufacturer in China—both large and small firms consistently report that real-machine data can only train their own robots, not others'.

For embodied AI to achieve true DeepSeek-like intelligence emergence, approximately 2 billion hours of data would be needed—an impossible task in human engineering history.

Increasingly, papers demonstrate that non-robotic (human) data is effective. The industry must ultimately find a new path: returning to authentic human behavior data, which will become the largest embodied AI data asset.

While data has value, that doesn't guarantee a viable business model. The data industry faces one critical pain point: piracy.

If an institution spends 1 million RMB on data, they have an incentive to replicate it 20 times and sell each copy for 50,000 RMB to recoup costs. This is devastating for original collectors, who then lose motivation to gather high-quality data.

Studying the autonomous driving industry, we found companies like Horizon Robotics have adopted a new approach: DaaS (Data as a Service). Data remains on servers. Clients bring their models for training, then take away only the parameters while leaving the data intact. This enables data reuse without replication risks. This may represent the true business model for embodied AI data.

Data that once cost 100 RMB could only be sold once. Future data at the same cost could serve 1,000 companies, reducing per-company costs to 0.1 RMB. Data would become as cheap and accessible as tap water, suddenly exploding industry supply.

Currently, mainstream (embodied AI) model companies procure at least millions of hours of data annually. Prices vary dramatically by data depth.

The cheapest are ordinary daily life scenarios: making beds, setting tables, organizing items. These require no specialized personnel—a few outsourced workers can complete them. Such data costs about 50-60 RMB per hour.

When entering real service scenarios like cafes, costs surge because businesses must halt operations to cooperate. Such service scenario data typically exceeds 200 RMB per hour.

Industrial robotic arm assembly lines: Data collection costs are higher in industrial scenarios Source: Public data

The next tier is industrial scenarios. Many assume high costs stem from collection difficulties, but communication expenses play a bigger role. Underpayment means non-cooperation, while major manufacturers may refuse collection altogether due to intellectual property and trade secret concerns. Industrial data ultimately costs at least 200-300 RMB per hour. Many aggressive model companies now prioritize such data despite higher prices because it better reflects real production environments.

The hardest data to collect comes from household scenarios due to privacy, property, and safety issues—far more complex than factory settings. The industry remains extremely cautious about household data collection.

We broadly categorize clients into two types.

The first type wants all data, regardless of scenario. They target general-purpose embodied AI foundation models, aiming to expose their models to diverse worlds for cheaper future industry-specific fine-tuning. A few domestic leading teams pursue this approach.

The second type—representing the vast majority—defines itself as vertical applications from day one. They procure data specifically for the most promising future implementation scenarios.

Service industry data currently sees the highest demand, explaining our extensive collection of cafe and restaurant data.

I must emphasize: Services aren't necessarily the most valuable direction for embodied AI. They're just the easiest to obtain data from.

Industrial assembly, medical, deep manufacturing data were previously unobtainable at low cost. Without this data, model developers wouldn't invest in those directions, creating market illusions that services represent embodied AI's future.

If someone could mass-supply manufacturing, medical, industrial assembly, and elderly care data, the entire industry would pivot. Data companies' true value lies not in selling data but creating supply to help the industry discover new possibilities.

Three clear trends emerged in client data demands this year:

1. Breadth: Companies desperately seek datasets exceeding 10 million hours to expose models to diverse worlds.

2. Cost: At 200 RMB per hour, 10 million hours would cost 2 billion RMB. Unless prices drop, models can't achieve commercial viability.

3. Depth: Embodied AI companies must now answer: With significant investor funding, in which scenarios can your robots create value? Only deep scenario data helps models enter real production/service environments to form commercial loops.

Another lucrative direction is AI annotation.

Many assume data collection is the costliest part, but as data becomes more sophisticated, annotation often exceeds collection costs.

Example: Ordinary motion data might be annotated by crowdsourced workers. But cooking data requires professional chefs due to specialized movements, terminology, and procedures. Chef annotation costs far exceed general crowdsourcing.

This issue will worsen as more manufacturing, medical, and industrial data enters the market.

AI-powered automatic annotation for vertical scenarios represents a major future opportunity. Many vertical datasets may be first annotated by AI, then verified by experts, dramatically reducing industry costs.

- 02 - The 1:200,000 Gap

By Yang Hongbing, Founder of Lingsheng Technology

Lingsheng Technology focuses on real-world scenario data engines and has completed multiple rounds of financing totaling hundreds of millions.

Early this year, the entire industry had only about 500,000 hours of embodied AI data. But training truly excellent models requires approximately 100 billion hours—equivalent to having one steamed bun when you need 200,000.

This represents the industry's current reality. However, data isn't simply sold by the hour. We oppose treating data as a mere commodity.

We've rigorously graded embodied data from L1 to L5 and established a data SLA (Service Level Agreement) system, quantifying data quality management for the first time globally.

Currently, L5 data sees the highest demand. Why is L5 scarce? Because requirements are extremely high. L5 data must undergo detailed annotation and task segmentation, featuring complete task, scenario, and object descriptions with resolutions typically exceeding 1280 and sub-millimeter positioning accuracy.

I've always believed the embodied data industry shouldn't stop at "I have data, come buy it." Even fruits have varieties and oil has grades, let alone data for training large models. Some enterprises don't truly understand data—they just resell secondhand information. Lingsheng not only provides high-quality multimodal data but helps clients optimize data pipelines and underlying infrastructure.

The biggest change I've observed this year is exponential growth in demand for Ego data (first-person perspective human data). Compared to teleoperation data, Ego data demand has surged. These data significantly improve model performance while offering over five times higher collection efficiency at lower costs.

Overseas markets already began shifting toward Ego and human-centric data in late 2023. This trend accelerated further in H1 2024.

Another clear trend is growing preference for real-world scenario data.

Example: For USB or wiring plugging tasks, data collected in controlled studios offers limited scenario variations. But with Lingsheng's thousands of external collectors—each with unique home/office environments and operating habits—the resulting data variations can reach thousands. Such diversity is hard to achieve in training facilities. Lingsheng expects to reach 1.2 million hours of real-world Ego datasets this year.

One industry disorder is inflated data accuracy claims. Some companies advertise millimeter-level positioning precision, but clients report severe drift during actual use, with errors reaching centimeter levels—ten times worse than advertised.

For sustainable embodied AI development, we must focus on results and train truly effective, accurate models. This requires genuinely high-precision, high-quality, high-value data.

Operationally, we expect order volumes to reach hundreds of millions this year, with annual revenue potentially exceeding last year's by over 50-fold. Our clients are primarily leading embodied AI companies, many of whom make repeat purchases.

We remain focused on three core metrics: data quality, data diversity, and real-world scenarios.

- 03 - Even 1 Billion in Financing Can't Guarantee Good Data

By Zhang Ji, Founder of Zhuma Innovation

Zhuma Innovation is a spatial intelligence company specializing in "3D cameras + AI," having completed tens of millions in angel round financing.

How severe is the data shortage in embodied AI? Current available data might not even reach 1/10,000 of actual needs.

Why?

First, companies don't know what they lack. For large language models, everyone knows to seek text data. Embodied AI models require stacked multimodal data—physical AI, spatial intelligence, dimensions, mechanics, joints, currents, sounds... Which data types take priority? Without knowing deficiencies, companies can't address them. Each firm now collects data differently, creating structural challenges.

Second, while text data can be crawled, embodied AI requires physical world data that hasn't been digitized—and thus can't be directly used for training.

Service robots: Service industry data is currently easiest to obtain Source: Public data

Third, collection currently relies on manual hour-by-hour efforts. An operator works 7-8 hours daily, creating linear time constraints that prevent rapid scaling.

Additionally, unclear requirements and lacking data standards result in chaotic formats. Many firms receive unusable new formats, wasting scarce data resources.

In 2026, the industry will suddenly recognize this issue.

Last year, everyone was still competing on models and algorithms, but this year they suddenly realized that these efforts did not yield great results. Data is the most crucial element, and everyone is now focusing on data. In particular, many people have suddenly started paying attention to ego data, that is, first-person data. This includes our method of using cameras to collect real 3D data, which has also suddenly become popular.

Another reason is that some teams from the intelligent driving industry have entered the field of embodied intelligence. They strongly believe that data must come from real-world scenarios. The intelligent driving industry has already proven that collecting large amounts of real data is crucial, with 90% of scenarios relying on real-world collection.

I believe that while embodied intelligence companies may not achieve immediate success, data companies are likely to become the new unicorns. Even if they don't become unicorns, these data companies will be the most profitable.

Simply put, the amount of data required for embodied intelligence is much larger than the original internet data, possibly by a factor of ten thousand. If all this data needs to be provided by data companies, the market will be enormous.

Nowadays, a company cannot even discuss embodied intelligence without having a million hours of data. At a rate of 500 yuan per hour for real physical data, a million hours would amount to 500 million yuan.

The problem is that even if a leading embodied intelligence company raises 1 billion yuan in a round of financing, they may not be able to acquire high-quality data. Data is difficult to buy and there isn't enough of it. All collection methods and time are basically in a linear relationship, and there is no method that can exponentially solve the supply of high-quality real data.

For embodied intelligence companies, the most valuable data comes from the earliest scenarios where robots are sold. For example, many people are now working on industrial-grade scenarios such as factories and logistics. For them, data from these scenarios is the most valuable. However, for data providers, data from real-world scenarios is the most valuable.

As long as the data is generated from real-world scenarios, there will be buyers, and it doesn't necessarily have to be strictly scenario-specific. From the perspective of data scale, synthetic data provides the largest scale because it does not grow linearly. It has the opportunity to break through the limitations of accumulation by people and time.

Companies that build data factories should also be quite profitable now, and it's a direction for making money discreetly. Everyone wants to build them now, but they don't know how, and many are funded by the government. These companies can make money from projects, but they may not be particularly valuable (in terms of valuation).

A data factory is a large venue where people operate real machines to simulate various scenarios and collect data, row after row, just like in a factory before. Similar data factories now exist everywhere.

The problem with data factories is that scaling is difficult because it relies on human scaling. The cost per piece of data may be high, but the scale is limited. With limited space, limited setup, limited personnel, and limited time, everything is calculable. So, it's making some hard-earned money.

There is still demand for simulated data, and the demand should be quite large. Simulated or synthetic data has the opportunity to break through the logic of linear growth and may form an exponential supply. So, its demand will not be small.

Moreover, those who make simulated data should be among the first to make money, at least for now. Its unit price is cheap, but the quantity is large.

Companies that do data annotation will definitely make money. It doesn't have to be AI annotation; algorithmic annotation is also fine. As long as it can be annotated, it's okay.

To sum up: Companies that can scale will definitely make money. The key is not to rely on human scaling but to scale through algorithms.

In this scenario, providing data will make money, just a matter of who makes more and who makes less.

However, in the long run, the companies that can truly continue to make money may be those that provide data infrastructure. The model of relying on people for collection may not make big money in the long run.

In the past, the SaaS industry had Databricks (valued at over 130 billion USD). In the future, the embodied intelligence industry will also have many data infrastructure companies similar to Databricks, providing not only data but also data engines, data closures, automation capabilities, labeling capabilities, reasoning capabilities, and framework capabilities.

- 04 - Raising a round of funding every one or two months

Wang Xuehui, Managing Partner of Tsinghua Alumni Seed Fund

The Tsinghua Alumni Seed Fund is the first university alumni fund in China, dedicated to becoming the "first stop for Tsinghua alumni entrepreneurship."

Data is like the "oil" of the embodied intelligence industry, and everyone is short of it now.

Even in a relatively fixed and single scenario like autonomous driving, Scale AI (a data training company valued at nearly 30 billion USD) has emerged, and many domestic autonomous driving annotation and data companies have also made money.

In the future, if humanoid robots truly enter hundreds of industries, the demand for data will be several orders of magnitude higher than that of autonomous driving, possibly two or three orders of magnitude.

This market will be huge, but now, whether it's the ontology, embodied models, world models, or data collection routes, nothing has converged yet, and various technical routes are flourishing.

We have invested in data collection startups such as Lingyu Intelligence, Yuanche Taichu, and Shouyi Technology. For example, Lingyu Intelligence mainly focuses on real machine remote operation, with good data quality but relatively high costs. This year, wristband technology has become more popular, including companies like Yuanche Taichu and Shouyi Technology.

The big opportunity in data collection largely comes from Meta's wristband technology. Before, people didn't believe that myoelectric technology could be so precise, but after Meta made it work, the market began to see opportunities. It is said that Apple's next-generation products may also follow this technical route, while other routes may be put on hold.

After the wristband technology was proven to work, people found that it could be worn not only on human hands but also on robots, combining "wristbands" and "data collection." Humans wear wristbands, and robots wear wristbands, forming a connection in between. This is both a technological innovation and a model innovation.

Currently, not many data collection companies are truly profitable yet. None have broken even (in terms of revenue and costs) or achieved profitability so far. The industry is still in its early stages, having only been around for a little over a year. For these companies, securing orders is already good, and profitability is not the main consideration for now.

If mass production is achieved, some companies will definitely die off, and the routes will converge to some extent. The biggest pain point in the industry right now is that no one has found the final convergent route yet, and many companies are trying several solutions simultaneously.

Now, many companies are raising funding every one or two months, and it's hard to say which route is definitely better. Tsinghua's strategy is that if the technical route is unclear, we generally won't bet on one route to succeed.

This applies not only to data companies but also to embodied whole machines, embodied models, and world models. The entire industry is now in a state of raising funding every one or two months.

Giant companies specializing in data will emerge in the embodied data industry. However, leading embodied whole machine companies may choose to collect data themselves. By analogy with the automotive industry, leading giants like Tesla and BYD do many things themselves, with BYD even making its own batteries. The top few companies in the robotics industry will probably do the same.

However, this does not mean that there is no room for third-party independent data companies. Apart from the top few companies, mid-tier and lower-tier companies, as well as various corner case scenarios, specialized robots, and special-purpose robots, will all have a large demand for data. For these companies, using third-party independent data companies jointly may be a more cost-effective approach. In the future, the typical customers of third-party independent data companies will at least include many mid-tier companies, and this volume is sufficient to support their growth into publicly listed companies.

There are currently two relatively mature business models for data collection companies: one is selling data on a one-time basis, and the other is sharing revenue based on the data value on each robot. Companies definitely hope that more revenue sharing based on the number of robots will be adopted in the future, but there will be negotiations involved.

When investors evaluate whether a data collection company is good, the most critical factor is its orders and which embodied whole machine companies are using it. Whether first-tier institutions and customers are using it is a very important indicator.

The views expressed in this article are the independent opinions of the speaker and do not constitute any investment advice.

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.

Newest

Links