07/07 2025
414
Both China and the United States are making significant investments; why is data annotation taking center stage?
"Dear President Trump, America must win the AI war." Earlier this year, 28-year-old Alexandr Wang took out a full-page advertisement in the Washington Post for his data annotation service company, Scale AI, on the second day of Trump's inauguration.
Alexandr Wang's dramatic move introduced data annotation to the public for the first time, highlighting a lesser-known fact: Among the three key elements of AI—data, models, and computing power—advancements in the data domain are less recognized compared to the intense competition in models and computing power.
However, two weeks ago, Meta's acquisition of a 49% stake in Scale AI for $14.3 billion truly propelled the field of AI data services into the global spotlight, causing a major upheaval in the US data annotation industry.
Coincidentally, besides US giants betting on the value of AI data services, the domestic data annotation industry has also been heating up over the past year, with significant actions at both the top-level design and market ends. Seven national data annotation base pilot cities have been established, and the National Data Administration has centrally released a collection of 47 excellent data annotation cases. Meanwhile, a batch of data annotation service companies have seen rapid growth in their performance.
However, beyond the frequent actions in the industry, there is a prevalent saying among industry insiders: Data annotation is accelerating towards automation, and technological progress is gradually eliminating many annotation tasks.
This raises curiosity: What exactly is this industry that both China and the US are investing in? What stage of development is this field currently at? Will automation render data annotation obsolete? And how will competition unfold next?
01 Behind the Acquisition: AI Basic Data Services in the Limelight
"Data is one of the most valuable assets in artificial intelligence," a consensus in the AI era, has been superbly validated by the Scale AI acquisition and the ensuing turbulence in the AI basic data services industry.
The acquisition amount of $14.3 billion ranks second only to Meta's acquisition of WhatsApp in its history. The reason Meta is willing to pay this price is its anxiety about falling behind in the current large model competition.
Over the past few months, the Silicon Valley giant has faced considerable pressure. In April this year, Meta's Llama 4 model feedback fell short of expectations, and the larger model, Behemoth, was also delayed.
On the acquired side, the reason why Scale AI can command such a high price has to do with the company's position in the field of AI basic data services, as well as the crucial role of data annotation and mining in current model training.
Founded in 2016, Scale AI initially served as a platform providing crowdsourcing services to help businesses complete tasks requiring manual operations such as content review and data extraction. Later, due to the massive demand for data review and annotation in the autonomous driving field, Scale AI began to focus on data annotation, assisting clients in collecting, cleaning, annotating, and managing large-scale data to facilitate the development of autonomous driving algorithms.
With the advent of the large model wave, Scale AI's revenue soared from $290 million in 2022 to $760 million in 2023, and is expected to continue growing to $870 million in 2024. Some sources claim that the company's revenue is expected to reach $2 billion by 2025.
To put its revenue into perspective, OpenAI's revenue for 2024 is projected to be $3.7 billion. According to Grand View Research data, the global data annotation and services market size reached $14.07 billion in 2023. Among them, the US market size reached $4.2 billion, accounting for nearly 30% of the global market. Scale AI's revenue scale can be considered one of the leading players in the field of basic data services.
Scale AI's clients include a group of Silicon Valley giants such as Google, Apple, xAI, Meta, Microsoft, and Amazon. Last year, Google spent approximately $150 million on Scale AI, making it the company's largest client.
According to technology media BI, in April this year, Scale AI ran at least 38 active projects for Google, accounting for more than one-third of the 107 generative AI projects on Scale AI's list at that time. The data project serving xAI includes a project named Xylophone, which mainly helps train xAI's chatbot to improve its conversational ability on a wide range of topics.
The extensive client network actually reflects the important position of data annotation and AI basic data services in current model training.
There is a saying in the AI industry, "Garbage in, garbage out," indicating that the quality of data significantly affects the performance of models. Data annotation essentially involves translating a large amount of unstructured data that machines cannot understand into structured data that machines can comprehend. Amid the large model wave, due to the unprecedented scale of data parameters, in order to enhance model intelligence, budgets around data annotation and processing are also soaring.
According to a 2024 survey by AI basic data service vendor LXT of 322 US enterprises with AI project experience, throughout 2023, these enterprises' capital investment in training data accounted for 15% of their overall AI construction investment. Previously, there was also a saying in the industry that high-quality annotated data is one of the reasons why ChatGPT's performance differs from other competitors.
Due to various factors, Meta made the decision to make a major acquisition of Scale AI. Perhaps in Meta's current view, cooperating with the leader in the data services field will help it better obtain proprietary data for model training and train higher-intelligence models based on this data, thereby keeping pace in the current large model competition.
This major acquisition has also triggered a series of chain reactions in the data annotation industry and the AI supply chain.
First, a large number of vendors competing with Meta's models have begun to sever their cooperation with Scale AI. For example, Scale AI's largest client, Google, immediately suspended cooperation on two projects codenamed "Genesis" and "Beetle Crown" upon the completion of the transaction.
Second, a group of data annotation vendors competing with Scale AI took the opportunity to expand their clientele. For instance, companies such as Sapien, Appen, Prolific, and Turing became candidates for many AI vendors when diversifying their data annotation supplier choices. Sapien AI's CEO Rowan Stone also stated that within 48 hours after the Meta transaction, their platform gained 40,000 new data annotation registrants, causing their servers to crash.
Amid concerns about the impact of Meta's acquisition on the neutrality of Scale AI's annotated data and the leakage of business secrets, Scale AI also issued a platform neutrality statement.
However, the issuance of the statement did not stop the various controversies within the industry. A major industry reshuffle is already underway.
02 Policy and Market-Driven Surge in the Domestic Market
Amid the major reshuffle in the overseas data annotation industry, over the past one to two years, China, as one of the countries with the fastest growth in the global AI industry, has seen rapid growth in data demand, and the data annotation field has also evolved accordingly.
First, the policy support is very obvious. Since last year, China has successively issued policies and regulations related to data annotation, providing a catalyst for the data annotation industry from the top-level design.
In June last year, the National Data Administration released a list of the first batch of seven pilot cities for data annotation bases. These seven cities have played a pioneering role in the ecological construction, capacity enhancement, and scenario application of the data annotation industry.
IDC told Digital Frontline that the initial intention of this policy is to promote the construction of high-quality datasets, with the goal of better promoting AI development and providing standard data support for the circulation of data elements. Factors such as city demand and talent structure are comprehensively considered in city selection.
In December last year, the data annotation field welcomed another major programmatic document. The four ministries and commissions of the state jointly issued the "Implementation Opinions on Promoting the High-quality Development of the Data Annotation Industry," clearly proposing a development goal of an average annual compound growth rate of over 20% for the industry scale by 2027, constructing the "four beams and eight pillars" for the development of the domestic data annotation industry.
At the same time, various regions have continuously issued relevant regulations and policies over the past year to guide industrial development.
Chart Source: Northeast Securities Research Report
Meanwhile, the industry regulatory authorities have also actively set benchmarks and promoted industry standardization. In April this year, the National Data Administration released a collection of 47 excellent data annotation cases at the theme exchange event on high-quality datasets and data annotation during the 8th Digital China Construction Summit, covering more than 20 fields such as healthcare, transportation, agriculture, and energy. These benchmark cases provide reusable practical templates and lay the foundation for standard unification and experience sharing in related fields.
While being supported by policies, with the arrival of the wave of large model applications, the heat and scale of the data annotation market have also increased significantly. A batch of enterprises such as Haitian Ruisheng and Appen have witnessed rapid growth in their performance.
Taking Appen as an example, its 2024 annual report released in February this year showed that its revenue from the China region exceeded 420 million yuan last year, with an annual growth rate of 71%, and its large model/AIGC business grew by 526%. Appen disclosed that many AI leaders, especially large model AI enterprises, have become its clients, with large models and related businesses accounting for 40% of its revenue in China.
Lin Qunshu, CEO of the AI data service startup Integer Intelligence, told Digital Frontline that last year, with the rapid evolution of multimodal models, they felt that the market demand for data annotation showed exponential growth.
An industry veteran believes that the bustling market end in the data annotation field is related to the structural changes in the AI field over the past year. Domestic open-source models represented by DeepSeek are significantly narrowing the gap with overseas models. At the same time, the progress of domestic models reduces the consumption of computing power, alleviating the computing power anxiety of many enterprises and elevating the importance of the data level to a higher position.
"The quality, scale, and accuracy of data will directly determine the upper limit of model capabilities and become the key to the effectiveness of model implementation," the person told Digital Frontline.
The imagination space for the industry is rapidly opening up. According to iResearch data, the scale of China's AI basic data services market was 5.8 billion yuan in 2024 and is expected to reach 17 billion yuan by 2028, with a compound annual growth rate of 30.84%.
IDC told Digital Frontline that currently, model applications are moving towards vertical fields, and the scene demand for data annotation mainly revolves around autonomous driving, education, healthcare, finance, retail, government affairs, etc.
With the increase in market heat, Digital Frontline observed that the number of participants in the industry is also increasing, competition is becoming fiercer, and at the same time, the boundaries between the upstream, midstream, and downstream of the industrial chain are gradually blurring.
For example, model vendors may have relevant product services in the field of data annotation from the perspective of providing more complete model capabilities. A typical example is Wisdom AI, which launched Batch API last year to solve data annotation problems using large model technology. There are also data annotation services provided by Baidu Intelligent Cloud.
There are also application enterprises that have launched some operational tools within applications to annotate some data and reduce illusions within scenarios from the perspective of AI implementation. A typical example is Lingyang's AI operation center launched in the Quick Service application for intelligent customer service. Aiming at the illusion problem in intelligent customer service scenarios, it uses the training center for annotation and feeds high-quality data back into the model to make question-and-answer more accurate.
"Annotation within applications alleviates model illusions and serves the model fine-tuning stage, which is a supplement or temporary solution to the current lack of ability of the base model," an industry insider told Digital Frontline.
03 Will Technological Evolution Make Data Annotation Obsolete?
Amid the rapid development of the global data annotation industry, there is also a voice that believes that the data annotation field may face new challenges due to technological progress. For example, some have pointed out that in the future, AI will automatically complete many annotation tasks, and enterprises in the annotation field may need to accelerate their transformation.
Regarding this trend, Digital Frontline communicated with multiple industry insiders, and it is generally believed within the industry that in the era of large models, data annotation is gradually becoming more complex, automated, and specialized. The wave of automation does not mean that annotation is no longer needed.
First is the trend of increasing complexity in data annotation, which is related to changes in data annotation demand brought about by the evolution of large model technology.
Mainstream large models predominantly employ unsupervised automatic learning mechanisms, utilizing vast amounts of unlabeled data during the pre-training phase. Nevertheless, subsequent stages of supervised fine-tuning (SFT) and reinforcement learning based on human feedback (RLHF) still necessitate manual annotation.
An insider in the data annotation industry elucidated that in the RLHF stage, businesses' data demands require humans to sort and align machine-generated answers, educating machine learning models about human tendencies, values, and preferences. Compared to the straightforward annotation tasks of drawing boxes and circles in the past, the complexity of data annotation in the fine-tuning and RLHF stages has soared, and the standards for annotation teams have become more stringent.
Industry whispers also suggest that during the RLHF stage, some teams have deployed doctoral teams to handle annotation tasks. For instance, Scale AI has recruited numerous PhDs to provide data annotation services during the RLHF phase, and OpenAI similarly collaborates with a team of PhDs to conduct quality inspections on these annotations following Scale AI's work.
The trend towards annotation automation is intertwined with the application of large model technological advancements in the realm of data annotation. Data annotation itself has undergone significant quality improvements and efficiency enhancements through the utilization of models. Refuel AI, an overseas open-source data annotation and cleaning platform, has conducted tests demonstrating that AI can markedly enhance the quality of data annotation while reducing costs.
In various NLP tasks, the accuracy of model annotations, compared to human annotations (alignment with real labels), is notably higher. The highest value in each column is highlighted in green.
Digital Frontline has observed that both domestic and international data annotation vendors are currently enhancing the automation level of data annotation, shifting data annotation tasks from labor-intensive manual operations to platform-based automatic annotation. Overseas companies like Scale AI, Haitian Ruisheng, Appen, and Integer Intelligence all possess their own automated data annotation platforms.
Apart from professional data service providers, annotation scenarios within certain enterprises are also undergoing automation. Taking the autonomous driving scenario as an example, Tesla previously established a substantial in-house data annotation team, but since 2022, they have started to downsize this team for assisted driving system development, instead utilizing the Dojo supercomputer for unsupervised annotation and training on massive video data.
Liu Yu, the president of data intelligence service provider MobTech, informed Digital Frontline that in the current fiercely competitive market, for data annotation service providers, crystallizing their service capabilities into standardized products can elevate their competitive threshold for enterprises. "The same workforce can annotate more efficiently, with higher annotation quality and supply stability."
However, the industry also acknowledges that this trend towards automation does not imply that annotation tasks and professional service providers are rendered obsolete. In fact, as AI advances towards vertical scenarios, the demand for manual annotation for complex tasks in specialized fields is on the rise.
"Data annotation is becoming increasingly challenging. When the degree of data automation is high, for instance, AI can complete 90% of automatic annotation, the remaining 10% becomes even more crucial," Li Haoran, a senior analyst at IDC China, told Digital Wisdom Frontline.
An AI application vendor also previously shared with Digital Wisdom Frontline that while AI might be capable of completing single-point frame-pulling and labeling tasks, numerous specialized domain knowledge annotations can only be accomplished manually.
Furthermore, with the advent of reasoning models, there is a pressing need for data related to the chain of thought. "It necessitates professionals who comprehend the business to dissect the problem more effectively through the configuration of rules and model parameters," Li Haoran added.
Li Haoran also noted that when data can be automatically annotated and synthesized, its value to the model diminishes, prompting enterprises to invest more resources in manually annotating more intricate problems. "Previous educational questions may have been tailored for junior and senior high school students, but now they might be geared towards college students. Additionally, previous image annotations only required circling faces, but now it is also necessary to input text to comprehend the meaning conveyed in the image and the structural relationships within it."
Amidst these trends, the evolution direction of the data annotation field has become apparent.
On one hand, the industry's entry barriers are transitioning from labor-intensive to technology-intensive, with higher professional thresholds. Moreover, as the focus of competition among players shifts to composite capabilities encompassing technical expertise and scenario resources, and more players enter the market, the industry's knockout competition has commenced simultaneously, intensifying market competition.