01/21 2025 395
AI is potent, yet it cannot create from nothing.
The prowess of AI essentially stems from the data utilized in algorithms and the training of large models. The quantity and quality of data play a pivotal role in the performance of these models. Previously, OpenAI staff noted that the Orion project (i.e., GPT-5) was progressing slowly due to a lack of sufficient high-quality data. Consequently, OpenAI recruited numerous mathematicians, physicists, and programmers to generate original data for training these large models.
The data challenges faced by AI companies extend beyond this. The frequent infringements on copyright are a persistent issue plaguing the industry. While large AI companies possess the resources and capacity to manage infringement issues, smaller companies may face catastrophic losses when faced with numerous lawsuits.
Copyright: Another Hurdle for the AI Industry
Since the inception of ChatGPT, the battle over copyright has commenced. Initially, the opposition to AI was primarily led by artists, as AI companies utilized their works to train large models, potentially displacing them. However, at that time, the data requirements for training large AI models were not substantial enough to offend a broad audience, and the artist community was relatively small, with limited influence.
As the capabilities of large AI models continue to evolve, so does the demand for data. In addition to public scientific papers, AI companies also capture posts on social platforms, news reports published by media outlets, and other information. While social media posts are relatively easier to manage due to less stringent requirements, news reports published by media outlets are protected by copyright.
(Image source: Doubao AI generation)
In late November 2024, The Toronto Star and five of its media outlets in Canada filed a lawsuit against OpenAI, alleging that it captured content from Canadian media without permission to train large models. They demanded that OpenAI pay CAD 20,000 (approximately RMB 100,000) per news report used, amounting to a total estimated sum of billions of Canadian dollars.
Faced with the accusations and exorbitant compensation demands from The Toronto Star, OpenAI denied the allegations and issued a statement claiming that the training of large AI models is based on public data, adhering to fair use and international copyright principles, which is fair to creators.
It's not just Canadian media; The Intercept, The New York Times, Raw Story, AlterNet in the United States, ANI in India, and GEMA, the German copyright agency, have all sued OpenAI.
As large models for video and audio generation become increasingly sophisticated, the copyright issues posed by AI companies have also intensified. In June 2024, the Recording Industry Association of America sued two AI music companies, Suno and Udio.
The domestic AI industry faces similar challenges. For instance, MiniMax, one of the six leading Chinese AI large models, was recently sued by iQIYI for using its materials without authorization to train Hailuo AI, and was demanded to pay compensation of RMB 100,000.
(Image source: MiniMax)
Even more concerning, some companies not only infringe on the copyright of works but also violate the portrait rights of public figures. For example, in the well-known incidents involving AI Sun Yanzi and AI Lei Jun, some netizens used AI synthesis technology to make Sun Yanzi sing various songs and make Lei Jun "speak rudely." On April 23, 2024, the first domestic AI-generated voice personality infringement case was adjudicated, with the infringed Ms. Yin winning and the infringing company paying her RMB 250,000, providing a glimmer of hope to creators.
Although OpenAI claims that the training of large AI models is based on public data when facing infringement lawsuits, public does not equate to copyright-free. Content such as photographs taken by photographers and articles written by editors are protected by copyright, and allowing AI companies to freely capture them undoubtedly infringes on the interests of creators.
If this trend continues, creators' enthusiasm and confidence in their work will inevitably wane, leading to a decrease in content creation. Consequently, the data available for training large AI models will become even more scarce, adversely affecting the normal development of the AI industry. Thus, safeguarding the legitimate rights and interests of creators and cracking down on infringements has become a pressing issue that the AI industry must address.
The Imperative of Establishing a "Shared Database"
Recently, the domestic AI company DeepSeek, leveraging a data distillation scheme, used other large AI models as teacher models to train the DeepSeek-V3 large model with fewer parameters, lower resource consumption, and extremely low training costs. However, due to calling itself "ChatGPT" when answering user questions, DeepSeek-V3 was ridiculed by Sam Altman, CEO of OpenAI. OpenAI, which maintains that it has not infringed upon Canadian media, expressed great dissatisfaction with potential infringements by other AI companies against them.
Regardless of how Sam Altman denies it, OpenAI's infringement has been confirmed, and the issue of infringement is widespread throughout the AIGC industry.
To tackle increasingly complex problems, the parameters of advanced large AI models will continue to increase in the future, and the demand for data will also rise. Especially with the advent of large models for video and audio generation, infringements will become more prevalent and frequent.
(Image source: Doubao AI generation)
To resolve copyright disputes at the source, relevant departments need to formulate corresponding laws and regulations to restrict AI companies from infringing and protect the rights and interests of creators. The "Opinions of the CPC Central Committee and the State Council on Establishing a Basic Data System to Better Play the Role of Data as a Factor" issued in December 2022 emphasizes downplaying ownership and emphasizing usage rights regarding AI companies' use of publicly available content on the internet. If commercial use is involved, fees need to be paid to the creators.
At the China-Europe Copyright Protection Seminar in the Digital Environment held in Xi'an on November 19, 2024, the organizers emphasized that they would take the "Regulations for the Implementation of the Copyright Law of the People's Republic of China" as an opportunity to revise and improve the system design to protect the legitimate rights and interests of authors.
Yan Xiaohong, chair of the China Copyright Association, stated that from a technical perspective, using copyrighted works requires disclosing copyright information and obtaining authorization for the works in principle, but this is practically unfeasible. The reason is that the data sources for companies training large AI models are excessively complex, including news reports from media outlets, posts published by individuals, papers from research institutions, reports from major companies, etc., making it difficult to count and apply for authorization individually.
Therefore, there is a need for global internet companies and academic research institutions to unite and create a shared database to label publicly available data on the internet and clarify copyright ownership. When AI companies require data, they must cooperate with the alliance formed by internet companies and academic research institutions to negotiate which data can be accessed and at what cost. While building the shared database, the internet company alliance should also communicate and cooperate with creators, obtain their authorization, and pay the corresponding fees before adding the content to the database.
(Image source: Doubao AI generation)
In this way, internet companies possessing vast amounts of data will assume the role of "intermediaries," connecting creators and AI companies, ensuring that creators can obtain revenue while also allowing companies to extract a certain profit. For domestic and foreign internet companies such as Tencent, Baidu, ByteDance, Facebook, and X, this adds an additional channel for information monetization.
Although AI companies need to invest in purchasing data, the difficulty of capturing data is significantly reduced, and the channels for obtaining data will also increase, potentially reducing some costs. OpenAI staff have complained about the lack of data, but in reality, it is the lack of publicly and easily accessible data. The internet is akin to an iceberg, with only one-third visible above the water, and the remaining two-thirds hidden beneath the surface. Only when AI companies are willing to pay the corresponding costs can they utilize this hidden data to train large models.
A Robust Data Sharing Mechanism: The Cornerstone of AI
Ilya Sutskever, a former employee of OpenAI, once stated that data is the fossil fuel of AI, and this fuel is running out. We only have one internet, and the era of maximizing data has passed. Additionally, GPT-5, originally scheduled for release in mid-to-late 2024, has seen delays in its training, leading many to question whether human society has enough data to support the AI industry in entering its next stage.
In fact, human society is generating new data continuously. The "National Data Resources Survey Report" reveals that China's total data generation in 2023 reached 32.85ZB (zettabytes), with an average of 90 billion GB of data generated daily.
(Image source: Doubao AI generation)
Today, with the internet permeating various aspects of our lives, work, and entertainment, covering nearly 70% of the global population, the notion that there is not enough data for training large AI models is a misconception. For AI companies, the challenge lies in extracting effective data.
Relevant departments providing a legal foundation and internet giants jointly building a database to filter effective data and protect the rights and interests of creators is undoubtedly the most efficient solution. In the past, AI enterprises did not lack data but wanted to keep it all to themselves, lacking the awareness to build a shared database. Now that the situation has changed, easily accessible data is no longer sufficient to support the AI industry in entering its next stage. Only by eliminating barriers and collaborating can all enterprises overcome the challenge of insufficient data volume.
Among the many industries considered potential triggers for the fourth industrial revolution, such as the metaverse, blockchain, 3D printing, room-temperature superconductivity, and artificial intelligence, AI and its related robotics industry currently appear to be the most promising in leading humanity into the fourth revolution.
To promote and regulate industry development, at the German Digital Summit on October 21, 2024, the German digital company Schwarz and Deutsche Bahn announced the establishment of the "European Data Center," aimed at providing data support for AI companies to train large models.
Just one month after that conference, the China-Europe Copyright Protection Seminar in the Digital Environment was held in Xi'an, indicating that relevant departments and enterprises in China and Europe intend to cooperate to jointly build the cornerstone of the AI industry's development. It is believed that with the cooperation of numerous countries and enterprises worldwide, data will no longer be a daunting challenge for AI companies in the future. While content creators provide data to aid AI companies in training large models, they will also be able to profit from it, bidding farewell to the era of frequent infringements without any compensation.