07/10 2024 511
In the past two years, large models have emerged in droves, shining brightly in the generation of text, images, audio, video, and other content forms. Content creation has always been considered a skill exclusive to humans. Since OpenAI released ChatGPT in 2022, many large models have begun to challenge this unique skill that has been monopolized by humans. After the initial fascination and "disenchantment," the public has gradually come to understand the "creative principles" of this new phenomenon.
It first needs to "consume" vast amounts of textual, image, audio, and video content. This big data is then rapidly analyzed and processed. Driven by deep learning technology, large models increasingly resemble humans in their ability to create multi-modal content such as text, images, audio, and video. Covering scenarios from social entertainment to work and learning, increasingly capable large models will profoundly change the future world.
Behind this rapid development, infringement issues involving large models frequently erupt.
1
At the end of April this year, several news organizations, including The New York Daily News and The Chicago Tribune, filed lawsuits against OpenAI and Microsoft in a federal court in New York, accusing them of using their news articles without authorization to train generative artificial intelligence (AI) technologies. Subsequently, the Center for Investigative Reporting (CIR) accused OpenAI and Microsoft of using copyrighted material to train their AI models. A complaint filed in the New York federal court alleged that OpenAI utilized CIR's content without permission or payment.
This inevitably brings to mind the legal battles between numerous American news organizations and Google's search engine a decade ago. In fact, large models have been seen as a new tool for internet users to access information following search engines. Compared to the latter, large models not only provide precise information but can also directly "create" original text, images, audio, and video content for user consumption.
Today, Google already "pays" many news organizations, and large models may be hard-pressed to escape a similar fate, although OpenAI emphasizes that using public material to train AI models constitutes fair use.
The confrontation between news organizations and internet giants can be traced back to 2009.
In 2009, The Wall Street Journal news website, owned by News Corporation, adopted a paywall system. Users could browse the first paragraph of some articles on the website, but had to pay to read the full text. At that time, such paywalled news could still be accessed in full via Google search links.
At the "Cable 2009 Show," Rupert Murdoch denounced Google, claiming that the search giant was stealing content that did not belong to them and urging content owners to fight back. Murdoch lamented, "Do we want to continue letting Google steal our copyrighted content? This cannot continue."
Even today, high-quality content from news websites remains a vital component of the premium services provided by search engines like Google. While search engines earn substantial wealth, news websites become their "dowry." The debate over whether search engines should pay news websites has spread from the United States to the world, with controversies persisting from over a decade ago to the present day.
After a long period of confrontation, Google paying news organizations has become a normalized phenomenon.
As early as 2020, Google announced that it had established partnerships with approximately 200 reporting organizations worldwide and would launch a new service to push news. Over the next three years, Google would pay $1 billion in usage fees for news articles and other content.
2
Compared to search engines' indexing "infringement" and advertising monopolies, the confrontation between large models and news organizations is more comprehensive, with more intense conflicts on both sides.
Global news websites rely on the continuous flow of traffic brought by Google to earn profits through advertising services, paid subscriptions, and other businesses. However, in the service mechanism of large models, hyperlinks account for a relatively low proportion, meaning that most services end within the large model products themselves, making it more difficult for news websites to obtain benefits.
This time, The New York Times was the first to wage war against large models. At the end of 2023, the newspaper filed a lawsuit against OpenAI and Microsoft, accusing the latter of using its copyrighted content without authorization to train AI models and presenting it to users in ChatGPT products. By the end of June this year, at least 13 news media organizations had filed infringement lawsuits against OpenAI and Microsoft.
According to Robert Thomson, CEO of News Corp, "The collective intellectual property rights of the media are under threat, and we should loudly demand compensation." Steven Lieberman, a lawyer in the news industry, went further, stating that OpenAI's great success is also attributed to the work of others, as it has accessed a vast amount of high-quality content without permission or payment.
Such lawsuits are not limited to the news industry. The multi-modal development of large models has also prompted counterattacks from companies and institutions in other industries.
On June 24, the world's three major record labels—Sony Music Group, Universal Music Group, and Warner Music—joined forces with multiple record companies to sue AI music generation companies Suno and Uncharted Labs, the developer of Udio, accusing them of illegally using copyrighted music to train their models and provide services.
The record labels alleged that Suno copied 662 songs and Udio copied 1,670 songs, seeking compensation of up to $150,000 per musical work.
Similar incidents have also occurred in China. During the 360 AI conference on June 6 this year, Zhou Hongyi, the founder and chairman of 360 Group, demonstrated the innovative feature "Partial Repainting" of the 360 AI browser using a picture of a female in ancient costumes. Two days later, the creator with the ID DynamicWangs took to social media, claiming that the image was meticulously created using an AI drawing model and accusing 360 of not obtaining authorization.
The content creation industry is characterized by its pursuit of "newness"—the latest ideas, events, opinions, painting styles, or video formats. For large models, lacking the most up-to-date information will inevitably lead to user complaints about outdated and traditional content. To pursue "newness," it is inevitable that disputes over "copyright" will arise with various content industry institutions.
The lawsuit filed by The New York Times last year included a passage stating that ChatGPT had copied its news reports almost word for word. The newspaper cited an example from 2019 when The New York Times published a Pulitzer Prize-winning series of articles about predatory lending in New York City's taxi industry. The newspaper claimed that with minimal prompting, ChatGPT would recite most of the content verbatim.
Obviously, some ChatGPT users have come to treat large models as search engines. Whether this form constitutes infringement is still a matter of legal debate, but as large models commercialize at a rapid pace, similar interrogations will emerge endlessly. Even if it is not considered a "big deal" under the current copyright law system, with the active protection of copyright holders, new legislation may inevitably emerge to prevent such phenomena. After all, news websites primarily rely on traffic and accompanying advertisements for profit, and ChatGPT's direct elimination of the "link" between users and news websites infringes upon the latter's interests.
In fact, both the United States and China, as AI powerhouses, are still exploring legal content related to AI in their copyright laws. However, considering the reality that a large number of content creators rely on copyright for their livelihood, the confrontation between large models and content copyright will be a long-term issue. Based on the decades-long confrontation between news websites and search engines, it seems inevitable that large model companies will eventually have to pay "copyright fees" to content providers.
3
Content copyright holders' future challenges to large models will primarily focus on two levels: first, whether copyrighted content was used in training AI models; second, whether the output text, images, audio, and video content involve infringement.
The commercialization of large models will inevitably confront the issue of "copyright." Take OpenAI's latest GPT-4o as an example. This large model can process 50 different languages, improving speed and quality compared to previous versions and gaining the ability to read human emotions. It accepts a combination of text, audio, and images as input and can generate any combination of text, audio, and images as output. "Compared to existing models, GPT-4o excels particularly in image and audio understanding."
Its application scenarios are abundant, including real-time translation, meeting report generation, legal consulting, creative writing, virtual customer service, and more, including real-time speech and video analysis functions. Users can also chat with it, asking questions to obtain the latest knowledge, and some have even developed "romantic relationships" with large models.
Beyond lifestyle scenarios, large models will be applied in an increasing number of commercial settings. This means that although OpenAI has announced that GPT-4o is currently available for free use (with usage limits), users must pay for unlimited access. Moreover, commercialization rights remain in the hands of OpenAI.
As GPT-4o is not available in China, I used Tencent's Yuanbao large model and Baidu's ERNIE Bot to ask about the movie "The Three-Body Problem" directed by Zhang Yimou. Tencent's Yuanbao provided answers with hyperlinks to the sources of each paragraph. Although Baidu's ERNIE Bot did not provide hyperlinks, it did include relevant topic hyperlinks below the answer.
In essence, large models are tools that can output "answers" related to the input content. It is essential to note that content creation evolves rapidly. To provide the best experience to users in both lifestyle and commercial scenarios, large models must be "fed" with the latest data. Moreover, under user demand, the output answers will inevitably "copy" content from news websites or other copyright holders. Currently, such conflicts are limited to a few large news organizations and large model companies. However, once large models become ubiquitous in daily life, these conflicts will further intensify.
How will future disputes over copyright be resolved? Many cases have already occurred, and future solutions will likely fall within these categories.
Legislation related to artificial intelligence is being introduced. On December 8, 2023, the European Commission, the European Parliament, and representatives of EU member states reached an agreement on the Artificial Intelligence Act (AI Act). The Act explicitly states that providers of general-purpose AI systems (GPAIs) like ChatGPT and related GPAI models need to produce technical documentation, comply with EU copyright law, and disclose summaries of the data content used to train the systems. Companies and institutions that violate the EU's AI law will face fines.
On August 15 this year, China's first normative policy for the generative AI industry, the "Interim Measures for the Administration of Generative Artificial Intelligence Services," jointly issued by seven departments including the Cyberspace Administration of China, came into effect. This is also the world's first regulatory policy for AI-generated content.
Regulatory agencies will impose penalties for violations. In March this year, France's market regulatory agency announced that it had imposed a fine of €250 million (approximately RMB 1.97 billion) on Google for using content from French publishers and news organizations without consent to train its chatbot "Bard" (an upgraded version named "Gemini"), violating EU intellectual property laws.
As the first company to be fined for "infringement" related to training data, Google serves as a cautionary tale. In the future, more large model enterprises may face related regulations due to issues with their training data.
For large model companies, establishing cooperation with content companies that hold copyrights will be an essential strategy going forward. In June this year, Time magazine and OpenAI announced a multi-year content licensing agreement and strategic partnership. The agreement allows OpenAI to introduce the publisher's content into ChatGPT and help train its most advanced AI models.
It is understood that the cooperation between the two parties is very deep, with OpenAI even gaining access to Time's archives and articles spanning over 100 years to train its AI models and respond to user inquiries in its consumer-facing products like ChatGPT.
In return, OpenAI will cite and link to the original sources when using Time magazine's content. Time magazine will be able to use OpenAI's technology to "develop new products" for its audience.
Regardless, original content is one of the essential pillars of the rapid development of the internet. The decades-long "copyright war" between news websites, music companies, copyright holders, and Google search engines will be reprised in the field of large models, with an even more intense struggle.
The prosperity of any technology should not be built on "exploitation and plunder." Large model enterprises may increase their competitive moat by raising the threshold for cooperation with content institutions such as news websites.
Currently, large models cannot go from 0 to 100. As the "nourishment" providers for large models, content creators or institutions have every reason to obtain reasonable benefits from the booming development of large models.
Referenced Articles:
Cailian Press: "Under Tremendous Pressure, Google Abandons 'Going It Alone' and Promises to Pay Publishers $1 Billion Over the Next Three Years"
National Business Daily: "Behind 13 Media Outlets' Anger at OpenAI and Other AI Giants: Why Has Content Creation Become a 'Free Lunch' for Large Models?"
Guancha: "EU Internal Market Commissioner: EU Reaches 'Historic AI Legislation,' Becoming the First Continent to Establish Clear Rules for AI Use"
Sichuan Observer: "Google Fined €250 Million, AI Training Data Copyright Issues Spark Further Controversy"
Cailian Press: "OpenAI and Time Magazine Reach Cooperation Agreement to Use Its Content to Train ChatGPT"