11/15 2024 474
AI makes everything possible.
Handcrafted Labor/Editor by Wage/Produced by Uncle Jiao/Unicorn Observation
A fashionable woman walks on the streets of Tokyo, filled with warm neon lights and animated city signs. She wears a black leather jacket, a red dress, and black boots, carrying a black handbag, wearing sunglasses, and wearing red lipstick. She walks confidently and casually. The streets are wet and reflective, creating a mirror effect under the colorful lighting. Many pedestrians walk on the streets.
In February of this year, OpenAI's Sora made a stunning debut. This 60-second one-shot text-to-video quickly became a hit, and the industry exclaimed that the GPT moment for AI videos had arrived.
While domestic onlookers marvel at the smoothness of Sora, they also begin to ask a soul-searching question: When will China have its own Sora? The pressure falls on Chinese AI leaders like Baidu.
To follow or not to follow?
On November 12, at the Baidu World 2024 Conference, Baidu founder Robin Li gave his answer: "When the entire Chinese internet was lamenting over Sora earlier this year, we decided to tackle the illusion problem in image generation. This problem may seem simpler and even more mundane, but if it's not solved, there will be no applications."
This decision somewhat surprised ordinary people, as compared to Sora, this choice is not at all "sexy."
By abandoning the popular Sora and choosing the mundane iRAG, did Robin Li make the right decision?
01 Decision Making
At the beginning of the year, when Sora was particularly hot, Unicorn Observation learned that there had been internal discussions within Baidu. The final conclusion was: Absolutely do not attempt to create a Sora because the cycle is too long, possibly requiring a decade or two of investment. No matter how popular it is, do not attempt it.
By the end of the year, this decision allowed Robin Li to stand confidently on the podium at the Shanghai World Expo Center and announce: "In the past 24 months, the biggest change in this industry is that large models have basically eliminated illusions, significantly improving the accuracy of answering questions."
Robin Li's confidence stems from Baidu's groundbreaking technology, iRAG (image-based RAG), a retrieval-enhanced text-to-image generation technique.
As we all know, a large model is a probabilistic model, and the generated content has a certain degree of uncertainty, often resulting in nonsensical output that can be both frustrating and amusing. The industry refers to this unrealistic, fictional content generated by AI as the AI illusion phenomenon.
Robin Li demonstrated an image of Beijing's Temple of Heaven generated by an open-source model on-site. It looked similar but felt off in some way. Only after comparing it to a photo of the Temple of Heaven did it become apparent that the real Temple of Heaven actually has three tiers, whereas the model generated one with four tiers.
This type of situation, where a fake replaces the real deal, might be amusing in a self-entertaining context, but to make it usable, reliable, and "human," the illusion problem needs to be solved.
Compared to Sora, the illusion problem may not be as flashy, but it is one of the biggest limitations restricting the widespread application of large models. The illusion problem not only affects the practicality of AI but also limits the application of large models in many fields. Especially in applications requiring high accuracy, such as healthcare and law, even a small error can lead to severe consequences.
For AI applications to flourish, the illusion problem is like a thick wall blocking the sun and hindering the growth of flowers.
Therefore, in terms of priority, iRAG has a higher precedence than Sora.
If we broaden our perspective to the entire AI industry, solving the illusion problem is more critical than creating a Sora. It can help more applications come to fruition, allowing more people to use AI technology, thereby benefiting more industries.
By combining its billions of image data with basic model capabilities, Baidu's retrieval-enhanced text-to-image technology, iRAG, significantly improves content quality and accuracy by using retrieved information to guide text or answers, resolving the issue of machine-like and fake-looking images generated by previous text-to-image models.
Seeing is believing. Unicorn Observation conducted a round of practical tests on Wenxiaoyan and generated the following set of images.
▲ Arranging Bill Gates to play mahjong with Guan Yu on the Great Wall, dare he?
▲ Lin Daiyu holding a sniper rifle, have you ever seen that before?
▲ An elderly Sophie Marceau embracing her younger self.
▲ The Statue of Liberty and the Eiffel Tower "relocated" to the desert.
▲ Zhang Juzheng going to court in the snow alone.
▲ A Ferrari flying over the Hukou Falls.
Although most of these whimsical images are unlikely to appear in reality, the elements in the images generated by iRAG are very real, with high accuracy. If it weren't for the content being too "discordant," one wouldn't be able to detect the "AI flavor."
After using iRAG to separate the real from the fake, the usability of AI-generated images is greatly improved, opening up more application spaces. For example, in film and television production, comics, sequential art, poster creation, etc., using iRAG can significantly reduce creation costs.
For instance, for a large brand's promotional campaign, creating a set of high-quality posters requires a lot of manpower, such as planners, models, photographers, etc., consuming significant financial resources. Projects often cost at least hundreds of thousands, sometimes even millions, but now the cost is close to zero.
Robin Li summarized the commercial value of iRAG as: no illusions, ultra-realistic, cost-free, and instant results.
02 Useful
The theme of Robin Li's speech this year was "The Applications Are Here."
Consistent with Robin Li's thinking on large models over the past year, the core message is two words: useful.
"Without a rich ecosystem of AI-native applications built on foundational models, large models are worthless." Last year at the Baidu World Conference, Robin Li urged entrepreneurs to create applications that generate more value.
This year, Robin Li "upgraded" usefulness to super usefulness: "Baidu is not introducing a 'super app' but continuously helping more people and businesses create millions of 'super useful' applications."
Based on the principle of usefulness, Robin Li abandoned the seemingly glamorous Sora and chose to develop iRAG to solve the illusion problem, removing the biggest obstacle for large models to become "useful."
"With foundational model capabilities in place, we will usher in a moment when AI applications shine brightly. Every application is a star, and every application will become a force for changing the world." Robin Li believes there are two major directions for future AI applications: intelligent agents and industrial applications."
At the conference, Baidu released 100 industrial applications based on large models, covering various industries such as manufacturing, energy, transportation, government affairs, finance, automotive, education, and the internet.
This declares that large models are no longer castles in the air but are tangibly reshaping various industries.
If iRAG makes large models more useful, Miaoda lowers the threshold for using large models.
Robin Li demonstrated how to build an event registration system using Miaoda.
Throughout the "development" process, Robin Li only described the requirements to Miaoda. Five intelligent agents—team leader, planner, editor, programmer, and quality inspector—collaborated to complete various tasks such as planning, content creation, and development. They can even automatically identify bugs.
With no-code programming, multi-agent collaboration, and multi-tool invocation, Miaoda can realize any idea without writing code, equipping everyone with the ability of a programmer.
"We will usher in an unprecedented era where one can earn money solely through ideas." Said Baidu CEO Robin Li.
This tool, set to launch in the first quarter of next year, may be as significant to large models as the Windows system was to the popularization of PCs.
At the end of the last century, computer operating systems were still based on the DOS system, a text-based command-line interface, which was not user-friendly for those without a computer background. The graphical interface of Windows significantly lowered the threshold for computer use, making computers truly accessible to ordinary households.
03 Foresight
Since ChatGPT debuted in 2022, large models have been popular for nearly two years.
Is this global large model frenzy a new technological revolution or just another bubble?
Robin Li and his Baidu ERNIE Bot gave the answer: As of early November, the daily average number of calls to Baidu's ERNIE Bot exceeded 1.5 billion, with a growth rate of 7.5 times in the past six months.
Over 30 years ago, while still a student at Peking University, Robin Li proactively enrolled in an AI course, forging an inseparable bond with AI and becoming a long-term believer in it.
Last year, following ChatGPT, Robin Li was the first to release China's first large model.
Nowadays, large models have become the "number one project" for internet giants. However, few "number one" figures like Robin Li are still actively promoting AI in the forefront.
In September 2023, Time magazine released its inaugural list of the top 100 AI figures globally, with Robin Li being named a global AI leader, along with Elon Musk and Jen-Hsun Huang.
Time's commentary: "Robin Li is China's most outstanding futurist, long committed to the wave of AI development."
Foresight allows Robin Li to maintain a sense of "clarity" in the fervent market.
Last year, when various players flooded in to compete in large models, Robin Li said, "Don't compete in models, compete in applications." It turned out that so many large models were indeed unnecessary, and now only a few giants in the US persist in developing foundational large models.
When applications began to be valued, many people started pursuing super hit C-end AI products. Robin Li said, "The transformation of large models on ToB business is an order of magnitude greater than the influence of the internet on ToB." Nowadays, super hit C-end AI "super apps" are still hard to find, while ToB applications are flourishing.
When Sora made a splash at the beginning of the year, Robin Li chose to tackle the illusion problem in image generation, resulting in the groundbreaking technology iRAG.
In a recent interview, Robin Li explained in detail why he chose not to pursue Sora. He believed that Sora essentially provides video generation capabilities in any scenario, which is very meaningful but also very difficult, requiring a very long time to develop.
His words came true.
As the year ends, Sora is still difficult to deliver. Some film producers who have tried it find it less than ideal, with some feedback indicating that hundreds of short films must be generated by the model to find one usable one.
When Sora first emerged, there was concern among Hollywood film and television professionals about being replaced by AI, leading to protests. Now, there has been a long silence from Hollywood.
Followers tend to follow others' lead. Those who can endure loneliness and stick to their own path may become trendsetters.
In Robin Li's view, AI resembles a new industrial revolution, meaning it won't end in three to five years or produce "super apps" in a year or two. Instead, it's more like a thorough reconstruction of society over the next three to five decades.
In this marathon-like race of AI, one must not seek short-term gains but maintain sufficient patience and strategic focus to avoid falling behind or going astray. (End)