06/14 2024 402
Will video generation large models be the next competitive high ground?
Written by | Bluehole Business, Zhao Weiwei
Since the launch of Kuaishou's self-developed video generation large model "Keling", the number of预约 applications has exceeded 65,000 as of now. It has sparked industry buzz.
The reason is simple. After the release of OpenAI's text-to-video Sora, it has been in internal testing, and the outside world has been unable to use it; whereas Kuaishou's "Keling" was released for testing, and through Kuaishou's creative tool Kuaiying App, users can directly apply for the public beta. Once approved, they can generate 2-minute, 1080p text-to-video, and the visual effects are not inferior to OpenAI's Sora.
Text-to-video requires tremendous computing power resources and higher demands on the model's capabilities, making it a territory that domestic large model vendors have not yet fully competed in. Surprisingly, Kuaishou's "Keling" has become the first domestic large model to "hand in the exam paper," ahead of ByteDance.
However, Kuaishou's leading edge will not last long. "ByteDance's video generation large model is also in internal testing and is expected to be released soon," an industry insider revealed, predicting that, similar to Kuaishou, ByteDance's video generation large model will also be launched first through its creative tool Jianying.
Moreover, just one week after the launch of "Keling" on June 13th, Luma AI released its latest text-to-video model, Dream Machine, which is freely available to all users. Its efficiency can reach 120 frames per 120 seconds, and it can quickly generate 5-second movie-level visual effects video clips. In addition, Luma's model surpasses Kuaishou's Keling in its rich aesthetic style options.
More competitors are on their way. "Before the end of June, large model vendors will continue to release Sora-like model products, and text-to-video and image-to-video large models will flourish everywhere," according to industry analysts. Previously, various large model vendors possessed video generation capabilities, but due to the high computing cost and the lack of comprehensive optimization of video effects, they have not been fully rolled out.
The battle of large models has evolved from technology to applications, from the competition of hundreds of models to price wars. Will video generation large models be the next competitive high ground? The answer is being revealed.
Overtaking ByteDance on the curve?
Industry analysts believe that "Keling's effect is currently the best among China's Sora-like models, and it is surprisingly from the Kuaishou team."
After the launch of Kuaishou's self-developed video generation large model "Keling," what surprised the outside world was, on the one hand, that the video generation effect could rival Sora; on the other hand, it came from the Kuaishou team. Because in previous large model competitions, Kuaishou was not a first-tier member that attracted attention. Kuaishou previously released the general-purpose large language model "Kuaiyi" and the text-to-image large model product "Ketu," but their influence was limited until now with "Keling."
Judging from the data released by Kuaishou's "Keling," the Chinese version of Sora is indeed its benchmark target.
From a technical perspective, Kuaishou's "Keling" adopts a DiT architecture similar to Sora, using Transformer instead of the traditional diffusion model's U-Net based on convolutional networks. Kuaishou's large model team has also developed a 3D spatio-temporal joint attention module and a 3D VAE network to achieve better spatio-temporal motion modeling and more efficient latent space encoding/decoding.
And from "Keling's" official website, one can easily see its product selling points.
Most notably, Keling supports generating 2-minute 30-frame videos with a resolution of up to 1080p and freely customizable aspect ratios, far surpassing Sora and domestic large model vendors. In terms of video generation effects, Keling emphasizes its three major advantages: generating large-scale reasonable movements, simulating physical world characteristics, and having conceptual combination capabilities and imagination.
In terms of dissemination, "Keling" also differs from previous domestic large model releases, first gaining attention on foreign social media, and then increasing in popularity domestically, achieving the characteristic of "export to domestic sales" or "blossoming within the walls and fragrant beyond."
On Twitter, user reviews and evaluations of "Keling" are widespread.
"I feel like everyone shouldn't wait for expensive and time-consuming industrial-grade AI like Sora. Just use Keling for free first. Kuaishou really surprised me this time."
"Compared to foreign Sora video generation large models, Chinese large model developers have a deeper understanding of local culture, and the content generated by large models can better meet the needs of local users."
"I charged an annual membership for Kuaiying in the afternoon, and it seems I skipped the queue for Kuaishou's Keling and can directly generate videos by changing the prompt. The effect is amazing. It takes about 3 minutes to generate a video with a VIP membership."
Relying on the popularity of foreign social media, the popularity of "Keling" has risen. A week after its release, Kuaishou officially recommended this product on its official WeChat public account with the title, "Are you 'Keling' today?"
In fact, internet companies like Tencent and ByteDance also possess video generation large models, but they have not yet been fully released for public testing or their effects are not satisfactory. ByteDance's Jianying product "Jimeng" has short video generation capabilities, allowing users to choose camera movement types, video ratios, and movement speeds to generate 3-6 second videos. However, in terms of video presentation effects and duration, it does not demonstrate an advantage comparable to Sora.
This further highlights the surprising advantage of Kuaishou's "Keling," as the industry has always believed that solid model training is essential, and there is no shortcut to overtaking on curves. If the basic model is not well-trained, neither text-to-text nor text-to-image generation will be good, let alone text-to-video generation. But the surprise is that Kuaishou's video large model has successfully staged a surprise attack.
The People Behind Keling
Who is the decisive figure behind "Keling"? This may be a story of talent mobility.
Just a few days before the official launch of Kuaishou's "Keling," Wang Xintao, an expert researcher at Kuaishou, gave an academic sharing on "An Initial Exploration of Video Generation and Its Controllability," which was considered Kuaishou's internal thinking on the technical aspects of the "Keling" large model. The related PPT also quickly circulated and became a resource for large model industry research.
After the launch of "Keling," Wang Xintao appeared again at an artificial intelligence academic sharing session in Shenzhen. Wang Xintao mentioned that the core challenge in catching up with Sora lies in how to learn physical laws in long videos and long shots, thus ensuring that the generated videos have a high degree of physical consistency.
Therefore, this is the issue that Wang Xintao believes is most worthy of in-depth research. "Traditionally, AI-generated videos are often limited to a single shot, lacking coherence and realism in complex scenarios. However, Sora can achieve smooth transitions between shots in complex long videos while maintaining strong 3D, temporal, and physical consistency."
In fact, Wang Xintao has not been with Kuaishou for long. Currently, he is a senior researcher at Kuaishou's Visual Generation and Interaction Center, belonging to Kuaishou's Multi-Model and AIGC department, responsible for research on visual content generation. Public information shows that last year, he was still a senior researcher at Tencent AI Lab, leading work on visual content generation (AIGC).
It can be said that the surprise attack behind Kuaishou's "Keling" is inseparable from the contribution of Tencent's former AI forces like Wang Xintao.
When Tencent's Hunyuan open-sourced the Hunyuan large model in the past, it had already disclosed that it possessed various video generation capabilities such as text-to-video, image-to-video, text-image-to-video, and video-to-video, and already supported 16-second video generation. At that time, Lu Qinglin, the head of Hunyuan's text-to-image large model, mentioned that alignment between different modalities was one of the difficulties. Hunyuan wanted to generate video and audio simultaneously, but it had to solve the confusion of aligning the output of both using a single model.
On the other hand, the popularity of "Keling" also reflects a certain sense of loss among former Kuaishou AI team members.
Wang Zhongyuan, the former technical vice president of Kuaishou, is now the president of the Beijing Academy of Artificial Intelligence. Last December, during Kuaishou's organizational restructuring, the main station, e-commerce, and commercialization divisions all embraced change, while Wang Zhongyuan, responsible for AI business, no longer held any position.
Just half a year ago, as the head of Kuaishou's AI & User Growth Business, Wang Zhongyuan first announced the progress of Kuaishou's AIGC at the Kuaishou Creator Conference, with the core aim of enhancing the creativity and productivity of short video content. At that time, Kuaishou had already opened up the "Ketu" large model product, supporting both text-to-image and image-to-image functions, with over 20 AI image play styles already launched.
The entire year of 2023 was a year of lack of a CTO for Kuaishou, as well as a year of team formation and business implementation for Kuaishou's large models. From an organizational structure perspective, Kuaishou's large model team belongs to Kuaishou's Community Science line, and its business includes large language models, text-to-image large models, video generation large models, and other directions. However, compared to peers, both the large language model and text-to-image model are lackluster.
China's version of Sora is undoubtedly one of Wang Zhongyuan's expectations, but it's unclear how he views "Keling."
After leaving Kuaishou, Wang Zhongyuan represented the Beijing Academy of Artificial Intelligence in an interview, discussing how AGI (Artificial General Intelligence) is accelerating, and that he now believes AGI may emerge in another four or five years, compared to his previous estimate of 40-50 years.
"The emergence of Sora is also a momentous occasion. Its true value is not just generating beautiful videos from text, but showing that large models may possess the ability to understand the three-dimensional world. In other words, Sora has initially demonstrated the scaling law on the world model," said Wang Zhongyuan.
How Long Can the First Place Last?
Kuaishou's "Keling" is highly praised at present, but how long can it maintain its position as China's first Sora?
The only usage channel for "Keling" is Kuaishou's creative tool Kuaiying App, but there has not been a significant fluctuation in Kuaiying App's download data. According to data from Qimai, the average daily downloads on the App Store in the past seven days have remained at around 20,000, and its ranking on the App (free) and Photography & Videography (free) charts has remained stable, without significant changes.
From a business perspective, "Keling" currently attracts more C-end consumers. Compared to models like text-to-image and text-to-text, which already have extensive usage scenarios in areas like advertising, the usage scenarios for text-to-video large models are still limited. Therefore, the strategy often serves content producers first, continuously expanding usage scenarios on the consumer end, and ultimately attracting B-end customers and merchants to pay for usage.
More importantly, competitors aiming to be China's first Sora are on their way.
On the one hand, in the domestic market, according to insiders, ByteDance's video generation large model is also in internal testing and is expected to be released soon, relying on its creative tool Jianying for launch. For Jianying, the previously launched "Jimeng" has already achieved the corresponding functions of text-to-video large models, but the optimization of these functions is currently insufficient.
On the other hand, the international market is changing even faster. On June 13th, Luma AI released its own video generation model, Dream Machine, which allows users to generate high-quality HD videos from text or images. Going one step further than Kuaishou's "Keling," Luma has achieved free and full access, allowing users to log in and use it without reservation or waiting.
However, similar to the issues faced by Kuaishou's "Keling," Luma AI also faces insufficient computing power, requiring long waiting times when using it. The waiting results may also result in failed content generation, so computing power is the biggest bottleneck restricting text-to-video large models.
The large model industry has previously disclosed relevant data, indicating that to achieve a level similar to Sora, large models require thousands of GPU cards for computing power, and further optimization capabilities require tens of thousands of GPU cards, which means the ability to mobilize large-scale computing power clusters, whether using Nvidia's flagship GPU chips or Huawei's Ascend domestic AI chips.
Large model competition is still in its early stages, and AI large models are a bonus for cloud services. How to effectively implement applications and minimize costs remains a common challenge facing the current large model industry.
Compared to ByteDance's large model strategy, Kuaishou's strategy of relying solely on "Keling" is still insufficient. ByteDance's Doubao large model is characterized by its low cost, significantly reducing the unit cost of model inference through price wars, attracting B-end customers into Volcano Engine's cloud services. If ByteDance releases a video generation large model, it will undoubtedly follow a path of lower costs.
Regardless, catching up with Sora has become one of the main consensuses and tasks of the large model industry in 2024. Kuaishou, aiming to maintain its position as China's first Sora, still faces a cruel test.