The Reference Frame of AI Has Changed After Google I/O

05/21 2026 569

    Where is the next alpha in AI?

    Author | Jingxing Editor | Guxi

Currently, the industry consensus on the Coding era has been established.

“Despite raising our token prices, customer acceptance remains high, with sustained strong demand. Even now, supply cannot fully meet demand, and a large number of customers are still waiting in line for our services.”

During last week's Q4 FY2026 earnings call, Alibaba CEO Wu Yongming's remarks highlighted the enormous size of the Coding market.

AI has finally moved from product launches into corporate production budgets. Alibaba has addressed the first question: Is there real demand for AI?

The second question comes from Google: What will AI look like next?

In the early morning hours of May 20, Beijing time, Google I/O 2026 kicked off as scheduled.

The highlight of this conference was undoubtedly the demonstration of intelligent agents and multimodal capabilities. With the release of Gemini Omni Flash, Google provided a precise definition—supporting input in any modality and generating output in any modality.

The video output showcased at the conference is just the beginning. According to Google's plans, Omni has the capability to achieve omnimodal output across text, images, audio, and video. Based on Gemini's world model capabilities, it can generate more precise physical effects, including gravity and dynamics.

For Google, Omni is no longer just a video model but a true super content creation gateway. It can be embedded into all creator workflows, creating a multimodal application market with even greater imagination than Coding.

Compared to programming, this is where AI's real wealth lies. From industry-standard pricing, the price per million tokens for video models is significantly higher than for images and text. This means that as token usage increases, video will create far greater API value than text.

More importantly, multimodal is approaching a historic technological inflection point.

Compared to the early model of simply stitching together text, image, and video models, the emergence of unified omnimodal models like Google Gemini Omni in 2026 marks the industry's entry into a brand-new era.

Multimodal: The Next Token Inflection Point

OpenAI CEO Sam Altman did not expect that achieving 1 million users would take ChatGPT five days at launch, while GPT-4o's image generation took only one hour.

With its highly realistic Ghibli-style artwork, GPT-4o's image generation feature became an instant hit upon release. OpenAI had to restrict free access and pleaded with users to stop generating images Crazy Life Picture ( Crazy Life Picture means " Crazy image generation " or " Crazy creation of images " in a more natural English phrasing would be "to stop generating images excessively"), allowing the team to get some rest.

This year's image generation model, Image 2, broke GPT-4o's record by gaining 1.8 million new users globally in just one hour. Within a week, it surpassed 120 million active users worldwide, driving a 23% month-over-month increase in ChatGPT Plus subscriptions.

The release of Google Nano Banana 2 earlier this year achieved global dominance in testing. The product reduced the generation time for a single 4K highly detailed image from minutes to seconds.

To date, the Nano Banana series has generated over 50 billion images. Media outlets commented that Google is ending the era of Photoshop.

Undoubtedly, groundbreaking multimodal models wield decisive market influence.

At last year's Google I/O conference, VEO 3 made a stunning debut, with its fruit-cutting video going viral on TikTok. In just six months, the total number of generated videos exceeded 230 million. Some media outlets wrote that VEO 3 saved Google's financial results.

But even greater disruptions are on the horizon.

A few days ago, a Reddit user accidentally discovered and shared a demo of Gemini Omni, instantly igniting the global AI community:

A teacher lectures while writing formulas on the blackboard, with the entire process—voice, visuals, and board text—flowing precisely and smoothly.

An X user commented that the Nano Banana moment for video models has arrived.

Gemini Omni's impressiveness doesn't stop there. The model supports one-click watermark removal, object replacement, and adaptive lighting effects. From the demonstration, its text consistency and character continuity surpass all previous video models.

AI users who have experienced Mars-like text outputs understand how difficult it is for AI to create content with clear and accurate text, let alone mathematical formulas written on a blackboard during a lecture.

Compared to VEO, Google Omni is a true omnimodal input and output model, allowing users to mix content from any modality as input to generate high-quality videos while supporting conversational editing.

This means Google Omni can handle all modality analysis and generation within a single unified model, rather than relying on multiple systems for post-processing integration.

According to Google, Omni represents an evolution of the Gemini main architecture, extending its native multimodal capabilities—present since its inception—from input to output.

In contrast, VEO and Nano Banana are not standalone products but capability components within Omni.

During the live demonstration, Google executives showcased specific editing scenarios: When a user inputs “Change the background to snow,” the model alters the video environment; when they input “Switch to a side-tracking angle,” the camera movement adjusts accordingly; when they input “Add narration,” the video generates commentary and background music.

From start to finish, users can modify videos through simple dialogue, precisely controlling every detail without switching threads or re-uploading. This completely rewrites the prompt-based, luck-dependent model of previous-generation video models like VEO.

DeepMind CEO Demis Hassabis stated that future versions of Omni will support input and output in any modality, with entry points covering the Gemini app, Google Flow, and YouTube Shorts. Even more powerful versions of Omni will be released later.

Google's ambition is clear. It aims to create a true world model without media limitations or modality barriers, where AI can interact with the world in any human-understandable way, defining the future form of AI with a single model.

Supporting this ambition is omnimodal capability.

Many people don't realize that unified omnimodal models have advantages in research and development efficiency.

When performing cross-modal tasks, improvements in text understanding can enhance image and video quality, making generated content more logical. Training data from images and videos can help models better understand the physical world, improving text reasoning and common-sense judgment.

This creates a positive feedback loop of 1+1>2. It also explains why leading figures like Yann LeCun and Fei-Fei Li insist that multimodal world models represent the future path of AI.

In the past, the market focused on Coding and underestimated multimodal. This paradigm is now being overturned.

Morgan Stanley recently noted in a research report that Minimax's potential value has been overlooked by the market, with its ARR projected to reach $1 billion by the end of 2026. A key reason is that the market underestimates the commercial value of multimodal technology, particularly the mutual reinforcement between large language models and multimodal models.

This statement exposes the biggest blind spot in the current AI industry.

A Native Five-Senses Powerhouse?

Turning to the domestic market, a wave of technology-driven growth is brewing.

Morgan Stanley points out that China's model market has reached a convex explosive inflection point, poised to replicate the supernova-like growth speed of the U.S. market. There are two reasons: First, model capabilities are approaching or surpassing previous leading U.S. products. Second, compared to U.S. models, Chinese models generally offer more competitive pricing.

In the domestic market, the current narrative logic among major players is highly convergent: Competing to become a domestic alternative to Claude, then finding unique advantages, such as specializing in long-form text, intelligent agents, or reasoning. Finally, they compete on subscription pricing to break out of the red ocean.

But this is not the full picture of the market.

Some players remain highly aligned with Gemini Omni's technical direction and are poised to replicate this ecological niche domestically first—namely, Minimax.

Recently, Goldman Sachs listed ByteDance, Alibaba, and Minimax together, citing Minimax's unique and comprehensive omnimodal layout among Chinese independent AI firms, as well as its industry-leading cost-effective and flexible computing architecture.

Goldman Sachs: Chinese Multimodal Models Continue Global Expansion, Focus on Hailuo 3

According to Goldman Sachs' predictions, the release of M3 and Hailuo 3 models will mark a major milestone for Minimax, with text API business gross margins reaching 40% and multimodal API business gross margins reaching 60-70%, higher than industry peers.

UBS sets Minimax's target price at HK$1,000, reasoning that as multimodal capabilities unlock potential, synergistic R&D across modalities will reduce training costs and rapidly enhance model capabilities.

In other words, multimodal R&D brings Minimax far more than just a product matrix—it includes a more refined and efficient engineering framework. This will further lower the barrier for enterprise models, expanding from developers to ordinary users.

J.P. Morgan gives Minimax an “Overweight” rating, citing a “rare combination of technical strength, multimodal commercialization potential, and global scalability.”

Minimax is not only the only independent large model vendor in China with full-stack capabilities across “text + images + video + audio + music” but also ranks globally in the top tier for text, voice, and video generation.

In the past, omnimodal was often misunderstood as a “feature checklist”—ticking boxes for text, images, video, voice, and music. But in reality, the true value of omnimodal lies not in “what it can do” but in “whether these capabilities can mutually reinforce each other.” This is the fundamental difference between an innate route choice and a patchwork upgrade later.

Video generation exemplifies this point.

Text models claim to understand the physical world, but it's hard to verify. You can ask it to write an essay about an apple falling, and it will sound convincing, but you'll never know if it truly understands gravity.

Video generation is different—it reveals flaws in a second. Are hand positions correct? Do object trajectories follow physical laws? Is camera switching smooth? Is text clear and accurate? Are audio and visuals synchronized? A single mistake is immediately noticeable to users.

This represents the ultimate test of a large model's world understanding capabilities. It requires not only stronger spatial understanding but also causal reasoning, long-range consistency, and multi-object relationship modeling. These, in turn, enhance text, agent, and tool-calling performance.

In other words, a unified omnimodal model is not a simple sum of five independent models but an organic whole.

This is Minimax's approach—from its M-series large language models to the Hailuo video model and Music audio model. This completeness of self-developed omnimodal capabilities and land ( land means "implementation" or "deployment" in this context) is unique among domestic independent AI enterprises.

This foundational, disruptive, and innate integrated route enables Minimax to achieve smoother full-sensory intelligence at lower costs.

Morgan Stanley calculates that through infrastructure optimization, Minimax can generate approximately $1 in revenue per minute on an 8-card H800 inference server, with costs below $0.30, while the industry average is around $0.50 per minute.

According to its prospectus, since inception, Minimax has spent only $500 million to reach the global top tier in multimodal capabilities—a scale only about 1% of OpenAI's costs.

When its text large model M2 was released, it ranked first in open-source evaluations on the global authoritative benchmark Artificial Analysis, with a comprehensive reasoning cost of just $0.53 per million tokens—only 8% of Claude 4.5 Sonnet's cost—and twice the reasoning speed.

At the same time, through its omnimodal model technical route, Minimax can iteratively improve text, image, audio, and video capabilities in tandem, breaking the impossible trinity of iteration efficiency, training cost, and model performance.

Last year's release of Minimax's video model helped global creators generate over 600 million videos in just about a month, while its voice model, with globally leading ultra-low latency, has generated over 200 million hours of speech to date.

In other words, with its globally top-tier multimodal model capabilities, Minimax's models have already become core infrastructure in the global multimodal field.

The Growth Inflection Point of Pure-Play

For investors, the most pressing question now is: Who will become the next rising star in the omnimodal explosion?

The answer is likely Minimax, which demonstrates the qualities of a scarce asset and is poised to reap three historic dividends.

The first dividend is the industry beta dividend ( dividend means "dividend" or "bonus" in this context) of rising token prices and volumes, as validated by Alibaba's MaaS performance.

Alibaba's FY2026 financial results show that its AI model and application services ARR (Annualized Recurring Revenue), including the BaiLian MaaS platform, has surpassed RMB 8 billion and is projected to exceed RMB 30 billion by year-end.

Wu Yongming's remarks prove that the agent market is undersupplied, with distinct seller's market characteristics. Behind this, the market logic has completely shifted.

J.P. Morgan points out that the current market battleground has moved from token pricing to model capabilities. Against a backdrop of highly robust demand, the optimal strategy is not to cut prices but to enhance model capabilities. Players with faster technical direction and iteration speeds will lead the market.

The second dividend is the industry alpha of multimodal valuation reassessment, catalyzed by Google's omnimodal foundation model route.

In the past, pure text model companies enjoyed most of the AI market's valuation premiums. However, omnimodal foundation models will disrupt this perception—all scenarios requiring visual, auditory, or spatial perception, such as education, media, industrial, medical, and consumer applications, will have room to grow. Their commercial ceiling will far exceed that of pure text.

With the advent of a full-modality foundational model boasting superior comprehension capabilities, full-modality is set to experience a turning point in valuation.

The third factor is the valuation flexibility dividend enjoyed by Pure-Play, an independently operating AI enterprise in China.

The AI businesses of tech giants are often diluted within their massive revenues. Alibaba's MaaS revenue remains at a low proportion, while ByteDance's AI capabilities are dispersed across multiple product lines, making it difficult for the market to precisely align valuation benchmarks with their AI businesses.

However, Minimax's model capabilities serve as its primary engine, with all revenue generated directly from the models themselves, undiluted by any other businesses. This difference in purity will significantly amplify the slope of its growth curve.

This implies that when the large model industry experiences a boom, Minimax's performance will demonstrate greater elasticity.

In other words, Alibaba has proven the viability of the industry beta, forming a closed logical loop; Google will drive the alpha of the full-modality technological route; and Minimax is poised to capture another unique alpha in China's AI landscape.

The upcoming model upgrade will serve as a clarion call for this revaluation.

During the 2025 financial results conference, Minimax's founder and CEO, Yan Junjie, explicitly revealed that the M3 and Hailuo 3 models, set to be released in the first half of this year, will advance to the direct generation stage of medium-to-long-form production-grade content. This will elevate the platform's token demand by one to two orders of magnitude.

Morgan Stanley stated that M3 is expected to rival the performance of the world's top models and demonstrate multimodal comprehension capabilities.

Hailuo 3 is anticipated to replicate the ecological niche of Seedance 2.0. Goldman Sachs indicated that Hailuo's next-generation model will achieve qualitative advancements in audio-video synchronization, editing capabilities, and multi-shot generation, while lowering production barriers for ordinary users.

More importantly, Hailuo 3 will be part of Minimax's full-modality foundation. This means that Hailuo 3's technological path will seamlessly integrate with text, image, and audio capabilities, enabling more complex multimodal tasks.

Soon, we will witness China's newest attempt in the full-modality foundational model direction, coming closest to Google's philosophy.

For this reason, leading investment banks generally consider Minimax one of the most investment-worthy targets in the current AI industry. As the only independent full-modality large model vendor in China, it not only follows a technological path closest to Google's but also possesses untapped growth potential.

As the release windows for M3 and Hailuo 3 draw near, Minimax's scarcity is transitioning from a 'technological narrative' to a 'financial reality.' After industry revaluation and the release of the new generation of models, the market's judgment may differ entirely.

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.