HY-World 2.0 and HappyOyster: A Tipping Point, Two Worldviews

Home

Finance

ICV

Smart City

Digital Live

Cloud

Optics

Home Finance AI ICV Smart City Digital Live Cloud Optics

04/22 2026 473

On April 16, 2026, Tencent and Alibaba each released a "world model" product on the same day. The former is the open-source Hunyuan 3D World Model 2.0 (HY-World 2.0), while the latter is HappyOyster, which focuses on real-time interaction. Such coincidences are not uncommon in the tech industry, where competitors closely monitor each other's release schedules, neither wanting to fall behind.

Over the past two years, discussions around "world models" have been heating up in both academia and industry, though they have largely remained speculative and contentious. The topic was truly thrust into the public eye by comments made by Yann LeCun, Meta's former Chief AI Scientist, at an MIT seminar in late 2025. He stated, "Within three to five years, world models will replace LLMs as the dominant AI architecture. No rational person will still be using today's large language models."

These remarks upset many in Silicon Valley and brought the term "world models" into mainstream discourse.

Opinions in the industry are sharply divided on whether LeCun's prediction will come true. However, one thing is happening: capital, talent, and top labs are all converging in this direction. Li Feifei's World Labs has completed a new $1 billion funding round, NVIDIA's Cosmos platform has surpassed 5 million downloads, and LeCun himself left Meta to found AMI Labs, securing $1.03 billion in seed funding.

In China, Tencent, Alibaba, Shengshu Technology, and Qunhe Technology are each betting on different routes, with Chinese players participating far more deeply in this competition than most outside observers expect.

Against this backdrop, this article attempts to answer three questions: Where is the essential boundary between world models and large language models? How has the global technical landscape differentiated? And what is the real situation for Chinese players in this race? These three questions are interconnected; none can be fully understood in isolation.

The blind spot of large language models and where world models begin

The core mechanism of LLMs is to find patterns in language space—given preceding words, they predict the probability of the next word appearing.

After training on large-scale data, this mechanism has produced surprising capabilities: writing, reasoning, programming, and translation. However, at their core, these abilities remain rooted in statistical language patterns rather than a true understanding of the physical world. An LLM knows that "a glass falling to the ground will shatter" because this sentence has appeared countless times in its training data, not because it understands elastic modulus, stress propagation, and impact energy. For it, "gravity" is a word that frequently co-occurs with specific contexts but not a physical law that can be generalized to new scenarios.

This distinction is irrelevant for tasks like chatting, summarization, and code generation, where LLMs are already highly effective. However, when AI needs to interact genuinely with the physical world, their limitations become clear.

For a robot to plan a path around obstacles to retrieve a cup from a table, it must understand 3D space, object shapes and masses, and the force and direction of actions. For an autonomous driving system to predict the position of a vehicle ahead in the next second, it must understand speed, acceleration, and driving intentions. For an AI character to behave reasonably in a game world, it must understand the causal structure of the scene, not just visual consistency of pixels. These tasks are fundamentally unsuitable for language modeling frameworks.

The starting point of world models is precisely to fill this gap. Simply put, world models predict not the next word but the next state—how an object's position in space will change, what chain reactions an action will trigger, or how light reflections on different surfaces will evolve as the viewpoint moves. They aim to construct an internal representation of physical reality, enabling AI to plan, predict, and infer within this representation rather than merely matching patterns in language space.

To use an imperfect but helpful analogy, an LLM is like a librarian who has read every travel guide and can tell you the name and history of any street or alley in Beijing but would not know which direction to walk to find the nearest subway station if placed on that street. In contrast, a world model aims to train a guide who has actually navigated the city and has an embodied perception of space.

This is not about the quantity of knowledge but the nature of knowledge.

However, "world models" are not currently a clearly defined technical concept. The work being done by different teams varies far more than their names suggest. Some teams are building video-based interactive generation systems, focusing on teaching models how visuals change in response to user actions. Others are generating editable 3D geometric assets directly from images or descriptions, emphasizing the engineering usability of outputs. Still others are providing physically accurate simulation training environments for robots and autonomous vehicles, focusing on data correctness.

These three approaches have limited overlap, and their underlying business logics are also entirely different. Understanding this is key to making sense of the current landscape.

Technical divergences and strategic choices among the three routes

From a technical standpoint, the global competition in world models is currently unfolding along three main directions, each with its own internal logic and inherent limitations.

The first route can be called "video-based world models." The core assumption here is that video is the richest record of the physical world, and by training models deeply on video data, they can learn how the world operates. Google's Genie series is an academic representative of this route. Genie 3, which opened an experimental preview to select researchers in August 2025, allows users to input text descriptions and generates interactive 3D scenes in real time.

Li Feifei's World Labs introduced Marble, which can generate stylish, navigable virtual worlds from text or images. Alibaba's ATH Business Unit's HappyOyster also follows this path, differentiating itself through a native multimodal architecture combined with streaming generation capabilities. The model continuously receives user commands during generation and responds in real time, allowing users to adjust the camera, rewrite the plot, and direct characters within the generated scene rather than waiting for a complete video to render before seeing results.

Currently, HappyOyster supports continuous real-time director-level interaction for over three minutes, making it the most mature domestic product in this route in terms of user experience. However, this route has a built-in limitation: world models based on video learning generate pixel-level consistency rather than necessarily physical-level authenticity. Smooth visuals do not guarantee true 3D structure; plausible lighting does not mean the model truly understands light propagation.

HappyOyster's technical documentation acknowledges that its navigation and director modes are not yet fully integrated and that consistency in long-duration scenes still needs improvement. This is a common engineering challenge for the entire route at this stage.

The second route is "3D asset-based world models," with Tencent's Hunyuan 3D World Model HY-World 2.0 being the most representative product to date. The key shift here is directly generating editable 3D geometric assets, such as meshes, 3D Gaussian Splats (3DGS), and point clouds, which can be seamlessly imported into mainstream game engines like Unity and Unreal Engine for secondary editing and physical interaction.

Tencent's bet on this route is driven by clear strategic logic. The company possesses vast amounts of 3D gaming data and mature engine engineering expertise, making game content production efficiency the most direct commercial validation scenario. Traditionally, modeling an open-world map takes months and involves dozens of artists; HY-World 2.0 can generate an interactive 3D game prototype scene in about 12 minutes. Even with discounts for real-world complexities, the impact on the gaming industry would be enormous.

However, this route also has limitations. 3D asset generation solves content production efficiency but remains a generative model at its core, not a true physics-understanding simulation system. It can produce visually plausible 3D scenes but cannot guarantee physical correctness—such as collision detection, material properties, or dynamic behavior—which still require engineer intervention in game engines. This gap is acceptable for game prototyping but becomes problematic when migrating to robot training or digital twins, where physical precision is critical.

Thus, the third route is closer to the infrastructure layer and can be called "spatial data and simulation platforms." This approach does not focus on end-user products but instead provides high-quality 3D training data, physically accurate simulation environments, and toolchains for bridging virtual and real worlds.

The most notable domestic case in this route is Qunhe Technology. This home design software company entered the field from an entirely different angle than Tencent or Alibaba, discovering a path to spatial intelligence through over a decade of data accumulation in home design software.

The 480 million 3D models and 500 million structured spatial scenes accumulated on its Kujiale platform represent physically accurate real-world design data. At NVIDIA's GTC 2025 conference, Qunhe open-sourced its Spatial Language Model (SpatialLM), which can generate physically constrained 3D scene layouts from just a smartphone video. After release, it ranked second on Hugging Face's trending list. Its spatial intelligence platform, SpatialVerse, has established partnerships with embodied AI companies like Zhiyuan Robotics, Galaxy General, and Qiongche Intelligence, providing virtual training environments for robots.

On April 17, 2026, Qunhe Technology went public on the Hong Kong Stock Exchange as the "world's first spatial intelligence stock," with its share price opening 171% higher on the first day.

From a horizontal comparison of the three routes above, the competitive landscapes in China and the U.S. show clear structural differences. In the U.S., major platform companies (NVIDIA, Google) focus on general-purpose infrastructure and frontier research, while academic startups (World Labs, AMI Labs) explore technologies. A mature commercial product layer has yet to emerge—Meta and OpenAI have been relatively cautious in their substantive (substantive) investments in world models, with the former still at the theoretical stage and the latter remaining focused on commercializing large language models.

In China, head (leading) big firms prefer to enter from their strongest vertical scenes, while a group of vertical data companies are positioning themselves at the upstream asset layer. The competitive logics differ: the U.S. emphasizes the generality of technical principles, while China prioritizes speed of scene adoption and the scarcity of data assets. How these differences will manifest in the next stage of competition remains unclear.

The "hype" has begun, but "profitability" remains unclear

When shifting focus from macro-level path comparisons to micro-level industry operations, the aforementioned differences are spawning a series of concrete, short-term frictions within China. Chinese players have entered rapidly by leveraging scene and data advantages, but because they moved so quickly, foundational consensus and rules have yet to be established, creating unique systemic risks beneath the surface excitement.

These issues are rarely discussed openly in the industry but genuinely exist and will influence the trajectory of this track (sector) over the next two to three years.

The first issue is that definition ambiguity is creating a false sense of prosperity. Currently, many domestic "world model" products use the same term but refer to vastly different things. Some are essentially video generation models with interactive packaging, others are 3D reconstruction tools with real-time rendering, and a few are genuinely attempting physical simulation.

This definitional chaos leads to misjudgments at the capital level, accumulates disappointment among users, and blurs the line between technical progress and market hype at the industry level. To provide an operationally meaningful standard for "true world models," Xinlichang believes the following definition can be applied: Can the model autonomously learn causal relationships from raw perceptual data without explicit annotations and make physically reliable predictions in never-before-seen scenarios?

By this standard, most current products are still far behind. This does not mean these products lack value but that equating iterative progress with paradigm shifts is a cognitive shortcut to guard against.

The second issue is that the perceived value of data barriers is overestimated. Chinese players do possess genuine data advantages, such as Tencent's 3D gaming data, Qunhe's spatial design data, and autonomous driving companies' road test data, which represent real moats in terms of volume. However, world models have fundamentally different data requirements than large language models. LLMs can learn useful patterns from vast but noisy text data, where breadth matters more than precision; world models require physically correct, temporally coherent, and precisely annotated 3D data, where quality trumps quantity.

The proportion of existing data assets truly usable for world model training is far lower than claimed. The controversy over synthetic data further complicates this issue: because collecting high-quality real 3D data is extremely costly, many teams use simulators to generate synthetic data for training sets.

Research trends reported by Nature in 2024 show that continuous training with synthetic data leads to accelerated model performance degradation over iteration cycles, a phenomenon researchers liken to "inbreeding." As of today, no widely accepted solution exists, meaning Chinese players' data advantages are more fragile than imagined.

The third issue, a perennial one, is that commercialization paths remain unsolved. After ChatGPT, large language models' business models gradually became clear—API pricing, enterprise subscriptions, and vertical industry deployments—with proven pathways.

For world models, however, no company has yet demonstrated a replicable commercial closed loop (closed loop). Tencent's HY-World 2.0 is currently open-sourced primarily as a developer tool, while 96.9% of Qunhe Technology's 2025 revenue came from software subscription services (mainly for Kujiale and Coohom products). Spatial intelligence-related businesses (including SpatialVerse) accounted for just 3.1%, with the core SpatialVerse platform contributing only 0.6%.

Game companies are willing to pay for AI-generated 3D scenes only if the quality can truly replace or significantly reduce labor costs—a gap that still exists. The integration cycle for film and television industry workflows is much longer than estimated, and procurement volumes from embodied AI companies have not yet reached commercialization thresholds. World models currently resemble a promissory note with vast imagination space but uncertain realization timelines.

This represents both current difficulties and future opportunities. Undoubtedly, the first player to validate a replicable commercial unit in a vertical scene will gain a first-mover advantage far out of proportion to its scale.

Epilogue

The rise of large language models has proven that when language prediction is conducted on a sufficiently large scale, the emergent capabilities can far exceed the designers' expectations. The core bet of 'world models' is whether this logic of 'emergence through scale' can be transferred to modeling the physical world.

The technical challenges are real. The complexity of the physical world far exceeds that of language spaces. The basic units of language are discrete words, while the states of the physical world are continuous and high-dimensional, relying on causal structures far more complex than grammatical rules. Data collection and annotation costs are orders of magnitude higher than those for text. Training paradigms need to be redesigned, and evaluation methods are far less mature than those in the NLP field. This path is longer and more difficult than the one taken by language models, filled with unknown detours.

However, the driving forces are equally real. There is a genuine demand for 'AI that truly understands the physical world' in fields such as robotics, autonomous driving, digital twins, and immersive content, and this demand will only grow stronger with the proliferation of intelligent hardware.

China's strengths and weaknesses in this competition are very specific: the accumulation of scenario-based data and the pressure to implement solutions in vertical industries provide strong support, while the depth of basic research and the pathways for commercial validation are real weaknesses.

The fact that Tencent and Alibaba released world model products on the same day indicates that among China's top tech companies, a consensus has formed regarding the next major battleground for AI. Whether this consensus is correct will be validated by time.

This moment may be closer than we imagine, yet also farther than we hope.

*The featured image and illustrations in the text are sourced from the internet.

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.

Newest

Links