Annals of the First Year of World Models: Motivations, Chaos, and Hidden Obstacles

Home

Finance

ICV

Smart City

Digital Live

Cloud

Optics

Home Finance AI ICV Smart City Digital Live Cloud Optics

04/21 2026 507

Produced by | RoboIsland

On April 16, Alibaba unveiled its open-world model Happy Oyster, while Tencent open-sourced its 3D world model HY-World 2.0.

On the same day, these two Chinese internet giants asserted their presence in the world model arena.

Less than a month prior, Li Feifei's World Labs had just secured $1 billion in funding, and Yann LeCun's AMI Labs stunned Silicon Valley with a $1.03 billion seed round.

Capital, giants, and entrepreneurs are flocking in, with a resounding slogan quickly spreading across the industry: World models are the most important frontier after large language models.

But if you were to ask these players, 'What exactly is a world model?' you'd likely get a stack of contradictory answers.

Some say it's an 'interactive 3D world,' others call it a 'causal model that understands physical laws,' some describe it as a 'digital simulator for robot training,' and still others simply say, 'It's just more advanced video generation.'

This isn't a divergence in academic discussion but rather the cognitive chaos gripping the entire frontier.

This article attempts to untangle this chaos. We will approach it through three step by step (progressively deeper) questions: Why are all the major players suddenly betting on world models? What exactly are their products doing, and which claims are real and which are empty (false)? And how deep are the dilemmas and gray areas obscured by the hype?

1. Why the Sudden All-In on World Models?

To understand why world models suddenly exploded in popularity, we must first revisit an awkward truth about large language models (LLMs).

Over the past two years, ChatGPT and its peers have showcased astonishing linguistic abilities but also exposed a fatal flaw: They don't understand the physical world.

Ask an LLM, 'What happens if you push a cup off the edge of a table?' It can answer, 'The cup will fall to the ground,' but it doesn't truly grasp gravity, acceleration, or collision—it merely recalls similar sentences from its training data.

A study in early 2026 pointed out that hallucinations are not a data or training issue but an inherent flaw in LLM architecture.

This flaw might be tolerable in pure text tasks, but when AI enters the real world—manipulating robots, driving cars, or operating in factories—it becomes an insurmountable obstacle. You can't let an autonomous driving model 'probably correctly' judge obstacles ahead, nor can you allow an industrial robot to 'approximately' predict the trajectory of parts.

Thus, a more fundamental need emerges: We need an AI that understands the causality of the physical world.

It must not only speak but also act; not only see but also anticipate. This is the fundamental reason world models have been thrust into the spotlight.

Large language models transformed the relationship between humans and information, while world models aim to transform the relationship between humans and reality.

Over the past two years, AI commercialization has mainly stay at (remained at) information processing—writing copy, translating text, generating code—but the next wave of growth engines clearly lie in the physical world: embodied intelligence, autonomous driving, and smart manufacturing.

These scenarios all require AI to understand space, predict dynamics, and plan actions.

Therefore, when major players bet on world models, they are essentially vying for the technological high ground in the 'post-LLM era.' Whoever first enables AI to truly understand the physical world will dominate the next industrial cycle.

Domestic and international players have vastly different strategies.

In the U.S., DeepMind, World Labs, and AMI Labs resemble basic science endeavors.

They focus on granting AI physical intuition and causal reasoning abilities akin to humans, with commercialization as a distant goal. Yann LeCun himself admits that AMI's products may not be visible for several years.

In China, the landscape is different. Alibaba and Tencent almost immediately tied their models to commercial scenarios: Happy Oyster targets paying users in film and game development, while HY-World 2.0 directly outputs 3D assets importable into Unity/UE, engaging in the business of 'AI-created worlds.'

Then there's Sand.ai's VidMuse, which focuses on the niche scenario of music-generated videos and achieved annual revenue in the tens of millions of dollars within months of launch.

The logic of Chinese teams is pragmatic: World models must first be profitable products.

Neither approach is superior, but they determine respective paces and risks. U.S. teams dare to bet on breakthroughs a decade away, while Chinese teams must see returns within a year.

The problem is that when everyone crowds under the same buzzword, outsiders struggle to discern who is doing what.

2. The Interrogation of Technical Standards

After spending time reviewing each product's Introduction (introductions), you're likely to fall into deeper confusion. Every world model looks different, with underlying logics that even contradict each other.

Consider the most counterintuitive camp: Yann LeCun's AMI Labs takes a path few dare to follow, arguing that AI doesn't need to generate photorealistic visuals.

LeCun's JEPA architecture deliberately discards pixel details, making predictions only in abstract latent spaces. The latest LeWorldModel has just 15 million parameters and trains in a few hours on a single GPU, yet its planning speed is 48 times faster than traditional methods.

The downside? Its outputs are incomprehensible to humans. You can't 'see' its predicted future—you can only trust it calculated correctly.

This is a purely academic route, far from ordinary users, but LeCun bets that true intelligence doesn't require simulating every falling leaf—only understanding that 'wind blows leaves off trees.'

Another path comes from Li Feifei's World Labs. Li believes intelligence must be built on explicit understanding of 3D space. Her Marble model generates editable, navigable 3D worlds from a single photo or text, allowing users to freely move perspectives.

World Labs also open-sourced the rendering engine Spark 2.0, enabling ordinary browsers to smoothly load billions of 3D points.

A candid assessment: Marble excels at reconstructing spatial appearance but remains weak in understanding what happens within that space.

You can walk into its generated rooms, but you can't push chairs or knock over cups. It's a replicator of static worlds, not a simulator of dynamic physics.

The most vibrant camp belongs to the generative faction. Google's Genie 3, Alibaba's Happy Oyster, and Tencent's HY-World 2.0 all fall here.

Their logic: If generated visuals are realistic enough and interactions smooth enough, physical laws will naturally emerge.

Alibaba introduced a fascinating feature in Happy Oyster called Director Mode, where users can input text commands during video playback to alter plot directions or switch camera angles. Tencent is more pragmatic, directly outputting editable 3D assets for game developers to import into Unity or UE engines.

But these products share a common weakness: Long-term consistency and physical accuracy remain unstable.

Genie 3's demos are stunning, but visuals degrade after a few minutes. Alibaba's roaming mode currently supports only 1 minute of continuous movement—what happens beyond that? The company hasn't said.

Tencent's 3D assets look decent in single scenes, but their strength lies in scene completeness and adherence to input images—metrics of 'looking right,' not 'being physically correct.'

Finally, there's a unique player: NVIDIA. Its Cosmos platform doesn't produce world models—it produces 'tools for producing world models.'

Data processing pipelines, video tokenizers, and pre-trained foundational models are all freely available for download. Jensen Huang's calculation is clear: No matter which route wins, training and inference will require NVIDIA GPUs.

It's the smartest business—betting not on direction, but on compute.

So, which of these world models are the real deal? A key technical standard: True world models must be 'action-conditioned,' meaning that given an action, the model must output changes in world state.

Press 'W' on your keyboard, and the perspective in the visuals should move forward; give a robot a grasping command, and the model should predict the object's new position.

By this standard, Li Feifei's Marble falls short—users can only observe, not act. It's more a 3D reconstruction tool than a world simulator.

Google's Genie 3 and Alibaba's Happy Oyster support interaction, but physical accuracy is questionable. Tencent's HY-World 2.0 outputs static assets, so dynamic prediction isn't involved.

In other words, hardly any product on the market meets the standard of a 'perfect physical world simulator.' Each has chosen a demonstrable, commercializable entry point within its capabilities.

This isn't wrong in itself—the issue is that everyone packages themselves under the vague umbrella term 'world model,' misleading outsiders into thinking all problems are solved.

3. The Deliberately Avoided Gray Areas

Reading only companies' press releases, one might think world models are on the verge of large-scale deployment—but neglected details paint a starkly different picture.

The data problem tops the list. Training a true world model requires massive datasets of 'observation, action, result' triplets, but no such ready-made datasets exist in reality.

Some use game data, where action labels are perfect, but game physics are engine-simulated, not real-world physics.

Others use human first-person videos, closest to reality, but these lack action labels, and human head and hand movements are entangled, making it impossible for models to discern who is moving.

Still others use real robot teleoperation data, which has the highest fidelity but costs tens of thousands of dollars per hour to collect, making large-scale training infeasible.

This means every world model has inherent 'capability boundaries.'

The evaluation vacuum is another headache. Visit any world model company's website, and you'll almost certainly see claims of 'topping global authoritative benchmarks.'

The issue? These benchmarks themselves are immature. Some prioritize visual realism, others physical accuracy, and still others task completion rates. A model ranking first in visual benchmarks might rank last in physical ones.

This lack of standardization lets companies make whatever claims they want. Ordinary people can't tell whether these are different categories of the same benchmark or cleverly crafted marketing spin.

Then there's the deliberately avoided 'impossible trinity.'

World models face three mutually constraining metrics: spatial scale, visual fidelity, and real-time interactivity.

You can't simultaneously achieve 'a vast world, crisp visuals, and smooth interactions.' Li Feifei's Marble exemplifies this: Version 1.1 offers high image quality (visual quality) but limited spatial range, while 1.1-Plus generates large scenes but with blurred image quality (visuals).

Kunlun Wanwei's Matrix-Game 3.0 achieves real-time generation at 720P/40FPS, but its demo scenes are limited in style and complexity.

Few products openly admit their weaknesses; they prefer to showcase demos under optimal conditions while hiding failures under extreme conditions. This selective presentation is creating a dangerous bubble.

Finally, the capital frenzy brings new speculative risks.

A notable trend: Capital has shifted from chasing 'industry veterans' to betting on young scholars from top universities. The two founders of Inverse Matrix Technology, aged 98 and 04 (likely birth years, e.g., 1998 and 2004), hail from Peking University and secured over $10 million in first-round funding.

Their technical route is 'reinforcement learning + world models,' with only papers so far, no product. This isn't to dismiss young talent but to note that during paradigm chaos, capital is willing to pay an extreme premium for the possibility of 'defining the next generation of technology.'

Yet most such lab projects ultimately fail to cross the 'paper-to-product' chasm. If even Yann LeCun, a Turing Award winner, admits commercialization is years away, what of fresh PhDs?

4. Conclusion

The goal of world models is to enable AI to predict and even intervene in the physical world. So, if AI's predictions are wrong, who bears responsibility?

Imagine a scenario: An autonomous vehicle's world model 'imagines' a nonexistent obstacle during simulation, causing the vehicle to brake suddenly and get rear-ended.

Who is at fault—the algorithm engineer or the provider of simulation data?

Another scenario: An industrial robot's world model incorrectly predicts a part's trajectory, damaging an entire production line. What are the insurance claims standards?

A more extreme case: Someone uses a world model to generate a hyper-realistic 3D disaster video, causing panic on social media. Does the platform have a review obligation? How does law define harm from 'virtual-reality confusion'?

Currently, no company or country has provided clear answers to these questions. The ethical framework and legal boundaries for world models lag far behind technological progress.

While capital and media focus on 'who can create the most realistic virtual world,' a more fundamental question is shelved: Are we truly prepared?

This may be the most underestimated variable in the world model race. Not compute, not data, not algorithms—but responsibility.",

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.

Newest

Links