Li Feifei's Manifesto on World Models

06/09 2026 393

"The world is everything that is the case."

In 1921, Ludwig Wittgenstein penned this famous line in Tractatus Logico-Philosophicus. A century later, it was quoted by Li Feifei, one of AI's leading figures, as the opening to her latest technical blog.

Within the realm of deep learning, people have grown accustomed over the past three years to AI's overwhelming dominance over language, beginning with ChatGPT, which endowed machines with expression, programming, and reasoning abilities far surpassing those of humans.

However, behind the digital marvels, a blind spot often goes unnoticed: machines can talk about the world but know nothing of its physical essence. Li Feifei's blog serves as a sobering antidote.

Today, as generative AI has become an indispensable tool globally, the definition of "world models" within the industry is becoming increasingly muddled. Companies are vying for interpretive authority over the concept, whether in video generation or embodied intelligence.

After Li Feifei published her blog, many believed she was attempting to reclaim the definitive authority over "world models." On the contrary, I believe what Li Feifei truly aims to do is to issue a manifesto: the world is not constituted by language but by rigorous physical space and temporal laws.

For machines to truly step into humanity's physical world, they must break free from the comfort zone of textual statistics and instead comprehend the refraction of light and shadow, the inertia of objects, and the logic of collisions. This represents not only a technological paradigm shift but also an inevitable path for AI toward embodied intelligence.

01 A Taxonomy is Needed

It must be acknowledged that in AI's lexicon, "world models" have become a catch-all term, seemingly applicable to any project involving image generation or environmental simulation. This ambiguity stems from the multidimensional demands placed on the definition of "world."

When a technology is in its infancy, it naturally lacks unified regulations to confine it within clear boundaries. The confusion surrounding the definition of "world models" is not historically unprecedented. When ancient Greek philosophers debated whether the essence of the world was water, fire, or indivisible atoms, they were, in fact, seeking a foundation for their reasoning.

The AI field now faces a similar dilemma: when a video generation model produces visually hyper-realistic effects that are physically impossible, how should it be defined? Li Feifei's blog references an ancient yet robust definitional basis: the Partially Observable Markov Decision Process (POMDP).

This is also the core axiom of reinforcement learning mechanisms, revealing the eternal closed loop of an agent's interaction with the physical world: the agent takes an action (Action), causing a change in the world's state (State). However, lacking a god's-eye view, the agent can only construct a partial perception of reality through observation (Observation).

A world model, in essence, is the abstract representation of the world that a machine constructs in its "brain" to survive within this closed loop. If any link in this loop remains undefined, the so-called world model remains a blind accumulation of pixels.

02 The Three Pillars of Intelligence

This closed loop sounds simple, with each link's function easily understandable. However, upon closer inspection, countless poorly defined details emerge. To explain the confusion, Li Feifei dissects world models into three core components, which serve as both technical classifications and the three pillars of AI's path toward embodied intelligence.

1. Renderer

The renderer's core logic is visual plausibility. Its output is pixels, striving to make images appear natural, coherent, and aesthetically pleasing to humans.

This is currently the most commercially mature field. Familiar examples include video generation models like OpenAI's Sora and ByteDance's Seedance 2.0, as well as image generation models like OpenAI's GPT-image-2 and Google's Nano Banana 2. Essentially, these are the most sophisticated visual probability machines to date. By learning from billions of internet images and videos, they have mastered the distributional patterns of light, shadow, and form.

Despite this seemingly ideal reality, Li Feifei points out the cost: while these top-tier models can generate magnificent architecture, attempting to interact with their physical structures would likely cause them to collapse instantly due to a lack of supporting structures. In other words, they do not understand what "support" means; they generate what the viewer "sees," not what the world "is."

2. Simulator

The simulator pursues precisely the structural fidelity that the renderer lacks. It could not care less about whether a video looks good; its sole concern is whether the world adheres to physical laws. When a simulator outputs a mundane cup, it must also include the cup's mass distribution, material friction coefficient, gravitational response, and physical boundaries upon collision.

With a simulator, the content in videos gains authenticity. However, simulators are not only severely underestimated in today's AI wave but also frequently overlooked.

From the aforementioned cup example, the simulator's presence shifts the discussion from "art" to "physics." Constructing a simulator that strictly adheres to physical laws requires an unimaginable amount of computational resources and annotation costs. Yet for robots, visual aesthetics are nearly useless; physical precision determines everything.

If the simulator is not precise enough, robots trained within it will never be able to enter the real world. The Sim-to-Real challenge is an objective reality: a test action that passes 100% in the lab may completely paralyze a robot in the real world due to minute frictional forces—this is what we often call the "Moravec's Paradox."

3. Planner

The planner is responsible for action output. As the nexus of perception and feedback, it must address the core question of "what to do next," which never has a standard answer. In Li Feifei's framework, this is also the final link in the entire "perception-action" closed loop and the most cutting-edge challenging domain.

Currently, all Visual-Language-Action (VLA) models attempt to enable systems to make decisions in unstructured, complex worlds. The planner is not merely about predicting the future but about selecting the most goal-achieving path from countless possibilities. It is the key to machines evolving from "observers" to "practitioners."

03 The Hundred-Billion-Dollar Hub

Among the three classifications provided by Li Feifei, models corresponding to renderers and planners are already relatively common; the remaining simulator, naturally, has become the most difficult component to realize. Li Feifei also offers a highly insightful judgment: the simulator is the linchpin connecting rendering and planning, as well as the core hub of the entire system.

In the field of simulators, the standout performer is not OpenAI, Anthropic, or Google but Jensen Huang's NVIDIA.

NVIDIA's Omniverse claims to support trillion-scale digital twin dreams precisely because it grasps the essence of simulators. On NVIDIA's platform, the operations of factories, supply chains, and warehouses have been transformed into complete digital mirrors. For the industrial sector, this is no longer just a visual demo but the core infrastructure of productivity.

This is not an exaggeration but a trillion-dollar market opportunity lying before everyone.

From virtual visualization in construction engineering to molecular dynamics simulations in the pharmaceutical industry, to scenario testing for autonomous driving, what these industries lack is not lifelike image and video generation models but an extremely high-fidelity simulator. To put it bluntly, mastering the ability to simulate the physical world equates to holding priority admission tickets to AI industrialization.

However, real-world difficulties leave almost no techno-optimists in this field. Li Feifei also admits that a significant gap remains.

First is the issue of embodied intelligence data, which we have repeatedly mentioned before. While video data on the internet is abundant, 3D data with explicit geometric structures, material properties, and physical feedback annotations is extremely scarce.

Second, the application of generative AI always comes with hidden risks. AI-generated geometric models can at best achieve visual perfection but are often physically unreasonable, such as cups intersecting with tables or objects colliding without volume. In human terms, these eerie phenomena can be summed up in two words: "clipping." But in real industrial applications, this spells disaster.

04 Toward a Unified World Model

Despite the formidable challenges, Li Feifei offers a positive prediction about industry trends: the boundaries between rendering, simulation, and planning are becoming increasingly blurred.

This is not a distant vision but a reality already unfolding. Through exploration, Li Feifei's World Labs team believes humanity is moving toward a unified foundational model. In this architecture, imagination and logic can merge into one.

Future models will no longer be mere superposition (superimpositions) and patchworks of single functions but a unified neural network foundation. They will be able to render realistic scenes via Gaussian splatting while simultaneously generating collision meshes required by physics engines in real time. Simply put, the unified foundational model will seamlessly switch between the visual modes humans need and the state modes physics engines require.

From another perspective, traditional models are static, whereas future world models will possess far greater interactivity. Renderers will no longer be passive video generators but will begin to accept action commands; simulators will become more editable and controllable; planners will engage in logical thinking, automatically adjusting strategies based on environmental changes.

05 The Long Arc of Spatial Intelligence

Finally, stepping back to the macro level, why does all of this matter regarding "world models"?

In Li Feifei's view, AI research over the past few decades has been searching for the key to enable machines to enter the physical world. Today, we already possess language models adept at handling logic; what we need next are models capable of processing space. The core of spatial intelligence lies in how machines interact with the physical world they inhabit.

This battle is not about who possesses more computational power but who can define the digital standards of the physical world.

World models represent not a simple algorithmic optimization but a monumental leap in AI evolution.

"Language has endowed machines with the ability to talk about the world, whereas world models are the means by which machines ultimately understand, imagine, reason, and interact with the physical world."

Everyone in this era is transitioning from a phase of talking about the world to a new era of truly understanding and reconstructing it.

Nevertheless, world models are merely an intermediate milestone on the path to AGI, and the AI humanity has created is still far from a true "world model" in the meaningful sense. Here, another leading figure in world models, Yann LeCun, offers a somewhat extreme viewpoint worth sharing:

Optimistically, it will take at least five to ten more years for machine intelligence to barely approach that of a puppy.

Reprint Authorization

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.