VLA Won't Die, Except Those Without World Model Integration

Home

Finance

ICV

Smart City

Digital Live

Cloud

Optics

Home Finance AI ICV Smart City Digital Live Cloud Optics

06/05 2026 557

Author | AIGC Relative Theory

In May 2026, a not-so-funny fabricated joke circulated in the embodied AI circles: During a demo, a VLA model was asked to 'Hand me the apple on the table.' The robotic arm elegantly reached out and firmly grasped a mug. The room fell silent. The engineer, drenched in cold sweat, quickly typed on the Pad: 'Redefine apple.'

Over the past six months, similar ' roll over ' [fail] jokes have been abundant, with protagonists ranging from China's highest-valued unicorns to Figure AI and Physical Intelligence across the ocean, all falling victim.

A couple of years ago, the industry was still rallying behind VLA (Vision-Language-Action) as the technical path. When Covariant's RFM-1 first appeared, the media eagerly crowned it the 'general robotics singularity.' Once Google DeepMind's RT-2 paper was published, analysts in the secondary market revised their reports overnight, advancing the commercialization timeline for embodied AI by three years.

Now, no one mentions 'singularity' anymore.

What everyone cares about is whether this thing can actually screw a bolt into a hole in a factory, rather than stabbing a screwdriver into its own motor. Under the VLA framework, the somewhat clumsy performance of embodied AI led NVIDIA's robotics lead, Jim Fan, to outright declare 'VLA is dead.'

But that statement was premature.

VLA won't die. Those VLA models that fantasize about creating general-purpose robots using only internet images, videos, and a few teleoperated robotic arm datasets should indeed be buried. However, something new is emerging—a fusion with the 'world model,' a concept the industry has talked about for years without seriously addressing. This may be the only viable path for embodied AI in the next three years.

The 'Brain in a Vat' Living in the Internet

To understand why VLA frequently fails, we must first grasp its genetic flaws.

The current mainstream VLA architecture, whether Google's RT-2 or the creations of domestic companies like Stardust Intelligence, all follow the same underlying logic. They first use massive amounts of internet image-text data to align vision and language, enabling the model to understand images and human language. Then, they incorporate robotic action data for end-to-end fine-tuning, allowing the model to output action commands.

The biggest allure of this approach is 'cost-effectiveness.' It attempts to reuse the infrastructure of large language models and vision-language models, turning robot learning into a 'lightweight' fine-tuning task.

Investors love this story: No need to collect expensive physical world interaction data from scratch; just stand on the shoulders of internet giants.

But here's the problem. Internet data teaches the model that 'an apple is a red, round object,' but not that 'an apple will deform and possibly roll away when a force of 10 Newtons is applied.'

Videos on the internet are edited, aesthetically pleasing clips filled with smooth transitions and large leaps in causality.

When a cup falls off the edge of a table, the next shot often shows it already shattered on the floor or caught by a hand. The fateful moment—the cup slipping from the fingertips, insufficient friction, excessive tilt—is forever lost.

The physics VLA learns is a 'pseudo-physics' based on superficial correlations. It knows 'falling' often accompanies 'shattering,' but it doesn't understand at what angle a glass pot filled with hot coffee will become top-heavy and slip off its lid. Google DeepMind's RT-2 paper also admits that the model's generalization ability plummets when faced with novel object combinations or scenarios requiring fine force control.

Furthermore, research by Physical Intelligence reveals a reality: Even if you scale the model up tenfold and feed it more internet images, its predictive ability for physical interactions remains almost flat. In this domain, the scaling law has hit a wall when it comes to physical interaction.

Thus, current VLA demos resemble carefully rehearsed magic tricks.

You can only witness the robot smoothly grasping objects in a controlled lab area of 0.5 square meters, using a fixed set of three to five props under strictly controlled lighting and backgrounds. Once the background is slightly altered or a reflective, transparent object is introduced, the model's 'brain in a vat' nature is exposed.

It only knows the answer, not the process.

The World Model Isn't a Panacea, But It's the Only Remedy

The recent hype around 'world models' resembles that of the metaverse a few years ago—everyone talks about it, but few have seen its true form. Yann LeCun at Meta's AI division constantly brings up world models, believing they are the key to true intelligence. NVIDIA's Jensen Huang also endorsed it at GTC.

In the context of embodied AI, world models are highly anticipated, but in some hands, they've nearly become a word game. Some teams take a simplistic approach: They slap a pre-existing physics simulation engine onto the VLA's output to 'correct' actions that violate physical laws.

For example, if the model says to penetrate the table to grab something, the simulator issues a 'collision warning' and stops the arm.

Is this integrating world models? This is patching bad code.

True integration lies in internalization.

A powerful world model should serve as VLA's 'subconscious' and 'intuition module,' not as an external safety supervisor.

Before VLA makes a decision, the world model should rapidly simulate the physical changes in the next few seconds internally, then constrain and guide action generation in reverse.

When I reach out to catch a thrown key, my brain doesn't first plan the exact trajectory of my fingers and wait for visual feedback to correct. Instead, I have an internalized model of 'how the key will fly in a parabola, the wind resistance, and where it will land,' which directly drives my muscle memory, allowing me to adjust my posture almost instinctively.

Fei-Fei Li's team's RoboAgent work and recent new attempts are moving in this direction. They force the model not just to learn 'see cup - output grab action' but to predict the next frame's depth map, object segmentation map, and even contact force distribution while learning actions.

This isn't just an expansion of input-output channels. It compels the model to break free from two-dimensional pixel correlations and construct an internal, three-dimensional, causal physical representation.

When the model can accurately predict 'if I push this bottle at this angle and speed, it will tilt rightward in the next 0.5 seconds,' it truly 'understands' the bottle's dynamics. Only then will grasping actions stop being either timid or overly aggressive.

The path forward is clear. Robot companies of all sizes have begun such integrations. VLA + world model, under various conceptual guises, will become the industry consensus.

Jim Fan's exclamation of 'Long live WAM' essentially refers to this combination.

Soon, all serious embodied AI companies will write in their technical white papers, 'We built an end-to-end world model,' or similar concepts fusing VLA and world models—different names, perhaps still called VLA models, but essentially the same.

The Silent War of Data Factories Will Determine Who Laughs Last

Debating whether VLA is dead or whether world models work misses the point.

These high-level issues ultimately boil down to the most fundamental, least glamorous factor: data.

A colleague responsible for data collection at a leading humanoid robot company privately told 'AIGC Relative Theory' that their biggest headache isn't algorithm tuning but keeping remote teleoperation annotators awake.

To collect high-quality operational data, they hire retired engineers to wear gloves and repeat screw in [screwing] a part all day. But the elderly engineers' hands tremble, causing issues with fine teleoperation mapping. After collecting a day's data, cleaning and aligning it, less than 10% is usable for the model.

And that's just one action. For VLA + world models to truly learn to brew a cup of coffee, they need to know the kettle's weight changes, steam temperature distribution, water flow impact force, and teacup material. No internet image-text database can provide this data.

This is an unprecedented data factory war.

Tesla's Optimus team is under intense scrutiny not just because of Musk's celebrity status but because they're migrating the 'shadow mode' and data engine system from automotive autonomous driving to robots. Every success and failure of Optimus screwing bolts in the factory is automatically labeled, fed back, and iteratively trained. This creates a formidable, self-sustaining data flywheel.

In contrast, most domestic robot companies still rely on the 'stacking people' archaic model. They rent a multi-thousand-square-meter facility and dense [densely] hire people for teleoperation, reminiscent of data annotation villages. Data quality is inconsistent, and collection costs remain high.

This directly leads to a result: While the VLA + world model technical route will become consensus, the true technological barrier will rapidly shift from model architecture itself to the scale and efficiency of data factories.

Future competition will be hierarchical. At the top tier are companies that can build 'physical world foundation models,' such as OpenAI, Google DeepMind, and NVIDIA. They provide the underlying VLA base models that understand fundamental physical laws.

In the middle tier are robot companies with efficient, massive, and diverse private data factories. They use their scenario-specific 'private domain data' to deeply fine-tune the base models, creating super-expert models for specific fields (e.g., 3C assembly, food service).

Companies without efficient data factories will become distributors for foundation model providers or remain stuck in low-tech scenarios like inspection and guidance, endlessly competing on price.

Data—high-quality physical interaction data—is the only ammunition VLA can ultimately use. Without ammunition, even the most advanced gun is just a burning stick.

Look at Physical Intelligence, a star company founded by top academic luminaries. This year, they've aggressively signed cooperation agreements with various manufacturing and logistics companies. They're not after the service fees but the most authentic, dirtiest, and most uncertain physical interaction data from those scenarios. Uber's rise wasn't due to algorithms but the data monopoly created by private cars driving on city streets worldwide.

The 'Uber moment' for embodied AI hasn't arrived yet, but the countdown has begun.

Conclusion

VLA isn't dead; it's just growing up. The sign of this growth is that it must be uprooted from the internet's greenhouse and thrown into the physical world's soil.

It needs to grow a new cognitive organ—the world model—to understand and predict physical causality. Whether this happens depends on the most unglamorous corners—data factories—where workers' movements are standardized, sensor noise is filtered out, and failed operations are meticulously recorded.

The grand narrative of embodied AI has ended. A duller, crueler engineering battle has just begun.

*All images in this article are sourced from the internet

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.

Newest

Links