07/03 2026
564

Editor: Lv Xinyi
In late June, Waymo announced the recall of 3,871 Robotaxis after discovering that some vehicles may fail to identify closed construction zones or incorrectly prioritize avoiding other risks while driving on highways, leading them to enter construction zones and continue at high speeds.
This marks the latest in a series of recalls by Waymo over the past two years and the second triggered by real-world road scenarios in less than two months. In May, Waymo recalled approximately 3,800 Robotaxis due to some vehicles' inability to avoid standing water on roads.
The nuance of this incident lies in the fact that construction zones, standing water, and road closures are not unfamiliar scenarios in the driving industry. While they are certainly included in pre-deployment training systems, leading players like Waymo still expose such "ordinary" issues during large-scale operations. This inevitably raises questions: despite iterative advancements in world models, simulation, and data closure, are we still far from truly digesting the complexities of real-world roads?
Such incidents cannot be simply interpreted as flaws in the world model itself.
This dose of real-world reality challenges the notion of "exhausting long-tail scenarios": real-world road conditions can never be fully enumerated through model training. Moreover, scenarios covered during training may recombine with real-time factors like vehicle speed, traffic flow, and temporary signage during actual operations, leading to new failures.
Therefore, for autonomous driving, beyond generating more complex long-tail scenarios and simulating more realistic road interactions, it is equally crucial—from a holistic technical strategy perspective—to efficiently filter and transform real-world failures into new training, evaluation, and repair assets, ultimately achieving a "failure closure." A single wrong entry, a single takeover, a single recall should not merely be isolated cases in accident reviews but rather starting points for model generation, training, evaluation, and feedback.
Waymo's recall should not be interpreted as a "record of autonomous driving failures."
What it truly reveals is the reality that Robotaxis must confront after transitioning from technical validation to large-scale operations: failures on the road will not disappear simply because simulations are extensive, world models are powerful, or data closure is complete. Instead, real-world operations will continually expose new issues.
For autonomous driving, what truly needs to scale in the future is not just vehicle numbers and operational mileage but also safety feedback capabilities. For autonomous driving world models, what truly matters in the next phase is not continuing to prove how many rare scenarios they can generate but whether they can swiftly transform past failures into new training, evaluation, and repair assets.
World models must still simulate more complex worlds, but what autonomous driving truly needs to prove is its ability to learn from past failures and avoid repeating them.
After all, real-world roads will continue to pose challenges.

Exposing problems on real roads is not that simple.
The complexity of Waymo's recall lies here. Whether a vehicle enters a construction zone due to failing to identify closure signs (a perception issue) or due to misprioritizing risks after partial identification (a planning and risk-ranking issue), the surface outcomes may appear similar, but the fault chains can differ entirely.

Similar issues are not unique to Waymo.
Cruise recalled nearly 1,200 autonomous vehicles after system misjudgments led to inappropriate hard braking, some cases even resulting in collisions and injuries. Zoox recalled 270 Robotaxis last year due to system misjudgments of vehicle trajectories and later recalled another 332 vehicles for unexpected reverse parking.
These cases collectively demonstrate that once Robotaxis enter real-world roads, the failures often stem from combinatorial breakdowns in perception, prediction, planning, and risk-ranking capabilities amid complex traffic interactions.
These anomalies triggered by real-world scenarios must be promptly logged into backend systems. However, the complexity of real-world data means that simply feeding it back is not a panacea.
Associate Professor Feng Shuo from Tsinghua University and a research team from the University of Michigan recently published their latest study on autonomous driving safety training, highlighting the "seesaw effect" in safety training: while autonomous driving models improve safety in some training scenarios, they become more prone to safety degradation in others. Elon Musk has also publicly mentioned this effect, noting that while data trained by large autonomous driving models is relatively reliable, new issues inevitably arise in areas not covered during training after real-world deployment.
The research team proposed a "dense learning" method, establishing a strategy for autonomous driving models to automatically filter high-value data samples and proactively engage in stratified learning, significantly increasing the density of high-value information in training data. They argue that more failure samples are not necessarily better; the key lies in identifying which failures are worth learning, how to organize them, and whether new side effects emerge after learning.
However, filtering is just the first step. The real challenge in post-training and evaluation is: how can these filtered failures be transformed from accident samples into interactive, adaptable, and evaluable scenario assets within world model capabilities?
This is precisely where world models must bridge the gap.

As a powerful cornerstone of autonomous driving training systems, world models resemble well-equipped, gleaming armored divisions prepared for extreme scenarios. Yet, real-world accidents thrust them into brutal street battles—subtle, dangerous, and exhaustingly time-consuming.
How should world models fight such tough battles?
The latest research by Li Hongyang's team at the University of Hong Kong, in collaboration with Huawei and Tsinghua University, is replanning this issue from the perspective of overall training workflow and strategy deployment. They propose that autonomous driving models should enter a "post-training" era: after successfully deploying vehicles guided by massive data, their safety boundaries should be systematically identified, real-world failures near these boundaries transformed into learnable experiences, and strategies updated in a constrained manner.
Their designed World Engine provides a technical fulcrum: discovering safety-critical scenarios from real driving logs, reconstructing them into interactive environments, and generating similar but not identical traffic variants for closed-loop evaluation and reinforcement learning post-training. The question it addresses is no longer "creating more precise long-tail scenarios" but whether real-world failures can be rapidly expanded into new scenarios that autonomous driving models can efficiently relearn and validate.
This precisely fills the most challenging segment of the failure digestion chain.
A single wrong entry into a construction zone, if left merely as an accident review, remains an isolated driving event. Only when reconstructed by world models into a set of scenarios with varying speeds, cone placements, surrounding vehicle behaviors, and temporary closure conditions can it become a trainable problem.

Furthermore, the repaired model must undergo repeated testing in these variants to prove not just familiarity with a single accident's causality but true mastery of same kind risk logic (similar risk logic) without becoming overly conservative in other scenarios and inadvertently creating new risks.
Domestic players are also moving in similar directions. Momenta emphasizes enabling intelligent driving to "understand the world" in its R7 reinforcement learning world model. NIO and XPENG are doubling down on world models, VLA, and closed-loop reinforcement learning, respectively. Huawei's Octopus leans more toward a cloud-based autonomous driving development toolchain, emphasizing end-to-end capabilities from data preprocessing, labeling, training, simulation, to deployment.
These approaches are not entirely identical, nor can they be simply defined as comprehensive solutions for real-world complex scenarios. However, they collectively indicate that the industry is shifting from "generating more scenarios" to "more rapidly processing real-world failures into training and validation assets."

For today's autonomous driving industry, every recall involves more than just lightweight software and hardware updates.
Mainstream media outlets such as Reuters, CNBC, and Fox News promptly reported on the Waymo incident, with U.S. automotive and AI industry media following suit. Accident witnesses were invited to condemn the dangers and fragility of autonomous driving technology... An unexpected incident not only means potential contraction of operational scope, disrupted expansion plans, and increased government regulatory communication costs for enterprises but also prompts users, the public, and markets to reevaluate the technical safety and commercial prospects of autonomous driving as an entire industry.
After all, the greater commercial risk lies not in a single recall itself but in enterprises' long-term reliance on a passive cycle of "real-world exposure—recall repairs—redeployment—re-exposure."
If every failure can only be discovered through real-world operations, addressed through recalls, and validated through subsequent deployments, Robotaxi expansion speed, safety trust, and operational costs will be continually dragged down and consumed by unpredictable real-world surprises.
The issue here is no longer simply judging which enterprise is performing well or poorly but recognizing that similar scenarios could occur at any company as products are rolled out at scale.
After all, Robotaxis cannot wait for all issues to be resolved in laboratories before entering commercialization en masse; real-world roads will inevitably continue to expose new failures. What truly matters in the market is how quickly companies can mitigate risks and respond to public concerns through technological "failure closure" after encountering failures—and even transform a failure into a new driver of commercial progress.
High-risk industries offer relevant precedents. The aviation industry has long established safety feedback mechanisms centered on daily flight data, incidents, and frontline reports, allowing anomalies and near-misses during operations to inform training, procedures, maintenance, and regulatory improvements.
Its significance lies not in guaranteeing that aircraft will never encounter new risks but in making risk feedback during real-world operations increasingly systematic, traceable, and reusable.
Autonomous driving faces a similar transformational need. Markets inherently demand that autonomous vehicles quickly adapt to various real-world surprises, but model and training technologies must keep pace with large-scale deployment. Unlike traditional transportation industries, autonomous driving has an additional technological lever.
World models and simulation training can reconstruct real-world failures into interactive, adaptable, and repeatedly evaluable scenario assets, reducing the cost of addressing similar issues in the future. This not only allows systems to pre-practice potential problems but also enables them to digest past failures more rapidly.