World Model: The "Super Brain" of Autonomous Driving, Redefining Intelligence Boundaries

Home

Finance

ICV

Smart City

Digital Live

Cloud

Optics

Home Finance AI ICV Smart City Digital Live Cloud Optics

07/18 2025 555

Introduction

Autonomous driving technology stands on the cusp of practicality, yet a veil of "common sense" obstructs its path to true functionality. Behind this veil lies the evolutionary journey of AI models, progressing from mere "seeing" to profound "understanding" and imaginative "foreseeing".

The advent of the World Model is propelling autonomous driving towards the intuitive thinking of an "experienced driver." Mushroom Auto delves into this revolutionary concept with you!

I. Tales of Artificial Naivety: Machines Confronting Floating Mattresses

During a torrential downpour on the Guangzhou Ring Expressway, a newcomer's self-driving car abruptly slammed on its brakes at a perceived "obstacle" ahead. The trailing driver, startled, alighted to investigate, only to discover a plastic bag fluttering like a cicada's wing on the road.

"This fool mistook a plastic bag for a wall!" The dashcam footage garnered millions of complaints on Douyin.

This absurd misjudgment exposes the critical flaws of traditional autonomous driving:

Perceptual Fragmentation: LiDAR identifies the bag as an "obstacle" but fails to calculate its drifting trajectory under wind.
Rigid Prediction: Algorithms based on historical trajectories cannot comprehend the sudden impulses of a child chasing a ball.
Short-sighted Planning: When encountering water accumulation, it only knows to slow down without anticipating the need to change lanes 500 meters in advance to avoid danger.

Data from a car testing ground is alarming: The current mass-produced system has a recognition rate of only 23% for suspended cables and a sluggish response time of 0.4 seconds to correct accelerator missteps.

Even more absurd was a road test where the AI erroneously identified the lead wedding car as an "ambulance," simply because both were adorned with red and white stripes!

"Machines need a common sense database, not a pixel database," emphasized NVIDIA engineers, showcasing a comparison video: A traditional model passes a roadside waver at a constant speed, whereas the World Model, integrating body orientation and road environment, predicts a 73% probability of a taxi request, changing lanes and slowing down in advance.

II. Dream Training Ground: Silicon-Based Intuition Racing 1 Million Kilometers a Day

In the Mushroom Auto lab in Lingang, Shanghai, engineers are rigorously "training" AI:

The rainstorm simulator dumps 150 millimeters of rain per hour, while a fan generates an 8-level wind to whip up a tire array, even deploying remote-controlled cars to simulate "ghost probes".

"This is a hundred times more intense than a driving school!" The technical director pointed to the cloud monitoring screen.

AI equipped with the MogoMind system undergoes grueling special training in a digital twin environment:

Visual Compression: A 2560×512 pixel image is compressed into a 32×32 "mind map".
Memory Deduction: The GRU neural network stores 300 frames of historical images to predict the next 5-second scene.
Physics Engine: Real-time calculation of changes in the friction coefficient between tires and the wet road surface.

Most impressive is the "dream learning" ability. Once the V-M-C (Vision-Memory-Controller) module completes training, AI can simulate driving at 1000 times the speed in the cloud, equivalent to accumulating 1 million kilometers of virtual mileage every day—sufficient to traverse the Beijing-Shanghai Expressway 500 times round trip!

The actual results are astounding:

In the Tongxiang Smart Connected Vehicle Demonstration Zone in Zhejiang, the system predicts intersection conflict risks 3 seconds in advance, enhancing traffic efficiency by 35%.
The prediction error for braking distance on rainy days drops sharply from 30% to within 5%, avoiding 90% of aquaplaning accidents.

III. Newton's Law Chip: Endowing AI with a Physical Brain

While Tesla relies solely on vision, the World Model implants "physical genes." NVIDIA Lab's neural PDE architecture equips AI with a Newton's law processor:

Fourier Operator: Solves fluid equations to predict splash trajectories of water accumulation.
Multi-granularity Token: Deconstructs mattresses into rigid skeletons, flexible surfaces, and air resistance.
Physical Loss Function: Penalizes unphysical fantasies like "translating 5 meters in 0.2 seconds".

In one extreme test, the system encountered the scenario of a "tornado lifting a piece of iron":

The vision module identifies the size of the iron sheet.
The physics engine calculates the wind thrust.
The memory module retrieves similar cases.
The controller generates a serpentine avoidance route.

The entire process takes only 80 milliseconds, four times faster than human reaction time. Even more impressive is the self-evolution ability—when the predicted trajectory deviates from reality beyond a threshold, the system automatically generates 3000 derivative scenarios to feed back into training, akin to an experienced driver "reviewing thrilling moments".

Huawei's strategy revolves around "conservative approach + human-machine co-driving." When the collision probability exceeds 3%, it immediately degrades to L2, twice as strict as industry standards. In a road test in Shenzhen during a rainstorm, this mechanism triggered 17 emergency avoidance maneuvers, preventing multiple chain-reaction rear-end collisions.

IV. Cost Grinder: Investing Millions to Cultivate AI Intuition

Training the World Model is a costly endeavor.

Mushroom Auto's public bill is staggering:

Data Annotation: 6D video with friction coefficients/wind speeds costs $80 per second of annotation.
Computational Power Consumption: Training with a thousand A100 cards for three weeks, the electricity bill is equivalent to the cost of a Porsche.
Simulation Cost: High-precision maps + weather simulation, costing $500 per kilometer of virtual road.

However, technological breakthroughs in 2025 are reshaping economics:

Mixed Precision Training: Reduces computational power demand by 4 times.
MoE Architecture: Activates partial parameters, lowering power consumption by 60%.
8-bit Quantization: Reduces in-vehicle inference power consumption to 25 watts.

Tsinghua University's MARS dataset leverages industry resources—by opening up 2000 hours of driving footage with 6D poses, the training cost for small and medium-sized enterprises drops from millions to hundreds of thousands. As the CTO of a startup quipped, "We used to burn money on LiDAR, now we burn money on 'common sense'!"

V. Cognitive Revolution: When Machines Learn to "Foresee"

World Model: A Digital Brain Capable of "Imagining"

The core architecture of the World Model, V-M-C (Vision-Memory-Controller), forms a cognitive chain akin to the human brain:

The Vision module uses VQ-VAE to compress camera footage and extract key features.
The Memory module stores historical information through GRU and Mixture Density Networks (MDN) to predict the latent code distribution of the next frame.
The Controller module generates actions based on current features and memory states.

Its most ingenious aspect lies in the "dream training" mechanism—once the V and M modules are trained, they can detach from the real vehicle and deduce at 1000 times real-time speed in the cloud, equivalent to the AI "racing" 1 million kilometers in the virtual world every day, accumulating extreme scenario experience at zero cost.

The hidden battle at the 2025 Beijing Auto Show heralds an industry upheaval:

Huawei showcased the "mental rehearsal" function, predicting oncoming traffic 5 seconds in advance at a tunnel entrance.
Tesla introduced "counterfactual reasoning," simulating "what would happen if I brake suddenly".
Mushroom Auto implemented "risk foresight," announcing water-logged road segments 500 meters in advance.

The far-reaching impact extends beyond driving. In an experiment with a home robot, a robotic arm equipped with the World Model, when handing over a coffee:

Predicts the owner's hand movement trajectory.
Calculates the coffee's sloshing waveform.
Adjusts the delivery angle, achieving a "zero spillage" handover.

This differential understanding of physical laws is transforming AI from a mere tool into a "scenario partner." On a rainy night in Tongxiang, an autonomous vehicle slowly pulls up to a bus stop. As the passenger walks towards the door with an umbrella, the car's body automatically tilts 15 degrees—this small gesture, dubbed the "gentleman's bow" by engineers, stems from the World Model's precise deduction of the "splash trajectory of water accumulation".

"We're not teaching machines to drive," said a Huawei scientist, gazing at the flowing data on the monitoring screen, "we're creating silicon-based life that understands the physical world."

At this moment, the Thor chip in the NVIDIA lab flashes blue light. Its internal 200GB/s shared memory has reserved seats for the "mental cinema" of the memory module.

In summary, Mushroom Auto also believes that while human drivers rely on experience to predict risks, these World Models—silicon-based brains—are deducing the future at nanosecond speeds! What do you think, dear reader?

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.

Newest

Links