Supervision Sparsity Issue Resolved! DriveVLA-W0 Enhances Autonomous Driving Data Scaling Law via World Modeling

12/01 2025 346

Recently, at the ICCV conference, Tesla highlighted a pressing challenge it is currently grappling with: supervision sparsity.

Supervision signals, which consist of low-dimensional and sparse driving actions, are at odds with the high-dimensional and dense visual information flow characteristic of Vision-Language-Action (VLA) systems. Consequently, even when armed with vast amounts of data, these systems fail to fully unleash the immense potential of VLA models.

A research paper resulting from a collaboration between leading domestic technical institutions and Huawei offers a solution to this conundrum. The researchers propose DriveVLA-W0, a training paradigm that harnesses the power of world modeling to predict future images. This approach underscores the pivotal role that world models play in unlocking the data scaling law of VLA systems.

The research methodology is divided into three crucial steps:

First, a VLA baseline model is constructed to vividly illustrate the challenges posed by supervision that relies solely on sparse actions.

Next, the baseline model is enhanced through the incorporation of world modeling, which provides a wealth of dense self-supervised information.

Finally, a lightweight, Mixture-of-Experts (MoE)-based action expert is introduced to tackle inference bottlenecks, thereby ensuring the model's real-time performance.

The Vision-Language-Action (VLA) baseline model is designed to process language instructions (L), frontal images (V), and sequences of past actions (A). To ensure wide-ranging applicability, the research team also developed two variants of mainstream vision-language models (VLMs): VLA (VQ), which quantizes images into discrete visual tokens, and **VLA (ViT)**, which extracts continuous features for a backbone similar to Qwen2.5-VL.

To address the supervision sparsity issue, the research team incorporates world modeling as a self-supervised objective. This encompasses various types of world models, such as AR world models and diffusion world models.

The lightweight action expert (with a parameter count of 500M) works in tandem with the main VLA expert within a Mixture of Experts (MoE) architecture. The architectural similarities between the two experts facilitate deep and efficient information fusion through a joint attention mechanism.

DriveVLA-W0 has achieved remarkable performance on the NAVSIM benchmark, outperforming top-tier methods under different architectural paradigms. These include BEV-based WoTE and VLA-based AutoVLA.

Remarkably, the model attains such high performance using only a single front-view camera.

In scenarios involving data expansion, world modeling proves to be superior to supervision methods that rely solely on actions. When using the baseline method, models quickly reach a performance bottleneck under sparse supervision. In contrast, VLA-W0 demonstrates continuous improvement.

The researchers validated the efficiency of the MoE architecture by measuring inference latency on an H200 GPU. Compared to the baseline DriveVLA-W0 (which has a latency of 117.8 ms and a PDMS score of 85.6), VLA-W0 reduces latency to 74.3 ms while simultaneously improving the PDMS score to 88.4.

The study reveals that insufficient supervision is a fundamental bottleneck that impedes the scalability of vision-language-action models in the field of autonomous driving. Adopting dense predictive world modeling represents a crucial step towards unlocking the full potential of large-scale data and constructing more generalized driving intelligence.

References:

https://arxiv.org/abs/2510.12796

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.