04/22 2025
531
In our previous article, "End-to-End Large Model 2.0 - VLA (Vision Language Action) for Autonomous Driving in 2025", we introduced the VLA model. Now, numerous companies have announced their plans to launch this model structure in the latter half of 2025. Among them, Ideal Auto stands out as one of the pioneers in adopting the VLA model for intelligent driving. The VLA model seamlessly integrates perception (via a 3D encoder), reasoning (using a language model), and decision-making (via a diffusion policy) into a unified, trainable end-to-end large model. Furthermore, Ideal Auto claims that its VLA model will concurrently support external multimodal interactions, such as voice interaction with the driver and specific visual inputs from the surroundings, thereby enabling intelligent driving that comprehends, perceives, and navigates effectively.
This article delves into and shares the nuances of Ideal Auto's intelligent driving VLA algorithm based on available information. The Ideal VLA model architecture comprises four core modules:
These three steps constitute the Ideal VLA model's structure, progressing from perception to processing and ultimately generating motion trajectories, all within a single trainable model. Now, how is this model trained? Ideal Auto employs Reinforcement Learning, feeding the model with data and conclusions desired by humans to ensure correct responses in similar future situations. The company adopts the world model approach, akin to NVIDIA's "combining 3D reconstruction and generation technology to create a high-fidelity virtual environment mirroring the physical world," similar to NVIDIA's Cosmos. Human-provided cases are then used for reinforcement learning training and closed-loop verification.
Detailed Construction of the Ideal VLA Model Architecture:
The true end-to-end model's allure lies in integrating and connecting these parts, using a common set of tokens for lossless and real-time information transmission. Model training involves constructing parameters like weights for these tokens. Ideal Auto's reinforcement learning (RL) framework relies on a highly realistic world model created by combining scene reconstruction and generation technology, addressing the training bias issue in traditional RL due to insufficient environmental realism. Self-supervised learning reconstructs dynamic 3D scenes from multi-view RGB images, generating multi-scale geometric and semantic information. 3D Gaussians represent the scene as point clouds, with each point containing position, color, transparency, and a covariance matrix, enabling efficient rendering of complex environments.
This approach allows the VLA model's training (end-to-end + language model) to be based on cloud-based virtual 3D environments, conducting millions of kilometers of driving simulations, potentially replacing some real-vehicle testing. In conclusion, the information presented here is primarily public and leans towards technical promotional content from Ideal Auto. Its effectiveness will need practical validation. However, this article offers a general understanding of its algorithm structure, ideas, and related core technologies. Furthermore, if Ideal Auto's model proves successful, it could be applied to other Physical AI applications, such as robots.
Reproduction and excerpts are strictly prohibited without permission - References: Ideal Auto 2025 GTC Presentation ppt - VLA: A Leap Towards Physical AI in Autonomous Driving. Join our knowledge platform to access a vast amount of first-hand information in the automotive industry, including the above references.