Understanding Ideal Auto's VLA Innovation

05/12 2025 437

Ideal Auto's success has traditionally been anchored in its fuel tank technology, but the company now aims to dispel this perception.

Admittedly, Ideal Auto capitalized on the automotive electrification transition by embracing an intermediary technology: extended-range vehicles that run on both fuel and electricity. The generator powers the battery, and the car's other components resemble those of pure electric vehicles. For more details, refer to our earlier article, "Why Ideal Auto Achieved First Place in Sales Among New Forces and Raised Funds in Hong Kong Stocks."

Ideal Auto distinguished itself from its contemporaries, outpacing them and becoming a favorite among both consumers and investors. Recently, Ideal's widely discussed VLA technology has also garnered significant attention. This article delves into three aspects to comprehensively understand Ideal's VLA:

Why did Ideal Auto launch VLA at this juncture?

What unique product features does VLA bring to intelligent assisted driving?

What is the true nature of Ideal Auto's VLA?

Ideal Under Pressure

Starting in 2022, AITO introduced its extended-range vehicles, and within a year, its sales rivaled those of Ideal Auto. While AITO's sales are significantly bolstered by Huawei's support, Leapmotor, which also offers extended-range vehicles, entered the new energy market in 2023 and became the second profitable new force after Ideal Auto in early 2025. Consequently, extended-range technology has gained widespread adoption among both new and traditional automotive players, regardless of nationality.

Ideal Auto now feels a sense of urgency, seeking its next growth trajectory. AI emerges as the logical answer for the forward-thinking company. How has AI evolved, and what stage has it reached? By leveraging AI, Ideal Auto aims to excel in both product innovation and marketing, establishing a captivating and imaginative product identity that resonates with investors and attracts consumers.

Readers familiar with our previous article, "2025 CES NVIDIA Insights: Agentic AI/Physical AI Rapidly Lands, the Future is Here," will recognize that Physical AI/Agentic AI represents the current trend and direction in AI.

Therefore, Ideal Auto has adopted a new label—Physical Agent—which incorporates two popular AI terms in intelligent assisted driving. The technology behind this label is VLA. If you're unfamiliar with VLA, read "In 2025, Autonomous Driving is About to 'Roll Out' the End-to-End Large Model 2.0 - VLA (Vision Language Action)."

Product Features of VLA Implementation

Ideal Auto refers to its VLA technology as MindVLA, promising a novel product form and user experience. A MindVLA-equipped vehicle acts as a dedicated driver who understands, sees, and finds.

'Understanding' implies that users can alter the vehicle's route and behavior via voice commands. For instance, in an unfamiliar park, a user can simply say to the vehicle through Ideal Mate, "Take me to find a supermarket." The vehicle will autonomously navigate and reach the destination without navigation assistance. During the journey, users can instruct the vehicle, such as saying, "You're going too fast" or "You should take the left road," and MindVLA will comprehend and execute these commands.

'Seeing' highlights MindVLA's robust general knowledge capabilities. It can recognize various store signs, like Starbucks and KFC. When users cannot locate their vehicle in an unfamiliar area, they can take a photo of the surroundings and send it to the vehicle. The MindVLA-enabled vehicle will search for the location in the photo and find the user automatically.

'Finding' signifies that the vehicle can autonomously navigate in underground garages, parks, and public roads. A practical scenario is when a user cannot find a parking spot in a mall's underground garage. They can instruct the vehicle, "Go find a parking spot and park." Leveraging its advanced spatial reasoning ability, the vehicle will autonomously search for a parking spot. Even if it encounters a dead end, the vehicle will reverse smoothly and continue searching for a suitable spot to park. This entire process does not rely on maps or navigation but solely on MindVLA's spatial understanding and logical reasoning.

In essence, the interaction mimics human-like communication, akin to that with a dedicated driver. However, to fully appreciate Ideal Auto's Physical Agent and VLA, it's essential to delve into their technical principles.

Technical Principles of VLA

For detailed information on the VLA model structure, refer to our previous article, "Ideal Intelligent Driving's VLA Model and Its Structure." Regarding its engineering implementation, there are four specific steps:

Firstly, Ideal Auto trained a base model in the cloud using a combination of vision data, language data, and joint VL (vision and language) data. This base model boasts approximately 32B (32 billion) parameters. According to Li Xiang's AI Talk speech, this base model may have utilized Deepseek open-source distillation or at least borrowed from its structural methodology, such as MoE (Mixture of Experts). Ideal Auto indicates that its model is an MoE model comprising 8 experts.

After obtaining this base model, it was distilled into a 3.6B (3.6 billion) parameter on-vehicle small model suitable for deployment in vehicles.

Secondly, post-training transforms the distilled small model into a VLA (driver large model). The previous model understands the environment, and this step adds action (vehicle planning and control) to create an end-to-end VLA (driver large model), linking perception to planning and control, ensuring that input perception information results in outputs like steering, acceleration, and braking controls. The final on-vehicle VLA model has approximately 4B (4 billion) parameters.

The third step involves reinforcement training, akin to targeted driving education for the model. Ideal Auto's reinforcement training comprises two parts:

The first part is RLHF (Reinforcement Learning from Human Feedback), utilizing Ideal Auto's previously accumulated human takeover data for training, enabling the model to differentiate between good and bad actions.

The second part is pure RL (Reinforcement Learning), employing a world model for training. Essentially, a world model is a collection of physical rules in the human world used to educate or train models. Ideal Auto's world model encompasses three types of rules:

Comfort rules - primarily judging comfort through G-force values and providing feedback on comfort levels.

Safety collision rules - teaching the model that collisions are unacceptable.

Traffic rules - ensuring the model does not violate traffic regulations. Comfort, traffic rules, and safety collisions constitute the three core rules of Ideal Auto's world model.

The above three stages collectively form the VLA (driver large model).

But how do humans and vehicles interact to create the so-called Physical Agent? Ideal Auto explains that it builds an Agent (intelligent entity) for the driver, which is an interactive system for language and images. Some general short commands are directly processed by the VLA (driver large model) deployed on the vehicle. More complex commands are first processed by the 32B cloud model and then sent to the on-vehicle VLA.

In practice, Ideal Auto's VLA may excel in human-like interactions for specific commands and environments but may struggle to guarantee real-time performance in complex scenarios.

This concludes the comprehensive methodology and structural system of Ideal Auto's VLA.

Final Thoughts

As predicted in our earlier article, "New Autonomous Driving Trend: The 'On-Vehicle Revolution' of DeepSeek-R1," Deepseek can be likened to the Linux moment for AI large models. Based on Deepseek's open-source applications or references for promotion and application across industries, Ideal Auto is at the forefront, at least in terms of its announcements.

Ideal Auto has constructed a multimodal Deepseek-like large model in the cloud and distilled it into a smaller on-vehicle model, using a consistent token language to bridge vehicle planning and control with human interaction.

However, assessing its usability through publicly available texts and information is challenging because what we easily access is often what others want us to see.

Nevertheless, Ideal Auto's VLA has indeed undertaken significant pioneering work in chip-level interactive compilation, enabling VLA's usage on both dual Orin and NVIDIA's latest Thor. As mentioned in our previous article, "Ideal Intelligent Driving's VLA Model and Its Structure," it innovatively adopts AI large model technologies like 3DGS, Diffusion, MoE, and CoT in the realm of intelligent assisted driving algorithms.

Reproduction and excerpts are strictly prohibited without permission.

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.