What Are the Physical AI Large Models Released by Car Companies? What Are Their Strengths?

Home

Finance

ICV

Smart City

Digital Live

Cloud

Optics

Home Finance AI ICV Smart City Digital Live Cloud Optics

06/26 2026 429

In 2026, a term that will frequently echo through the Beijing Auto Show, major manufacturers' press conferences, and even the financial reports of tech companies is "physical AI." XPENG explicitly redefined its corporate identity in its financial report, shifting from an intelligent electric vehicle company to a physical AI enterprise. He Xiaopeng emphasized that the application of physical AI is on the cusp of transitioning from mass production to explosive growth, with the company substantially ramping up its R&D investment in this area in 2026. Huawei unveiled the ADS 5 system, elevating its underlying technology to an AI agent for autonomous driving. NIO's founder, Li Bin, publicly endorsed the latest update of NIO's World Model, announcing that both NIO and Leapmotor would receive new version upgrades of NIO's World Model NWM in June. Li Auto also revealed its next-generation autonomous driving foundation model, MindVLA-o1, at NVIDIA GTC 2026, stating that autonomous driving is merely the starting point for physical AI, and future foundation models will drive new paradigms in embodied intelligence.

From these concentrated actions by companies, it's evident that the technological focus of intelligent driving is shifting from enabling vehicles to 'see' the world to enabling them to 'understand' it, with physical AI serving as the core of this transformation.

What is Physical AI, and How Does It Differ from Traditional AI?

To grasp the concept of physical AI, we must first establish a fundamental premise: traditional AI processes information from the digital realm, whereas physical AI aims to enable intelligent agents to operate in the real, physical world.

Baidu Baike defines physical AI as a system that integrates physical world modeling with intelligent decision-making capabilities. Its essence lies in combining mathematical models, sensor data, and machine learning algorithms to empower intelligent agents to comprehend physical laws, predict environmental changes, and execute operations that adhere to physical constraints. Wang Xiang, a specially appointed professor at the University of Science and Technology of China, offers a more technically nuanced explanation: physical AI implies that AI systems possess closed-loop capabilities for perception, reasoning, action, and feedback in the real world. They not only think but also execute tasks through embodied devices like robots, continuously correcting errors and self-evolving based on real-world feedback.

Image Source: Internet

The core strength of traditional AI (or digital AI) lies in pattern recognition, enabling it to learn statistical patterns from vast amounts of labeled data and make predictions on unseen inputs. This capability excels in tasks like text generation and image recognition but has a fundamental flaw: it lacks an understanding of the underlying laws of the physical world.

A traditional model can accurately identify a car ahead but won't know the braking distance on a slippery road or predict the impact of road slope on vehicle center-of-gravity changes. Zhou Guang, CEO of Yuanrong Qixing, clarified this distinction more vividly in a speech: small models resemble conditioned reflexes, relying on local features and excelling at immediate reactions but struggling with advanced cognitive understanding. Large models, in contrast, are closer to cognitive intelligence, possessing stronger generalization capabilities and able to make holistic judgments like humans.

The fundamental difference is that traditional AI is data-driven mapping, while physical AI is physics-driven reasoning. The former requires massive amounts of labeled data to cover as many scenarios as possible, while the latter can make reasonable decisions even in unseen scenarios based on an understanding of physical laws and causal relationships.

Image Source: Internet

Cao Xudong, Partner and CEO of Momenta, provided a more foundational breakdown from the perspective of model predictive capability evolution. He pointed out that the core strength of large language models is next-token prediction, enabling AI to compress common sense from the digital world and possess text understanding capabilities. World models, however, aim to predict the future states and interaction logic of the physical world, thereby acquiring the ability to understand objects' physical properties, motion causality, and interaction potentials. Cao summarized that world models and reinforcement learning jointly constitute the two core pillars of physical AI, a concept widely recognized in the industry.

What Technologies Are Involved in Physical AI?

If physical AI is a capability goal, the technical pathways to achieve it primarily include world models and VLA (Vision-Language-Action).

Let's first discuss world models. World models are not simple simulation engines but systems capable of understanding the operational laws of the physical world and predicting future states accordingly. World models can be decomposed into three layers: the first layer is pre-training, where physical laws, common sense, and causal relationships are compressed into the model through vast amounts of real driving data, forming a foundational understanding of the physical world. The second layer is simulation, where world models are used for closed-loop simulation in autonomous driving, enabling the system to deduce how the world will evolve when its behavior changes and to evaluate performance in long-tail scenarios. The third layer involves reinforcement learning within world models, constructing highly realistic virtual training grounds for reinforcement learning, allowing the system to repeatedly explore and learn from mistakes in an environment close to reality.

Huawei's ADS 5 WEWA 2.0 architecture adopts this technical route. Its cloud-based world model introduces a multi-agent game mechanism and multimodal generative AI technology, increasing training intensity by 10 times. The WEWA 2.0 architecture employs online reinforcement learning that generates, learns, and verifies simultaneously, further boosting training efficiency by 10 times. The vehicle-end world behavior model applies safety risk field theory, generating dynamic risk heatmaps by quantifying kinetic energy fields, potential energy fields, and behavior fields, reducing collision risk by 50%. This technical architecture transfers extremely rare extreme scenarios from the physical world to virtual space for repeated training, thereby approaching the safety limits of real-world driving conditions at a controllable cost.

Image Source: Internet

NIO's world model route has also entered the stage of large-scale validation. In January 2026, NIO fully rolled out an intelligent assisted driving version utilizing a world model + closed-loop reinforcement learning technical architecture to hundreds of thousands of models equipped with Banyan, Cedar, and Cedar S intelligent systems. In the three months following the January version update, NIO users' usage mileage and duration of urban navigation-assisted driving increased by 92% and 116% month-over-month, respectively. Subsequently, NIO further upgraded its technical architecture to a three-layer training framework of world model + supervised fine-tuning + closed-loop reinforcement learning, achieving for the first time in China the direct operation of the steering wheel and pedals by the intelligent assisted driving system, bypassing the traditional intermediate layer of trajectory planning and resulting in a shorter path and lower latency. Additionally, NIO realized real-time recognition and understanding of tidal lanes, variable lane sky signs through reconstructing sensor information representation methods in its self-developed system.

Tesla disclosed a world model patent in January 2026, detailing its digital twin + parallel universe implementation scheme. Using visual data collected by real vehicles, it can reconstruct high-precision three-dimensional road models and then generate countless extreme scenarios difficult to collect in reality through algorithms on this model for training vehicle-end algorithms. This simulator allows AI to learn virtual driving conditions equivalent to 500 years of human experience in a single day, significantly reducing reliance on real-world road testing.

VLA large models, unlike world models that focus on virtual deduction, aim to unify perception, reasoning, and decision-making within a single model framework. Li Auto's MindVLA-o1, unveiled at NVIDIA GTC 2026, represents this direction. The model adopts a native multimodal MoE Transformer architecture and constructs a foundation model for autonomous driving oriented towards physical world intelligence through five major technological innovations. At the perception level, Li Auto employs a vision-centric 3D ViT Encoder, using LiDAR point clouds as three-dimensional geometric prompts to guide the model in understanding real spatial structures. Simultaneously, it introduces a predictive latent world model capable of efficiently simulating scene evolution over the next few seconds in latent space. Zhan Kun, head of Li Auto's foundation models, stated, 'When we unify vision, language, and actions into a single model, it is no longer just an autonomous driving model but is evolving into a general-purpose intelligent agent for the physical world.'

XPENG's second-generation VLA takes a more streamlined engineering approach, abandoning reliance on explicit three-dimensional reconstruction and emphasizing the use of continuous video streams to enable the model to learn spatial relationships and causal relationships on its own. XPENG's second-generation VLA has been officially rolled out to users, with the proportion of assisted driving miles for models equipped with the second-generation VLA exceeding 50% for the first time in the first month of rollout. XPENG's self-developed Turing chip also provides computational support for this model, with four Turing chips in the Robotaxi model GX delivering 3000 TOPS of effective computational power. He Xiaopeng described the second-generation VLA as 'not just an autonomous driving model but a foundation model for the physical world.'

Image Source: Internet

It's worth mentioning that these two routes are not mutually exclusive. Li Auto's MindVLA-o1 has already integrated the deductive capabilities of implicit world models, while Momenta's R7 reinforcement learning world model also covers decision-making and reasoning functions similar to VLA across the three levels of pre-training, simulation, and reinforcement learning. At this stage, the industry's competitive focus is shifting from theoretical route debates to comparisons of engineering implementation efficiency.

Why Are Car Companies Shifting Towards Physical AI Large Models?

Zhou Guang pointed out in an industry speech that the industry has invested substantial resources in small models over the past few years, but the improvement in model capabilities has shown a clear diminishing marginal effect. Initially, investing a small amount of resources yielded significant improvements, but as scenario complexity increased, investments grew while benefits became increasingly limited. Meanwhile, small models also exhibit a seesaw effect, where one version solves some problems but may introduce new ones. Subsequent targeted fixes may then bring new instability factors. This capability fluctuation not only affects system reliability but also makes it difficult for users to establish long-term trust in assisted driving. Zhou Guang's judgment is that the industry is transitioning from a small model-dominated phase to a large model consensus phase.

In fact, it's not hard to see that the long-tail scenarios on real roads are nearly infinite—a car suddenly darting out from the roadside, non-standard construction barriers, reflective interference from rainwater on the road surface during heavy rain... These scenarios have an extremely low probability of appearing in collected data, and traditional end-to-end models significantly increase their error rates when encountering them. The solution offered by physical AI is to enable AI to reason based on an understanding of physical laws and causal relationships rather than relying on memorization of scenarios. Like human drivers, they can make on-the-spot judgments when encountering unseen situations, which is the primary reason physical AI large models are gradually taking center stage.

Image Source: Internet

By the end of 2025, the Ministry of Industry and Information Technology approved the first batch of L3 autonomous driving vehicle models for market access, meaning vehicles can be fully taken over by the system under certain conditions. However, L3 autonomous driving imposes far higher reliability requirements on the system than L2, and the typical uncertainties present in L2 systems are unacceptable under L3 permissions. The causal reasoning capabilities and more stable performance provided by physical AI are precisely the necessary foundation for meeting L3 requirements.

Final Thoughts

The year 2026 could well be regarded as the inaugural year for embodied AI, a time when this technology stands at a pivotal juncture, transitioning from mere theoretical notions to widespread, real-world deployment. Public data reveals that, between January and February 2026, the adoption rate of new passenger vehicles in China equipped with L2 combined driving assistance features soared to 69.15%. The introduction of urban NOA (Navigate on Autopilot) functions is gathering pace, with manufacturers such as Huawei, XPENG, and Li Auto rapidly iterating their offerings. Moreover, Tesla's supervised version of FSD (Full Self-Driving) is also poised to officially make its entry into the Chinese market. Against this backdrop of swiftly rising adoption rates and the gradual convergence of technological approaches, the competition in intelligent driving among automakers has evolved from a focus on mere availability to a emphasis on usability. The crux of usability lies precisely in the system's ability to comprehend the causal relationships inherent in the physical world, much like humans do, enabling it to make decisions that are safe, seamless, and predictable.

-- END --

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.

Newest

Links