Dialogue with Momenta's Cao Xudong: Speeding Up the Rollout of Embodied AI and Crafting a Legend of an Oriental Silicon Valley

Home

Finance

ICV

Smart City

Digital Live

Cloud

Optics

Home Finance AI ICV Smart City Digital Live Cloud Optics

04/27 2026 420

A decade ago, as Cao Xudong wandered through Silicon Valley, his eyes fell upon a street sign that read "Fairchild." Initially unimpressed, he was suddenly jolted as if struck by lightning, feeling an electric surge course through his veins. This street bore the name Fairchild, and Fairchild Semiconductor stood as the emblem of Silicon Valley's inception and the wellspring of a cascade of groundbreaking innovations that followed.

That initial "spark" has been carefully nurtured, propelling Momenta's "flywheel" into a rapid spin over the past ten years. At the Beijing Auto Show on April 25, Momenta announced that it had amassed over 200 designated vehicle models, successfully delivered more than 70 mass-produced models, and achieved mass production across more than ten countries and regions.

A more vivid demonstration is that over 60 products from automotive brands showcased at this Beijing Auto Show are equipped with Momenta's solutions, including newly released models from Mercedes-Benz, BMW, and Audi.

Judging from mass production and designated model data, as well as the scale of its "circle of friends," Momenta has firmly established itself as one of the undeniable leaders in China's intelligent driving supplier market, forging a "two superpowers and multiple strong players" landscape alongside Huawei—a pattern unlikely to be disrupted in the near future. It has maintained a market share exceeding 60% in the third-party urban NOA supplier market for three consecutive years.

On the occasion of Momenta's 10th anniversary, Cao Xudong finally had the confidence to declare, "Creating an Oriental Silicon Valley Legend Belonging to China."

Sources close to Momenta revealed to us that its meteoric rise stems from a confluence of strength and opportunity. On one hand, it was the first company in China to simultaneously develop L2+L4 technologies, laying a solid foundation in technical prowess. Years of technological accumulation and validation have enabled its mass production business to achieve efficient and high-quality delivery.

On the other hand, it capitalized on market opportunities early on. During a period influenced by the "soul theory" concept, OEMs had limited options, and Momenta effectively seized this window of opportunity. Cao Xudong's remarks in an interview following the press conference corroborated this point, stating that the intelligent driving market exhibits strong economies of scale and first-mover advantages. In other words, Momenta's "snowballing" advantage also serves as a barrier, ensuring its continued growth.

From the auto show and recent actions, Momenta's character is evolving, perhaps buoyed by the confidence brought about by its commercial explosion. Momenta has begun to place its vision in a more prominent position and is gazing toward a longer-term future.

For instance, "Better AI, Better Life" is now being prominently featured on a larger scale, and on the backs of Momenta employees' T-shirts at the auto show, the English phrase "Saving a Million Lives in a Decade" is uniformly printed.

If Momenta in the past was a tech-savvy "science and engineering enthusiast" who revered the technological flywheel, today's Momenta has taken on a more humanistic touch—a key turning point for many tech giants transitioning from being powerful to truly great.

The underlying engine propelling Momenta's new ambitions for high-speed development is the comprehensive acceleration at the "embodied AI" level.

This stems partly from changes in the industry's competitive landscape: Currently, digital AI has become a fiercely competitive red ocean, and traditional technological routes are hitting ceilings, especially in long-tail scenarios where effective breakthroughs remain elusive. At the same time, many intelligent driving suppliers and automakers are beginning to accelerate the implementation of end-to-end and world models, with homogeneous price wars and experience battles squeezing growth space.

More importantly, it is due to Momenta's proactive strategic "positioning" based on a clear anticipation of future trends. Cao Xudong believes that autonomous driving is the prologue to embodied AI, and world models + reinforcement learning are the two core pillars of embodied AI.

Simply put, this represents a leap from "seeing the world" to "understanding the world" for intelligent driving, enabling it to truly comprehend the physical world and its laws like humans, ultimately achieving a level that significantly surpasses human drivers.

According to Xia Yan, Momenta's Partner and R&D SVP, Momenta's technological route is divided into three layers. First is world model pre-training, which compresses physical laws, common sense, and causality into the model through massive real driving data, enabling the system to form a basic understanding of the physical world. Second is world model simulation, used for closed-loop simulation in autonomous driving, allowing the system to deduce how the world will evolve when its behavior changes and to evaluate long-tail scenarios. Third is reinforcement learning, which builds a highly realistic virtual training ground based on the first two layers, enabling the system to repeatedly explore and learn from mistakes in an environment close to reality.

In terms of capability application, Sun Gang, Momenta's Partner and R&D SVP, stated that enabling vehicles to drive smoothly in most daily scenarios is just the foundation. The true value of Momenta's embodied AI lies in its ability to ensure safety even in extremely rare scenarios that occur once in ten thousand times.

Sun Gang gave an example: If a box of apples falls from the vehicle ahead while driving, traditional technology would logically identify the obstacle and brake urgently. However, Momenta's embodied AI capability can predict the trajectory and spread of the rolling apples, decelerate smoothly in advance, and plan a detour route, handling the situation in a more composed and human-like driving manner.

In fact, Momenta's embodied AI is not just a technological concept but has already moved toward mass production. Momenta R7, as the first to feature embodied AI capabilities, has been equipped in SAIC Volkswagen's ID.ERA 9X, which was officially launched at the auto show. With a starting price of 299,800 yuan, the car achieved over 10,000 orders within an hour.

For Momenta, the choice by overseas giants such as BBA and Volkswagen represents an "admission ticket" selected under the world's most stringent standards, becoming a "passport" for its embodied AI to enter international high-end supply chains.

At the scene, Photon Intelligence had an offline dialogue with Cao Xudong. The following is a record of the dialogue, with some content edited for clarity while preserving core viewpoints and expression logic:

Q: Reverse joint ventures are now gaining popularity in the global automotive industry, with more and more overseas automakers valuing Chinese tech giants. How do you view this new trend? During this Beijing Auto Show, have any overseas potential clients come to communicate with us?

Cao Xudong: Chinese technology is now going global at a rapid pace. When entering overseas markets, such as Europe and others, it brings more advanced product value to local users but also brings some impact, such as on local companies, employment, or taxation. A good solution is to learn from China's previous model—reverse joint ventures. After reverse joint ventures, local areas can enjoy the excellent user experience of Chinese high-tech products while also benefiting from Chinese technology empowering local enterprises, bringing more development, better job opportunities, more employment, and better taxation. It's a win-win model.

Q: Which overseas clients have communicated with us at this auto show? What challenges has Momenta faced in cooperating with foreign automakers? What is this year's goal for going overseas?

Cao Xudong: Not just this year, but last year, we were already the common choice of global brands. Among the world's top brands, like German BBA and Volkswagen, Japanese Toyota, Honda, and Nissan, and American GM and Ford, we are already mass-producing cooperation clients.

The most common challenge is the conflict between China's speed and international OEM standards. However, this conflict mainly revolves around customers and users. By co-creating with a focus on customer and user value, better innovative methods can often be found, leading to better results.

Q: Data-driven is something Momenta has always emphasized. In the actual mass production process of the data flywheel, do you think the biggest bottleneck is data volume, algorithms, or automaker cooperation? There are also claims in the market that obtaining large amounts of data is not difficult, but the challenge lies in utilizing it well, and few automakers can truly do so. How do you view this perspective? And what does Momenta do?

Cao Xudong: Data is not just data itself; you could say data is like ore, specifically iron ore with a very low mineral content. So, to truly use data, you first need to turn this low-grade ore into high-grade ore.

For example, as I shared earlier, three puppies walking in a line across a highway is an extremely rare scenario. How do you pick out such data? The difficulty itself is like finding a needle in a haystack, which already sets a high threshold. How do you turn low-grade ore into high-grade ore, then into steel, and finally into an engine that gets installed in a car? That's where the ultimate value lies. So, the entire data flywheel system is a systemic capability. Having raw data, even massive amounts of it, is only 10% of the value source. The remaining 90% comes from the value of the system. That's the first point.

Q: There's a saying that data is not difficult to obtain, but utilizing it well is. So, how does Momenta utilize data effectively?

Cao Xudong: While I can't go into too much detail about large models, I can share that we divide the process into a pre-training phase and a Post-Training phase. In the pre-training phase, massive data from our mass-produced vehicles, now totaling 800,000 cars, is used. This mass-produced data includes a large amount of long-tail data, which is used for World Model Pre-Training to pre-train the model.

After pre-training, the model has physical common sense, but having physical common sense doesn't make it a good driver. Because among the massive data, there are good driving behaviors, but even more, there are bad driving behaviors. So, it's a bit like training large models in digital AI. With massive data as input, the model has worldly common sense, but that doesn't mean it has good behavior. Therefore, Post-Training is still needed to align or inspire its behavior toward good human behavior. These are roughly the two stages.

Q: At this Beijing Auto Show, many automakers are emphasizing the differences in their assisted driving technology routes, such as Xiaopeng's upgraded VA and Huawei's ADS 5.0. Compared to them, what is the biggest characteristic of Momenta's world model?

Cao Xudong: During our exchange, I thought Xia Yan explained it very well. What's more important than single-point algorithms is architectural capability, which is already stronger than single-point algorithm capabilities. Because once it involves architecture, it involves trade-offs. Not all innovations can be placed in the same architecture. When it comes to architecture, it involves trade-offs, and a good architecture can achieve better accumulation and synergy. Above architecture lies the system, which includes the data iteration system, the training system, and the entire iteration and verification system. Above the system lies more about organization and culture. As the Chinese saying goes, "Oranges grown south of the Huai River are juicy and sweet, while those grown north of the Huai River are bitter and puny."

I believe the fundamental gap between enterprises comes from the construction of their organization, culture, and corresponding systems, which creates a larger gap. While single-point algorithm innovations are certainly important, each generation of algorithmic architecture innovation actually brings significant progress. However, frankly, in the Chinese environment, the flow of knowledge and talent is relatively fast. Relying solely on single-point algorithms does not create particularly large barriers or differences. The barriers lie in systemic and organizational capabilities. That's why you may find that everyone is talking about the same direction for single-point algorithms, but the final results may differ by one or two generations. The gap is not in single-point algorithms but in systems and organizations.

Q: This year marks Momenta's 10th anniversary. At the beginning of the venture, three visions were set. At the press conference just now, user stories left a deep impression. At this moment, at the Beijing Auto Show, what insights would you like to share with everyone?

Cao Xudong: I feel quite fortunate. The most important thing is to do what you truly love with like-minded people. It really makes your life vibrant. There are many difficulties and challenges in entrepreneurship. Every year, you may feel that this year is the hardest, and next year will be better, but that's not the case. So, if you don't enjoy the process of discovering and solving problems, if you don't enjoy exploring and facing difficulties with like-minded people, it's hard to persist in entrepreneurship. You might grit your teeth and persist for one year, two years, or three years, but it's hard to persist for ten years. So, you must find like-minded people to do what you love and make your life vibrant.

Q: After embodied AI was popularized by NVIDIA's CEO Jensen Huang, many companies claim to be embodied AI companies. Where do you think Momenta stands globally in terms of embodied AI?

Cao Xudong: First, I believe embodied AI is the trend. Why? Everyone knows that digital AI has significant advantages. First, digital AI data can be quickly obtained on a large scale.

Everyone knows that OpenAI had robots and delved into digital AI early on, but during its focus phase, it temporarily abandoned robots and chose to do GPT. An important reason was that robot data was too hard to obtain.

GPT relies on the vast expanse of internet data. In recent years, digital AI has witnessed remarkable advancements. On the flip side, digital AI offers the advantage of more cost-effective testing and shorter development cycles, as it can engage with the digital world at a reduced cost and with greater efficiency. For instance, if an Agent needs to make a call, all it requires is an interface. In contrast, for a robot to utilize a tool, it must first construct a mechanical hand, grasp the tool, and then operate it. The level of difficulty and complexity involved is significantly higher.

Nevertheless, our world comprises both digital and physical realms, with the latter potentially being even more extensive. Consequently, after achieving substantial progress in the digital domain, numerous successful experiences and methodologies will naturally spill over into the physical world, fostering innovation there. This is why I believe we are merely at the dawn of the Physical AI era.

Reverting to our company's focus on Physical AI, I believe its core lies in two interconnected aspects: the data loop and the commercial loop. These two facets are mutually reinforcing. I've observed that once artificial intelligence applications approach human-level performance, they tend to rapidly surpass human capabilities within a short span. This is merely a personal observation. For example, both AlphaGo and earlier facial recognition technologies underwent a protracted period of gradual refinement to reach human-level performance, which could span a decade or two. Yet, surpassing or significantly exceeding human performance might occur within just a few years. This observation prompted me to ponder the underlying reasons.

Subsequently, I realized that the crux still lies in the data loop and the commercial loop, which form a positive feedback cycle. Initially, the data loop fosters a superior user experience. Once this experience matches or surpasses human-level performance, it can trigger explosive commercialization. This, in turn, leads to an exponential increase in data, further enhancing model capabilities. Ultimately, these factors synergistically promote and inspire each other, creating a robust positive feedback loop. This potent feedback mechanism enables achieving performance levels that are tens, hundreds, or even thousands of times greater than human capabilities in a remarkably short period.

Our assessment indicates that autonomous driving has entered this phase, whereas robots still require some time. That's the first point. Hence, autonomous driving marks the beginning of the Physical AI era, as it is the first to establish extensive data loops and commercial loops.

The second point is that achieving large-scale L4 autonomous driving, in my estimation, necessitates cumulative investments of at least billions of US dollars, and that might still reflect the research and development efficiency of a startup. For a large corporation, the figure could escalate to hundreds of billions of US dollars.

But what about robots? What is the cost of developing a general-purpose robot? My estimation is that it could range from hundreds of billions to a trillion US dollars, again possibly reflecting the research and development efficiency of a startup. Therefore, my conclusion is that Physical AI demands an entry ticket, which is a business capable of generating cash flow. Although the entire Chinese embodied intelligence capital market is currently highly active, relying solely on investment and financing to achieve general-purpose Physical AI or AGI in the physical world is unrealistic in the long run. Instead, there must be a cash flow business, which could be autonomous driving or another direction within Physical AI that I haven't yet considered, capable of achieving large-scale data loops and commercial loops earlier, or other cash flow businesses stemming from digital AI. In any case, a cash flow business is essential to support the research and development of Physical AI.

Q: What progress has Momenta made in its L4 business this year? Are there any plans or key milestones? Additionally, with an increasing number of players entering the Robotaxi sector, could you share Momenta's advantages in this area?

Cao Xudong: Our company's L4 endeavors extend beyond Robotaxi; we are also pursuing Robovan for logistics. In our ten-year vision, we aim to double the efficiency of logistics and transportation within a decade. In fact, logistics takes precedence over transportation in our strategy. Next year, we will also venture into Robotruck, although we won't embark on that journey this year.

What is the rationale behind this strategy? It harks back to the book by Jeff Hawkins that I mentioned earlier, which discusses a core concept: a neural network or a large model can attain general AI capabilities. Specifically, in the field of autonomous driving, what do we believe? We believe that a large autonomous driving model can encompass all vertical applications of autonomous driving and perform even better.

Moreover, we have already successfully validated this on Robotaxi, Robovan, and passenger vehicles, yielding excellent results. The value it brings is a significant reduction in R&D costs for each vertical. Furthermore, the experience and data from each application scenario and vertical can be aggregated and assimilated into this large model, enhancing the performance of each vertical. This is, in essence, a platform advantage.

This is somewhat analogous to the entire internet industry over a decade ago, where there were both vertical e-commerce platforms and platform e-commerce. However, in the end, the platform e-commerce companies prevailed, and vertical e-commerce might not even exist now. A significant reason for this is the platform effect. Our judgment is that there is also a potent platform effect in the field of autonomous driving and large models. A single large model can encompass all vertical domains and perform even better, resulting in lower costs and superior performance in each vertical.

Q: How do you perceive the landscape of intelligent driving? Will Huawei, Momenta, and others continue to dominate this year, or will other stronger intelligent driving suppliers catch up? Also, what is your perspective on whether there will be a definitive outcome in intelligent driving by 2030?

Cao Xudong: Intelligent driving or autonomous driving, as a whole, exhibits very strong scale effects and first-mover advantages, even surpassing those in the chip industry. If you look back in history, you'll notice that in the chip industry, whether it's PC-era chips or mobile phone chips, there are only two global players: Qualcomm and MTK.

Autonomous driving, being software-based, has zero marginal costs, so its scale effects are even more pronounced. These scale effects encompass not only cost reductions but also enhancements in user experience.

On the other hand, there is a particularly strong first-mover advantage with automakers because many business deals with automakers take three years to finalize, from the initial meeting to securing the contract. For international OEMs, it might even take five to seven years.

Take Mercedes-Benz as an example. They invested in us in 2017, which was quite serendipitous and fortunate. Ola Källenius, the current Chairman of Mercedes-Benz, thought our company was very dynamic and chose to invest in us. However, our first mass-produced project with Mercedes-Benz didn't hit the market until the second half of 2025, spanning a total of eight years, which was actually accelerated.

I once consulted a senior from Tsinghua University, who informed me that collaborating with Mercedes-Benz on mass production would take at least ten years. From 2017 to 2020, we were in the POC (Proof of Concept) phase; from 2020 to 2022, we were in the Pre-SOP (Pre-Start of Production) phase; from 2022 to 2024, we were developing small-scale mass production; it wasn't until 2024 that we secured all of Mercedes-Benz's electric and gasoline vehicle businesses, and true mass production won't commence until the end of 2025.

So, from this example, you can see why it takes three years to close a deal with domestic OEMs and five to seven years with overseas OEMs. This industry possesses very strong scale effects and first-mover advantages. Therefore, I still adhere to my original judgment: there will only be 2-3 players in China and 3-4 globally, and the convergence will occur very swiftly.

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.

Newest

Links