SenseTime Embraces the Grand Game of Embodied Intelligence

07/23 2025 506

Amid the industry's turning point, SenseTime unveils its embodied intelligence "brain" plan.

The 2025 World Artificial Intelligence Conference (WAIC2025) will commence this weekend. In anticipation, SenseTime has released a pivotal announcement: it will introduce a groundbreaking embodied intelligence "brain" at the conference.

According to SenseTime's official WeChat account, at the WAIC2025 Large Model Forum on July 27, the company will showcase its intelligent "brain" system, which integrates perception, visual navigation, and multimodal interaction, empowering intelligent devices such as robots and smart appliances.

SenseTime's entry into the embodied intelligence "brain" race doesn't surprise the industry. Research and industrial implementation in this field are primarily spearheaded by two groups: computer vision experts like Li Feifei, and robotics practitioners. SenseTime, with its roots in computer vision, is now intensifying its investment in embodied intelligence, a logical and necessary step for the company.

01

Laying the Foundations for the Embodied Intelligence "Brain"

The embodied intelligence "brain" has emerged as a key competitive arena in global AI. OpenAI and robotics firm Figure AI have collaborated on a versatile robot, Google has launched its embodied intelligence RT-2 model, and NVIDIA is focusing on world models and simulations. Domestically, Huawei also introduced the CloudRobo embodied intelligence platform with a "brain" in June this year. SenseTime, one of the earliest players, has been continuously evolving its technical approach.

Why do global tech giants place such emphasis on this track? The current wave of embodied intelligence is fundamentally driven by the deep integration of large models and robotics technology. In the "pre-large model era," robots were specialized "one-trick ponies"—food delivery robots couldn't screw in bolts, and those that could screw couldn't pour coffee. Ontological generalization, task generalization, and scenario generalization posed significant bottlenecks for embodied intelligence.

The tide turned in 2022. With the rise of large models like ChatGPT, AI gained the ability to understand natural language, generate content, and perform deep reasoning. The industry began envisioning robots equipped with "smarter brains," capable of transcending ontological constraints to tackle more complex and flexible tasks.

This underscores the significance of the embodied intelligence "brain." However, the field is still in its exploratory phase, with no clearly defined technical path. Some industry insiders propose three primary approaches:

VLA Model (Vision-Language-Action): Input language + image, output action. This approach boasts a simple structure but lacks in physical attribute recognition, utilization of physical laws, and sufficient control trajectories.

"Cerebellum and Cerebrum" Architecture: Divides responsibilities between "planning" and "execution," enhancing system modularity and interpretability, yet still faces generalization challenges.

World Model: Ambitiously modeling environmental states, physical laws, temporal logic, etc., emphasizing multimodal information fusion and reasoning. Its goal is to enable agents to understand the world, predict changes, and plan behaviors.

From the information currently disclosed by SenseTime, although it hasn't explicitly bet on a specific technical route, its phased evolution indicates a continuous accumulation of "world model" capabilities.

SenseTime's exploration and layout of the embodied intelligence "brain" can be divided into four stages:

Step 1: From "seeing" to "moving," constructing a vision-perception-decision-making closed loop.

In August 2022, SenseTime launched the home robot "Yuanluobo," integrating vision algorithms with robotic arms for the first time, achieving chess piece recognition and precise grasping in occluded environments. This marked the inception of SenseTime's "vision-perception-decision-making" closed loop, giving robots a basic framework for interacting with the physical world.

Traditional AI is "open-loop," sitting in the cloud to "think" and "see" the world. But once operation is involved, a complex closed loop must be formed—perception must be transformed into "understanding" to drive action execution. This is the starting point of embodied intelligence.

Step 2: Release of "Ririxin V5.5 - V6," advancing multimodal fusion and reasoning capabilities toward the cognitive center.

In April 2025, SenseTime introduced the "Ririxin V6" multimodal large model, with a core breakthrough in modal fusion, supporting long thinking chains, multimodal reasoning, and planning capabilities. It attempts to solve long-standing issues plaguing embodied intelligence—traditional systems "break the chain" when confronted with slightly complex, multi-step, and long-time-span tasks. It's akin to a robot's "cerebral cortex," leaping from "receiving signals" to "understanding intentions."

Crucially, it has been integrated by robotics firms like Fourier and Guixu, indicating SenseTime's platform-level technology output capabilities, transitioning from R&D to industrial empowerment, entering a phase of strategic leap.

The evolution of these capabilities didn't happen overnight. As early as July 2024, SenseTime released "Ririxin V5.5," a crucial relay station on this fusion path: based on the 600 billion-parameter model's performance upgrade, it introduced synthetic high-order thinking chain data, excelled in mathematical logic, instruction following, etc., and debuted the native streaming multimodal interactive experience model "Ririxin 5o," driving AI from "responding to input" to "understanding scenarios."

Moreover, end-side models like "Ririxin 5.5 Lite" significantly improved efficiency and popularization, bringing large models closer to users.

In late 2024, SenseTime pioneered the "Ririxin Fusion Large Model," achieving native fusion modal training and breaking the limitations of language and multimodal models operating independently. This model topped both the SuperCLUE and OpenCompass authoritative rankings, becoming a "double champion," signifying SenseTime's substantial breakthrough in "model unification" in deep reasoning and multimodal fusion for the first time. This series of innovations paved the way for the leap in V6 capabilities.

Step 3: "Brain" platformization, progressing toward the world model.

Next, SenseTime will soon unveil the embodied intelligence "brain platform," marking its integration and leap from single-point to system capabilities.

Furthermore, SenseTime has a long history of layout and accumulation in intelligent driving. The world model is a key breakthrough direction. "Kaiwu," developed by SenseTime's intelligent driving brand "Jueying," can already comprehend physical laws, learn traffic rules, and be implemented in real-world scenarios.

Cars and robots are essentially embodied intelligent agents. Possessing capabilities like perception, navigation, and interaction is their shared pursuit. SenseTime may transfer its "world model" experience from autonomous driving to robotics, driving further evolution of the embodied intelligence brain.

This product launch signifies the advancement of SenseTime's embodied intelligence strategy into a new "platformization" and industrial output phase. Behind it lies SenseTime's long-term technology embedding and strategic patience.

02

SenseTime's Preparation

Besides the embodied intelligence brain, SenseTime sent another clear signal through WAIC 2025: embodied intelligence is a competition of "compute density × data density × ecological density."

In recent years, the demand for compute power has surged, and compute density largely determines the speed of model evolution and application deployment. Especially, embodied intelligence needs to perceive and understand the physical world, involving multiple modalities, further increasing compute power consumption.

Statistics reveal that the proportion of compute power consumption in embodied intelligence scenarios has soared from 12% in 2023 to 28%. Additionally, 30% of NVIDIA's chip sales in the first quarter of this year went to embodied intelligence devices.

Over the years, SenseTime has heavily invested in compute power. As early as 2018, SenseTime built the country's first AI thousand-card cluster prototype. Two years later, it established the nation's first intelligent computing center. By the end of 2024, SenseTime's large device compute power scale reached 23,000 PetaFlops, surpassing some major cities' public intelligent compute power. In April this year, SenseTime launched SenseCore 2.0, aiming to become the "AI infrastructure that best understands large models."

SenseTime's original intention in building large devices was to boost AI model production efficiency and reduce usage costs. However, the combination of large devices and large models significantly enhanced SenseTime's compute power capabilities. For instance, SenseTime separates pre-filling and decoding in the model inference process, improving GPU hardware utilization and reducing inference latency.

SenseTime's large devices have helped it secure numerous intelligent computing and large model orders. For example, leveraging SenseTime's large devices, China Southern Power Grid achieved 100% localization from models, platform algorithms to underlying compute power, constructing a fully localized power AI infrastructure. A leading infrastructure industry design institute relied on SenseTime's large devices' domestic chips and base platforms to develop large language models and multimodal large models in engineering survey and design, addressing issues like challenging knowledge inheritance, poor integration, and low application.

IDC's report shows that SenseTime's large devices ranked second in the domestic AI large model solution market in the second half of 2024.

Today, SenseTime has evolved into an AI vendor with a trinity of "large devices-large models-applications."

Besides compute power, high-quality data is a major bottleneck in embodied intelligence development. Despite rapid evolution in brain architectures and technical routes, all ultimately converge on a consensus: data is the hardest nut to crack in embodied intelligence.

Moreover, some industry insiders propose that embodied intelligence, like large models, follows a Scaling Law. Their experiments show that for every 10-fold increase in data collected, the robot's error rate decreases by about 10 times. To boost the success rate from 99% to 99.9% requires 10 times more data, with costs rising exponentially.

Currently, high-quality data for embodied intelligence comes from offline real data collection, simulated synthetic data, and internet data, each with pros and cons. For instance, both Tesla and Google use teleoperation to obtain data, but the underlying costs are immense. It's reported that Google spent over ten million dollars and ten months to create over a hundred thousand pieces of data.

The industry views a more practical approach as utilizing internet video data and synthetic data to achieve high accuracy and then employing real data for reinforcement learning.

Over 80% of human information acquisition comes from vision. SenseTime, with its roots in machine vision, possesses a series of experiential technologies in visual information processing.

Simultaneously, SenseTime is promoting data synthesis of fusion modalities and enhanced training of fusion tasks. For example, in the pre-training stage, SenseTime not only uses naturally occurring massive graphic-text interlaced data but also synthesizes a vast amount of fusion modal data through inverse rendering and image generation based on mixed semantics. In the post-training stage, SenseTime constructs numerous cross-modal tasks, encompassing video interaction, multimodal document analysis, urban scene understanding, and in-vehicle scene understanding.

Additionally, besides accumulating underlying capabilities like compute power, data, and models, SenseTime has consistently invested in the ecosystem, funding a batch of embodied intelligence enterprises through its private equity fund Guoxiang Capital, including over a dozen companies like Yinhe General Robotics, Zhongqing Robotics, Taihu Robotics, and Luming Robotics. These companies span the entire embodied intelligence industry chain, with Zhongqing Robotics focusing on ontology and motion control, and Taihu Robotics on joint modules.

This ecological density allows SenseTime to stay closer to industrial needs and understand trends better than other large model vendors. Simultaneously, these invested enterprises provide SenseTime with abundant landing scenarios and real data, accelerating industrial implementation.

It's evident that from compute power infrastructure to data and ecology, SenseTime is quietly placing bets and investing in all facets required for embodied intelligence.

SenseTime's upcoming embodied intelligence "brain" launch coincides with the critical inflection point in 2025, where the industry transitions from concept validation to preliminary productization and platformization. The fusion of large models and robots is accelerating the shift from the laboratory to the real world. Leading technology companies and research institutions' accelerated layout has ignited a wave of competition in embodied intelligence.

The combined influence of policy and capital cannot be overstated: for the first time, embodied intelligence was highlighted in the 2025 government work report as an emerging industry. This intelligence is progressively permeating sectors such as manufacturing, unmanned retail, hospitality, and healthcare. Remarkably, in the first half of this year alone, financing in related embodied intelligence fields surpassed 20 billion yuan, with 130 financing events, vastly exceeding the total for all of 2024. The "golden age" of embodied intelligence has quietly dawned.

Considering SenseTime's technical layout in the "brain" of embodied intelligence and its closed-loop capabilities across key areas like compute power, data, and ecology, its strategy in this field is not a mere exploratory venture but a natural extension rooted in its technological heritage. It stands poised to become the optimal lever for SenseTime to propel its second growth curve.

From analyzing a series of information, it becomes evident that SenseTime's actions in embodied intelligence not only redefine its capability boundaries but also establish an early foothold for the next generation of intelligent forms. As robots and smart devices evolve into "embodied intelligent agents with brains," embodied intelligence could well be the linchpin for SenseTime to secure its future.

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.