11/17 2025
355
On September 15th, Unitree unveiled UnifoLM-WMA-0, an open-source world model-action architecture tailored for a diverse range of robot bodies developed by Unitree Technology. This architecture is meticulously crafted for general-purpose robot learning, with the central tenet being the creation of a world model capable of comprehending the physical principles underlying robot-environment interactions.
This world model boasts two pivotal functionalities:
The team showcased the deployment of robots:
As per the official announcement, UnifoLM-WMA-0 is a member of the Unitree robot unified large model family, explicitly engineered for general-purpose robot learning and adept at accommodating various robot forms.
Presently, UnifoLM-WMA-0 has made its training code, inference code, and model checkpoints publicly available. On GitHub, it has already attracted over 100 stars.
The Unitree team has also elucidated the training procedures for UnifoLM-WMA-0.
Initially, the team fine-tuned a video generation model on the Open-X dataset to harmonize its generation capabilities with robotic operational contexts. The model takes in images and textual instructions, subsequently producing future action videos that correspond to the provided textual cues.
Below is the generation outcome of the fine-tuned model on the test set:
Following that, they devised a strategy architecture grounded in the world model, capable of supporting two operational modes concurrently:
This represents the decision-making mode of UnifoLM-WMA on downstream task datasets post-training. The team undertook post-training optimization on respective downstream task datasets:
Additionally, the team has publicly shared five open-source datasets for model training purposes. The test outcomes reveal that, as a simulation engine, the model can achieve interactively controllable generation based on the 'current image' and a specified number of 'robot future actions'.
The juxtaposition of the generated results and the original video is presented below:
The Journey Towards General-Purpose Robots: A Protracted Endeavor
Unitree asserts that the 'World Model-Action' architecture will be entirely open-sourced and subject to continuous updates. This open-source architecture is designed to propel the advancement of embodied intelligence and expedite the realization of general-purpose robots.
As a company celebrated for its hardware excellence in the realm of humanoid robots, Unitree's software-level endeavors have also drawn considerable attention. Founder and CEO Wang Xingxing remarked that the company maintains a cautious approach towards investing in embodied intelligence model research and development. Despite the company's substantial growth, its investment still pales in comparison to that of large AI enterprises.
Wang Xingxing highlighted that, at present, robot hardware is generally 'adequate but falls short of excellence.' To attain widespread adoption, reduced costs, and enhanced reliability, ongoing enhancements are imperative. In his perspective, the research and development of embodied intelligence models constitute the current focal tasks. At this juncture, the models are far from reaching maturity and fail to meet industry expectations.
When delving into training data, he noted that, unlike large language models, which experience rapid improvements through extensive high-quality data, the robotics domain encounters greater hurdles in aligning models with physical entities, thereby demanding superior capabilities from AI models.
He underscored that breakthroughs in embodied intelligence are not merely a contest of resources and financial input. Past experiences have demonstrated that small and medium-sized teams can also attain pioneering achievements.