π*0.6: A VLA Model with the Ability to Learn from Experience

12/01 2025 523

Today, Physical Intelligence (PI), a U.S. startup specializing in embodied intelligence, introduced its newest robot foundation model, π*0.6.

According to company officials, over the past year, it has been relatively straightforward for robots to successfully complete about half of their assigned tasks. However, achieving consistent success in every attempt remains highly challenging, not to mention attaining human-level performance in real-world settings.

Real-world robot tasks necessitate a system that is both reliable and operates swiftly. PI has devised a method known as Recap (Reinforcement Learning with Experience and Correction based on Advantage-Conditioned Policies), which encompasses three key steps:

Robots are trained and corrected through demonstrations, enabling them to learn and improve from their autonomous experiences.

By leveraging Recap, the latest iteration of the Vision-Language-Action (VLA) model, π*0.6, is enhanced to efficiently carry out complex tasks, including brewing espresso, assembling boxes, and folding various garments.

When π*0.6 is trained with autonomous experiences utilizing Recap, the throughput of some of the most demanding tasks can more than double, and failure rates can be reduced by a factor of two or more.

This capability allows π*0.6 to attain the robustness necessary for practical applications. It can continuously brew espresso throughout an entire day, fold new clothes for hours on end without interruption in a new home, and assemble cardboard boxes for actual packaging in factories.

Why is it that Virtual Logic Automation (VLA) models trained solely through imitation learning struggle to achieve sustained success, whereas supervised learning proves highly effective for Large Language Models (LLMs) and other machine learning systems?

Given that robots interact with the real physical world, even minor errors such as misplacement can result in situations that differ slightly from those in the training data. In the training data, robots are more prone to making larger errors, which can lead to the accumulation of errors.

While small errors can be rectified, accumulated errors can culminate in failure. For AI systems that generate static outputs, like LLMs, this is not a major concern. However, in practice, it means that VLAs cannot reliably complete tasks successfully.

This issue can be resolved by utilizing additional data derived from the VLA's own behavior. VLAs can be trained to correct errors they actually make in the real world, and accumulated errors can be mitigated by allowing the policy (i.e., the VLA) to practice repeatedly.

Recap offers two methods for obtaining effective training signals from experiential data:

Guidance for providing corrections, where experts demonstrate to the robot how to rectify errors or perform more effectively;

Reinforcement learning, where the robot independently assesses which behaviors are superior or inferior based on the overall outcome of a round and iteratively learns to perform good behaviors while avoiding bad ones.

For guidance to be effective, expert-level remote operators must provide corrective measures, instructing the robot on how to recover from errors made during actual operations. This type of intervention targets the scenarios in which the policy is actually applied to the robot, thereby addressing the issue of error accumulation.

However, correction alone is far from adequate. The quality of corrective measures as supervision hinges on a person's ability to accurately time interventions and genuinely provide high-quality corrections. To complete tasks quickly, reliably, and consistently, robots need to learn autonomously.

The core challenge lies in credit assignment—determining which actions executed by the robot led to favorable outcomes and which led to unfavorable ones.

Reward assignment is a pivotal challenge in reinforcement learning. Recap can predict the superiority or inferiority of specific situations relative to others.

For instance, in a game like chess, the agent receives a reward after winning, and the value function predicts the agent's probability of winning based on the current board state. If the value function can be learned from the robot's experiences, good and bad behaviors can be judged by observing changes in the value function.

The following figure illustrates the predictions made by the value function during the process of 'folding clothes':

During execution, simply instruct the advantage-adjusted VLA to perform actions with higher advantages, resulting in a strategy that outperforms the training data.

The team explored three application scenarios: brewing espresso, folding various types of clothing, and assembling packaging boxes.

The initial phase of Recap involves pre-training and fine-tuning the π*0.6 model using offline reinforcement learning (RL), followed by further training the model through reinforcement learning using additional data collected from the robot.

Notably, significant improvements were observed in some highly challenging tasks, such as brewing espresso, where both throughput and success rates more than doubled after incorporating the robot's actual operational experiences.

From a qualitative standpoint, the final π*0.6 model, after learning from demonstration data and the robot's real-world experiences, can proficiently master various applications.

Each task presents numerous challenges that make high-throughput autonomous execution difficult. Even for the best current VLA models, each of these stages is challenging, yet π*0.6 can complete them with a success rate exceeding 90%.

Relevant personnel stated that expert demonstrations are utilized to define new behaviors, coaching is employed to refine strategies, and autonomous experiences are leveraged to refine behaviors, ultimately potentially enabling robots to achieve performance levels that surpass those of humans.

References:

https://www.pi.website/blog/pistar06#where-are-we-headed

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.