What's the Use of Having a General-Purpose Motion 'Cerebellum' for Humanoid Robots?

06/22 2026 452

A few days ago, Galaxy General released a model called AstraBrain-WBC 0.5, equipping humanoid robots with a Transformer controller capable of zero-shot learning of new movements.

The paper has been accepted by CVPR 2026, with both code and data open-sourced. The paper for this model is titled Humanoid-GPT.

Much like GPT in the text domain, it aims to prove one thing: by scaling data to a sufficiently large size and switching to a Transformer model, Scaling Law can be equally effective in the physical world.

In the demo video, a Unitree G1 robot dances in sync with human movements shown in the video, with smooth and coherent (Note: ' coherent ' is kept as-is since it is a technical term that can be understood in context; however, for strict translation, it means 'coherent' or 'seamless') motions, no pre-programming, and no fine-tuning for specific movements. The paper's data is solid, with a zero-shot motion tracking success rate of 92.58% and inference latency reduced to 0.39 milliseconds.

01

What is a robot's 'cerebellum',

and how does it differ from the 'brain'?

Galaxy General divides robot intelligence into three layers.

The brain handles perception and task planning, knowing that there is a box in front that needs to be moved to Area B. Neural control manages fine-grained manipulation at the end effector, such as how to pinch a screw with fingers. The cerebellum, sandwiched in between, manages whole-body motion coordination—where the center of gravity is, which leg to move first, how the arms and torso coordinate, and at what speed.

This division of labor is not an invention of Galaxy General; the human brain is structured this way.

The cerebral cortex handles planning, the pons transmits instructions downward, and the cerebellum coordinates and executes. The robotics industry has long struggled most with the cerebellum.

The brain's perceptual abilities are rapidly improving each year with visual large models, and the dexterity of end effectors is also becoming increasingly refined. However, for the layer in between—enabling a bipedal humanoid robot to stand stably, walk, and perform specified movements in any pose—there has been no universal solution.

The previous approach was to train a separate controller for each movement. Teaching a robot to walk required specifically (Note: ' specifically ' is kept as-is for technical precision; it means 'specifically' or 'dedicatedly') collecting a batch of walking motion capture data, labeling joint angles, and training a policy using reinforcement learning. Teaching it to run required training a second controller.

For each type of movement, a separate controller was trained, and each controller would fail when the scenario changed. A robot might learn a proficient walking strategy, but it might not work on a slope.

AstraBrain-WBC 0.5 aims to try a different path. Can a single model handle all movements, just as a GPT model processes various text tasks?

The Galaxy General team scraped almost all publicly available datasets in human motion capture, including AMASS, LAFAN1, Motion-X++, PHUMA, MotionMillion, and added over a thousand hours of their own motion data. After merging, filtering, and augmenting, they obtained 2 billion frames of motion data redirected through the Unitree G1's joint space.

Previous similar studies had training sets on the order of 100 million frames. NVIDIA's SONIC, for example, reached around 100 million frames. Galaxy General's dataset is 200 times larger.

02

Transformer Catches the Plate

That MLP Couldn't Hold

Previously, the mainstream architecture for humanoid robot motion tracking was MLP. MLP has an inherent limitation in motion control: it can only 'see' a state slice at a single moment.

The relationship between steps and center of gravity spans dozens or even hundreds of frames, and MLP cannot naturally model such long-range dependencies, relying instead on temporary solutions like 'concatenating historical data into the input vector.'

When training MLP on multimodal, highly dynamic motion data, marginal gains diminish after reaching a certain scale. The self-attention mechanism of Transformers is different.

The model can simultaneously 'look back' at any length of historical frames at each position in the sequence, capturing connections between 'the current movement and a pose 32 frames ago.'

For humanoid robots, this cross-frame coherence directly determines whether walking looks human-like and whether dancing will suddenly freeze.

The Galaxy General team conducted clean ablation experiments.

With the same 2 billion frames of training data, MLP's loss curve plateaued after about 50K steps, while the Transformer continued to decline after 200K steps. The Transformer eventually stabilized at around 0.06, while MLP stopped at around 0.08. This 0.02 gap manifests in real robots as whether the walking gait appears human-like.

During training, MLP can only process one time step at a time, requiring N cycles to cover a long sequence. The Transformer can process all positions in the entire sequence in a single forward pass, creating a magnitude gap in training throughput at the 2 billion frame scale.

If MLP were still used, the same 2 billion frames of data would require several times more computational resources and time to complete a single training run.

For engineering deployment, the team optimized with TensorRT compilation and C++ pipeline optimizations, achieving an inference latency of 0.39 milliseconds and a control loop of 50Hz. Compared to the TWIST system's 2.79 milliseconds, this is about 5 times faster. The larger model runs even faster, thanks to specialized kernel optimizations for causal attention and MLP fusion operators.

03

300+ Experts Handed Over to One Model

Directly training a single Transformer end-to-end on 2 billion frames of raw data is impractical. The team first trained 384 'motion experts' using reinforcement learning PPO across approximately 300 movement families.

Each expert was responsible for only one style of movement—the walking expert didn't handle dancing, and the dancing expert didn't handle sprinting. Each expert achieved high fidelity in its own style.

Then, using the DAgger distillation framework, a unified Transformer generalist model learned simultaneously from all 384 experts.

The knowledge of the 384 experts was distilled into an 80.4 million-parameter model. After distillation, only this single large model was needed for deployment.

The paper's ablation experiments showed that the number of clusters could neither be too few nor too many.

With 128 clusters, each expert had too broad a responsibility, reducing the training quality of individual experts and weakening the resulting generalist. With 1024 clusters, supervision signals between adjacent experts began to interfere, leaving the student model unsure of whom to follow. Around 384 clusters represented the optimal trade-off between diversity, quality, and cost at the current data scale.

The entire training process consumed approximately 15,000 GPU hours. 75% was spent on expert training using RTX 4090s, and 25% on Transformer distillation using H100s. This cost is reasonable for academic research and not excessively expensive for commercial deployment.

04

Does It Actually Work?

AstraBrain-WBC 0.5 answers three questions.

◎ First, motion data can be scaled to 2 billion frames.

◎ Second, the Transformer architecture can handle this scale of data and continue learning from it.

◎ Third, the distilled model can run in real-time on actual robots.

● First, it is a pure motion tracking model.

As clearly stated in the paper, the next direction is to integrate with visual-language-action models, incorporating multimodal information from vision, touch, and language.

Current AstraBrain-WBC 0.5 only understands joint angles. It doesn't know there's a box on the ground or a cup on the table. It follows motion sequences fed to it but isn't told where to go, what to pick up, or how to do it. It's a cerebellum, not a brain.

● Second, the demo environment is an open space with a flat floor.

There's a significant validation gap between high-dynamic movements in such an environment and real-world factory settings filled with pallets and narrow walkways. The paper does not provide test data in unstructured environments.

● Third, and the industry's biggest concern:

Galaxy General's current primary commercial direction is instant retail robot warehousing, using wheeled bases with dual-arm manipulation. Wheeled bases don't need to do backflips or dance. The practical commercial value of the cerebellum GPT's capabilities for wheeled robots currently lacks direct quantitative data.

Summary

The Scaling Law for robot motion control has been validated at the 2 billion frame scale.

The significance of this validation for the industry is methodological: previously, it was thought that 'robot motion data is hard to scale, so Scaling Law might not work,' but now someone has proven it does. With Transformers and sufficiently large data, a general-purpose cerebellum can be created.",

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.