ExGRPO Framework: Experience-Driven Learning, Pioneering a New Era in Reasoning

11/14 2025 410

While the prevailing training paradigm for AI models still revolves around 'practice problems coupled with scoring,' a collaborative research team from the Shanghai AI Laboratory, University of Macau, Nanjing University, and the Chinese University of Hong Kong posits that training transcends mere problem-solving; it encompasses reviewing, revisiting, and internalizing knowledge.

Their recent publication, titled 'ExGRPO: Learning to Reason from Experience,' systematically elucidates, for the first time, the pivotal role of 'experience management' in large-scale model reasoning training.

In stark contrast to traditional online policy RLVR (Reinforcement Learning from Verifiable Rewards) methodologies, ExGRPO markedly enhances the capacity to tackle intricate reasoning challenges.

Let's now delve into the underlying logic, advantages, and insights of the ExGRPO framework.

Over the past several years, to bolster the capabilities of large language models in mathematical reasoning, logical reasoning, and complex task resolution, the research community has extensively embraced Reinforcement Learning from Verifiable Rewards (RLVR) techniques.

However, in RLVR model training, a reasoning trajectory (roll-out) generated by the model is utilized for training, receives a reward, and is subsequently discarded.

On one hand, generating these trajectories often incurs significant costs. On the other hand, these 'valuable trajectories' are employed only once and then discarded—akin to a student neglecting to review or archive each problem they solve.

In essence, the traditional training process grapples with three major shortcomings:

Wastage of experience — Successful reasoning trajectories are frequently overlooked and forgotten.

Inefficiency — Practicing problems without reviewing leads to sluggish progress.

Unstable training — The model may descend into a state of 'solving problems without comprehension.'

In other words, when 'who possesses more data' and 'who trains for longer durations' become bottlenecks, systematically empowering models to review and reuse 'key experiences' could herald a breakthrough.

Against this backdrop, the study posits that not all experiences warrant revisiting; the crux lies in discerning 'what types of experiences' and how to review them.

The proposed ExGRPO (Experiential Group Relative Policy Optimization) serves as a framework for experience management and strategy optimization in large-scale model reasoning training.

Its core revolves around two dimensions:

Experience management: Identifying, storing, and filtering high-quality experiences.

Hybrid experience optimization: Integrating selected experiences with the exploration of new problems for training purposes.

In ExGRPO, experience management unfolds in three stages:

Experience collection: Upon the model successfully completing a problem, its trajectory is added to an experience replay pool, somewhat akin to compiling a collection of erroneous answers.

Experience categorization and storage: Based on the model's recent performance, each experience is dynamically labeled as 'easy,' 'medium,' or 'hard.' Simultaneously, if the model has consistently solved a problem multiple times in succession, that problem is removed to prevent the model from stagnating by repeatedly tackling problems it has already mastered.

Experience filtering: Selecting experiences based on two primary indicators: 'problem filtering' and 'trajectory filtering.'

Subsequently, ExGRPO adopts a 'hybrid strategy' training objective. In each training round, a portion of the minibatch is allocated to exploring entirely new problems, while another portion is dedicated to repeatedly learning from selected trajectories drawn from the experience pool.

Furthermore, a 'Policy Shaping' mechanism is introduced to prevent the model from becoming overly cautious and losing its exploratory capabilities due to excessive reviewing.

Across models with 1.5B-8B parameters and diverse architectures (such as Qwen and Llama), ExGRPO achieves an average improvement ranging from approximately +3.5 percentage points (for in-distribution tasks) to +7.6 percentage points (for out-of-distribution tasks) compared to traditional On-policy RL methods.

Moreover, overall training stability and efficiency have also witnessed enhancements.

However, ExGRPO confronts three major challenges.

Firstly, in larger-scale scenarios with a more diverse array of task types, will experience identification maintain its accuracy?

Secondly, establishing and maintaining an experience pool, categorizing experiences, and filtering trajectories all necessitate additional computational resources and engineering support, raising concerns regarding management costs.

Thirdly, the paper primarily conducts tests on mathematical and general reasoning benchmarks. Whether this experience reuse mechanism will prove equally effective in tasks like language generation, dialogue, and cross-modal tasks remains to be ascertained.

Nevertheless, for model training systems, ExGRPO offers a pragmatic 'wrong-answer notebook' approach: not merely solving problems but also reviewing them; not just practicing problems but optimizing experiences.

For developers, incorporating similar experience filtering mechanisms into actual model training—archiving, labeling, and reusing successful model trajectories instead of simply discarding them—could be a viable consideration.

For industrial applications, when models necessitate long-term service, continuous learning, and rapid iteration, experience mechanisms assume paramount importance. They signify that the model not only 'knows how to do' but also 'knows how to do better.'

For future research, experience categorization, trajectory filtering, and reuse mechanisms represent a promising avenue. They may also integrate seamlessly with domains like automated experience selection, meta-learning, and continual online learning.

Reference: https://arxiv.org/pdf/2510.02245


Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.