12/25 2025
492
Researchers have pointed out that large language models still encounter difficulties when it comes to proving theorems in formal languages, and the computational expenses remain substantial.
Recently, the ByteDance Seed team unveiled the next-generation formal mathematical reasoning model, Seed Prover 1.5. This model is trained using extensive intelligent reinforcement learning and incorporates an efficient Test-Time Scaling (TTS) workflow.
Through numerous interactions with other tools during the reinforcement learning process, the model continually gains experience, thereby significantly boosting its efficiency and capacity in formal theorem proving.
Moreover, the TTS workflow effectively narrows the gap between natural and formal languages. Compared to existing approaches, Seed Prover 1.5 delivers superior performance with a lower computational cost. The system successfully resolves 88% of the problems in PutnamBench (undergraduate level), 80% in Fate-H (graduate level), and 33% in Fate-X (doctoral level).
Remarkably, using this system, the Seed team solved 11 out of 12 problems from Putnam 2025 within just 9 hours. The research findings suggest that empirical learning, driven by high-quality formal feedback, holds considerable promise for the future advancement of formal mathematical reasoning.
Unlike earlier agent-based provers, the team introduces a more efficient strategy that incrementally employs multiple tools to build formal proofs step by step. Once a lemma is successfully compiled, it is stored in memory for reuse in subsequent reasoning stages, eliminating the need to re-generate previously validated code.
This incremental caching mechanism makes better use of the context window compared to methods that rely on complete proof generation. This approach offers several benefits:
Alignment with modal proofs: Seamlessly works with lemma-style proofs proposed earlier.
Simplification of complexity: Reduces the model's need to generate complete proofs. Instead, the model can concentrate on solving immediate sub-goals.
Context optimization: By caching effective lemmas sequentially, it significantly cuts down on context overhead compared to methods that require iterative regeneration of complete proof histories.
Flexible reasoning strategies: Allows for diverse reasoning tactics, such as eliminating irrelevant intermediate steps or reverting to a specific point to restart the dialogue.
The researchers developed various reinforcement learning training tasks, including proving directly from formal statements, proving based on natural language summaries of proofs, and proving using summaries of previous unsuccessful attempts.
They also suggest training a sketch model. This model creates a lemma-style Lean sketch from formal statements and their corresponding natural language proofs.
To train this sketch model, the research team made use of VAPO and hybrid reward signals. The Lean compiler guarantees the accuracy of the sketch structure, while the LLM, functioning as a criterion, serves as a semantic value model, utilizing Long Chain of Thought (Long-CoT) for better generalization than scalar-based models.
Seed Prover 1.5 integrates three specialized agents, including:
Natural Language Prover: An LLM fine-tuned for natural language proofs.
Sketch Model: A trained translation agent that transforms natural language proofs into tokenized Lean sketches, effectively bridging informal reasoning and formal syntax.
Intelligent Lean Prover: A tool-integration agent tasked with verifying each lemma individually.
The accuracy of reinforcement learning training improved from roughly 50% at the start to nearly 90% after 1000 steps. This enhancement is mainly due to the carefully curated training dataset and precise reward signals from Lean verification.
The average number of function calls dropped from around 15 to 10, corresponding with a steady decrease in average sequence length. This suggests that the model is learning to utilize tools more strategically, avoiding unnecessary or 'trial-and-error' calls.
Despite the reduction in calls, the model’s reasoning abilities have improved. Enhanced optimization steps indicate that the agent's capacity to tackle complex, long-duration problems is also on the rise. However, the continued presence of negative scores in the 32K–64K range shows that managing extremely long contexts remains a challenge.
Seed Prover 1.5 resolved 87.9% of the problems on PutnamBench and 80% on Fate-H, demonstrating its proficiency in handling formal mathematical proofs at undergraduate and graduate levels. Nevertheless, the system still struggles with doctoral-level problems and beyond, primarily due to heightened problem complexity and constraints in Mathlib support.
Researchers note that the system is currently incapable of making significant mathematical contributions on par with those of human experts. This limitation arises from the inherently critical dependency issues in advanced mathematical research.
The key difference lies in the nature of mathematical tasks: Problems in the IMO or Putnam mathematics competitions are carefully crafted so that solvers do not need prior knowledge of specific research papers. In contrast, driving substantial progress in mathematics generally requires a thorough review of numerous relevant papers.
To achieve such advancements, three interconnected challenges must be tackled:
First, pinpointing the most influential and relevant papers;
Second, constructing natural language proofs based on these papers;
Third, developing scalable methods to formalize the papers themselves and the findings derived from them.
Overcoming these three challenges will pave the way for the large-scale generation of formalized mathematical research—an achievement that could eventually lead to solving some long-standing mathematical conjectures.
References:
https://arxiv.org/abs/2512.17260