12/01 2025
446
DeepSeek has officially open-sourced its latest mathematical model, DeepSeekMath-V2, which emphasizes a self-verifiable mathematical reasoning framework.
DeepSeek-Math-V2 demonstrated outstanding performance in competitions like IMO-ProofBench, solving 5 out of 6 problems in IMO 2025, and achieving a near-perfect score of 118/120 in Putnam 2024.
The innovative model can perform self-verifying reasoning via a verifier-generator loop, marking a significant advancement in autonomous mathematical exploration.
Shao Zhihong, the lead author of this study, was also the principal contributor to DeepSeekMath 7B, which previously introduced GRPO.
Experiments have revealed that correct answers do not necessarily indicate a valid reasoning process. Moreover, numerous mathematical tasks necessitate rigorous step-by-step derivation, rendering final answer-based rewards ineffective.
To push the boundaries of deep reasoning, DeepSeek asserts that verifying the comprehensiveness and rigor of mathematical reasoning is crucial. Self-verification is especially vital for expanding computational capabilities during test time, particularly for open problems without established solutions.
To achieve self-verifiable mathematical reasoning, the team explored training a theorem-proving verifier using a large language model. Subsequently, this verifier served as a reward model to train a proof generator, encouraging the generator to identify and rectify as many flaws in its proofs as possible before finalization.
To preserve the distinction between generation and verification while enhancing generator performance, the DeepSeek team proposed expanding the verification computational capabilities to automatically label challenging-to-verify proofs. This creates training data to further refine the verifier.
DeepSeekMath-V2 showcases robust theorem-proving abilities, earning gold medal-level performance in IMO 2025 and CMO 2024, and achieving a near-perfect 118/120 in Putnam 2024 (with extended test-time computation).
These results suggest that self-verifiable mathematical reasoning is a promising research avenue and could aid in developing more advanced mathematical AI systems.
The team also introduced meta-verification: an auxiliary evaluation process to determine whether issues flagged by the verifier truly exist and whether they logically justify the predicted proof score based on evaluation criteria.
The meta-verifier generates a summary of issues identified during the analysis and assigns a quality score to gauge the accuracy and reasonableness of the verifier's assessment.
Leveraging the trained meta-verifier, researchers enhanced verifier training by incorporating meta-verification feedback into the reward function.
The proof verifier and generator form a synergistic cycle: the verifier refines the generator, and as the generator improves, it produces new proofs that test the verifier's current capabilities.
To boost labeling efficiency, researchers generated multiple verifier analyses for each proof to pinpoint potential issues for human review.
Through AI-assisted labeling, the team uncovered two key insights:
Increasing the verifier sample size enhances the likelihood of detecting genuine issues in flawed proofs;
Reviewing issues flagged by the verifier is essentially meta-verification, which is more sample-efficient for language learning.
These discoveries pave the way for further automation.
Across all CNML-level problem categories (algebra, geometry, number theory, combinatorics, and inequalities), DeepSeekMath-V2 consistently outperformed GPT-5-Thinking-High and Gemini 2.5-Pro, showcasing its exceptional theorem-proving abilities across diverse domains.
Through iterative refinement of IMO 2024 shortlist problems, the model enhanced proof quality. Each problem underwent 32 independent refinement threads. The verification score of the best proof selected by users was notably higher than the thread average, indicating the generator's proficiency in evaluating proof quality.
Self-verification effectively guides iterative enhancements. The generator can differentiate between high-quality and flawed proofs, leveraging this self-awareness to systematically bolster its mathematical reasoning capabilities.
The model solved 5 out of 6 problems in IMO 2025 and 4 problems in CMO 2024, earning partial credit on another, achieving gold medal performances in both prestigious high school math competitions.
In Putnam 2024 (a premier undergraduate math competition), the model solved 11 out of 12 problems flawlessly, with only minor errors on the remaining one, scoring 118/120 and surpassing the highest human score of 90.
The model's approach outperformed DeepMind's DeepThink (an IMO gold medalist) on the base dataset and remained competitive on advanced datasets while significantly outperforming all other baseline models. Nevertheless, the most challenging IMO-level problems continue to pose difficulties.
Remarkably, for unsolved problems, the generator typically identified genuine issues in its proofs, while fully solved problems passed all 64 verification attempts. This demonstrates that LLM-based verifiers can automatically evaluate proofs previously deemed difficult to verify.
References:
https://github.com/deepseek-ai/DeepSeek-Math-V2/blob/main/DeepSeekMath_V2.pdf