DeepSeek on the Cover of Nature: Unveiling Further Training Insights of R1

11/17 2025 567

On September 18th, DeepSeek R1 made its debut on the cover of Nature, a prestigious American academic journal.

This achievement marks the inaugural appearance of a Chinese large-scale model on the cover of Nature. The cover recommendation highlights:

The DeepSeek R1 model is trained using reinforcement learning, a method where the model receives high scores as rewards for correctly solving mathematical problems and penalties for incorrect responses. Consequently, it learns to reason by solving problems step-by-step and revealing these steps, thereby increasing the probability of arriving at the correct answer. This enables DeepSeek R1 to self-verify and self-reflect, evaluating its performance before providing answers to new questions, which in turn enhances its performance in programming and graduate-level scientific problems.

This information is derived from the paper titled "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning," published by DeepSeek in January. The corresponding author of the paper is the founder, Liang Wenfeng.

The paper updates the preprint released in January, detailing how DeepSeek enhances standard large language models (LLMs) for reasoning tasks. The supplementary materials, for the first time, disclose the training cost of R1, which amounts to a mere $294,000.

Lewis Tunstall, a machine learning engineer at Hugging Face, commented that R1 is the first large language model to undergo peer review, setting a highly welcome precedent. He stated, "It's challenging to assess whether these systems pose risks if we lack publicly shared norms for most of this process."

R1 has garnered 10.9 million downloads on the AI community platform Hugging Face, making it the most popular model in its category.

Faced with challenges such as poor readability and language mixing in DeepSeek R1 Zero, the DeepSeek team developed DeepSeek R1. Initially, they collected thousands of cold-start data points that exhibited conversational, human-like thought processes.

Subsequently, they integrated reasoning and non-reasoning datasets into the supervised fine-tuning (SFT) process. This enabled the model to excel not only in reasoning tasks but also in demonstrating advanced writing capabilities.

To further align the model with human preferences, R1 implemented a second reinforcement learning (RL) phase. This phase aimed to enhance its usefulness and harmlessness while improving its reasoning abilities.

DeepSeek R1 employs the GRPO reinforcement learning algorithm, which combines rule-based rewards with model-based rewards for reasoning-oriented and general data, respectively. This approach enhances the adaptability of the learning process across different domains.

The team utilized accuracy rewards and format rewards. Accuracy rewards evaluate whether the answer is correct, while format rewards complement the accuracy reward model by enforcing specific format requirements.

To assess and enhance model safety, they compiled a dataset of 106,000 prompts. Model-generated responses were labeled as "safe" or "unsafe" based on predefined safety guidelines.

In the first phase of RL, R1 sampled 16 outputs with a maximum length of 32,768. To expedite training, each rollout generated 8,192 outputs, which were randomly divided into 16 mini-batches and trained for only one internal epoch. To address language mixing issues, R1 introduced a language consistency reward, calculated as the proportion of target language words in the Chain-of-Thought (CoT).

In the second phase, R1 utilized a combination of reward signals and a diverse prompt distribution to train the model. For general data, a reward model guided the training. Ultimately, the integration of reward signals with a diverse data distribution enabled the model to excel in reasoning while prioritizing usefulness and harmlessness.

The first limitation pertains to structured output and tool utilization. Currently, R1's structured output capabilities remain subpar compared to existing models, and it cannot leverage tools like search engines or calculators to enhance output performance.

The second limitation is token efficiency. R1 dynamically allocates computational resources during reasoning based on problem complexity. Specifically, it generates more tokens for complex tasks, necessitating further optimization of token efficiency.

The third limitation involves reward signals. R1 ensures reward reliability through a reasoning domain rule-based reward model. However, constructing such a reliable reward model for tasks like writing presents significant challenges.

To enable lighter, smaller models to possess reasoning capabilities akin to DeepSeek R1, the development team directly utilized 800,000 samples curated by DeepSeek R1 to fine-tune open-source models like Qwen and Llama.

Experimental results demonstrate that this concise distillation strategy significantly enhances the reasoning performance of smaller models.

For more technical details, please refer to the original paper: https://www.nature.com/articles/s41586-025-09422-z


Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.