11/17 2025
340
Today, Meituan proudly announced the launch of its latest reasoning model, LongCat-Flash-Thinking.
This model stands as the first large language model in China to seamlessly integrate 'deep thinking + tool invocation' capabilities with 'informal + formal' reasoning skills. With an impressive total of 560 billion parameters, the model utilizes an innovative Mixture-of-Experts (MoE) architecture. This design intelligently activates between 18.6 billion and 31.3 billion parameters (averaging around 27 billion) based on contextual needs, ensuring optimal computational efficiency and performance.
According to Meituan's official announcements, the new model retains the rapid response speed of LongCat-Flash-Chat while integrating formal and agentic reasoning technologies. This enhancement significantly boosts its reasoning prowess across a range of complex tasks, including mathematics, logic, programming, automated theorem proving, and tool utilization.
The development of LongCat-Flash-Thinking unfolds in two distinct stages:
Long-term CoT Cold Start Training: The initial phase focuses on nurturing the model's foundational reasoning abilities. A curriculum learning strategy is implemented during mid-training to enhance the model's intrinsic capabilities. This is followed by an SFT (Supervised Fine-Tuning) phase, which utilizes reasoning-intensive and agentic data to prepare the model for more advanced learning.
Large-Scale Reinforcement Learning: The second phase leverages an efficient reinforcement learning framework, built upon the DORA system, to expand the model's potential. To tackle stability issues in asynchronous reinforcement learning training, the team has refined and extended the GRPO algorithm.
To address the instability inherent in traditional mixed-domain reinforcement learning training, LongCat-Flash-Thinking adopts a domain-parallel training approach. This method decouples optimization processes for STEM, coding, and agentic tasks. Not only does this stabilize the training process, but it also merges the generated domain-expert models into a near-Pareto-optimal final model that excels across all specialized domains.
The design of LongCat-Flash-Thinking is rooted in the previous DORA system. Its primary objective is to optimize long-tail generation by utilizing multiple older versions of Actor models through streaming deployment, all while maintaining sampling consistency. The DORA system comprises two core components: elastic hosting and multi-version asynchronous pipelining. These components work together to enhance training efficiency, ensure policy consistency for each sample, and facilitate efficient key-value cache reuse. This results in stable and scalable training across tens of thousands of accelerators.
Beyond general reasoning, LongCat-Flash-Thinking places a strong emphasis on two other key capabilities:
Formal Reasoning: LongCat-Flash-Thinking is adept at solving complex formal reasoning tasks, such as automated theorem proving. To bolster the model's formal reasoning capabilities, the team has introduced a novel expert iteration framework for refined data synthesis. This framework encompasses statement formalization, iterative proof synthesis, and syntax/consistency filtering.
Agentic Reasoning: LongCat-Flash-Thinking can adaptively employ provided tools to tackle complex reasoning tasks. To achieve this, a dual-path reasoning method is introduced to identify and retain high-quality queries that genuinely necessitate tool assistance. This fosters the development of robust agentic capabilities. After selecting high-value queries, corresponding high-quality solution trajectories are synthesized based on a versatile environment, which includes an MCP server and simulated tools for both single-round and multi-round interactions.
LongCat-Flash-Thinking outperformed Tongyi Qianwen Qwen3-235B-A22B on the MMLU (Massive Multitask Language Understanding) test, achieving a score of 89.3%. It also attained breakthrough results on mathematics-related benchmarks like HMMT and AIME, surpassing OpenAI's o3. On the code capability test, LiveCodeBench, it scored 79.4, placing it on par with GPT-5.