11/17 2025
368
On September 12, Alibaba's Tongyi Qianwen team introduced the next-generation basic model architecture, Qwen3-Next, and open-sourced the Qwen3-Next-80B-A3B series models built upon this innovative framework.
Users on X have showered the architecture with positive feedback, lauding its sophisticated design and precise reasoning capabilities. Compared to the Qwen3's MoE model structure, Qwen3-Next brings several core enhancements: a hybrid attention mechanism, a highly sparse MoE structure, a suite of training-stabilizing and user-friendly optimizations, and a multi-token prediction mechanism to boost inference efficiency.
Leveraging the Qwen3-Next model structure, Alibaba trained the Qwen3-Next-80B-A3B-Base model. This model boasts 80 billion parameters but only activates 3 billion during operation. Remarkably, the Base model delivers performance on par with or slightly superior to that of the Qwen3-32B dense model, while incurring less than one-tenth of the training costs of Qwen3-32B. Its inference throughput for contexts exceeding 32k tokens is over ten times that of Qwen3-32B, offering unparalleled training and inference cost-effectiveness.
Additionally, Alibaba has also developed and made available Qwen3-Next-80B-A3B-Instruct and Qwen3-Next-80B-A3B-Thinking models alongside the Qwen3-Next-80B-A3B-Base.
Alibaba has successfully tackled the long-standing issues of stability and efficiency in reinforcement learning training for hybrid attention mechanisms and highly sparse MoE architectures, achieving dual improvements in RL training efficiency and final model performance. In multiple benchmark tests, Qwen3-Next-80B-A3B-Thinking outperforms the proprietary Gemini-2.5-Flash-Thinking model.
Qwen3-Next adopts a hybrid architecture combining GatedDeltaNet and GatedAttention, introducing several enhanced designs to the standard attention mechanism:
Qwen3-Next also employs a highly sparse Mixture-of-Experts (MoE) architecture with a total of 80 billion parameters, activating only about 3 billion parameters per inference. Compared to Qwen3-MoE's configuration of 128 total experts and 8 routing experts, Qwen3-Next expands to 512 total experts, 10 routing experts, and 1 shared expert, maximizing resource utilization without compromising performance.
To prevent abnormally high norm weight values in certain layers, Qwen3-Next adopts Zero-Centered RMSNorm and applies weight decay to norm weights to curb unbounded growth. It also normalizes the parameters of the MoE router during initialization to ensure unbiased expert selection during early training stages, reducing initialization perturbations on experimental results.
Moreover, Qwen3-Next introduces a native Multi-Token Prediction mechanism, featuring an MTP module with a high Speculative Decoding acceptance rate, thereby enhancing the overall performance of the main model. The MTP multi-step inference performance is also fine-tuned, further boosting the Speculative Decoding acceptance rate in real-world scenarios.
Qwen3-Next utilizes a uniformly sampled subset of the Qwen3 36T pre-training corpus, containing only 15T tokens. Its training consumes less than 80% of the GPU hours required for Qwen3-30A-3B and merely 9.3% of the GPU computing resources compared to Qwen3-32B, showcasing exceptional training efficiency and cost-effectiveness.
Thanks to its innovative hybrid model architecture, Qwen3-Next demonstrates significant advantages in inference efficiency. Qwen3-Next-80B-A3B achieves nearly seven times the throughput of its predecessor at a context length of 4k tokens. When the context length exceeds 32k, the throughput improvement surpasses tenfold.
During the decoding phase, the model witnesses a nearly fourfold throughput increase at a 4k context length and maintains a throughput advantage of over ten times in long-context scenarios exceeding 32k.
The Qwen3-Next-80B-A3B-Base model, utilizing only one-tenth of the Non-Embedding active parameters, outperforms Qwen3-32B-Base in most benchmark tests and significantly surpasses Qwen3-30B-A3B, showcasing superior model efficiency and performance.
Qwen3-Next-80B-A3B-Instruct significantly outperforms Qwen3-30B-A3B-Instruct-2507 and Qwen3-32B-Non-thinking, achieving results nearly on par with those of Qwen3-235B-A22B-Instruct-2507.
On the RULER benchmark, the model outperforms Qwen3-30B-A3B-Instruct-2507, which shares the same number of layers and has more attention layers, across all context lengths. This underscores the superiority of the Gated DeltaNet and Gated Attention hybrid model in long-text scenarios.
Qwen3-Next-80B-A3B-Thinking outperforms the more costly pre-trained Qwen3-30B-A3B-Thinking-2507 and Qwen3-32B-thinking, surpassing the proprietary Gemini-2.5-Flash-Thinking model and approaching Alibaba's latest flagship model, Qwen3-235B-A22B-Thinking-2507, on certain metrics.
Qwen3-Next introduces multiple innovations in attention mechanisms within its model architecture, including linear attention and attention gating mechanisms, and further enhances sparsity in its MoE design.
In both "thinking mode" and "non-thinking mode," the performance of Qwen3-Next-80B-A3B is comparable to that of the larger-scale Qwen3-235B-A22B-2507, with significant improvements in inference speed, particularly in long-context scenarios.
Alibaba has stated its commitment to continuously optimizing this architecture and developing Qwen3.5, with the goal of achieving even higher levels of intelligence and productivity.
Currently, Qwen3-Next has been made available as open-source on the ModelScope community and HuggingFace platforms.