Prioritizing Latency: NVIDIA Unveils Nemotron-Flash, Empowering Small Models with "Faster Computation"

12/04 2025 540

When it comes to the design of Small Language Models (SLMs), the primary goal has traditionally been to minimize the number of parameters, thereby creating parameter-efficient SLMs. However, achieving parameter efficiency doesn't automatically guarantee proportional speed enhancements on real-world hardware.

NVIDIA has recently published a paper that tackles this issue head-on. The paper's objective is to pinpoint the crucial factors influencing SLM latency on actual devices and to offer universal guidelines and techniques for crafting and training SLMs with a strong emphasis on real-device latency.

The research team has introduced Nemotron-Flash, an innovative hybrid small language model that prioritizes reducing latency in real-world applications over simply minimizing parameter count. It boasts latency-optimized depth-width ratios, hybrid operators discovered through evolutionary search, and weight normalization during training.

It's worth noting that this paper has been accepted for presentation at NeurIPS 2025.

To overcome the limitations inherent in small models, the team pinpointed two fundamental architectural elements: the depth-width ratio and operator selection. The former is vital for latency in small-batch scenarios, while the latter impacts both latency and throughput in large-batch contexts.

The study's findings reveal a trade-off between accuracy and parameters/latency when adjusting depth and width. Although deeper models often strike a better balance between accuracy and parameters, they may not fare as well in the accuracy-latency trade-off. There exists an optimal depth-width ratio for specific latency constraints.

The research team also delved into cutting-edge efficient attention mechanisms to evaluate their suitability as building blocks. Leveraging the identified effective operators, they developed an evolutionary search framework. This framework automatically uncovers latency-optimal operator combinations in hybrid-space learning models (SLMs), enhancing both accuracy and latency performance.

In addition to architectural enhancements, the team further refined SLM training through weight normalization. This technique updates weights more efficiently and hastens final convergence, potentially becoming a standard component in future SLMs.

For SLM design, real-device latency is primarily influenced by two key factors: model depth and width, as well as operator selection.

The team made three key observations:

Across a broad range of depths, deeper models generally offer a better accuracy-parameter trade-off, though this benefit eventually plateaus;

In terms of accuracy-latency trade-offs, the benefits of deeper models may not be evident. For a given latency constraint, there's an optimal depth setting. For instance, with a 3-second latency limit, a depth-12 model achieved the highest accuracy among the tested configurations;

The ideal depth-width ratio typically rises with higher latency constraints. These findings highlight the importance of carefully choosing depth and width based on deployment limitations, rather than defaulting to deeper models.

Consequently, the team investigated systematic approaches to determine the optimal depth-width ratios within model families. They expanded existing scaling laws by parameterizing model loss based on depth and width.

Beyond model depth and width, the operators employed in each layer represent another crucial aspect. Initially, the team trained existing LM architectures in a controlled setting to identify the most promising operators in terms of accuracy-latency balance. Subsequently, they developed an evolutionary search process to automatically and efficiently discover hybrid operator combinations, thereby constructing hybrid SLMs.

The proliferation of various efficient attention mechanisms and their complex interactions in hybrid models prompted the team to create an automated framework. This framework identifies efficient and complementary attention mechanism combinations in hybrid models, featuring an evolutionary search engine to efficiently explore complex combinatorial design spaces.

During training, researchers projected model weights onto a unit-norm sphere after each iteration, constraining weight magnitudes. This normalization step removed radial components and emphasized angular updates, resulting in larger relative weight changes under similar gradient magnitudes.

The Nemotron-Flash series boasts the lowest decoding latency and highest accuracy among models of comparable size.

Nemotron-Flash-1B outperforms Qwen3-0.6B with 5.5% higher accuracy, 1.9 times lower latency, and 46 times higher throughput.

Similarly, NemotronFlash-3B surpasses Qwen2.5-3B and Qwen3-1.7B with 2.0% and 5.5% higher average accuracy, respectively, along with 1.7 times and 1.3 times lower latency and 6.4 times and 18.7 times higher throughput.

Through further optimization of attention mechanism configurations, Nemotron-Flash-3B-TP achieves 10.1 times and 29.7 times higher throughput than Qwen2.5-3B and Qwen3-1.7B, respectively.

Beyond delivering the most competitive latency and throughput, Nemotron-Flash-3B excels in common-sense reasoning, mathematics, coding, and recall tasks among models with over 1.5 billion parameters.

NemotronFlash-3B-Instruct showcases robust reasoning and instruction-following abilities, achieving the highest average accuracy and efficiency. Compared to Qwen2.5-1.5B and Qwen3-1.7B, it enhances average accuracy by over 4.7% and boosts throughput by 4.3 times and 18.7 times, respectively.

References:

https://arxiv.org/pdf/2511.18890

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.