Wink Pings

NVIDIA Releases Tri-Mode Large Language Model: Switch Decoding Modes Simply by Modifying Attention Mask, Boosting Single-User Throughput by Up to 4x

In May 2026, NVIDIA launched the industry's first tri-mode language model family, Nemotron-Labs-Diffusion. No architecture modifications or additional auxiliary models are required—users only need to adjust the attention mask to switch between three decoding modes: autoregressive, diffusion, and self-speculation. Adaptable to all scenarios from high-concurrency cloud services to personal local inference, it delivers up to a 4x improvement in actual single-user throughput. The entire model family is open-sourced under an open license.

On May 19, 2026, NVIDIA researcher Pavlo Molchanov unveiled the new large model series Nemotron-Labs-Diffusion, which has drawn wide attention from the industry after being shared by Emad Mostaque, former CEO of Stability AI. This is the world's first large model family that enables three decoding modes to share a single architecture, with parameters covering 3B, 8B, and 14B sizes. It comes in three variants: base, instruction-tuned, and vision-language multimodal, and all versions are available for download under an open license.

Unlike most existing solutions that require separate architectures for different decoding methods or additional deployment of small draft models, all decoding mode switches for Nemotron-Labs-Diffusion can be done simply by adjusting the model's attention pattern or mask. No changes to model parameters are needed, and no extra auxiliary modules are required. The model was trained with a joint objective combining autoregressive and diffusion approaches, and the two capabilities complement each other: diffusion logic enhances the model's lookahead planning ability, while autoregressive logic provides a left-to-right language generation prior to guarantee output quality.

![Nemotron-Labs-Diffusion tri-mode description and performance comparison](https://wink.run/image?url=https%3A%2F%2Fpbs.twimg.com%2Fmedia%2FHIsyQI_b0AAj7cS%3Fformat%3Djpg%26name%3Dlarge)

The official performance comparison chart clearly marks the positioning and applicable scenarios of the three modes:

1. Autoregressive (AR) Mode: Designed for high-concurrency cloud service scenarios. This is the conventional decoding logic used by the vast majority of current large language models, and delivers the most stable performance when multiple users send requests simultaneously.

2. Diffusion Mode: Boasts the highest theoretical speed potential. Official analysis shows that with an optimal sampler, the number of tokens that can be output per single forward pass in this mode is 76.5% higher than that of self-speculation mode, making it a core direction for future inference efficiency improvement.

3. Self-Speculation Mode: Adapted for low-concurrency personal AI inference scenarios. In this mode, the diffusion module generates candidate draft tokens, which are then verified by the autoregressive module. Compared with commonly used multi-token prediction methods in the industry, this approach achieves significant improvements in both candidate acceptance rate and actual device efficiency.

![Official effect demonstration GIF](https://research.nvidia.com/sites/default/files/publications/demo.gif)

Measured data shows that in single-user scenarios, the actual throughput of this model series can reach up to 4 times that of traditional solutions. Taking the 8B parameter version as an example, when running the SPEED-Bench benchmark on an NVIDIA GB200 GPU with SGLang, the number of decodable tokens per forward pass of Nemotron-Labs-Diffusion-8B is 5.9 times that of Qwen3-8B, while also achieving better accuracy.

Currently, all weights of this model series have been synchronized to NVIDIA's official collection on Hugging Face, along with complete project documentation and the technical report:

- Hugging Face Model Collection: [Nemotron-Labs-Diffusion](https://huggingface.co/collections/nvidia/nemotron-labs-diffusion)

- Official Project Page: [Tri-Mode Language Model Unifying Autoregressive, Diffusion, and Self-Speculation Decoding](https://research.nvidia.com/publication/2026-05_nemotron-labs-diffusion-tri-mode-language-model-unifying-autoregressive)

- Technical Report: [Nemotron-Labs-Diffusion Technical Report](http://bit.ly/Nemotron-Labs-Diffusion-Report)

Before this work, most large model inference optimizations focused on improving efficiency around a single decoding path, or maintained multiple independent sets of model weights for different scenarios. This solution effectively integrates three mainstream decoding technology routes into a single set of weights, which eliminates the storage and operation costs of maintaining multiple models, regardless of whether it is used by cloud service providers to handle high and low concurrency traffic, or by individual developers to deploy general-purpose models locally.

发布时间: 2026-05-20 06:50