Skip to content

Commit f13cfd1

Browse files
authored
add blog (#320)
1 parent d92734b commit f13cfd1

3 files changed

Lines changed: 114 additions & 0 deletions

File tree

Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
---
2+
title: "SGLang Adds Day-0 Support for NVIDIA Nemotron 3 Super for building High-Efficiency Multi-Agent Systems"
3+
author: "NVIDIA Nemotron Team"
4+
date: "March 11, 2026"
5+
previewImg: /images/blog/nemotron-3-super/figure_1.svg
6+
---
7+
8+
We are excited to announce that SGLang supports NVIDIA Nemotron 3 Super on Day 0.
9+
10+
[Nemotron 3 Super](https://developer.nvidia.com/blog/introducing-nemotron-3-super-an-open-hybrid-mamba-transformer-moe-for-agentic-reasoning/) is a leading open model in the Nemotron 3 family, built for running many collaborating agents together. Agentic systems that chain planning, reasoning, and tools produce far more tokens than single-turn chat; they also need strong reasoning on every step.
11+
12+
Nemotron 3 Super is a 120B-parameter hybrid MoE that activates only 12B parameters per forward pass, giving you leading accuracy for coding, tool calling, and instruction following at a fraction of the cost—plus a 1M-token context so agents keep conversation and plan state in view across long workflows.
13+
14+
![figure1](/images/blog/nemotron-3-super/figure_1.svg)<small><center>[Artificial Analysis](https://artificialanalysis.ai/) chart showing Nemotron 3 Super leading on intelligence vs. openness when compared to popular open models of similar size</center></small>
15+
16+
As you can see in the chart above, Nemotron 3 Super leads on the Artificial Analysis Openness index. When compared to other open models, Nemotron is fully open with open-weights, datasets, and recipes so developers can easily customize, optimize, and deploy on their infrastructure for maximum privacy and security.
17+
18+
In this post we walk through installing SGLang and serving Nemotron 3 Super for inference.
19+
20+
21+
## About Nemotron3 Super
22+
23+
24+
- **Architecture**: Mixture of Experts (MoE) with Hybrid Transformer-Mamba Architecture
25+
- Highest throughput efficiency in its size category and up to 5x higher throughput compared to previous Nemotron Super model (Llama Nemotron Super 1.5)
26+
- Multi-Token Prediction (MTP) : By predicting several future tokens simultaneously in a single forward pass, MTP drastically accelerates the generation of long-form text
27+
- Supports Thinking Budget for optimal accuracy with minimum reasoning token generation
28+
- **Accuracy**: Leading accuracy on Artificial Analysis Intelligence Index in its size category
29+
- Up to 2x higher accuracy on Artificial Analysis Intelligence Index compared to previous Nemotron Super model.
30+
- Latent MoE enables calling 4 experts for the inference cost of only one
31+
- **Model size**: 120B total parameters, 12B active parameters
32+
- **Context length**: up to 1M
33+
- **Model I/O**: Text in, text out
34+
- **Supported GPUs**: B200, H100, H200, DGX Spark, RTX 6000
35+
- **Get started**:
36+
- Download model weights from Hugging Face - [BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16), [FP8](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8) and [NVFP4](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4)
37+
- [Run with SGLang for inference](https://cookbook.sglang.io/autoregressive/NVIDIA/Nemotron3-Super)
38+
- [Technical report](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Super-Technical-Report.pdf) to build custom, optimized models with Nemotron techniques.
39+
40+
## Installation and Quick Start
41+
42+
For an easier setup with SGLang, refer to our getting started cookbook, available [here](https://cookbook.sglang.io/autoregressive/NVIDIA/Nemotron3-Super) or through NVIDIA Brev [launchable](https://brev.nvidia.com/launchable/deploy?launchableID=env-39d03Y3mDAiGuIrnHKmZwv0tA4s).
43+
44+
Run the command below to install dependencies:
45+
```bash
46+
pip install 'git+https://github.com/sgl-project/sglang.git#subdirectory=python'
47+
```
48+
49+
We can then serve this model. The command below is configured for a 4xH200 setup. Refer to the cookbooks for detailed instructions
50+
```bash
51+
# BF16
52+
```bash
53+
python3 -m sglang.launch_server \
54+
--model-path nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \
55+
--host 0.0.0.0 \
56+
--port 5000 \
57+
--trust-remote-code \
58+
--tp 4 \
59+
--tool-call-parser qwen3_coder \
60+
--reasoning-parser nemotron_3
61+
```
62+
63+
Once the server is up and running, you can prompt the model using the below code snippets:
64+
65+
```python
66+
from openai import OpenAI
67+
68+
# The model name we used when launching the server.
69+
SERVED_MODEL_NAME = "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16"
70+
71+
BASE_URL = f"http://localhost:5000/v1"
72+
API_KEY = "EMPTY" # SGLang server doesn't require an API key by default
73+
74+
client = OpenAI(base_url=BASE_URL, api_key=API_KEY)
75+
76+
resp = client.chat.completions.create(
77+
model=SERVED_MODEL_NAME,
78+
messages=[
79+
{"role": "system", "content": "You are a helpful AI assistant."},
80+
{"role": "user", "content": "Give me 3 bullet points about SGLang."}
81+
],
82+
temperature=0.6,
83+
max_tokens=512,
84+
)
85+
print("Reasoning:", resp.choices[0].message.reasoning_content, "\nContent:", resp.choices[0].message.content)
86+
```
87+
88+
89+
## Nemotron 3 Super is ideal for multi-agent and reasoning workloads
90+
91+
![figure2](/images/blog/nemotron-3-super/figure_2.svg)<small><center>[Artificial Analysis](https://artificialanalysis.ai/) chart showing Nemotron 3 Super leading on intelligence vs. efficiency when compared to popular open models of similar size</center></small>
92+
93+
As you can see in the chart above, the model achieves leading accuracy with higher efficiency on Artificial analysis benchmarks making it a strong choice for multi-agent systems that need both efficiency and capability.
94+
95+
The 1M-token context is built for long-horizon agent work: agents can keep full conversation history and plan state in context, and RAG pipelines can supply large document sets in one shot. That reduces fragmentation and goal drift in multi-step workflows.
96+
97+
Together, this makes Super a strong choice for orchestrating and running many agents on a single node—from code generation and debugging to research summarization, alert triage, and document analysis.
98+
99+
## Get Started
100+
101+
Nemotron 3 Super helps you build scalable, cost-efficient multi-agent AI with high accuracy. With open weights, datasets, and recipes, you get full transparency and the flexibility to fine-tune and deploy on your own infrastructure, from workstation to cloud.
102+
103+
Ready to run multi-agent AI at scale?
104+
- Download Nemotron 3 Super model weights from Hugging Face - [BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16), [FP8](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8) and [NVFP4](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4)
105+
- Run with SGLang for inference using the [cookbook](https://cookbook.sglang.io/autoregressive/NVIDIA/Nemotron3-Super) and through [Brev launchable](https://brev.nvidia.com/launchable/deploy?launchableID=env-39d03Y3mDAiGuIrnHKmZwv0tA4s)
106+
- Read the Nemotron 3 Super [technical report](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Super-Technical-Report.pdf)
107+
108+
## Acknowledgement
109+
Thanks to everyone who contributed to bringing Nemotron 3 Super to SGLang.
110+
111+
**NVIDIA**: Nirmal Kumar Juluru, Anusha Pant, Max Xu, Daniel Afrimi, Shahar Mor, Roi Koren, Ann Guan and many more
112+
**SGLang team and community**: Baizhou Zhang, Jiajun Li, Ke Bao, Lingyan Hao, Mingyi Lu

public/images/blog/nemotron-3-super/figure_1.svg

Lines changed: 1 addition & 0 deletions
Loading

public/images/blog/nemotron-3-super/figure_2.svg

Lines changed: 1 addition & 0 deletions
Loading

0 commit comments

Comments
 (0)