|
| 1 | +--- |
| 2 | +title: "SGLang Adds Day-0 Support for NVIDIA Nemotron 3 Super for building High-Efficiency Multi-Agent Systems" |
| 3 | +author: "NVIDIA Nemotron Team" |
| 4 | +date: "March 11, 2026" |
| 5 | +previewImg: /images/blog/nemotron-3-super/figure_1.svg |
| 6 | +--- |
| 7 | + |
| 8 | +We are excited to announce that SGLang supports NVIDIA Nemotron 3 Super on Day 0. |
| 9 | + |
| 10 | +[Nemotron 3 Super](https://developer.nvidia.com/blog/introducing-nemotron-3-super-an-open-hybrid-mamba-transformer-moe-for-agentic-reasoning/) is a leading open model in the Nemotron 3 family, built for running many collaborating agents together. Agentic systems that chain planning, reasoning, and tools produce far more tokens than single-turn chat; they also need strong reasoning on every step. |
| 11 | + |
| 12 | +Nemotron 3 Super is a 120B-parameter hybrid MoE that activates only 12B parameters per forward pass, giving you leading accuracy for coding, tool calling, and instruction following at a fraction of the cost—plus a 1M-token context so agents keep conversation and plan state in view across long workflows. |
| 13 | + |
| 14 | +<small><center>[Artificial Analysis](https://artificialanalysis.ai/) chart showing Nemotron 3 Super leading on intelligence vs. openness when compared to popular open models of similar size</center></small> |
| 15 | + |
| 16 | +As you can see in the chart above, Nemotron 3 Super leads on the Artificial Analysis Openness index. When compared to other open models, Nemotron is fully open with open-weights, datasets, and recipes so developers can easily customize, optimize, and deploy on their infrastructure for maximum privacy and security. |
| 17 | + |
| 18 | +In this post we walk through installing SGLang and serving Nemotron 3 Super for inference. |
| 19 | + |
| 20 | + |
| 21 | +## About Nemotron3 Super |
| 22 | + |
| 23 | + |
| 24 | +- **Architecture**: Mixture of Experts (MoE) with Hybrid Transformer-Mamba Architecture |
| 25 | + - Highest throughput efficiency in its size category and up to 5x higher throughput compared to previous Nemotron Super model (Llama Nemotron Super 1.5) |
| 26 | + - Multi-Token Prediction (MTP) : By predicting several future tokens simultaneously in a single forward pass, MTP drastically accelerates the generation of long-form text |
| 27 | + - Supports Thinking Budget for optimal accuracy with minimum reasoning token generation |
| 28 | +- **Accuracy**: Leading accuracy on Artificial Analysis Intelligence Index in its size category |
| 29 | + - Up to 2x higher accuracy on Artificial Analysis Intelligence Index compared to previous Nemotron Super model. |
| 30 | + - Latent MoE enables calling 4 experts for the inference cost of only one |
| 31 | +- **Model size**: 120B total parameters, 12B active parameters |
| 32 | +- **Context length**: up to 1M |
| 33 | +- **Model I/O**: Text in, text out |
| 34 | +- **Supported GPUs**: B200, H100, H200, DGX Spark, RTX 6000 |
| 35 | +- **Get started**: |
| 36 | + - Download model weights from Hugging Face - [BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16), [FP8](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8) and [NVFP4](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4) |
| 37 | + - [Run with SGLang for inference](https://cookbook.sglang.io/autoregressive/NVIDIA/Nemotron3-Super) |
| 38 | + - [Technical report](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Super-Technical-Report.pdf) to build custom, optimized models with Nemotron techniques. |
| 39 | + |
| 40 | +## Installation and Quick Start |
| 41 | + |
| 42 | +For an easier setup with SGLang, refer to our getting started cookbook, available [here](https://cookbook.sglang.io/autoregressive/NVIDIA/Nemotron3-Super) or through NVIDIA Brev [launchable](https://brev.nvidia.com/launchable/deploy?launchableID=env-39d03Y3mDAiGuIrnHKmZwv0tA4s). |
| 43 | + |
| 44 | +Run the command below to install dependencies: |
| 45 | +```bash |
| 46 | +pip install 'git+https://github.com/sgl-project/sglang.git#subdirectory=python' |
| 47 | +``` |
| 48 | + |
| 49 | +We can then serve this model. The command below is configured for a 4xH200 setup. Refer to the cookbooks for detailed instructions |
| 50 | +```bash |
| 51 | +# BF16 |
| 52 | +```bash |
| 53 | +python3 -m sglang.launch_server \ |
| 54 | + --model-path nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \ |
| 55 | + --host 0.0.0.0 \ |
| 56 | + --port 5000 \ |
| 57 | + --trust-remote-code \ |
| 58 | + --tp 4 \ |
| 59 | + --tool-call-parser qwen3_coder \ |
| 60 | + --reasoning-parser nemotron_3 |
| 61 | +``` |
| 62 | + |
| 63 | +Once the server is up and running, you can prompt the model using the below code snippets: |
| 64 | + |
| 65 | +```python |
| 66 | +from openai import OpenAI |
| 67 | +
|
| 68 | +# The model name we used when launching the server. |
| 69 | +SERVED_MODEL_NAME = "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16" |
| 70 | +
|
| 71 | +BASE_URL = f"http://localhost:5000/v1" |
| 72 | +API_KEY = "EMPTY" # SGLang server doesn't require an API key by default |
| 73 | +
|
| 74 | +client = OpenAI(base_url=BASE_URL, api_key=API_KEY) |
| 75 | +
|
| 76 | +resp = client.chat.completions.create( |
| 77 | + model=SERVED_MODEL_NAME, |
| 78 | + messages=[ |
| 79 | + {"role": "system", "content": "You are a helpful AI assistant."}, |
| 80 | + {"role": "user", "content": "Give me 3 bullet points about SGLang."} |
| 81 | + ], |
| 82 | + temperature=0.6, |
| 83 | + max_tokens=512, |
| 84 | +) |
| 85 | +print("Reasoning:", resp.choices[0].message.reasoning_content, "\nContent:", resp.choices[0].message.content) |
| 86 | +``` |
| 87 | + |
| 88 | + |
| 89 | +## Nemotron 3 Super is ideal for multi-agent and reasoning workloads |
| 90 | + |
| 91 | +<small><center>[Artificial Analysis](https://artificialanalysis.ai/) chart showing Nemotron 3 Super leading on intelligence vs. efficiency when compared to popular open models of similar size</center></small> |
| 92 | + |
| 93 | +As you can see in the chart above, the model achieves leading accuracy with higher efficiency on Artificial analysis benchmarks making it a strong choice for multi-agent systems that need both efficiency and capability. |
| 94 | + |
| 95 | +The 1M-token context is built for long-horizon agent work: agents can keep full conversation history and plan state in context, and RAG pipelines can supply large document sets in one shot. That reduces fragmentation and goal drift in multi-step workflows. |
| 96 | + |
| 97 | +Together, this makes Super a strong choice for orchestrating and running many agents on a single node—from code generation and debugging to research summarization, alert triage, and document analysis. |
| 98 | + |
| 99 | +## Get Started |
| 100 | + |
| 101 | +Nemotron 3 Super helps you build scalable, cost-efficient multi-agent AI with high accuracy. With open weights, datasets, and recipes, you get full transparency and the flexibility to fine-tune and deploy on your own infrastructure, from workstation to cloud. |
| 102 | + |
| 103 | +Ready to run multi-agent AI at scale? |
| 104 | +- Download Nemotron 3 Super model weights from Hugging Face - [BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16), [FP8](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8) and [NVFP4](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4) |
| 105 | +- Run with SGLang for inference using the [cookbook](https://cookbook.sglang.io/autoregressive/NVIDIA/Nemotron3-Super) and through [Brev launchable](https://brev.nvidia.com/launchable/deploy?launchableID=env-39d03Y3mDAiGuIrnHKmZwv0tA4s) |
| 106 | +- Read the Nemotron 3 Super [technical report](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Super-Technical-Report.pdf) |
| 107 | + |
| 108 | +## Acknowledgement |
| 109 | +Thanks to everyone who contributed to bringing Nemotron 3 Super to SGLang. |
| 110 | + |
| 111 | +**NVIDIA**: Nirmal Kumar Juluru, Anusha Pant, Max Xu, Daniel Afrimi, Shahar Mor, Roi Koren, Ann Guan and many more |
| 112 | +**SGLang team and community**: Baizhou Zhang, Jiajun Li, Ke Bao, Lingyan Hao, Mingyi Lu |
0 commit comments