Skip to content

Commit d92734b

Browse files
authored
Sglang nvidia gb300 blog fix images link (#319)
* add gb300 nvl72 blog Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com> * update image 1 Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com> * add hyperlinks Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com> * add paragraph to explain 25x perf Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com> * add link to prior blog Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com> * update ack Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com> * Fix broken image links in GB300 blog --------- Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
1 parent 6b11da9 commit d92734b

1 file changed

Lines changed: 0 additions & 6 deletions

File tree

blog/2026-02-20-gb300-inferencex.md

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -42,16 +42,10 @@ To fully exploit the capabilities of Blackwell Ultra on GB300 NVL72, SGLang inco
4242

4343
**NVFP4 GEMM for MoE and dense layers.** Using [NVFP4 precision](https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/) for MoE experts and other GEMMs reduces memory bandwidth pressure, taps into the higher FP4 Tensor Core throughput on Blackwell Ultra, and halves communication traffic for token dispatch. This shrinks weights in memory, freeing capacity for a larger KV cache and enabling higher concurrency.
4444

45-
<img src="/images/blog/gb300_inferencex/overlap_scheduling.png"
46-
style="display: block; margin: 20px auto 0; width: 75%; max-width: 100%; height: auto;">
47-
4845
**Computation–communication overlap.** Instead of relying on traditional Two-Batch overlapping (TBO), we adopt a single-batch overlap strategy tuned to the higher interconnect bandwidth of NVL72. In practice, this allows combining communication to run concurrently with down-GEMM computation in a producer–consumer pattern, while overlapping shared-expert computation on an additional CUDA stream to minimize idle time.
4946

5047
**NVIDIA Dynamo for disaggregated inference.** For prefill–decode disaggregation, we integrate with [NVIDIA Dynamo](https://www.nvidia.com/en-us/ai/dynamo/), an open-source distributed inference serving engine. Dynamo's modular design makes it possible to deeply couple its KV-aware router with SGLang's HiCache radix tree, while exposing flexible KV cache transfer backends such as NIXL and Mooncake to match different deployment scenarios.
5148

52-
<img src="/images/blog/gb300_inferencex/dynamo_integration.png"
53-
style="display: block; margin: 20px auto 0; width: 75%; max-width: 100%; height: auto;">
54-
5549
Together, these optimizations align the inference software stack with the characteristics of Blackwell Ultra, driving higher utilization and turning its raw hardware capability into delivered throughput.
5650

5751
## **8x More Performance on GB200 NVL72**

0 commit comments

Comments
 (0)