NVIDIA · chloecrozier · Jun 12, 2026 · Jun 12, 2026
diff --git a/AGENTS.md b/AGENTS.md
@@ -32,7 +32,7 @@ There is no unit test suite. Verification is done via the benchmark executables
 
 | Executable | Source | Typical config |
 |---|---|---|
-| `daqiri_bench_raw_gpudirect` | `raw_gpudirect_bench.cpp` | `daqiri_bench_raw_tx_rx.yaml`, `daqiri_bench_raw_tx_rx_4q.yaml`, `daqiri_bench_raw_tx_rx_spark.yaml`, `daqiri_bench_raw_{tx,rx}_spark_xhost.yaml`, `daqiri_bench_raw_sw_loopback.yaml`, `daqiri_bench_raw_rx_multi_q.yaml`, `daqiri_bench_raw_tx_rx_spark_mq.yaml` (mq base; `run_spark_mq_bench.sh` derives the 4 cells via `scripts/gen_spark_mq_config.py`) |
+| `daqiri_bench_raw_gpudirect` | `raw_gpudirect_bench.cpp` | `daqiri_bench_raw_tx_rx.yaml`, `daqiri_bench_raw_tx_rx_4q.yaml`, `daqiri_bench_raw_tx_rx_spark.yaml`, `daqiri_bench_raw_tx_rx_rtx_pro_6000.yaml`, `daqiri_bench_raw_tx_rx_rtx_pro_6000_nic.yaml`, `daqiri_bench_raw_tx_rx_rtx_pro_6000_nic_same_port.yaml`, `daqiri_bench_raw_{tx,rx}_spark_xhost.yaml`, `daqiri_bench_raw_sw_loopback.yaml`, `daqiri_bench_raw_sw_loopback_rtx_pro_6000.yaml`, `daqiri_bench_raw_rx_multi_q.yaml`, `daqiri_bench_raw_tx_rx_spark_mq.yaml` (mq base; `run_spark_mq_bench.sh` derives the 4 cells via `scripts/gen_spark_mq_config.py`) |
 | `daqiri_bench_raw_hds` | `raw_hds_bench.cpp` | `daqiri_bench_raw_tx_rx_hds.yaml` |
 | `daqiri_bench_raw_reorder_seq` | `raw_reorder_seq_bench.cpp` | `daqiri_bench_raw_tx_rx_reorder_seq_1024*.yaml`, `daqiri_bench_raw_rx_reorder_seq_*.yaml` |
 | `daqiri_bench_raw_reorder_quantize` | `raw_reorder_quantize_bench.cpp` | `daqiri_bench_raw_tx_rx_reorder_quantize_seq_batch.yaml` |

diff --git a/docs/benchmarks/raw_benchmarking.md b/docs/benchmarks/raw_benchmarking.md
@@ -59,6 +59,26 @@ docker run --rm -it --privileged \
 
     The Spark configs also pin the benchmark application's `bench_tx.cpu_core` / `bench_rx.cpu_core` fields to the high-frequency Cortex-X925 cores. Keep both the DAQIRI queue cores and the application worker cores on cores 16-19 unless you intentionally want a lower-power core in the measurement.
 
+!!! tip "RTX PRO 6000 Blackwell (x86_64 workstation / server)"
+
+    For discrete Blackwell RTX PRO 6000 systems, build with [`CMAKE_CUDA_ARCHITECTURES=120`](../tutorials/bare-metal-cmake-build.md) and use these configs:
+
+    - [`daqiri_bench_raw_sw_loopback_rtx_pro_6000.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_sw_loopback_rtx_pro_6000.yaml) — software loopback, no NIC required. Validates the GPUDirect build path; throughput is not wire-rate. See measured numbers in [`examples/rtx_pro_6000_baseline.md`](https://github.com/nvidia/daqiri/blob/main/examples/rtx_pro_6000_baseline.md).
+    - [`daqiri_bench_raw_tx_rx_rtx_pro_6000_nic.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_tx_rx_rtx_pro_6000_nic.yaml) — prefilled dual-port NIC run (dev-box PCIe BDFs; fill `eth_dst_addr` from the rx_port MAC). Port 0 TX on GPU CUDA 0, port 1 RX on GPU CUDA 1. Requires an L2 link between the two ports (QSFP cable, passive loopback optic, or switch). `carrier=1` on both ports does not guarantee they are looped to each other.
+    - [`daqiri_bench_raw_tx_rx_rtx_pro_6000.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_tx_rx_rtx_pro_6000.yaml) — generic `<placeholder>` template for cross-card or custom topology (800 Gbps target once cabled).
+    - [`daqiri_bench_raw_tx_rx_rtx_pro_6000_nic_same_port.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_tx_rx_rtx_pro_6000_nic_same_port.yaml) — experimental same-port TX+RX; failed `daqiri_init` on the reference box.
+
+    Unlike DGX Spark, typical RTX Pro servers expose one PF per physical port — there is no on-chip eswitch shortcut between two PFs on the same port without a link. After a NIC run, confirm whether traffic crossed the wire using `tx_phy_packets` / `rx_phy_packets` in the DPDK extended stats (near zero = on-chip or no wire loop; rising with vport counts = over-the-wire). Full constraints and baseline results: [`rtx_pro_6000_baseline.md`](https://github.com/nvidia/daqiri/blob/main/examples/rtx_pro_6000_baseline.md).
+
+    ```bash
+    sudo ./daqiri_bench_raw_gpudirect \
+      ./examples/daqiri_bench_raw_sw_loopback_rtx_pro_6000.yaml --seconds 30
+
+    # Once p0 and p1 are cabled:
+    sudo ./daqiri_bench_raw_gpudirect \
+      ./examples/daqiri_bench_raw_tx_rx_rtx_pro_6000_nic.yaml --seconds 30
+    ```
+
 #### Cross-host two-DGX-Spark loopback
 
 If you have two DGX Sparks cross-cabled p0↔p0 instead of a chassis QSFP loop on one machine, use the `_xhost` configs. Each host runs only its own role, so the YAML on each side configures one port instead of two. Both hosts must already be set up per the [DGX Spark profile](../tutorials/system_configuration.md#dgx-spark-profile), with one adjustment: the `daqiri-tx` (`1.1.1.1/24`) and `daqiri-rx` (`2.2.2.2/24`) nmcli profiles are *split across* the two hosts — bring up `daqiri-tx` on the TX host's p0 and `daqiri-rx` on the RX host's p0, instead of both on one box.

diff --git a/docs/tutorials/configuration-walkthrough.md b/docs/tutorials/configuration-walkthrough.md
@@ -32,6 +32,10 @@ For a shorter selection guide, start with the [Benchmarking overview](../benchma
     - **DGX Spark multi-queue core-scaling matrix** (prefilled) — one base config [`daqiri_bench_raw_tx_rx_spark_mq.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_tx_rx_spark_mq.yaml) (the balanced TX=2/RX=2 superset; cores TX → 16,17, RX → 18,19) from which `examples/run_spark_mq_bench.sh` (via `scripts/gen_spark_mq_config.py`) derives the four `(TX, RX)` cells — (1,1), (1,2) (RX scaling), (2,1) (TX scaling), (2,2) (balanced) — by pruning queues/flows. All run on `daqiri_bench_raw_gpudirect` at the native 8 KB shape.
     - **DGX Spark cross-host** (prefilled, runs on two Sparks) — [`daqiri_bench_raw_tx_spark_xhost.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_tx_spark_xhost.yaml) on the TX host and [`daqiri_bench_raw_rx_spark_xhost.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_rx_spark_xhost.yaml) on the RX host. Each host runs `daqiri_bench_raw_gpudirect` against its own half; cables connect p0↔p0 between the two boxes. See the [Cross-host two-DGX-Spark loopback](../benchmarks/raw_benchmarking.md#cross-host-two-dgx-spark-loopback) section for run details.
     - **No physical NIC available** — [`daqiri_bench_raw_sw_loopback.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_sw_loopback.yaml). `loopback: "sw"`, no NIC required. Useful for first-time build verification, not representative of production performance.
+    - **RTX PRO 6000 Blackwell — no cable** — [`daqiri_bench_raw_sw_loopback_rtx_pro_6000.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_sw_loopback_rtx_pro_6000.yaml). `kind: device`, `affinity: 0`; build with [`CMAKE_CUDA_ARCHITECTURES=120`](../tutorials/bare-metal-cmake-build.md). SW loopback smoke test only.
+    - **RTX PRO 6000 Blackwell — real NIC, dual-port on one card** (prefilled dev box) — [`daqiri_bench_raw_tx_rx_rtx_pro_6000_nic.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_tx_rx_rtx_pro_6000_nic.yaml). `61:00.0` p0 → `61:00.1` p1, GPU 0 TX / GPU 1 RX; needs L2 link between ports (not SW loopback). See [`rtx_pro_6000_baseline.md`](https://github.com/nvidia/daqiri/blob/main/examples/rtx_pro_6000_baseline.md) for hardware limits and measured baseline.
+    - **RTX PRO 6000 — same-PF NIC attempt** (experimental) — [`daqiri_bench_raw_tx_rx_rtx_pro_6000_nic_same_port.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_tx_rx_rtx_pro_6000_nic_same_port.yaml). Single port `61:00.0` TX+RX; failed `daqiri_init` on reference box — kept for follow-up.
+    - **RTX PRO 6000 Blackwell — dual-NIC loopback** (generic template) — [`daqiri_bench_raw_tx_rx_rtx_pro_6000.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_tx_rx_rtx_pro_6000.yaml). Placeholders for Cliff's 800 Gbps cross-card target; fill PCIe BDFs and MACs.
 
     To watch the same raw loopback benchmark with live Prometheus and Grafana
     counters, use the Grafana compose stack described in

diff --git a/examples/CMakeLists.txt b/examples/CMakeLists.txt
@@ -34,6 +34,10 @@ set(DAQIRI_BENCH_CONFIGS
   daqiri_bench_raw_rx_reorder_seq_batch.yaml
   daqiri_bench_raw_rx_multi_q.yaml
   daqiri_bench_raw_sw_loopback.yaml
+  daqiri_bench_raw_sw_loopback_rtx_pro_6000.yaml
+  daqiri_bench_raw_tx_rx_rtx_pro_6000.yaml
+  daqiri_bench_raw_tx_rx_rtx_pro_6000_nic.yaml
+  daqiri_bench_raw_tx_rx_rtx_pro_6000_nic_same_port.yaml
   daqiri_example_gds_write_sw_loopback.yaml
   daqiri_example_gds_write_tx_rx.yaml
   daqiri_example_pcap_writer_sw_loopback.yaml

diff --git a/examples/daqiri_bench_raw_sw_loopback_rtx_pro_6000.yaml b/examples/daqiri_bench_raw_sw_loopback_rtx_pro_6000.yaml
@@ -0,0 +1,77 @@
+# RTX PRO 6000 Blackwell (discrete dGPU) software-loopback smoke test.
+# No NIC or cable required — validates build + GPUDirect on one GPU.
+# Not representative of wire-speed performance; use the hardware template
+# (daqiri_bench_raw_tx_rx_rtx_pro_6000.yaml) once a QSFP loopback cable is installed.
+#
+# Build (native sm_120):
+#   cmake -S . -B build -DBUILD_SHARED_LIBS=ON -DDAQIRI_BUILD_PYTHON=OFF \
+#     -DDAQIRI_MGR="dpdk socket rdma" -DCMAKE_CUDA_ARCHITECTURES=120
+#   cmake --build build -j
+#
+# Run:
+#   ./build/examples/daqiri_bench_raw_gpudirect \
+#     ./build/examples/daqiri_bench_raw_sw_loopback_rtx_pro_6000.yaml --seconds 30
+#
+# memory_regions[].affinity is the CUDA device index (not nvidia-smi GPU id).
+# Change affinity to target a different GPU, e.g. affinity: 1 for CUDA device 1.
+#
+%YAML 1.2
+---
+daqiri:
+  cfg:
+    version: 1
+    stream_type: "raw"
+    master_core: 3
+    debug: false
+    log_level: "info"
+    loopback: "sw"
+
+    memory_regions:
+    - name: "Data_TX_GPU"
+      kind: "device"
+      affinity: 0
+      num_bufs: 51200
+      buf_size: 8064
+    - name: "Data_RX_GPU"
+      kind: "device"
+      affinity: 0
+      num_bufs: 51200
+      buf_size: 8064
+
+    interfaces:
+    - name: "loopback_ports"
+      address: "loopback"
+      tx:
+        queues:
+        - name: "tx_q_0"
+          id: 0
+          batch_size: 10240
+          cpu_core: 11
+          timeout_us: 1000
+          memory_regions:
+            - "Data_TX_GPU"
+          offloads:
+            - "tx_eth_src"
+      rx:
+        queues:
+        - name: "rq_q_0"
+          id: 0
+          cpu_core: 9
+          timeout_us: 1000
+          batch_size: 10240
+          memory_regions:
+            - "Data_RX_GPU"
+
+bench_rx:
+  interface_name: "loopback_ports"
+
+bench_tx:
+  interface_name: "loopback_ports"
+  batch_size: 10240
+  payload_size: 8000
+  header_size: 64
+  eth_dst_addr: 00:00:00:00:00:00
+  ip_src_addr: 0.0.0.0
+  ip_dst_addr: 0.0.0.0
+  udp_src_port: 4096
+  udp_dst_port: 4096
diff --git a/examples/daqiri_bench_raw_tx_rx_rtx_pro_6000.yaml b/examples/daqiri_bench_raw_tx_rx_rtx_pro_6000.yaml
@@ -0,0 +1,93 @@
+# RTX PRO 6000 Blackwell dual-NIC hardware loopback TEMPLATE (generic placeholders).
+# For a prefilled single-card dual-port run on the reference dev box, see
+# daqiri_bench_raw_tx_rx_rtx_pro_6000_nic.yaml instead.
+#
+# Requires L2 connectivity between tx_port and rx_port (QSFP cable, loopback optic, or switch).
+# NOT runnable until <placeholders> are replaced for your system.
+#
+# Target topology (Cliff's 800 Gbps vision — two cards or two ports at line rate):
+#   GPU 0 (CUDA affinity 0) --TX buffers--> NIC0 --link--> NIC1 --RX buffers--> GPU 1
+#
+# Hardware limits:
+#   - 800 Gbps aggregate needs two ~400G ports with an active link each; no cable = no wire test.
+#   - This server lacks Spark's dual-PF-per-port on-chip eswitch shortcut (see baseline doc).
+#   - SW loopback (daqiri_bench_raw_sw_loopback_rtx_pro_6000.yaml) never touches the NIC.
+#
+# Discovery helpers:
+#   lspci -d 15b3:
+#   ibdev2netdev
+#   cat /sys/class/net/<iface>/address   # eth_dst_addr = rx_port MAC
+#   nvidia-smi topo -m                   # NUMA / PCIe proximity for cpu_core picks
+#
+# Build: same as daqiri_bench_raw_sw_loopback_rtx_pro_6000.yaml header.
+#
+%YAML 1.2
+---
+daqiri:
+  cfg:
+    version: 1
+    stream_type: "raw"
+    master_core: 3
+    debug: false
+    log_level: "info"
+    loopback: ""
+
+    memory_regions:
+    - name: "Data_TX_GPU"
+      kind: "device"
+      affinity: 0
+      num_bufs: 51200
+      buf_size: 8064
+    - name: "Data_RX_GPU"
+      kind: "device"
+      affinity: 1
+      num_bufs: 51200
+      buf_size: 8064
+
+    interfaces:
+    - name: "tx_port"
+      address: <0000:00:00.0>
+      tx:
+        queues:
+        - name: "tx_q_0"
+          id: 0
+          batch_size: 10240
+          cpu_core: 11
+          memory_regions:
+            - "Data_TX_GPU"
+          offloads:
+            - "tx_eth_src"
+    - name: "rx_port"
+      address: <0000:00:00.1>
+      rx:
+        flow_isolation: true
+        queues:
+        - name: "rq_q_0"
+          id: 0
+          cpu_core: 9
+          batch_size: 10240
+          memory_regions:
+            - "Data_RX_GPU"
+        flows:
+        - name: "flow_0"
+          id: 0
+          action:
+            type: queue
+            id: 0
+          match:
+            udp_src: 4096
+            udp_dst: 4096
+
+bench_rx:
+- interface_name: "rx_port"
+
+bench_tx:
+- interface_name: "tx_port"
+  batch_size: 10240
+  payload_size: 8000
+  header_size: 64
+  eth_dst_addr: <00:00:00:00:00:00>
+  ip_src_addr: <1.2.3.4>
+  ip_dst_addr: <5.6.7.8>
+  udp_src_port: 4096
+  udp_dst_port: 4096
diff --git a/examples/daqiri_bench_raw_tx_rx_rtx_pro_6000_nic.yaml b/examples/daqiri_bench_raw_tx_rx_rtx_pro_6000_nic.yaml
@@ -0,0 +1,93 @@
+# RTX PRO 6000 Blackwell — real NIC loopback on this dev box (no external cable required
+# if p0/p1 already have link). Uses one ConnectX-7 / BF-3 dual-port card:
+#   tx_port 0000:61:00.0 (ens1f0np0, p0)  GPU CUDA 0
+#   rx_port 0000:61:00.1 (ens1f1np1, p1)  GPU CUDA 1
+#
+# Hardware limitations (read before comparing numbers):
+#   - NOT Cliff's 800 Gbps dual-card test. This is one ASIC, two ports (~400G class each).
+#   - Requires L2 connectivity between p0 and p1 (QSFP cable, passive loopback optic, or
+#     switch). If carrier=0 on either port, this config will not pass traffic.
+#   - Unlike DGX Spark, this server has one PF per physical port — no on-chip eswitch
+#     loopback between PFs on the same port without a link.
+#   - SW loopback (daqiri_bench_raw_sw_loopback_rtx_pro_6000.yaml) does not use the NIC.
+#
+# Verify link: cat /sys/class/net/ens1f{0,1}np*/carrier
+# eth_dst_addr: rx_port MAC — cat /sys/class/net/ens1f1np1/address on this dev box
+# Verify wire vs on-chip after a run: tx_phy_packets / rx_phy_packets near zero = on-chip;
+#   rising with TX/RX = packets crossed the SerDes.
+#
+# Build: cmake ... -DCMAKE_CUDA_ARCHITECTURES=120 && cmake --build build -j
+# Run:
+#   sudo ./build/examples/daqiri_bench_raw_gpudirect \
+#     ./build/examples/daqiri_bench_raw_tx_rx_rtx_pro_6000_nic.yaml --seconds 30
+#
+%YAML 1.2
+---
+daqiri:
+  cfg:
+    version: 1
+    stream_type: "raw"
+    master_core: 3
+    debug: false
+    log_level: "info"
+    loopback: ""
+
+    memory_regions:
+    - name: "Data_TX_GPU"
+      kind: "device"
+      affinity: 0
+      num_bufs: 51200
+      buf_size: 8064
+    - name: "Data_RX_GPU"
+      kind: "device"
+      affinity: 1
+      num_bufs: 51200
+      buf_size: 8064
+
+    interfaces:
+    - name: "tx_port"
+      address: 0000:61:00.0
+      tx:
+        queues:
+        - name: "tx_q_0"
+          id: 0
+          batch_size: 10240
+          cpu_core: 11
+          memory_regions:
+            - "Data_TX_GPU"
+          offloads:
+            - "tx_eth_src"
+    - name: "rx_port"
+      address: 0000:61:00.1
+      rx:
+        flow_isolation: true
+        queues:
+        - name: "rq_q_0"
+          id: 0
+          cpu_core: 9
+          batch_size: 10240
+          memory_regions:
+            - "Data_RX_GPU"
+        flows:
+        - name: "flow_0"
+          id: 0
+          action:
+            type: queue
+            id: 0
+          match:
+            udp_src: 4096
+            udp_dst: 4096
+
+bench_rx:
+- interface_name: "rx_port"
+
+bench_tx:
+- interface_name: "tx_port"
+  batch_size: 10240
+  payload_size: 8000
+  header_size: 64
+  eth_dst_addr: <00:00:00:00:00:00>
+  ip_src_addr: 1.2.3.4
+  ip_dst_addr: 5.6.7.8
+  udp_src_port: 4096
+  udp_dst_port: 4096
diff --git a/examples/daqiri_bench_raw_tx_rx_rtx_pro_6000_nic_same_port.yaml b/examples/daqiri_bench_raw_tx_rx_rtx_pro_6000_nic_same_port.yaml
@@ -0,0 +1,75 @@
+# Same-PF TX+RX on 0000:61:00.0 (ens1f0np0). Alternative when p0<->p1 are not cabled.
+# eth_dst_addr: this port's own MAC (hairpin / L2 loopback if supported) —
+#   cat /sys/class/net/<iface>/address
+# See daqiri_bench_raw_tx_rx_rtx_pro_6000_nic.yaml for dual-port p0->p1 attempt.
+#
+%YAML 1.2
+---
+daqiri:
+  cfg:
+    version: 1
+    stream_type: "raw"
+    master_core: 3
+    debug: false
+    log_level: "info"
+    loopback: ""
+
+    memory_regions:
+    - name: "Data_TX_GPU"
+      kind: "device"
+      affinity: 0
+      num_bufs: 51200
+      buf_size: 8064
+    - name: "Data_RX_GPU"
+      kind: "device"
+      affinity: 0
+      num_bufs: 51200
+      buf_size: 8064
+
+    interfaces:
+    - name: "tx_port"
+      address: 0000:61:00.0
+      tx:
+        queues:
+        - name: "tx_q_0"
+          id: 0
+          batch_size: 10240
+          cpu_core: 11
+          memory_regions:
+            - "Data_TX_GPU"
+          offloads:
+            - "tx_eth_src"
+    - name: "rx_port"
+      address: 0000:61:00.0
+      rx:
+        flow_isolation: true
+        queues:
+        - name: "rq_q_0"
+          id: 0
+          cpu_core: 9
+          batch_size: 10240
+          memory_regions:
+            - "Data_RX_GPU"
+        flows:
+        - name: "flow_0"
+          id: 0
+          action:
+            type: queue
+            id: 0
+          match:
+            udp_src: 4096
+            udp_dst: 4096
+
+bench_rx:
+- interface_name: "rx_port"
+
+bench_tx:
+- interface_name: "tx_port"
+  batch_size: 10240
+  payload_size: 8000
+  header_size: 64
+  eth_dst_addr: <00:00:00:00:00:00>
+  ip_src_addr: 1.2.3.4
+  ip_dst_addr: 5.6.7.8
+  udp_src_port: 4096
+  udp_dst_port: 4096