Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ There is no unit test suite. Verification is done via the benchmark executables

| Executable | Source | Typical config |
|---|---|---|
| `daqiri_bench_raw_gpudirect` | `raw_gpudirect_bench.cpp` | `daqiri_bench_raw_tx_rx.yaml`, `daqiri_bench_raw_tx_rx_4q.yaml`, `daqiri_bench_raw_tx_rx_spark.yaml`, `daqiri_bench_raw_{tx,rx}_spark_xhost.yaml`, `daqiri_bench_raw_sw_loopback.yaml`, `daqiri_bench_raw_rx_multi_q.yaml`, `daqiri_bench_raw_tx_rx_spark_mq.yaml` (mq base; `run_spark_mq_bench.sh` derives the 4 cells via `scripts/gen_spark_mq_config.py`) |
| `daqiri_bench_raw_gpudirect` | `raw_gpudirect_bench.cpp` | `daqiri_bench_raw_tx_rx.yaml`, `daqiri_bench_raw_tx_rx_4q.yaml`, `daqiri_bench_raw_tx_rx_spark.yaml`, `daqiri_bench_raw_tx_rx_rtx_pro_6000.yaml`, `daqiri_bench_raw_tx_rx_rtx_pro_6000_nic.yaml`, `daqiri_bench_raw_tx_rx_rtx_pro_6000_nic_same_port.yaml`, `daqiri_bench_raw_{tx,rx}_spark_xhost.yaml`, `daqiri_bench_raw_sw_loopback.yaml`, `daqiri_bench_raw_sw_loopback_rtx_pro_6000.yaml`, `daqiri_bench_raw_rx_multi_q.yaml`, `daqiri_bench_raw_tx_rx_spark_mq.yaml` (mq base; `run_spark_mq_bench.sh` derives the 4 cells via `scripts/gen_spark_mq_config.py`) |
| `daqiri_bench_raw_hds` | `raw_hds_bench.cpp` | `daqiri_bench_raw_tx_rx_hds.yaml` |
| `daqiri_bench_raw_reorder_seq` | `raw_reorder_seq_bench.cpp` | `daqiri_bench_raw_tx_rx_reorder_seq_1024*.yaml`, `daqiri_bench_raw_rx_reorder_seq_*.yaml` |
| `daqiri_bench_raw_reorder_quantize` | `raw_reorder_quantize_bench.cpp` | `daqiri_bench_raw_tx_rx_reorder_quantize_seq_batch.yaml` |
Expand Down
20 changes: 20 additions & 0 deletions docs/benchmarks/raw_benchmarking.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,26 @@ docker run --rm -it --privileged \

The Spark configs also pin the benchmark application's `bench_tx.cpu_core` / `bench_rx.cpu_core` fields to the high-frequency Cortex-X925 cores. Keep both the DAQIRI queue cores and the application worker cores on cores 16-19 unless you intentionally want a lower-power core in the measurement.

!!! tip "RTX PRO 6000 Blackwell (x86_64 workstation / server)"

For discrete Blackwell RTX PRO 6000 systems, build with [`CMAKE_CUDA_ARCHITECTURES=120`](../tutorials/bare-metal-cmake-build.md) and use these configs:

- [`daqiri_bench_raw_sw_loopback_rtx_pro_6000.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_sw_loopback_rtx_pro_6000.yaml) — software loopback, no NIC required. Validates the GPUDirect build path; throughput is not wire-rate. See measured numbers in [`examples/rtx_pro_6000_baseline.md`](https://github.com/nvidia/daqiri/blob/main/examples/rtx_pro_6000_baseline.md).
- [`daqiri_bench_raw_tx_rx_rtx_pro_6000_nic.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_tx_rx_rtx_pro_6000_nic.yaml) — prefilled dual-port NIC run (dev-box PCIe BDFs; fill `eth_dst_addr` from the rx_port MAC). Port 0 TX on GPU CUDA 0, port 1 RX on GPU CUDA 1. Requires an L2 link between the two ports (QSFP cable, passive loopback optic, or switch). `carrier=1` on both ports does not guarantee they are looped to each other.
- [`daqiri_bench_raw_tx_rx_rtx_pro_6000.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_tx_rx_rtx_pro_6000.yaml) — generic `<placeholder>` template for cross-card or custom topology (800 Gbps target once cabled).
- [`daqiri_bench_raw_tx_rx_rtx_pro_6000_nic_same_port.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_tx_rx_rtx_pro_6000_nic_same_port.yaml) — experimental same-port TX+RX; failed `daqiri_init` on the reference box.

Unlike DGX Spark, typical RTX Pro servers expose one PF per physical port — there is no on-chip eswitch shortcut between two PFs on the same port without a link. After a NIC run, confirm whether traffic crossed the wire using `tx_phy_packets` / `rx_phy_packets` in the DPDK extended stats (near zero = on-chip or no wire loop; rising with vport counts = over-the-wire). Full constraints and baseline results: [`rtx_pro_6000_baseline.md`](https://github.com/nvidia/daqiri/blob/main/examples/rtx_pro_6000_baseline.md).

```bash
sudo ./daqiri_bench_raw_gpudirect \
./examples/daqiri_bench_raw_sw_loopback_rtx_pro_6000.yaml --seconds 30

# Once p0 and p1 are cabled:
sudo ./daqiri_bench_raw_gpudirect \
./examples/daqiri_bench_raw_tx_rx_rtx_pro_6000_nic.yaml --seconds 30
```

#### Cross-host two-DGX-Spark loopback

If you have two DGX Sparks cross-cabled p0↔p0 instead of a chassis QSFP loop on one machine, use the `_xhost` configs. Each host runs only its own role, so the YAML on each side configures one port instead of two. Both hosts must already be set up per the [DGX Spark profile](../tutorials/system_configuration.md#dgx-spark-profile), with one adjustment: the `daqiri-tx` (`1.1.1.1/24`) and `daqiri-rx` (`2.2.2.2/24`) nmcli profiles are *split across* the two hosts — bring up `daqiri-tx` on the TX host's p0 and `daqiri-rx` on the RX host's p0, instead of both on one box.
Expand Down
4 changes: 4 additions & 0 deletions docs/tutorials/configuration-walkthrough.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,10 @@ For a shorter selection guide, start with the [Benchmarking overview](../benchma
- **DGX Spark multi-queue core-scaling matrix** (prefilled) — one base config [`daqiri_bench_raw_tx_rx_spark_mq.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_tx_rx_spark_mq.yaml) (the balanced TX=2/RX=2 superset; cores TX → 16,17, RX → 18,19) from which `examples/run_spark_mq_bench.sh` (via `scripts/gen_spark_mq_config.py`) derives the four `(TX, RX)` cells — (1,1), (1,2) (RX scaling), (2,1) (TX scaling), (2,2) (balanced) — by pruning queues/flows. All run on `daqiri_bench_raw_gpudirect` at the native 8 KB shape.
- **DGX Spark cross-host** (prefilled, runs on two Sparks) — [`daqiri_bench_raw_tx_spark_xhost.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_tx_spark_xhost.yaml) on the TX host and [`daqiri_bench_raw_rx_spark_xhost.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_rx_spark_xhost.yaml) on the RX host. Each host runs `daqiri_bench_raw_gpudirect` against its own half; cables connect p0↔p0 between the two boxes. See the [Cross-host two-DGX-Spark loopback](../benchmarks/raw_benchmarking.md#cross-host-two-dgx-spark-loopback) section for run details.
- **No physical NIC available** — [`daqiri_bench_raw_sw_loopback.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_sw_loopback.yaml). `loopback: "sw"`, no NIC required. Useful for first-time build verification, not representative of production performance.
- **RTX PRO 6000 Blackwell — no cable** — [`daqiri_bench_raw_sw_loopback_rtx_pro_6000.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_sw_loopback_rtx_pro_6000.yaml). `kind: device`, `affinity: 0`; build with [`CMAKE_CUDA_ARCHITECTURES=120`](../tutorials/bare-metal-cmake-build.md). SW loopback smoke test only.
- **RTX PRO 6000 Blackwell — real NIC, dual-port on one card** (prefilled dev box) — [`daqiri_bench_raw_tx_rx_rtx_pro_6000_nic.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_tx_rx_rtx_pro_6000_nic.yaml). `61:00.0` p0 → `61:00.1` p1, GPU 0 TX / GPU 1 RX; needs L2 link between ports (not SW loopback). See [`rtx_pro_6000_baseline.md`](https://github.com/nvidia/daqiri/blob/main/examples/rtx_pro_6000_baseline.md) for hardware limits and measured baseline.
- **RTX PRO 6000 — same-PF NIC attempt** (experimental) — [`daqiri_bench_raw_tx_rx_rtx_pro_6000_nic_same_port.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_tx_rx_rtx_pro_6000_nic_same_port.yaml). Single port `61:00.0` TX+RX; failed `daqiri_init` on reference box — kept for follow-up.
- **RTX PRO 6000 Blackwell — dual-NIC loopback** (generic template) — [`daqiri_bench_raw_tx_rx_rtx_pro_6000.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_tx_rx_rtx_pro_6000.yaml). Placeholders for Cliff's 800 Gbps cross-card target; fill PCIe BDFs and MACs.

To watch the same raw loopback benchmark with live Prometheus and Grafana
counters, use the Grafana compose stack described in
Expand Down
4 changes: 4 additions & 0 deletions examples/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,10 @@ set(DAQIRI_BENCH_CONFIGS
daqiri_bench_raw_rx_reorder_seq_batch.yaml
daqiri_bench_raw_rx_multi_q.yaml
daqiri_bench_raw_sw_loopback.yaml
daqiri_bench_raw_sw_loopback_rtx_pro_6000.yaml
daqiri_bench_raw_tx_rx_rtx_pro_6000.yaml
daqiri_bench_raw_tx_rx_rtx_pro_6000_nic.yaml
daqiri_bench_raw_tx_rx_rtx_pro_6000_nic_same_port.yaml
Comment thread
chloecrozier marked this conversation as resolved.
daqiri_example_gds_write_sw_loopback.yaml
daqiri_example_gds_write_tx_rx.yaml
daqiri_example_pcap_writer_sw_loopback.yaml
Expand Down
77 changes: 77 additions & 0 deletions examples/daqiri_bench_raw_sw_loopback_rtx_pro_6000.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# RTX PRO 6000 Blackwell (discrete dGPU) software-loopback smoke test.
# No NIC or cable required — validates build + GPUDirect on one GPU.
# Not representative of wire-speed performance; use the hardware template
# (daqiri_bench_raw_tx_rx_rtx_pro_6000.yaml) once a QSFP loopback cable is installed.
#
# Build (native sm_120):
# cmake -S . -B build -DBUILD_SHARED_LIBS=ON -DDAQIRI_BUILD_PYTHON=OFF \
# -DDAQIRI_MGR="dpdk socket rdma" -DCMAKE_CUDA_ARCHITECTURES=120
# cmake --build build -j
#
# Run:
# ./build/examples/daqiri_bench_raw_gpudirect \
# ./build/examples/daqiri_bench_raw_sw_loopback_rtx_pro_6000.yaml --seconds 30
#
# memory_regions[].affinity is the CUDA device index (not nvidia-smi GPU id).
# Change affinity to target a different GPU, e.g. affinity: 1 for CUDA device 1.
#
%YAML 1.2
---
daqiri:
cfg:
version: 1
stream_type: "raw"
master_core: 3
debug: false
log_level: "info"
loopback: "sw"

memory_regions:
- name: "Data_TX_GPU"
kind: "device"
affinity: 0
num_bufs: 51200
buf_size: 8064
- name: "Data_RX_GPU"
kind: "device"
affinity: 0
num_bufs: 51200
buf_size: 8064

interfaces:
- name: "loopback_ports"
address: "loopback"
tx:
queues:
- name: "tx_q_0"
id: 0
batch_size: 10240
cpu_core: 11
timeout_us: 1000
memory_regions:
- "Data_TX_GPU"
offloads:
- "tx_eth_src"
rx:
queues:
- name: "rq_q_0"
id: 0
cpu_core: 9
timeout_us: 1000
batch_size: 10240
memory_regions:
- "Data_RX_GPU"

bench_rx:
interface_name: "loopback_ports"

bench_tx:
interface_name: "loopback_ports"
batch_size: 10240
payload_size: 8000
header_size: 64
eth_dst_addr: 00:00:00:00:00:00
ip_src_addr: 0.0.0.0
ip_dst_addr: 0.0.0.0
udp_src_port: 4096
udp_dst_port: 4096
93 changes: 93 additions & 0 deletions examples/daqiri_bench_raw_tx_rx_rtx_pro_6000.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# RTX PRO 6000 Blackwell dual-NIC hardware loopback TEMPLATE (generic placeholders).
# For a prefilled single-card dual-port run on the reference dev box, see
# daqiri_bench_raw_tx_rx_rtx_pro_6000_nic.yaml instead.
#
# Requires L2 connectivity between tx_port and rx_port (QSFP cable, loopback optic, or switch).
# NOT runnable until <placeholders> are replaced for your system.
#
# Target topology (Cliff's 800 Gbps vision — two cards or two ports at line rate):
# GPU 0 (CUDA affinity 0) --TX buffers--> NIC0 --link--> NIC1 --RX buffers--> GPU 1
#
# Hardware limits:
# - 800 Gbps aggregate needs two ~400G ports with an active link each; no cable = no wire test.
# - This server lacks Spark's dual-PF-per-port on-chip eswitch shortcut (see baseline doc).
# - SW loopback (daqiri_bench_raw_sw_loopback_rtx_pro_6000.yaml) never touches the NIC.
#
# Discovery helpers:
# lspci -d 15b3:
# ibdev2netdev
# cat /sys/class/net/<iface>/address # eth_dst_addr = rx_port MAC
# nvidia-smi topo -m # NUMA / PCIe proximity for cpu_core picks
#
# Build: same as daqiri_bench_raw_sw_loopback_rtx_pro_6000.yaml header.
#
%YAML 1.2
---
daqiri:
cfg:
version: 1
stream_type: "raw"
master_core: 3
debug: false
log_level: "info"
loopback: ""

memory_regions:
- name: "Data_TX_GPU"
kind: "device"
affinity: 0
num_bufs: 51200
buf_size: 8064
- name: "Data_RX_GPU"
kind: "device"
affinity: 1
num_bufs: 51200
buf_size: 8064

interfaces:
- name: "tx_port"
address: <0000:00:00.0>
tx:
queues:
- name: "tx_q_0"
id: 0
batch_size: 10240
cpu_core: 11
memory_regions:
- "Data_TX_GPU"
offloads:
- "tx_eth_src"
- name: "rx_port"
address: <0000:00:00.1>
rx:
flow_isolation: true
queues:
- name: "rq_q_0"
id: 0
cpu_core: 9
batch_size: 10240
memory_regions:
- "Data_RX_GPU"
flows:
- name: "flow_0"
id: 0
action:
type: queue
id: 0
match:
udp_src: 4096
udp_dst: 4096

bench_rx:
- interface_name: "rx_port"

bench_tx:
- interface_name: "tx_port"
batch_size: 10240
payload_size: 8000
header_size: 64
eth_dst_addr: <00:00:00:00:00:00>
ip_src_addr: <1.2.3.4>
ip_dst_addr: <5.6.7.8>
udp_src_port: 4096
udp_dst_port: 4096
93 changes: 93 additions & 0 deletions examples/daqiri_bench_raw_tx_rx_rtx_pro_6000_nic.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# RTX PRO 6000 Blackwell — real NIC loopback on this dev box (no external cable required
# if p0/p1 already have link). Uses one ConnectX-7 / BF-3 dual-port card:
# tx_port 0000:61:00.0 (ens1f0np0, p0) GPU CUDA 0
# rx_port 0000:61:00.1 (ens1f1np1, p1) GPU CUDA 1
#
# Hardware limitations (read before comparing numbers):
# - NOT Cliff's 800 Gbps dual-card test. This is one ASIC, two ports (~400G class each).
# - Requires L2 connectivity between p0 and p1 (QSFP cable, passive loopback optic, or
# switch). If carrier=0 on either port, this config will not pass traffic.
# - Unlike DGX Spark, this server has one PF per physical port — no on-chip eswitch
# loopback between PFs on the same port without a link.
# - SW loopback (daqiri_bench_raw_sw_loopback_rtx_pro_6000.yaml) does not use the NIC.
#
# Verify link: cat /sys/class/net/ens1f{0,1}np*/carrier
# eth_dst_addr: rx_port MAC — cat /sys/class/net/ens1f1np1/address on this dev box
# Verify wire vs on-chip after a run: tx_phy_packets / rx_phy_packets near zero = on-chip;
# rising with TX/RX = packets crossed the SerDes.
#
# Build: cmake ... -DCMAKE_CUDA_ARCHITECTURES=120 && cmake --build build -j
# Run:
# sudo ./build/examples/daqiri_bench_raw_gpudirect \
# ./build/examples/daqiri_bench_raw_tx_rx_rtx_pro_6000_nic.yaml --seconds 30
#
%YAML 1.2
---
daqiri:
cfg:
version: 1
stream_type: "raw"
master_core: 3
debug: false
log_level: "info"
loopback: ""

memory_regions:
- name: "Data_TX_GPU"
kind: "device"
affinity: 0
num_bufs: 51200
buf_size: 8064
- name: "Data_RX_GPU"
kind: "device"
affinity: 1
num_bufs: 51200
buf_size: 8064

interfaces:
- name: "tx_port"
address: 0000:61:00.0
tx:
queues:
- name: "tx_q_0"
id: 0
batch_size: 10240
cpu_core: 11
memory_regions:
- "Data_TX_GPU"
offloads:
- "tx_eth_src"
- name: "rx_port"
address: 0000:61:00.1
rx:
flow_isolation: true
queues:
- name: "rq_q_0"
id: 0
cpu_core: 9
batch_size: 10240
memory_regions:
- "Data_RX_GPU"
flows:
- name: "flow_0"
id: 0
action:
type: queue
id: 0
match:
udp_src: 4096
udp_dst: 4096

bench_rx:
- interface_name: "rx_port"

bench_tx:
- interface_name: "tx_port"
batch_size: 10240
payload_size: 8000
header_size: 64
eth_dst_addr: <00:00:00:00:00:00>
ip_src_addr: 1.2.3.4
ip_dst_addr: 5.6.7.8
udp_src_port: 4096
udp_dst_port: 4096
75 changes: 75 additions & 0 deletions examples/daqiri_bench_raw_tx_rx_rtx_pro_6000_nic_same_port.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# Same-PF TX+RX on 0000:61:00.0 (ens1f0np0). Alternative when p0<->p1 are not cabled.
# eth_dst_addr: this port's own MAC (hairpin / L2 loopback if supported) —
# cat /sys/class/net/<iface>/address
# See daqiri_bench_raw_tx_rx_rtx_pro_6000_nic.yaml for dual-port p0->p1 attempt.
#
%YAML 1.2
---
daqiri:
cfg:
version: 1
stream_type: "raw"
master_core: 3
debug: false
log_level: "info"
loopback: ""

memory_regions:
- name: "Data_TX_GPU"
kind: "device"
affinity: 0
num_bufs: 51200
buf_size: 8064
- name: "Data_RX_GPU"
kind: "device"
affinity: 0
num_bufs: 51200
buf_size: 8064

interfaces:
- name: "tx_port"
address: 0000:61:00.0
tx:
queues:
- name: "tx_q_0"
id: 0
batch_size: 10240
cpu_core: 11
memory_regions:
- "Data_TX_GPU"
offloads:
- "tx_eth_src"
- name: "rx_port"
address: 0000:61:00.0
rx:
flow_isolation: true
queues:
- name: "rq_q_0"
id: 0
cpu_core: 9
batch_size: 10240
memory_regions:
- "Data_RX_GPU"
flows:
- name: "flow_0"
id: 0
action:
type: queue
id: 0
match:
udp_src: 4096
udp_dst: 4096

bench_rx:
- interface_name: "rx_port"

bench_tx:
- interface_name: "tx_port"
batch_size: 10240
payload_size: 8000
header_size: 64
eth_dst_addr: <00:00:00:00:00:00>
ip_src_addr: 1.2.3.4
ip_dst_addr: 5.6.7.8
udp_src_port: 4096
udp_dst_port: 4096
Loading
Loading