diff --git a/AGENTS.md b/AGENTS.md index 7ef7b254..3cb274df 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -32,7 +32,7 @@ There is no unit test suite. Verification is done via the benchmark executables | Executable | Source | Typical config | |---|---|---| -| `daqiri_bench_raw_gpudirect` | `raw_gpudirect_bench.cpp` | `daqiri_bench_raw_tx_rx.yaml`, `daqiri_bench_raw_tx_rx_4q.yaml`, `daqiri_bench_raw_tx_rx_spark.yaml`, `daqiri_bench_raw_{tx,rx}_spark_xhost.yaml`, `daqiri_bench_raw_sw_loopback.yaml`, `daqiri_bench_raw_rx_multi_q.yaml`, `daqiri_bench_raw_tx_rx_spark_mq.yaml` (mq base; `run_spark_mq_bench.sh` derives the 4 cells via `scripts/gen_spark_mq_config.py`) | +| `daqiri_bench_raw_gpudirect` | `raw_gpudirect_bench.cpp` | `daqiri_bench_raw_tx_rx.yaml`, `daqiri_bench_raw_tx_rx_4q.yaml`, `daqiri_bench_raw_tx_rx_spark.yaml`, `daqiri_bench_raw_tx_rx_rtx_pro_6000.yaml`, `daqiri_bench_raw_tx_rx_rtx_pro_6000_nic.yaml`, `daqiri_bench_raw_tx_rx_rtx_pro_6000_nic_same_port.yaml`, `daqiri_bench_raw_{tx,rx}_spark_xhost.yaml`, `daqiri_bench_raw_sw_loopback.yaml`, `daqiri_bench_raw_sw_loopback_rtx_pro_6000.yaml`, `daqiri_bench_raw_rx_multi_q.yaml`, `daqiri_bench_raw_tx_rx_spark_mq.yaml` (mq base; `run_spark_mq_bench.sh` derives the 4 cells via `scripts/gen_spark_mq_config.py`) | | `daqiri_bench_raw_hds` | `raw_hds_bench.cpp` | `daqiri_bench_raw_tx_rx_hds.yaml` | | `daqiri_bench_raw_reorder_seq` | `raw_reorder_seq_bench.cpp` | `daqiri_bench_raw_tx_rx_reorder_seq_1024*.yaml`, `daqiri_bench_raw_rx_reorder_seq_*.yaml` | | `daqiri_bench_raw_reorder_quantize` | `raw_reorder_quantize_bench.cpp` | `daqiri_bench_raw_tx_rx_reorder_quantize_seq_batch.yaml` | diff --git a/docs/benchmarks/raw_benchmarking.md b/docs/benchmarks/raw_benchmarking.md index 20264b24..04b527d8 100644 --- a/docs/benchmarks/raw_benchmarking.md +++ b/docs/benchmarks/raw_benchmarking.md @@ -59,6 +59,26 @@ docker run --rm -it --privileged \ The Spark configs also pin the benchmark application's `bench_tx.cpu_core` / `bench_rx.cpu_core` fields to the high-frequency Cortex-X925 cores. Keep both the DAQIRI queue cores and the application worker cores on cores 16-19 unless you intentionally want a lower-power core in the measurement. +!!! tip "RTX PRO 6000 Blackwell (x86_64 workstation / server)" + + For discrete Blackwell RTX PRO 6000 systems, build with [`CMAKE_CUDA_ARCHITECTURES=120`](../tutorials/bare-metal-cmake-build.md) and use these configs: + + - [`daqiri_bench_raw_sw_loopback_rtx_pro_6000.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_sw_loopback_rtx_pro_6000.yaml) — software loopback, no NIC required. Validates the GPUDirect build path; throughput is not wire-rate. See measured numbers in [`examples/rtx_pro_6000_baseline.md`](https://github.com/nvidia/daqiri/blob/main/examples/rtx_pro_6000_baseline.md). + - [`daqiri_bench_raw_tx_rx_rtx_pro_6000_nic.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_tx_rx_rtx_pro_6000_nic.yaml) — prefilled dual-port NIC run (dev-box PCIe BDFs; fill `eth_dst_addr` from the rx_port MAC). Port 0 TX on GPU CUDA 0, port 1 RX on GPU CUDA 1. Requires an L2 link between the two ports (QSFP cable, passive loopback optic, or switch). `carrier=1` on both ports does not guarantee they are looped to each other. + - [`daqiri_bench_raw_tx_rx_rtx_pro_6000.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_tx_rx_rtx_pro_6000.yaml) — generic `` template for cross-card or custom topology (800 Gbps target once cabled). + - [`daqiri_bench_raw_tx_rx_rtx_pro_6000_nic_same_port.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_tx_rx_rtx_pro_6000_nic_same_port.yaml) — experimental same-port TX+RX; failed `daqiri_init` on the reference box. + + Unlike DGX Spark, typical RTX Pro servers expose one PF per physical port — there is no on-chip eswitch shortcut between two PFs on the same port without a link. After a NIC run, confirm whether traffic crossed the wire using `tx_phy_packets` / `rx_phy_packets` in the DPDK extended stats (near zero = on-chip or no wire loop; rising with vport counts = over-the-wire). Full constraints and baseline results: [`rtx_pro_6000_baseline.md`](https://github.com/nvidia/daqiri/blob/main/examples/rtx_pro_6000_baseline.md). + + ```bash + sudo ./daqiri_bench_raw_gpudirect \ + ./examples/daqiri_bench_raw_sw_loopback_rtx_pro_6000.yaml --seconds 30 + + # Once p0 and p1 are cabled: + sudo ./daqiri_bench_raw_gpudirect \ + ./examples/daqiri_bench_raw_tx_rx_rtx_pro_6000_nic.yaml --seconds 30 + ``` + #### Cross-host two-DGX-Spark loopback If you have two DGX Sparks cross-cabled p0↔p0 instead of a chassis QSFP loop on one machine, use the `_xhost` configs. Each host runs only its own role, so the YAML on each side configures one port instead of two. Both hosts must already be set up per the [DGX Spark profile](../tutorials/system_configuration.md#dgx-spark-profile), with one adjustment: the `daqiri-tx` (`1.1.1.1/24`) and `daqiri-rx` (`2.2.2.2/24`) nmcli profiles are *split across* the two hosts — bring up `daqiri-tx` on the TX host's p0 and `daqiri-rx` on the RX host's p0, instead of both on one box. diff --git a/docs/tutorials/configuration-walkthrough.md b/docs/tutorials/configuration-walkthrough.md index b2232bcd..9e0957e9 100644 --- a/docs/tutorials/configuration-walkthrough.md +++ b/docs/tutorials/configuration-walkthrough.md @@ -32,6 +32,10 @@ For a shorter selection guide, start with the [Benchmarking overview](../benchma - **DGX Spark multi-queue core-scaling matrix** (prefilled) — one base config [`daqiri_bench_raw_tx_rx_spark_mq.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_tx_rx_spark_mq.yaml) (the balanced TX=2/RX=2 superset; cores TX → 16,17, RX → 18,19) from which `examples/run_spark_mq_bench.sh` (via `scripts/gen_spark_mq_config.py`) derives the four `(TX, RX)` cells — (1,1), (1,2) (RX scaling), (2,1) (TX scaling), (2,2) (balanced) — by pruning queues/flows. All run on `daqiri_bench_raw_gpudirect` at the native 8 KB shape. - **DGX Spark cross-host** (prefilled, runs on two Sparks) — [`daqiri_bench_raw_tx_spark_xhost.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_tx_spark_xhost.yaml) on the TX host and [`daqiri_bench_raw_rx_spark_xhost.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_rx_spark_xhost.yaml) on the RX host. Each host runs `daqiri_bench_raw_gpudirect` against its own half; cables connect p0↔p0 between the two boxes. See the [Cross-host two-DGX-Spark loopback](../benchmarks/raw_benchmarking.md#cross-host-two-dgx-spark-loopback) section for run details. - **No physical NIC available** — [`daqiri_bench_raw_sw_loopback.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_sw_loopback.yaml). `loopback: "sw"`, no NIC required. Useful for first-time build verification, not representative of production performance. + - **RTX PRO 6000 Blackwell — no cable** — [`daqiri_bench_raw_sw_loopback_rtx_pro_6000.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_sw_loopback_rtx_pro_6000.yaml). `kind: device`, `affinity: 0`; build with [`CMAKE_CUDA_ARCHITECTURES=120`](../tutorials/bare-metal-cmake-build.md). SW loopback smoke test only. + - **RTX PRO 6000 Blackwell — real NIC, dual-port on one card** (prefilled dev box) — [`daqiri_bench_raw_tx_rx_rtx_pro_6000_nic.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_tx_rx_rtx_pro_6000_nic.yaml). `61:00.0` p0 → `61:00.1` p1, GPU 0 TX / GPU 1 RX; needs L2 link between ports (not SW loopback). See [`rtx_pro_6000_baseline.md`](https://github.com/nvidia/daqiri/blob/main/examples/rtx_pro_6000_baseline.md) for hardware limits and measured baseline. + - **RTX PRO 6000 — same-PF NIC attempt** (experimental) — [`daqiri_bench_raw_tx_rx_rtx_pro_6000_nic_same_port.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_tx_rx_rtx_pro_6000_nic_same_port.yaml). Single port `61:00.0` TX+RX; failed `daqiri_init` on reference box — kept for follow-up. + - **RTX PRO 6000 Blackwell — dual-NIC loopback** (generic template) — [`daqiri_bench_raw_tx_rx_rtx_pro_6000.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_tx_rx_rtx_pro_6000.yaml). Placeholders for Cliff's 800 Gbps cross-card target; fill PCIe BDFs and MACs. To watch the same raw loopback benchmark with live Prometheus and Grafana counters, use the Grafana compose stack described in diff --git a/examples/CMakeLists.txt b/examples/CMakeLists.txt index 829d88e5..2f4d1235 100644 --- a/examples/CMakeLists.txt +++ b/examples/CMakeLists.txt @@ -34,6 +34,10 @@ set(DAQIRI_BENCH_CONFIGS daqiri_bench_raw_rx_reorder_seq_batch.yaml daqiri_bench_raw_rx_multi_q.yaml daqiri_bench_raw_sw_loopback.yaml + daqiri_bench_raw_sw_loopback_rtx_pro_6000.yaml + daqiri_bench_raw_tx_rx_rtx_pro_6000.yaml + daqiri_bench_raw_tx_rx_rtx_pro_6000_nic.yaml + daqiri_bench_raw_tx_rx_rtx_pro_6000_nic_same_port.yaml daqiri_example_gds_write_sw_loopback.yaml daqiri_example_gds_write_tx_rx.yaml daqiri_example_pcap_writer_sw_loopback.yaml diff --git a/examples/daqiri_bench_raw_sw_loopback_rtx_pro_6000.yaml b/examples/daqiri_bench_raw_sw_loopback_rtx_pro_6000.yaml new file mode 100644 index 00000000..b514bb31 --- /dev/null +++ b/examples/daqiri_bench_raw_sw_loopback_rtx_pro_6000.yaml @@ -0,0 +1,77 @@ +# RTX PRO 6000 Blackwell (discrete dGPU) software-loopback smoke test. +# No NIC or cable required — validates build + GPUDirect on one GPU. +# Not representative of wire-speed performance; use the hardware template +# (daqiri_bench_raw_tx_rx_rtx_pro_6000.yaml) once a QSFP loopback cable is installed. +# +# Build (native sm_120): +# cmake -S . -B build -DBUILD_SHARED_LIBS=ON -DDAQIRI_BUILD_PYTHON=OFF \ +# -DDAQIRI_MGR="dpdk socket rdma" -DCMAKE_CUDA_ARCHITECTURES=120 +# cmake --build build -j +# +# Run: +# ./build/examples/daqiri_bench_raw_gpudirect \ +# ./build/examples/daqiri_bench_raw_sw_loopback_rtx_pro_6000.yaml --seconds 30 +# +# memory_regions[].affinity is the CUDA device index (not nvidia-smi GPU id). +# Change affinity to target a different GPU, e.g. affinity: 1 for CUDA device 1. +# +%YAML 1.2 +--- +daqiri: + cfg: + version: 1 + stream_type: "raw" + master_core: 3 + debug: false + log_level: "info" + loopback: "sw" + + memory_regions: + - name: "Data_TX_GPU" + kind: "device" + affinity: 0 + num_bufs: 51200 + buf_size: 8064 + - name: "Data_RX_GPU" + kind: "device" + affinity: 0 + num_bufs: 51200 + buf_size: 8064 + + interfaces: + - name: "loopback_ports" + address: "loopback" + tx: + queues: + - name: "tx_q_0" + id: 0 + batch_size: 10240 + cpu_core: 11 + timeout_us: 1000 + memory_regions: + - "Data_TX_GPU" + offloads: + - "tx_eth_src" + rx: + queues: + - name: "rq_q_0" + id: 0 + cpu_core: 9 + timeout_us: 1000 + batch_size: 10240 + memory_regions: + - "Data_RX_GPU" + +bench_rx: + interface_name: "loopback_ports" + +bench_tx: + interface_name: "loopback_ports" + batch_size: 10240 + payload_size: 8000 + header_size: 64 + eth_dst_addr: 00:00:00:00:00:00 + ip_src_addr: 0.0.0.0 + ip_dst_addr: 0.0.0.0 + udp_src_port: 4096 + udp_dst_port: 4096 diff --git a/examples/daqiri_bench_raw_tx_rx_rtx_pro_6000.yaml b/examples/daqiri_bench_raw_tx_rx_rtx_pro_6000.yaml new file mode 100644 index 00000000..4af275ec --- /dev/null +++ b/examples/daqiri_bench_raw_tx_rx_rtx_pro_6000.yaml @@ -0,0 +1,93 @@ +# RTX PRO 6000 Blackwell dual-NIC hardware loopback TEMPLATE (generic placeholders). +# For a prefilled single-card dual-port run on the reference dev box, see +# daqiri_bench_raw_tx_rx_rtx_pro_6000_nic.yaml instead. +# +# Requires L2 connectivity between tx_port and rx_port (QSFP cable, loopback optic, or switch). +# NOT runnable until are replaced for your system. +# +# Target topology (Cliff's 800 Gbps vision — two cards or two ports at line rate): +# GPU 0 (CUDA affinity 0) --TX buffers--> NIC0 --link--> NIC1 --RX buffers--> GPU 1 +# +# Hardware limits: +# - 800 Gbps aggregate needs two ~400G ports with an active link each; no cable = no wire test. +# - This server lacks Spark's dual-PF-per-port on-chip eswitch shortcut (see baseline doc). +# - SW loopback (daqiri_bench_raw_sw_loopback_rtx_pro_6000.yaml) never touches the NIC. +# +# Discovery helpers: +# lspci -d 15b3: +# ibdev2netdev +# cat /sys/class/net//address # eth_dst_addr = rx_port MAC +# nvidia-smi topo -m # NUMA / PCIe proximity for cpu_core picks +# +# Build: same as daqiri_bench_raw_sw_loopback_rtx_pro_6000.yaml header. +# +%YAML 1.2 +--- +daqiri: + cfg: + version: 1 + stream_type: "raw" + master_core: 3 + debug: false + log_level: "info" + loopback: "" + + memory_regions: + - name: "Data_TX_GPU" + kind: "device" + affinity: 0 + num_bufs: 51200 + buf_size: 8064 + - name: "Data_RX_GPU" + kind: "device" + affinity: 1 + num_bufs: 51200 + buf_size: 8064 + + interfaces: + - name: "tx_port" + address: <0000:00:00.0> + tx: + queues: + - name: "tx_q_0" + id: 0 + batch_size: 10240 + cpu_core: 11 + memory_regions: + - "Data_TX_GPU" + offloads: + - "tx_eth_src" + - name: "rx_port" + address: <0000:00:00.1> + rx: + flow_isolation: true + queues: + - name: "rq_q_0" + id: 0 + cpu_core: 9 + batch_size: 10240 + memory_regions: + - "Data_RX_GPU" + flows: + - name: "flow_0" + id: 0 + action: + type: queue + id: 0 + match: + udp_src: 4096 + udp_dst: 4096 + +bench_rx: +- interface_name: "rx_port" + +bench_tx: +- interface_name: "tx_port" + batch_size: 10240 + payload_size: 8000 + header_size: 64 + eth_dst_addr: <00:00:00:00:00:00> + ip_src_addr: <1.2.3.4> + ip_dst_addr: <5.6.7.8> + udp_src_port: 4096 + udp_dst_port: 4096 diff --git a/examples/daqiri_bench_raw_tx_rx_rtx_pro_6000_nic.yaml b/examples/daqiri_bench_raw_tx_rx_rtx_pro_6000_nic.yaml new file mode 100644 index 00000000..f9cb21f9 --- /dev/null +++ b/examples/daqiri_bench_raw_tx_rx_rtx_pro_6000_nic.yaml @@ -0,0 +1,93 @@ +# RTX PRO 6000 Blackwell — real NIC loopback on this dev box (no external cable required +# if p0/p1 already have link). Uses one ConnectX-7 / BF-3 dual-port card: +# tx_port 0000:61:00.0 (ens1f0np0, p0) GPU CUDA 0 +# rx_port 0000:61:00.1 (ens1f1np1, p1) GPU CUDA 1 +# +# Hardware limitations (read before comparing numbers): +# - NOT Cliff's 800 Gbps dual-card test. This is one ASIC, two ports (~400G class each). +# - Requires L2 connectivity between p0 and p1 (QSFP cable, passive loopback optic, or +# switch). If carrier=0 on either port, this config will not pass traffic. +# - Unlike DGX Spark, this server has one PF per physical port — no on-chip eswitch +# loopback between PFs on the same port without a link. +# - SW loopback (daqiri_bench_raw_sw_loopback_rtx_pro_6000.yaml) does not use the NIC. +# +# Verify link: cat /sys/class/net/ens1f{0,1}np*/carrier +# eth_dst_addr: rx_port MAC — cat /sys/class/net/ens1f1np1/address on this dev box +# Verify wire vs on-chip after a run: tx_phy_packets / rx_phy_packets near zero = on-chip; +# rising with TX/RX = packets crossed the SerDes. +# +# Build: cmake ... -DCMAKE_CUDA_ARCHITECTURES=120 && cmake --build build -j +# Run: +# sudo ./build/examples/daqiri_bench_raw_gpudirect \ +# ./build/examples/daqiri_bench_raw_tx_rx_rtx_pro_6000_nic.yaml --seconds 30 +# +%YAML 1.2 +--- +daqiri: + cfg: + version: 1 + stream_type: "raw" + master_core: 3 + debug: false + log_level: "info" + loopback: "" + + memory_regions: + - name: "Data_TX_GPU" + kind: "device" + affinity: 0 + num_bufs: 51200 + buf_size: 8064 + - name: "Data_RX_GPU" + kind: "device" + affinity: 1 + num_bufs: 51200 + buf_size: 8064 + + interfaces: + - name: "tx_port" + address: 0000:61:00.0 + tx: + queues: + - name: "tx_q_0" + id: 0 + batch_size: 10240 + cpu_core: 11 + memory_regions: + - "Data_TX_GPU" + offloads: + - "tx_eth_src" + - name: "rx_port" + address: 0000:61:00.1 + rx: + flow_isolation: true + queues: + - name: "rq_q_0" + id: 0 + cpu_core: 9 + batch_size: 10240 + memory_regions: + - "Data_RX_GPU" + flows: + - name: "flow_0" + id: 0 + action: + type: queue + id: 0 + match: + udp_src: 4096 + udp_dst: 4096 + +bench_rx: +- interface_name: "rx_port" + +bench_tx: +- interface_name: "tx_port" + batch_size: 10240 + payload_size: 8000 + header_size: 64 + eth_dst_addr: <00:00:00:00:00:00> + ip_src_addr: 1.2.3.4 + ip_dst_addr: 5.6.7.8 + udp_src_port: 4096 + udp_dst_port: 4096 diff --git a/examples/daqiri_bench_raw_tx_rx_rtx_pro_6000_nic_same_port.yaml b/examples/daqiri_bench_raw_tx_rx_rtx_pro_6000_nic_same_port.yaml new file mode 100644 index 00000000..77ec6cc7 --- /dev/null +++ b/examples/daqiri_bench_raw_tx_rx_rtx_pro_6000_nic_same_port.yaml @@ -0,0 +1,75 @@ +# Same-PF TX+RX on 0000:61:00.0 (ens1f0np0). Alternative when p0<->p1 are not cabled. +# eth_dst_addr: this port's own MAC (hairpin / L2 loopback if supported) — +# cat /sys/class/net//address +# See daqiri_bench_raw_tx_rx_rtx_pro_6000_nic.yaml for dual-port p0->p1 attempt. +# +%YAML 1.2 +--- +daqiri: + cfg: + version: 1 + stream_type: "raw" + master_core: 3 + debug: false + log_level: "info" + loopback: "" + + memory_regions: + - name: "Data_TX_GPU" + kind: "device" + affinity: 0 + num_bufs: 51200 + buf_size: 8064 + - name: "Data_RX_GPU" + kind: "device" + affinity: 0 + num_bufs: 51200 + buf_size: 8064 + + interfaces: + - name: "tx_port" + address: 0000:61:00.0 + tx: + queues: + - name: "tx_q_0" + id: 0 + batch_size: 10240 + cpu_core: 11 + memory_regions: + - "Data_TX_GPU" + offloads: + - "tx_eth_src" + - name: "rx_port" + address: 0000:61:00.0 + rx: + flow_isolation: true + queues: + - name: "rq_q_0" + id: 0 + cpu_core: 9 + batch_size: 10240 + memory_regions: + - "Data_RX_GPU" + flows: + - name: "flow_0" + id: 0 + action: + type: queue + id: 0 + match: + udp_src: 4096 + udp_dst: 4096 + +bench_rx: +- interface_name: "rx_port" + +bench_tx: +- interface_name: "tx_port" + batch_size: 10240 + payload_size: 8000 + header_size: 64 + eth_dst_addr: <00:00:00:00:00:00> + ip_src_addr: 1.2.3.4 + ip_dst_addr: 5.6.7.8 + udp_src_port: 4096 + udp_dst_port: 4096 diff --git a/examples/rtx_pro_6000_baseline.md b/examples/rtx_pro_6000_baseline.md new file mode 100644 index 00000000..bd2babcc --- /dev/null +++ b/examples/rtx_pro_6000_baseline.md @@ -0,0 +1,60 @@ +# RTX PRO 6000 benchmark baseline + +Host: x86_64 EPYC · GPU: NVIDIA RTX PRO 6000 Blackwell Server Edition · `CMAKE_CUDA_ARCHITECTURES=120` · branch `ccrozier-rtx-pro-6000-bench` · 2026-06-12 + +## Hardware limitations (read first) + +| What | Limit | +|---|---| +| **Cliff 800 Gbps target** | Two ~400G ports with an **active L2 link** (QSFP cable, loopback optic, or switch). Not achievable without that link. | +| **This server vs DGX Spark** | One PF per physical port. **No** Spark-style on-chip eswitch loopback (two PFs on the same port). | +| **`loopback: "sw"`** | Does **not** use the NIC. Measures in-process DPDK path only; can exceed line rate (not comparable to Gbps on the wire). | +| **`carrier=1`** | Link up on a port ≠ p0 and p1 are cabled **to each other**. Our dual-port run proved this: TX on p0, RX 0 on p1. | +| **Starting point** | SW smoke test validates GPUDirect build. Real NIC numbers need a completed L2 loop; use `tx_phy_packets` / `rx_phy_packets` to confirm wire vs internal. | + +## Results (this dev box) + +| Config | GPU (CUDA) | Mode | Duration | TX Gbps | RX Gbps | Notes | +|---|---|---|---|---|---|---| +| `daqiri_bench_raw_sw_loopback_rtx_pro_6000.yaml` | 0 | `loopback: sw` | 30s | 1579.5 | 1579.5 | No NIC; GPUDirect smoke baseline | +| `daqiri_bench_raw_tx_rx_rtx_pro_6000_nic.yaml` | 0 TX / 1 RX | real NIC `61:00.0`→`61:00.1` | 30s | 381.5 | 0 | NIC TX path works; **no RX** — p0/p1 not looped (phy_pkts ≪ vport) | +| `daqiri_bench_raw_tx_rx_rtx_pro_6000_nic_same_port.yaml` | 0 | same PF `61:00.0` | — | — | — | `daqiri_init` failed (extmem pool on single port); not a baseline | +| `daqiri_bench_raw_sw_loopback_reorder_seq_1024.yaml` | CPU huge | `loopback: sw` + reorder | 30s | — | 160.2 | GPU reorder kernel path; CPU buffers not GPUDirect | +| `daqiri_bench_socket_udp_tx_rx.yaml` (`iterations: 0`) | host | kernel UDP `127.0.0.1` | 15s | 4.7 | 4.7 | Kernel socket baseline; stock YAML uses `iterations: 1000` (not a perf test) | + +### NIC dual-port run detail + +- Card: `0000:61:00.0` (ens1f0np0, p0) TX → `0000:61:00.1` (ens1f1np1, p1) RX +- `eth_dst_addr`: `c4:70:bd:c2:8a:93` (p1 MAC) +- DPDK `tx_phy_packets` / `rx_phy_packets` on port 0: **2** / **76** vs **177M** TX vport packets → traffic did not cross SerDes between ports + +## Commands + +```bash +# SW smoke (no NIC) +sudo ./build/examples/daqiri_bench_raw_gpudirect \ + ./build/examples/daqiri_bench_raw_sw_loopback_rtx_pro_6000.yaml --seconds 30 + +# Real NIC dual-port (needs p0↔p1 L2 loop) +sudo ./build/examples/daqiri_bench_raw_gpudirect \ + ./build/examples/daqiri_bench_raw_tx_rx_rtx_pro_6000_nic.yaml --seconds 30 +``` + +## What we can still run now (#17, partial) + +| Test | Status | Notes | +|---|---|---| +| Raw SW loopback GPUDirect | Done | Best no-cable perf number | +| Raw NIC TX (one port) | Done | Proves mlx5 + GPUDirect TX | +| Reorder SW loopback | Done | ~160 Gbps RX, CPU-side buffers | +| Socket UDP (`iterations: 0`) | Done | ~5 Gbps kernel baseline | +| Socket TCP (`iterations: 0`) | Easy | Same yaml tweak | +| HDS / RoCE / NIC closed-loop | Blocked | Need filled YAML + L2 loop | +| FFT / GEMM workloads | Not in repo | #17 asks for these | + +## Follow-ups + +- Install QSFP cable (or passive loopback) between p0 and p1 on `61:00.x`, re-run `_nic.yaml` +- Scale to cross-card 800 Gbps using `daqiri_bench_raw_tx_rx_rtx_pro_6000.yaml` template + second card +- Tune CPU cores from `nvidia-smi topo -m` once RX path is up +- Add RTX socket YAML with `iterations: 0` for repeatable kernel baseline