Describe the bug
When the TX ring drains, tx_core_worker idles without polling NIC completions. Completed mbufs are never reclaimed, the mempool free count stays below the 2 × batch_size gate in is_tx_burst_available(), and send_tx_burst is permanently blocked.
Root Cause
send_tx_burst (app thread) gates new submissions on mempool availability:
// is_tx_burst_available()
if (rte_mempool_avail_count(q->pools[seg]) < burst->hdr.hdr.num_pkts * 2) {
return false; // → caller gets NO_FREE_PACKET_BUFFERS
}
tx_core_worker (TX lcore) only reclaims mbufs via rte_eth_tx_burst(), which is only called when the ring has work. When the ring is empty, the idle path does nothing:
// tx_core_worker main loop
while (!force_quit.load()) {
if (rte_ring_dequeue(tparams->ring, reinterpret_cast<void**>(&msg)) != 0) {
continue; // ← no completion polling, mbufs never reclaimed
}
// ... rte_eth_tx_burst() only reached below ...
}
Once the pool is exhausted, neither side can make progress.
Example (num_bufs=12288, batch_size=4096, num_tx_desc=8192)
Gate threshold = batch_size × 2 = 8192. Both bursts fit in the NIC descriptor ring (8192 descriptors), so tx_core_worker finishes and enters idle before the NIC completes any transmission.
| Step |
Action |
Pool free |
Ring |
NIC in-flight |
| 1 |
initial state |
12288 |
0 |
0 |
| 2 |
send_tx_burst: burst 1 (4096) |
8192 |
1 |
0 |
| 3 |
send_tx_burst: burst 2 (4096) |
4096 |
2 |
0 |
| 4 |
send_tx_burst: 4096 < 8192 gate |
BLOCKED |
2 |
0 |
| 5 |
tx_core_worker: tx_burst(4096) |
4096 |
1 |
4096 |
| 6 |
tx_core_worker: tx_burst(4096) |
4096 |
0 |
8192 |
| 7 |
tx_core_worker: ring empty → continue |
4096 |
0 |
8192 |
| 8 |
NIC completes all 8192 packets |
4096 |
0 |
8192 (never reclaimed) |
After step 8 the pool is stuck at 4096 free < 8192 gate. send_tx_burst stays blocked, tx_core_worker spins on empty ring, and the 8192 completed mbufs are never returned to the pool.
Steps to reproduce
Save the following script as repro.sh and run it inside the daqiri container (set NIC_ADDR, IP_SRC, IP_DST as needed):
docker run --rm --privileged --net=host \
--runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all \
-v /dev/hugepages:/dev/hugepages \
-v "$(pwd)/repro.sh:/workspace/repro.sh" \
daqiri:local /workspace/repro.sh
repro.sh:
#!/usr/bin/env bash
set -euo pipefail
NIC_ADDR="${NIC_ADDR:-0005:03:00.0}"
IP_SRC="${IP_SRC:-10.0.0.1}"
IP_DST="${IP_DST:-10.0.0.2}"
MIN_PACKETS="${MIN_PACKETS:-1000000}"
BENCH=""
for d in ./build/examples /opt/daqiri/bin; do
[[ -x "$d/daqiri_bench_raw_gpudirect" ]] && BENCH="$d/daqiri_bench_raw_gpudirect" && break
done
[[ -z "$BENCH" ]] && { echo "daqiri_bench_raw_gpudirect not found"; exit 1; }
CONFIG="/tmp/repro_tx_stall_$$.yaml"
trap 'rm -f "$CONFIG"' EXIT
cat > "$CONFIG" <<EOF
%YAML 1.2
---
daqiri:
cfg:
version: 1
stream_type: "raw"
master_core: 0
debug: false
log_level: "warn"
loopback: ""
memory_regions:
- name: "Data_TX"
kind: "host_pinned"
affinity: 0
num_bufs: 12288
buf_size: 1064
interfaces:
- name: "tx_port"
address: "${NIC_ADDR}"
tx:
queues:
- name: "tx_q_0"
id: 0
batch_size: 4096
cpu_core: 3
memory_regions:
- "Data_TX"
offloads:
- "tx_eth_src"
bench_tx:
interface_name: "tx_port"
cpu_core: 4
batch_size: 4096
payload_size: 1000
header_size: 64
eth_dst_addr: "ff:ff:ff:ff:ff:ff"
ip_src_addr: "${IP_SRC}"
ip_dst_addr: "${IP_DST}"
udp_src_port: 4096
udp_dst_port: 4096
EOF
OUTPUT=$(timeout --signal=KILL 30 "$BENCH" "$CONFIG" --seconds 10 2>&1) && rc=0 || rc=$?
echo "$OUTPUT"
if [[ $rc -eq 137 ]]; then
echo -e "\nFAIL: killed after 30s — TX deadlock."
exit 1
fi
TX_PACKETS=$(grep -oP 'packets=\K[0-9]+' <<< "$OUTPUT" | head -1)
: "${TX_PACKETS:=0}"
echo ""
echo "Transmitted: $TX_PACKETS packets (minimum: $MIN_PACKETS)"
if (( TX_PACKETS < MIN_PACKETS )); then
echo -e "\nFAIL: throughput collapse — only $TX_PACKETS packets in 10s (need $MIN_PACKETS)."
exit 1
fi
echo -e "\nPASS"
Output (look for the TX complete line):
TX complete: interface=tx_port queue=0 packets=~20480 bytes=21790720 bursts=5 seconds=10.0065
Expected behavior
The benchmark should sustain >10 Mpps for the full 10-second run (>100M packets total).
Actual behavior
Throughput collapses after 2 bursts — only ~20K packets sent in 10 seconds (~2 Kpps).
Environment overview
- Environment location: Docker
- Method of DAQIRI install: source (container build via
./scripts/build-container.sh)
Environment details
- OS: Ubuntu 24.04 (container base:
nvcr.io/nvidia/cuda:13.1.0-devel-ubuntu24.04)
- DAQIRI version: main branch
- DPDK version: 25.11
- Hardware: IGX Dev Kit, ConnectX-7 (mlx5 PMD)
Describe the bug
When the TX ring drains,
tx_core_workeridles without polling NIC completions. Completed mbufs are never reclaimed, the mempool free count stays below the2 × batch_sizegate inis_tx_burst_available(), andsend_tx_burstis permanently blocked.Root Cause
send_tx_burst(app thread) gates new submissions on mempool availability:tx_core_worker(TX lcore) only reclaims mbufs viarte_eth_tx_burst(), which is only called when the ring has work. When the ring is empty, the idle path does nothing:Once the pool is exhausted, neither side can make progress.
Example (num_bufs=12288, batch_size=4096, num_tx_desc=8192)
Gate threshold =
batch_size × 2= 8192. Both bursts fit in the NIC descriptor ring (8192 descriptors), sotx_core_workerfinishes and enters idle before the NIC completes any transmission.send_tx_burst: burst 1 (4096)send_tx_burst: burst 2 (4096)send_tx_burst: 4096 < 8192 gatetx_core_worker: tx_burst(4096)tx_core_worker: tx_burst(4096)tx_core_worker: ring empty →continueAfter step 8 the pool is stuck at 4096 free < 8192 gate.
send_tx_burststays blocked,tx_core_workerspins on empty ring, and the 8192 completed mbufs are never returned to the pool.Steps to reproduce
Save the following script as
repro.shand run it inside the daqiri container (setNIC_ADDR,IP_SRC,IP_DSTas needed):docker run --rm --privileged --net=host \ --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all \ -v /dev/hugepages:/dev/hugepages \ -v "$(pwd)/repro.sh:/workspace/repro.sh" \ daqiri:local /workspace/repro.shrepro.sh:Output (look for the
TX completeline):Expected behavior
The benchmark should sustain >10 Mpps for the full 10-second run (>100M packets total).
Actual behavior
Throughput collapses after 2 bursts — only ~20K packets sent in 10 seconds (~2 Kpps).
Environment overview
./scripts/build-container.sh)Environment details
nvcr.io/nvidia/cuda:13.1.0-devel-ubuntu24.04)