[BUG] TX mempool exhaustion stall in DPDK tx_core_worker

## Describe the bug

When the TX ring drains, `tx_core_worker` idles without polling NIC completions. Completed mbufs are never reclaimed, the mempool free count stays below the `2 × batch_size` gate in `is_tx_burst_available()`, and `send_tx_burst` is permanently blocked.

### Root Cause

`send_tx_burst` (app thread) gates new submissions on mempool availability:

```cpp
// is_tx_burst_available()
if (rte_mempool_avail_count(q->pools[seg]) < burst->hdr.hdr.num_pkts * 2) {
  return false;  // → caller gets NO_FREE_PACKET_BUFFERS
}
```

`tx_core_worker` (TX lcore) only reclaims mbufs via `rte_eth_tx_burst()`, which is only called when the ring has work. When the ring is empty, the idle path does nothing:

```cpp
// tx_core_worker main loop
while (!force_quit.load()) {
  if (rte_ring_dequeue(tparams->ring, reinterpret_cast<void**>(&msg)) != 0) {
    continue;  // ← no completion polling, mbufs never reclaimed
  }
  // ... rte_eth_tx_burst() only reached below ...
}
```

Once the pool is exhausted, neither side can make progress.

### Example (num_bufs=12288, batch_size=4096, num_tx_desc=8192)

Gate threshold = `batch_size × 2` = 8192. Both bursts fit in the NIC descriptor ring (8192 descriptors), so `tx_core_worker` finishes and enters idle before the NIC completes any transmission.

| Step | Action | Pool free | Ring | NIC in-flight |
|------|--------|-----------|------|---------------|
| 1 | initial state | 12288 | 0 | 0 |
| 2 | `send_tx_burst`: burst 1 (4096) | 8192 | 1 | 0 |
| 3 | `send_tx_burst`: burst 2 (4096) | 4096 | 2 | 0 |
| 4 | `send_tx_burst`: 4096 < 8192 gate | **BLOCKED** | 2 | 0 |
| 5 | `tx_core_worker`: tx_burst(4096) | 4096 | 1 | 4096 |
| 6 | `tx_core_worker`: tx_burst(4096) | 4096 | 0 | 8192 |
| 7 | `tx_core_worker`: ring empty → `continue` | 4096 | 0 | 8192 |
| 8 | NIC completes all 8192 packets | 4096 | 0 | 8192 (**never reclaimed**) |

After step 8 the pool is stuck at 4096 free < 8192 gate. `send_tx_burst` stays blocked, `tx_core_worker` spins on empty ring, and the 8192 completed mbufs are never returned to the pool.

## Steps to reproduce

Save the following script as `repro.sh` and run it inside the daqiri container (set `NIC_ADDR`, `IP_SRC`, `IP_DST` as needed):

```bash
docker run --rm --privileged --net=host \
  --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all \
  -v /dev/hugepages:/dev/hugepages \
  -v "$(pwd)/repro.sh:/workspace/repro.sh" \
  daqiri:local /workspace/repro.sh
```

`repro.sh`:

```bash
#!/usr/bin/env bash
set -euo pipefail

NIC_ADDR="${NIC_ADDR:-0005:03:00.0}"
IP_SRC="${IP_SRC:-10.0.0.1}"
IP_DST="${IP_DST:-10.0.0.2}"
MIN_PACKETS="${MIN_PACKETS:-1000000}"

BENCH=""
for d in ./build/examples /opt/daqiri/bin; do
  [[ -x "$d/daqiri_bench_raw_gpudirect" ]] && BENCH="$d/daqiri_bench_raw_gpudirect" && break
done
[[ -z "$BENCH" ]] && { echo "daqiri_bench_raw_gpudirect not found"; exit 1; }

CONFIG="/tmp/repro_tx_stall_$$.yaml"
trap 'rm -f "$CONFIG"' EXIT

cat > "$CONFIG" <<EOF
%YAML 1.2
---
daqiri:
  cfg:
    version: 1
    stream_type: "raw"
    master_core: 0
    debug: false
    log_level: "warn"
    loopback: ""
    memory_regions:
    - name: "Data_TX"
      kind: "host_pinned"
      affinity: 0
      num_bufs: 12288
      buf_size: 1064
    interfaces:
    - name: "tx_port"
      address: "${NIC_ADDR}"
      tx:
        queues:
        - name: "tx_q_0"
          id: 0
          batch_size: 4096
          cpu_core: 3
          memory_regions:
            - "Data_TX"
          offloads:
            - "tx_eth_src"
bench_tx:
  interface_name: "tx_port"
  cpu_core: 4
  batch_size: 4096
  payload_size: 1000
  header_size: 64
  eth_dst_addr: "ff:ff:ff:ff:ff:ff"
  ip_src_addr: "${IP_SRC}"
  ip_dst_addr: "${IP_DST}"
  udp_src_port: 4096
  udp_dst_port: 4096
EOF

OUTPUT=$(timeout --signal=KILL 30 "$BENCH" "$CONFIG" --seconds 10 2>&1) && rc=0 || rc=$?
echo "$OUTPUT"

if [[ $rc -eq 137 ]]; then
  echo -e "\nFAIL: killed after 30s — TX deadlock."
  exit 1
fi

TX_PACKETS=$(grep -oP 'packets=\K[0-9]+' <<< "$OUTPUT" | head -1)
: "${TX_PACKETS:=0}"

echo ""
echo "Transmitted: $TX_PACKETS packets (minimum: $MIN_PACKETS)"

if (( TX_PACKETS < MIN_PACKETS )); then
  echo -e "\nFAIL: throughput collapse — only $TX_PACKETS packets in 10s (need $MIN_PACKETS)."
  exit 1
fi

echo -e "\nPASS"
```

Output (look for the `TX complete` line):

```
TX complete: interface=tx_port queue=0 packets=~20480 bytes=21790720 bursts=5 seconds=10.0065
```

## Expected behavior

The benchmark should sustain >10 Mpps for the full 10-second run (>100M packets total).

## Actual behavior

Throughput collapses after 2 bursts — only ~20K packets sent in 10 seconds (~2 Kpps).

## Environment overview

- Environment location: Docker
- Method of DAQIRI install: source (container build via `./scripts/build-container.sh`)

## Environment details

- OS: Ubuntu 24.04 (container base: `nvcr.io/nvidia/cuda:13.1.0-devel-ubuntu24.04`)
- DAQIRI version: main branch
- DPDK version: 25.11
- Hardware: IGX Dev Kit, ConnectX-7 (mlx5 PMD)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] TX mempool exhaustion stall in DPDK tx_core_worker #207

Describe the bug

Root Cause

Example (num_bufs=12288, batch_size=4096, num_tx_desc=8192)

Steps to reproduce

Expected behavior

Actual behavior

Environment overview

Environment details

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Step	Action	Pool free	Ring	NIC in-flight
1	initial state	12288	0	0
2	`send_tx_burst`: burst 1 (4096)	8192	1	0
3	`send_tx_burst`: burst 2 (4096)	4096	2	0
4	`send_tx_burst`: 4096 < 8192 gate	BLOCKED	2	0
5	`tx_core_worker`: tx_burst(4096)	4096	1	4096
6	`tx_core_worker`: tx_burst(4096)	4096	0	8192
7	`tx_core_worker`: ring empty → `continue`	4096	0	8192
8	NIC completes all 8192 packets	4096	0	8192 (never reclaimed)

Uh oh!

[BUG] TX mempool exhaustion stall in DPDK tx_core_worker #207

Description

Describe the bug

Root Cause

Example (num_bufs=12288, batch_size=4096, num_tx_desc=8192)

Steps to reproduce

Expected behavior

Actual behavior

Environment overview

Environment details

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions