Skip to content

[BUG] TX mempool exhaustion stall in DPDK tx_core_worker #207

Description

@lobolanja

Describe the bug

When the TX ring drains, tx_core_worker idles without polling NIC completions. Completed mbufs are never reclaimed, the mempool free count stays below the 2 × batch_size gate in is_tx_burst_available(), and send_tx_burst is permanently blocked.

Root Cause

send_tx_burst (app thread) gates new submissions on mempool availability:

// is_tx_burst_available()
if (rte_mempool_avail_count(q->pools[seg]) < burst->hdr.hdr.num_pkts * 2) {
  return false;  // → caller gets NO_FREE_PACKET_BUFFERS
}

tx_core_worker (TX lcore) only reclaims mbufs via rte_eth_tx_burst(), which is only called when the ring has work. When the ring is empty, the idle path does nothing:

// tx_core_worker main loop
while (!force_quit.load()) {
  if (rte_ring_dequeue(tparams->ring, reinterpret_cast<void**>(&msg)) != 0) {
    continue;  // ← no completion polling, mbufs never reclaimed
  }
  // ... rte_eth_tx_burst() only reached below ...
}

Once the pool is exhausted, neither side can make progress.

Example (num_bufs=12288, batch_size=4096, num_tx_desc=8192)

Gate threshold = batch_size × 2 = 8192. Both bursts fit in the NIC descriptor ring (8192 descriptors), so tx_core_worker finishes and enters idle before the NIC completes any transmission.

Step Action Pool free Ring NIC in-flight
1 initial state 12288 0 0
2 send_tx_burst: burst 1 (4096) 8192 1 0
3 send_tx_burst: burst 2 (4096) 4096 2 0
4 send_tx_burst: 4096 < 8192 gate BLOCKED 2 0
5 tx_core_worker: tx_burst(4096) 4096 1 4096
6 tx_core_worker: tx_burst(4096) 4096 0 8192
7 tx_core_worker: ring empty → continue 4096 0 8192
8 NIC completes all 8192 packets 4096 0 8192 (never reclaimed)

After step 8 the pool is stuck at 4096 free < 8192 gate. send_tx_burst stays blocked, tx_core_worker spins on empty ring, and the 8192 completed mbufs are never returned to the pool.

Steps to reproduce

Save the following script as repro.sh and run it inside the daqiri container (set NIC_ADDR, IP_SRC, IP_DST as needed):

docker run --rm --privileged --net=host \
  --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all \
  -v /dev/hugepages:/dev/hugepages \
  -v "$(pwd)/repro.sh:/workspace/repro.sh" \
  daqiri:local /workspace/repro.sh

repro.sh:

#!/usr/bin/env bash
set -euo pipefail

NIC_ADDR="${NIC_ADDR:-0005:03:00.0}"
IP_SRC="${IP_SRC:-10.0.0.1}"
IP_DST="${IP_DST:-10.0.0.2}"
MIN_PACKETS="${MIN_PACKETS:-1000000}"

BENCH=""
for d in ./build/examples /opt/daqiri/bin; do
  [[ -x "$d/daqiri_bench_raw_gpudirect" ]] && BENCH="$d/daqiri_bench_raw_gpudirect" && break
done
[[ -z "$BENCH" ]] && { echo "daqiri_bench_raw_gpudirect not found"; exit 1; }

CONFIG="/tmp/repro_tx_stall_$$.yaml"
trap 'rm -f "$CONFIG"' EXIT

cat > "$CONFIG" <<EOF
%YAML 1.2
---
daqiri:
  cfg:
    version: 1
    stream_type: "raw"
    master_core: 0
    debug: false
    log_level: "warn"
    loopback: ""
    memory_regions:
    - name: "Data_TX"
      kind: "host_pinned"
      affinity: 0
      num_bufs: 12288
      buf_size: 1064
    interfaces:
    - name: "tx_port"
      address: "${NIC_ADDR}"
      tx:
        queues:
        - name: "tx_q_0"
          id: 0
          batch_size: 4096
          cpu_core: 3
          memory_regions:
            - "Data_TX"
          offloads:
            - "tx_eth_src"
bench_tx:
  interface_name: "tx_port"
  cpu_core: 4
  batch_size: 4096
  payload_size: 1000
  header_size: 64
  eth_dst_addr: "ff:ff:ff:ff:ff:ff"
  ip_src_addr: "${IP_SRC}"
  ip_dst_addr: "${IP_DST}"
  udp_src_port: 4096
  udp_dst_port: 4096
EOF

OUTPUT=$(timeout --signal=KILL 30 "$BENCH" "$CONFIG" --seconds 10 2>&1) && rc=0 || rc=$?
echo "$OUTPUT"

if [[ $rc -eq 137 ]]; then
  echo -e "\nFAIL: killed after 30s — TX deadlock."
  exit 1
fi

TX_PACKETS=$(grep -oP 'packets=\K[0-9]+' <<< "$OUTPUT" | head -1)
: "${TX_PACKETS:=0}"

echo ""
echo "Transmitted: $TX_PACKETS packets (minimum: $MIN_PACKETS)"

if (( TX_PACKETS < MIN_PACKETS )); then
  echo -e "\nFAIL: throughput collapse — only $TX_PACKETS packets in 10s (need $MIN_PACKETS)."
  exit 1
fi

echo -e "\nPASS"

Output (look for the TX complete line):

TX complete: interface=tx_port queue=0 packets=~20480 bytes=21790720 bursts=5 seconds=10.0065

Expected behavior

The benchmark should sustain >10 Mpps for the full 10-second run (>100M packets total).

Actual behavior

Throughput collapses after 2 bursts — only ~20K packets sent in 10 seconds (~2 Kpps).

Environment overview

  • Environment location: Docker
  • Method of DAQIRI install: source (container build via ./scripts/build-container.sh)

Environment details

  • OS: Ubuntu 24.04 (container base: nvcr.io/nvidia/cuda:13.1.0-devel-ubuntu24.04)
  • DAQIRI version: main branch
  • DPDK version: 25.11
  • Hardware: IGX Dev Kit, ConnectX-7 (mlx5 PMD)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions