Skip to content

#207: Fix TX mempool exhaustion deadlock in tx_core_worker#208

Open
lobolanja wants to merge 3 commits into
NVIDIA:mainfrom
lobolanja:fix/tx-mempool-stall
Open

#207: Fix TX mempool exhaustion deadlock in tx_core_worker#208
lobolanja wants to merge 3 commits into
NVIDIA:mainfrom
lobolanja:fix/tx-mempool-stall

Conversation

@lobolanja

Copy link
Copy Markdown

Fixes #207

Problem

tx_core_worker never polls NIC TX completions while idling (empty ring), so mbufs are never reclaimed and is_tx_burst_available() blocks the producer permanently. Throughput collapses from >10 Mpps to ~2 Kpps.

Fix

Poll completions in the tx_core_worker idle path so mbufs are returned to the pool even when no new work is queued.

Why the DPDK patch is needed

DPDK provides rte_eth_tx_done_cleanup() as the standard API for explicitly reclaiming completed TX mbufs without sending new packets. However, the mlx5 PMD in DPDK 25.11 does not implement the tx_done_cleanup callback — calling it returns -ENOTSUP.

The only ways to trigger mlx5 completion processing (mlx5_tx_handle_completion) are:

Method API Problem
Send packets rte_eth_tx_burst() Requires work in the ring — useless when idle
Check descriptor rte_eth_tx_descriptor_status() Calls mlx5_tx_handle_completion() as a side effect, but that's not the function's contract — it is meant to report descriptor state, not reclaim mbufs. Future DPDK versions could change this behavior.
Done cleanup rte_eth_tx_done_cleanup() Correct API, but returns -ENOTSUP on mlx5

The patch adds a trivial mlx5_tx_done_cleanup() (~20 lines) that wraps the existing mlx5_tx_handle_completion() — no new completion logic is introduced. It simply exposes the internal function through the official DPDK callback so rte_eth_tx_done_cleanup() works as intended.

Without this patch, the only alternative is the rte_eth_tx_descriptor_status side-effect workaround (commit 1), which is fragile and semantically incorrect.

Commits

Commit Description
9f62fe2 Workaround — call rte_eth_tx_descriptor_status(port, queue, 0) in idle path. On mlx5 this triggers mlx5_tx_handle_completion() as a side effect.
19f5329 DPDK patch — add mlx5_tx_done_cleanup() to mlx5 PMD as a wrapper around mlx5_tx_handle_completion. Registered in mlx5_dev_ops and mlx5_dev_ops_isolate.
a7354dc Use proper API — replace descriptor_status workaround with rte_eth_tx_done_cleanup(port, queue, 0).

Reproduction

See the self-contained test script in issue #207.

# Without fix (demonstrates the bug):
# → ~20K packets in 10s → FAIL

# With fix:
# → >10M packets in 10s → PASS

Testing

  • Container build: clean, no warnings
  • Throughput test PASS after commit 1 (workaround) and after commit 3 (final fix)

Poll NIC TX completions via rte_eth_tx_descriptor_status() in the
tx_core_worker idle path (empty ring).  This triggers
mlx5_tx_handle_completion() as a side effect, reclaiming mbufs so
is_tx_burst_available() unblocks.

Signed-off-by: Juanca-Lobolanja <lobolanja@gmail.com>
Add mlx5_tx_done_cleanup() to the mlx5 PMD (DPDK 25.11) as a thin
wrapper around mlx5_tx_handle_completion().  Registers the callback in
mlx5_dev_ops and mlx5_dev_ops_isolate so rte_eth_tx_done_cleanup()
no longer returns -ENOTSUP.

Signed-off-by: Juanca-Lobolanja <lobolanja@gmail.com>
Replace the rte_eth_tx_descriptor_status() side-effect workaround with
the proper rte_eth_tx_done_cleanup() API, now that the mlx5 patch
provides the callback.

Signed-off-by: Juanca-Lobolanja <lobolanja@gmail.com>
@greptile-apps

greptile-apps Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

Fixes the TX mempool exhaustion deadlock in tx_core_worker where mbufs were never reclaimed while the ring was empty, causing throughput collapse. The fix adds a rte_eth_tx_done_cleanup() call in the idle path so the PMD reclaims completed TX mbufs back to the pool even when no new work is queued.

  • src/engines/dpdk/daqiri_dpdk_engine.cpp: Replaces the bare continue in the idle branch with rte_eth_tx_done_cleanup(port, queue, 0) before continue; the comment clearly explains why rte_eth_tx_burst() alone is insufficient as a reclaim path.
  • dpdk_patches/mlx5_tx_done_cleanup.patch (ignored per repo config): Provides the mlx5 PMD callback needed to make rte_eth_tx_done_cleanup() return something other than -ENOTSUP on that driver.

Confidence Score: 4/5

The change is a targeted one-liner in the idle spin path of tx_core_worker with a clear comment; no new allocations, no vtable changes, and all three commits carry proper DCO sign-offs and correctly formatted titles.

The code change itself is straightforward and low-risk — adding a reclaim call in the idle path is exactly the right fix and doesn't touch any allocation or free path. The only open item is a doc-sync pass against the engine docs, which is a housekeeping requirement regardless of whether new content is actually needed.

No files require close scrutiny; src/engines/dpdk/daqiri_dpdk_engine.cpp is the only reviewed file and the change is minimal and self-explanatory.

Important Files Changed

Filename Overview
src/engines/dpdk/daqiri_dpdk_engine.cpp Adds rte_eth_tx_done_cleanup() call in the tx_core_worker idle path to reclaim TX mbufs back to the pool when no packets are queued; fix is correctly scoped to the DpdkEngine implementation with no API or vtable changes.

Reviews (1): Last reviewed commit: "#207 - Use rte_eth_tx_done_cleanup for i..." | Re-trigger Greptile

@cliffburdick

Copy link
Copy Markdown
Collaborator

Hi @lobolanja, this appears to be a patch for DPDK. Should this be submitted to that project?

@lobolanja

lobolanja commented Jun 30, 2026

Copy link
Copy Markdown
Author

Hi @cliffburdick, Yes, the DPDK patch (mlx5_tx_done_cleanup) is intended for upstream submission as well. I started here because I wanted to validate the approach with the DAQIRI team first: I wasn't fully confident this is a genuine gap rather than a misuse on mi side (e.g., maybe there's a different intended pattern for idle-path mbuf reclaim that I'm missing).

I'm also happy to submit the mlx5 patch upstream to DPDK via their mailing list process, though that's a separate and longer cycle. In the meantime, carrying the patch in daqiri's dpdk_patches/ keeps things working without waiting on upstream.

If the fix looks correct to you, I'll go ahead and open a patch on dpdk.org in parallel.

@cliffburdick

Copy link
Copy Markdown
Collaborator

Hi @lobolanja, we're trying to migrate most users to our ibverbs backend since it performs better than the DPDK backend. This means we're also trying to deprecate the patches. I don't know enough about the mlx5 tx freeing to know whether this is needed, so perhaps you could submit it there and see if it gets accepted. In my experience things get reviewed within a few days of you submitting the patches, and it would at least be on the dev branch of DPDK.

@lobolanja

Copy link
Copy Markdown
Author

@cliffburdick

That makes sense, thanks for the context on the ibverbs migration.

In that case, would you be okay with keeping just commit 9f62fe2 (the rte_eth_tx_descriptor_status workaround)? It fixes the deadlock without adding any DPDK patch to maintain. I can rebase the PR to drop commits 2 and 3.

In parallel, I'll submit the tx_done_cleanup patch upstream to DPDK. Once it lands there, the workaround here can be swapped for the proper API in a follow-up, or it may become a non-issue if the engine moves to ibverbs by then.

@cliffburdick

Copy link
Copy Markdown
Collaborator

Yes, I think that's reasonable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] TX mempool exhaustion stall in DPDK tx_core_worker

2 participants