[https://nvbugs/5916092][fix] Fix MTP+PP hang by preserving speculative layer weights on last PP rank by xxi-nv · Pull Request #12555 · NVIDIA/TensorRT-LLM

xxi-nv · 2026-03-26T08:18:50Z

Summary

Fix hang in DecoderModel.__pp_init__() when both pipeline parallelism (PP) and MTP speculative decoding are enabled
MTP layers appended beyond num_hidden_layers were incorrectly getting skip_forward() + remove_weights() on all PP ranks, including the last rank where the MTP draft worker needs them
For layers beyond num_hidden_layers: always make them no-ops in the main decoder loop (all ranks), but only remove weights on non-last PP ranks

Affects: DeepSeekV3, NemotronH, ExaoneMoE, GLM (all use self.model.layers.extend(self.draft_model.mtp_layers))

Regression

Introduced by commit 2acd03030 ([https://nvbugs/5781589][fix] Implement pp skip forward for all spec workers. (#10578), 2026-01-14, @yuxianq).

That commit changed __pp_init__'s layer loop from range(num_hidden_layers) to enumerate(self.layers) to handle speculative worker layers, and simultaneously removed the epilogue-based handling (self.epilogue.extend(self.draft_model.mtp_layers)) from DeepSeekV3/GLM. The new loop correctly iterates ALL layers, but does not distinguish MTP layers (index >= num_hidden_layers) from base model layers — they get skip_forward() + remove_weights() on all ranks, including the last PP rank where MTP draft worker needs them.

Root cause

__pp_init__ iterates ALL self.layers, including MTP layers at indices >= num_hidden_layers. Since these indices are not in pp_layer_list, the existing code calls skip_forward() which:

Replaces layer.forward with a no-op
Calls remove_weights() to free GPU memory

This happens on ALL ranks, including the last PP rank. The MTP draft worker (running on rank N-1) then finds its speculative layers have no weights → hang.

Fix

for layer_idx, layer in enumerate(self.layers):
    if layer_idx >= num_hidden_layers:
        # Make MTP layers no-ops in main decoder loop (all ranks)
        if hasattr(layer, 'skip_forward'):
            layer.forward = layer.skip_forward
        # Only remove weights on non-last PP ranks
        if not mapping.is_last_pp_rank():
            remove_weights(layer)
        continue

Note: simply reverting to range(num_hidden_layers) does NOT work because the same commit also removed the epilogue-based MTP handling from model files.

Test plan

CI: L0_Test-SBSA-Multi-GPU TestDeepSeekV3Lite::test_nvfp4_4gpus[moe_backend=CUTLASS-mtp_nextn=2-pp4-...] should pass (was hanging)
CI: Verify no regressions in other PP and MTP tests
Removed waive for bug 5916092 from waives.txt

xxi-nv · 2026-03-26T08:18:57Z

/bot run

coderabbitai · 2026-03-26T08:23:14Z

📝 Walkthrough

Walkthrough

Modified pipeline-parallel layer initialization in DecoderModel to handle cases where self.layers exceeds num_hidden_layers by converting extra layers' forward methods to skip variants and selectively removing weights based on PP rank. Additionally removed one test-waive entry for DeepSeekV3Lite.

Changes

Cohort / File(s)	Summary
Pipeline-Parallel Layer Initialization `tensorrt_llm/_torch/models/modeling_utils.py`	Updated `DecoderModel.__pp_init__` to handle extra layers in the layers list by converting their `forward` to `skip_forward` when available and removing weights on non-last PP ranks.
Test Waiver `tests/integration/test_lists/waives.txt`	Removed waive entry for `accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4_4gpus` with specific parameter set and its associated bug reference.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title clearly describes the main fix: resolving an MTP+PP hang by preserving speculative layer weights on the last PP rank, which directly addresses the core issue in the changeset.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check	✅ Passed	PR description comprehensively covers the issue, root cause, solution, affected models, and test plan with clear structure.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

tensorrt_llm/_torch/models/modeling_utils.py (1)

307-316: Good fix for the MTP+PP hang issue.

The logic correctly preserves weights on the last PP rank while making extra layers no-ops in the main decoder loop. The separation of concerns (skip forward vs. remove weights) is well-reasoned.

Consider adding a warning when skip_forward is unavailable. If a layer doesn't have skip_forward, its forward remains unchanged. On the last PP rank, this means the layer would execute normally in the main decoder loop instead of being a no-op. While unlikely with MTP layers (DecoderLayer instances have skip_forward), a warning would be consistent with the existing skip_forward function at lines 163-165 and help debug unexpected behavior.

Suggested defensive warning

             if layer_idx >= num_hidden_layers:
                 # Extra layers (e.g., MTP speculative layers) appended beyond
                 # the base model. Skip their forward on all ranks so they are
                 # no-ops in the main decoder loop, but preserve weights on the
                 # last PP rank where the MTP draft worker needs them.
                 if hasattr(layer, 'skip_forward'):
                     layer.forward = layer.skip_forward
+                else:
+                    logger.warning(
+                        f"Layer {layer_idx} ({layer.__class__.__name__}) does not have "
+                        f"`skip_forward`; it will not be a no-op in the main decoder loop.")
                 if not mapping.is_last_pp_rank():
                     remove_weights(layer)
                 continue

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/models/modeling_utils.py` around lines 307 - 316, When
handling extra layers (layer_idx >= num_hidden_layers) add a defensive warning
if a layer does not have skip_forward: if hasattr(layer, 'skip_forward') is
false, emit a warning (e.g., logging.warning or warnings.warn) that the extra
layer lacks skip_forward so its forward will remain active on the last PP rank;
keep the existing behavior of calling layer.skip_forward when present and
remove_weights when not mapping.is_last_pp_rank(), but log this unexpected
condition referencing layer_idx, the layer object, and mapping.is_last_pp_rank()
to aid debugging (mirror the existing skip_forward handling used elsewhere).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tensorrt_llm/_torch/models/modeling_utils.py`:
- Around line 307-316: When handling extra layers (layer_idx >=
num_hidden_layers) add a defensive warning if a layer does not have
skip_forward: if hasattr(layer, 'skip_forward') is false, emit a warning (e.g.,
logging.warning or warnings.warn) that the extra layer lacks skip_forward so its
forward will remain active on the last PP rank; keep the existing behavior of
calling layer.skip_forward when present and remove_weights when not
mapping.is_last_pp_rank(), but log this unexpected condition referencing
layer_idx, the layer object, and mapping.is_last_pp_rank() to aid debugging
(mirror the existing skip_forward handling used elsewhere).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 34356177-7950-4e51-9c55-a9da2695250f

📥 Commits

Reviewing files that changed from the base of the PR and between e44df9e and d5c4c88.

📒 Files selected for processing (2)

tensorrt_llm/_torch/models/modeling_utils.py
tests/integration/test_lists/waives.txt

💤 Files with no reviewable changes (1)

tests/integration/test_lists/waives.txt

tensorrt-cicd · 2026-03-26T08:24:55Z

PR_Github #40435 [ run ] triggered by Bot. Commit: d5c4c88 Link to invocation

tensorrt-cicd · 2026-03-26T11:04:46Z

PR_Github #40435 [ run ] completed with state SUCCESS. Commit: d5c4c88
/LLM/main/L0_MergeRequest_PR pipeline #31527 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

xxi-nv · 2026-04-01T09:45:05Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-01T09:50:36Z

PR_Github #41175 [ run ] triggered by Bot. Commit: c1a03e7 Link to invocation

tensorrt-cicd · 2026-04-01T15:04:27Z

PR_Github #41175 [ run ] completed with state SUCCESS. Commit: c1a03e7
/LLM/main/L0_MergeRequest_PR pipeline #32140 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

xxi-nv · 2026-04-02T00:49:28Z

/bot run --disable-fail-fast

xxi-nv · 2026-04-02T07:02:41Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-02T07:08:49Z

PR_Github #41373 [ run ] triggered by Bot. Commit: c1a03e7 Link to invocation

tensorrt-cicd · 2026-04-02T10:48:36Z

PR_Github #41373 [ run ] completed with state SUCCESS. Commit: c1a03e7
/LLM/main/L0_MergeRequest_PR pipeline #32312 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

…ve layer weights on last PP rank DecoderModel.__pp_init__ iterates all layers including MTP speculative layers appended via layers.extend(mtp_layers). Since MTP layer indices exceed num_hidden_layers, they are not in pp_layer_list and get skip_forward() on ALL ranks, which replaces forward with a no-op AND removes weights. On non-last PP ranks this is correct (MTP layers unused). But on the last PP rank, the MTP draft worker needs the layer weights. Removing them causes a hang during generation. Fix: For layers beyond num_hidden_layers, always replace forward with skip_forward (so they are no-ops in the main decoder loop on all ranks), but only remove weights on non-last PP ranks. The last PP rank preserves weights so the MTP speculative decoding worker can use them. Affects all models using layers.extend(mtp_layers): DeepSeekV3, NemotronH, ExaoneMoE, and GLM. Signed-off-by: xxi <xxi@nvidia.com>

…eedback Separate the MTP speculative layer handling into its own loop instead of using a guard clause with continue inside the PP layer partitioning loop. This improves readability by cleanly separating the two concerns. Signed-off-by: xxi <xxi@nvidia.com>

xxi-nv · 2026-04-21T02:15:48Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-21T02:21:41Z

PR_Github #44572 [ run ] triggered by Bot. Commit: a39a74f Link to invocation

tensorrt-cicd · 2026-04-21T07:02:47Z

PR_Github #44572 [ run ] completed with state SUCCESS. Commit: a39a74f
/LLM/main/L0_MergeRequest_PR pipeline #34959 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

xxi-nv · 2026-04-21T07:25:56Z

/bot run --disable-fail-fast

github-actions · 2026-04-21T07:30:56Z

👎 Promotion blocked, new vulnerability found

Vulnerability report

Component	Vulnerability	Description	Severity
encode/uvicorn	CVE-2020-7694	This affects all versions of package uvicorn. The request logger provided by the package is vulnerable to ASNI escape sequence injection. Whenever any HTTP request is received, the default behaviour of uvicorn is to log its details to either the console or a log file. When attackers request crafted URLs with percent-encoded escape sequences, the logging component will log the URL after it's been processed with urllib.parse.unquote, therefore converting any percent-encoded characters into their single-character equivalent, which can have special meaning in terminal emulators. By requesting URLs with crafted paths, attackers can: * Pollute uvicorn's access logs, therefore jeopardising the integrity of such files. * Use ANSI sequence codes to attempt to interact with the terminal emulator that's displaying the logs (either in real time or from a file).	HIGH
encode/uvicorn	CVE-2020-7695	Uvicorn before 0.11.7 is vulnerable to HTTP response splitting. CRLF sequences are not escaped in the value of HTTP headers. Attackers can exploit this to add arbitrary headers to HTTP responses, or even return an arbitrary response body, whenever crafted input is used to construct HTTP headers.	MEDIUM

xxi-nv · 2026-04-21T10:23:38Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-21T10:30:13Z

PR_Github #44707 [ run ] triggered by Bot. Commit: a39a74f Link to invocation

tensorrt-cicd · 2026-04-21T11:52:54Z

PR_Github #44707 [ run ] completed with state FAILURE. Commit: a39a74f
/LLM/main/L0_MergeRequest_PR pipeline #35071 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

xxi-nv · 2026-04-21T12:38:26Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-21T12:44:07Z

PR_Github #44722 [ run ] triggered by Bot. Commit: 3b987bd Link to invocation

tensorrt-cicd · 2026-04-21T17:41:33Z

PR_Github #44722 [ run ] completed with state SUCCESS. Commit: 3b987bd
/LLM/main/L0_MergeRequest_PR pipeline #35085 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

xxi-nv · 2026-04-22T02:59:31Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-22T03:06:03Z

PR_Github #44861 [ run ] triggered by Bot. Commit: baf4f23 Link to invocation

xxi-nv · 2026-04-23T00:07:32Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-23T00:13:26Z

PR_Github #45036 [ run ] triggered by Bot. Commit: baf4f23 Link to invocation

xxi-nv requested a review from a team as a code owner March 26, 2026 08:18

xxi-nv requested a review from yechank-nvidia March 26, 2026 08:18

github-actions Bot assigned xxi-nv Mar 26, 2026

coderabbitai Bot reviewed Mar 26, 2026

View reviewed changes

xxi-nv removed the request for review from yechank-nvidia March 26, 2026 09:22

yuxianq reviewed Mar 31, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/models/modeling_utils.py Outdated

yuxianq approved these changes Apr 1, 2026

View reviewed changes

xxi-nv requested a review from syuoni April 2, 2026 07:03

syuoni approved these changes Apr 2, 2026

View reviewed changes

xxi-nv added 2 commits April 21, 2026 02:08

xxi-nv force-pushed the dev-xxi-bug5916092-bot branch from c1a03e7 to 00146e3 Compare April 21, 2026 02:14

Merge branch 'main' into dev-xxi-bug5916092-bot

a39a74f

xxi-nv enabled auto-merge (squash) April 21, 2026 06:34

Merge branch 'main' into dev-xxi-bug5916092-bot

3b987bd

xxi-nv added 2 commits April 22, 2026 07:12

Merge branch 'main' into dev-xxi-bug5916092-bot

419c65f

Merge branch 'main' into dev-xxi-bug5916092-bot

baf4f23

Conversation

xxi-nv commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Regression

Root cause

Fix

Test plan

Uh oh!

xxi-nv commented Mar 26, 2026

Uh oh!

coderabbitai Bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Mar 26, 2026

Uh oh!

tensorrt-cicd commented Mar 26, 2026

Uh oh!

Uh oh!

xxi-nv commented Apr 1, 2026

Uh oh!

tensorrt-cicd commented Apr 1, 2026

Uh oh!

tensorrt-cicd commented Apr 1, 2026

Uh oh!

xxi-nv commented Apr 2, 2026

Uh oh!

xxi-nv commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

xxi-nv commented Apr 21, 2026

Uh oh!

tensorrt-cicd commented Apr 21, 2026

Uh oh!

tensorrt-cicd commented Apr 21, 2026

Uh oh!

xxi-nv commented Apr 21, 2026

Uh oh!

github-actions Bot commented Apr 21, 2026

👎 Promotion blocked, new vulnerability found

Vulnerability report

Uh oh!

xxi-nv commented Apr 21, 2026

Uh oh!

tensorrt-cicd commented Apr 21, 2026

Uh oh!

tensorrt-cicd commented Apr 21, 2026

Uh oh!

xxi-nv commented Apr 21, 2026

Uh oh!

tensorrt-cicd commented Apr 21, 2026

Uh oh!

tensorrt-cicd commented Apr 21, 2026

Uh oh!

xxi-nv commented Apr 22, 2026

Uh oh!

tensorrt-cicd commented Apr 22, 2026

Uh oh!

xxi-nv commented Apr 23, 2026

Uh oh!

tensorrt-cicd commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xxi-nv commented Mar 26, 2026 •

edited

Loading

coderabbitai Bot commented Mar 26, 2026 •

edited

Loading