Skip to content

[https://nvbugs/5916092][fix] Fix MTP+PP hang by preserving speculative layer weights on last PP rank#12555

Open
xxi-nv wants to merge 6 commits intoNVIDIA:mainfrom
xxi-nv:dev-xxi-bug5916092-bot
Open

[https://nvbugs/5916092][fix] Fix MTP+PP hang by preserving speculative layer weights on last PP rank#12555
xxi-nv wants to merge 6 commits intoNVIDIA:mainfrom
xxi-nv:dev-xxi-bug5916092-bot

Conversation

@xxi-nv
Copy link
Copy Markdown
Collaborator

@xxi-nv xxi-nv commented Mar 26, 2026

Summary

  • Fix hang in DecoderModel.__pp_init__() when both pipeline parallelism (PP) and MTP speculative decoding are enabled
  • MTP layers appended beyond num_hidden_layers were incorrectly getting skip_forward() + remove_weights() on all PP ranks, including the last rank where the MTP draft worker needs them
  • For layers beyond num_hidden_layers: always make them no-ops in the main decoder loop (all ranks), but only remove weights on non-last PP ranks

Affects: DeepSeekV3, NemotronH, ExaoneMoE, GLM (all use self.model.layers.extend(self.draft_model.mtp_layers))

Regression

Introduced by commit 2acd03030 ([https://nvbugs/5781589][fix] Implement pp skip forward for all spec workers. (#10578), 2026-01-14, @yuxianq).

That commit changed __pp_init__'s layer loop from range(num_hidden_layers) to enumerate(self.layers) to handle speculative worker layers, and simultaneously removed the epilogue-based handling (self.epilogue.extend(self.draft_model.mtp_layers)) from DeepSeekV3/GLM. The new loop correctly iterates ALL layers, but does not distinguish MTP layers (index >= num_hidden_layers) from base model layers — they get skip_forward() + remove_weights() on all ranks, including the last PP rank where MTP draft worker needs them.

Root cause

__pp_init__ iterates ALL self.layers, including MTP layers at indices >= num_hidden_layers. Since these indices are not in pp_layer_list, the existing code calls skip_forward() which:

  1. Replaces layer.forward with a no-op
  2. Calls remove_weights() to free GPU memory

This happens on ALL ranks, including the last PP rank. The MTP draft worker (running on rank N-1) then finds its speculative layers have no weights → hang.

Fix

for layer_idx, layer in enumerate(self.layers):
    if layer_idx >= num_hidden_layers:
        # Make MTP layers no-ops in main decoder loop (all ranks)
        if hasattr(layer, 'skip_forward'):
            layer.forward = layer.skip_forward
        # Only remove weights on non-last PP ranks
        if not mapping.is_last_pp_rank():
            remove_weights(layer)
        continue

Note: simply reverting to range(num_hidden_layers) does NOT work because the same commit also removed the epilogue-based MTP handling from model files.

Test plan

  • CI: L0_Test-SBSA-Multi-GPU TestDeepSeekV3Lite::test_nvfp4_4gpus[moe_backend=CUTLASS-mtp_nextn=2-pp4-...] should pass (was hanging)
  • CI: Verify no regressions in other PP and MTP tests
  • Removed waive for bug 5916092 from waives.txt

@xxi-nv xxi-nv requested a review from a team as a code owner March 26, 2026 08:18
@xxi-nv xxi-nv requested a review from yechank-nvidia March 26, 2026 08:18
@xxi-nv
Copy link
Copy Markdown
Collaborator Author

xxi-nv commented Mar 26, 2026

/bot run

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 26, 2026

📝 Walkthrough

Walkthrough

Modified pipeline-parallel layer initialization in DecoderModel to handle cases where self.layers exceeds num_hidden_layers by converting extra layers' forward methods to skip variants and selectively removing weights based on PP rank. Additionally removed one test-waive entry for DeepSeekV3Lite.

Changes

Cohort / File(s) Summary
Pipeline-Parallel Layer Initialization
tensorrt_llm/_torch/models/modeling_utils.py
Updated DecoderModel.__pp_init__ to handle extra layers in the layers list by converting their forward to skip_forward when available and removing weights on non-last PP ranks.
Test Waiver
tests/integration/test_lists/waives.txt
Removed waive entry for accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4_4gpus with specific parameter set and its associated bug reference.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly describes the main fix: resolving an MTP+PP hang by preserving speculative layer weights on the last PP rank, which directly addresses the core issue in the changeset.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check ✅ Passed PR description comprehensively covers the issue, root cause, solution, affected models, and test plan with clear structure.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tensorrt_llm/_torch/models/modeling_utils.py (1)

307-316: Good fix for the MTP+PP hang issue.

The logic correctly preserves weights on the last PP rank while making extra layers no-ops in the main decoder loop. The separation of concerns (skip forward vs. remove weights) is well-reasoned.

Consider adding a warning when skip_forward is unavailable. If a layer doesn't have skip_forward, its forward remains unchanged. On the last PP rank, this means the layer would execute normally in the main decoder loop instead of being a no-op. While unlikely with MTP layers (DecoderLayer instances have skip_forward), a warning would be consistent with the existing skip_forward function at lines 163-165 and help debug unexpected behavior.

Suggested defensive warning
             if layer_idx >= num_hidden_layers:
                 # Extra layers (e.g., MTP speculative layers) appended beyond
                 # the base model. Skip their forward on all ranks so they are
                 # no-ops in the main decoder loop, but preserve weights on the
                 # last PP rank where the MTP draft worker needs them.
                 if hasattr(layer, 'skip_forward'):
                     layer.forward = layer.skip_forward
+                else:
+                    logger.warning(
+                        f"Layer {layer_idx} ({layer.__class__.__name__}) does not have "
+                        f"`skip_forward`; it will not be a no-op in the main decoder loop.")
                 if not mapping.is_last_pp_rank():
                     remove_weights(layer)
                 continue
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/models/modeling_utils.py` around lines 307 - 316, When
handling extra layers (layer_idx >= num_hidden_layers) add a defensive warning
if a layer does not have skip_forward: if hasattr(layer, 'skip_forward') is
false, emit a warning (e.g., logging.warning or warnings.warn) that the extra
layer lacks skip_forward so its forward will remain active on the last PP rank;
keep the existing behavior of calling layer.skip_forward when present and
remove_weights when not mapping.is_last_pp_rank(), but log this unexpected
condition referencing layer_idx, the layer object, and mapping.is_last_pp_rank()
to aid debugging (mirror the existing skip_forward handling used elsewhere).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tensorrt_llm/_torch/models/modeling_utils.py`:
- Around line 307-316: When handling extra layers (layer_idx >=
num_hidden_layers) add a defensive warning if a layer does not have
skip_forward: if hasattr(layer, 'skip_forward') is false, emit a warning (e.g.,
logging.warning or warnings.warn) that the extra layer lacks skip_forward so its
forward will remain active on the last PP rank; keep the existing behavior of
calling layer.skip_forward when present and remove_weights when not
mapping.is_last_pp_rank(), but log this unexpected condition referencing
layer_idx, the layer object, and mapping.is_last_pp_rank() to aid debugging
(mirror the existing skip_forward handling used elsewhere).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 34356177-7950-4e51-9c55-a9da2695250f

📥 Commits

Reviewing files that changed from the base of the PR and between e44df9e and d5c4c88.

📒 Files selected for processing (2)
  • tensorrt_llm/_torch/models/modeling_utils.py
  • tests/integration/test_lists/waives.txt
💤 Files with no reviewable changes (1)
  • tests/integration/test_lists/waives.txt

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40435 [ run ] triggered by Bot. Commit: d5c4c88 Link to invocation

@xxi-nv xxi-nv removed the request for review from yechank-nvidia March 26, 2026 09:22
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40435 [ run ] completed with state SUCCESS. Commit: d5c4c88
/LLM/main/L0_MergeRequest_PR pipeline #31527 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Comment thread tensorrt_llm/_torch/models/modeling_utils.py Outdated
@xxi-nv
Copy link
Copy Markdown
Collaborator Author

xxi-nv commented Apr 1, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41175 [ run ] triggered by Bot. Commit: c1a03e7 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41175 [ run ] completed with state SUCCESS. Commit: c1a03e7
/LLM/main/L0_MergeRequest_PR pipeline #32140 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@xxi-nv
Copy link
Copy Markdown
Collaborator Author

xxi-nv commented Apr 2, 2026

/bot run --disable-fail-fast

1 similar comment
@xxi-nv
Copy link
Copy Markdown
Collaborator Author

xxi-nv commented Apr 2, 2026

/bot run --disable-fail-fast

@xxi-nv xxi-nv requested a review from syuoni April 2, 2026 07:03
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41373 [ run ] triggered by Bot. Commit: c1a03e7 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41373 [ run ] completed with state SUCCESS. Commit: c1a03e7
/LLM/main/L0_MergeRequest_PR pipeline #32312 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

xxi-nv added 2 commits April 21, 2026 02:08
…ve layer weights on last PP rank

DecoderModel.__pp_init__ iterates all layers including MTP speculative layers
appended via layers.extend(mtp_layers). Since MTP layer indices exceed
num_hidden_layers, they are not in pp_layer_list and get skip_forward() on
ALL ranks, which replaces forward with a no-op AND removes weights.

On non-last PP ranks this is correct (MTP layers unused). But on the last PP
rank, the MTP draft worker needs the layer weights. Removing them causes a
hang during generation.

Fix: For layers beyond num_hidden_layers, always replace forward with
skip_forward (so they are no-ops in the main decoder loop on all ranks),
but only remove weights on non-last PP ranks. The last PP rank preserves
weights so the MTP speculative decoding worker can use them.

Affects all models using layers.extend(mtp_layers): DeepSeekV3, NemotronH,
ExaoneMoE, and GLM.

Signed-off-by: xxi <xxi@nvidia.com>
…eedback

Separate the MTP speculative layer handling into its own loop instead of
using a guard clause with continue inside the PP layer partitioning loop.
This improves readability by cleanly separating the two concerns.

Signed-off-by: xxi <xxi@nvidia.com>
@xxi-nv xxi-nv force-pushed the dev-xxi-bug5916092-bot branch from c1a03e7 to 00146e3 Compare April 21, 2026 02:14
@xxi-nv
Copy link
Copy Markdown
Collaborator Author

xxi-nv commented Apr 21, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44572 [ run ] triggered by Bot. Commit: a39a74f Link to invocation

@xxi-nv xxi-nv enabled auto-merge (squash) April 21, 2026 06:34
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44572 [ run ] completed with state SUCCESS. Commit: a39a74f
/LLM/main/L0_MergeRequest_PR pipeline #34959 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@xxi-nv
Copy link
Copy Markdown
Collaborator Author

xxi-nv commented Apr 21, 2026

/bot run --disable-fail-fast

@github-actions
Copy link
Copy Markdown

👎 Promotion blocked, new vulnerability found

Vulnerability report

Component Vulnerability Description Severity
encode/uvicorn CVE-2020-7694 This affects all versions of package uvicorn. The request logger provided by the package is vulnerable to ASNI escape sequence injection. Whenever any HTTP request is received, the default behaviour of uvicorn is to log its details to either the console or a log file. When attackers request crafted URLs with percent-encoded escape sequences, the logging component will log the URL after it's been processed with urllib.parse.unquote, therefore converting any percent-encoded characters into their single-character equivalent, which can have special meaning in terminal emulators. By requesting URLs with crafted paths, attackers can: * Pollute uvicorn's access logs, therefore jeopardising the integrity of such files. * Use ANSI sequence codes to attempt to interact with the terminal emulator that's displaying the logs (either in real time or from a file). HIGH
encode/uvicorn CVE-2020-7695 Uvicorn before 0.11.7 is vulnerable to HTTP response splitting. CRLF sequences are not escaped in the value of HTTP headers. Attackers can exploit this to add arbitrary headers to HTTP responses, or even return an arbitrary response body, whenever crafted input is used to construct HTTP headers. MEDIUM

@xxi-nv
Copy link
Copy Markdown
Collaborator Author

xxi-nv commented Apr 21, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44707 [ run ] triggered by Bot. Commit: a39a74f Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44707 [ run ] completed with state FAILURE. Commit: a39a74f
/LLM/main/L0_MergeRequest_PR pipeline #35071 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@xxi-nv
Copy link
Copy Markdown
Collaborator Author

xxi-nv commented Apr 21, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44722 [ run ] triggered by Bot. Commit: 3b987bd Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44722 [ run ] completed with state SUCCESS. Commit: 3b987bd
/LLM/main/L0_MergeRequest_PR pipeline #35085 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@xxi-nv
Copy link
Copy Markdown
Collaborator Author

xxi-nv commented Apr 22, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44861 [ run ] triggered by Bot. Commit: baf4f23 Link to invocation

@xxi-nv
Copy link
Copy Markdown
Collaborator Author

xxi-nv commented Apr 23, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45036 [ run ] triggered by Bot. Commit: baf4f23 Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants