Skip to content

[xpu] fix interrupt error#7805

Open
zhupengyang wants to merge 1 commit into
PaddlePaddle:developfrom
zhupengyang:fix_interrupt_error
Open

[xpu] fix interrupt error#7805
zhupengyang wants to merge 1 commit into
PaddlePaddle:developfrom
zhupengyang:fix_interrupt_error

Conversation

@zhupengyang
Copy link
Copy Markdown
Collaborator

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 13, 2026

Thanks for your contribution!

PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented May 13, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-15 10:36:49

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

Required 任务运行中:2 个必选任务仍在执行,请等待完成后再合并。当前已通过 2/4 必选任务。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
21(0) 21 14 1 4 0 2

2 任务状态汇总

2.1 Required任务:2/4 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
xpu_4cards_case_test / run_xpu_4cards_cases - 运行中 - Job -
xpu_8cards_case_test / run_xpu_8cards_cases - 运行中 - Job -
其余 2 个必选任务通过 - - - - -

2.2 可选任务 — 12/17 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Check PR Template 15s Job -
xpu_unit_test / run_xpu_unit_test - Job -
Trigger Jenkins for PR - Job -
其余 12 个可选任务通过 - - -

3 失败详情(仅 required)

无 required 失败任务。

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 13, 2026

Codecov Report

❌ Patch coverage is 0% with 3 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@c2df4c6). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/worker/xpu_model_runner.py 0.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##             develop   #7805   +/-   ##
=========================================
  Coverage           ?   7.12%           
=========================================
  Files              ?     462           
  Lines              ?   64274           
  Branches           ?    9851           
=========================================
  Hits               ?    4577           
  Misses             ?   59608           
  Partials           ?      89           
Flag Coverage Δ
XPU 7.12% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot

This comment was marked as outdated.

@EmmonsCurse
Copy link
Copy Markdown
Collaborator

/skip-ci ci_iluvatar
/skip-ci ci_hpu
/skip-ci build_gpu

PaddlePaddle-bot

This comment was marked as outdated.

@zhupengyang zhupengyang force-pushed the fix_interrupt_error branch from 442afa5 to 7de938c Compare May 15, 2026 02:20
@zhupengyang zhupengyang changed the title [not merge][xpu] fix interrupt error [xpu] fix interrupt error May 15, 2026
Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-15 10:24:44

📋 Review 摘要

PR 概述:XPU ModelRunner 在非 EP 模式下 num_tokens 为 0 时提前退出,修复 interrupt 错误
变更范围fastdeploy/worker/xpu_model_runner.py
影响面 Tag[XPU] [BugFix]

📝 PR 规范检查

标题含非标准 [not merge] 前缀且 [xpu] 大小写不规范;PR 描述各 section 未填写实际内容。

标题建议(可直接复制):

  • [XPU][BugFix] Fix interrupt error when num_tokens is 0

PR 描述建议(可直接复制):

## Motivation

XPU ModelRunner 在非 Expert Parallel 模式下,当 `ids_remove_padding` 的 token 数为 0 时,直接调用 model forward 会触发 interrupt 错误。通过在 `execute_model` 中添加提前退出逻辑,避免以空 batch 执行推理。

## Modifications

- `fastdeploy/worker/xpu_model_runner.py`:在 `execute_model``padding_cudagraph_inputs()` 调用之后,增加 `num_tokens <= 0` 的 early-return 判断;EP 模式下即使 token 为 0 仍需参与集合通信,故仅在非 EP 模式下触发。

## Usage or Command

N/A

## Accuracy Tests

N/A

## Checklist

- [ ] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别 文件 概述
📝 PR 规范 标题含非标准 [not merge] 前缀、[xpu] 大小写不规范;描述各 section 未填写实际内容
❓ 疑问 fastdeploy/worker/xpu_model_runner.py:1304 num_tokens <= 0< 0 分支不可达,建议改为 == 0 以提高语义准确性

总体评价

修复逻辑正确:非 EP 模式下 token 数为 0 时提前返回,与已有 EP 模式注释保持一致,改动小风险低。请在合入前补充 PR 描述并修正标题格式。

Comment thread fastdeploy/worker/xpu_model_runner.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants