Skip to content

[Feature] GPU Model Runner V1#7810

Draft
ming1753 wants to merge 2 commits into
PaddlePaddle:developfrom
ming1753:mrv1
Draft

[Feature] GPU Model Runner V1#7810
ming1753 wants to merge 2 commits into
PaddlePaddle:developfrom
ming1753:mrv1

Conversation

@ming1753
Copy link
Copy Markdown
Collaborator

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 13, 2026

Thanks for your contribution!

@ming1753 ming1753 marked this pull request as draft May 13, 2026 16:07
@ming1753 ming1753 changed the title 0513 code backup [Feature] GPU Model Runner V1 May 13, 2026
@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented May 13, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-18 16:45:26

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

当前无 Required 任务(GitHub Branch Protection Rules 未配置),所有任务均为可选。存在 1 个可选任务失败(Jenkins 触发超时),不阻塞合并。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
2(0) 2 1 1 0 0 0

2 任务状态汇总

2.1 Required任务 : 0/0 通过

当前未检测到 Required 任务(GitHub Branch Protection Rules 未配置)。

⚠️ 注意:主测试任务 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage 未出现在本次 CI 运行结果中,请确认该 Workflow 是否已正确触发。

2.2 可选任务 — 1/2 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Trigger Jenkins for PR (CI_METAX) 1h0m Job -
其余 1 个可选任务通过 - - -

3 失败详情(仅 required)

无 required 失败任务。

ℹ️ 可选任务失败说明Trigger Jenkins for PR(CI_METAX)触发 Jenkins 任务超时(超时时间 60 分钟),属于环境/基础设施问题,不影响 PR 合并。如需,可重新触发该 Workflow。

PaddlePaddle-bot

This comment was marked as outdated.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 13, 2026

Codecov Report

❌ Patch coverage is 0.19973% with 1499 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@a0141b9). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/worker/gpu/model_runner.py 0.00% 478 Missing ⚠️
fastdeploy/worker/gpu/input_batch.py 0.00% 189 Missing ⚠️
...el_executor/layers/attention/flashinfer_backend.py 0.00% 183 Missing ⚠️
fastdeploy/worker/gpu/buffer_utils.py 0.00% 156 Missing ⚠️
fastdeploy/worker/gpu/sampler/post_process.py 0.00% 105 Missing ⚠️
fastdeploy/worker/gpu/gather_tokens_kernel.py 0.00% 82 Missing ⚠️
fastdeploy/worker/gpu/sampler/sampler_state.py 0.00% 73 Missing ⚠️
fastdeploy/worker/gpu/block_table.py 0.00% 67 Missing ⚠️
fastdeploy/worker/gpu/request_state.py 0.00% 58 Missing ⚠️
fastdeploy/worker/gpu/async_output.py 0.00% 44 Missing ⚠️
... and 7 more
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7810   +/-   ##
==========================================
  Coverage           ?   15.80%           
==========================================
  Files              ?      474           
  Lines              ?    65574           
  Branches           ?     9963           
==========================================
  Hits               ?    10366           
  Misses             ?    54722           
  Partials           ?      486           
Flag Coverage Δ
XPU 15.80% <0.19%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-18 15:36:49

📋 Review 摘要

PR 概述:引入 GPU Model Runner V1 新实现,新增基于 CuTe 的 KV Cache Reshape 算子、统一 CUDA 错误检查宏 FD_CUDA_CHECK,并补充 bad words 相关配置字段。
变更范围fastdeploy/worker/gpu/(新目录)、custom_ops/gpu_ops/cache_kv/(新算子)、fastdeploy/config.pyfastdeploy/envs.py、attention backends
影响面 Tag[Feature] [OP] [FDConfig]

⚠️ 本 PR 变更量较大(31 文件),建议拆分以降低审查难度和合入风险。

建议拆分方案

  • PR 1: [OP] reshape_and_cache_flash CuTe kernel — custom_ops/gpu_ops/cache_kv/reshape_and_cache_flash.cu, custom_ops/setup_ops.py, cpp_extensions.cc(reshape binding)
  • PR 2: [Others] CUDA 宏统一 — custom_ops/gpu_ops/macros.h, helper.h, append_attn/, iluvatar_ops/
  • PR 3: [FDConfig] bad words 配置 — fastdeploy/config.py, fastdeploy/envs.py
  • PR 4: [Feature] GPU Model Runner V1 核心 — fastdeploy/worker/gpu/, gpu_worker.py, worker_process.py, attention backends

问题

级别 文件 概述
🔴 Bug fastdeploy/model_executor/layers/attention/append_attn_backend.py:324 调试代码残留,向 /tmp/v0_debug.txt 写文件,不可进入生产
🟡 建议 fastdeploy/envs.py:58 环境变量名 FD_MAX_BDA_WORDS_NUM / FD_BDA_WORDS_MAX_LEN 疑似 BAD 拼写错误,与配置字段 bad_words 不一致
🟡 建议 fastdeploy/config.py:392 新增 max_bad_words_num / bad_words_max_len 字段未同步到 EngineArgs CLI(A2 三入口同步缺失)
🟡 建议 custom_ops/gpu_ops/cache_kv/reshape_and_cache_flash.cu 新增算子缺少 tests/operators/ 单测
📝 PR 规范 PR 描述 Motivation / Modifications / Usage / Accuracy Tests 各节均为空模板

📝 PR 规范检查

PR 标题 [Feature] GPU Model Runner V1 格式合规,[Feature] 为官方 Tag,但描述各节均为空,需补充完整内容。

标题建议(可直接复制):
[Feature] Add GPU Model Runner V1 with CuTe KV cache reshape kernel

PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):

## Motivation

引入 GPU Model Runner V1(MRV1),提供新的 GPU 推理引擎实现路径,可通过 `FD_ENABLE_GPU_MRV1=1` 环境变量启用。同时新增基于 CuTe 的高性能 KV Cache Reshape 算子(支持直接拷贝和动态 FP8 E4M3 量化两条路径),统一 CUDA 错误检查宏,提升代码一致性。

## Modifications

- 新增 `fastdeploy/worker/gpu/` 目录,包含 MRV1 核心模块:`model_runner.py``block_table.py``buffer_utils.py``forward_meta.py``input_batch.py``request_state.py``sampler/``rotary_embedding/`- 新增 `custom_ops/gpu_ops/cache_kv/reshape_and_cache_flash.cu`:基于 CuTe 的 KV Cache reshape kernel,支持非 FP8 128-bit 向量化拷贝(NHD/HND layout)和动态 FP8 E4M3 per-head 量化
- 新增 `custom_ops/gpu_ops/macros.h`:统一 CUDA 错误检查宏 `FD_CUDA_CHECK`,替代分散的 `CUDA_CHECK` / `CHECK`
- `custom_ops/gpu_ops/cpp_extensions.cc`:新增 `copy_array_to_tensor``get_cuda_view_from_cpu_tensor``numpy_view_of_cpu_tensor``reshape_and_cache_flash` Python bindings;移除 `CudaError` 自定义异常类
- `fastdeploy/envs.py`:新增 `FD_ENABLE_GPU_MRV1` 开关及 bad words 相关环境变量
- `fastdeploy/config.py`:新增 `max_bad_words_num` / `bad_words_max_len` 配置字段

## Usage or Command

```bash
# 启用 GPU Model Runner V1
FD_ENABLE_GPU_MRV1=1 python -m fastdeploy.entrypoints.openai.api_server \
    --model <model_path> ...
```

## Accuracy Tests

N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

MRV1 框架结构清晰,CuTe kernel 实现规范。但调试代码(/tmp/v0_debug.txt 写文件)不可进入生产,必须在合入前移除;env 变量命名拼写问题和 CLI 三入口同步也需修复。

_kv_size = layer.kv_num_heads * layer.head_dim
_q = qkv[0, :_q_size].reshape([layer.num_heads, layer.head_dim])
_k = qkv[0, _q_size : _q_size + _kv_size].reshape([layer.kv_num_heads, layer.head_dim])
with open("/tmp/v0_debug.txt", "a") as _f:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug 调试代码残留,向 /tmp/v0_debug.txt 写文件

with open("/tmp/v0_debug.txt", "a") 及整个 # V0 Debug 代码块(317-325 行)属于开发调试代码,不可进入生产。每次推理首请求 layer 0 均会执行,在多进程/多实例部署下可能引发文件竞争,同时暴露内部张量数值。

建议修复:完整删除该调试块(从 # V0 Debug: 注释到 _f.write(...) 共 10 行)。

Comment thread fastdeploy/envs.py
# Maximum length of stop sequences.
"FD_STOP_SEQS_MAX_LEN": lambda: int(os.getenv("FD_STOP_SEQS_MAX_LEN", "8")),
# Maximum number of bad words.
"FD_MAX_BDA_WORDS_NUM": lambda: int(os.getenv("FD_MAX_BDA_WORDS_NUM", "16")),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 环境变量名 BDA 疑似 BAD 的拼写错误

FD_MAX_BDA_WORDS_NUM / FD_BDA_WORDS_MAX_LEN 中的 BDA 与配置字段名 bad_words_* 不一致,用户按直觉设置 FD_MAX_BAD_WORDS_NUM 将静默失效(环境变量读取不到,走默认值)。

建议统一命名:

"FD_MAX_BAD_WORDS_NUM": lambda: int(os.getenv("FD_MAX_BAD_WORDS_NUM", "16")),
"FD_BAD_WORDS_MAX_LEN": lambda: int(os.getenv("FD_BAD_WORDS_MAX_LEN", "8")),

同步修改 config.py 引用处:envs.FD_MAX_BAD_WORDS_NUM / envs.FD_BAD_WORDS_MAX_LEN

Comment thread fastdeploy/config.py
@@ -390,6 +390,8 @@ def read_from_env(self):
"""
self.max_stop_seqs_num = envs.FD_MAX_STOP_SEQS_NUM
self.stop_seqs_max_len = envs.FD_STOP_SEQS_MAX_LEN
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 A2 三入口同步缺失:新配置字段未加入 EngineArgs CLI

按 checklist A2,新增 Config 字段须同步到:

  1. fastdeploy/engine/args_utils.pyEngineArgs,CLI -- 参数)
  2. fastdeploy/envs.py(环境变量,已有)
  3. fastdeploy/config.py(FDConfig,已有)

当前 max_bad_words_num / bad_words_max_len 未见于 args_utils.py,用户无法通过 CLI 配置 bad words 上限,也无法在文档中发现该特性。建议在 EngineArgs 中补充对应 add_argument 条目。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants