[Feature] GPU Model Runner V1 by ming1753 · Pull Request #7810 · PaddlePaddle/FastDeploy

ming1753 · 2026-05-13T16:06:32Z

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-05-13T16:06:39Z

Thanks for your contribution!

PaddlePaddle-bot · 2026-05-13T16:17:12Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-18 16:45:26

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 7bd5ea1
Merge base: a0141b9 (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

当前无 Required 任务（GitHub Branch Protection Rules 未配置），所有任务均为可选。存在 1 个可选任务失败（Jenkins 触发超时），不阻塞合并。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
2(0)	2	1	1	0	0	0

2 任务状态汇总

2.1 Required任务 : 0/0 通过

当前未检测到 Required 任务（GitHub Branch Protection Rules 未配置）。

⚠️ 注意：主测试任务 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage 未出现在本次 CI 运行结果中，请确认该 Workflow 是否已正确触发。

2.2 可选任务 — 1/2 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Trigger Jenkins for PR` (CI_METAX)	1h0m	Job	-
✅	其余 1 个可选任务通过	-	-	-

3 失败详情（仅 required）

无 required 失败任务。

ℹ️ 可选任务失败说明：Trigger Jenkins for PR（CI_METAX）触发 Jenkins 任务超时（超时时间 60 分钟），属于环境/基础设施问题，不影响 PR 合并。如需，可重新触发该 Workflow。

codecov-commenter · 2026-05-13T17:22:51Z

Codecov Report

❌ Patch coverage is 0.19973% with 1499 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@a0141b9). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/worker/gpu/model_runner.py	0.00%	478 Missing ⚠️
fastdeploy/worker/gpu/input_batch.py	0.00%	189 Missing ⚠️
...el_executor/layers/attention/flashinfer_backend.py	0.00%	183 Missing ⚠️
fastdeploy/worker/gpu/buffer_utils.py	0.00%	156 Missing ⚠️
fastdeploy/worker/gpu/sampler/post_process.py	0.00%	105 Missing ⚠️
fastdeploy/worker/gpu/gather_tokens_kernel.py	0.00%	82 Missing ⚠️
fastdeploy/worker/gpu/sampler/sampler_state.py	0.00%	73 Missing ⚠️
fastdeploy/worker/gpu/block_table.py	0.00%	67 Missing ⚠️
fastdeploy/worker/gpu/request_state.py	0.00%	58 Missing ⚠️
fastdeploy/worker/gpu/async_output.py	0.00%	44 Missing ⚠️
... and 7 more

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7810   +/-   ##
==========================================
  Coverage           ?   15.80%           
==========================================
  Files              ?      474           
  Lines              ?    65574           
  Branches           ?     9963           
==========================================
  Hits               ?    10366           
  Misses             ?    54722           
  Partials           ?      486

Flag	Coverage Δ
XPU	`15.80% <0.19%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-18 15:36:49

📋 Review 摘要

PR 概述：引入 GPU Model Runner V1 新实现，新增基于 CuTe 的 KV Cache Reshape 算子、统一 CUDA 错误检查宏 FD_CUDA_CHECK，并补充 bad words 相关配置字段。
变更范围：fastdeploy/worker/gpu/（新目录）、custom_ops/gpu_ops/cache_kv/（新算子）、fastdeploy/config.py、fastdeploy/envs.py、attention backends
影响面 Tag：[Feature] [OP] [FDConfig]

⚠️ 本 PR 变更量较大（31 文件），建议拆分以降低审查难度和合入风险。

建议拆分方案：

PR 1: [OP] reshape_and_cache_flash CuTe kernel — custom_ops/gpu_ops/cache_kv/reshape_and_cache_flash.cu, custom_ops/setup_ops.py, cpp_extensions.cc（reshape binding）
PR 2: [Others] CUDA 宏统一 — custom_ops/gpu_ops/macros.h, helper.h, append_attn/, iluvatar_ops/
PR 3: [FDConfig] bad words 配置 — fastdeploy/config.py, fastdeploy/envs.py
PR 4: [Feature] GPU Model Runner V1 核心 — fastdeploy/worker/gpu/, gpu_worker.py, worker_process.py, attention backends

问题

级别	文件	概述
🔴 Bug	`fastdeploy/model_executor/layers/attention/append_attn_backend.py:324`	调试代码残留，向 `/tmp/v0_debug.txt` 写文件，不可进入生产
🟡 建议	`fastdeploy/envs.py:58`	环境变量名 `FD_MAX_BDA_WORDS_NUM` / `FD_BDA_WORDS_MAX_LEN` 疑似 `BAD` 拼写错误，与配置字段 `bad_words` 不一致
🟡 建议	`fastdeploy/config.py:392`	新增 `max_bad_words_num` / `bad_words_max_len` 字段未同步到 `EngineArgs` CLI（A2 三入口同步缺失）
🟡 建议	`custom_ops/gpu_ops/cache_kv/reshape_and_cache_flash.cu`	新增算子缺少 `tests/operators/` 单测
📝 PR 规范	—	PR 描述 Motivation / Modifications / Usage / Accuracy Tests 各节均为空模板

📝 PR 规范检查

PR 标题 [Feature] GPU Model Runner V1 格式合规，[Feature] 为官方 Tag，但描述各节均为空，需补充完整内容。

标题建议（可直接复制）：
[Feature] Add GPU Model Runner V1 with CuTe KV cache reshape kernel

PR 描述建议（可直接复制，必须复刻 checklist §D2 模板的完整结构）：

## Motivation

引入 GPU Model Runner V1（MRV1），提供新的 GPU 推理引擎实现路径，可通过 `FD_ENABLE_GPU_MRV1=1` 环境变量启用。同时新增基于 CuTe 的高性能 KV Cache Reshape 算子（支持直接拷贝和动态 FP8 E4M3 量化两条路径），统一 CUDA 错误检查宏，提升代码一致性。

## Modifications

- 新增 `fastdeploy/worker/gpu/` 目录，包含 MRV1 核心模块：`model_runner.py`、`block_table.py`、`buffer_utils.py`、`forward_meta.py`、`input_batch.py`、`request_state.py`、`sampler/`、`rotary_embedding/` 等
- 新增 `custom_ops/gpu_ops/cache_kv/reshape_and_cache_flash.cu`：基于 CuTe 的 KV Cache reshape kernel，支持非 FP8 128-bit 向量化拷贝（NHD/HND layout）和动态 FP8 E4M3 per-head 量化
- 新增 `custom_ops/gpu_ops/macros.h`：统一 CUDA 错误检查宏 `FD_CUDA_CHECK`，替代分散的 `CUDA_CHECK` / `CHECK`
- `custom_ops/gpu_ops/cpp_extensions.cc`：新增 `copy_array_to_tensor`、`get_cuda_view_from_cpu_tensor`、`numpy_view_of_cpu_tensor`、`reshape_and_cache_flash` Python bindings；移除 `CudaError` 自定义异常类
- `fastdeploy/envs.py`：新增 `FD_ENABLE_GPU_MRV1` 开关及 bad words 相关环境变量
- `fastdeploy/config.py`：新增 `max_bad_words_num` / `bad_words_max_len` 配置字段

## Usage or Command

```bash
# 启用 GPU Model Runner V1
FD_ENABLE_GPU_MRV1=1 python -m fastdeploy.entrypoints.openai.api_server \
    --model <model_path> ...
```

## Accuracy Tests

N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

MRV1 框架结构清晰，CuTe kernel 实现规范。但调试代码（/tmp/v0_debug.txt 写文件）不可进入生产，必须在合入前移除；env 变量命名拼写问题和 CLI 三入口同步也需修复。

PaddlePaddle-bot · 2026-05-18T07:41:32Z

+            _kv_size = layer.kv_num_heads * layer.head_dim
+            _q = qkv[0, :_q_size].reshape([layer.num_heads, layer.head_dim])
+            _k = qkv[0, _q_size : _q_size + _kv_size].reshape([layer.kv_num_heads, layer.head_dim])
+            with open("/tmp/v0_debug.txt", "a") as _f:


🔴 Bug 调试代码残留，向 /tmp/v0_debug.txt 写文件

with open("/tmp/v0_debug.txt", "a") 及整个 # V0 Debug 代码块（317-325 行）属于开发调试代码，不可进入生产。每次推理首请求 layer 0 均会执行，在多进程/多实例部署下可能引发文件竞争，同时暴露内部张量数值。

建议修复：完整删除该调试块（从 # V0 Debug: 注释到 _f.write(...) 共 10 行）。

PaddlePaddle-bot · 2026-05-18T07:41:32Z

    # Maximum length of stop sequences.
    "FD_STOP_SEQS_MAX_LEN": lambda: int(os.getenv("FD_STOP_SEQS_MAX_LEN", "8")),
+    # Maximum number of bad words.
+    "FD_MAX_BDA_WORDS_NUM": lambda: int(os.getenv("FD_MAX_BDA_WORDS_NUM", "16")),


🟡 建议 环境变量名 BDA 疑似 BAD 的拼写错误

FD_MAX_BDA_WORDS_NUM / FD_BDA_WORDS_MAX_LEN 中的 BDA 与配置字段名 bad_words_* 不一致，用户按直觉设置 FD_MAX_BAD_WORDS_NUM 将静默失效（环境变量读取不到，走默认值）。

建议统一命名：

"FD_MAX_BAD_WORDS_NUM": lambda: int(os.getenv("FD_MAX_BAD_WORDS_NUM", "16")), "FD_BAD_WORDS_MAX_LEN": lambda: int(os.getenv("FD_BAD_WORDS_MAX_LEN", "8")),

同步修改 config.py 引用处：envs.FD_MAX_BAD_WORDS_NUM / envs.FD_BAD_WORDS_MAX_LEN。

PaddlePaddle-bot · 2026-05-18T07:41:32Z

@@ -390,6 +390,8 @@ def read_from_env(self):
        """
        self.max_stop_seqs_num = envs.FD_MAX_STOP_SEQS_NUM
        self.stop_seqs_max_len = envs.FD_STOP_SEQS_MAX_LEN


🟡 建议 A2 三入口同步缺失：新配置字段未加入 EngineArgs CLI

按 checklist A2，新增 Config 字段须同步到：

fastdeploy/engine/args_utils.py（EngineArgs，CLI -- 参数）

fastdeploy/envs.py（环境变量，已有）

fastdeploy/config.py（FDConfig，已有）

当前 max_bad_words_num / bad_words_max_len 未见于 args_utils.py，用户无法通过 CLI 配置 bad words 上限，也无法在文档中发现该特性。建议在 EngineArgs 中补充对应 add_argument 条目。

ming1753 had a problem deploying to Metax_ci May 13, 2026 16:06 — with GitHub Actions Failure

ming1753 marked this pull request as draft May 13, 2026 16:07

ming1753 changed the title ~~0513 code backup~~ [Feature] GPU Model Runner V1 May 13, 2026

This comment was marked as outdated.

Sign in to view

ming1753 had a problem deploying to Metax_ci May 14, 2026 04:10 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

GPU Model Runner V1

922e813

ming1753 force-pushed the mrv1 branch from 0a91e38 to 922e813 Compare May 15, 2026 10:01

ming1753 had a problem deploying to Metax_ci May 15, 2026 10:01 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

support sampler

7bd5ea1

ming1753 had a problem deploying to Metax_ci May 18, 2026 07:27 — with GitHub Actions Failure

PaddlePaddle-bot suggested changes May 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] GPU Model Runner V1#7810

[Feature] GPU Model Runner V1#7810
ming1753 wants to merge 2 commits into
PaddlePaddle:developfrom
ming1753:mrv1

ming1753 commented May 13, 2026

Uh oh!

paddle-bot Bot commented May 13, 2026

Uh oh!

PaddlePaddle-bot commented May 13, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

codecov-commenter commented May 13, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot May 18, 2026

Uh oh!

PaddlePaddle-bot May 18, 2026

Uh oh!

PaddlePaddle-bot May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ming1753 commented May 13, 2026

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented May 13, 2026

Uh oh!

PaddlePaddle-bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 任务总览

2 任务状态汇总

2.1 Required任务 : 0/0 通过

2.2 可选任务 — 1/2 通过

3 失败详情（仅 required）

Uh oh!

This comment was marked as outdated.

Uh oh!

codecov-commenter commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

📝 PR 规范检查

总体评价

Uh oh!

PaddlePaddle-bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

PaddlePaddle-bot commented May 13, 2026 •

edited

Loading

codecov-commenter commented May 13, 2026 •

edited

Loading