[Feature] GPU Model Runner V1#7810
Conversation
|
Thanks for your contribution! |
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览当前无 Required 任务(GitHub Branch Protection Rules 未配置),所有任务均为可选。存在 1 个可选任务失败(Jenkins 触发超时),不阻塞合并。
2 任务状态汇总2.1 Required任务 : 0/0 通过
2.2 可选任务 — 1/2 通过
3 失败详情(仅 required)无 required 失败任务。
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #7810 +/- ##
==========================================
Coverage ? 15.80%
==========================================
Files ? 474
Lines ? 65574
Branches ? 9963
==========================================
Hits ? 10366
Misses ? 54722
Partials ? 486
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-18 15:36:49
📋 Review 摘要
PR 概述:引入 GPU Model Runner V1 新实现,新增基于 CuTe 的 KV Cache Reshape 算子、统一 CUDA 错误检查宏 FD_CUDA_CHECK,并补充 bad words 相关配置字段。
变更范围:fastdeploy/worker/gpu/(新目录)、custom_ops/gpu_ops/cache_kv/(新算子)、fastdeploy/config.py、fastdeploy/envs.py、attention backends
影响面 Tag:[Feature] [OP] [FDConfig]
建议拆分方案:
- PR 1:
[OP]reshape_and_cache_flash CuTe kernel —custom_ops/gpu_ops/cache_kv/reshape_and_cache_flash.cu,custom_ops/setup_ops.py,cpp_extensions.cc(reshape binding) - PR 2:
[Others]CUDA 宏统一 —custom_ops/gpu_ops/macros.h,helper.h,append_attn/,iluvatar_ops/ - PR 3:
[FDConfig]bad words 配置 —fastdeploy/config.py,fastdeploy/envs.py - PR 4:
[Feature]GPU Model Runner V1 核心 —fastdeploy/worker/gpu/,gpu_worker.py,worker_process.py, attention backends
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🔴 Bug | fastdeploy/model_executor/layers/attention/append_attn_backend.py:324 |
调试代码残留,向 /tmp/v0_debug.txt 写文件,不可进入生产 |
| 🟡 建议 | fastdeploy/envs.py:58 |
环境变量名 FD_MAX_BDA_WORDS_NUM / FD_BDA_WORDS_MAX_LEN 疑似 BAD 拼写错误,与配置字段 bad_words 不一致 |
| 🟡 建议 | fastdeploy/config.py:392 |
新增 max_bad_words_num / bad_words_max_len 字段未同步到 EngineArgs CLI(A2 三入口同步缺失) |
| 🟡 建议 | custom_ops/gpu_ops/cache_kv/reshape_and_cache_flash.cu |
新增算子缺少 tests/operators/ 单测 |
| 📝 PR 规范 | — | PR 描述 Motivation / Modifications / Usage / Accuracy Tests 各节均为空模板 |
📝 PR 规范检查
PR 标题 [Feature] GPU Model Runner V1 格式合规,[Feature] 为官方 Tag,但描述各节均为空,需补充完整内容。
标题建议(可直接复制):
[Feature] Add GPU Model Runner V1 with CuTe KV cache reshape kernel
PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):
## Motivation
引入 GPU Model Runner V1(MRV1),提供新的 GPU 推理引擎实现路径,可通过 `FD_ENABLE_GPU_MRV1=1` 环境变量启用。同时新增基于 CuTe 的高性能 KV Cache Reshape 算子(支持直接拷贝和动态 FP8 E4M3 量化两条路径),统一 CUDA 错误检查宏,提升代码一致性。
## Modifications
- 新增 `fastdeploy/worker/gpu/` 目录,包含 MRV1 核心模块:`model_runner.py`、`block_table.py`、`buffer_utils.py`、`forward_meta.py`、`input_batch.py`、`request_state.py`、`sampler/`、`rotary_embedding/` 等
- 新增 `custom_ops/gpu_ops/cache_kv/reshape_and_cache_flash.cu`:基于 CuTe 的 KV Cache reshape kernel,支持非 FP8 128-bit 向量化拷贝(NHD/HND layout)和动态 FP8 E4M3 per-head 量化
- 新增 `custom_ops/gpu_ops/macros.h`:统一 CUDA 错误检查宏 `FD_CUDA_CHECK`,替代分散的 `CUDA_CHECK` / `CHECK`
- `custom_ops/gpu_ops/cpp_extensions.cc`:新增 `copy_array_to_tensor`、`get_cuda_view_from_cpu_tensor`、`numpy_view_of_cpu_tensor`、`reshape_and_cache_flash` Python bindings;移除 `CudaError` 自定义异常类
- `fastdeploy/envs.py`:新增 `FD_ENABLE_GPU_MRV1` 开关及 bad words 相关环境变量
- `fastdeploy/config.py`:新增 `max_bad_words_num` / `bad_words_max_len` 配置字段
## Usage or Command
```bash
# 启用 GPU Model Runner V1
FD_ENABLE_GPU_MRV1=1 python -m fastdeploy.entrypoints.openai.api_server \
--model <model_path> ...
```
## Accuracy Tests
N/A
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.总体评价
MRV1 框架结构清晰,CuTe kernel 实现规范。但调试代码(/tmp/v0_debug.txt 写文件)不可进入生产,必须在合入前移除;env 变量命名拼写问题和 CLI 三入口同步也需修复。
| _kv_size = layer.kv_num_heads * layer.head_dim | ||
| _q = qkv[0, :_q_size].reshape([layer.num_heads, layer.head_dim]) | ||
| _k = qkv[0, _q_size : _q_size + _kv_size].reshape([layer.kv_num_heads, layer.head_dim]) | ||
| with open("/tmp/v0_debug.txt", "a") as _f: |
There was a problem hiding this comment.
🔴 Bug 调试代码残留,向 /tmp/v0_debug.txt 写文件
with open("/tmp/v0_debug.txt", "a") 及整个 # V0 Debug 代码块(317-325 行)属于开发调试代码,不可进入生产。每次推理首请求 layer 0 均会执行,在多进程/多实例部署下可能引发文件竞争,同时暴露内部张量数值。
建议修复:完整删除该调试块(从 # V0 Debug: 注释到 _f.write(...) 共 10 行)。
| # Maximum length of stop sequences. | ||
| "FD_STOP_SEQS_MAX_LEN": lambda: int(os.getenv("FD_STOP_SEQS_MAX_LEN", "8")), | ||
| # Maximum number of bad words. | ||
| "FD_MAX_BDA_WORDS_NUM": lambda: int(os.getenv("FD_MAX_BDA_WORDS_NUM", "16")), |
There was a problem hiding this comment.
🟡 建议 环境变量名 BDA 疑似 BAD 的拼写错误
FD_MAX_BDA_WORDS_NUM / FD_BDA_WORDS_MAX_LEN 中的 BDA 与配置字段名 bad_words_* 不一致,用户按直觉设置 FD_MAX_BAD_WORDS_NUM 将静默失效(环境变量读取不到,走默认值)。
建议统一命名:
"FD_MAX_BAD_WORDS_NUM": lambda: int(os.getenv("FD_MAX_BAD_WORDS_NUM", "16")),
"FD_BAD_WORDS_MAX_LEN": lambda: int(os.getenv("FD_BAD_WORDS_MAX_LEN", "8")),同步修改 config.py 引用处:envs.FD_MAX_BAD_WORDS_NUM / envs.FD_BAD_WORDS_MAX_LEN。
| @@ -390,6 +390,8 @@ def read_from_env(self): | |||
| """ | |||
| self.max_stop_seqs_num = envs.FD_MAX_STOP_SEQS_NUM | |||
| self.stop_seqs_max_len = envs.FD_STOP_SEQS_MAX_LEN | |||
There was a problem hiding this comment.
🟡 建议 A2 三入口同步缺失:新配置字段未加入 EngineArgs CLI
按 checklist A2,新增 Config 字段须同步到:
fastdeploy/engine/args_utils.py(EngineArgs,CLI--参数)fastdeploy/envs.py(环境变量,已有)fastdeploy/config.py(FDConfig,已有)
当前 max_bad_words_num / bad_words_max_len 未见于 args_utils.py,用户无法通过 CLI 配置 bad words 上限,也无法在文档中发现该特性。建议在 EngineArgs 中补充对应 add_argument 条目。
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.