[Cherry-Pick][Op][Optimization]Kernel fusion: cast+sigmoid+bias+noauxtc(#7777)#7818
[Cherry-Pick][Op][Optimization]Kernel fusion: cast+sigmoid+bias+noauxtc(#7777)#7818ShaneGZhu wants to merge 3 commits into
Conversation
…dle#7777) [Cherry-Pick][Op][Optimization]Kernel fusion: cast+sigmoid+bias+noauxtc (PaddlePaddle#7777)
|
Thanks for your contribution! |
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览
2 任务状态汇总2.1 Required任务 : 1/8 通过
2.2 可选任务 — 19/24 通过
3 失败详情(仅 required)Approval — 审批流程(置信度: 高)Approval
根因详情: 此 PR(Cherry-Pick 自定义算子 kernel fusion)触发了
当前两项审批均未满足,脚本返回 exit code 6(2 个审批错误)。 关键日志: 修复建议:
修复建议摘要: 请求指定RD审批:FastDeploy RD + PaddlePaddle RD 关联变更: 本 PR 包含 kernel fusion 自定义算子(cast+sigmoid+bias+noauxtc),触发审批要求 链接: 查看日志 |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-14 19:24:50
📋 Review 摘要
PR 概述:新增 grouped_topk 融合 CUDA 算子,将 cast+sigmoid+bias+noaux_tc 四步合并为单次 kernel launch,提升 MoE 路由性能,通过 --enable-moe-scores-elementwise-fuse 开关控制。
变更范围:custom_ops/gpu_ops/、fastdeploy/model_executor/layers/moe/、fastdeploy/engine/
影响面 Tag:[OP] [Optimization]
📝 PR 规范检查
存在两处规范问题:①标题 Tag [Op] 大小写偏差(官方为 [OP]);②PR 描述各 section 均为空/占位符,需补全。
标题建议(可直接复制):
[Cherry-Pick][OP][Optimization] Kernel fusion: cast+sigmoid+bias+noauxtc(#7777)
PR 描述建议(可直接复制):
## Motivation
新增 `grouped_topk` 融合 CUDA kernel,将原来分散的 cast + sigmoid + e_score_correction_bias 加法 + noaux_tc 四个操作合并为单次 kernel launch,减少 MoE 路由的 kernel 启动开销和中间显存读写。通过 `--enable-moe-scores-elementwise-fuse` flag 按需启用,覆盖 DeepSeek-V3/R1、GLM-4.5-Air、Qwen3-30B-A3B、Kimi-K2 等多种 MoE 配置。
## Modifications
- `custom_ops/gpu_ops/grouped_topk_kernels.cu`:新增融合 kernel,实现 cast+sigmoid+bias+grouped_topk 一体化计算,支持 float32/float16/bfloat16 输入
- `custom_ops/gpu_ops/cpp_extensions.cc`:添加 `grouped_topk` C++ 函数声明及 pybind11 绑定
- `custom_ops/setup_ops.py`:将 `grouped_topk_kernels.cu` 加入两处编译源文件列表
- `fastdeploy/engine/args_utils.py`:更新 `enable_moe_scores_elementwise_fuse` 字段注释
- `fastdeploy/model_executor/layers/moe/moe.py`:`get_moe_scores` 新增 `use_fused_cast=True` 分支,调用 `grouped_topk` 替代原 `fused_cast_sigmoid_bias + noaux_tc` 路径;同步移除 `fused_cast_sigmoid_bias` 导入
- `fastdeploy/model_executor/layers/moe/fused_moe_triton_backend.py`:`apply()` 与 `apply_tp()` 均改用 `enable_moe_scores_elementwise_fuse` flag 控制融合路径;修正 `up_gate_proj_weight` 的 stride_bk/stride_bn 顺序
- `tests/operators/test_grouped_topk_op.py`:新增算子单测,覆盖 DeepSeek-V3、GLM-4.5-Air、Qwen3-30B-A3B、Kimi-K2 四种模型配置及 bfloat16 输入路径
## Usage or Command
```bash
# 启用融合 MoE 评分算子
python -m fastdeploy.entrypoints.openai.api_server \
--enable-moe-scores-elementwise-fuse \
...
```
## Accuracy Tests
算子单测 (`tests/operators/test_grouped_topk_op.py`) 通过 Python 参考实现与 CUDA kernel 对比验证数值精度(atol=1e-3, rtol=1e-3),覆盖上述四种模型配置。
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [x] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [x] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | moe.py:136 |
EP redundant + fused cast 同时启用时,gating_output 未转 float32 即进入 noaux_tc_redundant |
| 📝 PR 规范 | moe.py:44 |
异常警告信息未同步更新,仍为 "import noaux_tc Failed!" |
总体评价
融合算子设计合理,测试覆盖全面(四种主流 MoE 配置 + bf16 路径),cherry-pick 流程规范。需关注 EP redundant + fused cast 同时启用时的 dtype 兼容性,以及 PR 描述需补全。
| n_group if n_group > 0 else 1, | ||
| topk_group if topk_group > 0 else 1, | ||
| top_k, | ||
| renormalize, |
There was a problem hiding this comment.
🟡 建议 当 use_fused_cast=True 且 expert_id_to_ep_rank_array is not None 时,gating_output 未经 .cast("float32") 便进入此 else 分支,随后 paddle.nn.functional.sigmoid(gating_output) 的输出将保持原始 dtype(bf16/fp16)。若 noaux_tc_redundant 不支持非 float32 的 scores 输入,将导致运行时错误或精度异常。
建议在 else 分支开头显式转换:
else:
if use_fused_cast: # fused path only covers non-redundant EP
gating_output = gating_output.cast("float32")
scores = paddle.nn.functional.sigmoid(gating_output)
scores_with_bias = scores + e_score_correction_bias或在函数注释/文档中说明 EP redundant 路径不支持 fused cast 模式。
| noaux_tc, | ||
| noaux_tc_redundant, | ||
| ) | ||
| except: |
There was a problem hiding this comment.
📝 PR 规范 警告信息未随新增 grouped_topk 一起更新,建议改为:
logger.warning("import noaux_tc/grouped_topk Failed!")
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## release/2.6 #7818 +/- ##
==============================================
Coverage ? 72.46%
==============================================
Files ? 381
Lines ? 54142
Branches ? 8456
==============================================
Hits ? 39235
Misses ? 12148
Partials ? 2759
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| stride_bk=layer.up_gate_proj_weight.strides[1], | ||
| stride_bn=layer.up_gate_proj_weight.strides[2], | ||
| stride_bk=layer.up_gate_proj_weight.strides[2], | ||
| stride_bn=layer.up_gate_proj_weight.strides[1], |
There was a problem hiding this comment.
fusion kernel会影响前面的topk算子吗
There was a problem hiding this comment.
这个是敏峥上一个PR误改了,我给恢复了
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.