[Cherry-Pick][Op][Optimization]Kernel fusion: cast+sigmoid+bias+noauxtc(#7777) by ShaneGZhu · Pull Request #7818 · PaddlePaddle/FastDeploy

ShaneGZhu · 2026-05-14T09:37:13Z

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

…dle#7777) [Cherry-Pick][Op][Optimization]Kernel fusion: cast+sigmoid+bias+noauxtc (PaddlePaddle#7777)

paddle-bot · 2026-05-14T09:37:30Z

Thanks for your contribution!

PaddlePaddle-bot · 2026-05-14T10:29:59Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-14 18:27:42

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: a03e584
Merge base: d02f3ba (branch: release/2.6)
查看完整 Diff
CI 详情

1 任务总览

⚠️ 有 1 个 Required 任务失败，另有 6 个 Required 任务仍在运行/等待中，需优先处理审批问题。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
32(0)	32	20	3	5	4	0

2 任务状态汇总

2.1 Required任务 : 1/8 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`Approval`	7s	PR问题：自定义算子PR缺少FastDeploy RD和Paddle RD各1人审批	请求指定RD审批：FastDeploy RD + PaddlePaddle RD	Job	-
⏳	`Run Base Tests / base_tests`	-	运行中	-	Job	-
⏳	`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	-	运行中	-	Job	-
⏳	`Run Stable Tests / stable_tests`	-	运行中	-	Job	-
⏸️	`Extracted partial CE model tasks to run in CI. / run_ce_cases`	-	等待中	-	-	-
⏸️	`Run FastDeploy LogProb Tests / run_tests_logprob`	-	等待中	-	-	-
⏸️	`Run Four Cards Tests / run_4_cards_tests`	-	等待中	-	-	-
✅	其余 1 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 19/24 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Check PR Template`	12s	Job	-
❌	`Trigger Jenkins for PR`	20s	Job	-
⏳	`Run iluvatar Tests / run_iluvatar_cases`	-	-	-
⏳	`xpu_build_test / xpu-build-test`	-	-	-
⏸️	`CI_HPU`	-	-	-
✅	其余 19 个可选任务通过	-	-	-

3 失败详情（仅 required）

Approval — 审批流程（置信度: 高）

Approval

状态: ❌ 失败
错误类型: 审批流程
置信度: 高
根因摘要: 自定义算子PR缺少必要审批，需FastDeploy RD和Paddle RD各1人批准
分析器: 通用分析(fallback)

根因详情:

此 PR（Cherry-Pick 自定义算子 kernel fusion）触发了 check_approval.sh 脚本的两项审批要求：

必须有一位 FastDeploy RD（qingqing01/dangqingqing、Jiang-Jia-Jun/jiangjiajun、heavengate/dengkaipeng）批准
必须有一位 PaddlePaddle RD（jeff41404/gaoxiang、yongqiangma/mayongqiang）批准

当前两项审批均未满足，脚本返回 exit code 6（2 个审批错误）。

关键日志:

0. You must have one FastDeploy RD (qingqing01(dangqingqing), Jiang-Jia-Jun(jiangjiajun), heavengate(dengkaipeng)) approval for adding custom op.
1. You must have one PaddlePaddle RD (jeff41404(gaoxiang), yongqiangma(mayongqiang)) approval for adding custom op.

There are 2 approved errors.
##[error]Process completed with exit code 6.

修复建议:

请 @dangqingqing 或 @jiangjiajun 或 @DENGKAIPENG 中任意一位 FastDeploy RD 审批此 PR
请 @gaoxiang 或 @mayongqiang 中任意一位 PaddlePaddle RD 审批此 PR

修复建议摘要: 请求指定RD审批：FastDeploy RD + PaddlePaddle RD

关联变更: 本 PR 包含 kernel fusion 自定义算子（cast+sigmoid+bias+noauxtc），触发审批要求

链接: 查看日志

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-14 19:24:50

📋 Review 摘要

PR 概述：新增 grouped_topk 融合 CUDA 算子，将 cast+sigmoid+bias+noaux_tc 四步合并为单次 kernel launch，提升 MoE 路由性能，通过 --enable-moe-scores-elementwise-fuse 开关控制。
变更范围：custom_ops/gpu_ops/、fastdeploy/model_executor/layers/moe/、fastdeploy/engine/
影响面 Tag：[OP] [Optimization]

📝 PR 规范检查

存在两处规范问题：①标题 Tag [Op] 大小写偏差（官方为 [OP]）；②PR 描述各 section 均为空/占位符，需补全。

标题建议（可直接复制）：

[Cherry-Pick][OP][Optimization] Kernel fusion: cast+sigmoid+bias+noauxtc(#7777)

PR 描述建议（可直接复制）：

## Motivation
新增 `grouped_topk` 融合 CUDA kernel，将原来分散的 cast + sigmoid + e_score_correction_bias 加法 + noaux_tc 四个操作合并为单次 kernel launch，减少 MoE 路由的 kernel 启动开销和中间显存读写。通过 `--enable-moe-scores-elementwise-fuse` flag 按需启用，覆盖 DeepSeek-V3/R1、GLM-4.5-Air、Qwen3-30B-A3B、Kimi-K2 等多种 MoE 配置。

## Modifications
- `custom_ops/gpu_ops/grouped_topk_kernels.cu`：新增融合 kernel，实现 cast+sigmoid+bias+grouped_topk 一体化计算，支持 float32/float16/bfloat16 输入
- `custom_ops/gpu_ops/cpp_extensions.cc`：添加 `grouped_topk` C++ 函数声明及 pybind11 绑定
- `custom_ops/setup_ops.py`：将 `grouped_topk_kernels.cu` 加入两处编译源文件列表
- `fastdeploy/engine/args_utils.py`：更新 `enable_moe_scores_elementwise_fuse` 字段注释
- `fastdeploy/model_executor/layers/moe/moe.py`：`get_moe_scores` 新增 `use_fused_cast=True` 分支，调用 `grouped_topk` 替代原 `fused_cast_sigmoid_bias + noaux_tc` 路径；同步移除 `fused_cast_sigmoid_bias` 导入
- `fastdeploy/model_executor/layers/moe/fused_moe_triton_backend.py`：`apply()` 与 `apply_tp()` 均改用 `enable_moe_scores_elementwise_fuse` flag 控制融合路径；修正 `up_gate_proj_weight` 的 stride_bk/stride_bn 顺序
- `tests/operators/test_grouped_topk_op.py`：新增算子单测，覆盖 DeepSeek-V3、GLM-4.5-Air、Qwen3-30B-A3B、Kimi-K2 四种模型配置及 bfloat16 输入路径

## Usage or Command
```bash
# 启用融合 MoE 评分算子
python -m fastdeploy.entrypoints.openai.api_server \
    --enable-moe-scores-elementwise-fuse \
    ...
```

## Accuracy Tests
算子单测 (`tests/operators/test_grouped_topk_op.py`) 通过 Python 参考实现与 CUDA kernel 对比验证数值精度（atol=1e-3, rtol=1e-3），覆盖上述四种模型配置。

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [x] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [x] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别	文件	概述
🟡 建议	`moe.py:136`	EP redundant + fused cast 同时启用时，gating_output 未转 float32 即进入 noaux_tc_redundant
📝 PR 规范	`moe.py:44`	异常警告信息未同步更新，仍为 "import noaux_tc Failed!"

总体评价

融合算子设计合理，测试覆盖全面（四种主流 MoE 配置 + bf16 路径），cherry-pick 流程规范。需关注 EP redundant + fused cast 同时启用时的 dtype 兼容性，以及 PR 描述需补全。

PaddlePaddle-bot · 2026-05-14T11:29:05Z

+            n_group if n_group > 0 else 1,
+            topk_group if topk_group > 0 else 1,
+            top_k,
+            renormalize,


🟡 建议 当 use_fused_cast=True 且 expert_id_to_ep_rank_array is not None 时，gating_output 未经 .cast("float32") 便进入此 else 分支，随后 paddle.nn.functional.sigmoid(gating_output) 的输出将保持原始 dtype（bf16/fp16）。若 noaux_tc_redundant 不支持非 float32 的 scores 输入，将导致运行时错误或精度异常。

建议在 else 分支开头显式转换：

else: if use_fused_cast: # fused path only covers non-redundant EP gating_output = gating_output.cast("float32") scores = paddle.nn.functional.sigmoid(gating_output) scores_with_bias = scores + e_score_correction_bias

或在函数注释/文档中说明 EP redundant 路径不支持 fused cast 模式。

PaddlePaddle-bot · 2026-05-14T11:29:05Z

+        noaux_tc,
+        noaux_tc_redundant,
+    )
 except:


📝 PR 规范 警告信息未随新增 grouped_topk 一起更新，建议改为：

logger.warning("import noaux_tc/grouped_topk Failed!")

codecov-commenter · 2026-05-14T11:39:50Z

Codecov Report

❌ Patch coverage is 92.30769% with 1 line in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (release/2.6@d02f3ba). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
...el_executor/layers/moe/fused_moe_triton_backend.py	80.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@              Coverage Diff               @@
##             release/2.6    #7818   +/-   ##
==============================================
  Coverage               ?   72.46%           
==============================================
  Files                  ?      381           
  Lines                  ?    54142           
  Branches               ?     8456           
==============================================
  Hits                   ?    39235           
  Misses                 ?    12148           
  Partials               ?     2759

Flag	Coverage Δ
GPU	`72.46% <92.30%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

BingooYang · 2026-05-14T12:00:13Z

-            stride_bk=layer.up_gate_proj_weight.strides[1],
-            stride_bn=layer.up_gate_proj_weight.strides[2],
+            stride_bk=layer.up_gate_proj_weight.strides[2],
+            stride_bn=layer.up_gate_proj_weight.strides[1],


fusion kernel会影响前面的topk算子吗

这个是敏峥上一个PR误改了，我给恢复了

ShaneGZhu added 2 commits May 14, 2026 17:13

[Op][Optimization]Kernel fusion: cast+sigmoid+bias+noauxtc (PaddlePad…

aca05dc

…dle#7777) [Cherry-Pick][Op][Optimization]Kernel fusion: cast+sigmoid+bias+noauxtc (PaddlePaddle#7777)

Bug fixes and modifications to the fused kernel switch.

4b04c51

ShaneGZhu had a problem deploying to Metax_ci May 14, 2026 09:37 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

fix replicated env args

a03e584

ShaneGZhu had a problem deploying to Metax_ci May 14, 2026 10:16 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

PaddlePaddle-bot reviewed May 14, 2026

View reviewed changes

BingooYang reviewed May 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Cherry-Pick][Op][Optimization]Kernel fusion: cast+sigmoid+bias+noauxtc(#7777)#7818

[Cherry-Pick][Op][Optimization]Kernel fusion: cast+sigmoid+bias+noauxtc(#7777)#7818
ShaneGZhu wants to merge 3 commits into
PaddlePaddle:release/2.6from
ShaneGZhu:cp26

ShaneGZhu commented May 14, 2026

Uh oh!

paddle-bot Bot commented May 14, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented May 14, 2026

Approval

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot May 14, 2026

Uh oh!

PaddlePaddle-bot May 14, 2026

Uh oh!

codecov-commenter commented May 14, 2026

Uh oh!

BingooYang May 14, 2026

Uh oh!

ShaneGZhu May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ShaneGZhu commented May 14, 2026

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented May 14, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented May 14, 2026

1 任务总览

2 任务状态汇总

2.1 Required任务 : 1/8 通过

2.2 可选任务 — 19/24 通过

3 失败详情（仅 required）

Approval

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

📝 PR 规范检查

问题

总体评价

Uh oh!

PaddlePaddle-bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented May 14, 2026

Codecov Report

Uh oh!

BingooYang May 14, 2026

Choose a reason for hiding this comment

Uh oh!

ShaneGZhu May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants