Skip to content

[Cherry-Pick][Op][Optimization]Kernel fusion: cast+sigmoid+bias+noauxtc(#7777)#7818

Open
ShaneGZhu wants to merge 3 commits into
PaddlePaddle:release/2.6from
ShaneGZhu:cp26
Open

[Cherry-Pick][Op][Optimization]Kernel fusion: cast+sigmoid+bias+noauxtc(#7777)#7818
ShaneGZhu wants to merge 3 commits into
PaddlePaddle:release/2.6from
ShaneGZhu:cp26

Conversation

@ShaneGZhu
Copy link
Copy Markdown
Contributor

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 14, 2026

Thanks for your contribution!

PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot
Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-14 18:27:42

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

⚠️1 个 Required 任务失败,另有 6 个 Required 任务仍在运行/等待中,需优先处理审批问题。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
32(0) 32 20 3 5 4 0

2 任务状态汇总

2.1 Required任务 : 1/8 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Approval 7s PR问题:自定义算子PR缺少FastDeploy RD和Paddle RD各1人审批 请求指定RD审批:FastDeploy RD + PaddlePaddle RD Job -
Run Base Tests / base_tests - 运行中 - Job -
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage - 运行中 - Job -
Run Stable Tests / stable_tests - 运行中 - Job -
⏸️ Extracted partial CE model tasks to run in CI. / run_ce_cases - 等待中 - - -
⏸️ Run FastDeploy LogProb Tests / run_tests_logprob - 等待中 - - -
⏸️ Run Four Cards Tests / run_4_cards_tests - 等待中 - - -
其余 1 个必选任务通过 - - - - -

2.2 可选任务 — 19/24 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Check PR Template 12s Job -
Trigger Jenkins for PR 20s Job -
Run iluvatar Tests / run_iluvatar_cases - - -
xpu_build_test / xpu-build-test - - -
⏸️ CI_HPU - - -
其余 19 个可选任务通过 - - -

3 失败详情(仅 required)

Approval — 审批流程(置信度: 高)

Approval

  • 状态: ❌ 失败
  • 错误类型: 审批流程
  • 置信度: 高
  • 根因摘要: 自定义算子PR缺少必要审批,需FastDeploy RD和Paddle RD各1人批准
  • 分析器: 通用分析(fallback)

根因详情:

此 PR(Cherry-Pick 自定义算子 kernel fusion)触发了 check_approval.sh 脚本的两项审批要求:

  1. 必须有一位 FastDeploy RD(qingqing01/dangqingqing、Jiang-Jia-Jun/jiangjiajun、heavengate/dengkaipeng)批准
  2. 必须有一位 PaddlePaddle RD(jeff41404/gaoxiang、yongqiangma/mayongqiang)批准

当前两项审批均未满足,脚本返回 exit code 6(2 个审批错误)。

关键日志:

0. You must have one FastDeploy RD (qingqing01(dangqingqing), Jiang-Jia-Jun(jiangjiajun), heavengate(dengkaipeng)) approval for adding custom op.
1. You must have one PaddlePaddle RD (jeff41404(gaoxiang), yongqiangma(mayongqiang)) approval for adding custom op.

There are 2 approved errors.
##[error]Process completed with exit code 6.

修复建议:

  1. @dangqingqing@jiangjiajun@DENGKAIPENG 中任意一位 FastDeploy RD 审批此 PR
  2. @gaoxiang@mayongqiang 中任意一位 PaddlePaddle RD 审批此 PR

修复建议摘要: 请求指定RD审批:FastDeploy RD + PaddlePaddle RD

关联变更: 本 PR 包含 kernel fusion 自定义算子(cast+sigmoid+bias+noauxtc),触发审批要求

链接: 查看日志

PaddlePaddle-bot

This comment was marked as outdated.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-14 19:24:50

📋 Review 摘要

PR 概述:新增 grouped_topk 融合 CUDA 算子,将 cast+sigmoid+bias+noaux_tc 四步合并为单次 kernel launch,提升 MoE 路由性能,通过 --enable-moe-scores-elementwise-fuse 开关控制。
变更范围custom_ops/gpu_ops/fastdeploy/model_executor/layers/moe/fastdeploy/engine/
影响面 Tag[OP] [Optimization]

📝 PR 规范检查

存在两处规范问题:①标题 Tag [Op] 大小写偏差(官方为 [OP]);②PR 描述各 section 均为空/占位符,需补全。

标题建议(可直接复制):

  • [Cherry-Pick][OP][Optimization] Kernel fusion: cast+sigmoid+bias+noauxtc(#7777)

PR 描述建议(可直接复制):

## Motivation
新增 `grouped_topk` 融合 CUDA kernel,将原来分散的 cast + sigmoid + e_score_correction_bias 加法 + noaux_tc 四个操作合并为单次 kernel launch,减少 MoE 路由的 kernel 启动开销和中间显存读写。通过 `--enable-moe-scores-elementwise-fuse` flag 按需启用,覆盖 DeepSeek-V3/R1、GLM-4.5-Air、Qwen3-30B-A3B、Kimi-K2 等多种 MoE 配置。

## Modifications
- `custom_ops/gpu_ops/grouped_topk_kernels.cu`:新增融合 kernel,实现 cast+sigmoid+bias+grouped_topk 一体化计算,支持 float32/float16/bfloat16 输入
- `custom_ops/gpu_ops/cpp_extensions.cc`:添加 `grouped_topk` C++ 函数声明及 pybind11 绑定
- `custom_ops/setup_ops.py`:将 `grouped_topk_kernels.cu` 加入两处编译源文件列表
- `fastdeploy/engine/args_utils.py`:更新 `enable_moe_scores_elementwise_fuse` 字段注释
- `fastdeploy/model_executor/layers/moe/moe.py``get_moe_scores` 新增 `use_fused_cast=True` 分支,调用 `grouped_topk` 替代原 `fused_cast_sigmoid_bias + noaux_tc` 路径;同步移除 `fused_cast_sigmoid_bias` 导入
- `fastdeploy/model_executor/layers/moe/fused_moe_triton_backend.py``apply()``apply_tp()` 均改用 `enable_moe_scores_elementwise_fuse` flag 控制融合路径;修正 `up_gate_proj_weight` 的 stride_bk/stride_bn 顺序
- `tests/operators/test_grouped_topk_op.py`:新增算子单测,覆盖 DeepSeek-V3、GLM-4.5-Air、Qwen3-30B-A3B、Kimi-K2 四种模型配置及 bfloat16 输入路径

## Usage or Command
```bash
# 启用融合 MoE 评分算子
python -m fastdeploy.entrypoints.openai.api_server \
    --enable-moe-scores-elementwise-fuse \
    ...
```

## Accuracy Tests
算子单测 (`tests/operators/test_grouped_topk_op.py`) 通过 Python 参考实现与 CUDA kernel 对比验证数值精度(atol=1e-3, rtol=1e-3),覆盖上述四种模型配置。

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [x] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [x] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别 文件 概述
🟡 建议 moe.py:136 EP redundant + fused cast 同时启用时,gating_output 未转 float32 即进入 noaux_tc_redundant
📝 PR 规范 moe.py:44 异常警告信息未同步更新,仍为 "import noaux_tc Failed!"

总体评价

融合算子设计合理,测试覆盖全面(四种主流 MoE 配置 + bf16 路径),cherry-pick 流程规范。需关注 EP redundant + fused cast 同时启用时的 dtype 兼容性,以及 PR 描述需补全。

n_group if n_group > 0 else 1,
topk_group if topk_group > 0 else 1,
top_k,
renormalize,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议use_fused_cast=Trueexpert_id_to_ep_rank_array is not None 时,gating_output 未经 .cast("float32") 便进入此 else 分支,随后 paddle.nn.functional.sigmoid(gating_output) 的输出将保持原始 dtype(bf16/fp16)。若 noaux_tc_redundant 不支持非 float32 的 scores 输入,将导致运行时错误或精度异常。

建议在 else 分支开头显式转换:

else:
    if use_fused_cast:  # fused path only covers non-redundant EP
        gating_output = gating_output.cast("float32")
    scores = paddle.nn.functional.sigmoid(gating_output)
    scores_with_bias = scores + e_score_correction_bias

或在函数注释/文档中说明 EP redundant 路径不支持 fused cast 模式。

noaux_tc,
noaux_tc_redundant,
)
except:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 PR 规范 警告信息未随新增 grouped_topk 一起更新,建议改为:

logger.warning("import noaux_tc/grouped_topk Failed!")

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 92.30769% with 1 line in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (release/2.6@d02f3ba). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...el_executor/layers/moe/fused_moe_triton_backend.py 80.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@              Coverage Diff               @@
##             release/2.6    #7818   +/-   ##
==============================================
  Coverage               ?   72.46%           
==============================================
  Files                  ?      381           
  Lines                  ?    54142           
  Branches               ?     8456           
==============================================
  Hits                   ?    39235           
  Misses                 ?    12148           
  Partials               ?     2759           
Flag Coverage Δ
GPU 72.46% <92.30%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

stride_bk=layer.up_gate_proj_weight.strides[1],
stride_bn=layer.up_gate_proj_weight.strides[2],
stride_bk=layer.up_gate_proj_weight.strides[2],
stride_bn=layer.up_gate_proj_weight.strides[1],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fusion kernel会影响前面的topk算子吗

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个是敏峥上一个PR误改了,我给恢复了

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants