[bugfix] free blocks even if AS write failed#7807
Conversation
|
Thanks for your contribution! |
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览❌ 2 个 Required 任务失败,需优先处理后方可合并。
2 任务状态汇总2.1 Required任务 : 6/8 通过
2.2 可选任务 — 17/21 通过
3 失败详情(仅 required)Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 覆盖率不达标(置信度: 高)run_tests_with_coverage
根因详情: 关键日志: 修复建议:
修复建议摘要: 为 prefix_cache_manager.py L1485/1491/1516-1518 补充测试用例 关联变更: 链接: 查看日志 Run Base Tests / base_tests — 测试断言失败(置信度: 中)base_tests
失败用例:
根因详情: 关键日志: 修复建议:
修复建议摘要: 请 rerun,若持续失败再排查块管理逻辑 关联变更: 链接: 查看日志 |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #7807 +/- ##
==========================================
Coverage ? 63.27%
==========================================
Files ? 462
Lines ? 64292
Branches ? 9853
==========================================
Hits ? 40680
Misses ? 20846
Partials ? 2766
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-17 01:28:50
📋 Review 摘要
PR 概述:修复 KVCache Block 生命周期管理的两处缺陷——LRU heap 节点泄漏 及 写存储失败时块未释放
变更范围:fastdeploy/cache_manager/、fastdeploy/engine/sched/
影响面 Tag:[KVCache] [Scheduler]
问题
未发现阻塞性问题。
📝 PR 规范检查
PR 标题标签格式有误([bugfix] 应规范为 [BugFix]),且 PR 描述的所有 section(Motivation / Modifications / Usage or Command / Accuracy Tests)均为空模板占位符,Checklist 全部未勾选,需补充实际内容。
标题建议(可直接复制):
[BugFix] free blocks even if AS write failed
PR 描述建议(可直接复制):
## Motivation
修复两处 KVCache Block 生命周期管理缺陷:
1. `free_block_ids_async` 中,`shared_count > 0` 的节点被 `continue` 跳过后永久丢失于 GPU LRU heap,导致 heap 泄漏,后续无法被重新释放。
2. `_trigger_preempt` 及 `finish_requests` 中,若 `write_cache_to_storage[_decode]` 抛出异常,`_free_blocks` 不会被执行,导致 Block 永久泄漏,可用 KVCache 资源持续减少。
## Modifications
- `fastdeploy/cache_manager/prefix_cache_manager.py`:在 `free_block_ids_async` 主循环中,将被跳过的节点(`shared_count > 0` 或其他跳过条件)收集至 `skipped_nodes` 列表,循环结束后重新压回 `gpu_lru_leaf_heap` 和 `gpu_lru_leaf_set`,防止 heap 永久泄漏。
- `fastdeploy/engine/sched/resource_manager_v1.py`:用 `try/except` 包裹 `_trigger_preempt` 和 `finish_requests` 中的 `write_cache_to_storage` / `write_cache_to_storage_decode` 调用,捕获异常后记录 warning 日志,确保写存储失败时 `_free_blocks` 仍被执行。
## Usage or Command
N/A
## Accuracy Tests
N/A
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.总体评价
修复思路清晰正确,skipped_nodes 回写逻辑和 try/except 保护均能有效解决对应 bug。建议补充 PR 描述并将标题 Tag 格式规范为 [BugFix] 后合入。
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览有 2 个 Required 任务失败,需优先处理后方可合并。
2 任务状态汇总2.1 Required任务 : 8/10 通过
2.2 可选任务 — 28/32 通过
3 失败详情(仅 required)Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 覆盖率不达标(置信度: 高)Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage
根因详情: 关键日志: 修复建议:
修复建议摘要: 为 prefix_cache_manager.py L1485-1518 的 skipped_nodes 逻辑补充单元测试 关联变更: Run Base Tests / base_tests — 测试失败(置信度: 中)Run Base Tests / base_tests
失败用例:
根因详情: 关键日志: 修复建议:
修复建议摘要: 确认 skipped_nodes 回填对 block 分配的影响,或尝试 rerun 关联变更: |
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.