Skip to content

[bugfix] free blocks even if AS write failed#7807

Open
zccjjj wants to merge 4 commits into
PaddlePaddle:developfrom
zccjjj:ASbugfix
Open

[bugfix] free blocks even if AS write failed#7807
zccjjj wants to merge 4 commits into
PaddlePaddle:developfrom
zccjjj:ASbugfix

Conversation

@zccjjj
Copy link
Copy Markdown
Contributor

@zccjjj zccjjj commented May 13, 2026

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 13, 2026

Thanks for your contribution!

PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented May 13, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-17 22:32:13

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

2 个 Required 任务失败,需优先处理后方可合并。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
29(0) 29 23 6 0 0 0

2 任务状态汇总

2.1 Required任务 : 6/8 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage 1h20m PR问题:新增代码覆盖率28.57%,总diff覆盖76%<80% 为 prefix_cache_manager.py 新增代码补充单元测试 Job -
Run Base Tests / base_tests 12m31s 不稳定问题:200响应数1005<1024阈值,差值仅1.9% 请 rerun,若持续失败再排查块管理逻辑 Job -
其余 6 个必选任务通过 - - - - -

2.2 可选任务 — 17/21 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Run iluvatar Tests / run_iluvatar_cases 10m29s Job -
Check PR Template 10s Job -
CI_HPU 1h7m Job -
Trigger Jenkins for PR 15m11s Job -
其余 17 个可选任务通过 - - -

3 失败详情(仅 required)

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 覆盖率不达标(置信度: 高)

run_tests_with_coverage

  • 状态: ❌ 失败
  • 错误类型: 覆盖率不达标
  • 置信度: 高
  • 根因摘要: PR新增代码覆盖率仅28.57%,diff总覆盖76%<80%阈值
  • 分析器: ci_analyze_unittest_fastdeploy

根因详情:
PR 在 prefix_cache_manager.py 中新增了 skipped_nodes 相关逻辑(violation_lines: 1485, 1491, 1516, 1517, 1518),这 5 行代码未被现有测试覆盖,导致该文件 diff 覆盖率降至 28.57%(覆盖 2 行,违规 5 行)。resource_manager_v1.py 的变更覆盖率为 100%,不影响整体。综合 diff 覆盖率为 76%,低于 80% 阈值。

关键日志:

Coverage generation failed (exit code 9)
GPU Patch Coverage Details:
  "fastdeploy/cache_manager/prefix_cache_manager.py": {
    "percent_covered": 28.57,
    "violation_lines": [1485, 1491, 1516, 1517, 1518],
    "covered_lines": [1451, 1515]
  },
  "total_percent_covered": 76

修复建议:

  1. fastdeploy/cache_manager/prefix_cache_manager.pyfree_block_ids_async 方法中,为 skipped_nodes 相关分支(L1485 skip 分支、L1491 skip 分支、L1516-1518 回写循环)补充单元测试,覆盖 shared_count > 0is_gpu_leaf_node=False 的 skip 场景
  2. 或在 PR Description 中申请覆盖率豁免,由 Reviewer 决定是否接受

修复建议摘要: 为 prefix_cache_manager.py L1485/1491/1516-1518 补充测试用例

关联变更: fastdeploy/cache_manager/prefix_cache_manager.py L1451, L1483-1488, L1511-1518(新增 skipped_nodes 逻辑)

链接: 查看日志

Run Base Tests / base_tests — 测试断言失败(置信度: 中)

base_tests

  • 状态: ❌ 失败
  • 错误类型: 测试失败
  • 置信度: 中
  • 根因摘要: test_max_waiting_time 收到1005个200响应,低于1024阈值(差1.9%)
  • 分析器: ci_analyze_unittest_fastdeploy

失败用例:

测试 错误 根因
test_max_waiting_time.py::test_waiting_time AssertionError: 200数量错误,实际=1005 高并发下200响应数略低于预期

根因详情:
test_max_waiting_time 期望 ≥1024 个请求返回 200,实际收到 1005(差 19,约 1.9%)。本 PR 的 resource_manager_v1.py 变更(try/except 包装)仅在 kvcache_storage_backend 启用时生效,而本测试使用 --no-enable-prefix-caching,两处 PR 变更与测试场景关联性较弱。该差值属高并发/调度压力测试的正常抖动范围,倾向判定为不稳定问题。

关键日志:

assert count_200 >= 1024, f"200 数量错误,应大于等于1024,实际={count_200}"
AssertionError: 200 数量错误,应大于等于1024,实际=1005
assert 1005 >= 1024
test_max_waiting_time.py:62: AssertionError
============================== 1 failed in 37.09s ==============================

修复建议:

  1. 请 rerun Run Base Tests / base_tests,若重跑通过则确认为不稳定测试,可正常合并
  2. 若持续失败,需排查 resource_manager_v1.py 块释放逻辑在高并发非前缀缓存场景下是否改变了调度行为

修复建议摘要: 请 rerun,若持续失败再排查块管理逻辑

关联变更: fastdeploy/engine/sched/resource_manager_v1.py L437-460, L1720-1734(try/except 包装 cache write)

链接: 查看日志

PaddlePaddle-bot

This comment was marked as outdated.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 13, 2026

Codecov Report

❌ Patch coverage is 66.66667% with 7 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@12c6ae0). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/cache_manager/prefix_cache_manager.py 14.28% 5 Missing and 1 partial ⚠️
fastdeploy/engine/sched/resource_manager_v1.py 92.85% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7807   +/-   ##
==========================================
  Coverage           ?   63.27%           
==========================================
  Files              ?      462           
  Lines              ?    64292           
  Branches           ?     9853           
==========================================
  Hits               ?    40680           
  Misses             ?    20846           
  Partials           ?     2766           
Flag Coverage Δ
GPU 72.38% <66.66%> (?)
XPU 7.11% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot

This comment was marked as outdated.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-17 01:28:50

📋 Review 摘要

PR 概述:修复 KVCache Block 生命周期管理的两处缺陷——LRU heap 节点泄漏 及 写存储失败时块未释放
变更范围fastdeploy/cache_manager/fastdeploy/engine/sched/
影响面 Tag[KVCache] [Scheduler]

问题

未发现阻塞性问题。

📝 PR 规范检查

PR 标题标签格式有误([bugfix] 应规范为 [BugFix]),且 PR 描述的所有 section(Motivation / Modifications / Usage or Command / Accuracy Tests)均为空模板占位符,Checklist 全部未勾选,需补充实际内容。

标题建议(可直接复制):

  • [BugFix] free blocks even if AS write failed

PR 描述建议(可直接复制):

## Motivation
修复两处 KVCache Block 生命周期管理缺陷:
1. `free_block_ids_async` 中,`shared_count > 0` 的节点被 `continue` 跳过后永久丢失于 GPU LRU heap,导致 heap 泄漏,后续无法被重新释放。
2. `_trigger_preempt``finish_requests` 中,若 `write_cache_to_storage[_decode]` 抛出异常,`_free_blocks` 不会被执行,导致 Block 永久泄漏,可用 KVCache 资源持续减少。

## Modifications
- `fastdeploy/cache_manager/prefix_cache_manager.py`:在 `free_block_ids_async` 主循环中,将被跳过的节点(`shared_count > 0` 或其他跳过条件)收集至 `skipped_nodes` 列表,循环结束后重新压回 `gpu_lru_leaf_heap``gpu_lru_leaf_set`,防止 heap 永久泄漏。
- `fastdeploy/engine/sched/resource_manager_v1.py`:用 `try/except` 包裹 `_trigger_preempt``finish_requests` 中的 `write_cache_to_storage` / `write_cache_to_storage_decode` 调用,捕获异常后记录 warning 日志,确保写存储失败时 `_free_blocks` 仍被执行。

## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

修复思路清晰正确,skipped_nodes 回写逻辑和 try/except 保护均能有效解决对应 bug。建议补充 PR 描述并将标题 Tag 格式规范为 [BugFix] 后合入。

@PaddlePaddle-bot
Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-17 23:18:01

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

2 个 Required 任务失败,需优先处理后方可合并。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
42(0) 42 36 6 0 0 0

2 任务状态汇总

2.1 Required任务 : 8/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage 1h20m PR问题:prefix_cache_manager.py 新增代码 diff 覆盖率 76%,低于 80% 阈值 为 L1485/1491/1516-1518 的 skipped_nodes 逻辑添加单元测试 Job -
Run Base Tests / base_tests 12m31s PR问题:test_waiting_time count_200=1005 < 1024,block 管理变更影响并发成功率 检查 skipped_nodes 改动对 block 分配的影响或调整测试阈值 Job -
其余 8 个必选任务通过 - - - - -

2.2 可选任务 — 28/32 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Run iluvatar Tests / run_iluvatar_cases 10m29s Job -
Check PR Template 10s Job -
CI_HPU 1h7m Job -
Trigger Jenkins for PR 15m11s Job -
其余 28 个可选任务通过 - - -

3 失败详情(仅 required)

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 覆盖率不达标(置信度: 高)

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage

  • 状态: ❌ 失败
  • 错误类型: 覆盖率不达标
  • 置信度: 高
  • 根因摘要: prefix_cache_manager.py 新增 7 行代码,5 行未覆盖,diff 覆盖率 76% 低于 80% 阈值
  • 分析器: ci_analyze_unittest_fastdeploy

根因详情:
PR 在 fastdeploy/cache_manager/prefix_cache_manager.py 新增了 skipped_nodes 相关逻辑(共 7 行新增代码),但其中仅 2 行被现有测试覆盖,第 1485、1491、1516、1517、1518 行未被任何测试执行。diff 覆盖率报告显示 prefix_cache_manager.py 仅有 28.57% 覆盖率,综合 diff 覆盖率为 76%,低于 CI 要求的 80% 阈值,触发 exit code 9。

关键日志:

COVERAGE_EXIT_CODE: 9
{"fastdeploy/cache_manager/prefix_cache_manager.py": {
  "percent_covered": 28.57,
  "violation_lines": [1485, 1491, 1516, 1517, 1518]},
 "total_percent_covered": 76,
 "total_num_violations": 5}
##[error]Process completed with exit code 9.

修复建议:

  1. 在对应测试文件中为 free_block_ids_asyncskipped_nodes 逻辑添加单元测试,覆盖 shared_count > 0 时节点被回填堆的场景(覆盖 prefix_cache_manager.py 第 1485、1491、1516-1518 行)
  2. 若无法立即补充测试,可在 CI 配置中为该文件申请覆盖率豁免

修复建议摘要: 为 prefix_cache_manager.py L1485-1518 的 skipped_nodes 逻辑补充单元测试

关联变更: fastdeploy/cache_manager/prefix_cache_manager.py 新增 skipped_nodes 列表及回填逻辑(+13 行)
链接: 查看日志

Run Base Tests / base_tests — 测试失败(置信度: 中)

Run Base Tests / base_tests

  • 状态: ❌ 失败
  • 错误类型: 测试失败
  • 置信度: 中
  • 根因摘要: test_waiting_time count_200=1005 < 1024,疑为 block 管理变更影响并发请求调度
  • 分析器: ci_analyze_unittest_fastdeploy

失败用例:

测试 错误 根因
test_max_waiting_time.py::test_waiting_time AssertionError: count_200=1005 < 1024 skipped_nodes 回填逻辑改变 block 分配时序

根因详情:
test_waiting_time 并发发送请求并统计 HTTP 200 响应数,期望 >= 1024,实际仅 1005(差 19 个,98.1% 成功率)。PR 修改了 prefix_cache_manager.py::free_block_ids_asyncskipped_nodes 逻辑——原先 shared_count > 0 的节点会从堆中"永久丢失",现在被回填到堆中。此行为变化影响重并发场景下的 block 分配时序,可能导致更多请求因资源竞争未能在等待超时内完成。置信度为"中",因 1005 vs 1024 非常接近,不排除环境波动所致。

关键日志:

assert count_200 >= 1024, f"200 数量错误,应大于等于1024,实际={count_200}"
AssertionError: 200 数量错误,应大于等于1024,实际=1005
assert 1005 >= 1024
test_max_waiting_time.py:62: AssertionError
FAILED test_max_waiting_time.py::test_waiting_time - 1 failed in 37.09s

修复建议:

  1. 检查 prefix_cache_manager.py::free_block_ids_asyncskipped_nodes 回填逻辑,确认新行为下并发 block 调度是否符合预期;若 block 分配行为发生了实质性变化,需评估对 test_max_waiting_time.py 中阈值 1024 的影响
  2. 若确认为 block 管理的合理变更,更新 test_max_waiting_time.py:62 的断言阈值;若为环境波动,尝试 rerun

修复建议摘要: 确认 skipped_nodes 回填对 block 分配的影响,或尝试 rerun

关联变更: fastdeploy/cache_manager/prefix_cache_manager.py L1511-1516 新增 skipped_nodes 回填逻辑
链接: 查看日志

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants