Skip to content

ci: concurrency groups, e2e sharding, React Doctor cache fix, branch cache cleanup#2377

Merged
malinskibeniamin merged 7 commits intomasterfrom
ci/blacksmith-audit-optimization
Apr 14, 2026
Merged

ci: concurrency groups, e2e sharding, React Doctor cache fix, branch cache cleanup#2377
malinskibeniamin merged 7 commits intomasterfrom
ci/blacksmith-audit-optimization

Conversation

@malinskibeniamin
Copy link
Copy Markdown
Contributor

@malinskibeniamin malinskibeniamin commented Apr 13, 2026

Summary

  • Add concurrency groups to repository-dispatch and frontend-react-doctor workflows
  • Shard Playwright e2e tests (single 16vCPU → 2×8vCPU shards)
  • Fix flaky quota-display.spec.ts test (exact regex match)
  • Fix React Doctor CI failures (stop caching stale node_modules)
  • Add cleanup-branch-cache workflow to prune caches on branch deletion

Context

30-day Blacksmith CI analysis across redpanda-data org. Companion PR with cloudv2 changes: redpanda-data/cloudv2#25518

Changes

Concurrency groups

Added cancel-in-progress: true to repository-dispatch.yml and frontend-react-doctor.yml. When a new push arrives on the same branch, in-flight runs cancel instead of completing on superseded commits.

E2E sharding (frontend-verify.yml)

Before After
1 job on 16vCPU, 30 min timeout 2 shards on 8vCPU, 20 min timeout

Uses Playwright --shard flag. Scales to 4+ shards as test count grows.

Flaky test fix (quota-display.spec.ts)

hasText: '100' substring-matched both the quota name cell (containing "100" in timestamp) and the value cell → strict mode violation. Fixed with regex /^100$/.

React Doctor cache fix (frontend-react-doctor.yml)

Stale prefix-matched node_modules cache caused knip (inside react-doctor) to resolve wrong zod version. Removed node_modules from cache path — only cache ~/.bun/install/cache.

Cache cleanup (cleanup-branch-cache.yml)

Automatically deletes GitHub Actions caches when branches are deleted. Console has 51 cache entries (8.4 GB), many from stale merged branches. Uses Blacksmith runner for consistency, per_page: 100 for explicit pagination.

Test plan

  • E2E sharding verified in CI (both shards run)
  • React Doctor passes locally (99/100)
  • Review feedback addressed (runner, pagination, shard format consistency)

🤖 Generated with Claude Code

@malinskibeniamin malinskibeniamin changed the title ci: add concurrency groups, shard e2e tests, add Blacksmith audit ci: Blacksmith CI audit — concurrency groups, e2e sharding, cache fixes Apr 14, 2026
@malinskibeniamin
Copy link
Copy Markdown
Contributor Author

@claude review

@claude
Copy link
Copy Markdown

claude bot commented Apr 14, 2026

Claude finished @malinskibeniamin's task in 2m 54s —— View job


Code Review

  • Gather context and read changed files
  • Review workflow changes (concurrency, sharding, cache)
  • Review flaky test fix
  • Review cleanup workflow
  • Post final review

Overall this is a solid, well-reasoned CI optimization. A few findings below, mostly minor.


cleanup-branch-cache.yml — New workflow

Pagination gap: getActionsCacheList returns at most 100 results per page. If a branch somehow accumulates >100 cache entries, only the first page is deleted. Not a problem today (51 total entries across the whole repo), but worth a one-liner note or a per_page: 100 param to be explicit.

const caches = await github.rest.actions.getActionsCacheList({
  owner: context.repo.owner,
  repo: context.repo.repo,
  ref,
  per_page: 100, // max; add pagination if this ever exceeds 100 per branch
});

Runner: Uses ubuntu-latest (GitHub-hosted) instead of the Blacksmith runners used everywhere else. For a lightweight, non-critical cleanup job this is fine, but if you want consistency (and to keep billing on Blacksmith) blacksmith-2vcpu-ubuntu-2404 would align with the rest of the repo.

Fix this →


repository-dispatch.yml — Concurrency group

The cancel-in-progress: true is the right call for branches, but worth understanding the tag push scenario: the concurrency group key is ${{ github.workflow }}-${{ github.ref }}. Two different tags (v1.0.0 then v1.0.1) produce different refs and won't cancel each other — that's correct. Two pushes to the same tag would cancel, which is an unusual/non-standard operation anyway.

More notable: the dispatch job sends a repository_dispatch event to console-enterprise and then sets a pending commit status. If a run is cancelled after the event is dispatched but before the status step runs, console-enterprise will start building while the console commit status stays unset. In practice this is very low probability (the dispatch step is fast and the cancel would have to land in that narrow window), but it's something to be aware of if you ever see "no status from enterprise CI" on a commit.


frontend-react-doctor.yml — Cache key divergence

The fix correctly removes node_modules from the cache and changes the key prefix from bun-frontend- to bun-cache-. The new key prefix is intentionally different to avoid restoring the old broken cache — good call.

Long-term: react-doctor now uses a permanently different cache namespace (bun-cache-) from frontend-verify.yml (bun-frontend-). The two workflows will never share a warmed bun download cache. That doubles the download cache storage for bun packages. If cache storage is a concern, consider either:

  • Converging to one key prefix once old caches expire (a few weeks), or
  • Unifying cache keys from the start

Not a blocker — just a note for when you revisit cache housekeeping.


frontend-verify.yml — E2E sharding

Sharding is per-variant, not global: bun run e2e-test -- --shard=1/2 calls run-all-variants.mjs, which runs each discovered variant as a separate Playwright process, passing --shard=1/2 to each. With two runnable variants in standard CI (console and kafka-connect), each CI shard:

  • Pre-builds the backend Docker image
  • Runs console tests (shard 1/2 or 2/2) and kafka-connect tests (shard 1/2 or 2/2) in parallel

This means Docker image builds happen twice (once per CI shard) instead of once. If Docker build time is a significant fraction of the 20-min budget, this reduces the net gain from sharding. It still halves Playwright test execution time, which is the dominant cost — but worth knowing the tradeoff.

node_modules still cached in verify workflows: The React Doctor fix removed node_modules from the cache to fix the stale zod issue with knip. The lint, type-check, build, test-unit, test-integration, and e2e-test jobs in frontend-verify.yml still cache node_modules. If knip is ever added to those workflows, the same stale-cache issue could reappear. Not a current bug, just something to flag.

Minor — shard format inconsistency: test-integration uses unquoted YAML scalars ([1/2, 2/2]) while e2e-test uses quoted strings (["1/2", "2/2"]). Both work, but consistency would be nice.

Artifact naming: Using ${{ strategy.job-index }} for the Playwright report artifact is the right choice — matrix.shard contains / which would make an awkward artifact name. job-index (0-based) is clean. ✓


quota-display.spec.ts — Flaky test fix

/^100$/ regex is the correct fix. hasText: '100' was a substring match that could fire on the quota name cell when the Date.now() timestamp happens to contain 100 as a substring (e.g., 1744100XXXXXX). The exact-match regex is precise and the intent is clear. ✓


Summary

Area Status
Concurrency groups ✓ Correct and well-applied
E2E sharding ✓ Works — be aware of per-variant Docker overhead
React Doctor cache fix ✓ Correct fix; note the permanent key divergence
Flaky test fix ✓ Correct
Cache cleanup workflow Minor: add per_page: 100, consider Blacksmith runner
repository-dispatch dispatch race Low-risk edge case, informational only

malinskibeniamin and others added 5 commits April 14, 2026 12:00
- Add concurrency groups with cancel-in-progress to repository-dispatch
  and frontend-react-doctor workflows to prevent redundant CI runs
  when pushing rapidly to the same branch
- Shard Playwright e2e tests from single 16vCPU job to 2x8vCPU shards,
  halving wall time while maintaining same total compute
- Add BLACKSMITH_AUDIT.md documenting 30-day CI analysis across
  redpanda-data org with findings and recommendations

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The `hasText: '100'` filter matched both the quota name cell (which
contains "100" as part of the random timestamp suffix) and the actual
value cell, causing a strict mode violation. Using regex `/^100$/`
ensures only cells with exact text content "100" are matched.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Stale prefix-matched node_modules cache causes knip (inside
react-doctor) to fail with "Package subpath './mini' is not defined
by exports in zod/package.json" when bun.lock changes. The stale
node_modules has top-level zod@3.x but knip needs nested zod@4.x.

Fix: only cache ~/.bun/install/cache (package download cache), not
node_modules. bun install rebuilds node_modules from scratch each
time, ensuring correct nested dependency structures. Minimal perf
impact since downloaded packages are still cached.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Automatically prunes GitHub Actions caches when branches are deleted.
Console repo has 51 cache entries (8.4 GB), many from stale merged
branches. This prevents cache bloat from accumulating.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- cleanup-branch-cache: use blacksmith-2vcpu-ubuntu-2404 for
  consistency with rest of repo
- cleanup-branch-cache: add per_page: 100 to getActionsCacheList
  to be explicit about pagination limit
- frontend-verify: use unquoted shard format [1/2, 2/2] for
  consistency with test-integration matrix

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@malinskibeniamin malinskibeniamin force-pushed the ci/blacksmith-audit-optimization branch from 00c7923 to 33fb769 Compare April 14, 2026 10:01
@malinskibeniamin malinskibeniamin changed the title ci: Blacksmith CI audit — concurrency groups, e2e sharding, cache fixes ci: concurrency groups, e2e sharding, React Doctor cache fix, branch cache cleanup Apr 14, 2026
@malinskibeniamin
Copy link
Copy Markdown
Contributor Author

@claude review

@claude
Copy link
Copy Markdown

claude bot commented Apr 14, 2026

Claude finished @malinskibeniamin's task in 1m 30s —— View job


Code Review

  • Gather context
  • Read changed files
  • Review workflow changes
  • Review flaky test fix
  • Post final review

All three items flagged in the previous review have been addressed. Here's the full verification pass.

Previous feedback — resolved ✓

Issue Status
cleanup-branch-cache.yml: missing per_page: 100 Fixed — explicit per_page: 100 present
cleanup-branch-cache.yml: ubuntu-latest runner Fixed — now blacksmith-2vcpu-ubuntu-2404
Shard format inconsistency ([1/2, 2/2] vs ["1/2", "2/2"]) Fixed — both test-integration and e2e-test now use unquoted [1/2, 2/2]

Remaining informational notes (no action required)

bun-cache- vs bun-frontend- key divergence (frontend-react-doctor.yml vs frontend-verify.yml): still present by design — the intentional rename breaks the stale restore. Once old caches expire (a few weeks) you could optionally converge them, but it's not a blocker.

Pre-existing artifact naming issue (frontend-verify.yml:179): integration-test-results-${{ matrix.shard }} was already in the test-integration job before this PR. The value 1/2 contains a / that would produce an awkward artifact path if that job ever fails and tries to upload. The new e2e-test correctly uses strategy.job-index (0-based, no slash). The test-integration artifact fix is out of scope for this PR but worth a follow-up.

Docker double-build: covered in the previous review — still a known tradeoff of per-variant sharding. The Playwright test-time reduction is the dominant gain.


quota-display.spec.ts/^100$/ fix ✓

Correct. hasText: '100' was a substring match that could fire on the quota name cell when Date.now() happened to contain 100 (e.g. 174410XXXXXX). The anchored regex is precise.


Overall

This is clean and ready to merge. All previously raised issues are resolved; the remaining notes are informational and don't block shipping.

@malinskibeniamin malinskibeniamin self-assigned this Apr 14, 2026
@malinskibeniamin malinskibeniamin requested review from a team, Mateoc, datamali, eblairmckee and graham-rp and removed request for a team April 14, 2026 11:38
@malinskibeniamin
Copy link
Copy Markdown
Contributor Author

@claude review

@claude
Copy link
Copy Markdown

claude bot commented Apr 14, 2026

Claude finished @malinskibeniamin's task in 52s —— View job


Code Review

  • Gather context
  • Read changed files
  • Review workflow changes
  • Review flaky test fix
  • Post final review

This is a third pass over the same PR. All previously raised issues remain addressed — no regressions introduced by the latest commits.

Full verification

File Change Status
cleanup-branch-cache.yml New workflow; Blacksmith runner, per_page: 100
frontend-react-doctor.yml Concurrency group; removed node_modules from cache; bun-cache- key
frontend-verify.yml e2e: 16vCPU→2×8vCPU shards, 30min→20min; consistent unquoted shard format
repository-dispatch.yml Concurrency group with cancel-in-progress: true
quota-display.spec.ts hasText: '100'/^100$/ exact-match regex

Notes carried from previous reviews (informational, no action needed)

integration-test-results-${{ matrix.shard }} (frontend-verify.yml:179): pre-existing artifact name contains / from the shard value — could produce awkward paths if that job uploads on failure. Worth a follow-up but out of scope here. The new e2e artifact correctly uses strategy.job-index.

bun-cache- vs bun-frontend- namespace split: intentional to break the stale restore. Fine as-is; can be unified once old caches expire.

Docker double-build: each CI shard pre-builds the backend image independently. Known tradeoff; Playwright execution time saving still dominates.


No new issues found. This is clean and ready to merge.

The shard value (1/2) contains a slash which produces awkward artifact
paths. Use job-index (0-based integer) like the e2e-test job does.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@malinskibeniamin malinskibeniamin enabled auto-merge (squash) April 14, 2026 11:40
GitHub auto-evicts caches after 7 days of no access and has a 10 GB
per-repo limit. Console's 8.4 GB cache is not large enough to warrant
a custom cleanup workflow.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@malinskibeniamin malinskibeniamin added the ai-jam AI Jam April 2026 related work label Apr 14, 2026
@malinskibeniamin malinskibeniamin merged commit d17334f into master Apr 14, 2026
6 checks passed
@malinskibeniamin malinskibeniamin deleted the ci/blacksmith-audit-optimization branch April 14, 2026 16:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai-jam AI Jam April 2026 related work enhancement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants