FE-813: Cook evaluator runs the verification tests instead of an LLM verdict by kostandinang · Pull Request #170 · hashintel/brunch

kostandinang · 2026-06-04T16:01:21Z

What

Makes brunch cook's per-slice done trustworthy. The evaluator was an LLM judging criterion prose with read,write,edit,bash — it could fix code mid-evaluation and rubber-stamp done (collapsing the write-tests → write-code → evaluate loop) and pass orphan output.

evaluate-done now runs the verification targets (BunTestRunner) and gates done on real pass/fail — requires ≥1 target and every target passing. It no longer calls pi.
Per-action tool scoping (toolsForAction): the evaluator can't mutate the sandbox; code-producing actions keep write tools.
Dropped the now-dead evaluator.md prompt; hardened the writer prompts.

Limit

done now requires real test execution, not integration. Targets are agent-authored against the emitter's synthesized paths, so a weak/isolated target can still pass — guaranteeing a wired feature needs the emitter to emit integration-demanding targets (the FE-800 integration-blind-verification follow-on).

Verify

pi-actions.test.ts + test-runner.test.ts + npm run check pass. Two full-suite failures (build-boundary.test.ts, app.test.ts) are a fe-800-base failure and a parallel-run flake — not this diff (see #167).

🤖 Generated with Claude Code

kostandinang · 2026-06-04T16:01:56Z

This stack of pull requests is managed by Graphite. Learn more about stacking.

cursor · 2026-06-04T16:27:23Z

PR Summary

Medium Risk
Changes core cook completion semantics and test execution paths used by every orchestrator run; mitigated by new unit/integration tests and unchanged engine contract fakes, but real bun subprocess behavior can still vary by environment.

Overview
Cook harness fidelity (FE-813): Per-slice done is no longer an LLM judgment. evaluate-done runs every slice verification target via bun test (evaluateVerificationTargets: ≥1 target, all must pass) and drops the evaluator.md / pi path entirely.

Tool scoping: toolsForAction limits evaluate-done to read-only pi tools so evaluation cannot mutate the sandbox; write-capable actions keep full tools.

Net + runner: run-tests handlers now execute all slice verification targets (not just the first). BunTestRunner uses spawnSync with argv (no shell) and merges stdout+stderr so failure diagnostics survive in reports.

Prompts & spec: test-writer / code-writer prompts stress tests as the oracle; D161-K and I126-K codify read-only, execution-derived completion. PLAN marks the brownfield TDD-collapse follow-on resolved.

Tests: New pi-actions.test.ts and test-runner.test.ts; client build-boundary timeout bumped to 120s for parallel vitest flake.

^{Reviewed by Cursor Bugbot for commit 162fc50. Bugbot is set up for automated code reviews on this repo. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 85753f7. Configure here.}

lunelson

This closes a real hole. An evaluator that held write tools and judged criterion prose could fix code mid-evaluation and rubber-stamp done, collapsing the whole test → code → evaluate loop; gating done on actual bun test pass/fail instead is a solid correctness win. Nice touches: evaluateVerificationTargets being pure with an injected runner, requiring at least one target, and the test that pins shell-metacharacter path safety.

A few things above the diff:

The gate is only as strong as the targets. Trust moves from "an LLM judges prose" to "agent-authored tests pass" — better, but still self-graded, since the same agent writes the code and its tests, and a weak or isolated target (a component tested in a vacuum) can still go green. You call this out as the integration-demanding-targets follow-on; is that tracked somewhere durable? Without it, "trustworthy done" reads as fully delivered when it's really the first of two halves.
Tool scoping fails open. toolsForAction hands back full write tools for everything except the hard-coded evaluate-done. For a boundary whose entire purpose is restriction, that's allow-by-default — a new action is write-capable unless someone remembers to lock it down. Would deny-by-default (explicit per-action grants) be a more durable shape here?
Self-reported non-green verify. The PR lands with two full-suite failures attributed to base + flake, and it also bumps the timeout in build-boundary.test.ts (the flaky one). CI is green, so this is just a confidence check: is build-boundary actually stable on this branch now, and is the app.test.ts / base failure tracked against #167 so it doesn't quietly become "expected red"?

Approving — none of these block.

The base branch was changed.

`evaluate-done` ran with `read,write,edit,bash`, so the evaluator could fix code during evaluation and report `done` on the first call — collapsing the write-tests → write-code → evaluate TDD loop (2026-05-26 brownfield smoke). Add per-action tool scoping (`toolsForAction`): the evaluator is read-only; code-producing actions keep the full toolset. Threads `tools` through `runPi`. Opens the `cook-harness-fidelity` frontier (Slice 1) and records it in PLAN. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

`evaluate-done` returned an LLM verdict over criterion prose — a soft oracle a standalone component or Ladle story could satisfy without the feature working. It now executes the slice's verification targets (`bun test`) and gates `done` on real pass/fail: `evaluateVerificationTargets` requires ≥1 target and every target passing. The evaluator no longer dispatches pi, so it cannot mutate the sandbox at all. Removes the now-dead `extractJson` helper. Completes the cook-harness-fidelity frontier (Slice 2). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

…stics `bun test` writes its results (failure detail, pass/fail counts) to stderr and only the version banner to stdout. The execSync failure path returned `err.stdout`, so every failing tests-run / epic-verified report carried just "bun test v1.3.14" with zero diagnostics (observed on cook run 289c9843). Switch to spawnSync and concatenate both streams; the arg-array form also drops shell interpolation of the target path. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Slice 2 made the evaluator a deterministic oracle (it executes the verification targets; `done` = all pass), which makes the test-writer the load-bearing oracle: a test that passes without exercising the slice's behavior now marks broken code DONE. Port ln-build discipline into the two live writer prompts to close that gap: - test-writer: orient first (read + match the repo's conventions); one observable behavior per test mapped to an acceptance criterion; test through the public interface; make the red meaningful; ban trivially-passing tests (a false oracle under the deterministic gate). - code-writer: orient + inside-out build; no speculative abstraction; never weaken a test to go green; pre-release deletion posture. - code-writer: drop the hardcoded "TypeScript" — derive the language from the repo (the harness is meant to be host-agnostic; the remaining `bun test` coupling in test-writer is the executor's, to be removed with the ProjectProfile/toolchain adapter). Delete evaluator.md: dead since Slice 2 stopped dispatching pi for evaluation (verified zero references repo-wide). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

…ter-prompt slice The "pi-actions evaluate-done collapses the TDD workflow" follow-on (under cook-codebase-mode) was closed by FE-813 Slices 1+2 but still listed as open. Mark it resolved with commit pointers, and extend the FE-813 status with Slice 3 (writer-prompt hardening, 9fb5af1) and the still-unowned bun→host test-runner decoupling. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

ln-sync of the cook-harness-fidelity work: records the read-only, execution-derived completion oracle as Active Decision D161-K and Critical Invariant I126-K (protected by pi-actions / engine-contract / brownfield-smoke tests), and points the PLAN frontier status at them. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Two real vite builds (~50s) run inside the default-parallel vitest pool; 120s gives headroom so parallel contention doesn't surface as a timeout flake. Co-authored-by: Claude <noreply@anthropic.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

kostandinang mentioned this pull request Jun 4, 2026

FE-812: Completed-spec fixture for the plan emitter #168

Merged

kostandinang changed the title ~~FE-813: Scope the cook evaluator to read-only tools~~ FE-813: Cook harness fidelity — trustworthy per-slice completion signal Jun 4, 2026

kostandinang marked this pull request as ready for review June 4, 2026 16:27

cursor Bot reviewed Jun 4, 2026

View reviewed changes

Comment thread src/orchestrator/src/pi-actions.ts

Comment thread src/orchestrator/src/pi-actions.ts Outdated

kostandinang changed the title ~~FE-813: Cook harness fidelity — trustworthy per-slice completion signal~~ FE-813: Cook evaluator runs the verification tests instead of an LLM verdict Jun 4, 2026

kostandinang self-assigned this Jun 4, 2026

kostandinang force-pushed the ka/fe-813-cook-harness-fidelity branch 2 times, most recently from 1cbed39 to a537038 Compare June 5, 2026 13:14

This was referenced Jun 5, 2026

FE-815: Orchestrator demo fixtures: parallel, diamond, halt #174

Merged

FE-819: Orchestrator <> Petrinaut integration improvements #176

Open

lunelson previously approved these changes Jun 8, 2026

View reviewed changes

kostandinang force-pushed the ka/fe-800-spec-to-cook-plan branch 3 times, most recently from 777428f to 260b8cb Compare June 8, 2026 15:04

Base automatically changed from ka/fe-800-spec-to-cook-plan to main June 8, 2026 15:45

kostandinang and others added 8 commits June 8, 2026 18:07

FE-813: Raise build-boundary minify test timeout to 120s

b342c06

Two real vite builds (~50s) run inside the default-parallel vitest pool; 120s gives headroom so parallel contention doesn't surface as a timeout flake. Co-authored-by: Claude <noreply@anthropic.com>

FE-813: Run all verification targets safely

162fc50

Co-authored-by: Cursor <cursoragent@cursor.com>

kostandinang force-pushed the ka/fe-813-cook-harness-fidelity branch from a537038 to 162fc50 Compare June 8, 2026 16:10

kostandinang requested a review from lunelson June 8, 2026 16:10

lunelson approved these changes Jun 8, 2026

View reviewed changes

kostandinang added this pull request to the merge queue Jun 8, 2026

Merged via the queue into main with commit 2b92992 Jun 8, 2026
6 checks passed

kostandinang deleted the ka/fe-813-cook-harness-fidelity branch June 8, 2026 17:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FE-813: Cook evaluator runs the verification tests instead of an LLM verdict#170

FE-813: Cook evaluator runs the verification tests instead of an LLM verdict#170
kostandinang merged 8 commits into
mainfrom
ka/fe-813-cook-harness-fidelity

kostandinang commented Jun 4, 2026 •

edited

Loading

Uh oh!

kostandinang commented Jun 4, 2026 •

edited

Loading

Uh oh!

cursor Bot commented Jun 4, 2026 •

edited

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

lunelson left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kostandinang commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Limit

Verify

Uh oh!

kostandinang commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lunelson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kostandinang commented Jun 4, 2026 •

edited

Loading

kostandinang commented Jun 4, 2026 •

edited

Loading

cursor Bot commented Jun 4, 2026 •

edited

Loading