hashintel · kostandinang · Jun 8, 2026 · Jun 4, 2026 · Jun 4, 2026 · Jun 4, 2026
diff --git a/memory/PLAN.md b/memory/PLAN.md
@@ -29,6 +29,7 @@ The May 2026 intent-spec, multi-chat, changeset-ledger, prompt/context, and agen
 2. `chat-runtime-secondary-chats` — FE-716; V1 done — PR #141 merged to main.
 3. **Petrinaut integration sub-track** — umbrella **FE-760** (Orchestrator ⇄ Petrinaut). FE-761 (semantics), FE-762 (`net.json` + SDCPN export), FE-763 (event stream), and FE-784 (colour fold) have **landed**. **`petri-sync-server` (FE-764)** is the active piece, reshaped (2026-06-01 meeting) into an **ephemeral cook-hosted SSE live stream** for the Bristol demo — no-colour, replay-on-connect, brunch-initiated session, supersedes the dropped static-bundle idea. Replaces the POC interpreter's visualization role with Petrinaut as canonical surface.
 4. `spec-to-cook-plan` — **FE-800**; **done — branch-complete off FE-764**, PR #167 pending re-description. Six slices landed: 1 (deterministic projection) + 2 (LLM planning pass) + 3 (deterministic reconciliation) + 4 (CLI wiring) + 5 (warning-model hardening) + 6 (read from spec id — `brunch plan <specId>`, server-side snapshot builder `buildCompletedSpecSnapshot` over `getEntitiesForSpecificationOnActivePath`, plan driver moved into `src/server/plan-runner.ts`, orchestrator `plan-cli.ts` deleted). Bristol-demo front half (`brunch plan <specId>` → `.brunch/cook/plan.yaml` → `brunch cook --petrinaut-stream`) is now operational against any completed spec in the project DB. Two proving spikes done 2026-06-03. Move to **Recently Completed** on PR merge.
+5. `cook-harness-fidelity` — make the cook execution harness's per-slice "done" signal trustworthy: the evaluator must *observe, not mutate*, and "done" must come from running verification targets, not an LLM verdict. Opening slice (evaluator read-only) is the documented `cook-codebase-mode` TDD-collapse follow-on; complements `spec-to-cook-plan`'s integration-blind-verification follow-on.
 
 ### Recently Completed
 
@@ -39,7 +40,7 @@ The May 2026 intent-spec, multi-chat, changeset-ledger, prompt/context, and agen
 
 #### Follow-ons surfaced by the 2026-05-26 cook-codebase-mode smoke
 
-- **pi-actions evaluate-done collapses the TDD workflow** — `pi-actions.ts:70` passes `--tools read,write,edit,bash` to every action including `evaluate-done`. Real pi fixed the buggy file *during evaluation* and reported `done: true` on the first call; write-tests / write-code / run-tests never executed. Affects both modes but is more visible in brownfield. Either restrict evaluator tools to `read` or accept this as the intended pi-as-agent behavior. Worth its own frontier.
+- ~~**pi-actions evaluate-done collapses the TDD workflow**~~ — **resolved by `cook-harness-fidelity` (FE-813)**: Slice 1 (`d2139d8c`) scoped the evaluator to read-only tools so it cannot fix code during evaluation; Slice 2 (`fcba8ab3`) replaced the LLM verdict with executing the verification targets.
 - **cook output promotion (follow-on)** — slice 3 creates real slice branches (`cook-slice/<runId>/<sliceId>`) but never commits; `cook/<runId>` HEAD === source HEAD with modifications in untracked subdirs, so there is no promotion path into the user's checkout. To close: commit slice work, `git merge` slice→epic→`cook/<runId>`, then `git merge cook/<runId>` from the working branch. Pairs with worktree/branch GC. Quality-of-life; the run worktree is already inspectable by hand.
 
 ### Next
@@ -170,6 +171,21 @@ The May 2026 intent-spec, multi-chat, changeset-ledger, prompt/context, and agen
 - **Traceability:** SPEC §D50 (reserved codebase-mode resolver); §A49 (worktree isolation at `<cwd>/.brunch/cook/runs/<runId>/worktree/`); Requirement 49.
 - **Design docs:** SPEC §D50 + §A49; `docs/next/architecture/plan-graph-petri-orchestration.md` (worktree section).
 
+### cook-harness-fidelity
+
+- **Name:** Cook harness fidelity — a trustworthy per-slice completion signal
+- **Linear:** unassigned (create on start)
+- **Kind:** structural
+- **Status:** branch-complete on `ka/fe-813-cook-harness-fidelity` (PR #170) — Slice 1 (evaluator read-only via per-action `toolsForAction`) + Slice 2 (`evaluate-done` gates `done` on executing the verification targets, replacing the LLM verdict; `evaluateVerificationTargets` requires ≥1 target and all-pass) landed + unit-tested 2026-06-04. Slice 3 (`9fb5af12`) hardens the writer prompts now that the test *is* the oracle: ports ln-build discipline into `test-writer`/`code-writer` (orient + match conventions, behavioral tests through the public interface, ban trivially-passing tests, no speculative abstraction), de-hardcodes `code-writer` from TypeScript, and deletes the now-dead `evaluator.md`. The "evaluator observes, never produces; completion reflects real test execution" invariant is now promoted into SPEC as **D161-K + I126-K** (ln-sync 2026-06-04; rides with PR #170's merge). Remaining sibling: the bun→host test-runner decoupling (ProjectProfile/toolchain adapter) is still unowned — `test-writer` stays bun-bound until it lands.
+- **Objective:** Make the cook execution harness's per-slice "done" signal trustworthy. The evaluator must **observe, not produce**: (a) `evaluate-done` runs `pi` with **read-only** tools so it cannot fix code during evaluation (today `pi-actions.ts` hands every action `read,write,edit,bash`); (b) "done" is decided by **executing** the slice's verification targets — mirroring `verify-epic`'s `execAsync('bun test …')` gate on real pass/fail — instead of an LLM verdict over prose. Establishes the invariant: *the evaluator never mutates the sandbox; completion reflects real test execution.*
+- **Why now / unlocks:** The 2026-05-26 brownfield smoke caught `evaluate-done` fixing the file during evaluation and reporting `done:true` on the first call, so write-tests/write-code/run never executed; and "done" is a soft LLM judgment with no requisite variety — it let orphan code pass (2026-06-04). The harness's success signal is untrustworthy across **every** run, so no downstream oracle work (integration oracle, simulation oracle) can be trusted until completion means something. Highest-leverage harness fix.
+- **Build order (slices — keep in CARDS/session, do not fragment):** (1) evaluator read-only — per-action tool scoping, `evaluate-done` → `read` [bugfix; the documented `cook-codebase-mode` follow-on]; (2) `evaluate-done` executes the slice's verification targets and gates `done` on real results, not an LLM verdict.
+- **Acceptance:** (1) per-action tool scoping; `toolsForAction('evaluate-done') === 'read'`; write-tests/write-code/verify-epic keep write-capable tools. (2) `evaluate-done` reports `done` from executed verification targets, not LLM judgment. (3) brownfield smoke: the TDD loop runs end-to-end (the evaluator no longer short-circuits). (4) engine contract suite green on both engines.
+- **Verification:** unit test on a pure `toolsForAction` map; adapter/contract test that `evaluate-done` gates on executed-target results; outer-loop brownfield smoke replaying the 2026-05-26 regression.
+- **Depends on:** `orchestrator-poc` (done), `cook-codebase-mode` (done). Complements `spec-to-cook-plan`'s integration-blind-verification follow-on (the emitter emits integration-demanding targets; this frontier makes the harness actually *run* them); upstream of any future integration oracle.
+- **Lexicon:** `evaluator` = read-only observer of verification results, distinct from the test-runner / code-writer; ties to `ln-oracles` "requisite variety."
+- **Design docs:** `docs/design/orchestrator.md`; SPEC §Verification Design.
+
 ### petri-petrinaut-semantics
 
 - **Name:** Petri-net semantic alignment for Petrinaut visualization

diff --git a/memory/SPEC.md b/memory/SPEC.md
@@ -208,6 +208,7 @@ Brunch operates inside a **workspace**: the cwd-backed software context whose lo
 158. **Plan model is two-level (epics → slices), no milestones in POC** — schema is provisional pending canonical brunch plan emission. Forward-compatible for intent/design/oracle pointers.
 159. **Worktree isolation per run** — agents write freely inside `<cwd>/.brunch/cook/runs/<runId>/worktree/` (cwd-scoped, not fixture-scoped); fixture dir and source repo untouched. Fixtures stay byte-identical before and after a run. Depends on: Requirement 49.
 160. **Spec→cook-plan emission is a CLI/orchestrator-track seam, not a V1 product UI surface** — projecting and planning a cook `plan.yaml` from a completed intent graph is dev-layer orchestrator capability extending Requirements 46–50, so it does not breach the V1 product non-goal "Brunch elicits specs and stops at the handoff/export boundary," which governs interactive product UX. The emitter is three-stage: projection (deterministic graph read of requirements + verifies edges) + planning pass (LLM-inferred execution-order DAG, epic grouping, non-buildable detection) + reconciliation (deterministic validation: no dangling/cyclic deps, cook-valid schema, synthesized verification targets). Generated plans are reviewable artifacts, not silent inputs. Depends on: Requirements 46–50; A97.
+161. **The cook evaluator is a read-only oracle; per-slice completion is execution-derived, not model-judged** — `evaluate-done` runs `pi` with read-only tools so it cannot mutate the sandbox during evaluation, and per-slice `done` is decided by executing the slice's verification targets (≥1 target, all pass) — mirroring `verify-epic`'s real test gate — rather than by an LLM verdict over prose. This makes the harness's success signal trustworthy enough that downstream oracle work (integration oracle, simulation oracle) can build on it. Depends on: Requirements 46–50.
 
 #### Provider, prompt/context, and agent substrate
 
@@ -260,6 +261,7 @@ Each invariant is a formalization candidate: the property is stated in human lan
 | I123-K | Worktree isolation holds — fixture directory and source repo are never mutated by an orchestrator run; worktree is cwd-scoped at `<cwd>/.brunch/cook/runs/<runId>/worktree/`. Codebase mode preserves the source repo's HEAD and tracked-file state byte-identically.                                                                                                                                                      | worktree.test.ts, brownfield-smoke.integration.test.ts                                        | Requirement 49; D159-K                           |
 | I124-K | Epic verification runs against a freshly-rebuilt `<parentSandboxDir>/__epic__/<epicId>/` dir holding the deterministic merge of its completed slices' worktrees (later slices in plan declaration order overwrite earlier ones on path collisions; collisions are reported via the `epic-sandbox-merged` event). Per-slice worktrees are not mutated by the merge. | epic-sandbox-merge.test.ts, engine-contract.test.ts                                            | Requirement 49; D159-K                           |
 | I125-K | Topology output-place candidates are fully declared in `HandlerDescriptor` via typed `Guard` predicates; `wireHandlers` introduces no new output places at fire time. Pure consumers can enumerate the reachable output-place set per transition from topology data alone via `enumerateCandidateOutputs(transition)`. Halt paths (budget exhaustion, verify-epic failure) and token transforms (reportId attach, retry/rework count propagation) remain runtime concerns and are explicitly not covered by this invariant. | topology.test.ts, engine-contract.test.ts                                                     | Requirements 46, 47, 48; D155-K (FE-747)         |
+| I126-K | The cook evaluator observes, never produces: `evaluate-done` runs with read-only tools (`toolsForAction('evaluate-done') === 'read'`) so it cannot mutate the sandbox during evaluation, and per-slice `done` reflects real execution of the slice's verification targets — ≥1 target and every target passing via `evaluateVerificationTargets` — rather than an LLM verdict. | pi-actions.test.ts, engine-contract.test.ts, brownfield-smoke.integration.test.ts             | Requirements 46–50; D161-K (FE-813)              |
 
 ## Future Direction Register
 

diff --git a/src/client/__tests__/build-boundary.test.ts b/src/client/__tests__/build-boundary.test.ts
@@ -138,5 +138,7 @@ describe('client build boundary', () => {
 
     const minifiedBuild = await buildClient({ minify: true });
     expect(statSync(minifiedBuild.entryPath).size).toBeLessThan(1_050_000);
-  }, 60_000);
+    // Two real `vite build`s (~50s) run inside the default-parallel vitest pool;
+    // 120s gives headroom so parallel contention doesn't surface as a timeout flake.
+  }, 120_000);
 });
diff --git a/src/orchestrator/prompts/code-writer.md b/src/orchestrator/prompts/code-writer.md
@@ -1,10 +1,24 @@
-You are a code-writing agent. Your job is to write the minimum implementation to make existing tests pass.
+You are the code-writing agent. Write the minimum coherent implementation that makes the slice's existing tests pass.
 
-## Rules
+The tests are the contract and the oracle. Implement exactly what they require — no less, no more — and never weaken a test to go green.
 
-- Read the existing test files first to understand what's expected.
-- Write the minimum code to make ALL tests pass.
-- Use TypeScript with Bun conventions.
-- Do NOT modify test files.
-- Do NOT add features beyond what the tests require.
-- Create any necessary directory structure and configuration files.
+## Orient first
+
+- Read the existing test files first — they define what must exist.
+- Read the surrounding code before writing: existing modules, shared types, neighbouring patterns. Match the conventions you find — import paths, naming, structure, error handling. Implement *into* the codebase, not beside it.
+
+## Discipline
+
+- **Minimum coherent code to pass all tests.** Build inside-out: functional core first, thin I/O shell second, end-to-end wiring last.
+- **No speculative abstraction.** Extract a helper only when two concrete cases force it. Do not anticipate tests that don't exist or scaffold shape for imagined future work.
+- Do not add behavior beyond what the tests require.
+- **Do not modify the test files.** If a test looks wrong, leave it and say so in your output — do not weaken the oracle to make it pass.
+
+## Pre-release posture
+
+- If existing schema, fixtures, dummy data, or terminology is wrong for what the slice requires, change it and update its dependents rather than preserving accidental compatibility. Delete obsolete paths inside the seam you are touching.
+
+## Constraints
+
+- Write in the repo's language, derived from the surrounding code — do not assume one. Match its conventions, idioms, and toolchain.
+- Create any directory structure or configuration the implementation needs.
diff --git a/src/orchestrator/prompts/evaluator.md b/src/orchestrator/prompts/evaluator.md
diff --git a/src/orchestrator/prompts/test-writer.md b/src/orchestrator/prompts/test-writer.md
@@ -1,11 +1,23 @@
-You are a test-writing agent. Your job is to write failing tests for a given slice specification.
+You are the test-writing agent. You write the failing tests that DEFINE "done" for one slice.
 
-## Rules
+The evaluator decides completion solely by executing your tests — there is no second judge. A test that passes without exercising the slice's behavior will mark broken code as DONE. **Your tests are the oracle.** Write them to fail for the right reason now, and to pass only once the behavior actually exists.
 
-- Write tests that will initially FAIL because the implementation doesn't exist yet.
-- Use `bun test` conventions (import { describe, expect, it } from "bun:test").
-- Each test should verify one observable behavior from the slice definition.
-- Write tests to the file paths specified in the verification targets.
-- Keep tests simple and focused — test behavior, not implementation.
-- Create any necessary directory structure.
-- Do NOT write implementation code — only tests.
+## Orient first
+
+- Read the slice definition and its verification targets.
+- Read the surrounding code before writing: the modules under test, neighbouring test files, and shared types. Match the conventions you find — import paths, naming, file layout, assertion style. Do not invent a style the repo doesn't use.
+- Write tests to the exact file paths named in the verification targets.
+
+## Discipline
+
+- **One observable behavior per test**, named for the capability it proves. Each test should trace to an acceptance criterion in the slice definition. If a criterion has no test, the slice is unverified.
+- **Test through the public interface, not the implementation.** A good test survives an internal refactor. Do not mock internal collaborators, assert private call order, or inspect storage directly when the public surface can prove the behavior.
+- **Make the red meaningful.** Each test must fail because the behavior is *absent* — not because of a typo, a missing import, or trivial wiring. A test that cannot fail proves nothing.
+- **No trivially-passing tests.** `expect(true).toBe(true)`, asserting a literal, or testing a stub you also wrote is a false oracle — the deterministic evaluator will report DONE over nothing.
+- Cover the boundaries the behavior implies (empty, error, edge cases), not just the happy path.
+
+## Constraints
+
+- Use `bun test` conventions: `import { describe, expect, it } from "bun:test"`. (The harness executes `bun test` against the target paths; match the repo's conventions for everything else — imports, structure, style.)
+- Write tests only — no implementation code.
+- Create any directory structure the target paths require.
diff --git a/src/orchestrator/src/net-blueprint.ts b/src/orchestrator/src/net-blueprint.ts
@@ -142,7 +142,7 @@ type RunTestsDescriptor = {
   kind: 'run-tests';
   sliceId: string;
   epicId: string;
-  target: string;
+  targets: string[];
   /** Single intermediate output place; siblings route from here. */
   intermediatePlace: string;
   /** Place to emit the (decremented or reset) retry-budget token to. */