diff --git a/memory/PLAN.md b/memory/PLAN.md index 5f00ef88..39a10a9d 100644 --- a/memory/PLAN.md +++ b/memory/PLAN.md @@ -29,6 +29,7 @@ The May 2026 intent-spec, multi-chat, changeset-ledger, prompt/context, and agen 2. `chat-runtime-secondary-chats` — FE-716; V1 done — PR #141 merged to main. 3. **Petrinaut integration sub-track** — umbrella **FE-760** (Orchestrator ⇄ Petrinaut). FE-761 (semantics), FE-762 (`net.json` + SDCPN export), FE-763 (event stream), and FE-784 (colour fold) have **landed**. **`petri-sync-server` (FE-764)** is the active piece, reshaped (2026-06-01 meeting) into an **ephemeral cook-hosted SSE live stream** for the Bristol demo — no-colour, replay-on-connect, brunch-initiated session, supersedes the dropped static-bundle idea. Replaces the POC interpreter's visualization role with Petrinaut as canonical surface. 4. `spec-to-cook-plan` — **FE-800**; **done — branch-complete off FE-764**, PR #167 pending re-description. Six slices landed: 1 (deterministic projection) + 2 (LLM planning pass) + 3 (deterministic reconciliation) + 4 (CLI wiring) + 5 (warning-model hardening) + 6 (read from spec id — `brunch plan `, server-side snapshot builder `buildCompletedSpecSnapshot` over `getEntitiesForSpecificationOnActivePath`, plan driver moved into `src/server/plan-runner.ts`, orchestrator `plan-cli.ts` deleted). Bristol-demo front half (`brunch plan ` → `.brunch/cook/plan.yaml` → `brunch cook --petrinaut-stream`) is now operational against any completed spec in the project DB. Two proving spikes done 2026-06-03. Move to **Recently Completed** on PR merge. +5. `cook-harness-fidelity` — make the cook execution harness's per-slice "done" signal trustworthy: the evaluator must *observe, not mutate*, and "done" must come from running verification targets, not an LLM verdict. Opening slice (evaluator read-only) is the documented `cook-codebase-mode` TDD-collapse follow-on; complements `spec-to-cook-plan`'s integration-blind-verification follow-on. ### Recently Completed @@ -39,7 +40,7 @@ The May 2026 intent-spec, multi-chat, changeset-ledger, prompt/context, and agen #### Follow-ons surfaced by the 2026-05-26 cook-codebase-mode smoke -- **pi-actions evaluate-done collapses the TDD workflow** — `pi-actions.ts:70` passes `--tools read,write,edit,bash` to every action including `evaluate-done`. Real pi fixed the buggy file *during evaluation* and reported `done: true` on the first call; write-tests / write-code / run-tests never executed. Affects both modes but is more visible in brownfield. Either restrict evaluator tools to `read` or accept this as the intended pi-as-agent behavior. Worth its own frontier. +- ~~**pi-actions evaluate-done collapses the TDD workflow**~~ — **resolved by `cook-harness-fidelity` (FE-813)**: Slice 1 (`d2139d8c`) scoped the evaluator to read-only tools so it cannot fix code during evaluation; Slice 2 (`fcba8ab3`) replaced the LLM verdict with executing the verification targets. - **cook output promotion (follow-on)** — slice 3 creates real slice branches (`cook-slice//`) but never commits; `cook/` HEAD === source HEAD with modifications in untracked subdirs, so there is no promotion path into the user's checkout. To close: commit slice work, `git merge` slice→epic→`cook/`, then `git merge cook/` from the working branch. Pairs with worktree/branch GC. Quality-of-life; the run worktree is already inspectable by hand. ### Next @@ -170,6 +171,21 @@ The May 2026 intent-spec, multi-chat, changeset-ledger, prompt/context, and agen - **Traceability:** SPEC §D50 (reserved codebase-mode resolver); §A49 (worktree isolation at `/.brunch/cook/runs//worktree/`); Requirement 49. - **Design docs:** SPEC §D50 + §A49; `docs/next/architecture/plan-graph-petri-orchestration.md` (worktree section). +### cook-harness-fidelity + +- **Name:** Cook harness fidelity — a trustworthy per-slice completion signal +- **Linear:** unassigned (create on start) +- **Kind:** structural +- **Status:** branch-complete on `ka/fe-813-cook-harness-fidelity` (PR #170) — Slice 1 (evaluator read-only via per-action `toolsForAction`) + Slice 2 (`evaluate-done` gates `done` on executing the verification targets, replacing the LLM verdict; `evaluateVerificationTargets` requires ≥1 target and all-pass) landed + unit-tested 2026-06-04. Slice 3 (`9fb5af12`) hardens the writer prompts now that the test *is* the oracle: ports ln-build discipline into `test-writer`/`code-writer` (orient + match conventions, behavioral tests through the public interface, ban trivially-passing tests, no speculative abstraction), de-hardcodes `code-writer` from TypeScript, and deletes the now-dead `evaluator.md`. The "evaluator observes, never produces; completion reflects real test execution" invariant is now promoted into SPEC as **D161-K + I126-K** (ln-sync 2026-06-04; rides with PR #170's merge). Remaining sibling: the bun→host test-runner decoupling (ProjectProfile/toolchain adapter) is still unowned — `test-writer` stays bun-bound until it lands. +- **Objective:** Make the cook execution harness's per-slice "done" signal trustworthy. The evaluator must **observe, not produce**: (a) `evaluate-done` runs `pi` with **read-only** tools so it cannot fix code during evaluation (today `pi-actions.ts` hands every action `read,write,edit,bash`); (b) "done" is decided by **executing** the slice's verification targets — mirroring `verify-epic`'s `execAsync('bun test …')` gate on real pass/fail — instead of an LLM verdict over prose. Establishes the invariant: *the evaluator never mutates the sandbox; completion reflects real test execution.* +- **Why now / unlocks:** The 2026-05-26 brownfield smoke caught `evaluate-done` fixing the file during evaluation and reporting `done:true` on the first call, so write-tests/write-code/run never executed; and "done" is a soft LLM judgment with no requisite variety — it let orphan code pass (2026-06-04). The harness's success signal is untrustworthy across **every** run, so no downstream oracle work (integration oracle, simulation oracle) can be trusted until completion means something. Highest-leverage harness fix. +- **Build order (slices — keep in CARDS/session, do not fragment):** (1) evaluator read-only — per-action tool scoping, `evaluate-done` → `read` [bugfix; the documented `cook-codebase-mode` follow-on]; (2) `evaluate-done` executes the slice's verification targets and gates `done` on real results, not an LLM verdict. +- **Acceptance:** (1) per-action tool scoping; `toolsForAction('evaluate-done') === 'read'`; write-tests/write-code/verify-epic keep write-capable tools. (2) `evaluate-done` reports `done` from executed verification targets, not LLM judgment. (3) brownfield smoke: the TDD loop runs end-to-end (the evaluator no longer short-circuits). (4) engine contract suite green on both engines. +- **Verification:** unit test on a pure `toolsForAction` map; adapter/contract test that `evaluate-done` gates on executed-target results; outer-loop brownfield smoke replaying the 2026-05-26 regression. +- **Depends on:** `orchestrator-poc` (done), `cook-codebase-mode` (done). Complements `spec-to-cook-plan`'s integration-blind-verification follow-on (the emitter emits integration-demanding targets; this frontier makes the harness actually *run* them); upstream of any future integration oracle. +- **Lexicon:** `evaluator` = read-only observer of verification results, distinct from the test-runner / code-writer; ties to `ln-oracles` "requisite variety." +- **Design docs:** `docs/design/orchestrator.md`; SPEC §Verification Design. + ### petri-petrinaut-semantics - **Name:** Petri-net semantic alignment for Petrinaut visualization diff --git a/memory/SPEC.md b/memory/SPEC.md index 3ab82552..0ac549a3 100644 --- a/memory/SPEC.md +++ b/memory/SPEC.md @@ -208,6 +208,7 @@ Brunch operates inside a **workspace**: the cwd-backed software context whose lo 158. **Plan model is two-level (epics → slices), no milestones in POC** — schema is provisional pending canonical brunch plan emission. Forward-compatible for intent/design/oracle pointers. 159. **Worktree isolation per run** — agents write freely inside `/.brunch/cook/runs//worktree/` (cwd-scoped, not fixture-scoped); fixture dir and source repo untouched. Fixtures stay byte-identical before and after a run. Depends on: Requirement 49. 160. **Spec→cook-plan emission is a CLI/orchestrator-track seam, not a V1 product UI surface** — projecting and planning a cook `plan.yaml` from a completed intent graph is dev-layer orchestrator capability extending Requirements 46–50, so it does not breach the V1 product non-goal "Brunch elicits specs and stops at the handoff/export boundary," which governs interactive product UX. The emitter is three-stage: projection (deterministic graph read of requirements + verifies edges) + planning pass (LLM-inferred execution-order DAG, epic grouping, non-buildable detection) + reconciliation (deterministic validation: no dangling/cyclic deps, cook-valid schema, synthesized verification targets). Generated plans are reviewable artifacts, not silent inputs. Depends on: Requirements 46–50; A97. +161. **The cook evaluator is a read-only oracle; per-slice completion is execution-derived, not model-judged** — `evaluate-done` runs `pi` with read-only tools so it cannot mutate the sandbox during evaluation, and per-slice `done` is decided by executing the slice's verification targets (≥1 target, all pass) — mirroring `verify-epic`'s real test gate — rather than by an LLM verdict over prose. This makes the harness's success signal trustworthy enough that downstream oracle work (integration oracle, simulation oracle) can build on it. Depends on: Requirements 46–50. #### Provider, prompt/context, and agent substrate @@ -260,6 +261,7 @@ Each invariant is a formalization candidate: the property is stated in human lan | I123-K | Worktree isolation holds — fixture directory and source repo are never mutated by an orchestrator run; worktree is cwd-scoped at `/.brunch/cook/runs//worktree/`. Codebase mode preserves the source repo's HEAD and tracked-file state byte-identically. | worktree.test.ts, brownfield-smoke.integration.test.ts | Requirement 49; D159-K | | I124-K | Epic verification runs against a freshly-rebuilt `/__epic__//` dir holding the deterministic merge of its completed slices' worktrees (later slices in plan declaration order overwrite earlier ones on path collisions; collisions are reported via the `epic-sandbox-merged` event). Per-slice worktrees are not mutated by the merge. | epic-sandbox-merge.test.ts, engine-contract.test.ts | Requirement 49; D159-K | | I125-K | Topology output-place candidates are fully declared in `HandlerDescriptor` via typed `Guard` predicates; `wireHandlers` introduces no new output places at fire time. Pure consumers can enumerate the reachable output-place set per transition from topology data alone via `enumerateCandidateOutputs(transition)`. Halt paths (budget exhaustion, verify-epic failure) and token transforms (reportId attach, retry/rework count propagation) remain runtime concerns and are explicitly not covered by this invariant. | topology.test.ts, engine-contract.test.ts | Requirements 46, 47, 48; D155-K (FE-747) | +| I126-K | The cook evaluator observes, never produces: `evaluate-done` runs with read-only tools (`toolsForAction('evaluate-done') === 'read'`) so it cannot mutate the sandbox during evaluation, and per-slice `done` reflects real execution of the slice's verification targets — ≥1 target and every target passing via `evaluateVerificationTargets` — rather than an LLM verdict. | pi-actions.test.ts, engine-contract.test.ts, brownfield-smoke.integration.test.ts | Requirements 46–50; D161-K (FE-813) | ## Future Direction Register diff --git a/src/client/__tests__/build-boundary.test.ts b/src/client/__tests__/build-boundary.test.ts index 86ce8ec7..837dbfe5 100644 --- a/src/client/__tests__/build-boundary.test.ts +++ b/src/client/__tests__/build-boundary.test.ts @@ -138,5 +138,7 @@ describe('client build boundary', () => { const minifiedBuild = await buildClient({ minify: true }); expect(statSync(minifiedBuild.entryPath).size).toBeLessThan(1_050_000); - }, 60_000); + // Two real `vite build`s (~50s) run inside the default-parallel vitest pool; + // 120s gives headroom so parallel contention doesn't surface as a timeout flake. + }, 120_000); }); diff --git a/src/orchestrator/prompts/code-writer.md b/src/orchestrator/prompts/code-writer.md index 8d777517..36538216 100644 --- a/src/orchestrator/prompts/code-writer.md +++ b/src/orchestrator/prompts/code-writer.md @@ -1,10 +1,24 @@ -You are a code-writing agent. Your job is to write the minimum implementation to make existing tests pass. +You are the code-writing agent. Write the minimum coherent implementation that makes the slice's existing tests pass. -## Rules +The tests are the contract and the oracle. Implement exactly what they require — no less, no more — and never weaken a test to go green. -- Read the existing test files first to understand what's expected. -- Write the minimum code to make ALL tests pass. -- Use TypeScript with Bun conventions. -- Do NOT modify test files. -- Do NOT add features beyond what the tests require. -- Create any necessary directory structure and configuration files. +## Orient first + +- Read the existing test files first — they define what must exist. +- Read the surrounding code before writing: existing modules, shared types, neighbouring patterns. Match the conventions you find — import paths, naming, structure, error handling. Implement *into* the codebase, not beside it. + +## Discipline + +- **Minimum coherent code to pass all tests.** Build inside-out: functional core first, thin I/O shell second, end-to-end wiring last. +- **No speculative abstraction.** Extract a helper only when two concrete cases force it. Do not anticipate tests that don't exist or scaffold shape for imagined future work. +- Do not add behavior beyond what the tests require. +- **Do not modify the test files.** If a test looks wrong, leave it and say so in your output — do not weaken the oracle to make it pass. + +## Pre-release posture + +- If existing schema, fixtures, dummy data, or terminology is wrong for what the slice requires, change it and update its dependents rather than preserving accidental compatibility. Delete obsolete paths inside the seam you are touching. + +## Constraints + +- Write in the repo's language, derived from the surrounding code — do not assume one. Match its conventions, idioms, and toolchain. +- Create any directory structure or configuration the implementation needs. diff --git a/src/orchestrator/prompts/evaluator.md b/src/orchestrator/prompts/evaluator.md deleted file mode 100644 index 29554391..00000000 --- a/src/orchestrator/prompts/evaluator.md +++ /dev/null @@ -1,11 +0,0 @@ -You are an evaluator agent. Your job is to assess whether a slice specification is fully satisfied by the current code. - -## Rules - -- Read the slice definition and verification targets. -- Check if test files exist and if they cover the specification. -- Run `bun test` on the verification targets to check if tests pass. -- Respond with a JSON object: { "done": true/false, "reasoning": "..." } -- "done": true means ALL verification targets pass and the slice spec is satisfied. -- "done": false means more work is needed. -- Be honest — if tests are missing, failing, or incomplete, say so. diff --git a/src/orchestrator/prompts/test-writer.md b/src/orchestrator/prompts/test-writer.md index 749e20cb..00d8c517 100644 --- a/src/orchestrator/prompts/test-writer.md +++ b/src/orchestrator/prompts/test-writer.md @@ -1,11 +1,23 @@ -You are a test-writing agent. Your job is to write failing tests for a given slice specification. +You are the test-writing agent. You write the failing tests that DEFINE "done" for one slice. -## Rules +The evaluator decides completion solely by executing your tests — there is no second judge. A test that passes without exercising the slice's behavior will mark broken code as DONE. **Your tests are the oracle.** Write them to fail for the right reason now, and to pass only once the behavior actually exists. -- Write tests that will initially FAIL because the implementation doesn't exist yet. -- Use `bun test` conventions (import { describe, expect, it } from "bun:test"). -- Each test should verify one observable behavior from the slice definition. -- Write tests to the file paths specified in the verification targets. -- Keep tests simple and focused — test behavior, not implementation. -- Create any necessary directory structure. -- Do NOT write implementation code — only tests. +## Orient first + +- Read the slice definition and its verification targets. +- Read the surrounding code before writing: the modules under test, neighbouring test files, and shared types. Match the conventions you find — import paths, naming, file layout, assertion style. Do not invent a style the repo doesn't use. +- Write tests to the exact file paths named in the verification targets. + +## Discipline + +- **One observable behavior per test**, named for the capability it proves. Each test should trace to an acceptance criterion in the slice definition. If a criterion has no test, the slice is unverified. +- **Test through the public interface, not the implementation.** A good test survives an internal refactor. Do not mock internal collaborators, assert private call order, or inspect storage directly when the public surface can prove the behavior. +- **Make the red meaningful.** Each test must fail because the behavior is *absent* — not because of a typo, a missing import, or trivial wiring. A test that cannot fail proves nothing. +- **No trivially-passing tests.** `expect(true).toBe(true)`, asserting a literal, or testing a stub you also wrote is a false oracle — the deterministic evaluator will report DONE over nothing. +- Cover the boundaries the behavior implies (empty, error, edge cases), not just the happy path. + +## Constraints + +- Use `bun test` conventions: `import { describe, expect, it } from "bun:test"`. (The harness executes `bun test` against the target paths; match the repo's conventions for everything else — imports, structure, style.) +- Write tests only — no implementation code. +- Create any directory structure the target paths require. diff --git a/src/orchestrator/src/net-blueprint.ts b/src/orchestrator/src/net-blueprint.ts index d33b5560..129698c6 100644 --- a/src/orchestrator/src/net-blueprint.ts +++ b/src/orchestrator/src/net-blueprint.ts @@ -142,7 +142,7 @@ type RunTestsDescriptor = { kind: 'run-tests'; sliceId: string; epicId: string; - target: string; + targets: string[]; /** Single intermediate output place; siblings route from here. */ intermediatePlace: string; /** Place to emit the (decremented or reset) retry-budget token to. */ diff --git a/src/orchestrator/src/net-compiler.ts b/src/orchestrator/src/net-compiler.ts index b8d84b25..403558ea 100644 --- a/src/orchestrator/src/net-compiler.ts +++ b/src/orchestrator/src/net-compiler.ts @@ -310,7 +310,7 @@ export function compileTopology(plan: Plan, policy: RunPolicy): NetBlueprint { kind: 'run-tests', sliceId: sid, epicId: epic.id, - target: slice.verification[0]?.target ?? '', + targets: slice.verification.map((v) => v.target), intermediatePlace: p(sid, 'run-tests:reported'), budgetPlace: p(sid, 'retry-budget'), maxRetries: policy.maxRetries, @@ -689,7 +689,7 @@ export function wireHandlers(blueprint: NetBlueprint, input: OrchestratorInput, } case 'run-tests': { - const { sliceId, epicId, target, intermediatePlace, budgetPlace, maxRetries } = h; + const { sliceId, epicId, targets, intermediatePlace, budgetPlace, maxRetries } = h; const baseToken: Token = { sliceId, epicId }; // FE-761 Slice 3: deferred-completion split. The synchronous part @@ -707,18 +707,23 @@ export function wireHandlers(blueprint: NetBlueprint, input: OrchestratorInput, const sandboxDir = seedSliceSandboxFromDeps(input.sandboxDir, plan, slice, { preserveExisting: true, }); - const result = await testRunner.run(target, sandboxDir); + const results = []; + for (const target of targets) { + results.push({ target, ...(await testRunner.run(target, sandboxDir)) }); + } + const passed = results.length > 0 && results.every((result) => result.passed); + const output = results.map((result) => result.output).join('\n'); const reportId = createReport(reports, { epicId, sliceId, actor: 'test-runner', event: 'tests-run', - payload: { passed: result.passed, output: result.output }, + payload: { passed, output, results }, }); ctx.reportIds.push(reportId); const tok: Token = { ...inputToken, reportId }; - if (result.passed) { + if (passed) { return [ { place: intermediatePlace, token: tok }, { place: budgetPlace, token: { ...baseToken, retryCount: 0 } }, diff --git a/src/orchestrator/src/pi-actions.test.ts b/src/orchestrator/src/pi-actions.test.ts new file mode 100644 index 00000000..5bff1d60 --- /dev/null +++ b/src/orchestrator/src/pi-actions.test.ts @@ -0,0 +1,109 @@ +import { mkdirSync, mkdtempSync, rmSync, writeFileSync } from 'node:fs'; +import { tmpdir } from 'node:os'; +import { join } from 'node:path'; + +import { describe, expect, it } from 'vitest'; + +import { createPiActions, evaluateVerificationTargets, toolsForAction } from './pi-actions.js'; +import { InMemoryReportSink } from './report-sink.js'; +import type { ActionContext } from './types.js'; + +describe('evaluateVerificationTargets — done reflects real test execution', () => { + it('done only when at least one target exists and every target passes', async () => { + const { done } = await evaluateVerificationTargets([{ target: 'a' }, { target: 'b' }], async () => true); + expect(done).toBe(true); + }); + + it('not done if any target fails, and reports per-target results', async () => { + const { done, results } = await evaluateVerificationTargets( + [{ target: 'a' }, { target: 'b' }], + async (t) => t === 'a', + ); + expect(done).toBe(false); + expect(results).toEqual([ + { target: 'a', passed: true }, + { target: 'b', passed: false }, + ]); + }); + + it('not done when there are no verification targets (nothing proves it)', async () => { + const { done } = await evaluateVerificationTargets([], async () => true); + expect(done).toBe(false); + }); + + it('a throwing runner counts as a failed target', async () => { + const { done } = await evaluateVerificationTargets([{ target: 'x' }], async () => { + throw new Error('runner blew up'); + }); + expect(done).toBe(false); + }); +}); + +describe('pi-actions tool scoping', () => { + it('evaluate-done is read-only — the evaluator cannot mutate the sandbox during evaluation', () => { + const tools = toolsForAction('evaluate-done'); + expect(tools).toContain('read'); + expect(tools).not.toContain('write'); + expect(tools).not.toContain('edit'); + expect(tools).not.toContain('bash'); + }); + + it('code-producing actions keep write-capable tools', () => { + for (const action of ['write-tests', 'write-code', 'verify-epic']) { + const tools = toolsForAction(action); + expect(tools).toContain('read'); + expect(tools).toContain('write'); + expect(tools).toContain('edit'); + expect(tools).toContain('bash'); + } + }); +}); + +describe('createPiActions evaluate-done', () => { + it('runs verification target paths with spaces and shell metacharacters without shell splitting', async () => { + const sandboxDir = mkdtempSync(join(tmpdir(), 'brunch-pi-actions-')); + try { + mkdirSync(join(sandboxDir, 'tests')); + const target = 'tests/path with spaces; false.test.ts'; + writeFileSync( + join(sandboxDir, target), + "import { expect, test } from 'bun:test';\n\ntest('runs', () => expect(1).toBe(1));\n", + ); + const reports = new InMemoryReportSink(); + const ctx: ActionContext = { + sandboxDir, + reports, + plan: { + epics: [{ id: 'epic-1', summary: 'Epic', depends_on: [], verification: [] }], + slices: [ + { + id: 'slice-1', + epic_id: 'epic-1', + definition: 'Run a spaced test path', + depends_on: [], + verification: [{ kind: 'unit-test', target }], + }, + ], + }, + epic: { id: 'epic-1', summary: 'Epic', depends_on: [], verification: [] }, + slice: { + id: 'slice-1', + epic_id: 'epic-1', + definition: 'Run a spaced test path', + depends_on: [], + verification: [{ kind: 'unit-test', target }], + }, + }; + + const reportId = await createPiActions()['evaluate-done']!(ctx); + const report = reports.getById(reportId); + + expect(report?.payload).toMatchObject({ + done: true, + results: [{ target, passed: true }], + }); + } finally { + rmSync(sandboxDir, { recursive: true, force: true }); + } + }); +}); diff --git a/src/orchestrator/src/pi-actions.ts b/src/orchestrator/src/pi-actions.ts index 12862a87..23c698dd 100644 --- a/src/orchestrator/src/pi-actions.ts +++ b/src/orchestrator/src/pi-actions.ts @@ -1,9 +1,6 @@ -import { exec, spawn } from 'node:child_process'; +import { spawn } from 'node:child_process'; import { dirname, join } from 'node:path'; import { fileURLToPath } from 'node:url'; -import { promisify } from 'node:util'; - -const execAsync = promisify(exec); import { createReport } from './report-helpers.js'; import { sliceLabel } from './slice-label.js'; @@ -48,6 +45,17 @@ function logVerbose(output: string): void { const PI_TIMEOUT_MS = 300_000; const PI_MAX_BUFFER = 10 * 1024 * 1024; +// Per-action tool scoping. The evaluator observes, it does not produce: a +// read-only toolset means `evaluate-done` cannot fix code during evaluation and +// short-circuit the write-tests → write-code → evaluate loop. Code-producing +// actions keep the full toolset. +const WRITE_TOOLS = 'read,write,edit,bash'; +const READ_ONLY_TOOLS = 'read'; + +export function toolsForAction(action: string): string { + return action === 'evaluate-done' ? READ_ONLY_TOOLS : WRITE_TOOLS; +} + // Async on purpose: `pi` runs for tens of seconds per call. A synchronous // `spawnSync` would freeze the shared event loop, starving the SSE stream // server of the chance to flush frames while a slice is being worked. @@ -59,6 +67,7 @@ function runPi(opts: { promptFile: string; task: string; sandboxDir: string; + tools: string; }): Promise { const start = Date.now(); @@ -80,7 +89,7 @@ function runPi(opts: { '--append-system-prompt', opts.promptFile, '--tools', - 'read,write,edit,bash', + opts.tools, opts.task, ], { cwd: opts.sandboxDir, stdio: ['ignore', 'pipe', 'pipe'] }, @@ -139,15 +148,60 @@ function runPi(opts: { }); } -/** Try to extract a JSON object from pi's text output. */ -function extractJson(raw: string): Record | undefined { - const match = raw.match(/\{[\s\S]*?\}/); - if (!match) return undefined; - try { - return JSON.parse(match[0]) as Record; - } catch { - return undefined; +/** + * Decide whether a slice is done by executing its verification targets. `done` + * requires at least one target and every target passing — a slice with no + * runnable verification cannot be proven done (no requisite variety). This is + * the real oracle: it replaces the prior LLM verdict over criterion prose, + * which a standalone component or Ladle story could satisfy without the + * feature working. + */ +export async function evaluateVerificationTargets( + targets: readonly { target: string }[], + runTarget: (target: string) => Promise, +): Promise<{ done: boolean; results: Array<{ target: string; passed: boolean }> }> { + const results: Array<{ target: string; passed: boolean }> = []; + for (const t of targets) { + let passed = false; + try { + passed = await runTarget(t.target); + } catch { + passed = false; + } + results.push({ target: t.target, passed }); } + return { done: results.length > 0 && results.every((r) => r.passed), results }; +} + +async function runBunTest(target: string, sandboxDir: string): Promise { + return new Promise((resolve) => { + const child = spawn('bun', ['test', target], { + cwd: sandboxDir, + stdio: ['ignore', 'pipe', 'pipe'], + }); + const stdoutChunks: Buffer[] = []; + const stderrChunks: Buffer[] = []; + let settled = false; + + const finish = (passed: boolean): void => { + if (settled) return; + settled = true; + clearTimeout(timer); + const output = Buffer.concat([...stdoutChunks, ...stderrChunks]).toString('utf8'); + logVerbose(output); + resolve(passed); + }; + + const timer = setTimeout(() => { + child.kill('SIGTERM'); + finish(false); + }, 60_000); + + child.stdout?.on('data', (chunk: Buffer) => stdoutChunks.push(chunk)); + child.stderr?.on('data', (chunk: Buffer) => stderrChunks.push(chunk)); + child.on('error', () => finish(false)); + child.on('close', (code) => finish(code === 0)); + }); } function report(ctx: ActionContext, actor: string, event: string, payload: Record): string { @@ -166,30 +220,14 @@ export function createPiActions(opts?: { verbose?: boolean; runStart?: number }) 'evaluate-done': async (ctx: ActionContext) => { const label = sliceLabel(ctx.slice); log('?', `evaluate ${label}`); - const task = `Evaluate slice "${ctx.slice.id}": ${ctx.slice.definition}\nVerification targets: ${ctx.slice.verification.map((v) => v.target).join(', ')}\nDetermine if all verification targets are satisfied. Respond with a JSON object: { "done": true/false, "reasoning": "..." }`; - - try { - const raw = await runPi({ - label: `evaluate ${label}`, - model: 'claude-haiku-4-5', - promptFile: join(promptsDir, 'evaluator.md'), - task, - sandboxDir: ctx.sandboxDir, - }); - const parsed = extractJson(raw) as { done?: boolean; reasoning?: string } | undefined; - const done = !!parsed?.done; - log(done ? '●' : '○', `verdict ${label} → ${done ? 'DONE' : 'NEEDS WORK'}`); - return report(ctx, 'evaluator', 'eval-done', { - done, - reasoning: parsed?.reasoning ?? raw.slice(0, 200), - }); - } catch (err) { - log('✗', `evaluate ${label} — ${err instanceof Error ? err.message : err}`); - return report(ctx, 'evaluator', 'eval-done', { - done: false, - reasoning: `evaluation failed: ${err instanceof Error ? err.message : String(err)}`, - }); + const { done, results } = await evaluateVerificationTargets(ctx.slice.verification, (target) => + runBunTest(target, ctx.sandboxDir), + ); + for (const r of results) { + log(r.passed ? '✓' : '✗', `verify ${r.target}`); } + log(done ? '●' : '○', `verdict ${label} → ${done ? 'DONE' : 'NEEDS WORK'}`); + return report(ctx, 'evaluator', 'eval-done', { done, results }); }, 'write-tests': async (ctx: ActionContext) => { @@ -203,6 +241,7 @@ export function createPiActions(opts?: { verbose?: boolean; runStart?: number }) promptFile: join(promptsDir, 'test-writer.md'), task, sandboxDir: ctx.sandboxDir, + tools: toolsForAction('write-tests'), }); return report(ctx, 'test-writer', 'tests-written', { @@ -222,6 +261,7 @@ export function createPiActions(opts?: { verbose?: boolean; runStart?: number }) promptFile: join(promptsDir, 'code-writer.md'), task, sandboxDir: ctx.sandboxDir, + tools: toolsForAction('write-code'), }); return report(ctx, 'code-writer', 'code-written', { @@ -248,25 +288,14 @@ export function createPiActions(opts?: { verbose?: boolean; runStart?: number }) promptFile: join(promptsDir, 'test-writer.md'), task: writeTask, sandboxDir: ctx.sandboxDir, + tools: toolsForAction('verify-epic'), }); let allPassed = true; for (const v of ctx.epic.verification) { - try { - const { stdout } = await execAsync(`bun test ${v.target}`, { - cwd: ctx.sandboxDir, - encoding: 'utf8', - timeout: 60_000, - }); - log('✓', `verify ${v.target}`); - logVerbose(stdout); - } catch (err) { - log('✗', `verify ${v.target}`); - if (_verbose && err && typeof err === 'object' && 'stdout' in err) { - logVerbose(String((err as { stdout: unknown }).stdout)); - } - allPassed = false; - } + const passed = await runBunTest(v.target, ctx.sandboxDir); + log(passed ? '✓' : '✗', `verify ${v.target}`); + allPassed &&= passed; } log(allPassed ? '●' : '✗', `epic ${ctx.epic.id} → ${allPassed ? 'PASS' : 'FAIL'}`); diff --git a/src/orchestrator/src/test-runner.test.ts b/src/orchestrator/src/test-runner.test.ts new file mode 100644 index 00000000..beb16d67 --- /dev/null +++ b/src/orchestrator/src/test-runner.test.ts @@ -0,0 +1,55 @@ +// Regression oracle for cook-harness report fidelity: `bun test` writes its +// results (failure detail, pass/fail counts) to **stderr** and only the +// version banner to stdout. A `TestResult.output` that drops stderr reduces +// every failing `tests-run` / `epic-verified` report to the bare banner — +// zero diagnostics (observed on cook run 289c9843, 2026-06-04). +// +// These tests spawn the real `bun` binary against tmpdir test files, so they +// pin the actual stream behavior rather than a mock's assumption of it. + +import { mkdtempSync, rmSync, writeFileSync } from 'node:fs'; +import { tmpdir } from 'node:os'; +import { join } from 'node:path'; + +import { afterEach, describe, expect, it } from 'vitest'; + +import { BunTestRunner } from './test-runner.js'; + +describe('BunTestRunner output fidelity', () => { + const dirs: string[] = []; + + function makeSandbox(testFileContent: string): string { + const dir = mkdtempSync(join(tmpdir(), 'bun-test-runner-')); + dirs.push(dir); + writeFileSync(join(dir, 'sample.test.ts'), testFileContent); + return dir; + } + + afterEach(() => { + for (const dir of dirs.splice(0)) rmSync(dir, { recursive: true, force: true }); + }); + + it('failing run output carries the failure detail, not just the banner', async () => { + const sandbox = makeSandbox( + `import { expect, test } from 'bun:test';\n` + + `test('deliberately fails', () => { expect(1).toBe(2); });\n`, + ); + + const result = await new BunTestRunner().run('sample.test.ts', sandbox); + + expect(result.passed).toBe(false); + expect(result.output).toContain('1 fail'); + expect(result.output).toContain('expect(received).toBe(expected)'); + }); + + it('passing run output carries the test summary', async () => { + const sandbox = makeSandbox( + `import { expect, test } from 'bun:test';\n` + `test('passes', () => { expect(1).toBe(1); });\n`, + ); + + const result = await new BunTestRunner().run('sample.test.ts', sandbox); + + expect(result.passed).toBe(true); + expect(result.output).toContain('1 pass'); + }); +}); diff --git a/src/orchestrator/src/test-runner.ts b/src/orchestrator/src/test-runner.ts index 7bff8ddf..d3c4b689 100644 --- a/src/orchestrator/src/test-runner.ts +++ b/src/orchestrator/src/test-runner.ts @@ -1,23 +1,21 @@ -import { execSync } from 'node:child_process'; +import { spawnSync } from 'node:child_process'; import type { TestResult, TestRunner } from './types.js'; export class BunTestRunner implements TestRunner { async run(target: string, sandboxDir: string): Promise { - try { - const output = execSync(`bun test ${target}`, { - cwd: sandboxDir, - encoding: 'utf8', - timeout: 60_000, - stdio: ['ignore', 'pipe', 'pipe'], - }); - return { passed: true, output }; - } catch (err) { - const output = - err && typeof err === 'object' && 'stdout' in err - ? String((err as { stdout: unknown }).stdout) - : String(err); - return { passed: false, output }; - } + const result = spawnSync('bun', ['test', target], { + cwd: sandboxDir, + encoding: 'utf8', + timeout: 60_000, + stdio: ['ignore', 'pipe', 'pipe'], + }); + // `bun test` writes its results (failure detail, pass/fail counts) to + // stderr and only the version banner to stdout — concatenate both so + // tests-run reports carry real diagnostics. + const output = [result.stdout, result.stderr, result.error ? String(result.error) : ''] + .filter(Boolean) + .join(''); + return { passed: result.status === 0, output }; } } diff --git a/src/orchestrator/src/topology.test.ts b/src/orchestrator/src/topology.test.ts index 06166455..d242dd39 100644 --- a/src/orchestrator/src/topology.test.ts +++ b/src/orchestrator/src/topology.test.ts @@ -91,6 +91,22 @@ const depPlan: Plan = { ], }; +const multiTargetPlan: Plan = { + epics: [{ id: 'epic-1', summary: 'E', depends_on: [], verification: [] }], + slices: [ + { + id: 'slice-1', + epic_id: 'epic-1', + definition: 'D', + depends_on: [], + verification: [ + { kind: 'unit-test', target: 'tests/a.test.ts' }, + { kind: 'integration-test', target: 'tests/b.test.ts' }, + ], + }, + ], +}; + describe('enumerateCandidateOutputs', () => { it('returns a non-empty output set for every transition in simplePlan', () => { const blueprint = compileTopology(simplePlan, { maxRetries: 3 }); @@ -139,6 +155,16 @@ describe('enumerateCandidateOutputs', () => { expect(enumerateCandidateOutputs(runTests!)).toEqual(expected); }); + it('run-tests producer carries every slice verification target', () => { + const blueprint = compileTopology(multiTargetPlan, { maxRetries: 3 }); + const runTests = blueprint.transitions.find((t) => t.id === 'slice-1:run-tests:complete'); + expect(runTests).toBeDefined(); + const handler = runTests!.handler; + if (handler.kind !== 'run-tests') throw new Error('expected run-tests descriptor'); + + expect(handler.targets).toEqual(['tests/a.test.ts', 'tests/b.test.ts']); + }); + it('assess-semantic producer enumerates intermediatePlace plus budgetPlace', () => { const blueprint = compileTopology(simplePlan, { maxRetries: 3 }); const assess = blueprint.transitions.find((t) => t.id === 'slice-1:assess-semantic:complete');