feat(metrics): ST1 — loop metrics + adversarially-hardened FCR/RP baseline#16
Merged
Conversation
…and-reject (ST1 AC2-AC5)
…ple record to .loop/repair (ST1 AC1)
Adversarial review of the ST1 metrics slice found the honesty gates were satisfiable without real evidence. Harden compute_metrics/build_baseline: - evidence_backed no longer credits a RUNLOG prose mention or a bare TASKS verify declaration. Only (a) a structurally-valid held-out verdict artifact (per-check visible/holdout arrays + flags re-derivable from them — a 4-field stub is rejected) or (b) a NON-COMMENT gate line in a verify-* script whose gate script file exists. Records the verdict sha256 in provenance and states it is evidence, not proof (tamper detection is the anti-cheat layer's job). - FCR cross-join is fail-closed: a success claim is clean only if every attached verify bundle is green, EXCEPT a red bundle whose own task later reached green (an honestly-repaired intermediate). An unrelated green can no longer launder a claimed task's still-red gate. - Unrecognized ### Outcome tokens are surfaced under provenance.unrecognized_ outcomes instead of silently escaping the FCR denominator. - RP records are anchored: before/after scores must be corroborated by real verify-bundle scores (fabricated deltas -> rejected); no bundle to anchor -> flagged unanchored. - --baseline additionally refuses when the two FCR methods disagree, when no iteration claims success (vacuous 0/0), or when a counted RP record is unanchored; each refusal names the failed precondition. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
doctor validated repair records and receipt/rollout ledgers when present but reported only the 4 core schema ids, under-counting its own coverage. Return the record schema keys actually checked and append their schema ids (repair, rollout, receipt) to schemas_checked, in deterministic order. A loop with no record files is unchanged (still the 4 core ids). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
summarize() returned its productive fraction under the key repair_productivity, the exact mislabel the slice de-branding closes: the rollout ledger carries rollout-productivity, NOT the RP baseline (metrics.py derives RP from the canonical repair record). Rename the key and update callers and tests. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ne binding - Scaffolding (manifest template, repo-os-contract, prompt-templates) declared the stale .loop/artifacts/repair-record.json output path the metrics tool cannot read; point them at the canonical per-iteration repair path. - README Measured-baseline passage now states the committed verdict is evidence, not proof, and cites its sha256. - Add a test binding the README FCR/RP literals to docs/metrics-baseline.json so they cannot silently drift, plus a guard against the stale scaffolding path. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Recompute docs/metrics-baseline.json with the hardened metrics command. Still evidence_backed via the real committed holdout verdict, FCR 0.0 / RP 1.0; adds the new provenance fields (holdout sha256, unanchored_records, unrecognized_ outcomes). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… anchoring (round-2 adversarial findings) A completion-class claim (task_passed/succeeded/terminal) is clean only if every attached bundle is green — no cross-iteration escape; the repaired-intermediate exception is progress-class (advanced) only and order-aware. --baseline accepts nothing weaker than a structurally-valid gate verdict artifact; the gate-script existence check no longer leaks to this repo's own toolkit. RP anchors to a same-task red->green bundle pair (order enforced when known), not global score membership. Short outcome tokens are surfaced. Every round-2 exploit (G0-G5, F, H1/H2, N2/N2b, N3) is pinned as a regression test. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
The repo's live .loop run-state is gitignored, so CI has none — the live-contract gate skips there and stays a local/operator check. The foreign-cwd CLI test sets PYTHONPATH for module visibility (CI has no editable install); it tests cwd-independent scripts/ resolution, not packaging. Verified against a git-archive fresh-checkout simulation. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ST1 — "Metrics real" (spec:
docs/superpowers/specs/2026-06-30-st1-metrics-baseline.md). FCR and repair-productivity go from claims to derivations, and the derivation itself was adversarially red-teamed twice before this PR.loop metrics <dir>— derives FCR + RP from real on-disk evidence (RUNLOG, verify bundles, held-out verdict, repair records, receipts), never narration. FCR computed two ways, disagreement surfaced; unmatched success claims fail closed. Deterministic (byte-identical) scorecard with a fullprovenanceblock.loop metrics --baseline— writesdocs/metrics-baseline.json; refuses over any run without a structurally-valid gate verdict artifact, with rejected/unanchored records, disagreeing FCR methods, or zero success claims.loop-engineer/repair@1,loop-engineer/rollout@1) end the two-shapes ambiguity;recheck_productiverecomputes-and-rejects (wired intorollout_ledger.summarize, key renamedrollout_productivity);validate_contractchecks record files when present.[project.scripts] loopconsole entry (editable install).Adversarial review
Two red-team rounds (4 + 2 Opus reviewers) produced 17 confirmed findings — including a
--baselinethat laundered a holdout-flagged run and anevidence_backedsatisfiable by a prose gate mention. All fixed; every exploit pinned as a regression test. Documented residual: a fully-fabricated, internally-consistent verdict artifact defeats offline shape-checking by construction — tamper detection belongs to the anti-cheat layer.Test plan
doctor .loopokloop metrics examples/coverage-repairbyte-identical across runs; scorecard == committed baseline🤖 Generated with Claude Code