feat(metrics): ST1 — loop metrics + adversarially-hardened FCR/RP baseline by SollanSystems · Pull Request #16 · SollanSystems/loop-engineer

SollanSystems · 2026-07-03T20:22:51Z

Summary

ST1 — "Metrics real" (spec: docs/superpowers/specs/2026-06-30-st1-metrics-baseline.md). FCR and repair-productivity go from claims to derivations, and the derivation itself was adversarially red-teamed twice before this PR.

loop metrics <dir> — derives FCR + RP from real on-disk evidence (RUNLOG, verify bundles, held-out verdict, repair records, receipts), never narration. FCR computed two ways, disagreement surfaced; unmatched success claims fail closed. Deterministic (byte-identical) scorecard with a full provenance block.
loop metrics --baseline — writes docs/metrics-baseline.json; refuses over any run without a structurally-valid gate verdict artifact, with rejected/unanchored records, disagreeing FCR methods, or zero success claims.
Published baseline over the gate-backed flagship: FCR 0.0 / RP 1.0; README literals bound to the JSON by a test.
Canonical record schemas (loop-engineer/repair@1, loop-engineer/rollout@1) end the two-shapes ambiguity; recheck_productive recomputes-and-rejects (wired into rollout_ledger.summarize, key renamed rollout_productivity); validate_contract checks record files when present.
Claim semantics are outcome-class aware: completion-class claims require every bundle green (no exceptions); progress-class may carry a red only if the same task goes green in a strictly later iteration. RP anchors to a same-task red→green bundle pair, order-enforced when known.
QW8: [project.scripts] loop console entry (editable install).

Adversarial review

Two red-team rounds (4 + 2 Opus reviewers) produced 17 confirmed findings — including a --baseline that laundered a holdout-flagged run and an evidence_backed satisfiable by a prose gate mention. All fixed; every exploit pinned as a regression test. Documented residual: a fully-fabricated, internally-consistent verdict artifact defeats offline shape-checking by construction — tamper detection belongs to the anti-cheat layer.

Test plan

pytest scripts: 217 passed (was 150 at branch point; +67 incl. all exploit regressions)
self_eval 13/13 · validate_frontmatter 9/0 · doctor .loop ok
loop metrics examples/coverage-repair byte-identical across runs; scorecard == committed baseline
All red-team exploit harnesses re-run post-fix: every exploit caught/refused, controls unchanged
CI green (py3.10/3.11/3.12)

🤖 Generated with Claude Code

…AC1)

…and-reject (ST1 AC2-AC5)

…e (ST1 AC2)

…(ST1 AC6)

… Rider A/B)

…ple record to .loop/repair (ST1 AC1)

…(ST1 AC5)

Adversarial review of the ST1 metrics slice found the honesty gates were satisfiable without real evidence. Harden compute_metrics/build_baseline: - evidence_backed no longer credits a RUNLOG prose mention or a bare TASKS verify declaration. Only (a) a structurally-valid held-out verdict artifact (per-check visible/holdout arrays + flags re-derivable from them — a 4-field stub is rejected) or (b) a NON-COMMENT gate line in a verify-* script whose gate script file exists. Records the verdict sha256 in provenance and states it is evidence, not proof (tamper detection is the anti-cheat layer's job). - FCR cross-join is fail-closed: a success claim is clean only if every attached verify bundle is green, EXCEPT a red bundle whose own task later reached green (an honestly-repaired intermediate). An unrelated green can no longer launder a claimed task's still-red gate. - Unrecognized ### Outcome tokens are surfaced under provenance.unrecognized_ outcomes instead of silently escaping the FCR denominator. - RP records are anchored: before/after scores must be corroborated by real verify-bundle scores (fabricated deltas -> rejected); no bundle to anchor -> flagged unanchored. - --baseline additionally refuses when the two FCR methods disagree, when no iteration claims success (vacuous 0/0), or when a counted RP record is unanchored; each refusal names the failed precondition. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

doctor validated repair records and receipt/rollout ledgers when present but reported only the 4 core schema ids, under-counting its own coverage. Return the record schema keys actually checked and append their schema ids (repair, rollout, receipt) to schemas_checked, in deterministic order. A loop with no record files is unchanged (still the 4 core ids). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

summarize() returned its productive fraction under the key repair_productivity, the exact mislabel the slice de-branding closes: the rollout ledger carries rollout-productivity, NOT the RP baseline (metrics.py derives RP from the canonical repair record). Rename the key and update callers and tests. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…ne binding - Scaffolding (manifest template, repo-os-contract, prompt-templates) declared the stale .loop/artifacts/repair-record.json output path the metrics tool cannot read; point them at the canonical per-iteration repair path. - README Measured-baseline passage now states the committed verdict is evidence, not proof, and cites its sha256. - Add a test binding the README FCR/RP literals to docs/metrics-baseline.json so they cannot silently drift, plus a guard against the stale scaffolding path. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Recompute docs/metrics-baseline.json with the hardened metrics command. Still evidence_backed via the real committed holdout verdict, FCR 0.0 / RP 1.0; adds the new provenance fields (holdout sha256, unanchored_records, unrecognized_ outcomes). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

… anchoring (round-2 adversarial findings) A completion-class claim (task_passed/succeeded/terminal) is clean only if every attached bundle is green — no cross-iteration escape; the repaired-intermediate exception is progress-class (advanced) only and order-aware. --baseline accepts nothing weaker than a structurally-valid gate verdict artifact; the gate-script existence check no longer leaks to this repo's own toolkit. RP anchors to a same-task red->green bundle pair (order enforced when known), not global score membership. Short outcome tokens are surfaced. Every round-2 exploit (G0-G5, F, H1/H2, N2/N2b, N3) is pinned as a regression test. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

chatgpt-codex-connector · 2026-07-03T20:22:57Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

Copilot

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

The repo's live .loop run-state is gitignored, so CI has none — the live-contract gate skips there and stays a local/operator check. The foreign-cwd CLI test sets PYTHONPATH for module visibility (CI has no editable install); it tests cwd-independent scripts/ resolution, not packaging. Verified against a git-archive fresh-checkout simulation. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

SollanSystems and others added 14 commits July 3, 2026 15:13

feat(schemas): canonical repair-record + rollout-record schemas (ST1 …

68b441f

…AC1)

feat(metrics): derive FCR/RP from real .loop evidence with recompute-…

8592f1a

…and-reject (ST1 AC2-AC5)

feat(rollout-ledger): reject-on-disagree via shared recheck_productiv…

5b2b8ff

…e (ST1 AC2)

feat(contract): validate repair/rollout/receipt records when present …

67bee4d

…(ST1 AC6)

feat(cli): wire 'loop metrics' subcommand + console-script entry (ST1…

bb383f7

… Rider A/B)

docs(eval-suite,example): name canonical repair record; relocate exam…

4d2db18

…ple record to .loop/repair (ST1 AC1)

feat(metrics): publish real FCR/RP baseline + README Metrics passage …

720639b

…(ST1 AC5)

chore(metrics): restamp baseline at the round-2-hardened derivation

46d4ce1

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings July 3, 2026 20:22

Copilot AI reviewed Jul 3, 2026

SollanSystems merged commit f7f8b38 into main Jul 4, 2026
4 checks passed

SollanSystems deleted the feat/st1-metrics-baseline branch July 4, 2026 01:14

SollanSystems mentioned this pull request Jul 4, 2026

chore(release): cut 0.6.0 — metrics real #17

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(metrics): ST1 — loop metrics + adversarially-hardened FCR/RP baseline#16

feat(metrics): ST1 — loop metrics + adversarially-hardened FCR/RP baseline#16
SollanSystems merged 15 commits into
mainfrom
feat/st1-metrics-baseline

SollanSystems commented Jul 3, 2026

Uh oh!

chatgpt-codex-connector Bot commented Jul 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SollanSystems commented Jul 3, 2026

Summary

Adversarial review

Test plan

Uh oh!

chatgpt-codex-connector Bot commented Jul 3, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants