Skip to content

feat(metrics): ST1 — loop metrics + adversarially-hardened FCR/RP baseline#16

Merged
SollanSystems merged 15 commits into
mainfrom
feat/st1-metrics-baseline
Jul 4, 2026
Merged

feat(metrics): ST1 — loop metrics + adversarially-hardened FCR/RP baseline#16
SollanSystems merged 15 commits into
mainfrom
feat/st1-metrics-baseline

Conversation

@SollanSystems

Copy link
Copy Markdown
Owner

Summary

ST1 — "Metrics real" (spec: docs/superpowers/specs/2026-06-30-st1-metrics-baseline.md). FCR and repair-productivity go from claims to derivations, and the derivation itself was adversarially red-teamed twice before this PR.

  • loop metrics <dir> — derives FCR + RP from real on-disk evidence (RUNLOG, verify bundles, held-out verdict, repair records, receipts), never narration. FCR computed two ways, disagreement surfaced; unmatched success claims fail closed. Deterministic (byte-identical) scorecard with a full provenance block.
  • loop metrics --baseline — writes docs/metrics-baseline.json; refuses over any run without a structurally-valid gate verdict artifact, with rejected/unanchored records, disagreeing FCR methods, or zero success claims.
  • Published baseline over the gate-backed flagship: FCR 0.0 / RP 1.0; README literals bound to the JSON by a test.
  • Canonical record schemas (loop-engineer/repair@1, loop-engineer/rollout@1) end the two-shapes ambiguity; recheck_productive recomputes-and-rejects (wired into rollout_ledger.summarize, key renamed rollout_productivity); validate_contract checks record files when present.
  • Claim semantics are outcome-class aware: completion-class claims require every bundle green (no exceptions); progress-class may carry a red only if the same task goes green in a strictly later iteration. RP anchors to a same-task red→green bundle pair, order-enforced when known.
  • QW8: [project.scripts] loop console entry (editable install).

Adversarial review

Two red-team rounds (4 + 2 Opus reviewers) produced 17 confirmed findings — including a --baseline that laundered a holdout-flagged run and an evidence_backed satisfiable by a prose gate mention. All fixed; every exploit pinned as a regression test. Documented residual: a fully-fabricated, internally-consistent verdict artifact defeats offline shape-checking by construction — tamper detection belongs to the anti-cheat layer.

Test plan

  • pytest scripts: 217 passed (was 150 at branch point; +67 incl. all exploit regressions)
  • self_eval 13/13 · validate_frontmatter 9/0 · doctor .loop ok
  • loop metrics examples/coverage-repair byte-identical across runs; scorecard == committed baseline
  • All red-team exploit harnesses re-run post-fix: every exploit caught/refused, controls unchanged
  • CI green (py3.10/3.11/3.12)

🤖 Generated with Claude Code

SollanSystems and others added 14 commits July 3, 2026 15:13
Adversarial review of the ST1 metrics slice found the honesty gates were
satisfiable without real evidence. Harden compute_metrics/build_baseline:

- evidence_backed no longer credits a RUNLOG prose mention or a bare TASKS
  verify declaration. Only (a) a structurally-valid held-out verdict artifact
  (per-check visible/holdout arrays + flags re-derivable from them — a 4-field
  stub is rejected) or (b) a NON-COMMENT gate line in a verify-* script whose
  gate script file exists. Records the verdict sha256 in provenance and states
  it is evidence, not proof (tamper detection is the anti-cheat layer's job).
- FCR cross-join is fail-closed: a success claim is clean only if every attached
  verify bundle is green, EXCEPT a red bundle whose own task later reached green
  (an honestly-repaired intermediate). An unrelated green can no longer launder a
  claimed task's still-red gate.
- Unrecognized ### Outcome tokens are surfaced under provenance.unrecognized_
  outcomes instead of silently escaping the FCR denominator.
- RP records are anchored: before/after scores must be corroborated by real
  verify-bundle scores (fabricated deltas -> rejected); no bundle to anchor ->
  flagged unanchored.
- --baseline additionally refuses when the two FCR methods disagree, when no
  iteration claims success (vacuous 0/0), or when a counted RP record is
  unanchored; each refusal names the failed precondition.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
doctor validated repair records and receipt/rollout ledgers when present but
reported only the 4 core schema ids, under-counting its own coverage. Return the
record schema keys actually checked and append their schema ids (repair, rollout,
receipt) to schemas_checked, in deterministic order. A loop with no record files
is unchanged (still the 4 core ids).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
summarize() returned its productive fraction under the key repair_productivity,
the exact mislabel the slice de-branding closes: the rollout ledger carries
rollout-productivity, NOT the RP baseline (metrics.py derives RP from the
canonical repair record). Rename the key and update callers and tests.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ne binding

- Scaffolding (manifest template, repo-os-contract, prompt-templates) declared
  the stale .loop/artifacts/repair-record.json output path the metrics tool
  cannot read; point them at the canonical per-iteration repair path.
- README Measured-baseline passage now states the committed verdict is evidence,
  not proof, and cites its sha256.
- Add a test binding the README FCR/RP literals to docs/metrics-baseline.json so
  they cannot silently drift, plus a guard against the stale scaffolding path.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Recompute docs/metrics-baseline.json with the hardened metrics command. Still
evidence_backed via the real committed holdout verdict, FCR 0.0 / RP 1.0; adds
the new provenance fields (holdout sha256, unanchored_records, unrecognized_
outcomes).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… anchoring (round-2 adversarial findings)

A completion-class claim (task_passed/succeeded/terminal) is clean only if
every attached bundle is green — no cross-iteration escape; the
repaired-intermediate exception is progress-class (advanced) only and
order-aware. --baseline accepts nothing weaker than a structurally-valid
gate verdict artifact; the gate-script existence check no longer leaks to
this repo's own toolkit. RP anchors to a same-task red->green bundle pair
(order enforced when known), not global score membership. Short outcome
tokens are surfaced. Every round-2 exploit (G0-G5, F, H1/H2, N2/N2b, N3)
is pinned as a regression test.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings July 3, 2026 20:22
@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

The repo's live .loop run-state is gitignored, so CI has none — the
live-contract gate skips there and stays a local/operator check. The
foreign-cwd CLI test sets PYTHONPATH for module visibility (CI has no
editable install); it tests cwd-independent scripts/ resolution, not
packaging. Verified against a git-archive fresh-checkout simulation.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@SollanSystems SollanSystems merged commit f7f8b38 into main Jul 4, 2026
4 checks passed
@SollanSystems SollanSystems deleted the feat/st1-metrics-baseline branch July 4, 2026 01:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants