Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion evals/cases/structural.json
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,8 @@
"prompt-templates.md",
"eval-suite.md",
"safety-and-approvals.md",
"platform-map.md"
"platform-map.md",
"model-routing.md"
],
"terminal_states": [
"Succeeded",
Expand Down Expand Up @@ -57,6 +58,9 @@
"terminal_state.json.tmpl",
"verify-fast.sh",
"verify-full.sh",
"verify-safety.sh",
"judge-rubric.sh",
"extract-trace-metrics.sh",
"EVALS-rubric.md.tmpl"
],
"eval_layer_names": [
Expand Down
54 changes: 54 additions & 0 deletions reference/model-routing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# Model-routing doctrine — the canonical table + rationale

The one place the `read → haiku / reason → sonnet / write → opus` rule is defined. Every
spoke states the one-line rule inline (so it can act) and points here for the table, the
rationale, and the optional enforcement. This file is the single source of truth; if a skill
and this file ever disagree, this file wins.

> **Base directory.** This file ships at the **plugin root** `reference/` (a sibling of
> `skills/`), i.e. `${CLAUDE_PLUGIN_ROOT}/reference/model-routing.md`. Skills reach it as
> `reference/model-routing.md` resolved against the plugin root, not their own folder.

## The one rule

**Every agent dispatch names an explicit `model:`.** This holds for the `Agent` tool and for
every Workflow `agent()` call. There is no default-by-omission: an omitted `model:` inherits
the costly main-loop model, which is the single biggest cost leak in an agent loop.

## The tier table

| Tier | `model:` | Use it for |
|---|---|---|
| **read** | `haiku` | Read-only lookups feeding the loop — status/coverage scans, trace/RUNLOG fact extraction, a monitor poll, "where is X", list-and-report. The default: anything that only *reports* what it found. |
| **reason** | `sonnet` | Judgment without production writes — plan critique / pre-execution reflection, rubric judging, failure triage, multi-source synthesis, an ADR review. |
| **write** | `opus` | Production writes and load-bearing decisions — the per-task worker, a bounded repair (repairs edit code), committing regression cases. |
| **orchestrate** | main loop | The operator itself — advancing the state machine, choosing the next transition, adjudicating verification. Not a dispatched sub-agent tier. |

Rule of thumb: **read → haiku, reason → sonnet, write → opus, orchestrate → main loop.** If you
cannot justify sonnet or opus for a dispatch, it is a haiku dispatch.

## Why it is load-bearing

- **Cost is bounded.** Routing read-only work to haiku instead of the main-loop model is the
difference between a loop that is cheap to run overnight and one that is not.
- **Dispatches are auditable.** An explicit tier per dispatch is a receipt line — append one
receipt per dispatch to `.loop/receipts/*.jsonl` (schema: `schemas/receipt.schema.json`), so
cost and routing are reconstructable after the run.
- **Omission is a broken call.** Treat a missing `model:` like a missing `prompt:` — fix it
before dispatching. On the Workflow tool the leak is the same: a model-less `agent()` inherits
the main-loop model just as an `Agent` call does.

## Optional enforcement (the author's stack — not required)

The rule holds as policy text in `WORKFLOW.md` on any platform, even one that cannot enforce it
at runtime. Where you already run them, these harden it — none are required to run a loop:

- **PreToolUse hooks** (`model_routing.py` for the `Agent` tool, `workflow_routing.py` for
Workflow `agent()` calls) block a model-less or over-tier dispatch before it fires.
- **`/routing` modes** (`normal` / `conserve` / `burn`) modulate the ceilings; explicitness is
never waived in any mode.
- **The `[escalation]` valve** — after a *verified* dispatch failure, re-dispatch the same prompt
at +1 tier once, with the literal `[escalation]` marker in the prompt. Never on a first attempt.

Without any of this, keep the rule as a line in the loop's `WORKFLOW.md` and name `model:` by
hand on every dispatch. That is enough — the enforcement tooling only automates the same rule.
11 changes: 7 additions & 4 deletions skills/loop-architect/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@ description: "Classify an agent-loop task and choose its architecture + realizat

# loop-architect

> **Base directory.** `reference/…` paths below are **plugin-root-relative** — resolve them against the plugin root (`${CLAUDE_PLUGIN_ROOT}/reference/…`, i.e. `../../reference/…` from this `skills/loop-architect/` folder), where the shared docs ship, not inside this skill's own folder. (The `scripts/verify-*` gate named below is the operated loop's own workspace script.)

The **brain** of the [[loop-engineer]] suite. It does **not** do the end task — it
turns an underspecified objective into a decision: *what shape should this loop be,
and what Claude-Code primitive physically runs it?* The output is an **architecture
Expand All @@ -14,7 +16,8 @@ it decides, it does not scaffold or run.
Two separable choices (see `reference/architecture-matrix.md`):
- **(A) Architecture** — how many agents, how much orchestration.
- **(B) Realization** — which Claude-Code primitive (Workflow tool / markdown
supervisor / portable Python FSM spine / delegate to an acceptance gate such as `/verify-slice`).
supervisor / portable Python FSM spine / delegate to an acceptance gate — the
contract's `verify-fast`→`verify-full` by default, optionally `/verify-slice`).

## Prime directive

Expand Down Expand Up @@ -108,9 +111,9 @@ to reach an explicit terminal state; never a silent "completed."

Any agent dispatch you suggest in the ADR names an explicit `model:` (read → `haiku`,
reason → `sonnet`, write → `opus`) per the model-routing rule — a Workflow
`agent({ model: "sonnet", … })` fan-out, a write agent on `opus`. Omitting
`model:` inherits the costly main-loop model; the author blocks that with a PreToolUse hook
(`workflow_routing.py`), but the rule holds on any surface.
`agent({ model: "sonnet", … })` fan-out, a write agent on `opus`. Never omit it: omitting
`model:` inherits the costly main-loop model. The full tier table + rationale (and the
optional `workflow_routing.py` enforcement) are in `reference/model-routing.md`.

## Hand-off

Expand Down
7 changes: 7 additions & 0 deletions skills/loop-contract/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@ description: "Scaffold the repo-OS operating contract for an agent loop — SPEC

# loop-contract — scaffold the repo-OS operating contract

> **Base directory.** `reference/…` and `templates/…` paths below are **plugin-root-relative** — resolve them against the plugin root (`${CLAUDE_PLUGIN_ROOT}/…`, i.e. `../../` from this `skills/loop-contract/` folder), where the shared docs and scaffold templates ship. The `scripts/verify-*` you scaffold land in the *new loop's* own workspace, not this plugin.

Turn an architecture decision into the **on-disk operating contract** an agent loop reads its
truth from every turn. State lives in files, not chat context, so the loop survives compaction,
a crashed session, and even a different engine. This is the externalized-state ("code as agent
Expand Down Expand Up @@ -46,6 +48,11 @@ outputs / permissions / approval_gates / terminal_states) and an **iteration-0 R
recording the pre-execution reflection. Each artifact has exactly one owner concern — no file
carries two jobs (rationale: `reference/repo-os-contract.md` §9).

The two deterministic gates `verify-fast` and `verify-full` scaffold as runnable stubs; the
three deeper proof-surface scripts (`verify-safety`, `judge-rubric`, `extract-trace-metrics`)
ship in `templates/` as stubs you copy into `scripts/` and wire as the SPEC criteria earn them —
`[[loop-evals]]` owns that proof logic, not this spoke.

## How to fill each template

Map the ADR + goal onto the templates in `templates/` (the `{{PLACEHOLDER}}` tokens are the fill
Expand Down
5 changes: 4 additions & 1 deletion skills/loop-engineer/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@ description: "Router for designing, launching, verifying, repairing, and improvi

# loop-engineer

> **Base directory.** `reference/…` paths below are **plugin-root-relative** — resolve them against the plugin root (`${CLAUDE_PLUGIN_ROOT}/reference/…`, i.e. `../../reference/…` from this `skills/loop-engineer/` folder), where the shared docs ship, not inside this skill's own folder.

**The loop is the design object — not the prompt.** A loop-engineer designs, launches, verifies, repairs, and improves *other* agent loops; it does not primarily solve the end task. Its first job is to turn an underspecified objective into an **executable operating contract** — success criteria, task queue, tool boundaries, evaluation methods, stopping rules, approval gates, and persistent artifacts that survive across turns and sessions.

**Prime directive.** If you cannot define success, verification, or a terminal state, the task is **underspecified** (`FailedSpecGap`) — say so, do not call the next completion "done." This is the central defense against the #1 long-horizon failure mode: false completion / weak self-verification / verifier gaming.
Expand Down Expand Up @@ -58,7 +60,7 @@ The bundled portable core runs every loop with no external setup: `python3 -m lo
- **`/verify-slice` and `/verify-milestone`** (claude-code-orchestration, *optional*) — auto-repair + cross-review layered on the contract's `verify-*` gate. `loop-evals` *designs* the criteria; `loop-run` *calls* the gate. No new verification engine is shipped here.
- **A portable Python FSM spine** (*optional*) — the init/next/complete + `state.json`-resume pattern for max-determinism / cross-engine resume; ~100 lines, or reuse the author's `harmony-agent` `engine/cli.py`. v1 ships no spine code.
- **The grader-split pattern** (as in the `launch-local-agent` skill) — an objective blocking gate in front of a judged advisory rubric; the model for keeping deterministic checks ahead of any model verdict.
- **The model-routing rule** — every dispatched agent names an explicit `model:` (read→haiku, reason→sonnet, write→opus) so cost is bounded and dispatches are auditable; receipts append to `.loop/receipts/*.jsonl`. *Optional:* the author enforces this with PreToolUse hooks (`model_routing.py` / `workflow_routing.py`) and `/routing` modes.
- **The model-routing rule** — every dispatched agent names an explicit `model:` (read→haiku, reason→sonnet, write→opus) so cost is bounded and dispatches are auditable; receipts append to `.loop/receipts/*.jsonl`. Canonical tier table + rationale: `reference/model-routing.md`. *Optional:* the author enforces this with PreToolUse hooks (`model_routing.py` / `workflow_routing.py`) and `/routing` modes.
- **superpowers** (*optional*) — `writing-plans`, `executing-plans`, `subagent-driven-development`, `verification-before-completion`, `test-driven-development` compose the markdown-supervisor realization. Any on-disk planning dir works as the planning surface (the author uses GSD `.gsd/`).
- **ui/orchestration surfaces** — when a loop's actual work is UI/UX or general orchestration (not loop engineering), defer to the appropriate `ui-ux`/`orchestration` surface; this suite builds and runs the loop, it does not do that domain work.

Expand All @@ -75,5 +77,6 @@ This router stays deliberately thin; every detail lives one hop away in `referen
- `reference/eval-suite.md` — [[loop-evals]] and [[loop-flywheel]]; the 7 layers, the two first-class metrics, the flywheel schedule.
- `reference/safety-and-approvals.md` — [[loop-run]] and [[loop-repair]]; escalation ladder, approval lifecycle, terminal states, anti-cheat.
- `reference/platform-map.md` — [[loop-architect]]; the engine-neutral contract mapped onto Claude / Codex / Hermes / Google.
- `reference/model-routing.md` — every spoke; the canonical read→haiku / reason→sonnet / write→opus tier table + rationale + optional enforcement.

If a question is about *how* a step works rather than *which* step is next, you have left the router — open the reference above and read it there.
4 changes: 3 additions & 1 deletion skills/loop-evals/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@ description: "Design the evaluation harness for an agent loop — the proof laye

# loop-evals — design the harness that proves the loop

> **Base directory.** `reference/…` and this plugin's bundled tools (`scripts/holdout_gate.py`, `scripts/anticheat_scan.py`) are **plugin-root-relative** — resolve them against the plugin root (`${CLAUDE_PLUGIN_ROOT}/…`, i.e. `../../` from this `skills/loop-evals/` folder). The `scripts/verify-*` gate and the `EVALS/…` tree live inside the *graded loop's* own repo, not this plugin.

A loop without measurement is a loop that *claims* success. This skill designs the evaluation harness for a loop — what to check, in what order, with which metric — so a "Succeeded" terminal state is backed by evidence, not narration. It is the **designer** of the suite; `[[loop-run]]` is the caller that runs the gate each iteration, and `[[loop-flywheel]]` feeds real failures back into it.

**In → out.** In: the loop's `SPEC.md` (success criteria, constraints, evidence rules) + its artifacts. Out: `scripts/verify-*` skeletons, an `EVALS/{dataset,rubrics,regressions,traces}/` tree, and the metric definitions — all committed inside the loop's own repo. This skill is read-only/advisory toward the loop it grades; it authors the harness, it does not run the task.
Expand Down Expand Up @@ -89,7 +91,7 @@ for (const c of regressionCases) {
}
}
```
(`read → haiku`, `reason → sonnet`, `write → opus`; receipts append to `.loop/receipts/`.)
(`read → haiku`, `reason → sonnet`, `write → opus` per the model-routing rule — tier table + rationale in `reference/model-routing.md`; receipts append to `.loop/receipts/`.)

## Standing the suite up

Expand Down
4 changes: 3 additions & 1 deletion skills/loop-flywheel/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@ description: "Turn a loop's own run history into compounding improvement — min

# loop-flywheel — the loop that improves the loop

> **Base directory.** `reference/…` paths below are **plugin-root-relative** — resolve them against the plugin root (`${CLAUDE_PLUGIN_ROOT}/reference/…`, i.e. `../../reference/…` from this `skills/loop-flywheel/` folder), where the shared docs ship. `EVALS/…` and `.loop/…` are inside the *improved loop's* own repo, not this plugin.

A loop that only runs gets no better. `loop-flywheel` is the **improvement engine**: it reads what a loop has already done (its `RUNLOG.md`, `EVALS/traces/`, and `.loop/receipts/*.jsonl`) and turns that history into three durable outputs — **new eval cases**, **harness-change proposals**, and **compacted memory**. It is the reflect→see step of the self-learning flywheel applied to an agent loop itself.

It owns no gate. The deterministic and rubric layers live in [[loop-evals]] and `reference/eval-suite.md`; this skill *feeds* that suite (mines failures into it) and *watches* its two first-class metrics over time. Read [[loop-evals]] first if you are standing the suite up; come here once a loop has run ≥2 iterations and you want it to compound.
Expand Down Expand Up @@ -44,7 +46,7 @@ await agent({ model: "opus", // write: turn confirmed failures into committe
prompt: `From the confirmed failures in this proposal, write one EVALS/regressions/<case>.json per distinct real failure (input + expected deterministic verdict). Commit only failures that actually occurred; leave harness-change proposals for human review.` });
```

(`read → haiku`, `reason → sonnet`, `write → opus`; receipts append to `.loop/receipts/`. The haiku+sonnet pass mines and *proposes*; only the opus pass writes the committed regression cases — and even then never reimplements the verify engine: the contract's `scripts/verify-*` gate, optionally `/verify-slice`, is the source of truth.)
(`read → haiku`, `reason → sonnet`, `write → opus` per the model-routing rule — canonical table in `reference/model-routing.md`; receipts append to `.loop/receipts/`. The haiku+sonnet pass mines and *proposes*; only the opus pass writes the committed regression cases — and even then never reimplements the verify engine: the contract's `scripts/verify-*` gate, optionally `/verify-slice`, is the source of truth.)

## Memory compaction: two stores, never mixed

Expand Down
2 changes: 2 additions & 0 deletions skills/loop-inspector/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@ description: "Inspect an existing agent loop and emit a scored gap report — th

# loop-inspector — the quality layer above the ecosystem

> **Base directory.** This skill's own `reference/patterns.md` sits in this folder. `reference/repo-os-contract.md` and the bundled `scripts/inspect_loop.py` are **plugin-root-relative** (`${CLAUDE_PLUGIN_ROOT}/…`, i.e. `../../` from this `skills/loop-inspector/` folder). The `scripts/verify-*` / `holdout_gate.py` / `anticheat_scan.py` names are *signals it looks for in the inspected loop*, not files shipped by this plugin.

Most of the [[loop-engineer]] suite *builds* a loop. `loop-inspector` **judges one
that already exists** — yours or someone else's. Point it at a loop directory (a
`.loop/` repo-OS contract, a superpowers or ruflo harness, any agent-loop dir) and it
Expand Down
5 changes: 5 additions & 0 deletions skills/loop-inspector/reference/patterns.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
# loop-inspector — inspection checklist, scoring rubric, and foreign-harness reading

> **Base directory.** The bundled `scripts/inspect_loop.py` and `reference/repo-os-contract.md`
> named below are **plugin-root-relative** (`${CLAUDE_PLUGIN_ROOT}/…`, i.e. `../../../` from this
> `skills/loop-inspector/reference/` folder). The `scripts/verify-*` / `holdout_gate.py` /
> `anticheat_scan.py` names are *signals read from the inspected loop*, not files in this plugin.

This is the depth behind [[loop-inspector]]: the exact checklist, how the score is
computed, and how to read a loop that does **not** use this suite's filenames. The
runnable core is `scripts/inspect_loop.py`; this file is the rubric it encodes.
Expand Down
4 changes: 3 additions & 1 deletion skills/loop-repair/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@ description: "Patch-and-repair loop for a failing agent run — use when a loop

# loop-repair

> **Base directory.** `reference/…` paths below are **plugin-root-relative** — resolve them against the plugin root (`${CLAUDE_PLUGIN_ROOT}/reference/…`, i.e. `../../reference/…` from this `skills/loop-repair/` folder), where the shared docs ship. The `scripts/verify-*` gate and `.loop/…` are inside the *operated loop's* workspace, not this plugin.

The repair lane. When verification disagrees with the work, **this skill is what reacts — bounded, recorded, and capped.** It does not own running the loop ([[loop-run]] does) or defining the checks ([[loop-evals]] does); it owns the disciplined response to a *failing* check so the loop converges instead of thrashing. Every rule here is downstream of the escalation ladder, the repair cap, and the verifier-gaming guard in `reference/safety-and-approvals.md` — read that for the full safety model; this is the operating procedure.

**When this fires:** a deterministic check (test / lint / typecheck / schema) failed, the rubric judge fell below threshold, or [[loop-run]] reached the `repair` state. Inputs: the failing `verification_bundle`, the best prior `.loop/state.json`, and the diff since that best state. Output: a structured **repair record** (below) + an updated state, then control back to [[loop-run]] to re-verify.
Expand Down Expand Up @@ -83,7 +85,7 @@ The cap is `repair.max_attempts`, **default N=2**, configurable in `WORKFLOW.md`

## Dispatching a repair to a subagent

For a bounded, isolated fix, [[loop-run]] may dispatch the repair to a write-tier agent. Per the model-routing rule, the dispatch **must** name an explicit `model:` — repairs write production code, so they route to `opus`:
For a bounded, isolated fix, [[loop-run]] may dispatch the repair to a write-tier agent. Per the model-routing rule (tier table: `reference/model-routing.md`), the dispatch **must** name an explicit `model:` — repairs write production code, so they route to `opus`:

```
Agent(
Expand Down
Loading
Loading