Why
Tinker is most useful to the EvalOps/Maestro research program as a model-intervention engine, not as another eval runner. Maestro is the instrumented agent environment: it already has headless protocol events, tool calls, approval waits/resolutions, raw agent events, prompt/skill outcome telemetry, and managed-provider routing hooks. Platform can join those into durable runs, governed tool executions, Fermata scenario results, eval-pipeline samples, and feature-flagged candidate canaries.
This issue tracks the Maestro-side slice needed for the Platform umbrella in evalops/platform#973: make Maestro export governed trajectories and persist enough candidate/flag metadata to train and compare Tinker candidate adapters against the current agent behavior.
Candidate first experiment
Start with approval-boundary and tool-risk behavior.
Train/evaluate whether a narrow Tinker candidate can:
- ask for approval at the right boundary instead of self-approving or proceeding unsafely
- prefer read-only/discovery tool calls before mutating commands
- recover after denied or failed tool execution without repeating the same unsafe action
- preserve task success while improving governance and economic metrics
Maestro-side work
- Define a trajectory export shape from Maestro sessions/headless events that includes prompts, model/provider usage, raw agent events when enabled, tool calls, approval requests/resolutions, tool outcomes, skill outcomes, eval scores, and redaction metadata.
- Ensure exported examples can be linked to Platform AgentRuntime run ids, ToolExecution ids, Fermata scenario/eval run ids, trace ids, feature-flag assignments, and candidate model/checkpoint metadata.
- Add a local exporter or script that can produce Tinker-compatible SFT/RL JSONL for a bounded internal dataset without adding Tinker as a production runtime dependency.
- Preserve existing headless/JSON protocol compatibility; this should be an additive research/export path.
- Record candidate/control cohort metadata in Maestro telemetry when a Tinker-trained model is evaluated through the managed gateway or LLM Gateway.
- Add docs describing how Maestro telemetry maps to the Platform/Fermata/Tinker loop.
Acceptance criteria
- A design note documents the Maestro event fields required for a Tinker training example and which fields must be redacted or omitted.
- A prototype exporter can emit at least one Tinker-compatible dataset for approval-boundary/tool-risk behavior from Maestro/Fermata-linked traces.
- Exported examples include stable lineage back to Maestro session id plus Platform trace/run/eval ids when present.
- Candidate/control assignment metadata is emitted with later eval runs so Fermata can compare base vs candidate trajectories.
- The implementation is fail-open and disabled by default; normal Maestro operation must not require Tinker credentials or Tinker availability.
- At least one smoke test or fixture validates the export schema against representative tool/approval/eval events.
Non-goals
- Do not make Tinker a required production dependency for Maestro.
- Do not replace Fermata, Platform feature flags, or the Eval Pipeline.
- Do not start with broad open-ended model improvement; keep the first behavior narrow enough to inspect and explain.
Relevant surfaces
proto/maestro/v1/headless.proto raw agent events, tool calls, usage, server requests, and approvals
src/telemetry/maestro-event-bus.ts Maestro CloudEvents and eval/skill/tool telemetry
src/telemetry/maestro-event-catalog.ts Platform consumer mapping for Maestro events
src/agent/transport/tool-execution-bridge.ts governed/observe tool execution bridge
src/config/feature-flags.ts current local feature-flag reader
src/providers/evalops-managed.ts managed-provider gateway integration
Related Platform work
- evalops/platform#973 tracks the Platform-side Tinker candidate training loop.
- Platform surfaces expected to consume this include AgentRuntime, ToolExecution, Fermata, EvalPipeline, LLM Gateway, Config/feature flags, Timeline, Meter, and Traces.
Why
Tinker is most useful to the EvalOps/Maestro research program as a model-intervention engine, not as another eval runner. Maestro is the instrumented agent environment: it already has headless protocol events, tool calls, approval waits/resolutions, raw agent events, prompt/skill outcome telemetry, and managed-provider routing hooks. Platform can join those into durable runs, governed tool executions, Fermata scenario results, eval-pipeline samples, and feature-flagged candidate canaries.
This issue tracks the Maestro-side slice needed for the Platform umbrella in evalops/platform#973: make Maestro export governed trajectories and persist enough candidate/flag metadata to train and compare Tinker candidate adapters against the current agent behavior.
Candidate first experiment
Start with approval-boundary and tool-risk behavior.
Train/evaluate whether a narrow Tinker candidate can:
Maestro-side work
Acceptance criteria
Non-goals
Relevant surfaces
proto/maestro/v1/headless.protoraw agent events, tool calls, usage, server requests, and approvalssrc/telemetry/maestro-event-bus.tsMaestro CloudEvents and eval/skill/tool telemetrysrc/telemetry/maestro-event-catalog.tsPlatform consumer mapping for Maestro eventssrc/agent/transport/tool-execution-bridge.tsgoverned/observe tool execution bridgesrc/config/feature-flags.tscurrent local feature-flag readersrc/providers/evalops-managed.tsmanaged-provider gateway integrationRelated Platform work