Skip to content

research: export Maestro trajectories for Tinker candidate training #180

Description

@haasonsaas

Why

Tinker is most useful to the EvalOps/Maestro research program as a model-intervention engine, not as another eval runner. Maestro is the instrumented agent environment: it already has headless protocol events, tool calls, approval waits/resolutions, raw agent events, prompt/skill outcome telemetry, and managed-provider routing hooks. Platform can join those into durable runs, governed tool executions, Fermata scenario results, eval-pipeline samples, and feature-flagged candidate canaries.

This issue tracks the Maestro-side slice needed for the Platform umbrella in evalops/platform#973: make Maestro export governed trajectories and persist enough candidate/flag metadata to train and compare Tinker candidate adapters against the current agent behavior.

Candidate first experiment

Start with approval-boundary and tool-risk behavior.

Train/evaluate whether a narrow Tinker candidate can:

  • ask for approval at the right boundary instead of self-approving or proceeding unsafely
  • prefer read-only/discovery tool calls before mutating commands
  • recover after denied or failed tool execution without repeating the same unsafe action
  • preserve task success while improving governance and economic metrics

Maestro-side work

  • Define a trajectory export shape from Maestro sessions/headless events that includes prompts, model/provider usage, raw agent events when enabled, tool calls, approval requests/resolutions, tool outcomes, skill outcomes, eval scores, and redaction metadata.
  • Ensure exported examples can be linked to Platform AgentRuntime run ids, ToolExecution ids, Fermata scenario/eval run ids, trace ids, feature-flag assignments, and candidate model/checkpoint metadata.
  • Add a local exporter or script that can produce Tinker-compatible SFT/RL JSONL for a bounded internal dataset without adding Tinker as a production runtime dependency.
  • Preserve existing headless/JSON protocol compatibility; this should be an additive research/export path.
  • Record candidate/control cohort metadata in Maestro telemetry when a Tinker-trained model is evaluated through the managed gateway or LLM Gateway.
  • Add docs describing how Maestro telemetry maps to the Platform/Fermata/Tinker loop.

Acceptance criteria

  • A design note documents the Maestro event fields required for a Tinker training example and which fields must be redacted or omitted.
  • A prototype exporter can emit at least one Tinker-compatible dataset for approval-boundary/tool-risk behavior from Maestro/Fermata-linked traces.
  • Exported examples include stable lineage back to Maestro session id plus Platform trace/run/eval ids when present.
  • Candidate/control assignment metadata is emitted with later eval runs so Fermata can compare base vs candidate trajectories.
  • The implementation is fail-open and disabled by default; normal Maestro operation must not require Tinker credentials or Tinker availability.
  • At least one smoke test or fixture validates the export schema against representative tool/approval/eval events.

Non-goals

  • Do not make Tinker a required production dependency for Maestro.
  • Do not replace Fermata, Platform feature flags, or the Eval Pipeline.
  • Do not start with broad open-ended model improvement; keep the first behavior narrow enough to inspect and explain.

Relevant surfaces

  • proto/maestro/v1/headless.proto raw agent events, tool calls, usage, server requests, and approvals
  • src/telemetry/maestro-event-bus.ts Maestro CloudEvents and eval/skill/tool telemetry
  • src/telemetry/maestro-event-catalog.ts Platform consumer mapping for Maestro events
  • src/agent/transport/tool-execution-bridge.ts governed/observe tool execution bridge
  • src/config/feature-flags.ts current local feature-flag reader
  • src/providers/evalops-managed.ts managed-provider gateway integration

Related Platform work

  • evalops/platform#973 tracks the Platform-side Tinker candidate training loop.
  • Platform surfaces expected to consume this include AgentRuntime, ToolExecution, Fermata, EvalPipeline, LLM Gateway, Config/feature flags, Timeline, Meter, and Traces.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions