research: export Maestro trajectories for Tinker candidate training

## Why

Tinker is most useful to the EvalOps/Maestro research program as a model-intervention engine, not as another eval runner. Maestro is the instrumented agent environment: it already has headless protocol events, tool calls, approval waits/resolutions, raw agent events, prompt/skill outcome telemetry, and managed-provider routing hooks. Platform can join those into durable runs, governed tool executions, Fermata scenario results, eval-pipeline samples, and feature-flagged candidate canaries.

This issue tracks the Maestro-side slice needed for the Platform umbrella in evalops/platform#973: make Maestro export governed trajectories and persist enough candidate/flag metadata to train and compare Tinker candidate adapters against the current agent behavior.

## Candidate first experiment

Start with approval-boundary and tool-risk behavior.

Train/evaluate whether a narrow Tinker candidate can:

- ask for approval at the right boundary instead of self-approving or proceeding unsafely
- prefer read-only/discovery tool calls before mutating commands
- recover after denied or failed tool execution without repeating the same unsafe action
- preserve task success while improving governance and economic metrics

## Maestro-side work

- Define a trajectory export shape from Maestro sessions/headless events that includes prompts, model/provider usage, raw agent events when enabled, tool calls, approval requests/resolutions, tool outcomes, skill outcomes, eval scores, and redaction metadata.
- Ensure exported examples can be linked to Platform AgentRuntime run ids, ToolExecution ids, Fermata scenario/eval run ids, trace ids, feature-flag assignments, and candidate model/checkpoint metadata.
- Add a local exporter or script that can produce Tinker-compatible SFT/RL JSONL for a bounded internal dataset without adding Tinker as a production runtime dependency.
- Preserve existing headless/JSON protocol compatibility; this should be an additive research/export path.
- Record candidate/control cohort metadata in Maestro telemetry when a Tinker-trained model is evaluated through the managed gateway or LLM Gateway.
- Add docs describing how Maestro telemetry maps to the Platform/Fermata/Tinker loop.

## Acceptance criteria

- A design note documents the Maestro event fields required for a Tinker training example and which fields must be redacted or omitted.
- A prototype exporter can emit at least one Tinker-compatible dataset for approval-boundary/tool-risk behavior from Maestro/Fermata-linked traces.
- Exported examples include stable lineage back to Maestro session id plus Platform trace/run/eval ids when present.
- Candidate/control assignment metadata is emitted with later eval runs so Fermata can compare base vs candidate trajectories.
- The implementation is fail-open and disabled by default; normal Maestro operation must not require Tinker credentials or Tinker availability.
- At least one smoke test or fixture validates the export schema against representative tool/approval/eval events.

## Non-goals

- Do not make Tinker a required production dependency for Maestro.
- Do not replace Fermata, Platform feature flags, or the Eval Pipeline.
- Do not start with broad open-ended model improvement; keep the first behavior narrow enough to inspect and explain.

## Relevant surfaces

- `proto/maestro/v1/headless.proto` raw agent events, tool calls, usage, server requests, and approvals
- `src/telemetry/maestro-event-bus.ts` Maestro CloudEvents and eval/skill/tool telemetry
- `src/telemetry/maestro-event-catalog.ts` Platform consumer mapping for Maestro events
- `src/agent/transport/tool-execution-bridge.ts` governed/observe tool execution bridge
- `src/config/feature-flags.ts` current local feature-flag reader
- `src/providers/evalops-managed.ts` managed-provider gateway integration

## Related Platform work

- evalops/platform#973 tracks the Platform-side Tinker candidate training loop.
- Platform surfaces expected to consume this include AgentRuntime, ToolExecution, Fermata, EvalPipeline, LLM Gateway, Config/feature flags, Timeline, Meter, and Traces.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

research: export Maestro trajectories for Tinker candidate training #180

Why

Candidate first experiment

Maestro-side work

Acceptance criteria

Non-goals

Relevant surfaces

Related Platform work

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

research: export Maestro trajectories for Tinker candidate training #180

Description

Why

Candidate first experiment

Maestro-side work

Acceptance criteria

Non-goals

Relevant surfaces

Related Platform work

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions