Isolate rollout failures and preserve V1 diagnostics#1529
Conversation
ApprovabilityVerdict: Needs human review This PR restructures error handling and failure isolation in the rollout execution system. Changes to how cleanup failures are propagated, new group rollout cancellation logic, and removal of some persisted diagnostic data represent significant runtime behavior changes that warrant human review. You can customize Macroscope's approvability policy. Learn more. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f719345079
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f2898e6df4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 27d6d2ab38
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fb9ef97fcd
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5aa700eb60
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
eacd337 to
10aa666
Compare
mikasenghaas
left a comment
There was a problem hiding this comment.
nice, defo onboard with storing more information in the error class (i think @rasdani has been wanting this for a long time)
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7a6fe4a785
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cf21c213e6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 76b9e1be18
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bc47a81f17
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4a82ef74d3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 93557feb6e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e3a43f7beb
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit ec7da17. Configure here.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ec7da17680
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7154b6547f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0ce3b5d567
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Summary
This PR makes rollout failures easier to inspect without changing the high-level fail-fast contract of the evaluator/trainer runtime.
The important behavior is:
ErrorData;The goal is to make failed rollouts explain what actually failed first, while keeping saved artifacts simple and avoiding a runtime where one secondary cleanup problem hides the real rollout/scoring error.
Runtime Contract
This PR keeps fail-fast semantics for unexpected runtime failures. It does not introduce a
continue_on_errormode.vf.Errorpaths still use the existing retry behavior.vf.Errorvalues remain visible on state and serialize into rollout outputs as structuredErrorData.sandbox_oom/sandbox_timeouton state, with detailed context emitted to logs.What Changed
Primary Error Serialization
Rollout outputs now use a single primary failure field:
error: the rollout's primary failure, serialized asErrorData.ErrorDatacontains:error: exception class name;message: exception message;error_chain_repr: detailed repr of the causal exception chain;error_chain_str: compact causal exception chain.The output schema intentionally does not include secondary
artifact_errors,cleanup_errors, orsandbox_failuresfields. Those secondary diagnostics belong in logs.Base Environment Cleanup And Group Handling
Environment._run_rollout_state()now records scoring end timing and attemptsrubric.cleanup(state)even when rollout scoring raises.Grouped rollout execution now creates explicit sibling tasks. If one sibling raises, pending siblings are cancelled and drained before the original group exception is re-raised.
Group scoring now records scoring end timing for every state and attempts cleanup for every state whether scoring succeeds or fails.
Impact:
V1 Runtime And Harness Handling
V1 grouped rollouts follow the same cancellation/drain pattern as the base environment.
V1 keeps rollout errors as live
vf.Errorinstances through group scoring and group cleanup, then serializes them before returning output. This lets group reward/metric code inspect the actual error type instead of only seeing serialized dictionaries.Harness.run()exposes the current state on exceptions that happen after state replacement, so V1 group recovery and cleanup operate on the same state object that the rollout was using when it failed.Artifact collection behavior is now:
vf.Errorduring artifact collection becomes the primary rollout error when there is no earlier rollout error;vf.Errorartifact failures without a prior rollout error remain fatal framework/runtime bugs.Cleanup behavior is now:
state["error"]do not replace the recorded rollout error;Sandbox Failure Visibility
Sandbox OOM and timeout detection still sets the state-level flags that downstream consumers already use:
sandbox_oomsandbox_timeoutDetailed sandbox failure context is emitted to logs instead of being written into rollout artifacts as
sandbox_failures.Error Reconstruction
Reconstructed
vf.Errorvalues from serializedErrorDatanow useErrorData.messageas the exception text and prefer the explicit top-levelErrorData.errortype before scanning the error chain.This keeps the runtime's live-error behavior and serialized-output behavior aligned, especially for V1 group scoring/cleanup paths that need to inspect error types.
Docs
docs/reference.mdanddocs/evaluation.mddocument the primary-error-only output contract and explain that secondary artifact, cleanup, and sandbox-detection failures are logged rather than persisted as rollout output fields.Validation
uv run ruff formatuv run ruff check --fixuv run pytest tests/test_v1_runtime_lifecycle.py::test_sandbox_program_patch_cannot_set_lifecycle_fields tests/test_v1_runtime_lifecycle.py::test_sandbox_command_marks_oom_failures tests/test_v1_runtime_lifecycle.py::test_release_sandboxes_keeps_failed_delete_retryable tests/test_v1_runtime_lifecycle.py::test_clear_creation_tasks_keeps_failed_delete_retryable(4 passed)uv run pytest tests/ -v --ignore=tests/test_envs.py -m "not prime_sandbox"(1417 passed, 2 deselected)uv run pre-commit run --all-files