Skip to content

Isolate rollout failures and preserve V1 diagnostics#1529

Open
xeophon wants to merge 26 commits into
mainfrom
codex/rollout-error-visibility
Open

Isolate rollout failures and preserve V1 diagnostics#1529
xeophon wants to merge 26 commits into
mainfrom
codex/rollout-error-visibility

Conversation

@xeophon

@xeophon xeophon commented Jun 3, 2026

Copy link
Copy Markdown
Member

Summary

This PR makes rollout failures easier to inspect without changing the high-level fail-fast contract of the evaluator/trainer runtime.

The important behavior is:

  • a rollout's primary failure is preserved in the rollout output as structured ErrorData;
  • secondary failures from cleanup, artifact collection, cancellation drain, and sandbox failure detection are logged as warnings;
  • those secondary failures are not persisted as extra output fields;
  • unexpected environment, scoring, setup, framework, and infrastructure failures still abort the run after cleanup and sibling cancellation.

The goal is to make failed rollouts explain what actually failed first, while keeping saved artifacts simple and avoiding a runtime where one secondary cleanup problem hides the real rollout/scoring error.

Runtime Contract

This PR keeps fail-fast semantics for unexpected runtime failures. It does not introduce a continue_on_error mode.

  • Environment setup/loading and global infrastructure failures abort the run.
  • Unexpected rollout, scoring, group execution, setup, and framework exceptions abort the eval/training run.
  • Before re-raising a failure, the runtime now does more cleanup and cancellation/draining work so resources and sibling rollout tasks do not keep running in the background.
  • Existing retryable vf.Error paths still use the existing retry behavior.
  • Rollout-local vf.Error values remain visible on state and serialize into rollout outputs as structured ErrorData.
  • Sandbox OOM/timeout detection still marks sandbox_oom / sandbox_timeout on state, with detailed context emitted to logs.

What Changed

Primary Error Serialization

Rollout outputs now use a single primary failure field:

  • error: the rollout's primary failure, serialized as ErrorData.

ErrorData contains:

  • error: exception class name;
  • message: exception message;
  • error_chain_repr: detailed repr of the causal exception chain;
  • error_chain_str: compact causal exception chain.

The output schema intentionally does not include secondary artifact_errors, cleanup_errors, or sandbox_failures fields. Those secondary diagnostics belong in logs.

Base Environment Cleanup And Group Handling

Environment._run_rollout_state() now records scoring end timing and attempts rubric.cleanup(state) even when rollout scoring raises.

Grouped rollout execution now creates explicit sibling tasks. If one sibling raises, pending siblings are cancelled and drained before the original group exception is re-raised.

Group scoring now records scoring end timing for every state and attempts cleanup for every state whether scoring succeeds or fails.

Impact:

  • scoring failures no longer skip cleanup;
  • cleanup failures after an existing scoring/rollout failure are logged without replacing the primary failure;
  • a fast failing grouped rollout does not leave slow sibling rollouts running after the group has already failed;
  • cleanup still runs across completed group states before the group failure propagates.

V1 Runtime And Harness Handling

V1 grouped rollouts follow the same cancellation/drain pattern as the base environment.

V1 keeps rollout errors as live vf.Error instances through group scoring and group cleanup, then serializes them before returning output. This lets group reward/metric code inspect the actual error type instead of only seeing serialized dictionaries.

Harness.run() exposes the current state on exceptions that happen after state replacement, so V1 group recovery and cleanup operate on the same state object that the rollout was using when it failed.

Artifact collection behavior is now:

  • a vf.Error during artifact collection becomes the primary rollout error when there is no earlier rollout error;
  • artifact failures after an existing primary rollout error are logged as secondary warnings;
  • non-vf.Error artifact failures without a prior rollout error remain fatal framework/runtime bugs.

Cleanup behavior is now:

  • cleanup failures during already-failing V1 rollout paths are logged as secondary warnings;
  • cleanup failures after a handled state["error"] do not replace the recorded rollout error;
  • V1 group cleanup failures are logged whether cleanup is the primary failure or follows group scoring failure.

Sandbox Failure Visibility

Sandbox OOM and timeout detection still sets the state-level flags that downstream consumers already use:

  • sandbox_oom
  • sandbox_timeout

Detailed sandbox failure context is emitted to logs instead of being written into rollout artifacts as sandbox_failures.

Error Reconstruction

Reconstructed vf.Error values from serialized ErrorData now use ErrorData.message as the exception text and prefer the explicit top-level ErrorData.error type before scanning the error chain.

This keeps the runtime's live-error behavior and serialized-output behavior aligned, especially for V1 group scoring/cleanup paths that need to inspect error types.

Docs

docs/reference.md and docs/evaluation.md document the primary-error-only output contract and explain that secondary artifact, cleanup, and sandbox-detection failures are logged rather than persisted as rollout output fields.

Validation

  • uv run ruff format
  • uv run ruff check --fix
  • uv run pytest tests/test_v1_runtime_lifecycle.py::test_sandbox_program_patch_cannot_set_lifecycle_fields tests/test_v1_runtime_lifecycle.py::test_sandbox_command_marks_oom_failures tests/test_v1_runtime_lifecycle.py::test_release_sandboxes_keeps_failed_delete_retryable tests/test_v1_runtime_lifecycle.py::test_clear_creation_tasks_keeps_failed_delete_retryable (4 passed)
  • uv run pytest tests/ -v --ignore=tests/test_envs.py -m "not prime_sandbox" (1417 passed, 2 deselected)
  • uv run pre-commit run --all-files
  • GitHub CI is passing for Ruff, Ty, Semgrep, CodeQL, Environments, Bugbot, and Verifiers on Python 3.10, 3.11, 3.12, and 3.13.

@macroscopeapp

macroscopeapp Bot commented Jun 3, 2026

Copy link
Copy Markdown

Approvability

Verdict: Needs human review

This PR restructures error handling and failure isolation in the rollout execution system. Changes to how cleanup failures are propagated, new group rollout cancellation logic, and removal of some persisted diagnostic data represent significant runtime behavior changes that warrant human review.

You can customize Macroscope's approvability policy. Learn more.

Comment thread verifiers/envs/environment.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f719345079

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread verifiers/envs/environment.py Outdated
Comment thread verifiers/envs/environment.py Outdated
Comment thread verifiers/envs/environment.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f2898e6df4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread verifiers/envs/environment.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 27d6d2ab38

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread verifiers/utils/eval_utils.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fb9ef97fcd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread verifiers/envs/environment.py Outdated
macroscopeapp[bot]
macroscopeapp Bot previously approved these changes Jun 4, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5aa700eb60

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread verifiers/v1/env.py Outdated

@mikasenghaas mikasenghaas left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, defo onboard with storing more information in the error class (i think @rasdani has been wanting this for a long time)

Comment thread verifiers/utils/error_utils.py Outdated
Comment thread verifiers/types.py Outdated
Comment thread verifiers/envs/environment.py Outdated
Comment thread verifiers/v1/harness.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7a6fe4a785

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread verifiers/v1/harness.py
Comment thread verifiers/v1/harness.py
Comment thread verifiers/envs/environment.py Outdated
Comment thread docs/reference.md

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cf21c213e6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread verifiers/v1/harness.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 76b9e1be18

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread verifiers/envs/environment.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bc47a81f17

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread verifiers/v1/harness.py Outdated
Comment thread verifiers/v1/env.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4a82ef74d3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread verifiers/utils/error_utils.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 93557feb6e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread verifiers/v1/env.py

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e3a43f7beb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread verifiers/utils/error_utils.py

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit ec7da17. Configure here.

Comment thread verifiers/v1/env.py

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ec7da17680

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread verifiers/v1/env.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7154b6547f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread verifiers/v1/env.py

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0ce3b5d567

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread verifiers/v1/harness.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants