Isolate rollout failures and preserve V1 diagnostics by xeophon · Pull Request #1529 · PrimeIntellect-ai/verifiers

xeophon · 2026-06-03T13:17:41Z

Summary

This PR makes rollout failures easier to inspect without changing the high-level fail-fast contract of the evaluator/trainer runtime.

The important behavior is:

a rollout's primary failure is preserved in the rollout output as structured ErrorData;
secondary failures from cleanup, artifact collection, cancellation drain, and sandbox failure detection are logged as warnings;
those secondary failures are not persisted as extra output fields;
unexpected environment, scoring, setup, framework, and infrastructure failures still abort the run after cleanup and sibling cancellation.

The goal is to make failed rollouts explain what actually failed first, while keeping saved artifacts simple and avoiding a runtime where one secondary cleanup problem hides the real rollout/scoring error.

Runtime Contract

This PR keeps fail-fast semantics for unexpected runtime failures. It does not introduce a continue_on_error mode.

Environment setup/loading and global infrastructure failures abort the run.
Unexpected rollout, scoring, group execution, setup, and framework exceptions abort the eval/training run.
Before re-raising a failure, the runtime now does more cleanup and cancellation/draining work so resources and sibling rollout tasks do not keep running in the background.
Existing retryable vf.Error paths still use the existing retry behavior.
Rollout-local vf.Error values remain visible on state and serialize into rollout outputs as structured ErrorData.
Sandbox OOM/timeout detection still marks sandbox_oom / sandbox_timeout on state, with detailed context emitted to logs.

What Changed

Primary Error Serialization

Rollout outputs now use a single primary failure field:

error: the rollout's primary failure, serialized as ErrorData.

ErrorData contains:

error: exception class name;
message: exception message;
error_chain_repr: detailed repr of the causal exception chain;
error_chain_str: compact causal exception chain.

The output schema intentionally does not include secondary artifact_errors, cleanup_errors, or sandbox_failures fields. Those secondary diagnostics belong in logs.

Base Environment Cleanup And Group Handling

Environment._run_rollout_state() now records scoring end timing and attempts rubric.cleanup(state) even when rollout scoring raises.

Grouped rollout execution now creates explicit sibling tasks. If one sibling raises, pending siblings are cancelled and drained before the original group exception is re-raised.

Group scoring now records scoring end timing for every state and attempts cleanup for every state whether scoring succeeds or fails.

Impact:

scoring failures no longer skip cleanup;
cleanup failures after an existing scoring/rollout failure are logged without replacing the primary failure;
a fast failing grouped rollout does not leave slow sibling rollouts running after the group has already failed;
cleanup still runs across completed group states before the group failure propagates.

V1 Runtime And Harness Handling

V1 grouped rollouts follow the same cancellation/drain pattern as the base environment.

V1 keeps rollout errors as live vf.Error instances through group scoring and group cleanup, then serializes them before returning output. This lets group reward/metric code inspect the actual error type instead of only seeing serialized dictionaries.

Harness.run() exposes the current state on exceptions that happen after state replacement, so V1 group recovery and cleanup operate on the same state object that the rollout was using when it failed.

Artifact collection behavior is now:

a vf.Error during artifact collection becomes the primary rollout error when there is no earlier rollout error;
artifact failures after an existing primary rollout error are logged as secondary warnings;
non-vf.Error artifact failures without a prior rollout error remain fatal framework/runtime bugs.

Cleanup behavior is now:

cleanup failures during already-failing V1 rollout paths are logged as secondary warnings;
cleanup failures after a handled state["error"] do not replace the recorded rollout error;
V1 group cleanup failures are logged whether cleanup is the primary failure or follows group scoring failure.

Sandbox Failure Visibility

Sandbox OOM and timeout detection still sets the state-level flags that downstream consumers already use:

sandbox_oom
sandbox_timeout

Detailed sandbox failure context is emitted to logs instead of being written into rollout artifacts as sandbox_failures.

Error Reconstruction

Reconstructed vf.Error values from serialized ErrorData now use ErrorData.message as the exception text and prefer the explicit top-level ErrorData.error type before scanning the error chain.

This keeps the runtime's live-error behavior and serialized-output behavior aligned, especially for V1 group scoring/cleanup paths that need to inspect error types.

Docs

docs/reference.md and docs/evaluation.md document the primary-error-only output contract and explain that secondary artifact, cleanup, and sandbox-detection failures are logged rather than persisted as rollout output fields.

Validation

uv run ruff format
uv run ruff check --fix
uv run pytest tests/test_v1_runtime_lifecycle.py::test_sandbox_program_patch_cannot_set_lifecycle_fields tests/test_v1_runtime_lifecycle.py::test_sandbox_command_marks_oom_failures tests/test_v1_runtime_lifecycle.py::test_release_sandboxes_keeps_failed_delete_retryable tests/test_v1_runtime_lifecycle.py::test_clear_creation_tasks_keeps_failed_delete_retryable (4 passed)
uv run pytest tests/ -v --ignore=tests/test_envs.py -m "not prime_sandbox" (1417 passed, 2 deselected)
uv run pre-commit run --all-files
GitHub CI is passing for Ruff, Ty, Semgrep, CodeQL, Environments, Bugbot, and Verifiers on Python 3.10, 3.11, 3.12, and 3.13.

macroscopeapp · 2026-06-03T13:20:08Z

Approvability

Verdict: Needs human review

This PR restructures error handling and failure isolation in the rollout execution system. Changes to how cleanup failures are propagated, new group rollout cancellation logic, and removal of some persisted diagnostic data represent significant runtime behavior changes that warrant human review.

^{You can customize Macroscope's approvability policy. Learn more.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f719345079

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f2898e6df4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 27d6d2ab38

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fb9ef97fcd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5aa700eb60

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

mikasenghaas

nice, defo onboard with storing more information in the error class (i think @rasdani has been wanting this for a long time)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7a6fe4a785

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

…isibility

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cf21c213e6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 76b9e1be18

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bc47a81f17

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4a82ef74d3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 93557feb6e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e3a43f7beb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit ec7da17. Configure here.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ec7da17680

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7154b6547f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0ce3b5d567

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Isolate rollout failures and preserve V1 diagnostics

f719345

cursor Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread verifiers/envs/environment.py Outdated

chatgpt-codex-connector Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread verifiers/envs/environment.py Outdated

Comment thread verifiers/envs/environment.py Outdated

Simplify structured error info defaults

f2898e6

cursor Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread verifiers/envs/environment.py Outdated

chatgpt-codex-connector Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread verifiers/envs/environment.py Outdated

Handle rollout continuation cleanup

27d6d2a

chatgpt-codex-connector Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread verifiers/utils/eval_utils.py Outdated

xeophon added 3 commits June 3, 2026 16:16

Wire continue_on_error through eval configs

daf145a

Remove rollout continue-on-error knob

3fb071d

Simplify generate task tracking

fb9ef97

chatgpt-codex-connector Bot reviewed Jun 4, 2026

View reviewed changes

Comment thread verifiers/envs/environment.py Outdated

Preserve primary errors during cleanup

5aa700e

macroscopeapp Bot previously approved these changes Jun 4, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 4, 2026

View reviewed changes

Comment thread verifiers/v1/env.py Outdated

Surface cancelled sibling cleanup failures

10aa666

xeophon dismissed macroscopeapp[bot]’s stale review via eacd337 June 4, 2026 11:48

xeophon force-pushed the codex/rollout-error-visibility branch from eacd337 to 10aa666 Compare June 4, 2026 11:56

xeophon mentioned this pull request Jun 4, 2026

Cache Hugging Face tokenizers in CI #1542

Merged

mikasenghaas reviewed Jun 4, 2026

View reviewed changes

Comment thread verifiers/utils/error_utils.py Outdated

Clean rollout error diagnostics (#1550)

13e9c8f

cursor Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread verifiers/types.py Outdated

macroscopeapp Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread verifiers/envs/environment.py Outdated

Merge main into rollout error visibility

7a6fe4a

cursor Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread verifiers/v1/harness.py Outdated

chatgpt-codex-connector Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread verifiers/v1/harness.py

Merge remote-tracking branch 'origin/main' into codex/rollout-error-v…

aec4111

…isibility

cursor Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread verifiers/v1/harness.py

Address rollout diagnostic review comments

262c780

macroscopeapp Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread verifiers/envs/environment.py Outdated

Comment thread docs/reference.md

Address remaining rollout visibility review comments

cf21c21

chatgpt-codex-connector Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread verifiers/v1/harness.py Outdated

xeophon added 2 commits June 5, 2026 17:21

Log secondary rollout failures

0b673f8

Stabilize rollout failure log tests

76b9e1b

chatgpt-codex-connector Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread verifiers/envs/environment.py Outdated

Log cancelled v0 group cleanup failures

bc47a81

chatgpt-codex-connector Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread verifiers/v1/harness.py Outdated

Preserve recorded V1 errors during cleanup

4a82ef7

cursor Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread verifiers/v1/env.py Outdated

chatgpt-codex-connector Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread verifiers/utils/error_utils.py Outdated

xeophon added 2 commits June 5, 2026 18:02

Log V1 group drain failures

b21e693

Preserve serialized error messages

93557fe

chatgpt-codex-connector Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread verifiers/v1/env.py

Use completed V1 group states during cleanup

e3a43f7

chatgpt-codex-connector Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread verifiers/utils/error_utils.py

Preserve serialized top-level error type

ec7da17

cursor Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread verifiers/v1/env.py

chatgpt-codex-connector Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread verifiers/v1/env.py Outdated

xeophon added 2 commits June 5, 2026 18:32

Log V1 group cleanup failures

74c1d05

Preserve failed V1 replacement states

7154b65

chatgpt-codex-connector Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread verifiers/v1/env.py

Preserve cancelled V1 replacement states

0ce3b5d

chatgpt-codex-connector Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread verifiers/v1/harness.py

xeophon added 2 commits June 5, 2026 18:58

Attach V1 state to cleanup exceptions

ffa7fce

Prune rollout visibility regression tests

f253ac5

Conversation

xeophon commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Runtime Contract

What Changed

Primary Error Serialization

Base Environment Cleanup And Group Handling

V1 Runtime And Harness Handling

Sandbox Failure Visibility

Error Reconstruction

Docs

Validation

Uh oh!

macroscopeapp Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Approvability

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

mikasenghaas left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

xeophon commented Jun 3, 2026 •

edited

Loading

macroscopeapp Bot commented Jun 3, 2026 •

edited

Loading