feat: env-server worker pool for v1 (router + N workers)#1623
Conversation
Reinstate v0's env-server parallelism for v1: a ROUTER broker (serve/pool.py) load-balances requests across `num_workers` worker processes — each an ordinary EnvServer/LegacyEnvServer bound to an ipc address — to the least-busy worker, relieving the single event loop (CPU-bound tokenization/scoring). Same wire protocol as a lone server, so EnvClient is unchanged; works for native v1 and the v0 bridge. - serve_env(num_workers, ...): a single in-process server (<=1) or the pool (>1). - EvalConfig.num_workers (default 0 = in-process); `eval --num-workers N` runs the eval through the pool (run_eval_server) for both v1 and legacy v0 — the same path prime-rl trains through, so it exercises the pool e2e. - EnvServerConfig.num_workers; the `serve` CLI routes through serve_env. Workers spawn-style with a death pipe (orphan self-exit). TODO: restart-on-death and stats/lag monitors (v0 had them; rollout errors return as data, not crashes). Validated: gsm8k-v1 (v1) and echo-v0 (legacy) eval through a 2-worker pool match the in-process baseline (reward 1.0, 0 errors). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The server-backed eval grouped a task's rollouts into one run_group whenever num_rollouts>1, so they all landed on a single worker (group scoring needs the group together) — defeating the pool. Now only a group-scored taskset uses run_group; otherwise each rollout is a separate run_rollout request the broker round-robins (least-busy) across workers. Mirrors the prime-rl dispatcher. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
agentic_benchmark.sh runs one harbor task (default: fix-git, terminal-bench-2) at group sizes (-r) across env-server modes (in-process vs N-worker pool), writing per-rollout durations + e2e; agentic_aggregate.py summarizes. With no group reward the rollouts are independent, so the pool round-robins them across workers — stressing concurrent agentic execution + verifier scoring. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Repurpose benchmark.sh to the single-turn counterpart of agentic_benchmark.sh: compare in-process vs N-worker pool at group sizes (-r, rollouts of one gsm8k task), RUNTIME a knob (default subprocess). Rename agentic_aggregate.py -> bench_aggregate.py (shared by both, generic label). Remove the superseded runtime benchmark (aggregate.py / plot.py / benchmark.json) — RUNTIME is now a benchmark.sh knob, and the pool comparison is the current focus. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 3 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 689a79b. Configure here.
| ] | ||
| results = await asyncio.gather(*units) | ||
| await client.close() | ||
| return [trace for unit_traces in results for trace in unit_traces] |
There was a problem hiding this comment.
Pool eval skips shared tools
Medium Severity
run_eval_server drives rollouts only through EnvClient and never enters Environment.shared_tools, unlike in-process run_eval. Tasksets with tools.shared=True therefore rebuild expensive shared tool servers per rollout instead of once per eval, changing behavior and load versus the default eval path.
Reviewed by Cursor Bugbot for commit 689a79b. Configure here.
| units = [ | ||
| run_rollout_unit(i) for i in idxs for _ in range(config.num_rollouts) | ||
| ] | ||
| results = await asyncio.gather(*units) |
There was a problem hiding this comment.
Group pool ignores rollout concurrency
Medium Severity
For @group_reward tasksets, --num-workers eval holds max_concurrent for an entire run_group RPC while the worker runs all group rollouts concurrently with no per-rollout limit. In-process eval acquires the semaphore once per rollout, so global concurrency can far exceed --max-concurrent.
Reviewed by Cursor Bugbot for commit 689a79b. Configure here.
| ): # server-backed: a worker pool runs rollouts (v1 or legacy) | ||
| from verifiers.v1.cli.runner import run_eval_server | ||
|
|
||
| traces = asyncio.run(run_eval_server(config)) |
There was a problem hiding this comment.
Worker pool docs not updated
Low Severity
This PR adds v1 uv run eval --num-workers and serve --num-workers worker-pool behavior, but no updates appear in docs/ (for example docs/faqs.md or evaluation-related sections) describing the new flags, defaults, or how they differ from in-process eval.
Additional Locations (1)
Triggered by project rule: BugBot Instructions
Reviewed by Cursor Bugbot for commit 689a79b. Configure here.
ApprovabilityVerdict: Needs human review This PR introduces a new worker pool architecture with significant new multiprocessing/ZMQ infrastructure and execution paths. Two unresolved medium-severity comments identify behavioral differences between pool and in-process modes (shared_tools handling, concurrency semantics) that warrant human review. You can customize Macroscope's approvability policy. Learn more. |


Summary
Reinstates v0's env-server parallelism for v1. A lone
EnvServerruns every rollout as anasyncio.Taskon one event loop, so CPU-bound work (renderer tokenization, scoring) competes for that loop; v0 relieved this with a router + worker pool. This brings it back.serve/pool.py—EnvServerPool, a ROUTER broker that load-balances requests to the least-busy ofnum_workersworker processes (each an ordinaryEnvServer/LegacyEnvServeron its own ipc address, over aDEALER). Same wire protocol as a lone server →EnvClientunchanged; replies routed back byrequest_id. Works for native v1 and the v0 bridge.serve_env(num_workers, …)— single in-process server whennum_workers <= 1, else the pool. Used by theserveCLI, eval, and prime-rl.EvalConfig.num_workers(default0= in-process);eval --num-workers Nruns the eval through the pool (run_eval_server).run_eval_serverissues onerun_rolloutper rollout (broker round-robins across workers) for independent tasksets; only a@group_rewardtaskset usesrun_group(one worker, cross-rollout scoring). Mirrors the prime-rl dispatcher.EnvServerConfig.num_workersfor theserveCLI.Workers
spawnwith a death pipe (orphan self-exit). Trimmed (TODO): per-worker restart-on-death + stats/lag monitors (rollout errors return as data, not crashes).Benchmarks (
bench/)Both scripts compare env-server modes (in-process vs N-worker pool) at group sizes (
-r, rollouts of one task), feeding a shared aggregator that records per-rolloutgeneration.duration(p10/p50/p90 — e2e is straggler-gated, so the distribution is the honest comparator) plus e2e, reward, and errors.benchmark.sh— single-turn (gsm8k-v1, subprocess): light per-rollout CPU, where the pool's fixed per-worker overhead is most visible against its event-loop relief.agentic_benchmark.sh— agentic: one harbor task (defaultfix-git, terminal-bench-2) where each rollout carries a multi-turn agent + a verifier.bench_aggregate.py— shared aggregator for both.Agentic —
fix-git(terminal-bench-2), prime runtime, deepseek-v4-flash, default+bash, max_turns=30-re2e speedup (in-process → 4-worker pool): r=32 1.38×, r=64 0.92×, r=128 1.83×. The win grows with concurrency: at r=128 the single event loop starves under 128 concurrent agent loops + verifier scoring (a 176s straggler), and the pool spreads them across 4 workers (max 110s) → 221s→121s. r=64 is within straggler noise (gen p50/p90 are ≈equal). Reward ≈parity, 0 errors in every cell.
Single-turn —
gsm8k-v1, subprocess, deepseek-v4-flash, max_tokens=1024-rNear-neutral: single-turn rollouts are light (gen p50 ~4–6s), so e2e is small (9–16s) and dominated by noise + worker-spawn overhead (r=32/64 slight pool win, r=128 slight pool loss — all within a few seconds). gen p50/p90 ≈parity, reward ≈parity, 0 errors. The pool's payoff is on agentic (heavy per-rollout) work, not single-turn (light).
Verification
gsm8k-v1(v1) andecho-v0(legacy) eval through a 2-worker pool match the in-process baseline (reward 1.0, 0 errors).ruffclean.Note
Medium Risk
New multiprocessing/ZMQ broker path affects how eval and serve run rollouts and teardown (SIGTERM, worker lifecycle); wire protocol is unchanged but concurrent sandbox usage scales with worker count.
Overview
Adds a v1 env-server worker pool (ZMQ ROUTER broker + N spawned
EnvServer/LegacyEnvServerworkers with least-busy routing) and wires it throughserve_env, theserveCLI (EnvServerConfig.num_workers), andeval --num-workersvia newrun_eval_server(spawns the pool, drives rollouts overEnvClient; per-rollout requests for independent tasksets,run_groupwhenrequires_group_scoring).Bench shifts from comparing runtimes × batch size to in-process (
workers=0) vs pool at rollout counts: sharedbench_aggregate.py, updatedbenchmark.sh, newagentic_benchmark.sh; removesaggregate.py,plot.py, and committedbenchmark.json.Reviewed by Cursor Bugbot for commit 689a79b. Bugbot is set up for automated code reviews on this repo. Configure here.
Note
Add worker pool support to env-server with router-based dispatch
EnvServerPoolthat binds a ZMQ ROUTER socket and dispatches requests to N worker processes, each running anEnvServerbehind a DEALER socket with least-busy routing.serve_envas a unified entrypoint that starts either a pooled server (num_workers > 1) or a single in-process server, with consistent SIGTERM handling and optional address announcement via a queue.evalCLI to dispatch rollouts via a spawned env server when--num_workers/-w > 0, coordinating requests throughEnvClientwith optional concurrency throttling viaasyncio.Semaphore.bench/bench_aggregate.py.EnvConfigfrom a minimal picklable dict; mismatches between serialized config fields and runtime expectations could cause silent misconfiguration.Macroscope summarized 689a79b.