feat: env-server worker pool for v1 (router + N workers) by mikasenghaas · Pull Request #1623 · PrimeIntellect-ai/verifiers

mikasenghaas · 2026-06-11T05:55:06Z

Summary

Reinstates v0's env-server parallelism for v1. A lone EnvServer runs every rollout as an asyncio.Task on one event loop, so CPU-bound work (renderer tokenization, scoring) competes for that loop; v0 relieved this with a router + worker pool. This brings it back.

serve/pool.py — EnvServerPool, a ROUTER broker that load-balances requests to the least-busy of num_workers worker processes (each an ordinary EnvServer / LegacyEnvServer on its own ipc address, over a DEALER). Same wire protocol as a lone server → EnvClient unchanged; replies routed back by request_id. Works for native v1 and the v0 bridge.
serve_env(num_workers, …) — single in-process server when num_workers <= 1, else the pool. Used by the serve CLI, eval, and prime-rl.
Eval — EvalConfig.num_workers (default 0 = in-process); eval --num-workers N runs the eval through the pool (run_eval_server).
Round-robin granularity — run_eval_server issues one run_rollout per rollout (broker round-robins across workers) for independent tasksets; only a @group_reward taskset uses run_group (one worker, cross-rollout scoring). Mirrors the prime-rl dispatcher.
EnvServerConfig.num_workers for the serve CLI.

Workers spawn with a death pipe (orphan self-exit). Trimmed (TODO): per-worker restart-on-death + stats/lag monitors (rollout errors return as data, not crashes).

Benchmarks (`bench/`)

Both scripts compare env-server modes (in-process vs N-worker pool) at group sizes (-r, rollouts of one task), feeding a shared aggregator that records per-rollout generation.duration (p10/p50/p90 — e2e is straggler-gated, so the distribution is the honest comparator) plus e2e, reward, and errors.

benchmark.sh — single-turn (gsm8k-v1, subprocess): light per-rollout CPU, where the pool's fixed per-worker overhead is most visible against its event-loop relief.
agentic_benchmark.sh — agentic: one harbor task (default fix-git, terminal-bench-2) where each rollout carries a multi-turn agent + a verifier.
bench_aggregate.py — shared aggregator for both.

Agentic — `fix-git` (terminal-bench-2), prime runtime, deepseek-v4-flash, default+bash, max_turns=30

mode	`-r`	e2e	gen p50	gen p90	gen max	reward
in-process	32	156s	68.7	81.7	131.5	1.00
pool-4	32	113s	68.2	78.7	103.8	1.00
in-process	64	167s	70.8	95.5	140.0	0.98
pool-4	64	181s	74.1	94.3	171.3	1.00
in-process	128	221s	75.6	104.5	176.2	1.00
pool-4	128	121s	68.3	84.7	110.6	0.99

e2e speedup (in-process → 4-worker pool): r=32 1.38×, r=64 0.92×, r=128 1.83×. The win grows with concurrency: at r=128 the single event loop starves under 128 concurrent agent loops + verifier scoring (a 176s straggler), and the pool spreads them across 4 workers (max 110s) → 221s→121s. r=64 is within straggler noise (gen p50/p90 are ≈equal). Reward ≈parity, 0 errors in every cell.

Single-turn — `gsm8k-v1`, subprocess, deepseek-v4-flash, max_tokens=1024

mode	`-r`	e2e	gen p50	gen p90	gen max	reward
in-process	32	11s	4.2	5.1	9.2	1.00
pool-4	32	9s	3.9	4.9	5.7	0.97
in-process	64	14s	4.4	6.2	10.9	0.98
pool-4	64	10s	4.2	5.3	6.7	1.00
in-process	128	12s	5.7	8.1	9.9	0.99
pool-4	128	16s	6.5	7.8	12.5	1.00

Near-neutral: single-turn rollouts are light (gen p50 ~4–6s), so e2e is small (9–16s) and dominated by noise + worker-spawn overhead (r=32/64 slight pool win, r=128 slight pool loss — all within a few seconds). gen p50/p90 ≈parity, reward ≈parity, 0 errors. The pool's payoff is on agentic (heavy per-rollout) work, not single-turn (light).

Verification

gsm8k-v1 (v1) and echo-v0 (legacy) eval through a 2-worker pool match the in-process baseline (reward 1.0, 0 errors).
Agentic + single-turn matrices above. ruff clean.

Note

Medium Risk
New multiprocessing/ZMQ broker path affects how eval and serve run rollouts and teardown (SIGTERM, worker lifecycle); wire protocol is unchanged but concurrent sandbox usage scales with worker count.

Overview
Adds a v1 env-server worker pool (ZMQ ROUTER broker + N spawned EnvServer/LegacyEnvServer workers with least-busy routing) and wires it through serve_env, the serve CLI (EnvServerConfig.num_workers), and eval --num-workers via new run_eval_server (spawns the pool, drives rollouts over EnvClient; per-rollout requests for independent tasksets, run_group when requires_group_scoring).

Bench shifts from comparing runtimes × batch size to in-process (workers=0) vs pool at rollout counts: shared bench_aggregate.py, updated benchmark.sh, new agentic_benchmark.sh; removes aggregate.py, plot.py, and committed benchmark.json.

^{Reviewed by Cursor Bugbot for commit 689a79b. Bugbot is set up for automated code reviews on this repo. Configure here.}

Note

Add worker pool support to env-server with router-based dispatch

Introduces EnvServerPool that binds a ZMQ ROUTER socket and dispatches requests to N worker processes, each running an EnvServer behind a DEALER socket with least-busy routing.
Adds serve_env as a unified entrypoint that starts either a pooled server (num_workers > 1) or a single in-process server, with consistent SIGTERM handling and optional address announcement via a queue.
Extends the eval CLI to dispatch rollouts via a spawned env server when --num_workers/-w > 0, coordinating requests through EnvClient with optional concurrency throttling via asyncio.Semaphore.
Reworks benchmark scripts to compare performance across worker-pool sizes, replacing the old runtime/size loop with workers/rollouts dimensions and aggregating results via bench/bench_aggregate.py.
Risk: worker processes reconstruct EnvConfig from a minimal picklable dict; mismatches between serialized config fields and runtime expectations could cause silent misconfiguration.

^{Macroscope summarized 689a79b.}

Reinstate v0's env-server parallelism for v1: a ROUTER broker (serve/pool.py) load-balances requests across `num_workers` worker processes — each an ordinary EnvServer/LegacyEnvServer bound to an ipc address — to the least-busy worker, relieving the single event loop (CPU-bound tokenization/scoring). Same wire protocol as a lone server, so EnvClient is unchanged; works for native v1 and the v0 bridge. - serve_env(num_workers, ...): a single in-process server (<=1) or the pool (>1). - EvalConfig.num_workers (default 0 = in-process); `eval --num-workers N` runs the eval through the pool (run_eval_server) for both v1 and legacy v0 — the same path prime-rl trains through, so it exercises the pool e2e. - EnvServerConfig.num_workers; the `serve` CLI routes through serve_env. Workers spawn-style with a death pipe (orphan self-exit). TODO: restart-on-death and stats/lag monitors (v0 had them; rollout errors return as data, not crashes). Validated: gsm8k-v1 (v1) and echo-v0 (legacy) eval through a 2-worker pool match the in-process baseline (reward 1.0, 0 errors). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The server-backed eval grouped a task's rollouts into one run_group whenever num_rollouts>1, so they all landed on a single worker (group scoring needs the group together) — defeating the pool. Now only a group-scored taskset uses run_group; otherwise each rollout is a separate run_rollout request the broker round-robins (least-busy) across workers. Mirrors the prime-rl dispatcher. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

agentic_benchmark.sh runs one harbor task (default: fix-git, terminal-bench-2) at group sizes (-r) across env-server modes (in-process vs N-worker pool), writing per-rollout durations + e2e; agentic_aggregate.py summarizes. With no group reward the rollouts are independent, so the pool round-robins them across workers — stressing concurrent agentic execution + verifier scoring. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Repurpose benchmark.sh to the single-turn counterpart of agentic_benchmark.sh: compare in-process vs N-worker pool at group sizes (-r, rollouts of one gsm8k task), RUNTIME a knob (default subprocess). Rename agentic_aggregate.py -> bench_aggregate.py (shared by both, generic label). Remove the superseded runtime benchmark (aggregate.py / plot.py / benchmark.json) — RUNTIME is now a benchmark.sh knob, and the pool comparison is the current focus. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 3 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 689a79b. Configure here.}

cursor · 2026-06-11T07:05:59Z

+            ]
+        results = await asyncio.gather(*units)
+        await client.close()
+        return [trace for unit_traces in results for trace in unit_traces]


Pool eval skips shared tools

Medium Severity

run_eval_server drives rollouts only through EnvClient and never enters Environment.shared_tools, unlike in-process run_eval. Tasksets with tools.shared=True therefore rebuild expensive shared tool servers per rollout instead of once per eval, changing behavior and load versus the default eval path.

^{Reviewed by Cursor Bugbot for commit 689a79b. Configure here.}

cursor · 2026-06-11T07:05:59Z

+            units = [
+                run_rollout_unit(i) for i in idxs for _ in range(config.num_rollouts)
+            ]
+        results = await asyncio.gather(*units)


Group pool ignores rollout concurrency

Medium Severity

For @group_reward tasksets, --num-workers eval holds max_concurrent for an entire run_group RPC while the worker runs all group rollouts concurrently with no per-rollout limit. In-process eval acquires the semaphore once per rollout, so global concurrency can far exceed --max-concurrent.

^{Reviewed by Cursor Bugbot for commit 689a79b. Configure here.}

cursor · 2026-06-11T07:05:59Z

+    ):  # server-backed: a worker pool runs rollouts (v1 or legacy)
+        from verifiers.v1.cli.runner import run_eval_server
+
+        traces = asyncio.run(run_eval_server(config))


Worker pool docs not updated

Low Severity

This PR adds v1 uv run eval --num-workers and serve --num-workers worker-pool behavior, but no updates appear in docs/ (for example docs/faqs.md or evaluation-related sections) describing the new flags, defaults, or how they differ from in-process eval.

Additional Locations (1)

verifiers/v1/cli/serve.py#L67-L73

^{Triggered by project rule: BugBot Instructions}

^{Reviewed by Cursor Bugbot for commit 689a79b. Configure here.}

macroscopeapp · 2026-06-11T07:06:29Z

Approvability

Verdict: Needs human review

This PR introduces a new worker pool architecture with significant new multiprocessing/ZMQ infrastructure and execution paths. Two unresolved medium-severity comments identify behavioral differences between pool and in-process modes (shared_tools handling, concurrency semantics) that warrant human review.

^{You can customize Macroscope's approvability policy. Learn more.}

mikasenghaas and others added 4 commits June 11, 2026 05:54

mikasenghaas marked this pull request as ready for review June 11, 2026 07:02

mikasenghaas merged commit be76cbc into feat/nano-as-v1 Jun 11, 2026
5 checks passed

cursor Bot reviewed Jun 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: env-server worker pool for v1 (router + N workers)#1623

feat: env-server worker pool for v1 (router + N workers)#1623
mikasenghaas merged 4 commits into
feat/nano-as-v1from
feat/v1-env-workers

mikasenghaas commented Jun 11, 2026 •

edited by macroscopeapp Bot

Loading

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 11, 2026

Uh oh!

cursor Bot Jun 11, 2026

Uh oh!

cursor Bot Jun 11, 2026

Uh oh!

macroscopeapp Bot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mikasenghaas commented Jun 11, 2026 • edited by macroscopeapp Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmarks (bench/)

Agentic — fix-git (terminal-bench-2), prime runtime, deepseek-v4-flash, default+bash, max_turns=30

Single-turn — gsm8k-v1, subprocess, deepseek-v4-flash, max_tokens=1024

Verification

Add worker pool support to env-server with router-based dispatch

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 11, 2026

Choose a reason for hiding this comment

Pool eval skips shared tools

Uh oh!

cursor Bot Jun 11, 2026

Choose a reason for hiding this comment

Group pool ignores rollout concurrency

Uh oh!

cursor Bot Jun 11, 2026

Choose a reason for hiding this comment

Worker pool docs not updated

Uh oh!

macroscopeapp Bot commented Jun 11, 2026

Approvability

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mikasenghaas commented Jun 11, 2026 •

edited by macroscopeapp Bot

Loading

Benchmarks (`bench/`)

Agentic — `fix-git` (terminal-bench-2), prime runtime, deepseek-v4-flash, default+bash, max_turns=30

Single-turn — `gsm8k-v1`, subprocess, deepseek-v4-flash, max_tokens=1024