Skip to content

feat: env-server worker pool for v1 (router + N workers)#1623

Merged
mikasenghaas merged 4 commits into
feat/nano-as-v1from
feat/v1-env-workers
Jun 11, 2026
Merged

feat: env-server worker pool for v1 (router + N workers)#1623
mikasenghaas merged 4 commits into
feat/nano-as-v1from
feat/v1-env-workers

Conversation

@mikasenghaas

@mikasenghaas mikasenghaas commented Jun 11, 2026

Copy link
Copy Markdown
Member

Summary

Reinstates v0's env-server parallelism for v1. A lone EnvServer runs every rollout as an asyncio.Task on one event loop, so CPU-bound work (renderer tokenization, scoring) competes for that loop; v0 relieved this with a router + worker pool. This brings it back.

  • serve/pool.pyEnvServerPool, a ROUTER broker that load-balances requests to the least-busy of num_workers worker processes (each an ordinary EnvServer / LegacyEnvServer on its own ipc address, over a DEALER). Same wire protocol as a lone server → EnvClient unchanged; replies routed back by request_id. Works for native v1 and the v0 bridge.
  • serve_env(num_workers, …) — single in-process server when num_workers <= 1, else the pool. Used by the serve CLI, eval, and prime-rl.
  • EvalEvalConfig.num_workers (default 0 = in-process); eval --num-workers N runs the eval through the pool (run_eval_server).
  • Round-robin granularityrun_eval_server issues one run_rollout per rollout (broker round-robins across workers) for independent tasksets; only a @group_reward taskset uses run_group (one worker, cross-rollout scoring). Mirrors the prime-rl dispatcher.
  • EnvServerConfig.num_workers for the serve CLI.

Workers spawn with a death pipe (orphan self-exit). Trimmed (TODO): per-worker restart-on-death + stats/lag monitors (rollout errors return as data, not crashes).

Benchmarks (bench/)

Both scripts compare env-server modes (in-process vs N-worker pool) at group sizes (-r, rollouts of one task), feeding a shared aggregator that records per-rollout generation.duration (p10/p50/p90 — e2e is straggler-gated, so the distribution is the honest comparator) plus e2e, reward, and errors.

  • benchmark.sh — single-turn (gsm8k-v1, subprocess): light per-rollout CPU, where the pool's fixed per-worker overhead is most visible against its event-loop relief.
  • agentic_benchmark.sh — agentic: one harbor task (default fix-git, terminal-bench-2) where each rollout carries a multi-turn agent + a verifier.
  • bench_aggregate.py — shared aggregator for both.

Agentic — fix-git (terminal-bench-2), prime runtime, deepseek-v4-flash, default+bash, max_turns=30

mode -r e2e gen p50 gen p90 gen max reward errors
in-process 32 156s 68.7 81.7 131.5 1.00 0
pool-4 32 113s 68.2 78.7 103.8 1.00 0
in-process 64 167s 70.8 95.5 140.0 0.98 0
pool-4 64 181s 74.1 94.3 171.3 1.00 0
in-process 128 221s 75.6 104.5 176.2 1.00 0
pool-4 128 121s 68.3 84.7 110.6 0.99 0

e2e speedup (in-process → 4-worker pool): r=32 1.38×, r=64 0.92×, r=128 1.83×. The win grows with concurrency: at r=128 the single event loop starves under 128 concurrent agent loops + verifier scoring (a 176s straggler), and the pool spreads them across 4 workers (max 110s) → 221s→121s. r=64 is within straggler noise (gen p50/p90 are ≈equal). Reward ≈parity, 0 errors in every cell.

Single-turn — gsm8k-v1, subprocess, deepseek-v4-flash, max_tokens=1024

mode -r e2e gen p50 gen p90 gen max reward errors
in-process 32 11s 4.2 5.1 9.2 1.00 0
pool-4 32 9s 3.9 4.9 5.7 0.97 0
in-process 64 14s 4.4 6.2 10.9 0.98 0
pool-4 64 10s 4.2 5.3 6.7 1.00 0
in-process 128 12s 5.7 8.1 9.9 0.99 0
pool-4 128 16s 6.5 7.8 12.5 1.00 0

Near-neutral: single-turn rollouts are light (gen p50 ~4–6s), so e2e is small (9–16s) and dominated by noise + worker-spawn overhead (r=32/64 slight pool win, r=128 slight pool loss — all within a few seconds). gen p50/p90 ≈parity, reward ≈parity, 0 errors. The pool's payoff is on agentic (heavy per-rollout) work, not single-turn (light).

Verification

  • gsm8k-v1 (v1) and echo-v0 (legacy) eval through a 2-worker pool match the in-process baseline (reward 1.0, 0 errors).
  • Agentic + single-turn matrices above. ruff clean.

Note

Medium Risk
New multiprocessing/ZMQ broker path affects how eval and serve run rollouts and teardown (SIGTERM, worker lifecycle); wire protocol is unchanged but concurrent sandbox usage scales with worker count.

Overview
Adds a v1 env-server worker pool (ZMQ ROUTER broker + N spawned EnvServer/LegacyEnvServer workers with least-busy routing) and wires it through serve_env, the serve CLI (EnvServerConfig.num_workers), and eval --num-workers via new run_eval_server (spawns the pool, drives rollouts over EnvClient; per-rollout requests for independent tasksets, run_group when requires_group_scoring).

Bench shifts from comparing runtimes × batch size to in-process (workers=0) vs pool at rollout counts: shared bench_aggregate.py, updated benchmark.sh, new agentic_benchmark.sh; removes aggregate.py, plot.py, and committed benchmark.json.

Reviewed by Cursor Bugbot for commit 689a79b. Bugbot is set up for automated code reviews on this repo. Configure here.

Note

Add worker pool support to env-server with router-based dispatch

  • Introduces EnvServerPool that binds a ZMQ ROUTER socket and dispatches requests to N worker processes, each running an EnvServer behind a DEALER socket with least-busy routing.
  • Adds serve_env as a unified entrypoint that starts either a pooled server (num_workers > 1) or a single in-process server, with consistent SIGTERM handling and optional address announcement via a queue.
  • Extends the eval CLI to dispatch rollouts via a spawned env server when --num_workers/-w > 0, coordinating requests through EnvClient with optional concurrency throttling via asyncio.Semaphore.
  • Reworks benchmark scripts to compare performance across worker-pool sizes, replacing the old runtime/size loop with workers/rollouts dimensions and aggregating results via bench/bench_aggregate.py.
  • Risk: worker processes reconstruct EnvConfig from a minimal picklable dict; mismatches between serialized config fields and runtime expectations could cause silent misconfiguration.

Macroscope summarized 689a79b.

mikasenghaas and others added 4 commits June 11, 2026 05:54
Reinstate v0's env-server parallelism for v1: a ROUTER broker (serve/pool.py)
load-balances requests across `num_workers` worker processes — each an ordinary
EnvServer/LegacyEnvServer bound to an ipc address — to the least-busy worker,
relieving the single event loop (CPU-bound tokenization/scoring). Same wire
protocol as a lone server, so EnvClient is unchanged; works for native v1 and
the v0 bridge.

- serve_env(num_workers, ...): a single in-process server (<=1) or the pool (>1).
- EvalConfig.num_workers (default 0 = in-process); `eval --num-workers N` runs the
  eval through the pool (run_eval_server) for both v1 and legacy v0 — the same path
  prime-rl trains through, so it exercises the pool e2e.
- EnvServerConfig.num_workers; the `serve` CLI routes through serve_env.

Workers spawn-style with a death pipe (orphan self-exit). TODO: restart-on-death
and stats/lag monitors (v0 had them; rollout errors return as data, not crashes).

Validated: gsm8k-v1 (v1) and echo-v0 (legacy) eval through a 2-worker pool match
the in-process baseline (reward 1.0, 0 errors).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The server-backed eval grouped a task's rollouts into one run_group whenever
num_rollouts>1, so they all landed on a single worker (group scoring needs the
group together) — defeating the pool. Now only a group-scored taskset uses
run_group; otherwise each rollout is a separate run_rollout request the broker
round-robins (least-busy) across workers. Mirrors the prime-rl dispatcher.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
agentic_benchmark.sh runs one harbor task (default: fix-git, terminal-bench-2)
at group sizes (-r) across env-server modes (in-process vs N-worker pool),
writing per-rollout durations + e2e; agentic_aggregate.py summarizes. With no
group reward the rollouts are independent, so the pool round-robins them across
workers — stressing concurrent agentic execution + verifier scoring.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Repurpose benchmark.sh to the single-turn counterpart of agentic_benchmark.sh:
compare in-process vs N-worker pool at group sizes (-r, rollouts of one gsm8k
task), RUNTIME a knob (default subprocess). Rename agentic_aggregate.py ->
bench_aggregate.py (shared by both, generic label). Remove the superseded
runtime benchmark (aggregate.py / plot.py / benchmark.json) — RUNTIME is now a
benchmark.sh knob, and the pool comparison is the current focus.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mikasenghaas mikasenghaas marked this pull request as ready for review June 11, 2026 07:02
@mikasenghaas mikasenghaas merged commit be76cbc into feat/nano-as-v1 Jun 11, 2026
5 checks passed

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 689a79b. Configure here.

]
results = await asyncio.gather(*units)
await client.close()
return [trace for unit_traces in results for trace in unit_traces]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pool eval skips shared tools

Medium Severity

run_eval_server drives rollouts only through EnvClient and never enters Environment.shared_tools, unlike in-process run_eval. Tasksets with tools.shared=True therefore rebuild expensive shared tool servers per rollout instead of once per eval, changing behavior and load versus the default eval path.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 689a79b. Configure here.

units = [
run_rollout_unit(i) for i in idxs for _ in range(config.num_rollouts)
]
results = await asyncio.gather(*units)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Group pool ignores rollout concurrency

Medium Severity

For @group_reward tasksets, --num-workers eval holds max_concurrent for an entire run_group RPC while the worker runs all group rollouts concurrently with no per-rollout limit. In-process eval acquires the semaphore once per rollout, so global concurrency can far exceed --max-concurrent.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 689a79b. Configure here.

Comment thread verifiers/v1/cli/eval.py
): # server-backed: a worker pool runs rollouts (v1 or legacy)
from verifiers.v1.cli.runner import run_eval_server

traces = asyncio.run(run_eval_server(config))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worker pool docs not updated

Low Severity

This PR adds v1 uv run eval --num-workers and serve --num-workers worker-pool behavior, but no updates appear in docs/ (for example docs/faqs.md or evaluation-related sections) describing the new flags, defaults, or how they differ from in-process eval.

Additional Locations (1)
Fix in Cursor Fix in Web

Triggered by project rule: BugBot Instructions

Reviewed by Cursor Bugbot for commit 689a79b. Configure here.

@macroscopeapp

macroscopeapp Bot commented Jun 11, 2026

Copy link
Copy Markdown

Approvability

Verdict: Needs human review

This PR introduces a new worker pool architecture with significant new multiprocessing/ZMQ infrastructure and execution paths. Two unresolved medium-severity comments identify behavioral differences between pool and in-process modes (shared_tools handling, concurrency semantics) that warrant human review.

You can customize Macroscope's approvability policy. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant