feat: step-level dense rewards for multi-turn environments#1554
feat: step-level dense rewards for multi-turn environments#1554zamal-db wants to merge 3 commits into
Conversation
… dense rewards Wire per-step reward signals into the multi-turn rollout loop, enabling dense credit assignment as described in the Kevin paper (arXiv:2507.11948). Changes: - verifiers/decorators.py: Add @vf.step_reward decorator (same overload pattern as @vf.reward) with weight and priority parameters. - verifiers/envs/multiturn_env.py: Discover @vf.step_reward handlers at init and invoke them after each model response in the rollout loop. - verifiers/utils/step_reward_utils.py: compute_discounted_returns() for gamma-discounted future returns, apply_step_advantages() for per-step advantage normalization across a group. - verifiers/rubrics/step_reward_rubric.py: StepRewardRubric that uses step-reward sums as rollout reward and applies per-step discounted advantages during group scoring. - verifiers/__init__.py: Export step_reward and StepRewardRubric. - tests/test_step_rewards.py: 18 tests covering decorator, rollout integration, discounted returns, and rubric scoring. Closes PrimeIntellect-ai#170
| "cleanup", | ||
| "metric", | ||
| "reward", | ||
| "step_reward", |
There was a problem hiding this comment.
Missing user-facing documentation
Medium Severity
This PR exports step_reward, StepRewardRubric, and step-level rollout scoring as public API, but no updates appear under docs/ (for example docs/training.md or docs/reference.md) describing how to use them.
Additional Locations (1)
Triggered by project rule: BugBot Instructions
Reviewed by Cursor Bugbot for commit 7bcaad5. Configure here.
- Use maybe_await in _apply_step_rewards so synchronous @step_reward handlers work without raising TypeError (matches rubric pattern) - Add score_rollout override to StepRewardRubric for single-state scoring - Add test for sync step_reward handler - Add module and class docstrings to test file (codebase convention)
_apply_step_rewards now only runs when add_trajectory_step actually appended a step. Prevents misassigning rewards to the previous turn when ComposableEnv's keep_trajectory_step filter drops a step.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit cb2c520. Configure here.
|
|
||
| for state in states: | ||
| state["reward"] = _sum_step_rewards(state) | ||
| state["metrics"] = {"step_reward_sum": cast(float, state["reward"])} |
There was a problem hiding this comment.
StepRewardRubric ignores weight
Medium Severity
StepRewardRubric registers _sum_step_rewards with a weight on the base Rubric, but its overridden score_rollout and score_group assign the unweighted sum directly. A non-default weight therefore has no effect on state["reward"] or metrics, unlike other rubrics.
Reviewed by Cursor Bugbot for commit cb2c520. Configure here.
|
All issues from Bugbot addressed in follow-up commits:
All 163 tests pass (19 step-reward + 144 existing), ruff lint/format clean. |


Implements #170.
Problem: Multi-turn RL training currently assigns a single scalar reward at the end of an episode. Every intermediate turn gets the same signal, making credit assignment across steps impossible.
Solution: Per-step reward infrastructure that lets you assign dense rewards at each turn, then compute discounted returns and normalize advantages across the group before training.
What this adds:
@vf.step_rewarddecorator - register handler functions that score each turn during rollout (same pattern as @vf.reward but called per-step, not end-of-episode)_apply_step_rewards()in MultiturnEnv rollout loop - calls step_reward handlers after each model response, accumulates viastate.add_step_reward()compute_discounted_returns(rewards, gamma)utility - backward pass computing R_t = r_t + gamma * R_{t+1}apply_step_advantages(states, gamma)utility - extracts per-step rewards from trajectory, computes discounted returns, normalizes advantages (zero-mean unit-variance) across all steps in the groupStepRewardRubric(gamma, weight)- ready-made rubric that combines episodic sum-of-step-rewards with per-step discounted advantages, wired into the existing rubric pipelineBased on: Kevin et al. (arXiv:2507.11948) - per-turn training samples with gamma-discounted future returns and advantage normalization across m*n samples.
Tests: 18 new tests covering discounted returns math, advantage normalization, decorator behavior, rollout integration, and rubric scoring. All existing tests unaffected.
Note
Medium Risk
Changes rollout scoring semantics when StepRewardRubric is used (trajectory step rewards become returns) and adds hooks on every turn in MultiTurnEnv; core training path but opt-in via decorators and rubric.
Overview
Adds per-turn dense rewards for multi-turn RL: environments can register
@vf.step_rewardhandlers (withweightandpriority) that run after each model turn, and training can aggregate them with discounted returns and group-normalized advantages.MultiTurnEnvdiscovers decorated handlers at init and, after each new trajectory step, calls them viamaybe_awaitand accumulates into the latest step withstate.add_step_reward(). New utilitiescompute_discounted_returnsandapply_step_advantagesimplement γ-discounted returns and zero-mean unit-variance advantages across all steps in a rollout group.StepRewardRubricsums step rewards for rollout scoring and, onscore_group, runs advantage normalization—overwriting each step’srewardwith discounted returns and setting per-step and rollout-leveladvantage.Public API exports
step_rewardandStepRewardRubric;tests/test_step_rewards.pycovers math, decorator metadata, rollout integration (including sync handlers), and rubric behavior.Reviewed by Cursor Bugbot for commit cb2c520. Bugbot is set up for automated code reviews on this repo. Configure here.
Note
Add step-level dense rewards and discounted advantage computation to multi-turn environments
@step_rewarddecorator in decorators.py that marks methods with priority/weight metadata, enabling discovery byMultiTurnEnvviadiscover_decorated._apply_step_rewardsto MultiTurnEnv, which is called after each trajectory step to invoke all registered handlers and accumulate weighted rewards viastate.add_step_reward.compute_discounted_returns(backward gamma-discounted accumulation) andapply_step_advantages(normalizes returns across a group and writes per-step advantages and rollout-level mean advantage).StepRewardRubricin step_reward_rubric.py, which sums step rewards into rollout reward and callsapply_step_advantageswith configurable gamma inscore_group.step_reward,StepRewardRubric) are exported from the top-levelverifierspackage.Macroscope summarized cb2c520.