feat: step-level dense rewards for multi-turn environments by zamal-db · Pull Request #1554 · PrimeIntellect-ai/verifiers

zamal-db · 2026-06-05T19:54:48Z

Implements #170.

Problem: Multi-turn RL training currently assigns a single scalar reward at the end of an episode. Every intermediate turn gets the same signal, making credit assignment across steps impossible.

Solution: Per-step reward infrastructure that lets you assign dense rewards at each turn, then compute discounted returns and normalize advantages across the group before training.

What this adds:

@vf.step_reward decorator - register handler functions that score each turn during rollout (same pattern as @vf.reward but called per-step, not end-of-episode)
_apply_step_rewards() in MultiturnEnv rollout loop - calls step_reward handlers after each model response, accumulates via state.add_step_reward()
compute_discounted_returns(rewards, gamma) utility - backward pass computing R_t = r_t + gamma * R_{t+1}
apply_step_advantages(states, gamma) utility - extracts per-step rewards from trajectory, computes discounted returns, normalizes advantages (zero-mean unit-variance) across all steps in the group
StepRewardRubric(gamma, weight) - ready-made rubric that combines episodic sum-of-step-rewards with per-step discounted advantages, wired into the existing rubric pipeline

Based on: Kevin et al. (arXiv:2507.11948) - per-turn training samples with gamma-discounted future returns and advantage normalization across m*n samples.

Tests: 18 new tests covering discounted returns math, advantage normalization, decorator behavior, rollout integration, and rubric scoring. All existing tests unaffected.

Note

Medium Risk
Changes rollout scoring semantics when StepRewardRubric is used (trajectory step rewards become returns) and adds hooks on every turn in MultiTurnEnv; core training path but opt-in via decorators and rubric.

Overview
Adds per-turn dense rewards for multi-turn RL: environments can register @vf.step_reward handlers (with weight and priority) that run after each model turn, and training can aggregate them with discounted returns and group-normalized advantages.

MultiTurnEnv discovers decorated handlers at init and, after each new trajectory step, calls them via maybe_await and accumulates into the latest step with state.add_step_reward(). New utilities compute_discounted_returns and apply_step_advantages implement γ-discounted returns and zero-mean unit-variance advantages across all steps in a rollout group. StepRewardRubric sums step rewards for rollout scoring and, on score_group, runs advantage normalization—overwriting each step’s reward with discounted returns and setting per-step and rollout-level advantage.

Public API exports step_reward and StepRewardRubric; tests/test_step_rewards.py covers math, decorator metadata, rollout integration (including sync handlers), and rubric behavior.

^{Reviewed by Cursor Bugbot for commit cb2c520. Bugbot is set up for automated code reviews on this repo. Configure here.}

Note

Add step-level dense rewards and discounted advantage computation to multi-turn environments

Introduces a @step_reward decorator in decorators.py that marks methods with priority/weight metadata, enabling discovery by MultiTurnEnv via discover_decorated.
Adds _apply_step_rewards to MultiTurnEnv, which is called after each trajectory step to invoke all registered handlers and accumulate weighted rewards via state.add_step_reward.
Adds step_reward_utils.py with compute_discounted_returns (backward gamma-discounted accumulation) and apply_step_advantages (normalizes returns across a group and writes per-step advantages and rollout-level mean advantage).
Adds StepRewardRubric in step_reward_rubric.py, which sums step rewards into rollout reward and calls apply_step_advantages with configurable gamma in score_group.
All new symbols (step_reward, StepRewardRubric) are exported from the top-level verifiers package.

^{Macroscope summarized cb2c520.}

@vf

… dense rewards Wire per-step reward signals into the multi-turn rollout loop, enabling dense credit assignment as described in the Kevin paper (arXiv:2507.11948). Changes: - verifiers/decorators.py: Add @vf.step_reward decorator (same overload pattern as @vf.reward) with weight and priority parameters. - verifiers/envs/multiturn_env.py: Discover @vf.step_reward handlers at init and invoke them after each model response in the rollout loop. - verifiers/utils/step_reward_utils.py: compute_discounted_returns() for gamma-discounted future returns, apply_step_advantages() for per-step advantage normalization across a group. - verifiers/rubrics/step_reward_rubric.py: StepRewardRubric that uses step-reward sums as rollout reward and applies per-step discounted advantages during group scoring. - verifiers/__init__.py: Export step_reward and StepRewardRubric. - tests/test_step_rewards.py: 18 tests covering decorator, rollout integration, discounted returns, and rubric scoring. Closes PrimeIntellect-ai#170

cursor · 2026-06-05T19:57:05Z

    "cleanup",
    "metric",
    "reward",
+    "step_reward",


Missing user-facing documentation

Medium Severity

This PR exports step_reward, StepRewardRubric, and step-level rollout scoring as public API, but no updates appear under docs/ (for example docs/training.md or docs/reference.md) describing how to use them.

Additional Locations (1)

verifiers/decorators.py#L268-L284

^{Triggered by project rule: BugBot Instructions}

^{Reviewed by Cursor Bugbot for commit 7bcaad5. Configure here.}

- Use maybe_await in _apply_step_rewards so synchronous @step_reward handlers work without raising TypeError (matches rubric pattern) - Add score_rollout override to StepRewardRubric for single-state scoring - Add test for sync step_reward handler - Add module and class docstrings to test file (codebase convention)

_apply_step_rewards now only runs when add_trajectory_step actually appended a step. Prevents misassigning rewards to the previous turn when ComposableEnv's keep_trajectory_step filter drops a step.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit cb2c520. Configure here.}

cursor · 2026-06-05T20:09:38Z

+
+        for state in states:
+            state["reward"] = _sum_step_rewards(state)
+            state["metrics"] = {"step_reward_sum": cast(float, state["reward"])}


StepRewardRubric ignores weight

Medium Severity

StepRewardRubric registers _sum_step_rewards with a weight on the base Rubric, but its overridden score_rollout and score_group assign the unweighted sum directly. A non-default weight therefore has no effect on state["reward"] or metrics, unlike other rubrics.

^{Reviewed by Cursor Bugbot for commit cb2c520. Configure here.}

zamal-db · 2026-06-05T20:11:28Z

All issues from Bugbot addressed in follow-up commits:

Sync step_reward handlers crash - Fixed. _apply_step_rewards\ now uses \maybe_await\ (same pattern as rubric scoring) so both sync and async handlers work. Added a dedicated test.
Dropped steps misassign rewards - Fixed. The rollout loop now tracks trajectory length before \�dd_model_response\ and only calls _apply_step_rewards\ if a step was actually appended. Safe when \ComposableEnv.keep_trajectory_step\ drops a step.
Step rubric skips rollout scoring - Fixed. Added \score_rollout\ override to \StepRewardRubric\ for single-state scoring paths (v1 runtime, RubricGroup).
Missing user-facing documentation - This repo does not have a \docs/\ directory with API reference pages. Usage is documented in-code via docstrings and the README. Happy to add an example to the README if maintainers want it in a follow-up.

All 163 tests pass (19 step-reward + 144 existing), ruff lint/format clean.

cursor Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread verifiers/envs/multiturn_env.py Outdated

fix: guard step rewards against dropped trajectory steps

cb2c520

_apply_step_rewards now only runs when add_trajectory_step actually appended a step. Prevents misassigning rewards to the previous turn when ComposableEnv's keep_trajectory_step filter drops a step.

cursor Bot reviewed Jun 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: step-level dense rewards for multi-turn environments#1554

feat: step-level dense rewards for multi-turn environments#1554
zamal-db wants to merge 3 commits into
PrimeIntellect-ai:mainfrom
zamal-db:feat/step-level-rewards

zamal-db commented Jun 5, 2026 •

edited by macroscopeapp Bot

Loading

Uh oh!

Uh oh!

Uh oh!

cursor Bot Jun 5, 2026

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 5, 2026

Uh oh!

zamal-db commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zamal-db commented Jun 5, 2026 • edited by macroscopeapp Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add step-level dense rewards and discounted advantage computation to multi-turn environments

Uh oh!

Uh oh!

Uh oh!

cursor Bot Jun 5, 2026

Choose a reason for hiding this comment

Missing user-facing documentation

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 5, 2026

Choose a reason for hiding this comment

StepRewardRubric ignores weight

Uh oh!

zamal-db commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zamal-db commented Jun 5, 2026 •

edited by macroscopeapp Bot

Loading