Skip to content

feat: step-level dense rewards for multi-turn environments#1554

Open
zamal-db wants to merge 3 commits into
PrimeIntellect-ai:mainfrom
zamal-db:feat/step-level-rewards
Open

feat: step-level dense rewards for multi-turn environments#1554
zamal-db wants to merge 3 commits into
PrimeIntellect-ai:mainfrom
zamal-db:feat/step-level-rewards

Conversation

@zamal-db

@zamal-db zamal-db commented Jun 5, 2026

Copy link
Copy Markdown

Implements #170.

Problem: Multi-turn RL training currently assigns a single scalar reward at the end of an episode. Every intermediate turn gets the same signal, making credit assignment across steps impossible.

Solution: Per-step reward infrastructure that lets you assign dense rewards at each turn, then compute discounted returns and normalize advantages across the group before training.

What this adds:

  1. @vf.step_reward decorator - register handler functions that score each turn during rollout (same pattern as @vf.reward but called per-step, not end-of-episode)
  2. _apply_step_rewards() in MultiturnEnv rollout loop - calls step_reward handlers after each model response, accumulates via state.add_step_reward()
  3. compute_discounted_returns(rewards, gamma) utility - backward pass computing R_t = r_t + gamma * R_{t+1}
  4. apply_step_advantages(states, gamma) utility - extracts per-step rewards from trajectory, computes discounted returns, normalizes advantages (zero-mean unit-variance) across all steps in the group
  5. StepRewardRubric(gamma, weight) - ready-made rubric that combines episodic sum-of-step-rewards with per-step discounted advantages, wired into the existing rubric pipeline

Based on: Kevin et al. (arXiv:2507.11948) - per-turn training samples with gamma-discounted future returns and advantage normalization across m*n samples.

Tests: 18 new tests covering discounted returns math, advantage normalization, decorator behavior, rollout integration, and rubric scoring. All existing tests unaffected.


Note

Medium Risk
Changes rollout scoring semantics when StepRewardRubric is used (trajectory step rewards become returns) and adds hooks on every turn in MultiTurnEnv; core training path but opt-in via decorators and rubric.

Overview
Adds per-turn dense rewards for multi-turn RL: environments can register @vf.step_reward handlers (with weight and priority) that run after each model turn, and training can aggregate them with discounted returns and group-normalized advantages.

MultiTurnEnv discovers decorated handlers at init and, after each new trajectory step, calls them via maybe_await and accumulates into the latest step with state.add_step_reward(). New utilities compute_discounted_returns and apply_step_advantages implement γ-discounted returns and zero-mean unit-variance advantages across all steps in a rollout group. StepRewardRubric sums step rewards for rollout scoring and, on score_group, runs advantage normalization—overwriting each step’s reward with discounted returns and setting per-step and rollout-level advantage.

Public API exports step_reward and StepRewardRubric; tests/test_step_rewards.py covers math, decorator metadata, rollout integration (including sync handlers), and rubric behavior.

Reviewed by Cursor Bugbot for commit cb2c520. Bugbot is set up for automated code reviews on this repo. Configure here.

Note

Add step-level dense rewards and discounted advantage computation to multi-turn environments

  • Introduces a @step_reward decorator in decorators.py that marks methods with priority/weight metadata, enabling discovery by MultiTurnEnv via discover_decorated.
  • Adds _apply_step_rewards to MultiTurnEnv, which is called after each trajectory step to invoke all registered handlers and accumulate weighted rewards via state.add_step_reward.
  • Adds step_reward_utils.py with compute_discounted_returns (backward gamma-discounted accumulation) and apply_step_advantages (normalizes returns across a group and writes per-step advantages and rollout-level mean advantage).
  • Adds StepRewardRubric in step_reward_rubric.py, which sums step rewards into rollout reward and calls apply_step_advantages with configurable gamma in score_group.
  • All new symbols (step_reward, StepRewardRubric) are exported from the top-level verifiers package.

Macroscope summarized cb2c520.

… dense rewards

Wire per-step reward signals into the multi-turn rollout loop, enabling
dense credit assignment as described in the Kevin paper (arXiv:2507.11948).

Changes:
- verifiers/decorators.py: Add @vf.step_reward decorator (same overload
  pattern as @vf.reward) with weight and priority parameters.
- verifiers/envs/multiturn_env.py: Discover @vf.step_reward handlers at
  init and invoke them after each model response in the rollout loop.
- verifiers/utils/step_reward_utils.py: compute_discounted_returns() for
  gamma-discounted future returns, apply_step_advantages() for per-step
  advantage normalization across a group.
- verifiers/rubrics/step_reward_rubric.py: StepRewardRubric that uses
  step-reward sums as rollout reward and applies per-step discounted
  advantages during group scoring.
- verifiers/__init__.py: Export step_reward and StepRewardRubric.
- tests/test_step_rewards.py: 18 tests covering decorator, rollout
  integration, discounted returns, and rubric scoring.

Closes PrimeIntellect-ai#170
Comment thread verifiers/envs/multiturn_env.py Outdated
Comment thread verifiers/rubrics/step_reward_rubric.py
Comment thread verifiers/__init__.py
"cleanup",
"metric",
"reward",
"step_reward",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing user-facing documentation

Medium Severity

This PR exports step_reward, StepRewardRubric, and step-level rollout scoring as public API, but no updates appear under docs/ (for example docs/training.md or docs/reference.md) describing how to use them.

Additional Locations (1)
Fix in Cursor Fix in Web

Triggered by project rule: BugBot Instructions

Reviewed by Cursor Bugbot for commit 7bcaad5. Configure here.

- Use maybe_await in _apply_step_rewards so synchronous @step_reward
  handlers work without raising TypeError (matches rubric pattern)
- Add score_rollout override to StepRewardRubric for single-state scoring
- Add test for sync step_reward handler
- Add module and class docstrings to test file (codebase convention)
Comment thread verifiers/envs/multiturn_env.py Outdated
_apply_step_rewards now only runs when add_trajectory_step actually
appended a step. Prevents misassigning rewards to the previous turn
when ComposableEnv's keep_trajectory_step filter drops a step.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit cb2c520. Configure here.


for state in states:
state["reward"] = _sum_step_rewards(state)
state["metrics"] = {"step_reward_sum": cast(float, state["reward"])}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StepRewardRubric ignores weight

Medium Severity

StepRewardRubric registers _sum_step_rewards with a weight on the base Rubric, but its overridden score_rollout and score_group assign the unweighted sum directly. A non-default weight therefore has no effect on state["reward"] or metrics, unlike other rubrics.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit cb2c520. Configure here.

@zamal-db

zamal-db commented Jun 5, 2026

Copy link
Copy Markdown
Author

All issues from Bugbot addressed in follow-up commits:

  1. Sync step_reward handlers crash - Fixed. _apply_step_rewards\ now uses \maybe_await\ (same pattern as rubric scoring) so both sync and async handlers work. Added a dedicated test.

  2. Dropped steps misassign rewards - Fixed. The rollout loop now tracks trajectory length before \�dd_model_response\ and only calls _apply_step_rewards\ if a step was actually appended. Safe when \ComposableEnv.keep_trajectory_step\ drops a step.

  3. Step rubric skips rollout scoring - Fixed. Added \score_rollout\ override to \StepRewardRubric\ for single-state scoring paths (v1 runtime, RubricGroup).

  4. Missing user-facing documentation - This repo does not have a \docs/\ directory with API reference pages. Usage is documented in-code via docstrings and the README. Happy to add an example to the README if maintainers want it in a follow-up.

All 163 tests pass (19 step-reward + 144 existing), ruff lint/format clean.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant