Skip to content

NaN metrics affect reward (even with weight=0.0)#1546

Open
mbronzi-saifh wants to merge 1 commit into
PrimeIntellect-ai:mainfrom
mbronzi-saifh:fix/zero-weight-metric-nan-reward
Open

NaN metrics affect reward (even with weight=0.0)#1546
mbronzi-saifh wants to merge 1 commit into
PrimeIntellect-ai:mainfrom
mbronzi-saifh:fix/zero-weight-metric-nan-reward

Conversation

@mbronzi-saifh

@mbronzi-saifh mbronzi-saifh commented Jun 4, 2026

Copy link
Copy Markdown

Description

Rubric.add_metric() registers a function with weight 0.0 so it is logged but
should not contribute to the reward. However, score_rollout() computes the
per-rollout reward as sum(score_i * weight_i) over all individual (non-group)
funcs — including weight-0 metrics. Because 0.0 * nan == nan (and
sum([..., nan]) == nan), a metric that returns a non-finite value silently
poisons the entire reward, and therefore any downstream advantage/loss, even
though it was meant to contribute nothing.

This is easy to hit in practice: a metric that reports float("nan") for
malformed/unscored rollouts (a natural "skip me in the mean" sentinel) turns the
reward into nan for those rollouts and NaNs the training loss.

Fix: skip zero-weight terms when summing the reward, so a metric cannot affect the
reward regardless of what it returns. Metrics are still evaluated and recorded in
state["metrics"] exactly as before — only the reward summation changes.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Test improvement

Testing

  • All existing tests pass when running uv run pytest locally.
  • New tests have been added to cover the changes

Added test_score_rollout_zero_weight_metric_nan_does_not_poison_reward in
tests/test_rubric.py: a rubric with a weight-1 reward func and a weight-0 metric
returning nan. It asserts the metric is still recorded as nan while the reward
stays finite. The test fails on main (reward == nan) and passes with this
fix
. tests/test_rubric.py (all cases) passes, and ruff format --check /
ruff check are clean.

Checklist

  • My code follows the style guidelines of this project as outlined in AGENTS.md
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Additional Notes

Minimal reproduction:

import math
from verifiers import Rubric

r = Rubric()
r.add_reward_func(lambda **kw: 1.0, weight=1.0)
r.add_metric(lambda **kw: float("nan"))  # weight 0.0
# before: reward == nan ; after: reward == 1.0 (metric still logged as nan)

Scope: only the individual-rollout reward sum in score_rollout is changed. The
group-scoring path is untouched. A metric given a non-zero weight that returns
a non-finite value will still propagate (intentionallyif you weight it into the
reward, its NaN should surface).

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Low Risk**
> Narrow change to individual rollout reward aggregation with a focused regression test; group scoring is untouched.
> 
> **Overview**
> Fixes a bug where **weight-0 metrics** (via `add_metric`) could still corrupt `state["reward"]` during `score_rollout`. The reward sum previously included `score * weight` for every individual func, so a metric returning `NaN` produced `0.0 * nan == nan` and poisoned training rewards/advantages even though metrics are meant to be logged only.
> 
> **`Rubric.score_rollout`** now **omits zero-weight terms** from the reward sum while still evaluating them and recording values in `state["metrics"]`. A regression test asserts a `NaN` metric stays in metrics but the reward stays finite.
> 
> Group scoring (`score_group`) is unchanged.
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit 577887aaa3cd72b976b005d126bfe590d78ca7f2. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

<!-- Macroscope's pull request summary starts here -->
<!-- Macroscope will only edit the content between these invisible markers, and the markers themselves will not be visible in the GitHub rendered markdown. -->
<!-- If you delete either of the start / end markers from your PR's description, Macroscope will append its summary at the bottom of the description. -->
> [!NOTE]
> ### Fix NaN metric values poisoning reward in `Rubric.score_rollout` when weight is 0.0
> In the reward aggregation step, `0.0 * NaN` evaluates to `NaN`, causing zero-weight metrics with non-finite values to corrupt `state['reward']`. The fix filters out zero-weight terms from the list comprehension in [`rubric.py`](https://github.com/PrimeIntellect-ai/verifiers/pull/1546/files#diff-93524b7f0aef49ccda32abbb52371781199c6551aed59a94957e5f432d6e7145) before multiplication, so only metrics with non-zero weights contribute to the final reward.
>
> <!-- Macroscope's review summary starts here -->
>
> <sup><a href="https://app.macroscope.com">Macroscope</a> summarized 577887a.</sup>
> <!-- Macroscope's review summary ends here -->
>
<!-- macroscope-ui-refresh -->
<!-- Macroscope's pull request summary ends here -->

…rd sum so that the value associated to a reward = 0.0 never contributres
@mbronzi-saifh mbronzi-saifh marked this pull request as ready for review June 4, 2026 21:32
@willccbb

willccbb commented Jun 5, 2026

Copy link
Copy Markdown
Member

Hmm I'm not sure we want to formally support returning nan for metrics here as a skip sentinel. The intention is that we're doing a "dumb average", and that metrics + rewards should behave interchangeably, and rewards definitely can't be nan. I think this is a case where we probably want to raise an error instead (@cursoragent pls add).

I do see the argument for having a "skip from mean" option but this pattern feels more likely to cause footguns and issues in downstream systems which expect metric values to be numeric + algebraically manipulable, and this cascades across rollouts/metrics without an easy backdoor other than dropping rollouts/groups containing any nans. We don't support returning None for the same reason (used to, but kept causing issues so we reverted), metrics have to be numeric, sorry.

@mbronzi-saifh

Copy link
Copy Markdown
Author

Thanks @willccbb for the quick reply - if skipping the product cannot be supported, for the reasons that you mentioned, then raising an error would be very appreciated (it could help to understand why a metric with weight=0 is affecting your reward); do I need to open an issue for that?

Related question/comment: it makes total sense to be able to have more reward functions that you can monitor and assembly into a single final reward. The part that confused me is why we have a separate add_metric function that is basically the same as add_reward (if I am not mistaken), and the only difference seems to be that the first one is usually called with weight=0, and the second one with weight != 0. Did I read this correctly?
Asking because - at least from my point of view - when we create the distinction reward/metric, I was never expecting the metric to be able to contribute to the reward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants