NaN metrics affect reward (even with weight=0.0)#1546
Conversation
…rd sum so that the value associated to a reward = 0.0 never contributres
|
Hmm I'm not sure we want to formally support returning I do see the argument for having a "skip from mean" option but this pattern feels more likely to cause footguns and issues in downstream systems which expect metric values to be numeric + algebraically manipulable, and this cascades across rollouts/metrics without an easy backdoor other than dropping rollouts/groups containing any |
|
Thanks @willccbb for the quick reply - if skipping the product cannot be supported, for the reasons that you mentioned, then raising an error would be very appreciated (it could help to understand why a metric with weight=0 is affecting your reward); do I need to open an issue for that? Related question/comment: it makes total sense to be able to have more reward functions that you can monitor and assembly into a single final reward. The part that confused me is why we have a separate add_metric function that is basically the same as add_reward (if I am not mistaken), and the only difference seems to be that the first one is usually called with weight=0, and the second one with weight != 0. Did I read this correctly? |
Description
Rubric.add_metric()registers a function with weight0.0so it is logged butshould not contribute to the reward. However,
score_rollout()computes theper-rollout reward as
sum(score_i * weight_i)over all individual (non-group)funcs — including weight-0 metrics. Because
0.0 * nan == nan(andsum([..., nan]) == nan), a metric that returns a non-finite value silentlypoisons the entire reward, and therefore any downstream advantage/loss, even
though it was meant to contribute nothing.
This is easy to hit in practice: a metric that reports
float("nan")formalformed/unscored rollouts (a natural "skip me in the mean" sentinel) turns the
reward into
nanfor those rollouts and NaNs the training loss.Fix: skip zero-weight terms when summing the reward, so a metric cannot affect the
reward regardless of what it returns. Metrics are still evaluated and recorded in
state["metrics"]exactly as before — only the reward summation changes.Type of Change
Testing
uv run pytestlocally.Added
test_score_rollout_zero_weight_metric_nan_does_not_poison_rewardintests/test_rubric.py: a rubric with a weight-1 reward func and a weight-0 metricreturning
nan. It asserts the metric is still recorded asnanwhile the rewardstays finite. The test fails on
main(reward == nan) and passes with thisfix.
tests/test_rubric.py(all cases) passes, andruff format --check/ruff checkare clean.Checklist
Additional Notes
Minimal reproduction: