This repository provides the TRM-Preference dataset, TRM weights, and implementations for TRM training and TRM-guided policy optimization.
The Thinking Reward Model (TRM) evaluates the quality of reasoning traces rather than final answers. We study how to optimize reasoning itself: instead of only asking “Is the answer correct?”, we ask:
Is this a good way to think?
We characterize reasoning quality with four dimensions (the ME² principle), enabling supervision beyond answer correctness.:
- Macro-Efficiency: global structure is disciplined (no unnecessary branching/restarts).
- Macro-Effectiveness: global structure stays coherent and aligned with the goal.
- Micro-Efficiency: individual steps are concise and non-redundant.
- Micro-Effectiveness: individual steps are locally valid and consistent.
The main process:
pip install torch==2.8.0 --index-url https://download.pytorch.org/whl/cu126
pip install -e ./verl-0.6.0
pip install ninja
pip install flash-attn==2.8.1 --no-build-isolation
pip install vllm==0.11.0
pip install sglang==0.5.2
pip install numpy==1.26.4
pip install transformers==4.56.1
pip install flashinfer-python
pip install math-verifyDownload our TRM:
huggingface-cli download zzzhr97/TRM-8B --local-dir <local-path>We use sglang to host the TRM server. Configure the path in trm.sh and run:
bash trm.shUse TRM to score reasoning:
import requests
import json
with open("sample.json", "r", encoding="utf-8") as f:
sample = json.load(f)
prompt = sample["prompt"]
response = sample["response"]
# Score the reasoning trace (before the termination marker).
reasoning = response.split("</think>", 1)[0]
input_text = f"{prompt}\n{reasoning}"
payload = {"model": "RewardModel", "input": input_text}
resp = requests.post("http://<TRM_HOST>:<TRM_PORT>/v1/embeddings", json=payload, timeout=60)
resp.raise_for_status()
score = resp.json()["data"][0]["embedding"][0]
print("TRM score:", score)Download training dataset:
huggingface-cli download zzzhr97/TRM-Preference --local-dir <local-path>Configure the path in train_rm.sh and begin training:
bash train_rm.shDownload general-verifier for verification:
huggingface-cli download TIGER-Lab/general-verifier --local-dir <local-path>Download training dataset:
huggingface-cli download zzzhr97/WebInstruct-Verified-Processed --local-dir <local-path>Configure the path in server/general-verifier.sh and host the general-verifier:
bash server/general-verifier.shSet the endpoints DEFAULT_VERIFIER_BACKENDS and DEFAULT_RM_BACKENDS in remote_verifier.py.
Then, configure training script train.sh and begin training:
bash train.shRL gains across benchmarks. TRM-guided training improves performance, showing that thinking rewards provide useful shaping beyond binary correctness.
Reasoning quality improves. Policies trained with TRM achieve higher win rates in pairwise trace evaluation, indicating better reasoning behaviors under the ME² dimensions.
This repo builds on open-source efforts, especially:
https://github.com/TIGER-AI-Lab/General-Reasonerhttps://github.com/verl-project/verl
@article{zhang2026characterizing,
title={Characterizing, Evaluating, and Optimizing Complex Reasoning},
author={Zhang, Haoran and Li, Yafu and Wang, Zhi and Wang, Zhilin and Zhang, Shunkai and Qu, Xiaoye and Cheng, Yu},
journal={arXiv preprint arXiv:2602.08498},
year={2026}
}



