Characterizing, Evaluating, and Optimizing
Complex Reasoning

This repository provides the TRM-Preference dataset, TRM weights, and implementations for TRM training and TRM-guided policy optimization.

The Thinking Reward Model (TRM) evaluates the quality of reasoning traces rather than final answers. We study how to optimize reasoning itself: instead of only asking “Is the answer correct?”, we ask:

Is this a good way to think?

We characterize reasoning quality with four dimensions (the ME² principle), enabling supervision beyond answer correctness.:

Macro-Efficiency: global structure is disciplined (no unnecessary branching/restarts).
Macro-Effectiveness: global structure stays coherent and aligned with the goal.
Micro-Efficiency: individual steps are concise and non-redundant.
Micro-Effectiveness: individual steps are locally valid and consistent.

The main process:

Install required packages

pip install torch==2.8.0 --index-url https://download.pytorch.org/whl/cu126
pip install -e ./verl-0.6.0
pip install ninja
pip install flash-attn==2.8.1 --no-build-isolation
pip install vllm==0.11.0
pip install sglang==0.5.2
pip install numpy==1.26.4
pip install transformers==4.56.1
pip install flashinfer-python
pip install math-verify

TRM Scoring

Download our TRM:

huggingface-cli download zzzhr97/TRM-8B --local-dir <local-path>

We use sglang to host the TRM server. Configure the path in trm.sh and run:

bash trm.sh

Use TRM to score reasoning:

import requests
import json

with open("sample.json", "r", encoding="utf-8") as f:
    sample = json.load(f)
prompt = sample["prompt"]
response = sample["response"]

# Score the reasoning trace (before the termination marker).
reasoning = response.split("</think>", 1)[0]
input_text = f"{prompt}\n{reasoning}"

payload = {"model": "RewardModel", "input": input_text}
resp = requests.post("http://<TRM_HOST>:<TRM_PORT>/v1/embeddings", json=payload, timeout=60)
resp.raise_for_status()
score = resp.json()["data"][0]["embedding"][0]
print("TRM score:", score)

TRM Training

Download training dataset:

huggingface-cli download zzzhr97/TRM-Preference --local-dir <local-path>

Configure the path in train_rm.sh and begin training:

bash train_rm.sh

RL Training

Prepare data and models

Download general-verifier for verification:

huggingface-cli download TIGER-Lab/general-verifier --local-dir <local-path>

Download training dataset:

huggingface-cli download zzzhr97/WebInstruct-Verified-Processed --local-dir <local-path>

Training

Configure the path in server/general-verifier.sh and host the general-verifier:

bash server/general-verifier.sh

Set the endpoints DEFAULT_VERIFIER_BACKENDS and DEFAULT_RM_BACKENDS in remote_verifier.py. Then, configure training script train.sh and begin training:

bash train.sh

Results

RL gains across benchmarks. TRM-guided training improves performance, showing that thinking rewards provide useful shaping beyond binary correctness.

Reasoning quality improves. Policies trained with TRM achieve higher win rates in pairwise trace evaluation, indicating better reasoning behaviors under the ME² dimensions.

Acknowledgements

This repo builds on open-source efforts, especially:

https://github.com/TIGER-AI-Lab/General-Reasoner
https://github.com/verl-project/verl

Citation

@article{zhang2026characterizing,
  title={Characterizing, Evaluating, and Optimizing Complex Reasoning},
  author={Zhang, Haoran and Li, Yafu and Wang, Zhi and Wang, Zhilin and Zhang, Shunkai and Qu, Xiaoye and Cheng, Yu},
  journal={arXiv preprint arXiv:2602.08498},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
chat_templates		chat_templates
server		server
verl-0.6.0		verl-0.6.0
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
fsdp2.yaml		fsdp2.yaml
main_ppo.py		main_ppo.py
remote_verifier.py		remote_verifier.py
sample.json		sample.json
train.sh		train.sh
train_rm.sh		train_rm.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Characterizing, Evaluating, and Optimizing
Complex Reasoning

Install required packages

TRM Scoring

TRM Training

RL Training

Prepare data and models

Training

Results

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Characterizing, Evaluating, and Optimizing Complex Reasoning

Install required packages

TRM Scoring

TRM Training

RL Training

Prepare data and models

Training

Results

Acknowledgements

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Characterizing, Evaluating, and Optimizing
Complex Reasoning

Packages