Add QUEST search taskset by rasdani · Pull Request #1555 · PrimeIntellect-ai/verifiers

rasdani · 2026-06-05T20:30:58Z

Summary

Add a composable search taskset factory with initial quest backend.
Port QUEST objective tasks from osunlp/QUEST-RL-Data into QuestTaskSet.
Vendor the minimal QUEST obj_task_eval runtime needed to load and run generated objective eval scripts.
Score /task/answer.txt answers with the real generated QUEST verifier/judge path, using vf.Error subclasses for infrastructure/evaluator failures instead of ad hoc success metrics.
Preserve upstream QUEST URL-backed verification semantics: invalid, irrelevant, or inaccessible cited webpages can still score the affected claim as unsupported (0.0), matching the authors' implementation.
Add READMEs for the search taskset family and search/quest backend, including a note that future work should consider finer-grained source-access/error handling to distinguish verifier limitations from model-provided bad URLs or unsupported claims.

Validation

uv run ruff check verifiers/envs/experimental/composable/tasksets/search
uv run --python 3.13 ty check verifiers/envs/experimental/composable/tasksets/search
Verifiers pre-push hooks passed.
End-to-end smoke with sibling rlm_search environment passed:
uv run --no-sync vf-eval rlm-search -a '{"task_type":"quest", "category":"objective"}' -n 1 -r 1 -d -A --max-retries 0 -C quest_answer_source,quest_eval_summary
- exit code: 0
- agent_error: 0.000
- objective_reward reported; reward may be 0.0 for an incorrect answer.

Note

Medium Risk
Large new scoring path with external judge API calls, HF eval-script loading, and network PDF/web fetches at grade time; incorrect or inaccessible sources can yield 0.0 without surfacing infra as errors, matching upstream QUEST behavior.

Overview
Adds a new composable search taskset family (SWE-style backend dispatch) with an initial quest backend for objective deep-research tasks from osunlp/QUEST-RL-Data.

QuestTaskSet loads the HF dataset (objective category only), runs agents in a sandbox, and expects the final answer in /task/answer.txt. QuestRubric reads that file (with completion fallback), dynamically loads per-task_id generated eval_scripts/*.py from the dataset cache, and scores via the vendored QUEST obj_task_eval runtime—LLM judge, verification tree, URL/PDF fetching—mapping infra/judge failures to vf.Error subclasses while wrong answers score 0.0 without state["error"], matching upstream QUEST URL-verification semantics.

Also adds make_search_taskset / make_quest_taskset, package READMEs, and core deps aiohttp, pymupdf, pillow, certifi (plus lockfile).

^{Reviewed by Cursor Bugbot for commit 98c3c1c. Bugbot is set up for automated code reviews on this repo. Configure here.}

Note

Add QUEST objective search taskset with LLM-based answer verification

Adds QuestTaskSet and QuestRubric in taskset.py to load QUEST HuggingFace dataset tasks, run them in a sandbox, and score answers via per-task eval scripts.
Introduces a vendored evaluation runtime (obj_task_eval) with Extractor and Verifier classes that fetch web/PDF content, call a judge LLM with concurrency control, and support majority-vote binary verification.
Implements VerificationNode and Evaluator in the eval toolkit to build hierarchical verification trees with sequential/parallel aggregation, gate-then-average scoring, and disk-based resume support.
Exposes make_quest_taskset and make_search_taskset factory functions from the search package for easy instantiation.
Adds aiohttp, pymupdf, pillow, and certifi as new required dependencies in pyproject.toml.
Risk: PDF parsing via pymupdf and async web fetching run at scoring time, adding latency and external network dependencies to evaluation.

^{Macroscope summarized 98c3c1c.}

cursor · 2026-06-07T17:28:04Z

+            cache=cache,
+            semaphore=self._semaphore,
+            logger=logger,
+            model=self.judge_model,


Shared logger corrupts parallel scoring

Medium Severity

QuestRubric.score_group scores rollouts concurrently, but every evaluate_answer call receives the same module-level logger. The vendored evaluator appends LLM trace data onto that logger instance, so parallel group scoring can interleave or overwrite traces and question metadata across rollouts.

Additional Locations (1)

verifiers/envs/experimental/composable/tasksets/search/quest/obj_task_eval/eval_toolkit.py#L82-L118

^{Reviewed by Cursor Bugbot for commit ecbf651. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 98c3c1c. Configure here.}

cursor · 2026-06-08T22:18:23Z

+                )
+
+        self._auto_save_state("batch_verify")
+        return results


Batch verify masks infra errors

High Severity

In batch_verify, asyncio.gather(..., return_exceptions=True) stores raised vf.Error instances in the result list instead of propagating them like verify() does. Those exception objects are truthy, so common checks such as all(results) can treat judge or infrastructure failures as successful verifications and inflate scores.

^{Reviewed by Cursor Bugbot for commit 98c3c1c. Configure here.}

rasdani added 2 commits June 5, 2026 20:27

feat: add QUEST search taskset

0375e01

fix: use framework errors for QUEST scoring

e8c5ccb

cursor Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread verifiers/envs/experimental/composable/tasksets/search/quest/taskset.py

Comment thread verifiers/envs/experimental/composable/tasksets/search/quest/taskset.py Outdated

fix: use QUEST PDF parser in scoring

4cd970e

cursor Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread verifiers/envs/experimental/composable/tasksets/search/quest/obj_task_eval/eval_toolkit.py

Comment thread ...s/envs/experimental/composable/tasksets/search/quest/obj_task_eval/utils/load_eval_script.py

docs: note QUEST source verification semantics

f11b074

cursor Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread verifiers/envs/experimental/composable/tasksets/search/quest/obj_task_eval/eval_toolkit.py

Comment thread verifiers/envs/experimental/composable/tasksets/search/quest/obj_task_eval/eval_toolkit.py

rasdani added 6 commits June 6, 2026 20:59

fix: address QUEST review issues

ca51509

fix: simplify QUEST judge model config

af5a4a5

fix: use shared client setup for QUEST judge

81c87d7

fix: load QUEST eval script symlinks

93320d9

fix: use gpt-5.4-mini QUEST judge default

81ed339

fix: resolve QUEST eval scripts once

3a53f2b

cursor Bot reviewed Jun 6, 2026

View reviewed changes

Comment thread verifiers/envs/experimental/composable/tasksets/search/quest/obj_task_eval/eval_toolkit.py Outdated

fix: surface QUEST judge errors

ecbf651

cursor Bot reviewed Jun 7, 2026

View reviewed changes

rasdani mentioned this pull request Jun 8, 2026

Add REDSearcher search taskset rasdani/verifiers#1

Open

Merge origin/main into feat/search-quest-taskset

98c3c1c

cursor Bot reviewed Jun 8, 2026

View reviewed changes

samsja approved these changes Jun 8, 2026

View reviewed changes

rasdani requested review from mikasenghaas and snimu June 8, 2026 22:19

rasdani merged commit ffb5e01 into PrimeIntellect-ai:main Jun 8, 2026
7 of 11 checks passed

rasdani mentioned this pull request Jun 8, 2026

feat: add QUEST open-ended scoring #1572

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add QUEST search taskset#1555

Add QUEST search taskset#1555
rasdani merged 12 commits into
PrimeIntellect-ai:mainfrom
rasdani:feat/search-quest-taskset

rasdani commented Jun 5, 2026 •

edited by macroscopeapp Bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot Jun 7, 2026

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rasdani commented Jun 5, 2026 • edited by macroscopeapp Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Add QUEST objective search taskset with LLM-based answer verification

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot Jun 7, 2026

Choose a reason for hiding this comment

Shared logger corrupts parallel scoring

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 8, 2026

Choose a reason for hiding this comment

Batch verify masks infra errors

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rasdani commented Jun 5, 2026 •

edited by macroscopeapp Bot

Loading