Add QUEST search taskset#1555
Conversation
| cache=cache, | ||
| semaphore=self._semaphore, | ||
| logger=logger, | ||
| model=self.judge_model, |
There was a problem hiding this comment.
Shared logger corrupts parallel scoring
Medium Severity
QuestRubric.score_group scores rollouts concurrently, but every evaluate_answer call receives the same module-level logger. The vendored evaluator appends LLM trace data onto that logger instance, so parallel group scoring can interleave or overwrite traces and question metadata across rollouts.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit ecbf651. Configure here.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 98c3c1c. Configure here.
| ) | ||
|
|
||
| self._auto_save_state("batch_verify") | ||
| return results |
There was a problem hiding this comment.
Batch verify masks infra errors
High Severity
In batch_verify, asyncio.gather(..., return_exceptions=True) stores raised vf.Error instances in the result list instead of propagating them like verify() does. Those exception objects are truthy, so common checks such as all(results) can treat judge or infrastructure failures as successful verifications and inflate scores.
Reviewed by Cursor Bugbot for commit 98c3c1c. Configure here.


Summary
searchtaskset factory with initialquestbackend.osunlp/QUEST-RL-DataintoQuestTaskSet.obj_task_evalruntime needed to load and run generated objective eval scripts./task/answer.txtanswers with the real generated QUEST verifier/judge path, usingvf.Errorsubclasses for infrastructure/evaluator failures instead of ad hoc success metrics.0.0), matching the authors' implementation.searchtaskset family andsearch/questbackend, including a note that future work should consider finer-grained source-access/error handling to distinguish verifier limitations from model-provided bad URLs or unsupported claims.Validation
uv run ruff check verifiers/envs/experimental/composable/tasksets/searchuv run --python 3.13 ty check verifiers/envs/experimental/composable/tasksets/searchrlm_searchenvironment passed:uv run --no-sync vf-eval rlm-search -a '{"task_type":"quest", "category":"objective"}' -n 1 -r 1 -d -A --max-retries 0 -C quest_answer_source,quest_eval_summaryagent_error: 0.000objective_rewardreported; reward may be0.0for an incorrect answer.Note
Medium Risk
Large new scoring path with external judge API calls, HF eval-script loading, and network PDF/web fetches at grade time; incorrect or inaccessible sources can yield 0.0 without surfacing infra as errors, matching upstream QUEST behavior.
Overview
Adds a new composable
searchtaskset family (SWE-style backend dispatch) with an initialquestbackend for objective deep-research tasks fromosunlp/QUEST-RL-Data.QuestTaskSetloads the HF dataset (objective category only), runs agents in a sandbox, and expects the final answer in/task/answer.txt.QuestRubricreads that file (with completion fallback), dynamically loads per-task_idgeneratedeval_scripts/*.pyfrom the dataset cache, and scores via the vendored QUESTobj_task_evalruntime—LLM judge, verification tree, URL/PDF fetching—mapping infra/judge failures tovf.Errorsubclasses while wrong answers score0.0withoutstate["error"], matching upstream QUEST URL-verification semantics.Also adds
make_search_taskset/make_quest_taskset, package READMEs, and core depsaiohttp,pymupdf,pillow,certifi(plus lockfile).Reviewed by Cursor Bugbot for commit 98c3c1c. Bugbot is set up for automated code reviews on this repo. Configure here.
Note
Add QUEST objective search taskset with LLM-based answer verification
QuestTaskSetandQuestRubricintaskset.pyto load QUEST HuggingFace dataset tasks, run them in a sandbox, and score answers via per-task eval scripts.obj_task_eval) withExtractorandVerifierclasses that fetch web/PDF content, call a judge LLM with concurrency control, and support majority-vote binary verification.VerificationNodeandEvaluatorin the eval toolkit to build hierarchical verification trees with sequential/parallel aggregation, gate-then-average scoring, and disk-based resume support.make_quest_tasksetandmake_search_tasksetfactory functions from thesearchpackage for easy instantiation.aiohttp,pymupdf,pillow, andcertifias new required dependencies inpyproject.toml.pymupdfand async web fetching run at scoring time, adding latency and external network dependencies to evaluation.Macroscope summarized 98c3c1c.