Skip to content

Add QUEST search taskset#1555

Merged
rasdani merged 12 commits into
PrimeIntellect-ai:mainfrom
rasdani:feat/search-quest-taskset
Jun 8, 2026
Merged

Add QUEST search taskset#1555
rasdani merged 12 commits into
PrimeIntellect-ai:mainfrom
rasdani:feat/search-quest-taskset

Conversation

@rasdani

@rasdani rasdani commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add a composable search taskset factory with initial quest backend.
  • Port QUEST objective tasks from osunlp/QUEST-RL-Data into QuestTaskSet.
  • Vendor the minimal QUEST obj_task_eval runtime needed to load and run generated objective eval scripts.
  • Score /task/answer.txt answers with the real generated QUEST verifier/judge path, using vf.Error subclasses for infrastructure/evaluator failures instead of ad hoc success metrics.
  • Preserve upstream QUEST URL-backed verification semantics: invalid, irrelevant, or inaccessible cited webpages can still score the affected claim as unsupported (0.0), matching the authors' implementation.
  • Add READMEs for the search taskset family and search/quest backend, including a note that future work should consider finer-grained source-access/error handling to distinguish verifier limitations from model-provided bad URLs or unsupported claims.

Validation

  • uv run ruff check verifiers/envs/experimental/composable/tasksets/search
  • uv run --python 3.13 ty check verifiers/envs/experimental/composable/tasksets/search
  • Verifiers pre-push hooks passed.
  • End-to-end smoke with sibling rlm_search environment passed:
    uv run --no-sync vf-eval rlm-search -a '{"task_type":"quest", "category":"objective"}' -n 1 -r 1 -d -A --max-retries 0 -C quest_answer_source,quest_eval_summary
    • exit code: 0
    • agent_error: 0.000
    • objective_reward reported; reward may be 0.0 for an incorrect answer.

Note

Medium Risk
Large new scoring path with external judge API calls, HF eval-script loading, and network PDF/web fetches at grade time; incorrect or inaccessible sources can yield 0.0 without surfacing infra as errors, matching upstream QUEST behavior.

Overview
Adds a new composable search taskset family (SWE-style backend dispatch) with an initial quest backend for objective deep-research tasks from osunlp/QUEST-RL-Data.

QuestTaskSet loads the HF dataset (objective category only), runs agents in a sandbox, and expects the final answer in /task/answer.txt. QuestRubric reads that file (with completion fallback), dynamically loads per-task_id generated eval_scripts/*.py from the dataset cache, and scores via the vendored QUEST obj_task_eval runtime—LLM judge, verification tree, URL/PDF fetching—mapping infra/judge failures to vf.Error subclasses while wrong answers score 0.0 without state["error"], matching upstream QUEST URL-verification semantics.

Also adds make_search_taskset / make_quest_taskset, package READMEs, and core deps aiohttp, pymupdf, pillow, certifi (plus lockfile).

Reviewed by Cursor Bugbot for commit 98c3c1c. Bugbot is set up for automated code reviews on this repo. Configure here.

Note

Add QUEST objective search taskset with LLM-based answer verification

  • Adds QuestTaskSet and QuestRubric in taskset.py to load QUEST HuggingFace dataset tasks, run them in a sandbox, and score answers via per-task eval scripts.
  • Introduces a vendored evaluation runtime (obj_task_eval) with Extractor and Verifier classes that fetch web/PDF content, call a judge LLM with concurrency control, and support majority-vote binary verification.
  • Implements VerificationNode and Evaluator in the eval toolkit to build hierarchical verification trees with sequential/parallel aggregation, gate-then-average scoring, and disk-based resume support.
  • Exposes make_quest_taskset and make_search_taskset factory functions from the search package for easy instantiation.
  • Adds aiohttp, pymupdf, pillow, and certifi as new required dependencies in pyproject.toml.
  • Risk: PDF parsing via pymupdf and async web fetching run at scoring time, adding latency and external network dependencies to evaluation.

Macroscope summarized 98c3c1c.

Comment thread verifiers/envs/experimental/composable/tasksets/search/quest/taskset.py Outdated
cache=cache,
semaphore=self._semaphore,
logger=logger,
model=self.judge_model,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shared logger corrupts parallel scoring

Medium Severity

QuestRubric.score_group scores rollouts concurrently, but every evaluate_answer call receives the same module-level logger. The vendored evaluator appends LLM trace data onto that logger instance, so parallel group scoring can interleave or overwrite traces and question metadata across rollouts.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit ecbf651. Configure here.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 98c3c1c. Configure here.

)

self._auto_save_state("batch_verify")
return results

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Batch verify masks infra errors

High Severity

In batch_verify, asyncio.gather(..., return_exceptions=True) stores raised vf.Error instances in the result list instead of propagating them like verify() does. Those exception objects are truthy, so common checks such as all(results) can treat judge or infrastructure failures as successful verifications and inflate scores.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 98c3c1c. Configure here.

@rasdani rasdani requested review from mikasenghaas and snimu June 8, 2026 22:19
@rasdani rasdani merged commit ffb5e01 into PrimeIntellect-ai:main Jun 8, 2026
7 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants