AgentWebBench

A benchmark for Multi-Agent Coordination in the Agentic Web.

Agentic Web is an emerging paradigm where autonomous agents help users use online information. As the paradigm develops, content providers are also deploying agents to manage their data and serve it through controlled interfaces. This shift moves information access from centralized retrieval to decentralized coordination. To study this setting, we introduce AgentWebBench, a benchmark that evaluates how well a user agent synthesizes answers by interacting with website-specific content agents.

News

🔥 [2026-06-16]: We released our code and datasets, check out our leaderboard!
🔥 [2026-05-01]: Cheers! Our paper was accepted to ICML 2026!
🔥 [2026-04-13]: Our paper is now available on arXiv. Check it out!

Installation

(1) Environment

conda create -n awbench python=3.12
conda activate awbench
pip install -r requirements.txt # CUDA 12.9

(2) API keys

Put your secrets in keys.env at the repo root:

# Google API Key (Gemini)
# Get from: https://ai.google.dev/
GOOGLE_API_KEY=your_google_api_key_here

# OpenAI API Key (GPT models)
# Get from: https://platform.openai.com/api-keys
OPENAI_API_KEY=your_openai_api_key_here

# Hugging Face Token (for model access)
# Get from: https://huggingface.co/settings/tokens
HF_TOKEN=your_huggingface_token_here

(3) Verify Search API Status

We provide a Search API for all tasks. To check the API status, run the following commands:

curl "https://www.clueweb22.us/awbench/websites"

These vectors and ID maps are derived from ClueWeb22 documents. Use is subject to the ClueWeb22 license.

Benchmark Design

🧩 Tasks

AgentWebBench has four tasks that cover common web information needs:

Ranked retrieval — web search, web recommendation.
Open-ended synthesis — question answering, deep research.

Task	`--task_type`	Dataset	#	Description	Metrics
Web Search	`web_search`	data/web_search/test_354.json	354	Document-retrieval queries	NDCG@{3,5}, Recall@{3,5}
Web Recommendation	`web_recommendation`	data/web_recommendation/test_281.json	281	Browsing history → next intent	NDCG@{3,5}, Recall@{3,5}
Question Answering	`qa`	data/qa/test_53.json	53	Multi-hop QA	Accuracy, F1
Deep Research	`deep_research`	data/deep_research/test_331.json	331	Deep Research questions	KPR, KPC, Clarify, Insight

Each task also ships a test_1.json single-example smoke set.

🧩 Methods

The --method flag sets how the user agent coordinates content agents:

`--method`	Name	What it does
`classical`	Classical	User agent only; centralized search directly over the global corpus.
`tool_embed`	Tool_E	User agent only; websites are exposed as tools and selected by embedding similarity.
`tool_prompt`	Tool_P	User agent only; websites are exposed as tools and selected by prompt (the user agent picks the sites).
`multi_agent`	Multi-Agent	The user agent dispatches to per-website content agents that search and summarize in parallel.

Our search API supports per-site search. If you need traditional centralized search (classical), you may use the global embeddings available at AgentWebBench-corpus and download ClueWeb22 category B.

Quick Start

Run all commands from the repo root.

# --- Hosted API model ---
sh scripts/run_qa.sh             multi_agent gemini gemini-3-flash-preview
sh scripts/run_web_search.sh     multi_agent gpt    gpt-4o
sh scripts/run_deep_research.sh  multi_agent hf     Qwen/Qwen3-30B-A3B-Thinking-2507:nebius

# --- Local model via vLLM (random port) ---
sh scripts/launchers/run_works_local_4b.sh

Each command runs main.py and then its evaluation. The repo supports four LLM backends (--api_type ∈ {gemini, gpt, hf, local}):

Gemini
GPT
HuggingFace Inference
Local models served with vLLM

See scripts/README.md for details.

How to evaluate a new LLM or method?

Existing results are on the Homepage and in the Paper. If you want to evaluate other LLMs or coordination strategies,

New LLM — serve your model with vLLM and run with --api_type local.
New coordination strategy — modify the user agent or content agent.

Acknowledgements

Our work is built on ClueWeb22, Tevatron, MiniCPM-Embedding-Light, and the MS MARCO / DeepResearchGym / ORBIT datasets.

Citation

@inproceedings{zhong2026agentwebbench,
  title={AgentWebBench: Benchmarking Multi-Agent Coordination in Agentic Web},
  author={Zhong, Shanshan and Shen, Kate and Xiong, Chenyan},
  booktitle={Proceedings of the 43rd International Conference on Machine Learning (ICML)},
  year={2026}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AgentWebBench

News

Installation

Benchmark Design

🧩 Tasks

🧩 Methods

Quick Start

How to evaluate a new LLM or method?

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
awbench		awbench
data		data
scripts		scripts
.gitignore		.gitignore
AgentWebBench.jpg		AgentWebBench.jpg
LICENSE		LICENSE
README.md		README.md
keys.env.example		keys.env.example
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

AgentWebBench

News

Installation

Benchmark Design

🧩 Tasks

🧩 Methods

Quick Start

How to evaluate a new LLM or method?

Acknowledgements

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages