AgentArena

A community benchmark for AI coding agent performance

Pick your favorite AI coding setup: agent, model, provider or your own self-hosted rig if you're feeling spicy. Run one of the tests, and send a PR with the results. Your handle goes on the leaderboard. Your rig joins the silicon beasts roster.

Each test is a multi-stage gauntlet: an unattended first build, then progressively harder refinement stages. Every stage is a single prompt fed to the agent as-is, with no hand-holding and no fixing its mistakes between stages. Setups that one-shot the easy stuff but fall apart on the harder later stages get exposed.

The leaderboard is fun, but the goal is real: a community-run, real-world view of how agentic AI coding setups actually perform on tasks they'll be asked to do: the kind of comparison vendor benchmarks rarely give you. Cloud vs. self-hosted, frontier vs. open-weight, high-effort vs. fast — all on the same prompts, on the same scale.

Tests structure

Each test has its directory under /tests. Inside each test, you'll find:

test.yaml — Test definition: name, description, and each stage's prompt.
/runs/ — One subdirectory per contributed run. Each run directory contains a run.yaml manifest and an optional subdirectory per stage with the resulting source code.

Example test structure

Here is an example of the directory structure for the live-message-wall test:

/tests
    /live-message-wall
        test.yaml
        /runs
            /tin-cat-claude-code-sonnet-4.6-high-effort
                run.yaml
                /stage-1-first-run
                /stage-2-advanced-features
                /stage-3-refinements
                /stage-4-complex-refinements

Each run directory is flat: its run.yaml carries all the metadata (contributor, agent, provider, model, settings, hardware) and per-stage metrics (time, tokens, cost, rating).

Optionally, each stage-*/ subdirectory holds the complete source code that resulted from running that stage (even if most of it is duplicated from earlier stages).

The CLI

The repository ships with a single-file CLI, agent-arena-cli.py, that handles the boilerplate for adding new tests or runs interactively, and can validate any manual YAML edits.

# Requirements: Python 3.11+

# Add (interactive)
./agent-arena-cli.py run add # record a new run
./agent-arena-cli.py test add # create a new test

# Validate
./agent-arena-cli.py validate # check all yaml files against the schema

Run ./agent-arena-cli.py --help for the full command list.

On Windows

Command Prompt: agent-arena-cli.py run add
PowerShell: .\agent-arena-cli.py run add
Git Bash / WSL: ./agent-arena-cli.py run add

You can always invoke Python directly with py agent-arena-cli.py … or python agent-arena-cli.py ….

Contribute

Please feel free to contribute your tests to this repository. You can either contribute entire new tests, or your runs of existing tests. See CONTRIBUTING.md for details.

Honest about what we are

AgentArena aims to be a relaxed, accurate benchmark of real tasks, not a precise, controlled lab study. If you're after rigorous, technical benchmarks, check the AI Agent Benchmark Compendium, a curated list worth your time.

About

Reach me at @lorenzoherrera ❤️ tin.cat

Name		Name	Last commit message	Last commit date
Latest commit History 172 Commits
.github/workflows		.github/workflows
__pycache__		__pycache__
logos		logos
scripts		scripts
tests		tests
.DS_Store		.DS_Store
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
agent-arena-cli.py		agent-arena-cli.py
agents.json		agents.json
logo.ans		logo.ans
logo.svg		logo.svg
models.json		models.json
providers.json		providers.json
stacks.json		stacks.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AgentArena

Tests structure

Example test structure

The CLI

On Windows

Contribute

Honest about what we are

About

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AgentArena

Tests structure

Example test structure

The CLI

On Windows

Contribute

Honest about what we are

About

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages