A benchmark suite for evaluating large language models on coding tasks. The project supports HumanEval, MBPP, BigCodeBench, and SWE-bench, and provides both a command-line runner and a Streamlit interface for interactive analysis.
Repository: https://github.com/abhaymundhara/llm-benchmark-suite
- REQUIREMENTS — hardware, software, storage, and environment prerequisites
- INSTALL — installation, smoke-test, and basic usage validation
- REPLICATION_GUIDE — instructions for repeating and reproducing benchmark results
The suite is designed to compare cloud and local model providers under a shared evaluation pipeline. It records pass rate, latency, token usage, and failure categories, and exports results in both JSON and human-readable summary formats.
- Multiple benchmarks: HumanEval, MBPP, BigCodeBench, and SWE-bench variants
- CLI workflow for reproducible batch execution
- Streamlit UI for run configuration, progress monitoring, and result inspection
- Run comparison tools with overlap-aware task matching
- Support for OpenAI, Anthropic, Google Gemini, and Ollama-based local models
- Structured reporting with task-level diagnostics and summary metrics
python3 -m pip install -r requirements.txtstreamlit run app.pypython3 runner.py --model ollama:qwen2.5-coder:7b --benchmark bigcodebench --limit 5For cloud providers, set the required API keys before running benchmarks:
export OPENAI_API_KEY="your-key"
export ANTHROPIC_API_KEY="your-key"
export GOOGLE_API_KEY="your-key"For SWE-bench runs, use setup_docker.sh to prepare the required Docker images.
The checks below are intended to prevent the most frequent installation and execution failures.
- Symptom: package install failures or import errors at startup.
- Check:
python --versionand confirm a supported interpreter (3.10 or 3.11). - Resolution: create a fresh virtual environment and reinstall from
requirements.txt.
- Symptom:
streamlitis missing when runningrun.sh. - Check: ensure the virtual environment is active.
- Resolution: reinstall dependencies and run with
python -m streamlit run app.pyif needed.
- Symptom: setup script reports daemon communication failure.
- Check: run
bash setup_docker.shand verify daemon health. - Resolution: start Docker Desktop/Engine and rerun setup.
- Symptom: provider returns authorization or credential errors.
- Check: verify
OPENAI_API_KEY,ANTHROPIC_API_KEY, andGOOGLE_API_KEYvalues. - Resolution: export valid keys in current shell session or
.env.
- Symptom: generation fails because model tag is missing.
- Check: verify Ollama service is running and model is pulled.
- Resolution: pull the model first (for example,
ollama pull qwen2.5-coder:7b).
- Symptom: Docker pulls/builds fail or runs terminate unexpectedly.
- Check: available disk space.
- Resolution: free disk space and rerun setup; SWE-bench workflows require substantially more storage than lightweight benchmarks.
If a problem persists, open a GitHub issue with: OS version, Python version, benchmark/model configuration, command used, and full error output.
app.py— Streamlit applicationrunner.py— CLI benchmark runner and orchestration layerbenchmarks/— benchmark implementations and registrymodels/— model adapters for supported providersrequirements.txt— Python dependenciesrun.sh— convenience launcher for the Streamlit appsetup_docker.sh— SWE-bench environment setupDocumentation/— project notes, guides, and dataset references
- 164 Python function-synthesis tasks
- Lightweight evaluation suitable for quick iteration
- 500 Python programming tasks
- Useful for basic code-generation evaluation
- 1,140 tasks focused on practical coding workflows
- Includes
instructandcompletevariants
- Repository-level software-engineering tasks based on real GitHub issues
- Requires Docker-based evaluation
- OpenAI models such as GPT-5.4 and GPT-5.3-Codex
- Anthropic Claude models
- Google Gemini models
- Qwen2.5-Coder:7b
- GPT-OSS:20B
- Other models available through Ollama
When Ollama is used with the default max_tokens=512, the adapter applies model-specific token limits automatically to reduce truncation and improve output quality.
- Pass rate
- Latency per task
- Input, output, and total token usage
- Estimated API cost for cloud models
- Failure categorization and diagnostics
Benchmark runs are written to the reports/ directory:
benchmark_<name>_<timestamp>.json— full structured resultsbenchmark_<name>_<timestamp>_summary.txt— concise summary report
- HumanEval — OpenAI
- MBPP — Google
- SWE-bench — Princeton NLP
- BigCodeBench — BigCode Project