Skip to content

feat: Migrate smoke tests to terminal bench runner#140

Open
pmateusz wants to merge 2 commits intomasterfrom
feat/terminal-bench-smoke-dataset
Open

feat: Migrate smoke tests to terminal bench runner#140
pmateusz wants to merge 2 commits intomasterfrom
feat/terminal-bench-smoke-dataset

Conversation

@pmateusz
Copy link
Copy Markdown
Contributor

@pmateusz pmateusz commented May 5, 2026


Kimchi Summary

What changed

Adds a new terminal-bench-smoke benchmark suite with three tasks (two Go coding tasks and one research task) that reuse the terminal-bench-2 harness. Also extends the kimchi_agent to support runtime toggling between orchestrator and single-model execution modes via the KIMCHI_MULTI_MODEL environment variable.

Why

Provides a lightweight, fast-feedback smoke test suite for validating kimchi-code changes without running the full benchmark. The single-model toggle eliminates the need to duplicate task definitions when comparing orchestrator behavior against standalone model performance.

Key changes

  • benchmark/terminal-bench-2/src/kimchi_agent/agent.py: Detects KIMCHI_MULTI_MODEL=false (via --ae flags) and appends --multi-model=false to CLI invocation, allowing single-model execution of any task.
  • benchmark/terminal-bench-smoke/: New smoke test directory containing:
    • tasks/go-rate-limiter/: Token-bucket HTTP middleware implementation task with black-box verifier testing burst limits and per-IP isolation.
    • tasks/go-task-api/: Layered REST API task (POST/GET/PATCH/DELETE) with end-to-end HTTP contract verification.
    • tasks/go-router-research/: Research task with trivial verifier (only checks /app/answer.md exists and is non-empty); intended for comparing token usage and subagent behavior rather than reward.
    • scripts/run-local.sh: Cross-builds Linux binary from working tree and executes smoke suite.
    • scripts/run-release.sh: Executes smoke suite against latest published kimchi-code release.
    • README.md: Usage examples including single-model tagging syntax (--ae KIMCHI_MULTI_MODEL=false --ae KIMCHI_TAGS=mode:single).

@kimchi-review
Copy link
Copy Markdown

kimchi-review Bot commented May 5, 2026

Kimchi Code Review

A review is being prepared and will be posted shortly.

Property Value
Commit 4482339
Author @pmateusz
Files changed 20
Review status Pending
What to expect

Kimchi will analyze the changes in this pull request and post:

  • A summary of the overall changes
  • Inline comments on specific lines with findings categorized by issue type

The review typically completes within a few minutes. This comment will be updated once the review is ready.

Interact with Kimchi
  • @kimchi review — re-trigger a full review on the latest commit
  • @kimchi summary — regenerate the PR summary
  • @kimchi ignore — skip this PR (no review will be posted)
  • Reply to any inline comment to ask follow-up questions or request clarification
Configuration

Reviews are configured by your organization admin.
Review instructions, excluded directories, and severity thresholds can be adjusted per repository in the Kimchi dashboard.


Powered by Kimchi — AI-powered code review by CAST AI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant