feat: Migrate smoke tests to terminal bench runner by pmateusz · Pull Request #140 · castai/kimchi-dev

pmateusz · 2026-05-05T09:56:51Z

Kimchi Summary

What changed

Adds a new terminal-bench-smoke benchmark suite with three tasks (two Go coding tasks and one research task) that reuse the terminal-bench-2 harness. Also extends the kimchi_agent to support runtime toggling between orchestrator and single-model execution modes via the KIMCHI_MULTI_MODEL environment variable.

Why

Provides a lightweight, fast-feedback smoke test suite for validating kimchi-code changes without running the full benchmark. The single-model toggle eliminates the need to duplicate task definitions when comparing orchestrator behavior against standalone model performance.

Key changes

benchmark/terminal-bench-2/src/kimchi_agent/agent.py: Detects KIMCHI_MULTI_MODEL=false (via --ae flags) and appends --multi-model=false to CLI invocation, allowing single-model execution of any task.
benchmark/terminal-bench-smoke/: New smoke test directory containing:
- tasks/go-rate-limiter/: Token-bucket HTTP middleware implementation task with black-box verifier testing burst limits and per-IP isolation.
- tasks/go-task-api/: Layered REST API task (POST/GET/PATCH/DELETE) with end-to-end HTTP contract verification.
- tasks/go-router-research/: Research task with trivial verifier (only checks /app/answer.md exists and is non-empty); intended for comparing token usage and subagent behavior rather than reward.
- scripts/run-local.sh: Cross-builds Linux binary from working tree and executes smoke suite.
- scripts/run-release.sh: Executes smoke suite against latest published kimchi-code release.
- README.md: Usage examples including single-model tagging syntax (--ae KIMCHI_MULTI_MODEL=false --ae KIMCHI_TAGS=mode:single).

…odel=false

…m manual suite

kimchi-review · 2026-05-05T09:56:54Z

Kimchi Code Review

A review is being prepared and will be posted shortly.

Property	Value
Commit	`4482339`
Author	@pmateusz
Files changed	20
Review status	Pending

What to expect

Kimchi will analyze the changes in this pull request and post:

A summary of the overall changes
Inline comments on specific lines with findings categorized by issue type

The review typically completes within a few minutes. This comment will be updated once the review is ready.

Interact with Kimchi

@kimchi review — re-trigger a full review on the latest commit
@kimchi summary — regenerate the PR summary
@kimchi ignore — skip this PR (no review will be posted)
Reply to any inline comment to ask follow-up questions or request clarification

Configuration

Reviews are configured by your organization admin.
Review instructions, excluded directories, and severity thresholds can be adjusted per repository in the Kimchi dashboard.

Powered by Kimchi — AI-powered code review by CAST AI

pmateusz added 2 commits May 4, 2026 21:58

feat(KimchiAgent): support KIMCHI_MULTI_MODEL env to toggle --multi-m…

5c03da1

…odel=false

feat(Bench): add terminal-bench-smoke dataset with 3 tasks ported fro…

4482339

…m manual suite

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Migrate smoke tests to terminal bench runner#140

feat: Migrate smoke tests to terminal bench runner#140
pmateusz wants to merge 2 commits intomasterfrom
feat/terminal-bench-smoke-dataset

pmateusz commented May 5, 2026 •

edited by kimchi-review Bot

Loading

Uh oh!

kimchi-review Bot commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pmateusz commented May 5, 2026 • edited by kimchi-review Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Kimchi Summary

What changed

Why

Key changes

Uh oh!

kimchi-review Bot commented May 5, 2026

Kimchi Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pmateusz commented May 5, 2026 •

edited by kimchi-review Bot

Loading