Skip to content

Add MSBench CI pipeline for SDK skills evaluation#222

Draft
yumnahussain wants to merge 3 commits into
AzureCosmosDB:mainfrom
yumnahussain:yumnahussain/ideal-succotash
Draft

Add MSBench CI pipeline for SDK skills evaluation#222
yumnahussain wants to merge 3 commits into
AzureCosmosDB:mainfrom
yumnahussain:yumnahussain/ideal-succotash

Conversation

@yumnahussain

@yumnahussain yumnahussain commented Jun 25, 2026

Copy link
Copy Markdown

Summary

Adds a GitHub Actions workflow and supporting scripts to evaluate the \cosmosdb-sdk\ skill set using MSBench. The pipeline measures whether the skills help an agent produce Cosmos DB SDK best-practice-compliant code across 5 SDKs.

What's included

File Purpose
.github/workflows/msbench-eval.yaml\ Workflow (workflow_dispatch) with Azure OIDC login, msbench-cli install
\scripts/msbench-eval.py\ Orchestration: submits \msbench-cli run --repeat 3, polls, reports
\scripts/create-skills-issue.py\ Maps failing test classes to \sdk-*\ rules, creates GitHub issue

How it works

  1. Manually trigger via Actions → MSBench SDK Skills Evaluation
  2. Runs 15 Harbor task instances (3 scenarios × 5 SDKs) with 3 independent repeats
  3. Computes pass@k metrics per instance
  4. If any instance falls below threshold (default 90%), creates a GitHub issue mapping failures to specific rules with suggested skill improvements

Configuration

  • Repeats: 3 (confirmed)
  • Model: Claude Opus 4.8 (highest available)
  • Threshold: 90% (tentative)

Companion benchmark repo

The Harbor tasks and verifier tests live in a separate benchmark registration repo (\cosmos-sdk-skills-bench). This PR only adds the CI trigger and evaluation logic to \cosmosdb-agent-kit.

Tentative decisions (feedback welcome)

  • 90% pass threshold

Prerequisites

  • Azure OIDC federated credential configured for this repo (AZURE_CLIENT_ID, AZURE_TENANT_ID, AZURE_SUBSCRIPTION_ID secrets)
  • MSBench CLI access via Azure Artifacts feed

Adds a GitHub Actions workflow and supporting scripts to evaluate
the cosmosdb-sdk skill set using MSBench. The pipeline:

- Runs 15 Harbor task instances (3 scenarios × 5 SDKs) with 3 repeats
- Uses pass@k metrics to measure skill reliability
- Creates GitHub issues mapping failures to specific sdk-* rules

New files:
- .github/workflows/msbench-eval.yaml (workflow_dispatch trigger)
- scripts/msbench-eval.py (orchestration: submit, poll, report)
- scripts/create-skills-issue.py (failure → rule mapping → issue)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 25, 2026 21:40
@yumnahussain yumnahussain marked this pull request as draft June 25, 2026 21:41

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a manually-triggered GitHub Actions pipeline plus helper scripts to run MSBench-based evaluations for Cosmos DB SDK skill guidance, generate a merged pass@k report, and (optionally) open a tracking issue when results fall below a threshold.

Changes:

  • Adds a workflow_dispatch GitHub Actions workflow to authenticate via Azure OIDC, install msbench-cli, run the evaluation, and upload results.json.
  • Adds scripts/msbench-eval.py to orchestrate MSBench runs, poll for completion, generate a merged report, and evaluate per-instance pass rates.
  • Adds scripts/create-skills-issue.py to parse results, map failing tests to sdk-* rules, and create a GitHub issue via gh issue create.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
.github/workflows/msbench-eval.yaml New manual CI workflow to run MSBench evaluation, optionally create an issue, and upload artifacts.
scripts/msbench-eval.py New orchestrator script for running MSBench, merging reports, parsing pass rates, and flagging failing instances.
scripts/create-skills-issue.py New issue-creation script to summarize failures and map them to rule files.

Comment thread .github/workflows/msbench-eval.yaml Outdated
Comment on lines +52 to +56
- name: Configure pip for Azure Artifacts
run: |
ACCESS_TOKEN=$(az account get-access-token --resource 499b84ac-1321-427f-aa17-267ca6975798 --query accessToken -o tsv)
pip config set global.extra-index-url "https://build:${ACCESS_TOKEN}@pkgs.dev.azure.com/devdiv/_packaging/MicrosoftSweBench/pypi/simple/"
pip config set global.trusted-host "pkgs.dev.azure.com"
Comment thread scripts/msbench-eval.py
Comment on lines +432 to +439
failing = [item for item in results if item["pass_rate"] < args.threshold]
if failing:
print(
f"\n{len(failing)} instance(s) fell below the {args.threshold * 100:.1f}% threshold.",
file=sys.stderr,
)
invoke_issue_creator(output_path, dry_run=False)
return 1
Comment thread scripts/msbench-eval.py
Comment on lines +254 to +261
if isinstance(value, str):
stripped = value.strip().rstrip("%")
try:
parsed = float(stripped)
except ValueError:
return None
return normalize_rate(parsed if not value.strip().endswith("%") else parsed)
return None
Comment on lines +565 to +589
def create_issue(repo: str, title: str, body: str) -> None:
command = [
"gh",
"issue",
"create",
"--repo",
repo,
"--title",
title,
"--label",
"msbench-eval",
"--label",
"skills-gap",
"--body-file",
"-",
]
completed = subprocess.run(
command,
check=True,
text=True,
capture_output=True,
input=body,
)
stdout = completed.stdout.strip()
if stdout:
Comment thread .github/workflows/msbench-eval.yaml Outdated
Comment on lines +75 to +84
# If the evaluation step fails (including threshold failures), create a GitHub issue
# summarizing the results so the regression can be tracked.
- name: Create issue if threshold not met
if: failure()
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
python scripts/create-skills-issue.py \
--results-file results.json \
--repo "${{ github.repository }}"
yumnahussain and others added 2 commits June 25, 2026 14:45
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Use masked env var for pip token instead of persisting in pip config
- Guard issue creation step on results.json existence
- Fix normalize_rate to correctly handle '0.9%' as 0.009 not 0.9
- Remove duplicate issue creation from msbench-eval.py (workflow handles it)
- Make labels optional in create-skills-issue.py with fallback

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants