Add MSBench CI pipeline for SDK skills evaluation by yumnahussain · Pull Request #222 · AzureCosmosDB/cosmosdb-agent-kit

yumnahussain · 2026-06-25T21:40:55Z

Summary

Adds a GitHub Actions workflow and supporting scripts to evaluate the \cosmosdb-sdk\ skill set using MSBench. The pipeline measures whether the skills help an agent produce Cosmos DB SDK best-practice-compliant code across 5 SDKs.

What's included

File	Purpose
.github/workflows/msbench-eval.yaml\	Workflow (workflow_dispatch) with Azure OIDC login, msbench-cli install
\scripts/msbench-eval.py\	Orchestration: submits \msbench-cli run --repeat 3, polls, reports
\scripts/create-skills-issue.py\	Maps failing test classes to \sdk-*\ rules, creates GitHub issue

How it works

Manually trigger via Actions → MSBench SDK Skills Evaluation
Runs 15 Harbor task instances (3 scenarios × 5 SDKs) with 3 independent repeats
Computes pass@k metrics per instance
If any instance falls below threshold (default 90%), creates a GitHub issue mapping failures to specific rules with suggested skill improvements

Configuration

Repeats: 3 (confirmed)
Model: Claude Opus 4.8 (highest available)
Threshold: 90% (tentative)

Companion benchmark repo

The Harbor tasks and verifier tests live in a separate benchmark registration repo (\cosmos-sdk-skills-bench). This PR only adds the CI trigger and evaluation logic to \cosmosdb-agent-kit.

Tentative decisions (feedback welcome)

90% pass threshold

Prerequisites

Azure OIDC federated credential configured for this repo (AZURE_CLIENT_ID, AZURE_TENANT_ID, AZURE_SUBSCRIPTION_ID secrets)
MSBench CLI access via Azure Artifacts feed

Adds a GitHub Actions workflow and supporting scripts to evaluate the cosmosdb-sdk skill set using MSBench. The pipeline: - Runs 15 Harbor task instances (3 scenarios × 5 SDKs) with 3 repeats - Uses pass@k metrics to measure skill reliability - Creates GitHub issues mapping failures to specific sdk-* rules New files: - .github/workflows/msbench-eval.yaml (workflow_dispatch trigger) - scripts/msbench-eval.py (orchestration: submit, poll, report) - scripts/create-skills-issue.py (failure → rule mapping → issue) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR adds a manually-triggered GitHub Actions pipeline plus helper scripts to run MSBench-based evaluations for Cosmos DB SDK skill guidance, generate a merged pass@k report, and (optionally) open a tracking issue when results fall below a threshold.

Changes:

Adds a workflow_dispatch GitHub Actions workflow to authenticate via Azure OIDC, install msbench-cli, run the evaluation, and upload results.json.
Adds scripts/msbench-eval.py to orchestrate MSBench runs, poll for completion, generate a merged report, and evaluate per-instance pass rates.
Adds scripts/create-skills-issue.py to parse results, map failing tests to sdk-* rules, and create a GitHub issue via gh issue create.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File	Description
`.github/workflows/msbench-eval.yaml`	New manual CI workflow to run MSBench evaluation, optionally create an issue, and upload artifacts.
`scripts/msbench-eval.py`	New orchestrator script for running MSBench, merging reports, parsing pass rates, and flagging failing instances.
`scripts/create-skills-issue.py`	New issue-creation script to summarize failures and map them to rule files.

+      - name: Configure pip for Azure Artifacts
+        run: |
+          ACCESS_TOKEN=$(az account get-access-token --resource 499b84ac-1321-427f-aa17-267ca6975798 --query accessToken -o tsv)
+          pip config set global.extra-index-url "https://build:${ACCESS_TOKEN}@pkgs.dev.azure.com/devdiv/_packaging/MicrosoftSweBench/pypi/simple/"
+          pip config set global.trusted-host "pkgs.dev.azure.com"


+    failing = [item for item in results if item["pass_rate"] < args.threshold]
+    if failing:
+        print(
+            f"\n{len(failing)} instance(s) fell below the {args.threshold * 100:.1f}% threshold.",
+            file=sys.stderr,
+        )
+        invoke_issue_creator(output_path, dry_run=False)
+        return 1


+    if isinstance(value, str):
+        stripped = value.strip().rstrip("%")
+        try:
+            parsed = float(stripped)
+        except ValueError:
+            return None
+        return normalize_rate(parsed if not value.strip().endswith("%") else parsed)
+    return None


+def create_issue(repo: str, title: str, body: str) -> None:
+    command = [
+        "gh",
+        "issue",
+        "create",
+        "--repo",
+        repo,
+        "--title",
+        title,
+        "--label",
+        "msbench-eval",
+        "--label",
+        "skills-gap",
+        "--body-file",
+        "-",
+    ]
+    completed = subprocess.run(
+        command,
+        check=True,
+        text=True,
+        capture_output=True,
+        input=body,
+    )
+    stdout = completed.stdout.strip()
+    if stdout:


+      # If the evaluation step fails (including threshold failures), create a GitHub issue
+      # summarizing the results so the regression can be tracked.
+      - name: Create issue if threshold not met
+        if: failure()
+        env:
+          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+        run: |
+          python scripts/create-skills-issue.py \
+            --results-file results.json \
+            --repo "${{ github.repository }}"


Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Use masked env var for pip token instead of persisting in pip config - Guard issue creation step on results.json existence - Fix normalize_rate to correctly handle '0.9%' as 0.009 not 0.9 - Remove duplicate issue creation from msbench-eval.py (workflow handles it) - Make labels optional in create-skills-issue.py with fallback Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings June 25, 2026 21:40

yumnahussain requested review from TheovanKraay, jaydestro and sajeetharan as code owners June 25, 2026 21:40

Copilot started reviewing on behalf of yumnahussain June 25, 2026 21:41 View session

yumnahussain marked this pull request as draft June 25, 2026 21:41

Copilot AI reviewed Jun 25, 2026

View reviewed changes

yumnahussain and others added 2 commits June 25, 2026 14:45

Confirm 3 repeats, default model to Claude Opus 4.8

456e4bb

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add MSBench CI pipeline for SDK skills evaluation#222

Add MSBench CI pipeline for SDK skills evaluation#222
yumnahussain wants to merge 3 commits into
AzureCosmosDB:mainfrom
yumnahussain:yumnahussain/ideal-succotash

yumnahussain commented Jun 25, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

yumnahussain commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's included

How it works

Configuration

Companion benchmark repo

Tentative decisions (feedback welcome)

Prerequisites

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yumnahussain commented Jun 25, 2026 •

edited

Loading