Add MSBench CI pipeline for SDK skills evaluation#222
Draft
yumnahussain wants to merge 3 commits into
Draft
Conversation
Adds a GitHub Actions workflow and supporting scripts to evaluate the cosmosdb-sdk skill set using MSBench. The pipeline: - Runs 15 Harbor task instances (3 scenarios × 5 SDKs) with 3 repeats - Uses pass@k metrics to measure skill reliability - Creates GitHub issues mapping failures to specific sdk-* rules New files: - .github/workflows/msbench-eval.yaml (workflow_dispatch trigger) - scripts/msbench-eval.py (orchestration: submit, poll, report) - scripts/create-skills-issue.py (failure → rule mapping → issue) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds a manually-triggered GitHub Actions pipeline plus helper scripts to run MSBench-based evaluations for Cosmos DB SDK skill guidance, generate a merged pass@k report, and (optionally) open a tracking issue when results fall below a threshold.
Changes:
- Adds a
workflow_dispatchGitHub Actions workflow to authenticate via Azure OIDC, installmsbench-cli, run the evaluation, and uploadresults.json. - Adds
scripts/msbench-eval.pyto orchestrate MSBench runs, poll for completion, generate a merged report, and evaluate per-instance pass rates. - Adds
scripts/create-skills-issue.pyto parse results, map failing tests to sdk-* rules, and create a GitHub issue viagh issue create.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
.github/workflows/msbench-eval.yaml |
New manual CI workflow to run MSBench evaluation, optionally create an issue, and upload artifacts. |
scripts/msbench-eval.py |
New orchestrator script for running MSBench, merging reports, parsing pass rates, and flagging failing instances. |
scripts/create-skills-issue.py |
New issue-creation script to summarize failures and map them to rule files. |
Comment on lines
+52
to
+56
| - name: Configure pip for Azure Artifacts | ||
| run: | | ||
| ACCESS_TOKEN=$(az account get-access-token --resource 499b84ac-1321-427f-aa17-267ca6975798 --query accessToken -o tsv) | ||
| pip config set global.extra-index-url "https://build:${ACCESS_TOKEN}@pkgs.dev.azure.com/devdiv/_packaging/MicrosoftSweBench/pypi/simple/" | ||
| pip config set global.trusted-host "pkgs.dev.azure.com" |
Comment on lines
+432
to
+439
| failing = [item for item in results if item["pass_rate"] < args.threshold] | ||
| if failing: | ||
| print( | ||
| f"\n{len(failing)} instance(s) fell below the {args.threshold * 100:.1f}% threshold.", | ||
| file=sys.stderr, | ||
| ) | ||
| invoke_issue_creator(output_path, dry_run=False) | ||
| return 1 |
Comment on lines
+254
to
+261
| if isinstance(value, str): | ||
| stripped = value.strip().rstrip("%") | ||
| try: | ||
| parsed = float(stripped) | ||
| except ValueError: | ||
| return None | ||
| return normalize_rate(parsed if not value.strip().endswith("%") else parsed) | ||
| return None |
Comment on lines
+565
to
+589
| def create_issue(repo: str, title: str, body: str) -> None: | ||
| command = [ | ||
| "gh", | ||
| "issue", | ||
| "create", | ||
| "--repo", | ||
| repo, | ||
| "--title", | ||
| title, | ||
| "--label", | ||
| "msbench-eval", | ||
| "--label", | ||
| "skills-gap", | ||
| "--body-file", | ||
| "-", | ||
| ] | ||
| completed = subprocess.run( | ||
| command, | ||
| check=True, | ||
| text=True, | ||
| capture_output=True, | ||
| input=body, | ||
| ) | ||
| stdout = completed.stdout.strip() | ||
| if stdout: |
Comment on lines
+75
to
+84
| # If the evaluation step fails (including threshold failures), create a GitHub issue | ||
| # summarizing the results so the regression can be tracked. | ||
| - name: Create issue if threshold not met | ||
| if: failure() | ||
| env: | ||
| GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} | ||
| run: | | ||
| python scripts/create-skills-issue.py \ | ||
| --results-file results.json \ | ||
| --repo "${{ github.repository }}" |
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Use masked env var for pip token instead of persisting in pip config - Guard issue creation step on results.json existence - Fix normalize_rate to correctly handle '0.9%' as 0.009 not 0.9 - Remove duplicate issue creation from msbench-eval.py (workflow handles it) - Make labels optional in create-skills-issue.py with fallback Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a GitHub Actions workflow and supporting scripts to evaluate the \cosmosdb-sdk\ skill set using MSBench. The pipeline measures whether the skills help an agent produce Cosmos DB SDK best-practice-compliant code across 5 SDKs.
What's included
How it works
Configuration
Companion benchmark repo
The Harbor tasks and verifier tests live in a separate benchmark registration repo (\cosmos-sdk-skills-bench). This PR only adds the CI trigger and evaluation logic to \cosmosdb-agent-kit.
Tentative decisions (feedback welcome)
Prerequisites