Skip to content

test(onboard): add strict tool-call probe e2e#4541

Open
cv wants to merge 4 commits into
mainfrom
test/strict-tool-call-probe-e2e
Open

test(onboard): add strict tool-call probe e2e#4541
cv wants to merge 4 commits into
mainfrom
test/strict-tool-call-probe-e2e

Conversation

@cv
Copy link
Copy Markdown
Collaborator

@cv cv commented May 29, 2026

Summary

Adds a hermetic regression E2E for the strict Chat Completions tool-call validation path introduced around Local Ollama onboarding. The test uses a local OpenAI-compatible mock to assert the bounded payload shape, transient retry behavior, and fail-closed behavior when no structured tool call is returned.

Related Issue

Fixes #4537

Changes

  • Add test/e2e/test-strict-tool-call-probe.sh to exercise strict tool-call validation against a mock OpenAI-compatible endpoint.
  • Wire strict-tool-call-probe-e2e into .github/workflows/regression-e2e.yaml for workflow dispatch.
  • Document the thinking-model assumption near the strict probe max_tokens cap in src/lib/inference/onboard-probes.ts.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • npm run docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Carlos Villela cvillela@nvidia.com

Summary by CodeRabbit

  • Tests

    • Added an end-to-end test harness validating strict tool-call probe behavior: success, transient-error recovery, and plain-text fallback; captures logs and reports pass/fail.
  • Chores

    • Updated regression workflow to include and gate the strict probe job, run it in CI (no-sleep mode), and upload probe logs on failure.
  • Documentation

    • Clarified an inline comment about the strict tool-call probe token cap and its current scope.

Review Change Stack

@cv cv self-assigned this May 29, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 29, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0bff4170-9443-498f-9fa9-3d659e32c914

📥 Commits

Reviewing files that changed from the base of the PR and between e55d82f and 44b00b9.

📒 Files selected for processing (1)
  • src/lib/inference/onboard-probes.ts
✅ Files skipped from review due to trivial changes (1)
  • src/lib/inference/onboard-probes.ts

📝 Walkthrough

Walkthrough

Adds a hermetic Bash+Node E2E that validates strict chat-completions tool-call payloads and retry/recovery behavior against a local mock server, and wires the test into the regression-e2e CI with selector gating and failure-only log upload.

Changes

Strict Tool-Call Probe E2E Coverage

Layer / File(s) Summary
E2E test harness and mock server
test/e2e/test-strict-tool-call-probe.sh, src/lib/inference/onboard-probes.ts
Bash harness builds the CLI and runs an embedded Node mock server in three modes (success, transient-502, plain-text). Records requests to JSONL and asserts outbound payload shape (sessions_send presence, tool_choice, max_tokens, stream, temperature), validates retry counts and recovery hook invocation, and updates the onboard-probes comment about Local Ollama max_tokens.
Regression E2E workflow integration
.github/workflows/regression-e2e.yaml
Adds strict-tool-call-probe-e2e to valid dispatch names, exposes strict_tool_call_probe from the selector, sets that output true/false based on requested jobs, gates the new job on that output, runs the test script with NEMOCLAW_TEST_NO_SLEEP=1, and configures failure-only artifact upload for logs.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#4532: Bounds the strict chat tool-call probe with max_tokens: 256 and stream: false, which this PR validates end-to-end.
  • NVIDIA/NemoClaw#4250: Related E2E changes around the Local Ollama/no-tools behavior that affect strict probe execution.

Suggested labels

CI/CD, E2E, NemoClaw CLI

Suggested reviewers

  • ericksoa

Poem

🐰 A tiny test hops out to see,
A mock, some retries, and payloads agree.
Logs tucked neatly, failures shown clear,
The probe now checks what we hold dear.
Hooray — the rabbit gives a joyful cheer!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'test(onboard): add strict tool-call probe e2e' directly and concisely describes the main change: adding an end-to-end test for the strict tool-call probe in onboarding.
Linked Issues check ✅ Passed The PR comprehensively addresses all coding requirements from issue #4537: hermetic E2E test added to test/e2e/, integrated into regression-e2e.yaml workflow, validates payload shape (tool_choice/max_tokens/stream), verifies success/retry behavior, and documents thinking-model assumption in onboard-probes.ts.
Out of Scope Changes check ✅ Passed All changes are directly scoped to issue #4537 requirements: new E2E test script, workflow integration, and documentation comment. No extraneous modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch test/strict-tool-call-probe-e2e

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 29, 2026

E2E Advisor Recommendation

Required E2E: strict-tool-call-probe-e2e
Optional E2E: onboard-inference-smoke-e2e

Dispatch hint: strict-tool-call-probe-e2e

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • strict-tool-call-probe-e2e (low; hermetic mock endpoint, no GPU/Ollama infrastructure or external inference secret required): This PR adds the regression job and its test script for strict Chat Completions tool-call probing. Run it to verify the new workflow selection, CLI build, mock endpoint harness, bounded payload assertions, transient retry, and fail-closed behavior all work in CI.

Optional E2E

  • onboard-inference-smoke-e2e (low to medium; hermetic/CI local mock-style smoke without external model-router secret): Adjacent confidence check for onboard inference validation behavior. The runtime source change is comment-only, so this is not merge-blocking, but it covers the broader onboarding path that reports success only after a real chat completion.

New E2E recommendations

  • None.

Dispatch hint

  • Workflow: regression-e2e.yaml
  • jobs input: strict-tool-call-probe-e2e

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 29, 2026

E2E Scenario Advisor Recommendation

Required scenario E2E: None
Optional scenario E2E: None

Workflow run

Full scenario advisor summary

E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required scenario E2E

  • None. No scenario E2E is recommended. The scenario workflows/catalog/runtime under test/e2e-scenario/ were not changed; the regression workflow and test/e2e script are non-scenario legacy E2E surfaces, and the src/lib/inference/onboard-probes.ts diff is comment-only with no behavioral effect on scenario E2E.

Optional scenario E2E

  • None.

Relevant changed files

  • None.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 29, 2026

PR Review Advisor

Findings: 1 needs attention, 2 worth checking, 0 nice ideas
Since last review: 0 prior items resolved, 2 still apply, 0 new items found

Review findings

🛠️ Needs attention

  • E2E still bypasses the Local Ollama onboarding caller (test/e2e/test-strict-tool-call-probe.sh:93): Issue Add hermetic E2E coverage for strict chat-completions tool-call validation #4537 asks for PR-safe hermetic E2E coverage that validates strict Chat Completions tool-call behavior through the onboarding flow and says the test should "onboards against an OpenAI-compatible mock endpoint". This harness verifies the validation helper and strict payload shape, but it constructs validation helpers and calls `validateOpenAiLikeSelection` directly. It would still pass if the production Local Ollama branch in `src/lib/onboard.ts` stopped invoking validation, passed the wrong selected model or validation base URL, or threaded the wrong `allowToolsIncompatible` value into `buildOllamaProbeOptions`.
    • Recommendation: Exercise the production Local Ollama onboarding selection/caller path against the mock endpoint, or add a focused integration/contract test at the real caller boundary in `src/lib/onboard.ts` that proves the selected Local Ollama model and validation URL are passed through `localInference.buildOllamaProbeOptions(false)` into strict Chat Completions validation. Keep the existing mock assertions for payload shape, retry, and fail-closed behavior.
    • Evidence: `test/e2e/test-strict-tool-call-probe.sh` defines `validate(endpoint, recoveryCalls = [])`, creates `createInferenceSelectionValidationHelpers(...)`, and calls `helpers.validateOpenAiLikeSelection("Local Ollama", endpoint, "mock-tool-model", ..., strictOllamaProbeOptions())` directly. The production Local Ollama path remains in `src/lib/onboard.ts`, where validation is called after `prepareOllamaModel(...)`, `getLocalProviderValidationBaseUrl(provider)`, and `localInference.buildOllamaProbeOptions(allowToolsIncompatible)` are combined.

🔎 Worth checking

  • Source-of-truth review needed: test/e2e/test-strict-tool-call-probe.sh strict Local Ollama validation coverage: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: `validate(endpoint, recoveryCalls = [])` calls `helpers.validateOpenAiLikeSelection(...)` directly with `strictOllamaProbeOptions()`, while the real Local Ollama caller remains in `src/lib/onboard.ts`.
  • Security regression guard stops at the helper boundary (test/e2e/test-strict-tool-call-probe.sh:93): The strict tool-call gate is a fail-closed validation path for Local Ollama onboarding, but the new E2E only tests the lower validation helper. That weakens the security regression guard and can create a false sense of coverage for the production onboarding path.
    • Recommendation: Add caller-boundary coverage for the real Local Ollama onboarding path so the security-relevant strict validation requirement cannot be bypassed by future caller changes.
    • Evidence: The test validates success, transient retry, and plain-text fail-closed behavior through `validateOpenAiLikeSelection`, while the real onboarding caller that decides whether strict validation is required remains outside the exercised path.

🌱 Nice ideas

  • None.
Since last review details

Current findings:

  • Source-of-truth review needed: test/e2e/test-strict-tool-call-probe.sh strict Local Ollama validation coverage: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: `validate(endpoint, recoveryCalls = [])` calls `helpers.validateOpenAiLikeSelection(...)` directly with `strictOllamaProbeOptions()`, while the real Local Ollama caller remains in `src/lib/onboard.ts`.
  • E2E still bypasses the Local Ollama onboarding caller (test/e2e/test-strict-tool-call-probe.sh:93): Issue Add hermetic E2E coverage for strict chat-completions tool-call validation #4537 asks for PR-safe hermetic E2E coverage that validates strict Chat Completions tool-call behavior through the onboarding flow and says the test should "onboards against an OpenAI-compatible mock endpoint". This harness verifies the validation helper and strict payload shape, but it constructs validation helpers and calls `validateOpenAiLikeSelection` directly. It would still pass if the production Local Ollama branch in `src/lib/onboard.ts` stopped invoking validation, passed the wrong selected model or validation base URL, or threaded the wrong `allowToolsIncompatible` value into `buildOllamaProbeOptions`.
    • Recommendation: Exercise the production Local Ollama onboarding selection/caller path against the mock endpoint, or add a focused integration/contract test at the real caller boundary in `src/lib/onboard.ts` that proves the selected Local Ollama model and validation URL are passed through `localInference.buildOllamaProbeOptions(false)` into strict Chat Completions validation. Keep the existing mock assertions for payload shape, retry, and fail-closed behavior.
    • Evidence: `test/e2e/test-strict-tool-call-probe.sh` defines `validate(endpoint, recoveryCalls = [])`, creates `createInferenceSelectionValidationHelpers(...)`, and calls `helpers.validateOpenAiLikeSelection("Local Ollama", endpoint, "mock-tool-model", ..., strictOllamaProbeOptions())` directly. The production Local Ollama path remains in `src/lib/onboard.ts`, where validation is called after `prepareOllamaModel(...)`, `getLocalProviderValidationBaseUrl(provider)`, and `localInference.buildOllamaProbeOptions(allowToolsIncompatible)` are combined.
  • Security regression guard stops at the helper boundary (test/e2e/test-strict-tool-call-probe.sh:93): The strict tool-call gate is a fail-closed validation path for Local Ollama onboarding, but the new E2E only tests the lower validation helper. That weakens the security regression guard and can create a false sense of coverage for the production onboarding path.
    • Recommendation: Add caller-boundary coverage for the real Local Ollama onboarding path so the security-relevant strict validation requirement cannot be bypassed by future caller changes.
    • Evidence: The test validates success, transient retry, and plain-text fail-closed behavior through `validateOpenAiLikeSelection`, while the real onboarding caller that decides whether strict validation is required remains outside the exercised path.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/regression-e2e.yaml:
- Around line 270-276: Update the GitHub Actions E2E job to harden the Checkout
and Node setup steps: in the checkout step (the actions/checkout usage) add
persist-credentials: false to prevent credential persistence, and replace the
floating actions/setup-node@v6 reference with a pinned full commit SHA for
setup-node to avoid unexpected changes; locate the lines using the
actions/checkout and actions/setup-node symbols and apply those two changes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3a6bca40-67ad-44ed-a275-361bc2ad75e4

📥 Commits

Reviewing files that changed from the base of the PR and between 6fb96bd and 6fd193f.

📒 Files selected for processing (3)
  • .github/workflows/regression-e2e.yaml
  • src/lib/inference/onboard-probes.ts
  • test/e2e/test-strict-tool-call-probe.sh

Comment thread .github/workflows/regression-e2e.yaml
@cv cv added the v0.0.55 Release target label May 29, 2026
@jyaunches jyaunches added R3 v0.0.56 Release target and removed v0.0.55 Release target R3 labels May 29, 2026
@wscurran wscurran added the E2E End-to-end testing — Brev infrastructure, test cases, nightly failures, and coverage gaps label May 29, 2026
@wscurran
Copy link
Copy Markdown
Contributor

@wscurran wscurran added the fix label May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

E2E End-to-end testing — Brev infrastructure, test cases, nightly failures, and coverage gaps fix v0.0.56 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add hermetic E2E coverage for strict chat-completions tool-call validation

3 participants