fix(eval): use function calling for tool/input mocking so non-OpenAI models work (AE-1646) by Chibionos · Pull Request #1692 · UiPath/uipath-python

Chibionos · 2026-05-29T07:20:57Z

Summary

Tool simulation and input generation in Studio Debug and Evaluation Set runs failed with AGENT_RUNTIME.UNEXPECTED_ERROR for non-OpenAI models (Anthropic Claude via Bedrock, Gemini), but worked with OpenAI/GPT.

Both eval mockers requested structured output via OpenAI-only response_format json_schema and parsed response.choices[0].message.content. On the normalized LLM Gateway, response_format structured output is only honored for OpenAI models; for Claude the content comes back empty/None, so json.loads(None) raised → wrapped as UiPathMockResponseGenerationError → AGENT_RUNTIME.UNEXPECTED_ERROR.

Fixes AE-1646 (customer: Sarasota Memorial Health Care System).

Root cause / regression

Regression from #1555, which started routing the agent's model into simulations. Before that, simulation always used a fixed OpenAI model (gpt_4_1_mini), so non-OpenAI providers were never exercised on this path — which is why Claude "worked before."

Fix

Switch both mockers to provider-agnostic function calling, mirroring llm_as_judge_evaluator (whose docstring already states function calling is the way to get structured output across OpenAI/Claude/Gemini):

Build a forced tool that wraps the output/input schema under a response property, force it via tool_choice=required, and read tool_calls[0].arguments["response"] (already a parsed dict).
Hoist nested $defs to the tool-parameters root so $refs from nested Pydantic models still resolve once the schema is wrapped.
The normalized gateway's chat_completions now accepts raw-dict tools (pass-through) so arbitrary nested schemas survive — the ToolDefinition converter only emits flat properties.

New shared helper eval/mocks/_structured_output.py keeps both mockers DRY.

Tests

test_llm_mockable_structured_output_via_tool_call — parametrized over gpt-4.1-mini, anthropic.claude-sonnet-4-5, gemini-2.5-pro; reproduces AE-1646 (content=None + tool_calls) and asserts the new contract.
test_build_response_tool_hoists_defs_to_root + helper error-branch unit tests.
test_raw_dict_tool_passthrough_mocked (platform) — asserts a nested array schema is forwarded byte-for-byte.
Existing mocker/input/span tests updated to the function-calling contract (behavior assertions preserved).
Full tests/cli/eval suite + platform mocked LLM tests green; ruff + mypy clean.

Note for reviewers

The OpenAI path also moves to function-calling (no longer response_format), matching the judge. Worth a live check that a Claude/Bedrock agent simulating a nested-model tool output round-trips through the gateway, since nested $defs in tool parameters has no prior precedent in this repo (the judge only used flat schemas).

🤖 Generated with Claude Code

…models work Tool simulation and input generation in Studio Debug and Evaluation Set runs failed with AGENT_RUNTIME.UNEXPECTED_ERROR for non-OpenAI models (Anthropic Claude via Bedrock, Gemini). The mockers requested structured output via OpenAI-only `response_format` json_schema and parsed `choices[0].message.content`; for Claude that content is empty/None, so `json.loads(...)` raised. Switch both mockers to provider-agnostic function calling (mirrors llm_as_judge_evaluator): build a forced tool that wraps the output/input schema under a `response` property, force it via tool_choice, and read `tool_calls[0].arguments["response"]` (already a parsed dict). Hoist nested `$defs` to the tool-parameters root so `$ref`s from nested Pydantic models still resolve. The normalized LLM gateway now accepts raw-dict tools so arbitrary nested schemas survive (the ToolDefinition converter only emits flat properties). Regression introduced by #1555, which started routing the agent's model into simulations; before that, simulation always used a fixed OpenAI model, so non-OpenAI providers were never exercised on this path. Fixes AE-1646. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The mocker fix in uipath depends on the dict-tool passthrough in uipath-platform, so uipath's lower-bound pin is raised to 0.1.60. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ation The normalized gateway accepts $ref/$defs in response_format but not inside a tool's parameters. Tool outputs typed as nested Pydantic models/enums (e.g. calculator's get_random_operator -> Wrapper[Operator]) produced a tool schema with $ref/$defs that the gateway rejected, so simulation failed. Inline the definitions into a self-contained schema (cyclic refs keep their $defs). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-29T07:53:00Z

🚨 Heads up: `uipath-langchain` cross-tests are FAILING 🚨

Your changes may break the uipath-langchain-python integration.

⚠️ These checks are NOT enforced by branch protection rules. Please review the failures before merging.

🔍 Inspect the failed run →

All-tool-calling regressed OpenAI tool simulation (calculator-evals 'Test Random Addition Using LLM' became flaky: gpt_4_1_mini returned wrong/empty values for a nested-enum output schema via function calling, where response_format was reliable). Make structured-output generation adaptive: prefer response_format (honored reliably by OpenAI, native $defs support) and fall back to a forced tool call only when content comes back empty (the non-OpenAI failure mode, e.g. Claude/Bedrock). Shared in generate_structured_output(), used by both mockers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

sonarqubecloud · 2026-05-29T08:22:27Z

Quality Gate passed

Issues
2 New issues
0 Accepted issues

Measures
0 Security Hotspots
100.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Chibionos · 2026-05-29T08:28:48Z

Re: the uipath-langchain cross-test heads-up above — that was from an earlier commit. After the adaptive fix (prefer response_format, fall back to function calling only when content is empty), the cross-tests pass on the latest commit: langchain-cross / {alpha,cloud,staging} and test-uipath-langchain (3.11/3.12/3.13 × ubuntu/windows) are all green. The heads-up comment is stale and can be disregarded.

Full suite is green: all eval test-cases (calculator-evals, simulation-testcase, tools-evals, etc.) pass across alpha/cloud/staging; lint, SonarCloud, build all pass.

mjnovice · 2026-06-03T01:04:06Z

+logger = logging.getLogger(__name__)
+
+
+def _inline_defs(


Can we have separate classes for how we are doing Anthropic, OpenAI, Gemini etc. ?

mjnovice

Minor comment about making the generate_structured_output more modular.

Chibionos and others added 2 commits May 28, 2026 23:57

chore: bump uipath to 2.10.74 and uipath-platform to 0.1.60

ab33199

The mocker fix in uipath depends on the dict-tool passthrough in uipath-platform, so uipath's lower-bound pin is raised to 0.1.60. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Chibionos mentioned this pull request May 29, 2026

fix(eval): use function calling for tool/input mocking so non-OpenAI models work (AE-1646) #1691

Closed

github-actions Bot added test:uipath-langchain Triggers tests in the uipath-langchain-python repository test:uipath-integrations labels May 29, 2026

Chibionos force-pushed the fix/ae-1646-mocker-non-openai-models branch from 5e4bdb2 to ae78cbe Compare May 29, 2026 08:06

test(eval): add explicit type params to _FakeLLM for mypy

b4954be

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mjnovice reviewed Jun 3, 2026

View reviewed changes

mjnovice approved these changes Jun 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(eval): use function calling for tool/input mocking so non-OpenAI models work (AE-1646)#1692

fix(eval): use function calling for tool/input mocking so non-OpenAI models work (AE-1646)#1692
Chibionos wants to merge 5 commits into
mainfrom
fix/ae-1646-mocker-non-openai-models

Chibionos commented May 29, 2026

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

sonarqubecloud Bot commented May 29, 2026

Uh oh!

Chibionos commented May 29, 2026

Uh oh!

mjnovice Jun 3, 2026

Uh oh!

mjnovice left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		logger = logging.getLogger(__name__)


		def _inline_defs(

Conversation

Chibionos commented May 29, 2026

Summary

Root cause / regression

Fix

Tests

Note for reviewers

Uh oh!

github-actions Bot commented May 29, 2026

🚨 Heads up: uipath-langchain cross-tests are FAILING 🚨

Uh oh!

sonarqubecloud Bot commented May 29, 2026

Quality Gate passed

Uh oh!

Chibionos commented May 29, 2026

Uh oh!

mjnovice Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

mjnovice left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

🚨 Heads up: `uipath-langchain` cross-tests are FAILING 🚨