fix(executor): add exponential backoff retry for transient LLM provid… by Rabba-Meghana · Pull Request #3773 · simstudioai/sim

Rabba-Meghana · 2026-03-25T22:48:52Z

Problem

When LLM providers (OpenAI, Anthropic, Gemini) return a 429 rate limit or 503 service unavailable, the workflow execution crashes immediately with no retry attempt. This silently breaks production agent workflows for users.

Solution

Added withRetry() utility with exponential backoff and jitter
Wrapped handler.execute() in block-executor so all agent blocks automatically retry on transient provider errors
Respects Retry-After headers from OpenAI and Anthropic
Never retries user errors (400, 401, 404) — only transient failures
Full test coverage: success path, 429 retry, maxRetries exceeded, non-retryable errors, Retry-After parsing

Type of Change

Bug fix

Testing

retry.test.ts covers 7 scenarios: immediate success, 429 retry, maxRetries exhausted, non-retryable 400/401, network error, Retry-After header
Run with: bun test apps/sim/executor/utils/retry.test.ts

Checklist

Code follows project style guidelines
Self-reviewed my changes
Tests added/updated and passing
No new warnings introduced
I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

…er errors - Add withRetry() utility with exponential backoff and jitter - Wrap handler.execute() in block-executor so all agent blocks retry on transient provider errors - Respects Retry-After headers from OpenAI and Anthropic - Never retries user errors (400, 401, 404) only transient failures - Full test coverage: success path, 429 retry, maxRetries exceeded, non-retryable errors, Retry-After parsing

vercel · 2026-03-25T22:48:58Z

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment

Project	Deployment	Actions	Updated (UTC)
docs	Skipped		Mar 25, 2026 11:20pm

cursor · 2026-03-25T22:48:59Z

PR Summary

Medium Risk
Adds automatic retries around LLM-related block execution, which can change workflow timing/cost and may amplify side effects if provider calls aren’t fully idempotent.

Overview
Prevents workflow runs from failing immediately on transient LLM provider errors by adding a shared withRetry utility (exponential backoff + jitter, optional Retry-After support, retries only for 429/503/529 and specific network errors).

BlockExecutor now wraps execution of LLM-like blocks (AGENT, EVALUATOR, ROUTER, ROUTER_V2) with withRetry, leaving other blocks unchanged, and introduces focused unit tests covering success, retry/limit behavior, non-retryable 4xx, network errors, and Retry-After handling.

^{Written by Cursor Bugbot for commit d6e45a2. This will update automatically on new commits. Configure here.}

Rabba-Meghana · 2026-03-25T22:51:13Z

Retries are intentionally scoped to transient provider/network errors only (429, 503, 529). These status codes indicate the request never reached or was not processed by the provider, making retries safe. Non-idempotent side effects from tool calls within agent blocks are not affected, those execute after the provider call succeeds. Happy to add an idempotent: false opt-out flag to RetryOptions if the team prefers explicit control per block type.

greptile-apps · 2026-03-25T22:51:14Z

Greptile Summary

This PR adds exponential backoff retry logic (withRetry) to the block executor to handle transient LLM provider errors (429, 503, 529), preventing workflow crashes on rate-limit or service-unavailable responses. The utility is well-structured with jitter, Retry-After header support, and good test coverage.

Two meaningful behavioral issues need attention before merge:

Retry scope is too broad: withRetry is applied at the BlockExecutor level and wraps every block type — not just LLM/AI blocks. HTTP request blocks, webhook blocks, database blocks, and any other block that performs a non-idempotent external operation will now also be silently retried up to 3 times on a 429 or 503. This can produce duplicate side effects (duplicate API calls, emails, database writes) that are hard to diagnose.
All statusless errors are retried: Errors with no .status/.statusCode property are treated as retryable (status === null → isRetryable = true). This includes ordinary JavaScript runtime errors and validation exceptions, which will now incur up to 30 seconds of unnecessary delay before surfacing to the user.

Confidence Score: 3/5

Not safe to merge as-is — retry scope covering all block types can cause duplicate side effects on non-idempotent operations.
Two P1 issues both relate to over-broad retry behavior: (1) the retry wraps every block type at the executor level instead of being scoped to LLM blocks, risking duplicate side effects; (2) statusless errors (including JS runtime errors) are unconditionally retried, masking bugs and adding silent latency. Both issues are concrete behavioral problems that can surface in production workflows. The implementation approach is otherwise sound and the test coverage is solid.
apps/sim/executor/execution/block-executor.ts (retry scope), apps/sim/executor/utils/retry.ts (statusless error handling)

Important Files Changed

Filename	Overview
apps/sim/executor/utils/retry.ts	New utility implementing exponential backoff retry with jitter and Retry-After header support; two issues: all statusless errors are retried (including non-transient JS errors), and the Retry-After header is not respected for 503 responses.
apps/sim/executor/execution/block-executor.ts	Wraps all block handler executions with withRetry — applies retry to every block type (HTTP, database, webhook, email, etc.), not just LLM blocks, risking duplicate side effects on non-idempotent operations.
apps/sim/executor/utils/retry.test.ts	Good test coverage for the happy path, 429 retry, maxRetries exhaustion, non-retryable 400/401, network error, and Retry-After; missing afterEach to restore real timers.

Sequence Diagram

sequenceDiagram
    participant BE as BlockExecutor
    participant WR as withRetry
    participant H as BlockHandler
    participant LLM as LLM Provider

    BE->>WR: withRetry(() => handler.execute(...))
    loop Up to maxRetries (3)
        WR->>H: handler.execute(ctx, block, inputs)
        H->>LLM: API Request
        alt Success
            LLM-->>H: 200 OK
            H-->>WR: output
            WR-->>BE: output
        else 429 / 503 / 529 (transient)
            LLM-->>H: Error (status 429/503/529)
            H-->>WR: throw error
            WR->>WR: Check Retry-After header (429 only)
            WR->>WR: calculateBackoffDelay(attempt, initialDelayMs, maxDelayMs)
            WR->>WR: await setTimeout(delayMs)
        else 400 / 401 / 404 (non-retryable)
            LLM-->>H: Error (status 4xx)
            H-->>WR: throw error
            WR-->>BE: rethrow immediately
        else No status (network / JS error)
            H-->>WR: throw error (no status)
            WR->>WR: treated as retryable (isRetryable = true)
            WR->>WR: await setTimeout(delayMs)
        end
    end
    WR-->>BE: throw lastError (maxRetries exceeded)

_{Reviews (1): Last reviewed commit: "fix(executor): add exponential backoff r..." | Re-trigger Greptile}

greptile-apps · 2026-03-25T22:51:17Z

apps/sim/executor/execution/block-executor.ts

+      const output = await withRetry(
+        () => handler.executeWithNode
+          ? handler.executeWithNode(ctx, block, resolvedInputs, nodeMetadata)
+          : handler.execute(ctx, block, resolvedInputs)
+      )


Retry applied to all block types, not just LLM blocks

The withRetry wrapper is placed around every block's handler execution — not just AI/LLM blocks. This means any block that calls an external service (HTTP request blocks, webhook blocks, database blocks, email blocks, etc.) will also be retried up to 3 times on a 429 or 503 response.

For non-idempotent operations — e.g. sending an email, creating a database row, triggering a webhook — this can silently create 2–4 duplicate operations per execution. The PR description says "all agent blocks automatically retry," but the implementation actually retries all block types.

Consider either:

Only retrying within the LLM/AI provider layer (not at the block executor level), or

Checking the block type before applying retry logic:

const isLLMBlock = /* check block.metadata?.id against known LLM block types */ const output = isLLMBlock ? await withRetry(() => handler.executeWithNode ? ... : ...) : await (handler.executeWithNode ? ... : ...)

Fixed in 70ce97f — addressed all review feedback: scoped retries to LLM blocks only, added isNetworkError() for statusless errors, added Retry-After support for 503, and added afterEach vi.useRealTimers() to tests.

greptile-apps · 2026-03-25T22:51:18Z

apps/sim/executor/utils/retry.ts

+      const isRetryable =
+        status === null ||
+        retryableStatusCodes.includes(status)


All statusless errors trigger retries, including non-transient JS errors

When status === null (no .status or .statusCode on the thrown error), isRetryable is true. This catches genuine network errors — but it also retries any JavaScript runtime error, validation error, or unexpected exception that doesn't carry a status code.

For example, a TypeError, RangeError, or any internal error thrown by a block handler will be silently retried 3 times with up to 30 seconds of total delay before finally failing. This masks bugs and significantly degrades the user-facing error latency for logic errors.

Consider reversing the default: treat unknown errors as non-retryable unless the error is explicitly recognized as transient (e.g., it is a known network error type, or it matches a recognizable pattern):

const isRetryable = status !== null ? retryableStatusCodes.includes(status) : isNetworkError(error) // helper that checks error.code, error.name, etc.

Fixed in 70ce97f — addressed all review feedback: scoped retries to LLM blocks only, added isNetworkError() for statusless errors, added Retry-After support for 503, and added afterEach vi.useRealTimers() to tests.

greptile-apps · 2026-03-25T22:51:19Z

apps/sim/executor/utils/retry.ts

+      if (responseHeaders && status === 429) {
+        const retryAfterMs = parseRetryAfterHeader(responseHeaders)
+        delayMs = retryAfterMs ?? calculateBackoffDelay(attempt, initialDelayMs, maxDelayMs)
+      } else {
+        delayMs = calculateBackoffDelay(attempt, initialDelayMs, maxDelayMs)
+      }


Retry-After header not respected for 503 responses

The Retry-After header is parsed and used only when status === 429. However, RFC 7231 §7.1.3 specifies that servers sending 503 Service Unavailable may also include a Retry-After header to signal when the service will recover. Both OpenAI and Anthropic can send 503s with Retry-After.

// suggestion if (responseHeaders && (status === 429 || status === 503)) { const retryAfterMs = parseRetryAfterHeader(responseHeaders) delayMs = retryAfterMs ?? calculateBackoffDelay(attempt, initialDelayMs, maxDelayMs) } else { delayMs = calculateBackoffDelay(attempt, initialDelayMs, maxDelayMs) }

Fixed in 70ce97f — addressed all review feedback: scoped retries to LLM blocks only, added isNetworkError() for statusless errors, added Retry-After support for 503, and added afterEach vi.useRealTimers() to tests.

greptile-apps · 2026-03-25T22:51:20Z

apps/sim/executor/utils/retry.test.ts

+    vi.useFakeTimers()
+  })
+
+  it('returns result immediately on success', async () => {


Missing afterEach to restore real timers

vi.useFakeTimers() is called in beforeEach but there is no corresponding vi.useRealTimers() in an afterEach. If Vitest runs these tests in the same worker as other test files that depend on real timers (e.g. setTimeout, Date.now()), the leaked fake timer state can cause subtle test failures elsewhere.

afterEach(() => { vi.useRealTimers() })

Fixed in 70ce97f — addressed all review feedback: scoped retries to LLM blocks only, added isNetworkError() for statusless errors, added Retry-After support for 503, and added afterEach vi.useRealTimers() to tests.

- Scope retries to LLM blocks only (agent, evaluator, router) to avoid duplicate side effects on non-idempotent blocks like HTTP, email, webhook - Treat statusless errors as non-retryable by default — only retry known network errors (ECONNRESET, ETIMEDOUT, etc.) not JS runtime errors - Respect Retry-After header for 503 responses in addition to 429 (RFC 7231) - Add afterEach vi.useRealTimers() to prevent timer leak between tests

Rabba-Meghana · 2026-03-25T22:58:57Z

Fixed in 70ce97f — addressed all review feedback: scoped retries to LLM blocks only, added isNetworkError() for statusless errors, added Retry-After support for 503, and added afterEach vi.useRealTimers() to tests.

apps/sim/executor/utils/retry.test.ts

apps/sim/executor/utils/retry.ts

… overruns

apps/sim/executor/utils/retry.ts

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

cursor · 2026-03-25T23:25:04Z

apps/sim/executor/utils/retry.ts

+  const capped = Math.min(maxDelayMs, exponential)
+  const jitter = Math.random() * capped * 0.2
+  return Math.floor(Math.min(capped + jitter, maxDelayMs))
+}


Jitter eliminated when backoff delay reaches max cap

Low Severity

In calculateBackoffDelay, capped is already clamped to maxDelayMs, but then jitter is added to capped and the result is clamped again by Math.min(..., maxDelayMs). When exponential >= maxDelayMs, the jitter is always clamped back to zero, producing a deterministic maxDelayMs with no randomization. This defeats the purpose of jitter (preventing thundering herd) once the exponential component reaches the cap. The jitter needs to be subtracted from (or applied within) capped rather than added on top of it.

apps/sim/executor/execution/block-executor.ts

vercel bot temporarily deployed to Preview March 25, 2026 22:48 Inactive

greptile-apps bot reviewed Mar 25, 2026

View reviewed changes

vercel bot temporarily deployed to Preview March 25, 2026 22:54 Inactive

cursor bot reviewed Mar 25, 2026

View reviewed changes

apps/sim/executor/utils/retry.test.ts Show resolved Hide resolved

apps/sim/executor/utils/retry.ts Show resolved Hide resolved

fix(executor): extend Retry-After to 529 and fix network error test

0b7a84f

vercel bot temporarily deployed to Preview March 25, 2026 23:04 Inactive

cursor bot reviewed Mar 25, 2026

View reviewed changes

apps/sim/executor/utils/retry.ts Outdated Show resolved Hide resolved

fix(executor): cap Retry-After delay by maxDelayMs to prevent timeout…

5c8c135

… overruns

vercel bot temporarily deployed to Preview March 25, 2026 23:14 Inactive

cursor bot reviewed Mar 25, 2026

View reviewed changes

apps/sim/executor/utils/retry.ts Show resolved Hide resolved

fix(executor): clamp jitter within maxDelayMs to honour delay cap

d6e45a2

vercel bot temporarily deployed to Preview March 25, 2026 23:20 Inactive

cursor bot reviewed Mar 25, 2026

View reviewed changes

Conversation

Rabba-Meghana commented Mar 25, 2026

Problem

Solution

Type of Change

Testing

Checklist

Uh oh!

vercel bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

Rabba-Meghana commented Mar 25, 2026

Uh oh!

greptile-apps bot commented Mar 25, 2026

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Rabba-Meghana Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Rabba-Meghana Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Rabba-Meghana Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Rabba-Meghana Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Rabba-Meghana commented Mar 25, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Mar 25, 2026

Choose a reason for hiding this comment

Jitter eliminated when backoff delay reaches max cap

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel bot commented Mar 25, 2026 •

edited

Loading

cursor bot commented Mar 25, 2026 •

edited

Loading