fix(executor): add exponential backoff retry for transient LLM provid…#3773
fix(executor): add exponential backoff retry for transient LLM provid…#3773Rabba-Meghana wants to merge 5 commits intosimstudioai:mainfrom
Conversation
…er errors - Add withRetry() utility with exponential backoff and jitter - Wrap handler.execute() in block-executor so all agent blocks retry on transient provider errors - Respects Retry-After headers from OpenAI and Anthropic - Never retries user errors (400, 401, 404) only transient failures - Full test coverage: success path, 429 retry, maxRetries exceeded, non-retryable errors, Retry-After parsing
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
PR SummaryMedium Risk Overview
Written by Cursor Bugbot for commit d6e45a2. This will update automatically on new commits. Configure here. |
|
Retries are intentionally scoped to transient provider/network errors only (429, 503, 529). These status codes indicate the request never reached or was not processed by the provider, making retries safe. Non-idempotent side effects from tool calls within agent blocks are not affected, those execute after the provider call succeeds. Happy to add an idempotent: false opt-out flag to RetryOptions if the team prefers explicit control per block type. |
Greptile SummaryThis PR adds exponential backoff retry logic ( Two meaningful behavioral issues need attention before merge:
Confidence Score: 3/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant BE as BlockExecutor
participant WR as withRetry
participant H as BlockHandler
participant LLM as LLM Provider
BE->>WR: withRetry(() => handler.execute(...))
loop Up to maxRetries (3)
WR->>H: handler.execute(ctx, block, inputs)
H->>LLM: API Request
alt Success
LLM-->>H: 200 OK
H-->>WR: output
WR-->>BE: output
else 429 / 503 / 529 (transient)
LLM-->>H: Error (status 429/503/529)
H-->>WR: throw error
WR->>WR: Check Retry-After header (429 only)
WR->>WR: calculateBackoffDelay(attempt, initialDelayMs, maxDelayMs)
WR->>WR: await setTimeout(delayMs)
else 400 / 401 / 404 (non-retryable)
LLM-->>H: Error (status 4xx)
H-->>WR: throw error
WR-->>BE: rethrow immediately
else No status (network / JS error)
H-->>WR: throw error (no status)
WR->>WR: treated as retryable (isRetryable = true)
WR->>WR: await setTimeout(delayMs)
end
end
WR-->>BE: throw lastError (maxRetries exceeded)
Reviews (1): Last reviewed commit: "fix(executor): add exponential backoff r..." | Re-trigger Greptile |
| const output = await withRetry( | ||
| () => handler.executeWithNode | ||
| ? handler.executeWithNode(ctx, block, resolvedInputs, nodeMetadata) | ||
| : handler.execute(ctx, block, resolvedInputs) | ||
| ) |
There was a problem hiding this comment.
Retry applied to all block types, not just LLM blocks
The withRetry wrapper is placed around every block's handler execution — not just AI/LLM blocks. This means any block that calls an external service (HTTP request blocks, webhook blocks, database blocks, email blocks, etc.) will also be retried up to 3 times on a 429 or 503 response.
For non-idempotent operations — e.g. sending an email, creating a database row, triggering a webhook — this can silently create 2–4 duplicate operations per execution. The PR description says "all agent blocks automatically retry," but the implementation actually retries all block types.
Consider either:
- Only retrying within the LLM/AI provider layer (not at the block executor level), or
- Checking the block type before applying retry logic:
const isLLMBlock = /* check block.metadata?.id against known LLM block types */
const output = isLLMBlock
? await withRetry(() => handler.executeWithNode ? ... : ...)
: await (handler.executeWithNode ? ... : ...)There was a problem hiding this comment.
Fixed in 70ce97f — addressed all review feedback: scoped retries to LLM blocks only, added isNetworkError() for statusless errors, added Retry-After support for 503, and added afterEach vi.useRealTimers() to tests.
apps/sim/executor/utils/retry.ts
Outdated
| const isRetryable = | ||
| status === null || | ||
| retryableStatusCodes.includes(status) |
There was a problem hiding this comment.
All statusless errors trigger retries, including non-transient JS errors
When status === null (no .status or .statusCode on the thrown error), isRetryable is true. This catches genuine network errors — but it also retries any JavaScript runtime error, validation error, or unexpected exception that doesn't carry a status code.
For example, a TypeError, RangeError, or any internal error thrown by a block handler will be silently retried 3 times with up to 30 seconds of total delay before finally failing. This masks bugs and significantly degrades the user-facing error latency for logic errors.
Consider reversing the default: treat unknown errors as non-retryable unless the error is explicitly recognized as transient (e.g., it is a known network error type, or it matches a recognizable pattern):
const isRetryable =
status !== null
? retryableStatusCodes.includes(status)
: isNetworkError(error) // helper that checks error.code, error.name, etc.There was a problem hiding this comment.
Fixed in 70ce97f — addressed all review feedback: scoped retries to LLM blocks only, added isNetworkError() for statusless errors, added Retry-After support for 503, and added afterEach vi.useRealTimers() to tests.
apps/sim/executor/utils/retry.ts
Outdated
| if (responseHeaders && status === 429) { | ||
| const retryAfterMs = parseRetryAfterHeader(responseHeaders) | ||
| delayMs = retryAfterMs ?? calculateBackoffDelay(attempt, initialDelayMs, maxDelayMs) | ||
| } else { | ||
| delayMs = calculateBackoffDelay(attempt, initialDelayMs, maxDelayMs) | ||
| } |
There was a problem hiding this comment.
Retry-After header not respected for 503 responses
The Retry-After header is parsed and used only when status === 429. However, RFC 7231 §7.1.3 specifies that servers sending 503 Service Unavailable may also include a Retry-After header to signal when the service will recover. Both OpenAI and Anthropic can send 503s with Retry-After.
// suggestion
if (responseHeaders && (status === 429 || status === 503)) {
const retryAfterMs = parseRetryAfterHeader(responseHeaders)
delayMs = retryAfterMs ?? calculateBackoffDelay(attempt, initialDelayMs, maxDelayMs)
} else {
delayMs = calculateBackoffDelay(attempt, initialDelayMs, maxDelayMs)
}There was a problem hiding this comment.
Fixed in 70ce97f — addressed all review feedback: scoped retries to LLM blocks only, added isNetworkError() for statusless errors, added Retry-After support for 503, and added afterEach vi.useRealTimers() to tests.
| vi.useFakeTimers() | ||
| }) | ||
|
|
||
| it('returns result immediately on success', async () => { |
There was a problem hiding this comment.
Missing
afterEach to restore real timers
vi.useFakeTimers() is called in beforeEach but there is no corresponding vi.useRealTimers() in an afterEach. If Vitest runs these tests in the same worker as other test files that depend on real timers (e.g. setTimeout, Date.now()), the leaked fake timer state can cause subtle test failures elsewhere.
afterEach(() => {
vi.useRealTimers()
})There was a problem hiding this comment.
Fixed in 70ce97f — addressed all review feedback: scoped retries to LLM blocks only, added isNetworkError() for statusless errors, added Retry-After support for 503, and added afterEach vi.useRealTimers() to tests.
- Scope retries to LLM blocks only (agent, evaluator, router) to avoid duplicate side effects on non-idempotent blocks like HTTP, email, webhook - Treat statusless errors as non-retryable by default — only retry known network errors (ECONNRESET, ETIMEDOUT, etc.) not JS runtime errors - Respect Retry-After header for 503 responses in addition to 429 (RFC 7231) - Add afterEach vi.useRealTimers() to prevent timer leak between tests
|
Fixed in 70ce97f — addressed all review feedback: scoped retries to LLM blocks only, added isNetworkError() for statusless errors, added Retry-After support for 503, and added afterEach vi.useRealTimers() to tests. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| const capped = Math.min(maxDelayMs, exponential) | ||
| const jitter = Math.random() * capped * 0.2 | ||
| return Math.floor(Math.min(capped + jitter, maxDelayMs)) | ||
| } |
There was a problem hiding this comment.
Jitter eliminated when backoff delay reaches max cap
Low Severity
In calculateBackoffDelay, capped is already clamped to maxDelayMs, but then jitter is added to capped and the result is clamped again by Math.min(..., maxDelayMs). When exponential >= maxDelayMs, the jitter is always clamped back to zero, producing a deterministic maxDelayMs with no randomization. This defeats the purpose of jitter (preventing thundering herd) once the exponential component reaches the cap. The jitter needs to be subtracted from (or applied within) capped rather than added on top of it.


Problem
When LLM providers (OpenAI, Anthropic, Gemini) return a 429 rate limit or 503 service unavailable, the workflow execution crashes immediately with no retry attempt. This silently breaks production agent workflows for users.
Solution
withRetry()utility with exponential backoff and jitterhandler.execute()in block-executor so all agent blocks automatically retry on transient provider errorsRetry-Afterheaders from OpenAI and AnthropicType of Change
Testing
retry.test.tscovers 7 scenarios: immediate success, 429 retry, maxRetries exhausted, non-retryable 400/401, network error, Retry-After headerbun test apps/sim/executor/utils/retry.test.tsChecklist