Skip to content

fix(core): retry TASK_PROCESS_SIGSEGV under the user's retry policy#3552

Open
matt-aitken wants to merge 1 commit into
mainfrom
feature/tri-9234-retry-task_process_sigsegv-errors-respecting-user-retry
Open

fix(core): retry TASK_PROCESS_SIGSEGV under the user's retry policy#3552
matt-aitken wants to merge 1 commit into
mainfrom
feature/tri-9234-retry-task_process_sigsegv-errors-respecting-user-retry

Conversation

@matt-aitken
Copy link
Copy Markdown
Member

Closes TRI-9234

Summary

  • TASK_PROCESS_SIGSEGV was hard-classified as non-retriable in shouldRetryError (packages/core/src/v3/errors.ts), failing the run on the first segfault regardless of the user's retry policy.
  • That assumed SIGSEGV is always a deterministic native crash. For Node tasks that's not reliably true — many production SIGSEGVs are flaky.
  • This flips SIGSEGV into the return true branch of shouldRetryError. The existing shouldLookupRetrySettings + lockedRetryConfig + maxAttempts chain in internal-packages/run-engine/src/engine/retrying.ts then gates the retry — same path SIGTERM and uncaught-exception already use. Tasks without a retry policy still fail fast.

Why retry

Common Node SIGSEGV causes that are non-deterministic across processes:

  • Native addon racessharp, canvas, better-sqlite3, node-rdkafka, bcrypt, etc. libuv thread-pool work stepping on V8 handles. Different heap layout / thread schedule on a fresh process → retry often succeeds.
  • JIT / GC interaction — V8 turbofan deopt or GC during a native callback. Timing-dependent.
  • Near-OOM in native code — when RSS approaches the cgroup limit, native allocations fail and poorly-written addons dereference NULL → SIGSEGV instead of clean OOM-kill.
  • Host / hardware issues — bit flips, kernel quirks. Retry lands on a different host.

The genuinely deterministic case (a bad pointer in user code that always trips the same addon) is real, but it's a subset — and maxAttempts already bounds the damage.

Pre-existing inconsistency this resolves

  • shouldRetryError returned false for TASK_PROCESS_SIGSEGVfail_run.
  • shouldLookupRetrySettings already lists TASK_PROCESS_SIGSEGV as retry-config-aware — but that branch was unreachable because shouldRetryError short-circuited first in retrying.ts:86-90.
  • We already retry TASK_RUN_UNCAUGHT_EXCEPTION (clearly a user-code bug) under the user's retry policy. Refusing to retry SIGSEGV was the odd one out.

shouldLookupRetrySettings reads like the intended behaviour; shouldRetryError looks like a stale gate.

What this doesn't touch

  • OOM-killed (TASK_PROCESS_OOM_KILLED, TASK_PROCESS_MAYBE_OOM_KILLED) — still false here because OOM has its own retry path in retrying.ts that bumps the machine size before reaching shouldRetryError. Tests assert this stays the case.
  • SIGKILL_TIMEOUT — still false. Process didn't respond to SIGTERM, so retrying without diagnosing why is more likely to mask a problem.
  • Routing SIGSEGV through the OOM machine-bump path is a plausible follow-up — would want data on SIGSEGV-near-OOM frequency before shipping.

Test plan

  • pnpm exec vitest run test/errors.test.ts in packages/core — 26/26 pass (4 new)
  • pnpm run build --filter @trigger.dev/core
  • CI green on PR

🤖 Generated with Claude Code

SIGSEGV was hard-classified as non-retriable in shouldRetryError on the
assumption that it's always a deterministic native crash. For Node
tasks that's not reliably true — many production SIGSEGVs are flaky:

- Native addon races (sharp, canvas, better-sqlite3, node-rdkafka,
  bcrypt, etc.) — libuv thread-pool work stepping on V8 handles.
  Different heap layout / thread schedule on a fresh process,
  retry often succeeds.
- JIT / GC interaction — V8 turbofan deopt or GC during a native
  callback. Timing-dependent.
- Near-OOM in native code — when RSS approaches the cgroup limit,
  native allocations fail and poorly-written addons dereference
  NULL → SIGSEGV instead of a clean OOM-kill. A fresh process
  with cleaner memory often succeeds.
- Host / hardware issues — bit flips, kernel quirks. Retry lands
  on a different host.

The codebase was already inconsistent here: shouldLookupRetrySettings
listed SIGSEGV as retry-config-aware, but the shouldRetryError gate
short-circuited fail_run before that branch could be reached. And we
already retry TASK_RUN_UNCAUGHT_EXCEPTION — clearly a user-code bug —
under the user's retry policy, so refusing to retry SIGSEGV was the
odd one out.

Flip TASK_PROCESS_SIGSEGV from the false branch to the true branch in
shouldRetryError. The existing retrying.ts pipeline then gates the
retry on lockedRetryConfig + maxAttempts — same path SIGTERM and
uncaught-exception already use. No new code paths; tasks without a
retry policy still fail fast.

Tests added in packages/core/test/errors.test.ts lock down the new
classification alongside SIGTERM, SIGKILL_TIMEOUT, and the OOM codes
(still non-retriable here because OOM has its own machine-bump retry
path in retrying.ts that runs before shouldRetryError).

Closes TRI-9234.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 11, 2026

🦋 Changeset detected

Latest commit: cdfe334

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 29 packages
Name Type
@trigger.dev/core Patch
@trigger.dev/build Patch
trigger.dev Patch
@trigger.dev/python Patch
@trigger.dev/redis-worker Patch
@trigger.dev/schema-to-json Patch
@trigger.dev/sdk Patch
@internal/cache Patch
@internal/clickhouse Patch
@internal/llm-model-catalog Patch
@internal/redis Patch
@internal/replication Patch
@internal/run-engine Patch
@internal/schedule-engine Patch
@internal/testcontainers Patch
@internal/tracing Patch
@internal/tsql Patch
@internal/zod-worker Patch
d3-chat Patch
references-d3-openai-agents Patch
references-nextjs-realtime Patch
references-realtime-hooks-test Patch
references-realtime-streams Patch
references-telemetry Patch
@internal/sdk-compat-tests Patch
@trigger.dev/react-hooks Patch
@trigger.dev/rsc Patch
@trigger.dev/database Patch
@trigger.dev/otlp-importer Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 11, 2026

Review Change Stack
No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 9401a995-a47a-449d-9563-b21838084501

📥 Commits

Reviewing files that changed from the base of the PR and between 2b84545 and cdfe334.

📒 Files selected for processing (3)
  • .changeset/retry-sigsegv.md
  • packages/core/src/v3/errors.ts
  • packages/core/test/errors.test.ts
📜 Recent review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
  • GitHub Check: units / e2e-webapp / 🧪 E2E Tests: Webapp
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
  • GitHub Check: typecheck / typecheck
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
  • GitHub Check: sdk-compat / Cloudflare Workers
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
  • GitHub Check: sdk-compat / Deno Runtime
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
  • GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
  • GitHub Check: sdk-compat / Node.js 20.20 (ubuntu-latest)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
  • GitHub Check: sdk-compat / Bun Runtime
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
  • GitHub Check: sdk-compat / Node.js 22.12 (ubuntu-latest)
  • GitHub Check: Analyze (javascript-typescript)
🧰 Additional context used
📓 Path-based instructions (13)
**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead

Files:

  • packages/core/test/errors.test.ts
  • packages/core/src/v3/errors.ts
{packages/core,apps/webapp}/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use zod for validation in packages/core and apps/webapp

Files:

  • packages/core/test/errors.test.ts
  • packages/core/src/v3/errors.ts
**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use function declarations instead of default exports

Files:

  • packages/core/test/errors.test.ts
  • packages/core/src/v3/errors.ts
**/*.{test,spec}.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use vitest for all tests in the Trigger.dev repository

Files:

  • packages/core/test/errors.test.ts
**/*.ts

📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)

**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries

Files:

  • packages/core/test/errors.test.ts
  • packages/core/src/v3/errors.ts
packages/core/**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (packages/core/CLAUDE.md)

Never import the root package (@trigger.dev/core). Always use subpath imports such as @trigger.dev/core/v3, @trigger.dev/core/v3/utils, @trigger.dev/core/logger, or @trigger.dev/core/schemas

Files:

  • packages/core/test/errors.test.ts
  • packages/core/src/v3/errors.ts
packages/**/*.{ts,tsx,js}

📄 CodeRabbit inference engine (CLAUDE.md)

Use pnpm run build to verify changes in public packages (packages/*)

Files:

  • packages/core/test/errors.test.ts
  • packages/core/src/v3/errors.ts
**/*.test.{ts,tsx,js}

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.test.{ts,tsx,js}: Use vitest exclusively for testing and never mock anything - use testcontainers instead
Place test files next to source files using the pattern MyService.ts -> MyService.test.ts

**/*.test.{ts,tsx,js}: Use vitest for unit testing and run tests with pnpm run test
Test files should live beside the files under test with descriptive describe and it blocks
Tests should avoid mocks or stubs and use helpers from @internal/testcontainers when Redis or Postgres are needed

Files:

  • packages/core/test/errors.test.ts
**/*.test.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

Use testcontainers with redisTest, postgresTest, or containerTest from @internal/testcontainers for testing with Redis/PostgreSQL dependencies

Files:

  • packages/core/test/errors.test.ts
{packages,integrations}/**/*.{ts,tsx,js}

📄 CodeRabbit inference engine (CLAUDE.md)

When modifying any public package (packages/* or integrations/*), add a changeset using pnpm run changeset:add

Files:

  • packages/core/test/errors.test.ts
  • packages/core/src/v3/errors.ts
{package.json,**/*.{ts,tsx,js}}

📄 CodeRabbit inference engine (CLAUDE.md)

Pin Zod to version 3.25.76 exactly across the entire monorepo - never use a different version or version range

Files:

  • packages/core/test/errors.test.ts
  • packages/core/src/v3/errors.ts
**/*.{ts,tsx,js}

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.{ts,tsx,js}: Import from @trigger.dev/core using subpaths only, never the root export
Always import tasks from @trigger.dev/sdk, never from @trigger.dev/sdk/v3 or deprecated client.defineJob
Add crumbs to code using // @Crumbs comments or `// `#region` `@crumbs blocks for debug tracing during development

Files:

  • packages/core/test/errors.test.ts
  • packages/core/src/v3/errors.ts
**/*.{ts,tsx,js,jsx,json,md,css,scss}

📄 CodeRabbit inference engine (AGENTS.md)

Code formatting is enforced using Prettier. Run pnpm run format before committing

Files:

  • packages/core/test/errors.test.ts
  • packages/core/src/v3/errors.ts
🧠 Learnings (2)
📚 Learning: 2026-03-22T13:26:12.060Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3244
File: apps/webapp/app/components/code/TextEditor.tsx:81-86
Timestamp: 2026-03-22T13:26:12.060Z
Learning: In the triggerdotdev/trigger.dev codebase, do not flag `navigator.clipboard.writeText(...)` calls for `missing-await`/`unhandled-promise` issues. These clipboard writes are intentionally invoked without `await` and without `catch` handlers across the project; keep that behavior consistent when reviewing TypeScript/TSX files (e.g., usages like in `apps/webapp/app/components/code/TextEditor.tsx`).

Applied to files:

  • packages/core/test/errors.test.ts
  • packages/core/src/v3/errors.ts
📚 Learning: 2026-03-22T19:24:14.403Z
Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 3187
File: apps/webapp/app/v3/services/alerts/deliverErrorGroupAlert.server.ts:200-204
Timestamp: 2026-03-22T19:24:14.403Z
Learning: In the triggerdotdev/trigger.dev codebase, webhook URLs are not expected to contain embedded credentials/secrets (e.g., fields like `ProjectAlertWebhookProperties` should only hold credential-free webhook endpoints). During code review, if you see logging or inclusion of raw webhook URLs in error messages, do not automatically treat it as a credential-leak/secrets-in-logs issue by default—first verify the URL does not contain embedded credentials (for example, no username/password in the URL, no obvious secret/token query params or fragments). If the URL is credential-free per this project’s conventions, allow the logging.

Applied to files:

  • packages/core/test/errors.test.ts
  • packages/core/src/v3/errors.ts
🔇 Additional comments (3)
packages/core/src/v3/errors.ts (1)

354-401: Good retry classification fix for SIGSEGV.

Adding TASK_PROCESS_SIGSEGV to the retriable internal-error branch correctly unblocks user-configured retry behavior and matches the existing retry-settings lookup flow.

.changeset/retry-sigsegv.md (1)

1-5: Changeset is clear and aligned with behavior change.

The patch note accurately describes the SIGSEGV retry-policy behavior and expected fast-fail behavior when retry is not configured.

packages/core/test/errors.test.ts (1)

250-274: Nice targeted regression coverage for retry behavior.

These tests validate the SIGSEGV retry-path change and keep guardrails around SIGKILL-timeout/OOM non-retry behavior.


Walkthrough

This PR makes segmentation fault (SIGSEGV) errors retryable when a task has a retry policy configured. The shouldRetryError function in errors.ts is updated to classify TASK_PROCESS_SIGSEGV as a retryable error code, matching the existing behavior for TASK_PROCESS_SIGTERM. Test coverage verifies that SIGSEGV and SIGTERM are retriable while SIGKILL, OOM, and related errors remain non-retriable. A changeset documents the patch-level change to @trigger.dev/core.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely summarizes the main change: reclassifying TASK_PROCESS_SIGSEGV as retriable under the user's retry policy, which is the primary modification across the changeset.
Description check ✅ Passed The description is comprehensive and well-structured with a clear summary, detailed rationale, context on pre-existing inconsistencies, and test plan confirming code review and testing were performed.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/tri-9234-retry-task_process_sigsegv-errors-respecting-user-retry

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 3 additional findings.

Open in Devin Review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant