Skip to content

feat(runner): opt-in remote cancel on controller wait-timeout (#6891)#6898

Merged
chubes4 merged 1 commit into
mainfrom
feat/remote-cancel-on-timeout-6891
Jun 28, 2026
Merged

feat(runner): opt-in remote cancel on controller wait-timeout (#6891)#6898
chubes4 merged 1 commit into
mainfrom
feat/remote-cancel-on-timeout-6891

Conversation

@chubes4

@chubes4 chubes4 commented Jun 28, 2026

Copy link
Copy Markdown
Member

Closes #6891 (the deferred remote-cancel half; the lock-reclaim half merged as #6895).

Problem

When the controller's runner-exec wait budget expires, it mirrors the remote job and returns a timeout but leaves the remote job running — by deliberate design. The orphaned job keeps holding its rig lock, which wedges all later runs of that rig with rig.resource_conflict (the #6891 symptom: a 28-min idle bench still held static-site-fixture-matrix).

This is intentional default behavior, covered by the contract test timeout_mirrors_remote_job_without_cancelling and the "still in flight, not cancelled" handoff semantics. The goal is to let an operator opt in to cancelling the remote job on timeout, without changing the default.

Change

  • New opt-in env var HOMEBOY_RUNNER_CANCEL_ON_WAIT_TIMEOUT (truthy spellings: 1/true/yes/on; unset/anything else = off).
  • daemon_job_wait_timeout — the single shared timeout builder used by both the direct-daemon and reverse-broker wait loops — now calls the existing runner_job_cancel primitive (hits the daemon/broker /jobs/{id}/cancel) only when opted in, best-effort. The remote job is cancelled and its rig lock freed.
  • The cancel outcome (disabled / requested / failed) is surfaced in the error message, the new cancel_on_wait_timeout detail, and a dedicated hint. A failed best-effort cancel never masks the underlying timeout.

Default preserved (byte-identical)

With the opt-in off:

  • No cancel call is made.
  • The error message still reads the remote job is still in flight and was not cancelled.
  • detail.cancel_on_wait_timeout = "disabled".
  • timeout_mirrors_remote_job_without_cancelling and all existing handoff-contract tests stay green.

Tests

A test-only injectable cancel hook lets the opt-in path be asserted without a live runner:

  • opt_in_cancels_remote_job_on_wait_timeout — flag on: cancel fires exactly once for the job; message reflects cancellation.
  • opt_in_off_leaves_remote_job_uncancelled — flag off: injected cancel is never called; default message intact.
  • opt_in_surfaces_remote_cancel_failure_on_wait_timeout — flag on + cancel errors: timeout still surfaces with a failed outcome and remediation hint.

cargo build clean (pre-existing warnings only); cargo test --lib runner::execution → 76 passed, 0 failed.

AI assistance

  • AI assistance: Yes
  • Tool(s): Claude Opus 4.8 via Claude Code
  • Used for: Investigating the wait-timeout path, wiring the opt-in, and writing tests.

🤖 Generated with Claude Code

When the controller's runner-exec wait budget expires it mirrors the
remote job but leaves it running, which keeps the rig lock held and wedges
later runs (#6891). Add an opt-in, off by default, that best-effort cancels
the remote job on wait-timeout via the existing runner_job_cancel primitive.

- New env `HOMEBOY_RUNNER_CANCEL_ON_WAIT_TIMEOUT` (truthy: 1/true/yes/on).
  Off/unset preserves byte-identical default behavior.
- `daemon_job_wait_timeout` (shared by daemon + reverse-broker timeout paths)
  attempts the cancel only when opted in, captures success/failure best-effort,
  and reflects it in the error message, `cancel_on_wait_timeout` detail, and hints.
- Default path is unchanged: message still says the remote job "was not cancelled"
  and no cancel call is made (`timeout_mirrors_remote_job_without_cancelling` green).
- Test-only injectable cancel hook lets the opt-in path be asserted without a live
  runner. New tests cover: opt-in fires cancel once for the job; opt-in off never
  calls cancel; opt-in surfaces a cancel failure.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@chubes4 chubes4 merged commit 5145554 into main Jun 28, 2026
3 checks passed
@chubes4 chubes4 deleted the feat/remote-cancel-on-timeout-6891 branch June 28, 2026 15:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Controller wait-timeout orphans the remote bench job + leaves the rig lock unreclaimable (resource_conflict blocks all later runs)

1 participant