Bug — orphan EC2 instances accumulate when scaleup fails after RunInstances
What I observed
During the v1.0.0-rc.1 release tag push (2026-05-04 ~04:22 UTC), the ami-build.yml workflow failed with VcpuLimitExceeded on the Standard On-Demand vCPU bucket in us-east-2 (32-vCPU limit). Investigating the live state showed:
- 222 messages in
jit-runners-scaleup-dlq + 12 in jit-runners-lifecycle-dlq
- 38–48 vCPU running in the c+m bucket while the live workload was modest
- Scaleup Lambda logs showed repeated:
spot and on-demand launch both failed
(spot: MaxSpotInstanceCountExceeded)
(on-demand: VcpuLimitExceeded)
- Scaledown Lambda reports
cleanup complete: stale=0 orphans=0 errors=0 every 5 min — it is NOT detecting the leak
Root cause
Cross-referenced the currently-running EC2 instances against DynamoDB rows. The 16 instances pegging the bucket had this DDB-status distribution:
| DDB status |
Running EC2 count |
Notes |
running |
1 |
Legit in-flight |
pending |
4 |
Legit (just-launched, runner agent registering) |
failed |
~10 |
Orphans — EC2 alive, DDB says failed |
completed |
a couple |
Orphans — job done, EC2 still up |
The scaleup error path marks the DDB row as failed (or completes the job and marks completed) but does not always terminate the EC2 instance before doing so. The EC2 keeps running until MaxRunnerAgeMinutes (360 min default) catches it via the age-based path — but in the meantime it consumes vCPU bucket capacity.
Where the gap is
internal/runner/cleanup.go — Cleaner.Clean looks for stale-pending runners (past stale_threshold_minutes) and re-enqueues them. It does NOT scan for runners in terminal states (failed, completed) whose instance_id is still alive on EC2.
lambda/cmd/scaleup/main.go error paths — when scaleup hits an error AFTER RunInstances succeeds (e.g. JIT-token mint fails, or the runner-agent fails to come up), the code marks DDB failed but does not call TerminateInstances.
Operational impact
- Every release that hits the vCPU quota leaks 1–N orphans
- Orphans pin the bucket → next release also hits the quota → more orphans → death spiral
- Tonight's v1.0.0-rc.1 tag push ran AMI build 4 times before the quota happened to be free; each failure left the same orphans behind
Proposed fixes
Three coordinated improvements (cheapest first):
1. Scaleup: terminate EC2 on any post-launch error
In lambda/cmd/scaleup/main.go, the path that marks DDB failed after launch should first call compute.Launcher.Terminate(instanceID). Today the failure path is:
launchedID, err := launcher.Launch(...)
// ... later, if any subsequent call fails:
markFailed(store, recordID) // <— DDB updated; EC2 left running
Should be:
launchedID, err := launcher.Launch(...)
// ... on subsequent error:
_ = launcher.Terminate(ctx, launchedID) // best-effort; ignore err
markFailed(store, recordID)
2. Scaledown: reap orphans regardless of DDB state
Extend Cleaner.Clean to also run a second sweep:
- List EC2 instances tagged
managed-by=jit-runners in running state
- For each, look up the DDB row by
instance_id
- If DDB row is missing OR status ∈ {
failed, completed} AND age > 5 min → terminate
This is independent of the pending-stale logic and catches the drift at the EC2-state level rather than trusting DDB.
3. Use the rebalancer's drift sense for cleanup too
The rebalancer already cycles every 1 min and queries GitHub for queue depth. Could also iterate live EC2 + DDB for orphan detection on the same cadence.
Ad-hoc cleanup (already executed)
The 8× c5.xlarge orphans drained on their own while I was investigating (likely picked up by the MaxRunnerAgeMinutes path), but tonight's release would have stalled indefinitely if the timing had been worse.
Adjacent: DLQ depth is silent
The scaleup-dlq had 222 messages. There's no alarm or dashboard surfacing this. A CloudWatch alarm on ApproximateNumberOfMessages > 5 for either DLQ should page operators.
Acceptance criteria
Bug — orphan EC2 instances accumulate when scaleup fails after RunInstances
What I observed
During the v1.0.0-rc.1 release tag push (2026-05-04 ~04:22 UTC), the
ami-build.ymlworkflow failed withVcpuLimitExceededon the Standard On-Demand vCPU bucket inus-east-2(32-vCPU limit). Investigating the live state showed:jit-runners-scaleup-dlq+ 12 injit-runners-lifecycle-dlqcleanup complete: stale=0 orphans=0 errors=0every 5 min — it is NOT detecting the leakRoot cause
Cross-referenced the currently-running EC2 instances against DynamoDB rows. The 16 instances pegging the bucket had this DDB-status distribution:
runningpendingfailedcompletedThe scaleup error path marks the DDB row as
failed(or completes the job and markscompleted) but does not always terminate the EC2 instance before doing so. The EC2 keeps running untilMaxRunnerAgeMinutes(360 min default) catches it via the age-based path — but in the meantime it consumes vCPU bucket capacity.Where the gap is
internal/runner/cleanup.go—Cleaner.Cleanlooks for stale-pendingrunners (paststale_threshold_minutes) and re-enqueues them. It does NOT scan for runners in terminal states (failed,completed) whose instance_id is still alive on EC2.lambda/cmd/scaleup/main.goerror paths — when scaleup hits an error AFTERRunInstancessucceeds (e.g. JIT-token mint fails, or the runner-agent fails to come up), the code marks DDBfailedbut does not callTerminateInstances.Operational impact
Proposed fixes
Three coordinated improvements (cheapest first):
1. Scaleup: terminate EC2 on any post-launch error
In
lambda/cmd/scaleup/main.go, the path that marks DDBfailedafter launch should first callcompute.Launcher.Terminate(instanceID). Today the failure path is:Should be:
2. Scaledown: reap orphans regardless of DDB state
Extend
Cleaner.Cleanto also run a second sweep:managed-by=jit-runnersinrunningstateinstance_idfailed,completed} AND age > 5 min → terminateThis is independent of the
pending-stale logic and catches the drift at the EC2-state level rather than trusting DDB.3. Use the rebalancer's drift sense for cleanup too
The rebalancer already cycles every 1 min and queries GitHub for queue depth. Could also iterate live EC2 + DDB for orphan detection on the same cadence.
Ad-hoc cleanup (already executed)
The 8× c5.xlarge orphans drained on their own while I was investigating (likely picked up by the
MaxRunnerAgeMinutespath), but tonight's release would have stalled indefinitely if the timing had been worse.Adjacent: DLQ depth is silent
The
scaleup-dlqhad 222 messages. There's no alarm or dashboard surfacing this. A CloudWatch alarm onApproximateNumberOfMessages > 5for either DLQ should page operators.Acceptance criteria
failedfailed/completed-state runners whose instance_id is still alive in EC2ami-build.ymlagainst a fully-saturated runner pool cleans the orphans within 1 scaledown cycle (5 min)