Skip to content

fix(scaleup): orphan EC2 instances accumulate when scaleup fails post-RunInstances #74

@kaio6fellipe

Description

@kaio6fellipe

Bug — orphan EC2 instances accumulate when scaleup fails after RunInstances

What I observed

During the v1.0.0-rc.1 release tag push (2026-05-04 ~04:22 UTC), the ami-build.yml workflow failed with VcpuLimitExceeded on the Standard On-Demand vCPU bucket in us-east-2 (32-vCPU limit). Investigating the live state showed:

  • 222 messages in jit-runners-scaleup-dlq + 12 in jit-runners-lifecycle-dlq
  • 38–48 vCPU running in the c+m bucket while the live workload was modest
  • Scaleup Lambda logs showed repeated:
    spot and on-demand launch both failed
      (spot: MaxSpotInstanceCountExceeded)
      (on-demand: VcpuLimitExceeded)
    
  • Scaledown Lambda reports cleanup complete: stale=0 orphans=0 errors=0 every 5 min — it is NOT detecting the leak

Root cause

Cross-referenced the currently-running EC2 instances against DynamoDB rows. The 16 instances pegging the bucket had this DDB-status distribution:

DDB status Running EC2 count Notes
running 1 Legit in-flight
pending 4 Legit (just-launched, runner agent registering)
failed ~10 Orphans — EC2 alive, DDB says failed
completed a couple Orphans — job done, EC2 still up

The scaleup error path marks the DDB row as failed (or completes the job and marks completed) but does not always terminate the EC2 instance before doing so. The EC2 keeps running until MaxRunnerAgeMinutes (360 min default) catches it via the age-based path — but in the meantime it consumes vCPU bucket capacity.

Where the gap is

  1. internal/runner/cleanup.goCleaner.Clean looks for stale-pending runners (past stale_threshold_minutes) and re-enqueues them. It does NOT scan for runners in terminal states (failed, completed) whose instance_id is still alive on EC2.
  2. lambda/cmd/scaleup/main.go error paths — when scaleup hits an error AFTER RunInstances succeeds (e.g. JIT-token mint fails, or the runner-agent fails to come up), the code marks DDB failed but does not call TerminateInstances.

Operational impact

  • Every release that hits the vCPU quota leaks 1–N orphans
  • Orphans pin the bucket → next release also hits the quota → more orphans → death spiral
  • Tonight's v1.0.0-rc.1 tag push ran AMI build 4 times before the quota happened to be free; each failure left the same orphans behind

Proposed fixes

Three coordinated improvements (cheapest first):

1. Scaleup: terminate EC2 on any post-launch error

In lambda/cmd/scaleup/main.go, the path that marks DDB failed after launch should first call compute.Launcher.Terminate(instanceID). Today the failure path is:

launchedID, err := launcher.Launch(...)
// ... later, if any subsequent call fails:
markFailed(store, recordID)  // <— DDB updated; EC2 left running

Should be:

launchedID, err := launcher.Launch(...)
// ... on subsequent error:
_ = launcher.Terminate(ctx, launchedID)  // best-effort; ignore err
markFailed(store, recordID)

2. Scaledown: reap orphans regardless of DDB state

Extend Cleaner.Clean to also run a second sweep:

  • List EC2 instances tagged managed-by=jit-runners in running state
  • For each, look up the DDB row by instance_id
  • If DDB row is missing OR status ∈ {failed, completed} AND age > 5 min → terminate

This is independent of the pending-stale logic and catches the drift at the EC2-state level rather than trusting DDB.

3. Use the rebalancer's drift sense for cleanup too

The rebalancer already cycles every 1 min and queries GitHub for queue depth. Could also iterate live EC2 + DDB for orphan detection on the same cadence.

Ad-hoc cleanup (already executed)

The 8× c5.xlarge orphans drained on their own while I was investigating (likely picked up by the MaxRunnerAgeMinutes path), but tonight's release would have stalled indefinitely if the timing had been worse.

Adjacent: DLQ depth is silent

The scaleup-dlq had 222 messages. There's no alarm or dashboard surfacing this. A CloudWatch alarm on ApproximateNumberOfMessages > 5 for either DLQ should page operators.

Acceptance criteria

  • Scaleup error path terminates the EC2 before marking DDB failed
  • Scaledown's orphan sweep also reaps failed/completed-state runners whose instance_id is still alive in EC2
  • Repeating ami-build.yml against a fully-saturated runner pool cleans the orphans within 1 scaledown cycle (5 min)
  • CloudWatch alarm on scaleup-dlq depth > 5

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions