fix(scaleup): orphan EC2 instances accumulate when scaleup fails post-RunInstances

**Bug — orphan EC2 instances accumulate when scaleup fails after RunInstances**

## What I observed

During the v1.0.0-rc.1 release tag push (2026-05-04 ~04:22 UTC), the `ami-build.yml` workflow failed with `VcpuLimitExceeded` on the Standard On-Demand vCPU bucket in `us-east-2` (32-vCPU limit). Investigating the live state showed:

- **222 messages in `jit-runners-scaleup-dlq`** + 12 in `jit-runners-lifecycle-dlq`
- **38–48 vCPU running** in the c+m bucket while the live workload was modest
- Scaleup Lambda logs showed repeated:
  ```
  spot and on-demand launch both failed
    (spot: MaxSpotInstanceCountExceeded)
    (on-demand: VcpuLimitExceeded)
  ```
- Scaledown Lambda reports `cleanup complete: stale=0 orphans=0 errors=0` every 5 min — it is NOT detecting the leak

## Root cause

Cross-referenced the currently-running EC2 instances against DynamoDB rows. The 16 instances pegging the bucket had this DDB-status distribution:

| DDB status | Running EC2 count | Notes |
|------------|-------------------|-------|
| `running`  | 1 | Legit in-flight |
| `pending`  | 4 | Legit (just-launched, runner agent registering) |
| `failed`   | ~10 | **Orphans — EC2 alive, DDB says failed** |
| `completed` | a couple | **Orphans — job done, EC2 still up** |

The scaleup error path marks the DDB row as `failed` (or completes the job and marks `completed`) but **does not always terminate the EC2 instance** before doing so. The EC2 keeps running until `MaxRunnerAgeMinutes` (360 min default) catches it via the age-based path — but in the meantime it consumes vCPU bucket capacity.

## Where the gap is

1. **`internal/runner/cleanup.go`** — `Cleaner.Clean` looks for stale-`pending` runners (past `stale_threshold_minutes`) and re-enqueues them. It does NOT scan for runners in terminal states (`failed`, `completed`) whose instance_id is still alive on EC2.
2. **`lambda/cmd/scaleup/main.go`** error paths — when scaleup hits an error AFTER `RunInstances` succeeds (e.g. JIT-token mint fails, or the runner-agent fails to come up), the code marks DDB `failed` but does not call `TerminateInstances`.

## Operational impact

- Every release that hits the vCPU quota leaks 1–N orphans
- Orphans pin the bucket → next release also hits the quota → more orphans → death spiral
- Tonight's v1.0.0-rc.1 tag push ran AMI build 4 times before the quota happened to be free; each failure left the same orphans behind

## Proposed fixes

Three coordinated improvements (cheapest first):

### 1. Scaleup: terminate EC2 on any post-launch error

In `lambda/cmd/scaleup/main.go`, the path that marks DDB `failed` after launch should first call `compute.Launcher.Terminate(instanceID)`. Today the failure path is:

```go
launchedID, err := launcher.Launch(...)
// ... later, if any subsequent call fails:
markFailed(store, recordID)  // <— DDB updated; EC2 left running
```

Should be:

```go
launchedID, err := launcher.Launch(...)
// ... on subsequent error:
_ = launcher.Terminate(ctx, launchedID)  // best-effort; ignore err
markFailed(store, recordID)
```

### 2. Scaledown: reap orphans regardless of DDB state

Extend `Cleaner.Clean` to also run a second sweep:
- List EC2 instances tagged `managed-by=jit-runners` in `running` state
- For each, look up the DDB row by `instance_id`
- If DDB row is missing OR status ∈ {`failed`, `completed`} AND age > 5 min → terminate

This is independent of the `pending`-stale logic and catches the drift at the EC2-state level rather than trusting DDB.

### 3. Use the rebalancer's drift sense for cleanup too

The rebalancer already cycles every 1 min and queries GitHub for queue depth. Could also iterate live EC2 + DDB for orphan detection on the same cadence.

## Ad-hoc cleanup (already executed)

The 8× c5.xlarge orphans drained on their own while I was investigating (likely picked up by the `MaxRunnerAgeMinutes` path), but tonight's release would have stalled indefinitely if the timing had been worse.

## Adjacent: DLQ depth is silent

The `scaleup-dlq` had 222 messages. There's no alarm or dashboard surfacing this. A CloudWatch alarm on `ApproximateNumberOfMessages > 5` for either DLQ should page operators.

## Acceptance criteria

- [ ] Scaleup error path terminates the EC2 before marking DDB `failed`
- [ ] Scaledown's orphan sweep also reaps `failed`/`completed`-state runners whose instance_id is still alive in EC2
- [ ] Repeating `ami-build.yml` against a fully-saturated runner pool cleans the orphans within 1 scaledown cycle (5 min)
- [ ] CloudWatch alarm on scaleup-dlq depth > 5


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(scaleup): orphan EC2 instances accumulate when scaleup fails post-RunInstances #74

What I observed

Root cause

Where the gap is

Operational impact

Proposed fixes

1. Scaleup: terminate EC2 on any post-launch error

2. Scaledown: reap orphans regardless of DDB state

3. Use the rebalancer's drift sense for cleanup too

Ad-hoc cleanup (already executed)

Adjacent: DLQ depth is silent

Acceptance criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DDB status	Running EC2 count	Notes
`running`	1	Legit in-flight
`pending`	4	Legit (just-launched, runner agent registering)
`failed`	~10	Orphans — EC2 alive, DDB says failed
`completed`	a couple	Orphans — job done, EC2 still up

fix(scaleup): orphan EC2 instances accumulate when scaleup fails post-RunInstances #74

Description

What I observed

Root cause

Where the gap is

Operational impact

Proposed fixes

1. Scaleup: terminate EC2 on any post-launch error

2. Scaledown: reap orphans regardless of DDB state

3. Use the rebalancer's drift sense for cleanup too

Ad-hoc cleanup (already executed)

Adjacent: DLQ depth is silent

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions