Skip to content

Graceful scale-in for actions-runner (development)#265

Merged
akuzminsky merged 1 commit into
mainfrom
actions-runner-scale-in-race
Apr 18, 2026
Merged

Graceful scale-in for actions-runner (development)#265
akuzminsky merged 1 commit into
mainfrom
actions-runner-scale-in-race

Conversation

@akuzminsky

Copy link
Copy Markdown
Member

Summary

Fixes the scale-in race where a GitHub Actions job dispatched into the ~11s window between Terminating:Wait and systemctl stop actions-runner.service exits 1 from gha_prerun.sh (ASG rejects SetInstanceProtection on a terminating instance). Instead of trying to close the window, we make the path successful:

  • gha_prerun.sh tolerates Terminating:Wait/Terminating:Proceed — job proceeds.
  • actions-runner.service uses KillMode=process + KillSignal=SIGTERM + TimeoutStopSec=21600, and the start script execs bin/runsvc.sh so SIGTERM reaches Runner.Listener (trapped → SIGINT → graceful shutdown finishes the in-flight job + postrun).
  • ExecStopPost runs gha-on-runner-exit.sh → completes the deregistration lifecycle hook so the instance can terminate cleanly.
  • New gha-lifecycle-heartbeater.timer (10 min, oneshot) heartbeats the hook while in Terminating:Wait, covering long jobs.
  • register.pp: new delete_registration_token exec (notify/refreshonly) cleans up the registration token secret after successful registration.
  • Also fixes three latent systemd bugs that silently SIGKILL in-flight jobs at 90s on any clean runner stop (not just scale-in).

Development environment only. Sandbox and global promotion will follow in separate PRs.

Full design notes and rationale: .claude/plans/actions-runner-scale-in-race.md.

Tracks: infrahouse/terraform-aws-actions-runner#81

Test plan

  • Apply to a development actions-runner instance; verify service starts cleanly under the new unit.
  • Run a job to completion on the new unit (baseline — prerun/postrun still work on InService).
  • Trigger an ASG scale-in while a job is queued; confirm the job runs to completion (no exit-1 from prerun) and the instance terminates cleanly.
  • Verify no ValidationError on SetInstanceProtection in CloudTrail during scale-in.
  • Confirm heartbeater no-ops on InService and heartbeats on Terminating:Wait (check timer last-run status, CloudTrail for RecordLifecycleActionHeartbeat).
  • Confirm delete_registration_token runs on fresh register and is skipped on subsequent agent runs.
  • Verify that on an old-Terraform ASG (no deregistration_hookname fact), ExecStopPost and heartbeater scripts no-op.

Notes on backwards compatibility

Safe on old-Terraform ASGs (no deregistration_hookname fact): ExecStopPost and heartbeater exit 0 as no-ops; the old deregistration Lambda still handles lifecycle completion. The delete_registration_token exec will fail with AccessDenied on old-Terraform instance profiles (intentional — we want the IAM gap visible). Net change is strictly an improvement even without the companion Terraform release.

🤖 Generated with Claude Code

Fix the scale-in race where a job dispatched into the ~11s window between
Terminating:Wait and systemd stop exits 1 from gha_prerun.sh. Instead,
accept the race: let prerun tolerate Terminating:Wait/Proceed, have
systemd forward SIGTERM via runsvc.sh so the runner finishes the in-flight
job, and complete the deregistration lifecycle hook in ExecStopPost. A
10-min oneshot timer heartbeats the hook while the instance is waiting.

Also fixes three latent systemd bugs on clean runner stops (KillMode
default SIGKILLs jobs at 90s) and adds a post-register secret cleanup.

Development environment only; sandbox/global promotion follows separately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@akuzminsky akuzminsky merged commit 058baa9 into main Apr 18, 2026
2 checks passed
@akuzminsky akuzminsky deleted the actions-runner-scale-in-race branch April 18, 2026 19:17
akuzminsky added a commit to infrahouse/terraform-aws-actions-runner that referenced this pull request Apr 19, 2026
Issue #81: jobs failed with `SetInstanceProtection: not in InService or
EnteringStandby or Standby` when the ASG picked their runner for scale-in
in the ~11s window between Terminating:Wait and actions-runner actually
stopping. CloudTrail analysis confirmed the race; see
.claude/plans/warm-pool-protection-race.md for the full RCA.

Paired with puppet-code PR infrahouse/puppet-code#265 which tolerates
the Terminating:Wait state in gha_prerun.sh, lets the runner finish the
job gracefully via SIGTERM, and completes the lifecycle hook via
ExecStopPost.

Changes in this repo:
- Deregistration lifecycle hook heartbeat_timeout lowered from 3600s
  (1h default) to 1800s (30 min). The on-host heartbeater extends it
  every 10 min while a job is running; the shorter timeout is a
  watchdog against a broken heartbeater.
- runner_deregistration Lambda switched to fire-and-forget SSM: sends
  `systemctl stop actions-runner.service` and exits, no longer waits
  for SSM completion or calls CompleteLifecycleAction. The on-host
  ExecStopPost completes the hook once the runner exits gracefully.
  Warm-pool-trim fast path (Warmed:Terminating:Wait) still completes
  the hook immediately.
- Instance profile gains autoscaling:RecordLifecycleActionHeartbeat
  (for the heartbeater) and secretsmanager:DeleteSecret (for Puppet's
  post-register cleanup).
- New Puppet custom fact `deregistration_hookname` injected via the
  existing cloud-init external-facts mechanism. Puppet uses it as
  Environment=DEREGISTRATION_HOOK_NAME=... in the systemd unit.
- Test conftest: setup_logging() now configures the root logger so
  pytest-infrahouse's wait_for_instance_refresh and other helpers emit
  visible output.
- docs/troubleshooting.md: new "Jobs Failing with SetInstanceProtection"
  section with RCA, how-to-verify-the-fix, and a gh-api script to list
  jobs a specific runner executed.

Registration-token cleanup keeps a belt-and-suspenders shape: Puppet
deletes the secret immediately after register_runner (puppet-code #265),
and the Lambda's existing ensure_registration_token(present=False)
stays as an idempotent safety net.

Closes #81

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
akuzminsky added a commit that referenced this pull request Apr 19, 2026
Follow-up to #265. Copies the graceful scale-in implementation from the
development environment to sandbox so it can bake there before the global
promotion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
akuzminsky added a commit that referenced this pull request Apr 19, 2026
Follow-up to #265. Copies the graceful scale-in implementation from the
development environment to sandbox so it can bake there before the global
promotion.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
akuzminsky added a commit that referenced this pull request Apr 19, 2026
Follow-up to #265 (development) and #268 (sandbox). Copies the graceful
scale-in implementation from environments/sandbox/modules/profile/github_runner/
into the top-level modules/profile/github_runner/ so all remaining environments
pick it up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
akuzminsky added a commit that referenced this pull request Apr 19, 2026
…#269)

Follow-up to #265 (development) and #268 (sandbox). Copies the graceful
scale-in implementation from environments/sandbox/modules/profile/github_runner/
into the top-level modules/profile/github_runner/ so all remaining environments
pick it up.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
infrahouse8 pushed a commit to infrahouse/terraform-aws-actions-runner that referenced this pull request Apr 19, 2026
Issue #81: jobs failed with `SetInstanceProtection: not in InService or
EnteringStandby or Standby` when the ASG picked their runner for scale-in
in the ~11s window between Terminating:Wait and actions-runner actually
stopping. CloudTrail analysis confirmed the race; see
.claude/plans/warm-pool-protection-race.md for the full RCA.

Paired with puppet-code PR infrahouse/puppet-code#265 which tolerates
the Terminating:Wait state in gha_prerun.sh, lets the runner finish the
job gracefully via SIGTERM, and completes the lifecycle hook via
ExecStopPost.

Changes in this repo:
- Deregistration lifecycle hook heartbeat_timeout lowered from 3600s
  (1h default) to 1800s (30 min). The on-host heartbeater extends it
  every 10 min while a job is running; the shorter timeout is a
  watchdog against a broken heartbeater.
- runner_deregistration Lambda switched to fire-and-forget SSM: sends
  `systemctl stop actions-runner.service` and exits, no longer waits
  for SSM completion or calls CompleteLifecycleAction. The on-host
  ExecStopPost completes the hook once the runner exits gracefully.
  Warm-pool-trim fast path (Warmed:Terminating:Wait) still completes
  the hook immediately.
- Instance profile gains autoscaling:RecordLifecycleActionHeartbeat
  (for the heartbeater) and secretsmanager:DeleteSecret (for Puppet's
  post-register cleanup).
- New Puppet custom fact `deregistration_hookname` injected via the
  existing cloud-init external-facts mechanism. Puppet uses it as
  Environment=DEREGISTRATION_HOOK_NAME=... in the systemd unit.
- Test conftest: setup_logging() now configures the root logger so
  pytest-infrahouse's wait_for_instance_refresh and other helpers emit
  visible output.
- docs/troubleshooting.md: new "Jobs Failing with SetInstanceProtection"
  section with RCA, how-to-verify-the-fix, and a gh-api script to list
  jobs a specific runner executed.

Registration-token cleanup keeps a belt-and-suspenders shape: Puppet
deletes the secret immediately after register_runner (puppet-code #265),
and the Lambda's existing ensure_registration_token(present=False)
stays as an idempotent safety net.

Closes #81

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants