Graceful scale-in for actions-runner (development)#265
Merged
Conversation
Fix the scale-in race where a job dispatched into the ~11s window between Terminating:Wait and systemd stop exits 1 from gha_prerun.sh. Instead, accept the race: let prerun tolerate Terminating:Wait/Proceed, have systemd forward SIGTERM via runsvc.sh so the runner finishes the in-flight job, and complete the deregistration lifecycle hook in ExecStopPost. A 10-min oneshot timer heartbeats the hook while the instance is waiting. Also fixes three latent systemd bugs on clean runner stops (KillMode default SIGKILLs jobs at 90s) and adds a post-register secret cleanup. Development environment only; sandbox/global promotion follows separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
infrahouse8
approved these changes
Apr 18, 2026
This was referenced Apr 19, 2026
akuzminsky
added a commit
to infrahouse/terraform-aws-actions-runner
that referenced
this pull request
Apr 19, 2026
Issue #81: jobs failed with `SetInstanceProtection: not in InService or EnteringStandby or Standby` when the ASG picked their runner for scale-in in the ~11s window between Terminating:Wait and actions-runner actually stopping. CloudTrail analysis confirmed the race; see .claude/plans/warm-pool-protection-race.md for the full RCA. Paired with puppet-code PR infrahouse/puppet-code#265 which tolerates the Terminating:Wait state in gha_prerun.sh, lets the runner finish the job gracefully via SIGTERM, and completes the lifecycle hook via ExecStopPost. Changes in this repo: - Deregistration lifecycle hook heartbeat_timeout lowered from 3600s (1h default) to 1800s (30 min). The on-host heartbeater extends it every 10 min while a job is running; the shorter timeout is a watchdog against a broken heartbeater. - runner_deregistration Lambda switched to fire-and-forget SSM: sends `systemctl stop actions-runner.service` and exits, no longer waits for SSM completion or calls CompleteLifecycleAction. The on-host ExecStopPost completes the hook once the runner exits gracefully. Warm-pool-trim fast path (Warmed:Terminating:Wait) still completes the hook immediately. - Instance profile gains autoscaling:RecordLifecycleActionHeartbeat (for the heartbeater) and secretsmanager:DeleteSecret (for Puppet's post-register cleanup). - New Puppet custom fact `deregistration_hookname` injected via the existing cloud-init external-facts mechanism. Puppet uses it as Environment=DEREGISTRATION_HOOK_NAME=... in the systemd unit. - Test conftest: setup_logging() now configures the root logger so pytest-infrahouse's wait_for_instance_refresh and other helpers emit visible output. - docs/troubleshooting.md: new "Jobs Failing with SetInstanceProtection" section with RCA, how-to-verify-the-fix, and a gh-api script to list jobs a specific runner executed. Registration-token cleanup keeps a belt-and-suspenders shape: Puppet deletes the secret immediately after register_runner (puppet-code #265), and the Lambda's existing ensure_registration_token(present=False) stays as an idempotent safety net. Closes #81 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 tasks
akuzminsky
added a commit
that referenced
this pull request
Apr 19, 2026
Follow-up to #265. Copies the graceful scale-in implementation from the development environment to sandbox so it can bake there before the global promotion. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8 tasks
akuzminsky
added a commit
that referenced
this pull request
Apr 19, 2026
Follow-up to #265. Copies the graceful scale-in implementation from the development environment to sandbox so it can bake there before the global promotion. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
akuzminsky
added a commit
that referenced
this pull request
Apr 19, 2026
Follow-up to #265 (development) and #268 (sandbox). Copies the graceful scale-in implementation from environments/sandbox/modules/profile/github_runner/ into the top-level modules/profile/github_runner/ so all remaining environments pick it up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
9 tasks
akuzminsky
added a commit
that referenced
this pull request
Apr 19, 2026
…#269) Follow-up to #265 (development) and #268 (sandbox). Copies the graceful scale-in implementation from environments/sandbox/modules/profile/github_runner/ into the top-level modules/profile/github_runner/ so all remaining environments pick it up. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
infrahouse8
pushed a commit
to infrahouse/terraform-aws-actions-runner
that referenced
this pull request
Apr 19, 2026
Issue #81: jobs failed with `SetInstanceProtection: not in InService or EnteringStandby or Standby` when the ASG picked their runner for scale-in in the ~11s window between Terminating:Wait and actions-runner actually stopping. CloudTrail analysis confirmed the race; see .claude/plans/warm-pool-protection-race.md for the full RCA. Paired with puppet-code PR infrahouse/puppet-code#265 which tolerates the Terminating:Wait state in gha_prerun.sh, lets the runner finish the job gracefully via SIGTERM, and completes the lifecycle hook via ExecStopPost. Changes in this repo: - Deregistration lifecycle hook heartbeat_timeout lowered from 3600s (1h default) to 1800s (30 min). The on-host heartbeater extends it every 10 min while a job is running; the shorter timeout is a watchdog against a broken heartbeater. - runner_deregistration Lambda switched to fire-and-forget SSM: sends `systemctl stop actions-runner.service` and exits, no longer waits for SSM completion or calls CompleteLifecycleAction. The on-host ExecStopPost completes the hook once the runner exits gracefully. Warm-pool-trim fast path (Warmed:Terminating:Wait) still completes the hook immediately. - Instance profile gains autoscaling:RecordLifecycleActionHeartbeat (for the heartbeater) and secretsmanager:DeleteSecret (for Puppet's post-register cleanup). - New Puppet custom fact `deregistration_hookname` injected via the existing cloud-init external-facts mechanism. Puppet uses it as Environment=DEREGISTRATION_HOOK_NAME=... in the systemd unit. - Test conftest: setup_logging() now configures the root logger so pytest-infrahouse's wait_for_instance_refresh and other helpers emit visible output. - docs/troubleshooting.md: new "Jobs Failing with SetInstanceProtection" section with RCA, how-to-verify-the-fix, and a gh-api script to list jobs a specific runner executed. Registration-token cleanup keeps a belt-and-suspenders shape: Puppet deletes the secret immediately after register_runner (puppet-code #265), and the Lambda's existing ensure_registration_token(present=False) stays as an idempotent safety net. Closes #81 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the scale-in race where a GitHub Actions job dispatched into the ~11s window between
Terminating:Waitandsystemctl stop actions-runner.serviceexits 1 fromgha_prerun.sh(ASG rejectsSetInstanceProtectionon a terminating instance). Instead of trying to close the window, we make the path successful:gha_prerun.shtoleratesTerminating:Wait/Terminating:Proceed— job proceeds.actions-runner.serviceusesKillMode=process+KillSignal=SIGTERM+TimeoutStopSec=21600, and the start scriptexecsbin/runsvc.shso SIGTERM reachesRunner.Listener(trapped → SIGINT → graceful shutdown finishes the in-flight job + postrun).ExecStopPostrunsgha-on-runner-exit.sh→ completes the deregistration lifecycle hook so the instance can terminate cleanly.gha-lifecycle-heartbeater.timer(10 min, oneshot) heartbeats the hook while inTerminating:Wait, covering long jobs.register.pp: newdelete_registration_tokenexec (notify/refreshonly) cleans up the registration token secret after successful registration.Development environment only. Sandbox and global promotion will follow in separate PRs.
Full design notes and rationale:
.claude/plans/actions-runner-scale-in-race.md.Tracks: infrahouse/terraform-aws-actions-runner#81
Test plan
InService).ValidationErroronSetInstanceProtectionin CloudTrail during scale-in.InServiceand heartbeats onTerminating:Wait(check timer last-run status, CloudTrail forRecordLifecycleActionHeartbeat).delete_registration_tokenruns on fresh register and is skipped on subsequent agent runs.deregistration_hooknamefact),ExecStopPostand heartbeater scripts no-op.Notes on backwards compatibility
Safe on old-Terraform ASGs (no
deregistration_hooknamefact):ExecStopPostand heartbeater exit 0 as no-ops; the old deregistration Lambda still handles lifecycle completion. Thedelete_registration_tokenexec will fail withAccessDeniedon old-Terraform instance profiles (intentional — we want the IAM gap visible). Net change is strictly an improvement even without the companion Terraform release.🤖 Generated with Claude Code