fix(cells): Use process groups and properly handle child process reaping#599
Merged
dmah42 merged 3 commits intoaurae-runtime:mainfrom May 4, 2026
Merged
fix(cells): Use process groups and properly handle child process reaping#599dmah42 merged 3 commits intoaurae-runtime:mainfrom
dmah42 merged 3 commits intoaurae-runtime:mainfrom
Conversation
Stop on a tracked executable previously left grandchildren alive on the host. tokio::process::Child::kill signals only the leader PID — even when process_group(0) was set on the Command, the spawned PGID's members are not signaled. The fix sets each executable's process group to its own PGID via process_group(0), then signals the entire group with killpg(SIGKILL) on stop. Adds: - Executable::start: process_group(0) so the spawned child is its own PGID leader. Captures the leader PID at spawn so killpg targets the right group even after Tokio reaps the child internally. - Executable::kill: replaces child.kill() with killpg(SIGKILL) on the captured PGID. Always reaps the child and joins the stdout/stderr reader tasks even if killpg fails, surfacing the killpg error after cleanup. - Executable::pid: now infallible (Option<Pid>); pid is read from the captured field, not from a possibly-reaped Child. - NestedAuraed::kill: tolerates ESRCH from nix::sys::signal::kill (process already gone) so cell teardown is idempotent. - Executables::stop: distinguishes 'never inserted' (ExecutableNotFound) from 'process already exited' (new ExecutableAlreadyExited variant) via ESRCH/ECHILD classification on the io::Error. Cache is evicted in both cases. Other errors still propagate as FailedToStopExecutable. - Executables::broadcast_stop: logs kill failures instead of dropping them silently. - ExecutablesError::ExecutableAlreadyExited: new variant; mapped to Status::not_found in CellsServiceError -> Status. - CellService::stop: reads pid from the infallible pid() and decouples observe-service channel cleanup from the stop result so channels are unregistered even when kill fails. Translates ExecutableNotFound and ExecutableAlreadyExited to an Ok response for idempotency. Tests (auraed/tests/cell_start_stop_delete.rs): - cells_start_stop_delete: happy-path allocate / start / stop / free. - cells_stop_kills_entire_process_group: regression test for aurae-runtime#534. Spawns a bash wrapper that forks two background sleeps; after stop, every PID in the leader's PGID must be gone within 3s. - cells_double_stop_is_idempotent: pins that a second stop returns Ok. - cells_stop_after_natural_exit_is_ok: stop on a process that has already exited on its own returns Ok (drives the ESRCH/ECHILD classification path). Unit tests (executables.rs): - start_should_cache_pid_and_reject_duplicates - stop_after_natural_exit_returns_ok_and_evicts - stop_unknown_name_returns_not_found
VirtualMachines::new called libc::signal(SIGCHLD, SIG_IGN). The disposition is inherited across execve into every nested auraed, so the kernel auto-reaped spawned children. That made Executable::kill's child.wait().await return ECHILD on every cells stop and caused waitpid in nested_auraed to hang waiting for a SIGCHLD that the kernel never delivered. Cloud Hypervisor's Vm::HANDLED_SIGNALS and Vmm::HANDLED_SIGNALS do not include SIGCHLD, so the block_signal loops below are unaffected.
dmah42
approved these changes
May 4, 2026
Contributor
|
wow, great job. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR addresses a couple bugs that prevented cells from properly tearing down and caused us to leave zombie child processes, detailed in #534 and also attempted in #535.
First, we change the cells to use process groups to ensure all child processes of cell executables are properly killed when the executable is killed. Because tokio's
Child::killdoesn't support process groups, we utilizenix::sys::signal::killpg.Secondly, when implementing the VM service, we had borrowed the same patterns in cloud hypervisor where SIGCHLD was ignored via
VirtualMachines::new. This caused children to be automatically reaped and wouldProcess not founderrors when attempting to kill the executable. We've removed that here.Fixes #534