Skip to content

fix(cells): Use process groups and properly handle child process reaping#599

Merged
dmah42 merged 3 commits intoaurae-runtime:mainfrom
mccormickt:fix-cells
May 4, 2026
Merged

fix(cells): Use process groups and properly handle child process reaping#599
dmah42 merged 3 commits intoaurae-runtime:mainfrom
mccormickt:fix-cells

Conversation

@mccormickt
Copy link
Copy Markdown
Contributor

This PR addresses a couple bugs that prevented cells from properly tearing down and caused us to leave zombie child processes, detailed in #534 and also attempted in #535.

First, we change the cells to use process groups to ensure all child processes of cell executables are properly killed when the executable is killed. Because tokio's Child::kill doesn't support process groups, we utilize nix::sys::signal::killpg.

Secondly, when implementing the VM service, we had borrowed the same patterns in cloud hypervisor where SIGCHLD was ignored via VirtualMachines::new. This caused children to be automatically reaped and would Process not found errors when attempting to kill the executable. We've removed that here.

Fixes #534

mccormickt added 2 commits May 3, 2026 10:08
Stop on a tracked executable previously left grandchildren alive on the
host. tokio::process::Child::kill signals only the leader PID — even
when process_group(0) was set on the Command, the spawned PGID's
members are not signaled. The fix sets each executable's process group
to its own PGID via process_group(0), then signals the entire group
with killpg(SIGKILL) on stop.

Adds:
- Executable::start: process_group(0) so the spawned child is its own
  PGID leader. Captures the leader PID at spawn so killpg targets the
  right group even after Tokio reaps the child internally.
- Executable::kill: replaces child.kill() with killpg(SIGKILL) on the
  captured PGID. Always reaps the child and joins the stdout/stderr
  reader tasks even if killpg fails, surfacing the killpg error after
  cleanup.
- Executable::pid: now infallible (Option<Pid>); pid is read from the
  captured field, not from a possibly-reaped Child.
- NestedAuraed::kill: tolerates ESRCH from nix::sys::signal::kill
  (process already gone) so cell teardown is idempotent.
- Executables::stop: distinguishes 'never inserted' (ExecutableNotFound)
  from 'process already exited' (new ExecutableAlreadyExited variant)
  via ESRCH/ECHILD classification on the io::Error. Cache is evicted
  in both cases. Other errors still propagate as FailedToStopExecutable.
- Executables::broadcast_stop: logs kill failures instead of dropping
  them silently.
- ExecutablesError::ExecutableAlreadyExited: new variant; mapped to
  Status::not_found in CellsServiceError -> Status.
- CellService::stop: reads pid from the infallible pid() and decouples
  observe-service channel cleanup from the stop result so channels are
  unregistered even when kill fails. Translates ExecutableNotFound and
  ExecutableAlreadyExited to an Ok response for idempotency.

Tests (auraed/tests/cell_start_stop_delete.rs):
- cells_start_stop_delete: happy-path allocate / start / stop / free.
- cells_stop_kills_entire_process_group: regression test for
  aurae-runtime#534. Spawns a bash
  wrapper that forks two background sleeps; after stop, every PID in
  the leader's PGID must be gone within 3s.
- cells_double_stop_is_idempotent: pins that a second stop returns Ok.
- cells_stop_after_natural_exit_is_ok: stop on a process that has
  already exited on its own returns Ok (drives the ESRCH/ECHILD
  classification path).

Unit tests (executables.rs):
- start_should_cache_pid_and_reject_duplicates
- stop_after_natural_exit_returns_ok_and_evicts
- stop_unknown_name_returns_not_found
VirtualMachines::new called libc::signal(SIGCHLD, SIG_IGN). The
disposition is inherited across execve into every nested auraed, so
the kernel auto-reaped spawned children. That made
Executable::kill's child.wait().await return ECHILD on every cells
stop and caused waitpid in nested_auraed to hang waiting for a
SIGCHLD that the kernel never delivered.

Cloud Hypervisor's Vm::HANDLED_SIGNALS and Vmm::HANDLED_SIGNALS do
not include SIGCHLD, so the block_signal loops below are unaffected.
@dmah42
Copy link
Copy Markdown
Contributor

dmah42 commented May 4, 2026

wow, great job.

@dmah42 dmah42 merged commit 07743c0 into aurae-runtime:main May 4, 2026
5 checks passed
@mccormickt mccormickt deleted the fix-cells branch May 4, 2026 23:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[cells] PID not found errors when stopping running executables

2 participants