Skip to content

Fix resilience relay bind: drain relay poll-waiters on stop/kill#470

Merged
mvdbeek merged 2 commits into
galaxyproject:masterfrom
ksuderman:469-fix-resilience-relay
Jul 3, 2026
Merged

Fix resilience relay bind: drain relay poll-waiters on stop/kill#470
mvdbeek merged 2 commits into
galaxyproject:masterfrom
ksuderman:469-fix-resilience-relay

Conversation

@ksuderman

Copy link
Copy Markdown
Contributor

Refs #469.

The Resilience Suite intermittently fails at fixture setup with
TimeoutError: Pulsar did not bind relay consumers within 60.0s
(test/resilience/harness/pulsar_control.py).

Cause

For AMQP modes, PulsarControl.kill() force-drops the stale broker consumer via
_force_drop_setup_consumer_connections() so the next setup can't race a lingering
registration. Relay mode had no equivalent. Pulsar's relay control consumer holds a
30 s long_poll waiter on the relay (pulsar/messaging/bind_relay.py); after a
stop/kill that waiter lingers on the relay until it notices the dropped connection.
Because the readiness check _relay_has_pulsar_setup_waiter() returns True on any
nonzero */job_setup waiter, a following test's setup can pass/race against the dead
pulsar's registration.

Fix

Deterministic relay teardown that mirrors the AMQP force-drop:

  • Add _relay_setup_waiter_count() (returns None when the relay is unreachable, so
    "unknown" is distinguishable from a confirmed zero) and refactor
    _relay_has_pulsar_setup_waiter() to build on it.
  • Add _wait_relay_setup_waiters_drained() — polls /messages/poll/stats until zero
    */job_setup waiters; best-effort and bounded so it can never hang the suite.
  • stop() and kill() await this drain in relay mode, so the following
    wait_until_consuming() only ever sees the freshly-started pulsar.

Harness-only change; no production code touched.

Testing

  • py_compile and flake8 (max-line 150, complexity 14) pass.
  • Not run against the live docker-compose stack — the failure is intermittent, so a
    single green run wouldn't prove the fix.

An independent interim mitigation (bind-timeout bump + fixture-setup retry) is proposed
separately in the 469-resilience-timeout PR; the two are non-overlapping and can land
in either order.

🤖 Generated with Claude Code

…lake

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
And trim claude code fluff comments.
@mvdbeek mvdbeek closed this Jul 3, 2026
@mvdbeek mvdbeek reopened this Jul 3, 2026
@mvdbeek

mvdbeek commented Jul 3, 2026

Copy link
Copy Markdown
Member

Green on https://github.com/mvdbeek/pulsar/actions/runs/28662443620/job/85006053383, CI is severly hammered so I'll merge.

@mvdbeek mvdbeek merged commit f1369c4 into galaxyproject:master Jul 3, 2026
0 of 62 checks passed
@ksuderman

Copy link
Copy Markdown
Contributor Author

CI is severly hammered so I'll merge.

That would be me opening a flurry of pull requests all at the same time 😁. I am done now though and I'll stage bulk changes more slowly in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants