Skip to content

Interim mitigation for resilience relay bind: timeout bump + fixture retry#471

Open
ksuderman wants to merge 1 commit into
galaxyproject:masterfrom
ksuderman:469-resilience-timeout
Open

Interim mitigation for resilience relay bind: timeout bump + fixture retry#471
ksuderman wants to merge 1 commit into
galaxyproject:masterfrom
ksuderman:469-resilience-timeout

Conversation

@ksuderman

Copy link
Copy Markdown
Contributor

Refs #469 #470

Interim mitigation for the intermittent Resilience Suite bind-timeout error
(TimeoutError: Pulsar did not bind <mode> consumers) This is a stopgap / safety net; the deterministic root-cause fix is proposed separately in the #470. The two are
non-overlapping and can land in either order.

Changes

  • harness/pulsar_control.py: bump the consumer bind timeout from 60.0 to 120.0 to give
    a slow startup more margin.
  • conftest.py: _start_fresh_with_retry(). The pulsar fixture retries a fresh
    start once on the bind TimeoutError. A blanket setup retry is more robust than
    marking individual scenarios rerun-eligible, since the error is not pinned to a fixed
    test. Each attempt force-recreates the container (fresh=True), so a retry starts from
    a clean slate. Only the bind TimeoutError is retried — a docker compose up failure
    still raises immediately, and the final TimeoutError propagates if every attempt
    times out so a genuine breakage still surfaces.

No new test dependency (avoids pulling in pytest-rerunfailures); retries silently to
match the suite's existing no-logging style.

Testing

  • py_compile and flake8 (max-line 150, complexity 14) pass.
  • Not run against the live docker-compose stack.

🤖 Generated with Claude Code

…ry fresh start

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant