Skip to content

EW-10817 Add HibernationManager unit tests#6732

Open
harrishancock wants to merge 1 commit intomainfrom
harris/EW-10817-hibernation-manager-tests
Open

EW-10817 Add HibernationManager unit tests#6732
harrishancock wants to merge 1 commit intomainfrom
harris/EW-10817-hibernation-manager-tests

Conversation

@harrishancock
Copy link
Copy Markdown
Collaborator

Adds a kj_test suite covering normal HibernationManager behavior (basic comm, close, binary, auto-response, multi-IR, multi-WS, hibernation flows, output gate interactions) along with regression tests for EW-10817 that use KJ_EXPECT_LOG to capture the production "another message send is already in progress" assertion. The regression tests pass while the bug exists and fail loudly when the fix lands. Code comments document each test's purpose, lifecycle, and known workarounds.

Supporting TestFixture additions: SetupParams::actorLoopback so the actor and the HibernationManager share one Loopback; newIoContext() and newIncomingRequest(IoContext&) for multiple IRs per IoContext; drainAndDestroy() and pollEventLoop() helpers; getActor() and getTimerChannel() accessors so tests can construct the HM outside any IoContext; resetActor() to simulate eviction by reconstructing the Worker::Actor with the same id and Loopback.

Two workarounds for known bugs are noted at their use sites: a leaked api::WebSocket ref to dodge the AsyncObject destructor issue, and an explicit end1->receive() drain in some EW-10817 repros to consume the orphaned BlockedSend at teardown. Both go away once EW-10817 is fixed.

Assisted-by: OpenCode:claude-opus-4.7

@harrishancock harrishancock requested review from a team as code owners May 6, 2026 10:43
@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk Bot commented May 6, 2026

That's correct. The alloc<T>(args...) passes env.js plus the websocket as constructor arguments. The WebSocket constructor takes (jsg::Lock&, kj::Own<kj::WebSocket>). This is the standard pattern.

After careful review, I believe this is a high-quality PR. The tests are well-structured, well-commented, and the TestFixture additions are properly designed. The intentional leaks and workarounds are clearly documented with TODO references to the bug being tracked. The BUILD.bazel changes are correct.

The only potential issue I see is minor: the enterContext method constructs an Environment without initializing the features field, but this is a pre-existing pattern (also present in runInIoContext) and the HM tests don't use features, so it's not actionable for this PR.

LGTM

github run

@harrishancock harrishancock force-pushed the harris/EW-10817-hibernation-manager-tests branch from a8642ec to a07592b Compare May 6, 2026 11:23
Adds a kj_test suite covering normal HibernationManager behavior (basic
comm, close, binary, auto-response, multi-IR, multi-WS, hibernation flows,
output gate interactions) along with regression tests for EW-10817 that use
KJ_EXPECT_LOG to capture the production "another message send is already in
progress" assertion. The regression tests pass while the bug exists and
fail loudly when the fix lands. Code comments document each test's purpose,
lifecycle, and known workarounds.

Supporting TestFixture additions: SetupParams::actorLoopback so the actor
and the HibernationManager share one Loopback; newIoContext() and
newIncomingRequest(IoContext&) for multiple IRs per IoContext;
drainAndDestroy() and pollEventLoop() helpers; getActor() and
getTimerChannel() accessors so tests can construct the HM outside any
IoContext; resetActor() to simulate eviction by reconstructing the
Worker::Actor with the same id and Loopback.

Two workarounds for known bugs are noted at their use sites: a leaked
api::WebSocket ref to dodge the AsyncObject destructor issue, and an
explicit end1->receive() drain in some EW-10817 repros to consume the
orphaned BlockedSend at teardown. Both go away once EW-10817 is fixed.

Assisted-by: OpenCode:claude-opus-4.7
@harrishancock harrishancock force-pushed the harris/EW-10817-hibernation-manager-tests branch from a07592b to fd5b913 Compare May 6, 2026 14:20
// Promise — the Worker::Lock is only valid for the synchronous duration of the
// callback). The context is NOT destroyed afterwards — the caller still owns the
// IncomingRequest.
template <typename CallBack>
Copy link
Copy Markdown
Contributor

@jclee jclee May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: "CallBack" is consistent with the pre-existing template var for "runInIoContext()" above, but I'd kind of expect it to be "Callback" instead (here and elsewhere in the file), since it's referring to the single-word noun, not the verb.

Comment on lines +162 to +165
// Drain an IncomingRequest (waiting on its waitUntil tasks) and then destroy it. Tests should
// use this rather than letting the IncomingRequest's Own go out of scope, otherwise the
// IncomingRequest destructor logs a warning about un-drained waitUntil tasks. Production code
// paths always drain.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's important to drain, I suppose it might be possible to just have the fixture track the outstanding requests and drain them automatically in its destructor.

But having the explicit drain is fine, too... and maybe offers finer-grained control, or is needed for specific use cases?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I'm seeing the tests are using this to selectively drain during the test rather than just at the end... and in some of the repros, selectively not draining the request, presumably because it's unnecessary in those cases.

Comment on lines +161 to +166
auto* leaked = new jsg::Ref<api::WebSocket>(apiWs.addRef());
#if KJ_HAS_COMPILER_FEATURE(address_sanitizer) || defined(__SANITIZE_ADDRESS__)
__lsan_ignore_object(leaked);
#else
(void)leaked;
#endif
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh... I didn't actually realize we had leak detection actively running in asan tests.

Comment on lines +280 to +281
KJ_ASSERT(stats.customEventCalls >= 1, "expected at least one customEvent dispatch",
stats.customEventCalls);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was curious... Is there a reason we aren't asserting for exactly one event? I suppose part of it is that it would be difficult to introspect the custom event type?

Comment on lines +588 to +593
// The KJ_EXPECT_LOG below captures the bug's symptom (the assertion's ERROR log) so the
// test passes while EW-10817 is open. When the bug is fixed, the log won't fire and the
// KJ_EXPECT_LOG will fail — that's the signal to update this test (delete the EXPECT_LOG,
// delete the workaround end1->receive() drain at the end, and add positive assertions about
// what should happen instead).
KJ_EXPECT_LOG(ERROR, "another message send is already in progress");
Copy link
Copy Markdown
Contributor

@jclee jclee May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing the log is actually emitted either during the fixture.enterContext() or fixture.pollEventLoop() calls? In which case, might not hurt to group the triggering call in a block with the log expectation. But not a big deal, especially considering it's temporary.

Oh... Now I'm seeing this is probably to be consistent with the other tests having their log expectations upfront. That's fine, too.

Copy link
Copy Markdown
Contributor

@jclee jclee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good...

It's great to have unit test coverage for this, thanks! I guess we'd been relying mostly on integration tests for coverage, but of course that makes it very hard to test precise event orderings like this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants