Skip to content

Fix pause/resume logic in FullRollingRestartIT#22366

Open
MdTanwer wants to merge 1 commit into
opensearch-project:mainfrom
MdTanwer:fix/full-rolling-restart-pause-resume-20064
Open

Fix pause/resume logic in FullRollingRestartIT#22366
MdTanwer wants to merge 1 commit into
opensearch-project:mainfrom
MdTanwer:fix/full-rolling-restart-pause-resume-20064

Conversation

@MdTanwer

@MdTanwer MdTanwer commented Jul 1, 2026

Copy link
Copy Markdown

Summary

Fixes #20064

  • Fixes the logical bug in testDerivedSourceWithConcurrentUpdatesRollingRestart where a single stop flag permanently terminated the background update thread on the first node restart
  • Introduces separate shutdown and paused flags so updates can pause during each node restart and resume afterward.
  • Replaces Thread.sleep with assertBusy to wait for in-flight updates to complete, and removes the @SuppressForbidden annotation.

@MdTanwer MdTanwer requested a review from a team as a code owner July 1, 2026 05:59
@github-actions github-actions Bot added bug Something isn't working Indexing Indexing, Bulk Indexing and anything related to indexing labels Jul 1, 2026
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

PR Reviewer Guide 🔍

(Review updated until commit 18742bb)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Race Condition

The inFlight counter is incremented inside the try block after nextDocId.getAndIncrement(), but paused is checked before that. After the pause check passes, the thread may increment inFlight and start an update while the main thread simultaneously sets paused=true and calls assertBusy(() -> assertEquals(0, inFlight.get())). This works because assertBusy polls, but if a new update starts (nextDocId increment) between the pause flag being set and inFlight being observed as 0, the check may pass while an update is still about to launch. Consider checking paused inside the same synchronized block that increments inFlight, or move the inFlight.incrementAndGet() before entering the try to ensure atomicity relative to the pause signal.

while (shutdown.get() == false) {
    synchronized (pauseLock) {
        while (paused.get() && shutdown.get() == false) {
            try {
                pauseLock.wait(100);
            } catch (InterruptedException e) {
                Thread.currentThread().interrupt();
                return;
            }
        }
    }
    if (shutdown.get()) {
        break;
    }
    try {
        int docId = nextDocId.getAndIncrement() % docCount;
        inFlight.incrementAndGet();
        try {
            client().prepareUpdate("test", String.valueOf(docId))
                .setRetryOnConflict(3)
                .setDoc("counter", successfulUpdates.get() + 1, "last_updated", System.currentTimeMillis(), "version", 1)
                .execute()
                .actionGet(TimeValue.timeValueSeconds(5));
            successfulUpdates.incrementAndGet();
            initialUpdatesLatch.countDown();
        } finally {
            inFlight.decrementAndGet();
            synchronized (pauseLock) {
                pauseLock.notifyAll();
            }
        }
    } catch (Exception e) {
        logger.warn("Error in background update thread", e);
    }
}

@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

PR Code Suggestions ✨

Latest suggestions up to 18742bb
Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix race between pause and in-flight check

There is a race: the update thread may have passed the paused check but not yet
incremented inFlight when assertBusy observes zero, allowing an update during
restart. Consider having the update thread increment inFlight before/inside the
synchronized pause check, or use a lock/semaphore that atomically couples the pause
state with in-flight tracking.

server/src/internalClusterTest/java/org/opensearch/recovery/FullRollingRestartIT.java [668-670]

 // Pause updates temporarily during node restart
-paused.set(true);
-assertBusy(() -> assertEquals(0, inFlight.get()));
+synchronized (pauseLock) {
+    paused.set(true);
+}
+assertBusy(() -> assertEquals(0, inFlight.get()), 30, TimeUnit.SECONDS);
Suggestion importance[1-10]: 5

__

Why: Identifies a legitimate potential race condition between setting paused and reading inFlight, though the proposed fix (wrapping paused.set(true) in the synchronized block) does not fully close the race window.

Low
Ensure in-flight counter always decrements

The inFlight counter is incremented after the pause check but the exception handler
wraps both the increment and the update call. If an exception is thrown between
inFlight.incrementAndGet() and entering the inner try block, the counter would still
be incremented but never decremented. Move inFlight.incrementAndGet() inside the
inner try block, or ensure the decrement always runs.

server/src/internalClusterTest/java/org/opensearch/recovery/FullRollingRestartIT.java [627-647]

 try {
     int docId = nextDocId.getAndIncrement() % docCount;
     inFlight.incrementAndGet();
     try {
         client().prepareUpdate("test", String.valueOf(docId))
-...
-    } catch (Exception e) {
-        logger.warn("Error in background update thread", e);
+            .setRetryOnConflict(3)
+            .setDoc("counter", successfulUpdates.get() + 1, "last_updated", System.currentTimeMillis(), "version", 1)
+            .execute()
+            .actionGet(TimeValue.timeValueSeconds(5));
+        successfulUpdates.incrementAndGet();
+        initialUpdatesLatch.countDown();
+    } finally {
+        inFlight.decrementAndGet();
+        synchronized (pauseLock) {
+            pauseLock.notifyAll();
+        }
     }
+} catch (Exception e) {
+    logger.warn("Error in background update thread", e);
 }
Suggestion importance[1-10]: 2

__

Why: The code already places inFlight.incrementAndGet() immediately before the inner try/finally, and no operations between them are likely to throw. The suggested improved_code is essentially identical to the existing code, offering no real change.

Low

Previous suggestions

Suggestions up to commit 21cef01
CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix race between pause and in-flight increment

The inFlight counter is incremented inside the inner try but decremented in its
finally, however if an exception is thrown between incrementAndGet() and entering
the inner try block, the counter would still be decremented correctly. However, if
the update is submitted while paused was just set to true, inFlight could still be
incremented after the pause check, causing assertBusy to wait. More critically, the
increment should occur before the pause check to ensure the pause loop observes it,
or the pause check should be inside a lock with the increment to avoid a race where
an update starts just as pause is being set.

server/src/internalClusterTest/java/org/opensearch/recovery/FullRollingRestartIT.java [627-646]

 try {
     int docId = nextDocId.getAndIncrement() % docCount;
-    inFlight.incrementAndGet();
+    synchronized (pauseLock) {
+        if (paused.get() || shutdown.get()) {
+            continue;
+        }
+        inFlight.incrementAndGet();
+    }
     try {
         client().prepareUpdate("test", String.valueOf(docId))
             .setRetryOnConflict(3)
             .setDoc("counter", successfulUpdates.get() + 1, "last_updated", System.currentTimeMillis(), "version", 1)
             .execute()
             .actionGet(TimeValue.timeValueSeconds(5));
         successfulUpdates.incrementAndGet();
         initialUpdatesLatch.countDown();
     } finally {
         inFlight.decrementAndGet();
         synchronized (pauseLock) {
             pauseLock.notifyAll();
         }
     }
 } catch (Exception e) {
     logger.warn("Error in background update thread", e);
 }
Suggestion importance[1-10]: 6

__

Why: The suggestion identifies a valid race condition where inFlight could be incremented after paused is set to true, potentially causing assertBusy to wait or fail. Synchronizing the pause check with the increment improves correctness, though the impact is moderate for a test.

Low
General
Provide explicit timeout for pause drain

assertBusy has a default timeout that may be too short under load; also, if inFlight
never reaches 0 the restart proceeds with concurrent updates in flight. Consider
providing an explicit timeout and ensure the pause check ordering guarantees no new
work is started after paused.set(true).

server/src/internalClusterTest/java/org/opensearch/recovery/FullRollingRestartIT.java [668-670]

 // Pause updates temporarily during node restart
 paused.set(true);
-assertBusy(() -> assertEquals(0, inFlight.get()));
+assertBusy(() -> assertEquals(0, inFlight.get()), 30, TimeUnit.SECONDS);
Suggestion importance[1-10]: 5

__

Why: Providing an explicit timeout for assertBusy makes the test more robust under load, avoiding potential flaky failures with the default timeout. It's a reasonable minor improvement for test reliability.

Low

Separate shutdown from pause so background updates resume after each
node restart, and replace Thread.sleep with assertBusy for in-flight
work. Resolves opensearch-project#20064.

Signed-off-by: MdTanwer <tanw9004167@gmail.com>
@MdTanwer MdTanwer force-pushed the fix/full-rolling-restart-pause-resume-20064 branch from 21cef01 to 18742bb Compare July 1, 2026 06:02
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 18742bb

@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

❕ Gradle check result for 18742bb: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@codecov

codecov Bot commented Jul 1, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.43%. Comparing base (95ece9b) to head (18742bb).

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #22366      +/-   ##
============================================
- Coverage     73.44%   73.43%   -0.02%     
+ Complexity    76114    76111       -3     
============================================
  Files          6076     6076              
  Lines        345508   345508              
  Branches      49732    49732              
============================================
- Hits         253773   253708      -65     
- Misses        71497    71606     +109     
+ Partials      20238    20194      -44     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working Indexing Indexing, Bulk Indexing and anything related to indexing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Logical bug in FullRollingRestartIT#testDerivedSourceWithConcurrentUpdatesRollingRestart

1 participant