Skip to content

Bound remote translog purge for small nodes#22387

Open
MdTanwer wants to merge 1 commit into
opensearch-project:mainfrom
MdTanwer:fix/remote-translog-purge-batch-20138
Open

Bound remote translog purge for small nodes#22387
MdTanwer wants to merge 1 commit into
opensearch-project:mainfrom
MdTanwer:fix/remote-translog-purge-batch-20138

Conversation

@MdTanwer

@MdTanwer MdTanwer commented Jul 4, 2026

Copy link
Copy Markdown

Summary

Fix #20138

  • Batch remote translog metadata and data file deletions instead of listing/deleting unbounded file lists on remote_purge, addressing heap and queue buildup on small-core nodes .
  • Add cluster settings cluster.remote_store.translog.purge_batch_size (default 500) and cluster.remote_store.translog.purge_max_batches_per_cycle (default 2), with adaptive throttling when a large backlog is detected.
  • Coalesce overlapping remote purge work per shard via single-flight scheduling in RemoteFsTranslog / RemoteFsTimestampAwareTranslog.

@github-actions

github-actions Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

PR Reviewer Guide 🔍

(Review updated until commit f23e34b)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Double permit release / cycle completion

In trimUnreferencedReaders, deleteRemoteGeneration is called with onRemotePurgeComplete (which releases 1 permit and calls completeRemotePurgeCycle), and then immediately followed by translogTransferManager.deleteStaleTranslogMetadataFilesAsync(onRemotePurgeComplete). The same onRemotePurgeComplete runnable is passed to both async operations, so it will be executed twice, releasing 2 permits total (correct for the 2-permit acquisition) but also invoking completeRemotePurgeCycle twice. More critically, if the first callback fires and completes the cycle before the second async operation finishes, remotePurgeInProgress gets reset while the metadata-delete operation is still in-flight, allowing a concurrent purge to be scheduled and violating the single-flight guarantee that this PR aims to introduce.

if (generationsToDelete.isEmpty() == false) {
    try {
        deleteRemoteGeneration(generationsToDelete, onRemotePurgeComplete);
    } catch (Exception e) {
        logger.error("Exception in delete generations flow", e);
        remoteGenerationDeletionPermits.release(REMOTE_DELETION_PERMITS);
        completeRemotePurgeCycle();
        return;
    }
    translogTransferManager.deleteStaleTranslogMetadataFilesAsync(onRemotePurgeComplete);
    deleteStaleRemotePrimaryTerms();
} else {
    remoteGenerationDeletionPermits.release(REMOTE_DELETION_PERMITS);
    completeRemotePurgeCycle();
}
Possible Issue

deleteGenerationAsync catches an exception, calls onCompletion.run(), then rethrows. If deleteGenerationAsyncBatched throws synchronously after already scheduling one or more asynchronous batch deletions, the completion callback may run both from the catch block and (later) from an in-flight batch, causing double invocation. Additionally, the caller in RemoteFsTranslog.trimUnreferencedReaders catches the rethrown exception and again calls completeRemotePurgeCycle/releases permits, compounding the double-completion issue.

public void deleteGenerationAsync(long primaryTerm, Set<Long> generations, Runnable onCompletion) {
    try {
        if (generations.isEmpty()) {
            onCompletion.run();
            return;
        }
        List<Long> generationList = new ArrayList<>(generations);
        deleteGenerationAsyncBatched(primaryTerm, generationList, 0, onCompletion);
    } catch (Exception e) {
        onCompletion.run();
        throw e;
    }
}
Adaptive throttling scope

getEffectiveMaxBatchesPerCycle is called with metadataFiles.size() == translogTransferManager.getTranslogPurgeListLimit() as the backlog indicator, but metadataFilesToBeDeleted may be a small subset of that list after applying pinned-timestamp filtering. This could trigger throttling to 1 batch/cycle even when the actual work to be done is trivial, unnecessarily slowing purge progress.

int maxFilesToDelete = Math.min(
    metadataFilesToBeDeleted.size(),
    translogTransferManager.getTranslogPurgeBatchSize() * translogTransferManager
        .getEffectiveMaxBatchesPerCycle(
            metadataFiles.size() == translogTransferManager.getTranslogPurgeListLimit(),
            metadataFiles.size()
        )
);
Threadpool name assumption

completeRemotePurgeCycle submits the pending purge to ThreadPool.Names.REMOTE_PURGE. If this executor name is not registered on the node (or is misspelled), threadPool.executor(...) will throw IllegalArgumentException, silently dropping the pending purge and leaving remotePurgePending=false with no follow-up. Verify REMOTE_PURGE is a known executor here and consider defensive handling.

if (remotePurgePending.getAndSet(false)) {
    threadPool.executor(ThreadPool.Names.REMOTE_PURGE).execute(() -> {
        try {
            trimUnreferencedReaders(false);
        } catch (IOException e) {
            logger.error("Exception while continuing remote translog purge", e);
        }
    });
}

@github-actions

github-actions Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

PR Code Suggestions ✨

Latest suggestions up to f23e34b
Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Avoid double-releasing permits on exception

onRemotePurgeComplete releases only one permit, but deleteGenerationAsync internally
invokes its onCompletion once at the end of all batches. The subsequent metadata
delete branch below also uses onRemotePurgeComplete (releasing one more permit), so
total 2 permits are released—OK. However, on exception here
remoteGenerationDeletionPermits.release(REMOTE_DELETION_PERMITS) double-releases
because onRemotePurgeComplete may already have been invoked by
deleteGenerationAsync's catch block (which calls onCompletion.run() before
throwing). This leads to over-releasing permits. Release only the remaining
permit(s) or make onCompletion idempotent.

server/src/main/java/org/opensearch/index/translog/RemoteFsTimestampAwareTranslog.java [248-263]

 if (generationsToBeDeleted.isEmpty() == false) {
     // Delete stale generations
     try {
         translogTransferManager.deleteGenerationAsync(
             primaryTermSupplier.getAsLong(),
             generationsToBeDeleted,
             onRemotePurgeComplete
         );
     } catch (Exception e) {
         logger.error("Exception in delete generations flow", e);
-        remoteGenerationDeletionPermits.release(REMOTE_DELETION_PERMITS);
+        // onRemotePurgeComplete has already released one permit via deleteGenerationAsync's catch
+        remoteGenerationDeletionPermits.release();
         if (indexDeleted == false) {
             completeRemotePurgeCycle();
         }
         return;
     }
 } else {
     remoteGenerationDeletionPermits.release();
 }
Suggestion importance[1-10]: 7

__

Why: Since deleteGenerationAsync in the catch block calls onCompletion.run() before rethrowing, the caller's own catch releasing REMOTE_DELETION_PERMITS on top can indeed over-release permits. This is a legitimate concern worth addressing.

Medium
Fix race between pending flag and reset

There is a race condition: if a pending purge is signalled between the permits check
and the remotePurgeInProgress flag being reset, the pending flag may be missed.
Additionally, resetting remotePurgeInProgress before consuming remotePurgePending
means a concurrent caller may set remotePurgePending=true but never trigger
re-execution. Check and consume remotePurgePending first, and only reset
remotePurgeInProgress if there is no pending work; otherwise re-run within the same
in-progress window.

server/src/main/java/org/opensearch/index/translog/RemoteFsTranslog.java [663-677]

 protected void completeRemotePurgeCycle() {
     if (remoteGenerationDeletionPermits.availablePermits() != REMOTE_DELETION_PERMITS) {
         return;
     }
-    completeRemotePurgeScheduling();
     if (remotePurgePending.getAndSet(false)) {
         threadPool.executor(ThreadPool.Names.REMOTE_PURGE).execute(() -> {
             try {
                 trimUnreferencedReaders(false);
             } catch (IOException e) {
                 logger.error("Exception while continuing remote translog purge", e);
+            } finally {
+                completeRemotePurgeScheduling();
             }
         });
+    } else {
+        completeRemotePurgeScheduling();
     }
 }
Suggestion importance[1-10]: 6

__

Why: The suggestion identifies a plausible race condition where a concurrent caller could set remotePurgePending after the check but before remotePurgeInProgress is reset. However, the analysis is partially speculative and the proposed fix has its own concerns around ordering with completeRemotePurgeScheduling.

Low
Prevent double invocation of onCompletion

If deleteGenerationAsyncBatched throws synchronously after having already scheduled
some async batches whose continuations will also eventually call onCompletion, the
catch block here will invoke onCompletion again. Since onCompletion releases a
permit and drives the purge cycle state machine, this can double-invoke and corrupt
permit counts. Ensure onCompletion is only invoked once—either by not calling it in
the catch when batching has started, or by making the runnable idempotent.

server/src/main/java/org/opensearch/index/translog/transfer/TranslogTransferManager.java [530-542]

 public void deleteGenerationAsync(long primaryTerm, Set<Long> generations, Runnable onCompletion) {
+    if (generations.isEmpty()) {
+        onCompletion.run();
+        return;
+    }
+    List<Long> generationList = new ArrayList<>(generations);
     try {
-        if (generations.isEmpty()) {
-            onCompletion.run();
-            return;
-        }
-        List<Long> generationList = new ArrayList<>(generations);
         deleteGenerationAsyncBatched(primaryTerm, generationList, 0, onCompletion);
     } catch (Exception e) {
-        onCompletion.run();
+        // onCompletion may already have been chained; do not call again to avoid double-release
         throw e;
     }
 }
Suggestion importance[1-10]: 5

__

Why: Valid observation that onCompletion could be invoked more than once if a batched call had already scheduled async work before throwing, but such synchronous exceptions after scheduling are unlikely in practice.

Low
General
Fix unreachable throttling condition

The condition listedMetadataFiles > getTranslogPurgeBatchSize() * 10 can never be
true because listedMetadataFiles is bounded above by getTranslogPurgeListLimit() =
batchSize * maxBatchesPerCycle + 1, and maxBatchesPerCycle is capped at 100 but
typical default is 2. More importantly, the list is truncated at the list limit so
this backlog signal is redundant with backlogRemaining. Use backlogRemaining alone
(or compare against batchSize * maxBatchesPerCycle) so the throttling actually
triggers.

server/src/main/java/org/opensearch/index/translog/transfer/TranslogTransferManager.java [767-773]

 public int getEffectiveMaxBatchesPerCycle(boolean backlogRemaining, int listedMetadataFiles) {
     int maxBatchesPerCycle = getTranslogPurgeMaxBatchesPerCycle();
-    if (backlogRemaining && listedMetadataFiles > getTranslogPurgeBatchSize() * 10) {
+    if (backlogRemaining) {
         return 1;
     }
     return maxBatchesPerCycle;
 }
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly notes that the listedMetadataFiles > batchSize * 10 check may be redundant or unreachable given the list limit constraints, making the backlog throttle less effective than intended.

Low

Previous suggestions

Suggestions up to commit bc9ff3f
CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix race in purge completion signaling

There is a race between completeRemotePurgeScheduling() and the check of
remotePurgePending. If another thread sets remotePurgePending=true after
completeRemotePurgeScheduling() but before getAndSet(false), and another purge
attempt happens between those, the pending signal could be lost or double-scheduled.
Reorder to check pending before clearing in-progress, or re-check pending after
clearing to guarantee no scheduled purge is dropped.

server/src/main/java/org/opensearch/index/translog/RemoteFsTranslog.java [663-677]

 protected void completeRemotePurgeCycle() {
     if (remoteGenerationDeletionPermits.availablePermits() != REMOTE_DELETION_PERMITS) {
         return;
     }
+    boolean pending = remotePurgePending.getAndSet(false);
     completeRemotePurgeScheduling();
-    if (remotePurgePending.getAndSet(false)) {
+    // Re-check to avoid losing a signal set between the two operations above
+    if (pending || remotePurgePending.getAndSet(false)) {
         threadPool.executor(ThreadPool.Names.REMOTE_PURGE).execute(() -> {
             try {
                 trimUnreferencedReaders(false);
             } catch (IOException e) {
                 logger.error("Exception while continuing remote translog purge", e);
             }
         });
     }
 }
Suggestion importance[1-10]: 7

__

Why: Identifies a plausible race condition in the purge completion signaling where a pending signal could be lost between completeRemotePurgeScheduling() and remotePurgePending.getAndSet(false). The reordering improves correctness of the concurrency logic.

Medium
Preserve pending signal on permit failure

When tryAcquire fails, you call completeRemotePurgeScheduling() (clearing
remotePurgeInProgress) but do not check remotePurgePending. If a concurrent
invocation set pending=true expecting the in-progress worker to reschedule, that
signal is now lost because permits were never acquired and completeRemotePurgeCycle
won't be called. Route through completeRemotePurgeCycle (or explicitly re-check
pending) to preserve the pending signal.

server/src/main/java/org/opensearch/index/translog/RemoteFsTranslog.java [608-618]

 if (tryScheduleRemotePurge() == false) {
     return;
 }
 
 // Since remote generation deletion is async, this ensures that only one generation deletion happens at a time.
 // Remote generations involves 2 async operations - 1) Delete translog generation files 2) Delete metadata files
 // We try to acquire 2 permits and if we can not, we return from here itself.
 if (remoteGenerationDeletionPermits.tryAcquire(REMOTE_DELETION_PERMITS) == false) {
-    completeRemotePurgeScheduling();
+    completeRemotePurgeCycle();
     return;
 }
Suggestion importance[1-10]: 6

__

Why: Valid observation that when tryAcquire fails, calling completeRemotePurgeScheduling() directly may lose a concurrently set pending signal. Routing through completeRemotePurgeCycle would preserve it, but permits check would immediately fail so the fix needs care.

Low
Guarantee forward progress on purge

maxFilesToDelete can be 0 when getEffectiveMaxBatchesPerCycle returns a small value
combined with a small batch size, but metadataFilesToBeDeleted is non-empty. In that
edge case subList(0, 0) is passed to deleteMetadataFilesInBatches which immediately
runs onCompletion, but you also skip updating metadataFilesNotToBeDeleted cache and
deleteStaleRemotePrimaryTerms since flow proceeds. Ensure maxFilesToDelete >= 1 when
there are files to delete to guarantee forward progress.

server/src/main/java/org/opensearch/index/translog/RemoteFsTimestampAwareTranslog.java [268-280]

 if (metadataFilesToBeDeleted.isEmpty() == false) {
     // Delete stale metadata files
     try {
-        int maxFilesToDelete = Math.min(
+        int maxFilesToDelete = Math.max(1, Math.min(
             metadataFilesToBeDeleted.size(),
             translogTransferManager.getTranslogPurgeBatchSize() * translogTransferManager
                 .getEffectiveMaxBatchesPerCycle(
                     metadataFiles.size() == translogTransferManager.getTranslogPurgeListLimit(),
                     metadataFiles.size()
                 )
-        );
+        ));
         List<String> metadataFilesThisCycle = metadataFilesToBeDeleted.subList(0, maxFilesToDelete);
         translogTransferManager.deleteMetadataFilesInBatches(metadataFilesThisCycle, onRemotePurgeComplete);
Suggestion importance[1-10]: 4

__

Why: The concern about maxFilesToDelete being 0 is theoretical since getEffectiveMaxBatchesPerCycle has a minimum of 1 and batch size has a minimum of 1, so the product is always at least 1. The suggestion is a defensive improvement but not addressing a real bug.

Low
General
Ensure deterministic batched deletion order

Since generations is a Set, the iteration order of new ArrayList<>(generations) is
undefined (e.g., HashSet). Previously all files were deleted in a single batch so
order didn't matter, but batching now means specific generations get deleted first
per cycle. Sort the generations (e.g., descending or ascending) to ensure
deterministic and predictable deletion progress across cycles.

server/src/main/java/org/opensearch/index/translog/transfer/TranslogTransferManager.java [530-542]

 public void deleteGenerationAsync(long primaryTerm, Set<Long> generations, Runnable onCompletion) {
     try {
         if (generations.isEmpty()) {
             onCompletion.run();
             return;
         }
         List<Long> generationList = new ArrayList<>(generations);
+        Collections.sort(generationList);
         deleteGenerationAsyncBatched(primaryTerm, generationList, 0, onCompletion);
     } catch (Exception e) {
         onCompletion.run();
         throw e;
     }
 }
Suggestion importance[1-10]: 4

__

Why: Sorting improves determinism and predictability of deletion order across cycles, but functional correctness is not impacted since all generations will eventually be deleted. Minor improvement for observability and debugging.

Low

@github-actions github-actions Bot added bug Something isn't working Indexing Indexing, Bulk Indexing and anything related to indexing Storage:Remote labels Jul 4, 2026
Batch remote translog metadata and data file deletions to cap
memory and remote_purge queue growth on low-core nodes. Add purge
batch cluster settings, single-flight coalescing per shard, and
unit tests for batched delete paths.

Fixes opensearch-project#20138

Signed-off-by: MdTanwer <tanw9004167@gmail.com>
@MdTanwer MdTanwer force-pushed the fix/remote-translog-purge-batch-20138 branch from bc9ff3f to f23e34b Compare July 4, 2026 10:35
@github-actions

github-actions Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit f23e34b

@github-actions

github-actions Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

❌ Gradle check result for f23e34b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working Indexing Indexing, Bulk Indexing and anything related to indexing Storage:Remote

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

[BUG] Remote purge can significantly slow down on smaller instances causing heap to build up

1 participant