Bound remote translog purge for small nodes by MdTanwer · Pull Request #22387 · opensearch-project/OpenSearch

MdTanwer · 2026-07-04T10:29:33Z

Summary

Fix #20138

Batch remote translog metadata and data file deletions instead of listing/deleting unbounded file lists on remote_purge, addressing heap and queue buildup on small-core nodes .
Add cluster settings cluster.remote_store.translog.purge_batch_size (default 500) and cluster.remote_store.translog.purge_max_batches_per_cycle (default 2), with adaptive throttling when a large backlog is detected.
Coalesce overlapping remote purge work per shard via single-flight scheduling in RemoteFsTranslog / RemoteFsTimestampAwareTranslog.

github-actions · 2026-07-04T10:30:49Z

PR Reviewer Guide 🔍

(Review updated until commit `f23e34b`)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review Double permit release / cycle completion In `trimUnreferencedReaders`, `deleteRemoteGeneration` is called with `onRemotePurgeComplete` (which releases 1 permit and calls `completeRemotePurgeCycle`), and then immediately followed by `translogTransferManager.deleteStaleTranslogMetadataFilesAsync(onRemotePurgeComplete)`. The same `onRemotePurgeComplete` runnable is passed to both async operations, so it will be executed twice, releasing 2 permits total (correct for the 2-permit acquisition) but also invoking `completeRemotePurgeCycle` twice. More critically, if the first callback fires and completes the cycle before the second async operation finishes, `remotePurgeInProgress` gets reset while the metadata-delete operation is still in-flight, allowing a concurrent purge to be scheduled and violating the single-flight guarantee that this PR aims to introduce. if (generationsToDelete.isEmpty() == false) { try { deleteRemoteGeneration(generationsToDelete, onRemotePurgeComplete); } catch (Exception e) { logger.error("Exception in delete generations flow", e); remoteGenerationDeletionPermits.release(REMOTE_DELETION_PERMITS); completeRemotePurgeCycle(); return; } translogTransferManager.deleteStaleTranslogMetadataFilesAsync(onRemotePurgeComplete); deleteStaleRemotePrimaryTerms(); } else { remoteGenerationDeletionPermits.release(REMOTE_DELETION_PERMITS); completeRemotePurgeCycle(); } Possible Issue `deleteGenerationAsync` catches an exception, calls `onCompletion.run()`, then rethrows. If `deleteGenerationAsyncBatched` throws synchronously after already scheduling one or more asynchronous batch deletions, the completion callback may run both from the catch block and (later) from an in-flight batch, causing double invocation. Additionally, the caller in `RemoteFsTranslog.trimUnreferencedReaders` catches the rethrown exception and again calls `completeRemotePurgeCycle`/releases permits, compounding the double-completion issue. public void deleteGenerationAsync(long primaryTerm, Set<Long> generations, Runnable onCompletion) { try { if (generations.isEmpty()) { onCompletion.run(); return; } List<Long> generationList = new ArrayList<>(generations); deleteGenerationAsyncBatched(primaryTerm, generationList, 0, onCompletion); } catch (Exception e) { onCompletion.run(); throw e; } } Adaptive throttling scope `getEffectiveMaxBatchesPerCycle` is called with `metadataFiles.size() == translogTransferManager.getTranslogPurgeListLimit()` as the backlog indicator, but `metadataFilesToBeDeleted` may be a small subset of that list after applying pinned-timestamp filtering. This could trigger throttling to 1 batch/cycle even when the actual work to be done is trivial, unnecessarily slowing purge progress. int maxFilesToDelete = Math.min( metadataFilesToBeDeleted.size(), translogTransferManager.getTranslogPurgeBatchSize() * translogTransferManager .getEffectiveMaxBatchesPerCycle( metadataFiles.size() == translogTransferManager.getTranslogPurgeListLimit(), metadataFiles.size() ) ); Threadpool name assumption `completeRemotePurgeCycle` submits the pending purge to `ThreadPool.Names.REMOTE_PURGE`. If this executor name is not registered on the node (or is misspelled), `threadPool.executor(...)` will throw `IllegalArgumentException`, silently dropping the pending purge and leaving `remotePurgePending=false` with no follow-up. Verify `REMOTE_PURGE` is a known executor here and consider defensive handling. if (remotePurgePending.getAndSet(false)) { threadPool.executor(ThreadPool.Names.REMOTE_PURGE).execute(() -> { try { trimUnreferencedReaders(false); } catch (IOException e) { logger.error("Exception while continuing remote translog purge", e); } }); }

github-actions · 2026-07-04T10:31:27Z

PR Code Suggestions ✨

Latest suggestions up to f23e34b
Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	Avoid double-releasing permits on exception `onRemotePurgeComplete` releases only one permit, but `deleteGenerationAsync` internally invokes its onCompletion once at the end of all batches. The subsequent metadata delete branch below also uses `onRemotePurgeComplete` (releasing one more permit), so total 2 permits are released—OK. However, on exception here `remoteGenerationDeletionPermits.release(REMOTE_DELETION_PERMITS)` double-releases because `onRemotePurgeComplete` may already have been invoked by `deleteGenerationAsync`'s catch block (which calls `onCompletion.run()` before throwing). This leads to over-releasing permits. Release only the remaining permit(s) or make `onCompletion` idempotent. server/src/main/java/org/opensearch/index/translog/RemoteFsTimestampAwareTranslog.java [248-263] if (generationsToBeDeleted.isEmpty() == false) { // Delete stale generations try { translogTransferManager.deleteGenerationAsync( primaryTermSupplier.getAsLong(), generationsToBeDeleted, onRemotePurgeComplete ); } catch (Exception e) { logger.error("Exception in delete generations flow", e); - remoteGenerationDeletionPermits.release(REMOTE_DELETION_PERMITS); + // onRemotePurgeComplete has already released one permit via deleteGenerationAsync's catch + remoteGenerationDeletionPermits.release(); if (indexDeleted == false) { completeRemotePurgeCycle(); } return; } } else { remoteGenerationDeletionPermits.release(); } Suggestion importance[1-10]: 7 __ Why: Since `deleteGenerationAsync` in the catch block calls `onCompletion.run()` before rethrowing, the caller's own catch releasing `REMOTE_DELETION_PERMITS` on top can indeed over-release permits. This is a legitimate concern worth addressing.	Medium
	Fix race between pending flag and reset There is a race condition: if a pending purge is signalled between the permits check and the `remotePurgeInProgress` flag being reset, the pending flag may be missed. Additionally, resetting `remotePurgeInProgress` before consuming `remotePurgePending` means a concurrent caller may set `remotePurgePending=true` but never trigger re-execution. Check and consume `remotePurgePending` first, and only reset `remotePurgeInProgress` if there is no pending work; otherwise re-run within the same in-progress window. server/src/main/java/org/opensearch/index/translog/RemoteFsTranslog.java [663-677] protected void completeRemotePurgeCycle() { if (remoteGenerationDeletionPermits.availablePermits() != REMOTE_DELETION_PERMITS) { return; } - completeRemotePurgeScheduling(); if (remotePurgePending.getAndSet(false)) { threadPool.executor(ThreadPool.Names.REMOTE_PURGE).execute(() -> { try { trimUnreferencedReaders(false); } catch (IOException e) { logger.error("Exception while continuing remote translog purge", e); + } finally { + completeRemotePurgeScheduling(); } }); + } else { + completeRemotePurgeScheduling(); } } Suggestion importance[1-10]: 6 __ Why: The suggestion identifies a plausible race condition where a concurrent caller could set `remotePurgePending` after the check but before `remotePurgeInProgress` is reset. However, the analysis is partially speculative and the proposed fix has its own concerns around ordering with `completeRemotePurgeScheduling`.	Low
	Prevent double invocation of onCompletion If `deleteGenerationAsyncBatched` throws synchronously after having already scheduled some async batches whose continuations will also eventually call `onCompletion`, the catch block here will invoke `onCompletion` again. Since `onCompletion` releases a permit and drives the purge cycle state machine, this can double-invoke and corrupt permit counts. Ensure `onCompletion` is only invoked once—either by not calling it in the catch when batching has started, or by making the runnable idempotent. server/src/main/java/org/opensearch/index/translog/transfer/TranslogTransferManager.java [530-542] public void deleteGenerationAsync(long primaryTerm, Set<Long> generations, Runnable onCompletion) { + if (generations.isEmpty()) { + onCompletion.run(); + return; + } + List<Long> generationList = new ArrayList<>(generations); try { - if (generations.isEmpty()) { - onCompletion.run(); - return; - } - List<Long> generationList = new ArrayList<>(generations); deleteGenerationAsyncBatched(primaryTerm, generationList, 0, onCompletion); } catch (Exception e) { - onCompletion.run(); + // onCompletion may already have been chained; do not call again to avoid double-release throw e; } } Suggestion importance[1-10]: 5 __ Why: Valid observation that `onCompletion` could be invoked more than once if a batched call had already scheduled async work before throwing, but such synchronous exceptions after scheduling are unlikely in practice.	Low
General	Fix unreachable throttling condition The condition `listedMetadataFiles > getTranslogPurgeBatchSize() * 10` can never be true because `listedMetadataFiles` is bounded above by `getTranslogPurgeListLimit()` = `batchSize * maxBatchesPerCycle + 1`, and `maxBatchesPerCycle` is capped at 100 but typical default is 2. More importantly, the list is truncated at the list limit so this backlog signal is redundant with `backlogRemaining`. Use `backlogRemaining` alone (or compare against `batchSize * maxBatchesPerCycle`) so the throttling actually triggers. server/src/main/java/org/opensearch/index/translog/transfer/TranslogTransferManager.java [767-773] public int getEffectiveMaxBatchesPerCycle(boolean backlogRemaining, int listedMetadataFiles) { int maxBatchesPerCycle = getTranslogPurgeMaxBatchesPerCycle(); - if (backlogRemaining && listedMetadataFiles > getTranslogPurgeBatchSize() * 10) { + if (backlogRemaining) { return 1; } return maxBatchesPerCycle; } Suggestion importance[1-10]: 6 __ Why: The suggestion correctly notes that the `listedMetadataFiles > batchSize * 10` check may be redundant or unreachable given the list limit constraints, making the backlog throttle less effective than intended.	Low

Previous suggestions

Suggestions up to commit bc9ff3f

Category	Suggestion	Impact
Possible issue	Fix race in purge completion signaling There is a race between `completeRemotePurgeScheduling()` and the check of `remotePurgePending`. If another thread sets `remotePurgePending=true` after `completeRemotePurgeScheduling()` but before `getAndSet(false)`, and another purge attempt happens between those, the pending signal could be lost or double-scheduled. Reorder to check pending before clearing in-progress, or re-check pending after clearing to guarantee no scheduled purge is dropped. server/src/main/java/org/opensearch/index/translog/RemoteFsTranslog.java [663-677] protected void completeRemotePurgeCycle() { if (remoteGenerationDeletionPermits.availablePermits() != REMOTE_DELETION_PERMITS) { return; } + boolean pending = remotePurgePending.getAndSet(false); completeRemotePurgeScheduling(); - if (remotePurgePending.getAndSet(false)) { + // Re-check to avoid losing a signal set between the two operations above + if (pending \|\| remotePurgePending.getAndSet(false)) { threadPool.executor(ThreadPool.Names.REMOTE_PURGE).execute(() -> { try { trimUnreferencedReaders(false); } catch (IOException e) { logger.error("Exception while continuing remote translog purge", e); } }); } } Suggestion importance[1-10]: 7 __ Why: Identifies a plausible race condition in the purge completion signaling where a pending signal could be lost between `completeRemotePurgeScheduling()` and `remotePurgePending.getAndSet(false)`. The reordering improves correctness of the concurrency logic.	Medium
	Preserve pending signal on permit failure When `tryAcquire` fails, you call `completeRemotePurgeScheduling()` (clearing `remotePurgeInProgress`) but do not check `remotePurgePending`. If a concurrent invocation set pending=true expecting the in-progress worker to reschedule, that signal is now lost because permits were never acquired and `completeRemotePurgeCycle` won't be called. Route through `completeRemotePurgeCycle` (or explicitly re-check pending) to preserve the pending signal. server/src/main/java/org/opensearch/index/translog/RemoteFsTranslog.java [608-618] if (tryScheduleRemotePurge() == false) { return; } // Since remote generation deletion is async, this ensures that only one generation deletion happens at a time. // Remote generations involves 2 async operations - 1) Delete translog generation files 2) Delete metadata files // We try to acquire 2 permits and if we can not, we return from here itself. if (remoteGenerationDeletionPermits.tryAcquire(REMOTE_DELETION_PERMITS) == false) { - completeRemotePurgeScheduling(); + completeRemotePurgeCycle(); return; } Suggestion importance[1-10]: 6 __ Why: Valid observation that when `tryAcquire` fails, calling `completeRemotePurgeScheduling()` directly may lose a concurrently set pending signal. Routing through `completeRemotePurgeCycle` would preserve it, but permits check would immediately fail so the fix needs care.	Low
	Guarantee forward progress on purge `maxFilesToDelete` can be `0` when `getEffectiveMaxBatchesPerCycle` returns a small value combined with a small batch size, but `metadataFilesToBeDeleted` is non-empty. In that edge case `subList(0, 0)` is passed to `deleteMetadataFilesInBatches` which immediately runs `onCompletion`, but you also skip updating `metadataFilesNotToBeDeleted` cache and `deleteStaleRemotePrimaryTerms` since flow proceeds. Ensure `maxFilesToDelete >= 1` when there are files to delete to guarantee forward progress. server/src/main/java/org/opensearch/index/translog/RemoteFsTimestampAwareTranslog.java [268-280] if (metadataFilesToBeDeleted.isEmpty() == false) { // Delete stale metadata files try { - int maxFilesToDelete = Math.min( + int maxFilesToDelete = Math.max(1, Math.min( metadataFilesToBeDeleted.size(), translogTransferManager.getTranslogPurgeBatchSize() * translogTransferManager .getEffectiveMaxBatchesPerCycle( metadataFiles.size() == translogTransferManager.getTranslogPurgeListLimit(), metadataFiles.size() ) - ); + )); List<String> metadataFilesThisCycle = metadataFilesToBeDeleted.subList(0, maxFilesToDelete); translogTransferManager.deleteMetadataFilesInBatches(metadataFilesThisCycle, onRemotePurgeComplete); Suggestion importance[1-10]: 4 __ Why: The concern about `maxFilesToDelete` being 0 is theoretical since `getEffectiveMaxBatchesPerCycle` has a minimum of 1 and batch size has a minimum of 1, so the product is always at least 1. The suggestion is a defensive improvement but not addressing a real bug.	Low
General	Ensure deterministic batched deletion order Since `generations` is a `Set`, the iteration order of `new ArrayList<>(generations)` is undefined (e.g., `HashSet`). Previously all files were deleted in a single batch so order didn't matter, but batching now means specific generations get deleted first per cycle. Sort the generations (e.g., descending or ascending) to ensure deterministic and predictable deletion progress across cycles. server/src/main/java/org/opensearch/index/translog/transfer/TranslogTransferManager.java [530-542] public void deleteGenerationAsync(long primaryTerm, Set<Long> generations, Runnable onCompletion) { try { if (generations.isEmpty()) { onCompletion.run(); return; } List<Long> generationList = new ArrayList<>(generations); + Collections.sort(generationList); deleteGenerationAsyncBatched(primaryTerm, generationList, 0, onCompletion); } catch (Exception e) { onCompletion.run(); throw e; } } Suggestion importance[1-10]: 4 __ Why: Sorting improves determinism and predictability of deletion order across cycles, but functional correctness is not impacted since all generations will eventually be deleted. Minor improvement for observability and debugging.	Low

Batch remote translog metadata and data file deletions to cap memory and remote_purge queue growth on low-core nodes. Add purge batch cluster settings, single-flight coalescing per shard, and unit tests for batched delete paths. Fixes opensearch-project#20138 Signed-off-by: MdTanwer <tanw9004167@gmail.com>

github-actions · 2026-07-04T10:37:33Z

Persistent review updated to latest commit f23e34b

github-actions · 2026-07-04T11:03:10Z

❌ Gradle check result for f23e34b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

MdTanwer requested review from a team, Bukhtawar, andrross, ashking94, cwperks, dbwiddis, gbbafna, jed326, kotwanikunal, mch2, msfroh, owaiskazi19, reta, sachinpkale, saratvemulapalli, shwetathareja and sohami as code owners July 4, 2026 10:29

github-actions Bot added bug Something isn't working Indexing Indexing, Bulk Indexing and anything related to indexing Storage:Remote labels Jul 4, 2026

github-project-automation Bot added this to Storage Project Board Jul 4, 2026

MdTanwer force-pushed the fix/remote-translog-purge-batch-20138 branch from bc9ff3f to f23e34b Compare July 4, 2026 10:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bound remote translog purge for small nodes#22387

Bound remote translog purge for small nodes#22387
MdTanwer wants to merge 1 commit into
opensearch-project:mainfrom
MdTanwer:fix/remote-translog-purge-batch-20138

MdTanwer commented Jul 4, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jul 4, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jul 4, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jul 4, 2026

Uh oh!

github-actions Bot commented Jul 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

MdTanwer commented Jul 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

github-actions Bot commented Jul 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Reviewer Guide 🔍

(Review updated until commit f23e34b)

Uh oh!

github-actions Bot commented Jul 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Code Suggestions ✨

Previous suggestions

Uh oh!

github-actions Bot commented Jul 4, 2026

Uh oh!

github-actions Bot commented Jul 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MdTanwer commented Jul 4, 2026 •

edited

Loading

github-actions Bot commented Jul 4, 2026 •

edited

Loading

(Review updated until commit `f23e34b`)

github-actions Bot commented Jul 4, 2026 •

edited

Loading