Summary
When a CAS blob is deleted from disk (e.g., via the overwrite path in Add()), but a concurrent GET has already looked up the key in the LRU and released the lock, the subsequent os.Open fails with "not exists." The warning is logged, but the stale LRU entry is never removed, causing the same warning to fire on every subsequent access until the server is restarted.
Impact
On a moderately active server (~10 concurrent Bazel users), we observed 20,145 phantom warnings over 2 days (~49/minute), producing persistent Lost inputs no longer available remotely failures for Bazel clients using --remote_download_toplevel. A restart temporarily resolves the issue by rebuilding the LRU from disk, but phantoms accumulate again within hours.
Root Cause
In cache/disk/disk.go, availableOrTryProxy():
// Slow path retry:
c.mu.Lock()
item, listElem = c.lru.Get(key)
if listElem != nil {
blobPath = path.Join(c.dir, c.FileLocation(...))
f, err = os.Open(blobPath)
}
c.mu.Unlock()
if err != nil {
// WARNING: logs but does NOT evict from LRU
log.Printf("Warning: expected %q to exist on disk, undersized cache?", blobPath)
}
Compare to the decompression-error path ~20 lines later, which correctly self-heals:
if err != nil {
log.Printf("Warning: expected item to be on disk, but something happened...")
c.mu.Lock()
c.lru.RemoveElement(listElem) // ← correctly removes dead entry
c.mu.Unlock()
}
Suggested Fix
Add LRU eviction in the file-not-found path, matching the existing decompression-error pattern:
if err != nil {
log.Printf("Warning: expected %q to exist on disk, undersized cache?", blobPath)
c.mu.Lock()
if listElem != nil {
c.lru.RemoveElement(listElem)
}
c.mu.Unlock()
}
This makes ghost entries self-healing on first access rather than persisting indefinitely.
Environment
- bazel-remote commit:
b857daf1f63c641dc3fe6105a674a6e9ed81cf35
- Go 1.25.6
- Local disk storage (no NFS), zstd compression mode
- ~847K cached files, 126 GB on disk, 1.2 TB max
Reproduction
- Start bazel-remote with multiple concurrent clients
- Trigger overwrites (same CAS hash uploaded by different clients, or after
bazel clean --expunge)
- Concurrent
GET requests for overwritten blobs hit the race window
- Once triggered, the phantom entry persists forever — visible via repeated "expected to exist on disk" warnings for the same path
Summary
When a CAS blob is deleted from disk (e.g., via the overwrite path in
Add()), but a concurrentGEThas already looked up the key in the LRU and released the lock, the subsequentos.Openfails with "not exists." The warning is logged, but the stale LRU entry is never removed, causing the same warning to fire on every subsequent access until the server is restarted.Impact
On a moderately active server (~10 concurrent Bazel users), we observed 20,145 phantom warnings over 2 days (~49/minute), producing persistent
Lost inputs no longer available remotelyfailures for Bazel clients using--remote_download_toplevel. A restart temporarily resolves the issue by rebuilding the LRU from disk, but phantoms accumulate again within hours.Root Cause
In
cache/disk/disk.go,availableOrTryProxy():Compare to the decompression-error path ~20 lines later, which correctly self-heals:
Suggested Fix
Add LRU eviction in the file-not-found path, matching the existing decompression-error pattern:
This makes ghost entries self-healing on first access rather than persisting indefinitely.
Environment
b857daf1f63c641dc3fe6105a674a6e9ed81cf35Reproduction
bazel clean --expunge)GETrequests for overwritten blobs hit the race window