fix(bootstrap): download snapshots via ranged ring buffer#1025
Conversation
Bootstrapping from the Cloudflare R2 snapshot bucket intermittently failed with "error decoding response body / operation timed out" while unpacking a segment. The old path streamed a single HTTP response directly into the tar extractor, so the connection stayed open for the entire multi-GB transfer and its lifetime was coupled to disk-write backpressure. R2 tears down such long-lived, slowly-drained responses where S3 tolerated them, and any transient stall on that one connection was fatal and unrecoverable. Replace the single stream with a ranged ring buffer (new `ranged` module): - A background thread downloads the snapshot in bounded 64 MiB byte ranges, staging a small fixed-size window (4 chunks, ~256 MiB) on disk ahead of the extractor. Backpressure is applied *before* a request is issued (via a permit pool), so the server never sees an idle/slow-drained connection. - Each chunk is short-lived and retried with exponential backoff on failure, making transient stalls recoverable instead of fatal. - A HEAD probe selects the ranged path when the endpoint advertises `Accept-Ranges: bytes`; otherwise it falls back to the original single-stream download (with its own untimed client, since an overall timeout must not cap a full-body stream). Verified end-to-end against the live R2 endpoint with an ignored integration test that downloads a prefix through the ring buffer and asserts byte-exactness versus a direct ranged fetch, plus window backpressure and staging cleanup. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds a new ChangesRanged bootstrap snapshot downloader
Sequence Diagram(s)sequenceDiagram
participant Bootstrap as bootstrap run
participant snapshot.rs
participant ranged
participant RemoteServer
participant StagingDir
Bootstrap->>snapshot.rs: run(config)
snapshot.rs->>ranged: build_client()
snapshot.rs->>ranged: probe(client, url)
ranged->>RemoteServer: HEAD url
RemoteServer-->>ranged: Accept-Ranges, Content-Length
ranged-->>snapshot.rs: RangeProbe
alt supports_ranges
snapshot.rs->>ranged: ranged_reader(client, url, total_size, staging, progress)
loop each chunk
ranged->>RemoteServer: GET Range: bytes=N-M
RemoteServer-->>ranged: chunk bytes
ranged->>StagingDir: write chunk file
end
snapshot.rs->>StagingDir: extract tar.gz
snapshot.rs->>StagingDir: cleanup staging
else fallback
snapshot.rs->>RemoteServer: GET url (streaming)
RemoteServer-->>snapshot.rs: full response stream
snapshot.rs->>snapshot.rs: extract tar.gz inline
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/bin/dolos/bootstrap/ranged.rs`:
- Line 209: Replace all instances of `io::Error::new(io::ErrorKind::Other, ...)`
with the modern `std::io::Error::other(...)` pattern in the error handling code
around the ranged.rs file. This change needs to be made at three locations
(lines 209, 230, and 238). Simply convert each
`io::Error::new(io::ErrorKind::Other, error_value)` call to
`std::io::Error::other(error_value)` to satisfy the clippy warning that treats
this as an error due to `-D warnings` being enabled.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 3f1af4af-bff9-40be-8adc-c7e4b7a82810
📒 Files selected for processing (3)
src/bin/dolos/bootstrap/mod.rssrc/bin/dolos/bootstrap/ranged.rssrc/bin/dolos/bootstrap/snapshot.rs
Problem
Bootstrapping from the Cloudflare R2 snapshot bucket intermittently fails while unpacking a segment:
This did not happen on S3.
Root cause
The old path (
fetch_snapshot) streamed a single HTTP response directly into the tar extractor. Consequences:tar::Archive::unpackonly reads the socket as fast as it writes each entry to disk, so the connection sat idle while large.segmentfiles were written.During verification a plain
reqwestranged fetch also stalled once for 300s with the same signature, confirming the failure is an intermittent R2/reqwest hiccup that the old design turned into a hard failure.Fix
Replace the single stream with a ranged ring buffer (new
rangedmodule):Accept-Ranges: bytes; otherwise it falls back to the original single-stream download (with its own untimed client).Bounded disk (~256 MiB) and negligible RAM; no 167 GB temp file.
Verification
cargo build+cargo clippyclean.ranged_matches_direct_fetch) against the live R2 endpoint: downloads a 5 MiB prefix through the ring buffer in 1 MiB chunks (forcing window backpressure), asserts byte-exactness vs a direct ranged fetch, progress total, and staging-dir cleanup. Passes in ~3.4s.Notes
ripget,trauma,downloader).ripget's windowed mode is the closest match but is async/tokio (returns atokio AsyncRead, needs an async→sync bridge into the blocking tar path) and a young single-maintainer crate on a critical path;trauma/downloaderonly download whole files to disk. Kept a small, blocking, disk-bounded implementation that matches the surrounding code.🤖 Generated with Claude Code
Summary by CodeRabbit