Skip to content

fix(bootstrap): download snapshots via ranged ring buffer#1025

Merged
scarmuega merged 2 commits into
mainfrom
fix/bootstrap-ranged-download
Jun 19, 2026
Merged

fix(bootstrap): download snapshots via ranged ring buffer#1025
scarmuega merged 2 commits into
mainfrom
fix/bootstrap-ranged-download

Conversation

@scarmuega

@scarmuega scarmuega commented Jun 19, 2026

Copy link
Copy Markdown
Member

Problem

Bootstrapping from the Cloudflare R2 snapshot bucket intermittently fails while unpacking a segment:

Error:   × Failed to extract snapshot
  ├─▶ failed to unpack `archive/000221.segment` ...
  ├─▶ error decoding response body
  ╰─▶ operation timed out

This did not happen on S3.

Root cause

The old path (fetch_snapshot) streamed a single HTTP response directly into the tar extractor. Consequences:

  • The connection stayed open for the entire multi-GB transfer (the full snapshot is ~167 GB compressed).
  • Its lifetime was coupled to disk-write backpressuretar::Archive::unpack only reads the socket as fast as it writes each entry to disk, so the connection sat idle while large .segment files were written.
  • R2 (fronted by Cloudflare) tears down such long-lived / slowly-drained streamed responses where S3 tolerated them.
  • Any transient stall on that single connection was fatal and unrecoverable.

During verification a plain reqwest ranged fetch also stalled once for 300s with the same signature, confirming the failure is an intermittent R2/reqwest hiccup that the old design turned into a hard failure.

Fix

Replace the single stream with a ranged ring buffer (new ranged module):

  • A background thread downloads the snapshot in bounded 64 MiB byte ranges, staging a small fixed-size window (4 chunks, ~256 MiB on disk) ahead of the extractor.
  • Backpressure is applied before a request is issued (permit pool): when extraction falls behind, the downloader stops issuing range requests rather than holding a connection open and idle. The server never sees a stalled connection.
  • Each chunk is short-lived and retried with exponential backoff, so transient stalls are recoverable instead of fatal. A coarse per-chunk timeout is safe here because each request is bounded (unlike a full-body stream).
  • A HEAD probe selects the ranged path when the endpoint advertises Accept-Ranges: bytes; otherwise it falls back to the original single-stream download (with its own untimed client).

Bounded disk (~256 MiB) and negligible RAM; no 167 GB temp file.

Verification

  • cargo build + cargo clippy clean.
  • Ignored integration test (ranged_matches_direct_fetch) against the live R2 endpoint: downloads a 5 MiB prefix through the ring buffer in 1 MiB chunks (forcing window backpressure), asserts byte-exactness vs a direct ranged fetch, progress total, and staging-dir cleanup. Passes in ~3.4s.
cargo test --bin dolos ranged_matches_direct_fetch -- --ignored --nocapture

Notes

  • Existing crates were evaluated (ripget, trauma, downloader). ripget's windowed mode is the closest match but is async/tokio (returns a tokio AsyncRead, needs an async→sync bridge into the blocking tar path) and a young single-maintainer crate on a critical path; trauma/downloader only download whole files to disk. Kept a small, blocking, disk-bounded implementation that matches the surrounding code.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features
    • Added efficient, ranged (byte-range) downloading for bootstrap snapshot downloads using chunked, disk-staged transfer with progress reporting.
    • Implemented backpressure to avoid excessive idle or stalled range requests when reading lags behind.
    • Improved reliability with transient-failure retries using exponential backoff.
    • Automatically falls back to streaming downloads when ranged requests aren’t supported or sizing isn’t usable.
  • Tests
    • Added an integration-style test (currently ignored) to validate ranged reads and staging behavior.

Bootstrapping from the Cloudflare R2 snapshot bucket intermittently failed
with "error decoding response body / operation timed out" while unpacking a
segment. The old path streamed a single HTTP response directly into the tar
extractor, so the connection stayed open for the entire multi-GB transfer and
its lifetime was coupled to disk-write backpressure. R2 tears down such
long-lived, slowly-drained responses where S3 tolerated them, and any
transient stall on that one connection was fatal and unrecoverable.

Replace the single stream with a ranged ring buffer (new `ranged` module):

- A background thread downloads the snapshot in bounded 64 MiB byte ranges,
  staging a small fixed-size window (4 chunks, ~256 MiB) on disk ahead of the
  extractor. Backpressure is applied *before* a request is issued (via a permit
  pool), so the server never sees an idle/slow-drained connection.
- Each chunk is short-lived and retried with exponential backoff on failure,
  making transient stalls recoverable instead of fatal.
- A HEAD probe selects the ranged path when the endpoint advertises
  `Accept-Ranges: bytes`; otherwise it falls back to the original single-stream
  download (with its own untimed client, since an overall timeout must not cap
  a full-body stream).

Verified end-to-end against the live R2 endpoint with an ignored integration
test that downloads a prefix through the ring buffer and asserts byte-exactness
versus a direct ranged fetch, plus window backpressure and staging cleanup.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b8e37110-00b8-40b8-bd48-7f9ec9001485

📥 Commits

Reviewing files that changed from the base of the PR and between 117693b and 817b674.

📒 Files selected for processing (1)
  • src/bin/dolos/bootstrap/ranged.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/bin/dolos/bootstrap/ranged.rs

📝 Walkthrough

Walkthrough

Adds a new ranged bootstrap module that implements a disk-staged ring-buffer HTTP downloader. It probes the snapshot endpoint for byte-range support and total size, downloads in fixed-size chunks with retry/backpressure, and provides a Read-compatible interface. snapshot.rs is updated to route between this ranged path and a streaming fallback.

Changes

Ranged bootstrap snapshot downloader

Layer / File(s) Summary
Module registration and probe infrastructure
src/bin/dolos/bootstrap/mod.rs, src/bin/dolos/bootstrap/ranged.rs
Declares the ranged module, defines chunk size, staging window, timeout, and retry constants, introduces RangeProbe { total_size, supports_ranges }, and implements build_client and probe via a HEAD request inspecting Accept-Ranges and Content-Length headers.
RangedReader ring-buffer and chunk download
src/bin/dolos/bootstrap/ranged.rs
Implements RangedReader backed by a spawned background thread that sends byte ranges into a bounded staging window on disk using permit channels for backpressure. download_chunk retries with exponential backoff and removes partial files between attempts. Read and Drop consume and clean chunk files. Includes an ignored integration test comparing ring-buffer output against a direct ranged GET.
Snapshot ranged/streaming routing
src/bin/dolos/bootstrap/snapshot.rs
Imports ranged, probes the endpoint, and routes to fetch_snapshot_ranged (staged chunked download, byte-progress, tar.gz extraction, staging cleanup) or fetch_snapshot_streaming (untimed client, limited redirects) based on the probe result.

Sequence Diagram(s)

sequenceDiagram
  participant Bootstrap as bootstrap run
  participant snapshot.rs
  participant ranged
  participant RemoteServer
  participant StagingDir

  Bootstrap->>snapshot.rs: run(config)
  snapshot.rs->>ranged: build_client()
  snapshot.rs->>ranged: probe(client, url)
  ranged->>RemoteServer: HEAD url
  RemoteServer-->>ranged: Accept-Ranges, Content-Length
  ranged-->>snapshot.rs: RangeProbe

  alt supports_ranges
    snapshot.rs->>ranged: ranged_reader(client, url, total_size, staging, progress)
    loop each chunk
      ranged->>RemoteServer: GET Range: bytes=N-M
      RemoteServer-->>ranged: chunk bytes
      ranged->>StagingDir: write chunk file
    end
    snapshot.rs->>StagingDir: extract tar.gz
    snapshot.rs->>StagingDir: cleanup staging
  else fallback
    snapshot.rs->>RemoteServer: GET url (streaming)
    RemoteServer-->>snapshot.rs: full response stream
    snapshot.rs->>snapshot.rs: extract tar.gz inline
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 Hopping byte by byte through the wire,
Chunk files bloom and quietly expire,
A ring-buffer burrow, staged just right,
Backpressure keeps the permits tight.
When ranges fail, the stream flows free —
The snapshot lands, as soft as can be! 🥕

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: introducing ranged HTTP download via ring buffer for bootstrap snapshots, which is the core solution to fix R2 timeout issues.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/bootstrap-ranged-download

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/bin/dolos/bootstrap/ranged.rs`:
- Line 209: Replace all instances of `io::Error::new(io::ErrorKind::Other, ...)`
with the modern `std::io::Error::other(...)` pattern in the error handling code
around the ranged.rs file. This change needs to be made at three locations
(lines 209, 230, and 238). Simply convert each
`io::Error::new(io::ErrorKind::Other, error_value)` call to
`std::io::Error::other(error_value)` to satisfy the clippy warning that treats
this as an error due to `-D warnings` being enabled.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3f1af4af-bff9-40be-8adc-c7e4b7a82810

📥 Commits

Reviewing files that changed from the base of the PR and between 005ef0c and 117693b.

📒 Files selected for processing (3)
  • src/bin/dolos/bootstrap/mod.rs
  • src/bin/dolos/bootstrap/ranged.rs
  • src/bin/dolos/bootstrap/snapshot.rs

Comment thread src/bin/dolos/bootstrap/ranged.rs Outdated
@scarmuega scarmuega merged commit 0452ec2 into main Jun 19, 2026
14 of 16 checks passed
@scarmuega scarmuega deleted the fix/bootstrap-ranged-download branch June 19, 2026 20:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant