Skip to content

feat(server): Direct_IO - bypasses the OS page cache for more predictable latency#3068

Open
tungtose wants to merge 8 commits intoapache:masterfrom
tungtose:direct-io-2
Open

feat(server): Direct_IO - bypasses the OS page cache for more predictable latency#3068
tungtose wants to merge 8 commits intoapache:masterfrom
tungtose:direct-io-2

Conversation

@tungtose
Copy link
Copy Markdown
Contributor

@tungtose tungtose commented Apr 1, 2026

Resolving #1420
WIP - Detail flow, scenarios, benchmark

DirectFile

DirectFile holds an open file using O_DSYNC | O_DIRECT. This is primarily used to solve the Tail Buffer Problem.

Scenario 1: Perfectly aligned write (fast path)

write_all (4096 bytes): no tail, length is mul of 4096:

a

Scenario 2: Sub-alignment write

write_all (2048 bytes): too small for disk write

s2

Scenario 3: Second write complete block

Existing tail of 2048 bytes + new write_all (2048b):

s3

Scenario 4: Split write, tail + new data alignment boundary:

Existing tail = 2048b, write_all (4712b), total=5126b

s4

Scenario 5: Flush() - Pad and write the tail buf

Force remaining tail bytes to disk

s5

Scenario 6: Segment rotate

The sealed segment gets flush_and_truncate so it's a clean, exact-size, self-contained file. The new segment starts with a fresh DirectFile at position 0
WIP

Scenario 7: Shutdown, recovery

flush_and_truncate on the active segment before exiting. On restart, metadata.len() gives the exact logical size, and the tail bytes are reconstructed from disk
WIP

Scenario 8: Crash

Lost < 4096 bytes

Benchmark:

WIP

Metric Buffered IO O_DIRECT only O_DIRECT + O_DSYNC
Throughput
Producer 250 MB/s 238 MB/s (-5%) 210 MB/s (-16%)
Consumer 250 MB/s 237 MB/s (-5%) 205 MB/s (-18%)
Aggregate 500 MB/s 475 MB/s (-5%) 415 MB/s (-17%)
Producer latency
p50 1.39 ms 2.27 ms 9.12 ms
p90 2.28 ms 13.13 ms 30.57 ms
p99 3.40 ms 96.92 ms 116.13 ms
p999 64.67 ms 128.99 ms 178.05 ms
p9999 457.92 ms 149.40 ms 199.76 ms
max 282.37 ms 159.16 ms 203.71 ms
std dev 1.96 ms 19.55 ms 12.30 ms
Consumer latency
p50 2.09 ms 7.77 ms 65.41 ms
p90 3.17 ms 170.06 ms 864.81 ms
p99 7.67 ms 514.43 ms 1,668.90 ms
p999 65.89 ms 669.90 ms 1,813.84 ms
p9999 459.78 ms 694.87 ms 1,849.49 ms
max 563.25 ms 1,009.93 ms 2,631.40 ms
std dev 2.85 ms 107.55 ms 420.02 ms

The consumer regression happens because persist_frozen_batches_to_disk blocks the event loop (single thread per shard). With buffered IO this takes <1ms, so it's unnoticeable. With Direct IO it takes ~15ms (worse at high disk load), blocking consumer reads from running

Increasing messages_required_to_save and size_of_messages_required_to_save would reduce persist frequency and improve consumer latency, but the benchmark uses default values

Throughput is lower, and producer latency is more stable as expected. However, there is a significant regression on the consumer side. I am working toward optimizing it (Defer the persistence). Since this feature is configurable, it can be safely turned off in the config.toml file. We can merge this and handle the optimization in a separate PR

Update:

Deferred persist come with better benchmark results (not commit code yet, it's a mess now):

Metric Buffered IO O_DIRECT + O_DSYNC (inline) O_DIRECT + O_DSYNC (deferred)
Throughput
Producer 250 MB/s 210 MB/s 250 MB/s
Consumer 250 MB/s 205 MB/s 250 MB/s
Aggregate 500 MB/s 415 MB/s 500 MB/s
Producer latency
p50 1.39 ms 9.12 ms 1.39 ms
p99 3.40 ms 116.13 ms 5.41 ms
p9999 457.92 ms 199.76 ms 25.79 ms
max 282.37 ms 203.71 ms 24.59 ms
std dev 1.96 ms 12.30 ms 0.39 ms
Consumer latency
p50 2.09 ms 65.41 ms 2.25 ms
p99 7.67 ms 1,668.90 ms 7.83 ms
p9999 459.78 ms 1,849.49 ms 31.92 ms
max 563.25 ms 2,631.40 ms 34.00 ms
std dev 2.85 ms 420.02 ms 1.04 ms

However, it introduces correctness issues with purge, delete, and rotation operations that can race with in-flight persists. WIP...

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 1, 2026

Codecov Report

❌ Patch coverage is 76.92308% with 240 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.45%. Comparing base (f5350d9) to head (048e28d).

Files with missing lines Patch % Lines
core/common/src/alloc/buffer.rs 1.21% 81 Missing ⚠️
...rc/types/segment_storage/messages_writer/direct.rs 67.22% 30 Missing and 9 partials ⚠️
...re/common/src/types/segment_storage/direct_file.rs 93.82% 31 Missing and 7 partials ⚠️
core/partitions/src/iggy_partitions.rs 0.00% 28 Missing ⚠️
...ommon/src/types/segment_storage/messages_reader.rs 63.79% 20 Missing and 1 partial ⚠️
...n/src/types/segment_storage/messages_writer/mod.rs 69.49% 17 Missing and 1 partial ⚠️
core/server/src/shard/system/messages.rs 60.00% 2 Missing and 2 partials ⚠️
core/common/src/types/segment_storage/mod.rs 90.00% 0 Missing and 2 partials ⚠️
core/partitions/src/iggy_index_writer.rs 0.00% 2 Missing ⚠️
core/server/src/shard/system/segments.rs 50.00% 1 Missing and 1 partial ⚠️
... and 5 more
Additional details and impacted files
@@             Coverage Diff              @@
##             master    #3068      +/-   ##
============================================
- Coverage     72.76%   70.45%   -2.31%     
  Complexity      943      943              
============================================
  Files          1117     1120       +3     
  Lines         96368    93078    -3290     
  Branches      73544    70271    -3273     
============================================
- Hits          70119    65579    -4540     
- Misses        23702    24718    +1016     
- Partials       2547     2781     +234     
Components Coverage Δ
Rust Core 70.47% <76.92%> (-3.02%) ⬇️
Java SDK 62.30% <ø> (ø)
C# SDK 69.10% <ø> (-0.33%) ⬇️
Python SDK 81.43% <ø> (ø)
Node SDK 91.40% <ø> (-0.13%) ⬇️
Go SDK 39.41% <ø> (ø)
Files with missing lines Coverage Δ
core/common/src/types/message/indexes_mut.rs 64.95% <100.00%> (ø)
core/common/src/types/message/message_view.rs 92.30% <100.00%> (+0.64%) ⬆️
...ore/common/src/types/message/messages_batch_mut.rs 49.89% <100.00%> (ø)
.../types/segment_storage/messages_writer/buffered.rs 0.00% <ø> (ø)
core/configs/src/server_config/system.rs 94.21% <ø> (ø)
core/partitions/src/types.rs 8.86% <ø> (ø)
core/server/src/streaming/segments/storage.rs 100.00% <100.00%> (ø)
core/simulator/src/replica.rs 100.00% <100.00%> (ø)
core/configs/src/server_config/defaults.rs 0.00% <0.00%> (ø)
core/server/src/bootstrap.rs 80.90% <90.90%> (+0.04%) ⬆️
... and 13 more

... and 108 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@tungtose tungtose force-pushed the direct-io-2 branch 2 times, most recently from ccee726 to 29f2000 Compare April 7, 2026 07:32
@hubcio
Copy link
Copy Markdown
Contributor

hubcio commented Apr 7, 2026

@tungtose I will check this today/tomorrow.

Comment on lines +129 to +132
#[allow(clippy::await_holding_refcell_ref)]
// SAFETY: compio is a single-threaded runtime — no other task runs on
// this thread during .await, so the RefCell borrow cannot conflict
pub async fn save_frozen_batches(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The safety comment isn't necessary true, we've learned it the hard way during the migration to thread-per-core shared nothing architecture. Single threaded runtime doesn't imply lack of concurrency and concurrent execution is something that can trigger the runtime panic on that borrow there.

@numinnex
Copy link
Copy Markdown
Contributor

I've looked through the PR and if I understand the design correctly it trades off disk space savings by not storing padding on each written buffer at the expense of complexity. I am not sure if it's the right trade-off, given that we could use background compaction to rewrite those small padded buffers into a larger contiguous blocks.

I think rather than storing the tail as leftover from misaligned writes, we should pass collection of already padded and aligned buffers into the write_vectored_all method (I know it's impossible without a lot of changes due to prolific usage of Bytes, that's why this feature is so hard to implement), but if this PR meant to be worth merging, without immediately becoming technical debt, we have to get it right from the get go.

You can take a lot at the io_buf or iggy_io_buf (I think this is how it' called now) module. I've created there an refcounted buffer that uses AVec as it's underlying storage and see if it's even feasible at all to replace Bytes with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants