feat(server): Direct_IO - bypasses the OS page cache for more predictable latency by tungtose · Pull Request #3068 · apache/iggy

tungtose · 2026-04-01T17:17:54Z

Resolving #1420
WIP - Detail flow, scenarios, benchmark

DirectFile

DirectFile holds an open file using O_DSYNC | O_DIRECT. This is primarily used to solve the Tail Buffer Problem.

Scenario 1: Perfectly aligned write (fast path)

write_all (4096 bytes): no tail, length is mul of 4096:

Scenario 2: Sub-alignment write

write_all (2048 bytes): too small for disk write

Scenario 3: Second write complete block

Existing tail of 2048 bytes + new write_all (2048b):

Scenario 4: Split write, tail + new data alignment boundary:

Existing tail = 2048b, write_all (4712b), total=5126b

Scenario 5: Flush() - Pad and write the tail buf

Force remaining tail bytes to disk

Scenario 6: Segment rotate

The sealed segment gets flush_and_truncate so it's a clean, exact-size, self-contained file. The new segment starts with a fresh DirectFile at position 0
WIP

Scenario 7: Shutdown, recovery

flush_and_truncate on the active segment before exiting. On restart, metadata.len() gives the exact logical size, and the tail bytes are reconstructed from disk
WIP

Scenario 8: Crash

Lost < 4096 bytes

Benchmark:

WIP

Metric	Buffered IO	O_DIRECT only	O_DIRECT + O_DSYNC
Throughput
Producer	250 MB/s	238 MB/s (-5%)	210 MB/s (-16%)
Consumer	250 MB/s	237 MB/s (-5%)	205 MB/s (-18%)
Aggregate	500 MB/s	475 MB/s (-5%)	415 MB/s (-17%)
Producer latency
p50	1.39 ms	2.27 ms	9.12 ms
p90	2.28 ms	13.13 ms	30.57 ms
p99	3.40 ms	96.92 ms	116.13 ms
p999	64.67 ms	128.99 ms	178.05 ms
p9999	457.92 ms	149.40 ms	199.76 ms
max	282.37 ms	159.16 ms	203.71 ms
std dev	1.96 ms	19.55 ms	12.30 ms
Consumer latency
p50	2.09 ms	7.77 ms	65.41 ms
p90	3.17 ms	170.06 ms	864.81 ms
p99	7.67 ms	514.43 ms	1,668.90 ms
p999	65.89 ms	669.90 ms	1,813.84 ms
p9999	459.78 ms	694.87 ms	1,849.49 ms
max	563.25 ms	1,009.93 ms	2,631.40 ms
std dev	2.85 ms	107.55 ms	420.02 ms

The consumer regression happens because persist_frozen_batches_to_disk blocks the event loop (single thread per shard). With buffered IO this takes <1ms, so it's unnoticeable. With Direct IO it takes ~15ms (worse at high disk load), blocking consumer reads from running

Increasing messages_required_to_save and size_of_messages_required_to_save would reduce persist frequency and improve consumer latency, but the benchmark uses default values

Throughput is lower, and producer latency is more stable as expected. However, there is a significant regression on the consumer side. I am working toward optimizing it (Defer the persistence). Since this feature is configurable, it can be safely turned off in the config.toml file. We can merge this and handle the optimization in a separate PR

Update:

Deferred persist come with better benchmark results (not commit code yet, it's a mess now):

Metric	Buffered IO	O_DIRECT + O_DSYNC (inline)	O_DIRECT + O_DSYNC (deferred)
Throughput
Producer	250 MB/s	210 MB/s	250 MB/s
Consumer	250 MB/s	205 MB/s	250 MB/s
Aggregate	500 MB/s	415 MB/s	500 MB/s
Producer latency
p50	1.39 ms	9.12 ms	1.39 ms
p99	3.40 ms	116.13 ms	5.41 ms
p9999	457.92 ms	199.76 ms	25.79 ms
max	282.37 ms	203.71 ms	24.59 ms
std dev	1.96 ms	12.30 ms	0.39 ms
Consumer latency
p50	2.09 ms	65.41 ms	2.25 ms
p99	7.67 ms	1,668.90 ms	7.83 ms
p9999	459.78 ms	1,849.49 ms	31.92 ms
max	563.25 ms	2,631.40 ms	34.00 ms
std dev	2.85 ms	420.02 ms	1.04 ms

However, it introduces correctness issues with purge, delete, and rotation operations that can race with in-flight persists. WIP...

codecov · 2026-04-01T19:11:10Z

Codecov Report

❌ Patch coverage is 76.92308% with 240 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.45%. Comparing base (f5350d9) to head (048e28d).

Files with missing lines	Patch %	Lines
core/common/src/alloc/buffer.rs	1.21%	81 Missing ⚠️
...rc/types/segment_storage/messages_writer/direct.rs	67.22%	30 Missing and 9 partials ⚠️
...re/common/src/types/segment_storage/direct_file.rs	93.82%	31 Missing and 7 partials ⚠️
core/partitions/src/iggy_partitions.rs	0.00%	28 Missing ⚠️
...ommon/src/types/segment_storage/messages_reader.rs	63.79%	20 Missing and 1 partial ⚠️
...n/src/types/segment_storage/messages_writer/mod.rs	69.49%	17 Missing and 1 partial ⚠️
core/server/src/shard/system/messages.rs	60.00%	2 Missing and 2 partials ⚠️
core/common/src/types/segment_storage/mod.rs	90.00%	0 Missing and 2 partials ⚠️
core/partitions/src/iggy_index_writer.rs	0.00%	2 Missing ⚠️
core/server/src/shard/system/segments.rs	50.00%	1 Missing and 1 partial ⚠️
... and 5 more

Additional details and impacted files

@@             Coverage Diff              @@
##             master    #3068      +/-   ##
============================================
- Coverage     72.76%   70.45%   -2.31%     
  Complexity      943      943              
============================================
  Files          1117     1120       +3     
  Lines         96368    93078    -3290     
  Branches      73544    70271    -3273     
============================================
- Hits          70119    65579    -4540     
- Misses        23702    24718    +1016     
- Partials       2547     2781     +234

Components	Coverage Δ
Rust Core	`70.47% <76.92%> (-3.02%)`	⬇️
Java SDK	`62.30% <ø> (ø)`
C# SDK	`69.10% <ø> (-0.33%)`	⬇️
Python SDK	`81.43% <ø> (ø)`
Node SDK	`91.40% <ø> (-0.13%)`	⬇️
Go SDK	`39.41% <ø> (ø)`

Files with missing lines	Coverage Δ
core/common/src/types/message/indexes_mut.rs	`64.95% <100.00%> (ø)`
core/common/src/types/message/message_view.rs	`92.30% <100.00%> (+0.64%)`	⬆️
...ore/common/src/types/message/messages_batch_mut.rs	`49.89% <100.00%> (ø)`
.../types/segment_storage/messages_writer/buffered.rs	`0.00% <ø> (ø)`
core/configs/src/server_config/system.rs	`94.21% <ø> (ø)`
core/partitions/src/types.rs	`8.86% <ø> (ø)`
core/server/src/streaming/segments/storage.rs	`100.00% <100.00%> (ø)`
core/simulator/src/replica.rs	`100.00% <100.00%> (ø)`
core/configs/src/server_config/defaults.rs	`0.00% <0.00%> (ø)`
core/server/src/bootstrap.rs	`80.90% <90.90%> (+0.04%)`	⬆️
... and 13 more

... and 108 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

hubcio · 2026-04-07T07:33:26Z

@tungtose I will check this today/tomorrow.

numinnex · 2026-04-16T08:47:35Z

+    #[allow(clippy::await_holding_refcell_ref)]
+    // SAFETY: compio is a single-threaded runtime — no other task runs on
+    // this thread during .await, so the RefCell borrow cannot conflict
+    pub async fn save_frozen_batches(


The safety comment isn't necessary true, we've learned it the hard way during the migration to thread-per-core shared nothing architecture. Single threaded runtime doesn't imply lack of concurrency and concurrent execution is something that can trigger the runtime panic on that borrow there.

numinnex · 2026-04-16T08:56:36Z

I've looked through the PR and if I understand the design correctly it trades off disk space savings by not storing padding on each written buffer at the expense of complexity. I am not sure if it's the right trade-off, given that we could use background compaction to rewrite those small padded buffers into a larger contiguous blocks.

I think rather than storing the tail as leftover from misaligned writes, we should pass collection of already padded and aligned buffers into the write_vectored_all method (I know it's impossible without a lot of changes due to prolific usage of Bytes, that's why this feature is so hard to implement), but if this PR meant to be worth merging, without immediately becoming technical debt, we have to get it right from the get go.

You can take a lot at the io_buf or iggy_io_buf (I think this is how it' called now) module. I've created there an refcounted buffer that uses AVec as it's underlying storage and see if it's even feasible at all to replace Bytes with it.

tungtose force-pushed the direct-io-2 branch from 2cb2fae to bde0cea Compare April 1, 2026 18:39

tungtose force-pushed the direct-io-2 branch 2 times, most recently from ccee726 to 29f2000 Compare April 7, 2026 07:32

tungtose force-pushed the direct-io-2 branch from 85741fa to ceebafe Compare April 7, 2026 09:05

tungtose added 7 commits April 12, 2026 17:15

feat: message writter with O_DSYNC | O_DIRECT

1b584ca

fix: clippy

a1d4b87

fix: cross compile

a95e5e3

fix: no share refcell over await point

8321e83

clean up

7aa310f

update: O_DSYNC configurable

ed5afa4

update: use macro to gen dsync tests

6041ca4

tungtose force-pushed the direct-io-2 branch from ceebafe to 6041ca4 Compare April 12, 2026 10:16

Merge branch 'master' into direct-io-2

048e28d

numinnex reviewed Apr 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(server): Direct_IO - bypasses the OS page cache for more predictable latency#3068

feat(server): Direct_IO - bypasses the OS page cache for more predictable latency#3068
tungtose wants to merge 8 commits intoapache:masterfrom
tungtose:direct-io-2

tungtose commented Apr 1, 2026 •

edited

Loading

Uh oh!

codecov bot commented Apr 1, 2026 •

edited

Loading

Uh oh!

hubcio commented Apr 7, 2026

Uh oh!

numinnex Apr 16, 2026

Uh oh!

numinnex commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

tungtose commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

DirectFile

Scenario 1: Perfectly aligned write (fast path)

Scenario 2: Sub-alignment write

Scenario 3: Second write complete block

Scenario 4: Split write, tail + new data alignment boundary:

Scenario 5: Flush() - Pad and write the tail buf

Scenario 6: Segment rotate

Scenario 7: Shutdown, recovery

Scenario 8: Crash

Benchmark:

Uh oh!

codecov bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

hubcio commented Apr 7, 2026

Uh oh!

numinnex Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

numinnex commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tungtose commented Apr 1, 2026 •

edited

Loading

codecov bot commented Apr 1, 2026 •

edited

Loading