feat(server): Direct_IO - bypasses the OS page cache for more predictable latency#3068
feat(server): Direct_IO - bypasses the OS page cache for more predictable latency#3068tungtose wants to merge 8 commits intoapache:masterfrom
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #3068 +/- ##
============================================
- Coverage 72.76% 70.45% -2.31%
Complexity 943 943
============================================
Files 1117 1120 +3
Lines 96368 93078 -3290
Branches 73544 70271 -3273
============================================
- Hits 70119 65579 -4540
- Misses 23702 24718 +1016
- Partials 2547 2781 +234
🚀 New features to boost your workflow:
|
ccee726 to
29f2000
Compare
|
@tungtose I will check this today/tomorrow. |
| #[allow(clippy::await_holding_refcell_ref)] | ||
| // SAFETY: compio is a single-threaded runtime — no other task runs on | ||
| // this thread during .await, so the RefCell borrow cannot conflict | ||
| pub async fn save_frozen_batches( |
There was a problem hiding this comment.
The safety comment isn't necessary true, we've learned it the hard way during the migration to thread-per-core shared nothing architecture. Single threaded runtime doesn't imply lack of concurrency and concurrent execution is something that can trigger the runtime panic on that borrow there.
|
I've looked through the PR and if I understand the design correctly it trades off disk space savings by not storing padding on each written buffer at the expense of complexity. I am not sure if it's the right trade-off, given that we could use background compaction to rewrite those small padded buffers into a larger contiguous blocks. I think rather than storing the You can take a lot at the |
Resolving #1420
WIP - Detail flow, scenarios, benchmark
DirectFile
DirectFileholds an open file usingO_DSYNC | O_DIRECT. This is primarily used to solve the Tail Buffer Problem.Scenario 1: Perfectly aligned write (fast path)
write_all(4096 bytes): no tail, length is mul of 4096:Scenario 2: Sub-alignment write
write_all(2048 bytes): too small for disk writeScenario 3: Second write complete block
Existing tail of 2048 bytes + new
write_all(2048b):Scenario 4: Split write, tail + new data alignment boundary:
Existing tail = 2048b,
write_all(4712b), total=5126bScenario 5: Flush() - Pad and write the tail buf
Force remaining tail bytes to disk
Scenario 6: Segment rotate
The sealed segment gets
flush_and_truncateso it's a clean, exact-size, self-contained file. The new segment starts with a freshDirectFileat position 0WIP
Scenario 7: Shutdown, recovery
flush_and_truncateon the active segment before exiting. On restart,metadata.len()gives the exact logical size, and the tail bytes are reconstructed from diskWIP
Scenario 8: Crash
Lost <
4096 bytesBenchmark:
WIP
The consumer regression happens because
persist_frozen_batches_to_diskblocks the event loop (single thread per shard). Withbuffered IOthis takes<1ms, so it's unnoticeable. WithDirect IOit takes~15ms(worse at high disk load), blocking consumer reads from runningIncreasing
messages_required_to_saveandsize_of_messages_required_to_savewould reduce persist frequency and improve consumer latency, but the benchmark uses default valuesThroughput is lower, and producer latency is more stable as expected. However, there is a significant regression on the consumer side. I am working toward optimizing it (Defer the persistence). Since this feature is configurable, it can be safely turned off in the
config.tomlfile. We can merge this and handle the optimization in a separate PRUpdate:
Deferred persist come with better benchmark results (not commit code yet, it's a mess now):
However, it introduces correctness issues with purge, delete, and rotation operations that can race with in-flight persists. WIP...