Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 9 additions & 6 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -67,12 +67,13 @@ jobs:

fuzz-smoke:
# M9 (§14.5): a SHORT per-PR smoke of the fuzz targets (F1 recovery, F2
# decoder). This lane is real and BLOCKING — a reproducible crash here is a
# genuine D11 bug and must red the PR (not advisory). It is distinct from the
# time-boxed nightly fuzz lane (fuzz.yml) and from any H1 sign-off: M9 fuzz
# findings gate M9-relevant PRs; they never red an H1 dispatch run. The full
# N-CPU-hour §14.13 release gate runs on a dedicated runner, not here.
name: fuzz smoke (F1/F2)
# decoder, F3 structure-aware classifier). This lane is real and BLOCKING — a
# reproducible crash here is a genuine D4/D5/D10/D11 bug and must red the PR
# (not advisory). It is distinct from the time-boxed nightly fuzz lane
# (fuzz.yml) and from any H1 sign-off: M9 fuzz findings gate M9-relevant PRs;
# they never red an H1 dispatch run. The full N-CPU-hour §14.13 release gate
# runs on a dedicated runner, not here.
name: fuzz smoke (F1/F2/F3)
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
Expand All @@ -98,6 +99,8 @@ jobs:
run: cargo +nightly fuzz run recovery --target x86_64-unknown-linux-gnu -- -runs=20000 -rss_limit_mb=4096
- name: F2 decode smoke (bounded; a crash reds the PR)
run: cargo +nightly fuzz run decode --target x86_64-unknown-linux-gnu -- -runs=40000 -rss_limit_mb=4096
- name: F3 structure smoke (bounded; a crash reds the PR)
run: cargo +nightly fuzz run structure --target x86_64-unknown-linux-gnu -- -runs=8000 -rss_limit_mb=4096

dirfsync-presence:
# M8 §14.4d Tier 1 (PRIMARY): the deterministic, FS-independent regression
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/fuzz.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,8 +35,8 @@ jobs:
strategy:
fail-fast: false
matrix:
# F3–F4 are appended here as their slices land.
target: [recovery, decode]
# F4 is appended here as its slice lands.
target: [recovery, decode, structure]
steps:
- uses: actions/checkout@v4
- name: Install nightly toolchain
Expand Down
2 changes: 2 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,8 @@ The entire value of this component is **correct behavior under crashes and fault

## Project status (keep this updated)

- **LATEST (2026-06-28): CORRECTNESS FIX (gates F3) — closed a D3/D5 sentinel silent-truncation hole that F3 surfaced (issue #26).** `record::decode` returned `Decoded::Sentinel` on `rec_type == 0` **alone**, *before* the CRC check, so a single-bit corruption of an **interior** record's `rec_type` byte (`1`→`0`) was mistaken for the end-of-records sentinel and **silently dropped every acked record after it** (`durable_lsn` rewound) — violating **D3** (no loss ≤ a returned `durable_lsn`) and **D5** (mid-log corruption must be fatal). **Fix (one line in `src/record.rs::decode`):** a sentinel is recognized only by a full **all-zero 20-byte header** (`rec_type == REC_TYPE_SENTINEL && buf[..20].iter().all(|&b| b == 0)`, short-circuited on `rec_type` so the Full-record hot path stays a single byte-compare). A corrupt `rec_type==0, crc≠0` record now falls through the existing ladder → CRC fails → `Invalid(BadCrc)` → `classify` → interior ⇒ fatal `TornMidLog` (D5) / tail ⇒ torn-tail truncate (D4); **no new code path, D11 preserved**. Genuine sentinels (pre-alloc zero region / §8.2.1 zeroing) are always all-zero ⇒ nothing legitimate lost. **Tests:** new `recovery.rs` regression tests at **interior (D5)** and **tail (D4 + idempotent-zeroed reopen, D10)** — both **fail before the fix, pass after** (demonstrated); `record::sentinel_header_detected` fixture corrected to the honest contract (a real record with a zeroed `rec_type` ⇒ `Invalid(BadCrc)`; all-zero ⇒ `Sentinel`) — **note:** the designer's writeup said the old `0xFF`-filled fixture would be `BadCrc`, but that buffer's `0xFF` length trips `LengthTooLarge` first, so the fixture now uses a genuine record to actually exercise the CRC-catches-`rec_type` mechanism (still `Invalid`, never `Sentinel`). **Blast radius walked:** `segment.rs:259` falls through to the full-record decode → `Invalid` (verified, no edit); `reader.rs:134` already treats `CleanEnd|Invalid` identically ⇒ **reader semantics unchanged** (a corrupt `rec_type==0` tail record shifts `CleanEnd`→`Invalid`, both end the live stream per §15.2). **F3 oracle updated in lockstep:** `Mutation::ZeroRecType` is now `invalidates() == true` (a CRC-caught corruption, no longer a sentinel) — F3 re-smoked **80 000 runs, exit 0, zero crashes, cov 589** (up from 561, since ZeroRecType now exercises the corruption arms). **Spec:** §8.2 step 1, §5.3 table row, §5.4, §4 D5 prose tightened to "all-zero header" + v6 changelog bullet. `cargo test` (both configs), `clippy --all-targets -D warnings` (both), `cargo fmt --check`, MSRV 1.85 green. Ships on the F3 branch (PR #25).
- **LATEST (2026-06-28): M9 slice 3 — F3 structure-aware classifier fuzz LANDED, built + smoke-green here; gate stays OPEN.** New `fuzz/fuzz_targets/structure.rs` (third cargo-fuzz bin). Builds a **valid dense single segment** (the harness owns the correct CRCs via the `fuzzing` generators, so the deep classifier states a blind byte fuzzer never reaches are hit every run), applies ONE fuzzer-chosen **localized mutation** at a chosen record (flip CRC / flip a CRC-covered body byte / zero `rec_type`→sentinel / extend `length` / tamper padding / reserved `rec_type`+re-CRC via the public `crc32c`), then drives the **real public `Wal::open`**. **Sharp classifier oracle:** invalid **interior** record (a valid record still follows) ⇒ MUST be fatal `TornMidLog`/`Corruption` (**D5**, never silent truncation); invalid **last** record ⇒ MUST truncate at its offset (**D4**); `rec_type`→0 ⇒ sentinel/clean-end. Plus the surviving suffix is a dense **byte-identical** prefix of the built records (**D6/D10**), an **idempotent reopen** yields a clean tail (**D7**, durable zeroing), and the forward scan stays within `scan_bound` (**D11**). **Smoke-green: 150 000 runs, exit 0, zero crashes** (cov 561); corpus `cargo fuzz cmin`'d (111 entries). **Falsifiability shown**: forcing `forward_scan_finds_valid` to always report "no continuation" (the classic D5 bug — mid-log corruption silently truncated) trips `D5: interior corruption returned Ok (silent truncation!)`, then reverted. **No `src/` change** (`git diff src/` empty). CI: `structure` added to `fuzz.yml` matrix + the per-PR smoke in `ci.yml` (now `fuzz smoke (F1/F2/F3)`, a crash reds the M9 PR). `cargo fmt --check`, `clippy --all-targets -D warnings`, `cargo test`, `actionlint` green. **Still NOT done in M9:** F4 (op-script oracle); Miri; `!Sync` trybuild + dir-lock; loom; soak; CI-matrix tidy-up; the N-CPU-hour release-gate observation.
- **LATEST (2026-06-28): M9 slice 2 — F2 single-record decoder fuzz LANDED, built + smoke-green here; gate stays OPEN.** New `fuzz/fuzz_targets/decode.rs` (second cargo-fuzz bin). The **raw fuzzer bytes are the decode buffer** (no `arbitrary` envelope — chosen after the struct-`Arbitrary` byte layout proved fragile; the corpus is now just record bytes), decoded against a boundary-biased `max_record_size` set `{0,1,7,8,64,4096,1<<20,u32::MAX}` so the length bound is hit from both sides for any record the buffer encodes — **including `max < payload`, which keeps the payload-bound assertion non-vacuous**. Asserts bounds-soundness on any returned record: `payload_len ≤ max`, `framed_len ≤ buf.len()`, `framed_len ≥ 20`, 8-aligned, `20 + payload_len ≤ framed_len` (D11, record level). Because a blind byte fuzzer essentially never synthesizes a CRC-valid frame, the corpus is **seeded with genuine CRC-valid records** (a Python `crc32c` generator self-checked against the canonical `0xE3069283` vector so it matches the `crc32c` crate) + `cargo fuzz cmin` (17 entries); **falsifiability shown**: disabling the decoder's length bound trips `payload_len 5 exceeds max_record_size 0` on a valid seed, then reverted. **Smoke-green: 300 000 runs, exit 0, zero crashes**; **no `src/` change** (`git diff src/` empty). CI: `decode` added to `fuzz.yml` matrix + the per-PR smoke in `ci.yml` (renamed `fuzz smoke (F1/F2)`, both targets, a crash reds the M9 PR). `cargo fmt --check`, `clippy --all-targets -D warnings`, record unit tests green. **Still NOT done in M9:** F3 (structure-aware), F4 (op-script oracle); Miri; `!Sync` trybuild + dir-lock; loom; soak; CI-matrix tidy-up; the N-CPU-hour release-gate observation.
- **LATEST (2026-06-27): M9 started — F1 recovery-parser fuzz LANDED (slice 1 of M9), built + smoke-green here; the N-CPU-hour gate stays OPEN.** New `fuzzing` Cargo feature (zero-cost when off) gates a `#[doc(hidden)] pub mod fuzzing` in `src/lib.rs` (exposes the internal parse entry points for the cargo-fuzz targets) and the **bounded-scan instrumentation** in `src/recovery.rs`. Per the designer's load-bearing fix, the `max_record_size + 28` bound is hoisted into one `recovery::scan_bound(max_record_size)` symbol used by **both** the real `forward_scan_finds_valid` loop's window **and** the in-loop `assert!`/thread-local probe — so the gate measures **production**, not a harness copy, and the bound cannot drift. New `fuzz/` cargo-fuzz crate (libFuzzer + `arbitrary` + ASan; standalone, never published): `fuzz/fuzz_targets/recovery.rs` (F1). **Primary surface is the real public `Wal::open`** over an adversarial *directory* of segment files — fuzzer-controlled filenames + `base_lsn`s (out-of-order/duplicate/gapped/`0`/malformed-name), valid-header dense bodies and pure garbage — so filename-parse → discovery → sort → header validation → §8.4 incomplete-highest discard → cross-segment continuity → `recover_segment` are all in the blast radius (D11/D2/contiguity), with a secondary single-file `recover_segment` probe asserting the bound directly. **Built with `cargo +nightly fuzz build` and smoke-green: 60 000 runs, exit 0, zero crashes**, corpus = the fuzzer-grown, `cargo fuzz cmin`-minimized coverage-preserving set (`fuzz/corpus/recovery/`, ~174 entries reaching the multi-segment-continuity coverage that hand-authored entropy seeds miss — per the designer's review note). **Falsifiability shown** (§14.0.3): widening the scan loop past `scan_bound` trips the in-loop `assert!` (`distance 4128 > 4124`), then reverted. **Framing (designer note, do not over-read):** the bounded-scan counter holds **structurally** (the loop window *is* `scan_bound`, so `distance ≤ scan_bound` for every input) — it is a **drift/regression guard**, not the headline; the substantive D11 proof in F1 is the **crash-free / no-OOB / no-unbounded-alloc / termination** surface over adversarial inputs. CI: new `.github/workflows/fuzz.yml` (nightly + dispatch, time-boxed, loud "contingent, NOT the N-CPU-hour gate" banner, uploads corpus/artifacts) + a **blocking per-PR smoke** in `ci.yml` (a reproducible crash reds the M9 PR — flag #3; never reds an H1 *dispatch* run). `cargo test` (no feature + `--features fuzzing`, 84 lib + all integration), `cargo clippy --all-targets -D warnings` (both configs), `cargo fmt --check`, MSRV 1.85 (both configs), `cargo build` (no feature ⇒ zero release impact) all green; `actionlint` clean on both workflows. **F4's crash model (when it lands) is the process-crash state machine, not power loss** — flag #2. **Still NOT done in M9:** F2 (decoder), F3 (structure-aware), F4 (op-script oracle); Miri; `!Sync` trybuild + dir-lock; loom publish-barrier; soak; CI-matrix tidy-up; and the F1 N-CPU-hour release-gate observation on a dedicated runner.
- **LATEST (2026-06-25, PRs #20 + #21 off `main`): dm-flakey CI now RUNS, H3-physical PASSES, §14.4d is three-tier.** PR #20 (`claude/m8-dmflakey-ci-fixes`) fixes the hosted dm-flakey gate: provision `linux-modules-extra-$(uname -r)` + `modprobe dm_flakey` (dm-flakey **is** reachable on hosted Azure runners — no self-hosted runner needed), `cmd_check` queries `dmsetup targets` **as root**, and dm table reloads use `dmsetup suspend --noflush --nolockfs` in **both** `flakey_fault` and `flakey_up` (a default suspend's lockfs **freeze** is a full fs-sync that either EIO'd through the erroring target — misread as a §12 violation — or persisted the un-synced data before the drop, defeating the §14.4d controls). **Result: H3-physical ext4 PASSES** (source-confirmed block-layer EIO → §12 poison; evidence on issue #16). PR #21 (`claude/m8-dirfsync-tiers`, stacked on #20) resolves §14.4d per the designer: **the dir-fsync omission is NOT reproducible on ext4/xfs/btrfs** — those journaling FSes transitively persist a new file's dir entry on the segment's own `fsync` (AFSNCE OSDI '14, §18), masking it; `fsync_dir` is kept as a portable-durability safeguard. Three tiers: **Tier-1 (PRIMARY, per-PR, deterministic) = `scripts/m8/dirfsync-presence.sh`** straces the roll path, asserts correct issues the roll-time dir-`fsync` while `inject_no_dir_fsync` does not — **RUN+green here** (`correct=5` vs `inject=1`), wired into `ci.yml`; **Tier-2 = behavioral power-loss via a synchronized mid-run cut** (`src/bin/dirfsync_cut_workload.rs` rolls once, acks a record into the new segment, blocks with the dirent dirty; harness activates `drop_writes` *before* kill/umount, fsck, remount, verify) — **CLOSED as a DOCUMENTED NEGATIVE RESULT (PR #21, owner Fedora 43):** the inject build recovers fully on EVERY config tested — ext4/xfs/btrfs, journal-less ext4 (incl. `ext2`-format), and the last attempt, journaled ext4 `data=writeback` (the ext4 driver's weakest ordering; `data=writeback` weakens data ordering, not the metadata/dirent). The dirent reaches disk transitively via the file's own `fdatasync` everywhere. **Mechanism correction:** the earlier "ext2 block-adjacency" claim is RETRACTED — dmesg shows `ext2`-format is serviced by the **ext4 driver journal-less** on modern kernels (standalone ext2 driver removed in Linux 6.9); mechanism not isolated. No readily-available Linux FS exposes it behaviorally ⇒ honest negative result, not a gap. Tier-1 strace carries the DoD; `fsync_dir` retained as a POSIX-portability safeguard. (Note: `data=writeback` requires a journal — NOT combinable with `-O ^has_journal`.); **Tier-3 = ext4/xfs/btrfs INCONCLUSIVE-by-design** (informational, never red on a masked miss, still red on a correct-build data loss). dm-flakey harness also got `wipefs`/zero-before-mkfs + `udevadm settle` + `dmsetup remove --retry/-f --deferred` (fixes the back-to-back "device busy"). Docs corrected (design §14.4d note + §14.13 row, runbook three-tier, this block). `shellcheck`+`cargo fmt --check` clean; the strace gate is self-verified green. **§14.4d behavioral (Tier-2) is now CLOSED as a documented negative result** (Tier-1 satisfies the DoD). **Still owner/CI to observe:** H1 power-pull.
Expand Down
Loading
Loading