Stashing these here for later
Optimization Review — PR #5 "1.0.0" (binary-ensemble)
PR: peterrrock2/binary-ensemble#5 · branch 1.0.0 @ 6c20a7c
Scope: A performance-focused second opinion, requested by the author (Peter), weighted toward the two areas he's currently finishing — (1) seek strategies in the lookup path that avoid a decode, and (2) a snapshotting strategy for better fast-forward on TwoDelta — plus general hot-path/memory opportunities.
Companion doc: correctness/safety/API/docs/CI review is in PR5-review.md.
Caveat: some of the seek/snapshot work is in-flight/uncommitted. This describes the branch as reviewed; where a finding is exactly what the new work addresses, it's confirmation/validation of the direction rather than news.
Bottom line
Both instincts are right and the payoffs are large/asymptotic. Two structural facts in the current branch make the work necessary, both verified directly in the code:
- Lookup fully bit-unpacks (and clones) every frame before the target.
extract_assignment_ben (ops/extract/mod.rs:53-74) drives next_record_ben, which calls frame.expand() on every frame and assignment.clone() per record (ben.rs:111,115). For Standard/MkvChain this is O(stream)·O(nodes) work where O(skipped frames) header-reads would do — the frame's serialized length is fully self-describing from its 6-byte header.
- TwoDelta emits snapshots only on non-2-swap transitions — never on a cadence. Snapshots (
pending_full_assignment) are set only in the Snapshot/run-too-long arms (xben.rs:266-270, mirror in ben.rs); crossing twodelta_chunk_size calls flush_chunk_inner, which writes a columnar batch of deltas, not a snapshot (xben.rs:245-247). So a clean pairwise-ReCom chain — the intended workload — produces one anchor snapshot and then millions of deltas, and there is no offset index anywhere. Fast-forward therefore has nothing to anchor to today. The DEFAULT_TWODELTA_CHUNK_SIZE = 10000 knob governs only LZMA2 batching, not replay depth (the spec confirms: "the batch size affects only compression and framing, never the decoded result").
So the snapshotting work is not merely an optimization — it's the prerequisite that makes seek/fast-forward possible. An offset index alone would not help: with one anchor snapshot it would only ever point at sample 1. Cadence-based snapshots must come first; the index second.
Format-stability note (ties to the correctness review): every read-side and constant-factor change here is wire-compatible. Periodic snapshots are ordinary legal frames — existing readers decode them transparently, and the .ben/.xben bytes for existing inputs are unchanged. The offset index is the only piece that needs new bytes, and it should live in the .bendl container directory, not the BEN/XBEN stream body (which would force a format major-bump per twodelta-format-spec.md). Done that way, the v1.0.0 stream fixtures stay byte-stable.
Severity (impact-based)
| # |
Severity |
Area |
Opportunity |
| O1 |
HIGH |
Focus #1 |
BEN Standard/MkvChain lookup skips frames by header instead of full-decoding them — O(stream)→O(skip). |
| O2 |
HIGH |
Focus #2 |
Add cadence-based snapshots so TwoDelta fast-forward has an anchor (enabler; wire-compatible). |
| O3 |
HIGH |
Focus #2 |
Add a snapshot offset index (in .bendl) so reaching sample N is O(1) seek + O(cadence) replay instead of O(N). |
| O4 |
HIGH |
General |
Stream the Python relabel_bundle/recompress_bundle transforms — today they buffer the whole stream 2–3× in RAM. |
| O5 |
MEDIUM |
Focus #1/#2 |
XBEN-TwoDelta lookup/subsample replays correctly but re-encodes every skipped frame — drop the throwaway re-encode. |
| O6 |
MEDIUM |
General/both |
Eliminate the per-step assignment.clone() in next_record_* (and the Python cursor) during replay. |
| O7 |
MEDIUM |
General |
XBEN reader overflow.drain(..consumed) per frame is O(remaining) memmove — use a cursor offset. |
| O8 |
MEDIUM |
General |
Decode→JSONL allocates a Value + String + concat per sample — reuse a buffer + serde_json::to_writer. |
| O9 |
MEDIUM |
General |
flush_chunk_inner writes the chunk in dozens of 2–4 byte write_alls straight into the xz encoder — coalesce into one buffer. |
| O10 |
LOW |
Focus #2 |
Make snapshot cadence tunable (own knob, not the chunk size); consider scaling it to assignment size. |
| O11 |
LOW |
Focus #1 |
CLI lookup only handles BEN — an .xben input errors instead of dispatching to extract_assignment_xben. |
| O12 |
LOW |
General |
count_samples_ben allocates+reads each payload to discard it (reuse the O1 skip primitive); minor reserve/with_capacity and a small directory clone. |
Focus #1 — Lookup seek strategies that avoid a decode
O1 [HIGH] — Header-only skip for BEN Standard/MkvChain. A frame's length is self-describing: Standard is a 6-byte header (max_val_bits u8, max_len_bits u8, n_bytes u32-BE) + n_bytes payload, count always 1; MkvChain is the same 6-byte header + payload + a trailing count u16-BE (decode.rs:173-218). So to reach sample N you can read the header, learn n_bytes/count, and step over the payload — no bit-unpack, no assignment alloc, no clone — expanding only the target frame.
- Recommended shape: add a
BenDecodeFrame::skip/read_header_only companion to from_reader. Offer a R: Read + Seek overload (the CLI passes BufReader<File>, which is Seek) that does seek(SeekFrom::Current(n_bytes)), and keep a Read-only io::copy(&mut reader.take(n_bytes), &mut io::sink()) fallback for pipes/&[u8]. (MkvChain wrinkle: count trails the payload, so even header-only must consume/seek past n_bytes to read it.)
- Keep the existing hardening on the skip path: validate
count != 0 and run check_payload_len(n_bytes) (cap 1<<26) before trusting n_bytes for a seek — a corrupt huge length must error, not seek to a wild offset.
- Wire format unchanged; strengthens the lazy-decode invariant (skipped frames are never materialized).
O5 [MEDIUM] — XBEN-TwoDelta lookup re-encodes throwaway frames. XBEN can't header-skip (xz transport isn't seekable and deltas are relative — full sequential replay is required), and I verified the current path is correct: next_frame_xben→next_record_xben keeps the delta chain alive across all frames and re-encodes via encode_ben32_assignments, so the target decodes properly. But every skipped frame pays a throwaway encode_ben32_assignments (frames.rs:151). For XBEN-TwoDelta lookup, drive next_record_xben directly and return the materialized assignment at the target — skipping both the per-frame re-encode and the target re-decode. (Plain extract_assignment_xben for Standard/MkvChain already avoids decoding skipped frames — good.)
O11 [LOW] — CLI lookup is BEN-only. lookup.rs:25 calls extract_assignment_ben unconditionally; an .xben file would be misread as a banner and error. extract_assignment_xben exists and is tested at the library level but isn't reachable from the CLI. Decide whether lookup should sniff the container (banner/extension) and dispatch. Behavioral gap, not a perf issue, but it's in the surface being reworked.
Read-side design for BEN-TwoDelta lookup (depends on O2/O3): walk frames header-only, recognizing the leading tag (BEN_TWODELTA_SNAPSHOT_TAG=0x00 / DELTA=0x01, io/reader/twodelta.rs); remember the byte offset + sample number of the last snapshot ≤ target; then Seek back to it, expand() it once, and replay only the intervening deltas. Replays are then bounded by the snapshot interval rather than the whole stream. A non-Seek fallback retains the most recent snapshot's decoded assignment as it walks forward.
Focus #2 — TwoDelta snapshotting / fast-forward
Your direction (periodic snapshots + an offset index) is the standard, correct design for delta-chain random access. The diagnosis above (one anchor snapshot, no index) confirms why it's needed. Sequencing and traps:
O2 [HIGH, wire-compatible — do first] — Cadence-based snapshots. Force a full snapshot every C samples regardless of transition kind. The machinery already exists and reseeds masks correctly (write_twodelta_snapshot in ben.rs, flush_twodelta_full in xben.rs:469-492); a periodic forced call is small. A forced snapshot mid-chunk must flush the open chunk first (the existing Snapshot arm does this, xben.rs:267) so the "deltas precede the following full frame" invariant holds. This alone makes "start from nearest snapshot" possible and existing readers decode the extra snapshots transparently.
O3 [HIGH, needs a guarded .bendl addition] — Snapshot offset index. Add a trailing snapshot directory mapping sample_number → byte offset of the snapshot frame as a new .bendl directory entry / asset, so a reader does an O(1) lookup + seek then O(C) replay. Pitfalls to design against:
- Location:
.bendl container, not the stream body — keeps .ben/.xben and the v1.0.0 fixtures byte-stable.
- Append/compact survival: store offsets relative to
stream_offset so they're invariant under compaction (which rewrites only the post-stream tail, compact.rs); the index rides the existing directory-rebuild on append.
- Self-contained anchors: snapshots are already independently decodable MkvChain frames that reseed masks — preserve that; a periodic snapshot must be a real full frame, never a delta.
- Reader validation: before trusting an index offset, confirm the byte there is a snapshot tag (
0x00) and resume after the full frame; extend test_fixture_mutations.rs to mutate index entries and assert panic-freedom.
O10 [LOW] — Cadence tuning. Expose a real snapshot_interval (the current --chunk-size/twodelta_chunk_size only batches columns and can't bound replay). Space cost is tiny: a snapshot ≈ its RLE size (~4·D bytes for D districts in XBEN full frames), so at C=10000, D≈100 that's ~0.04 B/sample to bound replay to ≤10000 deltas — extremely favorable; even C=1000 costs ~0.4 B/sample. Since replay time is O(C·M) in node count M, consider "snapshot every C deltas or every K delta-bytes" and aligning cadence to chunk boundaries so index offsets land on tags the reader already resyncs on. Encode side already streams (no whole-chain buffering); the only per-snapshot cost is an unavoidable O(M) mask rebuild.
General hot-path & memory wins
O4 [HIGH] — Stream the Python whole-bundle transforms. recompress_bundle read_to_ends the BEN stream into a Vec then builds the whole XBEN output in another Vec (recompress.rs:86-97); relabel_bundle does the same plus a third copy on write (relabel.rs:156-195). Peak ≈ 2–3× the full stream — directly contradicting the streaming selling point on the large-ensemble case these are used for. Both underlying functions already take R: Read/BufRead + W: Write, and assignment_stream_reader() (bounded, CRC-verifying) + BendlStreamSession (Write) exist — so pass them through instead of buffering. (Same issue as MEDIUM finding #3 in the correctness review; quantified here.)
O6 [MEDIUM] — Per-step clone during replay. next_record_ben/next_record_xben do previous_assignment = Some(assignment.clone()) every record (ben.rs:115, xben.rs:226/242/261) — an O(M) copy per sample, even though only TwoDelta's expand reads previous_assignment (Standard/MkvChain ignore it). for_each_assignment_ben already shows the clone-free pattern (move the buffer back after the borrow). Cheapest fix: only retain the previous for TwoDelta; for the Iterator API, keep an internal scratch cloned only when actually yielded. The Python SampleCursor::next clones again on top (cursor.rs:78,99) — so Python sample-by-sample iteration currently costs ~2 clones/sample.
O7 [MEDIUM] — XBEN overflow drain. After each frame, inner.overflow.drain(..consumed) shifts all remaining buffered bytes down (xben.rs:213/222/227/262) — O(remaining) per frame, so decoding K frames from one buffer fill is O(K·buffer). Track a consumed_offset cursor and slice &overflow[offset..], compacting only on refill → amortized O(1). Touches all the overflow-indexing helpers, so do it with the existing tests.
O8 [MEDIUM] — JSONL per-sample allocation. write_all_jsonl builds a json! Value (heap map + array), serializes to a String, then concatenates "\n" per sample (mod.rs:226-231; duplicated in verify.rs:552) — ~3 allocations/sample on every decode→JSONL. Keep a reusable buffer, serde_json::to_writer the {"assignment":...,"sample":...} slice directly, push b'\n'. Output bytes identical (key order matches). Fix both copies together to avoid drift.
O9 [MEDIUM] — Coalesce XBEN chunk writes. flush_chunk_inner issues ~5 + 8N + 2·total_runs separate 2–4 byte encoder.write_all(&x.to_be_bytes()) calls straight into XzEncoder (no BufWriter between) (xben.rs:494-524) — each crosses the liblzma boundary. Build one scratch Vec<u8> and write_all it once per chunk. Wire bytes identical (determinism preserved).
O12 [LOW] — Misc. count_samples_ben walks frame boundaries without expand() (good) but still allocates+reads each payload to discard it (ben.rs:125-135) — the O1 skip primitive removes that. decode_ben32_line could reserve(count) per run; assign_to_rle could with_capacity(input.len()) (rle/mod.rs:13). verify_all_asset_checksums clones the (small) directory (reader.rs:464) — negligible.
Deliberately not flagged (checked, fine): the core encode_ben_to_xben path streams (no BEN→BEN32→xz triple copy — that only appears in the Python recompress wrapper, O4); CRC is tee'd inline, not a second pass; xz settings are fixture-deterministic (don't "optimize" thread count/block size — it changes bytes); bundle asset buffering is bounded metadata, not the sample stream; graph MLC/RCM hot paths pre-size reasonably (any remaining cost is algorithmic, out of scope).
Suggested sequencing
- O2 (cadence snapshots) + O1 (BEN header-skip) — wire-compatible, unlock fast-forward and fast lookup.
- O6/O8/O9 constant-factor wins (clone, JSONL, chunk writes) and O5 (XBEN-TwoDelta re-encode).
- O4 (stream the Python transforms — also a memory-correctness fix).
- O3 (snapshot offset index in
.bendl, with append/compact + reader validation) — the asymptotic seek win and the only piece needing a guarded format addition.
All read-side/constant-factor items preserve the wire format, lazy-decode, and streaming invariants. Verified locally that the existing suite (1182 tests) is green, so these are deltas against a known-good baseline; each change should keep test_format_stability and the round-trip proptests passing (no behavior/format change without a guard).
How this was reviewed
Three focused read-only analysis passes (lookup/seek; TwoDelta snapshot/fast-forward; general hot-path/memory), each calibrated to an impact-based severity rubric, then adversarial verification of every featured claim against the source. The two load-bearing diagnoses — lookup decoding every skipped frame, and snapshots being content-driven rather than cadence-driven — were confirmed by direct reads of ops/extract/mod.rs, io/reader/stream_reader/{ben,xben}.rs, codec/frames/decode.rs, and io/writer/stream_writer/xben.rs. No microbenchmark was run; the asymptotic claims rest on the code paths (which frames are expanded vs skipped), and the existing 1182-test suite was confirmed green as the baseline. No source files were modified — this is a review.
Optimization Re-check after fresh commits — PR #5 (binary-ensemble)
PR: peterrrock2/binary-ensemble#5 · branch 1.0.0
Compared: HEAD d57d79a against the prior optimization-review point 6c20a7c (20 new commits).
Companion docs: PR5-optimization-review.md (full O1–O12 review) and PR5-review.md (correctness/safety/API/docs/CI).
Verification: cargo test -p binary-ensemble on d57d79a → 1208 passed, 0 failed, 2 ignored (up from 1182; ~26 new tests covering extract/lookup/twodelta). The new optimization work lands on a green suite and the fast paths are covered.
Bottom line
Peter shipped the two focus areas and several other findings — and the headline lookup/fast-forward work looks correct and hardened. What remains is mostly constant-factor reader cleanups plus one structural follow-up (a persisted index) that needs a format addition. The single highest-value remaining lookup item is small: make the new snapshot interval tunable.
Status of the prior findings
| # |
Prior sev |
Status |
Evidence / what's left |
| O1 BEN std/mkv header-skip lookup |
HIGH |
✅ Largely done |
extract_assignment_ben uses into_frames() + expand_self_contained() on the target only — skipped frames are no longer bit-unpacked or cloned (extract/mod.rs:88-98). Remainder: skipped payloads are still read_exact'd into a Vec (decode.rs:185,207) — the seek path delegates std/mkv to the non-seek reader, so it doesn't seek past payloads. The expensive part (decode+clone) is gone; the residual is minor/IO-bound. |
| O2 cadence snapshots (TwoDelta) |
HIGH |
✅ Done (BEN) |
New DEFAULT_TWODELTA_SNAPSHOT_INTERVAL = 50_000 + twodelta_deltas_since_snapshot counter forces a snapshot every 50k delta frames (writer/twodelta.rs:13, stream_writer/ben.rs:71-75,126-141). This is the enabler that bounds replay. BEN-only by design — see notes. |
| O3 persisted snapshot offset index |
HIGH |
⬜ Open |
No index in .bendl. latest_twodelta_snapshot_before byte-scans frame headers from the start each lookup (extract/mod.rs:153-202) — cheap per frame (no decode, seeks past payloads) but O(samples) to locate the snapshot. Needs a guarded container-level addition. |
| O4 stream the Python transforms |
HIGH |
✅ Done |
recompress/relabel now pass assignment_stream_reader() straight into encode_ben_to_xben(BufReader::new(stream), &mut session, …) / relabel-into-session — no read_to_end, no whole-stream Vec (recompress.rs, relabel.rs). Bonus: driving the verified reader to EOF also checks the source CRC. |
| O5 XBEN-TwoDelta lookup re-encode |
MED |
◐ Open, minor |
extract_assignment_xben decodes only the target (good) but skipped XBEN-TwoDelta frames are still replayed+re-encoded. XBEN can't seek (LZMA2), so it's inherently sequential; only the throwaway re-encode is avoidable. Low priority. |
O6 per-record clone in Iterator path |
MED |
⬜ Open |
*previous_assignment = Some(assignment.clone()) still at ben.rs:152, xben.rs:269,285,306. The lookup path no longer hits this; impact is now scoped to sample-by-sample Iterator use (Python for a in decoder, ~2 clones/sample). |
O7 XBEN overflow.drain per frame |
MED |
⬜ Open |
Still overflow.drain(..consumed) at xben.rs:256,265,270,302,307. The +179 lines there were TwoDelta-lookup support, not the drain. |
| O8 JSONL per-sample allocation |
MED |
⬜ Open |
json!{}.to_string() + "\n" still per sample at mod.rs:231 and the duplicate verify.rs:561. |
| O9 coalesce XBEN chunk writes |
MED |
⬜ Open |
flush_chunk_inner (writer xben.rs) unchanged — still per-field 2–4 byte write_alls into the encoder. |
| O10 tunable snapshot cadence |
LOW→MED |
◐ Partial — now the top lookup lever |
The interval exists but is a hardcoded const, not a WriteOption (unlike chunk size, which is plumbed via with_twodelta_chunk_size). See re-prioritization. |
O11 CLI lookup XBEN dispatch |
LOW |
◐ Partial |
lookup now uses the fast seek path (extract_assignment_ben_seek, lookup.rs:23) — good — but still errors on an .xben input; extract_assignment_xben remains CLI-unreachable. |
O12 misc reserve/with_capacity |
LOW |
⬜ Open |
Unchanged; negligible. |
Also resolved (from the correctness review): the HIGH CI-gating finding — ci.yml now triggers on pull_request and runs cargo test + pytest tests/ as fast suites (ci.yml:12,58-59,75-77). Several corruption-hardening commits also landed (unchecked total runs, overlong stream len python side, more corrupt header protection, plus a new header checksum), covering the LOW 32-bit-overflow item and strengthening the reader.
What's left on the table (re-prioritized for the new code)
- [MED — highest-value lookup item now] Expose the snapshot interval; reconsider 50k for large maps (O10). With cadence snapshots in, the dominant single-lookup cost is the ≤50k delta replay after the anchor, which is
O(interval × district-size). On a large map (CO ~140k blocks) a worst-case lookup can replay up to 50k deltas (hundreds of ms → seconds), while a denser cadence costs almost nothing in space (≈4·D/interval bytes/sample — hundredths of a byte at D≈100). Plumb a snapshot_interval through WriteOptions/CLI (mirror with_twodelta_chunk_size), default it relative to assignment size or expose it, and document the space-vs-lookup-latency trade-off. Small change, biggest remaining payoff.
- [MED, needs a guarded
.bendl addition] Persisted snapshot offset index (O3). Removes the O(samples) header-scan, making snapshot location O(1). Most valuable for repeated random access and very deep streams; secondary to O10 for a single lookup. Design caveats: store in the container directory (not the stream body), offsets relative to stream_offset, validate the offset lands on a 0x00 tag, and extend test_fixture_mutations.rs to mutate index entries.
- [MED] Reader constant-factor trio, untouched: O7 (overflow
drain → cursor), O8 (JSONL reuse a buffer + serde_json::to_writer; fix both copies), O9 (coalesce chunk writes). Broadest steady-state wins for full decode/recompress throughput, independent of the lookup work.
- [LOW] O6 (clone in the
Iterator/Python path), O11 (CLI .xben lookup dispatch), O1 remainder (seek past std/mkv payloads on seekable input), O12.
New observations on the fresh code
- The fast TwoDelta lookup looks correct and is hardened.
latest_twodelta_snapshot_before validates count != 0, check_payload_len, and check_twodelta_run_width during the scan (extract/mod.rs:181-231), and the replay reuses BenStreamReader by chaining a synthetic TWODELTA_BEN_BANNER in front of the seeked snapshot (extract/mod.rs:131-135) — a clean way to restart mid-stream. The first frame is guaranteed a snapshot, so the "no snapshot before sample" path is unreachable on well-formed streams.
- Cadence is BEN-only, and that's defensible — but worth a doc line. XBEN can't be randomly seeked (LZMA2), so
extract_assignment_xben is sequential regardless and cadence snapshots wouldn't help it. Practical implication: a .bendl backed by a BEN stream gets fast random-access lookup; one backed by XBEN does not. A real size-vs-seek trade-off worth stating in the docs so users pick the container deliberately.
- The interval counts frames, not samples (a repeat frame with
count>1 increments by 1) — correct, since replay cost is per frame.
Re-check method: diffed the optimization surface since 6c20a7c, read the reworked lookup/snapshot/transform code directly (ops/extract/mod.rs, io/writer/stream_writer/ben.rs, io/writer/twodelta.rs, ben-py/src/{recompress,relabel}.rs, .github/workflows/ci.yml), confirmed each prior finding's status against current HEAD, and re-ran the Rust suite. No source files were modified — this is a review.
Stashing these here for later
Optimization Review — PR #5 "1.0.0" (binary-ensemble)
PR: peterrrock2/binary-ensemble#5 · branch
1.0.0@6c20a7cScope: A performance-focused second opinion, requested by the author (Peter), weighted toward the two areas he's currently finishing — (1) seek strategies in the
lookuppath that avoid a decode, and (2) a snapshotting strategy for better fast-forward on TwoDelta — plus general hot-path/memory opportunities.Companion doc: correctness/safety/API/docs/CI review is in
PR5-review.md.Caveat: some of the seek/snapshot work is in-flight/uncommitted. This describes the branch as reviewed; where a finding is exactly what the new work addresses, it's confirmation/validation of the direction rather than news.
Bottom line
Both instincts are right and the payoffs are large/asymptotic. Two structural facts in the current branch make the work necessary, both verified directly in the code:
extract_assignment_ben(ops/extract/mod.rs:53-74) drivesnext_record_ben, which callsframe.expand()on every frame andassignment.clone()per record (ben.rs:111,115). For Standard/MkvChain this is O(stream)·O(nodes) work where O(skipped frames) header-reads would do — the frame's serialized length is fully self-describing from its 6-byte header.pending_full_assignment) are set only in theSnapshot/run-too-long arms (xben.rs:266-270, mirror inben.rs); crossingtwodelta_chunk_sizecallsflush_chunk_inner, which writes a columnar batch of deltas, not a snapshot (xben.rs:245-247). So a clean pairwise-ReCom chain — the intended workload — produces one anchor snapshot and then millions of deltas, and there is no offset index anywhere. Fast-forward therefore has nothing to anchor to today. TheDEFAULT_TWODELTA_CHUNK_SIZE = 10000knob governs only LZMA2 batching, not replay depth (the spec confirms: "the batch size affects only compression and framing, never the decoded result").So the snapshotting work is not merely an optimization — it's the prerequisite that makes seek/fast-forward possible. An offset index alone would not help: with one anchor snapshot it would only ever point at sample 1. Cadence-based snapshots must come first; the index second.
Severity (impact-based)
lookupskips frames by header instead of full-decoding them — O(stream)→O(skip)..bendl) so reaching sample N is O(1) seek + O(cadence) replay instead of O(N).relabel_bundle/recompress_bundletransforms — today they buffer the whole stream 2–3× in RAM.assignment.clone()innext_record_*(and the Python cursor) during replay.overflow.drain(..consumed)per frame is O(remaining) memmove — use a cursor offset.Value+String+ concat per sample — reuse a buffer +serde_json::to_writer.flush_chunk_innerwrites the chunk in dozens of 2–4 bytewrite_alls straight into the xz encoder — coalesce into one buffer.lookuponly handles BEN — an.xbeninput errors instead of dispatching toextract_assignment_xben.count_samples_benallocates+reads each payload to discard it (reuse the O1 skip primitive); minorreserve/with_capacityand a small directory clone.Focus #1 — Lookup seek strategies that avoid a decode
O1 [HIGH] — Header-only skip for BEN Standard/MkvChain. A frame's length is self-describing: Standard is a 6-byte header (
max_val_bits u8,max_len_bits u8,n_bytes u32-BE) +n_bytespayload, count always 1; MkvChain is the same 6-byte header + payload + a trailingcount u16-BE(decode.rs:173-218). So to reach sample N you can read the header, learnn_bytes/count, and step over the payload — no bit-unpack, no assignment alloc, no clone — expanding only the target frame.BenDecodeFrame::skip/read_header_onlycompanion tofrom_reader. Offer aR: Read + Seekoverload (the CLI passesBufReader<File>, which isSeek) that doesseek(SeekFrom::Current(n_bytes)), and keep aRead-onlyio::copy(&mut reader.take(n_bytes), &mut io::sink())fallback for pipes/&[u8]. (MkvChain wrinkle:counttrails the payload, so even header-only must consume/seek pastn_bytesto read it.)count != 0and runcheck_payload_len(n_bytes)(cap1<<26) before trustingn_bytesfor a seek — a corrupt huge length must error, not seek to a wild offset.O5 [MEDIUM] — XBEN-TwoDelta lookup re-encodes throwaway frames. XBEN can't header-skip (xz transport isn't seekable and deltas are relative — full sequential replay is required), and I verified the current path is correct:
next_frame_xben→next_record_xbenkeeps the delta chain alive across all frames and re-encodes viaencode_ben32_assignments, so the target decodes properly. But every skipped frame pays a throwawayencode_ben32_assignments(frames.rs:151). For XBEN-TwoDelta lookup, drivenext_record_xbendirectly and return the materialized assignment at the target — skipping both the per-frame re-encode and the target re-decode. (Plainextract_assignment_xbenfor Standard/MkvChain already avoids decoding skipped frames — good.)O11 [LOW] — CLI
lookupis BEN-only. lookup.rs:25 callsextract_assignment_benunconditionally; an.xbenfile would be misread as a banner and error.extract_assignment_xbenexists and is tested at the library level but isn't reachable from the CLI. Decide whetherlookupshould sniff the container (banner/extension) and dispatch. Behavioral gap, not a perf issue, but it's in the surface being reworked.Read-side design for BEN-TwoDelta lookup (depends on O2/O3): walk frames header-only, recognizing the leading tag (
BEN_TWODELTA_SNAPSHOT_TAG=0x00/DELTA=0x01, io/reader/twodelta.rs); remember the byte offset + sample number of the last snapshot ≤ target; thenSeekback to it,expand()it once, and replay only the intervening deltas. Replays are then bounded by the snapshot interval rather than the whole stream. A non-Seekfallback retains the most recent snapshot's decoded assignment as it walks forward.Focus #2 — TwoDelta snapshotting / fast-forward
Your direction (periodic snapshots + an offset index) is the standard, correct design for delta-chain random access. The diagnosis above (one anchor snapshot, no index) confirms why it's needed. Sequencing and traps:
O2 [HIGH, wire-compatible — do first] — Cadence-based snapshots. Force a full snapshot every C samples regardless of transition kind. The machinery already exists and reseeds masks correctly (
write_twodelta_snapshotinben.rs,flush_twodelta_fullin xben.rs:469-492); a periodic forced call is small. A forced snapshot mid-chunk must flush the open chunk first (the existingSnapshotarm does this, xben.rs:267) so the "deltas precede the following full frame" invariant holds. This alone makes "start from nearest snapshot" possible and existing readers decode the extra snapshots transparently.O3 [HIGH, needs a guarded
.bendladdition] — Snapshot offset index. Add a trailing snapshot directory mappingsample_number → byte offset of the snapshot frameas a new.bendldirectory entry / asset, so a reader does an O(1) lookup + seek then O(C) replay. Pitfalls to design against:.bendlcontainer, not the stream body — keeps.ben/.xbenand the v1.0.0 fixtures byte-stable.stream_offsetso they're invariant under compaction (which rewrites only the post-stream tail, compact.rs); the index rides the existing directory-rebuild on append.0x00) and resume after the full frame; extendtest_fixture_mutations.rsto mutate index entries and assert panic-freedom.O10 [LOW] — Cadence tuning. Expose a real
snapshot_interval(the current--chunk-size/twodelta_chunk_sizeonly batches columns and can't bound replay). Space cost is tiny: a snapshot ≈ its RLE size (~4·D bytes for D districts in XBEN full frames), so at C=10000, D≈100 that's ~0.04 B/sample to bound replay to ≤10000 deltas — extremely favorable; even C=1000 costs ~0.4 B/sample. Since replay time is O(C·M) in node count M, consider "snapshot every C deltas or every K delta-bytes" and aligning cadence to chunk boundaries so index offsets land on tags the reader already resyncs on. Encode side already streams (no whole-chain buffering); the only per-snapshot cost is an unavoidable O(M) mask rebuild.General hot-path & memory wins
O4 [HIGH] — Stream the Python whole-bundle transforms.
recompress_bundleread_to_ends the BEN stream into aVecthen builds the whole XBEN output in anotherVec(recompress.rs:86-97);relabel_bundledoes the same plus a third copy on write (relabel.rs:156-195). Peak ≈ 2–3× the full stream — directly contradicting the streaming selling point on the large-ensemble case these are used for. Both underlying functions already takeR: Read/BufRead+W: Write, andassignment_stream_reader()(bounded, CRC-verifying) +BendlStreamSession(Write) exist — so pass them through instead of buffering. (Same issue as MEDIUM finding #3 in the correctness review; quantified here.)O6 [MEDIUM] — Per-step clone during replay.
next_record_ben/next_record_xbendoprevious_assignment = Some(assignment.clone())every record (ben.rs:115, xben.rs:226/242/261) — an O(M) copy per sample, even though only TwoDelta'sexpandreadsprevious_assignment(Standard/MkvChain ignore it).for_each_assignment_benalready shows the clone-free pattern (move the buffer back after the borrow). Cheapest fix: only retain the previous for TwoDelta; for theIteratorAPI, keep an internal scratch cloned only when actually yielded. The PythonSampleCursor::nextclones again on top (cursor.rs:78,99) — so Python sample-by-sample iteration currently costs ~2 clones/sample.O7 [MEDIUM] — XBEN overflow drain. After each frame,
inner.overflow.drain(..consumed)shifts all remaining buffered bytes down (xben.rs:213/222/227/262) — O(remaining) per frame, so decoding K frames from one buffer fill is O(K·buffer). Track aconsumed_offsetcursor and slice&overflow[offset..], compacting only on refill → amortized O(1). Touches all the overflow-indexing helpers, so do it with the existing tests.O8 [MEDIUM] — JSONL per-sample allocation.
write_all_jsonlbuilds ajson!Value(heap map + array), serializes to aString, then concatenates"\n"per sample (mod.rs:226-231; duplicated in verify.rs:552) — ~3 allocations/sample on every decode→JSONL. Keep a reusable buffer,serde_json::to_writerthe{"assignment":...,"sample":...}slice directly, pushb'\n'. Output bytes identical (key order matches). Fix both copies together to avoid drift.O9 [MEDIUM] — Coalesce XBEN chunk writes.
flush_chunk_innerissues ~5 + 8N + 2·total_runsseparate 2–4 byteencoder.write_all(&x.to_be_bytes())calls straight intoXzEncoder(noBufWriterbetween) (xben.rs:494-524) — each crosses the liblzma boundary. Build one scratchVec<u8>andwrite_allit once per chunk. Wire bytes identical (determinism preserved).O12 [LOW] — Misc.
count_samples_benwalks frame boundaries withoutexpand()(good) but still allocates+reads each payload to discard it (ben.rs:125-135) — the O1 skip primitive removes that.decode_ben32_linecouldreserve(count)per run;assign_to_rlecouldwith_capacity(input.len())(rle/mod.rs:13).verify_all_asset_checksumsclones the (small) directory (reader.rs:464) — negligible.Deliberately not flagged (checked, fine): the core
encode_ben_to_xbenpath streams (no BEN→BEN32→xz triple copy — that only appears in the Pythonrecompresswrapper, O4); CRC is tee'd inline, not a second pass; xz settings are fixture-deterministic (don't "optimize" thread count/block size — it changes bytes); bundle asset buffering is bounded metadata, not the sample stream; graph MLC/RCM hot paths pre-size reasonably (any remaining cost is algorithmic, out of scope).Suggested sequencing
.bendl, with append/compact + reader validation) — the asymptotic seek win and the only piece needing a guarded format addition.All read-side/constant-factor items preserve the wire format, lazy-decode, and streaming invariants. Verified locally that the existing suite (1182 tests) is green, so these are deltas against a known-good baseline; each change should keep
test_format_stabilityand the round-trip proptests passing (no behavior/format change without a guard).How this was reviewed
Three focused read-only analysis passes (lookup/seek; TwoDelta snapshot/fast-forward; general hot-path/memory), each calibrated to an impact-based severity rubric, then adversarial verification of every featured claim against the source. The two load-bearing diagnoses — lookup decoding every skipped frame, and snapshots being content-driven rather than cadence-driven — were confirmed by direct reads of
ops/extract/mod.rs,io/reader/stream_reader/{ben,xben}.rs,codec/frames/decode.rs, andio/writer/stream_writer/xben.rs. No microbenchmark was run; the asymptotic claims rest on the code paths (which frames are expanded vs skipped), and the existing 1182-test suite was confirmed green as the baseline. No source files were modified — this is a review.Optimization Re-check after fresh commits — PR #5 (binary-ensemble)
PR: peterrrock2/binary-ensemble#5 · branch
1.0.0Compared: HEAD
d57d79aagainst the prior optimization-review point6c20a7c(20 new commits).Companion docs:
PR5-optimization-review.md(full O1–O12 review) andPR5-review.md(correctness/safety/API/docs/CI).Verification:
cargo test -p binary-ensembleond57d79a→ 1208 passed, 0 failed, 2 ignored (up from 1182; ~26 new tests coveringextract/lookup/twodelta). The new optimization work lands on a green suite and the fast paths are covered.Bottom line
Peter shipped the two focus areas and several other findings — and the headline lookup/fast-forward work looks correct and hardened. What remains is mostly constant-factor reader cleanups plus one structural follow-up (a persisted index) that needs a format addition. The single highest-value remaining lookup item is small: make the new snapshot interval tunable.
Status of the prior findings
extract_assignment_benusesinto_frames()+expand_self_contained()on the target only — skipped frames are no longer bit-unpacked or cloned (extract/mod.rs:88-98). Remainder: skipped payloads are stillread_exact'd into aVec(decode.rs:185,207) — the seek path delegates std/mkv to the non-seek reader, so it doesn'tseekpast payloads. The expensive part (decode+clone) is gone; the residual is minor/IO-bound.DEFAULT_TWODELTA_SNAPSHOT_INTERVAL = 50_000+twodelta_deltas_since_snapshotcounter forces a snapshot every 50k delta frames (writer/twodelta.rs:13, stream_writer/ben.rs:71-75,126-141). This is the enabler that bounds replay. BEN-only by design — see notes..bendl.latest_twodelta_snapshot_beforebyte-scans frame headers from the start each lookup (extract/mod.rs:153-202) — cheap per frame (no decode, seeks past payloads) but O(samples) to locate the snapshot. Needs a guarded container-level addition.recompress/relabelnow passassignment_stream_reader()straight intoencode_ben_to_xben(BufReader::new(stream), &mut session, …)/ relabel-into-session — noread_to_end, no whole-streamVec(recompress.rs, relabel.rs). Bonus: driving the verified reader to EOF also checks the source CRC.extract_assignment_xbendecodes only the target (good) but skipped XBEN-TwoDelta frames are still replayed+re-encoded. XBEN can't seek (LZMA2), so it's inherently sequential; only the throwaway re-encode is avoidable. Low priority.Iteratorpath*previous_assignment = Some(assignment.clone())still at ben.rs:152, xben.rs:269,285,306. The lookup path no longer hits this; impact is now scoped to sample-by-sampleIteratoruse (Pythonfor a in decoder, ~2 clones/sample).overflow.drainper frameoverflow.drain(..consumed)at xben.rs:256,265,270,302,307. The +179 lines there were TwoDelta-lookup support, not the drain.json!{}.to_string() + "\n"still per sample at mod.rs:231 and the duplicate verify.rs:561.flush_chunk_inner(writerxben.rs) unchanged — still per-field 2–4 bytewrite_alls into the encoder.const, not aWriteOption(unlike chunk size, which is plumbed viawith_twodelta_chunk_size). See re-prioritization.lookupXBEN dispatchlookupnow uses the fast seek path (extract_assignment_ben_seek, lookup.rs:23) — good — but still errors on an.xbeninput;extract_assignment_xbenremains CLI-unreachable.reserve/with_capacityAlso resolved (from the correctness review): the HIGH CI-gating finding —
ci.ymlnow triggers onpull_requestand runscargo test+pytest tests/as fast suites (ci.yml:12,58-59,75-77). Several corruption-hardening commits also landed (unchecked total runs,overlong stream len python side,more corrupt header protection, plus a new header checksum), covering the LOW 32-bit-overflow item and strengthening the reader.What's left on the table (re-prioritized for the new code)
O(interval × district-size). On a large map (CO ~140k blocks) a worst-case lookup can replay up to 50k deltas (hundreds of ms → seconds), while a denser cadence costs almost nothing in space (≈4·D/intervalbytes/sample — hundredths of a byte at D≈100). Plumb asnapshot_intervalthroughWriteOptions/CLI (mirrorwith_twodelta_chunk_size), default it relative to assignment size or expose it, and document the space-vs-lookup-latency trade-off. Small change, biggest remaining payoff..bendladdition] Persisted snapshot offset index (O3). Removes the O(samples) header-scan, making snapshot location O(1). Most valuable for repeated random access and very deep streams; secondary to O10 for a single lookup. Design caveats: store in the container directory (not the stream body), offsets relative tostream_offset, validate the offset lands on a0x00tag, and extendtest_fixture_mutations.rsto mutate index entries.drain→ cursor), O8 (JSONL reuse a buffer +serde_json::to_writer; fix both copies), O9 (coalesce chunk writes). Broadest steady-state wins for full decode/recompress throughput, independent of the lookup work.Iterator/Python path), O11 (CLI.xbenlookup dispatch), O1 remainder (seek past std/mkv payloads on seekable input), O12.New observations on the fresh code
latest_twodelta_snapshot_beforevalidatescount != 0,check_payload_len, andcheck_twodelta_run_widthduring the scan (extract/mod.rs:181-231), and the replay reusesBenStreamReaderby chaining a syntheticTWODELTA_BEN_BANNERin front of the seeked snapshot (extract/mod.rs:131-135) — a clean way to restart mid-stream. The first frame is guaranteed a snapshot, so the "no snapshot before sample" path is unreachable on well-formed streams.extract_assignment_xbenis sequential regardless and cadence snapshots wouldn't help it. Practical implication: a.bendlbacked by a BEN stream gets fast random-access lookup; one backed by XBEN does not. A real size-vs-seek trade-off worth stating in the docs so users pick the container deliberately.count>1increments by 1) — correct, since replay cost is per frame.Re-check method: diffed the optimization surface since
6c20a7c, read the reworked lookup/snapshot/transform code directly (ops/extract/mod.rs,io/writer/stream_writer/ben.rs,io/writer/twodelta.rs,ben-py/src/{recompress,relabel}.rs,.github/workflows/ci.yml), confirmed each prior finding's status against current HEAD, and re-ran the Rust suite. No source files were modified — this is a review.