Skip to content

Other Optimization Opportunities (for furture release) #6

@peterrrock2

Description

@peterrrock2

Stashing these here for later

Optimization Review — PR #5 "1.0.0" (binary-ensemble)

PR: peterrrock2/binary-ensemble#5 · branch 1.0.0 @ 6c20a7c
Scope: A performance-focused second opinion, requested by the author (Peter), weighted toward the two areas he's currently finishing — (1) seek strategies in the lookup path that avoid a decode, and (2) a snapshotting strategy for better fast-forward on TwoDelta — plus general hot-path/memory opportunities.
Companion doc: correctness/safety/API/docs/CI review is in PR5-review.md.
Caveat: some of the seek/snapshot work is in-flight/uncommitted. This describes the branch as reviewed; where a finding is exactly what the new work addresses, it's confirmation/validation of the direction rather than news.


Bottom line

Both instincts are right and the payoffs are large/asymptotic. Two structural facts in the current branch make the work necessary, both verified directly in the code:

  • Lookup fully bit-unpacks (and clones) every frame before the target. extract_assignment_ben (ops/extract/mod.rs:53-74) drives next_record_ben, which calls frame.expand() on every frame and assignment.clone() per record (ben.rs:111,115). For Standard/MkvChain this is O(stream)·O(nodes) work where O(skipped frames) header-reads would do — the frame's serialized length is fully self-describing from its 6-byte header.
  • TwoDelta emits snapshots only on non-2-swap transitions — never on a cadence. Snapshots (pending_full_assignment) are set only in the Snapshot/run-too-long arms (xben.rs:266-270, mirror in ben.rs); crossing twodelta_chunk_size calls flush_chunk_inner, which writes a columnar batch of deltas, not a snapshot (xben.rs:245-247). So a clean pairwise-ReCom chain — the intended workload — produces one anchor snapshot and then millions of deltas, and there is no offset index anywhere. Fast-forward therefore has nothing to anchor to today. The DEFAULT_TWODELTA_CHUNK_SIZE = 10000 knob governs only LZMA2 batching, not replay depth (the spec confirms: "the batch size affects only compression and framing, never the decoded result").

So the snapshotting work is not merely an optimization — it's the prerequisite that makes seek/fast-forward possible. An offset index alone would not help: with one anchor snapshot it would only ever point at sample 1. Cadence-based snapshots must come first; the index second.

Format-stability note (ties to the correctness review): every read-side and constant-factor change here is wire-compatible. Periodic snapshots are ordinary legal frames — existing readers decode them transparently, and the .ben/.xben bytes for existing inputs are unchanged. The offset index is the only piece that needs new bytes, and it should live in the .bendl container directory, not the BEN/XBEN stream body (which would force a format major-bump per twodelta-format-spec.md). Done that way, the v1.0.0 stream fixtures stay byte-stable.

Severity (impact-based)

# Severity Area Opportunity
O1 HIGH Focus #1 BEN Standard/MkvChain lookup skips frames by header instead of full-decoding them — O(stream)→O(skip).
O2 HIGH Focus #2 Add cadence-based snapshots so TwoDelta fast-forward has an anchor (enabler; wire-compatible).
O3 HIGH Focus #2 Add a snapshot offset index (in .bendl) so reaching sample N is O(1) seek + O(cadence) replay instead of O(N).
O4 HIGH General Stream the Python relabel_bundle/recompress_bundle transforms — today they buffer the whole stream 2–3× in RAM.
O5 MEDIUM Focus #1/#2 XBEN-TwoDelta lookup/subsample replays correctly but re-encodes every skipped frame — drop the throwaway re-encode.
O6 MEDIUM General/both Eliminate the per-step assignment.clone() in next_record_* (and the Python cursor) during replay.
O7 MEDIUM General XBEN reader overflow.drain(..consumed) per frame is O(remaining) memmove — use a cursor offset.
O8 MEDIUM General Decode→JSONL allocates a Value + String + concat per sample — reuse a buffer + serde_json::to_writer.
O9 MEDIUM General flush_chunk_inner writes the chunk in dozens of 2–4 byte write_alls straight into the xz encoder — coalesce into one buffer.
O10 LOW Focus #2 Make snapshot cadence tunable (own knob, not the chunk size); consider scaling it to assignment size.
O11 LOW Focus #1 CLI lookup only handles BEN — an .xben input errors instead of dispatching to extract_assignment_xben.
O12 LOW General count_samples_ben allocates+reads each payload to discard it (reuse the O1 skip primitive); minor reserve/with_capacity and a small directory clone.

Focus #1 — Lookup seek strategies that avoid a decode

O1 [HIGH] — Header-only skip for BEN Standard/MkvChain. A frame's length is self-describing: Standard is a 6-byte header (max_val_bits u8, max_len_bits u8, n_bytes u32-BE) + n_bytes payload, count always 1; MkvChain is the same 6-byte header + payload + a trailing count u16-BE (decode.rs:173-218). So to reach sample N you can read the header, learn n_bytes/count, and step over the payload — no bit-unpack, no assignment alloc, no clone — expanding only the target frame.

  • Recommended shape: add a BenDecodeFrame::skip/read_header_only companion to from_reader. Offer a R: Read + Seek overload (the CLI passes BufReader<File>, which is Seek) that does seek(SeekFrom::Current(n_bytes)), and keep a Read-only io::copy(&mut reader.take(n_bytes), &mut io::sink()) fallback for pipes/&[u8]. (MkvChain wrinkle: count trails the payload, so even header-only must consume/seek past n_bytes to read it.)
  • Keep the existing hardening on the skip path: validate count != 0 and run check_payload_len(n_bytes) (cap 1<<26) before trusting n_bytes for a seek — a corrupt huge length must error, not seek to a wild offset.
  • Wire format unchanged; strengthens the lazy-decode invariant (skipped frames are never materialized).

O5 [MEDIUM] — XBEN-TwoDelta lookup re-encodes throwaway frames. XBEN can't header-skip (xz transport isn't seekable and deltas are relative — full sequential replay is required), and I verified the current path is correct: next_frame_xbennext_record_xben keeps the delta chain alive across all frames and re-encodes via encode_ben32_assignments, so the target decodes properly. But every skipped frame pays a throwaway encode_ben32_assignments (frames.rs:151). For XBEN-TwoDelta lookup, drive next_record_xben directly and return the materialized assignment at the target — skipping both the per-frame re-encode and the target re-decode. (Plain extract_assignment_xben for Standard/MkvChain already avoids decoding skipped frames — good.)

O11 [LOW] — CLI lookup is BEN-only. lookup.rs:25 calls extract_assignment_ben unconditionally; an .xben file would be misread as a banner and error. extract_assignment_xben exists and is tested at the library level but isn't reachable from the CLI. Decide whether lookup should sniff the container (banner/extension) and dispatch. Behavioral gap, not a perf issue, but it's in the surface being reworked.

Read-side design for BEN-TwoDelta lookup (depends on O2/O3): walk frames header-only, recognizing the leading tag (BEN_TWODELTA_SNAPSHOT_TAG=0x00 / DELTA=0x01, io/reader/twodelta.rs); remember the byte offset + sample number of the last snapshot ≤ target; then Seek back to it, expand() it once, and replay only the intervening deltas. Replays are then bounded by the snapshot interval rather than the whole stream. A non-Seek fallback retains the most recent snapshot's decoded assignment as it walks forward.

Focus #2 — TwoDelta snapshotting / fast-forward

Your direction (periodic snapshots + an offset index) is the standard, correct design for delta-chain random access. The diagnosis above (one anchor snapshot, no index) confirms why it's needed. Sequencing and traps:

O2 [HIGH, wire-compatible — do first] — Cadence-based snapshots. Force a full snapshot every C samples regardless of transition kind. The machinery already exists and reseeds masks correctly (write_twodelta_snapshot in ben.rs, flush_twodelta_full in xben.rs:469-492); a periodic forced call is small. A forced snapshot mid-chunk must flush the open chunk first (the existing Snapshot arm does this, xben.rs:267) so the "deltas precede the following full frame" invariant holds. This alone makes "start from nearest snapshot" possible and existing readers decode the extra snapshots transparently.

O3 [HIGH, needs a guarded .bendl addition] — Snapshot offset index. Add a trailing snapshot directory mapping sample_number → byte offset of the snapshot frame as a new .bendl directory entry / asset, so a reader does an O(1) lookup + seek then O(C) replay. Pitfalls to design against:

  • Location: .bendl container, not the stream body — keeps .ben/.xben and the v1.0.0 fixtures byte-stable.
  • Append/compact survival: store offsets relative to stream_offset so they're invariant under compaction (which rewrites only the post-stream tail, compact.rs); the index rides the existing directory-rebuild on append.
  • Self-contained anchors: snapshots are already independently decodable MkvChain frames that reseed masks — preserve that; a periodic snapshot must be a real full frame, never a delta.
  • Reader validation: before trusting an index offset, confirm the byte there is a snapshot tag (0x00) and resume after the full frame; extend test_fixture_mutations.rs to mutate index entries and assert panic-freedom.

O10 [LOW] — Cadence tuning. Expose a real snapshot_interval (the current --chunk-size/twodelta_chunk_size only batches columns and can't bound replay). Space cost is tiny: a snapshot ≈ its RLE size (~4·D bytes for D districts in XBEN full frames), so at C=10000, D≈100 that's ~0.04 B/sample to bound replay to ≤10000 deltas — extremely favorable; even C=1000 costs ~0.4 B/sample. Since replay time is O(C·M) in node count M, consider "snapshot every C deltas or every K delta-bytes" and aligning cadence to chunk boundaries so index offsets land on tags the reader already resyncs on. Encode side already streams (no whole-chain buffering); the only per-snapshot cost is an unavoidable O(M) mask rebuild.

General hot-path & memory wins

O4 [HIGH] — Stream the Python whole-bundle transforms. recompress_bundle read_to_ends the BEN stream into a Vec then builds the whole XBEN output in another Vec (recompress.rs:86-97); relabel_bundle does the same plus a third copy on write (relabel.rs:156-195). Peak ≈ 2–3× the full stream — directly contradicting the streaming selling point on the large-ensemble case these are used for. Both underlying functions already take R: Read/BufRead + W: Write, and assignment_stream_reader() (bounded, CRC-verifying) + BendlStreamSession (Write) exist — so pass them through instead of buffering. (Same issue as MEDIUM finding #3 in the correctness review; quantified here.)

O6 [MEDIUM] — Per-step clone during replay. next_record_ben/next_record_xben do previous_assignment = Some(assignment.clone()) every record (ben.rs:115, xben.rs:226/242/261) — an O(M) copy per sample, even though only TwoDelta's expand reads previous_assignment (Standard/MkvChain ignore it). for_each_assignment_ben already shows the clone-free pattern (move the buffer back after the borrow). Cheapest fix: only retain the previous for TwoDelta; for the Iterator API, keep an internal scratch cloned only when actually yielded. The Python SampleCursor::next clones again on top (cursor.rs:78,99) — so Python sample-by-sample iteration currently costs ~2 clones/sample.

O7 [MEDIUM] — XBEN overflow drain. After each frame, inner.overflow.drain(..consumed) shifts all remaining buffered bytes down (xben.rs:213/222/227/262) — O(remaining) per frame, so decoding K frames from one buffer fill is O(K·buffer). Track a consumed_offset cursor and slice &overflow[offset..], compacting only on refill → amortized O(1). Touches all the overflow-indexing helpers, so do it with the existing tests.

O8 [MEDIUM] — JSONL per-sample allocation. write_all_jsonl builds a json! Value (heap map + array), serializes to a String, then concatenates "\n" per sample (mod.rs:226-231; duplicated in verify.rs:552) — ~3 allocations/sample on every decode→JSONL. Keep a reusable buffer, serde_json::to_writer the {"assignment":...,"sample":...} slice directly, push b'\n'. Output bytes identical (key order matches). Fix both copies together to avoid drift.

O9 [MEDIUM] — Coalesce XBEN chunk writes. flush_chunk_inner issues ~5 + 8N + 2·total_runs separate 2–4 byte encoder.write_all(&x.to_be_bytes()) calls straight into XzEncoder (no BufWriter between) (xben.rs:494-524) — each crosses the liblzma boundary. Build one scratch Vec<u8> and write_all it once per chunk. Wire bytes identical (determinism preserved).

O12 [LOW] — Misc. count_samples_ben walks frame boundaries without expand() (good) but still allocates+reads each payload to discard it (ben.rs:125-135) — the O1 skip primitive removes that. decode_ben32_line could reserve(count) per run; assign_to_rle could with_capacity(input.len()) (rle/mod.rs:13). verify_all_asset_checksums clones the (small) directory (reader.rs:464) — negligible.

Deliberately not flagged (checked, fine): the core encode_ben_to_xben path streams (no BEN→BEN32→xz triple copy — that only appears in the Python recompress wrapper, O4); CRC is tee'd inline, not a second pass; xz settings are fixture-deterministic (don't "optimize" thread count/block size — it changes bytes); bundle asset buffering is bounded metadata, not the sample stream; graph MLC/RCM hot paths pre-size reasonably (any remaining cost is algorithmic, out of scope).

Suggested sequencing

  1. O2 (cadence snapshots) + O1 (BEN header-skip) — wire-compatible, unlock fast-forward and fast lookup.
  2. O6/O8/O9 constant-factor wins (clone, JSONL, chunk writes) and O5 (XBEN-TwoDelta re-encode).
  3. O4 (stream the Python transforms — also a memory-correctness fix).
  4. O3 (snapshot offset index in .bendl, with append/compact + reader validation) — the asymptotic seek win and the only piece needing a guarded format addition.

All read-side/constant-factor items preserve the wire format, lazy-decode, and streaming invariants. Verified locally that the existing suite (1182 tests) is green, so these are deltas against a known-good baseline; each change should keep test_format_stability and the round-trip proptests passing (no behavior/format change without a guard).


How this was reviewed

Three focused read-only analysis passes (lookup/seek; TwoDelta snapshot/fast-forward; general hot-path/memory), each calibrated to an impact-based severity rubric, then adversarial verification of every featured claim against the source. The two load-bearing diagnoses — lookup decoding every skipped frame, and snapshots being content-driven rather than cadence-driven — were confirmed by direct reads of ops/extract/mod.rs, io/reader/stream_reader/{ben,xben}.rs, codec/frames/decode.rs, and io/writer/stream_writer/xben.rs. No microbenchmark was run; the asymptotic claims rest on the code paths (which frames are expanded vs skipped), and the existing 1182-test suite was confirmed green as the baseline. No source files were modified — this is a review.

Optimization Re-check after fresh commits — PR #5 (binary-ensemble)

PR: peterrrock2/binary-ensemble#5 · branch 1.0.0
Compared: HEAD d57d79a against the prior optimization-review point 6c20a7c (20 new commits).
Companion docs: PR5-optimization-review.md (full O1–O12 review) and PR5-review.md (correctness/safety/API/docs/CI).
Verification: cargo test -p binary-ensemble on d57d79a1208 passed, 0 failed, 2 ignored (up from 1182; ~26 new tests covering extract/lookup/twodelta). The new optimization work lands on a green suite and the fast paths are covered.


Bottom line

Peter shipped the two focus areas and several other findings — and the headline lookup/fast-forward work looks correct and hardened. What remains is mostly constant-factor reader cleanups plus one structural follow-up (a persisted index) that needs a format addition. The single highest-value remaining lookup item is small: make the new snapshot interval tunable.

Status of the prior findings

# Prior sev Status Evidence / what's left
O1 BEN std/mkv header-skip lookup HIGH Largely done extract_assignment_ben uses into_frames() + expand_self_contained() on the target only — skipped frames are no longer bit-unpacked or cloned (extract/mod.rs:88-98). Remainder: skipped payloads are still read_exact'd into a Vec (decode.rs:185,207) — the seek path delegates std/mkv to the non-seek reader, so it doesn't seek past payloads. The expensive part (decode+clone) is gone; the residual is minor/IO-bound.
O2 cadence snapshots (TwoDelta) HIGH Done (BEN) New DEFAULT_TWODELTA_SNAPSHOT_INTERVAL = 50_000 + twodelta_deltas_since_snapshot counter forces a snapshot every 50k delta frames (writer/twodelta.rs:13, stream_writer/ben.rs:71-75,126-141). This is the enabler that bounds replay. BEN-only by design — see notes.
O3 persisted snapshot offset index HIGH Open No index in .bendl. latest_twodelta_snapshot_before byte-scans frame headers from the start each lookup (extract/mod.rs:153-202) — cheap per frame (no decode, seeks past payloads) but O(samples) to locate the snapshot. Needs a guarded container-level addition.
O4 stream the Python transforms HIGH Done recompress/relabel now pass assignment_stream_reader() straight into encode_ben_to_xben(BufReader::new(stream), &mut session, …) / relabel-into-session — no read_to_end, no whole-stream Vec (recompress.rs, relabel.rs). Bonus: driving the verified reader to EOF also checks the source CRC.
O5 XBEN-TwoDelta lookup re-encode MED Open, minor extract_assignment_xben decodes only the target (good) but skipped XBEN-TwoDelta frames are still replayed+re-encoded. XBEN can't seek (LZMA2), so it's inherently sequential; only the throwaway re-encode is avoidable. Low priority.
O6 per-record clone in Iterator path MED Open *previous_assignment = Some(assignment.clone()) still at ben.rs:152, xben.rs:269,285,306. The lookup path no longer hits this; impact is now scoped to sample-by-sample Iterator use (Python for a in decoder, ~2 clones/sample).
O7 XBEN overflow.drain per frame MED Open Still overflow.drain(..consumed) at xben.rs:256,265,270,302,307. The +179 lines there were TwoDelta-lookup support, not the drain.
O8 JSONL per-sample allocation MED Open json!{}.to_string() + "\n" still per sample at mod.rs:231 and the duplicate verify.rs:561.
O9 coalesce XBEN chunk writes MED Open flush_chunk_inner (writer xben.rs) unchanged — still per-field 2–4 byte write_alls into the encoder.
O10 tunable snapshot cadence LOW→MED Partial — now the top lookup lever The interval exists but is a hardcoded const, not a WriteOption (unlike chunk size, which is plumbed via with_twodelta_chunk_size). See re-prioritization.
O11 CLI lookup XBEN dispatch LOW Partial lookup now uses the fast seek path (extract_assignment_ben_seek, lookup.rs:23) — good — but still errors on an .xben input; extract_assignment_xben remains CLI-unreachable.
O12 misc reserve/with_capacity LOW ⬜ Open Unchanged; negligible.

Also resolved (from the correctness review): the HIGH CI-gating findingci.yml now triggers on pull_request and runs cargo test + pytest tests/ as fast suites (ci.yml:12,58-59,75-77). Several corruption-hardening commits also landed (unchecked total runs, overlong stream len python side, more corrupt header protection, plus a new header checksum), covering the LOW 32-bit-overflow item and strengthening the reader.

What's left on the table (re-prioritized for the new code)

  1. [MED — highest-value lookup item now] Expose the snapshot interval; reconsider 50k for large maps (O10). With cadence snapshots in, the dominant single-lookup cost is the ≤50k delta replay after the anchor, which is O(interval × district-size). On a large map (CO ~140k blocks) a worst-case lookup can replay up to 50k deltas (hundreds of ms → seconds), while a denser cadence costs almost nothing in space (≈4·D/interval bytes/sample — hundredths of a byte at D≈100). Plumb a snapshot_interval through WriteOptions/CLI (mirror with_twodelta_chunk_size), default it relative to assignment size or expose it, and document the space-vs-lookup-latency trade-off. Small change, biggest remaining payoff.
  2. [MED, needs a guarded .bendl addition] Persisted snapshot offset index (O3). Removes the O(samples) header-scan, making snapshot location O(1). Most valuable for repeated random access and very deep streams; secondary to O10 for a single lookup. Design caveats: store in the container directory (not the stream body), offsets relative to stream_offset, validate the offset lands on a 0x00 tag, and extend test_fixture_mutations.rs to mutate index entries.
  3. [MED] Reader constant-factor trio, untouched: O7 (overflow drain → cursor), O8 (JSONL reuse a buffer + serde_json::to_writer; fix both copies), O9 (coalesce chunk writes). Broadest steady-state wins for full decode/recompress throughput, independent of the lookup work.
  4. [LOW] O6 (clone in the Iterator/Python path), O11 (CLI .xben lookup dispatch), O1 remainder (seek past std/mkv payloads on seekable input), O12.

New observations on the fresh code

  • The fast TwoDelta lookup looks correct and is hardened. latest_twodelta_snapshot_before validates count != 0, check_payload_len, and check_twodelta_run_width during the scan (extract/mod.rs:181-231), and the replay reuses BenStreamReader by chaining a synthetic TWODELTA_BEN_BANNER in front of the seeked snapshot (extract/mod.rs:131-135) — a clean way to restart mid-stream. The first frame is guaranteed a snapshot, so the "no snapshot before sample" path is unreachable on well-formed streams.
  • Cadence is BEN-only, and that's defensible — but worth a doc line. XBEN can't be randomly seeked (LZMA2), so extract_assignment_xben is sequential regardless and cadence snapshots wouldn't help it. Practical implication: a .bendl backed by a BEN stream gets fast random-access lookup; one backed by XBEN does not. A real size-vs-seek trade-off worth stating in the docs so users pick the container deliberately.
  • The interval counts frames, not samples (a repeat frame with count>1 increments by 1) — correct, since replay cost is per frame.

Re-check method: diffed the optimization surface since 6c20a7c, read the reworked lookup/snapshot/transform code directly (ops/extract/mod.rs, io/writer/stream_writer/ben.rs, io/writer/twodelta.rs, ben-py/src/{recompress,relabel}.rs, .github/workflows/ci.yml), confirmed each prior finding's status against current HEAD, and re-ran the Rust suite. No source files were modified — this is a review.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions