Other Optimization Opportunities (for furture release)

Stashing these here for later

# Optimization Review — PR #5 "1.0.0" (binary-ensemble)

**PR:** [peterrrock2/binary-ensemble#5](https://github.com/peterrrock2/binary-ensemble/pull/5) · branch `1.0.0` @ `6c20a7c`
**Scope:** A performance-focused **second opinion**, requested by the author (Peter), weighted toward the two areas he's currently finishing — **(1) seek strategies in the `lookup` path that avoid a decode**, and **(2) a snapshotting strategy for better fast-forward on TwoDelta** — plus general hot-path/memory opportunities.
**Companion doc:** correctness/safety/API/docs/CI review is in `PR5-review.md`.
**Caveat:** some of the seek/snapshot work is in-flight/uncommitted. This describes the branch *as reviewed*; where a finding is exactly what the new work addresses, it's confirmation/validation of the direction rather than news.

---

## Bottom line

Both instincts are right and the payoffs are large/asymptotic. Two structural facts in the current branch make the work necessary, both verified directly in the code:

- **Lookup fully bit-unpacks (and clones) every frame before the target.** `extract_assignment_ben` ([ops/extract/mod.rs:53-74](ben/src/ops/extract/mod.rs)) drives `next_record_ben`, which calls `frame.expand()` on every frame and `assignment.clone()` per record ([ben.rs:111,115](ben/src/io/reader/stream_reader/ben.rs)). For Standard/MkvChain this is O(stream)·O(nodes) work where O(skipped frames) header-reads would do — the frame's serialized length is fully self-describing from its 6-byte header.
- **TwoDelta emits snapshots only on non-2-swap transitions — never on a cadence.** Snapshots (`pending_full_assignment`) are set only in the `Snapshot`/run-too-long arms ([xben.rs:266-270](ben/src/io/writer/stream_writer/xben.rs), mirror in `ben.rs`); crossing `twodelta_chunk_size` calls `flush_chunk_inner`, which writes a columnar batch of **deltas**, not a snapshot ([xben.rs:245-247](ben/src/io/writer/stream_writer/xben.rs)). So a clean pairwise-ReCom chain — the intended workload — produces **one anchor snapshot and then millions of deltas**, and there is no offset index anywhere. Fast-forward therefore has nothing to anchor to today. The `DEFAULT_TWODELTA_CHUNK_SIZE = 10000` knob governs only LZMA2 batching, not replay depth (the spec confirms: "the batch size affects only compression and framing, never the decoded result").

So the snapshotting work is not merely an optimization — it's the prerequisite that makes seek/fast-forward *possible*. **An offset index alone would not help: with one anchor snapshot it would only ever point at sample 1.** Cadence-based snapshots must come first; the index second.

> **Format-stability note (ties to the correctness review):** every read-side and constant-factor change here is wire-compatible. Periodic snapshots are ordinary legal frames — existing readers decode them transparently, and the `.ben`/`.xben` bytes for *existing* inputs are unchanged. The **offset index is the only piece that needs new bytes**, and it should live in the `.bendl` container directory, **not** the BEN/XBEN stream body (which would force a format major-bump per `twodelta-format-spec.md`). Done that way, the v1.0.0 stream fixtures stay byte-stable.

## Severity (impact-based)

| # | Severity | Area | Opportunity |
|---|----------|------|-------------|
| O1 | **HIGH** | Focus #1 | BEN Standard/MkvChain `lookup` skips frames by header instead of full-decoding them — O(stream)→O(skip). |
| O2 | **HIGH** | Focus #2 | Add cadence-based snapshots so TwoDelta fast-forward has an anchor (enabler; wire-compatible). |
| O3 | **HIGH** | Focus #2 | Add a snapshot offset index (in `.bendl`) so reaching sample N is O(1) seek + O(cadence) replay instead of O(N). |
| O4 | **HIGH** | General | Stream the Python `relabel_bundle`/`recompress_bundle` transforms — today they buffer the whole stream 2–3× in RAM. |
| O5 | MEDIUM | Focus #1/#2 | XBEN-TwoDelta lookup/subsample replays correctly but **re-encodes every skipped frame** — drop the throwaway re-encode. |
| O6 | MEDIUM | General/both | Eliminate the per-step `assignment.clone()` in `next_record_*` (and the Python cursor) during replay. |
| O7 | MEDIUM | General | XBEN reader `overflow.drain(..consumed)` per frame is O(remaining) memmove — use a cursor offset. |
| O8 | MEDIUM | General | Decode→JSONL allocates a `Value` + `String` + concat per sample — reuse a buffer + `serde_json::to_writer`. |
| O9 | MEDIUM | General | `flush_chunk_inner` writes the chunk in dozens of 2–4 byte `write_all`s straight into the xz encoder — coalesce into one buffer. |
| O10 | LOW | Focus #2 | Make snapshot cadence tunable (own knob, not the chunk size); consider scaling it to assignment size. |
| O11 | LOW | Focus #1 | CLI `lookup` only handles BEN — an `.xben` input errors instead of dispatching to `extract_assignment_xben`. |
| O12 | LOW | General | `count_samples_ben` allocates+reads each payload to discard it (reuse the O1 skip primitive); minor `reserve`/`with_capacity` and a small directory clone. |

---

## Focus #1 — Lookup seek strategies that avoid a decode

**O1 [HIGH] — Header-only skip for BEN Standard/MkvChain.** A frame's length is self-describing: Standard is a 6-byte header (`max_val_bits u8`, `max_len_bits u8`, `n_bytes u32-BE`) + `n_bytes` payload, count always 1; MkvChain is the same 6-byte header + payload + a trailing `count u16-BE` ([decode.rs:173-218](ben/src/codec/frames/decode.rs)). So to reach sample N you can read the header, learn `n_bytes`/`count`, and step over the payload — **no bit-unpack, no assignment alloc, no clone** — expanding only the target frame.

- Recommended shape: add a `BenDecodeFrame::skip`/`read_header_only` companion to `from_reader`. Offer a `R: Read + Seek` overload (the CLI passes `BufReader<File>`, which is `Seek`) that does `seek(SeekFrom::Current(n_bytes))`, and keep a `Read`-only `io::copy(&mut reader.take(n_bytes), &mut io::sink())` fallback for pipes/`&[u8]`. (MkvChain wrinkle: `count` trails the payload, so even header-only must consume/seek past `n_bytes` to read it.)
- Keep the existing hardening on the skip path: validate `count != 0` and run `check_payload_len(n_bytes)` (cap `1<<26`) **before** trusting `n_bytes` for a seek — a corrupt huge length must error, not seek to a wild offset.
- Wire format unchanged; strengthens the lazy-decode invariant (skipped frames are never materialized).

**O5 [MEDIUM] — XBEN-TwoDelta lookup re-encodes throwaway frames.** XBEN can't header-skip (xz transport isn't seekable and deltas are relative — full sequential replay is required), and I verified the current path is **correct**: `next_frame_xben`→`next_record_xben` keeps the delta chain alive across all frames and re-encodes via `encode_ben32_assignments`, so the target decodes properly. But every *skipped* frame pays a throwaway `encode_ben32_assignments` ([frames.rs:151](ben/src/io/reader/stream_reader/frames.rs)). For XBEN-TwoDelta lookup, drive `next_record_xben` directly and return the materialized assignment at the target — skipping both the per-frame re-encode and the target re-decode. (Plain `extract_assignment_xben` for Standard/MkvChain already avoids decoding skipped frames — good.)

**O11 [LOW] — CLI `lookup` is BEN-only.** [lookup.rs:25](ben/src/cli/ben/modes/lookup.rs) calls `extract_assignment_ben` unconditionally; an `.xben` file would be misread as a banner and error. `extract_assignment_xben` exists and is tested at the library level but isn't reachable from the CLI. Decide whether `lookup` should sniff the container (banner/extension) and dispatch. Behavioral gap, not a perf issue, but it's in the surface being reworked.

**Read-side design for BEN-TwoDelta lookup (depends on O2/O3):** walk frames header-only, recognizing the leading tag (`BEN_TWODELTA_SNAPSHOT_TAG=0x00` / `DELTA=0x01`, [io/reader/twodelta.rs](ben/src/io/reader/twodelta.rs)); remember the byte offset + sample number of the last snapshot ≤ target; then `Seek` back to it, `expand()` it once, and replay only the intervening deltas. Replays are then bounded by the snapshot interval rather than the whole stream. A non-`Seek` fallback retains the most recent snapshot's decoded assignment as it walks forward.

## Focus #2 — TwoDelta snapshotting / fast-forward

Your direction (periodic snapshots + an offset index) is the standard, correct design for delta-chain random access. The diagnosis above (one anchor snapshot, no index) confirms why it's needed. Sequencing and traps:

**O2 [HIGH, wire-compatible — do first] — Cadence-based snapshots.** Force a full snapshot every C samples regardless of transition kind. The machinery already exists and reseeds masks correctly (`write_twodelta_snapshot` in `ben.rs`, `flush_twodelta_full` in [xben.rs:469-492](ben/src/io/writer/stream_writer/xben.rs)); a periodic forced call is small. A forced snapshot mid-chunk **must flush the open chunk first** (the existing `Snapshot` arm does this, [xben.rs:267](ben/src/io/writer/stream_writer/xben.rs)) so the "deltas precede the following full frame" invariant holds. This alone makes "start from nearest snapshot" possible and existing readers decode the extra snapshots transparently.

**O3 [HIGH, needs a guarded `.bendl` addition] — Snapshot offset index.** Add a trailing snapshot directory mapping `sample_number → byte offset of the snapshot frame` as a new `.bendl` directory entry / asset, so a reader does an O(1) lookup + seek then O(C) replay. Pitfalls to design against:
- **Location:** `.bendl` container, not the stream body — keeps `.ben`/`.xben` and the v1.0.0 fixtures byte-stable.
- **Append/compact survival:** store offsets **relative to `stream_offset`** so they're invariant under compaction (which rewrites only the post-stream tail, [compact.rs](ben/src/io/bundle/compact.rs)); the index rides the existing directory-rebuild on append.
- **Self-contained anchors:** snapshots are already independently decodable MkvChain frames that reseed masks — preserve that; a periodic snapshot must be a real full frame, never a delta.
- **Reader validation:** before trusting an index offset, confirm the byte there is a snapshot tag (`0x00`) and resume *after* the full frame; extend `test_fixture_mutations.rs` to mutate index entries and assert panic-freedom.

**O10 [LOW] — Cadence tuning.** Expose a real `snapshot_interval` (the current `--chunk-size`/`twodelta_chunk_size` only batches columns and can't bound replay). Space cost is tiny: a snapshot ≈ its RLE size (~4·D bytes for D districts in XBEN full frames), so at C=10000, D≈100 that's ~0.04 B/sample to bound replay to ≤10000 deltas — extremely favorable; even C=1000 costs ~0.4 B/sample. Since replay time is O(C·M) in node count M, consider "snapshot every C deltas **or** every K delta-bytes" and aligning cadence to chunk boundaries so index offsets land on tags the reader already resyncs on. Encode side already streams (no whole-chain buffering); the only per-snapshot cost is an unavoidable O(M) mask rebuild.

## General hot-path & memory wins

**O4 [HIGH] — Stream the Python whole-bundle transforms.** `recompress_bundle` `read_to_end`s the BEN stream into a `Vec` then builds the whole XBEN output in another `Vec` ([recompress.rs:86-97](ben-py/src/recompress.rs)); `relabel_bundle` does the same plus a third copy on write ([relabel.rs:156-195](ben-py/src/relabel.rs)). Peak ≈ 2–3× the full stream — directly contradicting the streaming selling point on the large-ensemble case these are used for. Both underlying functions already take `R: Read`/`BufRead` + `W: Write`, and `assignment_stream_reader()` (bounded, CRC-verifying) + `BendlStreamSession` (`Write`) exist — so pass them through instead of buffering. (Same issue as MEDIUM finding #3 in the correctness review; quantified here.)

**O6 [MEDIUM] — Per-step clone during replay.** `next_record_ben`/`next_record_xben` do `previous_assignment = Some(assignment.clone())` every record ([ben.rs:115](ben/src/io/reader/stream_reader/ben.rs), [xben.rs:226/242/261](ben/src/io/reader/stream_reader/xben.rs)) — an O(M) copy per sample, even though only TwoDelta's `expand` reads `previous_assignment` (Standard/MkvChain ignore it). `for_each_assignment_ben` already shows the clone-free pattern (move the buffer back after the borrow). Cheapest fix: only retain the previous for TwoDelta; for the `Iterator` API, keep an internal scratch cloned only when actually yielded. The Python `SampleCursor::next` clones again on top ([cursor.rs:78,99](ben-py/src/decode/cursor.rs)) — so Python sample-by-sample iteration currently costs ~2 clones/sample.

**O7 [MEDIUM] — XBEN overflow drain.** After each frame, `inner.overflow.drain(..consumed)` shifts all remaining buffered bytes down ([xben.rs:213/222/227/262](ben/src/io/reader/stream_reader/xben.rs)) — O(remaining) per frame, so decoding K frames from one buffer fill is O(K·buffer). Track a `consumed_offset` cursor and slice `&overflow[offset..]`, compacting only on refill → amortized O(1). Touches all the overflow-indexing helpers, so do it with the existing tests.

**O8 [MEDIUM] — JSONL per-sample allocation.** `write_all_jsonl` builds a `json!` `Value` (heap map + array), serializes to a `String`, then concatenates `"\n"` per sample ([mod.rs:226-231](ben/src/io/reader/stream_reader/mod.rs); duplicated in [verify.rs:552](ben/src/io/bundle/verify.rs)) — ~3 allocations/sample on every decode→JSONL. Keep a reusable buffer, `serde_json::to_writer` the `{"assignment":...,"sample":...}` slice directly, push `b'\n'`. Output bytes identical (key order matches). Fix both copies together to avoid drift.

**O9 [MEDIUM] — Coalesce XBEN chunk writes.** `flush_chunk_inner` issues ~`5 + 8N + 2·total_runs` separate 2–4 byte `encoder.write_all(&x.to_be_bytes())` calls straight into `XzEncoder` (no `BufWriter` between) ([xben.rs:494-524](ben/src/io/writer/stream_writer/xben.rs)) — each crosses the liblzma boundary. Build one scratch `Vec<u8>` and `write_all` it once per chunk. Wire bytes identical (determinism preserved).

**O12 [LOW] — Misc.** `count_samples_ben` walks frame boundaries without `expand()` (good) but still allocates+reads each payload to discard it ([ben.rs:125-135](ben/src/io/reader/stream_reader/ben.rs)) — the O1 skip primitive removes that. `decode_ben32_line` could `reserve(count)` per run; `assign_to_rle` could `with_capacity(input.len())` ([rle/mod.rs:13](ben/src/util/rle/mod.rs)). `verify_all_asset_checksums` clones the (small) directory ([reader.rs:464](ben/src/io/bundle/reader.rs)) — negligible.

**Deliberately not flagged** (checked, fine): the core `encode_ben_to_xben` path streams (no BEN→BEN32→xz triple copy — that only appears in the Python `recompress` wrapper, O4); CRC is tee'd inline, not a second pass; xz settings are fixture-deterministic (don't "optimize" thread count/block size — it changes bytes); bundle asset buffering is bounded metadata, not the sample stream; graph MLC/RCM hot paths pre-size reasonably (any remaining cost is algorithmic, out of scope).

## Suggested sequencing
1. **O2** (cadence snapshots) + **O1** (BEN header-skip) — wire-compatible, unlock fast-forward and fast lookup.
2. **O6/O8/O9** constant-factor wins (clone, JSONL, chunk writes) and **O5** (XBEN-TwoDelta re-encode).
3. **O4** (stream the Python transforms — also a memory-correctness fix).
4. **O3** (snapshot offset index in `.bendl`, with append/compact + reader validation) — the asymptotic seek win and the only piece needing a guarded format addition.

All read-side/constant-factor items preserve the wire format, lazy-decode, and streaming invariants. Verified locally that the existing suite (1182 tests) is green, so these are deltas against a known-good baseline; each change should keep `test_format_stability` and the round-trip proptests passing (no behavior/format change without a guard).

---

## How this was reviewed

Three focused read-only analysis passes (lookup/seek; TwoDelta snapshot/fast-forward; general hot-path/memory), each calibrated to an impact-based severity rubric, then adversarial verification of every featured claim against the source. The two load-bearing diagnoses — lookup decoding every skipped frame, and snapshots being content-driven rather than cadence-driven — were confirmed by direct reads of `ops/extract/mod.rs`, `io/reader/stream_reader/{ben,xben}.rs`, `codec/frames/decode.rs`, and `io/writer/stream_writer/xben.rs`. No microbenchmark was run; the asymptotic claims rest on the code paths (which frames are expanded vs skipped), and the existing 1182-test suite was confirmed green as the baseline. No source files were modified — this is a review.



# Optimization Re-check after fresh commits — PR #5 (binary-ensemble)

**PR:** [peterrrock2/binary-ensemble#5](https://github.com/peterrrock2/binary-ensemble/pull/5) · branch `1.0.0`
**Compared:** HEAD `d57d79a` against the prior optimization-review point `6c20a7c` (20 new commits).
**Companion docs:** `PR5-optimization-review.md` (full O1–O12 review) and `PR5-review.md` (correctness/safety/API/docs/CI).
**Verification:** `cargo test -p binary-ensemble` on `d57d79a` → **1208 passed, 0 failed, 2 ignored** (up from 1182; ~26 new tests covering `extract`/`lookup`/`twodelta`). The new optimization work lands on a green suite and the fast paths are covered.

---

## Bottom line

**Peter shipped the two focus areas and several other findings — and the headline lookup/fast-forward work looks correct and hardened.** What remains is mostly constant-factor reader cleanups plus one structural follow-up (a persisted index) that needs a format addition. The single highest-value remaining lookup item is small: make the new snapshot interval tunable.

## Status of the prior findings

| # | Prior sev | Status | Evidence / what's left |
|---|-----------|--------|------------------------|
| **O1** BEN std/mkv header-skip lookup | HIGH | ✅ **Largely done** | `extract_assignment_ben` uses `into_frames()` + `expand_self_contained()` on the target only — skipped frames are no longer bit-unpacked or cloned ([extract/mod.rs:88-98](ben/src/ops/extract/mod.rs)). *Remainder:* skipped payloads are still `read_exact`'d into a `Vec` ([decode.rs:185,207](ben/src/codec/frames/decode.rs)) — the seek path delegates std/mkv to the non-seek reader, so it doesn't `seek` past payloads. The expensive part (decode+clone) is gone; the residual is minor/IO-bound. |
| **O2** cadence snapshots (TwoDelta) | HIGH | ✅ **Done (BEN)** | New `DEFAULT_TWODELTA_SNAPSHOT_INTERVAL = 50_000` + `twodelta_deltas_since_snapshot` counter forces a snapshot every 50k delta frames ([writer/twodelta.rs:13](ben/src/io/writer/twodelta.rs), [stream_writer/ben.rs:71-75,126-141](ben/src/io/writer/stream_writer/ben.rs)). This is the enabler that bounds replay. BEN-only by design — see notes. |
| **O3** persisted snapshot offset index | HIGH | ⬜ **Open** | No index in `.bendl`. `latest_twodelta_snapshot_before` byte-scans frame headers from the start each lookup ([extract/mod.rs:153-202](ben/src/ops/extract/mod.rs)) — cheap per frame (no decode, seeks past payloads) but O(samples) to *locate* the snapshot. Needs a guarded container-level addition. |
| **O4** stream the Python transforms | HIGH | ✅ **Done** | `recompress`/`relabel` now pass `assignment_stream_reader()` straight into `encode_ben_to_xben(BufReader::new(stream), &mut session, …)` / relabel-into-session — no `read_to_end`, no whole-stream `Vec` ([recompress.rs](ben-py/src/recompress.rs), [relabel.rs](ben-py/src/relabel.rs)). Bonus: driving the verified reader to EOF also checks the source CRC. |
| **O5** XBEN-TwoDelta lookup re-encode | MED | ◐ **Open, minor** | `extract_assignment_xben` decodes only the target (good) but skipped XBEN-TwoDelta frames are still replayed+re-encoded. XBEN can't seek (LZMA2), so it's inherently sequential; only the throwaway re-encode is avoidable. Low priority. |
| **O6** per-record clone in `Iterator` path | MED | ⬜ **Open** | `*previous_assignment = Some(assignment.clone())` still at [ben.rs:152](ben/src/io/reader/stream_reader/ben.rs), [xben.rs:269,285,306](ben/src/io/reader/stream_reader/xben.rs). The *lookup* path no longer hits this; impact is now scoped to sample-by-sample `Iterator` use (Python `for a in decoder`, ~2 clones/sample). |
| **O7** XBEN `overflow.drain` per frame | MED | ⬜ **Open** | Still `overflow.drain(..consumed)` at [xben.rs:256,265,270,302,307](ben/src/io/reader/stream_reader/xben.rs). The +179 lines there were TwoDelta-lookup support, not the drain. |
| **O8** JSONL per-sample allocation | MED | ⬜ **Open** | `json!{}.to_string() + "\n"` still per sample at [mod.rs:231](ben/src/io/reader/stream_reader/mod.rs) and the duplicate [verify.rs:561](ben/src/io/bundle/verify.rs). |
| **O9** coalesce XBEN chunk writes | MED | ⬜ **Open** | `flush_chunk_inner` (writer `xben.rs`) unchanged — still per-field 2–4 byte `write_all`s into the encoder. |
| **O10** tunable snapshot cadence | LOW→**MED** | ◐ **Partial — now the top lookup lever** | The interval exists but is a hardcoded `const`, not a `WriteOption` (unlike chunk size, which *is* plumbed via `with_twodelta_chunk_size`). See re-prioritization. |
| **O11** CLI `lookup` XBEN dispatch | LOW | ◐ **Partial** | `lookup` now uses the fast seek path (`extract_assignment_ben_seek`, [lookup.rs:23](ben/src/cli/ben/modes/lookup.rs)) — good — but still errors on an `.xben` input; `extract_assignment_xben` remains CLI-unreachable. |
| **O12** misc `reserve`/`with_capacity` | LOW | ⬜ Open | Unchanged; negligible. |

**Also resolved (from the correctness review):** the **HIGH CI-gating finding** — `ci.yml` now triggers on `pull_request` and runs `cargo test` + `pytest tests/` as fast suites ([ci.yml:12,58-59,75-77](.github/workflows/ci.yml)). Several corruption-hardening commits also landed (`unchecked total runs`, `overlong stream len python side`, `more corrupt header protection`, plus a new **header checksum**), covering the LOW 32-bit-overflow item and strengthening the reader.

## What's left on the table (re-prioritized for the new code)

1. **[MED — highest-value lookup item now] Expose the snapshot interval; reconsider 50k for large maps (O10).** With cadence snapshots in, the dominant single-lookup cost is the **≤50k delta replay** after the anchor, which is `O(interval × district-size)`. On a large map (CO ~140k blocks) a worst-case lookup can replay up to 50k deltas (hundreds of ms → seconds), while a denser cadence costs almost nothing in space (≈`4·D/interval` bytes/sample — hundredths of a byte at D≈100). Plumb a `snapshot_interval` through `WriteOptions`/CLI (mirror `with_twodelta_chunk_size`), default it relative to assignment size or expose it, and document the space-vs-lookup-latency trade-off. Small change, biggest remaining payoff.
2. **[MED, needs a guarded `.bendl` addition] Persisted snapshot offset index (O3).** Removes the O(samples) header-scan, making snapshot location O(1). Most valuable for *repeated* random access and very deep streams; secondary to O10 for a single lookup. Design caveats: store in the container directory (not the stream body), offsets relative to `stream_offset`, validate the offset lands on a `0x00` tag, and extend `test_fixture_mutations.rs` to mutate index entries.
3. **[MED] Reader constant-factor trio, untouched:** O7 (overflow `drain` → cursor), O8 (JSONL reuse a buffer + `serde_json::to_writer`; fix both copies), O9 (coalesce chunk writes). Broadest steady-state wins for full decode/recompress throughput, independent of the lookup work.
4. **[LOW]** O6 (clone in the `Iterator`/Python path), O11 (CLI `.xben` lookup dispatch), O1 remainder (seek past std/mkv payloads on seekable input), O12.

## New observations on the fresh code

- **The fast TwoDelta lookup looks correct and is hardened.** `latest_twodelta_snapshot_before` validates `count != 0`, `check_payload_len`, and `check_twodelta_run_width` *during the scan* ([extract/mod.rs:181-231](ben/src/ops/extract/mod.rs)), and the replay reuses `BenStreamReader` by chaining a synthetic `TWODELTA_BEN_BANNER` in front of the seeked snapshot ([extract/mod.rs:131-135](ben/src/ops/extract/mod.rs)) — a clean way to restart mid-stream. The first frame is guaranteed a snapshot, so the "no snapshot before sample" path is unreachable on well-formed streams.
- **Cadence is BEN-only, and that's defensible — but worth a doc line.** XBEN can't be randomly seeked (LZMA2), so `extract_assignment_xben` is sequential regardless and cadence snapshots wouldn't help it. Practical implication: **a `.bendl` backed by a BEN stream gets fast random-access lookup; one backed by XBEN does not.** A real size-vs-seek trade-off worth stating in the docs so users pick the container deliberately.
- **The interval counts frames, not samples** (a repeat frame with `count>1` increments by 1) — correct, since replay cost is per frame.

---

*Re-check method: diffed the optimization surface since `6c20a7c`, read the reworked lookup/snapshot/transform code directly (`ops/extract/mod.rs`, `io/writer/stream_writer/ben.rs`, `io/writer/twodelta.rs`, `ben-py/src/{recompress,relabel}.rs`, `.github/workflows/ci.yml`), confirmed each prior finding's status against current HEAD, and re-ran the Rust suite. No source files were modified — this is a review.*


#	Prior sev	Status	Evidence / what's left
O1 BEN std/mkv header-skip lookup	HIGH	✅ Largely done	`extract_assignment_ben` uses `into_frames()` + `expand_self_contained()` on the target only — skipped frames are no longer bit-unpacked or cloned (extract/mod.rs:88-98). Remainder: skipped payloads are still `read_exact`'d into a `Vec` (decode.rs:185,207) — the seek path delegates std/mkv to the non-seek reader, so it doesn't `seek` past payloads. The expensive part (decode+clone) is gone; the residual is minor/IO-bound.
O2 cadence snapshots (TwoDelta)	HIGH	✅ Done (BEN)	New `DEFAULT_TWODELTA_SNAPSHOT_INTERVAL = 50_000` + `twodelta_deltas_since_snapshot` counter forces a snapshot every 50k delta frames (writer/twodelta.rs:13, stream_writer/ben.rs:71-75,126-141). This is the enabler that bounds replay. BEN-only by design — see notes.
O3 persisted snapshot offset index	HIGH	⬜ Open	No index in `.bendl`. `latest_twodelta_snapshot_before` byte-scans frame headers from the start each lookup (extract/mod.rs:153-202) — cheap per frame (no decode, seeks past payloads) but O(samples) to locate the snapshot. Needs a guarded container-level addition.
O4 stream the Python transforms	HIGH	✅ Done	`recompress`/`relabel` now pass `assignment_stream_reader()` straight into `encode_ben_to_xben(BufReader::new(stream), &mut session, …)` / relabel-into-session — no `read_to_end`, no whole-stream `Vec` (recompress.rs, relabel.rs). Bonus: driving the verified reader to EOF also checks the source CRC.
O5 XBEN-TwoDelta lookup re-encode	MED	◐ Open, minor	`extract_assignment_xben` decodes only the target (good) but skipped XBEN-TwoDelta frames are still replayed+re-encoded. XBEN can't seek (LZMA2), so it's inherently sequential; only the throwaway re-encode is avoidable. Low priority.
O6 per-record clone in `Iterator` path	MED	⬜ Open	`previous_assignment = Some(assignment.clone())` still at ben.rs:152, xben.rs:269,285,306. The lookup* path no longer hits this; impact is now scoped to sample-by-sample `Iterator` use (Python `for a in decoder`, ~2 clones/sample).
O7 XBEN `overflow.drain` per frame	MED	⬜ Open	Still `overflow.drain(..consumed)` at xben.rs:256,265,270,302,307. The +179 lines there were TwoDelta-lookup support, not the drain.
O8 JSONL per-sample allocation	MED	⬜ Open	`json!{}.to_string() + "\n"` still per sample at mod.rs:231 and the duplicate verify.rs:561.
O9 coalesce XBEN chunk writes	MED	⬜ Open	`flush_chunk_inner` (writer `xben.rs`) unchanged — still per-field 2–4 byte `write_all`s into the encoder.
O10 tunable snapshot cadence	LOW→MED	◐ Partial — now the top lookup lever	The interval exists but is a hardcoded `const`, not a `WriteOption` (unlike chunk size, which is plumbed via `with_twodelta_chunk_size`). See re-prioritization.
O11 CLI `lookup` XBEN dispatch	LOW	◐ Partial	`lookup` now uses the fast seek path (`extract_assignment_ben_seek`, lookup.rs:23) — good — but still errors on an `.xben` input; `extract_assignment_xben` remains CLI-unreachable.
O12 misc `reserve`/`with_capacity`	LOW	⬜ Open	Unchanged; negligible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Other Optimization Opportunities (for furture release) #6

Optimization Review — PR #5 "1.0.0" (binary-ensemble)

Bottom line

Severity (impact-based)

Focus #1 — Lookup seek strategies that avoid a decode

Focus #2 — TwoDelta snapshotting / fast-forward

General hot-path & memory wins

Suggested sequencing

How this was reviewed

Optimization Re-check after fresh commits — PR #5 (binary-ensemble)

Bottom line

Status of the prior findings

What's left on the table (re-prioritized for the new code)

New observations on the fresh code

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

#	Severity	Area	Opportunity
O1	HIGH	Focus #1	BEN Standard/MkvChain `lookup` skips frames by header instead of full-decoding them — O(stream)→O(skip).
O2	HIGH	Focus #2	Add cadence-based snapshots so TwoDelta fast-forward has an anchor (enabler; wire-compatible).
O3	HIGH	Focus #2	Add a snapshot offset index (in `.bendl`) so reaching sample N is O(1) seek + O(cadence) replay instead of O(N).
O4	HIGH	General	Stream the Python `relabel_bundle`/`recompress_bundle` transforms — today they buffer the whole stream 2–3× in RAM.
O5	MEDIUM	Focus #1/#2	XBEN-TwoDelta lookup/subsample replays correctly but re-encodes every skipped frame — drop the throwaway re-encode.
O6	MEDIUM	General/both	Eliminate the per-step `assignment.clone()` in `next_record_*` (and the Python cursor) during replay.
O7	MEDIUM	General	XBEN reader `overflow.drain(..consumed)` per frame is O(remaining) memmove — use a cursor offset.
O8	MEDIUM	General	Decode→JSONL allocates a `Value` + `String` + concat per sample — reuse a buffer + `serde_json::to_writer`.
O9	MEDIUM	General	`flush_chunk_inner` writes the chunk in dozens of 2–4 byte `write_all`s straight into the xz encoder — coalesce into one buffer.
O10	LOW	Focus #2	Make snapshot cadence tunable (own knob, not the chunk size); consider scaling it to assignment size.
O11	LOW	Focus #1	CLI `lookup` only handles BEN — an `.xben` input errors instead of dispatching to `extract_assignment_xben`.
O12	LOW	General	`count_samples_ben` allocates+reads each payload to discard it (reuse the O1 skip primitive); minor `reserve`/`with_capacity` and a small directory clone.

Other Optimization Opportunities (for furture release) #6

Description

Optimization Review — PR #5 "1.0.0" (binary-ensemble)

Bottom line

Severity (impact-based)

Focus #1 — Lookup seek strategies that avoid a decode

Focus #2 — TwoDelta snapshotting / fast-forward

General hot-path & memory wins

Suggested sequencing

How this was reviewed

Optimization Re-check after fresh commits — PR #5 (binary-ensemble)

Bottom line

Status of the prior findings

What's left on the table (re-prioritized for the new code)

New observations on the fresh code

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions