Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
222 commits
Select commit Hold shift + click to select a range
43d6b38
reorganize the repo to acutally make it maintainable
peterrrock2 Mar 12, 2026
8da732f
improve testing in ben
peterrrock2 Mar 12, 2026
a85511e
reorg the clis
peterrrock2 Mar 12, 2026
af2586b
add cli tests
peterrrock2 Mar 12, 2026
c46d18c
modernize the logging structure in ben
peterrrock2 Mar 12, 2026
6fdaa1c
small reorg and cleanup of pathologic cases in pyben
peterrrock2 Mar 12, 2026
8c11fc9
migrate to taskfile
peterrrock2 Mar 13, 2026
48bcd73
add tests and coverage for whole package
peterrrock2 Mar 13, 2026
aaf413e
rename internals so imports are use/import binary_ensemble
peterrrock2 Mar 13, 2026
80ef93b
More doc strings
peterrrock2 Mar 16, 2026
72fcaf1
Try to make relabeling faster
peterrrock2 Mar 16, 2026
34c3907
fix pben counting logic
peterrrock2 Mar 16, 2026
475c607
speed up pben conversoin logic
peterrrock2 Mar 16, 2026
34ead30
Add in spectral and Cuthill-McKee ordering to just see how well it does
peterrrock2 Mar 16, 2026
f737f33
swap spectral for nested dissection
peterrrock2 Mar 16, 2026
07750f0
swap nested for mla
peterrrock2 Mar 16, 2026
a1862b9
swap mla for mlc
peterrrock2 Mar 16, 2026
82ca34b
optimize mlc a bit
peterrrock2 Mar 16, 2026
bffa8df
Start of twodelta
peterrrock2 Mar 16, 2026
57e3cd9
more tests
peterrrock2 Mar 17, 2026
80f5eb6
Opitimize the checks for eqality to improve write speed
peterrrock2 Mar 17, 2026
a19fb26
add twodelta to reben
peterrrock2 Mar 17, 2026
e0054a1
add twodelta to xben
peterrrock2 Mar 17, 2026
b023278
move banners to their own section
peterrrock2 Mar 17, 2026
98b8541
add way to convert between ben versions
peterrrock2 Mar 17, 2026
5ca5e65
format
peterrrock2 Mar 17, 2026
befb641
fix read-all bug
peterrrock2 Mar 17, 2026
ed71980
update docs
peterrrock2 Mar 17, 2026
8dc1ce9
fix labelling issue
peterrrock2 Mar 17, 2026
3967b9a
Speed up conversion
peterrrock2 Mar 17, 2026
d1a85d0
optimize the twodelta case
peterrrock2 Mar 17, 2026
f2dd39b
fix header double-check bug
peterrrock2 Mar 17, 2026
9cc8efc
small opt for xben encoder
peterrrock2 Mar 17, 2026
392cf9c
opt decode
peterrrock2 Mar 17, 2026
4a06e29
get rid of extreneous copy
peterrrock2 Mar 17, 2026
69fabd9
possible xz compression improvement for twodelta
peterrrock2 Mar 17, 2026
c6bcfaf
remove mla
peterrrock2 Mar 17, 2026
762c8f6
Cleanup of twodelta method
peterrrock2 Mar 18, 2026
713d303
Start stubbing out bigger reorg
peterrrock2 Mar 19, 2026
7dd91fe
More reorg
peterrrock2 Mar 20, 2026
a0e8f73
Make a better error system
peterrrock2 Mar 20, 2026
2ce7fbf
Fix up twodelta so that the tests all work again
peterrrock2 Mar 20, 2026
a911ff6
More org
peterrrock2 Mar 20, 2026
1199d50
Modify responsibilities of the encoder
peterrrock2 Mar 20, 2026
d352036
Remove redundant code in io/writer/ben
peterrrock2 Mar 20, 2026
0f009b2
Reorg (I will figure out the design I like eventually)
peterrrock2 Mar 20, 2026
6078ab2
Centralized constructor impl for Ben-type frames
peterrrock2 Mar 20, 2026
cd0233a
Change BenEncoder -> AssignmentWriter
peterrrock2 Mar 20, 2026
8321f3e
Change XBenEncoder -> XZAssignmentWriter
peterrrock2 Mar 20, 2026
03da065
Change writer/ben.rs -> write/assignment_writer and make better use o…
peterrrock2 Mar 23, 2026
f8735dc
Update XZAssignment writer to be more parallel with the AssigmentWriter
peterrrock2 Mar 23, 2026
b126117
Rename twodelta.rs -> twodelta_encode.rs
peterrrock2 Mar 23, 2026
c8e2348
Update decoder and readers
peterrrock2 Mar 23, 2026
2433fa9
Update some tests
peterrrock2 Mar 31, 2026
9e8fa32
Add twodelta into cli
peterrrock2 Apr 7, 2026
edc9647
add twodelta to python side
peterrrock2 Apr 8, 2026
f53715e
improve docs for things that should not use twodelta
peterrrock2 Apr 8, 2026
b1c0813
Improve test suite
peterrrock2 Apr 8, 2026
0f8d0ff
add edge-case tests
peterrrock2 Apr 8, 2026
cc53e8a
change format spec for bendl
peterrrock2 Apr 8, 2026
f9224ae
move over to using petgraph internals
peterrrock2 Apr 10, 2026
027ba0d
try louvain
peterrrock2 Apr 10, 2026
266f11e
revert louvain
peterrrock2 Apr 11, 2026
612512e
add bendl
peterrrock2 Apr 11, 2026
5b5ba8b
better stess testing
peterrrock2 Apr 11, 2026
f933629
update pyben decoder
peterrrock2 Apr 11, 2026
cdfd7d8
fix warnings
peterrrock2 Apr 11, 2026
42e191a
allow multiple iterator passes
peterrrock2 Apr 11, 2026
cbcad0d
Lots more tests and remove PyBundleReader
peterrrock2 Apr 13, 2026
9ae12c2
More testing
peterrrock2 Apr 20, 2026
a43cf51
Get coverage to 98% on rust side
peterrrock2 Apr 30, 2026
d58276c
better json reader
peterrrock2 May 1, 2026
75049bf
we will continue testing until moral improves
peterrrock2 May 1, 2026
70912ca
move writer tests
peterrrock2 May 1, 2026
5b35451
reorganize bundle module
peterrrock2 May 1, 2026
092bdac
reorg cli module
peterrrock2 May 1, 2026
ae44ce5
reog pyben side to parallel ben side
peterrrock2 May 1, 2026
aa8ad81
reorg to make cli easier to test
peterrrock2 May 1, 2026
3c86805
fix twodelta inconsistency
peterrrock2 May 4, 2026
288045d
fix up some ambiguous / clashing terminology
peterrrock2 May 5, 2026
1c5255a
tweaks to cli behavior
peterrrock2 May 5, 2026
e6a98ee
change pyben -> ben-py
peterrrock2 May 5, 2026
d9b5a3b
change progress to spinner
peterrrock2 May 5, 2026
12aabce
get rid of frame duplicates
peterrrock2 May 6, 2026
6c1318d
Fix python tests
peterrrock2 May 6, 2026
143bf91
Increase size of xz block to improve ability to parallelize
peterrrock2 May 6, 2026
e71975b
Allow --n-cpus -1 to mean all cores
peterrrock2 May 6, 2026
860012f
Unify stream readers
peterrrock2 May 7, 2026
74d0dd5
small writer extraction
peterrrock2 May 7, 2026
c712179
consolodate relabel module
peterrrock2 May 9, 2026
46655eb
small dedupe in graph module
peterrrock2 May 9, 2026
ecc39df
Unify stream writers
peterrrock2 May 10, 2026
6c2a904
add known asset kind enum for bendl
peterrrock2 May 10, 2026
92f2b3e
add in an XBEN variant
peterrrock2 May 10, 2026
aa185d8
update pcompress translation to be more consistent
peterrrock2 May 10, 2026
2263240
Update bundle protocol
peterrrock2 May 10, 2026
899e270
clean up bendl stream api
peterrrock2 May 10, 2026
7a7ef6c
add in checksum logic
peterrrock2 May 18, 2026
1ce870c
formatting pass
peterrrock2 May 18, 2026
b76de04
checksum for the assignment streams
peterrrock2 May 21, 2026
f4d95e3
Add in some edge-case tests
peterrrock2 May 21, 2026
c2aac6e
add strict payload length enforcement
peterrrock2 May 22, 2026
18ad3ec
add some fixtures and stability tests
peterrrock2 May 22, 2026
c04cee7
enforce bit width consistency and add size guard
peterrrock2 May 22, 2026
7acb805
add adversarial tests
peterrrock2 May 22, 2026
f4a93f9
add more edge-case stress tests
peterrrock2 May 22, 2026
3cfee66
check twodelta boundary test
peterrrock2 May 22, 2026
f8986c8
more boundary tests
peterrrock2 May 22, 2026
24f8672
check label value 0 makes the round trip
peterrrock2 May 22, 2026
f146f52
add property tests for bendl
peterrrock2 May 22, 2026
5d84063
add property-based equivalence tests
peterrrock2 May 22, 2026
da50034
add some forward-compatibility tests
peterrrock2 May 22, 2026
8797434
add tests for multi-step decode
peterrrock2 May 22, 2026
aa00f83
test parallel reads
peterrrock2 May 22, 2026
4187b8c
test zero and one sample edge cases
peterrrock2 May 22, 2026
d6aef70
add cli tests
peterrrock2 May 22, 2026
107d880
better testing of the bendl write path
peterrrock2 May 22, 2026
d0c5b71
more cli path tests for bendl
peterrrock2 May 22, 2026
7b58410
remove redundant tests
peterrrock2 May 22, 2026
abac92d
formatting
peterrrock2 May 25, 2026
4f8c064
split bundle reader up
peterrrock2 May 30, 2026
ff2d3ec
create asset registry
peterrrock2 May 30, 2026
4de8883
Improve crash safety of BENDL and tighten verified stream APIs
peterrrock2 May 30, 2026
39a6ee8
Fix maximum number of directory entries
peterrrock2 May 31, 2026
e343f1f
remove repetition in verify
peterrrock2 May 31, 2026
841d16b
More consistent progress bar semantics
peterrrock2 May 31, 2026
d78b271
clean up fast path in relabel
peterrrock2 May 31, 2026
189b2d6
improve readability
peterrrock2 May 31, 2026
c53baf0
make clippy happy
peterrrock2 May 31, 2026
912e526
better recovery for dropped stream
peterrrock2 May 31, 2026
696038d
finish benpy rename
peterrrock2 Jun 1, 2026
b2a0a37
make two-delta work better on things like SB or arbitrary ensembles
peterrrock2 Jun 2, 2026
b13b0ed
fix clippy
peterrrock2 Jun 2, 2026
6475871
reorg python side and add bundle bindings
peterrrock2 Jun 3, 2026
2d32712
add relabel bundle helper and reformat
peterrrock2 Jun 4, 2026
3ca70d0
api change to add_graph on the python side
peterrrock2 Jun 4, 2026
70caba8
update cli to use dualgraph rather than shapefile
peterrrock2 Jun 5, 2026
1712c3b
major docs overhaul
peterrrock2 Jun 6, 2026
0ae330c
update docs
peterrrock2 Jun 9, 2026
aa038e4
a different color theme
peterrrock2 Jun 9, 2026
c4b616f
better contrast in color theme
peterrrock2 Jun 9, 2026
9dcaee7
better doc strings for the api section
peterrrock2 Jun 9, 2026
ec3c1f7
more docs!!
peterrrock2 Jun 9, 2026
3d20a76
reroute old methods through the ben stream reader
peterrrock2 Jun 10, 2026
6b8adde
format
peterrrock2 Jun 10, 2026
a4f5b39
harden against oversized assets
peterrrock2 Jun 10, 2026
add9fc3
harden against malicious assignment lengths and middle zero-byte corr…
peterrrock2 Jun 10, 2026
1fb97c2
better reader errors (propagate)
peterrrock2 Jun 10, 2026
4123e9a
fix some edge case issues in the pcompress translation
peterrrock2 Jun 10, 2026
7f7e998
handle duplicate index in remap and failed final flush edge cases
peterrrock2 Jun 10, 2026
cb41eec
better coverage harnass
peterrrock2 Jun 10, 2026
c557b11
get rid of the try_from_parts symantics and fix potential zero drop
peterrrock2 Jun 11, 2026
87cf6e6
fix issue with twodelta silently failing on a pathological 2 distrcit TX
peterrrock2 Jun 11, 2026
5593e37
Add in cross architecture tests
peterrrock2 Jun 11, 2026
7f6b0f8
remove the from_parts from twodelta
peterrrock2 Jun 11, 2026
38f9f43
convert panics into errors for better handling
peterrrock2 Jun 11, 2026
a33b35c
add rust linter into workflow
peterrrock2 Jun 11, 2026
5e76363
make sure interop with Pcompress still lives
peterrrock2 Jun 11, 2026
0fe0e4c
better ci
peterrrock2 Jun 11, 2026
b83148a
more fuzzing
peterrrock2 Jun 11, 2026
929ecc4
add soak test
peterrrock2 Jun 11, 2026
0611dc4
better ci / cd with a smoke test on wheels
peterrrock2 Jun 11, 2026
cedd613
autocompress larger assets in bundle
peterrrock2 Jun 12, 2026
72d0018
Note compression bomb risk (consequence of format and nothing to do h…
peterrrock2 Jun 12, 2026
a3b3400
tuning for fuzz tests
peterrrock2 Jun 12, 2026
1361ace
better api for asset payloads
peterrrock2 Jun 12, 2026
bf4179c
make sure python assets verify
peterrrock2 Jun 12, 2026
14304b8
enabling remove asset
peterrrock2 Jun 12, 2026
99c568e
Docs.....
peterrrock2 Jun 12, 2026
f26b2b1
naming fixes
peterrrock2 Jun 12, 2026
44dbe9b
make sure payload lengths do not overrun file size
peterrrock2 Jun 12, 2026
60f6bce
fix mid-write crash bug
peterrrock2 Jun 12, 2026
3e2fac4
unify remove and compact
peterrrock2 Jun 12, 2026
3ff79c2
remove large memory buffering of tail payloads
peterrrock2 Jun 12, 2026
33ce37c
power loss safety because I'm paranoid
peterrrock2 Jun 12, 2026
39c91d8
better file swap dicipline
peterrrock2 Jun 12, 2026
2020092
poison the encoder on failed stream finalize
peterrrock2 Jun 12, 2026
dc8d0ad
guard against mutations to underlying bendl (b/c jupyter)
peterrrock2 Jun 12, 2026
82dfbba
prevent custom assets from claiming known names
peterrrock2 Jun 12, 2026
a698eaa
deal with GIL
peterrrock2 Jun 13, 2026
1eebdf4
validate content type before writing
peterrrock2 Jun 13, 2026
23b8761
dedup pass
peterrrock2 Jun 13, 2026
35975b6
docs
peterrrock2 Jun 13, 2026
a70ac38
put fuzzing back in
peterrrock2 Jun 13, 2026
741bee9
give control over asset compression
peterrrock2 Jun 13, 2026
6c04ef4
cleanup comments
peterrrock2 Jun 13, 2026
0648dd8
more docs for me
peterrrock2 Jun 13, 2026
6e1faf5
better visited link color
peterrrock2 Jun 13, 2026
c17f410
phrasing sweep on docs
peterrrock2 Jun 13, 2026
0fb781e
docs org
peterrrock2 Jun 13, 2026
db2e62c
rename stream -> ben_stream
peterrrock2 Jun 13, 2026
e56ac50
prettify docs
peterrrock2 Jun 13, 2026
5a4a1b1
fix links
peterrrock2 Jun 13, 2026
bf211ba
fix links again
peterrrock2 Jun 13, 2026
75d42cc
increase timout for link check
peterrrock2 Jun 13, 2026
23f45b1
prettier
peterrrock2 Jun 13, 2026
5b7c23a
more link fixing
peterrrock2 Jun 13, 2026
979bb2d
you're killin me smalls
peterrrock2 Jun 13, 2026
967c2e3
link
peterrrock2 Jun 13, 2026
a2f378c
fix link ignore regex
peterrrock2 Jun 13, 2026
6c20a7c
remove landing page buttons
peterrrock2 Jun 13, 2026
4d94ef4
update log levels
peterrrock2 Jun 13, 2026
798efd2
change the way that reben does suffixes
peterrrock2 Jun 14, 2026
a22c62d
make canonicalize start at 0 rather than 1. lock in ability for arbit…
peterrrock2 Jun 14, 2026
cdac4cc
change cli shape
peterrrock2 Jun 14, 2026
e14d5f2
fix up transcode to/from pcompress
peterrrock2 Jun 14, 2026
f791ef0
better banner error
peterrrock2 Jun 14, 2026
d063994
small cosmetic changes to cli
peterrrock2 Jun 14, 2026
35e6a0c
go back to fast lookup on standard and mkv ben files
peterrrock2 Jun 14, 2026
c72ecee
faster twodelta lookup
peterrrock2 Jun 14, 2026
0959ed0
better spinner behavior on xz and lookup
peterrrock2 Jun 15, 2026
27789fe
update ci suite to run fast tests on PRs
peterrrock2 Jun 15, 2026
004672e
more corrupt header protection on the ben side
peterrrock2 Jun 15, 2026
29a77ad
protect against overlong stream len on python side
peterrrock2 Jun 15, 2026
d965df6
fix unchecked total runs and relabel bundle
peterrrock2 Jun 15, 2026
f3fe085
errors / docs cleanliness
peterrrock2 Jun 15, 2026
a76f418
change main CO example
peterrrock2 Jun 15, 2026
951abc8
add checksum for the header
peterrrock2 Jun 15, 2026
2d27f0f
some reorg / cleanup
peterrrock2 Jun 15, 2026
cf955d2
centralize stamp header (tests) and remove ability to write w/o heade…
peterrrock2 Jun 15, 2026
d57d79a
fix fuzz workflow
peterrrock2 Jun 15, 2026
91eda43
Merge branch 'main' into 1.0.0
peterrrock2 Jun 16, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
1 change: 1 addition & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
example/100k_CO_chain.jsonl.xben filter=lfs diff=lfs merge=lfs -text
example/50k_CO_chain.xben filter=lfs diff=lfs merge=lfs -text
38 changes: 34 additions & 4 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
name: CI

# Lightweight quality gates on every PR: formatting and lints for both languages. The heavier
# gates (full test suite, big-endian emulation) live in full-tests.yml and run on demand — either
# from the Actions tab or via a `/ci-full` / `/ci-endian` PR comment.
# Quality gates on every PR: formatting, lints, and the fast test suites for both languages. The
# heavier gates (the `#[ignore]`-gated stress suite, big-endian emulation, fuzzing) live in
# full-tests.yml and run on demand: from the Actions tab or via a `/ci-full` / `/ci-endian` /
# `/ci-fuzz` PR comment.
#
# These mirror `task format` / `task lint`; keep the two in sync.
# These mirror `task format` / `task lint` / `task test-rust-fast` / `task test-python`; keep them in
# sync.

on:
pull_request:
Expand Down Expand Up @@ -45,3 +47,31 @@ jobs:
- name: ruff check
working-directory: ben-py
run: uvx ruff check .

rust-test:
name: rust tests (fast)
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: dtolnay/rust-toolchain@stable
- uses: Swatinem/rust-cache@v2
- name: cargo test
run: cargo test

python-test:
name: python tests
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: dtolnay/rust-toolchain@stable
- uses: Swatinem/rust-cache@v2
- uses: astral-sh/setup-uv@v5
- name: sync environment
working-directory: ben-py
run: uv sync --all-groups
- name: build extension
working-directory: ben-py
run: uv run maturin develop
- name: pytest
working-directory: ben-py
run: uv run pytest tests/
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,5 @@ dev_files
demo

__pycache__
*.so
*.so
/ben-py/docs/code-theme-preview.md
7 changes: 5 additions & 2 deletions .readthedocs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,13 @@ build:
python:
install:
- method: pip
path: ./pyben
path: ./ben-py
extra_requirements:
- docs

sphinx:
builder: dirhtml
configuration: pyben/docs/conf.py
configuration: ben-py/docs/conf.py
# Notebook execution stays off here (NB_EXECUTION_MODE defaults to "off"), so the
# hosted build renders the committed notebook outputs. CI executes the notebooks.
fail_on_warning: true
144 changes: 144 additions & 0 deletions CONTEXT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
# Context

Orientation for anyone (human or agent) landing in the `binary-ensemble` workspace. It explains what
the project is, how the code is shaped, and the invariants that aren't obvious from any single file.

## What this project is

`binary-ensemble` compresses **ensembles of districting plans**. A redistricting sampler (MCMC
ReCom, SMC, etc.) emits thousands to millions of plans as canonicalized JSONL: one
`{"assignment": [...], "sample": n}` line per draw. Those files are enormous and highly redundant.
This workspace turns them into compact binary formats and provides the tooling to encode, decode,
inspect, relabel, and bundle them.

It is the spiritual successor to [PCompress](https://github.com/mggg/pcompress) and interoperates
with it.

The formats, in increasing capability:

- **`.ben`**: a banner plus bit-packed, run-length-encoded frames. One frame per sample.
- **`.xben`**: a BEN stream's payload wrapped in LZMA2 for maximum size reduction.
- **`.bendl`**: a self-describing _bundle_ holding a header, optional assets (dual graph, metadata,
node-permutation map), an embedded BEN/XBEN assignment stream, and a trailing directory. Feels
like one file; supports interrupted writes and post-finalize appends.

## Domain model (in brief)

`docs/glossary.md` is the **source of truth** for terminology and is worth reading before making
changes. The essentials:

- **Plan**: a partition of dual-graph nodes into districts (the mathematical object).
**Assignment**: its vector encoding, `Vec<u16>` where index _i_ is the district id of node _i_.
One plan has many assignments.
- **Sample**: `(sample_number, assignment)`. **Ensemble**: an ordered stream of samples from one
sampler run; the thing every format wraps.
- **Variant**: `Standard` | `MkvChain` | `TwoDelta`. Fixed per stream by its banner. `Standard`
stores each sample independently; `MkvChain` collapses repeated consecutive samples with a count;
`TwoDelta` delta-encodes single-ReCom-step transitions. Variant fitness depends on the sampler;
see the glossary.
- **Dual graph**: the geographic adjacency graph that gives a node ordering meaning.
Relabeling/reordering operations are defined against it.

The glossary also nails down deliberately-disambiguated words ("header", "extract", "payload",
"canonical\*") and the relabeling taxonomy. Honor those distinctions in code and prose.

## Architecture

A Cargo workspace with two members:

- **`ben/`**: package `binary-ensemble`, library `binary_ensemble`, plus two thin CLI binaries.
- **`ben-py/`**: PyO3 bindings (cdylib) published as the `binary_ensemble` Python package. Depends
on `ben/` by path; the core library has no Python dependency.

### CLI binaries (`ben/src/bin/*.rs`)

Each is a one-line wrapper over `cli::<tool>::run()`. `ben` is a subcommand tree; `bendl` owns the
bundle container role:

| Binary | Role | Does | | ------- | -------- |
------------------------------------------------------------------------- | | `ben` | codec |
encode/decode BEN/XBEN + xz; relabel/canonicalize/reencode; pcompress bridge | | `bendl` | bundle |
create / inspect / extract / append `.bendl` containers |

`ben` subcommands: `encode`, `xencode`, `decode`, `xdecode`, `lookup`, `xz-compress`,
`xz-decompress`, `relabel`, `canonicalize`, `reencode`, `sort-graph`, and `pcompress` (`from-ben` /
`to-ben` / `to-xben`). The relabel pipeline (decode → transform → re-encode) backs
`relabel`/`canonicalize`/`reencode`; the PCompress bridge backs `pcompress`.

### Library modules (`ben/src/`)

- **`codec/`**: the heart. `encode`, `decode`, `frames` (`BenEncodeFrame` / `BenDecodeFrame`), and
`translate` (BEN ↔ ben32 wire form). Frames keep their `raw_bytes` so they can be moved/subsampled
without eager unpacking.
- **`io/`**: streaming `reader` / `writer` over buffered, generic IO, and `bundle` (the `.bendl`
reader/writer/verify/format machinery).
- **`format/`**: on-disk metadata shared across streams (banners and `FormatError`).
- **`ops/`**: the higher-level operations `relabel` (the single `relabel_ben_file` driver
parameterised by `RelabelOptions`) and `extract`.
- **`json/`**: dual-graph utilities (NetworkX-adjacency IO, MLC and RCM node-ordering algorithms)
used by the relabel pipeline.
- **`progress/`**, **`logging/`**, **`util/`**: spinners (`indicatif`), `tracing` setup, and small
shared helpers (RLE).

### Data flow

```mermaid
flowchart LR
JSONL -->|encode| RLE
RLE -->|bit-pack| frame
frame -->|concat| ben["stream(.ben)"]
ben -->|LZMA2 wrap| xben[".xben"]
ben -->|bundle| bendl[".bendl"]
```

Decode reverses this. The relabel subcommands run decode → transform → re-encode in one streaming
pass. The encoding stack has five named layers (bit-packing, RLE, frame, stream, container); see the
glossary's "Encoding Stack" table.

## Invariants and cross-cutting concerns

These hold across the codebase and are easy to violate by accident:

- **Format stability is a contract.** Committed fixtures under `ben/tests/fixtures/v<n>/` must keep
decoding forever within a major version. Never regenerate fixtures in place. See
`docs/format-stability.md`.
- **Frames decode lazily.** Keeping `raw_bytes` without unpacking runs is what makes
subsample-by-skip and random-access reads fast. Don't force eager bit-unpacking on read.
- **Integrity is checked with CRC32C.** Verifying read paths are the default; checksum-skipping
variants are explicitly named with an `_unverified` suffix.
- **Terminology is disciplined.** The glossary governs identifiers and prose; when they disagree,
the glossary wins and the code changes.
- **Streaming, not slurping.** Ensembles are too large to hold in memory.
- **64-bit only** (enforced with `compile_error!` in `lib.rs`).
- **Illegal states are unrepresentable where practical**: e.g. `XBenVariant` cannot hold `TwoDelta`,
so BEN32-only paths reject it at compile time.

## Building and testing

The workspace uses a `Taskfile.yml` (the `task` / `go-task` runner) as the single entry point for
local workflows. CI runs the lightweight gates (formatting + lints) on every PR; the heavy gates
(full test suites, big-endian emulation) run on demand via the Actions tab or a `/ci-full` /
`/ci-endian` PR comment from a collaborator. The wheel-publishing workflow is separate and
tag-triggered.

- `task test`: Rust fast suite + `#[ignore]`-gated slow/stress suite + Python `pytest`.
- `task format`: `cargo fmt --all` + `ruff format`.
- `task lint`: `cargo clippy --workspace --all-targets` (warnings denied) + `ruff check`.
- `task coverage-summary`: combined Rust + Python coverage.
- `task test-endian`: full ben suite on one big-endian and one little-endian target via `cross`
(Docker + QEMU), proving wire-format endianness regardless of the development machine.
`task check-endian` is the no-Docker compile-only tier.
- `task fuzz`: time-boxed coverage-guided fuzzing (cargo-fuzz/libFuzzer, nightly) of every read
surface, seeded from the committed fixtures. `FUZZ_SECONDS` bounds each target (default 60).

Python development uses `uv` + `maturin` (`task ben-py-develop`).

## Document map

- **`docs/glossary.md`**: terminology, the source of truth.
- **`docs/coding-standards.md`**: how code in this repo is written (errors, logging, naming,
testing, modules, PyO3).
- **`docs/bendl-format-spec.md`**: the `.bendl` on-disk binary layout.
- **`docs/format-stability.md`**: the wire-format stability policy.
- **`README.md`**: user-facing CLI and library usage.
- **`docs/*-plan.md`**: active design plans, written before implementation.
Loading
Loading