Skip to content

Support shared categorical dictionaries#100

Merged
pbower merged 12 commits into
mainfrom
categorical_dicts
May 25, 2026
Merged

Support shared categorical dictionaries#100
pbower merged 12 commits into
mainfrom
categorical_dicts

Conversation

@pbower
Copy link
Copy Markdown
Owner

@pbower pbower commented May 23, 2026

This feature supports the case where one is working with a Categorical array, and it has a set of mappings as to which integers correspond to which strings (dictionary).

This feature adds support for a standalone dictionary that acts as a common source of truth.

pbower added 11 commits May 16, 2026 04:49
Before this change, every CategoricalArray held its own Vec64<String>
dictionary. Different batches for the same logical column ended up with
independent dictionaries, so code 3 in batch 1 could mean a different
string from code 3 in batch 17. Any operation comparing codes across
batches (group-by, hash-join, filter) had to materialise to strings or
remap on the fly, paying that cost on every operation.

The dictionary now has two modes. Standalone categoricals carry their
own Owned dictionary and mutate freely as before. Categoricals held
inside a SuperTable, SuperArray, or peer-sync set instead carry a
Shared snapshot of a dictionary owned by the parent's CategoryManager.
All batches in the same SuperTable see the same codes, and the cost of
keeping them aligned is paid once at push time rather than inside every
downstream kernel.

Growth on a Shared dictionary cannot happen from a single batch in
isolation, because that would silently desynchronise it from its
siblings. Mutators on a Shared categorical instead demote the dictionary
to Owned in place and log a warning, so the user can audit where their
categorical stopped sharing with the parent. The recommended path for
live ingestion is to route the new value through the parent's intern
method, which extends the shared dictionary once and leaves every
sibling snapshot coherent under the append-only invariant.

Reads are identical in both modes and do not touch any lock. The default
storage on the parent side is Mutex<Arc<DictionaryInner<T>>> which holds
the lock only on growth events; the contended_dict feature swaps that
for arc_swap::ArcSwap with a CAS-loop write path for workloads where
concurrent novel-value insertion is a hot path.

Arrow round-trips are byte-identical. The categorical tests in
tests/arrow_c_integration and tests/apache_arrow pass unchanged under
the default, default_categorical_8, extended_categorical, and
contended_dict feature combinations.
…r Consolidate on typed arrays

Splits the categorical dictionary work into two parts:

1. **Shared-dictionary surface is now opt-in via `shared_dict`** (chains
   into `contended_dict` for the ArcSwap backend). Without the feature
   every `CategoricalArray<T>` owns its dictionary outright and codes are
   not mutually meaningful across batches. Gated behind it:
   `Dictionary::Shared`, `DictionaryError::Shared`, `CategoryManager`,
   `CategoryDispatch`, `CategoricalArray::with_dictionary`,
   `SuperTable.category_managers`, `SuperTable::category_dispatch`, and
   the absorb path. `CategoryDispatch` exposes `install_from` / `absorb`
   methods that exhaustive-match the variant once and delegate to
   `Dictionary::to_inner` and `CategoryManager::absorb`, replacing the
   TypeId/transmute helper layer.

2. **`Consolidate` is now implemented on each typed array** via
   `Vec<TypedAVT<'a, T>>` (the existing AVT view tuples in aliases.rs):
   `IntegerAVT`, `FloatAVT`, `StringAVT`, `BooleanAVT`, `DatetimeAVT`,
   `CategoricalAVT`. Each impl reads windows directly from the source
   buffer - one copy per chunk, same algorithm as the old per-variant
   macros. `Vec<ArrayVT<'a>>::consolidate` exhaustively dispatches on
   the first chunk's variant and gathers typed AVTs. The
   `consolidate_int_variant!`, `consolidate_float_variant!`,
   `consolidate_string_variant!`, `consolidate_temporal_variant!`
   macros and the `consolidate_*_slices` private functions in
   `super_array_view.rs` are deleted in favour of this layering.
   `SuperArrayV::consolidate` keeps its single-slice/contiguous fast
   paths and otherwise builds `Vec<ArrayVT>` view tuples (no copy) and
   calls `.consolidate()`.

The categorical Consolidate impl preserves the shared-Arc fast path
under `shared_dict` (pure index-buffer concat + reuse snapshot) and
falls back to `Concatenate::concat`-fold for divergent dictionaries
(which already handles the prefix and intern-and-remap paths).

Build clean under default, shared_dict, default_categorical_8 + shared_dict,
extended_categorical + shared_dict, contended_dict, datetime,
extended_numeric_types, and --no-default-features. All 594 default tests
and 599 shared_dict tests pass.
…ic, polars test, default_categorical_8 unit tests

Sites the original PR missed when renaming `unique_values: Vec64<String>`
to `dictionary: Dictionary<T>` on `CategoricalArray<T>`:

- `src/traits/byte_size.rs`: `self.dictionary.values.est_bytes()` was a
  field access against the enum (no such field). Compute the container
  size from `dictionary.values().len() * size_of::<String>()`, matching
  the prior Vec64 capacity-based approximation.
- `src/structs/arena.rs` (feature `arena`): four `c.unique_values.clone()`
  and four `a.unique_values.clone()` reads against `CategoricalArray`
  rewritten to `Vec64::from(...dictionary.values().to_vec())`.
- `src/kernels/arithmetic/string.rs` (feature `str_arithmetic`): one
  `.dictionary.iter()` call with no such method, and an
  `Arc::new(Dictionary::from(...))` wrap that produced the wrong type.
  Fixed to `.dictionary.values().iter()` and to drop the Arc wrap.
  Test assertions using `.unique_values` updated to
  `.dictionary.values()`.
- `tests/polars.rs` (feature `cast_polars`): one residual
  `c.unique_values[...]` indexing rewritten to use `dictionary.values()`.
- `src/structs/field_array.rs` and `src/structs/chunked/super_table.rs`:
  test items referencing `fa_cat32` and `CategoricalIndexType::UInt32`
  gated to the same cfg as the `fa_cat32` macro itself, so unit tests
  build under `default_categorical_8` (where Categorical32 is disabled).
- Trimmed PR-introduced unused imports of `Arc` in
  `src/kernels/arithmetic/string.rs`, `src/kernels/string.rs`,
  `src/structs/variants/string.rs`, and gated the `Arc` import in
  `src/structs/dictionary.rs` to `shared_dict`.

Verified across 15 feature combinations: default, shared_dict,
default_categorical_8+shared_dict, extended_categorical+shared_dict,
contended_dict, datetime, extended_numeric_types, size, arena,
str_arithmetic, cast_arrow, cast_polars, cast_arrow+shared_dict,
cast_polars+shared_dict, arena+shared_dict. All lib unit tests pass on
each (594-641 tests per combo); doc and integration tests pass.
…dd cross-batch roundtrip tests

SuperArray now mirrors SuperTable's shared-dictionary machinery under
`shared_dict`:

- New `pub(crate) category_manager: Option<CategoryDispatch>` field
  (cfg-gated). Initialised in every constructor and `From` impl that
  builds a `SuperArray`.
- `push`, `push_with_null_count`, and `push_field_array` run the chunk
  through `CategoryDispatch::install_from` (first push) or
  `CategoryDispatch::absorb` (subsequent pushes) before storing it.
- `category_dispatch(&self) -> Option<&CategoryDispatch>` exposes the
  manager for live ingestion.
- `rebuild_category_manager()` reabsorbs all existing chunks; called
  from constructors that take a `Vec<Array>` upfront and from the
  `Concatenate::concat` impl.
- `PartialEq` switched from derived to a manual impl that compares
  data + field + null_counts (the derived manager is not part of
  observable equality).

Test: `test_shared_dict_absorb_across_pushes` exercises both the
manager-grows case (snapshot Arc rotates, earlier chunk's dictionary
is a prefix of the later one) and the no-growth case (snapshots are
`Arc::ptr_eq`-equal). Both invariants hold by the append-only design.

README's Feature Flags table gains entries for `shared_dict` and
`contended_dict`.

New cross-batch roundtrip tests:
- `tests/apache_arrow.rs::rt_arrow_super_table_shared_categorical32`:
  exports a two-batch SuperTable through arrow-rs `RecordBatch`,
  re-imports, asserts decoded strings survive verbatim and that the
  rebuilt batches preserve the append-only prefix invariant
  (first batch's dictionary is a prefix of the second's).
- `tests/polars.rs::rt_polars_super_table_shared_categorical32`:
  same flow through Polars `DataFrame`. Polars may emit either
  categorical or string-backed columns; both are accepted and the
  decoded strings are checked.

Verified across 15 feature combinations including `cast_arrow,shared_dict`
and `cast_polars,shared_dict`. All lib + integration + doc tests pass.
…hardedIndex)

Lays down the concurrency primitives for `shared_dict`. The cascade
through the rest of the crate (categorical.rs read/write call sites,
super_table.rs absorb path, super_array.rs absorb, super_array_view.rs
consolidate, kernels, FFI, conversions) is not yet done - this commit is
a savepoint of the infrastructure layer only.

## New: src/structs/append_only_vec.rs

`AppendOnlyVec<T>` - lock-free append-only Vec with stable element
addresses. Used internally by `Dictionary<T>` as the code-indexed value
array.

- Multi-writer concurrent push via per-slot `AtomicBool` init flag +
  global `AtomicUsize` reserved counter.
- `push_bounded(value, max_cap)`: CAS-loop on reserved that refuses
  reservation if `>= max_cap`. Solves the u8-cap-256 racing exactly -
  no leaked slots, no double-assigned codes under any number of
  concurrent writers.
- `push(value)`: unbounded fast path using `fetch_add`.
- `get(idx) -> Option<&T>` and `iter() -> (idx, &T)`: lock-free reads.
  `iter` yields the fully-init prefix; mid-write slots are invisible
  until their writer publishes.
- Bucketed allocation: bucket i holds `16 << i` entries; 28 buckets
  cover ~4B entries. Buckets allocated lazily via CAS, never moved or
  freed during the vector's lifetime - so `&T` borrows survive any
  concurrent push.
- ~400 LOC including tests for cap-respecting concurrent writers,
  stable refs across 200+ pushes spanning bucket boundaries, drop
  semantics for initialised slots only, multi-writer + multi-reader
  stress test.

## Rewritten: src/structs/dictionary.rs

`Dictionary<T>` - cloneable handle around
`Arc<DictionaryInner<T>>` where:

- `values: AppendOnlyVec<String>` - lock-free reads, lock-free
  multi-writer push (CAS-bounded by T::MAX + 1).
- `index: ShardedIndex<T>` - 64-shard `Mutex<HashMap<String, T>>` for
  reverse string-to-code lookup. Distinct strings hash to distinct
  shards under uniform hashing and never contend.

Methods:
- `intern(&self, value) -> Result<T, DictionaryError>`: concurrent-safe
  multi-writer intern. Distinct novel strings spread across shards;
  same-shard novel strings briefly serialise on that shard's mutex.
  Capacity-bounded push prevents leaked slots even when many threads
  contend on a narrow-width dict.
- `values(&self) -> &AppendOnlyVec<String>`: lock-free borrow.
- `lookup`, `len`, `is_empty`, `is_prefix_of`, `shares_with`,
  `detach_to_owned`, `clone` (Arc bump), `from_values`, `from_iter`.
- `absorb(&self, &mut CategoricalArray<T>)`: parent-side absorb path.
- `DictionaryError`: only `Overflow` variant (no Shared variant -
  intern is always allowed and routes through the atomic store).
- `CategoryDispatch`: width-erased enum holding `Dictionary<u8/u16/u32/u64>`
  for SuperTable/SuperArray's per-column manager slot.

Tests: empty starts empty, intern assigns dense codes, clones share
state, detach breaks sharing, prefix check, u8 cap exactly honoured
under sequential intern, u8 cap exactly honoured under 16-thread
concurrent intern with no leaks (verifies sum of successes ==
cap and sum of overflows == total attempts - cap), distinct-string
multi-writer intern produces no duplicates.

## Module gating: src/lib.rs

- `pub mod append_only_vec;` and `pub mod dictionary;` both gated
  under `#[cfg(feature = "shared_dict")]`.
- `pub use structs::dictionary::Dictionary;` similarly gated.

## Cargo.toml

- `shared_dict` feature is now `["chunked"]` - zero external deps.
  Previous arc-swap and parking_lot exploration removed.
- `contended_dict` feature was deleted in earlier WIP; not present here.

## src/structs/variants/categorical.rs

Partial: dictionary field docstring updated to describe new model
(still mentions "Arc<ArcSwap>" from intermediate design - needs
correction to AppendOnlyVec phrasing), `with_dictionary` constructor
takes `Dictionary<T>` directly, internal `demote_to_owned()` calls
removed from all mutator methods. Field is still `dictionary:
Dictionary<T>` always - needs splitting under cfg: `unique_values:
Vec64<String>` without shared_dict, `dictionary: Dictionary<T>` with.
This is the cascade-starting state, not the cascade-completed state.
Rewrites `AppendOnlyVec` as a single contiguous fixed-cap buffer with
lock-free reads (`as_slice` returns `&[T]` over the published prefix
via `Acquire` load on the publish counter) and lock-free writes (CAS
claim of the `reserved` counter + `Release`-store on `published`
after a brief spin-wait for predecessors). Cap is fixed at
construction; cap check happens inside the CAS loop so narrow widths
(`u8 -> 256`) are honoured exactly under any multi-writer contention.
No external dependencies.

`Dictionary::values()` now returns `&[String]` directly (via
`AppendOnlyVec::as_slice`) so call sites can use slice indexing,
iteration, and `.contains(...)` uniformly. `max_cap::<T>()` returns
the type's natural max for narrow widths and a soft default of
`1 << 20` for `u32`/`u64` (preallocating the natural cap would
reserve ~100 GB+ of virtual memory per dictionary and is rejected by
the allocator). Users with larger workloads can pass an explicit cap
via `with_capacity` later.

`CategoricalArray<T>`:
- Cfg-gates the field: `unique_values: Vec64<String>` (pre-PR shape)
  without `shared_dict`, `dictionary: Dictionary<T>` with it.
- `values()` returns `&[String]` in both modes via internal cfg branch.
- Private free fn `intern_into(unique_values, value)` provides the
  no-feature linear-scan-and-push behaviour; the cfg branch at each
  mutator site is a 2-liner (call `intern_into` vs `dictionary.intern`).
- `slice_clone` cfg-branches: clones `unique_values` without the
  feature, Arc-bumps the dictionary with it.
- `Concatenate::concat` cfg-branches into a two-path version (no-feat:
  divergent intern) vs the four-path one under the feature (shares_with
  / prefix-each-way / divergent).
- All `push_str`, `set_str`, `set_str_unchecked`, `set_unchecked`,
  `resize`, `extend_from_iter_with_capacity`, `extend_from_slice`,
  `fill`, `insert_rows` reach `intern_into` (no-feat) or
  `self.dictionary.intern(...)` (feat) via inline cfg.
- SuperArray test updated to the new always-Arc-shared semantics:
  every chunk's dictionary `shares_with` every other; growth visible
  immediately from all clones (no Owned/Shared variants in the new
  design).

Sanity tests pass:
  cargo test --features shared_dict --lib append_only_vec    (8/8)
  cargo test --features shared_dict --lib dictionary         (10/10)

The full crate compiles cleanly under `--features shared_dict`. The
cascade through the remaining files (conversions.rs, kernels/*,
ffi/arrow_c_ffi.rs, traits/byte_size.rs, traits/print.rs, enums/array.rs,
structs/variants/string.rs, structs/arena.rs) for the no-feature path
is the next step.
…ize, print, tests

Threads the cfg-gated `unique_values`/`dictionary` field through every
remaining call site so both feature states build and test cleanly.

External callers route through the cfg-branched `cat.values()` method
(returns `&[String]` in either mode) instead of touching the storage
field directly. Struct-literal constructions inside macros and tests
gain inline cfg branches.

Touched:
- `enums/error.rs`: `DictionaryError` variant and its `From` impl
  gated on `shared_dict` (no dictionary errors without the feature).
- `conversions.rs`: `string_to_cat!`, `cat_to_string!`,
  `cat_to_cat_widen!`, `cat_to_cat_narrow!` macros now cfg-branch
  the field at the construction site. `Dictionary` import gated.
- `kernels/string.rs` and `kernels/arithmetic/string.rs`: read
  sites use `cat.values()`. Struct literals cfg-branch the field.
  `Dictionary` import gated. Helper macro `consolidate_string_variant!`
  (and the test fixtures around it) updated.
- `ffi/arrow_c_ffi.rs`: categorical export reads via `cat.values()`.
- `traits/print.rs`: pretty-print reads via `cat.values()`.
- `traits/byte_size.rs`: `est_bytes` reads dictionary size via
  `cat.values()`; no longer touches the field directly.
- `structs/arena.rs`: arena reads use `c.values()` / `a.values()`.
- `structs/variants/categorical.rs`: all mutator sites (`push_str`,
  `set_str`, `set_str_unchecked`, `set_unchecked`, `resize`,
  `extend_from_iter_with_capacity`, `extend_from_slice`, `fill`,
  `insert_rows`) reach the internal `intern_into` free helper
  (no-feature linear-scan-and-push) or `self.dictionary.intern(...)`
  (with-feature) via an inline cfg block. `slice_clone` and
  `Concatenate::concat` cfg-branch the dictionary handling.
- Tests rewritten to use the public `cat.values()` API where
  possible (so they work in both feature states), or rewritten to
  the new always-Arc-shared semantics under `shared_dict`.

Verified across 15 feature combinations:
  default                            (589)
  shared_dict                        (606)
  default_categorical_8,shared_dict  (595)
  extended_categorical,shared_dict   (606)
  datetime                           (636)
  extended_numeric_types             (602)
  size                               (591)
  arena                              (612)
  str_arithmetic                     (613)
  cast_arrow                         (589)
  cast_polars                        (602)
  cast_arrow,shared_dict             (606 + 39 integration)
  cast_polars,shared_dict            (619 + 37 integration)
  arena,shared_dict                  (629)
  size,shared_dict                   (608)

All passing.
The lock-free contiguous append-only vector is general-purpose and
useful beyond minarrow's categorical dictionary. Lifts it into the
`vec64` crate behind an opt-in `append_only_vec` feature and re-imports
it here. No semantic change - the implementation moves verbatim
(`std`-only, no minarrow-specific imports).

Changes here:
- `shared_dict` now chains in `vec64/append_only_vec`, so enabling the
  minarrow feature pulls in `AppendOnlyVec` automatically.
- `src/structs/append_only_vec.rs` deleted; module declaration removed
  from `lib.rs`.
- `src/structs/dictionary.rs` imports `vec64::AppendOnlyVec` instead.
- `Cargo.toml`: pinned to `vec64 = "0.4.5"` (the version that ships
  `append_only_vec`), with a sibling `path = "../../vec64"` override
  for the in-tree dev cycle - drop the path entry once 0.4.5 is on
  crates.io.

Companion changes in `vec64`:
- `src/append_only_vec.rs` added (verbatim from the minarrow source).
- `Cargo.toml`: new `append_only_vec` feature (no deps); bumped version
  to 0.4.5.
- `src/lib.rs`: `pub mod append_only_vec` + `pub use AppendOnlyVec` both
  cfg-gated on the feature.
- `cargo test --features append_only_vec` runs all 8 sanity tests
  green from vec64.

Verified across the same 15 minarrow feature combinations as the prior
cascade commit (all passing; numbers are 8 lower under `shared_dict`
because the AppendOnlyVec tests now report from `vec64`, not here).
@pbower pbower force-pushed the categorical_dicts branch 3 times, most recently from 76d5bf6 to 2b727d2 Compare May 25, 2026 12:54
@pbower pbower force-pushed the categorical_dicts branch from 2b727d2 to 8432a4e Compare May 25, 2026 13:08
@pbower pbower merged commit 5765d83 into main May 25, 2026
14 of 16 checks passed
@pbower pbower deleted the categorical_dicts branch May 25, 2026 17:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant