Support shared categorical dictionaries#100
Merged
Merged
Conversation
Before this change, every CategoricalArray held its own Vec64<String> dictionary. Different batches for the same logical column ended up with independent dictionaries, so code 3 in batch 1 could mean a different string from code 3 in batch 17. Any operation comparing codes across batches (group-by, hash-join, filter) had to materialise to strings or remap on the fly, paying that cost on every operation. The dictionary now has two modes. Standalone categoricals carry their own Owned dictionary and mutate freely as before. Categoricals held inside a SuperTable, SuperArray, or peer-sync set instead carry a Shared snapshot of a dictionary owned by the parent's CategoryManager. All batches in the same SuperTable see the same codes, and the cost of keeping them aligned is paid once at push time rather than inside every downstream kernel. Growth on a Shared dictionary cannot happen from a single batch in isolation, because that would silently desynchronise it from its siblings. Mutators on a Shared categorical instead demote the dictionary to Owned in place and log a warning, so the user can audit where their categorical stopped sharing with the parent. The recommended path for live ingestion is to route the new value through the parent's intern method, which extends the shared dictionary once and leaves every sibling snapshot coherent under the append-only invariant. Reads are identical in both modes and do not touch any lock. The default storage on the parent side is Mutex<Arc<DictionaryInner<T>>> which holds the lock only on growth events; the contended_dict feature swaps that for arc_swap::ArcSwap with a CAS-loop write path for workloads where concurrent novel-value insertion is a hot path. Arrow round-trips are byte-identical. The categorical tests in tests/arrow_c_integration and tests/apache_arrow pass unchanged under the default, default_categorical_8, extended_categorical, and contended_dict feature combinations.
…r Consolidate on typed arrays Splits the categorical dictionary work into two parts: 1. **Shared-dictionary surface is now opt-in via `shared_dict`** (chains into `contended_dict` for the ArcSwap backend). Without the feature every `CategoricalArray<T>` owns its dictionary outright and codes are not mutually meaningful across batches. Gated behind it: `Dictionary::Shared`, `DictionaryError::Shared`, `CategoryManager`, `CategoryDispatch`, `CategoricalArray::with_dictionary`, `SuperTable.category_managers`, `SuperTable::category_dispatch`, and the absorb path. `CategoryDispatch` exposes `install_from` / `absorb` methods that exhaustive-match the variant once and delegate to `Dictionary::to_inner` and `CategoryManager::absorb`, replacing the TypeId/transmute helper layer. 2. **`Consolidate` is now implemented on each typed array** via `Vec<TypedAVT<'a, T>>` (the existing AVT view tuples in aliases.rs): `IntegerAVT`, `FloatAVT`, `StringAVT`, `BooleanAVT`, `DatetimeAVT`, `CategoricalAVT`. Each impl reads windows directly from the source buffer - one copy per chunk, same algorithm as the old per-variant macros. `Vec<ArrayVT<'a>>::consolidate` exhaustively dispatches on the first chunk's variant and gathers typed AVTs. The `consolidate_int_variant!`, `consolidate_float_variant!`, `consolidate_string_variant!`, `consolidate_temporal_variant!` macros and the `consolidate_*_slices` private functions in `super_array_view.rs` are deleted in favour of this layering. `SuperArrayV::consolidate` keeps its single-slice/contiguous fast paths and otherwise builds `Vec<ArrayVT>` view tuples (no copy) and calls `.consolidate()`. The categorical Consolidate impl preserves the shared-Arc fast path under `shared_dict` (pure index-buffer concat + reuse snapshot) and falls back to `Concatenate::concat`-fold for divergent dictionaries (which already handles the prefix and intern-and-remap paths). Build clean under default, shared_dict, default_categorical_8 + shared_dict, extended_categorical + shared_dict, contended_dict, datetime, extended_numeric_types, and --no-default-features. All 594 default tests and 599 shared_dict tests pass.
…ic, polars test, default_categorical_8 unit tests Sites the original PR missed when renaming `unique_values: Vec64<String>` to `dictionary: Dictionary<T>` on `CategoricalArray<T>`: - `src/traits/byte_size.rs`: `self.dictionary.values.est_bytes()` was a field access against the enum (no such field). Compute the container size from `dictionary.values().len() * size_of::<String>()`, matching the prior Vec64 capacity-based approximation. - `src/structs/arena.rs` (feature `arena`): four `c.unique_values.clone()` and four `a.unique_values.clone()` reads against `CategoricalArray` rewritten to `Vec64::from(...dictionary.values().to_vec())`. - `src/kernels/arithmetic/string.rs` (feature `str_arithmetic`): one `.dictionary.iter()` call with no such method, and an `Arc::new(Dictionary::from(...))` wrap that produced the wrong type. Fixed to `.dictionary.values().iter()` and to drop the Arc wrap. Test assertions using `.unique_values` updated to `.dictionary.values()`. - `tests/polars.rs` (feature `cast_polars`): one residual `c.unique_values[...]` indexing rewritten to use `dictionary.values()`. - `src/structs/field_array.rs` and `src/structs/chunked/super_table.rs`: test items referencing `fa_cat32` and `CategoricalIndexType::UInt32` gated to the same cfg as the `fa_cat32` macro itself, so unit tests build under `default_categorical_8` (where Categorical32 is disabled). - Trimmed PR-introduced unused imports of `Arc` in `src/kernels/arithmetic/string.rs`, `src/kernels/string.rs`, `src/structs/variants/string.rs`, and gated the `Arc` import in `src/structs/dictionary.rs` to `shared_dict`. Verified across 15 feature combinations: default, shared_dict, default_categorical_8+shared_dict, extended_categorical+shared_dict, contended_dict, datetime, extended_numeric_types, size, arena, str_arithmetic, cast_arrow, cast_polars, cast_arrow+shared_dict, cast_polars+shared_dict, arena+shared_dict. All lib unit tests pass on each (594-641 tests per combo); doc and integration tests pass.
…dd cross-batch roundtrip tests SuperArray now mirrors SuperTable's shared-dictionary machinery under `shared_dict`: - New `pub(crate) category_manager: Option<CategoryDispatch>` field (cfg-gated). Initialised in every constructor and `From` impl that builds a `SuperArray`. - `push`, `push_with_null_count`, and `push_field_array` run the chunk through `CategoryDispatch::install_from` (first push) or `CategoryDispatch::absorb` (subsequent pushes) before storing it. - `category_dispatch(&self) -> Option<&CategoryDispatch>` exposes the manager for live ingestion. - `rebuild_category_manager()` reabsorbs all existing chunks; called from constructors that take a `Vec<Array>` upfront and from the `Concatenate::concat` impl. - `PartialEq` switched from derived to a manual impl that compares data + field + null_counts (the derived manager is not part of observable equality). Test: `test_shared_dict_absorb_across_pushes` exercises both the manager-grows case (snapshot Arc rotates, earlier chunk's dictionary is a prefix of the later one) and the no-growth case (snapshots are `Arc::ptr_eq`-equal). Both invariants hold by the append-only design. README's Feature Flags table gains entries for `shared_dict` and `contended_dict`. New cross-batch roundtrip tests: - `tests/apache_arrow.rs::rt_arrow_super_table_shared_categorical32`: exports a two-batch SuperTable through arrow-rs `RecordBatch`, re-imports, asserts decoded strings survive verbatim and that the rebuilt batches preserve the append-only prefix invariant (first batch's dictionary is a prefix of the second's). - `tests/polars.rs::rt_polars_super_table_shared_categorical32`: same flow through Polars `DataFrame`. Polars may emit either categorical or string-backed columns; both are accepted and the decoded strings are checked. Verified across 15 feature combinations including `cast_arrow,shared_dict` and `cast_polars,shared_dict`. All lib + integration + doc tests pass.
…hardedIndex) Lays down the concurrency primitives for `shared_dict`. The cascade through the rest of the crate (categorical.rs read/write call sites, super_table.rs absorb path, super_array.rs absorb, super_array_view.rs consolidate, kernels, FFI, conversions) is not yet done - this commit is a savepoint of the infrastructure layer only. ## New: src/structs/append_only_vec.rs `AppendOnlyVec<T>` - lock-free append-only Vec with stable element addresses. Used internally by `Dictionary<T>` as the code-indexed value array. - Multi-writer concurrent push via per-slot `AtomicBool` init flag + global `AtomicUsize` reserved counter. - `push_bounded(value, max_cap)`: CAS-loop on reserved that refuses reservation if `>= max_cap`. Solves the u8-cap-256 racing exactly - no leaked slots, no double-assigned codes under any number of concurrent writers. - `push(value)`: unbounded fast path using `fetch_add`. - `get(idx) -> Option<&T>` and `iter() -> (idx, &T)`: lock-free reads. `iter` yields the fully-init prefix; mid-write slots are invisible until their writer publishes. - Bucketed allocation: bucket i holds `16 << i` entries; 28 buckets cover ~4B entries. Buckets allocated lazily via CAS, never moved or freed during the vector's lifetime - so `&T` borrows survive any concurrent push. - ~400 LOC including tests for cap-respecting concurrent writers, stable refs across 200+ pushes spanning bucket boundaries, drop semantics for initialised slots only, multi-writer + multi-reader stress test. ## Rewritten: src/structs/dictionary.rs `Dictionary<T>` - cloneable handle around `Arc<DictionaryInner<T>>` where: - `values: AppendOnlyVec<String>` - lock-free reads, lock-free multi-writer push (CAS-bounded by T::MAX + 1). - `index: ShardedIndex<T>` - 64-shard `Mutex<HashMap<String, T>>` for reverse string-to-code lookup. Distinct strings hash to distinct shards under uniform hashing and never contend. Methods: - `intern(&self, value) -> Result<T, DictionaryError>`: concurrent-safe multi-writer intern. Distinct novel strings spread across shards; same-shard novel strings briefly serialise on that shard's mutex. Capacity-bounded push prevents leaked slots even when many threads contend on a narrow-width dict. - `values(&self) -> &AppendOnlyVec<String>`: lock-free borrow. - `lookup`, `len`, `is_empty`, `is_prefix_of`, `shares_with`, `detach_to_owned`, `clone` (Arc bump), `from_values`, `from_iter`. - `absorb(&self, &mut CategoricalArray<T>)`: parent-side absorb path. - `DictionaryError`: only `Overflow` variant (no Shared variant - intern is always allowed and routes through the atomic store). - `CategoryDispatch`: width-erased enum holding `Dictionary<u8/u16/u32/u64>` for SuperTable/SuperArray's per-column manager slot. Tests: empty starts empty, intern assigns dense codes, clones share state, detach breaks sharing, prefix check, u8 cap exactly honoured under sequential intern, u8 cap exactly honoured under 16-thread concurrent intern with no leaks (verifies sum of successes == cap and sum of overflows == total attempts - cap), distinct-string multi-writer intern produces no duplicates. ## Module gating: src/lib.rs - `pub mod append_only_vec;` and `pub mod dictionary;` both gated under `#[cfg(feature = "shared_dict")]`. - `pub use structs::dictionary::Dictionary;` similarly gated. ## Cargo.toml - `shared_dict` feature is now `["chunked"]` - zero external deps. Previous arc-swap and parking_lot exploration removed. - `contended_dict` feature was deleted in earlier WIP; not present here. ## src/structs/variants/categorical.rs Partial: dictionary field docstring updated to describe new model (still mentions "Arc<ArcSwap>" from intermediate design - needs correction to AppendOnlyVec phrasing), `with_dictionary` constructor takes `Dictionary<T>` directly, internal `demote_to_owned()` calls removed from all mutator methods. Field is still `dictionary: Dictionary<T>` always - needs splitting under cfg: `unique_values: Vec64<String>` without shared_dict, `dictionary: Dictionary<T>` with. This is the cascade-starting state, not the cascade-completed state.
Rewrites `AppendOnlyVec` as a single contiguous fixed-cap buffer with lock-free reads (`as_slice` returns `&[T]` over the published prefix via `Acquire` load on the publish counter) and lock-free writes (CAS claim of the `reserved` counter + `Release`-store on `published` after a brief spin-wait for predecessors). Cap is fixed at construction; cap check happens inside the CAS loop so narrow widths (`u8 -> 256`) are honoured exactly under any multi-writer contention. No external dependencies. `Dictionary::values()` now returns `&[String]` directly (via `AppendOnlyVec::as_slice`) so call sites can use slice indexing, iteration, and `.contains(...)` uniformly. `max_cap::<T>()` returns the type's natural max for narrow widths and a soft default of `1 << 20` for `u32`/`u64` (preallocating the natural cap would reserve ~100 GB+ of virtual memory per dictionary and is rejected by the allocator). Users with larger workloads can pass an explicit cap via `with_capacity` later. `CategoricalArray<T>`: - Cfg-gates the field: `unique_values: Vec64<String>` (pre-PR shape) without `shared_dict`, `dictionary: Dictionary<T>` with it. - `values()` returns `&[String]` in both modes via internal cfg branch. - Private free fn `intern_into(unique_values, value)` provides the no-feature linear-scan-and-push behaviour; the cfg branch at each mutator site is a 2-liner (call `intern_into` vs `dictionary.intern`). - `slice_clone` cfg-branches: clones `unique_values` without the feature, Arc-bumps the dictionary with it. - `Concatenate::concat` cfg-branches into a two-path version (no-feat: divergent intern) vs the four-path one under the feature (shares_with / prefix-each-way / divergent). - All `push_str`, `set_str`, `set_str_unchecked`, `set_unchecked`, `resize`, `extend_from_iter_with_capacity`, `extend_from_slice`, `fill`, `insert_rows` reach `intern_into` (no-feat) or `self.dictionary.intern(...)` (feat) via inline cfg. - SuperArray test updated to the new always-Arc-shared semantics: every chunk's dictionary `shares_with` every other; growth visible immediately from all clones (no Owned/Shared variants in the new design). Sanity tests pass: cargo test --features shared_dict --lib append_only_vec (8/8) cargo test --features shared_dict --lib dictionary (10/10) The full crate compiles cleanly under `--features shared_dict`. The cascade through the remaining files (conversions.rs, kernels/*, ffi/arrow_c_ffi.rs, traits/byte_size.rs, traits/print.rs, enums/array.rs, structs/variants/string.rs, structs/arena.rs) for the no-feature path is the next step.
…ize, print, tests Threads the cfg-gated `unique_values`/`dictionary` field through every remaining call site so both feature states build and test cleanly. External callers route through the cfg-branched `cat.values()` method (returns `&[String]` in either mode) instead of touching the storage field directly. Struct-literal constructions inside macros and tests gain inline cfg branches. Touched: - `enums/error.rs`: `DictionaryError` variant and its `From` impl gated on `shared_dict` (no dictionary errors without the feature). - `conversions.rs`: `string_to_cat!`, `cat_to_string!`, `cat_to_cat_widen!`, `cat_to_cat_narrow!` macros now cfg-branch the field at the construction site. `Dictionary` import gated. - `kernels/string.rs` and `kernels/arithmetic/string.rs`: read sites use `cat.values()`. Struct literals cfg-branch the field. `Dictionary` import gated. Helper macro `consolidate_string_variant!` (and the test fixtures around it) updated. - `ffi/arrow_c_ffi.rs`: categorical export reads via `cat.values()`. - `traits/print.rs`: pretty-print reads via `cat.values()`. - `traits/byte_size.rs`: `est_bytes` reads dictionary size via `cat.values()`; no longer touches the field directly. - `structs/arena.rs`: arena reads use `c.values()` / `a.values()`. - `structs/variants/categorical.rs`: all mutator sites (`push_str`, `set_str`, `set_str_unchecked`, `set_unchecked`, `resize`, `extend_from_iter_with_capacity`, `extend_from_slice`, `fill`, `insert_rows`) reach the internal `intern_into` free helper (no-feature linear-scan-and-push) or `self.dictionary.intern(...)` (with-feature) via an inline cfg block. `slice_clone` and `Concatenate::concat` cfg-branch the dictionary handling. - Tests rewritten to use the public `cat.values()` API where possible (so they work in both feature states), or rewritten to the new always-Arc-shared semantics under `shared_dict`. Verified across 15 feature combinations: default (589) shared_dict (606) default_categorical_8,shared_dict (595) extended_categorical,shared_dict (606) datetime (636) extended_numeric_types (602) size (591) arena (612) str_arithmetic (613) cast_arrow (589) cast_polars (602) cast_arrow,shared_dict (606 + 39 integration) cast_polars,shared_dict (619 + 37 integration) arena,shared_dict (629) size,shared_dict (608) All passing.
The lock-free contiguous append-only vector is general-purpose and useful beyond minarrow's categorical dictionary. Lifts it into the `vec64` crate behind an opt-in `append_only_vec` feature and re-imports it here. No semantic change - the implementation moves verbatim (`std`-only, no minarrow-specific imports). Changes here: - `shared_dict` now chains in `vec64/append_only_vec`, so enabling the minarrow feature pulls in `AppendOnlyVec` automatically. - `src/structs/append_only_vec.rs` deleted; module declaration removed from `lib.rs`. - `src/structs/dictionary.rs` imports `vec64::AppendOnlyVec` instead. - `Cargo.toml`: pinned to `vec64 = "0.4.5"` (the version that ships `append_only_vec`), with a sibling `path = "../../vec64"` override for the in-tree dev cycle - drop the path entry once 0.4.5 is on crates.io. Companion changes in `vec64`: - `src/append_only_vec.rs` added (verbatim from the minarrow source). - `Cargo.toml`: new `append_only_vec` feature (no deps); bumped version to 0.4.5. - `src/lib.rs`: `pub mod append_only_vec` + `pub use AppendOnlyVec` both cfg-gated on the feature. - `cargo test --features append_only_vec` runs all 8 sanity tests green from vec64. Verified across the same 15 minarrow feature combinations as the prior cascade commit (all passing; numbers are 8 lower under `shared_dict` because the AppendOnlyVec tests now report from `vec64`, not here).
76d5bf6 to
2b727d2
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This feature supports the case where one is working with a Categorical array, and it has a set of mappings as to which integers correspond to which strings (dictionary).
This feature adds support for a standalone dictionary that acts as a common source of truth.