erasure_code: byte-indexed table lookup in *_base hot paths#420
erasure_code: byte-indexed table lookup in *_base hot paths#420scopedog wants to merge 1 commit into
Conversation
|
Take a look at Feb 25 comment from @LokyinZHAO: "Why I call gf_vect_dot_prod directly: ... I’m working on a Hitchhiker-like coding variant where I need to perform linear combinations on a subset of source blocks within a stripe. That requires fine-grained control over per-stripe, per-block dot-product operations that the high-level encode path does not provide (or does not expose with the control granularity I need). For this reason I call the low-level dot_prod routines directly in my experiments." LokyinZHAO's code calls gf_vect_dot_prod directly with calculated pointers to g_tbls assuming 32 bytes between constants, when it used to be 8 bytes between constants for GFNI. The simplest fix for this would have ec_init_tables return the number of bytes between entries in g_tbls so the code can calculate pointers to g_tbls entries. Another alternative would be some high level encode that takes an array of values (used as indexes into g_tbls), like PR #417. Instead @pablodelara - padded the GFNI g_tbls entries with 24 bytes to align them to 32 byte boundaries. Last I looked, the non-SIMD tables were not padded with 31 bytes between entries, and the LokyinZHAO's (and one other user's) code will fail in that case. Note that there is already a define GF_LARGE_TABLES used by ec_base.h that defines a 64K table, where the index for non-SIMD table is (value1 << 8) + value2. |
Replace per-byte gf_mul() calls in the C base path with byte-indexed
per-coefficient table lookups. Each hot-path *_base function builds a
256-byte product table (one coefficient at a time, on a single stack
buffer) and uses single byte-indexed loads in the inner loop instead of
the log+antilog gf_mul path. The routines are allocation-free.
For short regions the table build does not pay off, so each function
falls back to the original per-byte gf_mul() loop when the region length
is below a small threshold (GF_REGION_TBL_MIN_LEN = 256); only longer
regions use the byte-table path. This keeps fine-grained, direct callers
of the low-level functions from regressing.
Functions changed:
- gf_vect_mul_base
- gf_vect_dot_prod_base
- gf_vect_mad_base
- ec_encode_data_base
- ec_encode_data_update_base
Matrix-construction code (gf_gen_rs_matrix, gf_invert_matrix) where
the multiplier varies across the loop is left unchanged; the technique
only applies when the coefficient is fixed, which is the case for all
bulk EC encode/decode work but not for matrix algebra.
Public ABI of ec_init_tables is unchanged, and the coefficient input
format is unchanged (still read at v[j*32 + 1] with 32-byte stride), so
callers using the low-level functions directly are unaffected. The
hot-path lookup is exposed as a private macro GF_REGION_MUL(tbl, x) in
ec_base.h so other internal files can use the same pattern.
Measured ~3.0-4.4x speedup over the original gf_mul-based loops on
i5-1340P (Raptor Lake-P) across (k+m) configurations from 4+2 to
16+4 and stripe sizes from 4 KiB to 1 MiB.
Verified:
- erasure_code_base_test, gf_vect_dot_prod_base_test,
gf_vect_mul_base_test, erasure_code_test, erasure_code_update_test,
gf_inverse_test all pass
- byte-compared against a gf_mul reference across region lengths
straddling the threshold (len = 1..8192) and (k,m) from 2+1 to 16+6,
0 mismatches
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Hiroshi Nishida <nishidafmly@gmail.com>
4cf9c9a to
865ad8a
Compare
|
Thanks @jeffareid for the careful review and the pointer to #334. Though I'm not sure I've fully addressed LokyinZHAO's concern, here's my 1. Removed the allocation. The first version of 2. Added a small-length fallback. For short regions the 256-entry The input format is unchanged — the functions still read coefficients at I chose the 256 threshold conservatively (a region should be at least as |
|
@scopedog - If the non-SIMD g_tbls version is also padded to 32 bytes between entries, then g_tbls grows from 256 bytes to 8192 bytes. I don't know how GF_LARGE_TABLES would be handled by @LokyinZHAO's code. |
|
@jeffareid — I think there may be a mix-up between two separate tables, so let me clarify what this PR does and doesn't touch. This PR doesn't change It's also worth noting that On So this change is orthogonal to the |
|
@jeffareid — fair point, and I agree a stride reported by |
Summary
The C base (non-SIMD) erasure-code path multiplies each input byte by a
fixed GF(2^8) coefficient via
gf_mul()— a log + antilog table pair perbyte. Since the coefficient is constant across a region, this can be
replaced by a single 256-entry product table per coefficient, built once
on function entry, with a single byte-indexed load in the inner loop.
This is a pure software optimization of the fallback path — no new ISA
requirements, no public ABI change.
Functions changed
gf_vect_mul_basegf_vect_dot_prod_basegf_vect_mad_baseec_encode_data_baseec_encode_data_update_baseMatrix-construction helpers (
gf_gen_rs_matrix,gf_invert_matrix) areleft unchanged: the multiplier varies within those loops, so a
per-coefficient table does not apply.
Compatibility
ec_init_tablesand all exported functions unchanged.GF_REGION_MUL(tbl, x)inerasure_code/ec_base.hso other internal files can reuse the pattern.erasure_code/ec_base.canderasure_code/ec_base.h(+112 / -27).
Performance
~3.0–4.4× throughput over the existing
gf_mul-based base loops,measured on an i5-1340P (Raptor Lake-P), single-thread, across (k+m)
configurations from 4+2 to 16+4 and stripe sizes from 4 KiB to 1 MiB.
Only the base/fallback path is affected; the SIMD dispatch path is not.
Testing
erasure_code_test,erasure_code_update_test,gf_inverse_testpass on Zen 3 and Raptor Lake.
gf_mul-basedreference — 0 mismatches on both hosts.
Commit carries
Signed-off-by:per the DCO.