erasure_code: byte-indexed table lookup in *_base hot paths by scopedog · Pull Request #420 · intel/isa-l

scopedog · 2026-06-03T18:20:13Z

Summary

The C base (non-SIMD) erasure-code path multiplies each input byte by a
fixed GF(2^8) coefficient via gf_mul() — a log + antilog table pair per
byte. Since the coefficient is constant across a region, this can be
replaced by a single 256-entry product table per coefficient, built once
on function entry, with a single byte-indexed load in the inner loop.

This is a pure software optimization of the fallback path — no new ISA
requirements, no public ABI change.

Functions changed

gf_vect_mul_base
gf_vect_dot_prod_base
gf_vect_mad_base
ec_encode_data_base
ec_encode_data_update_base

Matrix-construction helpers (gf_gen_rs_matrix, gf_invert_matrix) are
left unchanged: the multiplier varies within those loops, so a
per-coefficient table does not apply.

Compatibility

Public ABI of ec_init_tables and all exported functions unchanged.
The lookup is factored into a private macro GF_REGION_MUL(tbl, x) in
erasure_code/ec_base.h so other internal files can reuse the pattern.
Touches only erasure_code/ec_base.c and erasure_code/ec_base.h
(+112 / -27).

Performance

~3.0–4.4× throughput over the existing gf_mul-based base loops,
measured on an i5-1340P (Raptor Lake-P), single-thread, across (k+m)
configurations from 4+2 to 16+4 and stripe sizes from 4 KiB to 1 MiB.
Only the base/fallback path is affected; the SIMD dispatch path is not.

Testing

erasure_code_test, erasure_code_update_test, gf_inverse_test
pass on Zen 3 and Raptor Lake.
72 (config × size) combinations byte-compared against a gf_mul-based
reference — 0 mismatches on both hosts.

Commit carries Signed-off-by: per the DCO.

jeffareid · 2026-06-03T20:02:13Z

Take a look at Feb 25 comment from @LokyinZHAO:

#334

"Why I call gf_vect_dot_prod directly: ... I’m working on a Hitchhiker-like coding variant where I need to perform linear combinations on a subset of source blocks within a stripe. That requires fine-grained control over per-stripe, per-block dot-product operations that the high-level encode path does not provide (or does not expose with the control granularity I need). For this reason I call the low-level dot_prod routines directly in my experiments."

LokyinZHAO's code calls gf_vect_dot_prod directly with calculated pointers to g_tbls assuming 32 bytes between constants, when it used to be 8 bytes between constants for GFNI. The simplest fix for this would have ec_init_tables return the number of bytes between entries in g_tbls so the code can calculate pointers to g_tbls entries. Another alternative would be some high level encode that takes an array of values (used as indexes into g_tbls), like PR #417.

Instead @pablodelara - padded the GFNI g_tbls entries with 24 bytes to align them to 32 byte boundaries. Last I looked, the non-SIMD tables were not padded with 31 bytes between entries, and the LokyinZHAO's (and one other user's) code will fail in that case.

Note that there is already a define GF_LARGE_TABLES used by ec_base.h that defines a 64K table, where the index for non-SIMD table is (value1 << 8) + value2.

Replace per-byte gf_mul() calls in the C base path with byte-indexed per-coefficient table lookups. Each hot-path *_base function builds a 256-byte product table (one coefficient at a time, on a single stack buffer) and uses single byte-indexed loads in the inner loop instead of the log+antilog gf_mul path. The routines are allocation-free. For short regions the table build does not pay off, so each function falls back to the original per-byte gf_mul() loop when the region length is below a small threshold (GF_REGION_TBL_MIN_LEN = 256); only longer regions use the byte-table path. This keeps fine-grained, direct callers of the low-level functions from regressing. Functions changed: - gf_vect_mul_base - gf_vect_dot_prod_base - gf_vect_mad_base - ec_encode_data_base - ec_encode_data_update_base Matrix-construction code (gf_gen_rs_matrix, gf_invert_matrix) where the multiplier varies across the loop is left unchanged; the technique only applies when the coefficient is fixed, which is the case for all bulk EC encode/decode work but not for matrix algebra. Public ABI of ec_init_tables is unchanged, and the coefficient input format is unchanged (still read at v[j*32 + 1] with 32-byte stride), so callers using the low-level functions directly are unaffected. The hot-path lookup is exposed as a private macro GF_REGION_MUL(tbl, x) in ec_base.h so other internal files can use the same pattern. Measured ~3.0-4.4x speedup over the original gf_mul-based loops on i5-1340P (Raptor Lake-P) across (k+m) configurations from 4+2 to 16+4 and stripe sizes from 4 KiB to 1 MiB. Verified: - erasure_code_base_test, gf_vect_dot_prod_base_test, gf_vect_mul_base_test, erasure_code_test, erasure_code_update_test, gf_inverse_test all pass - byte-compared against a gf_mul reference across region lengths straddling the threshold (len = 1..8192) and (k,m) from 2+1 to 16+6, 0 mismatches Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Hiroshi Nishida <nishidafmly@gmail.com>

scopedog · 2026-06-03T21:51:59Z

Thanks @jeffareid for the careful review and the pointer to #334.

Though I'm not sure I've fully addressed LokyinZHAO's concern, here's my
comment and what I've changed in response:

1. Removed the allocation. The first version of
gf_vect_dot_prod_base / ec_encode_data_base built the per-coefficient
tables with a malloc, which is a poor choice for a leaf routine. Both
functions process one source at a time, so they now rebuild a single
256-byte stack table per source — there is no heap allocation anywhere in
ec_base.c now.

2. Added a small-length fallback. For short regions the 256-entry
table build does not pay off, and it would penalize the fine-grained,
per-block calls described in #334. Each *_base function now falls back to
the original per-byte gf_mul() loop when len < 256, and only uses the
byte-indexed table for longer regions where the build amortizes.

The input format is unchanged — the functions still read coefficients at
v[j*32 + 1] with the same 32-byte stride — so code that calls these
directly with hand-constructed g_tbls pointers behaves exactly as
before, just faster for len >= 256.

I chose the 256 threshold conservatively (a region should be at least as
long as the table it would build). Happy to tune it, or to revisit the
approach entirely, if there's a better crossover for the direct-call
patterns you and @LokyinZHAO have in mind.

jeffareid · 2026-06-04T07:56:04Z

@scopedog - If the non-SIMD g_tbls version is also padded to 32 bytes between entries, then g_tbls grows from 256 bytes to 8192 bytes. I don't know how GF_LARGE_TABLES would be handled by @LokyinZHAO's code.

scopedog · 2026-06-04T13:09:58Z

@jeffareid — I think there may be a mix-up between two separate tables, so let me clarify what this PR does and doesn't touch.

This PR doesn't change g_tbls or ec_init_tables at all. Every *_base function still reads its coefficients from g_tbls at the same v[j*32 + … + 1] offsets as upstream — same 32-byte stride, same offset. The only new table is a transient 256-byte tbl[256] on the stack, built per source and thrown away. It's not stored in g_tbls and isn't visible to any caller, so there's no padding change and no growth to 8192 bytes. Code like @LokyinZHAO's that walks g_tbls with computed pointers sees byte-for-byte the same layout; it just runs faster for len >= 256.

It's also worth noting that g_tbls is built by gf_vect_mul_init_base, which derives each 32-byte entry directly from the coefficient (the (c << 1) ^ 0x1d doubling trick) — it never calls gf_mul(), so g_tbls doesn't depend on the global multiply tables at all.

On GF_LARGE_TABLES: it only selects the internal gf_mul/gf_inv implementation over static tables — it doesn't appear in g_tbls and isn't part of the public ABI, so there's nothing in it for direct g_tbls callers to handle, and this PR doesn't change that either way. build_mul_tbl() fills its 256-byte table by calling gf_mul() — the exact same gf_mul() the previous per-byte loop used — so whichever backend GF_LARGE_TABLES selects is inherited transparently, with no special-casing.

So this change is orthogonal to the g_tbls stride question in #334 — it runs entirely after g_tbls is constructed and reads it unchanged. If we do want to address the direct-call stride/padding issue (e.g. having ec_init_tables report the entry stride), I'm happy to help, but I'd suggest that as a separate PR since it's an API question independent of this fallback-path optimization.

jeffareid · 2026-06-04T16:17:05Z

@scopedog - thanks for clarifying. As for PR #334 - I would have preferred having ec_init_tables report the entry stride rather than padding and having user code assume a hard value for the stride, but it is not my call.

scopedog · 2026-06-04T16:41:13Z

@jeffareid — fair point, and I agree a stride reported by ec_init_tables would be a nicer, more future-proof contract than assuming 32. Feels like it belongs with the broader #334 discussion rather than this change, so I'd lean toward leaving it for the maintainers there. Glad you raised it, and thanks for the careful review.

scopedog force-pushed the upstream-bytetable branch from 4cf9c9a to 865ad8a Compare June 3, 2026 21:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

erasure_code: byte-indexed table lookup in *_base hot paths#420

erasure_code: byte-indexed table lookup in *_base hot paths#420
scopedog wants to merge 1 commit into
intel:masterfrom
scopedog:upstream-bytetable

scopedog commented Jun 3, 2026

Uh oh!

jeffareid commented Jun 3, 2026 •

edited

Loading

Uh oh!

scopedog commented Jun 3, 2026

Uh oh!

jeffareid commented Jun 4, 2026

Uh oh!

scopedog commented Jun 4, 2026

Uh oh!

jeffareid commented Jun 4, 2026 •

edited

Loading

Uh oh!

scopedog commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

scopedog commented Jun 3, 2026

Summary

Functions changed

Compatibility

Performance

Testing

Uh oh!

jeffareid commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scopedog commented Jun 3, 2026

Uh oh!

jeffareid commented Jun 4, 2026

Uh oh!

scopedog commented Jun 4, 2026

Uh oh!

jeffareid commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scopedog commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jeffareid commented Jun 3, 2026 •

edited

Loading

jeffareid commented Jun 4, 2026 •

edited

Loading