Skip to content

erasure_code: byte-indexed table lookup in *_base hot paths#420

Open
scopedog wants to merge 1 commit into
intel:masterfrom
scopedog:upstream-bytetable
Open

erasure_code: byte-indexed table lookup in *_base hot paths#420
scopedog wants to merge 1 commit into
intel:masterfrom
scopedog:upstream-bytetable

Conversation

@scopedog

@scopedog scopedog commented Jun 3, 2026

Copy link
Copy Markdown

Summary

The C base (non-SIMD) erasure-code path multiplies each input byte by a
fixed GF(2^8) coefficient via gf_mul() — a log + antilog table pair per
byte. Since the coefficient is constant across a region, this can be
replaced by a single 256-entry product table per coefficient, built once
on function entry, with a single byte-indexed load in the inner loop.

This is a pure software optimization of the fallback path — no new ISA
requirements, no public ABI change.

Functions changed

  • gf_vect_mul_base
  • gf_vect_dot_prod_base
  • gf_vect_mad_base
  • ec_encode_data_base
  • ec_encode_data_update_base

Matrix-construction helpers (gf_gen_rs_matrix, gf_invert_matrix) are
left unchanged: the multiplier varies within those loops, so a
per-coefficient table does not apply.

Compatibility

  • Public ABI of ec_init_tables and all exported functions unchanged.
  • The lookup is factored into a private macro GF_REGION_MUL(tbl, x) in
    erasure_code/ec_base.h so other internal files can reuse the pattern.
  • Touches only erasure_code/ec_base.c and erasure_code/ec_base.h
    (+112 / -27).

Performance

~3.0–4.4× throughput over the existing gf_mul-based base loops,
measured on an i5-1340P (Raptor Lake-P), single-thread, across (k+m)
configurations from 4+2 to 16+4 and stripe sizes from 4 KiB to 1 MiB.
Only the base/fallback path is affected; the SIMD dispatch path is not.

Testing

  • erasure_code_test, erasure_code_update_test, gf_inverse_test
    pass on Zen 3 and Raptor Lake.
  • 72 (config × size) combinations byte-compared against a gf_mul-based
    reference — 0 mismatches on both hosts.

Commit carries Signed-off-by: per the DCO.

@jeffareid

jeffareid commented Jun 3, 2026

Copy link
Copy Markdown

Take a look at Feb 25 comment from @LokyinZHAO:

#334

"Why I call gf_vect_dot_prod directly: ... I’m working on a Hitchhiker-like coding variant where I need to perform linear combinations on a subset of source blocks within a stripe. That requires fine-grained control over per-stripe, per-block dot-product operations that the high-level encode path does not provide (or does not expose with the control granularity I need). For this reason I call the low-level dot_prod routines directly in my experiments."

LokyinZHAO's code calls gf_vect_dot_prod directly with calculated pointers to g_tbls assuming 32 bytes between constants, when it used to be 8 bytes between constants for GFNI. The simplest fix for this would have ec_init_tables return the number of bytes between entries in g_tbls so the code can calculate pointers to g_tbls entries. Another alternative would be some high level encode that takes an array of values (used as indexes into g_tbls), like PR #417.

Instead @pablodelara - padded the GFNI g_tbls entries with 24 bytes to align them to 32 byte boundaries. Last I looked, the non-SIMD tables were not padded with 31 bytes between entries, and the LokyinZHAO's (and one other user's) code will fail in that case.

Note that there is already a define GF_LARGE_TABLES used by ec_base.h that defines a 64K table, where the index for non-SIMD table is (value1 << 8) + value2.

Replace per-byte gf_mul() calls in the C base path with byte-indexed
per-coefficient table lookups. Each hot-path *_base function builds a
256-byte product table (one coefficient at a time, on a single stack
buffer) and uses single byte-indexed loads in the inner loop instead of
the log+antilog gf_mul path. The routines are allocation-free.

For short regions the table build does not pay off, so each function
falls back to the original per-byte gf_mul() loop when the region length
is below a small threshold (GF_REGION_TBL_MIN_LEN = 256); only longer
regions use the byte-table path. This keeps fine-grained, direct callers
of the low-level functions from regressing.

Functions changed:
  - gf_vect_mul_base
  - gf_vect_dot_prod_base
  - gf_vect_mad_base
  - ec_encode_data_base
  - ec_encode_data_update_base

Matrix-construction code (gf_gen_rs_matrix, gf_invert_matrix) where
the multiplier varies across the loop is left unchanged; the technique
only applies when the coefficient is fixed, which is the case for all
bulk EC encode/decode work but not for matrix algebra.

Public ABI of ec_init_tables is unchanged, and the coefficient input
format is unchanged (still read at v[j*32 + 1] with 32-byte stride), so
callers using the low-level functions directly are unaffected. The
hot-path lookup is exposed as a private macro GF_REGION_MUL(tbl, x) in
ec_base.h so other internal files can use the same pattern.

Measured ~3.0-4.4x speedup over the original gf_mul-based loops on
i5-1340P (Raptor Lake-P) across (k+m) configurations from 4+2 to
16+4 and stripe sizes from 4 KiB to 1 MiB.

Verified:
  - erasure_code_base_test, gf_vect_dot_prod_base_test,
    gf_vect_mul_base_test, erasure_code_test, erasure_code_update_test,
    gf_inverse_test all pass
  - byte-compared against a gf_mul reference across region lengths
    straddling the threshold (len = 1..8192) and (k,m) from 2+1 to 16+6,
    0 mismatches

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Hiroshi Nishida <nishidafmly@gmail.com>
@scopedog scopedog force-pushed the upstream-bytetable branch from 4cf9c9a to 865ad8a Compare June 3, 2026 21:51
@scopedog

scopedog commented Jun 3, 2026

Copy link
Copy Markdown
Author

Thanks @jeffareid for the careful review and the pointer to #334.

Though I'm not sure I've fully addressed LokyinZHAO's concern, here's my
comment and what I've changed in response:

1. Removed the allocation. The first version of
gf_vect_dot_prod_base / ec_encode_data_base built the per-coefficient
tables with a malloc, which is a poor choice for a leaf routine. Both
functions process one source at a time, so they now rebuild a single
256-byte stack table per source — there is no heap allocation anywhere in
ec_base.c now.

2. Added a small-length fallback. For short regions the 256-entry
table build does not pay off, and it would penalize the fine-grained,
per-block calls described in #334. Each *_base function now falls back to
the original per-byte gf_mul() loop when len < 256, and only uses the
byte-indexed table for longer regions where the build amortizes.

The input format is unchanged — the functions still read coefficients at
v[j*32 + 1] with the same 32-byte stride — so code that calls these
directly with hand-constructed g_tbls pointers behaves exactly as
before, just faster for len >= 256.

I chose the 256 threshold conservatively (a region should be at least as
long as the table it would build). Happy to tune it, or to revisit the
approach entirely, if there's a better crossover for the direct-call
patterns you and @LokyinZHAO have in mind.

@jeffareid

Copy link
Copy Markdown

@scopedog - If the non-SIMD g_tbls version is also padded to 32 bytes between entries, then g_tbls grows from 256 bytes to 8192 bytes. I don't know how GF_LARGE_TABLES would be handled by @LokyinZHAO's code.

@scopedog

scopedog commented Jun 4, 2026

Copy link
Copy Markdown
Author

@jeffareid — I think there may be a mix-up between two separate tables, so let me clarify what this PR does and doesn't touch.

This PR doesn't change g_tbls or ec_init_tables at all. Every *_base function still reads its coefficients from g_tbls at the same v[j*32 + … + 1] offsets as upstream — same 32-byte stride, same offset. The only new table is a transient 256-byte tbl[256] on the stack, built per source and thrown away. It's not stored in g_tbls and isn't visible to any caller, so there's no padding change and no growth to 8192 bytes. Code like @LokyinZHAO's that walks g_tbls with computed pointers sees byte-for-byte the same layout; it just runs faster for len >= 256.

It's also worth noting that g_tbls is built by gf_vect_mul_init_base, which derives each 32-byte entry directly from the coefficient (the (c << 1) ^ 0x1d doubling trick) — it never calls gf_mul(), so g_tbls doesn't depend on the global multiply tables at all.

On GF_LARGE_TABLES: it only selects the internal gf_mul/gf_inv implementation over static tables — it doesn't appear in g_tbls and isn't part of the public ABI, so there's nothing in it for direct g_tbls callers to handle, and this PR doesn't change that either way. build_mul_tbl() fills its 256-byte table by calling gf_mul() — the exact same gf_mul() the previous per-byte loop used — so whichever backend GF_LARGE_TABLES selects is inherited transparently, with no special-casing.

So this change is orthogonal to the g_tbls stride question in #334 — it runs entirely after g_tbls is constructed and reads it unchanged. If we do want to address the direct-call stride/padding issue (e.g. having ec_init_tables report the entry stride), I'm happy to help, but I'd suggest that as a separate PR since it's an API question independent of this fallback-path optimization.

@jeffareid

jeffareid commented Jun 4, 2026

Copy link
Copy Markdown

@scopedog - thanks for clarifying. As for PR #334 - I would have preferred having ec_init_tables report the entry stride rather than padding and having user code assume a hard value for the stride, but it is not my call.

@scopedog

scopedog commented Jun 4, 2026

Copy link
Copy Markdown
Author

@jeffareid — fair point, and I agree a stride reported by ec_init_tables would be a nicer, more future-proof contract than assuming 32. Feels like it belongs with the broader #334 discussion rather than this change, so I'd lean toward leaving it for the maintainers there. Glad you raised it, and thanks for the careful review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants