NTT-CRT MFA: fuse transpose into row-FFT pass (~11% at 40M-100M digits) by mmurshed · Pull Request #78 · mmurshed/biginteger

mmurshed · 2026-05-31T06:59:49Z

What

The Matrix-Fourier (Bailey 6-step) NTT multiply ran each axis as a full out-of-place Transpose immediately followed by a full row-FFT sweep over the just-written buffer. At MFA sizes that scratch buffer is 64–256 MB per prime, so each transpose write + row-FFT read is two extra full round-trips through DRAM per axis. Large MFA on Apple silicon is memory-bandwidth bound (previously profiled), so those round-trips dominate.

This fuses them: gather a block of TILE output rows into an L2-resident tile, run the row FFT (and cross-twiddle) there, and write the result once. Collapses (transpose + row-FFT) into a single streaming pass — transform traffic drops from 8n → 4n bytes per forward/inverse. Per-row FFT math is byte-identical to the old step2/step5, so output is bit-exact.

Scope / safety

Only the single-level case (n1,n2 <= LEAF) is fused — covers every in-practice MFA length (2^24..2^26 all factor to sub-FFTs ≤ 2^13). The recursive case still needs a live as sub-FFT scratch and falls back to the original transpose path.
Gated by BIGMATH_NTT_MFA_FUSE (default on); tile size BIGMATH_NTT_MFA_FUSE_TILE (default 16 rows = 512 KB at the 2^13 leaf). Set -DBIGMATH_NTT_MFA_FUSE=0 to restore the old path.
All four column↔row index maps were hand-derived against the original Transpose+step semantics.

Verification

Bit-exact vs the unfused path (FNV checksum of product limbs) at 2M / 4M / 5.2M / 6M / 8M limbs.
mult_correctness passes (18/0).
Best-of-7 interleaved, fresh processes, M1 Max (base 2^64): 4M +11.24%, 8M +11.27%. Best-of-5 across 4M–10M: all +9–11%.

🤖 Generated with Claude Code

The Matrix-Fourier (Bailey 6-step) multiply ran each axis as a full out-of-place Transpose immediately followed by a full row-FFT sweep over the just-written buffer. At MFA sizes that scratch buffer is 64-256 MB per prime, so the transpose write + row-FFT read are two extra full round-trips through DRAM per axis. Large MFA on Apple silicon is memory-bandwidth bound (profiled), so those round-trips dominate. Fuse them: gather a block of TILE output rows into an L2-resident tile, run the row FFT (and cross-twiddle) there, and write the result once. This collapses (transpose + row-FFT) into a single streaming pass, cutting transform traffic from 8n to 4n bytes per forward/inverse. The per-row FFT math is byte-identical to the unfused step2/step5, so the output is bit-exact. Only the single-level case (n1,n2 <= LEAF) is fused; that covers every in-practice MFA length (2^24..2^26 all factor to sub-FFTs <= 2^13). The recursive case still needs `a` live as sub-FFT scratch, so it falls back to the original transpose path. Gated by BIGMATH_NTT_MFA_FUSE (default on); tile by BIGMATH_NTT_MFA_FUSE_TILE (default 16 rows = 512 KB at the 2^13 leaf). Verified bit-exact vs the unfused path (FNV checksum of product limbs) at 2M/4M/5.2M/6M/8M limbs; mult_correctness passes (18/0). Best-of-7 interleaved on M1 Max (base 2^64): 4M +11.2%, 8M +11.3%; best-of-5 across 4M-10M all +9-11%. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Adds an "MFA transpose fusion (PR #78)" subsection to the multiplication benchmarks: full balanced-Base2_64 sweep (3k–10M limbs) comparing the fused path against -DBIGMATH_NTT_MFA_FUSE=0, plus a two-panel plot (throughput and speedup vs operand size in decimal digits). The fusion holds +7–11% across the MFA band (40–193M digits); below the gate the path is unchanged (sub-ms noise). Plot generated with matplotlib; an SVG copy is included alongside the PNG. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mmurshed and others added 2 commits May 30, 2026 23:59

mmurshed merged commit d38f3c8 into main May 31, 2026
4 of 5 checks passed

mmurshed deleted the perf/ntt-mfa-transpose-fusion branch May 31, 2026 07:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NTT-CRT MFA: fuse transpose into row-FFT pass (~11% at 40M-100M digits)#78

NTT-CRT MFA: fuse transpose into row-FFT pass (~11% at 40M-100M digits)#78
mmurshed merged 2 commits into
mainfrom
perf/ntt-mfa-transpose-fusion

mmurshed commented May 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mmurshed commented May 31, 2026

What

Scope / safety

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant