Skip to content

NTT-CRT: skip redundant zero-fills of transform + scratch buffers#76

Merged
mmurshed merged 1 commit into
mainfrom
perf/ntt-crt-skip-zero-fill
May 31, 2026
Merged

NTT-CRT: skip redundant zero-fills of transform + scratch buffers#76
mmurshed merged 1 commit into
mainfrom
perf/ntt-crt-skip-zero-fill

Conversation

@mmurshed

Copy link
Copy Markdown
Owner

What

The CRT-NTT multiply is memory-bandwidth bound at large sizes (profiled: serial→8-thread = 3.08×, single-core IPC 2.49, 4 shared-nothing processes aggregate only 2.6×). Two per-call zero-fills were pure wasted memory traffic:

  • mfaScratch[0..5] were assign(n, 0) before use, but each scratch is fully written by the step-1 Transpose (recursive sub-calls write their slice) before any read → ~6n redundant writes/multiply. Now resize(n).
  • fa/fb forward buffers were assign(n, 0), then PackOperand overwrote the head [0, *CoeffSize). Only the zero-pad tail needs clearing → resize(n) + std::fill of the tail, removing a double-write of ~aCoeffSize+bCoeffSize elements/multiply.

Benchmark (mul_xl_bench, M1 Max, base 2^64, best-of-5)

size digits before after Δ
1.3M limbs 25M 1018.7 ms 960.9 ms −5.6%
2.6M limbs 50M 1786.9 ms 1726.4 ms −3.3%
5.2M limbs 100M 3916.4 ms 3909.2 ms −0.1%

Win concentrates below the MFA band (n < 2^24) where the dropped fills are a larger fraction; at 100M digits the MFA transform passes dominate and zero-fill (ideal sequential write) is a small time share. No regression at any size.

Correctness

tests/mult_correctness.cpp passes — all algorithms match the classic reference across all sizes including max-carry and unbalanced cases.

🤖 Generated with Claude Code

The CRT-NTT multiply (NTTMultiplicationCrt.h) is memory-bandwidth bound at
large sizes (measured: serial->8-thread = 3.08x, single-core IPC 2.49,
4 shared-nothing processes aggregate only 2.6x). Two of the per-call
zero-fills were pure wasted memory traffic:

- mfaScratch[0..5] were assign(n, 0) before use, but each scratch is fully
  written by the step-1 Transpose (and recursive sub-calls write their slice)
  before any read. The zero-fill was ~6n redundant writes per multiply.
  Changed to resize(n).

- fa/fb forward buffers were assign(n, 0), then PackOperand overwrote the
  coefficient head [0, *CoeffSize). Only the zero-pad tail needs clearing.
  Changed to resize(n) + std::fill of the tail only, removing a double-write
  of ~aCoeffSize+bCoeffSize elements per call.

mul_xl_bench, M1 Max, base 2^64, best-of-5 (before -> after):
  1.3M limbs (25M digits):  1018.7 -> 960.9 ms  (-5.6%)
  2.6M limbs (50M digits):  1786.9 -> 1726.4 ms (-3.3%)
  5.2M limbs (100M digits): 3916.4 -> 3909.2 ms (-0.1%)

Win concentrates below the MFA band (n < 2^24) where the dropped fills are a
larger fraction; at 100M digits the MFA transform passes dominate and zero-fill
(ideal sequential write) is a small time share. No regression at any size.

Correctness: tests/mult_correctness.cpp passes (all algorithms match the
classic reference across all sizes incl. max-carry and unbalanced cases).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mmurshed mmurshed merged commit 85b802c into main May 31, 2026
4 of 5 checks passed
@mmurshed mmurshed deleted the perf/ntt-crt-skip-zero-fill branch May 31, 2026 05:56
mmurshed added a commit that referenced this pull request May 31, 2026
Revert PR #76: NTT-CRT zero-fill change (no certifiable net win)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant