NTT-CRT: skip redundant zero-fills of transform + scratch buffers#76
Merged
Conversation
The CRT-NTT multiply (NTTMultiplicationCrt.h) is memory-bandwidth bound at large sizes (measured: serial->8-thread = 3.08x, single-core IPC 2.49, 4 shared-nothing processes aggregate only 2.6x). Two of the per-call zero-fills were pure wasted memory traffic: - mfaScratch[0..5] were assign(n, 0) before use, but each scratch is fully written by the step-1 Transpose (and recursive sub-calls write their slice) before any read. The zero-fill was ~6n redundant writes per multiply. Changed to resize(n). - fa/fb forward buffers were assign(n, 0), then PackOperand overwrote the coefficient head [0, *CoeffSize). Only the zero-pad tail needs clearing. Changed to resize(n) + std::fill of the tail only, removing a double-write of ~aCoeffSize+bCoeffSize elements per call. mul_xl_bench, M1 Max, base 2^64, best-of-5 (before -> after): 1.3M limbs (25M digits): 1018.7 -> 960.9 ms (-5.6%) 2.6M limbs (50M digits): 1786.9 -> 1726.4 ms (-3.3%) 5.2M limbs (100M digits): 3916.4 -> 3909.2 ms (-0.1%) Win concentrates below the MFA band (n < 2^24) where the dropped fills are a larger fraction; at 100M digits the MFA transform passes dominate and zero-fill (ideal sequential write) is a small time share. No regression at any size. Correctness: tests/mult_correctness.cpp passes (all algorithms match the classic reference across all sizes incl. max-carry and unbalanced cases). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mmurshed
added a commit
that referenced
this pull request
May 31, 2026
Revert PR #76: NTT-CRT zero-fill change (no certifiable net win)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
The CRT-NTT multiply is memory-bandwidth bound at large sizes (profiled: serial→8-thread = 3.08×, single-core IPC 2.49, 4 shared-nothing processes aggregate only 2.6×). Two per-call zero-fills were pure wasted memory traffic:
mfaScratch[0..5]wereassign(n, 0)before use, but each scratch is fully written by the step-1Transpose(recursive sub-calls write their slice) before any read → ~6n redundant writes/multiply. Nowresize(n).fa/fbforward buffers wereassign(n, 0), thenPackOperandoverwrote the head[0, *CoeffSize). Only the zero-pad tail needs clearing →resize(n)+std::fillof the tail, removing a double-write of ~aCoeffSize+bCoeffSize elements/multiply.Benchmark (mul_xl_bench, M1 Max, base 2^64, best-of-5)
Win concentrates below the MFA band (n < 2^24) where the dropped fills are a larger fraction; at 100M digits the MFA transform passes dominate and zero-fill (ideal sequential write) is a small time share. No regression at any size.
Correctness
tests/mult_correctness.cpppasses — all algorithms match the classic reference across all sizes including max-carry and unbalanced cases.🤖 Generated with Claude Code