NTT-CRT: skip redundant zero-fills of transform + scratch buffers by mmurshed · Pull Request #76 · mmurshed/biginteger

mmurshed · 2026-05-31T05:39:44Z

What

The CRT-NTT multiply is memory-bandwidth bound at large sizes (profiled: serial→8-thread = 3.08×, single-core IPC 2.49, 4 shared-nothing processes aggregate only 2.6×). Two per-call zero-fills were pure wasted memory traffic:

mfaScratch[0..5] were assign(n, 0) before use, but each scratch is fully written by the step-1 Transpose (recursive sub-calls write their slice) before any read → ~6n redundant writes/multiply. Now resize(n).
fa/fb forward buffers were assign(n, 0), then PackOperand overwrote the head [0, *CoeffSize). Only the zero-pad tail needs clearing → resize(n) + std::fill of the tail, removing a double-write of ~aCoeffSize+bCoeffSize elements/multiply.

Benchmark (mul_xl_bench, M1 Max, base 2^64, best-of-5)

size	digits	before	after	Δ
1.3M limbs	25M	1018.7 ms	960.9 ms	−5.6%
2.6M limbs	50M	1786.9 ms	1726.4 ms	−3.3%
5.2M limbs	100M	3916.4 ms	3909.2 ms	−0.1%

Win concentrates below the MFA band (n < 2^24) where the dropped fills are a larger fraction; at 100M digits the MFA transform passes dominate and zero-fill (ideal sequential write) is a small time share. No regression at any size.

Correctness

tests/mult_correctness.cpp passes — all algorithms match the classic reference across all sizes including max-carry and unbalanced cases.

🤖 Generated with Claude Code

The CRT-NTT multiply (NTTMultiplicationCrt.h) is memory-bandwidth bound at large sizes (measured: serial->8-thread = 3.08x, single-core IPC 2.49, 4 shared-nothing processes aggregate only 2.6x). Two of the per-call zero-fills were pure wasted memory traffic: - mfaScratch[0..5] were assign(n, 0) before use, but each scratch is fully written by the step-1 Transpose (and recursive sub-calls write their slice) before any read. The zero-fill was ~6n redundant writes per multiply. Changed to resize(n). - fa/fb forward buffers were assign(n, 0), then PackOperand overwrote the coefficient head [0, *CoeffSize). Only the zero-pad tail needs clearing. Changed to resize(n) + std::fill of the tail only, removing a double-write of ~aCoeffSize+bCoeffSize elements per call. mul_xl_bench, M1 Max, base 2^64, best-of-5 (before -> after): 1.3M limbs (25M digits): 1018.7 -> 960.9 ms (-5.6%) 2.6M limbs (50M digits): 1786.9 -> 1726.4 ms (-3.3%) 5.2M limbs (100M digits): 3916.4 -> 3909.2 ms (-0.1%) Win concentrates below the MFA band (n < 2^24) where the dropped fills are a larger fraction; at 100M digits the MFA transform passes dominate and zero-fill (ideal sequential write) is a small time share. No regression at any size. Correctness: tests/mult_correctness.cpp passes (all algorithms match the classic reference across all sizes incl. max-carry and unbalanced cases). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Revert PR #76: NTT-CRT zero-fill change (no certifiable net win)

mmurshed merged commit 85b802c into main May 31, 2026
4 of 5 checks passed

mmurshed deleted the perf/ntt-crt-skip-zero-fill branch May 31, 2026 05:56

mmurshed mentioned this pull request May 31, 2026

Revert PR #76: NTT-CRT zero-fill change (no certifiable net win) #77

Merged

mmurshed added a commit that referenced this pull request May 31, 2026

Merge pull request #77 from mmurshed/revert/ntt-zero-fill

95a7c35

Revert PR #76: NTT-CRT zero-fill change (no certifiable net win)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NTT-CRT: skip redundant zero-fills of transform + scratch buffers#76

NTT-CRT: skip redundant zero-fills of transform + scratch buffers#76
mmurshed merged 1 commit into
mainfrom
perf/ntt-crt-skip-zero-fill

mmurshed commented May 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mmurshed commented May 31, 2026

What

Benchmark (mul_xl_bench, M1 Max, base 2^64, best-of-5)

Correctness

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant