Revert PR #76: NTT-CRT zero-fill change (no certifiable net win)#77
Merged
Conversation
…fers" This reverts the buffer zero-fill change from PR #76 (commit 68c7678), restoring the original assign(n, 0) initialization of the forward and scratch buffers in NttCrt::Multiply. A full size sweep (3k–8M limbs, best-of-5, M1 Max) showed the change does not produce a certifiable net speedup. The apparent wins/losses in the original 3-size benchmark were within measurement noise: - std::vector::resize(n) value-initializes any grown region to 0, so on a cold buffer (the common case for a fresh process / first call at a size) resize already zeroes all n elements. The follow-up std::fill of the pad tail was then pure redundant work, making the cold path slower, not faster — the source of the 8M-limb (~-2%) regression observed after merge. - At warm steady state the change is a genuine but small win; across the full sweep the net effect is dominated by run-to-run variance and power-of-2-size codegen-alignment artifacts (e.g. +12% at 200k, -7% at 700k that do not survive controlled re-measurement). Reverting to the proven assign(n, 0) keeps the code simple and removes the post-merge regression. The buffers are memory-bandwidth-relevant, but the real traffic-reduction lever is fusing the MFA transpose/twiddle passes, not micro-optimizing initialization. Correctness: tests/mult_correctness.cpp passes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Reverts the buffer zero-fill change from #76 (commit 68c7678), restoring
assign(n, 0).Why
A full size sweep (3k–8M limbs, best-of-5, M1 Max) showed no certifiable net speedup; the original 3-size result was within noise.
Root cause:
std::vector::resize(n)value-initializes grown elements to 0, so on a cold buffer (fresh process / first call at a size — the benchmark's case)resizealready zeroes all n. The follow-upstd::fillof the pad tail was then redundant work, making the cold path slower — the source of the ~−2% 8M-limb regression observed after #76 merged.At warm steady state the change is a small genuine win, but across the full sweep the net is dominated by run-to-run variance and power-of-2 codegen-alignment artifacts (+12% at 200k, −7% at 700k that don't survive controlled re-measurement).
Reverting keeps the code simple and removes the post-merge regression. The real traffic-reduction lever remains fusing the MFA transpose/twiddle passes, not initialization micro-opt.
Correctness
tests/mult_correctness.cpppasses.🤖 Generated with Claude Code