Skip to content

Revert PR #76: NTT-CRT zero-fill change (no certifiable net win)#77

Merged
mmurshed merged 1 commit into
mainfrom
revert/ntt-zero-fill
May 31, 2026
Merged

Revert PR #76: NTT-CRT zero-fill change (no certifiable net win)#77
mmurshed merged 1 commit into
mainfrom
revert/ntt-zero-fill

Conversation

@mmurshed

Copy link
Copy Markdown
Owner

Reverts the buffer zero-fill change from #76 (commit 68c7678), restoring assign(n, 0).

Why

A full size sweep (3k–8M limbs, best-of-5, M1 Max) showed no certifiable net speedup; the original 3-size result was within noise.

Root cause: std::vector::resize(n) value-initializes grown elements to 0, so on a cold buffer (fresh process / first call at a size — the benchmark's case) resize already zeroes all n. The follow-up std::fill of the pad tail was then redundant work, making the cold path slower — the source of the ~−2% 8M-limb regression observed after #76 merged.

At warm steady state the change is a small genuine win, but across the full sweep the net is dominated by run-to-run variance and power-of-2 codegen-alignment artifacts (+12% at 200k, −7% at 700k that don't survive controlled re-measurement).

Reverting keeps the code simple and removes the post-merge regression. The real traffic-reduction lever remains fusing the MFA transpose/twiddle passes, not initialization micro-opt.

Correctness

tests/mult_correctness.cpp passes.

🤖 Generated with Claude Code

…fers"

This reverts the buffer zero-fill change from PR #76 (commit 68c7678),
restoring the original assign(n, 0) initialization of the forward and
scratch buffers in NttCrt::Multiply.

A full size sweep (3k–8M limbs, best-of-5, M1 Max) showed the change does
not produce a certifiable net speedup. The apparent wins/losses in the
original 3-size benchmark were within measurement noise:

- std::vector::resize(n) value-initializes any grown region to 0, so on a
  cold buffer (the common case for a fresh process / first call at a size)
  resize already zeroes all n elements. The follow-up std::fill of the pad
  tail was then pure redundant work, making the cold path slower, not
  faster — the source of the 8M-limb (~-2%) regression observed after merge.
- At warm steady state the change is a genuine but small win; across the
  full sweep the net effect is dominated by run-to-run variance and
  power-of-2-size codegen-alignment artifacts (e.g. +12% at 200k, -7% at
  700k that do not survive controlled re-measurement).

Reverting to the proven assign(n, 0) keeps the code simple and removes the
post-merge regression. The buffers are memory-bandwidth-relevant, but the
real traffic-reduction lever is fusing the MFA transpose/twiddle passes,
not micro-optimizing initialization.

Correctness: tests/mult_correctness.cpp passes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mmurshed mmurshed merged commit 95a7c35 into main May 31, 2026
4 of 5 checks passed
@mmurshed mmurshed deleted the revert/ntt-zero-fill branch May 31, 2026 06:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant