Fix near-balanced division: Newton balanced band + single-block pathology#79
Merged
Conversation
…logy Profiling the dispatch path on ratio-≈2 division at large divisor sizes surfaced two compounding problems: 1. NewtonDivision single-block path (na ≤ 2n+1) ran a 2n+1-limb chunk through the truncated (chunk·R)>>2n quotient estimate. The truncation error scales with chunk/B^(2n), reaching ~B once the chunk exceeds 2n limbs (the +1 limb routinely appears from the Knuth normalize shift on a ≈ 2n). The estimate undershot Q past the 8-step fixup cap and bailed to quadratic FastDivision — a 23× spike at nb=50000 (5200ms vs ~360ms at neighbours). Fix: route na > 2n through the blockwise path so every chunk stays ≤ 2n. 2. Burnikel-Ziegler's recursive 2n/n halving lands its intermediate NTT multiplies just over power-of-2 transform-length boundaries for non-power-of-2 divisor sizes, doubling the FFT length. The constant factor compounds across recursion depth into a 5–60× slowdown vs Newton, worst at n = 2^k + 1. Newton pads once and stays flat. Fix: a Newton balanced band (b ≥ NEWTON_BALANCED_B=98304, a ≥ 2b) takes the near-balanced large case BZ previously owned. Measured (Base2_64, M1 Max, dispatch wall-clock, before main d38f3c8 → BZ): 500k/250k 5177.7 → 588.3 ms 8.80× 1M/500k 10554.7 → 1485.7 ms 7.10× 2M/1M 21006.9 → 3747.6 ms 5.61× 524290/262145 (2^18+1) 92060 → 942 ms 97.7× Exact-power-of-2 divisor sizes (BZ best case) regress ~4%; rare in practice. Ratio-3 control unchanged (already Newton). Cross-checked vs BZ limb-for-limb across na ∈ {2n-1,2n,2n+1,2n+2,3n} near the boundary (120 cases) + canonical div_correctness — all match. Adds tests/performance/division_balanced_bench.cpp and the speedup plot. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mmurshed
added a commit
that referenced
this pull request
May 31, 2026
docs: sync division dispatch docs with Newton balanced band (PR #79)
This was referenced Jun 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Profiling the division dispatch path on near-balanced (ratio ≈ 2) operands at large divisor sizes surfaced two compounding problems. Both fixed here.
1. Newton single-block pathology
The single-block path (
na ≤ 2n+1) ran a2n+1-limb chunk through the truncated(chunk·R) >> 2nquotient estimate. The truncation error scales withchunk/B^(2n), reaching ~Bonce the chunk exceeds2nlimbs — and the+1limb routinely appears from the Knuth normalize shift ona ≈ 2n. The estimate then undershotQpast the 8-step fixup cap and bailed to quadraticFastDivision. Atnb=50000this was a 23× spike (5200 ms vs ~360 ms at neighbouring sizes).Fix: route
na > 2nthrough the blockwise path so every chunk stays≤ 2n.2. Burnikel-Ziegler non-power-of-2 blowup
BZ's recursive
2n/nhalving lands its intermediate NTT multiplies just over power-of-2 transform-length boundaries for non-power-of-2 divisor sizes, doubling the FFT length. The constant factor compounds across recursion depth into a 5–60× slowdown vs Newton, worst atn = 2^k + 1. Newton pads once and stays flat.Fix: a Newton balanced band —
b ≥ NEWTON_BALANCED_B (98304)anda ≥ 2b— takes the near-balanced large case BZ previously owned. Exact-power-of-2 divisor sizes (BZ's best case, where it ties Newton) regress ~4 %; rare in practice. The existing ratio-≥3 Newton band is untouched.Measured (Base2_64, M1 Max, dispatch wall-clock; before =
maind38f3c8→ BZ)500000/2500001000000/5000002000000/1000000524290/262145524288/2621443000000/1000000Validation
div_correctness(canonical): pass, 0 mismatchesunit_tests: 246 passed, 0 failedna ∈ {2n-1,2n,2n+1,2n+2,3n}×nbnear the boundary (120 cases): all matchtests/performance/division_balanced_bench.cpp+docs/images/make_division_balanced_plot.py🤖 Generated with Claude Code