Skip to content

Fix near-balanced division: Newton balanced band + single-block pathology#79

Merged
mmurshed merged 1 commit into
mainfrom
perf/newton-balanced-division
May 31, 2026
Merged

Fix near-balanced division: Newton balanced band + single-block pathology#79
mmurshed merged 1 commit into
mainfrom
perf/newton-balanced-division

Conversation

@mmurshed

Copy link
Copy Markdown
Owner

Summary

Profiling the division dispatch path on near-balanced (ratio ≈ 2) operands at large divisor sizes surfaced two compounding problems. Both fixed here.

1. Newton single-block pathology

The single-block path (na ≤ 2n+1) ran a 2n+1-limb chunk through the truncated (chunk·R) >> 2n quotient estimate. The truncation error scales with chunk/B^(2n), reaching ~B once the chunk exceeds 2n limbs — and the +1 limb routinely appears from the Knuth normalize shift on a ≈ 2n. The estimate then undershot Q past the 8-step fixup cap and bailed to quadratic FastDivision. At nb=50000 this was a 23× spike (5200 ms vs ~360 ms at neighbouring sizes).

Fix: route na > 2n through the blockwise path so every chunk stays ≤ 2n.

2. Burnikel-Ziegler non-power-of-2 blowup

BZ's recursive 2n/n halving lands its intermediate NTT multiplies just over power-of-2 transform-length boundaries for non-power-of-2 divisor sizes, doubling the FFT length. The constant factor compounds across recursion depth into a 5–60× slowdown vs Newton, worst at n = 2^k + 1. Newton pads once and stays flat.

Fix: a Newton balanced bandb ≥ NEWTON_BALANCED_B (98304) and a ≥ 2b — takes the near-balanced large case BZ previously owned. Exact-power-of-2 divisor sizes (BZ's best case, where it ties Newton) regress ~4 %; rare in practice. The existing ratio-≥3 Newton band is untouched.

Measured (Base2_64, M1 Max, dispatch wall-clock; before = main d38f3c8 → BZ)

shape (limbs) divisor before after speedup
500000/250000 non-pow2 5177.7 ms 588.3 ms 8.80×
1000000/500000 non-pow2 10554.7 ms 1485.7 ms 7.10×
2000000/1000000 non-pow2 21006.9 ms 3747.6 ms 5.61×
524290/262145 2¹⁸+1 (BZ worst) 92060.5 ms 941.9 ms 97.7×
524288/262144 pow2 (BZ best) 656.4 ms 684.0 ms 0.96×
3000000/1000000 ratio 3 (control) 4687.7 ms 4635.7 ms 1.01×

speedup

Validation

  • div_correctness (canonical): pass, 0 mismatches
  • unit_tests: 246 passed, 0 failed
  • Newton vs BZ cross-check limb-for-limb across na ∈ {2n-1,2n,2n+1,2n+2,3n} × nb near the boundary (120 cases): all match
  • Adds tests/performance/division_balanced_bench.cpp + docs/images/make_division_balanced_plot.py

🤖 Generated with Claude Code

…logy

Profiling the dispatch path on ratio-≈2 division at large divisor sizes
surfaced two compounding problems:

1. NewtonDivision single-block path (na ≤ 2n+1) ran a 2n+1-limb chunk
   through the truncated (chunk·R)>>2n quotient estimate. The truncation
   error scales with chunk/B^(2n), reaching ~B once the chunk exceeds 2n
   limbs (the +1 limb routinely appears from the Knuth normalize shift on
   a ≈ 2n). The estimate undershot Q past the 8-step fixup cap and bailed
   to quadratic FastDivision — a 23× spike at nb=50000 (5200ms vs ~360ms
   at neighbours). Fix: route na > 2n through the blockwise path so every
   chunk stays ≤ 2n.

2. Burnikel-Ziegler's recursive 2n/n halving lands its intermediate NTT
   multiplies just over power-of-2 transform-length boundaries for
   non-power-of-2 divisor sizes, doubling the FFT length. The constant
   factor compounds across recursion depth into a 5–60× slowdown vs
   Newton, worst at n = 2^k + 1. Newton pads once and stays flat. Fix: a
   Newton balanced band (b ≥ NEWTON_BALANCED_B=98304, a ≥ 2b) takes the
   near-balanced large case BZ previously owned.

Measured (Base2_64, M1 Max, dispatch wall-clock, before main d38f3c8 → BZ):
  500k/250k    5177.7 → 588.3 ms   8.80×
  1M/500k     10554.7 → 1485.7 ms  7.10×
  2M/1M       21006.9 → 3747.6 ms  5.61×
  524290/262145 (2^18+1) 92060 → 942 ms  97.7×
Exact-power-of-2 divisor sizes (BZ best case) regress ~4%; rare in
practice. Ratio-3 control unchanged (already Newton).

Cross-checked vs BZ limb-for-limb across na ∈ {2n-1,2n,2n+1,2n+2,3n}
near the boundary (120 cases) + canonical div_correctness — all match.
Adds tests/performance/division_balanced_bench.cpp and the speedup plot.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mmurshed mmurshed merged commit 175b3c0 into main May 31, 2026
4 of 5 checks passed
@mmurshed mmurshed deleted the perf/newton-balanced-division branch May 31, 2026 09:16
mmurshed added a commit that referenced this pull request May 31, 2026
docs: sync division dispatch docs with Newton balanced band (PR #79)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant