From e1da54be7517e92f1b162a62b8940f523c1e4f3c Mon Sep 17 00:00:00 2001 From: Max Murshed Date: Thu, 11 Jun 2026 21:39:56 -0700 Subject: [PATCH] perf: lower Newton balanced band ratio from 2/1 to 4/3 The wraparound-Newton PRs (#85-#87) cut Newton's constant ~35-45%, which moved the generic BZ/Newton crossover in the near-balanced band from ratio ~2 down to ~1.3 (measured at nb 100k-200k limbs: Newton wins from 1.3-1.35 up; at ratio 1.5 BZ is 1.7-1.9x slower, at 1.95 up to 3.4x; 2^k+1-family divisor sizes are 68x worse on BZ at ratio 1.5). Lower NEWTON_BALANCED from 2/1 to 4/3. Exact-power-of-two divisors (BZ's best case) regress <= ~25% in the narrow (4/3, ~1.4) sliver - same accepted tradeoff as PR #79. Measured after the change (M1 Max, min of 3): - nb=160000 limbs ratio 1.5: 244 -> 165 ms - nb=100000 limbs ratio 1.4: 201 -> 126 ms - nb=131073 (2^17+1) ratio 1.5: 10.7 s -> 157 ms Known residual ratio in (1, 4/3) still routes to BZ: generic sizes are genuinely faster there, but 2^k+1-family sizes still blow up (1.1-5.4 s at nb=131073, ratios 1.05-1.25). The follow-up fix is quotient-sized division, which scales with the quotient instead of the divisor. 246 unit tests + div_correctness pass. Docs updated (DIVISION.md, CLAUDE.md). Co-Authored-By: Claude Fable 5 --- CLAUDE.md | 4 ++-- docs/DIVISION.md | 4 ++-- include/biginteger/algorithms/Division.h | 6 +++--- src/algorithms/Division.cpp | 2 +- 4 files changed, 8 insertions(+), 8 deletions(-) diff --git a/CLAUDE.md b/CLAUDE.md index 8dabf9b..153f975 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -43,14 +43,14 @@ CI (`.github/workflows/qa.yaml`) runs `nanoclaw-task qa-agent` on PRs against `o **Division dispatch** (`algorithms/Division.h`, thresholds defined in `src/algorithms/Division.cpp`): - `NewtonDivision` (Newton-Raphson reciprocal, O(M(n)); handles arbitrary `na/nb` via blockwise mode — top chunk in [n+1, 2n], slide down by n, thread the remainder) when any skew band holds: - `b ≥ 4096` at ratio ≥ 3 (`NEWTON_SKEW` 3/1), or - - `b ≥ 98304` at ratio ≥ 2 (`NEWTON_BALANCED` 2/1 — the near-balanced band, PR #79), or + - `b ≥ 98304` at ratio ≥ 4/3 (`NEWTON_BALANCED` 4/3 — the near-balanced band, PR #79, lowered from 2/1 2026-06-11), or - `b ≥ 2048` at ratio ≥ 8 (`NEWTON_HIGH_SKEW` 8/1). - else `BurnikelZieglerDivision` for power-of-two base when `b > 512` and the BZ band fits (near-balanced `b ≥ 1024, b+32 ≤ a ≤ 3b`, or big-and-skewed `a > 2048 && a > 3b`). - otherwise multi-limb → `FastDivision` (Knuth Algorithm D variant) - single-limb divisor → `ClassicDivision` - `KnuthDivision` and `ReciprocalDivision` are alternates used by correctness tests for cross-checking. -The balanced band exists because BZ's recursive 2n/n halving lands intermediate NTT multiplies just over power-of-2 transform-length boundaries for non-power-of-2 divisor sizes, blowing up 5–60× vs Newton (worst at `n = 2^k+1`); Newton pads once and stays flat. **Known residual:** ratio ∈ (1, 2) at large `b` still routes to BZ and hits the same blowup (~2.7× slower than Newton would be at ratio 1.5) — the balanced band's `a ≥ 2b` lower bound doesn't cover it yet. +The balanced band exists because BZ's recursive 2n/n halving lands intermediate NTT multiplies just over power-of-2 transform-length boundaries for non-power-of-2 divisor sizes, blowing up 5–60× vs Newton (worst at `n = 2^k+1`); Newton pads once and stays flat. **Known residual:** ratio ∈ (1, 4/3) at large `b` still routes to BZ; generic sizes are fine there (BZ wins below the ~1.3 crossover) but `2^k+1`-family divisor sizes still blow up — the proper fix for that range is quotient-sized division. When adding a new algorithm, slot the implementation under `algorithms//.h`, then update the dispatch in `algorithms/.h` — the thresholds there are the only place size cutoffs live. diff --git a/docs/DIVISION.md b/docs/DIVISION.md index c098643..dab61d6 100644 --- a/docs/DIVISION.md +++ b/docs/DIVISION.md @@ -62,7 +62,7 @@ flowchart TD C -- no --> D{Compare(a, b)} D -- a == b --> Eq[return {1, 0}] D -- a < b --> Less[return {0, a}] - D -- a > b --> E{Newton band?
b ≥ 4096 and a ≥ 3b
OR b ≥ 98304 and a ≥ 2b
OR b ≥ 2048 and a ≥ 8b} + D -- a > b --> E{Newton band?
b ≥ 4096 and a ≥ 3b
OR b ≥ 98304 and 3a ≥ 4b
OR b ≥ 2048 and a ≥ 8b} E -- yes --> N[NewtonDivision] E -- no --> F{Power-of-two base
AND b.size > 512
AND BZ shape fits?} F -- yes --> BZ[BurnikelZieglerDivision] @@ -105,7 +105,7 @@ The current dispatch logic, paraphrased: The ordering matters: Newton wins on **large skewed** problems because the per-divisor reciprocal setup amortizes over multiple chunks. BZ wins on **mid-size near-balanced** problems where its 2n/n recursion structure beats both FastDivision and Newton's setup cost. -**Near-balanced band (PR #79).** Above `NEWTON_BALANCED_B` (98304 limbs), ratio-≥2 division goes to Newton instead of BZ. BZ's recursive 2n/n halving lands its intermediate NTT multiplies just over power-of-2 transform-length boundaries for non-power-of-2 divisor sizes — the FFT length doubles and the constant factor compounds across recursion depth into a **5–60× slowdown vs Newton**, worst at `n = 2^k + 1` (measured ~90 s for a 262145-limb divisor vs Newton's ~0.9 s). Newton pads once to the working size and stays flat. Exact-power-of-2 divisor sizes are BZ's best case (it ties Newton); they regress ~4 % under this band but are rare in practice. **Known residual:** ratio ∈ (1, 2) at large `b` still routes to BZ and hits the same blowup (~2.7× slower than Newton at ratio 1.5); the band's `a ≥ 2b` lower bound does not cover it. FastDivision is the default workhorse for everything else. +**Near-balanced band (PR #79, lowered to 4/3 on 2026-06-11).** Above `NEWTON_BALANCED_B` (98304 limbs), ratio-≥-4/3 division goes to Newton instead of BZ. BZ's recursive 2n/n halving lands its intermediate NTT multiplies just over power-of-2 transform-length boundaries for non-power-of-2 divisor sizes — the FFT length doubles and the constant factor compounds across recursion depth into a **5–60× slowdown vs Newton**, worst at `n = 2^k + 1` (measured ~90 s for a 262145-limb divisor vs Newton's ~0.9 s). Newton pads once to the working size and stays flat. Exact-power-of-2 divisor sizes are BZ's best case (it ties Newton); they regress ~4 % under this band but are rare in practice. **Band lower edge re-measured 2026-06-11** after the wraparound-Newton PRs (#85–#87) cut Newton's constant: the generic BZ/Newton crossover moved from ~2 down to ~1.3 (nb 100k–200k limbs: Newton wins from ratio 1.3–1.35 up; at 1.5 BZ is 1.7–1.9× slower, at 1.95 up to 3.4×). Band lowered `2/1 → 4/3`. Exact-power-of-2 divisors (BZ's best case) regress ≤ ~25% in the narrow (4/3, 1.4) sliver — same accepted tradeoff as PR #79. **Known residual:** ratio ∈ (1, 4/3) still routes to BZ; fine for generic sizes (BZ genuinely wins below the crossover) but `2^k+1`-family divisor sizes still hit the transform-doubling blowup (measured 1.1–5.4 s vs Newton's 0.16 s at nb = 131073, ratios 1.05–1.25). The right fix there is quotient-sized division (divide the top ~2Δ limbs by the divisor's top Δ limbs when the quotient is short), which scales with the quotient instead of the divisor. FastDivision is the default workhorse for everything else. `KnuthDivision` and `ReciprocalDivision` exist as alternate implementations used by correctness tests for cross-checking. They are not in the production dispatch path. diff --git a/include/biginteger/algorithms/Division.h b/include/biginteger/algorithms/Division.h index 9b737c6..6942e9a 100644 --- a/include/biginteger/algorithms/Division.h +++ b/include/biginteger/algorithms/Division.h @@ -5,7 +5,7 @@ * 1. NewtonDivision (blockwise handles arbitrary ratio via reciprocal cache), * when any of these skew bands hold: * - b ≥ NEWTON_MEDIUM_B AND a ≥ NEWTON_SKEW (3/1) · b - * - b ≥ NEWTON_BALANCED_B AND a ≥ NEWTON_BALANCED (2/1) · b + * - b ≥ NEWTON_BALANCED_B AND a ≥ NEWTON_BALANCED (4/3) · b * - b ≥ NEWTON_HIGH_SKEW_B AND a ≥ NEWTON_HIGH_SKEW (8/1) · b * The balanced (ratio ≥ 2) band starts higher (96k limbs) because BZ wins * near-balanced below that; above it BZ degrades erratically (measured @@ -56,11 +56,11 @@ namespace BigMath #endif #ifndef BIGMATH_NEWTON_BALANCED_NUMERATOR -#define BIGMATH_NEWTON_BALANCED_NUMERATOR 2 +#define BIGMATH_NEWTON_BALANCED_NUMERATOR 4 #endif #ifndef BIGMATH_NEWTON_BALANCED_DENOMINATOR -#define BIGMATH_NEWTON_BALANCED_DENOMINATOR 1 +#define BIGMATH_NEWTON_BALANCED_DENOMINATOR 3 #endif #ifndef BIGMATH_NEWTON_HIGH_SKEW_B diff --git a/src/algorithms/Division.cpp b/src/algorithms/Division.cpp index eb7e91c..7a66ff8 100644 --- a/src/algorithms/Division.cpp +++ b/src/algorithms/Division.cpp @@ -47,7 +47,7 @@ namespace BigMath bool newton_medium_skew = b.size() >= NEWTON_MEDIUM_B && NEWTON_SKEW_DENOMINATOR * a.size() >= NEWTON_SKEW_NUMERATOR * b.size(); - // Near-balanced (ratio ≥ 2) band: only above NEWTON_BALANCED_B, where BZ's + // Near-balanced (ratio ≥ 4/3) band: only above NEWTON_BALANCED_B, where BZ's // near-balanced path degrades erratically (measured 2×–4.5× slower than // Newton at b ≥ 100k limbs); below it BZ wins, so leave it alone. bool newton_balanced =